[
  {
    "path": ".astylerc",
    "content": "# astyle -n -r \"benchmark/*.h,*.cpp\" \"src/*.h,*.cpp\" \"tests/*.h,*.cpp\" \"tools/*.h,*.cpp\" \"examples/*.h,*.cpp\"\n\n# brace style\n--style=allman\n\n# tab\n--attach-namespaces\n--attach-extern-c\n--attach-closing-while\n\n# indentation\n--indent-preproc-define\n--indent-col1-comments\n--min-conditional-indent=0\n--max-continuation-indent=120\n\n# padding\n--pad-oper\n--pad-comma\n--pad-header\n--align-pointer=type\n--align-reference=type\n\n# formatting\n--break-closing-braces\n--attach-return-type\n--attach-return-type-decl\n--keep-one-line-blocks\n--keep-one-line-statements\n--convert-tabs\n--max-code-length=200\n--mode=c\n\n# other\n--lineend=linux\n"
  },
  {
    "path": ".clang-format",
    "content": "# find src/ tools/ tests/ examples/ benchmark/ -type f -name '*.c' -o -name '*.cpp' -o -name '*.h' | xargs -i clang-format -i {}\n\n# need clang-format >= 10.0\n\nAccessModifierOffset: -4\nAlignAfterOpenBracket: Align\nAlignConsecutiveAssignments: false\n# AlignConsecutiveBitFields: true\nAlignConsecutiveDeclarations: false\nAlignConsecutiveMacros: true\nAlignEscapedNewlines: Left\n# AlignOperands: AlignAfterOperator\nAlignTrailingComments: true\nAllowAllArgumentsOnNextLine: true\nAllowAllConstructorInitializersOnNextLine: true\nAllowAllParametersOfDeclarationOnNextLine: true\nAllowShortBlocksOnASingleLine: Always\nAllowShortCaseLabelsOnASingleLine: true\n# AllowShortEnumsOnASingleLine: true\nAllowShortFunctionsOnASingleLine: None\nAllowShortIfStatementsOnASingleLine: WithoutElse\nAllowShortLambdasOnASingleLine: All\nAllowShortLoopsOnASingleLine: true\nAlwaysBreakAfterReturnType: None\nAlwaysBreakBeforeMultilineStrings: false\nAlwaysBreakTemplateDeclarations: Yes\nBinPackArguments: true\nBinPackParameters: true\nBraceWrapping:\n  AfterCaseLabel: true\n  AfterClass: true\n  AfterControlStatement: Always\n  AfterEnum: true\n  AfterFunction: true\n  AfterNamespace: false\n  AfterObjCDeclaration: false\n  AfterStruct: true\n  AfterUnion: true\n  AfterExternBlock: false\n  BeforeCatch: true\n  BeforeElse: true\n#  BeforeLambdaBody: false\n#  BeforeWhile: false\n  IndentBraces: false\n  SplitEmptyFunction: true\n  SplitEmptyRecord: true\n  SplitEmptyNamespace: false\nBreakAfterJavaFieldAnnotations: true\nBreakBeforeBinaryOperators: All\nBreakBeforeBraces: Custom\nBreakBeforeTernaryOperators: true\nBreakConstructorInitializers: BeforeColon\nBreakInheritanceList: BeforeColon\nBreakStringLiterals: false\nColumnLimit: 0\n# CommentPragmas:\nCompactNamespaces: false\nConstructorInitializerAllOnOneLineOrOnePerLine: true\nConstructorInitializerIndentWidth: 4\nContinuationIndentWidth: 4\nCpp11BracedListStyle: true\nDeriveLineEnding: false\nDerivePointerAlignment: false\n# DisableFormat:\n# ExperimentalAutoDetectBinPacking:\nFixNamespaceComments: true\n# ForEachMacros:\nIncludeBlocks: Regroup\n# IncludeCategories:\n# IncludeIsMainRegex:\n# IncludeIsMainSourceRegex:\n# IndentCaseBlocks: false\nIndentCaseLabels: false\n# IndentExternBlock: NoIndent\nIndentGotoLabels: false\nIndentPPDirectives: None\nIndentWidth: 4\n# IndentWrappedFunctionNames: 4\n# InsertTrailingCommas: None\n# JavaImportGroups:\n# JavaScriptQuotes\n# JavaScriptWrapImports:\nKeepEmptyLinesAtTheStartOfBlocks: false\nLanguage: Cpp\n# MacroBlockBegin:\n# MacroBlockEnd:\nMaxEmptyLinesToKeep: 1\nNamespaceIndentation: None\n# NamespaceMacros:\n# ObjCBinPackProtocolList:\n# ObjCBlockIndentWidth:\n# ObjCBreakBeforeNestedBlockParam:\n# ObjCSpaceAfterProperty:\n# ObjCSpaceBeforeProtocolList:\n# PenaltyBreakAssignment:\n# PenaltyBreakBeforeFirstCallParameter:\n# PenaltyBreakComment:\n# PenaltyBreakFirstLessLess:\n# PenaltyBreakString:\n# PenaltyBreakTemplateDeclaration:\n# PenaltyExcessCharacter:\n# PenaltyReturnTypeOnItsOwnLine:\nPointerAlignment: Left\n# RawStringFormats:\nReflowComments: false\nSortIncludes: false\nSortUsingDeclarations: true\nSpaceAfterCStyleCast: false\nSpaceAfterLogicalNot: false\nSpaceAfterTemplateKeyword: false\nSpaceBeforeAssignmentOperators: true\nSpaceBeforeCpp11BracedList: false\nSpaceBeforeCtorInitializerColon: true\nSpaceBeforeInheritanceColon: true\nSpaceBeforeParens: ControlStatements\nSpaceBeforeRangeBasedForLoopColon: true\nSpaceBeforeSquareBrackets: false\nSpaceInEmptyBlock: false\nSpaceInEmptyParentheses: false\nSpacesBeforeTrailingComments: 1\nSpacesInAngles: false\nSpacesInCStyleCastParentheses: false\nSpacesInConditionalStatement: false\nSpacesInContainerLiterals: false\nSpacesInParentheses: false\nSpacesInSquareBrackets: false\nStandard: c++03\n#StatementMacros:\nTabWidth: 4\n# TypenameMacros:\nUseCRLF: false\nUseTab: Never\n"
  },
  {
    "path": ".gitattributes",
    "content": "*.comp linguist-language=GLSL\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug.md",
    "content": "---\nname: \"\\U0001F41B bug issue\"\nabout: submit a bug report +_+\n---\n\n## error log | 日志或报错信息 | ログ\n\n## context | 编译/运行环境 | バックグラウンド\n\n## how to reproduce | 复现步骤 | 再現方法\n1.\n2.\n3.\n\n## more | 其他 | その他\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/model-convert.md",
    "content": "---\nname: \"\\U0001F6B8 model convert issue\"\nabout: \"Life is Short, Use pnnx and convertmodel.com\"\n---\n\n## error log | 日志或报错信息 | ログ\n\n## model | 模型 | モデル\n1. original model\n\n## how to reproduce | 复现步骤 | 再現方法\n1.\n2.\n3.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/others.md",
    "content": "---\nname: \"\\U0001F4DD others\"\nabout: discussion, suggestion and question\n---\n\n## detail | 详细描述 | 詳細な説明\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/quantization.md",
    "content": "---\nname: \"\\U0001F4C8 quantization\"\nabout: best wishes for your low bit quantization has a low accuracy loss...\\(^▽^)/...2333... \n---\n\n## expectation | 诉求 | 期待する\n1. speed \n2. precision\n\n## model | 模型 | モデル\n1. model.param and model.bin\n\n## detail | 详细描述 | 詳細な説明\n"
  },
  {
    "path": ".github/dependabot.yml",
    "content": "version: 2\nupdates:\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"daily\"\n"
  },
  {
    "path": ".github/labeler.yml",
    "content": "cmake:\n- changed-files:\n  - any-glob-to-any-file: ['cmake/**', 'toolchains/**']\n\ndoc: \n- changed-files:\n  - any-glob-to-any-file: docs/**\n\npython: \n- changed-files:\n  - any-glob-to-any-file: python/**\n\nexample: \n- changed-files:\n  - any-glob-to-any-file: examples/**\n\ntest: \n- changed-files:\n  - any-glob-to-any-file: tests/**\n\ntool: \n- changed-files:\n  - any-glob-to-any-file: tools/**\npnnx: \n- changed-files:\n  - any-glob-to-any-file: tools/pnnx/**\n\ncore: \n- changed-files:\n  - any-glob-to-any-file: src/*\nlayer: \n- changed-files:\n  - any-glob-to-any-file: src/layer/*\n\narm: \n- changed-files:\n  - any-glob-to-any-file: src/layer/arm/**\nloongarch: \n- changed-files:\n  - any-glob-to-any-file: src/layer/loongarch/**\nmips: \n- changed-files:\n  - any-glob-to-any-file: src/layer/mips/**\nriscv: \n- changed-files:\n  - any-glob-to-any-file: src/layer/riscv/**\nvulkan: \n- changed-files:\n  - any-glob-to-any-file: src/layer/vulkan/**\nx86: \n- changed-files:\n  - any-glob-to-any-file: src/layer/x86/**\n"
  },
  {
    "path": ".github/workflows/android.yml",
    "content": "name: android\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/android.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/riscv/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/android.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/riscv/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\nconcurrency:\n  group: android-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    env:\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK_LATEST_HOME/build/cmake/android.toolchain.cmake \\\n        -DANDROID_PLATFORM=android-21 \\\n        -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: armeabi-v7a\n      run: |\n        mkdir build-armeabi-v7a && cd build-armeabi-v7a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON ..\n        cmake --build . -j $(nproc)\n    - name: arm64-v8a\n      run: |\n        mkdir build-arm64-v8a && cd build-arm64-v8a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"arm64-v8a\" ..\n        cmake --build . -j $(nproc)\n    - name: x86\n      run: |\n        mkdir build-x86 && cd build-x86\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"x86\" ..\n        cmake --build . -j $(nproc)\n    - name: x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"x86_64\" ..\n        cmake --build . -j $(nproc)\n    - name: riscv64\n      run: |\n        mkdir build-riscv64 && cd build-riscv64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"riscv64\" ..\n        cmake --build . -j $(nproc)\n\n    - name: armeabi-v7a-shared\n      run: |\n        mkdir build-armeabi-v7a-shared && cd build-armeabi-v7a-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: arm64-v8a-shared\n      run: |\n        mkdir build-arm64-v8a-shared && cd build-arm64-v8a-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"arm64-v8a\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: x86-shared\n      run: |\n        mkdir build-x86-shared && cd build-x86-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"x86\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: x86_64-shared\n      run: |\n        mkdir build-x86_64-shared && cd build-x86_64-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"x86_64\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: riscv64-shared\n      run: |\n        mkdir build-riscv64-shared && cd build-riscv64-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"riscv64\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n\n  ndk-r16b:\n    runs-on: ubuntu-latest\n    env:\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=$GITHUB_WORKSPACE/android-ndk-r16b/build/cmake/android.toolchain.cmake \\\n        -DANDROID_PLATFORM=android-21 \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: ndk-r16b\n      env:\n        DEBIAN_FRONTEND: noninteractive\n      run: |\n        pushd /usr/lib/x86_64-linux-gnu/\n        sudo ln -s libncurses.so.6 libncurses.so.5\n        sudo ln -s libtinfo.so.6 libtinfo.so.5\n        popd\n        wget -q https://dl.google.com/android/repository/android-ndk-r16b-linux-x86_64.zip -O $GITHUB_WORKSPACE/android-ndk-r16b-linux-x86_64.zip\n        cd $GITHUB_WORKSPACE && unzip -q android-ndk-r16b-linux-x86_64.zip\n\n    - name: armeabi-v7a\n      run: |\n        mkdir build-armeabi-v7a && cd build-armeabi-v7a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON ..\n        cmake --build . -j $(nproc)\n    - name: armeabi-v7a-no-neon\n      run: |\n        mkdir build-armeabi-v7a-no-neon && cd build-armeabi-v7a-no-neon\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=OFF ..\n        cmake --build . -j $(nproc)\n    - name: arm64-v8a\n      run: |\n        mkdir build-arm64-v8a && cd build-arm64-v8a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"arm64-v8a\" ..\n        cmake --build . -j $(nproc)\n\n    - name: armeabi-v7a-shared\n      run: |\n        mkdir build-armeabi-v7a-shared && cd build-armeabi-v7a-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: armeabi-v7a-no-neon-shared\n      run: |\n        mkdir build-armeabi-v7a-no-neon-shared && cd build-armeabi-v7a-no-neon-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=OFF -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: arm64-v8a-shared\n      run: |\n        mkdir build-arm64-v8a-shared && cd build-arm64-v8a-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"arm64-v8a\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/code-format-msg.yml",
    "content": "name: code-format-msg\n\non:\n  workflow_run:\n    workflows: [code-format]\n    types: [completed]\n\nconcurrency:\n  group: code-format-msg-${{ github.head_ref || github.run_id }}\n\npermissions:\n  contents: read\n  pull-requests: write\n\njobs:\n  pr-context:\n    name: acquire-pr-context\n    runs-on: ubuntu-latest\n    outputs:\n      PR_HEADSHA: ${{ steps.set-pr-context.outputs.head-sha }}\n      PR_NUMBER:  ${{ steps.set-pr-context.outputs.number   }}\n    if: ${{ github.event.workflow_run.event == 'pull_request' }}\n    steps:\n    - name: get-pr-context\n      id: set-pr-context\n      env:\n        GH_TOKEN: ${{ github.token }}\n        PR_TARGET_REPO: ${{ github.repository }}\n        PR_BRANCH: |-\n          ${{\n            (github.event.workflow_run.head_repository.owner.login != github.event.workflow_run.repository.owner.login)\n              && format('{0}:{1}', github.event.workflow_run.head_repository.owner.login, github.event.workflow_run.head_branch)\n              || github.event.workflow_run.head_branch\n          }}\n      run: |\n        gh pr view --repo \"${PR_TARGET_REPO}\" \"${PR_BRANCH}\" \\\n          --json 'number,headRefOid' \\\n          --jq '\"number=\\(.number)\\nhead-sha=\\(.headRefOid)\"' \\\n          >> $GITHUB_OUTPUT\n\n  remove-comment-if-success:\n    if: ${{ github.event.workflow_run.conclusion == 'success' }}\n    runs-on: ubuntu-latest\n    needs: [pr-context]\n    env:\n      PR_HEADSHA: ${{ needs.pr-context.outputs.PR_HEADSHA }}\n      PR_NUMBER:  ${{ needs.pr-context.outputs.PR_NUMBER  }}\n    steps:\n    - name: Remove existing \"format check failed\" comment\n      uses: actions/github-script@v8\n      with:\n        script: |\n          const owner = context.repo.owner;\n          const repo = context.repo.repo;\n          const { data: comments } = await github.rest.issues.listComments({\n            owner,\n            repo,\n            issue_number: ${{ env.PR_NUMBER }},\n          });\n\n          const targetComment = comments.find(comment =>\n            comment.body.includes(\"Please enable github action in **YOUR FORKED REPO** to make code-format workflow work\")\n          );\n\n          if (targetComment) {\n            await github.rest.issues.deleteComment({\n              owner,\n              repo,\n              comment_id: targetComment.id,\n            });\n            console.log(\"Removed existing code-format failure comment.\");\n          } else {\n            console.log(\"No existing format failure comment to remove.\");\n          }\n\n  post-comment-if-failure:\n    if: ${{ github.event.workflow_run.conclusion == 'failure' }}\n    runs-on: ubuntu-latest\n    needs: [pr-context]\n    env:\n      PR_HEADSHA: ${{ needs.pr-context.outputs.PR_HEADSHA }}\n      PR_NUMBER:  ${{ needs.pr-context.outputs.PR_NUMBER  }}\n    steps:\n    - name: Post comment on failed code-format if not existing\n      uses: actions/github-script@v8\n      with:\n        script: |\n          const owner = context.repo.owner;\n          const repo = context.repo.repo;\n          const { data: comments } = await github.rest.issues.listComments({\n            owner,\n            repo,\n            issue_number: ${{ env.PR_NUMBER }},\n          });\n\n          const existingComment = comments.find(comment =>\n            comment.body.includes(\"Please enable github action in **YOUR FORKED REPO** to make code-format workflow work\")\n          );\n\n          if (existingComment) {\n            console.log(\"A code-format failure comment already exists.\");\n          } else {\n            await github.rest.issues.createComment({\n              owner,\n              repo,\n              issue_number: ${{ env.PR_NUMBER }},\n              body: \"Please enable github action in **YOUR FORKED REPO** to make code-format workflow work\",\n            });\n            console.log(\"Created code-format failure comment.\");\n          }\n"
  },
  {
    "path": ".github/workflows/code-format.yml",
    "content": "name: code-format\n\non: [push, pull_request]\n\nconcurrency:\n  group: code-format-${{ github.ref }}\n  cancel-in-progress: true\n\npermissions:\n  contents: write\n\njobs:\n  code-format:\n    runs-on: ubuntu-latest\n    container: ubuntu:20.04\n    steps:\n    - name: astyle\n      run: |\n        export DEBIAN_FRONTEND=noninteractive\n        apt-get update -y\n        apt-get install -y astyle git\n\n    - uses: actions/checkout@v6\n\n    - name: cache-clang-format\n      id: cache-clang-format\n      uses: actions/cache@v5\n      with:\n        path: clang-format-install\n        key: clang-format-install-5\n    - name: clang-format\n      if: steps.cache-clang-format.outputs.cache-hit != 'true'\n      run: |\n        export DEBIAN_FRONTEND=noninteractive\n        apt-get update -y\n        apt-get install -y build-essential wget curl cmake unzip zip python3-pip\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-10.0.1/llvm-project-10.0.1.tar.xz\n        tar -xf llvm-project-10.0.1.tar.xz\n        cd llvm-project-10.0.1\n        mkdir build\n        cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF -DLLVM_ENABLE_PROJECTS=\"clang\" -DLLVM_TARGETS_TO_BUILD=\"\" -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_DOCS=OFF ../llvm/\n        make -j4 clang-format\n        mkdir $GITHUB_WORKSPACE/clang-format-install\n        cp -r bin/clang-format $GITHUB_WORKSPACE/clang-format-install\n        cd ../../\n        rm -rf llvm-project-10.0.1\n        rm llvm-project-10.0.1.tar.xz\n\n    - name: cache-clang-format-21\n      id: cache-clang-format-21\n      uses: actions/cache@v5\n      with:\n        path: clang-format-21-install\n        key: clang-format-21-install\n    - name: clang-format-21\n      if: steps.cache-clang-format-21.outputs.cache-hit != 'true'\n      run: |\n        export DEBIAN_FRONTEND=noninteractive\n        apt-get update -y\n        apt-get install -y build-essential wget curl cmake unzip zip python3-pip\n        pip install cmake\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-21.1.8/llvm-project-21.1.8.src.tar.xz\n        tar -xf llvm-project-21.1.8.src.tar.xz\n        cd llvm-project-21.1.8.src\n        mkdir build\n        cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF -DLLVM_ENABLE_PROJECTS=\"clang\" -DLLVM_TARGETS_TO_BUILD=\"\" -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_DOCS=OFF ../llvm/\n        make -j4 clang-format\n        mkdir $GITHUB_WORKSPACE/clang-format-21-install\n        cp -r bin/clang-format $GITHUB_WORKSPACE/clang-format-21-install\n        cd ../../\n        rm -rf llvm-project-21.1.8.src\n        rm llvm-project-21.1.8.src.tar.xz\n\n    - name: code-format\n      run: |\n        mv $GITHUB_WORKSPACE/clang-format-install/clang-format /usr/local/bin/clang-format\n        rm -rf $GITHUB_WORKSPACE/clang-format-install\n        sh codeformat.sh\n\n    - name: code-format-glsl\n      run: |\n        mv $GITHUB_WORKSPACE/clang-format-21-install/clang-format /usr/local/bin/clang-format-21\n        rm -rf $GITHUB_WORKSPACE/clang-format-21-install\n        cd src/layer/vulkan/shader\n        find . -type f -name '*.comp' | xargs -i clang-format-21 -i -assume-filename=main.cpp {}\n\n    - name: configure-git-safe-directory\n      run: git config --global --add safe.directory /__w/ncnn/ncnn\n\n    - uses: stefanzweifel/git-auto-commit-action@v7\n      with:\n        commit_message: apply code-format changes\n\n    - name: restore-clang-format-cache\n      run: |\n        mkdir $GITHUB_WORKSPACE/clang-format-install\n        cp -r /usr/local/bin/clang-format $GITHUB_WORKSPACE/clang-format-install\n        mkdir $GITHUB_WORKSPACE/clang-format-21-install\n        cp -r /usr/local/bin/clang-format-21 $GITHUB_WORKSPACE/clang-format-21-install/clang-format\n"
  },
  {
    "path": ".github/workflows/codeql-analysis.yml",
    "content": "# For most projects, this workflow file will not need changing; you simply need\n# to commit it to your repository.\n#\n# You may wish to alter this file to override the set of languages analyzed,\n# or to provide custom queries or build logic.\nname: \"CodeQL\"\n\non:\n  push:\n    branches: [master]\n    paths-ignore: ['**.md']\n  pull_request:\n    # The branches below must be a subset of the branches above\n    branches: [master]\n    paths-ignore: ['**.md']\n  schedule:\n    - cron: '0 20 * * 4'\n\nconcurrency:\n  group: CodeQL-${{ github.ref }}\n  cancel-in-progress: true\n\npermissions:\n  contents: read\n\njobs:\n  analyze:\n    permissions:\n      actions: read  # for github/codeql-action/init to get workflow details\n      contents: read  # for actions/checkout to fetch code\n      security-events: write  # for github/codeql-action/autobuild to send a status report\n    name: Analyze\n    runs-on: ubuntu-latest\n\n    strategy:\n      fail-fast: false\n      matrix:\n        # Override automatic language detection by changing the below list\n        # Supported options are ['csharp', 'cpp', 'go', 'java', 'javascript', 'python']\n        language: ['cpp']\n        # Learn more...\n        # https://docs.github.com/en/github/finding-security-vulnerabilities-and-errors-in-your-code/configuring-code-scanning#overriding-automatic-language-detection\n\n    steps:\n    - name: Checkout repository\n      uses: actions/checkout@v6\n      with:\n        # We must fetch at least the immediate parents so that if this is\n        # a pull request then we can checkout the head.\n        fetch-depth: 2\n\n    # If this run was triggered by a pull request event, then checkout\n    # the head of the pull request instead of the merge commit.\n    - run: git checkout HEAD^2\n      if: ${{ github.event_name == 'pull_request' }}\n\n    # Initializes the CodeQL tools for scanning.\n    - name: Initialize CodeQL\n      uses: github/codeql-action/init@v4\n      with:\n        languages: ${{ matrix.language }}\n        # If you wish to specify custom queries, you can do so here or in a config file.\n        # By default, queries listed here will override any specified in a config file. \n        # Prefix the list here with \"+\" to use these queries and those in the config file.\n        # queries: ./path/to/local/query, your-org/your-repo/queries@main\n\n    # Autobuild attempts to build any compiled languages  (C/C++, C#, or Java).\n    # If this step fails, then you should remove it and run the build manually (see below)\n    - name: Autobuild\n      uses: github/codeql-action/autobuild@v4\n\n    # ℹ️ Command-line programs to run using the OS shell.\n    # 📚 https://git.io/JvXDl\n\n    # ✏️ If the Autobuild fails above, remove it and uncomment the following three lines\n    #    and modify them (or add more) to build your code if your project\n    #    uses a compiled language\n\n    #- run: |\n    #   make bootstrap\n    #   make release\n\n    - name: Perform CodeQL Analysis\n      uses: github/codeql-action/analyze@v4\n"
  },
  {
    "path": ".github/workflows/compare-binary-size-pr-comment.yml",
    "content": "name: compare-binary-size-pr-comment\non:\n  workflow_run:\n    workflows: [\"compare-binary-size\"]\n    types:\n      - completed\n\npermissions:\n  actions: read\n  contents: read\n  pull-requests: write\n\njobs:\n  pr-comment:\n    runs-on: ubuntu-latest\n    steps:\n    - name: Setup tools\n      run: |\n        sudo apt-get update\n        sudo apt-get install -y jq unzip\n\n    - name: Ensure workflow_run is for a PR\n      id: validate\n      run: |\n        # Use the event payload file provided by GitHub Actions directly\n        echo \"Using event payload from: $GITHUB_EVENT_PATH\"\n        echo \"Event file size: $(wc -c < \"$GITHUB_EVENT_PATH\") bytes\"\n\n        # Safely compute number of associated PRs (use // 0 to default if missing)\n        PR_COUNT=$(jq -r '.workflow_run.pull_requests | length // 0' \"$GITHUB_EVENT_PATH\")\n        echo \"Associated pull_request count: $PR_COUNT\"\n\n        if [ \"$PR_COUNT\" -eq 0 ]; then\n          echo \"No pull_request associated with this workflow_run; nothing to do.\"\n          echo \"skip=true\" >> $GITHUB_OUTPUT\n          exit 0\n        fi\n\n        echo \"skip=false\" >> $GITHUB_OUTPUT\n\n    - name: Download artifact zip for this run\n      if: steps.validate.outputs.skip != 'true'\n      env:\n        RUN_ID: ${{ github.event.workflow_run.id }}\n        OWNER: ${{ github.repository_owner }}\n        REPO: ${{ github.repository }}\n        TOKEN: ${{ secrets.COMMENTER_PAT }}\n        ART_NAME: \"compare-binary-size.md\"\n      run: |\n        echo \"Listing artifacts for run $RUN_ID\"\n        API=\"https://api.github.com/repos/$OWNER/${REPO#*/}/actions/runs/$RUN_ID/artifacts\"\n\n        # Save artifact list to a file (avoid pipe/echo issues)\n        curl -s -H \"Authorization: token $TOKEN\" \"$API\" -o /tmp/art_list.json\n        echo \"Art list size: $(wc -c < /tmp/art_list.json) bytes\"\n        if ! jq . /tmp/art_list.json; then\n          echo \"Failed to parse /tmp/art_list.json with jq; aborting for safety.\"\n          exit 1\n        fi\n\n        # find artifact archive_download_url by name (first match)\n        ARCHIVE_URL=$(jq -r --arg name \"$ART_NAME\" '.artifacts[] | select(.name==$name) | .archive_download_url' /tmp/art_list.json | head -n1)\n        if [ -z \"$ARCHIVE_URL\" ] || [ \"$ARCHIVE_URL\" = \"null\" ]; then\n          echo \"Artifact named '$ART_NAME' not found for run $RUN_ID. Exiting.\"\n          exit 0\n        fi\n        echo \"Downloading artifact from: $ARCHIVE_URL\"\n\n        # download and unzip to temp dir\n        mkdir -p /tmp/artifact_contents\n        curl -L -H \"Authorization: token $TOKEN\" -o /tmp/artifact.zip \"$ARCHIVE_URL\"\n        if ! unzip -q /tmp/artifact.zip -d /tmp/artifact_contents; then\n          echo \"Failed to unzip /tmp/artifact.zip\"; exit 1\n        fi\n        ls -la /tmp/artifact_contents\n\n    - name: Read compare-binary-size.md content\n      if: steps.validate.outputs.skip != 'true'\n      id: read\n      run: |\n        # find file inside artifact_contents\n        FILE=$(find /tmp/artifact_contents -type f -name \"compare-binary-size.md\" | head -n1 || true)\n        if [ -z \"$FILE\" ]; then\n          # If artifact name matched but internal filename differs, try any .md\n          FILE=$(find /tmp/artifact_contents -type f -name \"*.md\" | head -n1 || true)\n        fi\n\n        if [ -z \"$FILE\" ]; then\n          echo \"compare_content<<EOF\" >> $GITHUB_OUTPUT\n          echo \"No compare-binary-size.md found in artifact.\" >> $GITHUB_OUTPUT\n          echo \"EOF\" >> $GITHUB_OUTPUT\n        else\n          # Truncate to avoid overly long comments (adjust lines as needed)\n          head -n 1000 \"$FILE\" > /tmp/compare-truncated.md || true\n          echo \"compare_content<<EOF\" >> $GITHUB_OUTPUT\n          cat /tmp/compare-truncated.md >> $GITHUB_OUTPUT\n          echo \"EOF\" >> $GITHUB_OUTPUT\n        fi\n\n    - name: Post or update PR comment via actions/github-script\n      if: steps.validate.outputs.skip != 'true'\n      uses: actions/github-script@v8\n      with:\n        github-token: ${{ secrets.COMMENTER_PAT }}\n        script: |\n          const pr = context.payload.workflow_run.pull_requests[0];\n          if (!pr) {\n            core.info(\"No pull request found in workflow_run payload; skipping.\");\n            return;\n          }\n\n          const owner = context.repo.owner;\n          const repo = context.repo.repo;\n          const issue_number = pr.number;\n          const marker = '<!-- compare-binary-size-bot -->';\n\n          // Read the compare content from env (set in previous step outputs)\n          const compare = process.env.COMPARE_CONTENT || \"\";\n\n          const body = `${marker}\\n**Binary size comparison** (from artifact)\\n\\n\\`\\`\\`markdown\\n${compare}\\n\\`\\`\\``;\n\n          // List existing comments and find our bot comment (by marker)\n          const { data: comments } = await github.rest.issues.listComments({\n            owner,\n            repo,\n            issue_number,\n            per_page: 100\n          });\n\n          const existing = comments.find(c => c.body && c.body.includes(marker));\n\n          if (existing) {\n            await github.rest.issues.updateComment({\n              owner,\n              repo,\n              comment_id: existing.id,\n              body\n            });\n            core.info(`Updated comment id=${existing.id}`);\n          } else {\n            await github.rest.issues.createComment({\n              owner,\n              repo,\n              issue_number,\n              body\n            });\n            core.info(\"Created new comment\");\n          }\n      env:\n        # pass the content from previous step into the github-script environment\n        COMPARE_CONTENT: ${{ steps.read.outputs.compare_content }}\n"
  },
  {
    "path": ".github/workflows/compare-binary-size.yml",
    "content": "name: compare-binary-size\non:\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/compare-binary-size.yml'\n    - 'toolchains/**'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/**'\n    - 'glslang'\n\nconcurrency:\n  group: compare-binary-size-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n  actions: read\n\njobs:\n  compare-size:\n    runs-on: ubuntu-latest\n    steps:\n    - name: checkout-pr-branch\n      uses: actions/checkout@v6\n      with:\n        ref: refs/pull/${{ github.event.pull_request.number }}/merge\n        submodules: true\n        path: pr\n\n    - name: checkout-base-branch\n      uses: actions/checkout@v6\n      with:\n        ref: ${{ github.event.pull_request.base.ref }}\n        repository: ${{ github.event.pull_request.base.repo.full_name }}\n        submodules: true\n        path: base\n\n    - name: install-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-arm-linux-gnueabihf g++-aarch64-linux-gnu\n\n    - name: compare-sizes\n      env:\n        COMMON_CMAKE_ARGS: -DNCNN_SHARED_LIB=ON -DNCNN_VULKAN=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF\n      run: |\n        # define target architectures\n        archs=(\"x86_64\" \"armhf\" \"aarch64\")\n\n        # generate table\n        echo \"The binary size change of libncnn.so (bytes)\" >> compare-binary-size.md\n        echo \"| architecture | base size | pr size | difference |\" >> compare-binary-size.md\n        echo \"|--------------|-----------|---------|------------|\" >> compare-binary-size.md\n\n        for arch in \"${archs[@]}\"; do\n\n          mkdir -p pr/build_$arch\n          pushd pr/build_$arch\n          if [ \"$arch\" = \"armhf\" ]; then\n            cmake ${{env.COMMON_CMAKE_ARGS}} -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake ..\n          elif [ \"$arch\" = \"aarch64\" ]; then\n            cmake ${{env.COMMON_CMAKE_ARGS}} -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake ..\n          else\n            cmake ${{env.COMMON_CMAKE_ARGS}} ..\n          fi\n          cmake --build . -j $(nproc)\n          PR_SIZE=$(stat -c%s $(readlink -f src/libncnn.so))\n          popd\n\n          mkdir -p base/build_$arch\n          pushd base/build_$arch\n          if [ \"$arch\" = \"armhf\" ]; then\n            cmake ${{env.COMMON_CMAKE_ARGS}} -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake ..\n          elif [ \"$arch\" = \"aarch64\" ]; then\n            cmake ${{env.COMMON_CMAKE_ARGS}} -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake ..\n          else\n            cmake ${{env.COMMON_CMAKE_ARGS}} ..\n          fi\n          cmake --build . -j $(nproc)\n          BASE_SIZE=$(stat -c%s $(readlink -f src/libncnn.so))\n          popd\n\n          DIFF=$(($PR_SIZE - $BASE_SIZE))\n          if [ $DIFF -gt 0 ]; then\n            DIFF_STR=\"+$DIFF :warning:\"\n          else\n            DIFF_STR=\"$DIFF :kissing_heart:\"\n          fi\n\n          echo \"| $arch | $BASE_SIZE | $PR_SIZE | $DIFF_STR |\" >> compare-binary-size.md\n        done\n\n        cat compare-binary-size.md\n\n    - name: upload-compare-binary-size-md\n      uses: actions/upload-artifact@v6\n      with:\n        name: compare-binary-size.md\n        path: compare-binary-size.md\n"
  },
  {
    "path": ".github/workflows/elf-riscv32.yml",
    "content": "name: elf-riscv32\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/elf-riscv32.yml'\n    - 'toolchains/riscv32-unknown-elf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/elf-riscv32.yml'\n    - 'toolchains/riscv32-unknown-elf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\nconcurrency:\n  group: elf-riscv32-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  rv32gc:\n    runs-on: [self-hosted, linux, centos]\n    steps:\n    - uses: actions/checkout@v6\n\n    #- name: riscv-gnu-toolchain\n      #run: |\n        #wget -c https://github.com/riscv-collab/riscv-gnu-toolchain/releases/download/2025.01.20/riscv32-elf-ubuntu-22.04-gcc-nightly-2025.01.20-nightly.tar.xz\n        #tar -xf riscv32-elf-ubuntu-22.04-gcc-nightly-2025.01.20-nightly.tar.xz\n        #mv riscv riscv32-elf\n\n    #- name: checkout-riscv-pk\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv/riscv-pk\n        #path: riscv-pk\n        #ref: d8659a4e8e888bdc9caf840ad17bfe83239b1d64\n    #- name: riscv-pk\n      #run: |\n        #cd riscv-pk\n        #mkdir build && cd build\n        #export PATH=$GITHUB_WORKSPACE/riscv32-elf/bin:$PATH\n        #export CFLAGS=\"-O3\"\n        #export CXXFLAGS=\"-O3\"\n        #../configure --prefix=$GITHUB_WORKSPACE/riscv32-elf --with-arch=rv32gc_zicsr_zifencei --host=riscv32-unknown-elf --with-abi=ilp32d\n        #make -j4\n        #make install\n\n    #- name: checkout-riscv-isa-sim\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv-software-src/riscv-isa-sim\n        #path: riscv-isa-sim\n        #ref: 5ef9a61f5fecdb9bf77da155172c8018ce820308\n    #- name: riscv-isa-sim\n      #run: |\n        #cd riscv-isa-sim\n        #mkdir build && cd build\n        #export PATH=$GITHUB_WORKSPACE/riscv32-elf/bin:$PATH\n        #export CFLAGS=\"-O3\"\n        #export CXXFLAGS=\"-O3\"\n        #../configure --prefix=$GITHUB_WORKSPACE/riscv32-elf\n        #make -j4\n        #make install\n\n    #- name: riscv-strip-install\n      #run: find $GITHUB_WORKSPACE/riscv32-elf -type f | xargs -i strip -g {} || true\n\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/riscv32-elf\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv32-unknown-elf.toolchain.cmake -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_RVV=OFF -DNCNN_XTHEADVECTOR=OFF -DNCNN_ZFH=OFF -DNCNN_ZVFH=OFF ..\n        cmake --build . -j 4\n\n    - name: test\n      run: |\n        export PATH=/data/action/osd/riscv32-elf/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=spike TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"--isa=rv32gc;/data/action/osd/riscv32-elf/riscv32-unknown-elf/bin/pk\" ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/elf-riscv64.yml",
    "content": "name: elf-riscv64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/elf-riscv64.yml'\n    - 'toolchains/riscv64-unknown-elf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/elf-riscv64.yml'\n    - 'toolchains/riscv64-unknown-elf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\nconcurrency:\n  group: elf-riscv64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  rv64gc:\n    runs-on: [self-hosted, linux, centos]\n    steps:\n    - uses: actions/checkout@v6\n\n    #- name: riscv-gnu-toolchain\n      #run: |\n        #wget -c https://github.com/riscv-collab/riscv-gnu-toolchain/releases/download/2025.01.20/riscv64-elf-ubuntu-22.04-gcc-nightly-2025.01.20-nightly.tar.xz\n        #tar -xf riscv64-elf-ubuntu-22.04-gcc-nightly-2025.01.20-nightly.tar.xz\n        #mv riscv riscv64-elf\n\n    #- name: checkout-riscv-pk\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv/riscv-pk\n        #path: riscv-pk\n        #ref: d8659a4e8e888bdc9caf840ad17bfe83239b1d64\n    #- name: riscv-pk\n      #run: |\n        #cd riscv-pk\n        #mkdir build && cd build\n        #export PATH=$GITHUB_WORKSPACE/riscv64-elf/bin:$PATH\n        #export CFLAGS=\"-O3\"\n        #export CXXFLAGS=\"-O3\"\n        #../configure --prefix=$GITHUB_WORKSPACE/riscv64-elf --with-arch=rv64gc_zicsr_zifencei --host=riscv64-unknown-elf --with-abi=lp64d\n        #make -j4\n        #make install\n\n    #- name: checkout-riscv-isa-sim\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv-software-src/riscv-isa-sim\n        #path: riscv-isa-sim\n        #ref: 5ef9a61f5fecdb9bf77da155172c8018ce820308\n    #- name: riscv-isa-sim\n      #run: |\n        #cd riscv-isa-sim\n        #mkdir build && cd build\n        #export PATH=$GITHUB_WORKSPACE/riscv64-elf/bin:$PATH\n        #export CFLAGS=\"-O3\"\n        #export CXXFLAGS=\"-O3\"\n        #../configure --prefix=$GITHUB_WORKSPACE/riscv64-elf\n        #make -j4\n        #make install\n\n    #- name: riscv-strip-install\n      #run: find $GITHUB_WORKSPACE/riscv64-elf -type f | xargs -i strip -g {} || true\n\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/riscv64-elf\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-elf.toolchain.cmake -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_XTHEADVECTOR=OFF ..\n        cmake --build . -j 4\n\n    - name: test\n      run: |\n        export PATH=/data/action/osd/riscv64-elf/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=spike TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"--isa=rv64gc;/data/action/osd/riscv64-elf/riscv64-unknown-elf/bin/pk\" ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/esp32.yml",
    "content": "name: ESP32\non:\n  push:\n    branches: [master]\n    paths:\n      - '.github/workflows/esp32.yml'\n      - 'CMakeLists.txt'\n      - 'cmake/**'\n      - 'src/*'\n      - 'src/layer/*'\n  pull_request:\n    branches: [master]\n    paths:\n      - '.github/workflows/esp32.yml'\n      - 'CMakeLists.txt'\n      - 'cmake/**'\n      - 'src/*'\n      - 'src/layer/*'\n\nconcurrency:\n  group: esp32-${{ github.ref }}\n  cancel-in-progress: true\n\npermissions:\n  contents: read\n\njobs:\n  build:\n    name: ESP32\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/checkout@v6\n        with:\n          submodules: true\n\n      - name: Setup Python\n        uses: actions/setup-python@v6\n        with:\n          python-version: '3.8'\n\n      - name: Install dependencies\n        run: |\n          sudo apt-get update\n          sudo apt-get install -y cmake ninja-build ccache\n            \n      - name: Checkout ESP-IDF\n        uses: actions/checkout@v6\n        with:\n          repository: espressif/esp-idf\n          path: esp-idf-install\n          ref: release/v5.3\n          \n      - name: Install ESP-IDF\n        run: |\n          cd esp-idf-install\n          git submodule update --init --recursive\n          ./install.sh\n\n      - name: Set environment and build NCNN for ESP32\n        run: |\n          source esp-idf-install/export.sh\n          echo \"IDF_PATH=$IDF_PATH\" >> $GITHUB_ENV\n          echo \"${IDF_PATH}/tools\" >> $GITHUB_PATH\n          echo \"${IDF_PATH}/components\" >> $GITHUB_PATH\n          mkdir -p build-esp32 && cd build-esp32\n          cmake -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/esp32.toolchain.cmake\" -DCMAKE_BUILD_TYPE=Release -DNCNN_BUILD_EXAMPLES=OFF ..\n          make -j 4\n          make install\n"
  },
  {
    "path": ".github/workflows/harmonyos.yml",
    "content": "name: harmonyos\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/harmonyos.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/harmonyos.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\nconcurrency:\n  group: harmonyos-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: [self-hosted, linux, centos]\n\n    env:\n      OHOS_NDK_HOME: /data/action/osd/ohos-sdk/linux/native\n      OHOS_NDK_CMAKE: /data/action/osd/ohos-sdk/linux/native/build-tools/cmake/bin/cmake\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=/data/action/osd/ohos-sdk/linux/native/build/cmake/ohos.toolchain.cmake \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DNCNN_SIMPLEOMP=ON \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    # - name: setup-sdk\n    #   run: |\n    #     cd /data/action/osd\n    #     wget -q https://repo.huaweicloud.com/harmonyos/os/4.1.1-Release/ohos-sdk-windows_linux-public.tar.gz\n    #     tar -xf ohos-sdk-windows_linux-public.tar.gz\n    #     cd ohos-sdk/linux\n    #     unzip -q native-linux-x64-4.1.7.8-Release.zip\n\n    - name: armeabi-v7a\n      run: |\n        mkdir build-armeabi-v7a && cd build-armeabi-v7a\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"armeabi-v7a\" ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n    - name: arm64-v8a\n      run: |\n        mkdir build-arm64-v8a && cd build-arm64-v8a\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"arm64-v8a\" ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n    - name: x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"x86_64\" ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n\n    - name: armeabi-v7a-shared\n      run: |\n        mkdir build-armeabi-v7a-shared && cd build-armeabi-v7a-shared\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"armeabi-v7a\" -DNCNN_SHARED_LIB=ON ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n    - name: arm64-v8a-shared\n      run: |\n        mkdir build-arm64-v8a-shared && cd build-arm64-v8a-shared\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"arm64-v8a\" -DNCNN_SHARED_LIB=ON ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n    - name: x86_64-shared\n      run: |\n        mkdir build-x86_64-shared && cd build-x86_64-shared\n        ${{ env.OHOS_NDK_CMAKE }} ${{ env.NCNN_CMAKE_OPTIONS }} -DOHOS_ARCH=\"x86_64\" -DNCNN_SHARED_LIB=ON ..\n        ${{ env.OHOS_NDK_CMAKE }} --build . -j 4\n"
  },
  {
    "path": ".github/workflows/ios.yml",
    "content": "name: ios\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/ios.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/ios.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\nconcurrency:\n  group: ios-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  IOS_DEPLOYMENT_TARGET: '13.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-ios-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=OS64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR64 -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATORARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/ios\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/ios-simulator\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install/ios\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/ios/lib\n        cp openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a $GITHUB_WORKSPACE/openmp-install/ios/lib/libomp.a\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/include $GITHUB_WORKSPACE/openmp-install/ios-simulator\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/ios-simulator/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/ios-simulator/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/ios/include/* $DEVELOPER_DIR/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/ios/lib/libomp.a $DEVELOPER_DIR/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib\n\n        sudo cp $GITHUB_WORKSPACE/openmp-install/ios-simulator/include/* $DEVELOPER_DIR/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/ios-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib\n\n    - name: arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=OS64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n    - name: simulator-x86_64\n      run: |\n        mkdir build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR64 -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n    - name: simulator-arm64\n      run: |\n        mkdir build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATORARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n"
  },
  {
    "path": ".github/workflows/labeler.yml",
    "content": "name: labeler\non: [pull_request_target]\n\npermissions:\n  contents: read\n  pull-requests: write\n\njobs:\n  label:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/labeler@v6\n"
  },
  {
    "path": ".github/workflows/linux-aarch64.yml",
    "content": "name: linux-aarch64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-aarch64.yml'\n    - 'toolchains/aarch64-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-aarch64.yml'\n    - 'toolchains/aarch64-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-aarch64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  aarch64-native:\n    runs-on: ubuntu-24.04-arm\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: cd build && ctest --output-on-failure -j $(nproc)\n\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: cd build-noint8 && ctest --output-on-failure -j $(nproc)\n\n    - name: build-simplestl-simplemath\n      run: |\n        mkdir build-simplestl-simplemath && cd build-simplestl-simplemath \n        cmake -DNCNN_STDIO=ON -DNCNN_STRING=ON -DNCNN_SIMPLESTL=ON -DNCNN_SIMPLEMATH=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-simplestl-simplemath\n      run: cd build-simplestl-simplemath && ctest --output-on-failure -j $(nproc)\n\n  asan:\n    runs-on: ubuntu-24.04-arm\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=relwithdebinfo -DNCNN_ASAN=ON -DNCNN_BUILD_TESTS=ON -DNCNN_SHARED_LIB=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        ctest --output-on-failure -j $(nproc)\n\n  aarch64:\n    runs-on: ubuntu-24.04\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: aarch64-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-aarch64-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test-a53\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;cortex-a53\" ctest --output-on-failure -j $(nproc)\n\n    - name: test-a55\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;cortex-a55\" ctest --output-on-failure -j $(nproc)\n\n    - name: test-a72\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;cortex-a72\" ctest --output-on-failure -j $(nproc)\n\n    - name: test-a76\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;cortex-a76\" ctest --output-on-failure -j $(nproc)\n\n    - name: test-a710\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;cortex-a710\" ctest --output-on-failure -j $(nproc)\n\n    - name: test-max\n      run: cd build && TESTS_EXECUTABLE_LOADER=qemu-aarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/aarch64-linux-gnu;-cpu;max\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-arm.yml",
    "content": "name: linux-arm\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-arm.yml'\n    - 'toolchains/arm-linux-gnueabi.toolchain.cmake'\n    - 'toolchains/arm-linux-gnueabihf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-arm.yml'\n    - 'toolchains/arm-linux-gnueabi.toolchain.cmake'\n    - 'toolchains/arm-linux-gnueabihf.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-arm-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  arm:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: arm-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-arm-linux-gnueabi qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabi.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabi\" ctest --output-on-failure -j $(nproc)\n\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabi.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: |\n        cd build-noint8\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabi\" ctest --output-on-failure -j $(nproc)\n\n  armhf:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: arm-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-arm-linux-gnueabihf qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabihf\" ctest --output-on-failure -j $(nproc)\n\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: |\n        cd build-noint8\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabihf\" ctest --output-on-failure -j $(nproc)\n\n  armhf-vfpv3-d16:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: arm-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-arm-linux-gnueabihf qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf-vfpv3-d16.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabihf\" ctest --output-on-failure -j $(nproc)\n\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf-vfpv3-d16.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: |\n        cd build-noint8\n        TESTS_EXECUTABLE_LOADER=qemu-arm-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/arm-linux-gnueabihf\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-loongarch64.yml",
    "content": "name: linux-loongarch64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-loongarch64.yml'\n    - 'toolchains/loongarch64-linux-gnu.toolchain.cmake'\n    - 'toolchains/loongarch64-unknown-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/loongarch/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-loongarch64.yml'\n    - 'toolchains/loongarch64-linux-gnu.toolchain.cmake'\n    - 'toolchains/loongarch64-unknown-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/loongarch/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-loongarch64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  gcc-loongarch64:\n    runs-on: [self-hosted, linux, centos]\n\n    steps:\n    - uses: actions/checkout@v6\n\n    # - name: qemu\n    #   run: |\n    #     sudo apt-get update\n    #     sudo apt-get install -y qemu-user-static\n\n    # - name: loongarch64-toolchain\n    #   run: |\n    #     wget https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/download/8.0/loongarch64-clfs-8.0-cross-tools-gcc-full.tar.xz\n    #     tar -xf loongarch64-clfs-8.0-cross-tools-gcc-full.tar.xz\n\n    - name: build\n      run: |\n        export LOONGARCH64_ROOT_PATH=/data/action/osd/cross-tools\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/loongarch64-unknown-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 4\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-loongarch64-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/data/action/osd/cross-tools/target\" ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/linux-mips.yml",
    "content": "name: linux-mips\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-mips.yml'\n    - 'toolchains/mipsel-linux-gnu.toolchain.cmake'\n    - 'toolchains/mipsisa32r6el-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/mips/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-mips.yml'\n    - 'toolchains/mipsel-linux-gnu.toolchain.cmake'\n    - 'toolchains/mipsisa32r6el-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/mips/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-mips-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  mipsel:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: mipsel-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-mipsel-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/mipsel-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-mipsel-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/mipsel-linux-gnu\" ctest --output-on-failure -j $(nproc)\n\n  mipsisa32r6el:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: mipsisa32r6el-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-mipsisa32r6el-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/mipsisa32r6el-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-mipsel-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/mipsisa32r6el-linux-gnu\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-mips64.yml",
    "content": "name: linux-mips64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-mips64.yml'\n    - 'toolchains/mips64el-linux-gnuabi64.toolchain.cmake'\n    - 'toolchains/mipsisa64r6el-linux-gnuabi64.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/mips/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-mips64.yml'\n    - 'toolchains/mips64el-linux-gnuabi64.toolchain.cmake'\n    - 'toolchains/mipsisa64r6el-linux-gnuabi64.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/mips/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-mips64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  mips64el:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: mips64el-gnuabi64-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-mips64el-linux-gnuabi64 qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/mips64el-linux-gnuabi64.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-mips64el-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/mips64el-linux-gnuabi64\" ctest --output-on-failure -j $(nproc)\n\n  mipsisa64r6el:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: mipsisa64r6el-gnuabi64-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-mipsisa64r6el-linux-gnuabi64 qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/mipsisa64r6el-linux-gnuabi64.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-mips64el-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/mipsisa64r6el-linux-gnuabi64\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-ppc64.yml",
    "content": "name: linux-ppc64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-ppc64.yml'\n    - 'toolchains/powerpc64le-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/*'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-ppc64.yml'\n    - 'toolchains/powerpc64le-linux-gnu.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/*'\n    - 'tests/**'\nconcurrency:\n  group: linux-ppc64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  ppc:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: powerpc-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-powerpc-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/powerpc-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-ppc-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/powerpc-linux-gnu\" ctest --output-on-failure -j $(nproc)\n\n  ppc64le:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: powerpc64le-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-powerpc64le-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/powerpc64le-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-ppc64le-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/powerpc64le-linux-gnu\" ctest --output-on-failure -j $(nproc)\n\n  power8le-vsx:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: powerpc64le-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-powerpc64le-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/power8le-linux-gnu-vsx.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-ppc64le-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/powerpc64le-linux-gnu\" ctest --output-on-failure -j $(nproc)\n\n  power9le-vsx:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: powerpc64le-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-powerpc64le-linux-gnu qemu-user-static\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/power9le-linux-gnu-vsx.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-ppc64le-static TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/powerpc64le-linux-gnu;-cpu;power9_v2.0\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-riscv32.yml",
    "content": "name: linux-riscv32\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-riscv32.yml'\n    - 'toolchains/c907-rv32-v310.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-riscv32.yml'\n    - 'toolchains/c907-rv32-v310.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-riscv32-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  xuantie:\n    name: xuantie-${{ matrix.cpu }}\n    runs-on: [self-hosted, linux, ubuntu]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { cpu: c907-rv32, QEMU_CPU: c907fdv-rv32,   OPENMP: ON,  RVV: ON,  XTHEADVECTOR: OFF, ZFH: ON, ZVFH: ON  }\n\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/Xuantie-900-gcc-linux-6.6.36-glibc-x86_64-V3.3.0\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/${{ matrix.cpu }}-v310.toolchain.cmake -DCMAKE_BUILD_TYPE=release \\\n            -DNCNN_OPENMP=${{ matrix.OPENMP }} -DNCNN_THREADS=${{ matrix.OPENMP }} \\\n            -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_RVV=${{ matrix.RVV }} \\\n            -DNCNN_XTHEADVECTOR=${{ matrix.XTHEADVECTOR }} \\\n            -DNCNN_ZFH=${{ matrix.ZFH }} \\\n            -DNCNN_ZVFH=${{ matrix.ZVFH }} \\\n            -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test\n      run: |\n        export PATH=/data/action/osd/Xuantie-qemu-x86_64-Ubuntu-20.04-V5.2.8-B20250721-0303/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv32 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;${{ matrix.QEMU_CPU }}\" ctest --output-on-failure -j 8\n"
  },
  {
    "path": ".github/workflows/linux-riscv64.yml",
    "content": "name: linux-riscv64\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-riscv64.yml'\n    - 'toolchains/riscv64-linux-gnu.toolchain.cmake'\n    - 'toolchains/riscv64-unknown-linux-gnu.toolchain.cmake'\n    - 'toolchains/riscv64-unknown-linux-gnu.llvm-toolchain.cmake'\n    - 'toolchains/c906-v310.toolchain.cmake'\n    - 'toolchains/c908-v310.toolchain.cmake'\n    - 'toolchains/c910-v310.toolchain.cmake'\n    - 'toolchains/k1.toolchain.cmake'\n    - 'toolchains/k1.llvm.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\n    - 'examples/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-riscv64.yml'\n    - 'toolchains/riscv64-linux-gnu.toolchain.cmake'\n    - 'toolchains/riscv64-unknown-linux-gnu.toolchain.cmake'\n    - 'toolchains/riscv64-unknown-linux-gnu.llvm-toolchain.cmake'\n    - 'toolchains/c906-v310.toolchain.cmake'\n    - 'toolchains/c908-v310.toolchain.cmake'\n    - 'toolchains/c910-v310.toolchain.cmake'\n    - 'toolchains/k1.toolchain.cmake'\n    - 'toolchains/k1.llvm.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/riscv/**'\n    - 'tests/**'\n    - 'examples/**'\nconcurrency:\n  group: linux-riscv64-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  gcc-riscv64:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: cache-qemu\n      id: cache-qemu\n      uses: actions/cache@v5\n      with:\n        path: qemu-install\n        key: qemu-riscv64-install-20220502-4\n    - name: install-qemu-build-deps\n      if: steps.cache-qemu.outputs.cache-hit != 'true'\n      run: |\n        sudo apt-get update\n        sudo apt-get install autoconf automake autotools-dev ninja-build build-essential pkg-config libglib2.0-dev libpixman-1-dev zlib1g-dev python3\n    - name: checkout-qemu\n      if: steps.cache-qemu.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: qemu/qemu\n        path: qemu\n        ref: f5643914a9e8f79c606a76e6a9d7ea82a3fc3e65\n    - name: qemu\n      if: steps.cache-qemu.outputs.cache-hit != 'true'\n      run: |\n        cd qemu\n        wget https://raw.githubusercontent.com/nihui/ncnn-assets/master/qemu-patches/0007-linux-user-Expose-risc-v-V-isa-bit-in-get_elf_hwcap.patch\n        patch -p1 -i 0007-linux-user-Expose-risc-v-V-isa-bit-in-get_elf_hwcap.patch\n        ./configure --prefix=$GITHUB_WORKSPACE/qemu-install --target-list=riscv64-linux-user --disable-system\n        make -j$(nproc)\n        make install\n\n    - name: riscv64-gnu-toolchain\n      run: |\n        sudo apt-get update\n        sudo apt-get install g++-riscv64-linux-gnu\n\n    - name: configure\n      run: mkdir build && cd build && cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n    - name: build\n      run: cmake --build build -j $(nproc)\n\n    - name: test\n      run: |\n        export PATH=$GITHUB_WORKSPACE/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-L;/usr/riscv64-linux-gnu\" ctest --output-on-failure -j $(nproc)\n\n  xuantie:\n    name: xuantie-${{ matrix.cpu }}\n    runs-on: [self-hosted, linux, ubuntu]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { cpu: c906, QEMU_CPU: c906fdv, OPENMP: OFF, RVV: OFF, XTHEADVECTOR: ON,  ZFH: ON, ZVFH: OFF }\n          - { cpu: c910, QEMU_CPU: c910v,   OPENMP: ON,  RVV: OFF, XTHEADVECTOR: ON,  ZFH: ON, ZVFH: OFF }\n          - { cpu: c908, QEMU_CPU: c908v,   OPENMP: ON,  RVV: ON,  XTHEADVECTOR: OFF, ZFH: ON, ZVFH: ON  }\n          - { cpu: c907, QEMU_CPU: c907fdv-rv64,   OPENMP: ON,  RVV: ON,  XTHEADVECTOR: OFF, ZFH: ON, ZVFH: ON  }\n\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/Xuantie-900-gcc-linux-6.6.36-glibc-x86_64-V3.3.0\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/${{ matrix.cpu }}-v310.toolchain.cmake -DCMAKE_BUILD_TYPE=release \\\n            -DNCNN_OPENMP=${{ matrix.OPENMP }} -DNCNN_THREADS=${{ matrix.OPENMP }} \\\n            -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_RVV=${{ matrix.RVV }} \\\n            -DNCNN_XTHEADVECTOR=${{ matrix.XTHEADVECTOR }} \\\n            -DNCNN_ZFH=${{ matrix.ZFH }} \\\n            -DNCNN_ZVFH=${{ matrix.ZVFH }} \\\n            -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test\n      run: |\n        export PATH=/data/action/osd/Xuantie-qemu-x86_64-Ubuntu-20.04-V5.2.8-B20250721-0303/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;${{ matrix.QEMU_CPU }}\" ctest --output-on-failure -j 8\n\n  spacemit:\n    name: spacemit-${{ matrix.cpu }}\n    runs-on: [self-hosted, linux, ubuntu]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { cpu: x60, QEMU_CPU: \"max,vlen=256,elen=64,vext_spec=v1.0\", OPENMP: ON, RVV: ON, XTHEADVECTOR: OFF, ZFH: ON, ZVFH: ON }\n\n    steps:\n    - uses: actions/checkout@v6\n\n    # https://archive.spacemit.com/toolchain/spacemit-toolchain-linux-glibc-x86_64-v1.1.2.tar.xz\n    - name: build-gcc\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/spacemit-toolchain-linux-glibc-x86_64-v1.1.2\n        mkdir build-gcc && cd build-gcc\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/k1.toolchain.cmake -DCMAKE_BUILD_TYPE=release \\\n            -DNCNN_OPENMP=${{ matrix.OPENMP }} -DNCNN_THREADS=${{ matrix.OPENMP }} \\\n            -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_RVV=${{ matrix.RVV }} \\\n            -DNCNN_XTHEADVECTOR=${{ matrix.XTHEADVECTOR }} \\\n            -DNCNN_ZFH=${{ matrix.ZFH }} \\\n            -DNCNN_ZVFH=${{ matrix.ZVFH }} \\\n            -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: build-llvm\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/spacemit-toolchain-linux-glibc-x86_64-v1.1.2\n        mkdir build-llvm && cd build-llvm\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/k1.llvm.toolchain.cmake -DCMAKE_BUILD_TYPE=release \\\n            -DNCNN_OPENMP=${{ matrix.OPENMP }} -DNCNN_THREADS=${{ matrix.OPENMP }} \\\n            -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_RVV=${{ matrix.RVV }} \\\n            -DNCNN_XTHEADVECTOR=${{ matrix.XTHEADVECTOR }} \\\n            -DNCNN_ZFH=${{ matrix.ZFH }} \\\n            -DNCNN_ZVFH=${{ matrix.ZVFH }} \\\n            -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    # https://archive.spacemit.com/spacemit-ai/qemu/jdsk-qemu-v0.0.14.tar.gz\n    - name: test-gcc\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/spacemit-toolchain-linux-glibc-x86_64-v1.1.2\n        export PATH=/data/action/osd/jdsk-qemu/bin:$PATH\n        cd build-gcc\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;${{ matrix.QEMU_CPU }};-L;${RISCV_ROOT_PATH}/sysroot\" ctest --output-on-failure -j 8\n\n    - name: test-llvm\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/spacemit-toolchain-linux-glibc-x86_64-v1.1.2\n        export PATH=/data/action/osd/jdsk-qemu/bin:$PATH\n        cd build-llvm\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;${{ matrix.QEMU_CPU }};-L;${RISCV_ROOT_PATH}/sysroot\" ctest --output-on-failure -j 8\n\n  gcc-rvv:\n    runs-on: [self-hosted, linux, ubuntu]\n    steps:\n    - uses: actions/checkout@v6\n\n    #- name: cache-qemu\n      #id: cache-qemu\n      #uses: actions/cache@v5\n      #with:\n        #path: qemu-install\n        #key: qemu-riscv64-install-20241202\n    #- name: install-qemu-build-deps\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #run: |\n        #sudo apt-get update\n        #sudo apt-get install autoconf automake autotools-dev ninja-build\n    #- name: checkout-qemu\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #uses: actions/checkout@v6\n      #with:\n        #repository: qemu/qemu\n        #path: qemu\n        #ref: 72b88908d12ee9347d13539c7dd9a252625158d1\n    #- name: qemu\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #run: |\n        #cd qemu\n        #./configure --prefix=$GITHUB_WORKSPACE/qemu-install --target-list=riscv64-linux-user --disable-system\n        #make -j4\n        #make install\n\n    #- name: cache-riscv\n      #id: cache-riscv\n      #uses: actions/cache@v5\n      #with:\n        #path: riscv-install\n        #key: riscv-linux-install-20241202\n\n    #- name: install-riscv-build-deps\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #sudo apt-get update\n        #sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev device-tree-compiler\n\n    #- name: checkout-riscv-gnu-toolchain\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv-collab/riscv-gnu-toolchain\n        #path: riscv-gnu-toolchain\n        #ref: 20f615317e2ce888dfc11b29ccde4a649494b654\n    #- name: checkout-riscv-gnu-toolchain-submodules\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #cd riscv-gnu-toolchain\n        #git submodule update --init --recursive --depth 1 glibc\n        #git submodule update --init --recursive --depth 1 newlib\n        #git submodule update --init --recursive --depth 1 riscv-binutils\n        #git submodule update --init --recursive --depth 1 riscv-gcc\n        #git submodule update --init --recursive --depth 1 riscv-dejagnu\n        #git submodule update --init --recursive --depth 1 riscv-gdb\n    #- name: riscv-gnu-toolchain\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #cd riscv-gnu-toolchain\n        #./configure --prefix=$GITHUB_WORKSPACE/riscv\n        #make linux -j4\n\n    #- name: riscv-strip-install\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: find $GITHUB_WORKSPACE/riscv -type f | xargs -i strip -g {} || true\n\n    - name: configure\n      run: export RISCV_ROOT_PATH=/data/action/osd/riscv && mkdir build && cd build && cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-linux-gnu.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n    - name: build\n      run: cmake --build build -j 8\n\n    - name: test-vlen256\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=256,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n\n    - name: test-vlen128\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=128,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n\n  clang-rvv:\n    runs-on: [self-hosted, linux, ubuntu]\n    steps:\n    - uses: actions/checkout@v6\n\n    #- name: cache-qemu\n      #id: cache-qemu\n      #uses: actions/cache@v5\n      #with:\n        #path: qemu-install\n        #key: qemu-riscv64-install-20241202\n    #- name: install-qemu-build-deps\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #run: |\n        #sudo apt-get update\n        #sudo apt-get install autoconf automake autotools-dev ninja-build\n    #- name: checkout-qemu\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #uses: actions/checkout@v6\n      #with:\n        #repository: qemu/qemu\n        #path: qemu\n        #ref: 72b88908d12ee9347d13539c7dd9a252625158d1\n    #- name: qemu\n      #if: steps.cache-qemu.outputs.cache-hit != 'true'\n      #run: |\n        #cd qemu\n        #./configure --prefix=$GITHUB_WORKSPACE/qemu-install --target-list=riscv64-linux-user --disable-system\n        #make -j4\n        #make install\n\n    #- name: cache-riscv\n      #id: cache-riscv\n      #uses: actions/cache@v5\n      #with:\n        #path: riscv-install\n        #key: riscv-linux-install-20241202\n\n    #- name: install-riscv-build-deps\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #sudo apt-get update\n        #sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev device-tree-compiler\n\n    #- name: checkout-riscv-gnu-toolchain\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #uses: actions/checkout@v6\n      #with:\n        #repository: riscv-collab/riscv-gnu-toolchain\n        #path: riscv-gnu-toolchain\n        #ref: 20f615317e2ce888dfc11b29ccde4a649494b654\n    #- name: checkout-riscv-gnu-toolchain-submodules\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #cd riscv-gnu-toolchain\n        #git submodule update --init --recursive --depth 1 glibc\n        #git submodule update --init --recursive --depth 1 newlib\n        #git submodule update --init --recursive --depth 1 riscv-binutils\n        #git submodule update --init --recursive --depth 1 riscv-gcc\n        #git submodule update --init --recursive --depth 1 riscv-dejagnu\n        #git submodule update --init --recursive --depth 1 riscv-gdb\n    #- name: riscv-gnu-toolchain\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: |\n        #cd riscv-gnu-toolchain\n        #./configure --prefix=$GITHUB_WORKSPACE/riscv\n        #make linux -j4\n\n    #- name: riscv-strip-install\n      #if: steps.cache-riscv.outputs.cache-hit != 'true'\n      #run: find $GITHUB_WORKSPACE/riscv -type f | xargs -i strip -g {} || true\n\n    # - name: install-clang\n    #   run: |\n    #     wget https://github.com/llvm/llvm-project/releases/download/llvmorg-19.1.4/llvm-project-19.1.4.src.tar.xz\n    #     tar -xf llvm-project-19.1.4.src.tar.xz\n    #     cd llvm-project-19.1.4.src\n    #     mkdir build\n    #     cd build\n    #     cmake -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/riscv -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DLLVM_ENABLE_PROJECTS=\"clang\" -DLLVM_TARGETS_TO_BUILD=\"RISCV\" -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF ../llvm/\n    #     make -j16\n    #     make install\n\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/riscv\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-linux-gnu.llvm-toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test-vlen256\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=256,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n\n    - name: test-vlen128\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=128,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n"
  },
  {
    "path": ".github/workflows/linux-x64-cpu-clang.yml",
    "content": "name: linux-x64-cpu-clang\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-clang.yml'\n    - 'toolchains/host-c.clang.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-clang.yml'\n    - 'toolchains/host-c.clang.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\nconcurrency:\n  group: linux-x64-cpu-clang-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-clang:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: update\n      run: sudo apt-get update\n    - name: protobuf\n      run: sudo apt-get install libprotobuf-dev protobuf-compiler libopencv-dev\n    - name: build-sse2\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-sse2 && cd build-sse2\n        cmake -DNCNN_AVX=OFF -DNCNN_AVX2=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-sse2\n      run: cd build-sse2 && ctest --output-on-failure -j $(nproc)\n    - name: build-shared\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_AVX2=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: build-avx2\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-avx2 && cd build-avx2\n        cmake -DNCNN_AVX2=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx2\n      run: cd build-avx2 && ctest --output-on-failure -j $(nproc)\n    - name: build-avx\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-avx && cd build-avx\n        cmake -DNCNN_AVX2=OFF -DNCNN_AVX=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx\n      run: cd build-avx && ctest --output-on-failure -j $(nproc)\n    - name: build-avx1-2\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-avx1-2 && cd build-avx1-2\n        cmake -DNCNN_AVX2=ON -DNCNN_AVX=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx1-2\n      run: cd build-avx1-2 && ctest --output-on-failure -j $(nproc)\n    - name: build-noint8\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DNCNN_INT8=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: cd build-noint8 && ctest --output-on-failure -j $(nproc)\n\n  linux-clang-simplestl:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: build-simplestl\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-simplestl && cd build-simplestl\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host-c.clang.toolchain.cmake -DNCNN_STDIO=ON -DNCNN_STRING=ON -DNCNN_SIMPLESTL=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-simplestl\n      run: cd build-simplestl && ctest --output-on-failure -j $(nproc)\n    - name: build-simplestl-simpleomp\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-simplestl-simpleomp && cd build-simplestl-simpleomp\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host-c.clang.toolchain.cmake -DNCNN_STDIO=ON -DNCNN_STRING=ON -DNCNN_SIMPLESTL=ON -DNCNN_SIMPLEOMP=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-simplestl-simpleomp\n      run: cd build-simplestl-simpleomp && ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-x64-cpu-gcc-musl.yml",
    "content": "name: linux-x64-cpu-gcc-musl\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-gcc-musl.yml'\n    - 'toolchains/host-c.gcc.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-gcc-musl.yml'\n    - 'toolchains/host-c.gcc.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\nconcurrency:\n  group: linux-x64-cpu-gcc-musl-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-gcc-musl:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: jirutka/setup-alpine@v1\n      with:\n        packages: >\n          cmake\n          clang\n          clang-dev\n          make\n          gcc\n          g++\n          libc-dev\n          linux-headers\n\n    - uses: actions/checkout@v6\n    - name: build\n      shell: alpine.sh {0}\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test\n      shell: alpine.sh {0}\n      run: cd build && ctest --output-on-failure -j $(nproc)\n    - name: build-shared\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-x64-cpu-gcc.yml",
    "content": "name: linux-x64-cpu-gcc\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-gcc.yml'\n    - 'toolchains/host-c.gcc.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-cpu-gcc.yml'\n    - 'toolchains/host-c.gcc.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\nconcurrency:\n  group: linux-x64-cpu-gcc-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-gcc:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: update\n      run: sudo apt-get update\n    - name: protobuf\n      run: sudo apt-get install libprotobuf-dev protobuf-compiler libopencv-dev\n    - name: build-sse2\n      run: |\n        mkdir build-sse2 && cd build-sse2\n        cmake -DNCNN_AVX=OFF -DNCNN_AVX2=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-sse2\n      run: cd build-sse2 && ctest --output-on-failure -j $(nproc)\n    - name: build-shared\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_AVX2=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: build-avx2\n      run: |\n        mkdir build-avx2 && cd build-avx2\n        cmake -DNCNN_AVX2=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx2\n      run: cd build-avx2 && ctest --output-on-failure -j $(nproc)\n    - name: build-avx\n      run: |\n        mkdir build-avx && cd build-avx\n        cmake -DNCNN_AVX2=OFF -DNCNN_AVX=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx\n      run: cd build-avx && ctest --output-on-failure -j $(nproc)\n    - name: build-avx1-2\n      run: |\n        mkdir build-avx1-2 && cd build-avx1-2\n        cmake -DNCNN_AVX2=ON -DNCNN_AVX=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-avx1-2\n      run: cd build-avx1-2 && ctest --output-on-failure -j $(nproc)\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DNCNN_INT8=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: cd build-noint8 && ctest --output-on-failure -j $(nproc)\n\n  asan:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=relwithdebinfo -DNCNN_ASAN=ON -DNCNN_BUILD_TESTS=ON -DNCNN_SHARED_LIB=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        ctest --output-on-failure -j $(nproc)\n\n  linux-gcc-cpp03-nostdio-nostring-simplestl:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: build-nostdio\n      run: |\n        mkdir build-nostdio && cd build-nostdio\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc-c++03.toolchain.cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-nostdio\n      run: cd build-nostdio && ctest --output-on-failure -j $(nproc)\n    - name: build-nostdio-nostring\n      run: |\n        mkdir build-nostdio-nostring && cd build-nostdio-nostring\n        cmake -DNCNN_STDIO=OFF -DNCNN_STRING=OFF -DNCNN_BUILD_TESTS=OFF -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: build-simplestl\n      run: |\n        mkdir build-simplestl && cd build-simplestl\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host-c.gcc.toolchain.cmake -DNCNN_STDIO=ON -DNCNN_STRING=ON -DNCNN_SIMPLESTL=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-simplestl\n      run: cd build-simplestl && ctest --output-on-failure -j $(nproc)\n    - name: build-simplestl-simpleomp\n      run: |\n        mkdir build-simplestl-simpleomp && cd build-simplestl-simpleomp\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host-c.gcc.toolchain.cmake -DNCNN_STDIO=ON -DNCNN_STRING=ON -DNCNN_SIMPLESTL=ON -DNCNN_SIMPLEOMP=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-simplestl-simpleomp\n      run: cd build-simplestl-simpleomp && ctest --output-on-failure -j $(nproc)\n\n  linux-gcc-avx512:\n    runs-on: [self-hosted, linux, t4]\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      env:\n        CC: gcc\n        CXX: g++\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_AVX2=ON -DNCNN_AVX512=ON -DNCNN_AVX512VNNI=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j 4\n    - name: test\n      env:\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: cd build && ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/linux-x64-gpu-clang.yml",
    "content": "name: linux-x64-gpu-clang\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-gpu-clang.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-gpu-clang.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\nconcurrency:\n  group: linux-x64-gpu-clang-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-clang-gpu:\n    runs-on: [self-hosted, linux, ubuntu25]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-swiftshader\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-linux-install-20250508\n    - name: checkout-swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: 930d46d31b5d637f313fd5ef55da2bbf053c26c1\n    - name: swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n        mkdir -p build; cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . -j 8\n        mkdir $GITHUB_WORKSPACE/swiftshader-install\n        cp Linux/* $GITHUB_WORKSPACE/swiftshader-install\n\n    - name: build\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        printf \"[Processor]\\nThreadCount=1\\n\" > build/tests/SwiftShader.ini\n        export VK_ICD_FILENAMES=\"$GITHUB_WORKSPACE/swiftshader-install/vk_swiftshader_icd.json\"\n        cd build && ctest --output-on-failure -j 8\n    - name: build-shared\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j 8\n"
  },
  {
    "path": ".github/workflows/linux-x64-gpu-gcc.yml",
    "content": "name: linux-x64-gpu-gcc\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-gpu-gcc.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-gpu-gcc.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\nconcurrency:\n  group: linux-x64-gpu-gcc-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-gcc-gpu:\n    runs-on: [self-hosted, linux, ubuntu25]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-swiftshader\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-linux-install-20250508\n    - name: checkout-swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: 930d46d31b5d637f313fd5ef55da2bbf053c26c1\n    - name: swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n        mkdir -p build; cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . -j 8\n        mkdir $GITHUB_WORKSPACE/swiftshader-install\n        cp Linux/* $GITHUB_WORKSPACE/swiftshader-install\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        printf \"[Processor]\\nThreadCount=1\\n\" > build/tests/SwiftShader.ini\n        export VK_ICD_FILENAMES=\"$GITHUB_WORKSPACE/swiftshader-install/vk_swiftshader_icd.json\"\n        cd build && ctest --output-on-failure -j 8\n    - name: build-shared\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j 8\n\n  linux-gcc-gpu-system-glslang:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: install-deps\n      run: |\n        sudo apt-get update\n        sudo apt-get install libprotobuf-dev protobuf-compiler libopencv-dev libvulkan-dev glslang-dev glslang-tools spirv-tools\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_VULKAN=ON -DNCNN_SYSTEM_GLSLANG=ON -DGLSLANG_TARGET_DIR=/usr/lib/x86_64-linux-gnu/cmake ..\n        cmake --build . -j $(nproc)\n    - name: build-shared\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DNCNN_VULKAN=ON -DNCNN_SYSTEM_GLSLANG=ON -DGLSLANG_TARGET_DIR=/usr/lib/x86_64-linux-gnu/cmake -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n\n  linux-gcc-gpu-t4:\n    runs-on: [self-hosted, linux, t4]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: build\n      env:\n        CC: gcc\n        CXX: g++\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j 4\n    - name: test\n      env:\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: |\n        cd build && ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/linux-x64-sde.yml",
    "content": "name: linux-x64-sde\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-sde.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x64-sde.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\nconcurrency:\n  group: linux-x64-sde-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  gcc-sde:\n    runs-on: ubuntu-24.04\n    steps:\n    - uses: actions/checkout@v6\n    - name: update\n      run: sudo apt-get update\n    - name: gcc14\n      run: sudo apt-get install gcc-14 g++-14\n    - name: Setup SDE binaries\n      uses: petarpetrovt/setup-sde@v3.0\n    - name: build\n      env:\n        CC: gcc-14\n        CXX: g++-14\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-p4p\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-p4p;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-snb\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-snb;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-hsw\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-hsw;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-adl\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-adl;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-arl\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-arl;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-skx\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-skx;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-spr\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-spr;--\" ctest --output-on-failure -j $(nproc)\n    - name: test-gnr\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-gnr;--\" ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-x86-cpu-clang.yml",
    "content": "name: linux-x86-cpu-clang\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x86-cpu-clang.yml'\n    - 'toolchains/host.clang-m32.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x86-cpu-clang.yml'\n    - 'toolchains/host.clang-m32.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-x86-cpu-clang-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-clang:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: update\n      run: sudo apt-get update\n    - name: gcc-multilib\n      run: sudo apt-get install gcc-multilib g++-multilib\n    - name: build\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.clang-m32.toolchain.cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: cd build && ctest --output-on-failure -j $(nproc)\n    - name: build-shared\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.clang-m32.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: build-noint8\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.clang-m32.toolchain.cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: cd build-noint8 && ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/linux-x86-cpu-gcc.yml",
    "content": "name: linux-x86-cpu-gcc\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x86-cpu-gcc.yml'\n    - 'toolchains/host.gcc-m32.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/linux-x86-cpu-gcc.yml'\n    - 'toolchains/host.gcc-m32.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\nconcurrency:\n  group: linux-x86-cpu-gcc-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-gcc:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: update\n      run: sudo apt-get update\n    - name: gcc-multilib\n      run: sudo apt-get install gcc-multilib g++-multilib\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc-m32.toolchain.cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test\n      run: cd build && ctest --output-on-failure -j $(nproc)\n    - name: build-nosse\n      run: |\n        mkdir build-nosse && cd build-nosse\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc-m32.toolchain.cmake -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-nosse\n      run: cd build-nosse && ctest --output-on-failure -j $(nproc)\n    - name: build-shared\n      run: |\n        mkdir build-shared && cd build-shared\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc-m32.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j $(nproc)\n    - name: build-noint8\n      run: |\n        mkdir build-noint8 && cd build-noint8\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc-m32.toolchain.cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_INT8=OFF ..\n        cmake --build . -j $(nproc)\n    - name: test-noint8\n      run: cd build-noint8 && ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/mac-catalyst.yml",
    "content": "name: mac-catalyst\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/mac-catalyst.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/mac-catalyst.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\nconcurrency:\n  group: mac-catalyst-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  MAC_CATALYST_DEPLOYMENT_TARGET: '13.1'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_CATALYST_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_CATALYST_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-mac-catalyst-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/mac-catalyst\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install/mac-catalyst\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/mac-catalyst/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/mac-catalyst/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/mac-catalyst/include/* $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/mac-catalyst/lib/libomp.a $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib\n\n    - name: x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n    - name: arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n"
  },
  {
    "path": ".github/workflows/macos.yml",
    "content": "name: macos\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/macos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/macos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\nconcurrency:\n  group: macos-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  MAC_DEPLOYMENT_TARGET: '11.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-macos-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/include/* $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/lib/libomp.a $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib\n\n    - name: cache-swiftshader\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-macos-install-20251004\n    - name: checkout-swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: de870ac7518fe2b6bb651ecc22fc36647cf7b986\n    - name: checkout-swiftshader-submodules\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n    - name: swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        mkdir -p build; cd build\n        cmake -DCMAKE_POLICY_VERSION_MINIMUM=3.5 -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_EGL=FALSE -DSWIFTSHADER_BUILD_GLESv2=FALSE -DSWIFTSHADER_BUILD_GLES_CM=FALSE -DSWIFTSHADER_BUILD_VULKAN=TRUE -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . -j 4\n        mkdir $GITHUB_WORKSPACE/swiftshader-install\n        cp Darwin/* $GITHUB_WORKSPACE/swiftshader-install\n\n    - name: arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n    - name: x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 4\n\n    - name: arm64-shared\n      run: |\n        mkdir build-arm64-shared && cd build-arm64-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j 4\n    - name: x86_64-shared\n      run: |\n        mkdir build-x86_64-shared && cd build-x86_64-shared\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" -DNCNN_SHARED_LIB=ON ..\n        cmake --build . -j 4\n\n    - name: x86_64-test\n      run: |\n        printf \"[Processor]\\nThreadCount=1\\n\" > build-x86_64/tests/SwiftShader.ini\n        export VK_ICD_FILENAMES=\"$GITHUB_WORKSPACE/swiftshader-install/vk_swiftshader_icd.json\"\n        cd build-x86_64 && ctest --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/pnnx.yml",
    "content": "name: pnnx\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/pnnx.yml'\n    - 'src/layer/*'\n    - 'tools/pnnx/**'\n    - '!tools/pnnx/README.md'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/pnnx.yml'\n    - 'src/layer/*'\n    - 'tools/pnnx/**'\n    - '!tools/pnnx/README.md'\nconcurrency:\n  group: pnnx-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\nenv:\n  LIBTORCH_VERSION: 2.10.0\n  TORCHVISION_VERSION: 0.25.0\n  PROTOBUF_VERSION: 21.12\n  ONNXRUNTIME_VERSION: 1.24.3\n  CACHE_DATE: 20260309\n  SEGMENT_DOWNLOAD_TIMEOUT_MINS: 15\n\njobs:\n  quick-test:\n    runs-on: ${{ matrix.os }}\n    strategy:\n      fail-fast: false\n      matrix:\n        os: [ubuntu-latest, macos-latest, windows-latest]\n\n    env:\n      PYTHONUSERBASE: ${{ github.workspace }}/torch\n      UseMultiToolTask: true\n    steps:\n    - uses: actions/checkout@v6\n\n    - uses: actions/setup-python@v6\n      with:\n        python-version: 3.12\n\n    - name: setup-pytorch\n      run: |\n        python3 -m pip config set global.break-system-packages true\n        pip3 install --user torch --index-url https://download.pytorch.org/whl/cpu\n        pip3 install --user numpy packaging\n\n    - name: build-pnnx\n      run: |\n        cd tools/pnnx\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . --config Release -j 4\n\n    - name: quick-test\n      if: matrix.os != 'windows-latest'\n      run: |\n        cd tools/pnnx\n        cd build && ctest -C Release --output-on-failure -R test_nn_Conv\n\n  build:\n    runs-on: [self-hosted, linux, ubuntu25]\n\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: local-cache-libtorch\n      id: local-cache-libtorch\n      uses: maxnowack/local-cache@v2\n      with:\n        path: libtorch-${{ env.LIBTORCH_VERSION }}-install\n        key: libtorch-${{ env.LIBTORCH_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: local-cache-torchvision\n      id: local-cache-torchvision\n      uses: maxnowack/local-cache@v2\n      with:\n        path: torchvision-${{ env.TORCHVISION_VERSION }}-install\n        key: torchvision-${{ env.TORCHVISION_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: local-cache-onnxruntime\n      id: local-cache-onnxruntime\n      uses: maxnowack/local-cache@v2\n      with:\n        path: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install\n        key: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: cache-libtorch\n      id: cache-libtorch\n      uses: actions/cache@v4\n      with:\n        path: libtorch-${{ env.LIBTORCH_VERSION }}-install\n        key: libtorch-${{ env.LIBTORCH_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: cache-torchvision\n      id: cache-torchvision\n      uses: actions/cache@v4\n      with:\n        path: torchvision-${{ env.TORCHVISION_VERSION }}-install\n        key: torchvision-${{ env.TORCHVISION_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: cache-onnxruntime\n      id: cache-onnxruntime\n      uses: actions/cache@v4\n      with:\n        path: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install\n        key: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: pnnx-patches\n      if: (steps.local-cache-libtorch.outputs.cache-hit != 'true' && steps.cache-libtorch.outputs.cache-hit != 'true') || (steps.local-cache-torchvision.outputs.cache-hit != 'true' && steps.cache-torchvision.outputs.cache-hit != 'true') || (steps.local-cache-onnxruntime.outputs.cache-hit != 'true' && steps.cache-onnxruntime.outputs.cache-hit != 'true')\n      uses: actions/checkout@v6\n      with:\n        repository: pnnx/pnnx\n        path: pnnx-patches\n\n    - name: libtorch\n      if: steps.local-cache-libtorch.outputs.cache-hit != 'true' && steps.cache-libtorch.outputs.cache-hit != 'true'\n      run: |\n        wget -q https://github.com/pytorch/pytorch/releases/download/v${{ env.LIBTORCH_VERSION }}/pytorch-v${{ env.LIBTORCH_VERSION }}.tar.gz\n        tar -xf pytorch-v${{ env.LIBTORCH_VERSION }}.tar.gz\n        cd pytorch-v${{ env.LIBTORCH_VERSION }}\n        pip3 install -r requirements.txt --break-system-packages\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/pytorch-v${{ env.LIBTORCH_VERSION }}-fix-mobile-build.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/pytorch-v${{ env.LIBTORCH_VERSION }}-no-link-system-lib.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/pytorch-v${{ env.LIBTORCH_VERSION }}-fix-eigen-build.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/pytorch-v${{ env.LIBTORCH_VERSION }}-fix-link-local-sleef.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/pytorch-v${{ env.LIBTORCH_VERSION }}-revert-nativert-api.patch\n        mkdir -p build && cd build\n        cmake -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/libtorch-${{ env.LIBTORCH_VERSION }}-install \\\n            -DCMAKE_BUILD_TYPE=MinSizeRel \\\n            -DBUILD_SHARED_LIBS=OFF \\\n            -DCMAKE_POLICY_VERSION_MINIMUM=3.5 \\\n            -DBUILD_CUSTOM_PROTOBUF=OFF \\\n            -DBUILD_LITE_INTERPRETER=OFF \\\n            -DBUILD_PYTHON=OFF \\\n            -DINTERN_BUILD_MOBILE=ON \\\n            -DINTERN_DISABLE_AUTOGRAD=ON \\\n            -DINTERN_DISABLE_ONNX=ON \\\n            -DUSE_CUDA=OFF \\\n            -DUSE_DISTRIBUTED=OFF \\\n            -DUSE_ITT=OFF \\\n            -DUSE_KINETO=OFF \\\n            -DUSE_LITE_INTERPRETER_PROFILER=OFF \\\n            -DUSE_MKLDNN=OFF \\\n            -DUSE_MPS=OFF \\\n            -DUSE_NUMPY=OFF \\\n            -DUSE_OPENMP=OFF \\\n            -DUSE_SOURCE_DEBUG_ON_MOBILE=OFF \\\n            -DUSE_XNNPACK=OFF \\\n            -DBUILD_TEST=OFF \\\n            -DATEN_NO_TEST=ON \\\n            ..\n        cmake --build . -j 8\n        cmake --build . -j 8 --target install/strip\n\n    - name: torchvision\n      if: steps.local-cache-torchvision.outputs.cache-hit != 'true' && steps.cache-torchvision.outputs.cache-hit != 'true'\n      run: |\n        wget -q https://github.com/pytorch/vision/archive/v${{ env.TORCHVISION_VERSION }}.zip -O vision-${{ env.TORCHVISION_VERSION }}.zip\n        unzip -q vision-${{ env.TORCHVISION_VERSION }}.zip\n        cd vision-${{ env.TORCHVISION_VERSION }}\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/vision-${{ env.TORCHVISION_VERSION }}-ops-only.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/vision-${{ env.TORCHVISION_VERSION }}-no-cuda-version.patch\n        mkdir -p build && cd build\n        cmake -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/torchvision-${{ env.TORCHVISION_VERSION }}-install \\\n            -DTorch_DIR=$GITHUB_WORKSPACE/libtorch-${{ env.LIBTORCH_VERSION }}-install/share/cmake/Torch \\\n            -DCMAKE_BUILD_TYPE=MinSizeRel \\\n            -DWITH_PNG=OFF \\\n            -DWITH_JPEG=OFF ..\n        cmake --build . -j 8\n        cmake --build . -j 8 --target install/strip\n\n    - name: onnxruntime\n      if: steps.local-cache-onnxruntime.outputs.cache-hit != 'true' && steps.cache-onnxruntime.outputs.cache-hit != 'true'\n      run: |\n        wget -q https://github.com/protocolbuffers/protobuf/archive/v${{ env.PROTOBUF_VERSION }}.zip -O protobuf-${{ env.PROTOBUF_VERSION }}.zip\n        unzip -q protobuf-${{ env.PROTOBUF_VERSION }}.zip\n        cd protobuf-${{ env.PROTOBUF_VERSION }}\n        mkdir -p build2 && cd build2\n        cmake -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install \\\n            -Dprotobuf_BUILD_TESTS=OFF \\\n            -DCMAKE_BUILD_TYPE=MinSizeRel \\\n            -DCMAKE_POSITION_INDEPENDENT_CODE=ON ..\n        cmake --build . -j 8\n        cmake --build . -j 8 --target install/strip\n\n        cd ../../\n        wget -q https://github.com/microsoft/onnxruntime/archive/v${{ env.ONNXRUNTIME_VERSION }}.zip -O onnxruntime-${{ env.ONNXRUNTIME_VERSION }}.zip\n        unzip -q onnxruntime-${{ env.ONNXRUNTIME_VERSION }}.zip\n        cd onnxruntime-${{ env.ONNXRUNTIME_VERSION }}\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-less-mlas-features.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-monolithic-static-library.patch\n        patch -p1 -i $GITHUB_WORKSPACE/pnnx-patches/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-use-clog.patch\n        mkdir -p build2 && cd build2\n        cmake -DCMAKE_INSTALL_PREFIX=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install \\\n            -DCMAKE_BUILD_TYPE=MinSizeRel \\\n            -Donnxruntime_USE_FULL_PROTOBUF=ON \\\n            -Donnxruntime_BUILD_SHARED_LIB=ON \\\n            -Donnxruntime_BUILD_UNIT_TESTS=OFF \\\n            -Donnxruntime_ENABLE_CPUINFO=OFF \\\n            -Donnxruntime_DISABLE_CONTRIB_OPS=ON \\\n            -Donnxruntime_DISABLE_ML_OPS=ON \\\n            -Donnxruntime_DISABLE_SPARSE_TENSORS=ON \\\n            -DCMAKE_POSITION_INDEPENDENT_CODE=ON \\\n            --compile-no-warning-as-error ../cmake\n        cmake --build . -j 8\n        cmake --build . -j 8 --target install/strip\n\n    - name: pnnx\n      run: |\n        cd tools/pnnx\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=MinSizeRel \\\n            -DTorch_INSTALL_DIR=$GITHUB_WORKSPACE/libtorch-${{ env.LIBTORCH_VERSION }}-install \\\n            -DTorchVision_INSTALL_DIR=$GITHUB_WORKSPACE/torchvision-${{ env.TORCHVISION_VERSION }}-install \\\n            -Donnxruntime_INSTALL_DIR=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install \\\n            -Dprotobuf_DIR=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install/lib/cmake/protobuf ..\n        cmake --build . -j 8\n        strip src/pnnx\n\n    - name: upload-pnnx\n      uses: actions/upload-artifact@v5\n      with:\n        name: pnnx\n        path: tools/pnnx/build/src/pnnx\n        compression-level: 9\n\n  test:\n    needs: [build]\n    runs-on: [self-hosted, linux, ubuntu25]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { python: '3.8',  numpy: '1.24.4', opencv: '4.5.*',  torch: '1.8.1',  torchvision: '0.9.1',  torchaudio: '0.8.1',      transformers: '4.52.1' }\n          - { python: '3.8',  numpy: '1.24.4', opencv: '4.5.*',  torch: '1.9.1',  torchvision: '0.10.1', torchaudio: '0.9.1',      transformers: '4.52.1' }\n          - { python: '3.8',  numpy: '1.24.4', opencv: '4.6.*',  torch: '1.10.0', torchvision: '0.11.1', torchaudio: '0.10.0+cpu', transformers: '4.52.1' }\n          - { python: '3.9',  numpy: '1.26.4', opencv: '4.6.*',  torch: '1.11.0', torchvision: '0.12.0', torchaudio: '0.11.0+cpu', transformers: '4.52.1' }\n          - { python: '3.9',  numpy: '1.26.4', opencv: '4.7.*',  torch: '1.12.0', torchvision: '0.13.0', torchaudio: '0.12.0+cpu', transformers: '4.52.1' }\n          - { python: '3.10', numpy: '1.26.4', opencv: '4.7.*',  torch: '1.13.0', torchvision: '0.14.0', torchaudio: '0.13.0+cpu', transformers: '4.52.1' }\n          - { python: '3.10', numpy: '1.26.4', opencv: '4.8.*',  torch: '2.0.0',  torchvision: '0.15.1', torchaudio: '2.0.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.10', numpy: '1.26.4', opencv: '4.8.*',  torch: '2.1.0',  torchvision: '0.16.0', torchaudio: '2.1.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.11', numpy: '1.26.4', opencv: '4.9.*',  torch: '2.2.1',  torchvision: '0.17.1', torchaudio: '2.2.1+cpu',  transformers: '4.52.1' }\n          - { python: '3.11', numpy: '1.26.4', opencv: '4.9.*',  torch: '2.3.0',  torchvision: '0.18.0', torchaudio: '2.3.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.11', numpy: '2.2.5',  opencv: '4.10.*', torch: '2.4.0',  torchvision: '0.19.0', torchaudio: '2.4.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.12', numpy: '2.2.5',  opencv: '4.10.*', torch: '2.5.0',  torchvision: '0.20.0', torchaudio: '2.5.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.12', numpy: '2.2.5',  opencv: '4.11.*', torch: '2.6.0',  torchvision: '0.21.0', torchaudio: '2.6.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.12', numpy: '2.2.5',  opencv: '4.11.*', torch: '2.7.0',  torchvision: '0.22.0', torchaudio: '2.7.0+cpu',  transformers: '4.52.1' }\n          - { python: '3.13', numpy: '2.2.5',  opencv: '4.12.*', torch: '2.8.0',  torchvision: '0.23.0', torchaudio: '2.8.0+cpu',  transformers: '4.56.2' }\n          - { python: '3.13', numpy: '2.2.5',  opencv: '4.12.*', torch: '2.9.0',  torchvision: '0.24.0', torchaudio: '2.9.0+cpu',  transformers: '4.56.2' }\n          - { python: '3.13', numpy: '2.2.5',  opencv: '4.12.*', torch: '2.10.0', torchvision: '0.25.0', torchaudio: '2.10.0+cpu', transformers: '4.56.2' }\n\n    name: test-${{ matrix.torch }}-py${{ matrix.python }}\n\n    env:\n      PYTHONUSERBASE: ${{ github.workspace }}/python-${{ matrix.python }}\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: local-cache-libtorch\n      id: local-cache-libtorch\n      uses: maxnowack/local-cache@v2\n      with:\n        path: libtorch-${{ env.LIBTORCH_VERSION }}-install\n        key: libtorch-${{ env.LIBTORCH_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: local-cache-torchvision\n      id: local-cache-torchvision\n      uses: maxnowack/local-cache@v2\n      with:\n        path: torchvision-${{ env.TORCHVISION_VERSION }}-install\n        key: torchvision-${{ env.TORCHVISION_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: local-cache-onnxruntime\n      id: local-cache-onnxruntime\n      uses: maxnowack/local-cache@v2\n      with:\n        path: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install\n        key: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n\n    - name: cache-libtorch\n      if: steps.local-cache-libtorch.outputs.cache-hit != 'true'\n      id: cache-libtorch\n      uses: actions/cache/restore@v5\n      with:\n        path: libtorch-${{ env.LIBTORCH_VERSION }}-install\n        key: libtorch-${{ env.LIBTORCH_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n        fail-on-cache-miss: true\n\n    - name: cache-torchvision\n      if: steps.local-cache-torchvision.outputs.cache-hit != 'true'\n      id: cache-torchvision\n      uses: actions/cache/restore@v5\n      with:\n        path: torchvision-${{ env.TORCHVISION_VERSION }}-install\n        key: torchvision-${{ env.TORCHVISION_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n        fail-on-cache-miss: true\n\n    - name: cache-onnxruntime\n      if: steps.local-cache-onnxruntime.outputs.cache-hit != 'true'\n      id: cache-onnxruntime\n      uses: actions/cache/restore@v5\n      with:\n        path: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install\n        key: onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-linux-install-${{ env.CACHE_DATE }}\n        fail-on-cache-miss: true\n\n    - uses: actions/setup-python@v6\n      with:\n        python-version: ${{ matrix.python }}\n\n    - name: setup-pytorch\n      run: |\n        export PATH=${{ env.PYTHONUSERBASE }}/bin:$PATH\n        pip3 install --user pytest wheel twine requests einops numpy==${{ matrix.numpy }} opencv-python==${{ matrix.opencv }}\n        pip3 install --user torch==${{ matrix.torch }}+cpu torchvision==${{ matrix.torchvision }}+cpu torchaudio==${{ matrix.torchaudio }} --index-url https://download.pytorch.org/whl/cpu\n        pip3 install --user onnx onnxscript onnxruntime\n        pip3 install --user \"transformers<=${{ matrix.transformers }}\" diffusers \"safetensors<=0.6.2\"\n\n    - name: setup-pytorch-execstack-or-patchelf\n      if: ${{ matrix.python }} == '3.8' || ${{ matrix.python }} == '3.9'\n      run: |\n        execstack -c ${{ env.PYTHONUSERBASE }}/lib/python${{ matrix.python }}/site-packages/torch/lib/libtorch_cpu.so || true\n        patchelf --clear-execstack ${{ env.PYTHONUSERBASE }}/lib/python${{ matrix.python }}/site-packages/torch/lib/libtorch_cpu.so || true\n\n    - name: python-ncnn\n      run: |\n        export CMAKE_BUILD_PARALLEL_LEVEL=8\n        pip3 install --user . --verbose\n\n    - name: pnnx\n      run: |\n        cd tools/pnnx\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=Release \\\n            -DTorch_INSTALL_DIR=$GITHUB_WORKSPACE/libtorch-${{ env.LIBTORCH_VERSION }}-install \\\n            -DTorchVision_INSTALL_DIR=$GITHUB_WORKSPACE/torchvision-${{ env.TORCHVISION_VERSION }}-install \\\n            -Donnxruntime_INSTALL_DIR=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install \\\n            -Dprotobuf_DIR=$GITHUB_WORKSPACE/onnxruntime-${{ env.ONNXRUNTIME_VERSION }}-install/lib/cmake/protobuf ..\n\n    - name: download-pnnx\n      uses: actions/download-artifact@v8\n      with:\n        name: pnnx\n        path: tools/pnnx/build/src\n\n    - name: test\n      run: |\n        export PATH=${{ env.PYTHONUSERBASE }}/bin:$PATH\n        chmod +x tools/pnnx/build/src/pnnx\n        export OMP_THREAD_LIMIT=1\n        export OMP_NUM_THREADS=1\n        export MKL_NUM_THREADS=1\n        export MKL_ENABLE_INSTRUCTIONS=SSE4_2\n        cd tools/pnnx/build\n        ctest --output-on-failure -j 8\n\n    - name: python-pnnx\n      run: |\n        export PATH=${{ env.PYTHONUSERBASE }}/bin:$PATH\n        export PNNX_WHEEL_WITHOUT_BUILD=ON\n        cd tools/pnnx/python\n        cp ../build/src/pnnx pnnx/\n        python3 setup.py install --user\n        pytest tests\n"
  },
  {
    "path": ".github/workflows/python.yml",
    "content": "name: python\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/python.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'python/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/python.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'python/**'\n    - 'glslang'\nconcurrency:\n  group: python-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  MAC_DEPLOYMENT_TARGET: '11.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\n  CMAKE_BUILD_PARALLEL_LEVEL: 4\n  UseMultiToolTask: true\npermissions:\n  contents: read\n\njobs:\n  build:\n    strategy:\n      matrix:\n        os: [ubuntu-latest, macos-15-intel, windows-latest]\n        python-version: [3.9, 3.12]\n\n    runs-on: ${{ matrix.os }}\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-swiftshader\n      if: matrix.os == 'ubuntu-latest'\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-linux-install-20240622\n    - name: checkout-swiftshader\n      if: matrix.os == 'ubuntu-latest' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: de870ac7518fe2b6bb651ecc22fc36647cf7b986\n    - name: checkout-swiftshader-submodules\n      if: matrix.os == 'ubuntu-latest' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n    - name: swiftshader\n      if: matrix.os == 'ubuntu-latest' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        mkdir -p build; cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_EGL=FALSE -DSWIFTSHADER_BUILD_GLESv2=FALSE -DSWIFTSHADER_BUILD_GLES_CM=FALSE -DSWIFTSHADER_BUILD_VULKAN=TRUE -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . -j $(nproc)\n        mkdir $GITHUB_WORKSPACE/swiftshader-install\n        cp Linux/* $GITHUB_WORKSPACE/swiftshader-install\n\n    - name: setup-python\n      uses: actions/setup-python@v6\n      with:\n        python-version: ${{ matrix.python-version }}\n    - name: install-deps\n      run: |\n        python -m pip install --upgrade pip\n        pip install pytest setuptools wheel twine importlib-metadata\n\n    - name: build\n      if: matrix.os == 'ubuntu-latest'\n      env:\n        CC: clang\n        CXX: clang++\n      run: |\n        mkdir build && cd build\n        cmake -DNCNN_VULKAN=ON -DNCNN_PYTHON=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j $(nproc)\n    - name: build\n      if: matrix.os == 'macos-15-intel'\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DPLATFORM=MAC -DARCHS=\"x86_64\" \\\n            -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET -DENABLE_BITCODE=$ENABLE_BITCODE -DENABLE_ARC=$ENABLE_ARC -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n            -DNCNN_VULKAN=OFF -DNCNN_PYTHON=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . -j 4\n    - name: build\n      if: matrix.os == 'windows-latest'\n      run: |\n        mkdir build; cd build\n        cmake -T v142,host=x64 -A x64 -DNCNN_VULKAN=OFF -DNCNN_PYTHON=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . --config Release -j 4\n    - name: build-python\n      run: cd python && pip install .\n    - name: test\n      if: matrix.os == 'ubuntu-latest'\n      run: |\n        export VK_ICD_FILENAMES=\"$GITHUB_WORKSPACE/swiftshader-install/vk_swiftshader_icd.json\"\n        cd python && pytest tests\n    - name: test\n      if: matrix.os != 'ubuntu-latest'\n      run: |\n        cd python && pytest tests\n"
  },
  {
    "path": ".github/workflows/release-python.yml",
    "content": "name: release-python\non:\n  push:\n    tags:\n      - '*'\n  workflow_dispatch:\n\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  MAC_DEPLOYMENT_TARGET: '11.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\n  CIBW_SKIP: \"cp3??t-*\"\n\njobs:\n  build_sdist:\n    name: Build SDist\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - uses: actions/setup-python@v6\n      with:\n        python-version: '3.x'\n\n    - name: Install deps\n      run: python -m pip install twine build\n\n    - name: Build SDist\n      run: python -m build -s\n\n    - name: Check metadata\n      run: twine check dist/*\n\n    - uses: actions/upload-artifact@v6\n      with:\n        name: sdist\n        path: dist/*.tar.gz\n\n  build_wheels:\n    name: ${{ matrix.arch }} ${{ matrix.build_id }} on ${{ matrix.os }}\n    runs-on: ${{ matrix.os }}\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { os: ubuntu-24.04,     arch: x86_64,     build: 'cp*-manylinux*', build_id: cp-manylinux }\n          - { os: ubuntu-24.04,     arch: x86_64,     build: 'cp*-musllinux*', build_id: cp-musllinux }\n          - { os: ubuntu-24.04,     arch: x86_64,     build: 'pp*',            build_id: pp           }\n          - { os: ubuntu-24.04,     arch: i686,       build: 'cp*-manylinux*', build_id: cp-manylinux }\n          - { os: ubuntu-24.04,     arch: i686,       build: 'cp*-musllinux*', build_id: cp-musllinux }\n          - { os: ubuntu-24.04,     arch: i686,       build: 'pp*',            build_id: pp           }\n          - { os: windows-2025,     arch: x86,        build: 'cp*',            build_id: cp           }\n          - { os: windows-2025,     arch: AMD64,      build: 'cp*',            build_id: cp           }\n          - { os: windows-2025,     arch: AMD64,      build: 'pp*',            build_id: pp           }\n          - { os: windows-11-arm,   arch: ARM64,      build: 'cp*',            build_id: cp           }\n          - { os: macos-15-intel,   arch: x86_64,     build: 'cp*',            build_id: cp           }\n          - { os: macos-15,         arch: arm64,      build: 'cp*',            build_id: cp           }\n          - { os: ubuntu-24.04-arm, arch: armv7l,     build: 'cp*-manylinux*', build_id: cp-manylinux }\n          - { os: ubuntu-24.04-arm, arch: armv7l,     build: 'cp*-musllinux*', build_id: cp-musllinux }\n          - { os: ubuntu-24.04-arm, arch: aarch64,    build: 'cp*-manylinux*', build_id: cp-manylinux }\n          - { os: ubuntu-24.04-arm, arch: aarch64,    build: 'cp*-musllinux*', build_id: cp-musllinux }\n          - { os: ubuntu-24.04-arm, arch: aarch64,    build: 'pp*',            build_id: pp           }\n\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    # build wheels for ubuntu\n    - name: Build wheels for ubuntu\n      if: matrix.os == 'ubuntu-24.04'\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_LINUX: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4\n      with:\n        output-dir: wheelhouse\n\n    # build wheels for ubuntu armv7l\n    - name: Build wheels for ubuntu armv7l\n      if: matrix.os == 'ubuntu-24.04-arm' && (matrix.arch == 'armv7l')\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_LINUX: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4\n          CFLAGS=\"-mfpu=neon\" CXXFLAGS=\"-mfpu=neon\"\n      with:\n        output-dir: wheelhouse\n\n    # build wheels for ubuntu aarch64\n    - name: Build wheels for ubuntu aarch64\n      if: matrix.os == 'ubuntu-24.04-arm' && (matrix.arch == 'aarch64')\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_LINUX: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4\n      with:\n        output-dir: wheelhouse\n\n    # build wheels for windows\n    - name: Build wheels for windows\n      if: matrix.os == 'windows-2025' && (matrix.arch == 'AMD64' || matrix.arch == 'x86')\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_WINDOWS: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT_WINDOWS: CMAKE_BUILD_PARALLEL_LEVEL=4\n        CIBW_BEFORE_BUILD: pip install delvewheel\n        CIBW_REPAIR_WHEEL_COMMAND: delvewheel repair -w {dest_dir} {wheel}\n      with:\n        output-dir: wheelhouse\n\n    - name: Build wheels for windows ARM64\n      if: matrix.os == 'windows-11-arm' && matrix.arch == 'ARM64'\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_WINDOWS: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT_WINDOWS: CMAKE_BUILD_PARALLEL_LEVEL=4\n        CIBW_BEFORE_BUILD: pip install delvewheel\n        CIBW_REPAIR_WHEEL_COMMAND: delvewheel repair -w {dest_dir} {wheel} --no-dll \"msvcp140.dll;vcomp140.dll\"\n      with:\n        output-dir: wheelhouse\n\n    # build wheels for macos\n    - name: cache-openmp for macos\n      if: matrix.os == 'macos-15-intel' || matrix.os == 'macos-15'\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-macos-install-20251004\n\n    - name: openmp for macos\n      if: (matrix.os == 'macos-15-intel' || matrix.os == 'macos-15') && steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n\n    - name: openmp-build-x86_64 for macos\n      if: (matrix.os == 'macos-15-intel' || matrix.os == 'macos-15') && steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n\n    - name: openmp-build-arm64 for macos\n      if: (matrix.os == 'macos-15-intel' || matrix.os == 'macos-15') && steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n\n    - name: openmp-merge-fat-library for macos\n      if: (matrix.os == 'macos-15-intel' || matrix.os == 'macos-15') && steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n\n    - name: install-openmp for macos\n      if: matrix.os == 'macos-15-intel' || matrix.os == 'macos-15'\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/include/* $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/lib/libomp.a $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib\n\n    - name: vulkansdk for macos\n      if: matrix.os == 'macos-15-intel' || matrix.os == 'macos-15'\n      run: |\n        wget -q https://sdk.lunarg.com/sdk/download/1.4.335.1/mac/vulkansdk-macos-1.4.335.1.zip?Human=true -O vulkansdk-macos-1.4.335.1.zip\n        unzip vulkansdk-macos-1.4.335.1.zip\n        sudo vulkansdk-macOS-1.4.335.1.app/Contents/MacOS/vulkansdk-macOS-1.4.335.1 --root $GITHUB_WORKSPACE/vulkansdk-macos-1.4.335.1 --accept-licenses --default-answer --confirm-command install\n\n    - name: Build wheels for macos x86_64\n      if: matrix.os == 'macos-15-intel' && matrix.arch == 'x86_64'\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_MACOS: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4\n          CMAKE_TOOLCHAIN_FILE=$GITHUB_WORKSPACE/toolchains/ios.toolchain.cmake PLATFORM=MAC ARCHS=\"x86_64\"\n          DEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET ENABLE_BITCODE=OFF ENABLE_ARC=OFF ENABLE_VISIBILITY=OFF\n          OpenMP_C_FLAGS=\"-Xclang -fopenmp\" OpenMP_CXX_FLAGS=\"-Xclang -fopenmp\"\n          OpenMP_C_LIB_NAMES=\"libomp\" OpenMP_CXX_LIB_NAMES=\"libomp\"\n          OpenMP_libomp_LIBRARY=\"libomp.a\"\n          Vulkan_LIBRARY=$GITHUB_WORKSPACE/vulkansdk-macos-1.4.335.1/macOS/lib/libMoltenVK.dylib\n          MACOSX_DEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET\n      with:\n        output-dir: wheelhouse\n\n    - name: Build wheels for macos arm64\n      if: matrix.os == 'macos-15' && matrix.arch == 'arm64'\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_MACOS: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build }}\n        CIBW_ENABLE: pypy\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4\n          CMAKE_TOOLCHAIN_FILE=$GITHUB_WORKSPACE/toolchains/ios.toolchain.cmake PLATFORM=MAC_ARM64 ARCHS=\"arm64\"\n          DEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET ENABLE_BITCODE=OFF ENABLE_ARC=OFF ENABLE_VISIBILITY=OFF\n          OpenMP_C_FLAGS=\"-Xclang -fopenmp\" OpenMP_CXX_FLAGS=\"-Xclang -fopenmp\"\n          OpenMP_C_LIB_NAMES=\"libomp\" OpenMP_CXX_LIB_NAMES=\"libomp\"\n          OpenMP_libomp_LIBRARY=\"libomp.a\"\n          Vulkan_LIBRARY=$GITHUB_WORKSPACE/vulkansdk-macos-1.4.335.1/macOS/lib/libMoltenVK.dylib\n          MACOSX_DEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET\n      with:\n        output-dir: wheelhouse\n\n    - name: Show files\n      run: ls -lh wheelhouse\n      shell: bash\n\n    - name: Verify clean directory\n      run: git diff --exit-code\n      shell: bash\n\n    - name: Upload wheels\n      uses: actions/upload-artifact@v6\n      with:\n        name: wheels-${{ matrix.os }}-${{ matrix.arch }}-${{ matrix.build_id }}\n        path: wheelhouse/*.whl\n\n  build_wheels_qemu_cp:\n    name: ${{ matrix.arch }} ${{ matrix.build_cp }} ${{ matrix.build_sub }}\n    runs-on: ubuntu-24.04\n\n    strategy:\n      fail-fast: false\n      matrix:\n        arch: [riscv64]\n        build_cp: [cp38, cp39, cp310, cp311, cp312, cp313, cp314]\n        build_sub: [manylinux, musllinux]\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: Set up QEMU\n      uses: docker/setup-qemu-action@v3\n      with:\n        platforms: all\n\n    - name: Build wheels with qemu\n      uses: pypa/cibuildwheel@v3.3.1\n      env:\n        CIBW_ARCHS_LINUX: ${{ matrix.arch }}\n        CIBW_BUILD: ${{ matrix.build_cp }}-${{ matrix.build_sub }}*\n        CIBW_BUILD_VERBOSITY: 1\n        CIBW_ENVIRONMENT: CMAKE_BUILD_PARALLEL_LEVEL=4 EXTRA_CMAKE_ARGS=\"-DNCNN_XTHEADVECTOR=OFF\"\n      with:\n        output-dir: wheelhouse\n\n    - name: Show files\n      run: ls -lh wheelhouse\n      shell: bash\n\n    - name: Verify clean directory\n      run: git diff --exit-code\n      shell: bash\n\n    - name: Upload wheels\n      uses: actions/upload-artifact@v6\n      with:\n        name: wheels_qemu_cp-${{ matrix.arch }}-${{ matrix.build_cp }}-${{ matrix.build_sub }}\n        path: wheelhouse/*.whl\n\n  upload_all:\n    permissions:\n      contents: none\n    name: Upload\n    needs: [build_wheels, build_wheels_qemu_cp, build_sdist]\n    runs-on: ubuntu-latest\n\n    steps:\n    - uses: actions/download-artifact@v8\n      with:\n        path: dist\n        merge-multiple: true\n\n    - uses: pypa/gh-action-pypi-publish@release/v1\n      with:\n        user: __token__\n        password: ${{ secrets.PYPI_API_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/release.yml",
    "content": "name: release\non:\n  push:\n    tags:\n      - '*'\n\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  IOS_DEPLOYMENT_TARGET: '13.0'\n  MAC_DEPLOYMENT_TARGET: '11.0'\n  MAC_CATALYST_DEPLOYMENT_TARGET: '13.1'\n  WATCHOS_DEPLOYMENT_TARGET: '6.0'\n  TVOS_DEPLOYMENT_TARGET: '11.0'\n  VISIONOS_DEPLOYMENT_TARGET: '1.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\n  EMSCRIPTEN_VERSION: 3.1.28\n\npermissions:\n  contents: read\n\njobs:\n\n  setup:\n    permissions:\n      contents: none\n    runs-on: ubuntu-latest\n    outputs:\n      VERSION: ${{ steps.get_version.outputs.VERSION }}\n    steps:\n    - name: get-version\n      id: get_version\n      run: echo \"VERSION=${GITHUB_REF/refs\\/tags\\//}\" >> $GITHUB_OUTPUT\n\n  full-source:\n    needs: [setup]\n    runs-on: ubuntu-latest\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-full-source\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: package\n      run: |\n        rm -rf .git\n        rm -f /tmp/${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r /tmp/${{ env.PACKAGENAME }}.zip .\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: /tmp/${{ env.PACKAGENAME }}.zip\n\n  ubuntu:\n    needs: [setup]\n    strategy:\n      matrix:\n        opt:\n          - { shared-lib: OFF, os: ubuntu-22.04, id: ubuntu-2204        }\n          - { shared-lib: OFF, os: ubuntu-24.04, id: ubuntu-2404        }\n          - { shared-lib: ON,  os: ubuntu-22.04, id: ubuntu-2204-shared }\n          - { shared-lib: ON,  os: ubuntu-24.04, id: ubuntu-2404-shared }\n    runs-on: ${{ matrix.opt.os }}\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: apt\n      run: |\n        sudo apt-get install -y libprotobuf-dev protobuf-compiler\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n            -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TOOLS=ON -DNCNN_BUILD_BENCHMARK=OFF -DNCNN_SHARED_LIB=${{ matrix.opt.shared-lib }} ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: package\n      run: |\n        rm -rf ${{ env.PACKAGENAME }}\n        mkdir -p ${{ env.PACKAGENAME }}\n        cp -a build/install/* ${{ env.PACKAGENAME }}\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip ${{ env.PACKAGENAME }}\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-macos:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-macos-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-macos\n        path: openmp-install\n\n  macos:\n    needs: [setup, openmp-macos]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: macos        }\n          - { vulkan: ON,  id: macos-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_TOOLS=OFF \\\n        -DNCNN_BUILD_EXAMPLES=OFF \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-macos\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-macos\n        path: openmp-macos\n    - name: install-openmp\n      run: |\n        sudo cp openmp-macos/include/* $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include\n        sudo cp openmp-macos/lib/libomp.a $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-macos/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-macos/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-x86_64/install/lib/libglslang.a \\\n            build-x86_64/install/lib/libSPIRV.a \\\n            -o build-x86_64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        lipo -create build-x86_64/install/lib/libglslang_combined.a build-arm64/install/lib/libglslang_combined.a -o glslang.framework/Versions/A/glslang\n        cp -a build-x86_64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create build-x86_64/install/lib/libncnn.a build-arm64/install/lib/libncnn.a -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-ios:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-ios-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=OS64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        cp openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-ios\n        path: openmp-install\n\n  ios:\n    needs: [setup, openmp-ios]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: ios        }\n          - { vulkan: ON,  id: ios-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-ios\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-ios\n        path: openmp-ios\n    - name: install-openmp\n      run: |\n        sudo cp openmp-ios/include/* $DEVELOPER_DIR/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/include\n        sudo cp openmp-ios/lib/libomp.a $DEVELOPER_DIR/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=OS64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-ios/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-ios/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        cp build-arm64/install/lib/libglslang_combined.a glslang.framework/Versions/A/glslang\n        cp -a build-arm64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        cp build-arm64/install/lib/libncnn.a ncnn.framework/Versions/A/ncnn\n        cp -a build-arm64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-ios-simulator:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-ios-simulator-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR64 -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATORARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-ios-simulator\n        path: openmp-install\n\n  ios-simulator:\n    needs: [setup, openmp-ios-simulator]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: ios-simulator        }\n          - { vulkan: ON,  id: ios-simulator-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$IOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-ios-simulator\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-ios-simulator\n        path: openmp-ios-simulator\n    - name: install-openmp\n      run: |\n        sudo cp openmp-ios-simulator/include/* $DEVELOPER_DIR/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/include\n        sudo cp openmp-ios-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR64 -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATORARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-ios-simulator/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-ios-simulator/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-x86_64/install/lib/libglslang.a \\\n            build-x86_64/install/lib/libSPIRV.a \\\n            -o build-x86_64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        lipo -create \\\n            build-x86_64/install/lib/libglslang_combined.a \\\n            build-arm64/install/lib/libglslang_combined.a \\\n            -o glslang.framework/Versions/A/glslang\n        cp -a build-x86_64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-x86_64/install/lib/libncnn.a \\\n            build-arm64/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-mac-catalyst:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_CATALYST_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-mac-catalyst-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST_ARM64 -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-mac-catalyst\n        path: openmp-install\n\n  mac-catalyst:\n    needs: [setup, openmp-mac-catalyst]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: mac-catalyst        }\n          - { vulkan: ON,  id: mac-catalyst-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$MAC_CATALYST_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-mac-catalyst\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-mac-catalyst\n        path: openmp-mac-catalyst\n    - name: install-openmp\n      run: |\n        sudo cp openmp-mac-catalyst/include/* $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include\n        sudo cp openmp-mac-catalyst/lib/libomp.a $DEVELOPER_DIR/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=MAC_CATALYST -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-mac-catalyst/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-mac-catalyst/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-x86_64/install/lib/libglslang.a \\\n            build-x86_64/install/lib/libSPIRV.a \\\n            -o build-x86_64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        lipo -create \\\n            build-x86_64/install/lib/libglslang_combined.a \\\n            build-arm64/install/lib/libglslang_combined.a \\\n            -o glslang.framework/Versions/A/glslang\n        cp -a build-x86_64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-x86_64/install/lib/libncnn.a \\\n            build-arm64/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-watchos:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-watchos-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-armv7k\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-armv7k && cd build-armv7k\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"armv7k\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64_32\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64_32 && cd build-arm64_32\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"arm64_32\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64_32/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-armv7k/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64_32/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-watchos\n        path: openmp-install\n\n  watchos:\n    needs: [setup, openmp-watchos]\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-watchos\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n\n    steps:\n    - uses: actions/checkout@v6\n    - name: download-openmp-watchos\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-watchos\n        path: openmp-watchos\n    - name: install-openmp\n      run: |\n        sudo cp openmp-watchos/include/* $DEVELOPER_DIR/Platforms/WatchOS.platform/Developer/SDKs/WatchOS.sdk/usr/include\n        sudo cp openmp-watchos/lib/libomp.a $DEVELOPER_DIR/Platforms/WatchOS.platform/Developer/SDKs/WatchOS.sdk/usr/lib\n    - name: build-armv7k\n      run: |\n        mkdir build-armv7k && cd build-armv7k\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"armv7k\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64_32\n      run: |\n        mkdir build-arm64_32 && cd build-arm64_32\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"arm64_32\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-watchos/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-watchos/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-armv7k/install/lib/libncnn.a \\\n            build-arm64_32/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-arm64_32/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-watchos-simulator:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-watchos-simulator-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-watchos-simulator\n        path: openmp-install\n\n  watchos-simulator:\n    needs: [setup, openmp-watchos-simulator]\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-watchos-simulator\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n\n    steps:\n    - uses: actions/checkout@v6\n    - name: download-openmp-watchos-simulator\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-watchos-simulator\n        path: openmp-watchos-simulator\n    - name: install-openmp\n      run: |\n        sudo cp openmp-watchos-simulator/include/* $DEVELOPER_DIR/Platforms/WatchSimulator.platform/Developer/SDKs/WatchSimulator.sdk/usr/include\n        sudo cp openmp-watchos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/WatchSimulator.platform/Developer/SDKs/WatchSimulator.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-watchos-simulator/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-watchos-simulator/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-x86_64/install/lib/libncnn.a \\\n            build-arm64/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-tvos:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-tvos-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64e\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64e && cd build-arm64e\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64e\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64e/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-tvos\n        path: openmp-install\n\n  tvos:\n    needs: [setup, openmp-tvos]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: tvos        }\n          - { vulkan: ON,  id: tvos-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-tvos\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-tvos\n        path: openmp-tvos\n    - name: install-openmp\n      run: |\n        sudo cp openmp-tvos/include/* $DEVELOPER_DIR/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS.sdk/usr/include\n        sudo cp openmp-tvos/lib/libomp.a $DEVELOPER_DIR/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS.sdk/usr/lib\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64e\n      run: |\n        mkdir build-arm64e && cd build-arm64e\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64e\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-tvos/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-tvos/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64e/install/lib/libglslang.a \\\n            build-arm64e/install/lib/libSPIRV.a \\\n            -o build-arm64e/install/lib/libglslang_combined.a\n        lipo -create \\\n            build-arm64/install/lib/libglslang_combined.a \\\n            build-arm64e/install/lib/libglslang_combined.a \\\n            -o glslang.framework/Versions/A/glslang\n        cp -a build-arm64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-arm64/install/lib/libncnn.a \\\n            build-arm64e/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-arm64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-tvos-simulator:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-tvos-simulator-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-tvos-simulator\n        path: openmp-install\n\n  tvos-simulator:\n    needs: [setup, openmp-tvos-simulator]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: tvos-simulator        }\n          - { vulkan: ON,  id: tvos-simulator-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-tvos-simulator\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-tvos-simulator\n        path: openmp-tvos-simulator\n    - name: install-openmp\n      run: |\n        sudo cp openmp-tvos-simulator/include/* $DEVELOPER_DIR/Platforms/AppleTVSimulator.platform/Developer/SDKs/AppleTVSimulator.sdk/usr/include\n        sudo cp openmp-tvos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/AppleTVSimulator.platform/Developer/SDKs/AppleTVSimulator.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-tvos-simulator/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-tvos-simulator/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-x86_64/install/lib/libglslang.a \\\n            build-x86_64/install/lib/libSPIRV.a \\\n            -o build-x86_64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        lipo -create \\\n            build-x86_64/install/lib/libglslang_combined.a \\\n            build-arm64/install/lib/libglslang_combined.a \\\n            -o glslang.framework/Versions/A/glslang\n        cp -a build-x86_64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-x86_64/install/lib/libncnn.a \\\n            build-arm64/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-visionos:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-visionos-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        cp openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-visionos\n        path: openmp-install\n\n  visionos:\n    needs: [setup, openmp-visionos]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: visionos        }\n          - { vulkan: ON,  id: visionos-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-visionos\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-visionos\n        path: openmp-visionos\n    - name: install-openmp\n      run: |\n        sudo cp openmp-visionos/include/* $DEVELOPER_DIR/Platforms/XROS.platform/Developer/SDKs/XROS.sdk/usr/include\n        sudo cp openmp-visionos/lib/libomp.a $DEVELOPER_DIR/Platforms/XROS.platform/Developer/SDKs/XROS.sdk/usr/lib\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-visionos/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-visionos/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        cp build-arm64/install/lib/libglslang_combined.a glslang.framework/Versions/A/glslang\n        cp -a build-arm64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        cp build-arm64/install/lib/libncnn.a ncnn.framework/Versions/A/ncnn\n        cp -a build-arm64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  openmp-visionos-simulator:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n    steps:\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-visionos-simulator-release-18.1.2-20251004\n    - name: checkout\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: build-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-x86_64 && cd build-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        rm -rf $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/include $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/lib/libomp.a\n    - name: upload\n      uses: actions/upload-artifact@v6\n      with:\n        name: openmp-visionos-simulator\n        path: openmp-install\n\n  visionos-simulator:\n    needs: [setup, openmp-visionos-simulator]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, id: visionos-simulator        }\n          - { vulkan: ON,  id: visionos-simulator-vulkan }\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: download-openmp-visionos-simulator\n      uses: actions/download-artifact@v8\n      with:\n        name: openmp-visionos-simulator\n        path: openmp-visionos-simulator\n    - name: install-openmp\n      run: |\n        sudo cp openmp-visionos-simulator/include/* $DEVELOPER_DIR/Platforms/XRSimulator.platform/Developer/SDKs/XRSimulator.sdk/usr/include\n        sudo cp openmp-visionos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/XRSimulator.platform/Developer/SDKs/XRSimulator.sdk/usr/lib\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: build-arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install/strip\n    - name: package-openmp\n      run: |\n        rm -rf openmp.framework\n        mkdir -p openmp.framework/Versions/A/Headers\n        mkdir -p openmp.framework/Versions/A/Resources\n        ln -s A openmp.framework/Versions/Current\n        ln -s Versions/Current/Headers openmp.framework/Headers\n        ln -s Versions/Current/Resources openmp.framework/Resources\n        ln -s Versions/Current/openmp openmp.framework/openmp\n        cp openmp-visionos-simulator/lib/libomp.a openmp.framework/Versions/A/openmp\n        cp -a openmp-visionos-simulator/include/* openmp.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/18.1/g' Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n    - name: package-glslang\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -rf glslang.framework\n        mkdir -p glslang.framework/Versions/A/Headers\n        mkdir -p glslang.framework/Versions/A/Resources\n        ln -s A glslang.framework/Versions/Current\n        ln -s Versions/Current/Headers glslang.framework/Headers\n        ln -s Versions/Current/Resources glslang.framework/Resources\n        ln -s Versions/Current/glslang glslang.framework/glslang\n        libtool -static \\\n            build-x86_64/install/lib/libglslang.a \\\n            build-x86_64/install/lib/libSPIRV.a \\\n            -o build-x86_64/install/lib/libglslang_combined.a\n        libtool -static \\\n            build-arm64/install/lib/libglslang.a \\\n            build-arm64/install/lib/libSPIRV.a \\\n            -o build-arm64/install/lib/libglslang_combined.a\n        lipo -create \\\n            build-x86_64/install/lib/libglslang_combined.a \\\n            build-arm64/install/lib/libglslang_combined.a \\\n            -o glslang.framework/Versions/A/glslang\n        cp -a build-x86_64/install/include/glslang glslang.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n    - name: package-ncnn\n      run: |\n        rm -rf ncnn.framework\n        mkdir -p ncnn.framework/Versions/A/Headers\n        mkdir -p ncnn.framework/Versions/A/Resources\n        ln -s A ncnn.framework/Versions/Current\n        ln -s Versions/Current/Headers ncnn.framework/Headers\n        ln -s Versions/Current/Resources ncnn.framework/Resources\n        ln -s Versions/Current/ncnn ncnn.framework/ncnn\n        lipo -create \\\n            build-x86_64/install/lib/libncnn.a \\\n            build-arm64/install/lib/libncnn.a \\\n            -o ncnn.framework/Versions/A/ncnn\n        cp -a build-x86_64/install/include/* ncnn.framework/Versions/A/Headers/\n        sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n    - name: package\n      if: matrix.opt.vulkan == 'OFF'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework ncnn.framework\n    - name: package\n      if: matrix.opt.vulkan == 'ON'\n      run: |\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.framework glslang.framework ncnn.framework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  android:\n    needs: [setup]\n    strategy:\n      matrix:\n        opt:\n          - { vulkan: OFF, shared-lib: OFF, id: android               }\n          - { vulkan: OFF, shared-lib: ON,  id: android-shared        }\n          - { vulkan: ON,  shared-lib: OFF, id: android-vulkan        }\n          - { vulkan: ON,  shared-lib: ON,  id: android-vulkan-shared }\n    runs-on: ubuntu-latest\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-${{ matrix.opt.id }}\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK_LATEST_HOME/build/cmake/android.toolchain.cmake \\\n        -DANDROID_PLATFORM=android-21 \\\n        -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False \\\n        -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n        -DNCNN_BUILD_BENCHMARK=OFF \\\n        -DNCNN_VULKAN=${{ matrix.opt.vulkan }} \\\n        -DNCNN_SHARED_LIB=${{ matrix.opt.shared-lib }} \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: ndk-fix-debug\n      run: sed -i -e '/^  -g$/d' $ANDROID_NDK_LATEST_HOME/build/cmake/android-legacy.toolchain.cmake\n    - name: build-armeabi-v7a\n      run: |\n        mkdir build-armeabi-v7a && cd build-armeabi-v7a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-arm64-v8a\n      run: |\n        mkdir build-arm64-v8a && cd build-arm64-v8a\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }}-DANDROID_ABI=\"arm64-v8a\" ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-x86\n      run: |\n        mkdir build-x86 && cd build-x86\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }}-DANDROID_ABI=\"x86\" ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-x86_64\n      run: |\n        mkdir build-x86_64 && cd build-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }}-DANDROID_ABI=\"x86_64\" ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-riscv64\n      run: |\n        mkdir build-riscv64 && cd build-riscv64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }}-DANDROID_ABI=\"riscv64\" ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: package\n      run: |\n        rm -rf ${{ env.PACKAGENAME }}\n        mkdir -p ${{ env.PACKAGENAME }}\n        cp -a build-armeabi-v7a/install ${{ env.PACKAGENAME }}/armeabi-v7a\n        cp -a build-arm64-v8a/install ${{ env.PACKAGENAME }}/arm64-v8a\n        cp -a build-x86/install ${{ env.PACKAGENAME }}/x86\n        cp -a build-x86_64/install ${{ env.PACKAGENAME }}/x86_64\n        cp -a build-riscv64/install ${{ env.PACKAGENAME }}/riscv64\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip ${{ env.PACKAGENAME }}\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  webassembly:\n    needs: [setup]\n    runs-on: ubuntu-latest\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-webassembly\n    steps:\n    - uses: actions/checkout@v6\n    - name: emsdk\n      run: |\n        git clone https://github.com/emscripten-core/emsdk.git\n        cd emsdk\n        ./emsdk install $EMSCRIPTEN_VERSION\n        ./emsdk activate $EMSCRIPTEN_VERSION\n    - name: build\n      run: |\n        source emsdk/emsdk_env.sh\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n            -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-simd\n      run: |\n        source emsdk/emsdk_env.sh\n        mkdir build-simd && cd build-simd\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n            -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-threads\n      run: |\n        source emsdk/emsdk_env.sh\n        mkdir build-threads && cd build-threads\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n            -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: build-simd-threads\n      run: |\n        source emsdk/emsdk_env.sh\n        mkdir build-simd-threads && cd build-simd-threads\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" \\\n            -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\n        cmake --build . -j $(nproc)\n        cmake --build . --target install/strip\n    - name: package\n      run: |\n        rm -rf ${{ env.PACKAGENAME }}\n        mkdir -p ${{ env.PACKAGENAME }}\n        cp -a build/install ${{ env.PACKAGENAME }}/basic\n        cp -a build-simd/install ${{ env.PACKAGENAME }}/simd\n        cp -a build-threads/install ${{ env.PACKAGENAME }}/threads\n        cp -a build-simd-threads/install ${{ env.PACKAGENAME }}/simd-threads\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip ${{ env.PACKAGENAME }}\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  windows:\n    needs: [setup]\n    strategy:\n      matrix:\n        opt:\n          - { shared-lib: OFF, os: windows-2022, toolset-version: v140, windows-sdk-version: 22621, id: vs2015 }\n          - { shared-lib: OFF, os: windows-2022, toolset-version: v141, windows-sdk-version: 22621, id: vs2017 }\n          - { shared-lib: OFF, os: windows-2022, toolset-version: v142, windows-sdk-version: 22621, id: vs2019 }\n          - { shared-lib: OFF, os: windows-2022, toolset-version: v143, windows-sdk-version: 26100, id: vs2022 }\n          - { shared-lib: ON,  os: windows-2022, toolset-version: v140, windows-sdk-version: 22621, id: vs2015-shared }\n          - { shared-lib: ON,  os: windows-2022, toolset-version: v141, windows-sdk-version: 22621, id: vs2017-shared }\n          - { shared-lib: ON,  os: windows-2022, toolset-version: v142, windows-sdk-version: 22621, id: vs2019-shared }\n          - { shared-lib: ON,  os: windows-2022, toolset-version: v143, windows-sdk-version: 26100, id: vs2022-shared }\n    runs-on: ${{ matrix.opt.os }}\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-windows-${{ matrix.opt.id }}\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: |\n        -T ${{ matrix.opt.toolset-version }},host=x64 `\n        -DCMAKE_BUILD_TYPE=Release `\n        -DCMAKE_INSTALL_PREFIX=install `\n        -DNCNN_VERSION_STRING=\"${{ needs.setup.outputs.VERSION }}\" `\n        -DNCNN_BUILD_EXAMPLES=OFF `\n        -DNCNN_BUILD_TOOLS=ON `\n        -DNCNN_BUILD_BENCHMARK=OFF `\n        -DNCNN_VULKAN=ON `\n        -DNCNN_SHARED_LIB=${{ matrix.opt.shared-lib }} `\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: Install VS 2017 (v141) Build Tools\n      if: matrix.opt.toolset-version == 'v141'\n      run: |\n        $vsInstallPath = & \"${env:ProgramFiles(x86)}\\Microsoft Visual Studio\\Installer\\vswhere.exe\" -latest -property installationPath\n        Start-Process -FilePath \"${env:ProgramFiles(x86)}\\Microsoft Visual Studio\\Installer\\vs_installer.exe\" -ArgumentList \"modify --installPath `\"$vsInstallPath`\" --add Microsoft.VisualStudio.Component.VC.v141.x86.x64 --quiet --norestart --nocache\" -Wait\n    - name: Install and Setup VS 2015 (v140) Build Tools\n      if: matrix.opt.toolset-version == 'v140'\n      run: |\n        $vs140Path = \"C:/vs140_build_tools\"\n        Invoke-WebRequest -Uri \"https://aka.ms/vs/15/release/vs_buildtools.exe\" -OutFile vs_buildtools.exe\n        Start-Process -FilePath \"vs_buildtools.exe\" -ArgumentList \"--installPath `\"$vs140Path`\" --add Microsoft.VisualStudio.Workload.VCTools --add Microsoft.VisualStudio.Component.VC.140 --quiet --wait --norestart --nocache\" -Wait\n\n        $vcvarsPath = (Get-ChildItem -Path $vs140Path -Filter \"vcvars64.bat\" -Recurse | Select-Object -First 1).FullName\n        $cmd = \"`\"$vcvarsPath`\" && powershell -Command `\"`$env:PATH;`$env:INCLUDE;`$env:LIB`\"\"\n        $output = cmd.exe /c $cmd\n        $lines = $output -split \"`r`n\"\n\n        echo \"PATH=$($lines[0]);$($env:PATH)\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n        echo \"INCLUDE=$($lines[1])\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n        echo \"LIB=$($lines[2])\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n\n    - uses: GuillaumeFalourd/setup-windows10-sdk-action@v2.4\n      with:\n        sdk-version: ${{ matrix.opt.windows-sdk-version }}\n\n    - name: cache-protobuf\n      id: cache-protobuf\n      uses: actions/cache@v5\n      with:\n        path: \"protobuf-install\"\n        key: protobuf-${{ matrix.opt.toolset-version }}-x86-x64-install\n    - name: protobuf\n      if: steps.cache-protobuf.outputs.cache-hit != 'true'\n      run: |\n        Invoke-WebRequest -Uri https://github.com/protocolbuffers/protobuf/archive/v3.11.2.zip -OutFile protobuf-3.11.2.zip\n        7z x ./protobuf-3.11.2.zip\n        cd protobuf-3.11.2\n        mkdir build-x86; cd build-x86;\n        cmake -T ${{ matrix.opt.toolset-version }},host=x64 -A Win32,version=10.0.${{ matrix.opt.windows-sdk-version }}.0 -DCMAKE_INSTALL_PREFIX=\"$env:GITHUB_WORKSPACE\\protobuf-install\\x86\" -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF ../cmake\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n        cd ..\n        mkdir build-x64; cd build-x64;\n        cmake -T ${{ matrix.opt.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.opt.windows-sdk-version }}.0 -DCMAKE_INSTALL_PREFIX=\"$env:GITHUB_WORKSPACE\\protobuf-install\\x64\" -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF ../cmake\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n    - name: build-x86\n      run: |\n        mkdir build-x86; cd build-x86\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -A Win32,version=10.0.${{ matrix.opt.windows-sdk-version }}.0 -Dprotobuf_DIR=\"$env:GITHUB_WORKSPACE\\protobuf-install\\x86\\cmake\" ..\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n    - name: build-x64\n      run: |\n        mkdir build-x64; cd build-x64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -A x64,version=10.0.${{ matrix.opt.windows-sdk-version }}.0 -Dprotobuf_DIR=\"$env:GITHUB_WORKSPACE\\protobuf-install\\x64\\cmake\" ..\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n    - name: build-arm64\n      if: matrix.opt.toolset-version == 'v142' || matrix.opt.toolset-version == 'v143'\n      run: |\n        mkdir build-arm64; cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -A arm64,version=10.0.${{ matrix.opt.windows-sdk-version }}.0 ..\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n    - name: package\n      if: matrix.opt.toolset-version == 'v140' || matrix.opt.toolset-version == 'v141'\n      run: |\n        mkdir ${{ env.PACKAGENAME }}\n        mkdir ${{ env.PACKAGENAME }}/x86\n        mkdir ${{ env.PACKAGENAME }}/x64\n        Copy-Item -Verbose -Recurse -Path \"build-x86\\install\\*\" -Destination \"${{ env.PACKAGENAME }}\\x86\"\n        Copy-Item -Verbose -Recurse -Path \"build-x64\\install\\*\" -Destination \"${{ env.PACKAGENAME }}\\x64\"\n        7z a -r ${{ env.PACKAGENAME }}.zip ${{ env.PACKAGENAME }}\n    - name: package\n      if: matrix.opt.toolset-version == 'v142' || matrix.opt.toolset-version == 'v143'\n      run: |\n        mkdir ${{ env.PACKAGENAME }}\n        mkdir ${{ env.PACKAGENAME }}/x86\n        mkdir ${{ env.PACKAGENAME }}/x64\n        mkdir ${{ env.PACKAGENAME }}/arm64\n        Copy-Item -Verbose -Recurse -Path \"build-x86\\install\\*\" -Destination \"${{ env.PACKAGENAME }}\\x86\"\n        Copy-Item -Verbose -Recurse -Path \"build-x64\\install\\*\" -Destination \"${{ env.PACKAGENAME }}\\x64\"\n        Copy-Item -Verbose -Recurse -Path \"build-arm64\\install\\*\" -Destination \"${{ env.PACKAGENAME }}\\arm64\"\n        7z a -r ${{ env.PACKAGENAME }}.zip ${{ env.PACKAGENAME }}\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n\n  apple:\n    needs: [setup, macos, ios, ios-simulator, mac-catalyst, watchos, watchos-simulator, tvos, tvos-simulator, visionos, visionos-simulator]\n    runs-on: macos-15-intel\n    env:\n      PACKAGENAME: ncnn-${{ needs.setup.outputs.VERSION }}-apple\n    steps:\n    - run: sudo xcode-select --switch /Applications/Xcode_16.4.0.app\n    - name: download\n      uses: actions/download-artifact@v8\n      with:\n        path: artifacts\n\n    - name: unzip\n      run: |\n        mkdir -p ncnn-ios\n        mkdir -p ncnn-ios-vulkan\n        mkdir -p ncnn-ios-simulator\n        mkdir -p ncnn-ios-simulator-vulkan\n        mkdir -p ncnn-mac-catalyst\n        mkdir -p ncnn-mac-catalyst-vulkan\n        mkdir -p ncnn-macos\n        mkdir -p ncnn-macos-vulkan\n        mkdir -p ncnn-tvos\n        mkdir -p ncnn-tvos-vulkan\n        mkdir -p ncnn-tvos-simulator\n        mkdir -p ncnn-tvos-simulator-vulkan\n        mkdir -p ncnn-visionos\n        mkdir -p ncnn-visionos-vulkan\n        mkdir -p ncnn-visionos-simulator\n        mkdir -p ncnn-visionos-simulator-vulkan\n        mkdir -p ncnn-watchos\n        mkdir -p ncnn-watchos-simulator\n\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-ios/ncnn-${{ needs.setup.outputs.VERSION }}-ios.zip -d ncnn-ios\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-ios-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-ios-vulkan.zip -d ncnn-ios-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-ios-simulator/ncnn-${{ needs.setup.outputs.VERSION }}-ios-simulator.zip -d ncnn-ios-simulator\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-ios-simulator-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-ios-simulator-vulkan.zip -d ncnn-ios-simulator-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-mac-catalyst/ncnn-${{ needs.setup.outputs.VERSION }}-mac-catalyst.zip -d ncnn-mac-catalyst\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-mac-catalyst-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-mac-catalyst-vulkan.zip -d ncnn-mac-catalyst-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-macos/ncnn-${{ needs.setup.outputs.VERSION }}-macos.zip -d ncnn-macos\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-macos-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-macos-vulkan.zip -d ncnn-macos-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-tvos/ncnn-${{ needs.setup.outputs.VERSION }}-tvos.zip -d ncnn-tvos\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-vulkan.zip -d ncnn-tvos-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-simulator/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-simulator.zip -d ncnn-tvos-simulator\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-simulator-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-tvos-simulator-vulkan.zip -d ncnn-tvos-simulator-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-visionos/ncnn-${{ needs.setup.outputs.VERSION }}-visionos.zip -d ncnn-visionos\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-vulkan.zip -d ncnn-visionos-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-simulator/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-simulator.zip -d ncnn-visionos-simulator\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-simulator-vulkan/ncnn-${{ needs.setup.outputs.VERSION }}-visionos-simulator-vulkan.zip -d ncnn-visionos-simulator-vulkan\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-watchos/ncnn-${{ needs.setup.outputs.VERSION }}-watchos.zip -d ncnn-watchos\n        unzip -q artifacts/ncnn-${{ needs.setup.outputs.VERSION }}-watchos-simulator/ncnn-${{ needs.setup.outputs.VERSION }}-watchos-simulator.zip -d ncnn-watchos-simulator\n\n    - name: create-xcframwork\n      run: |\n        rm -rf openmp.xcframework\n        xcodebuild -create-xcframework \\\n            -framework ncnn-macos/openmp.framework \\\n            -framework ncnn-ios/openmp.framework \\\n            -framework ncnn-ios-simulator/openmp.framework \\\n            -framework ncnn-mac-catalyst/openmp.framework \\\n            -framework ncnn-watchos/openmp.framework \\\n            -framework ncnn-watchos-simulator/openmp.framework \\\n            -framework ncnn-tvos/openmp.framework \\\n            -framework ncnn-tvos-simulator/openmp.framework \\\n            -framework ncnn-visionos/openmp.framework \\\n            -framework ncnn-visionos-simulator/openmp.framework \\\n            -output openmp.xcframework\n\n        rm -rf ncnn.xcframework\n        xcodebuild -create-xcframework \\\n            -framework ncnn-macos/ncnn.framework \\\n            -framework ncnn-ios/ncnn.framework \\\n            -framework ncnn-ios-simulator/ncnn.framework \\\n            -framework ncnn-mac-catalyst/ncnn.framework \\\n            -framework ncnn-watchos/ncnn.framework \\\n            -framework ncnn-watchos-simulator/ncnn.framework \\\n            -framework ncnn-tvos/ncnn.framework \\\n            -framework ncnn-tvos-simulator/ncnn.framework \\\n            -framework ncnn-visionos/ncnn.framework \\\n            -framework ncnn-visionos-simulator/ncnn.framework \\\n            -output ncnn.xcframework\n\n        rm -f ${{ env.PACKAGENAME }}.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}.zip openmp.xcframework ncnn.xcframework\n    - name: create-xcframwork-vulkan\n      run: |\n        rm -rf openmp.xcframework\n        xcodebuild -create-xcframework \\\n            -framework ncnn-macos-vulkan/openmp.framework \\\n            -framework ncnn-ios-vulkan/openmp.framework \\\n            -framework ncnn-ios-simulator-vulkan/openmp.framework \\\n            -framework ncnn-mac-catalyst-vulkan/openmp.framework \\\n            -framework ncnn-watchos/openmp.framework \\\n            -framework ncnn-watchos-simulator/openmp.framework \\\n            -framework ncnn-tvos-vulkan/openmp.framework \\\n            -framework ncnn-tvos-simulator-vulkan/openmp.framework \\\n            -framework ncnn-visionos/openmp.framework \\\n            -framework ncnn-visionos-simulator/openmp.framework \\\n            -output openmp.xcframework\n\n        rm -rf glslang.xcframework\n        xcodebuild -create-xcframework \\\n            -framework ncnn-macos-vulkan/glslang.framework \\\n            -framework ncnn-ios-vulkan/glslang.framework \\\n            -framework ncnn-ios-simulator-vulkan/glslang.framework \\\n            -framework ncnn-mac-catalyst-vulkan/glslang.framework \\\n            -framework ncnn-tvos-vulkan/glslang.framework \\\n            -framework ncnn-tvos-simulator-vulkan/glslang.framework \\\n            -framework ncnn-visionos-vulkan/glslang.framework \\\n            -framework ncnn-visionos-simulator-vulkan/glslang.framework \\\n            -output glslang.xcframework\n\n        rm -rf ncnn.xcframework\n        xcodebuild -create-xcframework \\\n            -framework ncnn-macos-vulkan/ncnn.framework \\\n            -framework ncnn-ios-vulkan/ncnn.framework \\\n            -framework ncnn-ios-simulator-vulkan/ncnn.framework \\\n            -framework ncnn-mac-catalyst-vulkan/ncnn.framework \\\n            -framework ncnn-watchos/ncnn.framework \\\n            -framework ncnn-watchos-simulator/ncnn.framework \\\n            -framework ncnn-tvos-vulkan/ncnn.framework \\\n            -framework ncnn-tvos-simulator-vulkan/ncnn.framework \\\n            -framework ncnn-visionos-vulkan/ncnn.framework \\\n            -framework ncnn-visionos-simulator-vulkan/ncnn.framework \\\n            -output ncnn.xcframework\n\n        rm -f ${{ env.PACKAGENAME }}-vulkan.zip\n        zip -9 -y -r ${{ env.PACKAGENAME }}-vulkan.zip openmp.xcframework glslang.xcframework ncnn.xcframework\n    - name: upload-zip\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}\n        path: ${{ env.PACKAGENAME }}.zip\n    - name: upload-zip-vulkan\n      uses: actions/upload-artifact@v6\n      with:\n        name: ${{ env.PACKAGENAME }}-vulkan\n        path: ${{ env.PACKAGENAME }}-vulkan.zip\n\n  release:\n    permissions:\n      contents: write  # for softprops/action-gh-release to create a release\n    needs: [setup, full-source, ubuntu, macos, ios, ios-simulator, mac-catalyst, watchos, watchos-simulator, tvos, tvos-simulator, android, webassembly, windows, apple]\n    runs-on: ubuntu-latest\n    steps:\n    - name: download\n      uses: actions/download-artifact@v8\n      with:\n        path: artifacts\n\n    - name: create-release\n      uses: softprops/action-gh-release@v2\n      with:\n        token: ${{ secrets.GITHUB_TOKEN }}\n        tag_name: ${{ needs.setup.outputs.VERSION }}\n        name: Release ${{ needs.setup.outputs.VERSION }}\n        files: artifacts/*/*.zip\n"
  },
  {
    "path": ".github/workflows/sync-wiki.yml",
    "content": "name: sync-wiki\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/sync-wiki.yml'\n    - 'docs/**'\nconcurrency:\n  group: sync-wiki-${{ github.ref }}\n  cancel-in-progress: true\n\npermissions:\n  contents: read\n\njobs:\n  sync-wiki:\n    permissions:\n      contents: write  # for Git to git push\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: sync\n      run: |\n        cp -r docs $GITHUB_WORKSPACE/ncnn.wiki\n        cd $GITHUB_WORKSPACE/ncnn.wiki\n        git config --global user.name \"wiki-sync-bot\"\n        git config --global user.email \"wiki-sync-bot@qq.com\"\n        git init\n        git add .\n        git commit -m \"sync\"\n        git remote add upstream https://${{ secrets.WIKI_SYNC_BOT_TOKEN }}@github.com/Tencent/ncnn.wiki.git\n        git push upstream master -f\n"
  },
  {
    "path": ".github/workflows/test-coverage.yml",
    "content": "name: test-coverage\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/test-coverage.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/**'\n    - 'tests/**'\n    - 'toolchains/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/test-coverage.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/**'\n    - 'tests/**'\n    - 'toolchains/**'\n    - 'glslang'\nconcurrency:\n  group: test-coverage-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  linux-gcc-gpu-t4:\n    runs-on: [self-hosted, linux, t4]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: build\n      env:\n        CC: gcc\n        CXX: g++\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_VULKAN=ON -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_XOP=OFF -DNCNN_AVXVNNI=OFF -DNCNN_AVXNECONVERT=OFF -DNCNN_AVX512=OFF -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 4\n    - name: test\n      env:\n        LD_LIBRARY_PATH: /data/action/install/lib64\n      run: cd build && ctest --output-on-failure -j 4\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov -d ./src -c -o lcov.info\n        lcov -r lcov.info '/usr/*' -o lcov.info\n        lcov -r lcov.info '*/install/*' -o lcov.info\n        lcov -r lcov.info '*/build/*' -o lcov.info\n        lcov --list lcov.info\n\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/.local/bin/codecov\n        files: build/lcov.info\n\n  linux-gcc-x64:\n    name: x64-${{ matrix.name }}\n    runs-on: [self-hosted, linux, ubuntu25]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { name: 'none',       SSE2: OFF, AVX: OFF, F16C: OFF, FMA: OFF, AVX2: OFF, AVX512: OFF, AVX512VNNI: OFF }\n          - { name: 'sse2',       SSE2: ON,  AVX: OFF, F16C: OFF, FMA: OFF, AVX2: OFF, AVX512: OFF, AVX512VNNI: OFF }\n          - { name: 'avx',        SSE2: ON,  AVX: ON,  F16C: OFF, FMA: OFF, AVX2: OFF, AVX512: OFF, AVX512VNNI: OFF }\n          - { name: 'avx2',       SSE2: ON,  AVX: ON,  F16C: ON,  FMA: ON,  AVX2: ON,  AVX512: OFF, AVX512VNNI: OFF }\n          - { name: 'avx512',     SSE2: ON,  AVX: ON,  F16C: ON,  FMA: ON,  AVX2: ON,  AVX512: ON,  AVX512VNNI: OFF }\n          - { name: 'avx512vnni', SSE2: ON,  AVX: ON,  F16C: ON,  FMA: ON,  AVX2: ON,  AVX512: ON,  AVX512VNNI: ON  }\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_SSE2=${{ matrix.SSE2 }} \\\n            -DNCNN_AVX=${{ matrix.AVX }} \\\n            -DNCNN_F16C=${{ matrix.F16C }} \\\n            -DNCNN_FMA=${{ matrix.FMA }} \\\n            -DNCNN_AVX2=${{ matrix.AVX2 }} \\\n            -DNCNN_AVX512=${{ matrix.AVX512 }} \\\n            -DNCNN_AVX512VNNI=${{ matrix.AVX512VNNI }} \\\n            -DNCNN_XOP=OFF \\\n            -DNCNN_AVXVNNI=OFF \\\n            -DNCNN_AVX512BF16=OFF \\\n            -DNCNN_AVX512FP16=OFF \\\n            -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        cd build\n        ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n    - name: build-openmp\n      run: |\n        mkdir build-openmp && cd build-openmp\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_SSE2=${{ matrix.SSE2 }} \\\n            -DNCNN_AVX=${{ matrix.AVX }} \\\n            -DNCNN_F16C=${{ matrix.F16C }} \\\n            -DNCNN_FMA=${{ matrix.FMA }} \\\n            -DNCNN_AVX2=${{ matrix.AVX2 }} \\\n            -DNCNN_AVX512=${{ matrix.AVX512 }} \\\n            -DNCNN_AVX512VNNI=${{ matrix.AVX512VNNI }} \\\n            -DNCNN_XOP=OFF \\\n            -DNCNN_AVXVNNI=OFF \\\n            -DNCNN_AVX512BF16=OFF \\\n            -DNCNN_AVX512FP16=OFF \\\n            -DNCNN_OPENMP=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        export OMP_THREAD_LIMIT=1\n        export OMP_NUM_THREADS=1\n        cd build-openmp\n        ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build-openmp\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build-openmp/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info,build-openmp/lcov.info\n\n  linux-gcc-x64-simplestl-simplemath:\n    name: simplestl-simplemath\n    runs-on: [self-hosted, linux, ubuntu25]\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host-c.gcc.toolchain.cmake \\\n            -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_SIMPLESTL=ON -DNCNN_SIMPLEMATH=ON \\\n            -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        cd build\n        ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info\n\n  linux-gcc-x64-sde:\n    name: sde-${{ matrix.cpu }}\n    runs-on: [self-hosted, linux, ubuntu25]\n    env:\n      SDE_PATH: /data/action/osd/sde-external-9.33.0-2024-01-07-lin\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - { cpu: hsw, AVX2: ON, AVXVNNI: OFF, AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }\n          - { cpu: adl, AVX2: ON, AVXVNNI: ON,  AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }\n          - { cpu: arl, AVX2: ON, AVXVNNI: ON,  AVXVNNIINT8: ON,  AVXNECONVERT: ON,  AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }\n          - { cpu: spr, AVX2: ON, AVXVNNI: OFF, AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: ON,  AVX512VNNI: ON,  AVX512BF16: ON,  AVX512FP16: ON  }\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_AVX=ON \\\n            -DNCNN_F16C=ON \\\n            -DNCNN_XOP=OFF \\\n            -DNCNN_AVX2=${{ matrix.AVX2 }} \\\n            -DNCNN_AVXVNNI=${{ matrix.AVXVNNI }} \\\n            -DNCNN_AVXVNNIINT8=${{ matrix.AVXVNNIINT8 }} \\\n            -DNCNN_AVXNECONVERT=${{ matrix.AVXNECONVERT }} \\\n            -DNCNN_AVX512=${{ matrix.AVX512 }} \\\n            -DNCNN_AVX512VNNI=${{ matrix.AVX512VNNI }} \\\n            -DNCNN_AVX512BF16=${{ matrix.AVX512BF16 }} \\\n            -DNCNN_AVX512FP16=${{ matrix.AVX512FP16 }} \\\n            -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-${{ matrix.cpu }};--\" ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info\n\n  linux-gcc-x64-sde-combined:\n    name: sde-combined\n    runs-on: [self-hosted, linux, ubuntu25]\n    env:\n      SDE_PATH: /data/action/osd/sde-external-9.33.0-2024-01-07-lin\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test-p4p\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-p4p;--\" ctest --output-on-failure -j 8\n    - name: test-snb\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-snb;--\" ctest --output-on-failure -j 8\n    - name: test-hsw\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-hsw;--\" ctest --output-on-failure -j 8\n    - name: test-adl\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-adl;--\" ctest --output-on-failure -j 8\n    - name: test-arl\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-arl;--\" ctest --output-on-failure -j 8\n    - name: test-skx\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-skx;--\" ctest --output-on-failure -j 8\n    - name: test-spr\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-spr;--\" ctest --output-on-failure -j 8\n    - name: test-gnr\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-gnr;--\" ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info\n\n  linux-gcc-riscv64-rvv:\n    strategy:\n      matrix:\n        openmp: [ON, OFF]\n    runs-on: [self-hosted, linux, ubuntu]\n    steps:\n    - uses: actions/checkout@v6\n    - name: build\n      run: |\n        export RISCV_ROOT_PATH=/data/action/osd/riscv\n        mkdir build\n        cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-linux-gnu.toolchain.cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON -DNCNN_ZFH=ON -DNCNN_ZVFH=ON -DNCNN_OPENMP=${{ matrix.openmp }} -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test-vlen256\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=256,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n\n    - name: test-vlen128\n      run: |\n        export PATH=/data/action/osd/qemu-install/bin:$PATH\n        cd build\n        TESTS_EXECUTABLE_LOADER=qemu-riscv64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=128,elen=64,vext_spec=v1.0;-L;/data/action/osd/riscv/sysroot\" ctest --output-on-failure -j 8\n\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --gcov-tool /data/action/osd/riscv/bin/riscv64-unknown-linux-gnu-gcov -d ./src -c -o lcov.info\n        lcov -r lcov.info '/usr/*' -o lcov.info\n        lcov -r lcov.info '*/install/*' -o lcov.info\n        lcov -r lcov.info '*/build/*' -o lcov.info\n        lcov --list lcov.info\n\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        files: build/lcov.info\n\n  linux-gpu-llvmpipe:\n    runs-on: [self-hosted, linux, ubuntu25]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_AVXVNNI=OFF -DNCNN_AVXNECONVERT=OFF -DNCNN_AVX512=ON -DNCNN_AVX512VNNI=ON -DNCNN_AVX512BF16=OFF -DNCNN_AVX512FP16=OFF -DNCNN_XOP=OFF -DNCNN_OPENMP=OFF -DNCNN_VULKAN=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        export LP_NUM_THREADS=4\n        cd build && ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info\n\n  linux-gpu-swiftshader:\n    runs-on: [self-hosted, linux, ubuntu25]\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-swiftshader\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-linux-install-20250508\n    - name: checkout-swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: 930d46d31b5d637f313fd5ef55da2bbf053c26c1\n    - name: swiftshader\n      if: steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n        mkdir -p build; cd build\n        cmake -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . -j 8\n        mkdir $GITHUB_WORKSPACE/swiftshader-install\n        cp Linux/* $GITHUB_WORKSPACE/swiftshader-install\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_AVXVNNI=OFF -DNCNN_AVXNECONVERT=OFF -DNCNN_AVX512=ON -DNCNN_AVX512VNNI=ON -DNCNN_AVX512BF16=OFF -DNCNN_AVX512FP16=OFF -DNCNN_XOP=OFF -DNCNN_OPENMP=OFF -DNCNN_VULKAN=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n    - name: test\n      run: |\n        printf \"[Processor]\\nThreadCount=1\\n\" > build/tests/SwiftShader.ini\n        export VK_ICD_FILENAMES=\"$GITHUB_WORKSPACE/swiftshader-install/vk_swiftshader_icd.json\"\n        cd build && ctest --output-on-failure -j 8\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info\n\n  linux-gcc-cross:\n    name: ${{ matrix.arch }}\n    runs-on: [self-hosted, linux, ubuntu25]\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n          - arch: arm\n            toolchain: arm-linux-gnueabi.toolchain.cmake\n            extra-cmake-args: -DNCNN_VFPV4=ON\n            qemu: qemu-arm-static\n            qemu-args: \"-L;/usr/arm-linux-gnueabi\"\n\n          - arch: arm-noinlineasm\n            toolchain: arm-linux-gnueabi.toolchain.cmake\n            extra-cmake-args: -DNCNN_GNU_INLINE_ASM=OFF -DNCNN_VFPV4=ON\n            qemu: qemu-arm-static\n            qemu-args: \"-L;/usr/arm-linux-gnueabi\"\n\n          - arch: armhf-vfpv3-d16\n            toolchain: arm-linux-gnueabihf-vfpv3-d16.toolchain.cmake\n            extra-cmake-args: -DNCNN_VFPV4=OFF\n            qemu: qemu-arm-static\n            qemu-args: \"-L;/usr/arm-linux-gnueabihf\"\n\n          - arch: armhf-vfpv3-d16-noinlineasm\n            toolchain: arm-linux-gnueabihf-vfpv3-d16.toolchain.cmake\n            extra-cmake-args: -DNCNN_GNU_INLINE_ASM=OFF -DNCNN_VFPV4=OFF\n            qemu: qemu-arm-static\n            qemu-args: \"-L;/usr/arm-linux-gnueabihf\"\n\n          - arch: aarch64-armv8.0\n            toolchain: aarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_ARM82=OFF\n            qemu: qemu-aarch64-static\n            qemu-args: \"-L;/usr/aarch64-linux-gnu\"\n\n          - arch: aarch64-armv8.2\n            toolchain: aarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_ARM82DOT=OFF -DNCNN_ARM82FP16FML=OFF\n            qemu: qemu-aarch64-static\n            qemu-args: \"-L;/usr/aarch64-linux-gnu\"\n\n          - arch: aarch64-armv8.4\n            toolchain: aarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_ARM84BF16=OFF -DNCNN_ARM84I8MM=OFF\n            qemu: qemu-aarch64-static\n            qemu-args: \"-L;/usr/aarch64-linux-gnu\"\n\n          - arch: aarch64-armv8.6\n            toolchain: aarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_ARM86SVE=OFF\n            qemu: qemu-aarch64-static\n            qemu-args: \"-L;/usr/aarch64-linux-gnu\"\n\n          - arch: aarch64-armv8.6-noinlineasm\n            toolchain: aarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_GNU_INLINE_ASM=OFF -DNCNN_ARM86SVE=OFF\n            qemu: qemu-aarch64-static\n            qemu-args: \"-L;/usr/aarch64-linux-gnu\"\n\n          - arch: mipsisa32r6el\n            toolchain: mipsisa32r6el-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_MSA=OFF -DNCNN_MMI=OFF\n            qemu: qemu-mipsel-static\n            qemu-args: \"-L;/usr/mipsisa32r6el-linux-gnu\"\n\n          - arch: mipsisa64r6el\n            toolchain: mipsisa64r6el-linux-gnuabi64.toolchain.cmake\n            extra-cmake-args: -DNCNN_MSA=ON -DNCNN_MMI=OFF\n            qemu: qemu-mips64el-static\n            qemu-args: \"-L;/usr/mipsisa64r6el-linux-gnuabi64\"\n\n          - arch: powerpc\n            toolchain: powerpc-linux-gnu.toolchain.cmake\n            extra-cmake-args:\n            qemu: qemu-ppc-static\n            qemu-args: \"-L;/usr/powerpc-linux-gnu\"\n\n          - arch: powerpc64le\n            toolchain: powerpc64le-linux-gnu.toolchain.cmake\n            extra-cmake-args:\n            qemu: qemu-ppc64le-static\n            qemu-args: \"-L;/usr/powerpc64le-linux-gnu\"\n\n          - arch: riscv64\n            toolchain: riscv64-linux-gnu.toolchain.cmake\n            extra-cmake-args:\n            qemu: qemu-riscv64-static\n            qemu-args: \"-L;/usr/riscv64-linux-gnu\"\n\n          - arch: loongarch64-la264\n            toolchain: loongarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_LSX=ON -DNCNN_LASX=OFF\n            qemu: qemu-loongarch64-static\n            qemu-args: \"-L;/usr/loongarch64-linux-gnu\"\n\n          - arch: loongarch64-la664\n            toolchain: loongarch64-linux-gnu.toolchain.cmake\n            extra-cmake-args: -DNCNN_LSX=ON -DNCNN_LASX=ON\n            qemu: qemu-loongarch64-static\n            qemu-args: \"-L;/usr/loongarch64-linux-gnu\"\n\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: build\n      run: |\n        mkdir build && cd build\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/${{ matrix.toolchain }} ${{ matrix.extra-cmake-args }} -DNCNN_OPENMP=OFF \\\n            -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=${{ matrix.qemu }} TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"${{ matrix.qemu-args }}\" ctest --output-on-failure -j 8\n\n    - name: lcov-collect\n      run: |\n        cd build\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n\n    - name: build-openmp\n      run: |\n        mkdir build-openmp && cd build-openmp\n        cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/${{ matrix.toolchain }} ${{ matrix.extra-cmake-args }} -DNCNN_OPENMP=ON \\\n            -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \\\n            -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j 8\n\n    - name: test-openmp\n      run: |\n        export OMP_THREAD_LIMIT=1\n        export OMP_NUM_THREADS=1\n        cd build-openmp\n        TESTS_EXECUTABLE_LOADER=${{ matrix.qemu }} TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"${{ matrix.qemu-args }}\" ctest --output-on-failure -j 8\n\n    - name: lcov-collect-openmp\n      run: |\n        cd build-openmp\n        lcov --ignore-errors inconsistent -d ./src -c -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '/usr/*' -o lcov.info\n        lcov --ignore-errors inconsistent -r lcov.info '*/build-openmp/*' -o lcov.info\n        lcov --ignore-errors inconsistent --list lcov.info\n\n    - name: codecov\n      uses: codecov/codecov-action@v5\n      with:\n        token: ${{ secrets.CODECOV_TOKEN }}\n        disable_search: true\n        plugins: noop\n        binary: /data/action/osd/codecov\n        files: build/lcov.info,build-openmp/lcov.info\n"
  },
  {
    "path": ".github/workflows/tvos.yml",
    "content": "name: tvos\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/tvos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/tvos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'glslang'\nconcurrency:\n  group: tvos-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  TVOS_DEPLOYMENT_TARGET: '11.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$TVOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-tvos-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-arm64e\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64e && cd build-arm64e\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64e\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/tvos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/tvos-simulator\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install/tvos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/tvos/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64e/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/tvos/lib/libomp.a\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/include $GITHUB_WORKSPACE/openmp-install/tvos-simulator\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/tvos-simulator/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/tvos-simulator/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/tvos/include/* $DEVELOPER_DIR/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/tvos/lib/libomp.a $DEVELOPER_DIR/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS.sdk/usr/lib\n\n        sudo cp $GITHUB_WORKSPACE/openmp-install/tvos-simulator/include/* $DEVELOPER_DIR/Platforms/AppleTVSimulator.platform/Developer/SDKs/AppleTVSimulator.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/tvos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/AppleTVSimulator.platform/Developer/SDKs/AppleTVSimulator.sdk/usr/lib\n\n    - name: arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n    - name: arm64e\n      run: |\n        mkdir build-arm64e && cd build-arm64e\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=TVOS -DARCHS=\"arm64e\" ..\n        cmake --build . -j 4\n    - name: simulator-x86_64\n      run: |\n        mkdir build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_TVOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n    - name: simulator-arm64\n      run: |\n        mkdir build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATORARM64_TVOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n"
  },
  {
    "path": ".github/workflows/visionos.yml",
    "content": "name: visionos\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/visionos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/visionos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\nconcurrency:\n  group: visionos-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  VISIONOS_DEPLOYMENT_TARGET: '1.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$VISIONOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n        -DNCNN_VULKAN=ON \\\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-visionos-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64 && cd build-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/visionos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/visionos-simulator\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/include $GITHUB_WORKSPACE/openmp-install/visionos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/visionos/lib\n        cp openmp-${{ env.OPENMP_VERSION }}.src/build-arm64/install/lib/libomp.a $GITHUB_WORKSPACE/openmp-install/visionos/lib/libomp.a\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/include $GITHUB_WORKSPACE/openmp-install/visionos-simulator\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/visionos-simulator/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/visionos-simulator/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/visionos/include/* $DEVELOPER_DIR/Platforms/XROS.platform/Developer/SDKs/XROS.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/visionos/lib/libomp.a $DEVELOPER_DIR/Platforms/XROS.platform/Developer/SDKs/XROS.sdk/usr/lib\n\n        sudo cp $GITHUB_WORKSPACE/openmp-install/visionos-simulator/include/* $DEVELOPER_DIR/Platforms/XRSimulator.platform/Developer/SDKs/XRSimulator.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/visionos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/XRSimulator.platform/Developer/SDKs/XRSimulator.sdk/usr/lib\n\n    - name: arm64\n      run: |\n        mkdir build-arm64 && cd build-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n    - name: simulator-x86_64\n      run: |\n        mkdir build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n    - name: simulator-arm64\n      run: |\n        mkdir build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_VISIONOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n"
  },
  {
    "path": ".github/workflows/watchos.yml",
    "content": "name: watchos\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/watchos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/watchos.yml'\n    - 'toolchains/ios.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\nconcurrency:\n  group: watchos-${{ github.ref }}\n  cancel-in-progress: true\nenv:\n  DEVELOPER_DIR: /Applications/Xcode_16.4.0.app/Contents/Developer\n  WATCHOS_DEPLOYMENT_TARGET: '6.0'\n  ENABLE_BITCODE: OFF\n  ENABLE_ARC: OFF\n  ENABLE_VISIBILITY: OFF\npermissions:\n  contents: read\n\njobs:\n  build:\n    runs-on: macos-15-intel\n    env:\n      OPENMP_VERSION: '18.1.2'\n      OPENMP_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DLIBOMP_ENABLE_SHARED=OFF \\\n        -DLIBOMP_OMPT_SUPPORT=OFF \\\n        -DLIBOMP_USE_HWLOC=OFF \\\n\n      NCNN_CMAKE_OPTIONS: |\n        -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake \\\n        -DDEPLOYMENT_TARGET=$WATCHOS_DEPLOYMENT_TARGET \\\n        -DENABLE_BITCODE=$ENABLE_BITCODE \\\n        -DENABLE_ARC=$ENABLE_ARC \\\n        -DENABLE_VISIBILITY=$ENABLE_VISIBILITY \\\n        -DCMAKE_INSTALL_PREFIX=install \\\n        -DCMAKE_BUILD_TYPE=Release \\\n        -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n        -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n        -DOpenMP_libomp_LIBRARY=\"libomp.a\" \\\n\n    steps:\n    - uses: actions/checkout@v6\n\n    - name: cache-openmp\n      id: cache-openmp\n      uses: actions/cache@v5\n      with:\n        path: openmp-install\n        key: openmp-watchos-install-20251004\n    - name: openmp\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf cmake-${{ env.OPENMP_VERSION }}.src.tar.xz\n        wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${{ env.OPENMP_VERSION }}/openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        tar -xf openmp-${{ env.OPENMP_VERSION }}.src.tar.xz\n        mv cmake-${{ env.OPENMP_VERSION }}.src/Modules/* openmp-${{ env.OPENMP_VERSION }}.src/cmake/\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        wget https://github.com/nihui/llvm-project/commit/ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        patch -p2 -i ef8c35bcf5d9cfdb0764ffde6a63c04ec715bc37.patch\n        wget https://github.com/nihui/llvm-project/commit/5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n        patch -p2 -i 5c12711f9a21f41bea70566bf15a4026804d6b20.patch\n    - name: openmp-armv7k\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-armv7k && cd build-armv7k\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"armv7k\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-arm64_32\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-arm64_32 && cd build-arm64_32\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"arm64_32\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-x86_64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-simulator-arm64\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        cd openmp-${{ env.OPENMP_VERSION }}.src\n        mkdir -p build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.OPENMP_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n        cmake --build . --target install\n    - name: openmp-merge-fat-library\n      if: steps.cache-openmp.outputs.cache-hit != 'true'\n      run: |\n        mkdir -p $GITHUB_WORKSPACE/openmp-install\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/watchos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/watchos-simulator\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-arm64_32/install/include $GITHUB_WORKSPACE/openmp-install/watchos\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/watchos/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-armv7k/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-arm64_32/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/watchos/lib/libomp.a\n\n        cp -a openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/include $GITHUB_WORKSPACE/openmp-install/watchos-simulator\n        mkdir -p $GITHUB_WORKSPACE/openmp-install/watchos-simulator/lib\n        lipo -create \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-x86_64/install/lib/libomp.a \\\n            openmp-${{ env.OPENMP_VERSION }}.src/build-simulator-arm64/install/lib/libomp.a \\\n            -o $GITHUB_WORKSPACE/openmp-install/watchos-simulator/lib/libomp.a\n\n    - name: install-openmp\n      run: |\n        sudo cp $GITHUB_WORKSPACE/openmp-install/watchos/include/* $DEVELOPER_DIR/Platforms/WatchOS.platform/Developer/SDKs/WatchOS.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/watchos/lib/libomp.a $DEVELOPER_DIR/Platforms/WatchOS.platform/Developer/SDKs/WatchOS.sdk/usr/lib\n\n        sudo cp $GITHUB_WORKSPACE/openmp-install/watchos-simulator/include/* $DEVELOPER_DIR/Platforms/WatchSimulator.platform/Developer/SDKs/WatchSimulator.sdk/usr/include\n        sudo cp $GITHUB_WORKSPACE/openmp-install/watchos-simulator/lib/libomp.a $DEVELOPER_DIR/Platforms/WatchSimulator.platform/Developer/SDKs/WatchSimulator.sdk/usr/lib\n\n    - name: armv7k\n      run: |\n        mkdir build-armv7k && cd build-armv7k\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"armv7k\" ..\n        cmake --build . -j 4\n    - name: arm64_32\n      run: |\n        mkdir build-arm64_32 && cd build-arm64_32\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=WATCHOS -DARCHS=\"arm64_32\" ..\n        cmake --build . -j 4\n\n    - name: simulator-x86_64\n      run: |\n        mkdir build-simulator-x86_64 && cd build-simulator-x86_64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"x86_64\" ..\n        cmake --build . -j 4\n    - name: simulator-arm64\n      run: |\n        mkdir build-simulator-arm64 && cd build-simulator-arm64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DPLATFORM=SIMULATOR_WATCHOS -DARCHS=\"arm64\" ..\n        cmake --build . -j 4\n"
  },
  {
    "path": ".github/workflows/web-assembly.yml",
    "content": "name: web-assembly\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/web-assembly.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/web-assembly.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n\nenv:\n  EMSCRIPTEN_VERSION: 3.1.28\n\nconcurrency:\n  group: web-assembly-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  webassembly:\n    runs-on: ubuntu-latest\n    steps:\n    - uses: actions/checkout@v6\n    - name: emsdk\n      run: |\n        git clone https://github.com/emscripten-core/emsdk.git\n        cd emsdk\n        ./emsdk install $EMSCRIPTEN_VERSION\n        ./emsdk activate $EMSCRIPTEN_VERSION\n    - name: build-basic\n      run: |\n        source emsdk/emsdk_env.sh\n        export LDFLAGS=\"-sERROR_ON_WASM_CHANGES_AFTER_LINK -sWASM_BIGINT -O1\"\n        mkdir build-basic && cd build-basic\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-basic\n      run: |\n        cd build-basic\n        TESTS_EXECUTABLE_LOADER=node ctest --output-on-failure -j $(nproc)\n    - name: build-simd\n      run: |\n        source emsdk/emsdk_env.sh\n        export LDFLAGS=\"-sERROR_ON_WASM_CHANGES_AFTER_LINK -sWASM_BIGINT -O1\"\n        mkdir build-simd && cd build-simd\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-simd\n      run: |\n        cd build-simd\n        TESTS_EXECUTABLE_LOADER=node ctest --output-on-failure -j $(nproc)\n    - name: build-simd-omp\n      run: |\n        source emsdk/emsdk_env.sh\n        export LDFLAGS=\"-sERROR_ON_WASM_CHANGES_AFTER_LINK -sWASM_BIGINT -O1\"\n        mkdir build-simd-omp && cd build-simd-omp\n        cmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . -j $(nproc)\n    - name: test-simd-omp\n      run: |\n        cd build-simd-omp\n        TESTS_EXECUTABLE_LOADER=node ctest --output-on-failure -j $(nproc)\n"
  },
  {
    "path": ".github/workflows/windows-arm.yml",
    "content": "name: windows-arm\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-arm.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-arm.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\nconcurrency:\n  group: windows-arm-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  windows:\n    name: ${{ matrix.vs-version }}\n    runs-on: windows-2022\n    strategy:\n      matrix:\n        include:\n          - vs-version: vs2019\n            toolset-version: v142\n            windows-sdk-version: 22621\n\n          - vs-version: vs2022\n            toolset-version: v143\n            windows-sdk-version: 26100\n\n    env:\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TESTS=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_VULKAN=ON\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - uses: GuillaumeFalourd/setup-windows10-sdk-action@v2.4\n      with:\n        sdk-version: ${{ matrix.windows-sdk-version }}\n    - name: arm64\n      run: |\n        mkdir build-arm64; cd build-arm64\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A arm64,version=10.0.${{ matrix.windows-sdk-version }}.0 ${{ env.NCNN_CMAKE_OPTIONS }} ..\n        cmake --build . --config Release -j 4\n    - name: arm64-shared\n      run: |\n        mkdir build-arm64-shared; cd build-arm64-shared\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A arm64,version=10.0.${{ matrix.windows-sdk-version }}.0 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_SHARED_LIB=ON ..\n        cmake --build . --config Release -j 4\n\n  woa-linux:\n    name: woa-linux\n    runs-on: ubuntu-latest\n    container: linaro/wine-arm64\n    steps:\n    - uses: actions/checkout@v6\n    - name: msvc-wine\n      env:\n        WINEPREFIX: /tmp/wine-x64-prefix/\n      run: |\n        apt-get update\n        apt-get install -y wine64 python3 msitools python3-simplejson python3-six ca-certificates winbind cmake ninja-build meson\n        ln -s /usr/bin/wine /usr/bin/wine64\n        xvfb-run winecfg &\n        git clone --depth 1 https://github.com/mstorsjo/msvc-wine\n        msvc-wine/vsdownload.py --accept-license --dest /msvc\n        msvc-wine/install.sh /msvc\n    - name: build\n      env:\n        WINEPREFIX: /tmp/wine-x64-prefix/\n        CC: cl\n        CXX: cl\n      run: |\n        export PATH=/msvc/bin/arm64:$PATH\n        mkdir build && cd build\n        cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_SYSTEM_NAME=Windows -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . --config Release -j $(nproc)\n    - name: test\n      run: |\n        cd build\n        TESTS_EXECUTABLE_LOADER=wine-arm64 TESTS_EXECUTABLE_LOADER_ARGUMENTS=\"\" ctest --output-on-failure -j $(nproc)\n\n  windows-arm:\n    runs-on: windows-11-arm\n    env:\n      UseMultiToolTask: true\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: build\n      run: |\n        mkdir build; cd build\n        cmake -A arm64 -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_VULKAN=OFF -DNCNN_ARM82=OFF ..\n        cmake --build . --config Release -j 4\n    - name: test\n      run: cd build; ctest -C Release --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/windows-clang.yml",
    "content": "name: windows-clang\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-clang.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-clang.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/arm/**'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\nconcurrency:\n  group: windows-clang-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  windows:\n    name: ClangCL\n    runs-on: windows-2022\n\n    env:\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: arm64\n      run: |\n        mkdir build-arm64; cd build-arm64\n        cmake -T ClangCL -A arm64 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_VULKAN=OFF ..\n        cmake --build . --config Release -j 4\n\n    - name: arm64-vulkan\n      run: |\n        mkdir build-arm64-vulkan; cd build-arm64-vulkan\n        cmake -T ClangCL -A arm64 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . --config Release -j 4\n\n    - name: x86\n      run: |\n        mkdir build-x86; cd build-x86\n        cmake -T ClangCL -A Win32 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_BUILD_TESTS=ON -DNCNN_VULKAN=OFF ..\n        cmake --build . --config Release -j 4\n    - name: x86-test\n      run: cd build-x86; ctest -C Release --output-on-failure -j 4\n\n    - name: x86-vulkan\n      run: |\n        mkdir build-x86-vulkan; cd build-x86-vulkan\n        cmake -T ClangCL -A Win32 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . --config Release -j 4\n\n    - name: x64\n      run: |\n        mkdir build-x64; cd build-x64\n        cmake -T ClangCL -A x64 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_BUILD_TESTS=ON -DNCNN_VULKAN=OFF ..\n        cmake --build . --config Release -j 4\n    - name: x64-test\n      run: cd build-x64; ctest -C Release --output-on-failure -j 4\n\n    - name: x64-vulkan\n      run: |\n        mkdir build-x64-vulkan; cd build-x64-vulkan\n        cmake -T ClangCL -A x64 ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON ..\n        cmake --build . --config Release -j 4\n"
  },
  {
    "path": ".github/workflows/windows-mingw.yml",
    "content": "name: windows-mingw\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-mingw.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-mingw.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'glslang'\nconcurrency:\n  group: windows-mingw-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  windows:\n    name: MinGW-w64\n    runs-on: windows-2022\n\n    env:\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: x64\n      run: |\n        mkdir build-x64; cd build-x64\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_BUILD_TESTS=ON -DNCNN_VULKAN=OFF -G \"MinGW Makefiles\" ..\n        cmake --build . --config Release -j 4\n    - name: x64-test\n      run: cd build-x64; ctest -C Release --output-on-failure -j 4\n\n    - name: x64-vulkan\n      run: |\n        mkdir build-x64-vulkan; cd build-x64-vulkan\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DNCNN_VULKAN=ON -DNCNN_SHARED_LIB=ON -G \"MinGW Makefiles\" ..\n        cmake --build . --config Release -j 4\n"
  },
  {
    "path": ".github/workflows/windows-xp.yml",
    "content": "name: windows-xp\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-xp.yml'\n    - 'toolchains/windows-xp-msvc.toolchain.cmake'\n    - 'toolchains/windows-xp-mingw.toolchain.cmake'\n    - 'toolchains/windows-xp-clang.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows-xp.yml'\n    - 'toolchains/windows-xp-msvc.toolchain.cmake'\n    - 'toolchains/windows-xp-mingw.toolchain.cmake'\n    - 'toolchains/windows-xp-clang.toolchain.cmake'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'tests/**'\nconcurrency:\n  group: windows-xp-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  MSVC:\n    runs-on: windows-2025\n\n    env:\n      VS_INSTALL_DIR: C:\\Program Files\\Microsoft Visual Studio\\2022\\Enterprise\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: config\n      shell: cmd\n      run: |\n        \"C:\\Program Files (x86)\\Microsoft Visual Studio\\Installer\\setup.exe\" modify --installPath \"${{ env.VS_INSTALL_DIR }}\" --channelId VisualStudio.17.Release --add Microsoft.VisualStudio.Component.WinXP  --add Microsoft.VisualStudio.Component.VC.Tools.X86.X64.Spectre --add Microsoft.VisualStudio.Component.VC.Tools.X86.X64 --add Microsoft.VisualStudio.Component.VC.Tools.X86.X64 --add Microsoft.VisualStudio.Component.VC.v141.xp --nocache --quiet\n        call \"${{ env.VS_INSTALL_DIR }}\\VC\\Auxiliary\\Build\\vcvarsall.bat\" x86\n    - name: build\n      run: |\n        mkdir build; cd build\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -A WIN32 -G \"Visual Studio 17 2022\" -T v141_xp -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_OPENMP=OFF -DNCNN_BUILD_WITH_STATIC_CRT=ON -DNCNN_AVX=OFF -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-msvc.toolchain.cmake\" ..\n        cmake --build . --config Release -j 4\n    - name: test\n      run: cd build; ctest -C Release --output-on-failure -j 4\n\n  MinGW-w32:\n    runs-on: windows-2025\n\n    env:\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n\n    - name: config\n      run: |\n        Invoke-WebRequest -Uri https://github.com/nihui/ncnn-assets/releases/download/toolchain/i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z -OutFile i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z\n        7z x ./i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z\n        Add-Content -Path $env:GITHUB_ENV -Value \"MINGW32_ROOT_PATH=${{ github.workspace }}\\mingw32\"\n        Add-Content -Path $env:GITHUB_PATH -Value \"${{ github.workspace }}\\mingw32\\bin\"\n    - name: build\n      run: |\n        mkdir build; cd build\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-mingw.toolchain.cmake\" -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_AVX=OFF .. -G \"MinGW Makefiles\"\n        cmake --build . --config Release -j 4\n    - name: test\n      run: cd build; ctest -C Release --output-on-failure -j 4\n\n  Clang:\n    runs-on: windows-2022\n\n    env:\n      UseMultiToolTask: true\n      NCNN_CMAKE_OPTIONS: -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON\n\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n    - name: Set up Clang\n      run: choco install llvm --version=6.0.0 --allow-downgrade\n    - name: Verify Clang\n      run: |\n        clang --version\n        clang++ --version\n    - name: config\n      run: |\n        Invoke-WebRequest -Uri https://github.com/nihui/ncnn-assets/releases/download/toolchain/i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z -OutFile i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z\n        7z x ./i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z\n        Add-Content -Path $env:GITHUB_ENV -Value \"MINGW32_ROOT_PATH=${{ github.workspace }}\\mingw32\"\n        echo \"${{ github.workspace }}\\mingw32\\bin;$env:PATH\" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8\n    - name: build\n      run: |\n        mkdir build; cd build\n        cmake ${{ env.NCNN_CMAKE_OPTIONS }} -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-clang.toolchain.cmake\" -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_AVX=OFF .. -G \"MinGW Makefiles\"\n        cmake --build . --config Release -j 4\n    - name: test\n      run: cd build; ctest -C Release --output-on-failure -j 4\n"
  },
  {
    "path": ".github/workflows/windows.yml",
    "content": "name: windows\non:\n  push:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\n  pull_request:\n    branches: [master]\n    paths:\n    - '.github/workflows/windows.yml'\n    - 'CMakeLists.txt'\n    - 'cmake/**'\n    - 'src/*'\n    - 'src/layer/*'\n    - 'src/layer/x86/**'\n    - 'src/layer/vulkan/**'\n    - 'tests/**'\n    - 'tools/**'\n    - '!tools/pnnx/**'\n    - 'examples/**'\n    - 'glslang'\nconcurrency:\n  group: windows-${{ github.ref }}\n  cancel-in-progress: true\npermissions:\n  contents: read\n\njobs:\n  msvc:\n    name: ${{ matrix.vs-version }}\n    runs-on: windows-2022\n    strategy:\n      matrix:\n        include:\n          - vs-version: vs2015\n            toolset-version: v140\n            windows-sdk-version: 22621\n          - vs-version: vs2017\n            toolset-version: v141\n            windows-sdk-version: 22621\n          - vs-version: vs2019\n            toolset-version: v142\n            windows-sdk-version: 26100\n          - vs-version: vs2022\n            toolset-version: v143\n            windows-sdk-version: 26100\n\n    env:\n      UseMultiToolTask: true\n    steps:\n    - uses: actions/checkout@v6\n      with:\n        submodules: true\n        \n    - name: Install VS 2017 (v141) Build Tools\n      if: matrix.vs-version == 'vs2017'\n      run: |\n        $vsInstallPath = & \"${env:ProgramFiles(x86)}\\Microsoft Visual Studio\\Installer\\vswhere.exe\" -latest -property installationPath\n        Start-Process -FilePath \"${env:ProgramFiles(x86)}\\Microsoft Visual Studio\\Installer\\vs_installer.exe\" -ArgumentList \"modify --installPath `\"$vsInstallPath`\" --add Microsoft.VisualStudio.Component.VC.v141.x86.x64 --quiet --norestart --nocache\" -Wait\n    - name: Install and Setup VS 2015 (v140) Build Tools\n      if: matrix.vs-version == 'vs2015'\n      run: |\n        $vs140Path = \"C:/vs140_build_tools\"\n        Invoke-WebRequest -Uri \"https://aka.ms/vs/15/release/vs_buildtools.exe\" -OutFile vs_buildtools.exe\n        Start-Process -FilePath \"vs_buildtools.exe\" -ArgumentList \"--installPath `\"$vs140Path`\" --add Microsoft.VisualStudio.Workload.VCTools --add Microsoft.VisualStudio.Component.VC.140 --quiet --wait --norestart --nocache\" -Wait\n\n        $vcvarsPath = (Get-ChildItem -Path $vs140Path -Filter \"vcvars64.bat\" -Recurse | Select-Object -First 1).FullName\n        $cmd = \"`\"$vcvarsPath`\" && powershell -Command `\"`$env:PATH;`$env:INCLUDE;`$env:LIB`\"\"\n        $output = cmd.exe /c $cmd\n        $lines = $output -split \"`r`n\"\n        \n        echo \"PATH=$($lines[0]);$($env:PATH)\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n        echo \"INCLUDE=$($lines[1])\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n        echo \"LIB=$($lines[2])\" | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append\n\n    - uses: GuillaumeFalourd/setup-windows10-sdk-action@v2.4\n      with:\n        sdk-version: ${{ matrix.windows-sdk-version }}\n\n    - name: cache-protobuf\n      id: cache-protobuf\n      uses: actions/cache@v5\n      with:\n        path: \"protobuf-install\"\n        key: protobuf-${{ matrix.vs-version }}-x64-install-3\n    - name: protobuf\n      if: steps.cache-protobuf.outputs.cache-hit != 'true'\n      run: |\n        Invoke-WebRequest -Uri https://github.com/protocolbuffers/protobuf/archive/v3.11.2.zip -OutFile protobuf-3.11.2.zip\n        7z x ./protobuf-3.11.2.zip\n        cd protobuf-3.11.2\n        mkdir build-${{ matrix.vs-version }}; cd build-${{ matrix.vs-version }}\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.windows-sdk-version }}.0 -DCMAKE_INSTALL_PREFIX=\"$env:GITHUB_WORKSPACE\\protobuf-install\" -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF -DNCNN_BUILD_TESTS=ON ../cmake\n        cmake --build . --config Release -j 4\n        cmake --build . --config Release --target install\n    - name: cache-swiftshader\n      if: matrix.vs-version != 'vs2015' && matrix.vs-version != 'vs2017'\n      id: cache-swiftshader\n      uses: actions/cache@v5\n      with:\n        path: swiftshader-install\n        key: swiftshader-${{ matrix.vs-version }}-x64-install-20251010\n    - name: checkout-swiftshader\n      if: matrix.vs-version != 'vs2015' && matrix.vs-version != 'vs2017' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      uses: actions/checkout@v6\n      with:\n        repository: google/swiftshader\n        path: swiftshader\n        ref: de870ac7518fe2b6bb651ecc22fc36647cf7b986\n    - name: checkout-swiftshader-submodules\n      if: matrix.vs-version != 'vs2015' && matrix.vs-version != 'vs2017' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        git -c submodule.\"third_party/git-hooks\".update=none submodule update --init --recursive\n    - name: swiftshader\n      if: matrix.vs-version != 'vs2015' && matrix.vs-version != 'vs2017' && steps.cache-swiftshader.outputs.cache-hit != 'true'\n      run: |\n        cd swiftshader\n        mkdir build-${{ matrix.vs-version }}; cd build-${{ matrix.vs-version }}\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.windows-sdk-version }}.0 -DCMAKE_INSTALL_PREFIX=install -DSWIFTSHADER_BUILD_EGL=FALSE -DSWIFTSHADER_BUILD_GLESv2=FALSE -DSWIFTSHADER_BUILD_GLES_CM=FALSE -DSWIFTSHADER_BUILD_VULKAN=TRUE -DSWIFTSHADER_BUILD_PVR=FALSE -DSWIFTSHADER_BUILD_TESTS=FALSE -DSWIFTSHADER_ENABLE_ASTC=FALSE -DSWIFTSHADER_WARNINGS_AS_ERRORS=FALSE -DREACTOR_BACKEND=Subzero -DREACTOR_DEFAULT_OPT_LEVEL=Default -DCMAKE_BUILD_TYPE=Release ..\n        cmake --build . --config Release -j 4\n        mkdir \"$env:GITHUB_WORKSPACE/swiftshader-install\"\n        Copy-Item -Path \"Windows\\*\" -Destination \"$env:GITHUB_WORKSPACE\\swiftshader-install\"\n\n    - name: x64\n      run: |\n        mkdir build-x64; cd build-x64\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.windows-sdk-version }}.0 -Dprotobuf_DIR=\"$env:GITHUB_WORKSPACE\\protobuf-install\\cmake\" -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON ..\n        cmake --build . --config Release -j 4\n    - name: x64-test\n      if: matrix.vs-version != 'vs2015' && matrix.vs-version != 'vs2017'\n      run: |\n        echo \"[Processor]`nThreadCount=1`n\" > build-x64/tests/Release/SwiftShader.ini\n        Copy-Item -Path \"$env:GITHUB_WORKSPACE\\swiftshader-install\\vulkan-1.dll\" -Destination 'build-x64\\tests'\n        cd build-x64; ctest -C Release --output-on-failure -j 4\n\n    - name: x64-sse2\n      run: |\n        mkdir build-x64-sse2; cd build-x64-sse2\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.windows-sdk-version }}.0 -DNCNN_RUNTIME_CPU=OFF -DNCNN_XOP=OFF -DNCNN_AVX=OFF -DNCNN_AVX2=OFF -DNCNN_AVX512=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . --config Release -j 4\n    - name: x64-sse2-test\n      run: cd build-x64-sse2; ctest -C Release --output-on-failure -j 4\n\n    - name: x64-avx\n      run: |\n        mkdir build-x64-avx; cd build-x64-avx\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A x64,version=10.0.${{ matrix.windows-sdk-version }}.0 -DNCNN_RUNTIME_CPU=OFF -DNCNN_XOP=OFF -DNCNN_AVX=ON -DNCNN_AVX2=OFF -DNCNN_AVX512=OFF -DNCNN_BUILD_TESTS=ON -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . --config Release -j 4\n    - name: x64-avx-test\n      run: cd build-x64-avx; ctest -C Release --output-on-failure -j 4\n\n    - name: x86\n      run: |\n        mkdir build-x86; cd build-x86\n        cmake -T ${{ matrix.toolset-version }},host=x64 -A Win32,version=10.0.${{ matrix.windows-sdk-version }}.0 -DNCNN_SHARED_LIB=ON -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..\n        cmake --build . --config Release -j 4\n    - name: x86-test\n      run: |\n        Copy-Item -Path \"build-x86\\src\\Release\\ncnn.dll\" -Destination 'build-x86\\tests'\n        cd build-x86; ctest -C Release --output-on-failure -j 4\n"
  },
  {
    "path": ".gitignore",
    "content": "# CMake build directory\nbuild*/\n\n# Backup files.\n*~\n\n# Prerequisites\n*.d\n\n# Compiled Object files\n*.slo\n*.lo\n*.o\n*.obj\n\n# Precompiled Headers\n*.gch\n*.pch\n\n# Compiled Dynamic libraries\n*.so\n*.dylib\n*.dll\n\n# Fortran module files\n*.mod\n*.smod\n\n# Compiled Static libraries\n*.lai\n*.la\n*.a\n*.lib\n\n# Executables\n*.exe\n*.out\n*.app\n\n# MACOSX\n.DS_Store\n\n# IDE\n.vs\n.vscode\n.idea\ncmake-build-debug\ncmake-build-release\nCMakeSettings.json\n\n# Compiled python\n__pycache__\n*.pyc\n*.pyd\n*.egg-info/\npython/setup.py\n\n# Clangd\n.cache/\n\n# Xmake\n.xmake/\n"
  },
  {
    "path": ".gitmodules",
    "content": "[submodule \"glslang\"]\n\tpath = glslang\n\turl = https://github.com/nihui/glslang\n[submodule \"python/pybind11\"]\n\tpath = python/pybind11\n\turl = https://github.com/pybind/pybind11.git\n"
  },
  {
    "path": "CITATION.cff",
    "content": "cff-version: 1.2.0\ntitle: ncnn\nmessage: >-\n  If you use this software, please cite it using the\n  metadata from this file.\ntype: software\nauthors:\n  - family-names: \"Ni\"\n    given-names: \"Hui\"\n  - name: \"The ncnn contributors\"\nabstract: >-\n  ncnn is a high-performance neural network inference\n  computing framework optimized for mobile platforms. \ndate-released: 2017-06-30\nkeywords:\n  - \"neural network\"\n  - \"artificial intelligence\"\n  - \"deep learning\"\n  - android\n  - ios\n  - windows\n  - linux\n  - macos\n  - pnnx\n  - simd\n  - vulkan\n  - riscv\n  - x86\n  - arm\n  - mips\n  - loongarch\nlicense: BSD-3-Clause\nrepository-code: \"https://github.com/Tencent/ncnn\"\n"
  },
  {
    "path": "CMakeLists.txt",
    "content": "if(CMAKE_TOOLCHAIN_FILE)\n    set(LIBRARY_OUTPUT_PATH_ROOT ${CMAKE_BINARY_DIR} CACHE PATH \"root for library output, set this to change where android libs are compiled to\")\n    # get absolute path, but get_filename_component ABSOLUTE only refer with source dir, so find_file here :(\n    get_filename_component(CMAKE_TOOLCHAIN_FILE_NAME ${CMAKE_TOOLCHAIN_FILE} NAME)\n    find_file(CMAKE_TOOLCHAIN_FILE ${CMAKE_TOOLCHAIN_FILE_NAME} PATHS ${CMAKE_SOURCE_DIR} NO_DEFAULT_PATH)\n    message(STATUS \"CMAKE_TOOLCHAIN_FILE = ${CMAKE_TOOLCHAIN_FILE}\")\nendif()\n\nif(NOT DEFINED CMAKE_INSTALL_PREFIX)\n    set(CMAKE_INSTALL_PREFIX \"${CMAKE_BINARY_DIR}/install\" CACHE PATH \"Installation Directory\")\nendif()\nmessage(STATUS \"CMAKE_INSTALL_PREFIX = ${CMAKE_INSTALL_PREFIX}\")\n\nif(NOT DEFINED NCNN_VERSION)\n    string(TIMESTAMP NCNN_VERSION \"%Y%m%d\")\nendif()\n\nset(NCNN_VERSION_MAJOR 1)\nset(NCNN_VERSION_MINOR 0)\nset(NCNN_VERSION_PATCH ${NCNN_VERSION})\nset(NCNN_VERSION_STRING ${NCNN_VERSION_MAJOR}.${NCNN_VERSION_MINOR}.${NCNN_VERSION_PATCH})\nset(NCNN_VERSION_NUMBER ${NCNN_VERSION})\nmessage(STATUS \"NCNN_VERSION_STRING = ${NCNN_VERSION_STRING}\")\n\ncmake_minimum_required(VERSION 2.8.12...3.10)\n\nif(NOT CMAKE_BUILD_TYPE)\n    set(CMAKE_BUILD_TYPE release CACHE STRING \"Choose the type of build\" FORCE)\nendif()\n\nif(NOT CMAKE_VERSION VERSION_LESS \"3.15\")\n    # enable CMAKE_MSVC_RUNTIME_LIBRARY\n    cmake_policy(SET CMP0091 NEW)\nendif()\n\nif(POLICY CMP0025)\n    # reference from https://cmake.org/cmake/help/latest/policy/CMP0025.html\n    cmake_policy(SET CMP0025 NEW)\nendif()\n\nif(POLICY CMP0057)\n    # reference from https://cmake.org/cmake/help/latest/policy/CMP0057.html\n    cmake_policy(SET CMP0057 NEW)\nendif()\n\nproject(ncnn)\n\nif(MSVC AND NOT CMAKE_VERSION VERSION_LESS \"3.15\")\n    option(NCNN_BUILD_WITH_STATIC_CRT \"Enables use of statically linked CRT for statically linked ncnn\" OFF)\n    if(NCNN_BUILD_WITH_STATIC_CRT)\n        # cmake before version 3.15 not work\n        set(CMAKE_MSVC_RUNTIME_LIBRARY \"MultiThreaded$<$<CONFIG:Debug>:Debug>\")\n    endif()\nendif()\n\nif(CMAKE_FIND_LIBRARY_SUFFIXES_INIT)\n    # project() overwrite CMAKE_FIND_LIBRARY_SUFFIXES in toolchain, restore it\n    set(CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_FIND_LIBRARY_SUFFIXES_INIT})\nendif()\n\noption(NCNN_SHARED_LIB \"shared library support\" OFF)\noption(NCNN_ENABLE_LTO \"enable link-time optimization\" OFF)\noption(NCNN_OPENMP \"openmp support\" ON)\noption(NCNN_STDIO \"load model from external file\" ON)\noption(NCNN_STRING \"plain and verbose string\" ON)\noption(NCNN_INSTALL_SDK \"install ncnn library and headers\" ON)\noption(NCNN_SIMPLEOCV \"minimal opencv structure emulation\" OFF)\noption(NCNN_SIMPLEOMP \"minimal openmp runtime emulation\" OFF)\noption(NCNN_SIMPLESTL \"minimal cpp stl structure emulation\" OFF)\noption(NCNN_SIMPLEMATH \"minimal cmath\" OFF)\noption(NCNN_THREADS \"build with threads\" ON)\noption(NCNN_BENCHMARK \"print benchmark information for every layer\" OFF)\noption(NCNN_C_API \"build with C api\" ON)\noption(NCNN_PLATFORM_API \"build with platform api candy\" ON)\noption(NCNN_WINXP \"build with windows xp compatibility\" OFF)\noption(NCNN_PIXEL \"convert and resize from/to image pixel\" ON)\noption(NCNN_PIXEL_ROTATE \"rotate image pixel orientation\" ON)\noption(NCNN_PIXEL_AFFINE \"warp affine image pixel\" ON)\noption(NCNN_PIXEL_DRAWING \"draw basic figure and text\" ON)\noption(NCNN_CMAKE_VERBOSE \"print verbose cmake messages\" OFF)\noption(NCNN_VULKAN \"vulkan compute support\" OFF)\noption(NCNN_SIMPLEVK \"minimal in-house vulkan loader\" ON)\noption(NCNN_SYSTEM_GLSLANG \"use system glslang library\" OFF)\noption(NCNN_RUNTIME_CPU \"runtime dispatch cpu routines\" ON)\noption(NCNN_DISABLE_PIC \"disable position-independent code\" OFF)\noption(NCNN_BUILD_TESTS \"build tests\" OFF)\noption(NCNN_COVERAGE \"build for coverage\" OFF)\noption(NCNN_ASAN \"build for address sanitizer\" OFF)\noption(NCNN_BUILD_BENCHMARK \"build benchmark\" ON)\noption(NCNN_PYTHON \"build python api\" OFF)\noption(NCNN_INT8 \"int8 inference\" ON)\noption(NCNN_BF16 \"bf16 inference\" ON)\noption(NCNN_FORCE_INLINE \"force inline some function\" ON)\n\nif(ANDROID OR IOS OR NCNN_SIMPLESTL)\n    option(NCNN_DISABLE_RTTI \"disable rtti\" ON)\n    option(NCNN_DISABLE_EXCEPTION \"disable exception\" ON)\nelse()\n    option(NCNN_DISABLE_RTTI \"disable rtti\" OFF)\n    option(NCNN_DISABLE_EXCEPTION \"disable exception\" OFF)\nendif()\n\nif(ANDROID OR IOS OR NCNN_SIMPLESTL OR CMAKE_CROSSCOMPILING)\n    option(NCNN_BUILD_TOOLS \"build tools\" OFF)\n    option(NCNN_BUILD_EXAMPLES \"build examples\" OFF)\nelse()\n    option(NCNN_BUILD_TOOLS \"build tools\" ON)\n    option(NCNN_BUILD_EXAMPLES \"build examples\" ON)\nendif()\n\nif(NCNN_SHARED_LIB)\n    if(NCNN_ENABLE_LTO)\n        # enable global link time optimization\n        cmake_policy(SET CMP0069 NEW)\n        set(CMAKE_POLICY_DEFAULT_CMP0069 NEW)\n        include(CheckIPOSupported)\n        check_ipo_supported(RESULT ipo_supported OUTPUT ipo_supported_output)\n        if(ipo_supported)\n            set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)\n        else()\n            message(WARNING \"IPO is not supported: ${ipo_supported_output}\")\n            set(NCNN_ENABLE_LTO OFF)\n        endif()\n    endif()\nendif()\n\nif(NOT NCNN_STDIO OR NOT NCNN_STRING)\n    if(NCNN_BUILD_TOOLS)\n        message(WARNING \"NCNN_STDIO or NCNN_STRING disabled, NCNN_BUILD_TOOLS will be turned off.\")\n        set(NCNN_BUILD_TOOLS OFF)\n    endif()\n    if(NCNN_BUILD_EXAMPLES)\n        message(WARNING \"NCNN_STDIO or NCNN_STRING disabled, NCNN_BUILD_EXAMPLES will be turned off.\")\n        set(NCNN_BUILD_EXAMPLES OFF)\n    endif()\n    if(NCNN_BUILD_BENCHMARK)\n        message(WARNING \"NCNN_STDIO or NCNN_STRING disabled, NCNN_BUILD_BENCHMARK will be turned off.\")\n        set(NCNN_BUILD_BENCHMARK OFF)\n    endif()\n    if(NCNN_BUILD_TESTS)\n        message(WARNING \"NCNN_STDIO or NCNN_STRING disabled, NCNN_BUILD_TESTS will be turned off.\")\n        set(NCNN_BUILD_TESTS OFF)\n    endif()\nendif()\n\n##############################################\n\ninclude(CheckCXXCompilerFlag)\nset(CMAKE_TRY_COMPILE_CONFIGURATION release)\nset(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)\n\n# gnu inline assembly in clang msvc does not work actually\nif(NOT (CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")))\n    check_cxx_source_compiles(\"int test(int a) { asm volatile(\\\"\\\" : \\\"=r\\\"(a) : \\\"0\\\"(a) : \\\"memory\\\"); return a; }\" NCNN_COMPILER_SUPPORT_GNU_INLINE_ASM)\n    if(NCNN_COMPILER_SUPPORT_GNU_INLINE_ASM)\n        option(NCNN_GNU_INLINE_ASM \"optimize platform with gnu style inline assembly\" ON)\n    else()\n        message(WARNING \"The compiler does not support gnu style inline assembly. NCNN_GNU_INLINE_ASM will be OFF.\")\n    endif()\nendif()\n\nif((IOS AND CMAKE_OSX_ARCHITECTURES MATCHES \"arm\")\n    OR (APPLE AND CMAKE_OSX_ARCHITECTURES MATCHES \"arm64\")\n    OR (CMAKE_SYSTEM_PROCESSOR MATCHES \"^(arm|aarch64)\")\n    OR (CMAKE_CXX_COMPILER_ARCHITECTURE_ID MATCHES \"(ARMV7|ARM64)\")\n    OR ((CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")) AND (${CMAKE_GENERATOR_PLATFORM} MATCHES \"^(arm|arm64)\")))\n    set(NCNN_TARGET_ARCH arm)\n\n    if(APPLE AND CMAKE_OSX_ARCHITECTURES STREQUAL \"arm64_32\")\n        set(NCNN_TARGET_ILP32 TRUE)\n    endif()\n\n    if(CMAKE_SIZEOF_VOID_P EQUAL 4 AND NOT NCNN_TARGET_ILP32)\n        check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, float32x4_t a, float32x4_t b) { return vmlaq_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM_NEON)\n\n        if(NCNN_COMPILER_SUPPORT_ARM_NEON)\n            if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n                set(CMAKE_REQUIRED_FLAGS \"/arch:VFPv4\")\n                check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n\n                unset(CMAKE_REQUIRED_FLAGS)\n            else()\n                set(CMAKE_REQUIRED_FLAGS \"-mfpu=neon-vfpv4\")\n                check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n\n                if(NOT NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n                    set(CMAKE_REQUIRED_FLAGS \"-mfpu=neon-vfpv4 -mfp16-format=ieee\")\n                    check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4_FP16)\n                endif()\n\n                unset(CMAKE_REQUIRED_FLAGS)\n            endif()\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_ARM_VFPV4 OR NCNN_COMPILER_SUPPORT_ARM_VFPV4_FP16)\n            option(NCNN_VFPV4 \"optimize armv7 platform with vfpv4\" ON)\n        else()\n            message(WARNING \"The compiler does not support arm vfpv4. NCNN_VFPV4 will be OFF.\")\n        endif()\n    endif()\n\n    if(CMAKE_SIZEOF_VOID_P EQUAL 8 OR NCNN_TARGET_ILP32)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.0\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x8_t test(float16x8_t s, float16x8_t a, float16x8_t b) { return vfmaq_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vdotq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_DOTPROD)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, float16x8_t a, float16x8_t b) { return vfmlalq_low_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16FML)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.4\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, bfloat16x8_t a, bfloat16x8_t b) { return vcvt_f32_bf16(vcvt_bf16_f32(vbfmmlaq_f32(s, a, b))); }\" NCNN_COMPILER_SUPPORT_ARM84_BF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.4\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vmmlaq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM84_I8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat16_t test(svfloat16_t s, svfloat16_t a, svfloat16_t b, svbool_t bp) { return svmla_f16_z(bp, s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint16_t test(svint16_t s, svint8_t a, svint8_t b) { return svmlslb_s16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE2)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svbfloat16_t a, svbfloat16_t b) { return svbfmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEBF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint32_t test(svint32_t s, svint8_t a, svint8_t b) { return svmmla_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEI8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svfloat32_t a, svfloat32_t b) { return svmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEF32MM)\n\n            unset(CMAKE_REQUIRED_FLAGS)\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.0\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2 -march=armv8.2-a+fp16\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x8_t test(float16x8_t s, float16x8_t a, float16x8_t b) { return vfmaq_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2 -march=armv8.2-a+dotprod\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vdotq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_DOTPROD)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.2 -march=armv8.2-a+fp16fml\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, float16x8_t a, float16x8_t b) { return vfmlalq_low_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16FML)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.4 -march=armv8.4-a+bf16\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, bfloat16x8_t a, bfloat16x8_t b) { return vcvt_f32_bf16(vcvt_bf16_f32(vbfmmlaq_f32(s, a, b))); }\" NCNN_COMPILER_SUPPORT_ARM84_BF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.4 -march=armv8.4-a+i8mm\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vmmlaq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM84_I8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6 -march=armv8.6-a+sve\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat16_t test(svfloat16_t s, svfloat16_t a, svfloat16_t b, svbool_t bp) { return svmla_f16_z(bp, s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6 -march=armv8.6-a+sve2\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint16_t test(svint16_t s, svint8_t a, svint8_t b) { return svmlslb_s16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE2)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6 -march=armv8.6-a+sve+bf16\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svbfloat16_t a, svbfloat16_t b) { return svbfmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEBF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6 -march=armv8.6-a+sve+i8mm\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint32_t test(svint32_t s, svint8_t a, svint8_t b) { return svmmla_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEI8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"/arch:armv8.6 -march=armv8.6-a+sve+f32mm\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svfloat32_t a, svfloat32_t b) { return svmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEF32MM)\n\n            unset(CMAKE_REQUIRED_FLAGS)\n        else()\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8-a\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x4_t test(float32x4_t a) { return vcvt_f16_f32(a); }\" NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.2-a+fp16\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat16x8_t test(float16x8_t s, float16x8_t a, float16x8_t b) { return vfmaq_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.2-a+dotprod\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vdotq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_DOTPROD)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.2-a+fp16fml\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, float16x8_t a, float16x8_t b) { return vfmlalq_low_f16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM82_FP16FML)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.4-a+bf16\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nfloat32x4_t test(float32x4_t s, bfloat16x8_t a, bfloat16x8_t b) { return vcvt_f32_bf16(vcvt_bf16_f32(vbfmmlaq_f32(s, a, b))); }\" NCNN_COMPILER_SUPPORT_ARM84_BF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.4-a+i8mm\")\n            check_cxx_source_compiles(\"#include <arm_neon.h>\\nint32x4_t test(int32x4_t s, int8x16_t a, int8x16_t b) { return vmmlaq_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM84_I8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.6-a+sve\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat16_t test(svfloat16_t s, svfloat16_t a, svfloat16_t b, svbool_t bp) { return svmla_f16_z(bp, s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.6-a+sve2\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint16_t test(svint16_t s, svint8_t a, svint8_t b) { return svmlslb_s16(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVE2)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.6-a+sve+bf16\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svbfloat16_t a, svbfloat16_t b) { return svbfmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEBF16)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.6-a+sve+i8mm\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvint32_t test(svint32_t s, svint8_t a, svint8_t b) { return svmmla_s32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEI8MM)\n\n            set(CMAKE_REQUIRED_FLAGS \"-march=armv8.6-a+sve+f32mm\")\n            check_cxx_source_compiles(\"#include <arm_sve.h>\\nsvfloat32_t test(svfloat32_t s, svfloat32_t a, svfloat32_t b) { return svmmla_f32(s, a, b); }\" NCNN_COMPILER_SUPPORT_ARM86_SVEF32MM)\n\n            unset(CMAKE_REQUIRED_FLAGS)\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n            option(NCNN_VFPV4 \"optimize aarch64 platform with vfpv4\" ON)\n        else()\n            message(WARNING \"The compiler does not support arm vfpv4. NCNN_VFPV4 will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_ARM82_FP16)\n            option(NCNN_ARM82 \"optimize aarch64 platform with armv8.2 fp16\" ON)\n            if(NCNN_COMPILER_SUPPORT_ARM82_DOTPROD)\n                if(NCNN_ARM82)\n                    option(NCNN_ARM82DOT \"optimize aarch64 platform with armv8.2 dotprod\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support armv8.2 dotprod. NCNN_ARM82DOT will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_ARM82_FP16FML)\n                if(NCNN_ARM82)\n                    option(NCNN_ARM82FP16FML \"optimize aarch64 platform with armv8.2 fp16fml\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support armv8.2 fp16fml. NCNN_ARM82FP16FML will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_ARM84_BF16)\n                if(NCNN_ARM82DOT AND NCNN_ARM82FP16FML)\n                    option(NCNN_ARM84BF16 \"optimize aarch64 platform with armv8.4 bf16\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support armv8.4 bf16. NCNN_ARM86BF16 will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_ARM84_I8MM)\n                if(NCNN_ARM82DOT AND NCNN_ARM82FP16FML)\n                    option(NCNN_ARM84I8MM \"optimize aarch64 platform with armv8.4 i8mm\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support armv8.4 i8mm. NCNN_ARM84I8MM will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_ARM86_SVE)\n                if(NCNN_ARM84BF16 AND NCNN_ARM84I8MM)\n                    option(NCNN_ARM86SVE \"optimize aarch64 platform with armv8.6 sve\" ON)\n                    if(NCNN_COMPILER_SUPPORT_ARM86_SVE2)\n                        if(NCNN_ARM86SVE)\n                            option(NCNN_ARM86SVE2 \"optimize aarch64 platform with armv8.6 sve2\" ON)\n                        endif()\n                    else()\n                        message(WARNING \"The compiler does not support armv8.6 sve2. NCNN_ARM86SVE2 will be OFF.\")\n                    endif()\n                    if(NCNN_COMPILER_SUPPORT_ARM86_SVEBF16)\n                        if(NCNN_ARM86SVE)\n                            option(NCNN_ARM86SVEBF16 \"optimize aarch64 platform with armv8.6 sve bf16\" ON)\n                        endif()\n                    else()\n                        message(WARNING \"The compiler does not support armv8.6 sve bf16. NCNN_ARM86SVEBF16 will be OFF.\")\n                    endif()\n                    if(NCNN_COMPILER_SUPPORT_ARM86_SVEI8MM)\n                        if(NCNN_ARM86SVE)\n                            option(NCNN_ARM86SVEI8MM \"optimize aarch64 platform with armv8.6 sve i8mm\" ON)\n                        endif()\n                    else()\n                        message(WARNING \"The compiler does not support armv8.6 sve i8mm. NCNN_ARM86SVEI8MM will be OFF.\")\n                    endif()\n                    if(NCNN_COMPILER_SUPPORT_ARM86_SVEF32MM)\n                        if(NCNN_ARM86SVE)\n                            option(NCNN_ARM86SVEF32MM \"optimize aarch64 platform with armv8.6 sve f32mm\" ON)\n                        endif()\n                    else()\n                        message(WARNING \"The compiler does not support armv8.6 sve f32mm. NCNN_ARM86SVEF32MM will be OFF.\")\n                    endif()\n                endif()\n            else()\n                message(WARNING \"The compiler does not support armv8.6 sve. NCNN_ARM86SVE will be OFF.\")\n            endif()\n        else()\n            message(WARNING \"The compiler does not support armv8.2 fp16. NCNN_ARM82 will be OFF.\")\n        endif()\n    endif()\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(mips)\")\n    set(NCNN_TARGET_ARCH mips)\n\n    check_cxx_compiler_flag(\"-mmsa\" NCNN_COMPILER_SUPPORT_MIPS_MSA)\n\n    set(CMAKE_REQUIRED_FLAGS \"-mloongson-mmi -I${CMAKE_CURRENT_SOURCE_DIR}/src/layer/mips\")\n    check_cxx_source_compiles(\"#include \\\"loongson_mmi.h\\\"\\nint32x2_t test(int16x4_t a, int16x4_t b) { return __mmi_pmaddhw(a, b); }\" NCNN_COMPILER_SUPPORT_LOONGSON_MMI)\n\n    unset(CMAKE_REQUIRED_FLAGS)\n\n    if(NCNN_COMPILER_SUPPORT_MIPS_MSA)\n        option(NCNN_MSA \"optimize mips platform with msa extension\" ON)\n    else()\n        message(WARNING \"The compiler does not support msa extension. NCNN_MSA will be OFF.\")\n    endif()\n    if(NCNN_COMPILER_SUPPORT_LOONGSON_MMI)\n        option(NCNN_MMI \"optimize mips platform with loongson mmi extension\" ON)\n    else()\n        message(WARNING \"The compiler does not support loongson mmi extension. NCNN_MMI will be OFF.\")\n    endif()\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(loongarch64|loongarch32)\")\n    set(NCNN_TARGET_ARCH loongarch)\n\n    set(CMAKE_REQUIRED_FLAGS \"-mlsx\")\n    check_cxx_source_compiles(\"#include <lsxintrin.h>\\n__m128 test(__m128 a, __m128 b, __m128 c) { return __lsx_vfmadd_s(a, b, c); }\" NCNN_COMPILER_SUPPORT_LOONGARCH_LSX)\n\n    set(CMAKE_REQUIRED_FLAGS \"-mlasx\")\n    check_cxx_source_compiles(\"#include <lasxintrin.h>\\n__m256 test(__m256 a, __m256 b, __m256 c) { return __lasx_xvfmadd_s(a, b, c); }\" NCNN_COMPILER_SUPPORT_LOONGARCH_LASX)\n\n    unset(CMAKE_REQUIRED_FLAGS)\n\n    if(NCNN_COMPILER_SUPPORT_LOONGARCH_LSX)\n        option(NCNN_LSX \"optimize loongarch platform with lsx extension\" ON)\n        if(NCNN_COMPILER_SUPPORT_LOONGARCH_LASX)\n            option(NCNN_LASX \"optimize loongarch platform with lasx extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support lasx extension. NCNN_LASX will be OFF.\")\n        endif()\n    else()\n        message(WARNING \"The compiler does not support lsx extension. NCNN_LSX will be OFF.\")\n    endif()\n\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(riscv)\")\n    set(NCNN_TARGET_ARCH riscv)\n\n    if(CMAKE_SIZEOF_VOID_P EQUAL 8)\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv64gcv\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat32m8_t test(vfloat32m8_t s, vfloat32m8_t w, float v, size_t vl) { return __riscv_vfmacc_vf_f32m8(s, v, w, vl); }\\nvfloat32m1x2_t test2(vfloat32m1_t x) { return __riscv_vcreate_v_f32m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_V)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv64gc_zfh -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"__fp16 test(__fp16 a) { return a * a; }\" NCNN_COMPILER_SUPPORT_RISCV_ZFH)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv64gcv_zfh_zvfh -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat16m8_t test(vfloat16m8_t s, vfloat16m8_t w, __fp16 v, size_t vl) { return __riscv_vfmacc_vf_f16m8(s, v, w, vl); }\\nvfloat16m1x2_t test2(vfloat16m1_t x){ return __riscv_vcreate_v_f16m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_ZVFH)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv64gc_zfh_xtheadvector -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat16m8_t test(vfloat16m8_t s, vfloat16m8_t w, __fp16 v, size_t vl) { return __riscv_vfmacc_vf_f16m8(s, v, w, vl); }\\nvfloat16m1x2_t test2(vfloat16m1_t x){ return __riscv_vcreate_v_f16m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n\n        unset(CMAKE_REQUIRED_FLAGS)\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_V OR NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n            option(NCNN_RVV \"optimize risc-v platform with v extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support risc-v v or xtheadvector extension. NCNN_RVV will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n            option(NCNN_XTHEADVECTOR \"optimize risc-v platform with xtheadvector extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support risc-v xtheadvector extension. NCNN_XTHEADVECTOR will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_ZFH)\n            option(NCNN_ZFH \"optimize risc-v platform with zfh extension\" ON)\n            if(NCNN_COMPILER_SUPPORT_RISCV_ZVFH OR NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n                if(NCNN_RVV AND NCNN_ZFH)\n                    option(NCNN_ZVFH \"optimize risc-v platform with zvfh extension\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support zvfh extension. NCNN_ZVFH will be OFF.\")\n            endif()\n        else()\n            message(WARNING \"The compiler does not support risc-v zfh extension. NCNN_ZFH will be OFF.\")\n        endif()\n\n    elseif(CMAKE_SIZEOF_VOID_P EQUAL 4)\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv32gcv\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat32m8_t test(vfloat32m8_t s, vfloat32m8_t w, float v, size_t vl) { return __riscv_vfmacc_vf_f32m8(s, v, w, vl); }\\nvfloat32m1x2_t test2(vfloat32m1_t x) { return __riscv_vcreate_v_f32m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_V)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv32gc_zfh -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"__fp16 test(__fp16 a) { return a * a; }\" NCNN_COMPILER_SUPPORT_RISCV_ZFH)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv32gcv_zfh_zvfh -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat16m8_t test(vfloat16m8_t s, vfloat16m8_t w, __fp16 v, size_t vl) { return __riscv_vfmacc_vf_f16m8(s, v, w, vl); }\\nvfloat16m1x2_t test2(vfloat16m1_t x){ return __riscv_vcreate_v_f16m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_ZVFH)\n\n        set(CMAKE_REQUIRED_FLAGS \"-march=rv32gc_zfh_xtheadvector -D__fp16=_Float16\")\n        check_cxx_source_compiles(\"#include <riscv_vector.h>\\nvfloat16m8_t test(vfloat16m8_t s, vfloat16m8_t w, __fp16 v, size_t vl) { return __riscv_vfmacc_vf_f16m8(s, v, w, vl); }\\nvfloat16m1x2_t test2(vfloat16m1_t x){ return __riscv_vcreate_v_f16m1x2(x, x); }\" NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n\n        unset(CMAKE_REQUIRED_FLAGS)\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_V OR NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n            option(NCNN_RVV \"optimize risc-v platform with v extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support risc-v v or xtheadvector extension. NCNN_RVV will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n            option(NCNN_XTHEADVECTOR \"optimize risc-v platform with xtheadvector extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support risc-v xtheadvector extension. NCNN_XTHEADVECTOR will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_RISCV_ZFH)\n            option(NCNN_ZFH \"optimize risc-v platform with zfh extension\" ON)\n            if(NCNN_COMPILER_SUPPORT_RISCV_ZVFH OR NCNN_COMPILER_SUPPORT_RISCV_XTHEADVECTOR)\n                if(NCNN_RVV AND NCNN_ZFH)\n                    option(NCNN_ZVFH \"optimize risc-v platform with zvfh extension\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support zvfh extension. NCNN_ZVFH will be OFF.\")\n            endif()\n        else()\n            message(WARNING \"The compiler does not support risc-v zfh extension. NCNN_ZFH will be OFF.\")\n        endif()\n\n    endif()\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(powerpc|ppc)\")\n    set(NCNN_TARGET_ARCH powerpc)\n\n    if(NCNN_PPC64LE_VSX)\n        set(NCNN_TARGET_ARCH x86)\n\n        set(CMAKE_REQUIRED_FLAGS \"-DNO_WARN_X86_INTRINSICS -D__SSE2__\")\n        check_cxx_source_compiles(\"#include <emmintrin.h>\\n__m128i test(__m128i a, __m128i b) { return _mm_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_PPC64LE_SSE2)\n        unset(CMAKE_REQUIRED_FLAGS)\n\n        set(CMAKE_REQUIRED_FLAGS \"-DNO_WARN_X86_INTRINSICS -D__SSE4_1__\")\n        check_cxx_source_compiles(\"#include <smmintrin.h>\\n__m128i test(__m128i a, __m128i b) { return _mm_packus_epi32(a, b); }\" NCNN_COMPILER_SUPPORT_PPC64LE_SSE41)\n        unset(CMAKE_REQUIRED_FLAGS)\n\n        if(NCNN_COMPILER_SUPPORT_PPC64LE_SSE2)\n            option(NCNN_VSX_SSE2 \"optimize ppc64le platform with sse2 extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support sse2 extension. NCNN_VSX_SSE2 will be OFF.\")\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_PPC64LE_SSE41)\n            option(NCNN_VSX_SSE41 \"optimize ppc64le platform with sse4.1 extension\" ON)\n        else()\n            message(WARNING \"The compiler does not support sse4.1 extension. NCNN_VSX_SSE41 will be OFF.\")\n        endif()\n    endif()\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(xtensa)\")\n    set(NCNN_TARGET_ARCH xtensa)\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(s390x)\")\n    set(NCNN_TARGET_ARCH s390x)\nelseif(CMAKE_SYSTEM_PROCESSOR MATCHES \"^(sw_64)\")\n    set(NCNN_TARGET_ARCH sw_64)\n    #sw_64 is alpha-like platform\n    set(CMAKE_C_FLAGS \"${CMAKE_C_FLAGS} -mieee\")\n    set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -mieee\")\nelse()\n    set(NCNN_TARGET_ARCH x86)\n\n    option(NCNN_SSE2 \"optimize x86 platform with sse2 extension\" ON)\n\n    if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 a, __m256 b) { return _mm256_mul_ps(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 s, __m256 a, __m256 b) { return _mm256_fmadd_ps(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_FMA)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n#include <ammintrin.h>\\n__m128i test(__m128i s, __m128i a, __m128i b) { return _mm_maddd_epi16(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_XOP)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m128i a) { return _mm256_cvtph_ps(a); }\" NCNN_COMPILER_SUPPORT_X86_F16C)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i a, __m256i b) { return _mm256_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX2)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i a, __m512i b) { return _mm512_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwssd_avx_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpbssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwsud_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT16)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m128bh test(__m256 a) { return _mm256_cvtneps_avx_pbh(a); }\" NCNN_COMPILER_SUPPORT_X86_AVX_NE_CONVERT)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i s, __m512i a, __m512i b) { return _mm512_dpwssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256bh test(__m256bh s, __m512bh a, __m512bh b) { return _mm512_cvtneps_pbh(_mm512_dpbf16_ps(_mm512_cvtpbh_ps(s), a, b)); }\\n__m512i test2(__m512 a) { __m256i _a = (__m256i)_mm512_cvtneps_pbh(a); return _mm512_inserti32x8(_mm512_castsi256_si512(_a), _a, 1); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_BF16)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512h test(__m512h s, __m512h a, __m512h b) { return _mm512_fmadd_ph(s, a, b); }\\n__m512 test2(__m512 a) { return _mm512_cvtxph_ps(_mm512_cvtxps_ph(a)); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_FP16)\n\n        unset(CMAKE_REQUIRED_FLAGS)\n    elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n        check_cxx_compiler_flag(\"-mrecip=none\" NCNN_COMPILER_SUPPORT_X86_RECIP_NONE)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 a, __m256 b) { return _mm256_mul_ps(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX -mfma -mf16c\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 s, __m256 a, __m256 b) { return _mm256_fmadd_ps(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_FMA)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX -mxop\")\n        check_cxx_source_compiles(\"#include <x86intrin.h>\\n__m128i test(__m128i s, __m128i a, __m128i b) { return _mm_maddd_epi16(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_XOP)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX -mf16c\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m128i a) { return _mm256_cvtph_ps(a); }\" NCNN_COMPILER_SUPPORT_X86_F16C)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2 -mfma -mf16c\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i a, __m256i b) { return _mm256_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX2)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512 -mfma -mf16c -mavx512cd -mavx512bw -mavx512dq -mavx512vl\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i a, __m512i b) { return _mm512_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2 -mfma -mf16c -mavxvnni\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwssd_avx_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2 -mfma -mf16c -mavxvnni -mavxvnniint8\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpbssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2 -mfma -mf16c -mavxvnni -mavxvnniint16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwsud_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT16)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX2 -mfma -mf16c -mavxneconvert\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m128bh test(__m256 a) { return _mm256_cvtneps_avx_pbh(a); }\" NCNN_COMPILER_SUPPORT_X86_AVX_NE_CONVERT)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512 -mfma -mf16c -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512vnni\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i s, __m512i a, __m512i b) { return _mm512_dpwssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512 -mfma -mf16c -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512bf16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256bh test(__m256bh s, __m512bh a, __m512bh b) { return _mm512_cvtneps_pbh(_mm512_dpbf16_ps(_mm512_cvtpbh_ps(s), a, b)); }\\n__m512i test2(__m512 a) { __m256i _a = (__m256i)_mm512_cvtneps_pbh(a); return _mm512_inserti32x8(_mm512_castsi256_si512(_a), _a, 1); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_BF16)\n\n        set(CMAKE_REQUIRED_FLAGS \"/arch:AVX512 -mfma -mf16c -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512fp16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512h test(__m512h s, __m512h a, __m512h b) { return _mm512_fmadd_ph(s, a, b); }\\n__m512 test2(__m512 a) { return _mm512_cvtxph_ps(_mm512_cvtxps_ph(a)); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_FP16)\n\n        unset(CMAKE_REQUIRED_FLAGS)\n    else()\n        check_cxx_compiler_flag(\"-mrecip=none\" NCNN_COMPILER_SUPPORT_X86_RECIP_NONE)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mavx\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 a, __m256 b) { return _mm256_mul_ps(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m256 s, __m256 a, __m256 b) { return _mm256_fmadd_ps(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_FMA)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mxop\")\n        check_cxx_source_compiles(\"#include <x86intrin.h>\\n__m128i test(__m128i s, __m128i a, __m128i b) { return _mm_maddd_epi16(a, b, s); }\" NCNN_COMPILER_SUPPORT_X86_XOP)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mf16c\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256 test(__m128i a) { return _mm256_cvtph_ps(a); }\" NCNN_COMPILER_SUPPORT_X86_F16C)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx2\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i a, __m256i b) { return _mm256_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX2)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i a, __m512i b) { return _mm512_madd_epi16(a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx2 -mavxvnni\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwssd_avx_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx2 -mavxvnni -mavxvnniint8\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpbssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx2 -mavxvnni -mavxvnniint16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256i test(__m256i s, __m256i a, __m256i b) { return _mm256_dpwsud_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT16)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx2 -mavxneconvert\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m128bh test(__m256 a) { return _mm256_cvtneps_avx_pbh(a); }\" NCNN_COMPILER_SUPPORT_X86_AVX_NE_CONVERT)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512vnni\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512i test(__m512i s, __m512i a, __m512i b) { return _mm512_dpwssd_epi32(s, a, b); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_VNNI)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512bf16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m256bh test(__m256bh s, __m512bh a, __m512bh b) { return _mm512_cvtneps_pbh(_mm512_dpbf16_ps(_mm512_cvtpbh_ps(s), a, b)); }\\n__m512i test2(__m512 a) { __m256i _a = (__m256i)_mm512_cvtneps_pbh(a); return _mm512_inserti32x8(_mm512_castsi256_si512(_a), _a, 1); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_BF16)\n\n        set(CMAKE_REQUIRED_FLAGS \"-mfma -mf16c -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512fp16\")\n        check_cxx_source_compiles(\"#include <immintrin.h>\\n__m512h test(__m512h s, __m512h a, __m512h b) { return _mm512_fmadd_ph(s, a, b); }\\n__m512 test2(__m512 a) { return _mm512_cvtxph_ps(_mm512_cvtxps_ph(a)); }\" NCNN_COMPILER_SUPPORT_X86_AVX512_FP16)\n\n        unset(CMAKE_REQUIRED_FLAGS)\n    endif()\n\n    if(NOT CMAKE_SYSTEM_NAME MATCHES \"Emscripten|WASI\" AND NCNN_COMPILER_SUPPORT_X86_AVX)\n        option(NCNN_AVX \"optimize x86 platform with avx extension\" ON)\n        if(NCNN_COMPILER_SUPPORT_X86_FMA)\n            if(NCNN_AVX)\n                option(NCNN_FMA \"optimize x86 platform with fma extension\" ON)\n            endif()\n        else()\n            message(WARNING \"The compiler does not support fma extension. NCNN_FMA will be OFF.\")\n        endif()\n        if(NCNN_COMPILER_SUPPORT_X86_XOP)\n            if(NCNN_AVX)\n                option(NCNN_XOP \"optimize x86 platform with xop extension\" ON)\n            endif()\n        else()\n            message(WARNING \"The compiler does not support xop extension. NCNN_XOP will be OFF.\")\n        endif()\n        if(NCNN_COMPILER_SUPPORT_X86_F16C)\n            if(NCNN_AVX)\n                option(NCNN_F16C \"optimize x86 platform with f16c extension\" ON)\n            endif()\n        else()\n            message(WARNING \"The compiler does not support f16c extension. NCNN_F16C will be OFF.\")\n        endif()\n        if(NCNN_COMPILER_SUPPORT_X86_AVX2)\n            if(NCNN_AVX)\n                option(NCNN_AVX2 \"optimize x86 platform with avx2 extension\" ON)\n            endif()\n            if(NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)\n                if(NCNN_AVX2)\n                    option(NCNN_AVXVNNI \"optimize x86 platform with avx vnni extension\" ON)\n                endif()\n                if(NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)\n                    if(NCNN_AVXVNNI)\n                        option(NCNN_AVXVNNIINT8 \"optimize x86 platform with avx vnni int8 extension\" ON)\n                    endif()\n                else()\n                    message(WARNING \"The compiler does not support avx vnni int8 extension. NCNN_AVXVNNIINT8 will be OFF.\")\n                endif()\n                if(NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT16)\n                    if(NCNN_AVXVNNI)\n                        option(NCNN_AVXVNNIINT16 \"optimize x86 platform with avx vnni int16 extension\" ON)\n                    endif()\n                else()\n                    message(WARNING \"The compiler does not support avx vnni int16 extension. NCNN_AVXVNNIINT16 will be OFF.\")\n                endif()\n            else()\n                message(WARNING \"The compiler does not support avx vnni extension. NCNN_AVXVNNI will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_X86_AVX_NE_CONVERT)\n                if(NCNN_AVX2)\n                    option(NCNN_AVXNECONVERT \"optimize x86 platform with avx ne convert extension\" ON)\n                endif()\n            else()\n                message(WARNING \"The compiler does not support avx ne convert extension. NCNN_AVXNECONVERT will be OFF.\")\n            endif()\n            if(NCNN_COMPILER_SUPPORT_X86_AVX512)\n                if(NCNN_AVX2)\n                    option(NCNN_AVX512 \"optimize x86 platform with avx512 extension\" ON)\n                endif()\n                if(NCNN_COMPILER_SUPPORT_X86_AVX512_VNNI)\n                    if(NCNN_AVX512)\n                        option(NCNN_AVX512VNNI \"optimize x86 platform with avx512 vnni extension\" ON)\n                    endif()\n                else()\n                    message(WARNING \"The compiler does not support avx512 vnni extension. NCNN_AVX512VNNI will be OFF.\")\n                endif()\n                if(NCNN_COMPILER_SUPPORT_X86_AVX512_BF16)\n                    if(NCNN_AVX512)\n                        option(NCNN_AVX512BF16 \"optimize x86 platform with avx512 bf16 extension\" ON)\n                    endif()\n                else()\n                    message(WARNING \"The compiler does not support avx512 bf16 extension. NCNN_AVX512BF16 will be OFF.\")\n                endif()\n                if(NCNN_COMPILER_SUPPORT_X86_AVX512_FP16)\n                    if(NCNN_AVX512)\n                        option(NCNN_AVX512FP16 \"optimize x86 platform with avx512 fp16 extension\" ON)\n                    endif()\n                else()\n                    message(WARNING \"The compiler does not support avx512 fp16 extension. NCNN_AVX512FP16 will be OFF.\")\n                endif()\n            else()\n                message(WARNING \"The compiler does not support avx512 extension. NCNN_AVX512 will be OFF.\")\n            endif()\n        else()\n            message(WARNING \"The compiler does not support avx2 extension. NCNN_AVX2 will be OFF.\")\n        endif()\n    else()\n        message(WARNING \"The compiler does not support avx extension. NCNN_AVX will be OFF.\")\n    endif()\nendif()\n\nunset(CMAKE_TRY_COMPILE_CONFIGURATION)\nunset(CMAKE_TRY_COMPILE_TARGET_TYPE)\n\nif(NCNN_TARGET_ILP32)\n    message(STATUS \"Target arch: ${NCNN_TARGET_ARCH} 64bit ilp32\")\nelseif(CMAKE_SIZEOF_VOID_P EQUAL 8)\n    message(STATUS \"Target arch: ${NCNN_TARGET_ARCH} 64bit\")\nelse()\n    message(STATUS \"Target arch: ${NCNN_TARGET_ARCH} 32bit\")\nendif()\n\n##############################################\n\n# set cmake default folder name\nset_property(GLOBAL PROPERTY USE_FOLDERS ON)\nset_property(GLOBAL PROPERTY PREDEFINED_TARGETS_FOLDER \"cmake\")\n\nif(CMAKE_SYSTEM_NAME STREQUAL \"Emscripten\")\n    set(CMAKE_C_FLAGS \"${CMAKE_C_FLAGS} -s FORCE_FILESYSTEM=1 -s INITIAL_MEMORY=256MB -s EXIT_RUNTIME=1\")\n    set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -s FORCE_FILESYSTEM=1 -s INITIAL_MEMORY=256MB -s EXIT_RUNTIME=1\")\n    set(CMAKE_EXECUTBLE_LINKER_FLAGS \"${CMAKE_EXECUTBLE_LINKER_FLAGS} -s FORCE_FILESYSTEM=1 -s INITIAL_MEMORY=256MB -s EXIT_RUNTIME=1\")\n\n    if(NCNN_OPENMP AND NCNN_SIMPLEOMP)\n        # TODO better flags for emscripten\n        # node --experimental-wasm-threads xxx.js\n        set(CMAKE_C_FLAGS \"${CMAKE_C_FLAGS} -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=15\")\n        set(CMAKE_CXX_FLAGS \"${CMAKE_CXX_FLAGS} -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=15\")\n        set(CMAKE_EXECUTBLE_LINKER_FLAGS \"${CMAKE_EXECUTBLE_LINKER_FLAGS} -s USE_PTHREADS=1 -s PTHREAD_POOL_SIZE=15\")\n    endif()\nendif()\n\nif(NCNN_VULKAN)\n    if(NCNN_SYSTEM_GLSLANG)\n        find_package(Threads)\n        find_package(SPIRV-Tools QUIET)\n        find_package(SPIRV-Tools-opt QUIET)\n        find_package(glslang QUIET)\n        if(glslang_FOUND)\n            add_library(glslang ALIAS glslang::glslang)\n            add_library(SPIRV ALIAS glslang::SPIRV)\n        else()\n            set(GLSLANG_TARGET_DIR \"GLSLANG-NOTFOUND\" CACHE PATH \"Absolute path to glslangTargets.cmake directory\")\n            if(NOT GLSLANG_TARGET_DIR AND NOT DEFINED ENV{GLSLANG_TARGET_DIR})\n                message(WARNING \"set glslang_DIR to glslang-config.cmake directory for using system glslang.\")\n                message(WARNING \"GLSLANG_TARGET_DIR must be defined! NCNN_SYSTEM_GLSLANG will be turned off.\")\n                set(NCNN_SYSTEM_GLSLANG OFF)\n            else()\n                include(\"${GLSLANG_TARGET_DIR}/OSDependentTargets.cmake\")\n                include(\"${GLSLANG_TARGET_DIR}/OGLCompilerTargets.cmake\")\n                if(EXISTS \"${GLSLANG_TARGET_DIR}/HLSLTargets.cmake\")\n                    # hlsl support can be optional\n                    include(\"${GLSLANG_TARGET_DIR}/HLSLTargets.cmake\")\n                endif()\n                include(\"${GLSLANG_TARGET_DIR}/glslangTargets.cmake\")\n                include(\"${GLSLANG_TARGET_DIR}/SPIRVTargets.cmake\")\n            endif()\n        endif()\n\n        if(TARGET glslang AND TARGET SPIRV)\n            get_property(glslang_location TARGET glslang PROPERTY LOCATION)\n            get_property(SPIRV_location TARGET SPIRV PROPERTY LOCATION)\n            message(STATUS \"Found glslang: ${glslang_location} (found version \\\"${glslang_VERSION}\\\")\")\n            message(STATUS \"Found SPIRV: ${SPIRV_location} (found version \\\"${glslang_VERSION}\\\")\")\n        else()\n            message(WARNING \"glslang or SPIRV target not found! NCNN_SYSTEM_GLSLANG will be turned off.\")\n            set(NCNN_SYSTEM_GLSLANG OFF)\n        endif()\n    endif()\n\n    if(NOT NCNN_SYSTEM_GLSLANG)\n        if(NOT EXISTS \"${CMAKE_CURRENT_SOURCE_DIR}/glslang/CMakeLists.txt\")\n            message(FATAL_ERROR \"The submodules were not downloaded! Please update submodules with \\\"git submodule update --init\\\" and try again.\")\n        else()\n            # glslang requires c++11\n            set(CMAKE_CXX_STANDARD 11)\n\n            option(BUILD_EXTERNAL \"\" OFF)\n            option(ENABLE_SPVREMAPPER \"\" OFF)\n            option(ENABLE_GLSLANG_BINARIES \"\" OFF)\n            option(ENABLE_HLSL \"\" OFF)\n            option(ENABLE_RTTI \"\" OFF)\n            option(ENABLE_EXCEPTIONS \"\" OFF)\n            option(ENABLE_OPT \"\" OFF)\n            option(ENABLE_PCH \"\" OFF)\n            option(GLSLANG_TESTS \"\" OFF)\n            if(NCNN_SHARED_LIB)\n                option(GLSLANG_ENABLE_INSTALL \"\" OFF)\n            else()\n                option(GLSLANG_ENABLE_INSTALL \"\" ON)\n            endif()\n            add_subdirectory(glslang)\n            if(NCNN_SHARED_LIB)\n                if(CMAKE_CXX_COMPILER_ID MATCHES \"GNU\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND NOT CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n                    target_compile_options(glslang PRIVATE -fvisibility=hidden -fvisibility-inlines-hidden)\n                    target_compile_options(glslang-default-resource-limits PRIVATE -fvisibility=hidden -fvisibility-inlines-hidden)\n                endif()\n                if(NCNN_ENABLE_LTO)\n                    set_target_properties(glslang PROPERTIES INTERPROCEDURAL_OPTIMIZATION ON)\n                    set_target_properties(glslang-default-resource-limits PROPERTIES INTERPROCEDURAL_OPTIMIZATION ON)\n                endif()\n            endif()\n        endif()\n    endif()\nendif()\n\nadd_subdirectory(src)\nif(NCNN_BUILD_BENCHMARK)\n    add_subdirectory(benchmark)\nendif()\nif(NCNN_BUILD_EXAMPLES)\n    add_subdirectory(examples)\nendif()\nif(NCNN_BUILD_TOOLS)\n    add_subdirectory(tools)\nendif()\nif(NCNN_BUILD_TESTS)\n    enable_testing()\n    add_subdirectory(tests)\n    add_subdirectory(tests/perf)\nendif()\nif(NCNN_PYTHON)\n    add_subdirectory(python)\nendif()\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "\n# Acknowledgements\n\n- Thanks to bug1989 [https://github.com/bug1989] for contributing the initial quantized int8 inference code and a large variety of device benchmark\n- Thanks to zhiliu6 [https://github.com/zhiliu6] for contributing the darknet conversion tool, operators and YOLO examples\n- Thanks to Tijmen Verhulsdonck [https://github.com/Timen] for contributing the massive AVX optimization for x86 platform\n"
  },
  {
    "path": "Info.plist",
    "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE plist PUBLIC \"-//Apple//DTD PLIST 1.0//EN\" \"http://www.apple.com/DTDs/PropertyList-1.0.dtd\">\n<plist version=\"1.0\">\n<dict>\n    <key>CFBundleName</key>\n    <string>__NAME__</string>\n    <key>CFBundleIdentifier</key>\n    <string>__IDENTIFIER__</string>\n    <key>CFBundleVersion</key>\n    <string>__VERSION__</string>\n    <key>CFBundleShortVersionString</key>\n    <string>__VERSION__</string>\n    <key>CFBundleSignature</key>\n    <string>????</string>\n    <key>CFBundlePackageType</key>\n    <string>FMWK</string>\n</dict>\n</plist>\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "Tencent is pleased to support the open source community by making ncnn available.\nCopyright (C) 2017 Tencent.  All rights reserved.\nIf you have downloaded a copy of the ncnn binary from Tencent, please note that the ncnn binary is licensed under the BSD 3-Clause License.\nIf you have downloaded a copy of the ncnn source code from Tencent, please note that ncnn source code is licensed under the BSD 3-Clause License, except for the third-party components listed below which are subject to different license terms.  Your integration of ncnn into your own projects may require compliance with the BSD 3-Clause License, as well as the other licenses applicable to the third-party components included within ncnn.\nA copy of the BSD 3-Clause License is included in this file.\n\nOther dependencies and licenses:\n\nOpen Source Software Licensed Under the zlib License:\nThe below software in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) 2017 Tencent.\n----------------------------------------------------------------------------------------\n1. neon_mathfun.h\nCopyright (C) 2011 Julien Pommier\n\n2. sse_mathfun.h\nCopyright (C) 2007 Julien Pommier\n\n3. avx_mathfun.h\nCopyright (C) 2012 Giovanni Garberoglio\nInterdisciplinary Laboratory for Computational Science (LISC)\nFondazione Bruno Kessler and University of Trento\nvia Sommarive, 18\nI-38123 Trento (Italy)\n\n\nTerms of the zlib License:\n---------------------------------------------------\nCopyright (c) <year> <copyright holders>\n\nThis software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.\n\nPermission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:\n\n1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.\n2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.\n3. This notice may not be removed or altered from any source distribution.\n\n\n\nOpen Source Software Licensed Under the BSD 2-Clause License:\nThe below software in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) 2017 Tencent.\n----------------------------------------------------------------------------------------\n1. squeezenet  1.1\nCopyright (c) 2016 Forrest N. Iandola and Matthew W. Moskewicz and Khalid Ashraf and Song Han and William J. Dally and Kurt Keutzer\nAll rights reserved.\n\n2. caffe.proto  master\nAll contributions by the University of California:\nCopyright (c) 2014-2017 The Regents of the University of California (Regents)\nAll rights reserved.\n\nAll other contributions:\nCopyright (c) 2014-2017, the respective contributors\nAll rights reserved.\n\n\nTerms of the BSD 2-Clause License:\n--------------------------------------------------------------------\nRedistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:\n\nRedistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.\nRedistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n\n\n\nOpen Source Software Licensed Under the BSD 3-Clause License:\nThe below software in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) 2017 Tencent.\n----------------------------------------------------------------------------------------\n1. android.toolchain.cmake  master\nCopyright (c) 2010-2011, Ethan Rublee\nCopyright (c) 2011-2014, Andrey Kamaev\nAll rights reserved.\n\n\nTerms of the BSD 3-Clause License:\n--------------------------------------------------------------------\n\nRedistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:\n\nRedistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.\nRedistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.\nNeither the name of [copyright holder] nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "recursive-include cmake *\n\nrecursive-include glslang *\nprune glslang/Test\n\nrecursive-include src *\n\nrecursive-include python *\nprune python/pybind11/tests\n\ninclude CMakeLists.txt\n"
  },
  {
    "path": "README.md",
    "content": "![ncnn](https://raw.githubusercontent.com/Tencent/ncnn/master/images/256-ncnn.png)\n\n# ncnn\n\n[![License](https://img.shields.io/badge/license-BSD_3_Clause-blue.svg?style=for-the-badge)](LICENSE.txt)\n[![Download Total Count](https://img.shields.io/github/downloads/Tencent/ncnn/total.svg?style=for-the-badge)](https://github.com/Tencent/ncnn/releases)\n[![codecov](https://img.shields.io/codecov/c/github/Tencent/ncnn/master?style=for-the-badge)](https://codecov.io/gh/Tencent/ncnn)\n\nncnn is a high-performance neural network inference computing framework optimized for mobile platforms.\nncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design.\nncnn does not have third-party dependencies.\nIt is cross-platform and runs faster than all known open-source frameworks on mobile phone cpu.\nDevelopers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, creating intelligent APPs, and bringing artificial intelligence to your fingertips.\nncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu, and so on.\n\nncnn 是一个为手机端极致优化的高性能神经网络前向计算框架。\nncnn 从设计之初深刻考虑手机端的部署和使用。\n无第三方依赖，跨平台，手机端 cpu 的速度快于目前所有已知的开源框架。\n基于 ncnn，开发者能够将深度学习算法轻松移植到手机端高效执行，\n开发出人工智能 APP，将 AI 带到你的指尖。\nncnn 目前已在腾讯多款应用中使用，如：QQ，Qzone，微信，天天 P 图等。\n\n---\n\n<table>\n<tr>\n<td>\n<b>技术交流 QQ 群</b><br />\n637093648 (超多大佬)<br />\n答案：卷卷卷卷卷（已满）\n</td>\n<td rowspan=3>\n<b>Telegram Group</b>\n\n<https://t.me/ncnnyes>\n</td>\n<td rowspan=3>\n<b>Discord Channel</b>\n\n<https://discord.gg/YRsxgmF>\n</td>\n</tr>\n<tr>\n<td>\n<b>Pocky QQ 群（MLIR YES!）</b><br />\n677104663 (超多大佬)<br />\n答案：multi-level intermediate representation\n</td>\n</tr>\n<tr>\n<td>\n<b>他们都不知道 pnnx 有多好用群</b><br />\n818998520 (新群！)\n</td>\n</tr>\n</table>\n\n---\n\n## Download & Build status\n\nhttps://github.com/Tencent/ncnn/releases/latest\n\n\n<table>\n<tr>\n<td rowspan=2>\n  <img src=\"https://user-images.githubusercontent.com/25181517/192108372-f71d70ac-7ae6-4c0d-8395-51d8870c2ef0.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n  **[how to build ncnn library](https://github.com/Tencent/ncnn/wiki/how-to-build) on Linux / Windows / macOS / Raspberry Pi3, Pi4 / POWER / Android / NVIDIA Jetson / iOS / WebAssembly / AllWinner D1 / Loongson 2K1000**\n\n</td>\n</tr>\n<tr>\n<td>Source</td>\n<td colspan=2>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-full-source.zip)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=3>\n  <img src=\"https://user-images.githubusercontent.com/25181517/117269608-b7dcfb80-ae58-11eb-8e66-6cc8753553f0.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for Android](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-android)\n- [Build for Termux on Android](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-termux-on-android)\n\n</td>\n</tr>\n<tr>\n<td>Android</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-android-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-android.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/android.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Aandroid)\n\n</td>\n</tr>\n<tr>\n<td>Android shared</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-android-vulkan-shared.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-android-shared.zip)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=3>\n  <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/HMOS_Logo_Icon.svg/240px-HMOS_Logo_Icon.svg.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for HarmonyOS with cross-compiling](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-harmonyos-with-cross-compiling)\n\n</td>\n</tr>\n<tr>\n<td>HarmonyOS</td>\n<td>\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/harmonyos.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Aharmonyos)\n\n</td>\n</tr>\n<tr>\n<td>HarmonyOS shared</td>\n<td>\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=3>\n  <img src=\"https://user-images.githubusercontent.com/25181517/121406611-a8246b80-c95e-11eb-9b11-b771486377f6.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for iOS on macOS with xcode](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-ios-on-macos-with-xcode)\n\n</td>\n</tr>\n<tr>\n<td>iOS</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ios-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ios.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/ios.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Aios)\n\n</td>\n</tr>\n<tr>\n<td>iOS-Simulator</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ios-simulator-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ios-simulator.zip)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=10>\n  <img src=\"https://user-images.githubusercontent.com/25181517/186884152-ae609cca-8cf1-4175-8d60-1ce1fa078ca2.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for macOS](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-macos)\n\n</td>\n</tr>\n<tr>\n<td>macOS</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-macos-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-macos.zip)\n\n</td>\n<td rowspan=1>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/macos.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Amacos)\n\n</td>\n</tr>\n<tr>\n<td>Mac-Catalyst</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-mac-catalyst-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-mac-catalyst.zip)\n\n</td>\n<td rowspan=1>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/mac-catalyst.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Amac-catalyst)\n\n</td>\n</tr>\n<tr>\n<td>watchOS</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-watchos.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/watchos.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Awatchos)\n\n</td>\n</tr>\n<tr>\n<td>watchOS-Simulator</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-watchos-simulator.zip)\n\n</td>\n</tr>\n<tr>\n<td>tvOS</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-tvos-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-tvos.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/tvos.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Atvos)\n\n</td>\n</tr>\n<tr>\n<td>tvOS-Simulator</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-tvos-simulator-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-tvos-simulator.zip)\n\n</td>\n</tr>\n<tr>\n<td>visionOS</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-visionos-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-visionos.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/visionos.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Avisionos)\n\n</td>\n</tr>\n<tr>\n<td>visionOS-Simulator</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-visionos-simulator-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-visionos-simulator.zip)\n\n</td>\n</tr>\n<tr>\n<td>Apple xcframework</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-apple-vulkan.zip)\n  [<img src=\"https://img.shields.io/badge/+cpuonly-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-apple.zip)\n\n</td>\n<td rowspan=1>\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=3>\n  <img src=\"https://user-images.githubusercontent.com/25181517/186884153-99edc188-e4aa-4c84-91b0-e2df260ebc33.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for Linux / NVIDIA Jetson / Raspberry Pi3, Pi4 / POWER](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-linux)\n\n</td>\n</tr>\n<tr>\n<td>Ubuntu 22.04</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ubuntu-2204.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ubuntu-2204-shared.zip)\n\n</td>\n<td rowspan=2>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-x64-gpu-gcc.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-x64-gpu-gcc)\n\n</td>\n</tr>\n<tr>\n<td>Ubuntu 24.04</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ubuntu-2404.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-ubuntu-2404-shared.zip)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=5>\n  <img alt=\"windows\" src=\"https://user-images.githubusercontent.com/25181517/186884150-05e9ff6d-340e-4802-9533-2c3f02363ee3.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for Windows x64 using VS2017](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-windows-x64-using-visual-studio-community-2017)\n- [Build for Windows x64 using MinGW-w64](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-windows-x64-using-mingw-w64)\n\n</td>\n</tr>\n<tr>\n<td>VS2015</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2015.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2015-shared.zip)\n\n</td>\n<td rowspan=4>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/windows.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Awindows)\n\n</td>\n</tr>\n<tr>\n<td>VS2017</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2017.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2017-shared.zip)\n\n</td>\n</tr>\n<tr>\n<td>VS2019</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2019.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2019-shared.zip)\n\n</td>\n</tr>\n<tr>\n<td>VS2022</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2022.zip)\n  [<img src=\"https://img.shields.io/badge/+shared-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-windows-vs2022-shared.zip)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=2>\n  <img src=\"https://user-images.githubusercontent.com/25181517/188324036-d704ac9a-6e61-4722-b978-254b25b61bed.png\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for WebAssembly](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-webassembly)\n\n</td>\n</tr>\n<tr>\n<td>WebAssembly</td>\n<td>\n\n  [<img src=\"https://img.shields.io/badge/download-blue?style=for-the-badge\">](https://github.com/Tencent/ncnn/releases/latest/download/ncnn-20260113-webassembly.zip)\n\n</td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/web-assembly.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Aweb-assembly)\n\n</td>\n</tr>\n\n<tr>\n<td rowspan=8>\n  <img src=\"https://github.com/marwin1991/profile-technology-icons/assets/76662862/2481dc48-be6b-4ebb-9e8c-3b957efe69fa\" width=\"120\" height=\"auto\">\n</td>\n<td colspan=3>\n\n- [Build for ARM Cortex-A family with cross-compiling](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-arm-cortex-a-family-with-cross-compiling)\n- [Build for Hisilicon platform with cross-compiling](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-hisilicon-platform-with-cross-compiling)\n- [Build for AllWinner D1](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-allwinner-d1)\n- [Build for Loongson 2K1000](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-loongson-2k1000)\n- [Build for QNX](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-qnx)\n\n</td>\n</tr>\n<tr>\n<td>Linux (arm)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-arm.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-arm)\n\n</td>\n</tr>\n<tr>\n<td>Linux (aarch64)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-aarch64.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-aarch64)\n\n</td>\n</tr>\n<tr>\n<td>Linux (mips)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-mips.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-mips)\n\n</td>\n</tr>\n<tr>\n<td>Linux (mips64)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-mips64.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-mips64)\n\n</td>\n</tr>\n<tr>\n<td>Linux (ppc64)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-ppc64.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-ppc64)\n\n</td>\n</tr>\n<tr>\n<td>Linux (riscv64)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-riscv64.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-riscv64)\n\n</td>\n</tr>\n<tr>\n<td>Linux (loongarch64)</td>\n<td></td>\n<td>\n\n  [<img src=\"https://img.shields.io/github/actions/workflow/status/Tencent/ncnn/linux-loongarch64.yml?branch=master&style=for-the-badge&label=build\">](https://github.com/Tencent/ncnn/actions?query=workflow%3Alinux-loongarch64)\n\n</td>\n</tr>\n\n</table>\n\n\n---\n\n## Support most commonly used CNN network\n\n## 支持大部分常用的 CNN 网络\n\n- Classical CNN:\n  [VGG](https://github.com/BVLC/caffe/wiki/Model-Zoo#models-used-by-the-vgg-team-in-ilsvrc-2014)\n  [AlexNet](https://github.com/BVLC/caffe/tree/9b891540183ddc834a02b2bd81b31afae71b2153/models/bvlc_alexnet)\n  [GoogleNet](https://github.com/BVLC/caffe/tree/9b891540183ddc834a02b2bd81b31afae71b2153/models/bvlc_googlenet)\n  Inception\n  ...\n- Practical CNN:\n  [ResNet](https://github.com/tornadomeet/ResNet)\n  [DenseNet](https://github.com/liuzhuang13/DenseNet)\n  [SENet](https://github.com/hujie-frank/SENet)\n  [FPN](https://github.com/unsky/FPN)\n  ...\n- Light-weight CNN:\n  [SqueezeNet](https://github.com/forresti/SqueezeNet)\n  [MobileNetV1](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md)\n  [MobileNetV2/V3](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/README.md)\n  [ShuffleNetV1](https://github.com/farmingyard/ShuffleNet)\n  [ShuffleNetV2](https://github.com/opconty/keras-shufflenetV2)\n  [MNasNet](https://github.com/tensorflow/models/tree/master/research/slim/nets/nasnet)\n  ...\n- Face Detection:\n  [MTCNN](https://github.com/ipazc/mtcnn)\n  [RetinaFace](https://github.com/biubug6/Pytorch_Retinaface)\n  [scrfd](https://github.com/nihui/ncnn-android-scrfd)\n  ...\n- Detection:\n  [VGG-SSD](https://github.com/lzx1413/CAFFE_SSD)\n  [MobileNet-SSD](https://github.com/chuanqi305/MobileNet-SSD)\n  [SqueezeNet-SSD](https://github.com/chuanqi305/SqueezeNet-SSD)\n  [MobileNetV2-SSDLite](https://github.com/chuanqi305/MobileNetv2-SSDLite)\n  [MobileNetV3-SSDLite](https://github.com/XiaoyuHuang96/MobilenetV3SSDLite-tfkeras)\n  ...\n- Detection:\n  [Faster-RCNN](https://github.com/rbgirshick/py-faster-rcnn)\n  [R-FCN](https://github.com/daijifeng001/R-FCN)\n  ...\n- Detection:\n  [YOLOv2](https://github.com/longcw/yolo2-pytorch)\n  [YOLOv3](https://github.com/ultralytics/yolov3)\n  [MobileNet-YOLOv3](https://github.com/eric612/MobileNet-YOLO)\n  [YOLOv4](https://github.com/Tianxiaomo/pytorch-YOLOv4)\n  [YOLOv5](https://github.com/ultralytics/yolov5)\n  [YOLOv7](https://github.com/WongKinYiu/yolov7)\n  [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX)\n  [YOLOv8](https://github.com/nihui/ncnn-android-yolov8)\n  ...\n- Detection:\n  [NanoDet](https://github.com/RangiLyu/nanodet)\n- Segmentation:\n  [FCN](https://github.com/unsky/FPN)\n  [PSPNet](https://github.com/hszhao/PSPNet)\n  [UNet](https://github.com/zhixuhao/unet)\n  [YOLACT](https://github.com/dbolya/yolact)\n  ...\n- Pose Estimation:\n  [SimplePose](https://github.com/dog-qiuqiu/Ultralight-SimplePose)\n  ...\n\n---\n\n## HowTo\n\n**[use ncnn with alexnet](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-alexnet) with detailed steps, recommended for beginners :)**\n\n**[ncnn 组件使用指北 alexnet](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-alexnet.zh) 附带详细步骤，新人强烈推荐 :)**\n\n**[use netron for ncnn model visualization](https://netron.app)**\n\n**[use ncnn with pytorch or onnx](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-pytorch-or-onnx)**\n\n[ncnn low-level operation api](https://github.com/Tencent/ncnn/wiki/low-level-operation-api)\n\n[ncnn param and model file spec](https://github.com/Tencent/ncnn/wiki/param-and-model-file-structure)\n\n[ncnn operation param weight table](https://github.com/Tencent/ncnn/wiki/operation-param-weight-table)\n\n[how to implement custom layer step by step](https://github.com/Tencent/ncnn/wiki/how-to-implement-custom-layer-step-by-step)\n\n---\n\n## FAQ\n\n**[ncnn deepwiki](https://deepwiki.com/Tencent/ncnn) LLM Answering Questions ;)** \n\n**[ncnn throw error](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-throw-error)**\n\n**[ncnn produce wrong result](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result)**\n\n**[ncnn vulkan](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-vulkan)**\n\n---\n\n## Features\n\n- Supports convolutional neural networks, supports multiple input and multi-branch structure, can calculate part of the branch\n- No third-party library dependencies, does not rely on BLAS / NNPACK or any other computing framework\n- Pure C++ implementation, cross-platform, supports Android, iOS and so on\n- ARM NEON assembly level of careful optimization, calculation speed is extremely high\n- Sophisticated memory management and data structure design, very low memory footprint\n- Supports multi-core parallel computing acceleration, ARM big.LITTLE CPU scheduling optimization\n- Supports GPU acceleration via the next-generation low-overhead Vulkan API\n- Extensible model design, supports 8bit [quantization](https://github.com/Tencent/ncnn/wiki/quantized-int8-inference) and half-precision floating point storage, can import caffe/pytorch/mxnet/onnx/darknet/keras/tensorflow(mlir) models\n- Support direct memory zero copy reference load network model\n- Can be registered with custom layer implementation and extended\n- Well, it is strong, not afraid of being stuffed with 卷 QvQ\n\n## 功能概述\n\n- 支持卷积神经网络，支持多输入和多分支结构，可计算部分分支\n- 无任何第三方库依赖，不依赖 BLAS/NNPACK 等计算框架\n- 纯 C++ 实现，跨平台，支持 Android / iOS 等\n- ARM Neon 汇编级良心优化，计算速度极快\n- 精细的内存管理和数据结构设计，内存占用极低\n- 支持多核并行计算加速，ARM big.LITTLE CPU 调度优化\n- 支持基于全新低消耗的 Vulkan API GPU 加速\n- 可扩展的模型设计，支持 8bit [量化](tools/quantize) 和半精度浮点存储，可导入 caffe/pytorch/mxnet/onnx/darknet/keras/tensorflow(mlir) 模型\n- 支持直接内存零拷贝引用加载网络模型\n- 可注册自定义层实现并扩展\n- 恩，很强就是了，不怕被塞卷 QvQ\n\n---\n\n## supported platform matrix\n\n- ✅ = known work and runs fast with good optimization\n- ✔️ = known work, but speed may not be fast enough\n- ❔ = shall work, not confirmed\n- / = not applied\n\n|            | Windows | Linux | Android | macOS | iOS |\n| ---------- | ------- | ----- | ------- | ----- | --- |\n| intel-cpu  | ✔️      | ✔️    | ✔️      | ✔️    | /   |\n| intel-gpu  | ✔️      | ✔️    | ✔️      | ✔️    | /   |\n| amd-cpu    | ✔️      | ✔️    | ✔️      | ✔️    | /   |\n| amd-gpu    | ✔️      | ✔️    | ✔️      | ✔️    | /   |\n| nvidia-gpu | ✔️      | ✔️    | ✔️      | ✔️    | /   |\n| qcom-cpu   | ✅      | ✅    | ✅      | /     | /   |\n| qcom-gpu   | ✔️      | ✔️    | ✔️      | /     | /   |\n| arm-cpu    | ✅      | ✅    | ✅      | /     | /   |\n| arm-gpu    | ❔      | ✔️    | ✔️      | /     | /   |\n| apple-cpu  | /       | /     | /       | ✔️    | ✅  |\n| apple-gpu  | /       | /     | /       | ✔️    | ✔️  |\n| ibm-cpu    | /       | ✔️     | /       | /    | /  |\n\n---\n\n## Project examples\n\n- <https://github.com/nihui/ncnn-android-squeezenet>\n- <https://github.com/nihui/ncnn-android-styletransfer>\n- <https://github.com/nihui/ncnn-android-mobilenetssd>\n- <https://github.com/moli232777144/mtcnn_ncnn>\n- <https://github.com/nihui/ncnn-android-yolov5>\n- <https://github.com/xiang-wuu/ncnn-android-yolov7>\n- <https://github.com/nihui/ncnn-android-scrfd> 🤩\n- <https://github.com/shaoshengsong/qt_android_ncnn_lib_encrypt_example>\n\n<img src=\"https://github.com/nihui/ncnn-assets/raw/master/20181217/ncnn-2.jpg\" height =\"230\"/><img src=\"https://github.com/nihui/ncnn-assets/raw/master/20181217/4.jpg\" height =\"230\"/><img src=\"https://github.com/nihui/ncnn-assets/raw/master/20181217/ncnn-33.jpg\" height =\"230\"/><img src=\"https://github.com/nihui/ncnn-assets/raw/master/20181217/ncnn-m.png\" height =\"230\"/><img src=\"https://github.com/nihui/ncnn-android-yolov5/raw/master/screenshot.jpg\" height =\"230\"/><img src=\"https://github.com/nihui/ncnn-android-scrfd/raw/master/screenshot.jpg\" height =\"230\"/><br>\n\n- <https://github.com/magicse/ncnn-colorization-siggraph17><br>\n<img src=\"https://user-images.githubusercontent.com/13585785/189326958-f5a8d6f8-caef-49bf-88da-ae494371195d.jpg\" width =\"700\"/>\n\n- <https://github.com/mizu-bai/ncnn-fortran> Call ncnn from Fortran\n\n- <https://github.com/k2-fsa/sherpa> Use ncnn for real-time speech\n  recognition (i.e., speech-to-text); also support embedded devices and provide\n  mobile Apps (e.g., Android App)\n\n---\n\n## License\n\n[BSD 3 Clause](LICENSE.txt)\n"
  },
  {
    "path": "benchmark/CMakeLists.txt",
    "content": "\nif(MSVC)\n    # warning C4996: 'fopen': This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.\n    add_definitions(/wd4996)\nendif()\n\n# ncnn macro\ninclude(${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_add_param.cmake)\n\nset(benchncnn_PARAMS\n    alexnet.param\n    blazeface.param\n    efficientnet_b0.param\n    efficientnetv2_b0.param\n    FastestDet.param\n    googlenet_int8.param\n    googlenet.param\n    mnasnet.param\n    mobilenet_int8.param\n    mobilenet_ssd_int8.param\n    mobilenet_ssd.param\n    mobilenet_v2.param\n    mobilenet_v3.param\n    mobilenet_yolo.param\n    mobilenet.param\n    mobilenetv2_yolov3.param\n    nanodet_m.param\n    proxylessnasnet.param\n    regnety_400m.param\n    resnet18_int8.param\n    resnet18.param\n    resnet50_int8.param\n    resnet50.param\n    shufflenet_v2.param\n    shufflenet.param\n    squeezenet_int8.param\n    squeezenet_ssd_int8.param\n    squeezenet_ssd.param\n    squeezenet.param\n    vgg16_int8.param\n    vgg16.param\n    vision_transformer.param\n    yolo-fastest-1.1.param\n    yolo-fastestv2.param\n    yolov4-tiny.param\n)\n\nforeach(PARAM_FILE ${benchncnn_PARAMS})\n    ncnn_add_param(\"${CMAKE_CURRENT_SOURCE_DIR}/${PARAM_FILE}\")\nendforeach()\n\nadd_custom_target(ncnn-generate-param DEPENDS ${NCNN_PARAM_HEX_FILES})\n\nconfigure_file(benchncnn_param_data.h.in ${CMAKE_CURRENT_BINARY_DIR}/benchncnn_param_data.h)\n\nadd_executable(benchncnn benchncnn.cpp)\ntarget_link_libraries(benchncnn PRIVATE ncnn)\n\ntarget_include_directories(benchncnn PRIVATE ${CMAKE_CURRENT_BINARY_DIR})\n\nif(CMAKE_SYSTEM_NAME STREQUAL \"Emscripten\")\n    target_link_libraries(benchncnn PRIVATE nodefs.js)\nendif()\n\nadd_dependencies(benchncnn ncnn-generate-param)\n\n# add benchncnn to a virtual project group\nset_property(TARGET benchncnn PROPERTY FOLDER \"benchmark\")\n"
  },
  {
    "path": "benchmark/FastestDet.param",
    "content": "7767517\n127 150\nInput                    in0                      0 1 in0\nConvolution              convrelu_0               1 1 in0 1 0=24 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=648 9=1\nPooling                  maxpool2d_43             1 1 1 2 0=0 1=3 11=3 12=2 13=1 2=2 3=1 5=1\nSplit                    splitncnn_0              1 2 2 3 4\nConvolutionDepthWise     convdw_95                1 1 4 5 0=24 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=216 7=24\nConvolution              convrelu_1               1 1 3 6 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConvolutionDepthWise     convdw_96                1 1 6 7 0=24 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=216 7=24\nConvolution              convrelu_3               1 1 5 8 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConvolution              convrelu_2               1 1 7 9 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConcat                   cat_0                    2 1 8 9 10 0=0\nShuffleChannel           shufflechannel_0         1 1 10 11 0=2 1=1\nSlice                    shufflechannel_0_slice   1 2 11 12 13 -23300=2,-233,-233 1=0\nConvolution              convrelu_4               1 1 13 14 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConvolutionDepthWise     convdw_97                1 1 14 15 0=24 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=216 7=24\nConvolution              convrelu_5               1 1 15 16 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConcat                   cat_1                    2 1 12 16 17 0=0\nShuffleChannel           shufflechannel_1         1 1 17 18 0=2 1=1\nSlice                    shufflechannel_1_slice   1 2 18 19 20 -23300=2,-233,-233 1=0\nConvolution              convrelu_6               1 1 20 21 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConvolutionDepthWise     convdw_98                1 1 21 22 0=24 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=216 7=24\nConvolution              convrelu_7               1 1 22 23 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConcat                   cat_2                    2 1 19 23 24 0=0\nShuffleChannel           shufflechannel_2         1 1 24 25 0=2 1=1\nSlice                    shufflechannel_2_slice   1 2 25 26 27 -23300=2,-233,-233 1=0\nConvolution              convrelu_8               1 1 27 28 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConvolutionDepthWise     convdw_99                1 1 28 29 0=24 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=216 7=24\nConvolution              convrelu_9               1 1 29 30 0=24 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=576 9=1\nConcat                   cat_3                    2 1 26 30 31 0=0\nSplit                    splitncnn_1              1 3 31 32 33 34\nConvolutionDepthWise     convdw_100               1 1 34 35 0=48 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=432 7=48\nConvolution              convrelu_10              1 1 33 36 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_101               1 1 36 37 0=48 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=432 7=48\nConvolution              convrelu_12              1 1 35 38 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolution              convrelu_11              1 1 37 39 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_4                    2 1 38 39 40 0=0\nShuffleChannel           shufflechannel_3         1 1 40 41 0=2 1=1\nSlice                    shufflechannel_3_slice   1 2 41 42 43 -23300=2,-233,-233 1=0\nConvolution              convrelu_13              1 1 43 44 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_102               1 1 44 45 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_14              1 1 45 46 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_5                    2 1 42 46 47 0=0\nShuffleChannel           shufflechannel_4         1 1 47 48 0=2 1=1\nSlice                    shufflechannel_4_slice   1 2 48 49 50 -23300=2,-233,-233 1=0\nConvolution              convrelu_15              1 1 50 51 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_103               1 1 51 52 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_16              1 1 52 53 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_6                    2 1 49 53 54 0=0\nShuffleChannel           shufflechannel_5         1 1 54 55 0=2 1=1\nSlice                    shufflechannel_5_slice   1 2 55 56 57 -23300=2,-233,-233 1=0\nConvolution              convrelu_17              1 1 57 58 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_104               1 1 58 59 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_18              1 1 59 60 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_7                    2 1 56 60 61 0=0\nShuffleChannel           shufflechannel_6         1 1 61 62 0=2 1=1\nSlice                    shufflechannel_6_slice   1 2 62 63 64 -23300=2,-233,-233 1=0\nConvolution              convrelu_19              1 1 64 65 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_105               1 1 65 66 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_20              1 1 66 67 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_8                    2 1 63 67 68 0=0\nShuffleChannel           shufflechannel_7         1 1 68 69 0=2 1=1\nSlice                    shufflechannel_7_slice   1 2 69 70 71 -23300=2,-233,-233 1=0\nConvolution              convrelu_21              1 1 71 72 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_106               1 1 72 73 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_22              1 1 73 74 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_9                    2 1 70 74 75 0=0\nShuffleChannel           shufflechannel_8         1 1 75 76 0=2 1=1\nSlice                    shufflechannel_8_slice   1 2 76 77 78 -23300=2,-233,-233 1=0\nConvolution              convrelu_23              1 1 78 79 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_107               1 1 79 80 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_24              1 1 80 81 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_10                   2 1 77 81 82 0=0\nShuffleChannel           shufflechannel_9         1 1 82 83 0=2 1=1\nSlice                    shufflechannel_9_slice   1 2 83 84 85 -23300=2,-233,-233 1=0\nConvolution              convrelu_25              1 1 85 86 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConvolutionDepthWise     convdw_108               1 1 86 87 0=48 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=432 7=48\nConvolution              convrelu_26              1 1 87 88 0=48 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=2304 9=1\nConcat                   cat_11                   2 1 84 88 89 0=0\nSplit                    splitncnn_2              1 3 89 90 91 92\nConvolutionDepthWise     convdw_109               1 1 92 93 0=96 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=864 7=96\nConvolution              convrelu_27              1 1 91 94 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConvolutionDepthWise     convdw_110               1 1 94 95 0=96 1=3 11=3 12=1 13=2 14=1 2=1 3=2 4=1 5=1 6=864 7=96\nConvolution              convrelu_29              1 1 93 96 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConvolution              convrelu_28              1 1 95 97 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConcat                   cat_12                   2 1 96 97 98 0=0\nShuffleChannel           shufflechannel_10        1 1 98 99 0=2 1=1\nSlice                    shufflechannel_10_slice  1 2 99 100 101 -23300=2,-233,-233 1=0\nConvolution              convrelu_30              1 1 101 102 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConvolutionDepthWise     convdw_111               1 1 102 103 0=96 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=864 7=96\nConvolution              convrelu_31              1 1 103 104 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConcat                   cat_13                   2 1 100 104 105 0=0\nShuffleChannel           shufflechannel_11        1 1 105 106 0=2 1=1\nSlice                    shufflechannel_11_slice  1 2 106 107 108 -23300=2,-233,-233 1=0\nConvolution              convrelu_32              1 1 108 109 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConvolutionDepthWise     convdw_112               1 1 109 110 0=96 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=864 7=96\nConvolution              convrelu_33              1 1 110 111 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConcat                   cat_14                   2 1 107 111 112 0=0\nShuffleChannel           shufflechannel_12        1 1 112 113 0=2 1=1\nSlice                    shufflechannel_12_slice  1 2 113 114 115 -23300=2,-233,-233 1=0\nConvolution              convrelu_34              1 1 115 116 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConvolutionDepthWise     convdw_113               1 1 116 117 0=96 1=3 11=3 12=1 13=1 14=1 2=1 3=1 4=1 5=1 6=864 7=96\nConvolution              convrelu_35              1 1 117 118 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nConcat                   cat_15                   2 1 114 118 119 0=0\nPooling                  avgpool2d_0              1 1 32 120 0=1 1=3 11=3 12=2 13=1 2=2 3=1 5=1 6=1\nInterp                   upsample_94              1 1 119 121 0=1 1=2.000000e+00 2=2.000000e+00 6=0\nConcat                   cat_16                   3 1 120 90 121 122 0=0\nConvolution              convrelu_36              1 1 122 123 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=32256 9=1\nSplit                    splitncnn_3              1 4 123 124 125 126 127\nConvolutionDepthWise     convdwrelu_5             1 1 127 128 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolutionDepthWise     convdwrelu_0             1 1 126 129 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolutionDepthWise     convdwrelu_4             1 1 129 130 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolutionDepthWise     convdwrelu_1             1 1 125 131 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolutionDepthWise     convdwrelu_2             1 1 131 132 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolutionDepthWise     convdwrelu_3             1 1 132 133 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConcat                   cat_17                   3 1 128 130 133 134 0=0\nConvolution              conv_38                  1 1 134 135 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=27648\nBinaryOp                 add_0                    2 1 124 135 136 0=0\nReLU                     relu_87                  1 1 136 137\nConvolution              convrelu_37              1 1 137 138 0=96 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=9216 9=1\nSplit                    splitncnn_4              1 3 138 139 140 141\nConvolutionDepthWise     convdwrelu_7             1 1 139 142 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolution              conv_41                  1 1 142 143 0=80 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=7680\nConvolutionDepthWise     convdwrelu_8             1 1 140 144 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolution              conv_42                  1 1 144 145 0=4 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=384\nSoftmax                  softmax_93               1 1 143 146 0=0 1=1\nConvolutionDepthWise     convdwrelu_6             1 1 141 147 0=96 1=5 11=5 12=1 13=1 14=2 2=1 3=1 4=2 5=1 6=2400 7=96 9=1\nConvolution              convsigmoid_38           1 1 147 148 0=1 1=1 11=1 12=1 13=1 14=0 2=1 3=1 4=0 5=1 6=96 9=4\nConcat                   cat_18                   3 1 148 145 146 out0 0=0\n"
  },
  {
    "path": "benchmark/README.md",
    "content": "benchncnn can be used to test neural network inference performance\n\nOnly the network definition files (ncnn param) are required.\n\nThe large model binary files (ncnn bin) are not loaded but generated randomly for speed test.\n\nIf no model specified, it would benchmark default built-in models. More model networks may be added later.\n\n---\nBuild\n```shell\n# assume you have already build ncnn library successfully\n# uncomment the following line in <ncnn-root-dir>/CMakeLists.txt with your favorite editor\n\n# add_subdirectory(benchmark)\n\ncd <ncnn-root-dir>/<your-build-dir>\nmake -j4\n\n# you can find benchncnn binary in <ncnn-root-dir>/<your-build-dir>/benchmark\n```\n\nUsage\n```shell\n# copy all param files to the current directory\n./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]\n  param=model.param\n  shape=[227,227,3],..\n```\nrun benchncnn on android device\n```shell\n# for running on android device, upload to /data/local/tmp/ folder\nadb push benchncnn /data/local/tmp/\n\n# (optional) upload your ncnn model param to /data/local/tmp/ folder\nadb push model.param /data/local/tmp/\n\n# executed in android adb shell\nadb shell\ncd /data/local/tmp/\n\n# sample: benchmark built-in models on cpu, with 4 threads on big core, 4 loops and cooling_down\n./benchncnn 4 4 2 -1 1\n\n# sample: benchmark built-in models on gpu id 0, with 1 thread on big core, 8 loops, without cooling_down\n./benchncnn 8 1 2 0 0\n\n./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]\n  param=model.param\n  shape=[227,227,3],..\n```\n\nParameter\n\n|param|options|default|\n|---|---|---|\n|loop count|1~N|4|\n|num threads|1~N|max_cpu_count|\n|powersave|0=all cores, 1=little cores only, 2=big cores only|0|\n|gpu device|-1=cpu-only, 0=gpu0, 1=gpu1 ...|-1|\n|cooling down|0=disable, 1=enable|1|\n|param|ncnn model.param filepath|-|\n|shape|model input shapes with, whc format|-|\n\nTips: Disable android UI server and set CPU and GPU to max frequency\n```shell\n# stopping android ui server, can be retarted later via adb shell start\nadb root\nadb shell stop\n\n# executed in android adb shell\n# set cpu performance mode\necho \"performance\" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor\necho \"performance\" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor\necho \"performance\" > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor\necho \"performance\" > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor\necho \"performance\" > /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor\necho \"performance\" > /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor\n\n# set gpu performance mode (eg. RK3399)\necho \"performance\" > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor\n\n# set gpu performance mode (eg. Android Adreno)\necho 1 > /sys/class/kgsl/kgsl-3d0/force_clk_on\necho 10000000 > /sys/class/kgsl/kgsl-3d0/idle_timer\necho \"performance\" > /sys/class/kgsl/kgsl-3d0/devfreq/governor\necho <max freq> > /sys/class/kgsl/kgsl-3d0/gpuclk\n```\n\n---\n\nTypical output (executed in android adb shell)\n\n### NVIDIA Jetson AGX Orin (Cortex-A78AE 2.2 GHz x 12 + Ampere@1.3 GHz Tensor Cores 64)\n```\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 64 1 0 -1 0\nloop_count = 64\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   11.66  max =   11.80  avg =   11.74\n     squeezenet_int8  min =   12.24  max =   12.39  avg =   12.31\n           mobilenet  min =   19.56  max =   19.73  avg =   19.65\n      mobilenet_int8  min =   16.06  max =   16.25  avg =   16.14\n        mobilenet_v2  min =   13.20  max =   13.41  avg =   13.29\n        mobilenet_v3  min =   11.39  max =   11.57  avg =   11.48\n          shufflenet  min =    8.07  max =    8.18  avg =    8.11\n       shufflenet_v2  min =    8.41  max =    8.51  avg =    8.45\n             mnasnet  min =   12.74  max =   12.91  avg =   12.79\n     proxylessnasnet  min =   15.18  max =   15.32  avg =   15.25\n     efficientnet_b0  min =   26.86  max =   26.96  avg =   26.90\n   efficientnetv2_b0  min =   35.99  max =   36.15  avg =   36.07\n        regnety_400m  min =   16.81  max =   16.98  avg =   16.87\n           blazeface  min =    4.25  max =    4.37  avg =    4.29\n           googlenet  min =   48.73  max =   48.98  avg =   48.87\n      googlenet_int8  min =   47.39  max =   47.60  avg =   47.49\n            resnet18  min =   30.93  max =   31.24  avg =   31.08\n       resnet18_int8  min =   55.44  max =   55.70  avg =   55.56\n             alexnet  min =   44.19  max =   44.43  avg =   44.33\n               vgg16  min =  173.94  max =  174.97  avg =  174.46\n          vgg16_int8  min =  475.10  max =  479.37  avg =  477.33\n            resnet50  min =   89.50  max =   90.11  avg =   89.80\n       resnet50_int8  min =  106.77  max =  107.14  avg =  106.96\n      squeezenet_ssd  min =   37.78  max =   38.35  avg =   37.93\n squeezenet_ssd_int8  min =   50.48  max =   50.88  avg =   50.74\n       mobilenet_ssd  min =   45.62  max =   46.12  avg =   45.74\n  mobilenet_ssd_int8  min =   37.77  max =   38.00  avg =   37.88\n      mobilenet_yolo  min =   90.23  max =   90.49  avg =   90.35\n  mobilenetv2_yolov3  min =   47.27  max =   47.48  avg =   47.33\n         yolov4-tiny  min =   60.41  max =   60.75  avg =   60.57\n           nanodet_m  min =   19.26  max =   19.43  avg =   19.35\n    yolo-fastest-1.1  min =    8.16  max =    8.31  avg =    8.20\n      yolo-fastestv2  min =    8.26  max =    8.39  avg =    8.32\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 64 2 0 -1 0\nloop_count = 64\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.83  max =    6.98  avg =    6.90\n     squeezenet_int8  min =    7.39  max =    7.50  avg =    7.45\n           mobilenet  min =   10.40  max =   10.50  avg =   10.45\n      mobilenet_int8  min =    8.92  max =    9.09  avg =    8.99\n        mobilenet_v2  min =    7.67  max =    7.80  avg =    7.74\n        mobilenet_v3  min =    6.86  max =    7.01  avg =    6.93\n          shufflenet  min =    6.34  max =    6.44  avg =    6.39\n       shufflenet_v2  min =    5.71  max =    5.83  avg =    5.76\n             mnasnet  min =    7.47  max =    7.58  avg =    7.53\n     proxylessnasnet  min =    8.73  max =    8.83  avg =    8.78\n     efficientnet_b0  min =   14.93  max =   15.13  avg =   15.03\n   efficientnetv2_b0  min =   20.17  max =   20.70  avg =   20.29\n        regnety_400m  min =   12.50  max =   12.62  avg =   12.57\n           blazeface  min =    2.95  max =    3.06  avg =    3.00\n           googlenet  min =   26.25  max =   26.53  avg =   26.37\n      googlenet_int8  min =   26.54  max =   26.79  avg =   26.66\n            resnet18  min =   16.69  max =   16.90  avg =   16.80\n       resnet18_int8  min =   29.70  max =   29.93  avg =   29.81\n             alexnet  min =   22.96  max =   23.12  avg =   23.03\n               vgg16  min =   88.39  max =   89.16  avg =   88.79\n          vgg16_int8  min =  245.86  max =  247.55  avg =  246.62\n            resnet50  min =   46.55  max =   46.86  avg =   46.70\n       resnet50_int8  min =   56.28  max =   56.63  avg =   56.43\n      squeezenet_ssd  min =   23.65  max =   24.29  avg =   23.81\n squeezenet_ssd_int8  min =   30.86  max =   31.27  avg =   30.99\n       mobilenet_ssd  min =   25.17  max =   25.31  avg =   25.24\n  mobilenet_ssd_int8  min =   21.77  max =   21.97  avg =   21.84\n      mobilenet_yolo  min =   48.03  max =   48.33  avg =   48.14\n  mobilenetv2_yolov3  min =   26.58  max =   26.81  avg =   26.66\n         yolov4-tiny  min =   35.31  max =   35.53  avg =   35.41\n           nanodet_m  min =   12.93  max =   13.08  avg =   13.01\n    yolo-fastest-1.1  min =    6.00  max =    6.10  avg =    6.04\n      yolo-fastestv2  min =    6.46  max =    6.61  avg =    6.52\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 64 4 0 -1 0\nloop_count = 64\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    4.54  max =    4.84  avg =    4.61\n     squeezenet_int8  min =    4.96  max =    5.41  avg =    5.05\n           mobilenet  min =    5.96  max =    6.23  avg =    6.04\n      mobilenet_int8  min =    5.21  max =    5.50  avg =    5.30\n        mobilenet_v2  min =    5.05  max =    5.26  avg =    5.15\n        mobilenet_v3  min =    4.83  max =    5.14  avg =    4.90\n          shufflenet  min =    5.11  max =    5.34  avg =    5.18\n       shufflenet_v2  min =    4.13  max =    4.44  avg =    4.18\n             mnasnet  min =    4.93  max =    5.27  avg =    5.01\n     proxylessnasnet  min =    5.64  max =    5.89  avg =    5.72\n     efficientnet_b0  min =    9.47  max =   10.60  avg =    9.60\n   efficientnetv2_b0  min =   12.67  max =   13.06  avg =   12.82\n        regnety_400m  min =   10.27  max =   10.58  avg =   10.38\n           blazeface  min =    2.05  max =    2.27  avg =    2.10\n           googlenet  min =   15.57  max =   15.96  avg =   15.68\n      googlenet_int8  min =   16.19  max =   16.65  avg =   16.32\n            resnet18  min =   10.20  max =   11.76  avg =   10.35\n       resnet18_int8  min =   16.89  max =   17.31  avg =   17.03\n             alexnet  min =   13.13  max =   13.70  avg =   13.32\n               vgg16  min =   51.03  max =   52.46  avg =   51.35\n          vgg16_int8  min =  131.08  max =  139.44  avg =  133.78\n            resnet50  min =   26.74  max =   28.32  avg =   26.91\n       resnet50_int8  min =   32.15  max =   32.74  avg =   32.38\n      squeezenet_ssd  min =   16.58  max =   16.99  avg =   16.70\n squeezenet_ssd_int8  min =   20.22  max =   21.67  avg =   20.51\n       mobilenet_ssd  min =   14.68  max =   16.07  avg =   14.83\n  mobilenet_ssd_int8  min =   12.89  max =   13.27  avg =   13.01\n      mobilenet_yolo  min =   28.44  max =   28.85  avg =   28.58\n  mobilenetv2_yolov3  min =   17.21  max =   21.31  avg =   17.44\n         yolov4-tiny  min =   23.68  max =   24.38  avg =   23.88\n           nanodet_m  min =    8.76  max =    9.17  avg =    8.86\n    yolo-fastest-1.1  min =    4.83  max =    5.04  avg =    4.88\n      yolo-fastestv2  min =    4.93  max =    5.17  avg =    5.00\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 64 8 0 -1 0\nloop_count = 64\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.52  max =    4.28  avg =    3.65\n     squeezenet_int8  min =    3.85  max =    4.11  avg =    3.93\n           mobilenet  min =    3.78  max =    4.12  avg =    3.85\n      mobilenet_int8  min =    3.57  max =    3.85  avg =    3.63\n        mobilenet_v2  min =    4.14  max =    4.44  avg =    4.22\n        mobilenet_v3  min =    3.89  max =    4.26  avg =    3.97\n          shufflenet  min =    4.78  max =    4.95  avg =    4.84\n       shufflenet_v2  min =    3.49  max =    3.84  avg =    3.54\n             mnasnet  min =    3.94  max =    4.09  avg =    3.99\n     proxylessnasnet  min =    4.41  max =    4.68  avg =    4.47\n     efficientnet_b0  min =    7.01  max =    7.85  avg =    7.13\n   efficientnetv2_b0  min =    9.22  max =    9.46  avg =    9.32\n        regnety_400m  min =    9.34  max =    9.66  avg =    9.44\n           blazeface  min =    1.86  max =    1.98  avg =    1.89\n           googlenet  min =   10.37  max =   10.76  avg =   10.48\n      googlenet_int8  min =   11.03  max =   11.34  avg =   11.16\n            resnet18  min =    6.83  max =    7.12  avg =    6.93\n       resnet18_int8  min =   10.25  max =   11.50  avg =   10.42\n             alexnet  min =    8.88  max =    9.71  avg =    9.01\n               vgg16  min =   31.26  max =   31.97  avg =   31.44\n          vgg16_int8  min =   71.31  max =   74.53  avg =   72.18\n            resnet50  min =   16.43  max =   16.84  avg =   16.52\n       resnet50_int8  min =   19.07  max =   20.28  avg =   19.42\n      squeezenet_ssd  min =   13.50  max =   13.69  avg =   13.56\n squeezenet_ssd_int8  min =   15.16  max =   16.06  avg =   15.30\n       mobilenet_ssd  min =    9.73  max =   10.85  avg =    9.90\n  mobilenet_ssd_int8  min =    9.27  max =    9.46  avg =    9.36\n      mobilenet_yolo  min =   17.58  max =   17.79  avg =   17.67\n  mobilenetv2_yolov3  min =   12.80  max =   13.50  avg =   12.90\n         yolov4-tiny  min =   17.98  max =   21.31  avg =   18.24\n           nanodet_m  min =    7.01  max =    7.18  avg =    7.09\n    yolo-fastest-1.1  min =    4.76  max =    4.86  avg =    4.80\n      yolo-fastestv2  min =    4.76  max =    4.88  avg =    4.82\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 64 12 0 -1 0\nloop_count = 64\nnum_threads = 12\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.50  max =    5.21  avg =    3.65\n     squeezenet_int8  min =    3.97  max =    4.44  avg =    4.12\n           mobilenet  min =    3.49  max =    7.73  avg =    3.78\n      mobilenet_int8  min =    3.40  max =    3.86  avg =    3.49\n        mobilenet_v2  min =    4.07  max =    4.39  avg =    4.17\n        mobilenet_v3  min =    3.92  max =    4.17  avg =    4.03\n          shufflenet  min =    5.08  max =    6.63  avg =    5.18\n       shufflenet_v2  min =    3.64  max =    5.11  avg =    3.75\n             mnasnet  min =    3.86  max =    4.16  avg =    3.95\n     proxylessnasnet  min =    4.30  max =    5.39  avg =    4.38\n     efficientnet_b0  min =    6.42  max =    9.19  avg =    6.61\n   efficientnetv2_b0  min =    8.96  max =    9.43  avg =    9.12\n        regnety_400m  min =   10.11  max =   10.89  avg =   10.27\n           blazeface  min =    1.93  max =    2.16  avg =    1.99\n           googlenet  min =    9.72  max =   10.84  avg =   10.01\n      googlenet_int8  min =   10.91  max =   13.03  avg =   11.17\n            resnet18  min =    6.70  max =    7.27  avg =    6.92\n       resnet18_int8  min =    9.62  max =   12.93  avg =   10.14\n             alexnet  min =    7.21  max =    7.47  avg =    7.32\n               vgg16  min =   29.61  max =   63.73  avg =   30.86\n          vgg16_int8  min =   64.91  max =   75.06  avg =   68.72\n            resnet50  min =   15.35  max =   16.28  avg =   15.73\n       resnet50_int8  min =   17.47  max =   18.98  avg =   18.09\n      squeezenet_ssd  min =   13.40  max =   28.74  avg =   14.07\n squeezenet_ssd_int8  min =   15.35  max =   16.77  avg =   15.67\n       mobilenet_ssd  min =    9.51  max =   11.49  avg =    9.88\n  mobilenet_ssd_int8  min =    9.43  max =   10.08  avg =    9.58\n      mobilenet_yolo  min =   16.88  max =   17.45  avg =   17.09\n  mobilenetv2_yolov3  min =   11.91  max =   31.90  avg =   12.50\n         yolov4-tiny  min =   17.85  max =   18.87  avg =   18.36\n           nanodet_m  min =    6.88  max =    7.64  avg =    7.06\n    yolo-fastest-1.1  min =    5.02  max =    5.53  avg =    5.12\n      yolo-fastestv2  min =    4.95  max =    5.60  avg =    5.05\ni@orin:~/projects/ncnn/benchmark$ ./benchncnn 128 1 0 0 0\n[0 NVIDIA Tegra Orin (nvgpu)]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA Tegra Orin (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra Orin (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra Orin (nvgpu)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 128\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    2.13  max =    3.37  avg =    2.31\n     squeezenet_int8  min =   12.31  max =   12.51  avg =   12.42\n           mobilenet  min =    2.03  max =    2.73  avg =    2.23\n      mobilenet_int8  min =   16.86  max =   17.91  avg =   16.99\n        mobilenet_v2  min =    2.59  max =    3.59  avg =    2.91\n        mobilenet_v3  min =    3.22  max =    4.23  avg =    3.71\n          shufflenet  min =    2.57  max =    3.27  avg =    2.80\n       shufflenet_v2  min =    3.20  max =    4.03  avg =    3.47\n             mnasnet  min =    2.45  max =    3.06  avg =    2.69\n     proxylessnasnet  min =    2.50  max =    3.14  avg =    2.72\n     efficientnet_b0  min =    4.23  max =    8.73  avg =    4.85\n   efficientnetv2_b0  min =    8.15  max =    8.60  avg =    8.41\n        regnety_400m  min =    3.25  max =    4.17  avg =    3.54\n           blazeface  min =    1.29  max =    1.48  avg =    1.33\n           googlenet  min =    4.95  max =   12.34  avg =    6.36\n      googlenet_int8  min =   47.49  max =   47.78  avg =   47.61\n            resnet18  min =    3.18  max =    9.49  avg =    4.04\n       resnet18_int8  min =   55.57  max =   55.88  avg =   55.73\n             alexnet  min =    3.22  max =   14.56  avg =    4.25\n               vgg16  min =    6.82  max =   14.75  avg =    8.18\n          vgg16_int8  min =  473.55  max =  479.07  avg =  476.22\n            resnet50  min =    4.75  max =   15.06  avg =    6.08\n       resnet50_int8  min =  106.99  max =  107.48  avg =  107.22\n      squeezenet_ssd  min =    6.87  max =    9.12  avg =    7.76\n squeezenet_ssd_int8  min =   50.87  max =   51.17  avg =   51.01\n       mobilenet_ssd  min =    4.44  max =    6.22  avg =    5.23\n  mobilenet_ssd_int8  min =   37.80  max =   38.03  avg =   37.92\n      mobilenet_yolo  min =    5.41  max =    7.36  avg =    6.29\n  mobilenetv2_yolov3  min =    7.20  max =    9.96  avg =    7.30\n         yolov4-tiny  min =   16.48  max =   28.81  avg =   18.40\n           nanodet_m  min =    5.75  max =    8.54  avg =    6.85\n    yolo-fastest-1.1  min =    4.03  max =    4.75  avg =    4.35\n      yolo-fastestv2  min =    4.27  max =    5.23  avg =    4.71\n```\n\n### AMD Ryzen Threadripper 3970X (Zen2 3.7 GHz ~ 4.5 GHz x 32)\n```\ni@s:~/qtang/ncnn/benchmark$ ../build-vulkan/benchmark/benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   11.73  max =   11.88  avg =   11.78\n           mobilenet  min =   21.63  max =   21.73  avg =   21.68\n        mobilenet_v2  min =   14.70  max =   14.95  avg =   14.82\n        mobilenet_v3  min =   12.12  max =   12.17  avg =   12.15\n          shufflenet  min =   14.08  max =   14.16  avg =   14.12\n       shufflenet_v2  min =   25.99  max =   26.13  avg =   26.06\n             mnasnet  min =   14.12  max =   14.17  avg =   14.14\n     proxylessnasnet  min =   16.51  max =   16.71  avg =   16.61\n     efficientnet_b0  min =   22.88  max =   22.97  avg =   22.93\n        regnety_400m  min =   18.50  max =   18.61  avg =   18.56\n           blazeface  min =    6.18  max =    6.27  avg =    6.21\n           googlenet  min =   58.42  max =   58.60  avg =   58.49\n            resnet18  min =   61.13  max =   61.84  avg =   61.40\n             alexnet  min =   50.82  max =   50.98  avg =   50.92\n               vgg16  min =  217.19  max =  218.40  avg =  217.87\n            resnet50  min =  126.84  max =  137.46  avg =  128.21\n      squeezenet_ssd  min =  114.24  max =  114.57  avg =  114.47\n       mobilenet_ssd  min =   51.60  max =   51.89  avg =   51.77\n      mobilenet_yolo  min =  125.09  max =  126.33  avg =  125.83\n  mobilenetv2_yolov3  min =   57.51  max =   57.79  avg =   57.65\n         yolov4-tiny  min =   85.65  max =   85.97  avg =   85.79\n```\n\n### NVIDIA Quadro RTX 8000 (TU102 SM x 72 + Tensor Core x 576)\n```\ni@s:~/qtang/ncnn/benchmark$ ../build-vulkan/benchmark/benchncnn 256 1 0 1 0\n[0 Quadro RTX 8000]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 Quadro RTX 8000]  bugsbn1=0  bugcopc=0  bugihfa=0\n[0 Quadro RTX 8000]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1\n[0 Quadro RTX 8000]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\n[1 Quadro RTX 8000]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[1 Quadro RTX 8000]  bugsbn1=0  bugcopc=0  bugihfa=0\n[1 Quadro RTX 8000]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1\n[1 Quadro RTX 8000]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 256\nnum_threads = 1\npowersave = 0\ngpu_device = 1\ncooling_down = 0\n          squeezenet  min =    0.84  max =    1.39  avg =    0.93\n           mobilenet  min =    0.90  max =    2.30  avg =    0.91\n        mobilenet_v2  min =    1.35  max =    9.59  avg =    1.46\n        mobilenet_v3  min =    1.60  max =   77.94  avg =    2.12\n          shufflenet  min =    0.86  max =    2.27  avg =    0.88\n       shufflenet_v2  min =    1.25  max =    1.47  avg =    1.27\n             mnasnet  min =    1.42  max =   20.77  avg =    1.72\n     proxylessnasnet  min =    1.48  max =    1.67  avg =    1.49\n     efficientnet_b0  min =    2.56  max =   12.86  avg =    2.77\n        regnety_400m  min =    1.84  max =   14.98  avg =    2.42\n           blazeface  min =    0.64  max =    0.90  avg =    0.65\n           googlenet  min =    2.94  max =   76.82  avg =    3.45\n            resnet18  min =    1.27  max =   10.56  avg =    1.56\n             alexnet  min =    1.53  max =   71.76  avg =    1.96\n               vgg16  min =    4.90  max =   78.12  avg =    5.80\n            resnet50  min =    3.00  max =   12.51  avg =    3.07\n      squeezenet_ssd  min =    5.60  max =   97.09  avg =    6.50\n       mobilenet_ssd  min =    2.40  max =   93.64  avg =    3.30\n      mobilenet_yolo  min =    2.96  max =   19.15  avg =    3.25\n  mobilenetv2_yolov3  min =    4.52  max =   66.96  avg =    5.32\n         yolov4-tiny  min =    9.32  max =   72.92  avg =   14.01\n\n```\n\n### NVIDIA RTX3090 (GA102 SM x 82 + Tensor Core 328)\n```\n(base) i@t:~/wls/ncnn/benchmark$ ../build/benchmark/benchncnn 32 1 0 0 0\n[0 GeForce RTX 3090]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 GeForce RTX 3090]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 GeForce RTX 3090]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 GeForce RTX 3090]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\n[1 GeForce RTX 3090]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[1 GeForce RTX 3090]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 GeForce RTX 3090]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 GeForce RTX 3090]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 32\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    1.76  max =    2.74  avg =    1.80\n     squeezenet_int8  min =   47.10  max =   47.75  avg =   47.21\n           mobilenet  min =    4.77  max =    5.79  avg =    5.20\n      mobilenet_int8  min =   64.19  max =   67.05  avg =   64.39\n        mobilenet_v2  min =    2.44  max =   20.89  avg =    6.98\n        mobilenet_v3  min =    2.75  max =    2.87  avg =    2.77\n          shufflenet  min =    2.20  max =    2.62  avg =    2.46\n       shufflenet_v2  min =    5.10  max =    7.43  avg =    5.75\n             mnasnet  min =    3.47  max =    3.50  avg =    3.48\n     proxylessnasnet  min =    2.59  max =    9.08  avg =    7.28\n     efficientnet_b0  min =    3.87  max =    4.65  avg =    3.91\n   efficientnetv2_b0  min =   29.48  max =   41.90  avg =   30.14\n        regnety_400m  min =    2.89  max =    2.99  avg =    2.91\n           blazeface  min =    1.55  max =    2.14  avg =    1.60\n           googlenet  min =    4.33  max =   17.89  avg =    6.05\n      googlenet_int8  min =  174.46  max =  178.19  avg =  174.74\n            resnet18  min =    2.14  max =   11.04  avg =    5.33\n       resnet18_int8  min =  193.37  max =  193.83  avg =  193.55\n             alexnet  min =    2.37  max =   15.99  avg =    4.50\n               vgg16  min =    4.55  max =   16.65  avg =    5.22\n          vgg16_int8  min = 1538.76  max = 1544.81  avg = 1540.79\n            resnet50  min =    4.13  max =   25.86  avg =    5.80\n       resnet50_int8  min =  400.89  max =  401.72  avg =  401.29\n      squeezenet_ssd  min =    6.95  max =    7.81  avg =    7.07\n squeezenet_ssd_int8  min =  158.51  max =  159.04  avg =  158.68\n       mobilenet_ssd  min =    4.36  max =   18.98  avg =    9.40\n  mobilenet_ssd_int8  min =  130.74  max =  130.92  avg =  130.83\n      mobilenet_yolo  min =    3.96  max =   11.94  avg =    6.48\n  mobilenetv2_yolov3  min =    6.07  max =    6.21  avg =    6.13\n         yolov4-tiny  min =   13.01  max =   26.78  avg =   14.87\n\nroot@3090:~/Desktop/ncnn-20221128/build/benchmark$ ./benchncnn 100 10 2 0 0\n[0 NVIDIA GeForce RTX 3090]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA GeForce RTX 3090]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA GeForce RTX 3090]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA GeForce RTX 3090]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 100\nnum_threads = 10\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    0.64  max =    0.66  avg =    0.65\n     squeezenet_int8  min =    4.30  max =    4.93  avg =    4.45\n           mobilenet  min =    0.60  max =    1.85  avg =    1.32\n      mobilenet_int8  min =    3.08  max =    3.17  avg =    3.12\n        mobilenet_v2  min =    1.40  max =    1.46  avg =    1.42\n        mobilenet_v3  min =    1.22  max =    6.10  avg =    3.02\n          shufflenet  min =    0.90  max =    0.97  avg =    0.92\n       shufflenet_v2  min =    1.06  max =    1.13  avg =    1.09\n             mnasnet  min =    0.84  max =    0.98  avg =    0.91\n     proxylessnasnet  min =    0.99  max =    3.01  avg =    2.45\n     efficientnet_b0  min =    2.11  max =    2.85  avg =    2.16\n   efficientnetv2_b0  min =    7.46  max =   28.58  avg =    8.55\n        regnety_400m  min =    1.53  max =    1.75  avg =    1.59\n           blazeface  min =    0.59  max =    0.94  avg =    0.63\n           googlenet  min =    1.90  max =   12.22  avg =    2.63\n      googlenet_int8  min =   17.45  max =   18.69  avg =   17.81\n            resnet18  min =    0.90  max =   13.14  avg =    3.09\n       resnet18_int8  min =   16.25  max =   17.34  avg =   16.50\n             alexnet  min =    0.86  max =    4.77  avg =    2.59\n               vgg16  min =    1.38  max =   11.20  avg =    2.91\n          vgg16_int8  min =   47.17  max =   49.02  avg =   47.57\n            resnet50  min =    1.54  max =    2.16  avg =    1.64\n       resnet50_int8  min =   22.90  max =   24.46  avg =   23.23\n      squeezenet_ssd  min =    2.25  max =   10.91  avg =    4.12\n squeezenet_ssd_int8  min =   11.98  max =   14.54  avg =   12.31\n       mobilenet_ssd  min =    1.46  max =    8.98  avg =    3.38\n  mobilenet_ssd_int8  min =    6.13  max =    6.65  avg =    6.23\n      mobilenet_yolo  min =    1.29  max =    1.43  avg =    1.34\n  mobilenetv2_yolov3  min =    3.64  max =    6.66  avg =    3.77\n         yolov4-tiny  min =    9.04  max =   11.65  avg =    9.54\n           nanodet_m  min =    1.43  max =   11.90  avg =    3.16\n    yolo-fastest-1.1  min =    1.40  max =    1.82  avg =    1.57\n      yolo-fastestv2  min =    1.36  max =    2.30  avg =    1.42\n  vision_transformer  min =  202.71  max =  244.47  avg =  218.69\n          FastestDet  min =    1.37  max =    5.37  avg =    2.77\n```\n\n### AMD Ryzen Embedded V1605B (Zen 2.0 GHz ~ 3.6 GHz x 4 + Radeon Vega 8 1.1GHz 8CU)\n```\nC:\\Users\\i\\Desktop\\benchmark>benchncnn.exe 32 1 0 -1 0\nloop_count = 32\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   22.13  max =   24.07  avg =   22.88\n     squeezenet_int8  min =   58.54  max =   62.21  avg =   59.55\n           mobilenet  min =   40.99  max =   43.67  avg =   41.70\n      mobilenet_int8  min =   98.06  max =  111.37  avg =  101.15\n        mobilenet_v2  min =   26.53  max =   28.96  avg =   27.81\n        mobilenet_v3  min =   22.96  max =   25.25  avg =   23.30\n          shufflenet  min =   20.17  max =   28.78  avg =   21.09\n       shufflenet_v2  min =   19.06  max =   19.72  avg =   19.47\n             mnasnet  min =   25.11  max =   39.53  avg =   27.54\n     proxylessnasnet  min =   28.84  max =   35.16  avg =   30.03\n     efficientnet_b0  min =   43.16  max =   46.03  avg =   43.65\n   efficientnetv2_b0  min =   48.64  max =   52.07  avg =   49.62\n        regnety_400m  min =   33.43  max =   35.87  avg =   33.97\n           blazeface  min =    5.43  max =    6.04  avg =    5.56\n           googlenet  min =   85.80  max =   90.93  avg =   87.65\n      googlenet_int8  min =  214.37  max =  230.75  avg =  219.50\n            resnet18  min =   76.58  max =   80.38  avg =   77.34\n       resnet18_int8  min =  231.16  max =  255.22  avg =  236.65\n             alexnet  min =   60.69  max =   64.06  avg =   61.34\n               vgg16  min =  286.45  max =  307.04  avg =  290.86\n          vgg16_int8  min = 1797.58  max = 2079.73  avg = 1844.78\n            resnet50  min =  198.27  max =  215.03  avg =  201.37\n       resnet50_int8  min =  493.52  max =  499.67  avg =  496.95\n      squeezenet_ssd  min =  189.97  max =  198.53  avg =  192.10\n squeezenet_ssd_int8  min =  198.81  max =  214.55  avg =  203.59\n       mobilenet_ssd  min =   87.56  max =   92.72  avg =   89.03\n  mobilenet_ssd_int8  min =  196.97  max =  209.51  avg =  201.95\n      mobilenet_yolo  min =  206.87  max =  218.48  avg =  210.84\n  mobilenetv2_yolov3  min =  102.72  max =  108.18  avg =  104.62\n         yolov4-tiny  min =  117.97  max =  134.73  avg =  121.26\n\nC:\\Users\\i\\Desktop\\benchmark>benchncnn.exe 32 2 0 -1 0\nloop_count = 32\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   13.43  max =   14.35  avg =   13.62\n     squeezenet_int8  min =   32.29  max =   50.76  avg =   33.56\n           mobilenet  min =   23.42  max =   25.10  avg =   24.09\n      mobilenet_int8  min =   51.99  max =   55.42  avg =   53.01\n        mobilenet_v2  min =   15.45  max =   15.75  avg =   15.59\n        mobilenet_v3  min =   14.32  max =   14.75  avg =   14.39\n          shufflenet  min =   12.64  max =   12.83  avg =   12.69\n       shufflenet_v2  min =   11.45  max =   12.44  avg =   11.60\n             mnasnet  min =   14.43  max =   20.45  avg =   15.11\n     proxylessnasnet  min =   16.18  max =   16.38  avg =   16.24\n     efficientnet_b0  min =   25.25  max =   28.42  avg =   26.59\n   efficientnetv2_b0  min =   27.57  max =   32.05  avg =   30.04\n        regnety_400m  min =   22.74  max =   24.75  avg =   23.31\n           blazeface  min =    3.44  max =    3.83  avg =    3.62\n           googlenet  min =   49.39  max =   66.76  avg =   53.76\n      googlenet_int8  min =  113.89  max =  136.75  avg =  119.29\n            resnet18  min =   43.77  max =   67.24  avg =   46.14\n       resnet18_int8  min =  121.44  max =  148.01  avg =  126.95\n             alexnet  min =   34.46  max =   37.38  avg =   35.50\n               vgg16  min =  177.16  max =  207.25  avg =  184.19\n          vgg16_int8  min =  951.86  max = 1155.60  avg =  990.51\n            resnet50  min =  112.28  max =  137.18  avg =  115.64\n       resnet50_int8  min =  260.69  max =  272.26  avg =  265.89\n      squeezenet_ssd  min =  108.07  max =  121.66  avg =  110.35\n squeezenet_ssd_int8  min =  109.01  max =  126.86  avg =  111.96\n       mobilenet_ssd  min =   49.60  max =   52.62  avg =   50.46\n  mobilenet_ssd_int8  min =  104.22  max =  111.07  avg =  106.33\n      mobilenet_yolo  min =  117.42  max =  136.73  avg =  122.92\n  mobilenetv2_yolov3  min =   61.66  max =   65.22  avg =   63.01\n         yolov4-tiny  min =   72.64  max =   77.09  avg =   74.30\n\nC:\\Users\\i\\Desktop\\benchmark>benchncnn.exe 32 4 0 -1 0\nloop_count = 32\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    9.19  max =   14.82  avg =   11.15\n     squeezenet_int8  min =   19.00  max =   40.30  avg =   24.80\n           mobilenet  min =   18.02  max =   39.84  avg =   27.38\n      mobilenet_int8  min =   28.04  max =   57.59  avg =   34.15\n        mobilenet_v2  min =   10.26  max =   17.79  avg =   13.36\n        mobilenet_v3  min =    8.87  max =   10.87  avg =    9.11\n          shufflenet  min =    8.93  max =   11.96  avg =    9.34\n       shufflenet_v2  min =    7.37  max =   13.10  avg =    8.72\n             mnasnet  min =    9.24  max =   14.90  avg =   11.32\n     proxylessnasnet  min =   10.21  max =   11.89  avg =   10.39\n     efficientnet_b0  min =   16.22  max =   23.71  avg =   16.59\n   efficientnetv2_b0  min =   17.44  max =   31.42  avg =   22.85\n        regnety_400m  min =   18.32  max =   24.02  avg =   18.90\n           blazeface  min =    2.22  max =    2.81  avg =    2.30\n           googlenet  min =   31.52  max =   51.80  avg =   42.11\n      googlenet_int8  min =   65.47  max =  114.41  avg =   75.98\n            resnet18  min =   28.90  max =   64.62  avg =   37.58\n       resnet18_int8  min =   71.29  max =  136.67  avg =  103.03\n             alexnet  min =   23.67  max =   34.01  avg =   29.78\n               vgg16  min =  142.18  max =  211.00  avg =  170.46\n          vgg16_int8  min =  531.36  max =  871.25  avg =  625.60\n            resnet50  min =   69.23  max =  108.67  avg =   73.68\n       resnet50_int8  min =  149.18  max =  309.88  avg =  168.68\n      squeezenet_ssd  min =   68.83  max =   81.70  avg =   71.01\n squeezenet_ssd_int8  min =   66.34  max =  118.16  avg =   74.34\n       mobilenet_ssd  min =   29.96  max =   34.32  avg =   30.74\n  mobilenet_ssd_int8  min =   56.87  max =   92.24  avg =   65.57\n      mobilenet_yolo  min =   74.26  max =  113.91  avg =   81.28\n  mobilenetv2_yolov3  min =   42.16  max =   63.49  avg =   45.34\n         yolov4-tiny  min =   53.06  max =   69.84  avg =   55.81\n\nC:\\Users\\i\\Desktop\\benchmark>benchncnn.exe 32 1 0 0 0\n[0 AMD Radeon(TM) Vega 8 Graphics]  queueC=1[2]  queueG=0[1]  queueT=2[1]\n[0 AMD Radeon(TM) Vega 8 Graphics]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 AMD Radeon(TM) Vega 8 Graphics]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 AMD Radeon(TM) Vega 8 Graphics]  subgroup=64  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 32\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    6.78  max =    7.09  avg =    6.91\n     squeezenet_int8  min =   58.93  max =   62.53  avg =   60.11\n           mobilenet  min =    8.08  max =    8.39  avg =    8.25\n      mobilenet_int8  min =   97.74  max =  116.77  avg =  100.17\n        mobilenet_v2  min =    7.95  max =    8.27  avg =    8.14\n        mobilenet_v3  min =    8.70  max =    9.70  avg =    9.02\n          shufflenet  min =    6.36  max =    7.64  avg =    7.01\n       shufflenet_v2  min =    7.04  max =    8.12  avg =    7.50\n             mnasnet  min =    8.07  max =    9.08  avg =    8.38\n     proxylessnasnet  min =    8.56  max =    9.66  avg =    8.81\n     efficientnet_b0  min =   16.68  max =   18.00  avg =   17.30\n   efficientnetv2_b0  min =  394.82  max =  404.88  avg =  401.05\n        regnety_400m  min =   11.92  max =   12.17  avg =   12.03\n           blazeface  min =    4.82  max =    6.50  avg =    5.42\n           googlenet  min =   18.44  max =   19.66  avg =   19.18\n      googlenet_int8  min =  213.41  max =  231.79  avg =  218.31\n            resnet18  min =   14.27  max =   14.72  avg =   14.44\n       resnet18_int8  min =  228.79  max =  249.65  avg =  236.06\n             alexnet  min =   17.31  max =   18.31  avg =   17.69\n               vgg16  min =  111.85  max =  123.35  avg =  112.98\n          vgg16_int8  min = 1789.64  max = 1838.84  avg = 1826.05\n            resnet50  min =   31.61  max =   32.86  avg =   32.12\n       resnet50_int8  min =  483.57  max =  505.72  avg =  491.76\n      squeezenet_ssd  min =   99.66  max =  105.68  avg =  104.57\n squeezenet_ssd_int8  min =  200.48  max =  208.71  avg =  203.02\n       mobilenet_ssd  min =   33.45  max =   35.64  avg =   34.75\n  mobilenet_ssd_int8  min =  195.14  max =  205.35  avg =  200.18\n      mobilenet_yolo  min =   59.20  max =   61.06  avg =   60.47\n  mobilenetv2_yolov3  min =   31.48  max =   33.25  avg =   32.84\n         yolov4-tiny  min =   93.75  max =   97.45  avg =   96.00\n```\n\n### Qualcomm SM8150-AC Snapdragon 855+ (Kyro485 2.96 GHz + 2.42 GHz x 3 + 1.80 GHz x 4 + Adreno 640)\n```\nOnePlus7T:/data/local/tmp # ./benchncnn 8 4 2 -1 1                                                                                                                                                                                      \nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    3.60  max =    3.70  avg =    3.64\n     squeezenet_int8  min =    3.67  max =    3.78  avg =    3.71\n           mobilenet  min =    5.32  max =    5.42  avg =    5.38\n      mobilenet_int8  min =    4.20  max =    4.28  avg =    4.23\n        mobilenet_v2  min =    4.64  max =    4.73  avg =    4.68\n        mobilenet_v3  min =    4.13  max =    4.25  avg =    4.18\n          shufflenet  min =    3.29  max =    3.40  avg =    3.33\n       shufflenet_v2  min =    2.98  max =    3.07  avg =    3.01\n             mnasnet  min =    4.26  max =    4.37  avg =    4.31\n     proxylessnasnet  min =    4.67  max =    4.78  avg =    4.72\n     efficientnet_b0  min =    7.23  max =    7.34  avg =    7.30\n   efficientnetv2_b0  min =    8.74  max =    8.87  avg =    8.81\n        regnety_400m  min =    7.88  max =    7.99  avg =    7.95\n           blazeface  min =    1.19  max =    1.30  avg =    1.22\n           googlenet  min =   13.07  max =   13.20  avg =   13.12\n      googlenet_int8  min =   12.86  max =   12.98  avg =   12.93\n            resnet18  min =   10.33  max =   10.36  avg =   10.35\n       resnet18_int8  min =    9.42  max =    9.45  avg =    9.43\n             alexnet  min =   11.88  max =   11.95  avg =   11.91\n               vgg16  min =   59.34  max =   60.69  avg =   60.19\n          vgg16_int8  min =   68.78  max =   69.07  avg =   68.93\n            resnet50  min =   26.18  max =   26.28  avg =   26.24\n       resnet50_int8  min =   20.86  max =   20.95  avg =   20.91\n      squeezenet_ssd  min =   12.00  max =   12.76  avg =   12.19\n squeezenet_ssd_int8  min =   11.67  max =   13.13  avg =   12.03\n       mobilenet_ssd  min =   11.88  max =   12.68  avg =   12.03\n  mobilenet_ssd_int8  min =    9.28  max =    9.68  avg =    9.35\n      mobilenet_yolo  min =   27.89  max =   28.06  avg =   27.96\n  mobilenetv2_yolov3  min =   18.00  max =   18.13  avg =   18.06\n         yolov4-tiny  min =   25.25  max =   25.36  avg =   25.29\n           nanodet_m  min =    8.93  max =    9.00  avg =    8.96\n    yolo-fastest-1.1  min =    3.73  max =    3.83  avg =    3.77\n      yolo-fastestv2  min =    3.38  max =    3.47  avg =    3.41\n  vision_transformer  min =  567.94  max =  572.31  avg =  569.66\n          FastestDet  min =    3.28  max =    3.37  avg =    3.32\n\nOnePlus7T:/data/local/tmp # ./benchncnn 8 1 2 -1 1                                                                                                                                                                                         \nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    8.24  max =    8.34  avg =    8.31\n     squeezenet_int8  min =    8.23  max =    8.34  avg =    8.30\n           mobilenet  min =   14.38  max =   14.56  avg =   14.45\n      mobilenet_int8  min =   11.12  max =   11.24  avg =   11.17\n        mobilenet_v2  min =    9.82  max =    9.88  avg =    9.84\n        mobilenet_v3  min =    8.15  max =    8.24  avg =    8.21\n          shufflenet  min =    5.32  max =    5.44  avg =    5.37\n       shufflenet_v2  min =    5.38  max =    5.51  avg =    5.44\n             mnasnet  min =    9.25  max =    9.36  avg =    9.31\n     proxylessnasnet  min =   10.95  max =   11.01  avg =   10.98\n     efficientnet_b0  min =   17.67  max =   17.79  avg =   17.73\n   efficientnetv2_b0  min =   20.56  max =   20.70  avg =   20.60\n        regnety_400m  min =   11.96  max =   12.07  avg =   12.00\n           blazeface  min =    2.19  max =    2.87  avg =    2.47\n           googlenet  min =   32.10  max =   32.20  avg =   32.15\n      googlenet_int8  min =   32.00  max =   32.15  avg =   32.07\n            resnet18  min =   22.02  max =   22.28  avg =   22.12\n       resnet18_int8  min =   26.17  max =   26.26  avg =   26.22\n             alexnet  min =   24.83  max =   24.99  avg =   24.92\n               vgg16  min =  129.57  max =  129.95  avg =  129.78\n          vgg16_int8  min =  202.08  max =  202.34  avg =  202.19\n            resnet50  min =   65.85  max =   66.01  avg =   65.93\n       resnet50_int8  min =   56.33  max =   56.49  avg =   56.42\n      squeezenet_ssd  min =   22.52  max =   24.50  avg =   22.93\n squeezenet_ssd_int8  min =   24.51  max =   26.83  avg =   24.98\n       mobilenet_ssd  min =   30.55  max =   32.68  avg =   30.85\n  mobilenet_ssd_int8  min =   22.96  max =   23.75  avg =   23.09\n      mobilenet_yolo  min =   68.74  max =   69.01  avg =   68.88\n  mobilenetv2_yolov3  min =   36.98  max =   37.16  avg =   37.06\n         yolov4-tiny  min =   47.36  max =   47.45  avg =   47.41\n           nanodet_m  min =   15.08  max =   15.30  avg =   15.17\n    yolo-fastest-1.1  min =    5.51  max =    5.61  avg =    5.55\n      yolo-fastestv2  min =    4.92  max =    5.02  avg =    4.97\n  vision_transformer  min =  990.13  max =  994.45  avg =  991.95\n          FastestDet  min =    5.06  max =    5.17  avg =    5.11\n\nOnePlus7T:/data/local/tmp $ ./benchncnn 8 1 2 0 1\n[0 Adreno (TM) 640]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 640]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=1\n[0 Adreno (TM) 640]  fp16-p/s/a=1/0/1  int8-p/s/a=1/0/0\n[0 Adreno (TM) 640]  subgroup=64  basic=1  vote=1  ballot=0  shuffle=0\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    8.59  max =    9.51  avg =    9.09\n           mobilenet  min =   13.04  max =   13.45  avg =   13.22\n        mobilenet_v2  min =   10.68  max =   11.38  avg =   10.85\n        mobilenet_v3  min =   11.86  max =   12.37  avg =   12.08\n          shufflenet  min =    8.21  max =    8.40  avg =    8.25\n       shufflenet_v2  min =    8.84  max =    9.13  avg =    8.97\n             mnasnet  min =   11.32  max =   11.72  avg =   11.45\n     proxylessnasnet  min =   12.27  max =   12.86  avg =   12.55\n     efficientnet_b0  min =   22.64  max =   22.82  avg =   22.75\n   efficientnetv2_b0  min =   32.32  max =   38.20  avg =   35.79\n        regnety_400m  min =   15.35  max =   15.86  avg =   15.64\n           blazeface  min =    2.82  max =    2.93  avg =    2.86\n           googlenet  min =   28.22  max =   28.34  avg =   28.26\n            resnet18  min =   24.71  max =   24.96  avg =   24.82\n             alexnet  min =   27.94  max =   28.10  avg =   28.01\n               vgg16  min =  106.08  max =  106.53  avg =  106.30\n            resnet50  min =   55.28  max =   56.03  avg =   55.68\n      squeezenet_ssd  min =   29.77  max =   30.65  avg =   30.05\n       mobilenet_ssd  min =   29.14  max =   29.39  avg =   29.25\n      mobilenet_yolo  min =   49.78  max =   50.09  avg =   49.94\n  mobilenetv2_yolov3  min =   31.11  max =   31.97  avg =   31.60\n         yolov4-tiny  min =   46.22  max =   46.90  avg =   46.63\n           nanodet_m  min =   15.96  max =   16.52  avg =   16.13\n    yolo-fastest-1.1  min =    9.59  max =    9.66  avg =    9.61\n      yolo-fastestv2  min =    7.99  max =    8.23  avg =    8.13\n```\n\n### Qualcomm MSM6150 Snapdragon 675 (Kyro460 2.0GHz x 2 + Kyro460 1.7GHz x 6 + Adreno 612)\n```\nviolet:/data/local/tmp/ncnn $ ./benchncnn 8 2 0\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = -1\n          squeezenet  min =   23.29  max =   24.65  avg =   23.95\n     squeezenet_int8  min =   23.24  max =   61.55  avg =   31.20\n           mobilenet  min =   31.60  max =   32.10  avg =   31.80\n      mobilenet_int8  min =   30.35  max =   32.03  avg =   30.95\n        mobilenet_v2  min =   25.92  max =   26.45  avg =   26.08\n          shufflenet  min =   11.91  max =   12.11  avg =   12.00\n             mnasnet  min =   21.38  max =   21.71  avg =   21.51\n     proxylessnasnet  min =   25.53  max =   25.78  avg =   25.62\n           googlenet  min =   93.62  max =  100.67  avg =   94.86\n      googlenet_int8  min =   90.74  max =   91.06  avg =   90.87\n            resnet18  min =   85.84  max =   87.37  avg =   86.50\n       resnet18_int8  min =   77.88  max =   78.11  avg =   78.00\n             alexnet  min =  196.33  max =  201.73  avg =  200.19\n               vgg16  min =  560.71  max =  571.75  avg =  564.84\n          vgg16_int8  min =  651.51  max =  652.68  avg =  652.12\n            resnet50  min =  178.25  max =  179.86  avg =  178.77\n       resnet50_int8  min =  181.07  max =  183.26  avg =  181.64\n      squeezenet_ssd  min =   64.86  max =   68.39  avg =   66.05\n squeezenet_ssd_int8  min =   69.61  max =   70.37  avg =   69.93\n       mobilenet_ssd  min =   65.92  max =   67.03  avg =   66.41\n  mobilenet_ssd_int8  min =   61.54  max =   63.38  avg =   62.27\n      mobilenet_yolo  min =  143.42  max =  146.69  avg =  144.33\n    mobilenet_yolov3  min =  150.45  max =  152.30  avg =  151.36\n\nviolet:/data/local/tmp/ncnn $ ./benchncnn 8 1 0\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = -1\n          squeezenet  min =   36.04  max =   37.25  avg =   36.48\n     squeezenet_int8  min =   37.82  max =   79.20  avg =   43.13\n           mobilenet  min =   54.29  max =   54.73  avg =   54.41\n      mobilenet_int8  min =   58.90  max =   60.11  avg =   59.39\n        mobilenet_v2  min =   38.64  max =   40.22  avg =   38.97\n          shufflenet  min =   18.05  max =   18.39  avg =   18.19\n             mnasnet  min =   34.65  max =   34.98  avg =   34.79\n     proxylessnasnet  min =   42.61  max =   43.12  avg =   42.80\n           googlenet  min =  164.74  max =  165.89  avg =  165.34\n      googlenet_int8  min =  159.93  max =  160.38  avg =  160.12\n            resnet18  min =  135.76  max =  137.93  avg =  136.98\n       resnet18_int8  min =  140.22  max =  144.06  avg =  141.92\n             alexnet  min =  391.01  max =  396.85  avg =  392.74\n               vgg16  min = 1019.35  max = 1022.75  avg = 1021.26\n          vgg16_int8  min = 1122.25  max = 1137.99  avg = 1124.78\n            resnet50  min =  302.16  max =  304.22  avg =  303.05\n       resnet50_int8  min =  318.35  max =  319.50  avg =  318.84\n      squeezenet_ssd  min =   91.26  max =   94.86  avg =   92.39\n squeezenet_ssd_int8  min =  105.06  max =  106.17  avg =  105.56\n       mobilenet_ssd  min =  105.01  max =  105.95  avg =  105.40\n  mobilenet_ssd_int8  min =  119.93  max =  120.50  avg =  120.19\n      mobilenet_yolo  min =  229.87  max =  230.76  avg =  230.21\n    mobilenet_yolov3  min =  242.10  max =  242.91  avg =  242.47\n```\n\n### Kirin 970 (Cortex-A73 2.4GHz x 4 + Cortex-A53 1.8GHz x 4)\n```\nHWEML:/data/local/tmp/ncnnbench $ ./benchncnn 8 4 2 -1 1\n[0 Mali-G72]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G72]  buglssc=0  bugsbn1=0  buglbia=0  bugihfa=1\n[0 Mali-G72]  fp16p=1  fp16s=0  fp16a=1  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   24.38  max =   28.03  avg =   25.83\n     squeezenet_int8  min =   21.79  max =   24.80  avg =   22.60\n           mobilenet  min =   34.09  max =   36.88  avg =   35.93\n      mobilenet_int8  min =   52.62  max =   61.70  avg =   55.38\n        mobilenet_v2  min =   23.71  max =   25.70  avg =   24.49\n        mobilenet_v3  min =   20.66  max =   25.68  avg =   23.07\n          shufflenet  min =   17.89  max =   19.91  avg =   18.53\n       shufflenet_v2  min =   13.73  max =   16.54  avg =   15.37\n             mnasnet  min =   24.36  max =   27.14  avg =   25.58\n     proxylessnasnet  min =   27.19  max =   29.70  avg =   28.59\n     efficientnet_b0  min =   49.31  max =   50.26  avg =   49.70\n        regnety_400m  min =   42.54  max =   51.22  avg =   46.71\n           blazeface  min =    5.49  max =    7.67  avg =    6.27\n           googlenet  min =   72.67  max =   81.22  avg =   75.92\n      googlenet_int8  min =   67.60  max =   74.50  avg =   71.21\n            resnet18  min =   69.32  max =   81.59  avg =   73.45\n       resnet18_int8  min =   60.92  max =   68.11  avg =   64.18\n             alexnet  min =   60.90  max =   79.28  avg =   66.72\n               vgg16  min =  337.01  max =  378.89  avg =  352.37\n          vgg16_int8  min =  465.88  max =  505.19  avg =  489.76\n            resnet50  min =  207.75  max =  220.74  avg =  214.42\n       resnet50_int8  min =  165.67  max =  183.80  avg =  171.27\n      squeezenet_ssd  min =   72.77  max =   84.45  avg =   79.09\n squeezenet_ssd_int8  min =   75.37  max =   86.58  avg =   78.70\n       mobilenet_ssd  min =   88.88  max =   96.43  avg =   92.02\n  mobilenet_ssd_int8  min =   89.04  max =  101.35  avg =   92.23\n      mobilenet_yolo  min =  189.73  max =  206.55  avg =  193.64\n  mobilenetv2_yolov3  min =   99.08  max =  111.64  avg =  104.23\n\nHWEML:/data/local/tmp/ncnnbench $ ./benchncnn 8 1 2 -1 1\n[0 Mali-G72]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G72]  buglssc=0  bugsbn1=0  buglbia=0  bugihfa=1\n[0 Mali-G72]  fp16p=1  fp16s=0  fp16a=1  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   73.47  max =   81.39  avg =   76.06\n     squeezenet_int8  min =   62.63  max =   73.66  avg =   66.52\n           mobilenet  min =  103.85  max =  112.83  avg =  108.98\n      mobilenet_int8  min =  152.27  max =  161.26  avg =  157.17\n        mobilenet_v2  min =   70.53  max =   87.26  avg =   76.67\n        mobilenet_v3  min =   59.87  max =   68.59  avg =   63.08\n          shufflenet  min =   36.69  max =   41.45  avg =   39.24\n       shufflenet_v2  min =   33.97  max =   37.84  avg =   35.03\n             mnasnet  min =   69.24  max =   79.73  avg =   74.20\n     proxylessnasnet  min =   78.63  max =   88.57  avg =   81.83\n     efficientnet_b0  min =  147.45  max =  159.07  avg =  152.09\n        regnety_400m  min =   90.83  max =   98.51  avg =   93.82\n           blazeface  min =   10.05  max =   11.59  avg =   10.78\n           googlenet  min =  240.26  max =  277.71  avg =  259.61\n      googlenet_int8  min =  214.64  max =  233.56  avg =  225.01\n            resnet18  min =  245.62  max =  268.49  avg =  260.37\n       resnet18_int8  min =  184.85  max =  194.91  avg =  190.60\n             alexnet  min =  202.52  max =  241.12  avg =  211.51\n               vgg16  min = 1632.98  max = 1769.05  avg = 1710.89\n          vgg16_int8  min = 1237.01  max = 1316.40  avg = 1273.44\n            resnet50  min =  558.41  max =  601.59  avg =  581.26\n       resnet50_int8  min =  425.26  max =  445.19  avg =  436.22\n      squeezenet_ssd  min =  228.50  max =  255.89  avg =  244.63\n squeezenet_ssd_int8  min =  166.97  max =  193.77  avg =  180.22\n       mobilenet_ssd  min =  226.54  max =  246.62  avg =  235.75\n  mobilenet_ssd_int8  min =  231.35  max =  249.63  avg =  241.29\n      mobilenet_yolo  min =  469.71  max =  508.79  avg =  497.50\n  mobilenetv2_yolov3  min =  242.88  max =  265.30  avg =  254.68\n\nHWEML:/data/local/tmp/ncnnbench $ ./benchncnn 4 1 2 0 1\n[0 Mali-G72]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G72]  buglssc=0  bugsbn1=0  buglbia=0  bugihfa=1\n[0 Mali-G72]  fp16p=1  fp16s=0  fp16a=1  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   24.54  max =   25.75  avg =   25.16\n           mobilenet  min =   22.03  max =   29.61  avg =   27.31\n        mobilenet_v2  min =   20.15  max =   28.05  avg =   25.35\n        mobilenet_v3  min =   34.26  max =   37.49  avg =   35.51\n          shufflenet  min =   26.29  max =   27.68  avg =   26.86\n       shufflenet_v2  min =   29.60  max =   32.08  avg =   31.27\n             mnasnet  min =   25.85  max =   29.38  avg =   27.98\n     proxylessnasnet  min =   23.64  max =   30.09  avg =   26.36\n     efficientnet_b0  min =   52.55  max =   58.51  avg =   55.56\n        regnety_400m  min =   37.81  max =   43.22  avg =   40.30\n           blazeface  min =    9.14  max =   10.93  avg =   10.08\n           googlenet  min =   60.19  max =   62.84  avg =   61.51\n            resnet18  min =   50.42  max =   52.93  avg =   51.70\n             alexnet  min =  195.34  max =  196.98  avg =  196.14\n               vgg16  min =  725.88  max =  751.20  avg =  739.99\n            resnet50  min =  124.47  max =  125.93  avg =  125.02\n      squeezenet_ssd  min =   91.79  max =   97.04  avg =   93.56\n       mobilenet_ssd  min =   51.81  max =   59.31  avg =   54.09\n      mobilenet_yolo  min =  124.67  max =  127.62  avg =  126.53\n  mobilenetv2_yolov3  min =   53.11  max =   54.81  avg =   54.11\n```\n\n### Qualcomm MSM8998 Snapdragon 835 (Kyro 2.45GHz x 4 + Kyro 1.9GHz x 4 + Adreno 540)\n```\ntaimen:/data/local/tmp/ncnnbench $ ./benchncnn 8 4 2 -1 0\n[0 Adreno (TM) 540]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 540]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 540]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   28.46  max =   30.89  avg =   29.77\n     squeezenet_int8  min =   30.32  max =   32.92  avg =   31.68\n           mobilenet  min =   36.65  max =   38.37  avg =   37.32\n      mobilenet_int8  min =   62.91  max =   66.71  avg =   64.49\n        mobilenet_v2  min =   27.85  max =   31.21  avg =   29.41\n        mobilenet_v3  min =   23.83  max =   26.40  avg =   24.79\n          shufflenet  min =   15.65  max =   16.88  avg =   16.27\n       shufflenet_v2  min =   13.70  max =   14.49  avg =   14.08\n             mnasnet  min =   25.04  max =   28.35  avg =   26.45\n     proxylessnasnet  min =   27.49  max =   29.58  avg =   28.62\n     efficientnet_b0  min =   48.43  max =   49.41  avg =   48.85\n        regnety_400m  min =   42.48  max =   43.78  avg =   43.18\n           blazeface  min =    4.39  max =    4.68  avg =    4.51\n           googlenet  min =   75.98  max =   78.40  avg =   77.37\n      googlenet_int8  min =   79.26  max =   83.20  avg =   80.55\n            resnet18  min =   73.60  max =   76.97  avg =   75.63\n       resnet18_int8  min =   62.93  max =   65.94  avg =   64.50\n             alexnet  min =   64.18  max =   67.02  avg =   65.49\n               vgg16  min =  389.39  max =  399.13  avg =  394.09\n          vgg16_int8  min =  509.06  max =  524.41  avg =  514.76\n            resnet50  min =  188.21  max =  194.58  avg =  191.98\n       resnet50_int8  min =  182.84  max =  187.22  avg =  184.23\n      squeezenet_ssd  min =   77.69  max =   81.17  avg =   79.24\n squeezenet_ssd_int8  min =   81.71  max =   84.12  avg =   82.90\n       mobilenet_ssd  min =   78.35  max =   81.50  avg =   79.82\n  mobilenet_ssd_int8  min =   96.84  max =  100.97  avg =   98.42\n      mobilenet_yolo  min =  167.32  max =  170.71  avg =  168.87\n  mobilenetv2_yolov3  min =   97.00  max =  102.11  avg =   99.01\n\ntaimen:/data/local/tmp/ncnnbench $ ./benchncnn 8 1 2 -1 1\n[0 Adreno (TM) 540]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 540]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 540]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   67.25  max =   71.39  avg =   69.35\n     squeezenet_int8  min =   62.12  max =   66.35  avg =   63.73\n           mobilenet  min =  103.30  max =  110.39  avg =  107.13\n      mobilenet_int8  min =  155.24  max =  161.42  avg =  157.82\n        mobilenet_v2  min =   71.89  max =   74.73  avg =   73.48\n        mobilenet_v3  min =   58.35  max =   63.43  avg =   60.68\n          shufflenet  min =   35.96  max =   39.43  avg =   36.94\n       shufflenet_v2  min =   35.53  max =   39.86  avg =   37.10\n             mnasnet  min =   66.71  max =   74.00  avg =   68.65\n     proxylessnasnet  min =   76.50  max =   82.20  avg =   78.57\n     efficientnet_b0  min =  142.32  max =  152.17  avg =  146.14\n        regnety_400m  min =   89.60  max =   98.27  avg =   92.62\n           blazeface  min =   10.45  max =   12.81  avg =   11.07\n           googlenet  min =  222.75  max =  233.61  avg =  228.38\n      googlenet_int8  min =  206.70  max =  212.20  avg =  209.24\n            resnet18  min =  210.86  max =  220.25  avg =  213.65\n       resnet18_int8  min =  176.04  max =  183.58  avg =  178.71\n             alexnet  min =  185.97  max =  195.91  avg =  191.40\n               vgg16  min = 1176.82  max = 1200.64  avg = 1187.88\n          vgg16_int8  min = 1086.52  max = 1105.00  avg = 1095.53\n            resnet50  min =  517.48  max =  533.99  avg =  526.04\n       resnet50_int8  min =  417.30  max =  435.81  avg =  422.36\n      squeezenet_ssd  min =  164.88  max =  171.21  avg =  167.51\n squeezenet_ssd_int8  min =  164.78  max =  171.77  avg =  168.36\n       mobilenet_ssd  min =  221.41  max =  229.13  avg =  226.18\n  mobilenet_ssd_int8  min =  234.15  max =  245.91  avg =  239.01\n      mobilenet_yolo  min =  471.34  max =  484.99  avg =  477.15\n  mobilenetv2_yolov3  min =  249.14  max =  257.61  avg =  252.54\n\ntaimen:/data/local/tmp/ncnnbench $ ./benchncnn 8 1 2 0 1\n[0 Adreno (TM) 540]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 540]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 540]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   18.74  max =   19.89  avg =   19.22\n           mobilenet  min =   21.19  max =   25.61  avg =   22.94\n        mobilenet_v2  min =   24.15  max =   34.68  avg =   30.12\n        mobilenet_v3  min =   25.94  max =   33.15  avg =   30.09\n          shufflenet  min =   25.05  max =   31.41  avg =   27.85\n       shufflenet_v2  min =   28.82  max =   32.04  avg =   30.95\n             mnasnet  min =   21.34  max =   27.69  avg =   24.17\n     proxylessnasnet  min =   25.51  max =   30.03  avg =   28.01\n     efficientnet_b0  min =   42.94  max =   47.44  avg =   45.28\n        regnety_400m  min =   36.36  max =   55.73  avg =   41.82\n           blazeface  min =   11.14  max =   13.11  avg =   12.20\n           googlenet  min =   49.72  max =   56.92  avg =   51.79\n            resnet18  min =   44.63  max =   47.37  avg =   45.86\n             alexnet  min =   42.83  max =   46.34  avg =   44.63\n               vgg16  min =  568.82  max =  586.75  avg =  578.60\n            resnet50  min =  108.63  max =  115.76  avg =  110.38\n      squeezenet_ssd  min =   85.22  max =  104.73  avg =   93.14\n       mobilenet_ssd  min =   49.91  max =   56.86  avg =   52.33\n      mobilenet_yolo  min =   98.76  max =  109.37  avg =  102.27\n  mobilenetv2_yolov3  min =   57.49  max =   61.15  avg =   58.74\n```\n\n### Qualcomm SDM765G Snapdragon 765G (Kyro 1.8GHz x 6 + Kyro 2.2GHz x 2 + Adreno 620)\n```\n130|bramble:/data/local/tmp $ ./benchncnn 8 4 2 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    9.84  max =   11.72  avg =   10.36\n     squeezenet_int8  min =   10.80  max =   11.13  avg =   10.96\n               mobilenet  min =   14.04  max =   14.37  avg =   14.20\n      mobilenet_int8  min =   13.39  max =   13.75  avg =   13.59\n        mobilenet_v2  min =   13.04  max =   13.51  avg =   13.27\n        mobilenet_v3  min =   11.00  max =   13.21  avg =   12.54\n          shufflenet  min =   11.08  max =   11.22  avg =   11.16\n       shufflenet_v2  min =    8.45  max =    8.50  avg =    8.47\n             mnasnet  min =   14.15  max =   14.69  avg =   14.38\n     proxylessnasnet  min =   14.49  max =   15.07  avg =   14.83\n     efficientnet_b0  min =   28.99  max =   29.53  avg =   29.24\n   efficientnetv2_b0  min =   38.92  max =   39.34  avg =   39.14\n        regnety_400m  min =   33.46  max =   33.81  avg =   33.62\n           blazeface  min =    4.22  max =    4.30  avg =    4.27\n           googlenet  min =   35.24  max =   36.94  avg =   35.57\n      googlenet_int8  min =   45.26  max =   46.46  avg =   45.78\n            resnet18  min =   33.14  max =   33.75  avg =   33.31\n       resnet18_int8  min =   43.26  max =   43.50  avg =   43.35\n             alexnet  min =   25.40  max =   26.19  avg =   25.74\n               vgg16  min =  121.39  max =  122.35  avg =  121.78\n          vgg16_int8  min =  243.47  max =  249.94  avg =  245.56\n            resnet50  min =   67.05  max =   70.16  avg =   68.20\n       resnet50_int8  min =   76.95  max =   80.23  avg =   78.18\n      squeezenet_ssd  min =   32.02  max =   33.27  avg =   32.51\n squeezenet_ssd_int8  min =   36.31  max =   38.35  avg =   37.09\n       mobilenet_ssd  min =   32.02  max =   34.55  avg =   32.99\n  mobilenet_ssd_int8  min =   32.31  max =   33.92  avg =   32.77\n      mobilenet_yolo  min =   99.12  max =  109.81  avg =  103.00\n  mobilenetv2_yolov3  min =   59.74  max =   60.95  avg =   60.21\n         yolov4-tiny  min =   57.83  max =   72.15  avg =   68.75\n           nanodet_m  min =   22.76  max =   22.97  avg =   22.85\n    yolo-fastest-1.1  min =   13.58  max =   13.93  avg =   13.80\n      yolo-fastestv2  min =   12.06  max =   12.27  avg =   12.15\n  vision_transformer  min = 1274.67  max = 1597.52  avg = 1363.14\n          FastestDet  min =    9.75  max =    9.86  avg =    9.81\n\n130|bramble:/data/local/tmp $ ./benchncnn 8 4 2 0 1\n[0 Adreno (TM) 620]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 620]  bugsbn1=1  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Adreno (TM) 620]  fp16-p/s/u/a=1/1/0/1  int8-p/s/u/a=1/0/0/1\n[0 Adreno (TM) 620]  subgroup=64  basic/vote/ballot/shuffle=1/1/1/1\n[0 Adreno (TM) 620]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   25.06  max =   25.80  avg =   25.53\n     squeezenet_int8  min =    9.75  max =    9.82  avg =    9.78\n           mobilenet  min =   43.43  max =   44.04  avg =   43.71\n      mobilenet_int8  min =   11.12  max =   11.59  avg =   11.34\n        mobilenet_v2  min =   32.14  max =   32.58  avg =   32.40\n        mobilenet_v3  min =   32.75  max =   32.98  avg =   32.87\n          shufflenet  min =   29.29  max =   29.63  avg =   29.40\n       shufflenet_v2  min =   32.43  max =   33.18  avg =   32.69\n             mnasnet  min =   34.58  max =   35.24  avg =   35.00\n     proxylessnasnet  min =   40.61  max =   41.40  avg =   40.98\n     efficientnet_b0  min =   49.44  max =   50.46  avg =   49.95\n   efficientnetv2_b0  min =  185.31  max =  187.37  avg =  186.24\n        regnety_400m  min =   41.43  max =   42.75  avg =   41.84\n           blazeface  min =   13.47  max =   14.07  avg =   13.72\n           googlenet  min =   78.12  max =   79.06  avg =   78.56\n      googlenet_int8  min =   48.73  max =   50.13  avg =   49.20\n            resnet18  min =   73.61  max =   74.05  avg =   73.75\n       resnet18_int8  min =   21.87  max =   22.05  avg =   21.95\n             alexnet  min =  128.58  max =  129.51  avg =  128.97\n               vgg16  min =  437.64  max =  439.12  avg =  438.28\n          vgg16_int8  min =  232.77  max =  243.06  avg =  239.54\n            resnet50  min =  187.36  max =  188.47  avg =  188.01\n       resnet50_int8  min =   75.79  max =   77.33  avg =   76.64\n      squeezenet_ssd  min =   80.68  max =   84.50  avg =   81.93\n squeezenet_ssd_int8  min =   29.88  max =   30.77  avg =   30.30\n       mobilenet_ssd  min =   94.77  max =   96.46  avg =   95.79\n  mobilenet_ssd_int8  min =   29.03  max =   30.07  avg =   29.53\n      mobilenet_yolo  min =  185.97  max =  188.11  avg =  186.59\n  mobilenetv2_yolov3  min =  108.43  max =  164.75  avg =  121.55\n         yolov4-tiny  min =  149.38  max =  158.39  avg =  153.92\n           nanodet_m  min =   46.73  max =   48.85  avg =   47.73\n    yolo-fastest-1.1  min =   26.32  max =   26.77  avg =   26.54\n      yolo-fastestv2  min =   38.87  max =   39.31  avg =   39.13\n  vision_transformer  min = 3392.80  max = 3397.79  avg = 3396.09\n          FastestDet  min =   43.05  max =   43.81  avg =   43.45\n```\n\n### Qualcomm SDM660 Snapdragon 660 (Kyro260 2.2GHz x 4 + Kyro260 1.84GHz x 4 + Adreno 512)\n```\nlavender:/data/local/tmp/ncnnbench $ ./benchncnn 8 8 0 -1 1\n[0 Adreno (TM) 512]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 512]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 512]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   29.05  max =   44.86  avg =   33.26\n     squeezenet_int8  min =   35.47  max =   37.10  avg =   36.09\n           mobilenet  min =   31.59  max =   33.47  avg =   32.33\n      mobilenet_int8  min =   77.50  max =   91.15  avg =   82.98\n        mobilenet_v2  min =   33.63  max =   35.43  avg =   34.54\n        mobilenet_v3  min =   29.97  max =   49.80  avg =   34.81\n          shufflenet  min =   28.52  max =   30.09  avg =   29.09\n       shufflenet_v2  min =   19.15  max =   21.15  avg =   19.99\n             mnasnet  min =   29.91  max =   35.11  avg =   31.46\n     proxylessnasnet  min =   33.28  max =  117.09  avg =   55.22\n     efficientnet_b0  min =   52.29  max =   57.93  avg =   55.04\n        regnety_400m  min =   96.05  max =  116.42  avg =  102.07\n           blazeface  min =    7.98  max =   11.83  avg =    8.89\n           googlenet  min =   76.88  max =  103.99  avg =   84.54\n      googlenet_int8  min =   97.68  max =  118.56  avg =  104.92\n            resnet18  min =   75.93  max =   89.31  avg =   80.00\n       resnet18_int8  min =   73.27  max =   80.84  avg =   76.19\n             alexnet  min =   90.94  max =  114.57  avg =   96.42\n               vgg16  min =  381.30  max =  615.62  avg =  555.96\n          vgg16_int8  min =  803.75  max = 1126.53  avg =  886.03\n            resnet50  min =  257.38  max =  285.19  avg =  266.59\n       resnet50_int8  min =  304.81  max =  338.01  avg =  314.84\n      squeezenet_ssd  min =  117.59  max =  145.79  avg =  123.79\n squeezenet_ssd_int8  min =  132.80  max =  163.00  avg =  149.99\n       mobilenet_ssd  min =  103.98  max =  126.90  avg =  113.10\n  mobilenet_ssd_int8  min =  167.86  max =  188.46  avg =  180.56\n      mobilenet_yolo  min =  201.75  max =  263.92  avg =  240.17\n  mobilenetv2_yolov3  min =  143.76  max =  167.77  avg =  151.94\n\nlavender:/data/local/tmp/ncnnbench $ ./benchncnn 4 1 2 -1 1\n[0 Adreno (TM) 512]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 512]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 512]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   69.75  max =   71.33  avg =   70.38\n     squeezenet_int8  min =   67.12  max =   68.07  avg =   67.59\n           mobilenet  min =  107.65  max =  110.48  avg =  108.82\n      mobilenet_int8  min =  163.13  max =  164.74  avg =  164.24\n        mobilenet_v2  min =   75.50  max =   77.36  avg =   76.38\n        mobilenet_v3  min =   59.05  max =   59.36  avg =   59.23\n          shufflenet  min =   38.33  max =   38.74  avg =   38.57\n       shufflenet_v2  min =   37.43  max =   38.97  avg =   38.32\n             mnasnet  min =   69.29  max =   73.20  avg =   70.73\n     proxylessnasnet  min =   80.81  max =   82.66  avg =   81.52\n     efficientnet_b0  min =  151.20  max =  152.38  avg =  151.72\n        regnety_400m  min =   93.53  max =   94.53  avg =   94.19\n           blazeface  min =   12.15  max =   12.82  avg =   12.46\n           googlenet  min =  239.63  max =  242.64  avg =  241.06\n      googlenet_int8  min =  214.71  max =  216.53  avg =  215.79\n            resnet18  min =  234.20  max =  238.74  avg =  236.90\n       resnet18_int8  min =  181.57  max =  183.97  avg =  182.66\n             alexnet  min =  205.94  max =  207.44  avg =  206.63\n               vgg16  min = 1188.14  max = 1201.95  avg = 1196.93\n          vgg16_int8  min = 1081.21  max = 1087.84  avg = 1085.17\n            resnet50  min =  556.54  max =  566.68  avg =  561.21\n       resnet50_int8  min =  433.19  max =  433.93  avg =  433.48\n      squeezenet_ssd  min =  169.02  max =  170.54  avg =  169.73\n squeezenet_ssd_int8  min =  176.28  max =  177.90  avg =  176.87\n       mobilenet_ssd  min =  228.15  max =  232.69  avg =  230.38\n  mobilenet_ssd_int8  min =  236.97  max =  239.69  avg =  238.35\n      mobilenet_yolo  min =  493.33  max =  506.34  avg =  499.79\n  mobilenetv2_yolov3  min =  252.53  max =  261.58  avg =  256.30\n\nlavender:/data/local/tmp/ncnnbench $ ./benchncnn 4 1 2 0 1\n[0 Adreno (TM) 512]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 512]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 512]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   34.49  max =   34.65  avg =   34.55\n           mobilenet  min =   54.45  max =   55.52  avg =   54.75\n        mobilenet_v2  min =   39.32  max =   39.58  avg =   39.50\n        mobilenet_v3  min =   36.13  max =   36.28  avg =   36.19\n          shufflenet  min =   35.25  max =   35.42  avg =   35.31\n       shufflenet_v2  min =   31.38  max =   31.70  avg =   31.53\n             mnasnet  min =   40.95  max =   41.32  avg =   41.13\n     proxylessnasnet  min =   43.81  max =   44.05  avg =   43.90\n     efficientnet_b0  min =   68.34  max =   68.56  avg =   68.47\n        regnety_400m  min =   53.89  max =   54.23  avg =   54.02\n           blazeface  min =   19.82  max =   27.74  avg =   22.01\n           googlenet  min =  119.46  max =  119.98  avg =  119.80\n            resnet18  min =  115.56  max =  120.28  avg =  116.88\n             alexnet  min =  102.06  max =  105.56  avg =  102.97\n               vgg16  min = 1192.29  max = 1202.17  avg = 1197.03\n            resnet50  min =  294.87  max =  298.79  avg =  296.05\n      squeezenet_ssd  min =  167.85  max =  168.42  avg =  168.09\n       mobilenet_ssd  min =  120.30  max =  120.37  avg =  120.34\n      mobilenet_yolo  min =  256.60  max =  260.21  avg =  257.54\n  mobilenetv2_yolov3  min =  121.48  max =  125.22  avg =  122.53\n```\n\n### Qualcomm MSM8996 Pro Snapdragon 821 (Kyro 2.35GHz x 2 + Kyro 2.19GHz x 2)\n```\nnatrium:/data/local/tmp # ./benchncnn 8 4 0 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   18.46  max =   19.12  avg =   18.78\n     squeezenet_int8  min =   16.69  max =   17.22  avg =   16.95\n           mobilenet  min =   27.33  max =   28.74  avg =   27.88\n      mobilenet_int8  min =   20.14  max =   20.71  avg =   20.46\n        mobilenet_v2  min =   21.94  max =   23.09  avg =   22.38\n        mobilenet_v3  min =   18.81  max =   19.45  avg =   19.04\n          shufflenet  min =   14.07  max =   14.75  avg =   14.29\n       shufflenet_v2  min =   11.52  max =   11.92  avg =   11.71\n             mnasnet  min =   20.41  max =   21.75  avg =   20.74\n     proxylessnasnet  min =   22.99  max =   23.63  avg =   23.13\n     efficientnet_b0  min =   34.74  max =   35.26  avg =   34.91\n   efficientnetv2_b0  min =   41.16  max =   41.60  avg =   41.39\n        regnety_400m  min =   44.27  max =   45.01  avg =   44.69\n           blazeface  min =    4.25  max =    4.71  avg =    4.43\n           googlenet  min =   54.88  max =   55.55  avg =   55.12\n      googlenet_int8  min =   51.88  max =   52.72  avg =   52.25\n            resnet18  min =   44.33  max =   45.44  avg =   44.88\n       resnet18_int8  min =   51.24  max =   51.94  avg =   51.54\n             alexnet  min =   38.62  max =   39.31  avg =   38.88\n               vgg16  min =  242.53  max =  244.23  avg =  243.16\n          vgg16_int8  min =  183.15  max =  204.96  avg =  192.16\n            resnet50  min =  122.14  max =  124.29  avg =  122.94\n       resnet50_int8  min =  116.61  max =  118.47  avg =  117.56\n      squeezenet_ssd  min =   47.92  max =   49.01  avg =   48.45\n squeezenet_ssd_int8  min =   43.21  max =   44.45  avg =   43.76\n       mobilenet_ssd  min =   56.92  max =   58.21  avg =   57.56\n  mobilenet_ssd_int8  min =   42.26  max =   42.92  avg =   42.48\n      mobilenet_yolo  min =  126.20  max =  128.50  avg =  127.10\n  mobilenetv2_yolov3  min =   75.49  max =   76.50  avg =   76.01\n         yolov4-tiny  min =   94.24  max =   95.75  avg =   94.83\n           nanodet_m  min =   31.30  max =   31.93  avg =   31.62\n    yolo-fastest-1.1  min =   16.89  max =   17.56  avg =   17.23\n      yolo-fastestv2  min =   12.97  max =   13.50  avg =   13.15\n\nnatrium:/data/local/tmp # ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   46.27  max =   46.60  avg =   46.45\n     squeezenet_int8  min =   41.33  max =   41.73  avg =   41.56\n           mobilenet  min =   80.89  max =   81.16  avg =   81.00\n      mobilenet_int8  min =   60.33  max =   62.29  avg =   61.33\n        mobilenet_v2  min =   51.78  max =   52.02  avg =   51.88\n        mobilenet_v3  min =   43.71  max =   44.17  avg =   43.91\n          shufflenet  min =   24.96  max =   25.08  avg =   25.02\n       shufflenet_v2  min =   24.09  max =   24.26  avg =   24.17\n             mnasnet  min =   51.28  max =   51.42  avg =   51.35\n     proxylessnasnet  min =   59.25  max =   59.66  avg =   59.48\n     efficientnet_b0  min =   92.16  max =   92.34  avg =   92.22\n   efficientnetv2_b0  min =  112.27  max =  113.63  avg =  113.17\n        regnety_400m  min =   68.59  max =   68.85  avg =   68.75\n           blazeface  min =    7.36  max =    7.83  avg =    7.59\n           googlenet  min =  151.15  max =  151.53  avg =  151.37\n      googlenet_int8  min =  152.01  max =  158.63  avg =  154.18\n            resnet18  min =  121.49  max =  121.90  avg =  121.77\n       resnet18_int8  min =  154.54  max =  166.73  avg =  161.30\n             alexnet  min =   97.41  max =   97.74  avg =   97.62\n               vgg16  min =  674.80  max =  675.86  avg =  675.38\n          vgg16_int8  min =  593.42  max =  602.98  avg =  596.93\n            resnet50  min =  360.44  max =  364.31  avg =  362.01\n       resnet50_int8  min =  371.21  max =  386.24  avg =  381.53\n      squeezenet_ssd  min =   97.72  max =   98.32  avg =   98.01\n squeezenet_ssd_int8  min =   98.33  max =   99.15  avg =   98.63\n       mobilenet_ssd  min =  161.72  max =  161.89  avg =  161.79\n  mobilenet_ssd_int8  min =  122.44  max =  123.38  avg =  123.00\n      mobilenet_yolo  min =  367.34  max =  369.59  avg =  368.97\n  mobilenetv2_yolov3  min =  190.09  max =  190.77  avg =  190.31\n         yolov4-tiny  min =  241.59  max =  242.29  avg =  241.81\n           nanodet_m  min =   63.03  max =   63.22  avg =   63.12\n    yolo-fastest-1.1  min =   29.06  max =   29.22  avg =   29.12\n      yolo-fastestv2  min =   22.72  max =   22.80  avg =   22.77\n```\n\n### Qualcomm MSM8994 Snapdragon 810 (Cortex-A57 2.0GHz x 4 + Cortex-A53 1.55GHz x 4)\n```\nangler:/data/local/tmp $ ./benchncnn 8 8 0 -1 1\nloop_count = 8\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   25.83  max =   29.17  avg =   27.69\n     squeezenet_int8  min =   24.18  max =   26.31  avg =   25.18\n           mobilenet  min =   33.94  max =   35.29  avg =   34.44\n      mobilenet_int8  min =   24.99  max =   26.12  avg =   25.46\n        mobilenet_v2  min =   32.63  max =   34.44  avg =   33.56\n        mobilenet_v3  min =   27.72  max =   30.14  avg =   29.35\n          shufflenet  min =   23.23  max =   26.78  avg =   24.58\n       shufflenet_v2  min =   21.04  max =   22.25  avg =   21.68\n             mnasnet  min =   29.51  max =   31.26  avg =   30.27\n     proxylessnasnet  min =   34.21  max =   37.55  avg =   35.20\n     efficientnet_b0  min =   54.75  max =   60.45  avg =   56.38\n   efficientnetv2_b0  min =   63.60  max =   67.51  avg =   64.81\n        regnety_400m  min =   60.80  max =   72.33  avg =   68.27\n           blazeface  min =    5.96  max =    7.22  avg =    6.41\n           googlenet  min =   80.62  max =   94.46  avg =   86.50\n      googlenet_int8  min =   69.05  max =   75.75  avg =   71.47\n            resnet18  min =   63.90  max =   75.96  avg =   69.64\n       resnet18_int8  min =   46.43  max =   62.23  avg =   53.22\n             alexnet  min =   82.67  max =   90.25  avg =   87.03\n               vgg16  min =  562.23  max =  636.26  avg =  594.82\n          vgg16_int8  min =  303.42  max =  358.03  avg =  325.60\n            resnet50  min =  233.47  max =  279.99  avg =  248.49\n       resnet50_int8  min =  170.11  max =  198.27  avg =  183.35\n      squeezenet_ssd  min =   86.97  max =  112.21  avg =   96.84\n squeezenet_ssd_int8  min =   66.09  max =   77.00  avg =   70.57\n       mobilenet_ssd  min =   76.95  max =  101.74  avg =   87.73\n  mobilenet_ssd_int8  min =   53.27  max =   60.50  avg =   57.46\n      mobilenet_yolo  min =  206.42  max =  260.06  avg =  227.84\n  mobilenetv2_yolov3  min =  129.32  max =  147.76  avg =  138.90\n         yolov4-tiny  min =  184.85  max =  213.03  avg =  203.52\n           nanodet_m  min =   47.66  max =   60.55  avg =   53.00\n\nangler:/data/local/tmp # ./benchncnn 4 4 2 -1 1\nloop_count = 4\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   41.39  max =   47.64  avg =   43.08\n     squeezenet_int8  min =   36.92  max =   37.59  avg =   37.24\n           mobilenet  min =   59.04  max =   59.43  avg =   59.22\n      mobilenet_int8  min =   44.67  max =   46.60  avg =   45.58\n        mobilenet_v2  min =   43.38  max =   43.71  avg =   43.62\n        mobilenet_v3  min =   37.57  max =   37.82  avg =   37.65\n          shufflenet  min =   30.67  max =   30.86  avg =   30.76\n       shufflenet_v2  min =   27.80  max =   28.12  avg =   27.97\n             mnasnet  min =   42.99  max =   46.41  avg =   44.21\n     proxylessnasnet  min =   51.26  max =   53.52  avg =   52.04\n     efficientnet_b0  min =   81.58  max =   82.30  avg =   82.03\n   efficientnetv2_b0  min =   94.01  max =   94.48  avg =   94.27\n        regnety_400m  min =   82.38  max =   83.86  avg =   82.95\n           blazeface  min =   10.02  max =   10.42  avg =   10.18\n           googlenet  min =  125.47  max =  126.72  avg =  125.92\n      googlenet_int8  min =  109.92  max =  111.65  avg =  110.44\n            resnet18  min =  110.14  max =  111.95  avg =  110.76\n       resnet18_int8  min =   78.21  max =   79.65  avg =   79.07\n             alexnet  min =   78.09  max =   80.34  avg =   78.87\n               vgg16  min =  486.69  max =  494.97  avg =  490.35\n          vgg16_int8  min =  370.66  max =  377.64  avg =  373.78\n            resnet50  min =  272.31  max =  278.64  avg =  274.10\n       resnet50_int8  min =  215.57  max =  218.55  avg =  217.27\n      squeezenet_ssd  min =  112.98  max =  114.75  avg =  113.60\n squeezenet_ssd_int8  min =   91.85  max =   94.82  avg =   93.13\n       mobilenet_ssd  min =  115.18  max =  116.56  avg =  115.95\n  mobilenet_ssd_int8  min =   90.95  max =   92.21  avg =   91.39\n      mobilenet_yolo  min =  255.07  max =  259.01  avg =  256.18\n  mobilenetv2_yolov3  min =  155.52  max =  156.58  avg =  156.09\n         yolov4-tiny  min =  231.89  max =  234.14  avg =  232.97\n           nanodet_m  min =   72.74  max =   74.71  avg =   73.52\n    yolo-fastest-1.1  min =   35.25  max =   36.51  avg =   35.77\n      yolo-fastestv2  min =   29.94  max =   31.09  avg =   30.75\n\nangler:/data/local/tmp # ./benchncnn 4 1 2 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   70.83  max =   72.68  avg =   71.77\n     squeezenet_int8  min =   59.27  max =   59.60  avg =   59.51\n           mobilenet  min =  110.70  max =  112.72  avg =  111.48\n      mobilenet_int8  min =   79.69  max =   80.01  avg =   79.81\n        mobilenet_v2  min =   77.85  max =   78.19  avg =   78.03\n        mobilenet_v3  min =   63.49  max =   63.92  avg =   63.73\n          shufflenet  min =   41.43  max =   41.60  avg =   41.49\n       shufflenet_v2  min =   37.49  max =   38.26  avg =   37.97\n             mnasnet  min =   73.91  max =   75.91  avg =   74.59\n     proxylessnasnet  min =   94.13  max =   94.53  avg =   94.37\n     efficientnet_b0  min =  161.91  max =  162.38  avg =  162.10\n   efficientnetv2_b0  min =  179.33  max =  180.26  avg =  179.67\n        regnety_400m  min =  100.35  max =  100.76  avg =  100.53\n           blazeface  min =   12.57  max =   12.76  avg =   12.66\n           googlenet  min =  232.77  max =  233.08  avg =  232.91\n      googlenet_int8  min =  203.39  max =  205.25  avg =  204.77\n            resnet18  min =  182.58  max =  183.17  avg =  182.91\n       resnet18_int8  min =  150.40  max =  152.07  avg =  151.35\n             alexnet  min =  147.27  max =  149.00  avg =  148.06\n               vgg16  min =  986.93  max =  988.35  avg =  987.47\n          vgg16_int8  min =  816.37  max =  819.93  avg =  817.79\n            resnet50  min =  502.77  max =  510.88  avg =  508.53\n       resnet50_int8  min =  393.33  max =  398.07  avg =  395.86\n      squeezenet_ssd  min =  175.01  max =  175.61  avg =  175.32\n squeezenet_ssd_int8  min =  145.19  max =  145.94  avg =  145.66\n       mobilenet_ssd  min =  231.04  max =  231.25  avg =  231.13\n  mobilenet_ssd_int8  min =  159.81  max =  160.52  avg =  160.13\n      mobilenet_yolo  min =  517.86  max =  523.71  avg =  521.85\n  mobilenetv2_yolov3  min =  275.84  max =  279.16  avg =  277.13\n         yolov4-tiny  min =  363.71  max =  366.14  avg =  364.56\n           nanodet_m  min =   93.90  max =   95.09  avg =   94.40\n    yolo-fastest-1.1  min =   45.94  max =   46.09  avg =   46.01\n      yolo-fastestv2  min =   38.23  max =   38.33  avg =   38.29\n\nangler:/data/local/tmp $ ./benchncnn 4 1 2 0 1\n[0 Adreno (TM) 430]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 430]  buglssc=0  bugsbn1=1  buglbia=0  bugihfa=0\n[0 Adreno (TM) 430]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   39.49  max =   41.93  avg =   40.62\n           mobilenet  min =   60.30  max =   61.81  avg =   60.88\n        mobilenet_v2  min =   45.38  max =   47.10  avg =   45.88\n        mobilenet_v3  min =   45.97  max =   47.39  avg =   46.69\n          shufflenet  min =   29.12  max =   31.02  avg =   29.91\n       shufflenet_v2  min =   47.58  max =   50.06  avg =   48.26\n             mnasnet  min =   47.84  max =   49.17  avg =   48.26\n     proxylessnasnet  min =   49.51  max =   51.03  avg =   49.97\n     efficientnet_b0  min =  100.56  max =  105.60  avg =  102.45\n        regnety_400m  min =   59.67  max =   61.24  avg =   60.56\n           blazeface  min =   13.87  max =   13.98  avg =   13.93\n           googlenet  min =  131.26  max =  136.33  avg =  133.40\n            resnet18  min =  116.38  max =  117.92  avg =  116.93\n             alexnet  min =   72.59  max =   73.94  avg =   73.29\n               vgg16  min = 1090.07  max = 1101.71  avg = 1096.34\n            resnet50  min =  299.76  max =  300.78  avg =  300.40\n      squeezenet_ssd  min =  181.95  max =  182.83  avg =  182.39\n       mobilenet_ssd  min =  148.44  max =  151.07  avg =  149.75\n      mobilenet_yolo  min =  284.46  max =  285.74  avg =  285.39\n  mobilenetv2_yolov3  min =  140.28  max =  148.62  avg =  144.83\n```\n\n### Qualcomm MSM8916 Snapdragon 410 (Cortex-A53 1.2GHz x 4)\n```\nHM2014812:/data/local/tmp # ./benchncnn 8 4 0 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   65.45  max =   73.59  avg =   68.10\n     squeezenet_int8  min =   59.39  max =   65.54  avg =   61.14\n           mobilenet  min =   86.69  max =   94.10  avg =   90.03\n      mobilenet_int8  min =   62.22  max =   69.67  avg =   64.13\n        mobilenet_v2  min =   77.98  max =   89.53  avg =   82.00\n        mobilenet_v3  min =   62.17  max =   68.31  avg =   63.90\n          shufflenet  min =   47.52  max =   53.76  avg =   49.92\n       shufflenet_v2  min =   39.77  max =   46.08  avg =   40.66\n             mnasnet  min =   69.27  max =   75.73  avg =   71.73\n     proxylessnasnet  min =   78.72  max =   85.37  avg =   81.33\n     efficientnet_b0  min =  126.62  max =  136.67  avg =  130.69\n   efficientnetv2_b0  min =  143.24  max =  150.97  avg =  146.89\n        regnety_400m  min =  108.79  max =  116.22  avg =  112.99\n           blazeface  min =   14.85  max =   15.02  avg =   14.94\n           googlenet  min =  180.91  max =  190.37  avg =  186.36\n      googlenet_int8  min =  160.07  max =  170.86  avg =  165.05\n            resnet18  min =  137.91  max =  155.37  avg =  144.99\n       resnet18_int8  min =  104.34  max =  110.20  avg =  106.76\n             alexnet  min =  105.30  max =  114.73  avg =  109.53\n               vgg16  min =  829.16  max =  942.94  avg =  853.28\n          vgg16_int8  min =  515.61  max =  547.32  avg =  526.50\n            resnet50  min =  380.46  max =  443.90  avg =  393.71\n       resnet50_int8  min =  318.06  max =  327.13  avg =  323.23\n      squeezenet_ssd  min =  178.22  max =  189.02  avg =  184.51\n squeezenet_ssd_int8  min =  153.75  max =  163.44  avg =  158.05\n       mobilenet_ssd  min =  189.45  max =  195.17  avg =  193.10\n  mobilenet_ssd_int8  min =  132.59  max =  139.63  avg =  137.23\n      mobilenet_yolo  min =  404.52  max =  414.20  avg =  409.97\n  mobilenetv2_yolov3  min =  271.33  max =  279.98  avg =  275.08\n         yolov4-tiny  min =  349.36  max =  372.54  avg =  357.98\n           nanodet_m  min =  103.01  max =  111.71  avg =  105.82\n\nHM2014812:/data/local/tmp # ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  147.48  max =  149.35  avg =  148.40\n     squeezenet_int8  min =  143.20  max =  144.55  avg =  143.98\n           mobilenet  min =  243.78  max =  244.33  avg =  244.08\n      mobilenet_int8  min =  206.23  max =  207.13  avg =  206.55\n        mobilenet_v2  min =  168.04  max =  170.37  avg =  169.06\n        mobilenet_v3  min =  147.10  max =  147.91  avg =  147.55\n          shufflenet  min =   88.47  max =   89.31  avg =   88.85\n       shufflenet_v2  min =   84.47  max =   84.80  avg =   84.60\n             mnasnet  min =  162.81  max =  163.93  avg =  163.22\n     proxylessnasnet  min =  208.18  max =  209.15  avg =  208.61\n     efficientnet_b0  min =  370.06  max =  371.14  avg =  370.64\n   efficientnetv2_b0  min =  418.28  max =  429.68  avg =  423.01\n        regnety_400m  min =  216.42  max =  217.19  avg =  216.71\n           blazeface  min =   27.63  max =   28.67  avg =   28.00\n           googlenet  min =  525.25  max =  528.83  avg =  526.23\n      googlenet_int8  min =  469.78  max =  472.51  avg =  470.76\n            resnet18  min =  396.46  max =  399.66  avg =  397.57\n       resnet18_int8  min =  324.07  max =  326.64  avg =  325.34\n             alexnet  min =  362.44  max =  363.02  avg =  362.68\n               vgg16  min = 2174.86  max = 2252.92  avg = 2215.62\n          vgg16_int8  min = 1726.07  max = 1732.69  avg = 1729.18\n            resnet50  min = 1136.96  max = 1142.94  avg = 1139.91\n       resnet50_int8  min =  977.73  max =  983.64  avg =  980.71\n      squeezenet_ssd  min =  350.46  max =  353.35  avg =  351.37\n squeezenet_ssd_int8  min =  333.91  max =  336.59  avg =  334.77\n       mobilenet_ssd  min =  513.18  max =  519.05  avg =  516.22\n  mobilenet_ssd_int8  min =  424.37  max =  426.89  avg =  426.03\n      mobilenet_yolo  min = 1143.20  max = 1145.04  avg = 1144.31\n  mobilenetv2_yolov3  min =  617.45  max =  619.30  avg =  618.37\n         yolov4-tiny  min =  839.32  max =  847.57  avg =  844.61\n           nanodet_m  min =  208.41  max =  211.31  avg =  210.03\n```\n\n### Qualcomm Snapdragon 888 (Cortex-X1 2.84GHz x1 + Cortex-A78 2.4GHz x3 + Cortex-A55 1.8GHz x4 + Adreno 660)\n```\nvenus:/data/local/tmp $ ./benchncnn 8 8 2 -1 1\nloop_count = 8\nnum_threads = 8\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    5.89  max =    6.04  avg =    5.98\n     squeezenet_int8  min =    6.09  max =    6.29  avg =    6.25\n           mobilenet  min =    9.27  max =   10.22  avg =    9.64\n      mobilenet_int8  min =    5.90  max =    6.05  avg =    5.97\n        mobilenet_v2  min =    6.87  max =    8.42  avg =    7.63\n        mobilenet_v3  min =    8.93  max =   12.22  avg =    9.55\n          shufflenet  min =    8.72  max =   11.44  avg =    9.20\n       shufflenet_v2  min =    6.05  max =    8.24  avg =    7.40\n             mnasnet  min =    7.83  max =    9.03  avg =    8.53\n     proxylessnasnet  min =    7.03  max =    9.62  avg =    7.88\n     efficientnet_b0  min =   12.62  max =   18.01  avg =   15.51\n   efficientnetv2_b0  min =   14.96  max =   23.75  avg =   19.61\n        regnety_400m  min =   23.58  max =   23.87  avg =   23.72\n           blazeface  min =    4.62  max =    4.87  avg =    4.73\n           googlenet  min =   17.23  max =   25.41  avg =   19.83\n      googlenet_int8  min =   16.91  max =   17.05  avg =   16.99\n            resnet18  min =   12.05  max =   14.90  avg =   13.47\n       resnet18_int8  min =   15.10  max =   15.42  avg =   15.27\n             alexnet  min =   13.85  max =   15.73  avg =   14.50\n               vgg16  min =   56.85  max =   57.88  avg =   57.32\n          vgg16_int8  min =   70.12  max =   72.99  avg =   71.53\n            resnet50  min =   29.45  max =   29.78  avg =   29.64\n       resnet50_int8  min =   24.99  max =   25.31  avg =   25.16\n      squeezenet_ssd  min =   17.51  max =   22.63  avg =   19.25\n squeezenet_ssd_int8  min =   16.81  max =   17.26  avg =   16.98\n       mobilenet_ssd  min =   15.96  max =   16.52  avg =   16.11\n  mobilenet_ssd_int8  min =   13.70  max =   14.26  avg =   13.95\n      mobilenet_yolo  min =   50.48  max =   52.88  avg =   51.76\n  mobilenetv2_yolov3  min =   22.63  max =   22.99  avg =   22.85\n         yolov4-tiny  min =   29.01  max =   38.20  avg =   32.50\n           nanodet_m  min =   12.58  max =   15.53  avg =   13.86\n    yolo-fastest-1.1  min =    8.57  max =    9.18  avg =    8.86\n      yolo-fastestv2  min =    6.85  max =    8.47  avg =    8.05\n  vision_transformer  min =  548.48  max =  703.29  avg =  614.47\n          FastestDet  min =    7.71  max =    9.31  avg =    8.15\n          \nvenus:/data/local/tmp $ ./benchncnn 8 8 2 0 1\n./benchncnn 8 8 2 0 1\n[0 Adreno (TM) 660]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 660]  bugsbn1=1  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Adreno (TM) 660]  fp16-p/s/u/a=1/1/0/1  int8-p/s/u/a=1/0/0/1\n[0 Adreno (TM) 660]  subgroup=64  basic/vote/ballot/shuffle=1/1/1/1\n[0 Adreno (TM) 660]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 8\nnum_threads = 8\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   10.63  max =   12.41  avg =   11.80\n     squeezenet_int8  min =    6.93  max =    8.82  avg =    7.86\n           mobilenet  min =   12.79  max =   14.12  avg =   13.48\n      mobilenet_int8  min =    9.18  max =    9.70  avg =    9.44\n        mobilenet_v2  min =   14.73  max =   15.62  avg =   15.13\n        mobilenet_v3  min =   14.68  max =   16.72  avg =   15.70\n          shufflenet  min =   11.28  max =   12.75  avg =   12.17\n       shufflenet_v2  min =   11.44  max =   14.27  avg =   12.07\n             mnasnet  min =   14.54  max =   15.94  avg =   15.35\n     proxylessnasnet  min =   16.33  max =   17.31  avg =   16.71\n     efficientnet_b0  min =   22.64  max =   25.42  avg =   24.35\n   efficientnetv2_b0  min =   41.16  max =   52.08  avg =   45.61\n        regnety_400m  min =   17.56  max =   18.08  avg =   17.85\n           blazeface  min =    2.87  max =    3.89  avg =    3.34\n           googlenet  min =   31.64  max =   33.38  avg =   32.14\n      googlenet_int8  min =   18.29  max =   19.15  avg =   18.73\n            resnet18  min =   23.47  max =   24.60  avg =   23.85\n       resnet18_int8  min =   11.89  max =   17.17  avg =   14.54\n             alexnet  min =   25.62  max =   26.23  avg =   25.98\n               vgg16  min =   41.81  max =   42.69  avg =   42.12\n          vgg16_int8  min =   79.43  max =  123.88  avg =   93.17\n            resnet50  min =   41.28  max =   43.27  avg =   41.79\n       resnet50_int8  min =   25.55  max =   26.34  avg =   25.97\n      squeezenet_ssd  min =   30.10  max =   33.64  avg =   31.39\n squeezenet_ssd_int8  min =   18.12  max =   18.58  avg =   18.30\n       mobilenet_ssd  min =   28.29  max =   28.90  avg =   28.66\n  mobilenet_ssd_int8  min =   13.90  max =   14.31  avg =   14.02\n      mobilenet_yolo  min =   43.88  max =   45.43  avg =   44.58\n  mobilenetv2_yolov3  min =   16.49  max =   37.05  avg =   19.32\n         yolov4-tiny  min =   22.70  max =   50.58  avg =   34.92\n           nanodet_m  min =   19.31  max =   19.88  avg =   19.57\n    yolo-fastest-1.1  min =   11.17  max =   11.33  avg =   11.26\n      yolo-fastestv2  min =    9.72  max =   10.04  avg =    9.85\n  vision_transformer  min =  744.98  max =  758.15  avg =  751.62\n          FastestDet  min =   11.95  max =   13.12  avg =   12.46\n```\n\n### Qualcomm Snapdragon X Elite (X1E78100), Oryon 3.4GHz x 12 + Adreno X1-85\n\nTest on Oryon CPU\n\n```\nloop_count = 10\nnum_threads = 12\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    5.13  max =    5.19  avg =    5.16\n     squeezenet_int8  min =    4.31  max =    4.81  avg =    4.67\n           mobilenet  min =    3.73  max =    3.85  avg =    3.78\n      mobilenet_int8  min =    2.51  max =    3.11  avg =    2.64\n        mobilenet_v2  min =    3.55  max =    3.70  avg =    3.60\n        mobilenet_v3  min =    3.28  max =    3.88  avg =    3.40\n          shufflenet  min =    3.77  max =    5.07  avg =    4.02\n       shufflenet_v2  min =    3.24  max =    3.34  avg =    3.29\n             mnasnet  min =    3.49  max =    4.09  avg =    3.58\n     proxylessnasnet  min =    4.30  max =    4.93  avg =    4.41\n     efficientnet_b0  min =    4.97  max =   17.26  avg =    6.28\n   efficientnetv2_b0  min =    6.85  max =   10.19  avg =    7.39\n        regnety_400m  min =   11.26  max =   11.36  avg =   11.31\n           blazeface  min =    1.43  max =    1.48  avg =    1.44\n           googlenet  min =    9.84  max =    9.96  avg =    9.89\n      googlenet_int8  min =    8.04  max =    8.33  avg =    8.13\n            resnet18  min =    6.63  max =    9.34  avg =    6.94\n       resnet18_int8  min =    5.47  max =    6.24  avg =    5.59\n             alexnet  min =    7.52  max =    7.61  avg =    7.54\n               vgg16  min =   29.66  max =   32.27  avg =   30.07\n          vgg16_int8  min =   32.97  max =   34.43  avg =   33.32\n            resnet50  min =   16.54  max =   16.68  avg =   16.63\n       resnet50_int8  min =   11.12  max =   13.84  avg =   11.42\n      squeezenet_ssd  min =    9.20  max =    9.77  avg =    9.39\n squeezenet_ssd_int8  min =    8.50  max =    9.17  avg =    8.73\n       mobilenet_ssd  min =    8.28  max =    8.67  avg =    8.36\n  mobilenet_ssd_int8  min =    5.59  max =    6.25  avg =    5.74\n      mobilenet_yolo  min =   21.42  max =   22.77  avg =   21.65\n  mobilenetv2_yolov3  min =   14.03  max =   14.34  avg =   14.13\n         yolov4-tiny  min =   23.60  max =   23.84  avg =   23.70\n           nanodet_m  min =    6.64  max =    7.40  avg =    6.77\n    yolo-fastest-1.1  min =    4.14  max =    7.15  avg =    4.53\n      yolo-fastestv2  min =    3.63  max =    3.70  avg =    3.66\n  vision_transformer  min =  384.74  max =  415.74  avg =  391.28\n          FastestDet  min =    4.29  max =    4.94  avg =    4.40\n```\n\nTest on X1-85 GPU\n\n```\n[0 Adreno X1-85]  queueC=0[1]  queueT=0[1]\n[0 Adreno X1-85]  fp16-p/s/u/a=1/1/0/1  int8-p/s/u/a=1/0/0/1  bf16-p/s=1/0\n[0 Adreno X1-85]  subgroup=128(64~128)  ops=1/1/1/1/1/1/1/1/1/1\n[0 Adreno X1-85]  fp16-cm=0  int8-cm=0  bf16-cm=0  fp8-cm=0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    3.23  max =    3.99  avg =    3.63\n           mobilenet  min =    3.33  max =    5.86  avg =    5.20\n        mobilenet_v2  min =    4.06  max =    4.77  avg =    4.52\n        mobilenet_v3  min =    4.61  max =    8.12  avg =    6.60\n          shufflenet  min =    3.16  max =    7.45  avg =    4.65\n       shufflenet_v2  min =    3.90  max =    6.00  avg =    5.02\n             mnasnet  min =    4.44  max =    5.12  avg =    4.81\n     proxylessnasnet  min =    4.91  max =    7.02  avg =    6.15\n     efficientnet_b0  min =    6.61  max =    7.25  avg =    7.04\n   efficientnetv2_b0  min =   21.48  max =   56.52  avg =   39.03\n        regnety_400m  min =    7.33  max =    7.60  avg =    7.44\n           blazeface  min =    2.83  max =    4.59  avg =    4.30\n           googlenet  min =   11.00  max =   12.98  avg =   12.60\n            resnet18  min =   12.11  max =   14.59  avg =   13.27\n             alexnet  min =   11.64  max =   12.18  avg =   11.96\n               vgg16  min =   40.06  max =   45.62  avg =   42.88\n            resnet50  min =   18.99  max =   21.93  avg =   20.88\n      squeezenet_ssd  min =   10.95  max =   14.73  avg =   13.03\n       mobilenet_ssd  min =    7.92  max =    9.75  avg =    9.46\n      mobilenet_yolo  min =    9.02  max =   12.54  avg =   11.38\n  mobilenetv2_yolov3  min =   12.70  max =   14.70  avg =   13.95\n         yolov4-tiny  min =   25.88  max =   30.26  avg =   28.12\n           nanodet_m  min =    9.38  max =   33.46  avg =   20.29\n    yolo-fastest-1.1  min =    6.08  max =    6.75  avg =    6.43\n      yolo-fastestv2  min =    4.50  max =    6.47  avg =    6.04\n  vision_transformer  min =  184.89  max =  191.78  avg =  189.07\n          FastestDet  min =    6.01  max =    7.83  avg =    6.43\n```\n\n### Raspberry Pi 3 Model B+ Broadcom BCM2837B0, Cortex-A53 (ARMv8) (1.4GHz x 4)\n```\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 4 4 0 -1 1\nloop_count = 4\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   84.74  max =   85.60  avg =   85.22\n     squeezenet_int8  min =   74.48  max =   74.80  avg =   74.68\n           mobilenet  min =  107.84  max =  110.13  avg =  108.66\n      mobilenet_int8  min =   66.91  max =   67.12  avg =   67.03\n        mobilenet_v2  min =  110.64  max =  112.73  avg =  111.68\n        mobilenet_v3  min =   85.78  max =   86.74  avg =   86.44\n          shufflenet  min =   58.38  max =   60.32  avg =   59.33\n       shufflenet_v2  min =   46.76  max =   47.53  avg =   47.19\n             mnasnet  min =   95.53  max =   95.88  avg =   95.78\n     proxylessnasnet  min =  102.24  max =  105.58  avg =  103.38\n     efficientnet_b0  min =  134.87  max =  136.98  avg =  135.86\n   efficientnetv2_b0  min =  146.62  max =  148.06  avg =  147.13\n        regnety_400m  min =  118.60  max =  119.51  avg =  119.03\n           blazeface  min =   15.42  max =   15.61  avg =   15.52\n           googlenet  min =  223.78  max =  224.85  avg =  224.22\n      googlenet_int8  min =  188.23  max =  190.15  avg =  189.21\n            resnet18  min =  270.86  max =  272.66  avg =  271.93\n       resnet18_int8  min =  159.57  max =  160.39  avg =  160.07\n             alexnet  min =  157.79  max =  160.77  avg =  159.09\n            resnet50  min =  583.57  max =  591.41  avg =  587.42\n       resnet50_int8  min =  383.96  max =  401.37  avg =  391.87\n      squeezenet_ssd  min =  247.90  max =  249.77  avg =  248.98\n squeezenet_ssd_int8  min =  191.65  max =  192.81  avg =  192.17\n       mobilenet_ssd  min =  240.11  max =  241.02  avg =  240.62\n  mobilenet_ssd_int8  min =  136.30  max =  137.26  avg =  136.73\n      mobilenet_yolo  min =  523.59  max =  539.91  avg =  529.98\n  mobilenetv2_yolov3  min =  356.44  max =  366.85  avg =  362.06\n         yolov4-tiny  min =  410.25  max =  422.18  avg =  417.17\n           nanodet_m  min =  114.98  max =  115.83  avg =  115.40\n    yolo-fastest-1.1  min =   79.85  max =   80.83  avg =   80.28\n      yolo-fastestv2  min =   62.36  max =   62.91  avg =   62.60\n          FastestDet  min =   67.11  max =   68.51  avg =   67.98\n\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  125.34  max =  125.81  avg =  125.58\n     squeezenet_int8  min =  135.56  max =  136.34  avg =  135.98\n           mobilenet  min =  204.62  max =  207.06  avg =  205.65\n      mobilenet_int8  min =  181.34  max =  182.46  avg =  181.91\n        mobilenet_v2  min =  158.69  max =  158.94  avg =  158.80\n        mobilenet_v3  min =  127.13  max =  127.31  avg =  127.23\n          shufflenet  min =   84.64  max =   85.29  avg =   84.89\n       shufflenet_v2  min =   74.28  max =   74.64  avg =   74.44\n             mnasnet  min =  148.12  max =  148.65  avg =  148.42\n     proxylessnasnet  min =  199.56  max =  201.99  avg =  200.42\n     efficientnet_b0  min =  240.94  max =  241.75  avg =  241.27\n   efficientnetv2_b0  min =  270.71  max =  270.90  avg =  270.83\n        regnety_400m  min =  186.89  max =  187.08  avg =  187.01\n           blazeface  min =   22.75  max =   23.24  avg =   22.95\n           googlenet  min =  450.64  max =  450.96  avg =  450.79\n      googlenet_int8  min =  424.66  max =  426.83  avg =  425.78\n            resnet18  min =  379.21  max =  380.01  avg =  379.57\n       resnet18_int8  min =  312.23  max =  313.21  avg =  312.68\n             alexnet  min =  270.13  max =  270.88  avg =  270.55\n            resnet50  min =  977.51  max =  981.89  avg =  979.75\n       resnet50_int8  min =  890.77  max =  896.89  avg =  893.83\n      squeezenet_ssd  min =  331.52  max =  333.47  avg =  332.46\n squeezenet_ssd_int8  min =  317.71  max =  319.64  avg =  318.62\n       mobilenet_ssd  min =  425.42  max =  426.52  avg =  425.93\n  mobilenet_ssd_int8  min =  370.17  max =  370.90  avg =  370.66\n      mobilenet_yolo  min =  930.40  max =  932.24  avg =  931.46\n  mobilenetv2_yolov3  min =  534.79  max =  543.56  avg =  539.20\n         yolov4-tiny  min =  675.33  max =  676.83  avg =  676.14\n           nanodet_m  min =  178.13  max =  178.98  avg =  178.64\n    yolo-fastest-1.1  min =  100.83  max =  101.96  avg =  101.49\n      yolo-fastestv2  min =   79.73  max =   79.94  avg =   79.84\n          FastestDet  min =   89.09  max =   90.07  avg =   89.78\n```\n\n### Raspberry Pi 4 Model B Broadcom BCM2711B0, Cortex-A72 (ARMv8) (1.8GHz x 4)\n```\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 10 4 0 -1 1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   46.28  max =   46.91  avg =   46.65\n     squeezenet_int8  min =   42.18  max =   44.98  avg =   42.59\n           mobilenet  min =   60.74  max =   61.79  avg =   61.17\n      mobilenet_int8  min =   34.19  max =   34.55  avg =   34.37\n        mobilenet_v2  min =   61.63  max =   62.02  avg =   61.88\n        mobilenet_v3  min =   47.08  max =   48.40  avg =   47.53\n          shufflenet  min =   32.91  max =   33.30  avg =   33.09\n       shufflenet_v2  min =   24.37  max =   24.73  avg =   24.56\n             mnasnet  min =   51.80  max =   52.14  avg =   51.98\n     proxylessnasnet  min =   53.02  max =   53.58  avg =   53.32\n     efficientnet_b0  min =   73.92  max =   74.44  avg =   74.19\n   efficientnetv2_b0  min =   79.10  max =   79.60  avg =   79.34\n        regnety_400m  min =   65.27  max =   66.12  avg =   65.70\n           blazeface  min =    8.62  max =    8.75  avg =    8.69\n           googlenet  min =  113.74  max =  115.14  avg =  114.35\n      googlenet_int8  min =  100.87  max =  101.71  avg =  101.25\n            resnet18  min =  122.27  max =  125.39  avg =  123.12\n       resnet18_int8  min =   82.19  max =   94.12  avg =   83.92\n             alexnet  min =   75.75  max =   78.08  avg =   76.40\n               vgg16  min =  541.66  max =  552.56  avg =  547.09\n          vgg16_int8  min =  391.44  max =  395.73  avg =  394.23\n            resnet50  min =  261.90  max =  263.91  avg =  262.83\n       resnet50_int8  min =  195.60  max =  198.08  avg =  196.65\n      squeezenet_ssd  min =  127.01  max =  129.85  avg =  127.61\n squeezenet_ssd_int8  min =  104.98  max =  107.67  avg =  105.47\n       mobilenet_ssd  min =  120.43  max =  123.28  avg =  121.46\n  mobilenet_ssd_int8  min =   70.70  max =   72.85  avg =   71.14\n      mobilenet_yolo  min =  270.89  max =  273.42  avg =  272.33\n  mobilenetv2_yolov3  min =  183.85  max =  185.73  avg =  184.88\n         yolov4-tiny  min =  205.95  max =  209.90  avg =  207.22\n           nanodet_m  min =   68.08  max =   68.69  avg =   68.38\n    yolo-fastest-1.1  min =   47.97  max =   48.20  avg =   48.06\n      yolo-fastestv2  min =   37.17  max =   37.69  avg =   37.47\n  vision_transformer  min = 1872.31  max = 1964.95  avg = 1909.21\n          FastestDet  min =   38.39  max =   39.17  avg =   38.69\n\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 10 1 0 -1 1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   73.35  max =   75.10  avg =   73.96\n     squeezenet_int8  min =   69.17  max =   69.66  avg =   69.42\n           mobilenet  min =  123.76  max =  125.35  avg =  124.32\n      mobilenet_int8  min =   84.66  max =   85.24  avg =   84.82\n        mobilenet_v2  min =   92.98  max =   94.05  avg =   93.48\n        mobilenet_v3  min =   72.48  max =   73.14  avg =   72.81\n          shufflenet  min =   47.17  max =   47.83  avg =   47.51\n       shufflenet_v2  min =   41.62  max =   42.60  avg =   42.12\n             mnasnet  min =   83.60  max =   84.35  avg =   83.98\n     proxylessnasnet  min =   98.48  max =   99.33  avg =   98.78\n     efficientnet_b0  min =  129.45  max =  130.02  avg =  129.73\n   efficientnetv2_b0  min =  155.06  max =  156.70  avg =  155.76\n        regnety_400m  min =  105.39  max =  106.03  avg =  105.70\n           blazeface  min =   12.54  max =   12.84  avg =   12.65\n           googlenet  min =  235.38  max =  236.34  avg =  235.94\n      googlenet_int8  min =  209.63  max =  210.39  avg =  210.00\n            resnet18  min =  190.80  max =  191.43  avg =  191.10\n       resnet18_int8  min =  157.92  max =  158.97  avg =  158.50\n             alexnet  min =  139.34  max =  139.44  avg =  139.40\n               vgg16  min = 1066.58  max = 1079.30  avg = 1071.85\n          vgg16_int8  min =  866.15  max =  873.75  avg =  869.84\n            resnet50  min =  533.15  max =  535.12  avg =  534.11\n       resnet50_int8  min =  423.72  max =  424.24  avg =  423.96\n      squeezenet_ssd  min =  178.90  max =  179.53  avg =  179.30\n squeezenet_ssd_int8  min =  157.05  max =  159.06  avg =  157.89\n       mobilenet_ssd  min =  250.71  max =  251.26  avg =  251.00\n  mobilenet_ssd_int8  min =  170.21  max =  170.96  avg =  170.56\n      mobilenet_yolo  min =  557.48  max =  560.08  avg =  558.80\n  mobilenetv2_yolov3  min =  301.60  max =  307.98  avg =  306.52\n         yolov4-tiny  min =  370.55  max =  375.69  avg =  372.99\n           nanodet_m  min =  103.05  max =  103.74  avg =  103.45\n    yolo-fastest-1.1  min =   56.58  max =   57.44  avg =   57.01\n      yolo-fastestv2  min =   46.69  max =   47.34  avg =   47.03\n  vision_transformer  min = 6605.19  max = 6606.66  avg = 6605.73\n          FastestDet  min =   52.11  max =   52.97  avg =   52.61\n```\n### Raspberry Pi 5 Broadcom BCM2712, Cortex-A76 (ARMv8) (2.4GHz x 4)\n```\npi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 4 0 -1 -1 >> text.out\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    6.74  max =    8.16  avg =    7.38\n     squeezenet_int8  min =    6.97  max =    7.67  avg =    7.21\n           mobilenet  min =    9.00  max =   72.98  avg =   33.88\n      mobilenet_int8  min =    8.68  max =    8.80  avg =    8.74\n        mobilenet_v2  min =   10.46  max =   10.63  avg =   10.52\n        mobilenet_v3  min =    7.30  max =    7.44  avg =    7.35\n          shufflenet  min =    4.14  max =    4.18  avg =    4.16\n       shufflenet_v2  min =    3.37  max =    3.41  avg =    3.39\n             mnasnet  min =    6.83  max =    8.55  avg =    7.10\n     proxylessnasnet  min =    7.85  max =    7.97  avg =    7.88\n     efficientnet_b0  min =   12.28  max =   12.37  avg =   12.33\n   efficientnetv2_b0  min =   13.54  max =   13.84  avg =   13.69\n        regnety_400m  min =   10.93  max =   11.07  avg =   10.99\n           blazeface  min =    1.45  max =    1.48  avg =    1.47\n           googlenet  min =   25.13  max =   25.47  avg =   25.35\n      googlenet_int8  min =   24.00  max =   24.23  avg =   24.12\n            resnet18  min =   19.84  max =   20.19  avg =   19.96\n       resnet18_int8  min =   16.68  max =   16.83  avg =   16.74\n             alexnet  min =   21.21  max =   21.54  avg =   21.36\n               vgg16  min =  127.75  max =  134.00  avg =  129.24\n          vgg16_int8  min =  106.39  max =  110.66  avg =  107.01\n            resnet50  min =   45.94  max =   46.54  avg =   46.21\n       resnet50_int8  min =   40.16  max =   42.58  avg =   40.75\n      squeezenet_ssd  min =   30.10  max =   30.95  avg =   30.37\n squeezenet_ssd_int8  min =   27.71  max =   29.03  avg =   28.15\n       mobilenet_ssd  min =   24.16  max =   24.89  avg =   24.52\n  mobilenet_ssd_int8  min =   21.79  max =   22.37  avg =   22.05\n      mobilenet_yolo  min =   58.06  max =   58.45  avg =   58.19\n  mobilenetv2_yolov3  min =   37.49  max =   37.94  avg =   37.68\n         yolov4-tiny  min =   44.45  max =   60.58  avg =   46.29\n           nanodet_m  min =   11.01  max =   11.28  avg =   11.18\n    yolo-fastest-1.1  min =    5.53  max =    5.97  avg =    5.62\n      yolo-fastestv2  min =    4.76  max =    4.84  avg =    4.80\n  vision_transformer  min =  600.65  max =  622.47  avg =  611.65\n          FastestDet  min =    4.83  max =    6.94  avg =    5.34\n\n\npi@raspberrypi:~/ncnn/benchmark $ ./benchncnn 10 1 0 -1 -1 \nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   11.77  max =   12.18  avg =   11.87\n     squeezenet_int8  min =   11.67  max =   11.98  avg =   11.82\n           mobilenet  min =   20.24  max =   20.59  avg =   20.30\n      mobilenet_int8  min =   14.38  max =   14.51  avg =   14.44\n        mobilenet_v2  min =   16.21  max =   16.49  avg =   16.38\n        mobilenet_v3  min =   11.64  max =   12.12  avg =   11.80\n          shufflenet  min =    7.17  max =    7.24  avg =    7.20\n       shufflenet_v2  min =    7.07  max =    7.21  avg =    7.14\n             mnasnet  min =   12.93  max =   13.03  avg =   12.99\n     proxylessnasnet  min =   15.72  max =   15.80  avg =   15.74\n     efficientnet_b0  min =   24.12  max =   24.53  avg =   24.20\n   efficientnetv2_b0  min =   27.59  max =   28.04  avg =   27.75\n        regnety_400m  min =   16.41  max =   16.66  avg =   16.49\n           blazeface  min =    2.98  max =    3.04  avg =    3.02\n           googlenet  min =   48.62  max =   48.87  avg =   48.71\n      googlenet_int8  min =   49.07  max =   49.26  avg =   49.15\n            resnet18  min =   29.54  max =   30.17  avg =   29.68\n       resnet18_int8  min =   36.30  max =   36.55  avg =   36.42\n             alexnet  min =   35.24  max =   35.86  avg =   35.62\n               vgg16  min =  188.84  max =  190.87  avg =  189.63\n          vgg16_int8  min =  272.27  max =  274.15  avg =  273.10\n            resnet50  min =   89.04  max =   89.87  avg =   89.43\n       resnet50_int8  min =   80.00  max =   80.50  avg =   80.16\n      squeezenet_ssd  min =   38.02  max =   38.69  avg =   38.29\n squeezenet_ssd_int8  min =   40.58  max =   41.17  avg =   40.94\n       mobilenet_ssd  min =   45.42  max =   47.08  avg =   45.90\n  mobilenet_ssd_int8  min =   36.05  max =   37.02  avg =   36.35\n      mobilenet_yolo  min =  104.82  max =  106.56  avg =  105.69\n  mobilenetv2_yolov3  min =   60.11  max =   60.29  avg =   60.19\n         yolov4-tiny  min =   67.61  max =   69.05  avg =   68.02\n           nanodet_m  min =   19.63  max =   19.81  avg =   19.69\n    yolo-fastest-1.1  min =    8.10  max =    8.14  avg =    8.12\n      yolo-fastestv2  min =    7.21  max =    7.26  avg =    7.24\n  vision_transformer  min = 1249.08  max = 1253.32  avg = 1250.30\n          FastestDet  min =    7.33  max =    7.44  avg =    7.38\n```\n### Raspberry Pi 5 Broadcom BCM2712, VideoCore VII Graphics (Vulkan 1.2)\n```\nfan@raspberrypi:~/ncnn/benchmark $ ../build/benchmark/benchncnn 10 $(nproc) 0 0\n[0 V3D 7.1.7]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 V3D 7.1.7]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 V3D 7.1.7]  fp16-p/s/a=1/1/0  int8-p/s/a=1/1/0\n[0 V3D 7.1.7]  subgroup=16  basic/vote/ballot/shuffle=1/0/0/0\n[0 V3D 7.1.7]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  subgroup=4  basic/vote/ballot/shuffle=1/1/1/1\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =  120.75  max =  121.31  avg =  120.94\n     squeezenet_int8  min =    9.57  max =   24.49  avg =   11.23\n           mobilenet  min =  160.32  max =  160.75  avg =  160.53\n      mobilenet_int8  min =   11.29  max =   11.47  avg =   11.37\n        mobilenet_v2  min =  121.05  max =  121.93  avg =  121.46\n        mobilenet_v3  min =  117.90  max =  119.20  avg =  118.48\n          shufflenet  min =   70.82  max =   71.55  avg =   71.04\n       shufflenet_v2  min =   97.74  max =   98.58  avg =   98.00\n             mnasnet  min =  118.21  max =  118.76  avg =  118.44\n     proxylessnasnet  min =  124.28  max =  124.92  avg =  124.52\n     efficientnet_b0  min =  187.48  max =  188.38  avg =  187.93\n   efficientnetv2_b0  min =  270.11  max =  280.80  avg =  272.26\n        regnety_400m  min =  142.14  max =  143.25  avg =  142.66\n           blazeface  min =   31.97  max =   32.41  avg =   32.17\n           googlenet  min =  346.30  max =  347.47  avg =  346.81\n      googlenet_int8  min =   30.77  max =   32.26  avg =   31.52\n            resnet18  min =  346.96  max =  347.50  avg =  347.26\n       resnet18_int8  min =   19.95  max =   20.95  avg =   20.48\n             alexnet  min =  181.57  max =  182.03  avg =  181.75\n               vgg16  min = 1776.00  max = 1776.66  avg = 1776.40\n          vgg16_int8  min =  134.10  max =  141.76  avg =  136.32\n            resnet50  min =  841.90  max =  842.50  avg =  842.16\n       resnet50_int8  min =   54.29  max =   55.22  avg =   54.54\n      squeezenet_ssd  min =  461.71  max =  468.09  avg =  466.97\n squeezenet_ssd_int8  min =   38.05  max =   39.00  avg =   38.58\n       mobilenet_ssd  min =  379.50  max =  381.66  avg =  380.14\n  mobilenet_ssd_int8  min =   29.91  max =   30.77  avg =   30.13\n      mobilenet_yolo  min =  753.61  max =  755.06  avg =  753.97\n  mobilenetv2_yolov3  min =  382.18  max =  389.90  avg =  386.97\n         yolov4-tiny  min =  673.87  max =  674.71  avg =  674.07\n           nanodet_m  min =  206.55  max =  210.48  avg =  209.69\n    yolo-fastest-1.1  min =  109.98  max =  111.18  avg =  110.45\n      yolo-fastestv2  min =   86.07  max =   87.16  avg =   86.51\n  vision_transformer  min = 20594.51  max = 20601.53  avg = 20596.59\n          FastestDet  min =   90.25  max =   91.00  avg =   90.64\n```\n\n### Raspberry Pi 5 Broadcom BCM2712 Overclock to 2.9Ghz, VideoCore VII Graphics Overclock to 1.1Ghz (Vulkan 1.2)\n```\npi@raspberrypi:~/ncnn/build/benchmark $ sudo echo \"arm_freq=2900\" >> /boot/firmware/config.txt\npi@raspberrypi:~/ncnn/build/benchmark $ sudo echo \"gpu_freq=1100\" >> /boot/firmware/config.txt\npi@raspberrypi:~/ncnn/build/benchmark $ sudo reboot\n\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 10 4 0 0\n[0 V3D 7.1.7]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 V3D 7.1.7]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 V3D 7.1.7]  fp16-p/s/u/a=1/1/1/0  int8-p/s/u/a=1/1/1/0\n[0 V3D 7.1.7]  subgroup=16  basic/vote/ballot/shuffle=1/0/0/0\n[0 V3D 7.1.7]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  subgroup=4  basic/vote/ballot/shuffle=1/1/1/1\n[1 llvmpipe (LLVM 15.0.6, 128 bits)]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =  106.98  max =  107.05  avg =  107.02\n     squeezenet_int8  min =    8.51  max =    8.83  avg =    8.65\n           mobilenet  min =  147.66  max =  147.71  avg =  147.68\n      mobilenet_int8  min =   10.21  max =   10.54  avg =   10.37\n        mobilenet_v2  min =  110.11  max =  110.23  avg =  110.18\n        mobilenet_v3  min =  101.84  max =  102.03  avg =  101.92\n          shufflenet  min =   59.77  max =   59.84  avg =   59.80\n       shufflenet_v2  min =   81.46  max =   81.60  avg =   81.51\n             mnasnet  min =  105.88  max =  105.98  avg =  105.94\n     proxylessnasnet  min =  108.82  max =  108.89  avg =  108.86\n     efficientnet_b0  min =  168.79  max =  168.93  avg =  168.87\n   efficientnetv2_b0  min =  232.52  max =  232.80  avg =  232.65\n        regnety_400m  min =  130.33  max =  130.49  avg =  130.36\n           blazeface  min =   22.23  max =   22.49  avg =   22.39\n           googlenet  min =  299.25  max =  299.37  avg =  299.31\n      googlenet_int8  min =   29.21  max =   29.97  avg =   29.58\n            resnet18  min =  304.47  max =  304.64  avg =  304.58\n       resnet18_int8  min =   19.31  max =   20.77  avg =   20.24\n             alexnet  min =  203.68  max =  203.79  avg =  203.76\n               vgg16  min = 1571.91  max = 1572.22  avg = 1572.06\n          vgg16_int8  min =  128.46  max =  130.89  avg =  129.96\n            resnet50  min =  754.16  max =  754.33  avg =  754.26\n       resnet50_int8  min =   52.65  max =   53.48  avg =   53.09\n      squeezenet_ssd  min =  398.22  max =  398.36  avg =  398.28\n squeezenet_ssd_int8  min =   34.26  max =   34.67  avg =   34.51\n       mobilenet_ssd  min =  344.81  max =  344.99  avg =  344.89\n  mobilenet_ssd_int8  min =   27.59  max =   28.01  avg =   27.77\n      mobilenet_yolo  min =  712.53  max =  712.63  avg =  712.59\n  mobilenetv2_yolov3  min =  362.81  max =  363.11  avg =  362.90\n         yolov4-tiny  min =  589.30  max =  589.51  avg =  589.39\n           nanodet_m  min =  178.83  max =  178.97  avg =  178.88\n    yolo-fastest-1.1  min =   92.36  max =   92.58  avg =   92.45\n      yolo-fastestv2  min =   70.68  max =   70.84  avg =   70.74\n  vision_transformer  min = 18615.94  max = 18648.17  avg = 18633.77\n          FastestDet  min =   74.59  max =   74.68  avg =   74.63\n\npi@raspberrypi:~/ncnn/build/benchmark $ ./benchncnn 10 4 0 -1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    7.61  max =    7.76  avg =    7.70\n     squeezenet_int8  min =    7.97  max =    8.68  avg =    8.23\n           mobilenet  min =    9.65  max =    9.91  avg =    9.80\n      mobilenet_int8  min =   10.60  max =   36.93  avg =   13.29\n        mobilenet_v2  min =   12.25  max =   12.64  avg =   12.40\n        mobilenet_v3  min =    8.14  max =    8.26  avg =    8.20\n          shufflenet  min =    3.72  max =    3.82  avg =    3.77\n       shufflenet_v2  min =    2.99  max =    3.10  avg =    3.05\n             mnasnet  min =    7.27  max =    7.46  avg =    7.37\n     proxylessnasnet  min =    8.39  max =    8.55  avg =    8.48\n     efficientnet_b0  min =   13.15  max =   13.59  avg =   13.39\n   efficientnetv2_b0  min =   14.79  max =   15.30  avg =   14.91\n        regnety_400m  min =    9.49  max =    9.71  avg =    9.57\n           blazeface  min =    1.41  max =    1.46  avg =    1.43\n           googlenet  min =   28.60  max =   28.87  avg =   28.73\n      googlenet_int8  min =   27.09  max =   27.77  avg =   27.47\n            resnet18  min =   21.47  max =   21.88  avg =   21.65\n       resnet18_int8  min =   20.07  max =   20.30  avg =   20.24\n             alexnet  min =   22.75  max =   23.47  avg =   23.05\n               vgg16  min =  154.32  max =  158.51  avg =  157.40\n          vgg16_int8  min =  127.78  max =  162.60  avg =  133.21\n            resnet50  min =   49.36  max =   49.86  avg =   49.63\n       resnet50_int8  min =   46.44  max =   46.89  avg =   46.74\n      squeezenet_ssd  min =   37.31  max =   74.95  avg =   41.30\n squeezenet_ssd_int8  min =   32.62  max =   33.63  avg =   33.09\n       mobilenet_ssd  min =   27.40  max =   27.99  avg =   27.68\n  mobilenet_ssd_int8  min =   26.70  max =   27.71  avg =   27.23\n      mobilenet_yolo  min =   60.25  max =   61.10  avg =   60.67\n  mobilenetv2_yolov3  min =   43.51  max =   44.29  avg =   43.87\n         yolov4-tiny  min =   51.63  max =   52.64  avg =   52.24\n           nanodet_m  min =   11.89  max =   12.06  avg =   11.97\n    yolo-fastest-1.1  min =    5.63  max =    5.78  avg =    5.69\n      yolo-fastestv2  min =    5.34  max =    5.48  avg =    5.40\n  vision_transformer  min =  481.78  max =  506.72  avg =  493.05\n          FastestDet  min =    4.91  max =    5.14  avg =    5.01\n```\n### Raspberry Pi Zero 2 W Broadcom BCM2710A1, Cortex-A53 (ARMv8) (1.0GHz x 4)\n\n```\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  119.52  max =  120.29  avg =  119.93\n     squeezenet_int8  min =   96.32  max =   96.96  avg =   96.55\n           mobilenet  min =  162.60  max =  165.49  avg =  163.19\n      mobilenet_int8  min =   90.78  max =   91.39  avg =   91.03\n        mobilenet_v2  min =  145.71  max =  148.83  avg =  147.39\n        mobilenet_v3  min =  113.89  max =  151.95  avg =  119.04\n          shufflenet  min =   72.72  max =   73.27  avg =   72.96\n       shufflenet_v2  min =   63.64  max =   64.50  avg =   64.13\n             mnasnet  min =  126.07  max =  126.93  avg =  126.53\n     proxylessnasnet  min =  139.90  max =  140.84  avg =  140.35\n     efficientnet_b0  min =  201.88  max =  202.55  avg =  202.14\n   efficientnetv2_b0  min =  227.22  max =  228.84  avg =  228.09\n        regnety_400m  min =  156.49  max =  157.47  avg =  156.96\n           blazeface  min =   22.79  max =   23.28  avg =   23.10\n           googlenet  min =  323.74  max =  324.90  avg =  324.45\n      googlenet_int8  min =  250.86  max =  252.82  avg =  251.63\n            resnet18  min =  351.37  max =  355.67  avg =  353.45\n       resnet18_int8  min =  194.83  max =  196.68  avg =  195.51\n             alexnet  min =  271.18  max =  273.53  avg =  272.18\n            resnet50  min =  777.44  max =  797.47  avg =  782.63\n       resnet50_int8  min =  496.78  max =  498.86  avg =  497.57\n      squeezenet_ssd  min =  376.10  max =  382.41  avg =  379.13\n squeezenet_ssd_int8  min =  255.99  max =  257.57  avg =  256.78\n       mobilenet_ssd  min =  338.64  max =  339.93  avg =  339.50\n  mobilenet_ssd_int8  min =  190.24  max =  190.68  avg =  190.48\n      mobilenet_yolo  min =  746.83  max =  748.14  avg =  747.53\n  mobilenetv2_yolov3  min =  487.99  max =  491.18  avg =  489.37\n         yolov4-tiny  min =  644.73  max =  652.24  avg =  646.64\n           nanodet_m  min =  165.27  max =  167.12  avg =  166.27\n    yolo-fastest-1.1  min =   98.74  max =  100.02  avg =   99.17\n      yolo-fastestv2  min =   80.52  max =   81.86  avg =   81.29\n\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  240.53  max =  241.07  avg =  240.77\n     squeezenet_int8  min =  212.63  max =  213.23  avg =  212.94\n           mobilenet  min =  393.79  max =  394.04  avg =  393.94\n      mobilenet_int8  min =  286.58  max =  286.95  avg =  286.75\n        mobilenet_v2  min =  273.97  max =  274.51  avg =  274.23\n        mobilenet_v3  min =  233.77  max =  234.59  avg =  234.20\n          shufflenet  min =  133.05  max =  133.36  avg =  133.23\n       shufflenet_v2  min =  128.86  max =  129.47  avg =  129.18\n             mnasnet  min =  265.70  max =  266.17  avg =  265.93\n     proxylessnasnet  min =  329.78  max =  330.54  avg =  330.13\n     efficientnet_b0  min =  518.42  max =  519.38  avg =  519.00\n   efficientnetv2_b0  min =  594.37  max =  595.17  avg =  594.74\n        regnety_400m  min =  329.53  max =  330.44  avg =  329.87\n           blazeface  min =   42.24  max =   45.56  avg =   43.96\n           googlenet  min =  780.05  max =  780.63  avg =  780.39\n      googlenet_int8  min =  663.83  max =  664.43  avg =  664.15\n            resnet18  min =  653.62  max =  657.59  avg =  654.69\n       resnet18_int8  min =  479.03  max =  479.72  avg =  479.40\n             alexnet  min =  687.99  max =  690.34  avg =  689.15\n            resnet50  min = 1800.97  max = 1806.11  avg = 1802.79\n       resnet50_int8  min = 1311.68  max = 1314.56  avg = 1313.15\n      squeezenet_ssd  min =  563.63  max =  565.57  avg =  564.44\n squeezenet_ssd_int8  min =  481.24  max =  483.97  avg =  482.20\n       mobilenet_ssd  min =  799.21  max =  829.10  avg =  803.56\n  mobilenet_ssd_int8  min =  568.11  max =  568.88  avg =  568.42\n      mobilenet_yolo  min = 1815.60  max = 1816.44  avg = 1815.93\n  mobilenetv2_yolov3  min =  951.34  max =  952.15  avg =  951.72\n         yolov4-tiny  min = 1258.21  max = 1259.49  avg = 1258.66\n           nanodet_m  min =  301.04  max =  304.09  avg =  301.70\n    yolo-fastest-1.1  min =  155.04  max =  155.98  avg =  155.53\n      yolo-fastestv2  min =  126.77  max =  127.40  avg =  127.05\n```\n\n### Banana Pi M2 Zero 2 AllWinner H2+, Cortex-A7 (ARMv7-A) (1.2GHz x 4)\n\n```\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  230.97  max =  232.18  avg =  231.49\n     squeezenet_int8  min =  171.12  max =  172.87  avg =  171.68\n           mobilenet  min =  327.65  max =  340.92  avg =  329.88\n      mobilenet_int8  min =  166.58  max =  169.55  avg =  167.47\n        mobilenet_v2  min =  276.81  max =  278.67  avg =  277.55\n        mobilenet_v3  min =  220.74  max =  225.14  avg =  222.08\n          shufflenet  min =  147.97  max =  157.68  avg =  149.40\n       shufflenet_v2  min =  146.56  max =  154.90  avg =  148.25\n             mnasnet  min =  243.06  max =  244.47  avg =  243.80\n     proxylessnasnet  min =  260.38  max =  261.47  avg =  260.66\n     efficientnet_b0  min =  368.98  max =  371.03  avg =  369.96\n   efficientnetv2_b0  min =  433.96  max =  459.25  avg =  437.52\n        regnety_400m  min =  307.53  max =  312.29  avg =  308.68\n           blazeface  min =   46.54  max =   47.35  avg =   46.98\n           googlenet  min =  647.86  max =  669.20  avg =  651.19\n      googlenet_int8  min =  439.90  max =  442.35  avg =  441.38\n            resnet18  min =  642.53  max =  856.58  avg =  698.28\n       resnet18_int8  min =  352.10  max =  354.51  avg =  353.44\n             alexnet  min =  593.16  max =  624.20  avg =  598.66\n            resnet50  min = 1556.12  max = 1782.22  avg = 1606.86\n       resnet50_int8  min =  911.63  max =  999.42  avg =  924.37\n      squeezenet_ssd  min =  653.85  max =  658.07  avg =  655.19\n squeezenet_ssd_int8  min =  456.26  max =  467.76  avg =  459.87\n       mobilenet_ssd  min =  671.93  max =  682.64  avg =  674.88\n  mobilenet_ssd_int8  min =  347.18  max =  349.07  avg =  347.81\n      mobilenet_yolo  min = 1471.16  max = 1492.65  avg = 1479.30\n  mobilenetv2_yolov3  min =  895.90  max =  906.60  avg =  899.74\n         yolov4-tiny  min = 1178.53  max = 1205.79  avg = 1183.98\n           nanodet_m  min =  358.89  max =  366.07  avg =  362.20\n    yolo-fastest-1.1  min =  189.93  max =  192.18  avg =  190.91\n      yolo-fastestv2  min =  158.60  max =  161.33  avg =  159.43\n\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  602.97  max =  604.97  avg =  603.46\n     squeezenet_int8  min =  431.18  max =  432.42  avg =  431.77\n           mobilenet  min =  971.52  max =  986.64  avg =  974.04\n      mobilenet_int8  min =  556.74  max =  556.98  avg =  556.84\n        mobilenet_v2  min =  682.85  max =  684.17  avg =  683.34\n        mobilenet_v3  min =  585.10  max =  585.76  avg =  585.57\n          shufflenet  min =  340.64  max =  342.63  avg =  341.26\n       shufflenet_v2  min =  322.41  max =  324.13  avg =  323.35\n             mnasnet  min =  644.30  max =  645.93  avg =  644.71\n     proxylessnasnet  min =  732.50  max =  733.30  avg =  732.96\n     efficientnet_b0  min = 1084.70  max = 1094.98  avg = 1086.52\n   efficientnetv2_b0  min = 1282.27  max = 1283.67  avg = 1282.60\n        regnety_400m  min =  764.60  max =  768.54  avg =  765.30\n           blazeface  min =  100.48  max =  106.28  avg =  103.33\n           googlenet  min = 1878.69  max = 1883.96  avg = 1880.76\n      googlenet_int8  min = 1274.31  max = 1296.02  avg = 1279.59\n            resnet18  min = 1837.91  max = 1843.95  avg = 1839.17\n       resnet18_int8  min = 1011.98  max = 1014.43  avg = 1013.01\n             alexnet  min = 1997.59  max = 2001.81  avg = 1999.42\n            resnet50  min = 4844.31  max = 4857.05  avg = 4847.80\n       resnet50_int8  min = 2792.59  max = 2810.08  avg = 2797.30\n      squeezenet_ssd  min = 1438.96  max = 1443.31  avg = 1441.09\n squeezenet_ssd_int8  min = 1046.76  max = 1053.00  avg = 1049.22\n       mobilenet_ssd  min = 2018.66  max = 2023.70  avg = 2019.67\n  mobilenet_ssd_int8  min = 1129.16  max = 1130.62  avg = 1129.82\n      mobilenet_yolo  min = 4724.90  max = 4728.57  avg = 4726.41\n  mobilenetv2_yolov3  min = 2410.67  max = 2427.95  avg = 2413.89\n         yolov4-tiny  min = 3177.27  max = 3185.52  avg = 3179.71\n           nanodet_m  min =  761.38  max =  768.79  avg =  766.53\n    yolo-fastest-1.1  min =  391.82  max =  393.32  avg =  392.39\n      yolo-fastestv2  min =  316.93  max =  319.86  avg =  318.33\n```\n\n### Radxa Orion O6 (Big Cortex‑A720 2.6Ghz x4 + Medium Cortex‑A720 x 4 + Little Cortex‑A520 x 4 + Arm Immortals G720 MC10 GPU 1.1Ghz)\n\n```\nradxa@orion-o6:~/ncnn/build/benchmark$ ./benchncnn 4 1 2 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    8.52  max =    8.53  avg =    8.53\n     squeezenet_int8  min =    6.49  max =    6.50  avg =    6.50\n           mobilenet  min =   15.56  max =   15.61  avg =   15.58\n      mobilenet_int8  min =    8.68  max =    8.70  avg =    8.69\n        mobilenet_v2  min =    9.67  max =    9.68  avg =    9.67\n        mobilenet_v3  min =    8.05  max =    8.07  avg =    8.06\n          shufflenet  min =    5.30  max =    5.32  avg =    5.31\n       shufflenet_v2  min =    5.55  max =    5.57  avg =    5.56\n             mnasnet  min =    9.23  max =    9.26  avg =    9.25\n     proxylessnasnet  min =   11.58  max =   11.58  avg =   11.58\n     efficientnet_b0  min =   18.67  max =   18.68  avg =   18.67\n   efficientnetv2_b0  min =   21.55  max =   21.59  avg =   21.57\n        regnety_400m  min =   13.02  max =   13.07  avg =   13.05\n           blazeface  min =    2.04  max =    2.06  avg =    2.05\n           googlenet  min =   35.36  max =   35.49  avg =   35.40\n      googlenet_int8  min =   27.86  max =   27.97  avg =   27.91\n            resnet18  min =   21.68  max =   21.74  avg =   21.70\n       resnet18_int8  min =   19.07  max =   19.12  avg =   19.09\n             alexnet  min =   23.94  max =   24.06  avg =   24.02\n               vgg16  min =  123.48  max =  124.36  avg =  123.87\n          vgg16_int8  min =  139.53  max =  139.72  avg =  139.64\n            resnet50  min =   68.07  max =   68.09  avg =   68.08\n       resnet50_int8  min =   39.99  max =   40.07  avg =   40.03\n      squeezenet_ssd  min =   20.35  max =   20.43  avg =   20.38\n squeezenet_ssd_int8  min =   18.62  max =   18.69  avg =   18.67\n       mobilenet_ssd  min =   31.40  max =   31.56  avg =   31.48\n  mobilenet_ssd_int8  min =   17.44  max =   17.54  avg =   17.49\n      mobilenet_yolo  min =   70.84  max =   70.94  avg =   70.88\n  mobilenetv2_yolov3  min =   35.24  max =   35.30  avg =   35.28\n         yolov4-tiny  min =   42.96  max =   43.02  avg =   42.99\n           nanodet_m  min =   13.05  max =   13.11  avg =   13.08\n    yolo-fastest-1.1  min =    5.21  max =    5.22  avg =    5.22\n      yolo-fastestv2  min =    4.48  max =    4.50  avg =    4.49\n  vision_transformer  min = 1001.70  max = 1002.06  avg = 1001.90\n          FastestDet  min =    4.65  max =    4.67  avg =    4.66\nradxa@orion-o6:~/ncnn/build/benchmark$ ./benchncnn 4 12 2 -1 1\nloop_count = 4\nnum_threads = 12\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   38.01  max =   40.45  avg =   39.00\n     squeezenet_int8  min =   45.53  max =   45.73  avg =   45.60\n           mobilenet  min =   33.35  max =   37.73  avg =   35.96\n      mobilenet_int8  min =   33.87  max =   34.05  avg =   33.93\n        mobilenet_v2  min =   57.97  max =   61.42  avg =   59.74\n        mobilenet_v3  min =   65.47  max =   65.76  avg =   65.65\n          shufflenet  min =  110.95  max =  111.29  avg =  111.12\n       shufflenet_v2  min =   63.97  max =   64.20  avg =   64.08\n             mnasnet  min =   56.06  max =   56.44  avg =   56.23\n     proxylessnasnet  min =   63.84  max =   64.36  avg =   64.10\n     efficientnet_b0  min =   94.52  max =   94.79  avg =   94.65\n   efficientnetv2_b0  min =  154.39  max =  158.08  avg =  156.57\n        regnety_400m  min =  454.18  max =  457.25  avg =  455.08\n           blazeface  min =   44.79  max =   45.03  avg =   44.92\n           googlenet  min =   91.22  max =   93.72  avg =   92.01\n      googlenet_int8  min =  115.45  max =  118.36  avg =  116.69\n            resnet18  min =   42.81  max =   50.61  avg =   45.62\n       resnet18_int8  min =   45.26  max =   47.70  avg =   46.52\n             alexnet  min =   25.74  max =   28.83  avg =   26.66\n               vgg16  min =   61.15  max =   64.72  avg =   63.09\n          vgg16_int8  min =   67.75  max =   73.18  avg =   69.38\n            resnet50  min =   90.29  max =  100.58  avg =   96.62\n       resnet50_int8  min =   92.35  max =   97.42  avg =   94.64\n      squeezenet_ssd  min =  105.26  max =  111.83  avg =  107.89\n squeezenet_ssd_int8  min =  117.49  max =  121.57  avg =  118.91\n       mobilenet_ssd  min =   89.79  max =   95.18  avg =   92.15\n  mobilenet_ssd_int8  min =   97.02  max =  103.84  avg =   99.86\n      mobilenet_yolo  min =  603.04  max =  606.87  avg =  605.03\n  mobilenetv2_yolov3  min =   75.32  max =   80.43  avg =   76.83\n         yolov4-tiny  min =   51.46  max =   60.43  avg =   56.32\n           nanodet_m  min =  104.05  max =  109.94  avg =  107.06\n    yolo-fastest-1.1  min =   90.31  max =   90.50  avg =   90.41\n      yolo-fastestv2  min =   94.72  max =   96.62  avg =   95.52\n  vision_transformer  min =  323.38  max =  333.42  avg =  329.50\n          FastestDet  min =   80.86  max =   83.37  avg =   81.84\nradxa@orion-o6:~/ncnn/build/benchmark$ ./benchncnn 4 1 2 0 1\n[0 Mali-G720-Immortalis]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G720-Immortalis]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Mali-G720-Immortalis]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[0 Mali-G720-Immortalis]  subgroup=16  basic/vote/ballot/shuffle=1/1/1/1\n[0 Mali-G720-Immortalis]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   16.33  max =   16.59  avg =   16.45\n     squeezenet_int8  min =    6.36  max =   10.08  avg =    7.32\n           mobilenet  min =    3.45  max =   27.79  avg =   14.90\n      mobilenet_int8  min =    8.71  max =    8.76  avg =    8.74\n        mobilenet_v2  min =    4.31  max =    4.43  avg =    4.40\n        mobilenet_v3  min =   19.81  max =   19.86  avg =   19.83\n          shufflenet  min =   14.76  max =   14.83  avg =   14.79\n       shufflenet_v2  min =   15.24  max =   15.33  avg =   15.28\n             mnasnet  min =    3.71  max =   10.64  avg =    5.55\n     proxylessnasnet  min =    4.82  max =    4.95  avg =    4.90\n     efficientnet_b0  min =    6.58  max =    6.62  avg =    6.60\n   efficientnetv2_b0  min =   56.26  max =   57.46  avg =   56.82\n        regnety_400m  min =    5.30  max =   30.08  avg =   17.72\n           blazeface  min =    4.36  max =    4.52  avg =    4.46\n           googlenet  min =    9.03  max =    9.07  avg =    9.05\n      googlenet_int8  min =   27.90  max =   27.94  avg =   27.92\n            resnet18  min =    6.47  max =   28.26  avg =   11.93\n       resnet18_int8  min =   19.79  max =   19.83  avg =   19.81\n             alexnet  min =    7.76  max =    7.81  avg =    7.77\n               vgg16  min =   27.58  max =   27.90  avg =   27.77\n          vgg16_int8  min =  143.28  max =  144.19  avg =  143.68\n            resnet50  min =   14.06  max =   14.22  avg =   14.15\n       resnet50_int8  min =   41.37  max =   41.48  avg =   41.43\n      squeezenet_ssd  min =   11.11  max =   60.31  avg =   47.93\n squeezenet_ssd_int8  min =   19.29  max =   19.39  avg =   19.35\n       mobilenet_ssd  min =    8.78  max =    8.88  avg =    8.82\n  mobilenet_ssd_int8  min =   17.60  max =   17.66  avg =   17.62\n      mobilenet_yolo  min =   13.64  max =   13.91  avg =   13.76\n  mobilenetv2_yolov3  min =   11.97  max =   15.79  avg =   14.01\n         yolov4-tiny  min =   26.72  max =   32.41  avg =   28.27\n           nanodet_m  min =    9.84  max =   13.42  avg =   10.76\n    yolo-fastest-1.1  min =   15.38  max =   15.62  avg =   15.56\n      yolo-fastestv2  min =   13.56  max =   13.67  avg =   13.61\n  vision_transformer  min =  831.86  max =  835.66  avg =  833.83\n          FastestDet  min =   13.85  max =   13.92  avg =   13.88\n```\n\n### Radxa Zero 3W, Cortex-A55 (ARMv82) (1.416 GHz x 4)\n```\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   34.51  max =  106.19  avg =   79.43\n     squeezenet_int8  min =   31.48  max =   49.87  avg =   34.65\n           mobilenet  min =   42.23  max =   45.36  avg =   42.89\n      mobilenet_int8  min =   35.97  max =   53.84  avg =   38.77\n        mobilenet_v2  min =   39.61  max =   40.35  avg =   40.00\n        mobilenet_v3  min =   31.19  max =   31.85  avg =   31.50\n          shufflenet  min =   24.75  max =   27.74  avg =   25.55\n       shufflenet_v2  min =   22.00  max =   22.70  avg =   22.31\n             mnasnet  min =   34.95  max =   53.55  avg =   37.39\n     proxylessnasnet  min =   39.96  max =   44.32  avg =   40.81\n     efficientnet_b0  min =   49.76  max =   67.77  avg =   52.61\n   efficientnetv2_b0  min =   64.00  max =   85.78  avg =   67.06\n        regnety_400m  min =   55.23  max =   73.22  avg =   57.87\n           blazeface  min =    7.80  max =   10.39  avg =    8.27\n           googlenet  min =   98.24  max =  118.27  avg =  101.78\n      googlenet_int8  min =   98.81  max =  115.66  avg =  101.52\n            resnet18  min =   75.33  max =   88.59  avg =   78.19\n       resnet18_int8  min =   76.31  max =   95.17  avg =   79.03\n             alexnet  min =   65.07  max =   73.80  avg =   67.18\n               vgg16  min =  423.20  max =  455.15  avg =  436.32\n          vgg16_int8  min =  591.82  max =  620.22  avg =  607.55\n            resnet50  min =  185.53  max =  207.10  avg =  193.03\n       resnet50_int8  min =  176.84  max =  194.73  avg =  181.81\n      squeezenet_ssd  min =   96.64  max =  118.46  avg =  100.86\n squeezenet_ssd_int8  min =   96.61  max =  123.48  avg =  104.64\n       mobilenet_ssd  min =   95.38  max =  110.52  avg =   98.61\n  mobilenet_ssd_int8  min =   76.21  max =   95.41  avg =   79.10\n      mobilenet_yolo  min =  210.73  max =  235.47  avg =  221.72\n  mobilenetv2_yolov3  min =  134.59  max =  154.33  avg =  139.54\n         yolov4-tiny  min =  167.79  max =  191.60  avg =  171.25\n           nanodet_m  min =   63.22  max =   80.73  avg =   66.25\n    yolo-fastest-1.1  min =   32.87  max =   88.05  avg =   47.36\n      yolo-fastestv2  min =   26.03  max =   27.01  avg =   26.54\n  vision_transformer  min = 3682.51  max = 3882.79  avg = 3809.42\n          FastestDet  min =   30.69  max =   50.65  avg =   33.65\n```\n\n### Avaota Aim T527, Allwinner T527 (Cortex-A55 2.2GHz x 4 + Cortex-A55 1.8GHz x 4)\n\n```\n./benchncnn 4 4 2 -1 1\nloop_count = 4\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   14.15  max =   14.21  avg =   14.17\n     squeezenet_int8  min =   21.05  max =   21.12  avg =   21.09\n           mobilenet  min =   19.22  max =   19.30  avg =   19.25\n      mobilenet_int8  min =   18.65  max =   19.52  avg =   19.07\n        mobilenet_v2  min =   20.23  max =   21.01  avg =   20.63\n        mobilenet_v3  min =   15.34  max =   15.48  avg =   15.41\n          shufflenet  min =   10.30  max =   10.37  avg =   10.33\n       shufflenet_v2  min =    9.18  max =    9.34  avg =    9.23\n             mnasnet  min =   15.58  max =   15.62  avg =   15.60\n     proxylessnasnet  min =   19.64  max =   19.73  avg =   19.67\n     efficientnet_b0  min =   25.62  max =   25.81  avg =   25.69\n   efficientnetv2_b0  min =   36.95  max =   37.46  avg =   37.17\n        regnety_400m  min =   23.75  max =   24.13  avg =   23.90\n           blazeface  min =    3.37  max =    3.42  avg =    3.40\n           googlenet  min =   57.36  max =   58.32  avg =   57.88\n      googlenet_int8  min =   60.80  max =   62.30  avg =   61.50\n            resnet18  min =   39.99  max =   40.34  avg =   40.17\n       resnet18_int8  min =   54.18  max =   56.08  avg =   55.16\n             alexnet  min =   41.87  max =   42.21  avg =   42.08\n               vgg16  min =  260.14  max =  260.94  avg =  260.51\n          vgg16_int8  min =  347.42  max =  348.90  avg =  348.30\n            resnet50  min =   90.91  max =   91.26  avg =   91.07\n       resnet50_int8  min =  121.94  max =  122.56  avg =  122.28\n      squeezenet_ssd  min =   57.11  max =   57.57  avg =   57.37\n squeezenet_ssd_int8  min =   74.70  max =   75.18  avg =   74.91\n       mobilenet_ssd  min =   49.60  max =   49.96  avg =   49.71\n  mobilenet_ssd_int8  min =   49.45  max =   49.93  avg =   49.63\n      mobilenet_yolo  min =  114.98  max =  115.37  avg =  115.18\n  mobilenetv2_yolov3  min =   75.74  max =   75.97  avg =   75.87\n         yolov4-tiny  min =   99.09  max =   99.43  avg =   99.25\n           nanodet_m  min =   29.40  max =   29.77  avg =   29.60\n    yolo-fastest-1.1  min =   13.78  max =   13.85  avg =   13.82\n      yolo-fastestv2  min =   12.91  max =   13.10  avg =   12.98\n  vision_transformer  min = 1641.78  max = 1648.71  avg = 1646.65\n          FastestDet  min =   12.24  max =   12.61  avg =   12.42\n\n```\n\n\n### Khadas VIM3, Amlogic A311D (Cortex-A73 2.2GHz x 4 + Cortex-A53 1.8GHz x 2)\n\n```\nvim3:/data/local/tmp # ./benchncnn 8 4 2 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   30.98  max =   31.26  avg =   31.09\n     squeezenet_int8  min =   24.70  max =   24.84  avg =   24.78\n           mobilenet  min =   42.57  max =   43.37  avg =   42.96\n      mobilenet_int8  min =   22.33  max =   22.52  avg =   22.44\n        mobilenet_v2  min =   39.36  max =   39.77  avg =   39.56\n        mobilenet_v3  min =   30.13  max =   30.45  avg =   30.28\n          shufflenet  min =   21.62  max =   21.94  avg =   21.80\n       shufflenet_v2  min =   18.83  max =   19.24  avg =   19.05\n             mnasnet  min =   33.54  max =   34.08  avg =   33.80\n     proxylessnasnet  min =   35.81  max =   36.05  avg =   35.95\n     efficientnet_b0  min =   53.82  max =   54.44  avg =   54.21\n   efficientnetv2_b0  min =   62.20  max =   62.60  avg =   62.43\n        regnety_400m  min =   48.82  max =   49.27  avg =   49.05\n           blazeface  min =    6.34  max =    6.51  avg =    6.43\n           googlenet  min =   81.96  max =   82.53  avg =   82.23\n      googlenet_int8  min =   64.42  max =   65.00  avg =   64.77\n            resnet18  min =   77.00  max =   77.83  avg =   77.46\n       resnet18_int8  min =   48.91  max =   49.14  avg =   49.05\n             alexnet  min =   60.43  max =   60.93  avg =   60.69\n               vgg16  min =  414.89  max =  423.00  avg =  418.75\n          vgg16_int8  min =  245.58  max =  246.37  avg =  245.94\n            resnet50  min =  185.53  max =  187.35  avg =  186.18\n       resnet50_int8  min =  123.36  max =  124.75  avg =  124.17\n      squeezenet_ssd  min =   85.87  max =   86.42  avg =   86.23\n squeezenet_ssd_int8  min =   64.90  max =   65.24  avg =   65.08\n       mobilenet_ssd  min =   88.32  max =   90.02  avg =   89.10\n  mobilenet_ssd_int8  min =   46.85  max =   47.18  avg =   46.98\n      mobilenet_yolo  min =  192.33  max =  195.38  avg =  194.10\n  mobilenetv2_yolov3  min =  127.33  max =  128.58  avg =  127.96\n         yolov4-tiny  min =  150.44  max =  152.02  avg =  151.20\n           nanodet_m  min =   54.22  max =   54.61  avg =   54.37\n    yolo-fastest-1.1  min =   28.13  max =   28.76  avg =   28.40\n      yolo-fastestv2  min =   22.10  max =   22.26  avg =   22.19\n\nvim3:/data/local/tmp # ./benchncnn 4 1 2 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   68.25  max =   68.85  avg =   68.67\n     squeezenet_int8  min =   51.92  max =   52.08  avg =   52.01\n           mobilenet  min =  112.69  max =  113.72  avg =  113.33\n      mobilenet_int8  min =   66.43  max =   66.89  avg =   66.68\n        mobilenet_v2  min =   81.36  max =   81.77  avg =   81.62\n        mobilenet_v3  min =   62.33  max =   63.39  avg =   62.94\n          shufflenet  min =   37.84  max =   38.03  avg =   37.93\n       shufflenet_v2  min =   37.33  max =   38.08  avg =   37.68\n             mnasnet  min =   73.83  max =   74.32  avg =   74.03\n     proxylessnasnet  min =   85.19  max =   86.43  avg =   85.84\n     efficientnet_b0  min =  138.68  max =  139.69  avg =  139.19\n   efficientnetv2_b0  min =  167.53  max =  167.99  avg =  167.75\n        regnety_400m  min =   94.78  max =   95.81  avg =   95.21\n           blazeface  min =   11.22  max =   11.43  avg =   11.28\n           googlenet  min =  229.35  max =  230.91  avg =  229.89\n      googlenet_int8  min =  173.04  max =  173.48  avg =  173.24\n            resnet18  min =  191.54  max =  193.78  avg =  192.49\n       resnet18_int8  min =  132.97  max =  133.51  avg =  133.25\n             alexnet  min =  140.31  max =  141.95  avg =  141.18\n               vgg16  min = 1093.71  max = 1100.95  avg = 1097.64\n          vgg16_int8  min =  734.44  max =  736.16  avg =  735.05\n            resnet50  min =  530.38  max =  533.93  avg =  531.87\n       resnet50_int8  min =  332.88  max =  334.22  avg =  333.71\n      squeezenet_ssd  min =  159.08  max =  160.98  avg =  160.16\n squeezenet_ssd_int8  min =  126.97  max =  127.96  avg =  127.43\n       mobilenet_ssd  min =  238.92  max =  241.14  avg =  239.70\n  mobilenet_ssd_int8  min =  135.57  max =  136.02  avg =  135.78\n      mobilenet_yolo  min =  539.59  max =  543.88  avg =  541.90\n  mobilenetv2_yolov3  min =  281.32  max =  285.05  avg =  283.24\n         yolov4-tiny  min =  381.99  max =  384.93  avg =  383.53\n           nanodet_m  min =   98.32  max =   98.85  avg =   98.60\n    yolo-fastest-1.1  min =   44.59  max =   44.95  avg =   44.80\n      yolo-fastestv2  min =   36.88  max =   37.11  avg =   36.98\n\nvim3:/data/local/tmp $ ./benchncnn 8 6 2 0 1                               \n[0 Mali-G52]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G52]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=1\n[0 Mali-G52]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/0/0/0\n[0 Mali-G52]  subgroup=8  basic/vote/ballot/shuffle=1/0/0/0\n[0 Mali-G52]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 8\nnum_threads = 6\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   21.29  max =   21.81  avg =   21.56\n     squeezenet_int8  min =   37.59  max =   37.85  avg =   37.70\n           mobilenet  min =   32.08  max =   32.61  avg =   32.42\n      mobilenet_int8  min =   40.12  max =   40.46  avg =   40.28\n        mobilenet_v2  min =   24.55  max =   24.67  avg =   24.62\n        mobilenet_v3  min =   25.35  max =   25.60  avg =   25.47\n          shufflenet  min =   18.78  max =   89.48  avg =   35.41\n       shufflenet_v2  min =   21.15  max =   21.33  avg =   21.22\n             mnasnet  min =   25.08  max =   25.31  avg =   25.21\n     proxylessnasnet  min =   26.97  max =   27.18  avg =   27.05\n     efficientnet_b0  min =   40.70  max =   40.91  avg =   40.81\n   efficientnetv2_b0  min =  189.26  max =  192.84  avg =  191.33\n        regnety_400m  min =   30.88  max =   31.17  avg =   31.03\n           blazeface  min =   24.34  max =   24.52  avg =   24.45\n           googlenet  min =   67.14  max =   67.43  avg =   67.30\n      googlenet_int8  min =   98.06  max =   98.57  avg =   98.35\n            resnet18  min =   61.13  max =   61.63  avg =   61.44\n       resnet18_int8  min =   72.63  max =   73.48  avg =   73.01\n             alexnet  min =   68.88  max =   70.34  avg =   69.71\n               vgg16  min =  347.48  max =  348.48  avg =  347.94\n          vgg16_int8  min =  342.50  max =  357.78  avg =  353.13\n            resnet50  min =  158.90  max =  160.10  avg =  159.76\n       resnet50_int8  min =  211.35  max =  212.68  avg =  212.11\n      squeezenet_ssd  min =   81.61  max =   82.17  avg =   81.91\n squeezenet_ssd_int8  min =   85.52  max =   85.98  avg =   85.79\n       mobilenet_ssd  min =   73.38  max =   74.41  avg =   74.02\n  mobilenet_ssd_int8  min =   85.13  max =   91.47  avg =   86.13\n      mobilenet_yolo  min =  154.47  max =  155.23  avg =  154.74\n  mobilenetv2_yolov3  min =  100.75  max =  101.96  avg =  101.27\n         yolov4-tiny  min =  140.52  max =  161.68  avg =  153.85\n           nanodet_m  min =   85.27  max =  110.53  avg =   94.81\n    yolo-fastest-1.1  min =   23.56  max =   42.04  avg =   33.10\n      yolo-fastestv2  min =   19.54  max =   21.66  avg =   21.01\n  vision_transformer  min = 6395.34  max = 6418.70  avg = 6410.43\n          FastestDet  min =   21.53  max =   23.21  avg =   22.98\n```\n\n### Rockchip RK3588 (Cortex-A76 2.4GHz x 4 + Cortex-A55 1.8GHz x 4)\n\n```\nrk3588_s:/data/local/tmp # ./benchncnn 8 4 2 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    7.57  max =    7.68  avg =    7.60\n     squeezenet_int8  min =    8.43  max =    8.52  avg =    8.46\n           mobilenet  min =   11.01  max =   11.08  avg =   11.05\n      mobilenet_int8  min =    8.89  max =    8.96  avg =    8.91\n        mobilenet_v2  min =    8.73  max =    8.78  avg =    8.76\n        mobilenet_v3  min =    7.90  max =    7.95  avg =    7.92\n          shufflenet  min =    7.95  max =    8.02  avg =    7.99\n       shufflenet_v2  min =    6.09  max =    6.13  avg =    6.11\n             mnasnet  min =    8.30  max =    8.35  avg =    8.33\n     proxylessnasnet  min =    9.67  max =    9.72  avg =    9.69\n     efficientnet_b0  min =   17.51  max =   17.60  avg =   17.56\n   efficientnetv2_b0  min =   28.10  max =   28.17  avg =   28.14\n        regnety_400m  min =   16.33  max =   16.39  avg =   16.35\n           blazeface  min =    2.81  max =    2.89  avg =    2.83\n           googlenet  min =   33.33  max =   33.41  avg =   33.37\n      googlenet_int8  min =   33.62  max =   33.87  avg =   33.77\n            resnet18  min =   18.83  max =   18.90  avg =   18.86\n       resnet18_int8  min =   33.92  max =   34.10  avg =   34.00\n             alexnet  min =   29.07  max =   29.11  avg =   29.09\n               vgg16  min =  106.86  max =  107.40  avg =  107.06\n          vgg16_int8  min =  283.66  max =  284.16  avg =  283.94\n            resnet50  min =   53.70  max =   54.21  avg =   53.83\n       resnet50_int8  min =   66.11  max =   66.24  avg =   66.15\n      squeezenet_ssd  min =   34.88  max =   35.04  avg =   34.99\n squeezenet_ssd_int8  min =   43.25  max =   43.62  avg =   43.37\n       mobilenet_ssd  min =   31.32  max =   31.42  avg =   31.37\n  mobilenet_ssd_int8  min =   26.11  max =   26.18  avg =   26.13\n      mobilenet_yolo  min =   58.89  max =   59.02  avg =   58.95\n  mobilenetv2_yolov3  min =   37.53  max =   37.64  avg =   37.58\n         yolov4-tiny  min =   52.95  max =   53.31  avg =   53.03\n           nanodet_m  min =   16.06  max =   16.14  avg =   16.10\n    yolo-fastest-1.1  min =    8.42  max =    8.47  avg =    8.45\n      yolo-fastestv2  min =    7.81  max =    7.88  avg =    7.84\n\nrk3588_s:/data/local/tmp # ./benchncnn 8 1 2 -1 1\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   25.04  max =   25.14  avg =   25.07\n     squeezenet_int8  min =   26.29  max =   26.38  avg =   26.33\n           mobilenet  min =   41.17  max =   41.23  avg =   41.19\n      mobilenet_int8  min =   32.51  max =   32.57  avg =   32.54\n        mobilenet_v2  min =   27.27  max =   27.31  avg =   27.29\n        mobilenet_v3  min =   22.49  max =   22.54  avg =   22.51\n          shufflenet  min =   18.15  max =   18.22  avg =   18.18\n       shufflenet_v2  min =   15.82  max =   15.86  avg =   15.85\n             mnasnet  min =   26.45  max =   26.50  avg =   26.47\n     proxylessnasnet  min =   31.60  max =   31.66  avg =   31.62\n     efficientnet_b0  min =   55.53  max =   55.68  avg =   55.62\n   efficientnetv2_b0  min =   96.84  max =   96.92  avg =   96.89\n        regnety_400m  min =   33.66  max =   33.70  avg =   33.68\n           blazeface  min =    8.80  max =    8.84  avg =    8.83\n           googlenet  min =  116.89  max =  117.06  avg =  116.97\n      googlenet_int8  min =  107.92  max =  108.03  avg =  107.98\n            resnet18  min =   60.97  max =   61.18  avg =   61.05\n       resnet18_int8  min =  118.95  max =  119.04  avg =  119.00\n             alexnet  min =   93.49  max =   93.59  avg =   93.55\n               vgg16  min =  333.81  max =  334.52  avg =  334.07\n          vgg16_int8  min =  947.19  max =  947.55  avg =  947.35\n            resnet50  min =  186.95  max =  187.42  avg =  187.15\n       resnet50_int8  min =  225.72  max =  225.86  avg =  225.75\n      squeezenet_ssd  min =   93.29  max =   93.66  avg =   93.47\n squeezenet_ssd_int8  min =  120.22  max =  120.95  avg =  120.49\n       mobilenet_ssd  min =  105.84  max =  105.90  avg =  105.87\n  mobilenet_ssd_int8  min =   85.95  max =   86.04  avg =   86.01\n      mobilenet_yolo  min =  194.22  max =  194.64  avg =  194.41\n  mobilenetv2_yolov3  min =  103.63  max =  103.72  avg =  103.69\n         yolov4-tiny  min =  136.59  max =  137.14  avg =  136.91\n           nanodet_m  min =   41.40  max =   41.49  avg =   41.43\n    yolo-fastest-1.1  min =   18.73  max =   18.80  avg =   18.77\n      yolo-fastestv2  min =   18.25  max =   18.31  avg =   18.28\n\nrk3588_s:/data/local/tmp # ./benchncnn 8 4 1 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   25.54  max =   25.99  avg =   25.71\n     squeezenet_int8  min =   30.88  max =   31.16  avg =   31.01\n           mobilenet  min =   36.24  max =   62.95  avg =   39.89\n      mobilenet_int8  min =   31.90  max =   32.37  avg =   32.06\n        mobilenet_v2  min =   27.49  max =   27.82  avg =   27.64\n        mobilenet_v3  min =   26.30  max =   26.69  avg =   26.45\n          shufflenet  min =   25.49  max =   25.72  avg =   25.60\n       shufflenet_v2  min =   21.59  max =   22.67  avg =   21.78\n             mnasnet  min =   27.92  max =   28.10  avg =   28.00\n     proxylessnasnet  min =   34.18  max =   34.42  avg =   34.28\n     efficientnet_b0  min =   57.37  max =   57.60  avg =   57.45\n   efficientnetv2_b0  min =   83.50  max =   84.03  avg =   83.66\n        regnety_400m  min =   50.83  max =   51.27  avg =   50.98\n           blazeface  min =   14.07  max =   14.29  avg =   14.17\n           googlenet  min =  100.60  max =  101.00  avg =  100.87\n      googlenet_int8  min =  106.58  max =  107.14  avg =  106.71\n            resnet18  min =   58.60  max =   59.62  avg =   59.00\n       resnet18_int8  min =   84.90  max =   85.15  avg =   84.99\n             alexnet  min =   86.06  max =   86.58  avg =   86.22\n               vgg16  min =  308.42  max =  309.18  avg =  308.81\n          vgg16_int8  min =  543.61  max =  545.09  avg =  544.40\n            resnet50  min =  163.45  max =  164.44  avg =  163.92\n       resnet50_int8  min =  179.51  max =  180.16  avg =  179.83\n      squeezenet_ssd  min =   96.32  max =   97.24  avg =   96.71\n squeezenet_ssd_int8  min =  116.48  max =  117.65  avg =  116.85\n       mobilenet_ssd  min =   92.12  max =   93.09  avg =   92.55\n  mobilenet_ssd_int8  min =   81.78  max =   82.42  avg =   81.95\n      mobilenet_yolo  min =  174.95  max =  175.40  avg =  175.15\n  mobilenetv2_yolov3  min =  110.63  max =  111.05  avg =  110.81\n         yolov4-tiny  min =  163.37  max =  164.24  avg =  163.63\n           nanodet_m  min =   52.96  max =   53.59  avg =   53.12\n    yolo-fastest-1.1  min =   28.98  max =   29.33  avg =   29.20\n      yolo-fastestv2  min =   23.52  max =   24.16  avg =   23.76\n\nrk3588_s:/data/local/tmp # ./benchncnn 8 1 1 -1 1\nloop_count = 8\nnum_threads = 1\npowersave = 1\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   83.46  max =   83.63  avg =   83.53\n     squeezenet_int8  min =  101.39  max =  102.29  avg =  101.77\n           mobilenet  min =  131.78  max =  132.25  avg =  131.87\n      mobilenet_int8  min =  111.66  max =  112.60  avg =  111.94\n        mobilenet_v2  min =   92.92  max =  227.19  avg =  132.44\n        mobilenet_v3  min =   78.38  max =   78.64  avg =   78.49\n          shufflenet  min =   62.98  max =   63.17  avg =   63.09\n       shufflenet_v2  min =   56.85  max =   57.23  avg =   57.00\n             mnasnet  min =   87.53  max =   87.71  avg =   87.60\n     proxylessnasnet  min =  113.25  max =  114.10  avg =  113.58\n     efficientnet_b0  min =  180.95  max =  181.16  avg =  181.07\n   efficientnetv2_b0  min =  285.34  max =  285.62  avg =  285.51\n        regnety_400m  min =  109.24  max =  109.36  avg =  109.31\n           blazeface  min =   41.12  max =   41.53  avg =   41.23\n           googlenet  min =  358.94  max =  359.55  avg =  359.24\n      googlenet_int8  min =  371.32  max =  371.84  avg =  371.51\n            resnet18  min =  209.97  max =  210.42  avg =  210.22\n       resnet18_int8  min =  302.93  max =  303.51  avg =  303.26\n             alexnet  min =  318.95  max =  321.70  avg =  319.40\n               vgg16  min = 1126.11  max = 1127.83  avg = 1126.98\n          vgg16_int8  min = 2026.90  max = 2034.04  avg = 2029.35\n            resnet50  min =  602.90  max =  603.70  avg =  603.30\n       resnet50_int8  min =  647.33  max =  649.41  avg =  648.65\n      squeezenet_ssd  min =  280.60  max =  281.50  avg =  281.02\n squeezenet_ssd_int8  min =  359.41  max =  362.07  avg =  360.66\n       mobilenet_ssd  min =  319.11  max =  319.29  avg =  319.21\n  mobilenet_ssd_int8  min =  272.16  max =  273.36  avg =  272.83\n      mobilenet_yolo  min =  607.07  max =  607.38  avg =  607.21\n  mobilenetv2_yolov3  min =  326.66  max =  326.95  avg =  326.80\n         yolov4-tiny  min =  449.56  max =  450.45  avg =  450.04\n           nanodet_m  min =  142.09  max =  142.54  avg =  142.32\n    yolo-fastest-1.1  min =   63.74  max =   63.80  avg =   63.78\n      yolo-fastestv2  min =   57.56  max =   58.17  avg =   57.97\n\nrk3588_s:/data/local/tmp # ./benchncnn 8 1 2 0 0\n[0 Mali-G610]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G610]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Mali-G610]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Mali-G610]  subgroup=16  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    7.09  max =    7.20  avg =    7.13\n           mobilenet  min =    9.16  max =    9.32  avg =    9.22\n        mobilenet_v2  min =   10.18  max =   10.32  avg =   10.25\n        mobilenet_v3  min =    8.01  max =    8.09  avg =    8.04\n          shufflenet  min =    5.88  max =    5.93  avg =    5.89\n       shufflenet_v2  min =    6.30  max =    6.33  avg =    6.31\n             mnasnet  min =    7.91  max =    8.00  avg =    7.95\n     proxylessnasnet  min =   11.20  max =   11.42  avg =   11.30\n        regnety_400m  min =   11.65  max =   11.84  avg =   11.74\n           blazeface  min =    2.50  max =    2.59  avg =    2.53\n           googlenet  min =   17.69  max =   17.78  avg =   17.74\n            resnet18  min =   16.04  max =   16.39  avg =   16.25\n             alexnet  min =   15.47  max =   15.66  avg =   15.56\n               vgg16  min =   64.74  max =   65.42  avg =   65.04\n            resnet50  min =   37.83  max =   38.31  avg =   38.12\n      squeezenet_ssd  min =   23.14  max =   23.44  avg =   23.26\n       mobilenet_ssd  min =   22.48  max =   23.01  avg =   22.74\n      mobilenet_yolo  min =   40.08  max =   40.72  avg =   40.32\n  mobilenetv2_yolov3  min =   31.88  max =   32.57  avg =   32.12\n         yolov4-tiny  min =   49.64  max =   50.73  avg =   50.13\n           nanodet_m  min =   10.60  max =   10.70  avg =   10.64\n    yolo-fastest-1.1  min =    7.63  max =    7.66  avg =    7.64\n      yolo-fastestv2  min =    6.99  max =    7.02  avg =    7.00\n```\n\n### Station-M3/ROC-RK3588S-PC, Rockchip RK3588S (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz + Mali-G610) StationOS (Android)\n\n```\nroc_rk3588s_pc:/data/local/tmp # ./benchncnn 10 1 0 0 0\n./benchncnn 10 1 0 0 0\n[0 Mali-G610]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G610]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Mali-G610]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Mali-G610]  subgroup=16  basic/vote/ballot/shuffle=1/1/1/1\n[0 Mali-G610]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    7.83  max =   14.17  avg =    9.76\n     squeezenet_int8  min =   13.41  max =   13.52  avg =   13.45\n           mobilenet  min =    8.73  max =    9.68  avg =    9.07\n      mobilenet_int8  min =   17.70  max =   17.89  avg =   17.80\n        mobilenet_v2  min =   10.73  max =   21.20  avg =   18.93\n        mobilenet_v3  min =    9.00  max =   13.36  avg =   10.64\n          shufflenet  min =    7.79  max =    7.93  avg =    7.85\n       shufflenet_v2  min =    8.01  max =    8.06  avg =    8.03\n             mnasnet  min =    7.43  max =    8.71  avg =    8.28\n     proxylessnasnet  min =   10.56  max =   12.07  avg =   11.70\n     efficientnet_b0  min =    2.15  max =    2.19  avg =    2.17\n   efficientnetv2_b0  min =    0.56  max =    0.62  avg =    0.57\n        regnety_400m  min =    1.65  max =    1.69  avg =    1.67\n           blazeface  min =    0.76  max =    0.79  avg =    0.78\n           googlenet  min =    1.53  max =    1.60  avg =    1.56\n      googlenet_int8  min =   60.85  max =   61.01  avg =   60.93\n            resnet18  min =    0.63  max =    0.82  avg =    0.65\n       resnet18_int8  min =   64.60  max =   65.13  avg =   64.78\n             alexnet  min =    0.35  max =    0.40  avg =    0.37\n               vgg16  min =    0.54  max =    0.60  avg =    0.56\n          vgg16_int8  min =  445.21  max =  562.09  avg =  537.10\n            resnet50  min =    0.95  max =    0.97  avg =    0.96\n       resnet50_int8  min =  113.02  max =  113.38  avg =  113.17\n      squeezenet_ssd  min =    1.94  max =    2.00  avg =    1.96\n squeezenet_ssd_int8  min =   52.09  max =   56.93  avg =   56.35\n       mobilenet_ssd  min =    1.19  max =    1.26  avg =    1.21\n  mobilenet_ssd_int8  min =   44.33  max =   44.87  avg =   44.66\n      mobilenet_yolo  min =    1.05  max =    1.24  avg =    1.13\n  mobilenetv2_yolov3  min =    1.18  max =    1.25  avg =    1.21\n         yolov4-tiny  min =    0.78  max =    0.80  avg =    0.78\n           nanodet_m  min =    3.43  max =    3.80  avg =    3.57\n    yolo-fastest-1.1  min =    1.43  max =    1.50  avg =    1.47\n      yolo-fastestv2  min =    2.03  max =    2.10  avg =    2.05\n  vision_transformer  min =    0.32  max =    0.36  avg =    0.35\n          FastestDet  min =    1.90  max =    1.95  avg =    1.93\n\nroc_rk3588s_pc:/data/local/tmp # ./benchncnn 10 1 0 -1 0\n./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   13.36  max =   13.50  avg =   13.40\n     squeezenet_int8  min =   16.22  max =   16.34  avg =   16.30\n           mobilenet  min =   22.41  max =   22.49  avg =   22.44\n      mobilenet_int8  min =   17.76  max =   17.94  avg =   17.84\n        mobilenet_v2  min =   17.60  max =   17.80  avg =   17.70\n        mobilenet_v3  min =   13.55  max =   13.70  avg =   13.61\n          shufflenet  min =    7.91  max =    7.95  avg =    7.93\n       shufflenet_v2  min =    8.36  max =    8.40  avg =    8.38\n             mnasnet  min =   14.50  max =   14.60  avg =   14.56\n     proxylessnasnet  min =   16.99  max =   17.12  avg =   17.06\n     efficientnet_b0  min =   26.55  max =   26.78  avg =   26.62\n   efficientnetv2_b0  min =   46.96  max =   47.44  avg =   47.30\n        regnety_400m  min =   18.53  max =   18.63  avg =   18.58\n           blazeface  min =    2.98  max =    3.02  avg =    3.00\n           googlenet  min =   62.69  max =   63.14  avg =   62.90\n      googlenet_int8  min =   60.86  max =   61.54  avg =   61.05\n            resnet18  min =   30.34  max =   31.39  avg =   31.22\n       resnet18_int8  min =   57.42  max =   57.67  avg =   57.56\n             alexnet  min =   40.81  max =   40.87  avg =   40.84\n               vgg16  min =  192.71  max =  195.20  avg =  194.26\n          vgg16_int8  min =  450.95  max =  534.38  avg =  482.27\n            resnet50  min =  105.11  max =  105.64  avg =  105.30\n       resnet50_int8  min =  105.94  max =  132.01  avg =  116.48\n      squeezenet_ssd  min =   51.36  max =   51.59  avg =   51.51\n squeezenet_ssd_int8  min =   69.01  max =   69.83  avg =   69.37\n       mobilenet_ssd  min =   53.19  max =   55.24  avg =   53.50\n  mobilenet_ssd_int8  min =   44.49  max =   44.98  avg =   44.74\n      mobilenet_yolo  min =  112.65  max =  113.28  avg =  112.94\n  mobilenetv2_yolov3  min =   63.38  max =   63.83  avg =   63.55\n         yolov4-tiny  min =   77.57  max =   78.20  avg =   77.90\n           nanodet_m  min =   25.21  max =   25.81  avg =   25.58\n    yolo-fastest-1.1  min =    8.76  max =    8.84  avg =    8.80\n      yolo-fastestv2  min =    8.46  max =    8.53  avg =    8.50\n  vision_transformer  min = 1499.53  max = 1501.32  avg = 1500.50\n          FastestDet  min =    7.04  max =    7.08  avg =    7.06\n```\n\n### Station P2, Rockchip RK3568 (Cortex-A55 2.0GHz x 4)\n\n```\n./benchncnn 4 4 0 -1 1\nloop_count = 4\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   26.02  max =   27.15  avg =   26.74\n     squeezenet_int8  min =   44.69  max =   45.70  avg =   45.24\n           mobilenet  min =   32.63  max =   33.49  avg =   33.10\n      mobilenet_int8  min =   44.23  max =   45.86  avg =   44.99\n        mobilenet_v2  min =   31.59  max =   32.02  avg =   31.86\n        mobilenet_v3  min =   25.71  max =   26.44  avg =   26.10\n          shufflenet  min =   22.12  max =   23.17  avg =   22.52\n       shufflenet_v2  min =   17.84  max =   18.21  avg =   17.96\n             mnasnet  min =   28.26  max =   28.70  avg =   28.45\n     proxylessnasnet  min =   31.96  max =   32.25  avg =   32.13\n     efficientnet_b0  min =   53.17  max =   54.48  avg =   53.60\n   efficientnetv2_b0  min =   70.08  max =   70.69  avg =   70.30\n        regnety_400m  min =   40.80  max =   41.79  avg =   41.10\n           blazeface  min =   10.79  max =   11.57  avg =   11.11\n           googlenet  min =   83.66  max =   92.22  avg =   86.23\n      googlenet_int8  min =  116.44  max =  118.34  avg =  117.08\n            resnet18  min =   61.38  max =   62.52  avg =   61.94\n       resnet18_int8  min =   95.58  max =   96.93  avg =   96.28\n             alexnet  min =   69.90  max =   70.59  avg =   70.19\n               vgg16  min =  334.24  max =  343.89  avg =  337.24\n          vgg16_int8  min =  464.88  max =  474.71  avg =  468.29\n            resnet50  min =  141.65  max =  146.23  avg =  143.78\n       resnet50_int8  min =  230.36  max =  254.75  avg =  241.24\n      squeezenet_ssd  min =   98.38  max =  104.60  avg =  100.50\n squeezenet_ssd_int8  min =  134.73  max =  137.88  avg =  136.12\n       mobilenet_ssd  min =   77.48  max =   79.92  avg =   78.64\n  mobilenet_ssd_int8  min =  101.44  max =  102.61  avg =  102.06\n      mobilenet_yolo  min =  149.12  max =  150.14  avg =  149.76\n  mobilenetv2_yolov3  min =  103.71  max =  107.81  avg =  105.69\n         yolov4-tiny  min =  145.75  max =  149.35  avg =  147.09\n           nanodet_m  min =   52.91  max =   54.06  avg =   53.53\n\n./benchncnn 4 2 0 -1 1\nloop_count = 4\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   33.78  max =   34.38  avg =   34.16\n     squeezenet_int8  min =   61.66  max =   62.11  avg =   61.85\n           mobilenet  min =   46.53  max =   46.74  avg =   46.62\n      mobilenet_int8  min =   71.06  max =   71.76  avg =   71.38\n        mobilenet_v2  min =   39.05  max =   39.38  avg =   39.19\n        mobilenet_v3  min =   32.20  max =   32.47  avg =   32.29\n          shufflenet  min =   27.13  max =   27.40  avg =   27.27\n       shufflenet_v2  min =   23.38  max =   23.92  avg =   23.62\n             mnasnet  min =   35.51  max =   35.73  avg =   35.62\n     proxylessnasnet  min =   42.98  max =   43.16  avg =   43.06\n     efficientnet_b0  min =   75.34  max =   75.79  avg =   75.61\n   efficientnetv2_b0  min =  107.34  max =  107.83  avg =  107.60\n        regnety_400m  min =   47.91  max =   48.20  avg =   48.02\n           blazeface  min =   16.38  max =   16.63  avg =   16.49\n           googlenet  min =  124.27  max =  125.24  avg =  124.65\n      googlenet_int8  min =  177.78  max =  178.39  avg =  178.06\n            resnet18  min =   82.02  max =   82.70  avg =   82.38\n       resnet18_int8  min =  148.06  max =  149.03  avg =  148.39\n             alexnet  min =  105.20  max =  105.91  avg =  105.54\n               vgg16  min =  459.65  max =  464.94  avg =  462.02\n          vgg16_int8  min =  737.54  max =  750.64  avg =  742.90\n            resnet50  min =  204.44  max =  205.20  avg =  204.84\n       resnet50_int8  min =  364.47  max =  366.04  avg =  365.53\n      squeezenet_ssd  min =  124.42  max =  128.01  avg =  125.80\n squeezenet_ssd_int8  min =  179.29  max =  183.83  avg =  181.43\n       mobilenet_ssd  min =  113.85  max =  115.50  avg =  114.41\n  mobilenet_ssd_int8  min =  161.35  max =  162.38  avg =  161.71\n      mobilenet_yolo  min =  214.95  max =  216.62  avg =  215.72\n  mobilenetv2_yolov3  min =  134.23  max =  136.26  avg =  135.07\n         yolov4-tiny  min =  194.72  max =  195.49  avg =  195.18\n           nanodet_m  min =   67.67  max =   68.09  avg =   67.90\n\n./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   54.31  max =   55.65  avg =   55.00\n     squeezenet_int8  min =  103.96  max =  106.28  avg =  104.92\n           mobilenet  min =   79.02  max =   79.46  avg =   79.25\n      mobilenet_int8  min =  130.06  max =  130.61  avg =  130.36\n        mobilenet_v2  min =   60.15  max =   60.66  avg =   60.31\n        mobilenet_v3  min =   49.40  max =   49.57  avg =   49.49\n          shufflenet  min =   39.39  max =   39.78  avg =   39.60\n       shufflenet_v2  min =   35.48  max =   35.70  avg =   35.62\n             mnasnet  min =   55.38  max =   56.10  avg =   55.71\n     proxylessnasnet  min =   70.29  max =   70.48  avg =   70.35\n     efficientnet_b0  min =  128.56  max =  129.96  avg =  129.26\n   efficientnetv2_b0  min =  181.00  max =  181.56  avg =  181.24\n        regnety_400m  min =   67.15  max =   69.62  avg =   67.95\n           blazeface  min =   26.07  max =   26.58  avg =   26.33\n           googlenet  min =  219.19  max =  221.32  avg =  220.01\n      googlenet_int8  min =  317.62  max =  319.40  avg =  318.37\n            resnet18  min =  135.33  max =  136.94  avg =  135.88\n       resnet18_int8  min =  264.69  max =  265.51  avg =  265.16\n             alexnet  min =  190.54  max =  193.50  avg =  191.88\n               vgg16  min =  790.99  max =  809.24  avg =  795.85\n          vgg16_int8  min = 1354.48  max = 1358.89  avg = 1357.40\n            resnet50  min =  358.08  max =  362.96  avg =  360.29\n       resnet50_int8  min =  667.92  max =  670.40  avg =  668.78\n      squeezenet_ssd  min =  193.15  max =  194.02  avg =  193.49\n squeezenet_ssd_int8  min =  291.42  max =  294.70  avg =  293.16\n       mobilenet_ssd  min =  189.54  max =  190.28  avg =  189.97\n  mobilenet_ssd_int8  min =  289.94  max =  290.40  avg =  290.28\n      mobilenet_yolo  min =  370.37  max =  384.69  avg =  375.11\n  mobilenetv2_yolov3  min =  210.93  max =  211.70  avg =  211.40\n         yolov4-tiny  min =  309.11  max =  310.74  avg =  309.89\n           nanodet_m  min =  100.42  max =  112.25  avg =  103.66\n```\n\n### Rock3A, Rockchip RK3568 (Cortex-A55 2.0GHz x 4) ubuntu 20.04\n\n```\nrock@rock3a:~/ncnn/build/benchmark$ ./benchncnn 8 4 0 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   29.52  max =   30.30  avg =   29.76\n     squeezenet_int8  min =   35.40  max =   36.19  avg =   35.88\n           mobilenet  min =   34.47  max =   35.44  avg =   34.84\n      mobilenet_int8  min =   34.19  max =   34.53  avg =   34.40\n        mobilenet_v2  min =   35.75  max =   36.09  avg =   35.88\n        mobilenet_v3  min =   28.12  max =   28.82  avg =   28.49\n          shufflenet  min =   23.62  max =   24.08  avg =   23.84\n       shufflenet_v2  min =   19.37  max =   19.64  avg =   19.52\n             mnasnet  min =   30.84  max =   31.45  avg =   31.02\n     proxylessnasnet  min =   35.73  max =   36.07  avg =   35.90\n     efficientnet_b0  min =   48.16  max =   49.29  avg =   48.64\n   efficientnetv2_b0  min =   66.62  max =   67.11  avg =   66.85\n        regnety_400m  min =   41.11  max =   41.64  avg =   41.34\n           blazeface  min =   12.38  max =   12.64  avg =   12.56\n           googlenet  min =   86.73  max =   87.79  avg =   87.11\n      googlenet_int8  min =  101.42  max =  103.87  avg =  102.55\n            resnet18  min =   64.85  max =   65.84  avg =   65.23\n       resnet18_int8  min =   93.55  max =   94.54  avg =   94.03\n             alexnet  min =   70.89  max =   73.58  avg =   71.57\n               vgg16  min =  356.13  max =  358.52  avg =  357.15\n          vgg16_int8  min =  521.92  max =  524.13  avg =  523.11\n            resnet50  min =  147.65  max =  150.33  avg =  148.52\n       resnet50_int8  min =  191.94  max =  192.73  avg =  192.30\n      squeezenet_ssd  min =  104.32  max =  105.75  avg =  105.00\n squeezenet_ssd_int8  min =  125.97  max =  127.53  avg =  126.70\n       mobilenet_ssd  min =   82.29  max =   82.65  avg =   82.47\n  mobilenet_ssd_int8  min =   79.26  max =   80.93  avg =   79.72\n      mobilenet_yolo  min =  165.51  max =  165.86  avg =  165.72\n  mobilenetv2_yolov3  min =  116.11  max =  116.83  avg =  116.43\n         yolov4-tiny  min =  152.09  max =  153.39  avg =  152.60\n           nanodet_m  min =   53.63  max =   54.14  avg =   53.92\n\nrock@rock3a:~/ncnn/build/benchmark$ ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   62.47  max =   63.04  avg =   62.84\n     squeezenet_int8  min =   67.23  max =   68.48  avg =   67.93\n           mobilenet  min =   85.27  max =   85.69  avg =   85.49\n      mobilenet_int8  min =   75.00  max =   75.48  avg =   75.26\n        mobilenet_v2  min =   68.41  max =   69.09  avg =   68.76\n        mobilenet_v3  min =   54.19  max =   54.52  avg =   54.34\n          shufflenet  min =   45.90  max =   46.30  avg =   46.09\n       shufflenet_v2  min =   39.64  max =   40.07  avg =   39.91\n             mnasnet  min =   62.16  max =   62.41  avg =   62.30\n     proxylessnasnet  min =   80.79  max =   81.41  avg =   81.12\n     efficientnet_b0  min =  113.47  max =  113.68  avg =  113.57\n   efficientnetv2_b0  min =  167.30  max =  167.58  avg =  167.44\n        regnety_400m  min =   72.12  max =   72.24  avg =   72.17\n           blazeface  min =   31.89  max =   32.04  avg =   31.95\n           googlenet  min =  224.27  max =  224.86  avg =  224.55\n      googlenet_int8  min =  240.02  max =  240.93  avg =  240.45\n            resnet18  min =  150.25  max =  150.69  avg =  150.47\n       resnet18_int8  min =  226.70  max =  228.19  avg =  227.56\n             alexnet  min =  197.44  max =  199.16  avg =  198.17\n               vgg16  min =  859.80  max =  860.79  avg =  860.35\n          vgg16_int8  min = 1409.66  max = 1411.92  avg = 1411.07\n            resnet50  min =  381.04  max =  382.73  avg =  381.86\n       resnet50_int8  min =  441.78  max =  445.00  avg =  443.29\n      squeezenet_ssd  min =  208.14  max =  208.67  avg =  208.41\n squeezenet_ssd_int8  min =  248.82  max =  250.80  avg =  249.89\n       mobilenet_ssd  min =  200.95  max =  201.21  avg =  201.06\n  mobilenet_ssd_int8  min =  173.81  max =  174.54  avg =  174.28\n      mobilenet_yolo  min =  394.65  max =  395.00  avg =  394.78\n  mobilenetv2_yolov3  min =  231.80  max =  232.27  avg =  232.08\n         yolov4-tiny  min =  321.31  max =  322.43  avg =  321.79\n           nanodet_m  min =  103.81  max =  104.61  avg =  104.25\n```\n\n### Station-M2/ROC-RK3566-PC, Rockchip RK3566 (Cortex-A55 1.8GHz x 4 + Mali-G52) StationOS (Android)\n\n```\nrk3566_roc_pc:/data/local/tmp # ./benchncnn 10 1 0 0 0\n./benchncnn 10 1 0 0 0\n[0 Mali-G52]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G52]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=1\n[0 Mali-G52]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Mali-G52]  subgroup=8  basic/vote/ballot/shuffle=1/1/1/1\n[0 Mali-G52]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   43.67  max =   44.15  avg =   43.82\n     squeezenet_int8  min =   62.72  max =   63.99  avg =   63.49\n           mobilenet  min =   74.32  max =   74.82  avg =   74.58\n      mobilenet_int8  min =   64.42  max =   65.43  avg =   64.89\n        mobilenet_v2  min =   52.96  max =   53.23  avg =   53.09\n        mobilenet_v3  min =   51.55  max =   53.12  avg =   51.96\n          shufflenet  min =   40.73  max =   41.28  avg =   40.98\n       shufflenet_v2  min =   41.56  max =   43.62  avg =   42.22\n             mnasnet  min =   54.37  max =   54.63  avg =   54.52\n     proxylessnasnet  min =   57.91  max =   59.38  avg =   58.36\n     efficientnet_b0  min =   38.40  max =   40.29  avg =   39.06\n   efficientnetv2_b0  min =   36.91  max =   38.45  avg =   37.72\n        regnety_400m  min =   69.07  max =   69.98  avg =   69.40\n           blazeface  min =   12.26  max =   13.08  avg =   12.57\n           googlenet  min =  147.08  max =  147.80  avg =  147.48\n      googlenet_int8  min =  221.94  max =  225.99  avg =  223.12\n            resnet18  min =  137.90  max =  138.50  avg =  138.19\n       resnet18_int8  min =  187.84  max =  190.88  avg =  188.81\n             alexnet  min =  167.56  max =  168.92  avg =  168.17\n               vgg16  min =  713.42  max =  715.20  avg =  714.51\n          vgg16_int8  min = 1279.97  max = 1302.95  avg = 1294.59\n            resnet50  min =  369.74  max =  375.95  avg =  372.60\n       resnet50_int8  min =  391.86  max =  397.49  avg =  395.17\n      squeezenet_ssd  min =  155.18  max =  156.09  avg =  155.62\n squeezenet_ssd_int8  min =  218.83  max =  222.64  avg =  221.11\n       mobilenet_ssd  min =  161.62  max =  163.22  avg =  162.27\n  mobilenet_ssd_int8  min =  147.33  max =  149.16  avg =  148.23\n      mobilenet_yolo  min =  344.09  max =  349.15  avg =  346.73\n  mobilenetv2_yolov3  min =  168.72  max =  169.64  avg =  169.22\n         yolov4-tiny  min =  239.44  max =  241.11  avg =  240.00\n           nanodet_m  min =   88.06  max =   89.89  avg =   88.87\n    yolo-fastest-1.1  min =   36.05  max =   37.86  avg =   36.47\n      yolo-fastestv2  min =   34.80  max =   36.58  avg =   35.37\n  vision_transformer  min =  356.42  max =  359.37  avg =  358.03\n          FastestDet  min =   38.03  max =   38.52  avg =   38.24\n\nrk3566_roc_pc:/data/local/tmp # ./benchncnn 10 1 0 -1 0\n./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   47.01  max =   48.12  avg =   47.62\n     squeezenet_int8  min =   63.30  max =   64.10  avg =   63.74\n           mobilenet  min =   70.24  max =   71.52  avg =   70.63\n      mobilenet_int8  min =   63.90  max =   65.25  avg =   64.41\n        mobilenet_v2  min =   55.75  max =   56.26  avg =   56.02\n        mobilenet_v3  min =   45.56  max =   46.47  avg =   46.17\n          shufflenet  min =   34.16  max =   35.16  avg =   34.64\n       shufflenet_v2  min =   32.58  max =   33.86  avg =   33.25\n             mnasnet  min =   52.43  max =   53.15  avg =   52.80\n     proxylessnasnet  min =   65.55  max =   67.04  avg =   66.36\n     efficientnet_b0  min =   82.52  max =   82.97  avg =   82.64\n   efficientnetv2_b0  min =  148.90  max =  150.47  avg =  149.64\n        regnety_400m  min =   63.33  max =   64.29  avg =   63.70\n           blazeface  min =   11.55  max =   12.35  avg =   11.77\n           googlenet  min =  205.85  max =  208.74  avg =  207.17\n      googlenet_int8  min =  222.72  max =  225.84  avg =  223.98\n            resnet18  min =  134.19  max =  136.81  avg =  135.39\n       resnet18_int8  min =  187.26  max =  189.45  avg =  188.36\n             alexnet  min =  143.01  max =  144.97  avg =  143.42\n               vgg16  min =  829.44  max =  839.46  avg =  835.37\n          vgg16_int8  min = 1299.25  max = 1306.89  avg = 1301.71\n            resnet50  min =  326.54  max =  330.21  avg =  328.27\n       resnet50_int8  min =  391.67  max =  395.59  avg =  393.27\n      squeezenet_ssd  min =  166.12  max =  168.33  avg =  167.08\n squeezenet_ssd_int8  min =  221.82  max =  223.85  avg =  222.69\n       mobilenet_ssd  min =  163.17  max =  166.55  avg =  164.11\n  mobilenet_ssd_int8  min =  146.16  max =  148.20  avg =  147.41\n      mobilenet_yolo  min =  335.15  max =  338.32  avg =  336.66\n  mobilenetv2_yolov3  min =  193.18  max =  195.51  avg =  194.33\n         yolov4-tiny  min =  288.82  max =  292.16  avg =  290.36\n           nanodet_m  min =   98.31  max =  100.30  avg =   99.20\n    yolo-fastest-1.1  min =   37.73  max =   38.97  avg =   38.40\n      yolo-fastestv2  min =   36.21  max =   37.90  avg =   37.13\n  vision_transformer  min = 7385.59  max = 7410.59  avg = 7402.20\n          FastestDet  min =   34.55  max =   35.42  avg =   35.06\n```\n\n### Rockchip RK3399 (Cortex-A72 1.8GHz x 2 + Cortex-A53 1.5GHz x 4)\n\n```\nnanopc-t4:/data/local/tmp # ./benchncnn 8 2 2 -1 1\nloop_count = 8\nnum_threads = 2\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   43.73  max =   44.30  avg =   43.97\n     squeezenet_int8  min =   37.92  max =   38.39  avg =   38.09\n           mobilenet  min =   64.28  max =   66.66  avg =   65.14\n      mobilenet_int8  min =   43.17  max =   43.73  avg =   43.38\n        mobilenet_v2  min =   51.30  max =   52.18  avg =   51.75\n        mobilenet_v3  min =   41.51  max =   43.25  avg =   42.10\n          shufflenet  min =   27.43  max =   28.27  avg =   27.75\n       shufflenet_v2  min =   24.96  max =   25.79  avg =   25.55\n             mnasnet  min =   45.44  max =   46.95  avg =   46.16\n     proxylessnasnet  min =   51.98  max =   53.52  avg =   52.48\n     efficientnet_b0  min =   83.79  max =   84.68  avg =   84.27\n   efficientnetv2_b0  min =   97.89  max =   99.27  avg =   98.55\n        regnety_400m  min =   65.15  max =   65.89  avg =   65.41\n           blazeface  min =    8.74  max =    8.89  avg =    8.80\n           googlenet  min =  131.46  max =  140.16  avg =  133.24\n      googlenet_int8  min =  115.72  max =  118.34  avg =  116.60\n            resnet18  min =  111.77  max =  113.18  avg =  112.37\n       resnet18_int8  min =   84.27  max =   84.90  avg =   84.49\n             alexnet  min =  105.74  max =  109.87  avg =  107.15\n               vgg16  min =  619.88  max =  634.59  avg =  629.15\n          vgg16_int8  min =  447.14  max =  451.09  avg =  448.53\n            resnet50  min =  291.51  max =  296.55  avg =  293.08\n       resnet50_int8  min =  224.09  max =  227.03  avg =  225.02\n      squeezenet_ssd  min =  109.72  max =  112.09  avg =  110.78\n squeezenet_ssd_int8  min =   93.41  max =   94.83  avg =   93.97\n       mobilenet_ssd  min =  131.30  max =  132.82  avg =  131.94\n  mobilenet_ssd_int8  min =   87.52  max =   88.89  avg =   88.35\n      mobilenet_yolo  min =  288.02  max =  289.84  avg =  288.61\n  mobilenetv2_yolov3  min =  168.45  max =  170.94  avg =  169.79\n         yolov4-tiny  min =  217.45  max =  226.39  avg =  219.76\n           nanodet_m  min =   65.74  max =   66.84  avg =   66.49\n    yolo-fastest-1.1  min =   32.91  max =   33.74  avg =   33.37\n      yolo-fastestv2  min =   28.90  max =   37.31  avg =   30.27\n\nnanopc-t4:/data/local/tmp # ./benchncnn 8 1 2 -1 1\nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   71.35  max =   73.02  avg =   71.83\n     squeezenet_int8  min =   60.39  max =   60.96  avg =   60.69\n           mobilenet  min =  111.12  max =  113.02  avg =  111.99\n      mobilenet_int8  min =   80.14  max =   81.59  avg =   81.00\n        mobilenet_v2  min =   78.18  max =   80.89  avg =   79.18\n        mobilenet_v3  min =   63.49  max =   64.26  avg =   63.90\n          shufflenet  min =   38.90  max =   40.28  avg =   39.26\n       shufflenet_v2  min =   37.72  max =   38.45  avg =   38.02\n             mnasnet  min =   72.34  max =   73.59  avg =   72.87\n     proxylessnasnet  min =   87.33  max =   89.70  avg =   88.45\n     efficientnet_b0  min =  145.14  max =  146.77  avg =  145.93\n   efficientnetv2_b0  min =  169.33  max =  171.16  avg =  170.16\n        regnety_400m  min =   99.08  max =   99.80  avg =   99.47\n           blazeface  min =   12.28  max =   12.69  avg =   12.48\n           googlenet  min =  228.18  max =  229.36  avg =  228.64\n      googlenet_int8  min =  201.62  max =  203.71  avg =  202.25\n            resnet18  min =  175.71  max =  180.53  avg =  176.85\n       resnet18_int8  min =  151.42  max =  152.45  avg =  151.83\n             alexnet  min =  160.81  max =  186.24  avg =  165.30\n               vgg16  min = 1044.34  max = 1080.88  avg = 1062.34\n          vgg16_int8  min =  844.53  max =  851.71  avg =  848.65\n            resnet50  min =  503.25  max =  505.20  avg =  504.18\n       resnet50_int8  min =  397.71  max =  400.19  avg =  398.63\n      squeezenet_ssd  min =  162.98  max =  165.97  avg =  164.34\n squeezenet_ssd_int8  min =  145.93  max =  148.59  avg =  146.94\n       mobilenet_ssd  min =  226.54  max =  229.80  avg =  227.80\n  mobilenet_ssd_int8  min =  159.97  max =  163.18  avg =  161.06\n      mobilenet_yolo  min =  512.90  max =  517.47  avg =  515.06\n  mobilenetv2_yolov3  min =  274.88  max =  280.24  avg =  276.36\n         yolov4-tiny  min =  351.97  max =  358.70  avg =  355.60\n           nanodet_m  min =   95.32  max =   97.83  avg =   96.28\n    yolo-fastest-1.1  min =   43.47  max =   46.52  avg =   44.55\n      yolo-fastestv2  min =   37.22  max =   37.63  avg =   37.45\n\nnanopc-t4:/data/local/tmp # ./benchncnn 8 4 1 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   48.11  max =   48.51  avg =   48.24\n     squeezenet_int8  min =   43.19  max =   44.17  avg =   43.40\n           mobilenet  min =   65.47  max =   66.40  avg =   65.68\n      mobilenet_int8  min =   49.15  max =   51.65  avg =   49.76\n        mobilenet_v2  min =   53.60  max =   54.19  avg =   53.87\n        mobilenet_v3  min =   52.83  max =   92.92  avg =   66.25\n          shufflenet  min =   35.71  max =   36.03  avg =   35.83\n       shufflenet_v2  min =   31.88  max =   32.38  avg =   32.16\n             mnasnet  min =   51.59  max =   54.01  avg =   52.30\n     proxylessnasnet  min =   60.11  max =   60.40  avg =   60.24\n     efficientnet_b0  min =   98.22  max =   99.40  avg =   98.56\n   efficientnetv2_b0  min =  114.19  max =  123.90  avg =  115.89\n        regnety_400m  min =   85.89  max =   86.20  avg =   86.03\n           blazeface  min =   11.23  max =   11.37  avg =   11.31\n           googlenet  min =  142.25  max =  160.88  avg =  145.26\n      googlenet_int8  min =  125.45  max =  128.50  avg =  125.96\n            resnet18  min =  116.68  max =  118.26  avg =  117.00\n       resnet18_int8  min =   88.43  max =   90.95  avg =   89.08\n             alexnet  min =  150.91  max =  160.01  avg =  152.51\n               vgg16  min =  674.91  max =  684.83  avg =  679.08\n          vgg16_int8  min =  417.60  max =  422.52  avg =  419.60\n            resnet50  min =  297.23  max =  299.37  avg =  298.03\n       resnet50_int8  min =  243.99  max =  251.39  avg =  245.99\n      squeezenet_ssd  min =  127.92  max =  128.53  avg =  128.17\n squeezenet_ssd_int8  min =  112.54  max =  114.63  avg =  113.19\n       mobilenet_ssd  min =  136.43  max =  140.14  avg =  137.33\n  mobilenet_ssd_int8  min =  102.14  max =  105.00  avg =  102.77\n      mobilenet_yolo  min =  291.45  max =  294.04  avg =  292.63\n  mobilenetv2_yolov3  min =  183.13  max =  187.00  avg =  184.05\n         yolov4-tiny  min =  257.46  max =  268.76  avg =  260.49\n           nanodet_m  min =   83.16  max =   91.03  avg =   84.77\n    yolo-fastest-1.1  min =   43.53  max =   43.87  avg =   43.74\n      yolo-fastestv2  min =   35.04  max =   35.54  avg =   35.17\n\nnanopc-t4:/data/local/tmp # ./benchncnn 8 1 1 -1 1\nloop_count = 8\nnum_threads = 1\npowersave = 1\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  129.63  max =  130.58  avg =  129.85\n     squeezenet_int8  min =  124.10  max =  126.34  avg =  124.81\n           mobilenet  min =  207.92  max =  208.72  avg =  208.41\n      mobilenet_int8  min =  175.55  max =  176.11  avg =  175.84\n        mobilenet_v2  min =  143.02  max =  143.56  avg =  143.25\n        mobilenet_v3  min =  133.11  max =  134.05  avg =  133.33\n          shufflenet  min =   77.97  max =   78.54  avg =   78.19\n       shufflenet_v2  min =   75.59  max =   76.05  avg =   75.82\n             mnasnet  min =  139.86  max =  141.77  avg =  140.19\n     proxylessnasnet  min =  178.57  max =  179.57  avg =  179.03\n     efficientnet_b0  min =  316.10  max =  317.82  avg =  316.86\n   efficientnetv2_b0  min =  359.26  max =  362.03  avg =  360.31\n        regnety_400m  min =  182.64  max =  183.03  avg =  182.82\n           blazeface  min =   25.81  max =   26.53  avg =   26.20\n           googlenet  min =  448.45  max =  450.80  avg =  449.35\n      googlenet_int8  min =  406.07  max =  410.65  avg =  408.04\n            resnet18  min =  351.64  max =  362.12  avg =  354.19\n       resnet18_int8  min =  298.10  max =  300.45  avg =  299.26\n             alexnet  min =  586.92  max =  588.73  avg =  587.80\n               vgg16  min = 2170.12  max = 2202.80  avg = 2183.32\n          vgg16_int8  min = 1533.65  max = 1542.01  avg = 1537.33\n            resnet50  min =  975.40  max =  977.79  avg =  976.61\n       resnet50_int8  min =  851.59  max =  855.22  avg =  853.75\n      squeezenet_ssd  min =  306.35  max =  307.54  avg =  306.96\n squeezenet_ssd_int8  min =  291.32  max =  292.87  avg =  292.18\n       mobilenet_ssd  min =  423.70  max =  424.63  avg =  424.11\n  mobilenet_ssd_int8  min =  358.62  max =  359.42  avg =  359.04\n      mobilenet_yolo  min =  928.06  max =  929.25  avg =  928.55\n  mobilenetv2_yolov3  min =  496.96  max =  499.29  avg =  497.73\n         yolov4-tiny  min =  712.80  max =  714.15  avg =  713.55\n           nanodet_m  min =  179.42  max =  180.60  avg =  179.75\n    yolo-fastest-1.1  min =   88.06  max =   88.85  avg =   88.35\n      yolo-fastestv2  min =   68.68  max =   69.83  avg =   69.08\n\nnanopc-t4:/data/local/tmp # ./benchncnn 4 1 2 0 0\n[0 Mali-T860]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-T860]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=1\n[0 Mali-T860]  fp16-p/s/a=1/0/1  int8-p/s/a=1/0/0\n[0 Mali-T860]  subgroup=0  basic=0  vote=0  ballot=0  shuffle=0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   24.57  max =   24.71  avg =   24.64\n           mobilenet  min =   35.86  max =   36.14  avg =   36.04\n        mobilenet_v2  min =   30.18  max =   30.19  avg =   30.19\n        mobilenet_v3  min =   30.88  max =   31.12  avg =   31.01\n          shufflenet  min =   33.90  max =   33.98  avg =   33.93\n       shufflenet_v2  min =   29.10  max =   29.14  avg =   29.12\n             mnasnet  min =   30.49  max =   30.59  avg =   30.53\n     proxylessnasnet  min =   33.56  max =   33.61  avg =   33.59\n     efficientnet_b0  min =   51.15  max =   51.54  avg =   51.38\n   efficientnetv2_b0  min =   86.26  max =   87.36  avg =   86.91\n        regnety_400m  min =   38.44  max =   38.54  avg =   38.49\n           blazeface  min =    9.66  max =    9.74  avg =    9.70\n           googlenet  min =   80.62  max =   80.96  avg =   80.81\n            resnet18  min =   74.07  max =   74.36  avg =   74.23\n             alexnet  min =   76.84  max =   77.26  avg =   77.08\n               vgg16  min =  300.71  max =  300.89  avg =  300.80\n            resnet50  min =  175.96  max =  176.72  avg =  176.23\n      squeezenet_ssd  min =   71.20  max =   71.38  avg =   71.32\n       mobilenet_ssd  min =   76.99  max =   77.47  avg =   77.19\n      mobilenet_yolo  min =  160.41  max =  160.84  avg =  160.62\n  mobilenetv2_yolov3  min =   91.31  max =   91.37  avg =   91.35\n         yolov4-tiny  min =  130.78  max =  131.54  avg =  131.16\n           nanodet_m  min =   55.90  max =   56.03  avg =   55.96\n    yolo-fastest-1.1  min =   25.50  max =   25.66  avg =   25.59\n      yolo-fastestv2  min =   24.94  max =   25.07  avg =   25.01\n```\n\n### MYIR RemiPi,Renesas RZG2L(Cortex-A55 1.5GHz x 2)\n\n```\nroot@myir-remi-1g:~/ncnn# time ./benchncnn 10 4 0 -1 1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   85.38  max =   87.72  avg =   86.78\n     squeezenet_int8  min =   84.23  max =   86.46  avg =   85.59\n           mobilenet  min =  121.01  max =  122.55  avg =  121.76\n      mobilenet_int8  min =   95.64  max =   97.27  avg =   96.25\n        mobilenet_v2  min =  101.35  max =  102.24  avg =  101.72\n        mobilenet_v3  min =   84.09  max =   86.66  avg =   84.86\n          shufflenet  min =   63.32  max =   65.16  avg =   64.53\n       shufflenet_v2  min =   60.33  max =   62.35  avg =   61.04\n             mnasnet  min =   95.51  max =   96.70  avg =   95.95\n     proxylessnasnet  min =  124.46  max =  125.82  avg =  125.14\n     efficientnet_b0  min =  144.94  max =  146.46  avg =  145.56\n   efficientnetv2_b0  min =  182.87  max =  185.63  avg =  184.56\n        regnety_400m  min =  105.31  max =  106.42  avg =  105.72\n           blazeface  min =   21.34  max =   21.90  avg =   21.50\n           googlenet  min =  313.01  max =  318.42  avg =  314.25\n      googlenet_int8  min =  301.87  max =  304.93  avg =  303.66\n            resnet18  min =  248.02  max =  253.93  avg =  250.12\n       resnet18_int8  min =  244.65  max =  246.62  avg =  245.66\n             alexnet  min =  204.00  max =  206.39  avg =  205.21\n            resnet50  min =  583.13  max =  584.82  avg =  584.11\n       resnet50_int8  min =  517.42  max =  520.97  avg =  519.07\n      squeezenet_ssd  min =  266.63  max =  273.34  avg =  268.60\n squeezenet_ssd_int8  min =  255.42  max =  260.98  avg =  257.15\n       mobilenet_ssd  min =  267.16  max =  270.41  avg =  268.20\n  mobilenet_ssd_int8  min =  205.03  max =  206.43  avg =  205.53\n      mobilenet_yolo  min =  571.08  max =  576.15  avg =  574.18\n  mobilenetv2_yolov3  min =  342.52  max =  344.84  avg =  343.38\n         yolov4-tiny  min =  499.74  max =  503.13  avg =  501.45\n           nanodet_m  min =  161.87  max =  163.90  avg =  162.93\n    yolo-fastest-1.1  min =   72.84  max =   74.81  avg =   73.35\n      yolo-fastestv2  min =   68.24  max =   70.49  avg =   68.74\n  vision_transformer  min = 12464.09  max = 12491.57  avg = 12475.63\n          FastestDet  min =   67.92  max =   69.90  avg =   68.94\n```\n\n### OrangePi Zero 2, Allwinner H616 (Cortex-A53 1.5GHz x 4)\n\n```\norangepi@zero2:~/ncnn/benchmark$ ./benchncnn 10 4 0 -1 1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   76.25  max =   90.20  avg =   78.99\n     squeezenet_int8  min =   59.92  max =   60.44  avg =   60.10\n           mobilenet  min =  106.91  max =  132.22  avg =  109.99\n      mobilenet_int8  min =   57.96  max =   59.06  avg =   58.19\n        mobilenet_v2  min =   97.93  max =  124.48  avg =  100.91\n        mobilenet_v3  min =   82.27  max =   83.93  avg =   83.00\n          shufflenet  min =   55.27  max =   82.06  avg =   58.40\n       shufflenet_v2  min =   44.94  max =   71.99  avg =   48.10\n             mnasnet  min =   90.66  max =   91.41  avg =   90.92\n     proxylessnasnet  min =   91.55  max =  118.74  avg =   94.71\n     efficientnet_b0  min =  127.95  max =  155.13  avg =  131.25\n   efficientnetv2_b0  min =  145.96  max =  173.67  avg =  149.36\n        regnety_400m  min =  102.83  max =  103.52  avg =  103.08\n           blazeface  min =   14.46  max =   14.95  avg =   14.77\n           googlenet  min =  217.71  max =  244.16  avg =  221.38\n      googlenet_int8  min =  163.04  max =  187.69  avg =  166.20\n            resnet18  min =  251.45  max =  277.52  avg =  255.00\n       resnet18_int8  min =  136.54  max =  161.95  avg =  141.60\n             alexnet  min =  212.07  max =  233.27  avg =  215.34\n               vgg16  min = 1206.92  max = 1981.79  avg = 1673.28\n          vgg16_int8  min =  622.93  max =  702.12  avg =  661.83\n            resnet50  min =  555.84  max =  643.69  avg =  576.17\n       resnet50_int8  min =  348.11  max =  374.25  avg =  354.17\n      squeezenet_ssd  min =  224.68  max =  251.32  avg =  230.59\n squeezenet_ssd_int8  min =  154.87  max =  182.66  avg =  159.08\n       mobilenet_ssd  min =  238.49  max =  426.65  avg =  263.18\n  mobilenet_ssd_int8  min =  118.36  max =  138.39  avg =  120.78\n      mobilenet_yolo  min =  500.28  max =  615.83  avg =  553.59\n  mobilenetv2_yolov3  min =  340.27  max =  369.13  avg =  347.17\n         yolov4-tiny  min =  365.04  max =  408.48  avg =  383.93\n           nanodet_m  min =  112.88  max =  141.85  avg =  116.13\n    yolo-fastest-1.1  min =   72.05  max =   73.46  avg =   72.68\n      yolo-fastestv2  min =   54.94  max =   55.35  avg =   55.15\n  vision_transformer  min = 6842.19  max = 9125.07  avg = 7343.64\n          FastestDet  min =   59.09  max =   59.87  avg =   59.35\n```\n\n### OrangePi4 LTS, Rockchip RK3399 (Cortex-A72 1.8GHz x 2 + Cortex-A53 1.5GHz x 4)\nTest Ubuntu 22.04 Gnome Desktop\n```\norangepi@orangepi4-lts:~/ncnn/benchmark$ ./benchncnn 10 6 0 -1 0\nloop_count = 10\nnum_threads = 6\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   40.89  max =   50.29  avg =   45.15\n     squeezenet_int8  min =   40.36  max =   48.57  avg =   43.56\n           mobilenet  min =   55.81  max =   67.35  avg =   59.81\n      mobilenet_int8  min =   39.96  max =   45.10  avg =   42.09\n        mobilenet_v2  min =   53.29  max =   64.12  avg =   57.40\n        mobilenet_v3  min =   38.94  max =   51.11  avg =   43.06\n          shufflenet  min =   27.32  max =   38.53  avg =   31.85\n       shufflenet_v2  min =   24.38  max =   31.17  avg =   28.32\n             mnasnet  min =   47.02  max =   50.68  avg =   48.86\n     proxylessnasnet  min =   52.31  max =   61.31  avg =   56.66\n     efficientnet_b0  min =   68.14  max =   76.07  avg =   72.62\n   efficientnetv2_b0  min =   77.23  max =   96.07  avg =   84.83\n        regnety_400m  min =   60.81  max =   81.72  avg =   72.37\n           blazeface  min =    7.24  max =    8.19  avg =    7.68\n           googlenet  min =  122.99  max =  132.67  avg =  128.90\n      googlenet_int8  min =  108.45  max =  121.17  avg =  115.37\n            resnet18  min =  100.67  max =  115.30  avg =  107.65\n       resnet18_int8  min =   80.17  max =   87.56  avg =   84.01\n             alexnet  min =   71.00  max =   83.09  avg =   76.21\n               vgg16  min =  557.67  max =  606.30  avg =  581.12\n          vgg16_int8  min =  369.93  max =  393.20  avg =  384.86\n            resnet50  min =  254.25  max =  272.90  avg =  265.18\n       resnet50_int8  min =  220.70  max =  231.50  avg =  225.03\n      squeezenet_ssd  min =  118.91  max =  131.52  avg =  123.91\n squeezenet_ssd_int8  min =   98.25  max =  116.42  avg =  110.13\n       mobilenet_ssd  min =  126.62  max =  134.13  avg =  129.56\n  mobilenet_ssd_int8  min =   83.83  max =   91.61  avg =   86.75\n      mobilenet_yolo  min =  281.19  max =  299.79  avg =  290.05\n  mobilenetv2_yolov3  min =  180.37  max =  194.10  avg =  185.61\n         yolov4-tiny  min =  215.28  max =  227.29  avg =  221.61\n           nanodet_m  min =   64.63  max =   75.86  avg =   70.46\n    yolo-fastest-1.1  min =   39.54  max =   48.30  avg =   44.76\n      yolo-fastestv2  min =   29.91  max =   53.15  avg =   37.32\n  vision_transformer  min = 2520.25  max = 2595.28  avg = 2557.05\n          FastestDet  min =   32.45  max =   47.38  avg =   40.55\n\norangepi@orangepi4-lts:~/ncnn/benchmark$ ./benchncnn 10 4 1 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   48.90  max =   56.65  avg =   53.09\n     squeezenet_int8  min =   48.09  max =   54.69  avg =   51.26\n           mobilenet  min =   66.06  max =   79.73  avg =   73.96\n      mobilenet_int8  min =   51.33  max =   58.30  avg =   54.71\n        mobilenet_v2  min =   61.06  max =   88.93  avg =   71.48\n        mobilenet_v3  min =   50.41  max =   65.40  avg =   56.51\n          shufflenet  min =   38.11  max =   63.95  avg =   44.03\n       shufflenet_v2  min =   33.27  max =   36.43  avg =   34.89\n             mnasnet  min =   60.02  max =   72.71  avg =   64.57\n     proxylessnasnet  min =   66.61  max =   73.25  avg =   70.65\n     efficientnet_b0  min =   87.27  max =   94.97  avg =   91.00\n   efficientnetv2_b0  min =   99.89  max =  112.09  avg =  106.13\n        regnety_400m  min =   84.65  max =   92.78  avg =   89.51\n           blazeface  min =    9.73  max =   11.45  avg =   10.85\n           googlenet  min =  154.74  max =  164.25  avg =  159.33\n      googlenet_int8  min =  140.29  max =  148.08  avg =  144.18\n            resnet18  min =  131.51  max =  244.02  avg =  150.56\n       resnet18_int8  min =  102.11  max =  114.40  avg =  108.32\n             alexnet  min =   81.13  max =   92.35  avg =   86.86\n               vgg16  min =  649.91  max =  668.62  avg =  660.25\n          vgg16_int8  min =  513.75  max =  523.77  avg =  518.17\n            resnet50  min =  330.89  max =  378.23  avg =  344.07\n       resnet50_int8  min =  280.38  max =  286.93  avg =  284.43\n      squeezenet_ssd  min =  134.35  max =  146.97  avg =  141.17\n squeezenet_ssd_int8  min =  126.31  max =  137.29  avg =  130.73\n       mobilenet_ssd  min =  146.83  max =  161.70  avg =  155.08\n  mobilenet_ssd_int8  min =  105.74  max =  117.05  avg =  111.62\n      mobilenet_yolo  min =  339.30  max =  352.16  avg =  345.22\n  mobilenetv2_yolov3  min =  223.12  max =  234.18  avg =  229.81\n         yolov4-tiny  min =  267.30  max =  272.95  avg =  270.47\n           nanodet_m  min =   78.72  max =   86.18  avg =   81.81\n    yolo-fastest-1.1  min =   47.96  max =   55.08  avg =   51.81\n      yolo-fastestv2  min =   38.01  max =   44.32  avg =   42.29\n  vision_transformer  min = 3499.34  max = 3526.15  avg = 3514.43\n          FastestDet  min =   40.14  max =   44.37  avg =   42.30\n\norangepi@orangepi4-lts:~/ncnn/benchmark$ ./benchncnn 10 2 2 -1 0\nloop_count = 10\nnum_threads = 2\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   45.65  max =   46.72  avg =   46.15\n     squeezenet_int8  min =   42.60  max =   43.01  avg =   42.76\n           mobilenet  min =   69.35  max =   70.59  avg =   69.92\n      mobilenet_int8  min =   46.08  max =   46.35  avg =   46.20\n        mobilenet_v2  min =   57.47  max =   58.90  avg =   58.08\n        mobilenet_v3  min =   44.72  max =   45.47  avg =   45.05\n          shufflenet  min =   31.74  max =   32.16  avg =   31.97\n       shufflenet_v2  min =   26.74  max =   26.98  avg =   26.86\n             mnasnet  min =   50.47  max =   51.20  avg =   50.82\n     proxylessnasnet  min =   57.31  max =   58.24  avg =   57.68\n     efficientnet_b0  min =   79.61  max =   80.79  avg =   80.02\n   efficientnetv2_b0  min =   92.67  max =   93.37  avg =   93.08\n        regnety_400m  min =   67.08  max =   68.07  avg =   67.59\n           blazeface  min =    8.56  max =    8.81  avg =    8.70\n           googlenet  min =  136.82  max =  138.26  avg =  137.44\n      googlenet_int8  min =  121.96  max =  122.64  avg =  122.36\n            resnet18  min =  118.04  max =  119.24  avg =  118.49\n       resnet18_int8  min =   89.55  max =   92.11  avg =   90.38\n             alexnet  min =   80.75  max =   82.34  avg =   81.24\n               vgg16  min =  602.11  max =  628.12  avg =  612.26\n          vgg16_int8  min =  481.31  max =  484.49  avg =  482.84\n            resnet50  min =  307.31  max =  310.10  avg =  308.88\n       resnet50_int8  min =  240.45  max =  243.43  avg =  241.76\n      squeezenet_ssd  min =  119.65  max =  122.93  avg =  121.34\n squeezenet_ssd_int8  min =  102.71  max =  103.45  avg =  103.20\n       mobilenet_ssd  min =  142.16  max =  143.58  avg =  142.54\n  mobilenet_ssd_int8  min =   93.20  max =   93.81  avg =   93.41\n      mobilenet_yolo  min =  315.42  max =  318.06  avg =  317.00\n  mobilenetv2_yolov3  min =  190.59  max =  191.74  avg =  190.96\n         yolov4-tiny  min =  228.77  max =  230.49  avg =  229.78\n           nanodet_m  min =   66.82  max =   67.23  avg =   67.02\n    yolo-fastest-1.1  min =   38.20  max =   40.89  avg =   38.85\n      yolo-fastestv2  min =   32.53  max =   33.48  avg =   33.03\n  vision_transformer  min = 3372.17  max = 3516.54  avg = 3461.89\n          FastestDet  min =   32.92  max =   35.55  avg =   33.62\n\n```\n\n### OrangePicm4, Rockchip Rk3566 (Cortex-A55 1.8GHz x 4)\n```\norangepi@orangepicm4:~/code/ncnn-test$ ./benchncnn 10 4 0 -1 1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   23.91  max =   91.49  avg =   31.03\n     squeezenet_int8  min =   24.44  max =   25.39  avg =   24.75\n           mobilenet  min =   30.67  max =   31.75  avg =   30.98\n      mobilenet_int8  min =   27.87  max =   28.48  avg =   28.05\n        mobilenet_v2  min =   31.82  max =   32.56  avg =   32.07\n        mobilenet_v3  min =   24.63  max =   24.91  avg =   24.81\n          shufflenet  min =   19.77  max =   20.19  avg =   20.01\n       shufflenet_v2  min =   16.67  max =   40.81  avg =   28.79\n             mnasnet  min =   27.48  max =   28.36  avg =   27.75\n     proxylessnasnet  min =   33.04  max =   37.30  avg =   33.70\n     efficientnet_b0  min =   39.21  max =  175.34  avg =   53.26\n   efficientnetv2_b0  min =   48.94  max =   78.68  avg =   52.44\n        regnety_400m  min =   39.81  max =   40.15  avg =   39.96\n           blazeface  min =    6.22  max =    6.36  avg =    6.30\n           googlenet  min =   75.48  max =  120.58  avg =   82.05\n      googlenet_int8  min =   74.42  max =   78.70  avg =   75.29\n            resnet18  min =   58.21  max =   99.04  avg =   66.07\n       resnet18_int8  min =   54.18  max =   79.91  avg =   57.31\n             alexnet  min =   49.18  max =  161.71  avg =   63.03\n               vgg16  min =  323.82  max =  452.63  avg =  360.92\n          vgg16_int8  min =  379.18  max =  527.82  avg =  432.99\n            resnet50  min =  135.84  max =  200.71  avg =  142.54\n       resnet50_int8  min =  126.06  max =  169.65  avg =  136.29\n      squeezenet_ssd  min =   77.62  max =  137.89  avg =   86.87\n squeezenet_ssd_int8  min =   74.17  max =   76.22  avg =   74.91\n       mobilenet_ssd  min =   68.60  max =  132.81  avg =   75.30\n  mobilenet_ssd_int8  min =   58.01  max =   59.24  avg =   58.81\n      mobilenet_yolo  min =  151.61  max =  247.03  avg =  168.31\n  mobilenetv2_yolov3  min =  106.00  max =  163.45  avg =  111.92\n         yolov4-tiny  min =  132.99  max =  193.53  avg =  139.88\n           nanodet_m  min =   51.43  max =   87.10  avg =   58.17\n    yolo-fastest-1.1  min =   26.10  max =   66.68  avg =   30.33\n      yolo-fastestv2  min =   21.87  max =   69.79  avg =   35.55\n  vision_transformer  min = 2301.36  max = 2513.89  avg = 2426.14\n          FastestDet  min =   21.33  max =   21.59  avg =   21.47\norangepi@orangepicm4:~/code/ncnn-test$ ./benchncnn 10 1 0 -1 1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   47.26  max =   48.21  avg =   47.68\n     squeezenet_int8  min =   50.80  max =   54.79  avg =   51.64\n           mobilenet  min =   68.18  max =   71.72  avg =   68.78\n      mobilenet_int8  min =   58.34  max =   58.73  avg =   58.56\n        mobilenet_v2  min =   56.56  max =   57.38  avg =   57.04\n        mobilenet_v3  min =   45.52  max =   53.46  avg =   47.98\n          shufflenet  min =   34.88  max =   75.06  avg =   46.15\n       shufflenet_v2  min =   33.43  max =   49.65  avg =   36.86\n             mnasnet  min =   53.87  max =   54.08  avg =   53.98\n     proxylessnasnet  min =   70.99  max =   71.40  avg =   71.14\n     efficientnet_b0  min =   83.79  max =   89.78  avg =   84.96\n   efficientnetv2_b0  min =  103.89  max =  117.47  avg =  105.81\n        regnety_400m  min =   63.68  max =   81.25  avg =   66.66\n           blazeface  min =   12.18  max =   39.24  avg =   21.79\n           googlenet  min =  179.41  max =  202.18  avg =  185.39\n      googlenet_int8  min =  187.88  max =  198.49  avg =  191.01\n            resnet18  min =  132.67  max =  148.94  avg =  136.09\n       resnet18_int8  min =  150.37  max =  158.14  avg =  153.17\n             alexnet  min =  115.00  max =  120.17  avg =  116.26\n               vgg16  min =  809.99  max =  851.07  avg =  827.73\n          vgg16_int8  min = 1149.74  max = 1161.37  avg = 1154.22\n            resnet50  min =  327.19  max =  350.42  avg =  332.12\n       resnet50_int8  min =  325.08  max =  332.46  avg =  327.17\n      squeezenet_ssd  min =  150.33  max =  163.00  avg =  153.12\n squeezenet_ssd_int8  min =  152.21  max =  157.94  avg =  155.36\n       mobilenet_ssd  min =  149.30  max =  150.23  avg =  149.72\n  mobilenet_ssd_int8  min =  121.93  max =  127.07  avg =  123.03\n      mobilenet_yolo  min =  330.91  max =  345.64  avg =  336.21\n  mobilenetv2_yolov3  min =  193.25  max =  214.92  avg =  198.82\n         yolov4-tiny  min =  284.38  max =  332.54  avg =  293.43\n           nanodet_m  min =   90.69  max =  100.74  avg =   92.56\n    yolo-fastest-1.1  min =   38.93  max =   51.96  avg =   42.11\n      yolo-fastestv2  min =   35.74  max =   48.11  avg =   38.63\n  vision_transformer  min = 7280.18  max = 7301.27  avg = 7292.38\n          FastestDet  min =   36.54  max =   42.31  avg =   38.41\n```\n\n### OrangePi5, Rockchip RK3588s (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz)\n```\norangepi@orangepi5:~/ncnn-master/benchmark$ ./benchncnn 10 8 0 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.22  max =    6.69  avg =    6.37\n     squeezenet_int8  min =    7.93  max =    8.32  avg =    8.07\n           mobilenet  min =    9.08  max =   14.02  avg =    9.81\n      mobilenet_int8  min =    7.89  max =    9.02  avg =    8.47\n        mobilenet_v2  min =    7.77  max =    8.09  avg =    7.92\n        mobilenet_v3  min =    6.87  max =    8.19  avg =    7.46\n          shufflenet  min =    5.98  max =   10.21  avg =    7.23\n       shufflenet_v2  min =    4.82  max =    5.04  avg =    4.93\n             mnasnet  min =    6.15  max =    6.36  avg =    6.24\n     proxylessnasnet  min =    9.50  max =   10.50  avg =    9.93\n     efficientnet_b0  min =   11.46  max =   11.79  avg =   11.60\n   efficientnetv2_b0  min =   18.61  max =   19.48  avg =   18.88\n        regnety_400m  min =   10.54  max =   12.44  avg =   10.86\n           blazeface  min =    1.96  max =    5.35  avg =    2.58\n           googlenet  min =   26.62  max =   32.59  avg =   29.96\n      googlenet_int8  min =   28.27  max =   32.80  avg =   30.01\n            resnet18  min =   15.52  max =   18.29  avg =   16.37\n       resnet18_int8  min =   23.33  max =   26.89  avg =   24.99\n             alexnet  min =   19.92  max =   22.75  avg =   21.06\n               vgg16  min =  101.18  max =  122.44  avg =  107.45\n          vgg16_int8  min =  164.69  max =  227.98  avg =  189.73\n            resnet50  min =   42.96  max =   59.26  avg =   50.83\n       resnet50_int8  min =   54.46  max =   66.72  avg =   61.37\n      squeezenet_ssd  min =   24.39  max =   31.19  avg =   27.69\n squeezenet_ssd_int8  min =   27.15  max =   41.55  avg =   33.68\n       mobilenet_ssd  min =   22.26  max =   26.89  avg =   23.95\n  mobilenet_ssd_int8  min =   21.18  max =   24.21  avg =   23.05\n      mobilenet_yolo  min =   52.65  max =   65.53  avg =   58.47\n  mobilenetv2_yolov3  min =   31.34  max =   45.15  avg =   34.63\n         yolov4-tiny  min =   40.55  max =   49.32  avg =   43.85\n           nanodet_m  min =   16.08  max =   19.51  avg =   17.58\n    yolo-fastest-1.1  min =    6.48  max =    7.33  avg =    6.98\n      yolo-fastestv2  min =    4.96  max =   11.66  avg =    7.30\n  vision_transformer  min =  678.22  max =  815.73  avg =  729.16\n          FastestDet  min =    4.95  max =   10.65  avg =    6.88\n\n\norangepi@orangepi5:~/ncnn-master/benchmark$ ./benchncnn 10 4 1 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   10.91  max =   11.14  avg =   11.03\n     squeezenet_int8  min =   14.26  max =   14.55  avg =   14.30\n           mobilenet  min =   15.92  max =   16.26  avg =   16.11\n      mobilenet_int8  min =   14.71  max =   15.22  avg =   14.91\n        mobilenet_v2  min =   12.28  max =   12.49  avg =   12.37\n        mobilenet_v3  min =   11.31  max =   11.72  avg =   11.46\n          shufflenet  min =   10.10  max =   10.33  avg =   10.24\n       shufflenet_v2  min =    9.38  max =    9.70  avg =    9.55\n             mnasnet  min =   12.28  max =   12.80  avg =   12.44\n     proxylessnasnet  min =   16.54  max =   16.66  avg =   16.60\n     efficientnet_b0  min =   19.56  max =   20.66  avg =   19.86\n   efficientnetv2_b0  min =   34.06  max =   34.65  avg =   34.41\n        regnety_400m  min =   23.97  max =   24.69  avg =   24.20\n           blazeface  min =    3.39  max =    3.56  avg =    3.48\n           googlenet  min =   46.96  max =   47.90  avg =   47.56\n      googlenet_int8  min =   49.56  max =   50.23  avg =   49.79\n            resnet18  min =   28.44  max =   29.54  avg =   28.77\n       resnet18_int8  min =   41.32  max =   42.44  avg =   41.67\n             alexnet  min =   31.83  max =   32.77  avg =   32.32\n               vgg16  min =  170.32  max =  178.30  avg =  173.22\n          vgg16_int8  min =  282.55  max =  299.32  avg =  287.78\n            resnet50  min =   78.00  max =   81.57  avg =   78.79\n       resnet50_int8  min =   89.12  max =   92.31  avg =   90.92\n      squeezenet_ssd  min =   38.07  max =   39.07  avg =   38.59\n squeezenet_ssd_int8  min =   50.98  max =   52.56  avg =   51.68\n       mobilenet_ssd  min =   38.79  max =   39.67  avg =   39.34\n  mobilenet_ssd_int8  min =   33.53  max =   35.26  avg =   34.66\n      mobilenet_yolo  min =   90.50  max =   92.32  avg =   90.99\n  mobilenetv2_yolov3  min =   51.38  max =   51.93  avg =   51.56\n         yolov4-tiny  min =   75.65  max =   76.80  avg =   76.17\n           nanodet_m  min =   21.33  max =   21.68  avg =   21.50\n    yolo-fastest-1.1  min =   11.18  max =   12.06  avg =   11.36\n      yolo-fastestv2  min =    9.87  max =   10.33  avg =   10.15\n  vision_transformer  min = 1475.77  max = 1477.97  avg = 1476.77\n          FastestDet  min =    9.39  max =    9.73  avg =    9.53\n\n\norangepi@orangepi5:~/ncnn-master/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.59  max =    3.70  avg =    3.66\n     squeezenet_int8  min =    4.32  max =    4.42  avg =    4.36\n           mobilenet  min =    5.50  max =    5.55  avg =    5.53\n      mobilenet_int8  min =    4.52  max =    4.60  avg =    4.56\n        mobilenet_v2  min =    4.50  max =    4.60  avg =    4.54\n        mobilenet_v3  min =    4.09  max =    4.28  avg =    4.15\n          shufflenet  min =    3.49  max =    3.58  avg =    3.51\n       shufflenet_v2  min =    2.91  max =    3.07  avg =    2.97\n             mnasnet  min =    4.18  max =    4.25  avg =    4.21\n     proxylessnasnet  min =    4.94  max =    5.00  avg =    4.97\n     efficientnet_b0  min =    7.50  max =    7.54  avg =    7.52\n   efficientnetv2_b0  min =   11.32  max =   11.41  avg =   11.37\n        regnety_400m  min =    7.92  max =    8.01  avg =    7.95\n           blazeface  min =    1.21  max =    1.31  avg =    1.24\n           googlenet  min =   15.03  max =   15.17  avg =   15.10\n      googlenet_int8  min =   15.48  max =   15.61  avg =   15.55\n            resnet18  min =    9.91  max =    9.97  avg =    9.93\n       resnet18_int8  min =   15.80  max =   16.00  avg =   15.89\n             alexnet  min =   12.35  max =   12.64  avg =   12.48\n               vgg16  min =   61.92  max =   65.62  avg =   62.93\n          vgg16_int8  min =  129.94  max =  131.65  avg =  130.65\n            resnet50  min =   27.41  max =   27.62  avg =   27.52\n       resnet50_int8  min =   33.01  max =   33.23  avg =   33.08\n      squeezenet_ssd  min =   13.92  max =   14.27  avg =   14.02\n squeezenet_ssd_int8  min =   18.04  max =   18.40  avg =   18.15\n       mobilenet_ssd  min =   13.69  max =   13.80  avg =   13.74\n  mobilenet_ssd_int8  min =   10.95  max =   11.10  avg =   11.02\n      mobilenet_yolo  min =   32.06  max =   32.30  avg =   32.17\n  mobilenetv2_yolov3  min =   19.27  max =   20.68  avg =   19.97\n         yolov4-tiny  min =   25.41  max =   29.51  avg =   27.76\n           nanodet_m  min =    6.68  max =    6.73  avg =    6.70\n    yolo-fastest-1.1  min =    3.77  max =    4.02  avg =    3.83\n      yolo-fastestv2  min =    3.41  max =    3.65  avg =    3.48\n  vision_transformer  min =  548.32  max =  654.71  avg =  579.48\n          FastestDet  min =    3.38  max =    3.46  avg =    3.42\n\n```\n\n### OrangePi5 Plus, Rockchip RK3588 (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz)\n```\norangepi@orangepi5plus:~/ncnn$ ./benchncnn 8 4 2 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    5.55  max =    5.67  avg =    5.61\n     squeezenet_int8  min =    5.39  max =    5.76  avg =    5.60\n           mobilenet  min =    7.43  max =    7.50  avg =    7.47\n      mobilenet_int8  min =    6.91  max =    7.00  avg =    6.96\n        mobilenet_v2  min =    8.24  max =    8.47  avg =    8.33\n        mobilenet_v3  min =    6.63  max =    7.32  avg =    6.84\n          shufflenet  min =    4.10  max =    4.23  avg =    4.14\n       shufflenet_v2  min =    3.51  max =    3.61  avg =    3.56\n             mnasnet  min =    5.76  max =    7.79  avg =    6.53\n     proxylessnasnet  min =    6.66  max =    7.19  avg =    6.79\n     efficientnet_b0  min =   10.32  max =   10.73  avg =   10.40\n   efficientnetv2_b0  min =   11.48  max =   11.78  avg =   11.61\n        regnety_400m  min =    9.73  max =    9.85  avg =    9.79\n           blazeface  min =    1.39  max =    1.62  avg =    1.46\n           googlenet  min =   21.48  max =   23.08  avg =   22.79\n      googlenet_int8  min =   20.82  max =   21.78  avg =   21.01\n            resnet18  min =    9.37  max =   10.05  avg =    9.50\n       resnet18_int8  min =   14.88  max =   19.64  avg =   15.90\n             alexnet  min =   24.74  max =   24.93  avg =   24.81\n               vgg16  min =   58.75  max =   62.44  avg =   59.52\n          vgg16_int8  min =   73.68  max =   75.89  avg =   74.14\n            resnet50  min =   44.88  max =   45.10  avg =   44.98\n       resnet50_int8  min =   35.54  max =   36.02  avg =   35.71\n      squeezenet_ssd  min =   12.07  max =   26.66  avg =   19.03\n squeezenet_ssd_int8  min =   21.95  max =   25.51  avg =   23.21\n       mobilenet_ssd  min =   12.62  max =   12.73  avg =   12.67\n  mobilenet_ssd_int8  min =   17.21  max =   17.68  avg =   17.44\n      mobilenet_yolo  min =   32.82  max =   32.98  avg =   32.91\n  mobilenetv2_yolov3  min =   18.67  max =   20.52  avg =   19.57\n         yolov4-tiny  min =   38.82  max =   40.84  avg =   39.82\n           nanodet_m  min =    9.05  max =    9.22  avg =    9.13\n    yolo-fastest-1.1  min =    4.67  max =    5.04  avg =    4.74\n      yolo-fastestv2  min =    4.27  max =    4.32  avg =    4.29\n  vision_transformer  min =  429.32  max =  431.02  avg =  430.20\n          FastestDet  min =    4.28  max =    4.72  avg =    4.36\n\n```\n\n### RDK X3 Module (Cortex-A53 1.5GHz x 4) aarch64\n```\nroot@ubuntu:/home/sunrise/ncnn-master/benchmark# ../build-aarch64-linux-gnu/benchmark/benchncnn 10 4 0 -1 1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   49.83  max =   50.57  avg =   50.08\n     squeezenet_int8  min =   48.43  max =   49.18  avg =   48.67\n           mobilenet  min =   68.37  max =   69.09  avg =   68.63\n      mobilenet_int8  min =   58.19  max =   58.72  avg =   58.37\n        mobilenet_v2  min =   58.76  max =   60.62  avg =   59.20\n        mobilenet_v3  min =   49.75  max =   50.60  avg =   50.06\n          shufflenet  min =   37.17  max =   37.96  avg =   37.50\n       shufflenet_v2  min =   32.08  max =   32.42  avg =   32.22\n             mnasnet  min =   55.51  max =   57.02  avg =   55.90\n     proxylessnasnet  min =   68.15  max =   69.53  avg =   68.78\n     efficientnet_b0  min =   88.64  max =   90.16  avg =   89.43\n   efficientnetv2_b0  min =  102.45  max =  103.42  avg =  102.92\n        regnety_400m  min =   88.22  max =   89.09  avg =   88.62\n           blazeface  min =    9.78  max =   10.15  avg =    9.93\n           googlenet  min =  152.20  max =  153.92  avg =  153.28\n      googlenet_int8  min =  141.80  max =  143.30  avg =  142.48\n            resnet18  min =  116.70  max =  117.59  avg =  117.03\n       resnet18_int8  min =  104.42  max =  105.85  avg =  104.94\n             alexnet  min =   82.55  max =   83.23  avg =   82.82\n               vgg16  min =  590.22  max =  598.18  avg =  594.35\n          vgg16_int8  min =  504.56  max =  507.21  avg =  505.73\n            resnet50  min =  307.36  max =  308.68  avg =  308.03\n       resnet50_int8  min =  281.35  max =  283.87  avg =  282.30\n      squeezenet_ssd  min =  124.93  max =  126.51  avg =  125.51\n squeezenet_ssd_int8  min =  118.07  max =  118.89  avg =  118.29\n       mobilenet_ssd  min =  142.27  max =  142.57  avg =  142.44\n  mobilenet_ssd_int8  min =  116.51  max =  117.60  avg =  117.04\n      mobilenet_yolo  min =  314.64  max =  317.09  avg =  315.93\n  mobilenetv2_yolov3  min =  204.55  max =  205.30  avg =  204.93\n         yolov4-tiny  min =  246.69  max =  249.64  avg =  247.95\n           nanodet_m  min =   77.73  max =   78.30  avg =   77.99\n    yolo-fastest-1.1  min =   46.29  max =   47.52  avg =   46.93\n      yolo-fastestv2  min =   36.55  max =   36.95  avg =   36.73\n  vision_transformer  min = 3372.85  max = 3409.14  avg = 3377.75\n          FastestDet  min =   38.23  max =   38.77  avg =   38.49\n\n```\n\n### NanoPi R2S, Rockchip RK3328 (Cortex-A53 1.3GHz x 4) Armbian focal (21.05.1) aarch64\n```\nroot@nanopi-r2s:~/ncnn/build/benchmark# ./benchncnn 8 4 0\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   62.20  max =   62.81  avg =   62.49\n     squeezenet_int8  min =   57.92  max =   71.46  avg =   59.76\n           mobilenet  min =   82.88  max =   89.36  avg =   84.52\n      mobilenet_int8  min =   57.16  max =   96.22  avg =   62.29\n        mobilenet_v2  min =   73.68  max =   75.92  avg =   74.17\n        mobilenet_v3  min =   59.57  max =   60.14  avg =   59.84\n          shufflenet  min =   52.34  max =   52.70  avg =   52.53\n       shufflenet_v2  min =   45.51  max =   45.92  avg =   45.73\n             mnasnet  min =   67.75  max =   83.15  avg =   69.82\n     proxylessnasnet  min =   81.70  max =   83.66  avg =   82.31\n     efficientnet_b0  min =  121.10  max =  123.22  avg =  121.55\n   efficientnetv2_b0  min =  138.93  max =  192.15  avg =  154.94\n        regnety_400m  min =   99.62  max =  116.29  avg =  101.97\n           blazeface  min =   18.80  max =   19.15  avg =   19.01\n           googlenet  min =  176.36  max =  202.84  avg =  181.86\n      googlenet_int8  min =  155.50  max =  190.50  avg =  161.20\n            resnet18  min =  165.79  max =  201.57  avg =  172.56\n       resnet18_int8  min =  122.24  max =  160.53  avg =  134.24\n             alexnet  min =  227.07  max =  238.09  avg =  232.19\n          vgg16_int8  min =  522.14  max =  551.75  avg =  531.68\n            resnet50  min =  378.30  max =  440.21  avg =  388.56\n       resnet50_int8  min =  315.76  max =  373.97  avg =  329.88\n      squeezenet_ssd  min =  175.37  max =  200.86  avg =  179.01\n squeezenet_ssd_int8  min =  134.71  max =  147.57  avg =  136.57\n       mobilenet_ssd  min =  174.43  max =  212.11  avg =  180.61\n  mobilenet_ssd_int8  min =  119.41  max =  153.75  avg =  124.21\n      mobilenet_yolo  min =  366.27  max =  422.67  avg =  383.65\n  mobilenetv2_yolov3  min =  238.56  max =  281.97  avg =  247.56\n         yolov4-tiny  min =  311.45  max =  333.32  avg =  316.79\n           nanodet_m  min =  114.15  max =  122.39  avg =  115.44\n\nroot@nanopi-r2s:~/ncnn/build/benchmark# ./benchncnn 8 2 0\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   89.02  max =   90.52  avg =   89.35\n     squeezenet_int8  min =   81.19  max =   81.90  avg =   81.42\n           mobilenet  min =  131.47  max =  134.39  avg =  132.34\n      mobilenet_int8  min =  102.20  max =  103.03  avg =  102.66\n        mobilenet_v2  min =  102.40  max =  108.12  avg =  103.91\n        mobilenet_v3  min =   89.17  max =   90.10  avg =   89.53\n          shufflenet  min =   65.74  max =   68.86  avg =   66.50\n       shufflenet_v2  min =   62.83  max =   64.41  avg =   63.25\n             mnasnet  min =   98.01  max =   98.24  avg =   98.14\n     proxylessnasnet  min =  121.10  max =  123.55  avg =  121.80\n     efficientnet_b0  min =  187.79  max =  188.41  avg =  188.08\n   efficientnetv2_b0  min =  211.96  max =  213.99  avg =  212.74\n        regnety_400m  min =  124.98  max =  125.49  avg =  125.28\n           blazeface  min =   24.91  max =   25.14  avg =   25.00\n           googlenet  min =  278.47  max =  283.24  avg =  280.79\n      googlenet_int8  min =  243.81  max =  247.82  avg =  245.30\n            resnet18  min =  257.46  max =  259.29  avg =  258.29\n       resnet18_int8  min =  187.18  max =  188.74  avg =  187.70\n             alexnet  min =  384.52  max =  387.07  avg =  385.84\n          vgg16_int8  min =  897.26  max =  901.68  avg =  899.19\n            resnet50  min =  618.85  max =  623.92  avg =  620.85\n       resnet50_int8  min =  512.33  max =  514.93  avg =  513.64\n      squeezenet_ssd  min =  211.21  max =  218.71  avg =  213.02\n squeezenet_ssd_int8  min =  193.32  max =  193.97  avg =  193.70\n       mobilenet_ssd  min =  271.11  max =  275.58  avg =  272.06\n  mobilenet_ssd_int8  min =  208.80  max =  209.59  avg =  209.05\n      mobilenet_yolo  min =  570.55  max =  575.98  avg =  572.73\n  mobilenetv2_yolov3  min =  329.04  max =  353.84  avg =  340.42\n         yolov4-tiny  min =  435.16  max =  463.68  avg =  457.69\n           nanodet_m  min =  155.70  max =  159.13  avg =  156.50\n```\n\n### EAIDK 310, Rockchip RK3228H (Cortex-A53 1.3GHz x 4) fedora-28 aarch64\n```\n[openailab@MiWiFi-R1D-srv benchmark]$ ./benchncnn 8 4 0 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   68.97  max =   71.42  avg =   69.65\n     squeezenet_int8  min =   58.47  max =   59.58  avg =   58.77\n           mobilenet  min =   90.87  max =  100.18  avg =   92.48\n      mobilenet_int8  min =   59.46  max =   63.02  avg =   60.01\n        mobilenet_v2  min =   82.92  max =  112.01  avg =   88.10\n        mobilenet_v3  min =   66.65  max =   69.57  avg =   67.27\n          shufflenet  min =   48.22  max =   48.49  avg =   48.34\n       shufflenet_v2  min =   48.52  max =   52.88  avg =   49.17\n             mnasnet  min =   75.63  max =   79.83  avg =   76.43\n     proxylessnasnet  min =   84.73  max =   86.69  avg =   85.16\n     efficientnet_b0  min =  125.69  max =  129.00  avg =  126.38\n   efficientnetv2_b0  min =  144.44  max =  149.01  avg =  145.33\n        regnety_400m  min =   99.69  max =  101.23  avg =  100.38\n           blazeface  min =   15.84  max =   16.24  avg =   16.03\n           googlenet  min =  194.64  max =  199.29  avg =  196.07\n      googlenet_int8  min =  158.54  max =  165.64  avg =  160.25\n            resnet18  min =  200.65  max =  221.60  avg =  204.30\n       resnet18_int8  min =  122.69  max =  126.57  avg =  123.54\n             alexnet  min =  175.54  max =  200.91  avg =  181.38\n            resnet50  min =  428.75  max =  466.51  avg =  439.67\n       resnet50_int8  min =  324.95  max =  347.47  avg =  329.74\n      squeezenet_ssd  min =  199.86  max =  207.51  avg =  201.99\n squeezenet_ssd_int8  min =  150.35  max =  176.92  avg =  154.60\n       mobilenet_ssd  min =  186.50  max =  189.92  avg =  188.09\n  mobilenet_ssd_int8  min =  123.55  max =  127.17  avg =  124.63\n      mobilenet_yolo  min =  393.83  max =  414.09  avg =  398.57\n  mobilenetv2_yolov3  min =  263.49  max =  273.11  avg =  266.11\n         yolov4-tiny  min =  342.33  max =  363.69  avg =  346.34\n           nanodet_m  min =  119.66  max =  127.29  avg =  121.26\n    yolo-fastest-1.1  min =   61.87  max =   90.26  avg =   65.77\n      yolo-fastestv2  min =   48.48  max =   50.82  avg =   48.93\n\n[openailab@MiWiFi-R1D-srv benchmark]$ ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  152.15  max =  152.67  avg =  152.43\n     squeezenet_int8  min =  143.22  max =  144.24  avg =  143.61\n           mobilenet  min =  237.77  max =  239.69  avg =  238.47\n      mobilenet_int8  min =  199.91  max =  201.35  avg =  200.50\n        mobilenet_v2  min =  169.67  max =  170.18  avg =  169.93\n        mobilenet_v3  min =  150.06  max =  151.17  avg =  150.78\n          shufflenet  min =   91.78  max =   92.38  avg =   92.06\n       shufflenet_v2  min =  100.86  max =  101.75  avg =  101.50\n             mnasnet  min =  165.10  max =  166.74  avg =  166.24\n     proxylessnasnet  min =  218.42  max =  220.55  avg =  219.12\n     efficientnet_b0  min =  348.00  max =  349.03  avg =  348.49\n   efficientnetv2_b0  min =  404.06  max =  406.16  avg =  405.00\n        regnety_400m  min =  209.48  max =  211.36  avg =  210.44\n           blazeface  min =   31.31  max =   32.61  avg =   32.00\n           googlenet  min =  510.38  max =  512.43  avg =  511.25\n      googlenet_int8  min =  454.38  max =  456.19  avg =  455.02\n            resnet18  min =  407.78  max =  409.45  avg =  408.34\n       resnet18_int8  min =  357.01  max =  360.72  avg =  358.74\n             alexnet  min =  504.12  max =  506.74  avg =  505.08\n            resnet50  min = 1115.42  max = 1121.91  avg = 1118.67\n       resnet50_int8  min =  973.38  max =  976.26  avg =  975.21\n      squeezenet_ssd  min =  361.52  max =  363.69  avg =  362.38\n squeezenet_ssd_int8  min =  333.81  max =  337.16  avg =  335.24\n       mobilenet_ssd  min =  477.43  max =  478.36  avg =  477.82\n  mobilenet_ssd_int8  min =  409.33  max =  409.67  avg =  409.52\n      mobilenet_yolo  min = 1048.79  max = 1057.72  avg = 1053.80\n  mobilenetv2_yolov3  min =  567.04  max =  571.44  avg =  569.04\n         yolov4-tiny  min =  788.40  max =  790.74  avg =  789.12\n           nanodet_m  min =  253.68  max =  254.59  avg =  254.16\n    yolo-fastest-1.1  min =  102.44  max =  103.11  avg =  102.67\n      yolo-fastestv2  min =   82.19  max =   82.43  avg =   82.35\n```\n\n### NVIDIA Jetson Orin Nano\n```\norin@nano:~/ncnn/benchmark$ ./benchncnn 8 6 0 0 1\n[0 NVIDIA Tegra Orin (nvgpu)]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA Tegra Orin (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra Orin (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra Orin (nvgpu)]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA Tegra Orin (nvgpu)]  fp16-matrix-16_8_8/16_8_16/16_16_16=1/1/1\nloop_count = 8\nnum_threads = 6\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    5.31  max =    5.95  avg =    5.44\n     squeezenet_int8  min =    5.13  max =    6.24  avg =    5.57\n           mobilenet  min =    2.98  max =    5.52  avg =    3.66\n      mobilenet_int8  min =    5.97  max =    7.76  avg =    6.98\n        mobilenet_v2  min =    6.73  max =    6.98  avg =    6.91\n        mobilenet_v3  min =    8.58  max =    8.77  avg =    8.71\n          shufflenet  min =    7.33  max =    7.43  avg =    7.39\n       shufflenet_v2  min =    7.59  max =    8.46  avg =    8.27\n             mnasnet  min =    4.78  max =    6.81  avg =    5.41\n     proxylessnasnet  min =    7.39  max =    7.65  avg =    7.52\n     efficientnet_b0  min =   10.81  max =   15.28  avg =   12.27\n   efficientnetv2_b0  min =   46.58  max =   48.56  avg =   47.70\n        regnety_400m  min =    9.86  max =   10.46  avg =   10.04\n           blazeface  min =    3.98  max =    4.66  avg =    4.31\n           googlenet  min =   10.01  max =   14.44  avg =   11.48\n      googlenet_int8  min =   18.07  max =   19.55  avg =   18.65\n            resnet18  min =    6.52  max =    9.73  avg =    8.26\n       resnet18_int8  min =   13.28  max =   20.58  avg =   14.96\n             alexnet  min =    8.71  max =    9.05  avg =    8.84\n               vgg16  min =   19.28  max =   19.49  avg =   19.35\n          vgg16_int8  min =   98.14  max =  100.92  avg =   99.76\n            resnet50  min =    9.25  max =    9.37  avg =    9.31\n       resnet50_int8  min =   31.16  max =   34.44  avg =   32.59\n      squeezenet_ssd  min =   13.60  max =   18.96  avg =   16.68\n squeezenet_ssd_int8  min =   17.81  max =   19.83  avg =   18.75\n       mobilenet_ssd  min =   11.88  max =   13.86  avg =   13.27\n  mobilenet_ssd_int8  min =   14.05  max =   21.16  avg =   15.64\n      mobilenet_yolo  min =   14.18  max =   14.41  avg =   14.26\n  mobilenetv2_yolov3  min =   16.65  max =   18.78  avg =   18.06\n         yolov4-tiny  min =   25.60  max =   26.56  avg =   25.92\n           nanodet_m  min =   15.71  max =   19.89  avg =   19.03\n    yolo-fastest-1.1  min =    8.72  max =    9.18  avg =    8.96\n      yolo-fastestv2  min =    7.97  max =    8.10  avg =    8.04\n  vision_transformer  min =  821.34  max =  825.91  avg =  823.26\n          FastestDet  min =    7.72  max =    8.15  avg =    7.81\norin@nano:~/ncnn/benchmark$ ./benchncnn 8 1 0 0 1\n[0 NVIDIA Tegra Orin (nvgpu)]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA Tegra Orin (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra Orin (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra Orin (nvgpu)]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA Tegra Orin (nvgpu)]  fp16-matrix-16_8_8/16_8_16/16_16_16=1/1/1\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    5.05  max =    5.23  avg =    5.09\n     squeezenet_int8  min =   15.93  max =   16.09  avg =   16.00\n           mobilenet  min =    2.97  max =    5.49  avg =    3.84\n      mobilenet_int8  min =   23.27  max =   23.38  avg =   23.33\n        mobilenet_v2  min =    3.61  max =    4.01  avg =    3.83\n        mobilenet_v3  min =    6.12  max =    8.36  avg =    6.67\n          shufflenet  min =    4.07  max =    7.25  avg =    6.22\n       shufflenet_v2  min =    8.49  max =    8.82  avg =    8.67\n             mnasnet  min =    3.70  max =    8.23  avg =    5.37\n     proxylessnasnet  min =    6.36  max =    9.16  avg =    7.52\n     efficientnet_b0  min =   10.55  max =   10.81  avg =   10.65\n   efficientnetv2_b0  min =   28.22  max =   28.62  avg =   28.54\n        regnety_400m  min =    7.22  max =   10.04  avg =    8.50\n           blazeface  min =    3.70  max =    3.86  avg =    3.76\n           googlenet  min =    7.18  max =    9.76  avg =    8.21\n      googlenet_int8  min =   63.19  max =   63.54  avg =   63.32\n            resnet18  min =    4.67  max =    4.73  avg =    4.70\n       resnet18_int8  min =   50.51  max =   50.81  avg =   50.65\n             alexnet  min =    8.56  max =   10.64  avg =    9.02\n               vgg16  min =   19.24  max =   19.50  avg =   19.31\n          vgg16_int8  min =  411.02  max =  412.40  avg =  411.60\n            resnet50  min =    9.14  max =    9.52  avg =    9.41\n       resnet50_int8  min =  112.04  max =  112.43  avg =  112.25\n      squeezenet_ssd  min =   13.23  max =   13.79  avg =   13.52\n squeezenet_ssd_int8  min =   46.52  max =   46.98  avg =   46.77\n       mobilenet_ssd  min =    8.89  max =   12.51  avg =    9.95\n  mobilenet_ssd_int8  min =   47.66  max =   48.73  avg =   48.13\n      mobilenet_yolo  min =    9.68  max =    9.75  avg =    9.70\n  mobilenetv2_yolov3  min =   15.84  max =   17.54  avg =   16.83\n         yolov4-tiny  min =   23.32  max =   25.49  avg =   24.56\n           nanodet_m  min =   13.59  max =   19.53  avg =   15.85\n    yolo-fastest-1.1  min =    7.68  max =   11.32  avg =    8.20\n      yolo-fastestv2  min =    7.75  max =    7.84  avg =    7.78\n  vision_transformer  min =  822.27  max =  829.73  avg =  825.74\n          FastestDet  min =    7.51  max =    8.05  avg =    7.68\n          \norin@nano:~/ncnn/benchmark$ ./benchncnn 8 6 0 -1 1\nloop_count = 8\nnum_threads = 6\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    5.07  max =    6.99  avg =    5.69\n     squeezenet_int8  min =    5.08  max =    5.79  avg =    5.42\n           mobilenet  min =    6.96  max =    8.20  avg =    7.45\n      mobilenet_int8  min =    5.91  max =    7.33  avg =    6.37\n        mobilenet_v2  min =    5.86  max =    7.55  avg =    6.51\n        mobilenet_v3  min =    5.60  max =    7.22  avg =    6.14\n          shufflenet  min =    5.20  max =    5.79  avg =    5.44\n       shufflenet_v2  min =    4.56  max =    5.90  avg =    4.86\n             mnasnet  min =    5.43  max =    6.44  avg =    5.83\n     proxylessnasnet  min =    5.92  max =    8.70  avg =    6.83\n     efficientnet_b0  min =   10.09  max =   11.57  avg =   10.65\n   efficientnetv2_b0  min =   12.79  max =   15.96  avg =   14.12\n        regnety_400m  min =   14.04  max =   21.23  avg =   15.88\n           blazeface  min =    1.76  max =    1.90  avg =    1.81\n           googlenet  min =   19.45  max =   25.43  avg =   21.21\n      googlenet_int8  min =   17.67  max =   18.59  avg =   18.20\n            resnet18  min =   12.26  max =   19.47  avg =   15.13\n       resnet18_int8  min =   13.02  max =   14.78  avg =   13.86\n             alexnet  min =   12.27  max =   19.18  avg =   15.02\n               vgg16  min =   59.43  max =   89.43  avg =   65.11\n          vgg16_int8  min =   97.71  max =  141.28  avg =  108.00\n            resnet50  min =   38.69  max =   40.67  avg =   39.26\n       resnet50_int8  min =   28.67  max =   31.63  avg =   29.93\n      squeezenet_ssd  min =   14.52  max =   26.92  avg =   17.89\n squeezenet_ssd_int8  min =   16.61  max =   19.27  avg =   17.82\n       mobilenet_ssd  min =   16.61  max =   22.65  avg =   17.89\n  mobilenet_ssd_int8  min =   13.22  max =   14.83  avg =   14.04\n      mobilenet_yolo  min =   40.10  max =   44.28  avg =   41.48\n  mobilenetv2_yolov3  min =   21.48  max =   22.83  avg =   22.01\n         yolov4-tiny  min =   33.30  max =   37.31  avg =   34.59\n           nanodet_m  min =   10.80  max =   12.62  avg =   11.54\n    yolo-fastest-1.1  min =    5.51  max =    6.03  avg =    5.75\n      yolo-fastestv2  min =    4.98  max =    6.35  avg =    5.44\n  vision_transformer  min =  610.40  max =  681.89  avg =  628.84\n          FastestDet  min =    4.82  max =    6.19  avg =    5.32\norin@nano:~/ncnn/benchmark$ ./benchncnn 8 1 0 -1 1\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   15.94  max =   16.23  avg =   16.04\n     squeezenet_int8  min =   15.91  max =   16.09  avg =   15.98\n           mobilenet  min =   28.77  max =   28.91  avg =   28.83\n      mobilenet_int8  min =   23.29  max =   23.63  avg =   23.46\n        mobilenet_v2  min =   19.32  max =   19.43  avg =   19.37\n        mobilenet_v3  min =   16.57  max =   16.65  avg =   16.61\n          shufflenet  min =   10.39  max =   10.48  avg =   10.44\n       shufflenet_v2  min =   10.61  max =   10.69  avg =   10.65\n             mnasnet  min =   18.61  max =   18.69  avg =   18.65\n     proxylessnasnet  min =   21.97  max =   22.17  avg =   22.05\n     efficientnet_b0  min =   36.73  max =   36.89  avg =   36.83\n   efficientnetv2_b0  min =   41.72  max =   41.97  avg =   41.83\n        regnety_400m  min =   25.71  max =   26.03  avg =   25.85\n           blazeface  min =    3.59  max =    3.63  avg =    3.60\n           googlenet  min =   66.85  max =   67.38  avg =   67.12\n      googlenet_int8  min =   63.65  max =   63.85  avg =   63.74\n            resnet18  min =   48.49  max =   49.21  avg =   48.83\n       resnet18_int8  min =   50.82  max =   51.16  avg =   50.92\n             alexnet  min =   57.67  max =   58.24  avg =   58.03\n               vgg16  min =  280.03  max =  281.34  avg =  280.77\n          vgg16_int8  min =  413.51  max =  414.67  avg =  414.08\n            resnet50  min =  138.19  max =  138.94  avg =  138.48\n       resnet50_int8  min =  112.53  max =  112.86  avg =  112.68\n      squeezenet_ssd  min =   46.26  max =   46.46  avg =   46.37\n squeezenet_ssd_int8  min =   47.56  max =   48.33  avg =   47.85\n       mobilenet_ssd  min =   60.51  max =   60.81  avg =   60.68\n  mobilenet_ssd_int8  min =   47.47  max =   47.76  avg =   47.58\n      mobilenet_yolo  min =  136.20  max =  136.54  avg =  136.37\n  mobilenetv2_yolov3  min =   69.80  max =   70.04  avg =   69.93\n         yolov4-tiny  min =   87.71  max =   88.63  avg =   88.12\n           nanodet_m  min =   25.73  max =   26.06  avg =   25.85\n    yolo-fastest-1.1  min =   10.25  max =   10.35  avg =   10.29\n      yolo-fastestv2  min =    9.25  max =    9.38  avg =    9.33\n  vision_transformer  min = 2282.07  max = 2690.34  avg = 2481.94\n          FastestDet  min =    9.80  max =    9.88  avg =    9.84\n```\n\n### NVIDIA Jetson Nano\n```\n[0 NVIDIA Tegra X1 (nvgpu)]  queueC=0[16]  queueG=0[16]  queueT=0[16]\n[0 NVIDIA Tegra X1 (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra X1 (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra X1 (nvgpu)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   12.15  max =   26.48  avg =   18.11\n     squeezenet_int8  min =   27.60  max =   42.50  avg =   29.89\n           mobilenet  min =   16.07  max =   16.10  avg =   16.09\n      mobilenet_int8  min =   30.65  max =   32.15  avg =   31.07\n        mobilenet_v2  min =   12.87  max =   13.15  avg =   12.99\n        mobilenet_v3  min =   13.32  max =   16.65  avg =   14.57\n          shufflenet  min =   14.21  max =   14.34  avg =   14.29\n       shufflenet_v2  min =   13.03  max =   21.97  avg =   19.02\n             mnasnet  min =   13.33  max =   13.64  avg =   13.49\n     proxylessnasnet  min =   14.65  max =   14.91  avg =   14.76\n     efficientnet_b0  min =   21.26  max =   21.41  avg =   21.35\n   efficientnetv2_b0  min =   54.66  max =   60.81  avg =   57.16\n        regnety_400m  min =   17.91  max =   18.08  avg =   18.01\n           blazeface  min =    6.87  max =    7.03  avg =    6.94\n           googlenet  min =   43.30  max =   43.54  avg =   43.43\n      googlenet_int8  min =   80.07  max =   84.28  avg =   81.10\n            resnet18  min =   43.89  max =   44.06  avg =   43.98\n       resnet18_int8  min =   60.70  max =   63.43  avg =   61.60\n             alexnet  min =   74.21  max =   75.20  avg =   74.45\n               vgg16  min =  310.39  max =  310.65  avg =  310.52\n          vgg16_int8  min =  293.15  max =  297.28  avg =  294.93\n            resnet50  min =   93.03  max =   93.22  avg =   93.12\n       resnet50_int8  min =  158.54  max =  161.25  avg =  159.56\n      squeezenet_ssd  min =   55.88  max =   57.43  avg =   56.46\n squeezenet_ssd_int8  min =   72.42  max =   73.25  avg =   72.73\n       mobilenet_ssd  min =   35.38  max =   37.57  avg =   36.63\n  mobilenet_ssd_int8  min =   62.92  max =   64.97  avg =   63.63\n      mobilenet_yolo  min =   76.56  max =   80.44  avg =   78.05\n  mobilenetv2_yolov3  min =   46.35  max =   48.14  avg =   47.26\n         yolov4-tiny  min =   95.38  max =   97.55  avg =   96.45\n           nanodet_m  min =   22.82  max =   26.01  avg =   24.48\n    yolo-fastest-1.1  min =   20.23  max =   25.51  avg =   21.52\n      yolo-fastestv2  min =   20.67  max =   20.82  avg =   20.75\n\n[0 NVIDIA Tegra X1 (nvgpu)]  queueC=0[16]  queueG=0[16]  queueT=0[16]\n[0 NVIDIA Tegra X1 (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra X1 (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra X1 (nvgpu)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =   12.00  max =   15.41  avg =   13.55\n     squeezenet_int8  min =   78.76  max =   79.14  avg =   78.91\n           mobilenet  min =   16.03  max =   16.25  avg =   16.15\n      mobilenet_int8  min =  107.58  max =  107.68  avg =  107.61\n        mobilenet_v2  min =   12.84  max =   13.13  avg =   12.99\n        mobilenet_v3  min =   13.29  max =   16.64  avg =   14.38\n          shufflenet  min =   14.23  max =   14.54  avg =   14.34\n       shufflenet_v2  min =   12.94  max =   13.21  avg =   13.02\n             mnasnet  min =   13.42  max =   13.66  avg =   13.53\n     proxylessnasnet  min =   14.64  max =   14.94  avg =   14.76\n     efficientnet_b0  min =   21.28  max =   21.51  avg =   21.36\n   efficientnetv2_b0  min =   74.32  max =   78.50  avg =   77.79\n        regnety_400m  min =   17.94  max =   18.26  avg =   18.07\n           blazeface  min =    6.83  max =    6.94  avg =    6.89\n           googlenet  min =   43.45  max =   43.63  avg =   43.52\n      googlenet_int8  min =  255.68  max =  256.33  avg =  255.92\n            resnet18  min =   43.96  max =   44.06  avg =   44.01\n       resnet18_int8  min =  192.01  max =  192.64  avg =  192.33\n             alexnet  min =   74.04  max =   74.23  avg =   74.14\n               vgg16  min =  310.32  max =  310.64  avg =  310.44\n          vgg16_int8  min = 1003.05  max = 1004.27  avg = 1003.66\n            resnet50  min =   93.05  max =   93.34  avg =   93.21\n       resnet50_int8  min =  516.27  max =  517.12  avg =  516.69\n      squeezenet_ssd  min =   56.67  max =   56.86  avg =   56.73\n squeezenet_ssd_int8  min =  182.96  max =  184.26  avg =  183.71\n       mobilenet_ssd  min =   35.61  max =   35.70  avg =   35.65\n  mobilenet_ssd_int8  min =  217.02  max =  217.50  avg =  217.23\n      mobilenet_yolo  min =   78.10  max =   78.36  avg =   78.20\n  mobilenetv2_yolov3  min =   49.86  max =   57.83  avg =   53.18\n         yolov4-tiny  min =   96.76  max =   96.86  avg =   96.82\n           nanodet_m  min =   25.26  max =   25.36  avg =   25.31\n    yolo-fastest-1.1  min =   21.55  max =   24.22  avg =   23.78\n      yolo-fastestv2  min =   20.80  max =   21.01  avg =   20.90\n\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   30.03  max =   31.41  avg =   30.59\n     squeezenet_int8  min =   27.32  max =   27.76  avg =   27.50\n           mobilenet  min =   41.74  max =   42.57  avg =   42.05\n      mobilenet_int8  min =   30.48  max =   31.57  avg =   30.85\n        mobilenet_v2  min =   33.49  max =   34.18  avg =   33.83\n        mobilenet_v3  min =   30.59  max =   30.96  avg =   30.79\n          shufflenet  min =   21.07  max =   31.68  avg =   22.53\n       shufflenet_v2  min =   19.55  max =   20.01  avg =   19.71\n             mnasnet  min =   31.70  max =   32.26  avg =   31.93\n     proxylessnasnet  min =   36.90  max =   38.55  avg =   37.27\n     efficientnet_b0  min =   68.42  max =   77.60  avg =   70.60\n   efficientnetv2_b0  min =   73.72  max =   81.05  avg =   75.31\n        regnety_400m  min =   56.67  max =   66.82  avg =   58.24\n           blazeface  min =    6.55  max =    6.96  avg =    6.74\n           googlenet  min =   92.74  max =   94.22  avg =   93.12\n      googlenet_int8  min =   80.86  max =   87.28  avg =   82.41\n            resnet18  min =   83.10  max =   84.30  avg =   83.44\n       resnet18_int8  min =   59.40  max =   65.86  avg =   60.70\n             alexnet  min =   89.21  max =   92.45  avg =   89.98\n               vgg16  min =  445.72  max =  451.09  avg =  447.39\n          vgg16_int8  min =  292.81  max =  295.55  avg =  294.34\n            resnet50  min =  203.42  max =  204.45  avg =  204.08\n       resnet50_int8  min =  157.87  max =  160.30  avg =  158.67\n      squeezenet_ssd  min =   85.60  max =   87.24  avg =   86.18\n squeezenet_ssd_int8  min =   73.10  max =   85.64  avg =   74.94\n       mobilenet_ssd  min =   86.75  max =   96.51  avg =   88.49\n  mobilenet_ssd_int8  min =   63.40  max =   71.57  avg =   64.97\n      mobilenet_yolo  min =  193.84  max =  195.24  avg =  194.62\n  mobilenetv2_yolov3  min =  115.80  max =  117.27  avg =  116.27\n         yolov4-tiny  min =  156.30  max =  158.26  avg =  156.81\n           nanodet_m  min =   46.64  max =   47.97  avg =   47.12\n    yolo-fastest-1.1  min =   25.78  max =   27.86  avg =   26.29\n      yolo-fastestv2  min =   20.54  max =   30.73  avg =   22.18\n\nloop_count = 8\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   85.91  max =   86.86  avg =   86.14\n     squeezenet_int8  min =   77.57  max =   78.10  avg =   77.69\n           mobilenet  min =  137.43  max =  138.03  avg =  137.63\n      mobilenet_int8  min =  108.06  max =  108.21  avg =  108.13\n        mobilenet_v2  min =   93.81  max =   94.70  avg =   93.99\n        mobilenet_v3  min =   81.77  max =   82.49  avg =   81.99\n          shufflenet  min =   47.84  max =   48.46  avg =   48.17\n       shufflenet_v2  min =   47.93  max =   48.23  avg =   48.09\n             mnasnet  min =   91.73  max =   92.55  avg =   91.98\n     proxylessnasnet  min =  115.41  max =  115.75  avg =  115.56\n     efficientnet_b0  min =  225.64  max =  226.21  avg =  225.94\n   efficientnetv2_b0  min =  239.71  max =  240.20  avg =  239.89\n        regnety_400m  min =  118.46  max =  118.84  avg =  118.61\n           blazeface  min =   15.58  max =   17.14  avg =   16.21\n           googlenet  min =  286.85  max =  287.51  avg =  287.11\n      googlenet_int8  min =  256.44  max =  256.74  avg =  256.53\n            resnet18  min =  221.27  max =  221.93  avg =  221.60\n       resnet18_int8  min =  189.95  max =  191.34  avg =  190.74\n             alexnet  min =  284.30  max =  285.40  avg =  284.87\n               vgg16  min = 1241.51  max = 1244.53  avg = 1242.90\n          vgg16_int8  min = 1003.92  max = 1004.47  avg = 1004.29\n            resnet50  min =  624.43  max =  625.34  avg =  624.84\n       resnet50_int8  min =  516.64  max =  517.26  avg =  516.99\n      squeezenet_ssd  min =  190.21  max =  191.35  avg =  190.71\n squeezenet_ssd_int8  min =  182.97  max =  184.19  avg =  183.38\n       mobilenet_ssd  min =  275.60  max =  276.17  avg =  275.90\n  mobilenet_ssd_int8  min =  216.67  max =  217.58  avg =  216.94\n      mobilenet_yolo  min =  616.16  max =  617.45  avg =  616.71\n  mobilenetv2_yolov3  min =  324.88  max =  325.73  avg =  325.19\n         yolov4-tiny  min =  421.01  max =  423.52  avg =  422.14\n           nanodet_m  min =  117.39  max =  117.75  avg =  117.54\n    yolo-fastest-1.1  min =   54.55  max =   55.61  avg =   54.87\n      yolo-fastestv2  min =   44.40  max =   44.78  avg =   44.57\n```\n\n### NVIDIA Jetson TX2 NX(NV-Denver2 2.0Ghz x 2 +  Cortex-A57 2.0Ghz x 4 + 256-core NVIDIA Pascal iGPU)\n```\nfan@ubuntu:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 $(nproc) 0 0\n[0 NVIDIA Tegra X2 (nvgpu)]  queueC=0[16]  queueG=0[16]  queueT=0[16]\n[0 NVIDIA Tegra X2 (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra X2 (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra X2 (nvgpu)]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA Tegra X2 (nvgpu)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 6\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    4.84  max =    6.12  avg =    5.33\n     squeezenet_int8  min =   23.14  max =  148.62  avg =   52.65\n           mobilenet  min =    7.23  max =    7.57  avg =    7.40\n      mobilenet_int8  min =   19.69  max =  101.50  avg =   44.15\n        mobilenet_v2  min =    6.65  max =    6.86  avg =    6.76\n        mobilenet_v3  min =    7.22  max =    8.34  avg =    8.01\n          shufflenet  min =    6.14  max =    6.73  avg =    6.51\n       shufflenet_v2  min =    5.33  max =    5.43  avg =    5.39\n             mnasnet  min =    6.98  max =    7.47  avg =    7.16\n     proxylessnasnet  min =    6.90  max =    7.52  avg =    7.09\n     efficientnet_b0  min =   11.42  max =   11.89  avg =   11.67\n   efficientnetv2_b0  min =   26.48  max =   51.57  avg =   36.25\n        regnety_400m  min =    8.94  max =    9.45  avg =    9.13\n           blazeface  min =    2.08  max =    3.21  avg =    2.42\n           googlenet  min =   15.33  max =   15.78  avg =   15.53\n      googlenet_int8  min =   64.02  max =  158.22  avg =   79.32\n            resnet18  min =   12.25  max =   13.28  avg =   12.78\n       resnet18_int8  min =   41.89  max =  156.59  avg =   57.07\n             alexnet  min =   20.15  max =   20.51  avg =   20.32\n               vgg16  min =   62.45  max =   64.63  avg =   63.06\n          vgg16_int8  min =  198.24  max =  271.71  avg =  217.63\n            resnet50  min =   30.05  max =   31.11  avg =   30.39\n       resnet50_int8  min =  129.03  max =  205.33  avg =  154.72\n      squeezenet_ssd  min =   18.48  max =   22.90  avg =   20.26\n squeezenet_ssd_int8  min =   48.18  max =   71.20  avg =   60.89\n       mobilenet_ssd  min =   15.56  max =   15.76  avg =   15.67\n  mobilenet_ssd_int8  min =   55.10  max =  114.34  avg =   67.41\n      mobilenet_yolo  min =   28.75  max =   32.54  avg =   30.30\n  mobilenetv2_yolov3  min =   26.15  max =   32.36  avg =   29.57\n         yolov4-tiny  min =   23.08  max =   37.19  avg =   25.43\n           nanodet_m  min =   15.81  max =   19.99  avg =   18.10\n    yolo-fastest-1.1  min =    7.35  max =   11.26  avg =    8.69\n      yolo-fastestv2  min =    6.16  max =    6.61  avg =    6.31\n  vision_transformer  min = 1301.45  max = 1356.58  avg = 1321.51\n          FastestDet  min =    5.64  max =    6.60  avg =    5.90\nfan@ubuntu:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 1 0 0\n[0 NVIDIA Tegra X2 (nvgpu)]  queueC=0[16]  queueG=0[16]  queueT=0[16]\n[0 NVIDIA Tegra X2 (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra X2 (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra X2 (nvgpu)]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA Tegra X2 (nvgpu)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    5.10  max =    6.33  avg =    5.51\n     squeezenet_int8  min =   56.36  max =   59.23  avg =   57.79\n           mobilenet  min =    6.61  max =    9.93  avg =    7.27\n      mobilenet_int8  min =   95.73  max =  107.69  avg =   99.35\n        mobilenet_v2  min =    6.66  max =    9.87  avg =    7.22\n        mobilenet_v3  min =    7.20  max =    8.77  avg =    7.61\n          shufflenet  min =    5.87  max =    6.13  avg =    5.97\n       shufflenet_v2  min =    5.63  max =    8.24  avg =    6.10\n             mnasnet  min =    6.55  max =    9.05  avg =    7.10\n     proxylessnasnet  min =    7.29  max =    7.86  avg =    7.50\n     efficientnet_b0  min =   11.22  max =   12.13  avg =   11.49\n   efficientnetv2_b0  min =   20.21  max =   24.55  avg =   21.42\n        regnety_400m  min =    8.94  max =   10.77  avg =    9.37\n           blazeface  min =    2.30  max =    2.45  avg =    2.35\n           googlenet  min =   15.48  max =   17.88  avg =   16.32\n      googlenet_int8  min =  197.08  max =  205.18  avg =  200.93\n            resnet18  min =   12.69  max =   13.38  avg =   13.01\n       resnet18_int8  min =  147.42  max =  154.63  avg =  149.94\n             alexnet  min =   20.49  max =   20.83  avg =   20.62\n               vgg16  min =   62.43  max =   63.41  avg =   62.81\n          vgg16_int8  min =  802.28  max =  810.33  avg =  805.66\n            resnet50  min =   29.96  max =   30.56  avg =   30.26\n       resnet50_int8  min =  488.38  max =  494.67  avg =  491.09\n      squeezenet_ssd  min =   18.35  max =   18.84  avg =   18.59\n squeezenet_ssd_int8  min =  121.27  max =  124.52  avg =  122.21\n       mobilenet_ssd  min =   15.13  max =   15.60  avg =   15.30\n  mobilenet_ssd_int8  min =  206.22  max =  225.98  avg =  222.55\n      mobilenet_yolo  min =   30.12  max =   31.28  avg =   30.41\n  mobilenetv2_yolov3  min =   26.65  max =   27.08  avg =   26.87\n         yolov4-tiny  min =   22.91  max =   23.32  avg =   23.04\n           nanodet_m  min =   11.57  max =   11.99  avg =   11.75\n    yolo-fastest-1.1  min =    7.06  max =    7.49  avg =    7.25\n      yolo-fastestv2  min =    6.17  max =    6.65  avg =    6.34\n  vision_transformer  min = 1185.13  max = 1193.94  avg = 1189.50\n          FastestDet  min =    5.78  max =    6.87  avg =    6.11\nfan@ubuntu:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 $(nproc) 0 -1\nloop_count = 10\nnum_threads = 6\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   19.92  max =   22.96  avg =   21.43\n     squeezenet_int8  min =   20.33  max =   25.17  avg =   22.63\n           mobilenet  min =   27.25  max =   80.19  avg =   36.64\n      mobilenet_int8  min =   21.22  max =   31.14  avg =   27.05\n        mobilenet_v2  min =   21.95  max =   25.77  avg =   24.10\n        mobilenet_v3  min =   20.10  max =   34.13  avg =   25.30\n          shufflenet  min =   14.96  max =  108.36  avg =   28.88\n       shufflenet_v2  min =   13.25  max =   29.33  avg =   16.43\n             mnasnet  min =   19.41  max =  111.63  avg =   30.57\n     proxylessnasnet  min =   22.58  max =   27.29  avg =   24.43\n     efficientnet_b0  min =   32.95  max =   35.53  avg =   34.46\n   efficientnetv2_b0  min =   36.91  max =   52.12  avg =   41.72\n        regnety_400m  min =   43.87  max =  152.33  avg =   56.15\n           blazeface  min =    4.51  max =   16.71  avg =    6.79\n           googlenet  min =   59.37  max =   93.96  avg =   70.88\n      googlenet_int8  min =   57.95  max =  124.06  avg =   71.47\n            resnet18  min =   51.99  max =  134.81  avg =   68.50\n       resnet18_int8  min =   40.54  max =  130.18  avg =   54.10\n             alexnet  min =   41.42  max =   67.03  avg =   52.66\n               vgg16  min =  253.75  max =  295.39  avg =  265.01\n          vgg16_int8  min =  183.96  max =  334.83  avg =  206.81\n            resnet50  min =  305.79  max =  330.68  avg =  316.55\n       resnet50_int8  min =  120.10  max =  133.19  avg =  125.92\n      squeezenet_ssd  min =   51.06  max =  125.69  avg =   67.34\n squeezenet_ssd_int8  min =   44.56  max =  156.68  avg =   61.47\n       mobilenet_ssd  min =   52.27  max =  123.50  avg =   64.86\n  mobilenet_ssd_int8  min =   48.18  max =  183.44  avg =   63.25\n      mobilenet_yolo  min =  120.27  max =  160.73  avg =  130.75\n  mobilenetv2_yolov3  min =   74.39  max =  167.08  avg =   86.50\n         yolov4-tiny  min =  108.39  max =  123.62  avg =  112.81\n           nanodet_m  min =   32.38  max =   91.62  avg =   42.01\n    yolo-fastest-1.1  min =   17.97  max =  157.78  avg =   34.93\n      yolo-fastestv2  min =   16.12  max =   19.55  avg =   18.03\n  vision_transformer  min = 2317.30  max = 2437.95  avg = 2375.98\n          FastestDet  min =   15.52  max =  127.95  avg =   27.40\nfan@ubuntu:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 1 0 -1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   48.72  max =   50.66  avg =   49.98\n     squeezenet_int8  min =   56.50  max =   61.58  avg =   58.64\n           mobilenet  min =   88.10  max =   89.76  avg =   88.92\n      mobilenet_int8  min =   95.08  max =   96.92  avg =   95.82\n        mobilenet_v2  min =   58.72  max =   61.48  avg =   59.54\n        mobilenet_v3  min =   48.58  max =   49.95  avg =   49.24\n          shufflenet  min =   30.42  max =   32.03  avg =   31.17\n       shufflenet_v2  min =   28.27  max =   29.37  avg =   28.65\n             mnasnet  min =   56.85  max =   58.22  avg =   57.37\n     proxylessnasnet  min =   68.67  max =   71.23  avg =   69.64\n     efficientnet_b0  min =   89.27  max =   92.67  avg =   90.33\n   efficientnetv2_b0  min =  107.72  max =  109.86  avg =  108.53\n        regnety_400m  min =   85.19  max =   91.74  avg =   86.95\n           blazeface  min =    8.60  max =    8.80  avg =    8.71\n           googlenet  min =  161.58  max =  166.70  avg =  163.60\n      googlenet_int8  min =  183.79  max =  189.43  avg =  186.17\n            resnet18  min =  123.43  max =  126.29  avg =  124.86\n       resnet18_int8  min =  140.80  max =  144.92  avg =  142.60\n             alexnet  min =   93.16  max =  100.47  avg =   96.44\n               vgg16  min =  664.14  max =  671.67  avg =  667.90\n          vgg16_int8  min =  799.67  max =  813.66  avg =  803.50\n            resnet50  min =  384.10  max =  388.46  avg =  386.49\n       resnet50_int8  min =  448.11  max =  473.27  avg =  465.12\n      squeezenet_ssd  min =  106.58  max =  109.62  avg =  107.39\n squeezenet_ssd_int8  min =  118.39  max =  122.62  avg =  120.43\n       mobilenet_ssd  min =  178.89  max =  183.37  avg =  180.47\n  mobilenet_ssd_int8  min =  201.46  max =  207.18  avg =  203.00\n      mobilenet_yolo  min =  407.54  max =  411.12  avg =  409.33\n  mobilenetv2_yolov3  min =  211.83  max =  214.46  avg =  213.20\n         yolov4-tiny  min =  249.11  max =  254.22  avg =  251.38\n           nanodet_m  min =   69.41  max =   71.26  avg =   70.28\n    yolo-fastest-1.1  min =   30.99  max =   33.29  avg =   32.03\n      yolo-fastestv2  min =   27.70  max =   28.90  avg =   27.93\n  vision_transformer  min = 3203.45  max = 3402.10  avg = 3286.58\n          FastestDet  min =   29.05  max =   32.57  avg =   30.53\n```\n\n### Rockchip RK3288-CG.W (Cortex-A17 1.8GHz x 4)\n```\nWW_Tinker_Board:/data/local/tmp # ./benchncnn 8 4 0 -1 1\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   56.61  max =   56.80  avg =   56.69\n     squeezenet_int8  min =   40.63  max =   41.05  avg =   40.89\n           mobilenet  min =   83.91  max =   84.59  avg =   84.23\n      mobilenet_int8  min =   36.15  max =   36.44  avg =   36.25\n        mobilenet_v2  min =   71.12  max =   71.73  avg =   71.54\n        mobilenet_v3  min =   56.08  max =   56.56  avg =   56.28\n          shufflenet  min =   37.39  max =   37.75  avg =   37.55\n       shufflenet_v2  min =   35.19  max =   35.52  avg =   35.34\n             mnasnet  min =   62.08  max =   62.36  avg =   62.24\n     proxylessnasnet  min =   66.98  max =   67.38  avg =   67.16\n     efficientnet_b0  min =  109.95  max =  110.71  avg =  110.15\n   efficientnetv2_b0  min =  122.56  max =  123.31  avg =  122.94\n        regnety_400m  min =   88.84  max =   89.19  avg =   88.99\n           blazeface  min =   11.79  max =   11.92  avg =   11.85\n           googlenet  min =  162.56  max =  165.39  avg =  163.19\n      googlenet_int8  min =  110.35  max =  110.91  avg =  110.60\n            resnet18  min =  172.39  max =  173.99  avg =  173.24\n       resnet18_int8  min =   84.00  max =   84.40  avg =   84.19\n             alexnet  min =  156.71  max =  158.23  avg =  157.59\n               vgg16  min =  956.95  max =  964.32  avg =  960.60\n          vgg16_int8  min =  388.10  max =  389.52  avg =  388.68\n            resnet50  min =  403.05  max =  404.80  avg =  404.01\n       resnet50_int8  min =  205.12  max =  207.42  avg =  206.19\n      squeezenet_ssd  min =  163.61  max =  165.79  avg =  164.93\n squeezenet_ssd_int8  min =  125.88  max =  126.35  avg =  126.12\n       mobilenet_ssd  min =  175.97  max =  176.86  avg =  176.39\n  mobilenet_ssd_int8  min =   76.90  max =   77.74  avg =   77.35\n      mobilenet_yolo  min =  385.59  max =  387.19  avg =  386.60\n  mobilenetv2_yolov3  min =  234.88  max =  236.22  avg =  235.66\n         yolov4-tiny  min =  307.44  max =  310.64  avg =  308.54\n           nanodet_m  min =   92.54  max =   93.15  avg =   92.82\n    yolo-fastest-1.1  min =   46.69  max =   47.02  avg =   46.83\n      yolo-fastestv2  min =   38.37  max =   38.68  avg =   38.54\n\nWW_Tinker_Board:/data/local/tmp # ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  138.27  max =  138.57  avg =  138.41\n     squeezenet_int8  min =   85.97  max =   86.23  avg =   86.05\n           mobilenet  min =  234.90  max =  235.08  avg =  235.00\n      mobilenet_int8  min =   99.92  max =  100.45  avg =  100.12\n        mobilenet_v2  min =  157.76  max =  157.99  avg =  157.86\n        mobilenet_v3  min =  130.05  max =  130.23  avg =  130.17\n          shufflenet  min =   74.48  max =   74.62  avg =   74.55\n       shufflenet_v2  min =   74.05  max =   74.25  avg =   74.13\n             mnasnet  min =  150.74  max =  151.03  avg =  150.87\n     proxylessnasnet  min =  171.09  max =  171.23  avg =  171.16\n     efficientnet_b0  min =  306.85  max =  307.02  avg =  306.97\n   efficientnetv2_b0  min =  347.40  max =  347.87  avg =  347.64\n        regnety_400m  min =  190.26  max =  190.33  avg =  190.29\n           blazeface  min =   25.25  max =   25.68  avg =   25.47\n           googlenet  min =  432.09  max =  432.48  avg =  432.32\n      googlenet_int8  min =  275.55  max =  276.07  avg =  275.88\n            resnet18  min =  355.11  max =  358.56  avg =  356.90\n       resnet18_int8  min =  205.80  max =  206.68  avg =  206.26\n             alexnet  min =  330.09  max =  330.29  avg =  330.15\n               vgg16  min = 2122.95  max = 2124.45  avg = 2123.68\n          vgg16_int8  min = 1048.53  max = 1049.29  avg = 1048.86\n            resnet50  min = 1047.27  max = 1048.33  avg = 1047.63\n       resnet50_int8  min =  517.75  max =  519.28  avg =  518.81\n      squeezenet_ssd  min =  304.69  max =  305.75  avg =  305.16\n squeezenet_ssd_int8  min =  219.16  max =  219.94  avg =  219.45\n       mobilenet_ssd  min =  483.73  max =  484.12  avg =  484.01\n  mobilenet_ssd_int8  min =  208.89  max =  209.19  avg =  209.09\n      mobilenet_yolo  min = 1092.75  max = 1093.70  avg = 1093.13\n  mobilenetv2_yolov3  min =  560.66  max =  560.92  avg =  560.77\n         yolov4-tiny  min =  704.69  max =  705.38  avg =  705.12\n           nanodet_m  min =  187.13  max =  187.57  avg =  187.39\n    yolo-fastest-1.1  min =   83.05  max =   83.11  avg =   83.08\n      yolo-fastestv2  min =   72.19  max =   72.23  avg =   72.21\n\nWW_Tinker_Board:/data/local/tmp # ./benchncnn 4 1 0 0 0\n[0 Mali-T760]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-T760]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=1\n[0 Mali-T760]  fp16-p/s/a=1/0/1  int8-p/s/a=1/0/0\n[0 Mali-T760]  subgroup=0  basic=0  vote=0  ballot=0  shuffle=0\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   41.78  max =   41.82  avg =   41.79\n           mobilenet  min =   62.67  max =   62.80  avg =   62.74\n        mobilenet_v2  min =   51.08  max =   51.26  avg =   51.17\n        mobilenet_v3  min =   51.43  max =   51.70  avg =   51.51\n          shufflenet  min =   56.83  max =   56.94  avg =   56.87\n       shufflenet_v2  min =   48.46  max =   48.63  avg =   48.53\n             mnasnet  min =   52.31  max =   52.63  avg =   52.42\n     proxylessnasnet  min =   57.33  max =   57.46  avg =   57.41\n     efficientnet_b0  min =   87.52  max =   87.80  avg =   87.62\n   efficientnetv2_b0  min =  123.83  max =  124.67  avg =  124.34\n        regnety_400m  min =   65.52  max =   65.81  avg =   65.64\n           blazeface  min =   14.56  max =   14.73  avg =   14.62\n           googlenet  min =  138.52  max =  139.39  avg =  138.89\n            resnet18  min =  124.45  max =  124.81  avg =  124.58\n             alexnet  min =  130.46  max =  130.68  avg =  130.54\n```\n### HiSilicon Hi3519V101 (Cortex-A17 1.2GHz x 1)\n```\nroot@Hi3519:/ncnn-benchmark # taskset 2 ./benchncnn 8 1 0\nloop_count = 8\nnum_threads = 1\npowersave = 0\n      squeezenet  min =  272.97  max =  275.84  avg =  274.85\n squeezenet-int8  min =  200.87  max =  202.47  avg =  201.74\n       mobilenet  min =  480.90  max =  482.16  avg =  481.64\n    mobilenet_v2  min =  350.01  max =  352.39  avg =  350.81\n      shufflenet  min =  152.40  max =  153.17  avg =  152.80\n       googlenet  min = 1096.65  max = 1101.35  avg = 1099.21\n        resnet18  min =  983.92  max =  987.00  avg =  985.25\n         alexnet  min = 1140.30  max = 1141.55  avg = 1140.92\n  squeezenet-ssd  min =  574.62  max =  580.12  avg =  577.23\n   mobilenet-ssd  min =  960.26  max =  969.13  avg =  965.93\n  mobilenet-yolo  min = 1867.78  max = 1880.08  avg = 1873.89\n```\n\n### iPhone 5S (Apple A7 1.3GHz x 2)\n```\niPhone:~ root# ./benchncnn 8 2 0 -1\n[0 Apple A7 GPU]  queueC=0[8]  queueT=0[8]  memU=1  memDL=1  memHV=1\n[0 Apple A7 GPU]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = -1\n          squeezenet  min =   49.21  max =   50.40  avg =   49.74\n     squeezenet_int8  min =   54.73  max =   57.39  avg =   56.70\n           mobilenet  min =   79.03  max =   80.00  avg =   79.44\n      mobilenet_int8  min =  109.95  max =  112.69  avg =  111.38\n        mobilenet_v2  min =   57.34  max =   57.88  avg =   57.47\n        mobilenet_v3  min =   52.66  max =   53.73  avg =   53.12\n          shufflenet  min =   32.78  max =   36.12  avg =   35.12\n       shufflenet_v2  min =   31.25  max =   32.10  avg =   31.61\n             mnasnet  min =   54.58  max =   56.12  avg =   55.44\n     proxylessnasnet  min =   69.52  max =   72.42  avg =   70.40\n           googlenet  min =  192.82  max =  194.20  avg =  193.35\n      googlenet_int8  min =  235.43  max =  244.71  avg =  239.64\n            resnet18  min =  164.33  max =  167.27  avg =  165.51\n       resnet18_int8  min =  176.16  max =  179.73  avg =  178.60\n             alexnet  min =  224.50  max =  228.21  avg =  226.51\n               vgg16  min = 4262.28  max = 4400.29  avg = 4300.34\n          vgg16_int8  min = 2835.84  max = 2955.22  avg = 2890.26\n            resnet50  min =  542.66  max = 1344.49  avg =  737.05\n       resnet50_int8  min =  426.08  max =  435.34  avg =  431.87\n      squeezenet_ssd  min =  129.03  max =  131.44  avg =  129.99\n squeezenet_ssd_int8  min =  155.52  max =  161.42  avg =  158.51\n       mobilenet_ssd  min =  168.18  max =  170.17  avg =  169.42\n  mobilenet_ssd_int8  min =  205.78  max =  212.07  avg =  209.66\n      mobilenet_yolo  min =  347.32  max =  363.15  avg =  355.72\n  mobilenetv2_yolov3  min =  193.11  max =  196.64  avg =  194.31\n\niPhone:~ root# ./benchncnn 4 1 0 -1\n[0 Apple A7 GPU]  queueC=0[8]  queueT=0[8]  memU=1  memDL=1  memHV=1\n[0 Apple A7 GPU]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\n          squeezenet  min =   86.36  max =   86.81  avg =   86.57\n     squeezenet_int8  min =   99.62  max =  100.07  avg =   99.83\n           mobilenet  min =  143.11  max =  146.50  avg =  145.38\n      mobilenet_int8  min =  202.25  max =  203.32  avg =  203.02\n        mobilenet_v2  min =   97.56  max =   98.55  avg =   98.09\n        mobilenet_v3  min =   87.45  max =   87.68  avg =   87.52\n          shufflenet  min =   54.01  max =   54.13  avg =   54.08\n       shufflenet_v2  min =   48.11  max =   48.65  avg =   48.36\n             mnasnet  min =   95.02  max =   95.77  avg =   95.25\n     proxylessnasnet  min =  123.91  max =  124.61  avg =  124.18\n           googlenet  min =  344.23  max =  348.95  avg =  345.97\n      googlenet_int8  min =  420.30  max =  420.99  avg =  420.65\n            resnet18  min =  300.44  max =  301.36  avg =  300.99\n       resnet18_int8  min =  308.60  max =  310.52  avg =  309.70\n             alexnet  min =  423.92  max =  429.84  avg =  427.24\n               vgg16  min = 4787.59  max = 5015.23  avg = 4900.43\n          vgg16_int8  min = 3560.59  max = 3722.75  avg = 3639.88\n            resnet50  min =  797.88  max = 1294.57  avg =  985.63\n       resnet50_int8  min =  751.15  max =  760.25  avg =  757.89\n      squeezenet_ssd  min =  193.75  max =  196.13  avg =  195.29\n squeezenet_ssd_int8  min =  243.78  max =  245.19  avg =  244.74\n       mobilenet_ssd  min =  299.69  max =  307.22  avg =  305.12\n  mobilenet_ssd_int8  min =  385.91  max =  389.82  avg =  388.48\n      mobilenet_yolo  min =  657.00  max =  659.31  avg =  658.08\n  mobilenetv2_yolov3  min =  335.59  max =  342.22  avg =  339.37\n\niPhone:~ root# ./benchncnn 4 1 0 0\n[0 Apple A7 GPU]  queueC=0[8]  queueT=0[8]  memU=1  memDL=1  memHV=1\n[0 Apple A7 GPU]  fp16p=1  fp16s=0  fp16a=0  int8s=0  int8a=0\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = 0\n          squeezenet  min =  260.18  max =  262.55  avg =  261.09\n           mobilenet  min =  288.73  max =  291.83  avg =  289.67\n        mobilenet_v2  min =  265.72  max =  267.05  avg =  266.14\n        mobilenet_v3  min =  255.86  max =  257.35  avg =  256.43\n          shufflenet  min =  236.66  max =  239.49  avg =  237.98\n       shufflenet_v2  min =  244.92  max =  247.75  avg =  246.22\n             mnasnet  min =  254.75  max =  256.48  avg =  255.85\n     proxylessnasnet  min =  281.42  max =  282.62  avg =  282.11\n           googlenet  min =  745.36  max =  764.91  avg =  754.16\n            resnet18  min =  721.26  max =  741.98  avg =  734.78\n             alexnet  min =  521.43  max =  530.95  avg =  527.01\n            resnet50  min = 1494.86  max = 1505.79  avg = 1501.49\n      squeezenet_ssd  min = 1096.45  max = 1102.84  avg = 1098.55\n       mobilenet_ssd  min =  639.50  max =  641.81  avg =  640.83\n      mobilenet_yolo  min = 1445.16  max = 1450.94  avg = 1447.42\n  mobilenetv2_yolov3  min = 1047.24  max = 1060.97  avg = 1052.86\n```\n\n### Freescale i.MX7 Dual (Cortex A7 1.0GHz x 2)\n```\nimx7d_pico:/data/local/tmp $ ./benchncnn 8 2 0 -1 1\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  220.10  max =  226.46  avg =  222.89\n     squeezenet_int8  min =  159.26  max =  165.25  avg =  161.71\n           mobilenet  min =  366.92  max =  373.78  avg =  371.55\n      mobilenet_int8  min =  223.14  max =  229.66  avg =  225.66\n        mobilenet_v2  min =  252.32  max =  259.41  avg =  255.54\n        mobilenet_v3  min =  214.05  max =  222.24  avg =  217.53\n          shufflenet  min =  137.02  max =  144.79  avg =  138.85\n       shufflenet_v2  min =  134.89  max =  140.75  avg =  137.18\n             mnasnet  min =  250.64  max =  256.75  avg =  253.33\n     proxylessnasnet  min =  285.35  max =  291.43  avg =  288.37\n     efficientnet_b0  min =  430.47  max =  436.63  avg =  434.75\n        regnety_400m  min =  317.69  max =  325.77  avg =  321.24\n           blazeface  min =   42.93  max =   43.30  avg =   43.14\n           googlenet  min =  721.84  max =  728.40  avg =  724.23\n      googlenet_int8  min =  504.07  max =  511.06  avg =  507.39\n            resnet18  min =  645.61  max =  653.08  avg =  648.51\n       resnet18_int8  min =  370.84  max =  514.38  avg =  392.80\n             alexnet  min =  783.64  max =  794.83  avg =  786.95\n      squeezenet_ssd  min =  508.71  max =  513.70  avg =  511.29\n squeezenet_ssd_int8  min =  402.85  max =  409.32  avg =  406.45\n       mobilenet_ssd  min =  763.70  max =  771.52  avg =  767.61\n  mobilenet_ssd_int8  min =  457.99  max =  460.85  avg =  459.76\n      mobilenet_yolo  min = 1730.90  max = 1746.52  avg = 1741.26\n  mobilenetv2_yolov3  min =  884.00  max =  892.97  avg =  889.38\n         yolov4-tiny  min = 1181.20  max = 1218.20  avg = 1202.28\n           nanodet_m  min =  331.53  max =  339.89  avg =  334.62\n\nimx7d_pico:/data/local/tmp $ ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  408.39  max =  410.27  avg =  408.95\n     squeezenet_int8  min =  290.25  max =  290.95  avg =  290.61\n           mobilenet  min =  707.10  max =  711.64  avg =  708.47\n      mobilenet_int8  min =  434.95  max =  436.16  avg =  435.66\n        mobilenet_v2  min =  466.52  max =  467.41  avg =  466.96\n        mobilenet_v3  min =  407.03  max =  408.29  avg =  407.56\n          shufflenet  min =  240.65  max =  241.07  avg =  240.85\n       shufflenet_v2  min =  229.27  max =  235.66  avg =  231.51\n             mnasnet  min =  471.21  max =  471.48  avg =  471.35\n     proxylessnasnet  min =  544.74  max =  547.62  avg =  546.20\n     efficientnet_b0  min =  824.09  max =  824.44  avg =  824.20\n        regnety_400m  min =  570.20  max =  571.73  avg =  570.82\n           blazeface  min =   76.46  max =   77.05  avg =   76.81\n           googlenet  min = 1368.82  max = 1369.99  avg = 1369.33\n      googlenet_int8  min =  945.51  max =  946.61  avg =  945.91\n            resnet18  min = 1237.79  max = 1257.12  avg = 1246.80\n       resnet18_int8  min =  705.09  max =  706.72  avg =  705.63\n             alexnet  min = 1516.35  max = 1522.82  avg = 1519.52\n      squeezenet_ssd  min =  906.97  max =  908.48  avg =  907.68\n squeezenet_ssd_int8  min =  727.15  max =  728.16  avg =  727.77\n       mobilenet_ssd  min = 1475.19  max = 1478.52  avg = 1476.81\n  mobilenet_ssd_int8  min =  883.88  max =  890.68  avg =  885.90\n      mobilenet_yolo  min = 3408.43  max = 3418.63  avg = 3412.52\n  mobilenetv2_yolov3  min = 1685.18  max = 1695.89  avg = 1689.23\n         yolov4-tiny  min = 2168.24  max = 2183.24  avg = 2175.93\n           nanodet_m  min =  561.56  max =  562.05  avg =  561.72\n```\n\n### Z7-Lite 7020 XC7Z020CLG400-2 (Cortex-A9 766MHz x 2)\n```\nroot@petalinux_hdmi:~# LD_LIBRARY_PATH=. ./benchncnn 8 2 0 -1 1\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  389.18  max =  390.13  avg =  389.60\n     squeezenet_int8  min =  254.33  max =  255.24  avg =  254.85\n           mobilenet  min =  623.71  max =  625.01  avg =  624.46\n      mobilenet_int8  min =  240.40  max =  241.03  avg =  240.87\n        mobilenet_v2  min =  450.00  max =  450.89  avg =  450.40\n        mobilenet_v3  min =  362.99  max =  363.66  avg =  363.28\n          shufflenet  min =  212.20  max =  213.28  avg =  212.84\n       shufflenet_v2  min =  210.26  max =  212.64  avg =  211.53\n             mnasnet  min =  408.67  max =  409.64  avg =  409.17\n     proxylessnasnet  min =  449.86  max =  450.94  avg =  450.45\n     efficientnet_b0  min =  737.40  max =  739.58  avg =  738.32\n   efficientnetv2_b0  min =  848.58  max =  849.74  avg =  849.24\n        regnety_400m  min =  501.32  max =  503.02  avg =  501.87\n           blazeface  min =   70.89  max =   72.22  avg =   71.61\n      squeezenet_ssd  min =  978.55  max =  979.86  avg =  979.22\n squeezenet_ssd_int8  min =  691.90  max =  694.18  avg =  692.73\n       mobilenet_ssd  min = 1353.12  max = 1354.13  avg = 1353.53\n  mobilenet_ssd_int8  min =  496.26  max =  497.29  avg =  496.61\n           nanodet_m  min =  542.04  max =  546.29  avg =  544.73\n    yolo-fastest-1.1  min =  282.75  max =  286.11  avg =  284.24\n      yolo-fastestv2  min =  230.91  max =  232.74  avg =  231.56\n\nroot@petalinux_hdmi:~# LD_LIBRARY_PATH=. ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  637.19  max =  639.33  avg =  637.82\n     squeezenet_int8  min =  390.31  max =  391.63  avg =  390.94\n           mobilenet  min = 1085.54  max = 1085.96  avg = 1085.71\n      mobilenet_int8  min =  437.28  max =  437.65  avg =  437.44\n        mobilenet_v2  min =  716.03  max =  716.75  avg =  716.35\n        mobilenet_v3  min =  587.83  max =  588.55  avg =  588.21\n          shufflenet  min =  331.28  max =  331.97  avg =  331.63\n       shufflenet_v2  min =  331.03  max =  333.19  avg =  331.76\n             mnasnet  min =  682.68  max =  683.11  avg =  682.82\n     proxylessnasnet  min =  763.89  max =  764.80  avg =  764.35\n     efficientnet_b0  min = 1288.61  max = 1289.10  avg = 1288.81\n   efficientnetv2_b0  min = 1499.12  max = 1500.11  avg = 1499.65\n        regnety_400m  min =  852.03  max =  853.16  avg =  852.68\n           blazeface  min =  109.40  max =  111.51  avg =  110.41\n      squeezenet_ssd  min = 1493.25  max = 1497.00  avg = 1494.87\n squeezenet_ssd_int8  min = 1016.77  max = 1019.31  avg = 1017.99\n       mobilenet_ssd  min = 2379.20  max = 2379.83  avg = 2379.64\n  mobilenet_ssd_int8  min =  881.70  max =  881.89  avg =  881.83\n           nanodet_m  min =  831.13  max =  832.58  avg =  831.87\n    yolo-fastest-1.1  min =  466.80  max =  469.90  avg =  468.79\n      yolo-fastestv2  min =  352.07  max =  355.20  avg =  353.36\n```\n\n### Loongson 2K1000 (GS264 1.0GHz x 2)\n```\nroot@ls2k:~/ncnn/build/benchmark# ./benchncnn 10 2 0 -1 1\nloop_count = 10\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  184.33  max =  184.94  avg =  184.65\n     squeezenet_int8  min =  201.42  max =  201.99  avg =  201.72\n           mobilenet  min =  277.17  max =  278.04  avg =  277.66\n      mobilenet_int8  min =  234.61  max =  235.17  avg =  234.81\n        mobilenet_v2  min =  223.10  max =  274.92  avg =  228.71\n        mobilenet_v3  min =  185.79  max =  201.76  avg =  187.60\n          shufflenet  min =  129.78  max =  131.09  avg =  130.28\n       shufflenet_v2  min =  115.86  max =  116.77  avg =  116.42\n             mnasnet  min =  213.92  max =  214.72  avg =  214.26\n     proxylessnasnet  min =  240.05  max =  242.02  avg =  240.86\n     efficientnet_b0  min =  347.52  max =  348.53  avg =  348.13\n   efficientnetv2_b0  min =  382.78  max =  479.58  avg =  398.18\n        regnety_400m  min =  270.00  max =  312.84  avg =  274.66\n           blazeface  min =   37.60  max =   38.02  avg =   37.79\n           googlenet  min =  659.55  max =  693.17  avg =  666.17\n      googlenet_int8  min =  678.26  max =  718.39  avg =  682.79\n            resnet18  min =  499.75  max =  766.88  avg =  532.49\n       resnet18_int8  min =  500.38  max =  533.97  avg =  504.56\n             alexnet  min =  508.49  max =  542.94  avg =  516.13\n               vgg16  min = 2654.06  max = 3082.44  avg = 2762.51\n          vgg16_int8  min = 2628.96  max = 2665.35  avg = 2647.12\n            resnet50  min = 1256.97  max = 1417.45  avg = 1283.04\n       resnet50_int8  min = 1232.55  max = 1276.94  avg = 1244.59\n      squeezenet_ssd  min =  538.83  max =  588.03  avg =  553.44\n squeezenet_ssd_int8  min =  501.67  max =  532.61  avg =  505.72\n       mobilenet_ssd  min =  571.14  max =  600.93  avg =  578.22\n  mobilenet_ssd_int8  min =  478.67  max =  515.39  avg =  483.06\n      mobilenet_yolo  min = 1644.48  max = 1729.17  avg = 1669.18\n  mobilenetv2_yolov3  min =  752.22  max =  792.40  avg =  760.10\n         yolov4-tiny  min =  994.48  max = 1096.10  avg = 1016.49\n           nanodet_m  min =  299.12  max =  343.99  avg =  303.98\n    yolo-fastest-1.1  min =  141.56  max =  142.93  avg =  142.04\n      yolo-fastestv2  min =  125.66  max =  168.88  avg =  130.28\n\nroot@ls2k:~/ncnn/build/benchmark# ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  295.48  max =  296.42  avg =  295.98\n     squeezenet_int8  min =  334.05  max =  336.31  avg =  335.35\n           mobilenet  min =  476.33  max =  479.00  avg =  477.41\n      mobilenet_int8  min =  446.03  max =  448.21  avg =  446.73\n        mobilenet_v2  min =  343.26  max =  343.97  avg =  343.69\n        mobilenet_v3  min =  296.84  max =  297.31  avg =  297.11\n          shufflenet  min =  202.31  max =  203.96  avg =  202.79\n       shufflenet_v2  min =  181.69  max =  182.42  avg =  182.08\n             mnasnet  min =  353.73  max =  354.12  avg =  353.99\n     proxylessnasnet  min =  404.49  max =  405.00  avg =  404.75\n     efficientnet_b0  min =  592.54  max =  593.81  avg =  593.14\n   efficientnetv2_b0  min =  649.91  max =  651.49  avg =  650.54\n        regnety_400m  min =  425.96  max =  426.33  avg =  426.12\n           blazeface  min =   59.74  max =   60.19  avg =   59.90\n           googlenet  min = 1120.13  max = 1217.54  avg = 1146.27\n      googlenet_int8  min = 1205.17  max = 1213.43  avg = 1208.13\n            resnet18  min =  803.07  max =  997.37  avg =  856.09\n       resnet18_int8  min =  911.74  max =  916.16  avg =  913.31\n             alexnet  min =  883.47  max =  903.08  avg =  889.06\n               vgg16  min = 4425.52  max = 4587.36  avg = 4467.61\n          vgg16_int8  min = 4896.90  max = 4993.15  avg = 4924.44\n            resnet50  min = 2163.22  max = 2169.90  avg = 2167.49\n       resnet50_int8  min = 2202.87  max = 2218.00  avg = 2210.51\n      squeezenet_ssd  min =  831.06  max =  926.94  avg =  856.24\n squeezenet_ssd_int8  min =  800.52  max =  803.28  avg =  801.72\n       mobilenet_ssd  min =  979.74  max =  980.82  avg =  980.22\n  mobilenet_ssd_int8  min =  893.79  max =  895.41  avg =  894.51\n      mobilenet_yolo  min = 2578.17  max = 2586.30  avg = 2582.55\n  mobilenetv2_yolov3  min = 1190.77  max = 1207.67  avg = 1196.06\n         yolov4-tiny  min = 1558.29  max = 1570.18  avg = 1561.52\n           nanodet_m  min =  442.90  max =  444.27  avg =  443.72\n    yolo-fastest-1.1  min =  203.60  max =  208.43  avg =  205.20\n      yolo-fastestv2  min =  184.61  max =  185.05  avg =  184.75\n```\n\n### Loongson 2K1000LA (LA264 1.0GHz * 2)\n```\nroot@ls2kla:~/ncnn/build/benchmark# ./benchncnn 10 2 0 -1 1\nloop_count = 10\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  151.11  max =  162.36  avg =  153.30\n     squeezenet_int8  min =  195.32  max =  198.63  avg =  196.12\n           mobilenet  min =  279.27  max =  283.42  avg =  280.40\n      mobilenet_int8  min =  264.78  max =  268.41  avg =  265.76\n        mobilenet_v2  min =  204.39  max =  207.69  avg =  205.77\n        mobilenet_v3  min =  171.32  max =  187.07  avg =  173.15\n          shufflenet  min =  147.43  max =  150.72  avg =  147.89\n       shufflenet_v2  min =  169.42  max =  172.58  avg =  170.35\n             mnasnet  min =  204.87  max =  208.01  avg =  205.63\n     proxylessnasnet  min =  226.79  max =  237.74  avg =  229.02\n     efficientnet_b0  min =  302.30  max =  310.91  avg =  303.87\n   efficientnetv2_b0  min =  327.65  max =  361.15  avg =  334.45\n        regnety_400m  min =  264.08  max =  278.49  avg =  266.35\n           blazeface  min =   31.80  max =   39.18  avg =   32.88\n           googlenet  min =  562.95  max =  578.42  avg =  566.28\n      googlenet_int8  min =  598.16  max =  613.56  avg =  601.68\n            resnet18  min =  466.73  max =  472.08  avg =  469.58\n       resnet18_int8  min =  489.69  max =  493.74  avg =  491.63\n             alexnet  min =  381.35  max =  388.12  avg =  384.78\n               vgg16  min = 2321.29  max = 2345.89  avg = 2330.29\n          vgg16_int8  min = 2562.86  max = 2568.06  avg = 2565.68\n            resnet50  min = 1219.09  max = 1225.67  avg = 1221.36\n       resnet50_int8  min = 1263.44  max = 1266.74  avg = 1265.09\n      squeezenet_ssd  min =  433.23  max =  441.06  avg =  437.07\n squeezenet_ssd_int8  min =  438.69  max =  443.17  avg =  440.81\n       mobilenet_ssd  min =  587.37  max =  598.57  avg =  589.99\n  mobilenet_ssd_int8  min =  539.62  max =  552.57  avg =  542.87\n      mobilenet_yolo  min = 1485.30  max = 1491.17  avg = 1487.81\n  mobilenetv2_yolov3  min =  711.57  max =  722.91  avg =  715.07\n         yolov4-tiny  min =  954.76  max =  961.66  avg =  957.28\n           nanodet_m  min =  364.22  max =  369.32  avg =  365.94\n    yolo-fastest-1.1  min =  154.81  max =  160.45  avg =  156.23\n      yolo-fastestv2  min =  157.39  max =  168.82  avg =  159.51\n  vision_transformer  min = 18926.46  max = 18980.43  avg = 18951.29\n          FastestDet  min =  168.81  max =  176.77  avg =  170.26\n\nroot@ls2kla:~/ncnn/build/benchmark# ./benchncnn 4 1 0 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  272.76  max =  280.89  avg =  275.29\n     squeezenet_int8  min =  352.02  max =  353.25  avg =  352.40\n           mobilenet  min =  519.09  max =  519.68  avg =  519.34\n      mobilenet_int8  min =  509.85  max =  510.23  avg =  510.04\n        mobilenet_v2  min =  352.06  max =  352.74  avg =  352.37\n        mobilenet_v3  min =  295.13  max =  295.70  avg =  295.39\n          shufflenet  min =  241.58  max =  241.94  avg =  241.73\n       shufflenet_v2  min =  282.88  max =  283.39  avg =  283.18\n             mnasnet  min =  357.74  max =  358.21  avg =  357.98\n     proxylessnasnet  min =  403.26  max =  411.69  avg =  406.02\n     efficientnet_b0  min =  546.11  max =  546.88  avg =  546.53\n   efficientnetv2_b0  min =  596.83  max =  597.05  avg =  596.93\n        regnety_400m  min =  441.94  max =  442.02  avg =  441.98\n           blazeface  min =   54.08  max =   54.59  avg =   54.38\n           googlenet  min = 1042.19  max = 1048.03  avg = 1044.40\n      googlenet_int8  min = 1118.22  max = 1121.18  avg = 1119.79\n            resnet18  min =  838.79  max =  839.81  avg =  839.43\n       resnet18_int8  min =  939.62  max =  940.72  avg =  940.23\n             alexnet  min =  729.36  max =  740.65  avg =  734.19\n               vgg16  min = 4326.68  max = 4335.10  avg = 4330.97\n          vgg16_int8  min = 4896.71  max = 4909.63  avg = 4905.14\n            resnet50  min = 2277.36  max = 2280.34  avg = 2279.14\n       resnet50_int8  min = 2399.07  max = 2402.21  avg = 2400.78\n      squeezenet_ssd  min =  751.49  max =  753.79  avg =  752.20\n squeezenet_ssd_int8  min =  771.01  max =  774.08  avg =  771.91\n       mobilenet_ssd  min = 1063.41  max = 1065.65  avg = 1064.16\n  mobilenet_ssd_int8  min = 1031.59  max = 1033.03  avg = 1032.09\n      mobilenet_yolo  min = 2585.33  max = 2586.65  avg = 2586.11\n  mobilenetv2_yolov3  min = 1246.35  max = 1248.43  avg = 1247.32\n         yolov4-tiny  min = 1639.13  max = 1642.47  avg = 1640.87\n           nanodet_m  min =  606.40  max =  607.14  avg =  606.86\n    yolo-fastest-1.1  min =  242.15  max =  244.64  avg =  243.43\n      yolo-fastestv2  min =  246.92  max =  247.84  avg =  247.27\n  vision_transformer  min = 36607.51  max = 36870.44  avg = 36724.88\n          FastestDet  min =  266.96  max =  268.86  avg =  267.94\n```\n\n### Loongson 2K2000 (LA364 1.5GHz * 2 with lsx)\n```\nloongson@loongson-pc:~/ncnn/build/benchmark$ ./benchncnn 4 2 0 -1 1\nloop_count = 4\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   58.54  max =   61.57  avg =   60.37\n     squeezenet_int8  min =   66.79  max =   72.05  avg =   70.49\n           mobilenet  min =  110.46  max =  112.72  avg =  111.84\n      mobilenet_int8  min =  117.83  max =  126.51  avg =  123.42\n        mobilenet_v2  min =   65.19  max =   70.78  avg =   67.73\n        mobilenet_v3  min =   51.30  max =   56.61  avg =   54.52\n          shufflenet  min =   32.78  max =   35.11  avg =   33.99\n       shufflenet_v2  min =   31.58  max =   32.59  avg =   32.15\n             mnasnet  min =   64.18  max =   78.53  avg =   68.72\n     proxylessnasnet  min =   73.49  max =   85.30  avg =   77.35\n     efficientnet_b0  min =  101.83  max =  106.26  avg =  104.91\n   efficientnetv2_b0  min =  126.55  max =  131.95  avg =  127.91\n        regnety_400m  min =   88.19  max =   92.58  avg =   89.60\n           blazeface  min =    8.57  max =    8.68  avg =    8.63\n           googlenet  min =  207.97  max =  214.47  avg =  211.07\n      googlenet_int8  min =  237.92  max =  241.06  avg =  239.76\n            resnet18  min =  153.42  max =  161.54  avg =  158.21\n       resnet18_int8  min =  177.77  max =  183.83  avg =  181.90\n             alexnet  min =  145.71  max =  149.41  avg =  147.97\n               vgg16  min =  937.03  max =  961.65  avg =  945.20\n          vgg16_int8  min =  850.20  max =  869.47  avg =  859.99\n            resnet50  min =  497.95  max =  524.29  avg =  511.85\n       resnet50_int8  min =  541.22  max =  549.09  avg =  544.30\n      squeezenet_ssd  min =  155.11  max =  163.01  avg =  159.72\n squeezenet_ssd_int8  min =  136.11  max =  138.38  avg =  137.36\n       mobilenet_ssd  min =  226.97  max =  231.33  avg =  229.20\n  mobilenet_ssd_int8  min =  248.61  max =  253.10  avg =  250.83\n      mobilenet_yolo  min =  613.25  max =  626.75  avg =  619.83\n  mobilenetv2_yolov3  min =  249.50  max =  258.17  avg =  255.75\n         yolov4-tiny  min =  312.41  max =  349.24  avg =  328.38\n           nanodet_m  min =   81.50  max =   84.20  avg =   83.14\n    yolo-fastest-1.1  min =   30.46  max =   30.91  avg =   30.71\n      yolo-fastestv2  min =   26.78  max =   28.80  avg =   28.10\n  vision_transformer  min = 4483.37  max = 4519.06  avg = 4507.04\n          FastestDet  min =   31.15  max =   32.37  avg =   32.06\n```\n\n### Loongson 3A3000 (GS464E 1.45GHz * 4)\n```\nroot@3A3K:~/Desktop/ncnn-20221128/build/benchmark$ ./benchncnn 5 4 2 -1 0\nloop_count = 5\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   88.82  max =  116.74  avg =   94.92\n     squeezenet_int8  min =  140.62  max =  162.48  avg =  146.32\n           mobilenet  min =  144.80  max =  244.58  avg =  172.14\n      mobilenet_int8  min =  265.21  max =  293.89  avg =  281.80\n        mobilenet_v2  min =  109.80  max =  156.74  avg =  120.48\n        mobilenet_v3  min =   90.18  max =   93.25  avg =   91.50\n          shufflenet  min =   56.64  max =  216.12  avg =  100.68\n       shufflenet_v2  min =   45.70  max =  142.00  avg =   65.20\n             mnasnet  min =  106.99  max =  229.11  avg =  134.22\n     proxylessnasnet  min =  123.68  max =  261.01  avg =  155.97\n     efficientnet_b0  min =  160.98  max =  191.14  avg =  171.55\n   efficientnetv2_b0  min =  162.75  max =  187.67  avg =  176.19\n        regnety_400m  min =  135.06  max =  174.12  avg =  151.30\n           blazeface  min =   15.26  max =   43.81  avg =   23.91\n           googlenet  min =  327.16  max =  386.02  avg =  350.25\n      googlenet_int8  min =  500.45  max =  637.39  avg =  540.62\n            resnet18  min =  254.45  max =  421.56  avg =  304.48\n       resnet18_int8  min =  385.14  max =  559.01  avg =  439.74\n             alexnet  min =  179.19  max =  220.91  avg =  190.63\n               vgg16  min = 1563.99  max = 1645.01  avg = 1619.63\n          vgg16_int8  min = 1436.00  max = 1530.45  avg = 1473.00\n            resnet50  min =  702.35  max =  833.23  avg =  764.14\n       resnet50_int8  min = 1099.40  max = 1208.84  avg = 1154.51\n      squeezenet_ssd  min =  191.40  max =  270.10  avg =  218.75\n squeezenet_ssd_int8  min =  304.51  max =  387.51  avg =  344.98\n       mobilenet_ssd  min =  315.77  max =  417.37  avg =  344.40\n  mobilenet_ssd_int8  min =  554.28  max =  656.07  avg =  580.72\n      mobilenet_yolo  min =  806.48  max =  851.22  avg =  825.50\n  mobilenetv2_yolov3  min =  382.38  max =  503.38  avg =  421.03\n         yolov4-tiny  min =  502.87  max =  620.30  avg =  550.08\n           nanodet_m  min =  126.00  max =  314.03  avg =  184.93\n    yolo-fastest-1.1  min =   64.68  max =  189.47  avg =  110.89\n      yolo-fastestv2  min =   69.03  max =  116.31  avg =   82.36\n  vision_transformer  min = 14737.56  max = 15012.35  avg = 14890.56\n          FastestDet  min =   84.30  max =  139.87  avg =  102.23\n```\n\n### Loongson 3A4000 (GS464V 1.8GHz * 4 with MSA128)\n```\nroot@3A4K:~/Desktop/ncnn-20221128/build/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   17.04  max =   39.86  avg =   20.39\n     squeezenet_int8  min =   21.77  max =   25.93  avg =   23.02\n           mobilenet  min =   26.34  max =   97.11  avg =   38.24\n      mobilenet_int8  min =   32.93  max =   33.31  avg =   33.07\n        mobilenet_v2  min =   19.40  max =   19.91  avg =   19.63\n        mobilenet_v3  min =   16.48  max =   45.31  avg =   19.68\n          shufflenet  min =   12.23  max =  116.79  avg =   22.86\n       shufflenet_v2  min =   11.14  max =   11.59  avg =   11.37\n             mnasnet  min =   18.33  max =   51.66  avg =   24.52\n     proxylessnasnet  min =   22.03  max =   22.46  avg =   22.19\n     efficientnet_b0  min =   34.94  max =  129.52  avg =   45.76\n   efficientnetv2_b0  min =   38.58  max =   67.86  avg =   41.84\n        regnety_400m  min =   35.53  max =   38.59  avg =   36.14\n           blazeface  min =    4.08  max =    4.34  avg =    4.17\n           googlenet  min =   72.60  max =  100.31  avg =   76.25\n      googlenet_int8  min =   82.09  max =  107.09  avg =   86.78\n            resnet18  min =   53.99  max =  100.21  avg =   63.52\n       resnet18_int8  min =   57.20  max =   77.00  avg =   60.47\n             alexnet  min =   61.95  max =   80.86  avg =   65.01\n               vgg16  min =  329.58  max =  438.99  avg =  360.40\n          vgg16_int8  min =  293.27  max =  366.16  avg =  311.23\n            resnet50  min =  138.06  max =  260.50  avg =  169.27\n       resnet50_int8  min =  154.06  max =  244.31  avg =  173.37\n      squeezenet_ssd  min =   60.44  max =   97.92  avg =   65.41\n squeezenet_ssd_int8  min =   55.34  max =  136.72  avg =   68.15\n       mobilenet_ssd  min =   57.97  max =  139.16  avg =   69.27\n  mobilenet_ssd_int8  min =   66.66  max =   89.91  avg =   71.00\n      mobilenet_yolo  min =  169.38  max =  711.10  avg =  242.62\n  mobilenetv2_yolov3  min =   75.61  max =   97.83  avg =   80.23\n         yolov4-tiny  min =  110.52  max =  143.67  avg =  118.53\n           nanodet_m  min =   24.04  max =   92.81  avg =   32.45\n    yolo-fastest-1.1  min =   10.97  max =   32.77  avg =   15.05\n      yolo-fastestv2  min =   11.54  max =   12.09  avg =   11.84\n  vision_transformer  min = 4193.41  max = 4274.03  avg = 4213.64\n          FastestDet  min =   12.54  max =   13.01  avg =   12.78\n```\n\n\n\n### Loongson 3A4000 (GS464V 1.8GHz * 4 with MSA128)\n\nTest on UOS V20 E1050\n\n```\nuos@uos-PC:~/ncnn/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   25.28  max =   38.19  avg =   27.81\n     squeezenet_int8  min =   21.61  max =   22.13  avg =   21.85\n           mobilenet  min =   44.77  max =   69.54  avg =   55.37\n      mobilenet_int8  min =   32.96  max =   44.00  avg =   36.08\n        mobilenet_v2  min =   29.21  max =   52.70  avg =   35.47\n        mobilenet_v3  min =   24.62  max =   27.32  avg =   25.18\n          shufflenet  min =   18.90  max =   49.70  avg =   22.95\n       shufflenet_v2  min =   15.87  max =   22.38  avg =   17.67\n             mnasnet  min =   29.08  max =   69.37  avg =   35.53\n     proxylessnasnet  min =   33.30  max =   94.15  avg =   42.81\n     efficientnet_b0  min =   49.34  max =   61.22  avg =   52.01\n   efficientnetv2_b0  min =   57.89  max =   72.55  avg =   60.72\n        regnety_400m  min =   50.65  max =   74.16  avg =   57.56\n           blazeface  min =    4.97  max =    5.33  avg =    5.11\n           googlenet  min =  101.45  max =  119.73  avg =  106.85\n      googlenet_int8  min =   83.94  max =   99.75  avg =   87.36\n            resnet18  min =   81.65  max =   99.76  avg =   85.96\n       resnet18_int8  min =   58.60  max =   75.88  avg =   60.62\n             alexnet  min =   77.05  max =  208.05  avg =  120.39\n               vgg16  min =  427.51  max =  676.57  avg =  531.53\n          vgg16_int8  min =  326.59  max =  487.96  avg =  417.74\n            resnet50  min =  221.51  max =  580.11  avg =  305.64\n       resnet50_int8  min =  158.00  max =  190.71  avg =  167.50\n      squeezenet_ssd  min =   98.87  max =  135.55  avg =  115.54\n squeezenet_ssd_int8  min =   66.33  max =  361.40  avg =  148.19\n       mobilenet_ssd  min =   94.12  max =  340.16  avg =  184.85\n  mobilenet_ssd_int8  min =   88.26  max =  150.47  avg =  112.35\n      mobilenet_yolo  min =  252.07  max =  510.61  avg =  327.21\n  mobilenetv2_yolov3  min =  115.31  max =  183.63  avg =  147.28\n         yolov4-tiny  min =  153.92  max =  259.18  avg =  196.70\n           nanodet_m  min =   34.95  max =   66.15  avg =   46.41\n    yolo-fastest-1.1  min =   15.34  max =   15.94  avg =   15.62\n      yolo-fastestv2  min =   15.53  max =   16.06  avg =   15.80\n  vision_transformer  min = 4200.48  max = 5853.43  avg = 4555.42\n          FastestDet  min =   16.73  max =   18.72  avg =   17.08\n\n\nuos@uos-PC:~/ncnn/benchmark$ ./benchncnn 10 4 1 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   25.93  max =   47.61  avg =   28.45\n     squeezenet_int8  min =   21.84  max =   27.09  avg =   22.84\n           mobilenet  min =   44.61  max =   83.44  avg =   52.52\n      mobilenet_int8  min =   32.91  max =   45.99  avg =   34.52\n        mobilenet_v2  min =   29.44  max =   37.14  avg =   30.43\n        mobilenet_v3  min =   24.54  max =   42.68  avg =   27.25\n          shufflenet  min =   17.16  max =   42.10  avg =   20.08\n       shufflenet_v2  min =   15.99  max =   16.43  avg =   16.29\n             mnasnet  min =   29.14  max =   43.37  avg =   30.79\n     proxylessnasnet  min =   33.15  max =   34.12  avg =   33.52\n     efficientnet_b0  min =   49.35  max =   87.75  avg =   54.03\n   efficientnetv2_b0  min =   57.69  max =   84.67  avg =   64.12\n        regnety_400m  min =   50.55  max =   75.35  avg =   55.31\n           blazeface  min =    5.01  max =    5.16  avg =    5.05\n           googlenet  min =  101.51  max =  116.33  avg =  105.38\n      googlenet_int8  min =   84.34  max =  102.58  avg =   89.89\n            resnet18  min =   80.58  max =   94.47  avg =   86.27\n       resnet18_int8  min =   59.00  max =   76.66  avg =   62.15\n             alexnet  min =   91.72  max =  117.98  avg =  102.20\n               vgg16  min =  435.57  max =  453.90  avg =  441.39\n          vgg16_int8  min =  308.39  max =  332.69  avg =  321.09\n            resnet50  min =  219.93  max =  249.30  avg =  231.93\n       resnet50_int8  min =  156.78  max =  179.34  avg =  163.43\n      squeezenet_ssd  min =  109.48  max =  153.84  avg =  123.75\n squeezenet_ssd_int8  min =   74.33  max =  117.03  avg =   93.81\n       mobilenet_ssd  min =   94.91  max =  161.38  avg =  127.78\n  mobilenet_ssd_int8  min =   82.35  max =  112.79  avg =   91.86\n      mobilenet_yolo  min =  252.05  max =  285.16  avg =  266.33\n  mobilenetv2_yolov3  min =  113.98  max =  173.83  avg =  139.60\n         yolov4-tiny  min =  150.06  max =  210.96  avg =  164.94\n           nanodet_m  min =   34.62  max =   67.81  avg =   48.43\n    yolo-fastest-1.1  min =   15.78  max =   16.09  avg =   15.93\n      yolo-fastestv2  min =   15.54  max =   32.82  avg =   17.62\n  vision_transformer  min = 4202.89  max = 5573.15  avg = 4426.38\n          FastestDet  min =   16.39  max =   17.06  avg =   16.75\n\n\n\n\nuos@uos-PC:~/ncnn/benchmark$ ./benchncnn 10 4 0 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   25.98  max =   36.75  avg =   28.86\n     squeezenet_int8  min =   22.04  max =   30.86  avg =   23.28\n           mobilenet  min =   44.82  max =   60.73  avg =   46.72\n      mobilenet_int8  min =   33.00  max =   48.45  avg =   34.70\n        mobilenet_v2  min =   29.53  max =   56.78  avg =   33.98\n        mobilenet_v3  min =   24.69  max =   45.60  avg =   28.13\n          shufflenet  min =   17.25  max =   24.72  avg =   18.18\n       shufflenet_v2  min =   16.00  max =   31.27  avg =   17.62\n             mnasnet  min =   28.95  max =   44.73  avg =   32.58\n     proxylessnasnet  min =   32.99  max =   45.42  avg =   34.66\n     efficientnet_b0  min =   49.71  max =   53.47  avg =   50.25\n   efficientnetv2_b0  min =   57.51  max =   78.56  avg =   61.47\n        regnety_400m  min =   50.18  max =   71.85  avg =   54.77\n           blazeface  min =    4.98  max =    9.36  avg =    5.48\n           googlenet  min =  101.25  max =  121.71  avg =  105.71\n      googlenet_int8  min =   82.97  max =  111.81  avg =   89.49\n            resnet18  min =   75.66  max =   87.19  avg =   78.72\n       resnet18_int8  min =   58.92  max =  108.67  avg =   76.70\n             alexnet  min =   79.12  max =  144.22  avg =  101.91\n               vgg16  min =  430.14  max =  460.46  avg =  444.56\n          vgg16_int8  min =  308.08  max =  350.15  avg =  324.86\n            resnet50  min =  219.60  max =  258.59  avg =  237.46\n       resnet50_int8  min =  156.54  max =  180.28  avg =  163.11\n      squeezenet_ssd  min =   77.71  max =  137.36  avg =  119.68\n squeezenet_ssd_int8  min =   78.88  max =  113.64  avg =   95.83\n       mobilenet_ssd  min =   94.82  max =  156.99  avg =  119.67\n  mobilenet_ssd_int8  min =   77.17  max =   98.29  avg =   86.90\n      mobilenet_yolo  min =  252.29  max =  295.62  avg =  265.58\n  mobilenetv2_yolov3  min =  114.28  max =  159.82  avg =  140.03\n         yolov4-tiny  min =  150.99  max =  203.07  avg =  165.18\n           nanodet_m  min =   34.48  max =   71.56  avg =   49.84\n    yolo-fastest-1.1  min =   15.36  max =   30.00  avg =   17.11\n      yolo-fastestv2  min =   15.42  max =   26.96  avg =   16.78\n  vision_transformer  min = 4187.60  max = 4319.84  avg = 4220.05\n          FastestDet  min =   16.30  max =   24.88  avg =   17.49\n\n```\n\n\n\n\n### Loongson 3A5000 (LA464 2.5GHz * 4)\n```\nroot@3A5K:~/Desktop/ncnn-20230223/build/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   11.97  max =   19.38  avg =   13.61\n     squeezenet_int8  min =   14.96  max =   15.36  avg =   15.12\n           mobilenet  min =   20.14  max =   27.50  avg =   21.12\n      mobilenet_int8  min =   25.28  max =   35.06  avg =   27.37\n        mobilenet_v2  min =   12.82  max =   13.20  avg =   12.98\n        mobilenet_v3  min =   11.39  max =   25.03  avg =   12.86\n          shufflenet  min =    7.35  max =    7.50  avg =    7.40\n       shufflenet_v2  min =    7.12  max =    7.23  avg =    7.18\n             mnasnet  min =   12.85  max =   21.69  avg =   13.83\n     proxylessnasnet  min =   15.35  max =   15.79  avg =   15.43\n     efficientnet_b0  min =   24.20  max =   24.46  avg =   24.30\n   efficientnetv2_b0  min =   26.80  max =   42.43  avg =   29.25\n        regnety_400m  min =   22.85  max =   38.30  avg =   24.51\n           blazeface  min =    2.57  max =    2.67  avg =    2.60\n           googlenet  min =   49.09  max =   85.91  avg =   67.57\n      googlenet_int8  min =   64.89  max =   95.28  avg =   76.41\n            resnet18  min =   42.43  max =   62.39  avg =   52.38\n       resnet18_int8  min =   47.96  max =   68.69  avg =   56.75\n             alexnet  min =   46.01  max =   59.26  avg =   49.20\n               vgg16  min =  246.82  max =  261.80  avg =  252.81\n          vgg16_int8  min =  247.13  max =  256.81  avg =  252.37\n            resnet50  min =  102.17  max =  138.16  avg =  117.65\n       resnet50_int8  min =  115.09  max =  151.30  avg =  129.13\n      squeezenet_ssd  min =   43.62  max =   70.64  avg =   53.89\n squeezenet_ssd_int8  min =   38.66  max =   60.12  avg =   47.66\n       mobilenet_ssd  min =   42.67  max =   68.78  avg =   53.95\n  mobilenet_ssd_int8  min =   56.29  max =   68.31  avg =   59.86\n      mobilenet_yolo  min =  129.04  max =  188.26  avg =  149.64\n  mobilenetv2_yolov3  min =   61.80  max =   71.41  avg =   66.43\n         yolov4-tiny  min =   88.64  max =  108.17  avg =   95.48\n           nanodet_m  min =   16.24  max =   16.57  avg =   16.34\n    yolo-fastest-1.1  min =    6.98  max =    7.16  avg =    7.05\n      yolo-fastestv2  min =    6.95  max =    7.29  avg =    7.08\n  vision_transformer  min = 2910.63  max = 3109.29  avg = 2949.04\n          FastestDet  min =    7.66  max =    7.90  avg =    7.80\n```\n\n### Loongson 3A6000 (LA664 2.5GHz * 4+4)\n\n```\n~/ncnn/build/benchmark$ ./benchncnn 10 8 2 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    7.12  max =    7.20  avg =    7.16\n     squeezenet_int8  min =    8.93  max =    9.20  avg =    8.98\n           mobilenet  min =   11.81  max =   11.88  avg =   11.84\n      mobilenet_int8  min =   14.25  max =   14.33  avg =   14.28\n        mobilenet_v2  min =    8.06  max =    8.16  avg =    8.08\n        mobilenet_v3  min =    6.84  max =    6.90  avg =    6.87\n          shufflenet  min =    5.38  max =    5.44  avg =    5.39\n       shufflenet_v2  min =    5.20  max =    5.22  avg =    5.20\n             mnasnet  min =    8.06  max =    8.10  avg =    8.07\n     proxylessnasnet  min =    8.94  max =    9.09  avg =    8.99\n     efficientnet_b0  min =   13.43  max =   13.65  avg =   13.48\n   efficientnetv2_b0  min =   16.06  max =   16.18  avg =   16.11\n        regnety_400m  min =   18.11  max =   18.18  avg =   18.14\n           blazeface  min =    1.59  max =    1.61  avg =    1.60\n           googlenet  min =   26.08  max =   26.24  avg =   26.17\n      googlenet_int8  min =   31.25  max =   31.42  avg =   31.34\n            resnet18  min =   19.65  max =   19.73  avg =   19.69\n       resnet18_int8  min =   25.55  max =   25.66  avg =   25.60\n             alexnet  min =   19.56  max =   19.81  avg =   19.67\n               vgg16  min =  115.32  max =  116.38  avg =  115.99\n          vgg16_int8  min =  135.94  max =  136.73  avg =  136.34\n            resnet50  min =   56.46  max =   56.96  avg =   56.81\n       resnet50_int8  min =   66.13  max =   66.40  avg =   66.27\n      squeezenet_ssd  min =   22.84  max =   22.99  avg =   22.89\n squeezenet_ssd_int8  min =   22.34  max =   22.76  avg =   22.54\n       mobilenet_ssd  min =   24.67  max =   24.75  avg =   24.71\n  mobilenet_ssd_int8  min =   29.32  max =   29.37  avg =   29.34\n      mobilenet_yolo  min =   82.82  max =   84.02  avg =   83.40\n  mobilenetv2_yolov3  min =   30.31  max =   30.45  avg =   30.38\n         yolov4-tiny  min =   42.49  max =   42.74  avg =   42.62\n           nanodet_m  min =   11.00  max =   11.08  avg =   11.02\n    yolo-fastest-1.1  min =    5.28  max =    5.40  avg =    5.31\n      yolo-fastestv2  min =    5.09  max =    5.10  avg =    5.10\n  vision_transformer  min =  869.40  max =  898.18  avg =  874.07\n          FastestDet  min =    5.28  max =    5.37  avg =    5.31\n```\n\n### Phytium FT-2000/4 (FTC663 armv8 2.2GHz x 4)\nTest on Kylin OS V10\n```\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   40.92  max =   43.43  avg =   41.34\n     squeezenet_int8  min =   35.48  max =   36.07  avg =   35.75\n           mobilenet  min =   72.23  max =   72.53  avg =   72.39\n      mobilenet_int8  min =   48.10  max =   48.59  avg =   48.31\n        mobilenet_v2  min =   47.94  max =   48.45  avg =   48.13\n        mobilenet_v3  min =   37.95  max =   39.59  avg =   38.41\n          shufflenet  min =   21.51  max =   21.84  avg =   21.64\n       shufflenet_v2  min =   21.10  max =   21.45  avg =   21.26\n             mnasnet  min =   44.53  max =   45.15  avg =   44.74\n     proxylessnasnet  min =   53.02  max =   53.62  avg =   53.21\n     efficientnet_b0  min =   79.81  max =   80.51  avg =   80.15\n   efficientnetv2_b0  min =   92.55  max =  103.10  avg =   97.53\n        regnety_400m  min =   58.52  max =   70.04  avg =   64.20\n           blazeface  min =    6.06  max =    9.85  avg =    6.88\n           googlenet  min =  146.49  max =  162.69  avg =  152.98\n      googlenet_int8  min =  127.38  max =  132.11  avg =  128.51\n            resnet18  min =  107.79  max =  108.83  avg =  108.37\n       resnet18_int8  min =   97.28  max =   99.03  avg =   97.73\n             alexnet  min =   89.95  max =   91.63  avg =   90.28\n               vgg16  min =  642.27  max =  647.16  avg =  644.09\n          vgg16_int8  min =  567.03  max =  574.11  avg =  568.74\n            resnet50  min =  329.12  max =  331.79  avg =  330.10\n       resnet50_int8  min =  252.48  max =  253.65  avg =  252.93\n      squeezenet_ssd  min =   96.46  max =   96.95  avg =   96.69\n squeezenet_ssd_int8  min =   92.35  max =   93.24  avg =   92.72\n       mobilenet_ssd  min =  149.14  max =  150.56  avg =  149.40\n  mobilenet_ssd_int8  min =   97.56  max =   98.03  avg =   97.82\n      mobilenet_yolo  min =  339.71  max =  340.60  avg =  339.89\n  mobilenetv2_yolov3  min =  174.53  max =  175.80  avg =  175.01\n         yolov4-tiny  min =  213.72  max =  214.94  avg =  214.08\n           nanodet_m  min =   49.95  max =   50.47  avg =   50.19\n    yolo-fastest-1.1  min =   23.80  max =   24.42  avg =   23.91\n      yolo-fastestv2  min =   19.78  max =   19.95  avg =   19.84\n  vision_transformer  min = 3927.51  max = 4025.76  avg = 3947.06\n          FastestDet  min =   21.78  max =   22.17  avg =   21.88\n\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 4 1 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   70.80  max =   76.55  avg =   72.49\n     squeezenet_int8  min =  110.36  max =  133.06  avg =  114.23\n           mobilenet  min =   77.97  max =   85.73  avg =   79.98\n      mobilenet_int8  min =   80.05  max =   84.09  avg =   81.76\n        mobilenet_v2  min =  101.07  max =  192.92  avg =  139.32\n        mobilenet_v3  min =  108.60  max =  129.37  avg =  113.80\n          shufflenet  min =  160.96  max =  188.96  avg =  168.62\n       shufflenet_v2  min =   96.20  max =  190.31  avg =  119.77\n             mnasnet  min =   97.34  max =  104.00  avg =   99.85\n     proxylessnasnet  min =  112.58  max =  276.49  avg =  145.74\n     efficientnet_b0  min =  171.01  max =  238.15  avg =  195.53\n   efficientnetv2_b0  min =  235.31  max =  299.00  avg =  254.12\n        regnety_400m  min = 1059.87  max = 1173.49  avg = 1084.13\n           blazeface  min =   58.69  max =   64.83  avg =   60.83\n           googlenet  min =  190.47  max =  257.76  avg =  207.71\n      googlenet_int8  min =  285.67  max =  327.20  avg =  300.87\n            resnet18  min =  111.87  max =  118.36  avg =  114.48\n       resnet18_int8  min =  143.08  max =  147.98  avg =  144.93\n             alexnet  min =   72.83  max =   76.52  avg =   74.01\n               vgg16  min =  390.35  max =  406.58  avg =  397.19\n          vgg16_int8  min =  358.54  max =  369.89  avg =  364.31\n            resnet50  min =  275.57  max =  300.14  avg =  283.21\n       resnet50_int8  min =  315.18  max =  371.22  avg =  328.43\n      squeezenet_ssd  min =  170.14  max =  200.18  avg =  175.23\n squeezenet_ssd_int8  min =  259.01  max =  271.23  avg =  263.35\n       mobilenet_ssd  min =  166.85  max =  170.64  avg =  168.74\n  mobilenet_ssd_int8  min =  191.71  max =  195.91  avg =  193.44\n      mobilenet_yolo  min =  960.70  max = 1080.81  avg =  983.68\n  mobilenetv2_yolov3  min =  187.72  max =  207.92  avg =  192.60\n         yolov4-tiny  min =  172.72  max =  177.62  avg =  174.63\n           nanodet_m  min =  128.79  max =  137.31  avg =  131.04\n    yolo-fastest-1.1  min =  132.39  max =  148.06  avg =  137.90\n      yolo-fastestv2  min =  130.97  max =  137.73  avg =  133.53\n  vision_transformer  min = 2229.10  max = 2392.59  avg = 2304.21\n          FastestDet  min =  119.98  max =  126.26  avg =  122.40\n\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   70.93  max =   75.55  avg =   72.93\n     squeezenet_int8  min =  109.65  max =  153.48  avg =  124.20\n           mobilenet  min =   78.02  max =   85.80  avg =   81.97\n      mobilenet_int8  min =   80.34  max =   89.31  avg =   83.20\n        mobilenet_v2  min =   99.51  max =  110.36  avg =  102.54\n        mobilenet_v3  min =  109.04  max =  116.28  avg =  111.75\n          shufflenet  min =  160.04  max =  166.21  avg =  163.59\n       shufflenet_v2  min =   88.90  max =   91.82  avg =   90.24\n             mnasnet  min =   97.02  max =  103.09  avg =   98.70\n     proxylessnasnet  min =  111.21  max =  117.47  avg =  113.97\n     efficientnet_b0  min =  167.99  max =  175.35  avg =  171.26\n   efficientnetv2_b0  min =  228.59  max =  245.97  avg =  232.79\n        regnety_400m  min = 1049.34  max = 1085.18  avg = 1064.68\n           blazeface  min =   59.35  max =   64.91  avg =   60.35\n           googlenet  min =  187.87  max =  195.29  avg =  190.56\n      googlenet_int8  min =  283.22  max =  301.69  avg =  287.66\n            resnet18  min =  111.48  max =  116.76  avg =  112.88\n       resnet18_int8  min =  142.41  max =  148.79  avg =  145.14\n             alexnet  min =   72.59  max =   75.37  avg =   73.62\n               vgg16  min =  389.61  max =  452.95  avg =  424.36\n          vgg16_int8  min =  365.57  max =  465.13  avg =  422.84\n            resnet50  min =  283.07  max =  411.14  avg =  332.88\n       resnet50_int8  min =  323.21  max =  381.13  avg =  340.59\n      squeezenet_ssd  min =  178.21  max =  252.82  avg =  211.62\n squeezenet_ssd_int8  min =  263.82  max =  372.38  avg =  284.38\n       mobilenet_ssd  min =  166.29  max =  281.36  avg =  195.16\n  mobilenet_ssd_int8  min =  194.00  max =  220.95  avg =  204.07\n      mobilenet_yolo  min =  964.99  max = 1027.13  avg =  989.45\n  mobilenetv2_yolov3  min =  218.58  max =  512.86  avg =  265.12\n         yolov4-tiny  min =  172.20  max =  177.27  avg =  174.14\n           nanodet_m  min =  128.78  max =  222.66  avg =  150.88\n    yolo-fastest-1.1  min =  132.52  max =  196.41  avg =  149.03\n      yolo-fastestv2  min =  131.39  max =  138.72  avg =  134.96\n  vision_transformer  min = 2243.31  max = 2659.56  avg = 2395.76\n          FastestDet  min =  119.44  max =  126.07  avg =  122.27\n\n```\n\n\n\n### Phytium FT-2000+/64 (FTC662 armv8 2.4GHz x 8)\n```\n[root@bogon benchmark]# ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   57.60  max =   59.78  avg =   58.51\n     squeezenet_int8  min =   47.05  max =   47.89  avg =   47.40\n           mobilenet  min =   91.08  max =   95.16  avg =   91.89\n      mobilenet_int8  min =   60.27  max =   61.17  avg =   60.74\n        mobilenet_v2  min =   63.38  max =   68.12  avg =   66.96\n        mobilenet_v3  min =   53.34  max =   54.71  avg =   54.01\n          shufflenet  min =   37.87  max =   41.78  avg =   39.37\n       shufflenet_v2  min =   35.89  max =   37.30  avg =   36.40\n             mnasnet  min =   59.57  max =   63.23  avg =   60.25\n     proxylessnasnet  min =   71.24  max =   71.93  avg =   71.51\n     efficientnet_b0  min =  134.34  max =  141.14  avg =  137.74\n   efficientnetv2_b0  min =  143.82  max =  145.63  avg =  144.36\n        regnety_400m  min =   76.96  max =   77.66  avg =   77.27\n           blazeface  min =   11.57  max =   11.90  avg =   11.70\n           googlenet  min =  188.10  max =  191.27  avg =  189.02\n      googlenet_int8  min =  167.54  max =  169.63  avg =  168.38\n            resnet18  min =  144.76  max =  163.39  avg =  154.95\n       resnet18_int8  min =  124.14  max =  129.84  avg =  127.83\n             alexnet  min =  198.22  max =  208.86  avg =  205.35\n               vgg16  min =  848.10  max =  891.00  avg =  859.94\n          vgg16_int8  min =  686.54  max =  742.77  avg =  704.74\n            resnet50  min =  413.45  max =  428.84  avg =  417.81\n       resnet50_int8  min =  306.32  max =  324.27  avg =  316.47\n      squeezenet_ssd  min =  147.62  max =  149.58  avg =  148.48\n squeezenet_ssd_int8  min =  116.18  max =  134.86  avg =  126.93\n       mobilenet_ssd  min =  188.49  max =  191.97  avg =  189.48\n  mobilenet_ssd_int8  min =  120.28  max =  121.36  avg =  120.83\n      mobilenet_yolo  min =  421.79  max =  425.68  avg =  423.51\n  mobilenetv2_yolov3  min =  222.86  max =  225.58  avg =  224.01\n         yolov4-tiny  min =  303.77  max =  310.70  avg =  307.45\n           nanodet_m  min =   80.87  max =   82.11  avg =   81.35\n\n[root@bogon benchmark]# ./benchncnn 10 8 0 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   14.53  max =   14.92  avg =   14.68\n     squeezenet_int8  min =   11.67  max =   11.89  avg =   11.82\n           mobilenet  min =   17.60  max =   20.05  avg =   18.34\n      mobilenet_int8  min =    9.94  max =   10.22  avg =   10.08\n        mobilenet_v2  min =   18.46  max =   19.18  avg =   18.81\n        mobilenet_v3  min =   16.30  max =   16.71  avg =   16.45\n          shufflenet  min =   14.65  max =   14.93  avg =   14.78\n       shufflenet_v2  min =   11.23  max =   11.56  avg =   11.35\n             mnasnet  min =   15.65  max =   16.08  avg =   15.92\n     proxylessnasnet  min =   18.78  max =   21.72  avg =   19.68\n     efficientnet_b0  min =   29.16  max =   29.62  avg =   29.37\n   efficientnetv2_b0  min =   33.28  max =   35.48  avg =   34.23\n        regnety_400m  min =   44.90  max =   47.36  avg =   46.32\n           blazeface  min =    4.23  max =    4.43  avg =    4.30\n           googlenet  min =   42.11  max =   42.98  avg =   42.38\n      googlenet_int8  min =   33.24  max =   38.21  avg =   34.10\n            resnet18  min =   33.27  max =   34.00  avg =   33.57\n       resnet18_int8  min =   23.66  max =   24.78  avg =   24.24\n             alexnet  min =   35.78  max =   37.68  avg =   36.46\n               vgg16  min =  219.60  max =  235.79  avg =  222.11\n          vgg16_int8  min =  128.64  max =  135.19  avg =  130.73\n            resnet50  min =   84.15  max =   85.48  avg =   84.66\n       resnet50_int8  min =   58.87  max =   61.98  avg =   59.85\n      squeezenet_ssd  min =   47.60  max =   50.24  avg =   48.54\n squeezenet_ssd_int8  min =   36.42  max =   37.89  avg =   36.99\n       mobilenet_ssd  min =   39.37  max =   42.63  avg =   41.06\n  mobilenet_ssd_int8  min =   21.59  max =   22.05  avg =   21.83\n      mobilenet_yolo  min =   83.16  max =   88.75  avg =   85.29\n  mobilenetv2_yolov3  min =   58.13  max =   59.50  avg =   58.62\n         yolov4-tiny  min =   74.18  max =   76.56  avg =   75.13\n           nanodet_m  min =   25.16  max =   31.45  avg =   26.71\n\nroot@FT2K:~/Desktop/ncnn-20221128/build/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   14.19  max =   21.46  avg =   15.16\n     squeezenet_int8  min =   11.63  max =   12.08  avg =   11.91\n           mobilenet  min =   20.52  max =   37.00  avg =   23.66\n      mobilenet_int8  min =   13.38  max =   25.95  avg =   15.01\n        mobilenet_v2  min =   15.80  max =   16.59  avg =   16.12\n        mobilenet_v3  min =   13.38  max =   17.62  avg =   14.21\n          shufflenet  min =   10.62  max =   11.10  avg =   10.85\n       shufflenet_v2  min =    9.09  max =   12.30  avg =    9.66\n             mnasnet  min =   14.85  max =   15.67  avg =   15.14\n     proxylessnasnet  min =   16.83  max =   17.10  avg =   16.98\n     efficientnet_b0  min =   24.59  max =   26.40  avg =   25.06\n   efficientnetv2_b0  min =   30.25  max =   34.46  avg =   31.42\n        regnety_400m  min =   32.37  max =   41.10  avg =   35.17\n           blazeface  min =    3.00  max =    3.56  avg =    3.18\n           googlenet  min =   49.52  max =   64.98  avg =   56.29\n      googlenet_int8  min =   38.65  max =   52.51  avg =   43.90\n            resnet18  min =   42.81  max =   53.94  avg =   45.38\n       resnet18_int8  min =   32.53  max =   53.62  avg =   37.26\n             alexnet  min =   33.92  max =   47.88  avg =   37.12\n               vgg16  min =  214.19  max =  228.96  avg =  220.16\n          vgg16_int8  min =  164.22  max =  224.51  avg =  180.15\n            resnet50  min =  106.90  max =  189.61  avg =  133.34\n       resnet50_int8  min =   79.62  max =   94.41  avg =   83.56\n      squeezenet_ssd  min =   48.00  max =   49.11  avg =   48.43\n squeezenet_ssd_int8  min =   33.59  max =   47.60  avg =   37.57\n       mobilenet_ssd  min =   43.97  max =   58.84  avg =   49.64\n  mobilenet_ssd_int8  min =   27.94  max =   32.89  avg =   29.56\n      mobilenet_yolo  min =  107.29  max =  118.80  avg =  114.24\n  mobilenetv2_yolov3  min =   63.44  max =  106.75  avg =   70.69\n         yolov4-tiny  min =   89.93  max =  155.39  avg =  101.90\n           nanodet_m  min =   20.34  max =   28.67  avg =   21.44\n    yolo-fastest-1.1  min =   11.74  max =   12.24  avg =   11.96\n      yolo-fastestv2  min =    9.81  max =    9.98  avg =    9.91\n  vision_transformer  min = 1617.60  max = 1634.13  avg = 1625.87\n          FastestDet  min =   10.19  max =   10.55  avg =   10.36\n```\n### HUAWEI KunPeng 920 2251K (x8 cores)\ntest on UOS 1050\n```\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   12.11  max =   12.40  avg =   12.25\n     squeezenet_int8  min =   14.24  max =   14.50  avg =   14.36\n           mobilenet  min =   20.52  max =   21.11  avg =   20.63\n      mobilenet_int8  min =   18.29  max =   18.63  avg =   18.45\n        mobilenet_v2  min =   13.73  max =   13.90  avg =   13.79\n        mobilenet_v3  min =   11.37  max =   11.49  avg =   11.41\n          shufflenet  min =    7.90  max =    7.96  avg =    7.92\n       shufflenet_v2  min =    8.09  max =    8.13  avg =    8.11\n             mnasnet  min =   13.26  max =   13.44  avg =   13.30\n     proxylessnasnet  min =   16.19  max =   16.39  avg =   16.26\n     efficientnet_b0  min =   34.92  max =   35.22  avg =   35.04\n   efficientnetv2_b0  min =   43.82  max =   44.39  avg =   43.94\n        regnety_400m  min =   17.55  max =   18.02  avg =   17.65\n           blazeface  min =    3.05  max =    3.08  avg =    3.07\n           googlenet  min =   58.65  max =   59.26  avg =   58.89\n      googlenet_int8  min =   60.55  max =   63.00  avg =   61.96\n            resnet18  min =   34.27  max =   35.43  avg =   34.84\n       resnet18_int8  min =   60.79  max =   62.15  avg =   61.47\n             alexnet  min =   42.01  max =   44.43  avg =   43.36\n               vgg16  min =  174.46  max =  177.33  avg =  175.57\n          vgg16_int8  min =  453.93  max =  457.03  avg =  454.79\n            resnet50  min =   95.36  max =   96.27  avg =   95.55\n       resnet50_int8  min =  119.77  max =  121.26  avg =  120.46\n      squeezenet_ssd  min =   39.05  max =   39.69  avg =   39.20\n squeezenet_ssd_int8  min =   55.06  max =   56.23  avg =   55.72\n       mobilenet_ssd  min =   45.20  max =   45.96  avg =   45.49\n  mobilenet_ssd_int8  min =   39.40  max =   40.13  avg =   39.76\n      mobilenet_yolo  min =   98.86  max =   99.85  avg =   99.34\n  mobilenetv2_yolov3  min =   51.17  max =   52.89  avg =   51.89\n         yolov4-tiny  min =   66.43  max =   67.23  avg =   66.70\n           nanodet_m  min =   20.59  max =   20.79  avg =   20.71\n    yolo-fastest-1.1  min =    7.90  max =    7.99  avg =    7.93\n      yolo-fastestv2  min =    7.45  max =    7.49  avg =    7.47\n  vision_transformer  min = 1586.33  max = 1595.34  avg = 1589.76\n          FastestDet  min =    7.45  max =    7.52  avg =    7.47\n\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 8 0 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.93  max =    3.10  avg =    3.00\n     squeezenet_int8  min =    3.47  max =    3.56  avg =    3.52\n           mobilenet  min =    3.89  max =    4.04  avg =    3.94\n      mobilenet_int8  min =    3.29  max =    3.39  avg =    3.33\n        mobilenet_v2  min =    3.95  max =    4.08  avg =    3.98\n        mobilenet_v3  min =    3.45  max =    3.59  avg =    3.49\n          shufflenet  min =    3.42  max =    4.66  avg =    3.62\n       shufflenet_v2  min =    2.60  max =    2.94  avg =    2.68\n             mnasnet  min =    3.46  max =    3.57  avg =    3.52\n     proxylessnasnet  min =    3.94  max =   12.34  avg =    4.88\n     efficientnet_b0  min =    7.31  max =    7.60  avg =    7.38\n   efficientnetv2_b0  min =    9.01  max =    9.22  avg =    9.08\n        regnety_400m  min =    8.56  max =    9.36  avg =    8.70\n           blazeface  min =    1.36  max =    3.52  avg =    1.60\n           googlenet  min =   11.80  max =   12.02  avg =   11.93\n      googlenet_int8  min =   11.87  max =   23.09  avg =   13.16\n            resnet18  min =    7.27  max =    7.64  avg =    7.38\n       resnet18_int8  min =   11.02  max =   11.73  avg =   11.20\n             alexnet  min =    9.05  max =    9.35  avg =    9.17\n               vgg16  min =   44.13  max =   50.84  avg =   46.89\n          vgg16_int8  min =   75.15  max =   80.73  avg =   77.52\n            resnet50  min =   18.72  max =   27.49  avg =   19.96\n       resnet50_int8  min =   22.72  max =   36.80  avg =   26.78\n      squeezenet_ssd  min =   13.96  max =   27.42  avg =   15.62\n squeezenet_ssd_int8  min =   15.01  max =   29.53  avg =   19.51\n       mobilenet_ssd  min =    9.37  max =   13.34  avg =   10.44\n  mobilenet_ssd_int8  min =    8.07  max =   24.28  avg =    9.83\n      mobilenet_yolo  min =   22.06  max =   24.89  avg =   22.91\n  mobilenetv2_yolov3  min =   14.41  max =   15.97  avg =   14.78\n         yolov4-tiny  min =   20.71  max =   23.96  avg =   21.42\n           nanodet_m  min =    6.37  max =    6.59  avg =    6.45\n    yolo-fastest-1.1  min =    4.27  max =    4.52  avg =    4.34\n      yolo-fastestv2  min =    3.53  max =    3.63  avg =    3.58\n  vision_transformer  min =  435.60  max =  523.43  avg =  479.70\n          FastestDet  min =    3.54  max =    7.95  avg =    5.24\n\nmobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    4.04  max =    4.22  avg =    4.09\n     squeezenet_int8  min =    4.64  max =    4.76  avg =    4.69\n           mobilenet  min =    6.04  max =    6.06  avg =    6.05\n      mobilenet_int8  min =    5.23  max =    5.32  avg =    5.25\n        mobilenet_v2  min =    5.00  max =    5.03  avg =    5.01\n        mobilenet_v3  min =    4.49  max =    4.69  avg =    4.52\n          shufflenet  min =    3.90  max =    3.94  avg =    3.91\n       shufflenet_v2  min =    3.27  max =    3.48  avg =    3.33\n             mnasnet  min =    4.80  max =    4.83  avg =    4.82\n     proxylessnasnet  min =    5.20  max =    5.28  avg =    5.23\n     efficientnet_b0  min =   10.53  max =   11.06  avg =   10.68\n   efficientnetv2_b0  min =   13.18  max =   13.37  avg =   13.25\n        regnety_400m  min =    9.20  max =    9.25  avg =    9.22\n           blazeface  min =    1.43  max =    1.45  avg =    1.44\n           googlenet  min =   17.63  max =   17.78  avg =   17.71\n      googlenet_int8  min =   17.63  max =   18.03  avg =   17.85\n            resnet18  min =   10.34  max =   10.59  avg =   10.40\n       resnet18_int8  min =   17.93  max =   18.84  avg =   18.25\n             alexnet  min =   13.28  max =   13.37  avg =   13.31\n               vgg16  min =   55.41  max =   56.60  avg =   55.70\n          vgg16_int8  min =  123.71  max =  125.34  avg =  124.48\n            resnet50  min =   27.82  max =   28.22  avg =   27.95\n       resnet50_int8  min =   34.50  max =   34.89  avg =   34.70\n      squeezenet_ssd  min =   14.67  max =   15.19  avg =   14.85\n squeezenet_ssd_int8  min =   19.76  max =   20.32  avg =   19.87\n       mobilenet_ssd  min =   13.15  max =   13.38  avg =   13.21\n  mobilenet_ssd_int8  min =   11.52  max =   11.70  avg =   11.60\n      mobilenet_yolo  min =   30.95  max =   31.28  avg =   31.05\n  mobilenetv2_yolov3  min =   20.04  max =   20.36  avg =   20.16\n         yolov4-tiny  min =   25.61  max =   26.73  avg =   25.80\n           nanodet_m  min =    7.93  max =    7.97  avg =    7.95\n    yolo-fastest-1.1  min =    4.52  max =    4.59  avg =    4.53\n      yolo-fastestv2  min =    3.74  max =    3.88  avg =    3.77\n  vision_transformer  min =  546.94  max =  726.81  avg =  698.27\n          FastestDet  min =    3.59  max =    3.61  avg =    3.60\n```\n\n### HUAWEI KunPeng 920 3211K (x24 cores)\ntest on ubuntu 22.04\n```\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   12.11  max =   12.20  avg =   12.14\n     squeezenet_int8  min =   14.34  max =   14.46  avg =   14.41\n           mobilenet  min =   20.27  max =   20.36  avg =   20.31\n      mobilenet_int8  min =   17.45  max =   17.74  avg =   17.58\n        mobilenet_v2  min =   13.72  max =   13.87  avg =   13.78\n        mobilenet_v3  min =   11.51  max =   11.69  avg =   11.61\n          shufflenet  min =    8.07  max =    8.36  avg =    8.20\n       shufflenet_v2  min =    8.13  max =    8.17  avg =    8.14\n             mnasnet  min =   13.34  max =   13.45  avg =   13.41\n     proxylessnasnet  min =   16.22  max =   16.35  avg =   16.29\n     efficientnet_b0  min =   34.69  max =   35.14  avg =   34.82\n   efficientnetv2_b0  min =   44.54  max =   44.68  avg =   44.61\n        regnety_400m  min =   18.06  max =   18.15  avg =   18.10\n           blazeface  min =    3.06  max =    3.22  avg =    3.12\n           googlenet  min =   56.80  max =   57.60  avg =   57.08\n      googlenet_int8  min =   58.64  max =   59.98  avg =   59.42\n            resnet18  min =   35.02  max =   35.35  avg =   35.10\n       resnet18_int8  min =   61.13  max =   61.68  avg =   61.33\n             alexnet  min =   42.56  max =   43.05  avg =   42.69\n               vgg16  min =  186.32  max =  188.73  avg =  187.20\n          vgg16_int8  min =  459.01  max =  461.48  avg =  460.29\n            resnet50  min =   97.59  max =   98.32  avg =   97.83\n       resnet50_int8  min =  118.67  max =  120.45  avg =  119.78\n      squeezenet_ssd  min =   39.62  max =   39.95  avg =   39.81\n squeezenet_ssd_int8  min =   56.72  max =   57.63  avg =   57.00\n       mobilenet_ssd  min =   45.44  max =   45.82  avg =   45.63\n  mobilenet_ssd_int8  min =   38.99  max =   40.08  avg =   39.39\n      mobilenet_yolo  min =   98.71  max =   99.27  avg =   98.94\n  mobilenetv2_yolov3  min =   51.50  max =   52.41  avg =   51.87\n         yolov4-tiny  min =   68.02  max =   68.43  avg =   68.24\n           nanodet_m  min =   20.49  max =   20.64  avg =   20.59\n    yolo-fastest-1.1  min =    8.17  max =    8.45  avg =    8.23\n      yolo-fastestv2  min =    7.73  max =    8.06  avg =    7.87\n  vision_transformer  min = 1620.65  max = 1630.45  avg = 1625.64\n          FastestDet  min =    7.65  max =    7.77  avg =    7.69\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 2 0 -1 0\nloop_count = 10\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.77  max =    6.85  avg =    6.81\n     squeezenet_int8  min =    7.98  max =    8.07  avg =    8.03\n           mobilenet  min =   10.70  max =   10.78  avg =   10.73\n      mobilenet_int8  min =    9.21  max =    9.36  avg =    9.28\n        mobilenet_v2  min =    7.91  max =    7.99  avg =    7.94\n        mobilenet_v3  min =    6.72  max =    6.92  avg =    6.78\n          shufflenet  min =    5.34  max =    5.55  avg =    5.38\n       shufflenet_v2  min =    5.12  max =    5.15  avg =    5.14\n             mnasnet  min =    7.74  max =    7.86  avg =    7.80\n     proxylessnasnet  min =    9.00  max =    9.03  avg =    9.02\n     efficientnet_b0  min =   18.51  max =   18.58  avg =   18.54\n   efficientnetv2_b0  min =   23.68  max =   23.83  avg =   23.74\n        regnety_400m  min =   12.65  max =   12.68  avg =   12.66\n           blazeface  min =    1.99  max =    2.14  avg =    2.03\n           googlenet  min =   30.83  max =   31.29  avg =   30.91\n      googlenet_int8  min =   31.97  max =   33.12  avg =   32.45\n            resnet18  min =   18.81  max =   18.87  avg =   18.84\n       resnet18_int8  min =   32.80  max =   32.99  avg =   32.90\n             alexnet  min =   22.88  max =   23.16  avg =   22.94\n               vgg16  min =  100.58  max =  101.12  avg =  100.90\n          vgg16_int8  min =  235.81  max =  237.97  avg =  236.20\n            resnet50  min =   51.12  max =   51.43  avg =   51.28\n       resnet50_int8  min =   62.46  max =   63.02  avg =   62.72\n      squeezenet_ssd  min =   23.26  max =   23.73  avg =   23.38\n squeezenet_ssd_int8  min =   31.91  max =   32.30  avg =   32.13\n       mobilenet_ssd  min =   24.73  max =   24.95  avg =   24.84\n  mobilenet_ssd_int8  min =   20.99  max =   21.52  avg =   21.21\n      mobilenet_yolo  min =   54.91  max =   55.70  avg =   55.15\n  mobilenetv2_yolov3  min =   30.18  max =   30.52  avg =   30.31\n         yolov4-tiny  min =   40.46  max =   40.61  avg =   40.55\n           nanodet_m  min =   12.56  max =   12.72  avg =   12.62\n    yolo-fastest-1.1  min =    6.00  max =    6.15  avg =    6.04\n      yolo-fastestv2  min =    5.32  max =    5.59  avg =    5.43\n  vision_transformer  min =  894.51  max =  896.28  avg =  895.57\n          FastestDet  min =    5.33  max =    5.42  avg =    5.36\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 4 0 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    4.18  max =    4.35  avg =    4.22\n     squeezenet_int8  min =    4.85  max =    4.98  avg =    4.89\n           mobilenet  min =    5.80  max =    5.95  avg =    5.89\n      mobilenet_int8  min =    4.86  max =    4.94  avg =    4.89\n        mobilenet_v2  min =    4.66  max =    4.73  avg =    4.69\n        mobilenet_v3  min =    4.46  max =    4.50  avg =    4.48\n          shufflenet  min =    4.01  max =    4.17  avg =    4.04\n       shufflenet_v2  min =    3.39  max =    3.41  avg =    3.39\n             mnasnet  min =    4.81  max =    4.93  avg =    4.85\n     proxylessnasnet  min =    5.47  max =    5.54  avg =    5.49\n     efficientnet_b0  min =   10.49  max =   10.55  avg =   10.52\n   efficientnetv2_b0  min =   13.67  max =   13.77  avg =   13.72\n        regnety_400m  min =   10.20  max =   10.24  avg =   10.21\n           blazeface  min =    1.52  max =    1.58  avg =    1.54\n           googlenet  min =   17.65  max =   17.69  avg =   17.68\n      googlenet_int8  min =   18.14  max =   18.27  avg =   18.19\n            resnet18  min =   10.52  max =   10.63  avg =   10.57\n       resnet18_int8  min =   17.42  max =   17.53  avg =   17.49\n             alexnet  min =   13.12  max =   13.20  avg =   13.16\n               vgg16  min =   55.24  max =   55.45  avg =   55.35\n          vgg16_int8  min =  123.46  max =  124.23  avg =  123.75\n            resnet50  min =   28.31  max =   28.57  avg =   28.39\n       resnet50_int8  min =   34.10  max =   34.39  avg =   34.23\n      squeezenet_ssd  min =   14.85  max =   14.96  avg =   14.91\n squeezenet_ssd_int8  min =   19.71  max =   19.88  avg =   19.82\n       mobilenet_ssd  min =   13.49  max =   13.58  avg =   13.52\n  mobilenet_ssd_int8  min =   11.60  max =   11.70  avg =   11.66\n      mobilenet_yolo  min =   31.74  max =   31.96  avg =   31.81\n  mobilenetv2_yolov3  min =   17.87  max =   18.03  avg =   17.93\n         yolov4-tiny  min =   25.63  max =   25.78  avg =   25.72\n           nanodet_m  min =    8.16  max =    8.22  avg =    8.20\n    yolo-fastest-1.1  min =    4.72  max =    4.86  avg =    4.75\n      yolo-fastestv2  min =    3.98  max =    4.15  avg =    4.00\n  vision_transformer  min =  501.18  max =  503.51  avg =  502.12\n          FastestDet  min =    3.74  max =    3.76  avg =    3.75\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 8 0 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.91  max =    3.10  avg =    2.97\n     squeezenet_int8  min =    3.42  max =    3.74  avg =    3.51\n           mobilenet  min =    3.57  max =    3.70  avg =    3.61\n      mobilenet_int8  min =    3.06  max =    3.14  avg =    3.10\n        mobilenet_v2  min =    3.73  max =    3.75  avg =    3.75\n        mobilenet_v3  min =    3.50  max =    3.66  avg =    3.56\n          shufflenet  min =    3.63  max =    3.65  avg =    3.64\n       shufflenet_v2  min =    2.85  max =    3.02  avg =    2.95\n             mnasnet  min =    3.60  max =    3.67  avg =    3.62\n     proxylessnasnet  min =    4.00  max =    4.08  avg =    4.03\n     efficientnet_b0  min =    7.31  max =    7.34  avg =    7.33\n   efficientnetv2_b0  min =    9.44  max =    9.51  avg =    9.47\n        regnety_400m  min =    9.76  max =   10.07  avg =    9.90\n           blazeface  min =    1.56  max =    1.75  avg =    1.61\n           googlenet  min =   11.22  max =   11.28  avg =   11.25\n      googlenet_int8  min =   11.40  max =   12.82  avg =   11.76\n            resnet18  min =    6.83  max =    6.96  avg =    6.90\n       resnet18_int8  min =   10.28  max =   10.38  avg =   10.33\n             alexnet  min =    8.75  max =    8.88  avg =    8.80\n               vgg16  min =   36.00  max =   36.72  avg =   36.29\n          vgg16_int8  min =   67.38  max =   67.72  avg =   67.54\n            resnet50  min =   17.63  max =   17.82  avg =   17.68\n       resnet50_int8  min =   20.05  max =   20.21  avg =   20.15\n      squeezenet_ssd  min =   11.18  max =   11.45  avg =   11.26\n squeezenet_ssd_int8  min =   14.09  max =   14.23  avg =   14.18\n       mobilenet_ssd  min =    8.60  max =    8.69  avg =    8.64\n  mobilenet_ssd_int8  min =    7.75  max =    7.87  avg =    7.81\n      mobilenet_yolo  min =   21.97  max =   22.25  avg =   22.09\n  mobilenetv2_yolov3  min =   14.04  max =   14.18  avg =   14.12\n         yolov4-tiny  min =   19.66  max =   19.93  avg =   19.81\n           nanodet_m  min =    6.52  max =    6.67  avg =    6.57\n    yolo-fastest-1.1  min =    4.61  max =    4.76  avg =    4.66\n      yolo-fastestv2  min =    3.78  max =    3.91  avg =    3.82\n  vision_transformer  min =  323.01  max =  327.38  avg =  323.75\n          FastestDet  min =    3.50  max =    3.54  avg =    3.51\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 16 0 -1 0\nloop_count = 10\nnum_threads = 16\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.00  max =    3.25  avg =    3.08\n     squeezenet_int8  min =    4.13  max =    4.47  avg =    4.21\n           mobilenet  min =    3.27  max =    3.42  avg =    3.34\n      mobilenet_int8  min =    3.49  max =    3.58  avg =    3.56\n        mobilenet_v2  min =    3.86  max =    4.10  avg =    3.97\n        mobilenet_v3  min =    3.72  max =    3.80  avg =    3.76\n          shufflenet  min =    4.67  max =    4.78  avg =    4.72\n       shufflenet_v2  min =    3.16  max =    3.24  avg =    3.20\n             mnasnet  min =    3.51  max =    3.65  avg =    3.57\n     proxylessnasnet  min =    4.08  max =    4.35  avg =    4.15\n     efficientnet_b0  min =    7.51  max =    7.80  avg =    7.63\n   efficientnetv2_b0  min =    8.92  max =    9.39  avg =    9.05\n        regnety_400m  min =   14.80  max =   15.05  avg =   14.89\n           blazeface  min =    2.14  max =    2.28  avg =    2.20\n           googlenet  min =    9.91  max =   10.00  avg =    9.96\n      googlenet_int8  min =   11.51  max =   11.65  avg =   11.60\n            resnet18  min =    6.39  max =    6.56  avg =    6.46\n       resnet18_int8  min =    9.76  max =    9.91  avg =    9.84\n             alexnet  min =    6.99  max =    7.10  avg =    7.04\n               vgg16  min =   27.52  max =   28.64  avg =   27.88\n          vgg16_int8  min =   45.64  max =   45.93  avg =   45.78\n            resnet50  min =   13.96  max =   14.17  avg =   14.07\n       resnet50_int8  min =   16.82  max =   16.93  avg =   16.89\n      squeezenet_ssd  min =   11.11  max =   11.54  avg =   11.23\n squeezenet_ssd_int8  min =   13.77  max =   14.00  avg =   13.88\n       mobilenet_ssd  min =    8.21  max =    8.46  avg =    8.35\n  mobilenet_ssd_int8  min =    8.87  max =    9.03  avg =    8.94\n      mobilenet_yolo  min =   30.77  max =   31.35  avg =   31.08\n  mobilenetv2_yolov3  min =   12.11  max =   13.10  avg =   12.43\n         yolov4-tiny  min =   18.25  max =   18.68  avg =   18.41\n           nanodet_m  min =    6.55  max =    6.68  avg =    6.59\n    yolo-fastest-1.1  min =    6.00  max =    6.22  avg =    6.09\n      yolo-fastestv2  min =    4.86  max =    5.01  avg =    4.94\n  vision_transformer  min =  218.18  max =  220.49  avg =  218.79\n          FastestDet  min =    5.01  max =    5.14  avg =    5.07\n(base) mobtgzhang@mobtgzhang-PC:~/ncnn/benchmark$ ./benchncnn 10 24 0 -1 0\nloop_count = 10\nnum_threads = 24\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.52  max =    3.96  avg =    3.70\n     squeezenet_int8  min =    5.49  max =    5.83  avg =    5.65\n           mobilenet  min =    3.42  max =    3.83  avg =    3.55\n      mobilenet_int8  min =    3.69  max =   45.17  avg =   11.59\n        mobilenet_v2  min =    4.63  max =    5.44  avg =    4.84\n        mobilenet_v3  min =    4.51  max =    4.89  avg =    4.68\n          shufflenet  min =    6.21  max =    6.52  avg =    6.36\n       shufflenet_v2  min =    3.98  max =   17.54  avg =    5.45\n             mnasnet  min =    4.28  max =    4.56  avg =    4.39\n     proxylessnasnet  min =    4.76  max =    5.13  avg =    4.92\n     efficientnet_b0  min =    7.45  max =  111.76  avg =   22.59\n   efficientnetv2_b0  min =   10.87  max =   33.13  avg =   13.51\n        regnety_400m  min =   20.97  max =   21.73  avg =   21.46\n           blazeface  min =    2.56  max =    2.82  avg =    2.67\n           googlenet  min =   10.54  max =  105.87  avg =   21.85\n      googlenet_int8  min =   14.21  max =   77.02  avg =   22.23\n            resnet18  min =    7.08  max =    7.51  avg =    7.31\n       resnet18_int8  min =   11.25  max =   50.66  avg =   19.14\n             alexnet  min =    7.13  max =    8.67  avg =    7.44\n               vgg16  min =   27.59  max =   35.35  avg =   29.12\n          vgg16_int8  min =   44.43  max =   51.76  avg =   46.90\n            resnet50  min =   15.16  max =  105.98  avg =   24.91\n       resnet50_int8  min =   19.82  max =   20.50  avg =   20.16\n      squeezenet_ssd  min =   13.03  max =   13.69  avg =   13.40\n squeezenet_ssd_int8  min =   17.62  max =  187.55  avg =   39.92\n       mobilenet_ssd  min =    8.83  max =   71.97  avg =   15.37\n  mobilenet_ssd_int8  min =   10.22  max =   49.61  avg =   15.26\n      mobilenet_yolo  min =   35.19  max =   46.43  avg =   36.93\n  mobilenetv2_yolov3  min =   12.96  max =   15.57  avg =   13.41\n         yolov4-tiny  min =   19.22  max =   21.43  avg =   19.89\n           nanodet_m  min =    7.71  max =    8.74  avg =    8.09\n    yolo-fastest-1.1  min =    6.71  max =   78.72  avg =   14.16\n      yolo-fastestv2  min =    5.72  max =    6.08  avg =    5.88\n  vision_transformer  min =  192.16  max =  221.86  avg =  202.73\n          FastestDet  min =    5.13  max =    5.47  avg =    5.30\n```\n\n### HUAWEI Kunpeng 920 7260 (x64 cores)\ntest on Ubuntu 20.04 (gcc 9.4.0)\n```\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 1 0 -1 0\nloop_count = 300\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   11.64  max =   12.11  avg =   11.71\n     squeezenet_int8  min =   12.22  max =   13.22  avg =   12.37\n           mobilenet  min =   20.00  max =   20.79  avg =   20.08\n      mobilenet_int8  min =   17.44  max =   19.09  avg =   17.64\n        mobilenet_v2  min =   13.29  max =   14.25  avg =   13.39\n        mobilenet_v3  min =   11.06  max =   11.84  avg =   11.11\n          shufflenet  min =    7.56  max =    7.74  avg =    7.59\n       shufflenet_v2  min =    7.84  max =    8.37  avg =    7.88\n             mnasnet  min =   13.07  max =   13.78  avg =   13.14\n     proxylessnasnet  min =   15.71  max =   16.31  avg =   15.77\n     efficientnet_b0  min =   34.79  max =   35.98  avg =   34.92\n   efficientnetv2_b0  min =   35.28  max =   36.36  avg =   35.41\n        regnety_400m  min =   17.06  max =   17.74  avg =   17.16\n           blazeface  min =    2.99  max =    3.04  avg =    3.01\n           googlenet  min =   50.76  max =   51.74  avg =   51.00\n      googlenet_int8  min =   50.31  max =   52.27  avg =   50.65\n            resnet18  min =   34.97  max =   37.17  avg =   35.82\n       resnet18_int8  min =   40.47  max =   42.03  avg =   40.78\n             alexnet  min =   39.19  max =   39.80  avg =   39.32\n               vgg16  min =  176.62  max =  181.29  avg =  177.07\n          vgg16_int8  min =  352.35  max =  358.38  avg =  355.15\n            resnet50  min =   96.76  max =   98.63  avg =   97.09\n       resnet50_int8  min =   90.00  max =   92.74  avg =   90.81\n      squeezenet_ssd  min =   33.23  max =   33.99  avg =   33.39\n squeezenet_ssd_int8  min =   38.50  max =   41.53  avg =   39.28\n       mobilenet_ssd  min =   42.49  max =   44.78  avg =   42.72\n  mobilenet_ssd_int8  min =   37.06  max =   39.97  avg =   37.57\n      mobilenet_yolo  min =   96.34  max =   98.91  avg =   96.73\n  mobilenetv2_yolov3  min =   50.88  max =   52.97  avg =   51.15\n         yolov4-tiny  min =   65.56  max =   67.13  avg =   65.80\n           nanodet_m  min =   19.94  max =   20.82  avg =   20.04\n    yolo-fastest-1.1  min =    7.66  max =    7.81  avg =    7.71\n      yolo-fastestv2  min =    6.82  max =    7.23  avg =    6.87\n  vision_transformer  min = 1535.03  max = 1552.84  avg = 1543.73\n          FastestDet  min =    7.17  max =    7.50  avg =    7.21\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 2 0 -1 0\nloop_count = 300\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.35  max =    9.15  avg =    7.33\n     squeezenet_int8  min =    8.06  max =    8.60  avg =    8.14\n           mobilenet  min =   10.30  max =   11.86  avg =   11.48\n      mobilenet_int8  min =    8.93  max =   11.87  avg =   10.47\n        mobilenet_v2  min =    9.05  max =   11.50  avg =    9.19\n        mobilenet_v3  min =    6.32  max =    6.42  avg =    6.36\n          shufflenet  min =    6.73  max =    8.55  avg =    6.81\n       shufflenet_v2  min =    4.94  max =    6.65  avg =    6.32\n             mnasnet  min =    7.38  max =   10.77  avg =    8.82\n     proxylessnasnet  min =    8.57  max =    9.72  avg =    8.63\n     efficientnet_b0  min =   18.61  max =   22.53  avg =   20.42\n   efficientnetv2_b0  min =   18.75  max =   21.93  avg =   20.79\n        regnety_400m  min =   11.86  max =   15.09  avg =   14.60\n           blazeface  min =    1.95  max =    3.37  avg =    2.06\n           googlenet  min =   28.66  max =   32.24  avg =   28.94\n      googlenet_int8  min =   27.64  max =   32.15  avg =   30.84\n            resnet18  min =   20.33  max =   20.77  avg =   20.47\n       resnet18_int8  min =   22.63  max =   23.72  avg =   22.88\n             alexnet  min =   20.41  max =   29.37  avg =   27.22\n               vgg16  min =  101.72  max =  140.33  avg =  103.29\n          vgg16_int8  min =  187.56  max =  211.44  avg =  189.92\n            resnet50  min =   51.07  max =   59.25  avg =   58.35\n       resnet50_int8  min =   46.50  max =   52.55  avg =   48.93\n      squeezenet_ssd  min =   22.48  max =   28.59  avg =   22.98\n squeezenet_ssd_int8  min =   25.56  max =   26.82  avg =   25.99\n       mobilenet_ssd  min =   22.81  max =   26.21  avg =   24.88\n  mobilenet_ssd_int8  min =   19.31  max =   25.53  avg =   21.74\n      mobilenet_yolo  min =   59.58  max =   62.04  avg =   59.99\n  mobilenetv2_yolov3  min =   33.26  max =   35.74  avg =   33.51\n         yolov4-tiny  min =   41.14  max =   45.34  avg =   42.46\n           nanodet_m  min =   12.10  max =   16.69  avg =   15.02\n    yolo-fastest-1.1  min =    5.44  max =    7.78  avg =    7.24\n      yolo-fastestv2  min =    5.03  max =    8.08  avg =    6.75\n  vision_transformer  min =  994.46  max = 1090.68  avg = 1045.50\n          FastestDet  min =    6.76  max =    6.91  avg =    6.83\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 4 0 -1 0\nloop_count = 300\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.79  max =    6.99  avg =    4.55\n     squeezenet_int8  min =    5.13  max =    5.68  avg =    5.20\n           mobilenet  min =    6.25  max =    6.55  avg =    6.30\n      mobilenet_int8  min =    5.96  max =    6.10  avg =    6.03\n        mobilenet_v2  min =    5.34  max =    7.15  avg =    5.62\n        mobilenet_v3  min =    4.05  max =    5.74  avg =    5.01\n          shufflenet  min =    3.69  max =    5.81  avg =    5.15\n       shufflenet_v2  min =    4.31  max =    6.02  avg =    4.56\n             mnasnet  min =    4.48  max =    6.05  avg =    5.54\n     proxylessnasnet  min =    5.05  max =    8.08  avg =    6.03\n     efficientnet_b0  min =   10.17  max =   12.21  avg =   11.58\n   efficientnetv2_b0  min =   10.86  max =   15.78  avg =   12.70\n        regnety_400m  min =    9.24  max =   14.13  avg =   11.98\n           blazeface  min =    1.89  max =    1.97  avg =    1.93\n           googlenet  min =   15.19  max =   20.31  avg =   16.90\n      googlenet_int8  min =   17.97  max =   19.40  avg =   18.11\n            resnet18  min =   11.18  max =   11.48  avg =   11.29\n       resnet18_int8  min =   12.26  max =   12.78  avg =   12.44\n             alexnet  min =   14.43  max =   16.94  avg =   14.68\n               vgg16  min =   62.40  max =   78.42  avg =   64.96\n          vgg16_int8  min =  101.52  max =  109.42  avg =  104.46\n            resnet50  min =   29.19  max =   39.69  avg =   32.99\n       resnet50_int8  min =   26.94  max =   28.82  avg =   27.16\n      squeezenet_ssd  min =   12.90  max =   16.52  avg =   15.20\n squeezenet_ssd_int8  min =   15.58  max =   18.40  avg =   16.28\n       mobilenet_ssd  min =   13.68  max =   14.45  avg =   13.87\n  mobilenet_ssd_int8  min =   12.20  max =   14.58  avg =   12.84\n      mobilenet_yolo  min =   34.85  max =   36.54  avg =   35.05\n  mobilenetv2_yolov3  min =   18.61  max =   20.93  avg =   19.92\n         yolov4-tiny  min =   26.09  max =   32.32  avg =   28.03\n           nanodet_m  min =    7.85  max =   12.48  avg =   11.00\n    yolo-fastest-1.1  min =    6.19  max =    6.49  avg =    6.31\n      yolo-fastestv2  min =    3.66  max =    6.83  avg =    5.11\n  vision_transformer  min =  605.95  max =  624.99  avg =  609.79\n          FastestDet  min =    4.32  max =    5.41  avg =    5.17\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 8 0 -1 0\nloop_count = 300\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.72  max =    3.74  avg =    3.05\n     squeezenet_int8  min =    3.80  max =    4.71  avg =    4.03\n           mobilenet  min =    3.94  max =    5.15  avg =    4.00\n      mobilenet_int8  min =    3.73  max =    3.87  avg =    3.80\n        mobilenet_v2  min =    4.51  max =    6.57  avg =    4.68\n        mobilenet_v3  min =    4.12  max =    4.38  avg =    4.28\n          shufflenet  min =    4.60  max =    6.27  avg =    4.88\n       shufflenet_v2  min =    4.07  max =    4.20  avg =    4.11\n             mnasnet  min =    4.26  max =    4.51  avg =    4.36\n     proxylessnasnet  min =    4.71  max =    7.40  avg =    4.80\n     efficientnet_b0  min =    8.49  max =    8.74  avg =    8.56\n   efficientnetv2_b0  min =    9.34  max =    9.68  avg =    9.41\n        regnety_400m  min =    8.00  max =   12.85  avg =   10.64\n           blazeface  min =    1.76  max =    1.84  avg =    1.80\n           googlenet  min =   10.89  max =   11.33  avg =   10.98\n      googlenet_int8  min =   11.66  max =   14.07  avg =   11.83\n            resnet18  min =    6.48  max =    6.61  avg =    6.54\n       resnet18_int8  min =    7.30  max =    7.79  avg =    7.51\n             alexnet  min =    8.33  max =    8.95  avg =    8.62\n               vgg16  min =   29.94  max =   47.54  avg =   31.95\n          vgg16_int8  min =   54.67  max =   60.76  avg =   56.03\n            resnet50  min =   16.13  max =   20.79  avg =   20.03\n       resnet50_int8  min =   15.64  max =   20.13  avg =   16.11\n      squeezenet_ssd  min =   11.58  max =   12.02  avg =   11.77\n squeezenet_ssd_int8  min =   11.14  max =   13.72  avg =   12.10\n       mobilenet_ssd  min =    8.27  max =   10.77  avg =    8.76\n  mobilenet_ssd_int8  min =    8.13  max =    9.09  avg =    8.29\n      mobilenet_yolo  min =   23.90  max =   24.69  avg =   24.17\n  mobilenetv2_yolov3  min =   14.83  max =   15.72  avg =   15.19\n         yolov4-tiny  min =   19.78  max =   23.66  avg =   20.05\n           nanodet_m  min =    8.92  max =   10.76  avg =    9.09\n    yolo-fastest-1.1  min =    5.49  max =    5.77  avg =    5.63\n      yolo-fastestv2  min =    5.04  max =    5.21  avg =    5.10\n  vision_transformer  min =  318.42  max =  379.40  avg =  363.66\n          FastestDet  min =    4.18  max =    4.54  avg =    4.38\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 16 0 -1 0\nloop_count = 300\nnum_threads = 16\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.70  max =    3.14  avg =    2.81\n     squeezenet_int8  min =    3.21  max =    4.22  avg =    3.39\n           mobilenet  min =    3.13  max =    3.26  avg =    3.20\n      mobilenet_int8  min =    3.17  max =    5.05  avg =    3.30\n        mobilenet_v2  min =    4.31  max =    6.24  avg =    4.62\n        mobilenet_v3  min =    3.57  max =    3.77  avg =    3.68\n          shufflenet  min =    4.70  max =    6.45  avg =    4.80\n       shufflenet_v2  min =    3.73  max =    4.27  avg =    3.87\n             mnasnet  min =    3.67  max =    3.87  avg =    3.75\n     proxylessnasnet  min =    4.28  max =    4.81  avg =    4.35\n     efficientnet_b0  min =    7.31  max =    7.77  avg =    7.53\n   efficientnetv2_b0  min =    9.87  max =   12.33  avg =   10.07\n        regnety_400m  min =   17.95  max =   18.53  avg =   18.26\n           blazeface  min =    2.26  max =    2.40  avg =    2.33\n           googlenet  min =    9.51  max =    9.99  avg =    9.68\n      googlenet_int8  min =   10.98  max =   11.36  avg =   11.18\n            resnet18  min =    5.59  max =    6.08  avg =    5.71\n       resnet18_int8  min =    6.55  max =    7.28  avg =    6.77\n             alexnet  min =    6.26  max =    6.50  avg =    6.36\n               vgg16  min =   23.98  max =   27.37  avg =   24.89\n          vgg16_int8  min =   38.07  max =   39.66  avg =   39.02\n            resnet50  min =   12.81  max =   14.19  avg =   13.76\n       resnet50_int8  min =   12.42  max =   12.84  avg =   12.55\n      squeezenet_ssd  min =   10.80  max =   11.49  avg =   11.12\n squeezenet_ssd_int8  min =   11.57  max =   12.21  avg =   11.74\n       mobilenet_ssd  min =    7.46  max =    8.08  avg =    7.84\n  mobilenet_ssd_int8  min =    7.47  max =    8.07  avg =    7.63\n      mobilenet_yolo  min =   21.70  max =   23.43  avg =   21.92\n  mobilenetv2_yolov3  min =   12.55  max =   14.56  avg =   12.90\n         yolov4-tiny  min =   17.68  max =   19.85  avg =   18.18\n           nanodet_m  min =    8.35  max =    8.70  avg =    8.45\n    yolo-fastest-1.1  min =    5.70  max =    7.11  avg =    6.05\n      yolo-fastestv2  min =    4.85  max =    5.70  avg =    5.37\n  vision_transformer  min =  214.36  max =  259.56  avg =  245.47\n          FastestDet  min =    5.01  max =    5.42  avg =    5.17\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 32 0 -1 0\nloop_count = 300\nnum_threads = 32\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.30  max =    2.94  avg =    2.46\n     squeezenet_int8  min =    3.08  max =    4.88  avg =    4.03\n           mobilenet  min =    2.49  max =    2.76  avg =    2.53\n      mobilenet_int8  min =    2.86  max =    3.73  avg =    2.95\n        mobilenet_v2  min =    4.51  max =    5.20  avg =    4.74\n        mobilenet_v3  min =    5.11  max =    6.91  avg =    6.10\n          shufflenet  min =    5.57  max =    6.51  avg =    5.78\n       shufflenet_v2  min =    4.37  max =    4.66  avg =    4.48\n             mnasnet  min =    3.72  max =    4.08  avg =    3.90\n     proxylessnasnet  min =    4.19  max =    6.18  avg =    4.79\n     efficientnet_b0  min =    6.80  max =    7.22  avg =    6.89\n   efficientnetv2_b0  min =   13.98  max =   17.55  avg =   15.06\n        regnety_400m  min =   16.10  max =   16.72  avg =   16.26\n           blazeface  min =    2.12  max =    2.53  avg =    2.17\n           googlenet  min =    8.63  max =    9.89  avg =    8.77\n      googlenet_int8  min =    9.90  max =   11.09  avg =   10.08\n            resnet18  min =    6.54  max =    6.99  avg =    6.73\n       resnet18_int8  min =    8.34  max =    9.00  avg =    8.67\n             alexnet  min =    6.64  max =    7.15  avg =    6.93\n               vgg16  min =   22.79  max =   23.91  avg =   23.50\n          vgg16_int8  min =   32.37  max =   37.51  avg =   33.13\n            resnet50  min =   11.19  max =   16.40  avg =   11.47\n       resnet50_int8  min =   11.92  max =   12.55  avg =   12.13\n      squeezenet_ssd  min =   10.75  max =   12.28  avg =   11.12\n squeezenet_ssd_int8  min =   11.31  max =   12.29  avg =   11.57\n       mobilenet_ssd  min =   10.25  max =   11.26  avg =   10.79\n  mobilenet_ssd_int8  min =   11.39  max =   16.99  avg =   11.98\n      mobilenet_yolo  min =   52.11  max =   60.46  avg =   53.84\n  mobilenetv2_yolov3  min =   12.07  max =   12.47  avg =   12.20\n         yolov4-tiny  min =   17.48  max =   17.79  avg =   17.58\n           nanodet_m  min =   13.06  max =   14.71  avg =   13.64\n    yolo-fastest-1.1  min =    5.70  max =    5.89  avg =    5.79\n      yolo-fastestv2  min =    8.89  max =    9.99  avg =    9.21\n  vision_transformer  min =  158.92  max =  187.40  avg =  168.21\n          FastestDet  min =    8.70  max =    9.43  avg =    9.00\nroot@8d46e508165f:/home/lkl/ARM_CHAR/ncnn/benchmark# ../build/benchmark/benchncnn 300 64 0 -1 0\nloop_count = 300\nnum_threads = 64\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.85  max =   78.56  avg =    7.81\n     squeezenet_int8  min =    8.06  max =   88.91  avg =    9.23\n           mobilenet  min =    3.02  max =   86.86  avg =    5.89\n      mobilenet_int8  min =    3.58  max =    4.55  avg =    3.68\n        mobilenet_v2  min =    5.05  max =  150.06  avg =   13.04\n        mobilenet_v3  min =    4.85  max =  125.22  avg =    8.34\n          shufflenet  min =   17.80  max =  220.55  avg =   21.01\n       shufflenet_v2  min =   11.23  max =  381.95  avg =   13.71\n             mnasnet  min =    9.83  max =  128.42  avg =   11.10\n     proxylessnasnet  min =   10.53  max =   68.52  avg =   12.03\n     efficientnet_b0  min =   16.78  max =  968.87  avg =   23.94\n   efficientnetv2_b0  min =   26.23  max =  551.18  avg =   31.34\n        regnety_400m  min =   70.14  max =  407.92  avg =   78.30\n           blazeface  min =    7.27  max =  191.44  avg =    9.37\n           googlenet  min =   16.69  max =  820.58  avg =   25.06\n      googlenet_int8  min =   20.58  max =  849.09  avg =   29.87\n            resnet18  min =    8.67  max =  349.00  avg =   11.33\n       resnet18_int8  min =   10.40  max =  128.98  avg =   11.45\n             alexnet  min =    6.15  max =  196.01  avg =   10.24\n               vgg16  min =   21.11  max =  288.66  avg =   29.37\n          vgg16_int8  min =   30.72  max =  251.95  avg =   37.68\n            resnet50  min =   19.10  max =  114.08  avg =   22.00\n       resnet50_int8  min =   18.99  max =  436.89  avg =   24.36\n      squeezenet_ssd  min =   22.22  max =  510.52  avg =   28.76\n squeezenet_ssd_int8  min =   23.42  max =  614.70  avg =   30.82\n       mobilenet_ssd  min =    7.62  max =  202.66  avg =   14.59\n  mobilenet_ssd_int8  min =    7.89  max =  109.82  avg =    8.80\n      mobilenet_yolo  min =   31.43  max =  742.10  avg =   45.52\n  mobilenetv2_yolov3  min =   18.31  max =  273.05  avg =   20.78\n         yolov4-tiny  min =   21.03  max =  400.05  avg =   33.64\n           nanodet_m  min =   19.94  max =  114.18  avg =   21.89\n    yolo-fastest-1.1  min =    7.20  max =  174.60  avg =    9.13\n      yolo-fastestv2  min =    7.50  max =  170.55  avg =    9.01\n  vision_transformer  min =  126.90  max =  335.71  avg =  157.38\n          FastestDet  min =    6.59  max =   19.77  avg =    6.77\n```\n\n### Intel Atom x5-Z8350\n```\nnihui@nihui-ROCK-Pi-X:~/ncnn/build/benchmark$ ./benchncnn 20 4 0 -1 1\nloop_count = 20\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   50.22  max =   50.53  avg =   50.32\n     squeezenet_int8  min =   77.92  max =   78.37  avg =   78.07\n           mobilenet  min =   80.12  max =   81.53  avg =   80.35\n      mobilenet_int8  min =  120.54  max =  124.10  avg =  120.84\n        mobilenet_v2  min =   56.62  max =   60.12  avg =   58.37\n        mobilenet_v3  min =   50.19  max =   50.41  avg =   50.27\n          shufflenet  min =   37.96  max =   38.28  avg =   38.10\n       shufflenet_v2  min =   35.28  max =   35.59  avg =   35.45\n             mnasnet  min =   54.91  max =   55.10  avg =   55.01\n     proxylessnasnet  min =   62.25  max =   62.59  avg =   62.40\n     efficientnet_b0  min =  101.92  max =  105.73  avg =  102.27\n   efficientnetv2_b0  min =  115.48  max =  117.25  avg =  115.89\n        regnety_400m  min =   79.66  max =   81.70  avg =   79.95\n           blazeface  min =   10.43  max =   10.60  avg =   10.49\n           googlenet  min =  170.41  max =  173.44  avg =  170.68\n      googlenet_int8  min =  253.06  max =  257.34  avg =  253.57\n            resnet18  min =  127.19  max =  130.69  avg =  127.65\n       resnet18_int8  min =  200.54  max =  204.25  avg =  200.88\n             alexnet  min =  104.89  max =  110.89  avg =  105.56\n               vgg16  min =  653.78  max =  661.34  avg =  655.44\n          vgg16_int8  min =  974.72  max = 1006.48  avg =  978.76\n            resnet50  min =  367.63  max =  371.74  avg =  368.27\n       resnet50_int8  min =  574.94  max =  584.08  avg =  576.18\n      squeezenet_ssd  min =  115.35  max =  116.47  avg =  115.62\n squeezenet_ssd_int8  min =  169.95  max =  170.75  avg =  170.26\n       mobilenet_ssd  min =  167.00  max =  172.02  avg =  168.95\n  mobilenet_ssd_int8  min =  244.91  max =  248.30  avg =  245.27\n      mobilenet_yolo  min =  382.80  max =  393.23  avg =  385.79\n  mobilenetv2_yolov3  min =  208.23  max =  211.54  avg =  209.64\n         yolov4-tiny  min =  251.10  max =  263.77  avg =  256.37\n           nanodet_m  min =   84.48  max =   84.95  avg =   84.70\n    yolo-fastest-1.1  min =   44.11  max =   45.15  avg =   44.26\n      yolo-fastestv2  min =   37.95  max =   38.52  avg =   38.34\n\nnihui@nihui-ROCK-Pi-X:~/ncnn/build/benchmark$ ./benchncnn 10 1 0 -1 1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  130.52  max =  131.08  avg =  130.64\n     squeezenet_int8  min =  231.03  max =  231.38  avg =  231.19\n           mobilenet  min =  231.40  max =  231.74  avg =  231.61\n      mobilenet_int8  min =  409.74  max =  410.02  avg =  409.85\n        mobilenet_v2  min =  150.23  max =  150.72  avg =  150.47\n        mobilenet_v3  min =  119.08  max =  119.34  avg =  119.20\n          shufflenet  min =   72.62  max =   72.81  avg =   72.73\n       shufflenet_v2  min =   73.63  max =   73.71  avg =   73.68\n             mnasnet  min =  140.87  max =  141.09  avg =  140.98\n     proxylessnasnet  min =  166.39  max =  166.75  avg =  166.54\n     efficientnet_b0  min =  280.55  max =  281.30  avg =  280.77\n   efficientnetv2_b0  min =  321.05  max =  321.24  avg =  321.16\n        regnety_400m  min =  183.78  max =  184.64  avg =  183.91\n           blazeface  min =   18.94  max =   19.08  avg =   19.01\n           googlenet  min =  453.56  max =  454.71  avg =  454.15\n      googlenet_int8  min =  791.40  max =  791.93  avg =  791.61\n            resnet18  min =  365.87  max =  366.40  avg =  366.15\n       resnet18_int8  min =  652.86  max =  653.39  avg =  653.09\n             alexnet  min =  289.15  max =  290.25  avg =  289.65\n               vgg16  min = 1887.16  max = 1887.73  avg = 1887.41\n          vgg16_int8  min = 3211.44  max = 3213.39  avg = 3212.55\n            resnet50  min = 1060.37  max = 1061.40  avg = 1060.80\n       resnet50_int8  min = 1869.41  max = 1870.59  avg = 1870.17\n      squeezenet_ssd  min =  277.23  max =  277.83  avg =  277.50\n squeezenet_ssd_int8  min =  455.54  max =  458.06  avg =  456.28\n       mobilenet_ssd  min =  478.03  max =  478.83  avg =  478.32\n  mobilenet_ssd_int8  min =  822.61  max =  822.96  avg =  822.79\n      mobilenet_yolo  min = 1136.89  max = 1138.51  avg = 1137.74\n  mobilenetv2_yolov3  min =  551.81  max =  552.53  avg =  552.14\n         yolov4-tiny  min =  685.49  max =  686.15  avg =  685.79\n           nanodet_m  min =  181.21  max =  181.52  avg =  181.32\n    yolo-fastest-1.1  min =   82.21  max =   82.68  avg =   82.30\n      yolo-fastestv2  min =   67.62  max =   68.36  avg =   68.10\n\nroot@nihui-ROCK-Pi-X:/home/nihui/osd/ncnn/build/benchmark# ./benchncnn 10 1 0 0 0\n[0 Intel(R) HD Graphics (CHV)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 Intel(R) HD Graphics (CHV)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Intel(R) HD Graphics (CHV)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Intel(R) HD Graphics (CHV)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   29.14  max =   29.76  avg =   29.45\n           mobilenet  min =   36.19  max =   37.03  avg =   36.52\n        mobilenet_v2  min =   30.39  max =   31.62  avg =   30.76\n        mobilenet_v3  min =   31.60  max =   32.25  avg =   31.92\n          shufflenet  min =   22.47  max =   23.19  avg =   22.70\n       shufflenet_v2  min =   22.30  max =   24.16  avg =   23.12\n             mnasnet  min =   29.40  max =   30.23  avg =   29.84\n     proxylessnasnet  min =   31.00  max =   31.91  avg =   31.41\n     efficientnet_b0  min =   58.03  max =   58.74  avg =   58.42\n   efficientnetv2_b0  min =  131.17  max =  191.61  avg =  161.37\n        regnety_400m  min =   40.30  max =   42.27  avg =   41.04\n           blazeface  min =   15.06  max =   15.96  avg =   15.48\n           googlenet  min =   85.37  max =   86.49  avg =   85.84\n            resnet18  min =   93.87  max =   95.00  avg =   94.53\n             alexnet  min =  110.96  max =  120.83  avg =  115.14\n               vgg16  min =  798.75  max =  812.60  avg =  804.93\n            resnet50  min =  213.12  max =  214.81  avg =  213.79\n      squeezenet_ssd  min =  124.48  max =  125.18  avg =  124.87\n       mobilenet_ssd  min =   84.04  max =   84.70  avg =   84.49\n      mobilenet_yolo  min =  186.52  max =  189.61  avg =  188.53\n  mobilenetv2_yolov3  min =  102.07  max =  102.97  avg =  102.39\n         yolov4-tiny  min =  212.49  max =  214.75  avg =  213.77\n           nanodet_m  min =   42.97  max =   45.58  avg =   44.05\n    yolo-fastest-1.1  min =   27.14  max =   32.53  avg =   28.76\n      yolo-fastestv2  min =   20.73  max =   25.90  avg =   22.97\n```\n\n### Intel Celeron N5105\n```\nloop_count = 8\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   18.06  max =   18.21  avg =   18.12\n     squeezenet_int8  min =   24.55  max =   25.16  avg =   24.69\n           mobilenet  min =   32.22  max =   32.70  avg =   32.40\n      mobilenet_int8  min =   40.52  max =   40.59  avg =   40.54\n        mobilenet_v2  min =   22.54  max =   22.71  avg =   22.65\n        mobilenet_v3  min =   17.86  max =   19.02  avg =   18.09\n          shufflenet  min =   11.23  max =   11.30  avg =   11.28\n       shufflenet_v2  min =   11.04  max =   11.19  avg =   11.13\n             mnasnet  min =   19.93  max =   20.09  avg =   20.01\n     proxylessnasnet  min =   21.91  max =   22.00  avg =   21.95\n     efficientnet_b0  min =   33.29  max =   33.66  avg =   33.50\n   efficientnetv2_b0  min =   40.16  max =   40.63  avg =   40.34\n        regnety_400m  min =   27.38  max =   27.59  avg =   27.50\n           blazeface  min =    3.01  max =    3.11  avg =    3.04\n           googlenet  min =   64.78  max =   65.16  avg =   65.01\n      googlenet_int8  min =   80.11  max =   80.79  avg =   80.46\n            resnet18  min =   53.91  max =   54.28  avg =   54.07\n       resnet18_int8  min =   63.95  max =   64.20  avg =   64.06\n             alexnet  min =   51.84  max =   52.17  avg =   52.00\n               vgg16  min =  322.01  max =  324.34  avg =  322.72\n          vgg16_int8  min =  323.83  max =  324.17  avg =  324.02\n            resnet50  min =  152.66  max =  153.33  avg =  153.03\n       resnet50_int8  min =  193.40  max =  194.55  avg =  194.03\n      squeezenet_ssd  min =   44.07  max =   44.51  avg =   44.37\n squeezenet_ssd_int8  min =   51.08  max =   52.26  avg =   51.60\n       mobilenet_ssd  min =   67.73  max =   68.21  avg =   67.98\n  mobilenet_ssd_int8  min =   82.41  max =   82.70  avg =   82.55\n      mobilenet_yolo  min =  157.38  max =  159.44  avg =  158.23\n  mobilenetv2_yolov3  min =   83.35  max =   83.68  avg =   83.55\n         yolov4-tiny  min =  107.25  max =  107.72  avg =  107.50\n           nanodet_m  min =   26.93  max =   27.24  avg =   27.09\n    yolo-fastest-1.1  min =   12.47  max =   12.71  avg =   12.61\n      yolo-fastestv2  min =   10.65  max =   10.95  avg =   10.81\n\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   54.43  max =   54.48  avg =   54.46\n     squeezenet_int8  min =   79.32  max =   79.64  avg =   79.43\n           mobilenet  min =  105.92  max =  106.12  avg =  106.03\n      mobilenet_int8  min =  152.24  max =  152.28  avg =  152.26\n        mobilenet_v2  min =   62.44  max =   62.83  avg =   62.57\n        mobilenet_v3  min =   49.47  max =   49.55  avg =   49.50\n          shufflenet  min =   27.32  max =   27.37  avg =   27.34\n       shufflenet_v2  min =   29.85  max =   30.00  avg =   29.93\n             mnasnet  min =   59.83  max =   60.09  avg =   59.98\n     proxylessnasnet  min =   66.66  max =   66.84  avg =   66.76\n     efficientnet_b0  min =  104.00  max =  104.19  avg =  104.08\n   efficientnetv2_b0  min =  128.05  max =  128.39  avg =  128.21\n        regnety_400m  min =   77.95  max =   78.03  avg =   78.00\n           blazeface  min =    6.66  max =    6.77  avg =    6.70\n           googlenet  min =  195.32  max =  195.75  avg =  195.52\n      googlenet_int8  min =  275.81  max =  276.25  avg =  275.98\n            resnet18  min =  160.94  max =  161.17  avg =  161.03\n       resnet18_int8  min =  223.88  max =  224.12  avg =  224.03\n             alexnet  min =  120.96  max =  121.16  avg =  121.05\n               vgg16  min =  852.50  max =  853.66  avg =  853.04\n          vgg16_int8  min = 1081.07  max = 1083.31  avg = 1082.18\n            resnet50  min =  497.54  max =  497.85  avg =  497.67\n       resnet50_int8  min =  681.79  max =  682.60  avg =  682.29\n      squeezenet_ssd  min =  101.81  max =  102.49  avg =  102.13\n squeezenet_ssd_int8  min =  147.77  max =  148.52  avg =  148.04\n       mobilenet_ssd  min =  215.63  max =  216.07  avg =  215.91\n  mobilenet_ssd_int8  min =  305.65  max =  305.97  avg =  305.78\n      mobilenet_yolo  min =  494.99  max =  495.41  avg =  495.16\n  mobilenetv2_yolov3  min =  233.51  max =  234.26  avg =  233.84\n         yolov4-tiny  min =  287.26  max =  287.89  avg =  287.50\n           nanodet_m  min =   70.48  max =   70.73  avg =   70.61\n    yolo-fastest-1.1  min =   27.32  max =   27.36  avg =   27.34\n      yolo-fastestv2  min =   23.51  max =   23.85  avg =   23.76\n\n[0 Intel(R) UHD Graphics (JSL)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 Intel(R) UHD Graphics (JSL)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Intel(R) UHD Graphics (JSL)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Intel(R) UHD Graphics (JSL)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   14.71  max =   15.37  avg =   14.90\n           mobilenet  min =   15.38  max =   16.34  avg =   16.07\n        mobilenet_v2  min =   13.58  max =   14.52  avg =   14.23\n        mobilenet_v3  min =   14.95  max =   15.81  avg =   15.20\n          shufflenet  min =   11.93  max =   12.73  avg =   12.31\n       shufflenet_v2  min =   14.47  max =   14.74  avg =   14.60\n             mnasnet  min =   15.32  max =   17.13  avg =   15.95\n     proxylessnasnet  min =   15.34  max =   16.25  avg =   15.66\n     efficientnet_b0  min =   26.02  max =   26.19  avg =   26.11\n   efficientnetv2_b0  min =   75.92  max =   76.18  avg =   76.07\n        regnety_400m  min =   17.79  max =   18.00  avg =   17.91\n           blazeface  min =    5.03  max =    5.96  avg =    5.65\n           googlenet  min =   35.20  max =   35.40  avg =   35.32\n            resnet18  min =   35.49  max =   35.61  avg =   35.56\n             alexnet  min =   40.93  max =   41.25  avg =   41.11\n               vgg16  min =  220.66  max =  222.18  avg =  221.42\n            resnet50  min =   78.10  max =   78.48  avg =   78.28\n      squeezenet_ssd  min =   46.90  max =   47.46  avg =   47.26\n       mobilenet_ssd  min =   33.33  max =   33.54  avg =   33.44\n      mobilenet_yolo  min =   67.54  max =   67.77  avg =   67.64\n  mobilenetv2_yolov3  min =   38.98  max =   39.69  avg =   39.37\n         yolov4-tiny  min =   68.01  max =   69.74  avg =   68.86\n           nanodet_m  min =   17.41  max =   18.13  avg =   17.78\n    yolo-fastest-1.1  min =   13.91  max =   14.18  avg =   14.03\n      yolo-fastestv2  min =   15.94  max =   16.02  avg =   15.97\n```\n\n### nVIDIA RTX2060 of Notebook\n```\nC:\\Users\\ai\\AppData\\Local\\Temp\\benchmark>benchncnn.exe 64 1 0 0 0\n[0 GeForce RTX 2060]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 GeForce RTX 2060]  buglssc=0  bugihfa=0\n[0 GeForce RTX 2060]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1\nloop_count = 64\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    2.14  max =    2.93  avg =    2.26\n           mobilenet  min =    2.08  max =    2.53  avg =    2.22\n        mobilenet_v2  min =    2.81  max =    4.03  avg =    3.05\n        mobilenet_v3  min =    2.90  max =    3.53  avg =    3.08\n          shufflenet  min =    1.94  max =    4.27  avg =    2.55\n       shufflenet_v2  min =    2.34  max =    2.97  avg =    2.49\n             mnasnet  min =    2.11  max =    2.86  avg =    2.37\n     proxylessnasnet  min =    2.27  max =    3.25  avg =    2.49\n           googlenet  min =    4.34  max =    6.79  avg =    5.25\n            resnet18  min =    2.60  max =    4.36  avg =    2.90\n             alexnet  min =    2.79  max =    4.70  avg =    3.04\n               vgg16  min =   11.40  max =   14.32  avg =   12.42\n            resnet50  min =    5.26  max =    5.86  avg =    5.51\n      squeezenet_ssd  min =    5.58  max =    7.94  avg =    6.56\n       mobilenet_ssd  min =    3.47  max =    5.29  avg =    3.77\n      mobilenet_yolo  min =    5.49  max =    6.19  avg =    5.70\n  mobilenetv2_yolov3  min =    3.69  max =    5.14  avg =    3.91\n```\n\n### nVIDIA RTX A3000 of Notebook (6GB)\n```\ncx@HP-ZBook-Fury-15-6-inch-G8-Mobile-Workstation-PC:~/ncnn/build/benchmark$ ./benchncnn 10 1 0 1\n[0 Intel(R) UHD Graphics (TGL GT1)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 Intel(R) UHD Graphics (TGL GT1)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Intel(R) UHD Graphics (TGL GT1)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Intel(R) UHD Graphics (TGL GT1)]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 Intel(R) UHD Graphics (TGL GT1)]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\n[1 NVIDIA RTX A3000 Laptop GPU]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[1 NVIDIA RTX A3000 Laptop GPU]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 NVIDIA RTX A3000 Laptop GPU]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 NVIDIA RTX A3000 Laptop GPU]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[1 NVIDIA RTX A3000 Laptop GPU]  fp16-matrix-16_8_8/16_8_16/16_16_16=1/1/1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 1\ncooling_down = 1\n          squeezenet  min =    1.49  max =    1.94  avg =    1.74\n     squeezenet_int8  min =    6.13  max =    6.20  avg =    6.16\n           mobilenet  min =    4.05  max =    4.82  avg =    4.65\n      mobilenet_int8  min =   10.24  max =   10.29  avg =   10.26\n        mobilenet_v2  min =    0.98  max =    1.14  avg =    1.03\n        mobilenet_v3  min =    1.74  max =    1.82  avg =    1.77\n          shufflenet  min =    1.43  max =   30.51  avg =    9.51\n       shufflenet_v2  min =    3.43  max =    3.89  avg =    3.77\n             mnasnet  min =    6.50  max =    6.75  avg =    6.62\n     proxylessnasnet  min =    6.46  max =    7.28  avg =    7.00\n     efficientnet_b0  min =    3.14  max =   15.11  avg =    7.29\n   efficientnetv2_b0  min =   18.50  max =   20.13  avg =   19.17\n        regnety_400m  min =    2.16  max =    3.57  avg =    2.70\n           blazeface  min =    2.52  max =    2.76  avg =    2.65\n           googlenet  min =    2.67  max =   14.67  avg =    9.85\n      googlenet_int8  min =   19.08  max =   19.40  avg =   19.19\n            resnet18  min =    5.19  max =    9.44  avg =    8.48\n       resnet18_int8  min =   16.57  max =   17.69  avg =   16.96\n             alexnet  min =    1.98  max =    3.24  avg =    2.23\n               vgg16  min =    3.59  max =   12.34  avg =   10.99\n          vgg16_int8  min =  110.63  max =  124.31  avg =  118.16\n            resnet50  min =    3.01  max =    4.93  avg =    3.77\n       resnet50_int8  min =   41.58  max =   44.80  avg =   43.24\n      squeezenet_ssd  min =    4.08  max =    4.70  avg =    4.32\n squeezenet_ssd_int8  min =   17.32  max =   17.92  avg =   17.46\n       mobilenet_ssd  min =    2.26  max =    8.23  avg =    5.57\n  mobilenet_ssd_int8  min =   20.35  max =   21.89  avg =   20.76\n      mobilenet_yolo  min =    2.14  max =   16.94  avg =    6.44\n  mobilenetv2_yolov3  min =    3.64  max =    5.09  avg =    4.02\n         yolov4-tiny  min =   10.94  max =   17.46  avg =   13.58\n           nanodet_m  min =    6.57  max =   13.91  avg =    9.82\n    yolo-fastest-1.1  min =    5.40  max =   14.22  avg =   10.78\n      yolo-fastestv2  min =    7.49  max =    9.43  avg =    7.99\n  vision_transformer  min =   76.04  max =   76.96  avg =   76.43\n          FastestDet  min =    6.31  max =    6.60  avg =    6.43\n```\n\n### nVIDIA RTX2080 of Desktop\n```\nE:\\projects\\framework\\ncnn\\benchmark>benchncnn.exe 4096 1 0 0 0\n[0 GeForce RTX 2080]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 GeForce RTX 2080]  buglssc=0  bugihfa=0\n[0 GeForce RTX 2080]  fp16p=1  fp16s=1  fp16a=1  int8s=1  int8a=1\nloop_count = 4096\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    1.39  max =   16.70  avg =    1.49\n           mobilenet  min =    1.32  max =    2.55  avg =    1.42\n        mobilenet_v2  min =    1.88  max =    5.02  avg =    2.00\n        mobilenet_v3  min =    2.31  max =    3.58  avg =    2.45\n          shufflenet  min =    1.45  max =    2.65  avg =    1.55\n       shufflenet_v2  min =    1.90  max =    3.21  avg =    2.03\n             mnasnet  min =    1.95  max =    3.17  avg =    2.09\n     proxylessnasnet  min =    2.02  max =    2.95  avg =    2.16\n           googlenet  min =    3.81  max =    5.91  avg =    4.05\n            resnet18  min =    2.10  max =    3.28  avg =    2.24\n             alexnet  min =    2.15  max =    3.35  avg =    2.30\n               vgg16  min =    7.33  max =   11.12  avg =    7.80\n            resnet50  min =    4.21  max =    6.70  avg =    4.49\n      squeezenet_ssd  min =    4.58  max =    6.86  avg =    4.88\n       mobilenet_ssd  min =    2.90  max =    4.52  avg =    3.09\n      mobilenet_yolo  min =    4.15  max =    6.09  avg =    4.40\n  mobilenetv2_yolov3  min =    3.04  max =    9.13  avg =    3.28\n```\n\n### NVIDIA Jetson AGX Xavier (Carmel 2.2 GHz x 8 + Volta Tensor Cores 64)\n```\ni@ubuntu:~/projects/ncnn/benchmark$ ./benchncnn 32 1 0 -1 0\nloop_count = 32\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   22.31  max =   23.29  avg =   22.68\n     squeezenet_int8  min =   47.64  max =   52.88  avg =   49.72\n           mobilenet  min =   37.50  max =   38.45  avg =   37.85\n      mobilenet_int8  min =   89.14  max =   92.38  avg =   90.95\n        mobilenet_v2  min =   24.31  max =   25.53  avg =   24.68\n        mobilenet_v3  min =   20.20  max =   21.21  avg =   20.56\n          shufflenet  min =   14.85  max =   15.64  avg =   15.15\n       shufflenet_v2  min =   14.34  max =   16.11  avg =   14.86\n             mnasnet  min =   23.42  max =   23.86  avg =   23.56\n     proxylessnasnet  min =   27.44  max =   28.83  avg =   27.83\n     efficientnet_b0  min =   34.57  max =   37.84  avg =   35.13\n   efficientnetv2_b0  min =   65.16  max =   68.67  avg =   66.76\n        regnety_400m  min =   33.86  max =   34.49  avg =   34.17\n           blazeface  min =   11.86  max =   14.15  avg =   12.52\n           googlenet  min =   83.19  max =   89.84  avg =   85.14\n      googlenet_int8  min =  146.74  max =  155.25  avg =  151.14\n            resnet18  min =   50.46  max =   57.80  avg =   53.40\n       resnet18_int8  min =  108.43  max =  116.14  avg =  110.78\n             alexnet  min =   56.59  max =   64.93  avg =   59.51\n               vgg16  min =  266.78  max =  272.16  avg =  269.14\n          vgg16_int8  min =  538.71  max =  551.55  avg =  544.78\n            resnet50  min =  169.11  max =  172.26  avg =  170.51\n       resnet50_int8  min =  370.55  max =  384.36  avg =  377.75\n      squeezenet_ssd  min =   58.51  max =   67.88  avg =   62.78\n squeezenet_ssd_int8  min =   95.34  max =  106.49  avg =   97.99\n       mobilenet_ssd  min =   83.52  max =   86.84  avg =   84.86\n  mobilenet_ssd_int8  min =  172.70  max =  181.84  avg =  176.25\n      mobilenet_yolo  min =  165.26  max =  167.74  avg =  166.51\n  mobilenetv2_yolov3  min =   88.11  max =   90.29  avg =   89.19\n         yolov4-tiny  min =  105.44  max =  109.24  avg =  107.07\n           nanodet_m  min =   33.60  max =   37.02  avg =   34.39\n    yolo-fastest-1.1  min =   13.56  max =   14.22  avg =   13.75\n      yolo-fastestv2  min =   13.76  max =   14.59  avg =   14.02\ni@ubuntu:~/projects/ncnn/benchmark$ ./benchncnn 32 2 0 -1 0\nloop_count = 32\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   13.05  max =   13.76  avg =   13.36\n     squeezenet_int8  min =   26.08  max =   28.09  avg =   26.69\n           mobilenet  min =   20.61  max =   21.21  avg =   20.81\n      mobilenet_int8  min =   44.72  max =   47.33  avg =   45.76\n        mobilenet_v2  min =   14.67  max =   15.23  avg =   14.86\n        mobilenet_v3  min =   12.59  max =   15.50  avg =   13.36\n          shufflenet  min =   12.74  max =   14.14  avg =   13.31\n       shufflenet_v2  min =   10.05  max =   10.89  avg =   10.40\n             mnasnet  min =   14.02  max =   14.75  avg =   14.19\n     proxylessnasnet  min =   16.05  max =   16.94  avg =   16.31\n     efficientnet_b0  min =   20.47  max =   23.05  avg =   20.81\n   efficientnetv2_b0  min =   37.51  max =   41.53  avg =   39.19\n        regnety_400m  min =   25.21  max =   25.73  avg =   25.39\n           blazeface  min =    7.30  max =    8.44  avg =    7.43\n           googlenet  min =   42.52  max =   47.38  avg =   44.39\n      googlenet_int8  min =   76.38  max =   81.63  avg =   77.93\n            resnet18  min =   26.76  max =   28.72  avg =   27.22\n       resnet18_int8  min =   55.97  max =   61.57  avg =   57.26\n             alexnet  min =   29.29  max =   33.20  avg =   31.03\n               vgg16  min =  134.32  max =  138.65  avg =  136.05\n          vgg16_int8  min =  267.70  max =  281.71  avg =  272.79\n            resnet50  min =   87.22  max =   88.75  avg =   87.65\n       resnet50_int8  min =  183.80  max =  192.17  avg =  187.25\n      squeezenet_ssd  min =   35.80  max =   39.00  avg =   37.32\n squeezenet_ssd_int8  min =   53.56  max =   60.43  avg =   55.58\n       mobilenet_ssd  min =   44.17  max =   48.30  avg =   44.70\n  mobilenet_ssd_int8  min =   90.32  max =   94.09  avg =   92.27\n      mobilenet_yolo  min =   87.50  max =   89.63  avg =   88.33\n  mobilenetv2_yolov3  min =   49.76  max =   51.58  avg =   50.44\n         yolov4-tiny  min =   61.17  max =   64.41  avg =   62.15\n           nanodet_m  min =   21.43  max =   22.47  avg =   21.82\n    yolo-fastest-1.1  min =   10.90  max =   12.63  avg =   11.12\n      yolo-fastestv2  min =   10.61  max =   11.11  avg =   10.82\ni@ubuntu:~/projects/ncnn/benchmark$ ./benchncnn 32 4 0 -1 0\nloop_count = 32\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    8.06  max =    8.79  avg =    8.39\n     squeezenet_int8  min =   14.96  max =   16.64  avg =   15.37\n           mobilenet  min =   11.24  max =   11.91  avg =   11.48\n      mobilenet_int8  min =   23.63  max =   24.75  avg =   23.81\n        mobilenet_v2  min =    9.27  max =    9.97  avg =    9.44\n        mobilenet_v3  min =    8.81  max =   10.06  avg =    9.07\n          shufflenet  min =   11.22  max =   11.53  avg =   11.37\n       shufflenet_v2  min =    7.81  max =    8.17  avg =    7.97\n             mnasnet  min =    9.40  max =   10.49  avg =   10.06\n     proxylessnasnet  min =   10.53  max =   10.73  avg =   10.62\n     efficientnet_b0  min =   13.55  max =   15.14  avg =   13.80\n   efficientnetv2_b0  min =   19.83  max =   21.95  avg =   21.09\n        regnety_400m  min =   21.80  max =   22.91  avg =   22.13\n           blazeface  min =    5.17  max =    6.27  avg =    5.31\n           googlenet  min =   22.67  max =   25.35  avg =   23.10\n      googlenet_int8  min =   43.19  max =   45.68  avg =   43.72\n            resnet18  min =   15.19  max =   16.14  avg =   15.42\n       resnet18_int8  min =   31.22  max =   34.76  avg =   31.81\n             alexnet  min =   15.20  max =   17.65  avg =   15.56\n               vgg16  min =   70.76  max =   73.21  avg =   71.70\n          vgg16_int8  min =  137.94  max =  143.50  avg =  139.54\n            resnet50  min =   47.15  max =   47.91  avg =   47.40\n       resnet50_int8  min =   99.80  max =  102.94  avg =  100.29\n      squeezenet_ssd  min =   22.10  max =   24.11  avg =   22.46\n squeezenet_ssd_int8  min =   33.21  max =   35.98  avg =   33.98\n       mobilenet_ssd  min =   25.09  max =   26.81  avg =   25.50\n  mobilenet_ssd_int8  min =   48.15  max =   50.96  avg =   49.49\n      mobilenet_yolo  min =   48.63  max =   49.02  avg =   48.84\n  mobilenetv2_yolov3  min =   30.93  max =   31.41  avg =   31.13\n         yolov4-tiny  min =   38.43  max =   41.20  avg =   39.28\n           nanodet_m  min =   14.95  max =   15.74  avg =   15.35\n    yolo-fastest-1.1  min =    8.89  max =    9.18  avg =    9.01\n      yolo-fastestv2  min =    8.36  max =    9.28  avg =    8.50\ni@ubuntu:~/projects/ncnn/benchmark$ ./benchncnn 32 8 0 -1 0\nloop_count = 32\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.52  max =   74.10  avg =   12.94\n     squeezenet_int8  min =   10.44  max =   18.81  avg =   12.15\n           mobilenet  min =    7.49  max =   14.63  avg =    8.67\n      mobilenet_int8  min =   13.80  max =   15.89  avg =   14.53\n        mobilenet_v2  min =    8.15  max =   11.42  avg =    8.78\n        mobilenet_v3  min =    7.60  max =   10.92  avg =    8.38\n          shufflenet  min =   11.51  max =   19.48  avg =   12.97\n       shufflenet_v2  min =    7.06  max =   15.58  avg =    9.48\n             mnasnet  min =    7.77  max =   15.12  avg =    8.68\n     proxylessnasnet  min =    8.54  max =   42.73  avg =   10.00\n     efficientnet_b0  min =   11.11  max =   12.86  avg =   11.89\n   efficientnetv2_b0  min =   17.17  max =   29.03  avg =   20.48\n        regnety_400m  min =   22.41  max =   36.72  avg =   25.49\n           blazeface  min =    4.93  max =   11.62  avg =    6.13\n           googlenet  min =   17.02  max =   31.61  avg =   19.92\n      googlenet_int8  min =   27.70  max =   35.49  avg =   29.18\n            resnet18  min =    9.74  max =   18.78  avg =   11.40\n       resnet18_int8  min =   18.52  max =   24.70  avg =   19.32\n             alexnet  min =   10.70  max =   15.41  avg =   11.39\n               vgg16  min =   40.80  max =   54.47  avg =   42.72\n          vgg16_int8  min =   74.71  max =   79.66  avg =   76.37\n            resnet50  min =   28.21  max =   36.62  avg =   29.41\n       resnet50_int8  min =   54.53  max =   76.02  avg =   56.81\n      squeezenet_ssd  min =   19.01  max =   30.68  avg =   24.89\n squeezenet_ssd_int8  min =   27.61  max =   35.87  avg =   29.22\n       mobilenet_ssd  min =   17.35  max =   22.87  avg =   18.55\n  mobilenet_ssd_int8  min =   29.92  max =   36.35  avg =   31.15\n      mobilenet_yolo  min =   31.63  max =   55.61  avg =   34.31\n  mobilenetv2_yolov3  min =   23.75  max =   35.45  avg =   25.68\n         yolov4-tiny  min =   29.23  max =   70.12  avg =   31.94\n           nanodet_m  min =   13.00  max =   21.72  avg =   15.39\n    yolo-fastest-1.1  min =    9.72  max =   17.94  avg =   11.45\n      yolo-fastestv2  min =    9.16  max =   16.35  avg =   11.08\ni@ubuntu:~/projects/ncnn/benchmark$ ./benchncnn 128 1 0 0 0\n[0 NVIDIA Tegra Xavier (nvgpu)]  queueC=2[8]  queueG=0[16]  queueT=1[1]\n[0 NVIDIA Tegra Xavier (nvgpu)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA Tegra Xavier (nvgpu)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA Tegra Xavier (nvgpu)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 128\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    4.85  max =   19.65  avg =    6.83\n     squeezenet_int8  min =   46.38  max =   49.70  avg =   47.22\n           mobilenet  min =    5.62  max =    6.61  avg =    6.33\n      mobilenet_int8  min =   87.42  max =   92.95  avg =   90.52\n        mobilenet_v2  min =    5.96  max =    7.53  avg =    6.50\n        mobilenet_v3  min =    6.77  max =    7.83  avg =    7.01\n          shufflenet  min =   10.58  max =   18.46  avg =   13.68\n       shufflenet_v2  min =   20.06  max =   21.09  avg =   20.37\n             mnasnet  min =    6.49  max =   26.49  avg =    8.26\n     proxylessnasnet  min =    6.75  max =   27.37  avg =    7.88\n     efficientnet_b0  min =   12.11  max =   48.35  avg =   14.63\n   efficientnetv2_b0  min =   24.61  max =   69.68  avg =   34.33\n        regnety_400m  min =    9.02  max =   34.40  avg =   10.84\n           blazeface  min =    7.55  max =    8.10  avg =    7.78\n           googlenet  min =   12.57  max =   65.14  avg =   18.91\n      googlenet_int8  min =  145.74  max =  155.87  avg =  151.06\n            resnet18  min =    8.88  max =   30.48  avg =    9.34\n       resnet18_int8  min =  109.19  max =  116.78  avg =  111.52\n             alexnet  min =    9.06  max =   54.53  avg =   19.04\n               vgg16  min =   18.12  max =   37.31  avg =   19.65\n          vgg16_int8  min =  530.60  max =  551.58  avg =  542.33\n            resnet50  min =   11.62  max =   20.64  avg =   12.17\n       resnet50_int8  min =  374.83  max =  384.79  avg =  379.50\n      squeezenet_ssd  min =   14.01  max =   55.88  avg =   23.64\n squeezenet_ssd_int8  min =   89.86  max =   95.80  avg =   92.18\n       mobilenet_ssd  min =   13.20  max =   13.61  avg =   13.37\n  mobilenet_ssd_int8  min =  170.17  max =  181.48  avg =  174.93\n      mobilenet_yolo  min =   11.78  max =   20.42  avg =   13.34\n  mobilenetv2_yolov3  min =   18.08  max =   62.94  avg =   26.70\n         yolov4-tiny  min =   26.44  max =   34.83  avg =   31.83\n           nanodet_m  min =    7.93  max =    9.91  avg =    9.01\n    yolo-fastest-1.1  min =    6.03  max =   20.85  avg =    8.42\n      yolo-fastestv2  min =    9.01  max =   20.60  avg =   12.51\n```\n\n### MacBook Pro (13-inch, M1, 2020)\n```\nMacBook-Pro benchmark % ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    4.80  max =    5.05  avg =    4.86\n     squeezenet_int8  min =    4.02  max =    4.13  avg =    4.04\n           mobilenet  min =    9.09  max =    9.41  avg =    9.22\n      mobilenet_int8  min =    4.65  max =    4.76  avg =    4.70\n        mobilenet_v2  min =    5.64  max =    5.83  avg =    5.73\n        mobilenet_v3  min =    4.64  max =    4.85  avg =    4.76\n          shufflenet  min =    3.48  max =    3.63  avg =    3.56\n       shufflenet_v2  min =    3.69  max =    3.81  avg =    3.73\n             mnasnet  min =    5.67  max =    5.94  avg =    5.77\n     proxylessnasnet  min =    7.03  max =    7.28  avg =    7.20\n     efficientnet_b0  min =    9.13  max =    9.53  avg =    9.28\n   efficientnetv2_b0  min =   17.37  max =   18.47  avg =   17.63\n        regnety_400m  min =    7.64  max =    8.08  avg =    7.72\n           blazeface  min =    1.80  max =    1.89  avg =    1.83\n           googlenet  min =   25.71  max =   25.90  avg =   25.81\n      googlenet_int8  min =   16.89  max =   17.10  avg =   16.97\n            resnet18  min =   17.16  max =   17.28  avg =   17.20\n       resnet18_int8  min =   15.55  max =   15.75  avg =   15.64\n             alexnet  min =   30.60  max =   31.11  avg =   30.69\n               vgg16  min =   73.41  max =   75.37  avg =   73.91\n          vgg16_int8  min =  103.81  max =  105.15  avg =  104.19\n            resnet50  min =   43.47  max =   44.24  avg =   43.68\n       resnet50_int8  min =   30.37  max =   35.25  avg =   31.61\n      squeezenet_ssd  min =   20.97  max =   21.21  avg =   21.12\n squeezenet_ssd_int8  min =   19.34  max =   19.54  avg =   19.42\n       mobilenet_ssd  min =   22.18  max =   22.58  avg =   22.28\n  mobilenet_ssd_int8  min =   13.27  max =   15.31  avg =   14.05\n      mobilenet_yolo  min =   40.78  max =   41.04  avg =   40.89\n  mobilenetv2_yolov3  min =   20.87  max =   21.92  avg =   21.02\n         yolov4-tiny  min =   30.73  max =   32.37  avg =   31.29\n           nanodet_m  min =    8.54  max =    8.86  avg =    8.65\n\n\nMacBook-Pro benchmark % ./benchncnn 10 8 0 0 0\n[0 Apple M1]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 Apple M1]  bugsbn1=0  bugbilz=151  bugcopc=0  bugihfa=0\n[0 Apple M1]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Apple M1]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    1.86  max =    2.22  avg =    2.01\n     squeezenet_int8  min =    2.38  max =    8.40  avg =    5.13\n           mobilenet  min =    2.50  max =    2.91  avg =    2.64\n      mobilenet_int8  min =    2.29  max =    5.26  avg =    3.54\n        mobilenet_v2  min =    2.93  max =    3.12  avg =    2.98\n        mobilenet_v3  min =    3.36  max =    3.61  avg =    3.48\n          shufflenet  min =    1.99  max =    2.54  avg =    2.18\n       shufflenet_v2  min =    2.35  max =    2.84  avg =    2.52\n             mnasnet  min =    2.81  max =    3.33  avg =    2.92\n     proxylessnasnet  min =    3.21  max =    3.62  avg =    3.36\n     efficientnet_b0  min =    4.74  max =    5.73  avg =    5.07\n   efficientnetv2_b0  min =   12.04  max =   13.04  avg =   12.61\n        regnety_400m  min =    3.86  max =    4.04  avg =    3.98\n           blazeface  min =    0.98  max =    1.11  avg =    1.03\n           googlenet  min =    4.86  max =    5.38  avg =    5.02\n      googlenet_int8  min =    9.43  max =   15.72  avg =   10.44\n            resnet18  min =    3.92  max =    4.59  avg =    4.24\n       resnet18_int8  min =    6.83  max =    7.57  avg =    7.35\n             alexnet  min =    7.49  max =    7.87  avg =    7.65\n               vgg16  min =   34.10  max =   35.29  avg =   34.60\n          vgg16_int8  min =   40.09  max =   44.66  avg =   41.95\n            resnet50  min =    7.22  max =    7.83  avg =    7.42\n       resnet50_int8  min =   14.52  max =   20.56  avg =   15.78\n      squeezenet_ssd  min =    8.52  max =   13.79  avg =    9.98\n squeezenet_ssd_int8  min =   12.38  max =   15.44  avg =   13.37\n       mobilenet_ssd  min =    4.83  max =    6.00  avg =    5.31\n  mobilenet_ssd_int8  min =    7.26  max =   13.12  avg =    9.01\n      mobilenet_yolo  min =    7.22  max =    8.66  avg =    7.99\n  mobilenetv2_yolov3  min =    7.46  max =    8.06  avg =    7.80\n         yolov4-tiny  min =   12.17  max =   13.95  avg =   12.82\n           nanodet_m  min =    3.54  max =    4.78  avg =    3.86\n```\n\n### MacBook Air (13-inch, M3, 2024)\n```\nMacBook-Air benchmark % ./benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    3.59  max =    4.20  avg =    3.80\n     squeezenet_int8  min =    2.61  max =    2.82  avg =    2.74\n           mobilenet  min =    6.67  max =    6.92  avg =    6.85\n      mobilenet_int8  min =    3.61  max =    3.66  avg =    3.62\n        mobilenet_v2  min =    4.08  max =    4.15  avg =    4.10\n        mobilenet_v3  min =    3.32  max =    3.44  avg =    3.34\n          shufflenet  min =    2.08  max =    2.13  avg =    2.10\n       shufflenet_v2  min =    2.35  max =    2.44  avg =    2.37\n             mnasnet  min =    4.14  max =    4.23  avg =    4.18\n     proxylessnasnet  min =    5.09  max =    5.15  avg =    5.11\n     efficientnet_b0  min =    6.67  max =    6.75  avg =    6.70\n   efficientnetv2_b0  min =    8.79  max =    8.83  avg =    8.81\n        regnety_400m  min =    5.68  max =    5.73  avg =    5.69\n           blazeface  min =    0.75  max =    0.77  avg =    0.76\n           googlenet  min =   15.94  max =   15.97  avg =   15.96\n      googlenet_int8  min =   10.88  max =   10.92  avg =   10.89\n            resnet18  min =   12.60  max =   12.63  avg =   12.61\n       resnet18_int8  min =    9.88  max =    9.95  avg =    9.90\n             alexnet  min =   12.72  max =   12.82  avg =   12.77\n               vgg16  min =   57.85  max =   61.44  avg =   58.40\n          vgg16_int8  min =   78.53  max =   79.85  avg =   78.83\n            resnet50  min =   34.79  max =   34.85  avg =   34.81\n       resnet50_int8  min =   20.56  max =   20.62  avg =   20.58\n      squeezenet_ssd  min =    9.64  max =    9.82  avg =    9.69\n squeezenet_ssd_int8  min =    8.21  max =    8.34  avg =    8.25\n       mobilenet_ssd  min =   14.21  max =   14.34  avg =   14.25\n  mobilenet_ssd_int8  min =    7.35  max =    7.41  avg =    7.37\n      mobilenet_yolo  min =   31.61  max =   31.74  avg =   31.64\n  mobilenetv2_yolov3  min =   15.79  max =   15.87  avg =   15.83\n         yolov4-tiny  min =   22.93  max =   22.99  avg =   22.96\n           nanodet_m  min =    5.58  max =    5.62  avg =    5.59\n    yolo-fastest-1.1  min =    2.00  max =    2.05  avg =    2.01\n      yolo-fastestv2  min =    1.75  max =    1.77  avg =    1.76\n  vision_transformer  min = 1020.57  max = 1046.02  avg = 1028.75\n          FastestDet  min =    1.88  max =    1.93  avg =    1.89\n\nMacBook-Air benchmark % ./benchncnn 10 8 0 0 0\n[0 Apple M3]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 Apple M3]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Apple M3]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[0 Apple M3]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 Apple M3]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    1.79  max =    2.48  avg =    2.16\n     squeezenet_int8  min =    2.78  max =    2.93  avg =    2.80\n           mobilenet  min =    1.40  max =    1.85  avg =    1.68\n      mobilenet_int8  min =    3.60  max =    3.67  avg =    3.61\n        mobilenet_v2  min =    1.68  max =    2.28  avg =    1.97\n        mobilenet_v3  min =    1.71  max =    2.29  avg =    2.00\n          shufflenet  min =    1.18  max =    2.49  avg =    1.78\n       shufflenet_v2  min =    1.45  max =    2.09  avg =    1.70\n             mnasnet  min =    1.74  max =    2.25  avg =    2.05\n     proxylessnasnet  min =    1.75  max =    2.18  avg =    2.02\n     efficientnet_b0  min =    2.71  max =    3.19  avg =    2.99\n   efficientnetv2_b0  min =    6.77  max =    7.04  avg =    6.88\n        regnety_400m  min =    1.94  max =    2.40  avg =    2.10\n           blazeface  min =    1.05  max =    1.43  avg =    1.24\n           googlenet  min =    3.99  max =    4.42  avg =    4.27\n      googlenet_int8  min =   10.83  max =   10.86  avg =   10.85\n            resnet18  min =    2.50  max =    2.77  avg =    2.70\n       resnet18_int8  min =    9.86  max =    9.91  avg =    9.88\n             alexnet  min =    2.99  max =    3.28  avg =    3.11\n               vgg16  min =   12.41  max =   13.13  avg =   12.54\n          vgg16_int8  min =   78.52  max =   78.67  avg =   78.61\n            resnet50  min =    5.46  max =    5.52  avg =    5.49\n       resnet50_int8  min =   20.57  max =   20.59  avg =   20.58\n      squeezenet_ssd  min =    3.86  max =    4.53  avg =    4.17\n squeezenet_ssd_int8  min =    8.20  max =    8.35  avg =    8.25\n       mobilenet_ssd  min =    3.19  max =    3.75  avg =    3.52\n  mobilenet_ssd_int8  min =    7.35  max =    7.41  avg =    7.37\n      mobilenet_yolo  min =    4.77  max =    4.88  avg =    4.81\n  mobilenetv2_yolov3  min =    4.28  max =    4.88  avg =    4.62\n         yolov4-tiny  min =    6.76  max =    7.38  avg =    7.21\n           nanodet_m  min =    2.92  max =    4.71  avg =    3.46\n    yolo-fastest-1.1  min =    1.48  max =    2.04  avg =    1.87\n      yolo-fastestv2  min =    1.41  max =    1.97  avg =    1.74\n  vision_transformer  min =   80.34  max =   80.66  avg =   80.44\n          FastestDet  min =    1.43  max =    2.04  avg =    1.73\n```\n\n### Ingenic T40XP Xburst2 Core X2 1.4Ghz (without MSA)\n```\nloop_count = 8\nnum_threads = 2\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =  921.23  max =  944.03  avg =  930.71\n     squeezenet_int8  min = 3280.89  max = 3404.83  avg = 3359.68\n           mobilenet  min = 1277.61  max = 1298.51  avg = 1284.38\n      mobilenet_int8  min = 4342.67  max = 4350.21  avg = 4345.85\n        mobilenet_v2  min =  780.92  max =  783.93  avg =  782.79\n        mobilenet_v3  min =  650.59  max =  655.08  avg =  652.06\n          shufflenet  min =  352.75  max =  353.69  avg =  353.24\n       shufflenet_v2  min =  362.82  max =  364.08  avg =  363.38\n             mnasnet  min =  790.45  max =  791.89  avg =  790.99\n     proxylessnasnet  min =  868.71  max =  870.47  avg =  869.17\n     efficientnet_b0  min = 1491.44  max = 1492.36  avg = 1491.95\n   efficientnetv2_b0  min = 2135.04  max = 2148.02  avg = 2139.99\n        regnety_400m  min = 1000.53  max = 1005.29  avg = 1001.81\n           blazeface  min =  102.72  max =  104.18  avg =  103.51\n           googlenet  min = 3652.89  max = 3705.40  avg = 3675.43\n      googlenet_int8  min = 8067.30  max = 8070.22  avg = 8069.21\n```\n### MacBook Pro (15-inch, 2019) - 2.6GHz six cores Intel Core i7 && Radeon Pro 555X 4GB && Intel UHD Graphics 630 1536MB\n```\n\n➜  benchmark git:(master) ✗ ./benchncnn 10 1 0 -1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   14.68  max =   17.06  avg =   15.55\n     squeezenet_int8  min =   51.64  max =   57.85  avg =   54.01\n           mobilenet  min =   20.74  max =   25.38  avg =   22.77\n      mobilenet_int8  min =   66.84  max =   91.01  avg =   75.69\n        mobilenet_v2  min =   14.04  max =   20.06  avg =   16.36\n        mobilenet_v3  min =   11.89  max =   16.22  avg =   13.58\n          shufflenet  min =   13.74  max =   17.10  avg =   15.02\n       shufflenet_v2  min =   12.73  max =   14.36  avg =   13.53\n             mnasnet  min =   11.05  max =   17.79  avg =   13.82\n     proxylessnasnet  min =   12.60  max =   27.38  avg =   17.55\n     efficientnet_b0  min =   23.73  max =   26.82  avg =   25.45\n   efficientnetv2_b0  min =   27.03  max =   33.89  avg =   30.78\n        regnety_400m  min =   13.81  max =   21.50  avg =   15.40\n           blazeface  min =    3.72  max =    4.98  avg =    4.43\n           googlenet  min =   65.88  max =   76.62  avg =   69.40\n      googlenet_int8  min =  192.07  max =  227.85  avg =  203.81\n            resnet18  min =   79.45  max =   90.41  avg =   85.32\n       resnet18_int8  min =  201.71  max =  222.31  avg =  207.39\n             alexnet  min =   70.67  max =   80.13  avg =   74.43\n               vgg16  min =  233.74  max =  261.62  avg =  250.99\n          vgg16_int8  min = 1722.78  max = 1997.14  avg = 1772.71\n            resnet50  min =  130.39  max =  135.31  avg =  133.27\n       resnet50_int8  min =  439.69  max =  483.78  avg =  461.33\n      squeezenet_ssd  min =  108.54  max =  122.15  avg =  115.02\n squeezenet_ssd_int8  min =  175.58  max =  185.09  avg =  181.33\n       mobilenet_ssd  min =   51.89  max =   59.32  avg =   54.30\n  mobilenet_ssd_int8  min =  140.15  max =  192.10  avg =  164.47\n      mobilenet_yolo  min =  117.37  max =  131.89  avg =  126.34\n  mobilenetv2_yolov3  min =   57.57  max =   72.29  avg =   64.92\n         yolov4-tiny  min =  114.45  max =  123.15  avg =  116.91\n           nanodet_m  min =   25.65  max =   33.27  avg =   28.75\n\n➜  benchmark git:(master) ✗ ./benchncnn 10 1 0 0\n[0 AMD Radeon Pro 555X]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 AMD Radeon Pro 555X]  bugsbn1=0  bugbilz=196  bugcopc=0  bugihfa=0\n[0 AMD Radeon Pro 555X]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 AMD Radeon Pro 555X]  subgroup=64  basic=0  vote=0  ballot=0  shuffle=0\n[1 Intel(R) UHD Graphics 630]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 Intel(R) UHD Graphics 630]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Intel(R) UHD Graphics 630]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 Intel(R) UHD Graphics 630]  subgroup=32  basic=0  vote=0  ballot=0  shuffle=0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    6.66  max =    7.30  avg =    6.91\n     squeezenet_int8  min =   49.97  max =   60.92  avg =   53.86\n           mobilenet  min =    6.99  max =    7.48  avg =    7.17\n      mobilenet_int8  min =   70.46  max =   83.20  avg =   79.33\n        mobilenet_v2  min =    9.56  max =   10.87  avg =   10.34\n        mobilenet_v3  min =   11.48  max =   12.20  avg =   11.94\n          shufflenet  min =    4.52  max =    5.25  avg =    4.96\n       shufflenet_v2  min =    7.29  max =    9.65  avg =    7.99\n             mnasnet  min =    9.82  max =   11.88  avg =   10.62\n     proxylessnasnet  min =    7.85  max =    8.41  avg =    8.07\n     efficientnet_b0  min =   17.34  max =   17.85  avg =   17.56\n   efficientnetv2_b0  min =   21.95  max =   24.10  avg =   23.15\n        regnety_400m  min =   13.54  max =   14.83  avg =   14.11\n           blazeface  min =    3.26  max =    6.59  avg =    5.50\n           googlenet  min =   17.62  max =   19.47  avg =   18.27\n      googlenet_int8  min =  198.88  max =  247.97  avg =  223.31\n            resnet18  min =   11.10  max =   12.01  avg =   11.59\n       resnet18_int8  min =  225.56  max =  259.39  avg =  238.97\n             alexnet  min =   17.66  max =   19.19  avg =   18.24\n               vgg16  min =   53.20  max =   54.88  avg =   53.73\n          vgg16_int8  min = 1747.52  max = 2130.08  avg = 1880.42\n            resnet50  min =   27.38  max =   28.84  avg =   28.34\n       resnet50_int8  min =  461.86  max =  579.83  avg =  528.15\n      squeezenet_ssd  min =   19.99  max =   20.98  avg =   20.50\n squeezenet_ssd_int8  min =  185.20  max =  209.66  avg =  196.81\n       mobilenet_ssd  min =   12.81  max =   14.21  avg =   13.48\n  mobilenet_ssd_int8  min =  139.29  max =  168.38  avg =  148.20\n      mobilenet_yolo  min =   19.50  max =   20.51  avg =   19.97\n  mobilenetv2_yolov3  min =   15.95  max =   19.28  avg =   16.85\n         yolov4-tiny  min =   21.43  max =   23.42  avg =   22.28\n           nanodet_m  min =    7.95  max =    9.23  avg =    8.48\n\n➜  benchmark git:(master) ✗ ./benchncnn 10 1 0 1\n[0 AMD Radeon Pro 555X]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[0 AMD Radeon Pro 555X]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 AMD Radeon Pro 555X]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 AMD Radeon Pro 555X]  subgroup=64  basic=0  vote=0  ballot=0  shuffle=0\n[1 Intel(R) UHD Graphics 630]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 Intel(R) UHD Graphics 630]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Intel(R) UHD Graphics 630]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 Intel(R) UHD Graphics 630]  subgroup=32  basic=0  vote=0  ballot=0  shuffle=0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 1\ncooling_down = 1\n          squeezenet  min =   11.06  max =   13.22  avg =   12.09\n     squeezenet_int8  min =   54.87  max =   64.55  avg =   59.84\n           mobilenet  min =   13.65  max =   16.70  avg =   14.81\n      mobilenet_int8  min =   72.36  max =   93.58  avg =   86.40\n        mobilenet_v2  min =   11.88  max =   15.90  avg =   13.47\n        mobilenet_v3  min =   12.68  max =   16.16  avg =   14.56\n          shufflenet  min =   13.87  max =   16.68  avg =   14.93\n       shufflenet_v2  min =   11.73  max =   13.65  avg =   12.87\n             mnasnet  min =   12.71  max =   15.56  avg =   14.22\n     proxylessnasnet  min =   14.03  max =   17.28  avg =   15.37\n     efficientnet_b0  min =   17.50  max =   21.46  avg =   19.30\n   efficientnetv2_b0  min =   35.47  max =   38.58  avg =   36.89\n        regnety_400m  min =   16.00  max =   19.45  avg =   17.48\n           blazeface  min =    6.08  max =    7.18  avg =    6.39\n           googlenet  min =   23.35  max =   29.68  avg =   25.77\n      googlenet_int8  min =  198.49  max =  254.38  avg =  222.77\n            resnet18  min =   21.85  max =   28.10  avg =   24.70\n       resnet18_int8  min =  211.21  max =  279.55  avg =  222.64\n             alexnet  min =   24.45  max =   30.47  avg =   26.87\n               vgg16  min =  115.20  max =  117.76  avg =  116.48\n          vgg16_int8  min = 1715.92  max = 1960.02  avg = 1800.21\n            resnet50  min =   45.65  max =   46.25  avg =   46.05\n       resnet50_int8  min =  448.13  max =  555.53  avg =  485.47\n      squeezenet_ssd  min =   28.43  max =   33.26  avg =   29.85\n squeezenet_ssd_int8  min =  180.91  max =  202.51  avg =  190.84\n       mobilenet_ssd  min =   21.03  max =   26.93  avg =   23.48\n  mobilenet_ssd_int8  min =  154.41  max =  184.64  avg =  165.04\n      mobilenet_yolo  min =   37.04  max =   38.64  avg =   37.52\n  mobilenetv2_yolov3  min =   24.98  max =   30.03  avg =   27.70\n         yolov4-tiny  min =   39.29  max =   50.25  avg =   44.18\n           nanodet_m  min =   15.97  max =   20.27  avg =   17.93\n```\n\n### Sunway SW421 (sw_64 1.7GHz * 4)\n```\nroot@SW421:~/Desktop/ncnn-20220420/ncnn-20220420/build/benchmark$ ./benchncnn\nloop_count = 4\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  943.61  max =  966.98  avg =  955.24\n     squeezenet_int8  min =  654.75  max =  731.28  avg =  674.87\n           mobilenet  min = 1584.87  max = 1612.88  avg = 1597.47\n      mobilenet_int8  min = 1198.21  max = 1204.82  avg = 1201.61\n        mobilenet_v2  min =  733.94  max =  754.79  avg =  744.48\n        mobilenet_v3  min =  665.26  max =  683.81  avg =  675.18\n          shufflenet  min =  401.53  max =  435.21  avg =  420.32\n       shufflenet_v2  min =  294.65  max =  316.50  avg =  309.08\n             mnasnet  min =  671.22  max =  808.46  avg =  713.01\n     proxylessnasnet  min =  686.12  max =  698.13  avg =  692.29\n     efficientnet_b0  min = 1151.75  max = 1184.86  avg = 1161.33\n   efficientnetv2_b0  min = 1372.05  max = 1395.22  avg = 1379.47\n        regnety_400m  min =  933.93  max =  949.42  avg =  942.43\n           blazeface  min =  104.72  max =  136.77  avg =  112.86\n           googlenet  min = 2574.02  max = 4330.81  avg = 3015.56\n      googlenet_int8  min = 2136.42  max = 2183.61  avg = 2166.45\n            resnet18  min = 2511.12  max = 2537.42  avg = 2526.08\n       resnet18_int8  min = 2003.84  max = 2027.50  avg = 2012.48\n             alexnet  min =  668.28  max =  686.35  avg =  673.95\n               vgg16  min = 24863.92  max = 24967.94  avg = 24907.39\n          vgg16_int8  min = 18735.54  max = 18926.83  avg = 18859.32\n            resnet50  min = 9896.47  max = 9981.13  avg = 9929.77\n       resnet50_int8  min = 6971.01  max = 7085.29  avg = 7017.88\n      squeezenet_ssd  min = 1798.23  max = 1814.25  avg = 1806.57\n squeezenet_ssd_int8  min = 1586.11  max = 1606.83  avg = 1596.75\n       mobilenet_ssd  min = 3995.54  max = 4018.27  avg = 4002.78\n  mobilenet_ssd_int8  min = 2753.65  max = 2766.06  avg = 2760.04\n      mobilenet_yolo  min = 10892.22  max = 10978.84  avg = 10921.00\n  mobilenetv2_yolov3  min = 3600.80  max = 3607.72  avg = 3603.18\n         yolov4-tiny  min = 5565.82  max = 5582.22  avg = 5571.78\n           nanodet_m  min = 1182.97  max = 1220.47  avg = 1199.30\n    yolo-fastest-1.1  min =  340.63  max =  360.95  avg =  349.15\n      yolo-fastestv2  min =  255.47  max =  281.79  avg =  268.82\n```\n\n### Sunway SW831 (sw_64 2.5GHz * 8)\n```\nroot@SW831:~/Desktop/ncnn_20221128/build/benchmark$ ./benchncnn 5 8 2 -1 0\nloop_count = 5\nnum_threads = 8\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  343.27  max =  420.86  avg =  364.97\n     squeezenet_int8  min =  237.91  max =  251.71  avg =  243.84\n           mobilenet  min =  607.80  max =  696.04  avg =  646.61\n      mobilenet_int8  min =  428.37  max =  499.32  avg =  460.21\n        mobilenet_v2  min =  291.29  max =  381.93  avg =  311.76\n        mobilenet_v3  min =  262.01  max =  287.93  avg =  277.29\n          shufflenet  min =  144.89  max =  169.10  avg =  150.84\n       shufflenet_v2  min =  121.44  max =  139.62  avg =  126.96\n             mnasnet  min =  265.59  max =  353.84  avg =  288.79\n     proxylessnasnet  min =  272.08  max =  293.19  avg =  284.61\n     efficientnet_b0  min =  445.40  max =  508.36  avg =  467.84\n   efficientnetv2_b0  min =  550.57  max =  619.16  avg =  581.85\n        regnety_400m  min =  374.02  max =  460.64  avg =  394.49\n           blazeface  min =   39.93  max =   59.19  avg =   44.14\n           googlenet  min =  941.35  max = 1014.23  avg =  976.37\n      googlenet_int8  min =  770.66  max =  827.44  avg =  797.93\n            resnet18  min =  815.02  max =  895.13  avg =  843.57\n       resnet18_int8  min =  701.10  max =  776.40  avg =  729.49\n             alexnet  min =  216.74  max =  273.39  avg =  228.99\n               vgg16  min = 8645.55  max = 8699.60  avg = 8681.61\n          vgg16_int8  min = 6786.91  max = 6930.90  avg = 6854.29\n            resnet50  min = 3624.02  max = 3698.91  avg = 3652.31\n       resnet50_int8  min = 2537.92  max = 2618.10  avg = 2567.88\n      squeezenet_ssd  min =  635.25  max =  693.23  avg =  663.56\n squeezenet_ssd_int8  min =  577.37  max =  641.12  avg =  603.34\n       mobilenet_ssd  min = 1529.35  max = 1711.54  avg = 1582.10\n  mobilenet_ssd_int8  min =  982.65  max = 1042.82  avg = 1016.62\n      mobilenet_yolo  min = 4053.62  max = 4124.84  avg = 4094.38\n  mobilenetv2_yolov3  min = 1367.81  max = 1527.79  avg = 1433.04\n         yolov4-tiny  min = 1943.20  max = 2028.02  avg = 1978.31\n           nanodet_m  min =  433.66  max =  498.83  avg =  457.77\n    yolo-fastest-1.1  min =  140.07  max =  284.35  avg =  192.46\n      yolo-fastestv2  min =  123.91  max =  225.70  avg =  152.54\n  vision_transformer  min = 2470.70  max = 2509.73  avg = 2486.40\n          FastestDet  min =  145.30  max =  163.43  avg =  154.35\n```\n\n### AXERA AX620A (Cortex-A7 1.0GHz * 4)\n```\n/root/axera # ./benchncnn 4 1 0 -1 0\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =  530.57  max =  533.11  avg =  532.22\n     squeezenet_int8  min =  359.74  max =  360.02  avg =  359.86\n           mobilenet  min =  920.12  max =  921.04  avg =  920.52\n      mobilenet_int8  min =  532.60  max =  533.08  avg =  532.81\n        mobilenet_v2  min =  608.81  max =  609.49  avg =  609.18\n        mobilenet_v3  min =  531.43  max =  532.34  avg =  531.90\n          shufflenet  min =  297.91  max =  300.08  avg =  299.06\n       shufflenet_v2  min =  288.44  max =  289.30  avg =  288.79\n             mnasnet  min =  590.29  max =  590.99  avg =  590.63\n     proxylessnasnet  min =  678.22  max =  679.22  avg =  678.63\n     efficientnet_b0  min = 1041.41  max = 1043.79  avg = 1042.61\n   efficientnetv2_b0  min = 1222.41  max = 1223.63  avg = 1222.91\n        regnety_400m  min =  723.83  max =  725.37  avg =  724.64\n           blazeface  min =   86.77  max =   87.21  avg =   86.92\n           googlenet  min = 1740.32  max = 1741.44  avg = 1740.81\n      googlenet_int8  min = 1167.95  max = 1169.18  avg = 1168.54\n            resnet18  min = 1584.41  max = 1585.36  avg = 1584.97\n       resnet18_int8  min =  915.78  max =  918.77  avg =  917.16\n             alexnet  min = 1811.30  max = 1812.86  avg = 1812.07\n            resnet50  min = 4516.48  max = 4523.48  avg = 4519.03\n       resnet50_int8  min = 2573.18  max = 2574.29  avg = 2573.69\n      squeezenet_ssd  min = 1191.79  max = 1193.71  avg = 1193.02\n squeezenet_ssd_int8  min =  862.36  max =  863.69  avg =  862.83\n       mobilenet_ssd  min = 1950.48  max = 1950.98  avg = 1950.65\n  mobilenet_ssd_int8  min = 1081.70  max = 1082.64  avg = 1082.20\n      mobilenet_yolo  min = 4629.22  max = 4630.23  avg = 4629.69\n  mobilenetv2_yolov3  min = 2233.05  max = 2234.14  avg = 2233.42\n         yolov4-tiny  min = 2942.58  max = 2946.55  avg = 2944.81\n           nanodet_m  min =  692.19  max =  693.36  avg =  692.79\n    yolo-fastest-1.1  min =  333.62  max =  334.43  avg =  334.00\n      yolo-fastestv2  min =  256.41  max =  257.32  avg =  256.83\n\n\n/root/axera # ./benchncnn 4 4 0 -1 0\nloop_count = 4\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  150.38  max =  179.83  avg =  157.90\n     squeezenet_int8  min =  106.97  max =  107.43  avg =  107.22\n           mobilenet  min =  248.92  max =  273.98  avg =  255.72\n      mobilenet_int8  min =  139.49  max =  139.65  avg =  139.60\n        mobilenet_v2  min =  174.67  max =  204.35  avg =  182.30\n        mobilenet_v3  min =  152.17  max =  152.54  avg =  152.30\n          shufflenet  min =   98.74  max =  125.99  avg =  105.74\n       shufflenet_v2  min =  103.44  max =  103.88  avg =  103.65\n             mnasnet  min =  167.63  max =  197.54  avg =  175.28\n     proxylessnasnet  min =  186.02  max =  186.32  avg =  186.15\n     efficientnet_b0  min =  284.35  max =  318.17  avg =  292.90\n   efficientnetv2_b0  min =  329.56  max =  359.71  avg =  337.22\n        regnety_400m  min =  246.91  max =  277.08  avg =  254.71\n           blazeface  min =   30.95  max =   31.31  avg =   31.16\n           googlenet  min =  474.87  max =  504.38  avg =  489.43\n      googlenet_int8  min =  322.06  max =  331.97  avg =  324.57\n            resnet18  min =  440.03  max =  475.28  avg =  456.70\n       resnet18_int8  min =  252.01  max =  280.64  avg =  259.22\n             alexnet  min =  453.16  max =  478.80  avg =  465.88\n            resnet50  min = 1214.70  max = 1252.42  avg = 1229.22\n       resnet50_int8  min =  684.53  max =  715.65  avg =  706.14\n      squeezenet_ssd  min =  358.84  max =  393.45  avg =  367.77\n squeezenet_ssd_int8  min =  281.56  max =  312.86  avg =  289.85\n       mobilenet_ssd  min =  519.11  max =  559.14  avg =  538.41\n  mobilenet_ssd_int8  min =  284.58  max =  310.02  avg =  291.02\n      mobilenet_yolo  min = 1238.87  max = 1284.74  avg = 1260.51\n  mobilenetv2_yolov3  min =  624.42  max =  665.81  avg =  642.15\n         yolov4-tiny  min =  826.46  max =  852.97  avg =  844.88\n           nanodet_m  min =  246.76  max =  279.09  avg =  255.04\n    yolo-fastest-1.1  min =  116.12  max =  116.95  avg =  116.50\n      yolo-fastestv2  min =   91.08  max =  102.93  avg =   94.41\n```\n\n### AMD Ryzen 5700g (Zen3 3.8 GHz ~ 4.6 GHz x 8)\ntest in wsl2 with ubuntu 20.04\n```\n$ ./benchncnn  10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.53  max =    7.05  avg =    6.77\n     squeezenet_int8  min =   17.72  max =   17.86  avg =   17.79\n           mobilenet  min =   11.43  max =   11.98  avg =   11.64\n      mobilenet_int8  min =   22.91  max =   24.48  avg =   23.26\n        mobilenet_v2  min =    8.28  max =    9.29  avg =    8.66\n        mobilenet_v3  min =    6.86  max =    6.98  avg =    6.94\n          shufflenet  min =    3.75  max =    4.64  avg =    3.91\n       shufflenet_v2  min =    5.08  max =    5.80  avg =    5.22\n             mnasnet  min =    7.54  max =    8.60  avg =    7.81\n     proxylessnasnet  min =    9.18  max =   10.33  avg =    9.41\n     efficientnet_b0  min =   22.57  max =   23.67  avg =   22.93\n   efficientnetv2_b0  min =   21.23  max =   22.08  avg =   21.45\n        regnety_400m  min =   10.56  max =   10.80  avg =   10.63\n           blazeface  min =    1.08  max =    1.17  avg =    1.11\n           googlenet  min =   27.91  max =   29.51  avg =   28.28\n      googlenet_int8  min =   71.00  max =   86.86  avg =   72.74\n            resnet18  min =   20.11  max =   20.56  avg =   20.26\n       resnet18_int8  min =   63.80  max =   65.13  avg =   64.19\n             alexnet  min =   20.64  max =   24.25  avg =   21.65\n               vgg16  min =  119.99  max =  125.45  avg =  121.59\n          vgg16_int8  min =  268.11  max =  270.41  avg =  269.15\n            resnet50  min =   55.42  max =   56.29  avg =   55.70\n       resnet50_int8  min =  126.73  max =  132.37  avg =  128.72\n      squeezenet_ssd  min =   28.41  max =   30.30  avg =   29.20\n squeezenet_ssd_int8  min =   41.12  max =   42.53  avg =   41.52\n       mobilenet_ssd  min =   24.15  max =   24.91  avg =   24.33\n  mobilenet_ssd_int8  min =   46.06  max =   59.19  avg =   49.87\n      mobilenet_yolo  min =   67.58  max =   73.19  avg =   68.99\n  mobilenetv2_yolov3  min =   29.44  max =   30.46  avg =   29.78\n         yolov4-tiny  min =   41.89  max =   43.47  avg =   42.37\n           nanodet_m  min =   11.23  max =   11.47  avg =   11.36\n    yolo-fastest-1.1  min =    3.86  max =    4.64  avg =    4.04\n      yolo-fastestv2  min =    3.43  max =    3.99  avg =    3.56\n  vision_transformer  min = 1590.86  max = 1593.97  avg = 1591.91\n\n\n$ ./benchncnn  10 16 0 -1 0\nloop_count = 10\nnum_threads = 16\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.94  max =    4.66  avg =    3.31\n     squeezenet_int8  min =    3.53  max =    5.26  avg =    3.92\n           mobilenet  min =    3.96  max =    5.30  avg =    4.21\n      mobilenet_int8  min =    4.27  max =    4.56  avg =    4.35\n        mobilenet_v2  min =    3.63  max =    4.20  avg =    3.82\n        mobilenet_v3  min =    3.25  max =    4.79  avg =    3.58\n          shufflenet  min =    2.98  max =    3.59  avg =    3.12\n       shufflenet_v2  min =    2.62  max =    5.93  avg =    3.04\n             mnasnet  min =    3.09  max =    3.49  avg =    3.28\n     proxylessnasnet  min =    3.57  max =    4.18  avg =    3.76\n     efficientnet_b0  min =    5.98  max =    6.48  avg =    6.18\n   efficientnetv2_b0  min =    6.96  max =    7.48  avg =    7.13\n        regnety_400m  min =    8.71  max =   11.89  avg =    9.61\n           blazeface  min =    0.86  max =    0.96  avg =    0.89\n           googlenet  min =   10.75  max =   11.33  avg =   11.00\n      googlenet_int8  min =   12.75  max =   15.47  avg =   13.50\n            resnet18  min =    8.92  max =   16.08  avg =   10.08\n       resnet18_int8  min =   10.55  max =   10.99  avg =   10.69\n             alexnet  min =    9.95  max =   10.45  avg =   10.17\n               vgg16  min =   52.28  max =   53.69  avg =   52.89\n          vgg16_int8  min =   44.90  max =   47.90  avg =   45.61\n            resnet50  min =   17.80  max =   21.43  avg =   18.66\n       resnet50_int8  min =   21.80  max =   25.42  avg =   22.75\n      squeezenet_ssd  min =   14.49  max =   16.36  avg =   14.90\n squeezenet_ssd_int8  min =   10.02  max =   10.49  avg =   10.28\n       mobilenet_ssd  min =    7.20  max =    7.86  avg =    7.51\n  mobilenet_ssd_int8  min =    8.51  max =   10.90  avg =    9.09\n      mobilenet_yolo  min =   35.67  max =   44.84  avg =   37.33\n  mobilenetv2_yolov3  min =   12.72  max =   17.16  avg =   13.67\n         yolov4-tiny  min =   20.81  max =   22.11  avg =   21.33\n           nanodet_m  min =    5.13  max =   42.12  avg =    9.07\n    yolo-fastest-1.1  min =    3.05  max =    4.72  avg =    3.39\n      yolo-fastestv2  min =    3.33  max =    3.73  avg =    3.44\n  vision_transformer  min =  214.91  max =  229.91  avg =  220.82\n```\n\n### Intel Celeron M 420 (Yonah 1.60 GHz x 1)\n\nTested on `Debian GNU/Linux 11 (bullseye) i686` with `cmake -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX=OFF -DNCNN_AVX2=OFF -DNCNN_AVX512=OFF-DNCNN_BUILD_TESTS=ON ..`.\n\n```\nmouri@Mouri-Laptop-2:~/ncnn/benchmark$ ./../build/benchmark/benchncnn\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  289.23  max =  301.83  avg =  292.90\n     squeezenet_int8  min =  442.82  max =  457.21  avg =  446.89\n           mobilenet  min =  549.62  max =  561.20  avg =  554.78\n      mobilenet_int8  min =  823.92  max =  837.70  avg =  830.52\n        mobilenet_v2  min =  341.72  max =  353.77  avg =  345.34\n        mobilenet_v3  min =  267.68  max =  282.08  avg =  273.10\n          shufflenet  min =  151.66  max =  153.02  avg =  152.24\n       shufflenet_v2  min =  161.54  max =  163.38  avg =  162.13\n             mnasnet  min =  322.66  max =  336.91  avg =  326.86\n     proxylessnasnet  min =  356.63  max =  368.79  avg =  360.66\n     efficientnet_b0  min =  489.92  max =  505.11  avg =  497.32\n   efficientnetv2_b0  min =  618.16  max =  632.02  avg =  622.82\n        regnety_400m  min =  414.83  max =  428.42  avg =  419.28\n           blazeface  min =   38.56  max =   40.05  avg =   39.05\n           googlenet  min = 1022.54  max = 1037.53  avg = 1029.48\n      googlenet_int8  min = 1493.35  max = 1495.46  avg = 1494.31\n            resnet18  min =  803.32  max =  818.27  avg =  812.49\n       resnet18_int8  min = 1188.26  max = 1200.88  avg = 1192.56\n             alexnet  min =  613.78  max =  623.88  avg =  619.99\n               vgg16  min = 4465.44  max = 4478.12  avg = 4474.16\n          vgg16_int8  min = 6042.40  max = 6114.37  avg = 6077.07\n            resnet50  min = 2517.75  max = 2528.42  avg = 2522.83\n       resnet50_int8  min = 3746.28  max = 3771.09  avg = 3756.88\n      squeezenet_ssd  min =  585.56  max =  636.01  avg =  602.62\n squeezenet_ssd_int8  min =  822.43  max =  968.77  avg =  862.33\n       mobilenet_ssd  min = 1116.98  max = 1139.17  avg = 1127.65\n  mobilenet_ssd_int8  min = 1665.03  max = 1670.55  avg = 1668.37\n      mobilenet_yolo  min = 2638.61  max = 2666.54  avg = 2652.26\n  mobilenetv2_yolov3  min = 1248.56  max = 1255.98  avg = 1251.22\n         yolov4-tiny  min = 1507.31  max = 1525.56  avg = 1514.66\n           nanodet_m  min =  386.41  max =  400.63  avg =  391.21\n    yolo-fastest-1.1  min =  159.97  max =  164.53  avg =  161.41\n      yolo-fastestv2  min =  134.29  max =  135.47  avg =  134.70\n  vision_transformer  min = 22201.32  max = 22510.75  avg = 22315.09\n          FastestDet  min =  146.94  max =  148.50  avg =  147.44\n```\n### VisionFive2 , JH7110 (SiFive-U74(RV64GC) 1.5GHz x 4) riscv64 with PowerVR B-Series BXE-4-32\nTest on Debian 11 with g++ 12.2.0 and vulkan 1.3.231\n```\nuser@starfive:~/Downloads/ncnn-master/benchmark$ ./benchncnn 10 4 0 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  149.06  max =  149.33  avg =  149.17\n     squeezenet_int8  min = 1318.66  max = 1349.04  avg = 1328.87\n           mobilenet  min =  255.13  max =  255.71  avg =  255.39\n      mobilenet_int8  min = 2025.40  max = 2036.00  avg = 2031.67\n        mobilenet_v2  min =  173.92  max =  174.60  avg =  174.31\n        mobilenet_v3  min =  166.58  max =  167.30  avg =  167.02\n          shufflenet  min =   91.36  max =   91.72  avg =   91.57\n       shufflenet_v2  min =   83.50  max =   83.95  avg =   83.76\n             mnasnet  min =  190.42  max =  191.15  avg =  190.66\n     proxylessnasnet  min =  226.35  max =  226.81  avg =  226.52\n     efficientnet_b0  min =  342.74  max =  343.62  avg =  343.15\n   efficientnetv2_b0  min =  343.31  max =  344.23  avg =  343.80\n        regnety_400m  min =  227.04  max =  227.75  avg =  227.43\n           blazeface  min =   26.18  max =   26.43  avg =   26.28\n           googlenet  min =  506.76  max =  508.58  avg =  507.84\n      googlenet_int8  min = 3827.36  max = 3856.05  avg = 3835.67\n            resnet18  min =  401.12  max =  402.27  avg =  401.61\n       resnet18_int8  min = 4053.06  max = 4069.98  avg = 4061.63\n             alexnet  min =  297.81  max =  320.09  avg =  301.39\n               vgg16  min = 2338.76  max = 2351.23  avg = 2346.19\n          vgg16_int8  min = 36846.41  max = 36929.56  avg = 36886.26\n            resnet50  min = 1189.88  max = 1211.10  avg = 1193.34\n       resnet50_int8  min = 11819.59  max = 11884.94  avg = 11845.22\n      squeezenet_ssd  min =  351.71  max =  352.73  avg =  352.30\n squeezenet_ssd_int8  min = 2872.00  max = 2903.35  avg = 2891.01\n       mobilenet_ssd  min =  530.92  max =  531.73  avg =  531.28\n  mobilenet_ssd_int8  min = 4511.56  max = 4553.41  avg = 4523.51\n      mobilenet_yolo  min = 1357.14  max = 1359.82  avg = 1358.83\n  mobilenetv2_yolov3  min =  621.15  max =  622.29  avg =  621.66\n         yolov4-tiny  min =  803.06  max =  809.19  avg =  805.79\n           nanodet_m  min =  220.82  max =  221.18  avg =  221.06\n    yolo-fastest-1.1  min =  102.59  max =  103.98  avg =  102.93\n      yolo-fastestv2  min =   89.61  max =   90.03  avg =   89.76\n  vision_transformer  min = 15862.96  max = 15897.17  avg = 15878.22\n          FastestDet  min =  108.69  max =  109.00  avg =  108.84\n\nuser@starfive:~/Downloads/ncnn-master/benchmark$ ./benchncnn 10 4 1 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  148.62  max =  148.95  avg =  148.82\n     squeezenet_int8  min = 1324.10  max = 1339.58  avg = 1332.57\n           mobilenet  min =  255.67  max =  256.20  avg =  255.93\n      mobilenet_int8  min = 2024.72  max = 2028.23  avg = 2026.29\n        mobilenet_v2  min =  173.76  max =  174.73  avg =  174.31\n        mobilenet_v3  min =  166.66  max =  167.28  avg =  166.99\n          shufflenet  min =   91.18  max =   91.68  avg =   91.46\n       shufflenet_v2  min =   83.88  max =   84.84  avg =   84.26\n             mnasnet  min =  190.23  max =  190.84  avg =  190.45\n     proxylessnasnet  min =  226.02  max =  226.82  avg =  226.38\n     efficientnet_b0  min =  342.95  max =  343.52  avg =  343.25\n   efficientnetv2_b0  min =  343.07  max =  343.80  avg =  343.39\n        regnety_400m  min =  226.96  max =  227.62  avg =  227.24\n           blazeface  min =   26.08  max =   26.32  avg =   26.18\n           googlenet  min =  508.30  max =  510.34  avg =  509.27\n      googlenet_int8  min = 3825.65  max = 3858.90  avg = 3833.79\n            resnet18  min =  400.69  max =  403.18  avg =  401.74\n       resnet18_int8  min = 4055.41  max = 4123.79  avg = 4067.55\n             alexnet  min =  296.35  max =  300.46  avg =  299.11\n               vgg16  min = 2337.68  max = 2349.78  avg = 2344.77\n          vgg16_int8  min = 36760.47  max = 36985.40  avg = 36918.31\n            resnet50  min = 1190.13  max = 1221.98  avg = 1196.77\n       resnet50_int8  min = 11816.03  max = 11869.41  avg = 11843.72\n      squeezenet_ssd  min =  351.24  max =  352.20  avg =  351.89\n squeezenet_ssd_int8  min = 2873.40  max = 2902.55  avg = 2891.58\n       mobilenet_ssd  min =  530.45  max =  531.85  avg =  530.91\n  mobilenet_ssd_int8  min = 4504.87  max = 4564.64  avg = 4528.56\n      mobilenet_yolo  min = 1357.83  max = 1360.48  avg = 1358.75\n  mobilenetv2_yolov3  min =  621.00  max =  621.76  avg =  621.35\n         yolov4-tiny  min =  803.54  max =  808.00  avg =  806.16\n           nanodet_m  min =  221.08  max =  222.57  avg =  221.72\n    yolo-fastest-1.1  min =  102.79  max =  103.15  avg =  102.95\n      yolo-fastestv2  min =   89.56  max =   89.79  avg =   89.70\n  vision_transformer  min = 15874.12  max = 15907.97  avg = 15883.26\n          FastestDet  min =  108.22  max =  108.64  avg =  108.36\n\nuser@starfive:~/Downloads/ncnn-master/benchmark$ ./benchncnn 10 1 1 0 0\n[0 PowerVR B-Series BXE-4-32]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 PowerVR B-Series BXE-4-32]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 PowerVR B-Series BXE-4-32]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 PowerVR B-Series BXE-4-32]  subgroup=1  basic/vote/ballot/shuffle=1/1/1/1\n[0 PowerVR B-Series BXE-4-32]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 1\npowersave = 1\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =  355.26  max =  356.42  avg =  355.75\n     squeezenet_int8  min = 5171.49  max = 5187.42  avg = 5178.45\n           mobilenet  min =  757.04  max =  762.74  avg =  759.77\n      mobilenet_int8  min = 7695.03  max = 7715.39  avg = 7705.16\n        mobilenet_v2  min =  476.20  max =  477.19  avg =  476.94\n        mobilenet_v3  min =  403.12  max =  405.44  avg =  405.09\n          shufflenet  min =  181.02  max =  182.32  avg =  181.96\n       shufflenet_v2  min =  257.29  max =  259.06  avg =  258.57\n             mnasnet  min =  495.78  max =  497.44  avg =  496.89\n     proxylessnasnet  min =  562.60  max =  563.02  avg =  562.83\n     efficientnet_b0  min =  660.29  max =  664.73  avg =  662.97\n   efficientnetv2_b0  min =  856.88  max =  864.96  avg =  861.30\n        regnety_400m  min =  492.79  max =  495.44  avg =  494.51\n           blazeface  min =   65.95  max =   68.72  avg =   68.19\n           googlenet  min = 1132.70  max = 1134.65  avg = 1133.50\n      googlenet_int8  min = 14978.60  max = 15000.89  avg = 14988.56\n            resnet18  min = 1155.15  max = 1172.06  avg = 1160.64\n       resnet18_int8  min = 15776.36  max = 15790.48  avg = 15782.76\n             alexnet  min =  601.09  max =  606.63  avg =  603.81\n               vgg16  min = 5558.47  max = 5613.23  avg = 5586.98\n          vgg16_int8  min = 143936.04  max = 144068.45  avg = 143991.58\n            resnet50  min = 3425.81  max = 3440.51  avg = 3434.73\n       resnet50_int8  min = 44780.92  max = 45144.97  avg = 45038.46\n      squeezenet_ssd  min =  967.46  max =  978.39  avg =  972.76\n squeezenet_ssd_int8  min = 10842.39  max = 10999.00  avg = 10940.15\n       mobilenet_ssd  min = 1565.15  max = 1570.11  avg = 1568.87\n  mobilenet_ssd_int8  min = 17317.40  max = 17386.46  avg = 17361.80\n      mobilenet_yolo  min = 3559.36  max = 3570.38  avg = 3568.84\n  mobilenetv2_yolov3  min = 1731.98  max = 1739.52  avg = 1735.33\n         yolov4-tiny  min = 1984.22  max = 2001.65  avg = 1993.20\n           nanodet_m  min =  603.06  max =  609.65  avg =  607.79\n    yolo-fastest-1.1  min =  306.30  max =  312.33  avg =  310.63\n      yolo-fastestv2  min =  201.45  max =  207.44  avg =  205.93\n  vision_transformer  min = 27310.74  max = 27358.54  avg = 27327.23\n          FastestDet  min =  245.07  max =  248.81  avg =  248.14\n```\n### T-Head TH1520 (C910V, 1.848 GHz x 4 + BXM-4-64 PowerVR)\n\nTested on `Linux anolis-riscv 5.10.112-00579-g8e3db308d5a5 #23 SMP PREEMPT Fri Aug 12 10:17:32 CST 2022 riscv64 riscv64 riscv64 GNU/Linux`\n\n```\n[root@anolis-riscv benchmark]# ./benchncnn\nsyscall error -1\nloop_count = 4\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  187.88  max =  188.82  avg =  188.13\n     squeezenet_int8  min = 2388.26  max = 2446.92  avg = 2411.46\n           mobilenet  min =  321.46  max =  323.34  avg =  322.19\n      mobilenet_int8  min = 2318.93  max = 2458.55  avg = 2400.99\n        mobilenet_v2  min =  214.01  max =  216.00  avg =  215.35\n        mobilenet_v3  min =  247.71  max =  248.18  avg =  247.96\n          shufflenet  min =  155.58  max =  155.85  avg =  155.67\n       shufflenet_v2  min =   99.50  max =   99.75  avg =   99.63\n             mnasnet  min =  261.46  max =  263.83  avg =  262.53\n     proxylessnasnet  min =  315.40  max =  316.89  avg =  316.28\n     efficientnet_b0  min =  484.97  max =  486.16  avg =  485.55\n   efficientnetv2_b0  min =  453.03  max =  453.40  avg =  453.21\n        regnety_400m  min =  314.09  max =  315.33  avg =  314.77\n           blazeface  min =   46.14  max =   46.69  avg =   46.39\n           googlenet  min =  650.99  max =  653.60  avg =  651.69\n      googlenet_int8  min = 5435.11  max = 6391.98  avg = 6012.81\n            resnet18  min =  505.48  max =  506.70  avg =  506.06\n       resnet18_int8  min = 5053.33  max = 6599.94  avg = 6001.86\n             alexnet  min =  403.68  max =  404.60  avg =  404.23\n               vgg16  min = 2731.55  max = 2746.48  avg = 2738.82\n```\n\ntest on `Beaglev-ahead(Linux ahead 5.10.113-ahead #2023.08.02.13.12+2c2096a98 SMP PREEMPT Wed Aug 2 13:13:02 UTC 2 riscv64 GNU/Linux)`\n\n```\ndebian@ahead:~/ncnn/build/benchmark$ sudo ./benchncnn 10 1 0 0 0\n[0 PowerVR B-Series BXM-4-64]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 PowerVR B-Series BXM-4-64]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 PowerVR B-Series BXM-4-64]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 PowerVR B-Series BXM-4-64]  subgroup=1  basic/vote/ballot/shuffle=1/1/1/1\n[0 PowerVR B-Series BXM-4-64]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =  287.88  max =  296.84  avg =  295.68\n     squeezenet_int8  min = 2289.46  max = 2320.97  avg = 2306.60\n           mobilenet  min =  584.32  max =  588.48  avg =  587.41\n      mobilenet_int8  min = 2487.91  max = 2492.12  avg = 2489.64\n        mobilenet_v2  min =  380.02  max =  386.67  avg =  385.75\n        mobilenet_v3  min =  314.73  max =  328.84  avg =  325.76\n          shufflenet  min =  146.96  max =  158.29  avg =  156.38\n       shufflenet_v2  min =  203.94  max =  211.77  avg =  210.82\n             mnasnet  min =  395.80  max =  404.95  avg =  403.80\n     proxylessnasnet  min =  447.74  max =  456.89  avg =  454.87\n     efficientnet_b0  min =  532.23  max =  543.05  avg =  538.53\n   efficientnetv2_b0  min =  659.43  max =  681.64  avg =  669.13\n        regnety_400m  min =  393.16  max =  407.27  avg =  403.81\n           blazeface  min =   50.41  max =   61.83  avg =   56.92\n           googlenet  min =  890.79  max =  898.09  avg =  896.25\n      googlenet_int8  min = 4713.76  max = 5296.61  avg = 5044.39\n            resnet18  min =  814.16  max =  824.53  avg =  820.35\n       resnet18_int8  min = 4800.73  max = 6015.34  avg = 5765.47\n             alexnet  min =  453.80  max =  465.51  avg =  462.11\n               vgg16  min = 4016.26  max = 4027.30  avg = 4021.94\n          vgg16_int8  min = 55069.69  max = 64814.86  avg = 59096.20\n            resnet50  min = 2494.42  max = 2502.38  avg = 2500.28\n       resnet50_int8  min = 15366.90  max = 17179.36  avg = 16701.20\n      squeezenet_ssd  min =  724.36  max =  738.28  avg =  730.44\n squeezenet_ssd_int8  min = 4550.62  max = 5235.87  avg = 4684.19\n       mobilenet_ssd  min = 1207.04  max = 1218.80  avg = 1212.86\n  mobilenet_ssd_int8  min = 6019.61  max = 6349.35  avg = 6184.49\n      mobilenet_yolo  min = 2736.28  max = 2747.06  avg = 2743.21\n  mobilenetv2_yolov3  min = 1339.16  max = 1349.46  avg = 1344.81\n         yolov4-tiny  min = 1457.05  max = 1459.04  avg = 1457.81\n           nanodet_m  min =  443.40  max =  444.58  avg =  444.00\n    yolo-fastest-1.1  min =  240.39  max =  248.05  avg =  247.04\n      yolo-fastestv2  min =  162.71  max =  173.30  avg =  169.39\n  vision_transformer  min = 17148.14  max = 17250.66  avg = 17202.60\n          FastestDet  min =  199.71  max =  200.38  avg =  199.90\n```\n\n### CVITEK SG2000 (C906, 1 GHz x 1 + 700MHz x 1)\n```\n[root@milkv-duo]~/ncnn# ./benchncnn 4 1 2 -1 0\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  221.53  max =  229.14  avg =  225.53\n     squeezenet_int8  min = 8153.49  max = 8163.26  avg = 8160.17\n           mobilenet  min =  329.60  max =  338.58  avg =  335.00\n      mobilenet_int8  min = 12725.12  max = 12733.70  avg = 12728.52\n        mobilenet_v2  min =  253.83  max =  260.60  avg =  257.20\n        mobilenet_v3  min =  205.51  max =  212.72  avg =  209.26\n          shufflenet  min =  358.73  max =  367.05  avg =  364.52\n       shufflenet_v2  min =  238.44  max =  246.05  avg =  242.09\n             mnasnet  min =  254.39  max =  258.26  avg =  255.63\n     proxylessnasnet  min =  294.99  max =  302.80  avg =  300.65\n        regnety_400m  min =  407.72  max =  409.69  avg =  409.03\n           blazeface  min =  117.08  max =  124.26  avg =  119.00\n           googlenet  min =  817.28  max =  824.70  avg =  820.70\n      googlenet_int8  min = 18246.97  max = 18276.23  avg = 18261.11\n            resnet18  min =  610.81  max =  618.87  avg =  613.91\n       resnet18_int8  min = 18772.96  max = 18808.53  avg = 18786.88\n             alexnet  min =  568.11  max =  577.02  avg =  570.66\n      squeezenet_ssd  min =  890.76  max =  896.30  avg =  893.57\n squeezenet_ssd_int8  min = 31680.48  max = 31938.09  avg = 31810.68\n       mobilenet_ssd  min =  746.38  max =  762.07  avg =  752.19\n  mobilenet_ssd_int8  min = 41140.62  max = 41540.85  avg = 41356.70\n      mobilenet_yolo  min = 1744.59  max = 1755.90  avg = 1750.05\n  mobilenetv2_yolov3  min =  890.20  max =  897.86  avg =  895.14\n         yolov4-tiny  min = 1056.03  max = 1059.44  avg = 1058.21\n           nanodet_m  min =  547.85  max =  554.80  avg =  549.81\n    yolo-fastest-1.1  min =  290.89  max =  298.31  avg =  296.24\n      yolo-fastestv2  min =  188.59  max =  196.79  avg =  190.96\n          FastestDet  min =  196.19  max =  205.96  avg =  200.99\n```\n\n### Rockchip RK3588 (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz)\ntest in ROCK5 MODEL B\n\n```\nrock@rock-5b:~/ncnn/build/benchmark$ ./benchncnn  10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   15.22  max =   16.03  avg =   15.70\n     squeezenet_int8  min =   16.77  max =   16.96  avg =   16.86\n           mobilenet  min =   23.07  max =   23.58  avg =   23.36\n      mobilenet_int8  min =   18.58  max =   18.90  avg =   18.72\n        mobilenet_v2  min =   18.74  max =   19.10  avg =   18.96\n        mobilenet_v3  min =   14.40  max =   14.65  avg =   14.50\n          shufflenet  min =    9.74  max =    9.88  avg =    9.84\n       shufflenet_v2  min =    9.44  max =    9.55  avg =    9.50\n             mnasnet  min =   14.73  max =   15.03  avg =   14.87\n     proxylessnasnet  min =   18.37  max =   18.59  avg =   18.46\n     efficientnet_b0  min =   29.11  max =   30.18  avg =   29.63\n   efficientnetv2_b0  min =   46.40  max =   46.95  avg =   46.76\n        regnety_400m  min =   19.18  max =   19.39  avg =   19.28\n           blazeface  min =    5.16  max =    5.23  avg =    5.20\n           googlenet  min =   64.64  max =   65.33  avg =   65.00\n      googlenet_int8  min =   61.86  max =   63.41  avg =   62.42\n            resnet18  min =   42.00  max =   43.34  avg =   42.48\n       resnet18_int8  min =   67.22  max =   67.80  avg =   67.45\n             alexnet  min =   57.65  max =   58.21  avg =   58.01\n               vgg16  min =  192.35  max =  193.36  avg =  192.84\n          vgg16_int8  min =  570.86  max =  578.81  avg =  574.50\n            resnet50  min =  107.86  max =  109.52  avg =  108.70\n       resnet50_int8  min =  134.41  max =  135.86  avg =  135.18\n      squeezenet_ssd  min =   40.85  max =   41.24  avg =   41.02\n squeezenet_ssd_int8  min =   52.23  max =   53.70  avg =   52.54\n       mobilenet_ssd  min =   45.11  max =   45.50  avg =   45.32\n  mobilenet_ssd_int8  min =   36.53  max =   36.63  avg =   36.59\n      mobilenet_yolo  min =   95.18  max =   96.79  avg =   95.90\n  mobilenetv2_yolov3  min =   65.50  max =   65.88  avg =   65.72\n         yolov4-tiny  min =   86.13  max =   88.84  avg =   87.29\n           nanodet_m  min =   22.57  max =   22.87  avg =   22.74\n    yolo-fastest-1.1  min =    9.23  max =    9.35  avg =    9.29\n      yolo-fastestv2  min =    8.62  max =    8.83  avg =    8.73\n  vision_transformer  min = 3077.54  max = 3396.13  avg = 3339.58\n          FastestDet  min =    9.11  max =    9.30  avg =    9.20\n\nrock@rock-5b:~/ncnn/build/benchmark$ ./benchncnn  10 8 0 -1 0\nloop_count = 10\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   10.02  max =   11.01  avg =   10.43\n     squeezenet_int8  min =   11.78  max =   13.77  avg =   12.55\n           mobilenet  min =   12.75  max =   13.58  avg =   13.12\n      mobilenet_int8  min =   12.23  max =   14.29  avg =   13.54\n        mobilenet_v2  min =   12.76  max =   14.27  avg =   13.40\n        mobilenet_v3  min =    9.51  max =    9.81  avg =    9.71\n          shufflenet  min =    7.06  max =    7.23  avg =    7.13\n       shufflenet_v2  min =    6.21  max =    7.32  avg =    6.38\n             mnasnet  min =    9.32  max =   12.49  avg =   10.75\n     proxylessnasnet  min =   13.79  max =   15.51  avg =   14.70\n     efficientnet_b0  min =   16.59  max =   17.99  avg =   17.08\n   efficientnetv2_b0  min =   28.26  max =   32.26  avg =   30.52\n        regnety_400m  min =   13.43  max =   15.00  avg =   13.72\n           blazeface  min =    3.87  max =    7.38  avg =    5.65\n           googlenet  min =   29.18  max =   44.00  avg =   36.31\n      googlenet_int8  min =   31.14  max =   37.48  avg =   34.58\n            resnet18  min =   21.47  max =   24.40  avg =   22.35\n       resnet18_int8  min =   26.68  max =   29.89  avg =   28.45\n             alexnet  min =   29.35  max =   38.09  avg =   31.65\n               vgg16  min =  112.37  max =  122.94  avg =  117.05\n          vgg16_int8  min =  161.08  max =  215.29  avg =  176.89\n            resnet50  min =   54.54  max =   57.50  avg =   55.71\n       resnet50_int8  min =   54.76  max =   65.05  avg =   60.59\n      squeezenet_ssd  min =   26.21  max =   35.05  avg =   30.76\n squeezenet_ssd_int8  min =   33.34  max =   40.88  avg =   36.19\n       mobilenet_ssd  min =   26.71  max =   28.85  avg =   27.88\n  mobilenet_ssd_int8  min =   22.03  max =   25.31  avg =   24.21\n      mobilenet_yolo  min =   60.51  max =   74.65  avg =   65.45\n  mobilenetv2_yolov3  min =   37.27  max =   44.13  avg =   41.20\n         yolov4-tiny  min =   49.84  max =   58.12  avg =   53.93\n           nanodet_m  min =   16.54  max =   22.41  avg =   20.60\n    yolo-fastest-1.1  min =    8.49  max =   13.50  avg =    9.91\n      yolo-fastestv2  min =    6.28  max =   11.22  avg =    8.00\n  vision_transformer  min =  968.62  max = 1063.47  avg = 1019.12\n          FastestDet  min =    6.14  max =   11.92  avg =    7.85\n\nrock@rock-5b:~/ncnn/build/benchmark$ ./benchncnn 10 4 2 -1 0\nloop_count = 10\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.78  max =    7.27  avg =    7.07\n     squeezenet_int8  min =    4.58  max =    4.73  avg =    4.63\n           mobilenet  min =    5.67  max =    5.78  avg =    5.72\n      mobilenet_int8  min =    5.01  max =    5.20  avg =    5.15\n        mobilenet_v2  min =    5.44  max =    5.76  avg =    5.50\n        mobilenet_v3  min =    4.67  max =    5.03  avg =    4.74\n          shufflenet  min =    4.22  max =    4.30  avg =    4.27\n       shufflenet_v2  min =    3.48  max =    3.60  avg =    3.53\n             mnasnet  min =    4.52  max =    4.83  avg =    4.61\n     proxylessnasnet  min =    5.44  max =    6.01  avg =    5.56\n     efficientnet_b0  min =    8.33  max =    8.52  avg =    8.41\n   efficientnetv2_b0  min =   12.95  max =   13.08  avg =   13.02\n        regnety_400m  min =    8.60  max =    8.73  avg =    8.66\n           blazeface  min =    1.86  max =    1.95  avg =    1.90\n           googlenet  min =   16.58  max =   16.85  avg =   16.65\n      googlenet_int8  min =   16.99  max =   17.13  avg =   17.06\n            resnet18  min =   14.98  max =   15.30  avg =   15.08\n       resnet18_int8  min =   20.10  max =   20.22  avg =   20.15\n             alexnet  min =   19.78  max =   20.21  avg =   19.87\n               vgg16  min =   66.35  max =   94.16  avg =   75.24\n          vgg16_int8  min =  131.02  max =  131.98  avg =  131.51\n            resnet50  min =   28.07  max =   28.78  avg =   28.28\n       resnet50_int8  min =   33.56  max =   35.53  avg =   33.84\n      squeezenet_ssd  min =   16.40  max =   16.80  avg =   16.49\n squeezenet_ssd_int8  min =   18.64  max =   19.00  avg =   18.76\n       mobilenet_ssd  min =   13.66  max =   13.78  avg =   13.72\n  mobilenet_ssd_int8  min =   11.23  max =   11.42  avg =   11.33\n      mobilenet_yolo  min =   30.76  max =   31.03  avg =   30.86\n  mobilenetv2_yolov3  min =   19.28  max =   21.07  avg =   20.30\n         yolov4-tiny  min =   33.44  max =   37.68  avg =   34.70\n           nanodet_m  min =    8.28  max =    8.55  avg =    8.38\n    yolo-fastest-1.1  min =    4.30  max =    4.40  avg =    4.34\n      yolo-fastestv2  min =    4.07  max =    4.18  avg =    4.13\n  vision_transformer  min =  815.67  max =  819.27  avg =  817.49\n          FastestDet  min =    4.34  max =    7.47  avg =    5.18\n```\n\n### AWS c5.4xlarge Instance\n\n- OS: Ubuntu 20.04.6 LTS x86_64\n- CPU: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz\n- Compiler: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)\n- ncnn tag: 20240102\n\n```\nloop_count = 4\nnum_threads = 8\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    3.31  max =    3.33  avg =    3.32\n     squeezenet_int8  min =    3.87  max =    4.34  avg =    4.07\n           mobilenet  min =    3.12  max =    3.20  avg =    3.17\n      mobilenet_int8  min =    3.32  max =    3.45  avg =    3.38\n        mobilenet_v2  min =    4.23  max =    4.43  avg =    4.33\n        mobilenet_v3  min =    3.82  max =    3.92  avg =    3.87\n          shufflenet  min =    3.67  max =    3.72  avg =    3.69\n       shufflenet_v2  min =    4.08  max =    4.22  avg =    4.15\n             mnasnet  min =    3.62  max =    3.69  avg =    3.64\n     proxylessnasnet  min =    4.29  max =    4.59  avg =    4.37\n     efficientnet_b0  min =    5.32  max =    5.64  avg =    5.50\n   efficientnetv2_b0  min =    6.81  max =    6.88  avg =    6.85\n        regnety_400m  min =    9.71  max =    9.77  avg =    9.74\n           blazeface  min =    1.71  max =    2.57  avg =    2.10\n           googlenet  min =   10.00  max =   10.09  avg =   10.05\n      googlenet_int8  min =    8.76  max =    8.79  avg =    8.77\n            resnet18  min =    6.55  max =    6.91  avg =    6.70\n       resnet18_int8  min =    5.63  max =    5.95  avg =    5.81\n             alexnet  min =    4.88  max =    4.91  avg =    4.89\n               vgg16  min =   36.99  max =   37.04  avg =   37.01\n          vgg16_int8  min =   28.13  max =   28.57  avg =   28.31\n            resnet50  min =   13.99  max =   14.13  avg =   14.06\n       resnet50_int8  min =   12.49  max =   12.56  avg =   12.53\n      squeezenet_ssd  min =    9.93  max =   10.04  avg =    9.98\n squeezenet_ssd_int8  min =    9.51  max =    9.70  avg =    9.59\n       mobilenet_ssd  min =    6.60  max =    6.63  avg =    6.61\n  mobilenet_ssd_int8  min =    6.95  max =    7.10  avg =    7.02\n      mobilenet_yolo  min =   18.28  max =   18.44  avg =   18.35\n  mobilenetv2_yolov3  min =   13.26  max =   13.39  avg =   13.32\n         yolov4-tiny  min =   25.14  max =   25.58  avg =   25.37\n           nanodet_m  min =    7.71  max =    7.77  avg =    7.75\n    yolo-fastest-1.1  min =    4.69  max =    4.96  avg =    4.81\n      yolo-fastestv2  min =    4.84  max =    5.17  avg =    5.01\n  vision_transformer  min =  139.34  max =  140.38  avg =  139.96\n          FastestDet  min =    4.95  max =    5.12  avg =    5.06\n```\n\n### Hyper-V Linux Guest with GPU-PV enabled (Intel Core i7-11800H, NVIDIA GeForce RTX 3070 Laptop GPU)\n\n- Host OS: Microsoft Windows 11 Enterprise (10.0.22621.1635)\n- Guest OS: openSUSE Tumbleweed x86_64 20230507\n- Mesa 3D source tree: https://gitlab.freedesktop.org/mesa/mesa/-/tree/ce6430067613e3e64cabf79918a3d96122b0c4c4\n- Mesa 3D configuration command\n  > meson --prefix=\"${PWD}/build/install\" -D gallium-drivers=swrast,d3d12 -D vulkan-drivers=swrast,microsoft-experimental build/\n- ncnn configuration command\n  > cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON ..\n\n```\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark> VK_ICD_FILENAMES=/home/mouri/Workspace/mesa/build/install/share/vulkan/icd.d/dzn_icd.x86_64.json ./../build/benchmark/benchncnn 10 1 0 0 0\nWARNING: dzn is not a conformant Vulkan implementation, testing use only.\nWARNING: dzn is not a conformant Vulkan implementation, testing use only.\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  queueC=1[8]  queueG=0[4]  queueT=2[1]\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/0/0\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  queueC=1[8]  queueG=0[4]  queueT=2[1]\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/0/0\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  subgroup=16  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   52.30  max =   65.51  avg =   56.65\n     squeezenet_int8  min =   14.53  max =   15.55  avg =   14.88\n           mobilenet  min =   37.42  max =   52.07  avg =   42.48\n      mobilenet_int8  min =   19.01  max =   19.82  avg =   19.46\n        mobilenet_v2  min =   55.34  max =   73.39  avg =   63.94\n        mobilenet_v3  min =   97.02  max =  123.14  avg =  109.90\n          shufflenet  min =   72.75  max =  100.26  avg =   88.26\n       shufflenet_v2  min =   93.34  max =  119.64  avg =  105.76\n             mnasnet  min =   63.49  max =   74.11  avg =   69.05\n     proxylessnasnet  min =   65.87  max =   83.87  avg =   76.33\n     efficientnet_b0  min =  162.86  max =  210.51  avg =  184.03\n   efficientnetv2_b0  min =  200.88  max =  220.40  avg =  210.85\n        regnety_400m  min =  106.92  max =  134.68  avg =  123.04\n           blazeface  min =   58.64  max =   66.50  avg =   60.54\n           googlenet  min =  117.34  max =  145.28  avg =  134.84\n      googlenet_int8  min =   62.50  max =   65.07  avg =   63.44\n            resnet18  min =   67.30  max =   92.40  avg =   80.23\n       resnet18_int8  min =   56.09  max =   58.40  avg =   56.97\n             alexnet  min =   29.94  max =   47.51  avg =   38.83\n               vgg16  min =   59.72  max =   73.08  avg =   65.46\n          vgg16_int8  min =  136.35  max =  148.39  avg =  143.96\n            resnet50  min =  115.92  max =  152.34  avg =  129.64\n       resnet50_int8  min =   93.86  max =  101.51  avg =   97.96\n      squeezenet_ssd  min =  139.82  max =  149.15  avg =  144.78\n squeezenet_ssd_int8  min =   32.09  max =   35.96  avg =   33.41\n       mobilenet_ssd  min =   88.14  max =  102.62  avg =   97.79\n  mobilenet_ssd_int8  min =   33.93  max =   36.42  avg =   34.41\n      mobilenet_yolo  min =   52.22  max =   65.25  avg =   58.81\n  mobilenetv2_yolov3  min =   75.09  max =   94.12  avg =   85.23\n         yolov4-tiny  min =   73.27  max =   88.69  avg =   81.44\n           nanodet_m  min =  110.98  max =  150.70  avg =  127.60\n    yolo-fastest-1.1  min =  104.72  max =  135.40  avg =  116.92\n      yolo-fastestv2  min =  113.84  max =  142.19  avg =  128.24\n  vision_transformer  min =  412.19  max =  474.25  avg =  444.15\n          FastestDet  min =   96.31  max =  131.51  avg =  117.27\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark> VK_ICD_FILENAMES=/home/mouri/Workspace/mesa/build/install/share/vulkan/icd.d/dzn_icd.x86_64.json ./../build/benchmark/benchncnn 10 1 0 1 0\nWARNING: dzn is not a conformant Vulkan implementation, testing use only.\nWARNING: dzn is not a conformant Vulkan implementation, testing use only.\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  queueC=1[8]  queueG=0[4]  queueT=2[1]\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/0/0\n[0 Microsoft Direct3D12 (NVIDIA GeForce RTX 3070 Laptop GPU)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  queueC=1[8]  queueG=0[4]  queueT=2[1]\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  fp16-p/s/a=1/1/1  int8-p/s/a=1/0/0\n[1 Microsoft Direct3D12 (Intel(R) UHD Graphics)]  subgroup=16  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 1\ncooling_down = 0\n          squeezenet  min =   36.86  max =   62.04  avg =   44.48\n     squeezenet_int8  min =   15.31  max =   16.14  avg =   15.63\n           mobilenet  min =   30.79  max =   34.67  avg =   32.95\n      mobilenet_int8  min =   19.23  max =   19.72  avg =   19.42\n        mobilenet_v2  min =   36.56  max =   40.53  avg =   38.20\n        mobilenet_v3  min =   52.11  max =   61.72  avg =   56.58\n          shufflenet  min =   41.50  max =   74.61  avg =   49.24\n       shufflenet_v2  min =   44.49  max =   52.30  avg =   49.04\n             mnasnet  min =   35.66  max =   43.45  avg =   37.98\n     proxylessnasnet  min =   41.27  max =   47.63  avg =   43.63\n     efficientnet_b0  min =   67.66  max =   80.88  avg =   73.64\n   efficientnetv2_b0  min =  111.10  max =  156.52  avg =  126.70\n        regnety_400m  min =   62.66  max =   89.16  avg =   68.99\n           blazeface  min =   24.86  max =   33.52  avg =   26.91\n           googlenet  min =   70.55  max =   84.22  avg =   75.19\n      googlenet_int8  min =   58.78  max =   64.81  avg =   62.99\n            resnet18  min =   44.17  max =   49.37  avg =   46.73\n       resnet18_int8  min =   59.99  max =   66.91  avg =   62.35\n             alexnet  min =   41.54  max =   57.16  avg =   44.30\n               vgg16  min =  138.74  max =  165.03  avg =  146.90\n          vgg16_int8  min =  135.36  max =  165.89  avg =  142.61\n            resnet50  min =   97.46  max =  107.18  avg =  100.89\n       resnet50_int8  min =   92.90  max =  100.45  avg =   95.91\n      squeezenet_ssd  min =   72.27  max =   90.71  avg =   76.09\n squeezenet_ssd_int8  min =   34.66  max =   40.46  avg =   36.58\n       mobilenet_ssd  min =   59.90  max =   68.74  avg =   62.40\n  mobilenet_ssd_int8  min =   37.02  max =   38.59  avg =   37.82\n      mobilenet_yolo  min =   73.19  max =   80.40  avg =   76.42\n  mobilenetv2_yolov3  min =   58.56  max =   66.71  avg =   62.02\n         yolov4-tiny  min =   63.75  max =   84.29  avg =   69.54\n           nanodet_m  min =   54.66  max =   67.89  avg =   60.82\n    yolo-fastest-1.1  min =   40.89  max =   51.03  avg =   43.15\n      yolo-fastestv2  min =   50.43  max =   77.46  avg =   60.66\n  vision_transformer  min = 1330.82  max = 1388.73  avg = 1354.10\n          FastestDet  min =   85.75  max =  112.67  avg =   98.62\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark> VK_ICD_FILENAMES=/home/mouri/Workspace/mesa/build/install/share/vulkan/icd.d/dzn_icd.x86_64.json ./../build/benchmark/benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    6.30  max =   10.16  avg =    8.21\n     squeezenet_int8  min =   14.53  max =   14.94  avg =   14.67\n           mobilenet  min =   10.71  max =   11.26  avg =   10.91\n      mobilenet_int8  min =   17.66  max =   18.46  avg =   17.91\n        mobilenet_v2  min =    7.74  max =    8.05  avg =    7.89\n        mobilenet_v3  min =    6.25  max =    6.70  avg =    6.38\n          shufflenet  min =    3.78  max =    7.87  avg =    5.37\n       shufflenet_v2  min =    4.19  max =    7.83  avg =    5.25\n             mnasnet  min =    7.29  max =    7.61  avg =    7.44\n     proxylessnasnet  min =    8.10  max =    8.43  avg =    8.24\n     efficientnet_b0  min =   11.77  max =   12.66  avg =   12.06\n   efficientnetv2_b0  min =   13.80  max =   15.02  avg =   14.11\n        regnety_400m  min =   10.09  max =   10.26  avg =   10.17\n           blazeface  min =    1.24  max =    4.02  avg =    2.45\n           googlenet  min =   24.05  max =   25.78  avg =   24.64\n      googlenet_int8  min =   58.75  max =   62.45  avg =   59.54\n            resnet18  min =   20.31  max =   21.48  avg =   20.74\n       resnet18_int8  min =   53.82  max =   55.27  avg =   54.43\n             alexnet  min =   17.37  max =   18.69  avg =   17.66\n               vgg16  min =  114.49  max =  117.62  avg =  115.96\n          vgg16_int8  min =  133.82  max =  144.40  avg =  137.07\n            resnet50  min =   54.40  max =   58.74  avg =   55.54\n       resnet50_int8  min =   92.95  max =  104.71  avg =   99.18\n      squeezenet_ssd  min =   17.30  max =   18.65  avg =   17.71\n squeezenet_ssd_int8  min =   32.27  max =   33.88  avg =   32.82\n       mobilenet_ssd  min =   24.01  max =   25.94  avg =   25.02\n  mobilenet_ssd_int8  min =   34.68  max =   36.09  avg =   35.43\n      mobilenet_yolo  min =   53.32  max =   63.48  avg =   56.58\n  mobilenetv2_yolov3  min =   30.06  max =   34.24  avg =   31.46\n         yolov4-tiny  min =   41.49  max =   43.55  avg =   42.50\n           nanodet_m  min =   10.24  max =   11.08  avg =   10.43\n    yolo-fastest-1.1  min =    3.85  max =    8.34  avg =    5.40\n      yolo-fastestv2  min =    4.33  max =    7.61  avg =    6.01\n  vision_transformer  min =  556.38  max =  599.49  avg =  567.98\n          FastestDet  min =    4.20  max =   11.37  avg =    6.51\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark>\n```\n\n### Hyper-V Linux Guest with GPU-PV enabled (Intel Core i7-7700K, NVIDIA GeForce GTX 1050 Ti)\n\n- Host OS: Microsoft Windows 10 Enterprise LTSC 2021 (10.0.19044.2846)\n- Guest OS: openSUSE Tumbleweed x86_64 20230507\n- Mesa 3D source tree: https://gitlab.freedesktop.org/mesa/mesa/-/tree/ce6430067613e3e64cabf79918a3d96122b0c4c4\n- Mesa 3D configuration command\n  > meson --prefix=\"${PWD}/build/install\" -D gallium-drivers=swrast,d3d12 -D vulkan-drivers=swrast,microsoft-experimental build/\n- ncnn configuration command\n  > cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_TESTS=ON ..\n\n```\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark> VK_ICD_FILENAMES=/home/mouri/Workspace/mesa/build/install/share/vulkan/icd.d/dzn_icd.x86_64.json ./../build/benchmark/benchncnn 10 1 0 0 0\nWARNING: dzn is not a conformant Vulkan implementation, testing use only.\n[0 Microsoft Direct3D12 (NVIDIA GeForce GTX 1050 Ti)]  queueC=1[8]  queueG=0[4]  queueT=2[1]\n[0 Microsoft Direct3D12 (NVIDIA GeForce GTX 1050 Ti)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Microsoft Direct3D12 (NVIDIA GeForce GTX 1050 Ti)]  fp16-p/s/a=1/0/0  int8-p/s/a=1/0/0\n[0 Microsoft Direct3D12 (NVIDIA GeForce GTX 1050 Ti)]  subgroup=32  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   53.80  max =   64.22  avg =   59.91\n     squeezenet_int8  min =   23.21  max =   25.98  avg =   24.44\n           mobilenet  min =   47.63  max =   55.22  avg =   49.79\n      mobilenet_int8  min =   23.27  max =   25.05  avg =   23.77\n        mobilenet_v2  min =   58.17  max =   83.14  avg =   68.48\n        mobilenet_v3  min =   92.14  max =  114.74  avg =  101.66\n          shufflenet  min =   75.96  max =  106.54  avg =   89.64\n       shufflenet_v2  min =   90.66  max =  114.69  avg =  103.25\n             mnasnet  min =   58.40  max =   85.74  avg =   67.75\n     proxylessnasnet  min =   66.73  max =   84.82  avg =   77.73\n     efficientnet_b0  min =  134.28  max =  164.39  avg =  155.40\n   efficientnetv2_b0  min =  171.97  max =  220.43  avg =  198.26\n        regnety_400m  min =  124.15  max =  145.61  avg =  135.99\n           blazeface  min =   53.18  max =   72.10  avg =   60.21\n           googlenet  min =  119.34  max =  159.93  avg =  134.71\n      googlenet_int8  min =   96.71  max =  102.44  avg =   98.57\n            resnet18  min =   68.14  max =   89.99  avg =   80.76\n       resnet18_int8  min =   88.07  max =  108.62  avg =   91.09\n             alexnet  min =   44.12  max =   51.57  avg =   48.09\n               vgg16  min =   88.49  max =   99.87  avg =   93.42\n          vgg16_int8  min =  196.17  max =  211.99  avg =  201.27\n            resnet50  min =  115.36  max =  138.65  avg =  125.57\n       resnet50_int8  min =  138.15  max =  148.55  avg =  141.08\n      squeezenet_ssd  min =  138.42  max =  168.49  avg =  155.66\n squeezenet_ssd_int8  min =   46.01  max =   47.83  avg =   46.85\n       mobilenet_ssd  min =   82.39  max =  134.74  avg =  101.22\n  mobilenet_ssd_int8  min =   45.53  max =   46.67  avg =   45.96\n      mobilenet_yolo  min =   70.39  max =   87.83  avg =   80.01\n  mobilenetv2_yolov3  min =   75.71  max =   90.59  avg =   84.04\n         yolov4-tiny  min =   72.16  max =   87.76  avg =   76.81\n           nanodet_m  min =   98.27  max =  129.60  avg =  112.34\n    yolo-fastest-1.1  min =  101.01  max =  118.45  avg =  106.47\n      yolo-fastestv2  min =  109.89  max =  137.23  avg =  123.97\n  vision_transformer  min =  688.60  max =  750.54  avg =  723.30\n          FastestDet  min =  104.16  max =  139.23  avg =  123.75\nmouri@MouriVM-openSUSE:~/Workspace/ncnn/benchmark> VK_ICD_FILENAMES=/home/mouri/Workspace/mesa/build/install/share/vulkan/icd.d/dzn_icd.x86_64.json ./../build/benchmark/benchncnn 10 1 0 -1 0\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    8.90  max =    9.48  avg =    9.15\n     squeezenet_int8  min =   22.54  max =   24.13  avg =   22.85\n           mobilenet  min =   14.85  max =   16.15  avg =   15.18\n      mobilenet_int8  min =   23.56  max =   23.98  avg =   23.74\n        mobilenet_v2  min =   11.03  max =   11.73  avg =   11.22\n        mobilenet_v3  min =    8.61  max =    9.29  avg =    8.79\n          shufflenet  min =    5.26  max =    5.96  avg =    5.42\n       shufflenet_v2  min =    5.56  max =    7.06  avg =    5.82\n             mnasnet  min =   10.46  max =   11.04  avg =   10.68\n     proxylessnasnet  min =   12.18  max =   12.55  avg =   12.33\n     efficientnet_b0  min =   22.46  max =   23.15  avg =   22.86\n   efficientnetv2_b0  min =   23.33  max =   23.80  avg =   23.55\n        regnety_400m  min =   13.03  max =   14.25  avg =   13.28\n           blazeface  min =    1.49  max =    1.95  avg =    1.61\n           googlenet  min =   35.26  max =   46.31  avg =   39.63\n      googlenet_int8  min =   96.25  max =   98.15  avg =   96.93\n            resnet18  min =   29.34  max =   31.00  avg =   29.92\n       resnet18_int8  min =   87.84  max =   89.85  avg =   88.73\n             alexnet  min =   22.91  max =   23.87  avg =   23.18\n               vgg16  min =  151.26  max =  174.79  avg =  155.94\n          vgg16_int8  min =  193.66  max =  210.63  avg =  199.14\n            resnet50  min =   74.89  max =   77.27  avg =   75.91\n       resnet50_int8  min =  136.59  max =  162.13  avg =  141.22\n      squeezenet_ssd  min =   24.48  max =   34.00  avg =   26.19\n squeezenet_ssd_int8  min =   46.31  max =   48.87  avg =   47.09\n       mobilenet_ssd  min =   31.56  max =   34.45  avg =   32.50\n  mobilenet_ssd_int8  min =   45.15  max =   46.53  avg =   45.93\n      mobilenet_yolo  min =   72.09  max =   78.05  avg =   74.31\n  mobilenetv2_yolov3  min =   40.44  max =   41.54  avg =   40.86\n         yolov4-tiny  min =   56.73  max =   60.59  avg =   57.93\n           nanodet_m  min =   13.22  max =   19.28  avg =   14.65\n    yolo-fastest-1.1  min =    5.47  max =    5.70  avg =    5.58\n      yolo-fastestv2  min =    5.68  max =    7.20  avg =    5.88\n  vision_transformer  min =  600.83  max =  666.35  avg =  617.33\n          FastestDet  min =    6.05  max =    6.72  avg =    6.23\n```\n\n### AMD Ryzen 9 5950X 16-Core of Desktop[2023-10-12]\n```\nE:\\github\\ncnn\\build-ncnn-vs2019\\benchmark\\Release>benchncnn.exe 100 16 0 -1 0\nloop_count = 100\nnum_threads = 16\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    2.68  max =    3.10  avg =    2.77\n     squeezenet_int8  min =    3.57  max =    4.72  avg =    4.04\n           mobilenet  min =    3.09  max =    5.44  avg =    3.38\n      mobilenet_int8  min =    2.36  max =    3.40  avg =    2.74\n        mobilenet_v2  min =    4.24  max =    4.81  avg =    4.40\n        mobilenet_v3  min =    3.46  max =    3.93  avg =    3.58\n          shufflenet  min =    3.21  max =    4.54  avg =    4.01\n       shufflenet_v2  min =    2.99  max =    4.49  avg =    3.34\n             mnasnet  min =    3.62  max =    4.31  avg =    3.83\n     proxylessnasnet  min =    4.06  max =    5.70  avg =    4.23\n     efficientnet_b0  min =    5.60  max =    6.55  avg =    5.81\n   efficientnetv2_b0  min =    6.83  max =    8.82  avg =    7.12\n        regnety_400m  min =    8.02  max =    9.75  avg =    8.34\n           blazeface  min =    1.34  max =    1.77  avg =    1.46\n           googlenet  min =   11.62  max =   15.95  avg =   12.70\n      googlenet_int8  min =    7.43  max =   10.06  avg =    7.92\n            resnet18  min =    8.39  max =   10.39  avg =    9.04\n       resnet18_int8  min =    6.23  max =    8.64  avg =    6.75\n             alexnet  min =    7.78  max =   12.51  avg =    8.51\n               vgg16  min =   53.85  max =   63.39  avg =   56.36\n          vgg16_int8  min =   35.61  max =   46.94  avg =   38.08\n            resnet50  min =   18.55  max =   24.46  avg =   19.81\n       resnet50_int8  min =   11.95  max =   23.21  avg =   13.51\n      squeezenet_ssd  min =   10.01  max =   13.16  avg =   10.69\n squeezenet_ssd_int8  min =    9.29  max =   14.02  avg =   10.47\n       mobilenet_ssd  min =    6.38  max =   10.26  avg =    7.15\n  mobilenet_ssd_int8  min =    4.69  max =    6.98  avg =    5.42\n      mobilenet_yolo  min =   17.63  max =   22.59  avg =   19.45\n  mobilenetv2_yolov3  min =   11.79  max =   15.67  avg =   12.76\n         yolov4-tiny  min =   21.53  max =   25.79  avg =   22.46\n           nanodet_m  min =    7.16  max =    9.99  avg =    8.01\n    yolo-fastest-1.1  min =    3.66  max =    5.00  avg =    4.38\n      yolo-fastestv2  min =    3.52  max =    5.20  avg =    4.60\n  vision_transformer  min =   67.01  max =   93.71  avg =   78.48\n          FastestDet  min =    4.44  max =    8.62  avg =    4.69\n```\n\n### AMD Radeon RX 6900 XT of Desktop[2023-10-12]\n```\nE:\\github\\ncnn\\build-ncnn-vs2019\\benchmark\\Release>benchncnn.exe 100 16 0 0 0\n[0 AMD Radeon RX 6900 XT]  queueC=1[2]  queueG=0[1]  queueT=2[2]\n[0 AMD Radeon RX 6900 XT]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 AMD Radeon RX 6900 XT]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 AMD Radeon RX 6900 XT]  subgroup=64  basic/vote/ballot/shuffle=1/1/1/1\n[0 AMD Radeon RX 6900 XT]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 100\nnum_threads = 16\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    2.19  max =    2.70  avg =    2.47\n     squeezenet_int8  min =    3.94  max =    4.51  avg =    4.18\n           mobilenet  min =    2.03  max =    2.63  avg =    2.28\n      mobilenet_int8  min =    2.56  max =    3.34  avg =    2.69\n        mobilenet_v2  min =    2.29  max =    2.98  avg =    2.62\n        mobilenet_v3  min =    2.31  max =    3.10  avg =    2.75\n          shufflenet  min =    1.89  max =    2.61  avg =    2.30\n       shufflenet_v2  min =    2.17  max =    3.04  avg =    2.59\n             mnasnet  min =    2.19  max =    2.98  avg =    2.69\n     proxylessnasnet  min =    2.12  max =    4.08  avg =    2.62\n     efficientnet_b0  min =    3.62  max =    5.27  avg =    4.21\n   efficientnetv2_b0  min =    6.09  max =    7.15  avg =    6.49\n        regnety_400m  min =    2.55  max =    3.82  avg =    3.00\n           blazeface  min =    1.93  max =    2.56  avg =    2.28\n           googlenet  min =    3.35  max =    4.46  avg =    3.75\n      googlenet_int8  min =    8.02  max =   12.84  avg =    9.15\n            resnet18  min =    2.46  max =    3.14  avg =    2.84\n       resnet18_int8  min =    6.37  max =    9.15  avg =    7.30\n             alexnet  min =    2.31  max =    2.91  avg =    2.69\n               vgg16  min =    4.76  max =    5.79  avg =    5.24\n          vgg16_int8  min =   35.94  max =   46.27  avg =   39.05\n            resnet50  min =    3.25  max =    4.09  avg =    3.75\n       resnet50_int8  min =   12.04  max =   20.53  avg =   14.61\n      squeezenet_ssd  min =    3.03  max =    5.31  avg =    3.66\n squeezenet_ssd_int8  min =    9.74  max =   13.46  avg =   10.42\n       mobilenet_ssd  min =    2.82  max =    4.75  avg =    3.39\n  mobilenet_ssd_int8  min =    4.67  max =    6.76  avg =    5.30\n      mobilenet_yolo  min =    3.01  max =    3.67  avg =    3.34\n  mobilenetv2_yolov3  min =    4.04  max =    6.46  avg =    4.55\n         yolov4-tiny  min =    5.75  max =    8.05  avg =    6.52\n           nanodet_m  min =   10.16  max =   14.97  avg =   13.11\n    yolo-fastest-1.1  min =    2.36  max =    3.80  avg =    2.88\n      yolo-fastestv2  min =    2.24  max =    3.19  avg =    2.80\n  vision_transformer  min =   20.43  max =   25.06  avg =   21.07\n          FastestDet  min =    2.49  max =    3.18  avg =    2.93\n```\n\n### NVIDIA GeForce RTX 3060 Ti of Desktop[2023-10-12]\n```\nE:\\github\\ncnn\\build-ncnn-vs2019\\benchmark\\Release>benchncnn.exe 100 16 0 0 0\n[0 NVIDIA GeForce RTX 3060 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA GeForce RTX 3060 Ti]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA GeForce RTX 3060 Ti]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA GeForce RTX 3060 Ti]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA GeForce RTX 3060 Ti]  fp16-matrix-16_8_8/16_8_16/16_16_16=1/1/1\n[1 Intel(R) UHD Graphics 770]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 Intel(R) UHD Graphics 770]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Intel(R) UHD Graphics 770]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 Intel(R) UHD Graphics 770]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[1 Intel(R) UHD Graphics 770]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 100\nnum_threads = 16\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    0.80  max =    2.51  avg =    0.89\n     squeezenet_int8  min =    2.81  max =    3.51  avg =    2.96\n           mobilenet  min =    0.70  max =    0.79  avg =    0.71\n      mobilenet_int8  min =    2.95  max =    3.44  avg =    3.03\n        mobilenet_v2  min =    1.09  max =    1.25  avg =    1.12\n        mobilenet_v3  min =    1.33  max =    2.04  avg =    1.56\n          shufflenet  min =    1.20  max =    1.39  avg =    1.27\n       shufflenet_v2  min =    1.50  max =    1.66  avg =    1.57\n             mnasnet  min =    1.11  max =    1.22  avg =    1.15\n     proxylessnasnet  min =    1.20  max =    1.63  avg =    1.24\n     efficientnet_b0  min =    2.38  max =    3.21  avg =    2.61\n   efficientnetv2_b0  min =    9.16  max =   11.35  avg =    9.63\n        regnety_400m  min =    1.86  max =    2.03  avg =    1.94\n           blazeface  min =    0.70  max =    1.10  avg =    0.76\n           googlenet  min =    2.11  max =    2.40  avg =    2.30\n      googlenet_int8  min =    6.91  max =    7.88  avg =    7.17\n            resnet18  min =    1.14  max =    1.47  avg =    1.19\n       resnet18_int8  min =    4.96  max =    6.82  avg =    5.40\n             alexnet  min =    1.10  max =    1.85  avg =    1.19\n               vgg16  min =    2.27  max =    3.97  avg =    2.46\n          vgg16_int8  min =   19.02  max =   22.20  avg =   20.28\n            resnet50  min =    2.00  max =    2.99  avg =    2.10\n       resnet50_int8  min =   10.66  max =   13.30  avg =   11.29\n      squeezenet_ssd  min =    2.74  max =    3.44  avg =    2.90\n squeezenet_ssd_int8  min =    6.93  max =    7.95  avg =    7.19\n       mobilenet_ssd  min =    1.86  max =    2.07  avg =    1.96\n  mobilenet_ssd_int8  min =    5.92  max =    6.48  avg =    6.09\n      mobilenet_yolo  min =    1.65  max =    2.58  avg =    1.78\n  mobilenetv2_yolov3  min =    3.85  max =    4.11  avg =    3.96\n         yolov4-tiny  min =    6.54  max =    7.05  avg =    6.69\n           nanodet_m  min =    2.38  max =    3.28  avg =    2.72\n    yolo-fastest-1.1  min =    1.73  max =    2.07  avg =    1.83\n      yolo-fastestv2  min =    1.72  max =    1.92  avg =    1.80\n  vision_transformer  min =   53.91  max =   56.59  avg =   55.27\n          FastestDet  min =    1.48  max =    1.83  avg =    1.69\n```\n\n### Intel(R) UHD Graphics 770 of Desktop[2023-10-12]\n```\nE:\\github\\ncnn\\build-ncnn-vs2019\\benchmark\\Release>benchncnn.exe 100 16 0 1 0\n[0 NVIDIA GeForce RTX 3060 Ti]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 NVIDIA GeForce RTX 3060 Ti]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 NVIDIA GeForce RTX 3060 Ti]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 NVIDIA GeForce RTX 3060 Ti]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 NVIDIA GeForce RTX 3060 Ti]  fp16-matrix-16_8_8/16_8_16/16_16_16=1/1/1\n[1 Intel(R) UHD Graphics 770]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 Intel(R) UHD Graphics 770]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 Intel(R) UHD Graphics 770]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[1 Intel(R) UHD Graphics 770]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[1 Intel(R) UHD Graphics 770]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 100\nnum_threads = 16\npowersave = 0\ngpu_device = 1\ncooling_down = 0\n          squeezenet  min =    3.11  max =    4.47  avg =    3.45\n     squeezenet_int8  min =    1.89  max =    2.84  avg =    2.23\n           mobilenet  min =    4.98  max =    5.67  avg =    5.18\n      mobilenet_int8  min =    2.54  max =    3.17  avg =    2.98\n        mobilenet_v2  min =    4.03  max =    4.89  avg =    4.37\n        mobilenet_v3  min =    4.45  max =    5.68  avg =    4.86\n          shufflenet  min =    3.42  max =    4.42  avg =    3.79\n       shufflenet_v2  min =    3.00  max =    4.01  avg =    3.30\n             mnasnet  min =    4.21  max =    5.12  avg =    4.51\n     proxylessnasnet  min =    4.62  max =    5.64  avg =    4.90\n     efficientnet_b0  min =    7.82  max =    8.63  avg =    8.10\n   efficientnetv2_b0  min =   34.52  max =   36.34  avg =   35.29\n        regnety_400m  min =    6.07  max =    7.31  avg =    6.44\n           blazeface  min =    1.54  max =    1.67  avg =    1.59\n           googlenet  min =   11.53  max =   12.64  avg =   11.89\n      googlenet_int8  min =   13.71  max =   15.52  avg =   14.38\n            resnet18  min =   10.75  max =   12.94  avg =   11.07\n       resnet18_int8  min =    9.04  max =   11.05  avg =    9.53\n             alexnet  min =   13.64  max =   14.37  avg =   13.98\n               vgg16  min =   38.53  max =   40.16  avg =   39.22\n          vgg16_int8  min =   16.04  max =   21.16  avg =   19.35\n            resnet50  min =   25.61  max =   28.22  avg =   26.62\n       resnet50_int8  min =    7.72  max =   12.83  avg =   10.29\n      squeezenet_ssd  min =   10.34  max =   15.88  avg =   14.75\n squeezenet_ssd_int8  min =    4.63  max =    7.13  avg =    5.66\n       mobilenet_ssd  min =   11.35  max =   13.06  avg =   12.44\n  mobilenet_ssd_int8  min =    4.21  max =    6.31  avg =    5.32\n      mobilenet_yolo  min =   20.14  max =   22.92  avg =   21.94\n  mobilenetv2_yolov3  min =   12.58  max =   14.88  avg =   14.21\n         yolov4-tiny  min =   20.62  max =   25.58  avg =   24.39\n           nanodet_m  min =    7.75  max =   12.49  avg =   11.42\n    yolo-fastest-1.1  min =    3.68  max =    6.49  avg =    5.54\n      yolo-fastestv2  min =    4.32  max =    5.39  avg =    4.51\n  vision_transformer  min =  796.51  max =  805.29  avg =  802.39\n          FastestDet  min =    2.89  max =    4.83  avg =    3.95\n```\n\n### Intel® Core™ i7-13700K of Desktop[2023-10-12]\n```\nE:\\github\\ncnn\\build-ncnn-vs2019\\benchmark\\Release>benchncnn.exe 100 16 0 -1 0\nloop_count = 100\nnum_threads = 16\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    1.69  max =    2.63  avg =    2.12\n     squeezenet_int8  min =    1.83  max =    3.03  avg =    2.26\n           mobilenet  min =    1.69  max =    2.64  avg =    2.24\n      mobilenet_int8  min =    2.47  max =    3.06  avg =    2.84\n        mobilenet_v2  min =    1.94  max =    3.47  avg =    2.47\n        mobilenet_v3  min =    1.49  max =    2.74  avg =    1.87\n          shufflenet  min =    1.57  max =    3.00  avg =    1.82\n       shufflenet_v2  min =    1.41  max =    1.72  avg =    1.51\n             mnasnet  min =    1.73  max =    2.94  avg =    2.13\n     proxylessnasnet  min =    2.08  max =    3.31  avg =    2.69\n     efficientnet_b0  min =    3.20  max =    4.99  avg =    3.78\n   efficientnetv2_b0  min =    3.51  max =    5.16  avg =    4.08\n        regnety_400m  min =    4.51  max =   10.29  avg =    6.18\n           blazeface  min =    0.52  max =    0.92  avg =    0.59\n           googlenet  min =    5.49  max =    7.48  avg =    6.26\n      googlenet_int8  min =    4.83  max =    7.54  avg =    5.90\n            resnet18  min =    4.05  max =    6.61  avg =    4.83\n       resnet18_int8  min =    3.77  max =    5.70  avg =    4.57\n             alexnet  min =    3.60  max =    5.09  avg =    4.26\n               vgg16  min =   25.19  max =   28.79  avg =   26.81\n          vgg16_int8  min =   17.52  max =   21.79  avg =   19.80\n            resnet50  min =    9.23  max =   13.15  avg =   11.34\n       resnet50_int8  min =    7.77  max =   12.00  avg =   10.18\n      squeezenet_ssd  min =    4.33  max =    6.73  avg =    4.96\n squeezenet_ssd_int8  min =    4.77  max =    7.62  avg =    5.71\n       mobilenet_ssd  min =    3.70  max =    6.43  avg =    4.53\n  mobilenet_ssd_int8  min =    4.16  max =    6.53  avg =    5.38\n      mobilenet_yolo  min =   11.27  max =   14.93  avg =   12.90\n  mobilenetv2_yolov3  min =    7.41  max =   11.52  avg =    9.11\n         yolov4-tiny  min =   12.05  max =   18.96  avg =   14.15\n           nanodet_m  min =    3.39  max =    5.77  avg =    4.07\n    yolo-fastest-1.1  min =    1.95  max =    3.85  avg =    2.30\n      yolo-fastestv2  min =    1.91  max =    3.52  avg =    2.27\n  vision_transformer  min =   79.50  max =   99.93  avg =   88.91\n          FastestDet  min =    1.92  max =    2.72  avg =    2.19\n```\n\n### Amlogic S805 (Cortex-A5, 4 × 1.536GHz)\n\n- Platform: Xunlei OneCloud (玩客云)\n- OS: Armbian buster (20.12) armv7l\n- Compiler: gcc version 8.3.0 (Debian 8.3.0-6)\n- ncnn tag: 20240102\n\n```\nmizu-bai@aml-s812:~/ncnn-20240102/benchmark$ ../build/benchmark/benchncnn\nloop_count = 4\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  376.45  max =  445.48  avg =  408.08\n     squeezenet_int8  min =  247.06  max =  340.34  avg =  281.40\n           mobilenet  min =  696.71  max =  745.63  avg =  718.49\n      mobilenet_int8  min =  355.78  max =  472.06  avg =  401.17\n        mobilenet_v2  min =  428.86  max =  491.25  avg =  458.45\n        mobilenet_v3  min =  361.78  max =  425.90  avg =  396.94\n          shufflenet  min =  245.90  max =  333.41  avg =  293.46\n       shufflenet_v2  min =  210.69  max =  329.51  avg =  260.73\n             mnasnet  min =  418.49  max =  493.40  avg =  448.95\n     proxylessnasnet  min =  542.20  max =  566.65  avg =  554.75\n     efficientnet_b0  min =  727.72  max =  785.47  avg =  750.72\n   efficientnetv2_b0  min =  805.70  max =  874.57  avg =  843.87\n        regnety_400m  min =  627.74  max =  686.57  avg =  660.60\n           blazeface  min =   62.14  max =  121.32  avg =   82.10\n           googlenet  min = 1295.31  max = 1411.88  avg = 1342.26\n      googlenet_int8  min =  796.39  max =  860.28  avg =  823.76\n            resnet18  min = 1076.93  max = 1125.12  avg = 1099.37\n       resnet18_int8  min =  587.12  max =  634.97  avg =  605.29\n             alexnet  min =  701.70  max =  729.68  avg =  718.99\n               vgg16  min = 5584.13  max = 5748.84  avg = 5660.70\n          vgg16_int8  min = 3107.89  max = 3138.78  avg = 3121.28\n            resnet50  min = 3378.84  max = 3461.61  avg = 3425.38\n       resnet50_int8  min = 2044.93  max = 2067.70  avg = 2061.38\n      squeezenet_ssd  min =  908.77  max =  972.68  avg =  939.98\n squeezenet_ssd_int8  min =  609.58  max =  703.88  avg =  662.43\n       mobilenet_ssd  min = 1524.69  max = 1589.79  avg = 1552.12\n  mobilenet_ssd_int8  min =  817.70  max =  885.45  avg =  840.30\n      mobilenet_yolo  min = 3497.13  max = 3605.83  avg = 3543.72\n  mobilenetv2_yolov3  min = 1734.10  max = 1824.98  avg = 1795.42\n         yolov4-tiny  min = 2093.70  max = 2163.44  avg = 2128.30\n           nanodet_m  min =  593.75  max =  647.03  avg =  608.03\n    yolo-fastest-1.1  min =  228.68  max =  318.40  avg =  265.74\n      yolo-fastestv2  min =  194.29  max =  258.78  avg =  219.82\n  vision_transformer  min = 14836.43  max = 15238.27  avg = 15125.26\n          FastestDet  min =  215.60  max =  264.69  avg =  239.85\n```\n\n### Qualcomm SM8550-AB Snapdragon 8 Gen 2 (Kyro 3.20 GHz + 2.8 GHz x 2 + 2.80 GHz x 2 + 2.00 GHz * 3 + Adreno 740)\n```\n./benchncnn 4 1 2 -1 1\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    8.44  max =    8.51  avg =    8.47\n     squeezenet_int8  min =    6.91  max =    7.13  avg =    7.00\n           mobilenet  min =   15.45  max =   15.53  avg =   15.49\n      mobilenet_int8  min =    8.76  max =    9.03  avg =    8.88\n        mobilenet_v2  min =    9.52  max =   10.71  avg =   10.02\n        mobilenet_v3  min =    7.89  max =    8.02  avg =    7.93\n          shufflenet  min =    5.07  max =    5.61  avg =    5.25\n       shufflenet_v2  min =    5.28  max =    5.41  avg =    5.37\n             mnasnet  min =    9.52  max =    9.58  avg =    9.54\n     proxylessnasnet  min =   11.26  max =   11.41  avg =   11.36\n     efficientnet_b0  min =   18.84  max =   18.91  avg =   18.88\n   efficientnetv2_b0  min =   28.60  max =   28.73  avg =   28.66\n        regnety_400m  min =   12.35  max =   12.39  avg =   12.37\n           blazeface  min =    1.83  max =    2.23  avg =    1.94\n           googlenet  min =   32.07  max =   37.37  avg =   35.59\n      googlenet_int8  min =   28.50  max =   28.57  avg =   28.53\n            resnet18  min =   21.88  max =   22.05  avg =   21.94\n       resnet18_int8  min =   24.43  max =   40.52  avg =   32.04\n             alexnet  min =   23.69  max =   24.22  avg =   23.98\n               vgg16  min =   91.85  max =  100.71  avg =   94.80\n          vgg16_int8  min =  206.66  max =  325.74  avg =  258.40\n            resnet50  min =   53.59  max =   54.20  avg =   53.96\n       resnet50_int8  min =   44.39  max =   45.11  avg =   44.74\n      squeezenet_ssd  min =   23.80  max =   24.12  avg =   23.94\n squeezenet_ssd_int8  min =   30.17  max =   30.42  avg =   30.31\n       mobilenet_ssd  min =   33.49  max =   33.69  avg =   33.59\n  mobilenet_ssd_int8  min =   19.37  max =   19.76  avg =   19.56\n      mobilenet_yolo  min =   72.63  max =   73.00  avg =   72.77\n  mobilenetv2_yolov3  min =   36.86  max =   37.40  avg =   37.08\n         yolov4-tiny  min =   44.94  max =   45.46  avg =   45.22\n           nanodet_m  min =   13.65  max =   13.99  avg =   13.82\n    yolo-fastest-1.1  min =    3.84  max =    3.93  avg =    3.89\n      yolo-fastestv2  min =    4.78  max =    4.93  avg =    4.84\n  vision_transformer  min = 1042.50  max = 1043.06  avg = 1042.80\n          FastestDet  min =    4.67  max =    4.75  avg =    4.70\n./benchncnn 4 4 2 -1 1\nloop_count = 4\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    2.60  max =    2.66  avg =    2.64\n     squeezenet_int8  min =    2.38  max =    2.43  avg =    2.40\n           mobilenet  min =    4.17  max =    4.25  avg =    4.21\n      mobilenet_int8  min =    2.59  max =    2.60  avg =    2.60\n        mobilenet_v2  min =    3.13  max =    3.44  avg =    3.23\n        mobilenet_v3  min =    2.90  max =    5.07  avg =    3.46\n          shufflenet  min =    2.34  max =    2.44  avg =    2.38\n       shufflenet_v2  min =    2.06  max =    2.15  avg =    2.11\n             mnasnet  min =    3.19  max =    3.20  avg =    3.20\n     proxylessnasnet  min =    3.53  max =    3.61  avg =    3.57\n     efficientnet_b0  min =    5.72  max =    5.75  avg =    5.74\n   efficientnetv2_b0  min =    8.61  max =    8.67  avg =    8.64\n        regnety_400m  min =    6.22  max =    6.27  avg =    6.25\n           blazeface  min =    0.82  max =    0.92  avg =    0.86\n           googlenet  min =   10.62  max =   14.39  avg =   11.59\n      googlenet_int8  min =    8.84  max =    8.99  avg =    8.92\n            resnet18  min =    6.61  max =    6.66  avg =    6.63\n       resnet18_int8  min =   21.41  max =   23.48  avg =   22.57\n             alexnet  min =    8.18  max =    8.24  avg =    8.21\n               vgg16  min =   36.99  max =   39.65  avg =   37.75\n          vgg16_int8  min =   86.21  max =   89.00  avg =   86.95\n            resnet50  min =   18.90  max =   18.98  avg =   18.94\n       resnet50_int8  min =   19.18  max =   19.28  avg =   19.22\n      squeezenet_ssd  min =    8.26  max =    8.42  avg =    8.32\n squeezenet_ssd_int8  min =   21.02  max =   21.15  avg =   21.09\n       mobilenet_ssd  min =    9.29  max =    9.42  avg =    9.34\n  mobilenet_ssd_int8  min =    5.85  max =    5.91  avg =    5.87\n      mobilenet_yolo  min =   21.64  max =   21.71  avg =   21.69\n  mobilenetv2_yolov3  min =   11.50  max =   11.62  avg =   11.57\n         yolov4-tiny  min =   14.91  max =   14.99  avg =   14.95\n           nanodet_m  min =    4.93  max =    5.02  avg =    4.98\n    yolo-fastest-1.1  min =    2.19  max =    2.26  avg =    2.21\n      yolo-fastestv2  min =    2.29  max =    2.44  avg =    2.39\n  vision_transformer  min =  242.50  max =  301.91  avg =  271.32\n          FastestDet  min =    2.01  max =    2.12  avg =    2.05\n./benchncnn 4 8 0 -1 1\nloop_count = 4\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    4.53  max =    6.34  avg =    5.48\n     squeezenet_int8  min =    5.48  max =    7.02  avg =    6.14\n           mobilenet  min =    6.89  max =    8.44  avg =    7.61\n      mobilenet_int8  min =    4.89  max =    6.39  avg =    5.43\n        mobilenet_v2  min =    6.01  max =    7.28  avg =    6.53\n        mobilenet_v3  min =    4.85  max =   12.13  avg =    7.16\n          shufflenet  min =    4.41  max =    6.20  avg =    5.25\n       shufflenet_v2  min =    3.50  max =    4.34  avg =    3.74\n             mnasnet  min =    5.52  max =    7.03  avg =    6.18\n     proxylessnasnet  min =    6.21  max =    7.76  avg =    6.94\n     efficientnet_b0  min =    9.49  max =   10.57  avg =    9.94\n   efficientnetv2_b0  min =   15.26  max =   19.50  avg =   17.42\n        regnety_400m  min =    9.89  max =   14.30  avg =   12.02\n           blazeface  min =    2.25  max =    3.44  avg =    2.66\n           googlenet  min =   18.98  max =   23.38  avg =   21.07\n      googlenet_int8  min =   17.99  max =   20.47  avg =   19.45\n            resnet18  min =   34.98  max =   84.52  avg =   69.50\n       resnet18_int8  min =   14.58  max =   15.43  avg =   15.04\n             alexnet  min =   13.56  max =   15.05  avg =   14.29\n               vgg16  min =   63.32  max =   73.69  avg =   67.01\n          vgg16_int8  min =   91.17  max =   99.80  avg =   94.81\n            resnet50  min =   32.01  max =   42.22  avg =   36.06\n       resnet50_int8  min =   30.16  max =   32.25  avg =   30.72\n      squeezenet_ssd  min =   14.72  max =   21.45  avg =   17.51\n squeezenet_ssd_int8  min =   18.21  max =   23.93  avg =   21.45\n       mobilenet_ssd  min =   16.38  max =   17.92  avg =   16.97\n  mobilenet_ssd_int8  min =   10.15  max =   15.88  avg =   12.92\n      mobilenet_yolo  min =   35.88  max =   37.10  avg =   36.26\n  mobilenetv2_yolov3  min =   21.92  max =   27.60  avg =   24.12\n         yolov4-tiny  min =   32.03  max =   34.45  avg =   33.51\n           nanodet_m  min =    9.49  max =   14.35  avg =   11.20\n    yolo-fastest-1.1  min =    3.97  max =    5.16  avg =    4.40\n      yolo-fastestv2  min =    5.13  max =    7.84  avg =    6.18\n  vision_transformer  min =  364.37  max =  391.13  avg =  374.55\n          FastestDet  min =    3.01  max =    7.36  avg =    4.76\n./benchncnn 4 1 2 0 0\n[0 Adreno (TM) 740]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 740]  bugsbn1=1  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Adreno (TM) 740]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Adreno (TM) 740]  subgroup=64  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 4\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    9.73  max =   11.72  avg =   10.55\n     squeezenet_int8  min =    7.21  max =    7.34  avg =    7.27\n           mobilenet  min =   10.87  max =   13.09  avg =   12.01\n      mobilenet_int8  min =    8.82  max =    9.23  avg =    9.11\n        mobilenet_v2  min =   15.77  max =   16.21  avg =   15.96\n        mobilenet_v3  min =   18.04  max =   18.68  avg =   18.40\n          shufflenet  min =    9.82  max =   11.92  avg =   10.79\n       shufflenet_v2  min =   14.41  max =   15.41  avg =   14.96\n             mnasnet  min =   16.01  max =   16.43  avg =   16.27\n     proxylessnasnet  min =   14.18  max =   16.28  avg =   15.51\n     efficientnet_b0  min =   36.38  max =   37.06  avg =   36.83\n   efficientnetv2_b0  min =   55.98  max =   66.59  avg =   59.54\n        regnety_400m  min =   21.94  max =   22.46  avg =   22.30\n           blazeface  min =    3.92  max =    4.47  avg =    4.08\n           googlenet  min =   31.79  max =   35.63  avg =   33.04\n      googlenet_int8  min =   23.21  max =   29.38  avg =   26.60\n            resnet18  min =   22.61  max =   24.05  avg =   23.09\n       resnet18_int8  min =   24.56  max =   24.78  avg =   24.62\n             alexnet  min =   25.98  max =   27.05  avg =   26.49\n               vgg16  min =   39.00  max =   39.82  avg =   39.29\n          vgg16_int8  min =  207.47  max =  208.56  avg =  207.90\n            resnet50  min =   44.07  max =   44.43  avg =   44.29\n       resnet50_int8  min =   44.77  max =   47.04  avg =   45.44\n      squeezenet_ssd  min =   33.71  max =   34.27  avg =   34.09\n squeezenet_ssd_int8  min =   22.53  max =   30.33  avg =   25.07\n       mobilenet_ssd  min =   26.91  max =   28.35  avg =   27.42\n  mobilenet_ssd_int8  min =   19.43  max =   19.82  avg =   19.69\n      mobilenet_yolo  min =   28.03  max =   29.19  avg =   28.65\n  mobilenetv2_yolov3  min =   33.54  max =   34.65  avg =   34.31\n         yolov4-tiny  min =   49.77  max =   51.21  avg =   50.55\n           nanodet_m  min =   17.35  max =   18.83  avg =   18.06\n    yolo-fastest-1.1  min =    9.45  max =    9.59  avg =    9.51\n      yolo-fastestv2  min =   13.13  max =   13.63  avg =   13.36\n  vision_transformer  min =  671.13  max =  679.90  avg =  675.27\n          FastestDet  min =    8.62  max =    9.01  avg =    8.86\n./benchncnn 64 1 2 0 0\n[0 Adreno (TM) 740]  queueC=0[3]  queueG=0[3]  queueT=0[3]\n[0 Adreno (TM) 740]  bugsbn1=1  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Adreno (TM) 740]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Adreno (TM) 740]  subgroup=64  basic=1  vote=1  ballot=1  shuffle=1\nloop_count = 64\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    9.56  max =   12.14  avg =   11.48\n     squeezenet_int8  min =    6.78  max =    8.47  avg =    7.04\n           mobilenet  min =   11.59  max =   12.90  avg =   12.44\n      mobilenet_int8  min =    8.69  max =    9.42  avg =    8.90\n        mobilenet_v2  min =   14.00  max =   16.08  avg =   15.12\n        mobilenet_v3  min =   16.66  max =   19.62  avg =   18.51\n          shufflenet  min =    8.72  max =   13.02  avg =   11.86\n       shufflenet_v2  min =   12.82  max =   14.66  avg =   14.03\n             mnasnet  min =   15.06  max =   17.55  avg =   16.12\n     proxylessnasnet  min =   15.42  max =   17.28  avg =   16.59\n     efficientnet_b0  min =   35.96  max =   41.24  avg =   37.89\n   efficientnetv2_b0  min =   46.11  max =   65.75  avg =   58.52\n        regnety_400m  min =   22.07  max =   26.40  avg =   24.43\n           blazeface  min =    3.61  max =    6.26  avg =    4.53\n           googlenet  min =   32.60  max =   37.05  avg =   34.55\n      googlenet_int8  min =   21.79  max =   30.65  avg =   24.84\n            resnet18  min =   19.46  max =   24.26  avg =   22.76\n       resnet18_int8  min =   38.09  max =   40.42  avg =   38.44\n             alexnet  min =   20.80  max =   28.44  avg =   26.86\n               vgg16  min =   36.00  max =   44.01  avg =   39.18\n          vgg16_int8  min =  201.54  max =  209.87  avg =  207.06\n            resnet50  min =   42.50  max =   46.82  avg =   44.26\n       resnet50_int8  min =   44.63  max =   47.47  avg =   45.15\n      squeezenet_ssd  min =   33.19  max =   36.74  avg =   34.62\n squeezenet_ssd_int8  min =   22.40  max =   31.99  avg =   25.65\n       mobilenet_ssd  min =   26.35  max =   29.79  avg =   28.09\n  mobilenet_ssd_int8  min =   19.15  max =   20.86  avg =   19.48\n      mobilenet_yolo  min =   28.42  max =   31.16  avg =   29.06\n  mobilenetv2_yolov3  min =   33.86  max =   36.54  avg =   35.36\n         yolov4-tiny  min =   46.51  max =   49.29  avg =   48.01\n           nanodet_m  min =   17.14  max =   19.79  avg =   18.49\n    yolo-fastest-1.1  min =    9.49  max =   15.00  avg =   13.59\n      yolo-fastestv2  min =   11.65  max =   15.61  avg =   14.36\n  vision_transformer  min =  650.85  max =  696.67  avg =  671.13\n          FastestDet  min =    8.63  max =   13.12  avg =   11.39\n```\n\n### MediaTek Dimensity 9300 (MT6989) (Cortex-X4 3.25 GHz + 2.85 GHz x 3 + Cortex-A720 2.0 GHz x 4 + Mali-G720-Immortalis MC12)\n```\nk6989v1_64:/data/local/tmp/benchmark # ../build-android/benchmark/benchncnn 8 8 0 -1 1                                           \nloop_count = 8\nnum_threads = 8\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    1.87  max =    2.18  avg =    2.01\n     squeezenet_int8  min =    1.52  max =    1.98  avg =    1.77\n           mobilenet  min =    3.02  max =    3.34  avg =    3.15\n      mobilenet_int8  min =    1.90  max =    2.27  avg =    2.04\n        mobilenet_v2  min =    2.72  max =    3.13  avg =    2.89\n        mobilenet_v3  min =    2.20  max =    3.82  avg =    2.78\n          shufflenet  min =    1.97  max =    2.56  avg =    2.20\n       shufflenet_v2  min =    1.77  max =    2.29  avg =    1.96\n             mnasnet  min =    2.61  max =    3.48  avg =    2.90\n     proxylessnasnet  min =    2.72  max =    3.06  avg =    2.89\n     efficientnet_b0  min =    4.57  max =    5.17  avg =    4.89\n   efficientnetv2_b0  min =    5.24  max =    6.72  avg =    5.81\n        regnety_400m  min =    4.94  max =    6.78  avg =    5.70\n           blazeface  min =    0.80  max =    1.02  avg =    0.91\n           googlenet  min =    7.76  max =    8.53  avg =    8.12\n      googlenet_int8  min =    5.68  max =    6.62  avg =    6.19\n            resnet18  min =    5.35  max =    6.06  avg =    5.61\n       resnet18_int8  min =    4.20  max =    4.40  avg =    4.29\n             alexnet  min =    5.96  max =    7.30  avg =    6.77\n               vgg16  min =   29.27  max =   30.58  avg =   29.93\n          vgg16_int8  min =   26.72  max =   28.12  avg =   27.27\n            resnet50  min =   15.21  max =   19.16  avg =   16.09\n       resnet50_int8  min =    8.57  max =    9.16  avg =    8.91\n      squeezenet_ssd  min =    6.29  max =    7.56  avg =    6.82\n squeezenet_ssd_int8  min =    5.57  max =    6.96  avg =    6.12\n       mobilenet_ssd  min =    6.90  max =    8.90  avg =    7.55\n  mobilenet_ssd_int8  min =    4.53  max =    5.22  avg =    4.86\n      mobilenet_yolo  min =   16.88  max =   19.71  avg =   17.88\n  mobilenetv2_yolov3  min =   10.51  max =   14.19  avg =   11.95\n         yolov4-tiny  min =   12.81  max =   16.23  avg =   14.22\n           nanodet_m  min =    4.38  max =    5.96  avg =    5.19\n    yolo-fastest-1.1  min =    2.22  max =    3.08  avg =    2.73\n      yolo-fastestv2  min =    2.09  max =    2.73  avg =    2.41\n  vision_transformer  min =  193.39  max =  203.13  avg =  198.32\n          FastestDet  min =    1.98  max =    2.35  avg =    2.16\nk6989v1_64:/data/local/tmp/benchmark # ../build-android/benchmark/benchncnn 8 4 2 -1 1                                           \nloop_count = 8\nnum_threads = 4\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    2.23  max =    2.31  avg =    2.27\n     squeezenet_int8  min =    1.68  max =    1.73  avg =    1.70\n           mobilenet  min =    3.76  max =    3.86  avg =    3.81\n      mobilenet_int8  min =    2.07  max =    2.16  avg =    2.11\n        mobilenet_v2  min =    2.72  max =    2.95  avg =    2.80\n        mobilenet_v3  min =    2.43  max =    2.51  avg =    2.47\n          shufflenet  min =    1.78  max =    1.87  avg =    1.81\n       shufflenet_v2  min =    1.61  max =    1.66  avg =    1.63\n             mnasnet  min =    2.69  max =    2.82  avg =    2.76\n     proxylessnasnet  min =    2.95  max =    3.13  avg =    3.05\n     efficientnet_b0  min =    4.99  max =    5.29  avg =    5.08\n   efficientnetv2_b0  min =    5.73  max =    5.86  avg =    5.79\n        regnety_400m  min =    4.97  max =    5.04  avg =    5.00\n           blazeface  min =    1.07  max =    1.17  avg =    1.10\n           googlenet  min =    8.51  max =    9.43  avg =    8.75\n      googlenet_int8  min =    6.01  max =    6.13  avg =    6.07\n            resnet18  min =    6.72  max =    7.04  avg =    6.95\n       resnet18_int8  min =    4.31  max =    4.40  avg =    4.34\n             alexnet  min =    7.41  max =    7.71  avg =    7.57\n               vgg16  min =   33.77  max =   34.68  avg =   34.08\n          vgg16_int8  min =   32.61  max =   33.83  avg =   33.12\n            resnet50  min =   18.76  max =   19.53  avg =   19.05\n       resnet50_int8  min =    9.56  max =    9.70  avg =    9.61\n      squeezenet_ssd  min =    6.86  max =    7.26  avg =    7.01\n squeezenet_ssd_int8  min =    5.42  max =    6.17  avg =    5.64\n       mobilenet_ssd  min =    8.38  max =    9.14  avg =    8.62\n  mobilenet_ssd_int8  min =    4.60  max =    4.90  avg =    4.69\n      mobilenet_yolo  min =   19.59  max =   20.06  avg =   19.78\n  mobilenetv2_yolov3  min =   10.46  max =   11.01  avg =   10.70\n         yolov4-tiny  min =   13.46  max =   14.18  avg =   13.86\n           nanodet_m  min =    4.52  max =    4.59  avg =    4.55\n    yolo-fastest-1.1  min =    1.88  max =    1.94  avg =    1.91\n      yolo-fastestv2  min =    1.73  max =    1.79  avg =    1.76\n  vision_transformer  min =  220.32  max =  229.49  avg =  223.92\n          FastestDet  min =    1.67  max =    1.73  avg =    1.70\nk6989v1_64:/data/local/tmp/benchmark # ../build-android/benchmark/benchncnn 8 4 1 -1 1                                           \nloop_count = 8\nnum_threads = 4\npowersave = 1\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    3.42  max =    4.25  avg =    3.62\n     squeezenet_int8  min =    2.63  max =    2.78  avg =    2.73\n           mobilenet  min =    5.66  max =    6.25  avg =    5.82\n      mobilenet_int8  min =    3.13  max =    5.66  avg =    3.58\n        mobilenet_v2  min =    4.40  max =    4.46  avg =    4.42\n        mobilenet_v3  min =    3.74  max =    4.07  avg =    3.94\n          shufflenet  min =    2.77  max =    2.86  avg =    2.82\n       shufflenet_v2  min =    2.52  max =    2.62  avg =    2.57\n             mnasnet  min =    4.24  max =    4.37  avg =    4.28\n     proxylessnasnet  min =    4.65  max =    4.91  avg =    4.74\n     efficientnet_b0  min =    7.71  max =   10.00  avg =    8.08\n   efficientnetv2_b0  min =    9.24  max =   10.34  avg =    9.87\n        regnety_400m  min =    7.87  max =    8.35  avg =    8.02\n           blazeface  min =    2.38  max =    2.46  avg =    2.40\n           googlenet  min =   13.21  max =   13.78  avg =   13.40\n      googlenet_int8  min =   10.23  max =   10.65  avg =   10.36\n            resnet18  min =    9.25  max =    9.68  avg =    9.49\n       resnet18_int8  min =    6.86  max =    6.97  avg =    6.91\n             alexnet  min =    9.73  max =   10.53  avg =    9.97\n               vgg16  min =   47.43  max =   48.12  avg =   47.78\n          vgg16_int8  min =   47.08  max =   48.18  avg =   47.46\n            resnet50  min =   26.82  max =   27.14  avg =   26.99\n       resnet50_int8  min =   15.01  max =   15.57  avg =   15.20\n      squeezenet_ssd  min =    9.96  max =   12.66  avg =   10.83\n squeezenet_ssd_int8  min =    8.47  max =    9.26  avg =    8.88\n       mobilenet_ssd  min =   12.54  max =   13.25  avg =   12.82\n  mobilenet_ssd_int8  min =    7.03  max =   10.91  avg =    7.94\n      mobilenet_yolo  min =   29.73  max =   30.45  avg =   30.23\n  mobilenetv2_yolov3  min =   16.64  max =   17.71  avg =   17.13\n         yolov4-tiny  min =   22.25  max =   22.65  avg =   22.45\n           nanodet_m  min =    7.56  max =    7.86  avg =    7.69\n    yolo-fastest-1.1  min =    3.32  max =    3.45  avg =    3.39\n      yolo-fastestv2  min =    2.76  max =    2.96  avg =    2.84\n  vision_transformer  min =  328.11  max =  337.26  avg =  332.12\n          FastestDet  min =    2.66  max =    2.77  avg =    2.71\nk6989v1_64:/data/local/tmp/benchmark # ../build-android/benchmark/benchncnn 8 1 2 -1 1                                           \nloop_count = 8\nnum_threads = 1\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =    5.27  max =    5.35  avg =    5.32\n     squeezenet_int8  min =    3.06  max =    3.22  avg =    3.16\n           mobilenet  min =    9.59  max =    9.85  avg =    9.74\n      mobilenet_int8  min =    4.29  max =    4.45  avg =    4.37\n        mobilenet_v2  min =    5.14  max =    5.33  avg =    5.20\n        mobilenet_v3  min =    4.28  max =    4.54  avg =    4.42\n          shufflenet  min =    3.18  max =    3.34  avg =    3.27\n       shufflenet_v2  min =    2.78  max =    3.23  avg =    3.05\n             mnasnet  min =    5.01  max =    5.38  avg =    5.19\n     proxylessnasnet  min =    6.11  max =    6.30  avg =    6.21\n     efficientnet_b0  min =   11.53  max =   11.78  avg =   11.66\n   efficientnetv2_b0  min =   13.88  max =   14.28  avg =   14.13\n        regnety_400m  min =    8.11  max =    8.18  avg =    8.16\n           blazeface  min =    0.99  max =    1.08  avg =    1.01\n           googlenet  min =   19.68  max =   20.71  avg =   20.25\n      googlenet_int8  min =   13.42  max =   13.86  avg =   13.60\n            resnet18  min =   18.10  max =   18.84  avg =   18.53\n       resnet18_int8  min =    9.67  max =   10.17  avg =    9.99\n             alexnet  min =   15.76  max =   16.35  avg =   16.03\n               vgg16  min =   70.22  max =   72.85  avg =   71.58\n          vgg16_int8  min =   76.83  max =   79.70  avg =   78.45\n            resnet50  min =   39.73  max =   41.24  avg =   40.30\n       resnet50_int8  min =   20.76  max =   21.54  avg =   21.27\n      squeezenet_ssd  min =   12.63  max =   18.67  avg =   15.20\n squeezenet_ssd_int8  min =   10.29  max =   16.13  avg =   14.13\n       mobilenet_ssd  min =   17.21  max =   18.43  avg =   17.68\n  mobilenet_ssd_int8  min =    8.92  max =    9.49  avg =    9.07\n      mobilenet_yolo  min =   37.45  max =   38.29  avg =   37.88\n  mobilenetv2_yolov3  min =   19.18  max =   19.83  avg =   19.58\n         yolov4-tiny  min =   27.06  max =   27.86  avg =   27.45\n           nanodet_m  min =    9.33  max =    9.50  avg =    9.42\n    yolo-fastest-1.1  min =    3.48  max =    3.59  avg =    3.54\n      yolo-fastestv2  min =    2.29  max =    2.37  avg =    2.33\n  vision_transformer  min =  730.38  max =  739.99  avg =  735.77\n          FastestDet  min =    2.40  max =    2.48  avg =    2.43\nk6989v1_64:/data/local/tmp/benchmark # ../build-android/benchmark/benchncnn 64 1 2 0 0                                           \n[0 Mali-G720-Immortalis MC12]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 Mali-G720-Immortalis MC12]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Mali-G720-Immortalis MC12]  fp16-p/s/a=1/1/1  int8-p/s/a=1/1/1\n[0 Mali-G720-Immortalis MC12]  subgroup=16  basic/vote/ballot/shuffle=1/1/1/1\n[0 Mali-G720-Immortalis MC12]  fp16-matrix-16_8_8/16_8_16/16_16_16=0/0/0\nloop_count = 64\nnum_threads = 1\npowersave = 2\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =   11.26  max =   13.58  avg =   12.32\n     squeezenet_int8  min =    3.08  max =    3.29  avg =    3.17\n           mobilenet  min =   11.96  max =   14.52  avg =   13.48\n      mobilenet_int8  min =    4.20  max =    4.58  avg =    4.34\n        mobilenet_v2  min =   13.62  max =   16.46  avg =   14.62\n        mobilenet_v3  min =   13.98  max =   17.16  avg =   15.25\n          shufflenet  min =   10.22  max =   11.82  avg =   11.07\n       shufflenet_v2  min =   12.42  max =   15.39  avg =   14.35\n             mnasnet  min =   12.94  max =   16.30  avg =   14.91\n     proxylessnasnet  min =   13.18  max =   16.55  avg =   15.05\n     efficientnet_b0  min =   16.70  max =   20.35  avg =   18.27\n   efficientnetv2_b0  min =   54.09  max =   70.05  avg =   58.68\n        regnety_400m  min =   16.20  max =   18.42  avg =   17.27\n           blazeface  min =    6.50  max =    7.86  avg =    6.93\n           googlenet  min =   15.29  max =   17.54  avg =   16.19\n      googlenet_int8  min =   20.38  max =   22.08  avg =   20.98\n            resnet18  min =   12.22  max =   15.63  avg =   14.27\n       resnet18_int8  min =    9.50  max =   10.46  avg =    9.75\n             alexnet  min =   12.00  max =   16.09  avg =   13.65\n               vgg16  min =   31.06  max =   32.77  avg =   31.85\n          vgg16_int8  min =  115.72  max =  123.71  avg =  118.23\n            resnet50  min =   15.74  max =   16.53  avg =   16.10\n       resnet50_int8  min =   32.43  max =   33.78  avg =   33.07\n      squeezenet_ssd  min =   17.24  max =   21.80  avg =   20.68\n squeezenet_ssd_int8  min =    9.69  max =   10.52  avg =    9.97\n       mobilenet_ssd  min =   15.32  max =   17.63  avg =   16.62\n  mobilenet_ssd_int8  min =    8.84  max =    9.54  avg =    9.05\n      mobilenet_yolo  min =   16.67  max =   18.21  avg =   17.25\n  mobilenetv2_yolov3  min =   20.08  max =   25.40  avg =   23.12\n         yolov4-tiny  min =   21.98  max =   29.67  avg =   24.75\n           nanodet_m  min =   23.19  max =   29.95  avg =   25.69\n    yolo-fastest-1.1  min =   15.07  max =   17.78  avg =   16.49\n      yolo-fastestv2  min =   14.67  max =   16.07  avg =   15.44\n  vision_transformer  min =  768.04  max =  801.48  avg =  786.79\n          FastestDet  min =    8.33  max =   16.07  avg =   14.38\n```\n\n### Xeon Phi 3120A (1.10 GHz 57-core 228-thread)\n\n- Host: CentOS 7.9\n- Compiler: icc & icpc (ICC) 17.0.2 20170213\n- ncnn tag: 20240102\n\nBuild command\n\n```bash\n$ CC=icc CXX=icpc CFLAGS=\"-mmic\" CXXFLAGS=\"-mmic\" cmake .. -DCMAKE_BUILD_TYPE=Release -DNCNN_SSE2=OFF -DNCNN_AVX=OFF -DNCNN_AVX2=OFF\n```\n\nCopy the whole `ncnn` directory and libraries in `/opt/intel/compilers_and_libraries_2017/linux/lib/mic/lib` to `mic0`, then set the `LD_LIBRARY_PATH` environment variable. Some tools cannot be built, but `benchncnn` should work. The built `benchncnn` is for Intel Xeon Phi coprocessor (k1om).\n\n```bash\n[mizu-bai@DESKTOP-1D9QDE1-mic0 benchmark]$ file benchncnn \nbenchncnn: ELF 64-bit LSB executable, Intel Xeon Phi coprocessor (k1om), version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not stripped\n```\n\nThe benchmark is run in the native mode, ssh into the Xeon Phi by `ssh user@mic0`, then run `benckncnn` as under general linux systems.\n\n```\n[mizu-bai@DESKTOP-1D9QDE1-mic0 benchmark]$ KMP_AFFINITY=scatter ../build/benchmark/benchncnn 4 56 0 -1 1\nloop_count = 4\nnum_threads = 56\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   43.42  max =   44.20  avg =   43.64\n     squeezenet_int8  min =  161.92  max =  162.41  avg =  162.15\n           mobilenet  min =   44.49  max =   46.90  avg =   45.68\n      mobilenet_int8  min =  230.47  max =  232.40  avg =  231.77\n        mobilenet_v2  min =   57.22  max =   62.03  avg =   59.42\n        mobilenet_v3  min =  301.16  max =  306.62  avg =  303.90\n          shufflenet  min =   65.80  max =   70.18  avg =   67.70\n       shufflenet_v2  min =   49.54  max =   53.17  avg =   51.22\n             mnasnet  min =  521.87  max =  527.76  avg =  524.63\n     proxylessnasnet  min =  745.79  max =  748.55  avg =  746.92\n     efficientnet_b0  min =  582.21  max =  584.64  avg =  583.34\n   efficientnetv2_b0  min =   84.13  max =   86.13  avg =   85.19\n        regnety_400m  min =  209.67  max =  214.84  avg =  212.39\n           blazeface  min =   26.33  max =   27.39  avg =   26.74\n           googlenet  min =  124.14  max =  125.72  avg =  124.83\n      googlenet_int8  min =  498.36  max =  502.37  avg =  500.29\n            resnet18  min =   87.86  max =   88.83  avg =   88.35\n       resnet18_int8  min =  359.50  max =  360.71  avg =  360.11\n             alexnet  min =   49.87  max =   51.25  avg =   50.76\n               vgg16  min =  341.87  max =  343.92  avg =  342.42\n          vgg16_int8  min = 1649.34  max = 1655.37  avg = 1652.98\n            resnet50  min =  198.91  max =  202.32  avg =  200.58\n       resnet50_int8  min =  983.48  max =  988.73  avg =  986.22\n      squeezenet_ssd  min =  108.33  max =  111.45  avg =  110.18\n squeezenet_ssd_int8  min =  368.96  max =  370.30  avg =  369.54\n       mobilenet_ssd  min =   98.29  max =  101.49  avg =   99.99\n  mobilenet_ssd_int8  min =  462.18  max =  466.20  avg =  464.85\n      mobilenet_yolo  min =  262.42  max =  266.84  avg =  263.91\n  mobilenetv2_yolov3  min =  159.20  max =  161.58  avg =  160.66\n         yolov4-tiny  min =  229.22  max =  230.48  avg =  229.87\n           nanodet_m  min =  115.10  max =  116.78  avg =  115.86\n    yolo-fastest-1.1  min =  154.48  max =  155.33  avg =  154.79\n      yolo-fastestv2  min =  161.10  max =  163.98  avg =  161.88\n  vision_transformer  min =  848.51  max =  863.03  avg =  854.92\n          FastestDet  min =  251.64  max =  253.22  avg =  252.38\n[mizu-bai@DESKTOP-1D9QDE1-mic0 benchmark]$ KMP_AFFINITY=scatter ../build/benchmark/benchncnn 4 112 0 -1 1\nloop_count = 4\nnum_threads = 112\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   41.07  max =   41.19  avg =   41.12\n     squeezenet_int8  min =  161.73  max =  163.90  avg =  162.74\n           mobilenet  min =   36.82  max =   37.53  avg =   37.11\n      mobilenet_int8  min =  231.50  max =  233.81  avg =  232.65\n        mobilenet_v2  min =   53.12  max =   55.87  avg =   54.44\n        mobilenet_v3  min =  277.82  max =  280.61  avg =  279.66\n          shufflenet  min =   64.11  max =   64.92  avg =   64.63\n       shufflenet_v2  min =   48.23  max =   50.00  avg =   49.19\n             mnasnet  min =  532.09  max =  534.73  avg =  533.34\n     proxylessnasnet  min =  760.43  max =  763.94  avg =  762.34\n     efficientnet_b0  min =  534.29  max =  547.51  avg =  541.29\n   efficientnetv2_b0  min =   75.94  max =   76.88  avg =   76.39\n        regnety_400m  min =  226.37  max =  227.81  avg =  227.23\n           blazeface  min =   26.03  max =   26.93  avg =   26.51\n           googlenet  min =  106.53  max =  107.54  avg =  107.06\n      googlenet_int8  min =  503.01  max =  505.16  avg =  504.13\n            resnet18  min =   73.63  max =   76.61  avg =   75.11\n       resnet18_int8  min =  358.18  max =  359.50  avg =  358.99\n             alexnet  min =   37.40  max =   38.17  avg =   37.83\n               vgg16  min =  244.95  max =  250.05  avg =  247.24\n          vgg16_int8  min = 1511.89  max = 1512.66  avg = 1512.35\n            resnet50  min =  151.99  max =  154.66  avg =  153.37\n       resnet50_int8  min =  954.16  max =  957.63  avg =  956.55\n      squeezenet_ssd  min =   91.46  max =   97.18  avg =   94.00\n squeezenet_ssd_int8  min =  368.03  max =  375.96  avg =  370.99\n       mobilenet_ssd  min =   79.61  max =   81.38  avg =   80.33\n  mobilenet_ssd_int8  min =  458.93  max =  463.41  avg =  461.63\n      mobilenet_yolo  min =  234.59  max =  236.91  avg =  235.43\n  mobilenetv2_yolov3  min =  145.82  max =  146.92  avg =  146.23\n         yolov4-tiny  min =  219.22  max =  220.51  avg =  219.83\n           nanodet_m  min =  109.43  max =  113.94  avg =  112.20\n    yolo-fastest-1.1  min =  158.13  max =  160.59  avg =  159.20\n      yolo-fastestv2  min =  162.05  max =  162.80  avg =  162.47\n  vision_transformer  min =  615.14  max =  625.35  avg =  618.47\n          FastestDet  min =  279.98  max =  282.49  avg =  281.14\n[mizu-bai@DESKTOP-1D9QDE1-mic0 benchmark]$ KMP_AFFINITY=scatter ../build/benchmark/benchncnn 4 224 0 -1 1\nloop_count = 4\nnum_threads = 224\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   45.54  max =   46.81  avg =   46.13\n     squeezenet_int8  min =  186.81  max =  187.14  avg =  186.97\n           mobilenet  min =   38.33  max =   39.11  avg =   38.64\n      mobilenet_int8  min =  251.06  max =  251.91  avg =  251.40\n        mobilenet_v2  min =   56.57  max =   57.15  avg =   56.88\n        mobilenet_v3  min =  365.04  max =  366.87  avg =  365.94\n          shufflenet  min =   71.16  max =   72.02  avg =   71.68\n       shufflenet_v2  min =   52.14  max =   53.60  avg =   52.92\n             mnasnet  min =  596.37  max =  603.62  avg =  600.50\n     proxylessnasnet  min =  911.84  max =  912.23  avg =  912.04\n     efficientnet_b0  min =  611.77  max =  614.32  avg =  612.69\n   efficientnetv2_b0  min =   82.16  max =   83.05  avg =   82.62\n        regnety_400m  min =  253.43  max =  255.79  avg =  254.66\n           blazeface  min =   30.54  max =   30.91  avg =   30.70\n           googlenet  min =  111.68  max =  112.65  avg =  112.11\n      googlenet_int8  min =  594.07  max =  597.09  avg =  596.03\n            resnet18  min =   78.14  max =   79.12  avg =   78.75\n       resnet18_int8  min =  412.69  max =  413.92  avg =  413.46\n             alexnet  min =   40.93  max =   41.43  avg =   41.17\n               vgg16  min =  242.45  max =  244.46  avg =  243.47\n          vgg16_int8  min = 1545.61  max = 1548.72  avg = 1547.47\n            resnet50  min =  147.73  max =  148.56  avg =  148.07\n       resnet50_int8  min = 1034.47  max = 1042.31  avg = 1038.41\n      squeezenet_ssd  min =  107.82  max =  110.53  avg =  108.98\n squeezenet_ssd_int8  min =  423.30  max =  426.91  avg =  425.67\n       mobilenet_ssd  min =   74.54  max =   77.13  avg =   75.97\n  mobilenet_ssd_int8  min =  510.95  max =  513.33  avg =  512.40\n      mobilenet_yolo  min =  238.83  max =  239.64  avg =  239.27\n  mobilenetv2_yolov3  min =  159.80  max =  160.31  avg =  160.04\n         yolov4-tiny  min =  233.89  max =  237.41  avg =  236.22\n           nanodet_m  min =  122.39  max =  123.42  avg =  122.89\n    yolo-fastest-1.1  min =  194.49  max =  195.25  avg =  194.94\n      yolo-fastestv2  min =  193.06  max =  195.03  avg =  194.05\n  vision_transformer  min =  547.36  max =  554.17  avg =  549.99\n          FastestDet  min =  317.76  max =  321.38  avg =  320.18\n```\n\n### PhytiumPi, Phytium E2000 (FTC664@1.8GHz x2 + FTC310@1.5GHz x2)\n```\nloop_count = 4\nnum_threads = 2\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   43.84  max =   43.95  avg =   43.88\n     squeezenet_int8  min =   35.48  max =   35.77  avg =   35.66\n           mobilenet  min =   69.31  max =   70.03  avg =   69.66\n      mobilenet_int8  min =   42.30  max =   42.40  avg =   42.35\n        mobilenet_v2  min =   59.07  max =   59.35  avg =   59.19\n        mobilenet_v3  min =   46.02  max =   46.37  avg =   46.19\n          shufflenet  min =   31.52  max =   31.61  avg =   31.56\n       shufflenet_v2  min =   23.99  max =   24.07  avg =   24.04\n             mnasnet  min =   49.40  max =   50.45  avg =   49.92\n     proxylessnasnet  min =   53.24  max =   53.85  avg =   53.53\n     efficientnet_b0  min =   77.49  max =   77.84  avg =   77.62\n   efficientnetv2_b0  min =   88.51  max =   88.92  avg =   88.69\n        regnety_400m  min =   66.99  max =   67.05  avg =   67.03\n           blazeface  min =    7.74  max =    8.14  avg =    7.98\n           googlenet  min =  126.62  max =  127.23  avg =  126.91\n      googlenet_int8  min =  102.87  max =  103.16  avg =  103.01\n            resnet18  min =  102.28  max =  102.63  avg =  102.48\n       resnet18_int8  min =   72.01  max =   72.45  avg =   72.29\n             alexnet  min =   76.00  max =  124.61  avg =   88.24\n               vgg16  min =  597.75  max =  601.99  avg =  599.44\n          vgg16_int8  min =  421.40  max =  423.83  avg =  423.01\n            resnet50  min =  278.16  max =  280.64  avg =  279.37\n       resnet50_int8  min =  207.26  max =  207.47  avg =  207.36\n      squeezenet_ssd  min =  108.69  max =  109.26  avg =  108.99\n squeezenet_ssd_int8  min =   84.05  max =   84.60  avg =   84.28\n       mobilenet_ssd  min =  141.65  max =  142.46  avg =  142.14\n  mobilenet_ssd_int8  min =   84.43  max =   84.99  avg =   84.73\n      mobilenet_yolo  min =  322.53  max =  325.15  avg =  323.51\n  mobilenetv2_yolov3  min =  194.84  max =  196.98  avg =  196.07\n         yolov4-tiny  min =  208.29  max =  213.26  avg =  210.77\n           nanodet_m  min =   64.78  max =   65.38  avg =   65.08\n    yolo-fastest-1.1  min =   37.89  max =   38.23  avg =   38.07\n      yolo-fastestv2  min =   29.75  max =   30.33  avg =   30.09\n  vision_transformer  min = 4257.71  max = 4263.73  avg = 4260.60\n          FastestDet  min =   30.86  max =   44.67  avg =   34.41\n```\n\n### AMD EPYC 7742 (2.25GHz) ubuntu 22.04 AOCC_4.2.0-Build#89\n\nsingle core\n\n```\n# nice -20 ../build-host-aocc-linux/benchmark/benchncnn 100 1 0 -1 0\nloop_count = 100\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    9.26  max =   10.05  avg =    9.45\n     squeezenet_int8  min =    9.54  max =   13.35  avg =    9.67\n           mobilenet  min =   16.20  max =   16.83  avg =   16.35\n      mobilenet_int8  min =   16.79  max =   17.28  avg =   16.89\n        mobilenet_v2  min =   10.69  max =   11.13  avg =   10.78\n        mobilenet_v3  min =    8.87  max =   14.09  avg =    9.03\n          shufflenet  min =    4.99  max =    5.29  avg =    5.06\n       shufflenet_v2  min =    5.61  max =    7.14  avg =    5.66\n             mnasnet  min =   11.94  max =   12.39  avg =   12.05\n     proxylessnasnet  min =   13.48  max =   16.57  avg =   13.62\n     efficientnet_b0  min =   19.58  max =   20.34  avg =   19.73\n   efficientnetv2_b0  min =   22.66  max =   23.63  avg =   22.89\n        regnety_400m  min =   14.89  max =   18.76  avg =   15.11\n           blazeface  min =    1.45  max =    1.59  avg =    1.51\n           googlenet  min =   35.38  max =   36.94  avg =   35.79\n      googlenet_int8  min =   30.55  max =   42.18  avg =   30.88\n            resnet18  min =   34.73  max =   48.15  avg =   35.43\n       resnet18_int8  min =   27.39  max =   28.22  avg =   27.61\n             alexnet  min =   31.42  max =   32.26  avg =   31.64\n               vgg16  min =  160.38  max =  172.02  avg =  162.52\n          vgg16_int8  min =  134.03  max =  153.69  avg =  135.12\n            resnet50  min =   85.47  max =   87.90  avg =   86.21\n       resnet50_int8  min =   71.18  max =   80.37  avg =   71.70\n      squeezenet_ssd  min =   24.66  max =   25.71  avg =   24.84\n squeezenet_ssd_int8  min =   23.61  max =   24.28  avg =   23.78\n       mobilenet_ssd  min =   34.48  max =   35.69  avg =   34.64\n  mobilenet_ssd_int8  min =   33.26  max =   34.32  avg =   33.45\n      mobilenet_yolo  min =   77.25  max =   86.54  avg =   77.73\n  mobilenetv2_yolov3  min =   41.72  max =   42.92  avg =   42.02\n         yolov4-tiny  min =   57.61  max =   59.49  avg =   58.46\n           nanodet_m  min =   12.92  max =   13.39  avg =   13.03\n    yolo-fastest-1.1  min =    5.02  max =    5.26  avg =    5.11\n      yolo-fastestv2  min =    5.06  max =    5.20  avg =    5.09\n  vision_transformer  min =  637.63  max =  670.46  avg =  640.60\n          FastestDet  min =    5.59  max =    5.82  avg =    5.66\n```\n\n64 cores\n\n```\n# nice -20 ../build-host-aocc-linux/benchmark/benchncnn 300 64 0 -1 0\nloop_count = 300\nnum_threads = 64\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =    4.19  max =   13.94  avg =    5.06\n     squeezenet_int8  min =    4.93  max =   13.59  avg =    5.14\n           mobilenet  min =    3.29  max =    5.28  avg =    3.39\n      mobilenet_int8  min =    2.32  max =    3.32  avg =    2.40\n        mobilenet_v2  min =    4.58  max =    8.64  avg =    4.76\n        mobilenet_v3  min =    4.11  max =    6.89  avg =    4.88\n          shufflenet  min =    5.67  max =    8.60  avg =    5.92\n       shufflenet_v2  min =    4.83  max =    6.29  avg =    5.02\n             mnasnet  min =    4.08  max =   12.75  avg =    4.29\n     proxylessnasnet  min =    4.46  max =    7.28  avg =    4.68\n     efficientnet_b0  min =    5.51  max =   11.67  avg =    6.33\n   efficientnetv2_b0  min =    7.50  max =   11.30  avg =    9.34\n        regnety_400m  min =   12.50  max =   20.88  avg =   12.76\n           blazeface  min =    1.67  max =    3.37  avg =    1.76\n           googlenet  min =   10.64  max =   11.59  avg =   10.87\n      googlenet_int8  min =    8.49  max =   17.88  avg =    9.90\n            resnet18  min =    6.36  max =    6.88  avg =    6.48\n       resnet18_int8  min =    4.65  max =   13.03  avg =    4.77\n             alexnet  min =    3.88  max =    4.62  avg =    3.97\n               vgg16  min =   26.00  max =   36.86  avg =   27.25\n          vgg16_int8  min =   17.75  max =   19.63  avg =   18.42\n            resnet50  min =   13.94  max =   23.10  avg =   14.17\n       resnet50_int8  min =    8.73  max =   18.32  avg =    8.92\n      squeezenet_ssd  min =   10.39  max =   12.10  avg =   10.77\n squeezenet_ssd_int8  min =   11.53  max =   20.24  avg =   12.01\n       mobilenet_ssd  min =    6.80  max =    8.16  avg =    6.96\n  mobilenet_ssd_int8  min =    4.98  max =    5.21  avg =    5.07\n      mobilenet_yolo  min =   17.75  max =   30.34  avg =   18.29\n  mobilenetv2_yolov3  min =   13.74  max =   15.69  avg =   14.18\n         yolov4-tiny  min =   21.27  max =   29.53  avg =   22.81\n           nanodet_m  min =   10.22  max =   12.25  avg =   10.89\n    yolo-fastest-1.1  min =    5.56  max =    6.03  avg =    5.66\n      yolo-fastestv2  min =    5.61  max =    5.78  avg =    5.67\n  vision_transformer  min =   69.07  max =  508.15  avg =   71.73\n          FastestDet  min =    5.74  max =    6.83  avg =    5.81\n```\n\n### NVIDIA Tesla V100-PCIE-32GB  (GV100 SM x 80 + Tensor Core x 640)\n\n```\n# ../build-host-gcc-vk-linux/benchmark/benchncnn 300 1 0 0 0\n[0 Tesla V100-PCIE-32GB]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[0 Tesla V100-PCIE-32GB]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 Tesla V100-PCIE-32GB]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[0 Tesla V100-PCIE-32GB]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[0 Tesla V100-PCIE-32GB]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\n[1 llvmpipe (LLVM 15.0.7, 256 bits)]  queueC=0[1]  queueG=0[1]  queueT=0[1]\n[1 llvmpipe (LLVM 15.0.7, 256 bits)]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[1 llvmpipe (LLVM 15.0.7, 256 bits)]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[1 llvmpipe (LLVM 15.0.7, 256 bits)]  subgroup=8  basic/vote/ballot/shuffle=1/1/1/1\n[1 llvmpipe (LLVM 15.0.7, 256 bits)]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\n[2 Tesla V100-PCIE-32GB]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[2 Tesla V100-PCIE-32GB]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[2 Tesla V100-PCIE-32GB]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[2 Tesla V100-PCIE-32GB]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[2 Tesla V100-PCIE-32GB]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\n[3 Tesla V100-PCIE-32GB]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[3 Tesla V100-PCIE-32GB]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[3 Tesla V100-PCIE-32GB]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[3 Tesla V100-PCIE-32GB]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[3 Tesla V100-PCIE-32GB]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\n[4 Tesla V100-PCIE-32GB]  queueC=2[8]  queueG=0[16]  queueT=1[2]\n[4 Tesla V100-PCIE-32GB]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[4 Tesla V100-PCIE-32GB]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[4 Tesla V100-PCIE-32GB]  subgroup=32  basic/vote/ballot/shuffle=1/1/1/1\n[4 Tesla V100-PCIE-32GB]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 300\nnum_threads = 1\npowersave = 0\ngpu_device = 0\ncooling_down = 0\n          squeezenet  min =    1.16  max =   16.79  avg =    1.64\n     squeezenet_int8  min =    9.03  max =   10.06  avg =    9.15\n           mobilenet  min =    1.05  max =    2.60  avg =    1.25\n      mobilenet_int8  min =   16.78  max =   19.89  avg =   16.93\n        mobilenet_v2  min =    1.60  max =    3.29  avg =    1.76\n        mobilenet_v3  min =    1.84  max =    8.43  avg =    2.04\n          shufflenet  min =    1.35  max =    3.73  avg =    1.54\n       shufflenet_v2  min =    1.66  max =    8.02  avg =    1.93\n             mnasnet  min =    1.69  max =    3.31  avg =    1.82\n     proxylessnasnet  min =    1.74  max =    3.70  avg =    1.89\n     efficientnet_b0  min =    2.86  max =    5.21  avg =    3.02\n   efficientnetv2_b0  min =   60.41  max =   80.28  avg =   69.51\n        regnety_400m  min =    2.38  max =    6.84  avg =    2.57\n           blazeface  min =    0.85  max =    3.50  avg =    0.96\n           googlenet  min =    3.69  max =   16.66  avg =    4.10\n      googlenet_int8  min =   33.66  max =   47.27  avg =   34.32\n            resnet18  min =    1.76  max =    7.58  avg =    1.95\n       resnet18_int8  min =   27.12  max =   36.43  avg =   27.62\n             alexnet  min =    1.33  max =    2.97  avg =    1.49\n               vgg16  min =    2.98  max =    4.60  avg =    3.17\n          vgg16_int8  min =  133.97  max =  154.41  avg =  136.22\n            resnet50  min =    3.42  max =   17.05  avg =    3.72\n       resnet50_int8  min =   70.53  max =   93.57  avg =   71.96\n      squeezenet_ssd  min =   16.88  max =   22.55  avg =   18.49\n squeezenet_ssd_int8  min =   23.12  max =   30.45  avg =   23.50\n       mobilenet_ssd  min =    5.44  max =    7.09  avg =    5.93\n  mobilenet_ssd_int8  min =   33.28  max =   38.92  avg =   33.62\n      mobilenet_yolo  min =    5.67  max =    7.66  avg =    6.26\n  mobilenetv2_yolov3  min =    6.33  max =    7.89  avg =    6.67\n         yolov4-tiny  min =   14.66  max =   17.29  avg =   15.57\n           nanodet_m  min =    5.36  max =   16.11  avg =    5.95\n    yolo-fastest-1.1  min =    5.60  max =    7.45  avg =    6.13\n      yolo-fastestv2  min =    3.48  max =    5.29  avg =    3.96\n  vision_transformer  min =  153.75  max =  198.81  avg =  165.58\n          FastestDet  min =    3.01  max =    5.01  avg =    3.29\n```\n\n### AXERA AX630C (Cortex-A53 1.2GHz * 2)\n\n```\n# ~/ncnn/build-aarch64-linux-gnu/benchmark # ./benchncnn 4 1 0 -1 0\nloop_count = 4\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =  129.78  max =  130.30  avg =  130.09\n     squeezenet_int8  min =  123.08  max =  123.48  avg =  123.22\n           mobilenet  min =  211.46  max =  221.68  avg =  214.14\n      mobilenet_int8  min =  196.00  max =  212.73  avg =  200.23\n        mobilenet_v2  min =  149.15  max =  149.21  avg =  149.17\n        mobilenet_v3  min =  124.70  max =  125.54  avg =  125.08\n          shufflenet  min =   80.75  max =   80.88  avg =   80.81\n       shufflenet_v2  min =   74.30  max =   74.50  avg =   74.37\n             mnasnet  min =  148.87  max =  165.85  avg =  153.26\n     proxylessnasnet  min =  203.05  max =  213.50  avg =  205.82\n     efficientnet_b0  min =  270.39  max =  280.59  avg =  273.13\n   efficientnetv2_b0  min =  302.93  max =  318.07  avg =  307.30\n        regnety_400m  min =  187.47  max =  187.90  avg =  187.60\n           blazeface  min =   22.64  max =   22.78  avg =   22.72\n           googlenet  min =  487.36  max =  503.50  avg =  493.93\n      googlenet_int8  min =  418.16  max =  434.44  avg =  426.09\n       resnet18_int8  min =  290.39  max =  301.90  avg =  293.70\n       resnet50_int8  min =  888.81  max =  898.34  avg =  895.92\n      squeezenet_ssd  min =  320.78  max =  330.33  avg =  323.54\n squeezenet_ssd_int8  min =  281.52  max =  299.11  avg =  286.89\n       mobilenet_ssd  min =  435.79  max =  452.66  avg =  444.19\n  mobilenet_ssd_int8  min =  394.38  max =  411.09  avg =  398.65\n      mobilenet_yolo  min =  955.48  max =  972.38  avg =  967.52\n  mobilenetv2_yolov3  min =  519.47  max =  536.58  avg =  524.25\n      yolo-fastestv2  min =   73.94  max =   74.15  avg =   74.05\n          FastestDet  min =   81.89  max =   82.07  avg =   81.98\n          \n# ~/ncnn/build-aarch64-linux-gnu/benchmark # ./benchncnn 4 2 0 -1 0\nloop_count = 4\nnum_threads = 2\npowersave = 0\ngpu_device = -1\ncooling_down = 0\n          squeezenet  min =   75.14  max =   88.89  avg =   79.06\n     squeezenet_int8  min =   70.11  max =   85.48  avg =   74.32\n           mobilenet  min =  112.72  max =  124.85  avg =  115.87\n      mobilenet_int8  min =  100.35  max =  100.58  avg =  100.49\n        mobilenet_v2  min =   85.92  max =   86.20  avg =   86.03\n        mobilenet_v3  min =   73.94  max =   74.34  avg =   74.20\n          shufflenet  min =   53.99  max =   66.11  avg =   57.63\n       shufflenet_v2  min =   47.47  max =   47.72  avg =   47.59\n             mnasnet  min =   85.96  max =   86.27  avg =   86.13\n     proxylessnasnet  min =  111.15  max =  121.84  avg =  113.92\n     efficientnet_b0  min =  149.72  max =  150.00  avg =  149.85\n   efficientnetv2_b0  min =  168.84  max =  170.57  avg =  169.35\n        regnety_400m  min =  120.42  max =  135.50  avg =  124.26\n           blazeface  min =   14.27  max =   14.48  avg =   14.39\n           googlenet  min =  263.82  max =  274.74  avg =  266.84\n      googlenet_int8  min =  226.91  max =  227.36  avg =  227.23\n       resnet18_int8  min =  157.66  max =  168.11  avg =  160.57\n       resnet50_int8  min =  469.84  max =  484.00  avg =  476.59\n      squeezenet_ssd  min =  190.23  max =  204.41  avg =  193.99\n squeezenet_ssd_int8  min =  162.73  max =  174.30  avg =  165.79\n       mobilenet_ssd  min =  236.26  max =  251.16  avg =  240.34\n  mobilenet_ssd_int8  min =  203.22  max =  212.01  avg =  206.00\n      mobilenet_yolo  min =  522.45  max =  537.99  avg =  529.95\n  mobilenetv2_yolov3  min =  300.33  max =  316.59  avg =  304.89\n      yolo-fastestv2  min =   50.27  max =   50.62  avg =   50.43\n          FastestDet  min =   53.34  max =   53.64  avg =   53.51\n```\n\n### Spacemit MUSE Pi Pro Spacemit M1 (Spacemit X60 *8 + PowerVR B-Series BXE-2-32 MC1)\n```\nroot@spacemit-k1-x-MUSE-Pi-Pro-board:/home/yingxi/ncnn/build/benchmark# ./benchncnn 4 8 2 -1 1\nloop_count = 4\nnum_threads = 8\npowersave = 2\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =  192.55  max =  203.73  avg =  195.61\n     squeezenet_int8  min =  863.38  max =  875.44  avg =  867.96\n           mobilenet  min =  260.32  max =  274.70  avg =  266.42\n      mobilenet_int8  min = 1287.80  max = 1606.98  avg = 1461.52\n        mobilenet_v2  min =  168.08  max =  173.99  avg =  169.97\n        mobilenet_v3  min =  141.06  max =  166.83  avg =  147.74\n          shufflenet  min =   82.91  max =   92.83  avg =   85.57\n       shufflenet_v2  min =   83.11  max =   83.35  avg =   83.26\n             mnasnet  min =  168.99  max =  180.35  avg =  171.95\n     proxylessnasnet  min =  186.14  max =  194.56  avg =  188.91\n     efficientnet_b0  min =  257.93  max =  263.18  avg =  259.94\n   efficientnetv2_b0  min =  385.35  max =  394.09  avg =  388.57\n        regnety_400m  min =  228.02  max =  229.55  avg =  228.88\n           blazeface  min =   26.78  max =   27.43  avg =   26.97\n           googlenet  min =  781.12  max =  796.37  avg =  788.60\n      googlenet_int8  min = 2422.82  max = 2441.75  avg = 2432.78\n            resnet18  min =  864.67  max =  874.15  avg =  869.32\n       resnet18_int8  min = 2409.34  max = 2728.57  avg = 2530.44\n             alexnet  min =  389.93  max =  393.67  avg =  391.77\n               vgg16  min = 8213.96  max = 8957.49  avg = 8405.27\n          vgg16_int8  min = 34268.94  max = 36044.89  avg = 35244.72\n            resnet50  min = 1798.75  max = 1859.80  avg = 1825.00\n       resnet50_int8  min = 7364.21  max = 7500.24  avg = 7428.21\n      squeezenet_ssd  min =  693.59  max =  701.68  avg =  697.60\n squeezenet_ssd_int8  min = 1447.64  max = 1461.21  avg = 1455.02\n       mobilenet_ssd  min =  530.90  max =  542.81  avg =  534.42\n  mobilenet_ssd_int8  min = 4347.45  max = 4391.44  avg = 4377.68\n      mobilenet_yolo  min = 1285.07  max = 1369.59  avg = 1312.64\n  mobilenetv2_yolov3  min =  605.19  max =  628.05  avg =  616.37\n         yolov4-tiny  min = 1743.00  max = 1751.39  avg = 1748.09\n           nanodet_m  min =  201.46  max =  202.80  avg =  202.03\n    yolo-fastest-1.1  min =   97.02  max =   98.29  avg =   97.71\n      yolo-fastestv2  min =   75.53  max =   76.62  avg =   76.20\n  vision_transformer  min = 11328.10  max = 11334.80  avg = 11332.34\n          FastestDet  min =   85.01  max =   86.04  avg =   85.45\n\nroot@spacemit-k1-x-MUSE-Pi-Pro-board:/home/yingxi/ncnn/build/benchmark# ./benchncnn 4 8 2 0 1\n[0 PowerVR B-Series BXE-2-32 MC1]  queueC=0[2]  queueG=0[2]  queueT=0[2]\n[0 PowerVR B-Series BXE-2-32 MC1]  bugsbn1=0  bugbilz=0  bugcopc=0  bugihfa=0\n[0 PowerVR B-Series BXE-2-32 MC1]  fp16-p/s/u/a=1/1/1/1  int8-p/s/u/a=1/1/1/1\n[0 PowerVR B-Series BXE-2-32 MC1]  subgroup=1(1~1)  ops=1/1/1/1/1/1/0/0/1/1\n[0 PowerVR B-Series BXE-2-32 MC1]  fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0\nloop_count = 4\nnum_threads = 8\npowersave = 2\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =  381.51  max =  382.05  avg =  381.73\n     squeezenet_int8  min =  862.26  max =  890.38  avg =  879.94\n           mobilenet  min =  795.29  max =  796.41  avg =  795.80\n      mobilenet_int8  min = 1284.16  max = 1298.86  avg = 1290.31\n        mobilenet_v2  min =  512.00  max =  512.59  avg =  512.19\n        mobilenet_v3  min =  428.55  max =  428.95  avg =  428.76\n          shufflenet  min =  198.17  max =  198.83  avg =  198.39\n       shufflenet_v2  min =  272.36  max =  272.73  avg =  272.55\n             mnasnet  min =  526.92  max =  527.44  avg =  527.12\n     proxylessnasnet  min =  601.43  max =  602.65  avg =  602.05\n     efficientnet_b0  min =  704.94  max =  705.23  avg =  705.13\n   efficientnetv2_b0  min =  854.83  max =  866.51  avg =  859.85\n        regnety_400m  min =  526.46  max =  527.04  avg =  526.65\n           blazeface  min =   69.74  max =   69.84  avg =   69.80\n           googlenet  min = 1230.07  max = 1231.04  avg = 1230.53\n      googlenet_int8  min = 2409.25  max = 2423.38  avg = 2416.76\n            resnet18  min = 1134.72  max = 1136.35  avg = 1135.44\n       resnet18_int8  min = 2431.48  max = 2552.62  avg = 2473.90\n             alexnet  min =  692.35  max =  697.08  avg =  695.61\n               vgg16  min = 5790.33  max = 5805.37  avg = 5796.20\n          vgg16_int8  min = 34057.43  max = 35714.99  avg = 35080.62\n            resnet50  min = 3426.54  max = 3429.97  avg = 3427.94\n       resnet50_int8  min = 7370.03  max = 7409.63  avg = 7390.83\n      squeezenet_ssd  min = 1057.50  max = 1061.42  avg = 1059.26\n squeezenet_ssd_int8  min = 1454.99  max = 1469.47  avg = 1462.61\n       mobilenet_ssd  min = 1670.02  max = 1673.22  avg = 1671.34\n  mobilenet_ssd_int8  min = 4372.23  max = 4424.18  avg = 4400.11\n      mobilenet_yolo  min = 3794.02  max = 3796.52  avg = 3795.21\n  mobilenetv2_yolov3  min = 1841.86  max = 1844.70  avg = 1843.49\n         yolov4-tiny  min = 2099.86  max = 2104.18  avg = 2102.34\n           nanodet_m  min =  646.19  max =  647.41  avg =  646.69\n    yolo-fastest-1.1  min =  322.08  max =  323.71  avg =  323.22\n      yolo-fastestv2  min =  209.42  max =  209.72  avg =  209.56\n  vision_transformer  min = 26499.86  max = 26548.73  avg = 26528.54\n          FastestDet  min =  251.68  max =  252.52  avg =  252.14\n```\n\n### Arduino UNO Q - QRB2210 (ARM Cortex-A53 @ 2.0GHz x 4)\n```\narduino@noivis-uno-q:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 4 0 -1 -1\nloop_count = 10\nnum_threads = 4\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   35.57  max =  111.57  avg =   43.99\n     squeezenet_int8  min =   31.61  max =   32.34  avg =   31.91\n           mobilenet  min =   47.82  max =  133.12  avg =   56.77\n      mobilenet_int8  min =   33.96  max =  102.49  avg =   44.91\n        mobilenet_v2  min =   42.62  max =  119.38  avg =   51.88\n        mobilenet_v3  min =   34.53  max =   35.91  avg =   35.27\n          shufflenet  min =   26.18  max =   26.47  avg =   26.32\n       shufflenet_v2  min =   22.02  max =   88.82  avg =   30.98\n             mnasnet  min =   38.96  max =   92.30  avg =   50.98\n     proxylessnasnet  min =   47.04  max =  137.34  avg =   56.91\n     efficientnet_b0  min =   58.75  max =  141.67  avg =   76.36\n   efficientnetv2_b0  min =   79.72  max =  175.06  avg =   99.54\n        regnety_400m  min =   65.97  max =  184.19  avg =   96.94\n           blazeface  min =    6.43  max =    7.84  avg =    6.76\n           googlenet  min =  105.37  max =  197.46  avg =  130.49\n      googlenet_int8  min =   89.68  max =  179.01  avg =  107.28\n            resnet18  min =   86.52  max =  166.67  avg =  102.49\n       resnet18_int8  min =   57.96  max =  107.52  avg =   66.63\n             alexnet  min =   56.77  max =  127.20  avg =   67.50\n               vgg16  min =  463.45  max =  557.00  avg =  511.24\n          vgg16_int8  min =  323.15  max =  415.10  avg =  367.00\n            resnet50  min =  219.89  max =  298.83  avg =  250.55\n       resnet50_int8  min =  177.14  max =  261.74  avg =  208.69\n      squeezenet_ssd  min =   96.95  max =  195.33  avg =  123.10\n squeezenet_ssd_int8  min =   79.66  max =  179.98  avg =   97.71\n       mobilenet_ssd  min =  100.40  max =  191.42  avg =  119.07\n  mobilenet_ssd_int8  min =   71.88  max =  173.69  avg =   92.27\n      mobilenet_yolo  min =  216.49  max =  301.24  avg =  248.78\n  mobilenetv2_yolov3  min =  154.69  max =  245.76  avg =  179.31\n         yolov4-tiny  min =  191.17  max =  261.76  avg =  218.64\n           nanodet_m  min =   57.66  max =  113.14  avg =   67.66\n    yolo-fastest-1.1  min =   34.72  max =  131.85  avg =   49.81\n      yolo-fastestv2  min =   26.91  max =   28.23  avg =   27.46\n  vision_transformer  min = 2529.77  max = 2703.20  avg = 2601.17\n          FastestDet  min =   28.09  max =   29.11  avg =   28.48\n\narduino@noivis-uno-q:~/ncnn/benchmark$ ../build/benchmark/benchncnn 10 1 0 -1 -1\nloop_count = 10\nnum_threads = 1\npowersave = 0\ngpu_device = -1\ncooling_down = 1\n          squeezenet  min =   94.15  max =  111.95  avg =   99.15\n     squeezenet_int8  min =   78.23  max =   86.76  avg =   80.23\n           mobilenet  min =  146.45  max =  165.20  avg =  153.61\n      mobilenet_int8  min =  123.70  max =  133.75  avg =  126.28\n        mobilenet_v2  min =   99.85  max =  108.01  avg =  103.90\n        mobilenet_v3  min =   93.31  max =  102.90  avg =   96.41\n          shufflenet  min =   61.80  max =   79.39  avg =   65.28\n       shufflenet_v2  min =   47.57  max =   56.28  avg =   49.89\n             mnasnet  min =  106.41  max =  119.18  avg =  109.83\n     proxylessnasnet  min =  143.93  max =  164.33  avg =  151.37\n     efficientnet_b0  min =  164.14  max =  173.38  avg =  167.91\n   efficientnetv2_b0  min =  206.05  max =  225.26  avg =  211.93\n        regnety_400m  min =  133.84  max =  144.94  avg =  137.26\n           blazeface  min =   13.90  max =   14.97  avg =   14.25\n           googlenet  min =  337.11  max =  364.05  avg =  347.30\n      googlenet_int8  min =  281.64  max =  293.46  avg =  288.34\n            resnet18  min =  276.23  max =  304.36  avg =  289.94\n       resnet18_int8  min =  190.11  max =  217.07  avg =  199.87\n             alexnet  min =  196.14  max =  203.26  avg =  198.63\n               vgg16  min = 1391.13  max = 1626.54  avg = 1502.86\n          vgg16_int8  min = 1128.65  max = 1290.60  avg = 1200.60\n            resnet50  min =  739.44  max =  774.68  avg =  750.76\n       resnet50_int8  min =  591.32  max =  612.44  avg =  603.38\n      squeezenet_ssd  min =  245.57  max =  280.32  avg =  262.18\n squeezenet_ssd_int8  min =  182.86  max =  228.61  avg =  199.68\n       mobilenet_ssd  min =  308.26  max =  320.81  avg =  314.58\n  mobilenet_ssd_int8  min =  246.33  max =  265.22  avg =  253.05\n      mobilenet_yolo  min =  682.76  max =  703.99  avg =  696.30\n  mobilenetv2_yolov3  min =  346.53  max =  365.76  avg =  355.41\n         yolov4-tiny  min =  527.86  max =  558.38  avg =  542.25\n           nanodet_m  min =  135.87  max =  153.99  avg =  145.11\n    yolo-fastest-1.1  min =   58.92  max =   76.24  avg =   65.08\n      yolo-fastestv2  min =   48.54  max =   59.97  avg =   53.21\n  vision_transformer  min = 9218.64  max = 10723.27  avg = 10253.49\n          FastestDet  min =   51.52  max =   62.65  avg =   55.04\n```\n\n"
  },
  {
    "path": "benchmark/RankCards/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.10)\n\nproject(RankCards CXX)\n\nset(CMAKE_CXX_STANDARD 11)\nset(CMAKE_CXX_STANDARD_REQUIRED ON)\nset(EXECUTABLE_OUTPUT_PATH \"../\")\n\nadd_executable(RankCards main.cpp)\n"
  },
  {
    "path": "benchmark/RankCards/README.md",
    "content": "### Rank the boards.\nThe table below is generated by RankCards, using the timings found in the /ncnn/benchmark/README.md file.<br>\nFirst, the best set of timings is selected from each board.<br>\nThe set is then compared to a reference set by calculating the ratio of each model one by one and averaging all results.<br>\nFinally, the boards are ranked from fast to slow.<br>\n|      | Board | Ratio | \n| :--: | :---- | :---  | \n| 1 | NVIDIA Quadro RTX 8000 (TU102 SM x 72 + Tensor Core x 576) | 0.147 | \n| 2 | nVIDIA RTX2080 of Desktop | 0.15 | \n| 3 | NVIDIA GeForce RTX 3060 Ti of Desktop[2023-10-12] | 0.18 | \n| 4 | nVIDIA RTX2060 of Notebook | 0.198 | \n| 5 | Intel® Core™ i7-13700K of Desktop[2023-10-12] | 0.255 | \n| 6 | AMD Radeon RX 6900 XT of Desktop[2023-10-12] | 0.275 | \n| 7 | NVIDIA RTX3090 (GA102 SM x 82 + Tensor Core 328) | 0.277 | \n| 8 | MediaTek Dimensity 9300 (MT6989) (Cortex-X4 3.25 GHz + 2.85 GHz x 3 + Cortex-A720 2.0 GHz x 4 + Mali-G720-Immortalis MC12) | 0.309 | \n| 9 | MacBook Pro (13-inch, M1, 2020) | 0.346 | \n| 10 | AWS c5.4xlarge Instance | 0.418 | \n| 11 | AMD Ryzen 9 5950X 16-Core of Desktop[2023-10-12] | 0.427 | \n| 12 | Qualcomm SM8550-AB Snapdragon 8 Gen 2 (Kyro 3.20 GHz + 2.8 GHz x 2 + 2.80 GHz x 2 + 2.00 GHz * 3 + Adreno 740) | 0.45 | \n| 13 | AMD Ryzen 5700g (Zen3 3.8 GHz ~ 4.6 GHz x 8) | 0.478 | \n| 14 | HUAWEI KunPeng 920 3211K (x24 cores) | 0.482 | \n| 15 | NVIDIA Jetson AGX Orin (Cortex-A78AE 2.2 GHz x 12 + Ampere@1.3 GHz Tensor Cores 64) | 0.485 | \n| 16 | HUAWEI KunPeng 920 2251K (x8 cores) | 0.54 | \n| 17 | nVIDIA RTX A3000 of Notebook (6GB) | 0.577 | \n| 18 | Intel(R) UHD Graphics 770 of Desktop[2023-10-12] | 0.593 | \n| 19 | OrangePi5, Rockchip RK3588s (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz) | 0.642 | \n| 20 | Qualcomm SM8150-AC Snapdragon 855+ (Kyro485 2.96 GHz + 2.42 GHz x 3 + 1.80 GHz x 4 + Adreno 640) | 0.665 | \n| 21 | Rockchip RK3588 (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz) | 0.753 | \n| 22 | NVIDIA Jetson Orin Nano | 0.819 | \n| 23 | Raspberry Pi 5 Broadcom BCM2712, Cortex-A76 (ARMv8) (2.4GHz x 4) | 1 | \n| 24 | Station-M3/ROC-RK3588S-PC, Rockchip RK3588S (Quad Core A76 2.4GHz + Quad Core A55 1.8GHz + Mali-G610) StationOS (Android) | 1 | \n| 25 | NVIDIA Jetson AGX Xavier (Carmel 2.2 GHz x 8 + Volta Tensor Cores 64) | 1.05 | \n| 26 | Loongson 3A6000 (LA664 2.5GHz * 4+4) | 1.11 | \n| 27 | Hyper-V Linux Guest with GPU-PV enabled (Intel Core i7-11800H, NVIDIA GeForce RTX 3070 Laptop GPU) | 1.19 | \n| 28 | Rockchip RK3588 (Cortex-A76 2.4GHz x 4 + Cortex-A55 1.8GHz x 4) | 1.35 | \n| 29 | NVIDIA Jetson TX2 NX(NV-Denver2 2.0Ghz x 2 +  Cortex-A57 2.0Ghz x 4 + 256-core NVIDIA Pascal iGPU) | 1.59 | \n| 30 | Hyper-V Linux Guest with GPU-PV enabled (Intel Core i7-7700K, NVIDIA GeForce GTX 1050 Ti) | 1.66 | \n| 31 | Phytium FT-2000+/64 (FTC662 armv8 2.4GHz x 8) | 1.75 | \n| 32 | AMD Ryzen Threadripper 3970X (Zen2 3.7 GHz ~ 4.5 GHz x 32) | 2.19 | \n| 33 | AMD Ryzen Embedded V1605B (Zen 2.0 GHz ~ 3.6 GHz x 4 + Radeon Vega 8 1.1GHz 8CU) | 2.23 | \n| 34 | Avaota Aim T527, Allwinner T527 (Cortex-A55 2.2GHz x 4 + Cortex-A55 1.8GHz x 4) | 2.28 | \n| 35 | Loongson 3A5000 (LA464 2.5GHz * 4) | 2.31 | \n| 36 | Qualcomm MSM8996 Pro Snapdragon 821 (Kyro 2.35GHz x 2 + Kyro 2.19GHz x 2) | 2.37 | \n| 37 | NVIDIA Jetson Nano | 2.44 | \n| 38 | Intel Celeron N5105 | 2.8 | \n| 39 | Loongson 3A4000 (GS464V 1.8GHz * 4 with MSA128) | 3.24 | \n| 40 | Khadas VIM3, Amlogic A311D (Cortex-A73 2.2GHz x 4 + Cortex-A53 1.8GHz x 2) | 3.48 | \n| 41 | Kirin 970 (Cortex-A73 2.4GHz x 4 + Cortex-A53 1.8GHz x 4) | 3.58 | \n| 42 | Qualcomm MSM8998 Snapdragon 835 (Kyro 2.45GHz x 4 + Kyro 1.9GHz x 4 + Adreno 540) | 3.63 | \n| 43 | MacBook Pro (15-inch, 2019) - 2.6GHz six cores Intel Core i7 && Radeon Pro 555X 4GB && Intel UHD Graphics 630 1536MB | 3.75 | \n| 44 | Qualcomm MSM6150 Snapdragon 675 (Kyro460 2.0GHz x 2 + Kyro460 1.7GHz x 6 + Adreno 612) | 3.75 | \n| 45 | Qualcomm MSM8994 Snapdragon 810 (Cortex-A57 2.0GHz x 4 + Cortex-A53 1.55GHz x 4) | 3.82 | \n| 46 | Station P2, Rockchip RK3568 (Cortex-A55 2.0GHz x 4) | 3.85 | \n| 47 | Rock3A, Rockchip RK3568 (Cortex-A55 2.0GHz x 4) ubuntu 20.04 | 3.86 | \n| 48 | Loongson 3A4000 (GS464V 1.8GHz * 4 with MSA128) | 4.08 | \n| 49 | Radxa Zero 3W, Cortex-A55 (ARMv82) (1.416 GHz x 4) | 4.5 | \n| 50 | Raspberry Pi 4 Model B Broadcom BCM2711B0, Cortex-A72 (ARMv8) (1.8GHz x 4) | 4.95 | \n| 51 | OrangePi4 LTS, Rockchip RK3399 (Cortex-A72 1.8GHz x 2 + Cortex-A53 1.5GHz x 4) | 5.11 | \n| 52 | Rockchip RK3399 (Cortex-A72 1.8GHz x 2 + Cortex-A53 1.5GHz x 4) | 5.16 | \n| 53 | PhytiumPi, Phytium E2000 (FTC664@1.8GHz x2 + FTC310@1.5GHz x2) | 5.16 | \n| 54 | Qualcomm SDM660 Snapdragon 660 (Kyro260 2.2GHz x 4 + Kyro260 1.84GHz x 4 + Adreno 512) | 5.26 | \n| 55 | Phytium FT-2000/4 (FTC663 armv8 2.2GHz x 4) | 5.27 | \n| 56 | RDK X3 Module (Cortex-A53 1.5GHz x 4) aarch64 | 5.88 | \n| 57 | Station-M2/ROC-RK3566-PC, Rockchip RK3566 (Cortex-A55 1.8GHz x 4 + Mali-G52) StationOS (Android) | 6.51 | \n| 58 | Rockchip RK3288-CG.W (Cortex-A17 1.8GHz x 4) | 6.66 | \n| 59 | Qualcomm MSM8916 Snapdragon 410 (Cortex-A53 1.2GHz x 4) | 7.63 | \n| 60 | NanoPi R2S, Rockchip RK3328 (Cortex-A53 1.3GHz x 4) Armbian focal (21.05.1) aarch64 | 7.66 | \n| 61 | Intel Atom x5-Z8350 | 7.74 | \n| 62 | Loongson 2K2000 (LA364 1.5GHz * 2 with lsx) | 8.23 | \n| 63 | EAIDK 310, Rockchip RK3228H (Cortex-A53 1.3GHz x 4) fedora-28 aarch64 | 8.34 | \n| 64 | OrangePi Zero 2, Allwinner H616 (Cortex-A53 1.5GHz x 4) | 9.51 | \n| 65 | Raspberry Pi 3 Model B+ Broadcom BCM2837B0, Cortex-A53 (ARMv8) (1.4GHz x 4) | 9.87 | \n| 66 | iPhone 5S (Apple A7 1.3GHz x 2) | 11 | \n| 67 | MYIR RemiPi,Renesas RZG2L(Cortex-A55 1.5GHz x 2) | 11.9 | \n| 68 | Raspberry Pi 5 Broadcom BCM2712, VideoCore VII Graphics (Vulkan 1.2) | 12.5 | \n| 69 | Raspberry Pi Zero 2 W Broadcom BCM2710A1, Cortex-A53 (ARMv8) (1.0GHz x 4) | 13.7 | \n| 70 | Xeon Phi 3120A (1.10 GHz 57-core 228-thread) | 15.1 | \n| 71 | Loongson 3A3000 (GS464E 1.45GHz * 4) | 16.3 | \n| 72 | AXERA AX620A (Cortex-A7 1.0GHz * 4) | 18.8 | \n| 73 | Loongson 2K1000LA (LA264 1.0GHz * 2) | 24.4 | \n| 74 | Loongson 2K1000 (GS264 1.0GHz x 2) | 24.8 | \n| 75 | Freescale i.MX7 Dual (Cortex A7 1.0GHz x 2) | 26.7 | \n| 76 | Banana Pi M2 Zero 2 AllWinner H2+, Cortex-A7 (ARMv7-A) (1.2GHz x 4) | 26.8 | \n| 77 | HiSilicon Hi3519V101 (Cortex-A17 1.2GHz x 1) | 36.2 | \n| 78 | Sunway SW831 (sw_64 2.5GHz * 8) | 40.7 | \n| 79 | Z7-Lite 7020 XC7Z020CLG400-2 (Cortex-A9 766MHz x 2) | 43.2 | \n| 80 | Intel Celeron M 420 (Yonah 1.60 GHz x 1) | 43.9 | \n| 81 | Amlogic S805 (Cortex-A5, 4 × 1.536GHz) | 45.9 | \n| 82 | VisionFive2 , JH7110 (SiFive-U74(RV64GC) 1.5GHz x 4) riscv64 with PowerVR B-Series BXE-4-32 | 72.4 | \n| 83 | T-Head TH1520 (C910V, 1.848 GHz x 4 + BXM-4-64 PowerVR) | 83.3 | \n| 84 | Sunway SW421 (sw_64 1.7GHz * 4) | 116 | \n| 85 | Ingenic T40XP Xburst2 Core X2 1.4Ghz (without MSA) | 165 | \n"
  },
  {
    "path": "benchmark/RankCards/Rcards.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n#ifndef RCARDS_H\n#define RCARDS_H\n\n#include <cstdint>\n#include <cmath>\n#include <deque>\n#include <list>\n#include <array>\n#include <memory>\n#include <iostream>\n#include <iomanip>\n#include <stdio.h>\n#include <string.h>\n#include <istream>\n#include <fstream>\n#include <sstream>\n#include <algorithm>\n#include <sys/types.h>\n#include <sys/stat.h>\n#include <unistd.h>\n#include <chrono>\n#include <thread>\n\n//---------------------------------------------------------------------------\n// Global hardcoded parameters\n//---------------------------------------------------------------------------\n// LERP(a,b,c) = linear interpolation macro, is 'a' when c == 0.0 and 'b' when c == 1.0 */\n#define MIN(a, b)                 ((a) > (b) ? (b) : (a))\n#define MAX(a, b)                 ((a) < (b) ? (b) : (a))\n#define LIM(a, b, c)              (((a) > (c)) ? (c) : ((a) < (b)) ? (b) : (a))\n#define LERP(a, b, c)             (((b) - (a)) * (c) + (a))\n#define ROUND(a)                  (static_cast<int>((a) + 0.5))\n#define EUCLIDEAN(x1, y1, x2, y2) sqrt(((x1) - (x2)) * ((x1) - (x2)) + ((y1) - (y2)) * ((y1) - (y2)))\n//---------------------------------------------------------------------------\nstruct TModel\n{\n    std::string Name;\n    float AvrTime{0.0};\n};\n//---------------------------------------------------------------------------\nstruct TModelSet\n{\n    std::vector<TModel> Mset;\n\n    //use push_back to prevent <brace-enclosed initializer list> issues with CMake\n    inline TModelSet(void)\n    {\n        TModel model;\n        model.Name = \"squeezenet\";\n        Mset.push_back(model);\n        model.Name = \"squeezenet_int8\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_int8\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_v2\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_v3\";\n        Mset.push_back(model);\n        model.Name = \"shufflenet\";\n        Mset.push_back(model);\n        model.Name = \"shufflenet_v2\";\n        Mset.push_back(model);\n        model.Name = \"mnasnet\";\n        Mset.push_back(model);\n        model.Name = \"proxylessnasnet\";\n        Mset.push_back(model);\n        model.Name = \"efficientnet_b0\";\n        Mset.push_back(model);\n        model.Name = \"efficientnetv2_b0\";\n        Mset.push_back(model);\n        model.Name = \"regnety_400m\";\n        Mset.push_back(model);\n        model.Name = \"blazeface\";\n        Mset.push_back(model);\n        model.Name = \"googlenet\";\n        Mset.push_back(model);\n        model.Name = \"googlenet_int8\";\n        Mset.push_back(model);\n        model.Name = \"resnet18\";\n        Mset.push_back(model);\n        model.Name = \"resnet18_int8\";\n        Mset.push_back(model);\n        model.Name = \"alexnet\";\n        Mset.push_back(model);\n        model.Name = \"vgg16\";\n        Mset.push_back(model);\n        model.Name = \"vgg16_int8\";\n        Mset.push_back(model);\n        model.Name = \"resnet50\";\n        Mset.push_back(model);\n        model.Name = \"resnet50_int8\";\n        Mset.push_back(model);\n        model.Name = \"squeezenet_ssd\";\n        Mset.push_back(model);\n        model.Name = \"squeezenet_ssd_int8\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_ssd\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_ssd_int8\";\n        Mset.push_back(model);\n        model.Name = \"mobilenet_yolo\";\n        Mset.push_back(model);\n        model.Name = \"mobilenetv2_yolov3\";\n        Mset.push_back(model);\n        model.Name = \"yolov4-tiny\";\n        Mset.push_back(model);\n        model.Name = \"nanodet_m\";\n        Mset.push_back(model);\n        model.Name = \"yolo-fastest-1.1\";\n        Mset.push_back(model);\n        model.Name = \"yolo-fastestv2\";\n        Mset.push_back(model);\n        model.Name = \"vision_transformer\";\n        Mset.push_back(model);\n        model.Name = \"FastestDet\";\n        Mset.push_back(model);\n    }\n\n    void Store(const TModel& model)\n    {\n        for (size_t i = 0; i < Mset.size(); i++)\n        {\n            if (Mset[i].Name == model.Name)\n            {\n                Mset[i].AvrTime = model.AvrTime;\n                break;\n            }\n        }\n    }\n\n    float Sum(void)\n    {\n        float t = 0;\n\n        for (size_t i = 0; i < Mset.size(); i++) t += Mset[i].AvrTime;\n\n        return t;\n    }\n\n    float Ratio(const TModelSet& Rset)\n    {\n        float w;\n        float s = 0;\n        float t = 0;\n\n        for (size_t r = 0; r < Rset.Mset.size(); r++)\n        {\n            if (Rset.Mset[r].AvrTime > 0.0)\n            {\n                for (size_t i = 0; i < Mset.size(); i++)\n                {\n                    if (Mset[i].AvrTime > 0.0)\n                    {\n                        if (Mset[i].Name == Rset.Mset[r].Name)\n                        {\n                            w = log(Rset.Mset[r].AvrTime);\n                            s += w * (Mset[i].AvrTime / Rset.Mset[r].AvrTime);\n                            t += w;\n                        }\n                    }\n                }\n            }\n        }\n        if (t > 0) s /= t;\n        return s;\n    }\n};\n//---------------------------------------------------------------------------\nstruct TBoard\n{\n    std::string Name;\n    size_t StartLine;\n    size_t EndLine;\n    std::vector<TModelSet> BenchSet;\n    int BestSet;\n    float Ratio;\n};\n//---------------------------------------------------------------------------\ninline bool FileExists(const std::string& name)\n{\n    struct stat buffer;\n    return (stat(name.c_str(), &buffer) == 0);\n}\n//---------------------------------------------------------------------------\ninline void FileCopy(const std::string& Src, const std::string& Dst)\n{\n    std::ifstream src(Src, std::ios::binary);\n    std::ofstream dst(Dst, std::ios::binary);\n\n    dst << src.rdbuf();\n}\n//---------------------------------------------------------------------------\n// to lower case\nstatic inline void lcase(std::string& s)\n{\n    std::transform(s.begin(), s.end(), s.begin(),\n    [](unsigned char c) {\n        return std::tolower(c);\n    });\n}\n//---------------------------------------------------------------------------\n// to lower case (copying)\nstatic inline std::string lcase_copy(std::string s)\n{\n    lcase(s);\n    return s;\n}\n//---------------------------------------------------------------------------\n// trim from start (in place)\nstatic inline void ltrim(std::string& s)\n{\n    s.erase(s.begin(), std::find_if(s.begin(), s.end(), [](int ch) {\n        return !std::isspace(ch);\n    }));\n}\n//---------------------------------------------------------------------------\n// trim from end (in place)\nstatic inline void rtrim(std::string& s)\n{\n    s.erase(std::find_if(s.rbegin(), s.rend(), [](int ch) {\n        return !std::isspace(ch);\n    }).base(),\n    s.end());\n}\n//---------------------------------------------------------------------------\n// trim from both ends (in place)\nstatic inline void trim(std::string& s)\n{\n    ltrim(s);\n    rtrim(s);\n}\n//---------------------------------------------------------------------------\n// trim from start (copying)\nstatic inline std::string ltrim_copy(std::string s)\n{\n    ltrim(s);\n    return s;\n}\n//---------------------------------------------------------------------------\n// trim from end (copying)\nstatic inline std::string rtrim_copy(std::string s)\n{\n    rtrim(s);\n    return s;\n}\n//---------------------------------------------------------------------------\n// trim from both ends (copying)\nstatic inline std::string trim_copy(std::string s)\n{\n    trim(s);\n    return s;\n}\n//---------------------------------------------------------------------------\nstatic inline void GetNameAver(std::string line, TModel& model)\n{\n    // line example: squeezenet  min =   46.28  max =   46.91  avg =   46.65\n\n    size_t p = line.find(\"min =\");\n\n    if (p != std::string::npos)\n    {\n        model.Name = trim_copy(line.substr(0, p));\n        p = line.find(\"avg =\");\n        if (p != std::string::npos)\n        {\n            try\n            {\n                model.AvrTime = std::stof(trim_copy(line.substr(p + 5, line.length() - p - 5)));\n            }\n            catch (...)\n            {\n            }\n        }\n        else\n            model.AvrTime = 0.0;\n    }\n    else\n    {\n        model.Name = \"\";\n        model.AvrTime = 0.0;\n    }\n}\n//---------------------------------------------------------------------------\n#endif // RCARDS_H\n"
  },
  {
    "path": "benchmark/RankCards/main.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include <iostream>\n#include <iostream>\n#include <fstream>\n#include <vector>\n#include <string>\n#include <cfloat>\n#include \"Rcards.h\"\n//---------------------------------------------------------------------------\nusing namespace std;\n//---------------------------------------------------------------------------\n#define REF_BOARD \"Raspberry Pi 5 Broadcom BCM2712, Cortex-A76 (ARMv8)\"\n//---------------------------------------------------------------------------\n// Define a custom comparator function for sorting based on Ratio\nbool compareByRatio(const TBoard& a, const TBoard& b)\n{\n    return a.Ratio < b.Ratio;\n}\n//---------------------------------------------------------------------------\nint main(int argc, char** argv)\n{\n    size_t i, t, n, r;\n    int RefBoard;\n    float f, x;\n    string Line;\n    TModel Model;\n    vector<string> Lines;  // Vector to store strings\n    vector<TBoard> Boards; // Vector to store boards\n    ifstream inputFile;\n\n    // Check existence of the ../README.md file\n    inputFile.open(\"../README.md\");\n    if (!inputFile.is_open())\n    {\n        if (argc != 2)\n        {\n            fprintf(stderr, \"Usage: ./RankCards <your README.md> \\n\");\n            return -1;\n        }\n        const char* imagepath = argv[1];\n        // Open the file given as argument\n        inputFile.open(imagepath);\n        // Check if the file is open\n        if (!inputFile.is_open())\n        {\n            cerr << \"Error opening file\" << endl;\n            return 1; // Return an error code\n        }\n    }\n\n    // Read each Line from the file and add it to the vector\n    while (std::getline(inputFile, Line))\n    {\n        Lines.push_back(Line);\n    }\n    // Close the file\n    inputFile.close();\n\n    // Get the boards.\n    for (i = 0; i < Lines.size(); i++)\n    {\n        TBoard Brd;\n        if (Lines[i].find(\"###\") != string::npos)\n        {\n            Brd.Name = Lines[i].substr(4, Lines[i].length() - 4);\n            Brd.StartLine = i + 1;\n            Boards.push_back(Brd);\n        }\n    }\n    // Get the boards end Line.\n    for (t = 0; t < Boards.size() - 1; t++)\n    {\n        Boards[t].EndLine = Boards[t + 1].StartLine;\n    }\n    Boards[t].EndLine = Lines.size();\n\n    // Get the bench sets (must always start with squeezenet)\n    for (t = 0; t < Boards.size(); t++)\n    {\n        TModelSet MdSet;\n        bool FirstSet = true;\n        for (n = Boards[t].StartLine; n < Boards[t].EndLine; n++)\n        {\n            GetNameAver(Lines[n], Model);\n            MdSet.Store(Model);\n\n            if (Model.Name == \"squeezenet\")\n            {\n                //start of new set, check if it is the first set\n                if (FirstSet)\n                    FirstSet = false;\n                else\n                    Boards[t].BenchSet.push_back(MdSet);\n            }\n        }\n        Boards[t].BenchSet.push_back(MdSet);\n    }\n\n    // Get the total AvrTime of the bench sets and set the lowest as best set\n    for (t = 0; t < Boards.size(); t++)\n    {\n        x = FLT_MAX;\n        for (n = 0; n < Boards[t].BenchSet.size(); n++)\n        {\n            f = Boards[t].BenchSet[n].Sum();\n            if (f < x)\n            {\n                x = f;\n                Boards[t].BestSet = n;\n            }\n        }\n    }\n\n    // Get the reference set\n    RefBoard = -1;\n    for (t = 0; t < Boards.size(); t++)\n    {\n        if (Boards[t].Name.find(REF_BOARD) != string::npos)\n        {\n            RefBoard = static_cast<int>(t);\n        }\n    }\n    if (RefBoard == -1)\n    {\n        cerr << \"Error finding reference board :\" << endl;\n        cerr << REF_BOARD << endl;\n        return 1; // Return an error code\n    }\n\n    // Get the ratios between the best bench sets and reference\n    r = Boards[RefBoard].BestSet;\n    for (t = 0; t < Boards.size(); t++)\n    {\n        n = Boards[t].BestSet;\n        Boards[t].Ratio = Boards[t].BenchSet[n].Ratio(Boards[RefBoard].BenchSet[r]);\n    }\n\n    // Sort the vector using the custom comparator\n    std::sort(Boards.begin(), Boards.end(), compareByRatio);\n\n    // Open an output README.md file\n    std::ofstream outputFile(\"README.md\");\n\n    // Check if the file is successfully opened\n    if (outputFile.is_open())\n    {\n        outputFile << \"### Rank the boards.\" << endl;\n        outputFile << \"The table below is generated by RankCards, using the timings found in the /ncnn/benchmark/README.md file.<br>\" << endl;\n        outputFile << \"First, the best set of timings is selected from each board.<br>\" << endl;\n        outputFile << \"The set is then compared to a reference set by calculating the ratio of each model one by one and averaging all results.<br>\" << endl;\n        outputFile << \"Finally, the boards are ranked from fast to slow.<br>\" << endl;\n        outputFile << \"|      | Board | Ratio | \" << endl;\n        outputFile << \"| :--: | :---- | :---  | \" << endl;\n        // Write the sorted vector to the file\n        for (t = 0; t < Boards.size(); t++)\n        {\n            outputFile << \"| \" << t + 1 << \" | \" << Boards[t].Name << \" | \" << setprecision(3) << Boards[t].Ratio << \" | \" << endl;\n        }\n        // Close the file stream\n        outputFile.close();\n        cout << \"Sorted data has been written to README.md\" << endl;\n    }\n    else\n    {\n        cerr << \"Error opening the file.\" << endl;\n        return 1; // Return an error code\n    }\n\n    return 0; // Return success\n}\n//---------------------------------------------------------------------------\n"
  },
  {
    "path": "benchmark/alexnet.param",
    "content": "7767517\n15 15\nInput                    data                     0 1 data -23330=4,3,227,227,3 0=227 1=227 2=3\nConvolution              conv1                    1 1 data conv1_relu1 -23330=4,3,55,55,96 0=96 1=11 3=4 5=1 6=34848 9=1\nLRN                      norm1                    1 1 conv1_relu1 norm1 -23330=4,3,55,55,96 2=1.000000e-04\nPooling                  pool1                    1 1 norm1 pool1 -23330=4,3,27,27,96 1=3 2=2\nConvolutionDepthWise     conv2                    1 1 pool1 conv2_relu2 -23330=4,3,27,27,256 0=256 1=5 4=2 5=1 6=307200 7=2 9=1\nLRN                      norm2                    1 1 conv2_relu2 norm2 -23330=4,3,27,27,256 2=1.000000e-04\nPooling                  pool2                    1 1 norm2 pool2 -23330=4,3,13,13,256 1=3 2=2\nConvolution              conv3                    1 1 pool2 conv3_relu3 -23330=4,3,13,13,384 0=384 1=3 4=1 5=1 6=884736 9=1\nConvolutionDepthWise     conv4                    1 1 conv3_relu3 conv4_relu4 -23330=4,3,13,13,384 0=384 1=3 4=1 5=1 6=663552 7=2 9=1\nConvolutionDepthWise     conv5                    1 1 conv4_relu4 conv5_relu5 -23330=4,3,13,13,256 0=256 1=3 4=1 5=1 6=442368 7=2 9=1\nPooling                  pool5                    1 1 conv5_relu5 pool5 -23330=4,3,6,6,256 1=3 2=2\nInnerProduct             fc6                      1 1 pool5 fc6_drop6 -23330=4,1,4096,1,1 0=4096 1=1 2=37748736 9=1\nInnerProduct             fc7                      1 1 fc6_drop6 fc7_drop7 -23330=4,1,4096,1,1 0=4096 1=1 2=16777216 9=1\nInnerProduct             fc8                      1 1 fc7_drop7 fc8 -23330=4,1,1000,1,1 0=1000 1=1 2=4096000\nSoftmax                  prob                     1 1 fc8 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/benchncnn.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include <float.h>\n#include <stdio.h>\n#include <string.h>\n\n#ifdef __EMSCRIPTEN__\n#include <emscripten.h>\n#endif\n\n#include \"benchmark.h\"\n#include \"cpu.h\"\n#include \"datareader.h\"\n#include \"net.h\"\n#include \"gpu.h\"\n\n#include \"benchncnn_param_data.h\"\n\n#ifndef NCNN_SIMPLESTL\n#include <vector>\n#endif\n\nclass DataReaderFromEmpty : public ncnn::DataReader\n{\npublic:\n    virtual int scan(const char* format, void* p) const\n    {\n        return 0;\n    }\n    virtual size_t read(void* buf, size_t size) const\n    {\n        memset(buf, 0, size);\n        return size;\n    }\n};\n\nstatic int g_warmup_loop_count = 8;\nstatic int g_loop_count = 4;\nstatic bool g_enable_cooling_down = true;\n\nstatic ncnn::UnlockedPoolAllocator g_blob_pool_allocator;\nstatic ncnn::PoolAllocator g_workspace_pool_allocator;\n\n#if NCNN_VULKAN\nstatic ncnn::VulkanDevice* g_vkdev = 0;\nstatic ncnn::VkAllocator* g_blob_vkallocator = 0;\nstatic ncnn::VkAllocator* g_staging_vkallocator = 0;\n#endif // NCNN_VULKAN\n\nvoid benchmark(const char* comment, const std::vector<ncnn::Mat>& _in, const ncnn::Option& opt, const char* model_param_data = NULL)\n{\n    // Skip if int8 model name and using GPU\n    if (opt.use_vulkan_compute && strstr(comment, \"int8\") != NULL)\n    {\n        if (!model_param_data)\n            fprintf(stderr, \"%20s  skipped (int8+GPU not supported)\\n\", comment);\n        return;\n    }\n\n    g_blob_pool_allocator.clear();\n    g_workspace_pool_allocator.clear();\n\n#if NCNN_VULKAN\n    if (opt.use_vulkan_compute)\n    {\n        g_blob_vkallocator->clear();\n        g_staging_vkallocator->clear();\n    }\n#endif // NCNN_VULKAN\n\n    ncnn::Net net;\n\n    net.opt = opt;\n\n#if NCNN_VULKAN\n    if (net.opt.use_vulkan_compute)\n    {\n        net.set_vulkan_device(g_vkdev);\n    }\n#endif // NCNN_VULKAN\n\n    if (model_param_data)\n    {\n        net.load_param_mem(model_param_data);\n    }\n    else\n    {\n        net.load_param(comment);\n    }\n\n    DataReaderFromEmpty dr;\n    net.load_model(dr);\n\n    const std::vector<const char*>& input_names = net.input_names();\n    const std::vector<const char*>& output_names = net.output_names();\n\n    if (g_enable_cooling_down)\n    {\n        // sleep 10 seconds for cooling down SOC  :(\n        ncnn::sleep(10 * 1000);\n    }\n\n    if (input_names.size() > _in.size())\n    {\n        fprintf(stderr, \"input %zu tensors while model has %zu inputs\\n\", _in.size(), input_names.size());\n        return;\n    }\n\n    // initialize input\n    for (size_t j = 0; j < input_names.size(); ++j)\n    {\n        ncnn::Mat in = _in[j];\n        in.fill(0.01f);\n    }\n\n    // warm up\n    for (int i = 0; i < g_warmup_loop_count; i++)\n    {\n        ncnn::Extractor ex = net.create_extractor();\n        for (size_t j = 0; j < input_names.size(); ++j)\n        {\n            ncnn::Mat in = _in[j];\n            ex.input(input_names[j], in);\n        }\n\n        for (size_t j = 0; j < output_names.size(); ++j)\n        {\n            ncnn::Mat out;\n            ex.extract(output_names[j], out);\n        }\n    }\n\n    double time_min = DBL_MAX;\n    double time_max = -DBL_MAX;\n    double time_avg = 0;\n\n    for (int i = 0; i < g_loop_count; i++)\n    {\n        double start = ncnn::get_current_time();\n        {\n            ncnn::Extractor ex = net.create_extractor();\n            for (size_t j = 0; j < input_names.size(); ++j)\n            {\n                ncnn::Mat in = _in[j];\n                ex.input(input_names[j], in);\n            }\n\n            for (size_t j = 0; j < output_names.size(); ++j)\n            {\n                ncnn::Mat out;\n                ex.extract(output_names[j], out);\n            }\n        }\n\n        double end = ncnn::get_current_time();\n\n        double time = end - start;\n\n        time_min = std::min(time_min, time);\n        time_max = std::max(time_max, time);\n        time_avg += time;\n    }\n\n    time_avg /= g_loop_count;\n\n    fprintf(stderr, \"%20s  min = %7.2f  max = %7.2f  avg = %7.2f\\n\", comment, time_min, time_max, time_avg);\n}\n\nvoid benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& opt, const char* model_param_data = NULL)\n{\n    std::vector<ncnn::Mat> inputs;\n    inputs.push_back(_in);\n    return benchmark(comment, inputs, opt, model_param_data);\n}\n\nvoid show_usage()\n{\n    fprintf(stderr, \"Usage: benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]\\n\");\n    fprintf(stderr, \"  param=model.param\\n\");\n    fprintf(stderr, \"  shape=[227,227,3],...\\n\");\n}\n\nstatic std::vector<ncnn::Mat> parse_shape_list(char* s)\n{\n    std::vector<std::vector<int> > shapes;\n    std::vector<ncnn::Mat> mats;\n\n    char* pch = strtok(s, \"[]\");\n    while (pch != NULL)\n    {\n        // parse a,b,c\n        int v;\n        int nconsumed = 0;\n        int nscan = sscanf(pch, \"%d%n\", &v, &nconsumed);\n        if (nscan == 1)\n        {\n            // ok we get shape\n            pch += nconsumed;\n\n            std::vector<int> s;\n            s.push_back(v);\n\n            nscan = sscanf(pch, \",%d%n\", &v, &nconsumed);\n            while (nscan == 1)\n            {\n                pch += nconsumed;\n\n                s.push_back(v);\n\n                nscan = sscanf(pch, \",%d%n\", &v, &nconsumed);\n            }\n\n            // shape end\n            shapes.push_back(s);\n        }\n\n        pch = strtok(NULL, \"[]\");\n    }\n\n    for (size_t i = 0; i < shapes.size(); ++i)\n    {\n        const std::vector<int>& shape = shapes[i];\n        switch (shape.size())\n        {\n        case 4:\n            mats.push_back(ncnn::Mat(shape[0], shape[1], shape[2], shape[3]));\n            break;\n        case 3:\n            mats.push_back(ncnn::Mat(shape[0], shape[1], shape[2]));\n            break;\n        case 2:\n            mats.push_back(ncnn::Mat(shape[0], shape[1]));\n            break;\n        case 1:\n            mats.push_back(ncnn::Mat(shape[0]));\n            break;\n        default:\n            fprintf(stderr, \"unsupported input shape size %zu\\n\", shape.size());\n            break;\n        }\n    }\n    return mats;\n}\n\nint main(int argc, char** argv)\n{\n    int loop_count = 4;\n    int num_threads = ncnn::get_physical_big_cpu_count();\n    int powersave = 2;\n    int gpu_device = -1;\n    int cooling_down = 1;\n    char* model = 0;\n    std::vector<ncnn::Mat> inputs;\n\n    for (int i = 1; i < argc; i++)\n    {\n        if (argv[i][0] == '-' && argv[i][1] == 'h')\n        {\n            show_usage();\n            return -1;\n        }\n\n        if (strcmp(argv[i], \"--help\") == 0)\n        {\n            show_usage();\n            return -1;\n        }\n    }\n\n    if (argc >= 2)\n    {\n        loop_count = atoi(argv[1]);\n    }\n    if (argc >= 3)\n    {\n        num_threads = atoi(argv[2]);\n    }\n    if (argc >= 4)\n    {\n        powersave = atoi(argv[3]);\n    }\n    if (argc >= 5)\n    {\n        gpu_device = atoi(argv[4]);\n    }\n    if (argc >= 6)\n    {\n        cooling_down = atoi(argv[5]);\n    }\n\n    for (int i = 6; i < argc; i++)\n    {\n        // key=value\n        char* kv = argv[i];\n\n        char* eqs = strchr(kv, '=');\n        if (eqs == NULL)\n        {\n            fprintf(stderr, \"unrecognized arg %s\\n\", kv);\n            continue;\n        }\n\n        // split k v\n        eqs[0] = '\\0';\n        const char* key = kv;\n        char* value = eqs + 1;\n\n        if (strcmp(key, \"param\") == 0)\n            model = value;\n        if (strcmp(key, \"shape\") == 0)\n            inputs = parse_shape_list(value);\n    }\n\n    if (model && inputs.empty())\n    {\n        fprintf(stderr, \"input tensor shape empty!\\n\");\n        return -1;\n    }\n\n#ifdef __EMSCRIPTEN__\n    EM_ASM(\n        FS.mkdir('/working');\n        FS.mount(NODEFS, {root: '.'}, '/working'););\n#endif // __EMSCRIPTEN__\n\n    bool use_vulkan_compute = gpu_device != -1;\n\n    g_enable_cooling_down = cooling_down != 0;\n\n    g_loop_count = loop_count;\n\n    g_blob_pool_allocator.set_size_compare_ratio(0.f);\n    g_workspace_pool_allocator.set_size_compare_ratio(0.f);\n\n#if NCNN_VULKAN\n    if (use_vulkan_compute)\n    {\n        g_warmup_loop_count = 10;\n\n        g_vkdev = ncnn::get_gpu_device(gpu_device);\n\n        g_blob_vkallocator = new ncnn::VkBlobAllocator(g_vkdev);\n        g_staging_vkallocator = new ncnn::VkStagingAllocator(g_vkdev);\n    }\n#endif // NCNN_VULKAN\n\n    ncnn::set_cpu_powersave(powersave);\n\n    ncnn::set_omp_dynamic(0);\n    ncnn::set_omp_num_threads(num_threads);\n\n    // default option\n    ncnn::Option opt;\n    opt.lightmode = true;\n    opt.num_threads = num_threads;\n    opt.blob_allocator = &g_blob_pool_allocator;\n    opt.workspace_allocator = &g_workspace_pool_allocator;\n#if NCNN_VULKAN\n    opt.blob_vkallocator = g_blob_vkallocator;\n    opt.workspace_vkallocator = g_blob_vkallocator;\n    opt.staging_vkallocator = g_staging_vkallocator;\n#endif // NCNN_VULKAN\n    opt.use_winograd_convolution = true;\n    opt.use_sgemm_convolution = true;\n    opt.use_int8_inference = true;\n    opt.use_vulkan_compute = use_vulkan_compute;\n    opt.use_fp16_packed = true;\n    opt.use_fp16_storage = true;\n    opt.use_fp16_arithmetic = true;\n    opt.use_int8_storage = true;\n    opt.use_int8_arithmetic = true;\n    opt.use_packing_layout = true;\n\n    fprintf(stderr, \"loop_count = %d\\n\", g_loop_count);\n    fprintf(stderr, \"num_threads = %d\\n\", num_threads);\n    fprintf(stderr, \"powersave = %d\\n\", ncnn::get_cpu_powersave());\n    fprintf(stderr, \"gpu_device = %d\\n\", gpu_device);\n    fprintf(stderr, \"cooling_down = %d\\n\", (int)g_enable_cooling_down);\n\n    if (model != 0)\n    {\n        // run user defined benchmark\n        benchmark(model, inputs, opt);\n    }\n    else\n    {\n        // run default cases\n        benchmark(\"squeezenet\", ncnn::Mat(227, 227, 3), opt, squeezenet_param_data);\n\n        benchmark(\"squeezenet_int8\", ncnn::Mat(227, 227, 3), opt, squeezenet_int8_param_data);\n\n        benchmark(\"mobilenet\", ncnn::Mat(224, 224, 3), opt, mobilenet_param_data);\n\n        benchmark(\"mobilenet_int8\", ncnn::Mat(224, 224, 3), opt, mobilenet_int8_param_data);\n\n        benchmark(\"mobilenet_v2\", ncnn::Mat(224, 224, 3), opt, mobilenet_v2_param_data);\n\n        // benchmark(\"mobilenet_v2_int8\", ncnn::Mat(224, 224, 3), opt, mobilenet_v2_int8_param_data);\n\n        benchmark(\"mobilenet_v3\", ncnn::Mat(224, 224, 3), opt, mobilenet_v3_param_data);\n\n        benchmark(\"shufflenet\", ncnn::Mat(224, 224, 3), opt, shufflenet_param_data);\n\n        benchmark(\"shufflenet_v2\", ncnn::Mat(224, 224, 3), opt, shufflenet_v2_param_data);\n\n        benchmark(\"mnasnet\", ncnn::Mat(224, 224, 3), opt, mnasnet_param_data);\n\n        benchmark(\"proxylessnasnet\", ncnn::Mat(224, 224, 3), opt, proxylessnasnet_param_data);\n\n        benchmark(\"efficientnet_b0\", ncnn::Mat(224, 224, 3), opt, efficientnet_b0_param_data);\n\n        benchmark(\"efficientnetv2_b0\", ncnn::Mat(224, 224, 3), opt, efficientnetv2_b0_param_data);\n\n        benchmark(\"regnety_400m\", ncnn::Mat(224, 224, 3), opt, regnety_400m_param_data);\n\n        benchmark(\"blazeface\", ncnn::Mat(128, 128, 3), opt, blazeface_param_data);\n\n        benchmark(\"googlenet\", ncnn::Mat(224, 224, 3), opt, googlenet_param_data);\n\n        benchmark(\"googlenet_int8\", ncnn::Mat(224, 224, 3), opt, googlenet_int8_param_data);\n\n        benchmark(\"resnet18\", ncnn::Mat(224, 224, 3), opt, resnet18_param_data);\n\n        benchmark(\"resnet18_int8\", ncnn::Mat(224, 224, 3), opt, resnet18_int8_param_data);\n\n        benchmark(\"alexnet\", ncnn::Mat(227, 227, 3), opt, alexnet_param_data);\n\n        benchmark(\"vgg16\", ncnn::Mat(224, 224, 3), opt, vgg16_param_data);\n\n        benchmark(\"vgg16_int8\", ncnn::Mat(224, 224, 3), opt, vgg16_int8_param_data);\n\n        benchmark(\"resnet50\", ncnn::Mat(224, 224, 3), opt, resnet50_param_data);\n\n        benchmark(\"resnet50_int8\", ncnn::Mat(224, 224, 3), opt, resnet50_int8_param_data);\n\n        benchmark(\"squeezenet_ssd\", ncnn::Mat(300, 300, 3), opt, squeezenet_ssd_param_data);\n\n        benchmark(\"squeezenet_ssd_int8\", ncnn::Mat(300, 300, 3), opt, squeezenet_ssd_int8_param_data);\n\n        benchmark(\"mobilenet_ssd\", ncnn::Mat(300, 300, 3), opt, mobilenet_ssd_param_data);\n\n        benchmark(\"mobilenet_ssd_int8\", ncnn::Mat(300, 300, 3), opt, mobilenet_ssd_int8_param_data);\n\n        benchmark(\"mobilenet_yolo\", ncnn::Mat(416, 416, 3), opt, mobilenet_yolo_param_data);\n\n        benchmark(\"mobilenetv2_yolov3\", ncnn::Mat(352, 352, 3), opt, mobilenetv2_yolov3_param_data);\n\n        benchmark(\"yolov4-tiny\", ncnn::Mat(416, 416, 3), opt, yolov4_tiny_param_data);\n\n        benchmark(\"nanodet_m\", ncnn::Mat(320, 320, 3), opt, nanodet_m_param_data);\n\n        benchmark(\"yolo-fastest-1.1\", ncnn::Mat(320, 320, 3), opt, yolo_fastest_1_1_param_data);\n\n        benchmark(\"yolo-fastestv2\", ncnn::Mat(352, 352, 3), opt, yolo_fastestv2_param_data);\n\n        benchmark(\"vision_transformer\", ncnn::Mat(384, 384, 3), opt, vision_transformer_param_data);\n\n        benchmark(\"FastestDet\", ncnn::Mat(352, 352, 3), opt, FastestDet_param_data);\n    }\n#if NCNN_VULKAN\n    delete g_blob_vkallocator;\n    delete g_staging_vkallocator;\n#endif // NCNN_VULKAN\n\n    return 0;\n}\n"
  },
  {
    "path": "benchmark/benchncnn_param_data.h.in",
    "content": "// Benchncnn Param Data header\n//\n// This file is auto-generated by cmake, don't edit it.\n\n@param_header_data@\n"
  },
  {
    "path": "benchmark/blazeface.param",
    "content": "7767517\n101 117\nInput            data                    0 1 data 0=128 1=128 2=3\nPadding          75                       1 1 data 75 0=1 1=2 2=1 3=2 4=0 5=0.000000e+00 7=0 8=0\nConvolution      76                       1 1 75 76 0=24 1=5 11=5 2=1 12=1 3=2 13=2 4=0 14=0 15=0 16=0 5=1 6=1800\nReLU             77                       1 1 76 77\nSplit            splitncnn_0              1 2 77 77_splitncnn_0 77_splitncnn_1\nConvolutionDepthWise 78                       1 1 77_splitncnn_1 78 0=24 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=216 7=24\nConvolution      79                       1 1 78 79 0=24 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=576\nBinaryOp         80                       2 1 79 77_splitncnn_0 80 0=0\nReLU             81                       1 1 80 81\nSplit            splitncnn_1              1 2 81 81_splitncnn_0 81_splitncnn_1\nPadding          82                       1 1 81_splitncnn_1 82 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=4\nConvolutionDepthWise 83                       1 1 81_splitncnn_0 83 0=24 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=216 7=24\nConvolution      84                       1 1 83 84 0=28 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=672\nBinaryOp         85                       2 1 84 82 85 0=0\nReLU             86                       1 1 85 86\nSplit            splitncnn_2              1 2 86 86_splitncnn_0 86_splitncnn_1\nPadding          87                       1 1 86_splitncnn_1 87 0=0 1=2 2=0 3=2 4=0 5=0.000000e+00 7=0 8=0\nPooling          88                       1 1 86_splitncnn_0 88 0=0 1=2 11=2 2=2 12=2 3=0 13=0 14=0 15=0 5=1\nPadding          89                       1 1 88 89 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=4\nConvolutionDepthWise 90                       1 1 87 90 0=28 1=3 11=3 2=1 12=1 3=2 13=2 4=0 14=0 15=0 16=0 5=1 6=252 7=28\nConvolution      91                       1 1 90 91 0=32 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=896\nBinaryOp         92                       2 1 91 89 92 0=0\nReLU             93                       1 1 92 93\nSplit            splitncnn_3              1 2 93 93_splitncnn_0 93_splitncnn_1\nPadding          94                       1 1 93_splitncnn_1 94 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=4\nConvolutionDepthWise 95                       1 1 93_splitncnn_0 95 0=32 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=288 7=32\nConvolution      96                       1 1 95 96 0=36 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=1152\nBinaryOp         97                       2 1 96 94 97 0=0\nReLU             98                       1 1 97 98\nSplit            splitncnn_4              1 2 98 98_splitncnn_0 98_splitncnn_1\nPadding          99                       1 1 98_splitncnn_1 99 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=6\nConvolutionDepthWise 100                      1 1 98_splitncnn_0 100 0=36 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=324 7=36\nConvolution      101                      1 1 100 101 0=42 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=1512\nBinaryOp         102                      2 1 101 99 102 0=0\nReLU             103                      1 1 102 103\nSplit            splitncnn_5              1 2 103 103_splitncnn_0 103_splitncnn_1\nPadding          104                      1 1 103_splitncnn_1 104 0=0 1=2 2=0 3=2 4=0 5=0.000000e+00 7=0 8=0\nPooling          105                      1 1 103_splitncnn_0 105 0=0 1=2 11=2 2=2 12=2 3=0 13=0 14=0 15=0 5=1\nPadding          106                      1 1 105 106 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=6\nConvolutionDepthWise 107                      1 1 104 107 0=42 1=3 11=3 2=1 12=1 3=2 13=2 4=0 14=0 15=0 16=0 5=1 6=378 7=42\nConvolution      108                      1 1 107 108 0=48 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=2016\nBinaryOp         109                      2 1 108 106 109 0=0\nReLU             110                      1 1 109 110\nSplit            splitncnn_6              1 2 110 110_splitncnn_0 110_splitncnn_1\nPadding          111                      1 1 110_splitncnn_1 111 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 112                      1 1 110_splitncnn_0 112 0=48 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=432 7=48\nConvolution      113                      1 1 112 113 0=56 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=2688\nBinaryOp         114                      2 1 113 111 114 0=0\nReLU             115                      1 1 114 115\nSplit            splitncnn_7              1 2 115 115_splitncnn_0 115_splitncnn_1\nPadding          116                      1 1 115_splitncnn_1 116 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 117                      1 1 115_splitncnn_0 117 0=56 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=504 7=56\nConvolution      118                      1 1 117 118 0=64 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=3584\nBinaryOp         119                      2 1 118 116 119 0=0\nReLU             120                      1 1 119 120\nSplit            splitncnn_8              1 2 120 120_splitncnn_0 120_splitncnn_1\nPadding          121                      1 1 120_splitncnn_1 121 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 122                      1 1 120_splitncnn_0 122 0=64 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=576 7=64\nConvolution      123                      1 1 122 123 0=72 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=4608\nBinaryOp         124                      2 1 123 121 124 0=0\nReLU             125                      1 1 124 125\nSplit            splitncnn_9              1 2 125 125_splitncnn_0 125_splitncnn_1\nPadding          126                      1 1 125_splitncnn_1 126 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 127                      1 1 125_splitncnn_0 127 0=72 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=648 7=72\nConvolution      128                      1 1 127 128 0=80 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=5760\nBinaryOp         129                      2 1 128 126 129 0=0\nReLU             130                      1 1 129 130\nSplit            splitncnn_10             1 2 130 130_splitncnn_0 130_splitncnn_1\nPadding          131                      1 1 130_splitncnn_1 131 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 132                      1 1 130_splitncnn_0 132 0=80 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=720 7=80\nConvolution      133                      1 1 132 133 0=88 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=7040\nBinaryOp         134                      2 1 133 131 134 0=0\nReLU             135                      1 1 134 135\nSplit            splitncnn_11             1 2 135 135_splitncnn_0 135_splitncnn_1\nPadding          136                      1 1 135_splitncnn_1 136 0=0 1=2 2=0 3=2 4=0 5=0.000000e+00 7=0 8=0\nPooling          137                      1 1 135_splitncnn_0 137 0=0 1=2 11=2 2=2 12=2 3=0 13=0 14=0 15=0 5=1\nPadding          138                      1 1 137 138 0=0 1=0 2=0 3=0 4=0 5=0.000000e+00 7=0 8=8\nConvolutionDepthWise 139                      1 1 136 139 0=88 1=3 11=3 2=1 12=1 3=2 13=2 4=0 14=0 15=0 16=0 5=1 6=792 7=88\nConvolution      140                      1 1 139 140 0=96 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=8448\nBinaryOp         141                      2 1 140 138 141 0=0\nReLU             142                      1 1 141 142\nSplit            splitncnn_12             1 2 142 142_splitncnn_0 142_splitncnn_1\nConvolutionDepthWise 143                      1 1 142_splitncnn_1 143 0=96 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=864 7=96\nConvolution      144                      1 1 143 144 0=96 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=9216\nBinaryOp         145                      2 1 144 142_splitncnn_0 145 0=0\nReLU             146                      1 1 145 146\nSplit            splitncnn_13             1 2 146 146_splitncnn_0 146_splitncnn_1\nConvolutionDepthWise 147                      1 1 146_splitncnn_1 147 0=96 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=864 7=96\nConvolution      148                      1 1 147 148 0=96 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=9216\nBinaryOp         149                      2 1 148 146_splitncnn_0 149 0=0\nReLU             150                      1 1 149 150\nSplit            splitncnn_14             1 2 150 150_splitncnn_0 150_splitncnn_1\nConvolutionDepthWise 151                      1 1 150_splitncnn_1 151 0=96 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=864 7=96\nConvolution      152                      1 1 151 152 0=96 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=9216\nBinaryOp         153                      2 1 152 150_splitncnn_0 153 0=0\nReLU             154                      1 1 153 154\nSplit            splitncnn_15             1 2 154 154_splitncnn_0 154_splitncnn_1\nConvolutionDepthWise 155                      1 1 154_splitncnn_1 155 0=96 1=3 11=3 2=1 12=1 3=1 13=1 4=1 14=1 15=1 16=1 5=1 6=864 7=96\nConvolution      156                      1 1 155 156 0=96 1=1 11=1 2=1 12=1 3=1 13=1 4=0 14=0 15=0 16=0 5=1 6=9216\nBinaryOp         157                      2 1 156 154_splitncnn_0 157 0=0\nReLU             output                   1 1 157 output\n"
  },
  {
    "path": "benchmark/efficientnet_b0.param",
    "content": "7767517\n200 225\nInput                    input.1                  0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              Conv_0                   1 1 data 362 -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864\nSwish                    Mul_3                    1 1 362 364 -23330=4,3,112,112,32\nConvolutionDepthWise     Conv_4                   1 1 364 366 -23330=4,3,112,112,32 0=32 1=3 4=1 5=1 6=288 7=32\nSwish                    Mul_7                    1 1 366 368 -23330=4,3,112,112,32\nSplit                    splitncnn_0              1 2 368 368_splitncnn_0 368_splitncnn_1 -23330=8,3,112,112,32,3,112,112,32\nPooling                  GlobalAveragePool_8      1 1 368_splitncnn_1 369 -23330=4,1,32,1,1 0=1 4=1\nInnerProduct             Conv_9                   1 1 369 370 -23330=4,1,8,1,1 0=8 1=1 2=256\nSwish                    Mul_11                   1 1 370 372 -23330=4,1,8,1,1\nConvolution              Conv_12                  1 1 372 374 -23330=4,1,32,1,1 0=32 1=1 5=1 6=256 9=4\nBinaryOp                 Mul_14                   2 1 368_splitncnn_0 374 375 -23330=4,3,112,112,32 0=2\nConvolution              Conv_15                  1 1 375 377 -23330=4,3,112,112,16 0=16 1=1 5=1 6=512\nConvolution              Conv_17                  1 1 377 379 -23330=4,3,112,112,96 0=96 1=1 5=1 6=1536\nSwish                    Mul_20                   1 1 379 381 -23330=4,3,112,112,96\nConvolutionDepthWise     Conv_21                  1 1 381 383 -23330=4,3,56,56,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96\nSwish                    Mul_24                   1 1 383 385 -23330=4,3,56,56,96\nSplit                    splitncnn_1              1 2 385 385_splitncnn_0 385_splitncnn_1 -23330=8,3,56,56,96,3,56,56,96\nPooling                  GlobalAveragePool_25     1 1 385_splitncnn_1 386 -23330=4,1,96,1,1 0=1 4=1\nInnerProduct             Conv_26                  1 1 386 387 -23330=4,1,4,1,1 0=4 1=1 2=384\nSwish                    Mul_28                   1 1 387 389 -23330=4,1,4,1,1\nConvolution              Conv_29                  1 1 389 391 -23330=4,1,96,1,1 0=96 1=1 5=1 6=384 9=4\nBinaryOp                 Mul_31                   2 1 385_splitncnn_0 391 392 -23330=4,3,56,56,96 0=2\nConvolution              Conv_32                  1 1 392 394 -23330=4,3,56,56,24 0=24 1=1 5=1 6=2304\nSplit                    splitncnn_2              1 2 394 394_splitncnn_0 394_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolution              Conv_34                  1 1 394_splitncnn_1 396 -23330=4,3,56,56,144 0=144 1=1 5=1 6=3456\nSwish                    Mul_37                   1 1 396 398 -23330=4,3,56,56,144\nConvolutionDepthWise     Conv_38                  1 1 398 400 -23330=4,3,56,56,144 0=144 1=3 4=1 5=1 6=1296 7=144\nSwish                    Mul_41                   1 1 400 402 -23330=4,3,56,56,144\nSplit                    splitncnn_3              1 2 402 402_splitncnn_0 402_splitncnn_1 -23330=8,3,56,56,144,3,56,56,144\nPooling                  GlobalAveragePool_42     1 1 402_splitncnn_1 403 -23330=4,1,144,1,1 0=1 4=1\nInnerProduct             Conv_43                  1 1 403 404 -23330=4,1,6,1,1 0=6 1=1 2=864\nSwish                    Mul_45                   1 1 404 406 -23330=4,1,6,1,1\nConvolution              Conv_46                  1 1 406 408 -23330=4,1,144,1,1 0=144 1=1 5=1 6=864 9=4\nBinaryOp                 Mul_48                   2 1 402_splitncnn_0 408 409 -23330=4,3,56,56,144 0=2\nConvolution              Conv_49                  1 1 409 411 -23330=4,3,56,56,24 0=24 1=1 5=1 6=3456\nBinaryOp                 Add_51                   2 1 394_splitncnn_0 411 412 -23330=4,3,56,56,24\nConvolution              Conv_52                  1 1 412 414 -23330=4,3,56,56,144 0=144 1=1 5=1 6=3456\nSwish                    Mul_55                   1 1 414 416 -23330=4,3,56,56,144\nConvolutionDepthWise     Conv_56                  1 1 416 418 -23330=4,3,28,28,144 0=144 1=5 3=2 4=2 5=1 6=3600 7=144\nSwish                    Mul_59                   1 1 418 420 -23330=4,3,28,28,144\nSplit                    splitncnn_4              1 2 420 420_splitncnn_0 420_splitncnn_1 -23330=8,3,28,28,144,3,28,28,144\nPooling                  GlobalAveragePool_60     1 1 420_splitncnn_1 421 -23330=4,1,144,1,1 0=1 4=1\nInnerProduct             Conv_61                  1 1 421 422 -23330=4,1,6,1,1 0=6 1=1 2=864\nSwish                    Mul_63                   1 1 422 424 -23330=4,1,6,1,1\nConvolution              Conv_64                  1 1 424 426 -23330=4,1,144,1,1 0=144 1=1 5=1 6=864 9=4\nBinaryOp                 Mul_66                   2 1 420_splitncnn_0 426 427 -23330=4,3,28,28,144 0=2\nConvolution              Conv_67                  1 1 427 429 -23330=4,3,28,28,40 0=40 1=1 5=1 6=5760\nSplit                    splitncnn_5              1 2 429 429_splitncnn_0 429_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              Conv_69                  1 1 429_splitncnn_1 431 -23330=4,3,28,28,240 0=240 1=1 5=1 6=9600\nSwish                    Mul_72                   1 1 431 433 -23330=4,3,28,28,240\nConvolutionDepthWise     Conv_73                  1 1 433 435 -23330=4,3,28,28,240 0=240 1=5 4=2 5=1 6=6000 7=240\nSwish                    Mul_76                   1 1 435 437 -23330=4,3,28,28,240\nSplit                    splitncnn_6              1 2 437 437_splitncnn_0 437_splitncnn_1 -23330=8,3,28,28,240,3,28,28,240\nPooling                  GlobalAveragePool_77     1 1 437_splitncnn_1 438 -23330=4,1,240,1,1 0=1 4=1\nInnerProduct             Conv_78                  1 1 438 439 -23330=4,1,10,1,1 0=10 1=1 2=2400\nSwish                    Mul_80                   1 1 439 441 -23330=4,1,10,1,1\nConvolution              Conv_81                  1 1 441 443 -23330=4,1,240,1,1 0=240 1=1 5=1 6=2400 9=4\nBinaryOp                 Mul_83                   2 1 437_splitncnn_0 443 444 -23330=4,3,28,28,240 0=2\nConvolution              Conv_84                  1 1 444 446 -23330=4,3,28,28,40 0=40 1=1 5=1 6=9600\nBinaryOp                 Add_86                   2 1 429_splitncnn_0 446 447 -23330=4,3,28,28,40\nConvolution              Conv_87                  1 1 447 449 -23330=4,3,28,28,240 0=240 1=1 5=1 6=9600\nSwish                    Mul_90                   1 1 449 451 -23330=4,3,28,28,240\nConvolutionDepthWise     Conv_91                  1 1 451 453 -23330=4,3,14,14,240 0=240 1=3 3=2 4=1 5=1 6=2160 7=240\nSwish                    Mul_94                   1 1 453 455 -23330=4,3,14,14,240\nSplit                    splitncnn_7              1 2 455 455_splitncnn_0 455_splitncnn_1 -23330=8,3,14,14,240,3,14,14,240\nPooling                  GlobalAveragePool_95     1 1 455_splitncnn_1 456 -23330=4,1,240,1,1 0=1 4=1\nInnerProduct             Conv_96                  1 1 456 457 -23330=4,1,10,1,1 0=10 1=1 2=2400\nSwish                    Mul_98                   1 1 457 459 -23330=4,1,10,1,1\nConvolution              Conv_99                  1 1 459 461 -23330=4,1,240,1,1 0=240 1=1 5=1 6=2400 9=4\nBinaryOp                 Mul_101                  2 1 455_splitncnn_0 461 462 -23330=4,3,14,14,240 0=2\nConvolution              Conv_102                 1 1 462 464 -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nSplit                    splitncnn_8              1 2 464 464_splitncnn_0 464_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              Conv_104                 1 1 464_splitncnn_1 466 -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400\nSwish                    Mul_107                  1 1 466 468 -23330=4,3,14,14,480\nConvolutionDepthWise     Conv_108                 1 1 468 470 -23330=4,3,14,14,480 0=480 1=3 4=1 5=1 6=4320 7=480\nSwish                    Mul_111                  1 1 470 472 -23330=4,3,14,14,480\nSplit                    splitncnn_9              1 2 472 472_splitncnn_0 472_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nPooling                  GlobalAveragePool_112    1 1 472_splitncnn_1 473 -23330=4,1,480,1,1 0=1 4=1\nInnerProduct             Conv_113                 1 1 473 474 -23330=4,1,20,1,1 0=20 1=1 2=9600\nSwish                    Mul_115                  1 1 474 476 -23330=4,1,20,1,1\nConvolution              Conv_116                 1 1 476 478 -23330=4,1,480,1,1 0=480 1=1 5=1 6=9600 9=4\nBinaryOp                 Mul_118                  2 1 472_splitncnn_0 478 479 -23330=4,3,14,14,480 0=2\nConvolution              Conv_119                 1 1 479 481 -23330=4,3,14,14,80 0=80 1=1 5=1 6=38400\nBinaryOp                 Add_121                  2 1 464_splitncnn_0 481 482 -23330=4,3,14,14,80\nSplit                    splitncnn_10             1 2 482 482_splitncnn_0 482_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              Conv_122                 1 1 482_splitncnn_1 484 -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400\nSwish                    Mul_125                  1 1 484 486 -23330=4,3,14,14,480\nConvolutionDepthWise     Conv_126                 1 1 486 488 -23330=4,3,14,14,480 0=480 1=3 4=1 5=1 6=4320 7=480\nSwish                    Mul_129                  1 1 488 490 -23330=4,3,14,14,480\nSplit                    splitncnn_11             1 2 490 490_splitncnn_0 490_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nPooling                  GlobalAveragePool_130    1 1 490_splitncnn_1 491 -23330=4,1,480,1,1 0=1 4=1\nInnerProduct             Conv_131                 1 1 491 492 -23330=4,1,20,1,1 0=20 1=1 2=9600\nSwish                    Mul_133                  1 1 492 494 -23330=4,1,20,1,1\nConvolution              Conv_134                 1 1 494 496 -23330=4,1,480,1,1 0=480 1=1 5=1 6=9600 9=4\nBinaryOp                 Mul_136                  2 1 490_splitncnn_0 496 497 -23330=4,3,14,14,480 0=2\nConvolution              Conv_137                 1 1 497 499 -23330=4,3,14,14,80 0=80 1=1 5=1 6=38400\nBinaryOp                 Add_139                  2 1 482_splitncnn_0 499 500 -23330=4,3,14,14,80\nConvolution              Conv_140                 1 1 500 502 -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400\nSwish                    Mul_143                  1 1 502 504 -23330=4,3,14,14,480\nConvolutionDepthWise     Conv_144                 1 1 504 506 -23330=4,3,14,14,480 0=480 1=5 4=2 5=1 6=12000 7=480\nSwish                    Mul_147                  1 1 506 508 -23330=4,3,14,14,480\nSplit                    splitncnn_12             1 2 508 508_splitncnn_0 508_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nPooling                  GlobalAveragePool_148    1 1 508_splitncnn_1 509 -23330=4,1,480,1,1 0=1 4=1\nInnerProduct             Conv_149                 1 1 509 510 -23330=4,1,20,1,1 0=20 1=1 2=9600\nSwish                    Mul_151                  1 1 510 512 -23330=4,1,20,1,1\nConvolution              Conv_152                 1 1 512 514 -23330=4,1,480,1,1 0=480 1=1 5=1 6=9600 9=4\nBinaryOp                 Mul_154                  2 1 508_splitncnn_0 514 515 -23330=4,3,14,14,480 0=2\nConvolution              Conv_155                 1 1 515 517 -23330=4,3,14,14,112 0=112 1=1 5=1 6=53760\nSplit                    splitncnn_13             1 2 517 517_splitncnn_0 517_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              Conv_157                 1 1 517_splitncnn_1 519 -23330=4,3,14,14,672 0=672 1=1 5=1 6=75264\nSwish                    Mul_160                  1 1 519 521 -23330=4,3,14,14,672\nConvolutionDepthWise     Conv_161                 1 1 521 523 -23330=4,3,14,14,672 0=672 1=5 4=2 5=1 6=16800 7=672\nSwish                    Mul_164                  1 1 523 525 -23330=4,3,14,14,672\nSplit                    splitncnn_14             1 2 525 525_splitncnn_0 525_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nPooling                  GlobalAveragePool_165    1 1 525_splitncnn_1 526 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             Conv_166                 1 1 526 527 -23330=4,1,28,1,1 0=28 1=1 2=18816\nSwish                    Mul_168                  1 1 527 529 -23330=4,1,28,1,1\nConvolution              Conv_169                 1 1 529 531 -23330=4,1,672,1,1 0=672 1=1 5=1 6=18816 9=4\nBinaryOp                 Mul_171                  2 1 525_splitncnn_0 531 532 -23330=4,3,14,14,672 0=2\nConvolution              Conv_172                 1 1 532 534 -23330=4,3,14,14,112 0=112 1=1 5=1 6=75264\nBinaryOp                 Add_174                  2 1 517_splitncnn_0 534 535 -23330=4,3,14,14,112\nSplit                    splitncnn_15             1 2 535 535_splitncnn_0 535_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              Conv_175                 1 1 535_splitncnn_1 537 -23330=4,3,14,14,672 0=672 1=1 5=1 6=75264\nSwish                    Mul_178                  1 1 537 539 -23330=4,3,14,14,672\nConvolutionDepthWise     Conv_179                 1 1 539 541 -23330=4,3,14,14,672 0=672 1=5 4=2 5=1 6=16800 7=672\nSwish                    Mul_182                  1 1 541 543 -23330=4,3,14,14,672\nSplit                    splitncnn_16             1 2 543 543_splitncnn_0 543_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nPooling                  GlobalAveragePool_183    1 1 543_splitncnn_1 544 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             Conv_184                 1 1 544 545 -23330=4,1,28,1,1 0=28 1=1 2=18816\nSwish                    Mul_186                  1 1 545 547 -23330=4,1,28,1,1\nConvolution              Conv_187                 1 1 547 549 -23330=4,1,672,1,1 0=672 1=1 5=1 6=18816 9=4\nBinaryOp                 Mul_189                  2 1 543_splitncnn_0 549 550 -23330=4,3,14,14,672 0=2\nConvolution              Conv_190                 1 1 550 552 -23330=4,3,14,14,112 0=112 1=1 5=1 6=75264\nBinaryOp                 Add_192                  2 1 535_splitncnn_0 552 553 -23330=4,3,14,14,112\nConvolution              Conv_193                 1 1 553 555 -23330=4,3,14,14,672 0=672 1=1 5=1 6=75264\nSwish                    Mul_196                  1 1 555 557 -23330=4,3,14,14,672\nConvolutionDepthWise     Conv_197                 1 1 557 559 -23330=4,3,7,7,672 0=672 1=5 3=2 4=2 5=1 6=16800 7=672\nSwish                    Mul_200                  1 1 559 561 -23330=4,3,7,7,672\nSplit                    splitncnn_17             1 2 561 561_splitncnn_0 561_splitncnn_1 -23330=8,3,7,7,672,3,7,7,672\nPooling                  GlobalAveragePool_201    1 1 561_splitncnn_1 562 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             Conv_202                 1 1 562 563 -23330=4,1,28,1,1 0=28 1=1 2=18816\nSwish                    Mul_204                  1 1 563 565 -23330=4,1,28,1,1\nConvolution              Conv_205                 1 1 565 567 -23330=4,1,672,1,1 0=672 1=1 5=1 6=18816 9=4\nBinaryOp                 Mul_207                  2 1 561_splitncnn_0 567 568 -23330=4,3,7,7,672 0=2\nConvolution              Conv_208                 1 1 568 570 -23330=4,3,7,7,192 0=192 1=1 5=1 6=129024\nSplit                    splitncnn_18             1 2 570 570_splitncnn_0 570_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              Conv_210                 1 1 570_splitncnn_1 572 -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184\nSwish                    Mul_213                  1 1 572 574 -23330=4,3,7,7,1152\nConvolutionDepthWise     Conv_214                 1 1 574 576 -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152\nSwish                    Mul_217                  1 1 576 578 -23330=4,3,7,7,1152\nSplit                    splitncnn_19             1 2 578 578_splitncnn_0 578_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nPooling                  GlobalAveragePool_218    1 1 578_splitncnn_1 579 -23330=4,1,1152,1,1 0=1 4=1\nInnerProduct             Conv_219                 1 1 579 580 -23330=4,1,48,1,1 0=48 1=1 2=55296\nSwish                    Mul_221                  1 1 580 582 -23330=4,1,48,1,1\nConvolution              Conv_222                 1 1 582 584 -23330=4,1,1152,1,1 0=1152 1=1 5=1 6=55296 9=4\nBinaryOp                 Mul_224                  2 1 578_splitncnn_0 584 585 -23330=4,3,7,7,1152 0=2\nConvolution              Conv_225                 1 1 585 587 -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 Add_227                  2 1 570_splitncnn_0 587 588 -23330=4,3,7,7,192\nSplit                    splitncnn_20             1 2 588 588_splitncnn_0 588_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              Conv_228                 1 1 588_splitncnn_1 590 -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184\nSwish                    Mul_231                  1 1 590 592 -23330=4,3,7,7,1152\nConvolutionDepthWise     Conv_232                 1 1 592 594 -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152\nSwish                    Mul_235                  1 1 594 596 -23330=4,3,7,7,1152\nSplit                    splitncnn_21             1 2 596 596_splitncnn_0 596_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nPooling                  GlobalAveragePool_236    1 1 596_splitncnn_1 597 -23330=4,1,1152,1,1 0=1 4=1\nInnerProduct             Conv_237                 1 1 597 598 -23330=4,1,48,1,1 0=48 1=1 2=55296\nSwish                    Mul_239                  1 1 598 600 -23330=4,1,48,1,1\nConvolution              Conv_240                 1 1 600 602 -23330=4,1,1152,1,1 0=1152 1=1 5=1 6=55296 9=4\nBinaryOp                 Mul_242                  2 1 596_splitncnn_0 602 603 -23330=4,3,7,7,1152 0=2\nConvolution              Conv_243                 1 1 603 605 -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 Add_245                  2 1 588_splitncnn_0 605 606 -23330=4,3,7,7,192\nSplit                    splitncnn_22             1 2 606 606_splitncnn_0 606_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              Conv_246                 1 1 606_splitncnn_1 608 -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184\nSwish                    Mul_249                  1 1 608 610 -23330=4,3,7,7,1152\nConvolutionDepthWise     Conv_250                 1 1 610 612 -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152\nSwish                    Mul_253                  1 1 612 614 -23330=4,3,7,7,1152\nSplit                    splitncnn_23             1 2 614 614_splitncnn_0 614_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nPooling                  GlobalAveragePool_254    1 1 614_splitncnn_1 615 -23330=4,1,1152,1,1 0=1 4=1\nInnerProduct             Conv_255                 1 1 615 616 -23330=4,1,48,1,1 0=48 1=1 2=55296\nSwish                    Mul_257                  1 1 616 618 -23330=4,1,48,1,1\nConvolution              Conv_258                 1 1 618 620 -23330=4,1,1152,1,1 0=1152 1=1 5=1 6=55296 9=4\nBinaryOp                 Mul_260                  2 1 614_splitncnn_0 620 621 -23330=4,3,7,7,1152 0=2\nConvolution              Conv_261                 1 1 621 623 -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 Add_263                  2 1 606_splitncnn_0 623 624 -23330=4,3,7,7,192\nConvolution              Conv_264                 1 1 624 626 -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184\nSwish                    Mul_267                  1 1 626 628 -23330=4,3,7,7,1152\nConvolutionDepthWise     Conv_268                 1 1 628 630 -23330=4,3,7,7,1152 0=1152 1=3 4=1 5=1 6=10368 7=1152\nSwish                    Mul_271                  1 1 630 632 -23330=4,3,7,7,1152\nSplit                    splitncnn_24             1 2 632 632_splitncnn_0 632_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nPooling                  GlobalAveragePool_272    1 1 632_splitncnn_1 633 -23330=4,1,1152,1,1 0=1 4=1\nInnerProduct             Conv_273                 1 1 633 634 -23330=4,1,48,1,1 0=48 1=1 2=55296\nSwish                    Mul_275                  1 1 634 636 -23330=4,1,48,1,1\nConvolution              Conv_276                 1 1 636 638 -23330=4,1,1152,1,1 0=1152 1=1 5=1 6=55296 9=4\nBinaryOp                 Mul_278                  2 1 632_splitncnn_0 638 639 -23330=4,3,7,7,1152 0=2\nConvolution              Conv_279                 1 1 639 641 -23330=4,3,7,7,320 0=320 1=1 5=1 6=368640\nConvolution              Conv_281                 1 1 641 643 -23330=4,3,7,7,1280 0=1280 1=1 5=1 6=409600\nSwish                    Mul_284                  1 1 643 645 -23330=4,3,7,7,1280\nPooling                  GlobalAveragePool_285    1 1 645 654 -23330=4,1,1280,1,1 0=1 4=1\nInnerProduct             Gemm_292                 1 1 654 655 -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\nSoftmax                  prob                     1 1 655 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/efficientnetv2_b0.param",
    "content": "7767517\n257 288\nMemoryData               110:12                   0 1 110:12 -23330=4,1,112,1,1 0=112\nMemoryData               133:12                   0 1 133:12 -23330=4,1,192,1,1 0=192\nMemoryData               144:12                   0 1 144:12 -23330=4,1,192,1,1 0=192\nMemoryData               14:11                    0 1 14:11 -23330=4,1,32,1,1 0=32\nMemoryData               155:12                   0 1 155:12 -23330=4,1,192,1,1 0=192\nMemoryData               166:12                   0 1 166:12 -23330=4,1,192,1,1 0=192\nMemoryData               177:12                   0 1 177:12 -23330=4,1,192,1,1 0=192\nMemoryData               188:12                   0 1 188:12 -23330=4,1,192,1,1 0=192\nMemoryData               199:12                   0 1 199:12 -23330=4,1,192,1,1 0=192\nMemoryData               22:11                    0 1 22:11 -23330=4,1,48,1,1 0=48\nMemoryData               33:11                    0 1 33:11 -23330=4,1,112,1,1 0=112\nMemoryData               44:11                    0 1 44:11 -23330=4,1,112,1,1 0=112\nMemoryData               55:11                    0 1 55:11 -23330=4,1,112,1,1 0=112\nMemoryData               77:11                    0 1 77:11 -23330=4,1,96,1,1 0=96\nMemoryData               88:11                    0 1 88:11 -23330=4,1,96,1,1 0=96\nInput                    op_201                   0 1 204:12 -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              op_202                   1 1 204:12 206:12 -23330=4,3,112,112,32 0=32 1=3 3=2 4=-233 5=1 6=864\nSwish                    op_203                   1 1 206:12 208:12 -23330=4,3,112,112,32\nConvolution              op_204                   1 1 208:12 210:12 -23330=4,3,112,112,16 0=16 1=3 4=-233 5=1 6=4608\nSwish                    op_205                   1 1 210:12 212:12_splitncnn_0 -23330=4,3,112,112,16\nConvolution              op_207                   1 1 212:12_splitncnn_0 215:12 -23330=4,3,56,56,64 0=64 1=3 3=2 4=-233 5=1 6=9216\nSwish                    op_208                   1 1 215:12 217:12 -23330=4,3,56,56,64\nConvolution              op_209                   1 1 217:12 219:12 -23330=4,3,56,56,32 0=32 1=1 4=-233 5=1 6=2048\nSplit                    splitncnn_1              1 2 219:12 219:12_splitncnn_0 219:12_splitncnn_1 -23330=8,3,56,56,32,3,56,56,32\nConvolution              op_210                   1 1 219:12_splitncnn_1 221:12 -23330=4,3,56,56,128 0=128 1=3 4=-233 5=1 6=36864\nSwish                    op_211                   1 1 221:12 223:12 -23330=4,3,56,56,128\nConvolution              op_212                   1 1 223:12 224:12 -23330=4,3,56,56,32 0=32 1=1 4=-233 6=4096\nEltwise                  op_213                   2 1 219:12_splitncnn_0 224:12 225:12 -23330=4,3,56,56,32 0=1\nBinaryOp                 op_214                   2 1 225:12 14:11 226:12_splitncnn_0 -23330=4,3,56,56,32\nConvolution              op_216                   1 1 226:12_splitncnn_0 229:12 -23330=4,3,28,28,128 0=128 1=3 3=2 4=-233 5=1 6=36864\nSwish                    op_217                   1 1 229:12 231:12 -23330=4,3,28,28,128\nConvolution              op_218                   1 1 231:12 233:12 -23330=4,3,28,28,48 0=48 1=1 4=-233 5=1 6=6144\nSplit                    splitncnn_3              1 2 233:12 233:12_splitncnn_0 233:12_splitncnn_1 -23330=8,3,28,28,48,3,28,28,48\nConvolution              op_219                   1 1 233:12_splitncnn_1 235:12 -23330=4,3,28,28,192 0=192 1=3 4=-233 5=1 6=82944\nSwish                    op_220                   1 1 235:12 237:12 -23330=4,3,28,28,192\nConvolution              op_221                   1 1 237:12 238:12 -23330=4,3,28,28,48 0=48 1=1 4=-233 6=9216\nEltwise                  op_222                   2 1 233:12_splitncnn_0 238:12 239:12 -23330=4,3,28,28,48 0=1\nBinaryOp                 op_223                   2 1 239:12 22:11 240:12_splitncnn_0 -23330=4,3,28,28,48\nConvolution              op_225                   1 1 240:12_splitncnn_0 243:12 -23330=4,3,28,28,192 0=192 1=1 4=-233 5=1 6=9216\nSwish                    op_226                   1 1 243:12 245:12 -23330=4,3,28,28,192\nConvolutionDepthWise     op_227                   1 1 245:12 248:12 -23330=4,3,14,14,192 0=192 1=3 3=2 4=-233 5=1 6=1728 7=192\nSwish                    op_229                   1 1 248:12 250:12 -23330=4,3,14,14,192\nSplit                    splitncnn_5              1 2 250:12 250:12_splitncnn_0 250:12_splitncnn_1 -23330=8,3,14,14,192,3,14,14,192\nReduction                op_230                   1 1 250:12_splitncnn_1 251:12 -23330=4,3,1,1,192 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_231                   1 1 251:12 253:12 -23330=4,3,1,1,12 0=12 1=1 4=-233 5=1 6=2304\nSwish                    op_232                   1 1 253:12 255:12 -23330=4,3,1,1,12\nConvolution              op_233                   1 1 255:12 258:12 -23330=4,3,1,1,192 0=192 1=1 4=-233 5=1 6=2304 9=4\nBinaryOp                 op_235                   2 1 250:12_splitncnn_0 258:12 259:12 -23330=4,3,14,14,192 0=2\nConvolution              op_236                   1 1 259:12 261:12 -23330=4,3,14,14,96 0=96 1=1 4=-233 5=1 6=18432\nSplit                    splitncnn_6              1 2 261:12 261:12_splitncnn_0 261:12_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              op_237                   1 1 261:12_splitncnn_1 263:12 -23330=4,3,14,14,384 0=384 1=1 4=-233 5=1 6=36864\nSwish                    op_238                   1 1 263:12 265:12 -23330=4,3,14,14,384\nConvolutionDepthWise     op_239                   1 1 265:12 268:12 -23330=4,3,14,14,384 0=384 1=3 4=-233 5=1 6=3456 7=384\nSwish                    op_241                   1 1 268:12 270:12 -23330=4,3,14,14,384\nSplit                    splitncnn_7              1 2 270:12 270:12_splitncnn_0 270:12_splitncnn_1 -23330=8,3,14,14,384,3,14,14,384\nReduction                op_242                   1 1 270:12_splitncnn_1 271:12 -23330=4,3,1,1,384 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_243                   1 1 271:12 273:12 -23330=4,3,1,1,24 0=24 1=1 4=-233 5=1 6=9216\nSwish                    op_244                   1 1 273:12 275:12 -23330=4,3,1,1,24\nConvolution              op_245                   1 1 275:12 278:12 -23330=4,3,1,1,384 0=384 1=1 4=-233 5=1 6=9216 9=4\nBinaryOp                 op_247                   2 1 270:12_splitncnn_0 278:12 279:12 -23330=4,3,14,14,384 0=2\nConvolution              op_248                   1 1 279:12 280:12 -23330=4,3,14,14,96 0=96 1=1 4=-233 6=36864\nEltwise                  op_249                   2 1 261:12_splitncnn_0 280:12 281:12 -23330=4,3,14,14,96 0=1\nBinaryOp                 op_250                   2 1 281:12 77:11 282:12 -23330=4,3,14,14,96\nSplit                    splitncnn_8              1 2 282:12 282:12_splitncnn_0 282:12_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              op_251                   1 1 282:12_splitncnn_1 284:12 -23330=4,3,14,14,384 0=384 1=1 4=-233 5=1 6=36864\nSwish                    op_252                   1 1 284:12 286:12 -23330=4,3,14,14,384\nConvolutionDepthWise     op_253                   1 1 286:12 289:12 -23330=4,3,14,14,384 0=384 1=3 4=-233 5=1 6=3456 7=384\nSwish                    op_255                   1 1 289:12 291:12 -23330=4,3,14,14,384\nSplit                    splitncnn_9              1 2 291:12 291:12_splitncnn_0 291:12_splitncnn_1 -23330=8,3,14,14,384,3,14,14,384\nReduction                op_256                   1 1 291:12_splitncnn_1 292:12 -23330=4,3,1,1,384 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_257                   1 1 292:12 294:12 -23330=4,3,1,1,24 0=24 1=1 4=-233 5=1 6=9216\nSwish                    op_258                   1 1 294:12 296:12 -23330=4,3,1,1,24\nConvolution              op_259                   1 1 296:12 299:12 -23330=4,3,1,1,384 0=384 1=1 4=-233 5=1 6=9216 9=4\nBinaryOp                 op_261                   2 1 291:12_splitncnn_0 299:12 300:12 -23330=4,3,14,14,384 0=2\nConvolution              op_262                   1 1 300:12 301:12 -23330=4,3,14,14,96 0=96 1=1 4=-233 6=36864\nEltwise                  op_263                   2 1 282:12_splitncnn_0 301:12 302:12 -23330=4,3,14,14,96 0=1\nBinaryOp                 op_264                   2 1 302:12 88:11 303:12 -23330=4,3,14,14,96\nConvolution              op_265                   1 1 303:12 305:12 -23330=4,3,14,14,576 0=576 1=1 4=-233 5=1 6=55296\nSwish                    op_266                   1 1 305:12 307:12 -23330=4,3,14,14,576\nConvolutionDepthWise     op_267                   1 1 307:12 310:12 -23330=4,3,14,14,576 0=576 1=3 4=-233 5=1 6=5184 7=576\nSwish                    op_269                   1 1 310:12 312:12 -23330=4,3,14,14,576\nSplit                    splitncnn_10             1 2 312:12 312:12_splitncnn_0 312:12_splitncnn_1 -23330=8,3,14,14,576,3,14,14,576\nReduction                op_270                   1 1 312:12_splitncnn_1 313:12 -23330=4,3,1,1,576 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_271                   1 1 313:12 315:12 -23330=4,3,1,1,24 0=24 1=1 4=-233 5=1 6=13824\nSwish                    op_272                   1 1 315:12 317:12 -23330=4,3,1,1,24\nConvolution              op_273                   1 1 317:12 320:12 -23330=4,3,1,1,576 0=576 1=1 4=-233 5=1 6=13824 9=4\nBinaryOp                 op_275                   2 1 312:12_splitncnn_0 320:12 321:12 -23330=4,3,14,14,576 0=2\nConvolution              op_276                   1 1 321:12 323:12 -23330=4,3,14,14,112 0=112 1=1 4=-233 5=1 6=64512\nSplit                    splitncnn_11             1 2 323:12 323:12_splitncnn_0 323:12_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              op_277                   1 1 323:12_splitncnn_1 325:12 -23330=4,3,14,14,672 0=672 1=1 4=-233 5=1 6=75264\nSwish                    op_278                   1 1 325:12 327:12 -23330=4,3,14,14,672\nConvolutionDepthWise     op_279                   1 1 327:12 330:12 -23330=4,3,14,14,672 0=672 1=3 4=-233 5=1 6=6048 7=672\nSwish                    op_281                   1 1 330:12 332:12 -23330=4,3,14,14,672\nSplit                    splitncnn_12             1 2 332:12 332:12_splitncnn_0 332:12_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nReduction                op_282                   1 1 332:12_splitncnn_1 333:12 -23330=4,3,1,1,672 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_283                   1 1 333:12 335:12 -23330=4,3,1,1,28 0=28 1=1 4=-233 5=1 6=18816\nSwish                    op_284                   1 1 335:12 337:12 -23330=4,3,1,1,28\nConvolution              op_285                   1 1 337:12 340:12 -23330=4,3,1,1,672 0=672 1=1 4=-233 5=1 6=18816 9=4\nBinaryOp                 op_287                   2 1 332:12_splitncnn_0 340:12 341:12 -23330=4,3,14,14,672 0=2\nConvolution              op_288                   1 1 341:12 342:12 -23330=4,3,14,14,112 0=112 1=1 4=-233 6=75264\nEltwise                  op_289                   2 1 323:12_splitncnn_0 342:12 343:12 -23330=4,3,14,14,112 0=1\nBinaryOp                 op_290                   2 1 343:12 110:12 344:12 -23330=4,3,14,14,112\nSplit                    splitncnn_13             1 2 344:12 344:12_splitncnn_0 344:12_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              op_291                   1 1 344:12_splitncnn_1 346:12 -23330=4,3,14,14,672 0=672 1=1 4=-233 5=1 6=75264\nSwish                    op_292                   1 1 346:12 348:12 -23330=4,3,14,14,672\nConvolutionDepthWise     op_293                   1 1 348:12 351:12 -23330=4,3,14,14,672 0=672 1=3 4=-233 5=1 6=6048 7=672\nSwish                    op_295                   1 1 351:12 353:12 -23330=4,3,14,14,672\nSplit                    splitncnn_14             1 2 353:12 353:12_splitncnn_0 353:12_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nReduction                op_296                   1 1 353:12_splitncnn_1 354:12 -23330=4,3,1,1,672 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_297                   1 1 354:12 356:12 -23330=4,3,1,1,28 0=28 1=1 4=-233 5=1 6=18816\nSwish                    op_298                   1 1 356:12 358:12 -23330=4,3,1,1,28\nConvolution              op_299                   1 1 358:12 361:12 -23330=4,3,1,1,672 0=672 1=1 4=-233 5=1 6=18816 9=4\nBinaryOp                 op_301                   2 1 353:12_splitncnn_0 361:12 362:12 -23330=4,3,14,14,672 0=2\nConvolution              op_302                   1 1 362:12 363:12 -23330=4,3,14,14,112 0=112 1=1 4=-233 6=75264\nEltwise                  op_303                   2 1 363:12 344:12_splitncnn_0 364:12 -23330=4,3,14,14,112 0=1\nBinaryOp                 op_304                   2 1 364:12 33:11 365:12 -23330=4,3,14,14,112\nSplit                    splitncnn_15             1 2 365:12 365:12_splitncnn_0 365:12_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              op_305                   1 1 365:12_splitncnn_1 367:12 -23330=4,3,14,14,672 0=672 1=1 4=-233 5=1 6=75264\nSwish                    op_306                   1 1 367:12 369:12 -23330=4,3,14,14,672\nConvolutionDepthWise     op_307                   1 1 369:12 372:12 -23330=4,3,14,14,672 0=672 1=3 4=-233 5=1 6=6048 7=672\nSwish                    op_309                   1 1 372:12 374:12 -23330=4,3,14,14,672\nSplit                    splitncnn_16             1 2 374:12 374:12_splitncnn_0 374:12_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nReduction                op_310                   1 1 374:12_splitncnn_1 375:12 -23330=4,3,1,1,672 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_311                   1 1 375:12 377:12 -23330=4,3,1,1,28 0=28 1=1 4=-233 5=1 6=18816\nSwish                    op_312                   1 1 377:12 379:12 -23330=4,3,1,1,28\nConvolution              op_313                   1 1 379:12 382:12 -23330=4,3,1,1,672 0=672 1=1 4=-233 5=1 6=18816 9=4\nBinaryOp                 op_315                   2 1 374:12_splitncnn_0 382:12 383:12 -23330=4,3,14,14,672 0=2\nConvolution              op_316                   1 1 383:12 384:12 -23330=4,3,14,14,112 0=112 1=1 4=-233 6=75264\nEltwise                  op_317                   2 1 365:12_splitncnn_0 384:12 385:12 -23330=4,3,14,14,112 0=1\nBinaryOp                 op_318                   2 1 385:12 44:11 386:12 -23330=4,3,14,14,112\nSplit                    splitncnn_17             1 2 386:12 386:12_splitncnn_0 386:12_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              op_319                   1 1 386:12_splitncnn_1 388:12 -23330=4,3,14,14,672 0=672 1=1 4=-233 5=1 6=75264\nSwish                    op_320                   1 1 388:12 390:12 -23330=4,3,14,14,672\nConvolutionDepthWise     op_321                   1 1 390:12 393:12 -23330=4,3,14,14,672 0=672 1=3 4=-233 5=1 6=6048 7=672\nSwish                    op_323                   1 1 393:12 395:12 -23330=4,3,14,14,672\nSplit                    splitncnn_18             1 2 395:12 395:12_splitncnn_0 395:12_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nReduction                op_324                   1 1 395:12_splitncnn_1 396:12 -23330=4,3,1,1,672 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_325                   1 1 396:12 398:12 -23330=4,3,1,1,28 0=28 1=1 4=-233 5=1 6=18816\nSwish                    op_326                   1 1 398:12 400:12 -23330=4,3,1,1,28\nConvolution              op_327                   1 1 400:12 403:12 -23330=4,3,1,1,672 0=672 1=1 4=-233 5=1 6=18816 9=4\nBinaryOp                 op_329                   2 1 395:12_splitncnn_0 403:12 404:12 -23330=4,3,14,14,672 0=2\nConvolution              op_330                   1 1 404:12 405:12 -23330=4,3,14,14,112 0=112 1=1 4=-233 6=75264\nEltwise                  op_331                   2 1 386:12_splitncnn_0 405:12 406:12 -23330=4,3,14,14,112 0=1\nBinaryOp                 op_332                   2 1 406:12 55:11 407:12_splitncnn_0 -23330=4,3,14,14,112\nConvolution              op_334                   1 1 407:12_splitncnn_0 410:12 -23330=4,3,14,14,672 0=672 1=1 4=-233 5=1 6=75264\nSwish                    op_335                   1 1 410:12 412:12 -23330=4,3,14,14,672\nConvolutionDepthWise     op_336                   1 1 412:12 415:12 -23330=4,3,7,7,672 0=672 1=3 3=2 4=-233 5=1 6=6048 7=672\nSwish                    op_338                   1 1 415:12 417:12 -23330=4,3,7,7,672\nSplit                    splitncnn_20             1 2 417:12 417:12_splitncnn_0 417:12_splitncnn_1 -23330=8,3,7,7,672,3,7,7,672\nReduction                op_339                   1 1 417:12_splitncnn_1 418:12 -23330=4,3,1,1,672 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_340                   1 1 418:12 420:12 -23330=4,3,1,1,28 0=28 1=1 4=-233 5=1 6=18816\nSwish                    op_341                   1 1 420:12 422:12 -23330=4,3,1,1,28\nConvolution              op_342                   1 1 422:12 425:12 -23330=4,3,1,1,672 0=672 1=1 4=-233 5=1 6=18816 9=4\nBinaryOp                 op_344                   2 1 417:12_splitncnn_0 425:12 426:12 -23330=4,3,7,7,672 0=2\nConvolution              op_345                   1 1 426:12 428:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 5=1 6=129024\nSplit                    splitncnn_21             1 2 428:12 428:12_splitncnn_0 428:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_346                   1 1 428:12_splitncnn_1 430:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_347                   1 1 430:12 432:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_348                   1 1 432:12 435:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_350                   1 1 435:12 437:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_22             1 2 437:12 437:12_splitncnn_0 437:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_351                   1 1 437:12_splitncnn_1 438:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_352                   1 1 438:12 440:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_353                   1 1 440:12 442:12 -23330=4,3,1,1,48\nConvolution              op_354                   1 1 442:12 445:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_356                   2 1 437:12_splitncnn_0 445:12 446:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_357                   1 1 446:12 447:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_358                   2 1 428:12_splitncnn_0 447:12 448:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_359                   2 1 448:12 133:12 449:12 -23330=4,3,7,7,192\nSplit                    splitncnn_23             1 2 449:12 449:12_splitncnn_0 449:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_360                   1 1 449:12_splitncnn_1 451:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_361                   1 1 451:12 453:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_362                   1 1 453:12 456:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_364                   1 1 456:12 458:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_24             1 2 458:12 458:12_splitncnn_0 458:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_365                   1 1 458:12_splitncnn_1 459:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_366                   1 1 459:12 461:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_367                   1 1 461:12 463:12 -23330=4,3,1,1,48\nConvolution              op_368                   1 1 463:12 466:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_370                   2 1 458:12_splitncnn_0 466:12 467:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_371                   1 1 467:12 468:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_372                   2 1 449:12_splitncnn_0 468:12 469:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_373                   2 1 469:12 144:12 470:12 -23330=4,3,7,7,192\nSplit                    splitncnn_25             1 2 470:12 470:12_splitncnn_0 470:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_374                   1 1 470:12_splitncnn_1 472:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_375                   1 1 472:12 474:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_376                   1 1 474:12 477:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_378                   1 1 477:12 479:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_26             1 2 479:12 479:12_splitncnn_0 479:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_379                   1 1 479:12_splitncnn_1 480:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_380                   1 1 480:12 482:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_381                   1 1 482:12 484:12 -23330=4,3,1,1,48\nConvolution              op_382                   1 1 484:12 487:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_384                   2 1 479:12_splitncnn_0 487:12 488:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_385                   1 1 488:12 489:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_386                   2 1 470:12_splitncnn_0 489:12 490:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_387                   2 1 490:12 155:12 491:12 -23330=4,3,7,7,192\nSplit                    splitncnn_27             1 2 491:12 491:12_splitncnn_0 491:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_388                   1 1 491:12_splitncnn_1 493:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_389                   1 1 493:12 495:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_390                   1 1 495:12 498:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_392                   1 1 498:12 500:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_28             1 2 500:12 500:12_splitncnn_0 500:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_393                   1 1 500:12_splitncnn_1 501:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_394                   1 1 501:12 503:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_395                   1 1 503:12 505:12 -23330=4,3,1,1,48\nConvolution              op_396                   1 1 505:12 508:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_398                   2 1 500:12_splitncnn_0 508:12 509:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_399                   1 1 509:12 510:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_400                   2 1 491:12_splitncnn_0 510:12 511:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_401                   2 1 511:12 166:12 512:12 -23330=4,3,7,7,192\nSplit                    splitncnn_29             1 2 512:12 512:12_splitncnn_0 512:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_402                   1 1 512:12_splitncnn_1 514:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_403                   1 1 514:12 516:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_404                   1 1 516:12 519:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_406                   1 1 519:12 521:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_30             1 2 521:12 521:12_splitncnn_0 521:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_407                   1 1 521:12_splitncnn_1 522:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_408                   1 1 522:12 524:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_409                   1 1 524:12 526:12 -23330=4,3,1,1,48\nConvolution              op_410                   1 1 526:12 529:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_412                   2 1 521:12_splitncnn_0 529:12 530:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_413                   1 1 530:12 531:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_414                   2 1 512:12_splitncnn_0 531:12 532:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_415                   2 1 532:12 177:12 533:12 -23330=4,3,7,7,192\nSplit                    splitncnn_31             1 2 533:12 533:12_splitncnn_0 533:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_416                   1 1 533:12_splitncnn_1 535:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_417                   1 1 535:12 537:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_418                   1 1 537:12 540:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_420                   1 1 540:12 542:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_32             1 2 542:12 542:12_splitncnn_0 542:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_421                   1 1 542:12_splitncnn_1 543:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_422                   1 1 543:12 545:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_423                   1 1 545:12 547:12 -23330=4,3,1,1,48\nConvolution              op_424                   1 1 547:12 550:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_426                   2 1 542:12_splitncnn_0 550:12 551:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_427                   1 1 551:12 552:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_428                   2 1 533:12_splitncnn_0 552:12 553:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_429                   2 1 553:12 188:12 554:12 -23330=4,3,7,7,192\nSplit                    splitncnn_33             1 2 554:12 554:12_splitncnn_0 554:12_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              op_430                   1 1 554:12_splitncnn_1 556:12 -23330=4,3,7,7,1152 0=1152 1=1 4=-233 5=1 6=221184\nSwish                    op_431                   1 1 556:12 558:12 -23330=4,3,7,7,1152\nConvolutionDepthWise     op_432                   1 1 558:12 561:12 -23330=4,3,7,7,1152 0=1152 1=3 4=-233 5=1 6=10368 7=1152\nSwish                    op_434                   1 1 561:12 563:12 -23330=4,3,7,7,1152\nSplit                    splitncnn_34             1 2 563:12 563:12_splitncnn_0 563:12_splitncnn_1 -23330=8,3,7,7,1152,3,7,7,1152\nReduction                op_435                   1 1 563:12_splitncnn_1 564:12 -23330=4,3,1,1,1152 0=3 1=0 -23303=2,1,2 4=1 5=1\nConvolution              op_436                   1 1 564:12 566:12 -23330=4,3,1,1,48 0=48 1=1 4=-233 5=1 6=55296\nSwish                    op_437                   1 1 566:12 568:12 -23330=4,3,1,1,48\nConvolution              op_438                   1 1 568:12 571:12 -23330=4,3,1,1,1152 0=1152 1=1 4=-233 5=1 6=55296 9=4\nBinaryOp                 op_440                   2 1 563:12_splitncnn_0 571:12 572:12 -23330=4,3,7,7,1152 0=2\nConvolution              op_441                   1 1 572:12 573:12 -23330=4,3,7,7,192 0=192 1=1 4=-233 6=221184\nEltwise                  op_442                   2 1 554:12_splitncnn_0 573:12 574:12 -23330=4,3,7,7,192 0=1\nBinaryOp                 op_443                   2 1 574:12 199:12 575:12_splitncnn_0 -23330=4,3,7,7,192\nConvolution              op_445                   1 1 575:12_splitncnn_0 578:12 -23330=4,3,7,7,1280 0=1280 1=1 4=-233 5=1 6=245760\nSwish                    op_446                   1 1 578:12 580:12 -23330=4,3,7,7,1280\nPooling                  op_447                   1 1 580:12 581:12 -23330=4,1,1280,1,1 0=1 4=1\nInnerProduct             op_448                   1 1 581:12 584:12 -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\n"
  },
  {
    "path": "benchmark/googlenet.param",
    "content": "7767517\n94 121\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1/7x7_s2             1 1 data conv1/7x7_s2_conv1/relu_7x7 -23330=4,3,112,112,64 0=64 1=7 3=2 4=3 5=1 6=9408 9=1\nPooling                  pool1/3x3_s2             1 1 conv1/7x7_s2_conv1/relu_7x7 pool1/3x3_s2 -23330=4,3,56,56,64 1=3 2=2\nLRN                      pool1/norm1              1 1 pool1/3x3_s2 pool1/norm1 -23330=4,3,56,56,64 2=1.000000e-04\nConvolution              conv2/3x3_reduce         1 1 pool1/norm1 conv2/3x3_reduce_conv2/relu_3x3_reduce -23330=4,3,56,56,64 0=64 1=1 5=1 6=4096 9=1\nConvolution              conv2/3x3                1 1 conv2/3x3_reduce_conv2/relu_3x3_reduce conv2/3x3_conv2/relu_3x3 -23330=4,3,56,56,192 0=192 1=3 4=1 5=1 6=110592 9=1\nLRN                      conv2/norm2              1 1 conv2/3x3_conv2/relu_3x3 conv2/norm2 -23330=4,3,56,56,192 2=1.000000e-04\nPooling                  pool2/3x3_s2             1 1 conv2/norm2 pool2/3x3_s2 -23330=4,3,28,28,192 1=3 2=2\nSplit                    splitncnn_0              1 4 pool2/3x3_s2 pool2/3x3_s2_splitncnn_0 pool2/3x3_s2_splitncnn_1 pool2/3x3_s2_splitncnn_2 pool2/3x3_s2_splitncnn_3 -23330=16,3,28,28,192,3,28,28,192,3,28,28,192,3,28,28,192\nConvolution              inception_3a/1x1         1 1 pool2/3x3_s2_splitncnn_3 inception_3a/1x1_inception_3a/relu_1x1 -23330=4,3,28,28,64 0=64 1=1 5=1 6=12288 9=1\nConvolution              inception_3a/3x3_reduce  1 1 pool2/3x3_s2_splitncnn_2 inception_3a/3x3_reduce_inception_3a/relu_3x3_reduce -23330=4,3,28,28,96 0=96 1=1 5=1 6=18432 9=1\nConvolution              inception_3a/3x3         1 1 inception_3a/3x3_reduce_inception_3a/relu_3x3_reduce inception_3a/3x3_inception_3a/relu_3x3 -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=110592 9=1\nConvolution              inception_3a/5x5_reduce  1 1 pool2/3x3_s2_splitncnn_1 inception_3a/5x5_reduce_inception_3a/relu_5x5_reduce -23330=4,3,28,28,16 0=16 1=1 5=1 6=3072 9=1\nConvolution              inception_3a/5x5         1 1 inception_3a/5x5_reduce_inception_3a/relu_5x5_reduce inception_3a/5x5_inception_3a/relu_5x5 -23330=4,3,28,28,32 0=32 1=5 4=2 5=1 6=12800 9=1\nPooling                  inception_3a/pool        1 1 pool2/3x3_s2_splitncnn_0 inception_3a/pool -23330=4,3,28,28,192 1=3 3=1\nConvolution              inception_3a/pool_proj   1 1 inception_3a/pool inception_3a/pool_proj_inception_3a/relu_pool_proj -23330=4,3,28,28,32 0=32 1=1 5=1 6=6144 9=1\nConcat                   inception_3a/output      4 1 inception_3a/1x1_inception_3a/relu_1x1 inception_3a/3x3_inception_3a/relu_3x3 inception_3a/5x5_inception_3a/relu_5x5 inception_3a/pool_proj_inception_3a/relu_pool_proj inception_3a/output -23330=4,3,28,28,256\nSplit                    splitncnn_1              1 4 inception_3a/output inception_3a/output_splitncnn_0 inception_3a/output_splitncnn_1 inception_3a/output_splitncnn_2 inception_3a/output_splitncnn_3 -23330=16,3,28,28,256,3,28,28,256,3,28,28,256,3,28,28,256\nConvolution              inception_3b/1x1         1 1 inception_3a/output_splitncnn_3 inception_3b/1x1_inception_3b/relu_1x1 -23330=4,3,28,28,128 0=128 1=1 5=1 6=32768 9=1\nConvolution              inception_3b/3x3_reduce  1 1 inception_3a/output_splitncnn_2 inception_3b/3x3_reduce_inception_3b/relu_3x3_reduce -23330=4,3,28,28,128 0=128 1=1 5=1 6=32768 9=1\nConvolution              inception_3b/3x3         1 1 inception_3b/3x3_reduce_inception_3b/relu_3x3_reduce inception_3b/3x3_inception_3b/relu_3x3 -23330=4,3,28,28,192 0=192 1=3 4=1 5=1 6=221184 9=1\nConvolution              inception_3b/5x5_reduce  1 1 inception_3a/output_splitncnn_1 inception_3b/5x5_reduce_inception_3b/relu_5x5_reduce -23330=4,3,28,28,32 0=32 1=1 5=1 6=8192 9=1\nConvolution              inception_3b/5x5         1 1 inception_3b/5x5_reduce_inception_3b/relu_5x5_reduce inception_3b/5x5_inception_3b/relu_5x5 -23330=4,3,28,28,96 0=96 1=5 4=2 5=1 6=76800 9=1\nPooling                  inception_3b/pool        1 1 inception_3a/output_splitncnn_0 inception_3b/pool -23330=4,3,28,28,256 1=3 3=1\nConvolution              inception_3b/pool_proj   1 1 inception_3b/pool inception_3b/pool_proj_inception_3b/relu_pool_proj -23330=4,3,28,28,64 0=64 1=1 5=1 6=16384 9=1\nConcat                   inception_3b/output      4 1 inception_3b/1x1_inception_3b/relu_1x1 inception_3b/3x3_inception_3b/relu_3x3 inception_3b/5x5_inception_3b/relu_5x5 inception_3b/pool_proj_inception_3b/relu_pool_proj inception_3b/output -23330=4,3,28,28,480\nPooling                  pool3/3x3_s2             1 1 inception_3b/output pool3/3x3_s2 -23330=4,3,14,14,480 1=3 2=2\nSplit                    splitncnn_2              1 4 pool3/3x3_s2 pool3/3x3_s2_splitncnn_0 pool3/3x3_s2_splitncnn_1 pool3/3x3_s2_splitncnn_2 pool3/3x3_s2_splitncnn_3 -23330=16,3,14,14,480,3,14,14,480,3,14,14,480,3,14,14,480\nConvolution              inception_4a/1x1         1 1 pool3/3x3_s2_splitncnn_3 inception_4a/1x1_inception_4a/relu_1x1 -23330=4,3,14,14,192 0=192 1=1 5=1 6=92160 9=1\nConvolution              inception_4a/3x3_reduce  1 1 pool3/3x3_s2_splitncnn_2 inception_4a/3x3_reduce_inception_4a/relu_3x3_reduce -23330=4,3,14,14,96 0=96 1=1 5=1 6=46080 9=1\nConvolution              inception_4a/3x3         1 1 inception_4a/3x3_reduce_inception_4a/relu_3x3_reduce inception_4a/3x3_inception_4a/relu_3x3 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=179712 9=1\nConvolution              inception_4a/5x5_reduce  1 1 pool3/3x3_s2_splitncnn_1 inception_4a/5x5_reduce_inception_4a/relu_5x5_reduce -23330=4,3,14,14,16 0=16 1=1 5=1 6=7680 9=1\nConvolution              inception_4a/5x5         1 1 inception_4a/5x5_reduce_inception_4a/relu_5x5_reduce inception_4a/5x5_inception_4a/relu_5x5 -23330=4,3,14,14,48 0=48 1=5 4=2 5=1 6=19200 9=1\nPooling                  inception_4a/pool        1 1 pool3/3x3_s2_splitncnn_0 inception_4a/pool -23330=4,3,14,14,480 1=3 3=1\nConvolution              inception_4a/pool_proj   1 1 inception_4a/pool inception_4a/pool_proj_inception_4a/relu_pool_proj -23330=4,3,14,14,64 0=64 1=1 5=1 6=30720 9=1\nConcat                   inception_4a/output      4 1 inception_4a/1x1_inception_4a/relu_1x1 inception_4a/3x3_inception_4a/relu_3x3 inception_4a/5x5_inception_4a/relu_5x5 inception_4a/pool_proj_inception_4a/relu_pool_proj inception_4a/output -23330=4,3,14,14,512\nSplit                    splitncnn_3              1 4 inception_4a/output inception_4a/output_splitncnn_0 inception_4a/output_splitncnn_1 inception_4a/output_splitncnn_2 inception_4a/output_splitncnn_3 -23330=16,3,14,14,512,3,14,14,512,3,14,14,512,3,14,14,512\nConvolution              inception_4b/1x1         1 1 inception_4a/output_splitncnn_3 inception_4b/1x1_inception_4b/relu_1x1 -23330=4,3,14,14,160 0=160 1=1 5=1 6=81920 9=1\nConvolution              inception_4b/3x3_reduce  1 1 inception_4a/output_splitncnn_2 inception_4b/3x3_reduce_inception_4b/relu_3x3_reduce -23330=4,3,14,14,112 0=112 1=1 5=1 6=57344 9=1\nConvolution              inception_4b/3x3         1 1 inception_4b/3x3_reduce_inception_4b/relu_3x3_reduce inception_4b/3x3_inception_4b/relu_3x3 -23330=4,3,14,14,224 0=224 1=3 4=1 5=1 6=225792 9=1\nConvolution              inception_4b/5x5_reduce  1 1 inception_4a/output_splitncnn_1 inception_4b/5x5_reduce_inception_4b/relu_5x5_reduce -23330=4,3,14,14,24 0=24 1=1 5=1 6=12288 9=1\nConvolution              inception_4b/5x5         1 1 inception_4b/5x5_reduce_inception_4b/relu_5x5_reduce inception_4b/5x5_inception_4b/relu_5x5 -23330=4,3,14,14,64 0=64 1=5 4=2 5=1 6=38400 9=1\nPooling                  inception_4b/pool        1 1 inception_4a/output_splitncnn_0 inception_4b/pool -23330=4,3,14,14,512 1=3 3=1\nConvolution              inception_4b/pool_proj   1 1 inception_4b/pool inception_4b/pool_proj_inception_4b/relu_pool_proj -23330=4,3,14,14,64 0=64 1=1 5=1 6=32768 9=1\nConcat                   inception_4b/output      4 1 inception_4b/1x1_inception_4b/relu_1x1 inception_4b/3x3_inception_4b/relu_3x3 inception_4b/5x5_inception_4b/relu_5x5 inception_4b/pool_proj_inception_4b/relu_pool_proj inception_4b/output -23330=4,3,14,14,512\nSplit                    splitncnn_4              1 4 inception_4b/output inception_4b/output_splitncnn_0 inception_4b/output_splitncnn_1 inception_4b/output_splitncnn_2 inception_4b/output_splitncnn_3 -23330=16,3,14,14,512,3,14,14,512,3,14,14,512,3,14,14,512\nConvolution              inception_4c/1x1         1 1 inception_4b/output_splitncnn_3 inception_4c/1x1_inception_4c/relu_1x1 -23330=4,3,14,14,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              inception_4c/3x3_reduce  1 1 inception_4b/output_splitncnn_2 inception_4c/3x3_reduce_inception_4c/relu_3x3_reduce -23330=4,3,14,14,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              inception_4c/3x3         1 1 inception_4c/3x3_reduce_inception_4c/relu_3x3_reduce inception_4c/3x3_inception_4c/relu_3x3 -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=294912 9=1\nConvolution              inception_4c/5x5_reduce  1 1 inception_4b/output_splitncnn_1 inception_4c/5x5_reduce_inception_4c/relu_5x5_reduce -23330=4,3,14,14,24 0=24 1=1 5=1 6=12288 9=1\nConvolution              inception_4c/5x5         1 1 inception_4c/5x5_reduce_inception_4c/relu_5x5_reduce inception_4c/5x5_inception_4c/relu_5x5 -23330=4,3,14,14,64 0=64 1=5 4=2 5=1 6=38400 9=1\nPooling                  inception_4c/pool        1 1 inception_4b/output_splitncnn_0 inception_4c/pool -23330=4,3,14,14,512 1=3 3=1\nConvolution              inception_4c/pool_proj   1 1 inception_4c/pool inception_4c/pool_proj_inception_4c/relu_pool_proj -23330=4,3,14,14,64 0=64 1=1 5=1 6=32768 9=1\nConcat                   inception_4c/output      4 1 inception_4c/1x1_inception_4c/relu_1x1 inception_4c/3x3_inception_4c/relu_3x3 inception_4c/5x5_inception_4c/relu_5x5 inception_4c/pool_proj_inception_4c/relu_pool_proj inception_4c/output -23330=4,3,14,14,512\nSplit                    splitncnn_5              1 4 inception_4c/output inception_4c/output_splitncnn_0 inception_4c/output_splitncnn_1 inception_4c/output_splitncnn_2 inception_4c/output_splitncnn_3 -23330=16,3,14,14,512,3,14,14,512,3,14,14,512,3,14,14,512\nConvolution              inception_4d/1x1         1 1 inception_4c/output_splitncnn_3 inception_4d/1x1_inception_4d/relu_1x1 -23330=4,3,14,14,112 0=112 1=1 5=1 6=57344 9=1\nConvolution              inception_4d/3x3_reduce  1 1 inception_4c/output_splitncnn_2 inception_4d/3x3_reduce_inception_4d/relu_3x3_reduce -23330=4,3,14,14,144 0=144 1=1 5=1 6=73728 9=1\nConvolution              inception_4d/3x3         1 1 inception_4d/3x3_reduce_inception_4d/relu_3x3_reduce inception_4d/3x3_inception_4d/relu_3x3 -23330=4,3,14,14,288 0=288 1=3 4=1 5=1 6=373248 9=1\nConvolution              inception_4d/5x5_reduce  1 1 inception_4c/output_splitncnn_1 inception_4d/5x5_reduce_inception_4d/relu_5x5_reduce -23330=4,3,14,14,32 0=32 1=1 5=1 6=16384 9=1\nConvolution              inception_4d/5x5         1 1 inception_4d/5x5_reduce_inception_4d/relu_5x5_reduce inception_4d/5x5_inception_4d/relu_5x5 -23330=4,3,14,14,64 0=64 1=5 4=2 5=1 6=51200 9=1\nPooling                  inception_4d/pool        1 1 inception_4c/output_splitncnn_0 inception_4d/pool -23330=4,3,14,14,512 1=3 3=1\nConvolution              inception_4d/pool_proj   1 1 inception_4d/pool inception_4d/pool_proj_inception_4d/relu_pool_proj -23330=4,3,14,14,64 0=64 1=1 5=1 6=32768 9=1\nConcat                   inception_4d/output      4 1 inception_4d/1x1_inception_4d/relu_1x1 inception_4d/3x3_inception_4d/relu_3x3 inception_4d/5x5_inception_4d/relu_5x5 inception_4d/pool_proj_inception_4d/relu_pool_proj inception_4d/output -23330=4,3,14,14,528\nSplit                    splitncnn_6              1 4 inception_4d/output inception_4d/output_splitncnn_0 inception_4d/output_splitncnn_1 inception_4d/output_splitncnn_2 inception_4d/output_splitncnn_3 -23330=16,3,14,14,528,3,14,14,528,3,14,14,528,3,14,14,528\nConvolution              inception_4e/1x1         1 1 inception_4d/output_splitncnn_3 inception_4e/1x1_inception_4e/relu_1x1 -23330=4,3,14,14,256 0=256 1=1 5=1 6=135168 9=1\nConvolution              inception_4e/3x3_reduce  1 1 inception_4d/output_splitncnn_2 inception_4e/3x3_reduce_inception_4e/relu_3x3_reduce -23330=4,3,14,14,160 0=160 1=1 5=1 6=84480 9=1\nConvolution              inception_4e/3x3         1 1 inception_4e/3x3_reduce_inception_4e/relu_3x3_reduce inception_4e/3x3_inception_4e/relu_3x3 -23330=4,3,14,14,320 0=320 1=3 4=1 5=1 6=460800 9=1\nConvolution              inception_4e/5x5_reduce  1 1 inception_4d/output_splitncnn_1 inception_4e/5x5_reduce_inception_4e/relu_5x5_reduce -23330=4,3,14,14,32 0=32 1=1 5=1 6=16896 9=1\nConvolution              inception_4e/5x5         1 1 inception_4e/5x5_reduce_inception_4e/relu_5x5_reduce inception_4e/5x5_inception_4e/relu_5x5 -23330=4,3,14,14,128 0=128 1=5 4=2 5=1 6=102400 9=1\nPooling                  inception_4e/pool        1 1 inception_4d/output_splitncnn_0 inception_4e/pool -23330=4,3,14,14,528 1=3 3=1\nConvolution              inception_4e/pool_proj   1 1 inception_4e/pool inception_4e/pool_proj_inception_4e/relu_pool_proj -23330=4,3,14,14,128 0=128 1=1 5=1 6=67584 9=1\nConcat                   inception_4e/output      4 1 inception_4e/1x1_inception_4e/relu_1x1 inception_4e/3x3_inception_4e/relu_3x3 inception_4e/5x5_inception_4e/relu_5x5 inception_4e/pool_proj_inception_4e/relu_pool_proj inception_4e/output -23330=4,3,14,14,832\nPooling                  pool4/3x3_s2             1 1 inception_4e/output pool4/3x3_s2 -23330=4,3,7,7,832 1=3 2=2\nSplit                    splitncnn_7              1 4 pool4/3x3_s2 pool4/3x3_s2_splitncnn_0 pool4/3x3_s2_splitncnn_1 pool4/3x3_s2_splitncnn_2 pool4/3x3_s2_splitncnn_3 -23330=16,3,7,7,832,3,7,7,832,3,7,7,832,3,7,7,832\nConvolution              inception_5a/1x1         1 1 pool4/3x3_s2_splitncnn_3 inception_5a/1x1_inception_5a/relu_1x1 -23330=4,3,7,7,256 0=256 1=1 5=1 6=212992 9=1\nConvolution              inception_5a/3x3_reduce  1 1 pool4/3x3_s2_splitncnn_2 inception_5a/3x3_reduce_inception_5a/relu_3x3_reduce -23330=4,3,7,7,160 0=160 1=1 5=1 6=133120 9=1\nConvolution              inception_5a/3x3         1 1 inception_5a/3x3_reduce_inception_5a/relu_3x3_reduce inception_5a/3x3_inception_5a/relu_3x3 -23330=4,3,7,7,320 0=320 1=3 4=1 5=1 6=460800 9=1\nConvolution              inception_5a/5x5_reduce  1 1 pool4/3x3_s2_splitncnn_1 inception_5a/5x5_reduce_inception_5a/relu_5x5_reduce -23330=4,3,7,7,32 0=32 1=1 5=1 6=26624 9=1\nConvolution              inception_5a/5x5         1 1 inception_5a/5x5_reduce_inception_5a/relu_5x5_reduce inception_5a/5x5_inception_5a/relu_5x5 -23330=4,3,7,7,128 0=128 1=5 4=2 5=1 6=102400 9=1\nPooling                  inception_5a/pool        1 1 pool4/3x3_s2_splitncnn_0 inception_5a/pool -23330=4,3,7,7,832 1=3 3=1\nConvolution              inception_5a/pool_proj   1 1 inception_5a/pool inception_5a/pool_proj_inception_5a/relu_pool_proj -23330=4,3,7,7,128 0=128 1=1 5=1 6=106496 9=1\nConcat                   inception_5a/output      4 1 inception_5a/1x1_inception_5a/relu_1x1 inception_5a/3x3_inception_5a/relu_3x3 inception_5a/5x5_inception_5a/relu_5x5 inception_5a/pool_proj_inception_5a/relu_pool_proj inception_5a/output -23330=4,3,7,7,832\nSplit                    splitncnn_8              1 4 inception_5a/output inception_5a/output_splitncnn_0 inception_5a/output_splitncnn_1 inception_5a/output_splitncnn_2 inception_5a/output_splitncnn_3 -23330=16,3,7,7,832,3,7,7,832,3,7,7,832,3,7,7,832\nConvolution              inception_5b/1x1         1 1 inception_5a/output_splitncnn_3 inception_5b/1x1_inception_5b/relu_1x1 -23330=4,3,7,7,384 0=384 1=1 5=1 6=319488 9=1\nConvolution              inception_5b/3x3_reduce  1 1 inception_5a/output_splitncnn_2 inception_5b/3x3_reduce_inception_5b/relu_3x3_reduce -23330=4,3,7,7,192 0=192 1=1 5=1 6=159744 9=1\nConvolution              inception_5b/3x3         1 1 inception_5b/3x3_reduce_inception_5b/relu_3x3_reduce inception_5b/3x3_inception_5b/relu_3x3 -23330=4,3,7,7,384 0=384 1=3 4=1 5=1 6=663552 9=1\nConvolution              inception_5b/5x5_reduce  1 1 inception_5a/output_splitncnn_1 inception_5b/5x5_reduce_inception_5b/relu_5x5_reduce -23330=4,3,7,7,48 0=48 1=1 5=1 6=39936 9=1\nConvolution              inception_5b/5x5         1 1 inception_5b/5x5_reduce_inception_5b/relu_5x5_reduce inception_5b/5x5_inception_5b/relu_5x5 -23330=4,3,7,7,128 0=128 1=5 4=2 5=1 6=153600 9=1\nPooling                  inception_5b/pool        1 1 inception_5a/output_splitncnn_0 inception_5b/pool -23330=4,3,7,7,832 1=3 3=1\nConvolution              inception_5b/pool_proj   1 1 inception_5b/pool inception_5b/pool_proj_inception_5b/relu_pool_proj -23330=4,3,7,7,128 0=128 1=1 5=1 6=106496 9=1\nConcat                   inception_5b/output      4 1 inception_5b/1x1_inception_5b/relu_1x1 inception_5b/3x3_inception_5b/relu_3x3 inception_5b/5x5_inception_5b/relu_5x5 inception_5b/pool_proj_inception_5b/relu_pool_proj inception_5b/output -23330=4,3,7,7,1024\nPooling                  pool5/7x7_s1             1 1 inception_5b/output pool5/7x7_s1_pool5/drop_7x7_s1 -23330=4,3,1,1,1024 0=1 1=7\nInnerProduct             loss3/classifier         1 1 pool5/7x7_s1_pool5/drop_7x7_s1 loss3/classifier -23330=4,1,1000,1,1 0=1000 1=1 2=1024000\nSoftmax                  prob                     1 1 loss3/classifier output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/googlenet_int8.param",
    "content": "7767517\n94 121\nInput                    data                     0 1 data 0=224 1=224 2=3\nConvolution              conv1/7x7_s2             1 1 data conv1/7x7_s2_conv1/relu_7x7 0=64 1=7 3=2 4=3 5=1 6=9408 8=2 9=1\nPooling                  pool1/3x3_s2             1 1 conv1/7x7_s2_conv1/relu_7x7 pool1/3x3_s2 1=3 2=2\nLRN                      pool1/norm1              1 1 pool1/3x3_s2 pool1/norm1 2=0.000100\nConvolution              conv2/3x3_reduce         1 1 pool1/norm1 conv2/3x3_reduce_conv2/relu_3x3_reduce 0=64 1=1 5=1 6=4096 8=102 9=1\nConvolution              conv2/3x3                1 1 conv2/3x3_reduce_conv2/relu_3x3_reduce conv2/3x3_conv2/relu_3x3 0=192 1=3 4=1 5=1 6=110592 8=2 9=1\nLRN                      conv2/norm2              1 1 conv2/3x3_conv2/relu_3x3 conv2/norm2 2=0.000100\nPooling                  pool2/3x3_s2             1 1 conv2/norm2 pool2/3x3_s2 1=3 2=2\nSplit                    splitncnn_0              1 4 pool2/3x3_s2 pool2/3x3_s2_splitncnn_0 pool2/3x3_s2_splitncnn_1 pool2/3x3_s2_splitncnn_2 pool2/3x3_s2_splitncnn_3\nConvolution              inception_3a/1x1         1 1 pool2/3x3_s2_splitncnn_3 inception_3a/1x1_inception_3a/relu_1x1 0=64 1=1 5=1 6=12288 8=2 9=1\nConvolution              inception_3a/3x3_reduce  1 1 pool2/3x3_s2_splitncnn_2 inception_3a/3x3_reduce_inception_3a/relu_3x3_reduce 0=96 1=1 5=1 6=18432 8=102 9=1\nConvolution              inception_3a/3x3         1 1 inception_3a/3x3_reduce_inception_3a/relu_3x3_reduce inception_3a/3x3_inception_3a/relu_3x3 0=128 1=3 4=1 5=1 6=110592 8=2 9=1\nConvolution              inception_3a/5x5_reduce  1 1 pool2/3x3_s2_splitncnn_1 inception_3a/5x5_reduce_inception_3a/relu_5x5_reduce 0=16 1=1 5=1 6=3072 8=102 9=1\nConvolution              inception_3a/5x5         1 1 inception_3a/5x5_reduce_inception_3a/relu_5x5_reduce inception_3a/5x5_inception_3a/relu_5x5 0=32 1=5 4=2 5=1 6=12800 8=2 9=1\nPooling                  inception_3a/pool        1 1 pool2/3x3_s2_splitncnn_0 inception_3a/pool 1=3 3=1\nConvolution              inception_3a/pool_proj   1 1 inception_3a/pool inception_3a/pool_proj_inception_3a/relu_pool_proj 0=32 1=1 5=1 6=6144 8=2 9=1\nConcat                   inception_3a/output      4 1 inception_3a/1x1_inception_3a/relu_1x1 inception_3a/3x3_inception_3a/relu_3x3 inception_3a/5x5_inception_3a/relu_5x5 inception_3a/pool_proj_inception_3a/relu_pool_proj inception_3a/output\nSplit                    splitncnn_1              1 4 inception_3a/output inception_3a/output_splitncnn_0 inception_3a/output_splitncnn_1 inception_3a/output_splitncnn_2 inception_3a/output_splitncnn_3\nConvolution              inception_3b/1x1         1 1 inception_3a/output_splitncnn_3 inception_3b/1x1_inception_3b/relu_1x1 0=128 1=1 5=1 6=32768 8=2 9=1\nConvolution              inception_3b/3x3_reduce  1 1 inception_3a/output_splitncnn_2 inception_3b/3x3_reduce_inception_3b/relu_3x3_reduce 0=128 1=1 5=1 6=32768 8=102 9=1\nConvolution              inception_3b/3x3         1 1 inception_3b/3x3_reduce_inception_3b/relu_3x3_reduce inception_3b/3x3_inception_3b/relu_3x3 0=192 1=3 4=1 5=1 6=221184 8=2 9=1\nConvolution              inception_3b/5x5_reduce  1 1 inception_3a/output_splitncnn_1 inception_3b/5x5_reduce_inception_3b/relu_5x5_reduce 0=32 1=1 5=1 6=8192 8=102 9=1\nConvolution              inception_3b/5x5         1 1 inception_3b/5x5_reduce_inception_3b/relu_5x5_reduce inception_3b/5x5_inception_3b/relu_5x5 0=96 1=5 4=2 5=1 6=76800 8=2 9=1\nPooling                  inception_3b/pool        1 1 inception_3a/output_splitncnn_0 inception_3b/pool 1=3 3=1\nConvolution              inception_3b/pool_proj   1 1 inception_3b/pool inception_3b/pool_proj_inception_3b/relu_pool_proj 0=64 1=1 5=1 6=16384 8=2 9=1\nConcat                   inception_3b/output      4 1 inception_3b/1x1_inception_3b/relu_1x1 inception_3b/3x3_inception_3b/relu_3x3 inception_3b/5x5_inception_3b/relu_5x5 inception_3b/pool_proj_inception_3b/relu_pool_proj inception_3b/output\nPooling                  pool3/3x3_s2             1 1 inception_3b/output pool3/3x3_s2 1=3 2=2\nSplit                    splitncnn_2              1 4 pool3/3x3_s2 pool3/3x3_s2_splitncnn_0 pool3/3x3_s2_splitncnn_1 pool3/3x3_s2_splitncnn_2 pool3/3x3_s2_splitncnn_3\nConvolution              inception_4a/1x1         1 1 pool3/3x3_s2_splitncnn_3 inception_4a/1x1_inception_4a/relu_1x1 0=192 1=1 5=1 6=92160 8=2 9=1\nConvolution              inception_4a/3x3_reduce  1 1 pool3/3x3_s2_splitncnn_2 inception_4a/3x3_reduce_inception_4a/relu_3x3_reduce 0=96 1=1 5=1 6=46080 8=102 9=1\nConvolution              inception_4a/3x3         1 1 inception_4a/3x3_reduce_inception_4a/relu_3x3_reduce inception_4a/3x3_inception_4a/relu_3x3 0=208 1=3 4=1 5=1 6=179712 8=2 9=1\nConvolution              inception_4a/5x5_reduce  1 1 pool3/3x3_s2_splitncnn_1 inception_4a/5x5_reduce_inception_4a/relu_5x5_reduce 0=16 1=1 5=1 6=7680 8=102 9=1\nConvolution              inception_4a/5x5         1 1 inception_4a/5x5_reduce_inception_4a/relu_5x5_reduce inception_4a/5x5_inception_4a/relu_5x5 0=48 1=5 4=2 5=1 6=19200 8=2 9=1\nPooling                  inception_4a/pool        1 1 pool3/3x3_s2_splitncnn_0 inception_4a/pool 1=3 3=1\nConvolution              inception_4a/pool_proj   1 1 inception_4a/pool inception_4a/pool_proj_inception_4a/relu_pool_proj 0=64 1=1 5=1 6=30720 8=2 9=1\nConcat                   inception_4a/output      4 1 inception_4a/1x1_inception_4a/relu_1x1 inception_4a/3x3_inception_4a/relu_3x3 inception_4a/5x5_inception_4a/relu_5x5 inception_4a/pool_proj_inception_4a/relu_pool_proj inception_4a/output\nSplit                    splitncnn_3              1 4 inception_4a/output inception_4a/output_splitncnn_0 inception_4a/output_splitncnn_1 inception_4a/output_splitncnn_2 inception_4a/output_splitncnn_3\nConvolution              inception_4b/1x1         1 1 inception_4a/output_splitncnn_3 inception_4b/1x1_inception_4b/relu_1x1 0=160 1=1 5=1 6=81920 8=2 9=1\nConvolution              inception_4b/3x3_reduce  1 1 inception_4a/output_splitncnn_2 inception_4b/3x3_reduce_inception_4b/relu_3x3_reduce 0=112 1=1 5=1 6=57344 8=102 9=1\nConvolution              inception_4b/3x3         1 1 inception_4b/3x3_reduce_inception_4b/relu_3x3_reduce inception_4b/3x3_inception_4b/relu_3x3 0=224 1=3 4=1 5=1 6=225792 8=2 9=1\nConvolution              inception_4b/5x5_reduce  1 1 inception_4a/output_splitncnn_1 inception_4b/5x5_reduce_inception_4b/relu_5x5_reduce 0=24 1=1 5=1 6=12288 8=102 9=1\nConvolution              inception_4b/5x5         1 1 inception_4b/5x5_reduce_inception_4b/relu_5x5_reduce inception_4b/5x5_inception_4b/relu_5x5 0=64 1=5 4=2 5=1 6=38400 8=2 9=1\nPooling                  inception_4b/pool        1 1 inception_4a/output_splitncnn_0 inception_4b/pool 1=3 3=1\nConvolution              inception_4b/pool_proj   1 1 inception_4b/pool inception_4b/pool_proj_inception_4b/relu_pool_proj 0=64 1=1 5=1 6=32768 8=2 9=1\nConcat                   inception_4b/output      4 1 inception_4b/1x1_inception_4b/relu_1x1 inception_4b/3x3_inception_4b/relu_3x3 inception_4b/5x5_inception_4b/relu_5x5 inception_4b/pool_proj_inception_4b/relu_pool_proj inception_4b/output\nSplit                    splitncnn_4              1 4 inception_4b/output inception_4b/output_splitncnn_0 inception_4b/output_splitncnn_1 inception_4b/output_splitncnn_2 inception_4b/output_splitncnn_3\nConvolution              inception_4c/1x1         1 1 inception_4b/output_splitncnn_3 inception_4c/1x1_inception_4c/relu_1x1 0=128 1=1 5=1 6=65536 8=2 9=1\nConvolution              inception_4c/3x3_reduce  1 1 inception_4b/output_splitncnn_2 inception_4c/3x3_reduce_inception_4c/relu_3x3_reduce 0=128 1=1 5=1 6=65536 8=102 9=1\nConvolution              inception_4c/3x3         1 1 inception_4c/3x3_reduce_inception_4c/relu_3x3_reduce inception_4c/3x3_inception_4c/relu_3x3 0=256 1=3 4=1 5=1 6=294912 8=2 9=1\nConvolution              inception_4c/5x5_reduce  1 1 inception_4b/output_splitncnn_1 inception_4c/5x5_reduce_inception_4c/relu_5x5_reduce 0=24 1=1 5=1 6=12288 8=102 9=1\nConvolution              inception_4c/5x5         1 1 inception_4c/5x5_reduce_inception_4c/relu_5x5_reduce inception_4c/5x5_inception_4c/relu_5x5 0=64 1=5 4=2 5=1 6=38400 8=2 9=1\nPooling                  inception_4c/pool        1 1 inception_4b/output_splitncnn_0 inception_4c/pool 1=3 3=1\nConvolution              inception_4c/pool_proj   1 1 inception_4c/pool inception_4c/pool_proj_inception_4c/relu_pool_proj 0=64 1=1 5=1 6=32768 8=2 9=1\nConcat                   inception_4c/output      4 1 inception_4c/1x1_inception_4c/relu_1x1 inception_4c/3x3_inception_4c/relu_3x3 inception_4c/5x5_inception_4c/relu_5x5 inception_4c/pool_proj_inception_4c/relu_pool_proj inception_4c/output\nSplit                    splitncnn_5              1 4 inception_4c/output inception_4c/output_splitncnn_0 inception_4c/output_splitncnn_1 inception_4c/output_splitncnn_2 inception_4c/output_splitncnn_3\nConvolution              inception_4d/1x1         1 1 inception_4c/output_splitncnn_3 inception_4d/1x1_inception_4d/relu_1x1 0=112 1=1 5=1 6=57344 8=2 9=1\nConvolution              inception_4d/3x3_reduce  1 1 inception_4c/output_splitncnn_2 inception_4d/3x3_reduce_inception_4d/relu_3x3_reduce 0=144 1=1 5=1 6=73728 8=102 9=1\nConvolution              inception_4d/3x3         1 1 inception_4d/3x3_reduce_inception_4d/relu_3x3_reduce inception_4d/3x3_inception_4d/relu_3x3 0=288 1=3 4=1 5=1 6=373248 8=2 9=1\nConvolution              inception_4d/5x5_reduce  1 1 inception_4c/output_splitncnn_1 inception_4d/5x5_reduce_inception_4d/relu_5x5_reduce 0=32 1=1 5=1 6=16384 8=102 9=1\nConvolution              inception_4d/5x5         1 1 inception_4d/5x5_reduce_inception_4d/relu_5x5_reduce inception_4d/5x5_inception_4d/relu_5x5 0=64 1=5 4=2 5=1 6=51200 8=2 9=1\nPooling                  inception_4d/pool        1 1 inception_4c/output_splitncnn_0 inception_4d/pool 1=3 3=1\nConvolution              inception_4d/pool_proj   1 1 inception_4d/pool inception_4d/pool_proj_inception_4d/relu_pool_proj 0=64 1=1 5=1 6=32768 8=2 9=1\nConcat                   inception_4d/output      4 1 inception_4d/1x1_inception_4d/relu_1x1 inception_4d/3x3_inception_4d/relu_3x3 inception_4d/5x5_inception_4d/relu_5x5 inception_4d/pool_proj_inception_4d/relu_pool_proj inception_4d/output\nSplit                    splitncnn_6              1 4 inception_4d/output inception_4d/output_splitncnn_0 inception_4d/output_splitncnn_1 inception_4d/output_splitncnn_2 inception_4d/output_splitncnn_3\nConvolution              inception_4e/1x1         1 1 inception_4d/output_splitncnn_3 inception_4e/1x1_inception_4e/relu_1x1 0=256 1=1 5=1 6=135168 8=2 9=1\nConvolution              inception_4e/3x3_reduce  1 1 inception_4d/output_splitncnn_2 inception_4e/3x3_reduce_inception_4e/relu_3x3_reduce 0=160 1=1 5=1 6=84480 8=102 9=1\nConvolution              inception_4e/3x3         1 1 inception_4e/3x3_reduce_inception_4e/relu_3x3_reduce inception_4e/3x3_inception_4e/relu_3x3 0=320 1=3 4=1 5=1 6=460800 8=2 9=1\nConvolution              inception_4e/5x5_reduce  1 1 inception_4d/output_splitncnn_1 inception_4e/5x5_reduce_inception_4e/relu_5x5_reduce 0=32 1=1 5=1 6=16896 8=102 9=1\nConvolution              inception_4e/5x5         1 1 inception_4e/5x5_reduce_inception_4e/relu_5x5_reduce inception_4e/5x5_inception_4e/relu_5x5 0=128 1=5 4=2 5=1 6=102400 8=2 9=1\nPooling                  inception_4e/pool        1 1 inception_4d/output_splitncnn_0 inception_4e/pool 1=3 3=1\nConvolution              inception_4e/pool_proj   1 1 inception_4e/pool inception_4e/pool_proj_inception_4e/relu_pool_proj 0=128 1=1 5=1 6=67584 8=2 9=1\nConcat                   inception_4e/output      4 1 inception_4e/1x1_inception_4e/relu_1x1 inception_4e/3x3_inception_4e/relu_3x3 inception_4e/5x5_inception_4e/relu_5x5 inception_4e/pool_proj_inception_4e/relu_pool_proj inception_4e/output\nPooling                  pool4/3x3_s2             1 1 inception_4e/output pool4/3x3_s2 1=3 2=2\nSplit                    splitncnn_7              1 4 pool4/3x3_s2 pool4/3x3_s2_splitncnn_0 pool4/3x3_s2_splitncnn_1 pool4/3x3_s2_splitncnn_2 pool4/3x3_s2_splitncnn_3\nConvolution              inception_5a/1x1         1 1 pool4/3x3_s2_splitncnn_3 inception_5a/1x1_inception_5a/relu_1x1 0=256 1=1 5=1 6=212992 8=2 9=1\nConvolution              inception_5a/3x3_reduce  1 1 pool4/3x3_s2_splitncnn_2 inception_5a/3x3_reduce_inception_5a/relu_3x3_reduce 0=160 1=1 5=1 6=133120 8=102 9=1\nConvolution              inception_5a/3x3         1 1 inception_5a/3x3_reduce_inception_5a/relu_3x3_reduce inception_5a/3x3_inception_5a/relu_3x3 0=320 1=3 4=1 5=1 6=460800 8=2 9=1\nConvolution              inception_5a/5x5_reduce  1 1 pool4/3x3_s2_splitncnn_1 inception_5a/5x5_reduce_inception_5a/relu_5x5_reduce 0=32 1=1 5=1 6=26624 8=102 9=1\nConvolution              inception_5a/5x5         1 1 inception_5a/5x5_reduce_inception_5a/relu_5x5_reduce inception_5a/5x5_inception_5a/relu_5x5 0=128 1=5 4=2 5=1 6=102400 8=2 9=1\nPooling                  inception_5a/pool        1 1 pool4/3x3_s2_splitncnn_0 inception_5a/pool 1=3 3=1\nConvolution              inception_5a/pool_proj   1 1 inception_5a/pool inception_5a/pool_proj_inception_5a/relu_pool_proj 0=128 1=1 5=1 6=106496 8=2 9=1\nConcat                   inception_5a/output      4 1 inception_5a/1x1_inception_5a/relu_1x1 inception_5a/3x3_inception_5a/relu_3x3 inception_5a/5x5_inception_5a/relu_5x5 inception_5a/pool_proj_inception_5a/relu_pool_proj inception_5a/output\nSplit                    splitncnn_8              1 4 inception_5a/output inception_5a/output_splitncnn_0 inception_5a/output_splitncnn_1 inception_5a/output_splitncnn_2 inception_5a/output_splitncnn_3\nConvolution              inception_5b/1x1         1 1 inception_5a/output_splitncnn_3 inception_5b/1x1_inception_5b/relu_1x1 0=384 1=1 5=1 6=319488 8=2 9=1\nConvolution              inception_5b/3x3_reduce  1 1 inception_5a/output_splitncnn_2 inception_5b/3x3_reduce_inception_5b/relu_3x3_reduce 0=192 1=1 5=1 6=159744 8=102 9=1\nConvolution              inception_5b/3x3         1 1 inception_5b/3x3_reduce_inception_5b/relu_3x3_reduce inception_5b/3x3_inception_5b/relu_3x3 0=384 1=3 4=1 5=1 6=663552 8=2 9=1\nConvolution              inception_5b/5x5_reduce  1 1 inception_5a/output_splitncnn_1 inception_5b/5x5_reduce_inception_5b/relu_5x5_reduce 0=48 1=1 5=1 6=39936 8=102 9=1\nConvolution              inception_5b/5x5         1 1 inception_5b/5x5_reduce_inception_5b/relu_5x5_reduce inception_5b/5x5_inception_5b/relu_5x5 0=128 1=5 4=2 5=1 6=153600 8=2 9=1\nPooling                  inception_5b/pool        1 1 inception_5a/output_splitncnn_0 inception_5b/pool 1=3 3=1\nConvolution              inception_5b/pool_proj   1 1 inception_5b/pool inception_5b/pool_proj_inception_5b/relu_pool_proj 0=128 1=1 5=1 6=106496 8=2 9=1\nConcat                   inception_5b/output      4 1 inception_5b/1x1_inception_5b/relu_1x1 inception_5b/3x3_inception_5b/relu_3x3 inception_5b/5x5_inception_5b/relu_5x5 inception_5b/pool_proj_inception_5b/relu_pool_proj inception_5b/output\nPooling                  pool5/7x7_s1             1 1 inception_5b/output pool5/7x7_s1_pool5/drop_7x7_s1 0=1 1=7\nInnerProduct             loss3/classifier         1 1 pool5/7x7_s1_pool5/drop_7x7_s1 loss3/classifier 0=1000 1=1 2=1024000\nSoftmax                  prob                     1 1 loss3/classifier output\n"
  },
  {
    "path": "benchmark/mnasnet.param",
    "content": "7767517\n76 86\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              first-3x3-conv           1 1 data first-3x3-conv_relu -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     A0_dw                    1 1 first-3x3-conv_relu A0_dw_relu -23330=4,3,112,112,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              A0_linear                1 1 A0_dw_relu A0_linear_bn -23330=4,3,112,112,16 0=16 1=1 5=1 6=512\nConvolution              B0_expand                1 1 A0_linear_bn B0_expand_relu -23330=4,3,112,112,48 0=48 1=1 5=1 6=768 9=1\nConvolutionDepthWise     B0_dw                    1 1 B0_expand_relu B0_dw_relu -23330=4,3,56,56,48 0=48 1=3 3=2 4=1 5=1 6=432 7=48 9=1\nConvolution              B0_linear                1 1 B0_dw_relu B0_linear_bn -23330=4,3,56,56,24 0=24 1=1 5=1 6=1152\nSplit                    splitncnn_0              1 2 B0_linear_bn B0_linear_bn_splitncnn_0 B0_linear_bn_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolution              B1_expand                1 1 B0_linear_bn_splitncnn_1 B1_expand_relu -23330=4,3,56,56,72 0=72 1=1 5=1 6=1728 9=1\nConvolutionDepthWise     B1_dw                    1 1 B1_expand_relu B1_dw_relu -23330=4,3,56,56,72 0=72 1=3 4=1 5=1 6=648 7=72 9=1\nConvolution              B1_linear                1 1 B1_dw_relu B1_linear_bn -23330=4,3,56,56,24 0=24 1=1 5=1 6=1728\nBinaryOp                 unknownncnn_0            2 1 B0_linear_bn_splitncnn_0 B1_linear_bn unknownncnn_0 -23330=4,3,56,56,24\nSplit                    splitncnn_1              1 2 unknownncnn_0 unknownncnn_0_splitncnn_0 unknownncnn_0_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolution              B2_expand                1 1 unknownncnn_0_splitncnn_1 B2_expand_relu -23330=4,3,56,56,72 0=72 1=1 5=1 6=1728 9=1\nConvolutionDepthWise     B2_dw                    1 1 B2_expand_relu B2_dw_relu -23330=4,3,56,56,72 0=72 1=3 4=1 5=1 6=648 7=72 9=1\nConvolution              B2_linear                1 1 B2_dw_relu B2_linear_bn -23330=4,3,56,56,24 0=24 1=1 5=1 6=1728\nBinaryOp                 unknownncnn_1            2 1 unknownncnn_0_splitncnn_0 B2_linear_bn unknownncnn_1 -23330=4,3,56,56,24\nConvolution              C0_expand                1 1 unknownncnn_1 C0_expand_relu -23330=4,3,56,56,72 0=72 1=1 5=1 6=1728 9=1\nConvolutionDepthWise     C0_dw                    1 1 C0_expand_relu C0_dw_relu -23330=4,3,28,28,72 0=72 1=5 3=2 4=2 5=1 6=1800 7=72 9=1\nConvolution              C0_linear                1 1 C0_dw_relu C0_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=2880\nSplit                    splitncnn_2              1 2 C0_linear_bn C0_linear_bn_splitncnn_0 C0_linear_bn_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              C1_expand                1 1 C0_linear_bn_splitncnn_1 C1_expand_relu -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     C1_dw                    1 1 C1_expand_relu C1_dw_relu -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=1\nConvolution              C1_linear                1 1 C1_dw_relu C1_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 unknownncnn_2            2 1 C0_linear_bn_splitncnn_0 C1_linear_bn unknownncnn_2 -23330=4,3,28,28,40\nSplit                    splitncnn_3              1 2 unknownncnn_2 unknownncnn_2_splitncnn_0 unknownncnn_2_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              C2_expand                1 1 unknownncnn_2_splitncnn_1 C2_expand_relu -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     C2_dw                    1 1 C2_expand_relu C2_dw_relu -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=1\nConvolution              C2_linear                1 1 C2_dw_relu C2_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 unknownncnn_3            2 1 unknownncnn_2_splitncnn_0 C2_linear_bn unknownncnn_3 -23330=4,3,28,28,40\nConvolution              D0_expand                1 1 unknownncnn_3 D0_expand_relu -23330=4,3,28,28,240 0=240 1=1 5=1 6=9600 9=1\nConvolutionDepthWise     D0_dw                    1 1 D0_expand_relu D0_dw_relu -23330=4,3,14,14,240 0=240 1=5 3=2 4=2 5=1 6=6000 7=240 9=1\nConvolution              D0_linear                1 1 D0_dw_relu D0_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nSplit                    splitncnn_4              1 2 D0_linear_bn D0_linear_bn_splitncnn_0 D0_linear_bn_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              D1_expand                1 1 D0_linear_bn_splitncnn_1 D1_expand_relu -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400 9=1\nConvolutionDepthWise     D1_dw                    1 1 D1_expand_relu D1_dw_relu -23330=4,3,14,14,480 0=480 1=5 4=2 5=1 6=12000 7=480 9=1\nConvolution              D1_linear                1 1 D1_dw_relu D1_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=38400\nBinaryOp                 unknownncnn_4            2 1 D0_linear_bn_splitncnn_0 D1_linear_bn unknownncnn_4 -23330=4,3,14,14,80\nSplit                    splitncnn_5              1 2 unknownncnn_4 unknownncnn_4_splitncnn_0 unknownncnn_4_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              D2_expand                1 1 unknownncnn_4_splitncnn_1 D2_expand_relu -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400 9=1\nConvolutionDepthWise     D2_dw                    1 1 D2_expand_relu D2_dw_relu -23330=4,3,14,14,480 0=480 1=5 4=2 5=1 6=12000 7=480 9=1\nConvolution              D2_linear                1 1 D2_dw_relu D2_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=38400\nBinaryOp                 unknownncnn_5            2 1 unknownncnn_4_splitncnn_0 D2_linear_bn unknownncnn_5 -23330=4,3,14,14,80\nConvolution              E0_expand                1 1 unknownncnn_5 E0_expand_relu -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400 9=1\nConvolutionDepthWise     E0_dw                    1 1 E0_expand_relu E0_dw_relu -23330=4,3,14,14,480 0=480 1=3 4=1 5=1 6=4320 7=480 9=1\nConvolution              E0_linear                1 1 E0_dw_relu E0_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=46080\nSplit                    splitncnn_6              1 2 E0_linear_bn E0_linear_bn_splitncnn_0 E0_linear_bn_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              E1_expand                1 1 E0_linear_bn_splitncnn_1 E1_expand_relu -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     E1_dw                    1 1 E1_expand_relu E1_dw_relu -23330=4,3,14,14,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              E1_linear                1 1 E1_dw_relu E1_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=55296\nBinaryOp                 unknownncnn_6            2 1 E0_linear_bn_splitncnn_0 E1_linear_bn unknownncnn_6 -23330=4,3,14,14,96\nConvolution              F0_expand                1 1 unknownncnn_6 F0_expand_relu -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     F0_dw                    1 1 F0_expand_relu F0_dw_relu -23330=4,3,7,7,576 0=576 1=5 3=2 4=2 5=1 6=14400 7=576 9=1\nConvolution              F0_linear                1 1 F0_dw_relu F0_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=110592\nSplit                    splitncnn_7              1 2 F0_linear_bn F0_linear_bn_splitncnn_0 F0_linear_bn_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F1_expand                1 1 F0_linear_bn_splitncnn_1 F1_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     F1_dw                    1 1 F1_expand_relu F1_dw_relu -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152 9=1\nConvolution              F1_linear                1 1 F1_dw_relu F1_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 unknownncnn_7            2 1 F0_linear_bn_splitncnn_0 F1_linear_bn unknownncnn_7 -23330=4,3,7,7,192\nSplit                    splitncnn_8              1 2 unknownncnn_7 unknownncnn_7_splitncnn_0 unknownncnn_7_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F2_expand                1 1 unknownncnn_7_splitncnn_1 F2_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     F2_dw                    1 1 F2_expand_relu F2_dw_relu -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152 9=1\nConvolution              F2_linear                1 1 F2_dw_relu F2_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 unknownncnn_8            2 1 unknownncnn_7_splitncnn_0 F2_linear_bn unknownncnn_8 -23330=4,3,7,7,192\nSplit                    splitncnn_9              1 2 unknownncnn_8 unknownncnn_8_splitncnn_0 unknownncnn_8_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F3_expand                1 1 unknownncnn_8_splitncnn_1 F3_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     F3_dw                    1 1 F3_expand_relu F3_dw_relu -23330=4,3,7,7,1152 0=1152 1=5 4=2 5=1 6=28800 7=1152 9=1\nConvolution              F3_linear                1 1 F3_dw_relu F3_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 unknownncnn_9            2 1 unknownncnn_8_splitncnn_0 F3_linear_bn unknownncnn_9 -23330=4,3,7,7,192\nConvolution              G0_expand                1 1 unknownncnn_9 G0_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     G0_dw                    1 1 G0_expand_relu G0_dw_relu -23330=4,3,7,7,1152 0=1152 1=3 4=1 5=1 6=10368 7=1152 9=1\nConvolution              G0_linear                1 1 G0_dw_relu G0_linear_bn -23330=4,3,7,7,320 0=320 1=1 5=1 6=368640\nConvolution              last-1x1-conv            1 1 G0_linear_bn last-1x1-conv_relu -23330=4,3,7,7,1280 0=1280 1=1 5=1 6=409600 9=1\nPooling                  avgpool                  1 1 last-1x1-conv_relu flatten -23330=4,1,1280,1,1 0=1 1=7 4=1 5=1\nInnerProduct             fc                       1 1 flatten fc -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\nSoftmax                  prob                     1 1 fc output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/mobilenet.param",
    "content": "7767517\n31 31\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_relu1 -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     conv2_1/dw               1 1 conv1_relu1 conv2_1/dw_relu2_1/dw -23330=4,3,112,112,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              conv2_1/sep              1 1 conv2_1/dw_relu2_1/dw conv2_1/sep_relu2_1/sep -23330=4,3,112,112,64 0=64 1=1 5=1 6=2048 9=1\nConvolutionDepthWise     conv2_2/dw               1 1 conv2_1/sep_relu2_1/sep conv2_2/dw_relu2_2/dw -23330=4,3,56,56,64 0=64 1=3 3=2 4=1 5=1 6=576 7=64 9=1\nConvolution              conv2_2/sep              1 1 conv2_2/dw_relu2_2/dw conv2_2/sep_relu2_2/sep -23330=4,3,56,56,128 0=128 1=1 5=1 6=8192 9=1\nConvolutionDepthWise     conv3_1/dw               1 1 conv2_2/sep_relu2_2/sep conv3_1/dw_relu3_1/dw -23330=4,3,56,56,128 0=128 1=3 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv3_1/sep              1 1 conv3_1/dw_relu3_1/dw conv3_1/sep_relu3_1/sep -23330=4,3,56,56,128 0=128 1=1 5=1 6=16384 9=1\nConvolutionDepthWise     conv3_2/dw               1 1 conv3_1/sep_relu3_1/sep conv3_2/dw_relu3_2/dw -23330=4,3,28,28,128 0=128 1=3 3=2 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv3_2/sep              1 1 conv3_2/dw_relu3_2/dw conv3_2/sep_relu3_2/sep -23330=4,3,28,28,256 0=256 1=1 5=1 6=32768 9=1\nConvolutionDepthWise     conv4_1/dw               1 1 conv3_2/sep_relu3_2/sep conv4_1/dw_relu4_1/dw -23330=4,3,28,28,256 0=256 1=3 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv4_1/sep              1 1 conv4_1/dw_relu4_1/dw conv4_1/sep_relu4_1/sep -23330=4,3,28,28,256 0=256 1=1 5=1 6=65536 9=1\nConvolutionDepthWise     conv4_2/dw               1 1 conv4_1/sep_relu4_1/sep conv4_2/dw_relu4_2/dw -23330=4,3,14,14,256 0=256 1=3 3=2 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv4_2/sep              1 1 conv4_2/dw_relu4_2/dw conv4_2/sep_relu4_2/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=131072 9=1\nConvolutionDepthWise     conv5_1/dw               1 1 conv4_2/sep_relu4_2/sep conv5_1/dw_relu5_1/dw -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_1/sep              1 1 conv5_1/dw_relu5_1/dw conv5_1/sep_relu5_1/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv5_2/dw               1 1 conv5_1/sep_relu5_1/sep conv5_2/dw_relu5_2/dw -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_2/sep              1 1 conv5_2/dw_relu5_2/dw conv5_2/sep_relu5_2/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv5_3/dw               1 1 conv5_2/sep_relu5_2/sep conv5_3/dw_relu5_3/dw -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_3/sep              1 1 conv5_3/dw_relu5_3/dw conv5_3/sep_relu5_3/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv5_4/dw               1 1 conv5_3/sep_relu5_3/sep conv5_4/dw_relu5_4/dw -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_4/sep              1 1 conv5_4/dw_relu5_4/dw conv5_4/sep_relu5_4/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv5_5/dw               1 1 conv5_4/sep_relu5_4/sep conv5_5/dw_relu5_5/dw -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_5/sep              1 1 conv5_5/dw_relu5_5/dw conv5_5/sep_relu5_5/sep -23330=4,3,14,14,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv5_6/dw               1 1 conv5_5/sep_relu5_5/sep conv5_6/dw_relu5_6/dw -23330=4,3,7,7,512 0=512 1=3 3=2 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv5_6/sep              1 1 conv5_6/dw_relu5_6/dw conv5_6/sep_relu5_6/sep -23330=4,3,7,7,1024 0=1024 1=1 5=1 6=524288 9=1\nConvolutionDepthWise     conv6/dw                 1 1 conv5_6/sep_relu5_6/sep conv6/dw_relu6/dw -23330=4,3,7,7,1024 0=1024 1=3 4=1 5=1 6=9216 7=1024 9=1\nConvolution              conv6/sep                1 1 conv6/dw_relu6/dw conv6/sep_relu6/sep -23330=4,3,7,7,1024 0=1024 1=1 5=1 6=1048576 9=1\nPooling                  pool6                    1 1 conv6/sep_relu6/sep pool6 -23330=4,1,1024,1,1 0=1 4=1\nInnerProduct             fc7                      1 1 pool6 fc7 -23330=4,1,1000,1,1 0=1000 1=1 2=1024000\nSoftmax                  prob                     1 1 fc7 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/mobilenet_int8.param",
    "content": "7767517\n31 31\nInput                    data                     0 1 data 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_relu1 0=32 1=3 3=2 4=1 5=1 6=864 8=102 9=1\nConvolutionDepthWise     conv2_1/dw               1 1 conv1_relu1 conv2_1/dw_relu2_1/dw 0=32 1=3 4=1 5=1 6=288 7=32 8=101 9=1\nConvolution              conv2_1/sep              1 1 conv2_1/dw_relu2_1/dw conv2_1/sep_relu2_1/sep 0=64 1=1 5=1 6=2048 8=102 9=1\nConvolutionDepthWise     conv2_2/dw               1 1 conv2_1/sep_relu2_1/sep conv2_2/dw_relu2_2/dw 0=64 1=3 3=2 4=1 5=1 6=576 7=64 8=101 9=1\nConvolution              conv2_2/sep              1 1 conv2_2/dw_relu2_2/dw conv2_2/sep_relu2_2/sep 0=128 1=1 5=1 6=8192 8=102 9=1\nConvolutionDepthWise     conv3_1/dw               1 1 conv2_2/sep_relu2_2/sep conv3_1/dw_relu3_1/dw 0=128 1=3 4=1 5=1 6=1152 7=128 8=101 9=1\nConvolution              conv3_1/sep              1 1 conv3_1/dw_relu3_1/dw conv3_1/sep_relu3_1/sep 0=128 1=1 5=1 6=16384 8=102 9=1\nConvolutionDepthWise     conv3_2/dw               1 1 conv3_1/sep_relu3_1/sep conv3_2/dw_relu3_2/dw 0=128 1=3 3=2 4=1 5=1 6=1152 7=128 8=101 9=1\nConvolution              conv3_2/sep              1 1 conv3_2/dw_relu3_2/dw conv3_2/sep_relu3_2/sep 0=256 1=1 5=1 6=32768 8=102 9=1\nConvolutionDepthWise     conv4_1/dw               1 1 conv3_2/sep_relu3_2/sep conv4_1/dw_relu4_1/dw 0=256 1=3 4=1 5=1 6=2304 7=256 8=101 9=1\nConvolution              conv4_1/sep              1 1 conv4_1/dw_relu4_1/dw conv4_1/sep_relu4_1/sep 0=256 1=1 5=1 6=65536 8=102 9=1\nConvolutionDepthWise     conv4_2/dw               1 1 conv4_1/sep_relu4_1/sep conv4_2/dw_relu4_2/dw 0=256 1=3 3=2 4=1 5=1 6=2304 7=256 8=101 9=1\nConvolution              conv4_2/sep              1 1 conv4_2/dw_relu4_2/dw conv4_2/sep_relu4_2/sep 0=512 1=1 5=1 6=131072 8=102 9=1\nConvolutionDepthWise     conv5_1/dw               1 1 conv4_2/sep_relu4_2/sep conv5_1/dw_relu5_1/dw 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_1/sep              1 1 conv5_1/dw_relu5_1/dw conv5_1/sep_relu5_1/sep 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv5_2/dw               1 1 conv5_1/sep_relu5_1/sep conv5_2/dw_relu5_2/dw 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_2/sep              1 1 conv5_2/dw_relu5_2/dw conv5_2/sep_relu5_2/sep 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv5_3/dw               1 1 conv5_2/sep_relu5_2/sep conv5_3/dw_relu5_3/dw 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_3/sep              1 1 conv5_3/dw_relu5_3/dw conv5_3/sep_relu5_3/sep 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv5_4/dw               1 1 conv5_3/sep_relu5_3/sep conv5_4/dw_relu5_4/dw 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_4/sep              1 1 conv5_4/dw_relu5_4/dw conv5_4/sep_relu5_4/sep 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv5_5/dw               1 1 conv5_4/sep_relu5_4/sep conv5_5/dw_relu5_5/dw 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_5/sep              1 1 conv5_5/dw_relu5_5/dw conv5_5/sep_relu5_5/sep 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv5_6/dw               1 1 conv5_5/sep_relu5_5/sep conv5_6/dw_relu5_6/dw 0=512 1=3 3=2 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv5_6/sep              1 1 conv5_6/dw_relu5_6/dw conv5_6/sep_relu5_6/sep 0=1024 1=1 5=1 6=524288 8=102 9=1\nConvolutionDepthWise     conv6/dw                 1 1 conv5_6/sep_relu5_6/sep conv6/dw_relu6/dw 0=1024 1=3 4=1 5=1 6=9216 7=1024 8=101 9=1\nConvolution              conv6/sep                1 1 conv6/dw_relu6/dw conv6/sep_relu6/sep 0=1024 1=1 5=1 6=1048576 8=2 9=1\nPooling                  pool6                    1 1 conv6/sep_relu6/sep pool6 0=1 4=1\nInnerProduct             fc7                      1 1 pool6 fc7 0=1000 1=1 2=1024000 8=2\nSoftmax                  prob                     1 1 fc7 output\n"
  },
  {
    "path": "benchmark/mobilenet_ssd.param",
    "content": "7767517\n92 115\nInput                    input                    0 1 data -23330=4,3,300,300,3 0=300 1=300 2=3\nSplit                    splitncnn_0              1 7 data data_splitncnn_0 data_splitncnn_1 data_splitncnn_2 data_splitncnn_3 data_splitncnn_4 data_splitncnn_5 data_splitncnn_6 -23330=28,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3\nConvolution              conv0                    1 1 data_splitncnn_6 conv0_conv0/relu -23330=4,3,150,150,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     conv1/dw                 1 1 conv0_conv0/relu conv1/dw_conv1/dw/relu -23330=4,3,150,150,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              conv1                    1 1 conv1/dw_conv1/dw/relu conv1_conv1/relu -23330=4,3,150,150,64 0=64 1=1 5=1 6=2048 9=1\nConvolutionDepthWise     conv2/dw                 1 1 conv1_conv1/relu conv2/dw_conv2/dw/relu -23330=4,3,75,75,64 0=64 1=3 3=2 4=1 5=1 6=576 7=64 9=1\nConvolution              conv2                    1 1 conv2/dw_conv2/dw/relu conv2_conv2/relu -23330=4,3,75,75,128 0=128 1=1 5=1 6=8192 9=1\nConvolutionDepthWise     conv3/dw                 1 1 conv2_conv2/relu conv3/dw_conv3/dw/relu -23330=4,3,75,75,128 0=128 1=3 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv3                    1 1 conv3/dw_conv3/dw/relu conv3_conv3/relu -23330=4,3,75,75,128 0=128 1=1 5=1 6=16384 9=1\nConvolutionDepthWise     conv4/dw                 1 1 conv3_conv3/relu conv4/dw_conv4/dw/relu -23330=4,3,38,38,128 0=128 1=3 3=2 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv4                    1 1 conv4/dw_conv4/dw/relu conv4_conv4/relu -23330=4,3,38,38,256 0=256 1=1 5=1 6=32768 9=1\nConvolutionDepthWise     conv5/dw                 1 1 conv4_conv4/relu conv5/dw_conv5/dw/relu -23330=4,3,38,38,256 0=256 1=3 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv5                    1 1 conv5/dw_conv5/dw/relu conv5_conv5/relu -23330=4,3,38,38,256 0=256 1=1 5=1 6=65536 9=1\nConvolutionDepthWise     conv6/dw                 1 1 conv5_conv5/relu conv6/dw_conv6/dw/relu -23330=4,3,19,19,256 0=256 1=3 3=2 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv6                    1 1 conv6/dw_conv6/dw/relu conv6_conv6/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=131072 9=1\nConvolutionDepthWise     conv7/dw                 1 1 conv6_conv6/relu conv7/dw_conv7/dw/relu -23330=4,3,19,19,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv7                    1 1 conv7/dw_conv7/dw/relu conv7_conv7/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv8/dw                 1 1 conv7_conv7/relu conv8/dw_conv8/dw/relu -23330=4,3,19,19,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv8                    1 1 conv8/dw_conv8/dw/relu conv8_conv8/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv9/dw                 1 1 conv8_conv8/relu conv9/dw_conv9/dw/relu -23330=4,3,19,19,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv9                    1 1 conv9/dw_conv9/dw/relu conv9_conv9/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv10/dw                1 1 conv9_conv9/relu conv10/dw_conv10/dw/relu -23330=4,3,19,19,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv10                   1 1 conv10/dw_conv10/dw/relu conv10_conv10/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv11/dw                1 1 conv10_conv10/relu conv11/dw_conv11/dw/relu -23330=4,3,19,19,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv11                   1 1 conv11/dw_conv11/dw/relu conv11_conv11/relu -23330=4,3,19,19,512 0=512 1=1 5=1 6=262144 9=1\nSplit                    splitncnn_1              1 4 conv11_conv11/relu conv11_conv11/relu_splitncnn_0 conv11_conv11/relu_splitncnn_1 conv11_conv11/relu_splitncnn_2 conv11_conv11/relu_splitncnn_3 -23330=16,3,19,19,512,3,19,19,512,3,19,19,512,3,19,19,512\nConvolutionDepthWise     conv12/dw                1 1 conv11_conv11/relu_splitncnn_3 conv12/dw_conv12/dw/relu -23330=4,3,10,10,512 0=512 1=3 3=2 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv12                   1 1 conv12/dw_conv12/dw/relu conv12_conv12/relu -23330=4,3,10,10,1024 0=1024 1=1 5=1 6=524288 9=1\nConvolutionDepthWise     conv13/dw                1 1 conv12_conv12/relu conv13/dw_conv13/dw/relu -23330=4,3,10,10,1024 0=1024 1=3 4=1 5=1 6=9216 7=1024 9=1\nConvolution              conv13                   1 1 conv13/dw_conv13/dw/relu conv13_conv13/relu -23330=4,3,10,10,1024 0=1024 1=1 5=1 6=1048576 9=1\nSplit                    splitncnn_2              1 4 conv13_conv13/relu conv13_conv13/relu_splitncnn_0 conv13_conv13/relu_splitncnn_1 conv13_conv13/relu_splitncnn_2 conv13_conv13/relu_splitncnn_3 -23330=16,3,10,10,1024,3,10,10,1024,3,10,10,1024,3,10,10,1024\nConvolution              conv14_1                 1 1 conv13_conv13/relu_splitncnn_3 conv14_1_conv14_1/relu -23330=4,3,10,10,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              conv14_2                 1 1 conv14_1_conv14_1/relu conv14_2_conv14_2/relu -23330=4,3,5,5,512 0=512 1=3 3=2 4=1 5=1 6=1179648 9=1\nSplit                    splitncnn_3              1 4 conv14_2_conv14_2/relu conv14_2_conv14_2/relu_splitncnn_0 conv14_2_conv14_2/relu_splitncnn_1 conv14_2_conv14_2/relu_splitncnn_2 conv14_2_conv14_2/relu_splitncnn_3 -23330=16,3,5,5,512,3,5,5,512,3,5,5,512,3,5,5,512\nConvolution              conv15_1                 1 1 conv14_2_conv14_2/relu_splitncnn_3 conv15_1_conv15_1/relu -23330=4,3,5,5,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              conv15_2                 1 1 conv15_1_conv15_1/relu conv15_2_conv15_2/relu -23330=4,3,3,3,256 0=256 1=3 3=2 4=1 5=1 6=294912 9=1\nSplit                    splitncnn_4              1 4 conv15_2_conv15_2/relu conv15_2_conv15_2/relu_splitncnn_0 conv15_2_conv15_2/relu_splitncnn_1 conv15_2_conv15_2/relu_splitncnn_2 conv15_2_conv15_2/relu_splitncnn_3 -23330=16,3,3,3,256,3,3,3,256,3,3,3,256,3,3,3,256\nConvolution              conv16_1                 1 1 conv15_2_conv15_2/relu_splitncnn_3 conv16_1_conv16_1/relu -23330=4,3,3,3,128 0=128 1=1 5=1 6=32768 9=1\nConvolution              conv16_2                 1 1 conv16_1_conv16_1/relu conv16_2_conv16_2/relu -23330=4,3,2,2,256 0=256 1=3 3=2 4=1 5=1 6=294912 9=1\nSplit                    splitncnn_5              1 4 conv16_2_conv16_2/relu conv16_2_conv16_2/relu_splitncnn_0 conv16_2_conv16_2/relu_splitncnn_1 conv16_2_conv16_2/relu_splitncnn_2 conv16_2_conv16_2/relu_splitncnn_3 -23330=16,3,2,2,256,3,2,2,256,3,2,2,256,3,2,2,256\nConvolution              conv17_1                 1 1 conv16_2_conv16_2/relu_splitncnn_3 conv17_1_conv17_1/relu -23330=4,3,2,2,64 0=64 1=1 5=1 6=16384 9=1\nConvolution              conv17_2                 1 1 conv17_1_conv17_1/relu conv17_2_conv17_2/relu -23330=4,3,1,1,128 0=128 1=3 3=2 4=1 5=1 6=73728 9=1\nSplit                    splitncnn_6              1 3 conv17_2_conv17_2/relu conv17_2_conv17_2/relu_splitncnn_0 conv17_2_conv17_2/relu_splitncnn_1 conv17_2_conv17_2/relu_splitncnn_2 -23330=12,3,1,1,128,3,1,1,128,3,1,1,128\nConvolution              conv11_mbox_loc          1 1 conv11_conv11/relu_splitncnn_2 conv11_mbox_loc -23330=4,3,19,19,12 0=12 1=1 5=1 6=6144\nPermute                  conv11_mbox_loc_perm     1 1 conv11_mbox_loc conv11_mbox_loc_perm -23330=4,3,12,19,19 0=3\nFlatten                  conv11_mbox_loc_flat     1 1 conv11_mbox_loc_perm conv11_mbox_loc_flat -23330=4,1,4332,1,1\nConvolution              conv11_mbox_conf         1 1 conv11_conv11/relu_splitncnn_1 conv11_mbox_conf -23330=4,3,19,19,63 0=63 1=1 5=1 6=32256\nPermute                  conv11_mbox_conf_perm    1 1 conv11_mbox_conf conv11_mbox_conf_perm -23330=4,3,63,19,19 0=3\nFlatten                  conv11_mbox_conf_flat    1 1 conv11_mbox_conf_perm conv11_mbox_conf_flat -23330=4,1,22743,1,1\nPriorBox                 conv11_mbox_priorbox     2 1 conv11_conv11/relu_splitncnn_0 data_splitncnn_5 conv11_mbox_priorbox -23330=4,2,4332,2,1 -23300=1,6.000000e+01 -23302=1,2.000000e+00 9=-233 10=-233 13=5.000000e-01\nConvolution              conv13_mbox_loc          1 1 conv13_conv13/relu_splitncnn_2 conv13_mbox_loc -23330=4,3,10,10,24 0=24 1=1 5=1 6=24576\nPermute                  conv13_mbox_loc_perm     1 1 conv13_mbox_loc conv13_mbox_loc_perm -23330=4,3,24,10,10 0=3\nFlatten                  conv13_mbox_loc_flat     1 1 conv13_mbox_loc_perm conv13_mbox_loc_flat -23330=4,1,2400,1,1\nConvolution              conv13_mbox_conf         1 1 conv13_conv13/relu_splitncnn_1 conv13_mbox_conf -23330=4,3,10,10,126 0=126 1=1 5=1 6=129024\nPermute                  conv13_mbox_conf_perm    1 1 conv13_mbox_conf conv13_mbox_conf_perm -23330=4,3,126,10,10 0=3\nFlatten                  conv13_mbox_conf_flat    1 1 conv13_mbox_conf_perm conv13_mbox_conf_flat -23330=4,1,12600,1,1\nPriorBox                 conv13_mbox_priorbox     2 1 conv13_conv13/relu_splitncnn_0 data_splitncnn_4 conv13_mbox_priorbox -23330=4,2,2400,2,1 -23300=1,1.050000e+02 -23301=1,1.500000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 13=5.000000e-01\nConvolution              conv14_2_mbox_loc        1 1 conv14_2_conv14_2/relu_splitncnn_2 conv14_2_mbox_loc -23330=4,3,5,5,24 0=24 1=1 5=1 6=12288\nPermute                  conv14_2_mbox_loc_perm   1 1 conv14_2_mbox_loc conv14_2_mbox_loc_perm -23330=4,3,24,5,5 0=3\nFlatten                  conv14_2_mbox_loc_flat   1 1 conv14_2_mbox_loc_perm conv14_2_mbox_loc_flat -23330=4,1,600,1,1\nConvolution              conv14_2_mbox_conf       1 1 conv14_2_conv14_2/relu_splitncnn_1 conv14_2_mbox_conf -23330=4,3,5,5,126 0=126 1=1 5=1 6=64512\nPermute                  conv14_2_mbox_conf_perm  1 1 conv14_2_mbox_conf conv14_2_mbox_conf_perm -23330=4,3,126,5,5 0=3\nFlatten                  conv14_2_mbox_conf_flat  1 1 conv14_2_mbox_conf_perm conv14_2_mbox_conf_flat -23330=4,1,3150,1,1\nPriorBox                 conv14_2_mbox_priorbox   2 1 conv14_2_conv14_2/relu_splitncnn_0 data_splitncnn_3 conv14_2_mbox_priorbox -23330=4,2,600,2,1 -23300=1,1.500000e+02 -23301=1,1.950000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 13=5.000000e-01\nConvolution              conv15_2_mbox_loc        1 1 conv15_2_conv15_2/relu_splitncnn_2 conv15_2_mbox_loc -23330=4,3,3,3,24 0=24 1=1 5=1 6=6144\nPermute                  conv15_2_mbox_loc_perm   1 1 conv15_2_mbox_loc conv15_2_mbox_loc_perm -23330=4,3,24,3,3 0=3\nFlatten                  conv15_2_mbox_loc_flat   1 1 conv15_2_mbox_loc_perm conv15_2_mbox_loc_flat -23330=4,1,216,1,1\nConvolution              conv15_2_mbox_conf       1 1 conv15_2_conv15_2/relu_splitncnn_1 conv15_2_mbox_conf -23330=4,3,3,3,126 0=126 1=1 5=1 6=32256\nPermute                  conv15_2_mbox_conf_perm  1 1 conv15_2_mbox_conf conv15_2_mbox_conf_perm -23330=4,3,126,3,3 0=3\nFlatten                  conv15_2_mbox_conf_flat  1 1 conv15_2_mbox_conf_perm conv15_2_mbox_conf_flat -23330=4,1,1134,1,1\nPriorBox                 conv15_2_mbox_priorbox   2 1 conv15_2_conv15_2/relu_splitncnn_0 data_splitncnn_2 conv15_2_mbox_priorbox -23330=4,2,216,2,1 -23300=1,1.950000e+02 -23301=1,2.400000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 13=5.000000e-01\nConvolution              conv16_2_mbox_loc        1 1 conv16_2_conv16_2/relu_splitncnn_2 conv16_2_mbox_loc -23330=4,3,2,2,24 0=24 1=1 5=1 6=6144\nPermute                  conv16_2_mbox_loc_perm   1 1 conv16_2_mbox_loc conv16_2_mbox_loc_perm -23330=4,3,24,2,2 0=3\nFlatten                  conv16_2_mbox_loc_flat   1 1 conv16_2_mbox_loc_perm conv16_2_mbox_loc_flat -23330=4,1,96,1,1\nConvolution              conv16_2_mbox_conf       1 1 conv16_2_conv16_2/relu_splitncnn_1 conv16_2_mbox_conf -23330=4,3,2,2,126 0=126 1=1 5=1 6=32256\nPermute                  conv16_2_mbox_conf_perm  1 1 conv16_2_mbox_conf conv16_2_mbox_conf_perm -23330=4,3,126,2,2 0=3\nFlatten                  conv16_2_mbox_conf_flat  1 1 conv16_2_mbox_conf_perm conv16_2_mbox_conf_flat -23330=4,1,504,1,1\nPriorBox                 conv16_2_mbox_priorbox   2 1 conv16_2_conv16_2/relu_splitncnn_0 data_splitncnn_1 conv16_2_mbox_priorbox -23330=4,2,96,2,1 -23300=1,2.400000e+02 -23301=1,2.850000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 13=5.000000e-01\nConvolution              conv17_2_mbox_loc        1 1 conv17_2_conv17_2/relu_splitncnn_2 conv17_2_mbox_loc -23330=4,3,1,1,24 0=24 1=1 5=1 6=3072\nPermute                  conv17_2_mbox_loc_perm   1 1 conv17_2_mbox_loc conv17_2_mbox_loc_perm -23330=4,3,24,1,1 0=3\nFlatten                  conv17_2_mbox_loc_flat   1 1 conv17_2_mbox_loc_perm conv17_2_mbox_loc_flat -23330=4,1,24,1,1\nConvolution              conv17_2_mbox_conf       1 1 conv17_2_conv17_2/relu_splitncnn_1 conv17_2_mbox_conf -23330=4,3,1,1,126 0=126 1=1 5=1 6=16128\nPermute                  conv17_2_mbox_conf_perm  1 1 conv17_2_mbox_conf conv17_2_mbox_conf_perm -23330=4,3,126,1,1 0=3\nFlatten                  conv17_2_mbox_conf_flat  1 1 conv17_2_mbox_conf_perm conv17_2_mbox_conf_flat -23330=4,1,126,1,1\nPriorBox                 conv17_2_mbox_priorbox   2 1 conv17_2_conv17_2/relu_splitncnn_0 data_splitncnn_0 conv17_2_mbox_priorbox -23330=4,2,24,2,1 -23300=1,2.850000e+02 -23301=1,3.000000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 13=5.000000e-01\nConcat                   mbox_loc                 6 1 conv11_mbox_loc_flat conv13_mbox_loc_flat conv14_2_mbox_loc_flat conv15_2_mbox_loc_flat conv16_2_mbox_loc_flat conv17_2_mbox_loc_flat mbox_loc -23330=4,1,7668,1,1\nConcat                   mbox_conf                6 1 conv11_mbox_conf_flat conv13_mbox_conf_flat conv14_2_mbox_conf_flat conv15_2_mbox_conf_flat conv16_2_mbox_conf_flat conv17_2_mbox_conf_flat mbox_conf -23330=4,1,40257,1,1\nConcat                   mbox_priorbox            6 1 conv11_mbox_priorbox conv13_mbox_priorbox conv14_2_mbox_priorbox conv15_2_mbox_priorbox conv16_2_mbox_priorbox conv17_2_mbox_priorbox mbox_priorbox -23330=4,2,7668,2,1 0=1\nReshape                  mbox_conf_reshape        1 1 mbox_conf mbox_conf_reshape -23330=4,2,21,1917,1 0=21 1=-1\nSoftmax                  mbox_conf_softmax        1 1 mbox_conf_reshape mbox_conf_softmax -23330=4,2,21,1917,1 0=1 1=1\nFlatten                  mbox_conf_flatten        1 1 mbox_conf_softmax mbox_conf_flatten -23330=4,1,40257,1,1\nDetectionOutput          detection_out            3 1 mbox_loc mbox_conf_flatten mbox_priorbox output 0=21 1=4.500000e-01 2=100 4=2.500000e-01\n"
  },
  {
    "path": "benchmark/mobilenet_ssd_int8.param",
    "content": "7767517\n92 115\nInput                    input                    0 1 data 0=300 1=300 2=3\nSplit                    splitncnn_0              1 7 data data_splitncnn_0 data_splitncnn_1 data_splitncnn_2 data_splitncnn_3 data_splitncnn_4 data_splitncnn_5 data_splitncnn_6\nConvolution              conv0                    1 1 data_splitncnn_6 conv0_conv0/relu 0=32 1=3 3=2 4=1 5=1 6=864 8=102 9=1\nConvolutionDepthWise     conv1/dw                 1 1 conv0_conv0/relu conv1/dw_conv1/dw/relu 0=32 1=3 4=1 5=1 6=288 7=32 8=101 9=1\nConvolution              conv1                    1 1 conv1/dw_conv1/dw/relu conv1_conv1/relu 0=64 1=1 5=1 6=2048 8=102 9=1\nConvolutionDepthWise     conv2/dw                 1 1 conv1_conv1/relu conv2/dw_conv2/dw/relu 0=64 1=3 3=2 4=1 5=1 6=576 7=64 8=101 9=1\nConvolution              conv2                    1 1 conv2/dw_conv2/dw/relu conv2_conv2/relu 0=128 1=1 5=1 6=8192 8=102 9=1\nConvolutionDepthWise     conv3/dw                 1 1 conv2_conv2/relu conv3/dw_conv3/dw/relu 0=128 1=3 4=1 5=1 6=1152 7=128 8=101 9=1\nConvolution              conv3                    1 1 conv3/dw_conv3/dw/relu conv3_conv3/relu 0=128 1=1 5=1 6=16384 8=102 9=1\nConvolutionDepthWise     conv4/dw                 1 1 conv3_conv3/relu conv4/dw_conv4/dw/relu 0=128 1=3 3=2 4=1 5=1 6=1152 7=128 8=101 9=1\nConvolution              conv4                    1 1 conv4/dw_conv4/dw/relu conv4_conv4/relu 0=256 1=1 5=1 6=32768 8=102 9=1\nConvolutionDepthWise     conv5/dw                 1 1 conv4_conv4/relu conv5/dw_conv5/dw/relu 0=256 1=3 4=1 5=1 6=2304 7=256 8=101 9=1\nConvolution              conv5                    1 1 conv5/dw_conv5/dw/relu conv5_conv5/relu 0=256 1=1 5=1 6=65536 8=102 9=1\nConvolutionDepthWise     conv6/dw                 1 1 conv5_conv5/relu conv6/dw_conv6/dw/relu 0=256 1=3 3=2 4=1 5=1 6=2304 7=256 8=101 9=1\nConvolution              conv6                    1 1 conv6/dw_conv6/dw/relu conv6_conv6/relu 0=512 1=1 5=1 6=131072 8=102 9=1\nConvolutionDepthWise     conv7/dw                 1 1 conv6_conv6/relu conv7/dw_conv7/dw/relu 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv7                    1 1 conv7/dw_conv7/dw/relu conv7_conv7/relu 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv8/dw                 1 1 conv7_conv7/relu conv8/dw_conv8/dw/relu 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv8                    1 1 conv8/dw_conv8/dw/relu conv8_conv8/relu 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv9/dw                 1 1 conv8_conv8/relu conv9/dw_conv9/dw/relu 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv9                    1 1 conv9/dw_conv9/dw/relu conv9_conv9/relu 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv10/dw                1 1 conv9_conv9/relu conv10/dw_conv10/dw/relu 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv10                   1 1 conv10/dw_conv10/dw/relu conv10_conv10/relu 0=512 1=1 5=1 6=262144 8=102 9=1\nConvolutionDepthWise     conv11/dw                1 1 conv10_conv10/relu conv11/dw_conv11/dw/relu 0=512 1=3 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv11                   1 1 conv11/dw_conv11/dw/relu conv11_conv11/relu 0=512 1=1 5=1 6=262144 8=2 9=1\nSplit                    splitncnn_1              1 4 conv11_conv11/relu conv11_conv11/relu_splitncnn_0 conv11_conv11/relu_splitncnn_1 conv11_conv11/relu_splitncnn_2 conv11_conv11/relu_splitncnn_3\nConvolutionDepthWise     conv12/dw                1 1 conv11_conv11/relu_splitncnn_3 conv12/dw_conv12/dw/relu 0=512 1=3 3=2 4=1 5=1 6=4608 7=512 8=101 9=1\nConvolution              conv12                   1 1 conv12/dw_conv12/dw/relu conv12_conv12/relu 0=1024 1=1 5=1 6=524288 8=102 9=1\nConvolutionDepthWise     conv13/dw                1 1 conv12_conv12/relu conv13/dw_conv13/dw/relu 0=1024 1=3 4=1 5=1 6=9216 7=1024 8=101 9=1\nConvolution              conv13                   1 1 conv13/dw_conv13/dw/relu conv13_conv13/relu 0=1024 1=1 5=1 6=1048576 8=2 9=1\nSplit                    splitncnn_2              1 4 conv13_conv13/relu conv13_conv13/relu_splitncnn_0 conv13_conv13/relu_splitncnn_1 conv13_conv13/relu_splitncnn_2 conv13_conv13/relu_splitncnn_3\nConvolution              conv14_1                 1 1 conv13_conv13/relu_splitncnn_3 conv14_1_conv14_1/relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              conv14_2                 1 1 conv14_1_conv14_1/relu conv14_2_conv14_2/relu 0=512 1=3 3=2 4=1 5=1 6=1179648 8=2 9=1\nSplit                    splitncnn_3              1 4 conv14_2_conv14_2/relu conv14_2_conv14_2/relu_splitncnn_0 conv14_2_conv14_2/relu_splitncnn_1 conv14_2_conv14_2/relu_splitncnn_2 conv14_2_conv14_2/relu_splitncnn_3\nConvolution              conv15_1                 1 1 conv14_2_conv14_2/relu_splitncnn_3 conv15_1_conv15_1/relu 0=128 1=1 5=1 6=65536 8=102 9=1\nConvolution              conv15_2                 1 1 conv15_1_conv15_1/relu conv15_2_conv15_2/relu 0=256 1=3 3=2 4=1 5=1 6=294912 8=2 9=1\nSplit                    splitncnn_4              1 4 conv15_2_conv15_2/relu conv15_2_conv15_2/relu_splitncnn_0 conv15_2_conv15_2/relu_splitncnn_1 conv15_2_conv15_2/relu_splitncnn_2 conv15_2_conv15_2/relu_splitncnn_3\nConvolution              conv16_1                 1 1 conv15_2_conv15_2/relu_splitncnn_3 conv16_1_conv16_1/relu 0=128 1=1 5=1 6=32768 8=102 9=1\nConvolution              conv16_2                 1 1 conv16_1_conv16_1/relu conv16_2_conv16_2/relu 0=256 1=3 3=2 4=1 5=1 6=294912 8=2 9=1\nSplit                    splitncnn_5              1 4 conv16_2_conv16_2/relu conv16_2_conv16_2/relu_splitncnn_0 conv16_2_conv16_2/relu_splitncnn_1 conv16_2_conv16_2/relu_splitncnn_2 conv16_2_conv16_2/relu_splitncnn_3\nConvolution              conv17_1                 1 1 conv16_2_conv16_2/relu_splitncnn_3 conv17_1_conv17_1/relu 0=64 1=1 5=1 6=16384 8=102 9=1\nConvolution              conv17_2                 1 1 conv17_1_conv17_1/relu conv17_2_conv17_2/relu 0=128 1=3 3=2 4=1 5=1 6=73728 8=2 9=1\nSplit                    splitncnn_6              1 3 conv17_2_conv17_2/relu conv17_2_conv17_2/relu_splitncnn_0 conv17_2_conv17_2/relu_splitncnn_1 conv17_2_conv17_2/relu_splitncnn_2\nConvolution              conv11_mbox_loc          1 1 conv11_conv11/relu_splitncnn_2 conv11_mbox_loc 0=12 1=1 5=1 6=6144 8=2\nPermute                  conv11_mbox_loc_perm     1 1 conv11_mbox_loc conv11_mbox_loc_perm 0=3\nFlatten                  conv11_mbox_loc_flat     1 1 conv11_mbox_loc_perm conv11_mbox_loc_flat\nConvolution              conv11_mbox_conf         1 1 conv11_conv11/relu_splitncnn_1 conv11_mbox_conf 0=63 1=1 5=1 6=32256 8=2\nPermute                  conv11_mbox_conf_perm    1 1 conv11_mbox_conf conv11_mbox_conf_perm 0=3\nFlatten                  conv11_mbox_conf_flat    1 1 conv11_mbox_conf_perm conv11_mbox_conf_flat\nPriorBox                 conv11_mbox_priorbox     2 1 conv11_conv11/relu_splitncnn_0 data_splitncnn_5 conv11_mbox_priorbox -23300=1,60.000000 -23302=1,2.000000 9=-233 10=-233 13=0.500000\nConvolution              conv13_mbox_loc          1 1 conv13_conv13/relu_splitncnn_2 conv13_mbox_loc 0=24 1=1 5=1 6=24576 8=2\nPermute                  conv13_mbox_loc_perm     1 1 conv13_mbox_loc conv13_mbox_loc_perm 0=3\nFlatten                  conv13_mbox_loc_flat     1 1 conv13_mbox_loc_perm conv13_mbox_loc_flat\nConvolution              conv13_mbox_conf         1 1 conv13_conv13/relu_splitncnn_1 conv13_mbox_conf 0=126 1=1 5=1 6=129024 8=2\nPermute                  conv13_mbox_conf_perm    1 1 conv13_mbox_conf conv13_mbox_conf_perm 0=3\nFlatten                  conv13_mbox_conf_flat    1 1 conv13_mbox_conf_perm conv13_mbox_conf_flat\nPriorBox                 conv13_mbox_priorbox     2 1 conv13_conv13/relu_splitncnn_0 data_splitncnn_4 conv13_mbox_priorbox -23300=1,105.000000 -23301=1,150.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 13=0.500000\nConvolution              conv14_2_mbox_loc        1 1 conv14_2_conv14_2/relu_splitncnn_2 conv14_2_mbox_loc 0=24 1=1 5=1 6=12288 8=2\nPermute                  conv14_2_mbox_loc_perm   1 1 conv14_2_mbox_loc conv14_2_mbox_loc_perm 0=3\nFlatten                  conv14_2_mbox_loc_flat   1 1 conv14_2_mbox_loc_perm conv14_2_mbox_loc_flat\nConvolution              conv14_2_mbox_conf       1 1 conv14_2_conv14_2/relu_splitncnn_1 conv14_2_mbox_conf 0=126 1=1 5=1 6=64512 8=2\nPermute                  conv14_2_mbox_conf_perm  1 1 conv14_2_mbox_conf conv14_2_mbox_conf_perm 0=3\nFlatten                  conv14_2_mbox_conf_flat  1 1 conv14_2_mbox_conf_perm conv14_2_mbox_conf_flat\nPriorBox                 conv14_2_mbox_priorbox   2 1 conv14_2_conv14_2/relu_splitncnn_0 data_splitncnn_3 conv14_2_mbox_priorbox -23300=1,150.000000 -23301=1,195.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 13=0.500000\nConvolution              conv15_2_mbox_loc        1 1 conv15_2_conv15_2/relu_splitncnn_2 conv15_2_mbox_loc 0=24 1=1 5=1 6=6144 8=2\nPermute                  conv15_2_mbox_loc_perm   1 1 conv15_2_mbox_loc conv15_2_mbox_loc_perm 0=3\nFlatten                  conv15_2_mbox_loc_flat   1 1 conv15_2_mbox_loc_perm conv15_2_mbox_loc_flat\nConvolution              conv15_2_mbox_conf       1 1 conv15_2_conv15_2/relu_splitncnn_1 conv15_2_mbox_conf 0=126 1=1 5=1 6=32256 8=2\nPermute                  conv15_2_mbox_conf_perm  1 1 conv15_2_mbox_conf conv15_2_mbox_conf_perm 0=3\nFlatten                  conv15_2_mbox_conf_flat  1 1 conv15_2_mbox_conf_perm conv15_2_mbox_conf_flat\nPriorBox                 conv15_2_mbox_priorbox   2 1 conv15_2_conv15_2/relu_splitncnn_0 data_splitncnn_2 conv15_2_mbox_priorbox -23300=1,195.000000 -23301=1,240.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 13=0.500000\nConvolution              conv16_2_mbox_loc        1 1 conv16_2_conv16_2/relu_splitncnn_2 conv16_2_mbox_loc 0=24 1=1 5=1 6=6144 8=2\nPermute                  conv16_2_mbox_loc_perm   1 1 conv16_2_mbox_loc conv16_2_mbox_loc_perm 0=3\nFlatten                  conv16_2_mbox_loc_flat   1 1 conv16_2_mbox_loc_perm conv16_2_mbox_loc_flat\nConvolution              conv16_2_mbox_conf       1 1 conv16_2_conv16_2/relu_splitncnn_1 conv16_2_mbox_conf 0=126 1=1 5=1 6=32256 8=2\nPermute                  conv16_2_mbox_conf_perm  1 1 conv16_2_mbox_conf conv16_2_mbox_conf_perm 0=3\nFlatten                  conv16_2_mbox_conf_flat  1 1 conv16_2_mbox_conf_perm conv16_2_mbox_conf_flat\nPriorBox                 conv16_2_mbox_priorbox   2 1 conv16_2_conv16_2/relu_splitncnn_0 data_splitncnn_1 conv16_2_mbox_priorbox -23300=1,240.000000 -23301=1,285.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 13=0.500000\nConvolution              conv17_2_mbox_loc        1 1 conv17_2_conv17_2/relu_splitncnn_2 conv17_2_mbox_loc 0=24 1=1 5=1 6=3072 8=2\nPermute                  conv17_2_mbox_loc_perm   1 1 conv17_2_mbox_loc conv17_2_mbox_loc_perm 0=3\nFlatten                  conv17_2_mbox_loc_flat   1 1 conv17_2_mbox_loc_perm conv17_2_mbox_loc_flat\nConvolution              conv17_2_mbox_conf       1 1 conv17_2_conv17_2/relu_splitncnn_1 conv17_2_mbox_conf 0=126 1=1 5=1 6=16128 8=2\nPermute                  conv17_2_mbox_conf_perm  1 1 conv17_2_mbox_conf conv17_2_mbox_conf_perm 0=3\nFlatten                  conv17_2_mbox_conf_flat  1 1 conv17_2_mbox_conf_perm conv17_2_mbox_conf_flat\nPriorBox                 conv17_2_mbox_priorbox   2 1 conv17_2_conv17_2/relu_splitncnn_0 data_splitncnn_0 conv17_2_mbox_priorbox -23300=1,285.000000 -23301=1,300.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 13=0.500000\nConcat                   mbox_loc                 6 1 conv11_mbox_loc_flat conv13_mbox_loc_flat conv14_2_mbox_loc_flat conv15_2_mbox_loc_flat conv16_2_mbox_loc_flat conv17_2_mbox_loc_flat mbox_loc\nConcat                   mbox_conf                6 1 conv11_mbox_conf_flat conv13_mbox_conf_flat conv14_2_mbox_conf_flat conv15_2_mbox_conf_flat conv16_2_mbox_conf_flat conv17_2_mbox_conf_flat mbox_conf\nConcat                   mbox_priorbox            6 1 conv11_mbox_priorbox conv13_mbox_priorbox conv14_2_mbox_priorbox conv15_2_mbox_priorbox conv16_2_mbox_priorbox conv17_2_mbox_priorbox mbox_priorbox 0=1\nReshape                  mbox_conf_reshape        1 1 mbox_conf mbox_conf_reshape 0=21 1=-1\nSoftmax                  mbox_conf_softmax        1 1 mbox_conf_reshape mbox_conf_softmax 0=1 1=1\nFlatten                  mbox_conf_flatten        1 1 mbox_conf_softmax mbox_conf_flatten\nDetectionOutput          detection_out            3 1 mbox_loc mbox_conf_flatten mbox_priorbox output 0=21 1=0.450000 2=100 4=0.250000\n"
  },
  {
    "path": "benchmark/mobilenet_v2.param",
    "content": "7767517\n77 87\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1/bn_relu1 -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolution              conv2_1/expand           1 1 conv1/bn_relu1 conv2_1/expand/bn_relu2_1/expand -23330=4,3,112,112,32 0=32 1=1 5=1 6=1024 9=1\nConvolutionDepthWise     conv2_1/dwise            1 1 conv2_1/expand/bn_relu2_1/expand conv2_1/dwise/bn_relu2_1/dwise -23330=4,3,112,112,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              conv2_1/linear           1 1 conv2_1/dwise/bn_relu2_1/dwise conv2_1/linear/bn_conv2_1/linear/scale -23330=4,3,112,112,16 0=16 1=1 5=1 6=512\nConvolution              conv2_2/expand           1 1 conv2_1/linear/bn_conv2_1/linear/scale conv2_2/expand/bn_relu2_2/expand -23330=4,3,112,112,96 0=96 1=1 5=1 6=1536 9=1\nConvolutionDepthWise     conv2_2/dwise            1 1 conv2_2/expand/bn_relu2_2/expand conv2_2/dwise/bn_relu2_2/dwise -23330=4,3,56,56,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96 9=1\nConvolution              conv2_2/linear           1 1 conv2_2/dwise/bn_relu2_2/dwise conv2_2/linear/bn_conv2_2/linear/scale -23330=4,3,56,56,24 0=24 1=1 5=1 6=2304\nSplit                    splitncnn_0              1 2 conv2_2/linear/bn_conv2_2/linear/scale conv2_2/linear/bn_conv2_2/linear/scale_splitncnn_0 conv2_2/linear/bn_conv2_2/linear/scale_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolution              conv3_1/expand           1 1 conv2_2/linear/bn_conv2_2/linear/scale_splitncnn_1 conv3_1/expand/bn_relu3_1/expand -23330=4,3,56,56,144 0=144 1=1 5=1 6=3456 9=1\nConvolutionDepthWise     conv3_1/dwise            1 1 conv3_1/expand/bn_relu3_1/expand conv3_1/dwise/bn_relu3_1/dwise -23330=4,3,56,56,144 0=144 1=3 4=1 5=1 6=1296 7=144 9=1\nConvolution              conv3_1/linear           1 1 conv3_1/dwise/bn_relu3_1/dwise conv3_1/linear/bn_conv3_1/linear/scale -23330=4,3,56,56,24 0=24 1=1 5=1 6=3456\nEltwise                  block_3_1                2 1 conv2_2/linear/bn_conv2_2/linear/scale_splitncnn_0 conv3_1/linear/bn_conv3_1/linear/scale block_3_1 -23330=4,3,56,56,24 0=1\nConvolution              conv3_2/expand           1 1 block_3_1 conv3_2/expand/bn_relu3_2/expand -23330=4,3,56,56,144 0=144 1=1 5=1 6=3456 9=1\nConvolutionDepthWise     conv3_2/dwise            1 1 conv3_2/expand/bn_relu3_2/expand conv3_2/dwise/bn_relu3_2/dwise -23330=4,3,28,28,144 0=144 1=3 3=2 4=1 5=1 6=1296 7=144 9=1\nConvolution              conv3_2/linear           1 1 conv3_2/dwise/bn_relu3_2/dwise conv3_2/linear/bn_conv3_2/linear/scale -23330=4,3,28,28,32 0=32 1=1 5=1 6=4608\nSplit                    splitncnn_1              1 2 conv3_2/linear/bn_conv3_2/linear/scale conv3_2/linear/bn_conv3_2/linear/scale_splitncnn_0 conv3_2/linear/bn_conv3_2/linear/scale_splitncnn_1 -23330=8,3,28,28,32,3,28,28,32\nConvolution              conv4_1/expand           1 1 conv3_2/linear/bn_conv3_2/linear/scale_splitncnn_1 conv4_1/expand/bn_relu4_1/expand -23330=4,3,28,28,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv4_1/dwise            1 1 conv4_1/expand/bn_relu4_1/expand conv4_1/dwise/bn_relu4_1/dwise -23330=4,3,28,28,192 0=192 1=3 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv4_1/linear           1 1 conv4_1/dwise/bn_relu4_1/dwise conv4_1/linear/bn_conv4_1/linear/scale -23330=4,3,28,28,32 0=32 1=1 5=1 6=6144\nEltwise                  block_4_1                2 1 conv3_2/linear/bn_conv3_2/linear/scale_splitncnn_0 conv4_1/linear/bn_conv4_1/linear/scale block_4_1 -23330=4,3,28,28,32 0=1\nSplit                    splitncnn_2              1 2 block_4_1 block_4_1_splitncnn_0 block_4_1_splitncnn_1 -23330=8,3,28,28,32,3,28,28,32\nConvolution              conv4_2/expand           1 1 block_4_1_splitncnn_1 conv4_2/expand/bn_relu4_2/expand -23330=4,3,28,28,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv4_2/dwise            1 1 conv4_2/expand/bn_relu4_2/expand conv4_2/dwise/bn_relu4_2/dwise -23330=4,3,28,28,192 0=192 1=3 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv4_2/linear           1 1 conv4_2/dwise/bn_relu4_2/dwise conv4_2/linear/bn_conv4_2/linear/scale -23330=4,3,28,28,32 0=32 1=1 5=1 6=6144\nEltwise                  block_4_2                2 1 block_4_1_splitncnn_0 conv4_2/linear/bn_conv4_2/linear/scale block_4_2 -23330=4,3,28,28,32 0=1\nConvolution              conv4_3/expand           1 1 block_4_2 conv4_3/expand/bn_relu4_3/expand -23330=4,3,28,28,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv4_3/dwise            1 1 conv4_3/expand/bn_relu4_3/expand conv4_3/dwise/bn_relu4_3/dwise -23330=4,3,14,14,192 0=192 1=3 3=2 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv4_3/linear           1 1 conv4_3/dwise/bn_relu4_3/dwise conv4_3/linear/bn_conv4_3/linear/scale -23330=4,3,14,14,64 0=64 1=1 5=1 6=12288\nSplit                    splitncnn_3              1 2 conv4_3/linear/bn_conv4_3/linear/scale conv4_3/linear/bn_conv4_3/linear/scale_splitncnn_0 conv4_3/linear/bn_conv4_3/linear/scale_splitncnn_1 -23330=8,3,14,14,64,3,14,14,64\nConvolution              conv4_4/expand           1 1 conv4_3/linear/bn_conv4_3/linear/scale_splitncnn_1 conv4_4/expand/bn_relu4_4/expand -23330=4,3,14,14,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv4_4/dwise            1 1 conv4_4/expand/bn_relu4_4/expand conv4_4/dwise/bn_relu4_4/dwise -23330=4,3,14,14,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv4_4/linear           1 1 conv4_4/dwise/bn_relu4_4/dwise conv4_4/linear/bn_conv4_4/linear/scale -23330=4,3,14,14,64 0=64 1=1 5=1 6=24576\nEltwise                  block_4_4                2 1 conv4_3/linear/bn_conv4_3/linear/scale_splitncnn_0 conv4_4/linear/bn_conv4_4/linear/scale block_4_4 -23330=4,3,14,14,64 0=1\nSplit                    splitncnn_4              1 2 block_4_4 block_4_4_splitncnn_0 block_4_4_splitncnn_1 -23330=8,3,14,14,64,3,14,14,64\nConvolution              conv4_5/expand           1 1 block_4_4_splitncnn_1 conv4_5/expand/bn_relu4_5/expand -23330=4,3,14,14,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv4_5/dwise            1 1 conv4_5/expand/bn_relu4_5/expand conv4_5/dwise/bn_relu4_5/dwise -23330=4,3,14,14,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv4_5/linear           1 1 conv4_5/dwise/bn_relu4_5/dwise conv4_5/linear/bn_conv4_5/linear/scale -23330=4,3,14,14,64 0=64 1=1 5=1 6=24576\nEltwise                  block_4_5                2 1 block_4_4_splitncnn_0 conv4_5/linear/bn_conv4_5/linear/scale block_4_5 -23330=4,3,14,14,64 0=1\nSplit                    splitncnn_5              1 2 block_4_5 block_4_5_splitncnn_0 block_4_5_splitncnn_1 -23330=8,3,14,14,64,3,14,14,64\nConvolution              conv4_6/expand           1 1 block_4_5_splitncnn_1 conv4_6/expand/bn_relu4_6/expand -23330=4,3,14,14,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv4_6/dwise            1 1 conv4_6/expand/bn_relu4_6/expand conv4_6/dwise/bn_relu4_6/dwise -23330=4,3,14,14,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv4_6/linear           1 1 conv4_6/dwise/bn_relu4_6/dwise conv4_6/linear/bn_conv4_6/linear/scale -23330=4,3,14,14,64 0=64 1=1 5=1 6=24576\nEltwise                  block_4_6                2 1 block_4_5_splitncnn_0 conv4_6/linear/bn_conv4_6/linear/scale block_4_6 -23330=4,3,14,14,64 0=1\nConvolution              conv4_7/expand           1 1 block_4_6 conv4_7/expand/bn_relu4_7/expand -23330=4,3,14,14,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv4_7/dwise            1 1 conv4_7/expand/bn_relu4_7/expand conv4_7/dwise/bn_relu4_7/dwise -23330=4,3,14,14,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv4_7/linear           1 1 conv4_7/dwise/bn_relu4_7/dwise conv4_7/linear/bn_conv4_7/linear/scale -23330=4,3,14,14,96 0=96 1=1 5=1 6=36864\nSplit                    splitncnn_6              1 2 conv4_7/linear/bn_conv4_7/linear/scale conv4_7/linear/bn_conv4_7/linear/scale_splitncnn_0 conv4_7/linear/bn_conv4_7/linear/scale_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              conv5_1/expand           1 1 conv4_7/linear/bn_conv4_7/linear/scale_splitncnn_1 conv5_1/expand/bn_relu5_1/expand -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     conv5_1/dwise            1 1 conv5_1/expand/bn_relu5_1/expand conv5_1/dwise/bn_relu5_1/dwise -23330=4,3,14,14,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv5_1/linear           1 1 conv5_1/dwise/bn_relu5_1/dwise conv5_1/linear/bn_conv5_1/linear/scale -23330=4,3,14,14,96 0=96 1=1 5=1 6=55296\nEltwise                  block_5_1                2 1 conv4_7/linear/bn_conv4_7/linear/scale_splitncnn_0 conv5_1/linear/bn_conv5_1/linear/scale block_5_1 -23330=4,3,14,14,96 0=1\nSplit                    splitncnn_7              1 2 block_5_1 block_5_1_splitncnn_0 block_5_1_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              conv5_2/expand           1 1 block_5_1_splitncnn_1 conv5_2/expand/bn_relu5_2/expand -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     conv5_2/dwise            1 1 conv5_2/expand/bn_relu5_2/expand conv5_2/dwise/bn_relu5_2/dwise -23330=4,3,14,14,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv5_2/linear           1 1 conv5_2/dwise/bn_relu5_2/dwise conv5_2/linear/bn_conv5_2/linear/scale -23330=4,3,14,14,96 0=96 1=1 5=1 6=55296\nEltwise                  block_5_2                2 1 block_5_1_splitncnn_0 conv5_2/linear/bn_conv5_2/linear/scale block_5_2 -23330=4,3,14,14,96 0=1\nConvolution              conv5_3/expand           1 1 block_5_2 conv5_3/expand/bn_relu5_3/expand -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     conv5_3/dwise            1 1 conv5_3/expand/bn_relu5_3/expand conv5_3/dwise/bn_relu5_3/dwise -23330=4,3,7,7,576 0=576 1=3 3=2 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv5_3/linear           1 1 conv5_3/dwise/bn_relu5_3/dwise conv5_3/linear/bn_conv5_3/linear/scale -23330=4,3,7,7,160 0=160 1=1 5=1 6=92160\nSplit                    splitncnn_8              1 2 conv5_3/linear/bn_conv5_3/linear/scale conv5_3/linear/bn_conv5_3/linear/scale_splitncnn_0 conv5_3/linear/bn_conv5_3/linear/scale_splitncnn_1 -23330=8,3,7,7,160,3,7,7,160\nConvolution              conv6_1/expand           1 1 conv5_3/linear/bn_conv5_3/linear/scale_splitncnn_1 conv6_1/expand/bn_relu6_1/expand -23330=4,3,7,7,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv6_1/dwise            1 1 conv6_1/expand/bn_relu6_1/expand conv6_1/dwise/bn_relu6_1/dwise -23330=4,3,7,7,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv6_1/linear           1 1 conv6_1/dwise/bn_relu6_1/dwise conv6_1/linear/bn_conv6_1/linear/scale -23330=4,3,7,7,160 0=160 1=1 5=1 6=153600\nEltwise                  block_6_1                2 1 conv5_3/linear/bn_conv5_3/linear/scale_splitncnn_0 conv6_1/linear/bn_conv6_1/linear/scale block_6_1 -23330=4,3,7,7,160 0=1\nSplit                    splitncnn_9              1 2 block_6_1 block_6_1_splitncnn_0 block_6_1_splitncnn_1 -23330=8,3,7,7,160,3,7,7,160\nConvolution              conv6_2/expand           1 1 block_6_1_splitncnn_1 conv6_2/expand/bn_relu6_2/expand -23330=4,3,7,7,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv6_2/dwise            1 1 conv6_2/expand/bn_relu6_2/expand conv6_2/dwise/bn_relu6_2/dwise -23330=4,3,7,7,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv6_2/linear           1 1 conv6_2/dwise/bn_relu6_2/dwise conv6_2/linear/bn_conv6_2/linear/scale -23330=4,3,7,7,160 0=160 1=1 5=1 6=153600\nEltwise                  block_6_2                2 1 block_6_1_splitncnn_0 conv6_2/linear/bn_conv6_2/linear/scale block_6_2 -23330=4,3,7,7,160 0=1\nConvolution              conv6_3/expand           1 1 block_6_2 conv6_3/expand/bn_relu6_3/expand -23330=4,3,7,7,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv6_3/dwise            1 1 conv6_3/expand/bn_relu6_3/expand conv6_3/dwise/bn_relu6_3/dwise -23330=4,3,7,7,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv6_3/linear           1 1 conv6_3/dwise/bn_relu6_3/dwise conv6_3/linear/bn_conv6_3/linear/scale -23330=4,3,7,7,320 0=320 1=1 5=1 6=307200\nConvolution              conv6_4                  1 1 conv6_3/linear/bn_conv6_3/linear/scale conv6_4/bn_relu6_4 -23330=4,3,7,7,1280 0=1280 1=1 5=1 6=409600 9=1\nPooling                  pool6                    1 1 conv6_4/bn_relu6_4 pool6 -23330=4,1,1280,1,1 0=1 4=1\nInnerProduct             fc7                      1 1 pool6 fc7 -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\nSoftmax                  prob                     1 1 fc7 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/mobilenet_v3.param",
    "content": "7767517\n145 163\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              313                      1 1 data 313 -23330=4,3,112,112,16 0=16 1=3 3=2 4=1 5=1 6=432\nSplit                    splitncnn_0              1 2 313 313_splitncnn_0 313_splitncnn_1 -23330=8,3,112,112,16,3,112,112,16\nHardSigmoid              319                      1 1 313_splitncnn_1 319 -23330=4,3,112,112,16\nBinaryOp                 320                      2 1 313_splitncnn_0 319 320 -23330=4,3,112,112,16 0=2\nSplit                    splitncnn_1              1 2 320 320_splitncnn_0 320_splitncnn_1 -23330=8,3,112,112,16,3,112,112,16\nConvolutionDepthWise     321                      1 1 320_splitncnn_1 323 -23330=4,3,112,112,16 0=16 1=3 4=1 5=1 6=144 7=16 9=1\nConvolution              324                      1 1 323 324 -23330=4,3,112,112,16 0=16 1=1 5=1 6=256\nBinaryOp                 326                      2 1 320_splitncnn_0 324 326 -23330=4,3,112,112,16\nConvolution              327                      1 1 326 329 -23330=4,3,112,112,64 0=64 1=1 5=1 6=1024 9=1\nConvolutionDepthWise     330                      1 1 329 332 -23330=4,3,56,56,64 0=64 1=3 3=2 4=1 5=1 6=576 7=64 9=1\nConvolution              333                      1 1 332 333 -23330=4,3,56,56,24 0=24 1=1 5=1 6=1536\nSplit                    splitncnn_2              1 2 333 333_splitncnn_0 333_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolution              335                      1 1 333_splitncnn_1 337 -23330=4,3,56,56,72 0=72 1=1 5=1 6=1728 9=1\nConvolutionDepthWise     338                      1 1 337 340 -23330=4,3,56,56,72 0=72 1=3 4=1 5=1 6=648 7=72 9=1\nConvolution              341                      1 1 340 341 -23330=4,3,56,56,24 0=24 1=1 5=1 6=1728\nBinaryOp                 343                      2 1 333_splitncnn_0 341 343 -23330=4,3,56,56,24\nConvolution              344                      1 1 343 346 -23330=4,3,56,56,72 0=72 1=1 5=1 6=1728 9=1\nConvolutionDepthWise     347                      1 1 346 347 -23330=4,3,28,28,72 0=72 1=5 3=2 4=2 5=1 6=1800 7=72\nSplit                    splitncnn_3              1 2 347 347_splitncnn_0 347_splitncnn_1 -23330=8,3,28,28,72,3,28,28,72\nPooling                  355                      1 1 347_splitncnn_1 359 -23330=4,1,72,1,1 0=1 4=1\nInnerProduct             360                      1 1 359 361 -23330=4,1,18,1,1 0=18 1=1 2=1296 9=1\nInnerProduct             362                      1 1 361 362 -23330=4,1,72,1,1 0=72 1=1 2=1296\nHardSigmoid              367                      1 1 362 367 -23330=4,1,72,1,1\nBinaryOp                 376                      2 1 347_splitncnn_0 367 376 -23330=4,3,28,28,72 0=2\nReLU                     377                      1 1 376 377 -23330=4,3,28,28,72\nConvolution              378                      1 1 377 378 -23330=4,3,28,28,40 0=40 1=1 5=1 6=2880\nSplit                    splitncnn_4              1 2 378 378_splitncnn_0 378_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              380                      1 1 378_splitncnn_1 382 -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     383                      1 1 382 383 -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120\nSplit                    splitncnn_5              1 2 383 383_splitncnn_0 383_splitncnn_1 -23330=8,3,28,28,120,3,28,28,120\nPooling                  391                      1 1 383_splitncnn_1 395 -23330=4,1,120,1,1 0=1 4=1\nInnerProduct             396                      1 1 395 397 -23330=4,1,30,1,1 0=30 1=1 2=3600 9=1\nInnerProduct             398                      1 1 397 398 -23330=4,1,120,1,1 0=120 1=1 2=3600\nHardSigmoid              403                      1 1 398 403 -23330=4,1,120,1,1\nBinaryOp                 412                      2 1 383_splitncnn_0 403 412 -23330=4,3,28,28,120 0=2\nReLU                     413                      1 1 412 413 -23330=4,3,28,28,120\nConvolution              414                      1 1 413 414 -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 416                      2 1 378_splitncnn_0 414 416 -23330=4,3,28,28,40\nSplit                    splitncnn_6              1 2 416 416_splitncnn_0 416_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              417                      1 1 416_splitncnn_1 419 -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     420                      1 1 419 420 -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120\nSplit                    splitncnn_7              1 2 420 420_splitncnn_0 420_splitncnn_1 -23330=8,3,28,28,120,3,28,28,120\nPooling                  428                      1 1 420_splitncnn_1 432 -23330=4,1,120,1,1 0=1 4=1\nInnerProduct             433                      1 1 432 434 -23330=4,1,30,1,1 0=30 1=1 2=3600 9=1\nInnerProduct             435                      1 1 434 435 -23330=4,1,120,1,1 0=120 1=1 2=3600\nHardSigmoid              440                      1 1 435 440 -23330=4,1,120,1,1\nBinaryOp                 449                      2 1 420_splitncnn_0 440 449 -23330=4,3,28,28,120 0=2\nReLU                     450                      1 1 449 450 -23330=4,3,28,28,120\nConvolution              451                      1 1 450 451 -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 453                      2 1 416_splitncnn_0 451 453 -23330=4,3,28,28,40\nConvolution              454                      1 1 453 454 -23330=4,3,28,28,240 0=240 1=1 5=1 6=9600\nHardSwish                461                      1 1 454 461 -23330=4,3,28,28,240\nConvolutionDepthWise     462                      1 1 461 462 -23330=4,3,14,14,240 0=240 1=3 3=2 4=1 5=1 6=2160 7=240\nHardSwish                469                      1 1 462 469 -23330=4,3,14,14,240\nConvolution              470                      1 1 469 470 -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nSplit                    splitncnn_8              1 2 470 470_splitncnn_0 470_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              472                      1 1 470_splitncnn_1 472 -23330=4,3,14,14,200 0=200 1=1 5=1 6=16000\nHardSwish                479                      1 1 472 479 -23330=4,3,14,14,200\nConvolutionDepthWise     480                      1 1 479 480 -23330=4,3,14,14,200 0=200 1=3 4=1 5=1 6=1800 7=200\nHardSwish                487                      1 1 480 487 -23330=4,3,14,14,200\nConvolution              488                      1 1 487 488 -23330=4,3,14,14,80 0=80 1=1 5=1 6=16000\nBinaryOp                 490                      2 1 470_splitncnn_0 488 490 -23330=4,3,14,14,80\nSplit                    splitncnn_9              1 2 490 490_splitncnn_0 490_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              491                      1 1 490_splitncnn_1 491 -23330=4,3,14,14,184 0=184 1=1 5=1 6=14720\nHardSwish                498                      1 1 491 498 -23330=4,3,14,14,184\nConvolutionDepthWise     499                      1 1 498 499 -23330=4,3,14,14,184 0=184 1=3 4=1 5=1 6=1656 7=184\nHardSwish                506                      1 1 499 506 -23330=4,3,14,14,184\nConvolution              507                      1 1 506 507 -23330=4,3,14,14,80 0=80 1=1 5=1 6=14720\nBinaryOp                 509                      2 1 490_splitncnn_0 507 509 -23330=4,3,14,14,80\nSplit                    splitncnn_10             1 2 509 509_splitncnn_0 509_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              510                      1 1 509_splitncnn_1 510 -23330=4,3,14,14,184 0=184 1=1 5=1 6=14720\nHardSwish                517                      1 1 510 517 -23330=4,3,14,14,184\nConvolutionDepthWise     518                      1 1 517 518 -23330=4,3,14,14,184 0=184 1=3 4=1 5=1 6=1656 7=184\nHardSwish                525                      1 1 518 525 -23330=4,3,14,14,184\nConvolution              526                      1 1 525 526 -23330=4,3,14,14,80 0=80 1=1 5=1 6=14720\nBinaryOp                 528                      2 1 509_splitncnn_0 526 528 -23330=4,3,14,14,80\nConvolution              529                      1 1 528 529 -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400\nHardSwish                536                      1 1 529 536 -23330=4,3,14,14,480\nConvolutionDepthWise     537                      1 1 536 537 -23330=4,3,14,14,480 0=480 1=3 4=1 5=1 6=4320 7=480\nSplit                    splitncnn_11             1 2 537 537_splitncnn_0 537_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nPooling                  545                      1 1 537_splitncnn_1 549 -23330=4,1,480,1,1 0=1 4=1\nInnerProduct             550                      1 1 549 551 -23330=4,1,120,1,1 0=120 1=1 2=57600 9=1\nInnerProduct             552                      1 1 551 552 -23330=4,1,480,1,1 0=480 1=1 2=57600\nHardSigmoid              557                      1 1 552 557 -23330=4,1,480,1,1\nBinaryOp                 566                      2 1 537_splitncnn_0 557 566 -23330=4,3,14,14,480 0=2\nHardSwish                572                      1 1 566 572 -23330=4,3,14,14,480\nConvolution              573                      1 1 572 573 -23330=4,3,14,14,112 0=112 1=1 5=1 6=53760\nSplit                    splitncnn_12             1 2 573 573_splitncnn_0 573_splitncnn_1 -23330=8,3,14,14,112,3,14,14,112\nConvolution              575                      1 1 573_splitncnn_1 575 -23330=4,3,14,14,672 0=672 1=1 5=1 6=75264\nHardSwish                582                      1 1 575 582 -23330=4,3,14,14,672\nConvolutionDepthWise     583                      1 1 582 583 -23330=4,3,14,14,672 0=672 1=3 4=1 5=1 6=6048 7=672\nSplit                    splitncnn_13             1 2 583 583_splitncnn_0 583_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nPooling                  591                      1 1 583_splitncnn_1 595 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             596                      1 1 595 597 -23330=4,1,168,1,1 0=168 1=1 2=112896 9=1\nInnerProduct             598                      1 1 597 598 -23330=4,1,672,1,1 0=672 1=1 2=112896\nHardSigmoid              603                      1 1 598 603 -23330=4,1,672,1,1\nBinaryOp                 612                      2 1 583_splitncnn_0 603 612 -23330=4,3,14,14,672 0=2\nHardSwish                618                      1 1 612 618 -23330=4,3,14,14,672\nConvolution              619                      1 1 618 619 -23330=4,3,14,14,112 0=112 1=1 5=1 6=75264\nBinaryOp                 621                      2 1 573_splitncnn_0 619 621 -23330=4,3,14,14,112\nConvolution              622                      1 1 621 622 -23330=4,3,14,14,672 0=672 1=1 5=1 6=75264\nHardSwish                629                      1 1 622 629 -23330=4,3,14,14,672\nConvolutionDepthWise     630                      1 1 629 630 -23330=4,3,14,14,672 0=672 1=5 4=2 5=1 6=16800 7=672\nSplit                    splitncnn_14             1 2 630 630_splitncnn_0 630_splitncnn_1 -23330=8,3,14,14,672,3,14,14,672\nPooling                  638                      1 1 630_splitncnn_1 642 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             643                      1 1 642 644 -23330=4,1,168,1,1 0=168 1=1 2=112896 9=1\nInnerProduct             645                      1 1 644 645 -23330=4,1,672,1,1 0=672 1=1 2=112896\nHardSigmoid              650                      1 1 645 650 -23330=4,1,672,1,1\nBinaryOp                 659                      2 1 630_splitncnn_0 650 659 -23330=4,3,14,14,672 0=2\nHardSwish                665                      1 1 659 665 -23330=4,3,14,14,672\nConvolution              666                      1 1 665 666 -23330=4,3,14,14,160 0=160 1=1 5=1 6=107520\nConvolution              668                      1 1 666 668 -23330=4,3,14,14,672 0=672 1=1 5=1 6=107520\nHardSwish                675                      1 1 668 675 -23330=4,3,14,14,672\nConvolutionDepthWise     676                      1 1 675 676 -23330=4,3,7,7,672 0=672 1=5 3=2 4=2 5=1 6=16800 7=672\nSplit                    splitncnn_15             1 2 676 676_splitncnn_0 676_splitncnn_1 -23330=8,3,7,7,672,3,7,7,672\nPooling                  684                      1 1 676_splitncnn_1 688 -23330=4,1,672,1,1 0=1 4=1\nInnerProduct             689                      1 1 688 690 -23330=4,1,168,1,1 0=168 1=1 2=112896 9=1\nInnerProduct             691                      1 1 690 691 -23330=4,1,672,1,1 0=672 1=1 2=112896\nHardSigmoid              696                      1 1 691 696 -23330=4,1,672,1,1\nBinaryOp                 705                      2 1 676_splitncnn_0 696 705 -23330=4,3,7,7,672 0=2\nHardSwish                711                      1 1 705 711 -23330=4,3,7,7,672\nConvolution              712                      1 1 711 712 -23330=4,3,7,7,160 0=160 1=1 5=1 6=107520\nSplit                    splitncnn_16             1 2 712 712_splitncnn_0 712_splitncnn_1 -23330=8,3,7,7,160,3,7,7,160\nConvolution              714                      1 1 712_splitncnn_1 714 -23330=4,3,7,7,960 0=960 1=1 5=1 6=153600\nHardSwish                721                      1 1 714 721 -23330=4,3,7,7,960\nConvolutionDepthWise     722                      1 1 721 722 -23330=4,3,7,7,960 0=960 1=5 4=2 5=1 6=24000 7=960\nSplit                    splitncnn_17             1 2 722 722_splitncnn_0 722_splitncnn_1 -23330=8,3,7,7,960,3,7,7,960\nPooling                  730                      1 1 722_splitncnn_1 734 -23330=4,1,960,1,1 0=1 4=1\nInnerProduct             735                      1 1 734 736 -23330=4,1,240,1,1 0=240 1=1 2=230400 9=1\nInnerProduct             737                      1 1 736 737 -23330=4,1,960,1,1 0=960 1=1 2=230400\nHardSigmoid              742                      1 1 737 742 -23330=4,1,960,1,1\nBinaryOp                 751                      2 1 722_splitncnn_0 742 751 -23330=4,3,7,7,960 0=2\nHardSwish                757                      1 1 751 757 -23330=4,3,7,7,960\nConvolution              758                      1 1 757 758 -23330=4,3,7,7,160 0=160 1=1 5=1 6=153600\nBinaryOp                 760                      2 1 712_splitncnn_0 758 760 -23330=4,3,7,7,160\nConvolution              761                      1 1 760 761 -23330=4,3,7,7,960 0=960 1=1 5=1 6=153600\nHardSwish                768                      1 1 761 768 -23330=4,3,7,7,960\nPooling                  769                      1 1 768 769 -23330=4,1,960,1,1 0=1 4=1\nHardSwish                775                      1 1 769 775 -23330=4,1,960,1,1\nReshape                  783                      1 1 775 783 -23330=4,1,960,1,1 0=-1\nInnerProduct             784                      1 1 783 784 -23330=4,1,1280,1,1 0=1280 1=1 2=1228800\nHardSwish                790                      1 1 784 790 -23330=4,1,1280,1,1\nInnerProduct             791                      1 1 790 791 -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\nSoftmax                  prob                     1 1 791 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/mobilenet_yolo.param",
    "content": "7767517\n39 41\nInput                    data                     0 1 data -23330=4,3,416,416,3 0=416 1=416 2=3\nConvolution              conv0                    1 1 data conv0_conv0/relu -23330=4,3,208,208,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     conv1/dw                 1 1 conv0_conv0/relu conv1/dw_conv1/dw/relu -23330=4,3,208,208,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              conv1                    1 1 conv1/dw_conv1/dw/relu conv1_conv1/relu -23330=4,3,208,208,64 0=64 1=1 5=1 6=2048 9=1\nConvolutionDepthWise     conv2/dw                 1 1 conv1_conv1/relu conv2/dw_conv2/dw/relu -23330=4,3,104,104,64 0=64 1=3 3=2 4=1 5=1 6=576 7=64 9=1\nConvolution              conv2                    1 1 conv2/dw_conv2/dw/relu conv2_conv2/relu -23330=4,3,104,104,128 0=128 1=1 5=1 6=8192 9=1\nConvolutionDepthWise     conv3/dw                 1 1 conv2_conv2/relu conv3/dw_conv3/dw/relu -23330=4,3,104,104,128 0=128 1=3 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv3                    1 1 conv3/dw_conv3/dw/relu conv3_conv3/relu -23330=4,3,104,104,128 0=128 1=1 5=1 6=16384 9=1\nConvolutionDepthWise     conv4/dw                 1 1 conv3_conv3/relu conv4/dw_conv4/dw/relu -23330=4,3,52,52,128 0=128 1=3 3=2 4=1 5=1 6=1152 7=128 9=1\nConvolution              conv4                    1 1 conv4/dw_conv4/dw/relu conv4_conv4/relu -23330=4,3,52,52,256 0=256 1=1 5=1 6=32768 9=1\nConvolutionDepthWise     conv5/dw                 1 1 conv4_conv4/relu conv5/dw_conv5/dw/relu -23330=4,3,52,52,256 0=256 1=3 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv5                    1 1 conv5/dw_conv5/dw/relu conv5_conv5/relu -23330=4,3,52,52,256 0=256 1=1 5=1 6=65536 9=1\nConvolutionDepthWise     conv6/dw                 1 1 conv5_conv5/relu conv6/dw_conv6/dw/relu -23330=4,3,26,26,256 0=256 1=3 3=2 4=1 5=1 6=2304 7=256 9=1\nConvolution              conv6                    1 1 conv6/dw_conv6/dw/relu conv6_conv6/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=131072 9=1\nConvolutionDepthWise     conv7/dw                 1 1 conv6_conv6/relu conv7/dw_conv7/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv7                    1 1 conv7/dw_conv7/dw/relu conv7_conv7/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv8/dw                 1 1 conv7_conv7/relu conv8/dw_conv8/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv8                    1 1 conv8/dw_conv8/dw/relu conv8_conv8/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv9/dw                 1 1 conv8_conv8/relu conv9/dw_conv9/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv9                    1 1 conv9/dw_conv9/dw/relu conv9_conv9/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv10/dw                1 1 conv9_conv9/relu conv10/dw_conv10/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv10                   1 1 conv10/dw_conv10/dw/relu conv10_conv10/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=262144 9=1\nConvolutionDepthWise     conv11/dw                1 1 conv10_conv10/relu conv11/dw_conv11/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv11                   1 1 conv11/dw_conv11/dw/relu conv11_conv11/relu -23330=4,3,26,26,512 0=512 1=1 5=1 6=262144 9=1\nSplit                    splitncnn_0              1 2 conv11_conv11/relu conv11_conv11/relu_splitncnn_0 conv11_conv11/relu_splitncnn_1 -23330=8,3,26,26,512,3,26,26,512\nConvolutionDepthWise     conv12/dw                1 1 conv11_conv11/relu_splitncnn_1 conv12/dw_conv12/dw/relu -23330=4,3,13,13,512 0=512 1=3 3=2 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv12                   1 1 conv12/dw_conv12/dw/relu conv12_conv12/relu -23330=4,3,13,13,1024 0=1024 1=1 5=1 6=524288 9=1\nConvolutionDepthWise     conv13/dw                1 1 conv12_conv12/relu conv13/dw_conv13/dw/relu -23330=4,3,13,13,1024 0=1024 1=3 4=1 5=1 6=9216 7=1024 9=1\nConvolution              conv13                   1 1 conv13/dw_conv13/dw/relu conv13_conv13/relu -23330=4,3,13,13,1024 0=1024 1=1 5=1 6=1048576 9=1\nConvolutionDepthWise     conv16/dw                1 1 conv13_conv13/relu conv16/dw_conv16/dw/relu -23330=4,3,13,13,1024 0=1024 1=3 4=1 5=1 6=9216 7=1024 9=1\nConvolution              conv17                   1 1 conv16/dw_conv16/dw/relu conv17_conv17/relu -23330=4,3,13,13,1024 0=1024 1=1 5=1 6=1048576 9=1\nSplit                    splitncnn_1              1 2 conv17_conv17/relu conv17_conv17/relu_splitncnn_0 conv17_conv17/relu_splitncnn_1 -23330=8,3,13,13,1024,3,13,13,1024\nDeconvolutionDepthWise   upsample                 1 1 conv17_conv17/relu_splitncnn_1 upsample -23330=4,3,26,26,512 0=512 1=4 3=2 4=1 6=16384 7=512\nEltwise                  conv_18/sum              2 1 conv11_conv11/relu_splitncnn_0 upsample conv_18/sum -23330=4,3,26,26,512 0=1\nConvolutionDepthWise     conv19/dw                1 1 conv_18/sum conv19/dw_conv19/dw/relu -23330=4,3,26,26,512 0=512 1=3 4=1 5=1 6=4608 7=512 9=1\nConvolution              conv20                   1 1 conv19/dw_conv19/dw/relu conv20_conv20/relu -23330=4,3,26,26,1024 0=1024 1=1 5=1 6=524288 9=1\nConvolution              conv22_indoor            1 1 conv17_conv17/relu_splitncnn_0 conv22 -23330=4,3,13,13,125 0=125 1=1 5=1 6=128000\nConvolution              conv23_indoor            1 1 conv20_conv20/relu conv23 -23330=4,3,26,26,125 0=125 1=1 5=1 6=128000\nYoloDetectionOutput      detection_out            2 1 conv22 conv23 output -23330=4,3,13,13,125 2=4.000000e-01 -23304=10,1.080000e+00,1.190000e+00,3.420000e+00,4.410000e+00,6.630000e+00,1.138000e+01,9.420000e+00,5.110000e+00,1.662000e+01,1.052000e+01\n"
  },
  {
    "path": "benchmark/mobilenetv2_yolov3.param",
    "content": "7767517\n87 99\nInput                    data                     0 1 data -23330=4,3,352,352,3 0=352 1=352 2=3\nConvolution              conv1                    1 1 data conv1_relu1 -23330=4,3,176,176,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     conv2                    1 1 conv1_relu1 conv2_relu2 -23330=4,3,176,176,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              conv3                    1 1 conv2_relu2 conv3 -23330=4,3,176,176,16 0=16 1=1 5=1 6=512\nConvolution              conv4                    1 1 conv3 conv4_relu3 -23330=4,3,176,176,96 0=96 1=1 5=1 6=1536 9=1\nConvolutionDepthWise     conv5                    1 1 conv4_relu3 conv5_relu4 -23330=4,3,88,88,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96 9=1\nConvolution              conv6                    1 1 conv5_relu4 conv6 -23330=4,3,88,88,24 0=24 1=1 5=1 6=2304\nSplit                    splitncnn_0              1 2 conv6 conv6_splitncnn_0 conv6_splitncnn_1 -23330=8,3,88,88,24,3,88,88,24\nConvolution              conv7                    1 1 conv6_splitncnn_1 conv7_relu5 -23330=4,3,88,88,144 0=144 1=1 5=1 6=3456 9=1\nConvolutionDepthWise     conv8                    1 1 conv7_relu5 conv8_relu6 -23330=4,3,88,88,144 0=144 1=3 4=1 5=1 6=1296 7=144 9=1\nConvolution              conv9                    1 1 conv8_relu6 conv9 -23330=4,3,88,88,24 0=24 1=1 5=1 6=3456\nEltwise                  add1                     2 1 conv6_splitncnn_0 conv9 add1 -23330=4,3,88,88,24 0=1\nConvolution              conv10                   1 1 add1 conv10_relu7 -23330=4,3,88,88,144 0=144 1=1 5=1 6=3456 9=1\nConvolutionDepthWise     conv11                   1 1 conv10_relu7 conv11_relu8 -23330=4,3,44,44,144 0=144 1=3 3=2 4=1 5=1 6=1296 7=144 9=1\nConvolution              conv12                   1 1 conv11_relu8 conv12 -23330=4,3,44,44,32 0=32 1=1 5=1 6=4608\nSplit                    splitncnn_1              1 2 conv12 conv12_splitncnn_0 conv12_splitncnn_1 -23330=8,3,44,44,32,3,44,44,32\nConvolution              conv13                   1 1 conv12_splitncnn_1 conv13_relu9 -23330=4,3,44,44,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv14                   1 1 conv13_relu9 conv14_relu10 -23330=4,3,44,44,192 0=192 1=3 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv15                   1 1 conv14_relu10 conv15 -23330=4,3,44,44,32 0=32 1=1 5=1 6=6144\nEltwise                  add2                     2 1 conv12_splitncnn_0 conv15 add2 -23330=4,3,44,44,32 0=1\nSplit                    splitncnn_2              1 2 add2 add2_splitncnn_0 add2_splitncnn_1 -23330=8,3,44,44,32,3,44,44,32\nConvolution              conv16                   1 1 add2_splitncnn_1 conv16_relu11 -23330=4,3,44,44,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv17                   1 1 conv16_relu11 conv17_relu12 -23330=4,3,44,44,192 0=192 1=3 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv18                   1 1 conv17_relu12 conv18 -23330=4,3,44,44,32 0=32 1=1 5=1 6=6144\nEltwise                  add3                     2 1 add2_splitncnn_0 conv18 add3 -23330=4,3,44,44,32 0=1\nConvolution              conv19                   1 1 add3 conv19_relu13 -23330=4,3,44,44,192 0=192 1=1 5=1 6=6144 9=1\nConvolutionDepthWise     conv20                   1 1 conv19_relu13 conv20_relu14 -23330=4,3,22,22,192 0=192 1=3 3=2 4=1 5=1 6=1728 7=192 9=1\nConvolution              conv21                   1 1 conv20_relu14 conv21 -23330=4,3,22,22,64 0=64 1=1 5=1 6=12288\nSplit                    splitncnn_3              1 2 conv21 conv21_splitncnn_0 conv21_splitncnn_1 -23330=8,3,22,22,64,3,22,22,64\nConvolution              conv22                   1 1 conv21_splitncnn_1 conv22_relu15 -23330=4,3,22,22,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv23                   1 1 conv22_relu15 conv23_relu16 -23330=4,3,22,22,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv24                   1 1 conv23_relu16 conv24 -23330=4,3,22,22,64 0=64 1=1 5=1 6=24576\nEltwise                  add4                     2 1 conv21_splitncnn_0 conv24 add4 -23330=4,3,22,22,64 0=1\nSplit                    splitncnn_4              1 2 add4 add4_splitncnn_0 add4_splitncnn_1 -23330=8,3,22,22,64,3,22,22,64\nConvolution              conv25                   1 1 add4_splitncnn_1 conv25_relu17 -23330=4,3,22,22,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv26                   1 1 conv25_relu17 conv26_relu18 -23330=4,3,22,22,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv27                   1 1 conv26_relu18 conv27 -23330=4,3,22,22,64 0=64 1=1 5=1 6=24576\nEltwise                  add5                     2 1 add4_splitncnn_0 conv27 add5 -23330=4,3,22,22,64 0=1\nSplit                    splitncnn_5              1 2 add5 add5_splitncnn_0 add5_splitncnn_1 -23330=8,3,22,22,64,3,22,22,64\nConvolution              conv28                   1 1 add5_splitncnn_1 conv28_relu19 -23330=4,3,22,22,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv29                   1 1 conv28_relu19 conv29_relu20 -23330=4,3,22,22,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv30                   1 1 conv29_relu20 conv30 -23330=4,3,22,22,64 0=64 1=1 5=1 6=24576\nEltwise                  add6                     2 1 add5_splitncnn_0 conv30 add6 -23330=4,3,22,22,64 0=1\nConvolution              conv31                   1 1 add6 conv31_relu21 -23330=4,3,22,22,384 0=384 1=1 5=1 6=24576 9=1\nConvolutionDepthWise     conv32                   1 1 conv31_relu21 conv32_relu22 -23330=4,3,22,22,384 0=384 1=3 4=1 5=1 6=3456 7=384 9=1\nConvolution              conv33                   1 1 conv32_relu22 conv33 -23330=4,3,22,22,96 0=96 1=1 5=1 6=36864\nSplit                    splitncnn_6              1 2 conv33 conv33_splitncnn_0 conv33_splitncnn_1 -23330=8,3,22,22,96,3,22,22,96\nConvolution              conv34                   1 1 conv33_splitncnn_1 conv34_relu23 -23330=4,3,22,22,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     conv35                   1 1 conv34_relu23 conv35_relu24 -23330=4,3,22,22,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv36                   1 1 conv35_relu24 conv36 -23330=4,3,22,22,96 0=96 1=1 5=1 6=55296\nEltwise                  add7                     2 1 conv33_splitncnn_0 conv36 add7 -23330=4,3,22,22,96 0=1\nSplit                    splitncnn_7              1 2 add7 add7_splitncnn_0 add7_splitncnn_1 -23330=8,3,22,22,96,3,22,22,96\nConvolution              conv37                   1 1 add7_splitncnn_1 conv37_relu25 -23330=4,3,22,22,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     conv38                   1 1 conv37_relu25 conv38_relu26 -23330=4,3,22,22,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv39                   1 1 conv38_relu26 conv39 -23330=4,3,22,22,96 0=96 1=1 5=1 6=55296\nEltwise                  add8                     2 1 add7_splitncnn_0 conv39 add8 -23330=4,3,22,22,96 0=1\nConvolution              conv40                   1 1 add8 conv40_relu27 -23330=4,3,22,22,576 0=576 1=1 5=1 6=55296 9=1\nSplit                    splitncnn_8              1 2 conv40_relu27 conv40_relu27_splitncnn_0 conv40_relu27_splitncnn_1 -23330=8,3,22,22,576,3,22,22,576\nConvolutionDepthWise     conv41                   1 1 conv40_relu27_splitncnn_1 conv41_relu28 -23330=4,3,11,11,576 0=576 1=3 3=2 4=1 5=1 6=5184 7=576 9=1\nConvolution              conv42                   1 1 conv41_relu28 conv42 -23330=4,3,11,11,160 0=160 1=1 5=1 6=92160\nSplit                    splitncnn_9              1 2 conv42 conv42_splitncnn_0 conv42_splitncnn_1 -23330=8,3,11,11,160,3,11,11,160\nConvolution              conv43                   1 1 conv42_splitncnn_1 conv43_relu29 -23330=4,3,11,11,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv44                   1 1 conv43_relu29 conv44_relu30 -23330=4,3,11,11,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv45                   1 1 conv44_relu30 conv45 -23330=4,3,11,11,160 0=160 1=1 5=1 6=153600\nEltwise                  add9                     2 1 conv42_splitncnn_0 conv45 add9 -23330=4,3,11,11,160 0=1\nSplit                    splitncnn_10             1 2 add9 add9_splitncnn_0 add9_splitncnn_1 -23330=8,3,11,11,160,3,11,11,160\nConvolution              conv46                   1 1 add9_splitncnn_1 conv46_relu31 -23330=4,3,11,11,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv47                   1 1 conv46_relu31 conv47_relu32 -23330=4,3,11,11,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv48                   1 1 conv47_relu32 conv48 -23330=4,3,11,11,160 0=160 1=1 5=1 6=153600\nEltwise                  add10                    2 1 add9_splitncnn_0 conv48 add10 -23330=4,3,11,11,160 0=1\nConvolution              conv49                   1 1 add10 conv49_relu33 -23330=4,3,11,11,960 0=960 1=1 5=1 6=153600 9=1\nConvolutionDepthWise     conv50                   1 1 conv49_relu33 conv50_relu34 -23330=4,3,11,11,960 0=960 1=3 4=1 5=1 6=8640 7=960 9=1\nConvolution              conv51                   1 1 conv50_relu34 conv51 -23330=4,3,11,11,320 0=320 1=1 5=1 6=307200\nConvolution              conv52                   1 1 conv51 conv52_relu35 -23330=4,3,11,11,1280 0=1280 1=1 5=1 6=409600 9=1\nConvolutionDepthWise     yolo/conv1/dw            1 1 conv52_relu35 yolo/conv1/dw_yolo/conv1/dw/relu -23330=4,3,11,11,1280 0=1280 1=3 4=1 5=1 6=11520 7=1280 9=1\nConvolution              yolo/conv1               1 1 yolo/conv1/dw_yolo/conv1/dw/relu yolo/conv1_yolo/conv1/relu -23330=4,3,11,11,576 0=576 1=1 5=1 6=737280 9=1\nSplit                    splitncnn_11             1 2 yolo/conv1_yolo/conv1/relu yolo/conv1_yolo/conv1/relu_splitncnn_0 yolo/conv1_yolo/conv1/relu_splitncnn_1 -23330=8,3,11,11,576,3,11,11,576\nDeconvolutionDepthWise   upsample                 1 1 yolo/conv1_yolo/conv1/relu_splitncnn_1 upsample -23330=4,3,21,21,576 0=576 1=1 3=2 6=576 7=576\nPooling                  maxpool                  1 1 upsample maxpool -23330=4,3,22,22,576 1=2 3=1\nConvolutionDepthWise     yolo/conv2/dw            1 1 conv40_relu27_splitncnn_0 yolo/conv2/dw_yolo/conv2/dw/relu -23330=4,3,22,22,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              yolo/conv2               1 1 yolo/conv2/dw_yolo/conv2/dw/relu yolo/conv2_yolo/conv2/relu -23330=4,3,22,22,576 0=576 1=1 5=1 6=331776 9=1\nEltwise                  yolo/conv2/sum           2 1 maxpool yolo/conv2_yolo/conv2/relu yolo/conv2/sum -23330=4,3,22,22,576 0=1\nConvolutionDepthWise     yolo/conv3/dw            1 1 yolo/conv2/sum yolo/conv3/dw_yolo/conv3/dw/relu -23330=4,3,22,22,576 0=576 1=3 4=1 5=1 6=5184 7=576 9=1\nConvolution              yolo/conv3               1 1 yolo/conv3/dw_yolo/conv3/dw/relu yolo/conv3_yolo/conv3/relu -23330=4,3,22,22,576 0=576 1=1 5=1 6=331776 9=1\nConvolution              yolo/conv4               1 1 yolo/conv1_yolo/conv1/relu_splitncnn_0 yolo/conv4 -23330=4,3,11,11,75 0=75 1=1 5=1 6=43200\nConvolution              yolo/conv5               1 1 yolo/conv3_yolo/conv3/relu yolo/conv5 -23330=4,3,22,22,75 0=75 1=1 5=1 6=43200\nYolov3DetectionOutput    detection_out            2 1 yolo/conv4 yolo/conv5 output 1=3 2=3.000000e-01 -23304=12,2.000000e+01,3.700000e+01,4.900000e+01,9.400000e+01,7.300000e+01,2.010000e+02,1.430000e+02,2.650000e+02,1.530000e+02,1.210000e+02,2.800000e+02,2.790000e+02 -23305=6,1077936128,1082130432,1084227584,0,1065353216,1073741824 -23306=2,3.200000e+01,1.600000e+01\n"
  },
  {
    "path": "benchmark/nanodet_m.param",
    "content": "7767517\n179 204\nInput                    input.1                  0 1 input.1 -23330=4,3,320,320,3 0=320 1=320 2=3\nConvolution              Conv_0                   1 1 input.1 424 -23330=4,3,160,160,24 0=24 1=3 3=2 4=1 5=1 6=648 9=2 -23310=1,1.000000e-01\nPooling                  MaxPool_2                1 1 424 425 -23330=4,3,80,80,24 1=3 2=2 3=1 5=1\nSplit                    splitncnn_0              1 2 425 425_splitncnn_0 425_splitncnn_1 -23330=8,3,80,80,24,3,80,80,24\nConvolutionDepthWise     Conv_3                   1 1 425_splitncnn_1 943 -23330=4,3,40,40,24 0=24 1=3 3=2 4=1 5=1 6=216 7=24\nConvolution              Conv_4                   1 1 943 430 -23330=4,3,40,40,58 0=58 1=1 5=1 6=1392 9=2 -23310=1,1.000000e-01\nConvolution              Conv_6                   1 1 425_splitncnn_0 433 -23330=4,3,80,80,58 0=58 1=1 5=1 6=1392 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_8                   1 1 433 952 -23330=4,3,40,40,58 0=58 1=3 3=2 4=1 5=1 6=522 7=58\nConvolution              Conv_9                   1 1 952 438 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConcat                   Concat_11                2 1 430 438 439 -23330=4,3,40,40,116\nShuffleChannel           Reshape_16               1 1 439 444 -23330=4,3,40,40,116 0=2\nSplit                    splitncnn_1              1 2 444 444_splitncnn_0 444_splitncnn_1 -23330=8,3,40,40,116,3,40,40,116\nCrop                     Slice_27                 1 1 444_splitncnn_1 455 -23330=4,3,40,40,58 -23309=1,0 -23310=1,58 -23311=1,0\nCrop                     Slice_30                 1 1 444_splitncnn_0 458 -23330=4,3,40,40,58 -23309=1,58 -23310=1,116 -23311=1,0\nConvolution              Conv_31                  1 1 458 461 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_33                  1 1 461 961 -23330=4,3,40,40,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              Conv_34                  1 1 961 466 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConcat                   Concat_36                2 1 455 466 467 -23330=4,3,40,40,116\nShuffleChannel           Reshape_41               1 1 467 472 -23330=4,3,40,40,116 0=2\nSplit                    splitncnn_2              1 2 472 472_splitncnn_0 472_splitncnn_1 -23330=8,3,40,40,116,3,40,40,116\nCrop                     Slice_52                 1 1 472_splitncnn_1 483 -23330=4,3,40,40,58 -23309=1,0 -23310=1,58 -23311=1,0\nCrop                     Slice_55                 1 1 472_splitncnn_0 486 -23330=4,3,40,40,58 -23309=1,58 -23310=1,116 -23311=1,0\nConvolution              Conv_56                  1 1 486 489 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_58                  1 1 489 970 -23330=4,3,40,40,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              Conv_59                  1 1 970 494 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConcat                   Concat_61                2 1 483 494 495 -23330=4,3,40,40,116\nShuffleChannel           Reshape_66               1 1 495 500 -23330=4,3,40,40,116 0=2\nSplit                    splitncnn_3              1 2 500 500_splitncnn_0 500_splitncnn_1 -23330=8,3,40,40,116,3,40,40,116\nCrop                     Slice_77                 1 1 500_splitncnn_1 511 -23330=4,3,40,40,58 -23309=1,0 -23310=1,58 -23311=1,0\nCrop                     Slice_80                 1 1 500_splitncnn_0 514 -23330=4,3,40,40,58 -23309=1,58 -23310=1,116 -23311=1,0\nConvolution              Conv_81                  1 1 514 517 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_83                  1 1 517 979 -23330=4,3,40,40,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              Conv_84                  1 1 979 522 -23330=4,3,40,40,58 0=58 1=1 5=1 6=3364 9=2 -23310=1,1.000000e-01\nConcat                   Concat_86                2 1 511 522 523 -23330=4,3,40,40,116\nShuffleChannel           Reshape_91               1 1 523 528 -23330=4,3,40,40,116 0=2\nSplit                    splitncnn_4              1 3 528 528_splitncnn_0 528_splitncnn_1 528_splitncnn_2 -23330=12,3,40,40,116,3,40,40,116,3,40,40,116\nConvolutionDepthWise     Conv_92                  1 1 528_splitncnn_2 985 -23330=4,3,20,20,116 0=116 1=3 3=2 4=1 5=1 6=1044 7=116\nConvolution              Conv_93                  1 1 985 533 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolution              Conv_95                  1 1 528_splitncnn_1 536 -23330=4,3,40,40,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_97                  1 1 536 994 -23330=4,3,20,20,116 0=116 1=3 3=2 4=1 5=1 6=1044 7=116\nConvolution              Conv_98                  1 1 994 541 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_100               2 1 533 541 542 -23330=4,3,20,20,232\nShuffleChannel           Reshape_105              1 1 542 547 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_5              1 2 547 547_splitncnn_0 547_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_116                1 1 547_splitncnn_1 558 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_119                1 1 547_splitncnn_0 561 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_120                 1 1 561 564 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_122                 1 1 564 1003 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_123                 1 1 1003 569 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_125               2 1 558 569 570 -23330=4,3,20,20,232\nShuffleChannel           Reshape_130              1 1 570 575 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_6              1 2 575 575_splitncnn_0 575_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_141                1 1 575_splitncnn_1 586 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_144                1 1 575_splitncnn_0 589 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_145                 1 1 589 592 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_147                 1 1 592 1012 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_148                 1 1 1012 597 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_150               2 1 586 597 598 -23330=4,3,20,20,232\nShuffleChannel           Reshape_155              1 1 598 603 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_7              1 2 603 603_splitncnn_0 603_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_166                1 1 603_splitncnn_1 614 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_169                1 1 603_splitncnn_0 617 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_170                 1 1 617 620 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_172                 1 1 620 1021 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_173                 1 1 1021 625 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_175               2 1 614 625 626 -23330=4,3,20,20,232\nShuffleChannel           Reshape_180              1 1 626 631 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_8              1 2 631 631_splitncnn_0 631_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_191                1 1 631_splitncnn_1 642 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_194                1 1 631_splitncnn_0 645 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_195                 1 1 645 648 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_197                 1 1 648 1030 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_198                 1 1 1030 653 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_200               2 1 642 653 654 -23330=4,3,20,20,232\nShuffleChannel           Reshape_205              1 1 654 659 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_9              1 2 659 659_splitncnn_0 659_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_216                1 1 659_splitncnn_1 670 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_219                1 1 659_splitncnn_0 673 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_220                 1 1 673 676 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_222                 1 1 676 1039 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_223                 1 1 1039 681 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_225               2 1 670 681 682 -23330=4,3,20,20,232\nShuffleChannel           Reshape_230              1 1 682 687 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_10             1 2 687 687_splitncnn_0 687_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_241                1 1 687_splitncnn_1 698 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_244                1 1 687_splitncnn_0 701 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_245                 1 1 701 704 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_247                 1 1 704 1048 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_248                 1 1 1048 709 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_250               2 1 698 709 710 -23330=4,3,20,20,232\nShuffleChannel           Reshape_255              1 1 710 715 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_11             1 2 715 715_splitncnn_0 715_splitncnn_1 -23330=8,3,20,20,232,3,20,20,232\nCrop                     Slice_266                1 1 715_splitncnn_1 726 -23330=4,3,20,20,116 -23309=1,0 -23310=1,116 -23311=1,0\nCrop                     Slice_269                1 1 715_splitncnn_0 729 -23330=4,3,20,20,116 -23309=1,116 -23310=1,232 -23311=1,0\nConvolution              Conv_270                 1 1 729 732 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_272                 1 1 732 1057 -23330=4,3,20,20,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              Conv_273                 1 1 1057 737 -23330=4,3,20,20,116 0=116 1=1 5=1 6=13456 9=2 -23310=1,1.000000e-01\nConcat                   Concat_275               2 1 726 737 738 -23330=4,3,20,20,232\nShuffleChannel           Reshape_280              1 1 738 743 -23330=4,3,20,20,232 0=2\nSplit                    splitncnn_12             1 3 743 743_splitncnn_0 743_splitncnn_1 743_splitncnn_2 -23330=12,3,20,20,232,3,20,20,232,3,20,20,232\nConvolutionDepthWise     Conv_281                 1 1 743_splitncnn_2 1063 -23330=4,3,10,10,232 0=232 1=3 3=2 4=1 5=1 6=2088 7=232\nConvolution              Conv_282                 1 1 1063 748 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConvolution              Conv_284                 1 1 743_splitncnn_1 751 -23330=4,3,20,20,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_286                 1 1 751 1072 -23330=4,3,10,10,232 0=232 1=3 3=2 4=1 5=1 6=2088 7=232\nConvolution              Conv_287                 1 1 1072 756 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConcat                   Concat_289               2 1 748 756 757 -23330=4,3,10,10,464\nShuffleChannel           Reshape_294              1 1 757 762 -23330=4,3,10,10,464 0=2\nSplit                    splitncnn_13             1 2 762 762_splitncnn_0 762_splitncnn_1 -23330=8,3,10,10,464,3,10,10,464\nCrop                     Slice_305                1 1 762_splitncnn_1 773 -23330=4,3,10,10,232 -23309=1,0 -23310=1,232 -23311=1,0\nCrop                     Slice_308                1 1 762_splitncnn_0 776 -23330=4,3,10,10,232 -23309=1,232 -23310=1,464 -23311=1,0\nConvolution              Conv_309                 1 1 776 779 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_311                 1 1 779 1081 -23330=4,3,10,10,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              Conv_312                 1 1 1081 784 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConcat                   Concat_314               2 1 773 784 785 -23330=4,3,10,10,464\nShuffleChannel           Reshape_319              1 1 785 790 -23330=4,3,10,10,464 0=2\nSplit                    splitncnn_14             1 2 790 790_splitncnn_0 790_splitncnn_1 -23330=8,3,10,10,464,3,10,10,464\nCrop                     Slice_330                1 1 790_splitncnn_1 801 -23330=4,3,10,10,232 -23309=1,0 -23310=1,232 -23311=1,0\nCrop                     Slice_333                1 1 790_splitncnn_0 804 -23330=4,3,10,10,232 -23309=1,232 -23310=1,464 -23311=1,0\nConvolution              Conv_334                 1 1 804 807 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_336                 1 1 807 1090 -23330=4,3,10,10,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              Conv_337                 1 1 1090 812 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConcat                   Concat_339               2 1 801 812 813 -23330=4,3,10,10,464\nShuffleChannel           Reshape_344              1 1 813 818 -23330=4,3,10,10,464 0=2\nSplit                    splitncnn_15             1 2 818 818_splitncnn_0 818_splitncnn_1 -23330=8,3,10,10,464,3,10,10,464\nCrop                     Slice_355                1 1 818_splitncnn_1 829 -23330=4,3,10,10,232 -23309=1,0 -23310=1,232 -23311=1,0\nCrop                     Slice_358                1 1 818_splitncnn_0 832 -23330=4,3,10,10,232 -23309=1,232 -23310=1,464 -23311=1,0\nConvolution              Conv_359                 1 1 832 835 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_361                 1 1 835 1099 -23330=4,3,10,10,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              Conv_362                 1 1 1099 840 -23330=4,3,10,10,232 0=232 1=1 5=1 6=53824 9=2 -23310=1,1.000000e-01\nConcat                   Concat_364               2 1 829 840 841 -23330=4,3,10,10,464\nShuffleChannel           Reshape_369              1 1 841 846 -23330=4,3,10,10,464 0=2\nConvolution              Conv_370                 1 1 528_splitncnn_0 847 -23330=4,3,40,40,96 0=96 1=1 5=1 6=11136\nConvolution              Conv_371                 1 1 743_splitncnn_0 848 -23330=4,3,20,20,96 0=96 1=1 5=1 6=22272\nConvolution              Conv_372                 1 1 846 849 -23330=4,3,10,10,96 0=96 1=1 5=1 6=44544\nSplit                    splitncnn_16             1 2 849 849_splitncnn_0 849_splitncnn_1 -23330=8,3,10,10,96,3,10,10,96\nInterp                   Resize_374               1 1 849_splitncnn_1 854 -23330=4,3,20,20,96 0=2 1=2.000000e+00 2=2.000000e+00\nBinaryOp                 Add_375                  2 1 848 854 855 -23330=4,3,20,20,96\nSplit                    splitncnn_17             1 2 855 855_splitncnn_0 855_splitncnn_1 -23330=8,3,20,20,96,3,20,20,96\nInterp                   Resize_377               1 1 855_splitncnn_1 860 -23330=4,3,40,40,96 0=2 1=2.000000e+00 2=2.000000e+00\nBinaryOp                 Add_378                  2 1 847 860 861 -23330=4,3,40,40,96\nSplit                    splitncnn_18             1 2 861 861_splitncnn_0 861_splitncnn_1 -23330=8,3,40,40,96,3,40,40,96\nInterp                   Resize_380               1 1 861_splitncnn_1 866 -23330=4,3,20,20,96 0=2 1=5.000000e-01 2=5.000000e-01\nBinaryOp                 Add_381                  2 1 855_splitncnn_0 866 867 -23330=4,3,20,20,96\nSplit                    splitncnn_19             1 2 867 867_splitncnn_0 867_splitncnn_1 -23330=8,3,20,20,96,3,20,20,96\nInterp                   Resize_383               1 1 867_splitncnn_1 872 -23330=4,3,10,10,96 0=2 1=5.000000e-01 2=5.000000e-01\nBinaryOp                 Add_384                  2 1 849_splitncnn_0 872 873 -23330=4,3,10,10,96\nConvolutionDepthWise     Conv_385                 1 1 861_splitncnn_0 876 -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_387                 1 1 876 879 -23330=4,3,40,40,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_389                 1 1 879 882 -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_391                 1 1 882 885 -23330=4,3,40,40,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolution              Conv_393                 1 1 885 886 -23330=4,3,40,40,112 0=112 1=1 5=1 6=10752\nSlice                    Split_394                1 2 886 887 888 -23330=8,3,40,40,80,3,40,40,32 -23300=2,80,-233\nSigmoid                  Sigmoid_395              1 1 887 889 -23330=4,3,40,40,80\nReshape                  Reshape_397              1 1 889 891 -23330=4,2,1600,80,1 0=-1 1=80\nPermute                  Transpose_398            1 1 891 cls_pred_stride_8 -23330=4,2,80,1600,1 0=1\nReshape                  Reshape_400              1 1 888 894 -23330=4,2,1600,32,1 0=-1 1=32\nPermute                  Transpose_401            1 1 894 dis_pred_stride_8 -23330=4,2,32,1600,1 0=1\nConvolutionDepthWise     Conv_402                 1 1 867_splitncnn_0 898 -23330=4,3,20,20,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_404                 1 1 898 901 -23330=4,3,20,20,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_406                 1 1 901 904 -23330=4,3,20,20,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_408                 1 1 904 907 -23330=4,3,20,20,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolution              Conv_410                 1 1 907 908 -23330=4,3,20,20,112 0=112 1=1 5=1 6=10752\nSlice                    Split_411                1 2 908 909 910 -23330=8,3,20,20,80,3,20,20,32 -23300=2,80,-233\nSigmoid                  Sigmoid_412              1 1 909 911 -23330=4,3,20,20,80\nReshape                  Reshape_414              1 1 911 913 -23330=4,2,400,80,1 0=-1 1=80\nPermute                  Transpose_415            1 1 913 cls_pred_stride_16 -23330=4,2,80,400,1 0=1\nReshape                  Reshape_417              1 1 910 916 -23330=4,2,400,32,1 0=-1 1=32\nPermute                  Transpose_418            1 1 916 dis_pred_stride_16 -23330=4,2,32,400,1 0=1\nConvolutionDepthWise     Conv_419                 1 1 873 920 -23330=4,3,10,10,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_421                 1 1 920 923 -23330=4,3,10,10,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     Conv_423                 1 1 923 926 -23330=4,3,10,10,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              Conv_425                 1 1 926 929 -23330=4,3,10,10,96 0=96 1=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConvolution              Conv_427                 1 1 929 930 -23330=4,3,10,10,112 0=112 1=1 5=1 6=10752\nSlice                    Split_428                1 2 930 931 932 -23330=8,3,10,10,80,3,10,10,32 -23300=2,80,-233\nSigmoid                  Sigmoid_429              1 1 931 933 -23330=4,3,10,10,80\nReshape                  Reshape_431              1 1 933 935 -23330=4,2,100,80,1 0=-1 1=80\nPermute                  Transpose_432            1 1 935 cls_pred_stride_32 -23330=4,2,80,100,1 0=1\nReshape                  Reshape_434              1 1 932 938 -23330=4,2,100,32,1 0=-1 1=32\nPermute                  Transpose_435            1 1 938 dis_pred_stride_32 -23330=4,2,32,100,1 0=1\nNoop                     Output                   6 1 cls_pred_stride_8 cls_pred_stride_16 cls_pred_stride_32 dis_pred_stride_8 dis_pred_stride_16 dis_pred_stride_32 output\n"
  },
  {
    "path": "benchmark/proxylessnasnet.param",
    "content": "7767517\n91 104\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              first-3x3-conv           1 1 data first-3x3-conv_relu -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nConvolutionDepthWise     A0_dw                    1 1 first-3x3-conv_relu A0_dw_relu -23330=4,3,112,112,32 0=32 1=3 4=1 5=1 6=288 7=32 9=1\nConvolution              A0_linear                1 1 A0_dw_relu A0_linear_bn -23330=4,3,112,112,32 0=32 1=1 5=1 6=1024\nConvolution              B0_expand                1 1 A0_linear_bn B0_expand_relu -23330=4,3,112,112,48 0=48 1=1 5=1 6=1536 9=1\nConvolutionDepthWise     B0_dw                    1 1 B0_expand_relu B0_dw_relu -23330=4,3,56,56,48 0=48 1=5 3=2 4=2 5=1 6=1200 7=48 9=1\nConvolution              B0_linear                1 1 B0_dw_relu B0_linear_bn -23330=4,3,56,56,32 0=32 1=1 5=1 6=1536\nSplit                    splitncnn_0              1 2 B0_linear_bn B0_linear_bn_splitncnn_0 B0_linear_bn_splitncnn_1 -23330=8,3,56,56,32,3,56,56,32\nConvolution              B1_expand                1 1 B0_linear_bn_splitncnn_1 B1_expand_relu -23330=4,3,56,56,96 0=96 1=1 5=1 6=3072 9=1\nConvolutionDepthWise     B1_dw                    1 1 B1_expand_relu B1_dw_relu -23330=4,3,56,56,96 0=96 1=3 4=1 5=1 6=864 7=96 9=1\nConvolution              B1_linear                1 1 B1_dw_relu B1_linear_bn -23330=4,3,56,56,32 0=32 1=1 5=1 6=3072\nBinaryOp                 unknownncnn_0            2 1 B0_linear_bn_splitncnn_0 B1_linear_bn unknownncnn_0 -23330=4,3,56,56,32\nConvolution              C0_expand                1 1 unknownncnn_0 C0_expand_relu -23330=4,3,56,56,96 0=96 1=1 5=1 6=3072 9=1\nConvolutionDepthWise     C0_dw                    1 1 C0_expand_relu C0_dw_relu -23330=4,3,28,28,96 0=96 1=7 3=2 4=3 5=1 6=4704 7=96 9=1\nConvolution              C0_linear                1 1 C0_dw_relu C0_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=3840\nSplit                    splitncnn_1              1 2 C0_linear_bn C0_linear_bn_splitncnn_0 C0_linear_bn_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              C1_expand                1 1 C0_linear_bn_splitncnn_1 C1_expand_relu -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     C1_dw                    1 1 C1_expand_relu C1_dw_relu -23330=4,3,28,28,120 0=120 1=3 4=1 5=1 6=1080 7=120 9=1\nConvolution              C1_linear                1 1 C1_dw_relu C1_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 unknownncnn_1            2 1 C0_linear_bn_splitncnn_0 C1_linear_bn unknownncnn_1 -23330=4,3,28,28,40\nSplit                    splitncnn_2              1 2 unknownncnn_1 unknownncnn_1_splitncnn_0 unknownncnn_1_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              C2_expand                1 1 unknownncnn_1_splitncnn_1 C2_expand_relu -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     C2_dw                    1 1 C2_expand_relu C2_dw_relu -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=1\nConvolution              C2_linear                1 1 C2_dw_relu C2_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 unknownncnn_2            2 1 unknownncnn_1_splitncnn_0 C2_linear_bn unknownncnn_2 -23330=4,3,28,28,40\nSplit                    splitncnn_3              1 2 unknownncnn_2 unknownncnn_2_splitncnn_0 unknownncnn_2_splitncnn_1 -23330=8,3,28,28,40,3,28,28,40\nConvolution              C3_expand                1 1 unknownncnn_2_splitncnn_1 C3_expand_relu -23330=4,3,28,28,120 0=120 1=1 5=1 6=4800 9=1\nConvolutionDepthWise     C3_dw                    1 1 C3_expand_relu C3_dw_relu -23330=4,3,28,28,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=1\nConvolution              C3_linear                1 1 C3_dw_relu C3_linear_bn -23330=4,3,28,28,40 0=40 1=1 5=1 6=4800\nBinaryOp                 unknownncnn_3            2 1 unknownncnn_2_splitncnn_0 C3_linear_bn unknownncnn_3 -23330=4,3,28,28,40\nConvolution              D0_expand                1 1 unknownncnn_3 D0_expand_relu -23330=4,3,28,28,240 0=240 1=1 5=1 6=9600 9=1\nConvolutionDepthWise     D0_dw                    1 1 D0_expand_relu D0_dw_relu -23330=4,3,14,14,240 0=240 1=7 3=2 4=3 5=1 6=11760 7=240 9=1\nConvolution              D0_linear                1 1 D0_dw_relu D0_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nSplit                    splitncnn_4              1 2 D0_linear_bn D0_linear_bn_splitncnn_0 D0_linear_bn_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              D1_expand                1 1 D0_linear_bn_splitncnn_1 D1_expand_relu -23330=4,3,14,14,240 0=240 1=1 5=1 6=19200 9=1\nConvolutionDepthWise     D1_dw                    1 1 D1_expand_relu D1_dw_relu -23330=4,3,14,14,240 0=240 1=5 4=2 5=1 6=6000 7=240 9=1\nConvolution              D1_linear                1 1 D1_dw_relu D1_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nBinaryOp                 unknownncnn_4            2 1 D0_linear_bn_splitncnn_0 D1_linear_bn unknownncnn_4 -23330=4,3,14,14,80\nSplit                    splitncnn_5              1 2 unknownncnn_4 unknownncnn_4_splitncnn_0 unknownncnn_4_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              D2_expand                1 1 unknownncnn_4_splitncnn_1 D2_expand_relu -23330=4,3,14,14,240 0=240 1=1 5=1 6=19200 9=1\nConvolutionDepthWise     D2_dw                    1 1 D2_expand_relu D2_dw_relu -23330=4,3,14,14,240 0=240 1=5 4=2 5=1 6=6000 7=240 9=1\nConvolution              D2_linear                1 1 D2_dw_relu D2_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nBinaryOp                 unknownncnn_5            2 1 unknownncnn_4_splitncnn_0 D2_linear_bn unknownncnn_5 -23330=4,3,14,14,80\nSplit                    splitncnn_6              1 2 unknownncnn_5 unknownncnn_5_splitncnn_0 unknownncnn_5_splitncnn_1 -23330=8,3,14,14,80,3,14,14,80\nConvolution              D3_expand                1 1 unknownncnn_5_splitncnn_1 D3_expand_relu -23330=4,3,14,14,240 0=240 1=1 5=1 6=19200 9=1\nConvolutionDepthWise     D3_dw                    1 1 D3_expand_relu D3_dw_relu -23330=4,3,14,14,240 0=240 1=5 4=2 5=1 6=6000 7=240 9=1\nConvolution              D3_linear                1 1 D3_dw_relu D3_linear_bn -23330=4,3,14,14,80 0=80 1=1 5=1 6=19200\nBinaryOp                 unknownncnn_6            2 1 unknownncnn_5_splitncnn_0 D3_linear_bn unknownncnn_6 -23330=4,3,14,14,80\nConvolution              E0_expand                1 1 unknownncnn_6 E0_expand_relu -23330=4,3,14,14,480 0=480 1=1 5=1 6=38400 9=1\nConvolutionDepthWise     E0_dw                    1 1 E0_expand_relu E0_dw_relu -23330=4,3,14,14,480 0=480 1=5 4=2 5=1 6=12000 7=480 9=1\nConvolution              E0_linear                1 1 E0_dw_relu E0_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=46080\nSplit                    splitncnn_7              1 2 E0_linear_bn E0_linear_bn_splitncnn_0 E0_linear_bn_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              E1_expand                1 1 E0_linear_bn_splitncnn_1 E1_expand_relu -23330=4,3,14,14,288 0=288 1=1 5=1 6=27648 9=1\nConvolutionDepthWise     E1_dw                    1 1 E1_expand_relu E1_dw_relu -23330=4,3,14,14,288 0=288 1=5 4=2 5=1 6=7200 7=288 9=1\nConvolution              E1_linear                1 1 E1_dw_relu E1_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=27648\nBinaryOp                 unknownncnn_7            2 1 E0_linear_bn_splitncnn_0 E1_linear_bn unknownncnn_7 -23330=4,3,14,14,96\nSplit                    splitncnn_8              1 2 unknownncnn_7 unknownncnn_7_splitncnn_0 unknownncnn_7_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              E2_expand                1 1 unknownncnn_7_splitncnn_1 E2_expand_relu -23330=4,3,14,14,288 0=288 1=1 5=1 6=27648 9=1\nConvolutionDepthWise     E2_dw                    1 1 E2_expand_relu E2_dw_relu -23330=4,3,14,14,288 0=288 1=5 4=2 5=1 6=7200 7=288 9=1\nConvolution              E2_linear                1 1 E2_dw_relu E2_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=27648\nBinaryOp                 unknownncnn_8            2 1 unknownncnn_7_splitncnn_0 E2_linear_bn unknownncnn_8 -23330=4,3,14,14,96\nSplit                    splitncnn_9              1 2 unknownncnn_8 unknownncnn_8_splitncnn_0 unknownncnn_8_splitncnn_1 -23330=8,3,14,14,96,3,14,14,96\nConvolution              E3_expand                1 1 unknownncnn_8_splitncnn_1 E3_expand_relu -23330=4,3,14,14,288 0=288 1=1 5=1 6=27648 9=1\nConvolutionDepthWise     E3_dw                    1 1 E3_expand_relu E3_dw_relu -23330=4,3,14,14,288 0=288 1=5 4=2 5=1 6=7200 7=288 9=1\nConvolution              E3_linear                1 1 E3_dw_relu E3_linear_bn -23330=4,3,14,14,96 0=96 1=1 5=1 6=27648\nBinaryOp                 unknownncnn_9            2 1 unknownncnn_8_splitncnn_0 E3_linear_bn unknownncnn_9 -23330=4,3,14,14,96\nConvolution              F0_expand                1 1 unknownncnn_9 F0_expand_relu -23330=4,3,14,14,576 0=576 1=1 5=1 6=55296 9=1\nConvolutionDepthWise     F0_dw                    1 1 F0_expand_relu F0_dw_relu -23330=4,3,7,7,576 0=576 1=7 3=2 4=3 5=1 6=28224 7=576 9=1\nConvolution              F0_linear                1 1 F0_dw_relu F0_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=110592\nSplit                    splitncnn_10             1 2 F0_linear_bn F0_linear_bn_splitncnn_0 F0_linear_bn_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F1_expand                1 1 F0_linear_bn_splitncnn_1 F1_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     F1_dw                    1 1 F1_expand_relu F1_dw_relu -23330=4,3,7,7,1152 0=1152 1=7 4=3 5=1 6=56448 7=1152 9=1\nConvolution              F1_linear                1 1 F1_dw_relu F1_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=221184\nBinaryOp                 unknownncnn_10           2 1 F0_linear_bn_splitncnn_0 F1_linear_bn unknownncnn_10 -23330=4,3,7,7,192\nSplit                    splitncnn_11             1 2 unknownncnn_10 unknownncnn_10_splitncnn_0 unknownncnn_10_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F2_expand                1 1 unknownncnn_10_splitncnn_1 F2_expand_relu -23330=4,3,7,7,576 0=576 1=1 5=1 6=110592 9=1\nConvolutionDepthWise     F2_dw                    1 1 F2_expand_relu F2_dw_relu -23330=4,3,7,7,576 0=576 1=7 4=3 5=1 6=28224 7=576 9=1\nConvolution              F2_linear                1 1 F2_dw_relu F2_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=110592\nBinaryOp                 unknownncnn_11           2 1 unknownncnn_10_splitncnn_0 F2_linear_bn unknownncnn_11 -23330=4,3,7,7,192\nSplit                    splitncnn_12             1 2 unknownncnn_11 unknownncnn_11_splitncnn_0 unknownncnn_11_splitncnn_1 -23330=8,3,7,7,192,3,7,7,192\nConvolution              F3_expand                1 1 unknownncnn_11_splitncnn_1 F3_expand_relu -23330=4,3,7,7,576 0=576 1=1 5=1 6=110592 9=1\nConvolutionDepthWise     F3_dw                    1 1 F3_expand_relu F3_dw_relu -23330=4,3,7,7,576 0=576 1=7 4=3 5=1 6=28224 7=576 9=1\nConvolution              F3_linear                1 1 F3_dw_relu F3_linear_bn -23330=4,3,7,7,192 0=192 1=1 5=1 6=110592\nBinaryOp                 unknownncnn_12           2 1 unknownncnn_11_splitncnn_0 F3_linear_bn unknownncnn_12 -23330=4,3,7,7,192\nConvolution              G0_expand                1 1 unknownncnn_12 G0_expand_relu -23330=4,3,7,7,1152 0=1152 1=1 5=1 6=221184 9=1\nConvolutionDepthWise     G0_dw                    1 1 G0_expand_relu G0_dw_relu -23330=4,3,7,7,1152 0=1152 1=7 4=3 5=1 6=56448 7=1152 9=1\nConvolution              G0_linear                1 1 G0_dw_relu G0_linear_bn -23330=4,3,7,7,320 0=320 1=1 5=1 6=368640\nConvolution              last-1x1-conv            1 1 G0_linear_bn last-1x1-conv_relu -23330=4,3,7,7,1280 0=1280 1=1 5=1 6=409600 9=1\nPooling                  avgpool                  1 1 last-1x1-conv_relu flatten -23330=4,1,1280,1,1 0=1 1=7 4=1 5=1\nInnerProduct             fc                       1 1 flatten fc -23330=4,1,1000,1,1 0=1000 1=1 2=1280000\nSoftmax                  prob                     1 1 fc output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/regnety_400m.param",
    "content": "7767517\n185 217\nInput                    input.1                  0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              Conv_0                   1 1 data 387 -23330=4,3,112,112,32 0=32 1=3 3=2 4=1 5=1 6=864 9=1\nSplit                    splitncnn_0              1 2 387 387_splitncnn_0 387_splitncnn_1 -23330=8,3,112,112,32,3,112,112,32\nConvolution              Conv_3                   1 1 387_splitncnn_1 389 -23330=4,3,56,56,48 0=48 1=1 3=2 5=1 6=1536\nConvolution              Conv_5                   1 1 387_splitncnn_0 392 -23330=4,3,112,112,48 0=48 1=1 5=1 6=1536 9=1\nConvolutionDepthWise     Conv_8                   1 1 392 395 -23330=4,3,56,56,48 0=48 1=3 3=2 4=1 5=1 6=3456 7=6 9=1\nSplit                    splitncnn_1              1 2 395 395_splitncnn_0 395_splitncnn_1 -23330=8,3,56,56,48,3,56,56,48\nPooling                  GlobalAveragePool_11     1 1 395_splitncnn_1 396 -23330=4,1,48,1,1 0=1 4=1\nInnerProduct             Conv_12                  1 1 396 398 -23330=4,1,8,1,1 0=8 1=1 2=384 9=1\nInnerProduct             Conv_14                  1 1 398 400 -23330=4,1,48,1,1 0=48 1=1 2=384 9=4\nBinaryOp                 Mul_16                   2 1 395_splitncnn_0 400 401 -23330=4,3,56,56,48 0=2\nConvolution              Conv_17                  1 1 401 403 -23330=4,3,56,56,48 0=48 1=1 5=1 6=2304\nBinaryOp                 Add_19                   2 1 389 403 404 -23330=4,3,56,56,48\nReLU                     Relu_20                  1 1 404 405 -23330=4,3,56,56,48\nSplit                    splitncnn_2              1 2 405 405_splitncnn_0 405_splitncnn_1 -23330=8,3,56,56,48,3,56,56,48\nConvolution              Conv_21                  1 1 405_splitncnn_1 407 -23330=4,3,28,28,104 0=104 1=1 3=2 5=1 6=4992\nConvolution              Conv_23                  1 1 405_splitncnn_0 410 -23330=4,3,56,56,104 0=104 1=1 5=1 6=4992 9=1\nConvolutionDepthWise     Conv_26                  1 1 410 413 -23330=4,3,28,28,104 0=104 1=3 3=2 4=1 5=1 6=7488 7=13 9=1\nSplit                    splitncnn_3              1 2 413 413_splitncnn_0 413_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nPooling                  GlobalAveragePool_29     1 1 413_splitncnn_1 414 -23330=4,1,104,1,1 0=1 4=1\nInnerProduct             Conv_30                  1 1 414 416 -23330=4,1,12,1,1 0=12 1=1 2=1248 9=1\nInnerProduct             Conv_32                  1 1 416 418 -23330=4,1,104,1,1 0=104 1=1 2=1248 9=4\nBinaryOp                 Mul_34                   2 1 413_splitncnn_0 418 419 -23330=4,3,28,28,104 0=2\nConvolution              Conv_35                  1 1 419 421 -23330=4,3,28,28,104 0=104 1=1 5=1 6=10816\nBinaryOp                 Add_37                   2 1 407 421 422 -23330=4,3,28,28,104\nReLU                     Relu_38                  1 1 422 423 -23330=4,3,28,28,104\nSplit                    splitncnn_4              1 2 423 423_splitncnn_0 423_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nConvolution              Conv_39                  1 1 423_splitncnn_1 426 -23330=4,3,28,28,104 0=104 1=1 5=1 6=10816 9=1\nConvolutionDepthWise     Conv_42                  1 1 426 429 -23330=4,3,28,28,104 0=104 1=3 4=1 5=1 6=7488 7=13 9=1\nSplit                    splitncnn_5              1 2 429 429_splitncnn_0 429_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nPooling                  GlobalAveragePool_45     1 1 429_splitncnn_1 430 -23330=4,1,104,1,1 0=1 4=1\nInnerProduct             Conv_46                  1 1 430 432 -23330=4,1,26,1,1 0=26 1=1 2=2704 9=1\nInnerProduct             Conv_48                  1 1 432 434 -23330=4,1,104,1,1 0=104 1=1 2=2704 9=4\nBinaryOp                 Mul_50                   2 1 429_splitncnn_0 434 435 -23330=4,3,28,28,104 0=2\nConvolution              Conv_51                  1 1 435 437 -23330=4,3,28,28,104 0=104 1=1 5=1 6=10816\nBinaryOp                 Add_53                   2 1 423_splitncnn_0 437 438 -23330=4,3,28,28,104\nReLU                     Relu_54                  1 1 438 439 -23330=4,3,28,28,104\nSplit                    splitncnn_6              1 2 439 439_splitncnn_0 439_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nConvolution              Conv_55                  1 1 439_splitncnn_1 442 -23330=4,3,28,28,104 0=104 1=1 5=1 6=10816 9=1\nConvolutionDepthWise     Conv_58                  1 1 442 445 -23330=4,3,28,28,104 0=104 1=3 4=1 5=1 6=7488 7=13 9=1\nSplit                    splitncnn_7              1 2 445 445_splitncnn_0 445_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nPooling                  GlobalAveragePool_61     1 1 445_splitncnn_1 446 -23330=4,1,104,1,1 0=1 4=1\nInnerProduct             Conv_62                  1 1 446 448 -23330=4,1,26,1,1 0=26 1=1 2=2704 9=1\nInnerProduct             Conv_64                  1 1 448 450 -23330=4,1,104,1,1 0=104 1=1 2=2704 9=4\nBinaryOp                 Mul_66                   2 1 445_splitncnn_0 450 451 -23330=4,3,28,28,104 0=2\nConvolution              Conv_67                  1 1 451 453 -23330=4,3,28,28,104 0=104 1=1 5=1 6=10816\nBinaryOp                 Add_69                   2 1 439_splitncnn_0 453 454 -23330=4,3,28,28,104\nReLU                     Relu_70                  1 1 454 455 -23330=4,3,28,28,104\nSplit                    splitncnn_8              1 2 455 455_splitncnn_0 455_splitncnn_1 -23330=8,3,28,28,104,3,28,28,104\nConvolution              Conv_71                  1 1 455_splitncnn_1 457 -23330=4,3,14,14,208 0=208 1=1 3=2 5=1 6=21632\nConvolution              Conv_73                  1 1 455_splitncnn_0 460 -23330=4,3,28,28,208 0=208 1=1 5=1 6=21632 9=1\nConvolutionDepthWise     Conv_76                  1 1 460 463 -23330=4,3,14,14,208 0=208 1=3 3=2 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_9              1 2 463 463_splitncnn_0 463_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_79     1 1 463_splitncnn_1 464 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_80                  1 1 464 466 -23330=4,1,26,1,1 0=26 1=1 2=5408 9=1\nInnerProduct             Conv_82                  1 1 466 468 -23330=4,1,208,1,1 0=208 1=1 2=5408 9=4\nBinaryOp                 Mul_84                   2 1 463_splitncnn_0 468 469 -23330=4,3,14,14,208 0=2\nConvolution              Conv_85                  1 1 469 471 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_87                   2 1 457 471 472 -23330=4,3,14,14,208\nReLU                     Relu_88                  1 1 472 473 -23330=4,3,14,14,208\nSplit                    splitncnn_10             1 2 473 473_splitncnn_0 473_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_89                  1 1 473_splitncnn_1 476 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264 9=1\nConvolutionDepthWise     Conv_92                  1 1 476 479 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_11             1 2 479 479_splitncnn_0 479_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_95     1 1 479_splitncnn_1 480 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_96                  1 1 480 482 -23330=4,1,52,1,1 0=52 1=1 2=10816 9=1\nInnerProduct             Conv_98                  1 1 482 484 -23330=4,1,208,1,1 0=208 1=1 2=10816 9=4\nBinaryOp                 Mul_100                  2 1 479_splitncnn_0 484 485 -23330=4,3,14,14,208 0=2\nConvolution              Conv_101                 1 1 485 487 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_103                  2 1 473_splitncnn_0 487 488 -23330=4,3,14,14,208\nReLU                     Relu_104                 1 1 488 489 -23330=4,3,14,14,208\nSplit                    splitncnn_12             1 2 489 489_splitncnn_0 489_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_105                 1 1 489_splitncnn_1 492 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264 9=1\nConvolutionDepthWise     Conv_108                 1 1 492 495 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_13             1 2 495 495_splitncnn_0 495_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_111    1 1 495_splitncnn_1 496 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_112                 1 1 496 498 -23330=4,1,52,1,1 0=52 1=1 2=10816 9=1\nInnerProduct             Conv_114                 1 1 498 500 -23330=4,1,208,1,1 0=208 1=1 2=10816 9=4\nBinaryOp                 Mul_116                  2 1 495_splitncnn_0 500 501 -23330=4,3,14,14,208 0=2\nConvolution              Conv_117                 1 1 501 503 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_119                  2 1 489_splitncnn_0 503 504 -23330=4,3,14,14,208\nReLU                     Relu_120                 1 1 504 505 -23330=4,3,14,14,208\nSplit                    splitncnn_14             1 2 505 505_splitncnn_0 505_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_121                 1 1 505_splitncnn_1 508 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264 9=1\nConvolutionDepthWise     Conv_124                 1 1 508 511 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_15             1 2 511 511_splitncnn_0 511_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_127    1 1 511_splitncnn_1 512 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_128                 1 1 512 514 -23330=4,1,52,1,1 0=52 1=1 2=10816 9=1\nInnerProduct             Conv_130                 1 1 514 516 -23330=4,1,208,1,1 0=208 1=1 2=10816 9=4\nBinaryOp                 Mul_132                  2 1 511_splitncnn_0 516 517 -23330=4,3,14,14,208 0=2\nConvolution              Conv_133                 1 1 517 519 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_135                  2 1 505_splitncnn_0 519 520 -23330=4,3,14,14,208\nReLU                     Relu_136                 1 1 520 521 -23330=4,3,14,14,208\nSplit                    splitncnn_16             1 2 521 521_splitncnn_0 521_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_137                 1 1 521_splitncnn_1 524 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264 9=1\nConvolutionDepthWise     Conv_140                 1 1 524 527 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_17             1 2 527 527_splitncnn_0 527_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_143    1 1 527_splitncnn_1 528 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_144                 1 1 528 530 -23330=4,1,52,1,1 0=52 1=1 2=10816 9=1\nInnerProduct             Conv_146                 1 1 530 532 -23330=4,1,208,1,1 0=208 1=1 2=10816 9=4\nBinaryOp                 Mul_148                  2 1 527_splitncnn_0 532 533 -23330=4,3,14,14,208 0=2\nConvolution              Conv_149                 1 1 533 535 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_151                  2 1 521_splitncnn_0 535 536 -23330=4,3,14,14,208\nReLU                     Relu_152                 1 1 536 537 -23330=4,3,14,14,208\nSplit                    splitncnn_18             1 2 537 537_splitncnn_0 537_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_153                 1 1 537_splitncnn_1 540 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264 9=1\nConvolutionDepthWise     Conv_156                 1 1 540 543 -23330=4,3,14,14,208 0=208 1=3 4=1 5=1 6=14976 7=26 9=1\nSplit                    splitncnn_19             1 2 543 543_splitncnn_0 543_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nPooling                  GlobalAveragePool_159    1 1 543_splitncnn_1 544 -23330=4,1,208,1,1 0=1 4=1\nInnerProduct             Conv_160                 1 1 544 546 -23330=4,1,52,1,1 0=52 1=1 2=10816 9=1\nInnerProduct             Conv_162                 1 1 546 548 -23330=4,1,208,1,1 0=208 1=1 2=10816 9=4\nBinaryOp                 Mul_164                  2 1 543_splitncnn_0 548 549 -23330=4,3,14,14,208 0=2\nConvolution              Conv_165                 1 1 549 551 -23330=4,3,14,14,208 0=208 1=1 5=1 6=43264\nBinaryOp                 Add_167                  2 1 537_splitncnn_0 551 552 -23330=4,3,14,14,208\nReLU                     Relu_168                 1 1 552 553 -23330=4,3,14,14,208\nSplit                    splitncnn_20             1 2 553 553_splitncnn_0 553_splitncnn_1 -23330=8,3,14,14,208,3,14,14,208\nConvolution              Conv_169                 1 1 553_splitncnn_1 555 -23330=4,3,7,7,440 0=440 1=1 3=2 5=1 6=91520\nConvolution              Conv_171                 1 1 553_splitncnn_0 558 -23330=4,3,14,14,440 0=440 1=1 5=1 6=91520 9=1\nConvolutionDepthWise     Conv_174                 1 1 558 561 -23330=4,3,7,7,440 0=440 1=3 3=2 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_21             1 2 561 561_splitncnn_0 561_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_177    1 1 561_splitncnn_1 562 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_178                 1 1 562 564 -23330=4,1,52,1,1 0=52 1=1 2=22880 9=1\nInnerProduct             Conv_180                 1 1 564 566 -23330=4,1,440,1,1 0=440 1=1 2=22880 9=4\nBinaryOp                 Mul_182                  2 1 561_splitncnn_0 566 567 -23330=4,3,7,7,440 0=2\nConvolution              Conv_183                 1 1 567 569 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_185                  2 1 555 569 570 -23330=4,3,7,7,440\nReLU                     Relu_186                 1 1 570 571 -23330=4,3,7,7,440\nSplit                    splitncnn_22             1 2 571 571_splitncnn_0 571_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nConvolution              Conv_187                 1 1 571_splitncnn_1 574 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600 9=1\nConvolutionDepthWise     Conv_190                 1 1 574 577 -23330=4,3,7,7,440 0=440 1=3 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_23             1 2 577 577_splitncnn_0 577_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_193    1 1 577_splitncnn_1 578 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_194                 1 1 578 580 -23330=4,1,110,1,1 0=110 1=1 2=48400 9=1\nInnerProduct             Conv_196                 1 1 580 582 -23330=4,1,440,1,1 0=440 1=1 2=48400 9=4\nBinaryOp                 Mul_198                  2 1 577_splitncnn_0 582 583 -23330=4,3,7,7,440 0=2\nConvolution              Conv_199                 1 1 583 585 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_201                  2 1 571_splitncnn_0 585 586 -23330=4,3,7,7,440\nReLU                     Relu_202                 1 1 586 587 -23330=4,3,7,7,440\nSplit                    splitncnn_24             1 2 587 587_splitncnn_0 587_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nConvolution              Conv_203                 1 1 587_splitncnn_1 590 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600 9=1\nConvolutionDepthWise     Conv_206                 1 1 590 593 -23330=4,3,7,7,440 0=440 1=3 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_25             1 2 593 593_splitncnn_0 593_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_209    1 1 593_splitncnn_1 594 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_210                 1 1 594 596 -23330=4,1,110,1,1 0=110 1=1 2=48400 9=1\nInnerProduct             Conv_212                 1 1 596 598 -23330=4,1,440,1,1 0=440 1=1 2=48400 9=4\nBinaryOp                 Mul_214                  2 1 593_splitncnn_0 598 599 -23330=4,3,7,7,440 0=2\nConvolution              Conv_215                 1 1 599 601 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_217                  2 1 587_splitncnn_0 601 602 -23330=4,3,7,7,440\nReLU                     Relu_218                 1 1 602 603 -23330=4,3,7,7,440\nSplit                    splitncnn_26             1 2 603 603_splitncnn_0 603_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nConvolution              Conv_219                 1 1 603_splitncnn_1 606 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600 9=1\nConvolutionDepthWise     Conv_222                 1 1 606 609 -23330=4,3,7,7,440 0=440 1=3 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_27             1 2 609 609_splitncnn_0 609_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_225    1 1 609_splitncnn_1 610 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_226                 1 1 610 612 -23330=4,1,110,1,1 0=110 1=1 2=48400 9=1\nInnerProduct             Conv_228                 1 1 612 614 -23330=4,1,440,1,1 0=440 1=1 2=48400 9=4\nBinaryOp                 Mul_230                  2 1 609_splitncnn_0 614 615 -23330=4,3,7,7,440 0=2\nConvolution              Conv_231                 1 1 615 617 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_233                  2 1 603_splitncnn_0 617 618 -23330=4,3,7,7,440\nReLU                     Relu_234                 1 1 618 619 -23330=4,3,7,7,440\nSplit                    splitncnn_28             1 2 619 619_splitncnn_0 619_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nConvolution              Conv_235                 1 1 619_splitncnn_1 622 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600 9=1\nConvolutionDepthWise     Conv_238                 1 1 622 625 -23330=4,3,7,7,440 0=440 1=3 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_29             1 2 625 625_splitncnn_0 625_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_241    1 1 625_splitncnn_1 626 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_242                 1 1 626 628 -23330=4,1,110,1,1 0=110 1=1 2=48400 9=1\nInnerProduct             Conv_244                 1 1 628 630 -23330=4,1,440,1,1 0=440 1=1 2=48400 9=4\nBinaryOp                 Mul_246                  2 1 625_splitncnn_0 630 631 -23330=4,3,7,7,440 0=2\nConvolution              Conv_247                 1 1 631 633 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_249                  2 1 619_splitncnn_0 633 634 -23330=4,3,7,7,440\nReLU                     Relu_250                 1 1 634 635 -23330=4,3,7,7,440\nSplit                    splitncnn_30             1 2 635 635_splitncnn_0 635_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nConvolution              Conv_251                 1 1 635_splitncnn_1 638 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600 9=1\nConvolutionDepthWise     Conv_254                 1 1 638 641 -23330=4,3,7,7,440 0=440 1=3 4=1 5=1 6=31680 7=55 9=1\nSplit                    splitncnn_31             1 2 641 641_splitncnn_0 641_splitncnn_1 -23330=8,3,7,7,440,3,7,7,440\nPooling                  GlobalAveragePool_257    1 1 641_splitncnn_1 642 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Conv_258                 1 1 642 644 -23330=4,1,110,1,1 0=110 1=1 2=48400 9=1\nInnerProduct             Conv_260                 1 1 644 646 -23330=4,1,440,1,1 0=440 1=1 2=48400 9=4\nBinaryOp                 Mul_262                  2 1 641_splitncnn_0 646 647 -23330=4,3,7,7,440 0=2\nConvolution              Conv_263                 1 1 647 649 -23330=4,3,7,7,440 0=440 1=1 5=1 6=193600\nBinaryOp                 Add_265                  2 1 635_splitncnn_0 649 650 -23330=4,3,7,7,440\nReLU                     Relu_266                 1 1 650 651 -23330=4,3,7,7,440\nPooling                  GlobalAveragePool_267    1 1 651 660 -23330=4,1,440,1,1 0=1 4=1\nInnerProduct             Gemm_274                 1 1 660 661 -23330=4,1,1000,1,1 0=1000 1=1 2=440000\nSoftmax                  prob                     1 1 661 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/resnet18.param",
    "content": "7767517\n50 58\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu -23330=4,3,112,112,64 0=64 1=7 3=2 4=3 5=1 6=9408 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 -23330=4,3,56,56,64 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1 -23330=8,3,56,56,64,3,56,56,64\nConvolution              res2a_branch1            1 1 pool1_splitncnn_1 res2a_branch1_scale2a_branch1 -23330=4,3,56,56,64 0=64 1=1 5=1 6=4096\nConvolution              res2a_branch2a           1 1 pool1_splitncnn_0 res2a_branch2a_res2a_branch2a_relu -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864 9=1\nConvolution              res2a_branch2b           1 1 res2a_branch2a_res2a_branch2a_relu res2a_branch2b_scale2a_branch2b -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864\nEltwise                  res2a                    2 1 res2a_branch1_scale2a_branch1 res2a_branch2b_scale2a_branch2b res2a -23330=4,3,56,56,64 0=1\nReLU                     res2a_relu               1 1 res2a res2a_res2a_relu -23330=4,3,56,56,64\nSplit                    splitncnn_1              1 2 res2a_res2a_relu res2a_res2a_relu_splitncnn_0 res2a_res2a_relu_splitncnn_1 -23330=8,3,56,56,64,3,56,56,64\nConvolution              res2b_branch2a           1 1 res2a_res2a_relu_splitncnn_1 res2b_branch2a_res2b_branch2a_relu -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864 9=1\nConvolution              res2b_branch2b           1 1 res2b_branch2a_res2b_branch2a_relu res2b_branch2b_scale2b_branch2b -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864\nEltwise                  res2b                    2 1 res2a_res2a_relu_splitncnn_0 res2b_branch2b_scale2b_branch2b res2b -23330=4,3,56,56,64 0=1\nReLU                     res2b_relu               1 1 res2b res2b_res2b_relu -23330=4,3,56,56,64\nSplit                    splitncnn_2              1 2 res2b_res2b_relu res2b_res2b_relu_splitncnn_0 res2b_res2b_relu_splitncnn_1 -23330=8,3,56,56,64,3,56,56,64\nConvolution              res3a_branch1            1 1 res2b_res2b_relu_splitncnn_1 res3a_branch1_scale3a_branch1 -23330=4,3,28,28,128 0=128 1=1 3=2 5=1 6=8192\nConvolution              res3a_branch2a           1 1 res2b_res2b_relu_splitncnn_0 res3a_branch2a_res3a_branch2a_relu -23330=4,3,28,28,128 0=128 1=3 3=2 4=1 5=1 6=73728 9=1\nConvolution              res3a_branch2b           1 1 res3a_branch2a_res3a_branch2a_relu res3a_branch2b_scale3a_branch2b -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456\nEltwise                  res3a                    2 1 res3a_branch1_scale3a_branch1 res3a_branch2b_scale3a_branch2b res3a -23330=4,3,28,28,128 0=1\nReLU                     res3a_relu               1 1 res3a res3a_res3a_relu -23330=4,3,28,28,128\nSplit                    splitncnn_3              1 2 res3a_res3a_relu res3a_res3a_relu_splitncnn_0 res3a_res3a_relu_splitncnn_1 -23330=8,3,28,28,128,3,28,28,128\nConvolution              res3b_branch2a           1 1 res3a_res3a_relu_splitncnn_1 res3b_branch2a_res3b_branch2a_relu -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456 9=1\nConvolution              res3b_branch2b           1 1 res3b_branch2a_res3b_branch2a_relu res3b_branch2b_scale3b_branch2b -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456\nEltwise                  res3b                    2 1 res3a_res3a_relu_splitncnn_0 res3b_branch2b_scale3b_branch2b res3b -23330=4,3,28,28,128 0=1\nReLU                     res3b_relu               1 1 res3b res3b_res3b_relu -23330=4,3,28,28,128\nSplit                    splitncnn_4              1 2 res3b_res3b_relu res3b_res3b_relu_splitncnn_0 res3b_res3b_relu_splitncnn_1 -23330=8,3,28,28,128,3,28,28,128\nConvolution              res4a_branch1            1 1 res3b_res3b_relu_splitncnn_1 res4a_branch1_scale4a_branch1 -23330=4,3,14,14,256 0=256 1=1 3=2 5=1 6=32768\nConvolution              res4a_branch2a           1 1 res3b_res3b_relu_splitncnn_0 res4a_branch2a_res4a_branch2a_relu -23330=4,3,14,14,256 0=256 1=3 3=2 4=1 5=1 6=294912 9=1\nConvolution              res4a_branch2b           1 1 res4a_branch2a_res4a_branch2a_relu res4a_branch2b_scale4a_branch2b -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824\nEltwise                  res4a                    2 1 res4a_branch1_scale4a_branch1 res4a_branch2b_scale4a_branch2b res4a -23330=4,3,14,14,256 0=1\nReLU                     res4a_relu               1 1 res4a res4a_res4a_relu -23330=4,3,14,14,256\nSplit                    splitncnn_5              1 2 res4a_res4a_relu res4a_res4a_relu_splitncnn_0 res4a_res4a_relu_splitncnn_1 -23330=8,3,14,14,256,3,14,14,256\nConvolution              res4b_branch2a           1 1 res4a_res4a_relu_splitncnn_1 res4b_branch2a_res4b_branch2a_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4b_branch2b           1 1 res4b_branch2a_res4b_branch2a_relu res4b_branch2b_scale4b_branch2b -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824\nEltwise                  res4b                    2 1 res4a_res4a_relu_splitncnn_0 res4b_branch2b_scale4b_branch2b res4b -23330=4,3,14,14,256 0=1\nReLU                     res4b_relu               1 1 res4b res4b_res4b_relu -23330=4,3,14,14,256\nSplit                    splitncnn_6              1 2 res4b_res4b_relu res4b_res4b_relu_splitncnn_0 res4b_res4b_relu_splitncnn_1 -23330=8,3,14,14,256,3,14,14,256\nConvolution              res5a_branch1            1 1 res4b_res4b_relu_splitncnn_1 res5a_branch1_scale5a_branch1 -23330=4,3,7,7,512 0=512 1=1 3=2 5=1 6=131072\nConvolution              res5a_branch2a           1 1 res4b_res4b_relu_splitncnn_0 res5a_branch2a_res5a_branch2a_relu -23330=4,3,7,7,512 0=512 1=3 3=2 4=1 5=1 6=1179648 9=1\nConvolution              res5a_branch2b           1 1 res5a_branch2a_res5a_branch2a_relu res5a_branch2b_scale5a_branch2b -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296\nEltwise                  res5a                    2 1 res5a_branch1_scale5a_branch1 res5a_branch2b_scale5a_branch2b res5a -23330=4,3,7,7,512 0=1\nReLU                     res5a_relu               1 1 res5a res5a_res5a_relu -23330=4,3,7,7,512\nSplit                    splitncnn_7              1 2 res5a_res5a_relu res5a_res5a_relu_splitncnn_0 res5a_res5a_relu_splitncnn_1 -23330=8,3,7,7,512,3,7,7,512\nConvolution              res5b_branch2a           1 1 res5a_res5a_relu_splitncnn_1 res5b_branch2a_res5b_branch2a_relu -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              res5b_branch2b           1 1 res5b_branch2a_res5b_branch2a_relu res5b_branch2b_scale5b_branch2b -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296\nEltwise                  res5b                    2 1 res5a_res5a_relu_splitncnn_0 res5b_branch2b_scale5b_branch2b res5b -23330=4,3,7,7,512 0=1\nReLU                     res5b_relu               1 1 res5b res5b_res5b_relu -23330=4,3,7,7,512\nPooling                  pool5                    1 1 res5b_res5b_relu pool5 -23330=4,3,1,1,512 0=1 1=7\nInnerProduct             fc1000                   1 1 pool5 fc1000 -23330=4,1,1000,1,1 0=1000 1=1 2=512000\nSoftmax                  prob                     1 1 fc1000 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/resnet18_int8.param",
    "content": "7767517\n50 58\nInput                    data                     0 1 data 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu 0=64 1=7 3=2 4=3 5=1 6=9408 8=2 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1\nConvolution              res2a_branch1            1 1 pool1_splitncnn_1 res2a_branch1_scale2a_branch1 0=64 1=1 5=1 6=4096 8=2\nConvolution              res2a_branch2a           1 1 pool1_splitncnn_0 res2a_branch2a_res2a_branch2a_relu 0=64 1=3 4=1 5=1 6=36864 8=102 9=1\nConvolution              res2a_branch2b           1 1 res2a_branch2a_res2a_branch2a_relu res2a_branch2b_scale2a_branch2b 0=64 1=3 4=1 5=1 6=36864 8=2\nEltwise                  res2a                    2 1 res2a_branch1_scale2a_branch1 res2a_branch2b_scale2a_branch2b res2a 0=1\nReLU                     res2a_relu               1 1 res2a res2a_res2a_relu\nSplit                    splitncnn_1              1 2 res2a_res2a_relu res2a_res2a_relu_splitncnn_0 res2a_res2a_relu_splitncnn_1\nConvolution              res2b_branch2a           1 1 res2a_res2a_relu_splitncnn_1 res2b_branch2a_res2b_branch2a_relu 0=64 1=3 4=1 5=1 6=36864 8=102 9=1\nConvolution              res2b_branch2b           1 1 res2b_branch2a_res2b_branch2a_relu res2b_branch2b_scale2b_branch2b 0=64 1=3 4=1 5=1 6=36864 8=2\nEltwise                  res2b                    2 1 res2a_res2a_relu_splitncnn_0 res2b_branch2b_scale2b_branch2b res2b 0=1\nReLU                     res2b_relu               1 1 res2b res2b_res2b_relu\nSplit                    splitncnn_2              1 2 res2b_res2b_relu res2b_res2b_relu_splitncnn_0 res2b_res2b_relu_splitncnn_1\nConvolution              res3a_branch1            1 1 res2b_res2b_relu_splitncnn_1 res3a_branch1_scale3a_branch1 0=128 1=1 3=2 5=1 6=8192 8=2\nConvolution              res3a_branch2a           1 1 res2b_res2b_relu_splitncnn_0 res3a_branch2a_res3a_branch2a_relu 0=128 1=3 3=2 4=1 5=1 6=73728 8=102 9=1\nConvolution              res3a_branch2b           1 1 res3a_branch2a_res3a_branch2a_relu res3a_branch2b_scale3a_branch2b 0=128 1=3 4=1 5=1 6=147456 8=2\nEltwise                  res3a                    2 1 res3a_branch1_scale3a_branch1 res3a_branch2b_scale3a_branch2b res3a 0=1\nReLU                     res3a_relu               1 1 res3a res3a_res3a_relu\nSplit                    splitncnn_3              1 2 res3a_res3a_relu res3a_res3a_relu_splitncnn_0 res3a_res3a_relu_splitncnn_1\nConvolution              res3b_branch2a           1 1 res3a_res3a_relu_splitncnn_1 res3b_branch2a_res3b_branch2a_relu 0=128 1=3 4=1 5=1 6=147456 8=102 9=1\nConvolution              res3b_branch2b           1 1 res3b_branch2a_res3b_branch2a_relu res3b_branch2b_scale3b_branch2b 0=128 1=3 4=1 5=1 6=147456 8=2\nEltwise                  res3b                    2 1 res3a_res3a_relu_splitncnn_0 res3b_branch2b_scale3b_branch2b res3b 0=1\nReLU                     res3b_relu               1 1 res3b res3b_res3b_relu\nSplit                    splitncnn_4              1 2 res3b_res3b_relu res3b_res3b_relu_splitncnn_0 res3b_res3b_relu_splitncnn_1\nConvolution              res4a_branch1            1 1 res3b_res3b_relu_splitncnn_1 res4a_branch1_scale4a_branch1 0=256 1=1 3=2 5=1 6=32768 8=2\nConvolution              res4a_branch2a           1 1 res3b_res3b_relu_splitncnn_0 res4a_branch2a_res4a_branch2a_relu 0=256 1=3 3=2 4=1 5=1 6=294912 8=102 9=1\nConvolution              res4a_branch2b           1 1 res4a_branch2a_res4a_branch2a_relu res4a_branch2b_scale4a_branch2b 0=256 1=3 4=1 5=1 6=589824 8=2\nEltwise                  res4a                    2 1 res4a_branch1_scale4a_branch1 res4a_branch2b_scale4a_branch2b res4a 0=1\nReLU                     res4a_relu               1 1 res4a res4a_res4a_relu\nSplit                    splitncnn_5              1 2 res4a_res4a_relu res4a_res4a_relu_splitncnn_0 res4a_res4a_relu_splitncnn_1\nConvolution              res4b_branch2a           1 1 res4a_res4a_relu_splitncnn_1 res4b_branch2a_res4b_branch2a_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4b_branch2b           1 1 res4b_branch2a_res4b_branch2a_relu res4b_branch2b_scale4b_branch2b 0=256 1=3 4=1 5=1 6=589824 8=2\nEltwise                  res4b                    2 1 res4a_res4a_relu_splitncnn_0 res4b_branch2b_scale4b_branch2b res4b 0=1\nReLU                     res4b_relu               1 1 res4b res4b_res4b_relu\nSplit                    splitncnn_6              1 2 res4b_res4b_relu res4b_res4b_relu_splitncnn_0 res4b_res4b_relu_splitncnn_1\nConvolution              res5a_branch1            1 1 res4b_res4b_relu_splitncnn_1 res5a_branch1_scale5a_branch1 0=512 1=1 3=2 5=1 6=131072 8=2\nConvolution              res5a_branch2a           1 1 res4b_res4b_relu_splitncnn_0 res5a_branch2a_res5a_branch2a_relu 0=512 1=3 3=2 4=1 5=1 6=1179648 8=102 9=1\nConvolution              res5a_branch2b           1 1 res5a_branch2a_res5a_branch2a_relu res5a_branch2b_scale5a_branch2b 0=512 1=3 4=1 5=1 6=2359296 8=2\nEltwise                  res5a                    2 1 res5a_branch1_scale5a_branch1 res5a_branch2b_scale5a_branch2b res5a 0=1\nReLU                     res5a_relu               1 1 res5a res5a_res5a_relu\nSplit                    splitncnn_7              1 2 res5a_res5a_relu res5a_res5a_relu_splitncnn_0 res5a_res5a_relu_splitncnn_1\nConvolution              res5b_branch2a           1 1 res5a_res5a_relu_splitncnn_1 res5b_branch2a_res5b_branch2a_relu 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              res5b_branch2b           1 1 res5b_branch2a_res5b_branch2a_relu res5b_branch2b_scale5b_branch2b 0=512 1=3 4=1 5=1 6=2359296 8=2\nEltwise                  res5b                    2 1 res5a_res5a_relu_splitncnn_0 res5b_branch2b_scale5b_branch2b res5b 0=1\nReLU                     res5b_relu               1 1 res5b res5b_res5b_relu\nPooling                  pool5                    1 1 res5b_res5b_relu pool5 0=1 1=7\nInnerProduct             fc1000                   1 1 pool5 fc1000 0=1000 1=1 2=512000\nSoftmax                  prob                     1 1 fc1000 output\n"
  },
  {
    "path": "benchmark/resnet50.param",
    "content": "7767517\n106 122\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu -23330=4,3,112,112,64 0=64 1=7 3=2 4=3 5=1 6=9408 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 -23330=4,3,56,56,64 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1 -23330=8,3,56,56,64,3,56,56,64\nConvolution              res2a_branch1            1 1 pool1_splitncnn_1 res2a_branch1_scale2a_branch1 -23330=4,3,56,56,256 0=256 1=1 5=1 6=16384\nConvolution              res2a_branch2a           1 1 pool1_splitncnn_0 res2a_branch2a_res2a_branch2a_relu -23330=4,3,56,56,64 0=64 1=1 5=1 6=4096 9=1\nConvolution              res2a_branch2b           1 1 res2a_branch2a_res2a_branch2a_relu res2a_branch2b_res2a_branch2b_relu -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864 9=1\nConvolution              res2a_branch2c           1 1 res2a_branch2b_res2a_branch2b_relu res2a_branch2c_scale2a_branch2c -23330=4,3,56,56,256 0=256 1=1 5=1 6=16384\nEltwise                  res2a                    2 1 res2a_branch1_scale2a_branch1 res2a_branch2c_scale2a_branch2c res2a -23330=4,3,56,56,256 0=1\nReLU                     res2a_relu               1 1 res2a res2a_res2a_relu -23330=4,3,56,56,256\nSplit                    splitncnn_1              1 2 res2a_res2a_relu res2a_res2a_relu_splitncnn_0 res2a_res2a_relu_splitncnn_1 -23330=8,3,56,56,256,3,56,56,256\nConvolution              res2b_branch2a           1 1 res2a_res2a_relu_splitncnn_1 res2b_branch2a_res2b_branch2a_relu -23330=4,3,56,56,64 0=64 1=1 5=1 6=16384 9=1\nConvolution              res2b_branch2b           1 1 res2b_branch2a_res2b_branch2a_relu res2b_branch2b_res2b_branch2b_relu -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864 9=1\nConvolution              res2b_branch2c           1 1 res2b_branch2b_res2b_branch2b_relu res2b_branch2c_scale2b_branch2c -23330=4,3,56,56,256 0=256 1=1 5=1 6=16384\nEltwise                  res2b                    2 1 res2a_res2a_relu_splitncnn_0 res2b_branch2c_scale2b_branch2c res2b -23330=4,3,56,56,256 0=1\nReLU                     res2b_relu               1 1 res2b res2b_res2b_relu -23330=4,3,56,56,256\nSplit                    splitncnn_2              1 2 res2b_res2b_relu res2b_res2b_relu_splitncnn_0 res2b_res2b_relu_splitncnn_1 -23330=8,3,56,56,256,3,56,56,256\nConvolution              res2c_branch2a           1 1 res2b_res2b_relu_splitncnn_1 res2c_branch2a_res2c_branch2a_relu -23330=4,3,56,56,64 0=64 1=1 5=1 6=16384 9=1\nConvolution              res2c_branch2b           1 1 res2c_branch2a_res2c_branch2a_relu res2c_branch2b_res2c_branch2b_relu -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=36864 9=1\nConvolution              res2c_branch2c           1 1 res2c_branch2b_res2c_branch2b_relu res2c_branch2c_scale2c_branch2c -23330=4,3,56,56,256 0=256 1=1 5=1 6=16384\nEltwise                  res2c                    2 1 res2b_res2b_relu_splitncnn_0 res2c_branch2c_scale2c_branch2c res2c -23330=4,3,56,56,256 0=1\nReLU                     res2c_relu               1 1 res2c res2c_res2c_relu -23330=4,3,56,56,256\nSplit                    splitncnn_3              1 2 res2c_res2c_relu res2c_res2c_relu_splitncnn_0 res2c_res2c_relu_splitncnn_1 -23330=8,3,56,56,256,3,56,56,256\nConvolution              res3a_branch1            1 1 res2c_res2c_relu_splitncnn_1 res3a_branch1_scale3a_branch1 -23330=4,3,28,28,512 0=512 1=1 3=2 5=1 6=131072\nConvolution              res3a_branch2a           1 1 res2c_res2c_relu_splitncnn_0 res3a_branch2a_res3a_branch2a_relu -23330=4,3,28,28,128 0=128 1=1 3=2 5=1 6=32768 9=1\nConvolution              res3a_branch2b           1 1 res3a_branch2a_res3a_branch2a_relu res3a_branch2b_res3a_branch2b_relu -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456 9=1\nConvolution              res3a_branch2c           1 1 res3a_branch2b_res3a_branch2b_relu res3a_branch2c_scale3a_branch2c -23330=4,3,28,28,512 0=512 1=1 5=1 6=65536\nEltwise                  res3a                    2 1 res3a_branch1_scale3a_branch1 res3a_branch2c_scale3a_branch2c res3a -23330=4,3,28,28,512 0=1\nReLU                     res3a_relu               1 1 res3a res3a_res3a_relu -23330=4,3,28,28,512\nSplit                    splitncnn_4              1 2 res3a_res3a_relu res3a_res3a_relu_splitncnn_0 res3a_res3a_relu_splitncnn_1 -23330=8,3,28,28,512,3,28,28,512\nConvolution              res3b_branch2a           1 1 res3a_res3a_relu_splitncnn_1 res3b_branch2a_res3b_branch2a_relu -23330=4,3,28,28,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              res3b_branch2b           1 1 res3b_branch2a_res3b_branch2a_relu res3b_branch2b_res3b_branch2b_relu -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456 9=1\nConvolution              res3b_branch2c           1 1 res3b_branch2b_res3b_branch2b_relu res3b_branch2c_scale3b_branch2c -23330=4,3,28,28,512 0=512 1=1 5=1 6=65536\nEltwise                  res3b                    2 1 res3a_res3a_relu_splitncnn_0 res3b_branch2c_scale3b_branch2c res3b -23330=4,3,28,28,512 0=1\nReLU                     res3b_relu               1 1 res3b res3b_res3b_relu -23330=4,3,28,28,512\nSplit                    splitncnn_5              1 2 res3b_res3b_relu res3b_res3b_relu_splitncnn_0 res3b_res3b_relu_splitncnn_1 -23330=8,3,28,28,512,3,28,28,512\nConvolution              res3c_branch2a           1 1 res3b_res3b_relu_splitncnn_1 res3c_branch2a_res3c_branch2a_relu -23330=4,3,28,28,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              res3c_branch2b           1 1 res3c_branch2a_res3c_branch2a_relu res3c_branch2b_res3c_branch2b_relu -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456 9=1\nConvolution              res3c_branch2c           1 1 res3c_branch2b_res3c_branch2b_relu res3c_branch2c_scale3c_branch2c -23330=4,3,28,28,512 0=512 1=1 5=1 6=65536\nEltwise                  res3c                    2 1 res3b_res3b_relu_splitncnn_0 res3c_branch2c_scale3c_branch2c res3c -23330=4,3,28,28,512 0=1\nReLU                     res3c_relu               1 1 res3c res3c_res3c_relu -23330=4,3,28,28,512\nSplit                    splitncnn_6              1 2 res3c_res3c_relu res3c_res3c_relu_splitncnn_0 res3c_res3c_relu_splitncnn_1 -23330=8,3,28,28,512,3,28,28,512\nConvolution              res3d_branch2a           1 1 res3c_res3c_relu_splitncnn_1 res3d_branch2a_res3d_branch2a_relu -23330=4,3,28,28,128 0=128 1=1 5=1 6=65536 9=1\nConvolution              res3d_branch2b           1 1 res3d_branch2a_res3d_branch2a_relu res3d_branch2b_res3d_branch2b_relu -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=147456 9=1\nConvolution              res3d_branch2c           1 1 res3d_branch2b_res3d_branch2b_relu res3d_branch2c_scale3d_branch2c -23330=4,3,28,28,512 0=512 1=1 5=1 6=65536\nEltwise                  res3d                    2 1 res3c_res3c_relu_splitncnn_0 res3d_branch2c_scale3d_branch2c res3d -23330=4,3,28,28,512 0=1\nReLU                     res3d_relu               1 1 res3d res3d_res3d_relu -23330=4,3,28,28,512\nSplit                    splitncnn_7              1 2 res3d_res3d_relu res3d_res3d_relu_splitncnn_0 res3d_res3d_relu_splitncnn_1 -23330=8,3,28,28,512,3,28,28,512\nConvolution              res4a_branch1            1 1 res3d_res3d_relu_splitncnn_1 res4a_branch1_scale4a_branch1 -23330=4,3,14,14,1024 0=1024 1=1 3=2 5=1 6=524288\nConvolution              res4a_branch2a           1 1 res3d_res3d_relu_splitncnn_0 res4a_branch2a_res4a_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 3=2 5=1 6=131072 9=1\nConvolution              res4a_branch2b           1 1 res4a_branch2a_res4a_branch2a_relu res4a_branch2b_res4a_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4a_branch2c           1 1 res4a_branch2b_res4a_branch2b_relu res4a_branch2c_scale4a_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4a                    2 1 res4a_branch1_scale4a_branch1 res4a_branch2c_scale4a_branch2c res4a -23330=4,3,14,14,1024 0=1\nReLU                     res4a_relu               1 1 res4a res4a_res4a_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_8              1 2 res4a_res4a_relu res4a_res4a_relu_splitncnn_0 res4a_res4a_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res4b_branch2a           1 1 res4a_res4a_relu_splitncnn_1 res4b_branch2a_res4b_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              res4b_branch2b           1 1 res4b_branch2a_res4b_branch2a_relu res4b_branch2b_res4b_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4b_branch2c           1 1 res4b_branch2b_res4b_branch2b_relu res4b_branch2c_scale4b_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4b                    2 1 res4a_res4a_relu_splitncnn_0 res4b_branch2c_scale4b_branch2c res4b -23330=4,3,14,14,1024 0=1\nReLU                     res4b_relu               1 1 res4b res4b_res4b_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_9              1 2 res4b_res4b_relu res4b_res4b_relu_splitncnn_0 res4b_res4b_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res4c_branch2a           1 1 res4b_res4b_relu_splitncnn_1 res4c_branch2a_res4c_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              res4c_branch2b           1 1 res4c_branch2a_res4c_branch2a_relu res4c_branch2b_res4c_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4c_branch2c           1 1 res4c_branch2b_res4c_branch2b_relu res4c_branch2c_scale4c_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4c                    2 1 res4b_res4b_relu_splitncnn_0 res4c_branch2c_scale4c_branch2c res4c -23330=4,3,14,14,1024 0=1\nReLU                     res4c_relu               1 1 res4c res4c_res4c_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_10             1 2 res4c_res4c_relu res4c_res4c_relu_splitncnn_0 res4c_res4c_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res4d_branch2a           1 1 res4c_res4c_relu_splitncnn_1 res4d_branch2a_res4d_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              res4d_branch2b           1 1 res4d_branch2a_res4d_branch2a_relu res4d_branch2b_res4d_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4d_branch2c           1 1 res4d_branch2b_res4d_branch2b_relu res4d_branch2c_scale4d_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4d                    2 1 res4c_res4c_relu_splitncnn_0 res4d_branch2c_scale4d_branch2c res4d -23330=4,3,14,14,1024 0=1\nReLU                     res4d_relu               1 1 res4d res4d_res4d_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_11             1 2 res4d_res4d_relu res4d_res4d_relu_splitncnn_0 res4d_res4d_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res4e_branch2a           1 1 res4d_res4d_relu_splitncnn_1 res4e_branch2a_res4e_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              res4e_branch2b           1 1 res4e_branch2a_res4e_branch2a_relu res4e_branch2b_res4e_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4e_branch2c           1 1 res4e_branch2b_res4e_branch2b_relu res4e_branch2c_scale4e_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4e                    2 1 res4d_res4d_relu_splitncnn_0 res4e_branch2c_scale4e_branch2c res4e -23330=4,3,14,14,1024 0=1\nReLU                     res4e_relu               1 1 res4e res4e_res4e_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_12             1 2 res4e_res4e_relu res4e_res4e_relu_splitncnn_0 res4e_res4e_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res4f_branch2a           1 1 res4e_res4e_relu_splitncnn_1 res4f_branch2a_res4f_branch2a_relu -23330=4,3,14,14,256 0=256 1=1 5=1 6=262144 9=1\nConvolution              res4f_branch2b           1 1 res4f_branch2a_res4f_branch2a_relu res4f_branch2b_res4f_branch2b_relu -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              res4f_branch2c           1 1 res4f_branch2b_res4f_branch2b_relu res4f_branch2c_scale4f_branch2c -23330=4,3,14,14,1024 0=1024 1=1 5=1 6=262144\nEltwise                  res4f                    2 1 res4e_res4e_relu_splitncnn_0 res4f_branch2c_scale4f_branch2c res4f -23330=4,3,14,14,1024 0=1\nReLU                     res4f_relu               1 1 res4f res4f_res4f_relu -23330=4,3,14,14,1024\nSplit                    splitncnn_13             1 2 res4f_res4f_relu res4f_res4f_relu_splitncnn_0 res4f_res4f_relu_splitncnn_1 -23330=8,3,14,14,1024,3,14,14,1024\nConvolution              res5a_branch1            1 1 res4f_res4f_relu_splitncnn_1 res5a_branch1_scale5a_branch1 -23330=4,3,7,7,2048 0=2048 1=1 3=2 5=1 6=2097152\nConvolution              res5a_branch2a           1 1 res4f_res4f_relu_splitncnn_0 res5a_branch2a_res5a_branch2a_relu -23330=4,3,7,7,512 0=512 1=1 3=2 5=1 6=524288 9=1\nConvolution              res5a_branch2b           1 1 res5a_branch2a_res5a_branch2a_relu res5a_branch2b_res5a_branch2b_relu -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              res5a_branch2c           1 1 res5a_branch2b_res5a_branch2b_relu res5a_branch2c_scale5a_branch2c -23330=4,3,7,7,2048 0=2048 1=1 5=1 6=1048576\nEltwise                  res5a                    2 1 res5a_branch1_scale5a_branch1 res5a_branch2c_scale5a_branch2c res5a -23330=4,3,7,7,2048 0=1\nReLU                     res5a_relu               1 1 res5a res5a_res5a_relu -23330=4,3,7,7,2048\nSplit                    splitncnn_14             1 2 res5a_res5a_relu res5a_res5a_relu_splitncnn_0 res5a_res5a_relu_splitncnn_1 -23330=8,3,7,7,2048,3,7,7,2048\nConvolution              res5b_branch2a           1 1 res5a_res5a_relu_splitncnn_1 res5b_branch2a_res5b_branch2a_relu -23330=4,3,7,7,512 0=512 1=1 5=1 6=1048576 9=1\nConvolution              res5b_branch2b           1 1 res5b_branch2a_res5b_branch2a_relu res5b_branch2b_res5b_branch2b_relu -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              res5b_branch2c           1 1 res5b_branch2b_res5b_branch2b_relu res5b_branch2c_scale5b_branch2c -23330=4,3,7,7,2048 0=2048 1=1 5=1 6=1048576\nEltwise                  res5b                    2 1 res5a_res5a_relu_splitncnn_0 res5b_branch2c_scale5b_branch2c res5b -23330=4,3,7,7,2048 0=1\nReLU                     res5b_relu               1 1 res5b res5b_res5b_relu -23330=4,3,7,7,2048\nSplit                    splitncnn_15             1 2 res5b_res5b_relu res5b_res5b_relu_splitncnn_0 res5b_res5b_relu_splitncnn_1 -23330=8,3,7,7,2048,3,7,7,2048\nConvolution              res5c_branch2a           1 1 res5b_res5b_relu_splitncnn_1 res5c_branch2a_res5c_branch2a_relu -23330=4,3,7,7,512 0=512 1=1 5=1 6=1048576 9=1\nConvolution              res5c_branch2b           1 1 res5c_branch2a_res5c_branch2a_relu res5c_branch2b_res5c_branch2b_relu -23330=4,3,7,7,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              res5c_branch2c           1 1 res5c_branch2b_res5c_branch2b_relu res5c_branch2c_scale5c_branch2c -23330=4,3,7,7,2048 0=2048 1=1 5=1 6=1048576\nEltwise                  res5c                    2 1 res5b_res5b_relu_splitncnn_0 res5c_branch2c_scale5c_branch2c res5c -23330=4,3,7,7,2048 0=1\nReLU                     res5c_relu               1 1 res5c res5c_res5c_relu -23330=4,3,7,7,2048\nPooling                  pool5                    1 1 res5c_res5c_relu pool5 -23330=4,3,1,1,2048 0=1 1=7\nInnerProduct             fc1000                   1 1 pool5 fc1000 -23330=4,1,1000,1,1 0=1000 1=1 2=2048000\nSoftmax                  prob                     1 1 fc1000 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/resnet50_int8.param",
    "content": "7767517\n106 122\nInput                    data                     0 1 data 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu 0=64 1=7 3=2 4=3 5=1 6=9408 8=2 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1\nConvolution              res2a_branch1            1 1 pool1_splitncnn_1 res2a_branch1_scale2a_branch1 0=256 1=1 5=1 6=16384 8=2\nConvolution              res2a_branch2a           1 1 pool1_splitncnn_0 res2a_branch2a_res2a_branch2a_relu 0=64 1=1 5=1 6=4096 8=102 9=1\nConvolution              res2a_branch2b           1 1 res2a_branch2a_res2a_branch2a_relu res2a_branch2b_res2a_branch2b_relu 0=64 1=3 4=1 5=1 6=36864 8=102 9=1\nConvolution              res2a_branch2c           1 1 res2a_branch2b_res2a_branch2b_relu res2a_branch2c_scale2a_branch2c 0=256 1=1 5=1 6=16384 8=2\nEltwise                  res2a                    2 1 res2a_branch1_scale2a_branch1 res2a_branch2c_scale2a_branch2c res2a 0=1\nReLU                     res2a_relu               1 1 res2a res2a_res2a_relu\nSplit                    splitncnn_1              1 2 res2a_res2a_relu res2a_res2a_relu_splitncnn_0 res2a_res2a_relu_splitncnn_1\nConvolution              res2b_branch2a           1 1 res2a_res2a_relu_splitncnn_1 res2b_branch2a_res2b_branch2a_relu 0=64 1=1 5=1 6=16384 8=102 9=1\nConvolution              res2b_branch2b           1 1 res2b_branch2a_res2b_branch2a_relu res2b_branch2b_res2b_branch2b_relu 0=64 1=3 4=1 5=1 6=36864 8=102 9=1\nConvolution              res2b_branch2c           1 1 res2b_branch2b_res2b_branch2b_relu res2b_branch2c_scale2b_branch2c 0=256 1=1 5=1 6=16384 8=2\nEltwise                  res2b                    2 1 res2a_res2a_relu_splitncnn_0 res2b_branch2c_scale2b_branch2c res2b 0=1\nReLU                     res2b_relu               1 1 res2b res2b_res2b_relu\nSplit                    splitncnn_2              1 2 res2b_res2b_relu res2b_res2b_relu_splitncnn_0 res2b_res2b_relu_splitncnn_1\nConvolution              res2c_branch2a           1 1 res2b_res2b_relu_splitncnn_1 res2c_branch2a_res2c_branch2a_relu 0=64 1=1 5=1 6=16384 8=102 9=1\nConvolution              res2c_branch2b           1 1 res2c_branch2a_res2c_branch2a_relu res2c_branch2b_res2c_branch2b_relu 0=64 1=3 4=1 5=1 6=36864 8=102 9=1\nConvolution              res2c_branch2c           1 1 res2c_branch2b_res2c_branch2b_relu res2c_branch2c_scale2c_branch2c 0=256 1=1 5=1 6=16384 8=2\nEltwise                  res2c                    2 1 res2b_res2b_relu_splitncnn_0 res2c_branch2c_scale2c_branch2c res2c 0=1\nReLU                     res2c_relu               1 1 res2c res2c_res2c_relu\nSplit                    splitncnn_3              1 2 res2c_res2c_relu res2c_res2c_relu_splitncnn_0 res2c_res2c_relu_splitncnn_1\nConvolution              res3a_branch1            1 1 res2c_res2c_relu_splitncnn_1 res3a_branch1_scale3a_branch1 0=512 1=1 3=2 5=1 6=131072 8=2\nConvolution              res3a_branch2a           1 1 res2c_res2c_relu_splitncnn_0 res3a_branch2a_res3a_branch2a_relu 0=128 1=1 3=2 5=1 6=32768 8=102 9=1\nConvolution              res3a_branch2b           1 1 res3a_branch2a_res3a_branch2a_relu res3a_branch2b_res3a_branch2b_relu 0=128 1=3 4=1 5=1 6=147456 8=102 9=1\nConvolution              res3a_branch2c           1 1 res3a_branch2b_res3a_branch2b_relu res3a_branch2c_scale3a_branch2c 0=512 1=1 5=1 6=65536 8=2\nEltwise                  res3a                    2 1 res3a_branch1_scale3a_branch1 res3a_branch2c_scale3a_branch2c res3a 0=1\nReLU                     res3a_relu               1 1 res3a res3a_res3a_relu\nSplit                    splitncnn_4              1 2 res3a_res3a_relu res3a_res3a_relu_splitncnn_0 res3a_res3a_relu_splitncnn_1\nConvolution              res3b_branch2a           1 1 res3a_res3a_relu_splitncnn_1 res3b_branch2a_res3b_branch2a_relu 0=128 1=1 5=1 6=65536 8=102 9=1\nConvolution              res3b_branch2b           1 1 res3b_branch2a_res3b_branch2a_relu res3b_branch2b_res3b_branch2b_relu 0=128 1=3 4=1 5=1 6=147456 8=102 9=1\nConvolution              res3b_branch2c           1 1 res3b_branch2b_res3b_branch2b_relu res3b_branch2c_scale3b_branch2c 0=512 1=1 5=1 6=65536 8=2\nEltwise                  res3b                    2 1 res3a_res3a_relu_splitncnn_0 res3b_branch2c_scale3b_branch2c res3b 0=1\nReLU                     res3b_relu               1 1 res3b res3b_res3b_relu\nSplit                    splitncnn_5              1 2 res3b_res3b_relu res3b_res3b_relu_splitncnn_0 res3b_res3b_relu_splitncnn_1\nConvolution              res3c_branch2a           1 1 res3b_res3b_relu_splitncnn_1 res3c_branch2a_res3c_branch2a_relu 0=128 1=1 5=1 6=65536 8=102 9=1\nConvolution              res3c_branch2b           1 1 res3c_branch2a_res3c_branch2a_relu res3c_branch2b_res3c_branch2b_relu 0=128 1=3 4=1 5=1 6=147456 8=102 9=1\nConvolution              res3c_branch2c           1 1 res3c_branch2b_res3c_branch2b_relu res3c_branch2c_scale3c_branch2c 0=512 1=1 5=1 6=65536 8=2\nEltwise                  res3c                    2 1 res3b_res3b_relu_splitncnn_0 res3c_branch2c_scale3c_branch2c res3c 0=1\nReLU                     res3c_relu               1 1 res3c res3c_res3c_relu\nSplit                    splitncnn_6              1 2 res3c_res3c_relu res3c_res3c_relu_splitncnn_0 res3c_res3c_relu_splitncnn_1\nConvolution              res3d_branch2a           1 1 res3c_res3c_relu_splitncnn_1 res3d_branch2a_res3d_branch2a_relu 0=128 1=1 5=1 6=65536 8=102 9=1\nConvolution              res3d_branch2b           1 1 res3d_branch2a_res3d_branch2a_relu res3d_branch2b_res3d_branch2b_relu 0=128 1=3 4=1 5=1 6=147456 8=102 9=1\nConvolution              res3d_branch2c           1 1 res3d_branch2b_res3d_branch2b_relu res3d_branch2c_scale3d_branch2c 0=512 1=1 5=1 6=65536 8=2\nEltwise                  res3d                    2 1 res3c_res3c_relu_splitncnn_0 res3d_branch2c_scale3d_branch2c res3d 0=1\nReLU                     res3d_relu               1 1 res3d res3d_res3d_relu\nSplit                    splitncnn_7              1 2 res3d_res3d_relu res3d_res3d_relu_splitncnn_0 res3d_res3d_relu_splitncnn_1\nConvolution              res4a_branch1            1 1 res3d_res3d_relu_splitncnn_1 res4a_branch1_scale4a_branch1 0=1024 1=1 3=2 5=1 6=524288 8=2\nConvolution              res4a_branch2a           1 1 res3d_res3d_relu_splitncnn_0 res4a_branch2a_res4a_branch2a_relu 0=256 1=1 3=2 5=1 6=131072 8=102 9=1\nConvolution              res4a_branch2b           1 1 res4a_branch2a_res4a_branch2a_relu res4a_branch2b_res4a_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4a_branch2c           1 1 res4a_branch2b_res4a_branch2b_relu res4a_branch2c_scale4a_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4a                    2 1 res4a_branch1_scale4a_branch1 res4a_branch2c_scale4a_branch2c res4a 0=1\nReLU                     res4a_relu               1 1 res4a res4a_res4a_relu\nSplit                    splitncnn_8              1 2 res4a_res4a_relu res4a_res4a_relu_splitncnn_0 res4a_res4a_relu_splitncnn_1\nConvolution              res4b_branch2a           1 1 res4a_res4a_relu_splitncnn_1 res4b_branch2a_res4b_branch2a_relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              res4b_branch2b           1 1 res4b_branch2a_res4b_branch2a_relu res4b_branch2b_res4b_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4b_branch2c           1 1 res4b_branch2b_res4b_branch2b_relu res4b_branch2c_scale4b_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4b                    2 1 res4a_res4a_relu_splitncnn_0 res4b_branch2c_scale4b_branch2c res4b 0=1\nReLU                     res4b_relu               1 1 res4b res4b_res4b_relu\nSplit                    splitncnn_9              1 2 res4b_res4b_relu res4b_res4b_relu_splitncnn_0 res4b_res4b_relu_splitncnn_1\nConvolution              res4c_branch2a           1 1 res4b_res4b_relu_splitncnn_1 res4c_branch2a_res4c_branch2a_relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              res4c_branch2b           1 1 res4c_branch2a_res4c_branch2a_relu res4c_branch2b_res4c_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4c_branch2c           1 1 res4c_branch2b_res4c_branch2b_relu res4c_branch2c_scale4c_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4c                    2 1 res4b_res4b_relu_splitncnn_0 res4c_branch2c_scale4c_branch2c res4c 0=1\nReLU                     res4c_relu               1 1 res4c res4c_res4c_relu\nSplit                    splitncnn_10             1 2 res4c_res4c_relu res4c_res4c_relu_splitncnn_0 res4c_res4c_relu_splitncnn_1\nConvolution              res4d_branch2a           1 1 res4c_res4c_relu_splitncnn_1 res4d_branch2a_res4d_branch2a_relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              res4d_branch2b           1 1 res4d_branch2a_res4d_branch2a_relu res4d_branch2b_res4d_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4d_branch2c           1 1 res4d_branch2b_res4d_branch2b_relu res4d_branch2c_scale4d_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4d                    2 1 res4c_res4c_relu_splitncnn_0 res4d_branch2c_scale4d_branch2c res4d 0=1\nReLU                     res4d_relu               1 1 res4d res4d_res4d_relu\nSplit                    splitncnn_11             1 2 res4d_res4d_relu res4d_res4d_relu_splitncnn_0 res4d_res4d_relu_splitncnn_1\nConvolution              res4e_branch2a           1 1 res4d_res4d_relu_splitncnn_1 res4e_branch2a_res4e_branch2a_relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              res4e_branch2b           1 1 res4e_branch2a_res4e_branch2a_relu res4e_branch2b_res4e_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4e_branch2c           1 1 res4e_branch2b_res4e_branch2b_relu res4e_branch2c_scale4e_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4e                    2 1 res4d_res4d_relu_splitncnn_0 res4e_branch2c_scale4e_branch2c res4e 0=1\nReLU                     res4e_relu               1 1 res4e res4e_res4e_relu\nSplit                    splitncnn_12             1 2 res4e_res4e_relu res4e_res4e_relu_splitncnn_0 res4e_res4e_relu_splitncnn_1\nConvolution              res4f_branch2a           1 1 res4e_res4e_relu_splitncnn_1 res4f_branch2a_res4f_branch2a_relu 0=256 1=1 5=1 6=262144 8=102 9=1\nConvolution              res4f_branch2b           1 1 res4f_branch2a_res4f_branch2a_relu res4f_branch2b_res4f_branch2b_relu 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              res4f_branch2c           1 1 res4f_branch2b_res4f_branch2b_relu res4f_branch2c_scale4f_branch2c 0=1024 1=1 5=1 6=262144 8=2\nEltwise                  res4f                    2 1 res4e_res4e_relu_splitncnn_0 res4f_branch2c_scale4f_branch2c res4f 0=1\nReLU                     res4f_relu               1 1 res4f res4f_res4f_relu\nSplit                    splitncnn_13             1 2 res4f_res4f_relu res4f_res4f_relu_splitncnn_0 res4f_res4f_relu_splitncnn_1\nConvolution              res5a_branch1            1 1 res4f_res4f_relu_splitncnn_1 res5a_branch1_scale5a_branch1 0=2048 1=1 3=2 5=1 6=2097152 8=2\nConvolution              res5a_branch2a           1 1 res4f_res4f_relu_splitncnn_0 res5a_branch2a_res5a_branch2a_relu 0=512 1=1 3=2 5=1 6=524288 8=102 9=1\nConvolution              res5a_branch2b           1 1 res5a_branch2a_res5a_branch2a_relu res5a_branch2b_res5a_branch2b_relu 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              res5a_branch2c           1 1 res5a_branch2b_res5a_branch2b_relu res5a_branch2c_scale5a_branch2c 0=2048 1=1 5=1 6=1048576 8=2\nEltwise                  res5a                    2 1 res5a_branch1_scale5a_branch1 res5a_branch2c_scale5a_branch2c res5a 0=1\nReLU                     res5a_relu               1 1 res5a res5a_res5a_relu\nSplit                    splitncnn_14             1 2 res5a_res5a_relu res5a_res5a_relu_splitncnn_0 res5a_res5a_relu_splitncnn_1\nConvolution              res5b_branch2a           1 1 res5a_res5a_relu_splitncnn_1 res5b_branch2a_res5b_branch2a_relu 0=512 1=1 5=1 6=1048576 8=102 9=1\nConvolution              res5b_branch2b           1 1 res5b_branch2a_res5b_branch2a_relu res5b_branch2b_res5b_branch2b_relu 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              res5b_branch2c           1 1 res5b_branch2b_res5b_branch2b_relu res5b_branch2c_scale5b_branch2c 0=2048 1=1 5=1 6=1048576 8=2\nEltwise                  res5b                    2 1 res5a_res5a_relu_splitncnn_0 res5b_branch2c_scale5b_branch2c res5b 0=1\nReLU                     res5b_relu               1 1 res5b res5b_res5b_relu\nSplit                    splitncnn_15             1 2 res5b_res5b_relu res5b_res5b_relu_splitncnn_0 res5b_res5b_relu_splitncnn_1\nConvolution              res5c_branch2a           1 1 res5b_res5b_relu_splitncnn_1 res5c_branch2a_res5c_branch2a_relu 0=512 1=1 5=1 6=1048576 8=102 9=1\nConvolution              res5c_branch2b           1 1 res5c_branch2a_res5c_branch2a_relu res5c_branch2b_res5c_branch2b_relu 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              res5c_branch2c           1 1 res5c_branch2b_res5c_branch2b_relu res5c_branch2c_scale5c_branch2c 0=2048 1=1 5=1 6=1048576 8=2\nEltwise                  res5c                    2 1 res5b_res5b_relu_splitncnn_0 res5c_branch2c_scale5c_branch2c res5c 0=1\nReLU                     res5c_relu               1 1 res5c res5c_res5c_relu\nPooling                  pool5                    1 1 res5c_res5c_relu pool5 0=1 1=7\nInnerProduct             fc1000                   1 1 pool5 fc1000 0=1000 1=1 2=2048000\nSoftmax                  prob                     1 1 fc1000 output\n"
  },
  {
    "path": "benchmark/shufflenet.param",
    "content": "7767517\n120 136\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu -23330=4,3,112,112,24 0=24 1=3 3=2 4=1 5=1 6=648 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 -23330=4,3,56,56,24 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nPooling                  resx1_match_conv         1 1 pool1_splitncnn_1 resx1_match_conv -23330=4,3,28,28,24 0=1 1=3 2=2\nConvolution              resx1_conv1              1 1 pool1_splitncnn_0 resx1_conv1_resx1_conv1_relu -23330=4,3,56,56,54 0=54 1=1 5=1 6=1296 9=1\nConvolutionDepthWise     resx1_conv2              1 1 resx1_conv1_resx1_conv1_relu resx1_conv2_resx1_conv2_scale -23330=4,3,28,28,54 0=54 1=3 3=2 4=1 5=1 6=486 7=54\nConvolutionDepthWise     resx1_conv3              1 1 resx1_conv2_resx1_conv2_scale resx1_conv3_resx1_conv3_scale -23330=4,3,28,28,216 0=216 1=1 5=1 6=3888 7=3\nConcat                   resx1_concat             2 1 resx1_match_conv resx1_conv3_resx1_conv3_scale resx1_concat -23330=4,3,28,28,240\nReLU                     resx1_concat_relu        1 1 resx1_concat resx1_concat_resx1_concat_relu -23330=4,3,28,28,240\nSplit                    splitncnn_1              1 2 resx1_concat_resx1_concat_relu resx1_concat_resx1_concat_relu_splitncnn_0 resx1_concat_resx1_concat_relu_splitncnn_1 -23330=8,3,28,28,240,3,28,28,240\nConvolutionDepthWise     resx2_conv1              1 1 resx1_concat_resx1_concat_relu_splitncnn_1 resx2_conv1_resx2_conv1_relu -23330=4,3,28,28,60 0=60 1=1 5=1 6=4800 7=3 9=1\nShuffleChannel           shuffle2                 1 1 resx2_conv1_resx2_conv1_relu shuffle2 -23330=4,3,28,28,60 0=3\nConvolutionDepthWise     resx2_conv2              1 1 shuffle2 resx2_conv2_resx2_conv2_scale -23330=4,3,28,28,60 0=60 1=3 4=1 5=1 6=540 7=60\nConvolutionDepthWise     resx2_conv3              1 1 resx2_conv2_resx2_conv2_scale resx2_conv3_resx2_conv3_scale -23330=4,3,28,28,240 0=240 1=1 5=1 6=4800 7=3\nEltwise                  resx2_elewise            2 1 resx1_concat_resx1_concat_relu_splitncnn_0 resx2_conv3_resx2_conv3_scale resx2_elewise -23330=4,3,28,28,240 0=1\nReLU                     resx2_elewise_relu       1 1 resx2_elewise resx2_elewise_resx2_elewise_relu -23330=4,3,28,28,240\nSplit                    splitncnn_2              1 2 resx2_elewise_resx2_elewise_relu resx2_elewise_resx2_elewise_relu_splitncnn_0 resx2_elewise_resx2_elewise_relu_splitncnn_1 -23330=8,3,28,28,240,3,28,28,240\nConvolutionDepthWise     resx3_conv1              1 1 resx2_elewise_resx2_elewise_relu_splitncnn_1 resx3_conv1_resx3_conv1_relu -23330=4,3,28,28,60 0=60 1=1 5=1 6=4800 7=3 9=1\nShuffleChannel           shuffle3                 1 1 resx3_conv1_resx3_conv1_relu shuffle3 -23330=4,3,28,28,60 0=3\nConvolutionDepthWise     resx3_conv2              1 1 shuffle3 resx3_conv2_resx3_conv2_scale -23330=4,3,28,28,60 0=60 1=3 4=1 5=1 6=540 7=60\nConvolutionDepthWise     resx3_conv3              1 1 resx3_conv2_resx3_conv2_scale resx3_conv3_resx3_conv3_scale -23330=4,3,28,28,240 0=240 1=1 5=1 6=4800 7=3\nEltwise                  resx3_elewise            2 1 resx2_elewise_resx2_elewise_relu_splitncnn_0 resx3_conv3_resx3_conv3_scale resx3_elewise -23330=4,3,28,28,240 0=1\nReLU                     resx3_elewise_relu       1 1 resx3_elewise resx3_elewise_resx3_elewise_relu -23330=4,3,28,28,240\nSplit                    splitncnn_3              1 2 resx3_elewise_resx3_elewise_relu resx3_elewise_resx3_elewise_relu_splitncnn_0 resx3_elewise_resx3_elewise_relu_splitncnn_1 -23330=8,3,28,28,240,3,28,28,240\nConvolutionDepthWise     resx4_conv1              1 1 resx3_elewise_resx3_elewise_relu_splitncnn_1 resx4_conv1_resx4_conv1_relu -23330=4,3,28,28,60 0=60 1=1 5=1 6=4800 7=3 9=1\nShuffleChannel           shuffle4                 1 1 resx4_conv1_resx4_conv1_relu shuffle4 -23330=4,3,28,28,60 0=3\nConvolutionDepthWise     resx4_conv2              1 1 shuffle4 resx4_conv2_resx4_conv2_scale -23330=4,3,28,28,60 0=60 1=3 4=1 5=1 6=540 7=60\nConvolutionDepthWise     resx4_conv3              1 1 resx4_conv2_resx4_conv2_scale resx4_conv3_resx4_conv3_scale -23330=4,3,28,28,240 0=240 1=1 5=1 6=4800 7=3\nEltwise                  resx4_elewise            2 1 resx3_elewise_resx3_elewise_relu_splitncnn_0 resx4_conv3_resx4_conv3_scale resx4_elewise -23330=4,3,28,28,240 0=1\nReLU                     resx4_elewise_relu       1 1 resx4_elewise resx4_elewise_resx4_elewise_relu -23330=4,3,28,28,240\nSplit                    splitncnn_4              1 2 resx4_elewise_resx4_elewise_relu resx4_elewise_resx4_elewise_relu_splitncnn_0 resx4_elewise_resx4_elewise_relu_splitncnn_1 -23330=8,3,28,28,240,3,28,28,240\nPooling                  resx5_match_conv         1 1 resx4_elewise_resx4_elewise_relu_splitncnn_1 resx5_match_conv -23330=4,3,14,14,240 0=1 1=3 2=2\nConvolutionDepthWise     resx5_conv1              1 1 resx4_elewise_resx4_elewise_relu_splitncnn_0 resx5_conv1_resx5_conv1_relu -23330=4,3,28,28,60 0=60 1=1 5=1 6=4800 7=3 9=1\nShuffleChannel           shuffle5                 1 1 resx5_conv1_resx5_conv1_relu shuffle5 -23330=4,3,28,28,60 0=3\nConvolutionDepthWise     resx5_conv2              1 1 shuffle5 resx5_conv2_resx5_conv2_scale -23330=4,3,14,14,60 0=60 1=3 3=2 4=1 5=1 6=540 7=60\nConvolutionDepthWise     resx5_conv3              1 1 resx5_conv2_resx5_conv2_scale resx5_conv3_resx5_conv3_scale -23330=4,3,14,14,240 0=240 1=1 5=1 6=4800 7=3\nConcat                   resx5_concat             2 1 resx5_match_conv resx5_conv3_resx5_conv3_scale resx5_concat -23330=4,3,14,14,480\nReLU                     resx5_concat_relu        1 1 resx5_concat resx5_concat_resx5_concat_relu -23330=4,3,14,14,480\nSplit                    splitncnn_5              1 2 resx5_concat_resx5_concat_relu resx5_concat_resx5_concat_relu_splitncnn_0 resx5_concat_resx5_concat_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx6_conv1              1 1 resx5_concat_resx5_concat_relu_splitncnn_1 resx6_conv1_resx6_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle6                 1 1 resx6_conv1_resx6_conv1_relu shuffle6 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx6_conv2              1 1 shuffle6 resx6_conv2_resx6_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx6_conv3              1 1 resx6_conv2_resx6_conv2_scale resx6_conv3_resx6_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx6_elewise            2 1 resx5_concat_resx5_concat_relu_splitncnn_0 resx6_conv3_resx6_conv3_scale resx6_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx6_elewise_relu       1 1 resx6_elewise resx6_elewise_resx6_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_6              1 2 resx6_elewise_resx6_elewise_relu resx6_elewise_resx6_elewise_relu_splitncnn_0 resx6_elewise_resx6_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx7_conv1              1 1 resx6_elewise_resx6_elewise_relu_splitncnn_1 resx7_conv1_resx7_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle7                 1 1 resx7_conv1_resx7_conv1_relu shuffle7 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx7_conv2              1 1 shuffle7 resx7_conv2_resx7_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx7_conv3              1 1 resx7_conv2_resx7_conv2_scale resx7_conv3_resx7_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx7_elewise            2 1 resx6_elewise_resx6_elewise_relu_splitncnn_0 resx7_conv3_resx7_conv3_scale resx7_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx7_elewise_relu       1 1 resx7_elewise resx7_elewise_resx7_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_7              1 2 resx7_elewise_resx7_elewise_relu resx7_elewise_resx7_elewise_relu_splitncnn_0 resx7_elewise_resx7_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx8_conv1              1 1 resx7_elewise_resx7_elewise_relu_splitncnn_1 resx8_conv1_resx8_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle8                 1 1 resx8_conv1_resx8_conv1_relu shuffle8 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx8_conv2              1 1 shuffle8 resx8_conv2_resx8_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx8_conv3              1 1 resx8_conv2_resx8_conv2_scale resx8_conv3_resx8_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx8_elewise            2 1 resx7_elewise_resx7_elewise_relu_splitncnn_0 resx8_conv3_resx8_conv3_scale resx8_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx8_elewise_relu       1 1 resx8_elewise resx8_elewise_resx8_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_8              1 2 resx8_elewise_resx8_elewise_relu resx8_elewise_resx8_elewise_relu_splitncnn_0 resx8_elewise_resx8_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx9_conv1              1 1 resx8_elewise_resx8_elewise_relu_splitncnn_1 resx9_conv1_resx9_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle9                 1 1 resx9_conv1_resx9_conv1_relu shuffle9 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx9_conv2              1 1 shuffle9 resx9_conv2_resx9_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx9_conv3              1 1 resx9_conv2_resx9_conv2_scale resx9_conv3_resx9_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx9_elewise            2 1 resx8_elewise_resx8_elewise_relu_splitncnn_0 resx9_conv3_resx9_conv3_scale resx9_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx9_elewise_relu       1 1 resx9_elewise resx9_elewise_resx9_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_9              1 2 resx9_elewise_resx9_elewise_relu resx9_elewise_resx9_elewise_relu_splitncnn_0 resx9_elewise_resx9_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx10_conv1             1 1 resx9_elewise_resx9_elewise_relu_splitncnn_1 resx10_conv1_resx10_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle10                1 1 resx10_conv1_resx10_conv1_relu shuffle10 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx10_conv2             1 1 shuffle10 resx10_conv2_resx10_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx10_conv3             1 1 resx10_conv2_resx10_conv2_scale resx10_conv3_resx10_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx10_elewise           2 1 resx9_elewise_resx9_elewise_relu_splitncnn_0 resx10_conv3_resx10_conv3_scale resx10_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx10_elewise_relu      1 1 resx10_elewise resx10_elewise_resx10_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_10             1 2 resx10_elewise_resx10_elewise_relu resx10_elewise_resx10_elewise_relu_splitncnn_0 resx10_elewise_resx10_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx11_conv1             1 1 resx10_elewise_resx10_elewise_relu_splitncnn_1 resx11_conv1_resx11_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle11                1 1 resx11_conv1_resx11_conv1_relu shuffle11 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx11_conv2             1 1 shuffle11 resx11_conv2_resx11_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx11_conv3             1 1 resx11_conv2_resx11_conv2_scale resx11_conv3_resx11_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx11_elewise           2 1 resx10_elewise_resx10_elewise_relu_splitncnn_0 resx11_conv3_resx11_conv3_scale resx11_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx11_elewise_relu      1 1 resx11_elewise resx11_elewise_resx11_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_11             1 2 resx11_elewise_resx11_elewise_relu resx11_elewise_resx11_elewise_relu_splitncnn_0 resx11_elewise_resx11_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nConvolutionDepthWise     resx12_conv1             1 1 resx11_elewise_resx11_elewise_relu_splitncnn_1 resx12_conv1_resx12_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle12                1 1 resx12_conv1_resx12_conv1_relu shuffle12 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx12_conv2             1 1 shuffle12 resx12_conv2_resx12_conv2_scale -23330=4,3,14,14,120 0=120 1=3 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx12_conv3             1 1 resx12_conv2_resx12_conv2_scale resx12_conv3_resx12_conv3_scale -23330=4,3,14,14,480 0=480 1=1 5=1 6=19200 7=3\nEltwise                  resx12_elewise           2 1 resx11_elewise_resx11_elewise_relu_splitncnn_0 resx12_conv3_resx12_conv3_scale resx12_elewise -23330=4,3,14,14,480 0=1\nReLU                     resx12_elewise_relu      1 1 resx12_elewise resx12_elewise_resx12_elewise_relu -23330=4,3,14,14,480\nSplit                    splitncnn_12             1 2 resx12_elewise_resx12_elewise_relu resx12_elewise_resx12_elewise_relu_splitncnn_0 resx12_elewise_resx12_elewise_relu_splitncnn_1 -23330=8,3,14,14,480,3,14,14,480\nPooling                  resx13_match_conv        1 1 resx12_elewise_resx12_elewise_relu_splitncnn_1 resx13_match_conv -23330=4,3,7,7,480 0=1 1=3 2=2\nConvolutionDepthWise     resx13_conv1             1 1 resx12_elewise_resx12_elewise_relu_splitncnn_0 resx13_conv1_resx13_conv1_relu -23330=4,3,14,14,120 0=120 1=1 5=1 6=19200 7=3 9=1\nShuffleChannel           shuffle13                1 1 resx13_conv1_resx13_conv1_relu shuffle13 -23330=4,3,14,14,120 0=3\nConvolutionDepthWise     resx13_conv2             1 1 shuffle13 resx13_conv2_resx13_conv2_scale -23330=4,3,7,7,120 0=120 1=3 3=2 4=1 5=1 6=1080 7=120\nConvolutionDepthWise     resx13_conv3             1 1 resx13_conv2_resx13_conv2_scale resx13_conv3_resx13_conv3_scale -23330=4,3,7,7,480 0=480 1=1 5=1 6=19200 7=3\nConcat                   resx13_concat            2 1 resx13_match_conv resx13_conv3_resx13_conv3_scale resx13_concat -23330=4,3,7,7,960\nReLU                     resx13_concat_relu       1 1 resx13_concat resx13_concat_resx13_concat_relu -23330=4,3,7,7,960\nSplit                    splitncnn_13             1 2 resx13_concat_resx13_concat_relu resx13_concat_resx13_concat_relu_splitncnn_0 resx13_concat_resx13_concat_relu_splitncnn_1 -23330=8,3,7,7,960,3,7,7,960\nConvolutionDepthWise     resx14_conv1             1 1 resx13_concat_resx13_concat_relu_splitncnn_1 resx14_conv1_resx14_conv1_relu -23330=4,3,7,7,240 0=240 1=1 5=1 6=76800 7=3 9=1\nShuffleChannel           shuffle14                1 1 resx14_conv1_resx14_conv1_relu shuffle14 -23330=4,3,7,7,240 0=3\nConvolutionDepthWise     resx14_conv2             1 1 shuffle14 resx14_conv2_resx14_conv2_scale -23330=4,3,7,7,240 0=240 1=3 4=1 5=1 6=2160 7=240\nConvolutionDepthWise     resx14_conv3             1 1 resx14_conv2_resx14_conv2_scale resx14_conv3_resx14_conv3_scale -23330=4,3,7,7,960 0=960 1=1 5=1 6=76800 7=3\nEltwise                  resx14_elewise           2 1 resx13_concat_resx13_concat_relu_splitncnn_0 resx14_conv3_resx14_conv3_scale resx14_elewise -23330=4,3,7,7,960 0=1\nReLU                     resx14_elewise_relu      1 1 resx14_elewise resx14_elewise_resx14_elewise_relu -23330=4,3,7,7,960\nSplit                    splitncnn_14             1 2 resx14_elewise_resx14_elewise_relu resx14_elewise_resx14_elewise_relu_splitncnn_0 resx14_elewise_resx14_elewise_relu_splitncnn_1 -23330=8,3,7,7,960,3,7,7,960\nConvolutionDepthWise     resx15_conv1             1 1 resx14_elewise_resx14_elewise_relu_splitncnn_1 resx15_conv1_resx15_conv1_relu -23330=4,3,7,7,240 0=240 1=1 5=1 6=76800 7=3 9=1\nShuffleChannel           shuffle15                1 1 resx15_conv1_resx15_conv1_relu shuffle15 -23330=4,3,7,7,240 0=3\nConvolutionDepthWise     resx15_conv2             1 1 shuffle15 resx15_conv2_resx15_conv2_scale -23330=4,3,7,7,240 0=240 1=3 4=1 5=1 6=2160 7=240\nConvolutionDepthWise     resx15_conv3             1 1 resx15_conv2_resx15_conv2_scale resx15_conv3_resx15_conv3_scale -23330=4,3,7,7,960 0=960 1=1 5=1 6=76800 7=3\nEltwise                  resx15_elewise           2 1 resx14_elewise_resx14_elewise_relu_splitncnn_0 resx15_conv3_resx15_conv3_scale resx15_elewise -23330=4,3,7,7,960 0=1\nReLU                     resx15_elewise_relu      1 1 resx15_elewise resx15_elewise_resx15_elewise_relu -23330=4,3,7,7,960\nSplit                    splitncnn_15             1 2 resx15_elewise_resx15_elewise_relu resx15_elewise_resx15_elewise_relu_splitncnn_0 resx15_elewise_resx15_elewise_relu_splitncnn_1 -23330=8,3,7,7,960,3,7,7,960\nConvolutionDepthWise     resx16_conv1             1 1 resx15_elewise_resx15_elewise_relu_splitncnn_1 resx16_conv1_resx16_conv1_relu -23330=4,3,7,7,240 0=240 1=1 5=1 6=76800 7=3 9=1\nShuffleChannel           shuffle16                1 1 resx16_conv1_resx16_conv1_relu shuffle16 -23330=4,3,7,7,240 0=3\nConvolutionDepthWise     resx16_conv2             1 1 shuffle16 resx16_conv2_resx16_conv2_scale -23330=4,3,7,7,240 0=240 1=3 4=1 5=1 6=2160 7=240\nConvolutionDepthWise     resx16_conv3             1 1 resx16_conv2_resx16_conv2_scale resx16_conv3_resx16_conv3_scale -23330=4,3,7,7,960 0=960 1=1 5=1 6=76800 7=3\nEltwise                  resx16_elewise           2 1 resx15_elewise_resx15_elewise_relu_splitncnn_0 resx16_conv3_resx16_conv3_scale resx16_elewise -23330=4,3,7,7,960 0=1\nReLU                     resx16_elewise_relu      1 1 resx16_elewise resx16_elewise_resx16_elewise_relu -23330=4,3,7,7,960\nPooling                  pool_ave                 1 1 resx16_elewise_resx16_elewise_relu pool_ave -23330=4,1,960,1,1 0=1 4=1\nInnerProduct             fc1000                   1 1 pool_ave fc1000 -23330=4,1,1000,1,1 0=1000 1=1 2=960000\nSoftmax                  prob                     1 1 fc1000 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/shufflenet_v2.param",
    "content": "7767517\n109 125\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1                    1 1 data conv1_conv1_relu -23330=4,3,112,112,24 0=24 1=3 3=2 4=1 5=1 6=648 9=1\nPooling                  pool1                    1 1 conv1_conv1_relu pool1 -23330=4,3,56,56,24 1=3 2=2\nSplit                    splitncnn_0              1 2 pool1 pool1_splitncnn_0 pool1_splitncnn_1 -23330=8,3,56,56,24,3,56,56,24\nConvolutionDepthWise     branch1_1_conv1          1 1 pool1_splitncnn_1 branch1_1_conv1_branch1_1_conv1_scale -23330=4,3,28,28,24 0=24 1=3 3=2 4=1 5=1 6=216 7=24\nConvolution              branch1_1_conv2          1 1 branch1_1_conv1_branch1_1_conv1_scale branch1_1_conv2_branch1_1_conv2_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=1392 9=1\nConvolution              branch1_2_conv1          1 1 pool1_splitncnn_0 branch1_2_conv1_branch1_2_conv1_relu -23330=4,3,56,56,58 0=58 1=1 5=1 6=1392 9=1\nConvolutionDepthWise     branch1_2_conv2          1 1 branch1_2_conv1_branch1_2_conv1_relu branch1_2_conv2_branch1_2_conv2_scale -23330=4,3,28,28,58 0=58 1=3 3=2 4=1 5=1 6=522 7=58\nConvolution              branch1_2_conv3          1 1 branch1_2_conv2_branch1_2_conv2_scale branch1_2_conv3_branch1_2_conv3_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConcat                   concat1                  2 1 branch1_1_conv2_branch1_1_conv2_relu branch1_2_conv3_branch1_2_conv3_relu concat1 -23330=4,3,28,28,116\nShuffleChannel           shuffle1                 1 1 concat1 shuffle1 -23330=4,3,28,28,116 0=2\nSlice                    slice2                   1 2 shuffle1 branch2_1 branch2_2 -23330=8,3,28,28,58,3,28,28,58 -23300=2,58,-233\nConvolution              branch2_2_conv1          1 1 branch2_2 branch2_2_conv1_branch2_2_conv1_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConvolutionDepthWise     branch2_2_conv2          1 1 branch2_2_conv1_branch2_2_conv1_relu branch2_2_conv2_branch2_2_conv2_scale -23330=4,3,28,28,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              branch2_2_conv3          1 1 branch2_2_conv2_branch2_2_conv2_scale branch2_2_conv3_branch2_2_conv3_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConcat                   concat2                  2 1 branch2_1 branch2_2_conv3_branch2_2_conv3_relu concat2 -23330=4,3,28,28,116\nShuffleChannel           shuffle2                 1 1 concat2 shuffle2 -23330=4,3,28,28,116 0=2\nSlice                    slice3                   1 2 shuffle2 branch3_1 branch3_2 -23330=8,3,28,28,58,3,28,28,58 -23300=2,58,-233\nConvolution              branch3_2_conv1          1 1 branch3_2 branch3_2_conv1_branch3_2_conv1_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConvolutionDepthWise     branch3_2_conv2          1 1 branch3_2_conv1_branch3_2_conv1_relu branch3_2_conv2_branch3_2_conv2_scale -23330=4,3,28,28,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              branch3_2_conv3          1 1 branch3_2_conv2_branch3_2_conv2_scale branch3_2_conv3_branch3_2_conv3_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConcat                   concat3                  2 1 branch3_1 branch3_2_conv3_branch3_2_conv3_relu concat3 -23330=4,3,28,28,116\nShuffleChannel           shuffle3                 1 1 concat3 shuffle3 -23330=4,3,28,28,116 0=2\nSlice                    slice4                   1 2 shuffle3 branch4_1 branch4_2 -23330=8,3,28,28,58,3,28,28,58 -23300=2,58,-233\nConvolution              branch4_2_conv1          1 1 branch4_2 branch4_2_conv1_branch4_2_conv1_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConvolutionDepthWise     branch4_2_conv2          1 1 branch4_2_conv1_branch4_2_conv1_relu branch4_2_conv2_branch4_2_conv2_scale -23330=4,3,28,28,58 0=58 1=3 4=1 5=1 6=522 7=58\nConvolution              branch4_2_conv3          1 1 branch4_2_conv2_branch4_2_conv2_scale branch4_2_conv3_branch4_2_conv3_relu -23330=4,3,28,28,58 0=58 1=1 5=1 6=3364 9=1\nConcat                   concat4                  2 1 branch4_1 branch4_2_conv3_branch4_2_conv3_relu concat4 -23330=4,3,28,28,116\nShuffleChannel           shuffle4                 1 1 concat4 shuffle4 -23330=4,3,28,28,116 0=2\nSplit                    splitncnn_1              1 2 shuffle4 shuffle4_splitncnn_0 shuffle4_splitncnn_1 -23330=8,3,28,28,116,3,28,28,116\nConvolutionDepthWise     branch5_1_conv1          1 1 shuffle4_splitncnn_1 branch5_1_conv1_branch5_1_conv1_scale -23330=4,3,14,14,116 0=116 1=3 3=2 4=1 5=1 6=1044 7=116\nConvolution              branch5_1_conv2          1 1 branch5_1_conv1_branch5_1_conv1_scale branch5_1_conv2_branch5_1_conv2_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolution              branch5_2_conv1          1 1 shuffle4_splitncnn_0 branch5_2_conv1_branch5_2_conv1_relu -23330=4,3,28,28,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch5_2_conv2          1 1 branch5_2_conv1_branch5_2_conv1_relu branch5_2_conv2_branch5_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 3=2 4=1 5=1 6=1044 7=116\nConvolution              branch5_2_conv3          1 1 branch5_2_conv2_branch5_2_conv2_scale branch5_2_conv3_branch5_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat5                  2 1 branch5_1_conv2_branch5_1_conv2_relu branch5_2_conv3_branch5_2_conv3_relu concat5 -23330=4,3,14,14,232\nShuffleChannel           shuffle5                 1 1 concat5 shuffle5 -23330=4,3,14,14,232 0=2\nSlice                    slice6                   1 2 shuffle5 branch6_1 branch6_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch6_2_conv1          1 1 branch6_2 branch6_2_conv1_branch6_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch6_2_conv2          1 1 branch6_2_conv1_branch6_2_conv1_relu branch6_2_conv2_branch6_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch6_2_conv3          1 1 branch6_2_conv2_branch6_2_conv2_scale branch6_2_conv3_branch6_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat6                  2 1 branch6_1 branch6_2_conv3_branch6_2_conv3_relu concat6 -23330=4,3,14,14,232\nShuffleChannel           shuffle6                 1 1 concat6 shuffle6 -23330=4,3,14,14,232 0=2\nSlice                    slice7                   1 2 shuffle6 branch7_1 branch7_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch7_2_conv1          1 1 branch7_2 branch7_2_conv1_branch7_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch7_2_conv2          1 1 branch7_2_conv1_branch7_2_conv1_relu branch7_2_conv2_branch7_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch7_2_conv3          1 1 branch7_2_conv2_branch7_2_conv2_scale branch7_2_conv3_branch7_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat7                  2 1 branch7_1 branch7_2_conv3_branch7_2_conv3_relu concat7 -23330=4,3,14,14,232\nShuffleChannel           shuffle7                 1 1 concat7 shuffle7 -23330=4,3,14,14,232 0=2\nSlice                    slice8                   1 2 shuffle7 branch8_1 branch8_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch8_2_conv1          1 1 branch8_2 branch8_2_conv1_branch8_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch8_2_conv2          1 1 branch8_2_conv1_branch8_2_conv1_relu branch8_2_conv2_branch8_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch8_2_conv3          1 1 branch8_2_conv2_branch8_2_conv2_scale branch8_2_conv3_branch8_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat8                  2 1 branch8_1 branch8_2_conv3_branch8_2_conv3_relu concat8 -23330=4,3,14,14,232\nShuffleChannel           shuffle8                 1 1 concat8 shuffle8 -23330=4,3,14,14,232 0=2\nSlice                    slice9                   1 2 shuffle8 branch9_1 branch9_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch9_2_conv1          1 1 branch9_2 branch9_2_conv1_branch9_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch9_2_conv2          1 1 branch9_2_conv1_branch9_2_conv1_relu branch9_2_conv2_branch9_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch9_2_conv3          1 1 branch9_2_conv2_branch9_2_conv2_scale branch9_2_conv3_branch9_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat9                  2 1 branch9_1 branch9_2_conv3_branch9_2_conv3_relu concat9 -23330=4,3,14,14,232\nShuffleChannel           shuffle9                 1 1 concat9 shuffle9 -23330=4,3,14,14,232 0=2\nSlice                    slice10                  1 2 shuffle9 branch10_1 branch10_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch10_2_conv1         1 1 branch10_2 branch10_2_conv1_branch10_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch10_2_conv2         1 1 branch10_2_conv1_branch10_2_conv1_relu branch10_2_conv2_branch10_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch10_2_conv3         1 1 branch10_2_conv2_branch10_2_conv2_scale branch10_2_conv3_branch10_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat10                 2 1 branch10_1 branch10_2_conv3_branch10_2_conv3_relu concat10 -23330=4,3,14,14,232\nShuffleChannel           shuffle10                1 1 concat10 shuffle10 -23330=4,3,14,14,232 0=2\nSlice                    slice11                  1 2 shuffle10 branch11_1 branch11_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch11_2_conv1         1 1 branch11_2 branch11_2_conv1_branch11_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch11_2_conv2         1 1 branch11_2_conv1_branch11_2_conv1_relu branch11_2_conv2_branch11_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch11_2_conv3         1 1 branch11_2_conv2_branch11_2_conv2_scale branch11_2_conv3_branch11_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat11                 2 1 branch11_1 branch11_2_conv3_branch11_2_conv3_relu concat11 -23330=4,3,14,14,232\nShuffleChannel           shuffle11                1 1 concat11 shuffle11 -23330=4,3,14,14,232 0=2\nSlice                    slice12                  1 2 shuffle11 branch12_1 branch12_2 -23330=8,3,14,14,116,3,14,14,116 -23300=2,116,-233\nConvolution              branch12_2_conv1         1 1 branch12_2 branch12_2_conv1_branch12_2_conv1_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConvolutionDepthWise     branch12_2_conv2         1 1 branch12_2_conv1_branch12_2_conv1_relu branch12_2_conv2_branch12_2_conv2_scale -23330=4,3,14,14,116 0=116 1=3 4=1 5=1 6=1044 7=116\nConvolution              branch12_2_conv3         1 1 branch12_2_conv2_branch12_2_conv2_scale branch12_2_conv3_branch12_2_conv3_relu -23330=4,3,14,14,116 0=116 1=1 5=1 6=13456 9=1\nConcat                   concat12                 2 1 branch12_1 branch12_2_conv3_branch12_2_conv3_relu concat12 -23330=4,3,14,14,232\nShuffleChannel           shuffle12                1 1 concat12 shuffle12 -23330=4,3,14,14,232 0=2\nSplit                    splitncnn_2              1 2 shuffle12 shuffle12_splitncnn_0 shuffle12_splitncnn_1 -23330=8,3,14,14,232,3,14,14,232\nConvolutionDepthWise     branch13_1_conv1         1 1 shuffle12_splitncnn_1 branch13_1_conv1_branch13_1_conv1_scale -23330=4,3,7,7,232 0=232 1=3 3=2 4=1 5=1 6=2088 7=232\nConvolution              branch13_1_conv2         1 1 branch13_1_conv1_branch13_1_conv1_scale branch13_1_conv2_branch13_1_conv2_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConvolution              branch13_2_conv1         1 1 shuffle12_splitncnn_0 branch13_2_conv1_branch13_2_conv1_relu -23330=4,3,14,14,232 0=232 1=1 5=1 6=53824 9=1\nConvolutionDepthWise     branch13_2_conv2         1 1 branch13_2_conv1_branch13_2_conv1_relu branch13_2_conv2_branch13_2_conv2_scale -23330=4,3,7,7,232 0=232 1=3 3=2 4=1 5=1 6=2088 7=232\nConvolution              branch13_2_conv3         1 1 branch13_2_conv2_branch13_2_conv2_scale branch13_2_conv3_branch13_2_conv3_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConcat                   concat13                 2 1 branch13_1_conv2_branch13_1_conv2_relu branch13_2_conv3_branch13_2_conv3_relu concat13 -23330=4,3,7,7,464\nShuffleChannel           shuffle13                1 1 concat13 shuffle13 -23330=4,3,7,7,464 0=2\nSlice                    slice14                  1 2 shuffle13 branch14_1 branch14_2 -23330=8,3,7,7,232,3,7,7,232 -23300=2,232,-233\nConvolution              branch14_2_conv1         1 1 branch14_2 branch14_2_conv1_branch14_2_conv1_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConvolutionDepthWise     branch14_2_conv2         1 1 branch14_2_conv1_branch14_2_conv1_relu branch14_2_conv2_branch14_2_conv2_scale -23330=4,3,7,7,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              branch14_2_conv3         1 1 branch14_2_conv2_branch14_2_conv2_scale branch14_2_conv3_branch14_2_conv3_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConcat                   concat14                 2 1 branch14_1 branch14_2_conv3_branch14_2_conv3_relu concat14 -23330=4,3,7,7,464\nShuffleChannel           shuffle14                1 1 concat14 shuffle14 -23330=4,3,7,7,464 0=2\nSlice                    slice15                  1 2 shuffle14 branch15_1 branch15_2 -23330=8,3,7,7,232,3,7,7,232 -23300=2,232,-233\nConvolution              branch15_2_conv1         1 1 branch15_2 branch15_2_conv1_branch15_2_conv1_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConvolutionDepthWise     branch15_2_conv2         1 1 branch15_2_conv1_branch15_2_conv1_relu branch15_2_conv2_branch15_2_conv2_scale -23330=4,3,7,7,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              branch15_2_conv3         1 1 branch15_2_conv2_branch15_2_conv2_scale branch15_2_conv3_branch15_2_conv3_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConcat                   concat15                 2 1 branch15_1 branch15_2_conv3_branch15_2_conv3_relu concat15 -23330=4,3,7,7,464\nShuffleChannel           shuffle15                1 1 concat15 shuffle15 -23330=4,3,7,7,464 0=2\nSlice                    slice16                  1 2 shuffle15 branch16_1 branch16_2 -23330=8,3,7,7,232,3,7,7,232 -23300=2,232,-233\nConvolution              branch16_2_conv1         1 1 branch16_2 branch16_2_conv1_branch16_2_conv1_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConvolutionDepthWise     branch16_2_conv2         1 1 branch16_2_conv1_branch16_2_conv1_relu branch16_2_conv2_branch16_2_conv2_scale -23330=4,3,7,7,232 0=232 1=3 4=1 5=1 6=2088 7=232\nConvolution              branch16_2_conv3         1 1 branch16_2_conv2_branch16_2_conv2_scale branch16_2_conv3_branch16_2_conv3_relu -23330=4,3,7,7,232 0=232 1=1 5=1 6=53824 9=1\nConcat                   concat16                 2 1 branch16_1 branch16_2_conv3_branch16_2_conv3_relu concat16 -23330=4,3,7,7,464\nShuffleChannel           shuffle16                1 1 concat16 shuffle16 -23330=4,3,7,7,464 0=2\nConvolution              conv5                    1 1 shuffle16 conv5_conv5_relu -23330=4,3,7,7,1024 0=1024 1=1 5=1 6=475136 9=1\nPooling                  pool_ave                 1 1 conv5_conv5_relu pool_ave -23330=4,1,1024,1,1 0=1 4=1\nInnerProduct             fc1000                   1 1 pool_ave fc1000 -23330=4,1,1000,1,1 0=1000 1=1 2=1024000\nSoftmax                  prob                     1 1 fc1000 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/squeezenet.param",
    "content": "7767517\n48 56\nInput                    data                     0 1 data -23330=4,3,227,227,3 0=227 1=227 2=3\nConvolution              conv1                    1 1 data conv1_relu_conv1 -23330=4,3,113,113,64 0=64 1=3 3=2 5=1 6=1728 9=1\nPooling                  pool1                    1 1 conv1_relu_conv1 pool1 -23330=4,3,56,56,64 1=3 2=2\nConvolution              fire2/squeeze1x1         1 1 pool1 fire2/squeeze1x1_fire2/relu_squeeze1x1 -23330=4,3,56,56,16 0=16 1=1 5=1 6=1024 9=1\nSplit                    splitncnn_0              1 2 fire2/squeeze1x1_fire2/relu_squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 -23330=8,3,56,56,16,3,56,56,16\nConvolution              fire2/expand1x1          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 fire2/expand1x1_fire2/relu_expand1x1 -23330=4,3,56,56,64 0=64 1=1 5=1 6=1024 9=1\nConvolution              fire2/expand3x3          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/expand3x3_fire2/relu_expand3x3 -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=9216 9=1\nConcat                   fire2/concat             2 1 fire2/expand1x1_fire2/relu_expand1x1 fire2/expand3x3_fire2/relu_expand3x3 fire2/concat -23330=4,3,56,56,128\nConvolution              fire3/squeeze1x1         1 1 fire2/concat fire3/squeeze1x1_fire3/relu_squeeze1x1 -23330=4,3,56,56,16 0=16 1=1 5=1 6=2048 9=1\nSplit                    splitncnn_1              1 2 fire3/squeeze1x1_fire3/relu_squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 -23330=8,3,56,56,16,3,56,56,16\nConvolution              fire3/expand1x1          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 fire3/expand1x1_fire3/relu_expand1x1 -23330=4,3,56,56,64 0=64 1=1 5=1 6=1024 9=1\nConvolution              fire3/expand3x3          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/expand3x3_fire3/relu_expand3x3 -23330=4,3,56,56,64 0=64 1=3 4=1 5=1 6=9216 9=1\nConcat                   fire3/concat             2 1 fire3/expand1x1_fire3/relu_expand1x1 fire3/expand3x3_fire3/relu_expand3x3 fire3/concat -23330=4,3,56,56,128\nPooling                  pool3                    1 1 fire3/concat pool3 -23330=4,3,28,28,128 1=3 2=2\nConvolution              fire4/squeeze1x1         1 1 pool3 fire4/squeeze1x1_fire4/relu_squeeze1x1 -23330=4,3,28,28,32 0=32 1=1 5=1 6=4096 9=1\nSplit                    splitncnn_2              1 2 fire4/squeeze1x1_fire4/relu_squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 -23330=8,3,28,28,32,3,28,28,32\nConvolution              fire4/expand1x1          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 fire4/expand1x1_fire4/relu_expand1x1 -23330=4,3,28,28,128 0=128 1=1 5=1 6=4096 9=1\nConvolution              fire4/expand3x3          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/expand3x3_fire4/relu_expand3x3 -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=36864 9=1\nConcat                   fire4/concat             2 1 fire4/expand1x1_fire4/relu_expand1x1 fire4/expand3x3_fire4/relu_expand3x3 fire4/concat -23330=4,3,28,28,256\nConvolution              fire5/squeeze1x1         1 1 fire4/concat fire5/squeeze1x1_fire5/relu_squeeze1x1 -23330=4,3,28,28,32 0=32 1=1 5=1 6=8192 9=1\nSplit                    splitncnn_3              1 2 fire5/squeeze1x1_fire5/relu_squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 -23330=8,3,28,28,32,3,28,28,32\nConvolution              fire5/expand1x1          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 fire5/expand1x1_fire5/relu_expand1x1 -23330=4,3,28,28,128 0=128 1=1 5=1 6=4096 9=1\nConvolution              fire5/expand3x3          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/expand3x3_fire5/relu_expand3x3 -23330=4,3,28,28,128 0=128 1=3 4=1 5=1 6=36864 9=1\nConcat                   fire5/concat             2 1 fire5/expand1x1_fire5/relu_expand1x1 fire5/expand3x3_fire5/relu_expand3x3 fire5/concat -23330=4,3,28,28,256\nPooling                  pool5                    1 1 fire5/concat pool5 -23330=4,3,14,14,256 1=3 2=2\nConvolution              fire6/squeeze1x1         1 1 pool5 fire6/squeeze1x1_fire6/relu_squeeze1x1 -23330=4,3,14,14,48 0=48 1=1 5=1 6=12288 9=1\nSplit                    splitncnn_4              1 2 fire6/squeeze1x1_fire6/relu_squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 -23330=8,3,14,14,48,3,14,14,48\nConvolution              fire6/expand1x1          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 fire6/expand1x1_fire6/relu_expand1x1 -23330=4,3,14,14,192 0=192 1=1 5=1 6=9216 9=1\nConvolution              fire6/expand3x3          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/expand3x3_fire6/relu_expand3x3 -23330=4,3,14,14,192 0=192 1=3 4=1 5=1 6=82944 9=1\nConcat                   fire6/concat             2 1 fire6/expand1x1_fire6/relu_expand1x1 fire6/expand3x3_fire6/relu_expand3x3 fire6/concat -23330=4,3,14,14,384\nConvolution              fire7/squeeze1x1         1 1 fire6/concat fire7/squeeze1x1_fire7/relu_squeeze1x1 -23330=4,3,14,14,48 0=48 1=1 5=1 6=18432 9=1\nSplit                    splitncnn_5              1 2 fire7/squeeze1x1_fire7/relu_squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 -23330=8,3,14,14,48,3,14,14,48\nConvolution              fire7/expand1x1          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 fire7/expand1x1_fire7/relu_expand1x1 -23330=4,3,14,14,192 0=192 1=1 5=1 6=9216 9=1\nConvolution              fire7/expand3x3          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/expand3x3_fire7/relu_expand3x3 -23330=4,3,14,14,192 0=192 1=3 4=1 5=1 6=82944 9=1\nConcat                   fire7/concat             2 1 fire7/expand1x1_fire7/relu_expand1x1 fire7/expand3x3_fire7/relu_expand3x3 fire7/concat -23330=4,3,14,14,384\nConvolution              fire8/squeeze1x1         1 1 fire7/concat fire8/squeeze1x1_fire8/relu_squeeze1x1 -23330=4,3,14,14,64 0=64 1=1 5=1 6=24576 9=1\nSplit                    splitncnn_6              1 2 fire8/squeeze1x1_fire8/relu_squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 -23330=8,3,14,14,64,3,14,14,64\nConvolution              fire8/expand1x1          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 fire8/expand1x1_fire8/relu_expand1x1 -23330=4,3,14,14,256 0=256 1=1 5=1 6=16384 9=1\nConvolution              fire8/expand3x3          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/expand3x3_fire8/relu_expand3x3 -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=147456 9=1\nConcat                   fire8/concat             2 1 fire8/expand1x1_fire8/relu_expand1x1 fire8/expand3x3_fire8/relu_expand3x3 fire8/concat -23330=4,3,14,14,512\nConvolution              fire9/squeeze1x1         1 1 fire8/concat fire9/squeeze1x1_fire9/relu_squeeze1x1 -23330=4,3,14,14,64 0=64 1=1 5=1 6=32768 9=1\nSplit                    splitncnn_7              1 2 fire9/squeeze1x1_fire9/relu_squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 -23330=8,3,14,14,64,3,14,14,64\nConvolution              fire9/expand1x1          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 fire9/expand1x1_fire9/relu_expand1x1 -23330=4,3,14,14,256 0=256 1=1 5=1 6=16384 9=1\nConvolution              fire9/expand3x3          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/expand3x3_fire9/relu_expand3x3 -23330=4,3,14,14,256 0=256 1=3 4=1 5=1 6=147456 9=1\nConcat                   fire9/concat             2 1 fire9/expand1x1_fire9/relu_expand1x1 fire9/expand3x3_fire9/relu_expand3x3 fire9/concat_drop9 -23330=4,3,14,14,512\nConvolution              conv10                   1 1 fire9/concat_drop9 conv10_relu_conv10 -23330=4,3,16,16,1000 0=1000 1=1 4=1 5=1 6=512000 9=1\nPooling                  pool10                   1 1 conv10_relu_conv10 pool10 -23330=4,1,1000,1,1 0=1 4=1\nSoftmax                  prob                     1 1 pool10 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/squeezenet_int8.param",
    "content": "7767517\n48 56\nInput                    data                     0 1 data 0=227 1=227 2=3\nConvolution              conv1                    1 1 data conv1_relu_conv1 0=64 1=3 3=2 5=1 6=1728 8=2 9=1\nPooling                  pool1                    1 1 conv1_relu_conv1 pool1 1=3 2=2\nConvolution              fire2/squeeze1x1         1 1 pool1 fire2/squeeze1x1_fire2/relu_squeeze1x1 0=16 1=1 5=1 6=1024 8=102 9=1\nSplit                    splitncnn_0              1 2 fire2/squeeze1x1_fire2/relu_squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1\nConvolution              fire2/expand1x1          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 fire2/expand1x1_fire2/relu_expand1x1 0=64 1=1 5=1 6=1024 8=2 9=1\nConvolution              fire2/expand3x3          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/expand3x3_fire2/relu_expand3x3 0=64 1=3 4=1 5=1 6=9216 8=2 9=1\nConcat                   fire2/concat             2 1 fire2/expand1x1_fire2/relu_expand1x1 fire2/expand3x3_fire2/relu_expand3x3 fire2/concat\nConvolution              fire3/squeeze1x1         1 1 fire2/concat fire3/squeeze1x1_fire3/relu_squeeze1x1 0=16 1=1 5=1 6=2048 8=102 9=1\nSplit                    splitncnn_1              1 2 fire3/squeeze1x1_fire3/relu_squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1\nConvolution              fire3/expand1x1          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 fire3/expand1x1_fire3/relu_expand1x1 0=64 1=1 5=1 6=1024 8=2 9=1\nConvolution              fire3/expand3x3          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/expand3x3_fire3/relu_expand3x3 0=64 1=3 4=1 5=1 6=9216 8=2 9=1\nConcat                   fire3/concat             2 1 fire3/expand1x1_fire3/relu_expand1x1 fire3/expand3x3_fire3/relu_expand3x3 fire3/concat\nPooling                  pool3                    1 1 fire3/concat pool3 1=3 2=2\nConvolution              fire4/squeeze1x1         1 1 pool3 fire4/squeeze1x1_fire4/relu_squeeze1x1 0=32 1=1 5=1 6=4096 8=102 9=1\nSplit                    splitncnn_2              1 2 fire4/squeeze1x1_fire4/relu_squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1\nConvolution              fire4/expand1x1          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 fire4/expand1x1_fire4/relu_expand1x1 0=128 1=1 5=1 6=4096 8=2 9=1\nConvolution              fire4/expand3x3          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/expand3x3_fire4/relu_expand3x3 0=128 1=3 4=1 5=1 6=36864 8=2 9=1\nConcat                   fire4/concat             2 1 fire4/expand1x1_fire4/relu_expand1x1 fire4/expand3x3_fire4/relu_expand3x3 fire4/concat\nConvolution              fire5/squeeze1x1         1 1 fire4/concat fire5/squeeze1x1_fire5/relu_squeeze1x1 0=32 1=1 5=1 6=8192 8=102 9=1\nSplit                    splitncnn_3              1 2 fire5/squeeze1x1_fire5/relu_squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1\nConvolution              fire5/expand1x1          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 fire5/expand1x1_fire5/relu_expand1x1 0=128 1=1 5=1 6=4096 8=2 9=1\nConvolution              fire5/expand3x3          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/expand3x3_fire5/relu_expand3x3 0=128 1=3 4=1 5=1 6=36864 8=2 9=1\nConcat                   fire5/concat             2 1 fire5/expand1x1_fire5/relu_expand1x1 fire5/expand3x3_fire5/relu_expand3x3 fire5/concat\nPooling                  pool5                    1 1 fire5/concat pool5 1=3 2=2\nConvolution              fire6/squeeze1x1         1 1 pool5 fire6/squeeze1x1_fire6/relu_squeeze1x1 0=48 1=1 5=1 6=12288 8=102 9=1\nSplit                    splitncnn_4              1 2 fire6/squeeze1x1_fire6/relu_squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1\nConvolution              fire6/expand1x1          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 fire6/expand1x1_fire6/relu_expand1x1 0=192 1=1 5=1 6=9216 8=2 9=1\nConvolution              fire6/expand3x3          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/expand3x3_fire6/relu_expand3x3 0=192 1=3 4=1 5=1 6=82944 8=2 9=1\nConcat                   fire6/concat             2 1 fire6/expand1x1_fire6/relu_expand1x1 fire6/expand3x3_fire6/relu_expand3x3 fire6/concat\nConvolution              fire7/squeeze1x1         1 1 fire6/concat fire7/squeeze1x1_fire7/relu_squeeze1x1 0=48 1=1 5=1 6=18432 8=102 9=1\nSplit                    splitncnn_5              1 2 fire7/squeeze1x1_fire7/relu_squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1\nConvolution              fire7/expand1x1          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 fire7/expand1x1_fire7/relu_expand1x1 0=192 1=1 5=1 6=9216 8=2 9=1\nConvolution              fire7/expand3x3          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/expand3x3_fire7/relu_expand3x3 0=192 1=3 4=1 5=1 6=82944 8=2 9=1\nConcat                   fire7/concat             2 1 fire7/expand1x1_fire7/relu_expand1x1 fire7/expand3x3_fire7/relu_expand3x3 fire7/concat\nConvolution              fire8/squeeze1x1         1 1 fire7/concat fire8/squeeze1x1_fire8/relu_squeeze1x1 0=64 1=1 5=1 6=24576 8=102 9=1\nSplit                    splitncnn_6              1 2 fire8/squeeze1x1_fire8/relu_squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1\nConvolution              fire8/expand1x1          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 fire8/expand1x1_fire8/relu_expand1x1 0=256 1=1 5=1 6=16384 8=2 9=1\nConvolution              fire8/expand3x3          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/expand3x3_fire8/relu_expand3x3 0=256 1=3 4=1 5=1 6=147456 8=2 9=1\nConcat                   fire8/concat             2 1 fire8/expand1x1_fire8/relu_expand1x1 fire8/expand3x3_fire8/relu_expand3x3 fire8/concat\nConvolution              fire9/squeeze1x1         1 1 fire8/concat fire9/squeeze1x1_fire9/relu_squeeze1x1 0=64 1=1 5=1 6=32768 8=102 9=1\nSplit                    splitncnn_7              1 2 fire9/squeeze1x1_fire9/relu_squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1\nConvolution              fire9/expand1x1          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 fire9/expand1x1_fire9/relu_expand1x1 0=256 1=1 5=1 6=16384 8=2 9=1\nConvolution              fire9/expand3x3          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/expand3x3_fire9/relu_expand3x3 0=256 1=3 4=1 5=1 6=147456 8=2 9=1\nConcat                   fire9/concat             2 1 fire9/expand1x1_fire9/relu_expand1x1 fire9/expand3x3_fire9/relu_expand3x3 fire9/concat_drop9\nConvolution              conv10                   1 1 fire9/concat_drop9 conv10_relu_conv10 0=1000 1=1 4=1 5=1 6=512000 8=2 9=1\nPooling                  pool10                   1 1 conv10_relu_conv10 pool10 0=1 4=1\nSoftmax                  prob                     1 1 pool10 output\n"
  },
  {
    "path": "benchmark/squeezenet_ssd.param",
    "content": "7767517\n119 152\nInput                    data                     0 1 data -23330=4,3,300,300,3 0=300 1=300 2=3\nSplit                    splitncnn_0              1 7 data data_splitncnn_0 data_splitncnn_1 data_splitncnn_2 data_splitncnn_3 data_splitncnn_4 data_splitncnn_5 data_splitncnn_6 -23330=28,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3,3,300,300,3\nConvolution              conv1                    1 1 data_splitncnn_6 conv1_relu_conv1 -23330=4,3,149,149,64 0=64 1=3 3=2 5=1 6=1728 9=1\nPooling                  pool1                    1 1 conv1_relu_conv1 pool1 -23330=4,3,74,74,64 1=3 2=2\nConvolution              fire2/squeeze1x1         1 1 pool1 fire2/squeeze1x1_fire2/relu_squeeze1x1 -23330=4,3,74,74,16 0=16 1=1 5=1 6=1024 9=1\nSplit                    splitncnn_1              1 2 fire2/squeeze1x1_fire2/relu_squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 -23330=8,3,74,74,16,3,74,74,16\nConvolution              fire2/expand1x1          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 fire2/expand1x1_fire2/relu_expand1x1 -23330=4,3,74,74,64 0=64 1=1 5=1 6=1024 9=1\nConvolution              fire2/expand3x3          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/expand3x3_fire2/relu_expand3x3 -23330=4,3,74,74,64 0=64 1=3 4=1 5=1 6=9216 9=1\nConcat                   fire2/concat             2 1 fire2/expand1x1_fire2/relu_expand1x1 fire2/expand3x3_fire2/relu_expand3x3 fire2/concat -23330=4,3,74,74,128\nConvolution              fire3/squeeze1x1         1 1 fire2/concat fire3/squeeze1x1_fire3/relu_squeeze1x1 -23330=4,3,74,74,16 0=16 1=1 5=1 6=2048 9=1\nSplit                    splitncnn_2              1 2 fire3/squeeze1x1_fire3/relu_squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 -23330=8,3,74,74,16,3,74,74,16\nConvolution              fire3/expand1x1          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 fire3/expand1x1_fire3/relu_expand1x1 -23330=4,3,74,74,64 0=64 1=1 5=1 6=1024 9=1\nConvolution              fire3/expand3x3          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/expand3x3_fire3/relu_expand3x3 -23330=4,3,74,74,64 0=64 1=3 4=1 5=1 6=9216 9=1\nConcat                   fire3/concat             2 1 fire3/expand1x1_fire3/relu_expand1x1 fire3/expand3x3_fire3/relu_expand3x3 fire3/concat -23330=4,3,74,74,128\nPooling                  pool3                    1 1 fire3/concat pool3 -23330=4,3,37,37,128 1=3 2=2\nConvolution              fire4/squeeze1x1         1 1 pool3 fire4/squeeze1x1_fire4/relu_squeeze1x1 -23330=4,3,37,37,32 0=32 1=1 5=1 6=4096 9=1\nSplit                    splitncnn_3              1 2 fire4/squeeze1x1_fire4/relu_squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 -23330=8,3,37,37,32,3,37,37,32\nConvolution              fire4/expand1x1          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 fire4/expand1x1_fire4/relu_expand1x1 -23330=4,3,37,37,128 0=128 1=1 5=1 6=4096 9=1\nConvolution              fire4/expand3x3          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/expand3x3_fire4/relu_expand3x3 -23330=4,3,37,37,128 0=128 1=3 4=1 5=1 6=36864 9=1\nConcat                   fire4/concat             2 1 fire4/expand1x1_fire4/relu_expand1x1 fire4/expand3x3_fire4/relu_expand3x3 fire4/concat -23330=4,3,37,37,256\nConvolution              fire5/squeeze1x1         1 1 fire4/concat fire5/squeeze1x1_fire5/relu_squeeze1x1 -23330=4,3,37,37,32 0=32 1=1 5=1 6=8192 9=1\nSplit                    splitncnn_4              1 2 fire5/squeeze1x1_fire5/relu_squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 -23330=8,3,37,37,32,3,37,37,32\nConvolution              fire5/expand1x1          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 fire5/expand1x1_fire5/relu_expand1x1 -23330=4,3,37,37,128 0=128 1=1 5=1 6=4096 9=1\nConvolution              fire5/expand3x3          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/expand3x3_fire5/relu_expand3x3 -23330=4,3,37,37,128 0=128 1=3 4=1 5=1 6=36864 9=1\nConcat                   fire5/concat             2 1 fire5/expand1x1_fire5/relu_expand1x1 fire5/expand3x3_fire5/relu_expand3x3 fire5/concat -23330=4,3,37,37,256\nSplit                    splitncnn_5              1 2 fire5/concat fire5/concat_splitncnn_0 fire5/concat_splitncnn_1 -23330=8,3,37,37,256,3,37,37,256\nPooling                  pool5                    1 1 fire5/concat_splitncnn_1 pool5 -23330=4,3,18,18,256 1=3 2=2\nConvolution              fire6/squeeze1x1         1 1 pool5 fire6/squeeze1x1_fire6/relu_squeeze1x1 -23330=4,3,18,18,48 0=48 1=1 5=1 6=12288 9=1\nSplit                    splitncnn_6              1 2 fire6/squeeze1x1_fire6/relu_squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 -23330=8,3,18,18,48,3,18,18,48\nConvolution              fire6/expand1x1          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 fire6/expand1x1_fire6/relu_expand1x1 -23330=4,3,18,18,192 0=192 1=1 5=1 6=9216 9=1\nConvolution              fire6/expand3x3          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/expand3x3_fire6/relu_expand3x3 -23330=4,3,18,18,192 0=192 1=3 4=1 5=1 6=82944 9=1\nConcat                   fire6/concat             2 1 fire6/expand1x1_fire6/relu_expand1x1 fire6/expand3x3_fire6/relu_expand3x3 fire6/concat -23330=4,3,18,18,384\nConvolution              fire7/squeeze1x1         1 1 fire6/concat fire7/squeeze1x1_fire7/relu_squeeze1x1 -23330=4,3,18,18,48 0=48 1=1 5=1 6=18432 9=1\nSplit                    splitncnn_7              1 2 fire7/squeeze1x1_fire7/relu_squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 -23330=8,3,18,18,48,3,18,18,48\nConvolution              fire7/expand1x1          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 fire7/expand1x1_fire7/relu_expand1x1 -23330=4,3,18,18,192 0=192 1=1 5=1 6=9216 9=1\nConvolution              fire7/expand3x3          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/expand3x3_fire7/relu_expand3x3 -23330=4,3,18,18,192 0=192 1=3 4=1 5=1 6=82944 9=1\nConcat                   fire7/concat             2 1 fire7/expand1x1_fire7/relu_expand1x1 fire7/expand3x3_fire7/relu_expand3x3 fire7/concat -23330=4,3,18,18,384\nConvolution              fire8/squeeze1x1         1 1 fire7/concat fire8/squeeze1x1_fire8/relu_squeeze1x1 -23330=4,3,18,18,64 0=64 1=1 5=1 6=24576 9=1\nSplit                    splitncnn_8              1 2 fire8/squeeze1x1_fire8/relu_squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 -23330=8,3,18,18,64,3,18,18,64\nConvolution              fire8/expand1x1          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 fire8/expand1x1_fire8/relu_expand1x1 -23330=4,3,18,18,256 0=256 1=1 5=1 6=16384 9=1\nConvolution              fire8/expand3x3          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/expand3x3_fire8/relu_expand3x3 -23330=4,3,18,18,256 0=256 1=3 4=1 5=1 6=147456 9=1\nConcat                   fire8/concat             2 1 fire8/expand1x1_fire8/relu_expand1x1 fire8/expand3x3_fire8/relu_expand3x3 fire8/concat -23330=4,3,18,18,512\nConvolution              fire9/squeeze1x1         1 1 fire8/concat fire9/squeeze1x1_fire9/relu_squeeze1x1 -23330=4,3,18,18,64 0=64 1=1 5=1 6=32768 9=1\nSplit                    splitncnn_9              1 2 fire9/squeeze1x1_fire9/relu_squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 -23330=8,3,18,18,64,3,18,18,64\nConvolution              fire9/expand1x1          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 fire9/expand1x1_fire9/relu_expand1x1 -23330=4,3,18,18,256 0=256 1=1 5=1 6=16384 9=1\nConvolution              fire9/expand3x3          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/expand3x3_fire9/relu_expand3x3 -23330=4,3,18,18,256 0=256 1=3 4=1 5=1 6=147456 9=1\nConcat                   fire9/concat             2 1 fire9/expand1x1_fire9/relu_expand1x1 fire9/expand3x3_fire9/relu_expand3x3 fire9/concat -23330=4,3,18,18,512\nSplit                    splitncnn_10             1 4 fire9/concat fire9/concat_splitncnn_0 fire9/concat_splitncnn_1 fire9/concat_splitncnn_2 fire9/concat_splitncnn_3 -23330=16,3,18,18,512,3,18,18,512,3,18,18,512,3,18,18,512\nPooling                  pool9                    1 1 fire9/concat_splitncnn_3 pool9 -23330=4,3,9,9,512 1=3 2=2\nConvolution              fire10/squeeze1x1        1 1 pool9 fire10/squeeze1x1_fire10/relu_squeeze1x1 -23330=4,3,9,9,96 0=96 1=1 5=1 6=49152 9=1\nSplit                    splitncnn_11             1 2 fire10/squeeze1x1_fire10/relu_squeeze1x1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_0 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_1 -23330=8,3,9,9,96,3,9,9,96\nConvolution              fire10/expand1x1         1 1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_1 fire10/expand1x1_fire10/relu_expand1x1 -23330=4,3,9,9,384 0=384 1=1 5=1 6=36864 9=1\nConvolution              fire10/expand3x3         1 1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_0 fire10/expand3x3_fire10/relu_expand3x3 -23330=4,3,9,9,384 0=384 1=3 4=1 5=1 6=331776 9=1\nConcat                   fire10/concat            2 1 fire10/expand1x1_fire10/relu_expand1x1 fire10/expand3x3_fire10/relu_expand3x3 fire10/concat -23330=4,3,9,9,768\nSplit                    splitncnn_12             1 4 fire10/concat fire10/concat_splitncnn_0 fire10/concat_splitncnn_1 fire10/concat_splitncnn_2 fire10/concat_splitncnn_3 -23330=16,3,9,9,768,3,9,9,768,3,9,9,768,3,9,9,768\nPooling                  pool10                   1 1 fire10/concat_splitncnn_3 pool10 -23330=4,3,4,4,768 1=3 2=2\nConvolution              fire11/squeeze1x1        1 1 pool10 fire11/squeeze1x1_fire11/relu_squeeze1x1 -23330=4,3,4,4,96 0=96 1=1 5=1 6=73728 9=1\nSplit                    splitncnn_13             1 2 fire11/squeeze1x1_fire11/relu_squeeze1x1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_0 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_1 -23330=8,3,4,4,96,3,4,4,96\nConvolution              fire11/expand1x1         1 1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_1 fire11/expand1x1_fire11/relu_expand1x1 -23330=4,3,4,4,384 0=384 1=1 5=1 6=36864 9=1\nConvolution              fire11/expand3x3         1 1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_0 fire11/expand3x3_fire11/relu_expand3x3 -23330=4,3,4,4,384 0=384 1=3 4=1 5=1 6=331776 9=1\nConcat                   fire11/concat            2 1 fire11/expand1x1_fire11/relu_expand1x1 fire11/expand3x3_fire11/relu_expand3x3 fire11/concat -23330=4,3,4,4,768\nSplit                    splitncnn_14             1 4 fire11/concat fire11/concat_splitncnn_0 fire11/concat_splitncnn_1 fire11/concat_splitncnn_2 fire11/concat_splitncnn_3 -23330=16,3,4,4,768,3,4,4,768,3,4,4,768,3,4,4,768\nConvolution              conv12_1                 1 1 fire11/concat_splitncnn_3 conv12_1_conv12_1/relu -23330=4,3,4,4,128 0=128 1=1 5=1 6=98304 9=1\nConvolution              conv12_2                 1 1 conv12_1_conv12_1/relu conv12_2_conv12_2/relu -23330=4,3,2,2,256 0=256 1=3 3=2 4=1 5=1 6=294912 9=1\nSplit                    splitncnn_15             1 4 conv12_2_conv12_2/relu conv12_2_conv12_2/relu_splitncnn_0 conv12_2_conv12_2/relu_splitncnn_1 conv12_2_conv12_2/relu_splitncnn_2 conv12_2_conv12_2/relu_splitncnn_3 -23330=16,3,2,2,256,3,2,2,256,3,2,2,256,3,2,2,256\nConvolution              conv13_1                 1 1 conv12_2_conv12_2/relu_splitncnn_3 conv13_1_conv13_1/relu -23330=4,3,2,2,64 0=64 1=1 5=1 6=16384 9=1\nConvolution              conv13_2                 1 1 conv13_1_conv13_1/relu conv13_2_conv13_2/relu -23330=4,3,1,1,128 0=128 1=3 3=2 4=1 5=1 6=73728 9=1\nSplit                    splitncnn_16             1 3 conv13_2_conv13_2/relu conv13_2_conv13_2/relu_splitncnn_0 conv13_2_conv13_2/relu_splitncnn_1 conv13_2_conv13_2/relu_splitncnn_2 -23330=12,3,1,1,128,3,1,1,128,3,1,1,128\nBatchNorm                fire5/bn                 1 1 fire5/concat_splitncnn_0 fire5/normal_fire5/scale -23330=4,3,37,37,256 0=256\nSplit                    splitncnn_17             1 3 fire5/normal_fire5/scale fire5/normal_fire5/scale_splitncnn_0 fire5/normal_fire5/scale_splitncnn_1 fire5/normal_fire5/scale_splitncnn_2 -23330=12,3,37,37,256,3,37,37,256,3,37,37,256\nConvolution              fire5_mbox_loc           1 1 fire5/normal_fire5/scale_splitncnn_2 fire5_mbox_loc -23330=4,3,37,37,16 0=16 1=3 4=1 5=1 6=36864\nPermute                  fire5_mbox_loc_perm      1 1 fire5_mbox_loc fire5_mbox_loc_perm -23330=4,3,16,37,37 0=3\nFlatten                  fire5_mbox_loc_flat      1 1 fire5_mbox_loc_perm fire5_mbox_loc_flat -23330=4,1,21904,1,1\nConvolution              fire5_mbox_conf          1 1 fire5/normal_fire5/scale_splitncnn_1 fire5_mbox_conf -23330=4,3,37,37,84 0=84 1=3 4=1 5=1 6=193536\nPermute                  fire5_mbox_conf_perm     1 1 fire5_mbox_conf fire5_mbox_conf_perm -23330=4,3,84,37,37 0=3\nFlatten                  fire5_mbox_conf_flat     1 1 fire5_mbox_conf_perm fire5_mbox_conf_flat -23330=4,1,114996,1,1\nPriorBox                 fire5_mbox_priorbox      2 1 fire5/normal_fire5/scale_splitncnn_0 data_splitncnn_5 fire5_mbox_priorbox -23330=4,2,21904,2,1 -23300=1,2.100000e+01 -23301=1,4.500000e+01 -23302=1,2.000000e+00 9=-233 10=-233 11=8.000000e+00 12=8.000000e+00 13=5.000000e-01\nConvolution              fire9_mbox_loc           1 1 fire9/concat_splitncnn_2 fire9_mbox_loc -23330=4,3,18,18,24 0=24 1=3 4=1 5=1 6=110592\nPermute                  fire9_mbox_loc_perm      1 1 fire9_mbox_loc fire9_mbox_loc_perm -23330=4,3,24,18,18 0=3\nFlatten                  fire9_mbox_loc_flat      1 1 fire9_mbox_loc_perm fire9_mbox_loc_flat -23330=4,1,7776,1,1\nConvolution              fire9_mbox_conf          1 1 fire9/concat_splitncnn_1 fire9_mbox_conf -23330=4,3,18,18,126 0=126 1=3 4=1 5=1 6=580608\nPermute                  fire9_mbox_conf_perm     1 1 fire9_mbox_conf fire9_mbox_conf_perm -23330=4,3,126,18,18 0=3\nFlatten                  fire9_mbox_conf_flat     1 1 fire9_mbox_conf_perm fire9_mbox_conf_flat -23330=4,1,40824,1,1\nPriorBox                 fire9_mbox_priorbox      2 1 fire9/concat_splitncnn_0 data_splitncnn_4 fire9_mbox_priorbox -23330=4,2,7776,2,1 -23300=1,4.500000e+01 -23301=1,9.900000e+01 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 11=1.600000e+01 12=1.600000e+01 13=5.000000e-01\nConvolution              fire10_mbox_loc          1 1 fire10/concat_splitncnn_2 fire10_mbox_loc -23330=4,3,9,9,24 0=24 1=3 4=1 5=1 6=165888\nPermute                  fire10_mbox_loc_perm     1 1 fire10_mbox_loc fire10_mbox_loc_perm -23330=4,3,24,9,9 0=3\nFlatten                  fire10_mbox_loc_flat     1 1 fire10_mbox_loc_perm fire10_mbox_loc_flat -23330=4,1,1944,1,1\nConvolution              fire10_mbox_conf         1 1 fire10/concat_splitncnn_1 fire10_mbox_conf -23330=4,3,9,9,126 0=126 1=3 4=1 5=1 6=870912\nPermute                  fire10_mbox_conf_perm    1 1 fire10_mbox_conf fire10_mbox_conf_perm -23330=4,3,126,9,9 0=3\nFlatten                  fire10_mbox_conf_flat    1 1 fire10_mbox_conf_perm fire10_mbox_conf_flat -23330=4,1,10206,1,1\nPriorBox                 fire10_mbox_priorbox     2 1 fire10/concat_splitncnn_0 data_splitncnn_3 fire10_mbox_priorbox -23330=4,2,1944,2,1 -23300=1,9.900000e+01 -23301=1,1.530000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 11=3.200000e+01 12=3.200000e+01 13=5.000000e-01\nConvolution              fire11_mbox_loc          1 1 fire11/concat_splitncnn_2 fire11_mbox_loc -23330=4,3,4,4,24 0=24 1=3 4=1 5=1 6=165888\nPermute                  fire11_mbox_loc_perm     1 1 fire11_mbox_loc fire11_mbox_loc_perm -23330=4,3,24,4,4 0=3\nFlatten                  fire11_mbox_loc_flat     1 1 fire11_mbox_loc_perm fire11_mbox_loc_flat -23330=4,1,384,1,1\nConvolution              fire11_mbox_conf         1 1 fire11/concat_splitncnn_1 fire11_mbox_conf -23330=4,3,4,4,126 0=126 1=3 4=1 5=1 6=870912\nPermute                  fire11_mbox_conf_perm    1 1 fire11_mbox_conf fire11_mbox_conf_perm -23330=4,3,126,4,4 0=3\nFlatten                  fire11_mbox_conf_flat    1 1 fire11_mbox_conf_perm fire11_mbox_conf_flat -23330=4,1,2016,1,1\nPriorBox                 fire11_mbox_priorbox     2 1 fire11/concat_splitncnn_0 data_splitncnn_2 fire11_mbox_priorbox -23330=4,2,384,2,1 -23300=1,1.530000e+02 -23301=1,2.070000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 11=6.400000e+01 12=6.400000e+01 13=5.000000e-01\nConvolution              conv12_2_mbox_loc        1 1 conv12_2_conv12_2/relu_splitncnn_2 conv12_2_mbox_loc -23330=4,3,2,2,24 0=24 1=3 4=1 5=1 6=55296\nPermute                  conv12_2_mbox_loc_perm   1 1 conv12_2_mbox_loc conv12_2_mbox_loc_perm -23330=4,3,24,2,2 0=3\nFlatten                  conv12_2_mbox_loc_flat   1 1 conv12_2_mbox_loc_perm conv12_2_mbox_loc_flat -23330=4,1,96,1,1\nConvolution              conv12_2_mbox_conf       1 1 conv12_2_conv12_2/relu_splitncnn_1 conv12_2_mbox_conf -23330=4,3,2,2,126 0=126 1=3 4=1 5=1 6=290304\nPermute                  conv12_2_mbox_conf_perm  1 1 conv12_2_mbox_conf conv12_2_mbox_conf_perm -23330=4,3,126,2,2 0=3\nFlatten                  conv12_2_mbox_conf_flat  1 1 conv12_2_mbox_conf_perm conv12_2_mbox_conf_flat -23330=4,1,504,1,1\nPriorBox                 conv12_2_mbox_priorbox   2 1 conv12_2_conv12_2/relu_splitncnn_0 data_splitncnn_1 conv12_2_mbox_priorbox -23330=4,2,96,2,1 -23300=1,2.070000e+02 -23301=1,2.610000e+02 -23302=2,2.000000e+00,3.000000e+00 9=-233 10=-233 11=1.000000e+02 12=1.000000e+02 13=5.000000e-01\nConvolution              conv13_2_mbox_loc        1 1 conv13_2_conv13_2/relu_splitncnn_2 conv13_2_mbox_loc -23330=4,3,1,1,16 0=16 1=3 4=1 5=1 6=18432\nPermute                  conv13_2_mbox_loc_perm   1 1 conv13_2_mbox_loc conv13_2_mbox_loc_perm -23330=4,3,16,1,1 0=3\nFlatten                  conv13_2_mbox_loc_flat   1 1 conv13_2_mbox_loc_perm conv13_2_mbox_loc_flat -23330=4,1,16,1,1\nConvolution              conv13_2_mbox_conf       1 1 conv13_2_conv13_2/relu_splitncnn_1 conv13_2_mbox_conf -23330=4,3,1,1,84 0=84 1=3 4=1 5=1 6=96768\nPermute                  conv13_2_mbox_conf_perm  1 1 conv13_2_mbox_conf conv13_2_mbox_conf_perm -23330=4,3,84,1,1 0=3\nFlatten                  conv13_2_mbox_conf_flat  1 1 conv13_2_mbox_conf_perm conv13_2_mbox_conf_flat -23330=4,1,84,1,1\nPriorBox                 conv13_2_mbox_priorbox   2 1 conv13_2_conv13_2/relu_splitncnn_0 data_splitncnn_0 conv13_2_mbox_priorbox -23330=4,2,16,2,1 -23300=1,2.610000e+02 -23301=1,3.150000e+02 -23302=1,2.000000e+00 9=-233 10=-233 11=3.000000e+02 12=3.000000e+02 13=5.000000e-01\nConcat                   mbox_loc                 6 1 fire5_mbox_loc_flat fire9_mbox_loc_flat fire10_mbox_loc_flat fire11_mbox_loc_flat conv12_2_mbox_loc_flat conv13_2_mbox_loc_flat mbox_loc -23330=4,1,32120,1,1\nConcat                   mbox_conf                6 1 fire5_mbox_conf_flat fire9_mbox_conf_flat fire10_mbox_conf_flat fire11_mbox_conf_flat conv12_2_mbox_conf_flat conv13_2_mbox_conf_flat mbox_conf -23330=4,1,168630,1,1\nConcat                   mbox_priorbox            6 1 fire5_mbox_priorbox fire9_mbox_priorbox fire10_mbox_priorbox fire11_mbox_priorbox conv12_2_mbox_priorbox conv13_2_mbox_priorbox mbox_priorbox -23330=4,2,32120,2,1 0=1\nReshape                  mbox_conf_reshape        1 1 mbox_conf mbox_conf_reshape -23330=4,2,21,8030,1 0=21 1=-1\nSoftmax                  mbox_conf_softmax        1 1 mbox_conf_reshape mbox_conf_softmax -23330=4,2,21,8030,1 0=1 1=1\nFlatten                  mbox_conf_flatten        1 1 mbox_conf_softmax mbox_conf_flatten -23330=4,1,168630,1,1\nDetectionOutput          detection_out            3 1 mbox_loc mbox_conf_flatten mbox_priorbox output 0=21 1=4.500000e-01 2=100 4=2.500000e-01\n"
  },
  {
    "path": "benchmark/squeezenet_ssd_int8.param",
    "content": "7767517\n119 152\nInput                    data                     0 1 data 0=300 1=300 2=3\nSplit                    splitncnn_0              1 7 data data_splitncnn_0 data_splitncnn_1 data_splitncnn_2 data_splitncnn_3 data_splitncnn_4 data_splitncnn_5 data_splitncnn_6\nConvolution              conv1                    1 1 data_splitncnn_6 conv1_relu_conv1 0=64 1=3 3=2 5=1 6=1728 8=2 9=1\nPooling                  pool1                    1 1 conv1_relu_conv1 pool1 1=3 2=2\nConvolution              fire2/squeeze1x1         1 1 pool1 fire2/squeeze1x1_fire2/relu_squeeze1x1 0=16 1=1 5=1 6=1024 8=102 9=1\nSplit                    splitncnn_1              1 2 fire2/squeeze1x1_fire2/relu_squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1\nConvolution              fire2/expand1x1          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 fire2/expand1x1_fire2/relu_expand1x1 0=64 1=1 5=1 6=1024 8=2 9=1\nConvolution              fire2/expand3x3          1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/expand3x3_fire2/relu_expand3x3 0=64 1=3 4=1 5=1 6=9216 8=2 9=1\nConcat                   fire2/concat             2 1 fire2/expand1x1_fire2/relu_expand1x1 fire2/expand3x3_fire2/relu_expand3x3 fire2/concat\nConvolution              fire3/squeeze1x1         1 1 fire2/concat fire3/squeeze1x1_fire3/relu_squeeze1x1 0=16 1=1 5=1 6=2048 8=102 9=1\nSplit                    splitncnn_2              1 2 fire3/squeeze1x1_fire3/relu_squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1\nConvolution              fire3/expand1x1          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 fire3/expand1x1_fire3/relu_expand1x1 0=64 1=1 5=1 6=1024 8=2 9=1\nConvolution              fire3/expand3x3          1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/expand3x3_fire3/relu_expand3x3 0=64 1=3 4=1 5=1 6=9216 8=2 9=1\nConcat                   fire3/concat             2 1 fire3/expand1x1_fire3/relu_expand1x1 fire3/expand3x3_fire3/relu_expand3x3 fire3/concat\nPooling                  pool3                    1 1 fire3/concat pool3 1=3 2=2\nConvolution              fire4/squeeze1x1         1 1 pool3 fire4/squeeze1x1_fire4/relu_squeeze1x1 0=32 1=1 5=1 6=4096 8=102 9=1\nSplit                    splitncnn_3              1 2 fire4/squeeze1x1_fire4/relu_squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1\nConvolution              fire4/expand1x1          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 fire4/expand1x1_fire4/relu_expand1x1 0=128 1=1 5=1 6=4096 8=2 9=1\nConvolution              fire4/expand3x3          1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/expand3x3_fire4/relu_expand3x3 0=128 1=3 4=1 5=1 6=36864 8=2 9=1\nConcat                   fire4/concat             2 1 fire4/expand1x1_fire4/relu_expand1x1 fire4/expand3x3_fire4/relu_expand3x3 fire4/concat\nConvolution              fire5/squeeze1x1         1 1 fire4/concat fire5/squeeze1x1_fire5/relu_squeeze1x1 0=32 1=1 5=1 6=8192 8=102 9=1\nSplit                    splitncnn_4              1 2 fire5/squeeze1x1_fire5/relu_squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1\nConvolution              fire5/expand1x1          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 fire5/expand1x1_fire5/relu_expand1x1 0=128 1=1 5=1 6=4096 8=2 9=1\nConvolution              fire5/expand3x3          1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/expand3x3_fire5/relu_expand3x3 0=128 1=3 4=1 5=1 6=36864 8=2 9=1\nConcat                   fire5/concat             2 1 fire5/expand1x1_fire5/relu_expand1x1 fire5/expand3x3_fire5/relu_expand3x3 fire5/concat\nSplit                    splitncnn_5              1 2 fire5/concat fire5/concat_splitncnn_0 fire5/concat_splitncnn_1\nPooling                  pool5                    1 1 fire5/concat_splitncnn_1 pool5 1=3 2=2\nConvolution              fire6/squeeze1x1         1 1 pool5 fire6/squeeze1x1_fire6/relu_squeeze1x1 0=48 1=1 5=1 6=12288 8=102 9=1\nSplit                    splitncnn_6              1 2 fire6/squeeze1x1_fire6/relu_squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1\nConvolution              fire6/expand1x1          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 fire6/expand1x1_fire6/relu_expand1x1 0=192 1=1 5=1 6=9216 8=2 9=1\nConvolution              fire6/expand3x3          1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/expand3x3_fire6/relu_expand3x3 0=192 1=3 4=1 5=1 6=82944 8=2 9=1\nConcat                   fire6/concat             2 1 fire6/expand1x1_fire6/relu_expand1x1 fire6/expand3x3_fire6/relu_expand3x3 fire6/concat\nConvolution              fire7/squeeze1x1         1 1 fire6/concat fire7/squeeze1x1_fire7/relu_squeeze1x1 0=48 1=1 5=1 6=18432 8=102 9=1\nSplit                    splitncnn_7              1 2 fire7/squeeze1x1_fire7/relu_squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1\nConvolution              fire7/expand1x1          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 fire7/expand1x1_fire7/relu_expand1x1 0=192 1=1 5=1 6=9216 8=2 9=1\nConvolution              fire7/expand3x3          1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/expand3x3_fire7/relu_expand3x3 0=192 1=3 4=1 5=1 6=82944 8=2 9=1\nConcat                   fire7/concat             2 1 fire7/expand1x1_fire7/relu_expand1x1 fire7/expand3x3_fire7/relu_expand3x3 fire7/concat\nConvolution              fire8/squeeze1x1         1 1 fire7/concat fire8/squeeze1x1_fire8/relu_squeeze1x1 0=64 1=1 5=1 6=24576 8=102 9=1\nSplit                    splitncnn_8              1 2 fire8/squeeze1x1_fire8/relu_squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1\nConvolution              fire8/expand1x1          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 fire8/expand1x1_fire8/relu_expand1x1 0=256 1=1 5=1 6=16384 8=2 9=1\nConvolution              fire8/expand3x3          1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/expand3x3_fire8/relu_expand3x3 0=256 1=3 4=1 5=1 6=147456 8=2 9=1\nConcat                   fire8/concat             2 1 fire8/expand1x1_fire8/relu_expand1x1 fire8/expand3x3_fire8/relu_expand3x3 fire8/concat\nConvolution              fire9/squeeze1x1         1 1 fire8/concat fire9/squeeze1x1_fire9/relu_squeeze1x1 0=64 1=1 5=1 6=32768 8=102 9=1\nSplit                    splitncnn_9              1 2 fire9/squeeze1x1_fire9/relu_squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1\nConvolution              fire9/expand1x1          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 fire9/expand1x1_fire9/relu_expand1x1 0=256 1=1 5=1 6=16384 8=2 9=1\nConvolution              fire9/expand3x3          1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/expand3x3_fire9/relu_expand3x3 0=256 1=3 4=1 5=1 6=147456 8=2 9=1\nConcat                   fire9/concat             2 1 fire9/expand1x1_fire9/relu_expand1x1 fire9/expand3x3_fire9/relu_expand3x3 fire9/concat\nSplit                    splitncnn_10             1 4 fire9/concat fire9/concat_splitncnn_0 fire9/concat_splitncnn_1 fire9/concat_splitncnn_2 fire9/concat_splitncnn_3\nPooling                  pool9                    1 1 fire9/concat_splitncnn_3 pool9 1=3 2=2\nConvolution              fire10/squeeze1x1        1 1 pool9 fire10/squeeze1x1_fire10/relu_squeeze1x1 0=96 1=1 5=1 6=49152 8=102 9=1\nSplit                    splitncnn_11             1 2 fire10/squeeze1x1_fire10/relu_squeeze1x1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_0 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_1\nConvolution              fire10/expand1x1         1 1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_1 fire10/expand1x1_fire10/relu_expand1x1 0=384 1=1 5=1 6=36864 8=2 9=1\nConvolution              fire10/expand3x3         1 1 fire10/squeeze1x1_fire10/relu_squeeze1x1_splitncnn_0 fire10/expand3x3_fire10/relu_expand3x3 0=384 1=3 4=1 5=1 6=331776 8=2 9=1\nConcat                   fire10/concat            2 1 fire10/expand1x1_fire10/relu_expand1x1 fire10/expand3x3_fire10/relu_expand3x3 fire10/concat\nSplit                    splitncnn_12             1 4 fire10/concat fire10/concat_splitncnn_0 fire10/concat_splitncnn_1 fire10/concat_splitncnn_2 fire10/concat_splitncnn_3\nPooling                  pool10                   1 1 fire10/concat_splitncnn_3 pool10 1=3 2=2\nConvolution              fire11/squeeze1x1        1 1 pool10 fire11/squeeze1x1_fire11/relu_squeeze1x1 0=96 1=1 5=1 6=73728 8=102 9=1\nSplit                    splitncnn_13             1 2 fire11/squeeze1x1_fire11/relu_squeeze1x1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_0 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_1\nConvolution              fire11/expand1x1         1 1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_1 fire11/expand1x1_fire11/relu_expand1x1 0=384 1=1 5=1 6=36864 8=2 9=1\nConvolution              fire11/expand3x3         1 1 fire11/squeeze1x1_fire11/relu_squeeze1x1_splitncnn_0 fire11/expand3x3_fire11/relu_expand3x3 0=384 1=3 4=1 5=1 6=331776 8=2 9=1\nConcat                   fire11/concat            2 1 fire11/expand1x1_fire11/relu_expand1x1 fire11/expand3x3_fire11/relu_expand3x3 fire11/concat\nSplit                    splitncnn_14             1 4 fire11/concat fire11/concat_splitncnn_0 fire11/concat_splitncnn_1 fire11/concat_splitncnn_2 fire11/concat_splitncnn_3\nConvolution              conv12_1                 1 1 fire11/concat_splitncnn_3 conv12_1_conv12_1/relu 0=128 1=1 5=1 6=98304 8=102 9=1\nConvolution              conv12_2                 1 1 conv12_1_conv12_1/relu conv12_2_conv12_2/relu 0=256 1=3 3=2 4=1 5=1 6=294912 8=2 9=1\nSplit                    splitncnn_15             1 4 conv12_2_conv12_2/relu conv12_2_conv12_2/relu_splitncnn_0 conv12_2_conv12_2/relu_splitncnn_1 conv12_2_conv12_2/relu_splitncnn_2 conv12_2_conv12_2/relu_splitncnn_3\nConvolution              conv13_1                 1 1 conv12_2_conv12_2/relu_splitncnn_3 conv13_1_conv13_1/relu 0=64 1=1 5=1 6=16384 8=102 9=1\nConvolution              conv13_2                 1 1 conv13_1_conv13_1/relu conv13_2_conv13_2/relu 0=128 1=3 3=2 4=1 5=1 6=73728 8=2 9=1\nSplit                    splitncnn_16             1 3 conv13_2_conv13_2/relu conv13_2_conv13_2/relu_splitncnn_0 conv13_2_conv13_2/relu_splitncnn_1 conv13_2_conv13_2/relu_splitncnn_2\nBatchNorm                fire5/bn                 1 1 fire5/concat_splitncnn_0 fire5/normal_fire5/scale 0=256\nSplit                    splitncnn_17             1 3 fire5/normal_fire5/scale fire5/normal_fire5/scale_splitncnn_0 fire5/normal_fire5/scale_splitncnn_1 fire5/normal_fire5/scale_splitncnn_2\nConvolution              fire5_mbox_loc           1 1 fire5/normal_fire5/scale_splitncnn_2 fire5_mbox_loc 0=16 1=3 4=1 5=1 6=36864 8=2\nPermute                  fire5_mbox_loc_perm      1 1 fire5_mbox_loc fire5_mbox_loc_perm 0=3\nFlatten                  fire5_mbox_loc_flat      1 1 fire5_mbox_loc_perm fire5_mbox_loc_flat\nConvolution              fire5_mbox_conf          1 1 fire5/normal_fire5/scale_splitncnn_1 fire5_mbox_conf 0=84 1=3 4=1 5=1 6=193536 8=2\nPermute                  fire5_mbox_conf_perm     1 1 fire5_mbox_conf fire5_mbox_conf_perm 0=3\nFlatten                  fire5_mbox_conf_flat     1 1 fire5_mbox_conf_perm fire5_mbox_conf_flat\nPriorBox                 fire5_mbox_priorbox      2 1 fire5/normal_fire5/scale_splitncnn_0 data_splitncnn_5 fire5_mbox_priorbox -23300=1,21.000000 -23301=1,45.000000 -23302=1,2.000000 9=-233 10=-233 11=8.000000 12=8.000000 13=0.500000\nConvolution              fire9_mbox_loc           1 1 fire9/concat_splitncnn_2 fire9_mbox_loc 0=24 1=3 4=1 5=1 6=110592 8=2\nPermute                  fire9_mbox_loc_perm      1 1 fire9_mbox_loc fire9_mbox_loc_perm 0=3\nFlatten                  fire9_mbox_loc_flat      1 1 fire9_mbox_loc_perm fire9_mbox_loc_flat\nConvolution              fire9_mbox_conf          1 1 fire9/concat_splitncnn_1 fire9_mbox_conf 0=126 1=3 4=1 5=1 6=580608 8=2\nPermute                  fire9_mbox_conf_perm     1 1 fire9_mbox_conf fire9_mbox_conf_perm 0=3\nFlatten                  fire9_mbox_conf_flat     1 1 fire9_mbox_conf_perm fire9_mbox_conf_flat\nPriorBox                 fire9_mbox_priorbox      2 1 fire9/concat_splitncnn_0 data_splitncnn_4 fire9_mbox_priorbox -23300=1,45.000000 -23301=1,99.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 11=16.000000 12=16.000000 13=0.500000\nConvolution              fire10_mbox_loc          1 1 fire10/concat_splitncnn_2 fire10_mbox_loc 0=24 1=3 4=1 5=1 6=165888 8=2\nPermute                  fire10_mbox_loc_perm     1 1 fire10_mbox_loc fire10_mbox_loc_perm 0=3\nFlatten                  fire10_mbox_loc_flat     1 1 fire10_mbox_loc_perm fire10_mbox_loc_flat\nConvolution              fire10_mbox_conf         1 1 fire10/concat_splitncnn_1 fire10_mbox_conf 0=126 1=3 4=1 5=1 6=870912 8=2\nPermute                  fire10_mbox_conf_perm    1 1 fire10_mbox_conf fire10_mbox_conf_perm 0=3\nFlatten                  fire10_mbox_conf_flat    1 1 fire10_mbox_conf_perm fire10_mbox_conf_flat\nPriorBox                 fire10_mbox_priorbox     2 1 fire10/concat_splitncnn_0 data_splitncnn_3 fire10_mbox_priorbox -23300=1,99.000000 -23301=1,153.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 11=32.000000 12=32.000000 13=0.500000\nConvolution              fire11_mbox_loc          1 1 fire11/concat_splitncnn_2 fire11_mbox_loc 0=24 1=3 4=1 5=1 6=165888 8=2\nPermute                  fire11_mbox_loc_perm     1 1 fire11_mbox_loc fire11_mbox_loc_perm 0=3\nFlatten                  fire11_mbox_loc_flat     1 1 fire11_mbox_loc_perm fire11_mbox_loc_flat\nConvolution              fire11_mbox_conf         1 1 fire11/concat_splitncnn_1 fire11_mbox_conf 0=126 1=3 4=1 5=1 6=870912 8=2\nPermute                  fire11_mbox_conf_perm    1 1 fire11_mbox_conf fire11_mbox_conf_perm 0=3\nFlatten                  fire11_mbox_conf_flat    1 1 fire11_mbox_conf_perm fire11_mbox_conf_flat\nPriorBox                 fire11_mbox_priorbox     2 1 fire11/concat_splitncnn_0 data_splitncnn_2 fire11_mbox_priorbox -23300=1,153.000000 -23301=1,207.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 11=64.000000 12=64.000000 13=0.500000\nConvolution              conv12_2_mbox_loc        1 1 conv12_2_conv12_2/relu_splitncnn_2 conv12_2_mbox_loc 0=24 1=3 4=1 5=1 6=55296 8=2\nPermute                  conv12_2_mbox_loc_perm   1 1 conv12_2_mbox_loc conv12_2_mbox_loc_perm 0=3\nFlatten                  conv12_2_mbox_loc_flat   1 1 conv12_2_mbox_loc_perm conv12_2_mbox_loc_flat\nConvolution              conv12_2_mbox_conf       1 1 conv12_2_conv12_2/relu_splitncnn_1 conv12_2_mbox_conf 0=126 1=3 4=1 5=1 6=290304 8=2\nPermute                  conv12_2_mbox_conf_perm  1 1 conv12_2_mbox_conf conv12_2_mbox_conf_perm 0=3\nFlatten                  conv12_2_mbox_conf_flat  1 1 conv12_2_mbox_conf_perm conv12_2_mbox_conf_flat\nPriorBox                 conv12_2_mbox_priorbox   2 1 conv12_2_conv12_2/relu_splitncnn_0 data_splitncnn_1 conv12_2_mbox_priorbox -23300=1,207.000000 -23301=1,261.000000 -23302=2,2.000000,3.000000 9=-233 10=-233 11=100.000000 12=100.000000 13=0.500000\nConvolution              conv13_2_mbox_loc        1 1 conv13_2_conv13_2/relu_splitncnn_2 conv13_2_mbox_loc 0=16 1=3 4=1 5=1 6=18432 8=2\nPermute                  conv13_2_mbox_loc_perm   1 1 conv13_2_mbox_loc conv13_2_mbox_loc_perm 0=3\nFlatten                  conv13_2_mbox_loc_flat   1 1 conv13_2_mbox_loc_perm conv13_2_mbox_loc_flat\nConvolution              conv13_2_mbox_conf       1 1 conv13_2_conv13_2/relu_splitncnn_1 conv13_2_mbox_conf 0=84 1=3 4=1 5=1 6=96768 8=2\nPermute                  conv13_2_mbox_conf_perm  1 1 conv13_2_mbox_conf conv13_2_mbox_conf_perm 0=3\nFlatten                  conv13_2_mbox_conf_flat  1 1 conv13_2_mbox_conf_perm conv13_2_mbox_conf_flat\nPriorBox                 conv13_2_mbox_priorbox   2 1 conv13_2_conv13_2/relu_splitncnn_0 data_splitncnn_0 conv13_2_mbox_priorbox -23300=1,261.000000 -23301=1,315.000000 -23302=1,2.000000 9=-233 10=-233 11=300.000000 12=300.000000 13=0.500000\nConcat                   mbox_loc                 6 1 fire5_mbox_loc_flat fire9_mbox_loc_flat fire10_mbox_loc_flat fire11_mbox_loc_flat conv12_2_mbox_loc_flat conv13_2_mbox_loc_flat mbox_loc\nConcat                   mbox_conf                6 1 fire5_mbox_conf_flat fire9_mbox_conf_flat fire10_mbox_conf_flat fire11_mbox_conf_flat conv12_2_mbox_conf_flat conv13_2_mbox_conf_flat mbox_conf\nConcat                   mbox_priorbox            6 1 fire5_mbox_priorbox fire9_mbox_priorbox fire10_mbox_priorbox fire11_mbox_priorbox conv12_2_mbox_priorbox conv13_2_mbox_priorbox mbox_priorbox 0=1\nReshape                  mbox_conf_reshape        1 1 mbox_conf mbox_conf_reshape 0=21 1=-1\nSoftmax                  mbox_conf_softmax        1 1 mbox_conf_reshape mbox_conf_softmax 0=1 1=1\nFlatten                  mbox_conf_flatten        1 1 mbox_conf_softmax mbox_conf_flatten\nDetectionOutput          detection_out            3 1 mbox_loc mbox_conf_flatten mbox_priorbox output 0=21 1=0.450000 2=100 4=0.250000\n"
  },
  {
    "path": "benchmark/vgg16.param",
    "content": "7767517\n23 23\nInput                    data                     0 1 data -23330=4,3,224,224,3 0=224 1=224 2=3\nConvolution              conv1_1                  1 1 data conv1_1_relu1_1 -23330=4,3,224,224,64 0=64 1=3 4=1 5=1 6=1728 9=1\nConvolution              conv1_2                  1 1 conv1_1_relu1_1 conv1_2_relu1_2 -23330=4,3,224,224,64 0=64 1=3 4=1 5=1 6=36864 9=1\nPooling                  pool1                    1 1 conv1_2_relu1_2 pool1 -23330=4,3,112,112,64 1=2 2=2\nConvolution              conv2_1                  1 1 pool1 conv2_1_relu2_1 -23330=4,3,112,112,128 0=128 1=3 4=1 5=1 6=73728 9=1\nConvolution              conv2_2                  1 1 conv2_1_relu2_1 conv2_2_relu2_2 -23330=4,3,112,112,128 0=128 1=3 4=1 5=1 6=147456 9=1\nPooling                  pool2                    1 1 conv2_2_relu2_2 pool2 -23330=4,3,56,56,128 1=2 2=2\nConvolution              conv3_1                  1 1 pool2 conv3_1_relu3_1 -23330=4,3,56,56,256 0=256 1=3 4=1 5=1 6=294912 9=1\nConvolution              conv3_2                  1 1 conv3_1_relu3_1 conv3_2_relu3_2 -23330=4,3,56,56,256 0=256 1=3 4=1 5=1 6=589824 9=1\nConvolution              conv3_3                  1 1 conv3_2_relu3_2 conv3_3_relu3_3 -23330=4,3,56,56,256 0=256 1=3 4=1 5=1 6=589824 9=1\nPooling                  pool3                    1 1 conv3_3_relu3_3 pool3 -23330=4,3,28,28,256 1=2 2=2\nConvolution              conv4_1                  1 1 pool3 conv4_1_relu4_1 -23330=4,3,28,28,512 0=512 1=3 4=1 5=1 6=1179648 9=1\nConvolution              conv4_2                  1 1 conv4_1_relu4_1 conv4_2_relu4_2 -23330=4,3,28,28,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              conv4_3                  1 1 conv4_2_relu4_2 conv4_3_relu4_3 -23330=4,3,28,28,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nPooling                  pool4                    1 1 conv4_3_relu4_3 pool4 -23330=4,3,14,14,512 1=2 2=2\nConvolution              conv5_1                  1 1 pool4 conv5_1_relu5_1 -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              conv5_2                  1 1 conv5_1_relu5_1 conv5_2_relu5_2 -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nConvolution              conv5_3                  1 1 conv5_2_relu5_2 conv5_3_relu5_3 -23330=4,3,14,14,512 0=512 1=3 4=1 5=1 6=2359296 9=1\nPooling                  pool5                    1 1 conv5_3_relu5_3 pool5 -23330=4,3,7,7,512 1=2 2=2\nInnerProduct             fc6                      1 1 pool5 fc6_drop6 -23330=4,1,4096,1,1 0=4096 1=1 2=102760448 9=1\nInnerProduct             fc7                      1 1 fc6_drop6 fc7_drop7 -23330=4,1,4096,1,1 0=4096 1=1 2=16777216 9=1\nInnerProduct             fc8                      1 1 fc7_drop7 fc8 -23330=4,1,1000,1,1 0=1000 1=1 2=4096000\nSoftmax                  prob                     1 1 fc8 output -23330=4,1,1000,1,1\n"
  },
  {
    "path": "benchmark/vgg16_int8.param",
    "content": "7767517\n23 23\nInput                    data                     0 1 data 0=224 1=224 2=3\nConvolution              conv1_1                  1 1 data conv1_1_relu1_1 0=64 1=3 4=1 5=1 6=1728 8=102 9=1\nConvolution              conv1_2                  1 1 conv1_1_relu1_1 conv1_2_relu1_2 0=64 1=3 4=1 5=1 6=36864 8=2 9=1\nPooling                  pool1                    1 1 conv1_2_relu1_2 pool1 1=2 2=2\nConvolution              conv2_1                  1 1 pool1 conv2_1_relu2_1 0=128 1=3 4=1 5=1 6=73728 8=102 9=1\nConvolution              conv2_2                  1 1 conv2_1_relu2_1 conv2_2_relu2_2 0=128 1=3 4=1 5=1 6=147456 8=2 9=1\nPooling                  pool2                    1 1 conv2_2_relu2_2 pool2 1=2 2=2\nConvolution              conv3_1                  1 1 pool2 conv3_1_relu3_1 0=256 1=3 4=1 5=1 6=294912 8=102 9=1\nConvolution              conv3_2                  1 1 conv3_1_relu3_1 conv3_2_relu3_2 0=256 1=3 4=1 5=1 6=589824 8=102 9=1\nConvolution              conv3_3                  1 1 conv3_2_relu3_2 conv3_3_relu3_3 0=256 1=3 4=1 5=1 6=589824 8=2 9=1\nPooling                  pool3                    1 1 conv3_3_relu3_3 pool3 1=2 2=2\nConvolution              conv4_1                  1 1 pool3 conv4_1_relu4_1 0=512 1=3 4=1 5=1 6=1179648 8=102 9=1\nConvolution              conv4_2                  1 1 conv4_1_relu4_1 conv4_2_relu4_2 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              conv4_3                  1 1 conv4_2_relu4_2 conv4_3_relu4_3 0=512 1=3 4=1 5=1 6=2359296 8=2 9=1\nPooling                  pool4                    1 1 conv4_3_relu4_3 pool4 1=2 2=2\nConvolution              conv5_1                  1 1 pool4 conv5_1_relu5_1 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              conv5_2                  1 1 conv5_1_relu5_1 conv5_2_relu5_2 0=512 1=3 4=1 5=1 6=2359296 8=102 9=1\nConvolution              conv5_3                  1 1 conv5_2_relu5_2 conv5_3_relu5_3 0=512 1=3 4=1 5=1 6=2359296 8=2 9=1\nPooling                  pool5                    1 1 conv5_3_relu5_3 pool5 1=2 2=2\nInnerProduct             fc6                      1 1 pool5 fc6_drop6 0=4096 1=1 2=102760448 8=2 9=1\nInnerProduct             fc7                      1 1 fc6_drop6 fc7_drop7 0=4096 1=1 2=16777216 8=2 9=1\nInnerProduct             fc8                      1 1 fc7_drop7 fc8 0=1000 1=1 2=4096000 8=2\nSoftmax                  prob                     1 1 fc8 output\n"
  },
  {
    "path": "benchmark/vision_transformer.param",
    "content": "7767517\n144 192\nInput            input                    0 1 input\nMemoryData       backbone.cls_token       0 1 backbone.cls_token 0=768 1=1\nMemoryData       backbone.pos_embed       0 1 backbone.pos_embed 0=768 1=145\nConvolution      Conv_0                   1 1 input onnx::Shape_153 0=768 1=32 11=32 2=1 12=1 3=32 13=32 4=0 14=0 15=0 16=0 5=1 6=2359296\nReshape          Reshape_8                1 1 onnx::Shape_153 onnx::Transpose_161 0=-1 1=768\nPermute          Transpose_9              1 1 onnx::Transpose_161 onnx::Concat_162 0=1\nConcat           Concat_10                2 1 backbone.cls_token onnx::Concat_162 onnx::Add_163 0=0\nBinaryOp         Add_11                   2 1 onnx::Add_163 backbone.pos_embed input.1 0=0\nSplit            splitncnn_0              1 2 input.1 input.1_splitncnn_0 input.1_splitncnn_1\nLayerNorm        LayerNorm_12             1 1 input.1_splitncnn_1 qkv_input 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_1              1 3 qkv_input qkv_input_splitncnn_0 qkv_input_splitncnn_1 qkv_input_splitncnn_2\nMultiHeadAttention MultiHeadAttention_21    3 1 qkv_input_splitncnn_2 qkv_input_splitncnn_1 qkv_input_splitncnn_0 onnx::Add_174 0=768 1=12 2=589824\nBinaryOp         Add_22                   2 1 input.1_splitncnn_0 onnx::Add_174 input.4 0=0\nSplit            splitncnn_2              1 2 input.4 input.4_splitncnn_0 input.4_splitncnn_1\nLayerNorm        LayerNorm_23             1 1 input.4_splitncnn_1 mmdeploy::Gemm_176 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_24                  1 1 mmdeploy::Gemm_176 mmdeploy::Gelu_177 0=3072 1=1 2=2359296\nGELU             Gelu_25                  1 1 mmdeploy::Gelu_177 input.8 0=1\nInnerProduct     Gemm_26                  1 1 input.8 input.12 0=768 1=1 2=2359296\nBinaryOp         Add_27                   2 1 input.4_splitncnn_0 input.12 input.16 0=0\nSplit            splitncnn_3              1 2 input.16 input.16_splitncnn_0 input.16_splitncnn_1\nLayerNorm        LayerNorm_28             1 1 input.16_splitncnn_1 qkv_input.3 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_4              1 3 qkv_input.3 qkv_input.3_splitncnn_0 qkv_input.3_splitncnn_1 qkv_input.3_splitncnn_2\nMultiHeadAttention MultiHeadAttention_37    3 1 qkv_input.3_splitncnn_2 qkv_input.3_splitncnn_1 qkv_input.3_splitncnn_0 onnx::Add_190 0=768 1=12 2=589824\nBinaryOp         Add_38                   2 1 input.16_splitncnn_0 onnx::Add_190 input.20 0=0\nSplit            splitncnn_5              1 2 input.20 input.20_splitncnn_0 input.20_splitncnn_1\nLayerNorm        LayerNorm_39             1 1 input.20_splitncnn_1 mmdeploy::Gemm_192 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_40                  1 1 mmdeploy::Gemm_192 mmdeploy::Gelu_193 0=3072 1=1 2=2359296\nGELU             Gelu_41                  1 1 mmdeploy::Gelu_193 input.24 0=1\nInnerProduct     Gemm_42                  1 1 input.24 input.28 0=768 1=1 2=2359296\nBinaryOp         Add_43                   2 1 input.20_splitncnn_0 input.28 input.32 0=0\nSplit            splitncnn_6              1 2 input.32 input.32_splitncnn_0 input.32_splitncnn_1\nLayerNorm        LayerNorm_44             1 1 input.32_splitncnn_1 qkv_input.7 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_7              1 3 qkv_input.7 qkv_input.7_splitncnn_0 qkv_input.7_splitncnn_1 qkv_input.7_splitncnn_2\nMultiHeadAttention MultiHeadAttention_53    3 1 qkv_input.7_splitncnn_2 qkv_input.7_splitncnn_1 qkv_input.7_splitncnn_0 onnx::Add_206 0=768 1=12 2=589824\nBinaryOp         Add_54                   2 1 input.32_splitncnn_0 onnx::Add_206 input.36 0=0\nSplit            splitncnn_8              1 2 input.36 input.36_splitncnn_0 input.36_splitncnn_1\nLayerNorm        LayerNorm_55             1 1 input.36_splitncnn_1 mmdeploy::Gemm_208 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_56                  1 1 mmdeploy::Gemm_208 mmdeploy::Gelu_209 0=3072 1=1 2=2359296\nGELU             Gelu_57                  1 1 mmdeploy::Gelu_209 input.40 0=1\nInnerProduct     Gemm_58                  1 1 input.40 input.44 0=768 1=1 2=2359296\nBinaryOp         Add_59                   2 1 input.36_splitncnn_0 input.44 input.48 0=0\nSplit            splitncnn_9              1 2 input.48 input.48_splitncnn_0 input.48_splitncnn_1\nLayerNorm        LayerNorm_60             1 1 input.48_splitncnn_1 qkv_input.11 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_10             1 3 qkv_input.11 qkv_input.11_splitncnn_0 qkv_input.11_splitncnn_1 qkv_input.11_splitncnn_2\nMultiHeadAttention MultiHeadAttention_69    3 1 qkv_input.11_splitncnn_2 qkv_input.11_splitncnn_1 qkv_input.11_splitncnn_0 onnx::Add_222 0=768 1=12 2=589824\nBinaryOp         Add_70                   2 1 input.48_splitncnn_0 onnx::Add_222 input.52 0=0\nSplit            splitncnn_11             1 2 input.52 input.52_splitncnn_0 input.52_splitncnn_1\nLayerNorm        LayerNorm_71             1 1 input.52_splitncnn_1 mmdeploy::Gemm_224 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_72                  1 1 mmdeploy::Gemm_224 mmdeploy::Gelu_225 0=3072 1=1 2=2359296\nGELU             Gelu_73                  1 1 mmdeploy::Gelu_225 input.56 0=1\nInnerProduct     Gemm_74                  1 1 input.56 input.60 0=768 1=1 2=2359296\nBinaryOp         Add_75                   2 1 input.52_splitncnn_0 input.60 input.64 0=0\nSplit            splitncnn_12             1 2 input.64 input.64_splitncnn_0 input.64_splitncnn_1\nLayerNorm        LayerNorm_76             1 1 input.64_splitncnn_1 qkv_input.15 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_13             1 3 qkv_input.15 qkv_input.15_splitncnn_0 qkv_input.15_splitncnn_1 qkv_input.15_splitncnn_2\nMultiHeadAttention MultiHeadAttention_85    3 1 qkv_input.15_splitncnn_2 qkv_input.15_splitncnn_1 qkv_input.15_splitncnn_0 onnx::Add_238 0=768 1=12 2=589824\nBinaryOp         Add_86                   2 1 input.64_splitncnn_0 onnx::Add_238 input.68 0=0\nSplit            splitncnn_14             1 2 input.68 input.68_splitncnn_0 input.68_splitncnn_1\nLayerNorm        LayerNorm_87             1 1 input.68_splitncnn_1 mmdeploy::Gemm_240 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_88                  1 1 mmdeploy::Gemm_240 mmdeploy::Gelu_241 0=3072 1=1 2=2359296\nGELU             Gelu_89                  1 1 mmdeploy::Gelu_241 input.72 0=1\nInnerProduct     Gemm_90                  1 1 input.72 input.76 0=768 1=1 2=2359296\nBinaryOp         Add_91                   2 1 input.68_splitncnn_0 input.76 input.80 0=0\nSplit            splitncnn_15             1 2 input.80 input.80_splitncnn_0 input.80_splitncnn_1\nLayerNorm        LayerNorm_92             1 1 input.80_splitncnn_1 qkv_input.19 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_16             1 3 qkv_input.19 qkv_input.19_splitncnn_0 qkv_input.19_splitncnn_1 qkv_input.19_splitncnn_2\nMultiHeadAttention MultiHeadAttention_101   3 1 qkv_input.19_splitncnn_2 qkv_input.19_splitncnn_1 qkv_input.19_splitncnn_0 onnx::Add_254 0=768 1=12 2=589824\nBinaryOp         Add_102                  2 1 input.80_splitncnn_0 onnx::Add_254 input.84 0=0\nSplit            splitncnn_17             1 2 input.84 input.84_splitncnn_0 input.84_splitncnn_1\nLayerNorm        LayerNorm_103            1 1 input.84_splitncnn_1 mmdeploy::Gemm_256 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_104                 1 1 mmdeploy::Gemm_256 mmdeploy::Gelu_257 0=3072 1=1 2=2359296\nGELU             Gelu_105                 1 1 mmdeploy::Gelu_257 input.88 0=1\nInnerProduct     Gemm_106                 1 1 input.88 input.92 0=768 1=1 2=2359296\nBinaryOp         Add_107                  2 1 input.84_splitncnn_0 input.92 input.96 0=0\nSplit            splitncnn_18             1 2 input.96 input.96_splitncnn_0 input.96_splitncnn_1\nLayerNorm        LayerNorm_108            1 1 input.96_splitncnn_1 qkv_input.23 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_19             1 3 qkv_input.23 qkv_input.23_splitncnn_0 qkv_input.23_splitncnn_1 qkv_input.23_splitncnn_2\nMultiHeadAttention MultiHeadAttention_117   3 1 qkv_input.23_splitncnn_2 qkv_input.23_splitncnn_1 qkv_input.23_splitncnn_0 onnx::Add_270 0=768 1=12 2=589824\nBinaryOp         Add_118                  2 1 input.96_splitncnn_0 onnx::Add_270 input.100 0=0\nSplit            splitncnn_20             1 2 input.100 input.100_splitncnn_0 input.100_splitncnn_1\nLayerNorm        LayerNorm_119            1 1 input.100_splitncnn_1 mmdeploy::Gemm_272 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_120                 1 1 mmdeploy::Gemm_272 mmdeploy::Gelu_273 0=3072 1=1 2=2359296\nGELU             Gelu_121                 1 1 mmdeploy::Gelu_273 input.104 0=1\nInnerProduct     Gemm_122                 1 1 input.104 input.108 0=768 1=1 2=2359296\nBinaryOp         Add_123                  2 1 input.100_splitncnn_0 input.108 input.112 0=0\nSplit            splitncnn_21             1 2 input.112 input.112_splitncnn_0 input.112_splitncnn_1\nLayerNorm        LayerNorm_124            1 1 input.112_splitncnn_1 qkv_input.27 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_22             1 3 qkv_input.27 qkv_input.27_splitncnn_0 qkv_input.27_splitncnn_1 qkv_input.27_splitncnn_2\nMultiHeadAttention MultiHeadAttention_133   3 1 qkv_input.27_splitncnn_2 qkv_input.27_splitncnn_1 qkv_input.27_splitncnn_0 onnx::Add_286 0=768 1=12 2=589824\nBinaryOp         Add_134                  2 1 input.112_splitncnn_0 onnx::Add_286 input.116 0=0\nSplit            splitncnn_23             1 2 input.116 input.116_splitncnn_0 input.116_splitncnn_1\nLayerNorm        LayerNorm_135            1 1 input.116_splitncnn_1 mmdeploy::Gemm_288 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_136                 1 1 mmdeploy::Gemm_288 mmdeploy::Gelu_289 0=3072 1=1 2=2359296\nGELU             Gelu_137                 1 1 mmdeploy::Gelu_289 input.120 0=1\nInnerProduct     Gemm_138                 1 1 input.120 input.124 0=768 1=1 2=2359296\nBinaryOp         Add_139                  2 1 input.116_splitncnn_0 input.124 input.128 0=0\nSplit            splitncnn_24             1 2 input.128 input.128_splitncnn_0 input.128_splitncnn_1\nLayerNorm        LayerNorm_140            1 1 input.128_splitncnn_1 qkv_input.31 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_25             1 3 qkv_input.31 qkv_input.31_splitncnn_0 qkv_input.31_splitncnn_1 qkv_input.31_splitncnn_2\nMultiHeadAttention MultiHeadAttention_149   3 1 qkv_input.31_splitncnn_2 qkv_input.31_splitncnn_1 qkv_input.31_splitncnn_0 onnx::Add_302 0=768 1=12 2=589824\nBinaryOp         Add_150                  2 1 input.128_splitncnn_0 onnx::Add_302 input.132 0=0\nSplit            splitncnn_26             1 2 input.132 input.132_splitncnn_0 input.132_splitncnn_1\nLayerNorm        LayerNorm_151            1 1 input.132_splitncnn_1 mmdeploy::Gemm_304 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_152                 1 1 mmdeploy::Gemm_304 mmdeploy::Gelu_305 0=3072 1=1 2=2359296\nGELU             Gelu_153                 1 1 mmdeploy::Gelu_305 input.136 0=1\nInnerProduct     Gemm_154                 1 1 input.136 input.140 0=768 1=1 2=2359296\nBinaryOp         Add_155                  2 1 input.132_splitncnn_0 input.140 input.144 0=0\nSplit            splitncnn_27             1 2 input.144 input.144_splitncnn_0 input.144_splitncnn_1\nLayerNorm        LayerNorm_156            1 1 input.144_splitncnn_1 qkv_input.35 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_28             1 3 qkv_input.35 qkv_input.35_splitncnn_0 qkv_input.35_splitncnn_1 qkv_input.35_splitncnn_2\nMultiHeadAttention MultiHeadAttention_165   3 1 qkv_input.35_splitncnn_2 qkv_input.35_splitncnn_1 qkv_input.35_splitncnn_0 onnx::Add_318 0=768 1=12 2=589824\nBinaryOp         Add_166                  2 1 input.144_splitncnn_0 onnx::Add_318 input.148 0=0\nSplit            splitncnn_29             1 2 input.148 input.148_splitncnn_0 input.148_splitncnn_1\nLayerNorm        LayerNorm_167            1 1 input.148_splitncnn_1 mmdeploy::Gemm_320 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_168                 1 1 mmdeploy::Gemm_320 mmdeploy::Gelu_321 0=3072 1=1 2=2359296\nGELU             Gelu_169                 1 1 mmdeploy::Gelu_321 input.152 0=1\nInnerProduct     Gemm_170                 1 1 input.152 input.156 0=768 1=1 2=2359296\nBinaryOp         Add_171                  2 1 input.148_splitncnn_0 input.156 input.160 0=0\nSplit            splitncnn_30             1 2 input.160 input.160_splitncnn_0 input.160_splitncnn_1\nLayerNorm        LayerNorm_172            1 1 input.160_splitncnn_1 qkv_input.39 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_31             1 3 qkv_input.39 qkv_input.39_splitncnn_0 qkv_input.39_splitncnn_1 qkv_input.39_splitncnn_2\nMultiHeadAttention MultiHeadAttention_181   3 1 qkv_input.39_splitncnn_2 qkv_input.39_splitncnn_1 qkv_input.39_splitncnn_0 onnx::Add_334 0=768 1=12 2=589824\nBinaryOp         Add_182                  2 1 input.160_splitncnn_0 onnx::Add_334 input.164 0=0\nSplit            splitncnn_32             1 2 input.164 input.164_splitncnn_0 input.164_splitncnn_1\nLayerNorm        LayerNorm_183            1 1 input.164_splitncnn_1 mmdeploy::Gemm_336 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_184                 1 1 mmdeploy::Gemm_336 mmdeploy::Gelu_337 0=3072 1=1 2=2359296\nGELU             Gelu_185                 1 1 mmdeploy::Gelu_337 input.168 0=1\nInnerProduct     Gemm_186                 1 1 input.168 input.172 0=768 1=1 2=2359296\nBinaryOp         Add_187                  2 1 input.164_splitncnn_0 input.172 input.176 0=0\nSplit            splitncnn_33             1 2 input.176 input.176_splitncnn_0 input.176_splitncnn_1\nLayerNorm        LayerNorm_188            1 1 input.176_splitncnn_1 qkv_input.43 0=768 1=1.000000e-06 2=1\nSplit            splitncnn_34             1 3 qkv_input.43 qkv_input.43_splitncnn_0 qkv_input.43_splitncnn_1 qkv_input.43_splitncnn_2\nMultiHeadAttention MultiHeadAttention_197   3 1 qkv_input.43_splitncnn_2 qkv_input.43_splitncnn_1 qkv_input.43_splitncnn_0 onnx::Add_350 0=768 1=12 2=589824\nBinaryOp         Add_198                  2 1 input.176_splitncnn_0 onnx::Add_350 input.180 0=0\nSplit            splitncnn_35             1 2 input.180 input.180_splitncnn_0 input.180_splitncnn_1\nLayerNorm        LayerNorm_199            1 1 input.180_splitncnn_1 mmdeploy::Gemm_352 0=768 1=1.000000e-06 2=1\nInnerProduct     Gemm_200                 1 1 mmdeploy::Gemm_352 mmdeploy::Gelu_353 0=3072 1=1 2=2359296\nGELU             Gelu_201                 1 1 mmdeploy::Gelu_353 input.184 0=1\nInnerProduct     Gemm_202                 1 1 input.184 input.188 0=768 1=1 2=2359296\nBinaryOp         Add_203                  2 1 input.180_splitncnn_0 input.188 input.192 0=0\nLayerNorm        LayerNorm_204            1 1 input.192 onnx::Gather_357 0=768 1=1.000000e-06 2=1\nCrop             Gather_206               1 1 onnx::Gather_357 mmdeploy::Gemm_359 -23309=1,0 -23310=1,1 -23311=1,0\nInnerProduct     Gemm_207                 1 1 mmdeploy::Gemm_359 cls_score 0=1000 1=1 2=768000\nSoftmax          Softmax_208              1 1 cls_score output 0=0 1=1\n"
  },
  {
    "path": "benchmark/yolo-fastest-1.1.param",
    "content": "7767517\n131 154\nInput                    data                     0 1 data -23330=4,3,320,320,3 0=320 1=320 2=3\nConvolution              0_22                     1 1 data 0_22_bn_leaky -23330=4,3,160,160,8 0=8 1=3 3=2 4=1 5=1 6=216 9=2 -23310=1,1.000000e-01\nConvolution              1_31                     1 1 0_22_bn_leaky 1_31_bn_leaky -23330=4,3,160,160,8 0=8 1=1 5=1 6=64 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     2_39                     1 1 1_31_bn_leaky 2_39_bn_leaky -23330=4,3,160,160,8 0=8 1=3 4=1 5=1 6=72 7=8 9=2 -23310=1,1.000000e-01\nConvolution              3_48                     1 1 2_39_bn_leaky 3_48_bn -23330=4,3,160,160,4 0=4 1=1 5=1 6=32\nSplit                    3_48_bn_split            1 2 3_48_bn 3_48_bn_split_0 3_48_bn_split_1 -23330=8,3,160,160,4,3,160,160,4\nConvolution              4_57                     1 1 3_48_bn_split_0 4_57_bn_leaky -23330=4,3,160,160,8 0=8 1=1 5=1 6=32 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     5_65                     1 1 4_57_bn_leaky 5_65_bn_leaky -23330=4,3,160,160,8 0=8 1=3 4=1 5=1 6=72 7=8 9=2 -23310=1,1.000000e-01\nConvolution              6_74                     1 1 5_65_bn_leaky 6_74_bn -23330=4,3,160,160,4 0=4 1=1 5=1 6=32\nEltwise                  8_86                     2 1 6_74_bn 3_48_bn_split_1 8_86 -23330=4,3,160,160,4 0=1\nConvolution              9_90                     1 1 8_86 9_90_bn_leaky -23330=4,3,160,160,24 0=24 1=1 5=1 6=96 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     10_98                    1 1 9_90_bn_leaky 10_98_bn_leaky -23330=4,3,80,80,24 0=24 1=3 3=2 4=1 5=1 6=216 7=24 9=2 -23310=1,1.000000e-01\nConvolution              11_107                   1 1 10_98_bn_leaky 11_107_bn -23330=4,3,80,80,8 0=8 1=1 5=1 6=192\nSplit                    11_107_bn_split          1 2 11_107_bn 11_107_bn_split_0 11_107_bn_split_1 -23330=8,3,80,80,8,3,80,80,8\nConvolution              12_116                   1 1 11_107_bn_split_0 12_116_bn_leaky -23330=4,3,80,80,32 0=32 1=1 5=1 6=256 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     13_124                   1 1 12_116_bn_leaky 13_124_bn_leaky -23330=4,3,80,80,32 0=32 1=3 4=1 5=1 6=288 7=32 9=2 -23310=1,1.000000e-01\nConvolution              14_133                   1 1 13_124_bn_leaky 14_133_bn -23330=4,3,80,80,8 0=8 1=1 5=1 6=256\nEltwise                  16_145                   2 1 14_133_bn 11_107_bn_split_1 16_145 -23330=4,3,80,80,8 0=1\nSplit                    16_145_split             1 2 16_145 16_145_split_0 16_145_split_1 -23330=8,3,80,80,8,3,80,80,8\nConvolution              17_149                   1 1 16_145_split_0 17_149_bn_leaky -23330=4,3,80,80,32 0=32 1=1 5=1 6=256 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     18_157                   1 1 17_149_bn_leaky 18_157_bn_leaky -23330=4,3,80,80,32 0=32 1=3 4=1 5=1 6=288 7=32 9=2 -23310=1,1.000000e-01\nConvolution              19_166                   1 1 18_157_bn_leaky 19_166_bn -23330=4,3,80,80,8 0=8 1=1 5=1 6=256\nEltwise                  21_179                   2 1 19_166_bn 16_145_split_1 21_179 -23330=4,3,80,80,8 0=1\nConvolution              22_183                   1 1 21_179 22_183_bn_leaky -23330=4,3,80,80,32 0=32 1=1 5=1 6=256 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     23_191                   1 1 22_183_bn_leaky 23_191_bn_leaky -23330=4,3,40,40,32 0=32 1=3 3=2 4=1 5=1 6=288 7=32 9=2 -23310=1,1.000000e-01\nConvolution              24_200                   1 1 23_191_bn_leaky 24_200_bn -23330=4,3,40,40,8 0=8 1=1 5=1 6=256\nSplit                    24_200_bn_split          1 2 24_200_bn 24_200_bn_split_0 24_200_bn_split_1 -23330=8,3,40,40,8,3,40,40,8\nConvolution              25_209                   1 1 24_200_bn_split_0 25_209_bn_leaky -23330=4,3,40,40,48 0=48 1=1 5=1 6=384 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     26_217                   1 1 25_209_bn_leaky 26_217_bn_leaky -23330=4,3,40,40,48 0=48 1=3 4=1 5=1 6=432 7=48 9=2 -23310=1,1.000000e-01\nConvolution              27_226                   1 1 26_217_bn_leaky 27_226_bn -23330=4,3,40,40,8 0=8 1=1 5=1 6=384\nEltwise                  29_238                   2 1 27_226_bn 24_200_bn_split_1 29_238 -23330=4,3,40,40,8 0=1\nSplit                    29_238_split             1 2 29_238 29_238_split_0 29_238_split_1 -23330=8,3,40,40,8,3,40,40,8\nConvolution              30_242                   1 1 29_238_split_0 30_242_bn_leaky -23330=4,3,40,40,48 0=48 1=1 5=1 6=384 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     31_250                   1 1 30_242_bn_leaky 31_250_bn_leaky -23330=4,3,40,40,48 0=48 1=3 4=1 5=1 6=432 7=48 9=2 -23310=1,1.000000e-01\nConvolution              32_259                   1 1 31_250_bn_leaky 32_259_bn -23330=4,3,40,40,8 0=8 1=1 5=1 6=384\nEltwise                  34_273                   2 1 32_259_bn 29_238_split_1 34_273 -23330=4,3,40,40,8 0=1\nConvolution              35_277                   1 1 34_273 35_277_bn_leaky -23330=4,3,40,40,48 0=48 1=1 5=1 6=384 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     36_285                   1 1 35_277_bn_leaky 36_285_bn_leaky -23330=4,3,40,40,48 0=48 1=3 4=1 5=1 6=432 7=48 9=2 -23310=1,1.000000e-01\nConvolution              37_294                   1 1 36_285_bn_leaky 37_294_bn -23330=4,3,40,40,16 0=16 1=1 5=1 6=768\nSplit                    37_294_bn_split          1 2 37_294_bn 37_294_bn_split_0 37_294_bn_split_1 -23330=8,3,40,40,16,3,40,40,16\nConvolution              38_303                   1 1 37_294_bn_split_0 38_303_bn_leaky -23330=4,3,40,40,96 0=96 1=1 5=1 6=1536 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     39_311                   1 1 38_303_bn_leaky 39_311_bn_leaky -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              40_320                   1 1 39_311_bn_leaky 40_320_bn -23330=4,3,40,40,16 0=16 1=1 5=1 6=1536\nEltwise                  42_332                   2 1 40_320_bn 37_294_bn_split_1 42_332 -23330=4,3,40,40,16 0=1\nSplit                    42_332_split             1 2 42_332 42_332_split_0 42_332_split_1 -23330=8,3,40,40,16,3,40,40,16\nConvolution              43_336                   1 1 42_332_split_0 43_336_bn_leaky -23330=4,3,40,40,96 0=96 1=1 5=1 6=1536 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     44_344                   1 1 43_336_bn_leaky 44_344_bn_leaky -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              45_353                   1 1 44_344_bn_leaky 45_353_bn -23330=4,3,40,40,16 0=16 1=1 5=1 6=1536\nEltwise                  47_365                   2 1 45_353_bn 42_332_split_1 47_365 -23330=4,3,40,40,16 0=1\nSplit                    47_365_split             1 2 47_365 47_365_split_0 47_365_split_1 -23330=8,3,40,40,16,3,40,40,16\nConvolution              48_369                   1 1 47_365_split_0 48_369_bn_leaky -23330=4,3,40,40,96 0=96 1=1 5=1 6=1536 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     49_377                   1 1 48_369_bn_leaky 49_377_bn_leaky -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              50_386                   1 1 49_377_bn_leaky 50_386_bn -23330=4,3,40,40,16 0=16 1=1 5=1 6=1536\nEltwise                  52_399                   2 1 50_386_bn 47_365_split_1 52_399 -23330=4,3,40,40,16 0=1\nSplit                    52_399_split             1 2 52_399 52_399_split_0 52_399_split_1 -23330=8,3,40,40,16,3,40,40,16\nConvolution              53_403                   1 1 52_399_split_0 53_403_bn_leaky -23330=4,3,40,40,96 0=96 1=1 5=1 6=1536 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     54_411                   1 1 53_403_bn_leaky 54_411_bn_leaky -23330=4,3,40,40,96 0=96 1=3 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              55_420                   1 1 54_411_bn_leaky 55_420_bn -23330=4,3,40,40,16 0=16 1=1 5=1 6=1536\nEltwise                  57_433                   2 1 55_420_bn 52_399_split_1 57_433 -23330=4,3,40,40,16 0=1\nConvolution              58_437                   1 1 57_433 58_437_bn_leaky -23330=4,3,40,40,96 0=96 1=1 5=1 6=1536 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     59_445                   1 1 58_437_bn_leaky 59_445_bn_leaky -23330=4,3,20,20,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96 9=2 -23310=1,1.000000e-01\nConvolution              60_454                   1 1 59_445_bn_leaky 60_454_bn -23330=4,3,20,20,24 0=24 1=1 5=1 6=2304\nSplit                    60_454_bn_split          1 2 60_454_bn 60_454_bn_split_0 60_454_bn_split_1 -23330=8,3,20,20,24,3,20,20,24\nConvolution              61_463                   1 1 60_454_bn_split_0 61_463_bn_leaky -23330=4,3,20,20,136 0=136 1=1 5=1 6=3264 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     62_471                   1 1 61_463_bn_leaky 62_471_bn_leaky -23330=4,3,20,20,136 0=136 1=3 4=1 5=1 6=1224 7=136 9=2 -23310=1,1.000000e-01\nConvolution              63_480                   1 1 62_471_bn_leaky 63_480_bn -23330=4,3,20,20,24 0=24 1=1 5=1 6=3264\nEltwise                  65_492                   2 1 63_480_bn 60_454_bn_split_1 65_492 -23330=4,3,20,20,24 0=1\nSplit                    65_492_split             1 2 65_492 65_492_split_0 65_492_split_1 -23330=8,3,20,20,24,3,20,20,24\nConvolution              66_496                   1 1 65_492_split_0 66_496_bn_leaky -23330=4,3,20,20,136 0=136 1=1 5=1 6=3264 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     67_504                   1 1 66_496_bn_leaky 67_504_bn_leaky -23330=4,3,20,20,136 0=136 1=3 4=1 5=1 6=1224 7=136 9=2 -23310=1,1.000000e-01\nConvolution              68_513                   1 1 67_504_bn_leaky 68_513_bn -23330=4,3,20,20,24 0=24 1=1 5=1 6=3264\nEltwise                  70_526                   2 1 68_513_bn 65_492_split_1 70_526 -23330=4,3,20,20,24 0=1\nSplit                    70_526_split             1 2 70_526 70_526_split_0 70_526_split_1 -23330=8,3,20,20,24,3,20,20,24\nConvolution              71_530                   1 1 70_526_split_0 71_530_bn_leaky -23330=4,3,20,20,136 0=136 1=1 5=1 6=3264 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     72_538                   1 1 71_530_bn_leaky 72_538_bn_leaky -23330=4,3,20,20,136 0=136 1=3 4=1 5=1 6=1224 7=136 9=2 -23310=1,1.000000e-01\nConvolution              73_547                   1 1 72_538_bn_leaky 73_547_bn -23330=4,3,20,20,24 0=24 1=1 5=1 6=3264\nEltwise                  75_559                   2 1 73_547_bn 70_526_split_1 75_559 -23330=4,3,20,20,24 0=1\nSplit                    75_559_split             1 2 75_559 75_559_split_0 75_559_split_1 -23330=8,3,20,20,24,3,20,20,24\nConvolution              76_563                   1 1 75_559_split_0 76_563_bn_leaky -23330=4,3,20,20,136 0=136 1=1 5=1 6=3264 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     77_571                   1 1 76_563_bn_leaky 77_571_bn_leaky -23330=4,3,20,20,136 0=136 1=3 4=1 5=1 6=1224 7=136 9=2 -23310=1,1.000000e-01\nConvolution              78_580                   1 1 77_571_bn_leaky 78_580_bn -23330=4,3,20,20,24 0=24 1=1 5=1 6=3264\nEltwise                  80_593                   2 1 78_580_bn 75_559_split_1 80_593 -23330=4,3,20,20,24 0=1\nSplit                    80_593_split             1 2 80_593 80_593_split_0 80_593_split_1 -23330=8,3,20,20,24,3,20,20,24\nConvolution              81_597                   1 1 80_593_split_0 81_597_bn_leaky -23330=4,3,20,20,136 0=136 1=1 5=1 6=3264 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     82_605                   1 1 81_597_bn_leaky 82_605_bn_leaky -23330=4,3,10,10,136 0=136 1=3 3=2 4=1 5=1 6=1224 7=136 9=2 -23310=1,1.000000e-01\nConvolution              83_615                   1 1 82_605_bn_leaky 83_615_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=6528\nSplit                    83_615_bn_split          1 2 83_615_bn 83_615_bn_split_0 83_615_bn_split_1 -23330=8,3,10,10,48,3,10,10,48\nConvolution              84_624                   1 1 83_615_bn_split_0 84_624_bn_leaky -23330=4,3,10,10,224 0=224 1=1 5=1 6=10752 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     85_632                   1 1 84_624_bn_leaky 85_632_bn_leaky -23330=4,3,10,10,224 0=224 1=3 4=1 5=1 6=2016 7=224 9=2 -23310=1,1.000000e-01\nConvolution              86_641                   1 1 85_632_bn_leaky 86_641_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=10752\nEltwise                  88_653                   2 1 86_641_bn 83_615_bn_split_1 88_653 -23330=4,3,10,10,48 0=1\nSplit                    88_653_split             1 2 88_653 88_653_split_0 88_653_split_1 -23330=8,3,10,10,48,3,10,10,48\nConvolution              89_657                   1 1 88_653_split_0 89_657_bn_leaky -23330=4,3,10,10,224 0=224 1=1 5=1 6=10752 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     90_665                   1 1 89_657_bn_leaky 90_665_bn_leaky -23330=4,3,10,10,224 0=224 1=3 4=1 5=1 6=2016 7=224 9=2 -23310=1,1.000000e-01\nConvolution              91_674                   1 1 90_665_bn_leaky 91_674_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=10752\nEltwise                  93_686                   2 1 91_674_bn 88_653_split_1 93_686 -23330=4,3,10,10,48 0=1\nSplit                    93_686_split             1 2 93_686 93_686_split_0 93_686_split_1 -23330=8,3,10,10,48,3,10,10,48\nConvolution              94_690                   1 1 93_686_split_0 94_690_bn_leaky -23330=4,3,10,10,224 0=224 1=1 5=1 6=10752 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     95_698                   1 1 94_690_bn_leaky 95_698_bn_leaky -23330=4,3,10,10,224 0=224 1=3 4=1 5=1 6=2016 7=224 9=2 -23310=1,1.000000e-01\nConvolution              96_707                   1 1 95_698_bn_leaky 96_707_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=10752\nEltwise                  98_719                   2 1 96_707_bn 93_686_split_1 98_719 -23330=4,3,10,10,48 0=1\nSplit                    98_719_split             1 2 98_719 98_719_split_0 98_719_split_1 -23330=8,3,10,10,48,3,10,10,48\nConvolution              99_723                   1 1 98_719_split_0 99_723_bn_leaky -23330=4,3,10,10,224 0=224 1=1 5=1 6=10752 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     100_731                  1 1 99_723_bn_leaky 100_731_bn_leaky -23330=4,3,10,10,224 0=224 1=3 4=1 5=1 6=2016 7=224 9=2 -23310=1,1.000000e-01\nConvolution              101_740                  1 1 100_731_bn_leaky 101_740_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=10752\nEltwise                  103_752                  2 1 101_740_bn 98_719_split_1 103_752 -23330=4,3,10,10,48 0=1\nSplit                    103_752_split            1 2 103_752 103_752_split_0 103_752_split_1 -23330=8,3,10,10,48,3,10,10,48\nConvolution              104_756                  1 1 103_752_split_0 104_756_bn_leaky -23330=4,3,10,10,224 0=224 1=1 5=1 6=10752 9=2 -23310=1,1.000000e-01\nConvolutionDepthWise     105_764                  1 1 104_756_bn_leaky 105_764_bn_leaky -23330=4,3,10,10,224 0=224 1=3 4=1 5=1 6=2016 7=224 9=2 -23310=1,1.000000e-01\nConvolution              106_773                  1 1 105_764_bn_leaky 106_773_bn -23330=4,3,10,10,48 0=48 1=1 5=1 6=10752\nEltwise                  108_784                  2 1 106_773_bn 103_752_split_1 108_784 -23330=4,3,10,10,48 0=1\nSplit                    108_784_split            1 4 108_784 108_784_split_0 108_784_split_1 108_784_split_2 108_784_split_3 -23330=16,3,10,10,48,3,10,10,48,3,10,10,48,3,10,10,48\nPooling                  109_788                  1 1 108_784_split_0 109_788 -23330=4,3,10,10,48 1=3 3=1 5=1\nPooling                  111_795                  1 1 108_784_split_1 111_795 -23330=4,3,10,10,48 1=5 3=2 5=1\nPooling                  113_802                  1 1 108_784_split_2 113_802 -23330=4,3,10,10,48 1=9 3=4 5=1\nConcat                   114_806                  4 1 113_802 111_795 109_788 108_784_split_3 114_806 -23330=4,3,10,10,192\nConvolution              115_811                  1 1 114_806 115_811_bn_leaky -23330=4,3,10,10,96 0=96 1=1 5=1 6=18432 9=2 -23310=1,1.000000e-01\nSplit                    115_811_bn_leaky_split   1 2 115_811_bn_leaky 115_811_bn_leaky_split_0 115_811_bn_leaky_split_1 -23330=8,3,10,10,96,3,10,10,96\nConvolutionDepthWise     116_819                  1 1 115_811_bn_leaky_split_0 116_819_bn_leaky -23330=4,3,10,10,96 0=96 1=5 4=2 5=1 6=2400 7=96 9=2 -23310=1,1.000000e-01\nConvolution              117_828                  1 1 116_819_bn_leaky 117_828_bn -23330=4,3,10,10,96 0=96 1=1 5=1 6=9216\nConvolutionDepthWise     118_836                  1 1 117_828_bn 118_836_bn_leaky -23330=4,3,10,10,96 0=96 1=5 4=2 5=1 6=2400 7=96 9=2 -23310=1,1.000000e-01\nConvolution              119_845                  1 1 118_836_bn_leaky 119_845_bn -23330=4,3,10,10,96 0=96 1=1 5=1 6=9216\nConvolution              120_854                  1 1 119_845_bn 120_854 -23330=4,3,10,10,255 0=255 1=1 5=1 6=24480\nInterp                   123_882                  1 1 115_811_bn_leaky_split_1 123_882 -23330=4,3,20,20,96 0=1 1=2.000000e+00 2=2.000000e+00\nConcat                   124_885                  2 1 123_882 80_593_split_1 124_885 -23330=4,3,20,20,120\nConvolutionDepthWise     125_888                  1 1 124_885 125_888_bn_leaky -23330=4,3,20,20,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=2 -23310=1,1.000000e-01\nConvolution              126_897                  1 1 125_888_bn_leaky 126_897_bn -23330=4,3,20,20,120 0=120 1=1 5=1 6=14400\nConvolutionDepthWise     127_905                  1 1 126_897_bn 127_905_bn_leaky -23330=4,3,20,20,120 0=120 1=5 4=2 5=1 6=3000 7=120 9=2 -23310=1,1.000000e-01\nConvolution              128_914                  1 1 127_905_bn_leaky 128_914_bn -23330=4,3,20,20,120 0=120 1=1 5=1 6=14400\nConvolution              129_922                  1 1 128_914_bn 129_922 -23330=4,3,20,20,255 0=255 1=1 5=1 6=30600\nYolov3DetectionOutput    detection_out            2 1 120_854 129_922 output -23330=4,2,6,1431,1 0=80 1=3 2=5.500000e-01 -23304=12,1.200000e+01,1.800000e+01,3.700000e+01,4.900000e+01,5.200000e+01,1.320000e+02,1.150000e+02,7.300000e+01,1.190000e+02,1.990000e+02,2.420000e+02,2.380000e+02 -23305=6,1077936128,1082130432,1084227584,0,1065353216,1073741824 -23306=2,3.200000e+01,1.600000e+01\n"
  },
  {
    "path": "benchmark/yolo-fastestv2.param",
    "content": "7767517\n144 166\nInput                    input.1                  0 1 input.1 -23330=4,3,352,352,3 0=352 1=352 2=3\nConvolution              Conv_0                   1 1 input.1 447 -23330=4,3,176,176,24 0=24 1=3 3=2 4=1 5=1 6=648 9=1\nPooling                  MaxPool_2                1 1 447 448 -23330=4,3,88,88,24 1=3 2=2 3=1 5=1\nSplit                    splitncnn_0              1 2 448 448_splitncnn_0 448_splitncnn_1 -23330=8,3,88,88,24,3,88,88,24\nConvolutionDepthWise     Conv_3                   1 1 448_splitncnn_1 800 -23330=4,3,44,44,24 0=24 1=3 3=2 4=1 5=1 6=216 7=24\nConvolution              Conv_4                   1 1 800 453 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConvolution              Conv_6                   1 1 448_splitncnn_0 456 -23330=4,3,88,88,24 0=24 1=1 5=1 6=576 9=1\nConvolutionDepthWise     Conv_8                   1 1 456 809 -23330=4,3,44,44,24 0=24 1=3 3=2 4=1 5=1 6=216 7=24\nConvolution              Conv_9                   1 1 809 461 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConcat                   Concat_11                2 1 453 461 462 -23330=4,3,44,44,48\nShuffleChannel           Reshape_16               1 1 462 467 -23330=4,3,44,44,48 0=2 1=1\nSlice                    Gather_20                1 2 467 469 471 -23330=8,3,44,44,24,3,44,44,24 -23300=2,-233,-233\nConvolution              Conv_21                  1 1 471 474 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConvolutionDepthWise     Conv_23                  1 1 474 818 -23330=4,3,44,44,24 0=24 1=3 4=1 5=1 6=216 7=24\nConvolution              Conv_24                  1 1 818 479 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConcat                   Concat_26                2 1 469 479 480 -23330=4,3,44,44,48\nShuffleChannel           Reshape_31               1 1 480 485 -23330=4,3,44,44,48 0=2 1=1\nSlice                    Gather_35                1 2 485 487 489 -23330=8,3,44,44,24,3,44,44,24 -23300=2,-233,-233\nConvolution              Conv_36                  1 1 489 492 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConvolutionDepthWise     Conv_38                  1 1 492 827 -23330=4,3,44,44,24 0=24 1=3 4=1 5=1 6=216 7=24\nConvolution              Conv_39                  1 1 827 497 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConcat                   Concat_41                2 1 487 497 498 -23330=4,3,44,44,48\nShuffleChannel           Reshape_46               1 1 498 503 -23330=4,3,44,44,48 0=2 1=1\nSlice                    Gather_50                1 2 503 505 507 -23330=8,3,44,44,24,3,44,44,24 -23300=2,-233,-233\nConvolution              Conv_51                  1 1 507 510 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConvolutionDepthWise     Conv_53                  1 1 510 836 -23330=4,3,44,44,24 0=24 1=3 4=1 5=1 6=216 7=24\nConvolution              Conv_54                  1 1 836 515 -23330=4,3,44,44,24 0=24 1=1 5=1 6=576 9=1\nConcat                   Concat_56                2 1 505 515 516 -23330=4,3,44,44,48\nSplit                    splitncnn_1              1 2 516 516_splitncnn_0 516_splitncnn_1 -23330=8,3,44,44,48,3,44,44,48\nConvolutionDepthWise     Conv_57                  1 1 516_splitncnn_1 842 -23330=4,3,22,22,48 0=48 1=3 3=2 4=1 5=1 6=432 7=48\nConvolution              Conv_58                  1 1 842 521 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolution              Conv_60                  1 1 516_splitncnn_0 524 -23330=4,3,44,44,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_62                  1 1 524 851 -23330=4,3,22,22,48 0=48 1=3 3=2 4=1 5=1 6=432 7=48\nConvolution              Conv_63                  1 1 851 529 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_65                2 1 521 529 530 -23330=4,3,22,22,96\nShuffleChannel           Reshape_70               1 1 530 535 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_74                1 2 535 537 539 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_75                  1 1 539 542 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_77                  1 1 542 860 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_78                  1 1 860 547 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_80                2 1 537 547 548 -23330=4,3,22,22,96\nShuffleChannel           Reshape_85               1 1 548 553 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_89                1 2 553 555 557 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_90                  1 1 557 560 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_92                  1 1 560 869 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_93                  1 1 869 565 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_95                2 1 555 565 566 -23330=4,3,22,22,96\nShuffleChannel           Reshape_100              1 1 566 571 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_104               1 2 571 573 575 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_105                 1 1 575 578 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_107                 1 1 578 878 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_108                 1 1 878 583 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_110               2 1 573 583 584 -23330=4,3,22,22,96\nShuffleChannel           Reshape_115              1 1 584 589 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_119               1 2 589 591 593 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_120                 1 1 593 596 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_122                 1 1 596 887 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_123                 1 1 887 601 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_125               2 1 591 601 602 -23330=4,3,22,22,96\nShuffleChannel           Reshape_130              1 1 602 607 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_134               1 2 607 609 611 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_135                 1 1 611 614 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_137                 1 1 614 896 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_138                 1 1 896 619 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_140               2 1 609 619 620 -23330=4,3,22,22,96\nShuffleChannel           Reshape_145              1 1 620 625 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_149               1 2 625 627 629 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_150                 1 1 629 632 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_152                 1 1 632 905 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_153                 1 1 905 637 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_155               2 1 627 637 638 -23330=4,3,22,22,96\nShuffleChannel           Reshape_160              1 1 638 643 -23330=4,3,22,22,96 0=2 1=1\nSlice                    Gather_164               1 2 643 645 647 -23330=8,3,22,22,48,3,22,22,48 -23300=2,-233,-233\nConvolution              Conv_165                 1 1 647 650 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConvolutionDepthWise     Conv_167                 1 1 650 914 -23330=4,3,22,22,48 0=48 1=3 4=1 5=1 6=432 7=48\nConvolution              Conv_168                 1 1 914 655 -23330=4,3,22,22,48 0=48 1=1 5=1 6=2304 9=1\nConcat                   Concat_170               2 1 645 655 656 -23330=4,3,22,22,96\nSplit                    splitncnn_2              1 3 656 656_splitncnn_0 656_splitncnn_1 656_splitncnn_2 -23330=12,3,22,22,96,3,22,22,96,3,22,22,96\nConvolutionDepthWise     Conv_171                 1 1 656_splitncnn_2 920 -23330=4,3,11,11,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96\nConvolution              Conv_172                 1 1 920 661 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConvolution              Conv_174                 1 1 656_splitncnn_1 664 -23330=4,3,22,22,96 0=96 1=1 5=1 6=9216 9=1\nConvolutionDepthWise     Conv_176                 1 1 664 929 -23330=4,3,11,11,96 0=96 1=3 3=2 4=1 5=1 6=864 7=96\nConvolution              Conv_177                 1 1 929 669 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConcat                   Concat_179               2 1 661 669 670 -23330=4,3,11,11,192\nShuffleChannel           Reshape_184              1 1 670 675 -23330=4,3,11,11,192 0=2 1=1\nSlice                    Gather_188               1 2 675 677 679 -23330=8,3,11,11,96,3,11,11,96 -23300=2,-233,-233\nConvolution              Conv_189                 1 1 679 682 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConvolutionDepthWise     Conv_191                 1 1 682 938 -23330=4,3,11,11,96 0=96 1=3 4=1 5=1 6=864 7=96\nConvolution              Conv_192                 1 1 938 687 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConcat                   Concat_194               2 1 677 687 688 -23330=4,3,11,11,192\nShuffleChannel           Reshape_199              1 1 688 693 -23330=4,3,11,11,192 0=2 1=1\nSlice                    Gather_203               1 2 693 695 697 -23330=8,3,11,11,96,3,11,11,96 -23300=2,-233,-233\nConvolution              Conv_204                 1 1 697 700 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConvolutionDepthWise     Conv_206                 1 1 700 947 -23330=4,3,11,11,96 0=96 1=3 4=1 5=1 6=864 7=96\nConvolution              Conv_207                 1 1 947 705 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConcat                   Concat_209               2 1 695 705 706 -23330=4,3,11,11,192\nShuffleChannel           Reshape_214              1 1 706 711 -23330=4,3,11,11,192 0=2 1=1\nSlice                    Gather_218               1 2 711 713 715 -23330=8,3,11,11,96,3,11,11,96 -23300=2,-233,-233\nConvolution              Conv_219                 1 1 715 718 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConvolutionDepthWise     Conv_221                 1 1 718 956 -23330=4,3,11,11,96 0=96 1=3 4=1 5=1 6=864 7=96\nConvolution              Conv_222                 1 1 956 723 -23330=4,3,11,11,96 0=96 1=1 5=1 6=9216 9=1\nConcat                   Concat_224               2 1 713 723 724 -23330=4,3,11,11,192\nSplit                    splitncnn_3              1 2 724 724_splitncnn_0 724_splitncnn_1 -23330=8,3,11,11,192,3,11,11,192\nConvolution              Conv_225                 1 1 724_splitncnn_1 727 -23330=4,3,11,11,72 0=72 1=1 5=1 6=13824 9=1\nSplit                    splitncnn_4              1 2 727 727_splitncnn_0 727_splitncnn_1 -23330=8,3,11,11,72,3,11,11,72\nConvolutionDepthWise     Conv_227                 1 1 727_splitncnn_1 730 -23330=4,3,11,11,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_229                 1 1 730 968 -23330=4,3,11,11,72 0=72 1=1 5=1 6=5184\nConvolutionDepthWise     Conv_230                 1 1 968 735 -23330=4,3,11,11,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_232                 1 1 735 974 -23330=4,3,11,11,72 0=72 1=1 5=1 6=5184\nSplit                    splitncnn_5              1 2 974 974_splitncnn_0 974_splitncnn_1 -23330=8,3,11,11,72,3,11,11,72\nConvolutionDepthWise     Conv_233                 1 1 727_splitncnn_0 740 -23330=4,3,11,11,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_235                 1 1 740 980 -23330=4,3,11,11,72 0=72 1=1 5=1 6=5184\nConvolutionDepthWise     Conv_236                 1 1 980 745 -23330=4,3,11,11,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_238                 1 1 745 986 -23330=4,3,11,11,72 0=72 1=1 5=1 6=5184\nInterp                   Resize_240               1 1 724_splitncnn_0 752 -23330=4,3,22,22,192 0=1 1=2.000000e+00 2=2.000000e+00\nConcat                   Concat_241               2 1 752 656_splitncnn_0 753 -23330=4,3,22,22,288\nConvolution              Conv_242                 1 1 753 756 -23330=4,3,22,22,72 0=72 1=1 5=1 6=20736 9=1\nSplit                    splitncnn_6              1 2 756 756_splitncnn_0 756_splitncnn_1 -23330=8,3,22,22,72,3,22,22,72\nConvolutionDepthWise     Conv_244                 1 1 756_splitncnn_1 759 -23330=4,3,22,22,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_246                 1 1 759 995 -23330=4,3,22,22,72 0=72 1=1 5=1 6=5184\nConvolutionDepthWise     Conv_247                 1 1 995 764 -23330=4,3,22,22,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_249                 1 1 764 1001 -23330=4,3,22,22,72 0=72 1=1 5=1 6=5184\nSplit                    splitncnn_7              1 2 1001 1001_splitncnn_0 1001_splitncnn_1 -23330=8,3,22,22,72,3,22,22,72\nConvolutionDepthWise     Conv_250                 1 1 756_splitncnn_0 769 -23330=4,3,22,22,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_252                 1 1 769 1007 -23330=4,3,22,22,72 0=72 1=1 5=1 6=5184\nConvolutionDepthWise     Conv_253                 1 1 1007 774 -23330=4,3,22,22,72 0=72 1=5 4=2 5=1 6=1800 7=72 9=1\nConvolution              Conv_255                 1 1 774 1013 -23330=4,3,22,22,72 0=72 1=1 5=1 6=5184\nConvolution              Conv_256                 1 1 1013 783 -23330=4,3,22,22,12 0=12 1=1 5=1 6=864 9=4\nConvolution              Conv_257                 1 1 1001_splitncnn_1 784 -23330=4,3,22,22,3 0=3 1=1 5=1 6=216 9=4\nConvolution              Conv_258                 1 1 1001_splitncnn_0 779 -23330=4,3,22,22,80 0=80 1=1 5=1 6=5760\nConvolution              Conv_259                 1 1 986 788 -23330=4,3,11,11,12 0=12 1=1 5=1 6=864 9=4\nConvolution              Conv_260                 1 1 974_splitncnn_1 789 -23330=4,3,11,11,3 0=3 1=1 5=1 6=216 9=4\nConvolution              Conv_261                 1 1 974_splitncnn_0 782 -23330=4,3,11,11,80 0=80 1=1 5=1 6=5760\nPermute                  Transpose_264            1 1 779 785 -23330=4,3,80,22,22 0=5\nSoftmax                  Softmax_265              1 1 785 786 -23330=4,3,80,22,22 0=2 1=1\nPermute                  Transpose_266            1 1 786 787 -23330=4,3,22,22,80 0=5\nPermute                  Transpose_269            1 1 782 790 -23330=4,3,80,11,11 0=5\nSoftmax                  Softmax_270              1 1 790 791 -23330=4,3,80,11,11 0=2 1=1\nPermute                  Transpose_271            1 1 791 792 -23330=4,3,11,11,80 0=5\nConcat                   Concat_272               3 1 783 784 787 793 -23330=4,3,22,22,95\nPermute                  Transpose_273            1 1 793 794 -23330=4,3,95,22,22 0=3\nConcat                   Concat_274               3 1 788 789 792 795 -23330=4,3,11,11,95\nPermute                  Transpose_275            1 1 795 796 -23330=4,3,95,11,11 0=3\nNoop                     output                   2 1 794 796 output\n"
  },
  {
    "path": "benchmark/yolov4-tiny.param",
    "content": "7767517\n45 53\nInput                    data                     0 1 data -23330=4,3,416,416,3 0=416 1=416 2=3\nConvolution              0_25                     1 1 data 0_25_bn_leaky -23330=4,3,208,208,32 0=32 1=3 3=2 4=1 5=1 6=864 9=2 -23310=1,1.000000e-01\nConvolution              1_33                     1 1 0_25_bn_leaky 1_33_bn_leaky -23330=4,3,104,104,64 0=64 1=3 3=2 4=1 5=1 6=18432 9=2 -23310=1,1.000000e-01\nConvolution              2_41                     1 1 1_33_bn_leaky 2_41_bn_leaky -23330=4,3,104,104,64 0=64 1=3 4=1 5=1 6=36864 9=2 -23310=1,1.000000e-01\nSplit                    2_41_bn_leaky_split      1 2 2_41_bn_leaky 2_41_bn_leaky_split_0 2_41_bn_leaky_split_1 -23330=8,3,104,104,64,3,104,104,64\nCrop                     3_49                     1 1 2_41_bn_leaky_split_0 3_49 -23330=4,3,104,104,32 2=32 3=104 4=104 5=32\nConvolution              4_54                     1 1 3_49 4_54_bn_leaky -23330=4,3,104,104,32 0=32 1=3 4=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nSplit                    4_54_bn_leaky_split      1 2 4_54_bn_leaky 4_54_bn_leaky_split_0 4_54_bn_leaky_split_1 -23330=8,3,104,104,32,3,104,104,32\nConvolution              5_62                     1 1 4_54_bn_leaky_split_0 5_62_bn_leaky -23330=4,3,104,104,32 0=32 1=3 4=1 5=1 6=9216 9=2 -23310=1,1.000000e-01\nConcat                   6_70                     2 1 5_62_bn_leaky 4_54_bn_leaky_split_1 6_70 -23330=4,3,104,104,64\nConvolution              7_73                     1 1 6_70 7_73_bn_leaky -23330=4,3,104,104,64 0=64 1=1 5=1 6=4096 9=2 -23310=1,1.000000e-01\nConcat                   8_81                     2 1 2_41_bn_leaky_split_1 7_73_bn_leaky 8_81 -23330=4,3,104,104,128\nPooling                  9_84                     1 1 8_81 9_84 -23330=4,3,52,52,128 1=2 2=2 14=1 15=1 5=1\nConvolution              10_88                    1 1 9_84 10_88_bn_leaky -23330=4,3,52,52,128 0=128 1=3 4=1 5=1 6=147456 9=2 -23310=1,1.000000e-01\nSplit                    10_88_bn_leaky_split     1 2 10_88_bn_leaky 10_88_bn_leaky_split_0 10_88_bn_leaky_split_1 -23330=8,3,52,52,128,3,52,52,128\nCrop                     11_96                    1 1 10_88_bn_leaky_split_0 11_96 -23330=4,3,52,52,64 2=64 3=52 4=52 5=64\nConvolution              12_101                   1 1 11_96 12_101_bn_leaky -23330=4,3,52,52,64 0=64 1=3 4=1 5=1 6=36864 9=2 -23310=1,1.000000e-01\nSplit                    12_101_bn_leaky_split    1 2 12_101_bn_leaky 12_101_bn_leaky_split_0 12_101_bn_leaky_split_1 -23330=8,3,52,52,64,3,52,52,64\nConvolution              13_109                   1 1 12_101_bn_leaky_split_0 13_109_bn_leaky -23330=4,3,52,52,64 0=64 1=3 4=1 5=1 6=36864 9=2 -23310=1,1.000000e-01\nConcat                   14_117                   2 1 13_109_bn_leaky 12_101_bn_leaky_split_1 14_117 -23330=4,3,52,52,128\nConvolution              15_120                   1 1 14_117 15_120_bn_leaky -23330=4,3,52,52,128 0=128 1=1 5=1 6=16384 9=2 -23310=1,1.000000e-01\nConcat                   16_128                   2 1 10_88_bn_leaky_split_1 15_120_bn_leaky 16_128 -23330=4,3,52,52,256\nPooling                  17_131                   1 1 16_128 17_131 -23330=4,3,26,26,256 1=2 2=2 14=1 15=1 5=1\nConvolution              18_135                   1 1 17_131 18_135_bn_leaky -23330=4,3,26,26,256 0=256 1=3 4=1 5=1 6=589824 9=2 -23310=1,1.000000e-01\nSplit                    18_135_bn_leaky_split    1 2 18_135_bn_leaky 18_135_bn_leaky_split_0 18_135_bn_leaky_split_1 -23330=8,3,26,26,256,3,26,26,256\nCrop                     19_143                   1 1 18_135_bn_leaky_split_0 19_143 -23330=4,3,26,26,128 2=128 3=26 4=26 5=128\nConvolution              20_148                   1 1 19_143 20_148_bn_leaky -23330=4,3,26,26,128 0=128 1=3 4=1 5=1 6=147456 9=2 -23310=1,1.000000e-01\nSplit                    20_148_bn_leaky_split    1 2 20_148_bn_leaky 20_148_bn_leaky_split_0 20_148_bn_leaky_split_1 -23330=8,3,26,26,128,3,26,26,128\nConvolution              21_156                   1 1 20_148_bn_leaky_split_0 21_156_bn_leaky -23330=4,3,26,26,128 0=128 1=3 4=1 5=1 6=147456 9=2 -23310=1,1.000000e-01\nConcat                   22_164                   2 1 21_156_bn_leaky 20_148_bn_leaky_split_1 22_164 -23330=4,3,26,26,256\nConvolution              23_167                   1 1 22_164 23_167_bn_leaky -23330=4,3,26,26,256 0=256 1=1 5=1 6=65536 9=2 -23310=1,1.000000e-01\nSplit                    23_167_bn_leaky_split    1 2 23_167_bn_leaky 23_167_bn_leaky_split_0 23_167_bn_leaky_split_1 -23330=8,3,26,26,256,3,26,26,256\nConcat                   24_175                   2 1 18_135_bn_leaky_split_1 23_167_bn_leaky_split_0 24_175 -23330=4,3,26,26,512\nPooling                  25_178                   1 1 24_175 25_178 -23330=4,3,13,13,512 1=2 2=2 14=1 15=1 5=1\nConvolution              26_182                   1 1 25_178 26_182_bn_leaky -23330=4,3,13,13,512 0=512 1=3 4=1 5=1 6=2359296 9=2 -23310=1,1.000000e-01\nConvolution              27_192                   1 1 26_182_bn_leaky 27_192_bn_leaky -23330=4,3,13,13,256 0=256 1=1 5=1 6=131072 9=2 -23310=1,1.000000e-01\nSplit                    27_192_bn_leaky_split    1 2 27_192_bn_leaky 27_192_bn_leaky_split_0 27_192_bn_leaky_split_1 -23330=8,3,13,13,256,3,13,13,256\nConvolution              28_200                   1 1 27_192_bn_leaky_split_0 28_200_bn_leaky -23330=4,3,13,13,512 0=512 1=3 4=1 5=1 6=1179648 9=2 -23310=1,1.000000e-01\nConvolution              29_208                   1 1 28_200_bn_leaky 29_208 -23330=4,3,13,13,255 0=255 1=1 5=1 6=130560\nConvolution              32_237                   1 1 27_192_bn_leaky_split_1 32_237_bn_leaky -23330=4,3,13,13,128 0=128 1=1 5=1 6=32768 9=2 -23310=1,1.000000e-01\nInterp                   33_245                   1 1 32_237_bn_leaky 33_245 -23330=4,3,26,26,128 0=1 1=2.000000e+00 2=2.000000e+00\nConcat                   34_248                   2 1 33_245 23_167_bn_leaky_split_1 34_248 -23330=4,3,26,26,384\nConvolution              35_251                   1 1 34_248 35_251_bn_leaky -23330=4,3,26,26,256 0=256 1=3 4=1 5=1 6=884736 9=2 -23310=1,1.000000e-01\nConvolution              36_259                   1 1 35_251_bn_leaky 36_259 -23330=4,3,26,26,255 0=255 1=1 5=1 6=65280\nYolov3DetectionOutput    detection_out            2 1 29_208 36_259 output -23330=4,2,6,1637,1 0=80 1=3 2=3.000001e-01 -23304=12,1.000000e+01,1.400000e+01,2.300000e+01,2.700000e+01,3.700000e+01,5.800000e+01,8.100000e+01,8.200000e+01,1.350000e+02,1.690000e+02,3.440000e+02,3.190000e+02 -23305=6,1077936128,1082130432,1084227584,1065353216,1073741824,1077936128 -23306=2,3.360000e+01,1.680000e+01\n"
  },
  {
    "path": "build-android.cmd",
    "content": ":: Set android ndk root\n@ECHO OFF\n@SETLOCAL\n@SET ANDROID_NDK=<your-ndk-root_path, such as\"E:\\android-ndk-r27\">\n\n:: Set ninja.exe\n:: @SET NINJA_EXE=<your-ninja-exe_path, such as\"D:\\android\\sdk\\cmake\\3.10.2.4988404\\bin\\ninja.exe\">\n\n:: android armv7\nmkdir build-android-armv7-vulkan\npushd build-android-armv7-vulkan\ncmake -G \"Unix Makefiles\" -DCMAKE_TOOLCHAIN_FILE=%ANDROID_NDK%/build/cmake/android.toolchain.cmake -DCMAKE_MAKE_PROGRAM=\"%ANDROID_NDK%/prebuilt/windows-x86_64/bin/make.exe\" -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON -DANDROID_PLATFORM=android-19 -DNCNN_VULKAN=ON ..\ncmake --build . --parallel %NUMBER_OF_PROCESSORS%\ncmake --build . --target install\npopd\n\n:: android aarch64\nmkdir build-android-aarch64-vulkan\npushd build-android-aarch64-vulkan\ncmake -G \"Unix Makefiles\" -DCMAKE_TOOLCHAIN_FILE=%ANDROID_NDK%/build/cmake/android.toolchain.cmake -DCMAKE_MAKE_PROGRAM=\"%ANDROID_NDK%/prebuilt/windows-x86_64/bin/make.exe\" -DANDROID_ABI=\"arm64-v8a\" -DANDROID_PLATFORM=android-21 -DNCNN_VULKAN=ON ..\ncmake --build . --parallel %NUMBER_OF_PROCESSORS%\ncmake --build . --target install\npopd\n\n:: android x86\nmkdir build-android-x86\npushd build-android-x86\ncmake -G \"Unix Makefiles\" -DCMAKE_TOOLCHAIN_FILE=%ANDROID_NDK%/build/cmake/android.toolchain.cmake -DCMAKE_MAKE_PROGRAM=\"%ANDROID_NDK%/prebuilt/windows-x86_64/bin/make.exe\" -DANDROID_ABI=\"x86\" -DANDROID_PLATFORM=android-19 -DNCNN_VULKAN=ON ..\ncmake --build . --parallel %NUMBER_OF_PROCESSORS%\ncmake --build . --target install\npopd\n\n:: android x86_64\nmkdir build-android-x86_64\npushd build-android-x86_64\ncmake -G \"Unix Makefiles\" -DCMAKE_TOOLCHAIN_FILE=%ANDROID_NDK%/build/cmake/android.toolchain.cmake -DCMAKE_MAKE_PROGRAM=\"%ANDROID_NDK%/prebuilt/windows-x86_64/bin/make.exe\" -DANDROID_ABI=\"x86_64\" -DANDROID_PLATFORM=android-21 -DNCNN_VULKAN=ON ..\ncmake --build . --parallel %NUMBER_OF_PROCESSORS%\ncmake --build . --target install\npopd\n\n:: android riscv64\nmkdir build-android-riscv64\npushd build-android-riscv64\ncmake -G \"Unix Makefiles\" -DCMAKE_TOOLCHAIN_FILE=%ANDROID_NDK%/build/cmake/android.toolchain.cmake -DCMAKE_MAKE_PROGRAM=\"%ANDROID_NDK%/prebuilt/windows-x86_64/bin/make.exe\" -DANDROID_ABI=\"riscv64\" -DANDROID_PLATFORM=android-35 -DNCNN_VULKAN=ON ..\ncmake --build . --parallel %NUMBER_OF_PROCESSORS%\ncmake --build . --target install\npopd\n\n@ENDLOCAL\n"
  },
  {
    "path": "build.sh",
    "content": "#!/usr/bin/env bash\n\n##### android armv7 without neon\nmkdir -p build-android-armv7-without-neon\npushd build-android-armv7-without-neon\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=OFF -DANDROID_PLATFORM=android-19 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### android armv7\nmkdir -p build-android-armv7\npushd build-android-armv7\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON -DANDROID_PLATFORM=android-19 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### android aarch64\nmkdir -p build-android-aarch64\npushd build-android-aarch64\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"arm64-v8a\" -DANDROID_PLATFORM=android-21 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### android x86\nmkdir -p build-android-x86\npushd build-android-x86\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"x86\" -DANDROID_PLATFORM=android-19 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### android x86_64\nmkdir -p build-android-x86_64\npushd build-android-x86_64\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"x86_64\" -DANDROID_PLATFORM=android-21 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### android riscv64\nmkdir -p build-android-riscv64\npushd build-android-riscv64\ncmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=\"riscv64\" -DANDROID_PLATFORM=android-35 -DNCNN_VULKAN=ON ..\nmake -j4\nmake install\npopd\n\n##### linux of hisiv300 (forgot the chip name) toolchain with neon and openmp\nmkdir -p build-hisiv300-linux\npushd build-hisiv300-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv300.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of hisiv500 (Hi3516CV200 and Hi3519V101) toolchain with neon and openmp\nmkdir -p build-hisiv500-linux\npushd build-hisiv500-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv500.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of hisiv600 (Hi3559V100) toolchain with neon and no openmp (due to only one cpu, close openmp)\nmkdir -p build-hisiv600-linux\npushd build-hisiv600-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv600.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of himix100 (Hi3559a) toolchain with neon and openmp\nmkdir -p build-himix100-linux\npushd build-himix100-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix100.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of arm-linux-gnueabi toolchain\nmkdir -p build-arm-linux-gnueabi\npushd build-arm-linux-gnueabi\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabi.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of arm-linux-gnueabihf toolchain\nmkdir -p build-arm-linux-gnueabihf\npushd build-arm-linux-gnueabihf\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux of v831 toolchain with neon and openmp\nmkdir -p build-v831-linux\npushd build-v831-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/v831.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux for aarch64-linux-gnu toolchain\nmkdir -p build-aarch64-linux-gnu\npushd build-aarch64-linux-gnu\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### linux host system with gcc/g++\nmkdir -p build-host-gcc-linux\npushd build-host-gcc-linux\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/host.gcc.toolchain.cmake ..\nmake -j4\nmake install\npopd\n\n##### MacOS\nmkdir -p build-mac\npushd build-mac\ncmake   -DNCNN_OPENMP=OFF \\\n        -DNCNN_BENCHMARK=ON \\\n        ..\nmake -j8\nmake install\npopd\n"
  },
  {
    "path": "cmake/ncnnConfig.cmake.in",
    "content": "set(NCNN_VERSION @NCNN_VERSION@)\nset(NCNN_OPENMP @NCNN_OPENMP@)\nset(NCNN_THREADS @NCNN_THREADS@)\nset(NCNN_VULKAN @NCNN_VULKAN@)\nset(NCNN_SHARED_LIB @NCNN_SHARED_LIB@)\nset(NCNN_SYSTEM_GLSLANG @NCNN_SYSTEM_GLSLANG@)\nset(NCNN_SIMPLEVK @NCNN_SIMPLEVK@)\n\nif(NCNN_OPENMP)\n    find_package(OpenMP)\nendif()\n\nif(NCNN_THREADS)\n    set(CMAKE_THREAD_PREFER_PTHREAD TRUE)\n    set(THREADS_PREFER_PTHREAD_FLAG TRUE)\n    find_package(Threads REQUIRED)\nendif()\n\nif(NCNN_VULKAN)\n    if(NOT NCNN_SIMPLEVK)\n        find_package(Vulkan REQUIRED)\n    endif()\n\n    if(NOT NCNN_SHARED_LIB)\n        if(NCNN_SYSTEM_GLSLANG)\n            find_package(SPIRV-Tools QUIET)\n            find_package(SPIRV-Tools-opt QUIET)\n            find_package(glslang QUIET)\n            if(NOT glslang_FOUND)\n                set(GLSLANG_TARGET_DIR \"@GLSLANG_TARGET_DIR@\")\n                include(${GLSLANG_TARGET_DIR}/OSDependentTargets.cmake)\n                include(${GLSLANG_TARGET_DIR}/OGLCompilerTargets.cmake)\n                if(EXISTS \"${GLSLANG_TARGET_DIR}/HLSLTargets.cmake\")\n                    # hlsl support can be optional\n                    include(\"${GLSLANG_TARGET_DIR}/HLSLTargets.cmake\")\n                endif()\n                include(${GLSLANG_TARGET_DIR}/glslangTargets.cmake)\n                include(${GLSLANG_TARGET_DIR}/SPIRVTargets.cmake)\n            endif()\n        else()\n            set(glslang_DIR \"${CMAKE_CURRENT_LIST_DIR}/../../../@CMAKE_INSTALL_LIBDIR@/cmake/glslang\")\n            find_package(glslang QUIET)\n        endif()\n    endif()\nendif()\n\ninclude(${CMAKE_CURRENT_LIST_DIR}/ncnn.cmake)\n\nif(TARGET ncnn)\n    set(ncnn_FOUND TRUE)\n    if(NOT ncnn_FIND_QUIETLY)\n        message(STATUS \"Found ncnn: ${NCNN_VERSION}\")\n    endif()\nendif()\n"
  },
  {
    "path": "cmake/ncnn_add_layer.cmake",
    "content": "\nmacro(ncnn_add_arch_opt_layer class NCNN_TARGET_ARCH_OPT NCNN_TARGET_ARCH_OPT_CFLAGS)\n    set(NCNN_${NCNN_TARGET_ARCH}_HEADER ${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}.h)\n    set(NCNN_${NCNN_TARGET_ARCH}_SOURCE ${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}.cpp)\n\n    if(WITH_LAYER_${name} AND EXISTS ${NCNN_${NCNN_TARGET_ARCH}_HEADER} AND EXISTS ${NCNN_${NCNN_TARGET_ARCH}_SOURCE})\n\n        set(NCNN_${NCNN_TARGET_ARCH_OPT}_HEADER ${CMAKE_CURRENT_BINARY_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.h)\n        set(NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE ${CMAKE_CURRENT_BINARY_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.cpp)\n\n        add_custom_command(\n            OUTPUT ${NCNN_${NCNN_TARGET_ARCH_OPT}_HEADER}\n            COMMAND ${CMAKE_COMMAND} -DSRC=${NCNN_${NCNN_TARGET_ARCH}_HEADER} -DDST=${NCNN_${NCNN_TARGET_ARCH_OPT}_HEADER} -DCLASS=${class} -P \"${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_generate_${NCNN_TARGET_ARCH_OPT}_source.cmake\"\n            DEPENDS ${NCNN_${NCNN_TARGET_ARCH}_HEADER}\n            COMMENT \"Generating source ${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.h\"\n            VERBATIM\n        )\n        set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT}_HEADER} PROPERTIES GENERATED TRUE)\n\n        add_custom_command(\n            OUTPUT ${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE}\n            COMMAND ${CMAKE_COMMAND} -DSRC=${NCNN_${NCNN_TARGET_ARCH}_SOURCE} -DDST=${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE} -DCLASS=${class} -P \"${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_generate_${NCNN_TARGET_ARCH_OPT}_source.cmake\"\n            DEPENDS ${NCNN_${NCNN_TARGET_ARCH}_SOURCE}\n            COMMENT \"Generating source ${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.cpp\"\n            VERBATIM\n        )\n        set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE} PROPERTIES GENERATED TRUE)\n\n        set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE} PROPERTIES COMPILE_FLAGS ${NCNN_TARGET_ARCH_OPT_CFLAGS})\n\n        list(APPEND ncnn_SRCS ${NCNN_${NCNN_TARGET_ARCH_OPT}_HEADER} ${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE})\n\n        # generate layer_declaration and layer_registry file\n        set(layer_declaration \"${layer_declaration}#include \\\"layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.h\\\"\\n\")\n        set(layer_declaration \"${layer_declaration}namespace ncnn { DEFINE_LAYER_CREATOR(${class}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}) }\\n\")\n\n        set(layer_registry_${NCNN_TARGET_ARCH_OPT} \"${layer_registry_${NCNN_TARGET_ARCH_OPT}}#if NCNN_STRING\\n{\\\"${class}\\\", ${class}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}_layer_creator},\\n#else\\n{${class}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}_layer_creator},\\n#endif\\n\")\n    else()\n        # no isa optimized version\n        if(WITH_LAYER_${name})\n            set(layer_registry_${NCNN_TARGET_ARCH_OPT} \"${layer_registry_${NCNN_TARGET_ARCH_OPT}}#if NCNN_STRING\\n{\\\"${class}\\\", ${class}_layer_creator},\\n#else\\n{${class}_layer_creator},\\n#endif\\n\")\n        else()\n            set(layer_registry_${NCNN_TARGET_ARCH_OPT} \"${layer_registry_${NCNN_TARGET_ARCH_OPT}}#if NCNN_STRING\\n{\\\"${class}\\\", 0},\\n#else\\n{0},\\n#endif\\n\")\n        endif()\n    endif()\nendmacro()\n\nmacro(ncnn_add_arch_opt_source class NCNN_TARGET_ARCH_OPT NCNN_TARGET_ARCH_OPT_CFLAGS)\n    set(NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE ${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT}.cpp)\n\n    if(WITH_LAYER_${name} AND EXISTS ${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE})\n        if(NCNN_RUNTIME_CPU)\n            set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE} PROPERTIES COMPILE_FLAGS ${NCNN_TARGET_ARCH_OPT_CFLAGS})\n        endif()\n        list(APPEND ncnn_SRCS ${NCNN_${NCNN_TARGET_ARCH_OPT}_SOURCE})\n    endif()\nendmacro()\n\nmacro(ncnn_add_arch_opt_layer_source class NCNN_TARGET_ARCH_OPT_BASE NCNN_TARGET_ARCH_OPT NCNN_TARGET_ARCH_OPT_CFLAGS)\n    set(NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_SOURCE ${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT_BASE}.cpp)\n\n    if(WITH_LAYER_${name} AND EXISTS ${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_SOURCE})\n\n        set(NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE ${CMAKE_CURRENT_BINARY_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}.cpp)\n\n        add_custom_command(\n            OUTPUT ${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE}\n            COMMAND ${CMAKE_COMMAND} -DSRC=${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_SOURCE} -DDST=${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE} -DCLASS=${class} -P \"${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_generate_${NCNN_TARGET_ARCH_OPT}_source.cmake\"\n            DEPENDS ${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_SOURCE}\n            COMMENT \"Generating source ${name}_${NCNN_TARGET_ARCH}_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}.cpp\"\n            VERBATIM\n        )\n        set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE} PROPERTIES GENERATED TRUE)\n\n        if(NCNN_RUNTIME_CPU)\n            set_source_files_properties(${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE} PROPERTIES COMPILE_FLAGS ${NCNN_TARGET_ARCH_OPT_CFLAGS})\n        endif()\n        list(APPEND ncnn_SRCS ${NCNN_${NCNN_TARGET_ARCH_OPT_BASE}_${NCNN_TARGET_ARCH_OPT}_SOURCE})\n    endif()\nendmacro()\n\nmacro(ncnn_add_layer class)\n    string(TOLOWER ${class} name)\n\n    # WITH_LAYER_xxx option\n    if(${ARGC} EQUAL 2)\n        option(WITH_LAYER_${name} \"build with layer ${name}\" ${ARGV1})\n    else()\n        option(WITH_LAYER_${name} \"build with layer ${name}\" ON)\n    endif()\n\n    if(NCNN_CMAKE_VERBOSE)\n        message(STATUS \"WITH_LAYER_${name} = ${WITH_LAYER_${name}}\")\n    endif()\n\n    if(WITH_LAYER_${name})\n        list(APPEND ncnn_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/layer/${name}.cpp)\n\n        # look for arch specific implementation and append source\n        # optimized implementation for armv7, aarch64 or x86\n        set(LAYER_ARCH_SRC ${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}.cpp)\n        if(EXISTS ${LAYER_ARCH_SRC})\n            set(WITH_LAYER_${name}_${NCNN_TARGET_ARCH} 1)\n            list(APPEND ncnn_SRCS ${LAYER_ARCH_SRC})\n        endif()\n\n        set(LAYER_VULKAN_SRC ${CMAKE_CURRENT_SOURCE_DIR}/layer/vulkan/${name}_vulkan.cpp)\n        if(NCNN_VULKAN AND EXISTS ${LAYER_VULKAN_SRC})\n            set(WITH_LAYER_${name}_vulkan 1)\n            list(APPEND ncnn_SRCS ${LAYER_VULKAN_SRC})\n        endif()\n    endif()\n\n    # generate layer_declaration and layer_registry file\n    if(WITH_LAYER_${name})\n        set(layer_declaration \"${layer_declaration}#include \\\"layer/${name}.h\\\"\\n\")\n        set(layer_declaration \"${layer_declaration}namespace ncnn { DEFINE_LAYER_CREATOR(${class}) }\\n\")\n\n        source_group (\"sources\\\\\\\\layers\" FILES \"${CMAKE_CURRENT_SOURCE_DIR}/layer/${name}.cpp\")\n    endif()\n\n    if(WITH_LAYER_${name}_${NCNN_TARGET_ARCH})\n        set(layer_declaration \"${layer_declaration}#include \\\"layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}.h\\\"\\n\")\n        set(layer_declaration \"${layer_declaration}namespace ncnn { DEFINE_LAYER_CREATOR(${class}_${NCNN_TARGET_ARCH}) }\\n\")\n\n        source_group (\"sources\\\\\\\\layers\\\\\\\\${NCNN_TARGET_ARCH}\" FILES \"${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}/${name}_${NCNN_TARGET_ARCH}.cpp\")\n    endif()\n\n    if(WITH_LAYER_${name}_vulkan)\n        set(layer_declaration \"${layer_declaration}#include \\\"layer/vulkan/${name}_vulkan.h\\\"\\n\")\n        set(layer_declaration \"${layer_declaration}namespace ncnn { DEFINE_LAYER_CREATOR(${class}_vulkan) }\\n\")\n\n        file(GLOB NCNN_SHADER_SRCS \"layer/vulkan/shader/${name}.comp\")\n        file(GLOB NCNN_SHADER_SUBSRCS \"layer/vulkan/shader/${name}_*.comp\")\n        list(APPEND NCNN_SHADER_SRCS ${NCNN_SHADER_SUBSRCS})\n        foreach(NCNN_SHADER_SRC ${NCNN_SHADER_SRCS})\n            ncnn_add_shader(${NCNN_SHADER_SRC})\n        endforeach()\n\n        source_group (\"sources\\\\\\\\layers\\\\\\\\vulkan\" FILES \"${CMAKE_CURRENT_SOURCE_DIR}/layer/vulkan/${name}_vulkan.cpp\")\n    endif()\n\n    if(WITH_LAYER_${name})\n        set(layer_registry \"${layer_registry}#if NCNN_STRING\\n{\\\"${class}\\\", ${class}_layer_creator},\\n#else\\n{${class}_layer_creator},\\n#endif\\n\")\n    else()\n        set(layer_registry \"${layer_registry}#if NCNN_STRING\\n{\\\"${class}\\\", 0},\\n#else\\n{0},\\n#endif\\n\")\n    endif()\n\n    if(WITH_LAYER_${name}_${NCNN_TARGET_ARCH})\n        set(layer_registry_arch \"${layer_registry_arch}#if NCNN_STRING\\n{\\\"${class}\\\", ${class}_${NCNN_TARGET_ARCH}_layer_creator},\\n#else\\n{${class}_${NCNN_TARGET_ARCH}_layer_creator},\\n#endif\\n\")\n    else()\n        set(layer_registry_arch \"${layer_registry_arch}#if NCNN_STRING\\n{\\\"${class}\\\", 0},\\n#else\\n{0},\\n#endif\\n\")\n    endif()\n\n    if(WITH_LAYER_${name}_vulkan)\n        set(layer_registry_vulkan \"${layer_registry_vulkan}#if NCNN_STRING\\n{\\\"${class}\\\", ${class}_vulkan_layer_creator},\\n#else\\n{${class}_vulkan_layer_creator},\\n#endif\\n\")\n    else()\n        set(layer_registry_vulkan \"${layer_registry_vulkan}#if NCNN_STRING\\n{\\\"${class}\\\", 0},\\n#else\\n{0},\\n#endif\\n\")\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"x86\")\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512)\n                ncnn_add_arch_opt_layer(${class} avx512 \"/arch:AVX512 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_FMA)\n                ncnn_add_arch_opt_layer(${class} fma \"/arch:AVX /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX)\n                ncnn_add_arch_opt_layer(${class} avx \"/arch:AVX /D__SSSE3__ /D__SSE4_1__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512VNNI)\n                ncnn_add_arch_opt_source(${class} avx512vnni \"/arch:AVX512 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512VNNI__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512BF16)\n                ncnn_add_arch_opt_source(${class} avx512bf16 \"/arch:AVX512 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512BF16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512FP16)\n                ncnn_add_arch_opt_source(${class} avx512fp16 \"/arch:AVX512 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512FP16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNI)\n                ncnn_add_arch_opt_source(${class} avxvnni \"/arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT8)\n                ncnn_add_arch_opt_source(${class} avxvnniint8 \"/arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__ /D__AVXVNNIINT8__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT16)\n                ncnn_add_arch_opt_source(${class} avxvnniint16 \"/arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__ /D__AVXVNNIINT16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXNECONVERT)\n                ncnn_add_arch_opt_source(${class} avxneconvert \"/arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXNECONVERT__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX2)\n                ncnn_add_arch_opt_source(${class} avx2 \"/arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_XOP)\n                ncnn_add_arch_opt_source(${class} xop \"/arch:AVX /D__SSSE3__ /D__SSE4_1__ /D__XOP__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_F16C)\n                ncnn_add_arch_opt_source(${class} f16c \"/arch:AVX /D__SSSE3__ /D__SSE4_1__ /D__F16C__\")\n            endif()\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512)\n                ncnn_add_arch_opt_layer(${class} avx512 \"/arch:AVX512 -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_FMA)\n                ncnn_add_arch_opt_layer(${class} fma \"/arch:AVX -mfma -mf16c /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX)\n                ncnn_add_arch_opt_layer(${class} avx \"/arch:AVX /D__SSSE3__ /D__SSE4_1__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512VNNI)\n                ncnn_add_arch_opt_source(${class} avx512vnni \"/arch:AVX512 -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512vnni /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512VNNI__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512BF16)\n                ncnn_add_arch_opt_source(${class} avx512bf16 \"/arch:AVX512 -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512bf16 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512BF16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512FP16)\n                ncnn_add_arch_opt_source(${class} avx512fp16 \"/arch:AVX512 -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512fp16 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVX512FP16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNI)\n                ncnn_add_arch_opt_source(${class} avxvnni \"/arch:AVX2 -mfma -mf16c -mavxvnni /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT8)\n                ncnn_add_arch_opt_source(${class} avxvnniint8 \"/arch:AVX2 -mfma -mf16c -mavxvnni -mavxvnniint8 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__ /D__AVXVNNIINT8__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT16)\n                ncnn_add_arch_opt_source(${class} avxvnniint16 \"/arch:AVX2 -mfma -mf16c -mavxvnni -mavxvnniint16 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXVNNI__ /D__AVXVNNIINT16__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXNECONVERT)\n                ncnn_add_arch_opt_source(${class} avxneconvert \"/arch:AVX2 -mfma -mf16c -mavxneconvert /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__ /D__AVXNECONVERT__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX2)\n                ncnn_add_arch_opt_source(${class} avx2 \"/arch:AVX2 -mfma -mf16c /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_XOP)\n                ncnn_add_arch_opt_source(${class} xop \"/arch:AVX -mxop /D__SSSE3__ /D__SSE4_1__ /D__XOP__\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_F16C)\n                ncnn_add_arch_opt_source(${class} f16c \"/arch:AVX -mf16c /D__SSSE3__ /D__SSE4_1__ /D__F16C__\")\n            endif()\n        else()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512)\n                ncnn_add_arch_opt_layer(${class} avx512 \"-mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_FMA)\n                ncnn_add_arch_opt_layer(${class} fma \"-mavx -mfma -mf16c\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX)\n                ncnn_add_arch_opt_layer(${class} avx \"-mavx\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512VNNI)\n                ncnn_add_arch_opt_source(${class} avx512vnni \"-mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512vnni\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512BF16)\n                ncnn_add_arch_opt_source(${class} avx512bf16 \"-mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512bf16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX512FP16)\n                ncnn_add_arch_opt_source(${class} avx512fp16 \"-mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c -mavx512fp16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNI)\n                ncnn_add_arch_opt_source(${class} avxvnni \"-mavx2 -mfma -mf16c -mavxvnni\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT8)\n                ncnn_add_arch_opt_source(${class} avxvnniint8 \"-mavx2 -mfma -mf16c -mavxvnni -mavxvnniint8\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXVNNIINT16)\n                ncnn_add_arch_opt_source(${class} avxvnniint16 \"-mavx2 -mfma -mf16c -mavxvnni -mavxvnniint16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVXNECONVERT)\n                ncnn_add_arch_opt_source(${class} avxneconvert \"-mavx2 -mfma -mf16c -mavxneconvert\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_AVX2)\n                ncnn_add_arch_opt_source(${class} avx2 \"-mavx2 -mfma -mf16c\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_XOP)\n                ncnn_add_arch_opt_source(${class} xop \"-mavx -mxop\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_F16C)\n                ncnn_add_arch_opt_source(${class} f16c \"-mavx -mf16c\")\n            endif()\n        endif()\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"arm\" AND (CMAKE_SIZEOF_VOID_P EQUAL 4 AND NOT NCNN_TARGET_ILP32))\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            if(NCNN_VFPV4)\n                ncnn_add_arch_opt_source(${class} vfpv4 \"/arch:VFPv4 /D__ARM_FP=0x0E\")\n            endif()\n        else()\n            if(NCNN_VFPV4)\n                if(NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n                    ncnn_add_arch_opt_source(${class} vfpv4 \"-mfpu=neon-vfpv4\")\n                elseif(NCNN_COMPILER_SUPPORT_ARM_VFPV4_FP16)\n                    ncnn_add_arch_opt_source(${class} vfpv4 \"-mfpu=neon-vfpv4 -mfp16-format=ieee\")\n                endif()\n            endif()\n        endif()\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"arm\" AND (CMAKE_SIZEOF_VOID_P EQUAL 8 OR NCNN_TARGET_ILP32))\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            if(NCNN_VFPV4)\n                ncnn_add_arch_opt_source(${class} vfpv4 \" \")\n            endif()\n            if(NCNN_ARM82)\n                ncnn_add_arch_opt_source(${class} asimdhp \"/arch:armv8.2 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82DOT)\n                ncnn_add_arch_opt_source(${class} asimddp \"/arch:armv8.2 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82FP16FML)\n                ncnn_add_arch_opt_source(${class} asimdfhm \"/arch:armv8.2 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_FP16_FML\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84BF16)\n                ncnn_add_arch_opt_source(${class} bf16 \"/arch:armv8.4 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML /D__ARM_FEATURE_BF16_VECTOR_ARITHMETIC\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84I8MM)\n                ncnn_add_arch_opt_source(${class} i8mm \"/arch:armv8.4 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML /D__ARM_FEATURE_MATMUL_INT8\")\n            endif()\n            # TODO add support for sve family\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE2)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEBF16)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEI8MM)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEF32MM)\n            endif()\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            if(NCNN_VFPV4)\n                ncnn_add_arch_opt_source(${class} vfpv4 \" \")\n            endif()\n            if(NCNN_ARM82)\n                ncnn_add_arch_opt_source(${class} asimdhp \"/arch:armv8.2 -march=armv8.2-a+fp16 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82DOT)\n                ncnn_add_arch_opt_source(${class} asimddp \"/arch:armv8.2 -march=armv8.2-a+fp16+dotprod /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82FP16FML)\n                ncnn_add_arch_opt_source(${class} asimdfhm \"/arch:armv8.2 -march=armv8.2-a+fp16+fp16fml /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_FP16_FML\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84BF16)\n                ncnn_add_arch_opt_source(${class} bf16 \"/arch:armv8.4 -march=armv8.4-a+fp16+dotprod+bf16 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML /D__ARM_FEATURE_BF16_VECTOR_ARITHMETIC\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84I8MM)\n                ncnn_add_arch_opt_source(${class} i8mm \"/arch:armv8.4 -march=armv8.4-a+fp16+dotprod+i8mm /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML /D__ARM_FEATURE_MATMUL_INT8\")\n            endif()\n            # TODO add support for sve family\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE2)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEBF16)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEI8MM)\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEF32MM)\n            endif()\n        else()\n            if(NCNN_VFPV4)\n                ncnn_add_arch_opt_source(${class} vfpv4 \" \")\n            endif()\n            if(NCNN_ARM82)\n                ncnn_add_arch_opt_source(${class} asimdhp \"-march=armv8.2-a+fp16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82DOT)\n                ncnn_add_arch_opt_source(${class} asimddp \"-march=armv8.2-a+fp16+dotprod\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM82FP16FML)\n                # clang 9.0.9 shipped with android ndk-r21 is missing __ARM_FEATURE_FP16_FML macro for asimdfhm target\n                ncnn_add_arch_opt_source(${class} asimdfhm \"-march=armv8.2-a+fp16+fp16fml -D__ARM_FEATURE_FP16_FML\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84BF16)\n                ncnn_add_arch_opt_source(${class} bf16 \"-march=armv8.4-a+fp16+dotprod+bf16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM84I8MM)\n                ncnn_add_arch_opt_source(${class} i8mm \"-march=armv8.4-a+fp16+dotprod+i8mm\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE)\n                ncnn_add_arch_opt_source(${class} sve \"-march=armv8.6-a+fp16+dotprod+sve\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVE2)\n                ncnn_add_arch_opt_source(${class} sve2 \"-march=armv8.6-a+fp16+dotprod+sve2\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEBF16)\n                ncnn_add_arch_opt_source(${class} svebf16 \"-march=armv8.6-a+fp16+dotprod+sve+bf16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEI8MM)\n                ncnn_add_arch_opt_source(${class} svei8mm \"-march=armv8.6-a+fp16+dotprod+sve+i8mm\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ARM86SVEF32MM)\n                ncnn_add_arch_opt_source(${class} svef32mm \"-march=armv8.6-a+fp16+dotprod+sve+f32mm\")\n            endif()\n        endif()\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"mips\")\n        if(NCNN_RUNTIME_CPU AND NCNN_MSA)\n            ncnn_add_arch_opt_layer(${class} msa \"-mmsa\")\n        endif()\n        if(NCNN_MMI)\n            ncnn_add_arch_opt_source(${class} mmi \"-mloongson-mmi\")\n        endif()\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"loongarch\")\n        if(NCNN_RUNTIME_CPU AND NCNN_LASX)\n            ncnn_add_arch_opt_layer(${class} lasx \"-mlasx -mlsx\")\n        endif()\n        if(NCNN_RUNTIME_CPU AND NCNN_LSX)\n            ncnn_add_arch_opt_layer(${class} lsx \"-mlsx\")\n        endif()\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"riscv\")\n        if(CMAKE_SIZEOF_VOID_P EQUAL 8)\n            if(NCNN_RUNTIME_CPU AND NCNN_RVV)\n                ncnn_add_arch_opt_layer(${class} rvv \"-march=rv64gcv\")\n            endif()\n            if(NCNN_ZFH)\n                if(NOT NCNN_RUNTIME_CPU AND NCNN_ZVFH)\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv64gcv_zfh_zvfh -D__fp16=_Float16\")\n                elseif(NOT NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv64gc_zfh_xtheadvector -D__riscv_zvfh=1 -D__fp16=_Float16\")\n                else()\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv64gc_zfh -D__fp16=_Float16\")\n                endif()\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n                # linker complains the conflict of v and xtheadvector, so disable generating any riscv attributes\n                ncnn_add_arch_opt_layer(${class} xtheadvector \"-march=rv64gc_xtheadvector -mno-riscv-attribute -Wa,-mno-arch-attr\")\n                ncnn_add_arch_opt_layer_source(${class} zfh xtheadvector \"-march=rv64gc_zfh_xtheadvector -mno-riscv-attribute -Wa,-mno-arch-attr -D__fp16=_Float16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ZVFH)\n                ncnn_add_arch_opt_layer_source(${class} zfh rvv \"-march=rv64gcv_zfh_zvfh -D__fp16=_Float16\")\n            endif()\n        elseif(CMAKE_SIZEOF_VOID_P EQUAL 4)\n            if(NCNN_RUNTIME_CPU AND NCNN_RVV)\n                ncnn_add_arch_opt_layer(${class} rvv \"-march=rv32gcv\")\n            endif()\n            if(NCNN_ZFH)\n                if(NOT NCNN_RUNTIME_CPU AND NCNN_ZVFH)\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv32gcv_zfh_zvfh -D__fp16=_Float16\")\n                elseif(NOT NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv32gc_zfh_xtheadvector -D__riscv_zvfh=1 -D__fp16=_Float16\")\n                else()\n                    ncnn_add_arch_opt_source(${class} zfh \"-march=rv32gc_zfh -D__fp16=_Float16\")\n                endif()\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n                # linker complains the conflict of v and xtheadvector, so disable generating any riscv attributes\n                ncnn_add_arch_opt_layer(${class} xtheadvector \"-march=rv32gc_xtheadvector -mno-riscv-attribute -Wa,-mno-arch-attr\")\n                ncnn_add_arch_opt_layer_source(${class} zfh xtheadvector \"-march=rv32gc_zfh_xtheadvector -mno-riscv-attribute -Wa,-mno-arch-attr -D__fp16=_Float16\")\n            endif()\n            if(NCNN_RUNTIME_CPU AND NCNN_ZVFH)\n                ncnn_add_arch_opt_layer_source(${class} zfh rvv \"-march=rv32gcv_zfh_zvfh -D__fp16=_Float16\")\n            endif()\n        endif()\n    endif()\n\n    # generate layer_type_enum file\n    set(layer_type_enum \"${layer_type_enum}${class} = ${__LAYER_TYPE_ENUM_INDEX},\\n\")\n    math(EXPR __LAYER_TYPE_ENUM_INDEX \"${__LAYER_TYPE_ENUM_INDEX}+1\")\nendmacro()\n"
  },
  {
    "path": "cmake/ncnn_add_param.cmake",
    "content": "\nmacro(ncnn_add_param NCNN_PARAM_SRC)\n    # Get the file name with extension\n    get_filename_component(NCNN_PARAM_SRC_NAME_WE ${NCNN_PARAM_SRC} NAME)\n    # Manually remove \".param\" since NAME_WE treats \".1.param\" as a multi-extension\n    string(REPLACE \".param\" \"\" NCNN_PARAM_SRC_NAME_WE \"${NCNN_PARAM_SRC_NAME_WE}\")\n    # Replace characters invalid in C identifiers ('.' and '-') with underscores\n    string(REPLACE \".param\" \"\" NCNN_PARAM_SRC_NAME_WE \"${NCNN_PARAM_SRC_NAME_WE}\")\n    # Replace characters invalid in C identifiers ('.' and '-') with underscores\n    string(REPLACE \".\" \"_\" NCNN_PARAM_SRC_NAME_WE \"${NCNN_PARAM_SRC_NAME_WE}\")\n    string(REPLACE \"-\" \"_\" NCNN_PARAM_SRC_NAME_WE \"${NCNN_PARAM_SRC_NAME_WE}\")\n    # Check if the result is empty\n    if (NOT NCNN_PARAM_SRC_NAME_WE)\n        message(FATAL_ERROR \"Failed to extract valid filename from '${NCNN_PARAM_SRC}'\")\n    endif()\n    # Check if the extracted filename is a valid C identifier\n    string(REGEX MATCH \"^[A-Za-z_][A-Za-z0-9_]*$\" is_valid \"${NCNN_PARAM_SRC_NAME_WE}\")\n    if (NOT is_valid)\n        message(FATAL_ERROR \"Extracted filename '${NCNN_PARAM_SRC_NAME_WE}' is not a valid C identifier\")\n    endif()\n\n    set(NCNN_PARAM_HEADER ${CMAKE_CURRENT_BINARY_DIR}/param/${NCNN_PARAM_SRC_NAME_WE}.hex.h)\n\n    add_custom_command(\n        OUTPUT ${NCNN_PARAM_HEADER}\n        COMMAND ${CMAKE_COMMAND} -DPARAM_SRC=${NCNN_PARAM_SRC} -DPARAM_SRC_NAME_WE=${NCNN_PARAM_SRC_NAME_WE} -DPARAM_HEADER=${NCNN_PARAM_HEADER} -P \"${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_generate_param_header.cmake\"\n        DEPENDS ${NCNN_PARAM_SRC}\n        COMMENT \"Preprocessing param source ${NCNN_PARAM_SRC_NAME_WE}.param\"\n        VERBATIM\n    )\n    set_source_files_properties(${NCNN_PARAM_HEADER} PROPERTIES GENERATED TRUE)\n\n    get_filename_component(NCNN_PARAM_HEADER_NAME ${NCNN_PARAM_HEADER} NAME)\n    string(APPEND param_header_data \"#include \\\"param/${NCNN_PARAM_HEADER_NAME}\\\"\\n\")\n\n    list(APPEND NCNN_PARAM_HEX_FILES ${NCNN_PARAM_HEADER})\nendmacro()\n"
  },
  {
    "path": "cmake/ncnn_add_shader.cmake",
    "content": "\nmacro(ncnn_add_shader NCNN_SHADER_SRC)\n    get_filename_component(NCNN_SHADER_SRC_NAME_WE ${NCNN_SHADER_SRC} NAME_WE)\n    set(NCNN_SHADER_COMP_HEADER ${CMAKE_CURRENT_BINARY_DIR}/layer/vulkan/shader/${NCNN_SHADER_SRC_NAME_WE}.comp.hex.h)\n\n    add_custom_command(\n        OUTPUT ${NCNN_SHADER_COMP_HEADER}\n        COMMAND ${CMAKE_COMMAND} -DSHADER_SRC=${NCNN_SHADER_SRC} -DSHADER_COMP_HEADER=${NCNN_SHADER_COMP_HEADER} -P \"${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_generate_shader_comp_header.cmake\"\n        DEPENDS ${NCNN_SHADER_SRC}\n        COMMENT \"Preprocessing shader source ${NCNN_SHADER_SRC_NAME_WE}.comp\"\n        VERBATIM\n    )\n    set_source_files_properties(${NCNN_SHADER_COMP_HEADER} PROPERTIES GENERATED TRUE)\n\n    get_filename_component(NCNN_SHADER_COMP_HEADER_NAME ${NCNN_SHADER_COMP_HEADER} NAME)\n    string(APPEND layer_shader_spv_data \"#include \\\"layer/vulkan/shader/${NCNN_SHADER_COMP_HEADER_NAME}\\\"\\n\")\n\n    get_filename_component(NCNN_SHADER_SRC_NAME_WE ${NCNN_SHADER_SRC} NAME_WE)\n    string(APPEND layer_shader_registry \"{${NCNN_SHADER_SRC_NAME_WE}_comp_data,sizeof(${NCNN_SHADER_SRC_NAME_WE}_comp_data)},\\n\")\n\n    list(APPEND NCNN_SHADER_SPV_HEX_FILES ${NCNN_SHADER_COMP_HEADER})\n\n    # generate layer_shader_type_enum file\n    set(layer_shader_type_enum \"${layer_shader_type_enum}${NCNN_SHADER_SRC_NAME_WE} = ${__LAYER_SHADER_TYPE_ENUM_INDEX},\\n\")\n    math(EXPR __LAYER_SHADER_TYPE_ENUM_INDEX \"${__LAYER_SHADER_TYPE_ENUM_INDEX}+1\")\nendmacro()\n\n"
  },
  {
    "path": "cmake/ncnn_generate_avx512_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_X86_H\" \"LAYER_${CLASS_UPPER}_X86_AVX512_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_x86\" \"${CLASS}_x86_avx512\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_x86.h\\\"\" \"#include \\\"${CLASS_LOWER}_x86_avx512.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_avx_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_X86_H\" \"LAYER_${CLASS_UPPER}_X86_AVX_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_x86\" \"${CLASS}_x86_avx\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_x86.h\\\"\" \"#include \\\"${CLASS_LOWER}_x86_avx.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_fma_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_X86_H\" \"LAYER_${CLASS_UPPER}_X86_FMA_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_x86\" \"${CLASS}_x86_fma\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_x86.h\\\"\" \"#include \\\"${CLASS_LOWER}_x86_fma.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_lasx_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_LOONGARCH_H\" \"LAYER_${CLASS_UPPER}_LOONGARCH_LASX_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_loongarch\" \"${CLASS}_loongarch_lasx\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_loongarch.h\\\"\" \"#include \\\"${CLASS_LOWER}_loongarch_lasx.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_lsx_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_LOONGARCH_H\" \"LAYER_${CLASS_UPPER}_LOONGARCH_LSX_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_loongarch\" \"${CLASS}_loongarch_lsx\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_loongarch.h\\\"\" \"#include \\\"${CLASS_LOWER}_loongarch_lsx.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_msa_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_MIPS_H\" \"LAYER_${CLASS_UPPER}_MIPS_MSA_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_mips\" \"${CLASS}_mips_msa\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_mips.h\\\"\" \"#include \\\"${CLASS_LOWER}_mips_msa.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_param_header.cmake",
    "content": "\n# must define PARAM_HEADER PARAM_SRC PARAM_SRC_NAME_WE\n\nfile(READ ${PARAM_SRC} param_data)\n\n# remove whitespace\nstring(REGEX REPLACE \"\\n +\" \"\\n\" param_data ${param_data})\n\n# replace more spaces to one space\nstring(REGEX REPLACE \"[ \\t]+\" \" \" param_data \"${param_data}\")\n\n# remove empty line\nstring(REGEX REPLACE \"\\n[\\n]+\" \"\\n\" param_data \"${param_data}\")\n\n# text to hex\nfile(WRITE ${CMAKE_CURRENT_BINARY_DIR}/param/${PARAM_SRC_NAME_WE}.text2hex.txt \"${param_data}\")\nfile(READ ${CMAKE_CURRENT_BINARY_DIR}/param/${PARAM_SRC_NAME_WE}.text2hex.txt param_data_hex HEX)\nstring(REGEX REPLACE \"([0-9a-f][0-9a-f])\" \"0x\\\\1,\" param_data_hex ${param_data_hex})\nstring(FIND \"${param_data_hex}\" \",\" tail_comma REVERSE)\nstring(SUBSTRING \"${param_data_hex}\" 0 ${tail_comma} param_data_hex)\n\n# generate model param header file\nfile(WRITE ${PARAM_HEADER} \"static const char ${PARAM_SRC_NAME_WE}_param_data[] = {${param_data_hex},0x00};\\n\")\n"
  },
  {
    "path": "cmake/ncnn_generate_rvv_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_RISCV_H\" \"LAYER_${CLASS_UPPER}_RISCV_RVV_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_riscv\" \"${CLASS}_riscv_rvv\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_riscv.h\\\"\" \"#include \\\"${CLASS_LOWER}_riscv_rvv.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/ncnn_generate_shader_comp_header.cmake",
    "content": "\n# must define SHADER_COMP_HEADER SHADER_SRC\n\nfile(READ ${SHADER_SRC} comp_data)\n\n# skip leading comment\nstring(FIND \"${comp_data}\" \"#version\" version_start)\nif(NOT ${version_start} EQUAL -1)\n    string(SUBSTRING \"${comp_data}\" ${version_start} -1 comp_data)\nendif()\n\n# remove whitespace\nstring(REGEX REPLACE \"\\n +\" \"\\n\" comp_data \"${comp_data}\")\n\n# remove comments\nstring(REGEX REPLACE \"//[^\\n]*\" \"\" comp_data \"${comp_data}\")\n\n# replace more spaces to one space\nstring(REGEX REPLACE \"[ \\t]+\" \" \" comp_data \"${comp_data}\")\n\n# remove empty line\nstring(REGEX REPLACE \"\\n[\\n]+\" \"\\n\" comp_data \"${comp_data}\")\n\nget_filename_component(SHADER_SRC_NAME_WE ${SHADER_SRC} NAME_WE)\n\n# text to hex\nfile(WRITE ${CMAKE_CURRENT_BINARY_DIR}/layer/vulkan/shader/${SHADER_SRC_NAME_WE}.text2hex.txt \"${comp_data}\")\nfile(READ ${CMAKE_CURRENT_BINARY_DIR}/layer/vulkan/shader/${SHADER_SRC_NAME_WE}.text2hex.txt comp_data_hex HEX)\nstring(REGEX REPLACE \"([0-9a-f][0-9a-f])\" \"0x\\\\1,\" comp_data_hex ${comp_data_hex})\nstring(FIND \"${comp_data_hex}\" \",\" tail_comma REVERSE)\nstring(SUBSTRING \"${comp_data_hex}\" 0 ${tail_comma} comp_data_hex)\n\nfile(WRITE ${SHADER_COMP_HEADER} \"static const char ${SHADER_SRC_NAME_WE}_comp_data[] = {${comp_data_hex}};\\n\")\n"
  },
  {
    "path": "cmake/ncnn_generate_xtheadvector_source.cmake",
    "content": "\n# must define SRC DST CLASS\n\nfile(READ ${SRC} source_data)\n\n# replace\nstring(TOUPPER ${CLASS} CLASS_UPPER)\nstring(TOLOWER ${CLASS} CLASS_LOWER)\n\nstring(REGEX REPLACE \"LAYER_${CLASS_UPPER}_RISCV_H\" \"LAYER_${CLASS_UPPER}_RISCV_XTHEADVECTOR_H\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"${CLASS}_riscv\" \"${CLASS}_riscv_xtheadvector\" source_data \"${source_data}\")\nstring(REGEX REPLACE \"#include \\\"${CLASS_LOWER}_riscv.h\\\"\" \"#include \\\"${CLASS_LOWER}_riscv_xtheadvector.h\\\"\" source_data \"${source_data}\")\n\nfile(WRITE ${DST} \"${source_data}\")\n"
  },
  {
    "path": "cmake/run_test.cmake",
    "content": "\r\nexecute_process(COMMAND $ENV{TESTS_EXECUTABLE_LOADER} $ENV{TESTS_EXECUTABLE_LOADER_ARGUMENTS} ${TEST_EXECUTABLE} $ENV{TESTS_ARGUMENTS} RESULT_VARIABLE result)\r\nif(NOT \"${result}\" STREQUAL \"0\")\r\n    message(FATAL_ERROR \"Test failed with return value '${result}'\")\r\nendif()\r\n"
  },
  {
    "path": "codeformat.sh",
    "content": "#!/usr/bin/env bash\n\n# we run clang-format and astyle twice to get stable format output\n\nformat_code() {\n    find src/ tools/ tests/ examples/ benchmark/ python/ -type f -name '*.c' -o -name '*.cpp' -o -name '*.cc' -o -name '*.h' | grep -v python/pybind11 | grep -v stb_image | grep -v ruapu | xargs -i clang-format -i {}\n    astyle -n -r \"benchmark/*.h,*.cpp,*.cc\" \"tests/*.h,*.cpp,*.cc\" \"tools/*.h,*.cpp,*.cc\" \"examples/*.h,*.cpp,*.cc\"\n    astyle -n -r \"src/*.h,*.cpp,*.cc\" --exclude=src/stb_image.h --exclude=src/stb_image_write.h --exclude=src/ruapu.h\n    astyle -n -r \"python/*.h,*.cpp,*.cc\" --exclude=python/pybind11\n}\n\nformat_code || { echo 'Formatting failed' ; exit 1; } #first time execute\nformat_code || { echo 'Formatting failed' ; exit 1; } #second time execute\n"
  },
  {
    "path": "docs/Home.md",
    "content": "### input data and extract output\n```cpp\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include \"net.h\"\n\nint main()\n{\n    cv::Mat img = cv::imread(\"image.ppm\", CV_LOAD_IMAGE_GRAYSCALE);\n    int w = img.cols;\n    int h = img.rows;\n\n    // subtract 128, norm to -1 ~ 1\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(img.data, ncnn::Mat::PIXEL_GRAY, w, h, 60, 60);\n    float mean[1] = { 128.f };\n    float norm[1] = { 1/128.f };\n    in.substract_mean_normalize(mean, norm);\n\n    ncnn::Net net;\n    net.load_param(\"model.param\");\n    net.load_model(\"model.bin\");\n\n    ncnn::Extractor ex = net.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat feat;\n    ex.extract(\"output\", feat);\n\n    return 0;\n}\n\n```\n\n### print Mat content\n```cpp\nvoid pretty_print(const ncnn::Mat& m)\n{\n    for (int q=0; q<m.c; q++)\n    {\n        const float* ptr = m.channel(q);\n        for (int z=0; z<m.d; z++)\n        {\n            for (int y=0; y<m.h; y++)\n            {\n                for (int x=0; x<m.w; x++)\n                {\n                    printf(\"%f \", ptr[x]);\n                }\n                ptr += m.w;\n                printf(\"\\n\");\n            }\n            printf(\"\\n\");\n        }\n        printf(\"------------------------\\n\");\n    }\n}\n```\n\n### print VkMat content\n```cpp\nvoid pretty_print(const ncnn::VkMat& m, ncnn::VkCompute& cmd, const ncnn::Option& opt)\n{\n    ncnn::Option opt_unpack = opt;\n    opt_unpack.use_packing_layout = false;\n\n    ncnn::Mat m_cpu;\n    cmd.record_download(m, m_cpu, opt_unpack);\n    cmd.submit_and_wait();\n    cmd.reset();\n\n    // print Mat content\n    pretty_print(m_cpu);\n}\n```\n\n### visualize Mat content\n```cpp\nvoid visualize(const char* title, const ncnn::Mat& m)\n{\n    std::vector<cv::Mat> normed_feats(m.c);\n\n    for (int i=0; i<m.c; i++)\n    {\n        cv::Mat tmp(m.h, m.w, CV_32FC1, (void*)(const float*)m.channel(i));\n\n        cv::normalize(tmp, normed_feats[i], 0, 255, cv::NORM_MINMAX, CV_8U);\n\n        cv::cvtColor(normed_feats[i], normed_feats[i], cv::COLOR_GRAY2BGR);\n\n        // check NaN\n        for (int y=0; y<m.h; y++)\n        {\n            const float* tp = tmp.ptr<float>(y);\n            uchar* sp = normed_feats[i].ptr<uchar>(y);\n            for (int x=0; x<m.w; x++)\n            {\n                float v = tp[x];\n                if (v != v)\n                {\n                    sp[0] = 0;\n                    sp[1] = 0;\n                    sp[2] = 255;\n                }\n\n                sp += 3;\n            }\n        }\n    }\n\n    int tw = m.w < 10 ? 32 : m.w < 20 ? 16 : m.w < 40 ? 8 : m.w < 80 ? 4 : m.w < 160 ? 2 : 1;\n    int th = (m.c - 1) / tw + 1;\n\n    cv::Mat show_map(m.h * th, m.w * tw, CV_8UC3);\n    show_map = cv::Scalar(127);\n\n    // tile\n    for (int i=0; i<m.c; i++)\n    {\n        int ty = i / tw;\n        int tx = i % tw;\n\n        normed_feats[i].copyTo(show_map(cv::Rect(tx * m.w, ty * m.h, m.w, m.h)));\n    }\n\n    cv::resize(show_map, show_map, cv::Size(0,0), 2, 2, cv::INTER_NEAREST);\n    cv::imshow(title, show_map);\n}\n```\n\n### FAQ\nQ ncnn的起源\n\nA 深度学习算法要在手机上落地，caffe依赖太多，手机上也没有cuda，需要个又快又小的前向网络实现\n\n\nQ ncnn名字的来历\n\nA cnn就是卷积神经网络的缩写，开头的n算是一语n关。比如new/next(全新的实现)，naive(ncnn是naive实现)，neon(ncnn最初为手机优化)，up主名字(←_←)\n\n\nQ 支持哪些平台\n\nA 跨平台，支持 android / ios / linux / windows / macos，也支持裸机跑\n\n\nQ 计算精度如何\n\nA armv7 neon float 不遵照 ieee754 标准，有些采用快速实现(如exp sin等)，速度快但确保精度足够高\n\n\nQ logo\n\nA up主是mc玩家，所以灵魂手绘像素猫，还可以找到ncnn...\n"
  },
  {
    "path": "docs/application-with-ncnn-inside.md",
    "content": "![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.azarlive.android.png) Azar-视频交友与聊天 June 20, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.cyberlink.youcammakeup.png) 玩美彩妆 - 自拍美颜 & 智能美妆相机 June 21, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.makeup.png) You Makeup Photo Camera 2.1.5\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.cartoon.cam.png) 滤镜相机 Cartoon Camera- Paintlab January 24, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.pipcamera.activity.png) 画中画相机 January 30, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.hefe.pro.editor.png) Photo Editor Pro 1.1.4.1029\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.apus.camera.id.png) Air Camera 1.7.3.1002\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.fotobeauty.png) 美丽拍－懂你的自拍美颜相机 February 1, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.perfectcorp.ycf.png) 玩美Fun-特效动图自拍滤镜&分享相片！ May 15, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.ufotosoft.justshot.png) Sweet Snap - 生活贴纸&图像编辑器,实时滤镜,录制视频和有趣表情包,美容效果 June 22, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.wantu.activity.png) 玩图 - 美图相机 March 29, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.meitu.meiyancamera.png) 美颜相机 7.6.95\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.lyrebirdstudio.colorizer.lite.png) 自拍相机 - 照片编辑器和过滤器和贴纸 April 27, 2018\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.apusapps.fulakora.png) APUS Camera 1.7.2.1001\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/video.like.png) LIKE短视频 — 魔法视频自拍神器 2.2.4\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.qiyi.video.png) 爱奇艺 9.6.0\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.eg.android.AlipayGphone.png) 支付宝 10.1.25.752\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.perfectcorp.beautycircle.png) YouCam Shop - World's First AR Makeup Shopping App 3.4.0\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.lyrebirdstudio.beauty.png) 美容化妆自拍相机和自拍照片编辑器 1.4.8\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.jingdong.app.mall.png) 京东-挑好物，上京东 7.0.8\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.versa.png) Versa 2.9.2\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.tencent.weishi.png) 微视 4.3.1.88\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.smile.gifmaker.png) 快手短视频—国民短视频平台 5.4.2.5360\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.sdu.didi.psnger.png) 滴滴出行 5.3.0\n\n"
  },
  {
    "path": "docs/benchmark/the-benchmark-of-caffe-android-lib,-mini-caffe,-and-ncnn.md",
    "content": "caffe-android-lib https://github.com/sh1r0/caffe-android-lib\n\nmini-caffe https://github.com/luoyetx/mini-caffe\n\nopenblas-0.2.20 https://github.com/xianyi/OpenBLAS\n\nncnn https://github.com/Tencent/ncnn\n\n***\n\nsqueezenet_v1.1 https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1\n\nmobilenet_v1 https://github.com/shicai/MobileNet-Caffe\n\nvgg16 https://gist.github.com/ksimonyan/211839e770f7b538e2d8\n\n***\n\nHost platform and compiler configuration: \n\nfedora 27, android-ndk-r15c, target arch = arm64-v8a\n\nwe manually update openblas package to version 0.2.20 in caffe-android-lib for better performance\n\n\n***\n\nDevice: Nexus 6p\n\nOS: LineageOS 15.1(Android 8.1.0), ROM newly flashed without any third-party APP installed\n\nCPU: Snapdragon 810 (Cortex-A57 2.0GHz x 4 + Cortex-A53 1.55GHz x 4)\n\nRAM: 3G\n\n\n***\n\nBenchmark method: \n\nRun squeezenet, mobilenet inference 23 times in a loop, discard the first three warmup records, and then calculate the average inference time\n\nRun vgg169 times in a loop, discard the first warmup record, and then calculate the average inference time\n\nSince the system may force SOC lowering its frequency when temperature goes high, sleep over 1 minute before each benchmark to prevent this issue.\n\nfps performance: fps = 1000 / avgtime(ms)\n\ncpu usage: take the CPU value in top utility output\n\nmemory usage: take the RES value in top utility output\n\nthe overall power consumption and performance per watt: \n\nDisable usb charging: adb shell echo 0 > /sys/class/power_supply/battery/charging_enabled\n\ncurrent(μA) = adb shell cat /sys/class/power_supply/battery/current_now (multiply -1 for 810 chip)\n\nvoltage(μV) = adb shell cat /sys/class/power_supply/battery/voltage_now\n\npower consumption(mW) = current / 1000 * voltage / 1000 / 1000\n\nperformance per watt(1000fps/W) = fps / power consumption * 1000\n\n\n***\n\nThe binary size after debug stripping\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/1.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/2.jpg)\n\n***\n\nsqueezenet\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/3.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/4.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/5.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/6.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/7.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/8.jpg)\n***\n\nmobilenet\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/9.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/10.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/11.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/12.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/13.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/14.jpg)\n***\n\nvgg16\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/15.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/16.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/17.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/18.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/19.jpg)\n\n![](https://github.com/nihui/ncnn-assets/raw/master/20180413/20.jpg)\n"
  },
  {
    "path": "docs/benchmark/vulkan-conformance-test.md",
    "content": "\n|device|gpu|api version|driver version|squeezenet|mobilenetssd|yolov3|\n|---|---|---|---|---|---|---|\n|intel-i7-7700|Intel(R) HD Graphics 630 (Kaby Lake GT2)|1.1.90|18.3.4|y|y|y|\n|GTX-1060|GeForce GTX 1060 3GB|1.1.95|418.172.0|y|y|y|\n|AMD-Radeon R9 M290X|AMD RADV PITCAIRN (LLVM 7.0.1)|1.1.70|18.3.4|y|y|y|\n|iphone-5s|Apple A7 GPU|1.0.82|0.2.1825|y|y|y|\n|huawei-nexus6p|Adreno (TM) 430|1.0.49|35.601.2388|y|y|y\n|vivo-y1731ca|Adreno (TM) 505|1.0.61|37.845.1429|y|n|n|\n|vivo-y85a|Adreno (TM) 506|1.0.61|2.944.3349|y|n|n|\n|vivo-x9s|Adreno (TM) 510|1.0.61|42.917.1172|y|y|y|\n|meizu-15|Adreno (TM) 512|1.0.38|29.189.223|n|n|n|\n|chuizi-jianguo-pro2|Adreno (TM) 512|1.0.38|21.219.2615|n|n|n|\n|xiaomi-note3|Adreno (TM) 512|1.0.38|39.369.2305|n|n|n|\n|oppo-r11|Adreno (TM) 512|1.0.38|42.977.756|n|n|n|\n|xiaomi-6x|Adreno (TM) 512|1.0.61|14.322.3739|y|y|y|\n|oppo-r11s+|Adreno (TM) 512|1.0.61|35.1004.3936|y|y|y|\n|vivo-x20a|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|\n|vivo-v1816a|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|\n|vivo-z1|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|\n|xiaomi-redmi-note5|Adreno (TM) 512|1.0.61|63.219.2354|y|y|y|\n|google-pixel|Adreno (TM) 530|1.1.87|512.354.0|y|y|y|\n|nubia-z17|Adreno (TM) 540|1.0.38|1.28.32|n|n|n|\n|samsung-galaxys8+|Adreno (TM) 540|1.0.61|29.896.3583|y|y|y|\n|oneplus-5t|Adreno (TM) 540|1.0.61|18.1023.2233|y|y|y|\n|google-pixel2|Adreno (TM) 540|1.1.66|512.313.0|y|y|y|\n|essential-ph-1|Adreno (TM) 540|1.1.66|512.319.0|y|y|y|\n|vivo-x23|Adreno (TM) 615|1.0.66|33.870.3328|y|y|y|\n|vivo-v1813ba|Adreno (TM) 615|1.0.66|33.870.3328|y|y|y|\n|xiaomi-8se|Adreno (TM) 616|1.0.66|30.913.18|y|y|y|\n|vivo-nex-a|Adreno (TM) 616|1.0.66|33.870.3328|y|y|y|\n|xiaomi-mix2s|Adreno (TM) 630|1.0.61|4.91.2976|y|y|y|\n|heisha-SKR-A0|Adreno (TM) 630|1.0.61|36.173.3586|y|y|y|\n|heisha-SKR-A0|Adreno (TM) 630|1.0.66|47.448.1532|y|y|y|\n|oneplus-6|Adreno (TM) 630|1.1.66|512.324.0|y|y|y|\n|vivo-iQOO|Adreno (TM) 640|1.1.87|512.361.0|y|y|y|\n|meitu-m8s|Mali-T880|1.0.14|500.910.1017|n|n|n|\n|huawei-p10|Mali-G71|1.0.53|151.949.2145|n|n|n|\n|huawei-mate9|Mali-G71|1.0.53|151.949.2145|n|n|n|\n|oppo-a73|Mali-G71|1.0.47|575.795.1934|n|n|n|\n|vivo-y97|Mali-G72|1.0.58|240.537.3580|n|n|n|\n|huawei-mate10|Mali-G72|1.0.66|14.0.0|y|y|y|\n|huawei-v10|Mali-G72|1.0.66|14.0.0|y|y|y|\n|huawei-vce-al00|Mali-G72|1.0.66|14.0.0|y|y|y|\n|huawei-mate20|Mali-G76|1.0.66|14.0.0|y|y|y|\n|huawei-pct-al10|Mali-G76|1.0.66|14.0.0|y|y|y|"
  },
  {
    "path": "docs/developer-guide/aarch64-mix-assembly-and-intrinsic.md",
    "content": "```c\n// v寄存器全部使用 %.4s\n// 128-bit vreg matches %.4s\n// a += b * c\nfloat32x4_t _a = vld1q_f32(a);\nfloat32x4_t _b = vld1q_f32(b);\nfloat32x4_t _c = vld1q_f32(c);\nasm volatile(\n    \"fmla  %0.4s, %2.4s, %3.4s\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// v寄存器使用低64位  %.2s\n// low 64-bit vreg matches %.2s\n// a += b * c\nfloat32x2_t _a = vld1_f32(a);\nfloat32x2_t _b = vld1_f32(b);\nfloat32x2_t _c = vld1_f32(c);\nasm volatile(\n    \"fmla  %0.2s, %2.2s, %3.2s\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// v寄存器单路使用 %.s[0] %.s[1] %.s[2] %.s[3]\n// 32-bit register matches %.s[0]\n// a += b * c[0]\n// a += b * c[1]\n// a += b * c[2]\n// a += b * c[3]\nfloat32x4_t _a = vld1_f32(a);\nfloat32x4_t _b = vld1_f32(b);\nfloat32x4_t _c = vld1_f32(c);\nasm volatile(\n    \"fmla  %0.4s, %2.4s, %3.s[0]\"\n    \"fmla  %0.4s, %2.4s, %3.s[1]\"\n    \"fmla  %0.4s, %2.4s, %3.s[2]\"\n    \"fmla  %0.4s, %2.4s, %3.s[3]\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n\n\nqwq\n"
  },
  {
    "path": "docs/developer-guide/add-custom-layer.zh.md",
    "content": "# NCNN增加自定义层\n\n## 举例\n\n这里举个例子添加自定义层次 如Relu6，即 std::min(6.f, std::max(0.f, val))\n\n```\nInput            input   0 1 input\nConvolution      conv2d  1 1 input conv2d 0=32 1=1 2=1 3=1 4=0 5=0 6=768\nRelu6            relu6   1 1 conv2d relu6\nPooling          maxpool 1 1 relu6 maxpool 0=0 1=3 2=2 3=-233 4=0\n```\n\n\n\n## 定义源码h文件：src/layer/relu6.h\n\n```CPP\n#ifndef LAYER_RELU6_H\n#define LAYER_RELU6_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Relu6 : public Layer\n{\npublic:\n    Relu6();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RELU6_H\n```\n\n\n\n## 定义源码CPP文件：src/layer/relu6.cpp\n\n```CPP\n#include \"relu6.h\"\n\n#include <math.h>\n\nnamespace ncnn {\n\nRelu6::Relu6()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Relu6::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q=0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i=0; i<size; i++)\n            {\n                ptr[i] = std::min(6.f, std::max(0.f, ptr[i]));\n            }\n        }\n\n        return 0;\n}\n\n} // namespace ncnn\n\n```\n\n\n\n## 修改 src/CMakeLists.txt 注册Relu6\n\n```CPP\nncnn_add_layer(GroupNorm)\nncnn_add_layer(LayerNorm)\nncnn_add_layer(Relu6)\n```\n\n\n\n## 定义测试用例CPP文件 tests/test_relu6.cpp \n\n```CPP\n#include \"layer/relu6.h\"\n#include \"testutil.h\"\n\nstatic int test_relu6(const ncnn::Mat& a)\n{\n    ncnn::ParamDict pd;\n\n    std::vector<ncnn::Mat> weights(0);\n\n    int ret = test_layer<ncnn::Relu6>(\"Relu6\", pd, weights, a);\n    if (ret != 0)\n    {\n        fprintf(stderr, \"test_relu6 failed a.dims=%d a=(%d %d %d)\\n\", a.dims, a.w, a.h, a.c);\n    }\n\n    return ret;\n}\n\nstatic int test_relu6_0()\n{\n    return 0\n           || test_relu6(RandomMat(5, 7, 24))\n           || test_relu6(RandomMat(7, 9, 12))\n           || test_relu6(RandomMat(3, 5, 13));\n}\n\nstatic int test_relu6_1()\n{\n    return 0\n           || test_relu6(RandomMat(15, 24))\n           || test_relu6(RandomMat(17, 12))\n           || test_relu6(RandomMat(19, 15));\n}\n\nstatic int test_relu6_2()\n{\n    return 0\n           || test_relu6(RandomMat(128))\n           || test_relu6(RandomMat(124))\n           || test_relu6(RandomMat(127));\n}\n\nint main()\n{\n    SRAND(7767517);\n\n    return 0\n           || test_relu6_0()\n           || test_relu6_1()\n           || test_relu6_2();\n}\n\n```\n\n\n\n## 修改tests/CMakeLists.txt 注册Relu6测试用例\n\n```CPP\nncnn_add_layer_test(LSTM)\nncnn_add_layer_test(Yolov3DetectionOutput)\nncnn_add_layer_test(Relu6)\n```\n\n\n\n## 编译\n\n```\n按原NCNN步骤编译\n```\n\n\n\n## 单元测试\n\n```\n./test_relu6\n```\n\n"
  },
  {
    "path": "docs/developer-guide/arm-a53-a55-dual-issue.md",
    "content": "## natural assembly\n* no register dependency, no penalty\n```\nld1     {v0.4s}, [r0], #16\nfmla    v10.4s, v16.4s, v24.s[0]\nfmla    v11.4s, v16.4s, v24.s[1]\nfmla    v12.4s, v16.4s, v24.s[2]\nfmla    v13.4s, v16.4s, v24.s[3]\n```\n\n## A53\n* 128bit vector load cannot be dual issued with fmla, wait 2 cycles\n* 64bit vector load cannot be dual issued with fmla, wait 1 cycle\n* 64bit integer load can be dual issued with fmla, no penalty\n* pointer update can be dual issued with fmla, no penalty\n* 64bit vector load and 64bit vector insert can be dual issued, no penalty\n* any vector load cannot be issued on the 4th cycle of each fmla (enters the accumulator pipeline)\n\n### practical guide\n* use 64bit vector load only\n* issue vector load every three fmla\n* 1 cycle to load 64bit, dual issue with the previous interleaved 64bit insert\n* load the remaining 64bit into integer register, dual issue with fmla\n* update pointer, dual issue with fmla\n* insert 64bit into vector from integer register, dual issue with the next interleaved 64bit load\n* add nop every three fmla if no load, seems to be faster\n```\nldr     d0, [r0] // 1 cycle, v0 first 64bit\nfmla\nldr     x23, [r0, #8] // 0 cycle, v0 second 64bit to temp register\nfmla\nadd     r0, r0, #16 // 0 cycle, update pointer\nfmla\nldr     d1, [r0] // 1 cycle, v1 first 64bit\nins     v0.d[1], x23 // 0 cycle, v0 second 64bit complete\nfmla\nldr     x23, [r0, #8] // 0 cycle, v1 second 64bit to temp register\nfmla\nadd     r0, r0, #16 // 0 cycle, update pointer\nfmla\nins     v1.d[1], x23 // 1 cycle, v1 second 64bit complete\nnop\nfmla\nfmla\nfmla\nnop\nnop\nfmla\nfmla\nfmla\n```\n\n## A55\n* Limited by the number of neon register read and write ports, most neon instructions cannot be dual-issued.\n* neon instructions have different latencies\n* 128bit vector load cannot be issued with fmla, WAR wait 2 cycles\n* 64bit integer load can be dual issued with fmla, no penalty\n* pointer update can be dual issued with fmla, no penalty\n* 64bit vector insert can be dual issued with fmla, no penalty\n\n### practical guide\n* A55 supports 128bit load and 256bit write in one clock. Support dual emission of two 64bit vector loads or single emission of 128bit vector load\n* `ldr`, dual issue with fmla\n* load the remaining 64bit into integer register, dual issue with fmla\n* update pointer, dual issue with fmla\n* insert 64bit into vector from integer register, dual issue with fmla\n* interleaved load loose register dependency\n* nop trick is not needed\n* Loop unrolling fma reduces pipeline bubbles\n* Some data type conversion neon instructions can be dual issued, such as `fsvts`\n```\nldr     d0, [r0] // 0 cycle, v0 first 64bit\nfmla\nldr     x23, [r0, #8] // 0 cycle, v0 second 64bit to temp register\nfmla\nadd     r0, r0, #16 // 0 cycle, update pointer\nfmla\nldr     d1, [r0] // 0 cycle, v1 first 64bit\nfmla\nins     v0.d[1], x23 // 0 cycle, v0 second 64bit complete\nfmla\nldr     x23, [r0, #8] // 0 cycle, v1 second 64bit to temp register\nfmla\nadd     r0, r0, #16 // 0 cycle, update pointer\nfmla\nins     v1.d[1], x23 // 0 cycle, v1 second 64bit complete\nfmla\n```\n"
  },
  {
    "path": "docs/developer-guide/armv7-mix-assembly-and-intrinsic.md",
    "content": "```c\n// d寄存器全部使用 %P\n// d reg matches %P\n// a += b * c\nfloat32x2_t _a = vld1_f32(a);\nfloat32x2_t _b = vld1_f32(b);\nfloat32x2_t _c = vld1_f32(c);\nasm volatile(\n    \"vmla.f32  %P0, %P2, %P3\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// q寄存器全部使用 %q\n// q reg matches %q\n// a += b * c\nfloat32x4_t _a = vld1q_f32(a);\nfloat32x4_t _b = vld1q_f32(b);\nfloat32x4_t _c = vld1q_f32(c);\nasm volatile(\n    \"vmla.f32  %q0, %q2, %q3\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// d寄存器单路使用 %P[0] %P[1]\n// 32bit d reg matches %P[0]\n// a += b * c[0]\n// a += b * c[1]\nfloat32x2_t _a = vld1_f32(a);\nfloat32x2_t _b = vld1_f32(b);\nfloat32x2_t _c = vld1_f32(c);\nasm volatile(\n    \"vmla.f32  %P0, %P2, %P3[0]\"\n    \"vmla.f32  %P0, %P2, %P3[1]\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// q寄存器单路使用 %e[0] %e[1] %f[0] %f[1]\n// 32-bit q reg matches %e[0]\n// a += b * c[0]\n// a += b * c[1]\n// a += b * c[2]\n// a += b * c[3]\nfloat32x4_t _a = vld1q_f32(a);\nfloat32x4_t _b = vld1q_f32(b);\nfloat32x4_t _c = vld1q_f32(c);\nasm volatile(\n    \"vmla.f32  %q0, %q2, %e3[0]\"\n    \"vmla.f32  %q0, %q2, %e3[1]\"\n    \"vmla.f32  %q0, %q2, %f3[0]\"\n    \"vmla.f32  %q0, %q2, %f3[1]\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// q寄存器拆分d寄存器使用 %e %f\n// use %e %f to split q reg into two d regs\n// a += b * c[0]c[1]\n// a += b * c[2]c[3]\nfloat32x2_t _a = vldq_f32(a);\nfloat32x2_t _b = vldq_f32(b);\nfloat32x4_t _c = vld1q_f32(c);\nasm volatile(\n    \"vmla.f32  %P0, %P2, %e3\"\n    \"vmla.f32  %P0, %P2, %f3\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// d寄存器声明绑定\n// specify concrete d reg which want to save\n// vmla.f32  d0, d2, d4\nregister float32x2_t _a asm(\"d0\") = vld1_f32(a);\nregister float32x2_t _b asm(\"d2\") = vld1_f32(b);\nregister float32x2_t _c asm(\"d4\") = vld1_f32(c);\n\nasm volatile(\n    \"vmla.f32  %P0, %P2, %P3\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n```c\n// q寄存器声明绑定\n// bind q reg with data\n// vmla.f32  q0, q1, q2\nregister float32x4_t _a asm(\"q0\") = vld1q_f32(a);\nregister float32x4_t _b asm(\"q1\") = vld1q_f32(b);\nregister float32x4_t _c asm(\"q2\") = vld1q_f32(c);\n\nasm volatile(\n    \"vmla.f32  %q0, %q2, %q3\"\n    : \"=w\"(_a) // %0\n    : \"0\"(_a),\n      \"w\"(_b), // %2\n      \"w\"(_c)  // %3\n    :\n);\n```\n\n如果不是因为编译器的bug，寄存器绑定是用不着的，然而。。。\n\nhttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=41538\n\nqwq\n"
  },
  {
    "path": "docs/developer-guide/binaryop-broadcasting.md",
    "content": "### broadcasting rule\n\nncnn BinaryOp accepts blobs with different shape\n\nC = BinaryOp(A, B)\n\nshape notation convention is [w], [w,h], [w,h,c], [w,h,d,c]\n\n* binaryop with scalar and scalar-like\n\n|A|B|C|\n|---|---|---|\n|[2]|scalar / [1]|[2]|\n|[2,3]|scalar / [1] / [1,1]|[2,3]|\n|[2,3,4]|scalar / [1] / [1,1] / [1,1,1]|[2,3,4]|\n|[2,3,4,5]|scalar / [1] / [1,1] / [1,1,1] / [1,1,1,1]|[2,3,4,5]|\n\n* no broadcast\n\n|A|B|C|\n|---|---|---|\n|[2]|[2]|[2]|\n|[2,3]|[2,3]|[2,3]|\n|[2,3,4]|[2,3,4]|[2,3,4]|\n|[2,3,4,5]|[2,3,4,5]|[2,3,4,5]|\n\n* explicit broadcast B\n\n|A|B|C|\n|---|---|---|\n|[2,3]|[1,3]|[2,3]|\n|[2,3]|[2,1]|[2,3]|\n|[2,3,4]|[1,3,4]|[2,3,4]|\n|[2,3,4]|[2,1,4]|[2,3,4]|\n|[2,3,4]|[2,3,1]|[2,3,4]|\n|[2,3,4]|[1,1,4]|[2,3,4]|\n|[2,3,4]|[1,3,1]|[2,3,4]|\n|[2,3,4]|[2,1,1]|[2,3,4]|\n|[2,3,4,5]|[1,3,4,5]|[2,3,4,5]|\n|[2,3,4,5]|[2,1,4,5]|[2,3,4,5]|\n|[2,3,4,5]|[2,3,1,5]|[2,3,4,5]|\n|[2,3,4,5]|[2,3,4,1]|[2,3,4,5]|\n|[2,3,4,5]|[1,1,4,5]|[2,3,4,5]|\n|[2,3,4,5]|[1,3,1,5]|[2,3,4,5]|\n|[2,3,4,5]|[1,3,4,1]|[2,3,4,5]|\n|[2,3,4,5]|[2,1,1,5]|[2,3,4,5]|\n|[2,3,4,5]|[2,1,4,1]|[2,3,4,5]|\n|[2,3,4,5]|[2,3,1,1]|[2,3,4,5]|\n|[2,3,4,5]|[1,1,1,5]|[2,3,4,5]|\n|[2,3,4,5]|[1,1,4,1]|[2,3,4,5]|\n|[2,3,4,5]|[1,3,1,1]|[2,3,4,5]|\n|[2,3,4,5]|[2,1,1,1]|[2,3,4,5]|\n\n* implicit broadcast B for inner axis\n\nIt broadcasts in the opposite direction of the numpy's implicit broadcasting behavior.\n\npnnx will insert reshape operator at the appropriate position to convert it to explicit broadcast automatically.\n\n|A|B|C|\n|---|---|---|\n|[2,3]|[3]|[2,3]|\n|[2,3,4]|[4]|[2,3,4]|\n|[2,3,4]|[3,4]|[2,3,4]|\n|[2,3,4,5]|[5]|[2,3,4,5]|\n|[2,3,4,5]|[4,5]|[2,3,4,5]|\n|[2,3,4,5]|[3,4,5]|[2,3,4,5]|\n\n* implicit broadcast B with 1 dimension rank for outer axis\n\nThis exists only for compatibility.\n\nWhen the size is the same, eg. [2,2] and [2], broadcast B for inner axis will be prioritized.\n\n|A|B|C|\n|---|---|---|\n|[2,3]|[2]|[2,3]|\n|[2,3,4]|[2]|[2,3,4]|\n|[2,3,4,5]|[2]|[2,3,4,5]|\n"
  },
  {
    "path": "docs/developer-guide/build-ncnn-on-windows-xp.zh.md",
    "content": "# Build ncnn on Windows XP\n\n> **Contributors:** [@Sugar-Baby](https://github.com/Sugar-Baby) and [@AtomAlpaca](https://github.com/AtomAlpaca)\n\n## 0. 环境准备\n\n#### 0.1 虚拟机设置\n\n我使用的是[我的MSDN](https://www.imsdn.cn/)提供的[Windows XP SP3 x64版本](https://www.imsdn.cn/operating-systems/windows-xp/)。虚拟机使用Oracle VM VirtualBox，内存4GB，存储空间64GB（C盘16GB，D盘48GB）。\n\n**在虚拟机关机的情况下**，点击虚拟机管理器界面的\"设置\"-\"网络\"-\"高级\"，将控制芯片改为PCnet-FAST III，混杂模式设置为拒绝，勾选接入网线，点击\"OK\"保存。重启虚拟机就可以连接上网络了。\n\n点击虚拟机界面的\"设备\"-\"安装增强功能...\"，在虚拟机中进入\"我的电脑\"，刷新后出现\"VirtualBox Guest Additions (D: )\"，右键选择\"自动播放\"，完成安装后重启。\n\n点击虚拟机界面的\"设备\"-\"共享粘贴板\"，设置为\"双向\"。点击\"设备\"-\"共享文件夹\"-\"共享文件夹..\"，点击右侧加号，在\"共享文件夹路径\"中选择\"其他...\"，然后选择需要共享的主机文件夹。勾选\"自动挂载\"和\"固定分配\"，点击\"OK\"保存。在虚拟机中进入\"我的电脑\"，刷新后出现'VBoxSvr' 上的 <主机文件夹名称>，双击进入就可以双向传输文件了。\n\n#### 0.2 开发环境配置\n\n浏览器推荐[Mypal 68](https://www.mypal-browser.org/download.html)，注意要选择32位版本。Windows XP自带ZIP文件解压。安装后就可以访问互联网了。\n\n从Github下载[w64devkit](https://github.com/skeeto/w64devkit)，选择x86版本。这里下载的是一个自解压的7z文件，在虚拟机中解压即可。\n\n在\"开始\"-\"控制面板\"-\"切换到经典视图\"-\"系统\"-\"高级\"-\"环境变量\"-\"系统变量\"中，选择Path，点击\"编辑\"，在字符串末尾加入一个分号(;)，然后粘贴w64devkit下bin文件夹的目录。点击\"确定\"保存之后可以打开命令提示符输入例如c++的命令验证是否成功加入环境变量。\n\n由于年代过于久远，Git的官方release已经没有兼容Windows XP的版本了。最后一个兼容的版本(1.9.5)可以在[这里](https://www.xiazaiba.com/html/29352.html)下载。\n\n为了使用Git，需要安装[Win32 OpenSSL](https://slproweb.com/products/Win32OpenSSL.html)。选择Win32 OpenSSL Light版本。这个过程中会附带安装VC++ 2022运行时库。\n\n如果因为协议、代理等问题不能在虚拟机中使用Git，也可以下载ZIP版本后在虚拟机中解压。\n\n需要手动下载[CMake最后支持Windows XP的版本](https://github.com/Kitware/CMake/releases/download/v3.10.3/cmake-3.10.3-win32-x86.zip)。建议解压在C:\\Program Files下，并且需要设置系统变量，到CMake目录下的bin文件夹。具体可以参考上面w64devkit的方法。\n\n## 1. 编译\n\n### 1.1 使用 MinGW-w64\n\n运行\n\n```bash\ncd <ncnn-root-dir>\nmkdir build\ncd build\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/windows-xp-mingw.toolchain.cmake -DNCNN_VULKAN=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_AVX=OFF -DCMAKE_BUILD_TYPE=Release -G \"MinGW Makefiles\" ..\nmake -j2\nmake install\n```\n\n由于平台性能的限制，Vulkan SDK 最低要求 Windows 7 SP1，XP 无法安装官方驱动和工具链，因此需要关闭Vulkan选项。同时需要使用简化版 OpenCV 替代库NCNN_SIMPLEOCV。\n\n### 1.2 使用 Clang\n\n需要先配置 MinGW-w64 环境，然后安装 Clang 6.0 或更高版本。\n\n```bash\ncd <ncnn-root-dir>\nmkdir build\ncd build\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/windows-xp-clang.toolchain.cmake -DNCNN_SIMPLEOCV=ON -DNCNN_SIMPLEOMP=ON -DNCNN_AVX=OFF -DCMAKE_BUILD_TYPE=Release -G \"MinGW Makefiles\" ..\nmake -j2\nmake install\n```\n\n### 1.3 使用 Visual Studio (MSVC)\n\n需要安装支持 Windows XP 的 v141_xp 工具集：\n\n1. 打开 Visual Studio 安装程序（工具 → 获取工具和功能）\n2. 选择\"使用 C++ 的桌面开发\"\n3. 在摘要部分选择\"对 C++ 的 Windows XP 支持\"\n4. 点击修改\n\n```bash\ncd <ncnn-root-dir>\nmkdir build\ncd build\ncmake -A WIN32 -G \"Visual Studio 17 2022\" -T v141_xp -DNCNN_SIMPLEOCV=ON -DNCNN_OPENMP=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_WITH_STATIC_CRT=ON -DCMAKE_TOOLCHAIN_FILE=../toolchains/windows-xp-msvc.toolchain.cmake ..\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\n## 2. 测试\n\n### 2.1 benchncnn\n\n将benchmark目录下的所有文件复制到build/benchmark目录下。在命令提示符中cd到build/benchmark， 然后运行\n\n```bash\nbenchncnn [测试的循环次数] [线程数] [节能模式]\n```\n\n其中，节能模式取值为0时关闭，为1时打开。\n\n### 2.2 examples\n\n从[这里](https://github.com/nihui/ncnn-assets/tree/master/models)可以下载到所有需要的param和bin文件。需要注意的是，ZF_faster_rcnn_final.bin开头的三个文件（.zip，.z01，.z02）最好先放在主机上解压出bin文件再传进虚拟机。\n\n把这些文件放在build/examples目录下。\n\n我写了一个bat脚本来批量测试这些模型：\n\n```batch\n@echo off\nsetlocal enabledelayedexpansion\n\nset EXAMPLES_DIR=<ncnn-root-dir>\\BUILD\\EXAMPLES\nset IMAGE_PATH=<ncnn-root-dir>\\IMAGES\\256-ncnn.png\nset LOG_FILE=test_results.log\n\necho NCNN Examples Test Results > %LOG_FILE%\necho ========================= >> %LOG_FILE%\necho Test started: %date% %time% >> %LOG_FILE%\necho. >> %LOG_FILE%\n\nfor %%f in (\"%EXAMPLES_DIR%\\*.exe\") do (\n    set EXE_NAME=%%~nf\n    set EXE_PATH=%%f\n    echo Testing: !EXE_NAME! >> %LOG_FILE%\n    echo -------------------------------- >> %LOG_FILE%\n\n    !EXE_PATH! \"%IMAGE_PATH%\" >> %LOG_FILE% 2>&1\n\n    if errorlevel 1 (\n        echo [ERROR] !EXE_NAME! failed to run. >> %LOG_FILE%\n    ) else (\n        echo [SUCCESS] !EXE_NAME! completed. >> %LOG_FILE%\n    )\n    echo. >> %LOG_FILE%\n)\n\necho Test finished: %date% %time% >> %LOG_FILE%\necho Results saved to %LOG_FILE%\nendlocal\n```\n\n把这个bat脚本放在build/examples目录下，替换掉所有的`<ncnn-root-dir>`，双击运行。通过生成的test_results.log即可查看所有模型的结果。\n\n通过修改`set IMAGE_PATH=<ncnn-root-dir>\\IMAGES\\256-ncnn.png`中的路径来更换需要测试的文件。"
  },
  {
    "path": "docs/developer-guide/custom-allocator.md",
    "content": "Mat structure is now allocator-aware via an extra allocator parameter with default zero value.\n\nThe good-old ncnn::fastMalloc()/ncnn::fastFree() will be used for a null allocator.\n\nYou could pass a custom allocator to delegate all memory allocation and deallocation.\n\n```cpp\nclass Allocator\n{\npublic:\n    virtual void* fastMalloc(size_t size) = 0;\n    virtual void fastFree(void* ptr) = 0;\n};\n```\n\nncnn has already implemented two simple pooled Allocator class, with mutex lock or without it.\n\n```cpp\nncnn::PoolAllocator locked_mempool;\nncnn::UnlockedPoolAllocator unlocked_mempool;\n```\n\nthe two allocator types in ncnn\n\n* blob allocator\n\n    used to allocate memory for all named blobs, which you could retrieve by Extractor::extract()\n* workspace allocator\n\n    used to allocate memory for internal temporary use in layer implementation, such as the temp blob after padding in convolution\n\nby default, all Extractor instance use the two allocator in the default option\nYou can alter them by ncnn::set_default_option()\nor you can set them per Extractor by Extractor::set_blob_allocator()/Extractor::set_workspace_allocator()\n\nblob allocator is guaranteed to be called in-order in layer implementation during each Extractor lifecycle\nwhile workspace allocator may be called synchronously\n\nthe practical usage\n\n* one network, one-by-one inference\n\n    shared unlocked blob allocator for all Extractor\n\n    shared locked workspace allocator for all Extractor\n\n* one network, concurrent inference\n\n    shared unlocked blob allocator for all Extractor in each thread\n\n    shared locked workspace allocator for all Extractor among all threads\n\n* concurrent multiple networks, one-by-one inference for each network\n\n    shared unlocked blob allocator for all Extractor of each network\n\n    shared locked workspace allocator for all Extractor among all networks (for saving memory)\n\n* concurrent multiple networks, concurrent inference for each network\n\n    shared unlocked blob allocator for all Extractor of each network in each thread\n\n    shared locked workspace allocator for all Extractor among all networks (for saving memory)\n"
  },
  {
    "path": "docs/developer-guide/element-packing.md",
    "content": "### what is packing and why\n\npacking is the form of storing multiple short-sized values as one long-sized value.\n\nelement packing is well mapped with the underlying simd register, which usually use one very wide register to store different types of values.\n\n|C|elemsize|elempack|\n|---|---|---|\n|double|8|1|\n|float|4|1|\n|int|4|1|\n|short|2|1|\n|signed char|1|1|\n\n|arm neon|elemsize|elempack|\n|---|---|---|\n|float64x2_t|16|2|\n|float32x4_t|16|4|\n|int32x4_t|16|4|\n|float16x4_t|8|4|\n|int8x8_t|8|8|\n\nThough the real count of values doubles when elempack is two, the wide-sized value is still treated as one value in the view of Mat structure. For example, we want to store 40 float values in Mat object, if elempack 1 is used, Mat width is then 40, while 10 if elempack 4 is used.\n\n|dims|w|h|c|cstep|elemsize|elempack|\n|---|---|---|---|---|---|---|\n|1|40|1|1|40|4|1|\n|1|10|1|1|10|16|4|\n\n### packing style convention\n\nIn practice, elempack 1, 4, 8 are the most common cases. It is possible to use any other packing style in theory.\n\nThe following table show the packing axis used in ncnn for different dimension.\n\n|dims|packing axis|shape before packing|shape after packing|\n|---|---|---|---|\n|1|w|w|w/elempack|\n|2|h|w, h|w, h/elempack|\n|3|c|w, h, c|w, h, c/elempack|\n\nIf the packing axis dim is not evenly divisible by elempack, zero padding may be used.\n\n```\noutw = (w + elempack - 1) / elempack;\n```\n\nThe following snippet shows the memory layout after elempack=4 on 3-dim Mat\n\n```\n// w=2 h=3 c=4 elempack=1\n0 1\n2 3\n4 5\n\n6 7\n8 9\n10 11\n\n12 13\n14 15\n16 17\n\n18 19\n20 21\n22 23\n\n// w=2 h=3 c=1 elempack=4\n(0,6,12,18) (1,7,13,19)\n(2,8,14,20) (3,9,15,21)\n(4,10,16,22) (5,11,17,23)\n```\n\n### how to convert elempack\n\nThere is a convenient wrapper function provided\n```\n// convert to elempack 4 if packing axis dim is evenly divisible by elempack\n// return the identity Mat otherwise\nncnn::Mat a;\nncnn::Mat a_packed;\nncnn::convert_packing(a, a_packed, 4);\nif (a_packed.elempack == 4)\n{\n    // check if packing is successful\n}\n\n// convert to packing 1, aka unpacking, shall be always successful\nncnn::Mat b;\nncnn::Mat b_unpacked;\nncnn::convert_packing(b, b_unpacked, 1);\n```\n\n### handle general interleaved data\n\nHere is an example of using convert packing to convert RGB interleaved data to planar\n\n**NOTE:** The following code is just presented to explain what packing is and the conversion process. Do not use it in production due to its poor performance. Do use ncnn::Mat::from_pixels()\n\n```cpp\n// rgb_interleaved_u8 is RGB RGB RGB ...\n// rgb_interleaved_u8.w = w;\n// rgb_interleaved_u8.h = h;\n// rgb_interleaved_u8.c = 1;\n// rgb_interleaved_u8.elemsize = 3;\n// rgb_interleaved_u8.elempack = 3;\n\nncnn::Mat rgb_interleaved_u8(w, h, 1, 3, 3);\nncnn::Mat rgb_planar_u8;\n\nncnn::convert_packing(rgb_interleaved_u8, rgb_planar_u8, 1);\n\n// rgb_planar_u8 is now RRR ... GGG ... BBB ...\n// rgb_planar_u8.w = w;\n// rgb_planar_u8.h = h;\n// rgb_planar_u8.c = 3;\n// rgb_planar_u8.elemsize = 1;\n// rgb_planar_u8.elempack = 1;\n```\n"
  },
  {
    "path": "docs/developer-guide/expression.md",
    "content": "### expression\n\nexpression is used in the reshape slice parameter to express the dynamic shape or subscript value based on the expression formula and input shape\n\nCompared with directly converting the expression calculation process into multiple operators, the motivation for using expression\n* No additional shape concat and other operators will be generated due to dynamic calculation, which greatly reduces the number of layers of the ncnn model and makes it easier to view the model structure and modify expression\n* Shape or subscript evaluations are usually single-digit operations, which are more suitable for direct completion on the CPU without layout conversion and kernel call overhead\n\nIn the param file, `Reshape` layer can contain 6=expression\n\nThe pnnx tool can automatically convert `pnnx.Expression` to the expr parameter of ncnn `Reshape`\n\n* Convert to 0w, 0h, 0d or 0c according to the input shape rank and `size(@0,1)`\n* Automatically remove the batch dimension according to the input batch index\n* Convert `pnnx.Expression` and `Tensor.reshape`/`Tensor.view` two operators are fused into ncnn `Reshape`\n* Automatically summarize the number of references, exclude duplicate references and sort the indexes of references\n* Convert the customary shape representation order, such as CHW to WHC\n\nExample pnnx.param where A and B are 3D tensors\n```\npnnx.Expression  expr     2 1 A B shape expr=[add(size(@1,0),2),mul(size(@0,1),2),-1]\nTensor.reshape   reshape  2 1 A shape out\n```\n\npnnx.py\n```python\nshape = [(B.size(0) + 2), (A.size(1) * 2), -1]\nout = A.reshape(*shape)\n```\n\nConverted to ncnn.param\n```\nReshape          reshape  2 1 A B out 6=\"-1,*(0h,2),+(1c,2)\"\n```\n\n### syntax\n\nUse infix expression, format is `op(arg0,arg1,...)`, multiple operations can be nested, multiple sizes are separated by commas, and numbers can be integers or decimals\n\nAmong them, the commonly used `add` `sub` `mul` `div` `floor_div` are abbreviated as `+` `-` `*` `/` `//`, and other arithmetic operations use names, such as `sin` `ceil` `max`, etc.\n\n* `max(2,3)`\n* `floor(sin(3.14))`\n* `+(*(-2,1),10)` means (-2 * 1) + 10\n* `1,2,+(3,2)` list can represent output shape with 3-rank\n\nThe input shape can be referenced at runtime, format is `id(w|h|d|c)`, the maximum id is 9, which means that up to 10 inputs can be referenced\n\nAssuming that the Reshape layer has two input blobs, A and B, then\n\n* `0w,1h` means A.w, B.h\n* `*(+(0c,1c),2)` means (A.c + B.c) * 2\n\n### helper api\n\n```cpp\n#include \"expression.h\"\n\nint count_expression_blobs(const std::string& expr);\n\nint eval_list_expression(const std::string& expr, const std::vector<Mat>& blobs, std::vector<int>& outlist);\n```\n\n* `count_expression_blobs`\n\nPass expression to get the number of inputs it references, such as `0w,1h` returns 2\n\n* `eval_list_expression`\n\nEvaluate the result list according to expression and input blob calculate. If the calculation result is a floating point number, it will be automatically truncated to an integer.\n\n### supported operator\n\n|type|operators|\n|---|---|\n|float to int|`trunc` `ceil` `floor` `round`|\n|binary arithmetic|`+` `-` `*` `/` `//` `max` `min` `pow` `fmod` `remainder` `atan2` `logaddexp`|\n|unary arithmetic|`abs` `neg` `sign` `square` `sqrt` `rsqrt` `reciprocal` `exp` `log` `log10` `sin` `asin` `cos` `acos` `tan` `atan` `sinh` `asinh` `cosh` `acosh` `tanh` `atanh`|\n|integer bitwise|`and` `or` `xor` `lshift` `rshift`|\n"
  },
  {
    "path": "docs/developer-guide/glsl-extension.md",
    "content": "# ncnn GLSL extension\n\n## rationale\nDifferent GPUs support different features, some support fp16 as buffer storage type, some support fp16 as operand variable, some old GPUs only support fp32\n\nWhen the GPU supports the `VK_KHR_16bit_storage` extension, in order to minimize the memory bandwidth consumption of the GPU, we will give priority to using fp16 as the storage type. Otherwise, we use `packHalf2x16` and `unpackHalf2x16` in GLSL 4.2 to compress 2 fp32 to uint, reducing read and write bandwidth.\n\nSimilarly, when the gpu supports the `VK_KHR_shader_float16_int8` extension, in order to speed up the calculation efficiency, we will give priority to using fp16 as the operation operand, which usually doubles the speed. Otherwise, we use fp32.\n\nTo ensure the widest compatibility, the following code for declaring descriptor binding and loading data will be written\n\n```c\n#if NCNN_fp16_storage // gpu supports 16bit storage\nlayout (binding = 0) buffer blob { f16vec4 blob_data[]; };\n#elif NCNN_fp16_packed // gpu supports GLSL 4.2\nlayout (binding = 0) buffer blob { uvec2 blob_data[]; };\n#else // gpu only supports fp32\nlayout (binding = 0) buffer blob { vec4 blob_data[]; };\n#endif\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n#if NCNN_fp16_storage && NCNN_fp16_arithmetic // gpu supports 16bit storage and shader float16\n    f16vec4 x = blob_data[i];\n#elif NCNN_fp16_storage // gpu supports 16bit storage but no shader float16\n    vec4 x = vec4(blob_data[i]);\n#elif NCNN_fp16_packed && NCNN_fp16_arithmetic // gpu supports GLSL 4.2 and shader float16\n    f16vec4 x = f16vec4(unpackFloat2x16(blob_data[i].x), unpackFloat2x16(blob_data[i].y));\n#elif NCNN_fp16_packed // gpu supports GLSL 4.2\n    vec4 x = vec4(unpackHalf2x16(blob_data[i].x), unpackHalf2x16(blob_data[i].y));\n#else // gpu only supports fp32\n    vec4 x = blob_data[i];\n#endif\n}\n```\n\nAs you can see, just declaring the buffer type and reading a value consumes a lot of lines of code, which is a maintenance nightmare. Therefore, ncnn adds more flexible data types and auxiliary functions to reduce the size of the code and improve readability, and will automatically expand to the most efficient implementation according to the feature level supported by the GPU.\n\nThe above code, by using the ncnn glsl extension, can be simplified to\n\n```c\nlayout (binding = 0) buffer blob { sfpvec4 blob_data[]; };\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n    afpvec4 x = buffer_ld4(blob_data, i);\n}\n```\n\nThe ncnn glsl extension provides the necessary data types for storage, computation, shared memory, and load, store, conversion functions for buffers and images. We also provide some buffer and image copy functions to prevent loss of precision when using fp16 as the intermediate data type, and to avoid unnecessary `unpackHalf2x16` and `packHalf2x16` pair.\n\n# entrypoint for compiling GLSL\n\nThe gpu.h header in the ncnn library exposes 3 APIs for compiling glsl code into spir-v binary, they support ncnn glsl extension, these 3 functions accept opt switch to control the expansion form of ncnn glsl extension. The first two accept raw glsl code strings, and the last one is used to create ncnn's built-in shader.\n\n```cpp\nnamespace ncnn {\n\n// online spirv compilation\nNCNN_EXPORT int compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv);\n\n} // namespace ncnn\n```\n\n## compile ncnn extended GLSL code directly\n\nYou can write shader code with ncnn glsl extension, compiled to spir-v using ncnn functions. The compiled product is a standard-compliant spir-v binary, which can be directly used to create a pipeline object in the vulkan api\n\n```cpp\nstatic const char my_glsl_data[] = R\"(\n#version 450\n\nlayout (binding = 0) readonly buffer a_blob { sfpvec4 a_blob_data[]; };\nlayout (binding = 1) writeonly buffer b_blob { sfpvec4 b_blob_data[]; };\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n    afpvec4 v = buffer_ld4(a_blob_data, i);\n\n    v = v + 123;\n\n    buffer_st4(b_blob_data, i, v);\n}\n)\";\n\nOption opt;\n // you can control the extension behavior\n // even if the gpu supports 16bit storage\nopt.use_fp16_storage = false;\n\nstd::vector<uint32_t> spirv;\nncnn::compile_spirv_module(my_glsl_data, sizeof(my_glsl_data) - 1, opt, spirv);\n\n// To create pipeline object later\n// ncnn::Pipeline pipeline(vkdev);\n// pipeline.set_local_size_xyz(64, 1, 1);\n// pipeline.create(spirv.data(), spirv.size() * 4, specializations);\n```\n\n## ncnn built-in shader\n\nThe shader index inside ncnn is exposed in the `layer_shader_type.h` header and can be used if needed\n\n```cpp\n#include \"layer_shader_type.h\"\n\nint shader_type_index = LayerShaderType::convert_ycbcr;\n\nOption opt;\n\nstd::vector<uint32_t> spirv;\nint retc = compile_spirv_module(shader_type_index, opt, spirv);\n```\n\n# data types\n\n## storage type\n\ndeclare buffer data layout in descriptor binding\n\n```c\nlayout (binding = 0) buffer top_blob { sfpvec4 top_blob_data[]; };\n```\n\n|storage type|fp32|fp16p|fp16s|bf16p|bf16s|\n|---|---|---|---|---|---|\n|sfp|float|uint|float16_t|uint|bfloat16_t|\n|sfpvec2|vec2|uint|f16vec2|uint|bf16vec2|\n|sfpvec4|vec4|uvec2|f16vec4|uvec2|bf16vec4|\n\n## arithmetic type\n\ndeclare local variable in glsl code\n\n```c\nvoid main()\n{\n    afpvec4 v = a * b;\n}\n```\n\n|arithmetic type|fp32|fp16a|\n|---|---|---|\n|afp|float|float16_t|\n|afpvec2|vec2|f16vec2|\n|afpvec4|vec4|f16vec4|\n\n## local type\n\ndeclare variable in shared local memory\n\n```c\nshared lfp tmp_a[8][4][2];\n```\n\n|local type|fp32|fp16p / fp16s only|fp16s+fp16a|fp16s+fp16u|bf16p|bf16s|\n|---|---|---|---|---|---|---|\n|lfp|float|float|float|float16_t|float|bfloat16_t|\n|lfpvec4|vec4|uvec2|uint64_t|f16vec4|uvec2|bf16vec4|\n\n# buffer functions\n\n- load typed value from src[offset]\n\n```c\nafp buffer_ld1(sfp src, int offset);\nafpvec2 buffer_ld2(sfpvec2 src, int offset);\nafpvec4 buffer_ld4(sfpvec4 src, int offset);\n```\n\n- store typed value to dst[offset]\n\n```c\nvoid buffer_st1(sfp dst, int offset, afp v);\nvoid buffer_st2(sfpvec2 dst, int offset, afpvec2 v);\nvoid buffer_st4(sfpvec4 dst, int offset, afpvec4 v);\n```\n\n- copy typed value from src[src_offset] to dst[dst_offset]\n\n```c\nvoid buffer_cp1(sfp dst, int dst_offset, sfp src, int src_offset);\nvoid buffer_cp2(sfpvec2 dst, int dst_offset, sfpvec2 src, int src_offset);\nvoid buffer_cp4(sfpvec4 dst, int dst_offset, sfpvec4 src, int src_offset);\n```\n\n- copy and pack value from src[src_offsets[0],src_offsets[1],...] to dst[dst_offset]\n\n```c\nvoid buffer_cp1to4(sfpvec4 dst, int dst_offset, sfp src, ivec4 src_offsets);\n```\n\n- copy and unpack value from src[src_offset] to dst[dst_offsets[0],dst_offsets[1],...]\n\n```c\nvoid buffer_cp4to1(sfp dst, ivec4 dst_offsets, sfpvec4 src, int src_offset);\n```\n# local data conversion functions\n\n- storage buffer to local memory\n\n```c\nlfp buffer_sm1(sfp src, int offset);\nlfpvec4 buffer_sm4(sfpvec4 src, int offset);\n```\n\n- local memory to local variable\n\n```c\nafp lfp2afp(lfp v);\nafpvec4 lfp2afpvec4(lfpvec4 v);\n```\n\n- local variable to local memory\n\n```c\nlfp afp2lfp(afp v);\nlfpvec4 afp2lfpvec4(afpvec4 v);\n```\n\nNote: The common usage of local memory is to read from global memory first, store it in local memory, and then read local variables from local memory for subsequent use. Therefore, only storage type to local type and local type to arithmetic type conversion functions are provided here.\n\n# misc functions\n\n- prefer specialization constant over push constant\n\n```c\nT psc(T x)\n```\n\nDeclare the same variable in specialization constant AND push constant section, then `psc(x)` will become a compile-time constant when specialization constant given non-zero or be dynamic via push constant otherwise. This is often used for tensor shape specialization. We can usually resolve all shape information and make them be compile-time constants for more aggressive shader optimization.\n\n```c\nlayout (constant_id = 0) const int size = 0;\n\nlayout (push_constant) uniform parameter\n{\n    int size;\n} p;\n\nvoid main()\n{\n    const int s = psc(size);\n}\n```\n\n# platform macros\n\njudge if the current platform is moltenvk, for enabling some platform-specific workaround\n\n```c\n#if NCNN_moltenvk\n// enable workaround for moltenvk\n#endif\n```\n\nncnn adds additional macro definitions in the new version, which may conflict or confuse the existing glsl code. In order to obtain cross-version compatibility of ncnn, you can switch between the old and new codes according to the `ncnn_glsl_version` macro version.\n\n```c\n#if ncnn_glsl_version >= 1\n// use device macros introduced since version 1\n#endif\n```\n\nncnn additionally defines most of the vulkan device-related features as macros, which we can use to distinguish different platforms, device extensions, features, and properties\n\n### extension macros\n\nWhen the device supports an extension, `ncnn_<extension_name>` is defined as the extension version\n\n```c\nvoid main()\n{\n#if ncnn_VK_KHR_16bit_storage\n    // here is the code for any device that supports VK_KHR_16bit_storage\n#endif\n\n#if ncnn_VK_KHR_sampler_ycbcr_conversion >= 10\n    // here is the code for any device that supports VK_KHR_sampler_ycbcr_conversion and version >= 10\n#endif\n}\n```\n\n### device feature and property macros\n\nncnn will query device features and properties and then define them as macros.\n\nThe macro name is `ncnn_<feature_name>` or `ncnn_<property_name>`\n\nThe `GL_EXT_shader_explicit_arithmetic_types_int64` extension will be automatically enabled without explicit code indication when the device supports `shaderInt64`\n\nThe `GL_EXT_shader_explicit_arithmetic_types_int16` extension will be automatically enabled without explicit code indication when the device supports `shaderInt16`\n\n```c\nvoid main()\n{\n#if ncnn_robustBufferAccess\n    // here is the code for any device that supports robustBufferAccess feature\n#endif\n\n#if ncnn_vendorID == 4318\n    // here is the vendor specific code, 4318 is nvidia graphics\n#endif\n\n#if ncnn_subgroupSize == 32\n    // here is the code path optimized for subgroup_size == 32\n#endif\n\n    // use macro definitions\n    uint size; // dynamic value from some previous routines\n    if (size < ncnn_subgroupSize)\n    {\n#if ncnn_supportedOperations & 4\n        // subgroup support arithmetic\n#endif\n\n#if ncnn_subgroup_arithmetic\n        // shorthand style for checking subgroup arithmetic :P\n#endif\n    }\n}\n```\n\n### validation layer macros\n\nncnn will define some additional convenient macros when the vulkan validation layer enabled\n\n* `ncnn_enable_validation_layer`\n* `NCNN_LOGE`\n\ncurrently, you have to modify the `ENABLE_VALIDATION_LAYER` definition at the beginning of `src/gpu.cpp` to `1` to enable these macros.\n\nThe `GL_EXT_debug_printf` extension will be enabled automatically without explicitly specifying it in your code.\n\n```c\nvoid main()\n{\n    int gx = int(gl_GlobalInvocationID.x);\n\n#if ncnn_enable_validation_layer\n    NCNN_LOGE(\"gx = %d\\n\", gx);\n#endif\n}\n```\n\nAt runtime, `NCNN_LOGE` will print out the value of `gx`\n\n### option macros\n\nenable glsl extension only if user enable some options\n\nThe `GL_EXT_shader_16bit_storage` extension will be automatically enabled without explicit code indication when the device supports 16-bit storage and the user turns on `opt.use_fp16_storage` or `opt.use_bf16_storage`\n\nThe `GL_EXT_shader_explicit_arithmetic_types_float16` extension will be automatically enabled without explicit code indication when the device supports 16-bit arithmetic and the user turns on `opt.use_fp16_arithmetic`\n\nThe `GL_EXT_shader_8bit_storage` extension will be automatically enabled without explicit code indication when the device supports 8-bit storage and the user turns on `opt.use_int8_storage`\n\nThe `GL_EXT_shader_explicit_arithmetic_types_int8` extension will be automatically enabled without explicit code indication when the device supports 8-bit arithmetic and the user turns on `opt.use_int8_arithmetic`\n\nThe `GL_EXT_bfloat16` extension will be automatically enabled without explicit code indication when the device supports bfloat16 storage and the user turns on `opt.use_bf16_storage`\n\n```c\nvoid main()\n{\n#if NCNN_fp16_storage\n    // the user enable fp16 storage option and the device has fp16 storage support\n#endif\n\n#if NCNN_fp16_arithmetic\n    // the user enable fp16 arithmetic option and the device has fp16 arithmetic support\n#endif\n}\n```\n\n|macro|defined by option|\n|---|---|\n|NCNN_fp16_packed|opt.use_fp16_packed|\n|NCNN_fp16_storage|opt.use_fp16_storage|\n|NCNN_fp16_arithmetic|opt.use_fp16_arithmetic|\n|NCNN_int8_packed|opt.use_int8_packed|\n|NCNN_int8_storage|opt.use_int8_storage|\n|NCNN_int8_arithmetic|opt.use_int8_arithmetic|\n|NCNN_bf16_packed|opt.use_bf16_packed|\n|NCNN_bf16_storage|opt.use_bf16_storage|\n|NCNN_shader_local_memory|opt.use_shader_local_memory|\n"
  },
  {
    "path": "docs/developer-guide/glsl-extension.zh.md",
    "content": "# ncnn GLSL 扩展\n\n## 理由\n不同的 GPU 支持不同的功能，有的支持 fp16 作为缓冲存储类型，有的支持 fp16 作为操作数变量，有的老 GPU 只支持 fp32。\n\n当 GPU 支持 `VK_KHR_16bit_storage` 扩展时，为了尽量减少 GPU 的内存带宽消耗，我们会优先使用 fp16 作为存储类型。否则，我们使用 `packHalf2x16` 和 `unpackHalf2x16` 在 GLSL 4.2 中将 2 个 fp32 压缩为 uint，从而减少读写带宽。\n\n同样，当 GPU 支持 `VK_KHR_shader_float16_int8` 扩展时，为了加快计算效率，我们会优先使用 fp16 作为运算操作数，这通常会使速度翻倍。否则，我们使用 fp32。\n\n为了确保最广泛的兼容性，将编写以下用于声明描述符绑定和加载数据的代码\n\n```c\n#if NCNN_fp16_storage // GPU支持 16bit storage\nlayout (binding = 0) buffer blob { f16vec4 blob_data[]; };\n#elif NCNN_fp16_packed // GPU支持 GLSL 4.2\nlayout (binding = 0) buffer blob { uvec2 blob_data[]; };\n#else // GPU仅支持 fp32\nlayout (binding = 0) buffer blob { vec4 blob_data[]; };\n#endif\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n#if NCNN_fp16_storage && NCNN_fp16_arithmetic // GPU支持 16bit storage 和 shader float16\n    f16vec4 x = blob_data[i];\n#elif NCNN_fp16_storage // GPU支持 16bit storage 但不包含 shader float16\n    vec4 x = vec4(blob_data[i]);\n#elif NCNN_fp16_packed && NCNN_fp16_arithmetic // GPU支持 GLSL 4.2 和 shader float16\n    f16vec4 x = f16vec4(unpackFloat2x16(blob_data[i].x), unpackFloat2x16(blob_data[i].y));\n#elif NCNN_fp16_packed // GPU支持 GLSL 4.2\n    vec4 x = vec4(unpackHalf2x16(blob_data[i].x), unpackHalf2x16(blob_data[i].y));\n#else // GPU仅支持 fp32\n    vec4 x = blob_data[i];\n#endif\n}\n```\n\n如您所见，仅声明缓冲区类型并读取值会消耗大量代码行，这是项目维护的噩梦。因此，ncnn 增加了更灵活的数据类型和辅助函数，以减小代码的大小并提高可读性，并且会根据 GPU 支持的功能级别自动扩展到最高效的实现。\n\n上面的代码，通过使用 ncnn GLSL 扩展，可以简化为\n\n```c\nlayout (binding = 0) buffer blob { sfpvec4 blob_data[]; };\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n    afpvec4 x = buffer_ld4(blob_data, i);\n}\n```\n\nncnn GLSL 扩展为存储、计算、共享内存以及缓冲区和图像的加载、存储、转换函数提供了必要的数据类型。我们还提供了一些缓冲区和图像复制函数，以防止在使用 fp16 作为中间数据类型时丢失精度，并避免不必要的 `unpackHalf2x16` 和 `packHalf2x16` 配对。\n\n# 编译GLSL的入口点\n\nncnn库中的 gpu.h 头文件公开了3个用于将 GLSL 代码编译为 Spir-V 二进制的API函数，它们支持 ncnn GLSL 扩展，这3个函数接受 opt switch 来控制 ncnn GLSL 扩展形式。前两个函数接受原始 GLSL 代码字符串作为参数，最后一个函数用于创建 ncnn 的已存在的内置着色器。\n\n```cpp\nnamespace ncnn {\n\n// 在线 Spir-V 编译器\nNCNN_EXPORT int compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv);\n\n} // namespace ncnn\n```\n\n## 直接编译ncnn扩展GLSL代码\n\n您可以使用 ncnn GLSL 扩展编写着色器代码，使用 ncnn 函数编译为 Spir-V。编译后的产品是符合标准的 Spir-V 二进制文件，可以直接用于在 Vulkan API 中创建流水线对象\n\n```cpp\nstatic const char my_glsl_data[] = R\"(\n#version 450\n\nlayout (binding = 0) readonly buffer a_blob { sfpvec4 a_blob_data[]; };\nlayout (binding = 1) writeonly buffer b_blob { sfpvec4 b_blob_data[]; };\n\nvoid main()\n{\n    const int i = int(gl_GlobalInvocationID.x);\n\n    afpvec4 v = buffer_ld4(a_blob_data, i);\n\n    v = v + 123;\n\n    buffer_st4(b_blob_data, i, v);\n}\n)\";\n\nOption opt;\n // 您可以控制Vulkan扩展行为\n // 当GPU支持16位存储的话\nopt.use_fp16_storage = false;\n\nstd::vector<uint32_t> spirv;\nncnn::compile_spirv_module(my_glsl_data, sizeof(my_glsl_data) - 1, opt, spirv);\n\n// 稍后再创建管道对象\n// ncnn::Pipeline pipeline(vkdev);\n// pipeline.set_local_size_xyz(64, 1, 1);\n// pipeline.create(spirv.data(), spirv.size() * 4, specializations);\n```\n\n## ncnn内置着色器\n\nncnn内部的着色器索引在标头中公开，如果需要可以使用 `layer_shader_type.h`\n\n```cpp\n#include \"layer_shader_type.h\"\n\nint shader_type_index = LayerShaderType::convert_ycbcr;\n\nOption opt;\n\nstd::vector<uint32_t> spirv;\nint retc = compile_spirv_module(shader_type_index, opt, spirv);\n```\n\n# 数据类型\n\n## 存储类型(storage type)\n\n在描述符绑定中声明缓冲区数据布局\n\n```c\nlayout (binding = 0) buffer top_blob { sfpvec4 top_blob_data[]; };\n```\n\n|存储类型|fp32|fp16p|fp16s|bf16p|bf16s|\n|---|---|---|---|---|---|\n|sfp|float|uint|float16_t|uint|bfloat16_t|\n|sfpvec2|vec2|uint|f16vec2|uint|bf16vec2|\n|sfpvec4|vec4|uvec2|f16vec4|uvec2|bf16vec4|\n\n## 算术类型(arithmetic type)\n\n在 GLSL 代码中声明局部变量\n\n```c\nvoid main()\n{\n    afpvec4 v = a * b;\n}\n```\n\n|算术类型|fp32|fp16a|\n|---|---|---|\n|afp|float|float16_t|\n|afpvec2|vec2|f16vec2|\n|afpvec4|vec4|f16vec4|\n\n## 本地类型(local type)\n\n在共享本地内存中声明变量\n\n```c\nshared lfp tmp_a[8][4][2];\n```\n\n|本地类型|fp32|fp16p / fp16s only|fp16s+fp16a|fp16s+fp16u|bf16p|bf16s|\n|---|---|---|---|---|---|---|\n|lfp|float|float|float|float16_t|float|bfloat16_t|\n|lfpvec4|vec4|uvec2|uint64_t|f16vec4|uvec2|bf16vec4|\n\n# 缓冲区函数(buffer functions)\n\n- 从 src[offset] 加载已经确定类型的值\n\n```c\nafp buffer_ld1(sfp src, int offset);\nafpvec2 buffer_ld2(sfpvec2 src, int offset);\nafpvec4 buffer_ld4(sfpvec4 src, int offset);\n```\n\n- 将已确定类型的值存储到 dst[偏移量]\n\n```c\nvoid buffer_st1(sfp dst, int offset, afp v);\nvoid buffer_st2(sfpvec2 dst, int offset, afpvec2 v);\nvoid buffer_st4(sfpvec4 dst, int offset, afpvec4 v);\n```\n\n- 从已确定类型 src[src_offset] 的值拷贝到 dst[dst_offset]\n\n```c\nvoid buffer_cp1(sfp dst, int dst_offset, sfp src, int src_offset);\nvoid buffer_cp2(sfpvec2 dst, int dst_offset, sfpvec2 src, int src_offset);\nvoid buffer_cp4(sfpvec4 dst, int dst_offset, sfpvec4 src, int src_offset);\n```\n\n- 从 src[src_offsets[0],src_offsets[1],...] 的值拷贝并打包到 dst[dst_offset]\n\n```c\nvoid buffer_cp1to4(sfpvec4 dst, int dst_offset, sfp src, ivec4 src_offsets);\n```\n\n- 从 src[src_offset] 的值拷贝并解包到 dst[dst_offsets[0],dst_offsets[1],...]\n\n```c\nvoid buffer_cp4to1(sfp dst, ivec4 dst_offsets, sfpvec4 src, int src_offset);\n```\n\n# 本地数据转换函数\n\n- 存储缓冲区转换到本地内存\n\n```c\nlfp buffer_sm1(sfp src, int offset);\nlfpvec4 buffer_sm4(sfpvec4 src, int offset);\n```\n\n- 本地内存转换到局部变量\n\n```c\nafp lfp2afp(lfp v);\nafpvec4 lfp2afpvec4(lfpvec4 v);\n```\n\n- 局部变量转换到本地内存\n\n```c\nlfp afp2lfp(afp v);\nlfpvec4 afp2lfpvec4(afpvec4 v);\n```\n\n注意：本地内存的常见用法是先从全局内存中读取，存储在本地内存中，然后再从本地内存中读取局部变量以供后续使用。因此，此处仅提供存储类型到本地类型和本地类型到算术类型的转换函数。\n\n# 杂项函数\n\n- 更推荐使用专业化常量(specialization constants)，而不是推动常量(push constants)\n\n```c\nT psc(T x)\n```\n\n在 `专用常量` 和 `推送常量` 部分中声明相同的变量，然后在专用常量给定非零时 `psc(x)` 将成为编译时常量，否则将通过推送常量动态。这通常用于张量形状特化。我们通常可以解析所有形状信息，并使它们成为编译时常量，以实现让着色器得到更积极的优化。\n\n```c\nlayout (constant_id = 0) const int size = 0;\n\nlayout (push_constant) uniform parameter\n{\n    int size;\n} p;\n\nvoid main()\n{\n    const int s = psc(size);\n}\n```\n\n# 平台宏定义\n\n判断当前平台是否为 moltenvk，以启用对于某些特定于平台的解决方法\n\n```c\n#if NCNN_moltenvk\n// 启用moltenvk的解决方法\n#endif\n```\n\nncnn 在新版本中添加了额外的宏定义，可能与现在的 glsl 代码冲突或引起混淆。为了实现  ncnn 的跨版本兼容性，可以根据  `ncnn_glsl_version` 宏的版本号在新旧代码之间进行切换 。\n\n```c\n#if ncnn_glsl_version >= 1\n// 使用自版本 1 起引入的设备宏\n#endif\n```\n\nncnn 额外定义了大多数 vulcan 设备相关功能作为宏，我们可以用来区分不同的平台、设备扩展、功能和属性。\n\n### 扩展宏定义\n\n当设备支持某个扩展时，`ncnn_<extension_name>` 被定义为扩展版本\n\n```c\nvoid main()\n{\n#if ncnn_VK_KHR_16bit_storage\n    // 支持 VK_KHR_16bit_storage 设备的代码\n#endif\n\n#if ncnn_VK_KHR_sampler_ycbcr_conversion >= 10\n    // 支持 VK_KHR_sampler_ycbcr_conversion 且版本 >=10 的代码\n#endif\n}\n```\n\n### 设备特性和属性宏\n\nncnn 会查询设备特性和属性，然后将它们定义为宏。\n\n宏名称为 `ncnn_<feature_name>` 或 `ncnn_<property_name>`\n\n当设备支持 `shaderInt64` 时，`GL_EXT_shader_explicit_arithmetic_types_int64` 扩展会自动启用，无需显式代码指示。\n\n当设备支持 `shaderInt16` 时，`GL_EXT_shader_explicit_arithmetic_types_int16` 扩展会自动启用，无需显式代码指示。\n\n```c\nvoid main()\n{\n#if ncnn_robustBufferAccess\n    // 支持 robustBufferAccess 特性的设备代码\n#endif\n\n#if ncnn_vendorID == 4318\n    // 供应商特定代码，4318 是 nvidia 显卡\n#endif\n\n#if ncnn_subgroupSize == 32\n    // 为 subgroup_size == 32 优化的代码路径\n#endif\n\n    // 使用宏定义\n    uint size; // 来自先前例程的动态值\n    if (size < ncnn_subgroupSize)\n    {\n#if ncnn_supportedOperations & 4\n        // subgroup 支持算术运算\n#endif\n\n#if ncnn_subgroup_arithmetic\n        // 检查 subgroup 算术运算的简写形式\n#endif\n    }\n}\n```\n\n### 验证层宏定义\n\n当启用 vulkan 验证层时，ncnn 会定义一些额外的便捷宏\n\n* `ncnn_enable_validation_layer`\n* `NCNN_LOGE`\n\n目前，你必须将 `src/gpu.cpp` 开头的 `ENABLE_VALIDATION_LAYER` 定义修改为 `1` 才能启用这些宏。\n\n`GL_EXT_debug_printf` 扩展会自动启用，无需在代码中显式指定。\n\n```c\nvoid main()\n{\n    int gx = int(gl_GlobalInvocationID.x);\n\n#if ncnn_enable_validation_layer\n    NCNN_LOGE(\"gx = %d\\n\", gx);\n#endif\n}\n```\n\n在运行时，`NCNN_LOGE` 将打印出 `gx` 的值\n\n### 选项宏\n\n仅当用户启用某些选项时才启用 GLSL 扩展\n\n`GL_EXT_shader_16bit_storage` 扩展会在设备支持 16 位存储且用户开启了 `opt.use_fp16_storage` 或 `opt.use_bf16_storage` 选项时，自动启用，无需显式代码指示。\n\n`GL_EXT_shader_explicit_arithmetic_types_float16` 扩展会在设备支持 16 位算术运算且用户开启了 `opt.use_fp16_arithmetic` 选项时，自动启用，无需显式代码指示。\n\n`GL_EXT_shader_8bit_storage` 扩展会在设备支持 8 位存储且用户开启了 `opt.use_int8_storage` 选项时，自动启用，无需显式代码指示。\n\n`GL_EXT_shader_explicit_arithmetic_types_int8` 扩展会在设备支持 8 位算术运算且用户开启了 `opt.use_int8_arithmetic` 选项时，自动启用，无需显式代码指示。\n\n`GL_EXT_bfloat16` 扩展会在设备支持 bfloat16 存储且用户开启了 `opt.use_bf16_storage` 选项时，自动启用，无需显式代码指示。\n\n```c\nvoid main()\n{\n#if NCNN_fp16_storage\n    // 用户启用 fp16 存储选项，且设备支持 fp16 存储\n#endif\n\n#if NCNN_fp16_arithmetic\n    // 用户启用 fp16 算术选项，且设备支持 fp16 算术运算\n#endif\n}\n```\n\n|宏定义|option中所定义的变量|\n|---|---|\n|NCNN_fp16_packed|opt.use_fp16_packed|\n|NCNN_fp16_storage|opt.use_fp16_storage|\n|NCNN_fp16_arithmetic|opt.use_fp16_arithmetic|\n|NCNN_int8_packed|opt.use_int8_packed|\n|NCNN_int8_storage|opt.use_int8_storage|\n|NCNN_int8_arithmetic|opt.use_int8_arithmetic|\n|NCNN_bf16_packed|opt.use_bf16_packed|\n|NCNN_bf16_storage|opt.use_bf16_storage|\n|NCNN_shader_local_memory|opt.use_shader_local_memory|\n"
  },
  {
    "path": "docs/developer-guide/how-to-be-a-contributor.zh.md",
    "content": "### 如何提交代码\n\n#### 一、fork 分支\n在浏览器中打开 [ncnn](https://github.com/tencent/ncnn), `fork` 到自己的 repositories，例如\n```\nhttps://github.com/user/ncnn\n```\n\nclone 项目到本地，添加官方 remote 并 fetch:\n```\n$ git clone https://github.com/user/ncnn && cd ncnn\n$ git remote add tencent https://github.com/tencent/ncnn\n$ git fetch tencent\n```\n对于 `git clone` 下来的项目，它现在有两个 remote，分别是 origin 和 tencent：\n\n```\n$ git remote -v\norigin   https://github.com/user/ncnn (fetch)\norigin   https://github.com/user/ncnn (push)\ntencent  https://github.com/Tencent/ncnn (fetch)\ntencent  https://github.com/Tencent/ncnn (push)\n```\norigin 指向你 fork 的仓库地址；remote 即官方 repo。可以基于不同的 remote 创建和提交分支。\n\n例如切换到官方 master 分支，并基于此创建自己的分支（命名尽量言简意赅。一个分支只做一件事，方便 review 和 revert）\n```\n$ git checkout tencent/master\n$ git checkout -b add-conv-int8\n```\n\n或创建分支时指定基于官方 master 分支：\n```\n$ git checkout -b fix-typo-in-document tencent/master\n```\n\n> `git fetch` 是从远程获取最新代码到本地。如果是第二次 pr ncnn，直接从  `git fetch tencent` 开始即可，不需要 `git remote add tencent`，也不需要修改 `github.com/user/ncnn`。\n\n#### 二、代码习惯\n为了增加沟通效率，reviewer 一般要求 contributor 遵从以下规则\n\n* `if-else`和花括号`{`中间需要换行\n* 不能随意增删空行\n* tab 替换为 4 个空格\n* 为了保证平台兼容性，目前不使用`c++11`，`src`目录下尽量避免使用`template`\n* 若是新增功能或平台，`test`目录需有对应测试用例\n* 文档放到`doc`对应目录下，中文用`.zh.md`做后缀；英文直接用`.md`后缀\n\n开发完成后提交到自己的 repository\n```\n$ git commit -a\n$ git push origin add-conv-int8\n```\n推荐使用 [`commitizen`](https://pypi.org/project/commitizen/) 或 [`gitlint`](https://jorisroovers.com/gitlint/) 等工具格式化 commit message，方便事后检索海量提交记录\n\n#### 三、代码提交\n浏览器中打开 [ncnn pulls](https://github.com/Tencent/ncnn/pulls) ，此时应有此分支 pr 提示，点击 `Compare & pull request`\n\n* 标题**必须**是英文。未完成的分支应以 `WIP:` 开头，例如 `WIP: add conv int8`\n* 正文宜包含以下内容，中英不限\n    * 内容概述和实现方式\n    * 功能或性能测试\n    * 测试结果\n\nCI 已集成了自动格式化，restyled-io 会在 pr 的同时生成 `Restyled add conv int8`，需要 merge 自动 restyled 的分支，例如\n```\n$ git fetch tencent\n$ git checkout add-conv-int8\n$ git merge tencent/restyled/pull-2078\n$ git push origin add-conv-int8\n```\n回到浏览器签署  CLA，所有 CI 测试通过后通知 reviewer merge 此分支。\n\n#### 四、彩蛋\n留下个人 qq 号会触发隐藏事件。"
  },
  {
    "path": "docs/developer-guide/how-to-implement-custom-layer-step-by-step.md",
    "content": "# step1 create a new empty class\n```cpp\n// mylayer.h\n#include \"layer.h\"\nusing namespace ncnn;\n\n// a new layer type called MyLayer\nclass MyLayer : public Layer\n{\n};\n\n// mylayer.cpp\n#include \"mylayer.h\"\nDEFINE_LAYER_CREATOR(MyLayer)\n```\n\n# step2 declare layer parameters and weights\n```cpp\n// mylayer.h\n#include \"layer.h\"\nusing namespace ncnn;\n\nclass MyLayer : public Layer\n{\nprivate:\n    int channels;// new code\n    float gamma;// new code\n    Mat weight;// new code\n};\n\n// mylayer.cpp\n#include \"mylayer.h\"\nDEFINE_LAYER_CREATOR(MyLayer)\n```\n\n# step3 implement load functions for parameters and weights\n```cpp\n// mylayer.h\n#include \"layer.h\"\nusing namespace ncnn;\n\nclass MyLayer : public Layer\n{\npublic:\n    virtual int load_param(const ParamDict& pd);// new code\n    virtual int load_model(const ModelBin& mb);// new code\n\nprivate:\n    int channels;\n    float eps;\n    Mat gamma_data;\n};\n\n// mylayer.cpp\n#include \"mylayer.h\"\nDEFINE_LAYER_CREATOR(MyLayer)\n\n// new routine for loading parameters\nint MyLayer::load_param(const ParamDict& pd)\n{\n    // details about the relations with param file\n    // https://github.com/Tencent/ncnn/wiki/param-and-model-file-structure\n    //\n    channels = pd.get(0, 0);// parse 0=<int value> entry, default value 0\n    eps = pd.get(1, 0.001f);// parse 1=<float value> entry, default value 0.001f\n\n    return 0;// return zero if success\n}\n\n// new routine for loading weights\nint MyLayer::load_model(const ModelBin& mb)\n{\n    // details about the relations with model file\n    // https://github.com/Tencent/ncnn/wiki/param-and-model-file-structure\n    //\n    // read weights with length of channels * sizeof(float)\n    // the second argument explains as follows\n    // 0 judge the value type automatically, you may get float or float16 or uint8 etc\n    //   depends on the model storage and the supporting target hardware\n    // 1 read float values anyway\n    // 2 read float16 values anyway\n    // 3 read uint8 values anyway\n    gamma_data = mb.load(channels, 1);\n    if (gamma_data.empty())\n        return -100;// return non-zero on error, -100 indicates out-of-memory\n\n    return 0;// return zero if success\n}\n```\n\n# step4 determine forward behavior\n```cpp\n// mylayer.h\n#include \"layer.h\"\nusing namespace ncnn;\n\nclass MyLayer : public Layer\n{\npublic:\n    MyLayer();// new code\n    virtual int load_param(const ParamDict& pd);\n    virtual int load_model(const ModelBin& mb);\n\nprivate:\n    int channels;\n    float eps;\n    Mat gamma_data;\n};\n\n// mylayer.cpp\n#include \"mylayer.h\"\nDEFINE_LAYER_CREATOR(MyLayer)\n\n// new routine for setting forward behavior\nMyLayer::MyLayer()\n{\n    // one input and one output\n    // typical one_blob_only type: Convolution, Pooling, ReLU, Softmax ...\n    // typical non-one_blob_only type: Eltwise, Split, Concat, Slice ...\n    one_blob_only = true;\n\n    // do not change the blob size, modify data in-place\n    // typical support_inplace type: ReLU, Sigmoid ...\n    // typical non-support_inplace type: Convolution, Pooling ...\n    support_inplace = true;\n}\n\nint MyLayer::load_param(const ParamDict& pd)\n{\n    channels = pd.get(0, 0);\n    eps = pd.get(1, 0.001f);\n\n    // you could alter the behavior based on loaded parameter\n    // if (eps == 0.001f)\n    // {\n    //     one_blob_only = false;\n    //     support_inplace = false;\n    // }\n\n    return 0;\n}\n\nint MyLayer::load_model(const ModelBin& mb)\n{\n    gamma_data = mb.load(channels, 1);\n    if (gamma_data.empty())\n        return -100;\n\n    // you could alter the behavior based on loaded weight\n    // if (gamma_data[0] == 0.f)\n    // {\n    //     one_blob_only = false;\n    //     support_inplace = false;\n    // }\n\n    return 0;\n}\n```\n\n# step5 choose proper interface based on forward behavior\n```cpp\n// The base class Layer defines four interfaces for each forward behavior combination\n\n// 1\nvirtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n// 2\nvirtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n// 3\nvirtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const;\n\n// 4\nvirtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n```\n**must** = layer must implement this function\n\n**optional** = layer may implement this function for optimal performance\n\nsometimes the graph inference path cannot call forward_inplace directly due to data sharing, in this situation the non-inplace forward routine will be used, which deep-copy the input blob and call inplace forward on it if the optional routine is not implemented. Thus, you could avoid this deep-copy by process input to output on-the-fly.\n\n|one_blob_only|support_inplace|1|2|3|4|\n|---|---|---|---|---|---|\n|false|false|must| | | |\n|false|true|optional| |must| |\n|true|false| |must| | |\n|true|true| |optional| |must|\n\n# step6 implement forward function\n```cpp\n// mylayer.h\n#include \"layer.h\"\nusing namespace ncnn;\n\nclass MyLayer : public Layer\n{\npublic:\n    MyLayer();\n    virtual int load_param(const ParamDict& pd);\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;// new code, optional\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;// new code\n\nprivate:\n    int channels;\n    float eps;\n    Mat gamma_data;\n};\n\n// mylayer.cpp\n#include \"mylayer.h\"\nDEFINE_LAYER_CREATOR(MyLayer)\n\nMyLayer::MyLayer()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint MyLayer::load_param(const ParamDict& pd)\n{\n    channels = pd.get(0, 0);\n    eps = pd.get(1, 0.001f);\n\n    return 0;\n}\n\nint MyLayer::load_model(const ModelBin& mb)\n{\n    gamma_data = mb.load(channels, 1);\n    if (gamma_data.empty())\n        return -100;\n\n    return 0;\n}\n\n// optional new routine for layer forward function, non-inplace version\nint MyLayer::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // check input dims, return non-zero on error\n    if (bottom_blob.c != channels)\n        return -1;\n\n    // x = (x + eps) * gamma_per_channel\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int size = w * h;\n\n    top_blob.create(w, h, channels, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;// return non-zero on error, -100 indicates out-of-memory\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q=0; q<channels; q++)\n    {\n        const float* ptr = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n        const float gamma = gamma_data[q];\n\n        for (int i=0; i<size; i++)\n        {\n            outptr[i] = (ptr[i] + eps) * gamma ;\n        }\n    }\n\n    return 0;\n}\n\n// new routine for layer forward function\nint MyLayer::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    // check input dims, return non-zero on error\n    if (bottom_top_blob.c != channels)\n        return -1;\n\n    // x = (x + eps) * gamma_per_channel\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q=0; q<channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n        const float gamma = gamma_data[q];\n\n        for (int i=0; i<size; i++)\n        {\n            ptr[i] = (ptr[i] + eps) * gamma ;\n        }\n    }\n\n    return 0;\n}\n```\n\n# step7 integrate with ncnn library\nyou may probably need to modify caffe2ncnn or mxnet2ncnn etc. to write your layer specific parameters and weights into ncnn param and model file\n\nthe param and model file structure [param-and-model-file-structure](param-and-model-file-structure)\n\n```\n// example param file content\nInput            input   0 1 input\nConvolution      conv2d  1 1 input conv2d 0=32 1=1 2=1 3=1 4=0 5=0 6=768\nMyLayer          mylayer 1 1 conv2d mylayer0\nPooling          maxpool 1 1 mylayer0 maxpool 0=0 1=3 2=2 3=-233 4=0\n```\n\n```cpp\nncnn::Net net;\n\n// register custom layer before load param and model\n// the layer creator function signature is always XYZ_layer_creator, which defined in DEFINE_LAYER_CREATOR macro\nnet.register_custom_layer(\"MyLayer\", MyLayer_layer_creator);\n\nnet.load_param(\"model.param\");\nnet.load_model(\"model.bin\");\n```\n"
  },
  {
    "path": "docs/developer-guide/how-to-write-a-neon-optimized-op-kernel.md",
    "content": "# benchmark\nop\n\n# naive C with openmp\nfor for for\n\n# unroll, first try\nh\n\n# register allocation\nkernels\n\n# unroll, second try\nsimd\n\n# neon intrinsics\noptional\n\n# naive neon assembly with pld\nasm\n\n# pipeline optimize, first try\nmore register load mla\n\n# pipeline optimize, second try\ninterleave load mla\n\n# pipeline optimize, third try\nloop tail\n\n# usual practice, load/save\n233\n\n# usual practice, unroll\n233\n\n# usual practice, save register\n233\n"
  },
  {
    "path": "docs/developer-guide/how-to-write-a-sse-optimized-op-kernel.zh.md",
    "content": "# 如何使用SSE来优化算子核心\n\n## 一：准备\n\n### 1.背景资料\n\n​\tSSE 全称Intel® Streaming SIMD Extensions (Intel® SSE),本质是Intel公司封装汇编语句提供的底层操作指令函数集。同样属于底层操作指令集的还有著名的Intel® AVX(Advanced Vector Extensions),  及 Intel® AVX2(Intel® Advanced Vector Extensions 2)。基于同样原理封装的还有Arm 对应Arm Intrinsic，MIPS中对应MIPS Intrinsic。\n\n​\tSSE的版本包含：SSE/SSE2/SSE3/SSE4.1/SSE4.2。下文中在描述CPU特性上统称为SSE系列指令集。在描述具体使用指令函数中的CPUID Flags，才会具体区分SSE不同版本。\n\n​\t自从MSVC不再支持x64的汇编指令后（虽然可以强制使用，但不推荐不安全）。SSE，AVX等成为MSVC 支持的最佳底层优化方法。\n\n​\t本文将从SSE的使用出发，以ncnn实现为例，展示如何使用SSE优化深度学习中算子。\n\n​\t优化算子工作需要三方面的准备事项：\n\n- 测试正确的原生代码\n- 快速测试验证环境\n- 基准统计程序\n\n### 2.确认硬件是否支持SSE\n\n​\t在开始SSE优化之前，首先请确保您硬件支持SSE指令集，对于大多数Intel CPU都支持SSE指令集。但在各种系统环境下，查看方式不同。我们有：\n\n#### 1.windows环境\n\n​\twindows环境下推荐简单使用GPU-Z来检测当前处理器是否支持SSE扩展。在GPU-Z官网下载后，运行，在“处理器”-“CPU支持的特性”项目下，若包含SSE系列指令集，即当前CPU支持SSE。\n\n#### 2.Linux环境和类Unix环境\n\n​\tLinux环境和类Unix环境下，使用查看cpuinfo文件来确认CPU特性；\n\n```shell\ncat  /proc/cpuinfo\n...\nflags: *** sse sse2 ***\t#在cpu flags中即可检查是否支持sse扩展\n```\n\n#### 2.macOS环境：\n\n​\tmacOS本质是像Unix环境，所以同样使用sysctl 来查看CPU特性.(注意Mac的 M1 M2系列芯片是arm架构，不支持SSE)\n\n```shell\nsysctl machdep.cpu\t\t# 结果同Linux环境\n```\n\n## 二：编写原生代码\n\n​\t使用SSE来优化算法的过程本质就是代码重构的一种情况。代码重构的首要条件是完成完备的代码行为测试集合。所以，这部分将从测试代码的编写开始。\n\n​\t其次优化过程的目标是调优某些性能指标的过程。所以第二部分将讨论性能指标的选定和优先级；\n\n### 1.编写测试代码\n\n​\t在大多数情况下，看到这篇文章的人肯定是比笔者更会写算法，所以我在这里只谈一些编写测试的注意事项（这里的测试指验证你算法满足你的要求所编写的代码行为，跟其他人无关）。\n\n​\t编写测试代码主要注意事项：\n\n- 思考如何构造基础数据结构才能满足算法行为的输入要求。举例来说，如果你准备为ncnn贡献算法，请阅读ncnn中关于Mat结构的函数。最好编写相关测试来验证该数据结构满足你的需要。（笔者的建议是可以先从简单结构来验证，比如需要做一个支持f32任意大小的矩阵加法算子，可以先从支持固定矩阵int8类型的加法开始编写测试代码）。\n- 保持结果的正确性。首先考虑，你所编写的原生代码行为上，是否满足你所需要的结果（不论这个结果是手算的，numpy算的或者pytorch算的）。其次要考虑，结果在内存上结构如何排布。以ncnn为例，思考你的结果该如何放入到一个Mat中，Mat的size该如何设定。（在后续SSE优化中，我们将多次以原生代码结果作为target结果，验证每次优化后的正确性，原生代码能够稳定输出正确结果非常重要）\n- 不用过早考虑算法的完备性，应该随着每次测试结果的正确来迭代重构算法和测试代码。二者同样重要。如果能够自动化测试，请尽量让一个简单的脚本执行来完成所有你慢吞吞的命令行。\n\n### 2.考虑性能指标\n\n​\t性能指标的主要作用是随着每次优化的迭代，告诉我们所采取的措施在什么方面取得效果，是正面优化还是负面优化。\n\n​\t性能指标很多，包括吞吐量，还有类似计算稳定度，时间延迟，视频方面还有fps 等等。无法确认有效的性能指标也是大多数优化算法的困难点之一。\t\n\n​\t随着简单粗暴地叠晶体管数量来解决电脑运行问题，性能指标似乎变得越来越不重要。这是一种错误观念，如果在单核上编写非常烂的代码，增加N个核心只是把烂代码重复N次而已。另一方面，性能指标有着客观性，在开发板上和集群设备上运行同样的算法，性能指标的优先级也不一样。但是，我认为应当满足最基础性能指标有这两个：\n\n- **吞吐量**：算法在单位时间内执行的次数，用Gflops表示，该值越大越好（也可以认为执行同样算法所平均占用的时间，时间越短越好）；\n- **性能衰退**：即随着数据规模的增加，Gflops在不同数据规模下的波动情况。更低的离散程度意味着吞吐量保持在一定范围内不发生变化。\n\n​\t其余有效的性能指标应当由业务环境和任务需求决定。负责技术基础设施建设的算法工程师，一方面应该理解业务所需求中的最高优先级，另一方面也应该追求做到更好。\n\n​\t以SSE优化算法，本质上是重构的迭代过程。不用在初期就考虑如何达到最大性能指标，而是应该考虑每次迭代中带来一定量的性能优化。\n\n## 三：理解SSE\n\n​\tSSE主要由SSE基础数据类型 及 针对性的SSE操作函数构成。前文提到，SSE是针对汇编语句的封装，所以本身不具备错误检查和错误处理（错误检查和错误处理一般由编译器完成）。使用不当的话，诸如segmentation fault之内指针指向不存在的内存错误非常常见。我在此处建议：<u>使用SSE优化之前，确保理解代码指针位置和移动原理，原生代码已经完成测试，输出结果正确。</u>\n\n### 1.SSE数据类型\n\n​\tSSE数据类型形如：\n\n```c++\n__m<bit><type>\t\t\t //__m适用代表申请mm寄存器\n    \t\t\t\t\t// bit 代表数据类型的字节长度，在SSE中为128 或 64\n    \t\t\t\t\t// 默认type为单精度浮点（f32），其余为int 或double\n// 另外要注意所有SSE的类型除__m128和__m64外，随着版本更新有不同的类型，建议根据需要且确定硬件性能后选择合适的类型\n// 举例如下：\n__m128\t\t\t\t\t//4xf32 含有4个单精度浮点数；SSE\n__m64    \t\t\t\t//4xf32 含有2个单精度浮点数；SSE\n__m128i   \t\t\t\t//8个int类型（8x16)\t\t ；SSE3\n__m128d\t\t\t\t\t//2个double类型(2x64)\n```\n\n### 2.SSE内联函数结构\n\n​\tSSE内联函数在线查询：[Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2) 在此 单个指令的结构如下：\n\n- Synopsis ：摘要。描述指令的接口定义，需要引入的头文件，对应的指令，CPU必须支持的标志；\n- Description：描述该指令的行为；\n- Operation：逻辑层面描述指令行为；\n- Performance：在不同架构中所需要的延迟和执行所需要的时钟周期数（CPI）。\n\n​\t值得指出的的是此处默认使用小端存储，即左边为高位，右边为低位。\n\n​\t相似的内联函数有很多，在使用时候一定要注意Operation中的逻辑满足您的要求。\n\n​\t另外，在ncnn中，ncnn已经将部分SSE内联函数以NCNN内联的方式封装。在为NCNN添加SSE优化的算法的过程中，请首先考虑搜索“NCNNINLINE”宏封装的SSE函数。\n\n## 四：样例\n\n### 1.一个简单的样例：4x4矩阵乘法\n\n​\t矩阵乘法方面，已经有很多出色的成果。值得一读的比如[how-to-optimize-gemm](https://github.com/flame/how-to-optimize-gemm)，及 [以Arm Intrinsic优化矩阵乘法](https://github.com/tpoisonooo/how-to-optimize-gemm)。我建议感兴趣同学参考和学习这两份项目，来探究如何从0到1优化一份算法；\n\n​\t矩阵乘法原理很简单：\n\n​\t假设有A，B两个矩阵，如下：\n$$\nA_{[4][4]} =  \n\\begin{bmatrix}\n\ta_0 & a_4 & a_8 & a_{12} \\\\\n\ta_1 & a_5 & a_9 & a_{13} \\\\\n\ta_2 & a_6 & a_{10} & a_{14} \\\\\n\ta_3 & a_7 & a_{11} & a_{15} \n\\end{bmatrix}\n~~\nB_{[4][4]} =  \n\\begin{bmatrix}\n\tb_0 & b_4 & b_8 & b_{12} \\\\\n\tb_1 & b_5 & b_9 & b_{13} \\\\\n\tb_2 & b_6 & b_{10} & b_{14} \\\\\n\tb_3 & b_7 & b_{11} & b_{15} \n\\end{bmatrix}\n~~\nC_{[4][4]} =  \n\\begin{bmatrix}\n\tc_0 & c_4 & c_8 & c_{12} \\\\\n\tc_1 & c_5 & c_9 & c_{13} \\\\\n\tc_2 & c_6 & c_{10} & c_{14} \\\\\n\tc_3 & c_7 & c_{11} & c_{15} \n\\end{bmatrix}\n$$\n​\t对于C 矩阵的第一列，我们有：\n$$\nc_0 = a_0b_0 + a_4b_1 + a_8b_2 + a_{12}b_3 \\\\\n\tc_1 = a_1b_0 + a_5b_1 + a_9b_2 + a_{13}b_3 \\\\\n\tc_2 = a_2b_0 + a_6b_1 + a_{10}b_2 + a_{14}b_3 \\\\\n\tc_3 = a_3b_0 + a_7b_1 + a_{11}b_2 + a_{15}b_3\n$$\n\n\n#### 1.编写测试代码和基准测量程序\n\n​\t在该样例中，测试代码很容易编写出来，我们只需要初始化4x4的二维数组，并返回指针即可。此时，可以不考虑泛用性，初始化为固定值即可。\n\n```c\n// <代码片段>\n...\nfloat A[16] = {0.0f};\t\t\t// 此处已经将输入和输出的矩阵默认展开成im2col 后的单行（inch = 1） 宽度为h*w = 16的矩阵\nfloat B[16] = {0.0f};\nfloat C[16] = {0.0f};\nmatrix_init_rand(A, 4, 4);\t\t// 随机初始化A数组\nmatrix_init_rand(B, 4, 4);\t\t// 随机初始化B数组\n```\n\n​\t编写验证正确性的测试代码。\n\n```c\n// <代码片段>\n...\nfloat T[16] = {...};\t\t\t// Target即为预测的C的结果数组，可用numpy或者纸笔计算\n...\nfloat error = 0.0001;\nbool CheckAuc(T, C, error);\t\t\n// 注意：float在计算机中不能完全表示，只能使用绝对误差的判别方法。gtest等测试框架的EXCEPT宏无法处理1.234e5这样结构的float数的对比。\n```\n\n​\t同样，编写计算耗时的基准测量代码，此处使用1000次操作所占的平均时间来作为基准。\n\n```c\n// <代码片段>\n...\nconst int loop = 1000;\nclock_gettime_(CLOCK_REALTIME, &time_start);\nfor(init i = 0; i < loop; i++)\n{\n\tmatirx_mult_native(C, A, B);\n}\nclock_gettime_(CLOCK_REALTIME, &time_end);\nclocks_c = (time_end.tv_sec - time_start.tv_sec) * 1000000 +  (time_end.tv_sec - time_start.tv_sec) /1000;\n```\n\n#### 2.编写原生代码\n\n​\t编写原生代码，使得正确性测试能够通过。\n\n```c\n// <代码片段>\nstatic void matirx_mult_native(float *C, float *A, float *B)\n{\n    for(int i_idx = 0; i_idx < 4; i_idx++)\n    {\n        for(int j_idx = 0; j_idx < 4; j_idx++)\n        {\n           for(int k_idx = 0; k_idx < 4; k_idx++)\n           {\n               C[4*j_idx + i_idx] += A[4*k_idx + i_idx] * B[4*j_idx + k_idx];\n           }\n        }\n    }\n}\n```\n\n#### 3.优化原生代码\n\n​\t注意到上述代码中，先取c0 - c3 的计算作为样例考虑：\n$$\n\tc_0 = a_0b_0 + a_4b_1 + a_8b_2 + a_{12}b_3   \\\\\n\tc_1 = a_1b_0 + a_5b_1 + a_9b_2 + a_{13}b_3    \\\\\n\tc_2 = a_2b_0 + a_6b_1 + a_{10}b_2 + a_{14}b_3 \\\\\n\tc_3 = a_3b_0 + a_7b_1 + a_{11}b_2 + a_{15}b_3\n$$\n\n##### 1.装载寄存器\n\n- 考虑竖排a0-a1-a2-a3 为4个f32 数据，又因为SSE可以申请mm寄存器，单次保存128bit，那么不妨把a0-a4保存在寄存器中，\n\n- 对于b0-b3 则是，单次读取一个值，能够重复用4次，不妨考虑b0 重复4次，排满单个128bit的mm寄存器；\n\n- 同理把c0-c3也放入寄存器，从列方向上考虑，取名为_c0 \n\n  ```c++\n  _m128 _a0 = _mm_load_ps(a_ptr);\t\t\t//a0 -a1 -a2 -a3\n  _m128 _a1 = _mm_load_ps(a_ptr + 4);\t\t//a4 -a5 -a6 -a7\n  _m128 _a2 = _mm_load_ps(a_ptr + 8);\t\t//a8 -a9 -a10-a11\n  _m128 _a3 = _mm_load_ps(a_ptr + 12);\t//a12-a13-a14-a15\n  \n  _m128 _b0 = _mm_load_ps1(b_ptr);\t\t// b0 - b0 - b0 - b0\n  _m128 _b1 = _mm_load_ps1(b_ptr + 4);\t// b1 - b1 - b1 - b1\n  _m128 _b2 = _mm_load_ps1(b_ptr + 8);\t// b2 - b2 - b2 - b2\n  _m128 _b3 = _mm_load_ps1(b_ptr + 12);\t// b3 - b3 - b3 - b3\n  ```\n\n##### 2.编写第一列的计算结果\n\n​\t对于_a0 -\\_a3 数据与\\_b0 数据相乘 ，有：\n\n```c++\n// 保存结果新建一个_c0 作为临时变量\n_m128 _c0 = _mm_set_ps1(0.0f);\n_c0 = _mm_mul_ps(_a0, _b0);\n_c0 = _mm_add_ps(_mm_mul_ps(_a1, _b1),_c0);\n_c0 = _mm_add_ps(_mm_mul_ps(_a2, _b2),_c0);\n_c0 = _mm_add_ps(_mm_mul_ps(_a3, _b3),_c0);\n// 把 _sum0存会以c指针开头的内存中，完美！\n_mm_store_ps(c_ptr, _c0);\n```\n\n##### 3.将单列输出扩展到所有列：\n\n​\t我们针对剩下的c中的c1 列也做相同的操作： 对于C1 列 有：\n$$\n\tc_4 = a_0b_4 + a_4b_5 + a_8b_6 + a_{12}b_7 \\\\\n\tc_5 = a_1b_4 + a_5b_5 + a_9b_6 + a_{13}b_7 \\\\\n\tc_6 = a_2b_4 + a_6b_5 + a_{10}b_6 + a_{14}b_7 \\\\\n\tc_7 = a_3b_4 + a_7b_5 + a_{11}b_6 + a_{15}b_7\n$$\n\n\n```c++\n// a 系列不变 b系列指针+1\n_m128 _b4 = _mm_load_ps1(b_ptr + 1);\t\t// b4 - b4 - b4 - b4\n_m128 _b5 = _mm_load_ps1(b_ptr + 4 + 1);\t// b5 - b5 - b5 - b5\n_m128 _b6 = _mm_load_ps1(b_ptr + 8 + 1);\t// b6 - b6 - b6 - b6\n_m128 _b7 = _mm_load_ps1(b_ptr + 12+ 1);\t// b7 - b7 - b7 - b7\n\n// 保存结果新建一个_c0 作为临时变量\n_m128 _c1 = _mm_set_ps1(0.0f);\n_c1 = _mm_mul_ps(_a0, _b4);\n_c1 = _mm_add_ps(_mm_mul_ps(_a1, _b5),_c1);\n_c1 = _mm_add_ps(_mm_mul_ps(_a2, _b6),_c1);\n_c1 = _mm_add_ps(_mm_mul_ps(_a3, _b7),_c1);\n// 把 _sum0存会以c指针开头的内存中，完美！\n_mm_store_ps(c_ptr, _c1);\n```\n\n​\t此时我们发现，对于C1列的操作与C0列及其相似，只不过是b_ptr的指针发生移动，不妨将其放到同一个循环中，有：\n\n```C++\n// a 系列不变\n_m128 _a0 = _mm_load_ps(a_ptr);\t\t\t//a0 -a1 -a2 -a3\n_m128 _a1 = _mm_load_ps(a_ptr + 4);\t\t//a4 -a5 -a6 -a7\n_m128 _a2 = _mm_load_ps(a_ptr + 8);\t\t//a8 -a9 -a10-a11\n_m128 _a3 = _mm_load_ps(a_ptr + 12);\t//a12-a13-a14-a15\n\nfor(int i = 0; i < 4; i++)\n{\n    _m128 _b0 = _mm_load_ps1(b_ptr);\t\t// b0 - b0 - b0 - b0\n    _m128 _b1 = _mm_load_ps1(b_ptr + 4);\t// b1 - b1 - b1 - b1\n    _m128 _b2 = _mm_load_ps1(b_ptr + 8);\t// b2 - b2 - b2 - b2\n    _m128 _b3 = _mm_load_ps1(b_ptr + 12);\t// b3 - b3 - b3 - b3\n    \n    _m128 _ci = _mm_set_ps1(0.0f);\n    _ci = _mm_mul_ps(_a0, _b0);\n    _ci = _mm_add_ps(_mm_mul_ps(_a1, _b1),_ci);\n    _ci = _mm_add_ps(_mm_mul_ps(_a2, _b2),_ci);\n    _ci = _mm_add_ps(_mm_mul_ps(_a3, _b3),_ci);\n    \n    _mm_store_ps(c_ptr, _ci);\n    \n    b_ptr += 1;\t\t\t\t// 移动b_ptr\n    c_ptr += 4;\t\t\t\t// 移动保存内存的c_ptr\n}\n```\n\n### 2.NCNN中以SSE优化算子的注意事项\n\n#### 1.线程与openmp\n\n​\t以上计算Benchmark 和 SSE优化的方法大多集中在单个核心中，但是在实际使用ncnn中，ncnn使用Option opt 中提供的num_threads 给openmp赋值，以实现多线程并行化，同时运行在多个核心上。\n\n```c++\n#pragma omp parallel for num_threads(opt.num_threads)\n```\n\n​\t在优化成SSE代码的初期，可以考虑锁定为单线程，或者直接不用考虑线程的影响，仅对单核以SSE优化，保证单核的结果正确后，再加上opt的多线程进行结果测试。\n\n#### 2.展开循环\n\n​\t在实际ncnn实现的原生代码的算法中，循环是非常常见的。针对以SSE优化这类循环，遵循非常简单的原则：循环中，迭代器等于零时刻，整个输出的结果也是正确的。\n\n​\t那么，在我们使用SSE优化过程中，不妨以迭代器等于零的时刻，函数计算结果作为此时目标结果。在此基础上再利用SSE优化代码。与目标结果核对正确以后，再进一步去考虑迭代器等于1的情况（重复这个过程直到迭代器达到最大值）。在迭代器的每个元素下，SSE优化出的代码都与结果相等，那么我们可以说，该次优化是正确性，且完全覆盖了需执行代码。（一般来说不用考虑到最大值，根据数学归纳法，n有效，n+1有效，那么n的序列都是有效的）\n\n## 五：总结\n\n​\t本文描述SSE的使用及以4x4矩阵乘法的样例来优化SSE代码。\n\n​\t值得注意的是，SSE只是128bit数据宽度的指令集，但是也可以用来模拟256bit 和 512bit数据宽度，来实现以pack4拼接成pack8，甚至pack16的做法，只不过在输出结果管理上更加繁琐而已。感兴趣的同学可以尝试一下。\n\n## 六：引用\n\n1. [SSE指令扩展快查](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2)；\n2. 浮点性能基准计算-[浮点峰值那些事儿](https://zhuanlan.zhihu.com/p/28226956)\n3. 硬件性能基准测试计算样例：[M1芯片搞数据科学好使吗？5种基准测试给你答案](https://mp.weixin.qq.com/s/2N5cl_Z1MRF8dfbRo-sb4A)\n4. 讨论矩阵乘法如何优化的系列论文：[how-to-optimized-gemm](https://github.com/flame/how-to-optimize-gemm/wiki)\n5. 讨论以Arm Intrinsic 优化gemm的系列文章：[OpenBLAS gemm从零入门](https://zhuanlan.zhihu.com/p/65436463)\n"
  },
  {
    "path": "docs/developer-guide/kvcache.md",
    "content": "# high-performance transformer inference with mha kv cache in ncnn\n\nThis document details the implementation and usage of the key-value (kv) cache for the `MultiHeadAttention` and `SDPA` layer in ncnn. This feature significantly accelerates autoregressive inference for Transformer-based models, such as large language models and other encoder-decoder architectures.\n\n## 1. what is kv cache?\n\n### the challenge of autoregressive inference\n\nTransformer models generate output token by token in a process called autoregressive decoding. In each step, the model takes the previously generated tokens as input to predict the next one. A core component of this is the self-attention mechanism, which computes query (q), key (k), and value (v) matrices based on the sequence generated so far.\n\nWithout optimization, the model must recompute the k and v matrices for all preceding tokens at every single step. For a sequence of length `N`, the computational cost for the self-attention mechanism is roughly proportional to `N^2`. As the sequence grows, this becomes a significant performance bottleneck.\n\n### the solution: kv cache\n\n**kv cache** is an optimization technique that stores the key and value tensors from previous decoding steps. When generating a new token, we only need to compute the k and v for the *current* token and append them to the cached values. The model then uses the full set of cached k and v tensors for the attention calculation.\n\n### key benefits\n\n- **dramatic speed-up:** It reduces the computational complexity of the self-attention mechanism from O(N^2) per step to approximately O(N). This drastically cuts down inference latency, especially for long sequences.\n- **reduced computation:** It eliminates redundant calculations, saving significant computational resources and energy.\n- **enables real-time applications:** The performance gain makes it feasible to deploy large Transformer models for interactive and real-time tasks.\n\n## 2. ncnn kv cache implementation\n\nncnn introduces kv cache support directly into its `MultiHeadAttention` and `SDPA` layer. The implementation is designed to be efficient and flexible, handling both the dynamic cache of self-attention and the static k/v of cross-attention found in encoder-decoder architectures.\n\n### self-attention vs. cross-attention cache logic\n\nThe caching strategy is fundamentally different for self-attention and cross-attention layers within a decoder.\n\n#### self-attention (dynamic cache)\n- **purpose:** Allows the decoder to attend to previously generated tokens in its own sequence (e.g., the text being generated).\n- **cache Logic:** The cache is **dynamic** and grows with each generated token. In step `t`, the k and v for token `t` are computed and appended to the cache from step `t-1`.\n- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layers for self-attention are modified to accept two additional inputs (`cache_k_in`, `cache_v_in`) and produce two corresponding outputs (`cache_k_out`, `cache_v_out`). The `7=1` parameter enables this dynamic caching behavior inside the layer.\n\n#### cross-attention (static k/v)\n- **purpose:** Allows the decoder to attend to the output of the encoder (e.g., attending to audio features in speech recognition, or an input sentence in translation).\n- **cache Logic:** The k and v matrices are derived from the encoder's output, which is computed only **once** per input sequence. Therefore, the k and v for cross-attention are **static** and do not change during the decoding process. They are \"cached\" in the sense that they are pre-computed and reused in every decoding step.\n- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layers for cross-attention are also configured with `7=1` and cache I/O blobs. However, the implementation correctly identifies cross-attention (where the query blob is different from the key/value blobs) and reuses the `cache_k_in` and `cache_v_in` directly, without performing concatenation. This allows the static encoder k/v to be passed efficiently through the network.\n\n## 3. ncnn kv cache memory layout\n\nThe memory layout of the kv cache is a critical design choice for performance. ncnn uses different layouts for `MultiHeadAttention` and `SDPA` to optimize for their respective calculation patterns.\n\n### `MultiHeadAttention` cache layout (Transposed)\n\nThe `MultiHeadAttention` layer uses a **transposed layout** for its cache blobs. The primary reason for this is to **ensure that data for each attention head is contiguous in memory, which significantly boosts gemm performance.**\n\n*   **input blobs (q, k, v):** These typically have a shape where height represents the sequence length.\n    *   `ncnn::Mat` dimensions: `(w = embed_dim, h = seq_len)`\n\n*   **cache blobs (`k_cache`, `v_cache`):** These are stored in a **transposed** format.\n    *   `ncnn::Mat` dimensions: `(w = seq_len, h = embed_dim)`\n\n**the rationale:**\n\n1.  **slicing by Head:** During the attention calculation, the code slices the `k_cache` and `v_cache` matrices along their height to isolate the data for each head (e.g., using `row_range(head_index * embed_dim_per_head, embed_dim_per_head)`).\n2.  **memory contiguity:** Because `ncnn::Mat` uses a row-major memory layout, this slicing operation on the transposed cache blob results in a sub-matrix where all the data for a single head is perfectly contiguous.\n3.  **gemm efficiency:** Subsequent matrix multiplication operations (`q * k^T` and `Attention * v`) can then operate on these contiguous memory blocks. This maximizes CPU cache locality and the effectiveness of simd instructions, leading to a substantial increase in computational speed.\n\nIf a non-transposed layout were used, the data for each head would be strided in memory, causing frequent cache misses and dramatically slowing down the performance-critical gemm calculations. Therefore, this transposed layout is a deliberate and crucial optimization for computation.\n\n### `SDPA` cache layout (Standard)\n\nThe `SDPA` layer uses the **standard ncnn Mat layout**, where the sequence length is represented by the height.\n\n*   **input blobs (q, k, v):** `(w = embed_dim, h = seq_len, c = num_heads)`\n*   **cache blobs (`k_cache`, `v_cache`):** `(w = embed_dim, h = seq_len, c = num_heads)`\n\n**the rationale:**\n\nThe `SDPA` layer's internal implementation directly concatenates the cache blobs (`past_k`, `past_v`) with the current ones (`cur_k`, `cur_v`) along the height dimension (`seq_len`). This simpler approach avoids the need for a transposed layout while still being highly efficient, as the concatenation logic is handled inside the optimized C++ implementation.\n\n## 4. converting models to support kv cache\n\nTo enable kv cache, you must modify the model's `.param` file to add the necessary cache inputs and outputs to all `MultiHeadAttention` and `SDPA` layers in the decoder.\n\n### step 1: export a sequence-length-1 model\n\nFirst, export your model from its original framework (e.g., PyTorch) using a sequence length of 1 for the decoder. This creates a graph optimized for single-token generation, which is the core of the autoregressive decoding loop.\n\n### step 2: modify the .ncnn.param file\n\nAfter exporting, a script is needed to edit the generated `.ncnn.param` file to make it cache-aware.\n\n#### A. Adding kv cache to All MultiHeadAttention and SDPA Layers\n\nYou must add cache inputs/outputs to **every** `MultiHeadAttention` / `SDPA` layer in the decoder.\n\n- **change `input_count` and `output_count`:** Increase both by 2.\n- **add blob names:** Append new, unique blob names for `cache_k_in`, `cache_v_in`, `cache_k_out`, and `cache_v_out`.\n- **enable cache behavior:** Add the parameter `7=1`.\n\nHere is a robust Python function that automates this process:\n```python\ndef add_kv_cache_to_ncnn_param(filename):\n    \"\"\"\n    Modifies an ncnn.param file to add a kv cache mechanism to all\n    MultiHeadAttention and SDPA layers and overwrites the original file.\n    This handles both self-attention and cross-attention layers.\n    \"\"\"\n    import os\n\n    if not os.path.exists(filename):\n        print(f\"Error: The file '{filename}' was not found.\")\n        return\n\n    with open(filename, 'r', encoding='utf-8') as f:\n        lines = f.readlines()\n\n    header_line_index = 1  # line 2, after magic number\n    header_parts = lines[header_line_index].strip().split()\n    original_layer_count = int(header_parts[0])\n    original_blob_count = int(header_parts[1])\n\n    attention_indices = [i for i, line in enumerate(lines) if line.strip().startswith(\"MultiHeadAttention\") or line.strip().startswith(\"SDPA\")]\n    attention_count = len(attention_indices)\n\n    if attention_count == 0:\n        print(\"No 'MultiHeadAttention' or 'SDPA' layers found. The file will not be modified.\")\n        return\n\n    # --- modify MultiHeadAttention and SDPA layers ---\n    for i, line_index in enumerate(attention_indices):\n        parts = lines[line_index].strip().split()\n        layer_type, layer_name, input_count_str, output_count_str = parts[:4]\n        input_count, output_count = int(input_count_str), int(output_count_str)\n\n        blob_and_params = parts[4:]\n        inputs = blob_and_params[:input_count]\n        outputs = blob_and_params[input_count : input_count + output_count]\n        params = blob_and_params[input_count + output_count:]\n\n        # add cache I/O blobs and enable cache parameter\n        inputs.extend([f\"cache_k_in_{i}\", f\"cache_v_in_{i}\"])\n        outputs.extend([f\"cache_k_out_{i}\", f\"cache_v_out_{i}\"])\n        params.append(\"7=1\")\n\n        new_line_parts = [\n            f\"{layer_type:<24}\", f\"{layer_name:<24}\",\n            str(input_count + 2), str(output_count + 2),\n            *inputs, *outputs, *params\n        ]\n        lines[line_index] = \" \".join(new_line_parts) + \"\\n\"\n\n    # --- add a single input layer to provide all cache blobs ---\n    new_layer_count = original_layer_count + 1\n    # each mha needs 2 new *input* blobs and produces 2 new *output* blobs.\n    # the total number of unique blobs increases by 4 for each mha.\n    new_blob_count = original_blob_count + (attention_count * 4)\n    lines[header_line_index] = f\"{new_layer_count} {new_blob_count}\\n\"\n\n    # find where to insert the new input layer (after existing ones)\n    insert_pos = header_line_index + 1\n    while insert_pos < len(lines) and lines[insert_pos].strip().startswith(\"Input\"):\n        insert_pos += 1\n\n    cache_blob_names = [name for i in range(attention_count) for name in (f\"cache_k_in_{i}\", f\"cache_v_in_{i}\")]\n    input_layer_line = (\n        f\"{'Input':<24} {'kv_cache_in':<24} 0 {len(cache_blob_names)} \"\n        f\"{' '.join(cache_blob_names)}\\n\"\n    )\n    lines.insert(insert_pos, input_layer_line)\n\n    with open(filename, 'w', encoding='utf-8') as f:\n        f.writelines(lines)\n\n    print(f\"Successfully added kv cache to {attention_count} MultiHeadAttention / SDPA layers.\")\n\n# usage:\n# add_kv_cache_to_ncnn_param(\"your_model_decoder.ncnn.param\")\n```\n\n#### B. Supporting Dynamic Sequence Length in Gemm\nFeed-forward networks (`Gemm` layers) that process the output of attention blocks must support dynamic sequence lengths, as the cache grows. To achieve this, change the parameter `7=1` (constant input shape) to `7=0` (dynamic input shape) for the relevant `Gemm` layers.\n\n```python\ndef update_gemm_params(param_file_path):\n    \"\"\"\n    Finds all 'Gemm' layers and changes parameter '7=1' to '7=0'\n    to support dynamic input shapes.\n    \"\"\"\n    import re\n    with open(param_file_path, 'r') as f:\n        lines = f.readlines()\n\n    new_lines = []\n    for line in lines:\n        if line.strip().startswith('Gemm'):\n            line = re.sub(r'(\\b7=)1\\b', r'\\g<1>0', line)\n        new_lines.append(line)\n\n    with open(param_file_path, 'w') as f:\n        f.writelines(new_lines)\n    print(f\"Updated Gemm layers in '{param_file_path}' to support dynamic inputs.\")\n\n# usage:\n# update_gemm_params(\"your_model_decoder.ncnn.param\")\n```\n\n## 5. implementing kv cache inference logic\n\nYour C++ inference code must manage the cache blobs across decoding steps.\n\n### step 1: identify cache blob indices\nAfter loading the network, identify the input and output blob indices for the cache. You can iterate through the mha layers and find the blobs you named in the conversion script.\n\n```cpp\n#include \"net.h\"\n#include <vector>\n#include <string>\n\nstruct kvcache_info\n{\n    std::vector<int> input_indices;\n    std::vector<int> output_indices;\n};\n\nvoid find_mha_kvcache_blobs(const ncnn::Net& net, kvcache_info& info)\n{\n    for (const ncnn::Layer* layer : net.layers())\n    {\n        // cache-enabled mha layer has 3 outputs (out, cache_k_out, cache_v_out) instead of 1\n        if ((layer->typeindex == ncnn::LayerType::MultiHeadAttention || layer->typeindex == ncnn::LayerType::SDPA) && layer->tops.size() == 3)\n        {\n            // the script adds cache_k and cache_v as the last two inputs/outputs\n            int input_count = layer->bottoms.size();\n            int output_count = layer->tops.size();\n\n            info.input_indices.push_back(layer->bottoms[input_count - 2]); // cache_k_in\n            info.input_indices.push_back(layer->bottoms[input_count - 1]); // cache_v_in\n\n            info.output_indices.push_back(layer->tops[output_count - 2]);  // cache_k_out, i.e., tops[1]\n            info.output_indices.push_back(layer->tops[output_count - 1]);  // cache_v_out, i.e., tops[2]\n        }\n    }\n}\n```\n\n### step 2: prefill and decode loop\nThe inference process is split into two phases: \"prefill\" for the initial prompt and \"decode\" for subsequent single-token generation.\n\n- **prefill (`run_decoder_pre`):**\n  - input: The entire initial sequence of token IDs\n  - the kv cache is empty\n  - run the decoder once\n  - extract the output logits for the *last* token to predict the next token\n  - extract the `out_cache_k` and `out_cache_v` blobs from all mha layers and store them\n\n- **decode (`run_decoder_step`):**\n  - input: The single, most recently generated token ID\n  - the kv cache blobs from the previous step are fed as input\n  - run the decoder\n  - extract the output logits to predict the next token\n  - extract and store the updated kv cache blobs for the next step\n\nHere is a conceptual C++ implementation:\n\n```cpp\n// assume 'decoder_net' is loaded and 'kvcache_info' is populated.\n\n// --- prefill step (processes a sequence of tokens) ---\nvoid run_decoder_pre(const std::vector<int>& tokens, const ncnn::Mat& encoder_states, std::vector<ncnn::Mat>& out_kv_cache)\n{\n    ncnn::Extractor ex = decoder_net.create_extractor();\n\n    ncnn::Mat input_embeds = prepare_input_embeds(tokens); // your embedding logic\n    ex.input(\"in0\", input_embeds); // use your input blob name\n    ex.input(\"encoder_out\", encoder_states); // use your encoder output blob name\n\n    out_kv_cache.resize(kvcache_info.output_indices.size());\n    for (size_t i = 0; i < kvcache_info.output_indices.size(); i++)\n    {\n        ex.extract(kvcache_info.output_indices[i], out_kv_cache[i]);\n    }\n\n    ncnn::Mat all_logits;\n    ex.extract(\"out0\", all_logits); // Use your output blob name\n    // ... process logits for the last token ...\n}\n\n// --- decode step (processes a single token) ---\nvoid run_decoder_step(int token, const ncnn::Mat& encoder_states, const std::vector<ncnn::Mat>& kv_cache, std::vector<ncnn::Mat>& out_kv_cache)\n{\n    ncnn::Extractor ex = decoder_net.create_extractor();\n\n    ncnn::Mat input_embeds = prepare_input_embeds({token});\n    ex.input(\"in0\", input_embeds);\n    ex.input(\"encoder_out\", encoder_states);\n\n    // feed the existing cache\n    for (size_t i = 0; i < kvcache_info.input_indices.size(); i++)\n    {\n        ex.input(kvcache_info.input_indices[i], kv_cache[i]);\n    }\n\n    // extract the updated cache\n    out_kv_cache.resize(kvcache_info.output_indices.size());\n    for (size_t i = 0; i < kvcache_info.output_indices.size(); i++)\n    {\n        ex.extract(kvcache_info.output_indices[i], out_kv_cache[i]);\n    }\n\n    ncnn::Mat logits;\n    ex.extract(\"out0\", logits);\n    // ... process logits to get the next token ...\n}\n\n// --- main inference loop ---\nvoid generate_sequence()\n{\n    std::vector<int> initial_tokens = { /* SOT and prompt tokens */ };\n    ncnn::Mat encoder_states = run_encoder(); // compute encoder output once\n\n    // 1. prefill stage\n    std::vector<ncnn::Mat> kv_cache;\n    run_decoder_pre(initial_tokens, encoder_states, kv_cache);\n    int next_token = get_next_token_from_prefill_logits();\n\n    // 2. autoregressive decoding loop\n    while (next_token != EOT_TOKEN && sequence_length < MAX_LENGTH)\n    {\n        std::vector<ncnn::Mat> next_kv_cache;\n        run_decoder_step(next_token, encoder_states, kv_cache, next_kv_cache);\n        kv_cache = next_kv_cache; // update cache for the next iteration\n\n        next_token = get_next_token_from_step_logits();\n        // append next_token to your generated sequence\n    }\n}\n```\nThis structured approach allows ncnn to perform highly efficient Transformer inference, correctly handling both dynamic self-attention and static cross-attention caches with an optimized memory layout.\n"
  },
  {
    "path": "docs/developer-guide/layer-feat-mask.md",
    "content": "# layer feature mask\n\nEach ncnn layer allows a special parameter pair `31=X` to control specific bahavior.\n\nX is an unsigned integer with each bit contributing a feature mask.\n\nWe usually use it to configuring fine-graded behaviors for certain layers to maintain accuracy, reduce memory usage or optimize performance.\n\n|bit|value|mask|rationale|\n|---|---|---|---|\n|1<<0|1|no fp16 arithmetic|precision concern|\n|1<<1|2|no fp16 storage|precision concern|\n|1<<2|4|no bf16 storage|precision concern|\n|1<<3|8|no int8|debug dynamic quantized model|\n|1<<4|16|no vulkan|reduce overhead for cpu op - gpu split - cpu op|\n|1<<5|32|no sgemm|reduce some memory|\n|1<<6|64|no winograd|reduce some memory|\n|1<<7|128|no threading|force single thread|\n\nThese bits can be OR-combined into one value to control multiple behaviors simultaneously.\n\nFor example, `31=17` means disabling both vulkan and fp16 arithmetic.\n\n## disable fp16 for certain layer to fix overflow\n\n```ruby\n7767517\n3 3\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nConvolution     conv1   1 1 conv0 conv1 0=128 1=3 6=36864 9=1\n```\n\nTypically, we use fp16 computation to improve inference speed.\nHowever, since the weight value of `conv1` is very large, fp16 accumulation may cause numerical overflow, so fp16 needs to be disabled individually for `conv1`, while other layers continue to use fp16 mode\n\nAdd `31=3` to disable fp16 storage and arithmetic.\n\n```ruby\n7767517\n3 3\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nConvolution     conv1   1 1 conv0 conv1 0=128 1=3 6=36864 9=1 31=3\n```\n\n## disable vulkan for certain layer to improve performance\n\n```ruby\n7767517\n5 5\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nSomeCPULayer    c0      1 1 conv0 c0 0=32\nReLU            relu0   1 1 c0 relu0\nSomeCPULayer    c1      1 1 relu0 c1 0=32\n```\n\nBetween the CPU layers, there is a simple calculation layer that supports vulkan. We can set `31=16` to force it to run on CPU. This can avoid the overhead of data upload, download and storage layout conversion between CPU and GPU. After all, CPU is fast enough for simple operations.\n\n```ruby\n7767517\n5 5\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nSomeCPULayer    c0      1 1 conv0 c0 0=32\nReLU            relu0   1 1 c0 relu0 31=16\nSomeCPULayer    c1      1 1 relu0 c1 0=32\n```\n\n## disable winograd for certain layer to reduce memory usage\n\n```ruby\n7767517\n3 3\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nConvolution     conv1   1 1 conv0 conv1 0=128 1=3 6=36864 9=1\n```\n\nThe winograd technology uses more memory for the purpose of improving convolution performance, but this is not always true. In some memory-constrained situations, or memory IO bottlenecks, we can disable the use of winograd on some layers in exchange for a smaller memory footprint. Add `31=64` to Convolution layer, which forces it to use implcit-gemm or tiled im2col-gemm implementation, reducing memory usage and sometimes improving vulkan performance.\n\n```ruby\n7767517\n3 3\nInput           input   0 1 input0 0=22 1=22 2=32\nConvolution     conv0   1 1 input0 conv0 0=32 1=1 6=1024 9=1\nConvolution     conv1   1 1 conv0 conv1 0=128 1=3 6=36864 9=1 31=64\n```\n\n## disable threading for certain layer to improve performance\n\n```ruby\n7767517\n4 4\nInput           input   0 1 input0 0=22 1=22 2=3\nConvolution     conv0   1 1 input0 conv0 0=16 1=3 6=432\nHardSigmoid     hs      1 1 conv0 hs0\nConvolution     conv1   1 1 hs0 conv1 0=16 1=3 6=2304\n```\n\nThe overhead of multi-thread dispatch and merging is too large for small tensors. Add `31=128` to HardSigmoid layer, which forces it to execute in a single thread, reducing power consumption and improving performance.\n\n```ruby\n7767517\n4 4\nInput           input   0 1 input0 0=22 1=22 2=3\nConvolution     conv0   1 1 input0 conv0 0=16 1=3 6=432\nHardSigmoid     hs      1 1 conv0 hs0 31=128\nConvolution     conv1   1 1 hs0 conv1 0=16 1=3 6=2304\n```\n"
  },
  {
    "path": "docs/developer-guide/layer-support-behavior.md",
    "content": "# Understanding `support_XYZ` Properties in ncnn's `Layer` Class\n\nThis document is for developers implementing new layers in `ncnn`. It explains the `support_XYZ` boolean properties in the `ncnn::Layer` base class. Correctly setting these properties declares the capabilities of your layer to the `ncnn` inference engine. This allows the engine to apply specific optimizations, such as enabling SIMD, half-precision floating-point computation, or Vulkan GPU acceleration, to achieve optimal performance and memory efficiency.\n\n## When to Set `support` Properties\n\nA layer can set its `support` properties in two ways:\n\n1.  **Statically in the constructor**: If the layer's capabilities are fixed, the simplest way is to set them in its constructor.\n2.  **Dynamically in `create_pipeline`**: If the layer's capabilities depend on parameters loaded from `load_param` or `load_model` (e.g., the data type of weights), you can set these properties dynamically within the `create_pipeline` method.\n\n---\n\n## Property Details\n\nHere is a detailed breakdown of each `support` property and what it means for your layer's implementation.\n\n### `one_blob_only`\n\n*   **Purpose**: Declares that the layer accepts only one input `blob` and produces only one output `blob`.\n*   **Requirements if `true`**: You must implement the single-input, single-output version of the `forward` method:\n    ```cpp\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    ```\n*   **Behavior**: When `true`, `ncnn` calls this overload. If `false` (default), the `std::vector<Mat>` version of `forward` is called.\n\n### `support_inplace`\n\n*   **Purpose**: Declares that the layer supports in-place computation, meaning the input and output can share the same memory. This significantly reduces memory overhead.\n*   **Requirements if `true`**: You must implement the `forward_inplace` method. Depending on whether `one_blob_only` is also enabled, implement the corresponding version:\n    ```cpp\n    // If one_blob_only is true\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\n    // If one_blob_only is false\n    virtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const;\n    ```\n\n### `support_vulkan`\n\n*   **Purpose**: Declares that the layer has a Vulkan implementation for GPU-accelerated inference.\n*   **Requirements if `true`**:\n    *   Implement `forward` / `forward_inplace` methods that accept `VkMat` for input and output.\n    *   Implement `upload_model` to transfer weight data to the GPU.\n    *   Implement `create_pipeline` and `destroy_pipeline` to manage Vulkan `Pipeline` objects and other GPU resources.\n\n### `support_packing` (for CPU)\n\n*   **Purpose**: Declares that the layer's **CPU implementation** can handle `Mat` data with a \"packing\" memory layout (i.e., `elempack > 1`). This is crucial for SIMD optimizations (e.g., processing 4 or 8 floats at once with NEON or AVX).\n*   **Behavior if `true`**:\n    *   When the input `Mat` channel count is a multiple of the SIMD width, the `ncnn` engine ensures that the input `Mat` passed to `forward` / `forward_inplace` is packed (e.g., `elempack=4` or `elempack=8`).\n    *   Your implementation must correctly handle `Mat` data where `cstep` and `elempack` are not their default values.\n*   **Behavior if `false`**:\n    *   The `ncnn` engine guarantees that the input `Mat` passed to your layer will always have `elempack=1`. The engine will automatically insert conversions if the preceding layer produced a packed output.\n*   **Output**: Regardless of the property's value, your layer can output a `Mat` with any `elempack`. However, it is highly recommended to output a `Mat` with an adaptive `elempack` to avoid unnecessary conversions in subsequent layers.\n\n### `support_any_packing` (for CPU)\n\n*   **Purpose**: An extension of `support_packing`. It declares that the layer's **CPU implementation** is flexible enough to handle a `Mat` with **any** `elempack` value (`1`, `4`, `8`, etc.).\n*   **Behavior if `true`**:\n    *   The `ncnn` engine can pass an input `Mat` with any packing layout to your `forward` method, without forcing a conversion to the hardware's \"optimal\" `elempack`. For example, on an AVX512 system where `elempack=16` is optimal, your layer can still accept `elempack=1`, `4`, or `8`.\n    *   This gives the engine more flexibility to avoid unnecessary packing/unpacking conversions between layers.\n*   **Behavior if `false`**: If `false` (but `support_packing` is `true`), the engine will try to provide an input `Mat` with an optimal `elempack` for the target architecture.\n*   **Output**: This property does not enforce any constraint on the output `Mat`, which can have any `elempack`.\n\n### `support_vulkan_packing` (for Vulkan)\n\n*   **Purpose**: This is the Vulkan equivalent of `support_packing`. It declares that the layer's **Vulkan implementation** can handle `VkMat` with `elempack=4`.\n*   **Behavior if `true`**: When the input `VkMat` has a channel count that is a multiple of 4, the `ncnn` engine will provide a packed `VkMat` (with `elempack=4`) to your Vulkan `forward` methods.\n*   **Behavior if `false`**: The engine will ensure the input `VkMat` has `elempack=1`.\n*   **Note**: `support_packing` and `support_vulkan_packing` are independent. A layer can support packing on CPU but not on Vulkan, or vice-versa.\n\n### `support_vulkan_any_packing` (for Vulkan)\n\n*   **Purpose**: An extension of `support_vulkan_packing`. It declares that the layer's **Vulkan implementation** can handle a `VkMat` with **any** supported `elempack` value (e.g., `1`, `4`).\n*   **Behavior if `true`**:\n    *   The `ncnn` engine can pass an input `VkMat` with any supported packing layout to your Vulkan `forward` method. This allows the engine to avoid unnecessary repacking operations on the GPU.\n    *   This is particularly useful for optimizing shader dispatch and memory access patterns.\n*   **Behavior if `false`**: If `false` (but `support_vulkan_packing` is `true`), the engine will try to provide a `VkMat` with `elempack=4` if the channel count is a multiple of 4.\n*   **Note**: This property is independent of its CPU counterpart, `support_any_packing`.\n\n### `support_bf16_storage`\n\n*   **Purpose**: Declares that the layer can process `bfloat16` data.\n*   **Behavior if `true`**:\n    *   The `forward` method may receive an input `Mat` of type `bfloat16` (`elembits() == 16`) or `fp32`.\n    *   Inside your `forward` implementation, you must check `opt.use_bf16_storage` and `bottom_blob.elembits()` to determine whether to use a `bfloat16`-optimized code path.\n*   **Behavior if `false`**: The `ncnn` engine ensures your layer will **not** receive a `bfloat16` `Mat`.\n*   **Output**: Your layer can output either a `bfloat16` or `fp32` `Mat`. When `opt.use_bf16_storage` is active, outputting `bfloat16` is recommended to maintain precision and performance across the network.\n\n### `support_fp16_storage`\n\n*   **Purpose**: Declares that the layer can process `float16` data for half-precision inference.\n*   **Behavior if `true`**:\n    *   Similar to `support_bf16_storage`, the `forward` method may receive an `fp16` or `fp32` `Mat`.\n    *   Your implementation should check `opt.use_fp16_storage` and `bottom_blob.elembits()` to select the correct code path.\n*   **Behavior if `false`**: The `ncnn` engine ensures your layer will **not** receive an `fp16` `Mat`.\n*   **Output**: Your layer can output either a `fp16` or `fp32` `Mat`. When `opt.use_fp16_storage` is active, outputting an `fp16` `Mat` is recommended.\n\n### `support_int8_storage`\n\n*   **Purpose**: Declares that the layer supports `int8` quantized inference.\n*   **Behavior if `true`**:\n    *   When `opt.use_int8_inference` is `true`, the `forward` method may receive an `int8` or `fp32` `Mat`.\n    *   **Important**: If the input is `fp32`, your `forward` implementation is responsible for dynamically quantizing it to `int8` before performing computations.\n*   **Behavior if `false`**: The `ncnn` engine ensures your layer will **not** receive an `int8` `Mat`.\n*   **Output**: The output can be `int8` or `fp32`, depending on your layer's design.\n\n---\n\n## Practical Implementation and Priorities\n\n### Handling Multiple Precision Types\n\nA layer can set `support_fp16_storage` and `support_bf16_storage` to `true` simultaneously. The `ncnn` engine prioritizes these formats based on the `Option` flags. As seen in the `convert_layout` function in `src/net.cpp`, if `opt.use_bf16_storage` is true, the engine will prefer converting inputs to `bfloat16`. Otherwise, it falls back to `fp16` if `opt.use_fp16_storage` is true.\n\nThe chosen `elempack` also depends on the precision. For instance, with SIMD, the priority might be:\n*   FP16: `elempack=8` (if supported), then `elempack=4`, then `1`.\n*   BF16: `elempack=4`, then `1`.\n\nYour `forward` implementation should reflect this by checking `elembits()` and `elempack` to dispatch to the correct kernel.\n\n### Code Example: `Clip_arm`\n\nThe `Clip_arm` layer provides a great example of these concepts in practice.\n\n1.  **Declaring Support in the Constructor**:\n    It declares support for packing and, conditionally, for fp16 and bf16 storage.\n    ```cpp\n    // From: src/layer/arm/clip_arm.cpp\n    Clip_arm::Clip_arm()\n    {\n    #if __ARM_NEON\n        support_packing = true;\n    #if NCNN_ARM82\n        support_fp16_storage = cpu_support_arm_asimdhp();\n    #endif\n    #endif // __ARM_NEON\n\n    #if NCNN_BF16\n        support_bf16_storage = true;\n    #endif\n    }\n    ```\n\n2.  **Dispatching in `forward_inplace`**:\n    The `forward_inplace` method acts as a dispatcher. It first checks the element size (`elembits`) and the corresponding `opt` flag to decide whether to call a specialized low-precision implementation (`fp16s` or `bf16s`). If neither is applicable, it defaults to the standard `fp32` implementation.\n\n    ```cpp\n    // From: src/layer/arm/clip_arm.cpp\n    int Clip_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n    {\n        int elembits = bottom_top_blob.elembits();\n\n    #if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    #endif\n\n    #if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n            return forward_inplace_bf16s(bottom_top_blob, opt);\n    #endif\n\n        // Default fp32 implementation follows...\n        int w = bottom_top_blob.w;\n        // ...\n    }\n    ```\n\n### An Incremental Development Workflow\n\nAdopting a gradual approach can simplify the development of a new layer:\n\n1.  **Implement the Core Algorithm**: Start with all `support_XYZ` properties set to `false`. Focus on getting the mathematical logic correct using standard `fp32` data and `elempack=1`.\n2.  **Add Packing Support**: Once the core logic is validated, set `support_packing = true`. Modify your code to handle `elempack > 1` and implement SIMD optimizations (e.g., using NEON intrinsics).\n3.  **Add Low-Precision Support**: Next, add support for `fp16`, `bf16`, or `int8`. Set the corresponding `support_*_storage` flags to `true` and add branches in your `forward` method to handle these data types based on the `opt` flags.\n4.  **Add Vulkan Support**: Finally, if GPU acceleration is desired, set `support_vulkan = true` and implement the Vulkan-specific methods.\n\nThis incremental process allows you to tackle one challenge at a time, making it easier to develop a highly optimized and feature-rich layer.\n"
  },
  {
    "path": "docs/developer-guide/low-level-operation-api.md",
    "content": "# implement elementwise addition with/without broadcast using BinaryOp operation\n\n* input must be fp32 storage without packing\n* output is expected to be fp32 storage without packing\n\n```cpp\nvoid binary_add(const ncnn::Mat& a, const ncnn::Mat& b, ncnn::Mat& c)\n{\n    ncnn::Option opt;\n    opt.num_threads = 2;\n    opt.use_fp16_storage = false;\n    opt.use_packing_layout = false;\n\n    ncnn::Layer* op = ncnn::create_layer(\"BinaryOp\");\n\n    // set param\n    ncnn::ParamDict pd;\n    pd.set(0, 0);// op_type\n\n    op->load_param(pd);\n\n    op->create_pipeline(opt);\n\n    // forward\n    std::vector<ncnn::Mat> bottoms(2);\n    bottoms[0] = a;\n    bottoms[1] = b;\n\n    std::vector<ncnn::Mat> tops(1);\n    op->forward(bottoms, tops, opt);\n\n    c = tops[0];\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n}\n```\n\n# implement 3x3 box blur on three channel image using ConvolutionDepthWise operation\n\n* input must be fp32 storage without packing\n* output is expected to be fp32 storage without packing\n\n```cpp\nvoid convolution_3x3_boxblur_RGB(const ncnn::Mat& rgb, ncnn::Mat& out)\n{\n    ncnn::Option opt;\n    opt.num_threads = 2;\n    opt.use_fp16_storage = false;\n    opt.use_packing_layout = false;\n\n    ncnn::Layer* op = ncnn::create_layer(\"ConvolutionDepthWise\");\n\n    // set param\n    ncnn::ParamDict pd;\n    pd.set(0, 3);// num_output\n    pd.set(1, 3);// kernel_w\n    pd.set(5, 0);// bias_term\n    pd.set(6, 3*3*3);// weight_data_size\n    pd.set(7, 3);// group\n\n    op->load_param(pd);\n\n    // set weights\n    ncnn::Mat weights[1];\n    weights[0].create(3*3*3);// weight_data\n\n    for (int i=0; i<3*3*3; i++)\n    {\n        weights[0][i] = 1.f / 9;\n    }\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    // forward\n    op->forward(rgb, out, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n}\n```\n# transpose Mat, chw to cwh\n\n* input must be fp32 storage with/without packing\n* output is expected to be fp32 storage packed\n\n```cpp\nvoid transpose(const ncnn::Mat& in, ncnn::Mat& out)\n{\n    ncnn::Option opt;\n    opt.num_threads = 2;\n    opt.use_fp16_storage = false;\n    opt.use_packing_layout = true;\n\n    ncnn::Layer* op = ncnn::create_layer(\"Permute\");\n\n    // set param\n    ncnn::ParamDict pd;\n    pd.set(0, 1);// order_type\n\n    op->load_param(pd);\n\n    op->create_pipeline(opt);\n\n    ncnn::Mat in_packed = in;\n    {\n        // resolve dst_elempack\n        int dims = in.dims;\n        int elemcount = 0;\n        if (dims == 1) elemcount = in.elempack * in.w;\n        if (dims == 2) elemcount = in.elempack * in.h;\n        if (dims == 3) elemcount = in.elempack * in.c;\n\n        int dst_elempack = 1;\n        if (op->support_packing)\n        {\n            if (elemcount % 8 == 0 && (ncnn::cpu_support_x86_avx2() || ncnn::cpu_support_x86_avx()))\n                dst_elempack = 8;\n            else if (elemcount % 4 == 0)\n                dst_elempack = 4;\n        }\n\n        if (in.elempack != dst_elempack)\n        {\n            convert_packing(in, in_packed, dst_elempack, opt);\n        }\n    }\n\n    // forward\n    op->forward(in_packed, out, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n}\n```\n# apply instance normalization\n// x = (x - mean) / sqrt(var)\n\n* input can be fp32/fp16 storage with/without packing\n* output is expected to be fp16 storage packed when supported, or fp32 storage packed otherwise\n\n```cpp\nvoid normalize(const ncnn::Mat& in, ncnn::Mat& out)\n{\n    ncnn::Option opt;\n    opt.num_threads = 2;\n    opt.use_fp16_storage = true;\n    opt.use_packing_layout = true;\n\n    ncnn::Layer* op = ncnn::create_layer(\"InstanceNorm\");\n\n    // set param\n    ncnn::ParamDict pd;\n    pd.set(0, in.c);// channels\n    pd.set(1, 0.f);// eps\n\n    op->load_param(pd);\n\n    // set weights\n    ncnn::Mat weights[2];\n    weights[0].create(in.c);// gamma_data\n    weights[1].create(in.c);// beta_data\n\n    weights[0].fill(1.f);\n    weights[1].fill(0.f);\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    ncnn::Mat in_fp16 = in;\n    if (in.elembits() == 32 && op->support_fp16_storage)\n    {\n        cast_float32_to_float16(in, in_fp16, opt);\n    }\n    if (in.elembits() == 16 && !op->support_fp16_storage)\n    {\n        cast_float16_to_float32(in, in_fp16, opt);\n    }\n\n    ncnn::Mat in_fp16_packed = in_fp16;\n    {\n        // resolve dst_elempack\n        int dims = in_fp16.dims;\n        int elemcount = 0;\n        if (dims == 1) elemcount = in_fp16.elempack * in_fp16.w;\n        if (dims == 2) elemcount = in_fp16.elempack * in_fp16.h;\n        if (dims == 3) elemcount = in_fp16.elempack * in_fp16.c;\n\n        int dst_elempack = 1;\n        if (op->support_packing)\n        {\n            if (elemcount % 8 == 0 && (ncnn::cpu_support_x86_avx2() || ncnn::cpu_support_x86_avx()))\n                dst_elempack = 8;\n            else if (elemcount % 4 == 0)\n                dst_elempack = 4;\n        }\n\n        if (in_fp16.elempack != dst_elempack)\n        {\n            convert_packing(in_fp16, in_fp16_packed, dst_elempack, opt);\n        }\n    }\n\n    // forward\n    op->forward(in_fp16_packed, out, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n}\n```\n\n# cpu -> gpu -> forward -> gpu -> cpu\n\n```cpp\nncnn::VulkanDevice* vkdev = ncnn::get_gpu_device();\n\nncnn::VkAllocator* blob_vkallocator = vkdev->acquire_blob_allocator();\nncnn::VkAllocator* staging_vkallocator = vkdev->acquire_staging_allocator();\n\nncnn::VkWeightAllocator* weight_vkallocator = new ncnn::VkWeightAllocator(vkdev);\nncnn::VkWeightStagingAllocator* weight_staging_vkallocator = new ncnn::VkWeightStagingAllocator(vkdev);\n\n// create layer\nncnn::Layer* convolution = ncnn::create_layer(\"Convolution\");\nconvolution->vkdev = vkdev;\n\n// set option\nncnn::Option opt;\nopt.num_threads = 4;\nopt.use_vulkan_compute = true;\nopt.blob_vkallocator = blob_vkallocator;\nopt.workspace_vkallocator = blob_vkallocator;\nopt.staging_vkallocator = staging_vkallocator;\n\n// load param\n{\n    ncnn::ParamDict pd;\n    pd.set(0, outch);\n    pd.set(1, ksize);\n    pd.set(6, outch*inch*ksize*ksize);\n    pd.use_vulkan_compute = 1;\n\n    convolution->load_param(pd);\n}\n\n// load model\n{\n    ncnn::Mat weights[2];\n    weights[0] = random_mat(outch*inch*ksize*ksize);\n    weights[1] = random_mat(outch);\n\n    ncnn::ModelBinFromMatArray mb(weights);\n    convolution->load_model(mb);\n}\n\n// create pipeline\nconvolution->create_pipeline(opt);\n\n// upload model\n{\n    ncnn::VkTransfer cmd(vkdev);\n\n    ncnn::Option opt_upload = opt;\n    opt_upload.blob_vkallocator = weight_vkallocator;\n    opt_upload.workspace_vkallocator = weight_vkallocator;\n    opt_upload.staging_vkallocator = weight_staging_vkallocator;\n\n    convolution->upload_model(cmd, opt_upload);\n\n    cmd.submit_and_wait();\n}\n\nncnn::Mat bottom = random_mat(w, h, inch);\n\nncnn::Mat top;\n\n// forward\n{\n    ncnn::VkCompute cmd(vkdev);\n\n    ncnn::VkMat bottom_gpu;\n    cmd.record_upload(bottom, bottom_gpu, opt);\n\n    ncnn::VkMat top_gpu;\n    convolution->forward(bottom_gpu, top_gpu, cmd, opt);\n\n    cmd.record_download(top_gpu, top, opt);\n\n    cmd.submit_and_wait();\n}\n\nconvolution->destroy_pipeline(opt);\n\ndelete convolution;\n\nvkdev->reclaim_blob_allocator(blob_vkallocator);\nvkdev->reclaim_staging_allocator(staging_vkallocator);\n\nweight_vkallocator->clear();\nweight_staging_vkallocator->clear();\ndelete weight_vkallocator;\ndelete weight_staging_vkallocator;\n```\n\n"
  },
  {
    "path": "docs/developer-guide/ncnn-tips-and-tricks.zh.md",
    "content": "### blob内存是隐含共享的\n\nncnn的blob最初直接使用opencv的cv::Mat，后发现blob最多只支持三维，因此实现了类似的Mat\nMat的data每个通道内存16字节对齐，并且有原子的引用计数，a=b不复制数据，超级快\nMat支持直接引用外部的内存块，不复制数据，加快模型加载和输入输出\n\n举个例子：split layer 将一个blob复制成n个，ncnn中实现为单纯的增加引用计数，没有任何数据复制\n\n### 只运算一部分并保留中间结果\n\nncnn的net在解决分支依赖时是自上而下深度优先的，因此当网络有多个分支时，运算只会在需要结果的那个分支中进行，节约时间\n当多个分支有重合部分时，运算其中一个分支后会自动保留其余分支所需的中间结果，隐含共享，以便运算其余分支时利用\n\n举个例子：某网络结构为 A -> B -> C1 + C2，向ncnn索要C1结果时，运算过程是 A -> B -> C1，同时B结果引用计数加1自动保留，后面还需要C2结果时，只运算C2就足够了\n\n### 开启轻模式省内存\n\n每个layer都会产生blob，除了最后的结果和多分支中间结果，大部分blob都不值得保留，开启轻模式可以在运算后自动回收，省下内存\n\n举个例子：某网络结构为 A -> B -> C，在轻模式下，向ncnn索要C结果时，A结果会在运算B时自动回收，而B结果会在运算C时自动回收，最后只保留C结果，后面再需要C结果会直接获得，满足绝大部分深度网络的使用方式\n\n### 网络和运算是分开的\n\nncnn的net是网络模型，实际使用的是extractor，也就是同个net可以有很多个运算实例，而且运算实例互不影响，中间结果保留在extractor内部，在多线程使用时共用网络的结构和参数数据，初始化网络模型和参数只需要一遍\n\n举个例子：全局静态的net实例，初始化一次后，就能不停地生成extractor使用\n\n### openmp虽快但未必合适\n\nncnn中几乎所有运算都能用上openmp多线程加速，而且性能很赞\n不过系统有时候会突然慢一下，比如手机太热自动降频，界面操作等等，ncnn耗时也会偶尔抖动变长，在计算耗时稳定性比较重要的时候建议关闭openmp，或者设置下extractor线程数\n\n举个例子：手机自拍时，用ncnn进行人脸实时定位，如果耗时突然涨一下就会感觉到掉帧，而稳定的帧率体验更好\n\n### NCNN_STDIO/NCNN_STRING禁用模型文件\n\nncnn支持加载自有的模型文件和模型内存，NCNN_STDIO控制是否需要支持加载模型文件，设成0能禁用这部分代码，从而减小库的体积，NCNN_STRING设成0能清除大部分可见的字符串和解析过程\n模型内存加载时的参数数据是直接引用的，速度更快，通常在手机上使用这种方式\n\n### 削减 ncnn 内置的层实现\n\ncmake的时候，加参数 -DWITH_LAYER_xxx=OFF 就可以完全不编译对应的内置层，这样可以进一步减小库的体积\n\n### 关于 ARM big.LITTLE 调度\n\n调用set_cpu_powersave可以把ncnn运算线程控制在特定的cpu核心上，大核心速度快耗电多，小核心速度慢点但省电，大小一起用手机热得快\n"
  },
  {
    "path": "docs/developer-guide/new-model-load-api.md",
    "content": "## current model load api\n### Cons\n#### long and awful code\n#### two functions\n#### deal float32 float16 quantized-u8\n#### deal alignment size\n```cpp\n#if NCNN_STDIO\nint Convolution::load_model(FILE* binfp)\n{\n    int nread;\n\n    union\n    {\n        struct\n        {\n            unsigned char f0;\n            unsigned char f1;\n            unsigned char f2;\n            unsigned char f3;\n        };\n        unsigned int tag;\n    } flag_struct;\n\n    nread = fread(&flag_struct, sizeof(flag_struct), 1, binfp);\n    if (nread != 1)\n    {\n        fprintf(stderr, \"Convolution read flag_struct failed %d\\n\", nread);\n        return -1;\n    }\n\n    unsigned int flag = flag_struct.f0 + flag_struct.f1 + flag_struct.f2 + flag_struct.f3;\n\n    weight_data.create(weight_data_size);\n    if (weight_data.empty())\n        return -100;\n\n    if (flag_struct.tag == 0x01306B47)\n    {\n        // half-precision weight data\n        int align_weight_data_size = alignSize(weight_data_size * sizeof(unsigned short), 4);\n        std::vector<unsigned short> float16_weights;\n        float16_weights.resize(align_weight_data_size);\n        nread = fread(float16_weights.data(), align_weight_data_size, 1, binfp);\n        if (nread != 1)\n        {\n            fprintf(stderr, \"Convolution read float16_weights failed %d\\n\", nread);\n            return -1;\n        }\n\n        weight_data = Mat::from_float16(float16_weights.data(), weight_data_size);\n        if (weight_data.empty())\n            return -100;\n    }\n    else if (flag != 0)\n    {\n        // quantized weight data\n        float quantization_value[256];\n        nread = fread(quantization_value, 256 * sizeof(float), 1, binfp);\n        if (nread != 1)\n        {\n            fprintf(stderr, \"Convolution read quantization_value failed %d\\n\", nread);\n            return -1;\n        }\n\n        int align_weight_data_size = alignSize(weight_data_size * sizeof(unsigned char), 4);\n        std::vector<unsigned char> index_array;\n        index_array.resize(align_weight_data_size);\n        nread = fread(index_array.data(), align_weight_data_size, 1, binfp);\n        if (nread != 1)\n        {\n            fprintf(stderr, \"Convolution read index_array failed %d\\n\", nread);\n            return -1;\n        }\n\n        float* weight_data_ptr = weight_data;\n        for (int i = 0; i < weight_data_size; i++)\n        {\n            weight_data_ptr[i] = quantization_value[ index_array[i] ];\n        }\n    }\n    else if (flag_struct.f0 == 0)\n    {\n        // raw weight data\n        nread = fread(weight_data, weight_data_size * sizeof(float), 1, binfp);\n        if (nread != 1)\n        {\n            fprintf(stderr, \"Convolution read weight_data failed %d\\n\", nread);\n            return -1;\n        }\n    }\n\n    if (bias_term)\n    {\n        bias_data.create(num_output);\n        if (bias_data.empty())\n            return -100;\n        nread = fread(bias_data, num_output * sizeof(float), 1, binfp);\n        if (nread != 1)\n        {\n            fprintf(stderr, \"Convolution read bias_data failed %d\\n\", nread);\n            return -1;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_STDIO\n\nint Convolution::load_model(const unsigned char*& mem)\n{\n    union\n    {\n        struct\n        {\n            unsigned char f0;\n            unsigned char f1;\n            unsigned char f2;\n            unsigned char f3;\n        };\n        unsigned int tag;\n    } flag_struct;\n\n    memcpy(&flag_struct, mem, sizeof(flag_struct));\n    mem += sizeof(flag_struct);\n\n    unsigned int flag = flag_struct.f0 + flag_struct.f1 + flag_struct.f2 + flag_struct.f3;\n\n    if (flag_struct.tag == 0x01306B47)\n    {\n        // half-precision weight data\n        weight_data = Mat::from_float16((unsigned short*)mem, weight_data_size);\n        mem += alignSize(weight_data_size * sizeof(unsigned short), 4);\n        if (weight_data.empty())\n            return -100;\n    }\n    else if (flag != 0)\n    {\n        // quantized weight data\n        const float* quantization_value = (const float*)mem;\n        mem += 256 * sizeof(float);\n\n        const unsigned char* index_array = (const unsigned char*)mem;\n        mem += alignSize(weight_data_size * sizeof(unsigned char), 4);\n\n        weight_data.create(weight_data_size);\n        if (weight_data.empty())\n            return -100;\n        float* weight_data_ptr = weight_data;\n        for (int i = 0; i < weight_data_size; i++)\n        {\n            weight_data_ptr[i] = quantization_value[ index_array[i] ];\n        }\n    }\n    else if (flag_struct.f0 == 0)\n    {\n        // raw weight data\n        weight_data = Mat(weight_data_size, (float*)mem);\n        mem += weight_data_size * sizeof(float);\n    }\n\n    if (bias_term)\n    {\n        bias_data = Mat(num_output, (float*)mem);\n        mem += num_output * sizeof(float);\n    }\n\n    return 0;\n}\n```\n\n## new model load api proposed\n### Pros\n#### clean and simple api\n#### element type detection\n```cpp\nint Convolution::load_model(const ModelBin& mb)\n{\n    // auto detect element type\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        // certain type specified\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n```\n"
  },
  {
    "path": "docs/developer-guide/new-param-load-api.md",
    "content": "## current param load api\n### Cons\n#### long and awful code\n#### three functions\n#### not extensible\n#### no default value\n#### no variable length array\n```\nMyLayer  mylayer 1 1 in out 100 1.250000\n```\n```\nbinary 100\nbinary 1.250000\n```\n```cpp\n#if NCNN_STDIO\n#if NCNN_STRING\nint MyLayer::load_param(FILE* paramfp)\n{\n    int nscan = fscanf(paramfp, \"%d %f\", &a, &b);\n    if (nscan != 2)\n    {\n        fprintf(stderr, \"MyLayer load_param failed %d\\n\", nscan);\n        return -1;\n    }\n\n    return 0;\n}\n#endif // NCNN_STRING\nint MyLayer::load_param_bin(FILE* paramfp)\n{\n    fread(&a, sizeof(int), 1, paramfp);\n\n    fread(&b, sizeof(float), 1, paramfp);\n\n    return 0;\n}\n#endif // NCNN_STDIO\n\nint MyLayer::load_param(const unsigned char*& mem)\n{\n    a = *(int*)(mem);\n    mem += 4;\n\n    b = *(float*)(mem);\n    mem += 4;\n\n    return 0;\n}\n```\n\n## new param load api proposed\n### Pros\n#### clean and simple api\n#### default value\n#### extensible\n#### variable length array\n```\n7767517\nMyLayer  mylayer 1 1 in out 0=100 1=1.250000 -23303=5,0.1,0.2,0.4,0.8,1.0\n```\n```\nbinary 0xDD857600(magic)\n\nbinary 0\nbinary 100\nbinary 1\nbinary 1.250000\nbinary -23303\nbinary 5\nbinary 0.1\nbinary 0.2\nbinary 0.4\nbinary 0.8\nbinary 1.0\nbinary -233(EOP)\n```\n```cpp\nint MyLayer::load_param(const ParamDict& pd)\n{\n    // pd.get( param id (seq), default value );\n    a = pd.get(0, 100);\n    b = pd.get(1, 1.25f);\n\n    // get default value for c if not specified in param file\n    c = pd.get(2, 0.001);\n\n    // get array\n    d = pd.get(3, Mat(len, array));\n    return 0;\n}\n```\n"
  },
  {
    "path": "docs/developer-guide/operation-param-weight-table.md",
    "content": "\n|operation|param id|param phase|default value|weight order|\n|:---:|:---:|:---:|:---:|:---:|\n|AbsVal|||\n|ArgMax|0|out_max_val|0|\n||1|topk|1|\n|BatchNorm|0|channels|0|slope mean variance bias|\n||1|eps|0.f|\n|Bias|0|bias_data_size|0|\n|BinaryOp|0|op_type|0|\n||1|with_scalar|0|\n||2|b|0.f|\n|BNLL|||\n|Cast|0|type_from|0|\n||1|type_to|0|\n|Clip|0|min|-FLT_MAX|\n||1|max|FLT_MAX|\n|Concat|0|axis|0|\n|Convolution|0|num_output|0|weight bias|\n||1|kernel_w|0|\n||2|dilation_w|1|\n||3|stride_w|1|\n||4|pad_left|0|\n||5|bias_term|0|\n||6|weight_data_size|0|\n||8|int8_scale_term|0|\n||9|activation_type|0|\n||10|activation_params|[ ]|\n||11|kernel_h|kernel_w|\n||12|dilation_h|dilation_w|\n||13|stride_h|stride_w|\n||15|pad_right|pad_left|\n||14|pad_top|pad_left|\n||16|pad_bottom|pad_top|\n||17|impl_type|0|\n||18|pad_value|0.f|\n|ConvolutionDepthWise|0|num_output|0|weight bias|\n||1|kernel_w|0|\n||2|dilation_w|1|\n||3|stride_w|1|\n||4|pad_left|0|\n||5|bias_term|0|\n||6|weight_data_size|0|\n||7|group|1|\n||8|int8_scale_term|0|\n||9|activation_type|0|\n||10|activation_params|[ ]|\n||11|kernel_h|kernel_w|\n||12|dilation_h|dilation_w|\n||13|stride_h|stride_w|\n||15|pad_right|pad_left|\n||14|pad_top|pad_left|\n||16|pad_bottom|pad_top|\n||18|pad_value|0.f|\n|Crop|0|woffset|0|\n||1|hoffset|0|\n||2|coffset|0|\n||3|outw|0|\n||4|outh|0|\n||5|outc|0|\n||6|woffset2|0|\n||7|hoffset2|0|\n||8|coffset2|0|\n||9|starts|[ ]|\n||10|ends|[ ]|\n||11|axes|[ ]|\n|Deconvolution|0|num_output|0|weight bias|\n||1|kernel_w|0|\n||2|dilation_w|1|\n||3|stride_w|1|\n||4|pad_left|0|\n||5|bias_term|0|\n||6|weight_data_size|0|\n||9|activation_type|0|\n||10|activation_params|[ ]|\n||11|kernel_h|kernel_w|\n||12|dilation_h|dilation_w|\n||13|stride_h|stride_w|\n||15|pad_right|pad_left|\n||14|pad_top|pad_left|\n||16|pad_bottom|pad_top|\n||18|output_pad_right|0|\n||19|output_pad_bottom|output_pad_right|\n||20|output_w|0|\n||21|output_h|output_w|\n|DeconvolutionDepthWise|0|num_output|0|weight bias|\n||1|kernel_w|0|\n||2|dilation_w|1|\n||3|stride_w|1|\n||4|pad_left|0|\n||5|bias_term|0|\n||6|weight_data_size|0|\n||7|group|1|\n||9|activation_type|0|\n||10|activation_params|[ ]|\n||11|kernel_h|kernel_w|\n||12|dilation_h|dilation_w|\n||13|stride_h|stride_w|\n||15|pad_right|pad_left|\n||14|pad_top|pad_left|\n||16|pad_bottom|pad_top|\n||18|output_pad_right|0|\n||19|output_pad_bottom|output_pad_right|\n||20|output_w|0|\n||21|output_h|output_w|\n|Dequantize|0|scale|1.f|bias|\n||1|bias_term|0|\n||2|bias_data_size|0|\n|DetectionOutput|0|num_class|0|\n||1|nms_threshold|0.05f|\n||2|nms_top_k|300|\n||3|keep_top_k|100|\n||4|confidence_threshold|0.5f|\n||5|variances[0]|0.1f|\n||6|variances[1]|0.1f|\n||7|variances[2]|0.2f|\n||8|variances[3]|0.2f|\n|Dropout|0|scale|1.f|\n|Eltwise|0|op_type|0|\n||1|coeffs|[ ]|\n|ELU|0|alpha|0.1f|\n|Embed|0|num_output|0|weight bias|\n||1|input_dim|0|\n||2|bias_term|0|\n||3|weight_data_size|0|\n|Exp|0|base|-1.f|\n||1|scale|1.f|\n||2|shift|0.f|\n|ExpandDims|0|expand_w|0|\n||1|expand_h|0|\n||2|expand_c|0|\n||3|axes|[ ]|\n|Flatten|||\n|HardSigmoid|0|alpha|0.2f||\n||1|beta|0.5f|\n|HardSwish|0|alpha|0.2f||\n||1|beta|0.5f|\n|InnerProduct|0|num_output|0|weight bias|\n||1|bias_term|0|\n||2|weight_data_size|0|\n||8|int8_scale_term|0|\n||9|activation_type|0|\n||10|activation_params|[ ]|\n|Input|0|w|0|\n||1|h|0|\n||2|c|0|\n|InstanceNorm|0|channels|0|gamma bias|\n||1|eps|0.001f|\n|Interp|0|resize_type|0|\n||1|height_scale|1.f|\n||2|width_scale|1.f|\n||3|output_height|0|\n||4|output_width|0|\n|Log|0|base|-1.f|\n||1|scale|1.f|\n||2|shift|0.f|\n|LRN|0|region_type|0|\n||1|local_size|5|\n||2|alpha|1.f|\n||3|beta|0.75f|\n||4|bias|1.f|\n|LSTM|0|num_output|0|\n||1|weight_data_size|1|\n||2|direction|0|\n|MemoryData|0|w|0|\n||1|h|0|\n||2|c|0|\n|Mish|||\n|MVN|0|normalize_variance|0|\n||1|across_channels|0|\n||2|eps|0.0001f|\n|Noop|||\n|Normalize|0|across_spatial|0|scale|\n||4|across_channel|0|\n||1|channel_shared|0|\n||2|eps|0.0001f|\n||9|eps_mode|0|\n||3|scale_data_size|0|\n|Packing|0|out_packing|1|\n||1|use_padding|0|\n||2|cast_type_from|0|\n||3|cast_type_to|0|\n||4|storage_type_from|0|\n||5|storage_type_to|0|\n|Padding|0|top|0|per_channel_pad_data|\n||1|bottom|0|\n||2|left|0|\n||3|right|0|\n||4|type|0|\n||5|value|0.f|\n||6|per_channel_pad_data_size|0|\n||7|front|0|\n||8|behind|0|\n|Permute|0|order_type|0|\n|PixelShuffle|0|upscale_factor|1|\n|Pooling|0|pooling_type(0: max 1: avg)|0|\n||1|kernel_w|0|\n||11|kernel_h|kernel_w|\n||2|stride_w|1|\n||12|stride_h|stride_w|\n||3|pad_left|0|\n||14|pad_right|pad_left|\n||13|pad_top|pad_left|\n||15|pad_bottom|pad_top|\n||4|global_pooling|0|\n||5|pad_mode|0|\n|Power|0|power|1.f|\n||1|scale|1.f|\n||2|shift|0.f|\n|PReLU|0|num_slope|0|slope|\n|PriorBox|0|min_sizes|[ ]|\n||1|max_sizes|[ ]|\n||2|aspect_ratios|[ ]|\n||3|varainces[0]|0.f|\n||4|varainces[1]|0.f|\n||5|varainces[2]|0.f|\n||6|varainces[3]|0.f|\n||7|flip|1|\n||8|clip|0|\n||9|image_width|0|\n||10|image_height|0|\n||11|step_width|-233.f|\n||12|step_height|-233.f|\n||13|offset|0.f|\n||14|step_mmdetection|0|\n||15|center_mmdetection|0|\n|Proposal|0|feat_stride|16|\n||1|base_size|16|\n||2|pre_nms_topN|6000|\n||3|after_nms_topN|300|\n||4|num_thresh|0.7f|\n||5|min_size|16|\n|PSROIPooling|0|pooled_width|7|\n||1|pooled_height|7|\n||2|spatial_scale|0.0625f|\n||3|output_dim|0|\n|Quantize|0|scale|1.f|\n|Reduction|0|operation|0|\n||1|dim|0|\n||2|coeff|1.f|\n||3|axes|[ ]|\n||4|keepdims|0|\n|ReLU|0|slope|0.f|\n|Reorg|0|stride|0|\n|Requantize|0|scale_in|1.f|bias|\n||1|scale_out|1.f|\n||2|bias_term|0|\n||3|bias_data_size|0|\n||4|fusion_relu|0|\n|Reshape|0|w|-233|\n||1|h|-233|\n||2|c|-233|\n||3|permute|0|\n|ROIAlign|0|pooled_width|0|\n||1|pooled_height|0|\n||2|spatial_scale|1.f|\n||3|sampling_ratio|0|\n||4|aligned|0|\n||5|version|0|\n|ROIPooling|0|pooled_width|0|\n||1|pooled_height|0|\n||2|spatial_scale|1.f|\n|Scale|0|scale_data_size|0|scale bias|\n||1|bias_term|0|\n|SELU|0|alpha|1.67326324f||\n||1|lambda|1.050700987f|\n|ShuffleChannel|0|group|1|\n|Sigmoid|||\n|Slice|0|slices|[ ]|\n||1|axis|0|\n|Softmax|0|axis|0|\n|Split|||\n|SPP|0|pooling_type|0|\n||1|pyramid_height|1|\n|Squeeze|0|squeeze_w|0|\n||1|squeeze_h|0|\n||2|squeeze_c|0|\n||3|axes|[ ]|\n|StatisticsPooling|0|include_stddev|0|\n|Swish|||\n|TanH|||\n|Threshold|0|threshold|0.f|\n|Tile|0|dim|0|\n||1|tiles|1|\n|UnaryOp|0|op_type|0|\n|YoloDetectionOutput|0|num_class|20|\n||1|num_box|5|\n||2|confidence_threshold|0.01f|\n||3|num_threshold|0.45f|\n||4|biases|[]|\n|Yolov3DetectionOutput|0|num_class|20|\n||1|num_box|5|\n||2|confidence_threshold|0.01f|\n||3|num_threshold|0.45f|\n||4|biases|[]|\n||5|mask|[]|\n||6|anchors_scale|[]|\n|RNN|0|num_output|0|\n||1|weight_data_size|0|\n||2|direction|0|\n|MultiHeadAttention|0|embed_dim|0|\n||1|num_head|1|\n||2|weight_data_size|0|\n"
  },
  {
    "path": "docs/developer-guide/operators.md",
    "content": "\n* [AbsVal](#absval)\n* [ArgMax](#argmax)\n* [BatchNorm](#batchnorm)\n* [Bias](#bias)\n* [BinaryOp](#binaryop)\n* [BNLL](#bnll)\n* [Cast](#cast)\n* [CELU](#celu)\n* [Clip](#clip)\n* [Concat](#concat)\n* [Convolution](#convolution)\n* [Convolution1D](#convolution1d)\n* [Convolution3D](#convolution3d)\n* [ConvolutionDepthWise](#convolutiondepthwise)\n* [ConvolutionDepthWise1D](#convolutiondepthwise1d)\n* [ConvolutionDepthWise3D](#convolutiondepthwise3d)\n* [CopyTo](#copyto)\n* [Crop](#crop)\n* [CumulativeSum](#cumulativesum)\n* [Deconvolution](#deconvolution)\n* [Deconvolution1D](#deconvolution1d)\n* [Deconvolution3D](#deconvolution3d)\n* [DeconvolutionDepthWise](#deconvolutiondepthwise)\n* [DeconvolutionDepthWise1D](#deconvolutiondepthwise1d)\n* [DeconvolutionDepthWise3D](#deconvolutiondepthwise3d)\n* [DeformableConv2D](#deformableconv2d)\n* [Dequantize](#dequantize)\n* [Diag](#diag)\n* [Dropout](#dropout)\n* [Eltwise](#eltwise)\n* [ELU](#elu)\n* [Embed](#embed)\n* [Exp](#exp)\n* [ExpandDims](#expanddims)\n* [Flatten](#flatten)\n* [Flip](#flip)\n* [Fold](#fold)\n* [GELU](#gelu)\n* [GLU](#glu)\n* [Gemm](#gemm)\n* [GridSample](#gridsample)\n* [GroupNorm](#groupnorm)\n* [GRU](#gru)\n* [HardSigmoid](#hardsigmoid)\n* [HardSwish](#hardswish)\n* [InnerProduct](#innerproduct)\n* [Input](#input)\n* [InstanceNorm](#instancenorm)\n* [Interp](#interp)\n* [InverseSpectrogram](#inversespectrogram)\n* [LayerNorm](#layernorm)\n* [Log](#log)\n* [LRN](#lrn)\n* [LSTM](#lstm)\n* [MemoryData](#memorydata)\n* [Mish](#mish)\n* [MultiHeadAttention](#multiheadattention)\n* [MVN](#mvn)\n* [Noop](#noop)\n* [Normalize](#normalize)\n* [Packing](#packing)\n* [Padding](#padding)\n* [Permute](#permute)\n* [PixelShuffle](#pixelshuffle)\n* [Pooling](#pooling)\n* [Pooling1D](#pooling1d)\n* [Pooling3D](#pooling3d)\n* [Power](#power)\n* [PReLU](#prelu)\n* [Quantize](#quantize)\n* [Reduction](#reduction)\n* [ReLU](#relu)\n* [Reorg](#reorg)\n* [Requantize](#requantize)\n* [Reshape](#reshape)\n* [RMSNorm](#rmsnorm)\n* [RNN](#rnn)\n* [RotaryEmbed](#rotaryembed)\n* [Scale](#scale)\n* [SDPA](#sdpa)\n* [SELU](#selu)\n* [Shrink](#shrink)\n* [ShuffleChannel](#shufflechannel)\n* [Sigmoid](#sigmoid)\n* [Slice](#slice)\n* [Softmax](#softmax)\n* [Softplus](#softplus)\n* [Spectrogram](#spectrogram)\n* [Split](#split)\n* [Squeeze](#squeeze)\n* [Swish](#swish)\n* [TanH](#tanh)\n* [Threshold](#threshold)\n* [Tile](#tile)\n* [UnaryOp](#unaryop)\n* [Unfold](#unfold)\n\n# AbsVal\n```\ny = abs(x)\n```\n\n* one_blob_only\n* support_inplace\n\n# ArgMax\n```\ny = argmax(x, out_max_val, topk)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | out_max_val   | int   | 0         |                   |\n| 1         | topk          | int   | 1         |                   |\n\n# BatchNorm\n```\ny = (x - mean) / sqrt(var + eps) * slope + bias\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | channels      | int   | 0         |                   |\n| 1         | eps           | float | 0.f       |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| slope_data    | float | [channels]            |\n| mean_data     | float | [channels]            |\n| var_data      | float | [channels]            |\n| bias_data     | float | [channels]            |\n\n# Bias\n```\ny = x + bias\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | bias_data_size| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| bias_data     | float | [channels]            |\n\n# BinaryOp\n This operation is used for binary computation, and the calculation rule depends on the [broadcasting rule](https://github.com/Tencent/ncnn/wiki/binaryop-broadcasting).\n```\nC = binaryop(A, B)\n```\nif with_scalar = 1:\n- one_blob_only\n- support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | op_type       | int   | 0         | Operation type as follows |\n| 1         | with_scalar   | int   | 0         | with_scalar=0 B is a matrix, with_scalar=1 B is a scalar |\n| 2         | b             | float | 0.f       | When B is a scalar, B = b |\n\nOperation type:\n- 0 = ADD\n- 1 = SUB\n- 2 = MUL\n- 3 = DIV\n- 4 = MAX\n- 5 = MIN\n- 6 = POW\n- 7 = RSUB\n- 8 = RDIV\n- 9 = RPOW\n- 10 = ATAN2\n- 11 = RATAN2\n\n# BNLL\n```\ny = log(1 + e^(-x)) , x > 0\ny = log(1 + e^x),     x < 0\n```\n\n* one_blob_only\n* support_inplace\n\n# Cast\n```\ny = cast(x)\n```\n\n* one_blob_only\n* support_packing\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | type_from     | int   | 0         |                   |\n| 1         | type_to       | int   | 0         |                   |\n\nElement type:\n- 0 = auto\n- 1 = float32\n- 2 = float16\n- 3 = int8\n- 4 = bfloat16\n\n# CELU\n```\nif x < 0    y = (exp(x / alpha) - 1.f) * alpha\nelse        y = x\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 1.f       |                   |\n\n# Clip\n```\ny = clamp(x, min, max)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | min           | float | -FLT_MAX  |                   |\n| 1         | max           | float | FLT_MAX   |                   |\n\n# Concat\n```\ny = concat(x0, x1, x2, ...) by axis\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axis          | int   | 0         |                   |\n\n# Convolution\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv(x2, weight, kernel, stride, dilation) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 8         | int8_scale_term| int  | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 19        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, kernel_h, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n| weight_data_int8_scales| float | [num_output] |\n| bottom_blob_int8_scales| float | [1]          |\n| top_blob_int8_scales| float | [1]             |\n\n# Convolution1D\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv1d(x2, weight, kernel, stride, dilation) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 19        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# Convolution3D\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv3d(x2, weight, kernel, stride, dilation) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 17        | pad_behind    | int   | pad_front |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 21        | kernel_d      | int   | kernel_w  |                   |\n| 22        | dilation_d    | int   | dilation_w |                  |\n| 23        | stride_d      | int   | stride_w  |                   |\n| 24        | pad_front     | int   | pad_left  |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, kernel_h, kernel_d, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# ConvolutionDepthWise\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv(x2, weight, kernel, stride, dilation, group) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 8         | int8_scale_term| int  | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 19        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, kernel_h, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n| weight_data_int8_scales| float | [group]      |\n| bottom_blob_int8_scales| float | [1]          |\n| top_blob_int8_scales| float | [1]             |\n\n# ConvolutionDepthWise1D\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv1d(x2, weight, kernel, stride, dilation, group) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 19        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n\n# ConvolutionDepthWise3D\n```\nx2 = pad(x, pads, pad_value)\nx3 = conv3d(x2, weight, kernel, stride, dilation, group) + bias\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 17        | pad_behind    | int   | pad_front |                   |\n| 18        | pad_value     | float | 0.f       |                   |\n| 21        | kernel_d      | int   | kernel_w  |                   |\n| 22        | dilation_d    | int   | dilation_w |                  |\n| 23        | stride_d      | int   | stride_w  |                   |\n| 24        | pad_front     | int   | pad_left  |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, kernel_h, kernel_d, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n\n# CopyTo\n```\nself[offset] = src\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | woffset       | int   | 0         |                   |\n| 1         | hoffset       | int   | 0         |                   |\n| 13        | doffset       | int   | 0         |                   |\n| 2         | coffset       | int   | 0         |                   |\n| 9         | starts        | array | [ ]       |                   |\n| 11        | axes          | array | [ ]       |                   |\n\n# Crop\n```\ny = crop(x)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | woffset       | int   | 0         |                   |\n| 1         | hoffset       | int   | 0         |                   |\n| 13        | doffset       | int   | 0         |                   |\n| 2         | coffset       | int   | 0         |                   |\n| 3         | outw          | int   | 0         |                   |\n| 4         | outh          | int   | 0         |                   |\n| 14        | outd          | int   | 0         |                   |\n| 5         | outc          | int   | 0         |                   |\n| 6         | woffset2      | int   | 0         |                   |\n| 7         | hoffset2      | int   | 0         |                   |\n| 15        | doffset2      | int   | 0         |                   |\n| 8         | coffset2      | int   | 0         |                   |\n| 9         | starts        | array | [ ]       |                   |\n| 10        | ends          | array | [ ]       |                   |\n| 11        | axes          | array | [ ]       |                   |\n| 19        | starts_expr   | str   | \"\"        |                   |\n| 20        | ends_expr     | str   | \"\"        |                   |\n| 21        | axes_expr     | str   | \"\"        |                   |\n\n# CumulativeSum\n\nIf axis < 0, we use axis = x.dims + axis\n\nIt implements https://pytorch.org/docs/stable/generated/torch.cumsum.html\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axis          | int   | 0         |                   |\n\n\n# Deconvolution\n```\nx2 = deconv(x, weight, kernel, stride, dilation) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 19        | output_pad_bottom| int | output_pad_right |           |\n| 20        | output_w      | int   | 0         |                   |\n| 21        | output_h      | int   | output_w  |                   |\n| 28        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, kernel_h, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# Deconvolution1D\n```\nx2 = deconv1d(x, weight, kernel, stride, dilation) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 20        | output_w      | int   | 0         |                   |\n| 28        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# Deconvolution3D\n```\nx2 = deconv3d(x, weight, kernel, stride, dilation) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 17        | pad_behind    | int   | pad_front |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 19        | output_pad_bottom| int | output_pad_right |           |\n| 20        | output_pad_behind| int | output_pad_right |           |\n| 21        | kernel_d      | int   | kernel_w  |                   |\n| 22        | dilation_d    | int   | dilation_w |                  |\n| 23        | stride_d      | int   | stride_w  |                   |\n| 24        | pad_front     | int   | pad_left  |                   |\n| 25        | output_w      | int   | 0         |                   |\n| 26        | output_h      | int   | output_w  |                   |\n| 27        | output_d      | int   | output_w  |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, kernel_h, kernel_d, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# DeconvolutionDepthWise\n```\nx2 = deconv(x, weight, kernel, stride, dilation, group) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 19        | output_pad_bottom| int | output_pad_right |           |\n| 20        | output_w      | int   | 0         |                   |\n| 21        | output_h      | int   | output_w  |                   |\n| 28        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, kernel_h, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n\n# DeconvolutionDepthWise1D\n```\nx2 = deconv1d(x, weight, kernel, stride, dilation, group) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 20        | output_w      | int   | 0         |                   |\n| 28        | dynamic_weight| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n\n# DeconvolutionDepthWise3D\n```\nx2 = deconv3d(x, weight, kernel, stride, dilation, group) + bias\nx3 = depad(x2, pads, pad_value)\ny = activation(x3, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 7         | group         | int   | 1         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 17        | pad_behind    | int   | pad_front |                   |\n| 18        | output_pad_right| int | 0         |                   |\n| 19        | output_pad_bottom| int | output_pad_right |           |\n| 20        | output_pad_behind| int | output_pad_right |           |\n| 21        | kernel_d      | int   | kernel_w  |                   |\n| 22        | dilation_d    | int   | dilation_w |                  |\n| 23        | stride_d      | int   | stride_w  |                   |\n| 24        | pad_front     | int   | pad_left  |                   |\n| 25        | output_w      | int   | 0         |                   |\n| 26        | output_h      | int   | output_w  |                   |\n| 27        | output_d      | int   | output_w  |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16 | [kernel_w, kernel_h, kernel_d, num_input / group, num_output / group, group] |\n| bias_data     | float | [num_output]          |\n\n# DeformableConv2D\n```\nx2 = deformableconv2d(x, offset, mask, weight, kernel, stride, dilation) + bias\ny = activation(x2, act_type, act_params)\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 5         | bias_term     | int   | 0         |                   |\n| 6         | weight_data_size| int | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [kernel_w, kernel_h, num_input, num_output] |\n| bias_data     | float | [num_output]          |\n\n# Dequantize\n```\ny = x * scale + bias\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | scale_data_size| int  | 1         |                   |\n| 1         | bias_data_size| int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| scale_data    | float | [scale_data_size]     |\n| bias_data     | float | [bias_data_size]      |\n\n# Diag\n```\ny = diag(x, diagonal)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | diagonal      | int   | 0         |                   |\n\n# Dropout\n```\ny = x * scale\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | scale         | float | 1.f       |                   |\n\n# Eltwise\n```\ny = elementwise_op(x0, x1, ...)\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | op_type       | int   | 0         |                   |\n| 1         | coeffs        | array | [ ]       |                   |\n\nOperation type:\n- 0 = PROD\n- 1 = SUM\n- 2 = MAX\n\n# ELU\n```\nif x < 0    y = (exp(x) - 1) * alpha\nelse        y = x\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 0.1f      |                   |\n\n# Embed\n```\ny = embedding(x)\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | input_dim     | int   | 0         |                   |\n| 2         | bias_term     | int   | 0         |                   |\n| 3         | weight_data_size | int | 0        |                   |\n| 18        | int8_scale_term| int  | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float | [weight_data_size]    |\n| bias_term     | float | [num_output]          |\n| weight_data_int8_scales| float | [1]          |\n\n# Exp\n```\nif base == -1   y = exp(shift + x * scale)\nelse            y = pow(base, (shift + x * scale))\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | base          | float | -1.f      |                   |\n| 1         | scale         | float | 1.f       |                   |\n| 2         | shift         | float | 0.f       |                   |\n\n# ExpandDims\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 3         | axes          | array | [ ]       |                   |\n\n# Flatten\nReshape blob to 1 dimension\n\n* one_blob_only\n\n# Flip\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axes          | array | [ ]       |                   |\n\n# Fold\n```\ny = fold(x)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n| 20        | output_w      | int   | 0         |                   |\n| 21        | output_h      | int   | output_w  |                   |\n\n# GELU\n```\nif fast_gelu == 1   y = 0.5 * x * (1 + tanh(0.79788452 * (x + 0.044715 * x * x * x)));\nelse                y = 0.5 * x * erfc(-0.70710678 * x)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | fast_gelu     | int   | 0         | use approximation |\n\n# GLU\n\nIf axis < 0, we use axis = x.dims + axis\n\nGLU(a,b)=a⊗σ(b)\n\nwhere a is the first half of the input matrix and b is the second half.\n\naxis specifies the dimension to split the input\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axis          | int   | 0         |                   |\n\n# Gemm\n```\na = transA ? transpose(x0) : x0\nb = transb ? transpose(x1) : x1\nc = x2\ny = (gemm(a, b) + c * beta) * alpha\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 1.f       |                   |\n| 1         | beta          | float | 1.f       |                   |\n| 2         | transA        | int   | 0         |                   |\n| 3         | transb        | int   | 0         |                   |\n| 4         | constantA     | int   | 0         |                   |\n| 5         | constantB     | int   | 0         |                   |\n| 6         | constantC     | int   | 0         |                   |\n| 7         | constantM     | int   | 0         |                   |\n| 8         | constantN     | int   | 0         |                   |\n| 9         | constantK     | int   | 0         |                   |\n| 10        | constant_broadcast_type_C | int | 0 |                 |\n| 11        | output_N1M    | int   | 0         |                   |\n| 12        | output_elempack | int | 0         |                   |\n| 13        | output_elemtype | int | 0         |                   |\n| 14        | output_transpose | int| 0         |                   |\n| 18        | int8_scale_term | int | 0         |                   |\n| 20        | constant_TILE_M | int | 0         |                   |\n| 21        | constant_TILE_N | int | 0         |                   |\n| 22        | constant_TILE_K | int | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| A_data        | float/fp16/int8 | [M, K] or [K, M] |\n| B_data        | float/fp16/int8 | [N, K] or [K, N] |\n| C_data        | float | [1], [M] or [N] or [1, M] or [N,1] or [N, M] |\n| A_data_int8_scales| float | [M]               |\n| B_data_int8_scales| float | [1]               |\n\n# GridSample\n```\nGiven an input and a flow-field grid, computes the output using input values and pixel locations from grid.\n\nFor each output location output[:, h2, w2], the size-2 vector grid[h2, w2, 2] specifies input pixel[:, h1, w1] locations x and y, \nwhich are used to interpolate the output value output[:, h2, w2]\n\nThis function is often used in conjunction with affine_grid() to build Spatial Transformer Networks .\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | sample_type   | int   | 1         |                   |\n| 1         | padding_mode  | int   | 1         |                   |\n| 2         | align_corner  | int   | 0         |                   |\n| 3         | permute_fusion| int   | 0         | fuse with permute |\n\n\nSample type:\n- 1 = Nearest\n- 2 = Bilinear\n- 3 = Bicubic\n\nPadding mode:\n- 1 = zeros\n- 2 = border\n- 3 = reflection\n\n\n# GroupNorm\n```\nsplit x along channel axis into group x0, x1 ...\nl2 normalize for each group x0, x1 ...\ny = x * gamma + beta\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | group         | int   | 1         |                   |\n| 1         | channels      | int   | 0         |                   |\n| 2         | eps           | float | 0.001f    | x = x / sqrt(var + eps) |\n| 3         | affine        | int   | 1         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| gamma_data    | float | [channels]            |\n| beta_data     | float | [channels]            |\n\n# GRU\nApply a single-layer GRU to a feature sequence of `T` timesteps. The input blob shape is `[w=input_size, h=T]` and the output blob shape is `[w=num_output, h=T]`.\n\n```\ny = gru(x)\ny0, hidden y1 = gru(x0, hidden x1)\n```\n\n* one_blob_only if bidirectional\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         | hidden size of output |\n| 1         | weight_data_size| int | 0         | total size of weight matrix |\n| 2         | direction     | int   | 0         | 0=forward, 1=reverse, 2=bidirectional |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_xc_data| float/fp16/int8 | [input_size, num_output * 3, num_directions] |\n| bias_c_data   | float/fp16/int8 | [num_output, 4, num_directions] |\n| weight_hc_data| float/fp16/int8 | [num_output, num_output * 3, num_directions] |\n\nDirection flag:\n- 0 = forward only\n- 1 = reverse only\n- 2 = bidirectional\n\n# HardSigmoid\n```\ny = clamp(x * alpha + beta, 0, 1)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 0.2f      |                   |\n| 1         | beta          | float | 0.5f      |                   |\n\n# HardSwish\n```\ny = x * clamp(x * alpha + beta, 0, 1)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 0.2f      |                   |\n| 1         | beta          | float | 0.5f      |                   |\n\n# InnerProduct\n```\nx2 = innerproduct(x, weight) + bias\ny = activation(x2, act_type, act_params)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | bias_term     | int   | 0         |                   |\n| 2         | weight_data_size| int | 0         |                   |\n| 8         | int8_scale_term| int  | 0         |                   |\n| 9         | activation_type| int  | 0         |                   |\n| 10        | activation_params| array | [ ]    |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_data   | float/fp16/int8 | [num_input, num_output] |\n| bias_data     | float | [num_output]          |\n| weight_data_int8_scales| float | [num_output] |\n| bottom_blob_int8_scales| float | [1]          |\n\n# Input\n```\ny = input\n```\n\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | w             | int   | 0         |                   |\n| 1         | h             | int   | 0         |                   |\n| 11        | d             | int   | 0         |                   |\n| 2         | c             | int   | 0         |                   |\n\n# InstanceNorm\n```\nsplit x along channel axis into instance x0, x1 ...\nl2 normalize for each channel instance x0, x1 ...\ny = x * gamma + beta\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | channels      | int   | 0         |                   |\n| 1         | eps           | float | 0.001f    | x = x / sqrt(var + eps) |\n| 2         | affine        | int   | 1         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| gamma_data    | float | [channels]            |\n| beta_data     | float | [channels]            |\n\n# Interp\n```\nif dynamic_target_size == 0     y = resize(x) by fixed size or scale\nelse                            y = resize(x0, size(x1))\n```\n\n* one_blob_only if dynamic_target_size == 0\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | resize_type   | int   | 0         |                   |\n| 1         | height_scale  | float | 1.f       |                   |\n| 2         | width_scale   | float | 1.f       |                   |\n| 3         | output_height | int   | 0         |                   |\n| 4         | output_width  | int   | 0         |                   |\n| 5         | dynamic_target_size| int | 0      |                   |\n| 6         | align_corner  | int   | 0         |                   |\n| 9         | size_expr     | str   | \"\"        |                   |\n\nResize type:\n- 1 = Nearest\n- 2 = Bilinear\n- 3 = Bicubic\n\n# InverseSpectrogram\n```\nx1 = x as complex\nx1 = x1 * sqrt(norm) if normalized\ny = istft(x1)\ny1 = unpad(y) if center\n\nif returns == 0 return y1 as complex\nif returns == 1 return y1 real\nif returns == 2 return y1 imag\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | n_fft         | int   | 0         |                   |\n| 1         | returns       | int   | 1         |                   |\n| 2         | hoplen        | int   | n_fft / 4 |                   |\n| 3         | winlen        | int   | n_fft     |                   |\n| 4         | window_type   | int   | 0         | 0=ones 1=hann 2=hamming |\n| 5         | center        | int   | 1         |                   |\n| 7         | normalized    | int   | 0         | 0=no 1=n_fft 2=window-l2-energy |\n\n# LayerNorm\n```\nsplit x along outmost axis into part x0, x1 ...\nl2 normalize for each part x0, x1 ...\ny = x * gamma + beta by elementwise\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | affine_size   | int   | 0         |                   |\n| 1         | eps           | float | 0.001f    | x = x / sqrt(var + eps) |\n| 2         | affine        | int   | 1         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| gamma_data    | float | [affine_size]         |\n| beta_data     | float | [affine_size]         |\n\n# Log\n```\nif base == -1   y = log(shift + x * scale)\nelse            y = log(shift + x * scale) / log(base)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | base          | float | -1.f      |                   |\n| 1         | scale         | float | 1.f       |                   |\n| 2         | shift         | float | 0.f       |                   |\n\n# LRN\n```\nif region_type == ACROSS_CHANNELS   square_sum = sum of channel window of local_size\nif region_type == WITHIN_CHANNEL    square_sum = sum of spatial window of local_size\ny = x * pow(bias + alpha * square_sum / (local_size * local_size), -beta)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | region_type   | int   | 0         |                   |\n| 1         | local_size    | int   | 5         |                   |\n| 2         | alpha         | float | 1.f       |                   |\n| 3         | beta          | float | 0.75f     |                   |\n| 4         | bias          | float | 1.f       |                   |\n\nRegion type:\n- 0 = ACROSS_CHANNELS\n- 1 = WITHIN_CHANNEL\n\n# LSTM\nApply a single-layer LSTM to a feature sequence of `T` timesteps. The input blob shape is `[w=input_size, h=T]` and the output blob shape is `[w=num_output, h=T]`.\n\n```\ny = lstm(x)\ny0, hidden y1, cell y2 = lstm(x0, hidden x1, cell x2)\n```\n\n* one_blob_only if bidirectional\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         | output size of output |\n| 1         | weight_data_size| int | 0         | total size of IFOG weight matrix |\n| 2         | direction     | int   | 0         | 0=forward, 1=reverse, 2=bidirectional |\n| 3         | hidden_size   | int   | num_output| hidden size       |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_xc_data| float/fp16/int8 | [input_size, hidden_size * 4, num_directions] |\n| bias_c_data   | float/fp16/int8 | [hidden_size, 4, num_directions] |\n| weight_hc_data| float/fp16/int8 | [num_output, hidden_size * 4, num_directions] |\n| weight_hr_data| float/fp16/int8 | [hidden_size, num_output, num_directions] |\n\nDirection flag:\n- 0 = forward only\n- 1 = reverse only\n- 2 = bidirectional\n\n# MemoryData\n```\ny = data\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | w             | int   | 0         |                   |\n| 1         | h             | int   | 0         |                   |\n| 11        | d             | int   | 0         |                   |\n| 2         | c             | int   | 0         |                   |\n| 21        | load_type     | int   | 1         | 1=fp32            |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| data          | float | [w, h, d, c]          |\n\n# Mish\n```\ny = x * tanh(log(exp(x) + 1))\n```\n\n* one_blob_only\n* support_inplace\n\n# MultiHeadAttention\n```\nq_affine = affine(q) / (embed_dim / num_head)\nk_affine = affine(k) or reuse kv_cache part\nv_affine = affine(v) or reuse kv_cache part\nsplit q k v into num_head part q0, k0, v0, q1, k1, v1 ...\nfor each num_head part\n    qk = q * k\n    qk = qk + attn_mask if attn_mask exists\n    softmax(qk)\n    qkv = qk * v\n    merge qkv to out\ny = affine(out)\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | embed_dim     | int   | 0         |                   |\n| 1         | num_heads     | int   | 1         |                   |\n| 2         | weight_data_size| int | 0         | qdim = weight_data_size / embed_dim |\n| 3         | kdim          | int   | embed_dim |                   |\n| 4         | vdim          | int   | embed_dim |                   |\n| 5         | attn_mask     | int   | 0         |                   |\n| 6         | scale         | float | 1.f / sqrt(embed_dim / num_heads) | |\n| 7         | kv_cache      | int   | 0         |                   |\n| 18        | int8_scale_term | int | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| q_weight_data | float/fp16/int8 | [embed_dim * qdim] |\n| q_bias_data   | float | [embed_dim]           |\n| k_weight_data | float/fp16/int8 | [embed_dim * kdim] |\n| k_bias_data   | float | [embed_dim]           |\n| v_weight_data | float/fp16/int8 | [embed_dim * vdim] |\n| v_bias_data   | float | [embed_dim]           |\n| out_weight_data| float/fp16/int8 | [qdim * embed_dim] |\n| out_bias_data | float | [qdim]                |\n| q_weight_data_int8_scales| float | [embed_dim] |\n| k_weight_data_int8_scales| float | [embed_dim] |\n| v_weight_data_int8_scales| float | [embed_dim] |\n| out_weight_data_int8_scales| float | [1]      |\n\n# MVN\n```\nif normalize_variance == 1 && across_channels == 1      y = (x - mean) / (sqrt(var) + eps) of whole blob\nif normalize_variance == 1 && across_channels == 0      y = (x - mean) / (sqrt(var) + eps) of each channel\nif normalize_variance == 0 && across_channels == 1      y = x - mean of whole blob\nif normalize_variance == 0 && across_channels == 0      y = x - mean of each channel\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | normalize_variance| int | 0       |                   |\n| 1         | across_channels| int  | 0         |                   |\n| 2         | eps           | float | 0.0001f   | x = x / (sqrt(var) + eps) |\n\n# Noop\n```\ny = x\n```\n\n# Normalize\n```\nif across_spatial == 1 && across_channel == 1      x2 = normalize(x) of whole blob\nif across_spatial == 1 && across_channel == 0      x2 = normalize(x) of each channel\nif across_spatial == 0 && across_channel == 1      x2 = normalize(x) of each position\ny = x2 * scale\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | across_spatial| int   | 0         |                   |\n| 1         | channel_shared| int   | 0         |                   |\n| 2         | eps           | float | 0.0001f   | see eps mode      |\n| 3         | scale_data_size| int  | 0         |                   |\n| 4         | across_channel| int   | 0         |                   |\n| 9         | eps_mode      | int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| scale_data    | float | [scale_data_size]     |\n\nEps Mode:\n- 0 = caffe/mxnet   x = x / sqrt(var + eps)\n- 1 = pytorch       x = x / max(sqrt(var), eps)\n- 2 = tensorflow    x = x / sqrt(max(var, eps))\n\n# Packing\n```\ny = wrap_packing(x)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | out_elempack  | int   | 1         |                   |\n| 1         | use_padding   | int   | 0         |                   |\n| 2         | cast_type_from| int   | 0         |                   |\n| 3         | cast_type_to  | int   | 0         |                   |\n| 4         | storage_type_from| int | 0        |                   |\n| 5         | storage_type_to| int  | 0         |                   |\n\n# Padding\n```\ny = pad(x, pads)\n```\n\n| param id  | name          | type | default   | description       |\n| --------- | ------------- | ---- | --------- | ----------------- |\n| 0         | top           | int  | 0         |                   |\n| 1         | bottom        | int  | 0         |                   |\n| 2         | left          | int  | 0         |                   |\n| 3         | right         | int  | 0         |                   |\n| 4         | type          | int  | 0         |                   |\n| 5         | value         | float | 0         |                   |\n| 6         | per_channel_pad_data_size| int | 0 |                 |\n| 7         | front         | int  | stride_w  |                   |\n| 8         | behind        | int  | pad_left  |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| per_channel_pad_data| float | [per_channel_pad_data_size] |\n\nPadding type:\n- 0 = CONSTANT\n- 1 = REPLICATE\n- 2 = REFLECT\n\n# Permute\n```\ny = reorder(x)\n```\n\n| param id  | name          | type | default   | description       |\n| --------- | ------------- | ---- | --------- | ----------------- |\n| 0         | order_type    | int  | 0         |                   |\n\nOrder Type:\n- 0 = WH WHC WHDC\n- 1 = HW HWC HWDC\n- 2 = WCH WDHC\n- 3 = CWH DWHC\n- 4 = HCW HDWC\n- 5 = CHW DHWC\n- 6 = WHCD\n- 7 = HWCD\n- 8 = WCHD\n- 9 = CWHD\n- 10 = HCWD\n- 11 = CHWD\n- 12 = WDCH\n- 13 = DWCH\n- 14 = WCDH\n- 15 = CWDH\n- 16 = DCWH\n- 17 = CDWH\n- 18 = HDCW\n- 19 = DHCW\n- 20 = HCDW\n- 21 = CHDW\n- 22 = DCHW\n- 23 = CDHW\n\n# PixelShuffle\n```\nif mode == 0    y = depth_to_space(x) where x channel order is sw-sh-outc\nif mode == 1    y = depth_to_space(x) where x channel order is outc-sw-sh\n```\n\n* one_blob_only\n\n| param id  | name          | type | default   | description       |\n| --------- | ------------- | ---- | --------- | ----------------- |\n| 0         | upscale_factor| int  | 1         |                   |\n| 1         | mode          | int  | 0         |                   |\n\n# Pooling\n```\nx2 = pad(x, pads)\nx3 = pooling(x2, kernel, stride)\n```\n\n| param id  | name          | type | default   | description       |\n| --------- | --------------| ---- | --------- | ----------------- |\n| 0         | pooling_type  | int  | 0         |                   |\n| 1         | kernel_w      | int  | 0         |                   |\n| 2         | stride_w      | int  | 1         |                   |\n| 3         | pad_left      | int  | 0         |                   |\n| 4         | global_pooling| int  | 0         |                   |\n| 5         | pad_mode      | int  | 0         |                   |\n| 6         | avgpool_count_include_pad| int | 0 |                 |\n| 7         | adaptive_pooling| int | 0        |                   |\n| 8         | out_w         | int  | 0         |                   |\n| 11        | kernel_h      | int  | kernel_w  |                   |\n| 12        | stride_h      | int  | stride_w  |                   |\n| 13        | pad_top       | int  | pad_left  |                   |\n| 14        | pad_right     | int  | pad_left  |                   |\n| 15        | pad_bottom    | int  | pad_top   |                   |\n| 18        | out_h         | int  | out_w     |                   |\n\nPooling type:\n- 0 = MAX\n- 1 = AVG\n\nPad mode:\n- 0 = full padding\n- 1 = valid padding\n- 2 = tensorflow padding=SAME or onnx padding=SAME_UPPER\n- 3 = onnx padding=SAME_LOWER\n\n# Pooling1D\n```\nx2 = pad(x, pads)\nx3 = pooling1d(x2, kernel, stride)\n```\n\n| param id  | name          | type | default   | description       |\n| --------- | --------------| ---- | --------- | ----------------- |\n| 0         | pooling_type  | int  | 0         |                   |\n| 1         | kernel_w      | int  | 0         |                   |\n| 2         | stride_w      | int  | 1         |                   |\n| 3         | pad_left      | int  | 0         |                   |\n| 4         | global_pooling| int  | 0         |                   |\n| 5         | pad_mode      | int  | 0         |                   |\n| 6         | avgpool_count_include_pad| int | 0 |                 |\n| 7         | adaptive_pooling| int | 0        |                   |\n| 8         | out_w         | int  | 0         |                   |\n| 14        | pad_right     | int  | pad_left  |                   |\n\nPooling type:\n- 0 = MAX\n- 1 = AVG\n\nPad mode:\n- 0 = full padding\n- 1 = valid padding\n- 2 = tensorflow padding=SAME or onnx padding=SAME_UPPER\n- 3 = onnx padding=SAME_LOWER\n\n# Pooling3D\n```\nx2 = pad(x, pads)\nx3 = pooling3d(x2, kernel, stride)\n```\n\n| param id  | name          | type | default   | description       |\n| --------- | --------------| ---- | --------- | ----------------- |\n| 0         | pooling_type  | int  | 0         |                   |\n| 1         | kernel_w      | int  | 0         |                   |\n| 2         | stride_w      | int  | 1         |                   |\n| 3         | pad_left      | int  | 0         |                   |\n| 4         | global_pooling| int  | 0         |                   |\n| 5         | pad_mode      | int  | 0         |                   |\n| 6         | avgpool_count_include_pad| int | 0 |                 |\n| 7         | adaptive_pooling| int | 0        |                   |\n| 8         | out_w         | int  | 0         |                   |\n| 11        | kernel_h      | int  | kernel_w  |                   |\n| 12        | stride_h      | int  | stride_w  |                   |\n| 13        | pad_top       | int  | pad_left  |                   |\n| 14        | pad_right     | int  | pad_left  |                   |\n| 15        | pad_bottom    | int  | pad_top   |                   |\n| 16        | pad_behind    | int  | pad_front |                   |\n| 18        | out_h         | int  | out_w     |                   |\n| 21        | kernel_d      | int  | kernel_w  |                   |\n| 22        | stride_d      | int  | stride_w  |                   |\n| 23        | pad_front     | int  | pad_left  |                   |\n| 28        | out_d         | int  | out_w     |                   |\n\nPooling type:\n- 0 = MAX\n- 1 = AVG\n\nPad mode:\n- 0 = full padding\n- 1 = valid padding\n- 2 = tensorflow padding=SAME or onnx padding=SAME_UPPER\n- 3 = onnx padding=SAME_LOWER\n\n# Power\n```\ny = pow((shift + x * scale), power)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | power         | float | 1.f       |                   |\n| 1         | scale         | float | 1.f       |                   |\n| 2         | shift         | float | 0.f       |                   |\n\n# PReLU\n```\nif x < 0    y = x * slope\nelse        y = x\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_slope     | int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| slope_data    | float | [num_slope]           |\n\n# Quantize\n```\ny = float2int8(x * scale)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | scale_data_size| int  | 1         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| scale_data    | float | [scale_data_size]     |\n\n# Reduction\n```\ny = reduce_op(x * coeff)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | operation     | int   | 0         |                   |\n| 1         | reduce_all    | int   | 1         |                   |\n| 2         | coeff         | float | 1.f       |                   |\n| 3         | axes          | array | [ ]       |                   |\n| 4         | keepdims      | int   | 0         |                   |\n| 5         | fixbug0       | int   | 0         | hack for bug fix, should be 1 |\n\nOperation type:\n- 0 = SUM\n- 1 = ASUM\n- 2 = SUMSQ\n- 3 = MEAN\n- 4 = MAX\n- 5 = MIN\n- 6 = PROD\n- 7 = L1\n- 8 = L2\n- 9 = LogSum\n- 10 = LogSumExp\n\n# ReLU\n```\nif x < 0    y = x * slope\nelse        y = x\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | slope         | float | 0.f       |                   |\n\n# Reorg\n```\nif mode == 0    y = space_to_depth(x) where x channel order is sw-sh-outc\nif mode == 1    y = space_to_depth(x) where x channel order is outc-sw-sh\n```\n\n* one_blob_only\n\n| param id  | name          | type | default   | description       |\n| --------- | ------------- | ---- | --------- | ----------------- |\n| 0         | stride        | int  | 1         |                   |\n| 1         | mode          | int  | 0         |                   |\n\n# Requantize\n```\nx2 = x * scale_in + bias\nx3 = activation(x2)\ny = float2int8(x3 * scale_out)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | scale_in_data_size| int | 1       |                   |\n| 1         | scale_out_data_size| int | 1      |                   |\n| 2         | bias_data_size| int   | 0         |                   |\n| 3         | activation_type| int  | 0         |                   |\n| 4         | activation_params| int | [ ]      |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| scale_in_data | float | [scale_in_data_size]  |\n| scale_out_data| float | [scale_out_data_size] |\n| bias_data     | float | [bias_data_size]      |\n\n# Reshape\n```\ny = reshape(x)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | w             | int   | -233      |                   |\n| 1         | h             | int   | -233      |                   |\n| 11        | d             | int   | -233      |                   |\n| 2         | c             | int   | -233      |                   |\n| 6         | shape_expr    | str   | \"\"        |                   |\n\nReshape flag:\n- 0 = copy from bottom\n- -1 = remaining\n- -233 = drop this dim(default)\n\n# RMSNorm\n```\nsplit x along outmost axis into part x0, x1 ...\nroot mean square normalize for each part x0, x1 ...\ny = x * gamma by elementwise\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | affine_size   | int   | 0         |                   |\n| 1         | eps           | float | 0.001f    | x = x / sqrt(var + eps) |\n| 2         | affine        | int   | 1         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| gamma_data    | float | [affine_size]         |\n\n# RNN\nApply a single-layer RNN to a feature sequence of `T` timesteps. The input blob shape is `[w=input_size, h=T]` and the output blob shape is `[w=num_output, h=T]`.\n\n```\ny = rnn(x)\ny0, hidden y1 = rnn(x0, hidden x1)\n```\n\n* one_blob_only if bidirectional\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         | hidden size of output |\n| 1         | weight_data_size| int | 0         | total size of weight matrix |\n| 2         | direction     | int   | 0         | 0=forward, 1=reverse, 2=bidirectional |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| weight_xc_data| float/fp16/int8 | [input_size, num_output, num_directions] |\n| bias_c_data   | float/fp16/int8 | [num_output, 1, num_directions] |\n| weight_hc_data| float/fp16/int8 | [num_output, num_output, num_directions] |\n\nDirection flag:\n- 0 = forward only\n- 1 = reverse only\n- 2 = bidirectional\n\n# RotaryEmbed\nApply rotary positional embeddings with cos and sin cache\n\n```\ny1 = x1 * cos - x2 * sin\ny2 = x1 * sin + x2 * cos\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | interleaved   | int   | 0         |                   |\n\n# Scale\n```\nif scale_data_size == -233  y = x0 * x1\nelse                        y = x * scale + bias\n```\n\n* one_blob_only if scale_data_size != -233\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | scale_data_size| int  | 0         |                   |\n| 1         | bias_term     | int   | 0         |                   |\n\n| weight        | type  | shape                 |\n| ------------- | ----- | --------------------- |\n| scale_data    | float | [scale_data_size]     |\n| bias_data     | float | [scale_data_size]     |\n\n# SDPA\n```\nscaled dot product attention\nfor each num_head part\n    qk = q * k\n    qk = qk + attn_mask if attn_mask exists\n    softmax(qk)\n    qkv = qk * v\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 5         | attn_mask     | int   | 0         |                   |\n| 6         | scale         | float | 0.f       | auto = 1.f / sqrt(embed_dim) |\n| 7         | kv_cache      | int   | 0         |                   |\n| 18        | int8_scale_term | int | 0         |                   |\n\n# SELU\n```\nif x < 0    y = (exp(x) - 1.f) * alpha * lambda\nelse        y = x * lambda\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | alpha         | float | 1.67326324f|                  |\n| 1         | lambda        | float | 1.050700987f|                 |\n\n# Shrink\n```\nif x < -lambd y = x + bias\nif x >  lambd y = x - bias\nelse          y = x\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | bias          | float | 0.0f      |                   |\n| 1         | lambd         | float | 0.5f      |                   |\n\n# ShuffleChannel\n```\nif reverse == 0     y = shufflechannel(x) by group\nif reverse == 1     y = shufflechannel(x) by channel / group\n```\n\n* one_blob_only\n\n| param id  | name          | type | default   | description       |\n| --------- | ------------- | ---- | --------- | ----------------- |\n| 0         | group         | int  | 1         |                   |\n| 1         | reverse       | int  | 0         |                   |\n\n# Sigmoid\n```\ny = 1 / (1 + exp(-x))\n```\n\n* one_blob_only\n* support_inplace\n\n# Slice\n```\nsplit x along axis into slices, each part slice size is based on slices array\n```\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | slices        | array | [ ]       |                   |\n| 1         | axis          | int   | 0         |                   |\n| 2         | indices       | array | [ ]       |                   |\n\n# Softmax\n```\nsoftmax(x, axis)\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axis          | int   | 0         |                   |\n| 1         | fixbug0       | int   | 0         | hack for bug fix, should be 1 |\n\n# Softplus\n```\ny = log(exp(x) + 1)\n```\n\n* one_blob_only\n* support_inplace\n\n# Spectrogram\n```\nx1 = pad(x) if center\ny = stft(x1)\ny = y / sqrt(norm) if normalized\n\nif power == 0 return y as real\nif power == 1 return magnitude\nif power == 2 return square of magnitude\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | n_fft         | int   | 0         |                   |\n| 1         | power         | int   | 0         |                   |\n| 2         | hoplen        | int   | n_fft / 4 |                   |\n| 3         | winlen        | int   | n_fft     |                   |\n| 4         | window_type   | int   | 0         | 0=ones 1=hann 2=hamming |\n| 5         | center        | int   | 1         |                   |\n| 6         | pad_type      | int   | 2         | 0=CONSTANT 1=REPLICATE 2=REFLECT |\n| 7         | normalized    | int   | 0         | 0=no 1=n_fft 2=window-l2-energy |\n| 8         | onesided      | int   | 1         |                   |\n\n# Split\n```\ny0, y1 ... = x\n```\n\n# Squeeze\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | squeeze_w     | int   | 0         |                   |\n| 1         | squeeze_h     | int   | 0         |                   |\n| 11        | squeeze_d     | int   | 0         |                   |\n| 2         | squeeze_c     | int   | 0         |                   |\n| 3         | axes          | array | [ ]       |                   |\n\n# Swish\n```\ny = x / (1 + exp(-x))\n```\n\n* one_blob_only\n* support_inplace\n\n# TanH\n```\ny = tanh(x)\n```\n\n* one_blob_only\n* support_inplace\n\n# Threshold\n```\nif x > threshold    y = 1\nelse                y = 0\n```\n\n* one_blob_only\n* support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | threshold     | float | 0.f       |                   |\n\n# Tile\n```\ny = repeat tiles along axis for x\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | axis          | int   | 0         |                   |\n| 1         | tiles         | int   | 1         |                   |\n| 2         | repeats       | array | [ ]       |                   |\n\n# UnaryOp\n```\ny = unaryop(x)\n```\n\n- one_blob_only\n- support_inplace\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | op_type       | int   | 0         | Operation type as follows |\n\nOperation type:\n- 0 = ABS\n- 1 = NEG\n- 2 = FLOOR\n- 3 = CEIL\n- 4 = SQUARE\n- 5 = SQRT\n- 6 = RSQ\n- 7 = EXP\n- 8 = LOG\n- 9 = SIN\n- 10 = COS\n- 11 = TAN\n- 12 = ASIN\n- 13 = ACOS\n- 14 = ATAN\n- 15 = RECIPROCAL\n- 16 = TANH\n- 17 = LOG10\n- 18 = ROUND\n- 19 = TRUNC\n\n# Unfold\n```\ny = unfold(x)\n```\n\n* one_blob_only\n\n| param id  | name          | type  | default   | description       |\n| --------- | ------------- | ----- | --------- | ----------------- |\n| 0         | num_output    | int   | 0         |                   |\n| 1         | kernel_w      | int   | 0         |                   |\n| 2         | dilation_w    | int   | 1         |                   |\n| 3         | stride_w      | int   | 1         |                   |\n| 4         | pad_left      | int   | 0         |                   |\n| 11        | kernel_h      | int   | kernel_w  |                   |\n| 12        | dilation_h    | int   | dilation_w |                  |\n| 13        | stride_h      | int   | stride_w  |                   |\n| 14        | pad_top       | int   | pad_left  |                   |\n| 15        | pad_right     | int   | pad_left  |                   |\n| 16        | pad_bottom    | int   | pad_top   |                   |\n"
  },
  {
    "path": "docs/developer-guide/param-and-model-file-structure.md",
    "content": "## net.param\n### example\n```\n7767517\n3 3\nInput         input    0 1 data 0=4 1=4 2=1\nInnerProduct  ip       1 1 data fc 0=10 1=1 2=80\nSoftmax       softmax  1 1 fc prob 0=0\n```\n### overview\n```\n[magic]\n```\n* magic number : 7767517\n```\n[layer count] [blob count]\n```\n* layer count : count of the layer line follows, should be exactly the count of all layer names\n* blob count : count of all blobs, usually greater than or equals to the layer count\n### layer line\n```\n[layer type] [layer name] [input count] [output count] [input blobs] [output blobs] [layer specific params]\n```\n* layer type : type name, such as Convolution Softmax etc\n* layer name : name of this layer, must be unique among all layer names\n* input count : count of the blobs this layer needs as input\n* output count : count of the blobs this layer produces as output\n* input blobs : name list of all the input blob names, separated by space, must be unique among input blob names of all layers\n* output blobs : name list of all the output blob names, separated by space, must be unique among output blob names of all layers\n* layer specific params : key=value pair list, separated by space\n### layer param\n```\n0=1 1=2.5 -23303=2,2.0,3.0\n```\nkey index should be unique in each layer line, pair can be omitted if the default value used\n\nthe meaning of existing param key index can be looked up at [operation-param-weight-table](operation-param-weight-table)\n\n* integer or float key : index 0 ~ 19\n* integer value : int\n* float value : float\n* integer array or float array key : -23300 minus index 0 ~ 19\n* integer array value : [array size],int,int,...,int\n* float array value : [array size],float,float,...,float\n\nIn modern ncnn param file\n\n* array could be represented as `3=2.0,3.0` that is much more human friendly\n* string typed value: `4=hello` and the string is no longer than 255\n\n## net.bin\n```\n  +---------+---------+---------+---------+---------+---------+\n  | weight1 | weight2 | weight3 | weight4 | ....... | weightN |\n  +---------+---------+---------+---------+---------+---------+\n  ^         ^         ^         ^\n  0x0      0x80      0x140     0x1C0\n```\nthe model binary is the concatenation of all weight data, each weight buffer is aligned by 32bit\n\n### weight buffer\n```\n[flag] (optional)\n[raw data]\n[padding] (optional)\n```\n* flag : unsigned int,  little-endian, indicating the weight storage type, 0 => float32, 0x01306B47 => float16, otherwise => quantized int8, may be omitted if the layer implementation forced the storage type explicitly\n* raw data : raw weight data, little-endian, float32 data or float16 data or quantized table and indexes depending on the storage type flag\n* padding : padding space for 32bit alignment, may be omitted if already aligned\n"
  },
  {
    "path": "docs/developer-guide/preload-practice.zh.md",
    "content": "## 只是实践经验，没有理论，不一定正确\n\n```\nprfm pldl1keep, [x0, #256]\n```\n* 放在 ld1 [x0] 前面 0~8 条指令\n* #256 表示把 x0+256 的内容放进 L1 cache\n* ldp 也适用\n* (经验)不写 offset 不如写个 #128\n* (经验)pldl1strm 似乎没啥意思，也没 pldl1keep 快\n* (经验)x0 ~ x0+256 的内容也会进来\n* (经验)load 128bit 用 #128，256bit或更多用 #256\n* (经验)避免 pld a，pld b，load a，load b 顺序，可能相互干扰\n* (经验)提前太多会失效\n* (经验)适合连续读\n\n```\nprfm pldl2strm, [x0, #256]\n```\n* 放在 ld1 [x0] 前面 N 条指令，N 尽量大些\n* #256 表示把 x0+256 的内容放进 L2 cache\n* ldp 也适用\n* (经验)不写 offset 不如写个 #128\n* (经验)pldl2strm 效果稍好于 pldl2keep\n* (经验)x0 ~ x0+256 的内容也会进来\n* (经验)load 128bit 用 #128，256bit 用 #256\n* (经验)读很多数据，用不同 offset 连续两次 pldl2strm\n* (经验)后面不要对同位置再 pldl1keep，会变慢\n* (经验)适合提前准备要跳到很远的地方读，比如换 channel\n"
  },
  {
    "path": "docs/developer-guide/tensorflow-op-combination.md",
    "content": "## batchnorm\n```\nInput       A            0 1 A 0 0 0\nMemoryData  sub/y        0 1 sub/y 16 0 0\nBinaryOp    sub          2 1 A sub/y sub 1\nMemoryData  div/y        0 1 div/y 16 0 0\nBinaryOp    div          2 1 sub div/y div 3\nMemoryData  mul/y        0 1 mul/y 16 0 0\nBinaryOp    mul          2 1 div mul/y mul 2\nMemoryData  BiasAdd/bias 0 1 BiasAdd/bias 16 0 0\nBinaryOp    BiasAdd      2 1 mul BiasAdd/bias BiasAdd 0\n```\n## convolution\n```\nInput       A            0 1 A 0 0 0\nConvolution Conv2D       1 1 A Conv2D 10 3 1 1 0 0 270\nMemoryData  biases/read  0 1 biases/read 10 0 0\nBinaryOp    BiasAdd      2 1 Conv2D biases/read BiasAdd 0\n```\n## innerproduct\n```\nInput        A           0 1 A 0 0 0\nMemoryData   biases/read 0 1 biases/read 10 0 0\nInnerProduct MatMul      1 1 A MatMul 10 0 2560\nBinaryOp     conv6       2 1 MatMul biases/read conv6 0\n```\n## leakyrelu\n```\nInput       A            0 1 A 0 0 0\nSplit       splitncnn_0  1 2 A A_splitncnn_0 A_splitncnn_1\nMemoryData  mul_1/x      0 1 mul_1/x 0 0 0\nBinaryOp    mul_1        2 1 mul_1/x A_splitncnn_1 mul_1 2\nBinaryOp    leaky        2 1 mul_1 A_splitncnn_0 leaky 4\n```\n## prelu\n```\nInput       A            0 1 A 0 0 0\nSplit       splitncnn_0  1 2 A A_splitncnn_0 A_splitncnn_1\nMemoryData  prelu/alpha  0 1 prelu/alpha 10 0 0\nReLU        prelu/Relu   1 1 A_splitncnn_1 prelu/Relu 0.000000\nUnaryOp     prelu/Neg    1 1 A_splitncnn_0 prelu/Neg 1\nReLU        prelu/Relu_1 1 1 prelu/Neg prelu/Relu_1 0.000000\nUnaryOp     prelu/Neg_1  1 1 prelu/Relu_1 prelu/Neg_1 1\nBinaryOp    prelu/Mul    2 1 prelu/alpha prelu/Neg_1 prelu/Mul 2\nBinaryOp    prelu/add    2 1 prelu/Relu prelu/Mul prelu/add 0\n```\n## softmax\n```\nInput       A            0 1 A 0 0 0\nSplit       splitncnn_4  1 2 A A_splitncnn_0 A_splitncnn_1\nReduction   Max          1 1 A_splitncnn_1 Max 4 -2 1.000000\nBinaryOp    sub          2 1 A_splitncnn_0 Max sub 1\nUnaryOp     Exp          1 1 sub Exp 7\nSplit       splitncnn_5  1 2 Exp Exp_splitncnn_0 Exp_splitncnn_1\nReduction   Sum          1 1 Exp_splitncnn_1 Sum 0 -2 1.000000\nBinaryOp    prob         2 1 Exp_splitncnn_0 Sum prob 3\n```"
  },
  {
    "path": "docs/developer-guide/vulkan-driver-loader.md",
    "content": "# ncnn vulkan driver loader\n\nncnn turns on the ```NCNN_SIMPLEVK``` cmake option by default, when ```NCNN_VULKAN``` is enabled\n\nsimplevk is ncnn's built-in vulkan loader. It provides vulkan function declarations and function entries that meet ncnn's needs. It allows the use and compilation of vulkan-related codes without relying on vulkan-sdk. It can dynamically load the vulkan runtime library at runtime or directly load the graphics card driver. vulkan driver. When distributing ncnn applications, it is not required that the target system has a vulkan driver.\n\nUsually you don't need to care about how simplevk loads the vulkan driver, because ncnn will automatically load and initialize when using vulkan related functions. It is sufficient to set the `Option` switch before loading the model.\n\nTypical code\n\n```cpp\nncnn::Net net;\nnet.opt.use_vulkan_compute = true;\nnet.load_param(\"model.param\");\nnet.load_param(\"model.bin\");\n```\n\nUsing the in-house vulkan loader instead of the standard libvulkan has the following benefits\n\n- Can compile ncnn vulkan code without installing vulkan-sdk\n- Can deploy and distribute applications without libvulkan linkage\n- Can load external vulkan driver instead of system driver\n- Can directly load android hal module\n- Can directly load graphics card driver files via NCNN_VULKAN_DRIVER env\n- Able to actively search for graphics card driver files in the system and load them\n- Can compile android libraries supporting vulkan under the platform of android-api<24\n\n## Create and manage gpu context\n\n```cpp\nint create_gpu_instance(const char* driver_path = 0);\n\nvoid destroy_gpu_instance();\n\nVkInstance get_gpu_instance();\n```\n\n## Loading order\n\n```\nIf driver_path == 0\n  1a from env ```VK_ICD_FILENAMES```\n  1b from env ```NCNN_VULKAN_DRIVER```\n\nIf driver_path != 0\n  1 from specified driver_path\n\n2 from vulkan-1.dll / libvulkan.so / libvulkan.dylib in system\n\n3 search driver by name nvoglv64.dll / amdvlk64.dll / libGLX_nvidia.so.0 .... and load it\n```\n\n## Load from system vulkan library or graphics driver\n\nThis is the default behavior and it should work on most systems\n\nsample usage\n```cpp\nint ret = create_gpu_instance();\n```\n\nLoad from system-installed libvulkan\n\n#### Windows\nvulkan-1.dll\n\n#### Linux Android\nlibvulkan.so\n\n#### macOS iOS and other APPLE platforms\nlibvulkan.dylib\n\nIf static moltenvk driver linked, should always succeed\n\nIf failed, it will try to find graphics driver object and load it\n\n#### Windows\nfor 64bit applications. search in ```%SystemRoot%\\System32\\DriverStore\\FileRepository```\n- nvoglv64.dll\n- amdvlk64.dll\n- igvk64.dll\n- qcvkarm64xum.dll\n\nfor 32bit applications. search in ```%SystemRoot%\\System32\\DriverStore\\FileRepository```\n- nvoglv32.dll\n- amdvlk32.dll\n- igvk32.dll\n\n#### Linux\n`dlopen()` search for\n- libGLX_nvidia.so.0\n- libvulkan_radeon.so\n- libvulkan_intel.so\n- libMaliVulkan.so.1\n- libVK_IMG.so\n\n#### Android\nfor 64bit applications\n- /vendor/lib64/hw/vulkan.adreno.so\n- /vendor/lib64/egl/libGLES_mali.so\n\nfor 32bit applications\n- /vendor/lib/hw/vulkan.adreno.so\n- /vendor/lib/egl/libGLES_mali.so\n\n#### macOS iOS and other APPLE platforms\n`dlopen()` search for\n- libMoltenVK.dylib\n- libvulkan_kosmickrisp.dylib\n\n## Load from driver_path\n\nfor advanced developer\n\nsample usage\n```cpp\nint ret = create_gpu_instance(\"libvulkan.so\");\nint ret = create_gpu_instance(\"/usr/lib64/libvulkan_radeon.so\");\nint ret = create_gpu_instance(\"/vendor/lib64/hw/vulkan.adreno.so\");\nint ret = create_gpu_instance(\"/data/local/tmp/vulkan.ad07XX.so\");\n```\n\n## Load from env VK_ICD_FILENAMES\n\nfor debug purpose\n\nsample usage\n```sh\nexport VK_ICD_FILENAMES=./vk_swiftshader_icd.json\nexport VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json\nexport VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json\n```\n\n## Load from env NCNN_VULKAN_DRIVER\n\nfor debug purpose\n\nsample usage\n```sh\nexport NCNN_VULKAN_DRIVER=/data/local/tmp/vulkan.ad07XX.so\n```\n"
  },
  {
    "path": "docs/faq.en.md",
    "content": "\n\n# How to join the technical Community Groups with QQ  ？\n\n- Open QQ -> click the group chat search-> search group number 637093648, enter the answer to the question: conv conv conv conv conv → join the group chat → ready to accept the Turing test(a joke)\n- Open QQ -> search Pocky group: 677104663 (lots experts), the answer to the question\n\n# How to watch the author's on live in Bilibili？\n\n- nihui：[水竹院落](https://live.bilibili.com/1264617)\n\n# Compilation\n\n- ## How to download the full source code？\n\n   git clone --recursive https://github.com/Tencent/ncnn/\n\n   or\n\n   download [ncnn-xxxxx-full-source.zip](https://github.com/Tencent/ncnn/releases)\n\n- ## How to cross-compile？How to set the cmake toolchain？\n\n   See https://github.com/Tencent/ncnn/wiki/how-to-build\n\n- ## The submodules were not downloaded! Please update submodules with \"git submodule update --init\" and try again\n\n   As above, download the full source code. Or follow the prompts to execute: git submodule update --init\n\n- ## Could NOT find Protobuf (missing: Protobuf_INCLUDE_DIR)\n\n   sudo apt-get install libprotobuf-dev protobuf-compiler\n\n- ## Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY)\n\n   https://github.com/Tencent/ncnn/issues/1873\n\n- ## Could not find a package configuration file provided by \"OpenCV\" with any of the following names: OpenCVConfig.cmake opencv-config.cmake\n\n   sudo apt-get install libopencv-dev\n\n   or customized compile and install ，with set(OpenCV_DIR {the dir OpenCVConfig.cmake exist})\n\n- ## Could not find a package configuration file provided by \"ncnn\" with any of the following names: ncnnConfig.cmake ncnn-config.cmake\n\n   set(ncnn_DIR { the dir ncnnConfig.cmake exist})\n\n- ## xxx.lib not found（be specified by system/compiler）\n\n   undefined reference to __kmpc_for_static_init_4 __kmpc_for_static_fini __kmpc_fork_call ...\n\n   Need to link openmp\n\n   undefined reference to glslang::InitializeProcess() glslang::TShader::TShader(EShLanguage) ...\n\n   need glslang.lib glslang-default-resource-limits.lib\n\n   undefined reference to AAssetManager_fromJava AAssetManager_open AAsset_seek ...\n\n   Add android to find_library and target_like_libraries\n\n   find_package(ncnn)\n\n- ## undefined reference to typeinfo for ncnn::Layer\n\n   opencv rtti -> opencv-mobile\n\n- ## undefined reference to __cpu_model\n\n   upgrade compiler / libgcc_s libgcc\n\n- ## unrecognized command line option \"-mavx2\"\n\n   upgrade gcc\n\n- ## Why is the compiled ncnn-android library so large？\n\n   See https://github.com/Tencent/ncnn/wiki/build-for-android.zh and see How to trim smaller ncnn\n\n- ## ncnnoptimize and custom layer\n\n   ncnnoptimize first before adding a custom layer to avoid ncnnoptimize not being able to handle custom layer saves.\n\n\n- ## rtti/exceptions Conflict\n\n   The reason for the conflict is that the libraries used in the project are configured differently, so analyze whether you need to turn them on or off according to your actual situation. ncnn is ON by default, add the following two parameters when recompiling ncnn.\n   - ON: -DNCNN_DISABLE_RTTI=OFF -DNCNN_DISABLE_EXCEPTION=OFF\n   - OFF: -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON\n\n\n- ## error: undefined symbol: ncnn::Extractor::extract(char const*, ncnn::Mat&)\n\n   Possible scenarios.\n   - Try upgrading the NDK version of Android Studio\n\n\n# How do I add the ncnn library to my project and how does the cmake method work?\n\nCompile ncnn,and make install. linux/windows should set/export ncnn_DIR points to the directory containing ncnnConfig.cmake under the install directory\n\n- ## android\n\n- ## ios\n\n- ## linux\n\n- ## windows\n\n- ## macos\n\n- ## arm linux\n\n\n# Convert model issues\n\n- ## caffe\n\n   `./caffe2ncnn caffe.prototxt caffe.caffemodel ncnn.param ncnn.bin`\n\n- ## mxnet\n\n   ` ./mxnet2ncnn mxnet-symbol.json mxnet.params ncnn.param ncnn.bin`\n\n- ## darknet\n\n   [https://github.com/xiangweizeng/darknet2ncnn](https://github.com/xiangweizeng/darknet2ncnn)\n\n- ## pytorch - onnx\n\n   [use ncnn with pytorch or onnx](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-pytorch-or-onnx)\n\n- ## tensorflow 1.x/2.x - keras\n\n   [https://github.com/MarsTechHAN/keras2ncnn](https://github.com/MarsTechHAN/keras2ncnn) **[@MarsTechHAN](https://github.com/MarsTechHAN)**\n\n- ## tensorflow 2.x - mlir\n\n   [Converting tensorflow2 models to ncnn via MLIR](https://zhuanlan.zhihu.com/p/152535430) **@[nihui](https://www.zhihu.com/people/nihui-2)**\n\n- ## netron\n\n   [https://github.com/lutzroeder/netron](https://github.com/lutzroeder/netron)\n\n- ## How to generate a model with fixed shape？\n\n   Input      0=w 1=h 2=c\n\n- ## why gpu can speedup\n\n- ## How to convert ncnnoptimize to fp16 model\n\n   `ncnnoptimize model.param model.bin yolov5s-opt.param yolov5s-opt.bin 65536`\n\n- ## How to use ncnnoptimize  checking the FLOPS / memory usage of your model\n\n- ## How to modify the model to support dynamics shape？\n\n   Interp Reshape\n\n- ## How to convert a model into code embedded in a program？\n\n   use ncnn2mem\n\n- ## How to encrypt the model？\n\n   See https://zhuanlan.zhihu.com/p/268327784\n\n- ## The ncnn model transferred under Linux, Windows/MacOS/Android/... Can I use it directly?\n\n   Yes, for all platforms\n\n- ## How to remove post-processing and export onnx？\n\n   Ref：\n\n   Referring to an article by UP <https://zhuanlan.zhihu.com/p/128974102>, step 3 is to remove the post-processing and then export the onnx, where removing the post-processing can be the result of removing the subsequent steps when testing within the project.\n\n- ## pytorch layers can't export to onnx？\n\n Mode 1:\n\n   ONNX_ATEN_FALLBACK\nFully customizable op, first change to one that can export (e.g. concat slice), go to ncnn and then modify param\n\n Way 2.\n\n You can try this with PNNX, see the following article for a general description:\n\n   1. [Windows/Linux/macOS steps for compiling PNNX](https://zhuanlan.zhihu.com/p/431833958)\n\n   2. [Learn in 5 minutes! Converting TorchScript models to ncnn models with PNNX](https://zhuanlan.zhihu.com/p/427512763)\n\n# Using\n\n- ## vkEnumeratePhysicalDevices failed -3\n\n- ## vkCreateInstance failed -9\n\n   Please upgrade your GPU driver if you meet this crash or error.\n   Here are the download sites for some brands of GPU drivers. We have provided some driver download pages here.\n   [Intel](https://downloadcenter.intel.com/product/80939/Graphics-Drivers), [AMD](https://www.amd.com/en/support), [Nvidia](https://) www.nvidia.com/Download/index.aspx)\n\n- ## ModuleNotFoundError: No module named 'ncnn.ncnn'\n\n   python setup.py develop\n\n- ## fopen nanodet-m.param failed\n\n   path should be working dir\n\n   File not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.\n\n- ## find_blob_index_by_name data / output / ... failed\n\n   layer name vs blob name\n\n   param.bin use xxx.id.h enum\n\n- ## parse magic failed\n\n- ## param is too old, please regenerate\n\n   The model maybe has problems\n\n   Your model file is being the old format converted by an old caffe2ncnn tool.\n\n   Checkout the latest ncnn code, build it and regenerate param and model binary files, and that should work.\n\n   Make sure that your param file starts with the magic number 7767517.\n\n   you may find more info on use-ncnn-with-alexnet\n\n   When adding the softmax layer yourself, you need to add 1=1\n\n- ## set_vulkan_compute failed, network use_vulkan_compute disabled\n\n   Set net.opt.use_vulkan_compute = true before load_param / load_model;\n\n- ## How to execute multiple blob inputs, multiple blob outputs？\n   Multiple execute `ex.input()` and `ex.extract()` like following\n    ```\n    ex.input(\"data1\", in_1);\n    ex.input(\"data2\", in_2);\n    ex.extract(\"output1\", out_1);\n    ex.extract(\"output2\", out_2);\n    ```\n- ## Multiple executions of Extractor extract double the calculation？\n\n   No\n\n- ## How to see the elapsed time for every layer？\n\n   cmake -DNCNN_BENCHMARK=ON ..\n\n- ## How to convert a cv::Mat CV_8UC3 BGR image\n\n   from_pixels to_pixels\n\n- ## How to convert float data to ncnn::Mat\n\n   First of all, you need to manage the memory you request yourself, at this point ncnn::Mat will not automatically free up the float data you pass over to it\n   ``` c++\n   std::vector<float> testData(60, 1.0); // use std::vector<float> to manage memory requests and releases yourself\n   ncnn::Mat in1 = ncnn::Mat(60, (void*)testData.data()).reshape(4, 5, 3); // just pass the pointer to the float data as a void*, and even specify the dimension (up says it's best to use reshape to solve the channel gap)\n   float* a = new float[60]; // New a piece of memory yourself, you need to release it later\n   ncnn::Mat in2 = ncnn::Mat(60, (void*)a).reshape(4, 5, 3).clone(); // use the same method as above, clone() to transfer data owner\n   ```\n"
  },
  {
    "path": "docs/faq.md",
    "content": "\n\n# 如何加入技术交流QQ群？\n\n- 打开QQ→点击群聊搜索→搜索群号637093648→输入问题答案：卷卷卷卷卷→进入群聊→准备接受图灵测试（bushi）\n- 前往QQ搜索Pocky群：677104663(超多大佬)，问题答案：multi level intermediate representation\n\n# 如何看作者b站直播？\n\n- nihui的bilibili直播间：[水竹院落](https://live.bilibili.com/1264617)\n\n# 编译\n\n- ## 怎样下载完整源码？\n\n   git clone --recursive https://github.com/Tencent/ncnn/\n   \n   或者\n   \n   下载 [ncnn-xxxxx-full-source.zip](https://github.com/Tencent/ncnn/releases)\n\n- ## 怎么交叉编译？cmake 工具链怎么设置啊？\n  \n   参见 https://github.com/Tencent/ncnn/wiki/how-to-build\n\n- ## The submodules were not downloaded! Please update submodules with \"git submodule update --init\" and try again\n\n   如上，下载完整源码。或者按提示执行: git submodule update --init\n\n- ## Could NOT find Protobuf (missing: Protobuf_INCLUDE_DIR)\n  \n   sudo apt-get install libprotobuf-dev protobuf-compiler\n\n- ## Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY)\n\n   https://github.com/Tencent/ncnn/issues/1873\n\n- ## Could not find a package configuration file provided by \"OpenCV\" with any of the following names: OpenCVConfig.cmake opencv-config.cmake\n\n   sudo apt-get install libopencv-dev\n\n   或者自行编译安装，set(OpenCV_DIR {OpenCVConfig.cmake所在目录})\n\n- ## Could not find a package configuration file provided by \"ncnn\" with any of the following names: ncnnConfig.cmake ncnn-config.cmake\n\n   set(ncnn_DIR {ncnnConfig.cmake所在目录})\n\n- ## 找不到库（需要根据系统/编译器指定）\n\n   undefined reference to __kmpc_for_static_init_4 __kmpc_for_static_fini __kmpc_fork_call ...\n\n   需要链接openmp库 \n\n   undefined reference to glslang::InitializeProcess() glslang::TShader::TShader(EShLanguage) ...\n\n   需要 glslang.lib glslang-default-resource-limits.lib\n\n   undefined reference to AAssetManager_fromJava AAssetManager_open AAsset_seek ...\n\n   find_library和target_like_libraries中增加 android \n\n   find_package(ncnn)\n\n- ## undefined reference to typeinfo for ncnn::Layer\n\n   opencv rtti -> opencv-mobile\n\n- ## undefined reference to __cpu_model\n\n   升级编译器 / libgcc_s libgcc\n\n- ## unrecognized command line option \"-mavx2\"\n\n   升级 gcc\n\n- ## 为啥自己编译的ncnn android库特别大？\n\n   https://github.com/Tencent/ncnn/wiki/build-for-android.zh 以及见 如何裁剪更小的 ncnn 库\n\n- ## ncnnoptimize和自定义层\n\n   先ncnnoptimize再增加自定义层，避免ncnnoptimize不能处理自定义层保存。\n\n\n- ## rtti/exceptions冲突\n\n   产生原因是项目工程中使用的库配置不一样导致冲突，根据自己的实际情况分析是需要开启还是关闭。ncnn默认是ON，在重新编译ncnn时增加以下2个参数即可：\n   - 开启：-DNCNN_DISABLE_RTTI=OFF -DNCNN_DISABLE_EXCEPTION=OFF\n   - 关闭：-DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON\n\n\n- ## error: undefined symbol: ncnn::Extractor::extract(char const*, ncnn::Mat&)\n\n   可能的情况：\n   - 尝试升级 Android Studio 的 NDK 版本\n\n- ## CMake 3.14.0 or higher is required.  You are running version 2.8.12.2\n```shell\nwget https://github.com/Kitware/CMake/releases/download/v3.18.2/cmake-3.18.2-Linux-x86_64.tar.gz\ntar zxvf cmake-3.18.2-Linux-x86_64.tar.gz\nmv cmake-3.18.2-Linux-x86_64 /opt/cmake-3.18.2\nln -sf /opt/cmake-3.18.2/bin/* /usr/bin/\n```\n\n# 怎样添加ncnn库到项目中？cmake方式怎么用？\n\n编译ncnn，make install。linux/windows set/export ncnn_DIR 指向 install目录下包含ncnnConfig.cmake 的目录\n\n- ## android\n\n- ## ios\n\n- ## linux\n\n- ## windows\n\n- ## macos\n\n- ## arm linux\n\n\n# 转模型问题\n\n- ## caffe\n\n   `./caffe2ncnn caffe.prototxt caffe.caffemodel ncnn.param ncnn.bin`\n\n- ## mxnet\n\n   ` ./mxnet2ncnn mxnet-symbol.json mxnet.params ncnn.param ncnn.bin`\n\n- ## darknet\n\n   [https://github.com/xiangweizeng/darknet2ncnn](https://github.com/xiangweizeng/darknet2ncnn)\n\n- ## pytorch - onnx\n\n   [use ncnn with pytorch or onnx](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-pytorch-or-onnx)\n\n- ## tensorflow 1.x/2.x - keras\n\n   [https://github.com/MarsTechHAN/keras2ncnn](https://github.com/MarsTechHAN/keras2ncnn) **[@MarsTechHAN](https://github.com/MarsTechHAN)**\n\n- ## tensorflow 2.x - mlir\n\n   [通过MLIR将tensorflow2模型转换到ncnn](https://zhuanlan.zhihu.com/p/152535430) **@[nihui](https://www.zhihu.com/people/nihui-2)**\n\n- ## netron\n\n   [https://github.com/lutzroeder/netron](https://github.com/lutzroeder/netron)\n\n- ## 怎么生成有固定 shape 信息的模型？\n\n   Input      0=w 1=h 2=c\n\n- ## why gpu能更快\n\n- ## ncnnoptimize 怎么转成 fp16 模型\n\n   `ncnnoptimize model.param model.bin yolov5s-opt.param yolov5s-opt.bin 65536`\n\n- ## ncnnoptimize 怎样查看模型的 FLOPS / 内存占用情况\n\n- ## 怎么修改模型支持动态 shape？\n\n   Interp Reshape\n\n- ## 如何将模型转换为代码内嵌到程序里？\n\n   ncnn2mem\n\n- ## 如何加密模型？\n\n   https://zhuanlan.zhihu.com/p/268327784\n\n- ## Linux下转的ncnn模型，Windows/MacOS/Android/.. 也能直接用吗？\n\n   Yes，全平台通用\n\n- ## 如何去掉后处理，再导出 onnx？\n\n   检测：\n\n   参考up的一篇文章<https://zhuanlan.zhihu.com/p/128974102>，步骤三就是去掉后处理,再导出onnx,其中去掉后处理可以是项目内测试时去掉后续步骤的结果。\n\n- ## pytorch 有的层导不出 onnx 怎么办？\n\n 方式一:\n\n   ONNX_ATEN_FALLBACK\n完全自定义的op，先改成能导出的（如 concat slice），转到 ncnn 后再修改 param\n\n 方式二：\n\n 可以使用PNNX来试试，参考以下文章大概说明:\n\n   1. [Windows/Linux/macOS 编译 PNNX 步骤](https://zhuanlan.zhihu.com/p/431833958)\n\n   2. [5分钟学会！用 PNNX 转换 TorchScript 模型到 ncnn 模型](https://zhuanlan.zhihu.com/p/427512763)\n\n# 使用\n\n- ## vkEnumeratePhysicalDevices failed -3\n\n- ## vkCreateInstance failed -9\n\n   出现此类问题请先更新GPU驱动。Please upgrade your GPU driver if you encounter this crash or error.\n   这里提供了一些品牌的GPU驱动下载网址.We have provided some drivers' download pages here.\n   [Intel](https://downloadcenter.intel.com/product/80939/Graphics-Drivers)，[AMD](https://www.amd.com/en/support)，[Nvidia](https://www.nvidia.com/Download/index.aspx)\n\n- ## docker 环境里面 nvidia-smi 能看到显卡也能跑 cuda 却不能跑 vulkan\n\n   因为这个docker环境的nvidia驱动没有安装opengl/vulkan支持\n\n  首先运行 nvidia-smi 查看当前驱动版本\n\n```\nNVIDIA-SMI 535.161.07\nDriver Version: 535.161.07\nCUDA Version: 12.2\n```\n\n然后去下载对应版本的NVIDIA驱动，安装用户态驱动文件，跳过内核部分\n\n```\nwget https://us.download.nvidia.com/tesla/535.161.07/NVIDIA-Linux-x86_64-535.161.07.run\nchmod +x NVIDIA-Linux-x86_64-535.161.07.run\n./NVIDIA-Linux-x86_64-535.161.07.run --silent --no-kernel-module\n```\n\n安装时会报一些文件权限错误，不用管，安装完成后 vulkan 支持就可用了。最后安装 vulkaninfo 查看gpu信息\n\n```\ndnf install vulkan-tools\nvulkaninfo\n```\n\n- ## ModuleNotFoundError: No module named 'ncnn.ncnn'\n\n   python setup.py develop\n\n- ## fopen nanodet-m.param failed\n\n   文件路径 working dir\n\n   File not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.\n\n- ## find_blob_index_by_name data / output / ... failed\n\n   layer name vs blob name\n   \n   param.bin 应该用 xxx.id.h 的枚举\n\n- ## parse magic failed\n\n- ## param is too old, please regenerate\n\n   模型本身有问题\n\n   Your model file is being the old format converted by an old caffe2ncnn tool.\n\n   Checkout the latest ncnn code, build it and regenerate param and model binary files, and that should work.\n\n   Make sure that your param file starts with the magic number 7767517.\n\n   you may find more info on use-ncnn-with-alexnet\n   \n   When adding the softmax layer yourself, you need to add 1=1\n\n- ## set_vulkan_compute failed, network use_vulkan_compute disabled\n\n   你应该在 load_param / load_model 之前设置 net.opt.use_vulkan_compute = true;\n\n- ## 多个blob输入，多个blob输出，怎么做？\n   多次执行`ex.input()` 和 `ex.extract()`\n```\nex.input(\"data1\", in_1);\nex.input(\"data2\", in_2);\nex.extract(\"output1\", out_1);\nex.extract(\"output2\", out_2);\n```\n- ## Extractor extract 多次会重复计算吗？\n\n   不会\n\n- ## 如何看每一层的耗时？\n\n   cmake -DNCNN_BENCHMARK=ON ..\n\n- ## 如何转换 cv::Mat CV_8UC3 BGR 图片\n\n   from_pixels to_pixels\n\n- ## 如何转换 float 数据为 ncnn::Mat\n\n   首先，自己申请的内存需要自己管理，此时ncnn::Mat不会自动给你释放你传过来的float数据\n   ``` c++\n   std::vector<float> testData(60, 1.0);                                      // 利用std::vector<float>自己管理内存的申请和释放\n   ncnn::Mat in1 = ncnn::Mat(60, (void*)testData.data()).reshape(4, 5, 3);    // 把float数据的指针转成void*传过去即可，甚至还可以指定维度(up说最好使用reshape用来解决channel gap)\n   float* a = new float[60];                                                  // 自己new一块内存，后续需要自己释放\n   ncnn::Mat in2 = ncnn::Mat(60, (void*)a).reshape(4, 5, 3).clone();          // 使用方法和上面相同，clone() to transfer data owner\n   ```\n\n- ## 如何初始化 ncnn::Mat 为全 0\n\n   `mat.fill(0.f);`\n\n- ## 如何查看／获取版本号\n\n   cmake时会打印\n\n   c_api.h ncnn_version()\n\n   自己拼 1.0+yyyymmdd\n\n- ## 如何转换 yuv 数据\n\n   yuv420sp2rgb yuv420sp2rgb_nv12\n\n   **[@metarutaiga](https://github.com/metarutaiga/xxYUV)**\n\n- ## 如何 resize crop rotate 图片\n\n   [efficient roi resize rotate](https://github.com/Tencent/ncnn/wiki/efficient-roi-resize-rotate)\n\n- ## 如何人脸5点对齐\n\n   get_affine_transform\n\n   warpaffine_bilinear_c3\n\n```c\n// 计算变换矩阵 并且求逆变换\nint type = 0;       // 0->区域外填充为v[0],v[1],v[2], -233->区域外不处理\nunsigned int v = 0;\nfloat tm[6];\nfloat tm_inv[6];\n// 人脸区域在原图上的坐标和宽高\nfloat src_x = target->det.rect.x / target->det.w * pIveImageU8C3->u32Width;\nfloat src_y = target->det.rect.y / target->det.h * pIveImageU8C3->u32Height;\nfloat src_w = target->det.rect.w / target->det.w * pIveImageU8C3->u32Width;\nfloat src_h = target->det.rect.h / target->det.h * pIveImageU8C3->u32Height;\nfloat point_src[10] = {\nsrc_x + src_w * target->attr.land[0][0], src_x + src_w * target->attr.land[0][1],\nsrc_x + src_w * target->attr.land[1][0], src_x + src_w * target->attr.land[1][1],\nsrc_x + src_w * target->attr.land[2][0], src_x + src_w * target->attr.land[2][1],\nsrc_x + src_w * target->attr.land[3][0], src_x + src_w * target->attr.land[3][1],\nsrc_x + src_w * target->attr.land[4][0], src_x + src_w * target->attr.land[4][1],\n};\nfloat point_dst[10] = { // +8 是因为我们处理112*112的图\n30.2946f + 8.0f, 51.6963f,\n65.5318f + 8.0f, 51.5014f,\n48.0252f + 8.0f, 71.7366f,\n33.5493f + 8.0f, 92.3655f,\n62.7299f + 8.0f, 92.2041f,\n};\n// 第一种方式：先计算变换在求逆\nAffineTrans::get_affine_transform(point_src, point_dst, 5, tm);\nAffineTrans::invert_affine_transform(tm, tm_inv);\n// 第二种方式：直接拿到求逆的结果\n// AffineTrans::get_affine_transform(point_dst, point_src, 5, tm_inv);\n// rgb 分离的，所以要单独处理\nfor(int c = 0; c < 3; c++)\n{\n    unsigned char* pSrc = malloc(xxx);\n    unsigned char* pDst = malloc(xxx);\n    ncnn::warpaffine_bilinear_c1(pSrc, SrcWidth, SrcHeight, SrcStride[c], pDst, DstWidth, DstHeight, DstStride[c], tm_inv, type, v);\n}\n// rgb packed则可以一次处理\nncnn::warpaffine_bilinear_c3(pSrc, SrcWidth, SrcHeight, SrcStride, pDst, DstWidth, DstHeight, DstStride, tm_inv, type, v);\n```\n\n- ## 如何获得中间层的blob输出\n  \n   ncnn::Mat output;\n   \n   ex.extract(\"your_blob_name\", output);\n\n- ## 为什么我使用GPU，但是GPU占用为0\n\n   windows 10 任务管理器 - 性能选项卡 - GPU - 选择其中一个视图左上角的下拉箭头切换到 Compute_0 / Compute_1 / Cuda\n\n   你还可以安装软件：GPU-Z \n\n- ## layer XYZ not exists or registered\n\n   Your network contains some operations that are not implemented in ncnn.\n\n   You may implement them as custom layer followed in how-to-implement-custom-layer-step-by-step.\n\n   Or you could simply register them as no-op if you are sure those operations make no sense.\n\n```\nclass Noop : public ncnn::Layer {};\nDEFINE_LAYER_CREATOR(Noop)\n\nnet.register_custom_layer(\"LinearRegressionOutput\", Noop_layer_creator);\nnet.register_custom_layer(\"MAERegressionOutput\", Noop_layer_creator);\n```\n\n- ## network graph not ready\n\n   You shall call Net::load_param() first, then Net::load_model().\n\n   This error may also happens when Net::load_param() failed, but not properly handled.\n\n   For more information about the ncnn model load api, see ncnn-load-model\n\n- ## memory not 32-bit aligned at XYZ\n\n   The pointer passed to Net::load_param() or Net::load_model() is not 32bit aligned.\n\n   In practice, the head pointer of std::vector is not guaranteed to be 32bit aligned.\n\n   you can store your binary buffer in ncnn::Mat structure, its internal memory is aligned.\n\n- ## crash on android with '__kmp_abort_process'\n\n   This usually happens if you bundle multiple shared library with openmp linked\n\n   It is actually an issue of the android ndk https://github.com/android/ndk/issues/1028\n\n   On old android ndk, modify the link flags as\n\n   -Wl,-Bstatic -lomp -Wl,-Bdynamic\n\n   For recent ndk >= 21\n\n   -fstatic-openmp\n\n- ## dlopen failed: library \"libomp.so\" not found\n   Newer android ndk defaults to dynamic openmp runtime\n\n   modify the link flags as\n\n   -fstatic-openmp -fopenmp\n\n- ## crash when freeing a ncnn dynamic library(.dll/.so) built with openMP\n\n   for optimal performance, the openmp threadpool spin waits for about a second prior to shutting down in case more work becomes available.\n\n   If you unload a dynamic library that's in the process of spin-waiting, it will crash in the manner you see (most of the time).\n\n   Just set OMP_WAIT_POLICY=passive in your environment, before calling loadlibrary. or Just wait a few seconds before calling freelibrary.\n\n   You can also use the following method to set environment variables in your code:\n\n   for msvc++:\n\n      SetEnvironmentVariable(_T(\"OMP_WAIT_POLICY\"), _T(\"passive\"));\n\n   for g++:\n\n      setenv(\"OMP_WAIT_POLICY\", \"passive\", 1)\n   \n      reference: https://stackoverflow.com/questions/34439956/vc-crash-when-freeing-a-dll-built-with-openmp\n\n# 跑出来的结果对不上\n\n[ncnn-produce-wrong-result](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result)\n\n- ## 如何打印 ncnn::Mat 的值？\n\n```C++\nvoid pretty_print(const ncnn::Mat& m)\n{\n    for (int q=0; q<m.c; q++)\n    {\n        const float* ptr = m.channel(q);\n        for (int y=0; y<m.h; y++)\n        {\n            for (int x=0; x<m.w; x++)\n            {\n                printf(\"%f \", ptr[x]);\n            }\n            ptr += m.w;\n            printf(\"\\n\");\n        }\n        printf(\"------------------------\\n\");\n    }\n}\n```\nIn Android Studio, `printf` will not work, you can use `__android_log_print` instead. Example :\n```C++\n#include <android/log.h>  // Don't forget this\n\nvoid pretty_print(const ncnn::Mat& m)\n{\n    for (int q=0; q<m.c; q++)\n    {\n        for (int y=0; y<m.h; y++)\n        {\n            for (int x=0; x<m.w; x++)\n            {\n                __android_log_print(ANDROID_LOG_DEBUG,\"LOG_TAG\",\"ncnn Mat is : %f\", m.channel(q).row(y)[x]);\n            }\n        }\n    }\n}\n```\n\n- ## 如何可视化 ncnn::Mat 的值？\n\n```\nvoid visualize(const char* title, const ncnn::Mat& m)\n{\n    std::vector<cv::Mat> normed_feats(m.c);\n\n    for (int i=0; i<m.c; i++)\n    {\n        cv::Mat tmp(m.h, m.w, CV_32FC1, (void*)(const float*)m.channel(i));\n\n        cv::normalize(tmp, normed_feats[i], 0, 255, cv::NORM_MINMAX, CV_8U);\n\n        cv::cvtColor(normed_feats[i], normed_feats[i], cv::COLOR_GRAY2BGR);\n\n        // check NaN\n        for (int y=0; y<m.h; y++)\n        {\n            const float* tp = tmp.ptr<float>(y);\n            uchar* sp = normed_feats[i].ptr<uchar>(y);\n            for (int x=0; x<m.w; x++)\n            {\n                float v = tp[x];\n                if (v != v)\n                {\n                    sp[0] = 0;\n                    sp[1] = 0;\n                    sp[2] = 255;\n                }\n\n                sp += 3;\n            }\n        }\n    }\n\n    int tw = m.w < 10 ? 32 : m.w < 20 ? 16 : m.w < 40 ? 8 : m.w < 80 ? 4 : m.w < 160 ? 2 : 1;\n    int th = (m.c - 1) / tw + 1;\n\n    cv::Mat show_map(m.h * th, m.w * tw, CV_8UC3);\n    show_map = cv::Scalar(127);\n\n    // tile\n    for (int i=0; i<m.c; i++)\n    {\n        int ty = i / tw;\n        int tx = i % tw;\n\n        normed_feats[i].copyTo(show_map(cv::Rect(tx * m.w, ty * m.h, m.w, m.h)));\n    }\n\n    cv::resize(show_map, show_map, cv::Size(0,0), 2, 2, cv::INTER_NEAREST);\n    cv::imshow(title, show_map);\n}\n```\n\n- ## 总是输出第一张图的结果\n\n   复用 Extractor？！\n\n- ## 启用fp16时的精度有差异\n\n   net.opt.use_fp16_packed = false;\n\n   net.opt.use_fp16_storage = false;\n\n   net.opt.use_fp16_arithmetic = false;\n\n   [ncnn-produce-wrong-result](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result)\n\n\n# 如何跑得更快？内存占用更少？库体积更小？\n\n- ## fp32 fp16\n\n- ## 大小核绑定\n   ncnn::set_cpu_powersave(int)绑定大核或小核\n   注意windows系统不支持绑核。\n   ncnn支持不同的模型运行在不同的核心。假设硬件平台有2个大核，4个小核，你想把netA运行在大核，netB运行在小核。\n   可以通过std::thread or pthread创建两个线程，运行如下代码：\n   0:全部\n   1:小核\n   2:大核\n```\n   void thread_1()\n   {\n      ncnn::set_cpu_powersave(2); // bind to big cores\n      netA.opt.num_threads = 2;\n   }\n\n   void thread_2()\n   {\n      ncnn::set_cpu_powersave(1); // bind to little cores\n      netB.opt.num_threads = 4;\n   }\n```\n\n   [openmp-best-practice.zh.md](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md)\n\n- ## 查看 CPU 或 GPU 数量\n   get_cpu_count\n   \n   get_gpu_count\n\n- ## ncnnoptimize\n\n   使用方式一：\n    - ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag\n    <br/>注意这里的flag指的是fp32和fp16，其中0指的是fp32，1指的是fp16\n\n   使用方式二：\n    - ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag cutstartname cutendname\n    <br/>cutstartname：模型截取的起点\n     <br/>cutendname：模型截取的终点\n\n\n- ## 如何使用量化工具？\n\n   [Post Training Quantization Tools](https://github.com/Tencent/ncnn/tree/master/tools/quantize)\n\n- ## 如何设置线程数？\n\n   opt.num_threads\n\n- ## 如何降低CPU占用率？\n\n   net.opt.openmp_blocktime = 0;\n   \n   OMP_WAIT_POLICY=passive\n\n- ## 如何 batch inference？\n\n```\n   int max_batch_size = vkdev->info.compute_queue_count;\n   \n   ncnn::Mat inputs[1000];\n   ncnn::Mat outputs[1000];\n   \n   #pragma omp parallel for num_threads(max_batch_size)\n   for (int i=0; i<1000; i++)\n   {\n       ncnn::Extractor ex = net1.create_extractor();\n       ex.input(\"data\", inputs[i]);\n       ex.extract(\"prob\", outputs[i]);\n   }\n```\n\n   \n\n- ## partial graph inference\n\n   先 extract 分类，判断后，再 extract bbox\n\n- ## 如何启用 bf16s 加速？\n\n```\nnet.opt.use_packing_layout = true;\nnet.opt.use_bf16_storage = true;\n```\n\n   [用bf16加速ncnn](https://zhuanlan.zhihu.com/p/112564372) **@[nihui](https://www.zhihu.com/people/nihui-2)**\n\n   A53\n\n- ## 如何裁剪更小的 ncnn 库？\n\n   [build-minimal-library](https://github.com/Tencent/ncnn/wiki/build-minimal-library)\n\n- ## net.opt sgemm winograd fp16_storage 各是有什么作用？\n\n   对内存消耗的影响\n\n- ## 如何解决显卡进入节能模式造成的一系列问题？\n\n   nVidia显卡（Intel和AMD估计也有）会在它认为的所谓空闲模式下，自动进入 `节能模式`，显存和核心频率就都会降低。\n   \n   简单来说就是如果你的计算任务是 `非连续的`，那么可能会让耗时看起来非常 `不均匀`，当期间有运算空闲间隔发生，显卡进入节能模式，则会在下一次冷启动时发生计算耗时远超正常耗时几倍的情况，如下日志所示：\n\n   ```cpp\n   //开始播放\n   Total: 162ms, Diff: 0ms, GLTex2Mat: 7ms, calc: 152ms, Mat2GLTex: 3ms\n   Total: 43ms, Diff: 0ms, GLTex2Mat: 3ms, calc: 35ms, Mat2GLTex: 2ms\n   Total: 45ms, Diff: 0ms, GLTex2Mat: 3ms, calc: 37ms, Mat2GLTex: 3ms\n   Total: 40ms, Diff: 0ms, GLTex2Mat: 3ms, calc: 32ms, Mat2GLTex: 4ms\n   //暂停3秒\n   //继续播放\n   Total: 190ms, Diff: 0ms, GLTex2Mat: 9ms, calc: 177ms, Mat2GLTex: 3ms\n   Total: 134ms, Diff: 0ms, GLTex2Mat: 5ms, calc: 110ms, Mat2GLTex: 18ms\n   Total: 40ms, Diff: 0ms, GLTex2Mat: 3ms, calc: 34ms, Mat2GLTex: 2ms\n   Total: 42ms, Diff: 0ms, GLTex2Mat: 3ms, calc: 36ms, Mat2GLTex: 2ms\n   Total: 47ms, Diff: 0ms, GLTex2Mat: 5ms, calc: 38ms, Mat2GLTex: 3ms\n   ...\n   ```\n\n   在对时间不敏感的项目上，这个问题没什么大不了的，完全可以忽略，但是有些业务场景上必须精准推估下一帧及其未来几帧的从上传、计算到渲染的耗时情况，则这种现象将会给开发者打开些许困扰。\n\n   ### 3种解决方法\n   * 联系显卡厂商，让其更新驱动将你的应用加入到免节能模式的白名单。\n     * 优点：你什么都不用改。缺点：沟通困难，很可能显卡厂商根本不理你。\n   * [显卡控制面板] - [管理3D设置] - [电源管理模式]，改成：[最高性能优先]。\n     * 优点：不用改代码。缺点：如果是部署端是小白用户，需要编写手册手把手教他。\n   * 可以空闲（暂停）时定期灌一些心跳计算包的任务进去（放1x1小图）让GPU维持在高性能状态。\n     * 优点：需要改代码。缺点：不低碳不环保。\n\n# 白嫖项目\n\n- ## nanodet\n\n# 其他\n\n- ## up主用的什么系统/编辑器/开发环境？\n\n   | 软件类型     |   软件名称  |\n   | ------------| ----------- |\n   | 系统        | Fedora       |\n   | 桌面环境     | KDE         |\n   | 编辑器       | Kate        |\n   | 画草图       | kolourpaint |\n   | 画函数图像   | kmplot      |\n   | bilibili直播 |  OBS         |\n"
  },
  {
    "path": "docs/how-to-build/build-mlir2ncnn.md",
    "content": "# mlir2ncnn\n\n## Compile\n\n**Clone LLVM**\n```bash\nhttps://github.com/llvm/llvm-project.git\ngit checkout -b mlir <a_working_commit_id>\n```\nCurrent working commit id is 74e6030bcbcc8e628f9a99a424342a0c656456f9:\n```bash\n$ git log\n\ncommit 74e6030bcbcc8e628f9a99a424342a0c656456f9 (HEAD -> main, origin/main, origin/HEAD)\nAuthor: Craig Topper <craig.topper@sifive.com>\nDate:   Thu Mar 4 22:30:38 2021 -0800\n\n    [TargetLowering] Use HandleSDNodes to prevent nodes from being deleted by recursive calls in getNegatedExpression.\n```\n\nIt is determined by query lastest git commit date of `tools/mlir` directory.\n\n\n**Compile mlir**\n```bash\ncd llvm-project\nmkdir build\ncd build\ncmake -G Ninja -DCMAKE_INSTALL_PREFIX=install -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DLLVM_ENABLE_PROJECTS=\"mlir\" -DLLVM_TARGETS_TO_BUILD=\"\" -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF ../llvm/\nninja -j8\nninja install\n```\n\n**Compile mlir2ncnn**\n```bash\ncd tools/mlir\nmkdir build\ncd build\ncmake .. -D LLVM_DIR=<path/to/your/llvm_install/lib/cmake/llvm>\nmake\n```\n\n## Usage\n\n**Export `.mlir`**\n\nSee https://zhuanlan.zhihu.com/p/152535430\n\n\n**Usage mlir2ncnn**\n\n```bash\n./mlir2ncnn pix2pix.mlir pix2pix.param pix2pix.bin\n```\n"
  },
  {
    "path": "docs/how-to-build/how-to-build.md",
    "content": "### Git clone ncnn repo with submodule\n\n```\ngit clone https://github.com/Tencent/ncnn.git\ncd ncnn\ngit submodule update --init\n```\n\n- [Git clone ncnn repo with submodule](#git-clone-ncnn-repo-with-submodule)\n- [Build for Linux](#build-for-linux)\n  - [Nvidia Jetson](#nvidia-jetson)\n  - [Raspberry Pi](#raspberry-pi)\n  - [POWER](#power)\n  - [Intel oneAPI](#intel-oneapi)\n  - [Cross compile: Riscv-gnu-toolchain](#cross-compile-riscv-gnu-toolchain)\n  - [Verification](#verification)\n- [Build for Windows x64 using Visual Studio Community 2017](#build-for-windows-x64-using-visual-studio-community-2017)\n- [Build for Windows x64 using MinGW-w64](#build-for-windows-x64-using-mingw-w64)\n- [Build for Windows XP (x86)](#build-for-windows-xp-x86)\n  - [Using MinGW-w64](#using-mingw-w64)\n  - [Using Clang](#using-clang)\n  - [Using Visual Studio (MSVC)](#using-visual-studio-msvc)\n- [Build for macOS](#build-for-macos)\n- [Build for ARM Cortex-A family with cross-compiling](#build-for-arm-cortex-a-family-with-cross-compiling)\n- [Build for Hisilicon platform with cross-compiling](#build-for-hisilicon-platform-with-cross-compiling)\n- [Build for AnyCloud platform with cross-compiling](#build-for-AnyCloud-platform-with-cross-compiling)\n- [Build for Android](#build-for-android)\n- [Build for iOS on macOS with xcode](#build-for-ios-on-macos-with-xcode)\n- [Build for WebAssembly](#build-for-webassembly)\n- [Build for AllWinner D1](#build-for-allwinner-d1)\n- [Build for Loongson 2K1000](#build-for-loongson-2k1000)\n- [Build for Termux on Android](#build-for-termux-on-android)\n- [Build for QNX](#build-for-qnx)\n- [Build for Nintendo 3DS Homebrew Launcher](#build-for-nintendo-3ds-homebrew-launcher)\n- [Build for HarmonyOS with cross-compiling](#build-for-harmonyos-with-cross-compiling)\n- [Build for ESP32 with cross-compiling](#build-for-esp32-with-cross-compiling)\n\n***\n\n### Build for Linux\n\nInstall required build dependencies:\n\n* git\n* g++\n* cmake\n* protocol buffer (protobuf) headers files and protobuf compiler\n* (optional) LLVM OpenMP header files # If building with Clang, and multithreaded CPU inference is desired\n* (optional) opencv  # For building examples\n\nGenerally if you have Intel, AMD or Nvidia GPU from last 10 years, Vulkan can be easily used.\n\nOn some systems there are no Vulkan drivers easily available at the moment (October 2020), so you might need to disable use of Vulkan on them. This applies to Raspberry Pi 3 (but there is experimental open source Vulkan driver in the works, which is not ready yet). Nvidia Tegra series devices (like Nvidia Jetson) should support Vulkan. Ensure you have most recent software installed for best experience.\n\nOn Debian, Ubuntu, or Raspberry Pi OS, you can install all required dependencies using:\n```shell\nsudo apt install build-essential git cmake libprotobuf-dev protobuf-compiler libomp-dev libopencv-dev\n```\nOn Redhat or Centos, you can install all required dependencies using:\n```shell\nsudo yum install build-essential git cmake libprotobuf-dev protobuf-compiler libopencv-dev\n```\n\nTo use Vulkan after building ncnn later, you will also need to have Vulkan driver for your GPU. For AMD and Intel GPUs these can be found in Mesa graphics driver, which usually is installed by default on all distros (i.e. `sudo apt install mesa-vulkan-drivers` on Debian/Ubuntu). For Nvidia GPUs the proprietary Nvidia driver must be downloaded and installed (some distros will allow easier installation in some way). After installing Vulkan driver, confirm Vulkan libraries and driver are working, by using `vulkaninfo` or `vulkaninfo | grep deviceType`, it should list GPU device type. If there are more than one GPU installed (including the case of integrated GPU and discrete GPU, commonly found in laptops), you might need to note the order of devices to use later on.\n\n#### Nvidia Jetson\n\nThe Vulkan driver is a default component of the Linux For Tegra BSP release, check [the device list](https://developer.nvidia.com/embedded/vulkan).\n\n```shell\ncd ncnn\nmkdir -p build\ncd build\ncmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..\nmake -j$(nproc)\n```\n\n#### Raspberry Pi\n\nVulkan drivers do exists, but are not mature. You are free to experiment at your own discretion, and report results and performance.\n\n```shell\ncd ncnn\nmkdir -p build\ncd build\ncmake -DCMAKE_BUILD_TYPE=Release -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..\nmake -j$(nproc)\n```\n\nYou can add `-GNinja` to `cmake` above to use Ninja build system (invoke build using `ninja` or `cmake --build .`).\n\nFor Raspberry Pi 3 on 32bit OS, add `-DCMAKE_TOOLCHAIN_FILE=../toolchains/pi3.toolchain.cmake` to cmake. You can also consider disabling Vulkan support as the Vulkan drivers for Raspberry Pi are still not mature, but it doesn't hurt to build the support in, but not use it.\n\n#### POWER\n\nFor POWER9 with Clang:\n\n```shell\ncd ncnn\nmkdir -p build\ncd build\ncmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/power9le-linux-gnu-vsx.clang.toolchain.cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..\nmake -j$(nproc)\n```\n\nTo use GCC instead, use the `power9le-linux-gnu-vsx.toolchain.cmake` toolchain file instead. Note that according to benchmarks, Clang appears to produce noticeably faster CPU inference than GCC for POWER9 targets. For fastest inference, use Clang 18 or higher; earlier versions of Clang may have impaired inference speed due to [Bug 49864](https://github.com/llvm/llvm-project/issues/49864) and [Bug 64664](https://github.com/llvm/llvm-project/issues/64664).\n\nFor POWER8 instead of POWER9, use the `power8le-linux-gnu-vsx.clang.toolchain.cmake` or `power8le-linux-gnu-vsx.toolchain.cmake` toolchain file instead. POWER8 will be slower than POWER9.\n\nNote that the POWER toolchain files only support little-endian mode.\n\n#### Intel oneAPI\n\nBesides the prerequests in this section, Intel oneAPI BaseKit and HPCKit should be installed. They are available from https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html freely.\n\nIntel oneAPI offers two kinds of compilers, the classic `icc/icpc` and the LLVM based `icx/icpx`. To build with these compilers, add `CC=icc CXX=icpc` or `CC=icx CXX=icpx` before the `cmake` command. When compiling with `icc/icpc`, cmake will warn that `xop`, `avx512`, and `bf16` extensions are not supported by the compiler, while `icx/icpx` works well.\n\nBoth of these compilers have been tested and passed the ncnn benchmark successfully. The results have been included in ncnn benchmark readme. Generally, `icx/icpx` are likely to show better performance than `icc/icpc` and the quantized models can benefit from the extensions `icx/icpx` supports.\n\n#### Cross compile: Riscv-gnu-toolchain\nBefore compiling the whole project, toolchain must be installed.\n[Reference: Riscv-gnu-toolchain build guide](https://github.com/riscv-collab/riscv-gnu-toolchain/blob/master/README.md)\n```shell\n\n# configure with vector extension.\n./configure --prefix=/opt/riscv --enable-multilib --with-arch=rv64gcv\n\n# configure without vector extension.\n./configure --prefix=/opt/riscv --enable-multilib --with-arch=rv64gc\n\n# it takes quite a long time:(\nsudo make linux\n\n```\nNow you can build the project:\n```shell\nmkdir build-riscv\ncd build-riscv\ncmake -DDCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-linux-gnu.toolchain.cmake -DNCNN_BUILD_EXAMPLES=ON ..\nmake -j$(nproc) # or `make -j2` if your cpu isn't powerful enough.\n```\n\n#### Verification\n\nVerify build by running some examples:\n\n```shell\ncd ../examples\n../build/examples/squeezenet ../images/256-ncnn.png\n[0 AMD RADV FIJI (LLVM 10.0.1)]  queueC=1[4]  queueG=0[1]  queueT=0[1]\n[0 AMD RADV FIJI (LLVM 10.0.1)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0\n[0 AMD RADV FIJI (LLVM 10.0.1)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1\n532 = 0.163452\n920 = 0.093140\n716 = 0.061584\n```\n\nYou can also run benchmarks (the 4th argument is a GPU device index to use, refer to `vulkaninfo`, if you have more than one GPU):\n\n```shell\ncd ../benchmark\n../build/benchmark/benchncnn 10 $(nproc) 0 0\n[0 AMD RADV FIJI (LLVM 10.0.1)]  queueC=1[4]  queueG=0[1]  queueT=0[1]\n[0 AMD RADV FIJI (LLVM 10.0.1)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0\n[0 AMD RADV FIJI (LLVM 10.0.1)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1\nnum_threads = 4\npowersave = 0\ngpu_device = 0\ncooling_down = 1\n          squeezenet  min =    4.68  max =    4.99  avg =    4.85\n     squeezenet_int8  min =   38.52  max =   66.90  avg =   48.52\n...\n```\n\nTo run benchmarks on a CPU, set the 5th argument to `-1`.\n\n\n***\n\n### Build for Windows x64 using Visual Studio Community 2017\n\nDownload and Install Visual Studio Community 2017 from https://visualstudio.microsoft.com/vs/community/\n\nStart the command prompt: `Start → Programs → Visual Studio 2017 → Visual Studio Tools → x64 Native Tools Command Prompt for VS 2017`\n\n> You can also search `x64 Native Tools Command Prompt for VS 2017` directly.\n\nDownload protobuf-3.11.2 from https://github.com/google/protobuf/archive/v3.11.2.zip\n\nBuild protobuf library:\n\n```shell\ncd <protobuf-root-dir>\nmkdir protobuf_build\ncd protobuf_build\ncmake -A x64 -DCMAKE_INSTALL_PREFIX=%cd%/install -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF ../cmake\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\nBuild ncnn library (replace `<protobuf-root-dir>` with a proper path):\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p protobuf_build\ncd protobuf_build\ncmake -A x64 -DCMAKE_INSTALL_PREFIX=%cd%/install -Dprotobuf_DIR=<protobuf-root-dir>/protobuf_build/install/cmake -DNCNN_VULKAN=ON ..\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\nNote: To speed up compilation process on multi core machines, configuring `cmake` to use `jom` or `ninja` using `-G` flag is recommended.\n\nNote: For protobuf >=22.0 (Take v25.3 for example):\n\nBuild zlib:\n```shell\ngit clone -b -v1.3.1 https://github.com/madler/zlib.git\ncd zlib\nmkdir build\ncd build\ncmake -A x64 -DCMAKE_INSTALL_PREFIX=%cd%/install ..\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\nBuild protobuf library (replace `<zlib-root-dir>` with a proper path):\n```shell\ngit clone -b v25.3 https://github.com/protocolbuffers/protobuf.git\ncd protobuf\ngit submodule update --init --recursive\n\nmkdir protobuf_build\ncd protobuf_build\ncmake -A x64 -DCMAKE_INSTALL_PREFIX=%cd%/install -DCMAKE_CXX_STANDARD=14 -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF -DZLIB_INCLUDE_DIR=<zlib-root-dir>\\build\\install\\include -DZLIB_LIBRARY=<zlib-root-dir>\\build\\install\\lib\\zlib.lib -DABSL_PROPAGATE_CXX_STD=ON ../cmake\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\nBuild ncnn library (replace `<zlib-root-dir>` and `<protobuf-root-dir>` with a proper path):\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build\ncd build\ncmake -A x64 -DCMAKE_INSTALL_PREFIX=%cd%/install -DCMAKE_PREFIX_PATH=<protobuf-root-dir>/protobuf_build\\install\\cmake -DZLIB_INCLUDE_DIR=<zlib-root-dir>\\build\\install\\include -DZLIB_LIBRARY=<zlib-root-dir>\\build\\install\\lib\\zlib.lib -Dabsl_DIR=<protobuf-root-dir>/protobuf_build\\install\\lib\\cmake\\absl -Dutf8_range_DIR=<protobuf-root-dir>/protobuf_build\\install\\lib\\cmake\\utf8_range -DNCNN_VULKAN=ON ..\ncmake --build . --config Release -j 2\ncmake --build . --config Release --target install\n```\n\n***\n\n### Build for Windows x64 using MinGW-w64\n\nDownload MinGW-w64 toolchain from [winlibs](https://winlibs.com/) or [w64devkit](https://github.com/skeeto/w64devkit), add `bin` folder to environment variables.\n\nBuild ncnn library:\n\n```shell\ncd <ncnn-root-dir>\nmkdir build\ncd build\ncmake -DNCNN_VULKAN=ON -G \"MinGW Makefiles\" ..\ncmake --build . --config Release -j 4\ncmake --build . --config Release --target install\n```\n\n***\n\n### Build for Windows XP (x86)\n\n> **Note:** Windows XP support is provided through collaborative contributions from [@Sugar-Baby](https://github.com/Sugar-Baby) and [@AtomAlpaca](https://github.com/AtomAlpaca).\n\n#### Using MinGW-w64\n\nDownload mingw toolchain targeting 32 bit from [sourceforge](https://jaist.dl.sourceforge.net/project/mingw-w64/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/8.1.0/threads-posix/dwarf/i686-8.1.0-release-posix-dwarf-rt_v6-rev0.7z), extract and add environment variable named `MINGW32_ROOT_PATH` valued by `<your-path-to-mingw-root-path>`, and add `<your-path-to-mingw-root-path>/bin` to `PATH`.\n\n```shell\nmkdir build\ncd build\ncmake -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-mingw.toolchain.cmake\" -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_AVX=OFF .. -G \"MinGW Makefiles\"\ncmake --build . --config Release -j 4\ncmake --build . --config Release --target install\n```\n\n#### Using Clang\n\nClang requires libraries from mingw. Configure mingw toolchain targeting 32-bit as described in the [MinGW-w64 section](#using-mingw-w64).\n\nInstall Clang 6.0 or later.\n\n```shell\nmkdir build\ncd build\ncmake -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-clang.toolchain.cmake\" -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_AVX=OFF .. -G \"MinGW Makefiles\"\ncmake --build . --config Release -j 4\ncmake --build . --config Release --target install\n```\n\n#### Using Visual Studio (MSVC)\n\nInstall v141_xp toolset for Windows XP:\n\n1. Bring up the Visual Studio installer (Tools → Get Tools and Features)\n2. Select Desktop development with C++\n3. Select Windows XP support for C++ from the Summary section\n4. Click Modify\n\n```shell\nmkdir build\ncd build\ncmake -A WIN32 -G \"Visual Studio 17 2022\" -T v141_xp -DNCNN_WINXP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_OPENMP=OFF -DNCNN_AVX=OFF -DNCNN_BUILD_WITH_STATIC_CRT=ON -DCMAKE_TOOLCHAIN_FILE=\"../toolchains/windows-xp-msvc.toolchain.cmake\" ..\ncmake --build . --config Release -j 4\ncmake --build . --config Release --target install\n```\n\n**Note:** The MSVC toolchain uses the `v141_xp` platform toolset for Windows XP compatibility. Vulkan is disabled for XP compatibility, and advanced CPU features (AVX, AVX2, AVX512) are disabled to ensure compatibility with older processors.\n\n***\n\n### Build for macOS\n\nWe've published ncnn to [brew](https://formulae.brew.sh/formula/ncnn#default) now, you can just use following method to install ncnn if you have the Xcode Command Line Tools installed.\n\n```shell\nbrew update\nbrew install ncnn\n```\n\nOr if you want to compile and build ncnn locally, first install Xcode or Xcode Command Line Tools according to your needs.\n\nThen install `protobuf` and `libomp` via homebrew\n\n```shell\nbrew install protobuf libomp\n```\n\nDownload and install Vulkan SDK from <https://vulkan.lunarg.com/sdk/home>\n\n\n```shell\nwget https://sdk.lunarg.com/sdk/download/1.3.280.1/mac/vulkansdk-macos-1.3.280.1.dmg -O vulkansdk-macos-1.3.280.1.dmg\nhdiutil attach vulkansdk-macos-1.3.280.1.dmg\nsudo /Volumes/vulkansdk-macos-1.3.280.1/InstallVulkan.app/Contents/MacOS/InstallVulkan --root `pwd`/vulkansdk-macos-1.3.280.1 --accept-licenses --default-answer --confirm-command install\nhdiutil detach /Volumes/vulkansdk-macos-1.3.280.1\n\n# setup env\nexport VULKAN_SDK=`pwd`/vulkansdk-macos-1.3.280.1/macOS\n```\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build\ncd build\n\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DPLATFORM=MAC -DARCHS=\"x86_64;arm64\" \\\n    -DVulkan_LIBRARY=`pwd`/../vulkansdk-macos-1.3.280.1/macOS/lib/libMoltenVK.dylib \\\n    -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..\n\ncmake --build . -j 4\ncmake --build . --target install\n```\n\n*Note: If you encounter `libomp` related errors during installation, you can also check our GitHub Actions at [here](https://github.com/Tencent/ncnn/blob/d91cccf/.github/workflows/macos-x64-gpu.yml#L50-L68) to install and use `openmp`.*\n***\n\n### Build for ARM Cortex-A family with cross-compiling\nDownload ARM toolchain from https://developer.arm.com/open-source/gnu-toolchain/gnu-a/downloads\n\n```shell\nexport PATH=\"<your-toolchain-compiler-path>:${PATH}\"\n```\n\nAlternatively install a cross-compiler provided by the distribution (i.e. on Debian / Ubuntu, you can do `sudo apt install g++-arm-linux-gnueabi g++-arm-linux-gnueabihf g++-aarch64-linux-gnu`).\n\nDepending on your needs build one or more of the below targets.\n\nAArch32 target with soft float (arm-linux-gnueabi)\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-arm-linux-gnueabi\ncd build-arm-linux-gnueabi\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabi.toolchain.cmake ..\nmake -j$(nproc)\nmake install\n```\n\nAArch32 target with hard float (arm-linux-gnueabihf)\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-arm-linux-gnueabihf\ncd build-arm-linux-gnueabihf\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake ..\nmake -j$(nproc)\nmake install\n```\n\nAArch64 GNU/Linux target (aarch64-linux-gnu)\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-aarch64-linux-gnu\ncd build-aarch64-linux-gnu\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake ..\nmake -j$(nproc)\nmake install\n```\n\n***\n\n### Build for Hisilicon platform with cross-compiling\nDownload and install Hisilicon SDK. The toolchain should be in `/opt/hisi-linux/x86-arm` \nnew version of Hisilicon toolchain should be in `/opt/linux/x86-arm/` \n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build\ncd build\n\n# Choose one cmake toolchain file depends on your target platform\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv300.toolchain.cmake ..\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv500.toolchain.cmake ..\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix100.toolchain.cmake ..\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix200.toolchain.cmake ..\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix210.toolchain.cmake ..\n\nmake -j$(nproc)\nmake install\n```\n\n***\n\n### Build for AnyCloud platform with cross-compiling\nDownload and install AnyCloud SDK. And load env to set toolchain can access in shell\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build\ncd build\n\n# Choose one cmake toolchain file depends on your target platform\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/anykav500.toolchain.cmake ..\n\nmake -j$(nproc)\nmake install\n```\n\n***\n\n### Build for Android\nYou can use the pre-build ncnn-android-lib.zip from https://github.com/Tencent/ncnn/releases\n\nDownload Android NDK from http://developer.android.com/ndk/downloads/index.html and install it, for example:\n\n```shell\nunzip android-ndk-r21d-linux-x86_64.zip\nexport ANDROID_NDK=<your-ndk-root-path>\n```\n\n(optional) remove the hardcoded debug flag in Android NDK [android-ndk issue](https://github.com/android-ndk/ndk/issues/243)\n```\n# open $ANDROID_NDK/build/cmake/android.toolchain.cmake for ndk < r23\n# or $ANDROID_NDK/build/cmake/android-legacy.toolchain.cmake for ndk >= r23\n# delete \"-g\" line\nlist(APPEND ANDROID_COMPILER_FLAGS\n  -g\n  -DANDROID\n```\n\nBuild armv7 library\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-android-armv7\ncd build-android-armv7\n\ncmake -DCMAKE_TOOLCHAIN_FILE=\"$ANDROID_NDK/build/cmake/android.toolchain.cmake\" \\\n    -DANDROID_ABI=\"armeabi-v7a\" -DANDROID_ARM_NEON=ON \\\n    -DANDROID_PLATFORM=android-14 -DNCNN_VULKAN=ON ..\n\n# If you use cmake >= 3.21 and ndk-r23\n# you need to add -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False option for working optimization flags\n\nmake -j$(nproc)\nmake install\n```\n\nPick `build-android-armv7/install` folder for further JNI usage.\n\n\nBuild aarch64 library:\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-android-aarch64\ncd build-android-aarch64\n\ncmake -DCMAKE_TOOLCHAIN_FILE=\"$ANDROID_NDK/build/cmake/android.toolchain.cmake\"\\\n    -DANDROID_ABI=\"arm64-v8a\" \\\n    -DANDROID_PLATFORM=android-21 -DNCNN_VULKAN=ON ..\n\n# If you use cmake >= 3.21 and ndk-r23\n# you need to add -DANDROID_USE_LEGACY_TOOLCHAIN_FILE=False option for working optimization flags\n\nmake -j$(nproc)\nmake install\n```\n\nPick `build-android-aarch64/install` folder for further JNI usage.\n\n***\n\n### Build for iOS on macOS with xcode\nYou can use the pre-build ncnn.framework glslang.framework and openmp.framework from https://github.com/Tencent/ncnn/releases\n\nInstall xcode\n\nYou can replace ```-DENABLE_BITCODE=0``` to ```-DENABLE_BITCODE=1``` in the following cmake arguments if you want to build bitcode enabled libraries.\n\nDownload and install openmp for multithreading inference feature on iPhoneOS\n```shell\nwget https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/openmp-11.0.0.src.tar.xz\ntar -xf openmp-11.0.0.src.tar.xz\ncd openmp-11.0.0.src\n\n# apply some compilation fix\nsed -i'' -e '/.size __kmp_unnamed_critical_addr/d' runtime/src/z_Linux_asm.S\nsed -i'' -e 's/__kmp_unnamed_critical_addr/___kmp_unnamed_critical_addr/g' runtime/src/z_Linux_asm.S\n\nmkdir -p build-ios\ncd build-ios\n\ncmake -DCMAKE_TOOLCHAIN_FILE=<ncnn-root-dir>/toolchains/ios.toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install \\\n    -DPLATFORM=OS64 -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 -DARCHS=\"arm64;arm64e\" \\\n    -DPERL_EXECUTABLE=/usr/local/bin/perl \\\n    -DLIBOMP_ENABLE_SHARED=OFF -DLIBOMP_OMPT_SUPPORT=OFF -DLIBOMP_USE_HWLOC=OFF ..\n\ncmake --build . -j 4\ncmake --build . --target install\n\n# copy openmp library and header files to xcode toolchain sysroot\n# <xcode-dir> is usually /Applications/Xcode.app or /Applications/Xcode-beta.app depends on your Xcode version\nsudo cp install/include/* <xcode-dir>/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/include\nsudo cp install/lib/libomp.a <xcode-dir>/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib\n```\n\nDownload and install openmp for multithreading inference feature on iPhoneSimulator\n```shell\nwget https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/openmp-11.0.0.src.tar.xz\ntar -xf openmp-11.0.0.src.tar.xz\ncd openmp-11.0.0.src\n\n# apply some compilation fix\nsed -i'' -e '/.size __kmp_unnamed_critical_addr/d' runtime/src/z_Linux_asm.S\nsed -i'' -e 's/__kmp_unnamed_critical_addr/___kmp_unnamed_critical_addr/g' runtime/src/z_Linux_asm.S\n\nmkdir -p build-ios-sim\ncd build-ios-sim\n\ncmake -DCMAKE_TOOLCHAIN_FILE=<ncnn-root-dir>/toolchains/ios.toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install \\\n    -DPLATFORM=SIMULATORARM64 -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 -DARCHS=\"x86_64;arm64\" \\\n    -DPERL_EXECUTABLE=/usr/local/bin/perl \\\n    -DLIBOMP_ENABLE_SHARED=OFF -DLIBOMP_OMPT_SUPPORT=OFF -DLIBOMP_USE_HWLOC=OFF ..\n\ncmake --build . -j 4\ncmake --build . --target install\n\n# copy openmp library and header files to xcode toolchain sysroot\n# <xcode-dir> is usually /Applications/Xcode.app or /Applications/Xcode-beta.app depends on your Xcode version\nsudo cp install/include/* <xcode-dir>/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/include\nsudo cp install/lib/libomp.a <xcode-dir>/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib\n```\n\nPackage openmp framework:\n```shell\ncd <openmp-root-dir>\n\nmkdir -p openmp.framework/Versions/A/Headers\nmkdir -p openmp.framework/Versions/A/Resources\nln -s A openmp.framework/Versions/Current\nln -s Versions/Current/Headers openmp.framework/Headers\nln -s Versions/Current/Resources openmp.framework/Resources\nln -s Versions/Current/openmp openmp.framework/openmp\nlipo -create build-ios/install/lib/libomp.a build-ios-sim/install/lib/libomp.a -o openmp.framework/Versions/A/openmp\ncp -r build-ios/install/include/* openmp.framework/Versions/A/Headers/\nsed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/11.0/g' <ncnn-root-dir>/Info.plist > openmp.framework/Versions/A/Resources/Info.plist\n```\n\nDownload and install Vulkan SDK from https://vulkan.lunarg.com/sdk/home\n```shell\nwget https://sdk.lunarg.com/sdk/download/1.2.189.0/mac/vulkansdk-macos-1.2.189.0.dmg?Human=true -O vulkansdk-macos-1.2.189.0.dmg\nhdiutil attach vulkansdk-macos-1.2.189.0.dmg\nsudo /Volumes/vulkansdk-macos-1.2.189.0/InstallVulkan.app/Contents/MacOS/InstallVulkan --root `pwd`/vulkansdk-macos-1.2.189.0 --accept-licenses --default-answer --confirm-command install\nhdiutil detach /Volumes/vulkansdk-macos-1.2.189.0\n\n# setup env\nexport VULKAN_SDK=`pwd`/vulkansdk-macos-1.2.189.0/macOS\n```\n\nBuild library for iPhoneOS:\n\n```shell\ncd <ncnn-root-dir>\ngit submodule update --init\nmkdir -p build-ios\ncd build-ios\n\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DPLATFORM=OS64 -DARCHS=\"arm64;arm64e\" \\\n    -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 \\\n    -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n    -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n    -DOpenMP_libomp_LIBRARY=\"/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib/libomp.a\" \\\n    -DNCNN_VULKAN=ON -DNCNN_BUILD_BENCHMARK=OFF ..\n\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nBuild library for iPhoneSimulator:\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build-ios-sim\ncd build-ios-sim\n\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DPLATFORM=SIMULATORARM64 -DARCHS=\"x86_64;arm64\" \\\n    -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 \\\n    -DOpenMP_C_FLAGS=\"-Xclang -fopenmp\" -DOpenMP_CXX_FLAGS=\"-Xclang -fopenmp\" \\\n    -DOpenMP_C_LIB_NAMES=\"libomp\" -DOpenMP_CXX_LIB_NAMES=\"libomp\" \\\n    -DOpenMP_libomp_LIBRARY=\"/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib/libomp.a\" \\\n    -DNCNN_BUILD_BENCHMARK=OFF ..\n\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nPackage glslang framework for iPhoneOS:\n```shell\ncd <ncnn-root-dir>\n\nmkdir -p glslang.framework/Versions/A/Headers\nmkdir -p glslang.framework/Versions/A/Resources\nln -s A glslang.framework/Versions/Current\nln -s Versions/Current/Headers glslang.framework/Headers\nln -s Versions/Current/Resources glslang.framework/Resources\nln -s Versions/Current/glslang glslang.framework/glslang\nlibtool -static build-ios/install/lib/libglslang.a build-ios/install/lib/libSPIRV.a -o build-ios/install/lib/libglslang_combined.a\nlipo -create build-ios/install/lib/libglslang_combined.a -o glslang.framework/Versions/A/glslang\ncp -r build/install/include/glslang glslang.framework/Versions/A/Headers/\nsed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist\n```\n\nPackage ncnn framework for iPhoneOS:\n```shell\ncd <ncnn-root-dir>\n\nmkdir -p ncnn.framework/Versions/A/Headers\nmkdir -p ncnn.framework/Versions/A/Resources\nln -s A ncnn.framework/Versions/Current\nln -s Versions/Current/Headers ncnn.framework/Headers\nln -s Versions/Current/Resources ncnn.framework/Resources\nln -s Versions/Current/ncnn ncnn.framework/ncnn\nlipo -create build-ios/install/lib/libncnn.a -o ncnn.framework/Versions/A/ncnn\ncp -r build-ios/install/include/* ncnn.framework/Versions/A/Headers/\nsed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist\n```\n\nPick `ncnn.framework` `glslang.framework` and `openmp.framework` folder for app development.\n\n***\n\n### Build for WebAssembly\n\nInstall Emscripten\n\n```shell\ngit clone https://github.com/emscripten-core/emsdk.git\ncd emsdk\n./emsdk install 3.1.28\n./emsdk activate 3.1.28\n\nsource emsdk_env.sh\n```\n\nBuild without any extension for general compatibility:\n```shell\nmkdir -p build\ncd build\ncmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \\\n    -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nBuild with WASM SIMD extension:\n```shell\nmkdir -p build-simd\ncd build-simd\ncmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \\\n    -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nBuild with WASM Thread extension:\n```shell\nmkdir -p build-threads\ncd build-threads\ncmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \\\n    -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nBuild with WASM SIMD and Thread extension:\n```shell\nmkdir -p build-simd-threads\ncd build-simd-threads\ncmake -DCMAKE_TOOLCHAIN_FILE=$EMSDK/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \\\n    -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_SIMPLEOCV=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \\\n    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nPick `build-XYZ/install` folder for further usage.\n\n***\n\n### Build for AllWinner D1\n\nDownload c906 toolchain package from https://www.xrvm.cn/community/download?id=4453617141140230144\n\n```shell\ntar -xf Xuantie-900-gcc-linux-6.6.0-glibc-x86_64-V3.1.0-20250522.tar.gz\nexport RISCV_ROOT_PATH=/home/nihui/osd/Xuantie-900-gcc-linux-6.6.0-glibc-x86_64-V3.1.0\n```\n\nBuild ncnn with riscv-v vector and simpleocv enabled:\n```shell\nmkdir -p build-c906\ncd build-c906\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/c906-v310.toolchain.cmake \\\n    -DCMAKE_BUILD_TYPE=release -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=OFF -DNCNN_XTHEADVECTOR=ON -DNCNN_ZFH=ON -DNCNN_ZVFH=OFF \\\n    -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON ..\ncmake --build . -j 4\ncmake --build . --target install\n```\n\nPick `build-c906/install` folder for further usage.\n\nYou can upload binary inside `build-c906/examples` folder and run on D1 board for testing.\n\n***\n\n### Build for Loongson 2K1000\n\nFor gcc version < 8.5, you need to fix msa.h header for workaround msa fmadd/fmsub/maddv/msubv bug.\n\nOpen ```/usr/lib/gcc/mips64el-linux-gnuabi64/8/include/msa.h```, find ```__msa_fmadd``` and ```__msa_fmsub``` and apply changes as the following\n```c\n// #define __msa_fmadd_w __builtin_msa_fmadd_w\n// #define __msa_fmadd_d __builtin_msa_fmadd_d\n// #define __msa_fmsub_w __builtin_msa_fmsub_w\n// #define __msa_fmsub_d __builtin_msa_fmsub_d\n#define __msa_fmadd_w(a, b, c) __builtin_msa_fmadd_w(c, b, a)\n#define __msa_fmadd_d(a, b, c) __builtin_msa_fmadd_d(c, b, a)\n#define __msa_fmsub_w(a, b, c) __builtin_msa_fmsub_w(c, b, a)\n#define __msa_fmsub_d(a, b, c) __builtin_msa_fmsub_d(c, b, a)\n```\n\nfind ```__msa_maddv``` and ```__msa_msubv``` and apply changes as the following\n```c\n// #define __msa_maddv_b __builtin_msa_maddv_b\n// #define __msa_maddv_h __builtin_msa_maddv_h\n// #define __msa_maddv_w __builtin_msa_maddv_w\n// #define __msa_maddv_d __builtin_msa_maddv_d\n// #define __msa_msubv_b __builtin_msa_msubv_b\n// #define __msa_msubv_h __builtin_msa_msubv_h\n// #define __msa_msubv_w __builtin_msa_msubv_w\n// #define __msa_msubv_d __builtin_msa_msubv_d\n#define __msa_maddv_b(a, b, c) __builtin_msa_maddv_b(c, b, a)\n#define __msa_maddv_h(a, b, c) __builtin_msa_maddv_h(c, b, a)\n#define __msa_maddv_w(a, b, c) __builtin_msa_maddv_w(c, b, a)\n#define __msa_maddv_d(a, b, c) __builtin_msa_maddv_d(c, b, a)\n#define __msa_msubv_b(a, b, c) __builtin_msa_msubv_b(c, b, a)\n#define __msa_msubv_h(a, b, c) __builtin_msa_msubv_h(c, b, a)\n#define __msa_msubv_w(a, b, c) __builtin_msa_msubv_w(c, b, a)\n#define __msa_msubv_d(a, b, c) __builtin_msa_msubv_d(c, b, a)\n```\n\nBuild ncnn with mips msa and simpleocv enabled:\n```shell\nmkdir -p build\ncd build\ncmake -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_MSA=ON -DNCNN_MMI=ON -DNCNN_SIMPLEOCV=ON ..\ncmake --build . -j 2\ncmake --build . --target install\n```\n\nPick `build/install` folder for further usage.\n\nYou can run binary inside `build/examples` folder for testing.\n\n***\n\n### Build for Termux on Android\n\nInstall app Termux on your phone,and install Ubuntu in Termux.\n\n If you want use ssh, just install openssh in Termux\n\n```shell\npkg install proot-distro\nproot-distro install ubuntu\n```\n\nor you can see what system can be installed using `proot-distro list`\n\nwhile you install ubuntu successfully, using `proot-distro login ubuntu` to login Ubuntu.\n\nThen make ncnn,no need to install any other dependencies.\n\n```shell\ngit clone https://github.com/Tencent/ncnn.git\ncd ncnn\ngit submodule update --init\nmkdir -p build\ncd build\ncmake -DCMAKE_BUILD_TYPE=Release -DNCNN_BUILD_EXAMPLES=ON -DNCNN_PLATFORM_API=OFF -DNCNN_SIMPLEOCV=ON ..\nmake -j$(nproc)\n```\n\nThen you can run a test\n\n> on my Pixel 3 XL using Qualcomm 845,cant load `256-ncnn.png`\n\n```shell\ncd ../examples\n../build/examples/squeezenet ../images/128-ncnn.png\n```\n\n### Build for QNX\n\nRequest license and download SDP from QNX Software Center: https://www.qnx.com/products/everywhere/ .\n\nSetup QNX environment by invoking SDP's bundled script:\n\non Windows, open cmd and run\n```batch\ncall C:\\Users\\zz\\qnx800\\qnxsdp-env.bat\n```\n\non Linux, use /bin/bash and run\n```shell\nsource /home/zz/qnx800/qnxsdp-env.sh\n```\n\nIf it gives error `cannot find ld` on Linux, solve it by creaing link file:\n```shell\ncd ${QNX_HOST}/usr/bin/\nln -s aarch64-unknown-nto-qnx7.1.0-ld ld\n```\n\nBuild ncnn with cmake in same shell:\n\n```shell\ngit clone https://github.com/Tencent/ncnn.git\ncd ncnn\ngit submodule update --init\nmkdir -p build-qnx\ncd build-qnx\ncmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-qnx.toolchain.cmake ..\nmake -j$(nproc)\nmake install\n```\n\nPick `build-qnx/install` folder for further usage.\n\n### Build for Nintendo 3DS Homebrew Launcher\nInstall DevkitPRO toolchains\n- If you are working on windows, download DevkitPro installer from [DevkitPro](https://devkitpro.org/wiki/Getting_Started).\n- If you are using Ubuntu, the official guidelines from DevkitPro might not work for you. Try using the lines below to install\n```shell\nsudo apt-get update\nsudo apt-get upgrade\nwget https://apt.devkitpro.org/install-devkitpro-pacman\nchmod +x ./install-devkitpro-pacman\nsudo ./install-devkitpro-pacman\n```\n\n```shell\nexport DEVKITPRO=/opt/devkitpro\nexport DEVKITARM=/opt/devkitpro/devkitARM\nexport DEVKITPPC=/opt/devkitpro/devkitPPC\nexport export PATH=$/opt/devkitpro/tools/bin:$PATH\nsource ~/.profile\n```\n```shell\nsudo dkp-pacman -Sy\nsudo dkp-pacman -Syu\nsudo dkp-pacman -S 3ds-dev\n```\nCopy the toolchain files from [3DS-cmake](https://github.com/Xtansia/3ds-cmake)(DevitARM3DS.cmake and the cmake folder) to NCNN's toolchains folder.\n```\n├── toolchains\n│   ├── cmake\n│   │   ├── bin2s_header.h.in\n│   │   ├── FindCITRO3D.cmake\n│   │   ├── FindCTRULIB.cmake\n│   │   ├── FindFreetype.cmake\n│   │   ├── FindJPEG.cmake\n│   │   ├── FindPNG.cmake\n│   │   ├── FindSF2D.cmake\n│   │   ├── FindSFIL.cmake\n│   │   ├── FindSFTD.cmake\n│   │   ├── FindZLIB.cmake\n│   │   ├── LibFindMacros.cmake\n│   │   ├── Tools3DS.cmake\n│   │   ├── ToolsGBA.cmake\n│   │   └── try_add_imported_target.cmake\n│   ├── DevkitArm3DS.cmake\n...\n\n```\nBuild with:\n```shell\ncd ncnn\nmkdir build && cd build\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/DevkitArm3DS.cmake .. -DNCNN_SIMPLEOCV=ON -DNCNN_OPENMP=OFF -DNCNN_VFPV4=OFF ..\nmake -j4\nmake install\n```\nModify the Makefile in Homebrew example to link and use NCNN in your 3DS Homebrew app.\n\n***\n\n### Build for HarmonyOS with cross-compiling\nDownload and install HarmonyOS SDK. The sdk installation directory is `/opt/ohos-sdk/linux`\n\n```shell\ncd <ncnn-root-dir>\nmkdir -p build\ncd build\n\nexport HM_SDK=/opt/ohos-sdk/linux\n\n# Choose HarmonyOS sdk cmake toolchain file.\n# If you want to enable vulkan, set -DNCNN_VULKAN=ON\n# The HarmonyOS sdk does not support openmp, use ncnn simpleomp instead.\n# Cross-compiling with CMake must use the one provided by the HarmonyOS SDK; otherwise, it won't recognize parameters like OHOS_PLATFORM, leading to compilation errors.\n${HM_SDK}/native/build-tools/cmake/bin/cmake -DOHOS_STL=c++_static -DOHOS_ARCH=arm64-v8a -DOHOS_PLATFORM=OHOS -DCMAKE_TOOLCHAIN_FILE=${HM_SDK}/native/build/cmake/ohos.toolchain.cmake -DNCNN_VULKAN=ON -DNCNN_SIMPLEOMP=ON ..\n\nmake -j$(nproc)\nmake install\n```\n\n***\n\n### Build for ESP32 with cross-compiling\nDownload esp-idf sdk\n```shell\ngit clone https://github.com/espressif/esp-idf\ncd esp-idf\ngit submodule update --init --recursive\n```\nInstall esp-idf sdk and configure the environment\n```shell\n./install.sh\nsource export.sh\n```\nAnd for Windows, you should use:\n```bash\ninstall.bat # or `install.ps1`\nexport.bat\n```\nNote: python>=3.8, cmake>=3.24.0\n\nBuild ncnn library:\n```shell\nmkdir build-esp32\ncd build-esp32\ncmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/esp32.toolchain.cmake -DCMAKE_BUILD_TYPE=Release ..\nmake -j 4\nmake install\n```\nNote: Make sure to compile in esp-idf environment.\n\nThe compiled ncnn library and headers can be put to the esp32 project to test.\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/FAQ-ncnn-produce-wrong-result.md",
    "content": "### caffemodel should be row-major\n\n`caffe2ncnn` tool assumes the caffemodel is row-major (produced by c++ caffe train command).\n\nThe kernel 3x3 weights should be stored as\n```\na b c\nd e f\ng h i\n```\n\nHowever, matlab caffe produced col-major caffemodel.\n\nYou have to transpose all the kernel weights by yourself or re-training using c++ caffe train command.\n\nBesides, you may interest in https://github.com/conanhujinming/matcaffe2caffe\n\n### check input is RGB or BGR\n\nIf your caffemodel is trained using c++ caffe and opencv, then the input image should be BGR order.\n\nIf your model is trained using matlab caffe or pytorch or mxnet or tensorflow, the input image would probably be RGB order.\n\nThe channel order can be changed on-the-fly through proper pixel type enum\n```\n// construct RGB blob from rgb image\nncnn::Mat in_rgb = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB, w, h);\n\n// construct BGR blob from bgr image\nncnn::Mat in_bgr = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR, w, h);\n\n// construct BGR blob from rgb image\nncnn::Mat in_bgr = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB2BGR, w, h);\n\n// construct RGB blob from bgr image\nncnn::Mat in_rgb = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR2RGB, w, h);\n```\n\n\n### image decoding\n\nJPEG(`.jpg`,`.jpeg`) is loss compression, people may get different pixel value for same image on same position. \n\n`.bmp` images are recommended instead.\n\n### interpolation / resizing\n\nThere are several image resizing methods, which may generate different result for same input image.\n\nEven we specify same interpolation method, different frameworks/libraries and their various versions may also introduce difference.\n\nA good practice is feed same size image as the input layer expected, e.g. read a 224x244 bmp image when input layer need 224x224 size.\n\n\n### Mat::from_pixels/from_pixels_resize assume that the pixel data is continuous\n\nYou shall pass continuous pixel buffer to from_pixels family.\n\nIf your image is an opencv submat from an image roi, call clone() to get a continuous one.\n```\ncv::Mat image;// the image\ncv::Rect facerect;// the face rectangle\n\ncv::Mat faceimage = image(facerect).clone();// get a continuous sub image\n\nncnn::Mat in = ncnn::Mat::from_pixels(faceimage.data, ncnn::Mat::PIXEL_BGR, faceimage.cols, faceimage.rows);\n```\n\n### pre process\nApply pre process according to your training configuration\n\nDifferent model has different pre process config, you may find the following transform config in Data layer section\n```\ntransform_param {\n    mean_value: 103.94\n    mean_value: 116.78\n    mean_value: 123.68\n    scale: 0.017\n}\n```\nThen the corresponding code for ncnn pre process is\n```cpp\nconst float mean_vals[3] = { 103.94f, 116.78f, 123.68f };\nconst float norm_vals[3] = { 0.017f, 0.017f, 0.017f };\nin.substract_mean_normalize(mean_vals, norm_vals);\n```\n\nMean file is not supported currently\n\nSo you have to pre process the input data by yourself (use opencv or something)\n```\ntransform_param {\n    mean_file: \"imagenet_mean.binaryproto\"\n}\n```\n\nFor pytorch or mxnet-gluon\n```python\ntransforms.ToTensor(),\ntransforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n```\nThen the corresponding code for ncnn pre process is\n```cpp\n// R' = (R / 255 - 0.485) / 0.229 = (R - 0.485 * 255) / 0.229 / 255\n// G' = (G / 255 - 0.456) / 0.224 = (G - 0.456 * 255) / 0.224 / 255\n// B' = (B / 255 - 0.406) / 0.225 = (B - 0.406 * 255) / 0.225 / 255\nconst float mean_vals[3] = {0.485f*255.f, 0.456f*255.f, 0.406f*255.f};\nconst float norm_vals[3] = {1/0.229f/255.f, 1/0.224f/255.f, 1/0.225f/255.f};\nin.substract_mean_normalize(mean_vals, norm_vals);\n```\n\n### use the desired blob\nThe blob names for input and extract are differ among models.\n\nFor example, squeezenet v1.1 use \"data\" as input blob and \"prob\" as output blob while mobilenet-ssd use \"data\" as input blob and \"detection_out\" as output blob.\n\nSome models may need multiple input or produce multiple output.\n\n```cpp\nncnn::Extractor ex = net.create_extractor();\n\nex.input(\"data\", in);// change \"data\" to yours\nex.input(\"mask\", mask);// change \"mask\" to yours\n\nex.extract(\"output1\", out1);// change \"output1\" to yours\nex.extract(\"output2\", out2);// change \"output2\" to yours\n```\n\n### blob may have channel gap\nEach channel pointer is aligned by 128bit in ncnn Mat structure.\n\nblob may have gaps between channels if (width x height) can not divided exactly by 4\n\nPrefer using ncnn::Mat::from_pixels or ncnn::Mat::from_pixels_resize for constructing input blob from image data\n\nIf you do need a continuous blob buffer, reshape the output.\n```cpp\n// out is the output blob extracted\nncnn::Mat flattened_out = out.reshape(out.w * out.h * out.c);\n\n// plain array, C-H-W\nconst float* outptr = flattened_out;\n```\n\n### create new Extractor for each image\nThe `ncnn::Extractor` object is stateful, if you reuse for different input, you will always get exact the same result cached inside.\n\nAlways create new Extractor to process images in loop unless you do know how the stateful Extractor works.\n```cpp\nfor (int i=0; i<count; i++)\n{\n    // always create Extractor\n    // it's cheap and almost instantly !\n    ncnn::Extractor ex = net.create_extractor();\n\n    // use\n    ex.input(your_data[i]);\n}\n```\n\n### use proper loading api\n\nIf you want to load plain param file buffer, you shall use Net::load_param_mem instead of Net::load_param.\n\nFor more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)\n\n```cpp\nncnn::Net net;\n\n// param_buffer is the content buffe of XYZ.param file\nnet.load_param_mem(param_buffer);\n```\n\n\n### disable fp16\n\nSome models may overflow fp16, resulting in a nan result.\n\nSo try to turn off fp16 lower-precision optimizations, and the precision will be improved to fp32 to investigate and solve the overflow problem caused by this.\n\nYou can set it as follows\n```cpp\nncnn::Net net;\n\nnet.opt.use_fp16_packed = false;\nnet.opt.use_fp16_storage = false;\nnet.opt.use_fp16_arithmetic = false;\n```\n\n### make data contiguous\nIf you find the output of pnnx.py and ncnn.py (generated by pnnx) is different, This may be due to data discontiguous. You can set the ncnn.py as follows and moditfity other codes:\n``` python\ndef test_inference():\n    torch.manual_seed(0)\n    in0 = torch.rand(1, 3, 224, 224, dtype=torch.float)\n    in0.contiguous()\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/FAQ-ncnn-protobuf-problem.zh.md",
    "content": "# Protobuf 类问题解决方法\n\n## 问题分析\n\nprotobuf 有关的报错，一般都是两个原因：\n\n1. 需要的 pb 没安装/`FindProtobuf.cmake`不存在，最终 `find_package` 失败\n2. 系统不止一套 pb，导致 bin/lib/include 三者不匹配\n\n如果你遇到了这些报错，都可以通过本文档解决：\n\n1. Linux 编译 `caffe2ncnn` 时报 `Protobuf not found`\n2. 编译 `caffe2ncnn` 时报 protoc 和 protobuf.so 版本不匹配\n\n## （推荐）通用处理办法\n\n这个办法包治百病，**不管什么情况一定生效**\n\n1. 编译下载 protobuf，以 3.20.0 版本为例\n\n```bash\n$ wget https://github.com/protocolbuffers/protobuf/releases/download/v3.20.0/protobuf-cpp-3.20.0.tar.gz\n$ tar xvf protobuf-cpp-3.20.0.tar.gz\n$ cd protobuf-3.20.0/\n$ ./configure --prefix=/path/to/install\n$ make && make install\n```\n注意需要 `--prefix`，不要装到系统里。能遇到这些错，说明本来系统环境就有问题，再给系统环境装 lib 就更乱了。\n\n2. 修改 cmake\n\n找到报错的 CMakeLists.txt，在 `find_package` 前插入 protobuf 路径。\n\n```bash\n# 加入下面 1 行\nlist(APPEND CMAKE_PREFIX_PATH \"/path/to/install\")\n\nfind_package(Protobuf REQUIRED)\n...\n```\n\n3. 调整 cmake 选项\n\n`cmake ..` 时，额外加入选项 `-DProtobuf_PROTOC_EXECUTABLE=/path/to/install/bin/protoc`\n\n```bash\n$ cd /path/to/ncnn/build\n$ rm -rf CMakeCache\n# 加入新选项\n$ cmake .. -DProtobuf_PROTOC_EXECUTABLE=/path/to/install/bin/protoc \n$ ...\n```\n\n## （不推荐）自己改环境变量\n\n### 一、遇到 `Protobuf not found`\n\n是因为 protobuf 未安装或环境变量未设置\n\n1. 安装 protobuf\n\nUbuntu 系统尝试以下命令\n```bash\n$ sudo apt-get install libprotobuf-dev protobuf-compiler\n```\n\nCentOS 尝试\n```bash\n$ sudo yum install protobuf-devel.x86_64 protobuf-compiler.x86_64\n```\n\n2. 然后设置 C++ 环境\n\n在 LD_LIBRARY_PATH 增加参数\n\n```bash\n$ export LD_LIBRARY_PATH=${YOUR_PROTOBUF_LIB_PATH}:$LD_LIBRARY_PATH\n```\n\n### 二、遇到 protoc 和 protobuf.so 版本不匹配\n\n1. 先看 protoc 需要的 so 版本号\n```bash\n$ ldd `whereis protoc| awk '{print $2}'` | grep libprotobuf.so\n```\n\n例如是 libprotobuf.so.10\n\n2. 然后搜这个文件所在的路径\n```bash\n$ cd / && find . -type f | grep libprotobuf.so.10\n```\n\n假设在`/home/user/mydir`\n\n3. 设置 protobuf.so 的搜索目录\n```bash\n$ export LD_LIBRARY_PATH=/home/user/mydir:$LD_LIBRARY_PATH\n```\n\n### 三、行走江湖必备\n关于环境变量设置、工具和技巧，强烈建议学习下 https://missing.csail.mit.edu/ \n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/FAQ-ncnn-throw-error.md",
    "content": "### param is too old, please regenerate\n\nYour model file is being the old format converted by an old caffe2ncnn tool.\n\nCheckout the latest ncnn code, build it and regenerate param and model binary files, and that should work.\n\nMake sure that your param file starts with the magic number 7767517.\n\nyou may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet)\n\nIf the original model is missing, you can try to manually fix the layer specific parameters in param file\n\n1. **Softmax** append `1=1`\n\nbefore\n```\nSoftmax xxx 1 1 in out ...\n```\nafter\n```\nSoftmax xxx 1 1 in out ... 1=1\n```\n\n2. **Reduction** minus all axes value by 1 (except the leading array count) and append `5=1`\n\nbefore\n```\nReduction xxx 1 1 in out ... -23303=2,2,3 ...\n```\nafter\n```\nReduction xxx 1 1 in out ... -23303=2,1,2 ... 5=1\n```\n\n### find_blob_index_by_name XYZ failed\n\nThat means ncnn couldn't find the XYZ blob in the network. \n\nYou shall call Extractor::input()/extract() by blob name instead of layer name.\n\nFor models loaded from binary param file or external memory, you shall call Extractor::input()/extract() by the enum defined in xxx.id.h because all the visible string literals have been stripped in binary form.\n\nThis error usually happens when the input layer is not properly converted.\n\nYou shall upgrade caffe prototxt/caffemodel before converting it to ncnn. Following snippet type shall be ok. \n\n```\nlayer {\n  name: \"data\"\n  type: \"Input\"\n  top: \"data\"\n  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }\n}\n```\n\nyou may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet).\n\n### layer XYZ not exists or registered\n\nYour network contains some operations that are not implemented in ncnn.\n\nYou may implement them as custom layer followed in [how-to-implement-custom-layer-step-by-step](how-to-implement-custom-layer-step-by-step).\n\nOr you could simply register them as no-op if you are sure those operations make no sense.\n\n```cpp\nclass Noop : public ncnn::Layer {};\nDEFINE_LAYER_CREATOR(Noop)\n\nnet.register_custom_layer(\"LinearRegressionOutput\", Noop_layer_creator);\nnet.register_custom_layer(\"MAERegressionOutput\", Noop_layer_creator);\n```\n\n### fopen XYZ.param/XYZ.bin failed\n\nFile not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.\n\n### network graph not ready\n\nYou shall call Net::load_param() first, then Net::load_model().\n\nThis error may also happens when Net::load_param() failed, but not properly handled.\n\nFor more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)\n\n### memory not 32-bit aligned at XYZ\n\nThe pointer passed to Net::load_param() or Net::load_model() is not 32bit aligned.\n\nIn practice, the head pointer of std::vector<unsigned char> is not guaranteed to be 32bit aligned.\n\nyou can store your binary buffer in ncnn::Mat structure, its internal memory is aligned.\n\n### undefined reference to '__kmpc_XYZ_XYZ'\n\nuse clang for building android shared library\n\ncomment the following line in your Application.mk\n```\nNDK_TOOLCHAIN_VERSION := 4.9\n```\n\n### crash on android with '__kmp_abort_process'\n\nThis usually happens if you bundle multiple shared library with openmp linked\n\nIt is actually an issue of the android ndk https://github.com/android/ndk/issues/1028\n\nOn old android ndk, modify the link flags as\n\n```\n-Wl,-Bstatic -lomp -Wl,-Bdynamic\n```\n\nFor recent ndk >= 21\n\n```\n-fstatic-openmp\n```\n\n### dlopen failed: library \"libomp.so\" not found\n\nNewer android ndk defaults to dynamic openmp runtime\n\nmodify the link flags as\n\n```\n-fstatic-openmp -fopenmp\n```\n\n### crash when freeing a ncnn dynamic library(*.dll/*.so) built with openMP\n\nfor optimal performance, the openmp threadpool spin waits for about a second prior to shutting down in case more work becomes available. \n\nIf you unload a dynamic library that's in the process of spin-waiting, it will crash in the manner you see (most of the time).\n\nJust set OMP_WAIT_POLICY=passive in your environment, before calling loadlibrary. or Just wait a few seconds before calling freelibrary.\n\nYou can also use the following method to set environment variables in your code:\n\nfor msvc++:\n\n```\nSetEnvironmentVariable(_T(\"OMP_WAIT_POLICY\"), _T(\"passive\"));\n```\n\nfor g++:\n\n```\nsetenv(\"OMP_WAIT_POLICY\", \"passive\", 1)\n```\n\nreference: https://stackoverflow.com/questions/34439956/vc-crash-when-freeing-a-dll-built-with-openmp\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/FAQ-ncnn-vulkan.md",
    "content": "### how to enable ncnn vulkan capability\n\nfollow [the build and install instruction](https://github.com/Tencent/ncnn/blob/master/docs/how-to-build/how-to-build.md)\n\nmake sure you have installed vulkan sdk from [lunarg vulkan sdk website](https://vulkan.lunarg.com/sdk/home)\n\nUsually, you can enable the vulkan compute inference feature by adding only one line of code to your application.\n\n```cpp\n// enable vulkan compute feature before loading\nncnn::Net net;\nnet.opt.use_vulkan_compute = 1;\n```\n\n### does my graphics device support vulkan\n\nSome platforms have been tested and known working. In theory, if your platform support vulkan api, either 1.0 or 1.1, it shall work.\n\n* Y = known work\n* ? = shall work, not confirmed\n* / = not applied\n\n|    |windows|linux|android|mac|ios|\n|---|---|---|---|---|---|\n|intel|Y|Y|?|?|/|\n|amd|Y|Y|/|?|/|\n|nvidia|Y|Y|?|/|/|\n|qcom|/|/|Y|/|/|\n|apple|/|/|/|Y|Y|\n|arm|/|?|Y|/|/|\n\nYou can search [the vulkan database](https://vulkan.gpuinfo.org) to see if your device supports vulkan.\n\nSome old buggy drivers may produce wrong result, that are blacklisted in ncnn and treated as non-vulkan capable device.\nYou could check if your device and driver have this issue with  [my conformance test here](vulkan-conformance-test).\nMost of these systems are android with version lower than 8.1.\n\n### why using vulkan over cuda/opencl/metal\n\nIn the beginning, I had no GPGPU programming experience, and I had to learn one.\n\nvulkan is considered more portable and well supported by vendors and the cross-platform low-overhead graphics api. As a contrast, cuda is only available on nvidia device, metal is only available on macos and ios, while loading opencl library is banned in android 7.0+ and does not work on ios.\n\n### I got errors like \"vkCreateComputePipelines failed -1000012000\" or random stalls or crashes\n\nUpgrade your vulkan driver.\n\n[intel https://downloadcenter.intel.com/product/80939/Graphics-Drivers](https://downloadcenter.intel.com/product/80939/Graphics-Drivers)\n\n[amd https://www.amd.com/en/support](https://www.amd.com/en/support)\n\n[nvidia https://www.nvidia.com/Download/index.aspx](https://www.nvidia.com/Download/index.aspx)\n\n### how to use ncnn vulkan on android\n\nminimum android ndk version: android-ndk-r18b\n\nminimum sdk platform api version: android-24\n\nlink your jni project with libvulkan.so\n\n[The squeezencnn example](https://github.com/Tencent/ncnn/tree/master/examples/squeezencnn) have equipped gpu inference, you could take it as reference.\n\n### how to use ncnn vulkan on ios\n\nsetup vulkan sdk (https://vulkan.lunarg.com/sdk/home#mac)\n\nmetal only works on real device with arm64 cpu (iPhone 5s and later)\n\nlink your project with MoltenVK framework and Metal\n\n### what about the layers without vulkan support\n\nThese layers have vulkan support currently\n\nAbsVal, BatchNorm, BinaryOp, Cast, Clip, Concat, Convolution, ConvolutionDepthWise, Crop, Deconvolution, DeconvolutionDepthWise, Dropout, Eltwise, Flatten, HardSigmoid, InnerProduct, Interp, LRN, Packing, Padding, Permute, Pooling(pad SAME not supported), PReLU, PriorBox, ReLU, Reorg, Reshape, Scale, ShuffleChannel, Sigmoid, Softmax, TanH, UnaryOp\n\nFor these layers without vulkan support, ncnn inference engine will automatically fallback to cpu path.\n\nThus, it is usually not a serious issue if your network only has some special head layers like SSD or YOLO. All examples in ncnn are known working properly with vulkan enabled.\n\n### my model runs slower on gpu than cpu\n\nThe current vulkan inference implementation is far from the preferred state. Many handful optimization techniques are planned, such as winograd convolution, operator fusion, fp16 storage and arithmetic etc.\n\nIt is common that your model runs slower on gpu than cpu on arm devices like mobile phones, since we have quite good arm optimization in ncnn ;)\n\n### vulkan device not found / extra high cpu utility while vulkan is enabled on nvidia gpu\n\nThere are several reasons could lead to this outcome. First please check your driver status with `nvidia-smi`. If you have correctly installed your driver, you should see something like this:\n\n```bash\n$ nvidia-smi\nSat Mar 06 19:53:16 2021\n+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |\n|-------------------------------+----------------------+----------------------+\n| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n|===============================+======================+======================|\n|   0  GeForce GTX 1060   WDDM  | 00000000:02:00.0 Off |                  N/A |\n| N/A   31C    P8     5W /  N/A |     90MiB /  6144MiB |      0%      Default |\n+-------------------------------+----------------------+----------------------+\n\n+-----------------------------------------------------------------------------+\n| Processes:                                                                  |\n|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |\n|        ID   ID                                                   Usage      |\n|=============================================================================|\n|  No running processes found                                                 |\n+-----------------------------------------------------------------------------+\n```\n\nIf `nvidia-smi` crashes or cannot be found, please reinstall your graphics driver.\n\nIf ncnn *is* utilizing the Tesla GPU, you can see your program in the `Processes` block at the bottom. In that case, it's likely some operators are not yet supported in Vulkan, and have fallbacked to the CPU, thus leading to a low utilization of the GPU.\n\nIf you *couldn't* find your process running, plase check the active driver model, which can be found to the right of your device name. For Geforce and Titan GPUs, the default driver model is WDDM (Windows Desktop Driver Model), which supports both rendering graphics as well as computing. But for Tesla GPUs, without configuration, the driver model is defualted to TCC ([Tesla Computing Cluster](https://docs.nvidia.com/gameworks/content/developertools/desktop/tesla_compute_cluster.htm)). NVIDIA's TCC driver does not support Vulkan, so you need to use the following command to set the driver model back to WDDM, to use Vulkan:\n\n```bash\n$ nvidia-smi -g 0 -dm 0\n```\n\nThe number following `-g` is the GPU ID (which can be found to the left of your device name in `nvidia-smi` output); and `-dm` stands for driver model, 0 refers to WDDM and 1 means TCC.\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/build-minimal-library.md",
    "content": "For some reason, if you're not happy with the binary size of the ncnn library, then here is the cheatsheet that helps you to build a minimal ncnn :P\n\n### disable c++ rtti and exceptions\n\n```\ncmake -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON ..\n```\n* Cannot use RTTI and Exceptions when ncnn functions are called.\n\n### disable vulkan support\n\n```\ncmake -DNCNN_VULKAN=OFF ..\n```\n\n* Cannot use GPU acceleration.\n\n### disable NCNN_STDIO\n\n```\ncmake -DNCNN_STDIO=OFF ..\n```\n\n* Cannot load model from files, but can load model from memory or by Android Assets.\n\n    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#load-model).\n\n### disable NCNN_STRING\n\n```\ncmake -DNCNN_STRING=OFF ..\n```\n\n* Cannot load human-readable param files with visible strings, but can load binary param.bin files.\n\n    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#strip-visible-string)\n\n* Cannot identify blobs by string name when calling `Extractor::input / extract`, but can identify them by enum value in `id.h`.\n\n    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#input-and-output).\n\n### disable NCNN_BF16\n\n```\ncmake -DNCNN_BF16=OFF ..\n```\n\n* Cannot use bf16 storage type in inference.\n\n\n### disable NCNN_INT8\n\n```\ncmake -DNCNN_INT8=OFF ..\n```\n\n* Cannot use quantized int8 inference.\n\n\n### drop pixel drawing functions\n\n```\ncmake -DNCNN_PIXEL_DRAWING=OFF ..\n```\n\n* Cannot use functions doing drawing basic shape and text like `ncnn::draw_rectangle_xx / ncnn::draw_circle_xx / ncnn::draw_text_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available.\n\n\n### drop pixel rotate and affine functions\n\n```\ncmake -DNCNN_PIXEL_ROTATE=OFF -DNCNN_PIXEL_AFFINE=OFF ..\n```\n\n* Cannot use functions doing rotatation and affine transformation like `ncnn::kanna_rotate_xx / ncnn::warpaffine_bilinear_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available. \n\n### drop pixel functions\n\n```\ncmake -DNCNN_PIXEL=OFF ..\n```\n\n* Cannot use functions transferring from image to pixels like `Mat::from_pixels / from_pixels_resize / to_pixels / to_pixels_resize`, and need create a Mat and fill in data by hand.\n\n### disable openmp\n\n```\ncmake -DNCNN_OPENMP=OFF ..\n```\n\n* Cannot use openmp multi-threading acceleration. If you want to run a model in single thread on your target machine, it is recommended to close the option.\n\n### disable avx2 and arm82 optimized kernel\n\n```\ncmake -DNCNN_AVX2=OFF -DNCNN_ARM82=OFF ..\n```\n\n* Do not compile optimized kernels using avx2 / arm82 instruction set extensions. If your target machine does not support some of them, it is recommended to close the related options.\n\n### disable runtime cpu instruction dispatch\n\n```\ncmake -DNCNN_RUNTIME_CPU=OFF ..\n```\n\n* Cannot check supported cpu instruction set extensions and use related optimized kernels in runtime.\n* If you know which instruction set extensions are supported on your target machine like avx2 / arm82, you can open related options like `-DNCNN_AVX2=ON / -DNCNN_ARM82=ON` by hand and then sse2 / arm8 version kernels will not be compiled.\n\n### drop layers not used\n\n```\ncmake -DWITH_LAYER_absval=OFF -DWITH_LAYER_bnll=OFF ..\n```\n\n* If your model does not include some layers, taking absval / bnll as a example above, you can drop them.\n* Some key or dependency layers should not be dropped, like convolution / innerproduct, their dependency like padding / flatten, and activation like relu / clip.\n\n### disable c++ stl\n\n```\ncmake -DNCNN_SIMPLESTL=ON ..\n```\n\n* STL provided by compiler is no longer depended on, and use `simplestl` provided by ncnn as a replacement. Users also can only use `simplestl` when ncnn functions are called.\n* Usually with compiler parameters `-nodefaultlibs -fno-builtin -nostdinc++ -lc`\n* Need cmake parameters `cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_STL=system` to avoid STL conflict when compiling to Android.\n\n### drop optimized kernel not used\n\n* Modify the source code under `ncnn/src/layer/arm/` to delete unnecessary optimized kernels or replace them with empty functions.\n* You can also drop layers and related optimized kernels by `-DWITH_LAYER_absval=OFF` as mentioned above.\n\n### drop operators from BinaryOp UnaryOp\n\n* Modify `ncnn/src/layer/binaryop.cpp unaryop.cpp` and `ncnn/src/layer/arm/binaryop.cpp unaryop_arm.cpp` by hand to delete unnecessary operators.\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/efficient-roi-resize-rotate.md",
    "content": "\n### image roi crop + convert to ncnn::Mat\n\n```\n+--------------+\n|   y          |           /-------/\n| x +-------+  |          +-------+|\n|   |     roih |im_h  =>  |      roih\n|   +-roiw--+  |          +-roiw--+/\n|              |\n+-----im_w-----+\n```\n```cpp\nncnn::Mat in = ncnn::Mat::from_pixels_roi(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih);\n```\nFor Android Application, it is :\n```cpp\nncnn::Mat in = ncnn::Mat::from_android_bitmap_roi(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih);\n```\n\n### image roi crop + resize + convert to ncnn::Mat\n\n```\n+--------------+\n|   y          |           /----/\n| x +-------+  |          +----+|\n|   |     roih |im_h  =>  |  target_h\n|   +-roiw--+  |          |    ||\n|              |          +----+/\n+-----im_w-----+         target_w\n```\n```cpp\nncnn::Mat in = ncnn::Mat::from_pixels_roi_resize(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih, target_w, target_h);\n```\nFor Android Application, it is :\n```cpp\nncnn::Mat in = ncnn::Mat::from_android_bitmap_roi_resize(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih, target_w, target_h);\n```\n\n### ncnn::Mat export image + offset paste\n\n```\n                +--------------+\n /-------/      |   y          |\n+-------+|      | x +-------+  |\n|       h|  =>  |   |       h  |im_h\n+---w---+/      |   +---w---+  |\n                |              |\n                +-----im_w-----+\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nout.to_pixels(data, ncnn::Mat::PIXEL_RGB, im_w * 3);\n```\n\n### ncnn::Mat export image + resize + roi paste\n\n```\n            +--------------+\n /----/     |   y          |\n+----+|     | x +-------+  |\n|    h| =>  |   |      roih|im_h\n|    ||     |   +-roiw--+  |\n+-w--+/     |              |\n            +-----im_w-----+\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nout.to_pixels_resize(data, ncnn::Mat::PIXEL_RGB, roiw, roih, im_w * 3);\n```\n\n### image roi crop + resize\n```\n+--------------+\n|   y          |\n| x +-------+  |          +----+\n|   |      roih|im_h  =>  |  target_h\n|   +-roiw--+  |          |    |\n|              |          +----+\n+-----im_w-----+         target_w\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nncnn::resize_bilinear_c3(data, roiw, roih, im_w * 3, outdata, target_w, target_h, target_w * 3);\n```\n\n### image resize + offset paste\n```\n            +--------------+\n            |   y          |\n+----+      | x +-------+  |\n|    h  =>  |   |     roih |im_h\n|    |      |   +-roiw--+  |\n+-w--+      |              |\n            +-----im_w-----+\n```\n```cpp\nunsigned char* outdata = im.data + (y * im_w + x) * 3;\nncnn::resize_bilinear_c3(data, w, h, w * 3, outdata, roiw, roih, im_w * 3);\n```\n\n### image roi crop + resize + roi paste\n```\n+--------------+         +-----------------+\n|   y          |         |  roiy           |\n| x +-------+  |         |roix----------+  |\n|   |       h  |im_h  => |   |     target_h|outim_h\n|   +---w---+  |         |   |          |  |\n|              |         |   +-target_w-+  |\n+-----im_w-----+         +-----outim_w-----+\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nunsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;\nncnn::resize_bilinear_c3(data, w, h, im_w * 3, outdata, target_w, target_h, outim_w * 3);\n```\n\n### image roi crop + rotate\n```\n+--------------+\n|   y          |\n| x +-------+  |          +---+\n|   |  < <  h  |im_h  =>  | ^ |w\n|   +---w---+  |          | ^ |\n|              |          +---+\n+-----im_w-----+            h\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, h * 3, 6);\n```\n\n### image rotate + offset paste\n```\n             +--------------+\n             |   y          |\n +---+       | x +-------+  |\n | ^ |h  =>  |   |  < <  w  |im_h\n | ^ |       |   +---h---+  |\n +---+       |              |\n   w         +-----im_w-----+\n```\n```cpp\nunsigned char* outdata = im.data + (y * im_w + x) * 3;\nncnn::kanna_rotate_c3(data, w, h, w * 3, outdata, h, w, im_w * 3, 7);\n```\n\n### image roi crop + rotate + roi paste\n```\n+--------------+         +-----------------+\n|   y          |         |        roiy     |\n| x +-------+  |         |   roix  +---+   |\n|   |  < <  h  |im_h  => |         | ^ w   |outim_h\n|   +---w---+  |         |         | ^ |   |\n|              |         |         +-h-+   |\n+-----im_w-----+         +-----outim_w-----+\n```\n```cpp\nconst unsigned char* data = im.data + (y * im_w + x) * 3;\nunsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;\nncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, outim_w * 3, 6);\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/ncnn-load-model.md",
    "content": "### the comprehensive model loading api table\n\n|load from|alexnet.param|alexnet.param.bin|alexnet.bin|\n|---|---|---|---|\n|file path|load_param(const char*)|load_param_bin(const char*)|load_model(const char*)|\n|file path<br/>(wchar_t for windows)|load_param(const wchar_t*)|load_param_bin(const wchar_t*)|load_model(const wchar_t*)|\n|file descriptor|load_param(FILE*)|load_param_bin(FILE*)|load_model(FILE*)|\n|file memory|load_param_mem(const char*)|load_param(const unsigned char*)|load_model(const unsigned char*)|\n|android asset|load_param(AAsset*)|load_param_bin(AAsset*)|load_model(AAsset*)|\n|android asset path|load_param(AAssetManager*, const char*)|load_param_bin(AAssetManager*, const char*)|load_model(AAssetManager*, const char*)|\n|custom IO reader|load_param(const DataReader&)|load_param_bin(const DataReader&)|load_model(const DataReader&)|\n\n### points to note\n\n1. Either of the following combination shall be enough for loading model\n    * alexnet.param + alexnet.bin\n    * alexnet.param.bin + alexnet.bin\n\n2. Never modify Net opt member after loading\n\n3. Most loading functions return 0 if success, except loading alexnet.param.bin and alexnet.bin from file memory, which returns the bytes consumed after loading\n    * size_t Net::load_param(const unsigned char*)\n    * size_t Net::load_model(const unsigned char*)\n\n4. It is recommended to load model from Android asset directly to avoid copying them to sdcard on Android platform\n\n5. The custom IO reader interface can be used to implement on-the-fly model decryption and loading\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/openmp-best-practice.md",
    "content": "ncnn openmp best practice\r\n\r\n### CPU loadaverage is too high with ncnn.\r\n\r\n   When inference the neural network with ncnn, the cpu occupancy is very high even all CPU cores occupancy close to 100%.\r\n\r\n   If there are other threads or processes that require more cpu resources, the running speed of the program will drop severely.\r\n\r\n### The root cause of high CPU usage\r\n\r\n1. ncnn uses openmp API to speed up the inference compute. the thread count equals to the cpu core   count. If the computing work need to run frequently, it must consume many cpu resources.\r\n\r\n2. There is a thread pool managed by openmp, the pool size is equal to the cpu core size. (the max  vulue is 15 if there are much more cpu cores?)\r\n   Openmp need to sync the thread when acquiring and returning threads to the pool. In order to improve efficiency, almost all omp implementations use spinlock synchronization (except for simpleomp). \r\n   The default spin time of the spinlock is 200ms. So after a thread is scheduled, the thread need to busy-wait up to 200ms.\r\n\r\n### Why the CPU usage is still high even using vulkan GPU acceleration.\r\n\r\n1. Openmp is also used when loading the param bin file, and this part runs on cpu.\r\n\r\n2. The fp32 to fp16 conversion before and after the GPU memory upload is executed on the cpu, and this part of the logic also uses openmp.\r\n\r\n### Solution\r\n```\r\n1. Bind to the specific cpu core.\r\n```\r\n   If you use a device with large and small core CPUs, it is recommended to bind large or small cores through ncnn::set_cpu_powersave(int). Note that Windows does not support binding cores. By the way,  it's possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope.\r\nSuppose your platform is 2 big cores + 4 little cores, and you want to execute model A on 2 big cores and model B on 4 little cores concurrently.\r\n\r\ncreate two threads via std::thread or pthread\r\n   ```\r\n   void thread_1()\r\n   {\r\n      ncnn::set_cpu_powersave(2); // bind to big cores\r\n      netA.opt.num_threads = 2;\r\n   }\r\n\r\n   void thread_2()\r\n   {\r\n      ncnn::set_cpu_powersave(1); // bind to little cores\r\n      netB.opt.num_threads = 4;\r\n   }\r\n   ```\r\n   \r\n```\r\n2. Use fewer threads.\r\n```\r\n   Set the number of threads to half of the cpu cores count or less through ncnn::set_omp_num_threads(int)  or change net.opt.num_threads field. If you are coding with clang libomp, it's recommended that the number of threads does not exceed 8. If you use other omp libraries, it is recommended that the number of threads does not exceed 4.\r\n```\r\n3. Reduce openmp spinlock blocktime.\r\n```\r\n   You can modify openmp blocktime by call ncnn::set_kmp_blocktime(int) method or modify net.opt.openmp_blocktime field.\r\n   This argument is the spin time set by the ncnn API, and the default is 20ms.You can set a smaller value according to\r\n   the situation, or directly change it to 0.\r\n\r\n   Limitations: At present, only the libomp library of clang is implemented. Neither vcomp nor libgomp have corresponding interfaces.\r\n   If it is not compiled with clang, this value is still 200ms by default.\r\n   If you use vcomp or libgomp, you can use the environment variable OMP_WAIT_POLICY=PASSIVE to disable spin time. If you use simpleomp,\r\n   It's no need to set this parameter.\r\n```\r\n4. Limit the number of threads available in the openmp thread pool.\r\n```\r\n   Even if the number of openmp threads is reduced, the CPU occupancy rate may still be high. This is more common on servers with\r\n   particularly many CPU cores. \r\n   This is because the waiting threads in the thread pool use a spinlock to busy-wait, which can be reducedby limiting the number of\r\n   threads available in the thread pool.\r\n\r\n   Generally, you can set the OMP_THREAD_LIMIT environment variable. simpleomp currently does not support this feature so it's no need to be set.\r\n   Note that this environment variable is only valid if it is set before the program starts.\r\n```\r\n5. Disable openmp completely\r\n```\r\n   If there is only one cpu core, or use the vulkan gpu acceleration, it is recommended to disable openmp, just specify -DNCNN_OPENMP=OFF\r\n   when compiling with cmake."
  },
  {
    "path": "docs/how-to-use-and-FAQ/openmp-best-practice.zh.md",
    "content": "ncnn openmp 最佳实践\r\n\r\n### ncnn占用过多cpu资源\r\n\r\n   使用ncnn推理运算，cpu占用非常高甚至所有核心占用都接近100%。\r\n\r\n   如果还有其它线程或进程需要较多的cpu资源，运行速度下降严重。\r\n\r\n### cpu占用高的根本原因\r\n\r\n1. ncnn使用openmp API控制多线程加速推理计算。默认情况下，线程数等于cpu内核数。如果推理需要高频率运行，必然占用大部分\r\n   cpu资源。\r\n\r\n2. openmp内部维护一个线程池，线程池最大可用线程数等于cpu内核数。(核心过多时最大限制是15？）获取和归还线程时需要同步。\r\n\r\n   为了提高效率，几乎所有omp实现都使用了自旋锁同步(simpleomp除外)。自旋锁默认的spin time是200ms。因此一个线程被调度后，\r\n   需要忙等待最多200ms。\r\n\r\n### 为什么使用vulkan加速后cpu占用依然很高。\r\n\r\n1. 加载参数文件时也使用了openmp，这部分是在cpu上运行的。\r\n\r\n2. 显存上传前和下载后的 fp32 fp16转换是在cpu上执行的，这部分逻辑也使用了openmp。\r\n\r\n### 解决方法\r\n\r\n```\r\n1. 绑核\r\n```\r\n   如果使用有大小核cpu的设备，建议通过ncnn::set_cpu_powersave(int)绑定大核或小核，注意windows系统不支持绑核。顺便说一下，ncnn支持不同的模型运行在不同的核心。假设硬件平台有2个大核，4个小核，你想把netA运行在大核，netB运行在小核。\r\n   可以通过std::thread or pthread创建两个线程，运行如下代码：\r\n   \r\n   ```\r\n   void thread_1()\r\n   {\r\n      ncnn::set_cpu_powersave(2); // bind to big cores\r\n      netA.opt.num_threads = 2;\r\n   }\r\n\r\n   void thread_2()\r\n   {\r\n      ncnn::set_cpu_powersave(1); // bind to little cores\r\n      netB.opt.num_threads = 4;\r\n   }\r\n   ```\r\n\r\n```\r\n2. 使用更少的线程数。\r\n```\r\n   通过ncnn::set_omp_num_threads(int)或者net.opt.num_threads字段设置线程数为cpu内核数的一半或更小。如果使用clang的libomp，\r\n   建议线程数不超过8，如果使用其它omp库，建议线程数不超过4。\r\n```\r\n3. 减小openmp blocktime。\r\n```\r\n   可以修改ncnn::set_kmp_blocktime(int)或者修改net.opt.openmp_blocktime，这个参数是ncnn API设置的spin time，默认是20ms。\r\n   可以根据情况设置更小的值，或者直接改为0。\r\n\r\n   局限：目前只有clang的libomp库有实现，vcomp和libgomp都没有相应接口，如果不是使用clang编译的，这个值默认还是200ms。\r\n   如果使用vcomp或libgomp, 可以使用环境变量OMP_WAIT_POLICY=PASSIVE禁用spin time，如果使用simpleomp,不需要设置这个参数。\r\n```\r\n4. 限制openmp线程池可用线程数量。\r\n```\r\n   即使减小了openmp线程数量，cpu占用率仍然可能会很高。这在cpu核心特别多的服务器上比较常见。这是因为线程池中的等待线程使用\r\n   自旋锁忙等待，可以通过限制线程池可用线程数量减轻这种影响。\r\n\r\n   一般可以通过设置OMP_THREAD_LIMIT环境变量。simpleomp目前不支持这一特性，不需要设置。注意这个环境变量仅在程序启动前设置才有效。\r\n```\r\n5. 完全禁用openmp\r\n```\r\n   如果只有一个cpu核心，或者使用vulkan加速，建议关闭openmp, cmake编译时指定-DNCNN_OPENMP=OFF即可。"
  },
  {
    "path": "docs/how-to-use-and-FAQ/quantized-int8-inference.md",
    "content": "# Post Training Quantization Tools\n\nTo support int8 model deployment on mobile devices,we provide the universal post training quantization tools which can convert the float32 model to int8 model.\n\n## User Guide\n\nExample with mobilenet, just need three steps.\n\n### 1. Optimize model\n\nNOTE: **If your model is converted via pnnx, skip this step.**\n\n```shell\n./ncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 0\n```\n\n### 2. Create the calibration table file\n\n#### 2.1 From image\n\nWe suggest that using the verification dataset for calibration, which is more than 5000 images.\n\nSome imagenet sample images here https://github.com/nihui/imagenet-sample-images\n\n```shell\nfind images/ -type f > imagelist.txt\n./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist.txt mobilenet.table mean=[104,117,123] norm=[0.017,0.017,0.017] shape=[224,224,3] pixel=BGR thread=8 method=kl\n```\n\n* mean and norm are the values you passed to ```Mat::substract_mean_normalize()```\n* shape is the blob shape of your model, [w,h] or [w,h,c]\n\n>\n    * if w and h both are given, image will be resized to exactly size.\n    * if w and h both are zero or negative, image will not be resized.\n    * if only h is zero or negative, image's width will scaled resize to w, keeping aspect ratio.\n    * if only w is zero or negative, image's height will scaled resize to h\n\n* pixel is the pixel format of your model, image pixels will be converted to this type before ```Extractor::input()```\n* thread is the CPU thread count that could be used for parallel inference\n* method is the post training quantization algorithm, kl and aciq are currently supported\n\nIf your model has multiple input nodes, you can use multiple list files and other parameters\n\n```shell\n./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist-bgr.txt,imagelist-depth.txt mobilenet.table mean=[104,117,123],[128] norm=[0.017,0.017,0.017],[0.0078125] shape=[224,224,3],[224,224,1] pixel=BGR,GRAY thread=8 method=kl\n```\n\n#### 2.2 From npy\n\nWe suggest that using the validation(development) set for calibration.\n\nUse the same preprocessing as the training set to get the input vectors, in the case of batchsize=1, store each input vector as an npy file, n inputs correspond to n npy files, the actual stored vectors to remove the batch dimension.\n\n\ntest net, shape is in NCHW format, but there's no `N`.\n```txt\nin0, shape=[512]\nin1, shape=[2, 1, 64]\nin2, shape=[2, 1, 64]\n```\n\nfilelist_in0.txt\n```txt\n0_in0.npy\n1_in0.npy\n2_in0.npy\n...\n```\n\nfilelist_in1.txt\n```txt\n0_in1.npy\n1_in1.npy\n2_in1.npy\n...\n```\n\nfilelist_in2.txt\n```txt\n0_in2.npy\n1_in2.npy\n2_in2.npy\n...\n```\n\n```shell\n./ncnn2table test.param test.bin filelist_in0.txt,filelist_in1.txt,filelist_in2.txt test.table shape=[512],[64,1,2],[64,1,2] thread=8 method=kl type=1\n```\n**Here shape is WHC, because the order of the arguments to `ncnn::Mat`.**\n\n### 3. Quantize model\n\n```shell\n./ncnn2int8 mobilenet-opt.param mobilenet-opt.bin mobilenet-int8.param mobilenet-int8.bin mobilenet.table\n```\n\nIf you don’t need static quantization, ncnn supports RNN/LSTM/GRU dynamic quantization. In this case, you can omit the table file.\n\n```shell\n./ncnn2int8 rnn-model.param rnn-model.bin rnn-model-int8.param rnn-model-int8.bin\n```\n\n## use ncnn int8 inference\n\nthe ncnn library would use int8 inference automatically, nothing changed in your code\n\n```cpp\nncnn::Net mobilenet;\nmobilenet.load_param(\"mobilenet-int8.param\");\nmobilenet.load_model(\"mobilenet-int8.bin\");\n```\n\n## mixed precision inference\n\nBefore quantize your model, comment the layer weight scale line in table file, then the layer will do the float32 inference\n\n```\nconv1_param_0 156.639840536\n```\n\n```\n#conv1_param_0 156.639840536\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md",
    "content": "We use alexnet as an example\n\n### prepare caffe prototxt and model\n\nThese files will usually generated when trained with caffe\n```\ntrain.prototxt\ndeploy.prototxt\nsnapshot_10000.caffemodel\n```\ndeploy.prototxt and caffemodel file are enough for TEST phase\n\nalexnet deploy.prototxt can be downloaded here\n\nhttps://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet\n\nalexnet caffemodel can be downloaded here\n\nhttp://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel\n\n### convert to ncnn model\n\nConvert old caffe prototxt and caffemodel to new ones using tools in caffe\n\nbecause the ncnn convert tool needs the new format\n```\nupgrade_net_proto_text [old prototxt] [new prototxt]\nupgrade_net_proto_binary [old caffemodel] [new caffemodel]\n```\n\nUse Input layer as input, set N dim as 1 since only one image can be processed each time\n```\nlayer {\n  name: \"data\"\n  type: \"Input\"\n  top: \"data\"\n  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }\n}\n```\nUse caffe2ncnn tool to convert caffe model to ncnn model\n```\ncaffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin\n```\n\n### strip visible string\n\nIt is already enough for deploying with param and bin file only, but there are visible strings in param file, it may not be suitable to distribute plain neural network information in your APP.\n\nYou can use ncnn2mem tool to convert plain model file to binary representation. It will generate alexnet.param.bin and two static array code files.\n```\nncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h\n```\n\n### load model\n\nLoad param and bin file, the easy way\n```cpp\nncnn::Net net;\nnet.load_param(\"alexnet.param\");\nnet.load_model(\"alexnet.bin\");\n```\nLoad binary param.bin and bin file, no visible strings included, suitable for bundled as APP resource\n```cpp\nncnn::Net net;\nnet.load_param_bin(\"alexnet.param.bin\");\nnet.load_model(\"alexnet.bin\");\n```\nLoad network and model from external memory, no visible strings included, no external resource files bundled, the whole model is hardcoded in your program\n\nYou may use this way to load from android asset resource\n```cpp\n#include \"alexnet.mem.h\"\nncnn::Net net;\nnet.load_param(alexnet_param_bin);\nnet.load_model(alexnet_bin);\n```\nYou can choose either way to load model. Loading from external memory is zero-copy, which means you must keep your memory buffer during processing\n\n### unload model\n```cpp\nnet.clear();\n```\n\n### input and output\n\nncnn Mat is the data structure for input and output data\n\nInput image should be converted to Mat, and subtracted mean values and normalized when needed\n\n```cpp\n#include \"mat.h\"\nunsigned char* rgbdata;// data pointer to RGB image pixels\nint w;// image width\nint h;// image height\nncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);\n\nconst float mean_vals[3] = {104.f, 117.f, 123.f};\nin.substract_mean_normalize(mean_vals, 0);\n```\nExecute the network inference and retrieve the result\n```cpp\n#include \"net.h\"\nncnn::Mat in;// input blob as above\nncnn::Mat out;\nncnn::Extractor ex = net.create_extractor();\nex.input(\"data\", in);\nex.extract(\"prob\", out);\n```\nIf you load model with binary param.bin file, you should use the enum value in alexnet.id.h file instead of the blob name\n```cpp\n#include \"net.h\"\n#include \"alexnet.id.h\"\nncnn::Mat in;// input blob as above\nncnn::Mat out;\nncnn::Extractor ex = net.create_extractor();\nex.input(alexnet_param_id::BLOB_data, in);\nex.extract(alexnet_param_id::BLOB_prob, out);\n```\nRead the data in the output Mat. Iterate data to get all classification scores.\n```cpp\nncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);\nstd::vector<float> scores;\nscores.resize(out_flatterned.w);\nfor (int j=0; j<out_flatterned.w; j++)\n{\n    scores[j] = out_flatterned[j];\n}\n```\n\n### some tricks\n\nConvert image colorspace and resize image with Mat convenient function, these functions are well optimized\n\nSupport RGB2GRAY GRAY2RGB RGB2BGR etc, support scale up and scale down\n```cpp\n#include \"mat.h\"\nunsigned char* rgbdata;// data pointer to RGB image pixels\nint w;// image width\nint h;// image height\nint target_width = 227;// target resized width\nint target_height = 227;// target resized height\nncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);\n```\nYou can concat multiple model files into one, and load this single file from FILE* interface.\n\nIt should ease the distribution of param and model files.\n\n> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin\n\n```cpp\n#include \"net.h\"\nFILE* fp = fopen(\"alexnet-all.bin\", \"rb\");\nnet.load_param_bin(fp);\nnet.load_model(fp);\nfclose(fp);\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.zh.md",
    "content": "首先，非常感谢大家对 ncnn 组件的关注\n为了方便大家使用 ncnn 组件，up主特意写了这篇使用指北，以烂大街的 alexnet 作为例子\n\n\n### 准备caffe网络和模型\n\ncaffe 的网络和模型通常是搞深度学习的研究者训练出来的，一般来说训练完会有\n```\ntrain.prototxt\ndeploy.prototxt\nsnapshot_10000.caffemodel\n```\n部署的时候只需要 TEST 过程，所以有 deploy.prototxt 和 caffemodel 就足够了\n\nalexnet 的 deploy.prototxt 可以在这里下载\nhttps://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet\n\nalexnet 的 caffemodel 可以在这里下载\nhttp://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel\n\n### 转换ncnn网络和模型\n\ncaffe 自带了工具可以把老版本的 caffe 网络和模型转换为新版（ncnn的工具只认识新版\n```\nupgrade_net_proto_text [老prototxt] [新prototxt]\nupgrade_net_proto_binary [老caffemodel] [新caffemodel]\n```\n输入层改用 Input，因为每次只需要做一个图片，所以第一个 dim 设为 1\n```\nlayer {\n  name: \"data\"\n  type: \"Input\"\n  top: \"data\"\n  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }\n}\n```\n使用 caffe2ncnn 工具转换为 ncnn 的网络描述和模型\n```\ncaffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin\n```\n### 去除可见字符串\n\n有 param 和 bin 文件其实已经可以用了，但是 param 描述文件是明文的，如果放在 APP 分发出去容易被窥探到网络结构（说得好像不明文就看不到一样\n使用 ncnn2mem 工具转换为二进制描述文件和内存模型，生成 alexnet.param.bin 和两个静态数组的代码文件\n```\nncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h\n```\n### 加载模型\n\n直接加载 param 和 bin，适合快速验证效果使用\n```cpp\nncnn::Net net;\nnet.load_param(\"alexnet.param\");\nnet.load_model(\"alexnet.bin\");\n```\n加载二进制的 param.bin 和 bin，没有可见字符串，适合 APP 分发模型资源\n```cpp\nncnn::Net net;\nnet.load_param_bin(\"alexnet.param.bin\");\nnet.load_model(\"alexnet.bin\");\n```\n从内存引用加载网络和模型，没有可见字符串，模型数据全在代码里头，没有任何外部文件\n另外，android apk 打包的资源文件读出来也是内存块\n```cpp\n#include \"alexnet.mem.h\"\nncnn::Net net;\nnet.load_param(alexnet_param_bin);\nnet.load_model(alexnet_bin);\n```\n以上三种都可以加载模型，其中内存引用方式加载是 zero-copy 的，所以使用 net 模型的来源内存块必须存在\n\n### 卸载模型\n```cpp\nnet.clear();\n```\n\n### 输入和输出\n\nncnn 用自己的数据结构 Mat 来存放输入和输出数据\n输入图像的数据要转换为 Mat，依需要减去均值和乘系数\n```cpp\n#include \"mat.h\"\nunsigned char* rgbdata;// data pointer to RGB image pixels\nint w;// image width\nint h;// image height\nncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);\n\nconst float mean_vals[3] = {104.f, 117.f, 123.f};\nin.substract_mean_normalize(mean_vals, 0);\n```\n执行前向网络，获得计算结果\n```cpp\n#include \"net.h\"\nncnn::Mat in;// input blob as above\nncnn::Mat out;\nncnn::Extractor ex = net.create_extractor();\nex.input(\"data\", in);\nex.extract(\"prob\", out);\n```\n如果是二进制的 param.bin 方式，没有可见字符串，利用 alexnet.id.h 的枚举来代替 blob 的名字\n```cpp\n#include \"net.h\"\n#include \"alexnet.id.h\"\nncnn::Mat in;// input blob as above\nncnn::Mat out;\nncnn::Extractor ex = net.create_extractor();\nex.input(alexnet_param_id::BLOB_data, in);\nex.extract(alexnet_param_id::BLOB_prob, out);\n```\n获取 Mat 中的输出数据，Mat 内部的数据通常是三维的，c / h / w，遍历所有获得全部分类的分数\n```cpp\nncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);\nstd::vector<float> scores;\nscores.resize(out_flatterned.w);\nfor (int j=0; j<out_flatterned.w; j++)\n{\n    scores[j] = out_flatterned[j];\n}\n```\n### 某些使用技巧\n\nMat 转换图像的时候可以顺便转换颜色和缩放大小，这些顺带的操作也是有优化的\n支持 RGB2GRAY GRAY2RGB RGB2BGR 等常用转换，支持缩小和放大\n```cpp\n#include \"mat.h\"\nunsigned char* rgbdata;// data pointer to RGB image pixels\nint w;// image width\nint h;// image height\nint target_width = 227;// target resized width\nint target_height = 227;// target resized height\nncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);\n```\nNet 有从 FILE* 文件描述加载的接口，可以利用这点把多个网络和模型文件合并为一个，分发时能方便些，内存引用就无所谓了\n\n> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin\n\n```cpp\n#include \"net.h\"\nFILE* fp = fopen(\"alexnet-all.bin\", \"rb\");\nnet.load_param_bin(fp);\nnet.load_model(fp);\nfclose(fp);\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnn-with-opencv.md",
    "content": "### opencv to ncnn\n\n* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + swap RGB/BGR\n\n```cpp\n// cv::Mat a(h, w, CV_8UC3);\nncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB, a.cols, a.rows);\n```\n\n* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + keep RGB/BGR order\n\n```cpp\n// cv::Mat a(h, w, CV_8UC3);\nncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_RGB, a.cols, a.rows);\n```\n\n* cv::Mat CV_8UC3 -> ncnn::Mat 1 channel + do RGB2GRAY/BGR2GRAY\n\n```cpp\n// cv::Mat rgb(h, w, CV_8UC3);\nncnn::Mat inrgb = ncnn::Mat::from_pixels(rgb.data, ncnn::Mat::PIXEL_RGB2GRAY, rgb.cols, rgb.rows);\n\n// cv::Mat bgr(h, w, CV_8UC3);\nncnn::Mat inbgr = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2GRAY, bgr.cols, bgr.rows);\n```\n\n* cv::Mat CV_8UC1 -> ncnn::Mat 1 channel\n\n```cpp\n// cv::Mat a(h, w, CV_8UC1);\nncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_GRAY, a.cols, a.rows);\n```\n\n* cv::Mat CV_32FC1 -> ncnn::Mat 1 channel\n\n  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**\n\n```cpp\n// cv::Mat a(h, w, CV_32FC1);\nncnn::Mat in(a.cols, a.rows, 1, (void*)a.data);\nin = in.clone();\n```\n\n* cv::Mat CV_32FC3 -> ncnn::Mat 3 channel\n\n  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**\n\n```cpp\n// cv::Mat a(h, w, CV_32FC3);\nncnn::Mat in_pack3(a.cols, a.rows, 1, (void*)a.data, (size_t)4u * 3, 3);\nncnn::Mat in;\nncnn::convert_packing(in_pack3, in, 1);\n```\n\n* std::vector < cv::Mat > + CV_32FC1 -> ncnn::Mat multiple channels\n\n  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**\n\n```cpp\n// std::vector<cv::Mat> a(channels, cv::Mat(h, w, CV_32FC1));\nint channels = a.size();\nncnn::Mat in(a[0].cols, a[0].rows, channels);\nfor (int p=0; p<in.c; p++)\n{\n    memcpy(in.channel(p), (const uchar*)a[p].data, in.w * in.h * sizeof(float));\n}\n```\n\n### ncnn to opencv\n\n* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + swap RGB/BGR\n\n  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**\n\n```cpp\n// ncnn::Mat in(w, h, 3);\ncv::Mat a(in.h, in.w, CV_8UC3);\nin.to_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB);\n```\n\n* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + keep RGB/BGR order\n\n  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**\n\n```cpp\n// ncnn::Mat in(w, h, 3);\ncv::Mat a(in.h, in.w, CV_8UC3);\nin.to_pixels(a.data, ncnn::Mat::PIXEL_RGB);\n```\n\n* ncnn::Mat 1 channel -> cv::Mat CV_8UC1\n\n  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**\n\n```cpp\n// ncnn::Mat in(w, h, 1);\ncv::Mat a(in.h, in.w, CV_8UC1);\nin.to_pixels(a.data, ncnn::Mat::PIXEL_GRAY);\n```\n\n* ncnn::Mat 1 channel -> cv::Mat CV_32FC1\n\n  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**\n\n```cpp\n// ncnn::Mat in;\ncv::Mat a(in.h, in.w, CV_32FC1);\nmemcpy((uchar*)a.data, in.data, in.w * in.h * sizeof(float));\n```\n\n* ncnn::Mat 3 channel -> cv::Mat CV_32FC3\n\n  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**\n\n```cpp\n// ncnn::Mat in(w, h, 3);\nncnn::Mat in_pack3;\nncnn::convert_packing(in, in_pack3, 3);\ncv::Mat a(in.h, in.w, CV_32FC3);\nmemcpy((uchar*)a.data, in_pack3.data, in.w * in.h * 3 * sizeof(float));\n```\n\n* ncnn::Mat multiple channels -> std::vector < cv::Mat > + CV_32FC1\n\n  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**\n\n```cpp\n// ncnn::Mat in(w, h, channels);\nstd::vector<cv::Mat> a(in.c);\nfor (int p=0; p<in.c; p++)\n{\n    a[p] = cv::Mat(in.h, in.w, CV_32FC1);\n    memcpy((uchar*)a[p].data, in.channel(p), in.w * in.h * sizeof(float));\n}\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnn-with-own-project.md",
    "content": "### use ncnn with own project\n\nAfter building ncnn, there is one or more library files generated. Consider integrating ncnn into your own project, you may use ncnn's installating provided cmake config file, or by manually specify library path(s).\n\n**with cmake**\n\nEnsure your project is built by cmake. Then in your project's CMakeLists.txt, add these lines:\n\n```cmake\nset(ncnn_DIR \"<ncnn_install_dir>/lib/cmake/ncnn\" CACHE PATH \"Directory that contains ncnnConfig.cmake\")\nfind_package(ncnn REQUIRED)\ntarget_link_libraries(my_target ncnn)\n```\nAfter this, both the header file search path (\"including directories\") and library paths are configured automatically, including vulkan related dependencies.\n\nNote: you have to change `<ncnn_install_dir>` to your machine's directory, it is the directory that contains `ncnnConfig.cmake`.\n\nFor the prebuilt ncnn release packages, ncnnConfig is located in:\n- for `ncnn-YYYYMMDD-windows-vs2019`, it is `lib/cmake/ncnn`\n- for `ncnn-YYYYMMDD-android-vulkan`, it is `${ANDROID_ABI}/lib/cmake/ncnn` (`${ANDROID_ABI}` is defined in NDK's cmake toolchain file)\n- other prebuilt release packages are with similar condition\n\n**manually specify**\n\nYou may also manually specify ncnn library path and including directory. Note that if you use ncnn with vulkan, it is also required to specify vulkan related dependencies.\n\nFor example, on Visual Studio debug mode with vulkan required, the lib paths are:\n```\nE:\\github\\ncnn\\build\\vs2019-x64\\install\\lib\\ncnnd.lib\nE:\\github\\ncnn\\build\\vs2019-x64\\install\\lib\\glslangd.lib\n```\nAnd for its release mode, lib paths are:\n```\nE:\\github\\ncnn\\build\\vs2019-x64\\install\\lib\\ncnn.lib\nE:\\github\\ncnn\\build\\vs2019-x64\\install\\lib\\glslang.lib\n```\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnn-with-pytorch-or-onnx.md",
    "content": "# A Guide to Converting pytorch / onnx Models to ncnn\n\nThis guide is designed to help pytorch and onnx users use the new-generation model conversion tool, **pnnx**, to efficiently and reliably convert models to the ncnn format for high-performance inference on the edge.\n\nThis document is written and revised based on the **official pnnx documentation**.\n\n* pnnx project: https://github.com/pnnx/pnnx\n* ncnn project: https://github.com/Tencent/ncnn\n* supported pytorch operators: https://github.com/Tencent/ncnn/tree/master/tools/pnnx#supported-pytorch-operator-status\n* supported onnx operators: https://github.com/Tencent/ncnn/tree/master/tools/pnnx#supported-onnx-operator-status\n\n---\n\n## Why is pnnx Highly Recommended?\n\nRegardless of which framework you come from, pnnx offers significant advantages over traditional tools (like `onnx2ncnn`):\n\n*   **Forget the Hassles of onnx**: The traditional `pytorch -> onnx -> ncnn` pipeline often fails due to onnx operator compatibility issues and dynamic shape problems. pnnx can convert directly from pytorch, completely bypassing the unstable intermediate step of onnx.\n*   **Core Framework Support**: pnnx focuses on supporting **pytorch** and **onnx**, providing you with a unified and consistent conversion experience.\n*   **More Stable and Powerful**: pnnx can handle a wider range of modern operators and complex model architectures, generating cleaner and more accurate ncnn graphs.\n*   **Active and Continuous Development**: pnnx is under active development, constantly adding support for the latest operators and features from both source frameworks and the ncnn engine.\n*   **Richer Graph Information**: pnnx preserves the original model's structural information during the conversion process, which is highly beneficial for model analysis and subsequent optimization.\n\n---\n\n## Workflow 1: Guide for pytorch Users (Recommended)\n\nFor pytorch users, converting directly from a pytorch model is the most stable and efficient path.\n\n### Method A: Direct Conversion in Python with `pnnx.export` (Most Recommended)\n\nThis is the simplest and most recommended workflow, allowing you to complete the model conversion with a single command without leaving your Python environment.\n\n#### 1. Install pnnx\n\nFirst, install the pnnx Python package. This command installs both the `pnnx` Python library and the `pnnx` command-line tool.\n\n```bash\npip3 install pnnx\n```\n\n#### 2. Call `pnnx.export` in Your Python Script\n\nCalling the `pnnx.export` function will generate both a TorchScript (`.pt`) file and the `.param` and `.bin` files required by ncnn.\n\n**Complete Code Example:**\n\n```python\nimport torch\nimport torch.nn as nn\nimport pnnx\n\n# 1. Define or load your pytorch model\nclass MyModel(nn.Module):\n    def __init__(self):\n        super(MyModel, self).__init__()\n        self.conv1 = nn.Conv2d(3, 16, 3, 1, 1)\n        self.relu = nn.ReLU()\n        self.fc = nn.Linear(16 * 224 * 224, 10)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = self.relu(x)\n        x = x.view(x.size(0), -1)\n        x = self.fc(x)\n        return x\n\n# 2. Instantiate the model and set it to evaluation mode\nmodel = MyModel()\nmodel.eval()\n\n# 3. Create a dummy input tensor with the correct input shape\ninput_tensor = torch.rand(1, 3, 224, 224)\n\n# 4. Call pnnx.export to export the model\npnnx.export(model, \"my_model.pt\", (input_tensor,))\n\nprint(\"Conversion complete!\")\nprint(\"Please check for the generated my_model.pt, my_model.ncnn.param, and my_model.ncnn.bin files.\")\n```\n\n### Method B: Using the Command-Line Tool (Alternative)\n\n#### 1. Get the pnnx Command-Line Tool\n\nIf you have already run `pip install pnnx`, the `pnnx` command is available, and you can proceed to the next step.\n\nFor non-Python environments or users who prefer a standalone executable, you can manually download the latest binary from the [pnnx Releases page](https://github.com/pnnx/pnnx/releases).\n\n#### 2. Export to TorchScript (Skip if you already have a .pt file)\n\n```python\nimport torch\n# ... (model definition from above)\nmodel = MyModel()\nmodel.eval()\ninput_tensor = torch.rand(1, 3, 224, 224)\ntraced_script_module = torch.jit.trace(model, input_tensor)\ntraced_script_module.save(\"my_model.pt\")\n```\n\n#### 3. Run the pnnx Command for Conversion\n\nRun the following command in your terminal.\n\n```bash\n# Syntax: pnnx <torchscript_model_path>\npnnx my_model.pt\n```\n\n---\n\n## Workflow 2: Guide for onnx Users\n\nFor users who already have an `.onnx` file, please use pnnx for conversion.\n\n### 1. Get the pnnx Command-Line Tool\n\n*   **Method 1 (Recommended):** If you have Python in your environment, install it directly via pip.\n    ```bash\n    pip3 install pnnx\n    ```\n    The `pnnx` command will be automatically added to your system's path.\n\n*   **Method 2 (Alternative):** For non-Python environments or to use a standalone program, you can download the latest executable from the [pnnx Releases page](https://github.com/pnnx/pnnx/releases).\n\n### 2. Run the Command-Line Conversion\n\nOpen a terminal, navigate to the directory containing your model file, and run the following command.\n\n**Basic Command Example:**\n\n```bash\n# Syntax: pnnx <onnx_model_path>\npnnx my_model.onnx\n```\nAfter the command executes successfully, you will get the `my_model.ncnn.param` and `my_model.ncnn.bin` files, which can be directly loaded and used in your ncnn project.\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/use-ncnnoptimize-to-optimize-model.md",
    "content": "\nthe typical usage\n```\nncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 65536 \n```\n\noperator fusion\n* batchnorm - scale\n* convolution - batchnorm\n* convolutiondepthwise - batchnorm\n* deconvolution - batchnorm\n* deconvolutiondepthwise - batchnorm\n* innerproduct - batchnorm\n* convolution - relu\n* convolutiondepthwise - relu\n* deconvolution - relu\n* deconvolutiondepthwise - relu\n* innerproduct - relu\n\neliminate noop operator\n* innerproduct - dropout\n* flatten after global pooling\n\nprefer better operator\n* replace convolution with innerproduct after global pooling\n"
  },
  {
    "path": "docs/how-to-use-and-FAQ/vulkan-notes.md",
    "content": "## supported platform\n\n* Y = known work\n* ? = shall work, not confirmed\n* / = not applied\n\n|    |windows|linux|android|mac|ios|\n|---|---|---|---|---|---|\n|intel|Y|Y|Y|Y|/|\n|amd|Y|Y|/|Y|/|\n|nvidia|Y|Y|?|/|/|\n|qcom|/|/|Y|/|/|\n|apple|/|/|/|Y|Y|\n|arm|/|Y|Y|/|/|\n\n## enable vulkan compute support\n```\n$ cmake -DNCNN_VULKAN=ON ..\n```\n\n## enable vulkan compute inference\n```cpp\nncnn::Net net;\nnet.opt.use_vulkan_compute = 1;\n```\n\n## proper allocator usage\n```cpp\nncnn::VkAllocator* blob_vkallocator = vkdev.acquire_blob_allocator();\nncnn::VkAllocator* staging_vkallocator = vkdev.acquire_blob_allocator();\n\nnet.opt.blob_vkallocator = blob_vkallocator;\nnet.opt.workspace_vkallocator = blob_vkallocator;\nnet.opt.staging_vkallocator = staging_vkallocator;\n\n// ....\n\n// after inference\nvkdev.reclaim_blob_allocator(blob_vkallocator);\nvkdev.reclaim_staging_allocator(staging_vkallocator);\n```\n\n## select gpu device\n```cpp\n// get gpu count\nint gpu_count = ncnn::get_gpu_count();\n\n// set specified vulkan device before loading param and model\nnet.set_vulkan_device(0); // use device-0\nnet.set_vulkan_device(1); // use device-1\n\n// or set opt.vulkan_device_index field before loading param and model\nnet.opt.vulkan_device_index = 0; // use device-0\nnet.opt.vulkan_device_index = 1; // use device-1\n```\n\n## zero-copy on unified memory device\n```cpp\nncnn::VkMat blob_gpu;\nncnn::Mat mapped = blob_gpu.mapped();\n\n// use mapped.data directly\n```\n\n## hybrid cpu/gpu inference\n```cpp\nncnn::Net net_cpu;\nncnn::Net net_gpu;\nnet_cpu.opt.use_vulkan_compute = false;\nnet_gpu.opt.use_vulkan_compute = true;\nnet_cpu.load_param();\nnet_cpu.load_model();\nnet_gpu.load_param();\nnet_gpu.load_model();\n\nncnn::Extractor ex_cpu = net_cpu.create_extractor();\nncnn::Extractor ex_gpu = net_gpu.create_extractor();\n\n#pragma omp parallel sections\n{\n    #pragma omp section\n    {\n        ex_cpu.input();\n        ex_cpu.extract();\n    }\n    #pragma omp section\n    {\n        ex_gpu.input();\n        ex_gpu.extract();\n    }\n}\n```\n\n## zero-copy gpu inference chaining\n```cpp\nncnn::Extractor ex1 = net1.create_extractor();\nncnn::Extractor ex2 = net2.create_extractor();\n\nncnn::VkCompute cmd(&vkdev);\n\nncnn::VkMat conv1;\nncnn::VkMat conv2;\nncnn::VkMat conv3;\n\nex1.input(\"conv1\", conv1);\nex1.extract(\"conv2\", conv2, cmd);\n\nex2.input(\"conv2\", conv2);\nex2.extract(\"conv3\", conv3, cmd);\n\ncmd.submit_and_wait();\n```\n\n## batch inference\n```cpp\nint max_batch_size = vkdev->info.compute_queue_count();\n\nncnn::Mat inputs[1000];\nncnn::Mat outputs[1000];\n\n#pragma omp parallel for num_threads(max_batch_size)\nfor (int i=0; i<1000; i++)\n{\n    ncnn::Extractor ex = net1.create_extractor();\n    ex.input(\"data\", inputs[i]);\n    ex.extract(\"prob\", outputs[i]);\n}\n```\n\n## control storage and arithmetic precision\n\ndisable all lower-precision optimizations, get full fp32 precision\n\n```cpp\nncnn::Net net;\nnet.opt.use_fp16_packed = false;\nnet.opt.use_fp16_storage = false;\nnet.opt.use_fp16_arithmetic = false;\nnet.opt.use_int8_storage = false;\nnet.opt.use_int8_arithmetic = false;\n```\n\n## debugging tips\n```cpp\n#define ENABLE_VALIDATION_LAYER 1 // modify to 1 in gpu.cpp\n```\n\n## add vulkan compute support to layer\n1. add vulkan shader in src/layer/shader/\n\n2. upload model weight data in Layer::upload_model()\n\n3. setup pipeline in Layer::create_pipeline()\n\n4. destroy pipeline in Layer::destroy_pipeline()\n\n5. record command in Layer::forward()\n\n## add optimized shader path\n1. add vulkan shader in src/layer/shader/ named XXX_abc.comp\n\n2. create pipeline with \"XXX_abc\"\n\n3. record command using XXX_abc pipeline\n\n## low-level op api\n1. create layer\n\n2. load param and load model\n\n3. upload model\n\n4. create pipeline\n\n5. new command\n\n6. record\n\n7. submit and wait\n\n"
  },
  {
    "path": "examples/CMakeLists.txt",
    "content": "macro(ncnn_add_example name)\n    add_executable(${name} ${name}.cpp)\n    if(OpenCV_FOUND)\n        target_include_directories(${name} PRIVATE ${OpenCV_INCLUDE_DIRS})\n        target_link_libraries(${name} PRIVATE ncnn ${OpenCV_LIBS})\n    elseif(NCNN_SIMPLEOCV)\n        target_compile_definitions(${name} PUBLIC USE_NCNN_SIMPLEOCV)\n        target_link_libraries(${name} PRIVATE ncnn)\n    endif()\n\n    # add test to a virtual project group\n    set_property(TARGET ${name} PROPERTY FOLDER \"examples\")\nendmacro()\n\nif(NCNN_PIXEL)\n    if(NOT NCNN_SIMPLEOCV)\n        find_package(OpenCV QUIET COMPONENTS opencv_world)\n        # for opencv 2.4 on ubuntu 16.04, there is no opencv_world but OpenCV_FOUND will be TRUE\n        if(\"${OpenCV_LIBS}\" STREQUAL \"\")\n            set(OpenCV_FOUND FALSE)\n        endif()\n        if(NOT OpenCV_FOUND)\n            find_package(OpenCV QUIET COMPONENTS core highgui imgproc imgcodecs videoio)\n        endif()\n        if(NOT OpenCV_FOUND)\n            find_package(OpenCV QUIET COMPONENTS core highgui imgproc)\n        endif()\n    endif()\n\n    if(OpenCV_FOUND OR NCNN_SIMPLEOCV)\n        if(OpenCV_FOUND)\n            message(STATUS \"OpenCV library: ${OpenCV_INSTALL_PATH}\")\n            message(STATUS \"    version: ${OpenCV_VERSION}\")\n            message(STATUS \"    libraries: ${OpenCV_LIBS}\")\n            message(STATUS \"    include path: ${OpenCV_INCLUDE_DIRS}\")\n\n            if(${OpenCV_VERSION_MAJOR} GREATER 3)\n                set(CMAKE_CXX_STANDARD 11)\n            endif()\n        endif()\n\n        include_directories(${CMAKE_CURRENT_SOURCE_DIR}/../src)\n        include_directories(${CMAKE_CURRENT_BINARY_DIR}/../src)\n        ncnn_add_example(arcface)\n        ncnn_add_example(squeezenet)\n        ncnn_add_example(squeezenet_c_api)\n        ncnn_add_example(fasterrcnn)\n        ncnn_add_example(rfcn)\n        ncnn_add_example(yolov2)\n        ncnn_add_example(yolov3)\n        ncnn_add_example(yolov5)\n        ncnn_add_example(yolov5_pnnx)\n        ncnn_add_example(yolov7_pnnx)\n        ncnn_add_example(yolov7)\n        ncnn_add_example(yolov8)\n        ncnn_add_example(yolov8_seg)\n        ncnn_add_example(yolov8_pose)\n        ncnn_add_example(yolov8_cls)\n        ncnn_add_example(yolox)\n        ncnn_add_example(yolo11)\n        ncnn_add_example(yolo11_seg)\n        ncnn_add_example(yolo11_pose)\n        ncnn_add_example(yolo11_cls)\n        ncnn_add_example(yoloworld)\n        ncnn_add_example(mobilenetv2ssdlite)\n        ncnn_add_example(mobilenetssd)\n        ncnn_add_example(squeezenetssd)\n        ncnn_add_example(shufflenetv2)\n        ncnn_add_example(peleenetssd_seg)\n        ncnn_add_example(simplepose)\n        ncnn_add_example(retinaface)\n        ncnn_add_example(yolact)\n        ncnn_add_example(nanodet)\n        ncnn_add_example(nanodetplus_pnnx)\n        ncnn_add_example(scrfd)\n        ncnn_add_example(scrfd_crowdhuman)\n        ncnn_add_example(piper)\n        ncnn_add_example(whisper)\n        if(OpenCV_FOUND)\n            ncnn_add_example(yolov4)\n            ncnn_add_example(yolov8_obb)\n            ncnn_add_example(yolo11_obb)\n            ncnn_add_example(rvm)\n            ncnn_add_example(p2pnet)\n            ncnn_add_example(ppocrv5)\n        endif()\n    else()\n        message(WARNING \"OpenCV not found and NCNN_SIMPLEOCV disabled, examples won't be built\")\n    endif()\nelse()\n    message(WARNING \"NCNN_PIXEL not enabled, examples won't be built\")\nendif()\n"
  },
  {
    "path": "examples/arcface.cpp",
    "content": "// Copyright 2025 heabeounMKTO\n// SPDX-License-Identifier: BSD-3-Clause\n/* ncnn example using yolo-face and arcface to extract embeddings from a face\n *\n *\n *  the arcface model is converted from\n * https://github.com/onnx/models/tree/main/validated/vision/body_analysis/arcface\n * 1. first simplify the arcface.onnx using onnxsim\n * 2. then convert it using ncnn's onnx exporter onnx2ncnn\n *  using pnnx to convert would cause -nan output!\n *\n *  the yolov8-face model is converted from\n *  https://github.com/derronqi/yolov8-face\n *\n *\n * you can find the models preconverted at\n * https://drive.google.com/drive/folders/1P0RDzj9V7FHEL8w_-yqls5RHeVpO-2PS?usp=sharing\n *\n * */\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n#include <float.h>\n#include \"layer.h\"\n#include \"net.h\"\n#include \"mat.h\"\n\n#ifndef ARCFACE_EXAMPLE_YOLO_INFER_SIZE\n#define ARCFACE_EXAMPLE_YOLO_INFER_SIZE 320\n#endif\n\nstruct Bbox\n{\n    float x1, y1, x2, y2, confidence;\n    int label;\n    Bbox()\n        : x1(0.0f), y1(0.0f), x2(0.0f), y2(0.0f), confidence(0.0f), label(0)\n    {\n    }\n    Bbox(float x1,\n         float y1,\n         float x2,\n         float y2,\n         float confidence,\n         int label = 0,\n         std::string label_name = \"\")\n        : x1(x1), y1(y1), x2(x2), y2(y2), confidence(confidence), label(label)\n    {\n    }\n    Bbox apply_image_scale(const cv::Mat& original_image,\n                           const float scale_factor,\n                           const int pad_w,\n                           const int pad_h)\n    {\n        int img_w = original_image.cols;\n        int img_h = original_image.rows;\n\n        x1 = (x1 - pad_w) / scale_factor;\n        y1 = (y1 - pad_h) / scale_factor;\n        x2 = (x2 - pad_w) / scale_factor;\n        y2 = (y2 - pad_h) / scale_factor;\n\n        // clamp\n        x1 = std::max(0.0f, std::min(x1, (float)img_w));\n        y1 = std::max(0.0f, std::min(y1, (float)img_h));\n        x2 = std::max(0.0f, std::min(x2, (float)img_w));\n        y2 = std::max(0.0f, std::min(y2, (float)img_h));\n        return Bbox(x1, y1, x2, y2, confidence, label);\n    }\n    std::string get_label_name(const std::vector<std::string>& classes)\n    {\n        return classes[this->label];\n    }\n\n    /// what more do you need to know vro\n    float area() const\n    {\n        float width = x2 - x1;\n        float height = y2 - y1;\n        return width * height;\n    }\n    cv::Mat crop_bbox(const cv::Mat& originalImage) const\n    {\n        // Calculate width and height\n        int bbox_width = static_cast<int>(x2 - x1);\n        int bbox_height = static_cast<int>(y2 - y1);\n\n        // Ensure valid dimensions\n        if (bbox_width <= 0 || bbox_height <= 0)\n        {\n            fprintf(stderr, \"Invalid bounding box dimensions\\n\");\n            return cv::Mat();\n        }\n\n        // Ensure coordinates are within image bounds\n        int x1_int = static_cast<int>(x1);\n        int y1_int = static_cast<int>(y1);\n        int x2_int = static_cast<int>(x2);\n        int y2_int = static_cast<int>(y2);\n\n        // Clamp to image bounds\n        x1_int = std::max(0, x1_int);\n        y1_int = std::max(0, y1_int);\n        x2_int = std::min(originalImage.cols, x2_int);\n        y2_int = std::min(originalImage.rows, y2_int);\n\n        // Create ROI and return cropped image\n        cv::Rect roi(x1_int, y1_int, x2_int - x1_int, y2_int - y1_int);\n        return originalImage(roi).clone();\n    }\n    cv::Rect_<float> get_rect() const\n    {\n        int x1_int = static_cast<int>(x1);\n        int y1_int = static_cast<int>(y1);\n        int width = static_cast<int>(x2 - x1);\n        int height = static_cast<int>(y2 - y1);\n\n        // Ensure valid dimensions\n        if (width <= 0 || height <= 0)\n        {\n            return cv::Rect(0, 0, 0, 0); // Return invalid rect\n        }\n\n        return cv::Rect(x1_int, y1_int, width, height);\n    }\n};\n\nstatic void print_bbox(Bbox& bbox)\n{\n    printf(\"Bbox(x1=%.2f, y1=%.2f, x2=%.2f, y2=%.2f, conf=%.4f, label=%d)\\n\",\n           bbox.x1, bbox.y1, bbox.x2, bbox.y2, bbox.confidence, bbox.label);\n}\n\nstatic void qsort_descent_inplace(std::vector<Bbox>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].confidence;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].confidence > p)\n            i++;\n\n        while (faceobjects[j].confidence < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    //     #pragma omp parallel sections\n    {\n        //         #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        //         #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Bbox>& faceobjects)\n{\n    if (faceobjects.empty()) return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nfloat calculate_iou(const Bbox& box1, const Bbox& box2)\n{\n    float x1 = std::max(box1.x1, box2.x1);\n    float y1 = std::max(box1.y1, box2.y1);\n    float x2 = std::min(box1.x2, box2.x2);\n    float y2 = std::min(box1.y2, box2.y2);\n\n    if (x2 <= x1 || y2 <= y1)\n    {\n        return 0.0f; // no intersect\n    }\n\n    float intersection_area = (x2 - x1) * (y2 - y1);\n    float box1_area = (box1.x2 - box1.x1) * (box1.y2 - box1.y1);\n    float box2_area = (box2.x2 - box2.x1) * (box2.y2 - box2.y1);\n    float union_area = box1_area + box2_area - intersection_area;\n    return intersection_area / union_area;\n}\n\nstatic std::vector<int>\nnon_maximum_supression(const std::vector<Bbox>& bbox, float iou_thresh, bool class_agnostic = false)\n{\n    std::vector<int> picked;\n    const int n = bbox.size();\n    if (n == 0) return picked;\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = bbox[i].area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Bbox& a = bbox[i];\n        bool keep = true;\n\n        for (int j : picked)\n        {\n            const Bbox& b = bbox[j];\n\n            // Enhanced class comparison logic using labels\n            if (!class_agnostic)\n            {\n                if (a.label != b.label)\n                {\n                    continue; // Different classes, don't suppress\n                }\n            }\n\n            float iou = calculate_iou(a, b);\n            if (iou > iou_thresh)\n            {\n                keep = false;\n                break;\n            }\n        }\n\n        if (keep)\n        {\n            picked.push_back(i);\n        }\n    }\n\n    return picked;\n}\n\nstatic std::vector<float> scale_wh(float w0, float h0, float w1, float h1)\n{\n    float r = std::min(w1 / w0, h1 / h0);\n    std::vector<float> _scale_factor(3);\n    _scale_factor[0] = r;\n    _scale_factor[1] = (float)std::round(w0 * r);\n    _scale_factor[2] = (float)std::round(h0 * r);\n    return _scale_factor;\n}\n\nstruct ImagePreProcessResults\n{\n    ncnn::Mat result;\n    float img_scale, pad_w, pad_h;\n\n    ImagePreProcessResults(ncnn::Mat result, float img_scale, float pad_w, float pad_h)\n        : result(result), img_scale(img_scale), pad_w(pad_w), pad_h(pad_h)\n    {\n    }\n};\n\nstruct DetectionResult\n{\n    std::vector<Bbox> bboxes;\n    std::vector<std::vector<float> > keypoints;\n};\n\nstatic ImagePreProcessResults preprocess_yolo_kpts(cv::Mat& input_image, int infer_size) noexcept\n{\n    float mean_vals[] = {0.f, 0.f, 0.f};\n\n    float norm_vals[] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    int img_w = input_image.cols;\n    int img_h = input_image.rows;\n    float scale_factor, new_w, new_h;\n    std::vector<float> _scale_factor = scale_wh(img_w, img_h, (float)infer_size, (float)infer_size);\n    scale_factor = _scale_factor[0];\n    new_w = _scale_factor[1];\n    new_h = _scale_factor[2];\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(input_image.data,\n                   ncnn::Mat::PIXEL_BGR2RGB, img_w,\n                   img_h, new_w, new_h);\n\n    // padding calculation\n    int pad_w = (infer_size - new_w) / 2;\n    int pad_h = (infer_size - new_h) / 2;\n\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, pad_h, infer_size - new_h - pad_h, pad_w,\n                           infer_size - new_w - pad_w, ncnn::BORDER_CONSTANT, 114.f);\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n    return ImagePreProcessResults(in_pad, scale_factor, pad_w, pad_h);\n}\n\n/// parses extra keypoints data for face mmodel\n/// the format is this:\n/// [x, y, w, h, conf, class_scores..., kp1_conf, kp1_x, kp1_y, kp2_conf, kp2_x, kp2_y,  ...]\nstatic DetectionResult parse_yolo_keypoints_results(ncnn::Mat& result,\n        cv::Mat& original_image,\n        ImagePreProcessResults& preproc_img,\n        float confidence_threshold,\n        float iou_threshold,\n        std::vector<std::string> class_names)\n{\n    cv::Mat output((int)result.w, (int)result.h, CV_32FC1);\n    for (int i = 0; i < output.cols; i++)\n    {\n        for (int j = 0; j < output.rows; j++)\n        {\n            output.ptr<float>(j)[i] = result.row(i)[j];\n        }\n    }\n    std::vector<Bbox> detections;\n    std::vector<std::vector<float> > all_keypoints;\n\n    int num_classes = class_names.size();\n    int kp_stride = 3;\n    int num_keypoints = 5;\n\n    for (int i = 0; i < output.rows; i++)\n    {\n        const float* row_ptr = output.ptr<float>(i);\n        const float* bboxes_ptr = row_ptr;\n        const float* classes_ptr = row_ptr + 4;\n        const float* max_s_ptr = std::max_element(classes_ptr, classes_ptr + num_classes);\n\n        float score = *max_s_ptr;\n        int class_id = max_s_ptr - classes_ptr;\n\n        if (score >= confidence_threshold)\n        {\n            float x = bboxes_ptr[0];\n            float y = bboxes_ptr[1];\n            float w = bboxes_ptr[2];\n            float h = bboxes_ptr[3];\n            float x1 = x - w / 2.0f;\n            float y1 = y - h / 2.0f;\n            float x2 = x + w / 2.0f;\n            float y2 = y + h / 2.0f;\n\n            if (x2 > x1 && y2 > y1)\n            {\n                Bbox bbox = Bbox(x1, y1, x2, y2, score, class_id)\n                            .apply_image_scale(original_image, preproc_img.img_scale,\n                                               preproc_img.pad_w, preproc_img.pad_h);\n                // Parse exactly 5 keypoints for this face model\n                std::vector<float> face_keypoints;\n                face_keypoints.reserve(15);\n                const float* kp_ptr = row_ptr + 4 + num_classes;\n                float scale = 1.0f / preproc_img.img_scale;\n\n                for (int k = 0; k < num_keypoints; k++)\n                {\n                    float kp_x = kp_ptr[k * kp_stride];\n                    float kp_y = kp_ptr[k * kp_stride + 1];\n                    float kp_conf_raw = kp_ptr[k * kp_stride + 2];\n\n                    // Apply sigmoid to convert logit to probability\n                    float kp_conf = 1.0f / (1.0f + expf(-kp_conf_raw));\n\n                    // Scale keypoints to original\n                    kp_x = (kp_x - preproc_img.pad_w) * scale;\n                    kp_y = (kp_y - preproc_img.pad_h) * scale;\n\n                    face_keypoints.push_back(kp_x);\n                    face_keypoints.push_back(kp_y);\n                    face_keypoints.push_back(kp_conf);\n                }\n\n                detections.push_back(bbox);\n                all_keypoints.push_back(face_keypoints);\n            }\n        }\n    }\n\n    // nms\n    qsort_descent_inplace(detections);\n    std::vector<int> picked = non_maximum_supression(detections, iou_threshold, false);\n    DetectionResult res;\n    for (size_t i = 0; i < picked.size(); i++)\n    {\n        int idx = picked[i];\n        res.bboxes.push_back(detections[idx]);\n        res.keypoints.push_back(all_keypoints[idx]);\n    }\n\n    return res;\n}\n\nstatic inline float get_similarity(std::vector<float> f1, std::vector<float> f2)\n{\n    float sim = 0.0;\n    for (size_t i = 0; i < f1.size(); i++)\n    {\n        sim += f1[i] * f2[i];\n    }\n    return sim;\n}\n\n// these are converted from here\n// https://github.com/deepinsight/insightface/blob/master/python-package/insightface/utils/face_align.py\nstatic int estimate_norm(float* transform_matrix, const float* lmk, int image_size = 112)\n{\n    float ARCFACE_DST[] {\n        38.2946f, 51.6963f, // left eye\n        73.5318f, 51.5014f, // right eye\n        56.0252f, 71.7366f, // nose\n        41.5493f, 92.3655f, // left mouth\n        70.7299f, 92.2041f  // right mouth\n    };\n    if (image_size % 112 != 0 && image_size % 128 != 0)\n    {\n        return -1;\n    }\n\n    float ratio, diff_x;\n    if (image_size % 112 == 0)\n    {\n        ratio = static_cast<float>(image_size) / 112.0f;\n        diff_x = 0.0f;\n    }\n    else\n    {\n        ratio = static_cast<float>(image_size) / 128.0f;\n        diff_x = 8.0f * ratio;\n    }\n\n    float src_points[10];\n    for (int i = 0; i < 5; i++)\n    {\n        src_points[i * 2] = lmk[i * 3];\n        src_points[i * 2 + 1] = lmk[i * 3 + 1];\n    }\n\n    float dst_points[10];\n    for (int i = 0; i < 5; i++)\n    {\n        dst_points[i * 2] = ARCFACE_DST[i * 2] * ratio + diff_x;\n        dst_points[i * 2 + 1] = ARCFACE_DST[i * 2 + 1] * ratio;\n    }\n\n    ncnn::get_affine_transform(dst_points, src_points, 5, transform_matrix);\n\n    return 0;\n}\n\nstatic int norm_crop(cv::Mat& output, const cv::Mat& input, const float* lmk, int image_size = 112)\n{\n    float transform_matrix[6];\n    int status = estimate_norm(transform_matrix, lmk, image_size);\n\n    if (status != 0)\n    {\n        return status;\n    }\n    output = cv::Mat(image_size, image_size, CV_8UC3);\n    ncnn::warpaffine_bilinear_c3(input.data, input.cols, input.rows,\n                                 output.data, image_size, image_size,\n                                 transform_matrix);\n    return 0;\n}\n\nvoid normalize_arcface(std::vector<float>& feature)\n{\n    if (feature.empty())\n        return;\n    float sum = 0;\n    for (auto it = feature.begin(); it != feature.end(); it++)\n        sum += (float)*it * (float)*it;\n    sum = sqrt(sum);\n    if (sum == 0.0f)\n        return;\n    for (auto it = feature.begin(); it != feature.end(); it++)\n        *it /= sum;\n}\n\nstatic int get_face(const cv::Mat& rgb, DetectionResult& result)\n{\n    int status = 0;\n    ncnn::Net yoloface;\n    yoloface.opt.use_vulkan_compute = true;\n    status = yoloface.load_param(\"yolov8-face.param\");\n\n    if (status != 0)\n    {\n        fprintf(stderr, \"couldn't load params\");\n        return status;\n    }\n\n    status = yoloface.load_model(\"yolov8-face.bin\");\n\n    if (status != 0)\n    {\n        fprintf(stderr, \"couldn't load model\");\n        return status;\n    }\n\n    cv::Mat input_image = rgb.clone();\n    ImagePreProcessResults preproc_img = preprocess_yolo_kpts(input_image, ARCFACE_EXAMPLE_YOLO_INFER_SIZE);\n    ncnn::Extractor ex = yoloface.create_extractor();\n    ex.input(\"in0\", preproc_img.result);\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n    std::vector<std::string> class_names = {\"face\"};\n    result = parse_yolo_keypoints_results(out, input_image, preproc_img, 0.5, 0.4, class_names);\n    if (result.bboxes.size() < 1)\n    {\n        fprintf(stderr, \"no faces are found!\");\n        return -1;\n    }\n    return 0;\n}\n\nstatic int get_embedding(const cv::Mat& rgb, std::vector<float>& result)\n{\n    ncnn::Net arcface;\n    arcface.opt.use_vulkan_compute = true;\n    int status = arcface.load_param(\"arcfaceresnet.param\");\n    if (status != 0)\n    {\n        fprintf(stderr, \"couldn't load arcface params\");\n        return status;\n    }\n    status = arcface.load_model(\"arcfaceresnet.bin\");\n    if (status != 0)\n    {\n        fprintf(stderr, \"couldn't load arcface model\");\n        return status;\n    }\n\n    if (rgb.empty() || rgb.type() != CV_8UC3)\n    {\n        fprintf(stderr, \"invalid input image!\");\n        return -1;\n    }\n    /*\n    * the arcface model provided in the link has builtin normalization layers,\n    * no need to run substract_mean_normalize\n    *\n    *  reference from .param\n    BinaryOp         _minusscalar0            2 1 data scalar_op2 _minusscalar0 0=1\n    BinaryOp         _mulscalar0              2 1 _minusscalar0 scalar_op3 _mulscalar0 0=2\n    * */\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(\n                       rgb.data,\n                       ncnn::Mat::PIXEL_BGR2RGB,\n                       rgb.cols,\n                       rgb.rows,\n                       112,\n                       112);\n    ncnn::Extractor ex = arcface.create_extractor();\n    ex.input(\"data\", in);\n    ncnn::Mat out;\n    ex.extract(\"fc1\", out);\n    const float* ptr = (const float*)out.data;\n    for (int i = 0; i < 512; i++)\n    {\n        result[i] = ptr[i];\n    }\n    normalize_arcface(result);\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 3)\n    {\n        fprintf(stderr, \"Usage: %s <face1_path> <face2_path>\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* face1_path = argv[1];\n    const char* face2_path = argv[2];\n\n    int status = 0;\n    cv::Mat face_img1 = cv::imread(face1_path);\n    cv::Mat face_img2 = cv::imread(face2_path);\n\n    if (face_img1.empty())\n    {\n        fprintf(stderr, \"Failed to load image: %s\\n\", face1_path);\n        return -1;\n    }\n    if (face_img2.empty())\n    {\n        fprintf(stderr, \"Failed to load image: %s\\n\", face2_path);\n        return -1;\n    }\n\n    cv::Mat input_embed1, input_embed2;\n    DetectionResult res1, res2;\n    std::vector<float> embedding1(512), embedding2(512);\n\n    status = get_face(face_img1, res1);\n    if (status != 0)\n    {\n        fprintf(stderr, \"get face failed for %s!\\n\", face1_path);\n        return -1;\n    }\n    fprintf(stdout, \"found faces in face1: %d\\n\", (int)res1.bboxes.size());\n    for (size_t i = 0; i < res1.bboxes.size(); i++)\n    {\n        print_bbox(res1.bboxes[i]);\n    }\n\n    status = get_face(face_img2, res2);\n    if (status != 0)\n    {\n        fprintf(stderr, \"get face failed for %s!\\n\", face2_path);\n        return -1;\n    }\n    fprintf(stdout, \"found faces in face2: %d\\n\", (int)res2.bboxes.size());\n    for (size_t i = 0; i < res2.bboxes.size(); i++)\n    {\n        print_bbox(res2.bboxes[i]);\n    }\n\n    status = norm_crop(input_embed1, face_img1, res1.keypoints[0].data());\n    status = get_embedding(input_embed1, embedding1);\n    if (status != 0)\n    {\n        fprintf(stderr, \"get embedding failed for %s!\\n\", face1_path);\n        return -1;\n    }\n\n    status = norm_crop(input_embed2, face_img2, res2.keypoints[0].data());\n    if (status != 0)\n    {\n        fprintf(stderr, \"norm_crop failed for face2!\\n\");\n        return -1;\n    }\n    status = get_embedding(input_embed2, embedding2);\n    if (status != 0)\n    {\n        fprintf(stderr, \"get embedding failed for face2!\\n\");\n        return -1;\n    }\n    if (status != 0)\n    {\n        fprintf(stderr, \"get embedding failed for %s!\\n\", face2_path);\n        return -1;\n    }\n\n    float similarity = get_similarity(embedding1, embedding2);\n    fprintf(stdout, \"Similarity: %f\\n\", similarity);\n}\n"
  },
  {
    "path": "examples/fasterrcnn.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <math.h>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic int detect_fasterrcnn(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net fasterrcnn;\n\n    fasterrcnn.opt.use_vulkan_compute = true;\n\n    // original pretrained model from https://github.com/rbgirshick/py-faster-rcnn\n    // py-faster-rcnn/models/pascal_voc/ZF/faster_rcnn_alt_opt/faster_rcnn_test.pt\n    // https://dl.dropboxusercontent.com/s/o6ii098bu51d139/faster_rcnn_models.tgz?dl=0\n    // ZF_faster_rcnn_final.caffemodel\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (fasterrcnn.load_param(\"ZF_faster_rcnn_final.param\"))\n        exit(-1);\n    if (fasterrcnn.load_model(\"ZF_faster_rcnn_final.bin\"))\n        exit(-1);\n\n    // hyper parameters taken from\n    // py-faster-rcnn/lib/fast_rcnn/config.py\n    // py-faster-rcnn/lib/fast_rcnn/test.py\n    const int target_size = 600; // __C.TEST.SCALES\n\n    const int max_per_image = 100;\n    const float confidence_thresh = 0.05f;\n\n    const float nms_threshold = 0.3f; // __C.TEST.NMS\n\n    // scale to target detect size\n    int w = bgr.cols;\n    int h = bgr.rows;\n    float scale = 1.f;\n    if (w < h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, w, h);\n\n    const float mean_vals[3] = {102.9801f, 115.9465f, 122.7717f};\n    in.substract_mean_normalize(mean_vals, 0);\n\n    ncnn::Mat im_info(3);\n    im_info[0] = h;\n    im_info[1] = w;\n    im_info[2] = scale;\n\n    // step1, extract feature and all rois\n    ncnn::Extractor ex1 = fasterrcnn.create_extractor();\n\n    ex1.input(\"data\", in);\n    ex1.input(\"im_info\", im_info);\n\n    ncnn::Mat conv5_relu5; // feature\n    ncnn::Mat rois;        // all rois\n    ex1.extract(\"conv5_relu5\", conv5_relu5);\n    ex1.extract(\"rois\", rois);\n\n    // step2, extract bbox and score for each roi\n    std::vector<std::vector<Object> > class_candidates;\n    for (int i = 0; i < rois.c; i++)\n    {\n        ncnn::Extractor ex2 = fasterrcnn.create_extractor();\n\n        ncnn::Mat roi = rois.channel(i); // get single roi\n        ex2.input(\"conv5_relu5\", conv5_relu5);\n        ex2.input(\"rois\", roi);\n\n        ncnn::Mat bbox_pred;\n        ncnn::Mat cls_prob;\n        ex2.extract(\"bbox_pred\", bbox_pred);\n        ex2.extract(\"cls_prob\", cls_prob);\n\n        int num_class = cls_prob.w;\n        class_candidates.resize(num_class);\n\n        // find class id with highest score\n        int label = 0;\n        float score = 0.f;\n        for (int i = 0; i < num_class; i++)\n        {\n            float class_score = cls_prob[i];\n            if (class_score > score)\n            {\n                label = i;\n                score = class_score;\n            }\n        }\n\n        // ignore background or low score\n        if (label == 0 || score <= confidence_thresh)\n            continue;\n\n        //         fprintf(stderr, \"%d = %f\\n\", label, score);\n\n        // unscale to image size\n        float x1 = roi[0] / scale;\n        float y1 = roi[1] / scale;\n        float x2 = roi[2] / scale;\n        float y2 = roi[3] / scale;\n\n        float pb_w = x2 - x1 + 1;\n        float pb_h = y2 - y1 + 1;\n\n        // apply bbox regression\n        float dx = bbox_pred[label * 4];\n        float dy = bbox_pred[label * 4 + 1];\n        float dw = bbox_pred[label * 4 + 2];\n        float dh = bbox_pred[label * 4 + 3];\n\n        float cx = x1 + pb_w * 0.5f;\n        float cy = y1 + pb_h * 0.5f;\n\n        float obj_cx = cx + pb_w * dx;\n        float obj_cy = cy + pb_h * dy;\n\n        float obj_w = pb_w * exp(dw);\n        float obj_h = pb_h * exp(dh);\n\n        float obj_x1 = obj_cx - obj_w * 0.5f;\n        float obj_y1 = obj_cy - obj_h * 0.5f;\n        float obj_x2 = obj_cx + obj_w * 0.5f;\n        float obj_y2 = obj_cy + obj_h * 0.5f;\n\n        // clip\n        obj_x1 = std::max(std::min(obj_x1, (float)(bgr.cols - 1)), 0.f);\n        obj_y1 = std::max(std::min(obj_y1, (float)(bgr.rows - 1)), 0.f);\n        obj_x2 = std::max(std::min(obj_x2, (float)(bgr.cols - 1)), 0.f);\n        obj_y2 = std::max(std::min(obj_y2, (float)(bgr.rows - 1)), 0.f);\n\n        // append object\n        Object obj;\n        obj.rect = cv::Rect_<float>(obj_x1, obj_y1, obj_x2 - obj_x1 + 1, obj_y2 - obj_y1 + 1);\n        obj.label = label;\n        obj.prob = score;\n\n        class_candidates[label].push_back(obj);\n    }\n\n    // post process\n    objects.clear();\n    for (int i = 0; i < (int)class_candidates.size(); i++)\n    {\n        std::vector<Object>& candidates = class_candidates[i];\n\n        qsort_descent_inplace(candidates);\n\n        std::vector<int> picked;\n        nms_sorted_bboxes(candidates, picked, nms_threshold);\n\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            int z = picked[j];\n            objects.push_back(candidates[z]);\n        }\n    }\n\n    qsort_descent_inplace(objects);\n\n    if (max_per_image > 0 && max_per_image < objects.size())\n    {\n        objects.resize(max_per_image);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_fasterrcnn(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/mobilenetssd.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_mobilenet(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net mobilenet;\n\n    mobilenet.opt.use_vulkan_compute = true;\n\n    // model is converted from https://github.com/chuanqi305/MobileNet-SSD\n    // and can be downloaded from https://drive.google.com/open?id=0ByaKLD9QaPtucWk0Y0dha1VVY0U\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (mobilenet.load_param(\"mobilenet_ssd_voc_ncnn.param\"))\n        exit(-1);\n    if (mobilenet.load_model(\"mobilenet_ssd_voc_ncnn.bin\"))\n        exit(-1);\n\n    const int target_size = 300;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {127.5f, 127.5f, 127.5f};\n    const float norm_vals[3] = {1.0 / 127.5, 1.0 / 127.5, 1.0 / 127.5};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = mobilenet.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_mobilenet(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/mobilenetv2ssdlite.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nclass Noop : public ncnn::Layer\n{\n};\nDEFINE_LAYER_CREATOR(Noop)\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_mobilenetv2(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net mobilenetv2;\n\n    mobilenetv2.opt.use_vulkan_compute = true;\n\n    mobilenetv2.register_custom_layer(\"Silence\", Noop_layer_creator);\n\n    // original pretrained model from https://github.com/chuanqi305/MobileNetv2-SSDLite\n    // https://github.com/chuanqi305/MobileNetv2-SSDLite/blob/master/ssdlite/voc/deploy.prototxt\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (mobilenetv2.load_param(\"mobilenetv2_ssdlite_voc.param\"))\n        exit(-1);\n    if (mobilenetv2.load_model(\"mobilenetv2_ssdlite_voc.bin\"))\n        exit(-1);\n\n    const int target_size = 300;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {127.5f, 127.5f, 127.5f};\n    const float norm_vals[3] = {1.0 / 127.5, 1.0 / 127.5, 1.0 / 127.5};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = mobilenetv2.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_mobilenetv2(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/mobilenetv3ssdlite.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n#include \"platform.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n#if NCNN_VULKAN\n#include \"gpu.h\"\n#endif // NCNN_VULKAN\n\ntemplate<class T>\nconst T& clamp(const T& v, const T& lo, const T& hi)\n{\n    assert(!(hi < lo));\n    return v < lo ? lo : hi < v ? hi : v;\n}\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_mobilenetv3(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net mobilenetv3;\n\n#if NCNN_VULKAN\n    mobilenetv3.opt.use_vulkan_compute = true;\n#endif // NCNN_VULKAN\n\n    // converted ncnn model from https://github.com/ujsyehao/mobilenetv3-ssd\n    if (mobilenetv3.load_param(\"./mobilenetv3_ssdlite_voc.param\"))\n        exit(-1);\n    if (mobilenetv3.load_model(\"./mobilenetv3_ssdlite_voc.bin\"))\n        exit(-1);\n\n    const int target_size = 300;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {123.675f, 116.28f, 103.53f};\n    const float norm_vals[3] = {1.0f, 1.0f, 1.0f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = mobilenetv3.create_extractor();\n\n    ex.input(\"input\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n\n        // filter out cross-boundary\n        float x1 = clamp(values[2] * target_size, 0.f, float(target_size - 1)) / target_size * img_w;\n        float y1 = clamp(values[3] * target_size, 0.f, float(target_size - 1)) / target_size * img_h;\n        float x2 = clamp(values[4] * target_size, 0.f, float(target_size - 1)) / target_size * img_w;\n        float y2 = clamp(values[5] * target_size, 0.f, float(target_size - 1)) / target_size * img_h;\n\n        object.rect.x = x1;\n        object.rect.y = y1;\n        object.rect.width = x2 - x1;\n        object.rect.height = y2 - y1;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        if (objects[i].prob > 0.6)\n        {\n            const Object& obj = objects[i];\n\n            fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                    obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n            cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n            char text[256];\n            sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n            int baseLine = 0;\n            cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n            int x = obj.rect.x;\n            int y = obj.rect.y - label_size.height - baseLine;\n            if (y < 0)\n                y = 0;\n            if (x + label_size.width > image.cols)\n                x = image.cols - label_size.width;\n\n            cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                          cv::Scalar(255, 255, 255), -1);\n\n            cv::putText(image, text, cv::Point(x, y + label_size.height),\n                        cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n        }\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_mobilenetv3(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/nanodet.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdlib.h>\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& cls_pred, const ncnn::Mat& dis_pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid = cls_pred.h;\n\n    int num_grid_x;\n    int num_grid_y;\n    if (in_pad.w > in_pad.h)\n    {\n        num_grid_x = in_pad.w / stride;\n        num_grid_y = num_grid / num_grid_x;\n    }\n    else\n    {\n        num_grid_y = in_pad.h / stride;\n        num_grid_x = num_grid / num_grid_y;\n    }\n\n    const int num_class = cls_pred.w;\n    const int reg_max_1 = dis_pred.w / 4;\n\n    for (int i = 0; i < num_grid_y; i++)\n    {\n        for (int j = 0; j < num_grid_x; j++)\n        {\n            const int idx = i * num_grid_x + j;\n\n            const float* scores = cls_pred.row(idx);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            for (int k = 0; k < num_class; k++)\n            {\n                if (scores[k] > score)\n                {\n                    label = k;\n                    score = scores[k];\n                }\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat bbox_pred(reg_max_1, 4, (void*)dis_pred.row(idx));\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(bbox_pred, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = bbox_pred.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (j + 0.5f) * stride;\n                float pb_cy = (i + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic int detect_nanodet(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net nanodet;\n\n    nanodet.opt.use_vulkan_compute = true;\n    // nanodet.opt.use_bf16_storage = true;\n\n    // original pretrained model from https://github.com/RangiLyu/nanodet\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (nanodet.load_param(\"nanodet_m.param\"))\n        exit(-1);\n    if (nanodet.load_model(\"nanodet_m.bin\"))\n        exit(-1);\n\n    int width = bgr.cols;\n    int height = bgr.rows;\n\n    const int target_size = 320;\n    const float prob_threshold = 0.4f;\n    const float nms_threshold = 0.5f;\n\n    // pad to multiple of 32\n    int w = width;\n    int h = height;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, width, height, w, h);\n\n    // pad to target_size rectangle\n    int wpad = (w + 31) / 32 * 32 - w;\n    int hpad = (h + 31) / 32 * 32 - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 0.f);\n\n    const float mean_vals[3] = {103.53f, 116.28f, 123.675f};\n    const float norm_vals[3] = {0.017429f, 0.017507f, 0.017125f};\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = nanodet.create_extractor();\n\n    ex.input(\"input.1\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // stride 8\n    {\n        ncnn::Mat cls_pred;\n        ncnn::Mat dis_pred;\n        ex.extract(\"792\", cls_pred);\n        ex.extract(\"795\", dis_pred);\n\n        std::vector<Object> objects8;\n        generate_proposals(cls_pred, dis_pred, 8, in_pad, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat cls_pred;\n        ncnn::Mat dis_pred;\n        ex.extract(\"814\", cls_pred);\n        ex.extract(\"817\", dis_pred);\n\n        std::vector<Object> objects16;\n        generate_proposals(cls_pred, dis_pred, 16, in_pad, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat cls_pred;\n        ncnn::Mat dis_pred;\n        ex.extract(\"836\", cls_pred);\n        ex.extract(\"839\", dis_pred);\n\n        std::vector<Object> objects32;\n        generate_proposals(cls_pred, dis_pred, 32, in_pad, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(width - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(height - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(width - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(height - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_nanodet(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/nanodetplus_pnnx.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdlib.h>\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + exp(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid = pred.h;\n\n    int num_grid_x = pred.w;\n    int num_grid_y = pred.h;\n\n    const int num_class = 80; // number of classes. 80 for COCO\n    const int reg_max_1 = (pred.c - num_class) / 4;\n\n    for (int i = 0; i < num_grid_y; i++)\n    {\n        for (int j = 0; j < num_grid_x; j++)\n        {\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            for (int k = 0; k < num_class; k++)\n            {\n                float s = pred.channel(k).row(i)[j];\n                if (s > score)\n                {\n                    label = k;\n                    score = s;\n                }\n            }\n\n            score = sigmoid(score);\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat bbox_pred(reg_max_1, 4);\n                for (int k = 0; k < reg_max_1 * 4; k++)\n                {\n                    bbox_pred[k] = pred.channel(num_class + k).row(i)[j];\n                }\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(bbox_pred, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = bbox_pred.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = j * stride;\n                float pb_cy = i * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic int detect_nanodet(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net nanodet;\n\n    nanodet.opt.use_vulkan_compute = true;\n    // nanodet.opt.use_bf16_storage = true;\n\n    // original pretrained model from https://github.com/RangiLyu/nanodet\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    //     nanodet.load_param(\"nanodet-plus-m_320.torchscript.ncnn.param\");\n    //     nanodet.load_model(\"nanodet-plus-m_320.torchscript.ncnn.bin\");\n    if (nanodet.load_param(\"nanodet-plus-m_416.torchscript.ncnn.param\"))\n        exit(-1);\n    if (nanodet.load_model(\"nanodet-plus-m_416.torchscript.ncnn.bin\"))\n        exit(-1);\n\n    int width = bgr.cols;\n    int height = bgr.rows;\n\n    //     const int target_size = 320;\n    const int target_size = 416;\n    const float prob_threshold = 0.4f;\n    const float nms_threshold = 0.5f;\n\n    // pad to multiple of 32\n    int w = width;\n    int h = height;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, width, height, w, h);\n\n    // pad to target_size rectangle\n    int wpad = (w + 31) / 32 * 32 - w;\n    int hpad = (h + 31) / 32 * 32 - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 0.f);\n\n    const float mean_vals[3] = {103.53f, 116.28f, 123.675f};\n    const float norm_vals[3] = {0.017429f, 0.017507f, 0.017125f};\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = nanodet.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // stride 8\n    {\n        ncnn::Mat pred;\n        ex.extract(\"231\", pred);\n\n        std::vector<Object> objects8;\n        generate_proposals(pred, 8, in_pad, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat pred;\n        ex.extract(\"228\", pred);\n\n        std::vector<Object> objects16;\n        generate_proposals(pred, 16, in_pad, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat pred;\n        ex.extract(\"225\", pred);\n\n        std::vector<Object> objects32;\n        generate_proposals(pred, 32, in_pad, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // stride 64\n    {\n        ncnn::Mat pred;\n        ex.extract(\"222\", pred);\n\n        std::vector<Object> objects64;\n        generate_proposals(pred, 64, in_pad, prob_threshold, objects64);\n\n        proposals.insert(proposals.end(), objects64.begin(), objects64.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(width - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(height - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(width - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(height - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_nanodet(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/p2pnet.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdlib.h>\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct CrowdPoint\n{\n    cv::Point pt;\n    float prob;\n};\n\nstatic void shift(int w, int h, int stride, std::vector<float> anchor_points, std::vector<float>& shifted_anchor_points)\n{\n    std::vector<float> x_, y_;\n    for (int i = 0; i < w; i++)\n    {\n        float x = (i + 0.5) * stride;\n        x_.push_back(x);\n    }\n    for (int i = 0; i < h; i++)\n    {\n        float y = (i + 0.5) * stride;\n        y_.push_back(y);\n    }\n\n    std::vector<float> shift_x((size_t)w * h, 0), shift_y((size_t)w * h, 0);\n    for (int i = 0; i < h; i++)\n    {\n        for (int j = 0; j < w; j++)\n        {\n            shift_x[i * w + j] = x_[j];\n        }\n    }\n    for (int i = 0; i < h; i++)\n    {\n        for (int j = 0; j < w; j++)\n        {\n            shift_y[i * w + j] = y_[i];\n        }\n    }\n\n    std::vector<float> shifts((size_t)w * h * 2, 0);\n    for (int i = 0; i < w * h; i++)\n    {\n        shifts[i * 2] = shift_x[i];\n        shifts[i * 2 + 1] = shift_y[i];\n    }\n\n    shifted_anchor_points.resize((size_t)2 * w * h * anchor_points.size() / 2, 0);\n    for (int i = 0; i < w * h; i++)\n    {\n        for (int j = 0; j < anchor_points.size() / 2; j++)\n        {\n            float x = anchor_points[j * 2] + shifts[i * 2];\n            float y = anchor_points[j * 2 + 1] + shifts[i * 2 + 1];\n            shifted_anchor_points[i * anchor_points.size() / 2 * 2 + j * 2] = x;\n            shifted_anchor_points[i * anchor_points.size() / 2 * 2 + j * 2 + 1] = y;\n        }\n    }\n}\nstatic void generate_anchor_points(int stride, int row, int line, std::vector<float>& anchor_points)\n{\n    float row_step = (float)stride / row;\n    float line_step = (float)stride / line;\n\n    std::vector<float> x_, y_;\n    for (int i = 1; i < line + 1; i++)\n    {\n        float x = (i - 0.5) * line_step - stride / 2;\n        x_.push_back(x);\n    }\n    for (int i = 1; i < row + 1; i++)\n    {\n        float y = (i - 0.5) * row_step - stride / 2;\n        y_.push_back(y);\n    }\n    std::vector<float> shift_x((size_t)row * line, 0), shift_y((size_t)row * line, 0);\n    for (int i = 0; i < row; i++)\n    {\n        for (int j = 0; j < line; j++)\n        {\n            shift_x[i * line + j] = x_[j];\n        }\n    }\n    for (int i = 0; i < row; i++)\n    {\n        for (int j = 0; j < line; j++)\n        {\n            shift_y[i * line + j] = y_[i];\n        }\n    }\n    anchor_points.resize((size_t)row * line * 2, 0);\n    for (int i = 0; i < row * line; i++)\n    {\n        float x = shift_x[i];\n        float y = shift_y[i];\n        anchor_points[i * 2] = x;\n        anchor_points[i * 2 + 1] = y;\n    }\n}\nstatic void generate_anchor_points(int img_w, int img_h, std::vector<int> pyramid_levels, int row, int line, std::vector<float>& all_anchor_points)\n{\n    std::vector<std::pair<int, int> > image_shapes;\n    std::vector<int> strides;\n    for (int i = 0; i < pyramid_levels.size(); i++)\n    {\n        int new_h = std::floor((img_h + std::pow(2, pyramid_levels[i]) - 1) / std::pow(2, pyramid_levels[i]));\n        int new_w = std::floor((img_w + std::pow(2, pyramid_levels[i]) - 1) / std::pow(2, pyramid_levels[i]));\n        image_shapes.push_back(std::make_pair(new_w, new_h));\n        strides.push_back(std::pow(2, pyramid_levels[i]));\n    }\n\n    all_anchor_points.clear();\n    for (int i = 0; i < pyramid_levels.size(); i++)\n    {\n        std::vector<float> anchor_points;\n        generate_anchor_points(std::pow(2, pyramid_levels[i]), row, line, anchor_points);\n        std::vector<float> shifted_anchor_points;\n        shift(image_shapes[i].first, image_shapes[i].second, strides[i], anchor_points, shifted_anchor_points);\n        all_anchor_points.insert(all_anchor_points.end(), shifted_anchor_points.begin(), shifted_anchor_points.end());\n    }\n}\n\nstatic int detect_crowd(const cv::Mat& bgr, std::vector<CrowdPoint>& crowd_points)\n{\n    ncnn::Option opt;\n    opt.num_threads = 4;\n    opt.use_vulkan_compute = false;\n    opt.use_bf16_storage = false;\n\n    ncnn::Net net;\n    net.opt = opt;\n\n    // model is converted from\n    // https://github.com/TencentYoutuResearch/CrowdCounting-P2PNet\n    // the ncnn model  https://pan.baidu.com/s/1O1CBgvY6yJkrK8Npxx3VMg pwd: ezhx\n    if (net.load_param(\"p2pnet.param\"))\n        exit(-1);\n    if (net.load_model(\"p2pnet.bin\"))\n        exit(-1);\n\n    int width = bgr.cols;\n    int height = bgr.rows;\n\n    int new_width = width / 128 * 128;\n    int new_height = height / 128 * 128;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, width, height, new_width, new_height);\n\n    std::vector<int> pyramid_levels(1, 3);\n    std::vector<float> all_anchor_points;\n    generate_anchor_points(in.w, in.h, pyramid_levels, 2, 2, all_anchor_points);\n\n    ncnn::Mat anchor_points = ncnn::Mat(2, all_anchor_points.size() / 2, all_anchor_points.data());\n\n    ncnn::Extractor ex = net.create_extractor();\n    const float mean_vals1[3] = {123.675f, 116.28f, 103.53f};\n    const float norm_vals1[3] = {0.01712475f, 0.0175f, 0.01742919f};\n\n    in.substract_mean_normalize(mean_vals1, norm_vals1);\n\n    ex.input(\"input\", in);\n    ex.input(\"anchor\", anchor_points);\n\n    ncnn::Mat score, points;\n    ex.extract(\"pred_scores\", score);\n    ex.extract(\"pred_points\", points);\n\n    for (int i = 0; i < points.h; i++)\n    {\n        float* score_data = score.row(i);\n        float* points_data = points.row(i);\n        CrowdPoint cp;\n        int x = points_data[0] / new_width * width;\n        int y = points_data[1] / new_height * height;\n        cp.pt = cv::Point(x, y);\n        cp.prob = score_data[1];\n        crowd_points.push_back(cp);\n    }\n\n    return 0;\n}\n\nstatic void draw_result(const cv::Mat& bgr, const std::vector<CrowdPoint>& crowd_points)\n{\n    cv::Mat image = bgr.clone();\n    const float threshold = 0.5f;\n    for (int i = 0; i < crowd_points.size(); i++)\n    {\n        if (crowd_points[i].prob > threshold)\n        {\n            cv::circle(image, crowd_points[i].pt, 4, cv::Scalar(0, 0, 255), -1, 8, 0);\n        }\n    }\n    cv::imshow(\"image\", image);\n    cv::waitKey();\n}\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat bgr = cv::imread(imagepath, 1);\n    if (bgr.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<CrowdPoint> crowd_points;\n    detect_crowd(bgr, crowd_points);\n    draw_result(bgr, crowd_points);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/peleenetssd_seg.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_peleenet(const cv::Mat& bgr, std::vector<Object>& objects, ncnn::Mat& resized)\n{\n    ncnn::Net peleenet;\n\n    peleenet.opt.use_vulkan_compute = true;\n\n    // model is converted from https://github.com/eric612/MobileNet-YOLO\n    // and can be downloaded from https://drive.google.com/open?id=1Wt6jKv13sBRMHgrGAJYlOlRF-o80pC0g\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (peleenet.load_param(\"pelee.param\"))\n        exit(-1);\n    if (peleenet.load_model(\"pelee.bin\"))\n        exit(-1);\n\n    const int target_size = 304;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {103.9f, 116.7f, 123.6f};\n    const float norm_vals[3] = {0.017f, 0.017f, 0.017f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = peleenet.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n    ncnn::Mat seg_out;\n    ex.extract(\"sigmoid\", seg_out);\n    resize_bilinear(seg_out, resized, img_w, img_h);\n    //resize_bicubic(seg_out,resized,img_w,img_h); // sharpness\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects, ncnn::Mat map)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"person\", \"rider\", \"car\", \"bus\",\n                                        \"truck\", \"bike\", \"motor\",\n                                        \"traffic light\", \"traffic sign\", \"train\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n    const int color[] = {128, 255, 128, 244, 35, 232};\n    const int color_count = sizeof(color) / sizeof(int);\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n    int width = map.w;\n    int height = map.h;\n    int size = map.c;\n    int img_index2 = 0;\n    float threshold = 0.45;\n    const float* ptr2 = map;\n    for (int i = 0; i < height; i++)\n    {\n        unsigned char* ptr1 = image.ptr<unsigned char>(i);\n        int img_index1 = 0;\n        for (int j = 0; j < width; j++)\n        {\n            float maxima = threshold;\n            int index = -1;\n            for (int c = 0; c < size; c++)\n            {\n                //const float* ptr3 = map.channel(c);\n                const float* ptr3 = ptr2 + c * width * height;\n                if (ptr3[img_index2] > maxima)\n                {\n                    maxima = ptr3[img_index2];\n                    index = c;\n                }\n            }\n            if (index > -1)\n            {\n                int color_index = (index)*3;\n                if (color_index < color_count)\n                {\n                    int b = color[color_index];\n                    int g = color[color_index + 1];\n                    int r = color[color_index + 2];\n                    ptr1[img_index1] = b / 2 + ptr1[img_index1] / 2;\n                    ptr1[img_index1 + 1] = g / 2 + ptr1[img_index1 + 1] / 2;\n                    ptr1[img_index1 + 2] = r / 2 + ptr1[img_index1 + 2] / 2;\n                }\n            }\n            img_index1 += 3;\n            img_index2++;\n        }\n    }\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    ncnn::Mat seg_out;\n    detect_peleenet(m, objects, seg_out);\n\n    draw_objects(m, objects, seg_out);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/piper.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// convert piper checkpoints to ncnn models\n//  1. checkout https://github.com/OHF-Voice/piper1-gpl (113931937cf235fc881afd1ca4be209bc6919bc7)\n//  2. apply patch piper1-gpl.patch from https://github.com/nihui/ncnn-android-piper\n//  3. setup piper with\n//      python3 -m venv .venv\n//      source .venv/bin/activate\n//      python3 -m pip install -e .[train]\n//  4. download piper checkpoint file (*.ckpt) from https://huggingface.co/datasets/rhasspy/piper-checkpoints\n//  5. install pnnx via pip install -U pnnx\n//  6. obtain export_ncnn.py script from https://github.com/nihui/ncnn-android-piper\n//      python export_ncnn.py en.ckpt\n\n// convert word list to simple phonemizer dict\n//  1. prepare word list from https://github.com/Alexir/CMUdict\n//  2. for each word, get phonemes via command \"./espeak-ng -q -v en-us --ipa word\"\n//  3. obtain config.json file from https://huggingface.co/datasets/rhasspy/piper-checkpoints\n//  4. replace phonemes with ids according to phoneme_id_map in config.json\n//  5. write dict binary\n//      word1 \\0x00 ids1 \\0xff word2 \\0x00 ids2 \\0xff .....\n\n#include \"layer.h\"\n#include \"mat.h\"\n#include \"net.h\"\n\n#include <ctype.h>\n#include <stdio.h>\n#include <map>\n#include <vector>\n\nclass relative_embeddings_k_module : public ncnn::Layer\n{\npublic:\n    relative_embeddings_k_module()\n    {\n        one_blob_only = true;\n    }\n\n    virtual int forward(const ncnn::Mat& bottom_blob, ncnn::Mat& top_blob, const ncnn::Option& opt) const\n    {\n        const int window_size = 4;\n\n        const int wsize = bottom_blob.w;\n        const int len = bottom_blob.h;\n        const int num_heads = bottom_blob.c;\n\n        top_blob.create(len, len, num_heads);\n\n        top_blob.fill(0.f);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_heads; q++)\n        {\n            const ncnn::Mat x0 = bottom_blob.channel(q);\n            ncnn::Mat out0 = top_blob.channel(q);\n\n            for (int i = 0; i < len; i++)\n            {\n                const float* xptr = x0.row(i) + std::max(0, window_size - i);\n                float* outptr = out0.row(i) + std::max(i - window_size, 0);\n                const int wsize2 = std::min(len, i - window_size + wsize) - std::max(i - window_size, 0);\n                for (int j = 0; j < wsize2; j++)\n                {\n                    *outptr++ = *xptr++;\n                }\n            }\n        }\n\n        return 0;\n    }\n};\n\nDEFINE_LAYER_CREATOR(relative_embeddings_k_module)\n\nclass relative_embeddings_v_module : public ncnn::Layer\n{\npublic:\n    relative_embeddings_v_module()\n    {\n        one_blob_only = true;\n    }\n\n    virtual int forward(const ncnn::Mat& bottom_blob, ncnn::Mat& top_blob, const ncnn::Option& opt) const\n    {\n        const int window_size = 4;\n\n        const int wsize = window_size * 2 + 1;\n        const int len = bottom_blob.h;\n        const int num_heads = bottom_blob.c;\n\n        top_blob.create(wsize, len, num_heads);\n\n        top_blob.fill(0.f);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_heads; q++)\n        {\n            const ncnn::Mat x0 = bottom_blob.channel(q);\n            ncnn::Mat out0 = top_blob.channel(q);\n\n            for (int i = 0; i < len; i++)\n            {\n                const float* xptr = x0.row(i) + std::max(i - window_size, 0);\n                float* outptr = out0.row(i) + std::max(0, window_size - i);\n                const int wsize2 = std::min(len, i - window_size + wsize) - std::max(i - window_size, 0);\n                for (int j = 0; j < wsize2; j++)\n                {\n                    *outptr++ = *xptr++;\n                }\n            }\n        }\n\n        return 0;\n    }\n};\n\nDEFINE_LAYER_CREATOR(relative_embeddings_v_module)\n\nclass piecewise_rational_quadratic_transform_module : public ncnn::Layer\n{\npublic:\n    piecewise_rational_quadratic_transform_module()\n    {\n        one_blob_only = false;\n    }\n\n    virtual int forward(const std::vector<ncnn::Mat>& bottom_blobs, std::vector<ncnn::Mat>& top_blobs, const ncnn::Option& opt) const\n    {\n        const ncnn::Mat& h = bottom_blobs[0];\n        const ncnn::Mat& x1 = bottom_blobs[1];\n        ncnn::Mat& outputs = top_blobs[0];\n\n        const int num_bins = 10;\n        const int filter_channels = 192;\n        const bool reverse = true;\n        const float tail_bound = 5.0f;\n        const float DEFAULT_MIN_BIN_WIDTH = 1e-3f;\n        const float DEFAULT_MIN_BIN_HEIGHT = 1e-3f;\n        const float DEFAULT_MIN_DERIVATIVE = 1e-3f;\n\n        const int batch_size = x1.w;\n        const int h_params_per_item = 2 * num_bins + (num_bins - 1); // 29\n\n        outputs = x1.clone();\n\n        float* out_ptr = outputs;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < batch_size; ++i)\n        {\n            const float current_x = ((const float*)x1)[i];\n\n            const float* h_data = h.row(i);\n\n            if (current_x < -tail_bound || current_x > tail_bound)\n            {\n                continue;\n            }\n\n            std::vector<float> unnormalized_widths(num_bins);\n            std::vector<float> unnormalized_heights(num_bins);\n            std::vector<float> unnormalized_derivatives(num_bins + 1);\n\n            const float inv_sqrt_filter_channels = 1.0f / sqrtf(filter_channels);\n            for (int j = 0; j < num_bins; ++j)\n            {\n                unnormalized_widths[j] = h_data[j] * inv_sqrt_filter_channels;\n            }\n            for (int j = 0; j < num_bins; ++j)\n            {\n                unnormalized_heights[j] = h_data[num_bins + j] * inv_sqrt_filter_channels;\n            }\n            for (int j = 0; j < num_bins - 1; ++j)\n            {\n                unnormalized_derivatives[j + 1] = h_data[2 * num_bins + j];\n            }\n\n            const float constant = logf(expf(1.f - DEFAULT_MIN_DERIVATIVE) - 1.f);\n            unnormalized_derivatives[0] = constant;\n            unnormalized_derivatives[num_bins] = constant;\n\n            const float left = -tail_bound, right = tail_bound;\n            const float bottom = -tail_bound, top = tail_bound;\n\n            // Softmax + Affine\n            std::vector<float> widths(num_bins);\n            float w_max = -INFINITY;\n            for (float val : unnormalized_widths) w_max = std::max(w_max, val);\n            float w_sum = 0.f;\n            for (int j = 0; j < num_bins; ++j)\n            {\n                widths[j] = expf(unnormalized_widths[j] - w_max);\n                w_sum += widths[j];\n            }\n            for (int j = 0; j < num_bins; ++j)\n            {\n                widths[j] = DEFAULT_MIN_BIN_WIDTH + (1.f - DEFAULT_MIN_BIN_WIDTH * num_bins) * (widths[j] / w_sum);\n            }\n\n            // cumwidths\n            std::vector<float> cumwidths(num_bins + 1);\n            cumwidths[0] = left;\n            float current_w_sum = 0.f;\n            for (int j = 0; j < num_bins - 1; ++j)\n            {\n                current_w_sum += widths[j];\n                cumwidths[j + 1] = left + (right - left) * current_w_sum;\n            }\n            cumwidths[num_bins] = right;\n\n            // heights\n            std::vector<float> heights(num_bins);\n            float h_max = -INFINITY;\n            for (float val : unnormalized_heights) h_max = std::max(h_max, val);\n            float h_sum = 0.f;\n            for (int j = 0; j < num_bins; ++j)\n            {\n                heights[j] = expf(unnormalized_heights[j] - h_max);\n                h_sum += heights[j];\n            }\n            for (int j = 0; j < num_bins; ++j)\n            {\n                heights[j] = DEFAULT_MIN_BIN_HEIGHT + (1.f - DEFAULT_MIN_BIN_HEIGHT * num_bins) * (heights[j] / h_sum);\n            }\n\n            // cumheights\n            std::vector<float> cumheights(num_bins + 1);\n            cumheights[0] = bottom;\n            float current_h_sum = 0.f;\n            for (int j = 0; j < num_bins - 1; ++j)\n            {\n                current_h_sum += heights[j];\n                cumheights[j + 1] = bottom + (top - bottom) * current_h_sum;\n            }\n            cumheights[num_bins] = top;\n\n            // Softplus\n            std::vector<float> derivatives(num_bins + 1);\n            for (int j = 0; j < num_bins + 1; ++j)\n            {\n                float x = unnormalized_derivatives[j];\n                derivatives[j] = DEFAULT_MIN_DERIVATIVE + (x > 0 ? x + logf(1.f + expf(-x)) : logf(1.f + expf(x)));\n            }\n\n            // bin_idx\n            int bin_idx = 0;\n            if (reverse)\n            {\n                auto it = std::upper_bound(cumheights.begin(), cumheights.end(), current_x);\n                bin_idx = std::distance(cumheights.begin(), it) - 1;\n            }\n            else\n            {\n                auto it = std::upper_bound(cumwidths.begin(), cumwidths.end(), current_x);\n                bin_idx = std::distance(cumwidths.begin(), it) - 1;\n            }\n            bin_idx = std::max(0, std::min(bin_idx, num_bins - 1));\n\n            // collect coeffs\n            const float input_cumwidths = cumwidths[bin_idx];\n            const float input_bin_widths = cumwidths[bin_idx + 1] - cumwidths[bin_idx];\n            const float input_cumheights = cumheights[bin_idx];\n            const float input_heights = cumheights[bin_idx + 1] - cumheights[bin_idx];\n            const float input_derivatives = derivatives[bin_idx];\n            const float input_derivatives_plus_one = derivatives[bin_idx + 1];\n            const float delta = input_heights / input_bin_widths;\n\n            // apply transform\n            if (reverse)\n            {\n                float a = (current_x - input_cumheights) * (input_derivatives + input_derivatives_plus_one - 2 * delta) + input_heights * (delta - input_derivatives);\n                float b = input_heights * input_derivatives - (current_x - input_cumheights) * (input_derivatives + input_derivatives_plus_one - 2 * delta);\n                float c = -delta * (current_x - input_cumheights);\n                float discriminant = b * b - 4 * a * c;\n                discriminant = std::max(0.f, discriminant);\n                float root = (2 * c) / (-b - sqrtf(discriminant));\n                out_ptr[i] = root * input_bin_widths + input_cumwidths;\n            }\n            else\n            {\n                float theta = (current_x - input_cumwidths) / input_bin_widths;\n                float theta_one_minus_theta = theta * (1 - theta);\n                float numerator = input_heights * (delta * theta * theta + input_derivatives * theta_one_minus_theta);\n                float denominator = delta + ((input_derivatives + input_derivatives_plus_one - 2 * delta) * theta_one_minus_theta);\n                out_ptr[i] = input_cumheights + numerator / denominator;\n            }\n        }\n\n        return 0;\n    }\n};\n\nDEFINE_LAYER_CREATOR(piecewise_rational_quadratic_transform_module)\n\nstatic bool is_word_eos(const char* word)\n{\n    const char c = word[0];\n    return c == ',' || c == '.' || c == ';' || c == '?' || c == '!';\n}\n\nstatic void find_word_id(const std::map<unsigned int, std::vector<const char*> >& dict, const char* word, const unsigned char*& ids)\n{\n    ids = 0;\n\n    unsigned char first_char = toupper(word[0]);\n    if (dict.find(first_char) == dict.end())\n        return;\n\n    const std::vector<const char*>& wordlist = dict.at(first_char);\n    for (size_t i = 0; i < wordlist.size(); i++)\n    {\n        if (strcasecmp(wordlist[i], word) == 0)\n        {\n            // hit\n            ids = (const unsigned char*)(wordlist[i] + strlen(wordlist[i]) + 1);\n            return;\n        }\n    }\n}\n\nstatic void simple_phonemize(const char* text, std::vector<int>& sequence_ids)\n{\n    // this is a very simple g2p function, it works for english only\n\n    // load dict buffer\n    std::vector<unsigned char> dictbinbuf;\n    {\n        FILE* fp = fopen(\"en-word_id.bin\", \"rb\");\n        if (!fp)\n            return;\n\n        fseek(fp, 0, SEEK_END);\n        size_t len = ftell(fp);\n        rewind(fp);\n\n        dictbinbuf.resize(len);\n        fread(dictbinbuf.data(), 1, len, fp);\n\n        fclose(fp);\n    }\n\n    // build dict\n    std::map<unsigned int, std::vector<const char*> > dict;\n    {\n        const unsigned char* p = dictbinbuf.data();\n        const char* word = (const char*)p;\n        for (size_t i = 0; i < dictbinbuf.size(); i++)\n        {\n            if (dictbinbuf[i] == 0xff)\n            {\n                unsigned int first_char = toupper(word[0]);\n                dict[first_char].push_back(word);\n                word = (const char*)(p + i + 1);\n            }\n        }\n    }\n\n    // phonemize mainpart\n    {\n        const int ID_PAD = 0;   // interleaved\n        const int ID_BOS = 1;   // beginning of sentence\n        const int ID_EOS = 2;   // end of sentence\n        const int ID_SPACE = 3; // space\n\n        bool last_char_is_control = false;\n        bool sentence_begin = true;\n        bool sentence_end = true;\n\n        char word[256];\n\n        const char* p = text;\n        while (*p)\n        {\n            if (sentence_end && !last_char_is_control)\n            {\n                sequence_ids.push_back(ID_BOS);\n                sequence_ids.push_back(ID_PAD);\n                sentence_end = false;\n            }\n\n            if (sentence_begin || last_char_is_control)\n            {\n                // the very first word\n            }\n            else\n            {\n                // space id\n                sequence_ids.push_back(ID_SPACE);\n                sequence_ids.push_back(ID_PAD);\n            }\n\n            if (isalnum((unsigned char)*p))\n            {\n                char* pword = word;\n\n                // alpha or number\n                *pword++ = *p++;\n\n                // consume word\n                int wordlen = 1;\n                while (isalnum((unsigned char)*p) && wordlen < 233)\n                {\n                    *pword++ = *p++;\n                    wordlen++;\n                }\n\n                *pword = '\\0';\n\n                if (is_word_eos(word))\n                {\n                    if (!sentence_end)\n                        sequence_ids.push_back(ID_EOS);\n                    sentence_end = true;\n                    last_char_is_control = false;\n                    sentence_begin = false;\n                    continue;\n                }\n\n                const unsigned char* ids = 0;\n                find_word_id(dict, word, ids);\n                if (ids)\n                {\n                    const unsigned char* pids = ids;\n                    while (*pids != 0xff)\n                    {\n                        sequence_ids.push_back(*pids);\n                        sequence_ids.push_back(ID_PAD);\n                        pids++;\n                    }\n                }\n                else\n                {\n                    // no such word, spell alphabet one by one\n                    char tmp[2] = {'\\0', '\\0'};\n                    for (size_t i = 0; i < strlen(word); i++)\n                    {\n                        tmp[0] = word[i];\n                        find_word_id(dict, tmp, ids);\n                        if (ids)\n                        {\n                            const unsigned char* pids = ids;\n                            while (*pids != 0xff)\n                            {\n                                sequence_ids.push_back(*pids);\n                                sequence_ids.push_back(ID_PAD);\n                                pids++;\n                            }\n                            if (i + 1 != strlen(word))\n                            {\n                                sequence_ids.push_back(ID_SPACE);\n                                sequence_ids.push_back(ID_PAD);\n                            }\n                        }\n                        else\n                        {\n                            fprintf(stderr, \"word char %c not recognized\\n\", word[i]);\n                        }\n                    }\n                }\n\n                last_char_is_control = false;\n                sentence_begin = false;\n                continue;\n            }\n            else\n            {\n                // skip control character\n                p++;\n                last_char_is_control = true;\n            }\n        }\n\n        if (!sentence_end)\n            sequence_ids.push_back(ID_EOS);\n    }\n}\n\nstatic void path_attention(const ncnn::Mat& logw, const ncnn::Mat& m_p, const ncnn::Mat& logs_p, float noise_scale, float length_scale, ncnn::Mat& z_p)\n{\n    const int x_lengths = logw.w;\n\n    // assert m_p.h == logs_p.h\n    const int depth = m_p.h;\n\n    std::vector<int> w_ceil(x_lengths);\n    int y_lengths = 0;\n    for (int i = 0; i < x_lengths; i++)\n    {\n        w_ceil[i] = (int)ceilf(expf(logw[i]) * length_scale);\n        y_lengths += w_ceil[i];\n    }\n\n    z_p.create(y_lengths, depth);\n\n    for (int i = 0; i < depth; i++)\n    {\n        const float* m_p_ptr = m_p.row(i);\n        const float* logs_p_ptr = logs_p.row(i);\n        float* ptr = z_p.row(i);\n\n        for (int j = 0; j < x_lengths; j++)\n        {\n            const float m = m_p_ptr[j];\n            const float nl = expf(logs_p_ptr[j]) * noise_scale;\n            const int duration = w_ceil[j];\n\n            for (int k = 0; k < duration; k++)\n            {\n                ptr[k] = m + (rand() / (float)RAND_MAX) * nl;\n            }\n            ptr += duration;\n        }\n    }\n}\n\nstatic int tts_piper(const char* text, int speaker_id, std::vector<short>& pcm)\n{\n    // zh models could be found at\n    // https://github.com/nihui/ncnn-android-piper/tree/master/app/src/main/assets\n\n    // hyper parameters from https://huggingface.co/datasets/rhasspy/piper-checkpoints/blob/main/en/en_US/libritts_r/medium/config.json\n    const float noise_scale = 0.333f;\n    const float length_scale = 1.f;\n    const float noise_scale_w = 0.333f;\n\n    // phonemize\n    ncnn::Mat sequence;\n    {\n        std::vector<int> sequence_ids;\n        simple_phonemize(text, sequence_ids);\n\n        const int sequence_length = (int)sequence_ids.size();\n\n        sequence.create(sequence_length);\n        memcpy(sequence, sequence_ids.data(), sequence_length * sizeof(int));\n    }\n\n    // enc_p\n    ncnn::Mat x;\n    ncnn::Mat m_p;\n    ncnn::Mat logs_p;\n    {\n        ncnn::Net enc_p;\n        enc_p.opt.use_vulkan_compute = true;\n        enc_p.register_custom_layer(\"piper.train.vits.attentions.relative_embeddings_k_module\", relative_embeddings_k_module_layer_creator);\n        enc_p.register_custom_layer(\"piper.train.vits.attentions.relative_embeddings_v_module\", relative_embeddings_v_module_layer_creator);\n        enc_p.load_param(\"en_enc_p.ncnn.param\");\n        enc_p.load_model(\"en_enc_p.ncnn.bin\");\n\n        ncnn::Extractor ex = enc_p.create_extractor();\n\n        ex.input(\"in0\", sequence);\n\n        ex.extract(\"out0\", x);\n        ex.extract(\"out1\", m_p);\n        ex.extract(\"out2\", logs_p);\n    }\n\n    // emb_g\n    ncnn::Mat g;\n    {\n        ncnn::Net emb_g;\n        emb_g.opt.use_vulkan_compute = true;\n        emb_g.load_param(\"en_emb_g.ncnn.param\");\n        emb_g.load_model(\"en_emb_g.ncnn.bin\");\n\n        ncnn::Mat speaker_id_mat(1);\n        {\n            int* p = speaker_id_mat;\n            p[0] = speaker_id;\n        }\n\n        ncnn::Extractor ex = emb_g.create_extractor();\n\n        ex.input(\"in0\", speaker_id_mat);\n\n        ex.extract(\"out0\", g);\n\n        g = g.reshape(1, g.w);\n    }\n\n    // dp\n    ncnn::Mat logw;\n    {\n        ncnn::Net dp;\n        dp.opt.use_vulkan_compute = true;\n        dp.register_custom_layer(\"piper.train.vits.modules.piecewise_rational_quadratic_transform_module\", piecewise_rational_quadratic_transform_module_layer_creator);\n        dp.load_param(\"en_dp.ncnn.param\");\n        dp.load_model(\"en_dp.ncnn.bin\");\n\n        ncnn::Mat noise(x.w, 2);\n        for (int i = 0; i < noise.w * noise.h; i++)\n        {\n            noise[i] = rand() / (float)RAND_MAX * noise_scale_w;\n        }\n\n        ncnn::Extractor ex = dp.create_extractor();\n\n        ex.input(\"in0\", x);\n        ex.input(\"in1\", noise);\n        ex.input(\"in2\", g);\n\n        ex.extract(\"out0\", logw);\n    }\n\n    // path attention\n    ncnn::Mat z_p;\n    {\n        path_attention(logw, m_p, logs_p, noise_scale, length_scale, z_p);\n    }\n\n    // flow\n    ncnn::Mat z;\n    {\n        ncnn::Net flow;\n        flow.opt.use_vulkan_compute = true;\n        flow.load_param(\"en_flow.ncnn.param\");\n        flow.load_model(\"en_flow.ncnn.bin\");\n\n        ncnn::Extractor ex = flow.create_extractor();\n\n        ex.input(\"in0\", z_p);\n        ex.input(\"in1\", g);\n\n        ex.extract(\"out0\", z);\n    }\n\n    // dec\n    ncnn::Mat o;\n    {\n        ncnn::Net dec;\n        dec.opt.use_vulkan_compute = true;\n        dec.load_param(\"en_dec.ncnn.param\");\n        dec.load_model(\"en_dec.ncnn.bin\");\n\n        ncnn::Extractor ex = dec.create_extractor();\n\n        ex.input(\"in0\", z);\n        ex.input(\"in1\", g);\n\n        ex.extract(\"out0\", o);\n    }\n\n    // normalize and clip\n    {\n        float volume = 1.f;\n        float absmax = 0.f;\n        for (int i = 0; i < o.w; i++)\n        {\n            absmax = std::max(absmax, fabs(o[i]));\n        }\n        if (absmax > 1e-8)\n        {\n            for (int i = 0; i < o.w; i++)\n            {\n                float v = o[i] / absmax * volume;\n                v = std::min(std::max(v, -1.f), 1.f);\n                o[i] = v;\n            }\n        }\n    }\n\n    // 16bit pcm\n    {\n        pcm.resize(o.w);\n        for (int i = 0; i < o.w; i++)\n        {\n            pcm[i] = (short)(o[i] * 32767);\n        }\n    }\n\n    return 0;\n}\n\nstatic void save_pcm_to_wav(const char* path, const short* pcm, int num_samples, int sample_rate)\n{\n    FILE* f = fopen(path, \"wb\");\n    if (!f)\n        return;\n\n    // write wav header\n    {\n        int16_t num_channels = 1;\n        int16_t bits_per_sample = 16;\n        int32_t byte_rate = sample_rate * num_channels * bits_per_sample / 8;\n        int16_t block_align = num_channels * bits_per_sample / 8;\n        int32_t data_chunk_size = num_samples * num_channels * bits_per_sample / 8;\n        int32_t chunk_size = 36 + data_chunk_size;\n\n        // RIFF header\n        fwrite(\"RIFF\", 1, 4, f);\n        fwrite(&chunk_size, 4, 1, f);\n        fwrite(\"WAVE\", 1, 4, f);\n\n        // fmt subchunk\n        fwrite(\"fmt \", 1, 4, f);\n        int32_t subchunk1_size = 16;\n        int16_t audio_format = 1; // PCM\n        fwrite(&subchunk1_size, 4, 1, f);\n        fwrite(&audio_format, 2, 1, f);\n        fwrite(&num_channels, 2, 1, f);\n        fwrite(&sample_rate, 4, 1, f);\n        fwrite(&byte_rate, 4, 1, f);\n        fwrite(&block_align, 2, 1, f);\n        fwrite(&bits_per_sample, 2, 1, f);\n\n        // data subchunk\n        fwrite(\"data\", 1, 4, f);\n        fwrite(&data_chunk_size, 4, 1, f);\n    }\n\n    fwrite(pcm, sizeof(short), num_samples, f);\n    fclose(f);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 4)\n    {\n        fprintf(stderr, \"Usage: %s [sentences] [speaker id 0~903] [out path]\\n\", argv[0]);\n        fprintf(stderr, \"       %s \\\"Hello World\\\" 0 out.wav\\n\", argv[0]);\n        fprintf(stderr, \"       %s \\\"Happy New Year\\\" 123 out.wav\\n\", argv[0]);\n        return 0;\n    }\n\n    const char* text = argv[1];\n    const int speaker_id = atoi(argv[2]);\n    const char* outpath = argv[3];\n\n    std::vector<short> pcm;\n    tts_piper(text, speaker_id, pcm);\n\n    // \"sample_rate\": 22050\n    save_pcm_to_wav(outpath, pcm.data(), pcm.size(), 22050);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/ppocrv5.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// pip install paddlepaddle==3.0.0\n// pip install paddleocr==3.0.0\n// paddlex --install paddle2onnx\n// paddleocr ocr -i test.png\n// paddlex --paddle2onnx --paddle_model_dir ~/.paddlex/official_models/PP-OCRv5_mobile_det --onnx_model_dir PP-OCRv5_mobile_det\n// paddlex --paddle2onnx --paddle_model_dir ~/.paddlex/official_models/PP-OCRv5_mobile_rec --onnx_model_dir PP-OCRv5_mobile_rec\n// pnnx PP-OCRv5_mobile_det.onnx inputshape=[1,3,320,320] inputshape2=[1,3,256,256]\n// pnnx PP-OCRv5_mobile_rec.onnx inputshape=[1,3,48,160] inputshape2=[1,3,48,256]\n// pnnx PP-OCRv5_server_det.onnx inputshape=[1,3,320,320] inputshape2=[1,3,256,256] fp16=0\n// pnnx PP-OCRv5_server_rec.onnx inputshape=[1,3,48,160] inputshape2=[1,3,48,256] fp16=0\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\n#include \"ppocrv5_dict.h\"\n\nstruct Character\n{\n    int id;\n    float prob;\n};\n\nstruct Object\n{\n    cv::RotatedRect rrect;\n    int orientation;\n    float prob;\n    std::vector<Character> text;\n};\n\nstatic double contour_score(const cv::Mat& binary, const std::vector<cv::Point>& contour)\n{\n    cv::Rect rect = cv::boundingRect(contour);\n    if (rect.x < 0)\n        rect.x = 0;\n    if (rect.y < 0)\n        rect.y = 0;\n    if (rect.x + rect.width > binary.cols)\n        rect.width = binary.cols - rect.x;\n    if (rect.y + rect.height > binary.rows)\n        rect.height = binary.rows - rect.y;\n\n    cv::Mat binROI = binary(rect);\n\n    cv::Mat mask = cv::Mat::zeros(rect.height, rect.width, CV_8U);\n    std::vector<cv::Point> roiContour;\n    for (size_t i = 0; i < contour.size(); i++)\n    {\n        cv::Point pt = cv::Point(contour[i].x - rect.x, contour[i].y - rect.y);\n        roiContour.push_back(pt);\n    }\n\n    std::vector<std::vector<cv::Point> > roiContours = {roiContour};\n    cv::fillPoly(mask, roiContours, cv::Scalar(255));\n\n    double score = cv::mean(binROI, mask).val[0];\n    return score / 255.f;\n}\n\nstatic cv::Mat get_rotate_crop_image(const cv::Mat& bgr, const Object& object)\n{\n    const int orientation = object.orientation;\n    const float rw = object.rrect.size.width;\n    const float rh = object.rrect.size.height;\n\n    const int target_height = 48;\n    const float target_width = rh * target_height / rw;\n\n    // warpperspective shall be used to rotate the image\n    // but actually they are all rectangles, so warpaffine is almost enough  :P\n\n    cv::Mat dst;\n\n    cv::Point2f corners[4];\n    object.rrect.points(corners);\n\n    if (orientation == 0)\n    {\n        // horizontal text\n        // corner points order\n        //  0--------1\n        //  |        |rw  -> as angle=90\n        //  3--------2\n        //      rh\n\n        std::vector<cv::Point2f> src_pts(3);\n        src_pts[0] = corners[0];\n        src_pts[1] = corners[1];\n        src_pts[2] = corners[3];\n\n        std::vector<cv::Point2f> dst_pts(3);\n        dst_pts[0] = cv::Point2f(0, 0);\n        dst_pts[1] = cv::Point2f(target_width, 0);\n        dst_pts[2] = cv::Point2f(0, target_height);\n\n        cv::Mat tm = cv::getAffineTransform(src_pts, dst_pts);\n\n        cv::warpAffine(bgr, dst, tm, cv::Size(target_width, target_height), cv::INTER_LINEAR, cv::BORDER_REPLICATE);\n    }\n    else\n    {\n        // vertial text\n        // corner points order\n        //  1----2\n        //  |    |\n        //  |    |\n        //  |    |rh  -> as angle=0\n        //  |    |\n        //  |    |\n        //  0----3\n        //    rw\n\n        std::vector<cv::Point2f> src_pts(3);\n        src_pts[0] = corners[2];\n        src_pts[1] = corners[3];\n        src_pts[2] = corners[1];\n\n        std::vector<cv::Point2f> dst_pts(3);\n        dst_pts[0] = cv::Point2f(0, 0);\n        dst_pts[1] = cv::Point2f(target_width, 0);\n        dst_pts[2] = cv::Point2f(0, target_height);\n\n        cv::Mat tm = cv::getAffineTransform(src_pts, dst_pts);\n\n        cv::warpAffine(bgr, dst, tm, cv::Size(target_width, target_height), cv::INTER_LINEAR, cv::BORDER_REPLICATE);\n    }\n\n    return dst;\n}\n\nclass PPOCRv5\n{\npublic:\n    void init();\n\n    void detect(const cv::Mat& bgr, std::vector<Object>& objects);\n\n    void recognize(const cv::Mat& bgr, Object& object);\n\nprotected:\n    ncnn::Net ppocrv5_det;\n    ncnn::Net ppocrv5_rec;\n};\n\nvoid PPOCRv5::init()\n{\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    // https://github.com/nihui/ncnn-android-ppocrv5/tree/master/app/src/main/assets\n\n    ppocrv5_det.opt.use_vulkan_compute = true;\n    // ppocrv5_det.opt.use_bf16_storage = true;\n\n    // fp16 must be disabled for server model\n    // ppocrv5_det.opt.use_fp16_packed = false;\n    // ppocrv5_det.opt.use_fp16_storage = false;\n\n    ppocrv5_det.load_param(\"PP_OCRv5_mobile_det.ncnn.param\");\n    ppocrv5_det.load_model(\"PP_OCRv5_mobile_det.ncnn.bin\");\n    // ppocrv5_det.load_param(\"PP_OCRv5_server_det.ncnn.param\");\n    // ppocrv5_det.load_model(\"PP_OCRv5_server_det.ncnn.bin\");\n\n    ppocrv5_rec.opt.use_vulkan_compute = true;\n    // ppocrv5_rec.opt.use_bf16_storage = true;\n\n    // fp16 must be disabled for server model\n    // ppocrv5_rec.opt.use_fp16_packed = false;\n    // ppocrv5_rec.opt.use_fp16_storage = false;\n\n    ppocrv5_rec.load_param(\"PP_OCRv5_mobile_rec.ncnn.param\");\n    ppocrv5_rec.load_model(\"PP_OCRv5_mobile_rec.ncnn.bin\");\n    // ppocrv5_rec.load_param(\"PP_OCRv5_server_rec.ncnn.param\");\n    // ppocrv5_rec.load_model(\"PP_OCRv5_server_rec.ncnn.bin\");\n}\n\nvoid PPOCRv5::detect(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    const int target_size = 960;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    const int target_stride = 32;\n\n    // letterbox pad to multiple of target_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (std::max(w, h) > target_size)\n    {\n        if (w > h)\n        {\n            scale = (float)target_size / w;\n            w = target_size;\n            h = h * scale;\n        }\n        else\n        {\n            scale = (float)target_size / h;\n            h = target_size;\n            w = w * scale;\n        }\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, img_w, img_h, w, h);\n\n    int wpad = (w + target_stride - 1) / target_stride * target_stride - w;\n    int hpad = (h + target_stride - 1) / target_stride * target_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float mean_vals[3] = {0.485f * 255.f, 0.456f * 255.f, 0.406f * 255.f};\n    const float norm_vals[3] = {1 / 0.229f / 255.f, 1 / 0.224f / 255.f, 1 / 0.225f / 255.f};\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = ppocrv5_det.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    const float denorm_vals[1] = {255.f};\n    out.substract_mean_normalize(0, denorm_vals);\n\n    cv::Mat pred(out.h, out.w, CV_8UC1);\n    out.to_pixels(pred.data, ncnn::Mat::PIXEL_GRAY);\n\n    // threshold binary\n    cv::Mat bitmap;\n    const float threshold = 0.3f;\n    cv::threshold(pred, bitmap, threshold * 255, 255, cv::THRESH_BINARY);\n\n    // boxes from bitmap\n    {\n        // should use dbnet post process, but I think unclip process is difficult to write\n        // so simply implement expansion. This may lose detection accuracy\n        // original implementation can be referenced\n        // https://github.com/MhLiao/DB/blob/master/structure/representers/seg_detector_representer.py\n\n        const float box_thresh = 0.6f;\n        const float enlarge_ratio = 1.95f;\n\n        const float min_size = 3 * scale;\n        const int max_candidates = 1000;\n\n        std::vector<std::vector<cv::Point> > contours;\n        std::vector<cv::Vec4i> hierarchy;\n\n        cv::findContours(bitmap, contours, hierarchy, cv::RETR_LIST, cv::CHAIN_APPROX_SIMPLE);\n\n        contours.resize(std::min(contours.size(), (size_t)max_candidates));\n\n        for (size_t i = 0; i < contours.size(); i++)\n        {\n            const std::vector<cv::Point>& contour = contours[i];\n            if (contour.size() <= 2)\n                continue;\n\n            double score = contour_score(pred, contour);\n            if (score < box_thresh)\n                continue;\n\n            cv::RotatedRect rrect = cv::minAreaRect(contour);\n\n            float rrect_maxwh = std::max(rrect.size.width, rrect.size.height);\n            if (rrect_maxwh < min_size)\n                continue;\n\n            int orientation = 0;\n            if (rrect.angle >= -30 && rrect.angle <= 30 && rrect.size.height > rrect.size.width * 2.7)\n            {\n                // vertical text\n                orientation = 1;\n            }\n            if ((rrect.angle <= -60 || rrect.angle >= 60) && rrect.size.width > rrect.size.height * 2.7)\n            {\n                // vertical text\n                orientation = 1;\n            }\n\n            if (rrect.angle < -30)\n            {\n                // make orientation from -90 ~ -30 to 90 ~ 150\n                rrect.angle += 180;\n            }\n            if (orientation == 0 && rrect.angle < 30)\n            {\n                // make it horizontal\n                rrect.angle += 90;\n                std::swap(rrect.size.width, rrect.size.height);\n            }\n            if (orientation == 1 && rrect.angle >= 60)\n            {\n                // make it vertical\n                rrect.angle -= 90;\n                std::swap(rrect.size.width, rrect.size.height);\n            }\n\n            // enlarge\n            rrect.size.height += rrect.size.width * (enlarge_ratio - 1);\n            rrect.size.width *= enlarge_ratio;\n\n            // adjust offset to original unpadded\n            rrect.center.x = (rrect.center.x - (wpad / 2)) / scale;\n            rrect.center.y = (rrect.center.y - (hpad / 2)) / scale;\n            rrect.size.width = (rrect.size.width) / scale;\n            rrect.size.height = (rrect.size.height) / scale;\n\n            Object obj;\n            obj.rrect = rrect;\n            obj.orientation = orientation;\n            obj.prob = score;\n            objects.push_back(obj);\n        }\n    }\n}\n\nvoid PPOCRv5::recognize(const cv::Mat& bgr, Object& object)\n{\n    cv::Mat roi = get_rotate_crop_image(bgr, object);\n\n    ncnn::Mat in = ncnn::Mat::from_pixels(roi.data, ncnn::Mat::PIXEL_BGR, roi.cols, roi.rows);\n\n    // ~/.paddlex/official_models/PP-OCRv5_mobile_rec/inference.yml\n    const float mean_vals[3] = {127.5, 127.5, 127.5};\n    const float norm_vals[3] = {1.0 / 127.5, 1.0 / 127.5, 1.0 / 127.5};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = ppocrv5_rec.create_extractor();\n\n    ex.input(\"in0\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    // 18385 x len\n    int last_token = 0;\n\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* p = out.row(i);\n\n        int index = 0;\n        float max_score = -9999.f;\n        for (int j = 0; j < out.w; j++)\n        {\n            float score = *p++;\n            if (score > max_score)\n            {\n                max_score = score;\n                index = j;\n            }\n        }\n\n        if (last_token == index) // CTC rule, if index is same as last one, they will be merged into one token\n            continue;\n\n        last_token = index;\n\n        if (index <= 0)\n            continue;\n\n        Character ch;\n        ch.id = index - 1;\n        ch.prob = max_score;\n\n        object.text.push_back(ch);\n    }\n}\n\nstatic int detect_ppocrv5(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    PPOCRv5 ppocrv5;\n\n    ppocrv5.init();\n\n    ppocrv5.detect(bgr, objects);\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        ppocrv5.recognize(bgr, objects[i]);\n    }\n\n    return 0;\n}\n\nstatic int draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const cv::Scalar colors[] = {\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 17];\n\n        fprintf(stderr, \"%s %.5f at %.2f %.2f %.2f x %.2f  @ %.2f  =  \", obj.orientation == 0 ? \"H\" : \"V\", obj.prob,\n                obj.rrect.center.x, obj.rrect.center.y, obj.rrect.size.width, obj.rrect.size.height, obj.rrect.angle);\n\n        cv::Point2f corners[4];\n        obj.rrect.points(corners);\n        cv::line(image, corners[0], corners[1], color);\n        cv::line(image, corners[1], corners[2], color);\n        cv::line(image, corners[2], corners[3], color);\n        cv::line(image, corners[3], corners[0], color);\n\n        std::string text;\n        for (size_t j = 0; j < objects[i].text.size(); j++)\n        {\n            const Character& ch = objects[i].text[j];\n            if (ch.id >= character_dict_size)\n                continue;\n\n            text += character_dict[ch.id];\n        }\n        fprintf(stderr, \"%s\\n\", text.c_str());\n    }\n\n    fprintf(stderr, \"opencv putText can not draw non-latin characters, you may see question marks instead\\n\");\n    fprintf(stderr, \"see opencv-mobile for drawing non-latin characters\\n\");\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 17];\n\n        std::string text;\n        for (size_t j = 0; j < objects[i].text.size(); j++)\n        {\n            const Character& ch = objects[i].text[j];\n            if (ch.id >= character_dict_size)\n            {\n                if (!text.empty() && text.back() != ' ')\n                {\n                    text += \" \";\n                }\n                continue;\n            }\n\n            if (obj.orientation == 0)\n            {\n                text += character_dict[ch.id];\n            }\n            else\n            {\n                text += character_dict[ch.id];\n                if (j + 1 < objects[i].text.size())\n                    text += \"\\n\";\n            }\n        }\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rrect.center.x - label_size.width / 2;\n        int y = obj.rrect.center.y - label_size.height / 2 - baseLine;\n        if (y < 0)\n            y = 0;\n        if (y + label_size.height > image.rows)\n            y = image.rows - label_size.height;\n        if (x < 0)\n            x = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        if (obj.orientation == 0)\n        {\n            cv::putText(image, text, cv::Point(x, y + label_size.height), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n        }\n        else\n        {\n            cv::putText(image, text, cv::Point(x, y + label_size.width), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n        }\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_ppocrv5(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/ppocrv5_dict.h",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic const char* character_dict[] = {\n    \"　\",\n    \"一\",\n    \"乙\",\n    \"二\",\n    \"十\",\n    \"丁\",\n    \"厂\",\n    \"七\",\n    \"卜\",\n    \"八\",\n    \"人\",\n    \"入\",\n    \"儿\",\n    \"匕\",\n    \"几\",\n    \"九\",\n    \"刁\",\n    \"了\",\n    \"刀\",\n    \"力\",\n    \"乃\",\n    \"又\",\n    \"三\",\n    \"干\",\n    \"于\",\n    \"亏\",\n    \"工\",\n    \"土\",\n    \"士\",\n    \"才\",\n    \"下\",\n    \"寸\",\n    \"大\",\n    \"丈\",\n    \"与\",\n    \"万\",\n    \"上\",\n    \"小\",\n    \"口\",\n    \"山\",\n    \"巾\",\n    \"千\",\n    \"乞\",\n    \"川\",\n    \"亿\",\n    \"个\",\n    \"夕\",\n    \"久\",\n    \"么\",\n    \"勺\",\n    \"凡\",\n    \"丸\",\n    \"及\",\n    \"广\",\n    \"亡\",\n    \"门\",\n    \"丫\",\n    \"义\",\n    \"之\",\n    \"尸\",\n    \"己\",\n    \"已\",\n    \"巳\",\n    \"弓\",\n    \"子\",\n    \"卫\",\n    \"也\",\n    \"女\",\n    \"刃\",\n    \"飞\",\n    \"习\",\n    \"叉\",\n    \"马\",\n    \"乡\",\n    \"丰\",\n    \"王\",\n    \"开\",\n    \"井\",\n    \"天\",\n    \"夫\",\n    \"元\",\n    \"无\",\n    \"云\",\n    \"专\",\n    \"丐\",\n    \"扎\",\n    \"艺\",\n    \"木\",\n    \"五\",\n    \"支\",\n    \"厅\",\n    \"不\",\n    \"犬\",\n    \"太\",\n    \"区\",\n    \"历\",\n    \"歹\",\n    \"友\",\n    \"尤\",\n    \"匹\",\n    \"车\",\n    \"巨\",\n    \"牙\",\n    \"屯\",\n    \"戈\",\n    \"比\",\n    \"互\",\n    \"切\",\n    \"瓦\",\n    \"止\",\n    \"少\",\n    \"曰\",\n    \"日\",\n    \"中\",\n    \"贝\",\n    \"冈\",\n    \"内\",\n    \"水\",\n    \"见\",\n    \"午\",\n    \"牛\",\n    \"手\",\n    \"气\",\n    \"毛\",\n    \"壬\",\n    \"升\",\n    \"夭\",\n    \"长\",\n    \"仁\",\n    \"什\",\n    \"片\",\n    \"仆\",\n    \"化\",\n    \"仇\",\n    \"币\",\n    \"仍\",\n    \"仅\",\n    \"斤\",\n    \"爪\",\n    \"反\",\n    \"介\",\n    \"父\",\n    \"从\",\n    \"仑\",\n    \"今\",\n    \"凶\",\n    \"分\",\n    \"乏\",\n    \"公\",\n    \"仓\",\n    \"月\",\n    \"氏\",\n    \"勿\",\n    \"欠\",\n    \"风\",\n    \"丹\",\n    \"匀\",\n    \"乌\",\n    \"勾\",\n    \"凤\",\n    \"六\",\n    \"文\",\n    \"亢\",\n    \"方\",\n    \"火\",\n    \"为\",\n    \"斗\",\n    \"忆\",\n    \"计\",\n    \"订\",\n    \"户\",\n    \"认\",\n    \"冗\",\n    \"讥\",\n    \"心\",\n    \"尺\",\n    \"引\",\n    \"丑\",\n    \"巴\",\n    \"孔\",\n    \"队\",\n    \"办\",\n    \"以\",\n    \"允\",\n    \"予\",\n    \"邓\",\n    \"劝\",\n    \"双\",\n    \"书\",\n    \"幻\",\n    \"玉\",\n    \"刊\",\n    \"未\",\n    \"末\",\n    \"示\",\n    \"击\",\n    \"打\",\n    \"巧\",\n    \"正\",\n    \"扑\",\n    \"卉\",\n    \"扒\",\n    \"功\",\n    \"扔\",\n    \"去\",\n    \"甘\",\n    \"世\",\n    \"艾\",\n    \"古\",\n    \"节\",\n    \"本\",\n    \"术\",\n    \"可\",\n    \"丙\",\n    \"左\",\n    \"厉\",\n    \"石\",\n    \"右\",\n    \"布\",\n    \"夯\",\n    \"戊\",\n    \"龙\",\n    \"平\",\n    \"灭\",\n    \"轧\",\n    \"东\",\n    \"卡\",\n    \"北\",\n    \"占\",\n    \"凸\",\n    \"卢\",\n    \"业\",\n    \"旧\",\n    \"帅\",\n    \"归\",\n    \"旦\",\n    \"目\",\n    \"且\",\n    \"叶\",\n    \"甲\",\n    \"申\",\n    \"叮\",\n    \"电\",\n    \"号\",\n    \"田\",\n    \"由\",\n    \"只\",\n    \"叭\",\n    \"史\",\n    \"央\",\n    \"兄\",\n    \"叽\",\n    \"叼\",\n    \"叫\",\n    \"叩\",\n    \"叨\",\n    \"另\",\n    \"叹\",\n    \"冉\",\n    \"皿\",\n    \"凹\",\n    \"囚\",\n    \"四\",\n    \"生\",\n    \"矢\",\n    \"失\",\n    \"乍\",\n    \"禾\",\n    \"丘\",\n    \"付\",\n    \"仗\",\n    \"代\",\n    \"仙\",\n    \"们\",\n    \"仪\",\n    \"白\",\n    \"仔\",\n    \"他\",\n    \"斥\",\n    \"瓜\",\n    \"乎\",\n    \"丛\",\n    \"令\",\n    \"用\",\n    \"甩\",\n    \"印\",\n    \"尔\",\n    \"乐\",\n    \"句\",\n    \"匆\",\n    \"册\",\n    \"卯\",\n    \"犯\",\n    \"外\",\n    \"处\",\n    \"冬\",\n    \"鸟\",\n    \"务\",\n    \"包\",\n    \"饥\",\n    \"主\",\n    \"市\",\n    \"立\",\n    \"冯\",\n    \"玄\",\n    \"闪\",\n    \"兰\",\n    \"半\",\n    \"汁\",\n    \"汇\",\n    \"头\",\n    \"汉\",\n    \"宁\",\n    \"穴\",\n    \"它\",\n    \"讨\",\n    \"写\",\n    \"让\",\n    \"礼\",\n    \"训\",\n    \"议\",\n    \"必\",\n    \"讯\",\n    \"记\",\n    \"永\",\n    \"司\",\n    \"尼\",\n    \"民\",\n    \"弗\",\n    \"弘\",\n    \"出\",\n    \"辽\",\n    \"奶\",\n    \"奴\",\n    \"召\",\n    \"加\",\n    \"皮\",\n    \"边\",\n    \"孕\",\n    \"发\",\n    \"圣\",\n    \"对\",\n    \"台\",\n    \"矛\",\n    \"纠\",\n    \"母\",\n    \"幼\",\n    \"丝\",\n    \"邦\",\n    \"式\",\n    \"迂\",\n    \"刑\",\n    \"戎\",\n    \"动\",\n    \"扛\",\n    \"寺\",\n    \"吉\",\n    \"扣\",\n    \"考\",\n    \"托\",\n    \"老\",\n    \"巩\",\n    \"圾\",\n    \"执\",\n    \"扩\",\n    \"扫\",\n    \"地\",\n    \"场\",\n    \"扬\",\n    \"耳\",\n    \"芋\",\n    \"共\",\n    \"芒\",\n    \"亚\",\n    \"芝\",\n    \"朽\",\n    \"朴\",\n    \"机\",\n    \"权\",\n    \"过\",\n    \"臣\",\n    \"吏\",\n    \"再\",\n    \"协\",\n    \"西\",\n    \"压\",\n    \"厌\",\n    \"戌\",\n    \"在\",\n    \"百\",\n    \"有\",\n    \"存\",\n    \"而\",\n    \"页\",\n    \"匠\",\n    \"夸\",\n    \"夺\",\n    \"灰\",\n    \"达\",\n    \"列\",\n    \"死\",\n    \"成\",\n    \"夹\",\n    \"夷\",\n    \"轨\",\n    \"邪\",\n    \"尧\",\n    \"划\",\n    \"迈\",\n    \"毕\",\n    \"至\",\n    \"此\",\n    \"贞\",\n    \"师\",\n    \"尘\",\n    \"尖\",\n    \"劣\",\n    \"光\",\n    \"当\",\n    \"早\",\n    \"吁\",\n    \"吐\",\n    \"吓\",\n    \"虫\",\n    \"曲\",\n    \"团\",\n    \"吕\",\n    \"同\",\n    \"吊\",\n    \"吃\",\n    \"因\",\n    \"吸\",\n    \"吗\",\n    \"吆\",\n    \"屿\",\n    \"屹\",\n    \"岁\",\n    \"帆\",\n    \"回\",\n    \"岂\",\n    \"则\",\n    \"刚\",\n    \"网\",\n    \"肉\",\n    \"年\",\n    \"朱\",\n    \"先\",\n    \"丢\",\n    \"廷\",\n    \"舌\",\n    \"竹\",\n    \"迁\",\n    \"乔\",\n    \"迄\",\n    \"伟\",\n    \"传\",\n    \"乒\",\n    \"乓\",\n    \"休\",\n    \"伍\",\n    \"伏\",\n    \"优\",\n    \"臼\",\n    \"伐\",\n    \"延\",\n    \"仲\",\n    \"件\",\n    \"任\",\n    \"伤\",\n    \"价\",\n    \"伦\",\n    \"份\",\n    \"华\",\n    \"仰\",\n    \"仿\",\n    \"伙\",\n    \"伪\",\n    \"自\",\n    \"伊\",\n    \"血\",\n    \"向\",\n    \"似\",\n    \"后\",\n    \"行\",\n    \"舟\",\n    \"全\",\n    \"会\",\n    \"杀\",\n    \"合\",\n    \"兆\",\n    \"企\",\n    \"众\",\n    \"爷\",\n    \"伞\",\n    \"创\",\n    \"肌\",\n    \"肋\",\n    \"朵\",\n    \"杂\",\n    \"危\",\n    \"旬\",\n    \"旨\",\n    \"旭\",\n    \"负\",\n    \"匈\",\n    \"名\",\n    \"各\",\n    \"多\",\n    \"争\",\n    \"色\",\n    \"壮\",\n    \"冲\",\n    \"妆\",\n    \"冰\",\n    \"庄\",\n    \"庆\",\n    \"亦\",\n    \"刘\",\n    \"齐\",\n    \"交\",\n    \"衣\",\n    \"次\",\n    \"产\",\n    \"决\",\n    \"亥\",\n    \"充\",\n    \"妄\",\n    \"闭\",\n    \"问\",\n    \"闯\",\n    \"羊\",\n    \"并\",\n    \"关\",\n    \"米\",\n    \"灯\",\n    \"州\",\n    \"汗\",\n    \"污\",\n    \"江\",\n    \"汛\",\n    \"池\",\n    \"汝\",\n    \"汤\",\n    \"忙\",\n    \"兴\",\n    \"宇\",\n    \"守\",\n    \"宅\",\n    \"字\",\n    \"安\",\n    \"讲\",\n    \"讳\",\n    \"军\",\n    \"讶\",\n    \"许\",\n    \"讹\",\n    \"论\",\n    \"讼\",\n    \"农\",\n    \"讽\",\n    \"设\",\n    \"访\",\n    \"诀\",\n    \"寻\",\n    \"那\",\n    \"迅\",\n    \"尽\",\n    \"导\",\n    \"异\",\n    \"弛\",\n    \"孙\",\n    \"阵\",\n    \"阳\",\n    \"收\",\n    \"阶\",\n    \"阴\",\n    \"防\",\n    \"奸\",\n    \"如\",\n    \"妇\",\n    \"妃\",\n    \"好\",\n    \"她\",\n    \"妈\",\n    \"戏\",\n    \"羽\",\n    \"观\",\n    \"欢\",\n    \"买\",\n    \"红\",\n    \"驮\",\n    \"纤\",\n    \"驯\",\n    \"约\",\n    \"级\",\n    \"纪\",\n    \"驰\",\n    \"纫\",\n    \"巡\",\n    \"寿\",\n    \"弄\",\n    \"麦\",\n    \"玖\",\n    \"玛\",\n    \"形\",\n    \"进\",\n    \"戒\",\n    \"吞\",\n    \"远\",\n    \"违\",\n    \"韧\",\n    \"运\",\n    \"扶\",\n    \"抚\",\n    \"坛\",\n    \"技\",\n    \"坏\",\n    \"抠\",\n    \"扰\",\n    \"扼\",\n    \"拒\",\n    \"找\",\n    \"批\",\n    \"址\",\n    \"扯\",\n    \"走\",\n    \"抄\",\n    \"贡\",\n    \"汞\",\n    \"坝\",\n    \"攻\",\n    \"赤\",\n    \"折\",\n    \"抓\",\n    \"扳\",\n    \"抡\",\n    \"扮\",\n    \"抢\",\n    \"孝\",\n    \"坎\",\n    \"均\",\n    \"抑\",\n    \"抛\",\n    \"投\",\n    \"坟\",\n    \"坑\",\n    \"抗\",\n    \"坊\",\n    \"抖\",\n    \"护\",\n    \"壳\",\n    \"志\",\n    \"块\",\n    \"扭\",\n    \"声\",\n    \"把\",\n    \"报\",\n    \"拟\",\n    \"却\",\n    \"抒\",\n    \"劫\",\n    \"芙\",\n    \"芜\",\n    \"苇\",\n    \"芽\",\n    \"花\",\n    \"芹\",\n    \"芥\",\n    \"芬\",\n    \"苍\",\n    \"芳\",\n    \"严\",\n    \"芦\",\n    \"芯\",\n    \"劳\",\n    \"克\",\n    \"芭\",\n    \"苏\",\n    \"杆\",\n    \"杠\",\n    \"杜\",\n    \"材\",\n    \"村\",\n    \"杖\",\n    \"杏\",\n    \"杉\",\n    \"巫\",\n    \"极\",\n    \"李\",\n    \"杨\",\n    \"求\",\n    \"甫\",\n    \"匣\",\n    \"更\",\n    \"束\",\n    \"吾\",\n    \"豆\",\n    \"两\",\n    \"酉\",\n    \"丽\",\n    \"医\",\n    \"辰\",\n    \"励\",\n    \"否\",\n    \"还\",\n    \"尬\",\n    \"歼\",\n    \"来\",\n    \"连\",\n    \"轩\",\n    \"步\",\n    \"卤\",\n    \"坚\",\n    \"肖\",\n    \"旱\",\n    \"盯\",\n    \"呈\",\n    \"时\",\n    \"吴\",\n    \"助\",\n    \"县\",\n    \"里\",\n    \"呆\",\n    \"吱\",\n    \"吠\",\n    \"呕\",\n    \"园\",\n    \"旷\",\n    \"围\",\n    \"呀\",\n    \"吨\",\n    \"足\",\n    \"邮\",\n    \"男\",\n    \"困\",\n    \"吵\",\n    \"串\",\n    \"员\",\n    \"呐\",\n    \"听\",\n    \"吟\",\n    \"吩\",\n    \"呛\",\n    \"吻\",\n    \"吹\",\n    \"呜\",\n    \"吭\",\n    \"吧\",\n    \"邑\",\n    \"吼\",\n    \"囤\",\n    \"别\",\n    \"吮\",\n    \"岖\",\n    \"岗\",\n    \"帐\",\n    \"财\",\n    \"针\",\n    \"钉\",\n    \"牡\",\n    \"告\",\n    \"我\",\n    \"乱\",\n    \"利\",\n    \"秃\",\n    \"秀\",\n    \"私\",\n    \"每\",\n    \"兵\",\n    \"估\",\n    \"体\",\n    \"何\",\n    \"佐\",\n    \"佑\",\n    \"但\",\n    \"伸\",\n    \"佃\",\n    \"作\",\n    \"伯\",\n    \"伶\",\n    \"佣\",\n    \"低\",\n    \"你\",\n    \"住\",\n    \"位\",\n    \"伴\",\n    \"身\",\n    \"皂\",\n    \"伺\",\n    \"佛\",\n    \"囱\",\n    \"近\",\n    \"彻\",\n    \"役\",\n    \"返\",\n    \"余\",\n    \"希\",\n    \"坐\",\n    \"谷\",\n    \"妥\",\n    \"含\",\n    \"邻\",\n    \"岔\",\n    \"肝\",\n    \"肛\",\n    \"肚\",\n    \"肘\",\n    \"肠\",\n    \"龟\",\n    \"甸\",\n    \"免\",\n    \"狂\",\n    \"犹\",\n    \"狈\",\n    \"角\",\n    \"删\",\n    \"条\",\n    \"彤\",\n    \"卵\",\n    \"灸\",\n    \"岛\",\n    \"刨\",\n    \"迎\",\n    \"饭\",\n    \"饮\",\n    \"系\",\n    \"言\",\n    \"冻\",\n    \"状\",\n    \"亩\",\n    \"况\",\n    \"床\",\n    \"库\",\n    \"庇\",\n    \"疗\",\n    \"吝\",\n    \"应\",\n    \"这\",\n    \"冷\",\n    \"庐\",\n    \"序\",\n    \"辛\",\n    \"弃\",\n    \"冶\",\n    \"忘\",\n    \"闰\",\n    \"闲\",\n    \"间\",\n    \"闷\",\n    \"判\",\n    \"兑\",\n    \"灶\",\n    \"灿\",\n    \"灼\",\n    \"弟\",\n    \"汪\",\n    \"沐\",\n    \"沛\",\n    \"汰\",\n    \"沥\",\n    \"沙\",\n    \"汽\",\n    \"沃\",\n    \"沦\",\n    \"汹\",\n    \"泛\",\n    \"沧\",\n    \"没\",\n    \"沟\",\n    \"沪\",\n    \"沈\",\n    \"沉\",\n    \"沁\",\n    \"怀\",\n    \"忧\",\n    \"忱\",\n    \"快\",\n    \"完\",\n    \"宋\",\n    \"宏\",\n    \"牢\",\n    \"究\",\n    \"穷\",\n    \"灾\",\n    \"良\",\n    \"证\",\n    \"启\",\n    \"评\",\n    \"补\",\n    \"初\",\n    \"社\",\n    \"祀\",\n    \"识\",\n    \"诈\",\n    \"诉\",\n    \"罕\",\n    \"诊\",\n    \"词\",\n    \"译\",\n    \"君\",\n    \"灵\",\n    \"即\",\n    \"层\",\n    \"屁\",\n    \"尿\",\n    \"尾\",\n    \"迟\",\n    \"局\",\n    \"改\",\n    \"张\",\n    \"忌\",\n    \"际\",\n    \"陆\",\n    \"阿\",\n    \"陈\",\n    \"阻\",\n    \"附\",\n    \"坠\",\n    \"妓\",\n    \"妙\",\n    \"妖\",\n    \"姊\",\n    \"妨\",\n    \"妒\",\n    \"努\",\n    \"忍\",\n    \"劲\",\n    \"矣\",\n    \"鸡\",\n    \"纬\",\n    \"驱\",\n    \"纯\",\n    \"纱\",\n    \"纲\",\n    \"纳\",\n    \"驳\",\n    \"纵\",\n    \"纷\",\n    \"纸\",\n    \"纹\",\n    \"纺\",\n    \"驴\",\n    \"纽\",\n    \"奉\",\n    \"玩\",\n    \"环\",\n    \"武\",\n    \"青\",\n    \"责\",\n    \"现\",\n    \"玫\",\n    \"表\",\n    \"规\",\n    \"抹\",\n    \"卦\",\n    \"坷\",\n    \"坯\",\n    \"拓\",\n    \"拢\",\n    \"拔\",\n    \"坪\",\n    \"拣\",\n    \"坦\",\n    \"担\",\n    \"坤\",\n    \"押\",\n    \"抽\",\n    \"拐\",\n    \"拖\",\n    \"者\",\n    \"拍\",\n    \"顶\",\n    \"拆\",\n    \"拎\",\n    \"拥\",\n    \"抵\",\n    \"拘\",\n    \"势\",\n    \"抱\",\n    \"拄\",\n    \"垃\",\n    \"拉\",\n    \"拦\",\n    \"幸\",\n    \"拌\",\n    \"拧\",\n    \"拂\",\n    \"拙\",\n    \"招\",\n    \"坡\",\n    \"披\",\n    \"拨\",\n    \"择\",\n    \"抬\",\n    \"拇\",\n    \"拗\",\n    \"其\",\n    \"取\",\n    \"茉\",\n    \"苦\",\n    \"昔\",\n    \"苛\",\n    \"若\",\n    \"茂\",\n    \"苹\",\n    \"苗\",\n    \"英\",\n    \"苟\",\n    \"苑\",\n    \"苞\",\n    \"范\",\n    \"直\",\n    \"茁\",\n    \"茄\",\n    \"茎\",\n    \"苔\",\n    \"茅\",\n    \"枉\",\n    \"林\",\n    \"枝\",\n    \"杯\",\n    \"枢\",\n    \"柜\",\n    \"枚\",\n    \"析\",\n    \"板\",\n    \"松\",\n    \"枪\",\n    \"枫\",\n    \"构\",\n    \"杭\",\n    \"杰\",\n    \"述\",\n    \"枕\",\n    \"丧\",\n    \"或\",\n    \"画\",\n    \"卧\",\n    \"事\",\n    \"刺\",\n    \"枣\",\n    \"雨\",\n    \"卖\",\n    \"郁\",\n    \"矾\",\n    \"矿\",\n    \"码\",\n    \"厕\",\n    \"奈\",\n    \"奔\",\n    \"奇\",\n    \"奋\",\n    \"态\",\n    \"欧\",\n    \"殴\",\n    \"垄\",\n    \"妻\",\n    \"轰\",\n    \"顷\",\n    \"转\",\n    \"斩\",\n    \"轮\",\n    \"软\",\n    \"到\",\n    \"非\",\n    \"叔\",\n    \"歧\",\n    \"肯\",\n    \"齿\",\n    \"些\",\n    \"卓\",\n    \"虎\",\n    \"虏\",\n    \"肾\",\n    \"贤\",\n    \"尚\",\n    \"旺\",\n    \"具\",\n    \"味\",\n    \"果\",\n    \"昆\",\n    \"国\",\n    \"哎\",\n    \"咕\",\n    \"昌\",\n    \"呵\",\n    \"畅\",\n    \"明\",\n    \"易\",\n    \"咙\",\n    \"昂\",\n    \"迪\",\n    \"典\",\n    \"固\",\n    \"忠\",\n    \"呻\",\n    \"咒\",\n    \"咋\",\n    \"咐\",\n    \"呼\",\n    \"鸣\",\n    \"咏\",\n    \"呢\",\n    \"咄\",\n    \"咖\",\n    \"岸\",\n    \"岩\",\n    \"帖\",\n    \"罗\",\n    \"帜\",\n    \"帕\",\n    \"岭\",\n    \"凯\",\n    \"败\",\n    \"账\",\n    \"贩\",\n    \"贬\",\n    \"购\",\n    \"贮\",\n    \"图\",\n    \"钓\",\n    \"制\",\n    \"知\",\n    \"迭\",\n    \"氛\",\n    \"垂\",\n    \"牧\",\n    \"物\",\n    \"乖\",\n    \"刮\",\n    \"秆\",\n    \"和\",\n    \"季\",\n    \"委\",\n    \"秉\",\n    \"佳\",\n    \"侍\",\n    \"岳\",\n    \"供\",\n    \"使\",\n    \"例\",\n    \"侠\",\n    \"侥\",\n    \"版\",\n    \"侄\",\n    \"侦\",\n    \"侣\",\n    \"侧\",\n    \"凭\",\n    \"侨\",\n    \"佩\",\n    \"货\",\n    \"侈\",\n    \"依\",\n    \"卑\",\n    \"的\",\n    \"迫\",\n    \"质\",\n    \"欣\",\n    \"征\",\n    \"往\",\n    \"爬\",\n    \"彼\",\n    \"径\",\n    \"所\",\n    \"舍\",\n    \"金\",\n    \"刹\",\n    \"命\",\n    \"肴\",\n    \"斧\",\n    \"爸\",\n    \"采\",\n    \"觅\",\n    \"受\",\n    \"乳\",\n    \"贪\",\n    \"念\",\n    \"贫\",\n    \"忿\",\n    \"肤\",\n    \"肺\",\n    \"肢\",\n    \"肿\",\n    \"胀\",\n    \"朋\",\n    \"股\",\n    \"肮\",\n    \"肪\",\n    \"肥\",\n    \"服\",\n    \"胁\",\n    \"周\",\n    \"昏\",\n    \"鱼\",\n    \"兔\",\n    \"狐\",\n    \"忽\",\n    \"狗\",\n    \"狞\",\n    \"备\",\n    \"饰\",\n    \"饱\",\n    \"饲\",\n    \"变\",\n    \"京\",\n    \"享\",\n    \"庞\",\n    \"店\",\n    \"夜\",\n    \"庙\",\n    \"府\",\n    \"底\",\n    \"疟\",\n    \"疙\",\n    \"疚\",\n    \"剂\",\n    \"卒\",\n    \"郊\",\n    \"庚\",\n    \"废\",\n    \"净\",\n    \"盲\",\n    \"放\",\n    \"刻\",\n    \"育\",\n    \"氓\",\n    \"闸\",\n    \"闹\",\n    \"郑\",\n    \"券\",\n    \"卷\",\n    \"单\",\n    \"炬\",\n    \"炒\",\n    \"炊\",\n    \"炕\",\n    \"炎\",\n    \"炉\",\n    \"沫\",\n    \"浅\",\n    \"法\",\n    \"泄\",\n    \"沽\",\n    \"河\",\n    \"沾\",\n    \"泪\",\n    \"沮\",\n    \"油\",\n    \"泊\",\n    \"沿\",\n    \"泡\",\n    \"注\",\n    \"泣\",\n    \"泞\",\n    \"泻\",\n    \"泌\",\n    \"泳\",\n    \"泥\",\n    \"沸\",\n    \"沼\",\n    \"波\",\n    \"泼\",\n    \"泽\",\n    \"治\",\n    \"怔\",\n    \"怯\",\n    \"怖\",\n    \"性\",\n    \"怕\",\n    \"怜\",\n    \"怪\",\n    \"怡\",\n    \"学\",\n    \"宝\",\n    \"宗\",\n    \"定\",\n    \"宠\",\n    \"宜\",\n    \"审\",\n    \"宙\",\n    \"官\",\n    \"空\",\n    \"帘\",\n    \"宛\",\n    \"实\",\n    \"试\",\n    \"郎\",\n    \"诗\",\n    \"肩\",\n    \"房\",\n    \"诚\",\n    \"衬\",\n    \"衫\",\n    \"视\",\n    \"祈\",\n    \"话\",\n    \"诞\",\n    \"诡\",\n    \"询\",\n    \"该\",\n    \"详\",\n    \"建\",\n    \"肃\",\n    \"录\",\n    \"隶\",\n    \"帚\",\n    \"屉\",\n    \"居\",\n    \"届\",\n    \"刷\",\n    \"屈\",\n    \"弧\",\n    \"弥\",\n    \"弦\",\n    \"承\",\n    \"孟\",\n    \"陋\",\n    \"陌\",\n    \"孤\",\n    \"陕\",\n    \"降\",\n    \"函\",\n    \"限\",\n    \"妹\",\n    \"姑\",\n    \"姐\",\n    \"姓\",\n    \"妮\",\n    \"始\",\n    \"姆\",\n    \"迢\",\n    \"驾\",\n    \"叁\",\n    \"参\",\n    \"艰\",\n    \"线\",\n    \"练\",\n    \"组\",\n    \"绅\",\n    \"细\",\n    \"驶\",\n    \"织\",\n    \"驹\",\n    \"终\",\n    \"驻\",\n    \"绊\",\n    \"驼\",\n    \"绍\",\n    \"绎\",\n    \"经\",\n    \"贯\",\n    \"契\",\n    \"贰\",\n    \"奏\",\n    \"春\",\n    \"帮\",\n    \"玷\",\n    \"珍\",\n    \"玲\",\n    \"玻\",\n    \"毒\",\n    \"型\",\n    \"拭\",\n    \"挂\",\n    \"封\",\n    \"持\",\n    \"拷\",\n    \"拱\",\n    \"项\",\n    \"垮\",\n    \"挎\",\n    \"城\",\n    \"挟\",\n    \"挠\",\n    \"政\",\n    \"赴\",\n    \"赵\",\n    \"挡\",\n    \"拽\",\n    \"哉\",\n    \"挺\",\n    \"括\",\n    \"垢\",\n    \"拴\",\n    \"拾\",\n    \"挑\",\n    \"垛\",\n    \"指\",\n    \"垫\",\n    \"挣\",\n    \"挤\",\n    \"拼\",\n    \"挖\",\n    \"按\",\n    \"挥\",\n    \"挪\",\n    \"拯\",\n    \"某\",\n    \"甚\",\n    \"荆\",\n    \"茸\",\n    \"革\",\n    \"茬\",\n    \"荐\",\n    \"巷\",\n    \"带\",\n    \"草\",\n    \"茧\",\n    \"茵\",\n    \"茶\",\n    \"荒\",\n    \"茫\",\n    \"荡\",\n    \"荣\",\n    \"荤\",\n    \"荧\",\n    \"故\",\n    \"胡\",\n    \"荫\",\n    \"荔\",\n    \"南\",\n    \"药\",\n    \"标\",\n    \"栈\",\n    \"柑\",\n    \"枯\",\n    \"柄\",\n    \"栋\",\n    \"相\",\n    \"查\",\n    \"柏\",\n    \"栅\",\n    \"柳\",\n    \"柱\",\n    \"柿\",\n    \"栏\",\n    \"柠\",\n    \"树\",\n    \"勃\",\n    \"要\",\n    \"柬\",\n    \"咸\",\n    \"威\",\n    \"歪\",\n    \"研\",\n    \"砖\",\n    \"厘\",\n    \"厚\",\n    \"砌\",\n    \"砂\",\n    \"泵\",\n    \"砚\",\n    \"砍\",\n    \"面\",\n    \"耐\",\n    \"耍\",\n    \"牵\",\n    \"鸥\",\n    \"残\",\n    \"殃\",\n    \"轴\",\n    \"轻\",\n    \"鸦\",\n    \"皆\",\n    \"韭\",\n    \"背\",\n    \"战\",\n    \"点\",\n    \"虐\",\n    \"临\",\n    \"览\",\n    \"竖\",\n    \"省\",\n    \"削\",\n    \"尝\",\n    \"昧\",\n    \"盹\",\n    \"是\",\n    \"盼\",\n    \"眨\",\n    \"哇\",\n    \"哄\",\n    \"哑\",\n    \"显\",\n    \"冒\",\n    \"映\",\n    \"星\",\n    \"昨\",\n    \"咧\",\n    \"昭\",\n    \"畏\",\n    \"趴\",\n    \"胃\",\n    \"贵\",\n    \"界\",\n    \"虹\",\n    \"虾\",\n    \"蚁\",\n    \"思\",\n    \"蚂\",\n    \"虽\",\n    \"品\",\n    \"咽\",\n    \"骂\",\n    \"勋\",\n    \"哗\",\n    \"咱\",\n    \"响\",\n    \"哈\",\n    \"哆\",\n    \"咬\",\n    \"咳\",\n    \"咪\",\n    \"哪\",\n    \"哟\",\n    \"炭\",\n    \"峡\",\n    \"罚\",\n    \"贱\",\n    \"贴\",\n    \"贻\",\n    \"骨\",\n    \"幽\",\n    \"钙\",\n    \"钝\",\n    \"钞\",\n    \"钟\",\n    \"钢\",\n    \"钠\",\n    \"钥\",\n    \"钦\",\n    \"钧\",\n    \"钩\",\n    \"钮\",\n    \"卸\",\n    \"缸\",\n    \"拜\",\n    \"看\",\n    \"矩\",\n    \"毡\",\n    \"氢\",\n    \"怎\",\n    \"牲\",\n    \"选\",\n    \"适\",\n    \"秒\",\n    \"香\",\n    \"种\",\n    \"秋\",\n    \"科\",\n    \"重\",\n    \"复\",\n    \"竿\",\n    \"段\",\n    \"便\",\n    \"俩\",\n    \"贷\",\n    \"顺\",\n    \"修\",\n    \"俏\",\n    \"保\",\n    \"促\",\n    \"俄\",\n    \"俐\",\n    \"侮\",\n    \"俭\",\n    \"俗\",\n    \"俘\",\n    \"信\",\n    \"皇\",\n    \"泉\",\n    \"鬼\",\n    \"侵\",\n    \"禹\",\n    \"侯\",\n    \"追\",\n    \"俊\",\n    \"盾\",\n    \"待\",\n    \"徊\",\n    \"衍\",\n    \"律\",\n    \"很\",\n    \"须\",\n    \"叙\",\n    \"剑\",\n    \"逃\",\n    \"食\",\n    \"盆\",\n    \"胚\",\n    \"胧\",\n    \"胆\",\n    \"胜\",\n    \"胞\",\n    \"胖\",\n    \"脉\",\n    \"胎\",\n    \"勉\",\n    \"狭\",\n    \"狮\",\n    \"独\",\n    \"狰\",\n    \"狡\",\n    \"狱\",\n    \"狠\",\n    \"贸\",\n    \"怨\",\n    \"急\",\n    \"饵\",\n    \"饶\",\n    \"蚀\",\n    \"饺\",\n    \"饼\",\n    \"峦\",\n    \"弯\",\n    \"将\",\n    \"奖\",\n    \"哀\",\n    \"亭\",\n    \"亮\",\n    \"度\",\n    \"迹\",\n    \"庭\",\n    \"疮\",\n    \"疯\",\n    \"疫\",\n    \"疤\",\n    \"咨\",\n    \"姿\",\n    \"亲\",\n    \"音\",\n    \"帝\",\n    \"施\",\n    \"闺\",\n    \"闻\",\n    \"闽\",\n    \"阀\",\n    \"阁\",\n    \"差\",\n    \"养\",\n    \"美\",\n    \"姜\",\n    \"叛\",\n    \"送\",\n    \"类\",\n    \"迷\",\n    \"籽\",\n    \"娄\",\n    \"前\",\n    \"首\",\n    \"逆\",\n    \"兹\",\n    \"总\",\n    \"炼\",\n    \"炸\",\n    \"烁\",\n    \"炮\",\n    \"炫\",\n    \"烂\",\n    \"剃\",\n    \"洼\",\n    \"洁\",\n    \"洪\",\n    \"洒\",\n    \"柒\",\n    \"浇\",\n    \"浊\",\n    \"洞\",\n    \"测\",\n    \"洗\",\n    \"活\",\n    \"派\",\n    \"洽\",\n    \"染\",\n    \"洛\",\n    \"浏\",\n    \"济\",\n    \"洋\",\n    \"洲\",\n    \"浑\",\n    \"浓\",\n    \"津\",\n    \"恃\",\n    \"恒\",\n    \"恢\",\n    \"恍\",\n    \"恬\",\n    \"恤\",\n    \"恰\",\n    \"恼\",\n    \"恨\",\n    \"举\",\n    \"觉\",\n    \"宣\",\n    \"宦\",\n    \"室\",\n    \"宫\",\n    \"宪\",\n    \"突\",\n    \"穿\",\n    \"窃\",\n    \"客\",\n    \"诫\",\n    \"冠\",\n    \"诬\",\n    \"语\",\n    \"扁\",\n    \"袄\",\n    \"祖\",\n    \"神\",\n    \"祝\",\n    \"祠\",\n    \"误\",\n    \"诱\",\n    \"诲\",\n    \"说\",\n    \"诵\",\n    \"垦\",\n    \"退\",\n    \"既\",\n    \"屋\",\n    \"昼\",\n    \"屏\",\n    \"屎\",\n    \"费\",\n    \"陡\",\n    \"逊\",\n    \"眉\",\n    \"孩\",\n    \"陨\",\n    \"除\",\n    \"险\",\n    \"院\",\n    \"娃\",\n    \"姥\",\n    \"姨\",\n    \"姻\",\n    \"娇\",\n    \"姚\",\n    \"娜\",\n    \"怒\",\n    \"架\",\n    \"贺\",\n    \"盈\",\n    \"勇\",\n    \"怠\",\n    \"癸\",\n    \"蚤\",\n    \"柔\",\n    \"垒\",\n    \"绑\",\n    \"绒\",\n    \"结\",\n    \"绕\",\n    \"骄\",\n    \"绘\",\n    \"给\",\n    \"绚\",\n    \"骆\",\n    \"络\",\n    \"绝\",\n    \"绞\",\n    \"骇\",\n    \"统\",\n    \"耕\",\n    \"耘\",\n    \"耗\",\n    \"耙\",\n    \"艳\",\n    \"泰\",\n    \"秦\",\n    \"珠\",\n    \"班\",\n    \"素\",\n    \"匿\",\n    \"蚕\",\n    \"顽\",\n    \"盏\",\n    \"匪\",\n    \"捞\",\n    \"栽\",\n    \"捕\",\n    \"埂\",\n    \"捂\",\n    \"振\",\n    \"载\",\n    \"赶\",\n    \"起\",\n    \"盐\",\n    \"捎\",\n    \"捍\",\n    \"捏\",\n    \"埋\",\n    \"捉\",\n    \"捆\",\n    \"捐\",\n    \"损\",\n    \"袁\",\n    \"捌\",\n    \"都\",\n    \"哲\",\n    \"逝\",\n    \"捡\",\n    \"挫\",\n    \"换\",\n    \"挽\",\n    \"挚\",\n    \"热\",\n    \"恐\",\n    \"捣\",\n    \"壶\",\n    \"捅\",\n    \"埃\",\n    \"挨\",\n    \"耻\",\n    \"耿\",\n    \"耽\",\n    \"聂\",\n    \"恭\",\n    \"莽\",\n    \"莱\",\n    \"莲\",\n    \"莫\",\n    \"莉\",\n    \"荷\",\n    \"获\",\n    \"晋\",\n    \"恶\",\n    \"莹\",\n    \"莺\",\n    \"真\",\n    \"框\",\n    \"梆\",\n    \"桂\",\n    \"桔\",\n    \"栖\",\n    \"档\",\n    \"桐\",\n    \"株\",\n    \"桥\",\n    \"桦\",\n    \"栓\",\n    \"桃\",\n    \"格\",\n    \"桩\",\n    \"校\",\n    \"核\",\n    \"样\",\n    \"根\",\n    \"索\",\n    \"哥\",\n    \"速\",\n    \"逗\",\n    \"栗\",\n    \"贾\",\n    \"酌\",\n    \"配\",\n    \"翅\",\n    \"辱\",\n    \"唇\",\n    \"夏\",\n    \"砸\",\n    \"砰\",\n    \"砾\",\n    \"础\",\n    \"破\",\n    \"原\",\n    \"套\",\n    \"逐\",\n    \"烈\",\n    \"殊\",\n    \"殉\",\n    \"顾\",\n    \"轿\",\n    \"较\",\n    \"顿\",\n    \"毙\",\n    \"致\",\n    \"柴\",\n    \"桌\",\n    \"虑\",\n    \"监\",\n    \"紧\",\n    \"党\",\n    \"逞\",\n    \"晒\",\n    \"眠\",\n    \"晓\",\n    \"哮\",\n    \"唠\",\n    \"鸭\",\n    \"晃\",\n    \"哺\",\n    \"晌\",\n    \"剔\",\n    \"晕\",\n    \"蚌\",\n    \"畔\",\n    \"蚣\",\n    \"蚊\",\n    \"蚪\",\n    \"蚓\",\n    \"哨\",\n    \"哩\",\n    \"圃\",\n    \"哭\",\n    \"哦\",\n    \"恩\",\n    \"鸯\",\n    \"唤\",\n    \"唁\",\n    \"哼\",\n    \"唧\",\n    \"啊\",\n    \"唉\",\n    \"唆\",\n    \"罢\",\n    \"峭\",\n    \"峨\",\n    \"峰\",\n    \"圆\",\n    \"峻\",\n    \"贼\",\n    \"贿\",\n    \"赂\",\n    \"赃\",\n    \"钱\",\n    \"钳\",\n    \"钻\",\n    \"钾\",\n    \"铁\",\n    \"铃\",\n    \"铅\",\n    \"缺\",\n    \"氧\",\n    \"氨\",\n    \"特\",\n    \"牺\",\n    \"造\",\n    \"乘\",\n    \"敌\",\n    \"秤\",\n    \"租\",\n    \"积\",\n    \"秧\",\n    \"秩\",\n    \"称\",\n    \"秘\",\n    \"透\",\n    \"笔\",\n    \"笑\",\n    \"笋\",\n    \"债\",\n    \"借\",\n    \"值\",\n    \"倚\",\n    \"俺\",\n    \"倾\",\n    \"倒\",\n    \"倘\",\n    \"俱\",\n    \"倡\",\n    \"候\",\n    \"赁\",\n    \"俯\",\n    \"倍\",\n    \"倦\",\n    \"健\",\n    \"臭\",\n    \"射\",\n    \"躬\",\n    \"息\",\n    \"倔\",\n    \"徒\",\n    \"徐\",\n    \"殷\",\n    \"舰\",\n    \"舱\",\n    \"般\",\n    \"航\",\n    \"途\",\n    \"拿\",\n    \"耸\",\n    \"爹\",\n    \"舀\",\n    \"爱\",\n    \"豺\",\n    \"豹\",\n    \"颁\",\n    \"颂\",\n    \"翁\",\n    \"胰\",\n    \"脆\",\n    \"脂\",\n    \"胸\",\n    \"胳\",\n    \"脏\",\n    \"脐\",\n    \"胶\",\n    \"脑\",\n    \"脓\",\n    \"逛\",\n    \"狸\",\n    \"狼\",\n    \"卿\",\n    \"逢\",\n    \"鸵\",\n    \"留\",\n    \"鸳\",\n    \"皱\",\n    \"饿\",\n    \"馁\",\n    \"凌\",\n    \"凄\",\n    \"恋\",\n    \"桨\",\n    \"浆\",\n    \"衰\",\n    \"衷\",\n    \"高\",\n    \"郭\",\n    \"席\",\n    \"准\",\n    \"座\",\n    \"症\",\n    \"病\",\n    \"疾\",\n    \"斋\",\n    \"疹\",\n    \"疼\",\n    \"疲\",\n    \"脊\",\n    \"效\",\n    \"离\",\n    \"紊\",\n    \"唐\",\n    \"瓷\",\n    \"资\",\n    \"凉\",\n    \"站\",\n    \"剖\",\n    \"竞\",\n    \"部\",\n    \"旁\",\n    \"旅\",\n    \"畜\",\n    \"阅\",\n    \"羞\",\n    \"羔\",\n    \"瓶\",\n    \"拳\",\n    \"粉\",\n    \"料\",\n    \"益\",\n    \"兼\",\n    \"烤\",\n    \"烘\",\n    \"烦\",\n    \"烧\",\n    \"烛\",\n    \"烟\",\n    \"烙\",\n    \"递\",\n    \"涛\",\n    \"浙\",\n    \"涝\",\n    \"浦\",\n    \"酒\",\n    \"涉\",\n    \"消\",\n    \"涡\",\n    \"浩\",\n    \"海\",\n    \"涂\",\n    \"浴\",\n    \"浮\",\n    \"涣\",\n    \"涤\",\n    \"流\",\n    \"润\",\n    \"涧\",\n    \"涕\",\n    \"浪\",\n    \"浸\",\n    \"涨\",\n    \"烫\",\n    \"涩\",\n    \"涌\",\n    \"悖\",\n    \"悟\",\n    \"悄\",\n    \"悍\",\n    \"悔\",\n    \"悯\",\n    \"悦\",\n    \"害\",\n    \"宽\",\n    \"家\",\n    \"宵\",\n    \"宴\",\n    \"宾\",\n    \"窍\",\n    \"窄\",\n    \"容\",\n    \"宰\",\n    \"案\",\n    \"请\",\n    \"朗\",\n    \"诸\",\n    \"诺\",\n    \"读\",\n    \"扇\",\n    \"诽\",\n    \"袜\",\n    \"袖\",\n    \"袍\",\n    \"被\",\n    \"祥\",\n    \"课\",\n    \"冥\",\n    \"谁\",\n    \"调\",\n    \"冤\",\n    \"谅\",\n    \"谆\",\n    \"谈\",\n    \"谊\",\n    \"剥\",\n    \"恳\",\n    \"展\",\n    \"剧\",\n    \"屑\",\n    \"弱\",\n    \"陵\",\n    \"祟\",\n    \"陶\",\n    \"陷\",\n    \"陪\",\n    \"娱\",\n    \"娟\",\n    \"恕\",\n    \"娥\",\n    \"娘\",\n    \"通\",\n    \"能\",\n    \"难\",\n    \"预\",\n    \"桑\",\n    \"绢\",\n    \"绣\",\n    \"验\",\n    \"继\",\n    \"骏\",\n    \"球\",\n    \"琐\",\n    \"理\",\n    \"琉\",\n    \"琅\",\n    \"捧\",\n    \"堵\",\n    \"措\",\n    \"描\",\n    \"域\",\n    \"捺\",\n    \"掩\",\n    \"捷\",\n    \"排\",\n    \"焉\",\n    \"掉\",\n    \"捶\",\n    \"赦\",\n    \"堆\",\n    \"推\",\n    \"埠\",\n    \"掀\",\n    \"授\",\n    \"捻\",\n    \"教\",\n    \"掏\",\n    \"掐\",\n    \"掠\",\n    \"掂\",\n    \"培\",\n    \"接\",\n    \"掷\",\n    \"控\",\n    \"探\",\n    \"据\",\n    \"掘\",\n    \"掺\",\n    \"职\",\n    \"基\",\n    \"聆\",\n    \"勘\",\n    \"聊\",\n    \"娶\",\n    \"著\",\n    \"菱\",\n    \"勒\",\n    \"黄\",\n    \"菲\",\n    \"萌\",\n    \"萝\",\n    \"菌\",\n    \"萎\",\n    \"菜\",\n    \"萄\",\n    \"菊\",\n    \"菩\",\n    \"萍\",\n    \"菠\",\n    \"萤\",\n    \"营\",\n    \"乾\",\n    \"萧\",\n    \"萨\",\n    \"菇\",\n    \"械\",\n    \"彬\",\n    \"梦\",\n    \"婪\",\n    \"梗\",\n    \"梧\",\n    \"梢\",\n    \"梅\",\n    \"检\",\n    \"梳\",\n    \"梯\",\n    \"桶\",\n    \"梭\",\n    \"救\",\n    \"曹\",\n    \"副\",\n    \"票\",\n    \"酝\",\n    \"酗\",\n    \"厢\",\n    \"戚\",\n    \"硅\",\n    \"硕\",\n    \"奢\",\n    \"盔\",\n    \"爽\",\n    \"聋\",\n    \"袭\",\n    \"盛\",\n    \"匾\",\n    \"雪\",\n    \"辅\",\n    \"辆\",\n    \"颅\",\n    \"虚\",\n    \"彪\",\n    \"雀\",\n    \"堂\",\n    \"常\",\n    \"眶\",\n    \"匙\",\n    \"晨\",\n    \"睁\",\n    \"眯\",\n    \"眼\",\n    \"悬\",\n    \"野\",\n    \"啪\",\n    \"啦\",\n    \"曼\",\n    \"晦\",\n    \"晚\",\n    \"啄\",\n    \"啡\",\n    \"距\",\n    \"趾\",\n    \"啃\",\n    \"跃\",\n    \"略\",\n    \"蚯\",\n    \"蛀\",\n    \"蛇\",\n    \"唬\",\n    \"累\",\n    \"鄂\",\n    \"唱\",\n    \"患\",\n    \"啰\",\n    \"唾\",\n    \"唯\",\n    \"啤\",\n    \"啥\",\n    \"啸\",\n    \"崖\",\n    \"崎\",\n    \"崭\",\n    \"逻\",\n    \"崔\",\n    \"帷\",\n    \"崩\",\n    \"崇\",\n    \"崛\",\n    \"婴\",\n    \"圈\",\n    \"铐\",\n    \"铛\",\n    \"铝\",\n    \"铜\",\n    \"铭\",\n    \"铲\",\n    \"银\",\n    \"矫\",\n    \"甜\",\n    \"秸\",\n    \"梨\",\n    \"犁\",\n    \"秽\",\n    \"移\",\n    \"笨\",\n    \"笼\",\n    \"笛\",\n    \"笙\",\n    \"符\",\n    \"第\",\n    \"敏\",\n    \"做\",\n    \"袋\",\n    \"悠\",\n    \"偿\",\n    \"偶\",\n    \"偎\",\n    \"偷\",\n    \"您\",\n    \"售\",\n    \"停\",\n    \"偏\",\n    \"躯\",\n    \"兜\",\n    \"假\",\n    \"衅\",\n    \"徘\",\n    \"徙\",\n    \"得\",\n    \"衔\",\n    \"盘\",\n    \"舶\",\n    \"船\",\n    \"舵\",\n    \"斜\",\n    \"盒\",\n    \"鸽\",\n    \"敛\",\n    \"悉\",\n    \"欲\",\n    \"彩\",\n    \"领\",\n    \"脚\",\n    \"脖\",\n    \"脯\",\n    \"豚\",\n    \"脸\",\n    \"脱\",\n    \"象\",\n    \"够\",\n    \"逸\",\n    \"猜\",\n    \"猪\",\n    \"猎\",\n    \"猫\",\n    \"凰\",\n    \"猖\",\n    \"猛\",\n    \"祭\",\n    \"馅\",\n    \"馆\",\n    \"凑\",\n    \"减\",\n    \"毫\",\n    \"烹\",\n    \"庶\",\n    \"麻\",\n    \"庵\",\n    \"痊\",\n    \"痒\",\n    \"痕\",\n    \"廊\",\n    \"康\",\n    \"庸\",\n    \"鹿\",\n    \"盗\",\n    \"章\",\n    \"竟\",\n    \"商\",\n    \"族\",\n    \"旋\",\n    \"望\",\n    \"率\",\n    \"阎\",\n    \"阐\",\n    \"着\",\n    \"羚\",\n    \"盖\",\n    \"眷\",\n    \"粘\",\n    \"粗\",\n    \"粒\",\n    \"断\",\n    \"剪\",\n    \"兽\",\n    \"焊\",\n    \"焕\",\n    \"清\",\n    \"添\",\n    \"鸿\",\n    \"淋\",\n    \"涯\",\n    \"淹\",\n    \"渠\",\n    \"渐\",\n    \"淑\",\n    \"淌\",\n    \"混\",\n    \"淮\",\n    \"淆\",\n    \"渊\",\n    \"淫\",\n    \"渔\",\n    \"淘\",\n    \"淳\",\n    \"液\",\n    \"淤\",\n    \"淡\",\n    \"淀\",\n    \"深\",\n    \"涮\",\n    \"涵\",\n    \"婆\",\n    \"梁\",\n    \"渗\",\n    \"情\",\n    \"惜\",\n    \"惭\",\n    \"悼\",\n    \"惧\",\n    \"惕\",\n    \"惟\",\n    \"惊\",\n    \"惦\",\n    \"悴\",\n    \"惋\",\n    \"惨\",\n    \"惯\",\n    \"寇\",\n    \"寅\",\n    \"寄\",\n    \"寂\",\n    \"宿\",\n    \"窒\",\n    \"窑\",\n    \"密\",\n    \"谋\",\n    \"谍\",\n    \"谎\",\n    \"谐\",\n    \"袱\",\n    \"祷\",\n    \"祸\",\n    \"谓\",\n    \"谚\",\n    \"谜\",\n    \"逮\",\n    \"敢\",\n    \"尉\",\n    \"屠\",\n    \"弹\",\n    \"隋\",\n    \"堕\",\n    \"随\",\n    \"蛋\",\n    \"隅\",\n    \"隆\",\n    \"隐\",\n    \"婚\",\n    \"婶\",\n    \"婉\",\n    \"颇\",\n    \"颈\",\n    \"绩\",\n    \"绪\",\n    \"续\",\n    \"骑\",\n    \"绰\",\n    \"绳\",\n    \"维\",\n    \"绵\",\n    \"绷\",\n    \"绸\",\n    \"综\",\n    \"绽\",\n    \"绿\",\n    \"缀\",\n    \"巢\",\n    \"琴\",\n    \"琳\",\n    \"琢\",\n    \"琼\",\n    \"斑\",\n    \"替\",\n    \"揍\",\n    \"款\",\n    \"堪\",\n    \"塔\",\n    \"搭\",\n    \"堰\",\n    \"揩\",\n    \"越\",\n    \"趁\",\n    \"趋\",\n    \"超\",\n    \"揽\",\n    \"堤\",\n    \"提\",\n    \"博\",\n    \"揭\",\n    \"喜\",\n    \"彭\",\n    \"揣\",\n    \"插\",\n    \"揪\",\n    \"搜\",\n    \"煮\",\n    \"援\",\n    \"搀\",\n    \"裁\",\n    \"搁\",\n    \"搓\",\n    \"搂\",\n    \"搅\",\n    \"壹\",\n    \"握\",\n    \"搔\",\n    \"揉\",\n    \"斯\",\n    \"期\",\n    \"欺\",\n    \"联\",\n    \"葫\",\n    \"散\",\n    \"惹\",\n    \"葬\",\n    \"募\",\n    \"葛\",\n    \"董\",\n    \"葡\",\n    \"敬\",\n    \"葱\",\n    \"蒋\",\n    \"蒂\",\n    \"落\",\n    \"韩\",\n    \"朝\",\n    \"辜\",\n    \"葵\",\n    \"棒\",\n    \"棱\",\n    \"棋\",\n    \"椰\",\n    \"植\",\n    \"森\",\n    \"焚\",\n    \"椅\",\n    \"椒\",\n    \"棵\",\n    \"棍\",\n    \"椎\",\n    \"棉\",\n    \"棚\",\n    \"棕\",\n    \"棺\",\n    \"榔\",\n    \"椭\",\n    \"惠\",\n    \"惑\",\n    \"逼\",\n    \"粟\",\n    \"棘\",\n    \"酣\",\n    \"酥\",\n    \"厨\",\n    \"厦\",\n    \"硬\",\n    \"硝\",\n    \"确\",\n    \"硫\",\n    \"雁\",\n    \"殖\",\n    \"裂\",\n    \"雄\",\n    \"颊\",\n    \"雳\",\n    \"暂\",\n    \"雅\",\n    \"翘\",\n    \"辈\",\n    \"悲\",\n    \"紫\",\n    \"凿\",\n    \"辉\",\n    \"敞\",\n    \"棠\",\n    \"赏\",\n    \"掌\",\n    \"晴\",\n    \"睐\",\n    \"暑\",\n    \"最\",\n    \"晰\",\n    \"量\",\n    \"鼎\",\n    \"喷\",\n    \"喳\",\n    \"晶\",\n    \"喇\",\n    \"遇\",\n    \"喊\",\n    \"遏\",\n    \"晾\",\n    \"景\",\n    \"畴\",\n    \"践\",\n    \"跋\",\n    \"跌\",\n    \"跑\",\n    \"跛\",\n    \"遗\",\n    \"蛙\",\n    \"蛛\",\n    \"蜓\",\n    \"蜒\",\n    \"蛤\",\n    \"喝\",\n    \"鹃\",\n    \"喂\",\n    \"喘\",\n    \"喉\",\n    \"喻\",\n    \"啼\",\n    \"喧\",\n    \"嵌\",\n    \"幅\",\n    \"帽\",\n    \"赋\",\n    \"赌\",\n    \"赎\",\n    \"赐\",\n    \"赔\",\n    \"黑\",\n    \"铸\",\n    \"铺\",\n    \"链\",\n    \"销\",\n    \"锁\",\n    \"锄\",\n    \"锅\",\n    \"锈\",\n    \"锋\",\n    \"锌\",\n    \"锐\",\n    \"甥\",\n    \"掰\",\n    \"短\",\n    \"智\",\n    \"氮\",\n    \"毯\",\n    \"氯\",\n    \"鹅\",\n    \"剩\",\n    \"稍\",\n    \"程\",\n    \"稀\",\n    \"税\",\n    \"筐\",\n    \"等\",\n    \"筑\",\n    \"策\",\n    \"筛\",\n    \"筒\",\n    \"筏\",\n    \"答\",\n    \"筋\",\n    \"筝\",\n    \"傲\",\n    \"傅\",\n    \"牌\",\n    \"堡\",\n    \"集\",\n    \"焦\",\n    \"傍\",\n    \"储\",\n    \"皓\",\n    \"皖\",\n    \"粤\",\n    \"奥\",\n    \"街\",\n    \"惩\",\n    \"御\",\n    \"循\",\n    \"艇\",\n    \"舒\",\n    \"逾\",\n    \"番\",\n    \"释\",\n    \"禽\",\n    \"腊\",\n    \"脾\",\n    \"腋\",\n    \"腔\",\n    \"腕\",\n    \"鲁\",\n    \"猩\",\n    \"猬\",\n    \"猾\",\n    \"猴\",\n    \"惫\",\n    \"然\",\n    \"馈\",\n    \"馋\",\n    \"装\",\n    \"蛮\",\n    \"就\",\n    \"敦\",\n    \"斌\",\n    \"痘\",\n    \"痢\",\n    \"痪\",\n    \"痛\",\n    \"童\",\n    \"竣\",\n    \"阔\",\n    \"善\",\n    \"翔\",\n    \"羡\",\n    \"普\",\n    \"粪\",\n    \"尊\",\n    \"奠\",\n    \"道\",\n    \"遂\",\n    \"曾\",\n    \"焰\",\n    \"港\",\n    \"滞\",\n    \"湖\",\n    \"湘\",\n    \"渣\",\n    \"渤\",\n    \"渺\",\n    \"湿\",\n    \"温\",\n    \"渴\",\n    \"溃\",\n    \"溅\",\n    \"滑\",\n    \"湃\",\n    \"渝\",\n    \"湾\",\n    \"渡\",\n    \"游\",\n    \"滋\",\n    \"渲\",\n    \"溉\",\n    \"愤\",\n    \"慌\",\n    \"惰\",\n    \"愕\",\n    \"愣\",\n    \"惶\",\n    \"愧\",\n    \"愉\",\n    \"慨\",\n    \"割\",\n    \"寒\",\n    \"富\",\n    \"寓\",\n    \"窜\",\n    \"窝\",\n    \"窖\",\n    \"窗\",\n    \"窘\",\n    \"遍\",\n    \"雇\",\n    \"裕\",\n    \"裤\",\n    \"裙\",\n    \"禅\",\n    \"禄\",\n    \"谢\",\n    \"谣\",\n    \"谤\",\n    \"谦\",\n    \"犀\",\n    \"属\",\n    \"屡\",\n    \"强\",\n    \"粥\",\n    \"疏\",\n    \"隔\",\n    \"隙\",\n    \"隘\",\n    \"媒\",\n    \"絮\",\n    \"嫂\",\n    \"媚\",\n    \"婿\",\n    \"登\",\n    \"缅\",\n    \"缆\",\n    \"缉\",\n    \"缎\",\n    \"缓\",\n    \"缔\",\n    \"缕\",\n    \"骗\",\n    \"编\",\n    \"骚\",\n    \"缘\",\n    \"瑟\",\n    \"鹉\",\n    \"瑞\",\n    \"瑰\",\n    \"瑙\",\n    \"魂\",\n    \"肆\",\n    \"摄\",\n    \"摸\",\n    \"填\",\n    \"搏\",\n    \"塌\",\n    \"鼓\",\n    \"摆\",\n    \"携\",\n    \"搬\",\n    \"摇\",\n    \"搞\",\n    \"塘\",\n    \"摊\",\n    \"聘\",\n    \"斟\",\n    \"蒜\",\n    \"勤\",\n    \"靴\",\n    \"靶\",\n    \"鹊\",\n    \"蓝\",\n    \"墓\",\n    \"幕\",\n    \"蓬\",\n    \"蓄\",\n    \"蒲\",\n    \"蓉\",\n    \"蒙\",\n    \"蒸\",\n    \"献\",\n    \"椿\",\n    \"禁\",\n    \"楚\",\n    \"楷\",\n    \"榄\",\n    \"想\",\n    \"槐\",\n    \"榆\",\n    \"楼\",\n    \"概\",\n    \"赖\",\n    \"酪\",\n    \"酬\",\n    \"感\",\n    \"碍\",\n    \"碘\",\n    \"碑\",\n    \"碎\",\n    \"碰\",\n    \"碗\",\n    \"碌\",\n    \"尴\",\n    \"雷\",\n    \"零\",\n    \"雾\",\n    \"雹\",\n    \"辐\",\n    \"辑\",\n    \"输\",\n    \"督\",\n    \"频\",\n    \"龄\",\n    \"鉴\",\n    \"睛\",\n    \"睹\",\n    \"睦\",\n    \"瞄\",\n    \"睫\",\n    \"睡\",\n    \"睬\",\n    \"嗜\",\n    \"鄙\",\n    \"嗦\",\n    \"愚\",\n    \"暖\",\n    \"盟\",\n    \"歇\",\n    \"暗\",\n    \"暇\",\n    \"照\",\n    \"畸\",\n    \"跨\",\n    \"跷\",\n    \"跳\",\n    \"跺\",\n    \"跪\",\n    \"路\",\n    \"跤\",\n    \"跟\",\n    \"遣\",\n    \"蜈\",\n    \"蜗\",\n    \"蛾\",\n    \"蜂\",\n    \"蜕\",\n    \"嗅\",\n    \"嗡\",\n    \"嗓\",\n    \"署\",\n    \"置\",\n    \"罪\",\n    \"罩\",\n    \"蜀\",\n    \"幌\",\n    \"错\",\n    \"锚\",\n    \"锡\",\n    \"锣\",\n    \"锤\",\n    \"锥\",\n    \"锦\",\n    \"键\",\n    \"锯\",\n    \"锰\",\n    \"矮\",\n    \"辞\",\n    \"稚\",\n    \"稠\",\n    \"颓\",\n    \"愁\",\n    \"筹\",\n    \"签\",\n    \"简\",\n    \"筷\",\n    \"毁\",\n    \"舅\",\n    \"鼠\",\n    \"催\",\n    \"傻\",\n    \"像\",\n    \"躲\",\n    \"魁\",\n    \"衙\",\n    \"微\",\n    \"愈\",\n    \"遥\",\n    \"腻\",\n    \"腰\",\n    \"腥\",\n    \"腮\",\n    \"腹\",\n    \"腺\",\n    \"鹏\",\n    \"腾\",\n    \"腿\",\n    \"鲍\",\n    \"猿\",\n    \"颖\",\n    \"触\",\n    \"解\",\n    \"煞\",\n    \"雏\",\n    \"馍\",\n    \"馏\",\n    \"酱\",\n    \"禀\",\n    \"痹\",\n    \"廓\",\n    \"痴\",\n    \"痰\",\n    \"廉\",\n    \"靖\",\n    \"新\",\n    \"韵\",\n    \"意\",\n    \"誊\",\n    \"粮\",\n    \"数\",\n    \"煎\",\n    \"塑\",\n    \"慈\",\n    \"煤\",\n    \"煌\",\n    \"满\",\n    \"漠\",\n    \"滇\",\n    \"源\",\n    \"滤\",\n    \"滥\",\n    \"滔\",\n    \"溪\",\n    \"溜\",\n    \"漓\",\n    \"滚\",\n    \"溢\",\n    \"溯\",\n    \"滨\",\n    \"溶\",\n    \"溺\",\n    \"粱\",\n    \"滩\",\n    \"慎\",\n    \"誉\",\n    \"塞\",\n    \"寞\",\n    \"窥\",\n    \"窟\",\n    \"寝\",\n    \"谨\",\n    \"褂\",\n    \"裸\",\n    \"福\",\n    \"谬\",\n    \"群\",\n    \"殿\",\n    \"辟\",\n    \"障\",\n    \"媳\",\n    \"嫉\",\n    \"嫌\",\n    \"嫁\",\n    \"叠\",\n    \"缚\",\n    \"缝\",\n    \"缠\",\n    \"缤\",\n    \"剿\",\n    \"静\",\n    \"碧\",\n    \"璃\",\n    \"赘\",\n    \"熬\",\n    \"墙\",\n    \"墟\",\n    \"嘉\",\n    \"摧\",\n    \"赫\",\n    \"截\",\n    \"誓\",\n    \"境\",\n    \"摘\",\n    \"摔\",\n    \"撇\",\n    \"聚\",\n    \"慕\",\n    \"暮\",\n    \"摹\",\n    \"蔓\",\n    \"蔑\",\n    \"蔡\",\n    \"蔗\",\n    \"蔽\",\n    \"蔼\",\n    \"熙\",\n    \"蔚\",\n    \"兢\",\n    \"模\",\n    \"槛\",\n    \"榴\",\n    \"榜\",\n    \"榨\",\n    \"榕\",\n    \"歌\",\n    \"遭\",\n    \"酵\",\n    \"酷\",\n    \"酿\",\n    \"酸\",\n    \"碟\",\n    \"碱\",\n    \"碳\",\n    \"磁\",\n    \"愿\",\n    \"需\",\n    \"辖\",\n    \"辗\",\n    \"雌\",\n    \"裳\",\n    \"颗\",\n    \"瞅\",\n    \"墅\",\n    \"嗽\",\n    \"踊\",\n    \"蜻\",\n    \"蜡\",\n    \"蝇\",\n    \"蜘\",\n    \"蝉\",\n    \"嘛\",\n    \"嘀\",\n    \"赚\",\n    \"锹\",\n    \"锻\",\n    \"镀\",\n    \"舞\",\n    \"舔\",\n    \"稳\",\n    \"熏\",\n    \"箕\",\n    \"算\",\n    \"箩\",\n    \"管\",\n    \"箫\",\n    \"舆\",\n    \"僚\",\n    \"僧\",\n    \"鼻\",\n    \"魄\",\n    \"魅\",\n    \"貌\",\n    \"膜\",\n    \"膊\",\n    \"膀\",\n    \"鲜\",\n    \"疑\",\n    \"孵\",\n    \"馒\",\n    \"裹\",\n    \"敲\",\n    \"豪\",\n    \"膏\",\n    \"遮\",\n    \"腐\",\n    \"瘩\",\n    \"瘟\",\n    \"瘦\",\n    \"辣\",\n    \"彰\",\n    \"竭\",\n    \"端\",\n    \"旗\",\n    \"精\",\n    \"粹\",\n    \"歉\",\n    \"弊\",\n    \"熄\",\n    \"熔\",\n    \"煽\",\n    \"潇\",\n    \"漆\",\n    \"漱\",\n    \"漂\",\n    \"漫\",\n    \"滴\",\n    \"漾\",\n    \"演\",\n    \"漏\",\n    \"慢\",\n    \"慷\",\n    \"寨\",\n    \"赛\",\n    \"寡\",\n    \"察\",\n    \"蜜\",\n    \"寥\",\n    \"谭\",\n    \"肇\",\n    \"褐\",\n    \"褪\",\n    \"谱\",\n    \"隧\",\n    \"嫩\",\n    \"翠\",\n    \"熊\",\n    \"凳\",\n    \"骡\",\n    \"缩\",\n    \"慧\",\n    \"撵\",\n    \"撕\",\n    \"撒\",\n    \"撩\",\n    \"趣\",\n    \"趟\",\n    \"撑\",\n    \"撮\",\n    \"撬\",\n    \"播\",\n    \"擒\",\n    \"墩\",\n    \"撞\",\n    \"撤\",\n    \"增\",\n    \"撰\",\n    \"聪\",\n    \"鞋\",\n    \"鞍\",\n    \"蕉\",\n    \"蕊\",\n    \"蔬\",\n    \"蕴\",\n    \"横\",\n    \"槽\",\n    \"樱\",\n    \"橡\",\n    \"樟\",\n    \"橄\",\n    \"敷\",\n    \"豌\",\n    \"飘\",\n    \"醋\",\n    \"醇\",\n    \"醉\",\n    \"磕\",\n    \"磊\",\n    \"磅\",\n    \"碾\",\n    \"震\",\n    \"霄\",\n    \"霉\",\n    \"瞒\",\n    \"题\",\n    \"暴\",\n    \"瞎\",\n    \"嘻\",\n    \"嘶\",\n    \"嘲\",\n    \"嘹\",\n    \"影\",\n    \"踢\",\n    \"踏\",\n    \"踩\",\n    \"踪\",\n    \"蝶\",\n    \"蝴\",\n    \"蝠\",\n    \"蝎\",\n    \"蝌\",\n    \"蝗\",\n    \"蝙\",\n    \"嘿\",\n    \"嘱\",\n    \"幢\",\n    \"墨\",\n    \"镇\",\n    \"镐\",\n    \"镑\",\n    \"靠\",\n    \"稽\",\n    \"稻\",\n    \"黎\",\n    \"稿\",\n    \"稼\",\n    \"箱\",\n    \"篓\",\n    \"箭\",\n    \"篇\",\n    \"僵\",\n    \"躺\",\n    \"僻\",\n    \"德\",\n    \"艘\",\n    \"膝\",\n    \"膛\",\n    \"鲤\",\n    \"鲫\",\n    \"熟\",\n    \"摩\",\n    \"褒\",\n    \"瘪\",\n    \"瘤\",\n    \"瘫\",\n    \"凛\",\n    \"颜\",\n    \"毅\",\n    \"糊\",\n    \"遵\",\n    \"憋\",\n    \"潜\",\n    \"澎\",\n    \"潮\",\n    \"潭\",\n    \"鲨\",\n    \"澳\",\n    \"潘\",\n    \"澈\",\n    \"澜\",\n    \"澄\",\n    \"懂\",\n    \"憔\",\n    \"懊\",\n    \"憎\",\n    \"额\",\n    \"翩\",\n    \"褥\",\n    \"谴\",\n    \"鹤\",\n    \"憨\",\n    \"慰\",\n    \"劈\",\n    \"履\",\n    \"豫\",\n    \"缭\",\n    \"撼\",\n    \"擂\",\n    \"操\",\n    \"擅\",\n    \"燕\",\n    \"蕾\",\n    \"薯\",\n    \"薛\",\n    \"薇\",\n    \"擎\",\n    \"薪\",\n    \"薄\",\n    \"颠\",\n    \"翰\",\n    \"噩\",\n    \"橱\",\n    \"橙\",\n    \"橘\",\n    \"整\",\n    \"融\",\n    \"瓢\",\n    \"醒\",\n    \"霍\",\n    \"霎\",\n    \"辙\",\n    \"冀\",\n    \"餐\",\n    \"嘴\",\n    \"踱\",\n    \"蹄\",\n    \"蹂\",\n    \"蟆\",\n    \"螃\",\n    \"器\",\n    \"噪\",\n    \"鹦\",\n    \"赠\",\n    \"默\",\n    \"黔\",\n    \"镜\",\n    \"赞\",\n    \"穆\",\n    \"篮\",\n    \"篡\",\n    \"篷\",\n    \"篱\",\n    \"儒\",\n    \"邀\",\n    \"衡\",\n    \"膨\",\n    \"雕\",\n    \"鲸\",\n    \"磨\",\n    \"瘾\",\n    \"瘸\",\n    \"凝\",\n    \"辨\",\n    \"辩\",\n    \"糙\",\n    \"糖\",\n    \"糕\",\n    \"燃\",\n    \"濒\",\n    \"澡\",\n    \"激\",\n    \"懒\",\n    \"憾\",\n    \"懈\",\n    \"窿\",\n    \"壁\",\n    \"避\",\n    \"缰\",\n    \"缴\",\n    \"戴\",\n    \"擦\",\n    \"藉\",\n    \"鞠\",\n    \"藏\",\n    \"藐\",\n    \"檬\",\n    \"檐\",\n    \"檀\",\n    \"礁\",\n    \"磷\",\n    \"霜\",\n    \"霞\",\n    \"瞭\",\n    \"瞧\",\n    \"瞬\",\n    \"瞳\",\n    \"瞩\",\n    \"瞪\",\n    \"曙\",\n    \"蹋\",\n    \"蹈\",\n    \"螺\",\n    \"蟋\",\n    \"蟀\",\n    \"嚎\",\n    \"赡\",\n    \"穗\",\n    \"魏\",\n    \"簧\",\n    \"簇\",\n    \"繁\",\n    \"徽\",\n    \"爵\",\n    \"朦\",\n    \"臊\",\n    \"鳄\",\n    \"癌\",\n    \"辫\",\n    \"赢\",\n    \"糟\",\n    \"糠\",\n    \"燥\",\n    \"懦\",\n    \"豁\",\n    \"臀\",\n    \"臂\",\n    \"翼\",\n    \"骤\",\n    \"藕\",\n    \"鞭\",\n    \"藤\",\n    \"覆\",\n    \"瞻\",\n    \"蹦\",\n    \"嚣\",\n    \"镰\",\n    \"翻\",\n    \"鳍\",\n    \"鹰\",\n    \"瀑\",\n    \"襟\",\n    \"璧\",\n    \"戳\",\n    \"孽\",\n    \"警\",\n    \"蘑\",\n    \"藻\",\n    \"攀\",\n    \"曝\",\n    \"蹲\",\n    \"蹭\",\n    \"蹬\",\n    \"巅\",\n    \"簸\",\n    \"簿\",\n    \"蟹\",\n    \"颤\",\n    \"靡\",\n    \"癣\",\n    \"瓣\",\n    \"羹\",\n    \"鳖\",\n    \"爆\",\n    \"疆\",\n    \"鬓\",\n    \"壤\",\n    \"馨\",\n    \"耀\",\n    \"躁\",\n    \"蠕\",\n    \"嚼\",\n    \"嚷\",\n    \"巍\",\n    \"籍\",\n    \"鳞\",\n    \"魔\",\n    \"糯\",\n    \"灌\",\n    \"譬\",\n    \"蠢\",\n    \"霸\",\n    \"露\",\n    \"霹\",\n    \"躏\",\n    \"黯\",\n    \"髓\",\n    \"赣\",\n    \"囊\",\n    \"镶\",\n    \"瓤\",\n    \"罐\",\n    \"矗\",\n    \"乂\",\n    \"乜\",\n    \"兀\",\n    \"弋\",\n    \"孑\",\n    \"孓\",\n    \"幺\",\n    \"亓\",\n    \"韦\",\n    \"廿\",\n    \"丏\",\n    \"卅\",\n    \"仄\",\n    \"厄\",\n    \"仃\",\n    \"仉\",\n    \"仂\",\n    \"兮\",\n    \"刈\",\n    \"爻\",\n    \"卞\",\n    \"闩\",\n    \"讣\",\n    \"尹\",\n    \"夬\",\n    \"爿\",\n    \"毋\",\n    \"邗\",\n    \"邛\",\n    \"艽\",\n    \"艿\",\n    \"札\",\n    \"叵\",\n    \"匝\",\n    \"丕\",\n    \"匜\",\n    \"劢\",\n    \"卟\",\n    \"叱\",\n    \"叻\",\n    \"仨\",\n    \"仕\",\n    \"仟\",\n    \"仡\",\n    \"仫\",\n    \"仞\",\n    \"卮\",\n    \"氐\",\n    \"犰\",\n    \"刍\",\n    \"邝\",\n    \"邙\",\n    \"汀\",\n    \"讦\",\n    \"讧\",\n    \"讪\",\n    \"讫\",\n    \"尻\",\n    \"阡\",\n    \"尕\",\n    \"弁\",\n    \"驭\",\n    \"匡\",\n    \"耒\",\n    \"玎\",\n    \"玑\",\n    \"邢\",\n    \"圩\",\n    \"圬\",\n    \"圭\",\n    \"扦\",\n    \"圪\",\n    \"圳\",\n    \"圹\",\n    \"扪\",\n    \"圮\",\n    \"圯\",\n    \"芊\",\n    \"芍\",\n    \"芄\",\n    \"芨\",\n    \"芑\",\n    \"芎\",\n    \"芗\",\n    \"亘\",\n    \"厍\",\n    \"夼\",\n    \"戍\",\n    \"尥\",\n    \"乩\",\n    \"旯\",\n    \"曳\",\n    \"岌\",\n    \"屺\",\n    \"凼\",\n    \"囡\",\n    \"钇\",\n    \"缶\",\n    \"氘\",\n    \"氖\",\n    \"牝\",\n    \"伎\",\n    \"伛\",\n    \"伢\",\n    \"佤\",\n    \"仵\",\n    \"伥\",\n    \"伧\",\n    \"伉\",\n    \"伫\",\n    \"囟\",\n    \"汆\",\n    \"刖\",\n    \"夙\",\n    \"旮\",\n    \"刎\",\n    \"犷\",\n    \"犸\",\n    \"舛\",\n    \"凫\",\n    \"邬\",\n    \"饧\",\n    \"汕\",\n    \"汔\",\n    \"汐\",\n    \"汲\",\n    \"汜\",\n    \"汊\",\n    \"忖\",\n    \"忏\",\n    \"讴\",\n    \"讵\",\n    \"祁\",\n    \"讷\",\n    \"聿\",\n    \"艮\",\n    \"厾\",\n    \"阱\",\n    \"阮\",\n    \"阪\",\n    \"丞\",\n    \"妁\",\n    \"牟\",\n    \"纡\",\n    \"纣\",\n    \"纥\",\n    \"纨\",\n    \"玕\",\n    \"玙\",\n    \"抟\",\n    \"抔\",\n    \"圻\",\n    \"坂\",\n    \"坍\",\n    \"坞\",\n    \"抃\",\n    \"抉\",\n    \"㧐\",\n    \"芫\",\n    \"邯\",\n    \"芸\",\n    \"芾\",\n    \"苈\",\n    \"苣\",\n    \"芷\",\n    \"芮\",\n    \"苋\",\n    \"芼\",\n    \"苌\",\n    \"苁\",\n    \"芩\",\n    \"芪\",\n    \"芡\",\n    \"芟\",\n    \"苄\",\n    \"苎\",\n    \"苡\",\n    \"杌\",\n    \"杓\",\n    \"杞\",\n    \"杈\",\n    \"忑\",\n    \"孛\",\n    \"邴\",\n    \"邳\",\n    \"矶\",\n    \"奁\",\n    \"豕\",\n    \"忒\",\n    \"欤\",\n    \"轫\",\n    \"迓\",\n    \"邶\",\n    \"忐\",\n    \"卣\",\n    \"邺\",\n    \"旰\",\n    \"呋\",\n    \"呒\",\n    \"呓\",\n    \"呔\",\n    \"呖\",\n    \"呃\",\n    \"旸\",\n    \"吡\",\n    \"町\",\n    \"虬\",\n    \"呗\",\n    \"吽\",\n    \"吣\",\n    \"吲\",\n    \"帏\",\n    \"岐\",\n    \"岈\",\n    \"岘\",\n    \"岑\",\n    \"岚\",\n    \"兕\",\n    \"囵\",\n    \"囫\",\n    \"钊\",\n    \"钋\",\n    \"钌\",\n    \"迕\",\n    \"氙\",\n    \"氚\",\n    \"牤\",\n    \"佞\",\n    \"邱\",\n    \"攸\",\n    \"佚\",\n    \"佝\",\n    \"佟\",\n    \"佗\",\n    \"伽\",\n    \"彷\",\n    \"佘\",\n    \"佥\",\n    \"孚\",\n    \"豸\",\n    \"坌\",\n    \"肟\",\n    \"邸\",\n    \"奂\",\n    \"劬\",\n    \"狄\",\n    \"狁\",\n    \"鸠\",\n    \"邹\",\n    \"饨\",\n    \"饩\",\n    \"饪\",\n    \"饫\",\n    \"饬\",\n    \"亨\",\n    \"庑\",\n    \"庋\",\n    \"疔\",\n    \"疖\",\n    \"肓\",\n    \"闱\",\n    \"闳\",\n    \"闵\",\n    \"羌\",\n    \"炀\",\n    \"沣\",\n    \"沅\",\n    \"沔\",\n    \"沤\",\n    \"沌\",\n    \"沏\",\n    \"沚\",\n    \"汩\",\n    \"汨\",\n    \"沂\",\n    \"汾\",\n    \"沨\",\n    \"汴\",\n    \"汶\",\n    \"沆\",\n    \"沩\",\n    \"泐\",\n    \"怃\",\n    \"怄\",\n    \"忡\",\n    \"忤\",\n    \"忾\",\n    \"怅\",\n    \"忻\",\n    \"忪\",\n    \"怆\",\n    \"忭\",\n    \"忸\",\n    \"诂\",\n    \"诃\",\n    \"诅\",\n    \"诋\",\n    \"诌\",\n    \"诏\",\n    \"诒\",\n    \"孜\",\n    \"陇\",\n    \"陀\",\n    \"陂\",\n    \"陉\",\n    \"妍\",\n    \"妩\",\n    \"妪\",\n    \"妣\",\n    \"妊\",\n    \"妗\",\n    \"妫\",\n    \"妞\",\n    \"姒\",\n    \"妤\",\n    \"邵\",\n    \"劭\",\n    \"刭\",\n    \"甬\",\n    \"邰\",\n    \"纭\",\n    \"纰\",\n    \"纴\",\n    \"纶\",\n    \"纾\",\n    \"玮\",\n    \"玡\",\n    \"玭\",\n    \"玠\",\n    \"玢\",\n    \"玥\",\n    \"玦\",\n    \"盂\",\n    \"忝\",\n    \"匦\",\n    \"坩\",\n    \"抨\",\n    \"拤\",\n    \"坫\",\n    \"拈\",\n    \"垆\",\n    \"抻\",\n    \"劼\",\n    \"拃\",\n    \"拊\",\n    \"坼\",\n    \"坻\",\n    \"㧟\",\n    \"坨\",\n    \"坭\",\n    \"抿\",\n    \"坳\",\n    \"耶\",\n    \"苷\",\n    \"苯\",\n    \"苤\",\n    \"茏\",\n    \"苫\",\n    \"苜\",\n    \"苴\",\n    \"苒\",\n    \"苘\",\n    \"茌\",\n    \"苻\",\n    \"苓\",\n    \"茚\",\n    \"茆\",\n    \"茑\",\n    \"茓\",\n    \"茔\",\n    \"茕\",\n    \"茀\",\n    \"苕\",\n    \"枥\",\n    \"枇\",\n    \"杪\",\n    \"杳\",\n    \"枧\",\n    \"杵\",\n    \"枨\",\n    \"枞\",\n    \"枋\",\n    \"杻\",\n    \"杷\",\n    \"杼\",\n    \"矸\",\n    \"砀\",\n    \"刳\",\n    \"奄\",\n    \"瓯\",\n    \"殁\",\n    \"郏\",\n    \"轭\",\n    \"郅\",\n    \"鸢\",\n    \"盱\",\n    \"昊\",\n    \"昙\",\n    \"杲\",\n    \"昃\",\n    \"咂\",\n    \"呸\",\n    \"昕\",\n    \"昀\",\n    \"旻\",\n    \"昉\",\n    \"炅\",\n    \"咔\",\n    \"畀\",\n    \"虮\",\n    \"咀\",\n    \"呷\",\n    \"黾\",\n    \"呱\",\n    \"呤\",\n    \"咚\",\n    \"咆\",\n    \"咛\",\n    \"呶\",\n    \"呣\",\n    \"呦\",\n    \"咝\",\n    \"岢\",\n    \"岿\",\n    \"岬\",\n    \"岫\",\n    \"帙\",\n    \"岣\",\n    \"峁\",\n    \"刿\",\n    \"迥\",\n    \"岷\",\n    \"剀\",\n    \"帔\",\n    \"峄\",\n    \"沓\",\n    \"囹\",\n    \"罔\",\n    \"钍\",\n    \"钎\",\n    \"钏\",\n    \"钒\",\n    \"钕\",\n    \"钗\",\n    \"邾\",\n    \"迮\",\n    \"牦\",\n    \"竺\",\n    \"迤\",\n    \"佶\",\n    \"佬\",\n    \"佰\",\n    \"侑\",\n    \"侉\",\n    \"臾\",\n    \"岱\",\n    \"侗\",\n    \"侃\",\n    \"侏\",\n    \"侩\",\n    \"佻\",\n    \"佾\",\n    \"侪\",\n    \"佼\",\n    \"佯\",\n    \"侬\",\n    \"帛\",\n    \"阜\",\n    \"侔\",\n    \"徂\",\n    \"刽\",\n    \"郄\",\n    \"怂\",\n    \"籴\",\n    \"瓮\",\n    \"戗\",\n    \"肼\",\n    \"䏝\",\n    \"肽\",\n    \"肱\",\n    \"肫\",\n    \"剁\",\n    \"迩\",\n    \"郇\",\n    \"狙\",\n    \"狎\",\n    \"狍\",\n    \"狒\",\n    \"咎\",\n    \"炙\",\n    \"枭\",\n    \"饯\",\n    \"饴\",\n    \"冽\",\n    \"冼\",\n    \"庖\",\n    \"疠\",\n    \"疝\",\n    \"疡\",\n    \"兖\",\n    \"妾\",\n    \"劾\",\n    \"炜\",\n    \"𬉼\",\n    \"炖\",\n    \"炘\",\n    \"炝\",\n    \"炔\",\n    \"泔\",\n    \"沭\",\n    \"泷\",\n    \"泸\",\n    \"泱\",\n    \"泅\",\n    \"泗\",\n    \"泠\",\n    \"泺\",\n    \"泖\",\n    \"泫\",\n    \"泮\",\n    \"沱\",\n    \"泯\",\n    \"泓\",\n    \"泾\",\n    \"怙\",\n    \"怵\",\n    \"怦\",\n    \"怛\",\n    \"怏\",\n    \"怍\",\n    \"㤘\",\n    \"怩\",\n    \"怫\",\n    \"怿\",\n    \"宕\",\n    \"穹\",\n    \"宓\",\n    \"诓\",\n    \"诔\",\n    \"诖\",\n    \"诘\",\n    \"戾\",\n    \"诙\",\n    \"戽\",\n    \"郓\",\n    \"衩\",\n    \"祆\",\n    \"祎\",\n    \"祉\",\n    \"祇\",\n    \"诛\",\n    \"诜\",\n    \"诟\",\n    \"诠\",\n    \"诣\",\n    \"诤\",\n    \"诧\",\n    \"诨\",\n    \"诩\",\n    \"戕\",\n    \"孢\",\n    \"亟\",\n    \"陔\",\n    \"妲\",\n    \"妯\",\n    \"姗\",\n    \"帑\",\n    \"弩\",\n    \"孥\",\n    \"驽\",\n    \"虱\",\n    \"迦\",\n    \"迨\",\n    \"绀\",\n    \"绁\",\n    \"绂\",\n    \"驷\",\n    \"驸\",\n    \"绉\",\n    \"绌\",\n    \"驿\",\n    \"骀\",\n    \"甾\",\n    \"珏\",\n    \"珐\",\n    \"珂\",\n    \"珑\",\n    \"玳\",\n    \"珀\",\n    \"顸\",\n    \"珉\",\n    \"珈\",\n    \"拮\",\n    \"垭\",\n    \"挝\",\n    \"垣\",\n    \"挞\",\n    \"垤\",\n    \"赳\",\n    \"贲\",\n    \"垱\",\n    \"垌\",\n    \"郝\",\n    \"垧\",\n    \"垓\",\n    \"挦\",\n    \"垠\",\n    \"茜\",\n    \"荚\",\n    \"荑\",\n    \"贳\",\n    \"荜\",\n    \"莒\",\n    \"茼\",\n    \"茴\",\n    \"茱\",\n    \"莛\",\n    \"荞\",\n    \"茯\",\n    \"荏\",\n    \"荇\",\n    \"荃\",\n    \"荟\",\n    \"荀\",\n    \"茗\",\n    \"荠\",\n    \"茭\",\n    \"茨\",\n    \"垩\",\n    \"荥\",\n    \"荦\",\n    \"荨\",\n    \"荩\",\n    \"剋\",\n    \"荪\",\n    \"茹\",\n    \"荬\",\n    \"荮\",\n    \"柰\",\n    \"栉\",\n    \"柯\",\n    \"柘\",\n    \"栊\",\n    \"柩\",\n    \"枰\",\n    \"栌\",\n    \"柙\",\n    \"枵\",\n    \"柚\",\n    \"枳\",\n    \"柞\",\n    \"柝\",\n    \"栀\",\n    \"柢\",\n    \"栎\",\n    \"枸\",\n    \"柈\",\n    \"柁\",\n    \"枷\",\n    \"柽\",\n    \"剌\",\n    \"酊\",\n    \"郦\",\n    \"甭\",\n    \"砗\",\n    \"砘\",\n    \"砒\",\n    \"斫\",\n    \"砭\",\n    \"砜\",\n    \"奎\",\n    \"耷\",\n    \"虺\",\n    \"殂\",\n    \"殇\",\n    \"殄\",\n    \"殆\",\n    \"轱\",\n    \"轲\",\n    \"轳\",\n    \"轶\",\n    \"轸\",\n    \"虿\",\n    \"毖\",\n    \"觇\",\n    \"尜\",\n    \"哐\",\n    \"眄\",\n    \"眍\",\n    \"𠳐\",\n    \"郢\",\n    \"眇\",\n    \"眊\",\n    \"眈\",\n    \"禺\",\n    \"哂\",\n    \"咴\",\n    \"曷\",\n    \"昴\",\n    \"昱\",\n    \"昵\",\n    \"咦\",\n    \"哓\",\n    \"哔\",\n    \"畎\",\n    \"毗\",\n    \"呲\",\n    \"胄\",\n    \"畋\",\n    \"畈\",\n    \"虼\",\n    \"虻\",\n    \"盅\",\n    \"咣\",\n    \"哕\",\n    \"剐\",\n    \"郧\",\n    \"咻\",\n    \"囿\",\n    \"咿\",\n    \"哌\",\n    \"哙\",\n    \"哚\",\n    \"咯\",\n    \"咩\",\n    \"咤\",\n    \"哝\",\n    \"哏\",\n    \"哞\",\n    \"峙\",\n    \"峣\",\n    \"罘\",\n    \"帧\",\n    \"峒\",\n    \"峤\",\n    \"峋\",\n    \"峥\",\n    \"贶\",\n    \"钚\",\n    \"钛\",\n    \"钡\",\n    \"钣\",\n    \"钤\",\n    \"钨\",\n    \"钫\",\n    \"钯\",\n    \"氡\",\n    \"氟\",\n    \"牯\",\n    \"郜\",\n    \"秕\",\n    \"秭\",\n    \"竽\",\n    \"笈\",\n    \"笃\",\n    \"俦\",\n    \"俨\",\n    \"俅\",\n    \"俪\",\n    \"叟\",\n    \"垡\",\n    \"牮\",\n    \"俣\",\n    \"俚\",\n    \"皈\",\n    \"俑\",\n    \"俟\",\n    \"逅\",\n    \"徇\",\n    \"徉\",\n    \"舢\",\n    \"俞\",\n    \"郗\",\n    \"俎\",\n    \"郤\",\n    \"爰\",\n    \"郛\",\n    \"瓴\",\n    \"胨\",\n    \"胪\",\n    \"胛\",\n    \"胂\",\n    \"胙\",\n    \"胍\",\n    \"胗\",\n    \"胝\",\n    \"朐\",\n    \"胫\",\n    \"鸨\",\n    \"匍\",\n    \"狨\",\n    \"狯\",\n    \"飑\",\n    \"狩\",\n    \"狲\",\n    \"訇\",\n    \"逄\",\n    \"昝\",\n    \"饷\",\n    \"饸\",\n    \"饹\",\n    \"胤\",\n    \"孪\",\n    \"娈\",\n    \"弈\",\n    \"奕\",\n    \"庥\",\n    \"疬\",\n    \"疣\",\n    \"疥\",\n    \"疭\",\n    \"庠\",\n    \"竑\",\n    \"彦\",\n    \"飒\",\n    \"闼\",\n    \"闾\",\n    \"闿\",\n    \"阂\",\n    \"羑\",\n    \"迸\",\n    \"籼\",\n    \"酋\",\n    \"炳\",\n    \"炻\",\n    \"炽\",\n    \"炯\",\n    \"烀\",\n    \"炷\",\n    \"烃\",\n    \"洱\",\n    \"洹\",\n    \"洧\",\n    \"洌\",\n    \"浃\",\n    \"洇\",\n    \"洄\",\n    \"洙\",\n    \"涎\",\n    \"洎\",\n    \"洫\",\n    \"浍\",\n    \"洮\",\n    \"洵\",\n    \"浒\",\n    \"浔\",\n    \"浕\",\n    \"洳\",\n    \"恸\",\n    \"恓\",\n    \"恹\",\n    \"恫\",\n    \"恺\",\n    \"恻\",\n    \"恂\",\n    \"恪\",\n    \"恽\",\n    \"宥\",\n    \"扃\",\n    \"衲\",\n    \"衽\",\n    \"衿\",\n    \"袂\",\n    \"祛\",\n    \"祜\",\n    \"祓\",\n    \"祚\",\n    \"诮\",\n    \"祗\",\n    \"祢\",\n    \"诰\",\n    \"诳\",\n    \"鸩\",\n    \"昶\",\n    \"郡\",\n    \"咫\",\n    \"弭\",\n    \"牁\",\n    \"胥\",\n    \"陛\",\n    \"陟\",\n    \"娅\",\n    \"姮\",\n    \"娆\",\n    \"姝\",\n    \"姣\",\n    \"姘\",\n    \"姹\",\n    \"怼\",\n    \"羿\",\n    \"炱\",\n    \"矜\",\n    \"绔\",\n    \"骁\",\n    \"骅\",\n    \"绗\",\n    \"绛\",\n    \"骈\",\n    \"耖\",\n    \"挈\",\n    \"珥\",\n    \"珙\",\n    \"顼\",\n    \"珰\",\n    \"珩\",\n    \"珧\",\n    \"珣\",\n    \"珞\",\n    \"琤\",\n    \"珲\",\n    \"敖\",\n    \"恚\",\n    \"埔\",\n    \"埕\",\n    \"埘\",\n    \"埙\",\n    \"埚\",\n    \"挹\",\n    \"耆\",\n    \"耄\",\n    \"埒\",\n    \"捋\",\n    \"贽\",\n    \"垸\",\n    \"捃\",\n    \"盍\",\n    \"荸\",\n    \"莆\",\n    \"莳\",\n    \"莴\",\n    \"莪\",\n    \"莠\",\n    \"莓\",\n    \"莜\",\n    \"莅\",\n    \"荼\",\n    \"莩\",\n    \"荽\",\n    \"莸\",\n    \"荻\",\n    \"莘\",\n    \"莎\",\n    \"莞\",\n    \"莨\",\n    \"渇\",\n    \"鸪\",\n    \"莼\",\n    \"栲\",\n    \"栳\",\n    \"郴\",\n    \"桓\",\n    \"桡\",\n    \"桎\",\n    \"桢\",\n    \"桤\",\n    \"梃\",\n    \"栝\",\n    \"桕\",\n    \"桁\",\n    \"桧\",\n    \"桅\",\n    \"栟\",\n    \"桉\",\n    \"栩\",\n    \"逑\",\n    \"逋\",\n    \"彧\",\n    \"鬲\",\n    \"豇\",\n    \"酐\",\n    \"逦\",\n    \"厝\",\n    \"孬\",\n    \"砝\",\n    \"砹\",\n    \"砺\",\n    \"砧\",\n    \"砷\",\n    \"砟\",\n    \"砼\",\n    \"砥\",\n    \"砣\",\n    \"剞\",\n    \"砻\",\n    \"轼\",\n    \"轾\",\n    \"辂\",\n    \"鸫\",\n    \"趸\",\n    \"龀\",\n    \"鸬\",\n    \"虔\",\n    \"逍\",\n    \"眬\",\n    \"唛\",\n    \"晟\",\n    \"眩\",\n    \"眙\",\n    \"哧\",\n    \"哽\",\n    \"唔\",\n    \"晁\",\n    \"晏\",\n    \"鸮\",\n    \"趵\",\n    \"趿\",\n    \"畛\",\n    \"蚨\",\n    \"蚜\",\n    \"蚍\",\n    \"蚋\",\n    \"蚬\",\n    \"蚝\",\n    \"蚧\",\n    \"唢\",\n    \"圄\",\n    \"唣\",\n    \"唏\",\n    \"盎\",\n    \"唑\",\n    \"崂\",\n    \"崃\",\n    \"罡\",\n    \"罟\",\n    \"峪\",\n    \"觊\",\n    \"赅\",\n    \"钰\",\n    \"钲\",\n    \"钴\",\n    \"钵\",\n    \"钹\",\n    \"钺\",\n    \"钽\",\n    \"钼\",\n    \"钿\",\n    \"铀\",\n    \"铂\",\n    \"铄\",\n    \"铆\",\n    \"铈\",\n    \"铉\",\n    \"铊\",\n    \"铋\",\n    \"铌\",\n    \"铍\",\n    \"䥽\",\n    \"铎\",\n    \"氩\",\n    \"氤\",\n    \"氦\",\n    \"毪\",\n    \"舐\",\n    \"秣\",\n    \"秫\",\n    \"盉\",\n    \"笄\",\n    \"笕\",\n    \"笊\",\n    \"笏\",\n    \"笆\",\n    \"俸\",\n    \"倩\",\n    \"俵\",\n    \"偌\",\n    \"俳\",\n    \"俶\",\n    \"倬\",\n    \"倏\",\n    \"恁\",\n    \"倭\",\n    \"倪\",\n    \"俾\",\n    \"倜\",\n    \"隼\",\n    \"隽\",\n    \"倌\",\n    \"倥\",\n    \"臬\",\n    \"皋\",\n    \"郫\",\n    \"倨\",\n    \"衄\",\n    \"颀\",\n    \"徕\",\n    \"舫\",\n    \"釜\",\n    \"奚\",\n    \"衾\",\n    \"胯\",\n    \"胱\",\n    \"胴\",\n    \"胭\",\n    \"脍\",\n    \"胼\",\n    \"朕\",\n    \"脒\",\n    \"胺\",\n    \"鸱\",\n    \"玺\",\n    \"鸲\",\n    \"狷\",\n    \"猁\",\n    \"狳\",\n    \"猃\",\n    \"狺\",\n    \"逖\",\n    \"桀\",\n    \"袅\",\n    \"饽\",\n    \"凇\",\n    \"栾\",\n    \"挛\",\n    \"亳\",\n    \"疳\",\n    \"疴\",\n    \"疸\",\n    \"疽\",\n    \"痈\",\n    \"疱\",\n    \"痂\",\n    \"痉\",\n    \"衮\",\n    \"凋\",\n    \"颃\",\n    \"恣\",\n    \"旆\",\n    \"旄\",\n    \"旃\",\n    \"阃\",\n    \"阄\",\n    \"訚\",\n    \"阆\",\n    \"恙\",\n    \"粑\",\n    \"朔\",\n    \"郸\",\n    \"烜\",\n    \"烨\",\n    \"烩\",\n    \"烊\",\n    \"剡\",\n    \"郯\",\n    \"烬\",\n    \"涑\",\n    \"浯\",\n    \"涞\",\n    \"涟\",\n    \"娑\",\n    \"涅\",\n    \"涠\",\n    \"浞\",\n    \"涓\",\n    \"浥\",\n    \"涔\",\n    \"浜\",\n    \"浠\",\n    \"浣\",\n    \"浚\",\n    \"悚\",\n    \"悭\",\n    \"悝\",\n    \"悒\",\n    \"悌\",\n    \"悛\",\n    \"宸\",\n    \"窈\",\n    \"剜\",\n    \"诹\",\n    \"冢\",\n    \"诼\",\n    \"袒\",\n    \"袢\",\n    \"祯\",\n    \"诿\",\n    \"谀\",\n    \"谂\",\n    \"谄\",\n    \"谇\",\n    \"屐\",\n    \"屙\",\n    \"陬\",\n    \"勐\",\n    \"奘\",\n    \"牂\",\n    \"蚩\",\n    \"陲\",\n    \"姬\",\n    \"娠\",\n    \"娌\",\n    \"娉\",\n    \"娲\",\n    \"娩\",\n    \"娴\",\n    \"娣\",\n    \"娓\",\n    \"婀\",\n    \"畚\",\n    \"逡\",\n    \"绠\",\n    \"骊\",\n    \"绡\",\n    \"骋\",\n    \"绥\",\n    \"绦\",\n    \"绨\",\n    \"骎\",\n    \"邕\",\n    \"鸶\",\n    \"彗\",\n    \"耜\",\n    \"焘\",\n    \"舂\",\n    \"琏\",\n    \"琇\",\n    \"麸\",\n    \"揶\",\n    \"埴\",\n    \"埯\",\n    \"捯\",\n    \"掳\",\n    \"掴\",\n    \"埸\",\n    \"埵\",\n    \"赧\",\n    \"埤\",\n    \"捭\",\n    \"逵\",\n    \"埝\",\n    \"堋\",\n    \"堍\",\n    \"掬\",\n    \"鸷\",\n    \"掖\",\n    \"捽\",\n    \"掊\",\n    \"堉\",\n    \"掸\",\n    \"捩\",\n    \"掮\",\n    \"悫\",\n    \"埭\",\n    \"埽\",\n    \"掇\",\n    \"掼\",\n    \"聃\",\n    \"菁\",\n    \"萁\",\n    \"菘\",\n    \"堇\",\n    \"萘\",\n    \"萋\",\n    \"菽\",\n    \"菖\",\n    \"萜\",\n    \"萸\",\n    \"萑\",\n    \"棻\",\n    \"菔\",\n    \"菟\",\n    \"萏\",\n    \"萃\",\n    \"菏\",\n    \"菹\",\n    \"菪\",\n    \"菅\",\n    \"菀\",\n    \"萦\",\n    \"菰\",\n    \"菡\",\n    \"梵\",\n    \"梿\",\n    \"梏\",\n    \"觋\",\n    \"桴\",\n    \"桷\",\n    \"梓\",\n    \"棁\",\n    \"桫\",\n    \"棂\",\n    \"啬\",\n    \"郾\",\n    \"匮\",\n    \"敕\",\n    \"豉\",\n    \"鄄\",\n    \"酞\",\n    \"酚\",\n    \"戛\",\n    \"硎\",\n    \"硭\",\n    \"硒\",\n    \"硖\",\n    \"硗\",\n    \"硐\",\n    \"硇\",\n    \"硌\",\n    \"鸸\",\n    \"瓠\",\n    \"匏\",\n    \"厩\",\n    \"龚\",\n    \"殒\",\n    \"殓\",\n    \"殍\",\n    \"赉\",\n    \"雩\",\n    \"辄\",\n    \"堑\",\n    \"眭\",\n    \"眦\",\n    \"啧\",\n    \"晡\",\n    \"晤\",\n    \"眺\",\n    \"眵\",\n    \"眸\",\n    \"圊\",\n    \"喏\",\n    \"喵\",\n    \"啉\",\n    \"勖\",\n    \"晞\",\n    \"唵\",\n    \"晗\",\n    \"冕\",\n    \"啭\",\n    \"畦\",\n    \"趺\",\n    \"啮\",\n    \"跄\",\n    \"蚶\",\n    \"蛄\",\n    \"蛎\",\n    \"蛆\",\n    \"蚰\",\n    \"蛊\",\n    \"圉\",\n    \"蚱\",\n    \"蛉\",\n    \"蛏\",\n    \"蚴\",\n    \"啁\",\n    \"啕\",\n    \"唿\",\n    \"啐\",\n    \"唼\",\n    \"唷\",\n    \"啖\",\n    \"啵\",\n    \"啶\",\n    \"啷\",\n    \"唳\",\n    \"唰\",\n    \"啜\",\n    \"帻\",\n    \"崚\",\n    \"崦\",\n    \"帼\",\n    \"崮\",\n    \"崤\",\n    \"崆\",\n    \"赇\",\n    \"赈\",\n    \"赊\",\n    \"铑\",\n    \"铒\",\n    \"铗\",\n    \"铙\",\n    \"铟\",\n    \"铠\",\n    \"铡\",\n    \"铢\",\n    \"铣\",\n    \"铤\",\n    \"铧\",\n    \"铨\",\n    \"铩\",\n    \"铪\",\n    \"铫\",\n    \"铬\",\n    \"铮\",\n    \"铯\",\n    \"铰\",\n    \"铱\",\n    \"铳\",\n    \"铵\",\n    \"铷\",\n    \"氪\",\n    \"牾\",\n    \"鸹\",\n    \"秾\",\n    \"逶\",\n    \"笺\",\n    \"筇\",\n    \"笸\",\n    \"笪\",\n    \"笮\",\n    \"笠\",\n    \"笥\",\n    \"笤\",\n    \"笳\",\n    \"笾\",\n    \"笞\",\n    \"偾\",\n    \"偃\",\n    \"偕\",\n    \"偈\",\n    \"傀\",\n    \"偬\",\n    \"偻\",\n    \"皑\",\n    \"皎\",\n    \"鸻\",\n    \"徜\",\n    \"舸\",\n    \"舻\",\n    \"舴\",\n    \"舷\",\n    \"龛\",\n    \"翎\",\n    \"脬\",\n    \"脘\",\n    \"脲\",\n    \"匐\",\n    \"猗\",\n    \"猡\",\n    \"猞\",\n    \"猝\",\n    \"斛\",\n    \"猕\",\n    \"馗\",\n    \"馃\",\n    \"馄\",\n    \"鸾\",\n    \"孰\",\n    \"庹\",\n    \"庾\",\n    \"痔\",\n    \"痍\",\n    \"疵\",\n    \"翊\",\n    \"旌\",\n    \"旎\",\n    \"袤\",\n    \"阇\",\n    \"阈\",\n    \"阉\",\n    \"阊\",\n    \"阋\",\n    \"阍\",\n    \"阏\",\n    \"羟\",\n    \"粝\",\n    \"粕\",\n    \"敝\",\n    \"焐\",\n    \"烯\",\n    \"焓\",\n    \"烽\",\n    \"焖\",\n    \"烷\",\n    \"焗\",\n    \"渍\",\n    \"渚\",\n    \"淇\",\n    \"淅\",\n    \"淞\",\n    \"渎\",\n    \"涿\",\n    \"淖\",\n    \"挲\",\n    \"淠\",\n    \"涸\",\n    \"渑\",\n    \"淦\",\n    \"淝\",\n    \"淬\",\n    \"涪\",\n    \"淙\",\n    \"涫\",\n    \"渌\",\n    \"淄\",\n    \"惬\",\n    \"悻\",\n    \"悱\",\n    \"惝\",\n    \"惘\",\n    \"悸\",\n    \"惆\",\n    \"惚\",\n    \"惇\",\n    \"惮\",\n    \"窕\",\n    \"谌\",\n    \"谏\",\n    \"扈\",\n    \"皲\",\n    \"谑\",\n    \"裆\",\n    \"袷\",\n    \"裉\",\n    \"谒\",\n    \"谔\",\n    \"谕\",\n    \"谖\",\n    \"谗\",\n    \"谙\",\n    \"谛\",\n    \"谝\",\n    \"逯\",\n    \"郿\",\n    \"隈\",\n    \"粜\",\n    \"隍\",\n    \"隗\",\n    \"婧\",\n    \"婊\",\n    \"婕\",\n    \"娼\",\n    \"婢\",\n    \"婵\",\n    \"胬\",\n    \"袈\",\n    \"翌\",\n    \"恿\",\n    \"欸\",\n    \"绫\",\n    \"骐\",\n    \"绮\",\n    \"绯\",\n    \"绱\",\n    \"骒\",\n    \"绲\",\n    \"骓\",\n    \"绶\",\n    \"绺\",\n    \"绻\",\n    \"绾\",\n    \"骖\",\n    \"缁\",\n    \"耠\",\n    \"琫\",\n    \"琵\",\n    \"琶\",\n    \"琪\",\n    \"瑛\",\n    \"琦\",\n    \"琥\",\n    \"琨\",\n    \"靓\",\n    \"琰\",\n    \"琮\",\n    \"琯\",\n    \"琬\",\n    \"琛\",\n    \"琚\",\n    \"辇\",\n    \"鼋\",\n    \"揳\",\n    \"堞\",\n    \"搽\",\n    \"揸\",\n    \"揠\",\n    \"堙\",\n    \"趄\",\n    \"揖\",\n    \"颉\",\n    \"塄\",\n    \"揿\",\n    \"耋\",\n    \"揄\",\n    \"蛩\",\n    \"蛰\",\n    \"塆\",\n    \"摒\",\n    \"揆\",\n    \"掾\",\n    \"聒\",\n    \"葑\",\n    \"葚\",\n    \"靰\",\n    \"靸\",\n    \"葳\",\n    \"葺\",\n    \"葸\",\n    \"萼\",\n    \"葆\",\n    \"葩\",\n    \"葶\",\n    \"蒌\",\n    \"萱\",\n    \"戟\",\n    \"葭\",\n    \"楮\",\n    \"棼\",\n    \"椟\",\n    \"棹\",\n    \"椤\",\n    \"棰\",\n    \"赍\",\n    \"椋\",\n    \"椁\",\n    \"椪\",\n    \"棣\",\n    \"椐\",\n    \"鹁\",\n    \"覃\",\n    \"酤\",\n    \"酢\",\n    \"酡\",\n    \"鹂\",\n    \"厥\",\n    \"殚\",\n    \"殛\",\n    \"雯\",\n    \"雱\",\n    \"辊\",\n    \"辋\",\n    \"椠\",\n    \"辍\",\n    \"辎\",\n    \"斐\",\n    \"睄\",\n    \"睑\",\n    \"睇\",\n    \"睃\",\n    \"戢\",\n    \"喋\",\n    \"嗒\",\n    \"喃\",\n    \"喱\",\n    \"喹\",\n    \"晷\",\n    \"喈\",\n    \"跖\",\n    \"跗\",\n    \"跞\",\n    \"跚\",\n    \"跎\",\n    \"跏\",\n    \"跆\",\n    \"蛱\",\n    \"蛲\",\n    \"蛭\",\n    \"蛳\",\n    \"蛐\",\n    \"蛔\",\n    \"蛞\",\n    \"蛴\",\n    \"蛟\",\n    \"蛘\",\n    \"喁\",\n    \"喟\",\n    \"啾\",\n    \"嗖\",\n    \"喑\",\n    \"嗟\",\n    \"喽\",\n    \"嗞\",\n    \"喀\",\n    \"喔\",\n    \"喙\",\n    \"嵘\",\n    \"嵖\",\n    \"崴\",\n    \"遄\",\n    \"詈\",\n    \"嵎\",\n    \"崽\",\n    \"嵬\",\n    \"嵛\",\n    \"嵯\",\n    \"嵝\",\n    \"嵫\",\n    \"幄\",\n    \"嵋\",\n    \"赕\",\n    \"铻\",\n    \"铼\",\n    \"铿\",\n    \"锃\",\n    \"锂\",\n    \"锆\",\n    \"锇\",\n    \"锉\",\n    \"锏\",\n    \"锑\",\n    \"锒\",\n    \"锔\",\n    \"锕\",\n    \"掣\",\n    \"矬\",\n    \"氰\",\n    \"毳\",\n    \"毽\",\n    \"犊\",\n    \"犄\",\n    \"犋\",\n    \"鹄\",\n    \"犍\",\n    \"嵇\",\n    \"黍\",\n    \"稃\",\n    \"稂\",\n    \"筚\",\n    \"筵\",\n    \"筌\",\n    \"傣\",\n    \"傈\",\n    \"舄\",\n    \"牍\",\n    \"傥\",\n    \"傧\",\n    \"遑\",\n    \"傩\",\n    \"遁\",\n    \"徨\",\n    \"媭\",\n    \"畲\",\n    \"弑\",\n    \"颌\",\n    \"翕\",\n    \"釉\",\n    \"鹆\",\n    \"舜\",\n    \"貂\",\n    \"腈\",\n    \"腌\",\n    \"腓\",\n    \"腆\",\n    \"腴\",\n    \"腑\",\n    \"腚\",\n    \"腱\",\n    \"鱿\",\n    \"鲀\",\n    \"鲂\",\n    \"颍\",\n    \"猢\",\n    \"猹\",\n    \"猥\",\n    \"飓\",\n    \"觞\",\n    \"觚\",\n    \"猱\",\n    \"颎\",\n    \"飧\",\n    \"馇\",\n    \"馊\",\n    \"亵\",\n    \"脔\",\n    \"裒\",\n    \"痣\",\n    \"痨\",\n    \"痦\",\n    \"痞\",\n    \"痤\",\n    \"痫\",\n    \"痧\",\n    \"赓\",\n    \"竦\",\n    \"瓿\",\n    \"啻\",\n    \"颏\",\n    \"鹇\",\n    \"阑\",\n    \"阒\",\n    \"阕\",\n    \"粞\",\n    \"遒\",\n    \"孳\",\n    \"焯\",\n    \"焜\",\n    \"焙\",\n    \"焱\",\n    \"鹈\",\n    \"湛\",\n    \"渫\",\n    \"湮\",\n    \"湎\",\n    \"湜\",\n    \"渭\",\n    \"湍\",\n    \"湫\",\n    \"溲\",\n    \"湟\",\n    \"溆\",\n    \"湲\",\n    \"湔\",\n    \"湉\",\n    \"渥\",\n    \"湄\",\n    \"滁\",\n    \"愠\",\n    \"惺\",\n    \"愦\",\n    \"惴\",\n    \"愀\",\n    \"愎\",\n    \"愔\",\n    \"喾\",\n    \"寐\",\n    \"谟\",\n    \"扉\",\n    \"裢\",\n    \"裎\",\n    \"裥\",\n    \"祾\",\n    \"祺\",\n    \"谠\",\n    \"幂\",\n    \"谡\",\n    \"谥\",\n    \"谧\",\n    \"遐\",\n    \"孱\",\n    \"弼\",\n    \"巽\",\n    \"骘\",\n    \"媪\",\n    \"媛\",\n    \"婷\",\n    \"巯\",\n    \"翚\",\n    \"皴\",\n    \"婺\",\n    \"骛\",\n    \"缂\",\n    \"缃\",\n    \"缄\",\n    \"彘\",\n    \"缇\",\n    \"缈\",\n    \"缌\",\n    \"缑\",\n    \"缒\",\n    \"缗\",\n    \"飨\",\n    \"耢\",\n    \"瑚\",\n    \"瑁\",\n    \"瑜\",\n    \"瑗\",\n    \"瑄\",\n    \"瑕\",\n    \"遨\",\n    \"骜\",\n    \"韫\",\n    \"髡\",\n    \"塬\",\n    \"鄢\",\n    \"趔\",\n    \"趑\",\n    \"摅\",\n    \"摁\",\n    \"蜇\",\n    \"搋\",\n    \"搪\",\n    \"搐\",\n    \"搛\",\n    \"搠\",\n    \"摈\",\n    \"彀\",\n    \"毂\",\n    \"搦\",\n    \"搡\",\n    \"蓁\",\n    \"戡\",\n    \"蓍\",\n    \"鄞\",\n    \"靳\",\n    \"蓐\",\n    \"蓦\",\n    \"鹋\",\n    \"蒽\",\n    \"蓓\",\n    \"蓖\",\n    \"蓊\",\n    \"蒯\",\n    \"蓟\",\n    \"蓑\",\n    \"蒿\",\n    \"蒺\",\n    \"蓠\",\n    \"蒟\",\n    \"蒡\",\n    \"蒹\",\n    \"蒴\",\n    \"蒗\",\n    \"蓥\",\n    \"颐\",\n    \"楔\",\n    \"楠\",\n    \"楂\",\n    \"楝\",\n    \"楫\",\n    \"楸\",\n    \"椴\",\n    \"槌\",\n    \"楯\",\n    \"皙\",\n    \"榈\",\n    \"槎\",\n    \"榉\",\n    \"楦\",\n    \"楣\",\n    \"楹\",\n    \"椽\",\n    \"裘\",\n    \"剽\",\n    \"甄\",\n    \"酮\",\n    \"酰\",\n    \"酯\",\n    \"酩\",\n    \"蜃\",\n    \"碛\",\n    \"碓\",\n    \"硼\",\n    \"碉\",\n    \"碚\",\n    \"碇\",\n    \"碜\",\n    \"鹌\",\n    \"辏\",\n    \"龃\",\n    \"龅\",\n    \"訾\",\n    \"粲\",\n    \"虞\",\n    \"睚\",\n    \"嗪\",\n    \"韪\",\n    \"嗷\",\n    \"嗉\",\n    \"睨\",\n    \"睢\",\n    \"雎\",\n    \"睥\",\n    \"嘟\",\n    \"嗑\",\n    \"嗫\",\n    \"嗬\",\n    \"嗔\",\n    \"嗝\",\n    \"戥\",\n    \"嗄\",\n    \"煦\",\n    \"暄\",\n    \"遢\",\n    \"暌\",\n    \"跬\",\n    \"跶\",\n    \"跸\",\n    \"跐\",\n    \"跣\",\n    \"跹\",\n    \"跻\",\n    \"蛸\",\n    \"蜊\",\n    \"蜍\",\n    \"蜉\",\n    \"蜣\",\n    \"畹\",\n    \"蛹\",\n    \"嗣\",\n    \"嗯\",\n    \"嗥\",\n    \"嗲\",\n    \"嗳\",\n    \"嗌\",\n    \"嗍\",\n    \"嗨\",\n    \"嗐\",\n    \"嗤\",\n    \"嗵\",\n    \"罨\",\n    \"嵊\",\n    \"嵩\",\n    \"嵴\",\n    \"骰\",\n    \"锗\",\n    \"锛\",\n    \"锜\",\n    \"锝\",\n    \"锞\",\n    \"锟\",\n    \"锢\",\n    \"锨\",\n    \"锩\",\n    \"锭\",\n    \"锱\",\n    \"雉\",\n    \"氲\",\n    \"犏\",\n    \"歃\",\n    \"稞\",\n    \"稗\",\n    \"稔\",\n    \"筠\",\n    \"筢\",\n    \"筮\",\n    \"筲\",\n    \"筱\",\n    \"牒\",\n    \"煲\",\n    \"敫\",\n    \"徭\",\n    \"愆\",\n    \"艄\",\n    \"觎\",\n    \"毹\",\n    \"貊\",\n    \"貅\",\n    \"貉\",\n    \"颔\",\n    \"腠\",\n    \"腩\",\n    \"腼\",\n    \"腭\",\n    \"腧\",\n    \"塍\",\n    \"媵\",\n    \"詹\",\n    \"鲅\",\n    \"鲆\",\n    \"鲇\",\n    \"鲈\",\n    \"稣\",\n    \"鲋\",\n    \"鲐\",\n    \"肄\",\n    \"鹐\",\n    \"飕\",\n    \"觥\",\n    \"遛\",\n    \"馐\",\n    \"鹑\",\n    \"亶\",\n    \"瘃\",\n    \"痱\",\n    \"痼\",\n    \"痿\",\n    \"瘐\",\n    \"瘁\",\n    \"瘆\",\n    \"麂\",\n    \"裔\",\n    \"歆\",\n    \"旒\",\n    \"雍\",\n    \"阖\",\n    \"阗\",\n    \"阙\",\n    \"羧\",\n    \"豢\",\n    \"粳\",\n    \"猷\",\n    \"煳\",\n    \"煜\",\n    \"煨\",\n    \"煅\",\n    \"煊\",\n    \"煸\",\n    \"煺\",\n    \"滟\",\n    \"溱\",\n    \"溘\",\n    \"漭\",\n    \"滢\",\n    \"溥\",\n    \"溧\",\n    \"溽\",\n    \"裟\",\n    \"溻\",\n    \"溷\",\n    \"滗\",\n    \"滫\",\n    \"溴\",\n    \"滏\",\n    \"滃\",\n    \"滦\",\n    \"溏\",\n    \"滂\",\n    \"滓\",\n    \"溟\",\n    \"滪\",\n    \"愫\",\n    \"慑\",\n    \"慊\",\n    \"鲎\",\n    \"骞\",\n    \"窦\",\n    \"窠\",\n    \"窣\",\n    \"裱\",\n    \"褚\",\n    \"裨\",\n    \"裾\",\n    \"裰\",\n    \"禊\",\n    \"谩\",\n    \"谪\",\n    \"媾\",\n    \"嫫\",\n    \"媲\",\n    \"嫒\",\n    \"嫔\",\n    \"媸\",\n    \"缙\",\n    \"缜\",\n    \"缛\",\n    \"辔\",\n    \"骝\",\n    \"缟\",\n    \"缡\",\n    \"缢\",\n    \"缣\",\n    \"骟\",\n    \"耥\",\n    \"璈\",\n    \"瑶\",\n    \"瑭\",\n    \"獒\",\n    \"觏\",\n    \"慝\",\n    \"嫠\",\n    \"韬\",\n    \"叆\",\n    \"髦\",\n    \"摽\",\n    \"墁\",\n    \"撂\",\n    \"摞\",\n    \"撄\",\n    \"翥\",\n    \"踅\",\n    \"摭\",\n    \"墉\",\n    \"墒\",\n    \"榖\",\n    \"綦\",\n    \"蔫\",\n    \"蔷\",\n    \"靺\",\n    \"靼\",\n    \"鞅\",\n    \"靿\",\n    \"甍\",\n    \"蔸\",\n    \"蔟\",\n    \"蔺\",\n    \"戬\",\n    \"蕖\",\n    \"蔻\",\n    \"蓿\",\n    \"斡\",\n    \"鹕\",\n    \"蓼\",\n    \"榛\",\n    \"榧\",\n    \"榻\",\n    \"榫\",\n    \"榭\",\n    \"槔\",\n    \"榱\",\n    \"槁\",\n    \"槟\",\n    \"槠\",\n    \"榷\",\n    \"僰\",\n    \"酽\",\n    \"酶\",\n    \"酹\",\n    \"厮\",\n    \"碡\",\n    \"碴\",\n    \"碣\",\n    \"碲\",\n    \"磋\",\n    \"臧\",\n    \"豨\",\n    \"殡\",\n    \"霆\",\n    \"霁\",\n    \"辕\",\n    \"蜚\",\n    \"裴\",\n    \"翡\",\n    \"龇\",\n    \"龈\",\n    \"睿\",\n    \"䁖\",\n    \"睽\",\n    \"嘞\",\n    \"嘈\",\n    \"嘌\",\n    \"嘁\",\n    \"嘎\",\n    \"暧\",\n    \"暝\",\n    \"踌\",\n    \"踉\",\n    \"蜞\",\n    \"蜥\",\n    \"蜮\",\n    \"蝈\",\n    \"蜴\",\n    \"蜱\",\n    \"蜩\",\n    \"蜷\",\n    \"蜿\",\n    \"螂\",\n    \"蜢\",\n    \"嘘\",\n    \"嘡\",\n    \"鹗\",\n    \"嘣\",\n    \"嘤\",\n    \"嘚\",\n    \"嗾\",\n    \"嘧\",\n    \"罴\",\n    \"罱\",\n    \"幔\",\n    \"嶂\",\n    \"幛\",\n    \"赙\",\n    \"罂\",\n    \"骷\",\n    \"骶\",\n    \"鹘\",\n    \"锲\",\n    \"锴\",\n    \"锶\",\n    \"锷\",\n    \"锸\",\n    \"锵\",\n    \"镁\",\n    \"镂\",\n    \"犒\",\n    \"箐\",\n    \"箦\",\n    \"箧\",\n    \"箍\",\n    \"箸\",\n    \"箬\",\n    \"箅\",\n    \"箪\",\n    \"箔\",\n    \"箜\",\n    \"箢\",\n    \"箓\",\n    \"毓\",\n    \"僖\",\n    \"儆\",\n    \"僳\",\n    \"僭\",\n    \"劁\",\n    \"僮\",\n    \"魃\",\n    \"魆\",\n    \"睾\",\n    \"艋\",\n    \"鄱\",\n    \"膈\",\n    \"膑\",\n    \"鲑\",\n    \"鲔\",\n    \"鲚\",\n    \"鲛\",\n    \"鲟\",\n    \"獐\",\n    \"觫\",\n    \"雒\",\n    \"夤\",\n    \"馑\",\n    \"銮\",\n    \"塾\",\n    \"麽\",\n    \"瘌\",\n    \"瘊\",\n    \"瘘\",\n    \"瘙\",\n    \"廖\",\n    \"韶\",\n    \"旖\",\n    \"膂\",\n    \"阚\",\n    \"鄯\",\n    \"鲞\",\n    \"粿\",\n    \"粼\",\n    \"粽\",\n    \"糁\",\n    \"槊\",\n    \"鹚\",\n    \"熘\",\n    \"熥\",\n    \"潢\",\n    \"漕\",\n    \"滹\",\n    \"漯\",\n    \"漶\",\n    \"潋\",\n    \"潴\",\n    \"漪\",\n    \"漉\",\n    \"漳\",\n    \"漩\",\n    \"澉\",\n    \"潍\",\n    \"慵\",\n    \"搴\",\n    \"窨\",\n    \"寤\",\n    \"綮\",\n    \"谮\",\n    \"褡\",\n    \"褙\",\n    \"褓\",\n    \"褛\",\n    \"褊\",\n    \"谯\",\n    \"谰\",\n    \"谲\",\n    \"暨\",\n    \"屣\",\n    \"鹛\",\n    \"嫣\",\n    \"嫱\",\n    \"嫖\",\n    \"嫦\",\n    \"嫚\",\n    \"嫘\",\n    \"嫡\",\n    \"鼐\",\n    \"翟\",\n    \"瞀\",\n    \"鹜\",\n    \"骠\",\n    \"缥\",\n    \"缦\",\n    \"缧\",\n    \"缨\",\n    \"骢\",\n    \"缪\",\n    \"缫\",\n    \"耦\",\n    \"耧\",\n    \"瑾\",\n    \"璜\",\n    \"璀\",\n    \"璎\",\n    \"璁\",\n    \"璋\",\n    \"璇\",\n    \"奭\",\n    \"髯\",\n    \"髫\",\n    \"撷\",\n    \"撅\",\n    \"赭\",\n    \"撸\",\n    \"鋆\",\n    \"撙\",\n    \"撺\",\n    \"墀\",\n    \"聩\",\n    \"觐\",\n    \"鞑\",\n    \"蕙\",\n    \"鞒\",\n    \"蕈\",\n    \"蕨\",\n    \"蕤\",\n    \"蕞\",\n    \"蕺\",\n    \"瞢\",\n    \"蕃\",\n    \"蕲\",\n    \"赜\",\n    \"槿\",\n    \"樯\",\n    \"槭\",\n    \"樗\",\n    \"樘\",\n    \"樊\",\n    \"槲\",\n    \"醌\",\n    \"醅\",\n    \"靥\",\n    \"魇\",\n    \"餍\",\n    \"磔\",\n    \"磙\",\n    \"霈\",\n    \"辘\",\n    \"龉\",\n    \"龊\",\n    \"觑\",\n    \"瞌\",\n    \"瞋\",\n    \"瞑\",\n    \"嘭\",\n    \"噎\",\n    \"噶\",\n    \"颙\",\n    \"暹\",\n    \"噘\",\n    \"踔\",\n    \"踝\",\n    \"踟\",\n    \"踒\",\n    \"踬\",\n    \"踮\",\n    \"踯\",\n    \"踺\",\n    \"踞\",\n    \"蝽\",\n    \"蝾\",\n    \"蝻\",\n    \"蝰\",\n    \"蝮\",\n    \"螋\",\n    \"蝓\",\n    \"蝣\",\n    \"蝼\",\n    \"噗\",\n    \"嘬\",\n    \"颚\",\n    \"噍\",\n    \"噢\",\n    \"噙\",\n    \"噜\",\n    \"噌\",\n    \"噔\",\n    \"颛\",\n    \"幞\",\n    \"幡\",\n    \"嶙\",\n    \"嶝\",\n    \"骺\",\n    \"骼\",\n    \"骸\",\n    \"镊\",\n    \"镉\",\n    \"镌\",\n    \"镍\",\n    \"镏\",\n    \"镒\",\n    \"镓\",\n    \"镔\",\n    \"稷\",\n    \"箴\",\n    \"篑\",\n    \"篁\",\n    \"篌\",\n    \"篆\",\n    \"牖\",\n    \"儋\",\n    \"徵\",\n    \"磐\",\n    \"虢\",\n    \"鹞\",\n    \"膘\",\n    \"滕\",\n    \"鲠\",\n    \"鲡\",\n    \"鲢\",\n    \"鲣\",\n    \"鲥\",\n    \"鲧\",\n    \"鲩\",\n    \"獗\",\n    \"獠\",\n    \"觯\",\n    \"馓\",\n    \"馔\",\n    \"麾\",\n    \"廛\",\n    \"瘛\",\n    \"瘼\",\n    \"瘢\",\n    \"瘠\",\n    \"齑\",\n    \"羯\",\n    \"羰\",\n    \"𥻗\",\n    \"遴\",\n    \"糌\",\n    \"糍\",\n    \"糅\",\n    \"熜\",\n    \"熵\",\n    \"熠\",\n    \"澍\",\n    \"澌\",\n    \"潸\",\n    \"潦\",\n    \"潲\",\n    \"鋈\",\n    \"潟\",\n    \"潼\",\n    \"潺\",\n    \"憬\",\n    \"憧\",\n    \"寮\",\n    \"窳\",\n    \"谳\",\n    \"褴\",\n    \"褟\",\n    \"褫\",\n    \"谵\",\n    \"熨\",\n    \"屦\",\n    \"嬉\",\n    \"勰\",\n    \"戮\",\n    \"蝥\",\n    \"缬\",\n    \"缮\",\n    \"缯\",\n    \"骣\",\n    \"畿\",\n    \"耩\",\n    \"耨\",\n    \"耪\",\n    \"璞\",\n    \"璟\",\n    \"靛\",\n    \"璠\",\n    \"璘\",\n    \"聱\",\n    \"螯\",\n    \"髻\",\n    \"髭\",\n    \"髹\",\n    \"擀\",\n    \"熹\",\n    \"甏\",\n    \"擞\",\n    \"縠\",\n    \"磬\",\n    \"颞\",\n    \"蕻\",\n    \"鞘\",\n    \"颟\",\n    \"薤\",\n    \"薨\",\n    \"檠\",\n    \"薏\",\n    \"薮\",\n    \"薜\",\n    \"薅\",\n    \"樾\",\n    \"橛\",\n    \"橇\",\n    \"樵\",\n    \"檎\",\n    \"橹\",\n    \"樽\",\n    \"樨\",\n    \"橼\",\n    \"墼\",\n    \"橐\",\n    \"翮\",\n    \"醛\",\n    \"醐\",\n    \"醍\",\n    \"醚\",\n    \"磲\",\n    \"赝\",\n    \"飙\",\n    \"殪\",\n    \"霖\",\n    \"霏\",\n    \"霓\",\n    \"錾\",\n    \"辚\",\n    \"臻\",\n    \"遽\",\n    \"氅\",\n    \"瞟\",\n    \"瞠\",\n    \"瞰\",\n    \"嚄\",\n    \"嚆\",\n    \"噤\",\n    \"暾\",\n    \"蹀\",\n    \"踹\",\n    \"踵\",\n    \"踽\",\n    \"蹉\",\n    \"蹁\",\n    \"螨\",\n    \"蟒\",\n    \"螈\",\n    \"螅\",\n    \"螭\",\n    \"螠\",\n    \"螟\",\n    \"噱\",\n    \"噬\",\n    \"噫\",\n    \"噻\",\n    \"噼\",\n    \"罹\",\n    \"圜\",\n    \"䦃\",\n    \"镖\",\n    \"镗\",\n    \"镘\",\n    \"镚\",\n    \"镛\",\n    \"镝\",\n    \"镞\",\n    \"镠\",\n    \"氇\",\n    \"氆\",\n    \"憩\",\n    \"穑\",\n    \"篝\",\n    \"篥\",\n    \"篦\",\n    \"篪\",\n    \"篙\",\n    \"盥\",\n    \"劓\",\n    \"翱\",\n    \"魉\",\n    \"魈\",\n    \"徼\",\n    \"歙\",\n    \"膳\",\n    \"膦\",\n    \"膙\",\n    \"鲮\",\n    \"鲱\",\n    \"鲲\",\n    \"鲳\",\n    \"鲴\",\n    \"鲵\",\n    \"鲷\",\n    \"鲻\",\n    \"獴\",\n    \"獭\",\n    \"獬\",\n    \"邂\",\n    \"鹧\",\n    \"廨\",\n    \"赟\",\n    \"瘰\",\n    \"廪\",\n    \"瘿\",\n    \"瘵\",\n    \"瘴\",\n    \"癃\",\n    \"瘳\",\n    \"斓\",\n    \"麇\",\n    \"麈\",\n    \"嬴\",\n    \"壅\",\n    \"羲\",\n    \"糗\",\n    \"瞥\",\n    \"甑\",\n    \"燎\",\n    \"燠\",\n    \"燔\",\n    \"燧\",\n    \"濑\",\n    \"濉\",\n    \"潞\",\n    \"澧\",\n    \"澹\",\n    \"澥\",\n    \"澶\",\n    \"濂\",\n    \"褰\",\n    \"寰\",\n    \"窸\",\n    \"褶\",\n    \"禧\",\n    \"嬖\",\n    \"犟\",\n    \"隰\",\n    \"嬗\",\n    \"颡\",\n    \"缱\",\n    \"缲\",\n    \"缳\",\n    \"璨\",\n    \"璩\",\n    \"璐\",\n    \"璪\",\n    \"螫\",\n    \"擤\",\n    \"壕\",\n    \"觳\",\n    \"罄\",\n    \"擢\",\n    \"薹\",\n    \"鞡\",\n    \"鞬\",\n    \"薷\",\n    \"薰\",\n    \"藓\",\n    \"藁\",\n    \"檄\",\n    \"檩\",\n    \"懋\",\n    \"醢\",\n    \"翳\",\n    \"礅\",\n    \"磴\",\n    \"鹩\",\n    \"龋\",\n    \"龌\",\n    \"豳\",\n    \"壑\",\n    \"黻\",\n    \"嚏\",\n    \"嚅\",\n    \"蹑\",\n    \"蹒\",\n    \"蹊\",\n    \"蟥\",\n    \"螬\",\n    \"螵\",\n    \"疃\",\n    \"螳\",\n    \"蟑\",\n    \"嚓\",\n    \"羁\",\n    \"罽\",\n    \"罾\",\n    \"嶷\",\n    \"黜\",\n    \"黝\",\n    \"髁\",\n    \"髀\",\n    \"镡\",\n    \"镢\",\n    \"镣\",\n    \"镦\",\n    \"镧\",\n    \"镩\",\n    \"镪\",\n    \"镫\",\n    \"罅\",\n    \"黏\",\n    \"簌\",\n    \"篾\",\n    \"篼\",\n    \"簖\",\n    \"簋\",\n    \"鼢\",\n    \"黛\",\n    \"儡\",\n    \"鹪\",\n    \"鼾\",\n    \"皤\",\n    \"魍\",\n    \"龠\",\n    \"繇\",\n    \"貘\",\n    \"邈\",\n    \"貔\",\n    \"臌\",\n    \"膻\",\n    \"臆\",\n    \"臃\",\n    \"鲼\",\n    \"鲽\",\n    \"鳀\",\n    \"鳃\",\n    \"鳅\",\n    \"鳇\",\n    \"鳊\",\n    \"螽\",\n    \"燮\",\n    \"鹫\",\n    \"襄\",\n    \"糜\",\n    \"縻\",\n    \"膺\",\n    \"癍\",\n    \"麋\",\n    \"懑\",\n    \"濡\",\n    \"濮\",\n    \"濞\",\n    \"濠\",\n    \"濯\",\n    \"蹇\",\n    \"謇\",\n    \"邃\",\n    \"襁\",\n    \"檗\",\n    \"擘\",\n    \"孺\",\n    \"隳\",\n    \"嬷\",\n    \"蟊\",\n    \"鹬\",\n    \"鍪\",\n    \"鏊\",\n    \"鳌\",\n    \"鬈\",\n    \"鬃\",\n    \"瞽\",\n    \"鞯\",\n    \"鞨\",\n    \"鞫\",\n    \"鞧\",\n    \"鞣\",\n    \"藜\",\n    \"藠\",\n    \"藩\",\n    \"醪\",\n    \"蹙\",\n    \"礓\",\n    \"燹\",\n    \"餮\",\n    \"瞿\",\n    \"曛\",\n    \"颢\",\n    \"曜\",\n    \"躇\",\n    \"蹚\",\n    \"鹭\",\n    \"蟛\",\n    \"蟪\",\n    \"蟠\",\n    \"蟮\",\n    \"鹮\",\n    \"黠\",\n    \"黟\",\n    \"髅\",\n    \"髂\",\n    \"镬\",\n    \"镭\",\n    \"镯\",\n    \"馥\",\n    \"簟\",\n    \"簪\",\n    \"鼬\",\n    \"雠\",\n    \"艟\",\n    \"鳎\",\n    \"鳏\",\n    \"鳐\",\n    \"癞\",\n    \"癔\",\n    \"癜\",\n    \"癖\",\n    \"糨\",\n    \"蹩\",\n    \"鎏\",\n    \"懵\",\n    \"彝\",\n    \"邋\",\n    \"鬏\",\n    \"攉\",\n    \"攒\",\n    \"鞲\",\n    \"鞴\",\n    \"藿\",\n    \"蘧\",\n    \"蘅\",\n    \"麓\",\n    \"醮\",\n    \"醯\",\n    \"酃\",\n    \"霪\",\n    \"霭\",\n    \"霨\",\n    \"黼\",\n    \"嚯\",\n    \"蹰\",\n    \"蹶\",\n    \"蹽\",\n    \"蹼\",\n    \"蹴\",\n    \"蹾\",\n    \"蹿\",\n    \"蠖\",\n    \"蠓\",\n    \"蟾\",\n    \"蠊\",\n    \"黢\",\n    \"髋\",\n    \"髌\",\n    \"镲\",\n    \"籀\",\n    \"籁\",\n    \"齁\",\n    \"魑\",\n    \"艨\",\n    \"鳓\",\n    \"鳔\",\n    \"鳕\",\n    \"鳗\",\n    \"鳙\",\n    \"麒\",\n    \"鏖\",\n    \"羸\",\n    \"㸆\",\n    \"瀚\",\n    \"瀣\",\n    \"瀛\",\n    \"襦\",\n    \"谶\",\n    \"襞\",\n    \"骥\",\n    \"缵\",\n    \"瓒\",\n    \"攘\",\n    \"蘩\",\n    \"蘖\",\n    \"醴\",\n    \"霰\",\n    \"酆\",\n    \"矍\",\n    \"曦\",\n    \"躅\",\n    \"鼍\",\n    \"巉\",\n    \"黩\",\n    \"黥\",\n    \"黪\",\n    \"镳\",\n    \"镴\",\n    \"黧\",\n    \"纂\",\n    \"璺\",\n    \"鼯\",\n    \"臜\",\n    \"鳜\",\n    \"鳝\",\n    \"鳟\",\n    \"獾\",\n    \"孀\",\n    \"骧\",\n    \"瓘\",\n    \"鼙\",\n    \"醺\",\n    \"礴\",\n    \"颦\",\n    \"曩\",\n    \"鳢\",\n    \"癫\",\n    \"麝\",\n    \"夔\",\n    \"爝\",\n    \"灏\",\n    \"禳\",\n    \"鐾\",\n    \"羼\",\n    \"蠡\",\n    \"耱\",\n    \"懿\",\n    \"蘸\",\n    \"鹳\",\n    \"霾\",\n    \"氍\",\n    \"饕\",\n    \"躐\",\n    \"髑\",\n    \"镵\",\n    \"穰\",\n    \"饔\",\n    \"鬻\",\n    \"鬟\",\n    \"趱\",\n    \"攫\",\n    \"攥\",\n    \"颧\",\n    \"躜\",\n    \"鼹\",\n    \"癯\",\n    \"麟\",\n    \"蠲\",\n    \"蠹\",\n    \"躞\",\n    \"衢\",\n    \"鑫\",\n    \"灞\",\n    \"襻\",\n    \"纛\",\n    \"鬣\",\n    \"攮\",\n    \"囔\",\n    \"馕\",\n    \"戆\",\n    \"爨\",\n    \"齉\",\n    \"亍\",\n    \"尢\",\n    \"彳\",\n    \"卬\",\n    \"殳\",\n    \"𠙶\",\n    \"毌\",\n    \"邘\",\n    \"戋\",\n    \"圢\",\n    \"氕\",\n    \"伋\",\n    \"仝\",\n    \"冮\",\n    \"氿\",\n    \"汈\",\n    \"氾\",\n    \"忉\",\n    \"宄\",\n    \"讱\",\n    \"扞\",\n    \"圲\",\n    \"圫\",\n    \"芏\",\n    \"芃\",\n    \"朳\",\n    \"朸\",\n    \"𨙸\",\n    \"邨\",\n    \"吒\",\n    \"吖\",\n    \"屼\",\n    \"屾\",\n    \"辿\",\n    \"钆\",\n    \"仳\",\n    \"伣\",\n    \"伈\",\n    \"癿\",\n    \"甪\",\n    \"邠\",\n    \"犴\",\n    \"冱\",\n    \"邡\",\n    \"闫\",\n    \"汋\",\n    \"䜣\",\n    \"讻\",\n    \"孖\",\n    \"纩\",\n    \"玒\",\n    \"玓\",\n    \"玘\",\n    \"玚\",\n    \"刬\",\n    \"坜\",\n    \"坉\",\n    \"扽\",\n    \"坋\",\n    \"扺\",\n    \"㧑\",\n    \"毐\",\n    \"芰\",\n    \"芣\",\n    \"苊\",\n    \"苉\",\n    \"芘\",\n    \"芴\",\n    \"芠\",\n    \"芤\",\n    \"杕\",\n    \"杙\",\n    \"杄\",\n    \"杧\",\n    \"杩\",\n    \"尪\",\n    \"尨\",\n    \"轪\",\n    \"坒\",\n    \"芈\",\n    \"旴\",\n    \"旵\",\n    \"呙\",\n    \"㕮\",\n    \"岍\",\n    \"岠\",\n    \"岜\",\n    \"呇\",\n    \"冏\",\n    \"觃\",\n    \"岙\",\n    \"伾\",\n    \"㑇\",\n    \"伭\",\n    \"佖\",\n    \"伲\",\n    \"佁\",\n    \"飏\",\n    \"狃\",\n    \"闶\",\n    \"汧\",\n    \"汫\",\n    \"𣲘\",\n    \"𣲗\",\n    \"沄\",\n    \"沘\",\n    \"汭\",\n    \"㳇\",\n    \"沇\",\n    \"忮\",\n    \"忳\",\n    \"忺\",\n    \"祃\",\n    \"诇\",\n    \"邲\",\n    \"诎\",\n    \"诐\",\n    \"屃\",\n    \"岊\",\n    \"阽\",\n    \"䢺\",\n    \"阼\",\n    \"妧\",\n    \"妘\",\n    \"𨚕\",\n    \"纮\",\n    \"驲\",\n    \"纻\",\n    \"纼\",\n    \"玤\",\n    \"玞\",\n    \"玱\",\n    \"玟\",\n    \"邽\",\n    \"邿\",\n    \"坥\",\n    \"坰\",\n    \"坬\",\n    \"坽\",\n    \"弆\",\n    \"耵\",\n    \"䢼\",\n    \"𦭜\",\n    \"茋\",\n    \"苧\",\n    \"苾\",\n    \"苠\",\n    \"枅\",\n    \"㭎\",\n    \"枘\",\n    \"枍\",\n    \"矼\",\n    \"矻\",\n    \"匼\",\n    \"旿\",\n    \"昇\",\n    \"昄\",\n    \"昒\",\n    \"昈\",\n    \"咉\",\n    \"咇\",\n    \"咍\",\n    \"岵\",\n    \"岽\",\n    \"岨\",\n    \"岞\",\n    \"峂\",\n    \"㟃\",\n    \"囷\",\n    \"钐\",\n    \"钔\",\n    \"钖\",\n    \"牥\",\n    \"佴\",\n    \"垈\",\n    \"侁\",\n    \"侹\",\n    \"佸\",\n    \"佺\",\n    \"隹\",\n    \"㑊\",\n    \"侂\",\n    \"佽\",\n    \"侘\",\n    \"郈\",\n    \"舠\",\n    \"郐\",\n    \"郃\",\n    \"攽\",\n    \"肭\",\n    \"肸\",\n    \"肷\",\n    \"狉\",\n    \"狝\",\n    \"饳\",\n    \"忞\",\n    \"於\",\n    \"炌\",\n    \"炆\",\n    \"泙\",\n    \"沺\",\n    \"泂\",\n    \"泜\",\n    \"泃\",\n    \"泇\",\n    \"怊\",\n    \"峃\",\n    \"穸\",\n    \"祋\",\n    \"祊\",\n    \"鸤\",\n    \"弢\",\n    \"弨\",\n    \"陑\",\n    \"陎\",\n    \"卺\",\n    \"乸\",\n    \"妭\",\n    \"姈\",\n    \"迳\",\n    \"叕\",\n    \"驵\",\n    \"䌹\",\n    \"驺\",\n    \"绋\",\n    \"绐\",\n    \"砉\",\n    \"耔\",\n    \"㛃\",\n    \"玶\",\n    \"珇\",\n    \"珅\",\n    \"珋\",\n    \"玹\",\n    \"珌\",\n    \"玿\",\n    \"韨\",\n    \"垚\",\n    \"垯\",\n    \"垙\",\n    \"垲\",\n    \"埏\",\n    \"垍\",\n    \"耇\",\n    \"垎\",\n    \"垴\",\n    \"垟\",\n    \"垞\",\n    \"挓\",\n    \"垵\",\n    \"垏\",\n    \"拶\",\n    \"荖\",\n    \"荁\",\n    \"荙\",\n    \"荛\",\n    \"茈\",\n    \"茽\",\n    \"荄\",\n    \"茺\",\n    \"荓\",\n    \"茳\",\n    \"𦰡\",\n    \"茛\",\n    \"荭\",\n    \"㭕\",\n    \"柷\",\n    \"柃\",\n    \"柊\",\n    \"枹\",\n    \"栐\",\n    \"柖\",\n    \"郚\",\n    \"剅\",\n    \"䴓\",\n    \"迺\",\n    \"厖\",\n    \"砆\",\n    \"砑\",\n    \"砄\",\n    \"耏\",\n    \"奓\",\n    \"䶮\",\n    \"轵\",\n    \"轷\",\n    \"轹\",\n    \"轺\",\n    \"昺\",\n    \"昽\",\n    \"盷\",\n    \"咡\",\n    \"咺\",\n    \"昳\",\n    \"昣\",\n    \"哒\",\n    \"昤\",\n    \"昫\",\n    \"昡\",\n    \"咥\",\n    \"昪\",\n    \"虷\",\n    \"虸\",\n    \"哃\",\n    \"峘\",\n    \"耑\",\n    \"峛\",\n    \"峗\",\n    \"峧\",\n    \"帡\",\n    \"钘\",\n    \"钜\",\n    \"钪\",\n    \"钬\",\n    \"钭\",\n    \"矧\",\n    \"秬\",\n    \"俫\",\n    \"舁\",\n    \"俜\",\n    \"俙\",\n    \"俍\",\n    \"垕\",\n    \"衎\",\n    \"舣\",\n    \"弇\",\n    \"侴\",\n    \"鸧\",\n    \"䏡\",\n    \"胠\",\n    \"𦙶\",\n    \"胈\",\n    \"胩\",\n    \"胣\",\n    \"朏\",\n    \"飐\",\n    \"訄\",\n    \"饻\",\n    \"庤\",\n    \"疢\",\n    \"炣\",\n    \"炟\",\n    \"㶲\",\n    \"洭\",\n    \"洘\",\n    \"洓\",\n    \"洿\",\n    \"㳚\",\n    \"泚\",\n    \"浈\",\n    \"浉\",\n    \"洸\",\n    \"洑\",\n    \"洢\",\n    \"洈\",\n    \"洚\",\n    \"洺\",\n    \"洨\",\n    \"浐\",\n    \"㳘\",\n    \"洴\",\n    \"洣\",\n    \"恔\",\n    \"宬\",\n    \"窀\",\n    \"扂\",\n    \"袆\",\n    \"祏\",\n    \"祐\",\n    \"祕\",\n    \"叚\",\n    \"陧\",\n    \"陞\",\n    \"娀\",\n    \"姞\",\n    \"姱\",\n    \"姤\",\n    \"姶\",\n    \"姽\",\n    \"枲\",\n    \"绖\",\n    \"骃\",\n    \"彖\",\n    \"骉\",\n    \"恝\",\n    \"珪\",\n    \"珛\",\n    \"珹\",\n    \"琊\",\n    \"玼\",\n    \"珖\",\n    \"珽\",\n    \"珦\",\n    \"珫\",\n    \"珒\",\n    \"珢\",\n    \"珕\",\n    \"珝\",\n    \"埗\",\n    \"垾\",\n    \"垺\",\n    \"埆\",\n    \"垿\",\n    \"埌\",\n    \"埇\",\n    \"莰\",\n    \"茝\",\n    \"鄀\",\n    \"莶\",\n    \"莝\",\n    \"䓖\",\n    \"莙\",\n    \"栻\",\n    \"桠\",\n    \"桄\",\n    \"梠\",\n    \"栴\",\n    \"梴\",\n    \"栒\",\n    \"酎\",\n    \"酏\",\n    \"砵\",\n    \"砠\",\n    \"砫\",\n    \"砬\",\n    \"硁\",\n    \"恧\",\n    \"翃\",\n    \"郪\",\n    \"𨐈\",\n    \"辀\",\n    \"辁\",\n    \"剕\",\n    \"赀\",\n    \"哢\",\n    \"晅\",\n    \"晊\",\n    \"唝\",\n    \"哳\",\n    \"哱\",\n    \"冔\",\n    \"晔\",\n    \"晐\",\n    \"晖\",\n    \"畖\",\n    \"蚄\",\n    \"蚆\",\n    \"帱\",\n    \"崁\",\n    \"峿\",\n    \"崄\",\n    \"帨\",\n    \"崀\",\n    \"赆\",\n    \"钷\",\n    \"眚\",\n    \"甡\",\n    \"笫\",\n    \"倻\",\n    \"倴\",\n    \"脩\",\n    \"倮\",\n    \"倕\",\n    \"倞\",\n    \"倓\",\n    \"倧\",\n    \"衃\",\n    \"虒\",\n    \"舭\",\n    \"舯\",\n    \"舥\",\n    \"瓞\",\n    \"鬯\",\n    \"鸰\",\n    \"脎\",\n    \"朓\",\n    \"胲\",\n    \"虓\",\n    \"鱽\",\n    \"狴\",\n    \"峱\",\n    \"狻\",\n    \"眢\",\n    \"勍\",\n    \"痄\",\n    \"疰\",\n    \"痃\",\n    \"竘\",\n    \"羖\",\n    \"羓\",\n    \"桊\",\n    \"敉\",\n    \"烠\",\n    \"烔\",\n    \"烶\",\n    \"烻\",\n    \"涍\",\n    \"浡\",\n    \"浭\",\n    \"浬\",\n    \"涄\",\n    \"涢\",\n    \"涐\",\n    \"浰\",\n    \"浟\",\n    \"浛\",\n    \"浼\",\n    \"浲\",\n    \"涘\",\n    \"悈\",\n    \"悃\",\n    \"悢\",\n    \"宧\",\n    \"窅\",\n    \"窊\",\n    \"窎\",\n    \"扅\",\n    \"扆\",\n    \"袪\",\n    \"袗\",\n    \"袯\",\n    \"祧\",\n    \"隺\",\n    \"堲\",\n    \"疍\",\n    \"𨺙\",\n    \"陴\",\n    \"烝\",\n    \"砮\",\n    \"㛚\",\n    \"哿\",\n    \"翀\",\n    \"翂\",\n    \"剟\",\n    \"绤\",\n    \"骍\",\n    \"䂮\",\n    \"琎\",\n    \"珸\",\n    \"珵\",\n    \"琄\",\n    \"琈\",\n    \"琀\",\n    \"珺\",\n    \"掭\",\n    \"堎\",\n    \"堐\",\n    \"埼\",\n    \"掎\",\n    \"埫\",\n    \"堌\",\n    \"晢\",\n    \"掞\",\n    \"埪\",\n    \"壸\",\n    \"㙍\",\n    \"聍\",\n    \"菝\",\n    \"萚\",\n    \"菥\",\n    \"莿\",\n    \"䓫\",\n    \"勚\",\n    \"䓬\",\n    \"萆\",\n    \"菂\",\n    \"菍\",\n    \"菼\",\n    \"萣\",\n    \"䓨\",\n    \"菉\",\n    \"䓛\",\n    \"梼\",\n    \"梽\",\n    \"桲\",\n    \"梾\",\n    \"桯\",\n    \"梣\",\n    \"梌\",\n    \"桹\",\n    \"敔\",\n    \"厣\",\n    \"硔\",\n    \"硙\",\n    \"硚\",\n    \"硊\",\n    \"硍\",\n    \"勔\",\n    \"䴕\",\n    \"龁\",\n    \"逴\",\n    \"唪\",\n    \"啫\",\n    \"翈\",\n    \"㫰\",\n    \"晙\",\n    \"畤\",\n    \"趼\",\n    \"跂\",\n    \"蛃\",\n    \"蚲\",\n    \"蚺\",\n    \"啴\",\n    \"䎃\",\n    \"崧\",\n    \"崟\",\n    \"崞\",\n    \"崒\",\n    \"崌\",\n    \"崡\",\n    \"铏\",\n    \"铕\",\n    \"铖\",\n    \"铘\",\n    \"铚\",\n    \"铞\",\n    \"铥\",\n    \"铴\",\n    \"牻\",\n    \"牿\",\n    \"稆\",\n    \"笱\",\n    \"笯\",\n    \"偰\",\n    \"偡\",\n    \"鸺\",\n    \"偭\",\n    \"偲\",\n    \"偁\",\n    \"㿠\",\n    \"鄅\",\n    \"偓\",\n    \"徛\",\n    \"衒\",\n    \"舳\",\n    \"舲\",\n    \"鸼\",\n    \"悆\",\n    \"鄃\",\n    \"瓻\",\n    \"䝙\",\n    \"脶\",\n    \"脞\",\n    \"脟\",\n    \"䏲\",\n    \"鱾\",\n    \"猇\",\n    \"猊\",\n    \"猄\",\n    \"觖\",\n    \"𠅤\",\n    \"庱\",\n    \"庼\",\n    \"庳\",\n    \"痓\",\n    \"䴔\",\n    \"竫\",\n    \"堃\",\n    \"阌\",\n    \"羝\",\n    \"羕\",\n    \"焆\",\n    \"烺\",\n    \"焌\",\n    \"淏\",\n    \"淟\",\n    \"淜\",\n    \"淴\",\n    \"淯\",\n    \"湴\",\n    \"涴\",\n    \"㥄\",\n    \"惛\",\n    \"惔\",\n    \"悰\",\n    \"惙\",\n    \"寁\",\n    \"逭\",\n    \"袼\",\n    \"裈\",\n    \"祲\",\n    \"谞\",\n    \"艴\",\n    \"弸\",\n    \"弶\",\n    \"隃\",\n    \"婞\",\n    \"娵\",\n    \"婼\",\n    \"媖\",\n    \"婳\",\n    \"婍\",\n    \"婌\",\n    \"婫\",\n    \"婤\",\n    \"婘\",\n    \"婠\",\n    \"绹\",\n    \"骕\",\n    \"絜\",\n    \"珷\",\n    \"琲\",\n    \"琡\",\n    \"琟\",\n    \"琔\",\n    \"琭\",\n    \"堾\",\n    \"堼\",\n    \"揕\",\n    \"㙘\",\n    \"堧\",\n    \"喆\",\n    \"堨\",\n    \"塅\",\n    \"堠\",\n    \"絷\",\n    \"𡎚\",\n    \"葜\",\n    \"惎\",\n    \"萳\",\n    \"葙\",\n    \"靬\",\n    \"葴\",\n    \"蒇\",\n    \"蒈\",\n    \"鄚\",\n    \"蒉\",\n    \"蓇\",\n    \"萩\",\n    \"蒐\",\n    \"葰\",\n    \"葎\",\n    \"鄑\",\n    \"蒎\",\n    \"葖\",\n    \"蒄\",\n    \"萹\",\n    \"棤\",\n    \"棽\",\n    \"棫\",\n    \"椓\",\n    \"椑\",\n    \"鹀\",\n    \"椆\",\n    \"棓\",\n    \"棬\",\n    \"棪\",\n    \"椀\",\n    \"楗\",\n    \"甦\",\n    \"酦\",\n    \"觌\",\n    \"奡\",\n    \"皕\",\n    \"硪\",\n    \"欹\",\n    \"詟\",\n    \"辌\",\n    \"棐\",\n    \"龂\",\n    \"黹\",\n    \"牚\",\n    \"睎\",\n    \"晫\",\n    \"晪\",\n    \"晱\",\n    \"𧿹\",\n    \"蛑\",\n    \"畯\",\n    \"斝\",\n    \"喤\",\n    \"崶\",\n    \"嵁\",\n    \"崾\",\n    \"嵅\",\n    \"崿\",\n    \"嵚\",\n    \"翙\",\n    \"圌\",\n    \"圐\",\n    \"赑\",\n    \"淼\",\n    \"赒\",\n    \"铹\",\n    \"铽\",\n    \"𨱇\",\n    \"锊\",\n    \"锍\",\n    \"锎\",\n    \"锓\",\n    \"犇\",\n    \"颋\",\n    \"稌\",\n    \"筀\",\n    \"筘\",\n    \"筜\",\n    \"筥\",\n    \"筅\",\n    \"傃\",\n    \"傉\",\n    \"翛\",\n    \"傒\",\n    \"傕\",\n    \"舾\",\n    \"畬\",\n    \"脿\",\n    \"腘\",\n    \"䐃\",\n    \"腙\",\n    \"腒\",\n    \"鲃\",\n    \"猰\",\n    \"猯\",\n    \"㺄\",\n    \"馉\",\n    \"鄗\",\n    \"廋\",\n    \"廆\",\n    \"鄌\",\n    \"粢\",\n    \"遆\",\n    \"旐\",\n    \"焞\",\n    \"欻\",\n    \"𣸣\",\n    \"溚\",\n    \"溁\",\n    \"湝\",\n    \"渰\",\n    \"湓\",\n    \"㴔\",\n    \"渟\",\n    \"溠\",\n    \"渼\",\n    \"溇\",\n    \"湣\",\n    \"湑\",\n    \"溞\",\n    \"愐\",\n    \"愃\",\n    \"敩\",\n    \"甯\",\n    \"棨\",\n    \"扊\",\n    \"裣\",\n    \"祼\",\n    \"婻\",\n    \"媆\",\n    \"媞\",\n    \"㛹\",\n    \"媓\",\n    \"媂\",\n    \"媄\",\n    \"毵\",\n    \"矞\",\n    \"缊\",\n    \"缐\",\n    \"骙\",\n    \"瑃\",\n    \"瑓\",\n    \"瑅\",\n    \"瑆\",\n    \"䴖\",\n    \"瑖\",\n    \"瑝\",\n    \"瑔\",\n    \"瑀\",\n    \"𤧛\",\n    \"瑳\",\n    \"瑂\",\n    \"嶅\",\n    \"瑑\",\n    \"遘\",\n    \"髢\",\n    \"塥\",\n    \"堽\",\n    \"赪\",\n    \"摛\",\n    \"塝\",\n    \"搒\",\n    \"搌\",\n    \"蒱\",\n    \"蒨\",\n    \"蓏\",\n    \"蔀\",\n    \"蓢\",\n    \"蓂\",\n    \"蒻\",\n    \"蓣\",\n    \"椹\",\n    \"楪\",\n    \"榃\",\n    \"榅\",\n    \"楒\",\n    \"楞\",\n    \"楩\",\n    \"榇\",\n    \"椸\",\n    \"楙\",\n    \"歅\",\n    \"碃\",\n    \"碏\",\n    \"碈\",\n    \"䃅\",\n    \"硿\",\n    \"鄠\",\n    \"辒\",\n    \"龆\",\n    \"觜\",\n    \"䣘\",\n    \"暕\",\n    \"鹍\",\n    \"㬊\",\n    \"暅\",\n    \"跱\",\n    \"蜐\",\n    \"蜎\",\n    \"嵲\",\n    \"赗\",\n    \"骱\",\n    \"锖\",\n    \"锘\",\n    \"锳\",\n    \"锧\",\n    \"锪\",\n    \"锫\",\n    \"锬\",\n    \"稑\",\n    \"稙\",\n    \"䅟\",\n    \"筻\",\n    \"筼\",\n    \"筶\",\n    \"筦\",\n    \"筤\",\n    \"傺\",\n    \"鹎\",\n    \"僇\",\n    \"艅\",\n    \"艉\",\n    \"谼\",\n    \"貆\",\n    \"腽\",\n    \"腨\",\n    \"腯\",\n    \"鲉\",\n    \"鲊\",\n    \"鲌\",\n    \"䲟\",\n    \"鲏\",\n    \"雊\",\n    \"猺\",\n    \"飔\",\n    \"觟\",\n    \"𦝼\",\n    \"馌\",\n    \"裛\",\n    \"廒\",\n    \"瘀\",\n    \"瘅\",\n    \"鄘\",\n    \"鹒\",\n    \"鄜\",\n    \"麀\",\n    \"鄣\",\n    \"阘\",\n    \"煁\",\n    \"煃\",\n    \"煴\",\n    \"煋\",\n    \"煟\",\n    \"煓\",\n    \"滠\",\n    \"溍\",\n    \"溹\",\n    \"滆\",\n    \"滉\",\n    \"溦\",\n    \"溵\",\n    \"漷\",\n    \"滧\",\n    \"滘\",\n    \"滍\",\n    \"愭\",\n    \"慥\",\n    \"慆\",\n    \"塱\",\n    \"裼\",\n    \"禋\",\n    \"禔\",\n    \"禘\",\n    \"禒\",\n    \"谫\",\n    \"鹔\",\n    \"愍\",\n    \"嫄\",\n    \"媱\",\n    \"戤\",\n    \"戣\",\n    \"缞\",\n    \"耤\",\n    \"瑧\",\n    \"瑨\",\n    \"瑱\",\n    \"瑷\",\n    \"瑢\",\n    \"斠\",\n    \"摏\",\n    \"墕\",\n    \"墈\",\n    \"墐\",\n    \"墘\",\n    \"摴\",\n    \"銎\",\n    \"𡐓\",\n    \"墚\",\n    \"撖\",\n    \"靽\",\n    \"鞁\",\n    \"蔌\",\n    \"蔈\",\n    \"蓰\",\n    \"蔹\",\n    \"蔊\",\n    \"嘏\",\n    \"榰\",\n    \"榑\",\n    \"槚\",\n    \"𣗋\",\n    \"槜\",\n    \"榍\",\n    \"疐\",\n    \"酺\",\n    \"酾\",\n    \"酲\",\n    \"酴\",\n    \"碶\",\n    \"䃎\",\n    \"碨\",\n    \"𥔲\",\n    \"碹\",\n    \"碥\",\n    \"劂\",\n    \"䴗\",\n    \"夥\",\n    \"瞍\",\n    \"鹖\",\n    \"㬎\",\n    \"跽\",\n    \"蜾\",\n    \"幖\",\n    \"嶍\",\n    \"圙\",\n    \"𨱏\",\n    \"锺\",\n    \"锼\",\n    \"锽\",\n    \"锾\",\n    \"锿\",\n    \"镃\",\n    \"镄\",\n    \"镅\",\n    \"馝\",\n    \"鹙\",\n    \"箨\",\n    \"箖\",\n    \"劄\",\n    \"僬\",\n    \"僦\",\n    \"僔\",\n    \"僎\",\n    \"槃\",\n    \"㙦\",\n    \"鲒\",\n    \"鲕\",\n    \"鲖\",\n    \"鲗\",\n    \"鲘\",\n    \"鲙\",\n    \"𩽾\",\n    \"夐\",\n    \"獍\",\n    \"飗\",\n    \"凘\",\n    \"廑\",\n    \"廙\",\n    \"瘗\",\n    \"瘥\",\n    \"瘕\",\n    \"鲝\",\n    \"鄫\",\n    \"熇\",\n    \"漹\",\n    \"漖\",\n    \"潆\",\n    \"漤\",\n    \"潩\",\n    \"漼\",\n    \"漴\",\n    \"㽏\",\n    \"漈\",\n    \"漋\",\n    \"漻\",\n    \"慬\",\n    \"窬\",\n    \"窭\",\n    \"㮾\",\n    \"褕\",\n    \"禛\",\n    \"禚\",\n    \"隩\",\n    \"嫕\",\n    \"嫭\",\n    \"嫜\",\n    \"嫪\",\n    \"㻬\",\n    \"麹\",\n    \"璆\",\n    \"漦\",\n    \"叇\",\n    \"墣\",\n    \"墦\",\n    \"墡\",\n    \"劐\",\n    \"薁\",\n    \"蕰\",\n    \"蔃\",\n    \"鼒\",\n    \"槱\",\n    \"鹝\",\n    \"磏\",\n    \"磉\",\n    \"殣\",\n    \"慭\",\n    \"霅\",\n    \"暵\",\n    \"暲\",\n    \"暶\",\n    \"踦\",\n    \"踣\",\n    \"䗖\",\n    \"蝘\",\n    \"蝲\",\n    \"蝤\",\n    \"噇\",\n    \"噂\",\n    \"噀\",\n    \"罶\",\n    \"嶲\",\n    \"嶓\",\n    \"㠇\",\n    \"嶟\",\n    \"嶒\",\n    \"镆\",\n    \"镈\",\n    \"镋\",\n    \"镎\",\n    \"镕\",\n    \"稹\",\n    \"儇\",\n    \"皞\",\n    \"皛\",\n    \"䴘\",\n    \"艎\",\n    \"艏\",\n    \"鹟\",\n    \"𩾃\",\n    \"鲦\",\n    \"鲪\",\n    \"鲬\",\n    \"橥\",\n    \"觭\",\n    \"鹠\",\n    \"鹡\",\n    \"糇\",\n    \"糈\",\n    \"翦\",\n    \"鹢\",\n    \"鹣\",\n    \"熛\",\n    \"潖\",\n    \"潵\",\n    \"㵐\",\n    \"澂\",\n    \"澛\",\n    \"瑬\",\n    \"潽\",\n    \"潾\",\n    \"潏\",\n    \"憭\",\n    \"憕\",\n    \"戭\",\n    \"褯\",\n    \"禤\",\n    \"嫽\",\n    \"遹\",\n    \"璥\",\n    \"璲\",\n    \"璒\",\n    \"憙\",\n    \"擐\",\n    \"鄹\",\n    \"薳\",\n    \"鞔\",\n    \"黇\",\n    \"蕗\",\n    \"薢\",\n    \"蕹\",\n    \"橞\",\n    \"橑\",\n    \"橦\",\n    \"醑\",\n    \"觱\",\n    \"磡\",\n    \"𥕢\",\n    \"磜\",\n    \"豮\",\n    \"鹾\",\n    \"虤\",\n    \"暿\",\n    \"曌\",\n    \"曈\",\n    \"㬚\",\n    \"蹅\",\n    \"踶\",\n    \"䗛\",\n    \"螗\",\n    \"疁\",\n    \"㠓\",\n    \"幪\",\n    \"嶦\",\n    \"𨱑\",\n    \"馞\",\n    \"穄\",\n    \"篚\",\n    \"篯\",\n    \"簉\",\n    \"鼽\",\n    \"衠\",\n    \"盦\",\n    \"螣\",\n    \"縢\",\n    \"鲭\",\n    \"鲯\",\n    \"鲰\",\n    \"鲺\",\n    \"鲹\",\n    \"亸\",\n    \"癀\",\n    \"瘭\",\n    \"羱\",\n    \"糒\",\n    \"燋\",\n    \"熻\",\n    \"燊\",\n    \"燚\",\n    \"燏\",\n    \"濩\",\n    \"濋\",\n    \"澪\",\n    \"澽\",\n    \"澴\",\n    \"澭\",\n    \"澼\",\n    \"憷\",\n    \"憺\",\n    \"懔\",\n    \"黉\",\n    \"嬛\",\n    \"鹨\",\n    \"翯\",\n    \"璱\",\n    \"𤩽\",\n    \"璬\",\n    \"璮\",\n    \"髽\",\n    \"擿\",\n    \"薿\",\n    \"薸\",\n    \"檑\",\n    \"櫆\",\n    \"檞\",\n    \"醨\",\n    \"繄\",\n    \"磹\",\n    \"磻\",\n    \"瞫\",\n    \"瞵\",\n    \"蹐\",\n    \"蟏\",\n    \"㘎\",\n    \"镤\",\n    \"镥\",\n    \"镨\",\n    \"𨱔\",\n    \"矰\",\n    \"穙\",\n    \"穜\",\n    \"穟\",\n    \"簕\",\n    \"簃\",\n    \"簏\",\n    \"儦\",\n    \"魋\",\n    \"斶\",\n    \"艚\",\n    \"谿\",\n    \"䲠\",\n    \"鲾\",\n    \"鲿\",\n    \"鳁\",\n    \"鳂\",\n    \"鳈\",\n    \"鳉\",\n    \"獯\",\n    \"䗪\",\n    \"馘\",\n    \"襕\",\n    \"襚\",\n    \"螱\",\n    \"甓\",\n    \"嬬\",\n    \"嬥\",\n    \"𦈡\",\n    \"瓀\",\n    \"釐\",\n    \"鬶\",\n    \"爇\",\n    \"鞳\",\n    \"鞮\",\n    \"藟\",\n    \"藦\",\n    \"藨\",\n    \"鹲\",\n    \"檫\",\n    \"黡\",\n    \"礞\",\n    \"礌\",\n    \"𥖨\",\n    \"蹢\",\n    \"蹜\",\n    \"蟫\",\n    \"䗴\",\n    \"嚚\",\n    \"髃\",\n    \"镮\",\n    \"镱\",\n    \"酂\",\n    \"馧\",\n    \"簠\",\n    \"簝\",\n    \"簰\",\n    \"鼫\",\n    \"鼩\",\n    \"皦\",\n    \"臑\",\n    \"䲢\",\n    \"鳑\",\n    \"鳒\",\n    \"鹱\",\n    \"鹯\",\n    \"癗\",\n    \"𦒍\",\n    \"旞\",\n    \"翷\",\n    \"冁\",\n    \"䎖\",\n    \"瀔\",\n    \"瀍\",\n    \"瀌\",\n    \"襜\",\n    \"䴙\",\n    \"嚭\",\n    \"㰀\",\n    \"鬷\",\n    \"醭\",\n    \"蹯\",\n    \"蠋\",\n    \"翾\",\n    \"鳘\",\n    \"儳\",\n    \"儴\",\n    \"鼗\",\n    \"𩾌\",\n    \"鳚\",\n    \"鳛\",\n    \"麑\",\n    \"麖\",\n    \"蠃\",\n    \"彟\",\n    \"嬿\",\n    \"鬒\",\n    \"蘘\",\n    \"欂\",\n    \"醵\",\n    \"颥\",\n    \"甗\",\n    \"𨟠\",\n    \"巇\",\n    \"酅\",\n    \"髎\",\n    \"犨\",\n    \"𨭉\",\n    \"㸌\",\n    \"爔\",\n    \"瀱\",\n    \"瀹\",\n    \"瀼\",\n    \"瀵\",\n    \"襫\",\n    \"孅\",\n    \"骦\",\n    \"耰\",\n    \"𤫉\",\n    \"瓖\",\n    \"鬘\",\n    \"趯\",\n    \"罍\",\n    \"鼱\",\n    \"鳠\",\n    \"鳡\",\n    \"鳣\",\n    \"爟\",\n    \"爚\",\n    \"灈\",\n    \"韂\",\n    \"糵\",\n    \"蘼\",\n    \"礵\",\n    \"鹴\",\n    \"躔\",\n    \"皭\",\n    \"龢\",\n    \"鳤\",\n    \"亹\",\n    \"籥\",\n    \"鼷\",\n    \"玃\",\n    \"醾\",\n    \"齇\",\n    \"觿\",\n    \"蠼\",\n    \"𬣙\",\n    \"𬇕\",\n    \"𬣞\",\n    \"𬘓\",\n    \"𫭟\",\n    \"𫭢\",\n    \"𫇭\",\n    \"𫐄\",\n    \"𫵷\",\n    \"𬇙\",\n    \"𬣡\",\n    \"𫸩\",\n    \"𫘜\",\n    \"𬘘\",\n    \"𫘝\",\n    \"𬨂\",\n    \"𬀩\",\n    \"𬀪\",\n    \"𬬩\",\n    \"𫍣\",\n    \"𬣳\",\n    \"𬩽\",\n    \"𬮿\",\n    \"𬯀\",\n    \"𫰛\",\n    \"𬳵\",\n    \"𬳶\",\n    \"𫠊\",\n    \"𬍛\",\n    \"鿍\",\n    \"𬜬\",\n    \"𪾢\",\n    \"𪨰\",\n    \"𫓧\",\n    \"𬬮\",\n    \"𬬱\",\n    \"𬬭\",\n    \"𬘡\",\n    \"𬳽\",\n    \"𬘩\",\n    \"𫄧\",\n    \"𪟝\",\n    \"𬍤\",\n    \"𫭼\",\n    \"𬜯\",\n    \"𬂩\",\n    \"𫠆\",\n    \"𬌗\",\n    \"𫑡\",\n    \"𪨶\",\n    \"𬬸\",\n    \"𬬻\",\n    \"𬬹\",\n    \"𬬿\",\n    \"𬭁\",\n    \"𫢸\",\n    \"𫗧\",\n    \"𬊈\",\n    \"𬒈\",\n    \"𬳿\",\n    \"𫄨\",\n    \"𬘫\",\n    \"𫮃\",\n    \"鿎\",\n    \"𬱖\",\n    \"𬟽\",\n    \"𫓯\",\n    \"𫟹\",\n    \"𫟼\",\n    \"𬇹\",\n    \"𬍡\",\n    \"𬤇\",\n    \"𫍯\",\n    \"𬤊\",\n    \"𫍲\",\n    \"𬯎\",\n    \"𬘬\",\n    \"𬘭\",\n    \"𬴂\",\n    \"𫘦\",\n    \"𫟅\",\n    \"𬘯\",\n    \"𫘧\",\n    \"𪣻\",\n    \"𬃊\",\n    \"𬷕\",\n    \"𫐐\",\n    \"𬹼\",\n    \"𫶇\",\n    \"𫖮\",\n    \"鿏\",\n    \"𬭊\",\n    \"𫓶\",\n    \"𬭎\",\n    \"𫖯\",\n    \"𬱟\",\n    \"𫛭\",\n    \"𫷷\",\n    \"𬮱\",\n    \"𬊤\",\n    \"𬴃\",\n    \"𫘨\",\n    \"𬪩\",\n    \"𬒔\",\n    \"𬨎\",\n    \"𫐓\",\n    \"𫫇\",\n    \"𫓹\",\n    \"𬭚\",\n    \"𬭛\",\n    \"𬕂\",\n    \"𬶋\",\n    \"𬶍\",\n    \"𫔶\",\n    \"𫌀\",\n    \"𫖳\",\n    \"𫘪\",\n    \"𫘬\",\n    \"𫞩\",\n    \"𪤗\",\n    \"𬸘\",\n    \"𬒗\",\n    \"𫚖\",\n    \"𬭤\",\n    \"𫚕\",\n    \"𬶐\",\n    \"𬶏\",\n    \"𬸚\",\n    \"𬤝\",\n    \"𬙂\",\n    \"𬭩\",\n    \"𬸣\",\n    \"𫍽\",\n    \"𬴊\",\n    \"𬞟\",\n    \"𫟦\",\n    \"𬺈\",\n    \"𫠜\",\n    \"𪩘\",\n    \"𬭬\",\n    \"𬭯\",\n    \"𫗴\",\n    \"𬸦\",\n    \"𫄷\",\n    \"𬭳\",\n    \"𬭶\",\n    \"𫔍\",\n    \"𬭸\",\n    \"𬭼\",\n    \"𫔎\",\n    \"𬸪\",\n    \"𬶟\",\n    \"𬶠\",\n    \"𬶨\",\n    \"𫄸\",\n    \"𬟁\",\n    \"𬙊\",\n    \"𬶭\",\n    \"𬶮\",\n    \"𬙋\",\n    \"𬺓\",\n    \"𫚭\",\n    \"廠\",\n    \"蔔\",\n    \"兒\",\n    \"幾\",\n    \"幹\",\n    \"虧\",\n    \"纔\",\n    \"與\",\n    \"萬\",\n    \"韆\",\n    \"億\",\n    \"個\",\n    \"廣\",\n    \"門\",\n    \"義\",\n    \"衛\",\n    \"飛\",\n    \"習\",\n    \"馬\",\n    \"鄉\",\n    \"豐\",\n    \"開\",\n    \"無\",\n    \"雲\",\n    \"專\",\n    \"藝\",\n    \"廳\",\n    \"區\",\n    \"歷\",\n    \"曆\",\n    \"車\",\n    \"貝\",\n    \"岡\",\n    \"見\",\n    \"氣\",\n    \"長\",\n    \"僕\",\n    \"幣\",\n    \"僅\",\n    \"從\",\n    \"侖\",\n    \"倉\",\n    \"風\",\n    \"烏\",\n    \"鳳\",\n    \"爲\",\n    \"鬥\",\n    \"憶\",\n    \"計\",\n    \"訂\",\n    \"認\",\n    \"譏\",\n    \"醜\",\n    \"隊\",\n    \"辦\",\n    \"鄧\",\n    \"勸\",\n    \"雙\",\n    \"書\",\n    \"擊\",\n    \"撲\",\n    \"節\",\n    \"術\",\n    \"厲\",\n    \"龍\",\n    \"滅\",\n    \"軋\",\n    \"東\",\n    \"盧\",\n    \"業\",\n    \"舊\",\n    \"帥\",\n    \"歸\",\n    \"葉\",\n    \"電\",\n    \"號\",\n    \"衹\",\n    \"隻\",\n    \"嘰\",\n    \"嘆\",\n    \"們\",\n    \"儀\",\n    \"叢\",\n    \"爾\",\n    \"樂\",\n    \"處\",\n    \"鼕\",\n    \"鳥\",\n    \"務\",\n    \"飢\",\n    \"饑\",\n    \"馮\",\n    \"閃\",\n    \"蘭\",\n    \"匯\",\n    \"彙\",\n    \"頭\",\n    \"漢\",\n    \"寧\",\n    \"討\",\n    \"寫\",\n    \"讓\",\n    \"禮\",\n    \"訓\",\n    \"議\",\n    \"訊\",\n    \"記\",\n    \"齣\",\n    \"遼\",\n    \"邊\",\n    \"發\",\n    \"髮\",\n    \"聖\",\n    \"對\",\n    \"臺\",\n    \"颱\",\n    \"檯\",\n    \"糾\",\n    \"絲\",\n    \"動\",\n    \"鞏\",\n    \"執\",\n    \"擴\",\n    \"掃\",\n    \"場\",\n    \"揚\",\n    \"亞\",\n    \"樸\",\n    \"機\",\n    \"權\",\n    \"過\",\n    \"協\",\n    \"壓\",\n    \"厭\",\n    \"頁\",\n    \"誇\",\n    \"奪\",\n    \"達\",\n    \"夾\",\n    \"軌\",\n    \"堯\",\n    \"劃\",\n    \"邁\",\n    \"畢\",\n    \"貞\",\n    \"師\",\n    \"塵\",\n    \"當\",\n    \"噹\",\n    \"籲\",\n    \"嚇\",\n    \"蟲\",\n    \"麯\",\n    \"團\",\n    \"糰\",\n    \"嗎\",\n    \"嶼\",\n    \"歲\",\n    \"迴\",\n    \"豈\",\n    \"則\",\n    \"剛\",\n    \"網\",\n    \"硃\",\n    \"遷\",\n    \"喬\",\n    \"偉\",\n    \"傳\",\n    \"優\",\n    \"傷\",\n    \"價\",\n    \"倫\",\n    \"華\",\n    \"僞\",\n    \"嚮\",\n    \"後\",\n    \"會\",\n    \"殺\",\n    \"閤\",\n    \"衆\",\n    \"爺\",\n    \"傘\",\n    \"創\",\n    \"雜\",\n    \"負\",\n    \"壯\",\n    \"衝\",\n    \"妝\",\n    \"莊\",\n    \"慶\",\n    \"劉\",\n    \"齊\",\n    \"産\",\n    \"閉\",\n    \"問\",\n    \"闖\",\n    \"關\",\n    \"燈\",\n    \"湯\",\n    \"興\",\n    \"講\",\n    \"諱\",\n    \"軍\",\n    \"訝\",\n    \"許\",\n    \"訛\",\n    \"論\",\n    \"訟\",\n    \"農\",\n    \"諷\",\n    \"設\",\n    \"訪\",\n    \"訣\",\n    \"尋\",\n    \"盡\",\n    \"儘\",\n    \"導\",\n    \"孫\",\n    \"陣\",\n    \"陽\",\n    \"階\",\n    \"陰\",\n    \"婦\",\n    \"媽\",\n    \"戲\",\n    \"觀\",\n    \"歡\",\n    \"買\",\n    \"紅\",\n    \"馱\",\n    \"纖\",\n    \"縴\",\n    \"馴\",\n    \"約\",\n    \"級\",\n    \"紀\",\n    \"馳\",\n    \"紉\",\n    \"壽\",\n    \"麥\",\n    \"瑪\",\n    \"進\",\n    \"遠\",\n    \"違\",\n    \"韌\",\n    \"運\",\n    \"撫\",\n    \"壇\",\n    \"罎\",\n    \"壞\",\n    \"摳\",\n    \"擾\",\n    \"貢\",\n    \"垻\",\n    \"壩\",\n    \"摺\",\n    \"掄\",\n    \"搶\",\n    \"墳\",\n    \"護\",\n    \"殻\",\n    \"塊\",\n    \"聲\",\n    \"報\",\n    \"擬\",\n    \"蕪\",\n    \"葦\",\n    \"蒼\",\n    \"嚴\",\n    \"蘆\",\n    \"勞\",\n    \"蘇\",\n    \"囌\",\n    \"極\",\n    \"楊\",\n    \"兩\",\n    \"麗\",\n    \"醫\",\n    \"勵\",\n    \"還\",\n    \"殲\",\n    \"來\",\n    \"連\",\n    \"軒\",\n    \"鹵\",\n    \"滷\",\n    \"堅\",\n    \"時\",\n    \"縣\",\n    \"裏\",\n    \"嘔\",\n    \"園\",\n    \"曠\",\n    \"圍\",\n    \"噸\",\n    \"郵\",\n    \"睏\",\n    \"員\",\n    \"聽\",\n    \"嗆\",\n    \"嗚\",\n    \"彆\",\n    \"嶇\",\n    \"崗\",\n    \"帳\",\n    \"財\",\n    \"針\",\n    \"釘\",\n    \"亂\",\n    \"體\",\n    \"傭\",\n    \"徹\",\n    \"餘\",\n    \"穀\",\n    \"鄰\",\n    \"腸\",\n    \"龜\",\n    \"猶\",\n    \"狽\",\n    \"條\",\n    \"島\",\n    \"飯\",\n    \"飲\",\n    \"係\",\n    \"繫\",\n    \"凍\",\n    \"狀\",\n    \"畝\",\n    \"庫\",\n    \"療\",\n    \"應\",\n    \"這\",\n    \"廬\",\n    \"閏\",\n    \"閑\",\n    \"間\",\n    \"悶\",\n    \"竈\",\n    \"燦\",\n    \"瀝\",\n    \"淪\",\n    \"滄\",\n    \"溝\",\n    \"滬\",\n    \"瀋\",\n    \"懷\",\n    \"憂\",\n    \"窮\",\n    \"證\",\n    \"啓\",\n    \"評\",\n    \"補\",\n    \"識\",\n    \"詐\",\n    \"訴\",\n    \"診\",\n    \"詞\",\n    \"譯\",\n    \"靈\",\n    \"層\",\n    \"遲\",\n    \"張\",\n    \"際\",\n    \"陸\",\n    \"陳\",\n    \"墜\",\n    \"勁\",\n    \"鷄\",\n    \"緯\",\n    \"驅\",\n    \"純\",\n    \"紗\",\n    \"綱\",\n    \"納\",\n    \"駁\",\n    \"縱\",\n    \"紛\",\n    \"紙\",\n    \"紋\",\n    \"紡\",\n    \"驢\",\n    \"紐\",\n    \"環\",\n    \"責\",\n    \"現\",\n    \"錶\",\n    \"規\",\n    \"攏\",\n    \"揀\",\n    \"擔\",\n    \"頂\",\n    \"擁\",\n    \"勢\",\n    \"攔\",\n    \"擰\",\n    \"撥\",\n    \"擇\",\n    \"蘋\",\n    \"範\",\n    \"莖\",\n    \"樞\",\n    \"櫃\",\n    \"闆\",\n    \"鬆\",\n    \"槍\",\n    \"楓\",\n    \"構\",\n    \"喪\",\n    \"畫\",\n    \"棗\",\n    \"賣\",\n    \"鬱\",\n    \"礬\",\n    \"礦\",\n    \"碼\",\n    \"厠\",\n    \"奮\",\n    \"態\",\n    \"歐\",\n    \"毆\",\n    \"壟\",\n    \"轟\",\n    \"頃\",\n    \"轉\",\n    \"斬\",\n    \"輪\",\n    \"軟\",\n    \"齒\",\n    \"虜\",\n    \"腎\",\n    \"賢\",\n    \"國\",\n    \"暢\",\n    \"嚨\",\n    \"鳴\",\n    \"羅\",\n    \"幟\",\n    \"嶺\",\n    \"凱\",\n    \"敗\",\n    \"賬\",\n    \"販\",\n    \"貶\",\n    \"購\",\n    \"貯\",\n    \"圖\",\n    \"釣\",\n    \"製\",\n    \"颳\",\n    \"俠\",\n    \"僥\",\n    \"偵\",\n    \"側\",\n    \"憑\",\n    \"僑\",\n    \"貨\",\n    \"質\",\n    \"徑\",\n    \"捨\",\n    \"覓\",\n    \"貪\",\n    \"貧\",\n    \"膚\",\n    \"腫\",\n    \"脹\",\n    \"骯\",\n    \"脅\",\n    \"魚\",\n    \"獰\",\n    \"備\",\n    \"飾\",\n    \"飽\",\n    \"飼\",\n    \"變\",\n    \"龐\",\n    \"廟\",\n    \"瘧\",\n    \"劑\",\n    \"廢\",\n    \"閘\",\n    \"鬧\",\n    \"鄭\",\n    \"捲\",\n    \"單\",\n    \"爐\",\n    \"淺\",\n    \"濘\",\n    \"瀉\",\n    \"潑\",\n    \"澤\",\n    \"憐\",\n    \"學\",\n    \"寶\",\n    \"寵\",\n    \"審\",\n    \"簾\",\n    \"實\",\n    \"試\",\n    \"詩\",\n    \"誠\",\n    \"襯\",\n    \"視\",\n    \"話\",\n    \"誕\",\n    \"詭\",\n    \"詢\",\n    \"該\",\n    \"詳\",\n    \"肅\",\n    \"録\",\n    \"隸\",\n    \"彌\",\n    \"瀰\",\n    \"陝\",\n    \"駕\",\n    \"參\",\n    \"艱\",\n    \"綫\",\n    \"練\",\n    \"組\",\n    \"紳\",\n    \"細\",\n    \"駛\",\n    \"織\",\n    \"駒\",\n    \"終\",\n    \"駐\",\n    \"絆\",\n    \"駝\",\n    \"紹\",\n    \"繹\",\n    \"經\",\n    \"貫\",\n    \"貳\",\n    \"幫\",\n    \"項\",\n    \"挾\",\n    \"撓\",\n    \"趙\",\n    \"擋\",\n    \"墊\",\n    \"擠\",\n    \"揮\",\n    \"薦\",\n    \"帶\",\n    \"繭\",\n    \"蕩\",\n    \"榮\",\n    \"葷\",\n    \"熒\",\n    \"鬍\",\n    \"蔭\",\n    \"藥\",\n    \"標\",\n    \"棧\",\n    \"棟\",\n    \"欄\",\n    \"檸\",\n    \"樹\",\n    \"鹹\",\n    \"磚\",\n    \"硯\",\n    \"麵\",\n    \"牽\",\n    \"鷗\",\n    \"殘\",\n    \"軸\",\n    \"輕\",\n    \"鴉\",\n    \"戰\",\n    \"點\",\n    \"臨\",\n    \"覽\",\n    \"竪\",\n    \"嘗\",\n    \"啞\",\n    \"顯\",\n    \"貴\",\n    \"蝦\",\n    \"蟻\",\n    \"螞\",\n    \"雖\",\n    \"駡\",\n    \"勛\",\n    \"嘩\",\n    \"響\",\n    \"喲\",\n    \"峽\",\n    \"罰\",\n    \"賤\",\n    \"貼\",\n    \"貽\",\n    \"鈣\",\n    \"鈍\",\n    \"鈔\",\n    \"鍾\",\n    \"鐘\",\n    \"鋼\",\n    \"鈉\",\n    \"鑰\",\n    \"欽\",\n    \"鈞\",\n    \"鈎\",\n    \"鈕\",\n    \"氈\",\n    \"氫\",\n    \"選\",\n    \"適\",\n    \"種\",\n    \"鞦\",\n    \"復\",\n    \"複\",\n    \"倆\",\n    \"貸\",\n    \"順\",\n    \"儉\",\n    \"須\",\n    \"鬚\",\n    \"劍\",\n    \"朧\",\n    \"膽\",\n    \"勝\",\n    \"狹\",\n    \"獅\",\n    \"獨\",\n    \"獄\",\n    \"貿\",\n    \"餌\",\n    \"饒\",\n    \"蝕\",\n    \"餃\",\n    \"餅\",\n    \"巒\",\n    \"彎\",\n    \"將\",\n    \"奬\",\n    \"瘡\",\n    \"瘋\",\n    \"親\",\n    \"閨\",\n    \"聞\",\n    \"閩\",\n    \"閥\",\n    \"閣\",\n    \"養\",\n    \"薑\",\n    \"類\",\n    \"婁\",\n    \"總\",\n    \"煉\",\n    \"爍\",\n    \"爛\",\n    \"窪\",\n    \"潔\",\n    \"灑\",\n    \"澆\",\n    \"濁\",\n    \"測\",\n    \"瀏\",\n    \"濟\",\n    \"渾\",\n    \"濃\",\n    \"惱\",\n    \"舉\",\n    \"覺\",\n    \"憲\",\n    \"竊\",\n    \"誡\",\n    \"誣\",\n    \"語\",\n    \"襖\",\n    \"誤\",\n    \"誘\",\n    \"誨\",\n    \"説\",\n    \"誦\",\n    \"墾\",\n    \"晝\",\n    \"費\",\n    \"遜\",\n    \"隕\",\n    \"險\",\n    \"嬌\",\n    \"賀\",\n    \"壘\",\n    \"綁\",\n    \"絨\",\n    \"結\",\n    \"繞\",\n    \"驕\",\n    \"繪\",\n    \"給\",\n    \"絢\",\n    \"駱\",\n    \"絡\",\n    \"絶\",\n    \"絞\",\n    \"駭\",\n    \"統\",\n    \"艷\",\n    \"蠶\",\n    \"頑\",\n    \"盞\",\n    \"撈\",\n    \"載\",\n    \"趕\",\n    \"鹽\",\n    \"損\",\n    \"撿\",\n    \"摯\",\n    \"剝\",\n    \"熱\",\n    \"搗\",\n    \"壺\",\n    \"聶\",\n    \"萊\",\n    \"蓮\",\n    \"獲\",\n    \"穫\",\n    \"惡\",\n    \"噁\",\n    \"瑩\",\n    \"鶯\",\n    \"檔\",\n    \"橋\",\n    \"樺\",\n    \"樁\",\n    \"樣\",\n    \"賈\",\n    \"礫\",\n    \"礎\",\n    \"顧\",\n    \"轎\",\n    \"較\",\n    \"頓\",\n    \"斃\",\n    \"緻\",\n    \"慮\",\n    \"監\",\n    \"緊\",\n    \"黨\",\n    \"曬\",\n    \"曉\",\n    \"嘮\",\n    \"鴨\",\n    \"暈\",\n    \"鴦\",\n    \"罷\",\n    \"圓\",\n    \"賊\",\n    \"賄\",\n    \"賂\",\n    \"贜\",\n    \"錢\",\n    \"鉗\",\n    \"鑽\",\n    \"鉀\",\n    \"鐵\",\n    \"鈴\",\n    \"鉛\",\n    \"犧\",\n    \"敵\",\n    \"積\",\n    \"稱\",\n    \"筆\",\n    \"債\",\n    \"傾\",\n    \"賃\",\n    \"艦\",\n    \"艙\",\n    \"聳\",\n    \"愛\",\n    \"頒\",\n    \"頌\",\n    \"臟\",\n    \"髒\",\n    \"臍\",\n    \"膠\",\n    \"腦\",\n    \"膿\",\n    \"鴕\",\n    \"鴛\",\n    \"皺\",\n    \"餓\",\n    \"餒\",\n    \"戀\",\n    \"槳\",\n    \"漿\",\n    \"準\",\n    \"癥\",\n    \"齋\",\n    \"離\",\n    \"資\",\n    \"競\",\n    \"閲\",\n    \"煩\",\n    \"燒\",\n    \"燭\",\n    \"遞\",\n    \"濤\",\n    \"澇\",\n    \"渦\",\n    \"塗\",\n    \"滌\",\n    \"潤\",\n    \"澗\",\n    \"漲\",\n    \"燙\",\n    \"澀\",\n    \"憫\",\n    \"寬\",\n    \"傢\",\n    \"賓\",\n    \"竅\",\n    \"請\",\n    \"諸\",\n    \"諾\",\n    \"讀\",\n    \"誹\",\n    \"襪\",\n    \"課\",\n    \"誰\",\n    \"調\",\n    \"諒\",\n    \"諄\",\n    \"談\",\n    \"誼\",\n    \"懇\",\n    \"劇\",\n    \"難\",\n    \"預\",\n    \"絹\",\n    \"綉\",\n    \"驗\",\n    \"繼\",\n    \"駿\",\n    \"瑣\",\n    \"擲\",\n    \"據\",\n    \"摻\",\n    \"職\",\n    \"蘿\",\n    \"螢\",\n    \"營\",\n    \"蕭\",\n    \"薩\",\n    \"夢\",\n    \"檢\",\n    \"醖\",\n    \"碩\",\n    \"聾\",\n    \"襲\",\n    \"輔\",\n    \"輛\",\n    \"顱\",\n    \"懸\",\n    \"躍\",\n    \"纍\",\n    \"囉\",\n    \"嘯\",\n    \"嶄\",\n    \"邏\",\n    \"嬰\",\n    \"銬\",\n    \"鐺\",\n    \"鋁\",\n    \"銅\",\n    \"銘\",\n    \"鏟\",\n    \"銀\",\n    \"矯\",\n    \"穢\",\n    \"籠\",\n    \"償\",\n    \"軀\",\n    \"釁\",\n    \"銜\",\n    \"盤\",\n    \"鴿\",\n    \"斂\",\n    \"領\",\n    \"臉\",\n    \"獵\",\n    \"餡\",\n    \"館\",\n    \"癢\",\n    \"鏇\",\n    \"閻\",\n    \"闡\",\n    \"蓋\",\n    \"斷\",\n    \"獸\",\n    \"鴻\",\n    \"漸\",\n    \"淵\",\n    \"漁\",\n    \"澱\",\n    \"滲\",\n    \"慚\",\n    \"懼\",\n    \"驚\",\n    \"慘\",\n    \"慣\",\n    \"謀\",\n    \"諜\",\n    \"謊\",\n    \"諧\",\n    \"禱\",\n    \"禍\",\n    \"謂\",\n    \"諺\",\n    \"謎\",\n    \"彈\",\n    \"墮\",\n    \"隨\",\n    \"隱\",\n    \"嬸\",\n    \"頗\",\n    \"頸\",\n    \"績\",\n    \"緒\",\n    \"續\",\n    \"騎\",\n    \"綽\",\n    \"繩\",\n    \"維\",\n    \"綿\",\n    \"綳\",\n    \"綢\",\n    \"綜\",\n    \"綻\",\n    \"緑\",\n    \"綴\",\n    \"瓊\",\n    \"趨\",\n    \"攬\",\n    \"攙\",\n    \"擱\",\n    \"摟\",\n    \"攪\",\n    \"聯\",\n    \"蔣\",\n    \"韓\",\n    \"橢\",\n    \"確\",\n    \"頰\",\n    \"靂\",\n    \"暫\",\n    \"翹\",\n    \"輩\",\n    \"鑿\",\n    \"輝\",\n    \"賞\",\n    \"睞\",\n    \"噴\",\n    \"疇\",\n    \"踐\",\n    \"遺\",\n    \"鵑\",\n    \"賦\",\n    \"賭\",\n    \"贖\",\n    \"賜\",\n    \"賠\",\n    \"鑄\",\n    \"鋪\",\n    \"鏈\",\n    \"銷\",\n    \"鎖\",\n    \"鋤\",\n    \"鍋\",\n    \"銹\",\n    \"鋒\",\n    \"鋅\",\n    \"鋭\",\n    \"鵝\",\n    \"築\",\n    \"篩\",\n    \"儲\",\n    \"懲\",\n    \"禦\",\n    \"釋\",\n    \"臘\",\n    \"魯\",\n    \"憊\",\n    \"饋\",\n    \"饞\",\n    \"裝\",\n    \"蠻\",\n    \"闊\",\n    \"糞\",\n    \"滯\",\n    \"濕\",\n    \"潰\",\n    \"濺\",\n    \"灣\",\n    \"憤\",\n    \"竄\",\n    \"窩\",\n    \"褲\",\n    \"禪\",\n    \"謝\",\n    \"謡\",\n    \"謗\",\n    \"謙\",\n    \"屬\",\n    \"屢\",\n    \"緬\",\n    \"纜\",\n    \"緝\",\n    \"緞\",\n    \"緩\",\n    \"締\",\n    \"縷\",\n    \"騙\",\n    \"編\",\n    \"騷\",\n    \"緣\",\n    \"鵡\",\n    \"攝\",\n    \"擺\",\n    \"襬\",\n    \"攤\",\n    \"鵲\",\n    \"藍\",\n    \"濛\",\n    \"懞\",\n    \"矇\",\n    \"獻\",\n    \"欖\",\n    \"樓\",\n    \"賴\",\n    \"礙\",\n    \"尷\",\n    \"霧\",\n    \"輻\",\n    \"輯\",\n    \"輸\",\n    \"頻\",\n    \"齡\",\n    \"鑒\",\n    \"蹺\",\n    \"蝸\",\n    \"錯\",\n    \"錨\",\n    \"錫\",\n    \"鑼\",\n    \"錘\",\n    \"錐\",\n    \"錦\",\n    \"鍵\",\n    \"鋸\",\n    \"錳\",\n    \"辭\",\n    \"頽\",\n    \"籌\",\n    \"簽\",\n    \"籤\",\n    \"簡\",\n    \"膩\",\n    \"鵬\",\n    \"騰\",\n    \"鮑\",\n    \"穎\",\n    \"觸\",\n    \"雛\",\n    \"饃\",\n    \"餾\",\n    \"醬\",\n    \"謄\",\n    \"糧\",\n    \"數\",\n    \"滿\",\n    \"濾\",\n    \"濫\",\n    \"灕\",\n    \"濱\",\n    \"灘\",\n    \"譽\",\n    \"窺\",\n    \"寢\",\n    \"謹\",\n    \"謬\",\n    \"闢\",\n    \"縛\",\n    \"縫\",\n    \"纏\",\n    \"繽\",\n    \"贅\",\n    \"墻\",\n    \"衊\",\n    \"藹\",\n    \"檻\",\n    \"釀\",\n    \"願\",\n    \"轄\",\n    \"輾\",\n    \"顆\",\n    \"踴\",\n    \"蠟\",\n    \"蠅\",\n    \"蟬\",\n    \"賺\",\n    \"鍬\",\n    \"鍛\",\n    \"鍍\",\n    \"穩\",\n    \"籮\",\n    \"簫\",\n    \"輿\",\n    \"鮮\",\n    \"饅\",\n    \"瀟\",\n    \"賽\",\n    \"譚\",\n    \"譜\",\n    \"騾\",\n    \"縮\",\n    \"攆\",\n    \"聰\",\n    \"藴\",\n    \"櫻\",\n    \"飄\",\n    \"黴\",\n    \"瞞\",\n    \"題\",\n    \"囑\",\n    \"鎮\",\n    \"鎬\",\n    \"鎊\",\n    \"簍\",\n    \"鯉\",\n    \"鯽\",\n    \"癟\",\n    \"癱\",\n    \"顔\",\n    \"鯊\",\n    \"瀾\",\n    \"額\",\n    \"譴\",\n    \"鶴\",\n    \"繚\",\n    \"顛\",\n    \"轍\",\n    \"鸚\",\n    \"贈\",\n    \"鏡\",\n    \"贊\",\n    \"籃\",\n    \"籬\",\n    \"鯨\",\n    \"癮\",\n    \"辯\",\n    \"瀕\",\n    \"懶\",\n    \"繮\",\n    \"繳\",\n    \"矚\",\n    \"贍\",\n    \"鰐\",\n    \"辮\",\n    \"贏\",\n    \"驟\",\n    \"囂\",\n    \"鐮\",\n    \"鰭\",\n    \"鷹\",\n    \"巔\",\n    \"顫\",\n    \"癬\",\n    \"鱉\",\n    \"鬢\",\n    \"鱗\",\n    \"躪\",\n    \"贛\",\n    \"鑲\",\n    \"韋\",\n    \"閂\",\n    \"訃\",\n    \"勱\",\n    \"芻\",\n    \"鄺\",\n    \"訐\",\n    \"訌\",\n    \"訕\",\n    \"訖\",\n    \"馭\",\n    \"璣\",\n    \"壙\",\n    \"捫\",\n    \"薌\",\n    \"厙\",\n    \"釔\",\n    \"傴\",\n    \"倀\",\n    \"傖\",\n    \"獷\",\n    \"獁\",\n    \"鳬\",\n    \"鄔\",\n    \"餳\",\n    \"懺\",\n    \"謳\",\n    \"詎\",\n    \"訥\",\n    \"紆\",\n    \"紂\",\n    \"紇\",\n    \"紈\",\n    \"璵\",\n    \"摶\",\n    \"塢\",\n    \"㩳\",\n    \"蕓\",\n    \"藶\",\n    \"莧\",\n    \"萇\",\n    \"蓯\",\n    \"磯\",\n    \"奩\",\n    \"歟\",\n    \"軔\",\n    \"鄴\",\n    \"嘸\",\n    \"囈\",\n    \"嚦\",\n    \"暘\",\n    \"唄\",\n    \"幃\",\n    \"峴\",\n    \"嵐\",\n    \"圇\",\n    \"釗\",\n    \"釙\",\n    \"釕\",\n    \"僉\",\n    \"鳩\",\n    \"鄒\",\n    \"飩\",\n    \"餼\",\n    \"飪\",\n    \"飫\",\n    \"飭\",\n    \"廡\",\n    \"癤\",\n    \"闈\",\n    \"閎\",\n    \"閔\",\n    \"煬\",\n    \"灃\",\n    \"漚\",\n    \"渢\",\n    \"潙\",\n    \"憮\",\n    \"慪\",\n    \"愾\",\n    \"悵\",\n    \"愴\",\n    \"詁\",\n    \"訶\",\n    \"詛\",\n    \"詆\",\n    \"謅\",\n    \"詔\",\n    \"詒\",\n    \"隴\",\n    \"陘\",\n    \"嫵\",\n    \"嫗\",\n    \"嬀\",\n    \"剄\",\n    \"紜\",\n    \"紕\",\n    \"紝\",\n    \"綸\",\n    \"紓\",\n    \"瑋\",\n    \"匭\",\n    \"壚\",\n    \"擓\",\n    \"蘢\",\n    \"蔦\",\n    \"塋\",\n    \"煢\",\n    \"櫪\",\n    \"梘\",\n    \"棖\",\n    \"樅\",\n    \"碭\",\n    \"甌\",\n    \"郟\",\n    \"軛\",\n    \"鳶\",\n    \"曇\",\n    \"蟣\",\n    \"黽\",\n    \"嚀\",\n    \"噝\",\n    \"巋\",\n    \"劌\",\n    \"剴\",\n    \"嶧\",\n    \"釷\",\n    \"釺\",\n    \"釧\",\n    \"釩\",\n    \"釹\",\n    \"釵\",\n    \"儈\",\n    \"儕\",\n    \"儂\",\n    \"劊\",\n    \"慫\",\n    \"糴\",\n    \"戧\",\n    \"膞\",\n    \"邇\",\n    \"梟\",\n    \"餞\",\n    \"飴\",\n    \"癘\",\n    \"瘍\",\n    \"煒\",\n    \"熰\",\n    \"熗\",\n    \"瀧\",\n    \"瀘\",\n    \"濼\",\n    \"涇\",\n    \"㥮\",\n    \"懌\",\n    \"誆\",\n    \"誄\",\n    \"詿\",\n    \"詰\",\n    \"詼\",\n    \"鄆\",\n    \"禕\",\n    \"誅\",\n    \"詵\",\n    \"詬\",\n    \"詮\",\n    \"詣\",\n    \"諍\",\n    \"詫\",\n    \"諢\",\n    \"詡\",\n    \"駑\",\n    \"紺\",\n    \"紲\",\n    \"紱\",\n    \"駟\",\n    \"駙\",\n    \"縐\",\n    \"絀\",\n    \"驛\",\n    \"駘\",\n    \"瓏\",\n    \"頇\",\n    \"埡\",\n    \"撾\",\n    \"撻\",\n    \"賁\",\n    \"壋\",\n    \"撏\",\n    \"莢\",\n    \"貰\",\n    \"蓽\",\n    \"蕎\",\n    \"薈\",\n    \"薺\",\n    \"堊\",\n    \"滎\",\n    \"犖\",\n    \"蕁\",\n    \"藎\",\n    \"蓀\",\n    \"蕒\",\n    \"葤\",\n    \"櫛\",\n    \"櫳\",\n    \"櫨\",\n    \"櫟\",\n    \"檉\",\n    \"酈\",\n    \"硨\",\n    \"碸\",\n    \"殤\",\n    \"軲\",\n    \"軻\",\n    \"轤\",\n    \"軼\",\n    \"軫\",\n    \"蠆\",\n    \"覘\",\n    \"瞘\",\n    \"嘵\",\n    \"嗶\",\n    \"噦\",\n    \"剮\",\n    \"鄖\",\n    \"噲\",\n    \"噥\",\n    \"嶢\",\n    \"幀\",\n    \"嶠\",\n    \"貺\",\n    \"鈈\",\n    \"鈦\",\n    \"鋇\",\n    \"鈑\",\n    \"鈐\",\n    \"鎢\",\n    \"鈁\",\n    \"鈀\",\n    \"篤\",\n    \"儔\",\n    \"儼\",\n    \"儷\",\n    \"腖\",\n    \"臚\",\n    \"脛\",\n    \"鴇\",\n    \"獪\",\n    \"颮\",\n    \"猻\",\n    \"餉\",\n    \"餄\",\n    \"餎\",\n    \"孿\",\n    \"孌\",\n    \"癧\",\n    \"瘲\",\n    \"颯\",\n    \"闥\",\n    \"閭\",\n    \"闓\",\n    \"閡\",\n    \"熾\",\n    \"烴\",\n    \"浹\",\n    \"澮\",\n    \"滸\",\n    \"潯\",\n    \"濜\",\n    \"慟\",\n    \"懨\",\n    \"愷\",\n    \"惻\",\n    \"惲\",\n    \"誚\",\n    \"禰\",\n    \"誥\",\n    \"誑\",\n    \"鴆\",\n    \"婭\",\n    \"嬈\",\n    \"懟\",\n    \"絝\",\n    \"驍\",\n    \"驊\",\n    \"絎\",\n    \"絳\",\n    \"駢\",\n    \"頊\",\n    \"璫\",\n    \"琿\",\n    \"塒\",\n    \"塤\",\n    \"堝\",\n    \"贄\",\n    \"蒔\",\n    \"萵\",\n    \"蕕\",\n    \"鴣\",\n    \"蒓\",\n    \"橈\",\n    \"楨\",\n    \"榿\",\n    \"檜\",\n    \"邐\",\n    \"礪\",\n    \"礱\",\n    \"軾\",\n    \"輊\",\n    \"輅\",\n    \"鶇\",\n    \"躉\",\n    \"齔\",\n    \"鸕\",\n    \"矓\",\n    \"嘜\",\n    \"鴞\",\n    \"蜆\",\n    \"嗩\",\n    \"嶗\",\n    \"崍\",\n    \"覬\",\n    \"賅\",\n    \"鈺\",\n    \"鉦\",\n    \"鈷\",\n    \"鉢\",\n    \"鈸\",\n    \"鉞\",\n    \"鉭\",\n    \"鉬\",\n    \"鈿\",\n    \"鈾\",\n    \"鉑\",\n    \"鑠\",\n    \"鉚\",\n    \"鈰\",\n    \"鉉\",\n    \"鉈\",\n    \"鉍\",\n    \"鈮\",\n    \"鈹\",\n    \"鏺\",\n    \"鐸\",\n    \"氬\",\n    \"筧\",\n    \"頎\",\n    \"徠\",\n    \"膾\",\n    \"鴟\",\n    \"璽\",\n    \"鴝\",\n    \"獫\",\n    \"裊\",\n    \"餑\",\n    \"欒\",\n    \"攣\",\n    \"癰\",\n    \"痙\",\n    \"頏\",\n    \"閫\",\n    \"鬮\",\n    \"誾\",\n    \"閬\",\n    \"鄲\",\n    \"燁\",\n    \"燴\",\n    \"燼\",\n    \"淶\",\n    \"漣\",\n    \"潿\",\n    \"慳\",\n    \"諏\",\n    \"諑\",\n    \"禎\",\n    \"諉\",\n    \"諛\",\n    \"諗\",\n    \"諂\",\n    \"誶\",\n    \"媧\",\n    \"嫻\",\n    \"綆\",\n    \"驪\",\n    \"綃\",\n    \"騁\",\n    \"綏\",\n    \"縧\",\n    \"綈\",\n    \"駸\",\n    \"鷥\",\n    \"燾\",\n    \"璉\",\n    \"麩\",\n    \"擄\",\n    \"摑\",\n    \"鷙\",\n    \"撣\",\n    \"慤\",\n    \"摜\",\n    \"縈\",\n    \"槤\",\n    \"覡\",\n    \"欞\",\n    \"嗇\",\n    \"匱\",\n    \"硤\",\n    \"磽\",\n    \"鴯\",\n    \"龔\",\n    \"殞\",\n    \"殮\",\n    \"賚\",\n    \"輒\",\n    \"塹\",\n    \"嘖\",\n    \"囀\",\n    \"嚙\",\n    \"蹌\",\n    \"蠣\",\n    \"蠱\",\n    \"蟶\",\n    \"幘\",\n    \"幗\",\n    \"賕\",\n    \"賑\",\n    \"賒\",\n    \"銠\",\n    \"鉺\",\n    \"鋏\",\n    \"鐃\",\n    \"銦\",\n    \"鎧\",\n    \"鍘\",\n    \"銖\",\n    \"銑\",\n    \"鋌\",\n    \"鏵\",\n    \"銓\",\n    \"鎩\",\n    \"鉿\",\n    \"銚\",\n    \"鉻\",\n    \"錚\",\n    \"銫\",\n    \"鉸\",\n    \"銥\",\n    \"銃\",\n    \"銨\",\n    \"銣\",\n    \"鴰\",\n    \"穠\",\n    \"箋\",\n    \"籩\",\n    \"僨\",\n    \"僂\",\n    \"皚\",\n    \"鴴\",\n    \"艫\",\n    \"龕\",\n    \"玀\",\n    \"獼\",\n    \"餜\",\n    \"餛\",\n    \"鸞\",\n    \"闍\",\n    \"閾\",\n    \"閹\",\n    \"閶\",\n    \"鬩\",\n    \"閽\",\n    \"閼\",\n    \"羥\",\n    \"糲\",\n    \"燜\",\n    \"漬\",\n    \"瀆\",\n    \"澠\",\n    \"愜\",\n    \"憚\",\n    \"諶\",\n    \"諫\",\n    \"皸\",\n    \"謔\",\n    \"襠\",\n    \"謁\",\n    \"諤\",\n    \"諭\",\n    \"諼\",\n    \"讒\",\n    \"諳\",\n    \"諦\",\n    \"諞\",\n    \"糶\",\n    \"嬋\",\n    \"綾\",\n    \"騏\",\n    \"綺\",\n    \"緋\",\n    \"緔\",\n    \"騍\",\n    \"緄\",\n    \"騅\",\n    \"綬\",\n    \"綹\",\n    \"綣\",\n    \"綰\",\n    \"驂\",\n    \"緇\",\n    \"靚\",\n    \"輦\",\n    \"黿\",\n    \"頡\",\n    \"撳\",\n    \"蟄\",\n    \"壪\",\n    \"蔞\",\n    \"櫝\",\n    \"欏\",\n    \"賫\",\n    \"鵓\",\n    \"鸝\",\n    \"殫\",\n    \"輥\",\n    \"輞\",\n    \"槧\",\n    \"輟\",\n    \"輜\",\n    \"瞼\",\n    \"躒\",\n    \"蛺\",\n    \"蟯\",\n    \"螄\",\n    \"蠐\",\n    \"嘍\",\n    \"嶸\",\n    \"嶁\",\n    \"賧\",\n    \"鋙\",\n    \"錸\",\n    \"鏗\",\n    \"鋥\",\n    \"鋰\",\n    \"鋯\",\n    \"鋨\",\n    \"銼\",\n    \"鐧\",\n    \"銻\",\n    \"鋃\",\n    \"鋦\",\n    \"錒\",\n    \"犢\",\n    \"鵠\",\n    \"篳\",\n    \"牘\",\n    \"儻\",\n    \"儐\",\n    \"儺\",\n    \"嬃\",\n    \"頜\",\n    \"鵒\",\n    \"魷\",\n    \"魨\",\n    \"魴\",\n    \"潁\",\n    \"颶\",\n    \"觴\",\n    \"熲\",\n    \"餷\",\n    \"餿\",\n    \"褻\",\n    \"臠\",\n    \"癆\",\n    \"癇\",\n    \"賡\",\n    \"頦\",\n    \"鷳\",\n    \"闌\",\n    \"闃\",\n    \"闋\",\n    \"鵜\",\n    \"憒\",\n    \"嚳\",\n    \"謨\",\n    \"褳\",\n    \"襇\",\n    \"讜\",\n    \"謖\",\n    \"謚\",\n    \"謐\",\n    \"騭\",\n    \"巰\",\n    \"翬\",\n    \"騖\",\n    \"緙\",\n    \"緗\",\n    \"緘\",\n    \"緹\",\n    \"緲\",\n    \"緦\",\n    \"緱\",\n    \"縋\",\n    \"緡\",\n    \"饗\",\n    \"耮\",\n    \"驁\",\n    \"韞\",\n    \"攄\",\n    \"擯\",\n    \"轂\",\n    \"驀\",\n    \"鶓\",\n    \"薊\",\n    \"蘺\",\n    \"鎣\",\n    \"頤\",\n    \"櫚\",\n    \"櫸\",\n    \"磧\",\n    \"磣\",\n    \"鵪\",\n    \"輳\",\n    \"齟\",\n    \"齙\",\n    \"韙\",\n    \"囁\",\n    \"躂\",\n    \"蹕\",\n    \"躚\",\n    \"躋\",\n    \"噯\",\n    \"鍺\",\n    \"錛\",\n    \"錡\",\n    \"鍀\",\n    \"錁\",\n    \"錕\",\n    \"錮\",\n    \"鍁\",\n    \"錈\",\n    \"錠\",\n    \"錙\",\n    \"覦\",\n    \"頷\",\n    \"鮁\",\n    \"鮃\",\n    \"鮎\",\n    \"鱸\",\n    \"穌\",\n    \"鮒\",\n    \"鮐\",\n    \"鵮\",\n    \"颼\",\n    \"饈\",\n    \"鶉\",\n    \"瘮\",\n    \"闔\",\n    \"闐\",\n    \"闕\",\n    \"灧\",\n    \"瀅\",\n    \"潷\",\n    \"灤\",\n    \"澦\",\n    \"懾\",\n    \"鱟\",\n    \"騫\",\n    \"竇\",\n    \"謾\",\n    \"謫\",\n    \"嬡\",\n    \"嬪\",\n    \"縉\",\n    \"縝\",\n    \"縟\",\n    \"轡\",\n    \"騮\",\n    \"縞\",\n    \"縭\",\n    \"縊\",\n    \"縑\",\n    \"騸\",\n    \"覯\",\n    \"韜\",\n    \"靉\",\n    \"攖\",\n    \"薔\",\n    \"藺\",\n    \"鶘\",\n    \"檳\",\n    \"櫧\",\n    \"釅\",\n    \"殯\",\n    \"霽\",\n    \"轅\",\n    \"齜\",\n    \"齦\",\n    \"瞜\",\n    \"曖\",\n    \"躊\",\n    \"蟈\",\n    \"鶚\",\n    \"嚶\",\n    \"羆\",\n    \"賻\",\n    \"罌\",\n    \"鶻\",\n    \"鍥\",\n    \"鍇\",\n    \"鍶\",\n    \"鍔\",\n    \"鍤\",\n    \"鏘\",\n    \"鎂\",\n    \"鏤\",\n    \"簀\",\n    \"篋\",\n    \"簞\",\n    \"籙\",\n    \"臏\",\n    \"鮭\",\n    \"鮪\",\n    \"鱭\",\n    \"鮫\",\n    \"鱘\",\n    \"饉\",\n    \"鑾\",\n    \"瘻\",\n    \"闞\",\n    \"鮝\",\n    \"糝\",\n    \"鷀\",\n    \"瀲\",\n    \"濰\",\n    \"譖\",\n    \"褸\",\n    \"譙\",\n    \"讕\",\n    \"譎\",\n    \"鶥\",\n    \"嬙\",\n    \"鶩\",\n    \"驃\",\n    \"縹\",\n    \"縵\",\n    \"縲\",\n    \"纓\",\n    \"驄\",\n    \"繆\",\n    \"繅\",\n    \"耬\",\n    \"瓔\",\n    \"擷\",\n    \"擼\",\n    \"攛\",\n    \"聵\",\n    \"覲\",\n    \"韃\",\n    \"鞽\",\n    \"蘄\",\n    \"賾\",\n    \"檣\",\n    \"靨\",\n    \"魘\",\n    \"饜\",\n    \"轆\",\n    \"齬\",\n    \"齪\",\n    \"覷\",\n    \"顒\",\n    \"躓\",\n    \"躑\",\n    \"蠑\",\n    \"螻\",\n    \"顎\",\n    \"嚕\",\n    \"顓\",\n    \"鑷\",\n    \"鎘\",\n    \"鎸\",\n    \"鎳\",\n    \"鎦\",\n    \"鎰\",\n    \"鎵\",\n    \"鑌\",\n    \"簣\",\n    \"鷂\",\n    \"鯁\",\n    \"鱺\",\n    \"鰱\",\n    \"鰹\",\n    \"鰣\",\n    \"鯀\",\n    \"鯇\",\n    \"觶\",\n    \"饊\",\n    \"饌\",\n    \"齏\",\n    \"讞\",\n    \"襤\",\n    \"譫\",\n    \"屨\",\n    \"纈\",\n    \"繕\",\n    \"繒\",\n    \"驏\",\n    \"擻\",\n    \"顳\",\n    \"顢\",\n    \"藪\",\n    \"櫓\",\n    \"櫞\",\n    \"贋\",\n    \"飆\",\n    \"鏨\",\n    \"轔\",\n    \"蟎\",\n    \"鐯\",\n    \"鏢\",\n    \"鏜\",\n    \"鏝\",\n    \"鏰\",\n    \"鏞\",\n    \"鏑\",\n    \"鏃\",\n    \"鏐\",\n    \"氌\",\n    \"穡\",\n    \"魎\",\n    \"鯪\",\n    \"鯡\",\n    \"鯤\",\n    \"鯧\",\n    \"鯝\",\n    \"鯢\",\n    \"鯛\",\n    \"鯔\",\n    \"獺\",\n    \"鷓\",\n    \"贇\",\n    \"癭\",\n    \"斕\",\n    \"瀨\",\n    \"顙\",\n    \"繾\",\n    \"繰\",\n    \"繯\",\n    \"蘚\",\n    \"鷯\",\n    \"齲\",\n    \"齷\",\n    \"躡\",\n    \"蹣\",\n    \"羈\",\n    \"鐔\",\n    \"鐝\",\n    \"鐐\",\n    \"鐓\",\n    \"鑭\",\n    \"鑹\",\n    \"鏹\",\n    \"鐙\",\n    \"籪\",\n    \"鷦\",\n    \"鱝\",\n    \"鰈\",\n    \"鯷\",\n    \"鰓\",\n    \"鰍\",\n    \"鰉\",\n    \"鯿\",\n    \"鷲\",\n    \"懣\",\n    \"鷸\",\n    \"鰲\",\n    \"韉\",\n    \"顥\",\n    \"鷺\",\n    \"䴉\",\n    \"髏\",\n    \"鑊\",\n    \"鐳\",\n    \"鐲\",\n    \"讎\",\n    \"鰨\",\n    \"鰥\",\n    \"鰩\",\n    \"癩\",\n    \"攢\",\n    \"靄\",\n    \"躥\",\n    \"髖\",\n    \"髕\",\n    \"鑔\",\n    \"籟\",\n    \"鰳\",\n    \"鰾\",\n    \"鱈\",\n    \"鰻\",\n    \"鱅\",\n    \"讖\",\n    \"驥\",\n    \"纘\",\n    \"瓚\",\n    \"鼉\",\n    \"黷\",\n    \"黲\",\n    \"鑣\",\n    \"鑞\",\n    \"臢\",\n    \"鱖\",\n    \"鱔\",\n    \"鱒\",\n    \"驤\",\n    \"顰\",\n    \"鱧\",\n    \"癲\",\n    \"灝\",\n    \"鸛\",\n    \"鑱\",\n    \"趲\",\n    \"顴\",\n    \"躦\",\n    \"饢\",\n    \"戇\",\n    \"戔\",\n    \"訏\",\n    \"訒\",\n    \"釓\",\n    \"俔\",\n    \"閆\",\n    \"澫\",\n    \"訢\",\n    \"訩\",\n    \"詝\",\n    \"紃\",\n    \"纊\",\n    \"瑒\",\n    \"剗\",\n    \"塸\",\n    \"壢\",\n    \"埨\",\n    \"撝\",\n    \"蔿\",\n    \"榪\",\n    \"軑\",\n    \"軏\",\n    \"咼\",\n    \"㠣\",\n    \"覎\",\n    \"㑳\",\n    \"颺\",\n    \"閌\",\n    \"潕\",\n    \"湋\",\n    \"澐\",\n    \"浿\",\n    \"諓\",\n    \"禡\",\n    \"詗\",\n    \"詘\",\n    \"詖\",\n    \"屓\",\n    \"彄\",\n    \"紘\",\n    \"馹\",\n    \"馼\",\n    \"紵\",\n    \"紞\",\n    \"駃\",\n    \"紖\",\n    \"瑲\",\n    \"薴\",\n    \"棡\",\n    \"軝\",\n    \"暐\",\n    \"晛\",\n    \"崬\",\n    \"釴\",\n    \"釤\",\n    \"鍆\",\n    \"鍚\",\n    \"鄶\",\n    \"獮\",\n    \"飿\",\n    \"嶨\",\n    \"詷\",\n    \"詪\",\n    \"鄩\",\n    \"鳲\",\n    \"隑\",\n    \"隮\",\n    \"娙\",\n    \"逕\",\n    \"駓\",\n    \"駔\",\n    \"駉\",\n    \"絅\",\n    \"騶\",\n    \"䮄\",\n    \"紼\",\n    \"紿\",\n    \"瓅\",\n    \"韍\",\n    \"墶\",\n    \"塏\",\n    \"薘\",\n    \"蕘\",\n    \"蔄\",\n    \"葒\",\n    \"鳾\",\n    \"龑\",\n    \"軹\",\n    \"軤\",\n    \"轢\",\n    \"軺\",\n    \"睍\",\n    \"曨\",\n    \"噠\",\n    \"鈃\",\n    \"鈇\",\n    \"鉅\",\n    \"鋹\",\n    \"釿\",\n    \"錀\",\n    \"鈧\",\n    \"鈥\",\n    \"鈄\",\n    \"倈\",\n    \"艤\",\n    \"鶬\",\n    \"颭\",\n    \"餏\",\n    \"湞\",\n    \"溮\",\n    \"滻\",\n    \"褘\",\n    \"絰\",\n    \"駰\",\n    \"絪\",\n    \"駪\",\n    \"綎\",\n    \"綖\",\n    \"驫\",\n    \"勣\",\n    \"璕\",\n    \"𡑍\",\n    \"䓣\",\n    \"薟\",\n    \"藭\",\n    \"椏\",\n    \"梜\",\n    \"頍\",\n    \"硜\",\n    \"輄\",\n    \"輈\",\n    \"輇\",\n    \"貲\",\n    \"嗊\",\n    \"曄\",\n    \"暉\",\n    \"鄳\",\n    \"幬\",\n    \"輋\",\n    \"嶮\",\n    \"贐\",\n    \"鉥\",\n    \"鉕\",\n    \"鑪\",\n    \"鉮\",\n    \"鉊\",\n    \"鉧\",\n    \"僤\",\n    \"鴒\",\n    \"魛\",\n    \"餗\",\n    \"燖\",\n    \"溳\",\n    \"礐\",\n    \"窵\",\n    \"襏\",\n    \"駼\",\n    \"絺\",\n    \"綌\",\n    \"騂\",\n    \"綄\",\n    \"璡\",\n    \"墠\",\n    \"壼\",\n    \"聹\",\n    \"蘀\",\n    \"勩\",\n    \"罃\",\n    \"檮\",\n    \"棶\",\n    \"厴\",\n    \"䃮\",\n    \"磑\",\n    \"礄\",\n    \"鴷\",\n    \"齕\",\n    \"頔\",\n    \"廼\",\n    \"凢\",\n    \"亾\",\n    \"枒\",\n    \"屍\",\n    \"匃\",\n    \"匄\",\n    \"紥\",\n    \"紮\",\n    \"疋\",\n    \"殀\",\n    \"讐\",\n    \"觔\",\n    \"兇\",\n    \"宂\",\n    \"㕥\",\n    \"㠯\",\n    \"栞\",\n    \"佈\",\n    \"佔\",\n    \"呌\",\n    \"敂\",\n    \"冄\",\n    \"坵\",\n    \"僊\",\n    \"怱\",\n    \"悤\",\n    \"冊\",\n    \"夘\",\n    \"戼\",\n    \"牠\",\n    \"妳\",\n    \"嬭\",\n    \"摃\",\n    \"釦\",\n    \"攷\",\n    \"託\",\n    \"衺\",\n    \"衕\",\n    \"弔\",\n    \"喫\",\n    \"囙\",\n    \"㠶\",\n    \"颿\",\n    \"秊\",\n    \"倣\",\n    \"髣\",\n    \"佀\",\n    \"朶\",\n    \"氷\",\n    \"決\",\n    \"併\",\n    \"並\",\n    \"竝\",\n    \"汙\",\n    \"汚\",\n    \"異\",\n    \"姦\",\n    \"廵\",\n    \"挵\",\n    \"衖\",\n    \"搤\",\n    \"阯\",\n    \"撦\",\n    \"埳\",\n    \"阬\",\n    \"誌\",\n    \"㕁\",\n    \"卻\",\n    \"刦\",\n    \"刧\",\n    \"刼\",\n    \"芲\",\n    \"蘤\",\n    \"桿\",\n    \"槓\",\n    \"荳\",\n    \"獃\",\n    \"唫\",\n    \"脗\",\n    \"皁\",\n    \"彿\",\n    \"髴\",\n    \"疘\",\n    \"刪\",\n    \"鉋\",\n    \"鑤\",\n    \"況\",\n    \"牀\",\n    \"恡\",\n    \"棄\",\n    \"洶\",\n    \"汎\",\n    \"災\",\n    \"烖\",\n    \"菑\",\n    \"禩\",\n    \"侷\",\n    \"跼\",\n    \"坿\",\n    \"玅\",\n    \"姉\",\n    \"妬\",\n    \"翫\",\n    \"搨\",\n    \"柺\",\n    \"拕\",\n    \"牴\",\n    \"觝\",\n    \"倖\",\n    \"抝\",\n    \"盃\",\n    \"桮\",\n    \"傑\",\n    \"逩\",\n    \"肎\",\n    \"菓\",\n    \"崐\",\n    \"崑\",\n    \"呪\",\n    \"虖\",\n    \"嘑\",\n    \"謼\",\n    \"詠\",\n    \"㟁\",\n    \"嵒\",\n    \"巗\",\n    \"巖\",\n    \"雰\",\n    \"稈\",\n    \"咊\",\n    \"嶽\",\n    \"妷\",\n    \"姪\",\n    \"廹\",\n    \"徃\",\n    \"餚\",\n    \"採\",\n    \"寀\",\n    \"唸\",\n    \"週\",\n    \"昬\",\n    \"兎\",\n    \"兔\",\n    \"亯\",\n    \"亱\",\n    \"䘚\",\n    \"淨\",\n    \"劵\",\n    \"匟\",\n    \"㳒\",\n    \"灋\",\n    \"洩\",\n    \"霑\",\n    \"淚\",\n    \"註\",\n    \"恠\",\n    \"箒\",\n    \"屆\",\n    \"絃\",\n    \"圅\",\n    \"旾\",\n    \"珎\",\n    \"掛\",\n    \"垜\",\n    \"艸\",\n    \"茘\",\n    \"査\",\n    \"栢\",\n    \"柵\",\n    \"栁\",\n    \"桺\",\n    \"柹\",\n    \"韮\",\n    \"揹\",\n    \"昰\",\n    \"閧\",\n    \"鬨\",\n    \"冐\",\n    \"暎\",\n    \"嚥\",\n    \"倃\",\n    \"𠴰\",\n    \"偺\",\n    \"喒\",\n    \"齩\",\n    \"欬\",\n    \"榘\",\n    \"㑺\",\n    \"儁\",\n    \"敍\",\n    \"敘\",\n    \"肧\",\n    \"脈\",\n    \"䘑\",\n    \"衇\",\n    \"跡\",\n    \"蹟\",\n    \"砲\",\n    \"礮\",\n    \"薙\",\n    \"鬀\",\n    \"恆\",\n    \"怳\",\n    \"卹\",\n    \"䘏\",\n    \"賉\",\n    \"婣\",\n    \"畊\",\n    \"揑\",\n    \"綑\",\n    \"輓\",\n    \"恥\",\n    \"躭\",\n    \"晉\",\n    \"棲\",\n    \"覈\",\n    \"慄\",\n    \"翄\",\n    \"脣\",\n    \"槕\",\n    \"㨪\",\n    \"螡\",\n    \"蟁\",\n    \"㤙\",\n    \"陗\",\n    \"峩\",\n    \"峯\",\n    \"乗\",\n    \"椉\",\n    \"咲\",\n    \"筍\",\n    \"俛\",\n    \"頫\",\n    \"勌\",\n    \"䠶\",\n    \"躳\",\n    \"慇\",\n    \"拏\",\n    \"㧱\",\n    \"挐\",\n    \"脃\",\n    \"胷\",\n    \"肐\",\n    \"貍\",\n    \"㽞\",\n    \"畱\",\n    \"淒\",\n    \"悽\",\n    \"蓆\",\n    \"効\",\n    \"傚\",\n    \"涼\",\n    \"缾\",\n    \"菸\",\n    \"煙\",\n    \"淛\",\n    \"湧\",\n    \"誖\",\n    \"猂\",\n    \"醼\",\n    \"讌\",\n    \"㝠\",\n    \"寃\",\n    \"孃\",\n    \"桒\",\n    \"毬\",\n    \"瑠\",\n    \"璢\",\n    \"瑯\",\n    \"㨗\",\n    \"搥\",\n    \"搯\",\n    \"蔆\",\n    \"惏\",\n    \"楳\",\n    \"槑\",\n    \"捄\",\n    \"廂\",\n    \"慽\",\n    \"慼\",\n    \"瞇\",\n    \"埜\",\n    \"畧\",\n    \"虵\",\n    \"稭\",\n    \"棃\",\n    \"犂\",\n    \"迻\",\n    \"媮\",\n    \"兠\",\n    \"舩\",\n    \"慾\",\n    \"綵\",\n    \"腳\",\n    \"𩓐\",\n    \"夠\",\n    \"豬\",\n    \"貓\",\n    \"湊\",\n    \"減\",\n    \"庻\",\n    \"蔴\",\n    \"菴\",\n    \"朢\",\n    \"睠\",\n    \"觕\",\n    \"麤\",\n    \"釬\",\n    \"銲\",\n    \"痳\",\n    \"殽\",\n    \"婬\",\n    \"滛\",\n    \"湻\",\n    \"㴱\",\n    \"樑\",\n    \"顇\",\n    \"㝛\",\n    \"窰\",\n    \"窯\",\n    \"琹\",\n    \"欵\",\n    \"墖\",\n    \"趂\",\n    \"隄\",\n    \"愽\",\n    \"揷\",\n    \"揫\",\n    \"煑\",\n    \"朞\",\n    \"㪚\",\n    \"塟\",\n    \"蔥\",\n    \"蔕\",\n    \"稜\",\n    \"棊\",\n    \"碁\",\n    \"椶\",\n    \"偪\",\n    \"㕑\",\n    \"廚\",\n    \"廈\",\n    \"鴈\",\n    \"冣\",\n    \"㝡\",\n    \"晳\",\n    \"鼃\",\n    \"餧\",\n    \"餵\",\n    \"嗁\",\n    \"諠\",\n    \"㡌\",\n    \"賸\",\n    \"筴\",\n    \"筞\",\n    \"筩\",\n    \"栰\",\n    \"暠\",\n    \"皜\",\n    \"踰\",\n    \"蝟\",\n    \"㪟\",\n    \"燄\",\n    \"遊\",\n    \"媿\",\n    \"嘅\",\n    \"庽\",\n    \"窓\",\n    \"牎\",\n    \"牕\",\n    \"窻\",\n    \"徧\",\n    \"僱\",\n    \"帬\",\n    \"裠\",\n    \"強\",\n    \"彊\",\n    \"疎\",\n    \"壻\",\n    \"瓌\",\n    \"䰟\",\n    \"皷\",\n    \"擕\",\n    \"㩗\",\n    \"㩦\",\n    \"攜\",\n    \"懃\",\n    \"鞾\",\n    \"幙\",\n    \"㮣\",\n    \"酧\",\n    \"詶\",\n    \"醻\",\n    \"掽\",\n    \"踫\",\n    \"㼝\",\n    \"盌\",\n    \"磟\",\n    \"覩\",\n    \"倸\",\n    \"㬉\",\n    \"煗\",\n    \"煖\",\n    \"晻\",\n    \"闇\",\n    \"炤\",\n    \"跥\",\n    \"䗬\",\n    \"蠭\",\n    \"寘\",\n    \"辠\",\n    \"稺\",\n    \"穉\",\n    \"燬\",\n    \"譭\",\n    \"瘉\",\n    \"癒\",\n    \"顋\",\n    \"骽\",\n    \"猨\",\n    \"蝯\",\n    \"稟\",\n    \"痺\",\n    \"癡\",\n    \"亷\",\n    \"㢘\",\n    \"韻\",\n    \"泝\",\n    \"遡\",\n    \"昚\",\n    \"躶\",\n    \"臝\",\n    \"羣\",\n    \"㬪\",\n    \"曡\",\n    \"疊\",\n    \"勦\",\n    \"琍\",\n    \"瓈\",\n    \"𤋮\",\n    \"熈\",\n    \"牓\",\n    \"搾\",\n    \"謌\",\n    \"堿\",\n    \"鹻\",\n    \"鹼\",\n    \"矁\",\n    \"燻\",\n    \"髈\",\n    \"𤺥\",\n    \"辢\",\n    \"旂\",\n    \"𡚁\",\n    \"潄\",\n    \"砦\",\n    \"詧\",\n    \"嫰\",\n    \"櫈\",\n    \"撐\",\n    \"墪\",\n    \"譔\",\n    \"鞵\",\n    \"鞌\",\n    \"蕋\",\n    \"橤\",\n    \"蘂\",\n    \"醕\",\n    \"譆\",\n    \"跴\",\n    \"蹤\",\n    \"蜨\",\n    \"蠍\",\n    \"稾\",\n    \"殭\",\n    \"惪\",\n    \"厀\",\n    \"襃\",\n    \"癅\",\n    \"䊀\",\n    \"餬\",\n    \"潛\",\n    \"癄\",\n    \"顦\",\n    \"鷰\",\n    \"藷\",\n    \"櫥\",\n    \"螎\",\n    \"蹏\",\n    \"蟇\",\n    \"譟\",\n    \"簒\",\n    \"彫\",\n    \"琱\",\n    \"鵰\",\n    \"餹\",\n    \"餻\",\n    \"簷\",\n    \"粦\",\n    \"燐\",\n    \"緐\",\n    \"幑\",\n    \"蹧\",\n    \"粇\",\n    \"穅\",\n    \"臋\",\n    \"籐\",\n    \"繙\",\n    \"飜\",\n    \"孼\",\n    \"蠏\",\n    \"燿\",\n    \"蝡\",\n    \"稬\",\n    \"穤\",\n    \"惷\",\n    \"覇\",\n    \"鑵\",\n    \"戹\",\n    \"阨\",\n    \"剳\",\n    \"帀\",\n    \"巵\",\n    \"亙\",\n    \"佇\",\n    \"竚\",\n    \"穽\",\n    \"岅\",\n    \"虯\",\n    \"𦍑\",\n    \"羗\",\n    \"啎\",\n    \"姙\",\n    \"㘭\",\n    \"袟\",\n    \"袠\",\n    \"逈\",\n    \"㒺\",\n    \"犛\",\n    \"氂\",\n    \"偘\",\n    \"甕\",\n    \"罋\",\n    \"冺\",\n    \"姍\",\n    \"蝨\",\n    \"琺\",\n    \"瑇\",\n    \"尅\",\n    \"梔\",\n    \"斮\",\n    \"斲\",\n    \"斵\",\n    \"暱\",\n    \"毘\",\n    \"蝱\",\n    \"吚\",\n    \"哶\",\n    \"峝\",\n    \"粃\",\n    \"竢\",\n    \"狥\",\n    \"秈\",\n    \"烱\",\n    \"㳄\",\n    \"袵\",\n    \"盇\",\n    \"涖\",\n    \"蒞\",\n    \"碪\",\n    \"蠔\",\n    \"唕\",\n    \"倐\",\n    \"儵\",\n    \"雋\",\n    \"皐\",\n    \"臯\",\n    \"衂\",\n    \"䶊\",\n    \"臙\",\n    \"獧\",\n    \"痾\",\n    \"皰\",\n    \"湼\",\n    \"澣\",\n    \"濬\",\n    \"塚\",\n    \"襢\",\n    \"娿\",\n    \"勅\",\n    \"勑\",\n    \"戞\",\n    \"廐\",\n    \"廄\",\n    \"眥\",\n    \"覜\",\n    \"勗\",\n    \"啗\",\n    \"噉\",\n    \"傯\",\n    \"挱\",\n    \"㥫\",\n    \"惥\",\n    \"慂\",\n    \"陻\",\n    \"蕚\",\n    \"萲\",\n    \"蕿\",\n    \"蘐\",\n    \"藼\",\n    \"櫂\",\n    \"箠\",\n    \"槨\",\n    \"啑\",\n    \"蹠\",\n    \"蚘\",\n    \"痐\",\n    \"蛕\",\n    \"蜖\",\n    \"瘖\",\n    \"遯\",\n    \"醃\",\n    \"飱\",\n    \"冪\",\n    \"簑\",\n    \"枏\",\n    \"柟\",\n    \"檝\",\n    \"楥\",\n    \"矴\",\n    \"椗\",\n    \"嘷\",\n    \"獋\",\n    \"粺\",\n    \"䈰\",\n    \"諐\",\n    \"齶\",\n    \"堘\",\n    \"疿\",\n    \"雝\",\n    \"秔\",\n    \"稉\",\n    \"槀\",\n    \"搉\",\n    \"廝\",\n    \"叡\",\n    \"嘠\",\n    \"蜋\",\n    \"筯\",\n    \"篛\",\n    \"麞\",\n    \"糉\",\n    \"緥\",\n    \"璿\",\n    \"髥\",\n    \"臕\",\n    \"餈\",\n    \"剹\",\n    \"橜\",\n    \"罇\",\n    \"蜺\",\n    \"矙\",\n    \"憇\",\n    \"翺\",\n    \"饍\",\n    \"瞖\",\n    \"羴\",\n    \"羶\",\n    \"爕\",\n    \"繦\",\n    \"騌\",\n    \"鬉\",\n    \"騣\",\n    \"蔾\",\n    \"䠀\",\n    \"簮\",\n    \"躕\",\n    \"蹵\",\n    \"䝔\",\n    \"貛\",\n    \"鼴\",\n    \"麐\",\n    \"塡\",\n    \"あ\",\n    \"い\",\n    \"う\",\n    \"え\",\n    \"お\",\n    \"か\",\n    \"き\",\n    \"く\",\n    \"け\",\n    \"こ\",\n    \"さ\",\n    \"し\",\n    \"す\",\n    \"せ\",\n    \"そ\",\n    \"た\",\n    \"ち\",\n    \"つ\",\n    \"て\",\n    \"と\",\n    \"な\",\n    \"に\",\n    \"ぬ\",\n    \"ね\",\n    \"の\",\n    \"は\",\n    \"ひ\",\n    \"ふ\",\n    \"へ\",\n    \"ほ\",\n    \"ま\",\n    \"み\",\n    \"む\",\n    \"め\",\n    \"も\",\n    \"や\",\n    \"ゆ\",\n    \"よ\",\n    \"ら\",\n    \"り\",\n    \"る\",\n    \"れ\",\n    \"ろ\",\n    \"わ\",\n    \"を\",\n    \"ん\",\n    \"が\",\n    \"ぎ\",\n    \"ぐ\",\n    \"げ\",\n    \"ご\",\n    \"ざ\",\n    \"じ\",\n    \"ず\",\n    \"ぜ\",\n    \"ぞ\",\n    \"だ\",\n    \"ぢ\",\n    \"づ\",\n    \"で\",\n    \"ど\",\n    \"ば\",\n    \"び\",\n    \"ぶ\",\n    \"べ\",\n    \"ぼ\",\n    \"ぱ\",\n    \"ぴ\",\n    \"ぷ\",\n    \"ぺ\",\n    \"ぽ\",\n    \"ぁ\",\n    \"ぃ\",\n    \"ぅ\",\n    \"ぇ\",\n    \"ぉ\",\n    \"っ\",\n    \"ゃ\",\n    \"ゅ\",\n    \"ょ\",\n    \"ゎ\",\n    \"ゕ\",\n    \"ゖ\",\n    \"ア\",\n    \"イ\",\n    \"ウ\",\n    \"エ\",\n    \"オ\",\n    \"カ\",\n    \"キ\",\n    \"ク\",\n    \"ケ\",\n    \"コ\",\n    \"サ\",\n    \"シ\",\n    \"ス\",\n    \"セ\",\n    \"ソ\",\n    \"タ\",\n    \"チ\",\n    \"ツ\",\n    \"テ\",\n    \"ト\",\n    \"ナ\",\n    \"ニ\",\n    \"ヌ\",\n    \"ネ\",\n    \"ノ\",\n    \"ハ\",\n    \"ヒ\",\n    \"フ\",\n    \"ヘ\",\n    \"ホ\",\n    \"マ\",\n    \"ミ\",\n    \"ム\",\n    \"メ\",\n    \"モ\",\n    \"ヤ\",\n    \"ユ\",\n    \"ヨ\",\n    \"ラ\",\n    \"リ\",\n    \"ル\",\n    \"レ\",\n    \"ロ\",\n    \"ワ\",\n    \"ヲ\",\n    \"ン\",\n    \"ガ\",\n    \"ギ\",\n    \"グ\",\n    \"ゲ\",\n    \"ゴ\",\n    \"ザ\",\n    \"ジ\",\n    \"ズ\",\n    \"ゼ\",\n    \"ゾ\",\n    \"ダ\",\n    \"ヂ\",\n    \"ヅ\",\n    \"デ\",\n    \"ド\",\n    \"バ\",\n    \"ビ\",\n    \"ブ\",\n    \"ベ\",\n    \"ボ\",\n    \"パ\",\n    \"ピ\",\n    \"プ\",\n    \"ペ\",\n    \"ポ\",\n    \"ァ\",\n    \"ィ\",\n    \"ゥ\",\n    \"ェ\",\n    \"ォ\",\n    \"ッ\",\n    \"ャ\",\n    \"ュ\",\n    \"ョ\",\n    \"ヮ\",\n    \"ヵ\",\n    \"ヶ\",\n    \"ヷ\",\n    \"ヸ\",\n    \"ヹ\",\n    \"ヺ\",\n    \"・\",\n    \"ー\",\n    \"ヽ\",\n    \"ヾ\",\n    \"ヿ\",\n    \"ｱ\",\n    \"ｲ\",\n    \"ｳ\",\n    \"ｴ\",\n    \"ｵ\",\n    \"ｶ\",\n    \"ｷ\",\n    \"ｸ\",\n    \"ｹ\",\n    \"ｺ\",\n    \"ｻ\",\n    \"ｼ\",\n    \"ｽ\",\n    \"ｾ\",\n    \"ｿ\",\n    \"ﾀ\",\n    \"ﾁ\",\n    \"ﾂ\",\n    \"ﾃ\",\n    \"ﾄ\",\n    \"ﾅ\",\n    \"ﾆ\",\n    \"ﾇ\",\n    \"ﾈ\",\n    \"ﾉ\",\n    \"ﾊ\",\n    \"ﾋ\",\n    \"ﾌ\",\n    \"ﾍ\",\n    \"ﾎ\",\n    \"ﾏ\",\n    \"ﾐ\",\n    \"ﾑ\",\n    \"ﾒ\",\n    \"ﾓ\",\n    \"ﾔ\",\n    \"ﾕ\",\n    \"ﾖ\",\n    \"ﾗ\",\n    \"ﾘ\",\n    \"ﾙ\",\n    \"ﾚ\",\n    \"ﾛ\",\n    \"ﾜ\",\n    \"ｦ\",\n    \"ﾝ\",\n    \"ﾞ\",\n    \"ﾟ\",\n    \"ｧ\",\n    \"ｨ\",\n    \"ｩ\",\n    \"ｪ\",\n    \"ｫ\",\n    \"ｯ\",\n    \"ｬ\",\n    \"ｭ\",\n    \"ｮ\",\n    \"円\",\n    \"気\",\n    \"糸\",\n    \"絵\",\n    \"楽\",\n    \"帰\",\n    \"戸\",\n    \"広\",\n    \"黒\",\n    \"図\",\n    \"線\",\n    \"読\",\n    \"売\",\n    \"歩\",\n    \"毎\",\n    \"亜\",\n    \"悪\",\n    \"圧\",\n    \"扱\",\n    \"囲\",\n    \"為\",\n    \"壱\",\n    \"隠\",\n    \"栄\",\n    \"営\",\n    \"駅\",\n    \"塩\",\n    \"縁\",\n    \"艶\",\n    \"応\",\n    \"桜\",\n    \"穏\",\n    \"仮\",\n    \"価\",\n    \"箇\",\n    \"ゑ\",\n    \"ゝ\",\n    \"ゞ\",\n    \"ヰ\",\n    \"ヴ\",\n    \"㈱\",\n    \"両\",\n    \"丼\",\n    \"丿\",\n    \"亀\",\n    \"仏\",\n    \"伝\",\n    \"侶\",\n    \"俤\",\n    \"値\",\n    \"倶\",\n    \"倹\",\n    \"偐\",\n    \"偽\",\n    \"働\",\n    \"儛\",\n    \"兌\",\n    \"児\",\n    \"冑\",\n    \"冨\",\n    \"凞\",\n    \"処\",\n    \"凪\",\n    \"別\",\n    \"剣\",\n    \"剤\",\n    \"剰\",\n    \"劔\",\n    \"労\",\n    \"勧\",\n    \"勲\",\n    \"匁\",\n    \"匂\",\n    \"匲\",\n    \"卍\",\n    \"単\",\n    \"厳\",\n    \"収\",\n    \"呂\",\n    \"呉\",\n    \"呑\",\n    \"呰\",\n    \"唖\",\n    \"喚\",\n    \"喩\",\n    \"喰\",\n    \"噛\",\n    \"噺\",\n    \"嚢\",\n    \"囃\",\n    \"団\",\n    \"圀\",\n    \"圏\",\n    \"堀\",\n    \"堺\",\n    \"塀\",\n    \"塁\",\n    \"塙\",\n    \"増\",\n    \"墺\",\n    \"壊\",\n    \"壌\",\n    \"壷\",\n    \"変\",\n    \"奨\",\n    \"姫\",\n    \"娯\",\n    \"嫐\",\n    \"嬢\",\n    \"嬾\",\n    \"孁\",\n    \"宍\",\n    \"実\",\n    \"宮\",\n    \"寔\",\n    \"寛\",\n    \"対\",\n    \"専\",\n    \"尭\",\n    \"峠\",\n    \"崋\",\n    \"嶋\",\n    \"巀\",\n    \"巌\",\n    \"巣\",\n    \"巻\",\n    \"帯\",\n    \"幇\",\n    \"庁\",\n    \"廃\",\n    \"廻\",\n    \"弉\",\n    \"弌\",\n    \"弐\",\n    \"弖\",\n    \"弾\",\n    \"従\",\n    \"徳\",\n    \"徴\",\n    \"忯\",\n    \"恵\",\n    \"悩\",\n    \"惣\",\n    \"懐\",\n    \"懽\",\n    \"戦\",\n    \"戯\",\n    \"戻\",\n    \"払\",\n    \"抜\",\n    \"択\",\n    \"拝\",\n    \"拠\",\n    \"拡\",\n    \"拵\",\n    \"挙\",\n    \"挿\",\n    \"捗\",\n    \"捜\",\n    \"掟\",\n    \"掲\",\n    \"掻\",\n    \"揃\",\n    \"換\",\n    \"揺\",\n    \"摂\",\n    \"撃\",\n    \"撹\",\n    \"斉\",\n    \"斎\",\n    \"旛\",\n    \"旡\",\n    \"晧\",\n    \"晩\",\n    \"暁\",\n    \"暦\",\n    \"曽\",\n    \"杁\",\n    \"杢\",\n    \"杣\",\n    \"杮\",\n    \"枓\",\n    \"枠\",\n    \"枡\",\n    \"柾\",\n    \"栂\",\n    \"栃\",\n    \"桝\",\n    \"桟\",\n    \"桾\",\n    \"梛\",\n    \"梱\",\n    \"梲\",\n    \"梶\",\n    \"椙\",\n    \"検\",\n    \"椥\",\n    \"楕\",\n    \"楡\",\n    \"楢\",\n    \"榊\",\n    \"榎\",\n    \"槇\",\n    \"様\",\n    \"槙\",\n    \"槻\",\n    \"樋\",\n    \"権\",\n    \"樫\",\n    \"橿\",\n    \"檥\",\n    \"欅\",\n    \"歎\",\n    \"歓\",\n    \"歯\",\n    \"歳\",\n    \"歴\",\n    \"毀\",\n    \"沖\",\n    \"沢\",\n    \"浄\",\n    \"涙\",\n    \"済\",\n    \"渉\",\n    \"渋\",\n    \"渓\",\n    \"渕\",\n    \"満\",\n    \"滝\",\n    \"漑\",\n    \"潅\",\n    \"澁\",\n    \"瀞\",\n    \"瀬\",\n    \"焔\",\n    \"焼\",\n    \"煇\",\n    \"煕\",\n    \"煥\",\n    \"燗\",\n    \"爼\",\n    \"犠\",\n    \"狛\",\n    \"猟\",\n    \"獏\",\n    \"獣\",\n    \"珊\",\n    \"瑤\",\n    \"甞\",\n    \"畑\",\n    \"畠\",\n    \"畳\",\n    \"畷\",\n    \"畺\",\n    \"痩\",\n    \"癪\",\n    \"発\",\n    \"県\",\n    \"眞\",\n    \"砕\",\n    \"碕\",\n    \"礒\",\n    \"禖\",\n    \"禿\",\n    \"稲\",\n    \"穂\",\n    \"穣\",\n    \"竃\",\n    \"竜\",\n    \"竴\",\n    \"笹\",\n    \"筈\",\n    \"筬\",\n    \"筰\",\n    \"箆\",\n    \"箏\",\n    \"箙\",\n    \"篠\",\n    \"篭\",\n    \"簺\",\n    \"籾\",\n    \"粂\",\n    \"粋\",\n    \"粛\",\n    \"粧\",\n    \"糺\",\n    \"紬\",\n    \"絁\",\n    \"経\",\n    \"絖\",\n    \"絣\",\n    \"絽\",\n    \"継\",\n    \"続\",\n    \"綟\",\n    \"総\",\n    \"縄\",\n    \"縅\",\n    \"縒\",\n    \"縦\",\n    \"繊\",\n    \"繋\",\n    \"繍\",\n    \"繝\",\n    \"繧\",\n    \"纐\",\n    \"纒\",\n    \"罠\",\n    \"罧\",\n    \"罵\",\n    \"羂\",\n    \"羇\",\n    \"羨\",\n    \"聟\",\n    \"聡\",\n    \"聨\",\n    \"聴\",\n    \"脇\",\n    \"脳\",\n    \"膣\",\n    \"膵\",\n    \"臈\",\n    \"臓\",\n    \"臥\",\n    \"舎\",\n    \"舖\",\n    \"舗\",\n    \"舘\",\n    \"芿\",\n    \"苅\",\n    \"茲\",\n    \"荊\",\n    \"荘\",\n    \"莬\",\n    \"莵\",\n    \"菫\",\n    \"萠\",\n    \"蔵\",\n    \"薗\",\n    \"薫\",\n    \"薬\",\n    \"薭\",\n    \"蘊\",\n    \"蛍\",\n    \"蝋\",\n    \"蝿\",\n    \"蟷\",\n    \"衞\",\n    \"衵\",\n    \"袙\",\n    \"袞\",\n    \"袰\",\n    \"袴\",\n    \"袿\",\n    \"裃\",\n    \"裡\",\n    \"裲\",\n    \"褄\",\n    \"褌\",\n    \"襴\",\n    \"襷\",\n    \"覗\",\n    \"覚\",\n    \"覧\",\n    \"観\",\n    \"訳\",\n    \"証\",\n    \"諌\",\n    \"諚\",\n    \"諟\",\n    \"諡\",\n    \"諮\",\n    \"譛\",\n    \"譲\",\n    \"讃\",\n    \"豅\",\n    \"豊\",\n    \"豎\",\n    \"賎\",\n    \"賛\",\n    \"贔\",\n    \"躙\",\n    \"躰\",\n    \"転\",\n    \"軽\",\n    \"輌\",\n    \"辥\",\n    \"辺\",\n    \"辻\",\n    \"込\",\n    \"逓\",\n    \"遅\",\n    \"遙\",\n    \"邉\",\n    \"郷\",\n    \"酔\",\n    \"醗\",\n    \"醤\",\n    \"醸\",\n    \"釈\",\n    \"鉄\",\n    \"鉇\",\n    \"鉤\",\n    \"鉱\",\n    \"鉾\",\n    \"銈\",\n    \"銕\",\n    \"銭\",\n    \"鋲\",\n    \"鋳\",\n    \"鋺\",\n    \"錆\",\n    \"錍\",\n    \"錣\",\n    \"錬\",\n    \"錵\",\n    \"鍑\",\n    \"鍮\",\n    \"鍼\",\n    \"鎌\",\n    \"鎗\",\n    \"鎚\",\n    \"鎹\",\n    \"鐇\",\n    \"鐚\",\n    \"鐡\",\n    \"鑁\",\n    \"鑑\",\n    \"鑚\",\n    \"鑢\",\n    \"閇\",\n    \"関\",\n    \"閦\",\n    \"闘\",\n    \"陥\",\n    \"険\",\n    \"隣\",\n    \"隷\",\n    \"雑\",\n    \"雫\",\n    \"霊\",\n    \"靜\",\n    \"靫\",\n    \"靭\",\n    \"靱\",\n    \"鞄\",\n    \"鞆\",\n    \"頚\",\n    \"頬\",\n    \"頴\",\n    \"頼\",\n    \"顕\",\n    \"顗\",\n    \"餝\",\n    \"饂\",\n    \"駄\",\n    \"駆\",\n    \"駈\",\n    \"騒\",\n    \"験\",\n    \"騨\",\n    \"髄\",\n    \"髙\",\n    \"髪\",\n    \"髷\",\n    \"鯖\",\n    \"鯰\",\n    \"鯱\",\n    \"鰒\",\n    \"鰯\",\n    \"鰰\",\n    \"鳰\",\n    \"鴎\",\n    \"鴫\",\n    \"鵄\",\n    \"鵞\",\n    \"鵺\",\n    \"鶏\",\n    \"鹸\",\n    \"麁\",\n    \"麺\",\n    \"麿\",\n    \"黌\",\n    \"黙\",\n    \"鼈\",\n    \"齢\",\n    \"龗\",\n    \"縯\",\n    \"蟅\",\n    \"坖\",\n    \"祂\",\n    \"鼂\",\n    \"鱚\",\n    \"蛻\",\n    \"屌\",\n    \"呾\",\n    \"煔\",\n    \"吶\",\n    \"扥\",\n    \"蚖\",\n    \"銂\",\n    \"尃\",\n    \"夋\",\n    \"鵼\",\n    \"徬\",\n    \"寳\",\n    \"彡\",\n    \"舨\",\n    \"湳\",\n    \"麼\",\n    \"鍈\",\n    \"崈\",\n    \"鱣\",\n    \"盺\",\n    \"拺\",\n    \"瑥\",\n    \"茷\",\n    \"焻\",\n    \"奀\",\n    \"驎\",\n    \"鱰\",\n    \"砢\",\n    \"痟\",\n    \"廱\",\n    \"僜\",\n    \"瘺\",\n    \"鱊\",\n    \"擥\",\n    \"嶰\",\n    \"淓\",\n    \"跅\",\n    \"浵\",\n    \"媗\",\n    \"璦\",\n    \"煠\",\n    \"檊\",\n    \"媃\",\n    \"峅\",\n    \"躄\",\n    \"鉟\",\n    \"塽\",\n    \"蟴\",\n    \"鯮\",\n    \"弍\",\n    \"烒\",\n    \"鵵\",\n    \"妑\",\n    \"孋\",\n    \"蚡\",\n    \"恊\",\n    \"輭\",\n    \"廞\",\n    \"產\",\n    \"曅\",\n    \"盜\",\n    \"騤\",\n    \"囪\",\n    \"鱀\",\n    \"茇\",\n    \"葊\",\n    \"逹\",\n    \"狓\",\n    \"崢\",\n    \"趖\",\n    \"凃\",\n    \"羙\",\n    \"鮸\",\n    \"昞\",\n    \"楿\",\n    \"渽\",\n    \"圗\",\n    \"麪\",\n    \"屇\",\n    \"鍉\",\n    \"葝\",\n    \"沯\",\n    \"爭\",\n    \"幵\",\n    \"筭\",\n    \"寊\",\n    \"銋\",\n    \"貮\",\n    \"鎭\",\n    \"熺\",\n    \"昜\",\n    \"鍱\",\n    \"墬\",\n    \"愒\",\n    \"磺\",\n    \"嚈\",\n    \"稘\",\n    \"珮\",\n    \"釆\",\n    \"殑\",\n    \"鍩\",\n    \"䲁\",\n    \"蕷\",\n    \"鐿\",\n    \"僡\",\n    \"佹\",\n    \"輶\",\n    \"冴\",\n    \"襶\",\n    \"賔\",\n    \"猙\",\n    \"辧\",\n    \"絛\",\n    \"磾\",\n    \"韁\",\n    \"螔\",\n    \"譳\",\n    \"礑\",\n    \"鋱\",\n    \"魩\",\n    \"嚗\",\n    \"棆\",\n    \"牆\",\n    \"敟\",\n    \"柶\",\n    \"瓛\",\n    \"魣\",\n    \"巎\",\n    \"轘\",\n    \"襌\",\n    \"枼\",\n    \"鸌\",\n    \"逺\",\n    \"錏\",\n    \"縡\",\n    \"帢\",\n    \"騄\",\n    \"媼\",\n    \"埅\",\n    \"鄤\",\n    \"萐\",\n    \"祙\",\n    \"旼\",\n    \"詥\",\n    \"鶲\",\n    \"燉\",\n    \"卲\",\n    \"銱\",\n    \"庲\",\n    \"伱\",\n    \"氽\",\n    \"嵿\",\n    \"挻\",\n    \"煵\",\n    \"窋\",\n    \"鐤\",\n    \"鮊\",\n    \"鱬\",\n    \"鰧\",\n    \"嬤\",\n    \"譞\",\n    \"諲\",\n    \"脭\",\n    \"悳\",\n    \"崘\",\n    \"阭\",\n    \"內\",\n    \"袾\",\n    \"冚\",\n    \"壐\",\n    \"咗\",\n    \"礠\",\n    \"孮\",\n    \"痲\",\n    \"埈\",\n    \"肹\",\n    \"鰮\",\n    \"鮓\",\n    \"濊\",\n    \"塜\",\n    \"凜\",\n    \"蒢\",\n    \"噰\",\n    \"桼\",\n    \"峍\",\n    \"焴\",\n    \"鶒\",\n    \"鋮\",\n    \"綠\",\n    \"鶹\",\n    \"熿\",\n    \"毴\",\n    \"咟\",\n    \"嘥\",\n    \"睺\",\n    \"繡\",\n    \"郎\",\n    \"瘞\",\n    \"鉶\",\n    \"蔎\",\n    \"秠\",\n    \"緤\",\n    \"蝀\",\n    \"躝\",\n    \"蟜\",\n    \"繃\",\n    \"囮\",\n    \"墫\",\n    \"乭\",\n    \"胊\",\n    \"濙\",\n    \"瘓\",\n    \"榣\",\n    \"鑛\",\n    \"鐫\",\n    \"嶴\",\n    \"甹\",\n    \"坮\",\n    \"銾\",\n    \"蒭\",\n    \"睜\",\n    \"俋\",\n    \"餠\",\n    \"榢\",\n    \"蓳\",\n    \"盋\",\n    \"堷\",\n    \"鍏\",\n    \"苝\",\n    \"巛\",\n    \"蚵\",\n    \"暏\",\n    \"熤\",\n    \"嬨\",\n    \"墎\",\n    \"鏽\",\n    \"戶\",\n    \"菺\",\n    \"膮\",\n    \"熖\",\n    \"睪\",\n    \"栜\",\n    \"捱\",\n    \"榗\",\n    \"鍷\",\n    \"曧\",\n    \"犽\",\n    \"韑\",\n    \"袓\",\n    \"䖝\",\n    \"焄\",\n    \"喦\",\n    \"髲\",\n    \"疌\",\n    \"㴪\",\n    \"侊\",\n    \"貐\",\n    \"蕅\",\n    \"禠\",\n    \"蕑\",\n    \"囯\",\n    \"暊\",\n    \"儞\",\n    \"佋\",\n    \"柎\",\n    \"㐱\",\n    \"鰤\",\n    \"苳\",\n    \"鱥\",\n    \"謤\",\n    \"遶\",\n    \"眀\",\n    \"鑀\",\n    \"羋\",\n    \"顏\",\n    \"陜\",\n    \"銩\",\n    \"黶\",\n    \"苼\",\n    \"蒤\",\n    \"棛\",\n    \"儫\",\n    \"咁\",\n    \"抦\",\n    \"衚\",\n    \"棩\",\n    \"焿\",\n    \"脫\",\n    \"麅\",\n    \"玏\",\n    \"埧\",\n    \"淸\",\n    \"黁\",\n    \"淽\",\n    \"彠\",\n    \"鮨\",\n    \"沜\",\n    \"糀\",\n    \"厓\",\n    \"楧\",\n    \"嶌\",\n    \"簹\",\n    \"檵\",\n    \"鱇\",\n    \"嶬\",\n    \"廸\",\n    \"卽\",\n    \"樀\",\n    \"贌\",\n    \"酼\",\n    \"籛\",\n    \"沒\",\n    \"晸\",\n    \"諪\",\n    \"蕡\",\n    \"妏\",\n    \"鄋\",\n    \"蒍\",\n    \"奧\",\n    \"抇\",\n    \"蓨\",\n    \"薆\",\n    \"鱷\",\n    \"巘\",\n    \"䝉\",\n    \"亰\",\n    \"寈\",\n    \"槩\",\n    \"誒\",\n    \"麴\",\n    \"蕟\",\n    \"溎\",\n    \"蘗\",\n    \"榦\",\n    \"斿\",\n    \"暟\",\n    \"炲\",\n    \"拚\",\n    \"娖\",\n    \"繖\",\n    \"橚\",\n    \"寜\",\n    \"爀\",\n    \"饟\",\n    \"悅\",\n    \"鯏\",\n    \"彜\",\n    \"眾\",\n    \"葯\",\n    \"嬝\",\n    \"埮\",\n    \"獇\",\n    \"馛\",\n    \"溙\",\n    \"瀦\",\n    \"熼\",\n    \"硓\",\n    \"鈢\",\n    \"樆\",\n    \"輬\",\n    \"鰜\",\n    \"蔘\",\n    \"渙\",\n    \"澔\",\n    \"嗮\",\n    \"旉\",\n    \"籜\",\n    \"媊\",\n    \"燘\",\n    \"儚\",\n    \"頹\",\n    \"缽\",\n    \"俽\",\n    \"逨\",\n    \"鱓\",\n    \"郞\",\n    \"歊\",\n    \"杴\",\n    \"珡\",\n    \"杋\",\n    \"醁\",\n    \"鰏\",\n    \"鵾\",\n    \"鐽\",\n    \"鮋\",\n    \"巶\",\n    \"荅\",\n    \"薾\",\n    \"囓\",\n    \"蹻\",\n    \"獎\",\n    \"禑\",\n    \"鎓\",\n    \"榲\",\n    \"僴\",\n    \"綞\",\n    \"尓\",\n    \"敭\",\n    \"曔\",\n    \"褔\",\n    \"鬅\",\n    \"亊\",\n    \"鏦\",\n    \"蓘\",\n    \"裬\",\n    \"鱲\",\n    \"薡\",\n    \"鰗\",\n    \"箑\",\n    \"鬪\",\n    \"縂\",\n    \"璸\",\n    \"甙\",\n    \"茮\",\n    \"辵\",\n    \"岻\",\n    \"覿\",\n    \"滈\",\n    \"鯶\",\n    \"鑂\",\n    \"囶\",\n    \"舺\",\n    \"溋\",\n    \"拋\",\n    \"菾\",\n    \"敾\",\n    \"虨\",\n    \"綝\",\n    \"蝍\",\n    \"醂\",\n    \"禨\",\n    \"賹\",\n    \"廧\",\n    \"絕\",\n    \"槗\",\n    \"徫\",\n    \"鎔\",\n    \"曮\",\n    \"蠂\",\n    \"捒\",\n    \"堈\",\n    \"莕\",\n    \"蓪\",\n    \"敎\",\n    \"禃\",\n    \"櫱\",\n    \"綧\",\n    \"瀶\",\n    \"逌\",\n    \"浤\",\n    \"碻\",\n    \"刄\",\n    \"逤\",\n    \"剏\",\n    \"氹\",\n    \"菈\",\n    \"娫\",\n    \"蜛\",\n    \"嵗\",\n    \"糎\",\n    \"螶\",\n    \"譓\",\n    \"鏳\",\n    \"嵙\",\n    \"瑊\",\n    \"隲\",\n    \"檨\",\n    \"緈\",\n    \"畵\",\n    \"砯\",\n    \"簗\",\n    \"彅\",\n    \"鰺\",\n    \"騋\",\n    \"窶\",\n    \"嚒\",\n    \"嵻\",\n    \"尙\",\n    \"頵\",\n    \"槰\",\n    \"虉\",\n    \"醞\",\n    \"巂\",\n    \"彔\",\n    \"偊\",\n    \"畇\",\n    \"鱨\",\n    \"妸\",\n    \"塲\",\n    \"畐\",\n    \"鈫\",\n    \"錟\",\n    \"磪\",\n    \"摠\",\n    \"彥\",\n    \"璙\",\n    \"囝\",\n    \"寗\",\n    \"耎\",\n    \"鮡\",\n    \"蘓\",\n    \"弅\",\n    \"焃\",\n    \"飥\",\n    \"戙\",\n    \"塰\",\n    \"儱\",\n    \"槺\",\n    \"噏\",\n    \"魟\",\n    \"禵\",\n    \"佧\",\n    \"咘\",\n    \"盪\",\n    \"瑈\",\n    \"鉲\",\n    \"睭\",\n    \"鏌\",\n    \"鼇\",\n    \"郋\",\n    \"魮\",\n    \"朖\",\n    \"滽\",\n    \"渃\",\n    \"滙\",\n    \"熯\",\n    \"醿\",\n    \"鎅\",\n    \"褀\",\n    \"鬬\",\n    \"巄\",\n    \"螥\",\n    \"眜\",\n    \"釚\",\n    \"柉\",\n    \"壎\",\n    \"峇\",\n    \"姸\",\n    \"唭\",\n    \"鮜\",\n    \"鈖\",\n    \"嫈\",\n    \"壄\",\n    \"洤\",\n    \"黃\",\n    \"伕\",\n    \"堦\",\n    \"嶔\",\n    \"鮰\",\n    \"鞞\",\n    \"漎\",\n    \"鉓\",\n    \"鮗\",\n    \"壴\",\n    \"阝\",\n    \"妀\",\n    \"矽\",\n    \"獢\",\n    \"倗\",\n    \"銪\",\n    \"鴓\",\n    \"橒\",\n    \"凈\",\n    \"哖\",\n    \"屚\",\n    \"偍\",\n    \"瑺\",\n    \"媯\",\n    \"淍\",\n    \"驌\",\n    \"椇\",\n    \"赬\",\n    \"薐\",\n    \"糹\",\n    \"碽\",\n    \"濲\",\n    \"釭\",\n    \"晭\",\n    \"纕\",\n    \"寖\",\n    \"閞\",\n    \"歿\",\n    \"呎\",\n    \"鶆\",\n    \"屄\",\n    \"櫿\",\n    \"犎\",\n    \"旲\",\n    \"㙟\",\n    \"龎\",\n    \"翜\",\n    \"螾\",\n    \"說\",\n    \"衜\",\n    \"泆\",\n    \"軎\",\n    \"鵂\",\n    \"荎\",\n    \"嚧\",\n    \"硂\",\n    \"桖\",\n    \"褭\",\n    \"筊\",\n    \"鰷\",\n    \"秳\",\n    \"戩\",\n    \"轀\",\n    \"鬹\",\n    \"飬\",\n    \"卋\",\n    \"暸\",\n    \"狦\",\n    \"搢\",\n    \"娋\",\n    \"鏴\",\n    \"溫\",\n    \"毉\",\n    \"淰\",\n    \"謩\",\n    \"餺\",\n    \"鵙\",\n    \"鳽\",\n    \"鮀\",\n    \"狶\",\n    \"氻\",\n    \"轝\",\n    \"妺\",\n    \"袛\",\n    \"蓭\",\n    \"梂\",\n    \"娛\",\n    \"牼\",\n    \"稅\",\n    \"兿\",\n    \"玾\",\n    \"煚\",\n    \"僩\",\n    \"鶿\",\n    \"鬄\",\n    \"崠\",\n    \"鉆\",\n    \"鯓\",\n    \"蚢\",\n    \"庀\",\n    \"鵟\",\n    \"坣\",\n    \"殼\",\n    \"悞\",\n    \"熅\",\n    \"敻\",\n    \"鍠\",\n    \"曶\",\n    \"愼\",\n    \"搳\",\n    \"姃\",\n    \"砳\",\n    \"槼\",\n    \"臞\",\n    \"韾\",\n    \"靑\",\n    \"鸊\",\n    \"薲\",\n    \"虛\",\n    \"蠄\",\n    \"啟\",\n    \"鶺\",\n    \"苺\",\n    \"滾\",\n    \"褞\",\n    \"仺\",\n    \"胇\",\n    \"憻\",\n    \"郳\",\n    \"烉\",\n    \"驩\",\n    \"冇\",\n    \"枖\",\n    \"夌\",\n    \"搵\",\n    \"匸\",\n    \"盨\",\n    \"櫾\",\n    \"霤\",\n    \"麊\",\n    \"貒\",\n    \"噓\",\n    \"嗢\",\n    \"笩\",\n    \"晈\",\n    \"冂\",\n    \"銳\",\n    \"毿\",\n    \"慜\",\n    \"囧\",\n    \"閜\",\n    \"娸\",\n    \"庢\",\n    \"壆\",\n    \"馯\",\n    \"桱\",\n    \"兗\",\n    \"葃\",\n    \"侅\",\n    \"煐\",\n    \"鐦\",\n    \"藸\",\n    \"鷎\",\n    \"嵰\",\n    \"逎\",\n    \"弒\",\n    \"匋\",\n    \"鐭\",\n    \"廔\",\n    \"砩\",\n    \"孆\",\n    \"灴\",\n    \"伷\",\n    \"兪\",\n    \"鴗\",\n    \"澯\",\n    \"幚\",\n    \"旙\",\n    \"勻\",\n    \"礽\",\n    \"婑\",\n    \"鱮\",\n    \"娍\",\n    \"銶\",\n    \"吳\",\n    \"鍟\",\n    \"仼\",\n    \"鳧\",\n    \"彞\",\n    \"娽\",\n    \"昛\",\n    \"鰼\",\n    \"剎\",\n    \"佉\",\n    \"鉏\",\n    \"偸\",\n    \"鰆\",\n    \"讙\",\n    \"橪\",\n    \"啱\",\n    \"岀\",\n    \"孻\",\n    \"釪\",\n    \"乹\",\n    \"鈳\",\n    \"漇\",\n    \"檦\",\n    \"埻\",\n    \"祿\",\n    \"爌\",\n    \"禇\",\n    \"鱵\",\n    \"㸃\",\n    \"梉\",\n    \"燝\",\n    \"霙\",\n    \"炁\",\n    \"飮\",\n    \"蠙\",\n    \"勷\",\n    \"鵎\",\n    \"儥\",\n    \"鐠\",\n    \"唻\",\n    \"廰\",\n    \"嚿\",\n    \"嵕\",\n    \"墱\",\n    \"紑\",\n    \"搖\",\n    \"瘜\",\n    \"皝\",\n    \"鸑\",\n    \"瀁\",\n    \"粵\",\n    \"撚\",\n    \"巑\",\n    \"梀\",\n    \"啯\",\n    \"眛\",\n    \"諴\",\n    \"夊\",\n    \"僙\",\n    \"鍝\",\n    \"裖\",\n    \"鮣\",\n    \"凬\",\n    \"飡\",\n    \"灊\",\n    \"橓\",\n    \"嫳\",\n    \"筳\",\n    \"咑\",\n    \"粍\",\n    \"瓑\",\n    \"璌\",\n    \"伃\",\n    \"閰\",\n    \"傜\",\n    \"黐\",\n    \"謢\",\n    \"驒\",\n    \"橫\",\n    \"蛯\",\n    \"寕\",\n    \"蠵\",\n    \"瞓\",\n    \"旳\",\n    \"翏\",\n    \"硏\",\n    \"寯\",\n    \"韡\",\n    \"楤\",\n    \"鰃\",\n    \"朿\",\n    \"侞\",\n    \"鵯\",\n    \"愨\",\n    \"祹\",\n    \"厔\",\n    \"丌\",\n    \"盩\",\n    \"謏\",\n    \"魕\",\n    \"啣\",\n    \"閱\",\n    \"曺\",\n    \"枛\",\n    \"罉\",\n    \"卐\",\n    \"樻\",\n    \"鷉\",\n    \"鯒\",\n    \"鋡\",\n    \"磱\",\n    \"枱\",\n    \"攴\",\n    \"蠷\",\n    \"穈\",\n    \"嚟\",\n    \"檽\",\n    \"趐\",\n    \"奐\",\n    \"鋐\",\n    \"檇\",\n    \"薀\",\n    \"峼\",\n    \"咭\",\n    \"訔\",\n    \"韠\",\n    \"鑴\",\n    \"鸐\",\n    \"唃\",\n    \"捦\",\n    \"鸜\",\n    \"誴\",\n    \"罳\",\n    \"璄\",\n    \"暃\",\n    \"夀\",\n    \"賨\",\n    \"鞥\",\n    \"鈊\",\n    \"灡\",\n    \"鮍\",\n    \"懮\",\n    \"籣\",\n    \"昐\",\n    \"陁\",\n    \"襾\",\n    \"鮠\",\n    \"鈏\",\n    \"囍\",\n    \"婯\",\n    \"艔\",\n    \"貭\",\n    \"䰾\",\n    \"姁\",\n    \"禼\",\n    \"堖\",\n    \"鋶\",\n    \"仛\",\n    \"鏷\",\n    \"謜\",\n    \"鑅\",\n    \"忬\",\n    \"蘶\",\n    \"謠\",\n    \"觙\",\n    \"奫\",\n    \"狟\",\n    \"泩\",\n    \"桙\",\n    \"飈\",\n    \"垰\",\n    \"啍\",\n    \"嚞\",\n    \"鯕\",\n    \"蒧\",\n    \"榞\",\n    \"徸\",\n    \"璹\",\n    \"揔\",\n    \"欉\",\n    \"魞\",\n    \"菶\",\n    \"玧\",\n    \"鳯\",\n    \"廍\",\n    \"侚\",\n    \"岰\",\n    \"岧\",\n    \"鋕\",\n    \"凵\",\n    \"彣\",\n    \"崱\",\n    \"媜\",\n    \"倢\",\n    \"鵐\",\n    \"砋\",\n    \"鷚\",\n    \"鱠\",\n    \"鮻\",\n    \"繻\",\n    \"摵\",\n    \"贓\",\n    \"磵\",\n    \"錻\",\n    \"痠\",\n    \"粩\",\n    \"胅\",\n    \"奣\",\n    \"塨\",\n    \"瀠\",\n    \"鸘\",\n    \"啚\",\n    \"娳\",\n    \"霶\",\n    \"壔\",\n    \"峚\",\n    \"甂\",\n    \"廁\",\n    \"覌\",\n    \"鰂\",\n    \"猳\",\n    \"鱻\",\n    \"盫\",\n    \"裿\",\n    \"杬\",\n    \"歛\",\n    \"澋\",\n    \"蘞\",\n    \"嵜\",\n    \"尐\",\n    \"旽\",\n    \"鉌\",\n    \"鎛\",\n    \"豿\",\n    \"凖\",\n    \"榤\",\n    \"禓\",\n    \"龝\",\n    \"悧\",\n    \"鷟\",\n    \"鮟\",\n    \"吋\",\n    \"喢\",\n    \"岪\",\n    \"吥\",\n    \"漵\",\n    \"頠\",\n    \"豔\",\n    \"巿\",\n    \"鑨\",\n    \"醣\",\n    \"熳\",\n    \"懍\",\n    \"湥\",\n    \"檡\",\n    \"韺\",\n    \"戱\",\n    \"緖\",\n    \"鐈\",\n    \"凉\",\n    \"緃\",\n    \"鮹\",\n    \"媐\",\n    \"爯\",\n    \"巆\",\n    \"褍\",\n    \"鐬\",\n    \"昍\",\n    \"扙\",\n    \"鍳\",\n    \"芛\",\n    \"蟳\",\n    \"嬅\",\n    \"糬\",\n    \"吔\",\n    \"塭\",\n    \"譿\",\n    \"冧\",\n    \"鏓\",\n    \"嶪\",\n    \"嗹\",\n    \"椵\",\n    \"姀\",\n    \"閿\",\n    \"褧\",\n    \"錞\",\n    \"玆\",\n    \"笘\",\n    \"篔\",\n    \"萡\",\n    \"鶡\",\n    \"螐\",\n    \"鮄\",\n    \"鰟\",\n    \"脷\",\n    \"啲\",\n    \"杤\",\n    \"蓚\",\n    \"尗\",\n    \"娎\",\n    \"殟\",\n    \"淥\",\n    \"蝚\",\n    \"蓧\",\n    \"彐\",\n    \"嚤\",\n    \"銍\",\n    \"囒\",\n    \"坶\",\n    \"淩\",\n    \"鶼\",\n    \"鱂\",\n    \"喼\",\n    \"燫\",\n    \"肏\",\n    \"姵\",\n    \"廌\",\n    \"禟\",\n    \"籝\",\n    \"迵\",\n    \"嵨\",\n    \"堮\",\n    \"蟌\",\n    \"憍\",\n    \"廕\",\n    \"蜑\",\n    \"緁\",\n    \"唘\",\n    \"竩\",\n    \"崙\",\n    \"璚\",\n    \"粄\",\n    \"栨\",\n    \"罈\",\n    \"梫\",\n    \"貤\",\n    \"藔\",\n    \"蜯\",\n    \"訁\",\n    \"斖\",\n    \"煶\",\n    \"馦\",\n    \"妠\",\n    \"閟\",\n    \"疕\",\n    \"夆\",\n    \"鎪\",\n    \"膥\",\n    \"澻\",\n    \"嘢\",\n    \"嚐\",\n    \"靁\",\n    \"鎻\",\n    \"鰛\",\n    \"穵\",\n    \"烋\",\n    \"縕\",\n    \"褎\",\n    \"疒\",\n    \"壠\",\n    \"溼\",\n    \"圂\",\n    \"咅\",\n    \"鯭\",\n    \"鯙\",\n    \"磘\",\n    \"玨\",\n    \"珤\",\n    \"朊\",\n    \"蚼\",\n    \"濶\",\n    \"薞\",\n    \"嚩\",\n    \"丟\",\n    \"嫺\",\n    \"鯻\",\n    \"椲\",\n    \"鰕\",\n    \"刂\",\n    \"蠘\",\n    \"踎\",\n    \"瀴\",\n    \"琁\",\n    \"鰶\",\n    \"瑴\",\n    \"肜\",\n    \"㐂\",\n    \"欥\",\n    \"媺\",\n    \"竻\",\n    \"讚\",\n    \"𣇉\",\n    \"裵\",\n    \"緜\",\n    \"廩\",\n    \"齧\",\n    \"叄\",\n    \"俌\",\n    \"厰\",\n    \"滀\",\n    \"錄\",\n    \"鷫\",\n    \"鯗\",\n    \"攞\",\n    \"姌\",\n    \"蔝\",\n    \"幷\",\n    \"縤\",\n    \"屻\",\n    \"鯃\",\n    \"雞\",\n    \"纁\",\n    \"嫲\",\n    \"嵮\",\n    \"屭\",\n    \"嶃\",\n    \"跩\",\n    \"鋗\",\n    \"蕢\",\n    \"篊\",\n    \"俬\",\n    \"淎\",\n    \"暻\",\n    \"鏻\",\n    \"憓\",\n    \"玗\",\n    \"溈\",\n    \"笭\",\n    \"糢\",\n    \"勳\",\n    \"閒\",\n    \"沍\",\n    \"咾\",\n    \"鉷\",\n    \"蘵\",\n    \"俁\",\n    \"崵\",\n    \"毸\",\n    \"苪\",\n    \"掙\",\n    \"鴡\",\n    \"萭\",\n    \"俴\",\n    \"屜\",\n    \"蒾\",\n    \"艹\",\n    \"剷\",\n    \"慍\",\n    \"朮\",\n    \"枴\",\n    \"氳\",\n    \"猓\",\n    \"甽\",\n    \"箝\",\n    \"譁\",\n    \"贗\",\n    \"迆\",\n    \"鈽\",\n    \"鍊\",\n    \"鍰\",\n    \"鏍\",\n    \"靦\",\n    \"餽\",\n    \"丮\",\n    \"丱\",\n    \"仜\",\n    \"仩\",\n    \"伬\",\n    \"伔\",\n    \"仱\",\n    \"伀\",\n    \"伻\",\n    \"佢\",\n    \"佒\",\n    \"侀\",\n    \"侇\",\n    \"佷\",\n    \"佌\",\n    \"佪\",\n    \"侐\",\n    \"侜\",\n    \"俓\",\n    \"侲\",\n    \"俉\",\n    \"侻\",\n    \"侳\",\n    \"俇\",\n    \"倅\",\n    \"倇\",\n    \"倰\",\n    \"倛\",\n    \"倳\",\n    \"倷\",\n    \"俷\",\n    \"倠\",\n    \"偯\",\n    \"偞\",\n    \"偠\",\n    \"偋\",\n    \"偝\",\n    \"偛\",\n    \"偢\",\n    \"偅\",\n    \"偟\",\n    \"偩\",\n    \"偫\",\n    \"傛\",\n    \"傔\",\n    \"傞\",\n    \"傋\",\n    \"傌\",\n    \"傎\",\n    \"傝\",\n    \"偨\",\n    \"傂\",\n    \"傽\",\n    \"傿\",\n    \"僆\",\n    \"傮\",\n    \"僄\",\n    \"僈\",\n    \"傰\",\n    \"僁\",\n    \"傱\",\n    \"僋\",\n    \"僗\",\n    \"僛\",\n    \"僪\",\n    \"僝\",\n    \"僓\",\n    \"僿\",\n    \"儃\",\n    \"儰\",\n    \"僸\",\n    \"僶\",\n    \"僾\",\n    \"儌\",\n    \"僽\",\n    \"儜\",\n    \"儓\",\n    \"儗\",\n    \"儑\",\n    \"儢\",\n    \"儤\",\n    \"儠\",\n    \"儸\",\n    \"儹\",\n    \"儽\",\n    \"冓\",\n    \"冘\",\n    \"冞\",\n    \"凊\",\n    \"凅\",\n    \"凔\",\n    \"刌\",\n    \"刉\",\n    \"刓\",\n    \"刜\",\n    \"刞\",\n    \"刵\",\n    \"刲\",\n    \"剆\",\n    \"刱\",\n    \"剉\",\n    \"剚\",\n    \"剒\",\n    \"剫\",\n    \"剭\",\n    \"剬\",\n    \"剺\",\n    \"剸\",\n    \"剻\",\n    \"剼\",\n    \"劀\",\n    \"劋\",\n    \"劖\",\n    \"劘\",\n    \"劗\",\n    \"劙\",\n    \"劦\",\n    \"勴\",\n    \"匊\",\n    \"匢\",\n    \"匰\",\n    \"匴\",\n    \"匷\",\n    \"匽\",\n    \"卌\",\n    \"卼\",\n    \"厎\",\n    \"厒\",\n    \"厗\",\n    \"厞\",\n    \"厜\",\n    \"厤\",\n    \"厬\",\n    \"厹\",\n    \"吰\",\n    \"吷\",\n    \"吪\",\n    \"呿\",\n    \"咈\",\n    \"呫\",\n    \"呺\",\n    \"呥\",\n    \"呬\",\n    \"呴\",\n    \"茍\",\n    \"咷\",\n    \"咮\",\n    \"咶\",\n    \"哅\",\n    \"咠\",\n    \"咢\",\n    \"唦\",\n    \"唗\",\n    \"唒\",\n    \"哤\",\n    \"唚\",\n    \"唈\",\n    \"哫\",\n    \"唅\",\n    \"唴\",\n    \"啢\",\n    \"唶\",\n    \"啒\",\n    \"啅\",\n    \"唌\",\n    \"唲\",\n    \"喨\",\n    \"喥\",\n    \"喭\",\n    \"噅\",\n    \"喓\",\n    \"喣\",\n    \"啽\",\n    \"喌\",\n    \"嗃\",\n    \"嗛\",\n    \"嗋\",\n    \"嗀\",\n    \"喿\",\n    \"喍\",\n    \"嗏\",\n    \"嗕\",\n    \"嗈\",\n    \"嘕\",\n    \"嘒\",\n    \"嗼\",\n    \"嘐\",\n    \"嘓\",\n    \"嘂\",\n    \"嗺\",\n    \"嘝\",\n    \"嘄\",\n    \"嗿\",\n    \"噈\",\n    \"噊\",\n    \"噆\",\n    \"噚\",\n    \"嘳\",\n    \"嘽\",\n    \"嘾\",\n    \"噮\",\n    \"噳\",\n    \"噣\",\n    \"噭\",\n    \"噞\",\n    \"嚌\",\n    \"嚍\",\n    \"嚃\",\n    \"嚘\",\n    \"嚜\",\n    \"嚫\",\n    \"嚪\",\n    \"嚬\",\n    \"嚲\",\n    \"嚵\",\n    \"嚽\",\n    \"嚾\",\n    \"囆\",\n    \"囅\",\n    \"囋\",\n    \"囗\",\n    \"圁\",\n    \"圞\",\n    \"圠\",\n    \"坁\",\n    \"坅\",\n    \"坲\",\n    \"坱\",\n    \"垀\",\n    \"坴\",\n    \"垗\",\n    \"垝\",\n    \"垔\",\n    \"垘\",\n    \"垽\",\n    \"垼\",\n    \"埢\",\n    \"埶\",\n    \"堩\",\n    \"堣\",\n    \"塈\",\n    \"堥\",\n    \"塓\",\n    \"塉\",\n    \"塯\",\n    \"塕\",\n    \"塼\",\n    \"墆\",\n    \"塿\",\n    \"塴\",\n    \"墋\",\n    \"塺\",\n    \"墝\",\n    \"墯\",\n    \"壈\",\n    \"墽\",\n    \"壖\",\n    \"壝\",\n    \"壛\",\n    \"壾\",\n    \"壿\",\n    \"夃\",\n    \"夎\",\n    \"夒\",\n    \"夗\",\n    \"奅\",\n    \"奊\",\n    \"奰\",\n    \"奲\",\n    \"奼\",\n    \"妦\",\n    \"妎\",\n    \"妢\",\n    \"妐\",\n    \"妵\",\n    \"姏\",\n    \"姎\",\n    \"㚷\",\n    \"姡\",\n    \"姺\",\n    \"姼\",\n    \"娭\",\n    \"婐\",\n    \"婟\",\n    \"婥\",\n    \"婓\",\n    \"婗\",\n    \"媔\",\n    \"媟\",\n    \"媢\",\n    \"婸\",\n    \"媦\",\n    \"媥\",\n    \"媬\",\n    \"媕\",\n    \"娷\",\n    \"嫇\",\n    \"嫋\",\n    \"媰\",\n    \"媻\",\n    \"嫮\",\n    \"嫥\",\n    \"嫢\",\n    \"嫛\",\n    \"嫿\",\n    \"嫴\",\n    \"嫷\",\n    \"嫶\",\n    \"嬎\",\n    \"嬓\",\n    \"嬐\",\n    \"嬲\",\n    \"嬽\",\n    \"孈\",\n    \"屘\",\n    \"孲\",\n    \"孷\",\n    \"宎\",\n    \"宨\",\n    \"寪\",\n    \"寍\",\n    \"寋\",\n    \"寑\",\n    \"寙\",\n    \"寠\",\n    \"寱\",\n    \"尌\",\n    \"尒\",\n    \"尟\",\n    \"尰\",\n    \"尳\",\n    \"屖\",\n    \"屔\",\n    \"屝\",\n    \"屧\",\n    \"屩\",\n    \"屮\",\n    \"屴\",\n    \"岏\",\n    \"岋\",\n    \"岉\",\n    \"岒\",\n    \"岮\",\n    \"岤\",\n    \"岯\",\n    \"岟\",\n    \"岝\",\n    \"峐\",\n    \"峌\",\n    \"峞\",\n    \"峉\",\n    \"峊\",\n    \"峬\",\n    \"峮\",\n    \"峷\",\n    \"崝\",\n    \"崨\",\n    \"崥\",\n    \"崏\",\n    \"崰\",\n    \"崣\",\n    \"崷\",\n    \"嵃\",\n    \"嵑\",\n    \"崳\",\n    \"崺\",\n    \"嵂\",\n    \"嵱\",\n    \"嵣\",\n    \"嵥\",\n    \"嵞\",\n    \"嶀\",\n    \"嵽\",\n    \"嶆\",\n    \"嵺\",\n    \"嵷\",\n    \"嶊\",\n    \"嶉\",\n    \"嶈\",\n    \"嵾\",\n    \"嶕\",\n    \"嶜\",\n    \"嶡\",\n    \"嶚\",\n    \"嶞\",\n    \"嶱\",\n    \"嶩\",\n    \"嶵\",\n    \"嶭\",\n    \"巃\",\n    \"巏\",\n    \"巕\",\n    \"巟\",\n    \"巹\",\n    \"帊\",\n    \"帗\",\n    \"帟\",\n    \"帣\",\n    \"帠\",\n    \"帤\",\n    \"帩\",\n    \"帾\",\n    \"帴\",\n    \"幏\",\n    \"幎\",\n    \"幓\",\n    \"幩\",\n    \"幝\",\n    \"幠\",\n    \"幧\",\n    \"幨\",\n    \"幦\",\n    \"幭\",\n    \"幰\",\n    \"庂\",\n    \"庉\",\n    \"庌\",\n    \"庈\",\n    \"庰\",\n    \"庛\",\n    \"庣\",\n    \"庨\",\n    \"庮\",\n    \"庪\",\n    \"庬\",\n    \"庴\",\n    \"廅\",\n    \"廇\",\n    \"廘\",\n    \"廗\",\n    \"廎\",\n    \"廜\",\n    \"緳\",\n    \"廦\",\n    \"廥\",\n    \"廮\",\n    \"廯\",\n    \"蠯\",\n    \"廾\",\n    \"弚\",\n    \"弝\",\n    \"弣\",\n    \"弤\",\n    \"弮\",\n    \"弳\",\n    \"彃\",\n    \"彉\",\n    \"彋\",\n    \"彏\",\n    \"彯\",\n    \"彴\",\n    \"彸\",\n    \"彾\",\n    \"徦\",\n    \"徥\",\n    \"徯\",\n    \"徲\",\n    \"徾\",\n    \"徿\",\n    \"忀\",\n    \"忁\",\n    \"忔\",\n    \"忕\",\n    \"忨\",\n    \"忣\",\n    \"忷\",\n    \"忥\",\n    \"怭\",\n    \"怲\",\n    \"怋\",\n    \"怴\",\n    \"怗\",\n    \"怚\",\n    \"怞\",\n    \"怬\",\n    \"怢\",\n    \"怐\",\n    \"怮\",\n    \"怓\",\n    \"怷\",\n    \"怹\",\n    \"恲\",\n    \"恞\",\n    \"恅\",\n    \"恇\",\n    \"恉\",\n    \"恛\",\n    \"恌\",\n    \"恀\",\n    \"恟\",\n    \"悀\",\n    \"悁\",\n    \"悕\",\n    \"悗\",\n    \"悇\",\n    \"悊\",\n    \"悐\",\n    \"悾\",\n    \"悺\",\n    \"惓\",\n    \"惤\",\n    \"惈\",\n    \"悷\",\n    \"惉\",\n    \"悹\",\n    \"惌\",\n    \"惢\",\n    \"惄\",\n    \"愊\",\n    \"愖\",\n    \"愅\",\n    \"惵\",\n    \"愓\",\n    \"惸\",\n    \"惼\",\n    \"惾\",\n    \"慉\",\n    \"慅\",\n    \"愶\",\n    \"愲\",\n    \"愮\",\n    \"愯\",\n    \"愬\",\n    \"慁\",\n    \"慞\",\n    \"慱\",\n    \"慒\",\n    \"慓\",\n    \"慲\",\n    \"憀\",\n    \"慴\",\n    \"慔\",\n    \"慺\",\n    \"慛\",\n    \"憃\",\n    \"慹\",\n    \"憱\",\n    \"憰\",\n    \"憢\",\n    \"憉\",\n    \"憛\",\n    \"憯\",\n    \"憟\",\n    \"憪\",\n    \"憡\",\n    \"憝\",\n    \"憖\",\n    \"懅\",\n    \"憴\",\n    \"懆\",\n    \"懁\",\n    \"憿\",\n    \"憸\",\n    \"憵\",\n    \"憼\",\n    \"懧\",\n    \"懠\",\n    \"懥\",\n    \"懤\",\n    \"懘\",\n    \"懭\",\n    \"懱\",\n    \"懪\",\n    \"懰\",\n    \"懫\",\n    \"懻\",\n    \"戁\",\n    \"戃\",\n    \"戄\",\n    \"戉\",\n    \"戠\",\n    \"酨\",\n    \"戺\",\n    \"扐\",\n    \"扜\",\n    \"扤\",\n    \"扡\",\n    \"扢\",\n    \"抆\",\n    \"抌\",\n    \"抎\",\n    \"抏\",\n    \"扻\",\n    \"抭\",\n    \"抴\",\n    \"拑\",\n    \"抾\",\n    \"抪\",\n    \"抶\",\n    \"抮\",\n    \"挍\",\n    \"挋\",\n    \"挃\",\n    \"拫\",\n    \"拹\",\n    \"挏\",\n    \"挌\",\n    \"拸\",\n    \"挀\",\n    \"拲\",\n    \"捖\",\n    \"挬\",\n    \"挶\",\n    \"揤\",\n    \"捊\",\n    \"挼\",\n    \"挩\",\n    \"捁\",\n    \"挴\",\n    \"捘\",\n    \"捔\",\n    \"捥\",\n    \"掝\",\n    \"掗\",\n    \"掫\",\n    \"掯\",\n    \"捵\",\n    \"掜\",\n    \"捼\",\n    \"掤\",\n    \"掔\",\n    \"掱\",\n    \"揎\",\n    \"揥\",\n    \"揨\",\n    \"揯\",\n    \"揊\",\n    \"揲\",\n    \"揵\",\n    \"摡\",\n    \"揟\",\n    \"揝\",\n    \"揜\",\n    \"揘\",\n    \"揅\",\n    \"揱\",\n    \"搆\",\n    \"搟\",\n    \"搕\",\n    \"搘\",\n    \"搹\",\n    \"搷\",\n    \"搣\",\n    \"搰\",\n    \"搊\",\n    \"搚\",\n    \"摀\",\n    \"搧\",\n    \"搫\",\n    \"摍\",\n    \"摝\",\n    \"摲\",\n    \"摦\",\n    \"摎\",\n    \"摋\",\n    \"摓\",\n    \"摐\",\n    \"摿\",\n    \"摮\",\n    \"摰\",\n    \"撢\",\n    \"撠\",\n    \"撗\",\n    \"撜\",\n    \"撋\",\n    \"撊\",\n    \"撌\",\n    \"撟\",\n    \"擗\",\n    \"擖\",\n    \"擏\",\n    \"擉\",\n    \"撽\",\n    \"擩\",\n    \"擣\",\n    \"擫\",\n    \"擭\",\n    \"擨\",\n    \"擽\",\n    \"擸\",\n    \"攇\",\n    \"攐\",\n    \"攍\",\n    \"攌\",\n    \"攗\",\n    \"攕\",\n    \"攓\",\n    \"攡\",\n    \"攠\",\n    \"攦\",\n    \"攩\",\n    \"攭\",\n    \"攲\",\n    \"攳\",\n    \"敁\",\n    \"敊\",\n    \"敆\",\n    \"敓\",\n    \"敧\",\n    \"敪\",\n    \"敤\",\n    \"敜\",\n    \"敯\",\n    \"敳\",\n    \"敶\",\n    \"敺\",\n    \"敹\",\n    \"敿\",\n    \"斁\",\n    \"斀\",\n    \"斄\",\n    \"斒\",\n    \"斔\",\n    \"斞\",\n    \"斨\",\n    \"斪\",\n    \"斻\",\n    \"旍\",\n    \"旓\",\n    \"旚\",\n    \"旝\",\n    \"旟\",\n    \"昲\",\n    \"昦\",\n    \"昢\",\n    \"晇\",\n    \"晥\",\n    \"晜\",\n    \"晼\",\n    \"晬\",\n    \"暀\",\n    \"暆\",\n    \"暍\",\n    \"暋\",\n    \"暡\",\n    \"暰\",\n    \"暩\",\n    \"曀\",\n    \"曊\",\n    \"曋\",\n    \"曏\",\n    \"曒\",\n    \"曚\",\n    \"曣\",\n    \"曭\",\n    \"朁\",\n    \"朅\",\n    \"朄\",\n    \"朒\",\n    \"朘\",\n    \"朣\",\n    \"朾\",\n    \"朹\",\n    \"朻\",\n    \"朼\",\n    \"杅\",\n    \"杇\",\n    \"杝\",\n    \"杗\",\n    \"枎\",\n    \"杶\",\n    \"枆\",\n    \"枌\",\n    \"柲\",\n    \"枺\",\n    \"枻\",\n    \"柸\",\n    \"柀\",\n    \"柅\",\n    \"柫\",\n    \"柤\",\n    \"柍\",\n    \"柮\",\n    \"柣\",\n    \"柂\",\n    \"柧\",\n    \"栚\",\n    \"桋\",\n    \"桏\",\n    \"栱\",\n    \"栵\",\n    \"栫\",\n    \"栭\",\n    \"栯\",\n    \"栘\",\n    \"栔\",\n    \"梡\",\n    \"梇\",\n    \"梐\",\n    \"桭\",\n    \"梮\",\n    \"楖\",\n    \"梬\",\n    \"梩\",\n    \"桵\",\n    \"梒\",\n    \"椌\",\n    \"椄\",\n    \"棜\",\n    \"棷\",\n    \"棳\",\n    \"棌\",\n    \"椈\",\n    \"楰\",\n    \"棯\",\n    \"椔\",\n    \"棸\",\n    \"楟\",\n    \"楎\",\n    \"楱\",\n    \"楅\",\n    \"楺\",\n    \"楈\",\n    \"楛\",\n    \"楉\",\n    \"楬\",\n    \"椳\",\n    \"楀\",\n    \"楄\",\n    \"楶\",\n    \"楘\",\n    \"榶\",\n    \"槉\",\n    \"榠\",\n    \"榬\",\n    \"榼\",\n    \"榙\",\n    \"榩\",\n    \"榾\",\n    \"榯\",\n    \"槄\",\n    \"榽\",\n    \"榹\",\n    \"槥\",\n    \"槸\",\n    \"樕\",\n    \"樠\",\n    \"槬\",\n    \"槢\",\n    \"樛\",\n    \"樝\",\n    \"槾\",\n    \"樧\",\n    \"槮\",\n    \"樔\",\n    \"槷\",\n    \"橀\",\n    \"樴\",\n    \"橉\",\n    \"橧\",\n    \"樲\",\n    \"橨\",\n    \"橝\",\n    \"橭\",\n    \"橶\",\n    \"樿\",\n    \"橁\",\n    \"檍\",\n    \"檖\",\n    \"檁\",\n    \"檟\",\n    \"橾\",\n    \"檛\",\n    \"檓\",\n    \"檕\",\n    \"檃\",\n    \"櫅\",\n    \"檹\",\n    \"櫡\",\n    \"櫠\",\n    \"櫌\",\n    \"櫑\",\n    \"櫙\",\n    \"櫋\",\n    \"櫜\",\n    \"櫐\",\n    \"櫫\",\n    \"櫬\",\n    \"櫰\",\n    \"櫹\",\n    \"櫺\",\n    \"櫼\",\n    \"欃\",\n    \"欋\",\n    \"欈\",\n    \"欐\",\n    \"欑\",\n    \"欘\",\n    \"欨\",\n    \"欴\",\n    \"欯\",\n    \"欭\",\n    \"欱\",\n    \"欶\",\n    \"欳\",\n    \"欷\",\n    \"欿\",\n    \"歂\",\n    \"歈\",\n    \"歍\",\n    \"歋\",\n    \"歕\",\n    \"歔\",\n    \"歜\",\n    \"歠\",\n    \"歭\",\n    \"歾\",\n    \"肂\",\n    \"殈\",\n    \"殏\",\n    \"殔\",\n    \"殗\",\n    \"殙\",\n    \"殠\",\n    \"殥\",\n    \"殢\",\n    \"殦\",\n    \"殧\",\n    \"殰\",\n    \"殶\",\n    \"毃\",\n    \"毄\",\n    \"毈\",\n    \"毇\",\n    \"毊\",\n    \"毚\",\n    \"毞\",\n    \"毦\",\n    \"毤\",\n    \"毨\",\n    \"毣\",\n    \"毰\",\n    \"毲\",\n    \"毻\",\n    \"毼\",\n    \"毾\",\n    \"氁\",\n    \"氀\",\n    \"氄\",\n    \"氠\",\n    \"氶\",\n    \"汃\",\n    \"汒\",\n    \"汏\",\n    \"汍\",\n    \"汸\",\n    \"沋\",\n    \"汱\",\n    \"汯\",\n    \"沕\",\n    \"汦\",\n    \"汳\",\n    \"泬\",\n    \"沶\",\n    \"沬\",\n    \"泧\",\n    \"沷\",\n    \"泭\",\n    \"泲\",\n    \"泒\",\n    \"沴\",\n    \"洟\",\n    \"洊\",\n    \"洀\",\n    \"浺\",\n    \"浶\",\n    \"洍\",\n    \"涒\",\n    \"浘\",\n    \"浢\",\n    \"涊\",\n    \"涆\",\n    \"浧\",\n    \"涗\",\n    \"涳\",\n    \"涬\",\n    \"淢\",\n    \"涷\",\n    \"淔\",\n    \"渀\",\n    \"淈\",\n    \"涾\",\n    \"淊\",\n    \"涽\",\n    \"淭\",\n    \"湆\",\n    \"湇\",\n    \"湅\",\n    \"湢\",\n    \"渿\",\n    \"湁\",\n    \"渜\",\n    \"渳\",\n    \"湀\",\n    \"渻\",\n    \"渮\",\n    \"湨\",\n    \"湡\",\n    \"渱\",\n    \"渨\",\n    \"湠\",\n    \"湱\",\n    \"湩\",\n    \"渹\",\n    \"溛\",\n    \"滖\",\n    \"溓\",\n    \"溔\",\n    \"滒\",\n    \"溰\",\n    \"溾\",\n    \"滜\",\n    \"滵\",\n    \"滱\",\n    \"漃\",\n    \"漥\",\n    \"漮\",\n    \"潎\",\n    \"漙\",\n    \"漧\",\n    \"漘\",\n    \"漒\",\n    \"滭\",\n    \"漊\",\n    \"潳\",\n    \"滮\",\n    \"潀\",\n    \"漰\",\n    \"潃\",\n    \"漅\",\n    \"濆\",\n    \"澒\",\n    \"澅\",\n    \"潚\",\n    \"潠\",\n    \"澖\",\n    \"潶\",\n    \"潬\",\n    \"潒\",\n    \"潐\",\n    \"潗\",\n    \"澓\",\n    \"潝\",\n    \"濇\",\n    \"濎\",\n    \"濈\",\n    \"濄\",\n    \"澞\",\n    \"澨\",\n    \"瀄\",\n    \"濌\",\n    \"澩\",\n    \"濴\",\n    \"濔\",\n    \"濣\",\n    \"濭\",\n    \"濧\",\n    \"濦\",\n    \"瀇\",\n    \"瀎\",\n    \"濿\",\n    \"瀀\",\n    \"濻\",\n    \"瀙\",\n    \"瀖\",\n    \"瀫\",\n    \"瀡\",\n    \"瀢\",\n    \"瀩\",\n    \"瀯\",\n    \"瀷\",\n    \"灂\",\n    \"瀸\",\n    \"瀿\",\n    \"瀺\",\n    \"灄\",\n    \"灉\",\n    \"灖\",\n    \"灗\",\n    \"灛\",\n    \"灟\",\n    \"灨\",\n    \"灩\",\n    \"灪\",\n    \"炾\",\n    \"炰\",\n    \"烓\",\n    \"烑\",\n    \"缹\",\n    \"焍\",\n    \"烰\",\n    \"焠\",\n    \"焮\",\n    \"焣\",\n    \"煆\",\n    \"煣\",\n    \"煝\",\n    \"熐\",\n    \"熉\",\n    \"熀\",\n    \"熂\",\n    \"熚\",\n    \"燅\",\n    \"燂\",\n    \"熸\",\n    \"燀\",\n    \"燡\",\n    \"爁\",\n    \"爊\",\n    \"爂\",\n    \"爓\",\n    \"爞\",\n    \"爢\",\n    \"爣\",\n    \"牄\",\n    \"牉\",\n    \"牋\",\n    \"牏\",\n    \"牣\",\n    \"牬\",\n    \"牰\",\n    \"牸\",\n    \"牷\",\n    \"犈\",\n    \"犉\",\n    \"犆\",\n    \"犅\",\n    \"犌\",\n    \"犑\",\n    \"犐\",\n    \"犗\",\n    \"犕\",\n    \"犓\",\n    \"犘\",\n    \"犚\",\n    \"犝\",\n    \"犞\",\n    \"犥\",\n    \"犦\",\n    \"犤\",\n    \"犣\",\n    \"犩\",\n    \"犪\",\n    \"犮\",\n    \"犵\",\n    \"犿\",\n    \"狆\",\n    \"狖\",\n    \"狋\",\n    \"狘\",\n    \"狜\",\n    \"狔\",\n    \"狚\",\n    \"狌\",\n    \"狑\",\n    \"狊\",\n    \"狤\",\n    \"狫\",\n    \"狪\",\n    \"狣\",\n    \"猀\",\n    \"狾\",\n    \"猑\",\n    \"猘\",\n    \"猈\",\n    \"狿\",\n    \"猏\",\n    \"猋\",\n    \"猒\",\n    \"猧\",\n    \"猲\",\n    \"猭\",\n    \"猦\",\n    \"猣\",\n    \"猵\",\n    \"猼\",\n    \"獂\",\n    \"獀\",\n    \"獊\",\n    \"獑\",\n    \"獌\",\n    \"獘\",\n    \"獞\",\n    \"獟\",\n    \"獝\",\n    \"獛\",\n    \"獡\",\n    \"獩\",\n    \"獦\",\n    \"獥\",\n    \"獳\",\n    \"獶\",\n    \"獽\",\n    \"獿\",\n    \"玂\",\n    \"玁\",\n    \"玈\",\n    \"玊\",\n    \"玔\",\n    \"珓\",\n    \"珶\",\n    \"琖\",\n    \"瑵\",\n    \"璊\",\n    \"瑽\",\n    \"璅\",\n    \"瑿\",\n    \"璗\",\n    \"瓁\",\n    \"瓋\",\n    \"瓝\",\n    \"瓟\",\n    \"瓡\",\n    \"瓥\",\n    \"瓨\",\n    \"瓬\",\n    \"瓵\",\n    \"瓾\",\n    \"瓽\",\n    \"甀\",\n    \"甃\",\n    \"甈\",\n    \"甋\",\n    \"甐\",\n    \"甒\",\n    \"甔\",\n    \"甖\",\n    \"甝\",\n    \"甮\",\n    \"甿\",\n    \"畟\",\n    \"畣\",\n    \"畽\",\n    \"疀\",\n    \"疧\",\n    \"痁\",\n    \"疻\",\n    \"痀\",\n    \"痎\",\n    \"痏\",\n    \"痋\",\n    \"痌\",\n    \"痑\",\n    \"痚\",\n    \"痡\",\n    \"痝\",\n    \"痗\",\n    \"痯\",\n    \"瘏\",\n    \"痷\",\n    \"痸\",\n    \"痻\",\n    \"瘈\",\n    \"瘑\",\n    \"瘝\",\n    \"瘣\",\n    \"瘯\",\n    \"瘱\",\n    \"瘽\",\n    \"癈\",\n    \"癉\",\n    \"癙\",\n    \"癐\",\n    \"癓\",\n    \"癠\",\n    \"癵\",\n    \"癹\",\n    \"皊\",\n    \"皏\",\n    \"皫\",\n    \"皯\",\n    \"皵\",\n    \"皻\",\n    \"皽\",\n    \"皾\",\n    \"盄\",\n    \"盓\",\n    \"盝\",\n    \"盬\",\n    \"盭\",\n    \"盳\",\n    \"眃\",\n    \"眅\",\n    \"盻\",\n    \"眝\",\n    \"眐\",\n    \"眓\",\n    \"眒\",\n    \"眣\",\n    \"眑\",\n    \"眕\",\n    \"眹\",\n    \"眱\",\n    \"眲\",\n    \"眴\",\n    \"眳\",\n    \"眽\",\n    \"睆\",\n    \"睅\",\n    \"睊\",\n    \"睋\",\n    \"睌\",\n    \"睕\",\n    \"睟\",\n    \"睒\",\n    \"睖\",\n    \"睩\",\n    \"睧\",\n    \"睔\",\n    \"瞁\",\n    \"睼\",\n    \"瞂\",\n    \"睮\",\n    \"睯\",\n    \"瞏\",\n    \"瞉\",\n    \"瞚\",\n    \"瞝\",\n    \"瞡\",\n    \"瞛\",\n    \"瞲\",\n    \"瞷\",\n    \"瞶\",\n    \"瞴\",\n    \"矂\",\n    \"矉\",\n    \"矊\",\n    \"矌\",\n    \"矎\",\n    \"矏\",\n    \"矐\",\n    \"矔\",\n    \"矕\",\n    \"矘\",\n    \"矠\",\n    \"矱\",\n    \"矲\",\n    \"矹\",\n    \"矺\",\n    \"砅\",\n    \"砐\",\n    \"砏\",\n    \"砎\",\n    \"砨\",\n    \"硈\",\n    \"硉\",\n    \"硠\",\n    \"硥\",\n    \"硱\",\n    \"硰\",\n    \"硩\",\n    \"碔\",\n    \"碄\",\n    \"碅\",\n    \"碆\",\n    \"硾\",\n    \"碫\",\n    \"碞\",\n    \"磍\",\n    \"磌\",\n    \"磎\",\n    \"磈\",\n    \"磃\",\n    \"磝\",\n    \"磩\",\n    \"磥\",\n    \"磞\",\n    \"磛\",\n    \"磳\",\n    \"磼\",\n    \"磿\",\n    \"礔\",\n    \"礉\",\n    \"礝\",\n    \"礛\",\n    \"礜\",\n    \"礥\",\n    \"礣\",\n    \"礧\",\n    \"礨\",\n    \"礭\",\n    \"礿\",\n    \"祌\",\n    \"祅\",\n    \"祔\",\n    \"祒\",\n    \"祑\",\n    \"祤\",\n    \"祩\",\n    \"祪\",\n    \"祣\",\n    \"祫\",\n    \"祡\",\n    \"祴\",\n    \"祳\",\n    \"禂\",\n    \"禗\",\n    \"禜\",\n    \"禫\",\n    \"禭\",\n    \"禬\",\n    \"禴\",\n    \"禷\",\n    \"禸\",\n    \"歶\",\n    \"秅\",\n    \"秏\",\n    \"秖\",\n    \"秎\",\n    \"秮\",\n    \"秪\",\n    \"秺\",\n    \"秶\",\n    \"稊\",\n    \"稒\",\n    \"稫\",\n    \"穊\",\n    \"稰\",\n    \"稯\",\n    \"穋\",\n    \"穛\",\n    \"穖\",\n    \"穧\",\n    \"穨\",\n    \"穮\",\n    \"穬\",\n    \"穭\",\n    \"穱\",\n    \"穾\",\n    \"窆\",\n    \"窉\",\n    \"窌\",\n    \"窏\",\n    \"窔\",\n    \"窐\",\n    \"窙\",\n    \"窢\",\n    \"窞\",\n    \"窫\",\n    \"窲\",\n    \"窴\",\n    \"窱\",\n    \"窾\",\n    \"竀\",\n    \"竁\",\n    \"竷\",\n    \"笐\",\n    \"笓\",\n    \"笅\",\n    \"笵\",\n    \"笻\",\n    \"笴\",\n    \"笰\",\n    \"笢\",\n    \"笝\",\n    \"笲\",\n    \"筄\",\n    \"筡\",\n    \"箈\",\n    \"箊\",\n    \"箌\",\n    \"箛\",\n    \"箎\",\n    \"箘\",\n    \"箄\",\n    \"箷\",\n    \"箾\",\n    \"篎\",\n    \"箯\",\n    \"箹\",\n    \"篞\",\n    \"篣\",\n    \"篧\",\n    \"篕\",\n    \"篨\",\n    \"篹\",\n    \"簅\",\n    \"篲\",\n    \"篿\",\n    \"篻\",\n    \"簎\",\n    \"篴\",\n    \"簂\",\n    \"簁\",\n    \"篸\",\n    \"篽\",\n    \"簜\",\n    \"簩\",\n    \"簙\",\n    \"簭\",\n    \"簦\",\n    \"簨\",\n    \"簢\",\n    \"簥\",\n    \"簳\",\n    \"簼\",\n    \"簬\",\n    \"簻\",\n    \"籉\",\n    \"籈\",\n    \"籊\",\n    \"籔\",\n    \"籗\",\n    \"籧\",\n    \"籦\",\n    \"籯\",\n    \"籺\",\n    \"籸\",\n    \"籹\",\n    \"粊\",\n    \"粔\",\n    \"粻\",\n    \"糔\",\n    \"糪\",\n    \"糱\",\n    \"糷\",\n    \"紎\",\n    \"紟\",\n    \"紒\",\n    \"紽\",\n    \"紸\",\n    \"紶\",\n    \"紩\",\n    \"絇\",\n    \"紾\",\n    \"絘\",\n    \"絯\",\n    \"絓\",\n    \"絧\",\n    \"絏\",\n    \"絭\",\n    \"絫\",\n    \"綀\",\n    \"綍\",\n    \"絿\",\n    \"綅\",\n    \"絻\",\n    \"絼\",\n    \"綔\",\n    \"綷\",\n    \"緂\",\n    \"綪\",\n    \"緀\",\n    \"緅\",\n    \"緎\",\n    \"緆\",\n    \"緌\",\n    \"綯\",\n    \"綼\",\n    \"緷\",\n    \"緛\",\n    \"緪\",\n    \"緧\",\n    \"縃\",\n    \"緺\",\n    \"緶\",\n    \"緰\",\n    \"縗\",\n    \"縌\",\n    \"縓\",\n    \"縎\",\n    \"縜\",\n    \"縚\",\n    \"縏\",\n    \"縼\",\n    \"繂\",\n    \"縳\",\n    \"顈\",\n    \"繈\",\n    \"縸\",\n    \"縪\",\n    \"繉\",\n    \"繀\",\n    \"縩\",\n    \"緵\",\n    \"縰\",\n    \"縿\",\n    \"縶\",\n    \"繜\",\n    \"繐\",\n    \"繣\",\n    \"繘\",\n    \"繢\",\n    \"繟\",\n    \"繑\",\n    \"繠\",\n    \"繶\",\n    \"繵\",\n    \"繸\",\n    \"繷\",\n    \"繺\",\n    \"繲\",\n    \"繴\",\n    \"纀\",\n    \"纇\",\n    \"纋\",\n    \"纆\",\n    \"纑\",\n    \"纗\",\n    \"纚\",\n    \"缿\",\n    \"罊\",\n    \"罏\",\n    \"罜\",\n    \"罞\",\n    \"罝\",\n    \"罛\",\n    \"罣\",\n    \"罥\",\n    \"罦\",\n    \"罭\",\n    \"罫\",\n    \"罬\",\n    \"罻\",\n    \"罼\",\n    \"罺\",\n    \"罿\",\n    \"羃\",\n    \"羉\",\n    \"羍\",\n    \"羒\",\n    \"羜\",\n    \"羛\",\n    \"羢\",\n    \"羠\",\n    \"羦\",\n    \"羬\",\n    \"羭\",\n    \"羵\",\n    \"羳\",\n    \"羷\",\n    \"羺\",\n    \"羾\",\n    \"翋\",\n    \"翍\",\n    \"翐\",\n    \"翑\",\n    \"翇\",\n    \"翢\",\n    \"翣\",\n    \"翭\",\n    \"翪\",\n    \"翨\",\n    \"翴\",\n    \"翲\",\n    \"翽\",\n    \"翿\",\n    \"耟\",\n    \"耞\",\n    \"耡\",\n    \"耴\",\n    \"耾\",\n    \"耹\",\n    \"聇\",\n    \"聈\",\n    \"聑\",\n    \"聏\",\n    \"聝\",\n    \"肕\",\n    \"肙\",\n    \"肒\",\n    \"肣\",\n    \"肵\",\n    \"胘\",\n    \"胑\",\n    \"胐\",\n    \"胕\",\n    \"胉\",\n    \"胏\",\n    \"胹\",\n    \"胵\",\n    \"脁\",\n    \"胻\",\n    \"脀\",\n    \"胾\",\n    \"胔\",\n    \"脰\",\n    \"脥\",\n    \"脤\",\n    \"脙\",\n    \"脡\",\n    \"脕\",\n    \"脧\",\n    \"腃\",\n    \"腏\",\n    \"腄\",\n    \"腇\",\n    \"脽\",\n    \"腍\",\n    \"腤\",\n    \"腷\",\n    \"腜\",\n    \"腛\",\n    \"腢\",\n    \"腲\",\n    \"朡\",\n    \"腞\",\n    \"腶\",\n    \"膉\",\n    \"膆\",\n    \"膃\",\n    \"膇\",\n    \"膍\",\n    \"膌\",\n    \"膋\",\n    \"膟\",\n    \"膕\",\n    \"膢\",\n    \"膱\",\n    \"膹\",\n    \"膫\",\n    \"膰\",\n    \"膬\",\n    \"膴\",\n    \"膲\",\n    \"臇\",\n    \"膷\",\n    \"臄\",\n    \"臅\",\n    \"臒\",\n    \"臐\",\n    \"臗\",\n    \"臛\",\n    \"臡\",\n    \"臦\",\n    \"臩\",\n    \"臮\",\n    \"臲\",\n    \"臷\",\n    \"臸\",\n    \"臿\",\n    \"舋\",\n    \"舑\",\n    \"舕\",\n    \"舝\",\n    \"舡\",\n    \"舼\",\n    \"舽\",\n    \"艀\",\n    \"艂\",\n    \"艓\",\n    \"艒\",\n    \"艐\",\n    \"艑\",\n    \"艕\",\n    \"艛\",\n    \"艵\",\n    \"艼\",\n    \"芀\",\n    \"芐\",\n    \"芅\",\n    \"芓\",\n    \"芔\",\n    \"苀\",\n    \"芚\",\n    \"芵\",\n    \"芧\",\n    \"芞\",\n    \"芺\",\n    \"苙\",\n    \"苨\",\n    \"苖\",\n    \"苬\",\n    \"苲\",\n    \"苵\",\n    \"苶\",\n    \"茙\",\n    \"茥\",\n    \"茿\",\n    \"茦\",\n    \"茢\",\n    \"荂\",\n    \"茪\",\n    \"荍\",\n    \"茖\",\n    \"茤\",\n    \"茠\",\n    \"茩\",\n    \"茻\",\n    \"莐\",\n    \"莣\",\n    \"莍\",\n    \"荺\",\n    \"莤\",\n    \"荴\",\n    \"莏\",\n    \"莁\",\n    \"荵\",\n    \"莔\",\n    \"莃\",\n    \"莌\",\n    \"莋\",\n    \"荾\",\n    \"莥\",\n    \"菨\",\n    \"萒\",\n    \"菧\",\n    \"菤\",\n    \"菆\",\n    \"菣\",\n    \"菿\",\n    \"菋\",\n    \"菎\",\n    \"菵\",\n    \"萉\",\n    \"菞\",\n    \"菳\",\n    \"菕\",\n    \"蓱\",\n    \"萿\",\n    \"葹\",\n    \"葥\",\n    \"葀\",\n    \"葧\",\n    \"萰\",\n    \"葍\",\n    \"葽\",\n    \"蔇\",\n    \"葞\",\n    \"萷\",\n    \"萺\",\n    \"萴\",\n    \"葅\",\n    \"菙\",\n    \"葋\",\n    \"萯\",\n    \"葂\",\n    \"葟\",\n    \"葌\",\n    \"蓎\",\n    \"蒬\",\n    \"蒮\",\n    \"蒫\",\n    \"蒪\",\n    \"蒚\",\n    \"蒝\",\n    \"蓌\",\n    \"蒛\",\n    \"蒩\",\n    \"蒘\",\n    \"蒶\",\n    \"蒠\",\n    \"蔤\",\n    \"蔏\",\n    \"蔩\",\n    \"蔉\",\n    \"蔍\",\n    \"蔧\",\n    \"蔜\",\n    \"蓻\",\n    \"蓺\",\n    \"蓴\",\n    \"蔪\",\n    \"蓲\",\n    \"蓷\",\n    \"蓫\",\n    \"蔒\",\n    \"蓩\",\n    \"蔖\",\n    \"蓾\",\n    \"蔨\",\n    \"蔮\",\n    \"蔂\",\n    \"蓶\",\n    \"蔱\",\n    \"蓹\",\n    \"蔠\",\n    \"蔰\",\n    \"蕫\",\n    \"蕍\",\n    \"蕀\",\n    \"蕆\",\n    \"蕄\",\n    \"蕇\",\n    \"蕣\",\n    \"蕛\",\n    \"蕱\",\n    \"蕵\",\n    \"蕮\",\n    \"蕧\",\n    \"蕠\",\n    \"蕦\",\n    \"蕝\",\n    \"薃\",\n    \"薧\",\n    \"薕\",\n    \"薠\",\n    \"薋\",\n    \"薣\",\n    \"薚\",\n    \"蕼\",\n    \"薉\",\n    \"蕸\",\n    \"薎\",\n    \"薖\",\n    \"薍\",\n    \"薝\",\n    \"薂\",\n    \"藆\",\n    \"藀\",\n    \"藃\",\n    \"藂\",\n    \"薵\",\n    \"薽\",\n    \"藇\",\n    \"藄\",\n    \"藋\",\n    \"藈\",\n    \"藅\",\n    \"薱\",\n    \"薶\",\n    \"藒\",\n    \"藫\",\n    \"藱\",\n    \"藙\",\n    \"藡\",\n    \"藚\",\n    \"藗\",\n    \"藲\",\n    \"藬\",\n    \"藘\",\n    \"藣\",\n    \"藑\",\n    \"藰\",\n    \"蘁\",\n    \"藾\",\n    \"蘛\",\n    \"蘉\",\n    \"蘌\",\n    \"蘪\",\n    \"蘦\",\n    \"蘟\",\n    \"蘣\",\n    \"蘜\",\n    \"蘙\",\n    \"蘮\",\n    \"蘡\",\n    \"蘠\",\n    \"蘥\",\n    \"蘴\",\n    \"蘳\",\n    \"蘬\",\n    \"虀\",\n    \"蘹\",\n    \"蘱\",\n    \"蘻\",\n    \"蘾\",\n    \"虃\",\n    \"虆\",\n    \"虇\",\n    \"虈\",\n    \"虌\",\n    \"虋\",\n    \"虙\",\n    \"虡\",\n    \"虣\",\n    \"虩\",\n    \"虪\",\n    \"虰\",\n    \"虭\",\n    \"虴\",\n    \"蚑\",\n    \"蚞\",\n    \"蚇\",\n    \"蚗\",\n    \"蚚\",\n    \"蚅\",\n    \"蚥\",\n    \"蚙\",\n    \"蚿\",\n    \"蚷\",\n    \"蛂\",\n    \"蛁\",\n    \"蛅\",\n    \"蛈\",\n    \"蚹\",\n    \"蚳\",\n    \"蚸\",\n    \"蛌\",\n    \"蚻\",\n    \"蛢\",\n    \"蛦\",\n    \"蛓\",\n    \"蛣\",\n    \"蛚\",\n    \"蛪\",\n    \"蛝\",\n    \"蛫\",\n    \"蛜\",\n    \"蛬\",\n    \"蛗\",\n    \"蜄\",\n    \"蛷\",\n    \"蜌\",\n    \"蛖\",\n    \"蛵\",\n    \"蜁\",\n    \"蛶\",\n    \"蜳\",\n    \"蝫\",\n    \"蜙\",\n    \"蝃\",\n    \"蜬\",\n    \"蝁\",\n    \"蝆\",\n    \"蜠\",\n    \"蜲\",\n    \"蜪\",\n    \"蜭\",\n    \"蜼\",\n    \"蜵\",\n    \"蝂\",\n    \"蜦\",\n    \"蜧\",\n    \"蜸\",\n    \"蜤\",\n    \"蜰\",\n    \"蝖\",\n    \"蝷\",\n    \"蟡\",\n    \"蝳\",\n    \"蝔\",\n    \"蝛\",\n    \"蝒\",\n    \"蝑\",\n    \"蝞\",\n    \"蝭\",\n    \"蝪\",\n    \"蝐\",\n    \"蝝\",\n    \"蝬\",\n    \"蝺\",\n    \"蝜\",\n    \"螛\",\n    \"螏\",\n    \"螓\",\n    \"螒\",\n    \"螁\",\n    \"螖\",\n    \"螘\",\n    \"蝹\",\n    \"螇\",\n    \"螑\",\n    \"螝\",\n    \"螜\",\n    \"螚\",\n    \"螪\",\n    \"螰\",\n    \"螹\",\n    \"螼\",\n    \"螮\",\n    \"蟉\",\n    \"蟃\",\n    \"蟂\",\n    \"螷\",\n    \"螴\",\n    \"螿\",\n    \"螸\",\n    \"蟞\",\n    \"蟧\",\n    \"蟦\",\n    \"蟢\",\n    \"蟟\",\n    \"蟤\",\n    \"蟔\",\n    \"蟓\",\n    \"蟭\",\n    \"蟘\",\n    \"螤\",\n    \"蟗\",\n    \"蟙\",\n    \"蠁\",\n    \"蟨\",\n    \"蠀\",\n    \"蟺\",\n    \"蠉\",\n    \"蠌\",\n    \"蟼\",\n    \"蠈\",\n    \"蟿\",\n    \"蠗\",\n    \"蠩\",\n    \"蠝\",\n    \"蠛\",\n    \"蠠\",\n    \"蠤\",\n    \"蠜\",\n    \"蠫\",\n    \"蠬\",\n    \"蠨\",\n    \"蠦\",\n    \"蠪\",\n    \"蠥\",\n    \"蠰\",\n    \"蠮\",\n    \"蠳\",\n    \"蠸\",\n    \"蠾\",\n    \"蠽\",\n    \"蠿\",\n    \"衁\",\n    \"衈\",\n    \"衋\",\n    \"衧\",\n    \"衪\",\n    \"衭\",\n    \"衶\",\n    \"袀\",\n    \"衱\",\n    \"衯\",\n    \"袃\",\n    \"袉\",\n    \"袕\",\n    \"袨\",\n    \"袚\",\n    \"袑\",\n    \"袡\",\n    \"袘\",\n    \"袧\",\n    \"袬\",\n    \"袌\",\n    \"袺\",\n    \"裗\",\n    \"袹\",\n    \"袸\",\n    \"裀\",\n    \"袶\",\n    \"袽\",\n    \"袲\",\n    \"裋\",\n    \"裍\",\n    \"裞\",\n    \"裚\",\n    \"裷\",\n    \"裧\",\n    \"裺\",\n    \"裮\",\n    \"裶\",\n    \"裯\",\n    \"裻\",\n    \"褁\",\n    \"褅\",\n    \"褋\",\n    \"褗\",\n    \"褆\",\n    \"褖\",\n    \"褑\",\n    \"褦\",\n    \"褮\",\n    \"褱\",\n    \"褢\",\n    \"褩\",\n    \"褵\",\n    \"褼\",\n    \"褾\",\n    \"襒\",\n    \"褷\",\n    \"襂\",\n    \"褽\",\n    \"襓\",\n    \"襋\",\n    \"襆\",\n    \"襐\",\n    \"襛\",\n    \"襗\",\n    \"襡\",\n    \"襘\",\n    \"襝\",\n    \"襣\",\n    \"襭\",\n    \"襩\",\n    \"襮\",\n    \"襳\",\n    \"襹\",\n    \"襺\",\n    \"覂\",\n    \"覅\",\n    \"覕\",\n    \"覛\",\n    \"覝\",\n    \"覢\",\n    \"覤\",\n    \"覣\",\n    \"覭\",\n    \"覮\",\n    \"覶\",\n    \"觓\",\n    \"觤\",\n    \"觡\",\n    \"觠\",\n    \"觢\",\n    \"觩\",\n    \"觰\",\n    \"觬\",\n    \"觲\",\n    \"觷\",\n    \"觺\",\n    \"觻\",\n    \"觼\",\n    \"觾\",\n    \"訑\",\n    \"訰\",\n    \"訧\",\n    \"訬\",\n    \"訞\",\n    \"詍\",\n    \"訹\",\n    \"詙\",\n    \"詀\",\n    \"詄\",\n    \"詅\",\n    \"訿\",\n    \"誂\",\n    \"詻\",\n    \"誃\",\n    \"誫\",\n    \"誙\",\n    \"誋\",\n    \"諆\",\n    \"誸\",\n    \"諔\",\n    \"諕\",\n    \"誻\",\n    \"諀\",\n    \"諅\",\n    \"諵\",\n    \"諝\",\n    \"諰\",\n    \"諈\",\n    \"謞\",\n    \"謘\",\n    \"謑\",\n    \"謋\",\n    \"謒\",\n    \"謕\",\n    \"謍\",\n    \"謈\",\n    \"謪\",\n    \"謧\",\n    \"謣\",\n    \"謰\",\n    \"謵\",\n    \"譇\",\n    \"謯\",\n    \"謱\",\n    \"謥\",\n    \"謷\",\n    \"謦\",\n    \"譐\",\n    \"譈\",\n    \"譊\",\n    \"譀\",\n    \"譋\",\n    \"譕\",\n    \"譑\",\n    \"譠\",\n    \"譪\",\n    \"譝\",\n    \"譨\",\n    \"譣\",\n    \"譥\",\n    \"譹\",\n    \"譸\",\n    \"譅\",\n    \"譺\",\n    \"譻\",\n    \"譾\",\n    \"讄\",\n    \"讂\",\n    \"讆\",\n    \"讋\",\n    \"讔\",\n    \"讘\",\n    \"讟\",\n    \"谹\",\n    \"谻\",\n    \"谽\",\n    \"谾\",\n    \"豃\",\n    \"豋\",\n    \"豍\",\n    \"豏\",\n    \"豗\",\n    \"豜\",\n    \"豝\",\n    \"豟\",\n    \"豥\",\n    \"豤\",\n    \"豦\",\n    \"豭\",\n    \"豰\",\n    \"豲\",\n    \"豱\",\n    \"豯\",\n    \"豵\",\n    \"豷\",\n    \"豶\",\n    \"豻\",\n    \"豽\",\n    \"貁\",\n    \"貀\",\n    \"貄\",\n    \"貏\",\n    \"貑\",\n    \"貕\",\n    \"貙\",\n    \"貗\",\n    \"貜\",\n    \"貣\",\n    \"貾\",\n    \"賌\",\n    \"賥\",\n    \"賟\",\n    \"賙\",\n    \"賵\",\n    \"賮\",\n    \"贆\",\n    \"贕\",\n    \"贙\",\n    \"赨\",\n    \"赩\",\n    \"赮\",\n    \"赸\",\n    \"趀\",\n    \"趌\",\n    \"趎\",\n    \"趏\",\n    \"趍\",\n    \"趓\",\n    \"趠\",\n    \"趜\",\n    \"趡\",\n    \"趥\",\n    \"趧\",\n    \"趬\",\n    \"趪\",\n    \"趭\",\n    \"趫\",\n    \"趮\",\n    \"趷\",\n    \"趹\",\n    \"跘\",\n    \"跓\",\n    \"跍\",\n    \"跇\",\n    \"跜\",\n    \"跕\",\n    \"跙\",\n    \"跈\",\n    \"跰\",\n    \"跠\",\n    \"跮\",\n    \"跦\",\n    \"跢\",\n    \"跧\",\n    \"跲\",\n    \"跫\",\n    \"踂\",\n    \"跿\",\n    \"踍\",\n    \"踃\",\n    \"踇\",\n    \"踆\",\n    \"跾\",\n    \"踠\",\n    \"踥\",\n    \"踤\",\n    \"踡\",\n    \"踕\",\n    \"踛\",\n    \"踖\",\n    \"踑\",\n    \"踙\",\n    \"踧\",\n    \"踘\",\n    \"踓\",\n    \"踳\",\n    \"踾\",\n    \"踸\",\n    \"踼\",\n    \"蹎\",\n    \"蹍\",\n    \"蹓\",\n    \"蹗\",\n    \"蹖\",\n    \"蹞\",\n    \"蹥\",\n    \"蹛\",\n    \"蹡\",\n    \"蹝\",\n    \"蹔\",\n    \"蹸\",\n    \"蹳\",\n    \"蹪\",\n    \"躆\",\n    \"躈\",\n    \"躖\",\n    \"躗\",\n    \"躟\",\n    \"躠\",\n    \"躤\",\n    \"躣\",\n    \"躩\",\n    \"躨\",\n    \"躽\",\n    \"軓\",\n    \"軘\",\n    \"軞\",\n    \"軯\",\n    \"軷\",\n    \"軦\",\n    \"軮\",\n    \"軥\",\n    \"軵\",\n    \"軧\",\n    \"軨\",\n    \"軶\",\n    \"軱\",\n    \"軬\",\n    \"輆\",\n    \"軿\",\n    \"輁\",\n    \"輀\",\n    \"輂\",\n    \"輐\",\n    \"輑\",\n    \"輤\",\n    \"輘\",\n    \"輚\",\n    \"輠\",\n    \"輣\",\n    \"輖\",\n    \"輗\",\n    \"輮\",\n    \"輵\",\n    \"輲\",\n    \"輹\",\n    \"輷\",\n    \"輴\",\n    \"轃\",\n    \"轇\",\n    \"轈\",\n    \"轒\",\n    \"轑\",\n    \"轏\",\n    \"轐\",\n    \"轓\",\n    \"轙\",\n    \"轖\",\n    \"轗\",\n    \"轕\",\n    \"轚\",\n    \"轞\",\n    \"轛\",\n    \"轠\",\n    \"辴\",\n    \"迉\",\n    \"迒\",\n    \"迋\",\n    \"迍\",\n    \"迖\",\n    \"迣\",\n    \"迡\",\n    \"迾\",\n    \"迿\",\n    \"逜\",\n    \"逿\",\n    \"遝\",\n    \"遳\",\n    \"遰\",\n    \"遻\",\n    \"邆\",\n    \"邅\",\n    \"遾\",\n    \"邍\",\n    \"邔\",\n    \"邟\",\n    \"邥\",\n    \"邞\",\n    \"邧\",\n    \"郱\",\n    \"郕\",\n    \"郖\",\n    \"郠\",\n    \"郙\",\n    \"郣\",\n    \"郥\",\n    \"郘\",\n    \"郰\",\n    \"郲\",\n    \"郔\",\n    \"鄬\",\n    \"郼\",\n    \"鄈\",\n    \"郹\",\n    \"郻\",\n    \"鄁\",\n    \"鄇\",\n    \"郺\",\n    \"鄐\",\n    \"鄍\",\n    \"鄏\",\n    \"鄎\",\n    \"鄟\",\n    \"鄝\",\n    \"鄡\",\n    \"鄛\",\n    \"鄨\",\n    \"鄪\",\n    \"鄦\",\n    \"鄮\",\n    \"鄵\",\n    \"鄸\",\n    \"鄻\",\n    \"鄾\",\n    \"酀\",\n    \"酁\",\n    \"酄\",\n    \"酇\",\n    \"酖\",\n    \"酘\",\n    \"酓\",\n    \"酟\",\n    \"酳\",\n    \"醆\",\n    \"醊\",\n    \"醓\",\n    \"醙\",\n    \"醟\",\n    \"醥\",\n    \"醧\",\n    \"醰\",\n    \"醱\",\n    \"醷\",\n    \"醲\",\n    \"醳\",\n    \"醹\",\n    \"醽\",\n    \"釂\",\n    \"釃\",\n    \"釢\",\n    \"釱\",\n    \"釳\",\n    \"釸\",\n    \"鈚\",\n    \"鈌\",\n    \"鈒\",\n    \"釽\",\n    \"鈆\",\n    \"鉒\",\n    \"鉠\",\n    \"鉯\",\n    \"鈶\",\n    \"鉼\",\n    \"銤\",\n    \"銛\",\n    \"銔\",\n    \"鉹\",\n    \"銗\",\n    \"鋄\",\n    \"鋀\",\n    \"鋟\",\n    \"鋘\",\n    \"鋩\",\n    \"鋝\",\n    \"鋂\",\n    \"鋊\",\n    \"錧\",\n    \"錼\",\n    \"錭\",\n    \"錎\",\n    \"鋋\",\n    \"鎡\",\n    \"鎃\",\n    \"鎯\",\n    \"鍖\",\n    \"鍜\",\n    \"鍐\",\n    \"鍭\",\n    \"鍌\",\n    \"鎒\",\n    \"鎷\",\n    \"鎝\",\n    \"鎉\",\n    \"鎎\",\n    \"鎞\",\n    \"鏏\",\n    \"鏂\",\n    \"鏚\",\n    \"鏬\",\n    \"鏙\",\n    \"鐋\",\n    \"鐏\",\n    \"鏾\",\n    \"鐕\",\n    \"鐨\",\n    \"鐍\",\n    \"鐀\",\n    \"鐎\",\n    \"鐖\",\n    \"鐻\",\n    \"鐶\",\n    \"鑐\",\n    \"鑋\",\n    \"鑕\",\n    \"鑮\",\n    \"鑯\",\n    \"钂\",\n    \"钀\",\n    \"钁\",\n    \"钃\",\n    \"镺\",\n    \"镻\",\n    \"镼\",\n    \"镽\",\n    \"閈\",\n    \"閍\",\n    \"閺\",\n    \"閵\",\n    \"闀\",\n    \"闉\",\n    \"闅\",\n    \"閷\",\n    \"闒\",\n    \"闑\",\n    \"闚\",\n    \"闛\",\n    \"闠\",\n    \"闟\",\n    \"闤\",\n    \"阞\",\n    \"阢\",\n    \"阤\",\n    \"阠\",\n    \"阰\",\n    \"阹\",\n    \"阸\",\n    \"阺\",\n    \"陏\",\n    \"陓\",\n    \"陊\",\n    \"陼\",\n    \"陭\",\n    \"陫\",\n    \"隇\",\n    \"陾\",\n    \"隉\",\n    \"隒\",\n    \"隓\",\n    \"隞\",\n    \"隤\",\n    \"隿\",\n    \"雂\",\n    \"雈\",\n    \"雓\",\n    \"雔\",\n    \"雗\",\n    \"雚\",\n    \"雟\",\n    \"雘\",\n    \"雺\",\n    \"雽\",\n    \"雿\",\n    \"霂\",\n    \"霋\",\n    \"霒\",\n    \"霐\",\n    \"霠\",\n    \"霣\",\n    \"霢\",\n    \"霩\",\n    \"霫\",\n    \"霬\",\n    \"霮\",\n    \"霵\",\n    \"霿\",\n    \"靆\",\n    \"靃\",\n    \"靪\",\n    \"靮\",\n    \"靷\",\n    \"靲\",\n    \"靾\",\n    \"鞃\",\n    \"鞀\",\n    \"鞂\",\n    \"靻\",\n    \"鞊\",\n    \"鞎\",\n    \"鞈\",\n    \"鞙\",\n    \"鞗\",\n    \"鞚\",\n    \"鞜\",\n    \"鞤\",\n    \"鞪\",\n    \"鞷\",\n    \"鞶\",\n    \"鞹\",\n    \"鞻\",\n    \"鞿\",\n    \"韄\",\n    \"韅\",\n    \"韇\",\n    \"韎\",\n    \"韐\",\n    \"韏\",\n    \"韕\",\n    \"韔\",\n    \"韗\",\n    \"韝\",\n    \"韟\",\n    \"韣\",\n    \"韥\",\n    \"韰\",\n    \"韱\",\n    \"韹\",\n    \"韽\",\n    \"頄\",\n    \"頖\",\n    \"頞\",\n    \"頝\",\n    \"頩\",\n    \"頨\",\n    \"頯\",\n    \"頲\",\n    \"顁\",\n    \"顄\",\n    \"顊\",\n    \"顉\",\n    \"顅\",\n    \"顐\",\n    \"顑\",\n    \"顜\",\n    \"顝\",\n    \"顠\",\n    \"顣\",\n    \"顟\",\n    \"顤\",\n    \"顪\",\n    \"顩\",\n    \"顲\",\n    \"颬\",\n    \"颲\",\n    \"颸\",\n    \"颽\",\n    \"颻\",\n    \"颾\",\n    \"飁\",\n    \"飂\",\n    \"飉\",\n    \"飋\",\n    \"飌\",\n    \"飣\",\n    \"飶\",\n    \"餂\",\n    \"餀\",\n    \"飺\",\n    \"餔\",\n    \"餖\",\n    \"餕\",\n    \"餤\",\n    \"餟\",\n    \"餥\",\n    \"餫\",\n    \"餪\",\n    \"餲\",\n    \"餯\",\n    \"餭\",\n    \"餱\",\n    \"餰\",\n    \"饁\",\n    \"饇\",\n    \"饐\",\n    \"饎\",\n    \"饙\",\n    \"饘\",\n    \"饛\",\n    \"饡\",\n    \"馣\",\n    \"馲\",\n    \"馰\",\n    \"馵\",\n    \"馻\",\n    \"馺\",\n    \"駂\",\n    \"馽\",\n    \"駜\",\n    \"駍\",\n    \"駏\",\n    \"駎\",\n    \"駖\",\n    \"駮\",\n    \"駬\",\n    \"駥\",\n    \"駤\",\n    \"駣\",\n    \"駩\",\n    \"駺\",\n    \"駴\",\n    \"駷\",\n    \"駹\",\n    \"駶\",\n    \"駻\",\n    \"駽\",\n    \"駾\",\n    \"騃\",\n    \"騉\",\n    \"騑\",\n    \"騊\",\n    \"騇\",\n    \"騚\",\n    \"騕\",\n    \"騥\",\n    \"騝\",\n    \"騛\",\n    \"騢\",\n    \"騠\",\n    \"騧\",\n    \"騞\",\n    \"騜\",\n    \"騵\",\n    \"騲\",\n    \"騴\",\n    \"騱\",\n    \"騬\",\n    \"騪\",\n    \"騩\",\n    \"騹\",\n    \"騽\",\n    \"驆\",\n    \"騺\",\n    \"驓\",\n    \"驔\",\n    \"驈\",\n    \"驉\",\n    \"驖\",\n    \"驞\",\n    \"驠\",\n    \"驦\",\n    \"驨\",\n    \"骭\",\n    \"骫\",\n    \"骹\",\n    \"骿\",\n    \"骴\",\n    \"骾\",\n    \"髇\",\n    \"髊\",\n    \"髆\",\n    \"髍\",\n    \"髐\",\n    \"髟\",\n    \"髧\",\n    \"髬\",\n    \"髳\",\n    \"髶\",\n    \"髺\",\n    \"髾\",\n    \"鬁\",\n    \"髼\",\n    \"鬋\",\n    \"鬊\",\n    \"鬎\",\n    \"鬌\",\n    \"鬐\",\n    \"鬕\",\n    \"鬗\",\n    \"鬖\",\n    \"鬙\",\n    \"鬞\",\n    \"鬠\",\n    \"鬤\",\n    \"鬫\",\n    \"鬳\",\n    \"鬵\",\n    \"鬺\",\n    \"鬾\",\n    \"鬿\",\n    \"魊\",\n    \"魌\",\n    \"魖\",\n    \"魠\",\n    \"魡\",\n    \"魧\",\n    \"魱\",\n    \"魦\",\n    \"魶\",\n    \"魵\",\n    \"鮅\",\n    \"鮇\",\n    \"魼\",\n    \"魾\",\n    \"魻\",\n    \"鮂\",\n    \"鮚\",\n    \"鮞\",\n    \"鮛\",\n    \"鮦\",\n    \"鮥\",\n    \"鮤\",\n    \"鮆\",\n    \"鯆\",\n    \"鮿\",\n    \"鮵\",\n    \"鯈\",\n    \"鯫\",\n    \"鯠\",\n    \"鯞\",\n    \"鯦\",\n    \"鯬\",\n    \"鰌\",\n    \"鰋\",\n    \"鰅\",\n    \"鯸\",\n    \"鰫\",\n    \"鰝\",\n    \"鰬\",\n    \"鱆\",\n    \"鰿\",\n    \"鱄\",\n    \"鱁\",\n    \"鰴\",\n    \"鱐\",\n    \"鱍\",\n    \"鱋\",\n    \"鱕\",\n    \"鱦\",\n    \"鱢\",\n    \"鱞\",\n    \"鱴\",\n    \"鱳\",\n    \"鱹\",\n    \"鳦\",\n    \"鳪\",\n    \"鳭\",\n    \"鳱\",\n    \"鳵\",\n    \"鳼\",\n    \"鳺\",\n    \"鳿\",\n    \"鳷\",\n    \"鴀\",\n    \"鳹\",\n    \"鳻\",\n    \"鴅\",\n    \"鴃\",\n    \"鴥\",\n    \"鴠\",\n    \"鴔\",\n    \"鴩\",\n    \"鴘\",\n    \"鴢\",\n    \"鴐\",\n    \"鴳\",\n    \"鵁\",\n    \"鵧\",\n    \"鴶\",\n    \"鴮\",\n    \"鴱\",\n    \"鴸\",\n    \"鵅\",\n    \"鵃\",\n    \"鴾\",\n    \"鵀\",\n    \"鴽\",\n    \"鵏\",\n    \"鵊\",\n    \"鵛\",\n    \"鵋\",\n    \"鵖\",\n    \"鵌\",\n    \"鵗\",\n    \"鵔\",\n    \"鵷\",\n    \"鶁\",\n    \"鶊\",\n    \"鶄\",\n    \"鶈\",\n    \"鵱\",\n    \"鶀\",\n    \"鵸\",\n    \"鶋\",\n    \"鶌\",\n    \"鵽\",\n    \"鵫\",\n    \"鵴\",\n    \"鵩\",\n    \"鶅\",\n    \"鵳\",\n    \"鵻\",\n    \"鶂\",\n    \"鵹\",\n    \"鶟\",\n    \"鶙\",\n    \"鶤\",\n    \"鶝\",\n    \"鶐\",\n    \"鶛\",\n    \"鶠\",\n    \"鶔\",\n    \"鶜\",\n    \"鶪\",\n    \"鶗\",\n    \"鶢\",\n    \"鶨\",\n    \"鶞\",\n    \"鶣\",\n    \"鶖\",\n    \"鶷\",\n    \"鶶\",\n    \"鷁\",\n    \"鷇\",\n    \"鷊\",\n    \"鷏\",\n    \"鶾\",\n    \"鷅\",\n    \"鷃\",\n    \"鶵\",\n    \"鷈\",\n    \"鶱\",\n    \"鶭\",\n    \"鷛\",\n    \"鷒\",\n    \"鷞\",\n    \"鷋\",\n    \"鷐\",\n    \"鷜\",\n    \"鷑\",\n    \"鷩\",\n    \"鷘\",\n    \"鷖\",\n    \"鷵\",\n    \"鷕\",\n    \"鷻\",\n    \"鷷\",\n    \"鷣\",\n    \"鷤\",\n    \"鷶\",\n    \"鷡\",\n    \"鷮\",\n    \"鷢\",\n    \"鸂\",\n    \"鷾\",\n    \"鸇\",\n    \"鸃\",\n    \"鸆\",\n    \"鸅\",\n    \"鸀\",\n    \"鸁\",\n    \"鸉\",\n    \"鷿\",\n    \"鷽\",\n    \"鸄\",\n    \"鸋\",\n    \"鸍\",\n    \"鸏\",\n    \"鸒\",\n    \"鸔\",\n    \"鸓\",\n    \"鸗\",\n    \"鸙\",\n    \"鹺\",\n    \"麃\",\n    \"麆\",\n    \"麉\",\n    \"麎\",\n    \"麌\",\n    \"麔\",\n    \"麙\",\n    \"麛\",\n    \"麚\",\n    \"麜\",\n    \"麠\",\n    \"麡\",\n    \"麧\",\n    \"麮\",\n    \"麰\",\n    \"麶\",\n    \"麷\",\n    \"黀\",\n    \"黂\",\n    \"黈\",\n    \"黓\",\n    \"黕\",\n    \"黖\",\n    \"黚\",\n    \"黤\",\n    \"黫\",\n    \"黮\",\n    \"黭\",\n    \"黰\",\n    \"黳\",\n    \"黵\",\n    \"黺\",\n    \"鼁\",\n    \"鼀\",\n    \"鼆\",\n    \"鼊\",\n    \"鼏\",\n    \"鼖\",\n    \"鼛\",\n    \"鼘\",\n    \"鼜\",\n    \"鼤\",\n    \"鼣\",\n    \"鼥\",\n    \"鼪\",\n    \"鼨\",\n    \"鼭\",\n    \"鼰\",\n    \"鼮\",\n    \"鼵\",\n    \"鼳\",\n    \"鼲\",\n    \"鼸\",\n    \"鼶\",\n    \"齀\",\n    \"齂\",\n    \"齃\",\n    \"齌\",\n    \"齍\",\n    \"齎\",\n    \"齖\",\n    \"齗\",\n    \"齘\",\n    \"齛\",\n    \"齠\",\n    \"齞\",\n    \"齝\",\n    \"齥\",\n    \"齤\",\n    \"齫\",\n    \"齱\",\n    \"齰\",\n    \"齮\",\n    \"齯\",\n    \"齴\",\n    \"齵\",\n    \"齸\",\n    \"齻\",\n    \"齺\",\n    \"齹\",\n    \"齾\",\n    \"龒\",\n    \"龤\",\n    \"堔\",\n    \"礂\",\n    \"蒏\",\n    \"蒆\",\n    \"兙\",\n    \"兛\",\n    \"兞\",\n    \"兝\",\n    \"兡\",\n    \"兣\",\n    \"嗧\",\n    \"瓩\",\n    \"忼\",\n    \"擡\",\n    \"氊\",\n    \"穇\",\n    \"擧\",\n    \"譌\",\n    \"!\",\n    \"\\\"\",\n    \"#\",\n    \"$\",\n    \"%\",\n    \"&\",\n    \"'\",\n    \"(\",\n    \")\",\n    \"*\",\n    \"+\",\n    \",\",\n    \"-\",\n    \".\",\n    \"/\",\n    \"0\",\n    \"1\",\n    \"2\",\n    \"3\",\n    \"4\",\n    \"5\",\n    \"6\",\n    \"7\",\n    \"8\",\n    \"9\",\n    \":\",\n    \";\",\n    \"<\",\n    \"=\",\n    \">\",\n    \"?\",\n    \"A\",\n    \"B\",\n    \"C\",\n    \"D\",\n    \"E\",\n    \"F\",\n    \"G\",\n    \"H\",\n    \"I\",\n    \"J\",\n    \"K\",\n    \"L\",\n    \"M\",\n    \"N\",\n    \"O\",\n    \"P\",\n    \"Q\",\n    \"R\",\n    \"S\",\n    \"T\",\n    \"U\",\n    \"V\",\n    \"W\",\n    \"X\",\n    \"Y\",\n    \"Z\",\n    \"[\",\n    \"]\",\n    \"_\",\n    \"`\",\n    \"a\",\n    \"b\",\n    \"c\",\n    \"d\",\n    \"e\",\n    \"f\",\n    \"g\",\n    \"h\",\n    \"i\",\n    \"j\",\n    \"k\",\n    \"l\",\n    \"m\",\n    \"n\",\n    \"o\",\n    \"p\",\n    \"q\",\n    \"r\",\n    \"s\",\n    \"t\",\n    \"u\",\n    \"v\",\n    \"w\",\n    \"x\",\n    \"y\",\n    \"z\",\n    \"©\",\n    \"°\",\n    \"²\",\n    \"´\",\n    \"½\",\n    \"Á\",\n    \"Ä\",\n    \"Å\",\n    \"Ç\",\n    \"È\",\n    \"É\",\n    \"Í\",\n    \"Ó\",\n    \"Ö\",\n    \"×\",\n    \"Ü\",\n    \"ß\",\n    \"à\",\n    \"á\",\n    \"â\",\n    \"ã\",\n    \"ä\",\n    \"å\",\n    \"æ\",\n    \"ç\",\n    \"è\",\n    \"é\",\n    \"ê\",\n    \"ë\",\n    \"í\",\n    \"ð\",\n    \"ñ\",\n    \"ò\",\n    \"ó\",\n    \"ô\",\n    \"õ\",\n    \"ö\",\n    \"ø\",\n    \"ú\",\n    \"û\",\n    \"ü\",\n    \"ý\",\n    \"ā\",\n    \"ă\",\n    \"ą\",\n    \"ć\",\n    \"Č\",\n    \"č\",\n    \"đ\",\n    \"ē\",\n    \"ė\",\n    \"ę\",\n    \"ğ\",\n    \"ī\",\n    \"ı\",\n    \"Ł\",\n    \"ł\",\n    \"ń\",\n    \"ň\",\n    \"ō\",\n    \"ř\",\n    \"Ş\",\n    \"ş\",\n    \"Š\",\n    \"š\",\n    \"ţ\",\n    \"ū\",\n    \"ż\",\n    \"Ž\",\n    \"ž\",\n    \"Ș\",\n    \"ș\",\n    \"ț\",\n    \"Δ\",\n    \"α\",\n    \"λ\",\n    \"μ\",\n    \"φ\",\n    \"Г\",\n    \"О\",\n    \"а\",\n    \"в\",\n    \"л\",\n    \"о\",\n    \"р\",\n    \"с\",\n    \"т\",\n    \"я\",\n    \"ồ\",\n    \"—\",\n    \"―\",\n    \"’\",\n    \"“\",\n    \"”\",\n    \"…\",\n    \"℃\",\n    \"→\",\n    \"∇\",\n    \"−\",\n    \"■\",\n    \"☆\",\n    \"、\",\n    \"。\",\n    \"々\",\n    \"〆\",\n    \"〈\",\n    \"〉\",\n    \"「\",\n    \"」\",\n    \"『\",\n    \"』\",\n    \"〔\",\n    \"〕\",\n    \"〜\",\n    \"！\",\n    \"＃\",\n    \"％\",\n    \"＆\",\n    \"（\",\n    \"）\",\n    \"＋\",\n    \"，\",\n    \"－\",\n    \"．\",\n    \"／\",\n    \"０\",\n    \"１\",\n    \"２\",\n    \"３\",\n    \"４\",\n    \"５\",\n    \"６\",\n    \"７\",\n    \"８\",\n    \"９\",\n    \"：\",\n    \"；\",\n    \"＝\",\n    \"？\",\n    \"＠\",\n    \"Ａ\",\n    \"Ｂ\",\n    \"Ｃ\",\n    \"Ｄ\",\n    \"Ｅ\",\n    \"Ｆ\",\n    \"Ｇ\",\n    \"Ｈ\",\n    \"Ｉ\",\n    \"Ｊ\",\n    \"Ｋ\",\n    \"Ｌ\",\n    \"Ｍ\",\n    \"Ｎ\",\n    \"Ｏ\",\n    \"Ｐ\",\n    \"Ｒ\",\n    \"Ｓ\",\n    \"Ｔ\",\n    \"Ｕ\",\n    \"Ｖ\",\n    \"Ｗ\",\n    \"Ｘ\",\n    \"Ｚ\",\n    \"ａ\",\n    \"ｂ\",\n    \"ｃ\",\n    \"ｄ\",\n    \"ｅ\",\n    \"ｆ\",\n    \"ｇ\",\n    \"ｈ\",\n    \"ｉ\",\n    \"ｊ\",\n    \"ｋ\",\n    \"ｌ\",\n    \"ｍ\",\n    \"ｎ\",\n    \"ｏ\",\n    \"ｐ\",\n    \"ｑ\",\n    \"ｒ\",\n    \"ｓ\",\n    \"ｔ\",\n    \"ｕ\",\n    \"ｖ\",\n    \"ｗ\",\n    \"ｘ\",\n    \"ｙ\",\n    \"ｚ\",\n    \"～\",\n    \"･\",\n    \"ǎ\",\n    \"ǒ\",\n    \"ě\",\n    \"ǐ\",\n    \"ì\",\n    \"ǔ\",\n    \"ù\",\n    \"ǖ\",\n    \"ǘ\",\n    \"ǚ\",\n    \"ǜ\",\n    \"【\",\n    \"】\",\n    \"《\",\n    \"》\",\n    \"‥\",\n    \"{\",\n    \"}\",\n    \"\\\\\",\n    \"|\",\n    \"@\",\n    \"^\",\n    \"~\",\n    \"÷\",\n    \"∕\",\n    \"∙\",\n    \"⋅\",\n    \"·\",\n    \"⊕\",\n    \"⊖\",\n    \"⊗\",\n    \"⊘\",\n    \"⊙\",\n    \"±\",\n    \"∓\",\n    \"∩\",\n    \"∪\",\n    \"□\",\n    \"⊎\",\n    \"⊓\",\n    \"⊔\",\n    \"≠\",\n    \"≈\",\n    \"≡\",\n    \"≤\",\n    \"≥\",\n    \"≪\",\n    \"≫\",\n    \"≲\",\n    \"≳\",\n    \"≶\",\n    \"≷\",\n    \"≺\",\n    \"≻\",\n    \"≼\",\n    \"≽\",\n    \"∈\",\n    \"∉\",\n    \"⊂\",\n    \"⊃\",\n    \"⊆\",\n    \"⊇\",\n    \"⊄\",\n    \"⊅\",\n    \"∅\",\n    \"∖\",\n    \"∁\",\n    \"∆\",\n    \"∧\",\n    \"∨\",\n    \"¬\",\n    \"⊻\",\n    \"⊼\",\n    \"⊽\",\n    \"←\",\n    \"↔\",\n    \"⇒\",\n    \"⇐\",\n    \"⇔\",\n    \"∀\",\n    \"∃\",\n    \"∄\",\n    \"∴\",\n    \"∵\",\n    \"∝\",\n    \"∞\",\n    \"⊥\",\n    \"∟\",\n    \"∠\",\n    \"∡\",\n    \"∢\",\n    \"′\",\n    \"″\",\n    \"∥\",\n    \"⊾\",\n    \"⊿\",\n    \"∂\",\n    \"∫\",\n    \"∬\",\n    \"∭\",\n    \"∮\",\n    \"∯\",\n    \"∰\",\n    \"∑\",\n    \"∏\",\n    \"√\",\n    \"∛\",\n    \"∜\",\n    \"∱\",\n    \"∲\",\n    \"∳\",\n    \"∶\",\n    \"∷\",\n    \"∼\",\n    \"®\",\n    \"≄\",\n    \"≅\",\n    \"≃\",\n    \"≦\",\n    \"≧\",\n    \"⊈\",\n    \"⊉\",\n    \"⊢\",\n    \"⊤\",\n    \"⊨\",\n    \"⊧\",\n    \"℉\",\n    \"Ω\",\n    \"℧\",\n    \"Å\",\n    \"⌀\",\n    \"ℏ\",\n    \"⅀\",\n    \"⍺\",\n    \"⍵\",\n    \"¢\",\n    \"€\",\n    \"£\",\n    \"¥\",\n    \"￥\",\n    \"₿\",\n    \"↑\",\n    \"↓\",\n    \"↕\",\n    \"↖\",\n    \"↗\",\n    \"↘\",\n    \"↙\",\n    \"↺\",\n    \"↻\",\n    \"↼\",\n    \"↽\",\n    \"↾\",\n    \"↿\",\n    \"⇀\",\n    \"⇁\",\n    \"⇂\",\n    \"⇃\",\n    \"⇋\",\n    \"⇌\",\n    \"ª\",\n    \"º\",\n    \"⁰\",\n    \"¹\",\n    \"³\",\n    \"⁴\",\n    \"⁵\",\n    \"⁶\",\n    \"⁷\",\n    \"⁸\",\n    \"⁹\",\n    \"⁺\",\n    \"⁻\",\n    \"⁼\",\n    \"⁽\",\n    \"⁾\",\n    \"ⁿ\",\n    \"₀\",\n    \"₁\",\n    \"₂\",\n    \"₃\",\n    \"₄\",\n    \"₅\",\n    \"₆\",\n    \"₇\",\n    \"₈\",\n    \"₉\",\n    \"₊\",\n    \"₋\",\n    \"₌\",\n    \"₍\",\n    \"₎\",\n    \"Ⅰ\",\n    \"Ⅱ\",\n    \"Ⅲ\",\n    \"Ⅳ\",\n    \"Ⅴ\",\n    \"Ⅵ\",\n    \"Ⅶ\",\n    \"Ⅷ\",\n    \"Ⅸ\",\n    \"Ⅹ\",\n    \"Ⅺ\",\n    \"Ⅻ\",\n    \"ⅰ\",\n    \"ⅱ\",\n    \"ⅲ\",\n    \"ⅳ\",\n    \"ⅴ\",\n    \"ⅵ\",\n    \"ⅶ\",\n    \"ⅷ\",\n    \"ⅸ\",\n    \"ⅹ\",\n    \"ⅺ\",\n    \"ⅻ\",\n    \"☰\",\n    \"☱\",\n    \"☲\",\n    \"☳\",\n    \"☴\",\n    \"☵\",\n    \"☶\",\n    \"☷\",\n    \"♀\",\n    \"♂\",\n    \"♳\",\n    \"♴\",\n    \"♵\",\n    \"♶\",\n    \"♷\",\n    \"♸\",\n    \"♹\",\n    \"♺\",\n    \"♩\",\n    \"♪\",\n    \"♫\",\n    \"♬\",\n    \"⚪\",\n    \"⚫\",\n    \"⚬\",\n    \"✶\",\n    \"✷\",\n    \"✸\",\n    \"➀\",\n    \"➁\",\n    \"➂\",\n    \"➃\",\n    \"➄\",\n    \"➅\",\n    \"➆\",\n    \"➇\",\n    \"➈\",\n    \"➉\",\n    \"➊\",\n    \"➋\",\n    \"➌\",\n    \"➍\",\n    \"➎\",\n    \"➏\",\n    \"➐\",\n    \"➑\",\n    \"➒\",\n    \"➓\",\n    \"⏀\",\n    \"⏁\",\n    \"⏂\",\n    \"⏃\",\n    \"⏄\",\n    \"⏅\",\n    \"⏆\",\n    \"⏇\",\n    \"⏈\",\n    \"⏉\",\n    \"⏊\",\n    \"⏋\",\n    \"⏌\",\n    \"⏚\",\n    \"⏴\",\n    \"⏵\",\n    \"⏶\",\n    \"⏷\",\n    \"⏸\",\n    \"⏹\",\n    \"⏺\",\n    \"⏻\",\n    \"⏼\",\n    \"Α\",\n    \"Β\",\n    \"Γ\",\n    \"Ε\",\n    \"Ζ\",\n    \"Η\",\n    \"Θ\",\n    \"Ι\",\n    \"Κ\",\n    \"Λ\",\n    \"Μ\",\n    \"Ν\",\n    \"Ξ\",\n    \"Ο\",\n    \"Π\",\n    \"Ρ\",\n    \"Σ\",\n    \"Τ\",\n    \"Υ\",\n    \"Φ\",\n    \"Χ\",\n    \"Ψ\",\n    \"β\",\n    \"γ\",\n    \"δ\",\n    \"ε\",\n    \"ζ\",\n    \"η\",\n    \"θ\",\n    \"ι\",\n    \"κ\",\n    \"ν\",\n    \"ξ\",\n    \"ο\",\n    \"π\",\n    \"ρ\",\n    \"σ\",\n    \"τ\",\n    \"υ\",\n    \"χ\",\n    \"ψ\",\n    \"ω\",\n    \"ϐ\",\n    \"ϑ\",\n    \"ϒ\",\n    \"ϕ\",\n    \"█\",\n    \"ϖ\",\n    \"ϰ\",\n    \"ϱ\",\n    \"ϴ\",\n    \"ϵ\",\n    \"ϝ\",\n    \"Ϟ\",\n    \"ϟ\",\n    \"Ϡ\",\n    \"ϡ\",\n    \"Ϣ\",\n    \"ϣ\",\n    \"Ϥ\",\n    \"ϥ\",\n    \"Ϧ\",\n    \"ϧ\",\n    \"Ϩ\",\n    \"ϩ\",\n    \"Ϫ\",\n    \"ϫ\",\n    \"Ϭ\",\n    \"ϭ\",\n    \"Ϯ\",\n    \"ϯ\",\n    \"∸\",\n    \"∹\",\n    \"∺\",\n    \"∻\",\n    \"∽\",\n    \"∾\",\n    \"∿\",\n    \"≀\",\n    \"≁\",\n    \"≂\",\n    \"≆\",\n    \"≇\",\n    \"≉\",\n    \"≊\",\n    \"≋\",\n    \"≌\",\n    \"≍\",\n    \"≎\",\n    \"≏\",\n    \"≐\",\n    \"≑\",\n    \"≒\",\n    \"≓\",\n    \"≔\",\n    \"≕\",\n    \"≖\",\n    \"≗\",\n    \"≘\",\n    \"≙\",\n    \"≚\",\n    \"≛\",\n    \"≜\",\n    \"≝\",\n    \"≞\",\n    \"≟\",\n    \"≢\",\n    \"≣\",\n    \"≨\",\n    \"≩\",\n    \"≬\",\n    \"≭\",\n    \"≮\",\n    \"≯\",\n    \"≰\",\n    \"≱\",\n    \"≴\",\n    \"≵\",\n    \"≸\",\n    \"≹\",\n    \"≾\",\n    \"≿\",\n    \"⊀\",\n    \"⊁\",\n    \"⊊\",\n    \"⊋\",\n    \"⊌\",\n    \"⊍\",\n    \"⊏\",\n    \"⊐\",\n    \"⊑\",\n    \"⊒\",\n    \"⊚\",\n    \"⊛\",\n    \"⊜\",\n    \"⊝\",\n    \"⊞\",\n    \"⊟\",\n    \"⊠\",\n    \"⊡\",\n    \"⊣\",\n    \"⊦\",\n    \"⊩\",\n    \"⊪\",\n    \"⊫\",\n    \"⊬\",\n    \"⊭\",\n    \"⊮\",\n    \"⊯\",\n    \"⊰\",\n    \"⊱\",\n    \"⊲\",\n    \"⊳\",\n    \"⊴\",\n    \"⊵\",\n    \"⊶\",\n    \"⊷\",\n    \"⊸\",\n    \"⊹\",\n    \"⊺\",\n    \"ℎ\",\n    \"℘\",\n    \"ℜ\",\n    \"ℑ\",\n    \"ℵ\",\n    \"ℶ\",\n    \"ℷ\",\n    \"ℸ\",\n    \"⌬\",\n    \"⌭\",\n    \"⌮\",\n    \"⌯\",\n    \"⎔\",\n    \"¤\",\n    \"₠\",\n    \"₡\",\n    \"₢\",\n    \"₣\",\n    \"₤\",\n    \"₥\",\n    \"₦\",\n    \"₧\",\n    \"₨\",\n    \"₩\",\n    \"₪\",\n    \"₫\",\n    \"₭\",\n    \"₮\",\n    \"₯\",\n    \"₰\",\n    \"₱\",\n    \"₲\",\n    \"₳\",\n    \"₴\",\n    \"₵\",\n    \"₶\",\n    \"₷\",\n    \"₸\",\n    \"₹\",\n    \"₺\",\n    \"₻\",\n    \"₼\",\n    \"₽\",\n    \"₾\",\n    \"↚\",\n    \"↛\",\n    \"↜\",\n    \"↝\",\n    \"↞\",\n    \"↟\",\n    \"↠\",\n    \"↡\",\n    \"↢\",\n    \"↣\",\n    \"↤\",\n    \"↥\",\n    \"↦\",\n    \"↧\",\n    \"↨\",\n    \"↩\",\n    \"↪\",\n    \"↫\",\n    \"↬\",\n    \"↭\",\n    \"↮\",\n    \"↯\",\n    \"↰\",\n    \"↱\",\n    \"↲\",\n    \"↳\",\n    \"↴\",\n    \"↵\",\n    \"↶\",\n    \"↷\",\n    \"↸\",\n    \"↹\",\n    \"⇄\",\n    \"⇅\",\n    \"⇆\",\n    \"⇇\",\n    \"⇈\",\n    \"⇉\",\n    \"⇊\",\n    \"⇍\",\n    \"⇎\",\n    \"⇏\",\n    \"⇑\",\n    \"⇓\",\n    \"⇕\",\n    \"⇖\",\n    \"⇗\",\n    \"⇘\",\n    \"⇙\",\n    \"⇚\",\n    \"⇛\",\n    \"⇜\",\n    \"⇝\",\n    \"⇞\",\n    \"⇟\",\n    \"⇠\",\n    \"⇡\",\n    \"⇢\",\n    \"⇣\",\n    \"⇤\",\n    \"⇥\",\n    \"⇦\",\n    \"⇧\",\n    \"⇨\",\n    \"⇩\",\n    \"⇪\",\n    \"⇫\",\n    \"⇬\",\n    \"⇭\",\n    \"⇮\",\n    \"⇯\",\n    \"⇰\",\n    \"⇱\",\n    \"⇲\",\n    \"⇳\",\n    \"⇴\",\n    \"⇵\",\n    \"⇶\",\n    \"⇷\",\n    \"⇸\",\n    \"⇹\",\n    \"⇺\",\n    \"⇻\",\n    \"⇼\",\n    \"⇽\",\n    \"⇾\",\n    \"⇿\",\n    \"ↀ\",\n    \"ↁ\",\n    \"ↂ\",\n    \"☀\",\n    \"☁\",\n    \"☂\",\n    \"☃\",\n    \"☄\",\n    \"★\",\n    \"☇\",\n    \"☈\",\n    \"☉\",\n    \"☊\",\n    \"☋\",\n    \"☌\",\n    \"☍\",\n    \"☎\",\n    \"☏\",\n    \"☐\",\n    \"☑\",\n    \"☒\",\n    \"☓\",\n    \"☔\",\n    \"☕\",\n    \"☖\",\n    \"☗\",\n    \"☘\",\n    \"☙\",\n    \"☚\",\n    \"☛\",\n    \"☜\",\n    \"☝\",\n    \"☞\",\n    \"☟\",\n    \"☠\",\n    \"☡\",\n    \"☢\",\n    \"☣\",\n    \"☤\",\n    \"☥\",\n    \"☦\",\n    \"☧\",\n    \"☨\",\n    \"☩\",\n    \"☪\",\n    \"☫\",\n    \"☬\",\n    \"☭\",\n    \"☮\",\n    \"☯\",\n    \"☸\",\n    \"☹\",\n    \"☺\",\n    \"☻\",\n    \"☼\",\n    \"☽\",\n    \"☾\",\n    \"☿\",\n    \"♁\",\n    \"♃\",\n    \"♄\",\n    \"♅\",\n    \"♆\",\n    \"♇\",\n    \"♔\",\n    \"♕\",\n    \"♖\",\n    \"♗\",\n    \"♘\",\n    \"♙\",\n    \"♚\",\n    \"♛\",\n    \"♜\",\n    \"♝\",\n    \"♞\",\n    \"♟\",\n    \"♠\",\n    \"♡\",\n    \"♢\",\n    \"♣\",\n    \"♤\",\n    \"♥\",\n    \"♦\",\n    \"♧\",\n    \"♨\",\n    \"♭\",\n    \"♮\",\n    \"♯\",\n    \"♰\",\n    \"♱\",\n    \"♲\",\n    \"♻\",\n    \"♼\",\n    \"♽\",\n    \"♾\",\n    \"⚀\",\n    \"⚁\",\n    \"⚂\",\n    \"⚃\",\n    \"⚄\",\n    \"⚅\",\n    \"⚆\",\n    \"⚇\",\n    \"⚈\",\n    \"⚉\",\n    \"⚊\",\n    \"⚋\",\n    \"⚌\",\n    \"⚍\",\n    \"⚎\",\n    \"⚏\",\n    \"⚐\",\n    \"⚑\",\n    \"⚒\",\n    \"⚓\",\n    \"⚔\",\n    \"⚕\",\n    \"⚖\",\n    \"⚗\",\n    \"⚘\",\n    \"⚙\",\n    \"⚚\",\n    \"⚛\",\n    \"⚜\",\n    \"⚝\",\n    \"⚞\",\n    \"⚟\",\n    \"⚠\",\n    \"⚡\",\n    \"⚢\",\n    \"⚣\",\n    \"⚤\",\n    \"⚥\",\n    \"⚦\",\n    \"⚧\",\n    \"⚨\",\n    \"⚩\",\n    \"⚭\",\n    \"⚮\",\n    \"⚯\",\n    \"⚰\",\n    \"⚱\",\n    \"⚲\",\n    \"⚳\",\n    \"⚴\",\n    \"⚵\",\n    \"⚶\",\n    \"⚷\",\n    \"⚸\",\n    \"⚹\",\n    \"⚺\",\n    \"⚻\",\n    \"⚼\",\n    \"⚿\",\n    \"⛀\",\n    \"⛁\",\n    \"⛂\",\n    \"⛃\",\n    \"⛆\",\n    \"⛇\",\n    \"⛈\",\n    \"⛉\",\n    \"⛊\",\n    \"⛋\",\n    \"⛌\",\n    \"⛍\",\n    \"⛏\",\n    \"⛐\",\n    \"⛑\",\n    \"⛒\",\n    \"⛓\",\n    \"⛕\",\n    \"⛖\",\n    \"⛗\",\n    \"⛘\",\n    \"⛙\",\n    \"⛚\",\n    \"⛛\",\n    \"⛜\",\n    \"⛝\",\n    \"⛞\",\n    \"⛠\",\n    \"⛡\",\n    \"⛢\",\n    \"⛣\",\n    \"⛤\",\n    \"⛥\",\n    \"⛦\",\n    \"⛧\",\n    \"⛨\",\n    \"⛩\",\n    \"⛪\",\n    \"⛫\",\n    \"⛬\",\n    \"⛭\",\n    \"⛮\",\n    \"⛯\",\n    \"⛶\",\n    \"⛾\",\n    \"⛿\",\n    \"✆\",\n    \"✇\",\n    \"✈\",\n    \"✉\",\n    \"✌\",\n    \"✍\",\n    \"✎\",\n    \"✏\",\n    \"✐\",\n    \"✑\",\n    \"✒\",\n    \"✓\",\n    \"✔\",\n    \"✕\",\n    \"✙\",\n    \"✚\",\n    \"✛\",\n    \"✜\",\n    \"✝\",\n    \"✞\",\n    \"✟\",\n    \"✠\",\n    \"✡\",\n    \"✢\",\n    \"✣\",\n    \"✤\",\n    \"✥\",\n    \"✦\",\n    \"✧\",\n    \"✩\",\n    \"✪\",\n    \"✫\",\n    \"✬\",\n    \"✭\",\n    \"✮\",\n    \"✯\",\n    \"✰\",\n    \"✱\",\n    \"✲\",\n    \"✳\",\n    \"✴\",\n    \"✵\",\n    \"✹\",\n    \"✺\",\n    \"✻\",\n    \"✼\",\n    \"✽\",\n    \"✾\",\n    \"✿\",\n    \"❀\",\n    \"❁\",\n    \"❂\",\n    \"❃\",\n    \"❄\",\n    \"❅\",\n    \"❆\",\n    \"❇\",\n    \"❈\",\n    \"❉\",\n    \"❊\",\n    \"❋\",\n    \"❍\",\n    \"❏\",\n    \"❐\",\n    \"❑\",\n    \"❒\",\n    \"❖\",\n    \"❘\",\n    \"❙\",\n    \"❚\",\n    \"❛\",\n    \"❜\",\n    \"❝\",\n    \"❞\",\n    \"❡\",\n    \"❢\",\n    \"❣\",\n    \"❤\",\n    \"❥\",\n    \"❦\",\n    \"❧\",\n    \"❨\",\n    \"❩\",\n    \"❪\",\n    \"❫\",\n    \"❬\",\n    \"❭\",\n    \"❮\",\n    \"❯\",\n    \"❰\",\n    \"❱\",\n    \"❲\",\n    \"❳\",\n    \"❴\",\n    \"❵\",\n    \"❶\",\n    \"❷\",\n    \"❸\",\n    \"❹\",\n    \"❺\",\n    \"❻\",\n    \"❼\",\n    \"❽\",\n    \"❾\",\n    \"❿\",\n    \"①\",\n    \"②\",\n    \"③\",\n    \"④\",\n    \"⑤\",\n    \"⑥\",\n    \"⑦\",\n    \"⑧\",\n    \"⑨\",\n    \"⑩\",\n    \"➔\",\n    \"➕\",\n    \"➖\",\n    \"➗\",\n    \"➘\",\n    \"➙\",\n    \"➚\",\n    \"➛\",\n    \"➜\",\n    \"➝\",\n    \"➞\",\n    \"➟\",\n    \"➠\",\n    \"➡\",\n    \"➢\",\n    \"➣\",\n    \"➤\",\n    \"➥\",\n    \"➦\",\n    \"➧\",\n    \"➨\",\n    \"➩\",\n    \"➪\",\n    \"➫\",\n    \"➬\",\n    \"➭\",\n    \"➮\",\n    \"➯\",\n    \"➰\",\n    \"➱\",\n    \"➲\",\n    \"➳\",\n    \"➴\",\n    \"➵\",\n    \"➶\",\n    \"➷\",\n    \"➸\",\n    \"➹\",\n    \"➺\",\n    \"➻\",\n    \"➼\",\n    \"➽\",\n    \"➾\",\n    \"➿\",\n    \"⌘\",\n    \"⌥\",\n    \"⌃\",\n    \"⎋\",\n    \"⌫\",\n    \"⌦\",\n    \"⏏\",\n    \"⌤\",\n    \"⌧\",\n    \"⌨\",\n    \"⎆\",\n    \"⎇\",\n    \"⎈\",\n    \"⎉\",\n    \"⎊\",\n    \"⎌\",\n    \"⎍\",\n    \"⎎\",\n    \"⎏\",\n    \"⎐\",\n    \"⎑\",\n    \"⎒\",\n    \"⎓\",\n    \"⎕\",\n    \"⎖\",\n    \"⎗\",\n    \"⎘\",\n    \"⎙\",\n    \"⎚\",\n    \"⎛\",\n    \"⎜\",\n    \"⎝\",\n    \"⎞\",\n    \"⎟\",\n    \"⎠\",\n    \"⎡\",\n    \"⎢\",\n    \"⎣\",\n    \"⎤\",\n    \"⎥\",\n    \"⎦\",\n    \"⎧\",\n    \"⎨\",\n    \"⎩\",\n    \"⎪\",\n    \"⎫\",\n    \"⎬\",\n    \"⎭\",\n    \"⎮\",\n    \"⎯\",\n    \"⎰\",\n    \"⎱\",\n    \"⎲\",\n    \"⎳\",\n    \"⎴\",\n    \"⎵\",\n    \"⎶\",\n    \"⎷\",\n    \"⎸\",\n    \"⎹\",\n    \"⎺\",\n    \"⎻\",\n    \"⎼\",\n    \"⎽\",\n    \"⎾\",\n    \"⎿\",\n    \"⏍\",\n    \"⏎\",\n    \"⏐\",\n    \"⏑\",\n    \"⏒\",\n    \"⏓\",\n    \"⏔\",\n    \"⏕\",\n    \"⏖\",\n    \"⏗\",\n    \"⏘\",\n    \"⏙\",\n    \"⏛\",\n    \"⏜\",\n    \"⏝\",\n    \"⏞\",\n    \"⏟\",\n    \"⏠\",\n    \"⏡\",\n    \"⏢\",\n    \"⏣\",\n    \"⏤\",\n    \"⏥\",\n    \"⏦\",\n    \"⏧\",\n    \"⏨\",\n    \"⏭\",\n    \"⏮\",\n    \"⏯\",\n    \"⏱\",\n    \"⏲\",\n    \"▲\",\n    \"▽\",\n    \"◐\",\n    \"⏽\",\n    \"⏾\",\n    \"⏿\",\n    \"ɐ\",\n    \"ɑ\",\n    \"ɒ\",\n    \"ɓ\",\n    \"ɔ\",\n    \"ɕ\",\n    \"ɖ\",\n    \"ɗ\",\n    \"ɘ\",\n    \"ə\",\n    \"ɚ\",\n    \"ɛ\",\n    \"ɜ\",\n    \"ɝ\",\n    \"ɞ\",\n    \"ɟ\",\n    \"ɠ\",\n    \"ɡ\",\n    \"ɢ\",\n    \"ɣ\",\n    \"ɤ\",\n    \"ɥ\",\n    \"ɦ\",\n    \"ɧ\",\n    \"ɨ\",\n    \"ɩ\",\n    \"ɪ\",\n    \"ɫ\",\n    \"ɬ\",\n    \"ɭ\",\n    \"ɮ\",\n    \"ɯ\",\n    \"ɰ\",\n    \"ɱ\",\n    \"ɲ\",\n    \"ɳ\",\n    \"ɴ\",\n    \"ɵ\",\n    \"ɶ\",\n    \"ɷ\",\n    \"ɸ\",\n    \"ɹ\",\n    \"ɺ\",\n    \"ɻ\",\n    \"ɼ\",\n    \"ɽ\",\n    \"ɾ\",\n    \"ɿ\",\n    \"ʀ\",\n    \"ʁ\",\n    \"ʂ\",\n    \"ʃ\",\n    \"ʄ\",\n    \"ʅ\",\n    \"ʆ\",\n    \"ʇ\",\n    \"ʈ\",\n    \"ʉ\",\n    \"ʊ\",\n    \"ʋ\",\n    \"ʌ\",\n    \"ʍ\",\n    \"ʎ\",\n    \"ʏ\",\n    \"ʐ\",\n    \"ʑ\",\n    \"ʒ\",\n    \"ʓ\",\n    \"ʔ\",\n    \"ʕ\",\n    \"ʖ\",\n    \"ʗ\",\n    \"ʘ\",\n    \"ʙ\",\n    \"ʚ\",\n    \"ʛ\",\n    \"ʜ\",\n    \"ʝ\",\n    \"ʞ\",\n    \"ʟ\",\n    \"ʠ\",\n    \"ʡ\",\n    \"ʢ\",\n    \"ʣ\",\n    \"ʤ\",\n    \"ʥ\",\n    \"ʦ\",\n    \"ʧ\",\n    \"ʨ\",\n    \"ʩ\",\n    \"ʪ\",\n    \"ʫ\",\n    \"ʬ\",\n    \"ʭ\",\n    \"ʮ\",\n    \"ʯ\",\n    \"━\",\n    \"Ǝ\",\n    \"Ã\",\n    \"●\",\n    \"▶\",\n    \"｜\",\n    \"𝑢\",\n    \"〖\",\n    \"〗\",\n    \"︽\",\n    \"–\",\n    \"﹥\",\n    \"𝜓\",\n    \"•\",\n    \"∋\",\n    \"ƒ\",\n    \"०\",\n    \"✘\",\n    \"Е\",\n    \"◉\",\n    \"〒\",\n    \"𝒱\",\n    \"𝜆\",\n    \"⟹\",\n    \"﹪\",\n    \"◊\",\n    \"╆\",\n    \"오\",\n    \"˂\",\n    \"〉\",\n    \"𝝎\",\n    \"▪\",\n    \"△\",\n    \"▁\",\n    \"◼\",\n    \"〇\",\n    \"▷\",\n    \"▬\",\n    \"𝒮\",\n    \"†\",\n    \"ₒ\",\n    \"⼁\",\n    \"〵\",\n    \"⭐\",\n    \"╳\",\n    \"⟶\",\n    \"으\",\n    \"⬆\",\n    \"Ạ\",\n    \"◀\",\n    \"\",\n    \"▫\",\n    \"丄\",\n    \"︾\",\n    \"◥\",\n    \"‖\",\n    \"𝜌\",\n    \"ⅼ\",\n    \"▼\",\n    \"⁎\",\n    \"﹏\",\n    \"😁\",\n    \"😂\",\n    \"😃\",\n    \"😄\",\n    \"😅\",\n    \"😆\",\n    \"😉\",\n    \"😊\",\n    \"😋\",\n    \"😌\",\n    \"😍\",\n    \"😏\",\n    \"😒\",\n    \"😓\",\n    \"😔\",\n    \"😖\",\n    \"😘\",\n    \"😚\",\n    \"😜\",\n    \"😝\",\n    \"😞\",\n    \"😠\",\n    \"😡\",\n    \"😢\",\n    \"😣\",\n    \"😤\",\n    \"😥\",\n    \"😨\",\n    \"😩\",\n    \"😪\",\n    \"😫\",\n    \"😭\",\n    \"😰\",\n    \"😱\",\n    \"😲\",\n    \"😳\",\n    \"😵\",\n    \"😷\",\n    \"😸\",\n    \"😹\",\n    \"😺\",\n    \"😻\",\n    \"😼\",\n    \"😽\",\n    \"😾\",\n    \"😿\",\n    \"🙀\",\n    \"🙅\",\n    \"🙆\",\n    \"🙇\",\n    \"🙈\",\n    \"🙉\",\n    \"🙊\",\n    \"🙋\",\n    \"🙌\",\n    \"🙍\",\n    \"🙎\",\n    \"🙏\",\n    \"✂\",\n    \"✅\",\n    \"✊\",\n    \"✋\",\n    \"✖\",\n    \"✨\",\n    \"❌\",\n    \"❎\",\n    \"❓\",\n    \"❔\",\n    \"❕\",\n    \"❗\",\n    \"🚀\",\n    \"🚃\",\n    \"🚄\",\n    \"🚅\",\n    \"🚇\",\n    \"🚉\",\n    \"🚌\",\n    \"🚏\",\n    \"🚑\",\n    \"🚒\",\n    \"🚓\",\n    \"🚕\",\n    \"🚗\",\n    \"🚙\",\n    \"🚚\",\n    \"🚢\",\n    \"🚤\",\n    \"🚥\",\n    \"🚧\",\n    \"🚨\",\n    \"🚩\",\n    \"🚪\",\n    \"🚫\",\n    \"🚬\",\n    \"🚭\",\n    \"🚲\",\n    \"🚶\",\n    \"🚹\",\n    \"🚺\",\n    \"🚻\",\n    \"🚼\",\n    \"🚽\",\n    \"🚾\",\n    \"🛀\",\n    \"Ⓜ\",\n    \"🅰\",\n    \"🅱\",\n    \"🅾\",\n    \"🅿\",\n    \"🆎\",\n    \"🆑\",\n    \"🆒\",\n    \"🆓\",\n    \"🆔\",\n    \"🆕\",\n    \"🆖\",\n    \"🆗\",\n    \"🆘\",\n    \"🆙\",\n    \"🆚\",\n    \"🇩🇪\",\n    \"🇬🇧\",\n    \"🇨🇳\",\n    \"🇯🇵\",\n    \"🇫🇷\",\n    \"🇰🇷\",\n    \"🇪🇸\",\n    \"🇮🇹\",\n    \"🇷🇺\",\n    \"🇺🇸\",\n    \"🈁\",\n    \"ℹ\",\n    \"⌚\",\n    \"⌛\",\n    \"⏩\",\n    \"⏪\",\n    \"⏫\",\n    \"⏬\",\n    \"⏰\",\n    \"⏳\",\n    \"◻\",\n    \"◽\",\n    \"◾\",\n    \"♈\",\n    \"♉\",\n    \"♊\",\n    \"♋\",\n    \"♌\",\n    \"♍\",\n    \"♎\",\n    \"♏\",\n    \"♐\",\n    \"♑\",\n    \"♒\",\n    \"♓\",\n    \"♿\",\n    \"⚽\",\n    \"⚾\",\n    \"⛄\",\n    \"⛅\",\n    \"⛎\",\n    \"⛔\",\n    \"⛲\",\n    \"⛳\",\n    \"⛵\",\n    \"⛺\",\n    \"⛽\",\n    \"⤴\",\n    \"⤵\",\n    \"⬅\",\n    \"⬇\",\n    \"⬛\",\n    \"⬜\",\n    \"⭕\",\n    \"〰\",\n    \"〽\",\n    \"㊗\",\n    \"㊙\",\n    \"🀄\",\n    \"🃏\",\n    \"🌀\",\n    \"🌁\",\n    \"🌂\",\n    \"🌃\",\n    \"🌄\",\n    \"🌅\",\n    \"🌆\",\n    \"🌇\",\n    \"🌈\",\n    \"🌉\",\n    \"🌊\",\n    \"🌋\",\n    \"🌌\",\n    \"🌏\",\n    \"🌑\",\n    \"🌓\",\n    \"🌔\",\n    \"🌕\",\n    \"🌙\",\n    \"🌛\",\n    \"🌟\",\n    \"🌠\",\n    \"🌰\",\n    \"🌱\",\n    \"🌴\",\n    \"🌵\",\n    \"🌷\",\n    \"🌸\",\n    \"🌹\",\n    \"🌺\",\n    \"🌻\",\n    \"🌼\",\n    \"🌽\",\n    \"🌾\",\n    \"🌿\",\n    \"🍀\",\n    \"🍁\",\n    \"🍂\",\n    \"🍃\",\n    \"🍄\",\n    \"🍅\",\n    \"🍆\",\n    \"🍇\",\n    \"🍈\",\n    \"🍉\",\n    \"🍊\",\n    \"🍌\",\n    \"🍍\",\n    \"🍎\",\n    \"🍏\",\n    \"🍑\",\n    \"🍒\",\n    \"🍓\",\n    \"🍔\",\n    \"🍕\",\n    \"🍖\",\n    \"🍗\",\n    \"🍘\",\n    \"🍙\",\n    \"🍚\",\n    \"🍛\",\n    \"🍜\",\n    \"🍝\",\n    \"🍞\",\n    \"🍟\",\n    \"🍠\",\n    \"🍡\",\n    \"🍢\",\n    \"🍣\",\n    \"🍤\",\n    \"🍥\",\n    \"🍦\",\n    \"🍧\",\n    \"🍨\",\n    \"🍩\",\n    \"🍪\",\n    \"🍫\",\n    \"🍬\",\n    \"🍭\",\n    \"🍮\",\n    \"🍯\",\n    \"🍰\",\n    \"🍱\",\n    \"🍲\",\n    \"🍳\",\n    \"🍴\",\n    \"🍵\",\n    \"🍶\",\n    \"🍷\",\n    \"🍸\",\n    \"🍹\",\n    \"🍺\",\n    \"🍻\",\n    \"🎀\",\n    \"🎁\",\n    \"🎂\",\n    \"🎃\",\n    \"🎄\",\n    \"🎅\",\n    \"🎆\",\n    \"🎇\",\n    \"🎈\",\n    \"🎉\",\n    \"🎊\",\n    \"🎋\",\n    \"🎌\",\n    \"🎍\",\n    \"🎎\",\n    \"🎏\",\n    \"🎐\",\n    \"🎑\",\n    \"🎒\",\n    \"🎓\",\n    \"🎠\",\n    \"🎡\",\n    \"🎢\",\n    \"🎣\",\n    \"🎤\",\n    \"🎥\",\n    \"🎦\",\n    \"🎧\",\n    \"🎨\",\n    \"🎩\",\n    \"🎪\",\n    \"🎫\",\n    \"🎬\",\n    \"🎭\",\n    \"🎮\",\n    \"🎯\",\n    \"🎰\",\n    \"🎱\",\n    \"🎲\",\n    \"🎳\",\n    \"🎴\",\n    \"🎵\",\n    \"🎶\",\n    \"🎷\",\n    \"🎸\",\n    \"🎹\",\n    \"🎺\",\n    \"🎻\",\n    \"🎼\",\n    \"🎽\",\n    \"🎾\",\n    \"🎿\",\n    \"🏀\",\n    \"🏁\",\n    \"🏂\",\n    \"🏃\",\n    \"🏄\",\n    \"🏆\",\n    \"🏈\",\n    \"🏊\",\n    \"🏠\",\n    \"🏡\",\n    \"🏢\",\n    \"🏣\",\n    \"🏥\",\n    \"🏦\",\n    \"🏧\",\n    \"🏨\",\n    \"🏩\",\n    \"🏪\",\n    \"🏫\",\n    \"🏬\",\n    \"🏭\",\n    \"🏮\",\n    \"🏯\",\n    \"🏰\",\n    \"🐌\",\n    \"🐍\",\n    \"🐎\",\n    \"🐑\",\n    \"🐒\",\n    \"🐔\",\n    \"🐗\",\n    \"🐘\",\n    \"🐙\",\n    \"🐚\",\n    \"🐛\",\n    \"🐜\",\n    \"🐝\",\n    \"🐞\",\n    \"🐟\",\n    \"🐠\",\n    \"🐡\",\n    \"🐢\",\n    \"🐣\",\n    \"🐤\",\n    \"🐥\",\n    \"🐦\",\n    \"🐧\",\n    \"🐨\",\n    \"🐩\",\n    \"🐫\",\n    \"🐬\",\n    \"🐭\",\n    \"🐮\",\n    \"🐯\",\n    \"🐰\",\n    \"🐱\",\n    \"🐲\",\n    \"🐳\",\n    \"🐴\",\n    \"🐵\",\n    \"🐶\",\n    \"🐷\",\n    \"🐸\",\n    \"🐹\",\n    \"🐺\",\n    \"🐻\",\n    \"🐼\",\n    \"🐽\",\n    \"🐾\",\n    \"👀\",\n    \"👂\",\n    \"👃\",\n    \"👄\",\n    \"👅\",\n    \"👆\",\n    \"👇\",\n    \"👈\",\n    \"👉\",\n    \"👊\",\n    \"👋\",\n    \"👌\",\n    \"👍\",\n    \"👎\",\n    \"👏\",\n    \"👐\",\n    \"👑\",\n    \"👒\",\n    \"👓\",\n    \"👔\",\n    \"👕\",\n    \"👖\",\n    \"👗\",\n    \"👘\",\n    \"👙\",\n    \"👚\",\n    \"👛\",\n    \"👜\",\n    \"👝\",\n    \"👞\",\n    \"👟\",\n    \"👠\",\n    \"👡\",\n    \"👢\",\n    \"👣\",\n    \"👤\",\n    \"👦\",\n    \"👧\",\n    \"👨\",\n    \"👩\",\n    \"👪\",\n    \"👫\",\n    \"👮\",\n    \"👯\",\n    \"👰\",\n    \"👱\",\n    \"👲\",\n    \"👳\",\n    \"👴\",\n    \"👵\",\n    \"👶\",\n    \"👷\",\n    \"👸\",\n    \"👹\",\n    \"👺\",\n    \"👻\",\n    \"👼\",\n    \"👽\",\n    \"👾\",\n    \"👿\",\n    \"💀\",\n    \"💁\",\n    \"💂\",\n    \"💃\",\n    \"💄\",\n    \"💅\",\n    \"💆\",\n    \"💇\",\n    \"💈\",\n    \"💉\",\n    \"💊\",\n    \"💋\",\n    \"💌\",\n    \"💍\",\n    \"💎\",\n    \"💏\",\n    \"💐\",\n    \"💑\",\n    \"💒\",\n    \"💓\",\n    \"💔\",\n    \"💕\",\n    \"💖\",\n    \"💗\",\n    \"💘\",\n    \"💙\",\n    \"💚\",\n    \"💛\",\n    \"💜\",\n    \"💝\",\n    \"💞\",\n    \"💟\",\n    \"💠\",\n    \"💡\",\n    \"💢\",\n    \"💣\",\n    \"💤\",\n    \"💥\",\n    \"💦\",\n    \"💧\",\n    \"💨\",\n    \"💩\",\n    \"💪\",\n    \"💫\",\n    \"💬\",\n    \"💮\",\n    \"💯\",\n    \"💰\",\n    \"💲\",\n    \"💳\",\n    \"💴\",\n    \"💵\",\n    \"💸\",\n    \"💹\",\n    \"💺\",\n    \"💻\",\n    \"💼\",\n    \"💽\",\n    \"💾\",\n    \"💿\",\n    \"📀\",\n    \"📁\",\n    \"📂\",\n    \"📃\",\n    \"📄\",\n    \"📅\",\n    \"📆\",\n    \"📇\",\n    \"📈\",\n    \"📉\",\n    \"📊\",\n    \"📋\",\n    \"📌\",\n    \"📍\",\n    \"📎\",\n    \"📏\",\n    \"📐\",\n    \"📑\",\n    \"📒\",\n    \"📓\",\n    \"📔\",\n    \"📕\",\n    \"📖\",\n    \"📗\",\n    \"📘\",\n    \"📙\",\n    \"📚\",\n    \"📛\",\n    \"📜\",\n    \"📝\",\n    \"📞\",\n    \"📟\",\n    \"📠\",\n    \"📡\",\n    \"📢\",\n    \"📣\",\n    \"📤\",\n    \"📥\",\n    \"📦\",\n    \"📧\",\n    \"📨\",\n    \"📩\",\n    \"📪\",\n    \"📫\",\n    \"📮\",\n    \"📰\",\n    \"📱\",\n    \"📲\",\n    \"📳\",\n    \"📴\",\n    \"📶\",\n    \"📷\",\n    \"📹\",\n    \"📺\",\n    \"📻\",\n    \"📼\",\n    \"🔃\",\n    \"🔊\",\n    \"🔋\",\n    \"🔌\",\n    \"🔍\",\n    \"🔎\",\n    \"🔏\",\n    \"🔐\",\n    \"🔑\",\n    \"🔒\",\n    \"🔓\",\n    \"🔔\",\n    \"🔖\",\n    \"🔗\",\n    \"🔘\",\n    \"🔙\",\n    \"🔚\",\n    \"🔛\",\n    \"🔜\",\n    \"🔝\",\n    \"🔞\",\n    \"🔟\",\n    \"🔠\",\n    \"🔡\",\n    \"🔢\",\n    \"🔣\",\n    \"🔤\",\n    \"🔥\",\n    \"🔦\",\n    \"🔧\",\n    \"🔨\",\n    \"🔩\",\n    \"🔪\",\n    \"🔫\",\n    \"🔮\",\n    \"🔯\",\n    \"🔰\",\n    \"🔱\",\n    \"🔲\",\n    \"🔳\",\n    \"🔴\",\n    \"🔵\",\n    \"🔶\",\n    \"🔷\",\n    \"🔸\",\n    \"🔹\",\n    \"🔺\",\n    \"🔻\",\n    \"🔼\",\n    \"🔽\",\n    \"🕐\",\n    \"🕑\",\n    \"🕒\",\n    \"🕓\",\n    \"🕔\",\n    \"🕕\",\n    \"🕖\",\n    \"🕗\",\n    \"🕘\",\n    \"🕙\",\n    \"🕚\",\n    \"🕛\",\n    \"🗻\",\n    \"🗼\",\n    \"🗽\",\n    \"🗾\",\n    \"🗿\",\n    \"😀\",\n    \"😇\",\n    \"😈\",\n    \"😎\",\n    \"😐\",\n    \"😑\",\n    \"😕\",\n    \"😗\",\n    \"😙\",\n    \"😛\",\n    \"😟\",\n    \"😦\",\n    \"😧\",\n    \"😬\",\n    \"😮\",\n    \"😯\",\n    \"😴\",\n    \"😶\",\n    \"🚁\",\n    \"🚂\",\n    \"🚆\",\n    \"🚈\",\n    \"🚊\",\n    \"🚍\",\n    \"🚎\",\n    \"🚐\",\n    \"🚔\",\n    \"🚖\",\n    \"🚘\",\n    \"🚛\",\n    \"🚜\",\n    \"🚝\",\n    \"🚞\",\n    \"🚟\",\n    \"🚠\",\n    \"🚡\",\n    \"🚣\",\n    \"🚦\",\n    \"🚮\",\n    \"🚯\",\n    \"🚰\",\n    \"🚱\",\n    \"🚳\",\n    \"🚴\",\n    \"🚵\",\n    \"🚷\",\n    \"🚸\",\n    \"🚿\",\n    \"🛁\",\n    \"🛂\",\n    \"🛃\",\n    \"🛄\",\n    \"🛅\",\n    \"🌍\",\n    \"🌎\",\n    \"🌐\",\n    \"🌒\",\n    \"🌖\",\n    \"🌗\",\n    \"🌘\",\n    \"🌚\",\n    \"🌜\",\n    \"🌝\",\n    \"🌞\",\n    \"🌲\",\n    \"🌳\",\n    \"🍋\",\n    \"🍐\",\n    \"🍼\",\n    \"🏇\",\n    \"🏉\",\n    \"🏤\",\n    \"🐀\",\n    \"🐁\",\n    \"🐂\",\n    \"🐃\",\n    \"🐄\",\n    \"🐅\",\n    \"🐆\",\n    \"🐇\",\n    \"🐈\",\n    \"🐉\",\n    \"🐊\",\n    \"🐋\",\n    \"🐏\",\n    \"🐐\",\n    \"🐓\",\n    \"🐕\",\n    \"🐖\",\n    \"🐪\",\n    \"👥\",\n    \"👬\",\n    \"👭\",\n    \"💭\",\n    \"💶\",\n    \"💷\",\n    \"📬\",\n    \"📭\",\n    \"📯\",\n    \"📵\",\n    \"🔀\",\n    \"🔁\",\n    \"🔂\",\n    \"🔄\",\n    \"🔅\",\n    \"🔆\",\n    \"🔇\",\n    \"🔉\",\n    \"🔕\",\n    \"🔬\",\n    \"🔭\",\n    \"🕜\",\n    \"🕝\",\n    \"🕞\",\n    \"🕟\",\n    \"🕠\",\n    \"🕡\",\n    \"🕢\",\n    \"🕣\",\n    \"🕤\",\n    \"🕥\",\n    \"🕦\",\n    \"🕧\"\n};\n\nstatic const int character_dict_size = sizeof(character_dict) / sizeof(const char*);\n"
  },
  {
    "path": "examples/retinaface.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct FaceObject\n{\n    cv::Rect_<float> rect;\n    cv::Point2f landmark[5];\n    float prob;\n};\n\nstatic inline float intersection_area(const FaceObject& a, const FaceObject& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<FaceObject>& faceobjects, std::vector<int>& picked, float nms_threshold)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const FaceObject& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const FaceObject& b = faceobjects[picked[j]];\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            //             float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\n// copy from src/layer/proposal.cpp\nstatic ncnn::Mat generate_anchors(int base_size, const ncnn::Mat& ratios, const ncnn::Mat& scales)\n{\n    int num_ratio = ratios.w;\n    int num_scale = scales.w;\n\n    ncnn::Mat anchors;\n    anchors.create(4, num_ratio * num_scale);\n\n    const float cx = base_size * 0.5f;\n    const float cy = base_size * 0.5f;\n\n    for (int i = 0; i < num_ratio; i++)\n    {\n        float ar = ratios[i];\n\n        int r_w = round(base_size / sqrt(ar));\n        int r_h = round(r_w * ar); //round(base_size * sqrt(ar));\n\n        for (int j = 0; j < num_scale; j++)\n        {\n            float scale = scales[j];\n\n            float rs_w = r_w * scale;\n            float rs_h = r_h * scale;\n\n            float* anchor = anchors.row(i * num_scale + j);\n\n            anchor[0] = cx - rs_w * 0.5f;\n            anchor[1] = cy - rs_h * 0.5f;\n            anchor[2] = cx + rs_w * 0.5f;\n            anchor[3] = cy + rs_h * 0.5f;\n        }\n    }\n\n    return anchors;\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int feat_stride, const ncnn::Mat& score_blob, const ncnn::Mat& bbox_blob, const ncnn::Mat& landmark_blob, float prob_threshold, std::vector<FaceObject>& faceobjects)\n{\n    int w = score_blob.w;\n    int h = score_blob.h;\n\n    // generate face proposal from bbox deltas and shifted anchors\n    const int num_anchors = anchors.h;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float* anchor = anchors.row(q);\n\n        const ncnn::Mat score = score_blob.channel(q + num_anchors);\n        const ncnn::Mat bbox = bbox_blob.channel_range(q * 4, 4);\n        const ncnn::Mat landmark = landmark_blob.channel_range(q * 10, 10);\n\n        // shifted anchor\n        float anchor_y = anchor[1];\n\n        float anchor_w = anchor[2] - anchor[0];\n        float anchor_h = anchor[3] - anchor[1];\n\n        for (int i = 0; i < h; i++)\n        {\n            float anchor_x = anchor[0];\n\n            for (int j = 0; j < w; j++)\n            {\n                int index = i * w + j;\n\n                float prob = score[index];\n\n                if (prob >= prob_threshold)\n                {\n                    // apply center size\n                    float dx = bbox.channel(0)[index];\n                    float dy = bbox.channel(1)[index];\n                    float dw = bbox.channel(2)[index];\n                    float dh = bbox.channel(3)[index];\n\n                    float cx = anchor_x + anchor_w * 0.5f;\n                    float cy = anchor_y + anchor_h * 0.5f;\n\n                    float pb_cx = cx + anchor_w * dx;\n                    float pb_cy = cy + anchor_h * dy;\n\n                    float pb_w = anchor_w * exp(dw);\n                    float pb_h = anchor_h * exp(dh);\n\n                    float x0 = pb_cx - pb_w * 0.5f;\n                    float y0 = pb_cy - pb_h * 0.5f;\n                    float x1 = pb_cx + pb_w * 0.5f;\n                    float y1 = pb_cy + pb_h * 0.5f;\n\n                    FaceObject obj;\n                    obj.rect.x = x0;\n                    obj.rect.y = y0;\n                    obj.rect.width = x1 - x0 + 1;\n                    obj.rect.height = y1 - y0 + 1;\n                    obj.landmark[0].x = cx + (anchor_w + 1) * landmark.channel(0)[index];\n                    obj.landmark[0].y = cy + (anchor_h + 1) * landmark.channel(1)[index];\n                    obj.landmark[1].x = cx + (anchor_w + 1) * landmark.channel(2)[index];\n                    obj.landmark[1].y = cy + (anchor_h + 1) * landmark.channel(3)[index];\n                    obj.landmark[2].x = cx + (anchor_w + 1) * landmark.channel(4)[index];\n                    obj.landmark[2].y = cy + (anchor_h + 1) * landmark.channel(5)[index];\n                    obj.landmark[3].x = cx + (anchor_w + 1) * landmark.channel(6)[index];\n                    obj.landmark[3].y = cy + (anchor_h + 1) * landmark.channel(7)[index];\n                    obj.landmark[4].x = cx + (anchor_w + 1) * landmark.channel(8)[index];\n                    obj.landmark[4].y = cy + (anchor_h + 1) * landmark.channel(9)[index];\n                    obj.prob = prob;\n\n                    faceobjects.push_back(obj);\n                }\n\n                anchor_x += feat_stride;\n            }\n\n            anchor_y += feat_stride;\n        }\n    }\n}\n\nstatic int detect_retinaface(const cv::Mat& bgr, std::vector<FaceObject>& faceobjects)\n{\n    ncnn::Net retinaface;\n\n    retinaface.opt.use_vulkan_compute = true;\n\n    // model is converted from\n    // https://github.com/deepinsight/insightface/tree/master/RetinaFace#retinaface-pretrained-models\n    // https://github.com/deepinsight/insightface/issues/669\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    //     retinaface.load_param(\"retinaface-R50.param\");\n    //     retinaface.load_model(\"retinaface-R50.bin\");\n    if (retinaface.load_param(\"mnet.25-opt.param\"))\n        exit(-1);\n    if (retinaface.load_model(\"mnet.25-opt.bin\"))\n        exit(-1);\n\n    const float prob_threshold = 0.8f;\n    const float nms_threshold = 0.4f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h);\n\n    ncnn::Extractor ex = retinaface.create_extractor();\n\n    ex.input(\"data\", in);\n\n    std::vector<FaceObject> faceproposals;\n\n    // stride 32\n    {\n        ncnn::Mat score_blob, bbox_blob, landmark_blob;\n        ex.extract(\"face_rpn_cls_prob_reshape_stride32\", score_blob);\n        ex.extract(\"face_rpn_bbox_pred_stride32\", bbox_blob);\n        ex.extract(\"face_rpn_landmark_pred_stride32\", landmark_blob);\n\n        const int base_size = 16;\n        const int feat_stride = 32;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 32.f;\n        scales[1] = 16.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects32;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, landmark_blob, prob_threshold, faceobjects32);\n\n        faceproposals.insert(faceproposals.end(), faceobjects32.begin(), faceobjects32.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat score_blob, bbox_blob, landmark_blob;\n        ex.extract(\"face_rpn_cls_prob_reshape_stride16\", score_blob);\n        ex.extract(\"face_rpn_bbox_pred_stride16\", bbox_blob);\n        ex.extract(\"face_rpn_landmark_pred_stride16\", landmark_blob);\n\n        const int base_size = 16;\n        const int feat_stride = 16;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 8.f;\n        scales[1] = 4.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects16;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, landmark_blob, prob_threshold, faceobjects16);\n\n        faceproposals.insert(faceproposals.end(), faceobjects16.begin(), faceobjects16.end());\n    }\n\n    // stride 8\n    {\n        ncnn::Mat score_blob, bbox_blob, landmark_blob;\n        ex.extract(\"face_rpn_cls_prob_reshape_stride8\", score_blob);\n        ex.extract(\"face_rpn_bbox_pred_stride8\", bbox_blob);\n        ex.extract(\"face_rpn_landmark_pred_stride8\", landmark_blob);\n\n        const int base_size = 16;\n        const int feat_stride = 8;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 2.f;\n        scales[1] = 1.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects8;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, landmark_blob, prob_threshold, faceobjects8);\n\n        faceproposals.insert(faceproposals.end(), faceobjects8.begin(), faceobjects8.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(faceproposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(faceproposals, picked, nms_threshold);\n\n    int face_count = picked.size();\n\n    faceobjects.resize(face_count);\n    for (int i = 0; i < face_count; i++)\n    {\n        faceobjects[i] = faceproposals[picked[i]];\n\n        // clip to image size\n        float x0 = faceobjects[i].rect.x;\n        float y0 = faceobjects[i].rect.y;\n        float x1 = x0 + faceobjects[i].rect.width;\n        float y1 = y0 + faceobjects[i].rect.height;\n\n        x0 = std::max(std::min(x0, (float)img_w - 1), 0.f);\n        y0 = std::max(std::min(y0, (float)img_h - 1), 0.f);\n        x1 = std::max(std::min(x1, (float)img_w - 1), 0.f);\n        y1 = std::max(std::min(y1, (float)img_h - 1), 0.f);\n\n        faceobjects[i].rect.x = x0;\n        faceobjects[i].rect.y = y0;\n        faceobjects[i].rect.width = x1 - x0;\n        faceobjects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_faceobjects(const cv::Mat& bgr, const std::vector<FaceObject>& faceobjects)\n{\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < faceobjects.size(); i++)\n    {\n        const FaceObject& obj = faceobjects[i];\n\n        fprintf(stderr, \"%.5f at %.2f %.2f %.2f x %.2f\\n\", obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(0, 255, 0));\n\n        cv::circle(image, obj.landmark[0], 2, cv::Scalar(0, 255, 255), -1);\n        cv::circle(image, obj.landmark[1], 2, cv::Scalar(0, 255, 255), -1);\n        cv::circle(image, obj.landmark[2], 2, cv::Scalar(0, 255, 255), -1);\n        cv::circle(image, obj.landmark[3], 2, cv::Scalar(0, 255, 255), -1);\n        cv::circle(image, obj.landmark[4], 2, cv::Scalar(0, 255, 255), -1);\n\n        char text[256];\n        sprintf(text, \"%.1f%%\", obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<FaceObject> faceobjects;\n    detect_retinaface(m, faceobjects);\n\n    draw_faceobjects(m, faceobjects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/rfcn.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <math.h>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic int detect_rfcn(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net rfcn;\n\n    rfcn.opt.use_vulkan_compute = true;\n\n    // original pretrained model from https://github.com/YuwenXiong/py-R-FCN\n    // https://github.com/YuwenXiong/py-R-FCN/blob/master/models/pascal_voc/ResNet-50/rfcn_end2end/test_agnostic.prototxt\n    // https://1drv.ms/u/s!AoN7vygOjLIQqUWHpY67oaC7mopf\n    // resnet50_rfcn_final.caffemodel\n    if (rfcn.load_param(\"rfcn_end2end.param\"))\n        exit(-1);\n    if (rfcn.load_model(\"rfcn_end2end.bin\"))\n        exit(-1);\n\n    const int target_size = 224;\n\n    const int max_per_image = 100;\n    const float confidence_thresh = 0.6f; // CONF_THRESH\n\n    const float nms_threshold = 0.3f; // NMS_THRESH\n\n    // scale to target detect size\n    int w = bgr.cols;\n    int h = bgr.rows;\n    float scale = 1.f;\n    if (w < h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, w, h);\n\n    const float mean_vals[3] = {102.9801f, 115.9465f, 122.7717f};\n    in.substract_mean_normalize(mean_vals, 0);\n\n    ncnn::Mat im_info(3);\n    im_info[0] = h;\n    im_info[1] = w;\n    im_info[2] = scale;\n\n    // step1, extract feature and all rois\n    ncnn::Extractor ex1 = rfcn.create_extractor();\n\n    ex1.input(\"data\", in);\n    ex1.input(\"im_info\", im_info);\n\n    ncnn::Mat rfcn_cls;\n    ncnn::Mat rfcn_bbox;\n    ncnn::Mat rois; // all rois\n    ex1.extract(\"rfcn_cls\", rfcn_cls);\n    ex1.extract(\"rfcn_bbox\", rfcn_bbox);\n    ex1.extract(\"rois\", rois);\n\n    // step2, extract bbox and score for each roi\n    std::vector<std::vector<Object> > class_candidates;\n    for (int i = 0; i < rois.c; i++)\n    {\n        ncnn::Extractor ex2 = rfcn.create_extractor();\n\n        ncnn::Mat roi = rois.channel(i); // get single roi\n        ex2.input(\"rfcn_cls\", rfcn_cls);\n        ex2.input(\"rfcn_bbox\", rfcn_bbox);\n        ex2.input(\"rois\", roi);\n\n        ncnn::Mat bbox_pred;\n        ncnn::Mat cls_prob;\n        ex2.extract(\"bbox_pred\", bbox_pred);\n        ex2.extract(\"cls_prob\", cls_prob);\n\n        int num_class = cls_prob.w;\n        class_candidates.resize(num_class);\n\n        // find class id with highest score\n        int label = 0;\n        float score = 0.f;\n        for (int i = 0; i < num_class; i++)\n        {\n            float class_score = cls_prob[i];\n            if (class_score > score)\n            {\n                label = i;\n                score = class_score;\n            }\n        }\n\n        // ignore background or low score\n        if (label == 0 || score <= confidence_thresh)\n            continue;\n\n        //         fprintf(stderr, \"%d = %f\\n\", label, score);\n\n        // unscale to image size\n        float x1 = roi[0] / scale;\n        float y1 = roi[1] / scale;\n        float x2 = roi[2] / scale;\n        float y2 = roi[3] / scale;\n\n        float pb_w = x2 - x1 + 1;\n        float pb_h = y2 - y1 + 1;\n\n        // apply bbox regression\n        float dx = bbox_pred[4];\n        float dy = bbox_pred[4 + 1];\n        float dw = bbox_pred[4 + 2];\n        float dh = bbox_pred[4 + 3];\n\n        float cx = x1 + pb_w * 0.5f;\n        float cy = y1 + pb_h * 0.5f;\n\n        float obj_cx = cx + pb_w * dx;\n        float obj_cy = cy + pb_h * dy;\n\n        float obj_w = pb_w * exp(dw);\n        float obj_h = pb_h * exp(dh);\n\n        float obj_x1 = obj_cx - obj_w * 0.5f;\n        float obj_y1 = obj_cy - obj_h * 0.5f;\n        float obj_x2 = obj_cx + obj_w * 0.5f;\n        float obj_y2 = obj_cy + obj_h * 0.5f;\n\n        // clip\n        obj_x1 = std::max(std::min(obj_x1, (float)(bgr.cols - 1)), 0.f);\n        obj_y1 = std::max(std::min(obj_y1, (float)(bgr.rows - 1)), 0.f);\n        obj_x2 = std::max(std::min(obj_x2, (float)(bgr.cols - 1)), 0.f);\n        obj_y2 = std::max(std::min(obj_y2, (float)(bgr.rows - 1)), 0.f);\n\n        // append object\n        Object obj;\n        obj.rect = cv::Rect_<float>(obj_x1, obj_y1, obj_x2 - obj_x1 + 1, obj_y2 - obj_y1 + 1);\n        obj.label = label;\n        obj.prob = score;\n\n        class_candidates[label].push_back(obj);\n    }\n\n    // post process\n    objects.clear();\n    for (int i = 0; i < (int)class_candidates.size(); i++)\n    {\n        std::vector<Object>& candidates = class_candidates[i];\n\n        qsort_descent_inplace(candidates);\n\n        std::vector<int> picked;\n        nms_sorted_bboxes(candidates, picked, nms_threshold);\n\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            int z = picked[j];\n            objects.push_back(candidates[z]);\n        }\n    }\n\n    qsort_descent_inplace(objects);\n\n    if (max_per_image > 0 && max_per_image < objects.size())\n    {\n        objects.resize(max_per_image);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_rfcn(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/rvm.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// ncnn model exported from https://github.com/PeterL1n/RobustVideoMatting\n//\n// import torch\n// from torch import nn\n// from model import MattingNetwork\n// from model.fast_guided_filter import FastGuidedFilterRefiner\n// from model.deep_guided_filter import DeepGuidedFilterRefiner\n//\n// class Model(nn.Module):\n//     def __init__(self):\n//         super().__init__()\n//\n//         self.rvm = MattingNetwork('mobilenetv3').eval()\n//         self.rvm.load_state_dict(torch.load('rvm_mobilenetv3.pth'))\n//\n//         self.refiner_deep = DeepGuidedFilterRefiner()\n//         self.refiner_fast = FastGuidedFilterRefiner()\n//\n//     def forward_first_frame(self, src):\n//         return self.rvm(src)\n//\n//     def forward(self, src, src_sm, r1, r2, r3, r4):\n//\n//         f1, f2, f3, f4 = self.rvm.backbone(src_sm)\n//         f4 = self.rvm.aspp(f4)\n//         hid, *rec = self.rvm.decoder(src_sm, f1, f2, f3, f4, r1, r2, r3, r4)\n//\n//         # downsample\n//         fgr_residual, pha = self.rvm.project_mat(hid).split([3, 1], dim=-3)\n//         fgr = fgr_residual + src_sm\n//\n//         # downsample + refiner_deep\n//         fgr_residual_deep, pha_deep = self.refiner_deep(src, src_sm, fgr_residual, pha, hid)\n//         fgr_deep = fgr_residual_deep + src\n//\n//         # downsample + refiner_fast\n//         fgr_residual_fast, pha_fast = self.refiner_fast(src, src_sm, fgr_residual, pha, hid)\n//         fgr_fast = fgr_residual_fast + src\n//\n//         # downsample + segmentation\n//         seg = self.rvm.project_seg(hid)\n//\n//         return fgr, pha, fgr_deep, pha_deep, fgr_fast, pha_fast, seg, *rec\n//\n// import pnnx\n//\n// model = Model().eval()\n//\n// x = torch.rand(1, 3, 512, 512)\n// x2 = torch.rand(1, 3, 256, 256)\n// x2_hr = torch.rand(1, 3, 1024, 1024)\n//\n// # generate feats via forward_first_frame, with different shapes\n// fgr, pha, r1, r2, r3, r4 = model.forward_first_frame(x)\n// fgr2, pha2, r12, r22, r32, r42 = model.forward_first_frame(x2)\n//\n// # export with dynamic shape\n// pnnx.export(model, \"rvm_mobilenetv3.pt\", (x, x, r1, r2, r3, r4), (x2_hr, x2, r12, r22, r32, r42))\n//\n// and then fix refiner_fast fp16 overflow issue in ncnn.param via appending 31=1 layer feat mask\n//\n// BinaryOp   div_58    2 1 401 399 402 0=3 31=1\n//\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n\nstatic int detect_rvm(const cv::Mat& bgr, cv::Mat& fgr, cv::Mat& pha, cv::Mat& seg)\n{\n    ncnn::Net rvm;\n\n    rvm.opt.use_vulkan_compute = true;\n\n    // https://github.com/nihui/ncnn-android-rvm/tree/master/app/src/main/assets\n    // you shall also change r1,r2,r3,r4 shape below when model changed\n    if (rvm.load_param(\"rvm_mobilenetv3.ncnn.param\"))\n        exit(-1);\n    if (rvm.load_model(\"rvm_mobilenetv3.ncnn.bin\"))\n        exit(-1);\n    // if (rvm.load_param(\"rvm_resnet50.ncnn.param\"))\n    //     exit(-1);\n    // if (rvm.load_model(\"rvm_resnet50.ncnn.bin\"))\n    //     exit(-1);\n\n    const int w = bgr.cols;\n    const int h = bgr.rows;\n\n    const int target_size = 512;\n    const int max_stride = 16;\n\n    bool refine_deep = true;\n    // bool refine_fast = true;\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n\n    ncnn::Mat in_pad;\n    ncnn::Mat in_small_pad;\n\n    int wpad = 0;\n    int hpad = 0;\n\n    bool downsample = std::max(w, h) > target_size;\n    if (downsample)\n    {\n        // letterbox pad to multiple of max_stride\n        int w2 = w;\n        int h2 = h;\n        float scale = 1.f;\n        if (w > h)\n        {\n            scale = (float)target_size / w;\n            w2 = target_size;\n            h2 = h2 * scale;\n        }\n        else\n        {\n            scale = (float)target_size / h;\n            h2 = target_size;\n            w2 = w2 * scale;\n        }\n\n        ncnn::Mat in_small = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, w, h, w2, h2);\n\n        // letterbox pad to target_size rectangle\n        int w2pad = (w2 + max_stride - 1) / max_stride * max_stride - w2;\n        int h2pad = (h2 + max_stride - 1) / max_stride * max_stride - h2;\n        ncnn::copy_make_border(in_small, in_small_pad, h2pad / 2, h2pad - h2pad / 2, w2pad / 2, w2pad - w2pad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n        in_small_pad.substract_mean_normalize(0, norm_vals);\n\n        int w3 = w;\n        int h3 = h;\n        if (w > h)\n        {\n            w3 = w;\n            h3 = in_small_pad.h / scale;\n            wpad = 0;\n            hpad = h3 - h;\n        }\n        else\n        {\n            h3 = h;\n            w3 = in_small_pad.w / scale;\n            wpad = w3 - w;\n            hpad = 0;\n        }\n\n        ncnn::Mat in = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, w, h);\n\n        ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n        in_pad.substract_mean_normalize(0, norm_vals);\n    }\n    else\n    {\n        ncnn::Mat in = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, w, h);\n\n        // letterbox pad to target_size rectangle\n        wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n        hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n        ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n        in_pad.substract_mean_normalize(0, norm_vals);\n\n        in_small_pad = in_pad;\n    }\n\n    // rvm_mobilenetv3\n    ncnn::Mat r1(in_small_pad.w / 2, in_small_pad.h / 2, 16);\n    ncnn::Mat r2(in_small_pad.w / 4, in_small_pad.h / 4, 20);\n    ncnn::Mat r3(in_small_pad.w / 8, in_small_pad.h / 8, 40);\n    ncnn::Mat r4(in_small_pad.w / 16, in_small_pad.h / 16, 64);\n\n    // rvm_resnet50\n    // ncnn::Mat r1(in_small_pad.w / 2, in_small_pad.h / 2, 16);\n    // ncnn::Mat r2(in_small_pad.w / 4, in_small_pad.h / 4, 32);\n    // ncnn::Mat r3(in_small_pad.w / 8, in_small_pad.h / 8, 64);\n    // ncnn::Mat r4(in_small_pad.w / 16, in_small_pad.h / 16, 128);\n\n    r1.fill(0.f);\n    r2.fill(0.f);\n    r3.fill(0.f);\n    r4.fill(0.f);\n\n    ncnn::Extractor ex = rvm.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n    ex.input(\"in1\", in_small_pad);\n\n    ex.input(\"in2\", r1);\n    ex.input(\"in3\", r2);\n    ex.input(\"in4\", r3);\n    ex.input(\"in5\", r4);\n\n    ncnn::Mat out_fgr;\n    ncnn::Mat out_pha;\n\n    if (downsample)\n    {\n        if (refine_deep)\n        {\n            // downsample + refine deep\n            ex.extract(\"out2\", out_fgr);\n            ex.extract(\"out3\", out_pha);\n        }\n        else // if (refine_fast)\n        {\n            // downsample + refine fast\n            ex.extract(\"out4\", out_fgr);\n            ex.extract(\"out5\", out_pha);\n        }\n    }\n    else\n    {\n        // no downsample\n        ex.extract(\"out0\", out_fgr);\n        ex.extract(\"out1\", out_pha);\n    }\n\n    ncnn::Mat out_seg;\n\n    // segmentation\n    ex.extract(\"out6\", out_seg);\n\n    // feats\n    ex.extract(\"out7\", r1);\n    ex.extract(\"out8\", r2);\n    ex.extract(\"out9\", r3);\n    ex.extract(\"out10\", r4);\n\n    const float denorm_vals[3] = {255.f, 255.f, 255.f};\n\n    out_fgr.substract_mean_normalize(0, denorm_vals);\n    fgr.create(out_fgr.h, out_fgr.w, CV_8UC3);\n    out_fgr.to_pixels(fgr.data, ncnn::Mat::PIXEL_RGB2BGR);\n\n    out_pha.substract_mean_normalize(0, denorm_vals);\n    pha.create(out_pha.h, out_pha.w, CV_8UC1);\n    out_pha.to_pixels(pha.data, ncnn::Mat::PIXEL_GRAY);\n\n    out_seg.substract_mean_normalize(0, denorm_vals);\n    seg.create(in_pad.h, in_pad.w, CV_8UC1);\n    out_seg.to_pixels_resize(seg.data, ncnn::Mat::PIXEL_GRAY, in_pad.w, in_pad.h);\n\n    // cut letterbox pad\n    fgr = fgr(cv::Rect(wpad / 2, hpad / 2, w, h));\n    pha = pha(cv::Rect(wpad / 2, hpad / 2, w, h));\n    seg = seg(cv::Rect(wpad / 2, hpad / 2, w, h));\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const cv::Mat& fgr, const cv::Mat& pha, const cv::Mat& seg)\n{\n    const int w = bgr.cols;\n    const int h = bgr.rows;\n\n    // composite\n    cv::Mat comp(h, w, CV_8UC3);\n    for (int y = 0; y < h; y++)\n    {\n        const uchar* pf = fgr.ptr<const uchar>(y);\n        const uchar* pa = pha.ptr<const uchar>(y);\n        uchar* p = comp.ptr<uchar>(y);\n        for (int x = 0; x < w; x++)\n        {\n            const float alpha = pa[0] / 255.f;\n            p[0] = cv::saturate_cast<uchar>(pf[0] * alpha + (1 - alpha) * 155);\n            p[1] = cv::saturate_cast<uchar>(pf[1] * alpha + (1 - alpha) * 255);\n            p[2] = cv::saturate_cast<uchar>(pf[2] * alpha + (1 - alpha) * 120);\n            pf += 3;\n            pa += 1;\n            p += 3;\n        }\n    }\n\n    // composite seg\n    cv::Mat comp_seg(h, w, CV_8UC3);\n    for (int y = 0; y < h; y++)\n    {\n        const uchar* pb = bgr.ptr<const uchar>(y);\n        const uchar* ps = seg.ptr<const uchar>(y);\n        uchar* p = comp_seg.ptr<uchar>(y);\n        for (int x = 0; x < w; x++)\n        {\n            const float alpha = ps[0] / 255.f;\n            p[0] = cv::saturate_cast<uchar>(pb[0] * alpha + (1 - alpha) * 155);\n            p[1] = cv::saturate_cast<uchar>(pb[1] * alpha + (1 - alpha) * 255);\n            p[2] = cv::saturate_cast<uchar>(pb[2] * alpha + (1 - alpha) * 120);\n            pb += 3;\n            ps += 1;\n            p += 3;\n        }\n    }\n\n    cv::imshow(\"comp\", comp);\n    cv::imshow(\"comp_seg\", comp_seg);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    cv::Mat fgr;\n    cv::Mat pha;\n    cv::Mat seg;\n    detect_rvm(m, fgr, pha, seg);\n\n    draw_objects(m, fgr, pha, seg);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/scrfd.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct FaceObject\n{\n    cv::Rect_<float> rect;\n    float prob;\n};\n\nstatic inline float intersection_area(const FaceObject& a, const FaceObject& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<FaceObject>& faceobjects, std::vector<int>& picked, float nms_threshold)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const FaceObject& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const FaceObject& b = faceobjects[picked[j]];\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            //             float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\n// insightface/detection/scrfd/mmdet/core/anchor/anchor_generator.py gen_single_level_base_anchors()\nstatic ncnn::Mat generate_anchors(int base_size, const ncnn::Mat& ratios, const ncnn::Mat& scales)\n{\n    int num_ratio = ratios.w;\n    int num_scale = scales.w;\n\n    ncnn::Mat anchors;\n    anchors.create(4, num_ratio * num_scale);\n\n    const float cx = 0;\n    const float cy = 0;\n\n    for (int i = 0; i < num_ratio; i++)\n    {\n        float ar = ratios[i];\n\n        int r_w = round(base_size / sqrt(ar));\n        int r_h = round(r_w * ar); //round(base_size * sqrt(ar));\n\n        for (int j = 0; j < num_scale; j++)\n        {\n            float scale = scales[j];\n\n            float rs_w = r_w * scale;\n            float rs_h = r_h * scale;\n\n            float* anchor = anchors.row(i * num_scale + j);\n\n            anchor[0] = cx - rs_w * 0.5f;\n            anchor[1] = cy - rs_h * 0.5f;\n            anchor[2] = cx + rs_w * 0.5f;\n            anchor[3] = cy + rs_h * 0.5f;\n        }\n    }\n\n    return anchors;\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int feat_stride, const ncnn::Mat& score_blob, const ncnn::Mat& bbox_blob, float prob_threshold, std::vector<FaceObject>& faceobjects)\n{\n    int w = score_blob.w;\n    int h = score_blob.h;\n\n    // generate face proposal from bbox deltas and shifted anchors\n    const int num_anchors = anchors.h;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float* anchor = anchors.row(q);\n\n        const ncnn::Mat score = score_blob.channel(q);\n        const ncnn::Mat bbox = bbox_blob.channel_range(q * 4, 4);\n\n        // shifted anchor\n        float anchor_y = anchor[1];\n\n        float anchor_w = anchor[2] - anchor[0];\n        float anchor_h = anchor[3] - anchor[1];\n\n        for (int i = 0; i < h; i++)\n        {\n            float anchor_x = anchor[0];\n\n            for (int j = 0; j < w; j++)\n            {\n                int index = i * w + j;\n\n                float prob = score[index];\n\n                if (prob >= prob_threshold)\n                {\n                    // insightface/detection/scrfd/mmdet/models/dense_heads/scrfd_head.py _get_bboxes_single()\n                    float dx = bbox.channel(0)[index] * feat_stride;\n                    float dy = bbox.channel(1)[index] * feat_stride;\n                    float dw = bbox.channel(2)[index] * feat_stride;\n                    float dh = bbox.channel(3)[index] * feat_stride;\n\n                    // insightface/detection/scrfd/mmdet/core/bbox/transforms.py distance2bbox()\n                    float cx = anchor_x + anchor_w * 0.5f;\n                    float cy = anchor_y + anchor_h * 0.5f;\n\n                    float x0 = cx - dx;\n                    float y0 = cy - dy;\n                    float x1 = cx + dw;\n                    float y1 = cy + dh;\n\n                    FaceObject obj;\n                    obj.rect.x = x0;\n                    obj.rect.y = y0;\n                    obj.rect.width = x1 - x0 + 1;\n                    obj.rect.height = y1 - y0 + 1;\n                    obj.prob = prob;\n\n                    faceobjects.push_back(obj);\n                }\n\n                anchor_x += feat_stride;\n            }\n\n            anchor_y += feat_stride;\n        }\n    }\n}\n\nstatic int detect_scrfd(const cv::Mat& bgr, std::vector<FaceObject>& faceobjects)\n{\n    ncnn::Net scrfd;\n\n    scrfd.opt.use_vulkan_compute = true;\n\n    // model is converted from\n    // https://github.com/deepinsight/insightface/tree/master/detection/scrfd\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (scrfd.load_param(\"scrfd_500m-opt2.param\"))\n        exit(-1);\n    if (scrfd.load_model(\"scrfd_500m-opt2.bin\"))\n        exit(-1);\n\n    int width = bgr.cols;\n    int height = bgr.rows;\n\n    // insightface/detection/scrfd/configs/scrfd/scrfd_500m.py\n    const int target_size = 640;\n    const float prob_threshold = 0.3f;\n    const float nms_threshold = 0.45f;\n\n    // pad to multiple of 32\n    int w = width;\n    int h = height;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, width, height, w, h);\n\n    // pad to target_size rectangle\n    int wpad = (w + 31) / 32 * 32 - w;\n    int hpad = (h + 31) / 32 * 32 - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 0.f);\n\n    const float mean_vals[3] = {127.5f, 127.5f, 127.5f};\n    const float norm_vals[3] = {1 / 128.f, 1 / 128.f, 1 / 128.f};\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = scrfd.create_extractor();\n\n    ex.input(\"input.1\", in_pad);\n\n    std::vector<FaceObject> faceproposals;\n\n    // stride 32\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"412\", score_blob);\n        ex.extract(\"415\", bbox_blob);\n\n        const int base_size = 16;\n        const int feat_stride = 8;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 1.f;\n        scales[1] = 2.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects32;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects32);\n\n        faceproposals.insert(faceproposals.end(), faceobjects32.begin(), faceobjects32.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"474\", score_blob);\n        ex.extract(\"477\", bbox_blob);\n\n        const int base_size = 64;\n        const int feat_stride = 16;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 1.f;\n        scales[1] = 2.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects16;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects16);\n\n        faceproposals.insert(faceproposals.end(), faceobjects16.begin(), faceobjects16.end());\n    }\n\n    // stride 8\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"536\", score_blob);\n        ex.extract(\"539\", bbox_blob);\n\n        const int base_size = 256;\n        const int feat_stride = 32;\n        ncnn::Mat ratios(1);\n        ratios[0] = 1.f;\n        ncnn::Mat scales(2);\n        scales[0] = 1.f;\n        scales[1] = 2.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects8;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects8);\n\n        faceproposals.insert(faceproposals.end(), faceobjects8.begin(), faceobjects8.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(faceproposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(faceproposals, picked, nms_threshold);\n\n    int face_count = picked.size();\n\n    faceobjects.resize(face_count);\n    for (int i = 0; i < face_count; i++)\n    {\n        faceobjects[i] = faceproposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (faceobjects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (faceobjects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (faceobjects[i].rect.x + faceobjects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (faceobjects[i].rect.y + faceobjects[i].rect.height - (hpad / 2)) / scale;\n\n        x0 = std::max(std::min(x0, (float)width - 1), 0.f);\n        y0 = std::max(std::min(y0, (float)height - 1), 0.f);\n        x1 = std::max(std::min(x1, (float)width - 1), 0.f);\n        y1 = std::max(std::min(y1, (float)height - 1), 0.f);\n\n        faceobjects[i].rect.x = x0;\n        faceobjects[i].rect.y = y0;\n        faceobjects[i].rect.width = x1 - x0;\n        faceobjects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_faceobjects(const cv::Mat& bgr, const std::vector<FaceObject>& faceobjects)\n{\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < faceobjects.size(); i++)\n    {\n        const FaceObject& obj = faceobjects[i];\n\n        fprintf(stderr, \"%.5f at %.2f %.2f %.2f x %.2f\\n\", obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(0, 255, 0));\n\n        char text[256];\n        sprintf(text, \"%.1f%%\", obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<FaceObject> faceobjects;\n    detect_scrfd(m, faceobjects);\n\n    draw_faceobjects(m, faceobjects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/scrfd_crowdhuman.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct FaceObject\n{\n    cv::Rect_<float> rect;\n    float prob;\n};\n\nstatic inline float intersection_area(const FaceObject& a, const FaceObject& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<FaceObject>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<FaceObject>& faceobjects, std::vector<int>& picked, float nms_threshold)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const FaceObject& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const FaceObject& b = faceobjects[picked[j]];\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            //             float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\n// insightface/detection/scrfd/mmdet/core/anchor/anchor_generator.py gen_single_level_base_anchors()\nstatic ncnn::Mat generate_anchors(int base_size, const ncnn::Mat& ratios, const ncnn::Mat& scales)\n{\n    int num_ratio = ratios.w;\n    int num_scale = scales.w;\n\n    ncnn::Mat anchors;\n    anchors.create(4, num_ratio * num_scale);\n\n    const float cx = 0;\n    const float cy = 0;\n\n    for (int i = 0; i < num_ratio; i++)\n    {\n        float ar = ratios[i];\n\n        int r_w = round(base_size / sqrt(ar));\n        int r_h = round(r_w * ar); //round(base_size * sqrt(ar));\n\n        for (int j = 0; j < num_scale; j++)\n        {\n            float scale = scales[j];\n\n            float rs_w = r_w * scale;\n            float rs_h = r_h * scale;\n\n            float* anchor = anchors.row(i * num_scale + j);\n\n            anchor[0] = cx - rs_w * 0.5f;\n            anchor[1] = cy - rs_h * 0.5f;\n            anchor[2] = cx + rs_w * 0.5f;\n            anchor[3] = cy + rs_h * 0.5f;\n        }\n    }\n\n    return anchors;\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int feat_stride, const ncnn::Mat& score_blob, const ncnn::Mat& bbox_blob, float prob_threshold, std::vector<FaceObject>& faceobjects)\n{\n    int w = score_blob.w;\n    int h = score_blob.h;\n\n    // generate face proposal from bbox deltas and shifted anchors\n    const int num_anchors = anchors.h;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float* anchor = anchors.row(q);\n\n        const ncnn::Mat score = score_blob.channel(q);\n        const ncnn::Mat bbox = bbox_blob.channel_range(q * 4, 4);\n\n        // shifted anchor\n        float anchor_y = anchor[1];\n\n        float anchor_w = anchor[2] - anchor[0];\n        float anchor_h = anchor[3] - anchor[1];\n\n        for (int i = 0; i < h; i++)\n        {\n            float anchor_x = anchor[0];\n\n            for (int j = 0; j < w; j++)\n            {\n                int index = i * w + j;\n\n                float prob = score[index];\n\n                if (prob >= prob_threshold)\n                {\n                    // insightface/detection/scrfd/mmdet/models/dense_heads/scrfd_head.py _get_bboxes_single()\n                    float dx = bbox.channel(0)[index] * feat_stride;\n                    float dy = bbox.channel(1)[index] * feat_stride;\n                    float dw = bbox.channel(2)[index] * feat_stride;\n                    float dh = bbox.channel(3)[index] * feat_stride;\n\n                    // insightface/detection/scrfd/mmdet/core/bbox/transforms.py distance2bbox()\n                    float cx = anchor_x + anchor_w * 0.5f;\n                    float cy = anchor_y + anchor_h * 0.5f;\n\n                    float x0 = cx - dx;\n                    float y0 = cy - dy;\n                    float x1 = cx + dw;\n                    float y1 = cy + dh;\n\n                    FaceObject obj;\n                    obj.rect.x = x0;\n                    obj.rect.y = y0;\n                    obj.rect.width = x1 - x0 + 1;\n                    obj.rect.height = y1 - y0 + 1;\n                    obj.prob = prob;\n\n                    faceobjects.push_back(obj);\n                }\n\n                anchor_x += feat_stride;\n            }\n\n            anchor_y += feat_stride;\n        }\n    }\n}\n\nstatic int detect_scrfd(const cv::Mat& bgr, std::vector<FaceObject>& faceobjects)\n{\n    ncnn::Net scrfd;\n\n    scrfd.opt.use_vulkan_compute = true;\n\n    // Insight face does not provided a trained scrfd_crowdhuman model\n    // but I have one for detecing cat face, you can have a try here:\n    // https://drive.google.com/file/d/1JogkKa0f_09HkENbCnXy9hRYxm35wKTn\n\n    if (scrfd.load_param(\"scrfd_crowdhuman.param\"))\n        exit(-1);\n    if (scrfd.load_model(\"scrfd_crowdhuman.bin\"))\n        exit(-1);\n\n    int width = bgr.cols;\n    int height = bgr.rows;\n\n    const int target_size = 640;\n    const float prob_threshold = 0.3f;\n    const float nms_threshold = 0.45f;\n\n    // pad to multiple of 32\n    int w = width;\n    int h = height;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, width, height, w, h);\n\n    // pad to target_size rectangle\n    int wpad = (w + 31) / 32 * 32 - w;\n    int hpad = (h + 31) / 32 * 32 - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 0.f);\n\n    const float mean_vals[3] = {127.5f, 127.5f, 127.5f};\n    const float norm_vals[3] = {1 / 128.f, 1 / 128.f, 1 / 128.f};\n    in_pad.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = scrfd.create_extractor();\n\n    ex.input(\"input.1\", in_pad);\n\n    std::vector<FaceObject> faceproposals;\n\n    // stride 8\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"490\", score_blob);\n        ex.extract(\"493\", bbox_blob);\n\n        const int base_size = 8;\n        const int feat_stride = 8;\n        ncnn::Mat ratios(1);\n        ratios[0] = 2.f;\n        ncnn::Mat scales(1);\n        scales[0] = 3.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects32;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects32);\n\n        faceproposals.insert(faceproposals.end(), faceobjects32.begin(), faceobjects32.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"510\", score_blob);\n        ex.extract(\"513\", bbox_blob);\n\n        const int base_size = 16;\n        const int feat_stride = 16;\n        ncnn::Mat ratios(1);\n        ratios[0] = 2.f;\n        ncnn::Mat scales(1);\n        scales[0] = 3.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects16;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects16);\n\n        faceproposals.insert(faceproposals.end(), faceobjects16.begin(), faceobjects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat score_blob, bbox_blob;\n        ex.extract(\"530\", score_blob);\n        ex.extract(\"533\", bbox_blob);\n\n        const int base_size = 32;\n        const int feat_stride = 32;\n        ncnn::Mat ratios(1);\n        ratios[0] = 2.f;\n        ncnn::Mat scales(1);\n        scales[0] = 3.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects8;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects8);\n\n        faceproposals.insert(faceproposals.end(), faceobjects8.begin(), faceobjects8.end());\n    }\n\n    // stride 64\n    {\n        ncnn::Mat score_blob, bbox_blob, kps_blob;\n        ex.extract(\"550\", score_blob);\n        ex.extract(\"553\", bbox_blob);\n\n        const int base_size = 64;\n        const int feat_stride = 64;\n        ncnn::Mat ratios(1);\n        ratios[0] = 2.f;\n        ncnn::Mat scales(1);\n        scales[0] = 3.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects8;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects8);\n\n        faceproposals.insert(faceproposals.end(), faceobjects8.begin(), faceobjects8.end());\n    }\n\n    // stride 128\n    {\n        ncnn::Mat score_blob, bbox_blob, kps_blob;\n        ex.extract(\"570\", score_blob);\n        ex.extract(\"573\", bbox_blob);\n\n        const int base_size = 128;\n        const int feat_stride = 128;\n        ncnn::Mat ratios(1);\n        ratios[0] = 2.f;\n        ncnn::Mat scales(1);\n        scales[0] = 3.f;\n        ncnn::Mat anchors = generate_anchors(base_size, ratios, scales);\n\n        std::vector<FaceObject> faceobjects8;\n        generate_proposals(anchors, feat_stride, score_blob, bbox_blob, prob_threshold, faceobjects8);\n\n        faceproposals.insert(faceproposals.end(), faceobjects8.begin(), faceobjects8.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(faceproposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(faceproposals, picked, nms_threshold);\n\n    int face_count = picked.size();\n\n    faceobjects.resize(face_count);\n    for (int i = 0; i < face_count; i++)\n    {\n        faceobjects[i] = faceproposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (faceobjects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (faceobjects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (faceobjects[i].rect.x + faceobjects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (faceobjects[i].rect.y + faceobjects[i].rect.height - (hpad / 2)) / scale;\n\n        x0 = std::max(std::min(x0, (float)width - 1), 0.f);\n        y0 = std::max(std::min(y0, (float)height - 1), 0.f);\n        x1 = std::max(std::min(x1, (float)width - 1), 0.f);\n        y1 = std::max(std::min(y1, (float)height - 1), 0.f);\n\n        faceobjects[i].rect.x = x0;\n        faceobjects[i].rect.y = y0;\n        faceobjects[i].rect.width = x1 - x0;\n        faceobjects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_faceobjects(const cv::Mat& bgr, const std::vector<FaceObject>& faceobjects)\n{\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < faceobjects.size(); i++)\n    {\n        const FaceObject& obj = faceobjects[i];\n\n        fprintf(stderr, \"%.5f at %.2f %.2f %.2f x %.2f\\n\", obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(0, 255, 0));\n\n        char text[256];\n        sprintf(text, \"%.1f%%\", obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<FaceObject> faceobjects;\n    detect_scrfd(m, faceobjects);\n\n    draw_faceobjects(m, faceobjects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/shufflenetv2.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <algorithm>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstatic int detect_shufflenetv2(const cv::Mat& bgr, std::vector<float>& cls_scores)\n{\n    ncnn::Net shufflenetv2;\n\n    shufflenetv2.opt.use_vulkan_compute = true;\n\n    // https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe\n    // models can be downloaded from https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/releases\n    if (shufflenetv2.load_param(\"shufflenet_v2_x0.5.param\"))\n        exit(-1);\n    if (shufflenetv2.load_model(\"shufflenet_v2_x0.5.bin\"))\n        exit(-1);\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, 224, 224);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = shufflenetv2.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"fc\", out);\n\n    // manually call softmax on the fc output\n    // convert result into probability\n    // skip if your model already has softmax operation\n    {\n        ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n        ncnn::ParamDict pd;\n        softmax->load_param(pd);\n\n        softmax->forward_inplace(out, shufflenetv2.opt);\n\n        delete softmax;\n    }\n\n    out = out.reshape(out.w * out.h * out.c);\n\n    cls_scores.resize(out.w);\n    for (int j = 0; j < out.w; j++)\n    {\n        cls_scores[j] = out[j];\n    }\n\n    return 0;\n}\n\nstatic int print_topk(const std::vector<float>& cls_scores, int topk)\n{\n    // partial sort topk with index\n    int size = cls_scores.size();\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(cls_scores[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    // print topk and score\n    for (int i = 0; i < topk; i++)\n    {\n        float score = vec[i].first;\n        int index = vec[i].second;\n        fprintf(stderr, \"%d = %f\\n\", index, score);\n    }\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<float> cls_scores;\n    detect_shufflenetv2(m, cls_scores);\n\n    print_topk(cls_scores, 3);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/simplepose.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <algorithm>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct KeyPoint\n{\n    cv::Point2f p;\n    float prob;\n};\n\nstatic int detect_posenet(const cv::Mat& bgr, std::vector<KeyPoint>& keypoints)\n{\n    ncnn::Net posenet;\n\n    posenet.opt.use_vulkan_compute = true;\n\n    // the simple baseline human pose estimation from gluon-cv\n    // https://gluon-cv.mxnet.io/build/examples_pose/demo_simple_pose.html\n    // mxnet model exported via\n    //      pose_net.hybridize()\n    //      pose_net.export('pose')\n    // then mxnet2ncnn\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (posenet.load_param(\"pose.param\"))\n        exit(-1);\n    if (posenet.load_model(\"pose.bin\"))\n        exit(-1);\n\n    int w = bgr.cols;\n    int h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, w, h, 192, 256);\n\n    // transforms.ToTensor(),\n    // transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n    // R' = (R / 255 - 0.485) / 0.229 = (R - 0.485 * 255) / 0.229 / 255\n    // G' = (G / 255 - 0.456) / 0.224 = (G - 0.456 * 255) / 0.224 / 255\n    // B' = (B / 255 - 0.406) / 0.225 = (B - 0.406 * 255) / 0.225 / 255\n    const float mean_vals[3] = {0.485f * 255.f, 0.456f * 255.f, 0.406f * 255.f};\n    const float norm_vals[3] = {1 / 0.229f / 255.f, 1 / 0.224f / 255.f, 1 / 0.225f / 255.f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = posenet.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"conv3_fwd\", out);\n\n    // resolve point from heatmap\n    keypoints.clear();\n    for (int p = 0; p < out.c; p++)\n    {\n        const ncnn::Mat m = out.channel(p);\n\n        float max_prob = 0.f;\n        int max_x = 0;\n        int max_y = 0;\n        for (int y = 0; y < out.h; y++)\n        {\n            const float* ptr = m.row(y);\n            for (int x = 0; x < out.w; x++)\n            {\n                float prob = ptr[x];\n                if (prob > max_prob)\n                {\n                    max_prob = prob;\n                    max_x = x;\n                    max_y = y;\n                }\n            }\n        }\n\n        KeyPoint keypoint;\n        keypoint.p = cv::Point2f(max_x * w / (float)out.w, max_y * h / (float)out.h);\n        keypoint.prob = max_prob;\n\n        keypoints.push_back(keypoint);\n    }\n\n    return 0;\n}\n\nstatic void draw_pose(const cv::Mat& bgr, const std::vector<KeyPoint>& keypoints)\n{\n    cv::Mat image = bgr.clone();\n\n    // draw bone\n    static const int joint_pairs[16][2] = {\n        {0, 1}, {1, 3}, {0, 2}, {2, 4}, {5, 6}, {5, 7}, {7, 9}, {6, 8}, {8, 10}, {5, 11}, {6, 12}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}\n    };\n\n    for (int i = 0; i < 16; i++)\n    {\n        const KeyPoint& p1 = keypoints[joint_pairs[i][0]];\n        const KeyPoint& p2 = keypoints[joint_pairs[i][1]];\n\n        if (p1.prob < 0.2f || p2.prob < 0.2f)\n            continue;\n\n        cv::line(image, p1.p, p2.p, cv::Scalar(255, 0, 0), 2);\n    }\n\n    // draw joint\n    for (size_t i = 0; i < keypoints.size(); i++)\n    {\n        const KeyPoint& keypoint = keypoints[i];\n\n        fprintf(stderr, \"%.2f %.2f = %.5f\\n\", keypoint.p.x, keypoint.p.y, keypoint.prob);\n\n        if (keypoint.prob < 0.2f)\n            continue;\n\n        cv::circle(image, keypoint.p, 3, cv::Scalar(0, 255, 0), -1);\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<KeyPoint> keypoints;\n    detect_posenet(m, keypoints);\n\n    draw_pose(m, keypoints);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/squeezencnn/README.md",
    "content": "The squeezenet android example project has been moved to https://github.com/nihui/ncnn-android-squeezenet\n"
  },
  {
    "path": "examples/squeezenet.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <algorithm>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstatic int detect_squeezenet(const cv::Mat& bgr, std::vector<float>& cls_scores)\n{\n    ncnn::Net squeezenet;\n\n    squeezenet.opt.use_vulkan_compute = true;\n\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (squeezenet.load_param(\"squeezenet_v1.1.param\"))\n        exit(-1);\n    if (squeezenet.load_model(\"squeezenet_v1.1.bin\"))\n        exit(-1);\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, 227, 227);\n\n    const float mean_vals[3] = {104.f, 117.f, 123.f};\n    in.substract_mean_normalize(mean_vals, 0);\n\n    ncnn::Extractor ex = squeezenet.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"prob\", out);\n\n    cls_scores.resize(out.w);\n    for (int j = 0; j < out.w; j++)\n    {\n        cls_scores[j] = out[j];\n    }\n\n    return 0;\n}\n\nstatic int print_topk(const std::vector<float>& cls_scores, int topk)\n{\n    // partial sort topk with index\n    int size = cls_scores.size();\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(cls_scores[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    // print topk and score\n    for (int i = 0; i < topk; i++)\n    {\n        float score = vec[i].first;\n        int index = vec[i].second;\n        fprintf(stderr, \"%d = %f\\n\", index, score);\n    }\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<float> cls_scores;\n    detect_squeezenet(m, cls_scores);\n\n    print_topk(cls_scores, 3);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/squeezenet_c_api.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"c_api.h\"\n\n#include <algorithm>\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstatic int detect_squeezenet(const cv::Mat& bgr, std::vector<float>& cls_scores)\n{\n    ncnn_net_t squeezenet = ncnn_net_create();\n\n    ncnn_option_t opt = ncnn_option_create();\n    ncnn_option_set_use_vulkan_compute(opt, 1);\n\n    ncnn_net_set_option(squeezenet, opt);\n\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (ncnn_net_load_param(squeezenet, \"squeezenet_v1.1.param\"))\n        exit(-1);\n    if (ncnn_net_load_model(squeezenet, \"squeezenet_v1.1.bin\"))\n        exit(-1);\n\n    ncnn_mat_t in = ncnn_mat_from_pixels_resize(bgr.data, NCNN_MAT_PIXEL_BGR, bgr.cols, bgr.rows, bgr.cols * 3, 227, 227, NULL);\n\n    const float mean_vals[3] = {104.f, 117.f, 123.f};\n    ncnn_mat_substract_mean_normalize(in, mean_vals, 0);\n\n    ncnn_extractor_t ex = ncnn_extractor_create(squeezenet);\n\n    ncnn_extractor_input(ex, \"data\", in);\n\n    ncnn_mat_t out;\n    ncnn_extractor_extract(ex, \"prob\", &out);\n\n    const int out_w = ncnn_mat_get_w(out);\n    const float* out_data = (const float*)ncnn_mat_get_data(out);\n\n    cls_scores.resize(out_w);\n    for (int j = 0; j < out_w; j++)\n    {\n        cls_scores[j] = out_data[j];\n    }\n\n    ncnn_mat_destroy(in);\n    ncnn_mat_destroy(out);\n\n    ncnn_extractor_destroy(ex);\n\n    ncnn_option_destroy(opt);\n\n    ncnn_net_destroy(squeezenet);\n\n    return 0;\n}\n\nstatic int print_topk(const std::vector<float>& cls_scores, int topk)\n{\n    // partial sort topk with index\n    int size = cls_scores.size();\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(cls_scores[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    // print topk and score\n    for (int i = 0; i < topk; i++)\n    {\n        float score = vec[i].first;\n        int index = vec[i].second;\n        fprintf(stderr, \"%d = %f\\n\", index, score);\n    }\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<float> cls_scores;\n    detect_squeezenet(m, cls_scores);\n\n    print_topk(cls_scores, 3);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/squeezenet_v1.1.param",
    "content": "7767517\n75 83\nInput            data             0 1 data 0=227 1=227 2=3\nConvolution      conv1            1 1 data conv1 0=64 1=3 2=1 3=2 4=0 5=1 6=1728\nReLU             relu_conv1       1 1 conv1 conv1_relu_conv1 0=0.000000\nPooling          pool1            1 1 conv1_relu_conv1 pool1 0=0 1=3 2=2 3=0 4=0\nConvolution      fire2/squeeze1x1 1 1 pool1 fire2/squeeze1x1 0=16 1=1 2=1 3=1 4=0 5=1 6=1024\nReLU             fire2/relu_squeeze1x1 1 1 fire2/squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_0      1 2 fire2/squeeze1x1_fire2/relu_squeeze1x1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1\nConvolution      fire2/expand1x1  1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_1 fire2/expand1x1 0=64 1=1 2=1 3=1 4=0 5=1 6=1024\nReLU             fire2/relu_expand1x1 1 1 fire2/expand1x1 fire2/expand1x1_fire2/relu_expand1x1 0=0.000000\nConvolution      fire2/expand3x3  1 1 fire2/squeeze1x1_fire2/relu_squeeze1x1_splitncnn_0 fire2/expand3x3 0=64 1=3 2=1 3=1 4=1 5=1 6=9216\nReLU             fire2/relu_expand3x3 1 1 fire2/expand3x3 fire2/expand3x3_fire2/relu_expand3x3 0=0.000000\nConcat           fire2/concat     2 1 fire2/expand1x1_fire2/relu_expand1x1 fire2/expand3x3_fire2/relu_expand3x3 fire2/concat 0=0\nConvolution      fire3/squeeze1x1 1 1 fire2/concat fire3/squeeze1x1 0=16 1=1 2=1 3=1 4=0 5=1 6=2048\nReLU             fire3/relu_squeeze1x1 1 1 fire3/squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_1      1 2 fire3/squeeze1x1_fire3/relu_squeeze1x1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1\nConvolution      fire3/expand1x1  1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_1 fire3/expand1x1 0=64 1=1 2=1 3=1 4=0 5=1 6=1024\nReLU             fire3/relu_expand1x1 1 1 fire3/expand1x1 fire3/expand1x1_fire3/relu_expand1x1 0=0.000000\nConvolution      fire3/expand3x3  1 1 fire3/squeeze1x1_fire3/relu_squeeze1x1_splitncnn_0 fire3/expand3x3 0=64 1=3 2=1 3=1 4=1 5=1 6=9216\nReLU             fire3/relu_expand3x3 1 1 fire3/expand3x3 fire3/expand3x3_fire3/relu_expand3x3 0=0.000000\nConcat           fire3/concat     2 1 fire3/expand1x1_fire3/relu_expand1x1 fire3/expand3x3_fire3/relu_expand3x3 fire3/concat 0=0\nPooling          pool3            1 1 fire3/concat pool3 0=0 1=3 2=2 3=0 4=0\nConvolution      fire4/squeeze1x1 1 1 pool3 fire4/squeeze1x1 0=32 1=1 2=1 3=1 4=0 5=1 6=4096\nReLU             fire4/relu_squeeze1x1 1 1 fire4/squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_2      1 2 fire4/squeeze1x1_fire4/relu_squeeze1x1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1\nConvolution      fire4/expand1x1  1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_1 fire4/expand1x1 0=128 1=1 2=1 3=1 4=0 5=1 6=4096\nReLU             fire4/relu_expand1x1 1 1 fire4/expand1x1 fire4/expand1x1_fire4/relu_expand1x1 0=0.000000\nConvolution      fire4/expand3x3  1 1 fire4/squeeze1x1_fire4/relu_squeeze1x1_splitncnn_0 fire4/expand3x3 0=128 1=3 2=1 3=1 4=1 5=1 6=36864\nReLU             fire4/relu_expand3x3 1 1 fire4/expand3x3 fire4/expand3x3_fire4/relu_expand3x3 0=0.000000\nConcat           fire4/concat     2 1 fire4/expand1x1_fire4/relu_expand1x1 fire4/expand3x3_fire4/relu_expand3x3 fire4/concat 0=0\nConvolution      fire5/squeeze1x1 1 1 fire4/concat fire5/squeeze1x1 0=32 1=1 2=1 3=1 4=0 5=1 6=8192\nReLU             fire5/relu_squeeze1x1 1 1 fire5/squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_3      1 2 fire5/squeeze1x1_fire5/relu_squeeze1x1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1\nConvolution      fire5/expand1x1  1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_1 fire5/expand1x1 0=128 1=1 2=1 3=1 4=0 5=1 6=4096\nReLU             fire5/relu_expand1x1 1 1 fire5/expand1x1 fire5/expand1x1_fire5/relu_expand1x1 0=0.000000\nConvolution      fire5/expand3x3  1 1 fire5/squeeze1x1_fire5/relu_squeeze1x1_splitncnn_0 fire5/expand3x3 0=128 1=3 2=1 3=1 4=1 5=1 6=36864\nReLU             fire5/relu_expand3x3 1 1 fire5/expand3x3 fire5/expand3x3_fire5/relu_expand3x3 0=0.000000\nConcat           fire5/concat     2 1 fire5/expand1x1_fire5/relu_expand1x1 fire5/expand3x3_fire5/relu_expand3x3 fire5/concat 0=0\nPooling          pool5            1 1 fire5/concat pool5 0=0 1=3 2=2 3=0 4=0\nConvolution      fire6/squeeze1x1 1 1 pool5 fire6/squeeze1x1 0=48 1=1 2=1 3=1 4=0 5=1 6=12288\nReLU             fire6/relu_squeeze1x1 1 1 fire6/squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_4      1 2 fire6/squeeze1x1_fire6/relu_squeeze1x1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1\nConvolution      fire6/expand1x1  1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_1 fire6/expand1x1 0=192 1=1 2=1 3=1 4=0 5=1 6=9216\nReLU             fire6/relu_expand1x1 1 1 fire6/expand1x1 fire6/expand1x1_fire6/relu_expand1x1 0=0.000000\nConvolution      fire6/expand3x3  1 1 fire6/squeeze1x1_fire6/relu_squeeze1x1_splitncnn_0 fire6/expand3x3 0=192 1=3 2=1 3=1 4=1 5=1 6=82944\nReLU             fire6/relu_expand3x3 1 1 fire6/expand3x3 fire6/expand3x3_fire6/relu_expand3x3 0=0.000000\nConcat           fire6/concat     2 1 fire6/expand1x1_fire6/relu_expand1x1 fire6/expand3x3_fire6/relu_expand3x3 fire6/concat 0=0\nConvolution      fire7/squeeze1x1 1 1 fire6/concat fire7/squeeze1x1 0=48 1=1 2=1 3=1 4=0 5=1 6=18432\nReLU             fire7/relu_squeeze1x1 1 1 fire7/squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_5      1 2 fire7/squeeze1x1_fire7/relu_squeeze1x1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1\nConvolution      fire7/expand1x1  1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_1 fire7/expand1x1 0=192 1=1 2=1 3=1 4=0 5=1 6=9216\nReLU             fire7/relu_expand1x1 1 1 fire7/expand1x1 fire7/expand1x1_fire7/relu_expand1x1 0=0.000000\nConvolution      fire7/expand3x3  1 1 fire7/squeeze1x1_fire7/relu_squeeze1x1_splitncnn_0 fire7/expand3x3 0=192 1=3 2=1 3=1 4=1 5=1 6=82944\nReLU             fire7/relu_expand3x3 1 1 fire7/expand3x3 fire7/expand3x3_fire7/relu_expand3x3 0=0.000000\nConcat           fire7/concat     2 1 fire7/expand1x1_fire7/relu_expand1x1 fire7/expand3x3_fire7/relu_expand3x3 fire7/concat 0=0\nConvolution      fire8/squeeze1x1 1 1 fire7/concat fire8/squeeze1x1 0=64 1=1 2=1 3=1 4=0 5=1 6=24576\nReLU             fire8/relu_squeeze1x1 1 1 fire8/squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_6      1 2 fire8/squeeze1x1_fire8/relu_squeeze1x1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1\nConvolution      fire8/expand1x1  1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_1 fire8/expand1x1 0=256 1=1 2=1 3=1 4=0 5=1 6=16384\nReLU             fire8/relu_expand1x1 1 1 fire8/expand1x1 fire8/expand1x1_fire8/relu_expand1x1 0=0.000000\nConvolution      fire8/expand3x3  1 1 fire8/squeeze1x1_fire8/relu_squeeze1x1_splitncnn_0 fire8/expand3x3 0=256 1=3 2=1 3=1 4=1 5=1 6=147456\nReLU             fire8/relu_expand3x3 1 1 fire8/expand3x3 fire8/expand3x3_fire8/relu_expand3x3 0=0.000000\nConcat           fire8/concat     2 1 fire8/expand1x1_fire8/relu_expand1x1 fire8/expand3x3_fire8/relu_expand3x3 fire8/concat 0=0\nConvolution      fire9/squeeze1x1 1 1 fire8/concat fire9/squeeze1x1 0=64 1=1 2=1 3=1 4=0 5=1 6=32768\nReLU             fire9/relu_squeeze1x1 1 1 fire9/squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1 0=0.000000\nSplit            splitncnn_7      1 2 fire9/squeeze1x1_fire9/relu_squeeze1x1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1\nConvolution      fire9/expand1x1  1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_1 fire9/expand1x1 0=256 1=1 2=1 3=1 4=0 5=1 6=16384\nReLU             fire9/relu_expand1x1 1 1 fire9/expand1x1 fire9/expand1x1_fire9/relu_expand1x1 0=0.000000\nConvolution      fire9/expand3x3  1 1 fire9/squeeze1x1_fire9/relu_squeeze1x1_splitncnn_0 fire9/expand3x3 0=256 1=3 2=1 3=1 4=1 5=1 6=147456\nReLU             fire9/relu_expand3x3 1 1 fire9/expand3x3 fire9/expand3x3_fire9/relu_expand3x3 0=0.000000\nConcat           fire9/concat     2 1 fire9/expand1x1_fire9/relu_expand1x1 fire9/expand3x3_fire9/relu_expand3x3 fire9/concat 0=0\nDropout          drop9            1 1 fire9/concat fire9/concat_drop9\nConvolution      conv10           1 1 fire9/concat_drop9 conv10 0=1000 1=1 2=1 3=1 4=1 5=1 6=512000\nReLU             relu_conv10      1 1 conv10 conv10_relu_conv10 0=0.000000\nPooling          pool10           1 1 conv10_relu_conv10 pool10 0=1 1=0 2=1 3=0 4=1\nSoftmax          prob             1 1 pool10 prob 0=0\n"
  },
  {
    "path": "examples/squeezenet_v1.1.prototxt",
    "content": "name: \"squeezenet_v1.1_deploy\"\n\nlayer {\n  name: \"data\"\n  type: \"Input\"\n  top: \"data\"\n  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }\n}\nlayer {\n  name: \"conv1\"\n  type: \"Convolution\"\n  bottom: \"data\"\n  top: \"conv1\"\n  convolution_param {\n    num_output: 64\n    kernel_size: 3\n    stride: 2\n  }\n}\nlayer {\n  name: \"relu_conv1\"\n  type: \"ReLU\"\n  bottom: \"conv1\"\n  top: \"conv1\"\n}\nlayer {\n  name: \"pool1\"\n  type: \"Pooling\"\n  bottom: \"conv1\"\n  top: \"pool1\"\n  pooling_param {\n    pool: MAX\n    kernel_size: 3\n    stride: 2\n  }\n}\nlayer {\n  name: \"fire2/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"pool1\"\n  top: \"fire2/squeeze1x1\"\n  convolution_param {\n    num_output: 16\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire2/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire2/squeeze1x1\"\n  top: \"fire2/squeeze1x1\"\n}\nlayer {\n  name: \"fire2/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire2/squeeze1x1\"\n  top: \"fire2/expand1x1\"\n  convolution_param {\n    num_output: 64\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire2/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire2/expand1x1\"\n  top: \"fire2/expand1x1\"\n}\nlayer {\n  name: \"fire2/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire2/squeeze1x1\"\n  top: \"fire2/expand3x3\"\n  convolution_param {\n    num_output: 64\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire2/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire2/expand3x3\"\n  top: \"fire2/expand3x3\"\n}\nlayer {\n  name: \"fire2/concat\"\n  type: \"Concat\"\n  bottom: \"fire2/expand1x1\"\n  bottom: \"fire2/expand3x3\"\n  top: \"fire2/concat\"\n}\nlayer {\n  name: \"fire3/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"fire2/concat\"\n  top: \"fire3/squeeze1x1\"\n  convolution_param {\n    num_output: 16\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire3/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire3/squeeze1x1\"\n  top: \"fire3/squeeze1x1\"\n}\nlayer {\n  name: \"fire3/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire3/squeeze1x1\"\n  top: \"fire3/expand1x1\"\n  convolution_param {\n    num_output: 64\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire3/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire3/expand1x1\"\n  top: \"fire3/expand1x1\"\n}\nlayer {\n  name: \"fire3/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire3/squeeze1x1\"\n  top: \"fire3/expand3x3\"\n  convolution_param {\n    num_output: 64\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire3/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire3/expand3x3\"\n  top: \"fire3/expand3x3\"\n}\nlayer {\n  name: \"fire3/concat\"\n  type: \"Concat\"\n  bottom: \"fire3/expand1x1\"\n  bottom: \"fire3/expand3x3\"\n  top: \"fire3/concat\"\n}\nlayer {\n  name: \"pool3\"\n  type: \"Pooling\"\n  bottom: \"fire3/concat\"\n  top: \"pool3\"\n  pooling_param {\n    pool: MAX\n    kernel_size: 3\n    stride: 2\n  }\n}\nlayer {\n  name: \"fire4/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"pool3\"\n  top: \"fire4/squeeze1x1\"\n  convolution_param {\n    num_output: 32\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire4/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire4/squeeze1x1\"\n  top: \"fire4/squeeze1x1\"\n}\nlayer {\n  name: \"fire4/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire4/squeeze1x1\"\n  top: \"fire4/expand1x1\"\n  convolution_param {\n    num_output: 128\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire4/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire4/expand1x1\"\n  top: \"fire4/expand1x1\"\n}\nlayer {\n  name: \"fire4/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire4/squeeze1x1\"\n  top: \"fire4/expand3x3\"\n  convolution_param {\n    num_output: 128\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire4/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire4/expand3x3\"\n  top: \"fire4/expand3x3\"\n}\nlayer {\n  name: \"fire4/concat\"\n  type: \"Concat\"\n  bottom: \"fire4/expand1x1\"\n  bottom: \"fire4/expand3x3\"\n  top: \"fire4/concat\"\n}\nlayer {\n  name: \"fire5/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"fire4/concat\"\n  top: \"fire5/squeeze1x1\"\n  convolution_param {\n    num_output: 32\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire5/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire5/squeeze1x1\"\n  top: \"fire5/squeeze1x1\"\n}\nlayer {\n  name: \"fire5/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire5/squeeze1x1\"\n  top: \"fire5/expand1x1\"\n  convolution_param {\n    num_output: 128\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire5/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire5/expand1x1\"\n  top: \"fire5/expand1x1\"\n}\nlayer {\n  name: \"fire5/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire5/squeeze1x1\"\n  top: \"fire5/expand3x3\"\n  convolution_param {\n    num_output: 128\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire5/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire5/expand3x3\"\n  top: \"fire5/expand3x3\"\n}\nlayer {\n  name: \"fire5/concat\"\n  type: \"Concat\"\n  bottom: \"fire5/expand1x1\"\n  bottom: \"fire5/expand3x3\"\n  top: \"fire5/concat\"\n}\nlayer {\n  name: \"pool5\"\n  type: \"Pooling\"\n  bottom: \"fire5/concat\"\n  top: \"pool5\"\n  pooling_param {\n    pool: MAX\n    kernel_size: 3\n    stride: 2\n  }\n}\nlayer {\n  name: \"fire6/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"pool5\"\n  top: \"fire6/squeeze1x1\"\n  convolution_param {\n    num_output: 48\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire6/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire6/squeeze1x1\"\n  top: \"fire6/squeeze1x1\"\n}\nlayer {\n  name: \"fire6/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire6/squeeze1x1\"\n  top: \"fire6/expand1x1\"\n  convolution_param {\n    num_output: 192\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire6/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire6/expand1x1\"\n  top: \"fire6/expand1x1\"\n}\nlayer {\n  name: \"fire6/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire6/squeeze1x1\"\n  top: \"fire6/expand3x3\"\n  convolution_param {\n    num_output: 192\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire6/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire6/expand3x3\"\n  top: \"fire6/expand3x3\"\n}\nlayer {\n  name: \"fire6/concat\"\n  type: \"Concat\"\n  bottom: \"fire6/expand1x1\"\n  bottom: \"fire6/expand3x3\"\n  top: \"fire6/concat\"\n}\nlayer {\n  name: \"fire7/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"fire6/concat\"\n  top: \"fire7/squeeze1x1\"\n  convolution_param {\n    num_output: 48\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire7/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire7/squeeze1x1\"\n  top: \"fire7/squeeze1x1\"\n}\nlayer {\n  name: \"fire7/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire7/squeeze1x1\"\n  top: \"fire7/expand1x1\"\n  convolution_param {\n    num_output: 192\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire7/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire7/expand1x1\"\n  top: \"fire7/expand1x1\"\n}\nlayer {\n  name: \"fire7/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire7/squeeze1x1\"\n  top: \"fire7/expand3x3\"\n  convolution_param {\n    num_output: 192\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire7/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire7/expand3x3\"\n  top: \"fire7/expand3x3\"\n}\nlayer {\n  name: \"fire7/concat\"\n  type: \"Concat\"\n  bottom: \"fire7/expand1x1\"\n  bottom: \"fire7/expand3x3\"\n  top: \"fire7/concat\"\n}\nlayer {\n  name: \"fire8/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"fire7/concat\"\n  top: \"fire8/squeeze1x1\"\n  convolution_param {\n    num_output: 64\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire8/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire8/squeeze1x1\"\n  top: \"fire8/squeeze1x1\"\n}\nlayer {\n  name: \"fire8/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire8/squeeze1x1\"\n  top: \"fire8/expand1x1\"\n  convolution_param {\n    num_output: 256\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire8/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire8/expand1x1\"\n  top: \"fire8/expand1x1\"\n}\nlayer {\n  name: \"fire8/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire8/squeeze1x1\"\n  top: \"fire8/expand3x3\"\n  convolution_param {\n    num_output: 256\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire8/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire8/expand3x3\"\n  top: \"fire8/expand3x3\"\n}\nlayer {\n  name: \"fire8/concat\"\n  type: \"Concat\"\n  bottom: \"fire8/expand1x1\"\n  bottom: \"fire8/expand3x3\"\n  top: \"fire8/concat\"\n}\nlayer {\n  name: \"fire9/squeeze1x1\"\n  type: \"Convolution\"\n  bottom: \"fire8/concat\"\n  top: \"fire9/squeeze1x1\"\n  convolution_param {\n    num_output: 64\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire9/relu_squeeze1x1\"\n  type: \"ReLU\"\n  bottom: \"fire9/squeeze1x1\"\n  top: \"fire9/squeeze1x1\"\n}\nlayer {\n  name: \"fire9/expand1x1\"\n  type: \"Convolution\"\n  bottom: \"fire9/squeeze1x1\"\n  top: \"fire9/expand1x1\"\n  convolution_param {\n    num_output: 256\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"fire9/relu_expand1x1\"\n  type: \"ReLU\"\n  bottom: \"fire9/expand1x1\"\n  top: \"fire9/expand1x1\"\n}\nlayer {\n  name: \"fire9/expand3x3\"\n  type: \"Convolution\"\n  bottom: \"fire9/squeeze1x1\"\n  top: \"fire9/expand3x3\"\n  convolution_param {\n    num_output: 256\n    pad: 1\n    kernel_size: 3\n  }\n}\nlayer {\n  name: \"fire9/relu_expand3x3\"\n  type: \"ReLU\"\n  bottom: \"fire9/expand3x3\"\n  top: \"fire9/expand3x3\"\n}\nlayer {\n  name: \"fire9/concat\"\n  type: \"Concat\"\n  bottom: \"fire9/expand1x1\"\n  bottom: \"fire9/expand3x3\"\n  top: \"fire9/concat\"\n}\nlayer {\n  name: \"drop9\"\n  type: \"Dropout\"\n  bottom: \"fire9/concat\"\n  top: \"fire9/concat\"\n  dropout_param {\n    dropout_ratio: 0.5\n  }\n}\nlayer {\n  name: \"conv10\"\n  type: \"Convolution\"\n  bottom: \"fire9/concat\"\n  top: \"conv10\"\n  convolution_param {\n    num_output: 1000\n    pad: 1\n    kernel_size: 1\n  }\n}\nlayer {\n  name: \"relu_conv10\"\n  type: \"ReLU\"\n  bottom: \"conv10\"\n  top: \"conv10\"\n}\nlayer {\n  name: \"pool10\"\n  type: \"Pooling\"\n  bottom: \"conv10\"\n  top: \"pool10\"\n  pooling_param {\n    pool: AVE\n    global_pooling: true\n  }\n}\nlayer {\n  name: \"prob\"\n  type: \"Softmax\"\n  bottom: \"pool10\"\n  top: \"prob\"\n}\n"
  },
  {
    "path": "examples/squeezenetssd.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_squeezenet(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net squeezenet;\n\n    squeezenet.opt.use_vulkan_compute = true;\n\n    // original pretrained model from https://github.com/chuanqi305/SqueezeNet-SSD\n    // squeezenet_ssd_voc_deploy.prototxt\n    // https://drive.google.com/open?id=0B3gersZ2cHIxdGpyZlZnbEQ5Snc\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (squeezenet.load_param(\"squeezenet_ssd_voc.param\"))\n        exit(-1);\n    if (squeezenet.load_model(\"squeezenet_ssd_voc.bin\"))\n        exit(-1);\n\n    const int target_size = 300;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {104.f, 117.f, 123.f};\n    in.substract_mean_normalize(mean_vals, 0);\n\n    ncnn::Extractor ex = squeezenet.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_squeezenet(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/synset_words.txt",
    "content": "n01440764 tench, Tinca tinca\nn01443537 goldfish, Carassius auratus\nn01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias\nn01491361 tiger shark, Galeocerdo cuvieri\nn01494475 hammerhead, hammerhead shark\nn01496331 electric ray, crampfish, numbfish, torpedo\nn01498041 stingray\nn01514668 cock\nn01514859 hen\nn01518878 ostrich, Struthio camelus\nn01530575 brambling, Fringilla montifringilla\nn01531178 goldfinch, Carduelis carduelis\nn01532829 house finch, linnet, Carpodacus mexicans\nn01534433 junco, snowbird\nn01537544 indigo bunting, indigo finch, indigo bird, Passerina cyanea\nn01558993 robin, American robin, Turdus migratorius\nn01560419 bulbul\nn01580077 jay\nn01582220 magpie\nn01592084 chickadee\nn01601694 water ouzel, dipper\nn01608432 kite\nn01614925 bald eagle, American eagle, Haliaeetus leucocephalus\nn01616318 vulture\nn01622779 great grey owl, great gray owl, Strix nebulosa\nn01629819 European fire salamander, Salamandra salamandra\nn01630670 common newt, Triturus vulgaris\nn01631663 eft\nn01632458 spotted salamander, Ambystoma maculatum\nn01632777 axolotl, mud puppy, Ambystoma mexicanum\nn01641577 bullfrog, Rana catesbeiana\nn01644373 tree frog, tree-frog\nn01644900 tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui\nn01664065 loggerhead, loggerhead turtle, Caretta caretta\nn01665541 leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea\nn01667114 mud turtle\nn01667778 terrapin\nn01669191 box turtle, box tortoise\nn01675722 banded gecko\nn01677366 common iguana, iguana, Iguana iguana\nn01682714 American chameleon, anole, Anolis carolinensis\nn01685808 whiptail, whiptail lizard\nn01687978 agama\nn01688243 frilled lizard, Chlamydosaurus kingi\nn01689811 alligator lizard\nn01692333 Gila monster, Heloderma suspectum\nn01693334 green lizard, Lacerta viridis\nn01694178 African chameleon, Chamaeleo chamaeleon\nn01695060 Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis\nn01697457 African crocodile, Nile crocodile, Crocodylus niloticus\nn01698640 American alligator, Alligator mississipiensis\nn01704323 triceratops\nn01728572 thunder snake, worm snake, Carphophis amoenus\nn01728920 ringneck snake, ring-necked snake, ring snake\nn01729322 hognose snake, puff adder, sand viper\nn01729977 green snake, grass snake\nn01734418 king snake, kingsnake\nn01735189 garter snake, grass snake\nn01737021 water snake\nn01739381 vine snake\nn01740131 night snake, Hypsiglena torquata\nn01742172 boa constrictor, Constrictor constrictor\nn01744401 rock python, rock snake, Python sebae\nn01748264 Indian cobra, Naja naja\nn01749939 green mamba\nn01751748 sea snake\nn01753488 horned viper, cerastes, sand viper, horned asp, Cerastes cornutus\nn01755581 diamondback, diamondback rattlesnake, Crotalus adamanteus\nn01756291 sidewinder, horned rattlesnake, Crotalus cerastes\nn01768244 trilobite\nn01770081 harvestman, daddy longlegs, Phalangium opilio\nn01770393 scorpion\nn01773157 black and gold garden spider, Argiope aurantia\nn01773549 barn spider, Araneus cavaticus\nn01773797 garden spider, Aranea diademata\nn01774384 black widow, Latrodectus mactans\nn01774750 tarantula\nn01775062 wolf spider, hunting spider\nn01776313 tick\nn01784675 centipede\nn01795545 black grouse\nn01796340 ptarmigan\nn01797886 ruffed grouse, partridge, Bonasa umbellus\nn01798484 prairie chicken, prairie grouse, prairie fowl\nn01806143 peacock\nn01806567 quail\nn01807496 partridge\nn01817953 African grey, African gray, Psittacus erithacus\nn01818515 macaw\nn01819313 sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita\nn01820546 lorikeet\nn01824575 coucal\nn01828970 bee eater\nn01829413 hornbill\nn01833805 hummingbird\nn01843065 jacamar\nn01843383 toucan\nn01847000 drake\nn01855032 red-breasted merganser, Mergus serrator\nn01855672 goose\nn01860187 black swan, Cygnus atratus\nn01871265 tusker\nn01872401 echidna, spiny anteater, anteater\nn01873310 platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus\nn01877812 wallaby, brush kangaroo\nn01882714 koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus\nn01883070 wombat\nn01910747 jellyfish\nn01914609 sea anemone, anemone\nn01917289 brain coral\nn01924916 flatworm, platyhelminth\nn01930112 nematode, nematode worm, roundworm\nn01943899 conch\nn01944390 snail\nn01945685 slug\nn01950731 sea slug, nudibranch\nn01955084 chiton, coat-of-mail shell, sea cradle, polyplacophore\nn01968897 chambered nautilus, pearly nautilus, nautilus\nn01978287 Dungeness crab, Cancer magister\nn01978455 rock crab, Cancer irroratus\nn01980166 fiddler crab\nn01981276 king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica\nn01983481 American lobster, Northern lobster, Maine lobster, Homarus americans\nn01984695 spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish\nn01985128 crayfish, crawfish, crawdad, crawdaddy\nn01986214 hermit crab\nn01990800 isopod\nn02002556 white stork, Ciconia ciconia\nn02002724 black stork, Ciconia nigra\nn02006656 spoonbill\nn02007558 flamingo\nn02009229 little blue heron, Egretta caerulea\nn02009912 American egret, great white heron, Egretta albus\nn02011460 bittern\nn02012849 crane\nn02013706 limpkin, Aramus pictus\nn02017213 European gallinule, Porphyrio porphyrio\nn02018207 American coot, marsh hen, mud hen, water hen, Fulica americana\nn02018795 bustard\nn02025239 ruddy turnstone, Arenaria interpres\nn02027492 red-backed sandpiper, dunlin, Erolia alpina\nn02028035 redshank, Tringa totanus\nn02033041 dowitcher\nn02037110 oystercatcher, oyster catcher\nn02051845 pelican\nn02056570 king penguin, Aptenodytes patagonica\nn02058221 albatross, mollymawk\nn02066245 grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus\nn02071294 killer whale, killer, orca, grampus, sea wolf, Orcinus orca\nn02074367 dugong, Dugong dugon\nn02077923 sea lion\nn02085620 Chihuahua\nn02085782 Japanese spaniel\nn02085936 Maltese dog, Maltese terrier, Maltese\nn02086079 Pekinese, Pekingese, Peke\nn02086240 Shih-Tzu\nn02086646 Blenheim spaniel\nn02086910 papillon\nn02087046 toy terrier\nn02087394 Rhodesian ridgeback\nn02088094 Afghan hound, Afghan\nn02088238 basset, basset hound\nn02088364 beagle\nn02088466 bloodhound, sleuthhound\nn02088632 bluetick\nn02089078 black-and-tan coonhound\nn02089867 Walker hound, Walker foxhound\nn02089973 English foxhound\nn02090379 redbone\nn02090622 borzoi, Russian wolfhound\nn02090721 Irish wolfhound\nn02091032 Italian greyhound\nn02091134 whippet\nn02091244 Ibizan hound, Ibizan Podenco\nn02091467 Norwegian elkhound, elkhound\nn02091635 otterhound, otter hound\nn02091831 Saluki, gazelle hound\nn02092002 Scottish deerhound, deerhound\nn02092339 Weimaraner\nn02093256 Staffordshire bullterrier, Staffordshire bull terrier\nn02093428 American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier\nn02093647 Bedlington terrier\nn02093754 Border terrier\nn02093859 Kerry blue terrier\nn02093991 Irish terrier\nn02094114 Norfolk terrier\nn02094258 Norwich terrier\nn02094433 Yorkshire terrier\nn02095314 wire-haired fox terrier\nn02095570 Lakeland terrier\nn02095889 Sealyham terrier, Sealyham\nn02096051 Airedale, Airedale terrier\nn02096177 cairn, cairn terrier\nn02096294 Australian terrier\nn02096437 Dandie Dinmont, Dandie Dinmont terrier\nn02096585 Boston bull, Boston terrier\nn02097047 miniature schnauzer\nn02097130 giant schnauzer\nn02097209 standard schnauzer\nn02097298 Scotch terrier, Scottish terrier, Scottie\nn02097474 Tibetan terrier, chrysanthemum dog\nn02097658 silky terrier, Sydney silky\nn02098105 soft-coated wheaten terrier\nn02098286 West Highland white terrier\nn02098413 Lhasa, Lhasa apso\nn02099267 flat-coated retriever\nn02099429 curly-coated retriever\nn02099601 golden retriever\nn02099712 Labrador retriever\nn02099849 Chesapeake Bay retriever\nn02100236 German short-haired pointer\nn02100583 vizsla, Hungarian pointer\nn02100735 English setter\nn02100877 Irish setter, red setter\nn02101006 Gordon setter\nn02101388 Brittany spaniel\nn02101556 clumber, clumber spaniel\nn02102040 English springer, English springer spaniel\nn02102177 Welsh springer spaniel\nn02102318 cocker spaniel, English cocker spaniel, cocker\nn02102480 Sussex spaniel\nn02102973 Irish water spaniel\nn02104029 kuvasz\nn02104365 schipperke\nn02105056 groenendael\nn02105162 malinois\nn02105251 briard\nn02105412 kelpie\nn02105505 komondor\nn02105641 Old English sheepdog, bobtail\nn02105855 Shetland sheepdog, Shetland sheep dog, Shetland\nn02106030 collie\nn02106166 Border collie\nn02106382 Bouvier des Flandres, Bouviers des Flandres\nn02106550 Rottweiler\nn02106662 German shepherd, German shepherd dog, German police dog, alsatian\nn02107142 Doberman, Doberman pinscher\nn02107312 miniature pinscher\nn02107574 Greater Swiss Mountain dog\nn02107683 Bernese mountain dog\nn02107908 Appenzeller\nn02108000 EntleBucher\nn02108089 boxer\nn02108422 bull mastiff\nn02108551 Tibetan mastiff\nn02108915 French bulldog\nn02109047 Great Dane\nn02109525 Saint Bernard, St Bernard\nn02109961 Eskimo dog, husky\nn02110063 malamute, malemute, Alaskan malamute\nn02110185 Siberian husky\nn02110341 dalmatian, coach dog, carriage dog\nn02110627 affenpinscher, monkey pinscher, monkey dog\nn02110806 basenji\nn02110958 pug, pug-dog\nn02111129 Leonberg\nn02111277 Newfoundland, Newfoundland dog\nn02111500 Great Pyrenees\nn02111889 Samoyed, Samoyede\nn02112018 Pomeranian\nn02112137 chow, chow chow\nn02112350 keeshond\nn02112706 Brabancon griffon\nn02113023 Pembroke, Pembroke Welsh corgi\nn02113186 Cardigan, Cardigan Welsh corgi\nn02113624 toy poodle\nn02113712 miniature poodle\nn02113799 standard poodle\nn02113978 Mexican hairless\nn02114367 timber wolf, grey wolf, gray wolf, Canis lupus\nn02114548 white wolf, Arctic wolf, Canis lupus tundrarum\nn02114712 red wolf, maned wolf, Canis rufus, Canis niger\nn02114855 coyote, prairie wolf, brush wolf, Canis latrans\nn02115641 dingo, warrigal, warragal, Canis dingo\nn02115913 dhole, Cuon alpinus\nn02116738 African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus\nn02117135 hyena, hyaena\nn02119022 red fox, Vulpes vulpes\nn02119789 kit fox, Vulpes macrotis\nn02120079 Arctic fox, white fox, Alopex lagopus\nn02120505 grey fox, gray fox, Urocyon cinereoargenteus\nn02123045 tabby, tabby cat\nn02123159 tiger cat\nn02123394 Persian cat\nn02123597 Siamese cat, Siamese\nn02124075 Egyptian cat\nn02125311 cougar, puma, catamount, mountain lion, painter, panther, Felis concolor\nn02127052 lynx, catamount\nn02128385 leopard, Panthera pardus\nn02128757 snow leopard, ounce, Panthera uncia\nn02128925 jaguar, panther, Panthera onca, Felis onca\nn02129165 lion, king of beasts, Panthera leo\nn02129604 tiger, Panthera tigris\nn02130308 cheetah, chetah, Acinonyx jubatus\nn02132136 brown bear, bruin, Ursus arctos\nn02133161 American black bear, black bear, Ursus americans, Euarctos americans\nn02134084 ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus\nn02134418 sloth bear, Melursus ursinus, Ursus ursinus\nn02137549 mongoose\nn02138441 meerkat, mierkat\nn02165105 tiger beetle\nn02165456 ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle\nn02167151 ground beetle, carabid beetle\nn02168699 long-horned beetle, longicorn, longicorn beetle\nn02169497 leaf beetle, chrysomelid\nn02172182 dung beetle\nn02174001 rhinoceros beetle\nn02177972 weevil\nn02190166 fly\nn02206856 bee\nn02219486 ant, emmet, pismire\nn02226429 grasshopper, hopper\nn02229544 cricket\nn02231487 walking stick, walkingstick, stick insect\nn02233338 cockroach, roach\nn02236044 mantis, mantid\nn02256656 cicada, cicala\nn02259212 leafhopper\nn02264363 lacewing, lacewing fly\nn02268443 dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk\nn02268853 damselfly\nn02276258 admiral\nn02277742 ringlet, ringlet butterfly\nn02279972 monarch, monarch butterfly, milkweed butterfly, Danaus plexippus\nn02280649 cabbage butterfly\nn02281406 sulphur butterfly, sulfur butterfly\nn02281787 lycaenid, lycaenid butterfly\nn02317335 starfish, sea star\nn02319095 sea urchin\nn02321529 sea cucumber, holothurian\nn02325366 wood rabbit, cottontail, cottontail rabbit\nn02326432 hare\nn02328150 Angora, Angora rabbit\nn02342885 hamster\nn02346627 porcupine, hedgehog\nn02356798 fox squirrel, eastern fox squirrel, Sciurus niger\nn02361337 marmot\nn02363005 beaver\nn02364673 guinea pig, Cavia cobaya\nn02389026 sorrel\nn02391049 zebra\nn02395406 hog, pig, grunter, squealer, Sus scrofa\nn02396427 wild boar, boar, Sus scrofa\nn02397096 warthog\nn02398521 hippopotamus, hippo, river horse, Hippopotamus amphibius\nn02403003 ox\nn02408429 water buffalo, water ox, Asiatic buffalo, Bubalus bubalis\nn02410509 bison\nn02412080 ram, tup\nn02415577 bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis\nn02417914 ibex, Capra ibex\nn02422106 hartebeest\nn02422699 impala, Aepyceros melampus\nn02423022 gazelle\nn02437312 Arabian camel, dromedary, Camelus dromedarius\nn02437616 llama\nn02441942 weasel\nn02442845 mink\nn02443114 polecat, fitch, foulmart, foumart, Mustela putorius\nn02443484 black-footed ferret, ferret, Mustela nigripes\nn02444819 otter\nn02445715 skunk, polecat, wood pussy\nn02447366 badger\nn02454379 armadillo\nn02457408 three-toed sloth, ai, Bradypus tridactylus\nn02480495 orangutan, orang, orangutang, Pongo pygmaeus\nn02480855 gorilla, Gorilla gorilla\nn02481823 chimpanzee, chimp, Pan troglodytes\nn02483362 gibbon, Hylobates lar\nn02483708 siamang, Hylobates syndactylus, Symphalangus syndactylus\nn02484975 guenon, guenon monkey\nn02486261 patas, hussar monkey, Erythrocebus patas\nn02486410 baboon\nn02487347 macaque\nn02488291 langur\nn02488702 colobus, colobus monkey\nn02489166 proboscis monkey, Nasalis larvatus\nn02490219 marmoset\nn02492035 capuchin, ringtail, Cebus capucinus\nn02492660 howler monkey, howler\nn02493509 titi, titi monkey\nn02493793 spider monkey, Ateles geoffroyi\nn02494079 squirrel monkey, Saimiri sciureus\nn02497673 Madagascar cat, ring-tailed lemur, Lemur catta\nn02500267 indri, indris, Indri indri, Indri brevicaudatus\nn02504013 Indian elephant, Elephas maximus\nn02504458 African elephant, Loxodonta africana\nn02509815 lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens\nn02510455 giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca\nn02514041 barracouta, snoek\nn02526121 eel\nn02536864 coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch\nn02606052 rock beauty, Holocanthus tricolor\nn02607072 anemone fish\nn02640242 sturgeon\nn02641379 gar, garfish, garpike, billfish, Lepisosteus osseus\nn02643566 lionfish\nn02655020 puffer, pufferfish, blowfish, globefish\nn02666196 abacus\nn02667093 abaya\nn02669723 academic gown, academic robe, judge's robe\nn02672831 accordion, piano accordion, squeeze box\nn02676566 acoustic guitar\nn02687172 aircraft carrier, carrier, flattop, attack aircraft carrier\nn02690373 airliner\nn02692877 airship, dirigible\nn02699494 altar\nn02701002 ambulance\nn02704792 amphibian, amphibious vehicle\nn02708093 analog clock\nn02727426 apiary, bee house\nn02730930 apron\nn02747177 ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin\nn02749479 assault rifle, assault gun\nn02769748 backpack, back pack, knapsack, packsack, rucksack, haversack\nn02776631 bakery, bakeshop, bakehouse\nn02777292 balance beam, beam\nn02782093 balloon\nn02783161 ballpoint, ballpoint pen, ballpen, Biro\nn02786058 Band Aid\nn02787622 banjo\nn02788148 bannister, banister, balustrade, balusters, handrail\nn02790996 barbell\nn02791124 barber chair\nn02791270 barbershop\nn02793495 barn\nn02794156 barometer\nn02795169 barrel, cask\nn02797295 barrow, garden cart, lawn cart, wheelbarrow\nn02799071 baseball\nn02802426 basketball\nn02804414 bassinet\nn02804610 bassoon\nn02807133 bathing cap, swimming cap\nn02808304 bath towel\nn02808440 bathtub, bathing tub, bath, tub\nn02814533 beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon\nn02814860 beacon, lighthouse, beacon light, pharos\nn02815834 beaker\nn02817516 bearskin, busby, shako\nn02823428 beer bottle\nn02823750 beer glass\nn02825657 bell cote, bell cot\nn02834397 bib\nn02835271 bicycle-built-for-two, tandem bicycle, tandem\nn02837789 bikini, two-piece\nn02840245 binder, ring-binder\nn02841315 binoculars, field glasses, opera glasses\nn02843684 birdhouse\nn02859443 boathouse\nn02860847 bobsled, bobsleigh, bob\nn02865351 bolo tie, bolo, bola tie, bola\nn02869837 bonnet, poke bonnet\nn02870880 bookcase\nn02871525 bookshop, bookstore, bookstall\nn02877765 bottlecap\nn02879718 bow\nn02883205 bow tie, bow-tie, bowtie\nn02892201 brass, memorial tablet, plaque\nn02892767 brassiere, bra, bandeau\nn02894605 breakwater, groin, groyne, mole, bulwark, seawall, jetty\nn02895154 breastplate, aegis, egis\nn02906734 broom\nn02909870 bucket, pail\nn02910353 buckle\nn02916936 bulletproof vest\nn02917067 bullet train, bullet\nn02927161 butcher shop, meat market\nn02930766 cab, hack, taxi, taxicab\nn02939185 caldron, cauldron\nn02948072 candle, taper, wax light\nn02950826 cannon\nn02951358 canoe\nn02951585 can opener, tin opener\nn02963159 cardigan\nn02965783 car mirror\nn02966193 carousel, carrousel, merry-go-round, roundabout, whirligig\nn02966687 carpenter's kit, tool kit\nn02971356 carton\nn02974003 car wheel\nn02977058 cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM\nn02978881 cassette\nn02979186 cassette player\nn02980441 castle\nn02981792 catamaran\nn02988304 CD player\nn02992211 cello, violoncello\nn02992529 cellular telephone, cellular phone, cellphone, cell, mobile phone\nn02999410 chain\nn03000134 chainlink fence\nn03000247 chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour\nn03000684 chain saw, chainsaw\nn03014705 chest\nn03016953 chiffonier, commode\nn03017168 chime, bell, gong\nn03018349 china cabinet, china closet\nn03026506 Christmas stocking\nn03028079 church, church building\nn03032252 cinema, movie theater, movie theatre, movie house, picture palace\nn03041632 cleaver, meat cleaver, chopper\nn03042490 cliff dwelling\nn03045698 cloak\nn03047690 clog, geta, patten, sabot\nn03062245 cocktail shaker\nn03063599 coffee mug\nn03063689 coffeepot\nn03065424 coil, spiral, volute, whorl, helix\nn03075370 combination lock\nn03085013 computer keyboard, keypad\nn03089624 confectionery, confectionary, candy store\nn03095699 container ship, containership, container vessel\nn03100240 convertible\nn03109150 corkscrew, bottle screw\nn03110669 cornet, horn, trumpet, trump\nn03124043 cowboy boot\nn03124170 cowboy hat, ten-gallon hat\nn03125729 cradle\nn03126707 crane\nn03127747 crash helmet\nn03127925 crate\nn03131574 crib, cot\nn03133878 Crock Pot\nn03134739 croquet ball\nn03141823 crutch\nn03146219 cuirass\nn03160309 dam, dike, dyke\nn03179701 desk\nn03180011 desktop computer\nn03187595 dial telephone, dial phone\nn03188531 diaper, nappy, napkin\nn03196217 digital clock\nn03197337 digital watch\nn03201208 dining table, board\nn03207743 dishrag, dishcloth\nn03207941 dishwasher, dish washer, dishwashing machine\nn03208938 disk brake, disc brake\nn03216828 dock, dockage, docking facility\nn03218198 dogsled, dog sled, dog sleigh\nn03220513 dome\nn03223299 doormat, welcome mat\nn03240683 drilling platform, offshore rig\nn03249569 drum, membranophone, tympan\nn03250847 drumstick\nn03255030 dumbbell\nn03259280 Dutch oven\nn03271574 electric fan, blower\nn03272010 electric guitar\nn03272562 electric locomotive\nn03290653 entertainment center\nn03291819 envelope\nn03297495 espresso maker\nn03314780 face powder\nn03325584 feather boa, boa\nn03337140 file, file cabinet, filing cabinet\nn03344393 fireboat\nn03345487 fire engine, fire truck\nn03347037 fire screen, fireguard\nn03355925 flagpole, flagstaff\nn03372029 flute, transverse flute\nn03376595 folding chair\nn03379051 football helmet\nn03384352 forklift\nn03388043 fountain\nn03388183 fountain pen\nn03388549 four-poster\nn03393912 freight car\nn03394916 French horn, horn\nn03400231 frying pan, frypan, skillet\nn03404251 fur coat\nn03417042 garbage truck, dustcart\nn03424325 gasmask, respirator, gas helmet\nn03425413 gas pump, gasoline pump, petrol pump, island dispenser\nn03443371 goblet\nn03444034 go-kart\nn03445777 golf ball\nn03445924 golfcart, golf cart\nn03447447 gondola\nn03447721 gong, tam-tam\nn03450230 gown\nn03452741 grand piano, grand\nn03457902 greenhouse, nursery, glasshouse\nn03459775 grille, radiator grille\nn03461385 grocery store, grocery, food market, market\nn03467068 guillotine\nn03476684 hair slide\nn03476991 hair spray\nn03478589 half track\nn03481172 hammer\nn03482405 hamper\nn03483316 hand blower, blow dryer, blow drier, hair dryer, hair drier\nn03485407 hand-held computer, hand-held microcomputer\nn03485794 handkerchief, hankie, hanky, hankey\nn03492542 hard disc, hard disk, fixed disk\nn03494278 harmonica, mouth organ, harp, mouth harp\nn03495258 harp\nn03496892 harvester, reaper\nn03498962 hatchet\nn03527444 holster\nn03529860 home theater, home theatre\nn03530642 honeycomb\nn03532672 hook, claw\nn03534580 hoopskirt, crinoline\nn03535780 horizontal bar, high bar\nn03538406 horse cart, horse-cart\nn03544143 hourglass\nn03584254 iPod\nn03584829 iron, smoothing iron\nn03590841 jack-o'-lantern\nn03594734 jean, blue jean, denim\nn03594945 jeep, landrover\nn03595614 jersey, T-shirt, tee shirt\nn03598930 jigsaw puzzle\nn03599486 jinrikisha, ricksha, rickshaw\nn03602883 joystick\nn03617480 kimono\nn03623198 knee pad\nn03627232 knot\nn03630383 lab coat, laboratory coat\nn03633091 ladle\nn03637318 lampshade, lamp shade\nn03642806 laptop, laptop computer\nn03649909 lawn mower, mower\nn03657121 lens cap, lens cover\nn03658185 letter opener, paper knife, paperknife\nn03661043 library\nn03662601 lifeboat\nn03666591 lighter, light, igniter, ignitor\nn03670208 limousine, limo\nn03673027 liner, ocean liner\nn03676483 lipstick, lip rouge\nn03680355 Loafer\nn03690938 lotion\nn03691459 loudspeaker, speaker, speaker unit, loudspeaker system, speaker system\nn03692522 loupe, jeweler's loupe\nn03697007 lumbermill, sawmill\nn03706229 magnetic compass\nn03709823 mailbag, postbag\nn03710193 mailbox, letter box\nn03710637 maillot\nn03710721 maillot, tank suit\nn03717622 manhole cover\nn03720891 maraca\nn03721384 marimba, xylophone\nn03724870 mask\nn03729826 matchstick\nn03733131 maypole\nn03733281 maze, labyrinth\nn03733805 measuring cup\nn03742115 medicine chest, medicine cabinet\nn03743016 megalith, megalithic structure\nn03759954 microphone, mike\nn03761084 microwave, microwave oven\nn03763968 military uniform\nn03764736 milk can\nn03769881 minibus\nn03770439 miniskirt, mini\nn03770679 minivan\nn03773504 missile\nn03775071 mitten\nn03775546 mixing bowl\nn03776460 mobile home, manufactured home\nn03777568 Model T\nn03777754 modem\nn03781244 monastery\nn03782006 monitor\nn03785016 moped\nn03786901 mortar\nn03787032 mortarboard\nn03788195 mosque\nn03788365 mosquito net\nn03791053 motor scooter, scooter\nn03792782 mountain bike, all-terrain bike, off-roader\nn03792972 mountain tent\nn03793489 mouse, computer mouse\nn03794056 mousetrap\nn03796401 moving van\nn03803284 muzzle\nn03804744 nail\nn03814639 neck brace\nn03814906 necklace\nn03825788 nipple\nn03832673 notebook, notebook computer\nn03837869 obelisk\nn03838899 oboe, hautboy, hautbois\nn03840681 ocarina, sweet potato\nn03841143 odometer, hodometer, mileometer, milometer\nn03843555 oil filter\nn03854065 organ, pipe organ\nn03857828 oscilloscope, scope, cathode-ray oscilloscope, CRO\nn03866082 overskirt\nn03868242 oxcart\nn03868863 oxygen mask\nn03871628 packet\nn03873416 paddle, boat paddle\nn03874293 paddlewheel, paddle wheel\nn03874599 padlock\nn03876231 paintbrush\nn03877472 pajama, pyjama, pj's, jammies\nn03877845 palace\nn03884397 panpipe, pandean pipe, syrinx\nn03887697 paper towel\nn03888257 parachute, chute\nn03888605 parallel bars, bars\nn03891251 park bench\nn03891332 parking meter\nn03895866 passenger car, coach, carriage\nn03899768 patio, terrace\nn03902125 pay-phone, pay-station\nn03903868 pedestal, plinth, footstall\nn03908618 pencil box, pencil case\nn03908714 pencil sharpener\nn03916031 perfume, essence\nn03920288 Petri dish\nn03924679 photocopier\nn03929660 pick, plectrum, plectron\nn03929855 pickelhaube\nn03930313 picket fence, paling\nn03930630 pickup, pickup truck\nn03933933 pier\nn03935335 piggy bank, penny bank\nn03937543 pill bottle\nn03938244 pillow\nn03942813 ping-pong ball\nn03944341 pinwheel\nn03947888 pirate, pirate ship\nn03950228 pitcher, ewer\nn03954731 plane, carpenter's plane, woodworking plane\nn03956157 planetarium\nn03958227 plastic bag\nn03961711 plate rack\nn03967562 plow, plough\nn03970156 plunger, plumber's helper\nn03976467 Polaroid camera, Polaroid Land camera\nn03976657 pole\nn03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria\nn03980874 poncho\nn03982430 pool table, billiard table, snooker table\nn03983396 pop bottle, soda bottle\nn03991062 pot, flowerpot\nn03992509 potter's wheel\nn03995372 power drill\nn03998194 prayer rug, prayer mat\nn04004767 printer\nn04005630 prison, prison house\nn04008634 projectile, missile\nn04009552 projector\nn04019541 puck, hockey puck\nn04023962 punching bag, punch bag, punching ball, punchball\nn04026417 purse\nn04033901 quill, quill pen\nn04033995 quilt, comforter, comfort, puff\nn04037443 racer, race car, racing car\nn04039381 racket, racquet\nn04040759 radiator\nn04041544 radio, wireless\nn04044716 radio telescope, radio reflector\nn04049303 rain barrel\nn04065272 recreational vehicle, RV, R.V.\nn04067472 reel\nn04069434 reflex camera\nn04070727 refrigerator, icebox\nn04074963 remote control, remote\nn04081281 restaurant, eating house, eating place, eatery\nn04086273 revolver, six-gun, six-shooter\nn04090263 rifle\nn04099969 rocking chair, rocker\nn04111531 rotisserie\nn04116512 rubber eraser, rubber, pencil eraser\nn04118538 rugby ball\nn04118776 rule, ruler\nn04120489 running shoe\nn04125021 safe\nn04127249 safety pin\nn04131690 saltshaker, salt shaker\nn04133789 sandal\nn04136333 sarong\nn04141076 sax, saxophone\nn04141327 scabbard\nn04141975 scale, weighing machine\nn04146614 school bus\nn04147183 schooner\nn04149813 scoreboard\nn04152593 screen, CRT screen\nn04153751 screw\nn04154565 screwdriver\nn04162706 seat belt, seatbelt\nn04179913 sewing machine\nn04192698 shield, buckler\nn04200800 shoe shop, shoe-shop, shoe store\nn04201297 shoji\nn04204238 shopping basket\nn04204347 shopping cart\nn04208210 shovel\nn04209133 shower cap\nn04209239 shower curtain\nn04228054 ski\nn04229816 ski mask\nn04235860 sleeping bag\nn04238763 slide rule, slipstick\nn04239074 sliding door\nn04243546 slot, one-armed bandit\nn04251144 snorkel\nn04252077 snowmobile\nn04252225 snowplow, snowplough\nn04254120 soap dispenser\nn04254680 soccer ball\nn04254777 sock\nn04258138 solar dish, solar collector, solar furnace\nn04259630 sombrero\nn04263257 soup bowl\nn04264628 space bar\nn04265275 space heater\nn04266014 space shuttle\nn04270147 spatula\nn04273569 speedboat\nn04275548 spider web, spider's web\nn04277352 spindle\nn04285008 sports car, sport car\nn04286575 spotlight, spot\nn04296562 stage\nn04310018 steam locomotive\nn04311004 steel arch bridge\nn04311174 steel drum\nn04317175 stethoscope\nn04325704 stole\nn04326547 stone wall\nn04328186 stopwatch, stop watch\nn04330267 stove\nn04332243 strainer\nn04335435 streetcar, tram, tramcar, trolley, trolley car\nn04336792 stretcher\nn04344873 studio couch, day bed\nn04346328 stupa, tope\nn04347754 submarine, pigboat, sub, U-boat\nn04350905 suit, suit of clothes\nn04355338 sundial\nn04355933 sunglass\nn04356056 sunglasses, dark glasses, shades\nn04357314 sunscreen, sunblock, sun blocker\nn04366367 suspension bridge\nn04367480 swab, swob, mop\nn04370456 sweatshirt\nn04371430 swimming trunks, bathing trunks\nn04371774 swing\nn04372370 switch, electric switch, electrical switch\nn04376876 syringe\nn04380533 table lamp\nn04389033 tank, army tank, armored combat vehicle, armoured combat vehicle\nn04392985 tape player\nn04398044 teapot\nn04399382 teddy, teddy bear\nn04404412 television, television system\nn04409515 tennis ball\nn04417672 thatch, thatched roof\nn04418357 theater curtain, theatre curtain\nn04423845 thimble\nn04428191 thresher, thrasher, threshing machine\nn04429376 throne\nn04435653 tile roof\nn04442312 toaster\nn04443257 tobacco shop, tobacconist shop, tobacconist\nn04447861 toilet seat\nn04456115 torch\nn04458633 totem pole\nn04461696 tow truck, tow car, wrecker\nn04462240 toyshop\nn04465501 tractor\nn04467665 trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi\nn04476259 tray\nn04479046 trench coat\nn04482393 tricycle, trike, velocipede\nn04483307 trimaran\nn04485082 tripod\nn04486054 triumphal arch\nn04487081 trolleybus, trolley coach, trackless trolley\nn04487394 trombone\nn04493381 tub, vat\nn04501370 turnstile\nn04505470 typewriter keyboard\nn04507155 umbrella\nn04509417 unicycle, monocycle\nn04515003 upright, upright piano\nn04517823 vacuum, vacuum cleaner\nn04522168 vase\nn04523525 vault\nn04525038 velvet\nn04525305 vending machine\nn04532106 vestment\nn04532670 viaduct\nn04536866 violin, fiddle\nn04540053 volleyball\nn04542943 waffle iron\nn04548280 wall clock\nn04548362 wallet, billfold, notecase, pocketbook\nn04550184 wardrobe, closet, press\nn04552348 warplane, military plane\nn04553703 washbasin, handbasin, washbowl, lavabo, wash-hand basin\nn04554684 washer, automatic washer, washing machine\nn04557648 water bottle\nn04560804 water jug\nn04562935 water tower\nn04579145 whiskey jug\nn04579432 whistle\nn04584207 wig\nn04589890 window screen\nn04590129 window shade\nn04591157 Windsor tie\nn04591713 wine bottle\nn04592741 wing\nn04596742 wok\nn04597913 wooden spoon\nn04599235 wool, woolen, woollen\nn04604644 worm fence, snake fence, snake-rail fence, Virginia fence\nn04606251 wreck\nn04612504 yawl\nn04613696 yurt\nn06359193 web site, website, internet site, site\nn06596364 comic book\nn06785654 crossword puzzle, crossword\nn06794110 street sign\nn06874185 traffic light, traffic signal, stoplight\nn07248320 book jacket, dust cover, dust jacket, dust wrapper\nn07565083 menu\nn07579787 plate\nn07583066 guacamole\nn07584110 consomme\nn07590611 hot pot, hotpot\nn07613480 trifle\nn07614500 ice cream, icecream\nn07615774 ice lolly, lolly, lollipop, popsicle\nn07684084 French loaf\nn07693725 bagel, beigel\nn07695742 pretzel\nn07697313 cheeseburger\nn07697537 hotdog, hot dog, red hot\nn07711569 mashed potato\nn07714571 head cabbage\nn07714990 broccoli\nn07715103 cauliflower\nn07716358 zucchini, courgette\nn07716906 spaghetti squash\nn07717410 acorn squash\nn07717556 butternut squash\nn07718472 cucumber, cuke\nn07718747 artichoke, globe artichoke\nn07720875 bell pepper\nn07730033 cardoon\nn07734744 mushroom\nn07742313 Granny Smith\nn07745940 strawberry\nn07747607 orange\nn07749582 lemon\nn07753113 fig\nn07753275 pineapple, ananas\nn07753592 banana\nn07754684 jackfruit, jak, jack\nn07760859 custard apple\nn07768694 pomegranate\nn07802026 hay\nn07831146 carbonara\nn07836838 chocolate sauce, chocolate syrup\nn07860988 dough\nn07871810 meat loaf, meatloaf\nn07873807 pizza, pizza pie\nn07875152 potpie\nn07880968 burrito\nn07892512 red wine\nn07920052 espresso\nn07930864 cup\nn07932039 eggnog\nn09193705 alp\nn09229709 bubble\nn09246464 cliff, drop, drop-off\nn09256479 coral reef\nn09288635 geyser\nn09332890 lakeside, lakeshore\nn09399592 promontory, headland, head, foreland\nn09421951 sandbar, sand bar\nn09428293 seashore, coast, seacoast, sea-coast\nn09468604 valley, vale\nn09472597 volcano\nn09835506 ballplayer, baseball player\nn10148035 groom, bridegroom\nn10565667 scuba diver\nn11879895 rapeseed\nn11939491 daisy\nn12057211 yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum\nn12144580 corn\nn12267677 acorn\nn12620546 hip, rose hip, rosehip\nn12768682 buckeye, horse chestnut, conker\nn12985857 coral fungus\nn12998815 agaric\nn13037406 gyromitra\nn13040303 stinkhorn, carrion fungus\nn13044778 earthstar\nn13052670 hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa\nn13054560 bolete\nn13133613 ear, spike, capitulum\nn15075141 toilet tissue, toilet paper, bathroom tissue\n"
  },
  {
    "path": "examples/whisper.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// whisper speech recognition implemented with ncnn library\n\n// convert openai-whisper checkpoints to ncnn models\n//  1. install pnnx via pip install -U pnnx\n//  2. obtain export_ncnn.py script from https://github.com/nihui/ncnn-android-whisper\n//  3. edit export_ncnn.py for changing the models among tiny/base/small/medium/large-v3-turbo\n//  4. make sure you have good internet connection\n//      python export_ncnn.py\n\n// convert vocab.json to simple whisper_vocab.txt\n//  1. obtain vocab.json file from https://huggingface.co/openai/whisper-tiny/blob/main/vocab.json\n//  2. convert json dict into plain list, save to whisper_vocab.txt\n\n// NOTE large-v3-turbo has special token ids from others, one more language(yue) and does not support translation\n\n#include \"net.h\"\n#include \"layer.h\"\n#include \"layer_type.h\"\n\n#include <float.h>\n#include <math.h>\n#include <stdint.h>\n#include <stdio.h>\n#include <algorithm>\n#include <string>\n#include <vector>\n\n// https://huggingface.co/openai/whisper-tiny/blob/main/tokenizer_config.json\nstatic const int token_endoftext = 50257;\nstatic const int token_startoftranscript = 50258;\nstatic const int token_lang_first = 50259;\nstatic const int token_lang_last = 50357;\nstatic const int token_lang_count = token_lang_last - token_lang_first + 1;\n// clang-format off\n// *INDENT-OFF*\nstatic const char* token_langs[] = {\n    \"en\", \"zh\", \"de\", \"es\", \"ru\", \"ko\", \"fr\", \"ja\", \"pt\", \"tr\", \"pl\", \"ca\", \"nl\", \"ar\", \"sv\",\n    \"it\", \"id\", \"hi\", \"fi\", \"vi\", \"he\", \"uk\", \"el\", \"ms\", \"cs\", \"ro\", \"da\", \"hu\", \"ta\", \"no\",\n    \"th\", \"ur\", \"hr\", \"bg\", \"lt\", \"la\", \"mi\", \"ml\", \"cy\", \"sk\", \"te\", \"fa\", \"lv\", \"bn\", \"sr\",\n    \"az\", \"sl\", \"kn\", \"et\", \"mk\", \"br\", \"eu\", \"is\", \"hy\", \"ne\", \"mn\", \"bs\", \"kk\", \"sq\", \"sw\",\n    \"gl\", \"mr\", \"pa\", \"si\", \"km\", \"sn\", \"yo\", \"so\", \"af\", \"oc\", \"ka\", \"be\", \"tg\", \"sd\", \"gu\",\n    \"am\", \"yi\", \"lo\", \"uz\", \"fo\", \"ht\", \"ps\", \"tk\", \"nn\", \"mt\", \"sa\", \"lb\", \"my\", \"bo\", \"tl\",\n    \"mg\", \"as\", \"tt\", \"haw\", \"ln\", \"ha\", \"ba\", \"jw\", \"su\"\n};\n// *INDENT-ON*\n// clang-format on\nstatic const int token_translate = 50358;\nstatic const int token_transcribe = 50359;\nstatic const int token_startoflm = 50360;\nstatic const int token_startofprev = 50361;\nstatic const int token_nocaptions = 50362;\nstatic const int token_notimestamps = 50363;\nstatic const int token_timestamp_first = 50364;\nstatic const int token_timestamp_last = 51864;\n\n// https://huggingface.co/openai/whisper-large-v3-turbo/blob/main/tokenizer_config.json\n// static const int token_endoftext = 50257;\n// static const int token_startoftranscript = 50258;\n// static const int token_lang_first = 50259;\n// static const int token_lang_last = 50357;\n// static const int token_lang_count = token_lang_last - token_lang_first + 1;\n// // clang-format off\n// // *INDENT-OFF*\n// static const char* token_langs[] = {\n//     \"en\", \"zh\", \"de\", \"es\", \"ru\", \"ko\", \"fr\", \"ja\", \"pt\", \"tr\", \"pl\", \"ca\", \"nl\", \"ar\", \"sv\",\n//     \"it\", \"id\", \"hi\", \"fi\", \"vi\", \"he\", \"uk\", \"el\", \"ms\", \"cs\", \"ro\", \"da\", \"hu\", \"ta\", \"no\",\n//     \"th\", \"ur\", \"hr\", \"bg\", \"lt\", \"la\", \"mi\", \"ml\", \"cy\", \"sk\", \"te\", \"fa\", \"lv\", \"bn\", \"sr\",\n//     \"az\", \"sl\", \"kn\", \"et\", \"mk\", \"br\", \"eu\", \"is\", \"hy\", \"ne\", \"mn\", \"bs\", \"kk\", \"sq\", \"sw\",\n//     \"gl\", \"mr\", \"pa\", \"si\", \"km\", \"sn\", \"yo\", \"so\", \"af\", \"oc\", \"ka\", \"be\", \"tg\", \"sd\", \"gu\",\n//     \"am\", \"yi\", \"lo\", \"uz\", \"fo\", \"ht\", \"ps\", \"tk\", \"nn\", \"mt\", \"sa\", \"lb\", \"my\", \"bo\", \"tl\",\n//     \"mg\", \"as\", \"tt\", \"haw\", \"ln\", \"ha\", \"ba\", \"jw\", \"su\", \"yue\"\n// };\n// // *INDENT-ON*\n// // clang-format on\n// static const int token_translate = 50359;\n// static const int token_transcribe = 50360;\n// static const int token_startoflm = 50361;\n// static const int token_startofprev = 50362;\n// static const int token_nospeech = 50363;\n// static const int token_notimestamps = 50364;\n// static const int token_timestamp_first = 50365;\n// static const int token_timestamp_last = 51865;\n\n// tokenizer for handling text tokens\nclass Tokenizer\n{\npublic:\n    std::vector<std::string> reverse_vocab;\n\n    uint8_t byte_decoder[512]; // unicode code point to byte value\n\n    // generate byte decoder for tokenization\n    void generate_byte_decoder()\n    {\n        // initialize array to 0\n        memset(byte_decoder, 0, 512 * sizeof(uint8_t));\n\n        // define function to check if char is in \"printable\" range\n        auto is_printable = [](int b) {\n            return (b >= '!' && b <= '~')     // '!' to '~'\n                   || (b >= 161 && b <= 172)  // '¡' to '¬'\n                   || (b >= 174 && b <= 255); // '®' to 'ÿ'\n        };\n\n        // handle \"printable\" characters\n        // for these chars, key and value are the same\n        for (int b = 0; b < 256; ++b)\n        {\n            if (is_printable(b))\n            {\n                byte_decoder[b] = static_cast<uint8_t>(b);\n            }\n        }\n\n        // handle remaining characters\n        // for these chars, key starts from 256 and increments\n        int n = 0;\n        for (int b = 0; b < 256; ++b)\n        {\n            if (!is_printable(b))\n            {\n                byte_decoder[256 + n] = static_cast<uint8_t>(b);\n                n++;\n            }\n        }\n    }\n\n    // convert utf-8 string to code points\n    std::vector<uint32_t> utf8_to_codepoints(const std::string& s) const\n    {\n        std::vector<uint32_t> codepoints;\n        for (size_t i = 0; i < s.length();)\n        {\n            uint32_t cp = 0;\n            int len = 0;\n            unsigned char c = s[i];\n\n            if (c < 0x80) // 1-byte\n            {\n                cp = c;\n                len = 1;\n            }\n            else if ((c & 0xE0) == 0xC0) // 2-byte\n            {\n                cp = ((s[i] & 0x1F) << 6) | (s[i + 1] & 0x3F);\n                len = 2;\n            }\n            else if ((c & 0xF0) == 0xE0) // 3-byte\n            {\n                cp = ((s[i] & 0x0F) << 12) | ((s[i + 1] & 0x3F) << 6) | (s[i + 2] & 0x3F);\n                len = 3;\n            }\n            else if ((c & 0xF8) == 0xF0) // 4-byte\n            {\n                cp = ((s[i] & 0x07) << 18) | ((s[i + 1] & 0x3F) << 12) | ((s[i + 2] & 0x3F) << 6) | (s[i + 3] & 0x3F);\n                len = 4;\n            }\n            else\n            {\n                // invalid utf-8 start byte, skip\n                i++;\n                continue;\n            }\n            codepoints.push_back(cp);\n            i += len;\n        }\n        return codepoints;\n    }\n\n    bool load(const char* vocab_path)\n    {\n        // generate decoder when loading\n        generate_byte_decoder();\n\n        {\n            FILE* fp = fopen(vocab_path, \"rb\");\n            if (!fp)\n            {\n                fprintf(stderr, \"fopen %s failed\\n\", vocab_path);\n                return false;\n            }\n\n            char line[256];\n            while (!feof(fp))\n            {\n                char* s = fgets(line, 255, fp);\n                if (!s)\n                    break;\n\n                int vocab_len = strlen(line);\n                if (vocab_len > 1)\n                {\n                    // drop the tail newline\n                    vocab_len -= 1;\n                }\n\n                reverse_vocab.push_back(std::string(line, vocab_len));\n            }\n\n            fclose(fp);\n        }\n\n        return true;\n    }\n\n    // decode token ids to text\n    std::string decode(const std::vector<int>& tokens) const\n    {\n        std::string outstring;\n        bool in_timestamp = false;\n\n        // step 1: concatenate token ids to a string with special unicode characters\n        std::string text_buffer;\n        for (int token_id : tokens)\n        {\n            if (token_id < token_endoftext)\n            {\n                text_buffer += reverse_vocab[token_id];\n                continue;\n            }\n\n            // handle timestamp tokens\n            if (token_id >= token_timestamp_first && token_id <= token_timestamp_last)\n            {\n                int timestamp = (token_id - token_timestamp_first) * 2;\n\n                char tmp[256];\n                sprintf(tmp, \" [%d.%02d] \", timestamp / 100, timestamp % 100);\n\n                if (in_timestamp)\n                {\n                    // step 2: translate the special string back to original byte stream\n                    std::vector<uint32_t> codepoints = utf8_to_codepoints(text_buffer);\n\n                    std::vector<uint8_t> byte_sequence;\n                    for (uint32_t cp : codepoints)\n                    {\n                        byte_sequence.push_back(byte_decoder[cp]);\n                    }\n\n                    std::string s(byte_sequence.begin(), byte_sequence.end());\n\n                    text_buffer.clear();\n\n                    outstring += s;\n                    outstring += tmp;\n                    outstring += \"\\n\";\n\n                    in_timestamp = false;\n                }\n                else\n                {\n                    outstring += tmp;\n                    in_timestamp = true;\n                }\n            }\n\n            // ignore functional/special tokens\n        }\n\n        if (!text_buffer.empty())\n        {\n            // step 2: translate the special string back to original byte stream\n            std::vector<uint32_t> codepoints = utf8_to_codepoints(text_buffer);\n\n            std::vector<uint8_t> byte_sequence;\n            for (uint32_t cp : codepoints)\n            {\n                byte_sequence.push_back(byte_decoder[cp]);\n            }\n\n            std::string s(byte_sequence.begin(), byte_sequence.end());\n\n            outstring += s;\n        }\n\n        return outstring;\n    }\n};\n\n// result class for beam search\nclass Result\n{\npublic:\n    std::vector<int> ids;\n    float score;\n\n    std::vector<ncnn::Mat> kvcache;\n};\n\n// main whisper implementation class\nclass Whisper\n{\npublic:\n    int load();\n\n    int detect_lang(const std::vector<short>& samples, std::string& lang) const;\n    int transcribe(const std::vector<short>& samples, const char* lang, std::string& text) const;\n\nprotected:\n    int extract_fbank_feature(const std::vector<short>& samples, ncnn::Mat& input_features) const;\n    int run_encoder(const ncnn::Mat& input_features, ncnn::Mat& encoder_states) const;\n    int run_decoder_prefill(const std::vector<int>& tokens, const ncnn::Mat& encoder_states, ncnn::Mat& last_logits, std::vector<ncnn::Mat>& out_kvcache) const;\n    int run_decoder_step(const std::vector<int>& tokens, const ncnn::Mat& encoder_states, ncnn::Mat& last_logits, const std::vector<ncnn::Mat>& kvcache, std::vector<ncnn::Mat>& out_kvcache) const;\n\nprotected:\n    ncnn::Net fbank;\n\n    ncnn::Net encoder;\n\n    ncnn::Net embed_token;\n    ncnn::Net embed_position;\n    ncnn::Net decoder;\n\n    ncnn::Net proj_out;\n\n    Tokenizer tokenizer;\n\nprotected:\n    std::vector<int> kv_cache_indexes;\n    std::vector<int> out_kv_cache_indexes;\n};\n\nint Whisper::load()\n{\n    // whisper models could be found at\n    // https://github.com/nihui/ncnn-android-whisper/releases\n    // https://github.com/nihui/ncnn-android-whisper/tree/master/app/src/main/assets\n\n    fbank.opt.use_vulkan_compute = true;\n    fbank.opt.use_fp16_packed = false;\n    fbank.opt.use_fp16_storage = false;\n    fbank.opt.use_fp16_arithmetic = false;\n\n    encoder.opt.use_vulkan_compute = true;\n    encoder.opt.use_fp16_packed = false;\n    encoder.opt.use_fp16_storage = false;\n    encoder.opt.use_fp16_arithmetic = false;\n\n    decoder.opt.use_vulkan_compute = true;\n    decoder.opt.use_fp16_packed = false;\n    decoder.opt.use_fp16_storage = false;\n    decoder.opt.use_fp16_arithmetic = false;\n\n    proj_out.opt.use_vulkan_compute = true;\n    proj_out.opt.use_fp16_packed = false;\n    proj_out.opt.use_fp16_storage = false;\n    proj_out.opt.use_fp16_arithmetic = false;\n\n    fbank.load_param(\"whisper_tiny_fbank.ncnn.param\");\n    fbank.load_model(\"whisper_tiny_fbank.ncnn.bin\");\n\n    encoder.load_param(\"whisper_tiny_encoder.ncnn.param\");\n    encoder.load_model(\"whisper_tiny_encoder.ncnn.bin\");\n\n    embed_token.load_param(\"whisper_tiny_embed_token.ncnn.param\");\n    embed_token.load_model(\"whisper_tiny_embed_token.ncnn.bin\");\n\n    embed_position.load_param(\"whisper_tiny_embed_position.ncnn.param\");\n    embed_position.load_model(\"whisper_tiny_embed_position.ncnn.bin\");\n\n    decoder.load_param(\"whisper_tiny_decoder.ncnn.param\");\n    decoder.load_model(\"whisper_tiny_decoder.ncnn.bin\");\n\n    proj_out.load_param(\"whisper_tiny_proj_out.ncnn.param\");\n    proj_out.load_model(\"whisper_tiny_proj_out.ncnn.bin\");\n\n    // fbank.load_param(\"whisper_large_v3_turbo_fbank.ncnn.param\");\n    // fbank.load_model(\"whisper_large_v3_turbo_fbank.ncnn.bin\");\n    //\n    // encoder.load_param(\"whisper_large_v3_turbo_encoder.ncnn.param\");\n    // encoder.load_model(\"whisper_large_v3_turbo_encoder.ncnn.bin\");\n    //\n    // embed_token.load_param(\"whisper_large_v3_turbo_embed_token.ncnn.param\");\n    // embed_token.load_model(\"whisper_large_v3_turbo_embed_token.ncnn.bin\");\n    //\n    // embed_position.load_param(\"whisper_large_v3_turbo_embed_position.ncnn.param\");\n    // embed_position.load_model(\"whisper_large_v3_turbo_embed_position.ncnn.bin\");\n    //\n    // decoder.load_param(\"whisper_large_v3_turbo_decoder.ncnn.param\");\n    // decoder.load_model(\"whisper_large_v3_turbo_decoder.ncnn.bin\");\n    //\n    // proj_out.load_param(\"whisper_large_v3_turbo_proj_out.ncnn.param\");\n    // proj_out.load_model(\"whisper_large_v3_turbo_proj_out.ncnn.bin\");\n\n    tokenizer.load(\"whisper_vocab.txt\");\n\n    // resolve kv cache blob indexes\n    for (size_t i = 0; i < decoder.layers().size(); i++)\n    {\n        const ncnn::Layer* mha = decoder.layers()[i];\n        if (mha->typeindex != ncnn::LayerType::MultiHeadAttention)\n            continue;\n\n        const size_t input_count = mha->bottoms.size();\n        const size_t output_count = mha->tops.size();\n\n        if (output_count == 3)\n        {\n            kv_cache_indexes.push_back(mha->bottoms[input_count - 2]);\n            kv_cache_indexes.push_back(mha->bottoms[input_count - 1]);\n            out_kv_cache_indexes.push_back(mha->tops[output_count - 2]);\n            out_kv_cache_indexes.push_back(mha->tops[output_count - 1]);\n        }\n    }\n\n    return 0;\n}\n\n// apply log_softmax in-place\nstatic void log_softmax_inplace(ncnn::Mat& m)\n{\n    ncnn::Option opt;\n    opt.use_packing_layout = false;\n    opt.use_fp16_storage = false;\n\n    {\n        ncnn::Layer* softmax = ncnn::create_layer_cpu(\"Softmax\");\n        ncnn::ParamDict pd;\n        pd.set(0, 0); // axis\n        softmax->load_param(pd);\n        softmax->forward_inplace(m, opt);\n        delete softmax;\n    }\n\n    {\n        ncnn::Layer* log = ncnn::create_layer_cpu(\"UnaryOp\");\n        ncnn::ParamDict pd;\n        pd.set(0, 8); // log\n        log->load_param(pd);\n        log->forward_inplace(m, opt);\n        delete log;\n    }\n}\n\nint Whisper::detect_lang(const std::vector<short>& samples, std::string& lang) const\n{\n    std::vector<int> ids(1);\n    ids[0] = token_startoftranscript;\n\n    ncnn::Mat input_features;\n    extract_fbank_feature(samples, input_features);\n\n    ncnn::Mat encoder_states;\n    run_encoder(input_features, encoder_states);\n\n    ncnn::Mat logits;\n    std::vector<ncnn::Mat> out_kvcache;\n    run_decoder_prefill(ids, encoder_states, logits, out_kvcache);\n\n    // find the lang token with highest prob\n    // we are only interested in lang part and no_speech\n    int lang_id = token_lang_first;\n    float max_prob = logits[token_lang_first];\n    for (int i = token_lang_first; i <= token_lang_last; i++)\n    {\n        float prob = logits[i];\n        if (prob > max_prob)\n        {\n            max_prob = prob;\n            lang_id = i;\n        }\n    }\n\n    lang = token_langs[lang_id - token_lang_first];\n\n    return 0;\n}\n\nint Whisper::transcribe(const std::vector<short>& samples, const char* lang, std::string& text) const\n{\n    // find lang token id by lang string\n    int token_lang = -1;\n    for (int i = 0; i < token_lang_count; i++)\n    {\n        if (strcmp(token_langs[i], lang) == 0)\n        {\n            token_lang = token_lang_first + i;\n            break;\n        }\n    }\n\n    if (token_lang == -1)\n    {\n        fprintf(stderr, \"language %s not supported\\n\", lang);\n        return -1;\n    }\n\n    // initialize with prompt tokens\n    std::vector<int> ids(4);\n    ids[0] = token_startoftranscript;\n    ids[1] = token_lang;\n    ids[2] = token_transcribe;\n    ids[3] = token_notimestamps;\n\n    ncnn::Mat input_features;\n    extract_fbank_feature(samples, input_features);\n\n    ncnn::Mat encoder_states;\n    run_encoder(input_features, encoder_states);\n\n    const int beam_size = 5;\n    const int max_candidates = 5;\n\n    std::vector<Result> finished_beams;\n\n    std::vector<Result> beams(1);\n    beams[0].ids = ids;\n    beams[0].score = 0.f;\n\n    int step = 0;\n\n    // beam search loop\n    for (;;)\n    {\n        std::vector<Result> candidates;\n\n        for (size_t i = 0; i < beams.size(); i++)\n        {\n            const Result& beam = beams[i];\n\n            ncnn::Mat logits;\n            std::vector<ncnn::Mat> out_kvcache;\n            if (step == 0)\n            {\n                run_decoder_prefill(beam.ids, encoder_states, logits, out_kvcache);\n            }\n            else\n            {\n                run_decoder_step(beam.ids, encoder_states, logits, beam.kvcache, out_kvcache);\n            }\n\n            log_softmax_inplace(logits);\n\n            // get topk candidates\n            const int topk = 5;\n            std::vector<std::pair<float, int> > vec(logits.w);\n            for (int j = 0; j < logits.w; j++)\n            {\n                vec[j] = std::make_pair(logits[j], j);\n            }\n            std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(), std::greater<std::pair<float, int> >());\n\n            for (int j = 0; j < topk; j++)\n            {\n                int next_id = vec[j].second;\n                float next_id_score = vec[j].first;\n\n                Result candidate;\n                candidate.ids = beam.ids;\n                candidate.ids.push_back(next_id);\n                candidate.score = beam.score + next_id_score;\n                candidate.kvcache = out_kvcache;\n\n                candidates.push_back(candidate);\n            }\n        }\n\n        // sort candidates by score\n        std::sort(candidates.begin(), candidates.end(), [](const Result& a, const Result& b) {\n            return a.score > b.score;\n        });\n\n        beams.clear();\n        for (size_t i = 0; i < candidates.size(); i++)\n        {\n            const Result& candidate = candidates[i];\n\n            if (candidate.ids.back() == token_endoftext)\n            {\n                finished_beams.push_back(candidate);\n            }\n            else\n            {\n                beams.push_back(candidate);\n            }\n        }\n\n        if (beams.size() > beam_size)\n        {\n            beams.resize(beam_size);\n        }\n\n        step++;\n\n        if (beams.empty())\n        {\n            break;\n        }\n\n        if (finished_beams.size() >= max_candidates)\n        {\n            break;\n        }\n    }\n\n    if (finished_beams.empty())\n    {\n        // no results\n        return 0;\n    }\n\n    // find the best result based on average score\n    int max_avg_score_index = 0;\n    float max_avg_score = -FLT_MAX;\n    for (size_t i = 0; i < finished_beams.size(); i++)\n    {\n        const Result& result = finished_beams[i];\n        float avg_score = result.score / result.ids.size();\n        if (avg_score > max_avg_score)\n        {\n            max_avg_score_index = (int)i;\n            max_avg_score = avg_score;\n        }\n    }\n\n    const Result& best_result = finished_beams[max_avg_score_index];\n\n    text = tokenizer.decode(best_result.ids);\n\n    return 0;\n}\n\nint Whisper::extract_fbank_feature(const std::vector<short>& samples, ncnn::Mat& input_features) const\n{\n    const int samples_size = (int)samples.size();\n\n    // pad to 480000, normalize samples to -1~1\n    ncnn::Mat waveform(480000);\n    waveform.fill(0.f);\n    {\n        for (int i = 0; i < samples_size; i++)\n        {\n            waveform[i] = samples[i] / 32768.0f;\n        }\n    }\n\n    ncnn::Extractor ex = fbank.create_extractor();\n\n    ex.input(\"in0\", waveform);\n\n    ex.extract(\"out0\", input_features);\n\n    // drop the last frame\n    {\n        ncnn::Mat input_features_3k(input_features.w - 1, input_features.h);\n        for (int i = 0; i < input_features.h; i++)\n        {\n            memcpy(input_features_3k.row(i), input_features.row(i), (input_features.w - 1) * sizeof(float));\n        }\n        input_features = input_features_3k;\n    }\n\n    return 0;\n}\n\nint Whisper::run_encoder(const ncnn::Mat& input_features, ncnn::Mat& encoder_states) const\n{\n    ncnn::Extractor ex = encoder.create_extractor();\n\n    ex.input(\"in0\", input_features);\n\n    ex.extract(\"out0\", encoder_states);\n\n    return 0;\n}\n\nint Whisper::run_decoder_prefill(const std::vector<int>& tokens, const ncnn::Mat& encoder_states, ncnn::Mat& last_logits, std::vector<ncnn::Mat>& out_kvcache) const\n{\n    const int dst_seqlen = tokens.size();\n\n    // token embedding\n    ncnn::Mat token_embeds;\n    {\n        ncnn::Mat input_tokens(dst_seqlen);\n        int* p = input_tokens;\n        memcpy(p, tokens.data(), tokens.size() * sizeof(int));\n\n        ncnn::Extractor ex = embed_token.create_extractor();\n        ex.input(\"in0\", input_tokens);\n        ex.extract(\"out0\", token_embeds);\n    }\n\n    // position embedding\n    ncnn::Mat position_embeds;\n    {\n        ncnn::Mat input_positions(dst_seqlen);\n        int* p = input_positions;\n        for (int i = 0; i < dst_seqlen; i++)\n        {\n            p[i] = i;\n        }\n\n        ncnn::Extractor ex = embed_position.create_extractor();\n        ex.input(\"in0\", input_positions);\n        ex.extract(\"out0\", position_embeds);\n    }\n\n    // input embedding = token + position\n    ncnn::Mat input_embeds;\n    {\n        input_embeds.create_like(token_embeds);\n        for (int i = 0; i < input_embeds.total(); i++)\n        {\n            input_embeds[i] = token_embeds[i] + position_embeds[i];\n        }\n    }\n\n    // create attention mask (causal mask)\n    ncnn::Mat attention_mask(dst_seqlen, dst_seqlen);\n    attention_mask.fill(0.f);\n    for (int i = 0; i < dst_seqlen; i++)\n    {\n        for (int j = i + 1; j < dst_seqlen; j++)\n        {\n            attention_mask.row(i)[j] = -INFINITY;\n        }\n    }\n\n    ncnn::Mat output_states;\n    {\n        ncnn::Extractor ex = decoder.create_extractor();\n        ex.input(\"in0\", input_embeds);\n        ex.input(\"in1\", encoder_states);\n        ex.input(\"in2\", attention_mask);\n\n        out_kvcache.resize(out_kv_cache_indexes.size());\n        for (size_t i = 0; i < out_kv_cache_indexes.size(); i++)\n        {\n            ex.extract(out_kv_cache_indexes[i], out_kvcache[i], 1);\n        }\n\n        ex.extract(\"out0\", output_states);\n    }\n\n    // get last token's state for next token prediction\n    ncnn::Mat last_state = output_states.row_range(dst_seqlen - 1, 1).clone();\n    {\n        ncnn::Extractor ex = proj_out.create_extractor();\n        ex.input(\"in0\", last_state);\n        ex.extract(\"out0\", last_logits);\n    }\n\n    last_logits = last_logits.reshape(last_logits.w);\n\n    return 0;\n}\n\nint Whisper::run_decoder_step(const std::vector<int>& tokens, const ncnn::Mat& encoder_states, ncnn::Mat& last_logits, const std::vector<ncnn::Mat>& kvcache, std::vector<ncnn::Mat>& out_kvcache) const\n{\n    const int token_id = tokens.back();\n    const int dst_seqlen = 1;\n\n    // token embedding\n    ncnn::Mat token_embeds;\n    {\n        ncnn::Mat input_tokens(dst_seqlen);\n        ((int*)input_tokens)[0] = token_id;\n\n        ncnn::Extractor ex = embed_token.create_extractor();\n        ex.input(\"in0\", input_tokens);\n        ex.extract(\"out0\", token_embeds);\n    }\n\n    // position embedding\n    ncnn::Mat position_embeds;\n    {\n        ncnn::Mat input_positions(dst_seqlen);\n        ((int*)input_positions)[0] = tokens.size() - 1;\n\n        ncnn::Extractor ex = embed_position.create_extractor();\n        ex.input(\"in0\", input_positions);\n        ex.extract(\"out0\", position_embeds);\n    }\n\n    // input embedding = token + position\n    ncnn::Mat input_embeds;\n    {\n        input_embeds.create_like(token_embeds);\n        for (int i = 0; i < input_embeds.total(); i++)\n        {\n            input_embeds[i] = token_embeds[i] + position_embeds[i];\n        }\n    }\n\n    // single token doesn't need attention mask\n    ncnn::Mat attention_mask(dst_seqlen, dst_seqlen);\n    attention_mask.fill(0.f);\n\n    ncnn::Mat output_states;\n    {\n        ncnn::Extractor ex = decoder.create_extractor();\n        ex.input(\"in0\", input_embeds);\n        ex.input(\"in1\", encoder_states);\n        ex.input(\"in2\", attention_mask);\n\n        // pass in kv cache from previous steps\n        for (size_t i = 0; i < kv_cache_indexes.size(); i++)\n        {\n            ex.input(kv_cache_indexes[i], kvcache[i]);\n        }\n\n        // extract updated kv cache\n        out_kvcache.resize(out_kv_cache_indexes.size());\n        for (size_t i = 0; i < out_kv_cache_indexes.size(); i++)\n        {\n            ex.extract(out_kv_cache_indexes[i], out_kvcache[i], 1);\n        }\n\n        ex.extract(\"out0\", output_states);\n    }\n\n    // get last token's state for prediction\n    ncnn::Mat last_state = output_states.row_range(dst_seqlen - 1, 1).clone();\n    {\n        ncnn::Extractor ex = proj_out.create_extractor();\n        ex.input(\"in0\", last_state);\n        ex.extract(\"out0\", last_logits);\n    }\n\n    last_logits = last_logits.reshape(last_logits.w);\n\n    return 0;\n}\n\nstatic int load_wav_samples(const char* wavpath, std::vector<short>& samples)\n{\n    FILE* fp = fopen(wavpath, \"rb\");\n    if (!fp)\n    {\n        fprintf(stderr, \"open %s failed\\n\", wavpath);\n        return -1;\n    }\n\n// https://stackoverflow.com/questions/1537964/visual-c-equivalent-of-gccs-attribute-packed\n#ifdef _MSC_VER\n#define PACK(__Declaration__) __pragma(pack(push, 1)) __Declaration__ __pragma(pack(pop))\n#else\n#define PACK(__Declaration__) __Declaration__ __attribute__((__packed__))\n#endif\n\n    PACK(struct wav_header {\n        char riff[4];\n        uint32_t chunk_size;\n        char wave[4];\n        char fmt[4];\n        uint32_t subchunk1_size;\n        uint16_t audio_format;\n        uint16_t num_channels;\n        uint32_t sample_rate;\n        uint32_t byte_rate;\n        uint16_t block_align;\n        uint16_t bits_per_sample;\n        char data[4];\n        uint32_t data_size;\n    });\n\n    wav_header header;\n    if (fread(&header, sizeof(wav_header), 1, fp) != 1)\n    {\n        fprintf(stderr, \"failed to read wav header from %s\\n\", wavpath);\n        fclose(fp);\n        return -1;\n    }\n\n    if (memcmp(header.riff, \"RIFF\", 4) != 0 || memcmp(header.wave, \"WAVE\", 4) != 0\n            || memcmp(header.fmt, \"fmt \", 4) != 0 || memcmp(header.data, \"data\", 4) != 0)\n    {\n        fprintf(stderr, \"%s is not a valid wav file\\n\", wavpath);\n        fclose(fp);\n        return -1;\n    }\n\n    if (header.subchunk1_size != 16 || header.audio_format != 1 || header.num_channels != 1\n            || header.sample_rate != 16000 || header.bits_per_sample != 16)\n    {\n        fprintf(stderr, \"%s is not pcm s16le 16k wav\\n\", wavpath);\n        fprintf(stderr, \"ffmpeg -i input.xxx -vn -c:a pcm_s16le -ac 1 -ar 16000 -fflags bitexact output.wav\\n\");\n        fclose(fp);\n        return -1;\n    }\n\n    fseek(fp, 0, SEEK_END);\n    long len = ftell(fp);\n\n    samples.resize((len - sizeof(wav_header)) / sizeof(short));\n\n    rewind(fp);\n\n    fseek(fp, sizeof(wav_header), SEEK_SET);\n\n    fread(samples.data(), 1, len - sizeof(wav_header), fp);\n\n    fclose(fp);\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [wavpath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* wavpath = argv[1];\n\n    std::vector<short> samples;\n    int ret = load_wav_samples(wavpath, samples);\n    if (ret != 0)\n    {\n        fprintf(stderr, \"load wav failed\\n\");\n        return -1;\n    }\n\n    if (samples.size() > 480000)\n    {\n        fprintf(stderr, \"audio duration too long, truncate to 30s\\n\");\n        samples.resize(480000);\n    }\n\n    Whisper whisper;\n    whisper.load();\n\n    // detect language first\n    std::string lang;\n    whisper.detect_lang(samples, lang);\n    fprintf(stderr, \"lang = %s\\n\", lang.c_str());\n\n    // transcribe audio to text\n    std::string text;\n    whisper.transcribe(samples, lang.c_str(), text);\n    fprintf(stderr, \"text = %s\\n\", text.c_str());\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolact.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n    std::vector<float> maskdata;\n    cv::Mat mask;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic int detect_yolact(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolact;\n\n    yolact.opt.use_vulkan_compute = true;\n\n    // original model converted from https://github.com/dbolya/yolact\n    // yolact_resnet50_54_800000.pth\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (yolact.load_param(\"yolact.param\"))\n        exit(-1);\n    if (yolact.load_model(\"yolact.bin\"))\n        exit(-1);\n\n    const int target_size = 550;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, target_size, target_size);\n\n    const float mean_vals[3] = {123.68f, 116.78f, 103.94f};\n    const float norm_vals[3] = {1.0 / 58.40f, 1.0 / 57.12f, 1.0 / 57.38f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = yolact.create_extractor();\n\n    ex.input(\"input.1\", in);\n\n    ncnn::Mat maskmaps;\n    ncnn::Mat location;\n    ncnn::Mat mask;\n    ncnn::Mat confidence;\n\n    ex.extract(\"619\", maskmaps); // 138x138 x 32\n\n    ex.extract(\"816\", location);   // 4 x 19248\n    ex.extract(\"818\", mask);       // maskdim 32 x 19248\n    ex.extract(\"820\", confidence); // 81 x 19248\n\n    int num_class = confidence.w;\n    int num_priors = confidence.h;\n\n    // make priorbox\n    ncnn::Mat priorbox(4, num_priors);\n    {\n        const int conv_ws[5] = {69, 35, 18, 9, 5};\n        const int conv_hs[5] = {69, 35, 18, 9, 5};\n\n        const float aspect_ratios[3] = {1.f, 0.5f, 2.f};\n        const float scales[5] = {24.f, 48.f, 96.f, 192.f, 384.f};\n\n        float* pb = priorbox;\n\n        for (int p = 0; p < 5; p++)\n        {\n            int conv_w = conv_ws[p];\n            int conv_h = conv_hs[p];\n\n            float scale = scales[p];\n\n            for (int i = 0; i < conv_h; i++)\n            {\n                for (int j = 0; j < conv_w; j++)\n                {\n                    // +0.5 because priors are in center-size notation\n                    float cx = (j + 0.5f) / conv_w;\n                    float cy = (i + 0.5f) / conv_h;\n\n                    for (int k = 0; k < 3; k++)\n                    {\n                        float ar = aspect_ratios[k];\n\n                        ar = sqrt(ar);\n\n                        float w = scale * ar / 550;\n                        float h = scale / ar / 550;\n\n                        // This is for backward compatibility with a bug where I made everything square by accident\n                        // cfg.backbone.use_square_anchors:\n                        h = w;\n\n                        pb[0] = cx;\n                        pb[1] = cy;\n                        pb[2] = w;\n                        pb[3] = h;\n\n                        pb += 4;\n                    }\n                }\n            }\n        }\n    }\n\n    const float confidence_thresh = 0.05f;\n    const float nms_threshold = 0.5f;\n    const int keep_top_k = 200;\n\n    std::vector<std::vector<Object> > class_candidates;\n    class_candidates.resize(num_class);\n\n    for (int i = 0; i < num_priors; i++)\n    {\n        const float* conf = confidence.row(i);\n        const float* loc = location.row(i);\n        const float* pb = priorbox.row(i);\n        const float* maskdata = mask.row(i);\n\n        // find class id with highest score\n        // start from 1 to skip background\n        int label = 0;\n        float score = 0.f;\n        for (int j = 1; j < num_class; j++)\n        {\n            float class_score = conf[j];\n            if (class_score > score)\n            {\n                label = j;\n                score = class_score;\n            }\n        }\n\n        // ignore background or low score\n        if (label == 0 || score <= confidence_thresh)\n            continue;\n\n        // CENTER_SIZE\n        float var[4] = {0.1f, 0.1f, 0.2f, 0.2f};\n\n        float pb_cx = pb[0];\n        float pb_cy = pb[1];\n        float pb_w = pb[2];\n        float pb_h = pb[3];\n\n        float bbox_cx = var[0] * loc[0] * pb_w + pb_cx;\n        float bbox_cy = var[1] * loc[1] * pb_h + pb_cy;\n        float bbox_w = (float)(exp(var[2] * loc[2]) * pb_w);\n        float bbox_h = (float)(exp(var[3] * loc[3]) * pb_h);\n\n        float obj_x1 = bbox_cx - bbox_w * 0.5f;\n        float obj_y1 = bbox_cy - bbox_h * 0.5f;\n        float obj_x2 = bbox_cx + bbox_w * 0.5f;\n        float obj_y2 = bbox_cy + bbox_h * 0.5f;\n\n        // clip\n        obj_x1 = std::max(std::min(obj_x1 * bgr.cols, (float)(bgr.cols - 1)), 0.f);\n        obj_y1 = std::max(std::min(obj_y1 * bgr.rows, (float)(bgr.rows - 1)), 0.f);\n        obj_x2 = std::max(std::min(obj_x2 * bgr.cols, (float)(bgr.cols - 1)), 0.f);\n        obj_y2 = std::max(std::min(obj_y2 * bgr.rows, (float)(bgr.rows - 1)), 0.f);\n\n        // append object\n        Object obj;\n        obj.rect = cv::Rect_<float>(obj_x1, obj_y1, obj_x2 - obj_x1 + 1, obj_y2 - obj_y1 + 1);\n        obj.label = label;\n        obj.prob = score;\n        obj.maskdata = std::vector<float>(maskdata, maskdata + mask.w);\n\n        class_candidates[label].push_back(obj);\n    }\n\n    objects.clear();\n    for (int i = 0; i < (int)class_candidates.size(); i++)\n    {\n        std::vector<Object>& candidates = class_candidates[i];\n\n        qsort_descent_inplace(candidates);\n\n        std::vector<int> picked;\n        nms_sorted_bboxes(candidates, picked, nms_threshold);\n\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            int z = picked[j];\n            objects.push_back(candidates[z]);\n        }\n    }\n\n    qsort_descent_inplace(objects);\n\n    // keep_top_k\n    if (keep_top_k < (int)objects.size())\n    {\n        objects.resize(keep_top_k);\n    }\n\n    // generate mask\n    for (int i = 0; i < (int)objects.size(); i++)\n    {\n        Object& obj = objects[i];\n\n        cv::Mat mask(maskmaps.h, maskmaps.w, CV_32FC1);\n        {\n            mask = cv::Scalar(0.f);\n\n            for (int p = 0; p < maskmaps.c; p++)\n            {\n                const float* maskmap = maskmaps.channel(p);\n                float coeff = obj.maskdata[p];\n                float* mp = (float*)mask.data;\n\n                // mask += m * coeff\n                for (int j = 0; j < maskmaps.w * maskmaps.h; j++)\n                {\n                    mp[j] += maskmap[j] * coeff;\n                }\n            }\n        }\n\n        cv::Mat mask2;\n        cv::resize(mask, mask2, cv::Size(img_w, img_h));\n\n        // crop obj box and binarize\n        obj.mask = cv::Mat(img_h, img_w, CV_8UC1);\n        {\n            obj.mask = cv::Scalar(0);\n\n            for (int y = 0; y < img_h; y++)\n            {\n                if (y < obj.rect.y || y > obj.rect.y + obj.rect.height)\n                    continue;\n\n                const float* mp2 = mask2.ptr<const float>(y);\n                uchar* bmp = obj.mask.ptr<uchar>(y);\n\n                for (int x = 0; x < img_w; x++)\n                {\n                    if (x < obj.rect.x || x > obj.rect.x + obj.rect.width)\n                        continue;\n\n                    bmp[x] = mp2[x] > 0.5f ? 255 : 0;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\",\n                                        \"train\", \"truck\", \"boat\", \"traffic light\", \"fire hydrant\",\n                                        \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\",\n                                        \"horse\", \"sheep\", \"cow\", \"elephant\", \"bear\", \"zebra\", \"giraffe\",\n                                        \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n                                        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\",\n                                        \"baseball glove\", \"skateboard\", \"surfboard\", \"tennis racket\",\n                                        \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\",\n                                        \"banana\", \"apple\", \"sandwich\", \"orange\", \"broccoli\", \"carrot\",\n                                        \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n                                        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\",\n                                        \"mouse\", \"remote\", \"keyboard\", \"cell phone\", \"microwave\", \"oven\",\n                                        \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\",\n                                        \"scissors\", \"teddy bear\", \"hair drier\", \"toothbrush\"\n                                       };\n\n    static const unsigned char colors[81][3] = {\n        {56, 0, 255},\n        {226, 255, 0},\n        {0, 94, 255},\n        {0, 37, 255},\n        {0, 255, 94},\n        {255, 226, 0},\n        {0, 18, 255},\n        {255, 151, 0},\n        {170, 0, 255},\n        {0, 255, 56},\n        {255, 0, 75},\n        {0, 75, 255},\n        {0, 255, 169},\n        {255, 0, 207},\n        {75, 255, 0},\n        {207, 0, 255},\n        {37, 0, 255},\n        {0, 207, 255},\n        {94, 0, 255},\n        {0, 255, 113},\n        {255, 18, 0},\n        {255, 0, 56},\n        {18, 0, 255},\n        {0, 255, 226},\n        {170, 255, 0},\n        {255, 0, 245},\n        {151, 255, 0},\n        {132, 255, 0},\n        {75, 0, 255},\n        {151, 0, 255},\n        {0, 151, 255},\n        {132, 0, 255},\n        {0, 255, 245},\n        {255, 132, 0},\n        {226, 0, 255},\n        {255, 37, 0},\n        {207, 255, 0},\n        {0, 255, 207},\n        {94, 255, 0},\n        {0, 226, 255},\n        {56, 255, 0},\n        {255, 94, 0},\n        {255, 113, 0},\n        {0, 132, 255},\n        {255, 0, 132},\n        {255, 170, 0},\n        {255, 0, 188},\n        {113, 255, 0},\n        {245, 0, 255},\n        {113, 0, 255},\n        {255, 188, 0},\n        {0, 113, 255},\n        {255, 0, 0},\n        {0, 56, 255},\n        {255, 0, 113},\n        {0, 255, 188},\n        {255, 0, 94},\n        {255, 0, 18},\n        {18, 255, 0},\n        {0, 255, 132},\n        {0, 188, 255},\n        {0, 245, 255},\n        {0, 169, 255},\n        {37, 255, 0},\n        {255, 0, 151},\n        {188, 0, 255},\n        {0, 255, 37},\n        {0, 255, 0},\n        {255, 0, 170},\n        {255, 0, 37},\n        {255, 75, 0},\n        {0, 0, 255},\n        {255, 207, 0},\n        {255, 0, 226},\n        {255, 245, 0},\n        {188, 255, 0},\n        {0, 255, 18},\n        {0, 255, 75},\n        {0, 255, 151},\n        {255, 56, 0},\n        {245, 255, 0}\n    };\n\n    cv::Mat image = bgr.clone();\n\n    int color_index = 0;\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        if (obj.prob < 0.15)\n            continue;\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        const unsigned char* color = colors[color_index % 81];\n        color_index++;\n\n        cv::rectangle(image, obj.rect, cv::Scalar(color[0], color[1], color[2]));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n\n        // draw mask\n        for (int y = 0; y < image.rows; y++)\n        {\n            const uchar* mp = obj.mask.ptr(y);\n            uchar* p = image.ptr(y);\n            for (int x = 0; x < image.cols; x++)\n            {\n                if (mp[x] == 255)\n                {\n                    p[0] = cv::saturate_cast<uchar>(p[0] * 0.5 + color[0] * 0.5);\n                    p[1] = cv::saturate_cast<uchar>(p[1] * 0.5 + color[1] * 0.5);\n                    p[2] = cv::saturate_cast<uchar>(p[2] * 0.5 + color[2] * 0.5);\n                }\n                p += 3;\n            }\n        }\n    }\n\n    cv::imwrite(\"result.png\", image);\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolact(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolo11.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolo11 torchscript\n//      yolo export model=yolo11n.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolo11n.torchscript\n// 4. modify yolo11n_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_235 = v_204.view(1, 144, 6400)\n//          v_236 = v_219.view(1, 144, 1600)\n//          v_237 = v_234.view(1, 144, 400)\n//          v_238 = torch.cat((v_235, v_236, v_237), dim=2)\n//          ...\n//      after:\n//          v_235 = v_204.view(1, 144, -1).transpose(1, 2)\n//          v_236 = v_219.view(1, 144, -1).transpose(1, 2)\n//          v_237 = v_234.view(1, 144, -1).transpose(1, 2)\n//          v_238 = torch.cat((v_235, v_236, v_237), dim=1)\n//          return v_238\n//      D. modify area attention for dynamic shape inference\n//      before:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, 400)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, 20, 20)\n//          v_107 = v_99.reshape(1, 128, 20, 20)\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n//      after:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, -1)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, v_95.size(2), v_95.size(3))\n//          v_107 = v_99.reshape(1, 128, v_95.size(2), v_95.size(3))\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n// 5. re-export yolo11 torchscript\n//      python3 -c 'import yolo11n_pnnx; yolo11n_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolo11n_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolo11n_pnnx.py.ncnn.param yolo11n.ncnn.param\n//      mv yolo11n_pnnx.py.ncnn.bin yolo11n.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=144 h=8400\n//\n//        | bbox-reg 16 x 4       | per-class scores(80) |\n//        +-----+-----+-----+-----+----------------------+\n//        | dx0 | dy0 | dx1 | dy1 |0.1 0.0 0.0 0.5 ......|\n//   all /|     |     |     |     |           .          |\n//  boxes |  .. |  .. |  .. |  .. |0.0 0.9 0.0 0.0 ......|\n//  (8400)|     |     |     |     |           .          |\n//       \\|     |     |     |     |           .          |\n//        +-----+-----+-----+-----+----------------------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 80 for COCO\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4);\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolo11(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolo11;\n\n    yolo11.opt.use_vulkan_compute = true;\n    // yolo11.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolo11/tree/master/app/src/main/assets\n    yolo11.load_param(\"yolo11n.ncnn.param\");\n    yolo11.load_model(\"yolo11n.ncnn.bin\");\n    // yolo11.load_param(\"yolo11s.ncnn.param\");\n    // yolo11.load_model(\"yolo11s.ncnn.bin\");\n    // yolo11.load_param(\"yolo11m.ncnn.param\");\n    // yolo11.load_model(\"yolo11m.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolo11.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolo11.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolo11(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolo11_cls.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolo11-cls torchscript\n//      yolo export model=yolo11n-cls.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolo11n-cls.torchscript\n// 4. now you get ncnn model files\n//      yolo11n_cls.ncnn.param\n//      yolo11n_cls.ncnn.bin\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    int label;\n    float prob;\n};\n\nstatic void get_topk(const ncnn::Mat& cls_scores, int topk, std::vector<Object>& objects)\n{\n    // partial sort topk with index\n    int size = cls_scores.w;\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(cls_scores[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    objects.resize(topk);\n    for (int i = 0; i < topk; i++)\n    {\n        objects[i].label = vec[i].second;\n        objects[i].prob = vec[i].first;\n    }\n}\n\nstatic int detect_yolo11_cls(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolo11;\n\n    yolo11.opt.use_vulkan_compute = true;\n    // yolo11.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolo11/tree/master/app/src/main/assets\n    yolo11.load_param(\"yolo11n_cls.ncnn.param\");\n    yolo11.load_model(\"yolo11n_cls.ncnn.bin\");\n    // yolo11.load_param(\"yolo11s_cls.ncnn.param\");\n    // yolo11.load_model(\"yolo11s_cls.ncnn.bin\");\n    // yolo11.load_param(\"yolo11m_cls.ncnn.param\");\n    // yolo11.load_model(\"yolo11m_cls.ncnn.bin\");\n\n    const int target_size = 224;\n    const int topk = 5;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // letterbox pad\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = target_size - w;\n    int hpad = target_size - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolo11.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    // return top-5\n    get_topk(out, topk, objects);\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"tench\", \"goldfish\", \"great white shark\", \"tiger shark\", \"hammerhead\", \"electric ray\", \"stingray\", \"cock\",\n        \"hen\", \"ostrich\", \"brambling\", \"goldfinch\", \"house finch\", \"junco\", \"indigo bunting\", \"robin\", \"bulbul\",\n        \"jay\", \"magpie\", \"chickadee\", \"water ouzel\", \"kite\", \"bald eagle\", \"vulture\", \"great grey owl\",\n        \"European fire salamander\", \"common newt\", \"eft\", \"spotted salamander\", \"axolotl\", \"bullfrog\", \"tree frog\",\n        \"tailed frog\", \"loggerhead\", \"leatherback turtle\", \"mud turtle\", \"terrapin\", \"box turtle\", \"banded gecko\",\n        \"common iguana\", \"American chameleon\", \"whiptail\", \"agama\", \"frilled lizard\", \"alligator lizard\",\n        \"Gila monster\", \"green lizard\", \"African chameleon\", \"Komodo dragon\", \"African crocodile\",\n        \"American alligator\", \"triceratops\", \"thunder snake\", \"ringneck snake\", \"hognose snake\", \"green snake\",\n        \"king snake\", \"garter snake\", \"water snake\", \"vine snake\", \"night snake\", \"boa constrictor\", \"rock python\",\n        \"Indian cobra\", \"green mamba\", \"sea snake\", \"horned viper\", \"diamondback\", \"sidewinder\", \"trilobite\",\n        \"harvestman\", \"scorpion\", \"black and gold garden spider\", \"barn spider\", \"garden spider\", \"black widow\",\n        \"tarantula\", \"wolf spider\", \"tick\", \"centipede\", \"black grouse\", \"ptarmigan\", \"ruffed grouse\",\n        \"prairie chicken\", \"peacock\", \"quail\", \"partridge\", \"African grey\", \"macaw\", \"sulphur-crested cockatoo\",\n        \"lorikeet\", \"coucal\", \"bee eater\", \"hornbill\", \"hummingbird\", \"jacamar\", \"toucan\", \"drake\",\n        \"red-breasted merganser\", \"goose\", \"black swan\", \"tusker\", \"echidna\", \"platypus\", \"wallaby\", \"koala\",\n        \"wombat\", \"jellyfish\", \"sea anemone\", \"brain coral\", \"flatworm\", \"nematode\", \"conch\", \"snail\", \"slug\",\n        \"sea slug\", \"chiton\", \"chambered nautilus\", \"Dungeness crab\", \"rock crab\", \"fiddler crab\", \"king crab\",\n        \"American lobster\", \"spiny lobster\", \"crayfish\", \"hermit crab\", \"isopod\", \"white stork\", \"black stork\",\n        \"spoonbill\", \"flamingo\", \"little blue heron\", \"American egret\", \"bittern\", \"crane (bird)\", \"limpkin\",\n        \"European gallinule\", \"American coot\", \"bustard\", \"ruddy turnstone\", \"red-backed sandpiper\", \"redshank\",\n        \"dowitcher\", \"oystercatcher\", \"pelican\", \"king penguin\", \"albatross\", \"grey whale\", \"killer whale\",\n        \"dugong\", \"sea lion\", \"Chihuahua\", \"Japanese spaniel\", \"Maltese dog\", \"Pekinese\", \"Shih-Tzu\",\n        \"Blenheim spaniel\", \"papillon\", \"toy terrier\", \"Rhodesian ridgeback\", \"Afghan hound\", \"basset\", \"beagle\",\n        \"bloodhound\", \"bluetick\", \"black-and-tan coonhound\", \"Walker hound\", \"English foxhound\", \"redbone\",\n        \"borzoi\", \"Irish wolfhound\", \"Italian greyhound\", \"whippet\", \"Ibizan hound\", \"Norwegian elkhound\",\n        \"otterhound\", \"Saluki\", \"Scottish deerhound\", \"Weimaraner\", \"Staffordshire bullterrier\",\n        \"American Staffordshire terrier\", \"Bedlington terrier\", \"Border terrier\", \"Kerry blue terrier\",\n        \"Irish terrier\", \"Norfolk terrier\", \"Norwich terrier\", \"Yorkshire terrier\", \"wire-haired fox terrier\",\n        \"Lakeland terrier\", \"Sealyham terrier\", \"Airedale\", \"cairn\", \"Australian terrier\", \"Dandie Dinmont\",\n        \"Boston bull\", \"miniature schnauzer\", \"giant schnauzer\", \"standard schnauzer\", \"Scotch terrier\",\n        \"Tibetan terrier\", \"silky terrier\", \"soft-coated wheaten terrier\", \"West Highland white terrier\",\n        \"Lhasa\", \"flat-coated retriever\", \"curly-coated retriever\", \"golden retriever\", \"Labrador retriever\",\n        \"Chesapeake Bay retriever\", \"German short-haired pointer\", \"vizsla\", \"English setter\", \"Irish setter\",\n        \"Gordon setter\", \"Brittany spaniel\", \"clumber\", \"English springer\", \"Welsh springer spaniel\",\n        \"cocker spaniel\", \"Sussex spaniel\", \"Irish water spaniel\", \"kuvasz\", \"schipperke\", \"groenendael\",\n        \"malinois\", \"briard\", \"kelpie\", \"komondor\", \"Old English sheepdog\", \"Shetland sheepdog\", \"collie\",\n        \"Border collie\", \"Bouvier des Flandres\", \"Rottweiler\", \"German shepherd\", \"Doberman\",\n        \"miniature pinscher\", \"Greater Swiss Mountain dog\", \"Bernese mountain dog\", \"Appenzeller\", \"EntleBucher\",\n        \"boxer\", \"bull mastiff\", \"Tibetan mastiff\", \"French bulldog\", \"Great Dane\", \"Saint Bernard\",\n        \"Eskimo dog\", \"malamute\", \"Siberian husky\", \"dalmatian\", \"affenpinscher\", \"basenji\", \"pug\", \"Leonberg\",\n        \"Newfoundland\", \"Great Pyrenees\", \"Samoyed\", \"Pomeranian\", \"chow\", \"keeshond\", \"Brabancon griffon\",\n        \"Pembroke\", \"Cardigan\", \"toy poodle\", \"miniature poodle\", \"standard poodle\", \"Mexican hairless\",\n        \"timber wolf\", \"white wolf\", \"red wolf\", \"coyote\", \"dingo\", \"dhole\", \"African hunting dog\", \"hyena\",\n        \"red fox\", \"kit fox\", \"Arctic fox\", \"grey fox\", \"tabby\", \"tiger cat\", \"Persian cat\", \"Siamese cat\",\n        \"Egyptian cat\", \"cougar\", \"lynx\", \"leopard\", \"snow leopard\", \"jaguar\", \"lion\", \"tiger\", \"cheetah\",\n        \"brown bear\", \"American black bear\", \"ice bear\", \"sloth bear\", \"mongoose\", \"meerkat\", \"tiger beetle\",\n        \"ladybug\", \"ground beetle\", \"long-horned beetle\", \"leaf beetle\", \"dung beetle\", \"rhinoceros beetle\",\n        \"weevil\", \"fly\", \"bee\", \"ant\", \"grasshopper\", \"cricket\", \"walking stick\", \"cockroach\", \"mantis\",\n        \"cicada\", \"leafhopper\", \"lacewing\", \"dragonfly\", \"damselfly\", \"admiral\", \"ringlet\", \"monarch\",\n        \"cabbage butterfly\", \"sulphur butterfly\", \"lycaenid\", \"starfish\", \"sea urchin\", \"sea cucumber\",\n        \"wood rabbit\", \"hare\", \"Angora\", \"hamster\", \"porcupine\", \"fox squirrel\", \"marmot\", \"beaver\",\n        \"guinea pig\", \"sorrel\", \"zebra\", \"hog\", \"wild boar\", \"warthog\", \"hippopotamus\", \"ox\", \"water buffalo\",\n        \"bison\", \"ram\", \"bighorn\", \"ibex\", \"hartebeest\", \"impala\", \"gazelle\", \"Arabian camel\", \"llama\",\n        \"weasel\", \"mink\", \"polecat\", \"black-footed ferret\", \"otter\", \"skunk\", \"badger\", \"armadillo\",\n        \"three-toed sloth\", \"orangutan\", \"gorilla\", \"chimpanzee\", \"gibbon\", \"siamang\", \"guenon\", \"patas\",\n        \"baboon\", \"macaque\", \"langur\", \"colobus\", \"proboscis monkey\", \"marmoset\", \"capuchin\", \"howler monkey\",\n        \"titi\", \"spider monkey\", \"squirrel monkey\", \"Madagascar cat\", \"indri\", \"Indian elephant\",\n        \"African elephant\", \"lesser panda\", \"giant panda\", \"barracouta\", \"eel\", \"coho\", \"rock beauty\",\n        \"anemone fish\", \"sturgeon\", \"gar\", \"lionfish\", \"puffer\", \"abacus\", \"abaya\", \"academic gown\",\n        \"accordion\", \"acoustic guitar\", \"aircraft carrier\", \"airliner\", \"airship\", \"altar\", \"ambulance\",\n        \"amphibian\", \"analog clock\", \"apiary\", \"apron\", \"ashcan\", \"assault rifle\", \"backpack\", \"bakery\",\n        \"balance beam\", \"balloon\", \"ballpoint\", \"Band Aid\", \"banjo\", \"bannister\", \"barbell\", \"barber chair\",\n        \"barbershop\", \"barn\", \"barometer\", \"barrel\", \"barrow\", \"baseball\", \"basketball\", \"bassinet\", \"bassoon\",\n        \"bathing cap\", \"bath towel\", \"bathtub\", \"beach wagon\", \"beacon\", \"beaker\", \"bearskin\", \"beer bottle\",\n        \"beer glass\", \"bell cote\", \"bib\", \"bicycle-built-for-two\", \"bikini\", \"binder\", \"binoculars\",\n        \"birdhouse\", \"boathouse\", \"bobsled\", \"bolo tie\", \"bonnet\", \"bookcase\", \"bookshop\", \"bottlecap\", \"bow\",\n        \"bow tie\", \"brass\", \"brassiere\", \"breakwater\", \"breastplate\", \"broom\", \"bucket\", \"buckle\",\n        \"bulletproof vest\", \"bullet train\", \"butcher shop\", \"cab\", \"caldron\", \"candle\", \"cannon\", \"canoe\",\n        \"can opener\", \"cardigan\", \"car mirror\", \"carousel\", \"carpenter's kit\", \"carton\", \"car wheel\",\n        \"cash machine\", \"cassette\", \"cassette player\", \"castle\", \"catamaran\", \"CD player\", \"cello\",\n        \"cellular telephone\", \"chain\", \"chainlink fence\", \"chain mail\", \"chain saw\", \"chest\", \"chiffonier\",\n        \"chime\", \"china cabinet\", \"Christmas stocking\", \"church\", \"cinema\", \"cleaver\", \"cliff dwelling\",\n        \"cloak\", \"clog\", \"cocktail shaker\", \"coffee mug\", \"coffeepot\", \"coil\", \"combination lock\",\n        \"computer keyboard\", \"confectionery\", \"container ship\", \"convertible\", \"corkscrew\", \"cornet\",\n        \"cowboy boot\", \"cowboy hat\", \"cradle\", \"crane (machine)\", \"crash helmet\", \"crate\", \"crib\",\n        \"Crock Pot\", \"croquet ball\", \"crutch\", \"cuirass\", \"dam\", \"desk\", \"desktop computer\", \"dial telephone\",\n        \"diaper\", \"digital clock\", \"digital watch\", \"dining table\", \"dishrag\", \"dishwasher\", \"disk brake\",\n        \"dock\", \"dogsled\", \"dome\", \"doormat\", \"drilling platform\", \"drum\", \"drumstick\", \"dumbbell\",\n        \"Dutch oven\", \"electric fan\", \"electric guitar\", \"electric locomotive\", \"entertainment center\",\n        \"envelope\", \"espresso maker\", \"face powder\", \"feather boa\", \"file\", \"fireboat\", \"fire engine\",\n        \"fire screen\", \"flagpole\", \"flute\", \"folding chair\", \"football helmet\", \"forklift\", \"fountain\",\n        \"fountain pen\", \"four-poster\", \"freight car\", \"French horn\", \"frying pan\", \"fur coat\", \"garbage truck\",\n        \"gasmask\", \"gas pump\", \"goblet\", \"go-kart\", \"golf ball\", \"golfcart\", \"gondola\", \"gong\", \"gown\",\n        \"grand piano\", \"greenhouse\", \"grille\", \"grocery store\", \"guillotine\", \"hair slide\", \"hair spray\",\n        \"half track\", \"hammer\", \"hamper\", \"hand blower\", \"hand-held computer\", \"handkerchief\", \"hard disc\",\n        \"harmonica\", \"harp\", \"harvester\", \"hatchet\", \"holster\", \"home theater\", \"honeycomb\", \"hook\",\n        \"hoopskirt\", \"horizontal bar\", \"horse cart\", \"hourglass\", \"iPod\", \"iron\", \"jack-o'-lantern\", \"jean\",\n        \"jeep\", \"jersey\", \"jigsaw puzzle\", \"jinrikisha\", \"joystick\", \"kimono\", \"knee pad\", \"knot\", \"lab coat\",\n        \"ladle\", \"lampshade\", \"laptop\", \"lawn mower\", \"lens cap\", \"letter opener\", \"library\", \"lifeboat\",\n        \"lighter\", \"limousine\", \"liner\", \"lipstick\", \"Loafer\", \"lotion\", \"loudspeaker\", \"loupe\", \"lumbermill\",\n        \"magnetic compass\", \"mailbag\", \"mailbox\", \"maillot (tights)\", \"maillot (tank suit)\", \"manhole cover\",\n        \"maraca\", \"marimba\", \"mask\", \"matchstick\", \"maypole\", \"maze\", \"measuring cup\", \"medicine chest\",\n        \"megalith\", \"microphone\", \"microwave\", \"military uniform\", \"milk can\", \"minibus\", \"miniskirt\",\n        \"minivan\", \"missile\", \"mitten\", \"mixing bowl\", \"mobile home\", \"Model T\", \"modem\", \"monastery\",\n        \"monitor\", \"moped\", \"mortar\", \"mortarboard\", \"mosque\", \"mosquito net\", \"motor scooter\", \"mountain bike\",\n        \"mountain tent\", \"mouse\", \"mousetrap\", \"moving van\", \"muzzle\", \"nail\", \"neck brace\", \"necklace\",\n        \"nipple\", \"notebook\", \"obelisk\", \"oboe\", \"ocarina\", \"odometer\", \"oil filter\", \"organ\", \"oscilloscope\",\n        \"overskirt\", \"oxcart\", \"oxygen mask\", \"packet\", \"paddle\", \"paddlewheel\", \"padlock\", \"paintbrush\",\n        \"pajama\", \"palace\", \"panpipe\", \"paper towel\", \"parachute\", \"parallel bars\", \"park bench\",\n        \"parking meter\", \"passenger car\", \"patio\", \"pay-phone\", \"pedestal\", \"pencil box\", \"pencil sharpener\",\n        \"perfume\", \"Petri dish\", \"photocopier\", \"pick\", \"pickelhaube\", \"picket fence\", \"pickup\", \"pier\",\n        \"piggy bank\", \"pill bottle\", \"pillow\", \"ping-pong ball\", \"pinwheel\", \"pirate\", \"pitcher\", \"plane\",\n        \"planetarium\", \"plastic bag\", \"plate rack\", \"plow\", \"plunger\", \"Polaroid camera\", \"pole\",\n        \"police van\", \"poncho\", \"pool table\", \"pop bottle\", \"pot\", \"potter's wheel\", \"power drill\",\n        \"prayer rug\", \"printer\", \"prison\", \"projectile\", \"projector\", \"puck\", \"punching bag\", \"purse\",\n        \"quill\", \"quilt\", \"racer\", \"racket\", \"radiator\", \"radio\", \"radio telescope\", \"rain barrel\",\n        \"recreational vehicle\", \"reel\", \"reflex camera\", \"refrigerator\", \"remote control\", \"restaurant\",\n        \"revolver\", \"rifle\", \"rocking chair\", \"rotisserie\", \"rubber eraser\", \"rugby ball\", \"rule\",\n        \"running shoe\", \"safe\", \"safety pin\", \"saltshaker\", \"sandal\", \"sarong\", \"sax\", \"scabbard\", \"scale\",\n        \"school bus\", \"schooner\", \"scoreboard\", \"screen\", \"screw\", \"screwdriver\", \"seat belt\", \"sewing machine\",\n        \"shield\", \"shoe shop\", \"shoji\", \"shopping basket\", \"shopping cart\", \"shovel\", \"shower cap\",\n        \"shower curtain\", \"ski\", \"ski mask\", \"sleeping bag\", \"slide rule\", \"sliding door\", \"slot\", \"snorkel\",\n        \"snowmobile\", \"snowplow\", \"soap dispenser\", \"soccer ball\", \"sock\", \"solar dish\", \"sombrero\",\n        \"soup bowl\", \"space bar\", \"space heater\", \"space shuttle\", \"spatula\", \"speedboat\", \"spider web\",\n        \"spindle\", \"sports car\", \"spotlight\", \"stage\", \"steam locomotive\", \"steel arch bridge\", \"steel drum\",\n        \"stethoscope\", \"stole\", \"stone wall\", \"stopwatch\", \"stove\", \"strainer\", \"streetcar\", \"stretcher\",\n        \"studio couch\", \"stupa\", \"submarine\", \"suit\", \"sundial\", \"sunglass\", \"sunglasses\", \"sunscreen\",\n        \"suspension bridge\", \"swab\", \"sweatshirt\", \"swimming trunks\", \"swing\", \"switch\", \"syringe\",\n        \"table lamp\", \"tank\", \"tape player\", \"teapot\", \"teddy\", \"television\", \"tennis ball\", \"thatch\",\n        \"theater curtain\", \"thimble\", \"thresher\", \"throne\", \"tile roof\", \"toaster\", \"tobacco shop\",\n        \"toilet seat\", \"torch\", \"totem pole\", \"tow truck\", \"toyshop\", \"tractor\", \"trailer truck\", \"tray\",\n        \"trench coat\", \"tricycle\", \"trimaran\", \"tripod\", \"triumphal arch\", \"trolleybus\", \"trombone\", \"tub\",\n        \"turnstile\", \"typewriter keyboard\", \"umbrella\", \"unicycle\", \"upright\", \"vacuum\", \"vase\", \"vault\",\n        \"velvet\", \"vending machine\", \"vestment\", \"viaduct\", \"violin\", \"volleyball\", \"waffle iron\", \"wall clock\",\n        \"wallet\", \"wardrobe\", \"warplane\", \"washbasin\", \"washer\", \"water bottle\", \"water jug\", \"water tower\",\n        \"whiskey jug\", \"whistle\", \"wig\", \"window screen\", \"window shade\", \"Windsor tie\", \"wine bottle\", \"wing\",\n        \"wok\", \"wooden spoon\", \"wool\", \"worm fence\", \"wreck\", \"yawl\", \"yurt\", \"web site\", \"comic book\",\n        \"crossword puzzle\", \"street sign\", \"traffic light\", \"book jacket\", \"menu\", \"plate\", \"guacamole\",\n        \"consomme\", \"hot pot\", \"trifle\", \"ice cream\", \"ice lolly\", \"French loaf\", \"bagel\", \"pretzel\",\n        \"cheeseburger\", \"hotdog\", \"mashed potato\", \"head cabbage\", \"broccoli\", \"cauliflower\", \"zucchini\",\n        \"spaghetti squash\", \"acorn squash\", \"butternut squash\", \"cucumber\", \"artichoke\", \"bell pepper\",\n        \"cardoon\", \"mushroom\", \"Granny Smith\", \"strawberry\", \"orange\", \"lemon\", \"fig\", \"pineapple\", \"banana\",\n        \"jackfruit\", \"custard apple\", \"pomegranate\", \"hay\", \"carbonara\", \"chocolate sauce\", \"dough\",\n        \"meat loaf\", \"pizza\", \"potpie\", \"burrito\", \"red wine\", \"espresso\", \"cup\", \"eggnog\", \"alp\", \"bubble\",\n        \"cliff\", \"coral reef\", \"geyser\", \"lakeside\", \"promontory\", \"sandbar\", \"seashore\", \"valley\", \"volcano\",\n        \"ballplayer\", \"groom\", \"scuba diver\", \"rapeseed\", \"daisy\", \"yellow lady's slipper\", \"corn\", \"acorn\",\n        \"hip\", \"buckeye\", \"coral fungus\", \"agaric\", \"gyromitra\", \"stinkhorn\", \"earthstar\", \"hen-of-the-woods\",\n        \"bolete\", \"ear\", \"toilet tissue\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    int y_offset = 0;\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f\\n\", obj.label, obj.prob);\n\n        char text[256];\n        sprintf(text, \"%4.1f%% %s\", obj.prob * 100, class_names[obj.label]);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = 0;\n        int y = y_offset;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n\n        y_offset += label_size.height;\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolo11_cls(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolo11_obb.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolo11-obb torchscript\n//      yolo export model=yolo11n-obb.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolo11n-obb.torchscript\n// 4. modify yolo11n_obb_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_195 = v_194.view(1, 1, 16384)\n//          v_201 = v_200.view(1, 1, 4096)\n//          v_207 = v_206.view(1, 1, 1024)\n//          v_208 = torch.cat((v_195, v_201, v_207), dim=2)\n//          ...\n//          v_256 = v_225.view(1, 79, 16384)\n//          v_257 = v_240.view(1, 79, 4096)\n//          v_258 = v_255.view(1, 79, 1024)\n//          v_259 = torch.cat((v_256, v_257, v_258), dim=2)\n//          ...\n//      after:\n//          v_195 = v_194.view(1, 1, -1).transpose(1, 2)\n//          v_201 = v_200.view(1, 1, -1).transpose(1, 2)\n//          v_207 = v_206.view(1, 1, -1).transpose(1, 2)\n//          v_208 = torch.cat((v_195, v_201, v_207), dim=1)\n//          ...\n//          v_256 = v_225.view(1, 79, -1).transpose(1, 2)\n//          v_257 = v_240.view(1, 79, -1).transpose(1, 2)\n//          v_258 = v_255.view(1, 79, -1).transpose(1, 2)\n//          v_259 = torch.cat((v_256, v_257, v_258), dim=1)\n//          return v_259, v_208\n//      D. modify area attention for dynamic shape inference\n//      before:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, 1024)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, 32, 32)\n//          v_107 = v_99.reshape(1, 128, 32, 32)\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n//      after:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, -1)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, v_95.size(2), v_95.size(3))\n//          v_107 = v_99.reshape(1, 128, v_95.size(2), v_95.size(3))\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n// 5. re-export yolo11-obb torchscript\n//      python3 -c 'import yolo11n_obb_pnnx; yolo11n_obb_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolo11n_obb_pnnx.py.pt inputshape=[1,3,1024,1024] inputshape2=[1,3,512,512]\n// 7. now you get ncnn model files\n//      mv yolo11n_obb_pnnx.py.ncnn.param yolo11n_obb.ncnn.param\n//      mv yolo11n_obb_pnnx.py.ncnn.bin yolo11n_obb.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=79 h=21504\n//\n//        | bbox-reg 16 x 4       |score(15)|\n//        +-----+-----+-----+-----+---------+\n//        | dx0 | dy0 | dx1 | dy1 | 0.1 ... |\n//   all /|     |     |     |     |     ... |\n//  boxes |  .. |  .. |  .. |  .. | 0.0 ... |\n// (21504)|     |     |     |     |  .  ... |\n//       \\|     |     |     |     |  .  ... |\n//        +-----+-----+-----+-----+---------+\n//\n\n// the out blob would be a 2-dim tensor with w=1 h=21504\n//\n//        | degree(1)|\n//        +----------+\n//        |    0.1   |\n//   all /|          |\n//  boxes |    0.0   |\n// (21504)|     .    |\n//       \\|     .    |\n//        +----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n\n#include <float.h>\n#include <math.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::RotatedRect rrect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    std::vector<cv::Point2f> intersection;\n    cv::rotatedRectangleIntersection(a.rrect, b.rrect, intersection);\n    if (intersection.empty())\n        return 0.f;\n\n    return cv::contourArea(intersection);\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rrect.size.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area;\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_angle, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 15 for DOTAv1\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                const float angle = sigmoid(pred_angle.row(y * num_grid_x + x)[0]) - 0.25f;\n\n                const float angle_rad = angle * 3.14159265358979323846f;\n                const float angle_degree = angle * 180.f;\n\n                float cos = cosf(angle_rad);\n                float sin = sinf(angle_rad);\n\n                float xx = (pred_ltrb[2] - pred_ltrb[0]) * 0.5f;\n                float yy = (pred_ltrb[3] - pred_ltrb[1]) * 0.5f;\n                float xr = xx * cos - yy * sin;\n                float yr = xx * sin + yy * cos;\n                const float cx = pb_cx + xr;\n                const float cy = pb_cy + yr;\n                const float ww = pred_ltrb[2] + pred_ltrb[0];\n                const float hh = pred_ltrb[3] + pred_ltrb[1];\n\n                Object obj;\n                obj.rrect = cv::RotatedRect(cv::Point2f(cx, cy), cv::Size_<float>(ww, hh), angle_degree);\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_angle, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), pred_angle.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolo11_obb(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolo11;\n\n    yolo11.opt.use_vulkan_compute = true;\n    // yolo11.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolo11/tree/master/app/src/main/assets\n    yolo11.load_param(\"yolo11n_obb.ncnn.param\");\n    yolo11.load_model(\"yolo11n_obb.ncnn.bin\");\n    // yolo11.load_param(\"yolo11s_obb.ncnn.param\");\n    // yolo11.load_model(\"yolo11s_obb.ncnn.bin\");\n    // yolo11.load_param(\"yolo11m_obb.ncnn.param\");\n    // yolo11.load_model(\"yolo11m_obb.ncnn.bin\");\n\n    const int target_size = 1024;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolo11.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolo11.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    ncnn::Mat out_angle;\n    ex.extract(\"out1\", out_angle);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, out_angle, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        Object obj = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        obj.rrect.center.x = (obj.rrect.center.x - (wpad / 2)) / scale;\n        obj.rrect.center.y = (obj.rrect.center.y - (hpad / 2)) / scale;\n        obj.rrect.size.width = (obj.rrect.size.width) / scale;\n        obj.rrect.size.height = (obj.rrect.size.height) / scale;\n\n        objects[i] = obj;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"plane\", \"ship\", \"storage tank\", \"baseball diamond\", \"tennis court\",\n        \"basketball court\", \"ground track field\", \"harbor\", \"bridge\", \"large vehicle\",\n        \"small vehicle\", \"helicopter\", \"roundabout\", \"soccer ball field\", \"swimming pool\"\n    };\n\n    static const cv::Scalar colors[] = {\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[obj.label];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f  @ %.2f\\n\", obj.label, obj.prob,\n                obj.rrect.center.x, obj.rrect.center.y, obj.rrect.size.width, obj.rrect.size.height, obj.rrect.angle);\n\n        cv::Point2f corners[4];\n        obj.rrect.points(corners);\n        cv::line(image, corners[0], corners[1], color);\n        cv::line(image, corners[1], corners[2], color);\n        cv::line(image, corners[2], corners[3], color);\n        cv::line(image, corners[3], corners[0], color);\n    }\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[obj.label];\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rrect.center.x - label_size.width / 2;\n        int y = obj.rrect.center.y - label_size.height / 2 - baseLine;\n        if (y < 0)\n            y = 0;\n        if (y + label_size.height > image.rows)\n            y = image.rows - label_size.height;\n        if (x < 0)\n            x = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolo11_obb(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolo11_pose.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolo11-pose torchscript\n//      yolo export model=yolo11n-pose.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolo11n-pose.torchscript\n// 4. modify yolo11n_pose_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_195 = v_194.view(1, 51, 6400)\n//          v_201 = v_200.view(1, 51, 1600)\n//          v_207 = v_206.view(1, 51, 400)\n//          v_208 = torch.cat((v_195, v_201, v_207), dim=-1)\n//          ...\n//          v_254 = v_223.view(1, 65, 6400)\n//          v_255 = v_238.view(1, 65, 1600)\n//          v_256 = v_253.view(1, 65, 400)\n//          v_257 = torch.cat((v_254, v_255, v_256), dim=2)\n//          ...\n//      after:\n//          v_195 = v_194.view(1, 51, -1).transpose(1, 2)\n//          v_201 = v_200.view(1, 51, -1).transpose(1, 2)\n//          v_207 = v_206.view(1, 51, -1).transpose(1, 2)\n//          v_208 = torch.cat((v_195, v_201, v_207), dim=1)\n//          ...\n//          v_254 = v_223.view(1, 65, -1).transpose(1, 2)\n//          v_255 = v_238.view(1, 65, -1).transpose(1, 2)\n//          v_256 = v_253.view(1, 65, -1).transpose(1, 2)\n//          v_257 = torch.cat((v_254, v_255, v_256), dim=1)\n//          return v_257, v_208\n//      D. modify area attention for dynamic shape inference\n//      before:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, 400)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, 20, 20)\n//          v_107 = v_99.reshape(1, 128, 20, 20)\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n//      after:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, -1)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, v_95.size(2), v_95.size(3))\n//          v_107 = v_99.reshape(1, 128, v_95.size(2), v_95.size(3))\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n// 5. re-export yolo11-pose torchscript\n//      python3 -c 'import yolo11n_pose_pnnx; yolo11n_pose_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolo11n_pose_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolo11n_pose_pnnx.py.ncnn.param yolo11n_pose.ncnn.param\n//      mv yolo11n_pose_pnnx.py.ncnn.bin yolo11n_pose.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=65 h=8400\n//\n//        | bbox-reg 16 x 4       |score(1)|\n//        +-----+-----+-----+-----+--------+\n//        | dx0 | dy0 | dx1 | dy1 |   0.1  |\n//   all /|     |     |     |     |        |\n//  boxes |  .. |  .. |  .. |  .. |   0.0  |\n//  (8400)|     |     |     |     |   .    |\n//       \\|     |     |     |     |   .    |\n//        +-----+-----+-----+-----+--------+\n//\n\n//\n//        | pose (51) |\n//        +-----------+\n//        |0.1........|\n//   all /|           |\n//  boxes |0.0........|\n//  (8400)|     .     |\n//       \\|     .     |\n//        +-----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct KeyPoint\n{\n    cv::Point2f p;\n    float prob;\n};\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n    std::vector<KeyPoint> keypoints;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_points, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_points = pred_points.w / 3;\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n            const ncnn::Mat pred_points_grid = pred_points.row_range(y * num_grid_x + x, 1).reshape(3, num_points);\n\n            // find label with max score\n            int label = 0;\n            float score = sigmoid(pred_grid[reg_max_1 * 4]);\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                std::vector<KeyPoint> keypoints;\n                for (int k = 0; k < num_points; k++)\n                {\n                    KeyPoint keypoint;\n                    keypoint.p.x = (x + pred_points_grid.row(k)[0] * 2) * stride;\n                    keypoint.p.y = (y + pred_points_grid.row(k)[1] * 2) * stride;\n                    keypoint.prob = sigmoid(pred_points_grid.row(k)[2]);\n                    keypoints.push_back(keypoint);\n                }\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n                obj.keypoints = keypoints;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_points, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), pred_points.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolo11_pose(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolo11;\n\n    yolo11.opt.use_vulkan_compute = true;\n    // yolo11.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolo11/tree/master/app/src/main/assets\n    yolo11.load_param(\"yolo11n_pose.ncnn.param\");\n    yolo11.load_model(\"yolo11n_pose.ncnn.bin\");\n    // yolo11.load_param(\"yolo11s_pose.ncnn.param\");\n    // yolo11.load_model(\"yolo11s_pose.ncnn.bin\");\n    // yolo11.load_param(\"yolo11m_pose.ncnn.param\");\n    // yolo11.load_model(\"yolo11m_pose.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n    const float mask_threshold = 0.5f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolo11.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolo11.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    ncnn::Mat out_points;\n    ex.extract(\"out1\", out_points);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, out_points, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    const int num_points = out_points.w / 3;\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        for (int j = 0; j < num_points; j++)\n        {\n            objects[i].keypoints[j].p.x = (objects[i].keypoints[j].p.x - (wpad / 2)) / scale;\n            objects[i].keypoints[j].p.y = (objects[i].keypoints[j].p.y - (hpad / 2)) / scale;\n        }\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"person\"};\n\n    static const cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        // draw bone\n        static const int joint_pairs[16][2] = {\n            {0, 1}, {1, 3}, {0, 2}, {2, 4}, {5, 6}, {5, 7}, {7, 9}, {6, 8}, {8, 10}, {5, 11}, {6, 12}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}\n        };\n        static const cv::Scalar bone_colors[] = {\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n        };\n\n        for (int j = 0; j < 16; j++)\n        {\n            const KeyPoint& p1 = obj.keypoints[joint_pairs[j][0]];\n            const KeyPoint& p2 = obj.keypoints[joint_pairs[j][1]];\n\n            if (p1.prob < 0.2f || p2.prob < 0.2f)\n                continue;\n\n            cv::line(image, p1.p, p2.p, bone_colors[j], 2);\n        }\n\n        // draw joint\n        for (size_t j = 0; j < obj.keypoints.size(); j++)\n        {\n            const KeyPoint& keypoint = obj.keypoints[j];\n\n            fprintf(stderr, \"%.2f %.2f = %.5f\\n\", keypoint.p.x, keypoint.p.y, keypoint.prob);\n\n            if (keypoint.prob < 0.2f)\n                continue;\n\n            cv::circle(image, keypoint.p, 3, color, -1);\n        }\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolo11_pose(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolo11_seg.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolo11-seg torchscript\n//      yolo export model=yolo11n-seg.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolo11n-seg.torchscript\n// 4. modify yolo11n_seg_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_202 = v_201.view(1, 32, 6400)\n//          v_208 = v_207.view(1, 32, 1600)\n//          v_214 = v_213.view(1, 32, 400)\n//          v_215 = torch.cat((v_202, v_208, v_214), dim=2)\n//          ...\n//          v_261 = v_230.view(1, 144, 6400)\n//          v_262 = v_245.view(1, 144, 1600)\n//          v_263 = v_260.view(1, 144, 400)\n//          v_264 = torch.cat((v_261, v_262, v_263), dim=2)\n//          ...\n//          v_285 = (v_284, v_196, )\n//          return v_285\n//      after:\n//          v_202 = v_201.view(1, 32, -1).transpose(1, 2)\n//          v_208 = v_207.view(1, 32, -1).transpose(1, 2)\n//          v_214 = v_213.view(1, 32, -1).transpose(1, 2)\n//          v_215 = torch.cat((v_202, v_208, v_214), dim=1)\n//          ...\n//          v_261 = v_230.view(1, 144, -1).transpose(1, 2)\n//          v_262 = v_245.view(1, 144, -1).transpose(1, 2)\n//          v_263 = v_260.view(1, 144, -1).transpose(1, 2)\n//          v_264 = torch.cat((v_261, v_262, v_263), dim=1)\n//          return v_264, v_215, v_196\n//      D. modify area attention for dynamic shape inference\n//      before:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, 400)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, 20, 20)\n//          v_107 = v_99.reshape(1, 128, 20, 20)\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n//      after:\n//          v_95 = self.model_10_m_0_attn_qkv_conv(v_94)\n//          v_96 = v_95.view(1, 2, 128, -1)\n//          v_97, v_98, v_99 = torch.split(tensor=v_96, dim=2, split_size_or_sections=(32,32,64))\n//          v_100 = torch.transpose(input=v_97, dim0=-2, dim1=-1)\n//          v_101 = torch.matmul(input=v_100, other=v_98)\n//          v_102 = (v_101 * 0.176777)\n//          v_103 = F.softmax(input=v_102, dim=-1)\n//          v_104 = torch.transpose(input=v_103, dim0=-2, dim1=-1)\n//          v_105 = torch.matmul(input=v_99, other=v_104)\n//          v_106 = v_105.view(1, 128, v_95.size(2), v_95.size(3))\n//          v_107 = v_99.reshape(1, 128, v_95.size(2), v_95.size(3))\n//          v_108 = self.model_10_m_0_attn_pe_conv(v_107)\n//          v_109 = (v_106 + v_108)\n//          v_110 = self.model_10_m_0_attn_proj_conv(v_109)\n// 5. re-export yolo11-seg torchscript\n//      python3 -c 'import yolo11n_seg_pnnx; yolo11n_seg_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolo11n_seg_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolo11n_seg_pnnx.py.ncnn.param yolo11n_seg.ncnn.param\n//      mv yolo11n_seg_pnnx.py.ncnn.bin yolo11n_seg.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=176 h=8400\n//\n//        | bbox-reg 16 x 4       | per-class scores(80) |\n//        +-----+-----+-----+-----+----------------------+\n//        | dx0 | dy0 | dx1 | dy1 |0.1 0.0 0.0 0.5 ......|\n//   all /|     |     |     |     |           .          |\n//  boxes |  .. |  .. |  .. |  .. |0.0 0.9 0.0 0.0 ......|\n//  (8400)|     |     |     |     |           .          |\n//       \\|     |     |     |     |           .          |\n//        +-----+-----+-----+-----+----------------------+\n//\n\n//\n//        | mask (32) |\n//        +-----------+\n//        |0.1........|\n//   all /|           |\n//  boxes |0.0........|\n//  (8400)|     .     |\n//       \\|     .     |\n//        +-----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n    int gindex;\n    cv::Mat mask;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 80 for COCO\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n                obj.gindex = y * num_grid_x + x;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        std::vector<Object> objects_stride;\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects_stride);\n\n        for (size_t j = 0; j < objects_stride.size(); j++)\n        {\n            Object obj = objects_stride[j];\n            obj.gindex += pred_row_offset;\n            objects.push_back(obj);\n        }\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolo11_seg(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolo11;\n\n    yolo11.opt.use_vulkan_compute = true;\n    // yolo11.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolo11/tree/master/app/src/main/assets\n    yolo11.load_param(\"yolo11n_seg.ncnn.param\");\n    yolo11.load_model(\"yolo11n_seg.ncnn.bin\");\n    // yolo11.load_param(\"yolo11s_seg.ncnn.param\");\n    // yolo11.load_model(\"yolo11s_seg.ncnn.bin\");\n    // yolo11.load_param(\"yolo11m_seg.ncnn.param\");\n    // yolo11.load_model(\"yolo11m_seg.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n    const float mask_threshold = 0.5f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolo11.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolo11.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    ncnn::Mat mask_feat;\n    ex.extract(\"out1\", mask_feat);\n\n    ncnn::Mat mask_protos;\n    ex.extract(\"out2\", mask_protos);\n\n    ncnn::Mat objects_mask_feat(mask_feat.w, 1, count);\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n\n        // pick mask feat\n        memcpy(objects_mask_feat.channel(i), mask_feat.row(objects[i].gindex), mask_feat.w * sizeof(float));\n    }\n\n    // process mask\n    ncnn::Mat objects_mask;\n    {\n        ncnn::Layer* gemm = ncnn::create_layer(\"Gemm\");\n\n        ncnn::ParamDict pd;\n        pd.set(6, 1);                             // constantC\n        pd.set(7, count);                         // constantM\n        pd.set(8, mask_protos.w * mask_protos.h); // constantN\n        pd.set(9, mask_feat.w);                   // constantK\n        pd.set(10, -1);                           // constant_broadcast_type_C\n        pd.set(11, 1);                            // output_N1M\n        gemm->load_param(pd);\n\n        ncnn::Option opt;\n        opt.num_threads = 1;\n        opt.use_packing_layout = false;\n\n        gemm->create_pipeline(opt);\n\n        std::vector<ncnn::Mat> gemm_inputs(2);\n        gemm_inputs[0] = objects_mask_feat;\n        gemm_inputs[1] = mask_protos.reshape(mask_protos.w * mask_protos.h, 1, mask_protos.c);\n        std::vector<ncnn::Mat> gemm_outputs(1);\n        gemm->forward(gemm_inputs, gemm_outputs, opt);\n        objects_mask = gemm_outputs[0].reshape(mask_protos.w, mask_protos.h, count);\n\n        gemm->destroy_pipeline(opt);\n\n        delete gemm;\n    }\n    {\n        ncnn::Layer* sigmoid = ncnn::create_layer(\"Sigmoid\");\n\n        ncnn::Option opt;\n        opt.num_threads = 1;\n        opt.use_packing_layout = false;\n\n        sigmoid->create_pipeline(opt);\n\n        sigmoid->forward_inplace(objects_mask, opt);\n\n        sigmoid->destroy_pipeline(opt);\n\n        delete sigmoid;\n    }\n\n    // resize mask map\n    {\n        ncnn::Mat objects_mask_resized;\n        ncnn::resize_bilinear(objects_mask, objects_mask_resized, in_pad.w / scale, in_pad.h / scale);\n        objects_mask = objects_mask_resized;\n    }\n\n    // create per-object mask\n    for (int i = 0; i < count; i++)\n    {\n        Object& obj = objects[i];\n\n        const ncnn::Mat mm = objects_mask.channel(i);\n\n        obj.mask = cv::Mat((int)obj.rect.height, (int)obj.rect.width, CV_8UC1);\n\n        // adjust offset to original unpadded and clip inside object box\n        for (int y = 0; y < (int)obj.rect.height; y++)\n        {\n            const float* pmm = mm.row((int)(hpad / 2 / scale + obj.rect.y + y)) + (int)(wpad / 2 / scale + obj.rect.x);\n            uchar* pmask = obj.mask.ptr<uchar>(y);\n            for (int x = 0; x < (int)obj.rect.width; x++)\n            {\n                pmask[x] = pmm[x] > mask_threshold ? 1 : 0;\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        for (int y = 0; y < (int)obj.rect.height; y++)\n        {\n            const uchar* maskptr = obj.mask.ptr<const uchar>(y);\n            uchar* bgrptr = image.ptr<uchar>((int)obj.rect.y + y) + (int)obj.rect.x * 3;\n            for (int x = 0; x < (int)obj.rect.width; x++)\n            {\n                if (maskptr[x])\n                {\n                    bgrptr[0] = bgrptr[0] * 0.5 + color[0] * 0.5;\n                    bgrptr[1] = bgrptr[1] * 0.5 + color[1] * 0.5;\n                    bgrptr[2] = bgrptr[2] * 0.5 + color[2] * 0.5;\n                }\n                bgrptr += 3;\n            }\n        }\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolo11_seg(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov2.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_yolov2(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov2;\n\n    yolov2.opt.use_vulkan_compute = true;\n\n    // original pretrained model from https://github.com/eric612/MobileNet-YOLO\n    // https://github.com/eric612/MobileNet-YOLO/blob/master/models/yolov2/mobilenet_yolo_deploy.prototxt\n    // https://github.com/eric612/MobileNet-YOLO/blob/master/models/yolov2/mobilenet_yolo_deploy_iter_80000.caffemodel\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (yolov2.load_param(\"mobilenet_yolo.param\"))\n        exit(-1);\n    if (yolov2.load_model(\"mobilenet_yolo.bin\"))\n        exit(-1);\n\n    const int target_size = 416;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    // the Caffe-YOLOv2-Windows style\n    // X' = X * scale - mean\n    const float mean_vals[3] = {1.0f, 1.0f, 1.0f};\n    const float norm_vals[3] = {0.007843f, 0.007843f, 0.007843f};\n    in.substract_mean_normalize(0, norm_vals);\n    in.substract_mean_normalize(mean_vals, 0);\n\n    ncnn::Extractor ex = yolov2.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov2(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov3.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int detect_yolov3(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov3;\n\n    yolov3.opt.use_vulkan_compute = true;\n\n    // original pretrained model from https://github.com/eric612/MobileNet-YOLO\n    // param : https://drive.google.com/open?id=1V9oKHP6G6XvXZqhZbzNKL6FI_clRWdC-\n    // bin : https://drive.google.com/open?id=1DBcuFCr-856z3FRQznWL_S5h-Aj3RawA\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (yolov3.load_param(\"mobilenetv2_yolov3.param\"))\n        exit(-1);\n    if (yolov3.load_model(\"mobilenetv2_yolov3.bin\"))\n        exit(-1);\n\n    const int target_size = 352;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {127.5f, 127.5f, 127.5f};\n    const float norm_vals[3] = {0.007843f, 0.007843f, 0.007843f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = yolov3.create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"detection_out\", out);\n\n    //     printf(\"%d %d %d\\n\", out.w, out.h, out.c);\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"background\",\n                                        \"aeroplane\", \"bicycle\", \"bird\", \"boat\",\n                                        \"bottle\", \"bus\", \"car\", \"cat\", \"chair\",\n                                        \"cow\", \"diningtable\", \"dog\", \"horse\",\n                                        \"motorbike\", \"person\", \"pottedplant\",\n                                        \"sheep\", \"sofa\", \"train\", \"tvmonitor\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov3(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov4.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"net.h\"\n\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n\n#if CV_MAJOR_VERSION >= 3\n#include <opencv2/videoio/videoio.hpp>\n#endif\n\n#include <vector>\n\n#include <stdio.h>\n\n#define NCNN_PROFILING\n#define YOLOV4_TINY //Using yolov4_tiny, if undef, using original yolov4\n\n#ifdef NCNN_PROFILING\n#include \"benchmark.h\"\n#endif\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic int init_yolov4(ncnn::Net* yolov4, int* target_size)\n{\n    /* --> Set the params you need for the ncnn inference <-- */\n\n    yolov4->opt.num_threads = 4; //You need to compile with libgomp for multi thread support\n\n    yolov4->opt.use_vulkan_compute = true; //You need to compile with libvulkan for gpu support\n\n    /* --> End of setting params <-- */\n    int ret = 0;\n\n    // original pretrained model from https://github.com/AlexeyAB/darknet\n    // the ncnn model https://drive.google.com/drive/folders/1YzILvh0SKQPS_lrb33dmGNq7aVTKPWS0?usp=sharing\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n#ifdef YOLOV4_TINY\n    const char* yolov4_param = \"yolov4-tiny-opt.param\";\n    const char* yolov4_model = \"yolov4-tiny-opt.bin\";\n    *target_size = 416;\n#else\n    const char* yolov4_param = \"yolov4-opt.param\";\n    const char* yolov4_model = \"yolov4-opt.bin\";\n    *target_size = 608;\n#endif\n\n    if (yolov4->load_param(yolov4_param))\n        exit(-1);\n    if (yolov4->load_model(yolov4_model))\n        exit(-1);\n\n    return 0;\n}\n\nstatic int detect_yolov4(const cv::Mat& bgr, std::vector<Object>& objects, int target_size, ncnn::Net* yolov4)\n{\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, bgr.cols, bgr.rows, target_size, target_size);\n\n    const float mean_vals[3] = {0, 0, 0};\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in.substract_mean_normalize(mean_vals, norm_vals);\n\n    ncnn::Extractor ex = yolov4->create_extractor();\n\n    ex.input(\"data\", in);\n\n    ncnn::Mat out;\n    ex.extract(\"output\", out);\n\n    objects.clear();\n    for (int i = 0; i < out.h; i++)\n    {\n        const float* values = out.row(i);\n\n        Object object;\n        object.label = values[0];\n        object.prob = values[1];\n        object.rect.x = values[2] * img_w;\n        object.rect.y = values[3] * img_h;\n        object.rect.width = values[4] * img_w - object.rect.x;\n        object.rect.height = values[5] * img_h - object.rect.y;\n\n        objects.push_back(object);\n    }\n\n    return 0;\n}\n\nstatic int draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects, int is_streaming)\n{\n    static const char* class_names[] = {\"background\", \"person\", \"bicycle\",\n                                        \"car\", \"motorbike\", \"aeroplane\", \"bus\", \"train\", \"truck\",\n                                        \"boat\", \"traffic light\", \"fire hydrant\", \"stop sign\",\n                                        \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\",\n                                        \"sheep\", \"cow\", \"elephant\", \"bear\", \"zebra\", \"giraffe\",\n                                        \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\",\n                                        \"frisbee\", \"skis\", \"snowboard\", \"sports ball\", \"kite\",\n                                        \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n                                        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\",\n                                        \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\", \"sandwich\",\n                                        \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\",\n                                        \"cake\", \"chair\", \"sofa\", \"pottedplant\", \"bed\", \"diningtable\",\n                                        \"toilet\", \"tvmonitor\", \"laptop\", \"mouse\", \"remote\", \"keyboard\",\n                                        \"cell phone\", \"microwave\", \"oven\", \"toaster\", \"sink\",\n                                        \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\",\n                                        \"teddy bear\", \"hair drier\", \"toothbrush\"\n                                       };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n\n    if (is_streaming)\n    {\n        cv::waitKey(1);\n    }\n    else\n    {\n        cv::waitKey(0);\n    }\n\n    return 0;\n}\n\nint main(int argc, char** argv)\n{\n    cv::Mat frame;\n    std::vector<Object> objects;\n\n    cv::VideoCapture cap;\n\n    ncnn::Net yolov4;\n\n    const char* devicepath;\n\n    int target_size = 0;\n    int is_streaming = 0;\n\n    if (argc < 2)\n    {\n        fprintf(stderr, \"Usage: %s [v4l input device or image]\\n\", argv[0]);\n        return -1;\n    }\n\n    devicepath = argv[1];\n\n#ifdef NCNN_PROFILING\n    double t_load_start = ncnn::get_current_time();\n#endif\n\n    int ret = init_yolov4(&yolov4, &target_size); //We load model and param first!\n    if (ret != 0)\n    {\n        fprintf(stderr, \"Failed to load model or param, error %d\", ret);\n        return -1;\n    }\n\n#ifdef NCNN_PROFILING\n    double t_load_end = ncnn::get_current_time();\n    fprintf(stdout, \"NCNN Init time %.02lfms\\n\", t_load_end - t_load_start);\n#endif\n\n    if (strstr(devicepath, \"/dev/video\") == NULL)\n    {\n        frame = cv::imread(argv[1], 1);\n        if (frame.empty())\n        {\n            fprintf(stderr, \"Failed to read image %s.\\n\", argv[1]);\n            return -1;\n        }\n    }\n    else\n    {\n        cap.open(devicepath);\n\n        if (!cap.isOpened())\n        {\n            fprintf(stderr, \"Failed to open %s\", devicepath);\n            return -1;\n        }\n\n        cap >> frame;\n\n        if (frame.empty())\n        {\n            fprintf(stderr, \"Failed to read from device %s.\\n\", devicepath);\n            return -1;\n        }\n\n        is_streaming = 1;\n    }\n\n    while (1)\n    {\n        if (is_streaming)\n        {\n#ifdef NCNN_PROFILING\n            double t_capture_start = ncnn::get_current_time();\n#endif\n\n            cap >> frame;\n\n#ifdef NCNN_PROFILING\n            double t_capture_end = ncnn::get_current_time();\n            fprintf(stdout, \"NCNN OpenCV capture time %.02lfms\\n\", t_capture_end - t_capture_start);\n#endif\n            if (frame.empty())\n            {\n                fprintf(stderr, \"OpenCV Failed to Capture from device %s\\n\", devicepath);\n                return -1;\n            }\n        }\n\n#ifdef NCNN_PROFILING\n        double t_detect_start = ncnn::get_current_time();\n#endif\n\n        detect_yolov4(frame, objects, target_size, &yolov4); //Create an extractor and run detection\n\n#ifdef NCNN_PROFILING\n        double t_detect_end = ncnn::get_current_time();\n        fprintf(stdout, \"NCNN detection time %.02lfms\\n\", t_detect_end - t_detect_start);\n#endif\n\n#ifdef NCNN_PROFILING\n        double t_draw_start = ncnn::get_current_time();\n#endif\n\n        draw_objects(frame, objects, is_streaming); //Draw detection results on opencv image\n\n#ifdef NCNN_PROFILING\n        double t_draw_end = ncnn::get_current_time();\n        fprintf(stdout, \"NCNN OpenCV draw result time %.02lfms\\n\", t_draw_end - t_draw_start);\n#endif\n\n        if (!is_streaming)\n        {   //If it is a still image, exit!\n            return 0;\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov5.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\n//#define YOLOV5_V60 1 //YOLOv5 v6.0\n#define YOLOV5_V62 1 //YOLOv5 v6.2 export  onnx model method https://github.com/shaoshengsong/yolov5_62_export_ncnn\n\n#if YOLOV5_V60 || YOLOV5_V62\n#define MAX_STRIDE 64\n#else\n#define MAX_STRIDE 32\nclass YoloV5Focus : public ncnn::Layer\n{\npublic:\n    YoloV5Focus()\n    {\n        one_blob_only = true;\n    }\n\n    virtual int forward(const ncnn::Mat& bottom_blob, ncnn::Mat& top_blob, const ncnn::Option& opt) const\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int channels = bottom_blob.c;\n\n        int outw = w / 2;\n        int outh = h / 2;\n        int outc = channels * 4;\n\n        top_blob.create(outw, outh, outc, 4u, 1, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc; p++)\n        {\n            const float* ptr = bottom_blob.channel(p % channels).row((p / channels) % 2) + ((p / channels) / 2);\n            float* outptr = top_blob.channel(p);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    *outptr = *ptr;\n\n                    outptr += 1;\n                    ptr += 2;\n                }\n\n                ptr += w;\n            }\n        }\n\n        return 0;\n    }\n};\n\nDEFINE_LAYER_CREATOR(YoloV5Focus)\n#endif //YOLOV5_V60    YOLOV5_V62\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return static_cast<float>(1.f / (1.f + exp(-x)));\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int stride, const ncnn::Mat& in_pad, const ncnn::Mat& feat_blob, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid = feat_blob.h;\n\n    int num_grid_x;\n    int num_grid_y;\n    if (in_pad.w > in_pad.h)\n    {\n        num_grid_x = in_pad.w / stride;\n        num_grid_y = num_grid / num_grid_x;\n    }\n    else\n    {\n        num_grid_y = in_pad.h / stride;\n        num_grid_x = num_grid / num_grid_y;\n    }\n\n    const int num_class = feat_blob.w - 5;\n\n    const int num_anchors = anchors.w / 2;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float anchor_w = anchors[q * 2];\n        const float anchor_h = anchors[q * 2 + 1];\n\n        const ncnn::Mat feat = feat_blob.channel(q);\n\n        for (int i = 0; i < num_grid_y; i++)\n        {\n            for (int j = 0; j < num_grid_x; j++)\n            {\n                const float* featptr = feat.row(i * num_grid_x + j);\n                float box_confidence = sigmoid(featptr[4]);\n                if (box_confidence >= prob_threshold)\n                {\n                    // find class index with max class score\n                    int class_index = 0;\n                    float class_score = -FLT_MAX;\n                    for (int k = 0; k < num_class; k++)\n                    {\n                        float score = featptr[5 + k];\n                        if (score > class_score)\n                        {\n                            class_index = k;\n                            class_score = score;\n                        }\n                    }\n                    float confidence = box_confidence * sigmoid(class_score);\n                    if (confidence >= prob_threshold)\n                    {\n                        // yolov5/models/yolo.py Detect forward\n                        // y = x[i].sigmoid()\n                        // y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n                        // y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n\n                        float dx = sigmoid(featptr[0]);\n                        float dy = sigmoid(featptr[1]);\n                        float dw = sigmoid(featptr[2]);\n                        float dh = sigmoid(featptr[3]);\n\n                        float pb_cx = (dx * 2.f - 0.5f + j) * stride;\n                        float pb_cy = (dy * 2.f - 0.5f + i) * stride;\n\n                        float pb_w = pow(dw * 2.f, 2) * anchor_w;\n                        float pb_h = pow(dh * 2.f, 2) * anchor_h;\n\n                        float x0 = pb_cx - pb_w * 0.5f;\n                        float y0 = pb_cy - pb_h * 0.5f;\n                        float x1 = pb_cx + pb_w * 0.5f;\n                        float y1 = pb_cy + pb_h * 0.5f;\n\n                        Object obj;\n                        obj.rect.x = x0;\n                        obj.rect.y = y0;\n                        obj.rect.width = x1 - x0;\n                        obj.rect.height = y1 - y0;\n                        obj.label = class_index;\n                        obj.prob = confidence;\n\n                        objects.push_back(obj);\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic int detect_yolov5(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov5;\n\n    yolov5.opt.use_vulkan_compute = true;\n    // yolov5.opt.use_bf16_storage = true;\n\n    // original pretrained model from https://github.com/ultralytics/yolov5\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n#if YOLOV5_V62\n    if (yolov5.load_param(\"yolov5s_6.2.param\"))\n        exit(-1);\n    if (yolov5.load_model(\"yolov5s_6.2.bin\"))\n        exit(-1);\n#elif YOLOV5_V60\n    if (yolov5.load_param(\"yolov5s_6.0.param\"))\n        exit(-1);\n    if (yolov5.load_model(\"yolov5s_6.0.bin\"))\n        exit(-1);\n#else\n    yolov5.register_custom_layer(\"YoloV5Focus\", YoloV5Focus_layer_creator);\n\n    if (yolov5.load_param(\"yolov5s.param\"))\n        exit(-1);\n    if (yolov5.load_model(\"yolov5s.bin\"))\n        exit(-1);\n#endif\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // letterbox pad to multiple of MAX_STRIDE\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // pad to target_size rectangle\n    // yolov5/utils/datasets.py letterbox\n    int wpad = (w + MAX_STRIDE - 1) / MAX_STRIDE * MAX_STRIDE - w;\n    int hpad = (h + MAX_STRIDE - 1) / MAX_STRIDE * MAX_STRIDE - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov5.create_extractor();\n\n    ex.input(\"images\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // anchor setting from yolov5/models/yolov5s.yaml\n\n    // stride 8\n    {\n        ncnn::Mat out;\n        ex.extract(\"output\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 10.f;\n        anchors[1] = 13.f;\n        anchors[2] = 16.f;\n        anchors[3] = 30.f;\n        anchors[4] = 33.f;\n        anchors[5] = 23.f;\n\n        std::vector<Object> objects8;\n        generate_proposals(anchors, 8, in_pad, out, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat out;\n\n#if YOLOV5_V62\n        ex.extract(\"353\", out);\n#elif YOLOV5_V60\n        ex.extract(\"376\", out);\n#else\n        ex.extract(\"781\", out);\n#endif\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 30.f;\n        anchors[1] = 61.f;\n        anchors[2] = 62.f;\n        anchors[3] = 45.f;\n        anchors[4] = 59.f;\n        anchors[5] = 119.f;\n\n        std::vector<Object> objects16;\n        generate_proposals(anchors, 16, in_pad, out, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat out;\n#if YOLOV5_V62\n        ex.extract(\"367\", out);\n#elif YOLOV5_V60\n        ex.extract(\"401\", out);\n#else\n        ex.extract(\"801\", out);\n#endif\n        ncnn::Mat anchors(6);\n        anchors[0] = 116.f;\n        anchors[1] = 90.f;\n        anchors[2] = 156.f;\n        anchors[3] = 198.f;\n        anchors[4] = 373.f;\n        anchors[5] = 326.f;\n\n        std::vector<Object> objects32;\n        generate_proposals(anchors, 32, in_pad, out, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov5(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov5_pnnx.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return static_cast<float>(1.f / (1.f + exp(-x)));\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int stride, const ncnn::Mat& in_pad, const ncnn::Mat& feat_blob, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid_x = feat_blob.w;\n    const int num_grid_y = feat_blob.h;\n\n    const int num_anchors = anchors.w / 2;\n\n    const int num_class = feat_blob.c / num_anchors - 5;\n\n    const int feat_offset = num_class + 5;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float anchor_w = anchors[q * 2];\n        const float anchor_h = anchors[q * 2 + 1];\n\n        for (int i = 0; i < num_grid_y; i++)\n        {\n            for (int j = 0; j < num_grid_x; j++)\n            {\n                // find class index with max class score\n                int class_index = 0;\n                float class_score = -FLT_MAX;\n                for (int k = 0; k < num_class; k++)\n                {\n                    float score = feat_blob.channel(q * feat_offset + 5 + k).row(i)[j];\n                    if (score > class_score)\n                    {\n                        class_index = k;\n                        class_score = score;\n                    }\n                }\n\n                float box_score = feat_blob.channel(q * feat_offset + 4).row(i)[j];\n\n                float confidence = sigmoid(box_score) * sigmoid(class_score);\n\n                if (confidence >= prob_threshold)\n                {\n                    // yolov5/models/yolo.py Detect forward\n                    // y = x[i].sigmoid()\n                    // y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n                    // y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n\n                    float dx = sigmoid(feat_blob.channel(q * feat_offset + 0).row(i)[j]);\n                    float dy = sigmoid(feat_blob.channel(q * feat_offset + 1).row(i)[j]);\n                    float dw = sigmoid(feat_blob.channel(q * feat_offset + 2).row(i)[j]);\n                    float dh = sigmoid(feat_blob.channel(q * feat_offset + 3).row(i)[j]);\n\n                    float pb_cx = (dx * 2.f - 0.5f + j) * stride;\n                    float pb_cy = (dy * 2.f - 0.5f + i) * stride;\n\n                    float pb_w = pow(dw * 2.f, 2) * anchor_w;\n                    float pb_h = pow(dh * 2.f, 2) * anchor_h;\n\n                    float x0 = pb_cx - pb_w * 0.5f;\n                    float y0 = pb_cy - pb_h * 0.5f;\n                    float x1 = pb_cx + pb_w * 0.5f;\n                    float y1 = pb_cy + pb_h * 0.5f;\n\n                    Object obj;\n                    obj.rect.x = x0;\n                    obj.rect.y = y0;\n                    obj.rect.width = x1 - x0;\n                    obj.rect.height = y1 - y0;\n                    obj.label = class_index;\n                    obj.prob = confidence;\n\n                    objects.push_back(obj);\n                }\n            }\n        }\n    }\n}\n\nstatic int detect_yolov5(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov5;\n\n    yolov5.opt.use_vulkan_compute = true;\n    // yolov5.opt.use_bf16_storage = true;\n\n    // original pretrained model from https://github.com/ultralytics/yolov5\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    if (yolov5.load_param(\"yolov5s.ncnn.param\"))\n        exit(-1);\n    if (yolov5.load_model(\"yolov5s.ncnn.bin\"))\n        exit(-1);\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // yolov5/models/common.py DetectMultiBackend\n    const int max_stride = 64;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // pad to target_size rectangle\n    // yolov5/utils/datasets.py letterbox\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov5.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // anchor setting from yolov5/models/yolov5s.yaml\n\n    // stride 8\n    {\n        ncnn::Mat out;\n        ex.extract(\"out0\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 10.f;\n        anchors[1] = 13.f;\n        anchors[2] = 16.f;\n        anchors[3] = 30.f;\n        anchors[4] = 33.f;\n        anchors[5] = 23.f;\n\n        std::vector<Object> objects8;\n        generate_proposals(anchors, 8, in_pad, out, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat out;\n        ex.extract(\"out1\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 30.f;\n        anchors[1] = 61.f;\n        anchors[2] = 62.f;\n        anchors[3] = 45.f;\n        anchors[4] = 59.f;\n        anchors[5] = 119.f;\n\n        std::vector<Object> objects16;\n        generate_proposals(anchors, 16, in_pad, out, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat out;\n        ex.extract(\"out2\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 116.f;\n        anchors[1] = 90.f;\n        anchors[2] = 156.f;\n        anchors[3] = 198.f;\n        anchors[4] = 373.f;\n        anchors[5] = 326.f;\n\n        std::vector<Object> objects32;\n        generate_proposals(anchors, 32, in_pad, out, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov5(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov7.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\n#define MAX_STRIDE 32\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return static_cast<float>(1.f / (1.f + exp(-x)));\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int stride, const ncnn::Mat& in_pad, const ncnn::Mat& feat_blob, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid = feat_blob.h;\n\n    int num_grid_x;\n    int num_grid_y;\n    if (in_pad.w > in_pad.h)\n    {\n        num_grid_x = in_pad.w / stride;\n        num_grid_y = num_grid / num_grid_x;\n    }\n    else\n    {\n        num_grid_y = in_pad.h / stride;\n        num_grid_x = num_grid / num_grid_y;\n    }\n\n    const int num_class = feat_blob.w - 5;\n\n    const int num_anchors = anchors.w / 2;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float anchor_w = anchors[q * 2];\n        const float anchor_h = anchors[q * 2 + 1];\n\n        const ncnn::Mat feat = feat_blob.channel(q);\n\n        for (int i = 0; i < num_grid_y; i++)\n        {\n            for (int j = 0; j < num_grid_x; j++)\n            {\n                const float* featptr = feat.row(i * num_grid_x + j);\n                float box_confidence = sigmoid(featptr[4]);\n                if (box_confidence >= prob_threshold)\n                {\n                    // find class index with max class score\n                    int class_index = 0;\n                    float class_score = -FLT_MAX;\n                    for (int k = 0; k < num_class; k++)\n                    {\n                        float score = featptr[5 + k];\n                        if (score > class_score)\n                        {\n                            class_index = k;\n                            class_score = score;\n                        }\n                    }\n                    float confidence = box_confidence * sigmoid(class_score);\n                    if (confidence >= prob_threshold)\n                    {\n                        float dx = sigmoid(featptr[0]);\n                        float dy = sigmoid(featptr[1]);\n                        float dw = sigmoid(featptr[2]);\n                        float dh = sigmoid(featptr[3]);\n\n                        float pb_cx = (dx * 2.f - 0.5f + j) * stride;\n                        float pb_cy = (dy * 2.f - 0.5f + i) * stride;\n\n                        float pb_w = pow(dw * 2.f, 2) * anchor_w;\n                        float pb_h = pow(dh * 2.f, 2) * anchor_h;\n\n                        float x0 = pb_cx - pb_w * 0.5f;\n                        float y0 = pb_cy - pb_h * 0.5f;\n                        float x1 = pb_cx + pb_w * 0.5f;\n                        float y1 = pb_cy + pb_h * 0.5f;\n\n                        Object obj;\n                        obj.rect.x = x0;\n                        obj.rect.y = y0;\n                        obj.rect.width = x1 - x0;\n                        obj.rect.height = y1 - y0;\n                        obj.label = class_index;\n                        obj.prob = confidence;\n\n                        objects.push_back(obj);\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic int detect_yolov7(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov7;\n\n    yolov7.opt.use_vulkan_compute = true;\n    // yolov7.opt.use_bf16_storage = true;\n\n    // original pretrained model from https://github.com/WongKinYiu/yolov7\n    // the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n    yolov7.load_param(\"yolov7-tiny.param\");\n    yolov7.load_model(\"yolov7-tiny.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // letterbox pad to multiple of MAX_STRIDE\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    int wpad = (w + MAX_STRIDE - 1) / MAX_STRIDE * MAX_STRIDE - w;\n    int hpad = (h + MAX_STRIDE - 1) / MAX_STRIDE * MAX_STRIDE - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov7.create_extractor();\n\n    ex.input(\"images\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // stride 8\n    {\n        ncnn::Mat out;\n        ex.extract(\"output\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 12.f;\n        anchors[1] = 16.f;\n        anchors[2] = 19.f;\n        anchors[3] = 36.f;\n        anchors[4] = 40.f;\n        anchors[5] = 28.f;\n\n        std::vector<Object> objects8;\n        generate_proposals(anchors, 8, in_pad, out, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat out;\n\n        ex.extract(\"288\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 36.f;\n        anchors[1] = 75.f;\n        anchors[2] = 76.f;\n        anchors[3] = 55.f;\n        anchors[4] = 72.f;\n        anchors[5] = 146.f;\n\n        std::vector<Object> objects16;\n        generate_proposals(anchors, 16, in_pad, out, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat out;\n\n        ex.extract(\"302\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 142.f;\n        anchors[1] = 110.f;\n        anchors[2] = 192.f;\n        anchors[3] = 243.f;\n        anchors[4] = 459.f;\n        anchors[5] = 401.f;\n\n        std::vector<Object> objects32;\n        generate_proposals(anchors, 32, in_pad, out, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static const unsigned char colors[19][3] = {\n        {54, 67, 244},\n        {99, 30, 233},\n        {176, 39, 156},\n        {183, 58, 103},\n        {181, 81, 63},\n        {243, 150, 33},\n        {244, 169, 3},\n        {212, 188, 0},\n        {136, 150, 0},\n        {80, 175, 76},\n        {74, 195, 139},\n        {57, 220, 205},\n        {59, 235, 255},\n        {7, 193, 255},\n        {0, 152, 255},\n        {34, 87, 255},\n        {72, 85, 121},\n        {158, 158, 158},\n        {139, 125, 96}\n    };\n\n    int color_index = 0;\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const unsigned char* color = colors[color_index % 19];\n        color_index++;\n\n        cv::Scalar cc(color[0], color[1], color[2]);\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cc, 2);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cc, -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(255, 255, 255));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov7(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov7_pnnx.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects)\n{\n    if (faceobjects.empty())\n        return;\n\n    qsort_descent_inplace(faceobjects, 0, faceobjects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return static_cast<float>(1.f / (1.f + exp(-x)));\n}\n\nstatic void generate_proposals(const ncnn::Mat& anchors, int stride, const ncnn::Mat& in_pad, const ncnn::Mat& feat_blob, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid_x = feat_blob.w;\n    const int num_grid_y = feat_blob.h;\n\n    const int num_anchors = anchors.w / 2;\n\n    const int num_class = 80;\n\n    for (int q = 0; q < num_anchors; q++)\n    {\n        const float anchor_w = anchors[q * 2];\n        const float anchor_h = anchors[q * 2 + 1];\n\n        for (int i = 0; i < num_grid_y; i++)\n        {\n            for (int j = 0; j < num_grid_x; j++)\n            {\n                // find class index with max class score\n                int class_index = 0;\n                float class_score = -FLT_MAX;\n                for (int k = 0; k < num_class; k++)\n                {\n                    float score = feat_blob.channel(q * 85 + 5 + k).row(i)[j];\n                    if (score > class_score)\n                    {\n                        class_index = k;\n                        class_score = score;\n                    }\n                }\n\n                float box_score = feat_blob.channel(q * 85 + 4).row(i)[j];\n\n                float confidence = sigmoid(box_score) * sigmoid(class_score);\n\n                if (confidence >= prob_threshold)\n                {\n                    // yolov5/models/yolo.py Detect forward\n                    // y = x[i].sigmoid()\n                    // y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i].to(x[i].device)) * self.stride[i]  # xy\n                    // y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n\n                    float dx = sigmoid(feat_blob.channel(q * 85 + 0).row(i)[j]);\n                    float dy = sigmoid(feat_blob.channel(q * 85 + 1).row(i)[j]);\n                    float dw = sigmoid(feat_blob.channel(q * 85 + 2).row(i)[j]);\n                    float dh = sigmoid(feat_blob.channel(q * 85 + 3).row(i)[j]);\n\n                    float pb_cx = (dx * 2.f - 0.5f + j) * stride;\n                    float pb_cy = (dy * 2.f - 0.5f + i) * stride;\n\n                    float pb_w = pow(dw * 2.f, 2) * anchor_w;\n                    float pb_h = pow(dh * 2.f, 2) * anchor_h;\n\n                    float x0 = pb_cx - pb_w * 0.5f;\n                    float y0 = pb_cy - pb_h * 0.5f;\n                    float x1 = pb_cx + pb_w * 0.5f;\n                    float y1 = pb_cy + pb_h * 0.5f;\n\n                    Object obj;\n                    obj.rect.x = x0;\n                    obj.rect.y = y0;\n                    obj.rect.width = x1 - x0;\n                    obj.rect.height = y1 - y0;\n                    obj.label = class_index;\n                    obj.prob = confidence;\n\n                    objects.push_back(obj);\n                }\n            }\n        }\n    }\n}\n\nstatic int detect_yolov7(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov7;\n\n    yolov7.opt.use_vulkan_compute = true;\n    // yolov7.opt.use_bf16_storage = true;\n\n    // git clone https://github.com/WongKinYiu/yolov7\n    // cd yolov7\n    // wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt\n    // python models/export.py --weights yolov7.pt\n    // pnnx yolov7.torchscript.pt inputshape=[1,3,640,640] inputshape=[1,3,320,320]\n    yolov7.load_param(\"yolov7.param\");\n    yolov7.load_model(\"yolov7.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // yolov5/models/common.py DetectMultiBackend\n    const int max_stride = 64;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // pad to target_size rectangle\n    // yolov5/utils/datasets.py letterbox\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov7.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    std::vector<Object> proposals;\n\n    // anchor setting from yolov5/models/yolov5s.yaml\n\n    // stride 8\n    {\n        ncnn::Mat out;\n        ex.extract(\"out0\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 12.f;\n        anchors[1] = 16.f;\n        anchors[2] = 19.f;\n        anchors[3] = 36.f;\n        anchors[4] = 40.f;\n        anchors[5] = 28.f;\n\n        std::vector<Object> objects8;\n        generate_proposals(anchors, 8, in_pad, out, prob_threshold, objects8);\n\n        proposals.insert(proposals.end(), objects8.begin(), objects8.end());\n    }\n\n    // stride 16\n    {\n        ncnn::Mat out;\n        ex.extract(\"out1\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 36.f;\n        anchors[1] = 75.f;\n        anchors[2] = 76.f;\n        anchors[3] = 55.f;\n        anchors[4] = 72.f;\n        anchors[5] = 146.f;\n\n        std::vector<Object> objects16;\n        generate_proposals(anchors, 16, in_pad, out, prob_threshold, objects16);\n\n        proposals.insert(proposals.end(), objects16.begin(), objects16.end());\n    }\n\n    // stride 32\n    {\n        ncnn::Mat out;\n        ex.extract(\"out2\", out);\n\n        ncnn::Mat anchors(6);\n        anchors[0] = 142.f;\n        anchors[1] = 110.f;\n        anchors[2] = 192.f;\n        anchors[3] = 243.f;\n        anchors[4] = 459.f;\n        anchors[5] = 401.f;\n\n        std::vector<Object> objects32;\n        generate_proposals(anchors, 32, in_pad, out, prob_threshold, objects32);\n\n        proposals.insert(proposals.end(), objects32.begin(), objects32.end());\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov7(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov8.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolov8 torchscript\n//      yolo export model=yolov8n.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8n.torchscript\n// 4. modify yolov8n_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_165 = v_142.view(1, 144, 6400)\n//          v_166 = v_153.view(1, 144, 1600)\n//          v_167 = v_164.view(1, 144, 400)\n//          v_168 = torch.cat((v_165, v_166, v_167), dim=2)\n//          ...\n//      after:\n//          v_165 = v_142.view(1, 144, -1).transpose(1, 2)\n//          v_166 = v_153.view(1, 144, -1).transpose(1, 2)\n//          v_167 = v_164.view(1, 144, -1).transpose(1, 2)\n//          v_168 = torch.cat((v_165, v_166, v_167), dim=1)\n//          return v_168\n// 5. re-export yolov8 torchscript\n//      python3 -c 'import yolov8n_pnnx; yolov8n_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolov8n_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolov8n_pnnx.py.ncnn.param yolov8n.ncnn.param\n//      mv yolov8n_pnnx.py.ncnn.bin yolov8n.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=144 h=8400\n//\n//        | bbox-reg 16 x 4       | per-class scores(80) |\n//        +-----+-----+-----+-----+----------------------+\n//        | dx0 | dy0 | dx1 | dy1 |0.1 0.0 0.0 0.5 ......|\n//   all /|     |     |     |     |           .          |\n//  boxes |  .. |  .. |  .. |  .. |0.0 0.9 0.0 0.0 ......|\n//  (8400)|     |     |     |     |           .          |\n//       \\|     |     |     |     |           .          |\n//        +-----+-----+-----+-----+----------------------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 80 for COCO\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4);\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolov8(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov8;\n\n    yolov8.opt.use_vulkan_compute = true;\n    // yolov8.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolov8/tree/master/app/src/main/assets\n    yolov8.load_param(\"yolov8n.ncnn.param\");\n    yolov8.load_model(\"yolov8n.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s.ncnn.param\");\n    // yolov8.load_model(\"yolov8s.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m.ncnn.param\");\n    // yolov8.load_model(\"yolov8m.ncnn.bin\");\n\n    // if you use oiv7 models, you shall call draw_objects_oiv() instead\n    // yolov8.load_param(\"yolov8n_oiv7.ncnn.param\");\n    // yolov8.load_model(\"yolov8n_oiv7.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s_oiv7.ncnn.param\");\n    // yolov8.load_model(\"yolov8s_oiv7.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m_oiv7.ncnn.param\");\n    // yolov8.load_model(\"yolov8m_oiv7.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolov8.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov8.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects_coco(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nstatic void draw_objects_oiv(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"Accordion\", \"Adhesive tape\", \"Aircraft\", \"Airplane\", \"Alarm clock\", \"Alpaca\", \"Ambulance\", \"Animal\",\n        \"Ant\", \"Antelope\", \"Apple\", \"Armadillo\", \"Artichoke\", \"Auto part\", \"Axe\", \"Backpack\", \"Bagel\",\n        \"Baked goods\", \"Balance beam\", \"Ball\", \"Balloon\", \"Banana\", \"Band-aid\", \"Banjo\", \"Barge\", \"Barrel\",\n        \"Baseball bat\", \"Baseball glove\", \"Bat (Animal)\", \"Bathroom accessory\", \"Bathroom cabinet\", \"Bathtub\",\n        \"Beaker\", \"Bear\", \"Bed\", \"Bee\", \"Beehive\", \"Beer\", \"Beetle\", \"Bell pepper\", \"Belt\", \"Bench\", \"Bicycle\",\n        \"Bicycle helmet\", \"Bicycle wheel\", \"Bidet\", \"Billboard\", \"Billiard table\", \"Binoculars\", \"Bird\",\n        \"Blender\", \"Blue jay\", \"Boat\", \"Bomb\", \"Book\", \"Bookcase\", \"Boot\", \"Bottle\", \"Bottle opener\",\n        \"Bow and arrow\", \"Bowl\", \"Bowling equipment\", \"Box\", \"Boy\", \"Brassiere\", \"Bread\", \"Briefcase\",\n        \"Broccoli\", \"Bronze sculpture\", \"Brown bear\", \"Building\", \"Bull\", \"Burrito\", \"Bus\", \"Bust\", \"Butterfly\",\n        \"Cabbage\", \"Cabinetry\", \"Cake\", \"Cake stand\", \"Calculator\", \"Camel\", \"Camera\", \"Can opener\", \"Canary\",\n        \"Candle\", \"Candy\", \"Cannon\", \"Canoe\", \"Cantaloupe\", \"Car\", \"Carnivore\", \"Carrot\", \"Cart\", \"Cassette deck\",\n        \"Castle\", \"Cat\", \"Cat furniture\", \"Caterpillar\", \"Cattle\", \"Ceiling fan\", \"Cello\", \"Centipede\",\n        \"Chainsaw\", \"Chair\", \"Cheese\", \"Cheetah\", \"Chest of drawers\", \"Chicken\", \"Chime\", \"Chisel\", \"Chopsticks\",\n        \"Christmas tree\", \"Clock\", \"Closet\", \"Clothing\", \"Coat\", \"Cocktail\", \"Cocktail shaker\", \"Coconut\",\n        \"Coffee\", \"Coffee cup\", \"Coffee table\", \"Coffeemaker\", \"Coin\", \"Common fig\", \"Common sunflower\",\n        \"Computer keyboard\", \"Computer monitor\", \"Computer mouse\", \"Container\", \"Convenience store\", \"Cookie\",\n        \"Cooking spray\", \"Corded phone\", \"Cosmetics\", \"Couch\", \"Countertop\", \"Cowboy hat\", \"Crab\", \"Cream\",\n        \"Cricket ball\", \"Crocodile\", \"Croissant\", \"Crown\", \"Crutch\", \"Cucumber\", \"Cupboard\", \"Curtain\",\n        \"Cutting board\", \"Dagger\", \"Dairy Product\", \"Deer\", \"Desk\", \"Dessert\", \"Diaper\", \"Dice\", \"Digital clock\",\n        \"Dinosaur\", \"Dishwasher\", \"Dog\", \"Dog bed\", \"Doll\", \"Dolphin\", \"Door\", \"Door handle\", \"Doughnut\",\n        \"Dragonfly\", \"Drawer\", \"Dress\", \"Drill (Tool)\", \"Drink\", \"Drinking straw\", \"Drum\", \"Duck\", \"Dumbbell\",\n        \"Eagle\", \"Earrings\", \"Egg (Food)\", \"Elephant\", \"Envelope\", \"Eraser\", \"Face powder\", \"Facial tissue holder\",\n        \"Falcon\", \"Fashion accessory\", \"Fast food\", \"Fax\", \"Fedora\", \"Filing cabinet\", \"Fire hydrant\",\n        \"Fireplace\", \"Fish\", \"Flag\", \"Flashlight\", \"Flower\", \"Flowerpot\", \"Flute\", \"Flying disc\", \"Food\",\n        \"Food processor\", \"Football\", \"Football helmet\", \"Footwear\", \"Fork\", \"Fountain\", \"Fox\", \"French fries\",\n        \"French horn\", \"Frog\", \"Fruit\", \"Frying pan\", \"Furniture\", \"Garden Asparagus\", \"Gas stove\", \"Giraffe\",\n        \"Girl\", \"Glasses\", \"Glove\", \"Goat\", \"Goggles\", \"Goldfish\", \"Golf ball\", \"Golf cart\", \"Gondola\",\n        \"Goose\", \"Grape\", \"Grapefruit\", \"Grinder\", \"Guacamole\", \"Guitar\", \"Hair dryer\", \"Hair spray\", \"Hamburger\",\n        \"Hammer\", \"Hamster\", \"Hand dryer\", \"Handbag\", \"Handgun\", \"Harbor seal\", \"Harmonica\", \"Harp\",\n        \"Harpsichord\", \"Hat\", \"Headphones\", \"Heater\", \"Hedgehog\", \"Helicopter\", \"Helmet\", \"High heels\",\n        \"Hiking equipment\", \"Hippopotamus\", \"Home appliance\", \"Honeycomb\", \"Horizontal bar\", \"Horse\", \"Hot dog\",\n        \"House\", \"Houseplant\", \"Human arm\", \"Human beard\", \"Human body\", \"Human ear\", \"Human eye\", \"Human face\",\n        \"Human foot\", \"Human hair\", \"Human hand\", \"Human head\", \"Human leg\", \"Human mouth\", \"Human nose\",\n        \"Humidifier\", \"Ice cream\", \"Indoor rower\", \"Infant bed\", \"Insect\", \"Invertebrate\", \"Ipod\", \"Isopod\",\n        \"Jacket\", \"Jacuzzi\", \"Jaguar (Animal)\", \"Jeans\", \"Jellyfish\", \"Jet ski\", \"Jug\", \"Juice\", \"Kangaroo\",\n        \"Kettle\", \"Kitchen & dining room table\", \"Kitchen appliance\", \"Kitchen knife\", \"Kitchen utensil\",\n        \"Kitchenware\", \"Kite\", \"Knife\", \"Koala\", \"Ladder\", \"Ladle\", \"Ladybug\", \"Lamp\", \"Land vehicle\",\n        \"Lantern\", \"Laptop\", \"Lavender (Plant)\", \"Lemon\", \"Leopard\", \"Light bulb\", \"Light switch\", \"Lighthouse\",\n        \"Lily\", \"Limousine\", \"Lion\", \"Lipstick\", \"Lizard\", \"Lobster\", \"Loveseat\", \"Luggage and bags\", \"Lynx\",\n        \"Magpie\", \"Mammal\", \"Man\", \"Mango\", \"Maple\", \"Maracas\", \"Marine invertebrates\", \"Marine mammal\",\n        \"Measuring cup\", \"Mechanical fan\", \"Medical equipment\", \"Microphone\", \"Microwave oven\", \"Milk\",\n        \"Miniskirt\", \"Mirror\", \"Missile\", \"Mixer\", \"Mixing bowl\", \"Mobile phone\", \"Monkey\", \"Moths and butterflies\",\n        \"Motorcycle\", \"Mouse\", \"Muffin\", \"Mug\", \"Mule\", \"Mushroom\", \"Musical instrument\", \"Musical keyboard\",\n        \"Nail (Construction)\", \"Necklace\", \"Nightstand\", \"Oboe\", \"Office building\", \"Office supplies\", \"Orange\",\n        \"Organ (Musical Instrument)\", \"Ostrich\", \"Otter\", \"Oven\", \"Owl\", \"Oyster\", \"Paddle\", \"Palm tree\",\n        \"Pancake\", \"Panda\", \"Paper cutter\", \"Paper towel\", \"Parachute\", \"Parking meter\", \"Parrot\", \"Pasta\",\n        \"Pastry\", \"Peach\", \"Pear\", \"Pen\", \"Pencil case\", \"Pencil sharpener\", \"Penguin\", \"Perfume\", \"Person\",\n        \"Personal care\", \"Personal flotation device\", \"Piano\", \"Picnic basket\", \"Picture frame\", \"Pig\",\n        \"Pillow\", \"Pineapple\", \"Pitcher (Container)\", \"Pizza\", \"Pizza cutter\", \"Plant\", \"Plastic bag\", \"Plate\",\n        \"Platter\", \"Plumbing fixture\", \"Polar bear\", \"Pomegranate\", \"Popcorn\", \"Porch\", \"Porcupine\", \"Poster\",\n        \"Potato\", \"Power plugs and sockets\", \"Pressure cooker\", \"Pretzel\", \"Printer\", \"Pumpkin\", \"Punching bag\",\n        \"Rabbit\", \"Raccoon\", \"Racket\", \"Radish\", \"Ratchet (Device)\", \"Raven\", \"Rays and skates\", \"Red panda\",\n        \"Refrigerator\", \"Remote control\", \"Reptile\", \"Rhinoceros\", \"Rifle\", \"Ring binder\", \"Rocket\",\n        \"Roller skates\", \"Rose\", \"Rugby ball\", \"Ruler\", \"Salad\", \"Salt and pepper shakers\", \"Sandal\",\n        \"Sandwich\", \"Saucer\", \"Saxophone\", \"Scale\", \"Scarf\", \"Scissors\", \"Scoreboard\", \"Scorpion\",\n        \"Screwdriver\", \"Sculpture\", \"Sea lion\", \"Sea turtle\", \"Seafood\", \"Seahorse\", \"Seat belt\", \"Segway\",\n        \"Serving tray\", \"Sewing machine\", \"Shark\", \"Sheep\", \"Shelf\", \"Shellfish\", \"Shirt\", \"Shorts\",\n        \"Shotgun\", \"Shower\", \"Shrimp\", \"Sink\", \"Skateboard\", \"Ski\", \"Skirt\", \"Skull\", \"Skunk\", \"Skyscraper\",\n        \"Slow cooker\", \"Snack\", \"Snail\", \"Snake\", \"Snowboard\", \"Snowman\", \"Snowmobile\", \"Snowplow\",\n        \"Soap dispenser\", \"Sock\", \"Sofa bed\", \"Sombrero\", \"Sparrow\", \"Spatula\", \"Spice rack\", \"Spider\",\n        \"Spoon\", \"Sports equipment\", \"Sports uniform\", \"Squash (Plant)\", \"Squid\", \"Squirrel\", \"Stairs\",\n        \"Stapler\", \"Starfish\", \"Stationary bicycle\", \"Stethoscope\", \"Stool\", \"Stop sign\", \"Strawberry\",\n        \"Street light\", \"Stretcher\", \"Studio couch\", \"Submarine\", \"Submarine sandwich\", \"Suit\", \"Suitcase\",\n        \"Sun hat\", \"Sunglasses\", \"Surfboard\", \"Sushi\", \"Swan\", \"Swim cap\", \"Swimming pool\", \"Swimwear\",\n        \"Sword\", \"Syringe\", \"Table\", \"Table tennis racket\", \"Tablet computer\", \"Tableware\", \"Taco\", \"Tank\",\n        \"Tap\", \"Tart\", \"Taxi\", \"Tea\", \"Teapot\", \"Teddy bear\", \"Telephone\", \"Television\", \"Tennis ball\",\n        \"Tennis racket\", \"Tent\", \"Tiara\", \"Tick\", \"Tie\", \"Tiger\", \"Tin can\", \"Tire\", \"Toaster\", \"Toilet\",\n        \"Toilet paper\", \"Tomato\", \"Tool\", \"Toothbrush\", \"Torch\", \"Tortoise\", \"Towel\", \"Tower\", \"Toy\",\n        \"Traffic light\", \"Traffic sign\", \"Train\", \"Training bench\", \"Treadmill\", \"Tree\", \"Tree house\",\n        \"Tripod\", \"Trombone\", \"Trousers\", \"Truck\", \"Trumpet\", \"Turkey\", \"Turtle\", \"Umbrella\", \"Unicycle\",\n        \"Van\", \"Vase\", \"Vegetable\", \"Vehicle\", \"Vehicle registration plate\", \"Violin\", \"Volleyball (Ball)\",\n        \"Waffle\", \"Waffle iron\", \"Wall clock\", \"Wardrobe\", \"Washing machine\", \"Waste container\", \"Watch\",\n        \"Watercraft\", \"Watermelon\", \"Weapon\", \"Whale\", \"Wheel\", \"Wheelchair\", \"Whisk\", \"Whiteboard\", \"Willow\",\n        \"Window\", \"Window blind\", \"Wine\", \"Wine glass\", \"Wine rack\", \"Winter melon\", \"Wok\", \"Woman\",\n        \"Wood-burning stove\", \"Woodpecker\", \"Worm\", \"Wrench\", \"Zebra\", \"Zucchini\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov8(m, objects);\n\n    draw_objects_coco(m, objects);\n    // draw_objects_oiv(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov8_cls.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolov8-cls torchscript\n//      yolo export model=yolov8n-cls.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8n-cls.torchscript\n// 4. now you get ncnn model files\n//      yolov8n_cls.ncnn.param\n//      yolov8n_cls.ncnn.bin\n\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    int label;\n    float prob;\n};\n\nstatic void get_topk(const ncnn::Mat& cls_scores, int topk, std::vector<Object>& objects)\n{\n    // partial sort topk with index\n    int size = cls_scores.w;\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(cls_scores[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    objects.resize(topk);\n    for (int i = 0; i < topk; i++)\n    {\n        objects[i].label = vec[i].second;\n        objects[i].prob = vec[i].first;\n    }\n}\n\nstatic int detect_yolov8_cls(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov8;\n\n    yolov8.opt.use_vulkan_compute = true;\n    // yolov8.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolov8/tree/master/app/src/main/assets\n    yolov8.load_param(\"yolov8n_cls.ncnn.param\");\n    yolov8.load_model(\"yolov8n_cls.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s_cls.ncnn.param\");\n    // yolov8.load_model(\"yolov8s_cls.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m_cls.ncnn.param\");\n    // yolov8.load_model(\"yolov8m_cls.ncnn.bin\");\n\n    const int target_size = 224;\n    const int topk = 5;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // letterbox pad\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = target_size - w;\n    int hpad = target_size - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov8.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    // return top-5\n    get_topk(out, topk, objects);\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"tench\", \"goldfish\", \"great white shark\", \"tiger shark\", \"hammerhead\", \"electric ray\", \"stingray\", \"cock\",\n        \"hen\", \"ostrich\", \"brambling\", \"goldfinch\", \"house finch\", \"junco\", \"indigo bunting\", \"robin\", \"bulbul\",\n        \"jay\", \"magpie\", \"chickadee\", \"water ouzel\", \"kite\", \"bald eagle\", \"vulture\", \"great grey owl\",\n        \"European fire salamander\", \"common newt\", \"eft\", \"spotted salamander\", \"axolotl\", \"bullfrog\", \"tree frog\",\n        \"tailed frog\", \"loggerhead\", \"leatherback turtle\", \"mud turtle\", \"terrapin\", \"box turtle\", \"banded gecko\",\n        \"common iguana\", \"American chameleon\", \"whiptail\", \"agama\", \"frilled lizard\", \"alligator lizard\",\n        \"Gila monster\", \"green lizard\", \"African chameleon\", \"Komodo dragon\", \"African crocodile\",\n        \"American alligator\", \"triceratops\", \"thunder snake\", \"ringneck snake\", \"hognose snake\", \"green snake\",\n        \"king snake\", \"garter snake\", \"water snake\", \"vine snake\", \"night snake\", \"boa constrictor\", \"rock python\",\n        \"Indian cobra\", \"green mamba\", \"sea snake\", \"horned viper\", \"diamondback\", \"sidewinder\", \"trilobite\",\n        \"harvestman\", \"scorpion\", \"black and gold garden spider\", \"barn spider\", \"garden spider\", \"black widow\",\n        \"tarantula\", \"wolf spider\", \"tick\", \"centipede\", \"black grouse\", \"ptarmigan\", \"ruffed grouse\",\n        \"prairie chicken\", \"peacock\", \"quail\", \"partridge\", \"African grey\", \"macaw\", \"sulphur-crested cockatoo\",\n        \"lorikeet\", \"coucal\", \"bee eater\", \"hornbill\", \"hummingbird\", \"jacamar\", \"toucan\", \"drake\",\n        \"red-breasted merganser\", \"goose\", \"black swan\", \"tusker\", \"echidna\", \"platypus\", \"wallaby\", \"koala\",\n        \"wombat\", \"jellyfish\", \"sea anemone\", \"brain coral\", \"flatworm\", \"nematode\", \"conch\", \"snail\", \"slug\",\n        \"sea slug\", \"chiton\", \"chambered nautilus\", \"Dungeness crab\", \"rock crab\", \"fiddler crab\", \"king crab\",\n        \"American lobster\", \"spiny lobster\", \"crayfish\", \"hermit crab\", \"isopod\", \"white stork\", \"black stork\",\n        \"spoonbill\", \"flamingo\", \"little blue heron\", \"American egret\", \"bittern\", \"crane (bird)\", \"limpkin\",\n        \"European gallinule\", \"American coot\", \"bustard\", \"ruddy turnstone\", \"red-backed sandpiper\", \"redshank\",\n        \"dowitcher\", \"oystercatcher\", \"pelican\", \"king penguin\", \"albatross\", \"grey whale\", \"killer whale\",\n        \"dugong\", \"sea lion\", \"Chihuahua\", \"Japanese spaniel\", \"Maltese dog\", \"Pekinese\", \"Shih-Tzu\",\n        \"Blenheim spaniel\", \"papillon\", \"toy terrier\", \"Rhodesian ridgeback\", \"Afghan hound\", \"basset\", \"beagle\",\n        \"bloodhound\", \"bluetick\", \"black-and-tan coonhound\", \"Walker hound\", \"English foxhound\", \"redbone\",\n        \"borzoi\", \"Irish wolfhound\", \"Italian greyhound\", \"whippet\", \"Ibizan hound\", \"Norwegian elkhound\",\n        \"otterhound\", \"Saluki\", \"Scottish deerhound\", \"Weimaraner\", \"Staffordshire bullterrier\",\n        \"American Staffordshire terrier\", \"Bedlington terrier\", \"Border terrier\", \"Kerry blue terrier\",\n        \"Irish terrier\", \"Norfolk terrier\", \"Norwich terrier\", \"Yorkshire terrier\", \"wire-haired fox terrier\",\n        \"Lakeland terrier\", \"Sealyham terrier\", \"Airedale\", \"cairn\", \"Australian terrier\", \"Dandie Dinmont\",\n        \"Boston bull\", \"miniature schnauzer\", \"giant schnauzer\", \"standard schnauzer\", \"Scotch terrier\",\n        \"Tibetan terrier\", \"silky terrier\", \"soft-coated wheaten terrier\", \"West Highland white terrier\",\n        \"Lhasa\", \"flat-coated retriever\", \"curly-coated retriever\", \"golden retriever\", \"Labrador retriever\",\n        \"Chesapeake Bay retriever\", \"German short-haired pointer\", \"vizsla\", \"English setter\", \"Irish setter\",\n        \"Gordon setter\", \"Brittany spaniel\", \"clumber\", \"English springer\", \"Welsh springer spaniel\",\n        \"cocker spaniel\", \"Sussex spaniel\", \"Irish water spaniel\", \"kuvasz\", \"schipperke\", \"groenendael\",\n        \"malinois\", \"briard\", \"kelpie\", \"komondor\", \"Old English sheepdog\", \"Shetland sheepdog\", \"collie\",\n        \"Border collie\", \"Bouvier des Flandres\", \"Rottweiler\", \"German shepherd\", \"Doberman\",\n        \"miniature pinscher\", \"Greater Swiss Mountain dog\", \"Bernese mountain dog\", \"Appenzeller\", \"EntleBucher\",\n        \"boxer\", \"bull mastiff\", \"Tibetan mastiff\", \"French bulldog\", \"Great Dane\", \"Saint Bernard\",\n        \"Eskimo dog\", \"malamute\", \"Siberian husky\", \"dalmatian\", \"affenpinscher\", \"basenji\", \"pug\", \"Leonberg\",\n        \"Newfoundland\", \"Great Pyrenees\", \"Samoyed\", \"Pomeranian\", \"chow\", \"keeshond\", \"Brabancon griffon\",\n        \"Pembroke\", \"Cardigan\", \"toy poodle\", \"miniature poodle\", \"standard poodle\", \"Mexican hairless\",\n        \"timber wolf\", \"white wolf\", \"red wolf\", \"coyote\", \"dingo\", \"dhole\", \"African hunting dog\", \"hyena\",\n        \"red fox\", \"kit fox\", \"Arctic fox\", \"grey fox\", \"tabby\", \"tiger cat\", \"Persian cat\", \"Siamese cat\",\n        \"Egyptian cat\", \"cougar\", \"lynx\", \"leopard\", \"snow leopard\", \"jaguar\", \"lion\", \"tiger\", \"cheetah\",\n        \"brown bear\", \"American black bear\", \"ice bear\", \"sloth bear\", \"mongoose\", \"meerkat\", \"tiger beetle\",\n        \"ladybug\", \"ground beetle\", \"long-horned beetle\", \"leaf beetle\", \"dung beetle\", \"rhinoceros beetle\",\n        \"weevil\", \"fly\", \"bee\", \"ant\", \"grasshopper\", \"cricket\", \"walking stick\", \"cockroach\", \"mantis\",\n        \"cicada\", \"leafhopper\", \"lacewing\", \"dragonfly\", \"damselfly\", \"admiral\", \"ringlet\", \"monarch\",\n        \"cabbage butterfly\", \"sulphur butterfly\", \"lycaenid\", \"starfish\", \"sea urchin\", \"sea cucumber\",\n        \"wood rabbit\", \"hare\", \"Angora\", \"hamster\", \"porcupine\", \"fox squirrel\", \"marmot\", \"beaver\",\n        \"guinea pig\", \"sorrel\", \"zebra\", \"hog\", \"wild boar\", \"warthog\", \"hippopotamus\", \"ox\", \"water buffalo\",\n        \"bison\", \"ram\", \"bighorn\", \"ibex\", \"hartebeest\", \"impala\", \"gazelle\", \"Arabian camel\", \"llama\",\n        \"weasel\", \"mink\", \"polecat\", \"black-footed ferret\", \"otter\", \"skunk\", \"badger\", \"armadillo\",\n        \"three-toed sloth\", \"orangutan\", \"gorilla\", \"chimpanzee\", \"gibbon\", \"siamang\", \"guenon\", \"patas\",\n        \"baboon\", \"macaque\", \"langur\", \"colobus\", \"proboscis monkey\", \"marmoset\", \"capuchin\", \"howler monkey\",\n        \"titi\", \"spider monkey\", \"squirrel monkey\", \"Madagascar cat\", \"indri\", \"Indian elephant\",\n        \"African elephant\", \"lesser panda\", \"giant panda\", \"barracouta\", \"eel\", \"coho\", \"rock beauty\",\n        \"anemone fish\", \"sturgeon\", \"gar\", \"lionfish\", \"puffer\", \"abacus\", \"abaya\", \"academic gown\",\n        \"accordion\", \"acoustic guitar\", \"aircraft carrier\", \"airliner\", \"airship\", \"altar\", \"ambulance\",\n        \"amphibian\", \"analog clock\", \"apiary\", \"apron\", \"ashcan\", \"assault rifle\", \"backpack\", \"bakery\",\n        \"balance beam\", \"balloon\", \"ballpoint\", \"Band Aid\", \"banjo\", \"bannister\", \"barbell\", \"barber chair\",\n        \"barbershop\", \"barn\", \"barometer\", \"barrel\", \"barrow\", \"baseball\", \"basketball\", \"bassinet\", \"bassoon\",\n        \"bathing cap\", \"bath towel\", \"bathtub\", \"beach wagon\", \"beacon\", \"beaker\", \"bearskin\", \"beer bottle\",\n        \"beer glass\", \"bell cote\", \"bib\", \"bicycle-built-for-two\", \"bikini\", \"binder\", \"binoculars\",\n        \"birdhouse\", \"boathouse\", \"bobsled\", \"bolo tie\", \"bonnet\", \"bookcase\", \"bookshop\", \"bottlecap\", \"bow\",\n        \"bow tie\", \"brass\", \"brassiere\", \"breakwater\", \"breastplate\", \"broom\", \"bucket\", \"buckle\",\n        \"bulletproof vest\", \"bullet train\", \"butcher shop\", \"cab\", \"caldron\", \"candle\", \"cannon\", \"canoe\",\n        \"can opener\", \"cardigan\", \"car mirror\", \"carousel\", \"carpenter's kit\", \"carton\", \"car wheel\",\n        \"cash machine\", \"cassette\", \"cassette player\", \"castle\", \"catamaran\", \"CD player\", \"cello\",\n        \"cellular telephone\", \"chain\", \"chainlink fence\", \"chain mail\", \"chain saw\", \"chest\", \"chiffonier\",\n        \"chime\", \"china cabinet\", \"Christmas stocking\", \"church\", \"cinema\", \"cleaver\", \"cliff dwelling\",\n        \"cloak\", \"clog\", \"cocktail shaker\", \"coffee mug\", \"coffeepot\", \"coil\", \"combination lock\",\n        \"computer keyboard\", \"confectionery\", \"container ship\", \"convertible\", \"corkscrew\", \"cornet\",\n        \"cowboy boot\", \"cowboy hat\", \"cradle\", \"crane (machine)\", \"crash helmet\", \"crate\", \"crib\",\n        \"Crock Pot\", \"croquet ball\", \"crutch\", \"cuirass\", \"dam\", \"desk\", \"desktop computer\", \"dial telephone\",\n        \"diaper\", \"digital clock\", \"digital watch\", \"dining table\", \"dishrag\", \"dishwasher\", \"disk brake\",\n        \"dock\", \"dogsled\", \"dome\", \"doormat\", \"drilling platform\", \"drum\", \"drumstick\", \"dumbbell\",\n        \"Dutch oven\", \"electric fan\", \"electric guitar\", \"electric locomotive\", \"entertainment center\",\n        \"envelope\", \"espresso maker\", \"face powder\", \"feather boa\", \"file\", \"fireboat\", \"fire engine\",\n        \"fire screen\", \"flagpole\", \"flute\", \"folding chair\", \"football helmet\", \"forklift\", \"fountain\",\n        \"fountain pen\", \"four-poster\", \"freight car\", \"French horn\", \"frying pan\", \"fur coat\", \"garbage truck\",\n        \"gasmask\", \"gas pump\", \"goblet\", \"go-kart\", \"golf ball\", \"golfcart\", \"gondola\", \"gong\", \"gown\",\n        \"grand piano\", \"greenhouse\", \"grille\", \"grocery store\", \"guillotine\", \"hair slide\", \"hair spray\",\n        \"half track\", \"hammer\", \"hamper\", \"hand blower\", \"hand-held computer\", \"handkerchief\", \"hard disc\",\n        \"harmonica\", \"harp\", \"harvester\", \"hatchet\", \"holster\", \"home theater\", \"honeycomb\", \"hook\",\n        \"hoopskirt\", \"horizontal bar\", \"horse cart\", \"hourglass\", \"iPod\", \"iron\", \"jack-o'-lantern\", \"jean\",\n        \"jeep\", \"jersey\", \"jigsaw puzzle\", \"jinrikisha\", \"joystick\", \"kimono\", \"knee pad\", \"knot\", \"lab coat\",\n        \"ladle\", \"lampshade\", \"laptop\", \"lawn mower\", \"lens cap\", \"letter opener\", \"library\", \"lifeboat\",\n        \"lighter\", \"limousine\", \"liner\", \"lipstick\", \"Loafer\", \"lotion\", \"loudspeaker\", \"loupe\", \"lumbermill\",\n        \"magnetic compass\", \"mailbag\", \"mailbox\", \"maillot (tights)\", \"maillot (tank suit)\", \"manhole cover\",\n        \"maraca\", \"marimba\", \"mask\", \"matchstick\", \"maypole\", \"maze\", \"measuring cup\", \"medicine chest\",\n        \"megalith\", \"microphone\", \"microwave\", \"military uniform\", \"milk can\", \"minibus\", \"miniskirt\",\n        \"minivan\", \"missile\", \"mitten\", \"mixing bowl\", \"mobile home\", \"Model T\", \"modem\", \"monastery\",\n        \"monitor\", \"moped\", \"mortar\", \"mortarboard\", \"mosque\", \"mosquito net\", \"motor scooter\", \"mountain bike\",\n        \"mountain tent\", \"mouse\", \"mousetrap\", \"moving van\", \"muzzle\", \"nail\", \"neck brace\", \"necklace\",\n        \"nipple\", \"notebook\", \"obelisk\", \"oboe\", \"ocarina\", \"odometer\", \"oil filter\", \"organ\", \"oscilloscope\",\n        \"overskirt\", \"oxcart\", \"oxygen mask\", \"packet\", \"paddle\", \"paddlewheel\", \"padlock\", \"paintbrush\",\n        \"pajama\", \"palace\", \"panpipe\", \"paper towel\", \"parachute\", \"parallel bars\", \"park bench\",\n        \"parking meter\", \"passenger car\", \"patio\", \"pay-phone\", \"pedestal\", \"pencil box\", \"pencil sharpener\",\n        \"perfume\", \"Petri dish\", \"photocopier\", \"pick\", \"pickelhaube\", \"picket fence\", \"pickup\", \"pier\",\n        \"piggy bank\", \"pill bottle\", \"pillow\", \"ping-pong ball\", \"pinwheel\", \"pirate\", \"pitcher\", \"plane\",\n        \"planetarium\", \"plastic bag\", \"plate rack\", \"plow\", \"plunger\", \"Polaroid camera\", \"pole\",\n        \"police van\", \"poncho\", \"pool table\", \"pop bottle\", \"pot\", \"potter's wheel\", \"power drill\",\n        \"prayer rug\", \"printer\", \"prison\", \"projectile\", \"projector\", \"puck\", \"punching bag\", \"purse\",\n        \"quill\", \"quilt\", \"racer\", \"racket\", \"radiator\", \"radio\", \"radio telescope\", \"rain barrel\",\n        \"recreational vehicle\", \"reel\", \"reflex camera\", \"refrigerator\", \"remote control\", \"restaurant\",\n        \"revolver\", \"rifle\", \"rocking chair\", \"rotisserie\", \"rubber eraser\", \"rugby ball\", \"rule\",\n        \"running shoe\", \"safe\", \"safety pin\", \"saltshaker\", \"sandal\", \"sarong\", \"sax\", \"scabbard\", \"scale\",\n        \"school bus\", \"schooner\", \"scoreboard\", \"screen\", \"screw\", \"screwdriver\", \"seat belt\", \"sewing machine\",\n        \"shield\", \"shoe shop\", \"shoji\", \"shopping basket\", \"shopping cart\", \"shovel\", \"shower cap\",\n        \"shower curtain\", \"ski\", \"ski mask\", \"sleeping bag\", \"slide rule\", \"sliding door\", \"slot\", \"snorkel\",\n        \"snowmobile\", \"snowplow\", \"soap dispenser\", \"soccer ball\", \"sock\", \"solar dish\", \"sombrero\",\n        \"soup bowl\", \"space bar\", \"space heater\", \"space shuttle\", \"spatula\", \"speedboat\", \"spider web\",\n        \"spindle\", \"sports car\", \"spotlight\", \"stage\", \"steam locomotive\", \"steel arch bridge\", \"steel drum\",\n        \"stethoscope\", \"stole\", \"stone wall\", \"stopwatch\", \"stove\", \"strainer\", \"streetcar\", \"stretcher\",\n        \"studio couch\", \"stupa\", \"submarine\", \"suit\", \"sundial\", \"sunglass\", \"sunglasses\", \"sunscreen\",\n        \"suspension bridge\", \"swab\", \"sweatshirt\", \"swimming trunks\", \"swing\", \"switch\", \"syringe\",\n        \"table lamp\", \"tank\", \"tape player\", \"teapot\", \"teddy\", \"television\", \"tennis ball\", \"thatch\",\n        \"theater curtain\", \"thimble\", \"thresher\", \"throne\", \"tile roof\", \"toaster\", \"tobacco shop\",\n        \"toilet seat\", \"torch\", \"totem pole\", \"tow truck\", \"toyshop\", \"tractor\", \"trailer truck\", \"tray\",\n        \"trench coat\", \"tricycle\", \"trimaran\", \"tripod\", \"triumphal arch\", \"trolleybus\", \"trombone\", \"tub\",\n        \"turnstile\", \"typewriter keyboard\", \"umbrella\", \"unicycle\", \"upright\", \"vacuum\", \"vase\", \"vault\",\n        \"velvet\", \"vending machine\", \"vestment\", \"viaduct\", \"violin\", \"volleyball\", \"waffle iron\", \"wall clock\",\n        \"wallet\", \"wardrobe\", \"warplane\", \"washbasin\", \"washer\", \"water bottle\", \"water jug\", \"water tower\",\n        \"whiskey jug\", \"whistle\", \"wig\", \"window screen\", \"window shade\", \"Windsor tie\", \"wine bottle\", \"wing\",\n        \"wok\", \"wooden spoon\", \"wool\", \"worm fence\", \"wreck\", \"yawl\", \"yurt\", \"web site\", \"comic book\",\n        \"crossword puzzle\", \"street sign\", \"traffic light\", \"book jacket\", \"menu\", \"plate\", \"guacamole\",\n        \"consomme\", \"hot pot\", \"trifle\", \"ice cream\", \"ice lolly\", \"French loaf\", \"bagel\", \"pretzel\",\n        \"cheeseburger\", \"hotdog\", \"mashed potato\", \"head cabbage\", \"broccoli\", \"cauliflower\", \"zucchini\",\n        \"spaghetti squash\", \"acorn squash\", \"butternut squash\", \"cucumber\", \"artichoke\", \"bell pepper\",\n        \"cardoon\", \"mushroom\", \"Granny Smith\", \"strawberry\", \"orange\", \"lemon\", \"fig\", \"pineapple\", \"banana\",\n        \"jackfruit\", \"custard apple\", \"pomegranate\", \"hay\", \"carbonara\", \"chocolate sauce\", \"dough\",\n        \"meat loaf\", \"pizza\", \"potpie\", \"burrito\", \"red wine\", \"espresso\", \"cup\", \"eggnog\", \"alp\", \"bubble\",\n        \"cliff\", \"coral reef\", \"geyser\", \"lakeside\", \"promontory\", \"sandbar\", \"seashore\", \"valley\", \"volcano\",\n        \"ballplayer\", \"groom\", \"scuba diver\", \"rapeseed\", \"daisy\", \"yellow lady's slipper\", \"corn\", \"acorn\",\n        \"hip\", \"buckeye\", \"coral fungus\", \"agaric\", \"gyromitra\", \"stinkhorn\", \"earthstar\", \"hen-of-the-woods\",\n        \"bolete\", \"ear\", \"toilet tissue\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    int y_offset = 0;\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f\\n\", obj.label, obj.prob);\n\n        char text[256];\n        sprintf(text, \"%4.1f%% %s\", obj.prob * 100, class_names[obj.label]);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = 0;\n        int y = y_offset;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n\n        y_offset += label_size.height;\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov8_cls(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov8_obb.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolov8-obb torchscript\n//      yolo export model=yolov8n-obb.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8n-obb.torchscript\n// 4. modify yolov8n_obb_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_137 = v_136.view(1, 1, 16384)\n//          v_143 = v_142.view(1, 1, 4096)\n//          v_149 = v_148.view(1, 1, 1024)\n//          v_150 = torch.cat((v_137, v_143, v_149), dim=2)\n//          ...\n//          v_186 = v_163.view(1, 79, 16384)\n//          v_187 = v_174.view(1, 79, 4096)\n//          v_188 = v_185.view(1, 79, 1024)\n//          v_189 = torch.cat((v_186, v_187, v_188), dim=2)\n//          ...\n//      after:\n//          v_137 = v_136.view(1, 1, -1).transpose(1, 2)\n//          v_143 = v_142.view(1, 1, -1).transpose(1, 2)\n//          v_149 = v_148.view(1, 1, -1).transpose(1, 2)\n//          v_150 = torch.cat((v_137, v_143, v_149), dim=1)\n//          ...\n//          v_186 = v_163.view(1, 79, -1).transpose(1, 2)\n//          v_187 = v_174.view(1, 79, -1).transpose(1, 2)\n//          v_188 = v_185.view(1, 79, -1).transpose(1, 2)\n//          v_189 = torch.cat((v_186, v_187, v_188), dim=1)\n//          return v_189, v_150\n// 5. re-export yolov8-obb torchscript\n//      python3 -c 'import yolov8n_obb_pnnx; yolov8n_obb_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolov8n_obb_pnnx.py.pt inputshape=[1,3,1024,1024] inputshape2=[1,3,512,512]\n// 7. now you get ncnn model files\n//      mv yolov8n_obb_pnnx.py.ncnn.param yolov8n_obb.ncnn.param\n//      mv yolov8n_obb_pnnx.py.ncnn.bin yolov8n_obb.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=79 h=21504\n//\n//        | bbox-reg 16 x 4       |score(15)|\n//        +-----+-----+-----+-----+---------+\n//        | dx0 | dy0 | dx1 | dy1 | 0.1 ... |\n//   all /|     |     |     |     |     ... |\n//  boxes |  .. |  .. |  .. |  .. | 0.0 ... |\n// (21504)|     |     |     |     |  .  ... |\n//       \\|     |     |     |     |  .  ... |\n//        +-----+-----+-----+-----+---------+\n//\n\n// the out blob would be a 2-dim tensor with w=1 h=21504\n//\n//        | degree(1)|\n//        +----------+\n//        |    0.1   |\n//   all /|          |\n//  boxes |    0.0   |\n// (21504)|     .    |\n//       \\|     .    |\n//        +----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n\n#include <float.h>\n#include <math.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::RotatedRect rrect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    std::vector<cv::Point2f> intersection;\n    cv::rotatedRectangleIntersection(a.rrect, b.rrect, intersection);\n    if (intersection.empty())\n        return 0.f;\n\n    return cv::contourArea(intersection);\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rrect.size.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area;\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_angle, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 15 for DOTAv1\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                const float angle = sigmoid(pred_angle.row(y * num_grid_x + x)[0]) - 0.25f;\n\n                const float angle_rad = angle * 3.14159265358979323846f;\n                const float angle_degree = angle * 180.f;\n\n                float cos = cosf(angle_rad);\n                float sin = sinf(angle_rad);\n\n                float xx = (pred_ltrb[2] - pred_ltrb[0]) * 0.5f;\n                float yy = (pred_ltrb[3] - pred_ltrb[1]) * 0.5f;\n                float xr = xx * cos - yy * sin;\n                float yr = xx * sin + yy * cos;\n                const float cx = pb_cx + xr;\n                const float cy = pb_cy + yr;\n                const float ww = pred_ltrb[2] + pred_ltrb[0];\n                const float hh = pred_ltrb[3] + pred_ltrb[1];\n\n                Object obj;\n                obj.rrect = cv::RotatedRect(cv::Point2f(cx, cy), cv::Size_<float>(ww, hh), angle_degree);\n                obj.label = label;\n                obj.prob = score;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_angle, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), pred_angle.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolov8_obb(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov8;\n\n    yolov8.opt.use_vulkan_compute = true;\n    // yolov8.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolov8/tree/master/app/src/main/assets\n    yolov8.load_param(\"yolov8n_obb.ncnn.param\");\n    yolov8.load_model(\"yolov8n_obb.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s_obb.ncnn.param\");\n    // yolov8.load_model(\"yolov8s_obb.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m_obb.ncnn.param\");\n    // yolov8.load_model(\"yolov8m_obb.ncnn.bin\");\n\n    const int target_size = 1024;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolov8.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov8.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    ncnn::Mat out_angle;\n    ex.extract(\"out1\", out_angle);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, out_angle, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        Object obj = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        obj.rrect.center.x = (obj.rrect.center.x - (wpad / 2)) / scale;\n        obj.rrect.center.y = (obj.rrect.center.y - (hpad / 2)) / scale;\n        obj.rrect.size.width = (obj.rrect.size.width) / scale;\n        obj.rrect.size.height = (obj.rrect.size.height) / scale;\n\n        objects[i] = obj;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"plane\", \"ship\", \"storage tank\", \"baseball diamond\", \"tennis court\",\n        \"basketball court\", \"ground track field\", \"harbor\", \"bridge\", \"large vehicle\",\n        \"small vehicle\", \"helicopter\", \"roundabout\", \"soccer ball field\", \"swimming pool\"\n    };\n\n    static const cv::Scalar colors[] = {\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[obj.label];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f  @ %.2f\\n\", obj.label, obj.prob,\n                obj.rrect.center.x, obj.rrect.center.y, obj.rrect.size.width, obj.rrect.size.height, obj.rrect.angle);\n\n        cv::Point2f corners[4];\n        obj.rrect.points(corners);\n        cv::line(image, corners[0], corners[1], color);\n        cv::line(image, corners[1], corners[2], color);\n        cv::line(image, corners[2], corners[3], color);\n        cv::line(image, corners[3], corners[0], color);\n    }\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[obj.label];\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rrect.center.x - label_size.width / 2;\n        int y = obj.rrect.center.y - label_size.height / 2 - baseLine;\n        if (y < 0)\n            y = 0;\n        if (y + label_size.height > image.rows)\n            y = image.rows - label_size.height;\n        if (x < 0)\n            x = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov8_obb(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov8_pose.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolov8-pose torchscript\n//      yolo export model=yolov8n-pose.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8n-pose.torchscript\n// 4. modify yolov8n_pose_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_137 = v_136.view(1, 51, 6400)\n//          v_143 = v_142.view(1, 51, 1600)\n//          v_149 = v_148.view(1, 51, 400)\n//          v_150 = torch.cat((v_137, v_143, v_149), dim=-1)\n//          ...\n//          v_184 = v_161.view(1, 65, 6400)\n//          v_185 = v_172.view(1, 65, 1600)\n//          v_186 = v_183.view(1, 65, 400)\n//          v_187 = torch.cat((v_184, v_185, v_186), dim=2)\n//          ...\n//      after:\n//          v_137 = v_136.view(1, 51, -1).transpose(1, 2)\n//          v_143 = v_142.view(1, 51, -1).transpose(1, 2)\n//          v_149 = v_148.view(1, 51, -1).transpose(1, 2)\n//          v_150 = torch.cat((v_137, v_143, v_149), dim=1)\n//          ...\n//          v_184 = v_161.view(1, 65, -1).transpose(1, 2)\n//          v_185 = v_172.view(1, 65, -1).transpose(1, 2)\n//          v_186 = v_183.view(1, 65, -1).transpose(1, 2)\n//          v_187 = torch.cat((v_184, v_185, v_186), dim=1)\n//          return v_187, v_150\n// 5. re-export yolov8-pose torchscript\n//      python3 -c 'import yolov8n_pose_pnnx; yolov8n_pose_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolov8n_pose_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolov8n_pose_pnnx.py.ncnn.param yolov8n_pose.ncnn.param\n//      mv yolov8n_pose_pnnx.py.ncnn.bin yolov8n_pose.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=65 h=8400\n//\n//        | bbox-reg 16 x 4       |score(1)|\n//        +-----+-----+-----+-----+--------+\n//        | dx0 | dy0 | dx1 | dy1 |   0.1  |\n//   all /|     |     |     |     |        |\n//  boxes |  .. |  .. |  .. |  .. |   0.0  |\n//  (8400)|     |     |     |     |   .    |\n//       \\|     |     |     |     |   .    |\n//        +-----+-----+-----+-----+--------+\n//\n\n//\n//        | pose (51) |\n//        +-----------+\n//        |0.1........|\n//   all /|           |\n//  boxes |0.0........|\n//  (8400)|     .     |\n//       \\|     .     |\n//        +-----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct KeyPoint\n{\n    cv::Point2f p;\n    float prob;\n};\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n    std::vector<KeyPoint> keypoints;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_points, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_points = pred_points.w / 3;\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n            const ncnn::Mat pred_points_grid = pred_points.row_range(y * num_grid_x + x, 1).reshape(3, num_points);\n\n            // find label with max score\n            int label = 0;\n            float score = sigmoid(pred_grid[reg_max_1 * 4]);\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                std::vector<KeyPoint> keypoints;\n                for (int k = 0; k < num_points; k++)\n                {\n                    KeyPoint keypoint;\n                    keypoint.p.x = (x + pred_points_grid.row(k)[0] * 2) * stride;\n                    keypoint.p.y = (y + pred_points_grid.row(k)[1] * 2) * stride;\n                    keypoint.prob = sigmoid(pred_points_grid.row(k)[2]);\n                    keypoints.push_back(keypoint);\n                }\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n                obj.keypoints = keypoints;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const ncnn::Mat& pred_points, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), pred_points.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects);\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolov8_pose(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov8;\n\n    yolov8.opt.use_vulkan_compute = true;\n    // yolov8.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolov8/tree/master/app/src/main/assets\n    yolov8.load_param(\"yolov8n_pose.ncnn.param\");\n    yolov8.load_model(\"yolov8n_pose.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s_pose.ncnn.param\");\n    // yolov8.load_model(\"yolov8s_pose.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m_pose.ncnn.param\");\n    // yolov8.load_model(\"yolov8m_pose.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n    const float mask_threshold = 0.5f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolov8.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov8.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    ncnn::Mat out_points;\n    ex.extract(\"out1\", out_points);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, out_points, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    const int num_points = out_points.w / 3;\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        for (int j = 0; j < num_points; j++)\n        {\n            objects[i].keypoints[j].p.x = (objects[i].keypoints[j].p.x - (wpad / 2)) / scale;\n            objects[i].keypoints[j].p.y = (objects[i].keypoints[j].p.y - (hpad / 2)) / scale;\n        }\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\"person\"};\n\n    static const cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        // draw bone\n        static const int joint_pairs[16][2] = {\n            {0, 1}, {1, 3}, {0, 2}, {2, 4}, {5, 6}, {5, 7}, {7, 9}, {6, 8}, {8, 10}, {5, 11}, {6, 12}, {11, 12}, {11, 13}, {12, 14}, {13, 15}, {14, 16}\n        };\n        static const cv::Scalar bone_colors[] = {\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(0, 255, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 128, 0),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(255, 51, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n            cv::Scalar(51, 153, 255),\n        };\n\n        for (int j = 0; j < 16; j++)\n        {\n            const KeyPoint& p1 = obj.keypoints[joint_pairs[j][0]];\n            const KeyPoint& p2 = obj.keypoints[joint_pairs[j][1]];\n\n            if (p1.prob < 0.2f || p2.prob < 0.2f)\n                continue;\n\n            cv::line(image, p1.p, p2.p, bone_colors[j], 2);\n        }\n\n        // draw joint\n        for (size_t j = 0; j < obj.keypoints.size(); j++)\n        {\n            const KeyPoint& keypoint = obj.keypoints[j];\n\n            fprintf(stderr, \"%.2f %.2f = %.5f\\n\", keypoint.p.x, keypoint.p.y, keypoint.prob);\n\n            if (keypoint.prob < 0.2f)\n                continue;\n\n            cv::circle(image, keypoint.p, 3, color, -1);\n        }\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov8_pose(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolov8_seg.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yolov8-seg torchscript\n//      yolo export model=yolov8n-seg.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8n-seg.torchscript\n// 4. modify yolov8n_seg_pnnx.py for dynamic shape inference\n//      A. modify reshape to support dynamic image sizes\n//      B. permute tensor before concat and adjust concat axis\n//      C. drop post-process part\n//      before:\n//          v_144 = v_143.view(1, 32, 6400)\n//          v_150 = v_149.view(1, 32, 1600)\n//          v_156 = v_155.view(1, 32, 400)\n//          v_157 = torch.cat((v_144, v_150, v_156), dim=2)\n//          ...\n//          v_191 = v_168.view(1, 144, 6400)\n//          v_192 = v_179.view(1, 144, 1600)\n//          v_193 = v_190.view(1, 144, 400)\n//          v_194 = torch.cat((v_191, v_192, v_193), dim=2)\n//          ...\n//          v_215 = (v_214, v_138, )\n//          return v_215\n//      after:\n//          v_144 = v_143.view(1, 32, -1).transpose(1, 2)\n//          v_150 = v_149.view(1, 32, -1).transpose(1, 2)\n//          v_156 = v_155.view(1, 32, -1).transpose(1, 2)\n//          v_157 = torch.cat((v_144, v_150, v_156), dim=1)\n//          ...\n//          v_191 = v_168.view(1, 144, -1).transpose(1, 2)\n//          v_192 = v_179.view(1, 144, -1).transpose(1, 2)\n//          v_193 = v_190.view(1, 144, -1).transpose(1, 2)\n//          v_194 = torch.cat((v_191, v_192, v_193), dim=1)\n//          return v_194, v_157, v_138\n// 5. re-export yolov8-seg torchscript\n//      python3 -c 'import yolov8n_seg_pnnx; yolov8n_seg_pnnx.export_torchscript()'\n// 6. convert new torchscript with dynamic shape\n//      pnnx yolov8n_seg_pnnx.py.pt inputshape=[1,3,640,640] inputshape2=[1,3,320,320]\n// 7. now you get ncnn model files\n//      mv yolov8n_seg_pnnx.py.ncnn.param yolov8n_seg.ncnn.param\n//      mv yolov8n_seg_pnnx.py.ncnn.bin yolov8n_seg.ncnn.bin\n\n// the out blob would be a 2-dim tensor with w=176 h=8400\n//\n//        | bbox-reg 16 x 4       | per-class scores(80) |\n//        +-----+-----+-----+-----+----------------------+\n//        | dx0 | dy0 | dx1 | dy1 |0.1 0.0 0.0 0.5 ......|\n//   all /|     |     |     |     |           .          |\n//  boxes |  .. |  .. |  .. |  .. |0.0 0.9 0.0 0.0 ......|\n//  (8400)|     |     |     |     |           .          |\n//       \\|     |     |     |     |           .          |\n//        +-----+-----+-----+-----+----------------------+\n//\n\n//\n//        | mask (32) |\n//        +-----------+\n//        |0.1........|\n//   all /|           |\n//  boxes |0.0........|\n//  (8400)|     .     |\n//       \\|     .     |\n//        +-----------+\n//\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n    int gindex;\n    cv::Mat mask;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic inline float sigmoid(float x)\n{\n    return 1.0f / (1.0f + expf(-x));\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, int stride, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    const int num_grid_x = w / stride;\n    const int num_grid_y = h / stride;\n\n    const int reg_max_1 = 16;\n    const int num_class = pred.w - reg_max_1 * 4; // number of classes. 80 for COCO\n\n    for (int y = 0; y < num_grid_y; y++)\n    {\n        for (int x = 0; x < num_grid_x; x++)\n        {\n            const ncnn::Mat pred_grid = pred.row_range(y * num_grid_x + x, 1);\n\n            // find label with max score\n            int label = -1;\n            float score = -FLT_MAX;\n            {\n                const ncnn::Mat pred_score = pred_grid.range(reg_max_1 * 4, num_class);\n\n                for (int k = 0; k < num_class; k++)\n                {\n                    float s = pred_score[k];\n                    if (s > score)\n                    {\n                        label = k;\n                        score = s;\n                    }\n                }\n\n                score = sigmoid(score);\n            }\n\n            if (score >= prob_threshold)\n            {\n                ncnn::Mat pred_bbox = pred_grid.range(0, reg_max_1 * 4).reshape(reg_max_1, 4).clone();\n\n                {\n                    ncnn::Layer* softmax = ncnn::create_layer(\"Softmax\");\n\n                    ncnn::ParamDict pd;\n                    pd.set(0, 1); // axis\n                    pd.set(1, 1);\n                    softmax->load_param(pd);\n\n                    ncnn::Option opt;\n                    opt.num_threads = 1;\n                    opt.use_packing_layout = false;\n\n                    softmax->create_pipeline(opt);\n\n                    softmax->forward_inplace(pred_bbox, opt);\n\n                    softmax->destroy_pipeline(opt);\n\n                    delete softmax;\n                }\n\n                float pred_ltrb[4];\n                for (int k = 0; k < 4; k++)\n                {\n                    float dis = 0.f;\n                    const float* dis_after_sm = pred_bbox.row(k);\n                    for (int l = 0; l < reg_max_1; l++)\n                    {\n                        dis += l * dis_after_sm[l];\n                    }\n\n                    pred_ltrb[k] = dis * stride;\n                }\n\n                float pb_cx = (x + 0.5f) * stride;\n                float pb_cy = (y + 0.5f) * stride;\n\n                float x0 = pb_cx - pred_ltrb[0];\n                float y0 = pb_cy - pred_ltrb[1];\n                float x1 = pb_cx + pred_ltrb[2];\n                float y1 = pb_cy + pred_ltrb[3];\n\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = x1 - x0;\n                obj.rect.height = y1 - y0;\n                obj.label = label;\n                obj.prob = score;\n                obj.gindex = y * num_grid_x + x;\n\n                objects.push_back(obj);\n            }\n        }\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, const std::vector<int>& strides, const ncnn::Mat& in_pad, float prob_threshold, std::vector<Object>& objects)\n{\n    const int w = in_pad.w;\n    const int h = in_pad.h;\n\n    int pred_row_offset = 0;\n    for (size_t i = 0; i < strides.size(); i++)\n    {\n        const int stride = strides[i];\n\n        const int num_grid_x = w / stride;\n        const int num_grid_y = h / stride;\n        const int num_grid = num_grid_x * num_grid_y;\n\n        std::vector<Object> objects_stride;\n        generate_proposals(pred.row_range(pred_row_offset, num_grid), stride, in_pad, prob_threshold, objects_stride);\n\n        for (size_t j = 0; j < objects_stride.size(); j++)\n        {\n            Object obj = objects_stride[j];\n            obj.gindex += pred_row_offset;\n            objects.push_back(obj);\n        }\n\n        pred_row_offset += num_grid;\n    }\n}\n\nstatic int detect_yolov8_seg(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolov8;\n\n    yolov8.opt.use_vulkan_compute = true;\n    // yolov8.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-android-yolov8/tree/master/app/src/main/assets\n    yolov8.load_param(\"yolov8n_seg.ncnn.param\");\n    yolov8.load_model(\"yolov8n_seg.ncnn.bin\");\n    // yolov8.load_param(\"yolov8s_seg.ncnn.param\");\n    // yolov8.load_model(\"yolov8s_seg.ncnn.bin\");\n    // yolov8.load_param(\"yolov8m_seg.ncnn.param\");\n    // yolov8.load_model(\"yolov8m_seg.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n    const float mask_threshold = 0.5f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // ultralytics/cfg/models/v8/yolov8.yaml\n    std::vector<int> strides(3);\n    strides[0] = 8;\n    strides[1] = 16;\n    strides[2] = 32;\n    const int max_stride = 32;\n\n    // letterbox pad to multiple of max_stride\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = (w + max_stride - 1) / max_stride * max_stride - w;\n    int hpad = (h + max_stride - 1) / max_stride * max_stride - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yolov8.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, strides, in_pad, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n    if (count == 0)\n        return 0;\n\n    ncnn::Mat mask_feat;\n    ex.extract(\"out1\", mask_feat);\n\n    ncnn::Mat mask_protos;\n    ex.extract(\"out2\", mask_protos);\n\n    ncnn::Mat objects_mask_feat(mask_feat.w, 1, count);\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n\n        // pick mask feat\n        memcpy(objects_mask_feat.channel(i), mask_feat.row(objects[i].gindex), mask_feat.w * sizeof(float));\n    }\n\n    // process mask\n    ncnn::Mat objects_mask;\n    {\n        ncnn::Layer* gemm = ncnn::create_layer(\"Gemm\");\n\n        ncnn::ParamDict pd;\n        pd.set(6, 1);                             // constantC\n        pd.set(7, count);                         // constantM\n        pd.set(8, mask_protos.w * mask_protos.h); // constantN\n        pd.set(9, mask_feat.w);                   // constantK\n        pd.set(10, -1);                           // constant_broadcast_type_C\n        pd.set(11, 1);                            // output_N1M\n        gemm->load_param(pd);\n\n        ncnn::Option opt;\n        opt.num_threads = 1;\n        opt.use_packing_layout = false;\n\n        gemm->create_pipeline(opt);\n\n        std::vector<ncnn::Mat> gemm_inputs(2);\n        gemm_inputs[0] = objects_mask_feat;\n        gemm_inputs[1] = mask_protos.reshape(mask_protos.w * mask_protos.h, 1, mask_protos.c);\n        std::vector<ncnn::Mat> gemm_outputs(1);\n        gemm->forward(gemm_inputs, gemm_outputs, opt);\n        objects_mask = gemm_outputs[0].reshape(mask_protos.w, mask_protos.h, count);\n\n        gemm->destroy_pipeline(opt);\n\n        delete gemm;\n    }\n    {\n        ncnn::Layer* sigmoid = ncnn::create_layer(\"Sigmoid\");\n\n        ncnn::Option opt;\n        opt.num_threads = 1;\n        opt.use_packing_layout = false;\n\n        sigmoid->create_pipeline(opt);\n\n        sigmoid->forward_inplace(objects_mask, opt);\n\n        sigmoid->destroy_pipeline(opt);\n\n        delete sigmoid;\n    }\n\n    // resize mask map\n    {\n        ncnn::Mat objects_mask_resized;\n        ncnn::resize_bilinear(objects_mask, objects_mask_resized, in_pad.w / scale, in_pad.h / scale);\n        objects_mask = objects_mask_resized;\n    }\n\n    // create per-object mask\n    for (int i = 0; i < count; i++)\n    {\n        Object& obj = objects[i];\n\n        const ncnn::Mat mm = objects_mask.channel(i);\n\n        obj.mask = cv::Mat((int)obj.rect.height, (int)obj.rect.width, CV_8UC1);\n\n        // adjust offset to original unpadded and clip inside object box\n        for (int y = 0; y < (int)obj.rect.height; y++)\n        {\n            const float* pmm = mm.row((int)(hpad / 2 / scale + obj.rect.y + y)) + (int)(wpad / 2 / scale + obj.rect.x);\n            uchar* pmask = obj.mask.ptr<uchar>(y);\n            for (int x = 0; x < (int)obj.rect.width; x++)\n            {\n                pmask[x] = pmm[x] > mask_threshold ? 1 : 0;\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        for (int y = 0; y < (int)obj.rect.height; y++)\n        {\n            const uchar* maskptr = obj.mask.ptr<const uchar>(y);\n            uchar* bgrptr = image.ptr<uchar>((int)obj.rect.y + y) + (int)obj.rect.x * 3;\n            for (int x = 0; x < (int)obj.rect.width; x++)\n            {\n                if (maskptr[x])\n                {\n                    bgrptr[0] = bgrptr[0] * 0.5 + color[0] * 0.5;\n                    bgrptr[1] = bgrptr[1] * 0.5 + color[1] * 0.5;\n                    bgrptr[2] = bgrptr[2] * 0.5 + color[2] * 0.5;\n                }\n                bgrptr += 3;\n            }\n        }\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolov8_seg(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yoloworld.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n// 1. install\n//      pip3 install -U ultralytics pnnx ncnn\n// 2. export yoloworld torchscript\n//      yolo export model=yolov8s-world.pt format=torchscript\n//      yolo export model=yolov8m-world.pt format=torchscript\n//      yolo export model=yolov8l-world.pt format=torchscript\n//      yolo export model=yolov8x-world.pt format=torchscript\n//      yolo export model=yolov8s-worldv2.pt format=torchscript\n//      yolo export model=yolov8m-worldv2.pt format=torchscript\n//      yolo export model=yolov8l-worldv2.pt format=torchscript\n//      yolo export model=yolov8x-worldv2.pt format=torchscript\n// 3. convert torchscript with static shape\n//      pnnx yolov8s-world.torchscript\n//      pnnx yolov8m-world.torchscript\n//      pnnx yolov8l-world.torchscript\n//      pnnx yolov8x-world.torchscript\n//      pnnx yolov8s-worldv2.torchscript\n//      pnnx yolov8m-worldv2.torchscript\n//      pnnx yolov8l-worldv2.torchscript\n//      pnnx yolov8x-worldv2.torchscript\n\n// the out blob would be a 2-dim tensor with w=8400 h=84\n//\n//        |    all boxes (8400)     |\n//        +-------------------------+\n//        | center-x   .            |\n//  bbox  | center-y   .            |\n//        |   w        .            |\n//        |   h        .            |\n//        +-------------------------+\n//        | 0.1        .            |\n//   per  | 0.0        .            |\n//  class | 0.5        .            |\n// scores |  .         .            |\n//  (80)  |  .         .            |\n//        +-------------------------+\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = objects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (objects[i].prob > p)\n            i++;\n\n        while (objects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(objects[i], objects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    // #pragma omp parallel sections\n    {\n        // #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(objects, left, j);\n        }\n        // #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(objects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& objects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = objects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = objects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = objects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = objects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic void generate_proposals(const ncnn::Mat& pred, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_boxes = pred.w;\n    const int num_class = pred.h - 4;\n\n    const ncnn::Mat pred_bbox = pred.row_range(0, 4);\n    const ncnn::Mat pred_score = pred.row_range(4, num_class);\n\n    for (int i = 0; i < num_boxes; i++)\n    {\n        int label = 0;\n        float score = -9999.f;\n        for (int j = 0; j < num_class; j++)\n        {\n            const float prob = pred_score.row(j)[i];\n            if (prob > score)\n            {\n                score = prob;\n                label = j;\n            }\n        }\n\n        if (score >= prob_threshold)\n        {\n            const float cx = pred_bbox.row(0)[i];\n            const float cy = pred_bbox.row(1)[i];\n            const float w = pred_bbox.row(2)[i];\n            const float h = pred_bbox.row(3)[i];\n\n            Object obj;\n            obj.rect.x = cx - w / 2;\n            obj.rect.y = cy - h / 2;\n            obj.rect.width = w;\n            obj.rect.height = h;\n            obj.label = label;\n            obj.prob = score;\n\n            objects.push_back(obj);\n        }\n    }\n}\n\nstatic int detect_yoloworld(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yoloworld;\n\n    yoloworld.opt.use_vulkan_compute = true;\n    // yoloworld.opt.use_bf16_storage = true;\n\n    // https://github.com/nihui/ncnn-assets/tree/master/models\n    // yoloworld.load_param(\"yolov8s_world.ncnn.param\");\n    // yoloworld.load_model(\"yolov8s_world.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8m_world.ncnn.param\");\n    // yoloworld.load_model(\"yolov8m_world.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8l_world.ncnn.param\");\n    // yoloworld.load_model(\"yolov8l_world.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8x_world.ncnn.param\");\n    // yoloworld.load_model(\"yolov8x_world.ncnn.bin\");\n    yoloworld.load_param(\"yolov8s_worldv2.ncnn.param\");\n    yoloworld.load_model(\"yolov8s_worldv2.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8m_worldv2.ncnn.param\");\n    // yoloworld.load_model(\"yolov8m_worldv2.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8l_worldv2.ncnn.param\");\n    // yoloworld.load_model(\"yolov8l_worldv2.ncnn.bin\");\n    // yoloworld.load_param(\"yolov8x_worldv2.ncnn.param\");\n    // yoloworld.load_model(\"yolov8x_worldv2.ncnn.bin\");\n\n    const int target_size = 640;\n    const float prob_threshold = 0.25f;\n    const float nms_threshold = 0.45f;\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    // letterbox pad\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)target_size / w;\n        w = target_size;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)target_size / h;\n        h = target_size;\n        w = w * scale;\n    }\n\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, img_w, img_h, w, h);\n\n    // letterbox pad to target_size rectangle\n    int wpad = target_size - w;\n    int hpad = target_size - h;\n    ncnn::Mat in_pad;\n    ncnn::copy_make_border(in, in_pad, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, ncnn::BORDER_CONSTANT, 114.f);\n\n    const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};\n    in_pad.substract_mean_normalize(0, norm_vals);\n\n    ncnn::Extractor ex = yoloworld.create_extractor();\n\n    ex.input(\"in0\", in_pad);\n\n    ncnn::Mat out;\n    ex.extract(\"out0\", out);\n\n    std::vector<Object> proposals;\n    generate_proposals(out, prob_threshold, proposals);\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, nms_threshold);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x - (wpad / 2)) / scale;\n        float y0 = (objects[i].rect.y - (hpad / 2)) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width - (wpad / 2)) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height - (hpad / 2)) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    static cv::Scalar colors[] = {\n        cv::Scalar(244, 67, 54),\n        cv::Scalar(233, 30, 99),\n        cv::Scalar(156, 39, 176),\n        cv::Scalar(103, 58, 183),\n        cv::Scalar(63, 81, 181),\n        cv::Scalar(33, 150, 243),\n        cv::Scalar(3, 169, 244),\n        cv::Scalar(0, 188, 212),\n        cv::Scalar(0, 150, 136),\n        cv::Scalar(76, 175, 80),\n        cv::Scalar(139, 195, 74),\n        cv::Scalar(205, 220, 57),\n        cv::Scalar(255, 235, 59),\n        cv::Scalar(255, 193, 7),\n        cv::Scalar(255, 152, 0),\n        cv::Scalar(255, 87, 34),\n        cv::Scalar(121, 85, 72),\n        cv::Scalar(158, 158, 158),\n        cv::Scalar(96, 125, 139)\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        const cv::Scalar& color = colors[i % 19];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, color);\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yoloworld(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "examples/yolox.cpp",
    "content": "// Copyright 2020 Tencent\n// Copyright 2020-2021 Megvii Inc.\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layer.h\"\n#include \"net.h\"\n\n#if defined(USE_NCNN_SIMPLEOCV)\n#include \"simpleocv.h\"\n#else\n#include <opencv2/core/core.hpp>\n#include <opencv2/highgui/highgui.hpp>\n#include <opencv2/imgproc/imgproc.hpp>\n#endif\n#include <float.h>\n#include <stdio.h>\n#include <vector>\n\n#define YOLOX_NMS_THRESH  0.45 // nms threshold\n#define YOLOX_CONF_THRESH 0.25 // threshold of bounding box prob\n#define YOLOX_TARGET_SIZE 640  // target image size after resize, might use 416 for small model\n\n// YOLOX use the same focus in yolov5\nclass YoloV5Focus : public ncnn::Layer\n{\npublic:\n    YoloV5Focus()\n    {\n        one_blob_only = true;\n    }\n\n    virtual int forward(const ncnn::Mat& bottom_blob, ncnn::Mat& top_blob, const ncnn::Option& opt) const\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int channels = bottom_blob.c;\n\n        int outw = w / 2;\n        int outh = h / 2;\n        int outc = channels * 4;\n\n        top_blob.create(outw, outh, outc, 4u, 1, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc; p++)\n        {\n            const float* ptr = bottom_blob.channel(p % channels).row((p / channels) % 2) + ((p / channels) / 2);\n            float* outptr = top_blob.channel(p);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    *outptr = *ptr;\n\n                    outptr += 1;\n                    ptr += 2;\n                }\n\n                ptr += w;\n            }\n        }\n\n        return 0;\n    }\n};\n\nDEFINE_LAYER_CREATOR(YoloV5Focus)\n\nstruct Object\n{\n    cv::Rect_<float> rect;\n    int label;\n    float prob;\n};\n\nstruct GridAndStride\n{\n    int grid0;\n    int grid1;\n    int stride;\n};\n\nstatic inline float intersection_area(const Object& a, const Object& b)\n{\n    cv::Rect_<float> inter = a.rect & b.rect;\n    return inter.area();\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& faceobjects, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = faceobjects[(left + right) / 2].prob;\n\n    while (i <= j)\n    {\n        while (faceobjects[i].prob > p)\n            i++;\n\n        while (faceobjects[j].prob < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(faceobjects[i], faceobjects[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    #pragma omp parallel sections\n    {\n        #pragma omp section\n        {\n            if (left < j) qsort_descent_inplace(faceobjects, left, j);\n        }\n        #pragma omp section\n        {\n            if (i < right) qsort_descent_inplace(faceobjects, i, right);\n        }\n    }\n}\n\nstatic void qsort_descent_inplace(std::vector<Object>& objects)\n{\n    if (objects.empty())\n        return;\n\n    qsort_descent_inplace(objects, 0, objects.size() - 1);\n}\n\nstatic void nms_sorted_bboxes(const std::vector<Object>& faceobjects, std::vector<int>& picked, float nms_threshold, bool agnostic = false)\n{\n    picked.clear();\n\n    const int n = faceobjects.size();\n\n    std::vector<float> areas(n);\n    for (int i = 0; i < n; i++)\n    {\n        areas[i] = faceobjects[i].rect.area();\n    }\n\n    for (int i = 0; i < n; i++)\n    {\n        const Object& a = faceobjects[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const Object& b = faceobjects[picked[j]];\n\n            if (!agnostic && a.label != b.label)\n                continue;\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            // float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nstatic void generate_grids_and_stride(const int target_w, const int target_h, std::vector<int>& strides, std::vector<GridAndStride>& grid_strides)\n{\n    for (int i = 0; i < (int)strides.size(); i++)\n    {\n        int stride = strides[i];\n        int num_grid_w = target_w / stride;\n        int num_grid_h = target_h / stride;\n        for (int g1 = 0; g1 < num_grid_h; g1++)\n        {\n            for (int g0 = 0; g0 < num_grid_w; g0++)\n            {\n                GridAndStride gs;\n                gs.grid0 = g0;\n                gs.grid1 = g1;\n                gs.stride = stride;\n                grid_strides.push_back(gs);\n            }\n        }\n    }\n}\n\nstatic void generate_yolox_proposals(std::vector<GridAndStride> grid_strides, const ncnn::Mat& feat_blob, float prob_threshold, std::vector<Object>& objects)\n{\n    const int num_grid = feat_blob.h;\n    const int num_class = feat_blob.w - 5;\n    const int num_anchors = grid_strides.size();\n\n    const float* feat_ptr = feat_blob.channel(0);\n    for (int anchor_idx = 0; anchor_idx < num_anchors; anchor_idx++)\n    {\n        const int grid0 = grid_strides[anchor_idx].grid0;\n        const int grid1 = grid_strides[anchor_idx].grid1;\n        const int stride = grid_strides[anchor_idx].stride;\n\n        // yolox/models/yolo_head.py decode logic\n        //  outputs[..., :2] = (outputs[..., :2] + grids) * strides\n        //  outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides\n        float x_center = (feat_ptr[0] + grid0) * stride;\n        float y_center = (feat_ptr[1] + grid1) * stride;\n        float w = exp(feat_ptr[2]) * stride;\n        float h = exp(feat_ptr[3]) * stride;\n        float x0 = x_center - w * 0.5f;\n        float y0 = y_center - h * 0.5f;\n\n        float box_objectness = feat_ptr[4];\n        for (int class_idx = 0; class_idx < num_class; class_idx++)\n        {\n            float box_cls_score = feat_ptr[5 + class_idx];\n            float box_prob = box_objectness * box_cls_score;\n            if (box_prob > prob_threshold)\n            {\n                Object obj;\n                obj.rect.x = x0;\n                obj.rect.y = y0;\n                obj.rect.width = w;\n                obj.rect.height = h;\n                obj.label = class_idx;\n                obj.prob = box_prob;\n\n                objects.push_back(obj);\n            }\n\n        } // class loop\n        feat_ptr += feat_blob.w;\n\n    } // point anchor loop\n}\n\nstatic int detect_yolox(const cv::Mat& bgr, std::vector<Object>& objects)\n{\n    ncnn::Net yolox;\n\n    yolox.opt.use_vulkan_compute = true;\n    // yolox.opt.use_bf16_storage = true;\n\n    // Focus in yolov5\n    yolox.register_custom_layer(\"YoloV5Focus\", YoloV5Focus_layer_creator);\n\n    // original pretrained model from https://github.com/Megvii-BaseDetection/YOLOX\n    // ncnn model param: https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_s_ncnn.tar.gz\n    // NOTE that newest version YOLOX remove normalization of model (minus mean and then div by std),\n    // which might cause your model outputs becoming a total mess, plz check carefully.\n    if (yolox.load_param(\"yolox.param\"))\n        exit(-1);\n    if (yolox.load_model(\"yolox.bin\"))\n        exit(-1);\n\n    int img_w = bgr.cols;\n    int img_h = bgr.rows;\n\n    int w = img_w;\n    int h = img_h;\n    float scale = 1.f;\n    if (w > h)\n    {\n        scale = (float)YOLOX_TARGET_SIZE / w;\n        w = YOLOX_TARGET_SIZE;\n        h = h * scale;\n    }\n    else\n    {\n        scale = (float)YOLOX_TARGET_SIZE / h;\n        h = YOLOX_TARGET_SIZE;\n        w = w * scale;\n    }\n    ncnn::Mat in = ncnn::Mat::from_pixels_resize(bgr.data, ncnn::Mat::PIXEL_BGR, img_w, img_h, w, h);\n\n    // pad to YOLOX_TARGET_SIZE rectangle\n    int wpad = (w + 31) / 32 * 32 - w;\n    int hpad = (h + 31) / 32 * 32 - h;\n    ncnn::Mat in_pad;\n    // different from yolov5, yolox only pad on bottom and right side,\n    // which means users don't need to extra padding info to decode boxes coordinate.\n    ncnn::copy_make_border(in, in_pad, 0, hpad, 0, wpad, ncnn::BORDER_CONSTANT, 114.f);\n\n    ncnn::Extractor ex = yolox.create_extractor();\n\n    ex.input(\"images\", in_pad);\n\n    std::vector<Object> proposals;\n\n    {\n        ncnn::Mat out;\n        ex.extract(\"output\", out);\n\n        static const int stride_arr[] = {8, 16, 32}; // might have stride=64 in YOLOX\n        std::vector<int> strides(stride_arr, stride_arr + sizeof(stride_arr) / sizeof(stride_arr[0]));\n        std::vector<GridAndStride> grid_strides;\n        generate_grids_and_stride(in_pad.w, in_pad.h, strides, grid_strides);\n        generate_yolox_proposals(grid_strides, out, YOLOX_CONF_THRESH, proposals);\n    }\n\n    // sort all proposals by score from highest to lowest\n    qsort_descent_inplace(proposals);\n\n    // apply nms with nms_threshold\n    std::vector<int> picked;\n    nms_sorted_bboxes(proposals, picked, YOLOX_NMS_THRESH);\n\n    int count = picked.size();\n\n    objects.resize(count);\n    for (int i = 0; i < count; i++)\n    {\n        objects[i] = proposals[picked[i]];\n\n        // adjust offset to original unpadded\n        float x0 = (objects[i].rect.x) / scale;\n        float y0 = (objects[i].rect.y) / scale;\n        float x1 = (objects[i].rect.x + objects[i].rect.width) / scale;\n        float y1 = (objects[i].rect.y + objects[i].rect.height) / scale;\n\n        // clip\n        x0 = std::max(std::min(x0, (float)(img_w - 1)), 0.f);\n        y0 = std::max(std::min(y0, (float)(img_h - 1)), 0.f);\n        x1 = std::max(std::min(x1, (float)(img_w - 1)), 0.f);\n        y1 = std::max(std::min(y1, (float)(img_h - 1)), 0.f);\n\n        objects[i].rect.x = x0;\n        objects[i].rect.y = y0;\n        objects[i].rect.width = x1 - x0;\n        objects[i].rect.height = y1 - y0;\n    }\n\n    return 0;\n}\n\nstatic void draw_objects(const cv::Mat& bgr, const std::vector<Object>& objects)\n{\n    static const char* class_names[] = {\n        \"person\", \"bicycle\", \"car\", \"motorcycle\", \"airplane\", \"bus\", \"train\", \"truck\", \"boat\", \"traffic light\",\n        \"fire hydrant\", \"stop sign\", \"parking meter\", \"bench\", \"bird\", \"cat\", \"dog\", \"horse\", \"sheep\", \"cow\",\n        \"elephant\", \"bear\", \"zebra\", \"giraffe\", \"backpack\", \"umbrella\", \"handbag\", \"tie\", \"suitcase\", \"frisbee\",\n        \"skis\", \"snowboard\", \"sports ball\", \"kite\", \"baseball bat\", \"baseball glove\", \"skateboard\", \"surfboard\",\n        \"tennis racket\", \"bottle\", \"wine glass\", \"cup\", \"fork\", \"knife\", \"spoon\", \"bowl\", \"banana\", \"apple\",\n        \"sandwich\", \"orange\", \"broccoli\", \"carrot\", \"hot dog\", \"pizza\", \"donut\", \"cake\", \"chair\", \"couch\",\n        \"potted plant\", \"bed\", \"dining table\", \"toilet\", \"tv\", \"laptop\", \"mouse\", \"remote\", \"keyboard\", \"cell phone\",\n        \"microwave\", \"oven\", \"toaster\", \"sink\", \"refrigerator\", \"book\", \"clock\", \"vase\", \"scissors\", \"teddy bear\",\n        \"hair drier\", \"toothbrush\"\n    };\n\n    cv::Mat image = bgr.clone();\n\n    for (size_t i = 0; i < objects.size(); i++)\n    {\n        const Object& obj = objects[i];\n\n        fprintf(stderr, \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\", obj.label, obj.prob,\n                obj.rect.x, obj.rect.y, obj.rect.width, obj.rect.height);\n\n        cv::rectangle(image, obj.rect, cv::Scalar(255, 0, 0));\n\n        char text[256];\n        sprintf(text, \"%s %.1f%%\", class_names[obj.label], obj.prob * 100);\n\n        int baseLine = 0;\n        cv::Size label_size = cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);\n\n        int x = obj.rect.x;\n        int y = obj.rect.y - label_size.height - baseLine;\n        if (y < 0)\n            y = 0;\n        if (x + label_size.width > image.cols)\n            x = image.cols - label_size.width;\n\n        cv::rectangle(image, cv::Rect(cv::Point(x, y), cv::Size(label_size.width, label_size.height + baseLine)),\n                      cv::Scalar(255, 255, 255), -1);\n\n        cv::putText(image, text, cv::Point(x, y + label_size.height),\n                    cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 0));\n    }\n\n    cv::imshow(\"image\", image);\n    cv::waitKey(0);\n}\n\nint main(int argc, char** argv)\n{\n    if (argc != 2)\n    {\n        fprintf(stderr, \"Usage: %s [imagepath]\\n\", argv[0]);\n        return -1;\n    }\n\n    const char* imagepath = argv[1];\n\n    cv::Mat m = cv::imread(imagepath, 1);\n    if (m.empty())\n    {\n        fprintf(stderr, \"cv::imread %s failed\\n\", imagepath);\n        return -1;\n    }\n\n    std::vector<Object> objects;\n    detect_yolox(m, objects);\n\n    draw_objects(m, objects);\n\n    return 0;\n}\n"
  },
  {
    "path": "package.sh",
    "content": "#!/usr/bin/bash\n\nNAME=ncnn\n\n##### package android lib\nANDROIDPKGNAME=${NAME}-android-lib\nrm -rf $ANDROIDPKGNAME\nmkdir -p $ANDROIDPKGNAME\nmkdir -p $ANDROIDPKGNAME/armeabi-v7a\nmkdir -p $ANDROIDPKGNAME/arm64-v8a\nmkdir -p $ANDROIDPKGNAME/x86\nmkdir -p $ANDROIDPKGNAME/x86_64\nmkdir -p $ANDROIDPKGNAME/include\ncp build-android-armv7/install/lib/lib*.a $ANDROIDPKGNAME/armeabi-v7a/\ncp build-android-aarch64/install/lib/lib*.a $ANDROIDPKGNAME/arm64-v8a/\ncp build-android-x86/install/lib/lib*.a $ANDROIDPKGNAME/x86/\ncp build-android-x86_64/install/lib/lib*.a $ANDROIDPKGNAME/x86_64/\ncp -r build-android-aarch64/install/include/* $ANDROIDPKGNAME/include/\nrm -f $ANDROIDPKGNAME.zip\nzip -9 -r $ANDROIDPKGNAME.zip $ANDROIDPKGNAME\n\n##### package ios framework\nIOSPKGNAME=${NAME}.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/${NAME} $IOSPKGNAME/${NAME}\nlipo -create \\\n    build-ios/install/lib/lib${NAME}.a \\\n    build-ios-sim/install/lib/lib${NAME}.a \\\n    -o $IOSPKGNAME/Versions/A/${NAME}\ncp -r build-ios/install/include/* $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME.zip\nzip -9 -y -r $IOSPKGNAME.zip $IOSPKGNAME\n\n##### package ios framework bitcode\nIOSPKGNAME=${NAME}.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/${NAME} $IOSPKGNAME/${NAME}\nlipo -create \\\n    build-ios-bitcode/install/lib/lib${NAME}.a \\\n    build-ios-sim-bitcode/install/lib/lib${NAME}.a \\\n    -o $IOSPKGNAME/Versions/A/${NAME}\ncp -r build-ios-bitcode/install/include/ncnn $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME-bitcode.zip\nzip -9 -y -r $IOSPKGNAME-bitcode.zip $IOSPKGNAME\n\n\n##### package android lib vulkan\nANDROIDPKGNAME=${NAME}-android-vulkan-lib\nrm -rf $ANDROIDPKGNAME\nmkdir -p $ANDROIDPKGNAME\nmkdir -p $ANDROIDPKGNAME/armeabi-v7a\nmkdir -p $ANDROIDPKGNAME/arm64-v8a\nmkdir -p $ANDROIDPKGNAME/x86\nmkdir -p $ANDROIDPKGNAME/x86_64\nmkdir -p $ANDROIDPKGNAME/include\ncp build-android-armv7-vulkan/install/lib/lib*.a $ANDROIDPKGNAME/armeabi-v7a/\ncp build-android-aarch64-vulkan/install/lib/lib*.a $ANDROIDPKGNAME/arm64-v8a/\ncp build-android-x86-vulkan/install/lib/lib*.a $ANDROIDPKGNAME/x86/\ncp build-android-x86_64-vulkan/install/lib/lib*.a $ANDROIDPKGNAME/x86_64/\ncp -r build-android-aarch64-vulkan/install/include/* $ANDROIDPKGNAME/include/\nrm -f $ANDROIDPKGNAME.zip\nzip -9 -r $ANDROIDPKGNAME.zip $ANDROIDPKGNAME\n\n##### package ios framework vulkan\nIOSPKGNAME=${NAME}.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/${NAME} $IOSPKGNAME/${NAME}\nlipo -create \\\n    build-ios-vulkan/install/lib/lib${NAME}.a \\\n    build-ios-sim-vulkan/install/lib/lib${NAME}.a \\\n    -o $IOSPKGNAME/Versions/A/${NAME}\ncp -r build-ios-vulkan/install/include/ncnn $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME-vulkan.zip\nzip -9 -y -r $IOSPKGNAME-vulkan.zip $IOSPKGNAME\n\n##### package ios framework vulkan bitcode\nIOSPKGNAME=${NAME}.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/${NAME} $IOSPKGNAME/${NAME}\nlipo -create \\\n    build-ios-vulkan-bitcode/install/lib/lib${NAME}.a \\\n    build-ios-sim-vulkan-bitcode/install/lib/lib${NAME}.a \\\n    -o $IOSPKGNAME/Versions/A/${NAME}\ncp -r build-ios-vulkan-bitcode/install/include/ncnn $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME-vulkan-bitcode.zip\nzip -9 -y -r $IOSPKGNAME-vulkan-bitcode.zip $IOSPKGNAME\n\n\n##### package ios framework glslang\nIOSPKGNAME=glslang.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/glslang $IOSPKGNAME/glslang\nlibtool -static \\\n    build-ios-vulkan/install/lib/libglslang.a \\\n    build-ios-vulkan/install/lib/libSPIRV.a \\\n    build-ios-vulkan/install/lib/libOGLCompiler.a \\\n    build-ios-vulkan/install/lib/libOSDependent.a \\\n    -o build-ios-vulkan/install/lib/libglslang_combined.a\nlibtool -static \\\n    build-ios-sim-vulkan/install/lib/libglslang.a \\\n    build-ios-sim-vulkan/install/lib/libSPIRV.a \\\n    build-ios-sim-vulkan/install/lib/libOGLCompiler.a \\\n    build-ios-sim-vulkan/install/lib/libOSDependent.a \\\n    -o build-ios-sim-vulkan/install/lib/libglslang_combined.a\nlipo -create \\\n    build-ios-vulkan/install/lib/libglslang_combined.a \\\n    build-ios-sim-vulkan/install/lib/libglslang_combined.a \\\n    -o $IOSPKGNAME/Versions/A/glslang\ncp -r build-ios-vulkan/install/include/glslang $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME.zip\nzip -9 -y -r $IOSPKGNAME.zip $IOSPKGNAME\n\n##### package ios framework glslang bitcode\nIOSPKGNAME=glslang.framework\nrm -rf $IOSPKGNAME\nmkdir -p $IOSPKGNAME/Versions/A/Headers\nmkdir -p $IOSPKGNAME/Versions/A/Resources\nln -s A $IOSPKGNAME/Versions/Current\nln -s Versions/Current/Headers $IOSPKGNAME/Headers\nln -s Versions/Current/Resources $IOSPKGNAME/Resources\nln -s Versions/Current/glslang $IOSPKGNAME/glslang\nlibtool -static \\\n    build-ios-vulkan-bitcode/install/lib/libglslang.a \\\n    build-ios-vulkan-bitcode/install/lib/libSPIRV.a \\\n    build-ios-vulkan-bitcode/install/lib/libOGLCompiler.a \\\n    build-ios-vulkan-bitcode/install/lib/libOSDependent.a \\\n    -o build-ios-vulkan-bitcode/install/lib/libglslang_combined.a\nlibtool -static \\\n    build-ios-sim-vulkan-bitcode/install/lib/libglslang.a \\\n    build-ios-sim-vulkan-bitcode/install/lib/libSPIRV.a \\\n    build-ios-sim-vulkan-bitcode/install/lib/libOGLCompiler.a \\\n    build-ios-sim-vulkan-bitcode/install/lib/libOSDependent.a \\\n    -o build-ios-sim-vulkan-bitcode/install/lib/libglslang_combined.a\nlipo -create \\\n    build-ios-vulkan-bitcode/install/lib/libglslang_combined.a \\\n    build-ios-sim-vulkan-bitcode/install/lib/libglslang_combined.a \\\n    -o $IOSPKGNAME/Versions/A/glslang\ncp -r build-ios-vulkan-bitcode/install/include/glslang $IOSPKGNAME/Versions/A/Headers/\ncp Info.plist ${IOSPKGNAME}/Versions/A/Resources/\nrm -f $IOSPKGNAME-bitcode.zip\nzip -9 -y -r $IOSPKGNAME-bitcode.zip $IOSPKGNAME\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\n    \"setuptools>=42\",\n    \"wheel\",\n    \"importlib-metadata\",\n]\nbuild-backend = \"setuptools.build_meta\"\n"
  },
  {
    "path": "python/CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.4...3.10)\n\nproject(pyncnn)\n\nset(PACKAGE_VERSION ${NCNN_VERSION_STRING})\nadd_definitions(-DVERSION_INFO=\"${PACKAGE_VERSION}\")\n\nset( CMAKE_CXX_STANDARD 11 )\nset( CMAKE_CXX_STANDARD_REQUIRED ON )\n\noption(NCNN_SYSTEM_PYBIND11 \"use system pybind11\" OFF)\n\nif(CMAKE_CXX_COMPILER_ARCHITECTURE_ID MATCHES \"ARM64\")\n    option(PYBIND11_PYTHONLIBS_OVERWRITE \"\" OFF)\n\n    set(PYTHON_PREFIX \"$ENV{LOCALAPPDATA}/pypa/cibuildwheel/Cache/nuget-cpython/pythonarm64.$ENV{PYTHON_VERSION}/tools\")\n    if(NOT DEFINED $ENV{CIBUILDWHEEL})\n        message(WARNING\n            \" This is hack for cibuildwheel on github action\\n\"\n            \" Use the right way to cross-compile python module for windows arm64 like follows\\n\"\n            \" set(PYTHON_PREFIX \\\"<your-pythonarm64-root-path>\\\")\\n\"\n        )\n    endif()\nendif()\n\nif(NCNN_SYSTEM_PYBIND11)\n    find_package(pybind11)\n    if(NOT pybind11_FOUND)\n        message(WARNING \"pybind11 package not found! NCNN_SYSTEM_PYBIND11 will be turned off.\")\n        set(NCNN_SYSTEM_PYBIND11 OFF)\n    endif()\nendif()\n\nif(NOT NCNN_SYSTEM_PYBIND11)\n    if(NOT EXISTS \"${CMAKE_CURRENT_SOURCE_DIR}/pybind11/CMakeLists.txt\")\n        message(FATAL_ERROR \"The submodules were not downloaded! Please update submodules with \\\"git submodule update --init\\\" and try again.\")\n    else()\n        add_subdirectory(pybind11)\n    endif()\nendif()\n\nif(\"${CMAKE_LIBRARY_OUTPUT_DIRECTORY}\" STREQUAL \"\")\n    if(MSVC OR CMAKE_GENERATOR STREQUAL \"Xcode\")\n        set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_DEBUG ${CMAKE_CURRENT_BINARY_DIR}/ncnn/)\n        set(CMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE ${CMAKE_CURRENT_BINARY_DIR}/ncnn/)\n    endif(MSVC OR CMAKE_GENERATOR STREQUAL \"Xcode\")\nendif(\"${CMAKE_LIBRARY_OUTPUT_DIRECTORY}\" STREQUAL \"\")\n\n# enable global link time optimization\ncmake_policy(SET CMP0069 NEW)\nset(CMAKE_POLICY_DEFAULT_CMP0069 NEW)\ninclude(CheckIPOSupported)\ncheck_ipo_supported(RESULT ipo_supported OUTPUT ipo_supported_output)\nif(ipo_supported)\n    set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)\nendif()\n\ninclude_directories(${pybind11_INCLUDE_DIR} ${PYTHON_INCLUDE_DIRS})\npybind11_add_module(pyncnn src/main.cpp)\nset_target_properties(pyncnn PROPERTIES OUTPUT_NAME \"ncnn\")\ntarget_link_libraries(pyncnn PUBLIC ncnn)\nset_target_properties(pyncnn PROPERTIES PREFIX \"\" LIBRARY_OUTPUT_DIRECTORY \"${CMAKE_CURRENT_BINARY_DIR}/ncnn\")\nset_property(TARGET pyncnn PROPERTY FOLDER \"python\")\nif(\"${CMAKE_LIBRARY_OUTPUT_DIRECTORY}\" STREQUAL \"\")\n    add_custom_command(TARGET pyncnn POST_BUILD \n        COMMAND ${CMAKE_COMMAND} -E copy ${CMAKE_CURRENT_BINARY_DIR}/ncnn/ncnn${PYTHON_MODULE_PREFIX}${PYTHON_MODULE_EXTENSION} \n        ${PROJECT_SOURCE_DIR}/ncnn/ncnn${PYTHON_MODULE_PREFIX}${PYTHON_MODULE_EXTENSION})\nendif(\"${CMAKE_LIBRARY_OUTPUT_DIRECTORY}\" STREQUAL \"\")\n\nconfigure_file(setup.py.i ${PROJECT_SOURCE_DIR}/setup.py)\n"
  },
  {
    "path": "python/README.md",
    "content": "# ncnn\npython wrapper of ncnn with [pybind11](https://github.com/pybind/pybind11), only support python3.x now.\n\n\nInstall from pip\n==================\n\nncnn is available as wheel packages for macOS, Windows and Linux distributions, you can install with pip:\n\n```\npython -m pip install -U pip\npython -m pip install -U ncnn\n```\n\n# Build from source\n\nIf you want to build ncnn with some options not as default, or just like to build everything yourself, it is not difficult to build ncnn from source.\n\n## Prerequisites\n\n**On Unix (Linux, OS X)**\n\n* A compiler with C++11 support\n* CMake >= 3.4\n\n**On Mac**\n\n* A compiler with C++11 support\n* CMake >= 3.4\n\n**On Windows**\n\n* Visual Studio 2015 or higher\n* CMake >= 3.4\n\n##  Build & Install\n\n1. clone ncnn and init submodule.\n\n```bash\ncd /pathto/ncnn\ngit submodule init && git submodule update\n```\n\n2. build and install.\n\n```\npython setup.py install\n```\n\nIf you want to use a custom toolchain, you can install with the `CMAKE_TOOLCHAIN_FILE` environment variable, like this:\n\n```\nCMAKE_TOOLCHAIN_FILE=\"../../toolchains/power9le-linux-gnu-vsx.clang.toolchain.cmake\" python setup.py install\n```\n\nif you want to enable the usage of vulkan, you can install as following:\n\n```\npython setup.py install --vulkan=on\n```\n\n> **Attention:**\n>\n> To enable Vulkan support, you must first install the Vulkan SDK.\n>\n> **For Windows or Linux Users:**\n>\n> Ensure that the `VULKAN_SDK` environment variable is set to the path of the Vulkan SDK.\n>\n> **For MacOS Users:**\n>\n> On MacOS, you will need to specify additional environment variables. For guidance on setting these variables, please refer to lines 279-286 in the following file: [ncnn/.github/workflows/release-python.yml at master · Tencent/ncnn](https://github.com/Tencent/ncnn/blob/master/.github/workflows/release-python.yml).\n\n## Custom-build & Install\n\n1. clone ncnn and init submodule.\n```bash\ncd /pathto/ncnn\ngit submodule init && git submodule update\n```\n2. build.\n```bash\nmkdir build\ncd build\ncmake -DNCNN_PYTHON=ON ..\nmake\n```\n\nTo use the pybind11 package provided by your system, set the CMake variable `NCNN_SYSTEM_PYBIND11` to `ON` during the build process, like this:\n\n```bash\nmkdir build\ncd build\ncmake -DNCNN_PYTHON=ON -DNCNN_SYSTEM_PYBIND11=ON ..\nmake\n```\n\n3. install\n\n```bash\ncd /pathto/ncnn\npip install .\n```\n\nif you use conda or miniconda, you can also install as following:\n```bash\ncd /pathto/ncnn\npython3 setup.py install\n```\n\n## Tests\n\n**test**\n```bash\ncd /pathto/ncnn/python\npython3 tests/test.py\n```\n\n**benchmark**\n\n```bash\ncd /pathto/ncnn/python\npython3 tests/benchmark.py\n```\n\n## Numpy\n**ncnn.Mat->numpy.array, with no memory copy**\n\n```bash\nmat = ncnn.Mat(...)\nmat_np = np.array(mat)\n```\n\n**numpy.array->ncnn.Mat, with no memory copy**\n```bash\nmat_np = np.array(...)\nmat = ncnn.Mat(mat_np)\n```\n\n# Model Zoo\ninstall requirements\n```bash\npip install -r requirements.txt\n```\nthen you can import ncnn.model_zoo and get model list as follow:\n```bash\nimport ncnn\nimport ncnn.model_zoo as model_zoo\n\nprint(model_zoo.get_model_list())\n```\nmodels now in model zoo are as list below:\n```bash\nmobilenet_yolov2\nmobilenetv2_yolov3\nyolov4_tiny\nyolov4\nyolov5s\nyolact\nmobilenet_ssd\nsqueezenet_ssd\nmobilenetv2_ssdlite\nmobilenetv3_ssdlite\nsqueezenet\nfaster_rcnn\npeleenet_ssd\nretinaface\nrfcn\nshufflenetv2\nsimplepose\nnanodet\n```\nall model in model zoo has example in ncnn/python/examples folder\n\n# Custom Layer\n\ncustom layer demo is in ncnn/python/ncnn/model_zoo/yolov5.py:23\n"
  },
  {
    "path": "python/examples/fasterrcnn.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"faster_rcnn\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/mobilenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"mobilenet_ssd\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/mobilenetv2ssdlite.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"mobilenetv2_ssdlite\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/mobilenetv3ssdlite.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"mobilenetv3_ssdlite\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects, 0.6)\n"
  },
  {
    "path": "python/examples/model_zoo.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom ncnn.model_zoo import get_model_list\n\nif __name__ == \"__main__\":\n    print(get_model_list())\n"
  },
  {
    "path": "python/examples/nanodet.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\n        \"nanodet\",\n        target_size=320,\n        prob_threshold=0.4,\n        nms_threshold=0.5,\n        num_threads=4,\n        use_gpu=True,\n    )\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/peleenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nimport numpy as np\nfrom ncnn.model_zoo import get_model\n\n\ndef draw_detection_objects_seg(image, class_names, objects, mat_map):\n    color = [128, 255, 128, 244, 35, 232]\n    color_count = len(color)\n\n    for obj in objects:\n        print(\n            \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\"\n            % (obj.label, obj.prob, obj.rect.x, obj.rect.y, obj.rect.w, obj.rect.h)\n        )\n\n        cv2.rectangle(\n            image,\n            (int(obj.rect.x), int(obj.rect.y)),\n            (int(obj.rect.x + obj.rect.w), int(obj.rect.y + obj.rect.h)),\n            (255, 0, 0),\n        )\n\n        text = \"%s %.1f%%\" % (class_names[int(obj.label)], obj.prob * 100)\n\n        label_size, baseLine = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)\n\n        x = obj.rect.x\n        y = obj.rect.y - label_size[1] - baseLine\n        if y < 0:\n            y = 0\n        if x + label_size[0] > image.shape[1]:\n            x = image.shape[1] - label_size[0]\n\n        cv2.rectangle(\n            image,\n            (int(x), int(y)),\n            (int(x + label_size[0]), int(y + label_size[1] + baseLine)),\n            (255, 255, 255),\n            -1,\n        )\n\n        cv2.putText(\n            image,\n            text,\n            (int(x), int(y + label_size[1])),\n            cv2.FONT_HERSHEY_SIMPLEX,\n            0.5,\n            (0, 0, 0),\n        )\n\n    width = mat_map.w\n    height = mat_map.h\n    size = mat_map.c\n    img_index2 = 0\n    threshold = 0.45\n    ptr2 = np.array(mat_map)\n    for i in range(height):\n        ptr1 = image[i].flatten()\n        img_index1 = 0\n        for j in range(width):\n            maxima = threshold\n            index = -1\n            for c in range(size):\n                # const float* ptr3 = ptr2 + c*width*height\n                ptr3 = ptr2[c].flatten()\n                if ptr3[img_index2] > maxima:\n                    maxima = ptr3[img_index2]\n                    index = c\n\n            if index > -1:\n                color_index = (index) * 3\n                if color_index < color_count:\n                    b = color[color_index]\n                    g = color[color_index + 1]\n                    r = color[color_index + 2]\n                    ptr1[img_index1] = b / 2 + ptr1[img_index1] / 2\n                    ptr1[img_index1 + 1] = g / 2 + ptr1[img_index1 + 1] / 2\n                    ptr1[img_index1 + 2] = r / 2 + ptr1[img_index1 + 2] / 2\n\n            img_index1 += 3\n            img_index2 += 1\n\n        image[i] = ptr1.reshape(image[i].shape)\n\n    cv2.imshow(\"image\", image)\n    cv2.waitKey(0)\n\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"peleenet_ssd\", num_threads=4, use_gpu=True)\n\n    objects, seg_out = net(m)\n\n    draw_detection_objects_seg(m, net.class_names, objects, seg_out)\n"
  },
  {
    "path": "python/examples/retinaface.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_faceobjects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"retinaface\", num_threads=4, use_gpu=True)\n\n    faceobjects = net(m)\n\n    draw_faceobjects(m, faceobjects)\n"
  },
  {
    "path": "python/examples/rfcn.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"rfcn\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/shufflenetv2.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import print_topk\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"shufflenetv2\", num_threads=4, use_gpu=True)\n\n    cls_scores = net(m)\n\n    print_topk(cls_scores, 3)\n"
  },
  {
    "path": "python/examples/simplepose.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_pose\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"simplepose\", num_threads=4, use_gpu=True)\n\n    keypoints = net(m)\n\n    draw_pose(m, keypoints)\n"
  },
  {
    "path": "python/examples/squeezenet.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import print_topk\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"squeezenet\", num_threads=4, use_gpu=True)\n\n    cls_scores = net(m)\n\n    print_topk(cls_scores, 5)\n"
  },
  {
    "path": "python/examples/squeezenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"squeezenet_ssd\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/yolact.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nimport numpy as np\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\n\ndef draw_result(image, class_names, boxes, masks, classes, scores):\n    colors = [\n        [56, 0, 255],\n        [226, 255, 0],\n        [0, 94, 255],\n        [0, 37, 255],\n        [0, 255, 94],\n        [255, 226, 0],\n        [0, 18, 255],\n        [255, 151, 0],\n        [170, 0, 255],\n        [0, 255, 56],\n        [255, 0, 75],\n        [0, 75, 255],\n        [0, 255, 169],\n        [255, 0, 207],\n        [75, 255, 0],\n        [207, 0, 255],\n        [37, 0, 255],\n        [0, 207, 255],\n        [94, 0, 255],\n        [0, 255, 113],\n        [255, 18, 0],\n        [255, 0, 56],\n        [18, 0, 255],\n        [0, 255, 226],\n        [170, 255, 0],\n        [255, 0, 245],\n        [151, 255, 0],\n        [132, 255, 0],\n        [75, 0, 255],\n        [151, 0, 255],\n        [0, 151, 255],\n        [132, 0, 255],\n        [0, 255, 245],\n        [255, 132, 0],\n        [226, 0, 255],\n        [255, 37, 0],\n        [207, 255, 0],\n        [0, 255, 207],\n        [94, 255, 0],\n        [0, 226, 255],\n        [56, 255, 0],\n        [255, 94, 0],\n        [255, 113, 0],\n        [0, 132, 255],\n        [255, 0, 132],\n        [255, 170, 0],\n        [255, 0, 188],\n        [113, 255, 0],\n        [245, 0, 255],\n        [113, 0, 255],\n        [255, 188, 0],\n        [0, 113, 255],\n        [255, 0, 0],\n        [0, 56, 255],\n        [255, 0, 113],\n        [0, 255, 188],\n        [255, 0, 94],\n        [255, 0, 18],\n        [18, 255, 0],\n        [0, 255, 132],\n        [0, 188, 255],\n        [0, 245, 255],\n        [0, 169, 255],\n        [37, 255, 0],\n        [255, 0, 151],\n        [188, 0, 255],\n        [0, 255, 37],\n        [0, 255, 0],\n        [255, 0, 170],\n        [255, 0, 37],\n        [255, 75, 0],\n        [0, 0, 255],\n        [255, 207, 0],\n        [255, 0, 226],\n        [255, 245, 0],\n        [188, 255, 0],\n        [0, 255, 18],\n        [0, 255, 75],\n        [0, 255, 151],\n        [255, 56, 0],\n        [245, 255, 0],\n    ]\n\n    color_index = 0\n\n    for box, mask, label, score in zip(boxes, masks, classes, scores):\n        if score < 0.15:\n            continue\n\n        print(\n            \"%s = %.5f at %.2f %.2f %.2f x %.2f\\n\"\n            % (label, score, box[0], box[1], box[2], box[3])\n        )\n\n        cv2.rectangle(\n            image,\n            (int(box[0]), int(box[1])),\n            (int(box[0] + box[2]), int(int(box[1] + box[3]))),\n            (255, 0, 0),\n        )\n\n        text = \"%s %.1f%%\" % (class_names[int(label)], score * 100)\n\n        label_size, baseLine = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)\n\n        x = box[0]\n        y = box[1] - label_size[1] - baseLine\n        if y < 0:\n            y = 0\n        if x + label_size[0] > image.shape[1]:\n            x = image.shape[1] - label_size[0]\n\n        cv2.rectangle(\n            image,\n            (int(x), int(y)),\n            (int(x + label_size[0]), int(y + label_size[1] + baseLine)),\n            (255, 255, 255),\n            -1,\n        )\n\n        cv2.putText(\n            image,\n            text,\n            (int(x), int(y + label_size[1])),\n            cv2.FONT_HERSHEY_SIMPLEX,\n            0.5,\n            (0, 0, 0),\n        )\n\n        image[mask] = image[mask] * 0.5 + np.array(colors[color_index]) * 0.5\n        color_index += 1\n\n    cv2.imshow(\"image\", image)\n    cv2.waitKey(0)\n\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\n        \"yolact\",\n        target_size=550,\n        confidence_threshold=0.05,\n        nms_threshold=0.5,\n        keep_top_k=200,\n        num_threads=4,\n        use_gpu=True,\n    )\n\n    boxes, masks, classes, scores = net(m)\n\n    draw_result(m, net.class_names, boxes, masks, classes, scores)\n"
  },
  {
    "path": "python/examples/yolov2.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"mobilenet_yolov2\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/yolov3.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\"mobilenetv2_yolov3\", num_threads=4, use_gpu=True)\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/yolov4.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [v4l input device or image]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    devicepath = sys.argv[1]\n\n    net = get_model(\"yolov4_tiny\", num_threads=4, use_gpu=True)\n    # net = get_model(\"yolov4\", num_threads=4, use_gpu=True)\n\n    if devicepath.find(\"/dev/video\") == -1:\n        m = cv2.imread(devicepath)\n        if m is None:\n            print(\"cv2.imread %s failed\\n\" % (devicepath))\n            sys.exit(0)\n\n        objects = net(m)\n\n        draw_detection_objects(m, net.class_names, objects)\n    else:\n        cap = cv2.VideoCapture(devicepath)\n\n        if cap.isOpened() == False:\n            print(\"Failed to open %s\" % (devicepath))\n            sys.exit(0)\n\n        while True:\n            ret, frame = cap.read()\n\n            objects = net(frame)\n\n            draw_detection_objects(frame, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/yolov5.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\n        \"yolov5s\",\n        target_size=640,\n        prob_threshold=0.25,\n        nms_threshold=0.45,\n        num_threads=4,\n        use_gpu=True,\n    )\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/examples/yolov8.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport cv2\nfrom ncnn.model_zoo import get_model\nfrom ncnn.utils import draw_detection_objects\n\nif __name__ == \"__main__\":\n    if len(sys.argv) != 2:\n        print(\"Usage: %s [imagepath]\\n\" % (sys.argv[0]))\n        sys.exit(0)\n\n    imagepath = sys.argv[1]\n\n    m = cv2.imread(imagepath)\n    if m is None:\n        print(\"cv2.imread %s failed\\n\" % (imagepath))\n        sys.exit(0)\n\n    net = get_model(\n        \"yolov8s\",\n        target_size=640,\n        prob_threshold=0.25,\n        nms_threshold=0.45,\n        num_threads=4,\n        use_gpu=True,\n    )\n\n    objects = net(m)\n\n    draw_detection_objects(m, net.class_names, objects)\n"
  },
  {
    "path": "python/ncnn/__init__.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom .ncnn import *\n\n__version__ = ncnn.__version__\n"
  },
  {
    "path": "python/ncnn/model_zoo/__init__.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\n# coding: utf-8\n\"\"\"Predefined and pretrained models.\"\"\"\n\nfrom . import model_store\n\nfrom .model_zoo import get_model, get_model_list\n"
  },
  {
    "path": "python/ncnn/model_zoo/fasterrcnn.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass Faster_RCNN:\n    def __init__(\n        self,\n        img_width=600,\n        img_height=600,\n        num_threads=1,\n        use_gpu=False,\n        max_per_image=100,\n        confidence_thresh=0.05,\n        nms_threshold=0.3,\n    ):\n        self.img_width = img_width\n        self.img_height = img_height\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [102.9801, 115.9465, 122.7717]\n        self.norm_vals = []\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # original pretrained model from https://github.com/rbgirshick/py-faster-rcnn\n        # py-faster-rcnn/models/pascal_voc/ZF/faster_rcnn_alt_opt/faster_rcnn_test.pt\n        # https://dl.dropboxusercontent.com/s/o6ii098bu51d139/faster_rcnn_models.tgz?dl=0\n        # ZF_faster_rcnn_final.caffemodel\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"ZF_faster_rcnn_final.param\"))\n        self.net.load_model(get_model_file(\"ZF_faster_rcnn_final.bin\"))\n\n        self.max_per_image = max_per_image\n        self.confidence_thresh = confidence_thresh\n        self.nms_threshold = nms_threshold\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        # scale to target detect size\n        h = img.shape[0]\n        w = img.shape[1]\n        scale = 1.0\n        if w < h:\n            scale = float(self.img_width) / w\n            w = self.img_width\n            h = int(h * scale)\n        else:\n            scale = float(self.img_height) / h\n            h = self.img_height\n            w = int(w * scale)\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img, ncnn.Mat.PixelType.PIXEL_BGR, img.shape[1], img.shape[0], w, h\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        # method 1 use numpy to Mat interface\n        # im_info = ncnn.Mat(np.array([h, w, scale], dtype=np.float32))\n\n        # method 2 use ncnn.Mat interface\n        im_info = ncnn.Mat(3)\n        im_info[0] = h\n        im_info[1] = w\n        im_info[2] = scale\n\n        ex1 = self.net.create_extractor()\n\n        ex1.input(\"data\", mat_in)\n        ex1.input(\"im_info\", im_info)\n\n        ret1, conv5_relu5 = ex1.extract(\"conv5_relu5\")\n        ret2, rois = ex1.extract(\"rois\")\n\n        class_candidates = []\n        for i in range(rois.c):\n            ex2 = self.net.create_extractor()\n\n            roi = rois.channel(i)  # get single roi\n            ex2.input(\"conv5_relu5\", conv5_relu5)\n            ex2.input(\"rois\", roi)\n\n            ret1, bbox_pred = ex2.extract(\"bbox_pred\")\n            ret2, cls_prob = ex2.extract(\"cls_prob\")\n\n            num_class = cls_prob.w\n            while len(class_candidates) < num_class:\n                class_candidates.append([])\n\n            # find class id with highest score\n            label = 0\n            score = 0.0\n            for j in range(num_class):\n                class_score = cls_prob[j]\n                if class_score > score:\n                    label = j\n                    score = class_score\n\n            # ignore background or low score\n            if label == 0 or score <= self.confidence_thresh:\n                continue\n\n            # fprintf(stderr, \"%d = %f\\n\", label, score);\n\n            # unscale to image size\n            x1 = roi[0] / scale\n            y1 = roi[1] / scale\n            x2 = roi[2] / scale\n            y2 = roi[3] / scale\n\n            pb_w = x2 - x1 + 1\n            pb_h = y2 - y1 + 1\n\n            # apply bbox regression\n            dx = bbox_pred[label * 4]\n            dy = bbox_pred[label * 4 + 1]\n            dw = bbox_pred[label * 4 + 2]\n            dh = bbox_pred[label * 4 + 3]\n\n            cx = x1 + pb_w * 0.5\n            cy = y1 + pb_h * 0.5\n\n            obj_cx = cx + pb_w * dx\n            obj_cy = cy + pb_h * dy\n\n            obj_w = pb_w * np.exp(dw)\n            obj_h = pb_h * np.exp(dh)\n\n            obj_x1 = obj_cx - obj_w * 0.5\n            obj_y1 = obj_cy - obj_h * 0.5\n            obj_x2 = obj_cx + obj_w * 0.5\n            obj_y2 = obj_cy + obj_h * 0.5\n\n            # clip\n            obj_x1 = np.maximum(np.minimum(obj_x1, float(img.shape[1] - 1)), 0.0)\n            obj_y1 = np.maximum(np.minimum(obj_y1, float(img.shape[0] - 1)), 0.0)\n            obj_x2 = np.maximum(np.minimum(obj_x2, float(img.shape[1] - 1)), 0.0)\n            obj_y2 = np.maximum(np.minimum(obj_y2, float(img.shape[0] - 1)), 0.0)\n\n            # append object\n            obj = Detect_Object()\n            obj.rect.x = obj_x1\n            obj.rect.y = obj_y1\n            obj.rect.w = obj_x2 - obj_x1 + 1\n            obj.rect.h = obj_y2 - obj_y1 + 1\n            obj.label = label\n            obj.prob = score\n\n            class_candidates[label].append(obj)\n\n        # post process\n        objects = []\n        for candidates in class_candidates:\n            if len(candidates) == 0:\n                continue\n\n            candidates.sort(key=lambda obj: obj.prob, reverse=True)\n\n            picked = self.nms_sorted_bboxes(candidates, self.nms_threshold)\n\n            for j in range(len(picked)):\n                z = picked[j]\n                objects.append(candidates[z])\n\n        objects.sort(key=lambda obj: obj.prob, reverse=True)\n\n        objects = objects[: self.max_per_image]\n\n        return objects\n\n    def nms_sorted_bboxes(self, objects, nms_threshold):\n        picked = []\n\n        n = len(objects)\n\n        areas = np.zeros((n,), dtype=np.float32)\n        for i in range(n):\n            areas[i] = objects[i].rect.area()\n\n        for i in range(n):\n            a = objects[i]\n\n            keep = True\n            for j in range(len(picked)):\n                b = objects[picked[j]]\n\n                # intersection over union\n                inter_area = a.rect.intersection_area(b.rect)\n                union_area = areas[i] + areas[picked[j]] - inter_area\n                # float IoU = inter_area / union_area\n                if inter_area / union_area > nms_threshold:\n                    keep = False\n\n            if keep:\n                picked.append(i)\n\n        return picked\n"
  },
  {
    "path": "python/ncnn/model_zoo/mobilenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass MobileNet_SSD:\n    def __init__(self, target_size=300, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [127.5, 127.5, 127.5]\n        self.norm_vals = [0.007843, 0.007843, 0.007843]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # model is converted from https://github.com/chuanqi305/MobileNet-SSD\n        # and can be downloaded from https://drive.google.com/open?id=0ByaKLD9QaPtucWk0Y0dha1VVY0U\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mobilenet_ssd_voc_ncnn.param\"))\n        self.net.load_model(get_model_file(\"mobilenet_ssd_voc_ncnn.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/mobilenetv2ssdlite.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass Noop(ncnn.Layer):\n    pass\n\n\ndef Noop_layer_creator():\n    return Noop()\n\n\nclass MobileNetV2_SSDLite:\n    def __init__(self, target_size=300, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [127.5, 127.5, 127.5]\n        self.norm_vals = [0.007843, 0.007843, 0.007843]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        # self.net.register_custom_layer(\"Silence\", Noop_layer_creator)\n\n        # original pretrained model from https://github.com/chuanqi305/MobileNetv2-SSDLite\n        # https://github.com/chuanqi305/MobileNetv2-SSDLite/blob/master/ssdlite/voc/deploy.prototxt\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mobilenetv2_ssdlite_voc.param\"))\n        self.net.load_model(get_model_file(\"mobilenetv2_ssdlite_voc.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img_w,\n            img_h,\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/mobilenetv3ssdlite.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\ndef clamp(v, lo, hi):\n    if v < lo:\n        return lo\n    elif hi < v:\n        return hi\n    else:\n        return v\n\n\nclass MobileNetV3_SSDLite:\n    def __init__(self, target_size=300, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [123.675, 116.28, 103.53]\n        self.norm_vals = [1.0, 1.0, 1.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # converted ncnn model from https://github.com/ujsyehao/mobilenetv3-ssd\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mobilenetv3_ssdlite_voc.param\"))\n        self.net.load_model(get_model_file(\"mobilenetv3_ssdlite_voc.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR2RGB,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize([], self.norm_vals)\n        mat_in.substract_mean_normalize(self.mean_vals, [])\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"input\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n\n            x1 = (\n                clamp(values[2] * self.target_size, 0.0, float(self.target_size - 1))\n                / self.target_size\n                * img_w\n            )\n            y1 = (\n                clamp(values[3] * self.target_size, 0.0, float(self.target_size - 1))\n                / self.target_size\n                * img_h\n            )\n            x2 = (\n                clamp(values[4] * self.target_size, 0.0, float(self.target_size - 1))\n                / self.target_size\n                * img_w\n            )\n            y2 = (\n                clamp(values[5] * self.target_size, 0.0, float(self.target_size - 1))\n                / self.target_size\n                * img_h\n            )\n\n            if np.isnan(x1) or np.isnan(y1) or np.isnan(x2) or np.isnan(y2):\n                continue\n\n            obj.rect.x = x1\n            obj.rect.y = y1\n            obj.rect.w = x2 - x1\n            obj.rect.h = y2 - y1\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n\n            x1 = clamp(values[2] * self.img_width, 0.0, float(self.img_width - 1)) / self.img_width * img_w\n            y1 = clamp(values[3] * self.img_height, 0.0, float(self.img_height - 1)) / self.img_height * img_h\n            x2 = clamp(values[4] * self.img_width, 0.0, float(self.img_width - 1)) / self.img_width * img_w\n            y2 = clamp(values[5] * self.img_height, 0.0, float(self.img_height - 1)) / self.img_height * img_h\n\n            obj.rect.x = x1\n            obj.rect.y = y1\n            obj.rect.w = x2 - x1\n            obj.rect.h = y2 - y1\n\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/model_store.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\n\"\"\"Model store which provides pretrained models.\"\"\"\nfrom __future__ import print_function\n\n__all__ = [\"get_model_file\", \"purge\"]\n\nimport os\nimport zipfile\nimport logging\nimport portalocker\n\nfrom ..utils import download, check_sha1\n\n_model_sha1 = {\n    name: checksum\n    for checksum, name in [\n        (\"4ff279e78cdb0b8bbc9363181df6f094ad46dc36\", \"mobilenet_yolo.param\"),\n        (\"1528cf08b9823fc01aaebfc932ec8c8d4a3b1613\", \"mobilenet_yolo.bin\"),\n        (\"3f5b78b0c982f8bdf3a2c3a27e6136d4d2680e96\", \"mobilenetv2_yolov3.param\"),\n        (\"0705b0f8fe5a77718561b9b7d6ed4f33fcd3d455\", \"mobilenetv2_yolov3.bin\"),\n        (\"de59186323ebad5650631e12a6cc66b526ec7df4\", \"yolov4-tiny-opt.param\"),\n        (\"1765c3b251c041dd6ac59d2ec3ddf7b983fe9ee9\", \"yolov4-tiny-opt.bin\"),\n        (\"e92d3a3a8ac5e6a6c08c433aa2252b0680124328\", \"yolov4-opt.param\"),\n        (\"69d128b42b70fb790e9d3ccabcf1b6e8cc2859fe\", \"yolov4-opt.bin\"),\n        (\"6fa8ccc8cabc0f5633ab3c6ffa268e6042b8888f\", \"yolov5s.param\"),\n        (\"0cbab3664deb090480ea748c1305f6fe850b9ac4\", \"yolov5s.bin\"),\n        (\"35ab0c1ce2864e0759d5794aa818df2de3013ab3\", \"yolov7-tiny.param\"),\n        (\"c0454f072b41997aa230c3fe1c1d504566574b6c\", \"yolov7-tiny.bin\"),\n        (\"e9de3c929d1c93f7dc94ed0f125795ac16ecc120\", \"yolov8s.param\"),\n        (\"90f4eb9e90086e2ec3af4c7837f00757e710b9c6\", \"yolov8s.bin\"),\n        (\"e65bae7052d9e9b9d45e1214a8d1b5fe6f64e8af\", \"yolact.param\"),\n        (\"9bda99f50b1c14c98c5c6bbc08d4f782eed66548\", \"yolact.bin\"),\n        (\"3723ce3e312db6a102cff1a5c39dae80e1de658e\", \"mobilenet_ssd_voc_ncnn.param\"),\n        (\"8e2d2139550dcbee1ce5e200b7697b25aab29656\", \"mobilenet_ssd_voc_ncnn.bin\"),\n        (\"52c669821dc32ef5b7ab30749fa71a3bc27786b8\", \"squeezenet_ssd_voc.param\"),\n        (\"347e31d1cbe469259fa8305860a7c24a95039202\", \"squeezenet_ssd_voc.bin\"),\n        (\"52dab628ecac8137e61ce3aea1a912f9c5a0a638\", \"mobilenetv2_ssdlite_voc.param\"),\n        (\"9fea06f74f7c60d753cf703ea992f92e50a986d4\", \"mobilenetv2_ssdlite_voc.bin\"),\n        (\"f36661eff1eda1e36185e7f2f28fc722ad8b66bb\", \"mobilenetv3_ssdlite_voc.param\"),\n        (\"908f63ca9bff0061a499512664b9c533a0b7f485\", \"mobilenetv3_ssdlite_voc.bin\"),\n        (\"a63d779a1f789af976bc4e2eae86fdd9b0bb6c2c\", \"squeezenet_v1.1.param\"),\n        (\"262f0e33e37aeac69021b5a3556664be65fc0aeb\", \"squeezenet_v1.1.bin\"),\n        (\"3ba57cccd1d4a583f6eb76eae25a2dbda7ce7f74\", \"ZF_faster_rcnn_final.param\"),\n        (\"1095fbb5f846a1f311b40941add5fef691acaf8d\", \"ZF_faster_rcnn_final.bin\"),\n        (\"3586ec3d663b1cc8ec8c662768caa9c7fbcf4fdc\", \"pelee.param\"),\n        (\"2442ad483dc546940271591b86db0d9c8b1c7118\", \"pelee.bin\"),\n        (\"6cfeda08d5494a1274199089fda77c421be1ecac\", \"mnet.25-opt.param\"),\n        (\"3ff9a51dc81cdf506a87543dbf752071ffc50b8d\", \"mnet.25-opt.bin\"),\n        (\"50acebff393c91468a73a7b7c604ef231429d068\", \"rfcn_end2end.param\"),\n        (\"9a68cd937959b4dda9c5bf9c99181cb0e40f266b\", \"rfcn_end2end.bin\"),\n        (\"d6b289cda068e9a9d8a171fb909352a05a39a494\", \"shufflenet_v2_x0.5.param\"),\n        (\"2ccd631d04a1b7e05483cd8a8def76bca7d330a8\", \"shufflenet_v2_x0.5.bin\"),\n        (\"7c8f8d72c60aab6802985423686b36c61be2f68c\", \"pose.param\"),\n        (\"7f691540972715298c611a3e595b20c59c2147ce\", \"pose.bin\"),\n        (\"979d09942881cf1207a93cbfa9853005a434469b\", \"nanodet_m.param\"),\n        (\"51d868905361e4ba9c45bd12e8a5608e7aadd1bd\", \"nanodet_m.bin\"),\n    ]\n}\n\n\n_split_model_bins = {\n    \"ZF_faster_rcnn_final.bin\": 3,\n    \"rfcn_end2end.bin\": 2,\n    \"yolov4-opt.bin\": 7,\n}\n\n\ngithub_repo_url = \"https://github.com/nihui/ncnn-assets/raw/master/models/\"\n_url_format = \"{repo_url}{file_name}\"\n\n\ndef merge_file(root, files_in, file_out, remove=True):\n    with open(file_out, \"wb\") as fd_out:\n        for file_in in files_in:\n            file = os.path.join(root, file_in)\n            with open(file, \"rb\") as fd_in:\n                fd_out.write(fd_in.read())\n            if remove == True:\n                os.remove(file)\n\n\ndef short_hash(name):\n    if name not in _model_sha1:\n        raise ValueError(\n            \"Pretrained model for {name} is not available.\".format(name=name)\n        )\n    return _model_sha1[name][:8]\n\n\ndef get_model_file(name, tag=None, root=os.path.join(\"~\", \".ncnn\", \"models\")):\n    r\"\"\"Return location for the pretrained on local file system.\n\n    This function will download from online model zoo when model cannot be found or has mismatch.\n    The root directory will be created if it doesn't exist.\n\n    Parameters\n    ----------\n    name : str\n        Name of the model.\n    root : str, default '~/.ncnn/models'\n        Location for keeping the model parameters.\n\n    Returns\n    -------\n    file_path\n        Path to the requested pretrained model file.\n    \"\"\"\n    if \"NCNN_HOME\" in os.environ:\n        root = os.path.join(os.environ[\"NCNN_HOME\"], \"models\")\n\n    use_tag = isinstance(tag, str)\n    if use_tag:\n        file_name = \"{name}-{short_hash}\".format(name=name, short_hash=tag)\n    else:\n        file_name = \"{name}\".format(name=name)\n\n    root = os.path.expanduser(root)\n    params_path = os.path.join(root, file_name)\n    lockfile = os.path.join(root, file_name + \".lock\")\n    if use_tag:\n        sha1_hash = tag\n    else:\n        sha1_hash = _model_sha1[name]\n\n    if not os.path.exists(root):\n        os.makedirs(root)\n\n    with portalocker.Lock(\n        lockfile, timeout=int(os.environ.get(\"NCNN_MODEL_LOCK_TIMEOUT\", 300))\n    ):\n        if os.path.exists(params_path):\n            if check_sha1(params_path, sha1_hash):\n                return params_path\n            else:\n                logging.warning(\n                    \"Hash mismatch in the content of model file '%s' detected. \"\n                    \"Downloading again.\",\n                    params_path,\n                )\n        else:\n            logging.info(\"Model file not found. Downloading.\")\n\n        zip_file_path = os.path.join(root, file_name)\n        if file_name in _split_model_bins:\n            file_name_parts = [\n                \"%s.part%02d\" % (file_name, i + 1)\n                for i in range(_split_model_bins[file_name])\n            ]\n            for file_name_part in file_name_parts:\n                file_path = os.path.join(root, file_name_part)\n                repo_url = os.environ.get(\"NCNN_REPO\", github_repo_url)\n                if repo_url[-1] != \"/\":\n                    repo_url = repo_url + \"/\"\n                download(\n                    _url_format.format(repo_url=repo_url, file_name=file_name_part),\n                    path=file_path,\n                    overwrite=True,\n                )\n\n            merge_file(root, file_name_parts, zip_file_path)\n        else:\n            repo_url = os.environ.get(\"NCNN_REPO\", github_repo_url)\n            if repo_url[-1] != \"/\":\n                repo_url = repo_url + \"/\"\n            download(\n                _url_format.format(repo_url=repo_url, file_name=file_name),\n                path=zip_file_path,\n                overwrite=True,\n            )\n        if zip_file_path.endswith(\".zip\"):\n            with zipfile.ZipFile(zip_file_path) as zf:\n                zf.extractall(root)\n            os.remove(zip_file_path)\n        # Make sure we write the model file on networked filesystems\n        try:\n            os.sync()\n        except AttributeError:\n            pass\n        if check_sha1(params_path, sha1_hash):\n            return params_path\n        else:\n            raise ValueError(\"Downloaded file has different hash. Please try again.\")\n\n\ndef purge(root=os.path.join(\"~\", \".ncnn\", \"models\")):\n    r\"\"\"Purge all pretrained model files in local file store.\n\n    Parameters\n    ----------\n    root : str, default '~/.ncnn/models'\n        Location for keeping the model parameters.\n    \"\"\"\n    root = os.path.expanduser(root)\n    files = os.listdir(root)\n    for f in files:\n        if f.endswith(\".params\"):\n            os.remove(os.path.join(root, f))\n"
  },
  {
    "path": "python/ncnn/model_zoo/model_zoo.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom .yolov2 import MobileNet_YoloV2\nfrom .yolov3 import MobileNetV2_YoloV3\nfrom .yolov4 import YoloV4_Tiny, YoloV4\nfrom .yolov5 import YoloV5s\nfrom .yolov7 import YoloV7_Tiny\nfrom .yolov8 import YoloV8s\nfrom .yolact import Yolact\nfrom .mobilenetssd import MobileNet_SSD\nfrom .squeezenetssd import SqueezeNet_SSD\nfrom .mobilenetv2ssdlite import MobileNetV2_SSDLite\nfrom .mobilenetv3ssdlite import MobileNetV3_SSDLite\nfrom .squeezenet import SqueezeNet\nfrom .fasterrcnn import Faster_RCNN\nfrom .peleenetssd import PeleeNet_SSD\nfrom .retinaface import RetinaFace\nfrom .rfcn import RFCN\nfrom .shufflenetv2 import ShuffleNetV2\nfrom .simplepose import SimplePose\nfrom .nanodet import NanoDet\n\n__all__ = [\"get_model\", \"get_model_list\"]\n\n_models = {\n    \"mobilenet_yolov2\": MobileNet_YoloV2,\n    \"mobilenetv2_yolov3\": MobileNetV2_YoloV3,\n    \"yolov4_tiny\": YoloV4_Tiny,\n    \"yolov4\": YoloV4,\n    \"yolov5s\": YoloV5s,\n    \"yolov7_tiny\": YoloV7_Tiny,\n    \"yolov8s\": YoloV8s,\n    \"yolact\": Yolact,\n    \"mobilenet_ssd\": MobileNet_SSD,\n    \"squeezenet_ssd\": SqueezeNet_SSD,\n    \"mobilenetv2_ssdlite\": MobileNetV2_SSDLite,\n    \"mobilenetv3_ssdlite\": MobileNetV3_SSDLite,\n    \"squeezenet\": SqueezeNet,\n    \"faster_rcnn\": Faster_RCNN,\n    \"peleenet_ssd\": PeleeNet_SSD,\n    \"retinaface\": RetinaFace,\n    \"rfcn\": RFCN,\n    \"shufflenetv2\": ShuffleNetV2,\n    \"simplepose\": SimplePose,\n    \"nanodet\": NanoDet,\n}\n\n\ndef get_model(name, **kwargs):\n    name = name.lower()\n    if name not in _models:\n        err_str = '\"%s\" is not among the following model list:\\n\\t' % (name)\n        err_str += \"%s\" % (\"\\n\\t\".join(sorted(_models.keys())))\n        raise ValueError(err_str)\n    net = _models[name](**kwargs)\n    return net\n\n\ndef get_model_list():\n    return list(_models.keys())\n"
  },
  {
    "path": "python/ncnn/model_zoo/nanodet.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\nfrom ..utils.functional import *\n\n\nclass NanoDet:\n    def __init__(\n        self,\n        target_size=320,\n        prob_threshold=0.4,\n        nms_threshold=0.3,\n        num_threads=1,\n        use_gpu=False,\n    ):\n        self.target_size = target_size\n        self.prob_threshold = prob_threshold\n        self.nms_threshold = nms_threshold\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [103.53, 116.28, 123.675]\n        self.norm_vals = [0.017429, 0.017507, 0.017125]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        # original pretrained model from https://github.com/RangiLyu/nanodet\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"nanodet_m.param\"))\n        self.net.load_model(get_model_file(\"nanodet_m.bin\"))\n\n        self.reg_max = 7\n        self.strides = [8, 16, 32]\n        self.num_candidate = 1000\n        self.top_k = -1\n\n        self.class_names = [\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorcycle\",\n            \"airplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"couch\",\n            \"potted plant\",\n            \"bed\",\n            \"dining table\",\n            \"toilet\",\n            \"tv\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_w = img.shape[1]\n        img_h = img.shape[0]\n\n        w = img_w\n        h = img_h\n        scale = 1.0\n        if w > h:\n            scale = float(self.target_size) / w\n            w = self.target_size\n            h = int(h * scale)\n        else:\n            scale = float(self.target_size) / h\n            h = self.target_size\n            w = int(w * scale)\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img, ncnn.Mat.PixelType.PIXEL_BGR, img_w, img_h, w, h\n        )\n\n        # pad to target_size rectangle\n        wpad = (w + 31) // 32 * 32 - w\n        hpad = (h + 31) // 32 * 32 - h\n        mat_in_pad = ncnn.copy_make_border(\n            mat_in,\n            hpad // 2,\n            hpad - hpad // 2,\n            wpad // 2,\n            wpad - wpad // 2,\n            ncnn.BorderType.BORDER_CONSTANT,\n            0,\n        )\n\n        mat_in_pad.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"input.1\", mat_in_pad)\n\n        score_out_name = [\"792\", \"814\", \"836\"]\n        scores = [ex.extract(x)[1] for x in score_out_name]\n        scores = [np.reshape(x, (-1, 80)) for x in scores]\n\n        boxes_out_name = [\"795\", \"817\", \"839\"]\n        raw_boxes = [ex.extract(x)[1] for x in boxes_out_name]\n        raw_boxes = [np.reshape(x, (-1, 32)) for x in raw_boxes]\n\n        # generate centers\n        decode_boxes = []\n        select_scores = []\n        for stride, box_distribute, score in zip(self.strides, raw_boxes, scores):\n            # centers\n            if mat_in_pad.w > mat_in_pad.h:\n                fm_w = mat_in_pad.w // stride\n                fm_h = score.shape[0] // fm_w\n            else:\n                fm_h = mat_in_pad.h // stride\n                fm_w = score.shape[1] // fm_h\n            h_range = np.arange(fm_h)\n            w_range = np.arange(fm_w)\n            ww, hh = np.meshgrid(w_range, h_range)\n            ct_row = (hh.flatten() + 0.5) * stride\n            ct_col = (ww.flatten() + 0.5) * stride\n            center = np.stack((ct_col, ct_row, ct_col, ct_row), axis=1)\n\n            # box distribution to distance\n            reg_range = np.arange(self.reg_max + 1)\n            box_distance = box_distribute.reshape((-1, self.reg_max + 1))\n            box_distance = softmax(box_distance)\n            box_distance = box_distance * np.expand_dims(reg_range, axis=0)\n            box_distance = np.sum(box_distance, axis=1).reshape((-1, 4))\n            box_distance = box_distance * stride\n\n            # top K candidate\n            topk_idx = np.argsort(score.max(axis=1))[::-1]\n            topk_idx = topk_idx[: self.num_candidate]\n            center = center[topk_idx]\n            score = score[topk_idx]\n            box_distance = box_distance[topk_idx]\n\n            # decode box\n            decode_box = center + [-1, -1, 1, 1] * box_distance\n\n            select_scores.append(score)\n            decode_boxes.append(decode_box)\n\n        # nms\n        bboxes = np.concatenate(decode_boxes, axis=0)\n        confidences = np.concatenate(select_scores, axis=0)\n        picked_box = []\n        picked_probs = []\n        picked_labels = []\n        for class_index in range(0, confidences.shape[1]):\n            probs = confidences[:, class_index]\n            mask = probs > self.prob_threshold\n            probs = probs[mask]\n            if probs.shape[0] == 0:\n                continue\n            subset_boxes = bboxes[mask, :]\n            picked = nms(\n                subset_boxes,\n                probs,\n                iou_threshold=self.nms_threshold,\n                top_k=self.top_k,\n            )\n            picked_box.append(subset_boxes[picked])\n            picked_probs.append(probs[picked])\n            picked_labels.extend([class_index] * len(picked))\n\n        if not picked_box:\n            return []\n\n        picked_box = np.concatenate(picked_box)\n        picked_probs = np.concatenate(picked_probs)\n\n        # result with clip\n        objects = [\n            Detect_Object(\n                label,\n                score,\n                (bbox[0] - wpad / 2) / scale if bbox[0] > 0 else 0,\n                (bbox[1] - hpad / 2) / scale if bbox[1] > 0 else 0,\n                (bbox[2] - bbox[0]) / scale\n                if bbox[2] < mat_in_pad.w\n                else (mat_in_pad.w - bbox[0]) / scale,\n                (bbox[3] - bbox[1]) / scale\n                if bbox[3] < mat_in_pad.h\n                else (mat_in_pad.h - bbox[1]) / scale,\n            )\n            for label, score, bbox in zip(picked_labels, picked_probs, picked_box)\n        ]\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/peleenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass PeleeNet_SSD:\n    def __init__(self, target_size=304, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [103.9, 116.7, 123.6]\n        self.norm_vals = [0.017, 0.017, 0.017]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # model is converted from https://github.com/eric612/MobileNet-YOLO\n        # and can be downloaded from https://drive.google.com/open?id=1Wt6jKv13sBRMHgrGAJYlOlRF-o80pC0g\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"pelee.param\"))\n        self.net.load_model(get_model_file(\"pelee.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"person\",\n            \"rider\",\n            \"car\",\n            \"bus\",\n            \"truck\",\n            \"bike\",\n            \"motor\",\n            \"traffic light\",\n            \"traffic sign\",\n            \"train\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n            objects.append(obj)\n        \"\"\"\n\n        ret, seg_out = ex.extract(\"sigmoid\")\n\n        resized = ncnn.Mat()\n        ncnn.resize_bilinear(seg_out, resized, img_w, img_h)\n\n        return objects, resized\n"
  },
  {
    "path": "python/ncnn/model_zoo/retinaface.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Point, Face_Object\n\n\nclass RetinaFace:\n    def __init__(\n        self, prob_threshold=0.8, nms_threshold=0.4, num_threads=1, use_gpu=False\n    ):\n        self.prob_threshold = prob_threshold\n        self.nms_threshold = nms_threshold\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # model is converted from\n        # https://github.com/deepinsight/insightface/tree/master/RetinaFace#retinaface-pretrained-models\n        # https://github.com/deepinsight/insightface/issues/669\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mnet.25-opt.param\"))\n        self.net.load_model(get_model_file(\"mnet.25-opt.bin\"))\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels(\n            img, ncnn.Mat.PixelType.PIXEL_BGR2RGB, img_w, img_h\n        )\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        faceobjects32 = self.detect_stride32(ex)\n        faceobjects16 = self.detect_stride16(ex)\n        faceobjects8 = self.detect_stride8(ex)\n\n        faceproposals = [*faceobjects32, *faceobjects16, *faceobjects8]\n\n        # sort all proposals by score from highest to lowest\n        faceproposals.sort(key=lambda obj: obj.prob, reverse=True)\n\n        # apply nms with nms_threshold\n        picked = self.nms_sorted_bboxes(faceproposals, self.nms_threshold)\n\n        face_count = len(picked)\n\n        faceobjects = []\n        for i in range(face_count):\n            faceobjects.append(faceproposals[picked[i]])\n\n            # clip to image size\n            x0 = faceobjects[i].rect.x\n            y0 = faceobjects[i].rect.y\n            x1 = x0 + faceobjects[i].rect.w\n            y1 = y0 + faceobjects[i].rect.h\n\n            x0 = np.maximum(np.minimum(x0, float(img_w) - 1), 0.0)\n            y0 = np.maximum(np.minimum(y0, float(img_h) - 1), 0.0)\n            x1 = np.maximum(np.minimum(x1, float(img_w) - 1), 0.0)\n            y1 = np.maximum(np.minimum(y1, float(img_h) - 1), 0.0)\n\n            faceobjects[i].rect.x = x0\n            faceobjects[i].rect.y = y0\n            faceobjects[i].rect.w = x1 - x0\n            faceobjects[i].rect.h = y1 - y0\n\n        return faceobjects\n\n    def detect_stride32(self, ex):\n        ret1, score_blob = ex.extract(\"face_rpn_cls_prob_reshape_stride32\")\n        ret2, bbox_blob = ex.extract(\"face_rpn_bbox_pred_stride32\")\n        ret3, landmark_blob = ex.extract(\"face_rpn_landmark_pred_stride32\")\n\n        base_size = 16\n        feat_stride = 32\n        ratios = ncnn.Mat(1)\n        ratios[0] = 1.0\n        scales = ncnn.Mat(2)\n        scales[0] = 32.0\n        scales[1] = 16.0\n        anchors = self.generate_anchors(base_size, ratios, scales)\n\n        faceobjects32 = self.generate_proposals(\n            anchors,\n            feat_stride,\n            score_blob,\n            bbox_blob,\n            landmark_blob,\n            self.prob_threshold,\n        )\n\n        return faceobjects32\n\n    def detect_stride16(self, ex):\n        ret1, score_blob = ex.extract(\"face_rpn_cls_prob_reshape_stride16\")\n        ret2, bbox_blob = ex.extract(\"face_rpn_bbox_pred_stride16\")\n        ret3, landmark_blob = ex.extract(\"face_rpn_landmark_pred_stride16\")\n\n        base_size = 16\n        feat_stride = 16\n        ratios = ncnn.Mat(1)\n        ratios[0] = 1.0\n        scales = ncnn.Mat(2)\n        scales[0] = 8.0\n        scales[1] = 4.0\n        anchors = self.generate_anchors(base_size, ratios, scales)\n\n        faceobjects16 = self.generate_proposals(\n            anchors,\n            feat_stride,\n            score_blob,\n            bbox_blob,\n            landmark_blob,\n            self.prob_threshold,\n        )\n\n        return faceobjects16\n\n    def detect_stride8(self, ex):\n        ret1, score_blob = ex.extract(\"face_rpn_cls_prob_reshape_stride8\")\n        ret2, bbox_blob = ex.extract(\"face_rpn_bbox_pred_stride8\")\n        ret3, landmark_blob = ex.extract(\"face_rpn_landmark_pred_stride8\")\n\n        base_size = 16\n        feat_stride = 8\n        ratios = ncnn.Mat(1)\n        ratios[0] = 1.0\n        scales = ncnn.Mat(2)\n        scales[0] = 2.0\n        scales[1] = 1.0\n        anchors = self.generate_anchors(base_size, ratios, scales)\n\n        faceobjects8 = self.generate_proposals(\n            anchors,\n            feat_stride,\n            score_blob,\n            bbox_blob,\n            landmark_blob,\n            self.prob_threshold,\n        )\n\n        return faceobjects8\n\n    def generate_anchors(self, base_size, ratios, scales):\n        num_ratio = ratios.w\n        num_scale = scales.w\n\n        # anchors = ncnn.Mat()\n        # anchors.create(w=4, h=num_ratio * num_scale)\n\n        anchors_np = np.zeros((2, 4), dtype=np.float32)\n\n        cx = base_size * 0.5\n        cy = base_size * 0.5\n\n        for i in range(num_ratio):\n            ar = ratios[i]\n\n            r_w = np.round(base_size / np.sqrt(ar))\n            r_h = np.round(r_w * ar)  # round(base_size * np.sqrt(ar))\n\n            for j in range(num_scale):\n                scale = scales[j]\n\n                rs_w = r_w * scale\n                rs_h = r_h * scale\n\n                anchor = anchors_np[i * num_scale + j]\n\n                anchor[0] = cx - rs_w * 0.5\n                anchor[1] = cy - rs_h * 0.5\n                anchor[2] = cx + rs_w * 0.5\n                anchor[3] = cy + rs_h * 0.5\n\n        anchors = ncnn.Mat(anchors_np)\n        return anchors\n\n    def generate_proposals(\n        self, anchors, feat_stride, score_blob, bbox_blob, landmark_blob, prob_threshold\n    ):\n        faceobjects = []\n\n        w = score_blob.w\n        h = score_blob.h\n\n        # generate face proposal from bbox deltas and shifted anchors\n        num_anchors = anchors.h\n\n        for q in range(num_anchors):\n            anchor = anchors.row(q)\n\n            score = score_blob.channel(q + num_anchors)\n            bbox = bbox_blob.channel_range(q * 4, 4)\n            landmark = landmark_blob.channel_range(q * 10, 10)\n\n            # shifted anchor\n            anchor_y = anchor[1]\n\n            anchor_w = anchor[2] - anchor[0]\n            anchor_h = anchor[3] - anchor[1]\n\n            for i in range(h):\n                anchor_x = anchor[0]\n\n                for j in range(w):\n                    index = i * w + j\n\n                    prob = score[index]\n\n                    if prob >= prob_threshold:\n                        # apply center size\n                        dx = bbox.channel(0)[index]\n                        dy = bbox.channel(1)[index]\n                        dw = bbox.channel(2)[index]\n                        dh = bbox.channel(3)[index]\n\n                        cx = anchor_x + anchor_w * 0.5\n                        cy = anchor_y + anchor_h * 0.5\n\n                        pb_cx = cx + anchor_w * dx\n                        pb_cy = cy + anchor_h * dy\n\n                        pb_w = anchor_w * np.exp(dw)\n                        pb_h = anchor_h * np.exp(dh)\n\n                        x0 = pb_cx - pb_w * 0.5\n                        y0 = pb_cy - pb_h * 0.5\n                        x1 = pb_cx + pb_w * 0.5\n                        y1 = pb_cy + pb_h * 0.5\n\n                        obj = Face_Object()\n                        obj.rect.x = x0\n                        obj.rect.y = y0\n                        obj.rect.w = x1 - x0 + 1\n                        obj.rect.h = y1 - y0 + 1\n                        obj.landmark = [Point(), Point(), Point(), Point(), Point()]\n                        obj.landmark[0].x = (\n                            cx + (anchor_w + 1) * landmark.channel(0)[index]\n                        )\n                        obj.landmark[0].y = (\n                            cy + (anchor_h + 1) * landmark.channel(1)[index]\n                        )\n                        obj.landmark[1].x = (\n                            cx + (anchor_w + 1) * landmark.channel(2)[index]\n                        )\n                        obj.landmark[1].y = (\n                            cy + (anchor_h + 1) * landmark.channel(3)[index]\n                        )\n                        obj.landmark[2].x = (\n                            cx + (anchor_w + 1) * landmark.channel(4)[index]\n                        )\n                        obj.landmark[2].y = (\n                            cy + (anchor_h + 1) * landmark.channel(5)[index]\n                        )\n                        obj.landmark[3].x = (\n                            cx + (anchor_w + 1) * landmark.channel(6)[index]\n                        )\n                        obj.landmark[3].y = (\n                            cy + (anchor_h + 1) * landmark.channel(7)[index]\n                        )\n                        obj.landmark[4].x = (\n                            cx + (anchor_w + 1) * landmark.channel(8)[index]\n                        )\n                        obj.landmark[4].y = (\n                            cy + (anchor_h + 1) * landmark.channel(9)[index]\n                        )\n                        obj.prob = prob\n\n                        faceobjects.append(obj)\n\n                    anchor_x += feat_stride\n\n                anchor_y += feat_stride\n\n        return faceobjects\n\n    def nms_sorted_bboxes(self, faceobjects, nms_threshold):\n        picked = []\n\n        n = len(faceobjects)\n\n        areas = []\n        for i in range(n):\n            areas.append(faceobjects[i].rect.area())\n\n        for i in range(n):\n            a = faceobjects[i]\n\n            keep = True\n            for j in range(len(picked)):\n                b = faceobjects[picked[j]]\n\n                # intersection over union\n                inter_area = a.rect.intersection_area(b.rect)\n                union_area = areas[i] + areas[picked[j]] - inter_area\n                # float IoU = inter_area / union_area\n                if inter_area / union_area > nms_threshold:\n                    keep = False\n\n            if keep:\n                picked.append(i)\n\n        return picked\n"
  },
  {
    "path": "python/ncnn/model_zoo/rfcn.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass RFCN:\n    def __init__(\n        self,\n        target_size=224,\n        max_per_image=100,\n        confidence_thresh=0.6,\n        nms_threshold=0.3,\n        num_threads=1,\n        use_gpu=False,\n    ):\n        self.target_size = target_size\n        self.max_per_image = max_per_image\n        self.confidence_thresh = confidence_thresh\n        self.nms_threshold = nms_threshold\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [102.9801, 115.9465, 122.7717]\n        self.norm_vals = []\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # original pretrained model from https://github.com/YuwenXiong/py-R-FCN\n        # https://github.com/YuwenXiong/py-R-FCN/blob/master/models/pascal_voc/ResNet-50/rfcn_end2end/test_agnostic.prototxt\n        # https://1drv.ms/u/s!AoN7vygOjLIQqUWHpY67oaC7mopf\n        # resnet50_rfcn_final.caffemodel\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"rfcn_end2end.param\"))\n        self.net.load_model(get_model_file(\"rfcn_end2end.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        h = img.shape[0]\n        w = img.shape[1]\n\n        scale = 1.0\n        if w < h:\n            scale = float(self.target_size) / w\n            w = self.target_size\n            h = h * scale\n        else:\n            scale = float(self.target_size) / h\n            h = self.target_size\n            w = w * scale\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            int(w),\n            int(h),\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        im_info = ncnn.Mat(3)\n        im_info[0] = h\n        im_info[1] = w\n        im_info[2] = scale\n\n        # step1, extract feature and all rois\n        ex1 = self.net.create_extractor()\n\n        ex1.input(\"data\", mat_in)\n        ex1.input(\"im_info\", im_info)\n\n        ret1, rfcn_cls = ex1.extract(\"rfcn_cls\")\n        ret2, rfcn_bbox = ex1.extract(\"rfcn_bbox\")\n        ret3, rois = ex1.extract(\"rois\")  # all rois\n\n        # step2, extract bbox and score for each roi\n        class_candidates = []\n        for i in range(rois.c):\n            ex2 = self.net.create_extractor()\n\n            roi = rois.channel(i)  # get single roi\n            ex2.input(\"rfcn_cls\", rfcn_cls)\n            ex2.input(\"rfcn_bbox\", rfcn_bbox)\n            ex2.input(\"rois\", roi)\n\n            ret1, bbox_pred = ex2.extract(\"bbox_pred\")\n            ret2, cls_prob = ex2.extract(\"cls_prob\")\n\n            num_class = cls_prob.w\n            while len(class_candidates) < num_class:\n                class_candidates.append([])\n\n            # find class id with highest score\n            label = 0\n            score = 0.0\n            for j in range(num_class):\n                class_score = cls_prob[j]\n                if class_score > score:\n                    label = j\n                    score = class_score\n\n            # ignore background or low score\n            if label == 0 or score <= self.confidence_thresh:\n                continue\n\n            # fprintf(stderr, \"%d = %f\\n\", label, score)\n\n            # unscale to image size\n            x1 = roi[0] / scale\n            y1 = roi[1] / scale\n            x2 = roi[2] / scale\n            y2 = roi[3] / scale\n\n            pb_w = x2 - x1 + 1\n            pb_h = y2 - y1 + 1\n\n            # apply bbox regression\n            dx = bbox_pred[4]\n            dy = bbox_pred[4 + 1]\n            dw = bbox_pred[4 + 2]\n            dh = bbox_pred[4 + 3]\n\n            cx = x1 + pb_w * 0.5\n            cy = y1 + pb_h * 0.5\n\n            obj_cx = cx + pb_w * dx\n            obj_cy = cy + pb_h * dy\n\n            obj_w = pb_w * np.exp(dw)\n            obj_h = pb_h * np.exp(dh)\n\n            obj_x1 = obj_cx - obj_w * 0.5\n            obj_y1 = obj_cy - obj_h * 0.5\n            obj_x2 = obj_cx + obj_w * 0.5\n            obj_y2 = obj_cy + obj_h * 0.5\n\n            # clip\n            obj_x1 = np.maximum(np.minimum(obj_x1, float(img.shape[1] - 1)), 0.0)\n            obj_y1 = np.maximum(np.minimum(obj_y1, float(img.shape[0] - 1)), 0.0)\n            obj_x2 = np.maximum(np.minimum(obj_x2, float(img.shape[1] - 1)), 0.0)\n            obj_y2 = np.maximum(np.minimum(obj_y2, float(img.shape[0] - 1)), 0.0)\n\n            # append object\n            obj = Detect_Object()\n            obj.rect.x = obj_x1\n            obj.rect.y = obj_y1\n            obj.rect.w = obj_x2 - obj_x1 + 1\n            obj.rect.h = obj_y2 - obj_y1 + 1\n            obj.label = label\n            obj.prob = score\n\n            class_candidates[label].append(obj)\n\n        # post process\n        objects = []\n        for candidates in class_candidates:\n            if len(candidates) == 0:\n                continue\n\n            candidates.sort(key=lambda obj: obj.prob, reverse=True)\n\n            picked = self.nms_sorted_bboxes(candidates, self.nms_threshold)\n\n            for j in range(len(picked)):\n                z = picked[j]\n                objects.append(candidates[z])\n\n        objects.sort(key=lambda obj: obj.prob, reverse=True)\n\n        objects = objects[: self.max_per_image]\n\n        return objects\n\n    def nms_sorted_bboxes(self, objects, nms_threshold):\n        picked = []\n\n        n = len(objects)\n\n        areas = np.zeros((n,), dtype=np.float32)\n        for i in range(n):\n            areas[i] = objects[i].rect.area()\n\n        for i in range(n):\n            a = objects[i]\n\n            keep = True\n            for j in range(len(picked)):\n                b = objects[picked[j]]\n\n                # intersection over union\n                inter_area = a.rect.intersection_area(b.rect)\n                union_area = areas[i] + areas[picked[j]] - inter_area\n                # float IoU = inter_area / union_area\n                if inter_area / union_area > nms_threshold:\n                    keep = False\n\n            if keep:\n                picked.append(i)\n\n        return picked\n"
  },
  {
    "path": "python/ncnn/model_zoo/shufflenetv2.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\n\n\nclass ShuffleNetV2:\n    def __init__(self, target_size=224, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = []\n        self.norm_vals = [1 / 255.0, 1 / 255.0, 1 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe\n        # models can be downloaded from https://github.com/miaow1988/ShuffleNet_V2_pytorch_caffe/releases\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"shufflenet_v2_x0.5.param\"))\n        self.net.load_model(get_model_file(\"shufflenet_v2_x0.5.bin\"))\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"fc\")\n\n        # manually call softmax on the fc output\n        # convert result into probability\n        # skip if your model already has softmax operation\n        softmax = ncnn.create_layer(\"Softmax\")\n\n        pd = ncnn.ParamDict()\n        softmax.load_param(pd)\n\n        softmax.forward_inplace(mat_out, self.net.opt)\n\n        mat_out = mat_out.reshape(mat_out.w * mat_out.h * mat_out.c)\n\n        cls_scores = np.array(mat_out)\n        return cls_scores\n"
  },
  {
    "path": "python/ncnn/model_zoo/simplepose.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import KeyPoint\n\n\nclass SimplePose:\n    def __init__(\n        self, target_width=192, target_height=256, num_threads=1, use_gpu=False\n    ):\n        self.target_width = target_width\n        self.target_height = target_height\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [0.485 * 255.0, 0.456 * 255.0, 0.406 * 255.0]\n        self.norm_vals = [1 / 0.229 / 255.0, 1 / 0.224 / 255.0, 1 / 0.225 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # the simple baseline human pose estimation from gluon-cv\n        # https://gluon-cv.mxnet.io/build/examples_pose/demo_simple_pose.html\n        # mxnet model exported via\n        #      pose_net.hybridize()\n        #      pose_net.export('pose')\n        # then mxnet2ncnn\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"pose.param\"))\n        self.net.load_model(get_model_file(\"pose.bin\"))\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        h = img.shape[0]\n        w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR2RGB,\n            img.shape[1],\n            img.shape[0],\n            self.target_width,\n            self.target_height,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"conv3_fwd\")\n\n        keypoints = []\n\n        for p in range(mat_out.c):\n            m = mat_out.channel(p)\n\n            max_prob = 0.0\n            max_x = 0\n            max_y = 0\n            for y in range(mat_out.h):\n                ptr = m.row(y)\n                for x in range(mat_out.w):\n                    prob = ptr[x]\n                    if prob > max_prob:\n                        max_prob = prob\n                        max_x = x\n                        max_y = y\n\n            keypoint = KeyPoint()\n            keypoint.p.x = max_x * w / float(mat_out.w)\n            keypoint.p.y = max_y * h / float(mat_out.h)\n            keypoint.prob = max_prob\n\n            keypoints.append(keypoint)\n\n        return keypoints\n"
  },
  {
    "path": "python/ncnn/model_zoo/squeezenet.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\n\n\nclass SqueezeNet:\n    def __init__(self, target_size=227, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [104.0, 117.0, 123.0]\n        self.norm_vals = []\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"squeezenet_v1.1.param\"))\n        self.net.load_model(get_model_file(\"squeezenet_v1.1.bin\"))\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"prob\")\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        out = np.array(mat_out)\n        return out\n"
  },
  {
    "path": "python/ncnn/model_zoo/squeezenetssd.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass SqueezeNet_SSD:\n    def __init__(self, target_size=300, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [104.0, 117.0, 123.0]\n        self.norm_vals = []\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # original pretrained model from https://github.com/chuanqi305/SqueezeNet-SSD\n        # squeezenet_ssd_voc_deploy.prototxt\n        # https://drive.google.com/open?id=0B3gersZ2cHIxdGpyZlZnbEQ5Snc\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"squeezenet_ssd_voc.param\"))\n        self.net.load_model(get_model_file(\"squeezenet_ssd_voc.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolact.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom math import sqrt\nimport numpy as np\nimport cv2\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.functional import sigmoid, nms\n\n\nclass Yolact:\n    def __init__(\n        self,\n        target_size=550,\n        confidence_threshold=0.05,\n        nms_threshold=0.5,\n        keep_top_k=200,\n        num_threads=1,\n        use_gpu=False,\n    ):\n        self.target_size = target_size\n        self.confidence_threshold = confidence_threshold\n        self.nms_threshold = nms_threshold\n        self.keep_top_k = keep_top_k\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [123.68, 116.78, 103.94]\n        self.norm_vals = [1.0 / 58.40, 1.0 / 57.12, 1.0 / 57.38]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        # original model converted from https://github.com/dbolya/yolact\n        # yolact_resnet50_54_800000.pth\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"yolact.param\"))\n        self.net.load_model(get_model_file(\"yolact.bin\"))\n\n        self.conv_ws = [69, 35, 18, 9, 5]\n        self.conv_hs = [69, 35, 18, 9, 5]\n        self.aspect_ratios = [1, 0.5, 2]\n        self.scales = [24, 48, 96, 192, 384]\n\n        self.priors = None\n        self.last_img_size = None\n\n        self.make_priors()\n\n        self.class_names = [\n            \"background\",\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorcycle\",\n            \"airplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"couch\",\n            \"potted plant\",\n            \"bed\",\n            \"dining table\",\n            \"toilet\",\n            \"tv\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR2RGB,\n            img_w,\n            img_h,\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"input.1\", mat_in)\n\n        ret1, proto_data = ex.extract(\"619\")  # 138x138 x 32\n        ret2, loc_data = ex.extract(\"816\")  # 4 x 19248\n        ret3, mask_data = ex.extract(\"818\")  # maskdim 32 x 19248\n        ret4, conf_data = ex.extract(\"820\")  # 81 x 19248\n\n        proto_data = np.array(proto_data)\n        loc_data = np.array(loc_data)\n        mask_data = np.array(mask_data)\n        conf_data = np.array(conf_data)\n        prior_data = self.make_priors()\n\n        # decoded_boxes = self.decode(loc_data, prior_data)\n        boxes, masks, classes, scores = self.detect(\n            conf_data, loc_data, prior_data, mask_data, img_w, img_h\n        )\n\n        # generate mask\n        masks = proto_data.transpose(1, 2, 0) @ masks.T\n        masks = sigmoid(masks)\n\n        # Scale masks up to the full image\n        masks = cv2.resize(masks, (img_w, img_h), interpolation=cv2.INTER_LINEAR)\n\n        # transpose into the correct output shape [num_dets, proto_h, proto_w]\n        masks = masks.transpose(2, 0, 1)\n\n        masks = masks > 0.5\n\n        return boxes, masks, classes, scores\n\n    def make_priors(self):\n        \"\"\" Note that priors are [x,y,width,height] where (x,y) is the center of the box. \"\"\"\n        if self.last_img_size != (self.target_size, self.target_size):\n            prior_data = []\n\n            for conv_w, conv_h, scale in zip(self.conv_ws, self.conv_hs, self.scales):\n                for i in range(conv_h):\n                    for j in range(conv_w):\n                        # +0.5 because priors are in center-size notation\n                        cx = (j + 0.5) / conv_w\n                        cy = (i + 0.5) / conv_h\n\n                        for ar in self.aspect_ratios:\n                            ar = sqrt(ar)\n\n                            w = scale * ar / self.target_size\n                            h = scale / ar / self.target_size\n\n                            # This is for backward compatibility with a bug where I made everything square by accident\n                            h = w\n\n                            prior_data += [cx, cy, w, h]\n\n            self.priors = np.array(prior_data).reshape(-1, 4)\n            self.last_img_size = (self.target_size, self.target_size)\n\n        return self.priors\n\n    def decode(self, loc, priors, img_w, img_h):\n        \"\"\"\n        Decode predicted bbox coordinates using the same scheme\n        employed by Yolov2: https://arxiv.org/pdf/1612.08242.pdf\n\n            b_x = (sigmoid(pred_x) - .5) / conv_w + prior_x\n            b_y = (sigmoid(pred_y) - .5) / conv_h + prior_y\n            b_w = prior_w * exp(loc_w)\n            b_h = prior_h * exp(loc_h)\n\n        Note that loc is inputed as [(s(x)-.5)/conv_w, (s(y)-.5)/conv_h, w, h]\n        while priors are inputed as [x, y, w, h] where each coordinate\n        is relative to size of the image (even sigmoid(x)). We do this\n        in the network by dividing by the 'cell size', which is just\n        the size of the convouts.\n\n        Also note that prior_x and prior_y are center coordinates which\n        is why we have to subtract .5 from sigmoid(pred_x and pred_y).\n\n        Args:\n            - loc:    The predicted bounding boxes of size [num_priors, 4]\n            - priors: The priorbox coords with size [num_priors, 4]\n\n        Returns: A tensor of decoded relative coordinates in point form\n                form with size [num_priors, 4(x, y, w, h)]\n        \"\"\"\n\n        variances = [0.1, 0.2]\n\n        boxes = np.concatenate(\n            (\n                priors[:, :2] + loc[:, :2] * variances[0] * priors[:, 2:],\n                priors[:, 2:] * np.exp(loc[:, 2:] * variances[1]),\n            ),\n            1,\n        )\n        boxes[:, :2] -= boxes[:, 2:] / 2\n        # boxes[:, 2:] += boxes[:, :2]\n\n        # crop\n        np.where(boxes[:, 0] < 0, 0, boxes[:, 0])\n        np.where(boxes[:, 1] < 0, 0, boxes[:, 1])\n        np.where(boxes[:, 2] > 1, 1, boxes[:, 2])\n        np.where(boxes[:, 3] > 1, 1, boxes[:, 3])\n\n        # decode to img size\n        boxes[:, 0] *= img_w\n        boxes[:, 1] *= img_h\n        boxes[:, 2] = boxes[:, 2] * img_w + 1\n        boxes[:, 3] = boxes[:, 3] * img_h + 1\n\n        return boxes\n\n    def detect(self, conf_preds, loc_data, prior_data, mask_data, img_w, img_h):\n        \"\"\" Perform nms for only the max scoring class that isn't background (class 0) \"\"\"\n        cur_scores = conf_preds[:, 1:]\n        num_class = cur_scores.shape[1]\n\n        classes = np.argmax(cur_scores, axis=1)\n        conf_scores = cur_scores[range(cur_scores.shape[0]), classes]\n\n        # filte by confidence_threshold\n        keep = conf_scores > self.confidence_threshold\n        conf_scores = conf_scores[keep]\n        classes = classes[keep]\n        loc_data = loc_data[keep, :]\n        prior_data = prior_data[keep, :]\n        masks = mask_data[keep, :]\n\n        # decode x, y, w, h\n        boxes = self.decode(loc_data, prior_data, img_w, img_h)\n\n        # nms for every class\n        boxes_result = []\n        masks_result = []\n        classes_result = []\n        conf_scores_result = []\n        for i in range(num_class):\n            where = np.where(classes == i)\n            if len(where) == 0:\n                continue\n\n            boxes_tmp = boxes[where]\n            masks_tmp = masks[where]\n            classes_tmp = classes[where]\n            conf_scores_tmp = conf_scores[where]\n\n            score_mask = conf_scores_tmp > self.confidence_threshold\n            boxes_tmp = boxes_tmp[score_mask]\n            masks_tmp = masks_tmp[score_mask]\n            classes_tmp = classes_tmp[score_mask]\n            conf_scores_tmp = conf_scores_tmp[score_mask]\n\n            indexes = nms(\n                boxes_tmp,\n                conf_scores_tmp,\n                iou_threshold=self.nms_threshold,\n                top_k=self.keep_top_k,\n            )\n\n            for index in indexes:\n                boxes_result.append(boxes_tmp[index])\n                masks_result.append(masks_tmp[index])\n                classes_result.append(classes_tmp[index] + 1)\n                conf_scores_result.append(conf_scores_tmp[index])\n\n        # keep top k\n        if len(conf_scores_result) > self.keep_top_k:\n            indexes = np.argsort(conf_scores_result)\n            indexes = indexes[: self.keep_top_k]\n\n            boxes_result = boxes_result[indexes]\n            masks_result = masks_result[indexes]\n            classes_result = classes_result[indexes]\n            conf_scores_result = conf_scores_result[indexes]\n\n        return (\n            np.array(boxes_result),\n            np.array(masks_result),\n            np.array(classes_result),\n            np.array(conf_scores_result),\n        )\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov2.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass MobileNet_YoloV2:\n    def __init__(self, target_size=416, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [1.0, 1.0, 1.0]\n        self.norm_vals = [0.007843, 0.007843, 0.007843]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # original pretrained model from https://github.com/eric612/MobileNet-YOLO\n        # https://github.com/eric612/MobileNet-YOLO/blob/master/models/yolov2/mobilenet_yolo_deploy.prototxt\n        # https://github.com/eric612/MobileNet-YOLO/blob/master/models/yolov2/mobilenet_yolo_deploy_iter_80000.caffemodel\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mobilenet_yolo.param\"))\n        self.net.load_model(get_model_file(\"mobilenet_yolo.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize([], self.norm_vals)\n        mat_in.substract_mean_normalize(self.mean_vals, [])\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov3.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass MobileNetV2_YoloV3:\n    def __init__(self, target_size=352, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = [127.5, 127.5, 127.5]\n        self.norm_vals = [0.007843, 0.007843, 0.007843]\n\n        self.net = ncnn.Net()\n        self.net.opt.num_threads = self.num_threads\n        self.net.opt.use_vulkan_compute = self.use_gpu\n\n        # original pretrained model from https://github.com/eric612/MobileNet-YOLO\n        # param : https://drive.google.com/open?id=1V9oKHP6G6XvXZqhZbzNKL6FI_clRWdC-\n        # bin : https://drive.google.com/open?id=1DBcuFCr-856z3FRQznWL_S5h-Aj3RawA\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"mobilenetv2_yolov3.param\"))\n        self.net.load_model(get_model_file(\"mobilenetv2_yolov3.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"aeroplane\",\n            \"bicycle\",\n            \"bird\",\n            \"boat\",\n            \"bottle\",\n            \"bus\",\n            \"car\",\n            \"cat\",\n            \"chair\",\n            \"cow\",\n            \"diningtable\",\n            \"dog\",\n            \"horse\",\n            \"motorbike\",\n            \"person\",\n            \"pottedplant\",\n            \"sheep\",\n            \"sofa\",\n            \"train\",\n            \"tvmonitor\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"detection_out\")\n\n        objects = []\n\n        # printf(\"%d %d %d\\n\", mat_out.w, mat_out.h, mat_out.c)\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.x = values[2] * img_w\n            obj.y = values[3] * img_h\n            obj.w = values[4] * img_w - obj.x\n            obj.h = values[5] * img_h - obj.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov4.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\n\n\nclass YoloV4_Base:\n    def __init__(self, tiny, target_size, num_threads=1, use_gpu=False):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = []\n        self.norm_vals = [1 / 255.0, 1 / 255.0, 1 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        # original pretrained model from https://github.com/AlexeyAB/darknet\n        # the ncnn model https://drive.google.com/drive/folders/1YzILvh0SKQPS_lrb33dmGNq7aVTKPWS0?usp=sharing\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        if tiny == True:\n            self.net.load_param(get_model_file(\"yolov4-tiny-opt.param\"))\n            self.net.load_model(get_model_file(\"yolov4-tiny-opt.bin\"))\n        else:\n            self.net.load_param(get_model_file(\"yolov4-opt.param\"))\n            self.net.load_model(get_model_file(\"yolov4-opt.bin\"))\n\n        self.class_names = [\n            \"background\",\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorbike\",\n            \"aeroplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"sofa\",\n            \"pottedplant\",\n            \"bed\",\n            \"diningtable\",\n            \"toilet\",\n            \"tvmonitor\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR2RGB,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"data\", mat_in)\n\n        ret, mat_out = ex.extract(\"output\")\n\n        objects = []\n\n        # method 1, use ncnn.Mat.row to get the result, no memory copy\n        for i in range(mat_out.h):\n            values = mat_out.row(i)\n\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.rect.x = values[2] * img_w\n            obj.rect.y = values[3] * img_h\n            obj.rect.w = values[4] * img_w - obj.rect.x\n            obj.rect.h = values[5] * img_h - obj.rect.y\n\n            objects.append(obj)\n\n        \"\"\"\n        #method 2, use ncnn.Mat->numpy.array to get the result, no memory copy too\n        out = np.array(mat_out)\n        for i in range(len(out)):\n            values = out[i]\n            obj = Detect_Object()\n            obj.label = values[0]\n            obj.prob = values[1]\n            obj.x = values[2] * img_w\n            obj.y = values[3] * img_h\n            obj.w = values[4] * img_w - obj.x\n            obj.h = values[5] * img_h - obj.y\n            objects.append(obj)\n        \"\"\"\n\n        return objects\n\n\nclass YoloV4_Tiny(YoloV4_Base):\n    def __init__(self, **kwargs):\n        super(YoloV4_Tiny, self).__init__(True, 416, **kwargs)\n\n\nclass YoloV4(YoloV4_Base):\n    def __init__(self, **kwargs):\n        super(YoloV4, self).__init__(False, 608, **kwargs)\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov5.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\nfrom ..utils.functional import *\n\n\nclass YoloV5Focus(ncnn.Layer):\n    yolov5FocusLayers = []\n\n    def __init__(self):\n        ncnn.Layer.__init__(self)\n        self.one_blob_only = True\n\n        self.yolov5FocusLayers.append(self)\n\n    def forward(self, bottom_blob, top_blob, opt):\n        x = np.array(bottom_blob)\n        x = np.concatenate(\n            [\n                x[..., ::2, ::2],\n                x[..., 1::2, ::2],\n                x[..., ::2, 1::2],\n                x[..., 1::2, 1::2],\n            ]\n        )\n\n        top_blob.clone_from(ncnn.Mat(x), opt.blob_allocator)\n        if top_blob.empty():\n            return -100\n\n        return 0\n\n\ndef YoloV5Focus_layer_creator():\n    return YoloV5Focus()\n\n\ndef YoloV5Focus_layer_destroyer(layer):\n    for i in range(len(YoloV5Focus.yolov5FocusLayers)):\n        if YoloV5Focus.yolov5FocusLayers[i] == layer:\n            del YoloV5Focus.yolov5FocusLayers[i]\n            break\n\n\nclass YoloV5s:\n    def __init__(\n        self,\n        target_size=640,\n        prob_threshold=0.25,\n        nms_threshold=0.45,\n        num_threads=1,\n        use_gpu=False,\n    ):\n        self.target_size = target_size\n        self.prob_threshold = prob_threshold\n        self.nms_threshold = nms_threshold\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.mean_vals = []\n        self.norm_vals = [1 / 255.0, 1 / 255.0, 1 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        self.net.register_custom_layer(\n            \"YoloV5Focus\", YoloV5Focus_layer_creator, YoloV5Focus_layer_destroyer\n        )\n\n        # original pretrained model from https://github.com/ultralytics/yolov5\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"yolov5s.param\"))\n        self.net.load_model(get_model_file(\"yolov5s.bin\"))\n\n        self.grid = [make_grid(10, 6), make_grid(20, 12), make_grid(40, 24)]\n        self.stride = np.array([32, 16, 8])\n        self.anchor_grid = np.array(\n            [\n                [116, 90, 156, 198, 373, 326],\n                [30, 61, 62, 45, 59, 119],\n                [10, 13, 16, 30, 33, 23],\n            ]\n        ).reshape((3, 1, 3, 1, 1, 2))\n\n        self.class_names = [\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorcycle\",\n            \"airplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"couch\",\n            \"potted plant\",\n            \"bed\",\n            \"dining table\",\n            \"toilet\",\n            \"tv\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_w = img.shape[1]\n        img_h = img.shape[0]\n\n        w = img_w\n        h = img_h\n        scale = 1.0\n        if w > h:\n            scale = float(self.target_size) / w\n            w = self.target_size\n            h = int(h * scale)\n        else:\n            scale = float(self.target_size) / h\n            h = self.target_size\n            w = int(w * scale)\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img, ncnn.Mat.PixelType.PIXEL_BGR2RGB, img_w, img_h, w, h\n        )\n        # pad to target_size rectangle\n        # yolov5/utils/datasets.py letterbox\n        wpad = (w + 31) // 32 * 32 - w\n        hpad = (h + 31) // 32 * 32 - h\n        mat_in_pad = ncnn.copy_make_border(\n            mat_in,\n            hpad // 2,\n            hpad - hpad // 2,\n            wpad // 2,\n            wpad - wpad // 2,\n            ncnn.BorderType.BORDER_CONSTANT,\n            114.0,\n        )\n\n        mat_in_pad.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"images\", mat_in_pad)\n\n        # anchor setting from yolov5/models/yolov5s.yaml\n        ret1, mat_out1 = ex.extract(\"output\")  # stride 8\n        ret2, mat_out2 = ex.extract(\"781\")  # stride 16\n        ret3, mat_out3 = ex.extract(\"801\")  # stride 32\n\n        pred = [np.array(mat_out3), np.array(mat_out2), np.array(mat_out1)]\n        z = []\n        for i in range(len(pred)):\n            num_grid = pred[i].shape[1]\n            if mat_in_pad.w > mat_in_pad.h:\n                num_grid_x = mat_in_pad.w // self.stride[i]\n                num_grid_y = num_grid // num_grid_x\n            else:\n                num_grid_y = mat_in_pad.h // self.stride[i]\n                num_grid_x = num_grid // num_grid_y\n            if (\n                self.grid[i].shape[0] != num_grid_x\n                or self.grid[i].shape[1] != num_grid_y\n            ):\n                self.grid[i] = make_grid(num_grid_x, num_grid_y)\n\n            y = sigmoid(pred[i])\n            y = y.reshape(pred[i].shape[0], num_grid_y, num_grid_x, pred[i].shape[2])\n            y[..., 0:2] = (y[..., 0:2] * 2.0 - 0.5 + self.grid[i]) * self.stride[\n                i\n            ]  # xy\n            y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n            z.append(y.reshape(1, -1, y.shape[-1]))\n        pred = np.concatenate(z, 1)\n\n        result = self.non_max_suppression(\n            pred, self.prob_threshold, self.nms_threshold\n        )[0]\n\n        objects = [\n            Detect_Object(\n                obj[5],\n                obj[4],\n                (obj[0] - (wpad / 2)) / scale,\n                (obj[1] - (hpad / 2)) / scale,\n                (obj[2] - obj[0]) / scale,\n                (obj[3] - obj[1]) / scale,\n            )\n            for obj in result\n        ]\n\n        return objects\n\n    def non_max_suppression(\n        self,\n        prediction,\n        conf_thres=0.1,\n        iou_thres=0.6,\n        merge=False,\n        classes=None,\n        agnostic=False,\n    ):\n        \"\"\"Performs Non-Maximum Suppression (NMS) on inference results\n\n        Returns:\n            detections with shape: nx6 (x1, y1, x2, y2, conf, cls)\n        \"\"\"\n        nc = prediction[0].shape[1] - 5  # number of classes\n        xc = prediction[..., 4] > conf_thres  # candidates\n\n        # Settings\n        min_wh, max_wh = 2, 4096  # (pixels) minimum and maximum box width and height\n        max_det = 300  # maximum number of detections per image\n        time_limit = 10.0  # seconds to quit after\n        redundant = True  # require redundant detections\n        multi_label = nc > 1  # multiple labels per box (adds 0.5ms/img)\n\n        t = time.time()\n        output = [None] * prediction.shape[0]\n        for xi, x in enumerate(prediction):  # image index, image inference\n            # Apply constraints\n            # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height\n            x = x[xc[xi]]  # confidence\n\n            # If none remain process next image\n            if not x.shape[0]:\n                continue\n\n            # Compute conf\n            x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf\n\n            # Box (center x, center y, width, height) to (x1, y1, x2, y2)\n            box = xywh2xyxy(x[:, :4])\n\n            # Detections matrix nx6 (xyxy, conf, cls)\n            if multi_label:\n                i, j = (x[:, 5:] > conf_thres).nonzero()\n                x = np.concatenate(\n                    (box[i], x[i, j + 5, None], j[:, None].astype(np.float32)), axis=1\n                )\n            else:  # best class only\n                conf, j = x[:, 5:].max(1, keepdim=True)\n                x = np.concatenate((box, conf, j.float()), axis=1)[\n                    conf.view(-1) > conf_thres\n                ]\n\n            # Filter by class\n            if classes:\n                x = x[(x[:, 5:6] == np.array(classes)).any(1)]\n\n            # Apply finite constraint\n            # if not torch.isfinite(x).all():\n            #     x = x[torch.isfinite(x).all(1)]\n\n            # If none remain process next image\n            n = x.shape[0]  # number of boxes\n            if not n:\n                continue\n\n            # Sort by confidence\n            # x = x[x[:, 4].argsort(descending=True)]\n\n            # Batched NMS\n            c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes\n            boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores\n            i = nms(boxes, scores, iou_threshold=iou_thres)\n            if len(i) > max_det:  # limit detections\n                i = i[:max_det]\n            if merge and (1 < n < 3e3):  # Merge NMS (boxes merged using weighted mean)\n                try:  # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)\n                    iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix\n                    weights = iou * scores[None]  # box weights\n                    x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(\n                        1, keepdim=True\n                    )  # merged boxes\n                    if redundant:\n                        i = i[iou.sum(1) > 1]  # require redundancy\n                except:  # possible CUDA error https://github.com/ultralytics/yolov3/issues/1139\n                    print(x, i, x.shape, i.shape)\n                    pass\n\n            output[xi] = x[i]\n            if (time.time() - t) > time_limit:\n                break  # time limit exceeded\n\n        return output\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov7.py",
    "content": "# Copyright 2020 Tencent\n# Copyright 2023 Kenny Bradley\n# SPDX-License-Identifier: BSD-3-Clause\n\n# Ported yolov7-tiny to python based on:\n#   - https://github.com/Qengineering/YoloV7-ncnn-Raspberry-Pi-4/blob/main/yolo.cpp\n# Format based on the ncnn yolov4 implementation\n\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\nimport numpy as np\n\n\n#def sigmoid_binned(val)\n#   this could use a much faster binned lookup table instead of np.exp and floating division\n\ndef sigmoid(val):\n    return 1.0 / (1.0 + np.exp(-val))\n\n#IOU functions:\n#find the overlap width given ([x1,x2], [x3,x4]) or ([y1,y2], [y3,y4])\ndef calcOverlap(r1, r2):\n    #r1 contains r2\n    if r1[0] <= r2[0] and r1[1] >= r2[1]:\n        return r2[1] - r2[0]\n    #r2 contains r1\n    elif r1[0] >= r2[0] and r1[1] <= r2[1]:\n        return r1[1] - r1[0]\n    #r1.1 is between r2.0 and r2.1\n    elif r1[0] <= r2[0] and r1[1] >= r2[0]: # r1[1] <= r2[1] is true since the first if failed\n        return r1[1] - r2[0]\n    #r1.0 is between r2.0 and r2.1\n    elif r1[0] >= r2[0] and r1[0] <= r2[1]: # r1[1] >= r2[1] is true since the second if failed\n        return r2[1] - r1[0]\n    else:\n        return 0\n\n#find X and Y overlaps and return intersection area\ndef calcIntersection(r1 : Detect_Object, r2 : Detect_Object):\n    xOverlap = calcOverlap([r1.rect.x, r1.rect.x+r1.rect.w], [r2.rect.x, r2.rect.x+r2.rect.w])\n    yOverlap = calcOverlap([r1.rect.y, r1.rect.y+r1.rect.h], [r2.rect.y, r2.rect.y+r2.rect.h])\n    return xOverlap*yOverlap\n\n\n#with r = [X1,X2,Y1,Y2] as the format return the IOU\ndef IOU(r1 : Detect_Object, r2 : Detect_Object):\n    intersection = calcIntersection(r1,r2)\n    #union =        r1 area       +        r2 area        - duplicate area\n    union = (r1.rect.w*r1.rect.h) + (r2.rect.w*r2.rect.h) - intersection\n    if union == 0:\n        return 0\n    else:\n        return intersection/union\n\n#NMS\n#detections are pre-sorted in ascending confidence order\n#detections are a list of Detect_Objects with : label, prob, rect\ndef NMS(detections, iou_thresh=0.45):\n    cleanDetections = []\n    detByClasses = {}\n    #group by class\n    for det in detections:\n        #det.label is the class\n        if det.label not in detByClasses.keys():\n            detByClasses[det.label] = []\n        detByClasses[det.label].append(det)\n\n    #for each class find the values to keep\n    for key, dets in detByClasses.items():\n        for i in range(len(dets)):\n            keep = 1\n            #keep unless a higher priority det has IOU > thresh\n            for j in range(i+1,len(dets)):\n                iou = IOU(dets[i], dets[j])\n                if iou > iou_thresh:\n                    keep = 0\n                    break\n            if keep:\n                cleanDetections.append(dets[i])\n\n    #return cleaner list of Detect_Object values\n    return cleanDetections\n    \nclass YoloV7_Base:\n    def __init__(self, target_size, num_threads=1, use_gpu=False, use_strides=[8,16,32]):\n        self.target_size = target_size\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n        self.use_strides = use_strides\n\n        self.mean_vals = []\n        self.norm_vals = [1 / 255.0, 1 / 255.0, 1 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        # original pretrained model from https://github.com/AlexeyAB/darknet\n        # the ncnn model https://drive.google.com/drive/folders/1YzILvh0SKQPS_lrb33dmGNq7aVTKPWS0?usp=sharing\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"yolov7-tiny.param\"))\n        self.net.load_model(get_model_file(\"yolov7-tiny.bin\"))\n\n        self.class_names = [\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorbike\",\n            \"aeroplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"sofa\",\n            \"pottedplant\",\n            \"bed\",\n            \"diningtable\",\n            \"toilet\",\n            \"tvmonitor\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\"\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n\n        img_h = img.shape[0]\n        img_w = img.shape[1]\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img,\n            ncnn.Mat.PixelType.PIXEL_BGR2RGB,\n            img.shape[1],\n            img.shape[0],\n            self.target_size,\n            self.target_size,\n        )\n        mat_in.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"images\", mat_in)\n\n        outValues = []\n        if 8 in self.use_strides:\n            ret8, out8 = ex.extract(\"output\");\n            outValues.append(out8)\n        else:\n            outValues.append(None)\n\n        if 16 in self.use_strides:\n            ret16, out16 = ex.extract(\"288\");\n            outValues.append(out16)\n        else:\n            outValues.append(None)\n\n        if 32 in self.use_strides:\n            ret32, out32 = ex.extract(\"302\");\n            outValues.append(out32)\n        else:\n            outValues.append(None)\n\n        #           P3/8,                  P4/16,                  P5/32\n        anchors = [[12,16, 19,36, 40,28], [36,75, 76,55, 72,146], [142,110, 192,243, 459,401]]\n        strides = [8,16,32]\n\n        objects = []\n        #this threshold is the value for which sigmoid gives 0.25 which is the threshold\n        threshNonSigmoid = -1.098612\n        for strideCount, mat_out in enumerate(outValues):\n            if mat_out is None:\n                continue\n\n            stride = strides[strideCount]\n            for c in range(3):\n              mat = mat_out.channel(c)\n\n              #yolo should always be square, it is expected to be 52x52\n              #    but sqrt() guarantees the correct size for side\n              side = int(np.sqrt(mat.h))\n\n              anchorW = anchors[strideCount][c*2]\n              anchorH = anchors[strideCount][c*2+1]\n              index = 0\n              for i in range(side):\n                  for j in range(side):\n\n                      #values 5-84 are class data\n                      classData=mat.row(index)[5:]\n                      maxLabel = max(classData)\n                      \n                      #optimization\n                      #if either the objectness or max class score resolve to < 0.25 we can skip this\n                      #  but the values are pre-sigmoid so compare to threshNonSigmoid.\n                      #  1 / (1+e^(-1.098612)) = 0.25 so just compare to the -1.098612 threshold\n                      if mat.row(index)[4] < threshNonSigmoid or maxLabel < threshNonSigmoid:\n                          index += 1\n                          continue\n\n                      #values 0-3 are coordinate data\n                      locData = mat.row(index)[0:4]\n                      #value 4 is the box confidence score\n                      box_score = sigmoid(mat.row(index)[4])\n                      #get the highest scoring class for this detection to multiply by the box_score\n                      label = np.argmax(classData)\n                      class_score = sigmoid(mat.row(index)[label+5])\n\n                      conf = box_score * class_score\n                      if conf > 0.25:\n                          obj = Detect_Object()\n                          obj.label = self.class_names[label]\n                          obj.prob = conf\n                          #convert from raw yolo output to W,H and X,Y\n                          obj.rect.w = ((sigmoid(locData[2]) *2) ** 2) * anchorW\n                          obj.rect.h = ((sigmoid(locData[3]) *2) ** 2) * anchorH\n                          obj.rect.x = ((sigmoid(locData[0]) * 2) - 0.5 + j) * stride - (obj.rect.w/2)\n                          obj.rect.y = ((sigmoid(locData[1]) * 2) - 0.5 + i) * stride - (obj.rect.h/2)\n                          objects.append(obj)\n\n                      index +=1\n\n        #sort based on probability in ascending order\n        objects.sort(key = lambda x: x.prob)\n        filtered_objects = NMS(objects)\n\n        #rescale to input image size\n        XscaleAdj = img_w / self.target_size\n        YscaleAdj = img_h / self.target_size\n        for count in range(len(filtered_objects)):\n            filtered_objects[count].rect.x *= XscaleAdj\n            filtered_objects[count].rect.w *= XscaleAdj\n            filtered_objects[count].rect.y *= YscaleAdj\n            filtered_objects[count].rect.h *= YscaleAdj\n\n        return filtered_objects\n\n\n\nclass YoloV7_Tiny(YoloV7_Base):\n    def __init__(self, **kwargs):\n        super(YoloV7_Tiny, self).__init__(416, **kwargs)\n"
  },
  {
    "path": "python/ncnn/model_zoo/yolov8.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\nimport numpy as np\nimport ncnn\nfrom .model_store import get_model_file\nfrom ..utils.objects import Detect_Object\nfrom ..utils.functional import *\nfrom typing import Iterable\n\nclass YoloV8s:\n    def __init__(\n        self,\n        target_size=640,\n        prob_threshold=0.25,\n        nms_threshold=0.45,\n        num_threads=1,\n        use_gpu=False,\n    ):\n        self.target_size = target_size\n        self.prob_threshold = prob_threshold\n        self.nms_threshold = nms_threshold\n        self.num_threads = num_threads\n        self.use_gpu = use_gpu\n\n        self.reg_max = 16\n        self.mean_vals = []\n        self.norm_vals = [1 / 255.0, 1 / 255.0, 1 / 255.0]\n\n        self.net = ncnn.Net()\n        self.net.opt.use_vulkan_compute = self.use_gpu\n        self.net.opt.num_threads = self.num_threads\n\n        # original pretrained model from https://github.com/ultralytics/ultralytics\n        # the ncnn model https://github.com/nihui/ncnn-assets/tree/master/models\n        self.net.load_param(get_model_file(\"yolov8s.param\"))\n        self.net.load_model(get_model_file(\"yolov8s.bin\"))\n\n        self.grid = [make_grid(20, 20), make_grid(40, 40), make_grid(80, 80)]\n        self.stride = np.array([32, 16, 8])\n\n        self.class_names = [\n            \"person\",\n            \"bicycle\",\n            \"car\",\n            \"motorcycle\",\n            \"airplane\",\n            \"bus\",\n            \"train\",\n            \"truck\",\n            \"boat\",\n            \"traffic light\",\n            \"fire hydrant\",\n            \"stop sign\",\n            \"parking meter\",\n            \"bench\",\n            \"bird\",\n            \"cat\",\n            \"dog\",\n            \"horse\",\n            \"sheep\",\n            \"cow\",\n            \"elephant\",\n            \"bear\",\n            \"zebra\",\n            \"giraffe\",\n            \"backpack\",\n            \"umbrella\",\n            \"handbag\",\n            \"tie\",\n            \"suitcase\",\n            \"frisbee\",\n            \"skis\",\n            \"snowboard\",\n            \"sports ball\",\n            \"kite\",\n            \"baseball bat\",\n            \"baseball glove\",\n            \"skateboard\",\n            \"surfboard\",\n            \"tennis racket\",\n            \"bottle\",\n            \"wine glass\",\n            \"cup\",\n            \"fork\",\n            \"knife\",\n            \"spoon\",\n            \"bowl\",\n            \"banana\",\n            \"apple\",\n            \"sandwich\",\n            \"orange\",\n            \"broccoli\",\n            \"carrot\",\n            \"hot dog\",\n            \"pizza\",\n            \"donut\",\n            \"cake\",\n            \"chair\",\n            \"couch\",\n            \"potted plant\",\n            \"bed\",\n            \"dining table\",\n            \"toilet\",\n            \"tv\",\n            \"laptop\",\n            \"mouse\",\n            \"remote\",\n            \"keyboard\",\n            \"cell phone\",\n            \"microwave\",\n            \"oven\",\n            \"toaster\",\n            \"sink\",\n            \"refrigerator\",\n            \"book\",\n            \"clock\",\n            \"vase\",\n            \"scissors\",\n            \"teddy bear\",\n            \"hair drier\",\n            \"toothbrush\",\n        ]\n\n    def __del__(self):\n        self.net = None\n\n    def __call__(self, img):\n        img_w = img.shape[1]\n        img_h = img.shape[0]\n\n        w = img_w\n        h = img_h\n        scale = 1.0\n        if w > h:\n            scale = float(self.target_size) / w\n            w = self.target_size\n            h = int(h * scale)\n        else:\n            scale = float(self.target_size) / h\n            h = self.target_size\n            w = int(w * scale)\n\n        mat_in = ncnn.Mat.from_pixels_resize(\n            img, ncnn.Mat.PixelType.PIXEL_BGR2RGB, img_w, img_h, w, h\n        )\n        # pad to target_size rectangle\n        # yolov5/utils/datasets.py letterbox\n        wpad = (w + 31) // 32 * 32 - w\n        hpad = (h + 31) // 32 * 32 - h\n        mat_in_pad = ncnn.copy_make_border(\n            mat_in,\n            hpad // 2,\n            hpad - hpad // 2,\n            wpad // 2,\n            wpad - wpad // 2,\n            ncnn.BorderType.BORDER_CONSTANT,\n            114.0,\n        )\n\n        mat_in_pad.substract_mean_normalize(self.mean_vals, self.norm_vals)\n\n        ex = self.net.create_extractor()\n        ex.input(\"in0\", mat_in_pad)\n\n        ret1, mat_out1 = ex.extract(\"out0\")  # stride 8\n        ret2, mat_out2 = ex.extract(\"out1\")  # stride 16\n        ret3, mat_out3 = ex.extract(\"out2\")  # stride 32\n\n        pred = [np.array(mat_out3), np.array(mat_out2), np.array(mat_out1)]\n        z = []\n        for i in range(len(pred)):\n            num_grid_x = mat_in_pad.w // self.stride[i]\n            num_grid_y = mat_in_pad.h // self.stride[i]\n            if (\n                    self.grid[i].shape[1] != num_grid_y\n                    or self.grid[i].shape[2] != num_grid_x\n            ):\n                self.grid[i] = make_grid(num_grid_x, num_grid_y)\n            cls, box = np.split(pred[i].transpose((1, 2, 0)), [len(self.class_names), ], -1)\n            box = softmax(box.reshape(-1, self.reg_max))\n            box = box.reshape(num_grid_y, num_grid_x, 4, self.reg_max)\n            box = box @ np.arange(0, self.reg_max, dtype=np.float32)\n            cls = sigmoid(cls)\n            conf = cls.max(-1, keepdims=True)\n            x1y1 = (self.grid[i][0] + 0.5 - box[..., :2]) * self.stride[i]\n            x2y2 = (self.grid[i][0] + 0.5 + box[..., 2:]) * self.stride[i]\n            res = np.concatenate([x1y1, x2y2, conf, cls], -1)\n            z.append(res.reshape((1, -1, len(self.class_names) + 5)))\n        pred = np.concatenate(z, 1)\n\n        result = self.non_max_suppression(\n            pred, self.prob_threshold, self.nms_threshold\n        )[0]\n\n        if isinstance(result, Iterable):\n            objects = [\n                Detect_Object(\n                    obj[5],\n                    obj[4],\n                    (obj[0] - (wpad / 2)) / scale,\n                    (obj[1] - (hpad / 2)) / scale,\n                    (obj[2] - obj[0]) / scale,\n                    (obj[3] - obj[1]) / scale,\n                )\n                for obj in result\n            ]\n        else:\n            objects = []\n\n        return objects\n\n    def non_max_suppression(\n        self,\n        prediction,\n        conf_thres=0.1,\n        iou_thres=0.6,\n        merge=False,\n        classes=None,\n        agnostic=False,\n    ):\n        \"\"\"Performs Non-Maximum Suppression (NMS) on inference results\n\n        Returns:\n            detections with shape: nx6 (x1, y1, x2, y2, conf, cls)\n        \"\"\"\n        nc = prediction[0].shape[1] - 5  # number of classes\n        xc = prediction[..., 4] > conf_thres  # candidates\n\n        # Settings\n        min_wh, max_wh = 2, 4096  # (pixels) minimum and maximum box width and height\n        max_det = 300  # maximum number of detections per image\n        time_limit = 10.0  # seconds to quit after\n        redundant = True  # require redundant detections\n        multi_label = nc > 1  # multiple labels per box (adds 0.5ms/img)\n\n        t = time.time()\n        output = [None] * prediction.shape[0]\n        for xi, x in enumerate(prediction):  # image index, image inference\n            # Apply constraints\n            # x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height\n            x = x[xc[xi]]  # confidence\n\n            # If none remain process next image\n            if not x.shape[0]:\n                continue\n\n            box = x[:, :4]\n\n            # Detections matrix nx6 (xyxy, conf, cls)\n            if multi_label:\n                i, j = (x[:, 5:] > conf_thres).nonzero()\n                x = np.concatenate(\n                    (box[i], x[i, j + 5, None], j[:, None].astype(np.float32)), axis=1\n                )\n            else:  # best class only\n                conf, j = x[:, 5:].max(1, keepdim=True)\n                x = np.concatenate((box, conf, j.float()), axis=1)[\n                    conf.view(-1) > conf_thres\n                ]\n\n            # Filter by class\n            if classes:\n                x = x[(x[:, 5:6] == np.array(classes)).any(1)]\n\n            # Apply finite constraint\n            # if not torch.isfinite(x).all():\n            #     x = x[torch.isfinite(x).all(1)]\n\n            # If none remain process next image\n            n = x.shape[0]  # number of boxes\n            if not n:\n                continue\n\n            # Sort by confidence\n            # x = x[x[:, 4].argsort(descending=True)]\n\n            # Batched NMS\n            c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes\n            boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores\n            i = nms(boxes, scores, iou_threshold=iou_thres)\n            if len(i) > max_det:  # limit detections\n                i = i[:max_det]\n            if merge and (1 < n < 3e3):  # Merge NMS (boxes merged using weighted mean)\n                try:  # update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)\n                    iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix\n                    weights = iou * scores[None]  # box weights\n                    x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(\n                        1, keepdim=True\n                    )  # merged boxes\n                    if redundant:\n                        i = i[iou.sum(1) > 1]  # require redundancy\n                except:  # possible CUDA error https://github.com/ultralytics/yolov3/issues/1139\n                    print(x, i, x.shape, i.shape)\n                    pass\n\n            output[xi] = x[i]\n            if (time.time() - t) > time_limit:\n                break  # time limit exceeded\n\n        return output\n"
  },
  {
    "path": "python/ncnn/utils/__init__.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nfrom .download import download, check_sha1\nfrom .visual import *\nfrom .objects import *\n"
  },
  {
    "path": "python/ncnn/utils/download.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\n\"\"\"Download files with progress bar.\"\"\"\n\nimport os\nimport hashlib\nimport requests\nfrom tqdm import tqdm\n\n\ndef check_sha1(filename, sha1_hash):\n    \"\"\"Check whether the sha1 hash of the file content matches the expected hash.\n    Parameters\n    ----------\n    filename : str\n        Path to the file.\n    sha1_hash : str\n        Expected sha1 hash in hexadecimal digits.\n    Returns\n    -------\n    bool\n        Whether the file content matches the expected hash.\n    \"\"\"\n    sha1 = hashlib.sha1()\n    with open(filename, \"rb\") as f:\n        while True:\n            data = f.read(1048576)\n            if not data:\n                break\n            sha1.update(data)\n\n    sha1_file = sha1.hexdigest()\n    l = min(len(sha1_file), len(sha1_hash))\n    return sha1.hexdigest()[0:l] == sha1_hash[0:l]\n\n\ndef download(url, path=None, overwrite=False, sha1_hash=None):\n    \"\"\"Download an given URL\n    Parameters\n    ----------\n    url : str\n        URL to download\n    path : str, optional\n        Destination path to store downloaded file. By default stores to the\n        current directory with same name as in url.\n    overwrite : bool, optional\n        Whether to overwrite destination file if already exists.\n    sha1_hash : str, optional\n        Expected sha1 hash in hexadecimal digits. Will ignore existing file when hash is specified\n        but doesn't match.\n    Returns\n    -------\n    str\n        The file path of the downloaded file.\n    \"\"\"\n    if path is None:\n        fname = url.split(\"/\")[-1]\n    else:\n        path = os.path.expanduser(path)\n        if os.path.isdir(path):\n            fname = os.path.join(path, url.split(\"/\")[-1])\n        else:\n            fname = path\n\n    if (\n        overwrite\n        or not os.path.exists(fname)\n        or (sha1_hash and not check_sha1(fname, sha1_hash))\n    ):\n        dirname = os.path.dirname(os.path.abspath(os.path.expanduser(fname)))\n        if not os.path.exists(dirname):\n            os.makedirs(dirname)\n\n        print(\"Downloading %s from %s...\" % (fname, url))\n        r = requests.get(url, stream=True)\n        if r.status_code != 200:\n            raise RuntimeError(\"Failed downloading url %s\" % url)\n        total_length = r.headers.get(\"content-length\")\n        with open(fname, \"wb\") as f:\n            if total_length is None:  # no content length header\n                for chunk in r.iter_content(chunk_size=1024):\n                    if chunk:  # filter out keep-alive new chunks\n                        f.write(chunk)\n            else:\n                total_length = int(total_length)\n                for chunk in tqdm(\n                    r.iter_content(chunk_size=1024),\n                    total=int(total_length / 1024.0 + 0.5),\n                    unit=\"KB\",\n                    unit_scale=False,\n                    dynamic_ncols=True,\n                ):\n                    f.write(chunk)\n\n        if sha1_hash and not check_sha1(fname, sha1_hash):\n            raise UserWarning(\n                \"File {} is downloaded but the content hash does not match. \"\n                \"The repo may be outdated or download may be incomplete. \"\n                'If the \"repo_url\" is overridden, consider switching to '\n                \"the default repo.\".format(fname)\n            )\n\n    return fname\n"
  },
  {
    "path": "python/ncnn/utils/functional.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\n\n\ndef xywh2xyxy(x):\n    # Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right\n    y = np.zeros_like(x)\n    y[:, 0] = x[:, 0] - x[:, 2] / 2  # top left x\n    y[:, 1] = x[:, 1] - x[:, 3] / 2  # top left y\n    y[:, 2] = x[:, 0] + x[:, 2] / 2  # bottom right x\n    y[:, 3] = x[:, 1] + x[:, 3] / 2  # bottom right y\n    return y\n\n\ndef xyxy2xywh(x):\n    # Convert nx4 boxes from [x1, y1, x2, y2] to [x, y, w, h] where xy1=top-left, xy2=bottom-right\n    y = np.zeros_like(x)\n    y[:, 0] = (x[:, 0] + x[:, 2]) / 2  # x center\n    y[:, 1] = (x[:, 1] + x[:, 3]) / 2  # y center\n    y[:, 2] = x[:, 2] - x[:, 0]  # width\n    y[:, 3] = x[:, 3] - x[:, 1]  # height\n    return y\n\n\ndef make_grid(nx=20, ny=20):\n    xv1, yv1 = np.meshgrid(np.arange(nx), np.arange(ny))\n    z1 = np.stack((xv1, yv1), 2).reshape((1, ny, nx, 2)).astype(np.float32)\n    return z1\n\n\ndef sigmoid(x):\n    return 1 / (1 + np.exp(-x))\n\n\ndef softmax(x):\n    max_value = np.max(x, axis=-1)\n    x -= max_value.reshape((x.shape[0], 1))\n    x = np.exp(x)\n    sum_value = np.sum(x, axis=-1)\n    x /= sum_value.reshape((x.shape[0], 1))\n    return x\n\n\ndef iou_of(boxes0, boxes1, eps=1e-5):\n    \"\"\"Return intersection-over-union (Jaccard index) of boxes.\n\n    Args:\n        boxes0 (N, 4): ground truth boxes.\n        boxes1 (N or 1, 4): predicted boxes.\n        eps: a small number to avoid 0 as denominator.\n    Returns:\n        iou (N): IoU values.\n    \"\"\"\n    overlap_left_top = np.maximum(boxes0[..., :2], boxes1[..., :2])\n    overlap_right_bottom = np.minimum(boxes0[..., 2:], boxes1[..., 2:])\n\n    overlap_area = area_of(overlap_left_top, overlap_right_bottom)\n    area0 = area_of(boxes0[..., :2], boxes0[..., 2:])\n    area1 = area_of(boxes1[..., :2], boxes1[..., 2:])\n    return overlap_area / (area0 + area1 - overlap_area + eps)\n\n\ndef area_of(left_top, right_bottom):\n    \"\"\"Compute the areas of rectangles given two corners.\n\n    Args:\n        left_top (N, 2): left top corner.\n        right_bottom (N, 2): right bottom corner.\n\n    Returns:\n        area (N): return the area.\n    \"\"\"\n    hw = np.clip(right_bottom - left_top, 0.0, None)\n    return hw[..., 0] * hw[..., 1]\n\n\ndef nms(boxes, scores, iou_threshold, top_k=-1, candidate_size=200):\n    \"\"\"\n\n    Args:\n        box_scores (N, 5): boxes in corner-form(x1, y1, x2, y2) and probabilities.\n        iou_threshold: intersection over union threshold.\n        top_k: keep top_k results. If k <= 0, keep all the results.\n        candidate_size: only consider the candidates with the highest scores.\n    Returns:\n         picked: a list of indexes of the kept boxes\n    \"\"\"\n\n    picked = []\n    indexes = np.argsort(scores)\n    indexes = indexes[-candidate_size:]\n    while len(indexes) > 0:\n        current = indexes[-1]\n        picked.append(current)\n        if 0 < top_k == len(picked) or len(indexes) == 1:\n            break\n\n        current_box = boxes[current, :]\n        indexes = indexes[:-1]\n        rest_boxes = boxes[indexes, :]\n        iou = iou_of(\n            rest_boxes,\n            np.expand_dims(current_box, axis=0),\n        )\n        indexes = indexes[iou <= iou_threshold]\n\n    return picked\n"
  },
  {
    "path": "python/ncnn/utils/objects.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\n\n\nclass Point(object):\n    def __init__(self):\n        self.x = 0.0\n        self.y = 0.0\n\n\nclass Rect(object):\n    def __init__(self, x=0, y=0, w=0, h=0):\n        self.x = x\n        self.y = y\n        self.w = w\n        self.h = h\n\n    def area(self):\n        return self.w * self.h\n\n    def intersection_area(self, b):\n        x1 = np.maximum(self.x, b.x)\n        y1 = np.maximum(self.y, b.y)\n        x2 = np.minimum(self.x + self.w, b.x + b.w)\n        y2 = np.minimum(self.y + self.h, b.y + b.h)\n        return np.abs(x1 - x2) * np.abs(y1 - y2)\n\n\nclass Detect_Object(object):\n    def __init__(self, label=0, prob=0, x=0, y=0, w=0, h=0):\n        self.label = label\n        self.prob = prob\n        self.rect = Rect(x, y, w, h)\n\n\nclass Face_Object(object):\n    def __init__(self):\n        self.prob = 0.0\n        self.rect = Rect()\n        self.landmark = []\n\n\nclass KeyPoint(object):\n    def __init__(self):\n        self.p = Point()\n        self.prob = 0.0\n"
  },
  {
    "path": "python/ncnn/utils/visual.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport cv2\n\n\ndef draw_detection_objects(image, class_names, objects, min_prob=0.0):\n    for obj in objects:\n        if obj.prob < min_prob:\n            continue\n\n        print(\n            \"%d = %.5f at %.2f %.2f %.2f x %.2f\\n\"\n            % (obj.label, obj.prob, obj.rect.x, obj.rect.y, obj.rect.w, obj.rect.h)\n        )\n\n        cv2.rectangle(\n            image,\n            (int(obj.rect.x), int(obj.rect.y)),\n            (int(obj.rect.x + obj.rect.w), int(obj.rect.y + obj.rect.h)),\n            (255, 0, 0),\n        )\n\n        text = \"%s %.1f%%\" % (class_names[int(obj.label)], obj.prob * 100)\n\n        label_size, baseLine = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)\n\n        x = obj.rect.x\n        y = obj.rect.y - label_size[1] - baseLine\n        if y < 0:\n            y = 0\n        if x + label_size[0] > image.shape[1]:\n            x = image.shape[1] - label_size[0]\n\n        cv2.rectangle(\n            image,\n            (int(x), int(y)),\n            (int(x + label_size[0]), int(y + label_size[1] + baseLine)),\n            (255, 255, 255),\n            -1,\n        )\n\n        cv2.putText(\n            image,\n            text,\n            (int(x), int(y + label_size[1])),\n            cv2.FONT_HERSHEY_SIMPLEX,\n            0.5,\n            (0, 0, 0),\n        )\n\n    cv2.imshow(\"image\", image)\n    cv2.waitKey(0)\n\n\ndef print_topk(cls_scores, topk):\n    indexes = np.argsort(cls_scores)[::-1][0:topk]\n    scores = cls_scores[indexes]\n\n    for index, score in zip(indexes, scores):\n        print(\"%d=%f\" % (index, score))\n\n\ndef draw_faceobjects(image, faceobjects):\n    for obj in faceobjects:\n        print(\n            \"%.5f at %.2f %.2f %.2f x %.2f\"\n            % (obj.prob, obj.rect.x, obj.rect.y, obj.rect.w, obj.rect.h)\n        )\n\n        cv2.rectangle(\n            image,\n            (int(obj.rect.x), int(obj.rect.y)),\n            (int(obj.rect.x + obj.rect.w), int(obj.rect.y + obj.rect.h)),\n            (255, 0, 0),\n        )\n\n        cv2.circle(\n            image,\n            (int(obj.landmark[0].x), int(obj.landmark[0].y)),\n            2,\n            (0, 255, 255),\n            -1,\n        )\n        cv2.circle(\n            image,\n            (int(obj.landmark[1].x), int(obj.landmark[1].y)),\n            2,\n            (0, 255, 255),\n            -1,\n        )\n        cv2.circle(\n            image,\n            (int(obj.landmark[2].x), int(obj.landmark[2].y)),\n            2,\n            (0, 255, 255),\n            -1,\n        )\n        cv2.circle(\n            image,\n            (int(obj.landmark[3].x), int(obj.landmark[3].y)),\n            2,\n            (0, 255, 255),\n            -1,\n        )\n        cv2.circle(\n            image,\n            (int(obj.landmark[4].x), int(obj.landmark[4].y)),\n            2,\n            (0, 255, 255),\n            -1,\n        )\n\n        text = \"%.1f%%\" % (obj.prob * 100)\n\n        label_size, baseLine = cv2.getTextSize(text, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)\n\n        x = obj.rect.x\n        y = obj.rect.y - label_size[1] - baseLine\n        if y < 0:\n            y = 0\n        if x + label_size[0] > image.shape[1]:\n            x = image.shape[1] - label_size[0]\n\n        cv2.rectangle(\n            image,\n            (int(x), int(y)),\n            (int(x + label_size[0]), int(y + label_size[1] + baseLine)),\n            (255, 255, 255),\n            -1,\n        )\n\n        cv2.putText(\n            image,\n            text,\n            (int(x), int(y + label_size[1])),\n            cv2.FONT_HERSHEY_SIMPLEX,\n            0.5,\n            (0, 0, 0),\n        )\n\n    cv2.imshow(\"image\", image)\n    cv2.waitKey(0)\n\n\ndef draw_pose(image, keypoints):\n    # draw bone\n    joint_pairs = [\n        (0, 1),\n        (1, 3),\n        (0, 2),\n        (2, 4),\n        (5, 6),\n        (5, 7),\n        (7, 9),\n        (6, 8),\n        (8, 10),\n        (5, 11),\n        (6, 12),\n        (11, 12),\n        (11, 13),\n        (12, 14),\n        (13, 15),\n        (14, 16),\n    ]\n\n    for i in range(16):\n        p1 = keypoints[joint_pairs[i][0]]\n        p2 = keypoints[joint_pairs[i][1]]\n\n        if p1.prob < 0.2 or p2.prob < 0.2:\n            continue\n\n        cv2.line(\n            image,\n            (int(p1.p.x), int(p1.p.y)),\n            (int(p2.p.x), int(p2.p.y)),\n            (255, 0, 0),\n            2,\n        )\n\n    # draw joint\n    for keypoint in keypoints:\n        print(\"%.2f %.2f = %.5f\" % (keypoint.p.x, keypoint.p.y, keypoint.prob))\n\n        if keypoint.prob < 0.2:\n            continue\n\n        cv2.circle(image, (int(keypoint.p.x), int(keypoint.p.y)), 3, (0, 255, 0), -1)\n\n    cv2.imshow(\"image\", image)\n    cv2.waitKey(0)\n"
  },
  {
    "path": "python/requirements.txt",
    "content": "numpy\ntqdm\nrequests\nportalocker\nopencv-python"
  },
  {
    "path": "python/setup.py.i",
    "content": "import sys\nfrom setuptools import setup, find_packages\n\ntry:\n    from wheel.bdist_wheel import bdist_wheel as _bdist_wheel\n\n    class bdist_wheel(_bdist_wheel):\n        def finalize_options(self):\n            _bdist_wheel.finalize_options(self)\n            self.root_is_pure = False\n\n\nexcept ImportError:\n    bdist_wheel = None\n\nif sys.version_info < (3, 0):\n    sys.exit(\"Sorry, Python < 3.0 is not supported\")\n\nrequirements = [\"numpy\", \"tqdm\", \"requests\", \"portalocker\", \"opencv-python\"]\n\nsetup(\n    name=\"ncnn\",\n    version=\"${PACKAGE_VERSION}\",\n    author=\"nihui\",\n    author_email=\"nihuini@tencent.com\",\n    maintainer=\"caishanli\",\n    maintainer_email=\"caishanli25@gmail.com\",\n    description=\"ncnn is a high-performance neural network inference framework optimized for the mobile platform\",\n    url=\"https://github.com/Tencent/ncnn\",\n    classifiers=[\n        \"Programming Language :: C++\",\n        \"Programming Language :: Python :: 3\",\n        \"Programming Language :: Python :: 3.6\",\n        \"Programming Language :: Python :: 3.7\",\n        \"Programming Language :: Python :: 3.8\",\n        \"Programming Language :: Python :: 3.9\",\n        \"Programming Language :: Python :: 3.10\",\n        \"Programming Language :: Python :: 3.11\",\n        \"Programming Language :: Python :: 3.12\",\n        \"Programming Language :: Python :: 3.13\",\n        \"Programming Language :: Python :: 3.14\",\n        \"License :: OSI Approved :: BSD License\",\n        \"Operating System :: OS Independent\",\n        \"Topic :: Scientific/Engineering :: Artificial Intelligence\",\n    ],\n    license=\"BSD-3\",\n    python_requires=\">=3.5\",\n    packages=find_packages(),\n    package_dir={\"\": \".\"},\n    package_data={\"ncnn\": [\"ncnn${PYTHON_MODULE_PREFIX}${PYTHON_MODULE_EXTENSION}\"]},\n    install_requires=requirements,\n    cmdclass={\"bdist_wheel\": bdist_wheel},\n)\n"
  },
  {
    "path": "python/src/main.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include <pybind11/pybind11.h>\n#include <pybind11/stl.h>\n#include <pybind11/numpy.h>\n#include <pybind11/functional.h>\n\n#include <cpu.h>\n#include <gpu.h>\n#include <net.h>\n#include <option.h>\n#include <blob.h>\n#include <paramdict.h>\n\n#include \"pybind11_mat.h\"\n#include \"pybind11_datareader.h\"\n#include \"pybind11_allocator.h\"\n#include \"pybind11_modelbin.h\"\n#include \"pybind11_layer.h\"\nusing namespace ncnn;\n\nnamespace py = pybind11;\n\nclass DataReaderFromMemoryCopy : public DataReaderFromMemory\n{\npublic:\n    explicit DataReaderFromMemoryCopy(const unsigned char*& mem)\n        : DataReaderFromMemory(mem)\n    {\n    }\n\n    virtual size_t reference(size_t size, const void** buf) const\n    {\n        return 0;\n    }\n};\n\nstruct LayerFactory\n{\n    std::string name;\n    int index;\n    std::function<Layer*()> creator;\n    std::function<void(Layer*)> destroyer;\n    layer_creator_func creator_func;\n    layer_destroyer_func destroyer_func;\n};\n\n#define LayerFactoryDeclear(n)                  \\\n    static ncnn::Layer* LayerCreator##n(void*); \\\n    static void LayerDestroyer##n(ncnn::Layer*, void*);\n\nLayerFactoryDeclear(0);\nLayerFactoryDeclear(1);\nLayerFactoryDeclear(2);\nLayerFactoryDeclear(3);\nLayerFactoryDeclear(4);\nLayerFactoryDeclear(5);\nLayerFactoryDeclear(6);\nLayerFactoryDeclear(7);\nLayerFactoryDeclear(8);\nLayerFactoryDeclear(9);\n\nstd::vector<LayerFactory> g_layer_factroys = {\n    {\"\", -1, nullptr, nullptr, LayerCreator0, LayerDestroyer0},\n    {\"\", -1, nullptr, nullptr, LayerCreator1, LayerDestroyer1},\n    {\"\", -1, nullptr, nullptr, LayerCreator2, LayerDestroyer2},\n    {\"\", -1, nullptr, nullptr, LayerCreator3, LayerDestroyer3},\n    {\"\", -1, nullptr, nullptr, LayerCreator4, LayerDestroyer4},\n    {\"\", -1, nullptr, nullptr, LayerCreator5, LayerDestroyer5},\n    {\"\", -1, nullptr, nullptr, LayerCreator6, LayerDestroyer6},\n    {\"\", -1, nullptr, nullptr, LayerCreator7, LayerDestroyer7},\n    {\"\", -1, nullptr, nullptr, LayerCreator8, LayerDestroyer8},\n    {\"\", -1, nullptr, nullptr, LayerCreator9, LayerDestroyer9},\n};\nint g_layer_factroy_index = 0;\n\n#define LayerFactoryDefine(n)                                  \\\n    static ncnn::Layer* LayerCreator##n(void* p)               \\\n    {                                                          \\\n        if (g_layer_factroys[n].creator != nullptr)            \\\n        {                                                      \\\n            return g_layer_factroys[n].creator();              \\\n        }                                                      \\\n        return nullptr;                                        \\\n    }                                                          \\\n    static void LayerDestroyer##n(ncnn::Layer* layer, void* p) \\\n    {                                                          \\\n        if (g_layer_factroys[n].destroyer)                     \\\n        {                                                      \\\n            g_layer_factroys[n].destroyer(layer);              \\\n        }                                                      \\\n    }\n\nLayerFactoryDefine(0);\nLayerFactoryDefine(1);\nLayerFactoryDefine(2);\nLayerFactoryDefine(3);\nLayerFactoryDefine(4);\nLayerFactoryDefine(5);\nLayerFactoryDefine(6);\nLayerFactoryDefine(7);\nLayerFactoryDefine(8);\nLayerFactoryDefine(9);\n\nPYBIND11_MODULE(ncnn, m)\n{\n    auto atexit = py::module_::import(\"atexit\");\n    atexit.attr(\"register\")(py::cpp_function([]() {\n        for (int i = 0; i < g_layer_factroys.size(); i++)\n        {\n            g_layer_factroys[i].creator = nullptr;\n            g_layer_factroys[i].destroyer = nullptr;\n        }\n    }));\n\n    py::class_<Allocator, PyAllocator<> >(m, \"Allocator\");\n    py::class_<PoolAllocator, Allocator, PyAllocatorOther<PoolAllocator> >(m, \"PoolAllocator\")\n    .def(py::init<>())\n    .def(\"set_size_compare_ratio\", &PoolAllocator::set_size_compare_ratio, py::arg(\"src\"))\n    .def(\"clear\", &PoolAllocator::clear)\n    .def(\"fastMalloc\", &PoolAllocator::fastMalloc, py::arg(\"size\"))\n    .def(\"fastFree\", &PoolAllocator::fastFree, py::arg(\"ptr\"));\n    py::class_<UnlockedPoolAllocator, Allocator, PyAllocatorOther<UnlockedPoolAllocator> >(m, \"UnlockedPoolAllocator\")\n    .def(py::init<>())\n    .def(\"set_size_compare_ratio\", &UnlockedPoolAllocator::set_size_compare_ratio, py::arg(\"src\"))\n    .def(\"clear\", &UnlockedPoolAllocator::clear)\n    .def(\"fastMalloc\", &UnlockedPoolAllocator::fastMalloc, py::arg(\"size\"))\n    .def(\"fastFree\", &UnlockedPoolAllocator::fastFree, py::arg(\"ptr\"));\n\n    py::class_<DataReader, PyDataReader<> >(m, \"DataReader\")\n    .def(py::init<>())\n#if NCNN_STRING\n    .def(\"scan\", &DataReader::scan, py::arg(\"format\"), py::arg(\"p\"))\n#endif // NCNN_STRING\n    .def(\"read\", &DataReader::read, py::arg(\"buf\"), py::arg(\"size\"));\n    py::class_<DataReaderFromEmpty, DataReader, PyDataReaderOther<DataReaderFromEmpty> >(m, \"DataReaderFromEmpty\")\n    .def(py::init<>())\n#if NCNN_STRING\n    .def(\"scan\", &DataReaderFromEmpty::scan, py::arg(\"format\"), py::arg(\"p\"))\n#endif // NCNN_STRING\n    .def(\"read\", &DataReaderFromEmpty::read, py::arg(\"buf\"), py::arg(\"size\"));\n\n    py::class_<Blob>(m, \"Blob\")\n    .def(py::init<>())\n#if NCNN_STRING\n    .def_readwrite(\"name\", &Blob::name)\n#endif // NCNN_STRING\n    .def_readwrite(\"producer\", &Blob::producer)\n    .def_readwrite(\"consumer\", &Blob::consumer)\n    .def_readwrite(\"shape\", &Blob::shape);\n\n    py::class_<ModelBin, PyModelBin<> >(m, \"ModelBin\")\n    .def(py::init<>())\n    .def(\"load\", (Mat(ModelBin::*)(int, int) const) & ModelBin::load, py::arg(\"w\"), py::arg(\"type\"))\n    .def(\"load\", (Mat(ModelBin::*)(int, int, int) const) & ModelBin::load, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"type\"))\n    .def(\"load\", (Mat(ModelBin::*)(int, int, int, int) const) & ModelBin::load, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"c\"), py::arg(\"type\"))\n    .def(\"load\", (Mat(ModelBin::*)(int, int, int, int, int) const) & ModelBin::load, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"d\"), py::arg(\"c\"), py::arg(\"type\"));\n    py::class_<ModelBinFromDataReader, ModelBin, PyModelBinOther<ModelBinFromDataReader> >(m, \"ModelBinFromDataReader\")\n    .def(py::init<const DataReader&>(), py::arg(\"dr\"))\n    .def(\"load\", &ModelBinFromDataReader::load, py::arg(\"w\"), py::arg(\"type\"));\n    py::class_<ModelBinFromMatArray, ModelBin, PyModelBinOther<ModelBinFromMatArray> >(m, \"ModelBinFromMatArray\")\n    .def(py::init<const Mat*>(), py::arg(\"weights\"))\n    .def(\"load\", &ModelBinFromMatArray::load, py::arg(\"w\"), py::arg(\"type\"));\n\n    py::class_<ParamDict>(m, \"ParamDict\")\n    .def(py::init<>())\n    .def(\"type\", &ParamDict::type, py::arg(\"id\"))\n    .def(\"get\", (int (ParamDict::*)(int, int) const) & ParamDict::get, py::arg(\"id\"), py::arg(\"def\"))\n    .def(\"get\", (float (ParamDict::*)(int, float) const) & ParamDict::get, py::arg(\"id\"), py::arg(\"def\"))\n    .def(\"get\", (Mat(ParamDict::*)(int, const Mat&) const) & ParamDict::get, py::arg(\"id\"), py::arg(\"def\"))\n    .def(\"set\", (void (ParamDict::*)(int, int)) & ParamDict::set, py::arg(\"id\"), py::arg(\"i\"))\n    .def(\"set\", (void (ParamDict::*)(int, float)) & ParamDict::set, py::arg(\"id\"), py::arg(\"f\"))\n    .def(\"set\", (void (ParamDict::*)(int, const Mat&)) & ParamDict::set, py::arg(\"id\"), py::arg(\"v\"));\n\n    py::class_<Option>(m, \"Option\")\n    .def(py::init<>())\n    .def_readwrite(\"lightmode\", &Option::lightmode)\n    .def_readwrite(\"num_threads\", &Option::num_threads)\n    .def_readwrite(\"blob_allocator\", &Option::blob_allocator)\n    .def_readwrite(\"workspace_allocator\", &Option::workspace_allocator)\n#if NCNN_VULKAN\n    .def_readwrite(\"blob_vkallocator\", &Option::blob_vkallocator)\n    .def_readwrite(\"workspace_vkallocator\", &Option::workspace_vkallocator)\n    .def_readwrite(\"staging_vkallocator\", &Option::staging_vkallocator)\n    //.def_readwrite(\"pipeline_cache\", &Option::pipeline_cache)\n#endif // NCNN_VULKAN\n    .def_readwrite(\"openmp_blocktime\", &Option::openmp_blocktime)\n    .def_readwrite(\"use_winograd_convolution\", &Option::use_winograd_convolution)\n    .def_readwrite(\"use_winograd23_convolution\", &Option::use_winograd23_convolution)\n    .def_readwrite(\"use_winograd43_convolution\", &Option::use_winograd43_convolution)\n    .def_readwrite(\"use_winograd63_convolution\", &Option::use_winograd63_convolution)\n    .def_readwrite(\"use_sgemm_convolution\", &Option::use_sgemm_convolution)\n    .def_readwrite(\"use_int8_inference\", &Option::use_int8_inference)\n    .def_readwrite(\"use_vulkan_compute\", &Option::use_vulkan_compute)\n    .def_readwrite(\"use_bf16_packed\", &Option::use_bf16_packed)\n    .def_readwrite(\"use_bf16_storage\", &Option::use_bf16_storage)\n    .def_readwrite(\"use_fp16_packed\", &Option::use_fp16_packed)\n    .def_readwrite(\"use_fp16_storage\", &Option::use_fp16_storage)\n    .def_readwrite(\"use_fp16_arithmetic\", &Option::use_fp16_arithmetic)\n    .def_readwrite(\"use_int8_packed\", &Option::use_int8_packed)\n    .def_readwrite(\"use_int8_storage\", &Option::use_int8_storage)\n    .def_readwrite(\"use_int8_arithmetic\", &Option::use_int8_arithmetic)\n    .def_readwrite(\"use_packing_layout\", &Option::use_packing_layout)\n    .def_readwrite(\"use_subgroup_ops\", &Option::use_subgroup_ops)\n    .def_readwrite(\"use_tensor_storage\", &Option::use_tensor_storage);\n\n    py::class_<Mat> mat(m, \"Mat\", py::buffer_protocol());\n    mat.def(py::init<>())\n    .def(py::init(\n    [](py::tuple shape, size_t elemsize, int elempack, Allocator* allocator) {\n        Mat* mat = nullptr;\n        switch (shape.size())\n        {\n        case 1:\n            mat = new Mat(shape[0].cast<int>(), elemsize, elempack, allocator);\n            break;\n        case 2:\n            mat = new Mat(shape[0].cast<int>(), shape[1].cast<int>(), elemsize, elempack, allocator);\n            break;\n        case 3:\n            mat = new Mat(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), elemsize, elempack, allocator);\n            break;\n        case 4:\n            mat = new Mat(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), shape[3].cast<int>(), elemsize, elempack, allocator);\n            break;\n        default:\n            std::stringstream ss;\n            ss << \"shape must be 1, 2, 3 or 4 dims, not \" << shape.size();\n            pybind11::pybind11_fail(ss.str());\n        }\n        return mat;\n    }),\n    py::arg(\"shape\"), py::kw_only(),\n    py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(py::init<int, size_t, int, Allocator*>(),\n         py::arg(\"w\"), py::kw_only(),\n         py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(py::init<int, int, size_t, int, Allocator*>(),\n         py::arg(\"w\"), py::arg(\"h\"), py::kw_only(),\n         py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(py::init<int, int, int, size_t, int, Allocator*>(),\n         py::arg(\"w\"), py::arg(\"h\"), py::arg(\"c\"), py::kw_only(),\n         py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(py::init<int, int, int, int, size_t, int, Allocator*>(),\n         py::arg(\"w\"), py::arg(\"h\"), py::arg(\"d\"), py::arg(\"c\"), py::kw_only(),\n         py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n\n    .def(py::init<const Mat&>(), py::arg(\"m\"))\n\n    .def(py::init([](py::buffer const b) {\n        py::buffer_info info = b.request();\n        if (info.ndim > 4)\n        {\n            std::stringstream ss;\n            ss << \"convert numpy.ndarray to ncnn.Mat only dims <=4 support now, but given \" << info.ndim;\n            pybind11::pybind11_fail(ss.str());\n        }\n\n        size_t elemsize = info.itemsize;\n\n        Mat* v = nullptr;\n        if (info.ndim == 1)\n        {\n            v = new Mat((int)info.shape[0], info.ptr, elemsize);\n        }\n        else if (info.ndim == 2)\n        {\n            v = new Mat((int)info.shape[1], (int)info.shape[0], info.ptr, elemsize);\n        }\n        else if (info.ndim == 3)\n        {\n            v = new Mat((int)info.shape[2], (int)info.shape[1], (int)info.shape[0], info.ptr, elemsize);\n\n            // in ncnn, buffer to construct ncnn::Mat need align to ncnn::alignSize\n            // with (w * h * elemsize, 16) / elemsize, but the buffer from numpy not\n            // so we set the cstep as numpy's cstep\n            v->cstep = (int)info.shape[2] * (int)info.shape[1];\n        }\n        else if (info.ndim == 4)\n        {\n            v = new Mat((int)info.shape[3], (int)info.shape[2], (int)info.shape[1], (int)info.shape[0], info.ptr, elemsize);\n\n            // in ncnn, buffer to construct ncnn::Mat need align to ncnn::alignSize\n            // with (w * h * d elemsize, 16) / elemsize, but the buffer from numpy not\n            // so we set the cstep as numpy's cstep\n            v->cstep = (int)info.shape[3] * (int)info.shape[2] * (int)info.shape[1];\n        }\n        return std::unique_ptr<Mat>(v);\n    }),\n    py::arg(\"array\"))\n    .def_buffer([](Mat& m) -> py::buffer_info {\n        return to_buffer_info(m);\n    })\n    .def(\n    \"numpy\", [](py::object obj, const std::string& format = \"\") -> py::array {\n        auto* m = obj.cast<Mat*>();\n        return py::array(to_buffer_info(*m, format), obj);\n    },\n    py::arg(\"format\") = \"\", \"i for int32, f for float32, d for double\")\n    //.def(\"fill\", (void (Mat::*)(int))(&Mat::fill), py::arg(\"v\"))\n    .def(\"fill\", (void (Mat::*)(float))(&Mat::fill), py::arg(\"v\"))\n    .def(\"clone\", &Mat::clone, py::arg(\"allocator\") = nullptr)\n    .def(\"clone_from\", &Mat::clone_from, py::arg(\"mat\"), py::arg(\"allocator\") = nullptr)\n    .def(\n    \"reshape\", [](Mat& mat, py::tuple shape, Allocator* allocator) {\n        switch (shape.size())\n        {\n        case 1:\n            return mat.reshape(shape[0].cast<int>(), allocator);\n        case 2:\n            return mat.reshape(shape[0].cast<int>(), shape[1].cast<int>(), allocator);\n        case 3:\n            return mat.reshape(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), allocator);\n        case 4:\n            return mat.reshape(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), shape[3].cast<int>(), allocator);\n        default:\n            std::stringstream ss;\n            ss << \"shape must be 1, 2, 3 or 4 dims, not \" << shape.size();\n            pybind11::pybind11_fail(ss.str());\n        }\n        return Mat();\n    },\n    py::arg(\"shape\"), py::kw_only(), py::arg(\"allocator\") = nullptr)\n    .def(\"reshape\", (Mat(Mat::*)(int, Allocator*) const) & Mat::reshape, py::arg(\"w\"), py::kw_only(), py::arg(\"allocator\") = nullptr)\n    .def(\"reshape\", (Mat(Mat::*)(int, int, Allocator*) const) & Mat::reshape, py::arg(\"w\"), py::arg(\"h\"), py::kw_only(), py::arg(\"allocator\") = nullptr)\n    .def(\"reshape\", (Mat(Mat::*)(int, int, int, Allocator*) const) & Mat::reshape, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"c\"), py::kw_only(), py::arg(\"allocator\") = nullptr)\n    .def(\"reshape\", (Mat(Mat::*)(int, int, int, int, Allocator*) const) & Mat::reshape, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"d\"), py::arg(\"c\"), py::kw_only(), py::arg(\"allocator\") = nullptr)\n\n    .def(\n    \"create\", [](Mat& mat, py::tuple shape, size_t elemsize, int elempack, Allocator* allocator) {\n        switch (shape.size())\n        {\n        case 1:\n            return mat.create(shape[0].cast<int>(), elemsize, elempack, allocator);\n        case 2:\n            return mat.create(shape[0].cast<int>(), shape[1].cast<int>(), elemsize, elempack, allocator);\n        case 3:\n            return mat.create(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), elemsize, elempack, allocator);\n        case 4:\n            return mat.create(shape[0].cast<int>(), shape[1].cast<int>(), shape[2].cast<int>(), shape[3].cast<int>(), elemsize, elempack, allocator);\n        default:\n            std::stringstream ss;\n            ss << \"shape must be 1, 2, 3 or 4 dims, not \" << shape.size();\n            pybind11::pybind11_fail(ss.str());\n        }\n        return;\n    },\n    py::arg(\"shape\"), py::kw_only(), py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(\"create\", (void (Mat::*)(int, size_t, int, Allocator*)) & Mat::create, py::arg(\"w\"), py::kw_only(), py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(\"create\", (void (Mat::*)(int, int, size_t, int, Allocator*)) & Mat::create, py::arg(\"w\"), py::arg(\"h\"), py::kw_only(), py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(\"create\", (void (Mat::*)(int, int, int, size_t, int, Allocator*)) & Mat::create, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"c\"), py::kw_only(), py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(\"create\", (void (Mat::*)(int, int, int, int, size_t, int, Allocator*)) & Mat::create, py::arg(\"w\"), py::arg(\"h\"), py::arg(\"d\"), py::arg(\"c\"), py::kw_only(), py::arg(\"elemsize\") = 4, py::arg(\"elempack\") = 1, py::arg(\"allocator\") = nullptr)\n    .def(\"create_like\", (void (Mat::*)(const Mat&, Allocator*)) & Mat::create_like, py::arg(\"m\"), py::arg(\"allocator\") = nullptr)\n    .def(\"addref\", &Mat::addref)\n    .def(\"release\", &Mat::release)\n    .def(\"empty\", &Mat::empty)\n    .def(\"total\", &Mat::total)\n    .def(\"elembits\", &Mat::elembits)\n    .def(\"shape\", &Mat::shape)\n    .def(\"channel\", (Mat(Mat::*)(int)) & Mat::channel, py::arg(\"c\"))\n    //.def(\"channel\", (const Mat (Mat::*)(int) const) & Mat::channel, py::arg(\"c\"))\n    .def(\"depth\", (Mat(Mat::*)(int)) & Mat::depth, py::arg(\"z\"))\n    //.def(\"depth\", (const Mat (Mat::*)(int) const) & Mat::depth, py::arg(\"z\"))\n    .def(\n    \"row\", [](Mat& m, int y) {\n        if (m.elempack != 1)\n        {\n            std::stringstream ss;\n            ss << \"get ncnn.Mat row only elempack 1 support now, but given \" << m.elempack;\n            pybind11::pybind11_fail(ss.str());\n        }\n\n        switch (m.elemsize)\n        {\n        case 1:\n            return py::memoryview::from_buffer(m.row<int8_t>(y), {m.w}, {sizeof(int8_t)});\n        //case 2:\n        //    return py::memoryview::from_buffer(m.row<short>(y), {m.w}, {sizeof(short)});\n        case 4:\n            return py::memoryview::from_buffer(m.row<float>(y), {m.w}, {sizeof(float)});\n        default:\n            std::stringstream ss;\n            ss << \"ncnn.Mat row elemsize \" << m.elemsize << \"not support now\";\n            pybind11::pybind11_fail(ss.str());\n        }\n        return py::memoryview::from_buffer(m.row<float>(y), {m.w}, {sizeof(float)});\n    },\n    py::arg(\"y\"))\n    .def(\"channel_range\", (Mat(Mat::*)(int, int)) & Mat::channel_range, py::arg(\"c\"), py::arg(\"channels\"))\n    //.def(\"channel_range\", (const Mat (Mat::*)(int, int) const) & Mat::channel_range, py::arg(\"c\"), py::arg(\"channels\"))\n    .def(\"depth_range\", (Mat(Mat::*)(int, int)) & Mat::depth_range, py::arg(\"z\"), py::arg(\"depths\"))\n    //.def(\"depth_range\", (const Mat (Mat::*)(int, int) const) & Mat::depth_range, py::arg(\"z\"), py::arg(\"depths\"))\n    .def(\"row_range\", (Mat(Mat::*)(int, int)) & Mat::row_range, py::arg(\"y\"), py::arg(\"rows\"))\n    //.def(\"row_range\", (const Mat (Mat::*)(int, int) const) & Mat::row_range, py::arg(\"y\"), py::arg(\"rows\"))\n    .def(\"range\", (Mat(Mat::*)(int, int)) & Mat::range, py::arg(\"x\"), py::arg(\"n\"))\n    //.def(\"range\", (const Mat (Mat::*)(int, int) const) & Mat::range, py::arg(\"x\"), py::arg(\"n\"))\n    .def(\n    \"__getitem__\", [](const Mat& m, size_t i) {\n        return m[i];\n    },\n    py::arg(\"i\"))\n    .def(\n    \"__setitem__\", [](Mat& m, size_t i, float v) {\n        m[i] = v;\n    },\n    py::arg(\"i\"), py::arg(\"v\"))\n    .def(\"__len__\", [](Mat& m) {\n        return m.w;\n    })\n\n    //convenient construct from pixel data\n    .def_static(\n    \"from_pixels\", [](py::buffer const b, int type, int w, int h, Allocator* allocator) {\n        return Mat::from_pixels((const unsigned char*)b.request().ptr, type, w, h, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels\", [](py::buffer const b, int type, int w, int h, int stride, Allocator* allocator) {\n        return Mat::from_pixels((const unsigned char*)b.request().ptr, type, w, h, stride, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"stride\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_resize\", [](py::buffer const b, int type, int w, int h, int target_width, int target_height, Allocator* allocator) {\n        return Mat::from_pixels_resize((const unsigned char*)b.request().ptr,\n                                       type, w, h, target_width, target_height, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"target_width\"), py::arg(\"target_height\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_resize\", [](py::buffer const b, int type, int w, int h, int stride, int target_width, int target_height, Allocator* allocator) {\n        return Mat::from_pixels_resize((const unsigned char*)b.request().ptr,\n                                       type, w, h, stride, target_width, target_height, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"stride\"), py::arg(\"target_width\"), py::arg(\"target_height\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_roi\", [](py::buffer const b, int type, int w, int h, int roix, int roiy, int roiw, int roih, Allocator* allocator) {\n        return Mat::from_pixels_roi((const unsigned char*)b.request().ptr,\n                                    type, w, h, roix, roiy, roiw, roih, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"roix\"), py::arg(\"roiy\"), py::arg(\"roiw\"), py::arg(\"roih\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_roi\", [](py::buffer const b, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, Allocator* allocator) {\n        return Mat::from_pixels_roi((const unsigned char*)b.request().ptr,\n                                    type, w, h, stride, roix, roiy, roiw, roih, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"stride\"), py::arg(\"roix\"), py::arg(\"roiy\"), py::arg(\"roiw\"), py::arg(\"roih\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_roi_resize\", [](py::buffer const b, int type, int w, int h, int roix, int roiy, int roiw, int roih, int target_width, int target_height, Allocator* allocator) {\n        return Mat::from_pixels_roi_resize((const unsigned char*)b.request().ptr,\n                                           type, w, h, roix, roiy, roiw, roih, target_width, target_height, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"roix\"), py::arg(\"roiy\"), py::arg(\"roiw\"), py::arg(\"roih\"), py::arg(\"target_width\"), py::arg(\"target_height\"), py::arg(\"allocator\") = nullptr)\n    .def_static(\n    \"from_pixels_roi_resize\", [](py::buffer const b, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, int target_width, int target_height, Allocator* allocator) {\n        return Mat::from_pixels_roi_resize((const unsigned char*)b.request().ptr,\n                                           type, w, h, stride, roix, roiy, roiw, roih, target_width, target_height, allocator);\n    },\n    py::arg(\"array\"), py::arg(\"type\"), py::arg(\"w\"), py::arg(\"h\"), py::arg(\"stride\"), py::arg(\"roix\"), py::arg(\"roiy\"), py::arg(\"roiw\"), py::arg(\"roih\"), py::arg(\"target_width\"), py::arg(\"target_height\"), py::arg(\"allocator\") = nullptr)\n    .def(\n    \"substract_mean_normalize\", [](Mat& mat, std::vector<float>& mean, std::vector<float>& norm) {\n        return mat.substract_mean_normalize(mean.size() > 0 ? &mean[0] : 0, norm.size() > 0 ? &norm[0] : 0);\n    },\n    py::arg(\"mean\"), py::arg(\"norm\"))\n    .def_readwrite(\"refcount\", &Mat::refcount)\n    .def_readwrite(\"elemsize\", &Mat::elemsize)\n    .def_readwrite(\"elempack\", &Mat::elempack)\n    .def_readwrite(\"allocator\", &Mat::allocator)\n    .def_readwrite(\"dims\", &Mat::dims)\n    .def_readwrite(\"w\", &Mat::w)\n    .def_readwrite(\"h\", &Mat::h)\n    .def_readwrite(\"d\", &Mat::d)\n    .def_readwrite(\"c\", &Mat::c)\n    .def_readwrite(\"cstep\", &Mat::cstep)\n    .def(\"__repr__\", [](const Mat& m) {\n        std::stringstream ss;\n        ss << \"<ncnn.Mat w=\" << m.w << \" h=\" << m.h << \" d=\" << m.d << \" c=\" << m.c << \" dims=\" << m.dims\n           << \" cstep=\" << m.cstep << \" elemsize=\" << m.elemsize << \" elempack=\" << m.elempack << \"\\n\\t\"\n           << \"refcount=\" << (m.refcount ? *m.refcount : 0) << \" data=0x\" << static_cast<const void*>(m.data)\n           << \" allocator=0x\" << static_cast<const void*>(m.allocator) << \">\\n\";\n\n        const int max_count = m.dims == 1 ? 10 : 6;\n        if (m.dims == 1)\n        {\n            ss << \"[\";\n            bool dot_printed_w = false;\n\n            if (m.elemsize == 1)\n            {\n                const int8_t* row = m.row<int8_t>(0);\n                for (int i = 0; i < m.w; i++)\n                {\n                    if (i < max_count / 2 || i >= m.w - max_count / 2)\n                    {\n                        if (i > 0)\n                        {\n                            ss << \", \";\n                        }\n                        ss << static_cast<int>(row[i]);\n                    }\n                    else if (!dot_printed_w)\n                    {\n                        dot_printed_w = true;\n                        ss << \", ...\";\n                    }\n                }\n            }\n            if (m.elemsize == 4)\n            {\n                const float* row = m.row<float>(0);\n                for (int i = 0; i < m.w; i++)\n                {\n                    if (i < max_count / 2 || i >= m.w - max_count / 2)\n                    {\n                        if (i > 0)\n                        {\n                            ss << \", \";\n                        }\n                        ss << row[i];\n                    }\n                    else if (!dot_printed_w)\n                    {\n                        dot_printed_w = true;\n                        ss << \", ...\";\n                    }\n                }\n            }\n            ss << \"]\";\n        }\n        else if (m.dims == 2)\n        {\n            bool dot_printed_h = false;\n            ss << \"[\";\n            for (int j = 0; j < m.h; j++)\n            {\n                bool dot_printed_w = false;\n                if (j < max_count / 2 || j >= m.h - max_count / 2)\n                {\n                    ss << \"[\";\n                    if (m.elemsize == 1)\n                    {\n                        const int8_t* row = m.row<int8_t>(j);\n                        for (int i = 0; i < m.w; i++)\n                        {\n                            if (i < max_count / 2 || i >= m.w - max_count / 2)\n                            {\n                                if (i > 0)\n                                {\n                                    ss << \", \";\n                                }\n                                ss << static_cast<int>(row[i]);\n                            }\n                            else if (!dot_printed_w)\n                            {\n                                dot_printed_w = true;\n                                ss << \", ...\";\n                            }\n                        }\n                    }\n                    if (m.elemsize == 4)\n                    {\n                        const float* row = m.row<float>(j);\n                        for (int i = 0; i < m.w; i++)\n                        {\n                            if (i < max_count / 2 || i >= m.w - max_count / 2)\n                            {\n                                if (i > 0)\n                                {\n                                    ss << \", \";\n                                }\n                                ss << row[i];\n                            }\n                            else if (!dot_printed_w)\n                            {\n                                dot_printed_w = true;\n                                ss << \", ...\";\n                            }\n                        }\n                    }\n                    ss << \"]\";\n                    if (j < m.h - 1)\n                    {\n                        ss << \"\\n\";\n                    }\n                }\n                else if (!dot_printed_h)\n                {\n                    dot_printed_h = true;\n                    ss << \"...\\n\";\n                }\n            }\n            ss << \"]\\n\";\n        }\n        else if (m.dims == 3)\n        {\n            bool dot_printed_c = false;\n            ss << \"[\";\n            for (int k = 0; k < m.c; k++)\n            {\n                bool dot_printed_h = false;\n                if (k < max_count / 2 || k >= m.c - max_count / 2)\n                {\n                    Mat channel = m.channel(k);\n                    if (k > 0)\n                    {\n                        ss << \" \";\n                    }\n                    ss << \"[\";\n                    for (int j = 0; j < channel.h; j++)\n                    {\n                        bool dot_printed_w = false;\n                        if (j < max_count / 2 || j >= channel.h - max_count / 2)\n                        {\n                            if (j > 0)\n                            {\n                                ss << \"  \";\n                            }\n                            ss << \"[\";\n                            if (m.elemsize == 1)\n                            {\n                                const int8_t* row = channel.row<int8_t>(j);\n                                for (int i = 0; i < channel.w; i++)\n                                {\n                                    if (i < max_count / 2 || i >= channel.w - max_count / 2)\n                                    {\n                                        if (i > 0)\n                                        {\n                                            ss << \", \";\n                                        }\n                                        ss << static_cast<int>(row[i]);\n                                    }\n                                    else if (!dot_printed_w)\n                                    {\n                                        dot_printed_w = true;\n                                        ss << \", ...\";\n                                    }\n                                }\n                            }\n                            if (m.elemsize == 4)\n                            {\n                                const float* row = channel.row<float>(j);\n                                for (int i = 0; i < m.w; i++)\n                                {\n                                    if (i < max_count / 2 || i >= m.w - max_count / 2)\n                                    {\n                                        if (i > 0)\n                                        {\n                                            ss << \", \";\n                                        }\n                                        ss << row[i];\n                                    }\n                                    else if (!dot_printed_w)\n                                    {\n                                        dot_printed_w = true;\n                                        ss << \", ...\";\n                                    }\n                                }\n                            }\n                            ss << \"]\";\n                            if (j < channel.h - 1)\n                            {\n                                ss << \"\\n\";\n                            }\n                        }\n                        else if (!dot_printed_h)\n                        {\n                            dot_printed_h = true;\n                            ss << \"  ...\\n\";\n                        }\n                    } // for j\n                    ss << \"]\";\n                    if (k < m.c - 1)\n                    {\n                        ss << \"\\n\\n\";\n                    }\n                }\n                else if (!dot_printed_c)\n                {\n                    dot_printed_c = true;\n                    ss << \" ...\\n\";\n                }\n            } // for k\n            ss << \"]\\n\";\n        }\n        else if (m.dims == 4)\n        {\n            bool dot_printed_c = false;\n            ss << \"[\";\n            for (int k = 0; k < m.c; k++)\n            {\n                bool dot_printed_d = false;\n                if (k < max_count / 2 || k >= m.c - max_count / 2)\n                {\n                    Mat channel = m.channel(k);\n                    if (k > 0)\n                    {\n                        ss << \" \";\n                    }\n                    ss << \"[\";\n                    for (int z = 0; z < channel.d; z++)\n                    {\n                        bool dot_printed_h = false;\n                        if (z < max_count / 2 || z >= channel.d - max_count / 2)\n                        {\n                            if (z > 0)\n                            {\n                                ss << \"  \";\n                            }\n                            ss << \"[\";\n                            for (int j = 0; j < channel.h; j++)\n                            {\n                                bool dot_printed_w = false;\n                                if (j < max_count / 2 || j >= channel.h - max_count / 2)\n                                {\n                                    if (j > 0)\n                                    {\n                                        ss << \"  \";\n                                    }\n                                    ss << \"[\";\n                                    if (m.elemsize == 1)\n                                    {\n                                        const int8_t* row = channel.depth(z).row<int8_t>(j);\n                                        for (int i = 0; i < channel.w; i++)\n                                        {\n                                            if (i < max_count / 2 || i >= channel.w - max_count / 2)\n                                            {\n                                                if (i > 0)\n                                                {\n                                                    ss << \", \";\n                                                }\n                                                ss << static_cast<int>(row[i]);\n                                            }\n                                            else if (!dot_printed_w)\n                                            {\n                                                dot_printed_w = true;\n                                                ss << \", ...\";\n                                            }\n                                        }\n                                    }\n                                    if (m.elemsize == 4)\n                                    {\n                                        const float* row = channel.depth(z).row<float>(j);\n                                        for (int i = 0; i < m.w; i++)\n                                        {\n                                            if (i < max_count / 2 || i >= m.w - max_count / 2)\n                                            {\n                                                if (i > 0)\n                                                {\n                                                    ss << \", \";\n                                                }\n                                                ss << row[i];\n                                            }\n                                            else if (!dot_printed_w)\n                                            {\n                                                dot_printed_w = true;\n                                                ss << \", ...\";\n                                            }\n                                        }\n                                    }\n                                    ss << \"]\";\n                                    if (j < channel.h - 1)\n                                    {\n                                        ss << \"\\n\";\n                                    }\n                                }\n                                else if (!dot_printed_h)\n                                {\n                                    dot_printed_h = true;\n                                    ss << \"  ...\\n\";\n                                }\n                            } // for j\n                            ss << \"]\";\n                            if (z < channel.d - 1)\n                            {\n                                ss << \"\\n\";\n                            }\n                        }\n                        else if (!dot_printed_d)\n                        {\n                            dot_printed_d = true;\n                            ss << \" ...\\n\";\n                        }\n                    } // for z\n                    ss << \"]\";\n                    if (k < m.c - 1)\n                    {\n                        ss << \"\\n\\n\";\n                    }\n                }\n                else if (!dot_printed_c)\n                {\n                    dot_printed_c = true;\n                    ss << \" ...\\n\";\n                }\n            } // for k\n            ss << \"]\\n\";\n        }\n        return ss.str();\n    });\n\n    py::enum_<ncnn::Mat::PixelType>(mat, \"PixelType\")\n    .value(\"PIXEL_CONVERT_SHIFT\", ncnn::Mat::PixelType::PIXEL_CONVERT_SHIFT)\n    .value(\"PIXEL_FORMAT_MASK\", ncnn::Mat::PixelType::PIXEL_FORMAT_MASK)\n    .value(\"PIXEL_CONVERT_MASK\", ncnn::Mat::PixelType::PIXEL_CONVERT_MASK)\n\n    .value(\"PIXEL_RGB\", ncnn::Mat::PixelType::PIXEL_RGB)\n    .value(\"PIXEL_BGR\", ncnn::Mat::PixelType::PIXEL_BGR)\n    .value(\"PIXEL_GRAY\", ncnn::Mat::PixelType::PIXEL_GRAY)\n    .value(\"PIXEL_RGBA\", ncnn::Mat::PixelType::PIXEL_RGBA)\n    .value(\"PIXEL_BGRA\", ncnn::Mat::PixelType::PIXEL_BGRA)\n\n    .value(\"PIXEL_RGB2BGR\", ncnn::Mat::PixelType::PIXEL_RGB2BGR)\n    .value(\"PIXEL_RGB2GRAY\", ncnn::Mat::PixelType::PIXEL_RGB2GRAY)\n    .value(\"PIXEL_RGB2RGBA\", ncnn::Mat::PixelType::PIXEL_RGB2RGBA)\n    .value(\"PIXEL_RGB2BGRA\", ncnn::Mat::PixelType::PIXEL_RGB2BGRA)\n\n    .value(\"PIXEL_BGR2RGB\", ncnn::Mat::PixelType::PIXEL_BGR2RGB)\n    .value(\"PIXEL_BGR2GRAY\", ncnn::Mat::PixelType::PIXEL_BGR2GRAY)\n    .value(\"PIXEL_BGR2RGBA\", ncnn::Mat::PixelType::PIXEL_BGR2RGBA)\n    .value(\"PIXEL_BGR2BGRA\", ncnn::Mat::PixelType::PIXEL_BGR2BGRA)\n\n    .value(\"PIXEL_GRAY2RGB\", ncnn::Mat::PixelType::PIXEL_GRAY2RGB)\n    .value(\"PIXEL_GRAY2BGR\", ncnn::Mat::PixelType::PIXEL_GRAY2BGR)\n    .value(\"PIXEL_GRAY2RGBA\", ncnn::Mat::PixelType::PIXEL_GRAY2RGBA)\n    .value(\"PIXEL_GRAY2BGRA\", ncnn::Mat::PixelType::PIXEL_GRAY2BGRA)\n\n    .value(\"PIXEL_RGBA2RGB\", ncnn::Mat::PixelType::PIXEL_RGBA2RGB)\n    .value(\"PIXEL_RGBA2BGR\", ncnn::Mat::PixelType::PIXEL_RGBA2BGR)\n    .value(\"PIXEL_RGBA2GRAY\", ncnn::Mat::PixelType::PIXEL_RGBA2GRAY)\n    .value(\"PIXEL_RGBA2BGRA\", ncnn::Mat::PixelType::PIXEL_RGBA2BGRA)\n\n    .value(\"PIXEL_BGRA2RGB\", ncnn::Mat::PixelType::PIXEL_BGRA2RGB)\n    .value(\"PIXEL_BGRA2BGR\", ncnn::Mat::PixelType::PIXEL_BGRA2BGR)\n    .value(\"PIXEL_BGRA2GRAY\", ncnn::Mat::PixelType::PIXEL_BGRA2GRAY)\n    .value(\"PIXEL_BGRA2RGBA\", ncnn::Mat::PixelType::PIXEL_BGRA2RGBA);\n\n    py::class_<Extractor>(m, \"Extractor\")\n    .def(\"__enter__\", [](Extractor& ex) -> Extractor& { return ex; })\n    .def(\"__exit__\", [](Extractor& ex, pybind11::args) {\n        ex.clear();\n    })\n    .def(\"clear\", &Extractor::clear)\n    .def(\"set_light_mode\", &Extractor::set_light_mode, py::arg(\"enable\"))\n    .def(\"set_blob_allocator\", &Extractor::set_blob_allocator, py::arg(\"allocator\"))\n    .def(\"set_workspace_allocator\", &Extractor::set_workspace_allocator, py::arg(\"allocator\"))\n#if NCNN_STRING\n    .def(\"input\", (int (Extractor::*)(const char*, const Mat&)) & Extractor::input, py::arg(\"blob_name\"), py::arg(\"in\"))\n    .def(\"extract\", (int (Extractor::*)(const char*, Mat&, int)) & Extractor::extract, py::arg(\"blob_name\"), py::arg(\"feat\"), py::arg(\"type\") = 0)\n    .def(\n    \"extract\", [](Extractor& ex, const char* blob_name, int type) {\n        ncnn::Mat feat;\n        int ret = ex.extract(blob_name, feat, type);\n        return py::make_tuple(ret, feat.clone());\n    },\n    py::arg(\"blob_name\"), py::arg(\"type\") = 0)\n#endif\n    .def(\"input\", (int (Extractor::*)(int, const Mat&)) & Extractor::input)\n    .def(\"extract\", (int (Extractor::*)(int, Mat&, int)) & Extractor::extract, py::arg(\"blob_index\"), py::arg(\"feat\"), py::arg(\"type\") = 0)\n    .def(\n    \"extract\", [](Extractor& ex, int blob_index, int type) {\n        ncnn::Mat feat;\n        int ret = ex.extract(blob_index, feat, type);\n        return py::make_tuple(ret, feat.clone());\n    },\n    py::arg(\"blob_index\"), py::arg(\"type\") = 0);\n\n    py::class_<Layer, PyLayer>(m, \"Layer\")\n    .def(py::init<>())\n    .def(\"load_param\", &Layer::load_param, py::arg(\"pd\"))\n    .def(\"load_model\", &Layer::load_model, py::arg(\"mb\"))\n    .def(\"create_pipeline\", &Layer::create_pipeline, py::arg(\"opt\"))\n    .def(\"destroy_pipeline\", &Layer::destroy_pipeline, py::arg(\"opt\"))\n    .def_readwrite(\"one_blob_only\", &Layer::one_blob_only)\n    .def_readwrite(\"support_inplace\", &Layer::support_inplace)\n    .def_readwrite(\"support_vulkan\", &Layer::support_vulkan)\n    .def_readwrite(\"support_packing\", &Layer::support_packing)\n    .def_readwrite(\"support_bf16_storage\", &Layer::support_bf16_storage)\n    .def_readwrite(\"support_fp16_storage\", &Layer::support_fp16_storage)\n    .def_readwrite(\"support_vulkan_packing\", &Layer::support_vulkan_packing)\n    .def_readwrite(\"support_any_packing\", &Layer::support_any_packing)\n    .def_readwrite(\"support_vulkan_any_packing\", &Layer::support_vulkan_any_packing)\n    .def(\"forward\", (int (Layer::*)(const std::vector<Mat>&, std::vector<Mat>&, const Option&) const) & Layer::forward,\n         py::arg(\"bottom_blobs\"), py::arg(\"top_blobs\"), py::arg(\"opt\"))\n    .def(\"forward\", (int (Layer::*)(const Mat&, Mat&, const Option&) const) & Layer::forward,\n         py::arg(\"bottom_blob\"), py::arg(\"top_blob\"), py::arg(\"opt\"))\n    .def(\"forward_inplace\", (int (Layer::*)(std::vector<Mat>&, const Option&) const) & Layer::forward_inplace,\n         py::arg(\"bottom_top_blobs\"), py::arg(\"opt\"))\n    .def(\"forward_inplace\", (int (Layer::*)(Mat&, const Option&) const) & Layer::forward_inplace,\n         py::arg(\"bottom_top_blob\"), py::arg(\"opt\"))\n    .def_readwrite(\"typeindex\", &Layer::typeindex)\n#if NCNN_STRING\n    .def_readwrite(\"type\", &Layer::type)\n    .def_readwrite(\"name\", &Layer::name)\n#endif // NCNN_STRING\n    .def_readwrite(\"bottoms\", &Layer::bottoms)\n    .def_readwrite(\"tops\", &Layer::tops)\n    .def_readwrite(\"bottom_shapes\", &Layer::bottom_shapes)\n    .def_readwrite(\"top_shapes\", &Layer::top_shapes);\n\n    py::class_<Net>(m, \"Net\")\n    .def(py::init<>())\n    .def_readwrite(\"opt\", &Net::opt)\n    .def(\"__enter__\", [](Net& net) -> Net& { return net; })\n    .def(\"__exit__\", [](Net& net, pybind11::args) {\n        net.clear();\n    })\n\n#if NCNN_VULKAN\n    .def(\"set_vulkan_device\", (void (Net::*)(int)) & Net::set_vulkan_device, py::arg(\"device_index\"))\n    .def(\"set_vulkan_device\", (void (Net::*)(const VulkanDevice*)) & Net::set_vulkan_device, py::arg(\"vkdev\"))\n    .def(\"vulkan_device\", &Net::vulkan_device, py::return_value_policy::reference_internal)\n#endif // NCNN_VULKAN\n\n#if NCNN_STRING\n    .def(\n    \"register_custom_layer\", [](Net& net, const char* type, const std::function<ncnn::Layer*()>& creator, const std::function<void(ncnn::Layer*)>& destroyer) {\n        if (g_layer_factroy_index == g_layer_factroys.size())\n        {\n            std::stringstream ss;\n            ss << \"python version only support \" << g_layer_factroys.size() << \" custom layers now\";\n            pybind11::pybind11_fail(ss.str());\n        }\n        LayerFactory& lf = g_layer_factroys[g_layer_factroy_index++];\n        lf.name = type;\n        lf.creator = creator;\n        lf.destroyer = destroyer;\n        return net.register_custom_layer(lf.name.c_str(), lf.creator_func, lf.destroyer_func);\n    },\n    py::arg(\"type\"), py::arg(\"creator\"), py::arg(\"destroyer\"))\n#endif //NCNN_STRING\n    .def(\n    \"register_custom_layer\", [](Net& net, int index, const std::function<ncnn::Layer*()>& creator, const std::function<void(ncnn::Layer*)>& destroyer) {\n        if (g_layer_factroy_index == g_layer_factroys.size())\n        {\n            std::stringstream ss;\n            ss << \"python version only support \" << g_layer_factroys.size() << \" custom layers now\";\n            pybind11::pybind11_fail(ss.str());\n        }\n        LayerFactory& lf = g_layer_factroys[g_layer_factroy_index++];\n        lf.index = index;\n        lf.creator = creator;\n        lf.destroyer = destroyer;\n        return net.register_custom_layer(index, lf.creator_func, lf.destroyer_func);\n    },\n    py::arg(\"index\"), py::arg(\"creator\"), py::arg(\"destroyer\"))\n#if NCNN_STRING\n    .def(\"load_param\", (int (Net::*)(const DataReader&)) & Net::load_param, py::arg(\"dr\"))\n#endif // NCNN_STRING\n    .def(\"load_param_bin\", (int (Net::*)(const DataReader&)) & Net::load_param_bin, py::arg(\"dr\"))\n    .def(\"load_model\", (int (Net::*)(const DataReader&)) & Net::load_model, py::arg(\"dr\"))\n\n#if NCNN_STDIO\n#if NCNN_STRING\n#if _WIN32\n    .def(\n    \"load_param\", [](Net& self, const std::wstring& path) {\n        return self.load_param(path.c_str());\n    },\n    py::arg(\"protopath\"))\n#else\n    .def(\"load_param\", (int (Net::*)(const char*)) & Net::load_param, py::arg(\"protopath\"))\n#endif\n    .def(\"load_param_mem\", (int (Net::*)(const char*)) & Net::load_param_mem, py::arg(\"mem\"))\n#endif // NCNN_STRING\n#if _WIN32\n    .def(\n    \"load_param_bin\", [](Net& self, const std::wstring& path) {\n        return self.load_param_bin(path.c_str());\n    },\n    py::arg(\"protopath\"))\n    .def(\n    \"load_model\", [](Net& self, const std::wstring& path) {\n        return self.load_model(path.c_str());\n    },\n    py::arg(\"modelpath\"))\n#else\n    .def(\"load_param_bin\", (int (Net::*)(const char*)) & Net::load_param_bin, py::arg(\"protopath\"))\n    .def(\"load_model\", (int (Net::*)(const char*)) & Net::load_model, py::arg(\"modelpath\"))\n#endif\n    .def(\n    \"load_model_mem\", [](Net& net, const char* mem) {\n        const unsigned char* _mem = (const unsigned char*)mem;\n        DataReaderFromMemoryCopy dr(_mem);\n        net.load_model(dr);\n    },\n    py::arg(\"mem\"))\n#endif // NCNN_STDIO\n\n    .def(\"clear\", &Net::clear)\n    .def(\"create_extractor\", &Net::create_extractor, py::keep_alive<0, 1>()) //net should be kept alive until retuned ex is freed by gc\n\n    .def(\"input_indexes\", &Net::input_indexes, py::return_value_policy::reference)\n    .def(\"output_indexes\", &Net::output_indexes, py::return_value_policy::reference)\n#if NCNN_STRING\n    .def(\"input_names\", &Net::input_names, py::return_value_policy::reference)\n    .def(\"output_names\", &Net::output_names, py::return_value_policy::reference)\n#endif // NCNN_STRING\n\n    .def(\"blobs\", &Net::blobs, py::return_value_policy::reference_internal)\n    .def(\"layers\", &Net::layers, py::return_value_policy::reference_internal);\n\n    py::enum_<ncnn::BorderType>(m, \"BorderType\")\n    .value(\"BORDER_CONSTANT\", ncnn::BorderType::BORDER_CONSTANT)\n    .value(\"BORDER_REPLICATE\", ncnn::BorderType::BORDER_REPLICATE);\n\n    m.def(\"cpu_support_arm_neon\", &cpu_support_arm_neon);\n    m.def(\"cpu_support_arm_vfpv4\", &cpu_support_arm_vfpv4);\n    m.def(\"cpu_support_arm_asimdhp\", &cpu_support_arm_asimdhp);\n    m.def(\"cpu_support_x86_avx2\", &cpu_support_x86_avx2);\n    m.def(\"cpu_support_x86_avx\", &cpu_support_x86_avx);\n    m.def(\"get_cpu_count\", &get_cpu_count);\n    m.def(\"get_little_cpu_count\", &get_little_cpu_count);\n    m.def(\"get_big_cpu_count\", &get_big_cpu_count);\n    m.def(\"get_physical_cpu_count\", &get_physical_cpu_count);\n    m.def(\"get_physical_little_cpu_count\", &get_physical_little_cpu_count);\n    m.def(\"get_physical_big_cpu_count\", &get_physical_big_cpu_count);\n    m.def(\"get_cpu_powersave\", &get_cpu_powersave);\n    m.def(\"set_cpu_powersave\", &set_cpu_powersave, py::arg(\"powersave\"));\n    m.def(\"get_omp_num_threads\", &get_omp_num_threads);\n    m.def(\"set_omp_num_threads\", &set_omp_num_threads, py::arg(\"num_threads\"));\n    m.def(\"get_omp_dynamic\", &get_omp_dynamic);\n    m.def(\"set_omp_dynamic\", &set_omp_dynamic, py::arg(\"dynamic\"));\n    m.def(\"get_omp_thread_num\", &get_omp_thread_num);\n    m.def(\"get_kmp_blocktime\", &get_kmp_blocktime);\n    m.def(\"set_kmp_blocktime\", &set_kmp_blocktime, py::arg(\"time_ms\"));\n\n    m.def(\"copy_make_border\", &copy_make_border,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"),\n          py::arg(\"type\"), py::arg(\"v\"), py::arg(\"opt\") = Option());\n    m.def(\n        \"copy_make_border\",\n    [](const Mat& src, int top, int bottom, int left, int right, int type, float v, const Option& opt) {\n        Mat dst;\n        copy_make_border(src, dst, top, bottom, left, right, type, v, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"),\n    py::arg(\"type\"), py::arg(\"v\"), py::arg(\"opt\") = Option());\n\n    m.def(\"copy_make_border_3d\", &copy_make_border_3d,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"), py::arg(\"front\"), py::arg(\"behind\"),\n          py::arg(\"type\"), py::arg(\"v\"), py::arg(\"opt\") = Option());\n    m.def(\n        \"copy_make_border_3d\",\n    [](const Mat& src, int top, int bottom, int left, int right, int front, int behind, int type, float v, const Option& opt) {\n        Mat dst;\n        copy_make_border_3d(src, dst, top, bottom, left, right, front, behind, type, v, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"), py::arg(\"front\"), py::arg(\"behind\"),\n    py::arg(\"type\"), py::arg(\"v\"), py::arg(\"opt\") = Option());\n\n    m.def(\"copy_cut_border\", &copy_cut_border,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"copy_cut_border\",\n    [](const Mat& src, int top, int bottom, int left, int right, const Option& opt) {\n        Mat dst;\n        copy_cut_border(src, dst, top, bottom, left, right, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"top\"), py::arg(\"bottom\"), py::arg(\"left\"), py::arg(\"right\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"resize_nearest\", &resize_nearest,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"w\"), py::arg(\"h\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"resize_nearest\",\n    [](const Mat& src, int w, int h, const Option& opt) {\n        Mat dst;\n        resize_nearest(src, dst, w, h);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"w\"), py::arg(\"h\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"resize_bilinear\", &resize_bilinear,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"w\"), py::arg(\"h\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"resize_bilinear\",\n    [](const Mat& src, int w, int h, const Option& opt) {\n        Mat dst;\n        resize_bilinear(src, dst, w, h, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"w\"), py::arg(\"h\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"resize_bicubic\", &resize_bicubic,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"w\"), py::arg(\"h\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"resize_bicubic\",\n    [](const Mat& src, int w, int h, const Option& opt) {\n        Mat dst;\n        resize_bicubic(src, dst, w, h, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"w\"), py::arg(\"h\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"convert_packing\", &convert_packing,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"elempack\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"convert_packing\",\n    [](const Mat& src, int elempack, const Option& opt) {\n        Mat dst;\n        convert_packing(src, dst, elempack, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"elempack\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"flatten\", &flatten,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"flatten\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        flatten(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"cast_float32_to_float16\", &cast_float32_to_float16,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"cast_float32_to_float16\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        cast_float32_to_float16(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"cast_float16_to_float32\", &cast_float16_to_float32,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"cast_float16_to_float32\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        cast_float16_to_float32(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"cast_int8_to_float32\", &cast_int8_to_float32,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"cast_int8_to_float32\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        cast_int8_to_float32(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"cast_float32_to_bfloat16\", &cast_float32_to_bfloat16,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"cast_float32_to_bfloat16\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        cast_float32_to_bfloat16(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"cast_bfloat16_to_float32\", &cast_bfloat16_to_float32,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"cast_bfloat16_to_float32\",\n    [](const Mat& src, const Option& opt) {\n        Mat dst;\n        cast_bfloat16_to_float32(src, dst, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"opt\") = Option());\n\n    m.def(\"quantize_to_int8\", &quantize_to_int8,\n          py::arg(\"src\"), py::arg(\"dst\"),\n          py::arg(\"scale_data\"),\n          py::arg(\"opt\") = Option());\n    m.def(\n        \"quantize_to_int8\",\n    [](const Mat& src, const Mat& scale_data, const Option& opt) {\n        Mat dst;\n        quantize_to_int8(src, dst, scale_data, opt);\n        return dst;\n    },\n    py::arg(\"src\"),\n    py::arg(\"scale_data\"),\n    py::arg(\"opt\") = Option());\n\n#if NCNN_STRING\n    m.def(\"layer_to_index\", &layer_to_index, py::arg(\"type\"));\n    m.def(\n        \"create_layer\",\n    [](const char* type) {\n        return static_cast<Layer*>(create_layer(type));\n    },\n    py::arg(\"type\"));\n    m.def(\n        \"create_layer\",\n    [](int index) {\n        return static_cast<Layer*>(create_layer(index));\n    },\n    py::arg(\"index\"));\n#endif //NCNN_STRING\n\n#if NCNN_VULKAN\n    m.def(\"create_gpu_instance\", &create_gpu_instance, py::arg(\"driver_path\") = ((const char*)0));\n    m.def(\"destroy_gpu_instance\", &destroy_gpu_instance);\n    m.def(\"get_gpu_count\", &get_gpu_count);\n    m.def(\"get_default_gpu_index\", &get_default_gpu_index);\n    m.def(\"get_gpu_info\", &get_gpu_info, py::arg(\"device_index\") = 0, py::return_value_policy::reference);\n    m.def(\"get_gpu_device\", &get_gpu_device, py::arg(\"device_index\") = 0, py::return_value_policy::reference);\n\n    py::class_<VkBufferMemory>(m, \"VkBufferMemory\")\n    .def_readwrite(\"offset\", &VkBufferMemory::offset)\n    .def_readwrite(\"capacity\", &VkBufferMemory::capacity)\n    .def_readwrite(\"refcount\", &VkBufferMemory::refcount);\n\n    py::class_<VkImageMemory>(m, \"VkImageMemory\")\n    .def_readwrite(\"width\", &VkImageMemory::width)\n    .def_readwrite(\"height\", &VkImageMemory::height)\n    .def_readwrite(\"depth\", &VkImageMemory::depth)\n    .def_readwrite(\"refcount\", &VkImageMemory::refcount);\n\n    py::class_<VkAllocator, PyVkAllocator<> >(m, \"VkAllocator\")\n    .def_readonly(\"vkdev\", &VkAllocator::vkdev)\n    .def_readwrite(\"buffer_memory_type_index\", &VkAllocator::buffer_memory_type_index)\n    .def_readwrite(\"image_memory_type_index\", &VkAllocator::image_memory_type_index)\n    .def_readwrite(\"mappable\", &VkAllocator::mappable)\n    .def_readwrite(\"coherent\", &VkAllocator::coherent);\n\n    py::class_<VkBlobAllocator, VkAllocator, PyVkAllocatorOther<VkBlobAllocator> >(m, \"VkBlobAllocator\")\n    .def(py::init<const VulkanDevice*>())\n    .def(\"clear\", &VkBlobAllocator::clear)\n    .def(\"fastMalloc\", (VkBufferMemory * (VkBlobAllocator::*)(size_t size)) & VkBlobAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkBlobAllocator::*)(VkBufferMemory * ptr)) & VkBlobAllocator::fastFree)\n    .def(\"fastMalloc\", (VkImageMemory * (VkBlobAllocator::*)(int, int, int, size_t, int)) & VkBlobAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkBlobAllocator::*)(VkImageMemory * ptr)) & VkBlobAllocator::fastFree);\n\n    py::class_<VkWeightAllocator, VkAllocator, PyVkAllocatorOther<VkWeightAllocator> >(m, \"VkWeightAllocator\")\n    .def(py::init<const VulkanDevice*>())\n    .def(\"clear\", &VkWeightAllocator::clear)\n    .def(\"fastMalloc\", (VkBufferMemory * (VkWeightAllocator::*)(size_t size)) & VkWeightAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkWeightAllocator::*)(VkBufferMemory * ptr)) & VkWeightAllocator::fastFree)\n    .def(\"fastMalloc\", (VkImageMemory * (VkWeightAllocator::*)(int, int, int, size_t, int)) & VkWeightAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkWeightAllocator::*)(VkImageMemory * ptr)) & VkWeightAllocator::fastFree);\n\n    py::class_<VkStagingAllocator, VkAllocator, PyVkAllocatorOther<VkStagingAllocator> >(m, \"VkStagingAllocator\")\n    .def(py::init<const VulkanDevice*>())\n    .def(\"set_size_compare_ratio\", &VkStagingAllocator::set_size_compare_ratio)\n    .def(\"clear\", &VkStagingAllocator::clear)\n    .def(\"fastMalloc\", (VkBufferMemory * (VkStagingAllocator::*)(size_t size)) & VkStagingAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkStagingAllocator::*)(VkBufferMemory * ptr)) & VkStagingAllocator::fastFree)\n    .def(\"fastMalloc\", (VkImageMemory * (VkStagingAllocator::*)(int, int, int, size_t, int)) & VkStagingAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkStagingAllocator::*)(VkImageMemory * ptr)) & VkStagingAllocator::fastFree);\n\n    py::class_<VkWeightStagingAllocator, VkAllocator, PyVkAllocatorOther<VkWeightStagingAllocator> >(m, \"VkWeightStagingAllocator\")\n    .def(py::init<const VulkanDevice*>())\n    .def(\"fastMalloc\", (VkBufferMemory * (VkWeightStagingAllocator::*)(size_t size)) & VkWeightStagingAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkWeightStagingAllocator::*)(VkBufferMemory * ptr)) & VkWeightStagingAllocator::fastFree)\n    .def(\"fastMalloc\", (VkImageMemory * (VkWeightStagingAllocator::*)(int, int, int, size_t, int)) & VkWeightStagingAllocator::fastMalloc, py::return_value_policy::reference_internal)\n    .def(\"fastFree\", (void (VkWeightStagingAllocator::*)(VkImageMemory * ptr)) & VkWeightStagingAllocator::fastFree);\n\n    py::class_<GpuInfo>(m, \"GpuInfo\")\n    .def(py::init<>())\n    .def(\"api_version\", &GpuInfo::api_version)\n    .def(\"driver_version\", &GpuInfo::driver_version)\n    .def(\"vendor_id\", &GpuInfo::vendor_id)\n    .def(\"device_id\", &GpuInfo::device_id)\n    .def(\"pipeline_cache_uuid\", [](GpuInfo& gpuinfo) {\n        return py::memoryview::from_buffer(gpuinfo.pipeline_cache_uuid(), {VK_UUID_SIZE}, {sizeof(uint8_t) * VK_UUID_SIZE});\n    })\n    .def(\"type\", &GpuInfo::type)\n    .def(\"device_name\", &GpuInfo::device_name);\n\n    py::class_<VulkanDevice>(m, \"VulkanDevice\")\n    .def(py::init<int>(), py::arg(\"device_index\") = 0)\n    .def(\n    \"info\", [](VulkanDevice& dev) {\n        return &dev.info;\n    },\n    py::return_value_policy::reference_internal)\n    .def(\"acquire_blob_allocator\", &VulkanDevice::acquire_blob_allocator)\n    .def(\"reclaim_blob_allocator\", &VulkanDevice::reclaim_blob_allocator, py::arg(\"vkallocator\"))\n    .def(\"acquire_staging_allocator\", &VulkanDevice::acquire_staging_allocator)\n    .def(\"reclaim_staging_allocator\", &VulkanDevice::reclaim_staging_allocator, py::arg(\"vkallocator\"))\n    .def(\"get_heap_budget\", &VulkanDevice::get_heap_budget);\n#endif // NCNN_VULKAN\n\n    m.doc() = R\"pbdoc(\n        ncnn python wrapper\n        -----------------------\n        .. currentmodule:: pyncnn\n        .. autosummary::\n           :toctree: _generate\n    )pbdoc\";\n\n#ifdef VERSION_INFO\n    m.attr(\"__version__\") = VERSION_INFO;\n#else\n    m.attr(\"__version__\") = \"dev\";\n#endif\n}\n"
  },
  {
    "path": "python/src/pybind11_allocator.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_ALLOCATOR_H\n#define PYBIND11_NCNN_ALLOCATOR_H\n\n#include <allocator.h>\n\ntemplate<class Base = ncnn::Allocator>\nclass PyAllocator : public Base\n{\npublic:\n    using Base::Base; // Inherit constructors\n    void* fastMalloc(size_t size) override\n    {\n        PYBIND11_OVERRIDE_PURE(void*, Base, fastMalloc, size);\n    }\n    void fastFree(void* ptr) override\n    {\n        PYBIND11_OVERRIDE_PURE(void, Base, fastFree, ptr);\n    }\n};\n\ntemplate<class Other>\nclass PyAllocatorOther : public PyAllocator<Other>\n{\npublic:\n    using PyAllocator<Other>::PyAllocator;\n    void* fastMalloc(size_t size) override\n    {\n        PYBIND11_OVERRIDE(void*, Other, fastMalloc, size);\n    }\n    void fastFree(void* ptr) override\n    {\n        PYBIND11_OVERRIDE(void, Other, fastFree, ptr);\n    }\n};\n\n#if NCNN_VULKAN\ntemplate<class Base = ncnn::VkAllocator>\nclass PyVkAllocator : public Base\n{\npublic:\n    using Base::Base; // Inherit constructors\n    void clear() override\n    {\n        PYBIND11_OVERRIDE(void, Base, clear, );\n    }\n    ncnn::VkBufferMemory* fastMalloc(size_t size) override\n    {\n        PYBIND11_OVERRIDE_PURE(ncnn::VkBufferMemory*, Base, fastMalloc, size);\n    }\n    void fastFree(ncnn::VkBufferMemory* ptr) override\n    {\n        PYBIND11_OVERRIDE_PURE(void, Base, fastFree, ptr);\n    }\n    int flush(ncnn::VkBufferMemory* ptr) override\n    {\n        PYBIND11_OVERRIDE(int, Base, flush, ptr);\n    }\n    int invalidate(ncnn::VkBufferMemory* ptr) override\n    {\n        PYBIND11_OVERRIDE(int, Base, invalidate, ptr);\n    }\n};\n\ntemplate<class Other>\nclass PyVkAllocatorOther : public PyVkAllocator<Other>\n{\npublic:\n    using PyVkAllocator<Other>::PyVkAllocator;\n    void clear() override\n    {\n        PYBIND11_OVERRIDE(void, Other, clear, );\n    }\n    ncnn::VkBufferMemory* fastMalloc(size_t size) override\n    {\n        PYBIND11_OVERRIDE(ncnn::VkBufferMemory*, Other, fastMalloc, size);\n    }\n    void fastFree(ncnn::VkBufferMemory* ptr) override\n    {\n        PYBIND11_OVERRIDE(void, Other, fastFree, ptr);\n    }\n};\n\ntemplate<class Base = ncnn::VkBlobAllocator>\nclass PyVkBlobAllocator : public Base\n{\npublic:\n    using Base::Base; // Inherit constructors\n    void clear() override\n    {\n        PYBIND11_OVERRIDE(void, Base, clear, );\n    }\n    ncnn::VkImageMemory* fastMalloc(int width, int height, VkFormat format) override\n    {\n        PYBIND11_OVERRIDE_PURE(ncnn::VkImageMemory*, Base, fastMalloc, width, height, format);\n    }\n    void fastFree(ncnn::VkImageMemory* ptr) override\n    {\n        PYBIND11_OVERRIDE_PURE(void, Base, fastFree, ptr);\n    }\n};\n\n//template<class Other>\n//class PyVkImageAllocatorOther : public PyVkImageAllocator<Other>\n//{\n//public:\n//    using PyVkImageAllocator<Other>::PyVkImageAllocator;\n//    ncnn::VkImageMemory* fastMalloc(int width, int height,\n//                                    VkFormat format) override\n//    {\n//        PYBIND11_OVERRIDE(ncnn::VkImageMemory*, Other, fastMalloc, width, height, format);\n//    }\n//    void fastFree(ncnn::VkImageMemory* ptr) override\n//    {\n//        PYBIND11_OVERRIDE(void, Other, fastFree, ptr);\n//    }\n//};\n#endif // NCNN_VULKAN\n\n#endif\n"
  },
  {
    "path": "python/src/pybind11_bind.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_BIND_H\n#define PYBIND11_NCNN_BIND_H\n\n#include <pybind11/functional.h>\n\n///////////////////////////////////////////////////////////////////////////////////////////////////////////////\n// virtual function pass by reference by https://github.com/pybind/pybind11/issues/2033\n#define PYBIND11_OVERRIDE_REFERENCE_IMPL(ret_type, cname, name, ...)                                 \\\n    do                                                                                               \\\n    {                                                                                                \\\n        pybind11::gil_scoped_acquire gil;                                                            \\\n        pybind11::function override = pybind11::get_override(static_cast<const cname*>(this), name); \\\n        if (override)                                                                                \\\n        {                                                                                            \\\n            auto o = override.operator()<pybind11::return_value_policy::reference>(__VA_ARGS__);     \\\n            if (pybind11::detail::cast_is_temporary_value_reference<ret_type>::value)                \\\n            {                                                                                        \\\n                static pybind11::detail::override_caster_t<ret_type> caster;                         \\\n                return pybind11::detail::cast_ref<ret_type>(std::move(o), caster);                   \\\n            }                                                                                        \\\n            else                                                                                     \\\n                return pybind11::detail::cast_safe<ret_type>(std::move(o));                          \\\n        }                                                                                            \\\n    } while (false)\n\n#define PYBIND11_OVERRIDE_REFERENCE_NAME(ret_type, cname, name, fn, ...)                                    \\\n    do                                                                                                      \\\n    {                                                                                                       \\\n        PYBIND11_OVERRIDE_REFERENCE_IMPL(PYBIND11_TYPE(ret_type), PYBIND11_TYPE(cname), name, __VA_ARGS__); \\\n        return cname::fn(__VA_ARGS__);                                                                      \\\n    } while (false)\n\n#define PYBIND11_OVERRIDE_REFERENCE(ret_type, cname, fn, ...) \\\n    PYBIND11_OVERRIDE_REFERENCE_NAME(PYBIND11_TYPE(ret_type), PYBIND11_TYPE(cname), #fn, fn, __VA_ARGS__)\n///////////////////////////////////////////////////////////////////////////////////////////////////////////////\n\n#endif\n"
  },
  {
    "path": "python/src/pybind11_datareader.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_DATAREADER_H\n#define PYBIND11_NCNN_DATAREADER_H\n\n#include <datareader.h>\n\nclass DataReaderFromEmpty : public ncnn::DataReader\n{\npublic:\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const\n    {\n        return 0;\n    }\n#endif // NCNN_STRING\n    virtual size_t read(void* buf, size_t size) const\n    {\n        memset(buf, 0, size);\n        return size;\n    }\n};\n\ntemplate<class Base = ncnn::DataReader>\nclass PyDataReader : public Base\n{\npublic:\n    using Base::Base; // Inherit constructors\n#if NCNN_STRING\n    int scan(const char* format, void* p) const override\n    {\n        PYBIND11_OVERRIDE(int, Base, scan, format, p);\n    }\n#endif // NCNN_STRING\n    size_t read(void* buf, size_t size) const override\n    {\n        PYBIND11_OVERRIDE(size_t, Base, read, buf, size);\n    }\n};\n\ntemplate<class Other>\nclass PyDataReaderOther : public PyDataReader<Other>\n{\npublic:\n    using PyDataReader<Other>::PyDataReader;\n#if NCNN_STRING\n    int scan(const char* format, void* p) const override\n    {\n        PYBIND11_OVERRIDE(int, Other, scan, format, p);\n    }\n#endif // NCNN_STRING\n    size_t read(void* buf, size_t size) const override\n    {\n        PYBIND11_OVERRIDE(size_t, Other, read, buf, size);\n    }\n};\n\n#endif\n"
  },
  {
    "path": "python/src/pybind11_layer.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_LAYER_H\n#define PYBIND11_NCNN_LAYER_H\n\n#include <layer.h>\n#include \"pybind11_bind.h\"\n\nclass PyLayer : public ncnn::Layer\n{\npublic:\n    virtual int load_param(const ncnn::ParamDict& pd)\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            load_param,\n            pd);\n    }\n\n    virtual int load_model(const ncnn::ModelBin& mb)\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            load_model,\n            mb);\n    }\n\n    virtual int create_pipeline(const ncnn::Option& opt)\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            create_pipeline,\n            opt);\n    }\n\n    virtual int destroy_pipeline(const ncnn::Option& opt)\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            destroy_pipeline,\n            opt);\n    }\n\npublic:\n    virtual int forward(const std::vector<ncnn::Mat>& bottom_blobs, std::vector<ncnn::Mat>& top_blobs, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward,\n            bottom_blobs,\n            top_blobs,\n            opt);\n    }\n    virtual int forward(const ncnn::Mat& bottom_blob, ncnn::Mat& top_blob, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward,\n            bottom_blob,\n            top_blob,\n            opt);\n    }\n\n    virtual int forward_inplace(std::vector<ncnn::Mat>& bottom_top_blobs, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward_inplace,\n            bottom_top_blobs,\n            opt);\n    }\n    virtual int forward_inplace(ncnn::Mat& bottom_top_blob, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward_inplace,\n            bottom_top_blob,\n            opt);\n    }\n\n#if NCNN_VULKAN\npublic:\n    virtual int upload_model(ncnn::VkTransfer& cmd, const ncnn::Option& opt)\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            upload_model,\n            cmd,\n            opt);\n    }\n\npublic:\n    virtual int forward(const std::vector<ncnn::VkMat>& bottom_blobs, std::vector<ncnn::VkMat>& top_blobs, ncnn::VkCompute& cmd, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward,\n            bottom_blobs,\n            top_blobs,\n            cmd,\n            opt);\n    }\n    virtual int forward(const ncnn::VkMat& bottom_blob, ncnn::VkMat& top_blob, ncnn::VkCompute& cmd, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward,\n            bottom_blob,\n            top_blob,\n            cmd,\n            opt);\n    }\n\n    virtual int forward_inplace(std::vector<ncnn::VkMat>& bottom_top_blobs, ncnn::VkCompute& cmd, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward_inplace,\n            bottom_top_blobs,\n            cmd,\n            opt);\n    }\n    virtual int forward_inplace(ncnn::VkMat& bottom_top_blob, ncnn::VkCompute& cmd, const ncnn::Option& opt) const\n    {\n        PYBIND11_OVERRIDE_REFERENCE(\n            int,\n            ncnn::Layer,\n            forward_inplace,\n            bottom_top_blob,\n            cmd,\n            opt);\n    }\n#endif // NCNN_VULKAN\n};\n\n#endif\n"
  },
  {
    "path": "python/src/pybind11_mat.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_MAT_H\n#define PYBIND11_NCNN_MAT_H\n\n#include <string>\n\n#include <pybind11/pybind11.h>\n\n#include <mat.h>\n\nnamespace py = pybind11;\n\nstd::string get_mat_format(const ncnn::Mat& m)\n{\n    std::string format;\n    if (m.elemsize == 4)\n    {\n        format = pybind11::format_descriptor<float>::format();\n    }\n    if (m.elemsize == 2)\n    {\n        // see https://docs.python.org/3/library/struct.html#format-characters\n        format = \"e\";\n    }\n    if (m.elemsize == 1)\n    {\n        format = pybind11::format_descriptor<int8_t>::format();\n    }\n    return format;\n}\n\n// possible values for format:\n// i (int32_t)\n// f (float)\n// d (double)\n// leave it to empty to use get_mat_format\npy::buffer_info to_buffer_info(ncnn::Mat& m, const std::string& format = \"\")\n{\n    if (m.elemsize != 1 && m.elemsize != 2 && m.elemsize != 4)\n    {\n        std::ostringstream ss;\n        ss << \"Convert ncnn.Mat to numpy.ndarray. Support only elemsize 1, 2, 4; but given \"\n           << m.elemsize;\n        py::pybind11_fail(ss.str());\n    }\n    if (m.elempack != 1)\n    {\n        std::ostringstream ss;\n        ss << \"Convert ncnn.Mat to numpy.ndarray. Support only elempack == 1, but \"\n           \"given \"\n           << m.elempack;\n        py::pybind11_fail(ss.str());\n    }\n    std::string _format(format);\n    if (_format.empty())\n    {\n        _format = get_mat_format(m);\n    }\n    std::vector<py::ssize_t> shape;\n    std::vector<py::ssize_t> strides;\n    if (m.dims == 1)\n    {\n        shape.push_back(m.w);\n        strides.push_back(m.elemsize);\n    }\n    else if (m.dims == 2)\n    {\n        shape.push_back(m.h);\n        shape.push_back(m.w);\n        strides.push_back(m.w * m.elemsize);\n        strides.push_back(m.elemsize);\n    }\n    else if (m.dims == 3)\n    {\n        shape.push_back(m.c);\n        shape.push_back(m.h);\n        shape.push_back(m.w);\n        strides.push_back(m.cstep * m.elemsize);\n        strides.push_back(m.w * m.elemsize);\n        strides.push_back(m.elemsize);\n    }\n    else if (m.dims == 4)\n    {\n        shape.push_back(m.c);\n        shape.push_back(m.d);\n        shape.push_back(m.h);\n        shape.push_back(m.w);\n        strides.push_back(m.cstep * m.elemsize);\n        strides.push_back(m.w * m.h * m.elemsize);\n        strides.push_back(m.w * m.elemsize);\n        strides.push_back(m.elemsize);\n    }\n    return py::buffer_info(m.data,     /* Pointer to buffer */\n                           m.elemsize, /* Size of one scalar */\n                           _format,    /* Python struct-style format descriptor */\n                           m.dims,     /* Number of dimensions */\n                           shape,      /* Buffer dimensions */\n                           strides     /* Strides (in bytes) for each index */\n                          );\n}\n\n#endif\n"
  },
  {
    "path": "python/src/pybind11_modelbin.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef PYBIND11_NCNN_MODELBIN_H\n#define PYBIND11_NCNN_MODELBIN_H\n\n#include <modelbin.h>\n\ntemplate<class Base = ncnn::ModelBin>\nclass PyModelBin : public Base\n{\npublic:\n    using Base::Base; // Inherit constructors\n    ncnn::Mat load(int w, int type) const override\n    {\n        PYBIND11_OVERRIDE(ncnn::Mat, Base, load, w, type);\n    }\n    //ncnn::Mat load(int w, int h, int type) const override {\n    //\tPYBIND11_OVERRIDE(ncnn::Mat, Base, load, w, h, type);\n    //}\n    //ncnn::Mat load(int w, int h, int c, int type) const override {\n    //\tPYBIND11_OVERRIDE(ncnn::Mat, Base, load, w, h, c, type);\n    //}\n};\n\ntemplate<class Other>\nclass PyModelBinOther : public PyModelBin<Other>\n{\npublic:\n    using PyModelBin<Other>::PyModelBin;\n    ncnn::Mat load(int w, int type) const override\n    {\n        PYBIND11_OVERRIDE(ncnn::Mat, Other, load, w, type);\n    }\n};\n\n#endif\n"
  },
  {
    "path": "python/tests/benchmark.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport time\nimport ncnn\n\nparam_root = \"../../benchmark\"\n\ng_warmup_loop_count = 8\ng_loop_count = 4\ng_enable_cooling_down = True\n\ng_vkdev = None\ng_blob_vkallocator = None\ng_staging_vkallocator = None\n\ng_blob_pool_allocator = ncnn.UnlockedPoolAllocator()\ng_workspace_pool_allocator = ncnn.PoolAllocator()\n\n\ndef benchmark(comment, _in, opt):\n    _in.fill(0.01)\n\n    g_blob_pool_allocator.clear()\n    g_workspace_pool_allocator.clear()\n\n    if opt.use_vulkan_compute:\n        g_blob_vkallocator.clear()\n        g_staging_vkallocator.clear()\n\n    net = ncnn.Net()\n    net.opt = opt\n\n    if net.opt.use_vulkan_compute:\n        net.set_vulkan_device(g_vkdev)\n\n    net.load_param(param_root + comment + \".param\")\n\n    dr = ncnn.DataReaderFromEmpty()\n    net.load_model(dr)\n\n    input_names = net.input_names()\n    output_names = net.output_names()\n\n    if g_enable_cooling_down:\n        time.sleep(10)\n\n    # warm up\n    for i in range(g_warmup_loop_count):\n        # test with statement\n        with net.create_extractor() as ex:\n            ex.input(input_names[0], _in)\n            ex.extract(output_names[0])\n\n    time_min = sys.float_info.max\n    time_max = -sys.float_info.max\n    time_avg = 0.0\n\n    for i in range(g_loop_count):\n        start = time.time()\n\n        # test net keep alive until ex freed\n        ex = net.create_extractor()\n        ex.input(input_names[0], _in)\n        ex.extract(output_names[0])\n\n        end = time.time()\n\n        timespan = end - start\n\n        time_min = timespan if timespan < time_min else time_min\n        time_max = timespan if timespan > time_max else time_max\n        time_avg += timespan\n\n    time_avg /= g_loop_count\n\n    print(\n        \"%20s  min = %7.2f  max = %7.2f  avg = %7.2f\"\n        % (comment, time_min * 1000, time_max * 1000, time_avg * 1000)\n    )\n\n\nif __name__ == \"__main__\":\n    loop_count = 4\n    num_threads = ncnn.get_cpu_count()\n    powersave = 0\n    gpu_device = -1\n    cooling_down = 1\n\n    argc = len(sys.argv)\n    if argc >= 2:\n        loop_count = int(sys.argv[1])\n    if argc >= 3:\n        num_threads = int(sys.argv[2])\n    if argc >= 4:\n        powersave = int(sys.argv[3])\n    if argc >= 5:\n        gpu_device = int(sys.argv[4])\n    if argc >= 6:\n        cooling_down = int(sys.argv[5])\n\n    use_vulkan_compute = gpu_device != -1\n\n    g_enable_cooling_down = cooling_down != 0\n\n    g_loop_count = loop_count\n\n    g_blob_pool_allocator.set_size_compare_ratio(0.0)\n    g_workspace_pool_allocator.set_size_compare_ratio(0.5)\n\n    if use_vulkan_compute:\n        g_warmup_loop_count = 10\n\n        g_vkdev = ncnn.get_gpu_device(gpu_device)\n\n        g_blob_vkallocator = ncnn.VkBlobAllocator(g_vkdev)\n        g_staging_vkallocator = ncnn.VkStagingAllocator(g_vkdev)\n\n    opt = ncnn.Option()\n    opt.lightmode = True\n    opt.num_threads = num_threads\n    opt.blob_allocator = g_blob_pool_allocator\n    opt.workspace_allocator = g_workspace_pool_allocator\n    if use_vulkan_compute:\n        opt.blob_vkallocator = g_blob_vkallocator\n        opt.workspace_vkallocator = g_blob_vkallocator\n        opt.staging_vkallocator = g_staging_vkallocator\n    opt.use_winograd_convolution = True\n    opt.use_sgemm_convolution = True\n    opt.use_int8_inference = True\n    opt.use_vulkan_compute = use_vulkan_compute\n    opt.use_fp16_packed = True\n    opt.use_fp16_storage = True\n    opt.use_fp16_arithmetic = True\n    opt.use_int8_storage = True\n    opt.use_int8_arithmetic = True\n    opt.use_packing_layout = True\n\n    ncnn.set_cpu_powersave(powersave)\n    ncnn.set_omp_dynamic(0)\n    ncnn.set_omp_num_threads(num_threads)\n\n    print(\"loop_count =\", loop_count)\n    print(\"num_threads =\", num_threads)\n    print(\"powersave =\", ncnn.get_cpu_powersave())\n    print(\"gpu_device =\", gpu_device)\n    print(\"cooling_down =\", g_enable_cooling_down)\n\n    benchmark(\"squeezenet\", ncnn.Mat((227, 227, 3)), opt)\n    benchmark(\"squeezenet_int8\", ncnn.Mat((227, 227, 3)), opt)\n    benchmark(\"mobilenet\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"mobilenet_int8\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"mobilenet_v2\", ncnn.Mat((224, 224, 3)), opt)\n    # benchmark(\"mobilenet_v2_int8\", ncnn.Mat(w=224, h=224, c=3), opt)\n    benchmark(\"mobilenet_v3\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"shufflenet\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"shufflenet_v2\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"mnasnet\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"proxylessnasnet\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"efficientnet_b0\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"regnety_400m\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"blazeface\", ncnn.Mat((128, 128, 3)), opt)\n    benchmark(\"googlenet\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"googlenet_int8\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"resnet18\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"resnet18_int8\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"alexnet\", ncnn.Mat((227, 227, 3)), opt)\n    benchmark(\"vgg16\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"vgg16_int8\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"resnet50\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"resnet50_int8\", ncnn.Mat((224, 224, 3)), opt)\n    benchmark(\"squeezenet_ssd\", ncnn.Mat((300, 300, 3)), opt)\n    benchmark(\"squeezenet_ssd_int8\", ncnn.Mat((300, 300, 3)), opt)\n    benchmark(\"mobilenet_ssd\", ncnn.Mat((300, 300, 3)), opt)\n    benchmark(\"mobilenet_ssd_int8\", ncnn.Mat((300, 300, 3)), opt)\n    benchmark(\"mobilenet_yolo\", ncnn.Mat((416, 416, 3)), opt)\n    benchmark(\"mobilenetv2_yolov3\", ncnn.Mat((352, 352, 3)), opt)\n    benchmark(\"yolov4-tiny\", ncnn.Mat((416, 416, 3)), opt)\n"
  },
  {
    "path": "python/tests/custom_layer.param",
    "content": "7767517\n2 2\nInput            data                             0 1 data\nCustomLayer      cl_fwd                           1 1 data output\n"
  },
  {
    "path": "python/tests/test.param",
    "content": "7767517\n3 3\nInput            data                             0 1 data\nConvolution      conv0_fwd                        1 1 data conv0_fwd 0=3 1=3 11=3 2=1 12=1 3=1 13=1 4=0 14=0 5=1 6=81\nInnerProduct     dense0_fwd                       1 1 conv0_fwd output 0=1 1=1 2=151875\n"
  },
  {
    "path": "python/tests/test_allocator.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef test_pool_allocator():\n    pa = ncnn.PoolAllocator()\n    assert pa is not None\n    pa.set_size_compare_ratio(0.5)\n    buf = pa.fastMalloc(10 * 1024)\n    assert buf is not None\n    pa.fastFree(buf)\n    pa.clear()\n\n\ndef test_unlocked_pool_allocator():\n    upa = ncnn.UnlockedPoolAllocator()\n    assert upa is not None\n    upa.set_size_compare_ratio(0.5)\n    buf = upa.fastMalloc(10 * 1024)\n    assert buf is not None\n    upa.fastFree(buf)\n    upa.clear()\n"
  },
  {
    "path": "python/tests/test_blob.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef test_blob():\n    blob = ncnn.Blob()\n\n    blob.name = \"myblob\"\n    assert blob.name == \"myblob\"\n\n    blob.producer = 0\n    assert blob.producer == 0\n\n    blob.consumer = 0\n    assert blob.consumer == 0\n\n    blob.shape = ncnn.Mat(1)\n    assert blob.shape.dims == 1 and blob.shape.w == 1\n"
  },
  {
    "path": "python/tests/test_extractor.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\nalloctor = ncnn.PoolAllocator()\n\n\ndef test_extractor():\n    with pytest.raises(TypeError, match=\"No constructor\"):\n        ex = ncnn.Extractor()\n\n    dr = ncnn.DataReaderFromEmpty()\n\n    net = ncnn.Net()\n    net.load_param(\"tests/test.param\")\n    net.load_model(dr)\n\n    in_mat = ncnn.Mat((227, 227, 3))\n    with net.create_extractor() as ex:\n        ex.set_light_mode(True)\n\n        ex.set_blob_allocator(alloctor)\n        ex.set_workspace_allocator(alloctor)\n\n        ex.input(\"data\", in_mat)\n        ret, out_mat = ex.extract(\"conv0_fwd\")\n        assert (\n            ret == 0\n            and out_mat.dims == 3\n            and out_mat.w == 225\n            and out_mat.h == 225\n            and out_mat.c == 3\n        )\n\n        ret, out_mat = ex.extract(\"output\")\n        assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1\n\n\ndef test_extractor_index():\n    with pytest.raises(TypeError, match=\"No constructor\"):\n        ex = ncnn.Extractor()\n\n    dr = ncnn.DataReaderFromEmpty()\n\n    net = ncnn.Net()\n    net.load_param(\"tests/test.param\")\n    net.load_model(dr)\n\n    in_mat = ncnn.Mat((227, 227, 3))\n    ex = net.create_extractor()\n    ex.set_light_mode(True)\n\n    ex.set_blob_allocator(alloctor)\n    ex.set_workspace_allocator(alloctor)\n\n    ex.input(0, in_mat)\n    ret, out_mat = ex.extract(1)\n    assert (\n        ret == 0\n        and out_mat.dims == 3\n        and out_mat.w == 225\n        and out_mat.h == 225\n        and out_mat.c == 3\n    )\n\n    ret, out_mat = ex.extract(2)\n    assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1\n\n    # not use with sentence, call clear manually to ensure ex destruct before net\n    ex.clear()\n"
  },
  {
    "path": "python/tests/test_mat.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport sys\nimport numpy as np\nimport pytest\n\nimport ncnn\n\n\ndef test_mat_dims1():\n    mat = ncnn.Mat(1)\n    assert mat.dims == 1 and mat.w == 1\n    mat = ncnn.Mat(2, elemsize=4)\n    assert mat.dims == 1 and mat.w == 2 and mat.elemsize == 4\n    mat = ncnn.Mat(3, elemsize=4, elempack=1)\n    assert mat.dims == 1 and mat.w == 3 and mat.elemsize == 4 and mat.elempack == 1\n    mat = ncnn.Mat(4, elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 1\n        and mat.w == 4\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n    mat = ncnn.Mat((1,))\n    assert mat.dims == 1 and mat.w == 1\n    mat = ncnn.Mat((2,), elemsize=4)\n    assert mat.dims == 1 and mat.w == 2 and mat.elemsize == 4\n    mat = ncnn.Mat((3,), elemsize=4, elempack=1)\n    assert mat.dims == 1 and mat.w == 3 and mat.elemsize == 4 and mat.elempack == 1\n    mat = ncnn.Mat((4,), elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 1\n        and mat.w == 4\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n\ndef test_mat_dims2():\n    mat = ncnn.Mat(1, 2)\n    assert mat.dims == 2 and mat.w == 1 and mat.h == 2\n    mat = ncnn.Mat(3, 4, elemsize=4)\n    assert mat.dims == 2 and mat.w == 3 and mat.h == 4 and mat.elemsize == 4\n    mat = ncnn.Mat(5, 6, elemsize=4, elempack=1)\n    assert (\n        mat.dims == 2\n        and mat.w == 5\n        and mat.h == 6\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat(7, 8, elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 2\n        and mat.w == 7\n        and mat.h == 8\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n    mat = ncnn.Mat((1, 2))\n    assert mat.dims == 2 and mat.w == 1 and mat.h == 2\n    mat = ncnn.Mat((3, 4), elemsize=4)\n    assert mat.dims == 2 and mat.w == 3 and mat.h == 4 and mat.elemsize == 4\n    mat = ncnn.Mat((5, 6), elemsize=4, elempack=1)\n    assert (\n        mat.dims == 2\n        and mat.w == 5\n        and mat.h == 6\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat((7, 8), elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 2\n        and mat.w == 7\n        and mat.h == 8\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n\ndef test_mat_dims3():\n    mat = ncnn.Mat(1, 2, 3)\n    assert mat.dims == 3 and mat.w == 1 and mat.h == 2 and mat.c == 3\n    mat = ncnn.Mat(4, 5, 6, elemsize=4)\n    assert (\n        mat.dims == 3 and mat.w == 4 and mat.h == 5 and mat.c == 6 and mat.elemsize == 4\n    )\n    mat = ncnn.Mat(7, 8, 9, elemsize=4, elempack=1)\n    assert (\n        mat.dims == 3\n        and mat.w == 7\n        and mat.h == 8\n        and mat.c == 9\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat(10, 11, 12, elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 3\n        and mat.w == 10\n        and mat.h == 11\n        and mat.c == 12\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n    mat = ncnn.Mat((1, 2, 3))\n    assert mat.dims == 3 and mat.w == 1 and mat.h == 2 and mat.c == 3\n    mat = ncnn.Mat((4, 5, 6), elemsize=4)\n    assert (\n        mat.dims == 3 and mat.w == 4 and mat.h == 5 and mat.c == 6 and mat.elemsize == 4\n    )\n    mat = ncnn.Mat((7, 8, 9), elemsize=4, elempack=1)\n    assert (\n        mat.dims == 3\n        and mat.w == 7\n        and mat.h == 8\n        and mat.c == 9\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat((10, 11, 12), elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 3\n        and mat.w == 10\n        and mat.h == 11\n        and mat.c == 12\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n\ndef test_mat_dims4():\n    mat = ncnn.Mat(1, 2, 3, 4)\n    assert mat.dims == 4 and mat.w == 1 and mat.h == 2 and mat.d == 3 and mat.c == 4\n    mat = ncnn.Mat(4, 5, 6, 7, elemsize=4)\n    assert (\n        mat.dims == 4 and mat.w == 4 and mat.h == 5 and mat.d == 6 and mat.c == 7 and mat.elemsize == 4\n    )\n    mat = ncnn.Mat(7, 8, 9, 10, elemsize=4, elempack=1)\n    assert (\n        mat.dims == 4\n        and mat.w == 7\n        and mat.h == 8\n        and mat.d == 9\n        and mat.c == 10\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat(10, 11, 12, 13, elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 4\n        and mat.w == 10\n        and mat.h == 11\n        and mat.d == 12\n        and mat.c == 13\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n    mat = ncnn.Mat((1, 2, 3, 4))\n    assert mat.dims == 4 and mat.w == 1 and mat.h == 2 and mat.d == 3 and mat.c == 4\n    mat = ncnn.Mat((4, 5, 6, 7), elemsize=4)\n    assert (\n        mat.dims == 4 and mat.w == 4 and mat.h == 5 and mat.d == 6 and mat.c == 7 and mat.elemsize == 4\n    )\n    mat = ncnn.Mat((7, 8, 9, 10), elemsize=4, elempack=1)\n    assert (\n        mat.dims == 4\n        and mat.w == 7\n        and mat.h == 8\n        and mat.d == 9\n        and mat.c == 10\n        and mat.elemsize == 4\n        and mat.elempack == 1\n    )\n    mat = ncnn.Mat((10, 11, 12, 13), elemsize=4, elempack=1, allocator=None)\n    assert (\n        mat.dims == 4\n        and mat.w == 10\n        and mat.h == 11\n        and mat.d == 12\n        and mat.c == 13\n        and mat.elemsize == 4\n        and mat.elempack == 1\n        and mat.allocator == None\n    )\n\n\ndef test_numpy():\n    mat = ncnn.Mat(1)\n    array = mat.numpy()\n    assert mat.dims == array.ndim and mat.w == array.shape[0]\n    mat = ncnn.Mat(2, 3)\n    array = mat.numpy()\n    assert array.dtype == np.float32\n    assert (\n        mat.dims == array.ndim and mat.w == array.shape[1] and mat.h == array.shape[0]\n    )\n    mat = ncnn.Mat(4, 5, 6)\n    array = np.array(mat)\n    assert (\n        mat.dims == array.ndim\n        and mat.w == array.shape[2]\n        and mat.h == array.shape[1]\n        and mat.c == array.shape[0]\n    )\n    mat = ncnn.Mat(7, 8, 9, 10)\n    array = np.array(mat)\n    assert (\n        mat.dims == array.ndim\n        and mat.w == array.shape[3]\n        and mat.h == array.shape[2]\n        and mat.d == array.shape[1]\n        and mat.c == array.shape[0]\n    )\n\n    mat = ncnn.Mat(1, elemsize=1)\n    array = mat.numpy()\n    assert array.dtype == np.int8\n    mat = ncnn.Mat(1, elemsize=2)\n    array = mat.numpy()\n    assert array.dtype == np.float16\n    # pybind11 def_buffer throw bug\n    # with pytest.raises(RuntimeError) as execinfo:\n    #     mat = ncnn.Mat(1, elemsize=3)\n    #     array = np.array(mat)\n    #     assert \"convert ncnn.Mat to numpy.ndarray only elemsize 1, 2, 4 support now, but given 3\" in str(\n    #         execinfo.value\n    #     )\n    assert array.dtype == np.float16\n    mat = ncnn.Mat(1, elemsize=4)\n    array = mat.numpy()\n    assert array.dtype == np.float32\n\n    mat = np.random.randint(0, 128, size=(12,)).astype(np.uint8)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.rand(12).astype(np.float32)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.randint(0, 128, size=(12, 11)).astype(np.uint8)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.rand(12, 11).astype(np.float32)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.randint(0, 256, size=(12, 11, 3)).astype(np.uint8)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.rand(12, 11, 3).astype(np.float32)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.randint(0, 256, size=(12, 11, 7, 3)).astype(np.uint8)\n    array = np.array(mat)\n    assert (mat == array).all()\n    mat = np.random.rand(12, 11, 7, 3).astype(np.float32)\n    array = np.array(mat)\n    assert (mat == array).all()\n\n    array = np.array([1, 2, 3], dtype=np.int32)\n    mat = ncnn.Mat(array)\n    array2 = mat.numpy(format='i')\n    assert array2.dtype == np.int32\n    array[0] = 10\n    assert array2[0] == 10\n\n    array = np.array([1, 2, 3], dtype=np.float32)\n    mat = ncnn.Mat(array)\n    array2 = mat.numpy(format='f')\n    assert array2.dtype == np.float32\n    array2[0] = 100\n    assert array[0] == 100\n\ndef test_fill():\n    mat = ncnn.Mat(1)\n    mat.fill(1.0)\n    array = np.array(mat)\n    assert np.abs(array[0] - 1.0) < np.finfo(np.float32).eps\n\n\ndef test_clone():\n    mat1 = ncnn.Mat(1)\n    mat2 = mat1.clone()\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w\n\n    mat1 = ncnn.Mat(2, 3)\n    mat2 = mat1.clone()\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w and mat1.h == mat2.h\n\n    mat1 = ncnn.Mat(4, 5, 6)\n    mat2 = mat1.clone()\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat(7, 8, 9, 10)\n    mat2 = mat1.clone()\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.d == mat2.d\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat((1,))\n    mat2 = mat1.clone()\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w\n\n    mat1 = ncnn.Mat((2, 3))\n    mat2 = mat1.clone()\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w and mat1.h == mat2.h\n\n    mat1 = ncnn.Mat((4, 5, 6))\n    mat2 = mat1.clone()\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat((7, 8, 9, 10))\n    mat2 = mat1.clone()\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.d == mat2.d\n        and mat1.c == mat2.c\n    )\n\n\ndef test_clone_from():\n    mat2 = ncnn.Mat()\n\n    mat1 = ncnn.Mat(1)\n    mat2.clone_from(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w\n\n    mat1 = ncnn.Mat(2, 3)\n    mat2.clone_from(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w and mat1.h == mat2.h\n\n    mat1 = ncnn.Mat(4, 5, 6)\n    mat2.clone_from(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat(7, 8, 9, 10)\n    mat2.clone_from(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.d == mat2.d\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat((1,))\n    mat2.clone_from(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w\n\n    mat1 = ncnn.Mat((2, 3))\n    mat2.clone_from(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w and mat1.h == mat2.h\n\n    mat1 = ncnn.Mat((4, 5, 6))\n    mat2.clone_from(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.c == mat2.c\n    )\n\n    mat1 = ncnn.Mat((7, 8, 9, 10))\n    mat2.clone_from(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.d == mat2.d\n        and mat1.c == mat2.c\n    )\n\n\ndef test_reshape():\n    mat1 = ncnn.Mat()\n    mat2 = mat1.reshape(1)\n    assert mat2.dims == 0\n    mat2 = mat1.reshape(1, 1)\n    assert mat2.dims == 0\n    mat2 = mat1.reshape(1, 1, 1)\n    assert mat2.dims == 0\n    mat2 = mat1.reshape(1, 1, 1, 1)\n    assert mat2.dims == 0\n\n    mat1 = ncnn.Mat(1)\n    mat2 = mat1.reshape(1, 1)\n    assert mat2.dims == 2 and mat2.w == 1 and mat2.h == 1\n    mat2 = mat1.reshape(1, 1, 1)\n    assert mat2.dims == 3 and mat2.w == 1 and mat2.h == 1 and mat2.c == 1\n    mat2 = mat1.reshape(1, 1, 1, 1)\n    assert mat2.dims == 4 and mat2.w == 1 and mat2.h == 1 and mat2.d == 1 and mat2.c == 1\n\n    mat1 = ncnn.Mat(1, 2)\n    mat2 = mat1.reshape(2)\n    assert mat2.dims == 1 and mat2.w == 2\n    mat2 = mat1.reshape(2, 1)\n    assert mat2.dims == 2 and mat2.w == 2 and mat2.h == 1\n    mat2 = mat1.reshape(2, 1, 1)\n    assert mat2.dims == 3 and mat2.w == 2 and mat2.h == 1 and mat2.c == 1\n    mat2 = mat1.reshape(2, 1, 1, 1)\n    assert mat2.dims == 4 and mat2.w == 2 and mat2.h == 1 and mat2.d == 1 and mat2.c == 1\n\n    mat1 = ncnn.Mat(1, 2, 3)\n    mat2 = mat1.reshape(6)\n    assert mat2.dims == 1 and mat2.w == 6\n    mat2 = mat1.reshape(2, 3)\n    assert mat2.dims == 2 and mat2.w == 2 and mat2.h == 3\n    mat2 = mat1.reshape(2, 3, 1)\n    assert mat2.dims == 3 and mat2.w == 2 and mat2.h == 3 and mat2.c == 1\n    mat2 = mat1.reshape(2, 1, 3, 1)\n    assert mat2.dims == 4 and mat2.w == 2 and mat2.h == 1 and mat2.d == 3 and mat2.c == 1\n\n    mat1 = ncnn.Mat((1,))\n    mat2 = mat1.reshape((1, 1))\n    assert mat2.dims == 2 and mat2.w == 1 and mat2.h == 1\n    mat2 = mat1.reshape((1, 1, 1))\n    assert mat2.dims == 3 and mat2.w == 1 and mat2.h == 1 and mat2.c == 1\n    mat2 = mat1.reshape((1, 1, 1, 1))\n    assert mat2.dims == 4 and mat2.w == 1 and mat2.h == 1 and mat2.d == 1 and mat2.c == 1\n\n    mat1 = ncnn.Mat((1, 2))\n    mat2 = mat1.reshape((2,))\n    assert mat2.dims == 1 and mat2.w == 2\n    mat2 = mat1.reshape((2, 1))\n    assert mat2.dims == 2 and mat2.w == 2 and mat2.h == 1\n    mat2 = mat1.reshape((2, 1, 1))\n    assert mat2.dims == 3 and mat2.w == 2 and mat2.h == 1 and mat2.c == 1\n    mat2 = mat1.reshape((2, 1, 1, 1))\n    assert mat2.dims == 4 and mat2.w == 2 and mat2.h == 1 and mat2.d == 1 and mat2.c == 1\n\n    mat1 = ncnn.Mat((1, 2, 3))\n    mat2 = mat1.reshape((6,))\n    assert mat2.dims == 1 and mat2.w == 6\n    mat2 = mat1.reshape((2, 3))\n    assert mat2.dims == 2 and mat2.w == 2 and mat2.h == 3 and mat2.c == 1\n    mat2 = mat1.reshape((2, 3, 1))\n    assert mat2.dims == 3 and mat2.w == 2 and mat2.h == 3 and mat2.c == 1\n    mat2 = mat1.reshape((2, 1, 3, 1))\n    assert mat2.dims == 4 and mat2.w == 2 and mat2.h == 1 and mat2.d == 3 and mat2.c == 1\n\n    with pytest.raises(RuntimeError) as execinfo:\n        mat1.reshape((1, 1, 1, 1, 1))\n    assert \"shape must be 1, 2, 3 or 4 dims, not 5\" in str(execinfo.value)\n\n\ndef test_create():\n    mat = ncnn.Mat()\n    mat.create(1)\n    assert mat.dims == 1 and mat.w == 1\n    mat.create(2, 3)\n    assert mat.dims == 2 and mat.w == 2 and mat.h == 3\n    mat.create(4, 5, 6)\n    assert mat.dims == 3 and mat.w == 4 and mat.h == 5 and mat.c == 6\n    mat.create(7, 8, 9, 10)\n    assert mat.dims == 4 and mat.w == 7 and mat.h == 8 and mat.d == 9 and mat.c == 10\n\n    mat.create((1,))\n    assert mat.dims == 1 and mat.w == 1\n    mat.create((2, 3))\n    assert mat.dims == 2 and mat.w == 2 and mat.h == 3\n    mat.create((4, 5, 6))\n    assert mat.dims == 3 and mat.w == 4 and mat.h == 5 and mat.c == 6\n    mat.create((7, 8, 9, 10))\n    assert mat.dims == 4 and mat.w == 7 and mat.h == 8 and mat.d == 9 and mat.c == 10\n\n\ndef test_create_like():\n    mat2 = ncnn.Mat()\n\n    mat1 = ncnn.Mat(1)\n    mat2.create_like(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w\n    mat1 = ncnn.Mat(2, 3)\n    mat2.create_like(mat1)\n    assert mat1.dims == mat2.dims and mat1.w == mat2.w and mat1.h == mat2.h\n    mat1 = ncnn.Mat(4, 5, 6)\n    mat2.create_like(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.c == mat2.c\n    )\n    mat1 = ncnn.Mat(7, 8, 9, 10)\n    mat2.create_like(mat1)\n    assert (\n        mat1.dims == mat2.dims\n        and mat1.w == mat2.w\n        and mat1.h == mat2.h\n        and mat1.d == mat2.d\n        and mat1.c == mat2.c\n    )\n\n\ndef test_addref_release():\n    mat = ncnn.Mat(1)\n    assert mat.refcount == 1\n\n    mat.addref()\n    assert mat.refcount == 2\n\n    mat.release()\n    assert mat.refcount == None\n\n\ndef test_empty():\n    mat = ncnn.Mat()\n    assert mat.empty() == True\n\n    mat = ncnn.Mat(1)\n    assert mat.empty() == False\n\n\ndef test_total():\n    mat = ncnn.Mat(1)\n    assert mat.total() == 4 # 1 aligned\n    mat = ncnn.Mat(2, 3)\n    assert mat.total() == 8 # 2 * 3 aligned\n    mat = ncnn.Mat(4, 5, 6)\n    assert mat.total() == 4 * 5 * 6\n    mat = ncnn.Mat(7, 8, 9, 10)\n    assert mat.total() == 7 * 8 * 9 * 10\n\n\ndef test_elembits():\n    mat = ncnn.Mat(1, elemsize=1, elempack=1)\n    assert mat.elembits() == 8\n    mat = ncnn.Mat(2, elemsize=2, elempack=1)\n    assert mat.elembits() == 16\n    mat = ncnn.Mat(3, elemsize=4, elempack=1)\n    assert mat.elembits() == 32\n\n\ndef test_shape():\n    mat = ncnn.Mat(1)\n    shape = mat.shape()\n    assert shape.dims == 1 and shape.w == 1\n    mat = ncnn.Mat(2, 3)\n    shape = mat.shape()\n    assert shape.dims == 2 and shape.w == 2 and shape.h == 3\n    mat = ncnn.Mat(4, 5, 6)\n    shape = mat.shape()\n    assert shape.dims == 3 and shape.w == 4 and shape.h == 5 and shape.c == 6\n    mat = ncnn.Mat(7, 8, 9, 10)\n    shape = mat.shape()\n    assert shape.dims == 4 and shape.w == 7 and shape.h == 8 and shape.d == 9 and shape.c == 10\n\n\ndef test_channel_depth_row():\n    mat = ncnn.Mat(2, 3, 4, 5)\n    mat.fill(6.0)\n    channel = mat.channel(1)\n    assert channel.dims == 3 and channel.w == 2 and channel.h == 3 and channel.c == 4\n\n    depth = channel.depth(1)\n    assert depth.dims == 2 and depth.w == 2 and depth.h == 3\n\n    row = depth.row(1)\n    assert len(row) == 2 and np.abs(row[0] - 6.0) < sys.float_info.min\n\n\ndef test_channel_row():\n    mat = ncnn.Mat(2, 3, 4)\n    mat.fill(4.0)\n    channel = mat.channel(1)\n    assert channel.dims == 2 and channel.w == 2 and channel.h == 3 and channel.c == 1\n\n    row = channel.row(1)\n    assert len(row) == 2 and np.abs(row[0] - 4.0) < sys.float_info.min\n\n\ndef test_channel_range():\n    mat = ncnn.Mat(1, 2, 3)\n    channel_range = mat.channel_range(0, 2)\n    assert (\n        channel_range.dims == 3\n        and channel_range.w == 1\n        and channel_range.h == 2\n        and channel_range.c == 2\n    )\n\n\ndef test_depth_range():\n    mat = ncnn.Mat(1, 2, 3, 4)\n    depth_range = mat.channel(1).depth_range(1, 2)\n    assert (\n        depth_range.dims == 3\n        and depth_range.w == 1\n        and depth_range.h == 2\n        and depth_range.c == 2\n    )\n\n\ndef test_row_range():\n    mat = ncnn.Mat(1, 2)\n    row_range = mat.row_range(0, 2)\n    assert row_range.dims == 2 and row_range.w == 1 and row_range.h == 2\n\n\ndef test_range():\n    mat = ncnn.Mat(2)\n    range = mat.range(0, 2)\n    assert range.dims == 1 and range.w == 2\n\n\ndef test_getitem_setitem():\n    mat = ncnn.Mat(2)\n    mat.fill(1)\n    assert (\n        np.abs(mat[0] - 1.0) < sys.float_info.min\n        and np.abs(mat[1] - 1.0) < sys.float_info.min\n    )\n\n    mat[0] = 2.0\n    assert (\n        np.abs(mat[0] - 2.0) < sys.float_info.min\n        and np.abs(mat[1] - 1.0) < sys.float_info.min\n    )\n\n\ndef test_from_pixels():\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels(pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300)  # chw\n    assert mat.dims == 3 and mat.w == 400 and mat.h == 300 and mat.c == 3\n    assert pixels[0, 0, 0] == mat.channel(0).row(0)[0]\n    assert pixels[200, 150, 1] == mat.channel(1).row(200)[150]\n    assert pixels[299, 399, 2] == mat.channel(2).row(299)[399]\n\n    pixels = np.random.randint(0, 256, size=(300, 500, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels(\n        pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300, stride=500 * 3\n    )  # chw\n    assert mat.dims == 3 and mat.w == 400 and mat.h == 300 and mat.c == 3\n    assert pixels[0, 0, 0] == mat.channel(0).row(0)[0]\n    assert pixels[200, 150, 1] == mat.channel(1).row(200)[150]\n    assert pixels[299, 399, 2] == mat.channel(2).row(299)[399]\n\n\ndef test_from_pixels_resize():\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_resize(\n        pixels, ncnn.Mat.PixelType.PIXEL_BGR2RGB, 400, 300, 200, 150\n    )  # chw\n    assert mat.dims == 3 and mat.w == 200 and mat.h == 150 and mat.c == 3\n\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_resize(\n        pixels, ncnn.Mat.PixelType.PIXEL_BGR2RGB, 400, 300, 400, 300\n    )  # chw\n    assert mat.dims == 3 and mat.w == 400 and mat.h == 300 and mat.c == 3\n    assert pixels[0, 0, 0] == mat.channel(2).row(0)[0]\n    assert pixels[200, 150, 1] == mat.channel(1).row(200)[150]\n    assert pixels[299, 399, 2] == mat.channel(0).row(299)[399]\n\n    pixels = np.random.randint(0, 256, size=(300, 500, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_resize(\n        pixels, ncnn.Mat.PixelType.PIXEL_BGR2RGB, 400, 300, 500 * 3, 200, 150\n    )  # chw\n    assert mat.dims == 3 and mat.w == 200 and mat.h == 150 and mat.c == 3\n\n    pixels = np.random.randint(0, 256, size=(300, 500, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_resize(\n        pixels, ncnn.Mat.PixelType.PIXEL_BGR2RGB, 400, 300, 500 * 3, 400, 300\n    )  # chw\n    assert mat.dims == 3 and mat.w == 400 and mat.h == 300 and mat.c == 3\n    assert pixels[0, 0, 0] == mat.channel(2).row(0)[0]\n    assert pixels[200, 150, 1] == mat.channel(1).row(200)[150]\n    assert pixels[299, 399, 2] == mat.channel(0).row(299)[399]\n\n\ndef test_from_pixels_roi():\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_roi(\n        pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300, 100, 75, 200, 150\n    )  # chw\n    assert mat.dims == 3 and mat.w == 200 and mat.h == 150 and mat.c == 3\n    assert pixels[75, 100, 0] == mat.channel(0).row(0)[0]\n    assert pixels[150, 200, 1] == mat.channel(1).row(75)[100]\n    assert pixels[224, 299, 2] == mat.channel(2).row(149)[199]\n\n    pixels = np.random.randint(0, 256, size=(300, 500, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_roi(\n        pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300, 500 * 3, 100, 75, 200, 150\n    )  # chw\n    assert mat.dims == 3 and mat.w == 200 and mat.h == 150 and mat.c == 3\n    assert pixels[75, 100, 0] == mat.channel(0).row(0)[0]\n    assert pixels[150, 200, 1] == mat.channel(1).row(75)[100]\n    assert pixels[224, 299, 2] == mat.channel(2).row(149)[199]\n\n\ndef test_from_pixels_roi_resize():\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_roi_resize(\n        pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300, 100, 75, 200, 150, 100, 75\n    )  # chw\n    assert mat.dims == 3 and mat.w == 100 and mat.h == 75 and mat.c == 3\n\n    pixels = np.random.randint(0, 256, size=(300, 500, 3)).astype(np.uint8)  # hwc\n    mat = ncnn.Mat.from_pixels_roi_resize(\n        pixels,\n        ncnn.Mat.PixelType.PIXEL_RGB,\n        400,\n        300,\n        500 * 3,\n        100,\n        75,\n        200,\n        150,\n        100,\n        75,\n    )  # chw\n    assert mat.dims == 3 and mat.w == 100 and mat.h == 75 and mat.c == 3\n\n\ndef test_substract_mean_normalize():\n    pixels = np.random.randint(0, 256, size=(300, 400, 3)).astype(np.uint8)  # hwc\n    mean_vals = [127.5, 127.5, 127.5]\n    norm_vals = [0.007843, 0.007843, 0.007843]\n\n    mat = ncnn.Mat.from_pixels(pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300)  # chw\n    mat.substract_mean_normalize([], norm_vals)\n    assert np.abs(pixels[0, 0, 0] * 0.007843 - mat.channel(0).row(0)[0]) < 1e-5\n\n    mat = ncnn.Mat.from_pixels(pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300)  # chw\n    mat.substract_mean_normalize(mean_vals, [])\n    assert np.abs((pixels[0, 0, 0] - 127.5) - mat.channel(0).row(0)[0]) < 1e-5\n\n    mat = ncnn.Mat.from_pixels(pixels, ncnn.Mat.PixelType.PIXEL_RGB, 400, 300)  # chw\n    mat.substract_mean_normalize(mean_vals, norm_vals)\n    assert (\n        np.abs((pixels[0, 0, 0] - 127.5) * 0.007843 - mat.channel(0).row(0)[0]) < 1e-5\n    )\n"
  },
  {
    "path": "python/tests/test_net.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport numpy as np\nimport pytest\n\nimport ncnn\n\n\ndef test_net():\n    dr = ncnn.DataReaderFromEmpty()\n\n    with ncnn.Net() as net:\n        ret = net.load_param(\"tests/test.param\")\n        net.load_model(dr)\n        assert ret == 0 and len(net.blobs()) == 3 and len(net.layers()) == 3\n\n        input_names = net.input_names()\n        output_names = net.output_names()\n        assert len(input_names) > 0 and len(output_names) > 0\n\n        in_mat = ncnn.Mat((227, 227, 3))\n\n        with net.create_extractor() as ex:\n            ex.input(\"data\", in_mat)\n            ret, out_mat = ex.extract(\"output\")\n\n        assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1\n\n        net.clear()\n        assert len(net.blobs()) == 0 and len(net.layers()) == 0\n\n\ndef test_net_mem():\n    modelbin = bytearray(303940)\n    modelbin[0:4] = 71,107,48,1\n    modelbin[180:184] = 71,107,48,1\n\n    with ncnn.Net() as net:\n        ret = net.load_param(\"tests/test.param\")\n        net.load_model_mem(bytes(modelbin))\n        assert ret == 0 and len(net.blobs()) == 3 and len(net.layers()) == 3\n\n        input_names = net.input_names()\n        output_names = net.output_names()\n        assert len(input_names) > 0 and len(output_names) > 0\n\n        in_mat = ncnn.Mat((227, 227, 3))\n\n        with net.create_extractor() as ex:\n            ex.input(\"data\", in_mat)\n            ret, out_mat = ex.extract(\"output\")\n\n        assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1\n\n        net.clear()\n        assert len(net.blobs()) == 0 and len(net.layers()) == 0\n\n\ndef test_net_vulkan():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    dr = ncnn.DataReaderFromEmpty()\n\n    net = ncnn.Net()\n    net.opt.use_vulkan_compute = True\n    ret = net.load_param(\"tests/test.param\")\n    net.load_model(dr)\n    assert ret == 0 and len(net.blobs()) == 3 and len(net.layers()) == 3\n\n    in_mat = ncnn.Mat((227, 227, 3))\n\n    ex = net.create_extractor()\n    ex.input(\"data\", in_mat)\n    ret, out_mat = ex.extract(\"output\")\n\n    assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1\n\n    ex.clear()\n\n    net.clear()\n    assert len(net.blobs()) == 0 and len(net.layers()) == 0\n\n\ndef test_custom_layer():\n    class CustomLayer(ncnn.Layer):\n        customLayers = []\n\n        def __init__(self):\n            ncnn.Layer.__init__(self)\n            self.one_blob_only = True\n\n            self.customLayers.append(self)\n\n        def forward(self, bottom_blob, top_blob, opt):\n            x = np.array(bottom_blob)\n            x += 1\n\n            top_blob.clone_from(ncnn.Mat(x), opt.blob_allocator)\n            if top_blob.empty():\n                return -100\n\n            return 0\n\n    def CustomLayer_layer_creator():\n        return CustomLayer()\n\n    def CustomLayer_layer_destroyer(layer):\n        for i in range(len(CustomLayer.customLayers)):\n            if CustomLayer.customLayers[i] == layer:\n                del CustomLayer.customLayers[i]\n                break\n\n    dr = ncnn.DataReaderFromEmpty()\n\n    net = ncnn.Net()\n    net.register_custom_layer(\n        \"CustomLayer\", CustomLayer_layer_creator, CustomLayer_layer_destroyer\n    )\n    ret = net.load_param(\"tests/custom_layer.param\")\n    net.load_model(dr)\n    assert ret == 0 and len(net.blobs()) == 2 and len(net.layers()) == 2\n\n    in_mat = ncnn.Mat(1)\n    in_mat.fill(1.0)\n\n    ex = net.create_extractor()\n    ex.input(\"data\", in_mat)\n    ret, out_mat = ex.extract(\"output\")\n    assert ret == 0 and out_mat.dims == 1 and out_mat.w == 1 and out_mat[0] == 2.0\n\n    ex.clear()\n\n    net.clear()\n    assert len(net.blobs()) == 0 and len(net.layers()) == 0\n\n\ndef test_vulkan_device_index():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    net = ncnn.Net()\n    assert net.vulkan_device() is None\n\n    net.set_vulkan_device(0)\n    assert net.vulkan_device() is not None\n\n\ndef test_vulkan_device_vkdev():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    net = ncnn.Net()\n    assert net.vulkan_device() is None\n\n    vkdev = ncnn.get_gpu_device(0)\n    net.set_vulkan_device(vkdev)\n    assert net.vulkan_device() is not None\n"
  },
  {
    "path": "python/tests/test_option.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef test_option():\n    allocator = ncnn.PoolAllocator()\n\n    opt = ncnn.Option()\n\n    opt.lightmode = True\n    assert opt.lightmode == True\n    opt.lightmode = False\n    assert opt.lightmode == False\n\n    assert opt.num_threads == ncnn.get_physical_big_cpu_count()\n    opt.num_threads = 1\n    assert opt.num_threads == 1\n\n    assert opt.blob_allocator is None\n    opt.blob_allocator = allocator\n    assert opt.blob_allocator == allocator\n\n    assert opt.workspace_allocator is None\n    opt.workspace_allocator = allocator\n    assert opt.workspace_allocator == allocator\n\n    assert opt.openmp_blocktime == 20\n    opt.openmp_blocktime = 40\n    assert opt.openmp_blocktime == 40\n\n    opt.use_winograd_convolution = True\n    assert opt.use_winograd_convolution == True\n    opt.use_winograd_convolution = False\n    assert opt.use_winograd_convolution == False\n\n    opt.use_sgemm_convolution = True\n    assert opt.use_sgemm_convolution == True\n    opt.use_sgemm_convolution = False\n    assert opt.use_sgemm_convolution == False\n\n    opt.use_int8_inference = True\n    assert opt.use_int8_inference == True\n    opt.use_int8_inference = False\n    assert opt.use_int8_inference == False\n\n    opt.use_vulkan_compute = True\n    assert opt.use_vulkan_compute == True\n    opt.use_vulkan_compute = False\n    assert opt.use_vulkan_compute == False\n\n    opt.use_bf16_packed = True\n    assert opt.use_bf16_packed == True\n    opt.use_bf16_packed = False\n    assert opt.use_bf16_packed == False\n\n    opt.use_bf16_storage = True\n    assert opt.use_bf16_storage == True\n    opt.use_bf16_storage = False\n    assert opt.use_bf16_storage == False\n\n    opt.use_fp16_packed = True\n    assert opt.use_fp16_packed == True\n    opt.use_fp16_packed = False\n    assert opt.use_fp16_packed == False\n\n    opt.use_fp16_storage = True\n    assert opt.use_fp16_storage == True\n    opt.use_fp16_storage = False\n    assert opt.use_fp16_storage == False\n\n    opt.use_fp16_arithmetic = True\n    assert opt.use_fp16_arithmetic == True\n    opt.use_fp16_arithmetic = False\n    assert opt.use_fp16_arithmetic == False\n\n    opt.use_int8_packed = True\n    assert opt.use_int8_packed == True\n    opt.use_int8_packed = False\n    assert opt.use_int8_packed == False\n\n    opt.use_int8_storage = True\n    assert opt.use_int8_storage == True\n    opt.use_int8_storage = False\n    assert opt.use_int8_storage == False\n\n    opt.use_int8_arithmetic = True\n    assert opt.use_int8_arithmetic == True\n    opt.use_int8_arithmetic = False\n    assert opt.use_int8_arithmetic == False\n\n    opt.use_packing_layout = True\n    assert opt.use_packing_layout == True\n    opt.use_packing_layout = False\n    assert opt.use_packing_layout == False\n\n    opt.use_subgroup_ops = True\n    assert opt.use_subgroup_ops == True\n    opt.use_subgroup_ops = False\n    assert opt.use_subgroup_ops == False\n\n    opt.use_tensor_storage = True\n    assert opt.use_tensor_storage == True\n    opt.use_tensor_storage = False\n    assert opt.use_tensor_storage == False\n"
  },
  {
    "path": "python/tests/test_paramdict.py",
    "content": "# Copyright 2020 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef test_paramdict():\n    pd = ncnn.ParamDict()\n    assert pd.type(0) == 0\n    assert pd.get(0, -1) == -1\n\n    pd.set(1, 1)\n    assert pd.type(1) == 2 and pd.get(1, -1) == 1\n\n    pd.set(2, 2.0)\n    assert pd.type(2) == 3 and pd.get(2, -2.0) == 2.0\n\n    mat = ncnn.Mat(1)\n    pd.set(3, mat)\n    assert pd.type(3) == 4 and pd.get(3, ncnn.Mat()).dims == mat.dims\n"
  },
  {
    "path": "python/tests/test_vulkan_allocator.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef test_vk_blob_allocator():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    vkdev = ncnn.get_gpu_device(0)\n    assert vkdev is not None\n    allocator = ncnn.VkBlobAllocator(vkdev)\n    assert allocator.buffer_memory_type_index >= 0\n    assert allocator.image_memory_type_index >= 0\n\n    mappable = allocator.mappable\n    allocator.mappable = not mappable\n    assert allocator.mappable == (not mappable)\n\n    coherent = allocator.coherent\n    allocator.coherent = not coherent\n    assert allocator.coherent == (not coherent)\n\n    bufmem = allocator.fastMalloc(10 * 1024)\n    assert bufmem is not None\n    allocator.fastFree(bufmem)\n\n    imgmem = allocator.fastMalloc(4, 4, 3, 4, 1)\n    assert imgmem is not None\n    allocator.fastFree(imgmem)\n\n\ndef test_vk_weight_allocator():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    vkdev = ncnn.get_gpu_device(0)\n    assert vkdev is not None\n    allocator = ncnn.VkWeightAllocator(vkdev)\n    assert allocator.buffer_memory_type_index >= 0\n    assert allocator.image_memory_type_index >= 0\n\n    mappable = allocator.mappable\n    allocator.mappable = not mappable\n    assert allocator.mappable == (not mappable)\n\n    coherent = allocator.coherent\n    allocator.coherent = not coherent\n    assert allocator.coherent == (not coherent)\n\n    bufmem = allocator.fastMalloc(10 * 1024)\n    assert bufmem is not None\n    allocator.fastFree(bufmem)\n\n    imgmem = allocator.fastMalloc(4, 4, 3, 4, 1)\n    assert imgmem is not None\n    allocator.fastFree(imgmem)\n\n\ndef test_vk_staging_allocator():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    vkdev = ncnn.get_gpu_device(0)\n    assert vkdev is not None\n    allocator = ncnn.VkStagingAllocator(vkdev)\n    assert allocator.buffer_memory_type_index >= 0\n    assert allocator.image_memory_type_index >= 0\n\n    mappable = allocator.mappable\n    allocator.mappable = not mappable\n    assert allocator.mappable == (not mappable)\n\n    coherent = allocator.coherent\n    allocator.coherent = not coherent\n    assert allocator.coherent == (not coherent)\n\n    bufmem = allocator.fastMalloc(10 * 1024)\n    assert bufmem is not None\n    allocator.fastFree(bufmem)\n\n    imgmem = allocator.fastMalloc(4, 4, 3, 4, 1)\n    assert imgmem is not None\n    allocator.fastFree(imgmem)\n\n\ndef test_vk_weight_staging_allocator():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    vkdev = ncnn.get_gpu_device(0)\n    assert vkdev is not None\n    allocator = ncnn.VkWeightStagingAllocator(vkdev)\n    assert allocator.buffer_memory_type_index >= 0\n    assert allocator.image_memory_type_index >= 0\n\n    mappable = allocator.mappable\n    allocator.mappable = not mappable\n    assert allocator.mappable == (not mappable)\n\n    coherent = allocator.coherent\n    allocator.coherent = not coherent\n    assert allocator.coherent == (not coherent)\n\n    bufmem = allocator.fastMalloc(10 * 1024)\n    assert bufmem is not None\n    allocator.fastFree(bufmem)\n\n    imgmem = allocator.fastMalloc(4, 4, 3, 4, 1)\n    assert imgmem is None\n"
  },
  {
    "path": "python/tests/test_vulkan_device.py",
    "content": "# Copyright 2021 Tencent\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport pytest\n\nimport ncnn\n\n\ndef check_gpuinfo(gpuinfo):\n    assert gpuinfo.api_version() > 0\n    assert gpuinfo.driver_version() > 0\n    assert gpuinfo.vendor_id() > 0\n    assert gpuinfo.device_id() > 0\n    assert gpuinfo.pipeline_cache_uuid() is not None\n    assert gpuinfo.type() >= 0\n\n\ndef test_gpu_api():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    assert ncnn.create_gpu_instance() == 0\n    assert ncnn.get_gpu_count() > 0\n    assert ncnn.get_default_gpu_index() >= 0\n\n    gpuinfo = ncnn.get_gpu_info(0)\n    check_gpuinfo(gpuinfo)\n\n    vkdev = ncnn.get_gpu_device(0)\n    assert vkdev is not None\n    gpuinfo = vkdev.info()\n    check_gpuinfo(gpuinfo)\n\n    ncnn.destroy_gpu_instance()\n\n\ndef test_vulkan_device():\n    if not hasattr(ncnn, \"get_gpu_count\"):\n        return\n\n    vkdev = ncnn.VulkanDevice(0)\n    assert vkdev is not None\n    gpuinfo = vkdev.info()\n    check_gpuinfo(gpuinfo)\n"
  },
  {
    "path": "setup.py",
    "content": "import io\nimport os\nimport sys\nimport time\nimport re\nimport shutil\nimport subprocess\n\nfrom setuptools import setup, find_packages, Extension\nfrom setuptools.command.build_ext import build_ext\nfrom setuptools.command.install import install\n\n\ndef find_version():\n    with io.open(\"CMakeLists.txt\", encoding=\"utf8\") as f:\n        version_file = f.read()\n\n    version_major = re.findall(r\"NCNN_VERSION_MAJOR (.+?)\", version_file)\n    version_minor = re.findall(r\"NCNN_VERSION_MINOR (.+?)\", version_file)\n\n    if version_major and version_minor:\n        ncnn_version = time.strftime(\"%Y%m%d\", time.localtime())\n\n        return version_major[0] + \".\" + version_minor[0] + \".\" + ncnn_version\n    raise RuntimeError(\"Unable to find version string.\")\n\n# Parse environment variables\nVulkan_LIBRARY = os.environ.get(\"Vulkan_LIBRARY\", \"\")\nCMAKE_TOOLCHAIN_FILE = os.environ.get(\"CMAKE_TOOLCHAIN_FILE\", \"\")\nPLATFORM = os.environ.get(\"PLATFORM\", \"\")\nARCHS = os.environ.get(\"ARCHS\", \"\")\nDEPLOYMENT_TARGET = os.environ.get(\"DEPLOYMENT_TARGET\", \"\")\nOpenMP_C_FLAGS = os.environ.get(\"OpenMP_C_FLAGS\", \"\")\nOpenMP_CXX_FLAGS = os.environ.get(\"OpenMP_CXX_FLAGS\", \"\")\nOpenMP_C_LIB_NAMES = os.environ.get(\"OpenMP_C_LIB_NAMES\", \"\")\nOpenMP_CXX_LIB_NAMES = os.environ.get(\"OpenMP_CXX_LIB_NAMES\", \"\")\nOpenMP_libomp_LIBRARY = os.environ.get(\"OpenMP_libomp_LIBRARY\", \"\")\nENABLE_BITCODE = os.environ.get(\"ENABLE_BITCODE\", \"\")\nENABLE_ARC = os.environ.get(\"ENABLE_ARC\", \"\")\nENABLE_VISIBILITY = os.environ.get(\"ENABLE_VISIBILITY\", \"\")\nEXTRA_CMAKE_ARGS = os.getenv(\"EXTRA_CMAKE_ARGS\", \"\").split()\n\n# Parse variables from command line with setup.py install\nclass InstallCommand(install):\n    user_options = install.user_options + [\n        ('vulkan=', None, 'Enable the usage of Vulkan.'),\n    ]\n    def initialize_options(self):\n        install.initialize_options(self)\n        self.vulkan = None\n\n    def finalize_options(self):\n        install.finalize_options(self)\n\n    def run(self):\n        install.run(self)\n\n# Convert distutils Windows platform specifiers to CMake -A arguments\nPLAT_TO_CMAKE = {\n    \"win32\": \"Win32\",\n    \"win-amd64\": \"x64\",\n    \"win-arm32\": \"ARM\",\n    \"win-arm64\": \"ARM64\",\n}\n\n# A CMakeExtension needs a sourcedir instead of a file list.\n# The name must be the _single_ output extension from the CMake build.\n# If you need multiple extensions, see scikit-build.\nclass CMakeExtension(Extension):\n    def __init__(self, name, sourcedir=\"\"):\n        Extension.__init__(self, name, sources=[])\n        self.sourcedir = os.path.abspath(sourcedir)\n\n\nclass CMakeBuild(build_ext):\n    def build_extension(self, ext):\n        extdir = os.path.abspath(os.path.dirname(self.get_ext_fullpath(ext.name)))\n        extdir = os.path.join(extdir, \"ncnn\")\n\n        # required for auto-detection of auxiliary \"native\" libs\n        if not extdir.endswith(os.path.sep):\n            extdir += os.path.sep\n\n        cfg = \"Debug\" if self.debug else \"Release\"\n\n        # CMake lets you override the generator - we need to check this.\n        # Can be set with Conda-Build, for example.\n        cmake_generator = os.environ.get(\"CMAKE_GENERATOR\", \"\")\n\n        # Set Python_EXECUTABLE instead if you use PYBIND11_FINDPYTHON\n        # EXAMPLE_VERSION_INFO shows you how to pass a value into the C++ code\n        # from Python.\n        cmake_args = [\n            \"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={}\".format(extdir),\n            \"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE={}\".format(extdir),\n            \"-DPYTHON_EXECUTABLE={}\".format(sys.executable),\n            \"-DCMAKE_BUILD_TYPE={}\".format(cfg),  # not used on MSVC, but no harm\n            \"-DNCNN_PYTHON=ON\",\n            \"-DNCNN_VULKAN=ON\",\n            \"-DNCNN_DISABLE_RTTI=OFF\",\n            \"-DNCNN_DISABLE_EXCEPTION=OFF\",\n            \"-DNCNN_BUILD_BENCHMARK=OFF\",\n            \"-DNCNN_BUILD_EXAMPLES=OFF\",\n            \"-DNCNN_BUILD_TOOLS=OFF\",\n        ]\n        if Vulkan_LIBRARY != \"\":\n            cmake_args.append(\"-DVulkan_LIBRARY=\" + Vulkan_LIBRARY)\n        if CMAKE_TOOLCHAIN_FILE != \"\":\n            cmake_args.append(\"-DCMAKE_TOOLCHAIN_FILE=\" + CMAKE_TOOLCHAIN_FILE)\n        if PLATFORM != \"\":\n            cmake_args.append(\"-DPLATFORM=\" + PLATFORM)\n        if ARCHS != \"\":\n            cmake_args.append(\"-DARCHS=\" + ARCHS)\n        if DEPLOYMENT_TARGET != \"\":\n            cmake_args.append(\"-DDEPLOYMENT_TARGET=\" + DEPLOYMENT_TARGET)\n        if OpenMP_C_FLAGS != \"\":\n            cmake_args.append(\"-DOpenMP_C_FLAGS=\" + OpenMP_C_FLAGS)\n        if OpenMP_CXX_FLAGS != \"\":\n            cmake_args.append(\"-DOpenMP_CXX_FLAGS=\" + OpenMP_CXX_FLAGS)\n        if OpenMP_C_LIB_NAMES != \"\":\n            cmake_args.append(\"-DOpenMP_C_LIB_NAMES=\" + OpenMP_C_LIB_NAMES)\n        if OpenMP_CXX_LIB_NAMES != \"\":\n            cmake_args.append(\"-DOpenMP_CXX_LIB_NAMES=\" + OpenMP_CXX_LIB_NAMES)\n        if OpenMP_libomp_LIBRARY != \"\":\n            cmake_args.append(\"-DOpenMP_libomp_LIBRARY=\" + OpenMP_libomp_LIBRARY)\n        if ENABLE_BITCODE != \"\":\n            cmake_args.append(\"-DENABLE_BITCODE=\" + ENABLE_BITCODE)\n        if ENABLE_ARC != \"\":\n            cmake_args.append(\"-DENABLE_ARC=\" + ENABLE_ARC)\n        if ENABLE_VISIBILITY != \"\":\n            cmake_args.append(\"-DENABLE_VISIBILITY=\" + ENABLE_VISIBILITY)\n\n        cmake_args += EXTRA_CMAKE_ARGS\n\n        build_args = []\n\n        if self.compiler.compiler_type == \"msvc\":\n            # Single config generators are handled \"normally\"\n            single_config = any(x in cmake_generator for x in {\"NMake\", \"Ninja\"})\n\n            # CMake allows an arch-in-generator style for backward compatibility\n            contains_arch = any(x in cmake_generator for x in {\"ARM\", \"Win64\"})\n\n            # Specify the arch if using MSVC generator, but only if it doesn't\n            # contain a backward-compatibility arch spec already in the\n            # generator name.\n            if not single_config and not contains_arch:\n                cmake_args += [\"-A\", PLAT_TO_CMAKE[self.plat_name]]\n\n            # Multi-config generators have a different way to specify configs\n            if not single_config:\n                cmake_args += [\n                    \"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_{}={}\".format(cfg.upper(), extdir)\n                ]\n                build_args += [\"--config\", cfg]\n\n        # Set CMAKE_BUILD_PARALLEL_LEVEL to control the parallel build level\n        # across all generators.\n        if \"CMAKE_BUILD_PARALLEL_LEVEL\" not in os.environ:\n            # self.parallel is a Python 3 only way to set parallel jobs by hand\n            # using -j in the build_ext call, not supported by pip or PyPA-build.\n            if hasattr(self, \"parallel\") and self.parallel:\n                # CMake 3.12+ only.\n                build_args += [\"-j{}\".format(self.parallel)]\n            else:\n                # Automatically set parallel jobs based on CPU core count\n                cpu_count = os.cpu_count() or 1\n                build_args += [\"-j{}\".format(cpu_count)]\n\n        if not os.path.exists(self.build_temp):\n            os.makedirs(self.build_temp)\n\n        subprocess.check_call(\n            [\"cmake\", ext.sourcedir] + cmake_args, cwd=self.build_temp\n        )\n        subprocess.check_call(\n            [\"cmake\", \"--build\", \".\"] + build_args, cwd=self.build_temp\n        )\n\n\nif sys.version_info < (3, 0):\n    sys.exit(\"Sorry, Python < 3.0 is not supported\")\n\nrequirements = [\"numpy\", \"tqdm\", \"requests\", \"portalocker\", \"opencv-python\"]\nsetup_requires = []\nif shutil.which(\"cmake\") is None:\n    setup_requires += [\"cmake>=3.12\"]\nif shutil.which(\"ninja\") is None:\n    setup_requires += [\"ninja; sys_platform != 'win32'\"]\n\nwith io.open(\"README.md\", encoding=\"utf-8\") as h:\n    long_description = h.read()\n\nsetup(\n    name=\"ncnn\",\n    version=find_version(),\n    author=\"nihui\",\n    author_email=\"nihuini@tencent.com\",\n    maintainer=\"caishanli\",\n    maintainer_email=\"caishanli25@gmail.com\",\n    description=\"ncnn is a high-performance neural network inference framework optimized for the mobile platform\",\n    long_description=long_description,\n    long_description_content_type=\"text/markdown\",\n    url=\"https://github.com/Tencent/ncnn\",\n    classifiers=[\n        \"Programming Language :: C++\",\n        \"Programming Language :: Python :: 3\",\n        \"Programming Language :: Python :: 3.6\",\n        \"Programming Language :: Python :: 3.7\",\n        \"Programming Language :: Python :: 3.8\",\n        \"Programming Language :: Python :: 3.9\",\n        \"Programming Language :: Python :: 3.10\",\n        \"Programming Language :: Python :: 3.11\",\n        \"Programming Language :: Python :: 3.12\",\n        \"Programming Language :: Python :: 3.13\",\n        \"Programming Language :: Python :: 3.14\",\n        \"License :: OSI Approved :: BSD License\",\n        \"Operating System :: OS Independent\",\n        \"Topic :: Scientific/Engineering :: Artificial Intelligence\",\n    ],\n    license=\"BSD-3\",\n    python_requires=\">=3.5\",\n    packages=find_packages(\"python\"),\n    package_dir={\"\": \"python\"},\n    setup_requires=setup_requires,\n    install_requires=requirements,\n    ext_modules=[CMakeExtension(\"ncnn\")],\n    cmdclass={'install': InstallCommand, \"build_ext\": CMakeBuild},\n)\n"
  },
  {
    "path": "src/CMakeLists.txt",
    "content": "\n##############################################\n\nconfigure_file(platform.h.in ${CMAKE_CURRENT_BINARY_DIR}/platform.h)\n\n# Add source file to list, and add to special visual folder\nfunction(ncnn_src_group ncnn_src_string folder)\n    string(REPLACE \" \" \";\" _ncnn_src_list ${ncnn_src_string})\n\n    string(REGEX REPLACE \"/\" \"\\\\\\\\\" _target_folder \"${folder}\")\n\n    foreach(_file IN LISTS ${_ncnn_src_list})\n        source_group (\"${_target_folder}\" FILES \"${_file}\")\n    endforeach ()\nendfunction()\n\nset(ncnn_SRCS\n    allocator.cpp\n    benchmark.cpp\n    blob.cpp\n    c_api.cpp\n    command.cpp\n    cpu.cpp\n    datareader.cpp\n    expression.cpp\n    gpu.cpp\n    layer.cpp\n    mat.cpp\n    mat_pixel.cpp\n    mat_pixel_affine.cpp\n    mat_pixel_drawing.cpp\n    mat_pixel_resize.cpp\n    mat_pixel_rotate.cpp\n    modelbin.cpp\n    net.cpp\n    option.cpp\n    paramdict.cpp\n    pipeline.cpp\n    pipelinecache.cpp\n    simpleocv.cpp\n    simpleomp.cpp\n    simplestl.cpp\n    simplemath.cpp\n    simplevk.cpp\n)\n\nif(ANDROID)\n    list(APPEND ncnn_SRCS mat_pixel_android.cpp)\nendif()\n\nncnn_src_group(ncnn_SRCS \"sources\")\n\ninclude_directories(\"${CMAKE_CURRENT_SOURCE_DIR}/layer/${NCNN_TARGET_ARCH}\")\n\n# ncnn macro\ninclude(${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_add_shader.cmake)\ninclude(${CMAKE_CURRENT_SOURCE_DIR}/../cmake/ncnn_add_layer.cmake)\n\n# look for vulkan compute shader and compile\nset(NCNN_SHADER_SPV_HEX_FILES)\n\nset(__LAYER_TYPE_ENUM_INDEX 0)\nset(__LAYER_SHADER_TYPE_ENUM_INDEX 0)\n\n# layer implementation\nncnn_add_layer(AbsVal)\nncnn_add_layer(ArgMax OFF)\nncnn_add_layer(BatchNorm)\nncnn_add_layer(Bias)\nncnn_add_layer(BNLL)\nncnn_add_layer(Concat)\nncnn_add_layer(Convolution)\nncnn_add_layer(Crop)\nncnn_add_layer(Deconvolution)\nncnn_add_layer(Dropout)\nncnn_add_layer(Eltwise)\nncnn_add_layer(ELU)\nncnn_add_layer(Embed)\nncnn_add_layer(Exp)\nncnn_add_layer(Flatten)\nncnn_add_layer(InnerProduct)\nncnn_add_layer(Input)\nncnn_add_layer(Log)\nncnn_add_layer(LRN)\nncnn_add_layer(MemoryData)\nncnn_add_layer(MVN)\nncnn_add_layer(Pooling)\nncnn_add_layer(Power)\nncnn_add_layer(PReLU)\nncnn_add_layer(Proposal)\nncnn_add_layer(Reduction)\nncnn_add_layer(ReLU)\nncnn_add_layer(Reshape)\nncnn_add_layer(ROIPooling)\nncnn_add_layer(Scale)\nncnn_add_layer(Sigmoid)\nncnn_add_layer(Slice)\nncnn_add_layer(Softmax)\nncnn_add_layer(Split)\nncnn_add_layer(SPP OFF)\nncnn_add_layer(TanH)\nncnn_add_layer(Threshold)\nncnn_add_layer(Tile)\nncnn_add_layer(RNN)\nncnn_add_layer(LSTM)\nncnn_add_layer(BinaryOp)\nncnn_add_layer(UnaryOp)\nncnn_add_layer(ConvolutionDepthWise)\nncnn_add_layer(Padding)\nncnn_add_layer(Squeeze)\nncnn_add_layer(ExpandDims)\nncnn_add_layer(Normalize)\nncnn_add_layer(Permute)\nncnn_add_layer(PriorBox)\nncnn_add_layer(DetectionOutput)\nncnn_add_layer(Interp)\nncnn_add_layer(DeconvolutionDepthWise)\nncnn_add_layer(ShuffleChannel)\nncnn_add_layer(InstanceNorm)\nncnn_add_layer(Clip)\nncnn_add_layer(Reorg)\nncnn_add_layer(YoloDetectionOutput)\nncnn_add_layer(Quantize)\nncnn_add_layer(Dequantize)\nncnn_add_layer(Yolov3DetectionOutput)\nncnn_add_layer(PSROIPooling)\nncnn_add_layer(ROIAlign)\nncnn_add_layer(Packing)\nncnn_add_layer(Requantize)\nncnn_add_layer(Cast)\nncnn_add_layer(HardSigmoid)\nncnn_add_layer(SELU)\nncnn_add_layer(HardSwish)\nncnn_add_layer(Noop)\nncnn_add_layer(PixelShuffle)\nncnn_add_layer(DeepCopy)\nncnn_add_layer(Mish)\nncnn_add_layer(StatisticsPooling)\nncnn_add_layer(Swish)\nncnn_add_layer(Gemm)\nncnn_add_layer(GroupNorm)\nncnn_add_layer(LayerNorm)\nncnn_add_layer(Softplus)\nncnn_add_layer(GRU)\nncnn_add_layer(MultiHeadAttention)\nncnn_add_layer(GELU)\nncnn_add_layer(Convolution1D)\nncnn_add_layer(Pooling1D)\nncnn_add_layer(ConvolutionDepthWise1D)\nncnn_add_layer(Convolution3D)\nncnn_add_layer(ConvolutionDepthWise3D)\nncnn_add_layer(Pooling3D)\nncnn_add_layer(MatMul)\nncnn_add_layer(Deconvolution1D)\nncnn_add_layer(DeconvolutionDepthWise1D)\nncnn_add_layer(Deconvolution3D)\nncnn_add_layer(DeconvolutionDepthWise3D)\nncnn_add_layer(Einsum)\nncnn_add_layer(DeformableConv2D)\nncnn_add_layer(GLU)\nncnn_add_layer(Fold)\nncnn_add_layer(Unfold)\nncnn_add_layer(GridSample)\nncnn_add_layer(CumulativeSum)\nncnn_add_layer(CopyTo)\nncnn_add_layer(Erf)\nncnn_add_layer(Diag)\nncnn_add_layer(CELU)\nncnn_add_layer(Shrink)\nncnn_add_layer(RMSNorm)\nncnn_add_layer(Spectrogram)\nncnn_add_layer(InverseSpectrogram)\nncnn_add_layer(Flip)\nncnn_add_layer(SDPA)\nncnn_add_layer(RotaryEmbed)\n\nif(NCNN_VULKAN)\n    ncnn_add_shader(${CMAKE_CURRENT_SOURCE_DIR}/convert_ycbcr.comp)\n    ncnn_add_shader(${CMAKE_CURRENT_SOURCE_DIR}/layer/vulkan/shader/vulkan_activation.comp)\nendif()\n\nadd_custom_target(ncnn-generate-spirv DEPENDS ${NCNN_SHADER_SPV_HEX_FILES})\n\n# create new\nconfigure_file(layer_declaration.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_declaration.h)\nconfigure_file(layer_registry.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_registry.h)\nconfigure_file(layer_type_enum.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_type_enum.h)\nconfigure_file(layer_shader_registry.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_shader_registry.h)\nconfigure_file(layer_shader_spv_data.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_shader_spv_data.h)\nconfigure_file(layer_shader_type_enum.h.in ${CMAKE_CURRENT_BINARY_DIR}/layer_shader_type_enum.h)\n\nif(NCNN_SHARED_LIB)\n    add_library(ncnn SHARED ${ncnn_SRCS})\nelse()\n    add_library(ncnn STATIC ${ncnn_SRCS})\nendif()\nset_target_properties(ncnn PROPERTIES DEBUG_POSTFIX \"d\")\nif(APPLE OR IOS)\n    # macos / ios only accepts a.b.c.d.e where a=24bit b/c/d/e=10bit\n    # 20201228 to 20.12.28\n    string(SUBSTRING ${NCNN_VERSION} 2 2 NCNN_VERSION_YEAR)\n    string(SUBSTRING ${NCNN_VERSION} 4 2 NCNN_VERSION_MONTH)\n    string(SUBSTRING ${NCNN_VERSION} 6 2 NCNN_VERSION_DAY)\n    set(NCNN_VERSION_APPLE_STRING ${NCNN_VERSION_MAJOR}.${NCNN_VERSION_MINOR}.${NCNN_VERSION_YEAR}.${NCNN_VERSION_MONTH}.${NCNN_VERSION_DAY})\n    set_target_properties(ncnn PROPERTIES VERSION ${NCNN_VERSION_APPLE_STRING} SOVERSION ${NCNN_VERSION_MAJOR})\nelse()\n    set_target_properties(ncnn PROPERTIES VERSION ${NCNN_VERSION_STRING} SOVERSION ${NCNN_VERSION_MAJOR})\nendif()\n\ninclude(GenerateExportHeader)\ngenerate_export_header(ncnn)\n\nif(NOT NCNN_SHARED_LIB)\n    set_target_properties(ncnn PROPERTIES COMPILE_FLAGS -DNCNN_STATIC_DEFINE)\nendif()\n\nif(NCNN_SIMPLESTL AND NOT NCNN_SIMPLEMATH)\n    # link math lib explicitly\n    target_link_libraries(ncnn PUBLIC m)\nendif()\n\ntarget_include_directories(ncnn\n    PUBLIC\n        $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>\n        $<INSTALL_INTERFACE:include/ncnn>\n        $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}>\n    PRIVATE\n        $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/layer>)\n\nif(NCNN_OPENMP)\n    if(NOT NCNN_SIMPLEOMP)\n        find_package(OpenMP)\n        if(NOT TARGET OpenMP::OpenMP_CXX AND (OpenMP_CXX_FOUND OR OPENMP_FOUND))\n            target_compile_options(ncnn PRIVATE ${OpenMP_CXX_FLAGS})\n        endif()\n    endif()\n\n    if(NCNN_SIMPLEOMP OR OpenMP_CXX_FOUND OR OPENMP_FOUND)\n        if(NCNN_CMAKE_VERBOSE)\n            message(\"Building with OpenMP\")\n        endif()\n\n        if(NCNN_SIMPLEOMP)\n            if(IOS OR APPLE)\n                target_compile_options(ncnn PRIVATE -Xpreprocessor -fopenmp)\n            else()\n                target_compile_options(ncnn PRIVATE -fopenmp)\n            endif()\n        elseif(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER 20))\n            target_compile_options(ncnn PRIVATE -fopenmp)\n            target_link_libraries(ncnn PUBLIC -fopenmp -static-openmp)\n        elseif(OpenMP_CXX_FOUND)\n            target_link_libraries(ncnn PUBLIC OpenMP::OpenMP_CXX)\n        else()\n            target_link_libraries(ncnn PRIVATE \"${OpenMP_CXX_FLAGS}\")\n        endif()\n    endif()\nendif()\n\nif(NCNN_THREADS)\n    set(CMAKE_THREAD_PREFER_PTHREAD TRUE)\n    set(THREADS_PREFER_PTHREAD_FLAG TRUE)\n    find_package(Threads REQUIRED)\n\n    if(TARGET Threads::Threads)\n        target_link_libraries(ncnn PUBLIC Threads::Threads)\n    endif()\n    if(NCNN_SIMPLEOMP OR NCNN_SIMPLESTL)\n        target_link_libraries(ncnn PUBLIC pthread)\n    endif()\nendif()\n\nif(NCNN_VULKAN)\n    if(NCNN_SIMPLEVK)\n        if(APPLE)\n            # simplevk use static vulkan linkage on apple platform as fallback\n            if(DEFINED Vulkan_LIBRARY)\n                message(STATUS \"simplevk static vulkan linkage as fallback enabled on APPLE platforms\")\n                target_link_libraries(ncnn PUBLIC ${Vulkan_LIBRARY})\n\n                # https://github.com/KhronosGroup/MoltenVK/blob/main/Docs/MoltenVK_Runtime_UserGuide.md#optionally-link-to-required-system-libraries\n                if(NOT NCNN_SHARED_LIB)\n                    find_library(Metal NAMES Metal)\n                    find_library(Foundation NAMES Foundation)\n                    find_library(QuartzCore NAMES QuartzCore)\n                    find_library(CoreGraphics NAMES CoreGraphics)\n                    find_library(IOSurface NAMES IOSurface)\n                    list(APPEND vulkan_dependent_LINK_LIBRARIES ${Metal} ${Foundation} ${QuartzCore} ${CoreGraphics} ${IOSurface})\n                    if(CMAKE_SYSTEM_NAME STREQUAL \"Darwin\")\n                        if(NOT IOS)\n                            find_library(AppKit NAMES AppKit)\n                            list(APPEND vulkan_dependent_LINK_LIBRARIES ${AppKit})\n                        endif()\n                        find_library(IOKit NAMES IOKit)\n                        list(APPEND vulkan_dependent_LINK_LIBRARIES ${IOKit})\n                    endif()\n                    if(IOS OR CMAKE_SYSTEM_NAME STREQUAL \"iOS\" OR CMAKE_SYSTEM_NAME STREQUAL \"tvOS\")\n                        find_library(UIKit NAMES UIKit)\n                        list(APPEND vulkan_dependent_LINK_LIBRARIES ${UIKit})\n                    endif()\n                    target_link_libraries(ncnn PRIVATE ${vulkan_dependent_LINK_LIBRARIES})\n                endif()\n\n            else()\n                message(WARNING \"Vulkan_LIBRARY shall be defined for simplevk static linkage as fallback on APPLE platforms\")\n\n                # link simplevk stub\n                set(SIMPLEVK_TBD \"${CMAKE_CURRENT_SOURCE_DIR}/simplevk.tbd\")\n                target_link_libraries(ncnn PRIVATE \"-Wl,-weak_library,${SIMPLEVK_TBD}\")\n            endif()\n        endif()\n        target_link_libraries(ncnn PRIVATE ${CMAKE_DL_LIBS})\n    else()\n        find_package(Vulkan QUIET)\n        if(NOT Vulkan_FOUND)\n            if(DEFINED ENV{VULKAN_SDK})\n                if(CMAKE_SYSTEM_NAME MATCHES \"Linux\")\n                    list(APPEND CMAKE_MODULE_PATH \"$ENV{VULKAN_SDK}/../source/VulkanTools/cmake\")\n                elseif(CMAKE_SYSTEM_NAME MATCHES \"Windows\")\n                    list(APPEND CMAKE_MODULE_PATH \"$ENV{VULKAN_SDK}/Samples/cmake\")\n                elseif(CMAKE_SYSTEM_NAME MATCHES \"Darwin\")\n                    message(WARNING \"Failed to find vulkan since cmake is too old\\n\"\n                        \"cmake >= 3.7 required. Consider `brew upgrade cmake`\")\n                endif()\n            else()\n                message(FATAL_ERROR \"Error: CMake didn't find Vulkan. Please set VULKAN_SDK env var, e.g.:\\n\"\n                    \"Linux: export VULKAN_SDK=~/soft/vulkansdk/1.2.148.0/x86_64\\n\"\n                    \"Windows: set VULKAN_SDK=E:/lib/VulkanSDK/1.2.148.0\\n\"\n                    \"MacOS: export VULKAN_SDK=~/soft/vulkansdk/1.2.148.0/macOS\\n\"\n                )\n            endif()\n            find_package(Vulkan REQUIRED)\n        endif()\n\n        target_link_libraries(ncnn PUBLIC Vulkan::Vulkan)\n    endif()\n\n    # link in-house glslang\n    target_include_directories(ncnn PRIVATE $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../>)\n    target_link_libraries(ncnn PRIVATE glslang SPIRV)\nendif()\n\nif(NCNN_PLATFORM_API AND ANDROID)\n    target_link_libraries(ncnn PUBLIC android jnigraphics log)\nendif()\n\nif(WIN32)\n    target_compile_definitions(ncnn PUBLIC NOMINMAX)\nendif()\n\nif(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n    target_compile_definitions(ncnn PRIVATE _SCL_SECURE_NO_WARNINGS _CRT_SECURE_NO_DEPRECATE)\n\n    if(CMAKE_BUILD_TYPE MATCHES \"(Release|RELEASE|release)\")\n        target_compile_options(ncnn PRIVATE /fp:fast)\n    endif()\n\n    if(NCNN_TARGET_ARCH STREQUAL \"arm\")\n        # disable msvc svml optimization on arm target as it produces wrong result\n        target_compile_options(ncnn PRIVATE /d2Qvec-mathlib-)\n    endif()\n\n    if(NCNN_SHARED_LIB)\n        # msvc argues about stl string and vector uses in exported functions\n        target_compile_options(ncnn PRIVATE /wd4251)\n    endif()\nelse()\n    target_compile_options(ncnn PRIVATE -Wall -Wextra -Wno-unused-function)\n\n    if(NOT NCNN_DISABLE_PIC)\n        set_target_properties(ncnn PROPERTIES POSITION_INDEPENDENT_CODE ON INTERFACE_POSITION_INDEPENDENT_CODE ON)\n    endif()\n\n    if(CMAKE_BUILD_TYPE MATCHES \"(Release|RELEASE|release)\")\n        if(NOT CMAKE_SYSTEM_NAME STREQUAL \"Emscripten\" AND NOT (CMAKE_CXX_COMPILER_ID MATCHES \"GNU\" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 4.6) AND NOT CMAKE_CXX_COMPILER_ID MATCHES \"Clang\")\n            target_compile_options(ncnn PRIVATE -Ofast)\n        endif()\n\n        target_compile_options(ncnn PRIVATE -ffast-math)\n    endif()\n\n    # target_compile_options(ncnn PRIVATE -march=native)\n    target_compile_options(ncnn PRIVATE -fvisibility=hidden -fvisibility-inlines-hidden)\n    if(NCNN_SHARED_LIB AND NCNN_ENABLE_LTO)\n        set_target_properties(ncnn PROPERTIES INTERPROCEDURAL_OPTIMIZATION ON)\n    endif()\nendif()\n\nif(NCNN_DISABLE_RTTI)\n    if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n        target_compile_options(ncnn PUBLIC /GR-)\n    else()\n        target_compile_options(ncnn PUBLIC -fno-rtti)\n    endif()\nendif()\n\nif(NCNN_DISABLE_EXCEPTION)\n    if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n        target_compile_options(ncnn PUBLIC /EHsc /D_HAS_EXCEPTIONS=0)\n    else()\n        target_compile_options(ncnn PUBLIC -fno-exceptions)\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"x86\")\n    if(NCNN_SSE2)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            if(CMAKE_SIZEOF_VOID_P EQUAL 4)\n                target_compile_options(ncnn PRIVATE /arch:SSE2)\n            endif()\n            target_compile_options(ncnn PRIVATE /D__SSE2__)\n        else()\n            if(NOT CMAKE_SYSTEM_NAME MATCHES \"WASI\")\n                target_compile_options(ncnn PRIVATE -msse2 -msse)\n            endif ()\n            if(CMAKE_SYSTEM_NAME MATCHES \"Emscripten|WASI\")\n                target_compile_options(ncnn PRIVATE -msimd128)\n            endif()\n        endif()\n\n        if(NCNN_COMPILER_SUPPORT_X86_RECIP_NONE)\n            # recip optimization causes precision loss\n            target_compile_options(ncnn PRIVATE -mrecip=none)\n        endif()\n    endif()\n\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_AVX512)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            target_compile_options(ncnn PRIVATE /arch:AVX512 /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__)\n            if(NCNN_AVX512VNNI)\n                target_compile_options(ncnn PRIVATE /D__AVX512VNNI__)\n            endif()\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            target_compile_options(ncnn PRIVATE /arch:AVX512 -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c /D__SSSE3__ /D__SSE4_1__ /D__FMA__ /D__F16C__)\n            if(NCNN_AVX512VNNI)\n                target_compile_options(ncnn PRIVATE -mavx512vnni /D__AVX512VNNI__)\n            endif()\n            if(NCNN_AVX512BF16)\n                target_compile_options(ncnn PRIVATE -mavx512bf16 /D__AVX512BF16__)\n            endif()\n            if(NCNN_AVX512FP16)\n                target_compile_options(ncnn PRIVATE -mavx512fp16 /D__AVX512FP16__)\n            endif()\n        else()\n            target_compile_options(ncnn PRIVATE -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mfma -mf16c)\n            if(NCNN_AVX512VNNI)\n                target_compile_options(ncnn PRIVATE -mavx512vnni)\n            endif()\n            if(NCNN_AVX512BF16)\n                target_compile_options(ncnn PRIVATE -mavx512bf16)\n            endif()\n            if(NCNN_AVX512FP16)\n                target_compile_options(ncnn PRIVATE -mavx512fp16)\n            endif()\n        endif()\n    elseif(NOT NCNN_RUNTIME_CPU AND NCNN_FMA)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            if(NCNN_AVX2)\n                target_compile_options(ncnn PRIVATE /arch:AVX2 /D__SSSE3__ /D__SSE4_1__ /D__FMA__)\n            else()\n                target_compile_options(ncnn PRIVATE /arch:AVX /D__SSSE3__ /D__SSE4_1__ /D__FMA__)\n            endif()\n            if(NCNN_AVXVNNIINT8)\n                target_compile_options(ncnn PRIVATE /D__AVXVNNIINT8__)\n            endif()\n            if(NCNN_AVXVNNIINT16)\n                target_compile_options(ncnn PRIVATE /D__AVXVNNIINT16__)\n            endif()\n            if(NCNN_AVXNECONVERT)\n                target_compile_options(ncnn PRIVATE /D__AVXNECONVERT__)\n            endif()\n            if(NCNN_AVXVNNI)\n                target_compile_options(ncnn PRIVATE /D__AVXVNNI__)\n            elseif(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE /D__XOP__)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE /D__F16C__)\n            endif()\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            if(NCNN_AVX2)\n                target_compile_options(ncnn PRIVATE /arch:AVX2 -mfma /D__SSSE3__ /D__SSE4_1__ /D__FMA__)\n            else()\n                target_compile_options(ncnn PRIVATE /arch:AVX -mfma /D__SSSE3__ /D__SSE4_1__ /D__FMA__)\n            endif()\n            if(NCNN_AVXVNNIINT8)\n                target_compile_options(ncnn PRIVATE -mavxvnniint8 /D__AVXVNNIINT8__)\n            endif()\n            if(NCNN_AVXVNNIINT16)\n                target_compile_options(ncnn PRIVATE -mavxvnniint16 /D__AVXVNNIINT16__)\n            endif()\n            if(NCNN_AVXNECONVERT)\n                target_compile_options(ncnn PRIVATE -mavxneconvert /D__AVXNECONVERT__)\n            endif()\n            if(NCNN_AVXVNNI)\n                target_compile_options(ncnn PRIVATE -mavxvnni /D__AVXVNNI__)\n            elseif(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE -mxop /D__XOP__)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE -mf16c /D__F16C__)\n            endif()\n        else()\n            if(NCNN_AVX2)\n                target_compile_options(ncnn PRIVATE -mavx2 -mfma)\n            else()\n                target_compile_options(ncnn PRIVATE -mavx -mfma)\n            endif()\n            if(NCNN_AVXVNNIINT8)\n                target_compile_options(ncnn PRIVATE -mavxvnniint8)\n            endif()\n            if(NCNN_AVXVNNIINT16)\n                target_compile_options(ncnn PRIVATE -mavxvnniint16)\n            endif()\n            if(NCNN_AVXNECONVERT)\n                target_compile_options(ncnn PRIVATE -mavxneconvert)\n            endif()\n            if(NCNN_AVXVNNI)\n                target_compile_options(ncnn PRIVATE -mavxvnni)\n            elseif(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE -mxop)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE -mf16c)\n            endif()\n        endif()\n    elseif(NOT NCNN_RUNTIME_CPU AND NCNN_AVX)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            target_compile_options(ncnn PRIVATE /arch:AVX /D__SSSE3__ /D__SSE4_1__)\n            if(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE /D__XOP__)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE /D__F16C__)\n            endif()\n        elseif(CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\")\n            target_compile_options(ncnn PRIVATE /arch:AVX /D__SSSE3__ /D__SSE4_1__)\n            if(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE -mxop /D__XOP__)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE -mf16c /D__F16C__)\n            endif()\n        else()\n            target_compile_options(ncnn PRIVATE -mavx)\n            if(NCNN_XOP)\n                target_compile_options(ncnn PRIVATE -mxop)\n            endif()\n            if(NCNN_F16C)\n                target_compile_options(ncnn PRIVATE -mf16c)\n            endif()\n        endif()\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"arm\" AND (CMAKE_SIZEOF_VOID_P EQUAL 4 AND NOT NCNN_TARGET_ILP32))\n    if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n        # always enable neon for msvc arm\n        target_compile_options(ncnn PRIVATE /D__arm__ /D__ARM_NEON)\n    endif()\n\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_VFPV4)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            target_compile_options(ncnn PRIVATE /arch:VFPv4 /D__ARM_FP=0x0E)\n        else()\n            if(NCNN_COMPILER_SUPPORT_ARM_VFPV4)\n                target_compile_options(ncnn PRIVATE -mfpu=neon-vfpv4)\n            elseif(NCNN_COMPILER_SUPPORT_ARM_VFPV4_FP16)\n                target_compile_options(ncnn PRIVATE -mfpu=neon-vfpv4 -mfp16-format=ieee)\n            endif()\n        endif()\n    elseif(NOT NCNN_RUNTIME_CPU)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            target_compile_options(ncnn PRIVATE /D__ARM_FP=0x0C)\n        endif()\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"arm\" AND (CMAKE_SIZEOF_VOID_P EQUAL 8 OR NCNN_TARGET_ILP32))\n    if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n        # always enable neon and vfpv4 for msvc arm64\n        target_compile_options(ncnn PRIVATE /D__arm__ /D__aarch64__ /D__ARM_NEON /D__ARM_FP=0x0E)\n    endif()\n\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_ARM86SVE)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            # TODO add support for sve family\n            target_compile_options(ncnn PRIVATE /arch:armv8.6 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML /D__ARM_FEATURE_BF16_VECTOR_ARITHMETIC /D__ARM_FEATURE_MATMUL_INT8)\n            if(NCNN_ARM86SVE2)\n            endif()\n            if(NCNN_ARM86SVEBF16)\n            endif()\n            if(NCNN_ARM86SVEI8MM)\n            endif()\n            if(NCNN_ARM86SVEF32MM)\n            endif()\n        endif()\n        if(NOT CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            set(ARM_MARCH_FLAG \"-march=armv8.6-a+fp16+dotprod+sve\")\n            if(NCNN_ARM86SVE2)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+sve2\")\n            endif()\n            if(NCNN_ARM86SVEBF16)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+bf16\")\n            endif()\n            if(NCNN_ARM86SVEI8MM)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+i8mm\")\n            endif()\n            if(NCNN_ARM86SVEF32MM)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+f32mm\")\n            endif()\n        endif()\n    elseif(NOT NCNN_RUNTIME_CPU AND (NCNN_ARM84BF16 OR NCNN_ARM84I8MM))\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            target_compile_options(ncnn PRIVATE /arch:armv8.4 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC /D__ARM_FEATURE_DOTPROD /D__ARM_FEATURE_FP16_FML)\n            if(NCNN_ARM84BF16)\n                target_compile_options(ncnn PRIVATE /D__ARM_FEATURE_BF16_VECTOR_ARITHMETIC)\n            endif()\n            if(NCNN_ARM84I8MM)\n                target_compile_options(ncnn PRIVATE /D__ARM_FEATURE_MATMUL_INT8)\n            endif()\n        endif()\n        if(NOT CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            set(ARM_MARCH_FLAG \"-march=armv8.4-a+fp16+dotprod\")\n            if(NCNN_ARM84BF16)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+bf16\")\n            endif()\n            if(NCNN_ARM84I8MM)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+i8mm\")\n            endif()\n        endif()\n    elseif(NOT NCNN_RUNTIME_CPU AND NCNN_ARM82)\n        if(CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\" OR (CMAKE_CXX_COMPILER_ID MATCHES \"Clang\" AND CMAKE_CXX_SIMULATE_ID MATCHES \"MSVC\" AND CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES \"MSVC\"))\n            target_compile_options(ncnn PRIVATE /arch:armv8.2 /D__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)\n            if(NCNN_ARM82DOT)\n                target_compile_options(ncnn PRIVATE /D__ARM_FEATURE_DOTPROD)\n            endif()\n            if(NCNN_ARM82FP16FML)\n                target_compile_options(ncnn PRIVATE /D__ARM_FEATURE_FP16_FML)\n            endif()\n        endif()\n        if(NOT CMAKE_CXX_COMPILER_ID MATCHES \"MSVC\")\n            set(ARM_MARCH_FLAG \"-march=armv8.2-a+fp16\")\n            if(NCNN_ARM82DOT)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+dotprod\")\n            endif()\n            if(NCNN_ARM82FP16FML)\n                set(ARM_MARCH_FLAG \"${ARM_MARCH_FLAG}+fp16fml\")\n                # clang 9.0.9 shipped with android ndk-r21 is missing __ARM_FEATURE_FP16_FML macro for asimdfhm target\n                target_compile_options(ncnn PRIVATE -D__ARM_FEATURE_FP16_FML)\n            endif()\n        endif()\n    endif()\n    target_compile_options(ncnn PRIVATE ${ARM_MARCH_FLAG})\n\n    if(ANDROID_NDK_MAJOR AND (ANDROID_NDK_MAJOR GREATER_EQUAL 23))\n        # llvm 12 in ndk-23 enables out-of-line atomics by default\n        # disable this feature for fixing linking atomic builtins issue with old ndk\n        target_compile_options(ncnn PRIVATE -mno-outline-atomics)\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"mips\")\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_MSA)\n        target_compile_options(ncnn PRIVATE -mmsa)\n    endif()\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_MMI)\n        target_compile_options(ncnn PRIVATE -mloongson-mmi)\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"loongarch\")\n    if(NOT NCNN_RUNTIME_CPU AND NCNN_LSX)\n        target_compile_options(ncnn PRIVATE -mlsx)\n        if(NCNN_LASX)\n            target_compile_options(ncnn PRIVATE -mlasx)\n        endif()\n    endif()\nendif()\n\nif(NCNN_TARGET_ARCH STREQUAL \"riscv\" AND NOT C906)\n    if(CMAKE_SIZEOF_VOID_P EQUAL 8)\n        if(NOT NCNN_RUNTIME_CPU AND NCNN_RVV)\n            set(RISCV_MARCH_FLAG \"-march=rv64gcv\")\n            if(NCNN_ZFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zfh\")\n                target_compile_options(ncnn PRIVATE -D__fp16=_Float16)\n            endif()\n            if(NCNN_ZVFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zvfh\")\n            endif()\n        elseif(NOT NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n            set(RISCV_MARCH_FLAG \"-march=rv64gc_xtheadvector\")\n            if(NCNN_ZFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zfh\")\n                target_compile_options(ncnn PRIVATE -D__riscv_zvfh=1 -D__fp16=_Float16)\n            endif()\n        endif()\n        target_compile_options(ncnn PRIVATE ${RISCV_MARCH_FLAG})\n    elseif(CMAKE_SIZEOF_VOID_P EQUAL 4)\n        if(NOT NCNN_RUNTIME_CPU AND NCNN_RVV)\n            set(RISCV_MARCH_FLAG \"-march=rv32gcv\")\n            if(NCNN_ZFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zfh\")\n                target_compile_options(ncnn PRIVATE -D__fp16=_Float16)\n            endif()\n            if(NCNN_ZVFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zvfh\")\n            endif()\n        elseif(NOT NCNN_RUNTIME_CPU AND NCNN_XTHEADVECTOR)\n            set(RISCV_MARCH_FLAG \"-march=rv32gc_xtheadvector\")\n            if(NCNN_ZFH)\n                set(RISCV_MARCH_FLAG \"${RISCV_MARCH_FLAG}_zfh\")\n                target_compile_options(ncnn PRIVATE -D__riscv_zvfh=1 -D__fp16=_Float16)\n            endif()\n        endif()\n        target_compile_options(ncnn PRIVATE ${RISCV_MARCH_FLAG})\n    endif()\nendif()\n\nif(NCNN_PPC64LE_VSX)\n    # Auto-translate SSE2 to VSX if compiler is new enough.\n    if(NCNN_VSX_SSE2)\n        target_compile_options(ncnn PRIVATE -DNO_WARN_X86_INTRINSICS -D__SSE2__)\n    endif()\n\n    # Auto-translate SSE4.1 to VSX if compiler is new enough.\n    if(NCNN_VSX_SSE41)\n        target_compile_options(ncnn PRIVATE -DNO_WARN_X86_INTRINSICS -D__SSE4_1__)\n    endif()\nendif()\n\nif(NCNN_COVERAGE)\n    target_compile_options(ncnn PUBLIC -coverage -fprofile-arcs -ftest-coverage)\n    target_link_libraries(ncnn PUBLIC -coverage -lgcov)\nendif()\n\nif(NCNN_ASAN)\n    target_compile_options(ncnn PUBLIC -fsanitize=address)\n    target_link_libraries(ncnn PUBLIC -fsanitize=address)\nendif()\n\nadd_dependencies(ncnn ncnn-generate-spirv)\n\nif(NCNN_INSTALL_SDK)\n    include(GNUInstallDirs)\n\n    include(CMakePackageConfigHelpers)\n    write_basic_package_version_file(\n        ${CMAKE_CURRENT_BINARY_DIR}/ncnnConfigVersion.cmake\n        VERSION ${NCNN_VERSION}\n        COMPATIBILITY AnyNewerVersion\n    )\n\n    install(TARGETS ncnn EXPORT ncnn\n        ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}\n        LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}\n        RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}\n    )\n    install(FILES\n        allocator.h\n        benchmark.h\n        blob.h\n        c_api.h\n        command.h\n        cpu.h\n        datareader.h\n        expression.h\n        gpu.h\n        layer.h\n        layer_shader_type.h\n        layer_type.h\n        mat.h\n        modelbin.h\n        net.h\n        option.h\n        paramdict.h\n        pipeline.h\n        pipelinecache.h\n        simpleocv.h\n        simpleomp.h\n        simplestl.h\n        simplemath.h\n        simplevk.h\n        vulkan_header_fix.h\n        ${CMAKE_CURRENT_BINARY_DIR}/ncnn_export.h\n        ${CMAKE_CURRENT_BINARY_DIR}/layer_shader_type_enum.h\n        ${CMAKE_CURRENT_BINARY_DIR}/layer_type_enum.h\n        ${CMAKE_CURRENT_BINARY_DIR}/platform.h\n        DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/ncnn\n    )\n    install(EXPORT ncnn DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/ncnn)\n    configure_file(${CMAKE_CURRENT_LIST_DIR}/../cmake/ncnnConfig.cmake.in ncnnConfig.cmake @ONLY)\n    install(FILES\n        ${CMAKE_CURRENT_BINARY_DIR}/ncnnConfig.cmake\n        ${CMAKE_CURRENT_BINARY_DIR}/ncnnConfigVersion.cmake\n        DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/ncnn)\n    # pkgconfig\n    configure_file(ncnn.pc.in ${CMAKE_CURRENT_BINARY_DIR}/ncnn.pc @ONLY)\n    install(FILES ${CMAKE_CURRENT_BINARY_DIR}/ncnn.pc DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig)\nendif()\n\n# add ncnn and generate-spirv to a virtual project group\nset_property(GLOBAL PROPERTY USE_FOLDERS ON)\nset_property(TARGET ncnn PROPERTY FOLDER \"libncnn\")\nset_property(TARGET ncnn-generate-spirv PROPERTY FOLDER \"libncnn\")\n"
  },
  {
    "path": "src/allocator.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"allocator.h\"\n\n#include \"gpu.h\"\n#include \"pipeline.h\"\n\n#if __ANDROID_API__ >= 26\n#include <android/hardware_buffer.h>\n#endif // __ANDROID_API__ >= 26\n\nnamespace ncnn {\n\nAllocator::~Allocator()\n{\n}\n\nclass PoolAllocatorPrivate\n{\npublic:\n    Mutex budgets_lock;\n    Mutex payouts_lock;\n    unsigned int size_compare_ratio; // 0~256\n    size_t size_drop_threshold;\n    std::list<std::pair<size_t, void*> > budgets;\n    std::list<std::pair<size_t, void*> > payouts;\n};\n\nPoolAllocator::PoolAllocator()\n    : Allocator(), d(new PoolAllocatorPrivate)\n{\n    d->size_compare_ratio = 0;\n    d->size_drop_threshold = 10;\n}\n\nPoolAllocator::~PoolAllocator()\n{\n    clear();\n\n    if (!d->payouts.empty())\n    {\n        NCNN_LOGE(\"FATAL ERROR! pool allocator destroyed too early\");\n#if NCNN_STDIO\n        std::list<std::pair<size_t, void*> >::iterator it = d->payouts.begin();\n        for (; it != d->payouts.end(); ++it)\n        {\n            void* ptr = it->second;\n            NCNN_LOGE(\"%p still in use\", ptr);\n        }\n#endif\n    }\n\n    delete d;\n}\n\nPoolAllocator::PoolAllocator(const PoolAllocator&)\n    : d(0)\n{\n}\n\nPoolAllocator& PoolAllocator::operator=(const PoolAllocator&)\n{\n    return *this;\n}\n\nvoid PoolAllocator::clear()\n{\n    d->budgets_lock.lock();\n\n    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin();\n    for (; it != d->budgets.end(); ++it)\n    {\n        void* ptr = it->second;\n        ncnn::fastFree(ptr);\n    }\n    d->budgets.clear();\n\n    d->budgets_lock.unlock();\n}\n\nvoid PoolAllocator::set_size_compare_ratio(float scr)\n{\n    if (scr < 0.f || scr > 1.f)\n    {\n        NCNN_LOGE(\"invalid size compare ratio %f\", scr);\n        return;\n    }\n\n    d->size_compare_ratio = (unsigned int)(scr * 256);\n}\n\nvoid PoolAllocator::set_size_drop_threshold(size_t threshold)\n{\n    d->size_drop_threshold = threshold;\n}\n\nvoid* PoolAllocator::fastMalloc(size_t size)\n{\n    d->budgets_lock.lock();\n\n    // find free budget\n    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin(), it_max = d->budgets.begin(), it_min = d->budgets.begin();\n    for (; it != d->budgets.end(); ++it)\n    {\n        size_t bs = it->first;\n\n        // size_compare_ratio ~ 100%\n        if (bs >= size && ((bs * d->size_compare_ratio) >> 8) <= size)\n        {\n            void* ptr = it->second;\n\n            d->budgets.erase(it);\n\n            d->budgets_lock.unlock();\n\n            d->payouts_lock.lock();\n\n            d->payouts.push_back(std::make_pair(bs, ptr));\n\n            d->payouts_lock.unlock();\n\n            return ptr;\n        }\n\n        if (bs < it_min->first)\n        {\n            it_min = it;\n        }\n        if (bs > it_max->first)\n        {\n            it_max = it;\n        }\n    }\n\n    if (d->budgets.size() >= d->size_drop_threshold)\n    {\n        // All chunks in pool are not chosen. Then try to drop some outdated\n        // chunks and return them to OS.\n        if (it_max->first < size)\n        {\n            // Current query is asking for a chunk larger than any cached chunks.\n            // Then remove the smallest one.\n            ncnn::fastFree(it_min->second);\n            d->budgets.erase(it_min);\n        }\n        else if (it_min->first > size)\n        {\n            // Current query is asking for a chunk smaller than any cached chunks.\n            // Then remove the largest one.\n            ncnn::fastFree(it_max->second);\n            d->budgets.erase(it_max);\n        }\n    }\n\n    d->budgets_lock.unlock();\n\n    // new\n    void* ptr = ncnn::fastMalloc(size);\n\n    d->payouts_lock.lock();\n\n    d->payouts.push_back(std::make_pair(size, ptr));\n\n    d->payouts_lock.unlock();\n\n    return ptr;\n}\n\nvoid PoolAllocator::fastFree(void* ptr)\n{\n    d->payouts_lock.lock();\n\n    // return to budgets\n    std::list<std::pair<size_t, void*> >::iterator it = d->payouts.begin();\n    for (; it != d->payouts.end(); ++it)\n    {\n        if (it->second == ptr)\n        {\n            size_t size = it->first;\n\n            d->payouts.erase(it);\n\n            d->payouts_lock.unlock();\n\n            d->budgets_lock.lock();\n\n            d->budgets.push_back(std::make_pair(size, ptr));\n\n            d->budgets_lock.unlock();\n\n            return;\n        }\n    }\n\n    d->payouts_lock.unlock();\n\n    NCNN_LOGE(\"FATAL ERROR! pool allocator get wild %p\", ptr);\n    ncnn::fastFree(ptr);\n}\n\nclass UnlockedPoolAllocatorPrivate\n{\npublic:\n    unsigned int size_compare_ratio; // 0~256\n    size_t size_drop_threshold;\n    std::list<std::pair<size_t, void*> > budgets;\n    std::list<std::pair<size_t, void*> > payouts;\n};\n\nUnlockedPoolAllocator::UnlockedPoolAllocator()\n    : Allocator(), d(new UnlockedPoolAllocatorPrivate)\n{\n    d->size_compare_ratio = 0;\n    d->size_drop_threshold = 10;\n}\n\nUnlockedPoolAllocator::~UnlockedPoolAllocator()\n{\n    clear();\n\n    if (!d->payouts.empty())\n    {\n        NCNN_LOGE(\"FATAL ERROR! unlocked pool allocator destroyed too early\");\n#if NCNN_STDIO\n        std::list<std::pair<size_t, void*> >::iterator it = d->payouts.begin();\n        for (; it != d->payouts.end(); ++it)\n        {\n            void* ptr = it->second;\n            NCNN_LOGE(\"%p still in use\", ptr);\n        }\n#endif\n    }\n\n    delete d;\n}\n\nUnlockedPoolAllocator::UnlockedPoolAllocator(const UnlockedPoolAllocator&)\n    : d(0)\n{\n}\n\nUnlockedPoolAllocator& UnlockedPoolAllocator::operator=(const UnlockedPoolAllocator&)\n{\n    return *this;\n}\n\nvoid UnlockedPoolAllocator::clear()\n{\n    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin();\n    for (; it != d->budgets.end(); ++it)\n    {\n        void* ptr = it->second;\n        ncnn::fastFree(ptr);\n    }\n    d->budgets.clear();\n}\n\nvoid UnlockedPoolAllocator::set_size_compare_ratio(float scr)\n{\n    if (scr < 0.f || scr > 1.f)\n    {\n        NCNN_LOGE(\"invalid size compare ratio %f\", scr);\n        return;\n    }\n\n    d->size_compare_ratio = (unsigned int)(scr * 256);\n}\n\nvoid UnlockedPoolAllocator::set_size_drop_threshold(size_t threshold)\n{\n    d->size_drop_threshold = threshold;\n}\n\nvoid* UnlockedPoolAllocator::fastMalloc(size_t size)\n{\n    // find free budget\n    std::list<std::pair<size_t, void*> >::iterator it = d->budgets.begin(), it_max = d->budgets.begin(), it_min = d->budgets.begin();\n    for (; it != d->budgets.end(); ++it)\n    {\n        size_t bs = it->first;\n\n        // size_compare_ratio ~ 100%\n        if (bs >= size && ((bs * d->size_compare_ratio) >> 8) <= size)\n        {\n            void* ptr = it->second;\n\n            d->budgets.erase(it);\n\n            d->payouts.push_back(std::make_pair(bs, ptr));\n\n            return ptr;\n        }\n\n        if (bs > it_max->first)\n        {\n            it_max = it;\n        }\n        if (bs < it_min->first)\n        {\n            it_min = it;\n        }\n    }\n\n    if (d->budgets.size() >= d->size_drop_threshold)\n    {\n        if (it_max->first < size)\n        {\n            ncnn::fastFree(it_min->second);\n            d->budgets.erase(it_min);\n        }\n        else if (it_min->first > size)\n        {\n            ncnn::fastFree(it_max->second);\n            d->budgets.erase(it_max);\n        }\n    }\n\n    // new\n    void* ptr = ncnn::fastMalloc(size);\n\n    d->payouts.push_back(std::make_pair(size, ptr));\n\n    return ptr;\n}\n\nvoid UnlockedPoolAllocator::fastFree(void* ptr)\n{\n    // return to budgets\n    std::list<std::pair<size_t, void*> >::iterator it = d->payouts.begin();\n    for (; it != d->payouts.end(); ++it)\n    {\n        if (it->second == ptr)\n        {\n            size_t size = it->first;\n\n            d->payouts.erase(it);\n\n            d->budgets.push_back(std::make_pair(size, ptr));\n\n            return;\n        }\n    }\n\n    NCNN_LOGE(\"FATAL ERROR! unlocked pool allocator get wild %p\", ptr);\n    ncnn::fastFree(ptr);\n}\n\n#if NCNN_VULKAN\nVkAllocator::VkAllocator(const VulkanDevice* _vkdev)\n    : vkdev(_vkdev)\n{\n    buffer_memory_type_index = (uint32_t)-1;\n    image_memory_type_index = (uint32_t)-1;\n    reserved_type_index = (uint32_t)-1;\n    mappable = false;\n    coherent = false;\n}\n\nVkAllocator::~VkAllocator()\n{\n    clear();\n}\n\nvoid VkAllocator::clear()\n{\n}\n\nstatic inline size_t round_up(size_t n, size_t multiple)\n{\n    return (n + multiple - 1) / multiple * multiple;\n}\n\nstatic inline size_t round_down(size_t n, size_t multiple)\n{\n    return n / multiple * multiple;\n}\n\nint VkAllocator::flush(VkBufferMemory* ptr)\n{\n    if (coherent)\n        return 0;\n\n    VkMappedMemoryRange mappedMemoryRange;\n    mappedMemoryRange.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;\n    mappedMemoryRange.pNext = 0;\n    mappedMemoryRange.memory = ptr->memory;\n    mappedMemoryRange.offset = round_down(ptr->offset, vkdev->info.non_coherent_atom_size());\n    mappedMemoryRange.size = round_up(ptr->offset + ptr->capacity, vkdev->info.non_coherent_atom_size()) - mappedMemoryRange.offset;\n\n    VkResult ret = vkFlushMappedMemoryRanges(vkdev->vkdevice(), 1, &mappedMemoryRange);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkFlushMappedMemoryRanges failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VkAllocator::invalidate(VkBufferMemory* ptr)\n{\n    if (coherent)\n        return 0;\n\n    VkMappedMemoryRange mappedMemoryRange;\n    mappedMemoryRange.sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;\n    mappedMemoryRange.pNext = 0;\n    mappedMemoryRange.memory = ptr->memory;\n    mappedMemoryRange.offset = round_down(ptr->offset, vkdev->info.non_coherent_atom_size());\n    mappedMemoryRange.size = round_up(ptr->offset + ptr->capacity, vkdev->info.non_coherent_atom_size()) - mappedMemoryRange.offset;\n\n    VkResult ret = vkInvalidateMappedMemoryRanges(vkdev->vkdevice(), 1, &mappedMemoryRange);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkInvalidateMappedMemoryRanges failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nVkBuffer VkAllocator::create_buffer(size_t size, VkBufferUsageFlags usage)\n{\n    VkBufferCreateInfo bufferCreateInfo;\n    bufferCreateInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;\n    bufferCreateInfo.pNext = 0;\n    bufferCreateInfo.flags = 0;\n    bufferCreateInfo.size = size;\n    bufferCreateInfo.usage = usage;\n    bufferCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;\n    bufferCreateInfo.queueFamilyIndexCount = 0;\n    bufferCreateInfo.pQueueFamilyIndices = 0;\n\n    VkBuffer buffer = 0;\n    VkResult ret = vkCreateBuffer(vkdev->vkdevice(), &bufferCreateInfo, 0, &buffer);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateBuffer failed %d\", ret);\n        return 0;\n    }\n\n    return buffer;\n}\n\nVkDeviceMemory VkAllocator::allocate_memory(size_t size, uint32_t memory_type_index)\n{\n    VkMemoryAllocateInfo memoryAllocateInfo;\n    memoryAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;\n    memoryAllocateInfo.pNext = 0;\n    memoryAllocateInfo.allocationSize = size;\n    memoryAllocateInfo.memoryTypeIndex = memory_type_index;\n\n    VkDeviceMemory memory = 0;\n    VkResult ret = vkAllocateMemory(vkdev->vkdevice(), &memoryAllocateInfo, 0, &memory);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkAllocateMemory failed %d\", ret);\n        return 0;\n    }\n\n    return memory;\n}\n\nVkDeviceMemory VkAllocator::allocate_dedicated_memory(size_t size, uint32_t memory_type_index, VkImage image, VkBuffer buffer)\n{\n    VkMemoryAllocateInfo memoryAllocateInfo;\n    memoryAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;\n    memoryAllocateInfo.pNext = 0;\n    memoryAllocateInfo.allocationSize = size;\n    memoryAllocateInfo.memoryTypeIndex = memory_type_index;\n\n    VkMemoryDedicatedAllocateInfoKHR memoryDedicatedAllocateInfo;\n    memoryDedicatedAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO_KHR;\n    memoryDedicatedAllocateInfo.pNext = 0;\n    memoryDedicatedAllocateInfo.image = image;\n    memoryDedicatedAllocateInfo.buffer = buffer;\n    memoryAllocateInfo.pNext = &memoryDedicatedAllocateInfo;\n\n    VkDeviceMemory memory = 0;\n    VkResult ret = vkAllocateMemory(vkdev->vkdevice(), &memoryAllocateInfo, 0, &memory);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkAllocateMemory failed %d\", ret);\n        return 0;\n    }\n\n    return memory;\n}\n\nVkDeviceMemory VkAllocator::allocate_import_host_memory(size_t size, uint32_t memory_type_index, void* host_ptr)\n{\n    VkMemoryAllocateInfo memoryAllocateInfo;\n    memoryAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;\n    memoryAllocateInfo.pNext = 0;\n    memoryAllocateInfo.allocationSize = size;\n    memoryAllocateInfo.memoryTypeIndex = memory_type_index;\n\n    VkImportMemoryHostPointerInfoEXT importMemoryHostPointerInfo;\n    importMemoryHostPointerInfo.sType = VK_STRUCTURE_TYPE_IMPORT_MEMORY_HOST_POINTER_INFO_EXT;\n    importMemoryHostPointerInfo.pNext = 0;\n    importMemoryHostPointerInfo.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT;\n    importMemoryHostPointerInfo.pHostPointer = host_ptr;\n    memoryAllocateInfo.pNext = &importMemoryHostPointerInfo;\n\n    VkDeviceMemory memory = 0;\n    VkResult ret = vkAllocateMemory(vkdev->vkdevice(), &memoryAllocateInfo, 0, &memory);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkAllocateMemory failed %d\", ret);\n        return 0;\n    }\n\n    return memory;\n}\n\nVkImage VkAllocator::create_image(int width, int height, int depth, VkFormat format, VkImageTiling tiling, VkImageUsageFlags usage)\n{\n    VkImageCreateInfo imageCreateInfo;\n    imageCreateInfo.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO,\n    imageCreateInfo.pNext = 0;\n    imageCreateInfo.flags = 0;\n    imageCreateInfo.imageType = VK_IMAGE_TYPE_3D;\n    imageCreateInfo.format = format;\n    imageCreateInfo.extent.width = width;\n    imageCreateInfo.extent.height = height;\n    imageCreateInfo.extent.depth = depth;\n    imageCreateInfo.mipLevels = 1;\n    imageCreateInfo.arrayLayers = 1;\n    imageCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;\n    imageCreateInfo.tiling = tiling;\n    imageCreateInfo.usage = usage;\n    imageCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;\n    imageCreateInfo.queueFamilyIndexCount = 0;\n    imageCreateInfo.pQueueFamilyIndices = 0;\n    imageCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n\n    VkImage image;\n    VkResult ret = vkCreateImage(vkdev->vkdevice(), &imageCreateInfo, 0, &image);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateImage failed %d %d %d %d %d %d %d\", ret, width, height, depth, format, tiling, usage);\n        return 0;\n    }\n\n    return image;\n}\n\nVkImageView VkAllocator::create_imageview(VkImage image, VkFormat format)\n{\n    VkImageViewCreateInfo imageViewCreateInfo;\n    imageViewCreateInfo.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;\n    imageViewCreateInfo.pNext = 0;\n    imageViewCreateInfo.flags = 0;\n    imageViewCreateInfo.image = image;\n    imageViewCreateInfo.viewType = VK_IMAGE_VIEW_TYPE_3D;\n    imageViewCreateInfo.format = format;\n    imageViewCreateInfo.components.r = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.g = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.b = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.a = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n    imageViewCreateInfo.subresourceRange.baseMipLevel = 0;\n    imageViewCreateInfo.subresourceRange.levelCount = 1;\n    imageViewCreateInfo.subresourceRange.baseArrayLayer = 0;\n    imageViewCreateInfo.subresourceRange.layerCount = 1;\n\n    VkImageView imageview;\n    VkResult ret = vkCreateImageView(vkdev->vkdevice(), &imageViewCreateInfo, 0, &imageview);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateImageView failed %d\", ret);\n        return 0;\n    }\n\n    return imageview;\n}\n\nstatic inline size_t least_common_multiple(size_t a, size_t b)\n{\n    if (a == b)\n        return a;\n\n    if (a > b)\n        return least_common_multiple(b, a);\n\n    size_t lcm = b;\n    while (lcm % a != 0)\n    {\n        lcm += b;\n    }\n\n    return lcm;\n}\n\nclass VkBlobAllocatorPrivate\n{\npublic:\n    size_t block_size;\n    size_t buffer_offset_alignment;\n    size_t bind_memory_offset_alignment;\n    std::vector<std::list<std::pair<size_t, size_t> > > buffer_budgets;\n    std::vector<VkBufferMemory*> buffer_blocks;\n    std::vector<std::list<std::pair<size_t, size_t> > > image_memory_budgets;\n    std::vector<VkDeviceMemory> image_memory_blocks;\n};\n\nVkBlobAllocator::VkBlobAllocator(const VulkanDevice* _vkdev, size_t preferred_block_size)\n    : VkAllocator(_vkdev), d(new VkBlobAllocatorPrivate)\n{\n    d->buffer_offset_alignment = vkdev->info.buffer_offset_alignment();\n    d->bind_memory_offset_alignment = vkdev->info.buffer_image_granularity();\n\n    if (vkdev->info.type() == 1)\n    {\n        // on integrated gpu, there may be device local only memory too, eg. AMD APU\n        // assuming larger alignment always keeps us safe :)\n\n        // least common multiple for memory_map_alignment and buffer_offset_alignment and non_coherent_atom_size\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, vkdev->info.memory_map_alignment());\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, vkdev->info.non_coherent_atom_size());\n    }\n\n    if (vkdev->info.support_VK_KHR_robustness2() || vkdev->info.support_VK_EXT_robustness2())\n    {\n        size_t robust_storage_buffer_access_size_alignment = vkdev->info.queryRobustness2Properties().robustStorageBufferAccessSizeAlignment;\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, robust_storage_buffer_access_size_alignment);\n    }\n\n    d->block_size = alignSize(preferred_block_size, d->buffer_offset_alignment);\n}\n\nVkBlobAllocator::~VkBlobAllocator()\n{\n    clear();\n\n    delete d;\n}\n\nVkBlobAllocator::VkBlobAllocator(const VkBlobAllocator&)\n    : VkAllocator(0), d(0)\n{\n}\n\nVkBlobAllocator& VkBlobAllocator::operator=(const VkBlobAllocator&)\n{\n    return *this;\n}\n\nvoid VkBlobAllocator::clear()\n{\n    //     NCNN_LOGE(\"VkBlobAllocator %lu\", buffer_blocks.size());\n\n    for (size_t i = 0; i < d->buffer_blocks.size(); i++)\n    {\n        VkBufferMemory* ptr = d->buffer_blocks[i];\n\n        //         std::list< std::pair<size_t, size_t> >::iterator it = buffer_budgets[i].begin();\n        //         while (it != buffer_budgets[i].end())\n        //         {\n        //             NCNN_LOGE(\"VkBlobAllocator budget %p %lu %lu\", ptr->buffer, it->first, it->second);\n        //             it++;\n        //         }\n\n        if (mappable)\n            vkUnmapMemory(vkdev->vkdevice(), ptr->memory);\n\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n        delete ptr;\n    }\n    d->buffer_blocks.clear();\n\n    d->buffer_budgets.clear();\n\n    for (size_t i = 0; i < d->image_memory_blocks.size(); i++)\n    {\n        VkDeviceMemory memory = d->image_memory_blocks[i];\n\n        //         std::list< std::pair<size_t, size_t> >::iterator it = d->image_memory_budgets[i].begin();\n        //         while (it != d->image_memory_budgets[i].end())\n        //         {\n        //             NCNN_LOGE(\"VkBlobAllocator budget %p %lu %lu\", memory, it->first, it->second);\n        //             it++;\n        //         }\n\n        vkFreeMemory(vkdev->vkdevice(), memory, 0);\n    }\n    d->image_memory_blocks.clear();\n\n    d->image_memory_budgets.clear();\n}\n\nVkBufferMemory* VkBlobAllocator::fastMalloc(size_t size)\n{\n    size_t aligned_size = alignSize(size, d->buffer_offset_alignment);\n\n    const int buffer_block_count = d->buffer_blocks.size();\n\n    // find first spare space in buffer_blocks\n    for (int i = 0; i < buffer_block_count; i++)\n    {\n        std::list<std::pair<size_t, size_t> >::iterator it = d->buffer_budgets[i].begin();\n        while (it != d->buffer_budgets[i].end())\n        {\n            size_t budget_size = it->second;\n            if (budget_size < aligned_size)\n            {\n                it++;\n                continue;\n            }\n\n            // return sub buffer\n            VkBufferMemory* ptr = new VkBufferMemory;\n\n            ptr->buffer = d->buffer_blocks[i]->buffer;\n            ptr->offset = it->first;\n            ptr->memory = d->buffer_blocks[i]->memory;\n            ptr->capacity = aligned_size;\n            ptr->mapped_ptr = d->buffer_blocks[i]->mapped_ptr;\n            ptr->memory_type_index = d->buffer_blocks[i]->memory_type_index;\n            ptr->access_flags = 0;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n            // adjust buffer_budgets\n            if (budget_size == aligned_size)\n            {\n                d->buffer_budgets[i].erase(it);\n            }\n            else\n            {\n                it->first += aligned_size;\n                it->second -= aligned_size;\n            }\n\n            //             NCNN_LOGE(\"VkBlobAllocator M %p +%lu %lu\", ptr->buffer, ptr->offset, ptr->capacity);\n\n            return ptr;\n        }\n    }\n\n    size_t new_block_size = std::max(d->block_size, aligned_size);\n\n    // create new block\n    VkBufferMemory* block = new VkBufferMemory;\n\n    block->buffer = create_buffer(new_block_size, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT);\n    block->offset = 0;\n\n    // TODO respect VK_KHR_dedicated_allocation ?\n\n    VkMemoryRequirements memoryRequirements;\n    vkGetBufferMemoryRequirements(vkdev->vkdevice(), block->buffer, &memoryRequirements);\n\n    // setup memory type and alignment\n    if (buffer_memory_type_index == (uint32_t)-1)\n    {\n        if (vkdev->info.type() == 1)\n        {\n            // integrated gpu, prefer unified memory\n            buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n            // on amd integrated gpu, there is a faster and larger device-only heap\n            uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n            const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n            uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;\n            uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n            if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n            {\n                buffer_memory_type_index = device_local_memory_type_index;\n            }\n        }\n        else\n        {\n            // discrete gpu, device local\n            buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n        }\n\n        mappable = vkdev->is_mappable(buffer_memory_type_index);\n        coherent = vkdev->is_coherent(buffer_memory_type_index);\n    }\n\n    block->memory = allocate_memory(memoryRequirements.size, buffer_memory_type_index);\n    if (!block->memory)\n    {\n        vkDestroyBuffer(vkdev->vkdevice(), block->buffer, 0);\n        delete block;\n        return 0;\n    }\n\n    // ignore memoryRequirements.alignment as we always bind at zero offset\n    vkBindBufferMemory(vkdev->vkdevice(), block->buffer, block->memory, 0);\n\n    block->mapped_ptr = 0;\n    if (mappable)\n    {\n        vkMapMemory(vkdev->vkdevice(), block->memory, 0, new_block_size, 0, &block->mapped_ptr);\n    }\n\n    block->memory_type_index = buffer_memory_type_index;\n\n    d->buffer_blocks.push_back(block);\n\n    // return sub buffer\n    VkBufferMemory* ptr = new VkBufferMemory;\n\n    ptr->buffer = block->buffer;\n    ptr->offset = 0;\n    ptr->memory = block->memory;\n    ptr->capacity = aligned_size;\n    ptr->mapped_ptr = block->mapped_ptr;\n    ptr->memory_type_index = block->memory_type_index;\n    ptr->access_flags = 0;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n    // adjust buffer_budgets\n    std::list<std::pair<size_t, size_t> > budget;\n    if (new_block_size > aligned_size)\n    {\n        budget.push_back(std::make_pair(aligned_size, new_block_size - aligned_size));\n    }\n    d->buffer_budgets.push_back(budget);\n\n    //     NCNN_LOGE(\"VkBlobAllocator M %p +%lu %lu\", ptr->buffer, ptr->offset, ptr->capacity);\n\n    return ptr;\n}\n\nvoid VkBlobAllocator::fastFree(VkBufferMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkBlobAllocator F %p +%lu %lu\", ptr->buffer, ptr->offset, ptr->capacity);\n\n    const int buffer_block_count = d->buffer_blocks.size();\n\n    int block_index = -1;\n    for (int i = 0; i < buffer_block_count; i++)\n    {\n        if (d->buffer_blocks[i]->buffer == ptr->buffer && d->buffer_blocks[i]->memory == ptr->memory)\n        {\n            block_index = i;\n            break;\n        }\n    }\n\n    if (block_index == -1)\n    {\n        NCNN_LOGE(\"FATAL ERROR! unlocked VkBlobAllocator get wild %p\", ptr->buffer);\n\n        delete ptr;\n\n        return;\n    }\n\n    // merge\n    std::list<std::pair<size_t, size_t> >::iterator it_merge_left = d->buffer_budgets[block_index].end();\n    std::list<std::pair<size_t, size_t> >::iterator it_merge_right = d->buffer_budgets[block_index].end();\n    std::list<std::pair<size_t, size_t> >::iterator it = d->buffer_budgets[block_index].begin();\n    for (; it != d->buffer_budgets[block_index].end(); it++)\n    {\n        if (it->first + it->second == ptr->offset)\n        {\n            it_merge_left = it;\n        }\n        else if (ptr->offset + ptr->capacity == it->first)\n        {\n            it_merge_right = it;\n        }\n    }\n\n    if (it_merge_left != d->buffer_budgets[block_index].end() && it_merge_right != d->buffer_budgets[block_index].end())\n    {\n        it_merge_left->second = it_merge_right->first + it_merge_right->second - it_merge_left->first;\n        d->buffer_budgets[block_index].erase(it_merge_right);\n    }\n    else if (it_merge_left != d->buffer_budgets[block_index].end())\n    {\n        it_merge_left->second = ptr->offset + ptr->capacity - it_merge_left->first;\n    }\n    else if (it_merge_right != d->buffer_budgets[block_index].end())\n    {\n        it_merge_right->second = it_merge_right->first + it_merge_right->second - ptr->offset;\n        it_merge_right->first = ptr->offset;\n    }\n    else\n    {\n        if (ptr->offset == 0)\n        {\n            // chain leading block\n            d->buffer_budgets[block_index].push_front(std::make_pair(ptr->offset, ptr->capacity));\n        }\n        else\n        {\n            d->buffer_budgets[block_index].push_back(std::make_pair(ptr->offset, ptr->capacity));\n        }\n    }\n\n    delete ptr;\n}\n\nVkImageMemory* VkBlobAllocator::fastMalloc(int w, int h, int c, size_t elemsize, int elempack)\n{\n    if (elempack != 1 && elempack != 4 && elempack != 8)\n    {\n        NCNN_LOGE(\"elempack must be 1 4 8\");\n        return 0;\n    }\n\n    // resolve format\n    VkFormat format = VK_FORMAT_UNDEFINED;\n\n    if (elemsize / elempack == 4)\n    {\n        // fp32\n        if (elempack == 1) format = VK_FORMAT_R32_SFLOAT;\n        if (elempack == 4) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n        if (elempack == 8) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n    }\n    if (elemsize / elempack == 2)\n    {\n        // fp16\n        if (elempack == 1) format = VK_FORMAT_R16_SFLOAT;\n        if (elempack == 4) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n        if (elempack == 8) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n    }\n    if (elemsize / elempack == 1)\n    {\n        // int8\n        if (elempack == 1) format = VK_FORMAT_R8_SINT;\n        if (elempack == 4) format = VK_FORMAT_R8G8B8A8_SINT;\n        if (elempack == 8) format = VK_FORMAT_R8G8B8A8_SINT;\n    }\n\n    // resolve image width height depth\n    int width = w;\n    int height = h;\n    int depth = c;\n\n    // large elempack spills on image w\n    if (elempack == 8) width *= 2;\n\n    if (width > (int)vkdev->info.max_image_dimension_3d() || height > (int)vkdev->info.max_image_dimension_3d() || depth > (int)vkdev->info.max_image_dimension_3d())\n    {\n        NCNN_LOGE(\"image dimension too large %d %d %d > %d\", width, height, depth, (int)vkdev->info.max_image_dimension_3d());\n        return 0;\n    }\n\n    VkImageMemory* ptr = new VkImageMemory;\n\n    ptr->image = create_image(width, height, depth, format, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_STORAGE_BIT | VK_IMAGE_USAGE_TRANSFER_SRC_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT);\n\n    ptr->width = width;\n    ptr->height = height;\n    ptr->depth = depth;\n    ptr->format = format;\n\n    // TODO respect VK_KHR_dedicated_allocation ?\n    VkMemoryRequirements memoryRequirements;\n    vkGetImageMemoryRequirements(vkdev->vkdevice(), ptr->image, &memoryRequirements);\n\n    const size_t size = memoryRequirements.size;\n    const size_t alignment = std::max((size_t)memoryRequirements.alignment, d->bind_memory_offset_alignment);\n\n    size_t aligned_size = alignSize(size, alignment);\n\n    const int image_memory_block_count = d->image_memory_blocks.size();\n\n    // find first spare space in image_memory_blocks\n    for (int i = 0; i < image_memory_block_count; i++)\n    {\n#if __APPLE__\n        // HACK moltenvk v1.2.3 is unhappy for image binding with offset  :(\n        break;\n#endif\n\n        std::list<std::pair<size_t, size_t> >::iterator it = d->image_memory_budgets[i].begin();\n        while (it != d->image_memory_budgets[i].end())\n        {\n            // we cannot use it->first directly for base offset alignment\n            size_t bind_base_offset = it->first;\n            size_t bind_offset = alignSize(bind_base_offset, alignment);\n            size_t budget_size = it->second;\n            if (budget_size < aligned_size + (bind_offset - bind_base_offset))\n            {\n                it++;\n                continue;\n            }\n\n            // bind at memory offset\n            ptr->memory = d->image_memory_blocks[i];\n            ptr->bind_offset = bind_offset;\n            ptr->bind_capacity = aligned_size;\n\n            vkBindImageMemory(vkdev->vkdevice(), ptr->image, ptr->memory, ptr->bind_offset);\n\n            // do not allow host access to optimal tiling image\n            ptr->mapped_ptr = 0;\n            ptr->memory_type_index = image_memory_type_index;\n\n            ptr->imageview = create_imageview(ptr->image, format);\n\n            ptr->access_flags = 0;\n            ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n            ptr->command_refcount = 0;\n\n            if (bind_base_offset != bind_offset)\n            {\n                // NOTE there is small offset inside bind_base_offset and bind_offset\n                // adjust ptr->bind_offset and ptr->bind_capacity after vkBindImageMemory\n                // so that memory management could be easier\n                aligned_size += (bind_offset - bind_base_offset);\n\n                ptr->bind_offset = bind_base_offset;\n                ptr->bind_capacity = aligned_size;\n            }\n\n            // adjust image_memory_budgets\n            if (budget_size == aligned_size)\n            {\n                d->image_memory_budgets[i].erase(it);\n            }\n            else\n            {\n                it->first += aligned_size;\n                it->second -= aligned_size;\n            }\n\n            //             NCNN_LOGE(\"VkBlobAllocator M %p +%lu %lu\", ptr->memory, ptr->bind_offset, ptr->bind_capacity);\n\n            return ptr;\n        }\n    }\n\n    // setup memory type and alignment\n    if (image_memory_type_index == (uint32_t)-1)\n    {\n        if (vkdev->info.type() == 1)\n        {\n            // integrated gpu, prefer unified memory\n            image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n            // on amd integrated gpu, there is a faster and larger device-only heap\n            uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n            const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n            uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;\n            uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n            if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n            {\n                image_memory_type_index = device_local_memory_type_index;\n            }\n        }\n        else\n        {\n            // discrete gpu, device local\n            image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n        }\n\n        mappable = vkdev->is_mappable(image_memory_type_index);\n        coherent = vkdev->is_coherent(image_memory_type_index);\n    }\n\n    // create new block\n    size_t new_block_size = std::max(d->block_size, aligned_size);\n\n#if __APPLE__\n    // HACK moltenvk v1.2.3 is unhappy for image binding with offset\n    // always ignore block size for smaller memory footprint :(\n    new_block_size = aligned_size;\n#endif\n\n    // bind at memory offset\n    ptr->memory = allocate_memory(new_block_size, image_memory_type_index);\n    if (!ptr->memory)\n    {\n        vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n        delete ptr;\n        return 0;\n    }\n    ptr->bind_offset = 0;\n    ptr->bind_capacity = aligned_size;\n\n    // ignore memoryRequirements2.memoryRequirements.alignment as we always bind at zero offset\n    vkBindImageMemory(vkdev->vkdevice(), ptr->image, ptr->memory, ptr->bind_offset);\n\n    // do not allow host access to optimal tiling image\n    ptr->mapped_ptr = 0;\n    ptr->memory_type_index = image_memory_type_index;\n\n    ptr->imageview = create_imageview(ptr->image, format);\n\n    ptr->access_flags = 0;\n    ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n    ptr->command_refcount = 0;\n\n    // adjust image_memory_budgets\n    d->image_memory_blocks.push_back(ptr->memory);\n\n    std::list<std::pair<size_t, size_t> > budget;\n    if (new_block_size > aligned_size)\n    {\n        budget.push_back(std::make_pair(aligned_size, new_block_size - aligned_size));\n    }\n    d->image_memory_budgets.push_back(budget);\n\n    //     NCNN_LOGE(\"VkBlobAllocator M %p +%lu %lu\", ptr->memory, ptr->bind_offset, ptr->bind_capacity);\n\n    return ptr;\n}\n\nvoid VkBlobAllocator::fastFree(VkImageMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkBlobAllocator F %p +%lu %lu\", ptr->memory, ptr->bind_offset, ptr->bind_capacity);\n\n    const int image_memory_block_count = d->image_memory_blocks.size();\n\n    int block_index = -1;\n    for (int i = 0; i < image_memory_block_count; i++)\n    {\n        if (d->image_memory_blocks[i] == ptr->memory)\n        {\n            block_index = i;\n            break;\n        }\n    }\n\n    if (block_index == -1)\n    {\n        NCNN_LOGE(\"FATAL ERROR! unlocked VkBlobAllocator get wild %p\", ptr->memory);\n\n        if (!ptr->command_refcount)\n        {\n            vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n            vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n\n            delete ptr;\n        }\n\n        return;\n    }\n\n    // merge\n    std::list<std::pair<size_t, size_t> >::iterator it_merge_left = d->image_memory_budgets[block_index].end();\n    std::list<std::pair<size_t, size_t> >::iterator it_merge_right = d->image_memory_budgets[block_index].end();\n    std::list<std::pair<size_t, size_t> >::iterator it = d->image_memory_budgets[block_index].begin();\n    for (; it != d->image_memory_budgets[block_index].end(); it++)\n    {\n        if (it->first + it->second == ptr->bind_offset)\n        {\n            it_merge_left = it;\n        }\n        else if (ptr->bind_offset + ptr->bind_capacity == it->first)\n        {\n            it_merge_right = it;\n        }\n    }\n\n    if (it_merge_left != d->image_memory_budgets[block_index].end() && it_merge_right != d->image_memory_budgets[block_index].end())\n    {\n        it_merge_left->second = it_merge_right->first + it_merge_right->second - it_merge_left->first;\n        d->image_memory_budgets[block_index].erase(it_merge_right);\n    }\n    else if (it_merge_left != d->image_memory_budgets[block_index].end())\n    {\n        it_merge_left->second = ptr->bind_offset + ptr->bind_capacity - it_merge_left->first;\n    }\n    else if (it_merge_right != d->image_memory_budgets[block_index].end())\n    {\n        it_merge_right->second = it_merge_right->first + it_merge_right->second - ptr->bind_offset;\n        it_merge_right->first = ptr->bind_offset;\n    }\n    else\n    {\n        if (ptr->bind_offset == 0)\n        {\n            // chain leading block\n            d->image_memory_budgets[block_index].push_front(std::make_pair(ptr->bind_offset, ptr->bind_capacity));\n        }\n        else\n        {\n            d->image_memory_budgets[block_index].push_back(std::make_pair(ptr->bind_offset, ptr->bind_capacity));\n        }\n    }\n\n    if (!ptr->command_refcount)\n    {\n        vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n        vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n\n        delete ptr;\n    }\n}\n\nclass VkWeightAllocatorPrivate\n{\npublic:\n    size_t block_size;\n    size_t buffer_offset_alignment;\n    size_t bind_memory_offset_alignment;\n    std::vector<size_t> buffer_block_free_spaces;\n    std::vector<VkBufferMemory*> buffer_blocks;\n    std::vector<VkBufferMemory*> dedicated_buffer_blocks;\n    std::vector<size_t> image_memory_block_free_spaces;\n    std::vector<VkDeviceMemory> image_memory_blocks;\n    std::vector<VkDeviceMemory> dedicated_image_memory_blocks;\n\n    bool prefer_host_memory;\n#if !defined(_WIN32)\n    std::vector<void*> host_ptrs;\n#endif\n};\n\nVkWeightAllocator::VkWeightAllocator(const VulkanDevice* _vkdev, bool _prefer_host_memory, size_t preferred_block_size)\n    : VkAllocator(_vkdev), d(new VkWeightAllocatorPrivate)\n{\n    d->buffer_offset_alignment = vkdev->info.buffer_offset_alignment();\n    d->bind_memory_offset_alignment = vkdev->info.buffer_image_granularity();\n\n    if (vkdev->info.type() == 1)\n    {\n        // on integrated gpu, there may be device local only memory too, eg. AMD APU\n        // assuming larger alignment always keeps us safe :)\n\n        // least common multiple for memory_map_alignment and buffer_offset_alignment and non_coherent_atom_size\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, vkdev->info.memory_map_alignment());\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, vkdev->info.non_coherent_atom_size());\n    }\n\n    if (vkdev->info.support_VK_KHR_robustness2() || vkdev->info.support_VK_EXT_robustness2())\n    {\n        size_t robust_storage_buffer_access_size_alignment = vkdev->info.queryRobustness2Properties().robustStorageBufferAccessSizeAlignment;\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, robust_storage_buffer_access_size_alignment);\n    }\n\n    if (_prefer_host_memory && vkdev->info.support_VK_EXT_external_memory_host())\n    {\n        size_t min_imported_host_pointer_alignment = vkdev->info.queryExternalMemoryHostProperties().minImportedHostPointerAlignment;\n        d->buffer_offset_alignment = least_common_multiple(d->buffer_offset_alignment, min_imported_host_pointer_alignment);\n    }\n\n    d->block_size = alignSize(preferred_block_size, d->buffer_offset_alignment);\n\n    d->prefer_host_memory = _prefer_host_memory;\n}\n\nVkWeightAllocator::~VkWeightAllocator()\n{\n    clear();\n\n    delete d;\n}\n\nVkWeightAllocator::VkWeightAllocator(const VkWeightAllocator&)\n    : VkAllocator(0), d(0)\n{\n}\n\nVkWeightAllocator& VkWeightAllocator::operator=(const VkWeightAllocator&)\n{\n    return *this;\n}\n\nvoid VkWeightAllocator::clear()\n{\n    //     NCNN_LOGE(\"VkWeightAllocator %lu %lu\", d->buffer_blocks.size(), d->dedicated_buffer_blocks.size());\n\n    d->buffer_block_free_spaces.clear();\n\n    for (size_t i = 0; i < d->buffer_blocks.size(); i++)\n    {\n        VkBufferMemory* ptr = d->buffer_blocks[i];\n\n        if (mappable)\n            vkUnmapMemory(vkdev->vkdevice(), ptr->memory);\n\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n        delete ptr;\n    }\n    d->buffer_blocks.clear();\n\n    for (size_t i = 0; i < d->dedicated_buffer_blocks.size(); i++)\n    {\n        VkBufferMemory* ptr = d->dedicated_buffer_blocks[i];\n\n        if (mappable)\n            vkUnmapMemory(vkdev->vkdevice(), ptr->memory);\n\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n        delete ptr;\n    }\n    d->dedicated_buffer_blocks.clear();\n\n    d->image_memory_block_free_spaces.clear();\n\n    for (size_t i = 0; i < d->image_memory_blocks.size(); i++)\n    {\n        VkDeviceMemory memory = d->image_memory_blocks[i];\n\n        vkFreeMemory(vkdev->vkdevice(), memory, 0);\n    }\n    d->image_memory_blocks.clear();\n\n    for (size_t i = 0; i < d->dedicated_image_memory_blocks.size(); i++)\n    {\n        VkDeviceMemory memory = d->dedicated_image_memory_blocks[i];\n\n        vkFreeMemory(vkdev->vkdevice(), memory, 0);\n    }\n    d->dedicated_image_memory_blocks.clear();\n\n#if !defined(_WIN32)\n    for (size_t i = 0; i < d->host_ptrs.size(); i++)\n    {\n        void* host_ptr = d->host_ptrs[i];\n\n        ncnn::fastFree(host_ptr);\n    }\n    d->host_ptrs.clear();\n#endif\n}\n\n// fastMalloc() with alignment parameter and no malloc overread\nstatic void* fastMalloc_with_alignment(size_t size, size_t alignment)\n{\n#if _MSC_VER\n    return _aligned_malloc(size, alignment);\n#elif (defined(__unix__) || defined(__APPLE__)) && _POSIX_C_SOURCE >= 200112L || (__ANDROID__ && __ANDROID_API__ >= 17)\n    void* ptr = 0;\n    if (posix_memalign(&ptr, alignment, size))\n        ptr = 0;\n    return ptr;\n#elif __ANDROID__ && __ANDROID_API__ < 17\n    return memalign(alignment, size);\n#else\n    unsigned char* udata = (unsigned char*)malloc(size + sizeof(void*) + alignment);\n    if (!udata)\n        return 0;\n    unsigned char** adata = alignPtr((unsigned char**)udata + 1, alignment);\n    adata[-1] = udata;\n    return adata;\n#endif\n}\n\nVkBufferMemory* VkWeightAllocator::fastMalloc(size_t size)\n{\n    //     NCNN_LOGE(\"VkWeightAllocator fastMalloc %lu\", size);\n\n    size_t aligned_size = alignSize(size, d->buffer_offset_alignment);\n\n    const int buffer_block_count = d->buffer_blocks.size();\n\n    // find first spare space in buffer_blocks\n    for (int i = 0; i < buffer_block_count; i++)\n    {\n        size_t free_size = d->buffer_block_free_spaces[i];\n        if (free_size >= aligned_size)\n        {\n            size_t block_offset = d->block_size - free_size;\n\n            // return sub buffer\n            VkBufferMemory* ptr = new VkBufferMemory;\n\n            ptr->buffer = d->buffer_blocks[i]->buffer;\n            ptr->offset = block_offset;\n            ptr->memory = d->buffer_blocks[i]->memory;\n            ptr->capacity = aligned_size;\n            ptr->mapped_ptr = d->buffer_blocks[i]->mapped_ptr;\n            ptr->memory_type_index = d->buffer_blocks[i]->memory_type_index;\n            ptr->access_flags = 0;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n            d->buffer_block_free_spaces[i] -= aligned_size;\n\n            return ptr;\n        }\n    }\n\n    size_t new_block_size = std::max(d->block_size, aligned_size);\n\n    // create new block\n    VkBufferMemory* block = new VkBufferMemory;\n\n    block->buffer = create_buffer(new_block_size, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT);\n    block->offset = 0;\n\n    if (vkdev->info.support_VK_KHR_get_memory_requirements2() && vkdev->info.support_VK_KHR_dedicated_allocation() && !d->prefer_host_memory)\n    {\n        VkBufferMemoryRequirementsInfo2KHR bufferMemoryRequirementsInfo2;\n        bufferMemoryRequirementsInfo2.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_REQUIREMENTS_INFO_2_KHR;\n        bufferMemoryRequirementsInfo2.pNext = 0;\n        bufferMemoryRequirementsInfo2.buffer = block->buffer;\n\n        VkMemoryRequirements2KHR memoryRequirements2;\n        memoryRequirements2.sType = VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR;\n        memoryRequirements2.pNext = 0;\n\n        VkMemoryDedicatedRequirementsKHR memoryDedicatedRequirements;\n        memoryDedicatedRequirements.sType = VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR;\n        memoryDedicatedRequirements.pNext = 0;\n        memoryRequirements2.pNext = &memoryDedicatedRequirements;\n\n        vkdev->vkGetBufferMemoryRequirements2KHR(vkdev->vkdevice(), &bufferMemoryRequirementsInfo2, &memoryRequirements2);\n\n        bool dedicatedAllocation = memoryDedicatedRequirements.requiresDedicatedAllocation || memoryDedicatedRequirements.prefersDedicatedAllocation;\n\n        if (dedicatedAllocation)\n        {\n            // setup memory type and alignment\n            if (buffer_memory_type_index == (uint32_t)-1)\n            {\n                if (vkdev->info.type() == 1)\n                {\n                    // integrated gpu, prefer unified memory\n                    buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n                    // on amd integrated gpu, there is a faster and larger device-only heap\n                    uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                    const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n                    uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;\n                    uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n                    if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n                    {\n                        buffer_memory_type_index = device_local_memory_type_index;\n                    }\n                }\n                else\n                {\n                    // discrete gpu, device local\n                    if (vkdev->info.resizable_bar_enabled())\n                    {\n                        buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT);\n                    }\n                    else\n                    {\n                        buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                    }\n                }\n\n                mappable = vkdev->is_mappable(buffer_memory_type_index);\n                coherent = vkdev->is_coherent(buffer_memory_type_index);\n            }\n\n            block->memory = allocate_dedicated_memory(memoryRequirements2.memoryRequirements.size, buffer_memory_type_index, 0, block->buffer);\n            if (!block->memory)\n            {\n                vkDestroyBuffer(vkdev->vkdevice(), block->buffer, 0);\n                delete block;\n                return 0;\n            }\n\n            // ignore memoryRequirements2.memoryRequirements.alignment as we always bind at zero offset\n            vkBindBufferMemory(vkdev->vkdevice(), block->buffer, block->memory, 0);\n\n            block->mapped_ptr = 0;\n            if (mappable)\n            {\n                vkMapMemory(vkdev->vkdevice(), block->memory, 0, new_block_size, 0, &block->mapped_ptr);\n            }\n\n            block->memory_type_index = buffer_memory_type_index;\n\n            d->dedicated_buffer_blocks.push_back(block);\n\n            // return sub buffer\n            VkBufferMemory* ptr = new VkBufferMemory;\n\n            ptr->buffer = block->buffer;\n            ptr->offset = 0;\n            ptr->memory = block->memory;\n            ptr->capacity = new_block_size;\n            ptr->mapped_ptr = block->mapped_ptr;\n            ptr->memory_type_index = block->memory_type_index;\n            ptr->access_flags = 0;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n            return ptr;\n        }\n    }\n\n    VkMemoryRequirements memoryRequirements;\n    vkGetBufferMemoryRequirements(vkdev->vkdevice(), block->buffer, &memoryRequirements);\n\n    if (d->prefer_host_memory)\n    {\n#if !defined(_WIN32)\n        if (vkdev->info.support_VK_EXT_external_memory_host())\n        {\n            void* host_ptr = fastMalloc_with_alignment(memoryRequirements.size, d->buffer_offset_alignment);\n\n            if (host_ptr)\n            {\n                VkMemoryHostPointerPropertiesEXT pointerProperties;\n                pointerProperties.sType = VK_STRUCTURE_TYPE_MEMORY_HOST_POINTER_PROPERTIES_EXT;\n                pointerProperties.pNext = 0;\n                VkResult ret = vkdev->vkGetMemoryHostPointerPropertiesEXT(vkdev->vkdevice(), VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT, host_ptr, &pointerProperties);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkGetMemoryHostPointerPropertiesEXT failed %d\", ret);\n                    ncnn::fastFree(host_ptr);\n                    vkDestroyBuffer(vkdev->vkdevice(), block->buffer, 0);\n                    delete block;\n                    return 0;\n                }\n\n                // setup memory type and alignment\n                if (buffer_memory_type_index == (uint32_t)-1)\n                {\n                    buffer_memory_type_index = vkdev->find_memory_index(pointerProperties.memoryTypeBits, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n\n                    mappable = vkdev->is_mappable(buffer_memory_type_index);\n                    coherent = vkdev->is_coherent(buffer_memory_type_index);\n                }\n\n                block->memory = allocate_import_host_memory(memoryRequirements.size, buffer_memory_type_index, host_ptr);\n                if (!block->memory)\n                {\n                    // oom\n                    ncnn::fastFree(host_ptr);\n                    d->prefer_host_memory = false;\n                }\n                else\n                {\n                    d->host_ptrs.push_back(host_ptr);\n                }\n            }\n            else\n            {\n                // oom\n                d->prefer_host_memory = false;\n            }\n        }\n        else\n#endif // !defined(_WIN32)\n        {\n            // setup memory type and alignment\n            if (buffer_memory_type_index == (uint32_t)-1)\n            {\n                buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n\n                mappable = vkdev->is_mappable(buffer_memory_type_index);\n                coherent = vkdev->is_coherent(buffer_memory_type_index);\n            }\n\n            block->memory = allocate_memory(memoryRequirements.size, buffer_memory_type_index);\n            if (!block->memory)\n            {\n                // oom\n                d->prefer_host_memory = false;\n            }\n        }\n\n        if (!d->prefer_host_memory)\n        {\n            NCNN_LOGE(\"weight allocator fallback to device memory\");\n            buffer_memory_type_index = (uint32_t)-1;\n            image_memory_type_index = (uint32_t)-1;\n        }\n    }\n    if (!d->prefer_host_memory)\n    {\n        // setup memory type and alignment\n        if (buffer_memory_type_index == (uint32_t)-1)\n        {\n            if (vkdev->info.type() == 1)\n            {\n                // integrated gpu, prefer unified memory\n                buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n                // on amd integrated gpu, there is a faster and larger device-only heap\n                uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n                uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;\n                uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n                if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n                {\n                    buffer_memory_type_index = device_local_memory_type_index;\n                }\n            }\n            else\n            {\n                // discrete gpu, device local\n                if (vkdev->info.resizable_bar_enabled())\n                {\n                    buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT);\n                }\n                else\n                {\n                    buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                }\n            }\n\n            mappable = vkdev->is_mappable(buffer_memory_type_index);\n            coherent = vkdev->is_coherent(buffer_memory_type_index);\n        }\n\n        block->memory = allocate_memory(memoryRequirements.size, buffer_memory_type_index);\n    }\n    if (!block->memory)\n    {\n        vkDestroyBuffer(vkdev->vkdevice(), block->buffer, 0);\n        delete block;\n        return 0;\n    }\n\n    // ignore memoryRequirements.alignment as we always bind at zero offset\n    vkBindBufferMemory(vkdev->vkdevice(), block->buffer, block->memory, 0);\n\n    //     NCNN_LOGE(\"VkWeightAllocator M %p\", block->buffer);\n\n    block->mapped_ptr = 0;\n    if (mappable)\n    {\n        vkMapMemory(vkdev->vkdevice(), block->memory, 0, new_block_size, 0, &block->mapped_ptr);\n    }\n\n    block->memory_type_index = buffer_memory_type_index;\n\n    d->buffer_blocks.push_back(block);\n\n    d->buffer_block_free_spaces.push_back(new_block_size - aligned_size);\n\n    // return sub buffer\n    VkBufferMemory* ptr = new VkBufferMemory;\n\n    ptr->buffer = block->buffer;\n    ptr->offset = 0;\n    ptr->memory = block->memory;\n    ptr->capacity = aligned_size;\n    ptr->mapped_ptr = block->mapped_ptr;\n    ptr->memory_type_index = block->memory_type_index;\n    ptr->access_flags = 0;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n    return ptr;\n}\n\nvoid VkWeightAllocator::fastFree(VkBufferMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkWeightAllocator F %p\", ptr->buffer);\n\n    delete ptr;\n}\n\nVkImageMemory* VkWeightAllocator::fastMalloc(int w, int h, int c, size_t elemsize, int elempack)\n{\n    if (elempack != 1 && elempack != 4 && elempack != 8 && elempack != 16 && elempack != 32 && elempack != 64)\n    {\n        NCNN_LOGE(\"elempack must be 1 4 8 16 32 64\");\n        return 0;\n    }\n\n    // resolve format\n    VkFormat format = VK_FORMAT_UNDEFINED;\n\n    if (elemsize / elempack == 4)\n    {\n        // fp32\n        if (elempack == 1) format = VK_FORMAT_R32_SFLOAT;\n        if (elempack == 4) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n        if (elempack == 8) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n        if (elempack == 16) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n        if (elempack == 32) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n        if (elempack == 64) format = VK_FORMAT_R32G32B32A32_SFLOAT;\n    }\n    if (elemsize / elempack == 2)\n    {\n        // fp16\n        if (elempack == 1) format = VK_FORMAT_R16_SFLOAT;\n        if (elempack == 4) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n        if (elempack == 8) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n        if (elempack == 16) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n        if (elempack == 32) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n        if (elempack == 64) format = VK_FORMAT_R16G16B16A16_SFLOAT;\n    }\n    if (elemsize / elempack == 1)\n    {\n        // int8\n        if (elempack == 1) format = VK_FORMAT_R8_SINT;\n        if (elempack == 4) format = VK_FORMAT_R8G8B8A8_SINT;\n        if (elempack == 8) format = VK_FORMAT_R8G8B8A8_SINT;\n        if (elempack == 16) format = VK_FORMAT_R8G8B8A8_SINT;\n        if (elempack == 32) format = VK_FORMAT_R8G8B8A8_SINT;\n        if (elempack == 64) format = VK_FORMAT_R8G8B8A8_SINT;\n    }\n\n    // resolve image width height depth\n    int width = w;\n    int height = h;\n    int depth = c;\n\n    // large elempack spills on image w\n    if (elempack == 8) width *= 2;\n    if (elempack == 16) width *= 4;\n    if (elempack == 32) width *= 8;\n    if (elempack == 64) width *= 16;\n\n    if (width > (int)vkdev->info.max_image_dimension_3d() || height > (int)vkdev->info.max_image_dimension_3d() || depth > (int)vkdev->info.max_image_dimension_3d())\n    {\n        NCNN_LOGE(\"image dimension too large %d %d %d > %d\", width, height, depth, (int)vkdev->info.max_image_dimension_3d());\n        return 0;\n    }\n\n    VkImageMemory* ptr = new VkImageMemory;\n\n    ptr->image = create_image(width, height, depth, format, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT);\n\n    ptr->width = width;\n    ptr->height = height;\n    ptr->depth = depth;\n    ptr->format = format;\n\n    if (vkdev->info.support_VK_KHR_get_memory_requirements2() && vkdev->info.support_VK_KHR_dedicated_allocation() && !d->prefer_host_memory)\n    {\n        VkImageMemoryRequirementsInfo2KHR imageMemoryRequirementsInfo2;\n        imageMemoryRequirementsInfo2.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_REQUIREMENTS_INFO_2_KHR;\n        imageMemoryRequirementsInfo2.pNext = 0;\n        imageMemoryRequirementsInfo2.image = ptr->image;\n\n        VkMemoryRequirements2KHR memoryRequirements2;\n        memoryRequirements2.sType = VK_STRUCTURE_TYPE_MEMORY_REQUIREMENTS_2_KHR;\n        memoryRequirements2.pNext = 0;\n\n        VkMemoryDedicatedRequirementsKHR memoryDedicatedRequirements;\n        memoryDedicatedRequirements.sType = VK_STRUCTURE_TYPE_MEMORY_DEDICATED_REQUIREMENTS_KHR;\n        memoryDedicatedRequirements.pNext = 0;\n        memoryRequirements2.pNext = &memoryDedicatedRequirements;\n\n        vkdev->vkGetImageMemoryRequirements2KHR(vkdev->vkdevice(), &imageMemoryRequirementsInfo2, &memoryRequirements2);\n\n        bool dedicatedAllocation = memoryDedicatedRequirements.requiresDedicatedAllocation || memoryDedicatedRequirements.prefersDedicatedAllocation;\n\n        if (dedicatedAllocation)\n        {\n            // setup memory type and alignment\n            if (image_memory_type_index == (uint32_t)-1)\n            {\n                if (vkdev->info.type() == 1)\n                {\n                    // integrated gpu, prefer unified memory\n                    image_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n                    // on amd integrated gpu, there is a faster and larger device-only heap\n                    uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                    const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n                    uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;\n                    uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n                    if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n                    {\n                        image_memory_type_index = device_local_memory_type_index;\n                    }\n                }\n                else\n                {\n                    // discrete gpu, device local\n                    if (vkdev->info.resizable_bar_enabled())\n                    {\n                        image_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT);\n                    }\n                    else\n                    {\n                        image_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                    }\n                }\n\n                mappable = vkdev->is_mappable(image_memory_type_index);\n                coherent = vkdev->is_coherent(image_memory_type_index);\n            }\n\n            // bind memory\n            ptr->memory = allocate_dedicated_memory(memoryRequirements2.memoryRequirements.size, image_memory_type_index, ptr->image, 0);\n            if (!ptr->memory)\n            {\n                vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n                delete ptr;\n                return 0;\n            }\n            ptr->bind_offset = 0;\n            ptr->bind_capacity = memoryRequirements2.memoryRequirements.size;\n\n            // ignore memoryRequirements2.memoryRequirements.alignment as we always bind at zero offset\n            vkBindImageMemory(vkdev->vkdevice(), ptr->image, ptr->memory, ptr->bind_offset);\n\n            // do not allow host access to optimal tiling image\n            ptr->mapped_ptr = 0;\n            ptr->memory_type_index = image_memory_type_index;\n\n            ptr->imageview = create_imageview(ptr->image, format);\n\n            ptr->access_flags = 0;\n            ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n            ptr->command_refcount = 0;\n\n            d->dedicated_image_memory_blocks.push_back(ptr->memory);\n\n            return ptr;\n        }\n    }\n\n    VkMemoryRequirements memoryRequirements;\n    vkGetImageMemoryRequirements(vkdev->vkdevice(), ptr->image, &memoryRequirements);\n\n    const size_t size = memoryRequirements.size;\n    const size_t alignment = std::max((size_t)memoryRequirements.alignment, d->bind_memory_offset_alignment);\n\n    size_t aligned_size = alignSize(size, alignment);\n\n    const int image_memory_block_count = d->image_memory_blocks.size();\n\n    // find first spare space in buffer_blocks\n    for (int i = 0; i < image_memory_block_count; i++)\n    {\n        // we cannot use image_memory_block_free_spaces[i] directly for base offset alignment\n        size_t bind_base_offset = d->block_size - d->image_memory_block_free_spaces[i];\n        size_t bind_offset = alignSize(bind_base_offset, alignment);\n        if (d->image_memory_block_free_spaces[i] >= aligned_size + (bind_offset - bind_base_offset))\n        {\n            // bind at memory offset\n            ptr->memory = d->image_memory_blocks[i];\n            ptr->bind_offset = bind_offset;\n            ptr->bind_capacity = aligned_size;\n\n            vkBindImageMemory(vkdev->vkdevice(), ptr->image, ptr->memory, ptr->bind_offset);\n\n            // do not allow host access to optimal tiling image\n            ptr->mapped_ptr = 0;\n            ptr->memory_type_index = image_memory_type_index;\n\n            ptr->imageview = create_imageview(ptr->image, format);\n\n            ptr->access_flags = 0;\n            ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n            ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n            ptr->command_refcount = 0;\n\n            if (bind_base_offset != bind_offset)\n            {\n                // NOTE there is small offset inside bind_base_offset and bind_offset\n                // adjust ptr->bind_offset and ptr->bind_capacity after vkBindImageMemory\n                // so that memory management could be easier\n                aligned_size += (bind_offset - bind_base_offset);\n\n                ptr->bind_offset = bind_base_offset;\n                ptr->bind_capacity = aligned_size;\n            }\n\n            d->image_memory_block_free_spaces[i] -= aligned_size;\n\n            return ptr;\n        }\n    }\n\n    // create new block\n    size_t new_block_size = std::max(d->block_size, aligned_size);\n\n    if (d->prefer_host_memory)\n    {\n#if !defined(_WIN32)\n        if (vkdev->info.support_VK_EXT_external_memory_host())\n        {\n            void* host_ptr = fastMalloc_with_alignment(new_block_size, d->buffer_offset_alignment);\n\n            if (host_ptr)\n            {\n                VkMemoryHostPointerPropertiesEXT pointerProperties;\n                pointerProperties.sType = VK_STRUCTURE_TYPE_MEMORY_HOST_POINTER_PROPERTIES_EXT;\n                pointerProperties.pNext = 0;\n                VkResult ret = vkdev->vkGetMemoryHostPointerPropertiesEXT(vkdev->vkdevice(), VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT, host_ptr, &pointerProperties);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkGetMemoryHostPointerPropertiesEXT failed %d\", ret);\n                    ncnn::fastFree(host_ptr);\n                    vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n                    delete ptr;\n                    return 0;\n                }\n\n                // setup memory type and alignment\n                if (image_memory_type_index == (uint32_t)-1)\n                {\n                    image_memory_type_index = vkdev->find_memory_index(pointerProperties.memoryTypeBits, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n\n                    mappable = vkdev->is_mappable(image_memory_type_index);\n                    coherent = vkdev->is_coherent(image_memory_type_index);\n                }\n\n                ptr->memory = allocate_import_host_memory(new_block_size, image_memory_type_index, host_ptr);\n                if (!ptr->memory)\n                {\n                    // oom\n                    ncnn::fastFree(host_ptr);\n                    d->prefer_host_memory = false;\n                }\n                else\n                {\n                    d->host_ptrs.push_back(host_ptr);\n                }\n            }\n            else\n            {\n                // oom\n                d->prefer_host_memory = false;\n            }\n        }\n        else\n#endif // !defined(_WIN32)\n        {\n            // setup memory type and alignment\n            if (image_memory_type_index == (uint32_t)-1)\n            {\n                image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n\n                mappable = vkdev->is_mappable(image_memory_type_index);\n                coherent = vkdev->is_coherent(image_memory_type_index);\n            }\n\n            // bind at memory offset\n            ptr->memory = allocate_memory(new_block_size, image_memory_type_index);\n            if (!ptr->memory)\n            {\n                // oom\n                d->prefer_host_memory = false;\n            }\n        }\n\n        if (!d->prefer_host_memory)\n        {\n            NCNN_LOGE(\"weight allocator fallback to device memory\");\n            buffer_memory_type_index = (uint32_t)-1;\n            image_memory_type_index = (uint32_t)-1;\n        }\n    }\n    if (!d->prefer_host_memory)\n    {\n        // setup memory type and alignment\n        if (image_memory_type_index == (uint32_t)-1)\n        {\n            if (vkdev->info.type() == 1)\n            {\n                // integrated gpu, prefer unified memory\n                image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);\n\n                // on amd integrated gpu, there is a faster and larger device-only heap\n                uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physicalDeviceMemoryProperties();\n                uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;\n                uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;\n                if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)\n                {\n                    image_memory_type_index = device_local_memory_type_index;\n                }\n            }\n            else\n            {\n                // discrete gpu, device local\n                if (vkdev->info.resizable_bar_enabled())\n                {\n                    image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT);\n                }\n                else\n                {\n                    image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n                }\n            }\n\n            mappable = vkdev->is_mappable(image_memory_type_index);\n            coherent = vkdev->is_coherent(image_memory_type_index);\n        }\n\n        // bind at memory offset\n        ptr->memory = allocate_memory(new_block_size, image_memory_type_index);\n    }\n    if (!ptr->memory)\n    {\n        vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n        delete ptr;\n        return 0;\n    }\n    ptr->bind_offset = 0;\n    ptr->bind_capacity = aligned_size;\n\n    // ignore memoryRequirements2.memoryRequirements.alignment as we always bind at zero offset\n    vkBindImageMemory(vkdev->vkdevice(), ptr->image, ptr->memory, ptr->bind_offset);\n\n    // do not allow host access to optimal tiling image\n    ptr->mapped_ptr = 0;\n    ptr->memory_type_index = image_memory_type_index;\n\n    ptr->imageview = create_imageview(ptr->image, format);\n\n    ptr->access_flags = 0;\n    ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n    ptr->command_refcount = 0;\n\n    d->image_memory_blocks.push_back(ptr->memory);\n    d->image_memory_block_free_spaces.push_back(new_block_size - aligned_size);\n\n    return ptr;\n}\n\nvoid VkWeightAllocator::fastFree(VkImageMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkWeightAllocator F %p\", ptr->memory);\n\n    if (!ptr->command_refcount)\n    {\n        vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n        vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n\n        delete ptr;\n    }\n}\n\nclass VkStagingAllocatorPrivate\n{\npublic:\n    unsigned int size_compare_ratio; // 0~256\n    std::list<VkBufferMemory*> buffer_budgets;\n};\n\nVkStagingAllocator::VkStagingAllocator(const VulkanDevice* _vkdev)\n    : VkAllocator(_vkdev), d(new VkStagingAllocatorPrivate)\n{\n    mappable = true;\n    coherent = true;\n\n    d->size_compare_ratio = 192; // 0.75f * 256\n}\n\nVkStagingAllocator::~VkStagingAllocator()\n{\n    clear();\n\n    delete d;\n}\n\nVkStagingAllocator::VkStagingAllocator(const VkStagingAllocator&)\n    : VkAllocator(0), d(0)\n{\n}\n\nVkStagingAllocator& VkStagingAllocator::operator=(const VkStagingAllocator&)\n{\n    return *this;\n}\n\nvoid VkStagingAllocator::set_size_compare_ratio(float scr)\n{\n    if (scr < 0.f || scr > 1.f)\n    {\n        NCNN_LOGE(\"invalid size compare ratio %f\", scr);\n        return;\n    }\n\n    d->size_compare_ratio = (unsigned int)(scr * 256);\n}\n\nvoid VkStagingAllocator::clear()\n{\n    //     NCNN_LOGE(\"VkStagingAllocator %lu\", buffer_budgets.size());\n\n    for (std::list<VkBufferMemory*>::iterator it = d->buffer_budgets.begin(); it != d->buffer_budgets.end(); it++)\n    {\n        VkBufferMemory* ptr = *it;\n\n        //         NCNN_LOGE(\"VkStagingAllocator F %p\", ptr->buffer);\n\n        vkUnmapMemory(vkdev->vkdevice(), ptr->memory);\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n        delete ptr;\n    }\n    d->buffer_budgets.clear();\n}\n\nVkBufferMemory* VkStagingAllocator::fastMalloc(size_t size)\n{\n    // find free budget\n    std::list<VkBufferMemory*>::iterator it = d->buffer_budgets.begin();\n    for (; it != d->buffer_budgets.end(); it++)\n    {\n        VkBufferMemory* ptr = *it;\n\n        size_t capacity = ptr->capacity;\n\n        // size_compare_ratio ~ 100%\n        if (capacity >= size && ((capacity * d->size_compare_ratio) >> 8) <= size)\n        {\n            d->buffer_budgets.erase(it);\n\n            //             NCNN_LOGE(\"VkStagingAllocator M %p %lu reused %lu\", ptr->buffer, size, capacity);\n\n            return ptr;\n        }\n    }\n\n    VkBufferMemory* ptr = new VkBufferMemory;\n\n    ptr->buffer = create_buffer(size, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT);\n    ptr->offset = 0;\n\n    VkMemoryRequirements memoryRequirements;\n    vkGetBufferMemoryRequirements(vkdev->vkdevice(), ptr->buffer, &memoryRequirements);\n\n    // setup memory type\n    if (buffer_memory_type_index == (uint32_t)-1)\n    {\n        buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n    }\n\n    ptr->memory = allocate_memory(memoryRequirements.size, buffer_memory_type_index);\n    if (!ptr->memory)\n    {\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        delete ptr;\n        return 0;\n    }\n\n    // ignore memoryRequirements.alignment as we always bind at zero offset\n    vkBindBufferMemory(vkdev->vkdevice(), ptr->buffer, ptr->memory, 0);\n\n    ptr->capacity = size;\n\n    vkMapMemory(vkdev->vkdevice(), ptr->memory, 0, size, 0, &ptr->mapped_ptr);\n\n    ptr->memory_type_index = buffer_memory_type_index;\n\n    ptr->access_flags = 0;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n    //     NCNN_LOGE(\"VkStagingAllocator M %p %lu\", ptr->buffer, size);\n\n    return ptr;\n}\n\nvoid VkStagingAllocator::fastFree(VkBufferMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkStagingAllocator F %p\", ptr->buffer);\n\n    // return to buffer_budgets\n    d->buffer_budgets.push_back(ptr);\n}\n\nVkImageMemory* VkStagingAllocator::fastMalloc(int w, int h, int c, size_t elemsize, int /* elempack */)\n{\n    // staging image is mainly used for storing small piece of dynamic parameters\n    // we allocate host memory as a fake image, it's simple and good\n\n    const size_t size = w * h * c * elemsize;\n\n    VkImageMemory* ptr = new VkImageMemory;\n\n    ptr->image = 0;\n    ptr->width = w;\n    ptr->height = h;\n    ptr->depth = c;\n    ptr->format = VK_FORMAT_UNDEFINED;\n    ptr->memory = 0;\n    ptr->bind_offset = 0;\n    ptr->bind_capacity = size;\n\n    ptr->mapped_ptr = malloc(size);\n    ptr->memory_type_index = (uint32_t)-1;\n\n    ptr->imageview = 0;\n\n    ptr->access_flags = 0;\n    ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n    ptr->stage_flags = VK_PIPELINE_STAGE_HOST_BIT;\n    ptr->command_refcount = 0;\n\n    //     NCNN_LOGE(\"VkStagingAllocator M %p %d %d %d %d %d\", ptr->image, dims, width, height, depth, format);\n\n    return ptr;\n}\n\nvoid VkStagingAllocator::fastFree(VkImageMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkStagingAllocator F %p\", ptr->image);\n\n    free(ptr->mapped_ptr);\n\n    delete ptr;\n}\n\nclass VkWeightStagingAllocatorPrivate\n{\npublic:\n};\n\nVkWeightStagingAllocator::VkWeightStagingAllocator(const VulkanDevice* _vkdev)\n    : VkAllocator(_vkdev), d(new VkWeightStagingAllocatorPrivate)\n{\n    mappable = true;\n    coherent = true;\n}\n\nVkWeightStagingAllocator::~VkWeightStagingAllocator()\n{\n    delete d;\n}\n\nVkWeightStagingAllocator::VkWeightStagingAllocator(const VkWeightStagingAllocator&)\n    : VkAllocator(0), d(0)\n{\n}\n\nVkWeightStagingAllocator& VkWeightStagingAllocator::operator=(const VkWeightStagingAllocator&)\n{\n    return *this;\n}\n\nVkBufferMemory* VkWeightStagingAllocator::fastMalloc(size_t size)\n{\n    VkBufferMemory* ptr = new VkBufferMemory;\n\n    ptr->buffer = create_buffer(size, VK_BUFFER_USAGE_TRANSFER_SRC_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT);\n    ptr->offset = 0;\n\n    VkMemoryRequirements memoryRequirements;\n    vkGetBufferMemoryRequirements(vkdev->vkdevice(), ptr->buffer, &memoryRequirements);\n\n    // setup memory type\n    if (buffer_memory_type_index == (uint32_t)-1)\n    {\n        buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, VK_MEMORY_PROPERTY_HOST_CACHED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);\n    }\n\n    ptr->memory = allocate_memory(memoryRequirements.size, buffer_memory_type_index);\n    if (!ptr->memory)\n    {\n        vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n        delete ptr;\n        return 0;\n    }\n\n    // ignore memoryRequirements.alignment as we always bind at zero offset\n    vkBindBufferMemory(vkdev->vkdevice(), ptr->buffer, ptr->memory, 0);\n\n    ptr->capacity = size;\n\n    vkMapMemory(vkdev->vkdevice(), ptr->memory, 0, size, 0, &ptr->mapped_ptr);\n\n    ptr->memory_type_index = buffer_memory_type_index;\n\n    ptr->access_flags = 0;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n    //     NCNN_LOGE(\"VkWeightStagingAllocator M %p %lu\", ptr->buffer, size);\n\n    return ptr;\n}\n\nvoid VkWeightStagingAllocator::fastFree(VkBufferMemory* ptr)\n{\n    //     NCNN_LOGE(\"VkWeightStagingAllocator F %p\", ptr->buffer);\n\n    vkUnmapMemory(vkdev->vkdevice(), ptr->memory);\n    vkDestroyBuffer(vkdev->vkdevice(), ptr->buffer, 0);\n    vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n    delete ptr;\n}\n\nVkImageMemory* VkWeightStagingAllocator::fastMalloc(int /*w*/, int /*h*/, int /*c*/, size_t /*elemsize*/, int /*elempack*/)\n{\n    return 0;\n}\n\nvoid VkWeightStagingAllocator::fastFree(VkImageMemory* /*ptr*/)\n{\n}\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\nVkAndroidHardwareBufferImageAllocator::VkAndroidHardwareBufferImageAllocator(const VulkanDevice* _vkdev, AHardwareBuffer* _hb)\n    : VkAllocator(_vkdev), hb(_hb)\n{\n    samplerYcbcrConversion = 0;\n\n    init();\n}\n\nVkAndroidHardwareBufferImageAllocator::~VkAndroidHardwareBufferImageAllocator()\n{\n    if (samplerYcbcrConversion)\n    {\n        vkdev->vkDestroySamplerYcbcrConversionKHR(vkdev->vkdevice(), samplerYcbcrConversion, 0);\n        samplerYcbcrConversion = 0;\n    }\n}\n\nVkAndroidHardwareBufferImageAllocator::VkAndroidHardwareBufferImageAllocator(const VkAndroidHardwareBufferImageAllocator&)\n    : VkAllocator(0)\n{\n}\n\nVkAndroidHardwareBufferImageAllocator& VkAndroidHardwareBufferImageAllocator::operator=(const VkAndroidHardwareBufferImageAllocator&)\n{\n    return *this;\n}\n\nVkBufferMemory* VkAndroidHardwareBufferImageAllocator::fastMalloc(size_t /*size*/)\n{\n    return 0;\n}\n\nvoid VkAndroidHardwareBufferImageAllocator::fastFree(VkBufferMemory* /*ptr*/)\n{\n}\n\nVkImageMemory* VkAndroidHardwareBufferImageAllocator::fastMalloc(int /*w*/, int /*h*/, int /*c*/, size_t /*elemsize*/, int /*elempack*/)\n{\n    VkResult ret;\n\n    VkExternalFormatANDROID externalFormat;\n    externalFormat.sType = VK_STRUCTURE_TYPE_EXTERNAL_FORMAT_ANDROID;\n    externalFormat.pNext = 0;\n    externalFormat.externalFormat = bufferFormatProperties.externalFormat;\n\n    VkExternalMemoryImageCreateInfo externalMemoryImageCreateInfo;\n    externalMemoryImageCreateInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO,\n    externalMemoryImageCreateInfo.pNext = &externalFormat,\n    externalMemoryImageCreateInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_ANDROID_HARDWARE_BUFFER_BIT_ANDROID;\n\n    VkImageCreateInfo imageCreateInfo;\n    imageCreateInfo.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO,\n    imageCreateInfo.pNext = &externalMemoryImageCreateInfo;\n    imageCreateInfo.flags = 0;\n    imageCreateInfo.imageType = VK_IMAGE_TYPE_2D;\n    imageCreateInfo.format = VK_FORMAT_UNDEFINED;\n    imageCreateInfo.extent.width = bufferDesc.width;\n    imageCreateInfo.extent.height = bufferDesc.height;\n    imageCreateInfo.extent.depth = 1;\n    imageCreateInfo.mipLevels = 1;\n    imageCreateInfo.arrayLayers = 1;\n    imageCreateInfo.samples = VK_SAMPLE_COUNT_1_BIT;\n    imageCreateInfo.tiling = VK_IMAGE_TILING_OPTIMAL;\n    imageCreateInfo.usage = VK_IMAGE_USAGE_SAMPLED_BIT;\n    imageCreateInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;\n    imageCreateInfo.queueFamilyIndexCount = 0;\n    imageCreateInfo.pQueueFamilyIndices = 0;\n    imageCreateInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n\n    VkImage image = 0;\n    ret = vkCreateImage(vkdev->vkdevice(), &imageCreateInfo, 0, &image);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateImage failed %d\", ret);\n        return 0;\n    }\n\n    // setup memory type\n    if (image_memory_type_index == (uint32_t)-1)\n    {\n        image_memory_type_index = vkdev->find_memory_index(bufferProperties.memoryTypeBits, 0, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);\n    }\n\n    VkImportAndroidHardwareBufferInfoANDROID importAndroidHardwareBufferInfo;\n    importAndroidHardwareBufferInfo.sType = VK_STRUCTURE_TYPE_IMPORT_ANDROID_HARDWARE_BUFFER_INFO_ANDROID;\n    importAndroidHardwareBufferInfo.pNext = 0;\n    importAndroidHardwareBufferInfo.buffer = hb;\n\n    VkMemoryDedicatedAllocateInfo memoryDedicatedAllocateInfo;\n    memoryDedicatedAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_DEDICATED_ALLOCATE_INFO;\n    memoryDedicatedAllocateInfo.pNext = &importAndroidHardwareBufferInfo;\n    memoryDedicatedAllocateInfo.image = image;\n    memoryDedicatedAllocateInfo.buffer = VK_NULL_HANDLE;\n\n    VkMemoryAllocateInfo memoryAllocateInfo;\n    memoryAllocateInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;\n    memoryAllocateInfo.pNext = &memoryDedicatedAllocateInfo;\n    memoryAllocateInfo.allocationSize = bufferProperties.allocationSize;\n    memoryAllocateInfo.memoryTypeIndex = image_memory_type_index;\n\n    VkDeviceMemory memory = 0;\n    ret = vkAllocateMemory(vkdev->vkdevice(), &memoryAllocateInfo, 0, &memory);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkAllocateMemory failed %d\", ret);\n        return 0;\n    }\n\n    VkBindImageMemoryInfo bindImageMemoryInfo;\n    bindImageMemoryInfo.sType = VK_STRUCTURE_TYPE_BIND_IMAGE_MEMORY_INFO;\n    bindImageMemoryInfo.pNext = 0;\n    bindImageMemoryInfo.image = image;\n    bindImageMemoryInfo.memory = memory;\n    bindImageMemoryInfo.memoryOffset = 0;\n    ret = vkdev->vkBindImageMemory2KHR(vkdev->vkdevice(), 1, &bindImageMemoryInfo);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkBindImageMemory2KHR failed %d\", ret);\n        vkDestroyImage(vkdev->vkdevice(), image, 0);\n        return 0;\n    }\n\n    VkSamplerYcbcrConversionInfoKHR samplerYcbcrConversionInfo;\n    samplerYcbcrConversionInfo.sType = VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_INFO_KHR;\n    samplerYcbcrConversionInfo.pNext = &externalFormat;\n    samplerYcbcrConversionInfo.conversion = samplerYcbcrConversion;\n\n    VkImageViewCreateInfo imageViewCreateInfo;\n    imageViewCreateInfo.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;\n    imageViewCreateInfo.pNext = &samplerYcbcrConversionInfo;\n    imageViewCreateInfo.flags = 0;\n    imageViewCreateInfo.image = image;\n    imageViewCreateInfo.viewType = VK_IMAGE_VIEW_TYPE_2D;\n    imageViewCreateInfo.format = VK_FORMAT_UNDEFINED;\n    imageViewCreateInfo.components.r = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.g = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.b = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.components.a = VK_COMPONENT_SWIZZLE_IDENTITY;\n    imageViewCreateInfo.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n    imageViewCreateInfo.subresourceRange.baseMipLevel = 0;\n    imageViewCreateInfo.subresourceRange.levelCount = 1;\n    imageViewCreateInfo.subresourceRange.baseArrayLayer = 0;\n    imageViewCreateInfo.subresourceRange.layerCount = 1;\n\n    VkImageView imageview = 0;\n    ret = vkCreateImageView(vkdev->vkdevice(), &imageViewCreateInfo, 0, &imageview);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateImageView failed %d\", ret);\n        vkDestroyImage(vkdev->vkdevice(), image, 0);\n        vkFreeMemory(vkdev->vkdevice(), memory, 0);\n        return 0;\n    }\n\n    VkImageMemory* ptr = new VkImageMemory;\n    ptr->image = image;\n    ptr->memory = memory;\n    ptr->imageview = imageview;\n    ptr->mapped_ptr = 0;\n    ptr->memory_type_index = (uint32_t)-1;\n    ptr->access_flags = 0;\n    ptr->image_layout = VK_IMAGE_LAYOUT_UNDEFINED;\n    ptr->stage_flags = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n\n    return ptr;\n}\n\nvoid VkAndroidHardwareBufferImageAllocator::fastFree(VkImageMemory* ptr)\n{\n    vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n    vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n    vkFreeMemory(vkdev->vkdevice(), ptr->memory, 0);\n\n    delete ptr;\n}\n\nint VkAndroidHardwareBufferImageAllocator::init()\n{\n    AHardwareBuffer_describe(hb, &bufferDesc);\n\n    VkResult ret;\n\n    // resolve externalFormat\n    bufferFormatProperties.sType = VK_STRUCTURE_TYPE_ANDROID_HARDWARE_BUFFER_FORMAT_PROPERTIES_ANDROID;\n    bufferFormatProperties.pNext = 0;\n\n    bufferProperties.sType = VK_STRUCTURE_TYPE_ANDROID_HARDWARE_BUFFER_PROPERTIES_ANDROID;\n    bufferProperties.pNext = &bufferFormatProperties;\n\n    ret = vkdev->vkGetAndroidHardwareBufferPropertiesANDROID(vkdev->vkdevice(), hb, &bufferProperties);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkGetAndroidHardwareBufferPropertiesANDROID failed %d\", ret);\n        return -1;\n    }\n\n    // setup samplerYcbcrConversion\n    VkExternalFormatANDROID externalFormat;\n    externalFormat.sType = VK_STRUCTURE_TYPE_EXTERNAL_FORMAT_ANDROID;\n    externalFormat.pNext = 0;\n    externalFormat.externalFormat = bufferFormatProperties.externalFormat;\n\n    VkSamplerYcbcrConversionCreateInfoKHR samplerYcbcrConversionCreateInfo;\n    samplerYcbcrConversionCreateInfo.sType = VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_CREATE_INFO_KHR;\n    samplerYcbcrConversionCreateInfo.pNext = &externalFormat;\n    samplerYcbcrConversionCreateInfo.format = VK_FORMAT_UNDEFINED;\n    samplerYcbcrConversionCreateInfo.ycbcrModel = bufferFormatProperties.suggestedYcbcrModel;\n    samplerYcbcrConversionCreateInfo.ycbcrRange = bufferFormatProperties.suggestedYcbcrRange;\n    samplerYcbcrConversionCreateInfo.components = bufferFormatProperties.samplerYcbcrConversionComponents;\n    samplerYcbcrConversionCreateInfo.xChromaOffset = bufferFormatProperties.suggestedXChromaOffset;\n    samplerYcbcrConversionCreateInfo.yChromaOffset = bufferFormatProperties.suggestedYChromaOffset;\n    samplerYcbcrConversionCreateInfo.chromaFilter = VK_FILTER_NEAREST;\n    samplerYcbcrConversionCreateInfo.forceExplicitReconstruction = VK_FALSE;\n\n    ret = vkdev->vkCreateSamplerYcbcrConversionKHR(vkdev->vkdevice(), &samplerYcbcrConversionCreateInfo, 0, &samplerYcbcrConversion);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateSamplerYcbcrConversionKHR failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VkAndroidHardwareBufferImageAllocator::width() const\n{\n    return bufferDesc.width;\n}\n\nint VkAndroidHardwareBufferImageAllocator::height() const\n{\n    return bufferDesc.height;\n}\n\nuint64_t VkAndroidHardwareBufferImageAllocator::external_format() const\n{\n    return bufferFormatProperties.externalFormat;\n}\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\n\n#endif // NCNN_VULKAN\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/allocator.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_ALLOCATOR_H\n#define NCNN_ALLOCATOR_H\n\n#ifdef _WIN32\n#define WIN32_LEAN_AND_MEAN\n#include <windows.h>\n#endif\n\n#include \"platform.h\"\n\n#include <stdlib.h>\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\n#include <android/hardware_buffer.h>\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\n\nnamespace ncnn {\n\n// the alignment of all the allocated buffers\n#if NCNN_AVX512\n#define NCNN_MALLOC_ALIGN 64\n#elif NCNN_AVX\n#define NCNN_MALLOC_ALIGN 32\n#else\n#define NCNN_MALLOC_ALIGN 16\n#endif\n\n// we have some optimized kernels that may overread buffer a bit in loop\n// it is common to interleave next-loop data load with arithmetic instructions\n// allocating more bytes keeps us safe from SEGV_ACCERR failure\n#define NCNN_MALLOC_OVERREAD 64\n\n// Aligns a pointer to the specified number of bytes\n// ptr Aligned pointer\n// n Alignment size that must be a power of two\ntemplate<typename _Tp>\nstatic NCNN_FORCEINLINE _Tp* alignPtr(_Tp* ptr, int n = (int)sizeof(_Tp))\n{\n    return (_Tp*)(((size_t)ptr + n - 1) & -n);\n}\n\n// Aligns a buffer size to the specified number of bytes\n// The function returns the minimum number that is greater or equal to sz and is divisible by n\n// sz Buffer size to align\n// n Alignment size that must be a power of two\nstatic NCNN_FORCEINLINE size_t alignSize(size_t sz, int n)\n{\n    return (sz + n - 1) & -n;\n}\n\nstatic NCNN_FORCEINLINE void* fastMalloc(size_t size)\n{\n#if _MSC_VER\n    return _aligned_malloc(size + NCNN_MALLOC_OVERREAD, NCNN_MALLOC_ALIGN);\n#elif (defined(__unix__) || defined(__APPLE__)) && _POSIX_C_SOURCE >= 200112L || (__ANDROID__ && __ANDROID_API__ >= 17)\n    void* ptr = 0;\n    if (posix_memalign(&ptr, NCNN_MALLOC_ALIGN, size + NCNN_MALLOC_OVERREAD))\n        ptr = 0;\n    return ptr;\n#elif __ANDROID__ && __ANDROID_API__ < 17\n    return memalign(NCNN_MALLOC_ALIGN, size + NCNN_MALLOC_OVERREAD);\n#else\n    unsigned char* udata = (unsigned char*)malloc(size + sizeof(void*) + NCNN_MALLOC_ALIGN + NCNN_MALLOC_OVERREAD);\n    if (!udata)\n        return 0;\n    unsigned char** adata = alignPtr((unsigned char**)udata + 1, NCNN_MALLOC_ALIGN);\n    adata[-1] = udata;\n    return adata;\n#endif\n}\n\nstatic NCNN_FORCEINLINE void fastFree(void* ptr)\n{\n    if (ptr)\n    {\n#if _MSC_VER\n        _aligned_free(ptr);\n#elif (defined(__unix__) || defined(__APPLE__)) && _POSIX_C_SOURCE >= 200112L || (__ANDROID__ && __ANDROID_API__ >= 17)\n        free(ptr);\n#elif __ANDROID__ && __ANDROID_API__ < 17\n        free(ptr);\n#else\n        unsigned char* udata = ((unsigned char**)ptr)[-1];\n        free(udata);\n#endif\n    }\n}\n\n#if NCNN_THREADS\n// exchange-add operation for atomic operations on reference counters\n#if defined __riscv && !defined __riscv_atomic\n// riscv target without A extension\nstatic NCNN_FORCEINLINE int NCNN_XADD(int* addr, int delta)\n{\n    int tmp = *addr;\n    *addr += delta;\n    return tmp;\n}\n#elif defined __INTEL_COMPILER && !(defined WIN32 || defined _WIN32)\n// atomic increment on the linux version of the Intel(tm) compiler\n#define NCNN_XADD(addr, delta) (int)_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(addr)), delta)\n#elif defined __GNUC__\n#if defined __clang__ && __clang_major__ >= 3 && !defined __ANDROID__ && !defined __EMSCRIPTEN__ && !defined(__CUDACC__)\n#ifdef __ATOMIC_ACQ_REL\n#define NCNN_XADD(addr, delta) __c11_atomic_fetch_add((_Atomic(int)*)(addr), delta, __ATOMIC_ACQ_REL)\n#else\n#define NCNN_XADD(addr, delta) __atomic_fetch_add((_Atomic(int)*)(addr), delta, 4)\n#endif\n#else\n#if defined __ATOMIC_ACQ_REL && !defined __clang__\n// version for gcc >= 4.7\n#define NCNN_XADD(addr, delta) (int)__atomic_fetch_add((unsigned*)(addr), (unsigned)(delta), __ATOMIC_ACQ_REL)\n#else\n#define NCNN_XADD(addr, delta) (int)__sync_fetch_and_add((unsigned*)(addr), (unsigned)(delta))\n#endif\n#endif\n#elif defined _MSC_VER && !defined RC_INVOKED\n#define NCNN_XADD(addr, delta) (int)_InterlockedExchangeAdd((long volatile*)addr, delta)\n#else\n// thread-unsafe branch\nstatic NCNN_FORCEINLINE int NCNN_XADD(int* addr, int delta)\n{\n    int tmp = *addr;\n    *addr += delta;\n    return tmp;\n}\n#endif\n#else  // NCNN_THREADS\nstatic NCNN_FORCEINLINE int NCNN_XADD(int* addr, int delta)\n{\n    int tmp = *addr;\n    *addr += delta;\n    return tmp;\n}\n#endif // NCNN_THREADS\n\nclass NCNN_EXPORT Allocator\n{\npublic:\n    virtual ~Allocator();\n    virtual void* fastMalloc(size_t size) = 0;\n    virtual void fastFree(void* ptr) = 0;\n};\n\nclass PoolAllocatorPrivate;\nclass NCNN_EXPORT PoolAllocator : public Allocator\n{\npublic:\n    PoolAllocator();\n    ~PoolAllocator();\n\n    // ratio range 0 ~ 1\n    // default cr = 0\n    void set_size_compare_ratio(float scr);\n\n    // budget drop threshold\n    // default threshold = 10\n    void set_size_drop_threshold(size_t);\n\n    // release all budgets immediately\n    void clear();\n\n    virtual void* fastMalloc(size_t size);\n    virtual void fastFree(void* ptr);\n\nprivate:\n    PoolAllocator(const PoolAllocator&);\n    PoolAllocator& operator=(const PoolAllocator&);\n\nprivate:\n    PoolAllocatorPrivate* const d;\n};\n\nclass UnlockedPoolAllocatorPrivate;\nclass NCNN_EXPORT UnlockedPoolAllocator : public Allocator\n{\npublic:\n    UnlockedPoolAllocator();\n    ~UnlockedPoolAllocator();\n\n    // ratio range 0 ~ 1\n    // default cr = 0\n    void set_size_compare_ratio(float scr);\n\n    // budget drop threshold\n    // default threshold = 10\n    void set_size_drop_threshold(size_t);\n\n    // release all budgets immediately\n    void clear();\n\n    virtual void* fastMalloc(size_t size);\n    virtual void fastFree(void* ptr);\n\nprivate:\n    UnlockedPoolAllocator(const UnlockedPoolAllocator&);\n    UnlockedPoolAllocator& operator=(const UnlockedPoolAllocator&);\n\nprivate:\n    UnlockedPoolAllocatorPrivate* const d;\n};\n\n#if NCNN_VULKAN\n\nclass VulkanDevice;\n\nclass NCNN_EXPORT VkBufferMemory\n{\npublic:\n    VkBuffer buffer;\n\n    // the base offset assigned by allocator\n    size_t offset;\n    size_t capacity;\n\n    VkDeviceMemory memory;\n    void* mapped_ptr;\n\n    uint32_t memory_type_index;\n\n    // buffer state, modified by command functions internally\n    mutable VkAccessFlags access_flags;\n    mutable VkPipelineStageFlags stage_flags;\n\n    // initialize and modified by mat\n    int refcount;\n};\n\nclass NCNN_EXPORT VkImageMemory\n{\npublic:\n    VkImage image;\n    VkImageView imageview;\n\n    // underlying info assigned by allocator\n    int width;\n    int height;\n    int depth;\n    VkFormat format;\n\n    VkDeviceMemory memory;\n    void* mapped_ptr;\n\n    uint32_t memory_type_index;\n\n    // the base offset assigned by allocator\n    size_t bind_offset;\n    size_t bind_capacity;\n\n    // image state, modified by command functions internally\n    mutable VkAccessFlags access_flags;\n    mutable VkImageLayout image_layout;\n    mutable VkPipelineStageFlags stage_flags;\n\n    // in-execution state, modified by command functions internally\n    mutable int command_refcount;\n\n    // initialize and modified by mat\n    int refcount;\n};\n\nclass NCNN_EXPORT VkAllocator\n{\npublic:\n    explicit VkAllocator(const VulkanDevice* _vkdev);\n    virtual ~VkAllocator();\n\n    virtual void clear();\n\n    virtual VkBufferMemory* fastMalloc(size_t size) = 0;\n    virtual void fastFree(VkBufferMemory* ptr) = 0;\n    virtual int flush(VkBufferMemory* ptr);\n    virtual int invalidate(VkBufferMemory* ptr);\n\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack) = 0;\n    virtual void fastFree(VkImageMemory* ptr) = 0;\n\npublic:\n    const VulkanDevice* vkdev;\n    uint32_t buffer_memory_type_index;\n    uint32_t image_memory_type_index;\n    uint32_t reserved_type_index;\n    bool mappable;\n    bool coherent;\n\nprotected:\n    VkBuffer create_buffer(size_t size, VkBufferUsageFlags usage);\n    VkDeviceMemory allocate_memory(size_t size, uint32_t memory_type_index);\n    VkDeviceMemory allocate_dedicated_memory(size_t size, uint32_t memory_type_index, VkImage image, VkBuffer buffer);\n    VkDeviceMemory allocate_import_host_memory(size_t size, uint32_t memory_type_index, void* host_ptr);\n\n    VkImage create_image(int width, int height, int depth, VkFormat format, VkImageTiling tiling, VkImageUsageFlags usage);\n    VkImageView create_imageview(VkImage image, VkFormat format);\n};\n\nclass VkBlobAllocatorPrivate;\nclass NCNN_EXPORT VkBlobAllocator : public VkAllocator\n{\npublic:\n    explicit VkBlobAllocator(const VulkanDevice* vkdev, size_t preferred_block_size = 16 * 1024 * 1024); // 16M\n    virtual ~VkBlobAllocator();\n\npublic:\n    // release all budgets immediately\n    virtual void clear();\n\n    virtual VkBufferMemory* fastMalloc(size_t size);\n    virtual void fastFree(VkBufferMemory* ptr);\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack);\n    virtual void fastFree(VkImageMemory* ptr);\n\nprivate:\n    VkBlobAllocator(const VkBlobAllocator&);\n    VkBlobAllocator& operator=(const VkBlobAllocator&);\n\nprivate:\n    VkBlobAllocatorPrivate* const d;\n};\n\nclass VkWeightAllocatorPrivate;\nclass NCNN_EXPORT VkWeightAllocator : public VkAllocator\n{\npublic:\n    explicit VkWeightAllocator(const VulkanDevice* vkdev, bool prefer_host_memory = false, size_t preferred_block_size = 8 * 1024 * 1024); // 8M\n    explicit VkWeightAllocator(const VulkanDevice* vkdev, size_t preferred_block_size)\n        : VkWeightAllocator(vkdev, false, preferred_block_size)\n    {\n    }\n    virtual ~VkWeightAllocator();\n\npublic:\n    // release all blocks immediately\n    virtual void clear();\n\npublic:\n    virtual VkBufferMemory* fastMalloc(size_t size);\n    virtual void fastFree(VkBufferMemory* ptr);\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack);\n    virtual void fastFree(VkImageMemory* ptr);\n\nprivate:\n    VkWeightAllocator(const VkWeightAllocator&);\n    VkWeightAllocator& operator=(const VkWeightAllocator&);\n\nprivate:\n    VkWeightAllocatorPrivate* const d;\n};\n\nclass VkStagingAllocatorPrivate;\nclass NCNN_EXPORT VkStagingAllocator : public VkAllocator\n{\npublic:\n    explicit VkStagingAllocator(const VulkanDevice* vkdev);\n    virtual ~VkStagingAllocator();\n\npublic:\n    // ratio range 0 ~ 1\n    // default cr = 0.75\n    void set_size_compare_ratio(float scr);\n\n    // release all budgets immediately\n    virtual void clear();\n\n    virtual VkBufferMemory* fastMalloc(size_t size);\n    virtual void fastFree(VkBufferMemory* ptr);\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack);\n    virtual void fastFree(VkImageMemory* ptr);\n\nprivate:\n    VkStagingAllocator(const VkStagingAllocator&);\n    VkStagingAllocator& operator=(const VkStagingAllocator&);\n\nprivate:\n    VkStagingAllocatorPrivate* const d;\n};\n\nclass VkWeightStagingAllocatorPrivate;\nclass NCNN_EXPORT VkWeightStagingAllocator : public VkAllocator\n{\npublic:\n    explicit VkWeightStagingAllocator(const VulkanDevice* vkdev);\n    virtual ~VkWeightStagingAllocator();\n\npublic:\n    virtual VkBufferMemory* fastMalloc(size_t size);\n    virtual void fastFree(VkBufferMemory* ptr);\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack);\n    virtual void fastFree(VkImageMemory* ptr);\n\nprivate:\n    VkWeightStagingAllocator(const VkWeightStagingAllocator&);\n    VkWeightStagingAllocator& operator=(const VkWeightStagingAllocator&);\n\nprivate:\n    VkWeightStagingAllocatorPrivate* const d;\n};\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\nclass NCNN_EXPORT VkAndroidHardwareBufferImageAllocator : public VkAllocator\n{\npublic:\n    VkAndroidHardwareBufferImageAllocator(const VulkanDevice* _vkdev, AHardwareBuffer* _hb);\n    virtual ~VkAndroidHardwareBufferImageAllocator();\n\npublic:\n    virtual VkBufferMemory* fastMalloc(size_t size);\n    virtual void fastFree(VkBufferMemory* ptr);\n    virtual VkImageMemory* fastMalloc(int w, int h, int c, size_t elemsize, int elempack);\n    virtual void fastFree(VkImageMemory* ptr);\n\nprivate:\n    VkAndroidHardwareBufferImageAllocator(const VkAndroidHardwareBufferImageAllocator&);\n    VkAndroidHardwareBufferImageAllocator& operator=(const VkAndroidHardwareBufferImageAllocator&);\n\npublic:\n    int init();\n\n    int width() const;\n    int height() const;\n    uint64_t external_format() const;\n\npublic:\n    AHardwareBuffer* hb;\n    AHardwareBuffer_Desc bufferDesc;\n    VkAndroidHardwareBufferFormatPropertiesANDROID bufferFormatProperties;\n    VkAndroidHardwareBufferPropertiesANDROID bufferProperties;\n    VkSamplerYcbcrConversionKHR samplerYcbcrConversion;\n};\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\n\n#endif // NCNN_VULKAN\n\n} // namespace ncnn\n\n#endif // NCNN_ALLOCATOR_H\n"
  },
  {
    "path": "src/benchmark.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"benchmark.h\"\n\n#if (__cplusplus >= 201103L || (defined(_MSVC_LANG) && _MSVC_LANG >= 201103L)) && !defined(__riscv) && !NCNN_SIMPLESTL\n#define USE_CXX11_CLOCK 1\n#else\n#define USE_CXX11_CLOCK 0\n#endif\n\n#if USE_CXX11_CLOCK\n#include <chrono>\n#if NCNN_THREADS\n#include <thread>\n#endif\n#include <numeric>\n#include <algorithm>\n#endif // USE_CXX11_CLOCK\n\n#ifdef _WIN32\n#define WIN32_LEAN_AND_MEAN\n#include <windows.h>\n#else                 // _WIN32\n#include <sys/time.h> //gettimeofday()\n#include <unistd.h>   // sleep()\n#endif                // _WIN32\n\n#if NCNN_BENCHMARK\n#include \"layer/convolution.h\"\n#include \"layer/convolutiondepthwise.h\"\n#include \"layer/deconvolution.h\"\n#include \"layer/deconvolutiondepthwise.h\"\n#include \"layer/convolution3d.h\"\n#include \"layer/convolutiondepthwise3d.h\"\n#include \"layer/deconvolution3d.h\"\n#include \"layer/deconvolutiondepthwise3d.h\"\n\n#include <stdio.h>\n#endif // NCNN_BENCHMARK\n\nnamespace ncnn {\n\ndouble get_current_time()\n{\n#if USE_CXX11_CLOCK\n    auto now = std::chrono::high_resolution_clock::now();\n    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch());\n    return usec.count() / 1000.0;\n#else\n#ifdef _WIN32\n    LARGE_INTEGER freq;\n    LARGE_INTEGER pc;\n    QueryPerformanceFrequency(&freq);\n    QueryPerformanceCounter(&pc);\n\n    return pc.QuadPart * 1000.0 / freq.QuadPart;\n#else  // _WIN32\n    struct timeval tv;\n    gettimeofday(&tv, NULL);\n\n    return tv.tv_sec * 1000.0 + tv.tv_usec / 1000.0;\n#endif // _WIN32\n#endif\n}\n\nvoid sleep(unsigned long long int milliseconds)\n{\n#if USE_CXX11_CLOCK && NCNN_THREADS\n    std::this_thread::sleep_for(std::chrono::milliseconds(milliseconds));\n#else\n#ifdef _WIN32\n    Sleep(milliseconds);\n#elif defined(__unix__) || defined(__APPLE__)\n    usleep(milliseconds * 1000);\n#elif _POSIX_TIMERS\n    struct timespec ts;\n    ts.tv_sec = milliseconds * 0.001;\n    ts.tv_nsec = 0;\n    nanosleep(&ts, &ts);\n#else\n    // TODO How to handle it ?\n#endif\n#endif\n}\n\n#if NCNN_BENCHMARK\n\nvoid benchmark(const Layer* layer, double start, double end)\n{\n    fprintf(stderr, \"%-24s %-30s %8.2lfms\", layer->type.c_str(), layer->name.c_str(), end - start);\n    fprintf(stderr, \"    |\");\n    fprintf(stderr, \"\\n\");\n}\n\nvoid benchmark(const Layer* layer, const Mat& bottom_blob, Mat& top_blob, double start, double end)\n{\n    fprintf(stderr, \"%-24s %-30s %8.2lfms\", layer->type.c_str(), layer->name.c_str(), end - start);\n\n    char in_shape_str[64] = {'\\0'};\n    char out_shape_str[64] = {'\\0'};\n\n    if (bottom_blob.dims == 1)\n    {\n        sprintf(in_shape_str, \"[%3d *%d]\", bottom_blob.w, bottom_blob.elempack);\n    }\n    if (bottom_blob.dims == 2)\n    {\n        sprintf(in_shape_str, \"[%3d, %3d *%d]\", bottom_blob.w, bottom_blob.h, bottom_blob.elempack);\n    }\n    if (bottom_blob.dims == 3)\n    {\n        sprintf(in_shape_str, \"[%3d, %3d, %3d *%d]\", bottom_blob.w, bottom_blob.h, bottom_blob.c, bottom_blob.elempack);\n    }\n    if (bottom_blob.dims == 4)\n    {\n        sprintf(in_shape_str, \"[%3d, %3d, %3d, %3d *%d]\", bottom_blob.w, bottom_blob.h, bottom_blob.d, bottom_blob.c, bottom_blob.elempack);\n    }\n\n    if (top_blob.dims == 1)\n    {\n        sprintf(out_shape_str, \"[%3d *%d]\", top_blob.w, top_blob.elempack);\n    }\n    if (top_blob.dims == 2)\n    {\n        sprintf(out_shape_str, \"[%3d, %3d *%d]\", top_blob.w, top_blob.h, top_blob.elempack);\n    }\n    if (top_blob.dims == 3)\n    {\n        sprintf(out_shape_str, \"[%3d, %3d, %3d *%d]\", top_blob.w, top_blob.h, top_blob.c, top_blob.elempack);\n    }\n    if (top_blob.dims == 4)\n    {\n        sprintf(out_shape_str, \"[%3d, %3d, %3d, %3d *%d]\", top_blob.w, top_blob.h, top_blob.d, top_blob.c, top_blob.elempack);\n    }\n\n    fprintf(stderr, \"    | %22s -> %-22s\", in_shape_str, out_shape_str);\n\n    if (layer->type == \"Convolution\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d     stride: %1d x %1d\",\n                ((Convolution*)layer)->kernel_w,\n                ((Convolution*)layer)->kernel_h,\n                ((Convolution*)layer)->stride_w,\n                ((Convolution*)layer)->stride_h);\n    }\n    else if (layer->type == \"ConvolutionDepthWise\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d     stride: %1d x %1d\",\n                ((ConvolutionDepthWise*)layer)->kernel_w,\n                ((ConvolutionDepthWise*)layer)->kernel_h,\n                ((ConvolutionDepthWise*)layer)->stride_w,\n                ((ConvolutionDepthWise*)layer)->stride_h);\n    }\n    else if (layer->type == \"Deconvolution\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d     stride: %1d x %1d\",\n                ((Deconvolution*)layer)->kernel_w,\n                ((Deconvolution*)layer)->kernel_h,\n                ((Deconvolution*)layer)->stride_w,\n                ((Deconvolution*)layer)->stride_h);\n    }\n    else if (layer->type == \"DeconvolutionDepthWise\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d     stride: %1d x %1d\",\n                ((DeconvolutionDepthWise*)layer)->kernel_w,\n                ((DeconvolutionDepthWise*)layer)->kernel_h,\n                ((DeconvolutionDepthWise*)layer)->stride_w,\n                ((DeconvolutionDepthWise*)layer)->stride_h);\n    }\n    else if (layer->type == \"Convolution3D\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d x %1d    stride: %1d x %1d x %1d\",\n                ((Convolution3D*)layer)->kernel_w,\n                ((Convolution3D*)layer)->kernel_h,\n                ((Convolution3D*)layer)->kernel_d,\n                ((Convolution3D*)layer)->stride_w,\n                ((Convolution3D*)layer)->stride_h,\n                ((Convolution3D*)layer)->stride_d);\n    }\n    else if (layer->type == \"ConvolutionDepthWise3D\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d x %1d    stride: %1d x %1d x %1d\",\n                ((ConvolutionDepthWise3D*)layer)->kernel_w,\n                ((ConvolutionDepthWise3D*)layer)->kernel_h,\n                ((ConvolutionDepthWise3D*)layer)->kernel_d,\n                ((ConvolutionDepthWise3D*)layer)->stride_w,\n                ((ConvolutionDepthWise3D*)layer)->stride_h,\n                ((ConvolutionDepthWise3D*)layer)->stride_d);\n    }\n    else if (layer->type == \"Deconvolution3D\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d x %1d    stride: %1d x %1d x %1d\",\n                ((Deconvolution3D*)layer)->kernel_w,\n                ((Deconvolution3D*)layer)->kernel_h,\n                ((Deconvolution3D*)layer)->kernel_d,\n                ((Deconvolution3D*)layer)->stride_w,\n                ((Deconvolution3D*)layer)->stride_h,\n                ((Deconvolution3D*)layer)->stride_d);\n    }\n    else if (layer->type == \"DeconvolutionDepthWise3D\")\n    {\n        fprintf(stderr, \"     kernel: %1d x %1d x %1d    stride: %1d x %1d x %1d\",\n                ((DeconvolutionDepthWise3D*)layer)->kernel_w,\n                ((DeconvolutionDepthWise3D*)layer)->kernel_h,\n                ((DeconvolutionDepthWise3D*)layer)->kernel_d,\n                ((DeconvolutionDepthWise3D*)layer)->stride_w,\n                ((DeconvolutionDepthWise3D*)layer)->stride_h,\n                ((DeconvolutionDepthWise3D*)layer)->stride_d);\n    }\n    fprintf(stderr, \"\\n\");\n}\n\n#endif // NCNN_BENCHMARK\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/benchmark.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_BENCHMARK_H\n#define NCNN_BENCHMARK_H\n\n#include \"layer.h\"\n#include \"mat.h\"\n#include \"platform.h\"\n\nnamespace ncnn {\n\n// get now timestamp in ms\nNCNN_EXPORT double get_current_time();\n\n// sleep milliseconds\nNCNN_EXPORT void sleep(unsigned long long int milliseconds = 1000);\n\n#if NCNN_BENCHMARK\n\nNCNN_EXPORT void benchmark(const Layer* layer, double start, double end);\nNCNN_EXPORT void benchmark(const Layer* layer, const Mat& bottom_blob, Mat& top_blob, double start, double end);\n\n#endif // NCNN_BENCHMARK\n\n} // namespace ncnn\n\n#endif // NCNN_BENCHMARK_H\n"
  },
  {
    "path": "src/blob.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"blob.h\"\n\nnamespace ncnn {\n\nBlob::Blob()\n{\n    producer = -1;\n    consumer = -1;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/blob.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_BLOB_H\n#define NCNN_BLOB_H\n\n#include \"mat.h\"\n#include \"platform.h\"\n\nnamespace ncnn {\n\nclass NCNN_EXPORT Blob\n{\npublic:\n    // empty\n    Blob();\n\npublic:\n#if NCNN_STRING\n    // blob name\n    std::string name;\n#endif // NCNN_STRING\n    // layer index which produce this blob as output\n    int producer;\n    // layer index which need this blob as input\n    int consumer;\n    // shape hint\n    Mat shape;\n};\n\n} // namespace ncnn\n\n#endif // NCNN_BLOB_H\n"
  },
  {
    "path": "src/c_api.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"platform.h\"\n\n#if NCNN_C_API\n\n#include \"c_api.h\"\n\n#include <stdlib.h>\n\n#include \"allocator.h\"\n#include \"blob.h\"\n#include \"datareader.h\"\n#include \"layer.h\"\n#include \"mat.h\"\n#include \"modelbin.h\"\n#include \"net.h\"\n#include \"option.h\"\n#include \"paramdict.h\"\n\nusing ncnn::Allocator;\nusing ncnn::Blob;\nusing ncnn::DataReader;\nusing ncnn::Extractor;\nusing ncnn::Layer;\nusing ncnn::Mat;\nusing ncnn::ModelBin;\nusing ncnn::Net;\nusing ncnn::Option;\nusing ncnn::ParamDict;\n\n#ifdef __cplusplus\nextern \"C\" {\n#endif\n\nconst char* ncnn_version()\n{\n    return NCNN_VERSION_STRING;\n}\n\nint ncnn_version_number()\n{\n    return NCNN_VERSION_NUMBER;\n}\n\n/* allocator api */\nclass PoolAllocator_c_api : public ncnn::PoolAllocator\n{\npublic:\n    PoolAllocator_c_api(ncnn_allocator_t _allocator)\n        : ncnn::PoolAllocator()\n    {\n        allocator = _allocator;\n    }\n\n    virtual void* fastMalloc(size_t size)\n    {\n        return allocator->fast_malloc(allocator, size);\n    }\n\n    virtual void fastFree(void* ptr)\n    {\n        return allocator->fast_free(allocator, ptr);\n    }\n\npublic:\n    ncnn_allocator_t allocator;\n};\n\nstatic void* __ncnn_PoolAllocator_fast_malloc(ncnn_allocator_t allocator, size_t size)\n{\n    return ((ncnn::PoolAllocator*)allocator->pthis)->ncnn::PoolAllocator::fastMalloc(size);\n}\n\nstatic void __ncnn_PoolAllocator_fast_free(ncnn_allocator_t allocator, void* ptr)\n{\n    ((ncnn::PoolAllocator*)allocator->pthis)->ncnn::PoolAllocator::fastFree(ptr);\n}\n\nclass UnlockedPoolAllocator_c_api : public ncnn::UnlockedPoolAllocator\n{\npublic:\n    UnlockedPoolAllocator_c_api(ncnn_allocator_t _allocator)\n        : ncnn::UnlockedPoolAllocator()\n    {\n        allocator = _allocator;\n    }\n\n    virtual void* fastMalloc(size_t size)\n    {\n        return allocator->fast_malloc(allocator, size);\n    }\n\n    virtual void fastFree(void* ptr)\n    {\n        return allocator->fast_free(allocator, ptr);\n    }\n\npublic:\n    ncnn_allocator_t allocator;\n};\n\nstatic void* __ncnn_UnlockedPoolAllocator_fast_malloc(ncnn_allocator_t allocator, size_t size)\n{\n    return ((ncnn::UnlockedPoolAllocator*)allocator->pthis)->ncnn::UnlockedPoolAllocator::fastMalloc(size);\n}\n\nstatic void __ncnn_UnlockedPoolAllocator_fast_free(ncnn_allocator_t allocator, void* ptr)\n{\n    ((ncnn::UnlockedPoolAllocator*)allocator->pthis)->ncnn::UnlockedPoolAllocator::fastFree(ptr);\n}\n\nncnn_allocator_t ncnn_allocator_create_pool_allocator()\n{\n    ncnn_allocator_t allocator = (ncnn_allocator_t)malloc(sizeof(struct __ncnn_allocator_t));\n    allocator->pthis = (void*)(new PoolAllocator_c_api(allocator));\n    allocator->fast_malloc = __ncnn_PoolAllocator_fast_malloc;\n    allocator->fast_free = __ncnn_PoolAllocator_fast_free;\n    return allocator;\n}\n\nncnn_allocator_t ncnn_allocator_create_unlocked_pool_allocator()\n{\n    ncnn_allocator_t allocator = (ncnn_allocator_t)malloc(sizeof(struct __ncnn_allocator_t));\n    allocator->pthis = (void*)(new UnlockedPoolAllocator_c_api(allocator));\n    allocator->fast_malloc = __ncnn_UnlockedPoolAllocator_fast_malloc;\n    allocator->fast_free = __ncnn_UnlockedPoolAllocator_fast_free;\n    return allocator;\n}\n\nvoid ncnn_allocator_destroy(ncnn_allocator_t allocator)\n{\n    if (allocator)\n    {\n        delete (Allocator*)allocator->pthis;\n        free(allocator);\n    }\n}\n\n/* option api */\nncnn_option_t ncnn_option_create()\n{\n    return (ncnn_option_t)(new Option());\n}\n\nvoid ncnn_option_destroy(ncnn_option_t opt)\n{\n    delete (Option*)opt;\n}\n\nint ncnn_option_get_num_threads(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->num_threads;\n}\n\nvoid ncnn_option_set_num_threads(ncnn_option_t opt, int num_threads)\n{\n    ((Option*)opt)->num_threads = num_threads;\n}\n\nvoid ncnn_option_set_blob_allocator(ncnn_option_t opt, ncnn_allocator_t allocator)\n{\n    ((Option*)opt)->blob_allocator = allocator ? (Allocator*)allocator->pthis : NULL;\n}\n\nvoid ncnn_option_set_workspace_allocator(ncnn_option_t opt, ncnn_allocator_t allocator)\n{\n    ((Option*)opt)->workspace_allocator = allocator ? (Allocator*)allocator->pthis : NULL;\n}\n\nint ncnn_option_get_use_vulkan_compute(const ncnn_option_t opt)\n{\n#if NCNN_VULKAN\n    return ((const Option*)opt)->use_vulkan_compute;\n#else\n    (void)opt;\n    return 0;\n#endif\n}\n\nint ncnn_option_get_use_local_pool_allocator(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_local_pool_allocator;\n}\n\nint ncnn_option_get_use_winograd_convolution(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_winograd_convolution;\n}\n\nint ncnn_option_get_use_sgemm_convolution(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_sgemm_convolution;\n}\n\nint ncnn_option_get_use_packing_layout(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_packing_layout;\n}\n\nint ncnn_option_get_use_fp16_packed(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_fp16_packed;\n}\n\nint ncnn_option_get_use_fp16_storage(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_fp16_storage;\n}\n\nint ncnn_option_get_use_fp16_arithmetic(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_fp16_arithmetic;\n}\n\nint ncnn_option_get_use_int8_packed(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_int8_packed;\n}\n\nint ncnn_option_get_use_int8_storage(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_int8_storage;\n}\n\nint ncnn_option_get_use_int8_arithmetic(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_int8_arithmetic;\n}\n\nint ncnn_option_get_use_bf16_packed(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_bf16_packed;\n}\n\nint ncnn_option_get_use_bf16_storage(const ncnn_option_t opt)\n{\n    return ((const Option*)opt)->use_bf16_storage;\n}\n\nint ncnn_option_get_use_shader_local_memory(const ncnn_option_t opt)\n{\n#if NCNN_VULKAN\n    return ((const Option*)opt)->use_shader_local_memory;\n#else\n    (void)opt;\n    return 0;\n#endif\n}\n\nint ncnn_option_get_use_cooperative_matrix(const ncnn_option_t opt)\n{\n#if NCNN_VULKAN\n    return ((const Option*)opt)->use_cooperative_matrix;\n#else\n    (void)opt;\n    return 0;\n#endif\n}\n\nvoid ncnn_option_set_use_vulkan_compute(ncnn_option_t opt, int enable)\n{\n#if NCNN_VULKAN\n    ((Option*)opt)->use_vulkan_compute = enable;\n#else\n    (void)opt;\n    (void)enable;\n#endif\n}\n\nvoid ncnn_option_set_use_local_pool_allocator(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_local_pool_allocator = enable;\n}\n\nvoid ncnn_option_set_use_winograd_convolution(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_winograd_convolution = enable;\n}\n\nvoid ncnn_option_set_use_sgemm_convolution(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_sgemm_convolution = enable;\n}\n\nvoid ncnn_option_set_use_packing_layout(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_packing_layout = enable;\n}\n\nvoid ncnn_option_set_use_fp16_packed(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_fp16_packed = enable;\n}\n\nvoid ncnn_option_set_use_fp16_storage(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_fp16_storage = enable;\n}\n\nvoid ncnn_option_set_use_fp16_arithmetic(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_fp16_arithmetic = enable;\n}\n\nvoid ncnn_option_set_use_int8_packed(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_int8_packed = enable;\n}\n\nvoid ncnn_option_set_use_int8_storage(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_int8_storage = enable;\n}\n\nvoid ncnn_option_set_use_int8_arithmetic(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_int8_arithmetic = enable;\n}\n\nvoid ncnn_option_set_use_bf16_packed(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_bf16_packed = enable;\n}\n\nvoid ncnn_option_set_use_bf16_storage(ncnn_option_t opt, int enable)\n{\n    ((Option*)opt)->use_bf16_storage = enable;\n}\n\nvoid ncnn_option_set_use_shader_local_memory(ncnn_option_t opt, int enable)\n{\n#if NCNN_VULKAN\n    ((Option*)opt)->use_shader_local_memory = enable;\n#else\n    (void)opt;\n    (void)enable;\n#endif\n}\n\nvoid ncnn_option_set_use_cooperative_matrix(ncnn_option_t opt, int enable)\n{\n#if NCNN_VULKAN\n    ((Option*)opt)->use_cooperative_matrix = enable;\n#else\n    (void)opt;\n    (void)enable;\n#endif\n}\n\n/* mat api */\nncnn_mat_t ncnn_mat_create()\n{\n    return (ncnn_mat_t)(new Mat());\n}\n\nncnn_mat_t ncnn_mat_create_1d(int w, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_2d(int w, int h, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_3d(int w, int h, int c, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, c, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_4d(int w, int h, int d, int c, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, d, c, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_1d(int w, void* data, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, data, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_2d(int w, int h, void* data, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, data, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_3d(int w, int h, int c, void* data, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, c, data, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_4d(int w, int h, int d, int c, void* data, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, d, c, data, (size_t)4u, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_1d_elem(int w, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_2d_elem(int w, int h, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_3d_elem(int w, int h, int c, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, c, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_4d_elem(int w, int h, int d, int c, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, d, c, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_1d_elem(int w, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, data, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_2d_elem(int w, int h, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, data, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_3d_elem(int w, int h, int c, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, c, data, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nncnn_mat_t ncnn_mat_create_external_4d_elem(int w, int h, int d, int c, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(w, h, d, c, data, elemsize, elempack, allocator ? (Allocator*)allocator->pthis : NULL));\n}\n\nvoid ncnn_mat_destroy(ncnn_mat_t mat)\n{\n    delete (Mat*)mat;\n}\n\nvoid ncnn_mat_fill_float(ncnn_mat_t mat, float v)\n{\n    ((Mat*)mat)->fill(v);\n}\n\nncnn_mat_t ncnn_mat_clone(const ncnn_mat_t mat, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(((const Mat*)mat)->clone(allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_reshape_1d(const ncnn_mat_t mat, int w, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(((const Mat*)mat)->reshape(w, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_reshape_2d(const ncnn_mat_t mat, int w, int h, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(((const Mat*)mat)->reshape(w, h, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_reshape_3d(const ncnn_mat_t mat, int w, int h, int c, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(((const Mat*)mat)->reshape(w, h, c, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_reshape_4d(const ncnn_mat_t mat, int w, int h, int d, int c, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(((const Mat*)mat)->reshape(w, h, d, c, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nint ncnn_mat_get_dims(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->dims;\n}\n\nint ncnn_mat_get_w(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->w;\n}\n\nint ncnn_mat_get_h(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->h;\n}\n\nint ncnn_mat_get_d(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->d;\n}\n\nint ncnn_mat_get_c(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->c;\n}\n\nsize_t ncnn_mat_get_elemsize(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->elemsize;\n}\n\nint ncnn_mat_get_elempack(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->elempack;\n}\n\nsize_t ncnn_mat_get_cstep(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->cstep;\n}\n\nvoid* ncnn_mat_get_data(const ncnn_mat_t mat)\n{\n    return ((const Mat*)mat)->data;\n}\n\nvoid* ncnn_mat_get_channel_data(const ncnn_mat_t mat, int c)\n{\n    return ((const Mat*)mat)->channel(c).data;\n}\n\n#if NCNN_PIXEL\n\n/* mat pixel api */\nncnn_mat_t ncnn_mat_from_pixels(const unsigned char* pixels, int type, int w, int h, int stride, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(Mat::from_pixels(pixels, type, w, h, stride, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_from_pixels_resize(const unsigned char* pixels, int type, int w, int h, int stride, int target_width, int target_height, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(Mat::from_pixels_resize(pixels, type, w, h, stride, target_width, target_height, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_from_pixels_roi(const unsigned char* pixels, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(Mat::from_pixels_roi(pixels, type, w, h, stride, roix, roiy, roiw, roih, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nncnn_mat_t ncnn_mat_from_pixels_roi_resize(const unsigned char* pixels, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, int target_width, int target_height, ncnn_allocator_t allocator)\n{\n    return (ncnn_mat_t)(new Mat(Mat::from_pixels_roi_resize(pixels, type, w, h, stride, roix, roiy, roiw, roih, target_width, target_height, allocator ? (Allocator*)allocator->pthis : NULL)));\n}\n\nvoid ncnn_mat_to_pixels(const ncnn_mat_t mat, unsigned char* pixels, int type, int stride)\n{\n    ((const Mat*)mat)->to_pixels(pixels, type, stride);\n}\n\nvoid ncnn_mat_to_pixels_resize(const ncnn_mat_t mat, unsigned char* pixels, int type, int target_width, int target_height, int target_stride)\n{\n    ((const Mat*)mat)->to_pixels_resize(pixels, type, target_width, target_height, target_stride);\n}\n\n#endif /* NCNN_PIXEL */\n\nvoid ncnn_mat_substract_mean_normalize(ncnn_mat_t mat, const float* mean_vals, const float* norm_vals)\n{\n    ((Mat*)mat)->substract_mean_normalize(mean_vals, norm_vals);\n}\n\nvoid ncnn_convert_packing(const ncnn_mat_t src, ncnn_mat_t* dst, int elempack, const ncnn_option_t opt)\n{\n    Mat _dst;\n    ncnn::convert_packing(*(const Mat*)src, _dst, elempack, *(Option*)opt);\n    *dst = (ncnn_mat_t)(new Mat(_dst));\n}\n\nvoid ncnn_flatten(const ncnn_mat_t src, ncnn_mat_t* dst, const ncnn_option_t opt)\n{\n    Mat _dst;\n    ncnn::flatten(*(const Mat*)src, _dst, *(Option*)opt);\n    *dst = (ncnn_mat_t)(new Mat(_dst));\n}\n\n/* blob api */\n#if NCNN_STRING\nconst char* ncnn_blob_get_name(const ncnn_blob_t blob)\n{\n    return ((const Blob*)blob)->name.c_str();\n}\n#endif /* NCNN_STRING */\n\nint ncnn_blob_get_producer(const ncnn_blob_t blob)\n{\n    return ((const Blob*)blob)->producer;\n}\n\nint ncnn_blob_get_consumer(const ncnn_blob_t blob)\n{\n    return ((const Blob*)blob)->consumer;\n}\n\nvoid ncnn_blob_get_shape(const ncnn_blob_t blob, int* dims, int* w, int* h, int* c)\n{\n    const Mat& shape = ((const Blob*)blob)->shape;\n    *dims = shape.dims;\n    *w = shape.w;\n    *h = shape.h;\n    *c = shape.c;\n}\n\n/* paramdict api */\nncnn_paramdict_t ncnn_paramdict_create()\n{\n    return (ncnn_paramdict_t)(new ParamDict());\n}\n\nvoid ncnn_paramdict_destroy(ncnn_paramdict_t pd)\n{\n    delete (ParamDict*)pd;\n}\n\nint ncnn_paramdict_get_type(const ncnn_paramdict_t pd, int id)\n{\n    return ((const ParamDict*)pd)->type(id);\n}\n\nint ncnn_paramdict_get_int(const ncnn_paramdict_t pd, int id, int def)\n{\n    return ((const ParamDict*)pd)->get(id, def);\n}\n\nfloat ncnn_paramdict_get_float(const ncnn_paramdict_t pd, int id, float def)\n{\n    return ((const ParamDict*)pd)->get(id, def);\n}\n\nncnn_mat_t ncnn_paramdict_get_array(ncnn_paramdict_t pd, int id, const ncnn_mat_t def)\n{\n    return (ncnn_mat_t)(new Mat(((const ParamDict*)pd)->get(id, *(const Mat*)def)));\n}\n\nvoid ncnn_paramdict_set_int(ncnn_paramdict_t pd, int id, int i)\n{\n    return ((ParamDict*)pd)->set(id, i);\n}\n\nvoid ncnn_paramdict_set_float(ncnn_paramdict_t pd, int id, float f)\n{\n    return ((ParamDict*)pd)->set(id, f);\n}\n\nvoid ncnn_paramdict_set_array(ncnn_paramdict_t pd, int id, ncnn_mat_t v)\n{\n    return ((ParamDict*)pd)->set(id, *(const Mat*)v);\n}\n\n/* datareader api */\nclass DataReader_c_api : public ncnn::DataReader\n{\npublic:\n    DataReader_c_api(ncnn_datareader_t _dr)\n        : ncnn::DataReader()\n    {\n        dr = _dr;\n    }\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const\n    {\n        return dr->scan(dr, format, p);\n    }\n#endif /* NCNN_STRING */\n\n    virtual size_t read(void* buf, size_t size) const\n    {\n        return dr->read(dr, buf, size);\n    }\n\npublic:\n    ncnn_datareader_t dr;\n};\n\n#if NCNN_STRING\nstatic int __ncnn_DataReader_scan(ncnn_datareader_t dr, const char* format, void* p)\n{\n    return ((ncnn::DataReader*)dr->pthis)->ncnn::DataReader::scan(format, p);\n}\n#endif /* NCNN_STRING */\n\nstatic size_t __ncnn_DataReader_read(ncnn_datareader_t dr, void* buf, size_t size)\n{\n    return ((ncnn::DataReader*)dr->pthis)->ncnn::DataReader::read(buf, size);\n}\n\n#if NCNN_STDIO\nclass DataReaderFromStdio_c_api : public ncnn::DataReaderFromStdio\n{\npublic:\n    DataReaderFromStdio_c_api(FILE* fp, ncnn_datareader_t _dr)\n        : ncnn::DataReaderFromStdio(fp)\n    {\n        dr = _dr;\n    }\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const\n    {\n        return dr->scan(dr, format, p);\n    }\n#endif /* NCNN_STRING */\n\n    virtual size_t read(void* buf, size_t size) const\n    {\n        return dr->read(dr, buf, size);\n    }\n\npublic:\n    ncnn_datareader_t dr;\n};\n\n#if NCNN_STRING\nstatic int __ncnn_DataReaderFromStdio_scan(ncnn_datareader_t dr, const char* format, void* p)\n{\n    return ((ncnn::DataReaderFromStdio*)dr->pthis)->ncnn::DataReaderFromStdio::scan(format, p);\n}\n#endif /* NCNN_STRING */\n\nstatic size_t __ncnn_DataReaderFromStdio_read(ncnn_datareader_t dr, void* buf, size_t size)\n{\n    return ((ncnn::DataReaderFromStdio*)dr->pthis)->ncnn::DataReaderFromStdio::read(buf, size);\n}\n#endif /* NCNN_STDIO */\n\nclass DataReaderFromMemory_c_api : public ncnn::DataReaderFromMemory\n{\npublic:\n    DataReaderFromMemory_c_api(const unsigned char*& mem, ncnn_datareader_t _dr)\n        : ncnn::DataReaderFromMemory(mem)\n    {\n        dr = _dr;\n    }\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const\n    {\n        return dr->scan(dr, format, p);\n    }\n#endif /* NCNN_STRING */\n\n    virtual size_t read(void* buf, size_t size) const\n    {\n        return dr->read(dr, buf, size);\n    }\n\npublic:\n    ncnn_datareader_t dr;\n};\n\n#if NCNN_STRING\nstatic int __ncnn_DataReaderFromMemory_scan(ncnn_datareader_t dr, const char* format, void* p)\n{\n    return ((ncnn::DataReaderFromMemory*)dr->pthis)->ncnn::DataReaderFromMemory::scan(format, p);\n}\n#endif /* NCNN_STRING */\n\nstatic size_t __ncnn_DataReaderFromMemory_read(ncnn_datareader_t dr, void* buf, size_t size)\n{\n    return ((ncnn::DataReaderFromMemory*)dr->pthis)->ncnn::DataReaderFromMemory::read(buf, size);\n}\n\nncnn_datareader_t ncnn_datareader_create()\n{\n    ncnn_datareader_t dr = (ncnn_datareader_t)malloc(sizeof(struct __ncnn_datareader_t));\n    dr->pthis = (void*)(new DataReader_c_api(dr));\n#if NCNN_STRING\n    dr->scan = __ncnn_DataReader_scan;\n#endif /* NCNN_STRING */\n    dr->read = __ncnn_DataReader_read;\n    return dr;\n}\n\n#if NCNN_STDIO\nncnn_datareader_t ncnn_datareader_create_from_stdio(FILE* fp)\n{\n    ncnn_datareader_t dr = (ncnn_datareader_t)malloc(sizeof(struct __ncnn_datareader_t));\n    dr->pthis = (void*)(new DataReaderFromStdio_c_api(fp, dr));\n#if NCNN_STRING\n    dr->scan = __ncnn_DataReaderFromStdio_scan;\n#endif /* NCNN_STRING */\n    dr->read = __ncnn_DataReaderFromStdio_read;\n    return dr;\n}\n#endif /* NCNN_STDIO */\n\nncnn_datareader_t ncnn_datareader_create_from_memory(const unsigned char** mem)\n{\n    ncnn_datareader_t dr = (ncnn_datareader_t)malloc(sizeof(struct __ncnn_datareader_t));\n    dr->pthis = (void*)(new DataReaderFromMemory_c_api(*mem, dr));\n#if NCNN_STRING\n    dr->scan = __ncnn_DataReaderFromMemory_scan;\n#endif /* NCNN_STRING */\n    dr->read = __ncnn_DataReaderFromMemory_read;\n    return dr;\n}\n\nvoid ncnn_datareader_destroy(ncnn_datareader_t dr)\n{\n    delete (DataReader*)dr->pthis;\n    free(dr);\n}\n\n/* modelbin api */\nclass ModelBinFromDataReader_c_api : public ncnn::ModelBinFromDataReader\n{\npublic:\n    ModelBinFromDataReader_c_api(ncnn_modelbin_t _mb, const DataReader& dr)\n        : ncnn::ModelBinFromDataReader(dr)\n    {\n        mb = _mb;\n    }\n\n    virtual Mat load(int w, int type) const\n    {\n        ncnn_mat_t m = mb->load_1d(mb, w, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\n    virtual Mat load(int w, int h, int type) const\n    {\n        ncnn_mat_t m = mb->load_2d(mb, w, h, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\n    virtual Mat load(int w, int h, int c, int type) const\n    {\n        ncnn_mat_t m = mb->load_3d(mb, w, h, c, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\npublic:\n    ncnn_modelbin_t mb;\n};\n\nstatic ncnn_mat_t __ncnn_ModelBinFromDataReader_load_1d(const ncnn_modelbin_t mb, int w, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromDataReader*)mb->pthis)->ncnn::ModelBinFromDataReader::load(w, type)));\n}\n\nstatic ncnn_mat_t __ncnn_ModelBinFromDataReader_load_2d(const ncnn_modelbin_t mb, int w, int h, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromDataReader*)mb->pthis)->ncnn::ModelBin::load(w, h, type)));\n}\n\nstatic ncnn_mat_t __ncnn_ModelBinFromDataReader_load_3d(const ncnn_modelbin_t mb, int w, int h, int c, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromDataReader*)mb->pthis)->ncnn::ModelBin::load(w, h, c, type)));\n}\n\nclass ModelBinFromMatArray_c_api : public ncnn::ModelBinFromMatArray\n{\npublic:\n    ModelBinFromMatArray_c_api(ncnn_modelbin_t _mb, const Mat* weights)\n        : ncnn::ModelBinFromMatArray(weights)\n    {\n        mb = _mb;\n    }\n\n    virtual Mat load(int w, int type) const\n    {\n        ncnn_mat_t m = mb->load_1d(mb, w, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\n    virtual Mat load(int w, int h, int type) const\n    {\n        ncnn_mat_t m = mb->load_2d(mb, w, h, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\n    virtual Mat load(int w, int h, int c, int type) const\n    {\n        ncnn_mat_t m = mb->load_3d(mb, w, h, c, type);\n        Mat m2 = *(Mat*)m;\n        ncnn_mat_destroy(m);\n        return m2;\n    }\n\npublic:\n    ncnn_modelbin_t mb;\n};\n\nstatic ncnn_mat_t __ncnn_ModelBinFromMatArray_load_1d(const ncnn_modelbin_t mb, int w, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromMatArray*)mb->pthis)->ncnn::ModelBinFromMatArray::load(w, type)));\n}\n\nstatic ncnn_mat_t __ncnn_ModelBinFromMatArray_load_2d(const ncnn_modelbin_t mb, int w, int h, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromMatArray*)mb->pthis)->ncnn::ModelBin::load(w, h, type)));\n}\n\nstatic ncnn_mat_t __ncnn_ModelBinFromMatArray_load_3d(const ncnn_modelbin_t mb, int w, int h, int c, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBinFromMatArray*)mb->pthis)->ncnn::ModelBin::load(w, h, c, type)));\n}\n\nncnn_modelbin_t ncnn_modelbin_create_from_datareader(const ncnn_datareader_t dr)\n{\n    ncnn_modelbin_t mb = (ncnn_modelbin_t)malloc(sizeof(struct __ncnn_modelbin_t));\n    mb->pthis = (void*)(new ModelBinFromDataReader_c_api(mb, *(const DataReader*)dr->pthis));\n    mb->load_1d = __ncnn_ModelBinFromDataReader_load_1d;\n    mb->load_2d = __ncnn_ModelBinFromDataReader_load_2d;\n    mb->load_3d = __ncnn_ModelBinFromDataReader_load_3d;\n    return mb;\n}\n\nncnn_modelbin_t ncnn_modelbin_create_from_mat_array(const ncnn_mat_t* weights, int n)\n{\n    std::vector<Mat> matarray(n);\n    for (int i = 0; i < n; i++)\n    {\n        matarray[i] = *(const Mat*)weights[i];\n    }\n    ncnn_modelbin_t mb = (ncnn_modelbin_t)malloc(sizeof(struct __ncnn_modelbin_t));\n    mb->pthis = (void*)(new ModelBinFromMatArray_c_api(mb, n ? &matarray[0] : NULL));\n    mb->load_1d = __ncnn_ModelBinFromMatArray_load_1d;\n    mb->load_2d = __ncnn_ModelBinFromMatArray_load_2d;\n    mb->load_3d = __ncnn_ModelBinFromMatArray_load_3d;\n    return mb;\n}\n\nvoid ncnn_modelbin_destroy(ncnn_modelbin_t mb)\n{\n    delete (ModelBin*)mb->pthis;\n    free(mb);\n}\n\nstatic ncnn_mat_t __ncnn_modelbin_load_1d(const ncnn_modelbin_t mb, int w, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBin*)mb->pthis)->load(w, type)));\n}\n\nstatic ncnn_mat_t __ncnn_modelbin_load_2d(const ncnn_modelbin_t mb, int w, int h, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBin*)mb->pthis)->load(w, h, type)));\n}\n\nstatic ncnn_mat_t __ncnn_modelbin_load_3d(const ncnn_modelbin_t mb, int w, int h, int c, int type)\n{\n    return (ncnn_mat_t)(new Mat(((const ncnn::ModelBin*)mb->pthis)->load(w, h, c, type)));\n}\n\n/* layer api */\nclass Layer_c_api : public Layer\n{\npublic:\n    Layer_c_api(ncnn_layer_t _layer)\n        : Layer()\n    {\n        layer = _layer;\n    }\n\n    virtual int load_param(const ParamDict& pd)\n    {\n        return layer->load_param(layer, (ncnn_paramdict_t)&pd);\n    }\n\n    virtual int load_model(const ModelBin& mb)\n    {\n        struct __ncnn_modelbin_t mb0;\n        mb0.pthis = (void*)&mb;\n        mb0.load_1d = __ncnn_modelbin_load_1d;\n        mb0.load_2d = __ncnn_modelbin_load_2d;\n        mb0.load_3d = __ncnn_modelbin_load_3d;\n        return layer->load_model(layer, &mb0);\n    }\n\n    virtual int create_pipeline(const Option& opt)\n    {\n        return layer->create_pipeline(layer, (ncnn_option_t)&opt);\n    }\n\n    virtual int destroy_pipeline(const Option& opt)\n    {\n        return layer->destroy_pipeline(layer, (ncnn_option_t)&opt);\n    }\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n    {\n        const int n = bottom_blobs.size();\n        const int n2 = top_blobs.size();\n        std::vector<ncnn_mat_t> bottom_blobs0(n);\n        for (int i = 0; i < n; i++)\n        {\n            bottom_blobs0[i] = (ncnn_mat_t)&bottom_blobs[i];\n        }\n        std::vector<ncnn_mat_t> top_blobs0(n2, (ncnn_mat_t)0);\n        int ret = layer->forward_n(layer, &bottom_blobs0[0], n, &top_blobs0[0], n2, (ncnn_option_t)&opt);\n        for (int i = 0; i < n2; i++)\n        {\n            top_blobs[i] = *(Mat*)top_blobs0[i];\n            ncnn_mat_destroy(top_blobs0[i]);\n        }\n        return ret;\n    }\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n    {\n        ncnn_mat_t top_blob0 = 0;\n        int ret = layer->forward_1(layer, (ncnn_mat_t)&bottom_blob, &top_blob0, (ncnn_option_t)&opt);\n        top_blob = *(Mat*)top_blob0;\n        ncnn_mat_destroy(top_blob0);\n        return ret;\n    }\n\n    virtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const\n    {\n        const int n = bottom_top_blobs.size();\n        std::vector<ncnn_mat_t> bottom_top_blobs0(n);\n        for (int i = 0; i < n; i++)\n        {\n            bottom_top_blobs0[i] = (ncnn_mat_t)&bottom_top_blobs[i];\n        }\n        return layer->forward_inplace_n(layer, &bottom_top_blobs0[0], n, (ncnn_option_t)&opt);\n    }\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n    {\n        return layer->forward_inplace_1(layer, (ncnn_mat_t)&bottom_top_blob, (ncnn_option_t)&opt);\n    }\n\npublic:\n    ncnn_layer_t layer;\n};\n\nstatic int __ncnn_Layer_load_param(ncnn_layer_t layer, const ncnn_paramdict_t pd)\n{\n    return ((Layer*)layer->pthis)->Layer::load_param(*(const ParamDict*)pd);\n}\n\nstatic int __ncnn_Layer_load_model(ncnn_layer_t layer, const ncnn_modelbin_t mb)\n{\n    return ((Layer*)layer->pthis)->Layer::load_model(*(const ModelBin*)mb);\n}\n\nstatic int __ncnn_Layer_create_pipeline(ncnn_layer_t layer, const ncnn_option_t opt)\n{\n    return ((Layer*)layer->pthis)->Layer::create_pipeline(*(const Option*)opt);\n}\n\nstatic int __ncnn_Layer_destroy_pipeline(ncnn_layer_t layer, const ncnn_option_t opt)\n{\n    return ((Layer*)layer->pthis)->Layer::destroy_pipeline(*(const Option*)opt);\n}\n\nstatic int __ncnn_Layer_forward_1(const ncnn_layer_t layer, const ncnn_mat_t bottom_blob, ncnn_mat_t* top_blob, const ncnn_option_t opt)\n{\n    Mat _top_blob;\n    int ret = ((const Layer*)layer->pthis)->Layer::forward(*(const Mat*)bottom_blob, _top_blob, *(const Option*)opt);\n    *top_blob = (ncnn_mat_t)(new Mat(_top_blob));\n    return ret;\n}\n\nstatic int __ncnn_Layer_forward_n(const ncnn_layer_t layer, const ncnn_mat_t* bottom_blobs, int n, ncnn_mat_t* top_blobs, int n2, const ncnn_option_t opt)\n{\n    std::vector<Mat> _bottom_blobs(n);\n    std::vector<Mat> _top_blobs(n2);\n    for (int i = 0; i < n; i++)\n    {\n        _bottom_blobs[i] = *(Mat*)bottom_blobs[i];\n    }\n    int ret = ((const Layer*)layer->pthis)->Layer::forward(_bottom_blobs, _top_blobs, *(const Option*)opt);\n    for (int i = 0; i < n2; i++)\n    {\n        top_blobs[i] = (ncnn_mat_t)(new Mat(_top_blobs[i]));\n    }\n    return ret;\n}\n\nstatic int __ncnn_Layer_forward_inplace_1(const ncnn_layer_t layer, ncnn_mat_t bottom_top_blob, const ncnn_option_t opt)\n{\n    return ((const Layer*)layer->pthis)->Layer::forward_inplace(*(Mat*)bottom_top_blob, *(const Option*)opt);\n}\n\nstatic int __ncnn_Layer_forward_inplace_n(const ncnn_layer_t layer, ncnn_mat_t* bottom_top_blobs, int n, const ncnn_option_t opt)\n{\n    std::vector<Mat> _bottom_top_blobs(n);\n    for (int i = 0; i < n; i++)\n    {\n        _bottom_top_blobs[i] = *(Mat*)bottom_top_blobs[i];\n    }\n    return ((const Layer*)layer->pthis)->Layer::forward_inplace(_bottom_top_blobs, *(const Option*)opt);\n}\n\nstatic int __ncnn_layer_load_param(ncnn_layer_t layer, const ncnn_paramdict_t pd)\n{\n    return ((Layer*)layer->pthis)->load_param(*(const ParamDict*)pd);\n}\n\nstatic int __ncnn_layer_load_model(ncnn_layer_t layer, const ncnn_modelbin_t mb)\n{\n    return ((Layer*)layer->pthis)->load_model(*(const ModelBin*)mb);\n}\n\nstatic int __ncnn_layer_create_pipeline(ncnn_layer_t layer, const ncnn_option_t opt)\n{\n    return ((Layer*)layer->pthis)->create_pipeline(*(const Option*)opt);\n}\n\nstatic int __ncnn_layer_destroy_pipeline(ncnn_layer_t layer, const ncnn_option_t opt)\n{\n    return ((Layer*)layer->pthis)->destroy_pipeline(*(const Option*)opt);\n}\n\nstatic int __ncnn_layer_forward_1(const ncnn_layer_t layer, const ncnn_mat_t bottom_blob, ncnn_mat_t* top_blob, const ncnn_option_t opt)\n{\n    Mat _top_blob;\n    int ret = ((const Layer*)layer->pthis)->forward(*(const Mat*)bottom_blob, _top_blob, *(const Option*)opt);\n    *top_blob = (ncnn_mat_t)(new Mat(_top_blob));\n    return ret;\n}\n\nstatic int __ncnn_layer_forward_n(const ncnn_layer_t layer, const ncnn_mat_t* bottom_blobs, int n, ncnn_mat_t* top_blobs, int n2, const ncnn_option_t opt)\n{\n    std::vector<Mat> _bottom_blobs(n);\n    std::vector<Mat> _top_blobs(n2);\n    for (int i = 0; i < n; i++)\n    {\n        _bottom_blobs[i] = *(Mat*)bottom_blobs[i];\n    }\n    int ret = ((const Layer*)layer->pthis)->forward(_bottom_blobs, _top_blobs, *(const Option*)opt);\n    for (int i = 0; i < n2; i++)\n    {\n        top_blobs[i] = (ncnn_mat_t)(new Mat(_top_blobs[i]));\n    }\n    return ret;\n}\n\nstatic int __ncnn_layer_forward_inplace_1(const ncnn_layer_t layer, ncnn_mat_t bottom_top_blob, const ncnn_option_t opt)\n{\n    return ((const Layer*)layer->pthis)->forward_inplace(*(Mat*)bottom_top_blob, *(const Option*)opt);\n}\n\nstatic int __ncnn_layer_forward_inplace_n(const ncnn_layer_t layer, ncnn_mat_t* bottom_top_blobs, int n, const ncnn_option_t opt)\n{\n    std::vector<Mat> _bottom_top_blobs(n);\n    for (int i = 0; i < n; i++)\n    {\n        _bottom_top_blobs[i] = *(Mat*)bottom_top_blobs[i];\n    }\n    return ((const Layer*)layer->pthis)->forward_inplace(_bottom_top_blobs, *(const Option*)opt);\n}\n\nncnn_layer_t ncnn_layer_create()\n{\n    ncnn_layer_t layer = (ncnn_layer_t)malloc(sizeof(__ncnn_layer_t));\n    layer->pthis = (void*)(new Layer_c_api(layer));\n    layer->load_param = __ncnn_Layer_load_param;\n    layer->load_model = __ncnn_Layer_load_model;\n    layer->create_pipeline = __ncnn_Layer_create_pipeline;\n    layer->destroy_pipeline = __ncnn_Layer_destroy_pipeline;\n    layer->forward_1 = __ncnn_Layer_forward_1;\n    layer->forward_n = __ncnn_Layer_forward_n;\n    layer->forward_inplace_1 = __ncnn_Layer_forward_inplace_1;\n    layer->forward_inplace_n = __ncnn_Layer_forward_inplace_n;\n    return layer;\n}\n\nncnn_layer_t ncnn_layer_create_by_typeindex(int typeindex)\n{\n    void* pthis = (void*)(ncnn::create_layer(typeindex));\n    if (!pthis)\n    {\n        return 0;\n    }\n\n    ncnn_layer_t layer = (ncnn_layer_t)malloc(sizeof(__ncnn_layer_t));\n    layer->pthis = pthis;\n    layer->load_param = __ncnn_layer_load_param;\n    layer->load_model = __ncnn_layer_load_model;\n    layer->create_pipeline = __ncnn_layer_create_pipeline;\n    layer->destroy_pipeline = __ncnn_layer_destroy_pipeline;\n    layer->forward_1 = __ncnn_layer_forward_1;\n    layer->forward_n = __ncnn_layer_forward_n;\n    layer->forward_inplace_1 = __ncnn_layer_forward_inplace_1;\n    layer->forward_inplace_n = __ncnn_layer_forward_inplace_n;\n    return layer;\n}\n\n#if NCNN_STRING\nncnn_layer_t ncnn_layer_create_by_type(const char* type)\n{\n    void* pthis = (void*)(ncnn::create_layer(type));\n    if (!pthis)\n    {\n        return 0;\n    }\n\n    ncnn_layer_t layer = (ncnn_layer_t)malloc(sizeof(__ncnn_layer_t));\n    layer->pthis = pthis;\n    layer->load_param = __ncnn_layer_load_param;\n    layer->load_model = __ncnn_layer_load_model;\n    layer->create_pipeline = __ncnn_layer_create_pipeline;\n    layer->destroy_pipeline = __ncnn_layer_destroy_pipeline;\n    layer->forward_1 = __ncnn_layer_forward_1;\n    layer->forward_n = __ncnn_layer_forward_n;\n    layer->forward_inplace_1 = __ncnn_layer_forward_inplace_1;\n    layer->forward_inplace_n = __ncnn_layer_forward_inplace_n;\n    return layer;\n}\n\nint ncnn_layer_type_to_index(const char* type)\n{\n    return ncnn::layer_to_index(type);\n}\n#endif /* NCNN_STRING */\n\nvoid ncnn_layer_destroy(ncnn_layer_t layer)\n{\n    delete (Layer*)layer->pthis;\n    free(layer);\n}\n\n#if NCNN_STRING\nconst char* ncnn_layer_get_name(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->name.c_str();\n}\n#endif /* NCNN_STRING */\n\nint ncnn_layer_get_typeindex(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->typeindex;\n}\n\n#if NCNN_STRING\nconst char* ncnn_layer_get_type(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->type.c_str();\n}\n#endif /* NCNN_STRING */\n\nint ncnn_layer_get_one_blob_only(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->one_blob_only;\n}\n\nint ncnn_layer_get_support_inplace(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->support_inplace;\n}\n\nint ncnn_layer_get_support_vulkan(const ncnn_layer_t layer)\n{\n#if NCNN_VULKAN\n    return ((const Layer*)layer->pthis)->support_vulkan;\n#else\n    (void)layer;\n    return 0;\n#endif\n}\n\nint ncnn_layer_get_support_packing(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->support_packing;\n}\n\nint ncnn_layer_get_support_bf16_storage(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->support_bf16_storage;\n}\n\nint ncnn_layer_get_support_fp16_storage(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->support_fp16_storage;\n}\n\nint ncnn_layer_get_support_vulkan_packing(const ncnn_layer_t layer)\n{\n#if NCNN_VULKAN\n    return ((const Layer*)layer->pthis)->support_vulkan_packing;\n#else\n    (void)layer;\n    return 0;\n#endif\n}\n\nint ncnn_layer_get_support_any_packing(const ncnn_layer_t layer)\n{\n    return ((const Layer*)layer->pthis)->support_any_packing;\n}\n\nint ncnn_layer_get_support_vulkan_any_packing(const ncnn_layer_t layer)\n{\n#if NCNN_VULKAN\n    return ((const Layer*)layer->pthis)->support_vulkan_any_packing;\n#else\n    (void)layer;\n    return 0;\n#endif\n}\n\nvoid ncnn_layer_set_one_blob_only(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->one_blob_only = enable;\n}\n\nvoid ncnn_layer_set_support_inplace(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->support_inplace = enable;\n}\n\nvoid ncnn_layer_set_support_vulkan(ncnn_layer_t layer, int enable)\n{\n#if NCNN_VULKAN\n    ((Layer*)layer->pthis)->support_vulkan = enable;\n#else\n    (void)layer;\n    (void)enable;\n#endif\n}\n\nvoid ncnn_layer_set_support_packing(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->support_packing = enable;\n}\n\nvoid ncnn_layer_set_support_bf16_storage(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->support_bf16_storage = enable;\n}\n\nvoid ncnn_layer_set_support_fp16_storage(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->support_fp16_storage = enable;\n}\n\nvoid ncnn_layer_set_support_vulkan_packing(ncnn_layer_t layer, int enable)\n{\n#if NCNN_VULKAN\n    ((Layer*)layer->pthis)->support_vulkan_packing = enable;\n#else\n    (void)layer;\n    (void)enable;\n#endif\n}\n\nvoid ncnn_layer_set_support_any_packing(ncnn_layer_t layer, int enable)\n{\n    ((Layer*)layer->pthis)->support_any_packing = enable;\n}\n\nvoid ncnn_layer_set_support_vulkan_any_packing(ncnn_layer_t layer, int enable)\n{\n#if NCNN_VULKAN\n    ((Layer*)layer->pthis)->support_vulkan_any_packing = enable;\n#else\n    (void)layer;\n    (void)enable;\n#endif\n}\n\nint ncnn_layer_get_bottom_count(const ncnn_layer_t layer)\n{\n    return (int)((const Layer*)layer->pthis)->bottoms.size();\n}\n\nint ncnn_layer_get_bottom(const ncnn_layer_t layer, int i)\n{\n    return ((const Layer*)layer->pthis)->bottoms[i];\n}\n\nint ncnn_layer_get_top_count(const ncnn_layer_t layer)\n{\n    return (int)((const Layer*)layer->pthis)->tops.size();\n}\n\nint ncnn_layer_get_top(const ncnn_layer_t layer, int i)\n{\n    return ((const Layer*)layer->pthis)->tops[i];\n}\n\nvoid ncnn_blob_get_bottom_shape(const ncnn_layer_t layer, int i, int* dims, int* w, int* h, int* c)\n{\n    const Mat& shape = ((const Layer*)layer->pthis)->bottom_shapes[i];\n    *dims = shape.dims;\n    *w = shape.w;\n    *h = shape.h;\n    *c = shape.c;\n}\n\nvoid ncnn_blob_get_top_shape(const ncnn_layer_t layer, int i, int* dims, int* w, int* h, int* c)\n{\n    const Mat& shape = ((const Layer*)layer->pthis)->top_shapes[i];\n    *dims = shape.dims;\n    *w = shape.w;\n    *h = shape.h;\n    *c = shape.c;\n}\n\n/* net api */\nncnn_net_t ncnn_net_create()\n{\n    ncnn_net_t net = (ncnn_net_t)malloc(sizeof(struct __ncnn_net_t));\n    net->pthis = (void*)(new Net());\n    net->custom_layer_factory = 0;\n    return net;\n}\n\nvoid ncnn_net_destroy(ncnn_net_t net)\n{\n    delete (Net*)net->pthis;\n    ncnn_net_custom_layer_factory_t ud = net->custom_layer_factory;\n    while (ud)\n    {\n        ncnn_net_custom_layer_factory_t ud_next = ud->next;\n        free(ud);\n        ud = ud_next;\n    }\n    free(net);\n}\n\nncnn_option_t ncnn_net_get_option(ncnn_net_t net)\n{\n    return (ncnn_option_t)(&((Net*)(net->pthis))->opt);\n}\n\nvoid ncnn_net_set_option(ncnn_net_t net, ncnn_option_t opt)\n{\n    ((Net*)net->pthis)->opt = *((Option*)opt);\n}\n\n#if NCNN_VULKAN\nvoid ncnn_net_set_vulkan_device(ncnn_net_t net, int device_index)\n{\n    ((Net*)net->pthis)->set_vulkan_device(device_index);\n}\n#endif\n\nstatic ::ncnn::Layer* __Layer_c_api_layer_creator(void* userdata)\n{\n    ncnn_net_custom_layer_factory_t ud = (ncnn_net_custom_layer_factory_t)userdata;\n\n    ncnn_layer_t layer0 = ud->creator(ud->userdata);\n\n    ::ncnn::Layer* layer = (::ncnn::Layer*)layer0->pthis;\n\n    layer->userdata = (void*)layer0;\n\n    layer->one_blob_only = ncnn_layer_get_one_blob_only(layer0);\n    layer->support_inplace = ncnn_layer_get_support_inplace(layer0);\n    layer->support_vulkan = ncnn_layer_get_support_vulkan(layer0);\n    layer->support_packing = ncnn_layer_get_support_packing(layer0);\n\n    layer->support_bf16_storage = ncnn_layer_get_support_bf16_storage(layer0);\n    layer->support_fp16_storage = ncnn_layer_get_support_fp16_storage(layer0);\n\n    return layer;\n}\n\nstatic void __Layer_c_api_layer_destroyer(::ncnn::Layer* layer, void* userdata)\n{\n    ncnn_net_custom_layer_factory_t ud = (ncnn_net_custom_layer_factory_t)userdata;\n\n    ncnn_layer_t layer0 = (ncnn_layer_t)layer->userdata;\n\n    ud->destroyer(layer0, ud->userdata);\n}\n\n#if NCNN_STRING\nvoid ncnn_net_register_custom_layer_by_type(ncnn_net_t net, const char* type, ncnn_layer_creator_t creator, ncnn_layer_destroyer_t destroyer, void* userdata)\n{\n    ncnn_net_custom_layer_factory_t ud = (ncnn_net_custom_layer_factory_t)malloc(sizeof(struct __ncnn_net_custom_layer_factory_t));\n    ud->creator = creator;\n    ud->destroyer = destroyer;\n    ud->userdata = userdata;\n    ud->next = net->custom_layer_factory;\n    net->custom_layer_factory = ud;\n    ((Net*)net->pthis)->register_custom_layer(type, __Layer_c_api_layer_creator, __Layer_c_api_layer_destroyer, (void*)ud);\n}\n#endif /* NCNN_STRING */\n\nvoid ncnn_net_register_custom_layer_by_typeindex(ncnn_net_t net, int typeindex, ncnn_layer_creator_t creator, ncnn_layer_destroyer_t destroyer, void* userdata)\n{\n    ncnn_net_custom_layer_factory_t ud = (ncnn_net_custom_layer_factory_t)malloc(sizeof(struct __ncnn_net_custom_layer_factory_t));\n    ud->creator = creator;\n    ud->destroyer = destroyer;\n    ud->userdata = userdata;\n    ud->next = net->custom_layer_factory;\n    net->custom_layer_factory = ud;\n    ((Net*)net->pthis)->register_custom_layer(typeindex, __Layer_c_api_layer_creator, __Layer_c_api_layer_destroyer, (void*)ud);\n}\n\n#if NCNN_STDIO\n#if NCNN_STRING\nint ncnn_net_load_param(ncnn_net_t net, const char* path)\n{\n    return ((Net*)net->pthis)->load_param(path);\n}\n#endif /* NCNN_STRING */\n\nint ncnn_net_load_param_bin(ncnn_net_t net, const char* path)\n{\n    return ((Net*)net->pthis)->load_param_bin(path);\n}\n\nint ncnn_net_load_model(ncnn_net_t net, const char* path)\n{\n    return ((Net*)net->pthis)->load_model(path);\n}\n\n#if _WIN32\n#if NCNN_STRING\nint ncnn_net_load_param_w(ncnn_net_t net, const wchar_t* path)\n{\n    return ((Net*)net->pthis)->load_param(path);\n}\n#endif /* NCNN_STRING */\n\nint ncnn_net_load_param_bin_w(ncnn_net_t net, const wchar_t* path)\n{\n    return ((Net*)net->pthis)->load_param_bin(path);\n}\n\nint ncnn_net_load_model_w(ncnn_net_t net, const wchar_t* path)\n{\n    return ((Net*)net->pthis)->load_model(path);\n}\n#endif /* _WIN32 */\n#endif /* NCNN_STDIO */\n\n#if NCNN_STDIO\n#if NCNN_STRING\nint ncnn_net_load_param_memory(ncnn_net_t net, const char* mem)\n{\n    return ((Net*)net->pthis)->load_param_mem(mem);\n}\n#endif /* NCNN_STRING */\n#endif /* NCNN_STDIO */\n\nsize_t ncnn_net_load_param_bin_memory(ncnn_net_t net, const unsigned char* mem)\n{\n    return ((Net*)net->pthis)->load_param(mem);\n}\n\nsize_t ncnn_net_load_model_memory(ncnn_net_t net, const unsigned char* mem)\n{\n    return ((Net*)net->pthis)->load_model(mem);\n}\n\n#if NCNN_STRING\nint ncnn_net_load_param_datareader(ncnn_net_t net, const ncnn_datareader_t dr)\n{\n    return ((Net*)net->pthis)->load_param(*(const DataReader*)dr->pthis);\n}\n#endif /* NCNN_STRING */\n\nint ncnn_net_load_param_bin_datareader(ncnn_net_t net, const ncnn_datareader_t dr)\n{\n    return ((Net*)net->pthis)->load_param_bin(*(const DataReader*)dr->pthis);\n}\n\nint ncnn_net_load_model_datareader(ncnn_net_t net, const ncnn_datareader_t dr)\n{\n    return ((Net*)net->pthis)->load_model(*(const DataReader*)dr->pthis);\n}\n\nvoid ncnn_net_clear(ncnn_net_t net)\n{\n    return ((Net*)net->pthis)->clear();\n}\n\nint ncnn_net_get_input_count(const ncnn_net_t net)\n{\n    return (int)((Net*)net->pthis)->input_indexes().size();\n}\n\nint ncnn_net_get_output_count(const ncnn_net_t net)\n{\n    return (int)((Net*)net->pthis)->output_indexes().size();\n}\n\n#if NCNN_STRING\nconst char* ncnn_net_get_input_name(const ncnn_net_t net, int i)\n{\n    return ((Net*)net->pthis)->input_names()[i];\n}\n\nconst char* ncnn_net_get_output_name(const ncnn_net_t net, int i)\n{\n    return ((Net*)net->pthis)->output_names()[i];\n}\n#endif /* NCNN_STRING */\n\nint ncnn_net_get_input_index(const ncnn_net_t net, int i)\n{\n    return ((Net*)net->pthis)->input_indexes()[i];\n}\n\nint ncnn_net_get_output_index(const ncnn_net_t net, int i)\n{\n    return ((Net*)net->pthis)->output_indexes()[i];\n}\n\n/* extractor api */\nncnn_extractor_t ncnn_extractor_create(ncnn_net_t net)\n{\n    return (ncnn_extractor_t)(new Extractor(((Net*)net->pthis)->create_extractor()));\n}\n\nvoid ncnn_extractor_destroy(ncnn_extractor_t ex)\n{\n    delete (Extractor*)ex;\n}\n\nvoid ncnn_extractor_set_option(ncnn_extractor_t ex, const ncnn_option_t opt)\n{\n    (void)ex;\n    (void)opt;\n}\n\n#if NCNN_STRING\nint ncnn_extractor_input(ncnn_extractor_t ex, const char* name, const ncnn_mat_t mat)\n{\n    return ((Extractor*)ex)->input(name, *((const Mat*)mat));\n}\n\nint ncnn_extractor_extract(ncnn_extractor_t ex, const char* name, ncnn_mat_t* mat)\n{\n    Mat mat0;\n    int ret = ((Extractor*)ex)->extract(name, mat0);\n    *mat = (ncnn_mat_t)(new Mat(mat0));\n    return ret;\n}\n#endif /* NCNN_STRING */\n\nint ncnn_extractor_input_index(ncnn_extractor_t ex, int index, const ncnn_mat_t mat)\n{\n    return ((Extractor*)ex)->input(index, *((const Mat*)mat));\n}\n\nint ncnn_extractor_extract_index(ncnn_extractor_t ex, int index, ncnn_mat_t* mat)\n{\n    Mat mat0;\n    int ret = ((Extractor*)ex)->extract(index, mat0);\n    *mat = (ncnn_mat_t)(new Mat(mat0));\n    return ret;\n}\n\nvoid ncnn_copy_make_border(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int type, float v, const ncnn_option_t opt)\n{\n    const Option _opt = opt ? *((const Option*)opt) : Option();\n    copy_make_border(*(const Mat*)src, *(Mat*)dst, top, bottom, left, right, type, v, _opt);\n}\n\nvoid ncnn_copy_make_border_3d(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int front, int behind, int type, float v, const ncnn_option_t opt)\n{\n    const Option _opt = opt ? *((const Option*)opt) : Option();\n    copy_make_border_3d(*(const Mat*)src, *(Mat*)dst, top, bottom, left, right, front, behind, type, v, _opt);\n}\n\nvoid ncnn_copy_cut_border(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, const ncnn_option_t opt)\n{\n    const Option _opt = opt ? *((const Option*)opt) : Option();\n    copy_cut_border(*(const Mat*)src, *(Mat*)dst, top, bottom, left, right, _opt);\n}\n\nvoid ncnn_copy_cut_border_3d(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int front, int behind, const ncnn_option_t opt)\n{\n    const Option _opt = opt ? *((const Option*)opt) : Option();\n    copy_cut_border_3d(*(const Mat*)src, *(Mat*)dst, top, bottom, left, right, front, behind, _opt);\n}\n\n#if NCNN_PIXEL_DRAWING\nvoid ncnn_draw_rectangle_c1(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness)\n{\n    ncnn::draw_rectangle_c1(pixels, w, h, w, rx, ry, rw, rh, color, thickness);\n}\n\nvoid ncnn_draw_rectangle_c2(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness)\n{\n    ncnn::draw_rectangle_c2(pixels, w, h, w * 2, rx, ry, rw, rh, color, thickness);\n}\n\nvoid ncnn_draw_rectangle_c3(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness)\n{\n    ncnn::draw_rectangle_c3(pixels, w, h, w * 3, rx, ry, rw, rh, color, thickness);\n}\n\nvoid ncnn_draw_rectangle_c4(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness)\n{\n    ncnn::draw_rectangle_c4(pixels, w, h, w * 4, rx, ry, rw, rh, color, thickness);\n}\n\nvoid ncnn_draw_text_c1(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color)\n{\n    ncnn::draw_text_c1(pixels, w, h, w, text, x, y, fontpixelsize, color);\n}\n\nvoid ncnn_draw_text_c2(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color)\n{\n    ncnn::draw_text_c2(pixels, w, h, w * 2, text, x, y, fontpixelsize, color);\n}\n\nvoid ncnn_draw_text_c3(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color)\n{\n    ncnn::draw_text_c3(pixels, w, h, w * 3, text, x, y, fontpixelsize, color);\n}\n\nvoid ncnn_draw_text_c4(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color)\n{\n    ncnn::draw_text_c4(pixels, w, h, w * 4, text, x, y, fontpixelsize, color);\n}\n\nvoid ncnn_draw_circle_c1(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness)\n{\n    ncnn::draw_circle_c1(pixels, w, h, w, cx, cy, radius, color, thickness);\n}\n\nvoid ncnn_draw_circle_c2(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness)\n{\n    ncnn::draw_circle_c2(pixels, w, h, w * 2, cx, cy, radius, color, thickness);\n}\n\nvoid ncnn_draw_circle_c3(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness)\n{\n    ncnn::draw_circle_c3(pixels, w, h, w * 3, cx, cy, radius, color, thickness);\n}\n\nvoid ncnn_draw_circle_c4(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness)\n{\n    ncnn::draw_circle_c4(pixels, w, h, w * 4, cx, cy, radius, color, thickness);\n}\n\nvoid ncnn_draw_line_c1(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness)\n{\n    ncnn::draw_line_c1(pixels, w, h, w, x0, y0, x1, y1, color, thickness);\n}\n\nvoid ncnn_draw_line_c2(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness)\n{\n    ncnn::draw_line_c2(pixels, w, h, w * 2, x0, y0, x1, y1, color, thickness);\n}\n\nvoid ncnn_draw_line_c3(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness)\n{\n    ncnn::draw_line_c3(pixels, w, h, w * 3, x0, y0, x1, y1, color, thickness);\n}\n\nvoid ncnn_draw_line_c4(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness)\n{\n    ncnn::draw_line_c4(pixels, w, h, w * 4, x0, y0, x1, y1, color, thickness);\n}\n#endif /* NCNN_PIXEL_DRAWING */\n\n#ifdef __cplusplus\n} /* extern \"C\" */\n#endif\n\n#endif /* NCNN_C_API */\n"
  },
  {
    "path": "src/c_api.h",
    "content": "/* Copyright 2020 Tencent\n * SPDX-License-Identifier: BSD-3-Clause\n */\n\n#ifndef NCNN_C_API_H\n#define NCNN_C_API_H\n\n#include \"platform.h\"\n\n#if NCNN_C_API\n\n#include <stddef.h>\n\n#ifdef __cplusplus\nextern \"C\" {\n#endif\n\nNCNN_EXPORT const char* ncnn_version(void);\nNCNN_EXPORT int ncnn_version_number(void);\n\n/* allocator api */\ntypedef struct __ncnn_allocator_t* ncnn_allocator_t;\nstruct NCNN_EXPORT __ncnn_allocator_t\n{\n    void* pthis;\n\n    void* (*fast_malloc)(ncnn_allocator_t allocator, size_t size);\n    void (*fast_free)(ncnn_allocator_t allocator, void* ptr);\n};\n\nNCNN_EXPORT ncnn_allocator_t ncnn_allocator_create_pool_allocator(void);\nNCNN_EXPORT ncnn_allocator_t ncnn_allocator_create_unlocked_pool_allocator(void);\nNCNN_EXPORT void ncnn_allocator_destroy(ncnn_allocator_t allocator);\n\n/* option api */\ntypedef struct __ncnn_option_t* ncnn_option_t;\n\nNCNN_EXPORT ncnn_option_t ncnn_option_create(void);\nNCNN_EXPORT void ncnn_option_destroy(ncnn_option_t opt);\n\nNCNN_EXPORT int ncnn_option_get_num_threads(const ncnn_option_t opt);\nNCNN_EXPORT void ncnn_option_set_num_threads(ncnn_option_t opt, int num_threads);\n\nNCNN_EXPORT void ncnn_option_set_blob_allocator(ncnn_option_t opt, ncnn_allocator_t allocator);\nNCNN_EXPORT void ncnn_option_set_workspace_allocator(ncnn_option_t opt, ncnn_allocator_t allocator);\n\nNCNN_EXPORT int ncnn_option_get_use_vulkan_compute(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_local_pool_allocator(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_winograd_convolution(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_sgemm_convolution(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_packing_layout(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_fp16_packed(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_fp16_storage(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_fp16_arithmetic(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_int8_packed(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_int8_storage(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_int8_arithmetic(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_bf16_packed(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_bf16_storage(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_shader_local_memory(const ncnn_option_t opt);\nNCNN_EXPORT int ncnn_option_get_use_cooperative_matrix(const ncnn_option_t opt);\n\nNCNN_EXPORT void ncnn_option_set_use_vulkan_compute(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_local_pool_allocator(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_winograd_convolution(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_sgemm_convolution(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_packing_layout(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_fp16_packed(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_fp16_storage(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_fp16_arithmetic(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_int8_packed(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_int8_storage(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_int8_arithmetic(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_bf16_packed(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_bf16_storage(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_shader_local_memory(ncnn_option_t opt, int enable);\nNCNN_EXPORT void ncnn_option_set_use_cooperative_matrix(ncnn_option_t opt, int enable);\n\n/* mat api */\ntypedef struct __ncnn_mat_t* ncnn_mat_t;\n\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create(void);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_1d(int w, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_2d(int w, int h, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_3d(int w, int h, int c, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_4d(int w, int h, int d, int c, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_1d(int w, void* data, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_2d(int w, int h, void* data, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_3d(int w, int h, int c, void* data, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_4d(int w, int h, int d, int c, void* data, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_1d_elem(int w, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_2d_elem(int w, int h, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_3d_elem(int w, int h, int c, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_4d_elem(int w, int h, int d, int c, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_1d_elem(int w, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_2d_elem(int w, int h, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_3d_elem(int w, int h, int c, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_create_external_4d_elem(int w, int h, int d, int c, void* data, size_t elemsize, int elempack, ncnn_allocator_t allocator);\nNCNN_EXPORT void ncnn_mat_destroy(ncnn_mat_t mat);\n\nNCNN_EXPORT void ncnn_mat_fill_float(ncnn_mat_t mat, float v);\n\nNCNN_EXPORT ncnn_mat_t ncnn_mat_clone(const ncnn_mat_t mat, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_reshape_1d(const ncnn_mat_t mat, int w, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_reshape_2d(const ncnn_mat_t mat, int w, int h, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_reshape_3d(const ncnn_mat_t mat, int w, int h, int c, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_reshape_4d(const ncnn_mat_t mat, int w, int h, int d, int c, ncnn_allocator_t allocator);\n\nNCNN_EXPORT int ncnn_mat_get_dims(const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_mat_get_w(const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_mat_get_h(const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_mat_get_d(const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_mat_get_c(const ncnn_mat_t mat);\nNCNN_EXPORT size_t ncnn_mat_get_elemsize(const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_mat_get_elempack(const ncnn_mat_t mat);\nNCNN_EXPORT size_t ncnn_mat_get_cstep(const ncnn_mat_t mat);\nNCNN_EXPORT void* ncnn_mat_get_data(const ncnn_mat_t mat);\n\nNCNN_EXPORT void* ncnn_mat_get_channel_data(const ncnn_mat_t mat, int c);\n\n#if NCNN_PIXEL\n\n/* mat pixel api */\n#define NCNN_MAT_PIXEL_RGB       1\n#define NCNN_MAT_PIXEL_BGR       2\n#define NCNN_MAT_PIXEL_GRAY      3\n#define NCNN_MAT_PIXEL_RGBA      4\n#define NCNN_MAT_PIXEL_BGRA      5\n#define NCNN_MAT_PIXEL_X2Y(X, Y) (X | (Y << 16))\nNCNN_EXPORT ncnn_mat_t ncnn_mat_from_pixels(const unsigned char* pixels, int type, int w, int h, int stride, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_from_pixels_resize(const unsigned char* pixels, int type, int w, int h, int stride, int target_width, int target_height, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_from_pixels_roi(const unsigned char* pixels, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, ncnn_allocator_t allocator);\nNCNN_EXPORT ncnn_mat_t ncnn_mat_from_pixels_roi_resize(const unsigned char* pixels, int type, int w, int h, int stride, int roix, int roiy, int roiw, int roih, int target_width, int target_height, ncnn_allocator_t allocator);\nNCNN_EXPORT void ncnn_mat_to_pixels(const ncnn_mat_t mat, unsigned char* pixels, int type, int stride);\nNCNN_EXPORT void ncnn_mat_to_pixels_resize(const ncnn_mat_t mat, unsigned char* pixels, int type, int target_width, int target_height, int target_stride);\n\n#endif /* NCNN_PIXEL */\n\nNCNN_EXPORT void ncnn_mat_substract_mean_normalize(ncnn_mat_t mat, const float* mean_vals, const float* norm_vals);\n\nNCNN_EXPORT void ncnn_convert_packing(const ncnn_mat_t src, ncnn_mat_t* dst, int elempack, const ncnn_option_t opt);\nNCNN_EXPORT void ncnn_flatten(const ncnn_mat_t src, ncnn_mat_t* dst, const ncnn_option_t opt);\n\n/* blob api */\ntypedef struct __ncnn_blob_t* ncnn_blob_t;\n\n#if NCNN_STRING\nNCNN_EXPORT const char* ncnn_blob_get_name(const ncnn_blob_t blob);\n#endif /* NCNN_STRING */\n\nNCNN_EXPORT int ncnn_blob_get_producer(const ncnn_blob_t blob);\nNCNN_EXPORT int ncnn_blob_get_consumer(const ncnn_blob_t blob);\n\nNCNN_EXPORT void ncnn_blob_get_shape(const ncnn_blob_t blob, int* dims, int* w, int* h, int* c);\n\n/* paramdict api */\ntypedef struct __ncnn_paramdict_t* ncnn_paramdict_t;\n\nNCNN_EXPORT ncnn_paramdict_t ncnn_paramdict_create(void);\nNCNN_EXPORT void ncnn_paramdict_destroy(ncnn_paramdict_t pd);\n\nNCNN_EXPORT int ncnn_paramdict_get_type(const ncnn_paramdict_t pd, int id);\n\nNCNN_EXPORT int ncnn_paramdict_get_int(const ncnn_paramdict_t pd, int id, int def);\nNCNN_EXPORT float ncnn_paramdict_get_float(const ncnn_paramdict_t pd, int id, float def);\nNCNN_EXPORT ncnn_mat_t ncnn_paramdict_get_array(const ncnn_paramdict_t pd, int id, const ncnn_mat_t def);\n\nNCNN_EXPORT void ncnn_paramdict_set_int(ncnn_paramdict_t pd, int id, int i);\nNCNN_EXPORT void ncnn_paramdict_set_float(ncnn_paramdict_t pd, int id, float f);\nNCNN_EXPORT void ncnn_paramdict_set_array(ncnn_paramdict_t pd, int id, const ncnn_mat_t v);\n\n/* datareader api */\ntypedef struct __ncnn_datareader_t* ncnn_datareader_t;\nstruct NCNN_EXPORT __ncnn_datareader_t\n{\n    void* pthis;\n\n#if NCNN_STRING\n    int (*scan)(ncnn_datareader_t dr, const char* format, void* p);\n#endif /* NCNN_STRING */\n    size_t (*read)(ncnn_datareader_t dr, void* buf, size_t size);\n};\n\nNCNN_EXPORT ncnn_datareader_t ncnn_datareader_create(void);\n#if NCNN_STDIO\nNCNN_EXPORT ncnn_datareader_t ncnn_datareader_create_from_stdio(FILE* fp);\n#endif /* NCNN_STDIO */\nNCNN_EXPORT ncnn_datareader_t ncnn_datareader_create_from_memory(const unsigned char** mem);\nNCNN_EXPORT void ncnn_datareader_destroy(ncnn_datareader_t dr);\n\n/* modelbin api */\ntypedef struct __ncnn_modelbin_t* ncnn_modelbin_t;\nstruct NCNN_EXPORT __ncnn_modelbin_t\n{\n    void* pthis;\n\n    ncnn_mat_t (*load_1d)(const ncnn_modelbin_t mb, int w, int type);\n    ncnn_mat_t (*load_2d)(const ncnn_modelbin_t mb, int w, int h, int type);\n    ncnn_mat_t (*load_3d)(const ncnn_modelbin_t mb, int w, int h, int c, int type);\n};\n\nNCNN_EXPORT ncnn_modelbin_t ncnn_modelbin_create_from_datareader(const ncnn_datareader_t dr);\nNCNN_EXPORT ncnn_modelbin_t ncnn_modelbin_create_from_mat_array(const ncnn_mat_t* weights, int n);\nNCNN_EXPORT void ncnn_modelbin_destroy(ncnn_modelbin_t mb);\n\n/* layer api */\ntypedef struct __ncnn_layer_t* ncnn_layer_t;\nstruct NCNN_EXPORT __ncnn_layer_t\n{\n    void* pthis;\n\n    int (*load_param)(ncnn_layer_t layer, const ncnn_paramdict_t pd);\n    int (*load_model)(ncnn_layer_t layer, const ncnn_modelbin_t mb);\n\n    int (*create_pipeline)(ncnn_layer_t layer, const ncnn_option_t opt);\n    int (*destroy_pipeline)(ncnn_layer_t layer, const ncnn_option_t opt);\n\n    int (*forward_1)(const ncnn_layer_t layer, const ncnn_mat_t bottom_blob, ncnn_mat_t* top_blob, const ncnn_option_t opt);\n    int (*forward_n)(const ncnn_layer_t layer, const ncnn_mat_t* bottom_blobs, int n, ncnn_mat_t* top_blobs, int n2, const ncnn_option_t opt);\n\n    int (*forward_inplace_1)(const ncnn_layer_t layer, ncnn_mat_t bottom_top_blob, const ncnn_option_t opt);\n    int (*forward_inplace_n)(const ncnn_layer_t layer, ncnn_mat_t* bottom_top_blobs, int n, const ncnn_option_t opt);\n};\n\nNCNN_EXPORT ncnn_layer_t ncnn_layer_create(void);\nNCNN_EXPORT ncnn_layer_t ncnn_layer_create_by_typeindex(int typeindex);\n#if NCNN_STRING\nNCNN_EXPORT ncnn_layer_t ncnn_layer_create_by_type(const char* type);\nNCNN_EXPORT int ncnn_layer_type_to_index(const char* type);\n#endif /* NCNN_STRING */\nNCNN_EXPORT void ncnn_layer_destroy(ncnn_layer_t layer);\n\n#if NCNN_STRING\nNCNN_EXPORT const char* ncnn_layer_get_name(const ncnn_layer_t layer);\n#endif /* NCNN_STRING */\n\nNCNN_EXPORT int ncnn_layer_get_typeindex(const ncnn_layer_t layer);\n#if NCNN_STRING\nNCNN_EXPORT const char* ncnn_layer_get_type(const ncnn_layer_t layer);\n#endif /* NCNN_STRING */\n\nNCNN_EXPORT int ncnn_layer_get_one_blob_only(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_inplace(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_vulkan(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_packing(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_bf16_storage(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_fp16_storage(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_vulkan_packing(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_any_packing(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_support_vulkan_any_packing(const ncnn_layer_t layer);\n\nNCNN_EXPORT void ncnn_layer_set_one_blob_only(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_inplace(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_vulkan(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_packing(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_bf16_storage(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_fp16_storage(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_vulkan_packing(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_any_packing(ncnn_layer_t layer, int enable);\nNCNN_EXPORT void ncnn_layer_set_support_vulkan_any_packing(ncnn_layer_t layer, int enable);\n\nNCNN_EXPORT int ncnn_layer_get_bottom_count(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_bottom(const ncnn_layer_t layer, int i);\nNCNN_EXPORT int ncnn_layer_get_top_count(const ncnn_layer_t layer);\nNCNN_EXPORT int ncnn_layer_get_top(const ncnn_layer_t layer, int i);\n\nNCNN_EXPORT void ncnn_blob_get_bottom_shape(const ncnn_layer_t layer, int i, int* dims, int* w, int* h, int* c);\nNCNN_EXPORT void ncnn_blob_get_top_shape(const ncnn_layer_t layer, int i, int* dims, int* w, int* h, int* c);\n\n/* layer factory function */\ntypedef ncnn_layer_t (*ncnn_layer_creator_t)(void* userdata);\ntypedef void (*ncnn_layer_destroyer_t)(ncnn_layer_t layer, void* userdata);\n\ntypedef struct __ncnn_net_custom_layer_factory_t* ncnn_net_custom_layer_factory_t;\nstruct __ncnn_net_custom_layer_factory_t\n{\n    ncnn_layer_creator_t creator;\n    ncnn_layer_destroyer_t destroyer;\n    void* userdata;\n    ncnn_net_custom_layer_factory_t next;\n};\n\n/* net api */\ntypedef struct __ncnn_net_t* ncnn_net_t;\nstruct __ncnn_net_t\n{\n    void* pthis;\n\n    ncnn_net_custom_layer_factory_t custom_layer_factory;\n};\n\nNCNN_EXPORT ncnn_net_t ncnn_net_create(void);\nNCNN_EXPORT void ncnn_net_destroy(ncnn_net_t net);\n\nNCNN_EXPORT ncnn_option_t ncnn_net_get_option(ncnn_net_t net);\nNCNN_EXPORT void ncnn_net_set_option(ncnn_net_t net, ncnn_option_t opt);\n\n#if NCNN_VULKAN\nNCNN_EXPORT void ncnn_net_set_vulkan_device(ncnn_net_t net, int device_index);\n#endif\n\n#if NCNN_STRING\nNCNN_EXPORT void ncnn_net_register_custom_layer_by_type(ncnn_net_t net, const char* type, ncnn_layer_creator_t creator, ncnn_layer_destroyer_t destroyer, void* userdata);\n#endif /* NCNN_STRING */\nNCNN_EXPORT void ncnn_net_register_custom_layer_by_typeindex(ncnn_net_t net, int typeindex, ncnn_layer_creator_t creator, ncnn_layer_destroyer_t destroyer, void* userdata);\n\n#if NCNN_STDIO\n#if NCNN_STRING\nNCNN_EXPORT int ncnn_net_load_param(ncnn_net_t net, const char* path);\n#endif /* NCNN_STRING */\nNCNN_EXPORT int ncnn_net_load_param_bin(ncnn_net_t net, const char* path);\nNCNN_EXPORT int ncnn_net_load_model(ncnn_net_t net, const char* path);\n#if _WIN32\n#if NCNN_STRING\nNCNN_EXPORT int ncnn_net_load_param_w(ncnn_net_t net, const wchar_t* path);\n#endif /* NCNN_STRING */\nNCNN_EXPORT int ncnn_net_load_param_bin_w(ncnn_net_t net, const wchar_t* path);\nNCNN_EXPORT int ncnn_net_load_model_w(ncnn_net_t net, const wchar_t* path);\n#endif /* _WIN32 */\n#endif /* NCNN_STDIO */\n\n#if NCNN_STDIO\n#if NCNN_STRING\nNCNN_EXPORT int ncnn_net_load_param_memory(ncnn_net_t net, const char* mem);\n#endif /* NCNN_STRING */\n#endif /* NCNN_STDIO */\nNCNN_EXPORT size_t ncnn_net_load_param_bin_memory(ncnn_net_t net, const unsigned char* mem);\nNCNN_EXPORT size_t ncnn_net_load_model_memory(ncnn_net_t net, const unsigned char* mem);\n\n#if NCNN_STRING\nNCNN_EXPORT int ncnn_net_load_param_datareader(ncnn_net_t net, const ncnn_datareader_t dr);\n#endif /* NCNN_STRING */\nNCNN_EXPORT int ncnn_net_load_param_bin_datareader(ncnn_net_t net, const ncnn_datareader_t dr);\nNCNN_EXPORT int ncnn_net_load_model_datareader(ncnn_net_t net, const ncnn_datareader_t dr);\n\nNCNN_EXPORT void ncnn_net_clear(ncnn_net_t net);\n\nNCNN_EXPORT int ncnn_net_get_input_count(const ncnn_net_t net);\nNCNN_EXPORT int ncnn_net_get_output_count(const ncnn_net_t net);\n#if NCNN_STRING\nNCNN_EXPORT const char* ncnn_net_get_input_name(const ncnn_net_t net, int i);\nNCNN_EXPORT const char* ncnn_net_get_output_name(const ncnn_net_t net, int i);\n#endif /* NCNN_STRING */\nNCNN_EXPORT int ncnn_net_get_input_index(const ncnn_net_t net, int i);\nNCNN_EXPORT int ncnn_net_get_output_index(const ncnn_net_t net, int i);\n\n/* extractor api */\ntypedef struct __ncnn_extractor_t* ncnn_extractor_t;\n\nNCNN_EXPORT ncnn_extractor_t ncnn_extractor_create(ncnn_net_t net);\nNCNN_EXPORT void ncnn_extractor_destroy(ncnn_extractor_t ex);\n\nNCNN_EXPORT void ncnn_extractor_set_option(ncnn_extractor_t ex, const ncnn_option_t opt);\n\n#if NCNN_STRING\nNCNN_EXPORT int ncnn_extractor_input(ncnn_extractor_t ex, const char* name, const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_extractor_extract(ncnn_extractor_t ex, const char* name, ncnn_mat_t* mat);\n#endif /* NCNN_STRING */\nNCNN_EXPORT int ncnn_extractor_input_index(ncnn_extractor_t ex, int index, const ncnn_mat_t mat);\nNCNN_EXPORT int ncnn_extractor_extract_index(ncnn_extractor_t ex, int index, ncnn_mat_t* mat);\n\n/* mat process api */\n#define NCNN_BORDER_CONSTANT    0\n#define NCNN_BORDER_REPLICATE   1\n#define NCNN_BORDER_REFLECT     2\n#define NCNN_BORDER_TRANSPARENT -233\nNCNN_EXPORT void ncnn_copy_make_border(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int type, float v, const ncnn_option_t opt);\nNCNN_EXPORT void ncnn_copy_make_border_3d(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int front, int behind, int type, float v, const ncnn_option_t opt);\nNCNN_EXPORT void ncnn_copy_cut_border(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, const ncnn_option_t opt);\nNCNN_EXPORT void ncnn_copy_cut_border_3d(const ncnn_mat_t src, ncnn_mat_t dst, int top, int bottom, int left, int right, int front, int behind, const ncnn_option_t opt);\n\n#if NCNN_PIXEL_DRAWING\n/* mat pixel drawing api*/\nNCNN_EXPORT void ncnn_draw_rectangle_c1(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_rectangle_c2(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_rectangle_c3(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_rectangle_c4(unsigned char* pixels, int w, int h, int rx, int ry, int rw, int rh, unsigned int color, int thickness);\n\nNCNN_EXPORT void ncnn_draw_text_c1(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color);\nNCNN_EXPORT void ncnn_draw_text_c2(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color);\nNCNN_EXPORT void ncnn_draw_text_c3(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color);\nNCNN_EXPORT void ncnn_draw_text_c4(unsigned char* pixels, int w, int h, const char* text, int x, int y, int fontpixelsize, unsigned int color);\n\nNCNN_EXPORT void ncnn_draw_circle_c1(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_circle_c2(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_circle_c3(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_circle_c4(unsigned char* pixels, int w, int h, int cx, int cy, int radius, unsigned int color, int thickness);\n\nNCNN_EXPORT void ncnn_draw_line_c1(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_line_c2(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_line_c3(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness);\nNCNN_EXPORT void ncnn_draw_line_c4(unsigned char* pixels, int w, int h, int x0, int y0, int x1, int y1, unsigned int color, int thickness);\n#endif /* NCNN_PIXEL_DRAWING */\n\n#ifdef __cplusplus\n} /* extern \"C\" */\n#endif\n\n#endif /* NCNN_C_API */\n\n#endif /* NCNN_C_API_H */\n"
  },
  {
    "path": "src/command.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"command.h\"\n\n#if NCNN_VULKAN\n\n#include \"option.h\"\n#include \"pipeline.h\"\n\nnamespace ncnn {\n\nclass VkComputePrivate\n{\npublic:\n    VkComputePrivate(const VulkanDevice* _vkdev);\n    ~VkComputePrivate();\n\n    int init();\n    int begin_command_buffer();\n    int end_command_buffer();\n\n    const VulkanDevice* vkdev;\n\n    VkCommandPool compute_command_pool;\n\n    VkCommandBuffer compute_command_buffer;\n\n    VkFence compute_command_fence;\n\n    std::vector<VkMat> upload_staging_buffers;\n    std::vector<VkMat> download_post_buffers;\n    std::vector<Mat> download_post_mats_fp16;\n    std::vector<Mat> download_post_mats;\n\n    std::vector<VkImageMemory*> image_blocks_to_destroy;\n\n    // the good-old path for device without VK_KHR_push_descriptor\n    std::vector<VkDescriptorPool> descriptor_pools;\n    std::vector<VkDescriptorSet> descriptorsets;\n\n    struct record\n    {\n        enum\n        {\n            TYPE_copy_buffer,\n            TYPE_copy_image,\n            TYPE_copy_buffer_to_image,\n            TYPE_copy_image_to_buffer,\n            TYPE_bind_pipeline,\n            TYPE_bind_descriptorsets,\n            TYPE_push_constants,\n            TYPE_dispatch,\n            TYPE_memory_barrers,\n            TYPE_buffer_barrers,\n            TYPE_image_barrers,\n\n#if NCNN_BENCHMARK\n            TYPE_write_timestamp,\n#endif // NCNN_BENCHMARK\n\n            TYPE_post_download,\n            TYPE_post_cast_float16_to_float32,\n            TYPE_post_cast_bfloat16_to_float32,\n        };\n\n        int type;\n        VkCommandBuffer command_buffer;\n\n        union\n        {\n            struct\n            {\n                VkBuffer src;\n                VkBuffer dst;\n                uint32_t region_count;\n                const VkBufferCopy* regions;\n            } copy_buffer;\n            struct\n            {\n                VkImage src;\n                VkImageLayout src_layout;\n                VkImage dst;\n                VkImageLayout dst_layout;\n                uint32_t region_count;\n                const VkImageCopy* regions;\n            } copy_image;\n            struct\n            {\n                VkBuffer src;\n                VkImage dst;\n                VkImageLayout layout;\n                uint32_t region_count;\n                const VkBufferImageCopy* regions;\n            } copy_buffer_to_image;\n            struct\n            {\n                VkImage src;\n                VkImageLayout layout;\n                VkBuffer dst;\n                uint32_t region_count;\n                const VkBufferImageCopy* regions;\n            } copy_image_to_buffer;\n\n            struct\n            {\n                VkPipelineBindPoint bind_point;\n                VkPipeline pipeline;\n            } bind_pipeline;\n            struct\n            {\n                VkPipelineBindPoint bind_point;\n                VkPipelineLayout pipeline_layout;\n                uint32_t descriptorset_count;\n                uint32_t descriptorset_offset;\n            } bind_descriptorsets;\n            struct\n            {\n                VkPipelineLayout pipeline_layout;\n                VkShaderStageFlags stage_flags;\n                uint32_t size;\n                const void* values;\n            } push_constants;\n\n            struct\n            {\n                uint32_t group_count_x;\n                uint32_t group_count_y;\n                uint32_t group_count_z;\n            } dispatch;\n\n            struct\n            {\n                VkPipelineStageFlags src_stage;\n                VkPipelineStageFlags dst_stage;\n                uint32_t barrier_count;\n                const VkMemoryBarrier* barriers;\n            } memory_barrers;\n            struct\n            {\n                VkPipelineStageFlags src_stage;\n                VkPipelineStageFlags dst_stage;\n                uint32_t barrier_count;\n                const VkBufferMemoryBarrier* barriers;\n            } buffer_barrers;\n            struct\n            {\n                VkPipelineStageFlags src_stage;\n                VkPipelineStageFlags dst_stage;\n                uint32_t barrier_count;\n                const VkImageMemoryBarrier* barriers;\n            } image_barrers;\n\n#if NCNN_BENCHMARK\n            struct\n            {\n                uint32_t query;\n            } write_timestamp;\n#endif // NCNN_BENCHMARK\n\n            struct\n            {\n                uint32_t download_post_buffer_mat_offset;\n                uint32_t download_post_mat_fp16_offset;\n            } post_download;\n            struct\n            {\n                uint32_t download_post_mat_fp16_offset;\n                uint32_t download_post_mat_offset;\n                int num_threads;\n            } post_cast_float16_to_float32;\n            struct\n            {\n                uint32_t download_post_mat_bf16_offset;\n                uint32_t download_post_mat_offset;\n                int num_threads;\n            } post_cast_bfloat16_to_float32;\n        };\n    };\n\n    std::vector<record> delayed_records;\n\n    uint64_t pending_dispatch_total;\n\n#if NCNN_BENCHMARK\n    uint32_t query_count;\n    VkQueryPool query_pool;\n#endif // NCNN_BENCHMARK\n};\n\nVkComputePrivate::VkComputePrivate(const VulkanDevice* _vkdev)\n    : vkdev(_vkdev)\n{\n    compute_command_pool = 0;\n    compute_command_buffer = 0;\n    compute_command_fence = 0;\n\n    pending_dispatch_total = 0;\n\n#if NCNN_BENCHMARK\n    query_count = 0;\n    query_pool = 0;\n#endif // NCNN_BENCHMARK\n\n    init();\n}\n\nVkComputePrivate::~VkComputePrivate()\n{\n    for (size_t i = 0; i < image_blocks_to_destroy.size(); i++)\n    {\n        VkImageMemory* ptr = image_blocks_to_destroy[i];\n\n        int old_command_refcount = NCNN_XADD(&ptr->command_refcount, -1);\n        if (ptr->refcount == 0 && old_command_refcount == 1)\n        {\n            // no userspace reference and we are the last command reference\n            vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n            vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n\n            delete ptr;\n        }\n        else\n        {\n            // reference exists in user code or other command\n        }\n    }\n    image_blocks_to_destroy.clear();\n\n    if (!vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        for (size_t i = 0; i < descriptorsets.size(); i++)\n        {\n            vkFreeDescriptorSets(vkdev->vkdevice(), descriptor_pools[i], 1, &descriptorsets[i]);\n            vkDestroyDescriptorPool(vkdev->vkdevice(), descriptor_pools[i], 0);\n        }\n    }\n\n#if NCNN_BENCHMARK\n    if (query_pool)\n    {\n        // all submitted commands that refer to queryPool must have completed execution\n        vkResetCommandBuffer(compute_command_buffer, 0);\n\n        vkDestroyQueryPool(vkdev->vkdevice(), query_pool, 0);\n    }\n#endif // NCNN_BENCHMARK\n\n    vkDestroyFence(vkdev->vkdevice(), compute_command_fence, 0);\n\n    vkFreeCommandBuffers(vkdev->vkdevice(), compute_command_pool, 1, &compute_command_buffer);\n    vkDestroyCommandPool(vkdev->vkdevice(), compute_command_pool, 0);\n}\n\nint VkComputePrivate::init()\n{\n    // compute_command_pool\n    {\n        VkCommandPoolCreateInfo commandPoolCreateInfo;\n        commandPoolCreateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;\n        commandPoolCreateInfo.pNext = 0;\n        commandPoolCreateInfo.flags = VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;\n        commandPoolCreateInfo.queueFamilyIndex = vkdev->info.compute_queue_family_index();\n\n        VkResult ret = vkCreateCommandPool(vkdev->vkdevice(), &commandPoolCreateInfo, 0, &compute_command_pool);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkCreateCommandPool failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // compute_command_buffer\n    {\n        VkCommandBufferAllocateInfo commandBufferAllocateInfo;\n        commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;\n        commandBufferAllocateInfo.pNext = 0;\n        commandBufferAllocateInfo.commandPool = compute_command_pool;\n        commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;\n        commandBufferAllocateInfo.commandBufferCount = 1;\n\n        VkResult ret = vkAllocateCommandBuffers(vkdev->vkdevice(), &commandBufferAllocateInfo, &compute_command_buffer);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkAllocateCommandBuffers failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // compute_command_fence\n    {\n        VkFenceCreateInfo fenceCreateInfo;\n        fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;\n        fenceCreateInfo.pNext = 0;\n        fenceCreateInfo.flags = 0;\n\n        VkResult ret = vkCreateFence(vkdev->vkdevice(), &fenceCreateInfo, 0, &compute_command_fence);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkCreateFence failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        begin_command_buffer();\n\n#if NCNN_BENCHMARK\n        if (query_pool)\n            vkCmdResetQueryPool(compute_command_buffer, query_pool, 0, query_count);\n#endif // NCNN_BENCHMARK\n    }\n\n    return 0;\n}\n\nint VkComputePrivate::begin_command_buffer()\n{\n    VkCommandBufferBeginInfo commandBufferBeginInfo;\n    commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;\n    commandBufferBeginInfo.pNext = 0;\n    commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;\n    commandBufferBeginInfo.pInheritanceInfo = 0;\n\n    VkResult ret = vkBeginCommandBuffer(compute_command_buffer, &commandBufferBeginInfo);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkBeginCommandBuffer failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VkComputePrivate::end_command_buffer()\n{\n    VkResult ret = vkEndCommandBuffer(compute_command_buffer);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEndCommandBuffer failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nVkCompute::VkCompute(const VulkanDevice* _vkdev)\n    : vkdev(_vkdev), d(new VkComputePrivate(_vkdev))\n{\n}\n\nVkCompute::~VkCompute()\n{\n    delete d;\n}\n\nvoid VkCompute::record_upload(const Mat& src, VkMat& dst, const Option& opt)\n{\n    // NCNN_LOGE(\"record_upload buffer\");\n\n    Mat src_fp16;\n    if (src.elemsize == src.elempack * 4u)\n    {\n        // cpu cast to fp16 (discrete gpu)\n        if (vkdev->info.type() == 0 && (opt.use_bf16_storage || opt.use_bf16_packed))\n        {\n            ncnn::cast_float32_to_bfloat16(src, src_fp16, opt);\n        }\n        else if (vkdev->info.type() == 0 && (opt.use_fp16_storage || opt.use_fp16_packed))\n        {\n            ncnn::cast_float32_to_float16(src, src_fp16, opt);\n        }\n        else\n        {\n            src_fp16 = src;\n        }\n    }\n    else\n    {\n        src_fp16 = src;\n    }\n\n    // vkdev->convert_packing only handles elempack=1/4\n    if (src_fp16.elempack > 4)\n    {\n        Mat src_fp16_pack4;\n        ncnn::convert_packing(src_fp16, src_fp16_pack4, 4, opt);\n        src_fp16 = src_fp16_pack4;\n    }\n\n    // upload\n    VkMat dst_staging;\n    dst_staging.create_like(src_fp16, opt.staging_vkallocator);\n    if (dst_staging.empty())\n        return;\n\n    // stash staging\n    d->upload_staging_buffers.push_back(dst_staging);\n\n    //     NCNN_LOGE(\"upload_staging_buffer %p  ->   %p +%d ~%d\", src_fp16.data, dst_staging.buffer(), dst_staging.buffer_offset(), dst_staging.buffer_capacity());\n\n    // memcpy src to device\n    memcpy(dst_staging.mapped_ptr(), src_fp16.data, src_fp16.total() * src_fp16.elemsize);\n    dst_staging.allocator->flush(dst_staging.data);\n\n    // mark device host-write @ null\n    dst_staging.data->access_flags = VK_ACCESS_HOST_WRITE_BIT;\n    dst_staging.data->stage_flags = VK_PIPELINE_STAGE_HOST_BIT;\n\n    // resolve dst_elempack\n    int dims = src_fp16.dims;\n    int elemcount = 0;\n    if (dims == 1) elemcount = src_fp16.elempack * src_fp16.w;\n    if (dims == 2) elemcount = src_fp16.elempack * src_fp16.h;\n    if (dims == 3 || dims == 4) elemcount = src_fp16.elempack * src_fp16.c;\n\n    int dst_elempack = elemcount % 4 == 0 ? 4 : 1;\n\n    // gpu cast to fp16 on the fly (integrated gpu)\n    int cast_type_to = 0;\n    if (vkdev->info.type() != 0)\n    {\n        if (opt.use_bf16_storage || opt.use_bf16_packed)\n            cast_type_to = 5;\n        else if (opt.use_fp16_storage || opt.use_fp16_packed)\n            cast_type_to = 2;\n        else\n            cast_type_to = 1;\n    }\n    vkdev->convert_packing(dst_staging, dst, dst_elempack, cast_type_to, *this, opt);\n}\n\nvoid VkCompute::record_download(const VkMat& src, Mat& dst, const Option& opt)\n{\n    // NCNN_LOGE(\"record_download buffer\");\n\n    // resolve dst_elempack\n    int dims = src.dims;\n    int elemcount = 0;\n    if (dims == 1) elemcount = src.elempack * src.w;\n    if (dims == 2) elemcount = src.elempack * src.h;\n    if (dims == 3 || dims == 4) elemcount = src.elempack * src.c;\n\n    int dst_elempack = 1;\n    if (opt.use_packing_layout)\n        dst_elempack = elemcount % 4 == 0 ? 4 : 1;\n    else\n        dst_elempack = 1;\n\n    // gpu cast to fp32 on the fly (integrated gpu)\n    Option opt_staging = opt;\n    if (!opt_staging.blob_vkallocator->mappable)\n    {\n        opt_staging.blob_vkallocator = opt.staging_vkallocator;\n    }\n    int cast_type_to = 0;\n    if (vkdev->info.type() != 0)\n    {\n        cast_type_to = 1;\n    }\n\n    if (src.elemsize == src.elempack * 1u)\n    {\n        cast_type_to = 4;\n    }\n\n    VkMat dst_staging;\n    vkdev->convert_packing(src, dst_staging, dst_elempack, cast_type_to, *this, opt_staging);\n\n    // barrier device any @ compute to host-read @ compute\n    if (dst_staging.data->access_flags & VK_ACCESS_HOST_WRITE_BIT || dst_staging.data->stage_flags != VK_PIPELINE_STAGE_HOST_BIT)\n    {\n        VkBufferMemoryBarrier* barriers = new VkBufferMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = dst_staging.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_HOST_READ_BIT;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].buffer = dst_staging.buffer();\n        barriers[0].offset = dst_staging.buffer_offset();\n        barriers[0].size = dst_staging.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = dst_staging.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_HOST_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, barriers, 0, 0);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_buffer_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.buffer_barrers.src_stage = src_stage;\n            r.buffer_barrers.dst_stage = dst_stage;\n            r.buffer_barrers.barrier_count = 1;\n            r.buffer_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark device host-read @ any\n        dst_staging.data->access_flags = VK_ACCESS_HOST_READ_BIT;\n        dst_staging.data->stage_flags = VK_PIPELINE_STAGE_HOST_BIT;\n    }\n\n    // create dst\n    Mat dst_fp16;\n    dst_fp16.create_like(dst_staging, opt.blob_allocator);\n    if (dst_fp16.empty())\n        return;\n\n    // download\n    d->download_post_buffers.push_back(dst_staging);\n    d->download_post_mats_fp16.push_back(dst_fp16);\n\n    // post memcpy device to dst\n    {\n        VkComputePrivate::record r;\n        r.type = VkComputePrivate::record::TYPE_post_download;\n        r.command_buffer = 0;\n        r.post_download.download_post_buffer_mat_offset = d->download_post_buffers.size() - 1;\n        r.post_download.download_post_mat_fp16_offset = d->download_post_mats_fp16.size() - 1;\n        d->delayed_records.push_back(r);\n    }\n\n    // cast to fp32 (discrete gpu)\n    if (dst_fp16.elemsize == dst_fp16.elempack * 2u)\n    {\n        if (vkdev->info.type() == 0 && (opt.use_bf16_storage || opt.use_bf16_packed))\n        {\n            int dims = dst_fp16.dims;\n            if (dims == 1)\n                dst.create(dst_fp16.w, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 2)\n                dst.create(dst_fp16.w, dst_fp16.h, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 3)\n                dst.create(dst_fp16.w, dst_fp16.h, dst_fp16.c, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 4)\n                dst.create(dst_fp16.w, dst_fp16.h, dst_fp16.d, dst_fp16.c, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n\n            d->download_post_mats.push_back(dst);\n\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_post_cast_bfloat16_to_float32;\n            r.command_buffer = 0;\n            r.post_cast_bfloat16_to_float32.download_post_mat_bf16_offset = d->download_post_mats_fp16.size() - 1;\n            r.post_cast_bfloat16_to_float32.download_post_mat_offset = d->download_post_mats.size() - 1;\n            r.post_cast_bfloat16_to_float32.num_threads = opt.num_threads;\n            d->delayed_records.push_back(r);\n        }\n        else if (vkdev->info.type() == 0 && (opt.use_fp16_storage || opt.use_fp16_packed))\n        {\n            int dims = dst_fp16.dims;\n            if (dims == 1)\n                dst.create(dst_fp16.w, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 2)\n                dst.create(dst_fp16.w, dst_fp16.h, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 3)\n                dst.create(dst_fp16.w, dst_fp16.h, dst_fp16.c, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n            if (dims == 4)\n                dst.create(dst_fp16.w, dst_fp16.h, dst_fp16.d, dst_fp16.c, (size_t)(dst_fp16.elempack * 4u), dst_fp16.elempack, opt.blob_allocator);\n\n            d->download_post_mats.push_back(dst);\n\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_post_cast_float16_to_float32;\n            r.command_buffer = 0;\n            r.post_cast_float16_to_float32.download_post_mat_fp16_offset = d->download_post_mats_fp16.size() - 1;\n            r.post_cast_float16_to_float32.download_post_mat_offset = d->download_post_mats.size() - 1;\n            r.post_cast_float16_to_float32.num_threads = opt.num_threads;\n            d->delayed_records.push_back(r);\n        }\n        else\n        {\n            dst = dst_fp16;\n        }\n    }\n    else\n    {\n        dst = dst_fp16;\n    }\n}\n\nvoid VkCompute::record_clone(const Mat& src, VkMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone host to buffer\");\n\n    // host to staging\n    VkMat dst_staging;\n    dst_staging.create_like(src, opt.staging_vkallocator);\n    if (dst_staging.empty())\n        return;\n\n    // memcpy src to device\n    memcpy(dst_staging.mapped_ptr(), src.data, src.total() * src.elemsize);\n    dst_staging.allocator->flush(dst_staging.data);\n\n    // mark device host-write @ null\n    dst_staging.data->access_flags = VK_ACCESS_HOST_WRITE_BIT;\n    dst_staging.data->stage_flags = VK_PIPELINE_STAGE_HOST_BIT;\n\n    // staging to device\n    record_clone(dst_staging, dst, opt);\n\n    // stash staging\n    d->upload_staging_buffers.push_back(dst_staging);\n}\n\nvoid VkCompute::record_clone(const Mat& src, VkImageMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone host to image\");\n\n    // host to staging\n    VkMat dst_staging;\n    Option opt_staging = opt;\n    opt_staging.blob_vkallocator = opt.staging_vkallocator;\n    record_clone(src, dst_staging, opt_staging);\n\n    // staging to image\n    record_clone(dst_staging, dst, opt);\n\n    // stash staging\n    d->upload_staging_buffers.push_back(dst_staging);\n}\n\nvoid VkCompute::record_clone(const VkMat& src, Mat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone buffer to host\");\n\n    if (!src.allocator->mappable)\n    {\n        // device to staging\n        VkMat src_staging;\n        Option opt_staging = opt;\n        opt_staging.blob_vkallocator = opt.staging_vkallocator;\n        record_clone(src, src_staging, opt_staging);\n\n        // staging to host\n        record_clone(src_staging, dst, opt);\n\n        return;\n    }\n\n    // create dst\n    dst.create_like(src, opt.blob_allocator);\n    if (dst.empty())\n        return;\n\n    // barrier device any @ compute to host-read @ compute\n    if (src.data->access_flags & VK_ACCESS_HOST_WRITE_BIT || src.data->stage_flags != VK_PIPELINE_STAGE_HOST_BIT)\n    {\n        VkBufferMemoryBarrier* barriers = new VkBufferMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = src.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_HOST_READ_BIT;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].buffer = src.buffer();\n        barriers[0].offset = src.buffer_offset();\n        barriers[0].size = src.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = src.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_HOST_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, barriers, 0, 0);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_buffer_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.buffer_barrers.src_stage = src_stage;\n            r.buffer_barrers.dst_stage = dst_stage;\n            r.buffer_barrers.barrier_count = 1;\n            r.buffer_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark device host-read @ any\n        src.data->access_flags = VK_ACCESS_HOST_READ_BIT;\n        src.data->stage_flags = VK_PIPELINE_STAGE_HOST_BIT;\n    }\n\n    // stash download post buffer and mat\n    d->download_post_buffers.push_back(src);\n    d->download_post_mats_fp16.push_back(dst);\n\n    // post memcpy device to dst\n    {\n        VkComputePrivate::record r;\n        r.type = VkComputePrivate::record::TYPE_post_download;\n        r.command_buffer = 0;\n        r.post_download.download_post_buffer_mat_offset = d->download_post_buffers.size() - 1;\n        r.post_download.download_post_mat_fp16_offset = d->download_post_mats_fp16.size() - 1;\n        d->delayed_records.push_back(r);\n    }\n}\n\nvoid VkCompute::record_clone(const VkImageMat& src, Mat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone image to host\");\n\n    // image to staging\n    VkMat src_staging;\n    Option opt_staging = opt;\n    opt_staging.blob_vkallocator = opt.staging_vkallocator;\n    record_clone(src, src_staging, opt_staging);\n\n    // staging to host\n    record_clone(src_staging, dst, opt);\n}\n\nvoid VkCompute::record_clone(const VkMat& src, VkMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone buffer to buffer\");\n\n    // create dst\n    dst.create_like(src, opt.blob_vkallocator);\n    if (dst.empty())\n        return;\n\n    if (src.data->access_flags & VK_ACCESS_TRANSFER_WRITE_BIT || src.data->stage_flags != VK_PIPELINE_STAGE_TRANSFER_BIT)\n    {\n        // barrier device any @ compute to transfer-read @ compute\n        VkBufferMemoryBarrier* barriers = new VkBufferMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = src.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].buffer = src.buffer();\n        barriers[0].offset = src.buffer_offset();\n        barriers[0].size = src.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = src.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, barriers, 0, 0);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_buffer_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.buffer_barrers.src_stage = src_stage;\n            r.buffer_barrers.dst_stage = dst_stage;\n            r.buffer_barrers.barrier_count = 1;\n            r.buffer_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark device transfer-read @ transfer\n        src.data->access_flags = VK_ACCESS_TRANSFER_READ_BIT;\n        src.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    {\n        // barrier device any @ null to transfer-write @ compute\n\n        // mark device transfer-write @ transfer\n        dst.data->access_flags = VK_ACCESS_TRANSFER_WRITE_BIT;\n        dst.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // record device to staging\n    {\n        VkBufferCopy* regions = new VkBufferCopy[1];\n        regions[0].srcOffset = src.buffer_offset();\n        regions[0].dstOffset = dst.buffer_offset();\n        regions[0].size = std::min(src.buffer_capacity(), dst.buffer_capacity());\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdCopyBuffer(d->compute_command_buffer, src.buffer(), dst.buffer(), 1, regions);\n            delete[] regions;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_copy_buffer;\n            r.command_buffer = d->compute_command_buffer;\n            r.copy_buffer.src = src.buffer();\n            r.copy_buffer.dst = dst.buffer();\n            r.copy_buffer.region_count = 1;\n            r.copy_buffer.regions = regions;\n            d->delayed_records.push_back(r);\n        }\n    }\n}\n\nvoid VkCompute::record_clone(const VkImageMat& src, VkImageMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone image to image\");\n\n    // create dst\n    dst.create_like(src, opt.blob_vkallocator);\n    if (dst.empty())\n        return;\n\n    // image layout transform any @ any to transfer-src-optimal @ compute\n    if (src.data->access_flags & VK_ACCESS_TRANSFER_WRITE_BIT || src.data->image_layout != VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL || src.data->stage_flags != VK_PIPELINE_STAGE_TRANSFER_BIT)\n    {\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = src.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;\n        barriers[0].oldLayout = src.data->image_layout;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = src.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = src.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image transfer-src-optimal @ compute\n        src.data->access_flags = VK_ACCESS_TRANSFER_READ_BIT;\n        src.data->image_layout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n        src.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // image layout transform undefined @ null to transfer-dst-optimal @ compute\n    {\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = 0;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;\n        barriers[0].oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = dst.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image transfer-dst-optimal @ compute\n        dst.data->access_flags = VK_ACCESS_TRANSFER_WRITE_BIT;\n        dst.data->image_layout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n        dst.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // record device to staging\n    {\n        VkImageCopy* regions = new VkImageCopy[1];\n        regions[0].srcSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        regions[0].srcSubresource.mipLevel = 0;\n        regions[0].srcSubresource.baseArrayLayer = 0;\n        regions[0].srcSubresource.layerCount = 1;\n        regions[0].srcOffset.x = 0;\n        regions[0].srcOffset.y = 0;\n        regions[0].srcOffset.z = 0;\n        regions[0].dstSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        regions[0].dstSubresource.mipLevel = 0;\n        regions[0].dstSubresource.baseArrayLayer = 0;\n        regions[0].dstSubresource.layerCount = 1;\n        regions[0].dstOffset.x = 0;\n        regions[0].dstOffset.y = 0;\n        regions[0].dstOffset.z = 0;\n        regions[0].extent.width = src.data->width;\n        regions[0].extent.height = src.data->height;\n        regions[0].extent.depth = src.data->depth;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdCopyImage(d->compute_command_buffer, src.image(), VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, dst.image(), VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 1, regions);\n            delete[] regions;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_copy_image;\n            r.command_buffer = d->compute_command_buffer;\n            r.copy_image.src = src.image();\n            r.copy_image.src_layout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n            r.copy_image.dst = dst.image();\n            r.copy_image.dst_layout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n            r.copy_image.region_count = 1;\n            r.copy_image.regions = regions;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // image and imageview can not be destroyed until command execution ends\n    NCNN_XADD(&src.data->command_refcount, 1);\n    NCNN_XADD(&dst.data->command_refcount, 1);\n    d->image_blocks_to_destroy.push_back(src.data);\n    d->image_blocks_to_destroy.push_back(dst.data);\n}\n\nvoid VkCompute::record_clone(const VkMat& src, VkImageMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone buffer to image\");\n\n    // create dst\n    dst.create_like(src, opt.blob_vkallocator);\n    if (dst.empty())\n        return;\n\n    // barrier device any @ any to transfer-read @ compute\n    if (src.data->access_flags & VK_ACCESS_SHADER_WRITE_BIT || src.data->stage_flags != VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)\n    {\n        VkBufferMemoryBarrier* barriers = new VkBufferMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = src.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].buffer = src.buffer();\n        barriers[0].offset = src.buffer_offset();\n        barriers[0].size = src.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = src.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, barriers, 0, 0);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_buffer_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.buffer_barrers.src_stage = src_stage;\n            r.buffer_barrers.dst_stage = dst_stage;\n            r.buffer_barrers.barrier_count = 1;\n            r.buffer_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark device transfer-read @ compute\n        src.data->access_flags = VK_ACCESS_TRANSFER_READ_BIT;\n        src.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // image layout transform undefined @ null to transfer-dst-optimal @ compute\n    {\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = 0;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;\n        barriers[0].oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = dst.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image transfer-dst-optimal @ compute\n        dst.data->access_flags = VK_ACCESS_TRANSFER_WRITE_BIT;\n        dst.data->image_layout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n        dst.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // record device to image\n    {\n        int region_count;\n        VkBufferImageCopy* regions;\n        if (dst.elemsize * dst.w * dst.h % 16 == 0)\n        {\n            region_count = 1;\n            regions = new VkBufferImageCopy[1];\n            regions[0].bufferOffset = src.buffer_offset();\n            regions[0].bufferRowLength = 0;\n            regions[0].bufferImageHeight = 0;\n            regions[0].imageSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n            regions[0].imageSubresource.mipLevel = 0;\n            regions[0].imageSubresource.baseArrayLayer = 0;\n            regions[0].imageSubresource.layerCount = 1;\n            regions[0].imageOffset.x = 0;\n            regions[0].imageOffset.y = 0;\n            regions[0].imageOffset.z = 0;\n            regions[0].imageExtent.width = dst.data->width;\n            regions[0].imageExtent.height = dst.data->height;\n            regions[0].imageExtent.depth = dst.data->depth;\n        }\n        else\n        {\n            region_count = dst.c;\n            regions = new VkBufferImageCopy[region_count];\n            for (int i = 0; i < region_count; i++)\n            {\n                regions[i].bufferOffset = src.buffer_offset() + src.cstep * src.elemsize * i;\n                regions[i].bufferRowLength = 0;\n                regions[i].bufferImageHeight = 0;\n                regions[i].imageSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n                regions[i].imageSubresource.mipLevel = 0;\n                regions[i].imageSubresource.baseArrayLayer = 0;\n                regions[i].imageSubresource.layerCount = 1;\n                regions[i].imageOffset.x = 0;\n                regions[i].imageOffset.y = 0;\n                regions[i].imageOffset.z = i;\n                regions[i].imageExtent.width = dst.data->width;\n                regions[i].imageExtent.height = dst.data->height;\n                regions[i].imageExtent.depth = 1;\n            }\n        }\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdCopyBufferToImage(d->compute_command_buffer, src.buffer(), dst.image(), VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, region_count, regions);\n            delete[] regions;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_copy_buffer_to_image;\n            r.command_buffer = d->compute_command_buffer;\n            r.copy_buffer_to_image.src = src.buffer();\n            r.copy_buffer_to_image.dst = dst.image();\n            r.copy_buffer_to_image.layout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;\n            r.copy_buffer_to_image.region_count = region_count;\n            r.copy_buffer_to_image.regions = regions;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // image and imageview can not be destroyed until command execution ends\n    NCNN_XADD(&dst.data->command_refcount, 1);\n    d->image_blocks_to_destroy.push_back(dst.data);\n}\n\nvoid VkCompute::record_clone(const VkImageMat& src, VkMat& dst, const Option& opt)\n{\n    //     NCNN_LOGE(\"record_clone image to buffer\");\n\n    // create dst\n    dst.create_like(src, opt.blob_vkallocator);\n    if (dst.empty())\n        return;\n\n    // image layout transform any @ any to transfer-src-optimal @ compute\n    if (src.data->access_flags & VK_ACCESS_TRANSFER_WRITE_BIT || src.data->image_layout != VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL || src.data->stage_flags != VK_PIPELINE_STAGE_TRANSFER_BIT)\n    {\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = src.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;\n        barriers[0].oldLayout = src.data->image_layout;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = src.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = src.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image transfer-src-optimal @ compute\n        src.data->access_flags = VK_ACCESS_TRANSFER_READ_BIT;\n        src.data->image_layout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n        src.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    {\n        // barrier device any @ null to transfer-write @ compute\n\n        // mark device transfer-write @ transfer\n        dst.data->access_flags = VK_ACCESS_TRANSFER_WRITE_BIT;\n        dst.data->stage_flags = VK_PIPELINE_STAGE_TRANSFER_BIT;\n    }\n\n    // record image to device\n    {\n        int region_count;\n        VkBufferImageCopy* regions;\n        if (src.elemsize * src.w * src.h % 16 == 0)\n        {\n            region_count = 1;\n            regions = new VkBufferImageCopy[1];\n            regions[0].bufferOffset = dst.buffer_offset();\n            regions[0].bufferRowLength = 0;\n            regions[0].bufferImageHeight = 0;\n            regions[0].imageSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n            regions[0].imageSubresource.mipLevel = 0;\n            regions[0].imageSubresource.baseArrayLayer = 0;\n            regions[0].imageSubresource.layerCount = 1;\n            regions[0].imageOffset.x = 0;\n            regions[0].imageOffset.y = 0;\n            regions[0].imageOffset.z = 0;\n            regions[0].imageExtent.width = src.data->width;\n            regions[0].imageExtent.height = src.data->height;\n            regions[0].imageExtent.depth = src.data->depth;\n        }\n        else\n        {\n            region_count = src.c;\n            regions = new VkBufferImageCopy[region_count];\n            for (int i = 0; i < region_count; i++)\n            {\n                regions[i].bufferOffset = dst.buffer_offset() + dst.cstep * dst.elemsize * i;\n                regions[i].bufferRowLength = 0;\n                regions[i].bufferImageHeight = 0;\n                regions[i].imageSubresource.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n                regions[i].imageSubresource.mipLevel = 0;\n                regions[i].imageSubresource.baseArrayLayer = 0;\n                regions[i].imageSubresource.layerCount = 1;\n                regions[i].imageOffset.x = 0;\n                regions[i].imageOffset.y = 0;\n                regions[i].imageOffset.z = i;\n                regions[i].imageExtent.width = src.data->width;\n                regions[i].imageExtent.height = src.data->height;\n                regions[i].imageExtent.depth = 1;\n            }\n        }\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdCopyImageToBuffer(d->compute_command_buffer, src.image(), VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, dst.buffer(), region_count, regions);\n            delete[] regions;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_copy_image_to_buffer;\n            r.command_buffer = d->compute_command_buffer;\n            r.copy_image_to_buffer.src = src.image();\n            r.copy_image_to_buffer.layout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL;\n            r.copy_image_to_buffer.dst = dst.buffer();\n            r.copy_image_to_buffer.region_count = region_count;\n            r.copy_image_to_buffer.regions = regions;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // image and imageview can not be destroyed until command execution ends\n    NCNN_XADD(&src.data->command_refcount, 1);\n    d->image_blocks_to_destroy.push_back(src.data);\n}\n\nvoid VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& bindings, const std::vector<vk_constant_type>& constants, const VkMat& dispatcher)\n{\n    record_pipeline(pipeline, bindings, std::vector<VkImageMat>(), constants, dispatcher);\n}\n\nvoid VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkImageMat>& bindings, const std::vector<vk_constant_type>& constants, const VkImageMat& dispatcher)\n{\n    record_pipeline(pipeline, std::vector<VkMat>(), bindings, constants, dispatcher);\n}\n\nvoid VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const VkMat& dispatcher)\n{\n    Mat dispatcher_mat(dispatcher.w, dispatcher.h, dispatcher.d, dispatcher.c, (void*)0);\n\n    record_pipeline(pipeline, buffer_bindings, image_bindings, constants, dispatcher_mat);\n}\n\nvoid VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const VkImageMat& dispatcher)\n{\n    Mat dispatcher_mat(dispatcher.w, dispatcher.h, dispatcher.d, dispatcher.c, (void*)0);\n\n    record_pipeline(pipeline, buffer_bindings, image_bindings, constants, dispatcher_mat);\n}\n\nvoid VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const Mat& dispatcher)\n{\n    //     NCNN_LOGE(\"record_pipeline %p\", pipeline);\n\n    const int buffer_binding_count = (int)buffer_bindings.size();\n    const int image_binding_count = (int)image_bindings.size();\n    const int constant_count = (int)constants.size();\n\n    const int binding_count = buffer_binding_count + image_binding_count;\n    const ShaderInfo& shader_info = pipeline->shader_info();\n\n    if (binding_count != shader_info.binding_count)\n    {\n        NCNN_LOGE(\"binding_count not match, expect %d but got %d + %d\", shader_info.binding_count, buffer_binding_count, image_binding_count);\n    }\n\n    if (constant_count != shader_info.push_constant_count)\n    {\n        NCNN_LOGE(\"push_constant_count not match, expect %d but got %d\", shader_info.push_constant_count, constant_count);\n    }\n\n    int buffer_index = 0;\n    int image_index = 0;\n    for (int i = 0; i < binding_count; i++)\n    {\n        int binding_type = shader_info.binding_types[i];\n\n        if (binding_type == 1)\n        {\n            const VkMat& binding = buffer_bindings[buffer_index].empty() ? vkdev->get_dummy_buffer() : buffer_bindings[buffer_index];\n            buffer_index++;\n\n            //             NCNN_LOGE(\"binding #%d buffer = %d %d %d %d @ %lu %d = %p +%ld ~%ld\", i, binding.dims, binding.w, binding.h, binding.c, binding.elemsize, binding.elempack, binding.buffer(), binding.buffer_offset(), binding.buffer_capacity());\n\n            barrier_readwrite(binding);\n        }\n        else if (binding_type == 2)\n        {\n            const VkImageMat& binding = image_bindings[image_index].empty() ? vkdev->get_dummy_image() : image_bindings[image_index];\n            image_index++;\n\n            //             NCNN_LOGE(\"binding #%d image = %d %d %d %d @ %lu %d = %p +%ld ~%ld %p\", i, binding.dims, binding.w, binding.h, binding.c, binding.elemsize, binding.elempack, binding.image(), binding.data->bind_offset, binding.data->bind_capacity, binding.imageview());\n\n            barrier_readwrite(binding);\n\n            // image and imageview can not be destroyed until command execution ends\n            NCNN_XADD(&binding.data->command_refcount, 1);\n            d->image_blocks_to_destroy.push_back(binding.data);\n        }\n        else // if (binding_type == 3)\n        {\n            const VkImageMat& binding = image_bindings[image_index].empty() ? vkdev->get_dummy_image_readonly() : image_bindings[image_index];\n            image_index++;\n\n            //             NCNN_LOGE(\"binding #%d sampler = %d %d %d %d @ %lu %d = %p +%ld ~%ld %p\", i, binding.dims, binding.w, binding.h, binding.c, binding.elemsize, binding.elempack, binding.image(), binding.data->bind_offset, binding.data->bind_capacity, binding.imageview());\n\n            // if the same image used for both storage image and combined image sampler\n            // only apply image layout transition to general\n            bool image_read_write = false;\n            for (int j = 0; j < image_binding_count; j++)\n            {\n                if (shader_info.binding_types[j] == 2 && binding.data == image_bindings[j].data)\n                {\n                    // the same image is used as storage image, skip it\n                    image_read_write = true;\n                    break;\n                }\n            }\n            if (image_read_write)\n                continue;\n\n            barrier_readonly(binding);\n\n            // image and imageview can not be destroyed until command execution ends\n            NCNN_XADD(&binding.data->command_refcount, 1);\n            d->image_blocks_to_destroy.push_back(binding.data);\n        }\n    }\n\n    // record bind pipeline\n    {\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdBindPipeline(d->compute_command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline->pipeline());\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_bind_pipeline;\n            r.command_buffer = d->compute_command_buffer;\n            r.bind_pipeline.bind_point = VK_PIPELINE_BIND_POINT_COMPUTE;\n            r.bind_pipeline.pipeline = pipeline->pipeline();\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record update bindings\n    if (binding_count > 0)\n    {\n        std::vector<unsigned char> descriptorInfos;\n        {\n            descriptorInfos.resize(sizeof(VkDescriptorBufferInfo) * buffer_binding_count + sizeof(VkDescriptorImageInfo) * image_binding_count);\n\n            unsigned char* p_descriptorInfos = descriptorInfos.data();\n            int descriptorBufferInfo_index = 0;\n            int descriptorImageInfo_index = 0;\n            for (int i = 0; i < binding_count; i++)\n            {\n                int binding_type = shader_info.binding_types[i];\n\n                if (binding_type == 1)\n                {\n                    const VkMat& binding = buffer_bindings[descriptorBufferInfo_index].empty() ? vkdev->get_dummy_buffer() : buffer_bindings[descriptorBufferInfo_index];\n                    descriptorBufferInfo_index++;\n\n                    VkDescriptorBufferInfo descriptorBufferInfo;\n                    descriptorBufferInfo.buffer = binding.buffer();\n                    descriptorBufferInfo.offset = binding.buffer_offset();\n                    descriptorBufferInfo.range = binding.total() * binding.elemsize;\n\n                    memcpy(p_descriptorInfos, &descriptorBufferInfo, sizeof(VkDescriptorBufferInfo));\n                    p_descriptorInfos += sizeof(VkDescriptorBufferInfo);\n                }\n                else //if (binding_type == 2 || binding_type == 3)\n                {\n                    const VkImageMat& binding = image_bindings[descriptorImageInfo_index].empty() ? vkdev->get_dummy_image() : image_bindings[descriptorImageInfo_index];\n                    descriptorImageInfo_index++;\n\n                    // we always use immutable nearest sampler set in descroptor layout during pipeline creation\n                    VkDescriptorImageInfo descriptorImageInfo;\n                    descriptorImageInfo.sampler = 0;\n                    descriptorImageInfo.imageView = binding.imageview();\n                    descriptorImageInfo.imageLayout = binding.data->image_layout;\n\n                    memcpy(p_descriptorInfos, &descriptorImageInfo, sizeof(VkDescriptorImageInfo));\n                    p_descriptorInfos += sizeof(VkDescriptorImageInfo);\n                }\n            }\n        }\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkdev->vkCmdPushDescriptorSetWithTemplateKHR(d->compute_command_buffer, pipeline->descriptor_update_template(), pipeline->pipeline_layout(), 0, descriptorInfos.data());\n        }\n        else\n        {\n            // create new descriptor_pool and descriptorset\n            VkDescriptorPool descriptor_pool;\n            {\n                int image_binding_count = 0;\n                int sampler_binding_count = 0;\n                for (int i = 0; i < binding_count; i++)\n                {\n                    int binding_type = shader_info.binding_types[i];\n\n                    if (binding_type == 2)\n                        image_binding_count++;\n                    else // if (binding_type == 3)\n                        sampler_binding_count++;\n                }\n\n                VkDescriptorPoolSize poolSizes[3];\n                poolSizes[0].type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n                poolSizes[0].descriptorCount = buffer_binding_count;\n                poolSizes[1].type = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;\n                poolSizes[1].descriptorCount = image_binding_count;\n                poolSizes[2].type = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n                poolSizes[2].descriptorCount = sampler_binding_count;\n\n                VkDescriptorPoolCreateInfo descriptorPoolCreateInfo;\n                descriptorPoolCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;\n                descriptorPoolCreateInfo.pNext = 0;\n                descriptorPoolCreateInfo.flags = VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT;\n                descriptorPoolCreateInfo.maxSets = 1;\n                descriptorPoolCreateInfo.poolSizeCount = 3;\n                descriptorPoolCreateInfo.pPoolSizes = poolSizes;\n\n                VkResult ret = vkCreateDescriptorPool(vkdev->vkdevice(), &descriptorPoolCreateInfo, 0, &descriptor_pool);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkCreateDescriptorPool failed %d\", ret);\n                    return;\n                }\n            }\n            d->descriptor_pools.push_back(descriptor_pool);\n\n            VkDescriptorSet descriptorset;\n            {\n                VkDescriptorSetLayout descriptorset_layout = pipeline->descriptorset_layout();\n\n                VkDescriptorSetAllocateInfo descriptorSetAllocateInfo;\n                descriptorSetAllocateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;\n                descriptorSetAllocateInfo.pNext = 0;\n                descriptorSetAllocateInfo.descriptorPool = descriptor_pool;\n                descriptorSetAllocateInfo.descriptorSetCount = 1;\n                descriptorSetAllocateInfo.pSetLayouts = &descriptorset_layout;\n\n                VkResult ret = vkAllocateDescriptorSets(vkdev->vkdevice(), &descriptorSetAllocateInfo, &descriptorset);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkAllocateDescriptorSets failed %d\", ret);\n                    return;\n                }\n            }\n            d->descriptorsets.push_back(descriptorset);\n\n            if (vkdev->info.support_VK_KHR_descriptor_update_template())\n            {\n                vkdev->vkUpdateDescriptorSetWithTemplateKHR(vkdev->vkdevice(), descriptorset, pipeline->descriptor_update_template(), descriptorInfos.data());\n            }\n            else\n            {\n                std::vector<VkWriteDescriptorSet> writeDescriptorSets(binding_count);\n                {\n                    const unsigned char* p_descriptorInfos = descriptorInfos.data();\n                    for (int i = 0; i < binding_count; i++)\n                    {\n                        int binding_type = shader_info.binding_types[i];\n\n                        writeDescriptorSets[i].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;\n                        writeDescriptorSets[i].pNext = 0;\n                        writeDescriptorSets[i].dstSet = descriptorset;\n                        writeDescriptorSets[i].dstBinding = i;\n                        writeDescriptorSets[i].dstArrayElement = 0;\n                        writeDescriptorSets[i].descriptorCount = 1;\n                        writeDescriptorSets[i].pTexelBufferView = 0;\n\n                        if (binding_type == 1)\n                        {\n                            writeDescriptorSets[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n                            writeDescriptorSets[i].pImageInfo = 0;\n                            writeDescriptorSets[i].pBufferInfo = (const VkDescriptorBufferInfo*)p_descriptorInfos;\n\n                            p_descriptorInfos += sizeof(VkDescriptorBufferInfo);\n                        }\n                        else if (binding_type == 2)\n                        {\n                            writeDescriptorSets[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;\n                            writeDescriptorSets[i].pImageInfo = (const VkDescriptorImageInfo*)p_descriptorInfos;\n                            writeDescriptorSets[i].pBufferInfo = 0;\n\n                            p_descriptorInfos += sizeof(VkDescriptorImageInfo);\n                        }\n                        else // if (binding_type == 3)\n                        {\n                            writeDescriptorSets[i].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n                            writeDescriptorSets[i].pImageInfo = (const VkDescriptorImageInfo*)p_descriptorInfos;\n                            writeDescriptorSets[i].pBufferInfo = 0;\n\n                            p_descriptorInfos += sizeof(VkDescriptorImageInfo);\n                        }\n                    }\n                }\n\n                vkUpdateDescriptorSets(vkdev->vkdevice(), binding_count, writeDescriptorSets.data(), 0, 0);\n            }\n\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_bind_descriptorsets;\n            r.command_buffer = d->compute_command_buffer;\n            r.bind_descriptorsets.bind_point = VK_PIPELINE_BIND_POINT_COMPUTE;\n            r.bind_descriptorsets.pipeline_layout = pipeline->pipeline_layout();\n            r.bind_descriptorsets.descriptorset_count = 1;\n            r.bind_descriptorsets.descriptorset_offset = d->descriptorsets.size() - 1;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record push constants\n    if (constant_count > 0)\n    {\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPushConstants(d->compute_command_buffer, pipeline->pipeline_layout(), VK_SHADER_STAGE_COMPUTE_BIT, 0, constant_count * sizeof(vk_constant_type), constants.data());\n        }\n        else\n        {\n            uint32_t size = constant_count * sizeof(vk_constant_type);\n            unsigned char* constant_values = new unsigned char[size];\n            memcpy(constant_values, constants.data(), size);\n\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_push_constants;\n            r.command_buffer = d->compute_command_buffer;\n            r.push_constants.pipeline_layout = pipeline->pipeline_layout();\n            r.push_constants.stage_flags = VK_SHADER_STAGE_COMPUTE_BIT;\n            r.push_constants.size = size;\n            r.push_constants.values = constant_values;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record dispatch\n    {\n        uint32_t group_count_x = (dispatcher.w + pipeline->local_size_x() - 1) / pipeline->local_size_x();\n        uint32_t group_count_y = (dispatcher.h * (dispatcher.d ? dispatcher.d : 1) + pipeline->local_size_y() - 1) / pipeline->local_size_y();\n        uint32_t group_count_z = (dispatcher.c + pipeline->local_size_z() - 1) / pipeline->local_size_z();\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdDispatch(d->compute_command_buffer, group_count_x, group_count_y, group_count_z);\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_dispatch;\n            r.command_buffer = d->compute_command_buffer;\n            r.dispatch.group_count_x = group_count_x;\n            r.dispatch.group_count_y = group_count_y;\n            r.dispatch.group_count_z = group_count_z;\n            d->delayed_records.push_back(r);\n        }\n\n        d->pending_dispatch_total += group_count_x * group_count_y * group_count_z;\n    }\n}\n\n#if NCNN_BENCHMARK\nvoid VkCompute::record_write_timestamp(uint32_t query)\n{\n    if (vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        if (d->query_pool)\n            vkCmdWriteTimestamp(d->compute_command_buffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, d->query_pool, query);\n    }\n    else\n    {\n        VkComputePrivate::record r;\n        r.type = VkComputePrivate::record::TYPE_write_timestamp;\n        r.command_buffer = d->compute_command_buffer;\n        r.write_timestamp.query = query;\n        d->delayed_records.push_back(r);\n    }\n}\n#endif // NCNN_BENCHMARK\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\nvoid VkCompute::record_import_android_hardware_buffer(const ImportAndroidHardwareBufferPipeline* pipeline, const VkImageMat& src, const VkMat& dst)\n{\n    // image layout transform undefined @ null to general @ compute\n    {\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = 0;\n        barriers[0].dstAccessMask = VK_ACCESS_SHADER_READ_BIT;\n        barriers[0].oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = src.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record bind pipeline\n    {\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdBindPipeline(d->compute_command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline->pipeline());\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_bind_pipeline;\n            r.command_buffer = d->compute_command_buffer;\n            r.bind_pipeline.bind_point = VK_PIPELINE_BIND_POINT_COMPUTE;\n            r.bind_pipeline.pipeline = pipeline->pipeline();\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record update bindings\n    {\n        VkDescriptorImageInfo descriptorImageInfo;\n        descriptorImageInfo.sampler = pipeline->sampler;\n        descriptorImageInfo.imageView = src.imageview();\n        descriptorImageInfo.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;\n\n        VkDescriptorBufferInfo descriptorBufferInfo;\n        descriptorBufferInfo.buffer = dst.buffer();\n        descriptorBufferInfo.offset = dst.buffer_offset();\n        descriptorBufferInfo.range = dst.total() * dst.elemsize;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            struct ImportAndroidHardwareBufferDescriptorInfo\n            {\n                VkDescriptorImageInfo imageInfo;\n                VkDescriptorBufferInfo bufferInfo;\n                VkDescriptorBufferInfo buffer4Info;\n            };\n\n            ImportAndroidHardwareBufferDescriptorInfo info;\n            info.imageInfo = descriptorImageInfo;\n            info.bufferInfo = descriptorBufferInfo;\n            info.buffer4Info = descriptorBufferInfo;\n\n            vkdev->vkCmdPushDescriptorSetWithTemplateKHR(d->compute_command_buffer, pipeline->descriptor_update_template(), pipeline->pipeline_layout(), 0, &info);\n        }\n        else\n        {\n            // create new descriptor_pool and descriptorset\n            VkDescriptorPool descriptor_pool;\n            {\n                VkDescriptorPoolSize poolSizes[2];\n                poolSizes[0].type = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n                poolSizes[0].descriptorCount = 1;\n                poolSizes[1].type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n                poolSizes[1].descriptorCount = 2;\n\n                VkDescriptorPoolCreateInfo descriptorPoolCreateInfo;\n                descriptorPoolCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;\n                descriptorPoolCreateInfo.pNext = 0;\n                descriptorPoolCreateInfo.flags = VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT;\n                descriptorPoolCreateInfo.maxSets = 1;\n                descriptorPoolCreateInfo.poolSizeCount = 2;\n                descriptorPoolCreateInfo.pPoolSizes = poolSizes;\n\n                VkResult ret = vkCreateDescriptorPool(vkdev->vkdevice(), &descriptorPoolCreateInfo, 0, &descriptor_pool);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkCreateDescriptorPool failed %d\", ret);\n                    return;\n                }\n            }\n            d->descriptor_pools.push_back(descriptor_pool);\n\n            VkDescriptorSet descriptorset;\n            {\n                VkDescriptorSetLayout descriptorset_layout = pipeline->descriptorset_layout();\n\n                VkDescriptorSetAllocateInfo descriptorSetAllocateInfo;\n                descriptorSetAllocateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;\n                descriptorSetAllocateInfo.pNext = 0;\n                descriptorSetAllocateInfo.descriptorPool = descriptor_pool;\n                descriptorSetAllocateInfo.descriptorSetCount = 1;\n                descriptorSetAllocateInfo.pSetLayouts = &descriptorset_layout;\n\n                VkResult ret = vkAllocateDescriptorSets(vkdev->vkdevice(), &descriptorSetAllocateInfo, &descriptorset);\n                if (ret != VK_SUCCESS)\n                {\n                    NCNN_LOGE(\"vkAllocateDescriptorSets failed %d\", ret);\n                    return;\n                }\n            }\n            d->descriptorsets.push_back(descriptorset);\n\n            if (vkdev->info.support_VK_KHR_descriptor_update_template())\n            {\n                struct ImportAndroidHardwareBufferDescriptorInfo\n                {\n                    VkDescriptorImageInfo imageInfo;\n                    VkDescriptorBufferInfo bufferInfo;\n                    VkDescriptorBufferInfo buffer4Info;\n                };\n\n                ImportAndroidHardwareBufferDescriptorInfo info;\n                info.imageInfo = descriptorImageInfo;\n                info.bufferInfo = descriptorBufferInfo;\n                info.buffer4Info = descriptorBufferInfo;\n\n                vkdev->vkUpdateDescriptorSetWithTemplateKHR(vkdev->vkdevice(), descriptorset, pipeline->descriptor_update_template(), &info);\n            }\n            else\n            {\n                VkWriteDescriptorSet writeDescriptorSets[3];\n                writeDescriptorSets[0].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;\n                writeDescriptorSets[0].pNext = 0;\n                writeDescriptorSets[0].dstSet = descriptorset;\n                writeDescriptorSets[0].dstBinding = 0;\n                writeDescriptorSets[0].dstArrayElement = 0;\n                writeDescriptorSets[0].descriptorCount = 1;\n                writeDescriptorSets[0].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n                writeDescriptorSets[0].pImageInfo = &descriptorImageInfo;\n                writeDescriptorSets[0].pBufferInfo = 0;\n                writeDescriptorSets[0].pTexelBufferView = 0;\n                writeDescriptorSets[1].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;\n                writeDescriptorSets[1].pNext = 0;\n                writeDescriptorSets[1].dstSet = descriptorset;\n                writeDescriptorSets[1].dstBinding = 1;\n                writeDescriptorSets[1].dstArrayElement = 0;\n                writeDescriptorSets[1].descriptorCount = 1;\n                writeDescriptorSets[1].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n                writeDescriptorSets[1].pImageInfo = 0;\n                writeDescriptorSets[1].pBufferInfo = &descriptorBufferInfo;\n                writeDescriptorSets[1].pTexelBufferView = 0;\n                writeDescriptorSets[2].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;\n                writeDescriptorSets[2].pNext = 0;\n                writeDescriptorSets[2].dstSet = descriptorset;\n                writeDescriptorSets[2].dstBinding = 2;\n                writeDescriptorSets[2].dstArrayElement = 0;\n                writeDescriptorSets[2].descriptorCount = 1;\n                writeDescriptorSets[2].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n                writeDescriptorSets[2].pImageInfo = 0;\n                writeDescriptorSets[2].pBufferInfo = &descriptorBufferInfo;\n                writeDescriptorSets[2].pTexelBufferView = 0;\n\n                vkUpdateDescriptorSets(vkdev->vkdevice(), 3, writeDescriptorSets, 0, 0);\n            }\n\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_bind_descriptorsets;\n            r.command_buffer = d->compute_command_buffer;\n            r.bind_descriptorsets.bind_point = VK_PIPELINE_BIND_POINT_COMPUTE;\n            r.bind_descriptorsets.pipeline_layout = pipeline->pipeline_layout();\n            r.bind_descriptorsets.descriptorset_count = 1;\n            r.bind_descriptorsets.descriptorset_offset = d->descriptorsets.size() - 1;\n            d->delayed_records.push_back(r);\n        }\n    }\n\n    // record dispatch\n    {\n        uint32_t group_count_x = (dst.w + pipeline->local_size_x() - 1) / pipeline->local_size_x();\n        uint32_t group_count_y = (dst.h + pipeline->local_size_y() - 1) / pipeline->local_size_y();\n        uint32_t group_count_z = (dst.c + pipeline->local_size_z() - 1) / pipeline->local_size_z();\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdDispatch(d->compute_command_buffer, group_count_x, group_count_y, group_count_z);\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_dispatch;\n            r.command_buffer = d->compute_command_buffer;\n            r.dispatch.group_count_x = group_count_x;\n            r.dispatch.group_count_y = group_count_y;\n            r.dispatch.group_count_z = group_count_z;\n            d->delayed_records.push_back(r);\n        }\n    }\n}\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\n\nint VkCompute::submit_and_wait()\n{\n    //     NCNN_LOGE(\"submit_and_wait\");\n\n    if (!vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        d->begin_command_buffer();\n\n#if NCNN_BENCHMARK\n        if (d->query_pool)\n            vkCmdResetQueryPool(d->compute_command_buffer, d->query_pool, 0, d->query_count);\n#endif // NCNN_BENCHMARK\n\n        const size_t record_count = d->delayed_records.size();\n\n        // handle delayed records\n        for (size_t i = 0; i < record_count; i++)\n        {\n            const VkComputePrivate::record& r = d->delayed_records[i];\n\n            switch (r.type)\n            {\n            case VkComputePrivate::record::TYPE_copy_buffer:\n            {\n                vkCmdCopyBuffer(r.command_buffer, r.copy_buffer.src, r.copy_buffer.dst, r.copy_buffer.region_count, r.copy_buffer.regions);\n                delete[] r.copy_buffer.regions;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_copy_image:\n            {\n                vkCmdCopyImage(r.command_buffer, r.copy_image.src, r.copy_image.src_layout, r.copy_image.dst, r.copy_image.dst_layout, r.copy_image.region_count, r.copy_image.regions);\n                delete[] r.copy_image.regions;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_copy_buffer_to_image:\n            {\n                vkCmdCopyBufferToImage(r.command_buffer, r.copy_buffer_to_image.src, r.copy_buffer_to_image.dst, r.copy_buffer_to_image.layout, r.copy_buffer_to_image.region_count, r.copy_buffer_to_image.regions);\n                delete[] r.copy_buffer_to_image.regions;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_copy_image_to_buffer:\n            {\n                vkCmdCopyImageToBuffer(r.command_buffer, r.copy_image_to_buffer.src, r.copy_image_to_buffer.layout, r.copy_image_to_buffer.dst, r.copy_image_to_buffer.region_count, r.copy_image_to_buffer.regions);\n                delete[] r.copy_image_to_buffer.regions;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_bind_pipeline:\n            {\n                vkCmdBindPipeline(r.command_buffer, r.bind_pipeline.bind_point, r.bind_pipeline.pipeline);\n                break;\n            }\n            case VkComputePrivate::record::TYPE_bind_descriptorsets:\n            {\n                vkCmdBindDescriptorSets(r.command_buffer, r.bind_descriptorsets.bind_point, r.bind_descriptorsets.pipeline_layout, 0, r.bind_descriptorsets.descriptorset_count, &d->descriptorsets[r.bind_descriptorsets.descriptorset_offset], 0, 0);\n                break;\n            }\n            case VkComputePrivate::record::TYPE_push_constants:\n            {\n                vkCmdPushConstants(r.command_buffer, r.push_constants.pipeline_layout, r.push_constants.stage_flags, 0, r.push_constants.size, r.push_constants.values);\n                delete[](unsigned char*) r.push_constants.values;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_dispatch:\n            {\n                vkCmdDispatch(r.command_buffer, r.dispatch.group_count_x, r.dispatch.group_count_y, r.dispatch.group_count_z);\n                break;\n            }\n            case VkComputePrivate::record::TYPE_memory_barrers:\n            {\n                vkCmdPipelineBarrier(r.command_buffer, r.memory_barrers.src_stage, r.memory_barrers.dst_stage, 0, r.memory_barrers.barrier_count, r.memory_barrers.barriers, 0, 0, 0, 0);\n                delete[] r.memory_barrers.barriers;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_buffer_barrers:\n            {\n                vkCmdPipelineBarrier(r.command_buffer, r.buffer_barrers.src_stage, r.buffer_barrers.dst_stage, 0, 0, 0, r.buffer_barrers.barrier_count, r.buffer_barrers.barriers, 0, 0);\n                delete[] r.buffer_barrers.barriers;\n                break;\n            }\n            case VkComputePrivate::record::TYPE_image_barrers:\n            {\n                vkCmdPipelineBarrier(r.command_buffer, r.image_barrers.src_stage, r.image_barrers.dst_stage, 0, 0, 0, 0, 0, r.image_barrers.barrier_count, r.image_barrers.barriers);\n                delete[] r.image_barrers.barriers;\n                break;\n            }\n#if NCNN_BENCHMARK\n            case VkComputePrivate::record::TYPE_write_timestamp:\n            {\n                if (d->query_pool)\n                    vkCmdWriteTimestamp(r.command_buffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, d->query_pool, r.write_timestamp.query);\n                break;\n            }\n#endif // NCNN_BENCHMARK\n            case VkComputePrivate::record::TYPE_post_download:\n            case VkComputePrivate::record::TYPE_post_cast_float16_to_float32:\n            case VkComputePrivate::record::TYPE_post_cast_bfloat16_to_float32:\n            default:\n                break;\n            }\n        }\n    }\n\n    // end command buffer\n    {\n        d->end_command_buffer();\n    }\n\n    // acquire queue and reclaim on return\n    VkQueue compute_queue = vkdev->acquire_queue(vkdev->info.compute_queue_family_index());\n    if (compute_queue == 0)\n    {\n        NCNN_LOGE(\"out of compute queue\");\n        return -1;\n    }\n\n    // submit compute\n    {\n        VkSubmitInfo submitInfo;\n        submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;\n        submitInfo.pNext = 0;\n        submitInfo.waitSemaphoreCount = 0;\n        submitInfo.pWaitSemaphores = 0;\n        submitInfo.pWaitDstStageMask = 0;\n        submitInfo.commandBufferCount = 1;\n        submitInfo.pCommandBuffers = &d->compute_command_buffer;\n        submitInfo.signalSemaphoreCount = 0;\n        submitInfo.pSignalSemaphores = 0;\n\n        VkResult ret = vkQueueSubmit(compute_queue, 1, &submitInfo, d->compute_command_fence);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkQueueSubmit failed %d\", ret);\n            vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n            return -1;\n        }\n    }\n\n    vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n\n    // wait\n    {\n        VkResult ret = vkWaitForFences(vkdev->vkdevice(), 1, &d->compute_command_fence, VK_TRUE, (uint64_t)-1);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkWaitForFences failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // handle delayed post records\n    for (size_t i = 0; i < d->delayed_records.size(); i++)\n    {\n        const VkComputePrivate::record& r = d->delayed_records[i];\n\n        switch (r.type)\n        {\n        case VkComputePrivate::record::TYPE_post_download:\n        {\n            const VkMat& src = d->download_post_buffers[r.post_download.download_post_buffer_mat_offset];\n            Mat& dst = d->download_post_mats_fp16[r.post_download.download_post_mat_fp16_offset];\n\n            // NCNN_LOGE(\"post_download  %p +%d ~%d  -> %p\", src.buffer(), src.buffer_offset(), src.buffer_capacity(), dst.data);\n\n            src.allocator->invalidate(src.data);\n            memcpy(dst.data, src.mapped_ptr(), dst.total() * dst.elemsize);\n            break;\n        }\n        case VkComputePrivate::record::TYPE_post_cast_float16_to_float32:\n        {\n            // NCNN_LOGE(\"post_cast_float16_to_float32\");\n\n            const Mat& src = d->download_post_mats_fp16[r.post_cast_float16_to_float32.download_post_mat_fp16_offset];\n            Mat& dst = d->download_post_mats[r.post_cast_float16_to_float32.download_post_mat_offset];\n\n            Option opt;\n            opt.num_threads = r.post_cast_float16_to_float32.num_threads;\n            opt.blob_allocator = dst.allocator;\n            ncnn::cast_float16_to_float32(src, dst, opt);\n            break;\n        }\n        case VkComputePrivate::record::TYPE_post_cast_bfloat16_to_float32:\n        {\n            // NCNN_LOGE(\"post_cast_bfloat16_to_float32\");\n\n            const Mat& src = d->download_post_mats_fp16[r.post_cast_bfloat16_to_float32.download_post_mat_bf16_offset];\n            Mat& dst = d->download_post_mats[r.post_cast_bfloat16_to_float32.download_post_mat_offset];\n\n            Option opt;\n            opt.num_threads = r.post_cast_bfloat16_to_float32.num_threads;\n            opt.blob_allocator = dst.allocator;\n            ncnn::cast_bfloat16_to_float32(src, dst, opt);\n            break;\n        }\n        default:\n            break;\n        }\n    }\n\n    d->delayed_records.clear();\n\n    d->pending_dispatch_total = 0;\n\n    return 0;\n}\n\nint VkCompute::reset()\n{\n    d->upload_staging_buffers.clear();\n    d->download_post_buffers.clear();\n    d->download_post_mats_fp16.clear();\n    d->download_post_mats.clear();\n\n    for (size_t i = 0; i < d->image_blocks_to_destroy.size(); i++)\n    {\n        VkImageMemory* ptr = d->image_blocks_to_destroy[i];\n\n        int old_command_refcount = NCNN_XADD(&ptr->command_refcount, -1);\n        if (ptr->refcount == 0 && old_command_refcount == 1)\n        {\n            // no userspace reference and we are the last command reference\n            vkDestroyImageView(vkdev->vkdevice(), ptr->imageview, 0);\n            vkDestroyImage(vkdev->vkdevice(), ptr->image, 0);\n\n            delete ptr;\n        }\n        else\n        {\n            // reference exists in user code or other command\n        }\n    }\n    d->image_blocks_to_destroy.clear();\n\n    if (!vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        for (size_t i = 0; i < d->descriptorsets.size(); i++)\n        {\n            vkFreeDescriptorSets(vkdev->vkdevice(), d->descriptor_pools[i], 1, &d->descriptorsets[i]);\n            vkDestroyDescriptorPool(vkdev->vkdevice(), d->descriptor_pools[i], 0);\n        }\n        d->descriptor_pools.clear();\n        d->descriptorsets.clear();\n    }\n\n    d->delayed_records.clear();\n\n    d->pending_dispatch_total = 0;\n\n    // reset command buffer and fence\n    {\n        VkResult ret = vkResetCommandBuffer(d->compute_command_buffer, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkResetCommandBuffer failed %d\", ret);\n            return -1;\n        }\n    }\n    {\n        VkResult ret = vkResetFences(vkdev->vkdevice(), 1, &d->compute_command_fence);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkResetFences failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        d->begin_command_buffer();\n\n#if NCNN_BENCHMARK\n        if (d->query_pool)\n            vkCmdResetQueryPool(d->compute_command_buffer, d->query_pool, 0, d->query_count);\n#endif // NCNN_BENCHMARK\n    }\n\n    return 0;\n}\n\nuint64_t VkCompute::pending_dispatch_total() const\n{\n    return d->pending_dispatch_total;\n}\n\n#if NCNN_BENCHMARK\nint VkCompute::create_query_pool(uint32_t _query_count)\n{\n    d->query_count = _query_count;\n\n    VkQueryPoolCreateInfo queryPoolCreateInfo;\n    queryPoolCreateInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;\n    queryPoolCreateInfo.pNext = 0;\n    queryPoolCreateInfo.flags = 0;\n    queryPoolCreateInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;\n    queryPoolCreateInfo.queryCount = d->query_count;\n    queryPoolCreateInfo.pipelineStatistics = 0;\n\n    VkResult ret = vkCreateQueryPool(vkdev->vkdevice(), &queryPoolCreateInfo, 0, &d->query_pool);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateQueryPool failed %d\", ret);\n        return -1;\n    }\n\n    if (vkdev->info.support_VK_KHR_push_descriptor())\n    {\n        if (d->query_pool)\n            vkCmdResetQueryPool(d->compute_command_buffer, d->query_pool, 0, d->query_count);\n    }\n\n    return 0;\n}\n\nint VkCompute::get_query_pool_results(uint32_t first_query, uint32_t query_count, std::vector<uint64_t>& results)\n{\n    if (results.size() < first_query + query_count)\n    {\n        NCNN_LOGE(\"results not large enough\");\n        return -1;\n    }\n\n    VkResult ret = vkGetQueryPoolResults(vkdev->vkdevice(), d->query_pool, first_query, query_count,\n                                         query_count * sizeof(uint64_t), results.data() + first_query, sizeof(uint64_t), VK_QUERY_RESULT_64_BIT);\n    if (ret != VK_SUCCESS && ret != VK_NOT_READY)\n    {\n        NCNN_LOGE(\"vkGetQueryPoolResults failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n#endif // NCNN_BENCHMARK\n\nvoid VkCompute::barrier_readwrite(const VkMat& binding)\n{\n    if (binding.data->access_flags & VK_ACCESS_SHADER_WRITE_BIT || binding.data->stage_flags != VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)\n    {\n        // barrier device any @ compute/null to shader-readwrite @ compute\n        VkBufferMemoryBarrier* barriers = new VkBufferMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = binding.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].buffer = binding.buffer();\n        barriers[0].offset = binding.buffer_offset();\n        barriers[0].size = binding.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = binding.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, barriers, 0, 0);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_buffer_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.buffer_barrers.src_stage = src_stage;\n            r.buffer_barrers.dst_stage = dst_stage;\n            r.buffer_barrers.barrier_count = 1;\n            r.buffer_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark device shader-readwrite @ compute\n        binding.data->access_flags = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;\n        binding.data->stage_flags = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n    }\n}\n\nvoid VkCompute::barrier_readwrite(const VkImageMat& binding)\n{\n    if (binding.data->access_flags & VK_ACCESS_SHADER_WRITE_BIT || binding.data->image_layout != VK_IMAGE_LAYOUT_GENERAL || binding.data->stage_flags != VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)\n    {\n        // image layout transform any @ any to shader-write @ compute\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = binding.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;\n        barriers[0].oldLayout = binding.data->image_layout;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_GENERAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = binding.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = binding.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image shader-write @ compute\n        binding.data->access_flags = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;\n        binding.data->image_layout = VK_IMAGE_LAYOUT_GENERAL;\n        binding.data->stage_flags = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n    }\n}\n\nvoid VkCompute::barrier_readonly(const VkImageMat& binding)\n{\n    if (binding.data->access_flags & VK_ACCESS_SHADER_WRITE_BIT || binding.data->image_layout != VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL || binding.data->stage_flags != VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)\n    {\n        // image layout transform any @ any to shader-readonly-optimal @ compute\n        VkImageMemoryBarrier* barriers = new VkImageMemoryBarrier[1];\n        barriers[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;\n        barriers[0].pNext = 0;\n        barriers[0].srcAccessMask = binding.data->access_flags;\n        barriers[0].dstAccessMask = VK_ACCESS_SHADER_READ_BIT;\n        barriers[0].oldLayout = binding.data->image_layout;\n        barriers[0].newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;\n        barriers[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barriers[0].image = binding.image();\n        barriers[0].subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\n        barriers[0].subresourceRange.baseMipLevel = 0;\n        barriers[0].subresourceRange.levelCount = 1;\n        barriers[0].subresourceRange.baseArrayLayer = 0;\n        barriers[0].subresourceRange.layerCount = 1;\n\n        VkPipelineStageFlags src_stage = binding.data->stage_flags;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n        if (vkdev->info.support_VK_KHR_push_descriptor())\n        {\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 0, 0, 1, barriers);\n            delete[] barriers;\n        }\n        else\n        {\n            VkComputePrivate::record r;\n            r.type = VkComputePrivate::record::TYPE_image_barrers;\n            r.command_buffer = d->compute_command_buffer;\n            r.image_barrers.src_stage = src_stage;\n            r.image_barrers.dst_stage = dst_stage;\n            r.image_barrers.barrier_count = 1;\n            r.image_barrers.barriers = barriers;\n            d->delayed_records.push_back(r);\n        }\n\n        // mark image shader-readonly-optimal @ compute\n        binding.data->access_flags = VK_ACCESS_SHADER_READ_BIT;\n        binding.data->image_layout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;\n        binding.data->stage_flags = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n    }\n}\n\nclass VkTransferPrivate\n{\npublic:\n    VkTransferPrivate(const VulkanDevice* _vkdev);\n    ~VkTransferPrivate();\n\n    int init();\n    int begin_command_buffer();\n    int end_command_buffer();\n\n    const VulkanDevice* vkdev;\n\n    uint64_t pending_upload_total;\n\n    VkCommandPool compute_command_pool;\n    VkCommandPool transfer_command_pool;\n\n    VkCommandBuffer upload_command_buffer;\n    VkCommandBuffer compute_command_buffer;\n\n    VkSemaphore upload_compute_semaphore;\n\n    VkFence upload_command_fence;\n    VkFence compute_command_fence;\n\n    std::vector<VkMat> upload_staging_buffers;\n};\n\nVkTransferPrivate::VkTransferPrivate(const VulkanDevice* _vkdev)\n    : vkdev(_vkdev)\n{\n    pending_upload_total = 0;\n\n    compute_command_pool = 0;\n    transfer_command_pool = 0;\n\n    upload_command_buffer = 0;\n    compute_command_buffer = 0;\n\n    upload_compute_semaphore = 0;\n\n    upload_command_fence = 0;\n    compute_command_fence = 0;\n\n    init();\n}\n\nVkTransferPrivate::~VkTransferPrivate()\n{\n    vkDestroyFence(vkdev->vkdevice(), compute_command_fence, 0);\n\n    vkFreeCommandBuffers(vkdev->vkdevice(), compute_command_pool, 1, &compute_command_buffer);\n    vkDestroyCommandPool(vkdev->vkdevice(), compute_command_pool, 0);\n\n    if (!vkdev->info.unified_compute_transfer_queue())\n    {\n        vkDestroyFence(vkdev->vkdevice(), upload_command_fence, 0);\n\n        vkDestroySemaphore(vkdev->vkdevice(), upload_compute_semaphore, 0);\n\n        vkFreeCommandBuffers(vkdev->vkdevice(), transfer_command_pool, 1, &upload_command_buffer);\n        vkDestroyCommandPool(vkdev->vkdevice(), transfer_command_pool, 0);\n    }\n}\n\nint VkTransferPrivate::init()\n{\n    // compute_command_pool\n    {\n        VkCommandPoolCreateInfo commandPoolCreateInfo;\n        commandPoolCreateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;\n        commandPoolCreateInfo.pNext = 0;\n        commandPoolCreateInfo.flags = 0;\n        commandPoolCreateInfo.queueFamilyIndex = vkdev->info.compute_queue_family_index();\n\n        VkResult ret = vkCreateCommandPool(vkdev->vkdevice(), &commandPoolCreateInfo, 0, &compute_command_pool);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkCreateCommandPool failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // compute_command_buffer\n    {\n        VkCommandBufferAllocateInfo commandBufferAllocateInfo;\n        commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;\n        commandBufferAllocateInfo.pNext = 0;\n        commandBufferAllocateInfo.commandPool = compute_command_pool;\n        commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;\n        commandBufferAllocateInfo.commandBufferCount = 1;\n\n        VkResult ret = vkAllocateCommandBuffers(vkdev->vkdevice(), &commandBufferAllocateInfo, &compute_command_buffer);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkAllocateCommandBuffers failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // compute_command_fence\n    {\n        VkFenceCreateInfo fenceCreateInfo;\n        fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;\n        fenceCreateInfo.pNext = 0;\n        fenceCreateInfo.flags = 0;\n\n        VkResult ret = vkCreateFence(vkdev->vkdevice(), &fenceCreateInfo, 0, &compute_command_fence);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkCreateFence failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (!vkdev->info.unified_compute_transfer_queue())\n    {\n        // transfer_command_pool\n        {\n            VkCommandPoolCreateInfo commandPoolCreateInfo;\n            commandPoolCreateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;\n            commandPoolCreateInfo.pNext = 0;\n            commandPoolCreateInfo.flags = 0;\n            commandPoolCreateInfo.queueFamilyIndex = vkdev->info.transfer_queue_family_index();\n\n            VkResult ret = vkCreateCommandPool(vkdev->vkdevice(), &commandPoolCreateInfo, 0, &transfer_command_pool);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkCreateCommandPool failed %d\", ret);\n                return -1;\n            }\n        }\n\n        // upload_command_buffer\n        {\n            VkCommandBufferAllocateInfo commandBufferAllocateInfo;\n            commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;\n            commandBufferAllocateInfo.pNext = 0;\n            commandBufferAllocateInfo.commandPool = transfer_command_pool;\n            commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;\n            commandBufferAllocateInfo.commandBufferCount = 1;\n\n            VkResult ret = vkAllocateCommandBuffers(vkdev->vkdevice(), &commandBufferAllocateInfo, &upload_command_buffer);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkAllocateCommandBuffers failed %d\", ret);\n                return -1;\n            }\n        }\n\n        // upload_compute_semaphore\n        {\n            VkSemaphoreCreateInfo semaphoreCreateInfo;\n            semaphoreCreateInfo.sType = VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO;\n            semaphoreCreateInfo.pNext = 0;\n            semaphoreCreateInfo.flags = 0;\n\n            VkResult ret = vkCreateSemaphore(vkdev->vkdevice(), &semaphoreCreateInfo, 0, &upload_compute_semaphore);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkCreateSemaphore failed %d\", ret);\n                return -1;\n            }\n        }\n\n        // upload_command_fence\n        {\n            VkFenceCreateInfo fenceCreateInfo;\n            fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;\n            fenceCreateInfo.pNext = 0;\n            fenceCreateInfo.flags = 0;\n\n            VkResult ret = vkCreateFence(vkdev->vkdevice(), &fenceCreateInfo, 0, &upload_command_fence);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkCreateFence failed %d\", ret);\n                return -1;\n            }\n        }\n    }\n\n    begin_command_buffer();\n\n    return 0;\n}\n\nint VkTransferPrivate::begin_command_buffer()\n{\n    {\n        VkCommandBufferBeginInfo commandBufferBeginInfo;\n        commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;\n        commandBufferBeginInfo.pNext = 0;\n        commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;\n        commandBufferBeginInfo.pInheritanceInfo = 0;\n\n        VkResult ret = vkBeginCommandBuffer(compute_command_buffer, &commandBufferBeginInfo);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkBeginCommandBuffer failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (!vkdev->info.unified_compute_transfer_queue())\n    {\n        {\n            VkCommandBufferBeginInfo commandBufferBeginInfo;\n            commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;\n            commandBufferBeginInfo.pNext = 0;\n            commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;\n            commandBufferBeginInfo.pInheritanceInfo = 0;\n\n            VkResult ret = vkBeginCommandBuffer(upload_command_buffer, &commandBufferBeginInfo);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkBeginCommandBuffer failed %d\", ret);\n                return -1;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint VkTransferPrivate::end_command_buffer()\n{\n    {\n        VkResult ret = vkEndCommandBuffer(compute_command_buffer);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkEndCommandBuffer failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (!vkdev->info.unified_compute_transfer_queue())\n    {\n        {\n            VkResult ret = vkEndCommandBuffer(upload_command_buffer);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkEndCommandBuffer failed %d\", ret);\n                return -1;\n            }\n        }\n    }\n\n    return 0;\n}\n\nVkTransfer::VkTransfer(const VulkanDevice* _vkdev)\n    : vkdev(_vkdev), d(new VkTransferPrivate(_vkdev))\n{\n}\n\nVkTransfer::~VkTransfer()\n{\n    delete d;\n}\n\nvoid VkTransfer::record_upload(const Mat& src, VkMat& dst, const Option& opt, bool flatten)\n{\n    //     NCNN_LOGE(\"record_upload src = %d | %d %d %d @ %d\", src.dims, src.w, src.h, src.c, src.elempack);\n\n    // NOTE keep the hack here ?\n    if (src.elembits() == 32)\n    {\n        if (opt.use_bf16_storage || opt.use_bf16_packed)\n        {\n            Mat src_bf16;\n            cast_float32_to_bfloat16(src, src_bf16, opt);\n\n            record_upload(src_bf16, dst, opt, flatten);\n\n            return;\n        }\n        else if (opt.use_fp16_storage || opt.use_fp16_packed)\n        {\n            Mat src_fp16;\n            cast_float32_to_float16(src, src_fp16, opt);\n\n            record_upload(src_fp16, dst, opt, flatten);\n\n            return;\n        }\n    }\n\n    Mat src_flattened = flatten ? src.reshape(src.w * src.h * src.c) : src;\n\n    // create dst\n    dst.create_like(src_flattened, opt.blob_vkallocator);\n\n    if (dst.empty())\n    {\n        return;\n    }\n\n    d->pending_upload_total += dst.total() * dst.elemsize;\n\n    if (dst.allocator->mappable)\n    {\n        // memcpy src_flattened to device\n        memcpy(dst.mapped_ptr(), src_flattened.data, src_flattened.total() * src_flattened.elemsize);\n        dst.allocator->flush(dst.data);\n\n        // barrier device host-write @ null to shader-read @ compute\n        {\n            VkBufferMemoryBarrier barrier;\n            barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n            barrier.pNext = 0;\n            barrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;\n            barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;\n            barrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n            barrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n            barrier.buffer = dst.buffer();\n            barrier.offset = dst.buffer_offset();\n            barrier.size = dst.buffer_capacity();\n\n            VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_HOST_BIT;\n            VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, &barrier, 0, 0);\n        }\n\n        // mark device shader-readwrite @ compute\n        dst.data->access_flags = VK_ACCESS_SHADER_READ_BIT;\n        dst.data->stage_flags = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n        return;\n    }\n\n    // create staging\n    VkMat dst_staging;\n    dst_staging.create_like(src_flattened, opt.staging_vkallocator);\n\n    // memcpy src_flattened to staging\n    memcpy(dst_staging.mapped_ptr(), src_flattened.data, src_flattened.total() * src_flattened.elemsize);\n    dst_staging.allocator->flush(dst_staging.data);\n\n    VkCommandBuffer command_buffer;\n    if (vkdev->info.unified_compute_transfer_queue())\n    {\n        command_buffer = d->compute_command_buffer;\n    }\n    else\n    {\n        command_buffer = d->upload_command_buffer;\n    }\n\n    // barrier staging host-write @ null to transfer-read @ queue\n    {\n        VkBufferMemoryBarrier barrier;\n        barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n        barrier.pNext = 0;\n        barrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;\n        barrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;\n        barrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n        barrier.buffer = dst_staging.buffer();\n        barrier.offset = dst_staging.buffer_offset();\n        barrier.size = dst_staging.buffer_capacity();\n\n        VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_HOST_BIT;\n        VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n\n        vkCmdPipelineBarrier(command_buffer, src_stage, dst_stage, 0, 0, 0, 1, &barrier, 0, 0);\n    }\n\n    // record staging to device\n    {\n        VkBufferCopy region;\n        region.srcOffset = dst_staging.buffer_offset();\n        region.dstOffset = dst.buffer_offset();\n        region.size = std::min(dst_staging.buffer_capacity(), dst.buffer_capacity());\n\n        vkCmdCopyBuffer(command_buffer, dst_staging.buffer(), dst.buffer(), 1, &region);\n    }\n\n    if (vkdev->info.unified_compute_transfer_queue())\n    {\n        // barrier device transfer-write @ compute to shader-read @ compute\n        {\n            VkBufferMemoryBarrier barrier;\n            barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n            barrier.pNext = 0;\n            barrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;\n            barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;\n            barrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n            barrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;\n            barrier.buffer = dst.buffer();\n            barrier.offset = dst.buffer_offset();\n            barrier.size = dst.buffer_capacity();\n\n            VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n            VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n            vkCmdPipelineBarrier(command_buffer, src_stage, dst_stage, 0, 0, 0, 1, &barrier, 0, 0);\n        }\n    }\n    else\n    {\n        // queue ownership transfer transfer-write @ transfer to shader-read @ compute\n\n        // release\n        {\n            VkBufferMemoryBarrier barrier;\n            barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n            barrier.pNext = 0;\n            barrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;\n            barrier.dstAccessMask = 0;\n            barrier.srcQueueFamilyIndex = vkdev->info.transfer_queue_family_index();\n            barrier.dstQueueFamilyIndex = vkdev->info.compute_queue_family_index();\n            barrier.buffer = dst.buffer();\n            barrier.offset = dst.buffer_offset();\n            barrier.size = dst.buffer_capacity();\n\n            VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TRANSFER_BIT;\n            VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;\n\n            vkCmdPipelineBarrier(d->upload_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, &barrier, 0, 0);\n        }\n\n        // acquire\n        {\n            VkBufferMemoryBarrier barrier;\n            barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;\n            barrier.pNext = 0;\n            barrier.srcAccessMask = 0;\n            barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;\n            barrier.srcQueueFamilyIndex = vkdev->info.transfer_queue_family_index();\n            barrier.dstQueueFamilyIndex = vkdev->info.compute_queue_family_index();\n            barrier.buffer = dst.buffer();\n            barrier.offset = dst.buffer_offset();\n            barrier.size = dst.buffer_capacity();\n\n            VkPipelineStageFlags src_stage = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT;\n            VkPipelineStageFlags dst_stage = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n            vkCmdPipelineBarrier(d->compute_command_buffer, src_stage, dst_stage, 0, 0, 0, 1, &barrier, 0, 0);\n        }\n    }\n\n    // mark device shader-readwrite @ compute\n    dst.data->access_flags = VK_ACCESS_SHADER_READ_BIT;\n    dst.data->stage_flags = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;\n\n    // stash staging\n    d->upload_staging_buffers.push_back(dst_staging);\n}\n\nint VkTransfer::submit_and_wait()\n{\n    //     NCNN_LOGE(\"submit_and_wait\");\n\n    // end command buffer\n    {\n        d->end_command_buffer();\n    }\n\n    VkQueue compute_queue = vkdev->acquire_queue(vkdev->info.compute_queue_family_index());\n    if (compute_queue == 0)\n    {\n        NCNN_LOGE(\"out of compute queue\");\n        return -1;\n    }\n\n    if (vkdev->info.unified_compute_transfer_queue())\n    {\n        // submit compute\n        {\n            VkSubmitInfo submitInfo;\n            submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;\n            submitInfo.pNext = 0;\n            submitInfo.waitSemaphoreCount = 0;\n            submitInfo.pWaitSemaphores = 0;\n            submitInfo.pWaitDstStageMask = 0;\n            submitInfo.commandBufferCount = 1;\n            submitInfo.pCommandBuffers = &d->compute_command_buffer;\n            submitInfo.signalSemaphoreCount = 0;\n            submitInfo.pSignalSemaphores = 0;\n\n            VkResult ret = vkQueueSubmit(compute_queue, 1, &submitInfo, d->compute_command_fence);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkQueueSubmit failed %d\", ret);\n                vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n                return -1;\n            }\n        }\n    }\n    else\n    {\n        VkQueue transfer_queue = vkdev->acquire_queue(vkdev->info.transfer_queue_family_index());\n        if (transfer_queue == 0)\n        {\n            NCNN_LOGE(\"out of transfer queue\");\n            vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n            return -1;\n        }\n\n        // submit upload compute\n        {\n            VkSubmitInfo submitInfo;\n            submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;\n            submitInfo.pNext = 0;\n            submitInfo.waitSemaphoreCount = 0;\n            submitInfo.pWaitSemaphores = 0;\n            submitInfo.pWaitDstStageMask = 0;\n            submitInfo.commandBufferCount = 1;\n            submitInfo.pCommandBuffers = &d->upload_command_buffer;\n            submitInfo.signalSemaphoreCount = 1;\n            submitInfo.pSignalSemaphores = &d->upload_compute_semaphore;\n\n            VkResult ret = vkQueueSubmit(transfer_queue, 1, &submitInfo, d->upload_command_fence);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkQueueSubmit failed %d\", ret);\n                vkdev->reclaim_queue(vkdev->info.transfer_queue_family_index(), transfer_queue);\n                vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n                return -1;\n            }\n        }\n        {\n            VkPipelineStageFlags wait_dst_stage = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT; // FIXME\n\n            VkSubmitInfo submitInfo;\n            submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;\n            submitInfo.pNext = 0;\n            submitInfo.waitSemaphoreCount = 1;\n            submitInfo.pWaitSemaphores = &d->upload_compute_semaphore;\n            submitInfo.pWaitDstStageMask = &wait_dst_stage;\n            submitInfo.commandBufferCount = 1;\n            submitInfo.pCommandBuffers = &d->compute_command_buffer;\n            submitInfo.signalSemaphoreCount = 0;\n            submitInfo.pSignalSemaphores = 0;\n\n            VkResult ret = vkQueueSubmit(compute_queue, 1, &submitInfo, d->compute_command_fence);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkQueueSubmit failed %d\", ret);\n                vkdev->reclaim_queue(vkdev->info.transfer_queue_family_index(), transfer_queue);\n                vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n                return -1;\n            }\n        }\n\n        vkdev->reclaim_queue(vkdev->info.transfer_queue_family_index(), transfer_queue);\n    }\n\n    vkdev->reclaim_queue(vkdev->info.compute_queue_family_index(), compute_queue);\n\n    // wait\n    if (vkdev->info.unified_compute_transfer_queue())\n    {\n        VkResult ret = vkWaitForFences(vkdev->vkdevice(), 1, &d->compute_command_fence, VK_TRUE, (uint64_t)-1);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkWaitForFences failed %d\", ret);\n            return -1;\n        }\n    }\n    else\n    {\n        VkFence fences[2] = {d->upload_command_fence, d->compute_command_fence};\n\n        VkResult ret = vkWaitForFences(vkdev->vkdevice(), 2, fences, VK_TRUE, (uint64_t)-1);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkWaitForFences failed %d\", ret);\n            return -1;\n        }\n    }\n\n    d->pending_upload_total = 0;\n\n    return 0;\n}\n\nint VkTransfer::reset()\n{\n    d->upload_staging_buffers.clear();\n\n    d->pending_upload_total = 0;\n\n    // reset command buffer and fence\n    {\n        VkResult ret = vkResetCommandBuffer(d->compute_command_buffer, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkResetCommandBuffer failed %d\", ret);\n            return -1;\n        }\n    }\n    {\n        VkResult ret = vkResetFences(vkdev->vkdevice(), 1, &d->compute_command_fence);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkResetFences failed %d\", ret);\n            return -1;\n        }\n    }\n\n    if (!vkdev->info.unified_compute_transfer_queue())\n    {\n        {\n            VkResult ret = vkResetCommandBuffer(d->upload_command_buffer, 0);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkResetCommandBuffer failed %d\", ret);\n                return -1;\n            }\n        }\n        {\n            VkResult ret = vkResetFences(vkdev->vkdevice(), 1, &d->upload_command_fence);\n            if (ret != VK_SUCCESS)\n            {\n                NCNN_LOGE(\"vkResetFences failed %d\", ret);\n                return -1;\n            }\n        }\n    }\n\n    d->begin_command_buffer();\n\n    return 0;\n}\n\nuint64_t VkTransfer::pending_upload_total() const\n{\n    return d->pending_upload_total;\n}\n\n} // namespace ncnn\n\n#endif // NCNN_VULKAN\n"
  },
  {
    "path": "src/command.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_COMMAND_H\n#define NCNN_COMMAND_H\n\n#include \"platform.h\"\n\n#if NCNN_VULKAN\n\n#include \"mat.h\"\n\nnamespace ncnn {\n\nclass Pipeline;\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\nclass ImportAndroidHardwareBufferPipeline;\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\nclass VkComputePrivate;\nclass NCNN_EXPORT VkCompute\n{\npublic:\n    explicit VkCompute(const VulkanDevice* vkdev);\n    virtual ~VkCompute();\n\npublic:\n    void record_upload(const Mat& src, VkMat& dst, const Option& opt);\n\n    void record_download(const VkMat& src, Mat& dst, const Option& opt);\n\n    void record_clone(const Mat& src, VkMat& dst, const Option& opt);\n\n    void record_clone(const Mat& src, VkImageMat& dst, const Option& opt);\n\n    void record_clone(const VkMat& src, Mat& dst, const Option& opt);\n\n    void record_clone(const VkImageMat& src, Mat& dst, const Option& opt);\n\n    void record_clone(const VkMat& src, VkMat& dst, const Option& opt);\n\n    void record_clone(const VkImageMat& src, VkImageMat& dst, const Option& opt);\n\n    void record_clone(const VkMat& src, VkImageMat& dst, const Option& opt);\n\n    void record_clone(const VkImageMat& src, VkMat& dst, const Option& opt);\n\n    void record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& bindings, const std::vector<vk_constant_type>& constants, const VkMat& dispatcher);\n\n    void record_pipeline(const Pipeline* pipeline, const std::vector<VkImageMat>& bindings, const std::vector<vk_constant_type>& constants, const VkImageMat& dispatcher);\n\n    void record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const VkMat& dispatcher);\n    void record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const VkImageMat& dispatcher);\n    void record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& buffer_bindings, const std::vector<VkImageMat>& image_bindings, const std::vector<vk_constant_type>& constants, const Mat& dispatcher);\n\n#if NCNN_BENCHMARK\n    void record_write_timestamp(uint32_t query);\n#endif // NCNN_BENCHMARK\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 26\n    void record_import_android_hardware_buffer(const ImportAndroidHardwareBufferPipeline* pipeline, const VkImageMat& src, const VkMat& dst);\n#endif // __ANDROID_API__ >= 26\n#endif // NCNN_PLATFORM_API\n\n    int submit_and_wait();\n\n    int reset();\n\n    uint64_t pending_dispatch_total() const;\n\n#if NCNN_BENCHMARK\n    int create_query_pool(uint32_t query_count);\n\n    int get_query_pool_results(uint32_t first_query, uint32_t query_count, std::vector<uint64_t>& results);\n#endif // NCNN_BENCHMARK\n\nprotected:\n    const VulkanDevice* vkdev;\n\n    void barrier_readwrite(const VkMat& binding);\n    void barrier_readwrite(const VkImageMat& binding);\n    void barrier_readonly(const VkImageMat& binding);\n\nprivate:\n    VkComputePrivate* const d;\n};\n\nclass VkTransferPrivate;\nclass NCNN_EXPORT VkTransfer\n{\npublic:\n    explicit VkTransfer(const VulkanDevice* vkdev);\n    virtual ~VkTransfer();\n\npublic:\n    void record_upload(const Mat& src, VkMat& dst, const Option& opt, bool flatten = true);\n\n    int submit_and_wait();\n\n    int reset();\n\n    uint64_t pending_upload_total() const;\n\nprotected:\n    const VulkanDevice* vkdev;\n\nprivate:\n    VkTransferPrivate* const d;\n};\n\n} // namespace ncnn\n\n#endif // NCNN_VULKAN\n\n#endif // NCNN_COMMAND_H\n"
  },
  {
    "path": "src/convert_ycbcr.comp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#version 450\n\nlayout (constant_id = 0) const int w = 0;\nlayout (constant_id = 1) const int h = 0;\nlayout (constant_id = 2) const int outw = 0;\nlayout (constant_id = 3) const int outh = 0;\nlayout (constant_id = 4) const int type_to = 0;\nlayout (constant_id = 5) const int rotate_from = 0;\nlayout (constant_id = 6) const int need_resize = 0;\n\nlayout (binding = 0) uniform sampler2D android_hardware_buffer_image;\nlayout (binding = 1) writeonly buffer vkmat_blob { sfp vkmat_blob_data[]; };\nlayout (binding = 2) writeonly buffer vkmat_pack4_blob { sfpvec4 vkmat_pack4_blob_data[]; };\n\nvoid main()\n{\n    int gx = int(gl_GlobalInvocationID.x);\n    int gy = int(gl_GlobalInvocationID.y);\n    int gz = int(gl_GlobalInvocationID.z);\n\n    if (gx >= outw || gy >= outh || gz >= 1)\n        return;\n\n    vec2 pos;\n\n    if (rotate_from == 1)\n    {\n        pos = vec2(gx, gy);\n    }\n\n    if (rotate_from == 2)\n    {\n        pos = vec2(outw - 1 - gx, gy);\n    }\n\n    if (rotate_from == 3)\n    {\n        pos = vec2(outw - 1 - gx, outh - 1 - gy);\n    }\n\n    if (rotate_from == 4)\n    {\n        pos = vec2(gx, outh - 1 - gy);\n    }\n\n    if (rotate_from == 5)\n    {\n        pos = vec2(gy, gx);\n    }\n\n    if (rotate_from == 6)\n    {\n        pos = vec2(gy, outw - 1 - gx);\n    }\n\n    if (rotate_from == 7)\n    {\n        pos = vec2(outh - 1 - gy, outw - 1 - gx);\n    }\n\n    if (rotate_from == 8)\n    {\n        pos = vec2(outh - 1 - gy, gx);\n    }\n\n    if (need_resize == 1)\n    {\n        if (rotate_from < 5) // 1 2 3 4\n        {\n            pos.x = pos.x * (float(w) / outw);\n            pos.y = pos.y * (float(h) / outh);\n        }\n        else // 5 6 7 8\n        {\n            pos.x = pos.x * (float(w) / outh);\n            pos.y = pos.y * (float(h) / outw);\n        }\n    }\n\n    vec3 rgb = texture(android_hardware_buffer_image, pos).rgb * 255.f;\n\n    const int outcstep = outw * outh / 4 * 4;\n\n    if (type_to == 1) // PIXEL_RGB\n    {\n        ivec3 v_offset = (gy * outw + gx) + ivec3(0, 1, 2) * outcstep;\n\n        buffer_st1(vkmat_blob_data, v_offset.r, afp(rgb.r));\n        buffer_st1(vkmat_blob_data, v_offset.g, afp(rgb.g));\n        buffer_st1(vkmat_blob_data, v_offset.b, afp(rgb.b));\n    }\n\n    if (type_to == 2) // PIXEL_BGR\n    {\n        ivec3 v_offset = (gy * outw + gx) + ivec3(0, 1, 2) * outcstep;\n\n        buffer_st1(vkmat_blob_data, v_offset.r, afp(rgb.b));\n        buffer_st1(vkmat_blob_data, v_offset.g, afp(rgb.g));\n        buffer_st1(vkmat_blob_data, v_offset.b, afp(rgb.r));\n    }\n\n    if (type_to == 3) // PIXEL_GRAY\n    {\n        // coeffs for r g b = 0.299f, 0.587f, 0.114f\n        float v = clamp(rgb.r * 0.299f + rgb.g * 0.587f + rgb.b * 0.114f, 0.f, 255.f);\n\n        int v_offset = gy * outw + gx;\n\n        buffer_st1(vkmat_blob_data, v_offset, afp(v));\n    }\n\n    if (type_to == 4) // PIXEL_RGBA\n    {\n        vec4 rgba;\n        rgba.rgb = rgb;\n        rgba.a = 255.f;\n\n        int v_offset = gy * outw + gx;\n\n        buffer_st4(vkmat_pack4_blob_data, v_offset, afpvec4(rgba));\n    }\n\n    if (type_to == 5) // PIXEL_BGRA\n    {\n        vec4 rgba;\n        rgba.bgr = rgb;\n        rgba.a = 255.f;\n\n        int v_offset = gy * outw + gx;\n\n        buffer_st4(vkmat_pack4_blob_data, v_offset, afpvec4(rgba));\n    }\n}\n"
  },
  {
    "path": "src/cpu.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n\n#include \"platform.h\"\n\n#include <limits.h>\n#ifndef __wasi__\n#include <setjmp.h>\n#include <signal.h>\n#endif // __wasi__\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n\n#ifdef _OPENMP\n#if NCNN_SIMPLEOMP\n#include \"simpleomp.h\"\n#else\n#include <omp.h>\n#endif\n#endif\n\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n#ifdef _MSC_VER\n#include <intrin.h>    // __cpuid()\n#include <immintrin.h> // _xgetbv()\n#endif\n#if defined(__clang__) || defined(__GNUC__)\n#include <cpuid.h> // __get_cpuid() and __cpuid_count()\n#endif\n#endif\n\n#ifdef __EMSCRIPTEN__\n#include <emscripten/threading.h>\n#endif\n\n#if defined _WIN32\n#define WIN32_LEAN_AND_MEAN\n#include <windows.h>\n#endif\n\n#if defined __ANDROID__ || defined __OHOS__ || __linux__\n#if defined __ANDROID__\n#if __ANDROID_API__ >= 18\n#include <sys/auxv.h> // getauxval()\n#endif\n#include <sys/system_properties.h> // __system_property_get()\n#include <dlfcn.h>\n#endif\n#if defined __OHOS__\n#include <sys/auxv.h> // getauxval()\n#endif\n#include <ctype.h>\n#include <stdint.h>\n#include <fcntl.h>\n#include <sys/stat.h>\n#include <sys/syscall.h>\n#include <unistd.h>\n#endif\n\n#if __APPLE__\n#include <mach/mach.h>\n#include <mach/machine.h>\n#include <mach/thread_act.h>\n#include <sys/sysctl.h>\n#include <sys/types.h>\n#include <unistd.h>\n#include \"TargetConditionals.h\"\n#if TARGET_OS_IPHONE\n#define __IOS__ 1\n#endif\n// define missing cpu model for old sdk\n#ifndef CPUFAMILY_ARM_HURRICANE\n#define CPUFAMILY_ARM_HURRICANE 0x67ceee93\n#endif\n// A11\n#ifndef CPUFAMILY_ARM_MONSOON_MISTRAL\n#define CPUFAMILY_ARM_MONSOON_MISTRAL 0xe81e7ef6\n#endif\n// A12\n#ifndef CPUFAMILY_ARM_VORTEX_TEMPEST\n#define CPUFAMILY_ARM_VORTEX_TEMPEST 0x07d34b9f\n#endif\n// A13\n#ifndef CPUFAMILY_ARM_LIGHTNING_THUNDER\n#define CPUFAMILY_ARM_LIGHTNING_THUNDER 0x462504d2\n#endif\n// A14 / M1\n#ifndef CPUFAMILY_ARM_FIRESTORM_ICESTORM\n#define CPUFAMILY_ARM_FIRESTORM_ICESTORM 0x1b588bb3\n#endif\n// A15 / M2\n#ifndef CPUFAMILY_ARM_AVALANCHE_BLIZZARD\n#define CPUFAMILY_ARM_AVALANCHE_BLIZZARD 0xda33d83d\n#endif\n// A16\n#ifndef CPUFAMILY_ARM_EVEREST_SAWTOOTH\n#define CPUFAMILY_ARM_EVEREST_SAWTOOTH 0x8765edea\n#endif\n// A17\n#ifndef CPUFAMILY_ARM_COLL\n#define CPUFAMILY_ARM_COLL 0x2876f5b5\n#endif\n// A18\n#ifndef CPUFAMILY_ARM_TUPAI\n#define CPUFAMILY_ARM_TUPAI 0x204526d0\n#endif\n// A18 Pro\n#ifndef CPUFAMILY_ARM_TAHITI\n#define CPUFAMILY_ARM_TAHITI 0x75d4acb9\n#endif\n// M3\n#ifndef CPUFAMILY_ARM_IBIZA\n#define CPUFAMILY_ARM_IBIZA 0xfa33415e\n#endif\n// M3 Pro\n#ifndef CPUFAMILY_ARM_LOBOS\n#define CPUFAMILY_ARM_LOBOS 0x5f4dea93\n#endif\n// M3 Max\n#ifndef CPUFAMILY_ARM_PALMA\n#define CPUFAMILY_ARM_PALMA 0x72015832\n#endif\n// M4\n#ifndef CPUFAMILY_ARM_DONAN\n#define CPUFAMILY_ARM_DONAN 0x6f5129ac\n#endif\n// M4 Pro / M4 Max\n#ifndef CPUFAMILY_ARM_BRAVA\n#define CPUFAMILY_ARM_BRAVA 0x17d5b93a\n#endif\n#endif // __APPLE__\n\n#if defined(__SSE3__)\n#include <immintrin.h>\n#endif\n\n#if (defined _WIN32 && (__aarch64__ || __arm__)) || ((defined __ANDROID__ || defined __linux__) && __riscv)\n#define RUAPU_IMPLEMENTATION\n#include \"ruapu.h\"\n#endif\n\n#if defined(_OPENMP) && (__clang__ || defined(_OPENMP_LLVM_RUNTIME))\n__attribute__((constructor)) void ncnn_kmp_env_initializer()\n{\n    // this function should be called before touching all openmp stuff\n    // the env setting here helps prevent abort from happening inside openmp\n\n    // the internal affinity routines in llvm openmp call abort on __NR_sched_getaffinity / __NR_sched_setaffinity fails\n    // ref KMPNativeAffinity::get_system_affinity/set_system_affinity in openmp/runtime/src/kmp_affinity.h\n    // and cpu core goes offline in powersave mode on android, which triggers abort\n    // disable affinity capability, we handle thread affinity for openmp threads\n#if defined _WIN32\n#if _WIN32_WINNT >= 0x0600\n    _putenv_s(\"KMP_AFFINITY\", \"disabled\");\n#else\n    _putenv(\"KMP_AFFINITY=disabled\");\n#endif\n#else\n    setenv(\"KMP_AFFINITY\", \"disabled\", 1);\n#endif\n\n    // openmp initialization triggers abort when another openmp runtime detected\n    // ref __kmp_register_library_startup in openmp/runtime/src/kmp_runtime.cpp\n    // this happens when loading multiple libraries that are static linked openmp\n    // just let it continue to work, it works well in most cases, at least it won't crash unexpectedly\n#if defined _WIN32\n#if _WIN32_WINNT >= 0x0600\n    _putenv_s(\"KMP_DUPLICATE_LIB_OK\", \"1\");\n#else\n    _putenv(\"KMP_DUPLICATE_LIB_OK=1\");\n#endif\n#else\n    setenv(\"KMP_DUPLICATE_LIB_OK\", \"1\", 1);\n#endif\n}\n#endif\n\n// topology info\nstatic int g_cpucount;\nstatic int g_physical_cpucount;\nstatic int g_powersave;\nstatic ncnn::CpuSet g_cpu_affinity_mask_all;\nstatic ncnn::CpuSet g_cpu_affinity_mask_little;\nstatic ncnn::CpuSet g_cpu_affinity_mask_big;\n\n// isa info\n#if defined _WIN32\n#if __aarch64__\nstatic int g_cpu_support_arm_asimdhp;\nstatic int g_cpu_support_arm_cpuid;\nstatic int g_cpu_support_arm_asimddp;\nstatic int g_cpu_support_arm_asimdfhm;\nstatic int g_cpu_support_arm_bf16;\nstatic int g_cpu_support_arm_i8mm;\nstatic int g_cpu_support_arm_sve;\nstatic int g_cpu_support_arm_sve2;\nstatic int g_cpu_support_arm_svebf16;\nstatic int g_cpu_support_arm_svei8mm;\nstatic int g_cpu_support_arm_svef32mm;\n#elif __arm__\nstatic int g_cpu_support_arm_edsp;\nstatic int g_cpu_support_arm_neon;\nstatic int g_cpu_support_arm_vfpv4;\n#endif // __aarch64__ || __arm__\n#elif defined __ANDROID__ || defined __linux__\nstatic unsigned int g_hwcaps;\nstatic unsigned int g_hwcaps2;\n#elif __APPLE__\nstatic unsigned int g_hw_cpufamily;\nstatic cpu_type_t g_hw_cputype;\nstatic cpu_subtype_t g_hw_cpusubtype;\n#if __aarch64__\nstatic int g_hw_optional_arm_FEAT_FP16;\nstatic int g_hw_optional_arm_FEAT_DotProd;\nstatic int g_hw_optional_arm_FEAT_FHM;\nstatic int g_hw_optional_arm_FEAT_BF16;\nstatic int g_hw_optional_arm_FEAT_I8MM;\n#endif // __aarch64__\n#endif\n\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\nstatic int g_cpu_support_x86_avx;\nstatic int g_cpu_support_x86_fma;\nstatic int g_cpu_support_x86_xop;\nstatic int g_cpu_support_x86_f16c;\nstatic int g_cpu_support_x86_avx2;\nstatic int g_cpu_support_x86_avx_vnni;\nstatic int g_cpu_support_x86_avx_vnni_int8;\nstatic int g_cpu_support_x86_avx_vnni_int16;\nstatic int g_cpu_support_x86_avx_ne_convert;\nstatic int g_cpu_support_x86_avx512;\nstatic int g_cpu_support_x86_avx512_vnni;\nstatic int g_cpu_support_x86_avx512_bf16;\nstatic int g_cpu_support_x86_avx512_fp16;\n#endif // defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\nstatic int g_cpu_support_riscv_zfh;\nstatic int g_cpu_support_riscv_zvfh;\nstatic int g_cpu_support_riscv_xtheadvector;\n#endif // __riscv\n#endif // defined __ANDROID__ || defined __linux__\n\nstatic int g_cpu_level2_cachesize;\nstatic int g_cpu_level3_cachesize;\n\n// misc info\n#if defined __ANDROID__ || defined __linux__\n#if __aarch64__\nstatic int g_cpu_is_arm_a53_a55;\n#endif // __aarch64__\n#endif // defined __ANDROID__ || defined __linux__\n\nstatic bool is_being_debugged()\n{\n#if defined _WIN32\n    return IsDebuggerPresent();\n#elif defined __ANDROID__ || defined __linux__\n    // https://stackoverflow.com/questions/3596781/how-to-detect-if-the-current-process-is-being-run-by-gdb\n    int status_fd = open(\"/proc/self/status\", O_RDONLY);\n    if (status_fd == -1)\n        return false;\n\n    char buf[4096];\n    ssize_t num_read = read(status_fd, buf, sizeof(buf) - 1);\n    close(status_fd);\n\n    if (num_read <= 0)\n        return false;\n\n    buf[num_read] = '\\0';\n    const char tracerPidString[] = \"TracerPid:\";\n    const char* tracer_pid_ptr = strstr(buf, tracerPidString);\n    if (!tracer_pid_ptr)\n        return false;\n\n    for (const char* ch = tracer_pid_ptr + sizeof(tracerPidString) - 1; ch <= buf + num_read; ++ch)\n    {\n        if (isspace(*ch))\n            continue;\n\n        return isdigit(*ch) != 0 && *ch != '0';\n    }\n\n    return false;\n#elif defined __APPLE__\n    // https://stackoverflow.com/questions/2200277/detecting-debugger-on-mac-os-x\n    struct kinfo_proc info;\n    info.kp_proc.p_flag = 0;\n\n    int mib[4];\n    mib[0] = CTL_KERN;\n    mib[1] = KERN_PROC;\n    mib[2] = KERN_PROC_PID;\n    mib[3] = getpid();\n\n    size_t size = sizeof(info);\n    sysctl(mib, sizeof(mib) / sizeof(*mib), &info, &size, NULL, 0);\n\n    return ((info.kp_proc.p_flag & P_TRACED) != 0);\n#else\n    // unknown platform :(\n    fprintf(stderr, \"unknown platform!\\n\");\n    return false;\n#endif\n}\n\n#if defined __ANDROID__ || defined __OHOS__ || defined __linux__\n\n#define AT_HWCAP  16\n#define AT_HWCAP2 26\n\n#if __aarch64__\n// from arch/arm64/include/uapi/asm/hwcap.h\n#define HWCAP_ASIMD     (1 << 1)\n#define HWCAP_ASIMDHP   (1 << 10)\n#define HWCAP_CPUID     (1 << 11)\n#define HWCAP_ASIMDDP   (1 << 20)\n#define HWCAP_SVE       (1 << 22)\n#define HWCAP_ASIMDFHM  (1 << 23)\n#define HWCAP2_SVE2     (1 << 1)\n#define HWCAP2_SVEI8MM  (1 << 9)\n#define HWCAP2_SVEF32MM (1 << 10)\n#define HWCAP2_SVEBF16  (1 << 12)\n#define HWCAP2_I8MM     (1 << 13)\n#define HWCAP2_BF16     (1 << 14)\n#else\n// from arch/arm/include/uapi/asm/hwcap.h\n#define HWCAP_EDSP  (1 << 7)\n#define HWCAP_NEON  (1 << 12)\n#define HWCAP_VFPv4 (1 << 16)\n#endif\n\n#if __mips__\n// from arch/mips/include/uapi/asm/hwcap.h\n#define HWCAP_MIPS_MSA     (1 << 1)\n#define HWCAP_LOONGSON_MMI (1 << 11)\n#endif\n\n#if __loongarch64\n// from arch/loongarch/include/uapi/asm/hwcap.h\n#define HWCAP_LOONGARCH_LSX  (1 << 4)\n#define HWCAP_LOONGARCH_LASX (1 << 5)\n#endif\n\n#if __riscv\n// from arch/riscv/include/uapi/asm/hwcap.h\n#define COMPAT_HWCAP_ISA_F (1 << ('F' - 'A'))\n#define COMPAT_HWCAP_ISA_V (1 << ('V' - 'A'))\n#endif\n\n#if defined __ANDROID__ || defined __OHOS__\n// Probe the system's C library for a 'getauxval' function and call it if\n// it exits, or return 0 for failure. This function is available since API\n// level 18.\n//\n// HarmonyOS NEXT support `getauxval` directly.\n//\n// Note that getauxval() can't really be re-implemented here, because\n// its implementation does not parse /proc/self/auxv. Instead it depends\n// on values  that are passed by the kernel at process-init time to the\n// C runtime initialization layer.\nstatic unsigned int get_elf_hwcap_from_getauxval(unsigned int type)\n{\n#if defined __OHOS__\n    return getauxval(type);\n#else\n#if __ANDROID_API__ >= 18\n    unsigned int hwcap = getauxval(type);\n    if (hwcap)\n        return hwcap;\n#endif\n\n    typedef unsigned long getauxval_func_t(unsigned long);\n\n    dlerror();\n    void* libc_handle = dlopen(\"libc.so\", RTLD_NOW);\n    if (!libc_handle)\n    {\n        NCNN_LOGE(\"dlopen libc.so failed %s\", dlerror());\n        return 0;\n    }\n\n    unsigned int result = 0;\n    getauxval_func_t* func = (getauxval_func_t*)dlsym(libc_handle, \"getauxval\");\n    if (!func)\n    {\n        NCNN_LOGE(\"dlsym getauxval failed\");\n    }\n    else\n    {\n        // Note: getauxval() returns 0 on failure. Doesn't touch errno.\n        result = (unsigned int)(*func)(type);\n    }\n    dlclose(libc_handle);\n\n    return result;\n#endif\n}\n#endif // defined __ANDROID__ || defined __OHOS__\n\n// extract the ELF HW capabilities bitmap from /proc/self/auxv\nstatic unsigned int get_elf_hwcap_from_proc_self_auxv(unsigned int type)\n{\n    FILE* fp = fopen(\"/proc/self/auxv\", \"rb\");\n    if (!fp)\n    {\n        NCNN_LOGE(\"fopen /proc/self/auxv failed\");\n        return 0;\n    }\n\n#if __aarch64__ || __mips64 || __riscv_xlen == 64 || __loongarch64\n    struct\n    {\n        uint64_t tag;\n        uint64_t value;\n    } entry;\n#else\n    struct\n    {\n        unsigned int tag;\n        unsigned int value;\n    } entry;\n\n#endif\n\n    unsigned int result = 0;\n    while (!feof(fp))\n    {\n        int nread = fread((char*)&entry, sizeof(entry), 1, fp);\n        if (nread != 1)\n            break;\n\n        if (entry.tag == 0 && entry.value == 0)\n            break;\n\n        if (entry.tag == type)\n        {\n            result = entry.value;\n            break;\n        }\n    }\n\n    fclose(fp);\n\n    return result;\n}\n\nstatic unsigned int get_elf_hwcap(unsigned int type)\n{\n    unsigned int hwcap = 0;\n\n#if defined __ANDROID__ || defined __OHOS__\n    hwcap = get_elf_hwcap_from_getauxval(type);\n#endif\n\n    if (!hwcap)\n        hwcap = get_elf_hwcap_from_proc_self_auxv(type);\n\n#if defined __ANDROID__\n#if __aarch64__\n    if (type == AT_HWCAP)\n    {\n        // samsung exynos9810 on android pre-9 incorrectly reports armv8.2\n        // for little cores, but big cores only support armv8.0\n        // drop all armv8.2 features used by ncnn for preventing SIGILLs\n        // ref https://reviews.llvm.org/D114523\n        char arch[PROP_VALUE_MAX];\n        int len = __system_property_get(\"ro.arch\", arch);\n        if (len > 0 && strncmp(arch, \"exynos9810\", 10) == 0)\n        {\n            hwcap &= ~HWCAP_ASIMDHP;\n            hwcap &= ~HWCAP_ASIMDDP;\n        }\n    }\n#endif // __aarch64__\n#endif // defined __ANDROID__\n\n    return hwcap;\n}\n#endif // defined __ANDROID__ || defined __OHOS__ || defined __linux__\n\n#if __APPLE__\nstatic unsigned int get_hw_cpufamily()\n{\n    unsigned int value = 0;\n    size_t len = sizeof(value);\n    sysctlbyname(\"hw.cpufamily\", &value, &len, NULL, 0);\n    return value;\n}\n\nstatic cpu_type_t get_hw_cputype()\n{\n    cpu_type_t value = 0;\n    size_t len = sizeof(value);\n    sysctlbyname(\"hw.cputype\", &value, &len, NULL, 0);\n    return value;\n}\n\nstatic cpu_subtype_t get_hw_cpusubtype()\n{\n    cpu_subtype_t value = 0;\n    size_t len = sizeof(value);\n    sysctlbyname(\"hw.cpusubtype\", &value, &len, NULL, 0);\n    return value;\n}\n\nstatic int get_hw_capability(const char* cap)\n{\n    int64_t value = 0;\n    size_t len = sizeof(value);\n    sysctlbyname(cap, &value, &len, NULL, 0);\n    return value;\n}\n#endif // __APPLE__\n\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\nstatic inline void x86_cpuid(int level, unsigned int out[4])\n{\n#if defined(_MSC_VER) && !defined(__clang__)\n    __cpuid((int*)out, level);\n#elif defined(__clang__) || defined(__GNUC__)\n    __get_cpuid(level, out, out + 1, out + 2, out + 3);\n#else\n    NCNN_LOGE(\"x86_cpuid is unknown for current compiler\");\n    out[0] = 0;\n    out[1] = 0;\n    out[2] = 0;\n    out[3] = 0;\n#endif\n}\n\nstatic inline void x86_cpuid_sublevel(int level, int sublevel, unsigned int out[4])\n{\n#if defined(_MSC_VER)\n    __cpuidex((int*)out, level, sublevel);\n#elif defined(__clang__) || defined(__GNUC__)\n    __cpuid_count(level, sublevel, out[0], out[1], out[2], out[3]);\n#else\n    NCNN_LOGE(\"x86_cpuid_sublevel is unknown for current compiler\");\n    out[0] = 0;\n    out[1] = 0;\n    out[2] = 0;\n    out[3] = 0;\n#endif\n}\n\nstatic inline int x86_get_xcr0()\n{\n#if defined(_MSC_FULL_VER) && (_MSC_FULL_VER >= 160040219)\n    return _xgetbv(0);\n#elif defined(__i386__) || defined(__x86_64__)\n    int xcr0 = 0;\n    asm(\".byte 0x0f, 0x01, 0xd0\"\n        : \"=a\"(xcr0)\n        : \"c\"(0)\n        : \"%edx\");\n    return xcr0;\n#else\n    NCNN_LOGE(\"x86_get_xcr0 is unknown for current compiler\");\n    return 0xffffffff; // assume it will work\n#endif\n}\n\nstatic int get_cpu_support_x86_avx()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 1)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    return 1;\n}\n\nstatic int get_cpu_support_x86_fma()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    return cpu_info[2] & (1u << 12);\n}\n\nstatic int get_cpu_support_x86_xop()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0x80000000, cpu_info);\n\n    if (cpu_info[0] < 0x80000001)\n        return 0;\n\n    x86_cpuid(0x80000001, cpu_info);\n\n    return cpu_info[2] & (1u << 11);\n}\n\nstatic int get_cpu_support_x86_f16c()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 1)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n\n    return cpu_info[2] & (1u << 29);\n}\n\nstatic int get_cpu_support_x86_avx2()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 0, cpu_info);\n    return cpu_info[1] & (1u << 5);\n}\n\nstatic int get_cpu_support_x86_avx_vnni()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 1, cpu_info);\n    return cpu_info[0] & (1u << 4);\n}\n\nstatic int get_cpu_support_x86_avx_vnni_int8()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 1, cpu_info);\n    return cpu_info[3] & (1u << 4);\n}\n\nstatic int get_cpu_support_x86_avx_vnni_int16()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 1, cpu_info);\n    return cpu_info[3] & (1u << 10);\n}\n\nstatic int get_cpu_support_x86_avx_ne_convert()\n{\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 1, cpu_info);\n    return cpu_info[3] & (1u << 5);\n}\n\nstatic int get_cpu_support_x86_avx512()\n{\n#if __APPLE__\n    return get_hw_capability(\"hw.optional.avx512f\")\n           && get_hw_capability(\"hw.optional.avx512bw\")\n           && get_hw_capability(\"hw.optional.avx512cd\")\n           && get_hw_capability(\"hw.optional.avx512dq\")\n           && get_hw_capability(\"hw.optional.avx512vl\");\n#else\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    // check avx512 XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 0xe0) != 0xe0)\n        return 0;\n\n    x86_cpuid_sublevel(7, 0, cpu_info);\n    return (cpu_info[1] & (1u << 16)) && (cpu_info[1] & (1u << 17)) && (cpu_info[1] & (1u << 28)) && (cpu_info[1] & (1u << 30)) && (cpu_info[1] & (1u << 31));\n#endif\n}\n\nstatic int get_cpu_support_x86_avx512_vnni()\n{\n#if __APPLE__\n    return get_hw_capability(\"hw.optional.avx512vnni\");\n#else\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    // check avx512 XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 0xe0) != 0xe0)\n        return 0;\n\n    x86_cpuid_sublevel(7, 0, cpu_info);\n    return cpu_info[2] & (1u << 11);\n#endif\n}\n\nstatic int get_cpu_support_x86_avx512_bf16()\n{\n#if __APPLE__\n    return get_hw_capability(\"hw.optional.avx512bf16\");\n#else\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    x86_cpuid_sublevel(7, 1, cpu_info);\n    return cpu_info[0] & (1u << 5);\n#endif\n}\n\nstatic int get_cpu_support_x86_avx512_fp16()\n{\n#if __APPLE__\n    return get_hw_capability(\"hw.optional.avx512fp16\");\n#else\n    unsigned int cpu_info[4] = {0};\n    x86_cpuid(0, cpu_info);\n\n    int nIds = cpu_info[0];\n    if (nIds < 7)\n        return 0;\n\n    x86_cpuid(1, cpu_info);\n    // check AVX XSAVE OSXSAVE\n    if (!(cpu_info[2] & (1u << 28)) || !(cpu_info[2] & (1u << 26)) || !(cpu_info[2] & (1u << 27)))\n        return 0;\n\n    // check XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 6) != 6)\n        return 0;\n\n    // check avx512 XSAVE enabled by kernel\n    if ((x86_get_xcr0() & 0xe0) != 0xe0)\n        return 0;\n\n    x86_cpuid_sublevel(7, 0, cpu_info);\n    return cpu_info[3] & (1u << 23);\n#endif\n}\n#endif // defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n\nstatic int get_cpucount()\n{\n    int count = 0;\n#ifdef __EMSCRIPTEN__\n    if (emscripten_has_threading_support())\n        count = emscripten_num_logical_cores();\n    else\n        count = 1;\n#elif defined _WIN32\n    SYSTEM_INFO system_info;\n    GetSystemInfo(&system_info);\n    count = system_info.dwNumberOfProcessors;\n#elif defined __ANDROID__ || defined __linux__\n    // get cpu count from /proc/cpuinfo\n    FILE* fp = fopen(\"/proc/cpuinfo\", \"rb\");\n    if (!fp)\n        return 1;\n\n    char line[1024];\n    while (!feof(fp))\n    {\n        char* s = fgets(line, 1024, fp);\n        if (!s)\n            break;\n\n        if (memcmp(line, \"processor\", 9) == 0)\n        {\n            count++;\n        }\n    }\n\n    fclose(fp);\n#elif __APPLE__\n    size_t len = sizeof(count);\n    sysctlbyname(\"hw.ncpu\", &count, &len, NULL, 0);\n#else\n#ifdef _OPENMP\n    count = omp_get_max_threads();\n#else\n    count = 1;\n#endif // _OPENMP\n#endif\n\n    if (count < 1)\n        count = 1;\n\n    return count;\n}\n\n#if defined __ANDROID__ || defined __linux__\nstatic int get_thread_siblings(int cpuid)\n{\n    char path[256];\n    sprintf(path, \"/sys/devices/system/cpu/cpu%d/topology/thread_siblings\", cpuid);\n\n    FILE* fp = 0; //fopen(path, \"rb\");\n    if (fp)\n    {\n        int thread_siblings = -1;\n        int nscan = fscanf(fp, \"%x\", &thread_siblings);\n        if (nscan != 1)\n        {\n            // ignore\n        }\n\n        fclose(fp);\n\n        return thread_siblings;\n    }\n\n    // second try, parse from human-readable thread_siblings_list\n    sprintf(path, \"/sys/devices/system/cpu/cpu%d/topology/thread_siblings_list\", cpuid);\n\n    fp = fopen(path, \"rb\");\n    if (fp)\n    {\n        int thread_siblings = -1;\n\n        int id0;\n        char sep;\n        int id1;\n\n        int nscan = fscanf(fp, \"%d\", &id0);\n        if (nscan == 1)\n        {\n            thread_siblings = (1 << id0);\n\n            while (fscanf(fp, \"%c%d\", &sep, &id1) == 2)\n            {\n                if (sep == ',')\n                {\n                    thread_siblings |= (1 << id1);\n                }\n                if (sep == '-' && id0 < id1)\n                {\n                    for (int i = id0 + 1; i <= id1; i++)\n                    {\n                        thread_siblings |= (1 << i);\n                    }\n                }\n\n                id0 = id1;\n            }\n        }\n        else\n        {\n            // ignore\n        }\n\n        fclose(fp);\n\n        return thread_siblings;\n    }\n\n    return -1;\n}\n#endif // defined __ANDROID__ || defined __linux__\n\nstatic int get_physical_cpucount()\n{\n    int count = 0;\n#if defined _WIN32\n    typedef BOOL(WINAPI * LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);\n    LPFN_GLPI glpi = (LPFN_GLPI)GetProcAddress(GetModuleHandle(TEXT(\"kernel32\")), \"GetLogicalProcessorInformation\");\n    if (glpi == NULL)\n    {\n        NCNN_LOGE(\"GetLogicalProcessorInformation is not supported\");\n        return g_cpucount;\n    }\n\n    DWORD return_length = 0;\n    glpi(NULL, &return_length);\n\n    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(return_length);\n    glpi(buffer, &return_length);\n\n    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = buffer;\n    DWORD byte_offset = 0;\n    while (byte_offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= return_length)\n    {\n        if (ptr->Relationship == RelationProcessorCore)\n        {\n            count++;\n        }\n\n        byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);\n        ptr++;\n    }\n\n    free(buffer);\n#elif defined __ANDROID__ || defined __linux__\n    std::vector<int> thread_set;\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        int thread_siblings = get_thread_siblings(i);\n        if (thread_siblings == -1)\n        {\n            // ignore malformed one\n            continue;\n        }\n\n        bool thread_siblings_exists = false;\n        for (size_t j = 0; j < thread_set.size(); j++)\n        {\n            if (thread_set[j] == thread_siblings)\n            {\n                thread_siblings_exists = true;\n                break;\n            }\n        }\n\n        if (!thread_siblings_exists)\n        {\n            thread_set.push_back(thread_siblings);\n            count++;\n        }\n    }\n    if (count == 0)\n    {\n        // cannot resolve siblings, fallback to all cpu count\n        count = g_cpucount;\n    }\n#elif __APPLE__\n    size_t len = sizeof(count);\n    sysctlbyname(\"hw.physicalcpu_max\", &count, &len, NULL, 0);\n#else\n    count = g_cpucount;\n#endif\n\n    if (count > g_cpucount)\n        count = g_cpucount;\n\n    return count;\n}\n\n#if defined __ANDROID__ || defined __linux__\nstatic int get_data_cache_size(int cpuid, int level)\n{\n    char path[256];\n\n    // discover sysfs cache entry\n    int indexid = -1;\n    for (int i = 0;; i++)\n    {\n        // check level\n        {\n            sprintf(path, \"/sys/devices/system/cpu/cpu%d/cache/index%d/level\", cpuid, i);\n            FILE* fp = fopen(path, \"rb\");\n            if (!fp)\n                break;\n\n            int cache_level = -1;\n            int nscan = fscanf(fp, \"%d\", &cache_level);\n            fclose(fp);\n            if (nscan != 1 || cache_level != level)\n                continue;\n        }\n\n        // check type\n        {\n            sprintf(path, \"/sys/devices/system/cpu/cpu%d/cache/index%d/type\", cpuid, i);\n            FILE* fp = fopen(path, \"rb\");\n            if (!fp)\n                break;\n\n            char type[32];\n            int nscan = fscanf(fp, \"%31s\", type);\n            fclose(fp);\n            if (nscan != 1 || (strcmp(type, \"Data\") != 0 && strcmp(type, \"Unified\") != 0))\n                continue;\n        }\n\n        indexid = i;\n        break;\n    }\n\n    if (indexid == -1)\n    {\n        // no sysfs entry\n        return 0;\n    }\n\n    // get size\n    int cache_size_K = 0;\n    {\n        sprintf(path, \"/sys/devices/system/cpu/cpu%d/cache/index%d/size\", cpuid, indexid);\n        FILE* fp = fopen(path, \"rb\");\n        if (!fp)\n            return 0;\n\n        int nscan = fscanf(fp, \"%dK\", &cache_size_K);\n        fclose(fp);\n        if (nscan != 1)\n        {\n            NCNN_LOGE(\"fscanf cache_size_K error %d\", nscan);\n            return 0;\n        }\n    }\n\n    // parse shared_cpu_map mask\n    ncnn::CpuSet shared_cpu_map;\n    {\n        sprintf(path, \"/sys/devices/system/cpu/cpu%d/cache/index%d/shared_cpu_map\", cpuid, indexid);\n        FILE* fp = fopen(path, \"rb\");\n        if (!fp)\n            return 0;\n\n        char shared_cpu_map_str[256];\n        int nscan = fscanf(fp, \"%255s\", shared_cpu_map_str);\n        fclose(fp);\n        if (nscan != 1)\n        {\n            NCNN_LOGE(\"fscanf shared_cpu_map error %d\", nscan);\n            return 0;\n        }\n\n        int len = strlen(shared_cpu_map_str);\n\n        if (shared_cpu_map_str[0] == '0' && shared_cpu_map_str[1] == 'x')\n        {\n            // skip leading 0x\n            len -= 2;\n        }\n\n        int ci = 0;\n        for (int i = len - 1; i >= 0; i--)\n        {\n            char x = shared_cpu_map_str[i];\n            if (x & 1) shared_cpu_map.enable(ci + 0);\n            if (x & 2) shared_cpu_map.enable(ci + 1);\n            if (x & 4) shared_cpu_map.enable(ci + 2);\n            if (x & 8) shared_cpu_map.enable(ci + 3);\n\n            ci += 4;\n        }\n    }\n\n    if (shared_cpu_map.num_enabled() == 1)\n        return cache_size_K * 1024;\n\n    // resolve physical cpu count in the shared_cpu_map\n    int shared_physical_cpu_count = 0;\n    {\n        std::vector<int> thread_set;\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            if (!shared_cpu_map.is_enabled(i))\n                continue;\n\n            int thread_siblings = get_thread_siblings(i);\n            if (thread_siblings == -1)\n            {\n                // ignore malformed one\n                continue;\n            }\n\n            bool thread_siblings_exists = false;\n            for (size_t j = 0; j < thread_set.size(); j++)\n            {\n                if (thread_set[j] == thread_siblings)\n                {\n                    thread_siblings_exists = true;\n                    break;\n                }\n            }\n\n            if (!thread_siblings_exists)\n            {\n                thread_set.push_back(thread_siblings);\n                shared_physical_cpu_count++;\n            }\n        }\n    }\n\n    // return per-physical-core cache size with 4K aligned\n    cache_size_K = (cache_size_K / shared_physical_cpu_count + 3) / 4 * 4;\n\n    return cache_size_K * 1024;\n}\n\nstatic int get_big_cpu_data_cache_size(int level)\n{\n    if (g_cpu_affinity_mask_big.num_enabled() == 0)\n    {\n        // smp cpu\n        return get_data_cache_size(0, level);\n    }\n\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        if (g_cpu_affinity_mask_big.is_enabled(i))\n        {\n            return get_data_cache_size(i, level);\n        }\n    }\n\n    // should never reach here, fallback to cpu0\n    return get_data_cache_size(0, level);\n}\n#endif // defined __ANDROID__ || defined __linux__\n\nstatic int get_cpu_level2_cachesize()\n{\n    int size = 0;\n#if defined _WIN32\n    typedef BOOL(WINAPI * LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);\n    LPFN_GLPI glpi = (LPFN_GLPI)GetProcAddress(GetModuleHandle(TEXT(\"kernel32\")), \"GetLogicalProcessorInformation\");\n    if (glpi != NULL)\n    {\n        DWORD return_length = 0;\n        glpi(NULL, &return_length);\n\n        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(return_length);\n        glpi(buffer, &return_length);\n\n        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = buffer;\n        DWORD byte_offset = 0;\n        while (byte_offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= return_length)\n        {\n            if (ptr->Relationship == RelationCache)\n            {\n                PCACHE_DESCRIPTOR Cache = &ptr->Cache;\n                if (Cache->Level == 2)\n                {\n                    size = std::max(size, (int)Cache->Size);\n                }\n            }\n\n            byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);\n            ptr++;\n        }\n\n        free(buffer);\n    }\n#elif defined __ANDROID__ || defined __linux__\n    size = get_big_cpu_data_cache_size(2);\n#if defined(_SC_LEVEL2_CACHE_SIZE)\n    if (size <= 0)\n        size = sysconf(_SC_LEVEL2_CACHE_SIZE);\n#endif\n#elif __APPLE__\n    // perflevel 0 is the higher performance cluster\n    int cpusperl2 = get_hw_capability(\"hw.perflevel0.cpusperl2\");\n    int l2cachesize = get_hw_capability(\"hw.perflevel0.l2cachesize\");\n    size = cpusperl2 > 1 ? l2cachesize / cpusperl2 : l2cachesize;\n#endif\n\n    // fallback to a common value\n    if (size <= 0)\n    {\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n        size = 64 * 1024;\n        if (g_cpu_support_x86_avx)\n            size = 128 * 1024;\n        if (g_cpu_support_x86_avx2)\n            size = 256 * 1024;\n        if (g_cpu_support_x86_avx512)\n            size = 1024 * 1024;\n#elif __aarch64__\n        size = 256 * 1024;\n#elif __arm__\n        size = 128 * 1024;\n#else\n        // is 64k still too large here ?\n        size = 64 * 1024;\n#endif\n    }\n\n    return size;\n}\n\nstatic int get_cpu_level3_cachesize()\n{\n    int size = 0;\n#if defined _WIN32\n    typedef BOOL(WINAPI * LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);\n    LPFN_GLPI glpi = (LPFN_GLPI)GetProcAddress(GetModuleHandle(TEXT(\"kernel32\")), \"GetLogicalProcessorInformation\");\n    if (glpi != NULL)\n    {\n        DWORD return_length = 0;\n        glpi(NULL, &return_length);\n\n        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(return_length);\n        glpi(buffer, &return_length);\n\n        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = buffer;\n        DWORD byte_offset = 0;\n        while (byte_offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= return_length)\n        {\n            if (ptr->Relationship == RelationCache)\n            {\n                PCACHE_DESCRIPTOR Cache = &ptr->Cache;\n                if (Cache->Level == 3)\n                {\n                    size = std::max(size, (int)Cache->Size);\n                }\n            }\n\n            byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);\n            ptr++;\n        }\n\n        free(buffer);\n    }\n#elif defined __ANDROID__ || defined __linux__\n    size = get_big_cpu_data_cache_size(3);\n#if defined(_SC_LEVEL3_CACHE_SIZE)\n    if (size <= 0)\n        size = sysconf(_SC_LEVEL3_CACHE_SIZE);\n#endif\n#elif __APPLE__\n    // perflevel 0 is the higher performance cluster\n    // get the size shared among all cpus\n    size = get_hw_capability(\"hw.perflevel0.l3cachesize\");\n#endif\n\n    // l3 cache size can be zero\n\n    return size;\n}\n\n#if defined _WIN32\nstatic ncnn::CpuSet get_smt_cpu_mask()\n{\n    ncnn::CpuSet smt_cpu_mask;\n\n    typedef BOOL(WINAPI * LPFN_GLPI)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION, PDWORD);\n    LPFN_GLPI glpi = (LPFN_GLPI)GetProcAddress(GetModuleHandle(TEXT(\"kernel32\")), \"GetLogicalProcessorInformation\");\n    if (glpi == NULL)\n    {\n        NCNN_LOGE(\"GetLogicalProcessorInformation is not supported\");\n        return smt_cpu_mask;\n    }\n\n    DWORD return_length = 0;\n    glpi(NULL, &return_length);\n\n    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(return_length);\n    glpi(buffer, &return_length);\n\n    PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = buffer;\n    DWORD byte_offset = 0;\n    while (byte_offset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= return_length)\n    {\n        if (ptr->Relationship == RelationProcessorCore)\n        {\n            ncnn::CpuSet smt_set;\n            smt_set.mask = ptr->ProcessorMask;\n            if (smt_set.num_enabled() > 1)\n            {\n                // this core is smt\n                smt_cpu_mask.mask |= smt_set.mask;\n            }\n        }\n\n        byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);\n        ptr++;\n    }\n\n    free(buffer);\n\n    return smt_cpu_mask;\n}\n\nstatic std::vector<int> get_max_freq_mhz()\n{\n    typedef struct _PROCESSOR_POWER_INFORMATION\n    {\n        ULONG Number;\n        ULONG MaxMhz;\n        ULONG CurrentMhz;\n        ULONG MhzLimit;\n        ULONG MaxIdleState;\n        ULONG CurrentIdleState;\n    } PROCESSOR_POWER_INFORMATION, *PPROCESSOR_POWER_INFORMATION;\n\n    HMODULE powrprof = LoadLibrary(TEXT(\"powrprof.dll\"));\n\n    typedef LONG(WINAPI * LPFN_CNPI)(POWER_INFORMATION_LEVEL, PVOID, ULONG, PVOID, ULONG);\n    LPFN_CNPI cnpi = (LPFN_CNPI)GetProcAddress(powrprof, \"CallNtPowerInformation\");\n    if (cnpi == NULL)\n    {\n        NCNN_LOGE(\"CallNtPowerInformation is not supported\");\n        FreeLibrary(powrprof);\n        return std::vector<int>(g_cpucount, 0);\n    }\n\n    DWORD return_length = sizeof(PROCESSOR_POWER_INFORMATION) * g_cpucount;\n    PPROCESSOR_POWER_INFORMATION buffer = (PPROCESSOR_POWER_INFORMATION)malloc(return_length);\n\n    cnpi(ProcessorInformation, NULL, 0, buffer, return_length);\n\n    std::vector<int> ret;\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        ULONG max_mhz = buffer[i].MaxMhz;\n        ret.push_back(max_mhz);\n    }\n\n    free(buffer);\n    FreeLibrary(powrprof);\n    return ret;\n}\n\nstatic int set_sched_affinity(const ncnn::CpuSet& thread_affinity_mask)\n{\n    DWORD_PTR prev_mask = SetThreadAffinityMask(GetCurrentThread(), thread_affinity_mask.mask);\n    if (prev_mask == 0)\n    {\n        NCNN_LOGE(\"SetThreadAffinityMask failed %d\", GetLastError());\n        return -1;\n    }\n\n    return 0;\n}\n#endif // defined _WIN32\n\n#if defined __ANDROID__ || defined __linux__\nstatic int get_max_freq_khz(int cpuid)\n{\n    // first try, for all possible cpu\n    char path[256];\n    sprintf(path, \"/sys/devices/system/cpu/cpufreq/stats/cpu%d/time_in_state\", cpuid);\n\n    FILE* fp = fopen(path, \"rb\");\n\n    if (!fp)\n    {\n        // second try, for online cpu\n        sprintf(path, \"/sys/devices/system/cpu/cpu%d/cpufreq/stats/time_in_state\", cpuid);\n        fp = fopen(path, \"rb\");\n\n        if (fp)\n        {\n            int max_freq_khz = 0;\n            while (!feof(fp))\n            {\n                int freq_khz = 0;\n                int nscan = fscanf(fp, \"%d %*d\", &freq_khz);\n                if (nscan != 1)\n                    break;\n\n                if (freq_khz > max_freq_khz)\n                    max_freq_khz = freq_khz;\n            }\n\n            fclose(fp);\n\n            if (max_freq_khz != 0)\n                return max_freq_khz;\n\n            fp = NULL;\n        }\n\n        if (!fp)\n        {\n            // third try, for online cpu\n            sprintf(path, \"/sys/devices/system/cpu/cpu%d/cpufreq/cpuinfo_max_freq\", cpuid);\n            fp = fopen(path, \"rb\");\n\n            if (!fp)\n                return -1;\n\n            int max_freq_khz = -1;\n            int nscan = fscanf(fp, \"%d\", &max_freq_khz);\n            if (nscan != 1)\n            {\n                NCNN_LOGE(\"fscanf cpuinfo_max_freq error %d\", nscan);\n            }\n            fclose(fp);\n\n            return max_freq_khz;\n        }\n    }\n\n    int max_freq_khz = 0;\n    while (!feof(fp))\n    {\n        int freq_khz = 0;\n        int nscan = fscanf(fp, \"%d %*d\", &freq_khz);\n        if (nscan != 1)\n            break;\n\n        if (freq_khz > max_freq_khz)\n            max_freq_khz = freq_khz;\n    }\n\n    fclose(fp);\n\n    return max_freq_khz;\n}\n\nstatic bool is_smt_cpu(int cpuid)\n{\n    // https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/stable/sysfs-devices-system-cpu#L68-72\n    char path[256];\n    sprintf(path, \"/sys/devices/system/cpu/cpu%d/topology/core_cpus_list\", cpuid);\n\n    FILE* fp = fopen(path, \"rb\");\n\n    if (!fp)\n    {\n        sprintf(path, \"/sys/devices/system/cpu/cpu%d/topology/thread_siblings_list\", cpuid);\n        fp = fopen(path, \"rb\");\n\n        if (!fp)\n            return false;\n    }\n\n    bool is_smt = false;\n    while (!feof(fp))\n    {\n        char ch = fgetc(fp);\n        if (ch == ',' || ch == '-')\n        {\n            is_smt = true;\n            break;\n        }\n    }\n\n    fclose(fp);\n\n    return is_smt;\n}\n\nstatic int set_sched_affinity(const ncnn::CpuSet& thread_affinity_mask)\n{\n    // set affinity for thread\n#if defined(__BIONIC__) && !defined(__OHOS__)\n    pid_t pid = gettid();\n#else\n    pid_t pid = syscall(SYS_gettid);\n#endif\n\n    int syscallret = syscall(__NR_sched_setaffinity, pid, sizeof(cpu_set_t), &thread_affinity_mask.cpu_set);\n    if (syscallret)\n    {\n        NCNN_LOGE(\"syscall error %d\", syscallret);\n        return -1;\n    }\n\n    return 0;\n}\n#endif // defined __ANDROID__ || defined __linux__\n\n#if __APPLE__\nstatic int set_sched_affinity(const ncnn::CpuSet& thread_affinity_mask)\n{\n    // https://developer.apple.com/library/archive/releasenotes/Performance/RN-AffinityAPI/index.html\n    // http://www.hybridkernel.com/2015/01/18/binding_threads_to_cores_osx.html\n    // https://gist.github.com/Coneko/4234842\n\n    // This is a quite outdated document. Apple will not allow developers to set CPU affinity.\n    // In OS X 10.5 it worked, later it became a suggestion to OS X, then in 10.10 or so (as well in later ones), macOS will ignore any affinity settings.\n    // see https://github.com/Tencent/ncnn/pull/2335#discussion_r528233919   --- AmeAkio\n\n    int affinity_tag = THREAD_AFFINITY_TAG_NULL;\n    for (int i = 0; i < (int)sizeof(thread_affinity_mask.policy) * 8; i++)\n    {\n        if (thread_affinity_mask.is_enabled(i))\n        {\n            affinity_tag = i + 1;\n            break;\n        }\n    }\n\n    mach_port_t tid = pthread_mach_thread_np(pthread_self());\n\n    thread_affinity_policy_data_t policy_data;\n    policy_data.affinity_tag = affinity_tag;\n    int ret = thread_policy_set(tid, THREAD_AFFINITY_POLICY, (thread_policy_t)&policy_data, THREAD_AFFINITY_POLICY_COUNT);\n    if (ret && ret != KERN_NOT_SUPPORTED)\n    {\n        NCNN_LOGE(\"thread_policy_set error %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n#endif // __APPLE__\n\nstatic void initialize_cpu_thread_affinity_mask(ncnn::CpuSet& mask_all, ncnn::CpuSet& mask_little, ncnn::CpuSet& mask_big)\n{\n    mask_all.disable_all();\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        mask_all.enable(i);\n    }\n\n#if defined _WIN32\n// Check SDK >= Win7\n#if _WIN32_WINNT >= _WIN32_WINNT_WIN7 // win7\n\n    // Load GetLogicalProcessorInformationEx\n    HMODULE kernel32 = LoadLibrary(TEXT(\"kernel32.dll\"));\n    if (!kernel32)\n    {\n        NCNN_LOGE(\"LoadLibrary kernel32.dll failed\");\n        return;\n    }\n\n    typedef BOOL(WINAPI * LPFN_GLPIE)(LOGICAL_PROCESSOR_RELATIONSHIP, PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, PDWORD);\n    LPFN_GLPIE glpie = (LPFN_GLPIE)GetProcAddress(kernel32, \"GetLogicalProcessorInformationEx\");\n\n    if (glpie != NULL)\n    {\n        DWORD bufferSize = 0;\n        glpie(RelationProcessorCore, nullptr, &bufferSize);\n        std::vector<BYTE> buffer(bufferSize);\n        if (!glpie(RelationProcessorCore, (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*)(buffer.data()), &bufferSize))\n        {\n            NCNN_LOGE(\"GetLogicalProcessorInformationEx failed\");\n            return;\n        }\n\n        // A map from processor number to whether it is an E core\n        std::vector<std::pair<DWORD, bool> > processorCoreType;\n        BYTE maxEfficiencyClass = 0; // In a system without E cores, all cores EfficiencyClass is 0\n\n        BYTE* ptr = buffer.data();\n        while (ptr < buffer.data() + bufferSize)\n        {\n            SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX* info = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*)ptr;\n            if (info->Relationship == RelationProcessorCore)\n            {\n                // Mingw and some old MSVC do not have EfficiencyClass in PROCESSOR_RELATIONSHIP\n                // So we should redefine PROCESSOR_RELATIONSHIP\n                // Because ncnn need to support c++98, so we can't use some new features in c++11\n                // So there is a ugly implementation\n\n                BYTE efficiencyClass = ((BYTE*)&info->Processor)[1];\n\n                bool isECore = (efficiencyClass == 0);\n                maxEfficiencyClass = (std::max)(maxEfficiencyClass, efficiencyClass);\n\n                for (WORD g = 0; g < info->Processor.GroupCount; ++g)\n                {\n                    const GROUP_AFFINITY& ga = info->Processor.GroupMask[g];\n                    KAFFINITY mask = ga.Mask;\n                    WORD group = ga.Group;\n                    for (int bit = 0; bit < 64; ++bit)\n                    {   // for each bit in the mask\n                        if (mask & (static_cast<KAFFINITY>(1) << bit))\n                        {\n                            DWORD processorNumber = group * 64 + bit;\n                            processorCoreType.push_back(std::pair<DWORD, bool>(processorNumber, isECore));\n                        }\n                    }\n                }\n            }\n            ptr += info->Size;\n        }\n\n        if (maxEfficiencyClass == 0)\n        {\n            // All cores are P cores\n            mask_little.disable_all();\n            mask_big = mask_all;\n        }\n        else\n        {\n            for (int i = 0; i < g_cpucount; i++)\n            {\n                bool isECore = false;\n                for (int j = 0; j < processorCoreType.size(); j++)\n                {\n                    std::pair<DWORD, bool> p = processorCoreType[j];\n                    if (p.first == i)\n                    {\n                        isECore = p.second;\n                        break;\n                    }\n                }\n                // fprintf(stderr, \"processor %d is %s\\n\", i, isECore ? \"E\" : \"P\");\n\n                if (isECore)\n                {\n                    mask_little.enable(i);\n                }\n                else\n                {\n                    mask_big.enable(i);\n                }\n            }\n        }\n    }\n    else\n#endif\n    {\n        // get max freq mhz for all cores\n        int max_freq_mhz_min = INT_MAX;\n        int max_freq_mhz_max = 0;\n        std::vector<int> cpu_max_freq_mhz = get_max_freq_mhz();\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            int max_freq_mhz = cpu_max_freq_mhz[i];\n\n            // NCNN_LOGE(\"%d max freq = %d khz\", i, max_freq_mhz);\n\n            if (max_freq_mhz > max_freq_mhz_max)\n                max_freq_mhz_max = max_freq_mhz;\n            if (max_freq_mhz < max_freq_mhz_min)\n                max_freq_mhz_min = max_freq_mhz;\n        }\n\n        int max_freq_mhz_medium = (max_freq_mhz_min + max_freq_mhz_max) / 2;\n        if (max_freq_mhz_medium == max_freq_mhz_max)\n        {\n            mask_little.disable_all();\n            mask_big = mask_all;\n            return;\n        }\n\n        ncnn::CpuSet smt_cpu_mask = get_smt_cpu_mask();\n\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            if (smt_cpu_mask.is_enabled(i))\n            {\n                // always treat smt core as big core\n                mask_big.enable(i);\n                continue;\n            }\n\n            if (cpu_max_freq_mhz[i] < max_freq_mhz_medium)\n                mask_little.enable(i);\n            else\n                mask_big.enable(i);\n        }\n    }\n#elif defined __ANDROID__ || defined __linux__\n    int max_freq_khz_min = INT_MAX;\n    int max_freq_khz_max = 0;\n    std::vector<int> cpu_max_freq_khz(g_cpucount);\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        int max_freq_khz = get_max_freq_khz(i);\n\n        // NCNN_LOGE(\"%d max freq = %d khz\", i, max_freq_khz);\n\n        cpu_max_freq_khz[i] = max_freq_khz;\n\n        if (max_freq_khz > max_freq_khz_max)\n            max_freq_khz_max = max_freq_khz;\n        if (max_freq_khz < max_freq_khz_min)\n            max_freq_khz_min = max_freq_khz;\n    }\n\n    int max_freq_khz_medium = (max_freq_khz_min + max_freq_khz_max) / 2;\n    if (max_freq_khz_medium == max_freq_khz_max)\n    {\n        mask_little.disable_all();\n        mask_big = mask_all;\n        return;\n    }\n\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        if (is_smt_cpu(i))\n        {\n            // always treat smt core as big core\n            mask_big.enable(i);\n            continue;\n        }\n\n        if (cpu_max_freq_khz[i] < max_freq_khz_medium)\n            mask_little.enable(i);\n        else\n            mask_big.enable(i);\n    }\n#elif __APPLE__\n    int nperflevels = get_hw_capability(\"hw.nperflevels\");\n    if (nperflevels == 1)\n    {\n        // smp models\n        mask_little.disable_all();\n        mask_big = mask_all;\n    }\n    else\n    {\n        // two or more clusters, level0 is the high-performance cluster\n        int perflevel0_logicalcpu = get_hw_capability(\"hw.perflevel0.logicalcpu_max\");\n        for (int i = 0; i < perflevel0_logicalcpu; i++)\n        {\n            mask_big.enable(i);\n        }\n        for (int i = perflevel0_logicalcpu; i < g_cpucount; i++)\n        {\n            mask_little.enable(i);\n        }\n    }\n#else\n    // TODO implement me for other platforms\n    mask_little.disable_all();\n    mask_big = mask_all;\n#endif\n}\n\n#if defined __ANDROID__ || defined __linux__\n#if __aarch64__\nunion midr_info_t\n{\n    struct __attribute__((packed))\n    {\n        unsigned int revision : 4;\n        unsigned int part : 12;\n        unsigned int architecture : 4;\n        unsigned int variant : 4;\n        unsigned int implementer : 8;\n    };\n    unsigned int midr;\n\n    midr_info_t(unsigned int _midr)\n        : midr(_midr)\n    {\n    }\n};\n\nstatic unsigned int get_midr_from_sysfs(int cpuid)\n{\n    char path[256];\n    sprintf(path, \"/sys/devices/system/cpu/cpu%d/regs/identification/midr_el1\", cpuid);\n\n    FILE* fp = fopen(path, \"rb\");\n    if (!fp)\n        return 0;\n\n    unsigned int midr_el1 = 0;\n    int nscan = fscanf(fp, \"%x\", &midr_el1);\n    if (nscan != 1)\n    {\n        // ignore\n    }\n\n    fclose(fp);\n\n    return midr_el1;\n}\n\nstatic int get_midr_from_proc_cpuinfo(std::vector<unsigned int>& midrs)\n{\n    FILE* fp = fopen(\"/proc/cpuinfo\", \"rb\");\n    if (!fp)\n        return -1;\n\n    midrs.resize(g_cpucount, 0);\n\n    int cpuid = -1;\n    midr_info_t midr_info(0);\n\n    char line[1024];\n    while (!feof(fp))\n    {\n        char* s = fgets(line, 1024, fp);\n        if (!s)\n            break;\n\n        if (memcmp(line, \"processor\", 9) == 0)\n        {\n            // processor       : 4\n            int id = -1;\n            int nscan = sscanf(line, \"%*[^:]: %d\", &id);\n            if (nscan != 1)\n                continue;\n\n            if (cpuid >= 0 && cpuid < g_cpucount)\n            {\n                if (midr_info.midr == 0)\n                {\n                    // shared midr\n                    midrs[cpuid] = (unsigned int)-1;\n                }\n                else\n                {\n                    // save midr and reset\n                    midrs[cpuid] = midr_info.midr;\n                    for (int i = 0; i < g_cpucount; i++)\n                    {\n                        if (midrs[i] == (unsigned int)-1)\n                            midrs[i] = midr_info.midr;\n                    }\n                }\n\n                midr_info.midr = 0;\n            }\n\n            cpuid = id;\n        }\n\n        if (cpuid == -1)\n            continue;\n\n        if (memcmp(line, \"CPU implementer\", 15) == 0)\n        {\n            // CPU implementer : 0x51\n            unsigned int id = 0;\n            int nscan = sscanf(line, \"%*[^:]: %x\", &id);\n            if (nscan != 1)\n                continue;\n\n            midr_info.implementer = id;\n        }\n        else if (memcmp(line, \"CPU architecture\", 16) == 0)\n        {\n            // CPU architecture: 8\n            int id = 0;\n            int nscan = sscanf(line, \"%*[^:]: %d\", &id);\n            if (nscan != 1)\n                continue;\n\n            midr_info.architecture = id;\n        }\n        else if (memcmp(line, \"CPU variant\", 11) == 0)\n        {\n            // CPU variant     : 0xd\n            int id = 0;\n            int nscan = sscanf(line, \"%*[^:]: %x\", &id);\n            if (nscan != 1)\n                continue;\n\n            midr_info.variant = id;\n        }\n        else if (memcmp(line, \"CPU part\", 8) == 0)\n        {\n            // CPU part        : 0x804\n            int id = 0;\n            int nscan = sscanf(line, \"%*[^:]: %x\", &id);\n            if (nscan != 1)\n                continue;\n\n            midr_info.part = id;\n        }\n        else if (memcmp(line, \"CPU revision\", 12) == 0)\n        {\n            // CPU revision    : 14\n            int id = 0;\n            int nscan = sscanf(line, \"%*[^:]: %d\", &id);\n            if (nscan != 1)\n                continue;\n\n            midr_info.revision = id;\n        }\n    }\n\n    fclose(fp);\n\n    if (cpuid >= 0 && cpuid < g_cpucount)\n    {\n        if (midr_info.midr == 0)\n        {\n            // shared midr\n            midrs[cpuid] = (unsigned int)-1;\n        }\n        else\n        {\n            // save midr and reset\n            midrs[cpuid] = midr_info.midr;\n            for (int i = 0; i < g_cpucount; i++)\n            {\n                if (midrs[i] == (unsigned int)-1)\n                    midrs[i] = midr_info.midr;\n            }\n        }\n\n        midr_info.midr = 0;\n    }\n\n    // /proc/cpuinfo may only report little/online cores on old kernel\n    if (g_cpu_affinity_mask_big.num_enabled() == g_cpucount)\n    {\n        // assign the remaining unknown midrs for smp cpu\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            if (midrs[i] == 0)\n                midrs[i] = midr_info.midr;\n        }\n    }\n    else\n    {\n        // clear the big core midrs for hmp cpu if they are the same as little cores\n        unsigned int little_midr = 0;\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            if (g_cpu_affinity_mask_little.is_enabled(i))\n            {\n                little_midr = midrs[i];\n                break;\n            }\n        }\n\n        for (int i = 0; i < g_cpucount; i++)\n        {\n            if (g_cpu_affinity_mask_big.is_enabled(i))\n            {\n                if (midrs[i] == little_midr)\n                {\n                    midrs[i] = 0;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n// return midr for the current running core\nstatic unsigned int get_midr_from_register()\n{\n    uint64_t midr;\n    asm volatile(\"mrs   %0, MIDR_EL1\"\n                 : \"=r\"(midr));\n\n    return (unsigned int)midr;\n}\n\nstatic int get_sched_affinity(ncnn::CpuSet& thread_affinity_mask)\n{\n    // get affinity for thread\n#if defined(__BIONIC__) && !defined(__OHOS__)\n    pid_t pid = gettid();\n#else\n    pid_t pid = syscall(SYS_gettid);\n#endif\n\n    thread_affinity_mask.disable_all();\n\n    int syscallret = syscall(__NR_sched_getaffinity, pid, sizeof(cpu_set_t), &thread_affinity_mask.cpu_set);\n    if (syscallret)\n    {\n        // handle get error silently\n        return -1;\n    }\n\n    return 0;\n}\n\nstatic int midr_is_a53_a55(unsigned int midr)\n{\n    // 0x 41 ? f d03 ? = arm cortex-a53\n    // 0x 51 ? f 801 ? = qcom kryo200 a53\n    // 0x 41 ? f d04 ? = arm cortex-a35\n    // 0x 41 ? f d05 ? = arm cortex-a55\n    // 0x 51 ? f 803 ? = qcom kryo300 a55\n    // 0x 51 ? f 805 ? = qcom kryo400 a55\n\n    midr_info_t midr_info(midr);\n\n    return (midr_info.implementer == 0x41 && midr_info.part == 0xd03)\n           || (midr_info.implementer == 0x51 && midr_info.part == 0x801)\n           || (midr_info.implementer == 0x41 && midr_info.part == 0xd04)\n           || (midr_info.implementer == 0x41 && midr_info.part == 0xd05)\n           || (midr_info.implementer == 0x51 && midr_info.part == 0x803)\n           || (midr_info.implementer == 0x51 && midr_info.part == 0x805);\n}\n\nstatic int detect_cpu_is_arm_a53_a55()\n{\n    int a53_a55_cpu_count = 0;\n\n    // first try, iterate /sys/devices/system/cpu/cpuX/regs/identification/midr_el1\n    bool sysfs_midr = true;\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        unsigned int midr = 0;\n\n        // for kernel 4.7+\n        midr = get_midr_from_sysfs(i);\n        if (midr == 0)\n        {\n            sysfs_midr = false;\n            break;\n        }\n\n        if (midr_is_a53_a55(midr))\n        {\n            a53_a55_cpu_count++;\n        }\n    }\n\n    if (!sysfs_midr)\n    {\n        // second try, collect midr from /proc/cpuinfo\n        std::vector<unsigned int> midrs;\n        int ret = get_midr_from_proc_cpuinfo(midrs);\n        if (ret == 0 && (int)midrs.size() == g_cpucount)\n        {\n            for (int i = 0; i < g_cpucount; i++)\n            {\n                if (midr_is_a53_a55(midrs[i]))\n                {\n                    a53_a55_cpu_count++;\n                }\n            }\n        }\n        else\n        {\n            // third try, assume all aarch64 little cores are a53/a55\n            a53_a55_cpu_count = g_cpu_affinity_mask_little.num_enabled();\n        }\n    }\n\n    if (a53_a55_cpu_count == 0)\n        return 0; // all non a53/a55\n\n    if (a53_a55_cpu_count == g_cpucount)\n        return 1; // all a53/a55\n\n    // little cores are a53/a55\n    return 2;\n}\n#endif // __aarch64__\n#endif // defined __ANDROID__ || defined __linux__\n\n// the initialization\nstatic void initialize_global_cpu_info()\n{\n#if defined(_OPENMP) && (__clang__ || defined(_OPENMP_LLVM_RUNTIME))\n    ncnn_kmp_env_initializer();\n#endif\n\n    g_cpucount = get_cpucount();\n    g_physical_cpucount = get_physical_cpucount();\n    g_powersave = 0;\n    initialize_cpu_thread_affinity_mask(g_cpu_affinity_mask_all, g_cpu_affinity_mask_little, g_cpu_affinity_mask_big);\n\n#if (defined _WIN32 && (__aarch64__ || __arm__)) || ((defined __ANDROID__ || defined __linux__) && __riscv)\n    if (!is_being_debugged())\n    {\n        ruapu_init();\n    }\n#endif\n\n#if defined _WIN32\n#if __aarch64__\n    g_cpu_support_arm_cpuid = ruapu_supports(\"cpuid\");\n    g_cpu_support_arm_asimdhp = ruapu_supports(\"asimdhp\") || IsProcessorFeaturePresent(43) || IsProcessorFeaturePresent(67);   // dp implies hp, 67 is PF_ARM_V82_FP16_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_asimddp = ruapu_supports(\"asimddp\") || IsProcessorFeaturePresent(43);                                    // 43 is PF_ARM_V82_DP_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_asimdfhm = ruapu_supports(\"asimdfhm\") || IsProcessorFeaturePresent(66) || IsProcessorFeaturePresent(68); // bf16 or i8mm implies fhm\n    g_cpu_support_arm_bf16 = ruapu_supports(\"bf16\") || IsProcessorFeaturePresent(68);                                          // 68 is PF_ARM_V86_BF16_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_i8mm = ruapu_supports(\"i8mm\") || IsProcessorFeaturePresent(66);                                          // 66 is PF_ARM_V82_I8MM_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_sve = ruapu_supports(\"sve\") || IsProcessorFeaturePresent(46);                                            // 46 is PF_ARM_SVE_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_sve2 = ruapu_supports(\"sve2\") || IsProcessorFeaturePresent(47);                                          // 47 is PF_ARM_SVE2_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_svebf16 = ruapu_supports(\"svebf16\") || IsProcessorFeaturePresent(52);                                    // 52 is PF_ARM_SVE_BF16_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_svei8mm = ruapu_supports(\"svei8mm\") || IsProcessorFeaturePresent(57);                                    // 57 is PF_ARM_SVE_I8MM_INSTRUCTIONS_AVAILABLE\n    g_cpu_support_arm_svef32mm = ruapu_supports(\"svef32mm\") || IsProcessorFeaturePresent(58);                                  // 58 is PF_ARM_SVE_F32MM_INSTRUCTIONS_AVAILABLE\n#elif __arm__\n    g_cpu_support_arm_edsp = ruapu_supports(\"edsp\");\n    g_cpu_support_arm_neon = 1; // all modern windows arm devices have neon\n    g_cpu_support_arm_vfpv4 = ruapu_supports(\"vfpv4\");\n#endif // __aarch64__ || __arm__\n#elif defined __ANDROID__ || defined __linux__\n    g_hwcaps = get_elf_hwcap(AT_HWCAP);\n    g_hwcaps2 = get_elf_hwcap(AT_HWCAP2);\n#elif __APPLE__\n    g_hw_cpufamily = get_hw_cpufamily();\n    g_hw_cputype = get_hw_cputype();\n    g_hw_cpusubtype = get_hw_cpusubtype();\n#if __aarch64__\n    g_hw_optional_arm_FEAT_FP16 = get_hw_capability(\"hw.optional.arm.FEAT_FP16\");\n    g_hw_optional_arm_FEAT_DotProd = get_hw_capability(\"hw.optional.arm.FEAT_DotProd\");\n    g_hw_optional_arm_FEAT_FHM = get_hw_capability(\"hw.optional.arm.FEAT_FHM\");\n    g_hw_optional_arm_FEAT_BF16 = get_hw_capability(\"hw.optional.arm.FEAT_BF16\");\n    g_hw_optional_arm_FEAT_I8MM = get_hw_capability(\"hw.optional.arm.FEAT_I8MM\");\n\n    switch (g_hw_cpufamily)\n    {\n    case CPUFAMILY_ARM_TUPAI:\n    case CPUFAMILY_ARM_TAHITI:\n    case CPUFAMILY_ARM_DONAN:\n    case CPUFAMILY_ARM_BRAVA:\n    // TODO check sve sme\n    case CPUFAMILY_ARM_AVALANCHE_BLIZZARD:\n    case CPUFAMILY_ARM_EVEREST_SAWTOOTH:\n    case CPUFAMILY_ARM_COLL:\n    case CPUFAMILY_ARM_IBIZA:\n    case CPUFAMILY_ARM_LOBOS:\n    case CPUFAMILY_ARM_PALMA:\n        g_hw_optional_arm_FEAT_BF16 = 1;\n        g_hw_optional_arm_FEAT_I8MM = 1;\n    case CPUFAMILY_ARM_LIGHTNING_THUNDER:\n    case CPUFAMILY_ARM_FIRESTORM_ICESTORM:\n        g_hw_optional_arm_FEAT_DotProd = 1;\n        g_hw_optional_arm_FEAT_FHM = 1;\n    case CPUFAMILY_ARM_MONSOON_MISTRAL:\n    case CPUFAMILY_ARM_VORTEX_TEMPEST:\n        g_hw_optional_arm_FEAT_FP16 = 1;\n    default:\n        break;\n    }\n#endif // __aarch64__\n#endif\n\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    g_cpu_support_x86_avx = get_cpu_support_x86_avx();\n    g_cpu_support_x86_fma = get_cpu_support_x86_fma();\n    g_cpu_support_x86_xop = get_cpu_support_x86_xop();\n    g_cpu_support_x86_f16c = get_cpu_support_x86_f16c();\n    g_cpu_support_x86_avx2 = get_cpu_support_x86_avx2();\n    g_cpu_support_x86_avx_vnni = get_cpu_support_x86_avx_vnni();\n    g_cpu_support_x86_avx_vnni_int8 = get_cpu_support_x86_avx_vnni_int8();\n    g_cpu_support_x86_avx_vnni_int16 = get_cpu_support_x86_avx_vnni_int16();\n    g_cpu_support_x86_avx_ne_convert = get_cpu_support_x86_avx_ne_convert();\n    g_cpu_support_x86_avx512 = get_cpu_support_x86_avx512();\n    g_cpu_support_x86_avx512_vnni = get_cpu_support_x86_avx512_vnni();\n    g_cpu_support_x86_avx512_bf16 = get_cpu_support_x86_avx512_bf16();\n    g_cpu_support_x86_avx512_fp16 = get_cpu_support_x86_avx512_fp16();\n#endif // defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\n    g_cpu_support_riscv_zfh = ruapu_supports(\"zfh\") || ruapu_supports(\"xtheadvector\");   // xtheadvector implies zfh\n    g_cpu_support_riscv_zvfh = ruapu_supports(\"zvfh\") || ruapu_supports(\"xtheadvector\"); // xtheadvector implies zvfh\n    g_cpu_support_riscv_xtheadvector = ruapu_supports(\"xtheadvector\");\n#endif // __riscv\n#endif // defined __ANDROID__ || defined __linux__\n\n    g_cpu_level2_cachesize = get_cpu_level2_cachesize();\n    g_cpu_level3_cachesize = get_cpu_level3_cachesize();\n\n#if defined __ANDROID__ || defined __linux__\n#if __aarch64__\n    g_cpu_is_arm_a53_a55 = detect_cpu_is_arm_a53_a55();\n#endif // __aarch64__\n#endif // defined __ANDROID__ || defined __linux__\n}\n\nstatic int g_cpu_info_initialized = 0;\n\nstatic inline void try_initialize_global_cpu_info()\n{\n    if (!g_cpu_info_initialized)\n    {\n        initialize_global_cpu_info();\n        g_cpu_info_initialized = 1;\n    }\n}\n\nnamespace ncnn {\n\n#if defined _WIN32\nCpuSet::CpuSet()\n{\n    disable_all();\n}\n\nvoid CpuSet::enable(int cpu)\n{\n    mask |= ((ULONG_PTR)1 << cpu);\n}\n\nvoid CpuSet::disable(int cpu)\n{\n    mask &= ~((ULONG_PTR)1 << cpu);\n}\n\nvoid CpuSet::disable_all()\n{\n    mask = 0;\n}\n\nbool CpuSet::is_enabled(int cpu) const\n{\n    return mask & ((ULONG_PTR)1 << cpu);\n}\n\nint CpuSet::num_enabled() const\n{\n    int num_enabled = 0;\n    for (int i = 0; i < (int)sizeof(mask) * 8; i++)\n    {\n        if (is_enabled(i))\n            num_enabled++;\n    }\n\n    return num_enabled;\n}\n#elif defined __ANDROID__ || defined __linux__\nCpuSet::CpuSet()\n{\n    disable_all();\n}\n\nvoid CpuSet::enable(int cpu)\n{\n    CPU_SET(cpu, &cpu_set);\n}\n\nvoid CpuSet::disable(int cpu)\n{\n    CPU_CLR(cpu, &cpu_set);\n}\n\nvoid CpuSet::disable_all()\n{\n    CPU_ZERO(&cpu_set);\n}\n\nbool CpuSet::is_enabled(int cpu) const\n{\n    return CPU_ISSET(cpu, &cpu_set);\n}\n\nint CpuSet::num_enabled() const\n{\n    int num_enabled = 0;\n    for (int i = 0; i < (int)sizeof(cpu_set_t) * 8; i++)\n    {\n        if (is_enabled(i))\n            num_enabled++;\n    }\n\n    return num_enabled;\n}\n#elif __APPLE__\nCpuSet::CpuSet()\n{\n    disable_all();\n}\n\nvoid CpuSet::enable(int cpu)\n{\n    policy |= ((unsigned int)1 << cpu);\n}\n\nvoid CpuSet::disable(int cpu)\n{\n    policy &= ~((unsigned int)1 << cpu);\n}\n\nvoid CpuSet::disable_all()\n{\n    policy = 0;\n}\n\nbool CpuSet::is_enabled(int cpu) const\n{\n    return policy & ((unsigned int)1 << cpu);\n}\n\nint CpuSet::num_enabled() const\n{\n    int num_enabled = 0;\n    for (int i = 0; i < (int)sizeof(policy) * 8; i++)\n    {\n        if (is_enabled(i))\n            num_enabled++;\n    }\n\n    return num_enabled;\n}\n#else\nCpuSet::CpuSet()\n{\n}\n\nvoid CpuSet::enable(int /* cpu */)\n{\n}\n\nvoid CpuSet::disable(int /* cpu */)\n{\n}\n\nvoid CpuSet::disable_all()\n{\n}\n\nbool CpuSet::is_enabled(int /* cpu */) const\n{\n    return true;\n}\n\nint CpuSet::num_enabled() const\n{\n    return get_cpu_count();\n}\n#endif\n\nint cpu_support_arm_edsp()\n{\n    try_initialize_global_cpu_info();\n#if __arm__ && !__aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_edsp;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_EDSP;\n#elif __APPLE__\n    return g_hw_cputype == CPU_TYPE_ARM;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_neon()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n    return 1;\n#elif __arm__\n#if defined _WIN32\n    return g_cpu_support_arm_neon;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_NEON;\n#elif __APPLE__\n    return g_hw_cputype == CPU_TYPE_ARM && g_hw_cpusubtype > CPU_SUBTYPE_ARM_V7;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_vfpv4()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n    return 1;\n#elif __arm__\n#if defined _WIN32\n    return g_cpu_support_arm_vfpv4;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_VFPv4;\n#elif __APPLE__\n    return g_hw_cputype == CPU_TYPE_ARM && g_hw_cpusubtype > CPU_SUBTYPE_ARM_V7S;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_asimdhp()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_asimdhp;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_ASIMDHP;\n#elif __APPLE__\n    return g_hw_optional_arm_FEAT_FP16;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_cpuid()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_cpuid;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_CPUID;\n#elif __APPLE__\n    return 0;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_asimddp()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_asimddp;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_ASIMDDP;\n#elif __APPLE__\n    return g_hw_optional_arm_FEAT_DotProd;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_asimdfhm()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_asimdfhm;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_ASIMDFHM;\n#elif __APPLE__\n    return g_hw_optional_arm_FEAT_FHM;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_bf16()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_bf16;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_BF16;\n#elif __APPLE__\n    return g_hw_optional_arm_FEAT_BF16;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_i8mm()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_i8mm;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_I8MM;\n#elif __APPLE__\n    return g_hw_optional_arm_FEAT_I8MM;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_sve()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_sve;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps & HWCAP_SVE;\n#elif __APPLE__\n    return 0; // no known apple cpu support armv8.6 sve\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_sve2()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_sve2;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_SVE2;\n#elif __APPLE__\n    return 0; // no known apple cpu support armv8.6 sve2\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_svebf16()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_svebf16;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_SVEBF16;\n#elif __APPLE__\n    return 0; // no known apple cpu support armv8.6 svebf16\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_svei8mm()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_svei8mm;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_SVEI8MM;\n#elif __APPLE__\n    return 0; // no known apple cpu support armv8.6 svei8mm\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_arm_svef32mm()\n{\n    try_initialize_global_cpu_info();\n#if __aarch64__\n#if defined _WIN32\n    return g_cpu_support_arm_svef32mm;\n#elif defined __ANDROID__ || defined __linux__\n    return g_hwcaps2 & HWCAP2_SVEF32MM;\n#elif __APPLE__\n    return 0; // no known apple cpu support armv8.6 svef32mm\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_fma()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_fma;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_xop()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_xop;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_f16c()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_f16c;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx2()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx2;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx_vnni()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx_vnni;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx_vnni_int8()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx_vnni_int8;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx_vnni_int16()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx_vnni_int16;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx_ne_convert()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx_ne_convert;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx512()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx512;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx512_vnni()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx512_vnni;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx512_bf16()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx512_bf16;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_x86_avx512_fp16()\n{\n    try_initialize_global_cpu_info();\n#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)\n    return g_cpu_support_x86_avx512_fp16;\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_mips_msa()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __mips__\n    return g_hwcaps & HWCAP_MIPS_MSA;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_loongarch_lsx()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __loongarch64\n    return g_hwcaps & HWCAP_LOONGARCH_LSX;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_loongarch_lasx()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __loongarch64\n    return g_hwcaps & HWCAP_LOONGARCH_LASX;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_loongson_mmi()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __mips__\n    return g_hwcaps & HWCAP_LOONGSON_MMI;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_riscv_v()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\n    return g_hwcaps & COMPAT_HWCAP_ISA_V;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_riscv_zfh()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\n    return g_cpu_support_riscv_zfh;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_riscv_zvfh()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\n    return g_cpu_support_riscv_zvfh;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_support_riscv_xtheadvector()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __riscv\n    return g_cpu_support_riscv_xtheadvector;\n#else\n    return 0;\n#endif\n#else\n    return 0;\n#endif\n}\n\nint cpu_riscv_vlenb()\n{\n#if C906\n    // FIXME xuantie qemu reports all zero auxv flags\n    return 16;\n#endif\n    try_initialize_global_cpu_info();\n#if __riscv\n    if (!cpu_support_riscv_v())\n        return 0;\n\n    int a = 0;\n    asm volatile(\n        \".word  0xc22026f3  \\n\" // csrr  a3, vlenb\n        \"mv     %0, a3      \\n\"\n        : \"=r\"(a)\n        :\n        : \"memory\", \"a3\");\n    return a;\n#else\n    return 0;\n#endif\n}\n\nint get_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    return g_cpucount;\n}\n\nint get_little_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    return get_cpu_thread_affinity_mask(1).num_enabled();\n}\n\nint get_big_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    int big_cpu_count = get_cpu_thread_affinity_mask(2).num_enabled();\n    return big_cpu_count ? big_cpu_count : g_cpucount;\n}\n\nint get_physical_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    return g_physical_cpucount;\n}\n\nint get_physical_little_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    if (g_physical_cpucount == g_cpucount)\n        return get_little_cpu_count();\n\n    return g_physical_cpucount * 2 - g_cpucount;\n}\n\nint get_physical_big_cpu_count()\n{\n    try_initialize_global_cpu_info();\n    if (g_physical_cpucount == g_cpucount)\n        return get_big_cpu_count();\n\n    return g_cpucount - g_physical_cpucount;\n}\n\nint get_cpu_level2_cache_size()\n{\n    try_initialize_global_cpu_info();\n    return g_cpu_level2_cachesize;\n}\n\nint get_cpu_level3_cache_size()\n{\n    try_initialize_global_cpu_info();\n    return g_cpu_level3_cachesize;\n}\n\nint get_cpu_powersave()\n{\n    try_initialize_global_cpu_info();\n    return g_powersave;\n}\n\nint set_cpu_powersave(int powersave)\n{\n    try_initialize_global_cpu_info();\n    if (powersave < 0 || powersave > 2)\n    {\n        NCNN_LOGE(\"powersave %d not supported\", powersave);\n        return -1;\n    }\n\n    const CpuSet& thread_affinity_mask = get_cpu_thread_affinity_mask(powersave);\n\n    int ret = set_cpu_thread_affinity(thread_affinity_mask);\n    if (ret != 0)\n        return ret;\n\n    g_powersave = powersave;\n\n    return 0;\n}\n\nconst CpuSet& get_cpu_thread_affinity_mask(int powersave)\n{\n    try_initialize_global_cpu_info();\n    if (powersave == 0)\n        return g_cpu_affinity_mask_all;\n\n    if (powersave == 1)\n        return g_cpu_affinity_mask_little;\n\n    if (powersave == 2)\n        return g_cpu_affinity_mask_big;\n\n    NCNN_LOGE(\"powersave %d not supported\", powersave);\n\n    // fallback to all cores anyway\n    return g_cpu_affinity_mask_all;\n}\n\nint set_cpu_thread_affinity(const CpuSet& thread_affinity_mask)\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__ || defined _WIN32\n#ifdef _OPENMP\n    int num_threads = thread_affinity_mask.num_enabled();\n\n    // set affinity for each thread\n    set_omp_num_threads(num_threads);\n    std::vector<int> ssarets(num_threads, 0);\n    #pragma omp parallel for num_threads(num_threads)\n    for (int i = 0; i < num_threads; i++)\n    {\n        ssarets[i] = set_sched_affinity(thread_affinity_mask);\n    }\n    for (int i = 0; i < num_threads; i++)\n    {\n        if (ssarets[i] != 0)\n            return -1;\n    }\n#else\n    int ssaret = set_sched_affinity(thread_affinity_mask);\n    if (ssaret != 0)\n        return -1;\n#endif\n\n    return 0;\n#elif __APPLE__\n\n#ifdef _OPENMP\n    int num_threads = thread_affinity_mask.num_enabled();\n\n    // set affinity for each thread\n    set_omp_num_threads(num_threads);\n    std::vector<int> ssarets(num_threads, 0);\n    #pragma omp parallel for num_threads(num_threads)\n    for (int i = 0; i < num_threads; i++)\n    {\n        // assign one core for each thread\n        int core = -1 - i;\n        for (int j = 0; j < (int)sizeof(thread_affinity_mask.policy) * 8; j++)\n        {\n            if (thread_affinity_mask.is_enabled(j))\n            {\n                if (core == -1)\n                {\n                    core = j;\n                    break;\n                }\n                else\n                {\n                    core++;\n                }\n            }\n        }\n        CpuSet this_thread_affinity_mask;\n        if (core != -1 - i)\n        {\n            this_thread_affinity_mask.enable(core);\n        }\n\n        ssarets[i] = set_sched_affinity(this_thread_affinity_mask);\n    }\n    for (int i = 0; i < num_threads; i++)\n    {\n        if (ssarets[i] != 0)\n            return -1;\n    }\n#else\n    int ssaret = set_sched_affinity(thread_affinity_mask);\n    if (ssaret != 0)\n        return -1;\n#endif\n\n    return 0;\n#else\n    // TODO\n    (void)thread_affinity_mask;\n    return -1;\n#endif\n}\n\nint is_current_thread_running_on_a53_a55()\n{\n    try_initialize_global_cpu_info();\n#if defined __ANDROID__ || defined __linux__\n#if __aarch64__\n    if (g_cpu_is_arm_a53_a55 == 0)\n        return 0; // all non a53/a55\n\n    if (g_cpu_is_arm_a53_a55 == 1)\n        return 1; // all a53/a55\n\n    if (g_powersave == 2)\n        return 0; // big clusters\n\n    if (g_powersave == 1)\n        return 1; // little clusters\n\n    // little cores are a53/a55\n\n    // use cpuid for retrieving midr since kernel 4.7+\n    if (cpu_support_arm_cpuid())\n    {\n        unsigned int midr = get_midr_from_register();\n        if (midr)\n            return midr_is_a53_a55(midr);\n    }\n\n    // check if affinity cpuid is in the little ones\n    CpuSet thread_cs;\n    int ret = get_sched_affinity(thread_cs);\n    if (ret != 0)\n    {\n        // no affinity capability\n        return 0;\n    }\n\n    const CpuSet& little_cs = get_cpu_thread_affinity_mask(1);\n    for (int i = 0; i < g_cpucount; i++)\n    {\n        if (!thread_cs.is_enabled(i))\n            continue;\n\n        if (!little_cs.is_enabled(i))\n            return 0;\n    }\n\n    // all affinity cpuids are little core\n    return 1;\n#else\n    return 0;\n#endif // __aarch64__\n#else\n    return 0;\n#endif // defined __ANDROID__ || defined __linux__\n}\n\nint get_omp_num_threads()\n{\n#ifdef _OPENMP\n    return omp_get_num_threads();\n#else\n    return 1;\n#endif\n}\n\nvoid set_omp_num_threads(int num_threads)\n{\n#ifdef _OPENMP\n    omp_set_num_threads(num_threads);\n#else\n    (void)num_threads;\n#endif\n}\n\nint get_omp_dynamic()\n{\n#ifdef _OPENMP\n    return omp_get_dynamic();\n#else\n    return 0;\n#endif\n}\n\nvoid set_omp_dynamic(int dynamic)\n{\n#ifdef _OPENMP\n    omp_set_dynamic(dynamic);\n#else\n    (void)dynamic;\n#endif\n}\n\nint get_omp_thread_num()\n{\n#ifdef _OPENMP\n    return omp_get_thread_num();\n#else\n    return 0;\n#endif\n}\n\nint get_kmp_blocktime()\n{\n#if defined(_OPENMP) && (__clang__ || defined(_OPENMP_LLVM_RUNTIME))\n    return kmp_get_blocktime();\n#else\n    return 0;\n#endif\n}\n\nvoid set_kmp_blocktime(int time_ms)\n{\n#if defined(_OPENMP) && (__clang__ || defined(_OPENMP_LLVM_RUNTIME))\n    kmp_set_blocktime(time_ms);\n#else\n    (void)time_ms;\n#endif\n}\n\nstatic ncnn::ThreadLocalStorage tls_flush_denormals;\n\nint get_flush_denormals()\n{\n#if defined(__SSE3__)\n    return (int)reinterpret_cast<size_t>(tls_flush_denormals.get());\n#else\n    return 0;\n#endif\n}\n\nint set_flush_denormals(int flush_denormals)\n{\n    if (flush_denormals < 0 || flush_denormals > 3)\n    {\n        NCNN_LOGE(\"denormals_zero %d not supported\", flush_denormals);\n        return -1;\n    }\n#if defined(__SSE3__)\n    if (flush_denormals == 0)\n    {\n        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_OFF);\n        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF);\n    }\n    else if (flush_denormals == 1)\n    {\n        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF);\n    }\n    else if (flush_denormals == 2)\n    {\n        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_OFF);\n        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n    }\n    else if (flush_denormals == 3)\n    {\n        _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);\n        _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);\n    }\n\n    tls_flush_denormals.set(reinterpret_cast<void*>((size_t)flush_denormals));\n    return 0;\n#else\n    return 0;\n#endif\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/cpu.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_CPU_H\n#define NCNN_CPU_H\n\n#include <stddef.h>\n\n#if defined _WIN32\n#define WIN32_LEAN_AND_MEAN\n#include <windows.h>\n#endif\n#if defined __ANDROID__ || defined __linux__\n#include <sched.h> // cpu_set_t\n#endif\n\n#include \"platform.h\"\n\nnamespace ncnn {\n\nclass NCNN_EXPORT CpuSet\n{\npublic:\n    CpuSet();\n    void enable(int cpu);\n    void disable(int cpu);\n    void disable_all();\n    bool is_enabled(int cpu) const;\n    int num_enabled() const;\n\npublic:\n#if defined _WIN32\n    ULONG_PTR mask;\n#endif\n#if defined __ANDROID__ || defined __linux__\n    cpu_set_t cpu_set;\n#endif\n#if __APPLE__\n    unsigned int policy;\n#endif\n};\n\n// test optional cpu features\n// edsp = armv7 edsp\nNCNN_EXPORT int cpu_support_arm_edsp();\n// neon = armv7 neon or aarch64 asimd\nNCNN_EXPORT int cpu_support_arm_neon();\n// vfpv4 = armv7 fp16 + fma\nNCNN_EXPORT int cpu_support_arm_vfpv4();\n// asimdhp = aarch64 asimd half precision\nNCNN_EXPORT int cpu_support_arm_asimdhp();\n// cpuid = aarch64 cpuid info\nNCNN_EXPORT int cpu_support_arm_cpuid();\n// asimddp = aarch64 asimd dot product\nNCNN_EXPORT int cpu_support_arm_asimddp();\n// asimdfhm = aarch64 asimd fhm\nNCNN_EXPORT int cpu_support_arm_asimdfhm();\n// bf16 = aarch64 bf16\nNCNN_EXPORT int cpu_support_arm_bf16();\n// i8mm = aarch64 i8mm\nNCNN_EXPORT int cpu_support_arm_i8mm();\n// sve = aarch64 sve\nNCNN_EXPORT int cpu_support_arm_sve();\n// sve2 = aarch64 sve2\nNCNN_EXPORT int cpu_support_arm_sve2();\n// svebf16 = aarch64 svebf16\nNCNN_EXPORT int cpu_support_arm_svebf16();\n// svei8mm = aarch64 svei8mm\nNCNN_EXPORT int cpu_support_arm_svei8mm();\n// svef32mm = aarch64 svef32mm\nNCNN_EXPORT int cpu_support_arm_svef32mm();\n\n// avx = x86 avx\nNCNN_EXPORT int cpu_support_x86_avx();\n// fma = x86 fma\nNCNN_EXPORT int cpu_support_x86_fma();\n// xop = x86 xop\nNCNN_EXPORT int cpu_support_x86_xop();\n// f16c = x86 f16c\nNCNN_EXPORT int cpu_support_x86_f16c();\n// avx2 = x86 avx2 + fma + f16c\nNCNN_EXPORT int cpu_support_x86_avx2();\n// avx_vnni = x86 avx vnni\nNCNN_EXPORT int cpu_support_x86_avx_vnni();\n// avx_vnni_int8 = x86 avx vnni int8\nNCNN_EXPORT int cpu_support_x86_avx_vnni_int8();\n// avx_vnni_int16 = x86 avx vnni int16\nNCNN_EXPORT int cpu_support_x86_avx_vnni_int16();\n// avx_ne_convert = x86 avx ne convert\nNCNN_EXPORT int cpu_support_x86_avx_ne_convert();\n// avx512 = x86 avx512f + avx512cd + avx512bw + avx512dq + avx512vl\nNCNN_EXPORT int cpu_support_x86_avx512();\n// avx512_vnni = x86 avx512 vnni\nNCNN_EXPORT int cpu_support_x86_avx512_vnni();\n// avx512_bf16 = x86 avx512 bf16\nNCNN_EXPORT int cpu_support_x86_avx512_bf16();\n// avx512_fp16 = x86 avx512 fp16\nNCNN_EXPORT int cpu_support_x86_avx512_fp16();\n\n// lsx = loongarch lsx\nNCNN_EXPORT int cpu_support_loongarch_lsx();\n// lasx = loongarch lasx\nNCNN_EXPORT int cpu_support_loongarch_lasx();\n\n// msa = mips mas\nNCNN_EXPORT int cpu_support_mips_msa();\n// mmi = loongson mmi\nNCNN_EXPORT int cpu_support_loongson_mmi();\n\n// v = riscv vector\nNCNN_EXPORT int cpu_support_riscv_v();\n// zfh = riscv half-precision float\nNCNN_EXPORT int cpu_support_riscv_zfh();\n// zvfh = riscv vector half-precision float\nNCNN_EXPORT int cpu_support_riscv_zvfh();\n// xtheadvector = riscv xtheadvector\nNCNN_EXPORT int cpu_support_riscv_xtheadvector();\n// vlenb = riscv vector length in bytes\nNCNN_EXPORT int cpu_riscv_vlenb();\n\n// cpu info\nNCNN_EXPORT int get_cpu_count();\nNCNN_EXPORT int get_little_cpu_count();\nNCNN_EXPORT int get_big_cpu_count();\n\nNCNN_EXPORT int get_physical_cpu_count();\nNCNN_EXPORT int get_physical_little_cpu_count();\nNCNN_EXPORT int get_physical_big_cpu_count();\n\n// cpu l2 varies from 64k to 1M, but l3 can be zero\nNCNN_EXPORT int get_cpu_level2_cache_size();\nNCNN_EXPORT int get_cpu_level3_cache_size();\n\n// bind all threads on little clusters if powersave enabled\n// affects HMP arch cpu like ARM big.LITTLE\n// only implemented on android at the moment\n// switching powersave is expensive and not thread-safe\n// 0 = all cores enabled(default)\n// 1 = only little clusters enabled\n// 2 = only big clusters enabled\n// return 0 if success for setter function\nNCNN_EXPORT int get_cpu_powersave();\nNCNN_EXPORT int set_cpu_powersave(int powersave);\n\n// convenient wrapper\nNCNN_EXPORT const CpuSet& get_cpu_thread_affinity_mask(int powersave);\n\n// set explicit thread affinity\nNCNN_EXPORT int set_cpu_thread_affinity(const CpuSet& thread_affinity_mask);\n\n// runtime thread affinity info\nNCNN_EXPORT int is_current_thread_running_on_a53_a55();\n\n// misc function wrapper for openmp routines\nNCNN_EXPORT int get_omp_num_threads();\nNCNN_EXPORT void set_omp_num_threads(int num_threads);\n\nNCNN_EXPORT int get_omp_dynamic();\nNCNN_EXPORT void set_omp_dynamic(int dynamic);\n\nNCNN_EXPORT int get_omp_thread_num();\n\nNCNN_EXPORT int get_kmp_blocktime();\nNCNN_EXPORT void set_kmp_blocktime(int time_ms);\n\n// need to flush denormals on Intel Chipset.\n// Other architectures such as ARM can be added as needed.\n// 0 = DAZ OFF, FTZ OFF\n// 1 = DAZ ON , FTZ OFF\n// 2 = DAZ OFF, FTZ ON\n// 3 = DAZ ON,  FTZ ON\nNCNN_EXPORT int get_flush_denormals();\nNCNN_EXPORT int set_flush_denormals(int flush_denormals);\n\n} // namespace ncnn\n\n#endif // NCNN_CPU_H\n"
  },
  {
    "path": "src/datareader.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"datareader.h\"\n\n#include <string.h>\n\nnamespace ncnn {\n\nDataReader::DataReader()\n{\n}\n\nDataReader::~DataReader()\n{\n}\n\n#if NCNN_STRING\nint DataReader::scan(const char* /*format*/, void* /*p*/) const\n{\n    return 0;\n}\n#endif // NCNN_STRING\n\nsize_t DataReader::read(void* /*buf*/, size_t /*size*/) const\n{\n    return 0;\n}\n\nsize_t DataReader::reference(size_t /*size*/, const void** /*buf*/) const\n{\n    return 0;\n}\n\n#if NCNN_STDIO\nclass DataReaderFromStdioPrivate\n{\npublic:\n    DataReaderFromStdioPrivate(FILE* _fp)\n        : fp(_fp)\n    {\n    }\n    FILE* fp;\n};\n\nDataReaderFromStdio::DataReaderFromStdio(FILE* _fp)\n    : DataReader(), d(new DataReaderFromStdioPrivate(_fp))\n{\n}\n\nDataReaderFromStdio::~DataReaderFromStdio()\n{\n    delete d;\n}\n\nDataReaderFromStdio::DataReaderFromStdio(const DataReaderFromStdio&)\n    : d(0)\n{\n}\n\nDataReaderFromStdio& DataReaderFromStdio::operator=(const DataReaderFromStdio&)\n{\n    return *this;\n}\n\n#if NCNN_STRING\nint DataReaderFromStdio::scan(const char* format, void* p) const\n{\n    return fscanf(d->fp, format, p);\n}\n#endif // NCNN_STRING\n\nsize_t DataReaderFromStdio::read(void* buf, size_t size) const\n{\n    return fread(buf, 1, size, d->fp);\n}\n#endif // NCNN_STDIO\n\nclass DataReaderFromMemoryPrivate\n{\npublic:\n    DataReaderFromMemoryPrivate(const unsigned char*& _mem)\n        : mem(_mem)\n    {\n    }\n    const unsigned char*& mem;\n};\n\nDataReaderFromMemory::DataReaderFromMemory(const unsigned char*& _mem)\n    : DataReader(), d(new DataReaderFromMemoryPrivate(_mem))\n{\n}\n\nDataReaderFromMemory::~DataReaderFromMemory()\n{\n    delete d;\n}\n\nDataReaderFromMemory::DataReaderFromMemory(const DataReaderFromMemory&)\n    : d(0)\n{\n}\n\nDataReaderFromMemory& DataReaderFromMemory::operator=(const DataReaderFromMemory&)\n{\n    return *this;\n}\n\n#if NCNN_STRING\nint DataReaderFromMemory::scan(const char* format, void* p) const\n{\n    size_t fmtlen = strlen(format);\n\n    const size_t nlen = fmtlen + 4;\n    char* format_with_n = new char[nlen];\n    snprintf(format_with_n, nlen, \"%s%%n\", format);\n\n    int nconsumed = 0;\n    int nscan = sscanf((const char*)d->mem, format_with_n, p, &nconsumed);\n    d->mem += nconsumed;\n\n    delete[] format_with_n;\n\n    return nconsumed > 0 ? nscan : 0;\n}\n#endif // NCNN_STRING\n\nsize_t DataReaderFromMemory::read(void* buf, size_t size) const\n{\n    memcpy(buf, d->mem, size);\n    d->mem += size;\n    return size;\n}\n\nsize_t DataReaderFromMemory::reference(size_t size, const void** buf) const\n{\n    *buf = d->mem;\n    d->mem += size;\n    return size;\n}\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 9\nclass DataReaderFromAndroidAssetPrivate\n{\npublic:\n    DataReaderFromAndroidAssetPrivate(AAsset* _asset)\n        : asset(_asset), mem(0)\n    {\n    }\n    AAsset* asset;\n    mutable const unsigned char* mem;\n};\n\nDataReaderFromAndroidAsset::DataReaderFromAndroidAsset(AAsset* _asset)\n    : DataReader(), d(new DataReaderFromAndroidAssetPrivate(_asset))\n{\n}\n\nDataReaderFromAndroidAsset::~DataReaderFromAndroidAsset()\n{\n    delete d;\n}\n\nDataReaderFromAndroidAsset::DataReaderFromAndroidAsset(const DataReaderFromAndroidAsset&)\n    : d(0)\n{\n}\n\nDataReaderFromAndroidAsset& DataReaderFromAndroidAsset::operator=(const DataReaderFromAndroidAsset&)\n{\n    return *this;\n}\n\n#if NCNN_STRING\nint DataReaderFromAndroidAsset::scan(const char* format, void* p) const\n{\n    if (!d->mem)\n    {\n        off_t pos = AAsset_seek(d->asset, 0, SEEK_CUR);\n        d->mem = (const unsigned char*)AAsset_getBuffer(d->asset);\n        d->mem += pos;\n    }\n\n    // asset internal buffer may not be NULL-terminated\n    // create a NULL-terminated string for sscanf\n    std::string line;\n    {\n        off64_t remain_length = AAsset_getRemainingLength64(d->asset);\n        const char* newline_pos;\n        if (remain_length > 1 && ((const char*)d->mem)[0] == '\\n')\n        {\n            // skip the leading newline\n            // however, it is fine to create \"\\nXYZ 123 abc\" as sscanf will skip the leading newline silently\n            newline_pos = (const char*)memchr((const char*)d->mem + 1, '\\n', remain_length - 1);\n        }\n        else if (remain_length > 2 && ((const char*)d->mem)[0] == '\\r' && ((const char*)d->mem)[1] == '\\n')\n        {\n            // skip the leading newline\n            // however, it is fine to create \"\\r\\nXYZ 123 abc\" as sscanf will skip the leading newline silently\n            newline_pos = (const char*)memchr((const char*)d->mem + 2, '\\n', remain_length - 2);\n        }\n        else\n        {\n            newline_pos = (const char*)memchr((const char*)d->mem, '\\n', remain_length);\n        }\n\n        size_t line_length = newline_pos ? newline_pos - (const char*)d->mem : (size_t)remain_length;\n        line = std::string((const char*)d->mem, line_length);\n    }\n\n    int fmtlen = strlen(format);\n\n    char* format_with_n = new char[fmtlen + 3];\n    sprintf(format_with_n, \"%s%%n\", format);\n\n    int nconsumed = 0;\n    int nscan = sscanf(line.c_str(), format_with_n, p, &nconsumed);\n    d->mem += nconsumed;\n\n    delete[] format_with_n;\n\n    if (nconsumed == 0)\n        return 0;\n\n    AAsset_seek(d->asset, nconsumed, SEEK_CUR);\n\n    return nscan;\n}\n#endif // NCNN_STRING\n\nsize_t DataReaderFromAndroidAsset::read(void* buf, size_t size) const\n{\n    int nread = AAsset_read(d->asset, buf, size);\n    if (nread < 0)\n        return 0;\n\n    if (d->mem)\n    {\n        d->mem += nread;\n    }\n\n    return nread;\n}\n#endif // __ANDROID_API__ >= 9\n#endif // NCNN_PLATFORM_API\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/datareader.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_DATAREADER_H\n#define NCNN_DATAREADER_H\n\n#include \"platform.h\"\n#if NCNN_STDIO\n#include <stdio.h>\n#endif\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 9\n#include <android/asset_manager.h>\n#endif\n#endif // NCNN_PLATFORM_API\n\nnamespace ncnn {\n\n// data read wrapper\nclass NCNN_EXPORT DataReader\n{\npublic:\n    DataReader();\n    virtual ~DataReader();\n\n#if NCNN_STRING\n    // parse plain param text\n    // return 1 if scan success\n    virtual int scan(const char* format, void* p) const;\n#endif // NCNN_STRING\n\n    // read binary param and model data\n    // return bytes read\n    virtual size_t read(void* buf, size_t size) const;\n\n    // get model data reference\n    // return bytes referenced\n    virtual size_t reference(size_t size, const void** buf) const;\n};\n\n#if NCNN_STDIO\nclass DataReaderFromStdioPrivate;\nclass NCNN_EXPORT DataReaderFromStdio : public DataReader\n{\npublic:\n    explicit DataReaderFromStdio(FILE* fp);\n    virtual ~DataReaderFromStdio();\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const;\n#endif // NCNN_STRING\n    virtual size_t read(void* buf, size_t size) const;\n\nprivate:\n    DataReaderFromStdio(const DataReaderFromStdio&);\n    DataReaderFromStdio& operator=(const DataReaderFromStdio&);\n\nprivate:\n    DataReaderFromStdioPrivate* const d;\n};\n#endif // NCNN_STDIO\n\nclass DataReaderFromMemoryPrivate;\nclass NCNN_EXPORT DataReaderFromMemory : public DataReader\n{\npublic:\n    explicit DataReaderFromMemory(const unsigned char*& mem);\n    virtual ~DataReaderFromMemory();\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const;\n#endif // NCNN_STRING\n    virtual size_t read(void* buf, size_t size) const;\n    virtual size_t reference(size_t size, const void** buf) const;\n\nprivate:\n    DataReaderFromMemory(const DataReaderFromMemory&);\n    DataReaderFromMemory& operator=(const DataReaderFromMemory&);\n\nprivate:\n    DataReaderFromMemoryPrivate* const d;\n};\n\n#if NCNN_PLATFORM_API\n#if __ANDROID_API__ >= 9\nclass DataReaderFromAndroidAssetPrivate;\nclass NCNN_EXPORT DataReaderFromAndroidAsset : public DataReader\n{\npublic:\n    explicit DataReaderFromAndroidAsset(AAsset* asset);\n    virtual ~DataReaderFromAndroidAsset();\n\n#if NCNN_STRING\n    virtual int scan(const char* format, void* p) const;\n#endif // NCNN_STRING\n    virtual size_t read(void* buf, size_t size) const;\n\nprivate:\n    DataReaderFromAndroidAsset(const DataReaderFromAndroidAsset&);\n    DataReaderFromAndroidAsset& operator=(const DataReaderFromAndroidAsset&);\n\nprivate:\n    DataReaderFromAndroidAssetPrivate* const d;\n};\n#endif // __ANDROID_API__ >= 9\n#endif // NCNN_PLATFORM_API\n\n} // namespace ncnn\n\n#endif // NCNN_DATAREADER_H\n"
  },
  {
    "path": "src/expression.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"expression.h\"\n\n#include <stdio.h> // sscanf\n\nnamespace ncnn {\n\nint count_expression_blobs(const std::string& expr)\n{\n    int count = 0;\n\n    std::string t;\n    for (size_t i = 0; i < expr.size(); i++)\n    {\n        char ch = expr[i];\n\n        if (ch == '(' || ch == ')' || ch == ',')\n        {\n            if (!t.empty())\n            {\n                if (t.size() == 2 && (t[0] >= '0' && t[0] <= '9') && (t[1] == 'w' || t[1] == 'h' || t[1] == 'd' || t[1] == 'c'))\n                {\n                    int blob_index = t[0] - '0';\n                    count = std::max(count, blob_index + 1);\n                }\n\n                t.clear();\n            }\n        }\n        else\n        {\n#if NCNN_SIMPLESTL\n            t.resize(t.size() + 1);\n            t[t.size() - 1] = ch;\n#else\n            t += ch;\n#endif\n        }\n    }\n\n    if (!t.empty())\n    {\n        if (t.size() == 2 && (t[0] >= '0' && t[0] <= '9') && (t[1] == 'w' || t[1] == 'h' || t[1] == 'd' || t[1] == 'c'))\n        {\n            int blob_index = t[0] - '0';\n            count = std::max(count, blob_index + 1);\n        }\n    }\n\n    return count;\n}\n\nstruct typed_value\n{\n    int type; // 0=i 1=f\n    union\n    {\n        int i;\n        float f;\n    };\n\n    typed_value()\n        : type(0), i(0)\n    {\n    }\n    typed_value(int _i)\n        : type(0), i(_i)\n    {\n    }\n    typed_value(float _f)\n        : type(1), f(_f)\n    {\n    }\n\n    int to_int()\n    {\n        if (type == 0)\n            return i;\n\n        // trunc by default\n        return (int)f;\n    }\n};\n\nint eval_list_expression(const std::string& expr, const std::vector<Mat>& blobs, std::vector<int>& outlist)\n{\n    // /(0w,2),*(0h,2),0c\n\n    // split by , ( )\n    //\n    //     /\n    //         0w\n    //         2\n    // -------------------\n    //     *\n    //         0h\n    //         2\n    // -------------------\n    //     0c\n    // -------------------\n\n    // split by , ( )\n\n    // split into tokens\n    std::vector<std::string> tokens;\n    {\n        std::string t;\n        for (size_t i = 0; i < expr.size(); i++)\n        {\n            char ch = expr[i];\n\n            if (ch == '(' || ch == ')' || ch == ',')\n            {\n                if (!t.empty())\n                {\n                    tokens.push_back(t);\n                    t.clear();\n                }\n            }\n            else\n            {\n#if NCNN_SIMPLESTL\n                t.resize(t.size() + 1);\n                t[t.size() - 1] = ch;\n#else\n                t += ch;\n#endif\n            }\n        }\n\n        if (!t.empty())\n        {\n            tokens.push_back(t);\n        }\n    }\n\n    //      / 0w 2 * 0h 2 0c\n\n    // scan and stack\n    std::stack<typed_value> exprstack;\n    for (int i = (int)tokens.size() - 1; i >= 0; i--)\n    {\n        const std::string& t = tokens[i];\n\n        // NCNN_LOGE(\"t = %s\", t.c_str());\n\n        // + - * / 0w 0h 0d 0c 12345\n\n        if (t.size() == 2 && (t[0] >= '0' && t[0] <= '9') && (t[1] == 'w' || t[1] == 'h' || t[1] == 'd' || t[1] == 'c'))\n        {\n            size_t blob_index = t[0] - '0';\n            if (blob_index >= blobs.size())\n            {\n                NCNN_LOGE(\"shape expression blob index %d out of bound!\", (int)blob_index);\n                return -1;\n            }\n\n            const Mat& blob = blobs[blob_index].shape();\n            int size;\n            if (t[1] == 'w')\n                size = blob.w;\n            else if (t[1] == 'h')\n                size = blob.h;\n            else if (t[1] == 'd')\n                size = blob.d;\n            else // if (t[1] == 'c')\n                size = blob.c;\n\n            // NCNN_LOGE(\"t = %s  =>  %d\", t.c_str(), size);\n\n            exprstack.push(size);\n        }\n        else if (t == \"+\" || t == \"-\" || t == \"*\" || t == \"//\" || t == \"max\" || t == \"min\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n            typed_value tb = exprstack.top();\n            exprstack.pop();\n\n            if (ta.type == 0 && tb.type == 0)\n            {\n                const int a = ta.i;\n                const int b = tb.i;\n\n                int r = 0;\n                if (t == \"+\")\n                {\n                    r = a + b;\n                }\n                else if (t == \"-\")\n                {\n                    r = a - b;\n                }\n                else if (t == \"*\")\n                {\n                    r = a * b;\n                }\n                else if (t == \"//\")\n                {\n                    if (b == 0)\n                    {\n                        NCNN_LOGE(\"expr divide by zero\");\n                        return -1;\n                    }\n                    else\n                    {\n                        r = a / b;\n                    }\n                }\n                else if (t == \"max\")\n                {\n                    r = std::max(a, b);\n                }\n                else // if (t == \"min\")\n                {\n                    r = std::min(a, b);\n                }\n                exprstack.push(r);\n            }\n            else\n            {\n                const float a = ta.type == 0 ? ta.i : ta.f;\n                const float b = tb.type == 0 ? tb.i : tb.f;\n\n                float r = 0.f;\n                if (t == \"+\")\n                {\n                    r = a + b;\n                }\n                else if (t == \"-\")\n                {\n                    r = a - b;\n                }\n                else if (t == \"*\")\n                {\n                    r = a * b;\n                }\n                else if (t == \"//\")\n                {\n                    r = floorf(a / b);\n                }\n                else if (t == \"max\")\n                {\n                    r = std::max(a, b);\n                }\n                else // if (t == \"min\")\n                {\n                    r = std::min(a, b);\n                }\n                exprstack.push(r);\n            }\n        }\n        else if (t == \"abs\" || t == \"neg\" || t == \"sign\" || t == \"square\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n\n            if (ta.type == 0)\n            {\n                const int a = ta.i;\n\n                int r = 0;\n                if (t == \"abs\")\n                {\n                    r = a > 0 ? a : -a;\n                }\n                else if (t == \"neg\")\n                {\n                    r = -a;\n                }\n                else if (t == \"sign\")\n                {\n                    r = a > 0 ? 1 : (a == 0 ? 0 : -1);\n                }\n                else // if (t == \"square\")\n                {\n                    r = a * a;\n                }\n                exprstack.push(r);\n            }\n            else\n            {\n                const float a = ta.f;\n\n                float r = 0;\n                if (t == \"abs\")\n                {\n                    r = fabsf(a);\n                }\n                else if (t == \"neg\")\n                {\n                    r = -a;\n                }\n                else if (t == \"sign\")\n                {\n                    r = a > 0.f ? 1 : (a == 0.f ? 0 : -1);\n                }\n                else // if (t == \"square\")\n                {\n                    r = a * a;\n                }\n                exprstack.push(r);\n            }\n        }\n        else if (t == \"trunc\" || t == \"ceil\" || t == \"floor\" || t == \"round\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n\n            if (ta.type == 0)\n            {\n                const int a = ta.i;\n                exprstack.push(a);\n            }\n            else\n            {\n                const float a = ta.f;\n\n                int r = 0;\n                if (t == \"trunc\")\n                {\n                    r = (int)a;\n                }\n                else if (t == \"ceil\")\n                {\n                    r = (int)ceil(a);\n                }\n                else if (t == \"floor\")\n                {\n                    r = (int)floor(a);\n                }\n                else // if (t == \"round\")\n                {\n                    r = (int)round(a);\n                }\n                exprstack.push(r);\n            }\n        }\n        else if (t == \"acos\"\n                 || t == \"acosh\"\n                 || t == \"asin\"\n                 || t == \"asinh\"\n                 || t == \"atan\"\n                 || t == \"atanh\"\n                 || t == \"cos\"\n                 || t == \"cosh\"\n                 || t == \"erf\"\n                 || t == \"exp\"\n                 || t == \"log\"\n                 || t == \"log10\"\n                 || t == \"reciprocal\"\n                 || t == \"rsqrt\"\n                 || t == \"sin\"\n                 || t == \"sinh\"\n                 || t == \"sqrt\"\n                 || t == \"tan\"\n                 || t == \"tanh\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n\n            const float a = ta.type == 0 ? ta.i : ta.f;\n\n            float r = 0;\n            if (t == \"acos\")\n            {\n                r = acosf(a);\n            }\n            else if (t == \"acosh\")\n            {\n                r = acoshf(a);\n            }\n            else if (t == \"asin\")\n            {\n                r = asinf(a);\n            }\n            else if (t == \"asinh\")\n            {\n                r = asinhf(a);\n            }\n            else if (t == \"atan\")\n            {\n                r = atanf(a);\n            }\n            else if (t == \"atanh\")\n            {\n                r = atanhf(a);\n            }\n            else if (t == \"cos\")\n            {\n                r = cosf(a);\n            }\n            else if (t == \"cosh\")\n            {\n                r = coshf(a);\n            }\n            else if (t == \"erf\")\n            {\n                r = erff(a);\n            }\n            else if (t == \"exp\")\n            {\n                r = expf(a);\n            }\n            else if (t == \"log\")\n            {\n                r = logf(a);\n            }\n            else if (t == \"log10\")\n            {\n                r = log10f(a);\n            }\n            else if (t == \"reciprocal\")\n            {\n                r = 1.f / a;\n            }\n            else if (t == \"rsqrt\")\n            {\n                r = 1.f / sqrtf(a);\n            }\n            else if (t == \"sin\")\n            {\n                r = sinf(a);\n            }\n            else if (t == \"sinh\")\n            {\n                r = sinhf(a);\n            }\n            else if (t == \"sqrt\")\n            {\n                r = sqrtf(a);\n            }\n            else if (t == \"tan\")\n            {\n                r = tanf(a);\n            }\n            else // if (t == \"tanh\")\n            {\n                r = tanhf(a);\n            }\n            exprstack.push(r);\n        }\n        else if (t == \"/\"\n                 || t == \"atan2\"\n                 || t == \"fmod\"\n                 || t == \"pow\"\n                 || t == \"remainder\"\n                 || t == \"logaddexp\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n            typed_value tb = exprstack.top();\n            exprstack.pop();\n\n            const float a = ta.type == 0 ? ta.i : ta.f;\n            const float b = tb.type == 0 ? tb.i : tb.f;\n\n            float r = 0.f;\n            if (t == \"/\")\n            {\n                r = a / b;\n            }\n            else if (t == \"atan2\")\n            {\n                r = atan2f(a, b);\n            }\n            else if (t == \"fmod\")\n            {\n                r = fmodf(a, b);\n            }\n            else if (t == \"pow\")\n            {\n                r = powf(a, b);\n            }\n            else if (t == \"remainder\")\n            {\n                r = fmodf(a, b);\n                if (a * b < 0)\n                    r += b;\n            }\n            else // if (t == \"logaddexp\")\n            {\n                r = logf(expf(a) + expf(b));\n            }\n            exprstack.push(r);\n        }\n        else if (t == \"and\" || t == \"or\" || t == \"xor\" || t == \"lshift\" || t == \"rshift\")\n        {\n            typed_value ta = exprstack.top();\n            exprstack.pop();\n            typed_value tb = exprstack.top();\n            exprstack.pop();\n\n            // assert ta.type == 0 && tb.type == 0\n\n            const int a = ta.i;\n            const int b = tb.i;\n\n            int r = 0;\n            if (t == \"and\")\n            {\n                r = a & b;\n            }\n            else if (t == \"or\")\n            {\n                r = a | b;\n            }\n            else if (t == \"xor\")\n            {\n                r = a ^ b;\n            }\n            else if (t == \"lshift\")\n            {\n                r = a << b;\n            }\n            else // if (t == \"rshift\")\n            {\n                r = a >> b;\n            }\n            exprstack.push(r);\n        }\n        else\n        {\n            // literal\n            int vi;\n            float vf;\n            int nscani = sscanf(t.c_str(), \"%d\", &vi);\n            int nscanf = sscanf(t.c_str(), \"%f\", &vf);\n            if (nscani == 1 && nscanf == 1 && vi == vf)\n            {\n                exprstack.push(vi);\n            }\n            else if (nscanf == 1)\n            {\n                exprstack.push(vf);\n            }\n            else\n            {\n                NCNN_LOGE(\"malformed literal token %s\", t.c_str());\n                return -1;\n            }\n        }\n    }\n\n    int size = exprstack.top().to_int();\n    exprstack.pop();\n    outlist.push_back(size);\n    while (!exprstack.empty())\n    {\n        size = exprstack.top().to_int();\n        exprstack.pop();\n        outlist.push_back(size);\n    }\n\n    // NCNN_LOGE(\"shape %s = %d %d\", expr.c_str(), list[0], list[1]);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/expression.h",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"mat.h\"\n\nnamespace ncnn {\n\n// count how many blobs are referenced inside expression\nNCNN_EXPORT int count_expression_blobs(const std::string& expr);\n\n// resolve reshape shape from expression and input blobs\n// resolve slice indices(starts, ends) from expression and input blobs\n// see docs/developer-guide/expression.md\n// return 0 if success\nNCNN_EXPORT int eval_list_expression(const std::string& expr, const std::vector<Mat>& blobs, std::vector<int>& outlist);\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/gpu.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gpu.h\"\n\n#if NCNN_VULKAN\n\n#include <float.h>\n#include <limits.h>\n#include <stdlib.h>\n#include <string.h>\n\n#include \"glslang/SPIRV/GlslangToSpv.h\"\n#if NCNN_SYSTEM_GLSLANG\n#include \"glslang/Public/ShaderLang.h\"\n#else\n#include \"glslang/glslang/Public/ShaderLang.h\"\n#endif\n\n#include \"layer/vulkan/shader/vulkan_activation.comp.hex.h\"\n\n#include \"command.h\"\n#include \"layer.h\"\n#include \"layer_type.h\"\n#include \"mat.h\"\n#include \"pipelinecache.h\"\n\n// There is known issue that vkDestroyDebugUtilsMessengerEXT crash on exit when vulkan validation layer enabled\n// upstream fix https://github.com/KhronosGroup/Vulkan-Loader/pull/539\n#define ENABLE_VALIDATION_LAYER 0\n\nnamespace ncnn {\n\n// global\nstatic Mutex g_instance_lock;\n\nclass __ncnn_vulkan_instance_holder\n{\npublic:\n    __ncnn_vulkan_instance_holder()\n    {\n        instance = 0;\n        instance_api_version = 0;\n        created = 0;\n        glslang_initialized = false;\n\n#if NCNN_VULKAN_LOADER\n        libvulkan = 0;\n#if defined __ANDROID__\n        hvkdi = 0;\n#endif\n#endif // NCNN_VULKAN_LOADER\n\n#if ENABLE_VALIDATION_LAYER\n        callback = 0;\n#endif\n    }\n\n    ~__ncnn_vulkan_instance_holder()\n    {\n        destroy_gpu_instance();\n    }\n\n    operator VkInstance()\n    {\n        return instance;\n    }\n\n    VkInstance instance;\n    uint32_t instance_api_version;\n    int created;\n    bool glslang_initialized;\n\n#if ENABLE_VALIDATION_LAYER\n    VkDebugUtilsMessengerEXT callback;\n#endif\n};\nstatic __ncnn_vulkan_instance_holder g_instance;\n\nstatic int g_gpu_count = 0;\nstatic int g_default_gpu_index = -1;\n\n// NOTE 32 is large enough i think ...\n#define NCNN_MAX_GPU_COUNT 32\nstatic GpuInfo* g_gpu_infos[NCNN_MAX_GPU_COUNT] = {0};\n\n// default vulkan device\nstatic Mutex g_default_vkdev_lock;\nstatic VulkanDevice* g_default_vkdev[NCNN_MAX_GPU_COUNT] = {0};\n\nstruct layer_shader_registry_entry\n{\n    const char* comp_data;\n    int comp_data_size;\n};\n\n#include \"layer_shader_spv_data.h\"\n\nstatic const layer_shader_registry_entry layer_shader_registry[] = {\n#include \"layer_shader_registry.h\"\n};\n\nstatic const int layer_shader_registry_entry_count = sizeof(layer_shader_registry) / sizeof(layer_shader_registry_entry);\n\n// vulkan core\nPFN_vkAllocateCommandBuffers vkAllocateCommandBuffers = 0;\nPFN_vkAllocateDescriptorSets vkAllocateDescriptorSets = 0;\nPFN_vkAllocateMemory vkAllocateMemory = 0;\nPFN_vkBeginCommandBuffer vkBeginCommandBuffer = 0;\nPFN_vkBindBufferMemory vkBindBufferMemory = 0;\nPFN_vkBindImageMemory vkBindImageMemory = 0;\nPFN_vkCmdBeginQuery vkCmdBeginQuery = 0;\nPFN_vkCmdBindDescriptorSets vkCmdBindDescriptorSets = 0;\nPFN_vkCmdBindIndexBuffer vkCmdBindIndexBuffer = 0;\nPFN_vkCmdBindPipeline vkCmdBindPipeline = 0;\nPFN_vkCmdCopyBuffer vkCmdCopyBuffer = 0;\nPFN_vkCmdCopyBufferToImage vkCmdCopyBufferToImage = 0;\nPFN_vkCmdCopyImage vkCmdCopyImage = 0;\nPFN_vkCmdCopyImageToBuffer vkCmdCopyImageToBuffer = 0;\nPFN_vkCmdCopyQueryPoolResults vkCmdCopyQueryPoolResults = 0;\nPFN_vkCmdDispatch vkCmdDispatch = 0;\nPFN_vkCmdDispatchIndirect vkCmdDispatchIndirect = 0;\nPFN_vkCmdEndQuery vkCmdEndQuery = 0;\nPFN_vkCmdExecuteCommands vkCmdExecuteCommands = 0;\nPFN_vkCmdFillBuffer vkCmdFillBuffer = 0;\nPFN_vkCmdPipelineBarrier vkCmdPipelineBarrier = 0;\nPFN_vkCmdPushConstants vkCmdPushConstants = 0;\nPFN_vkCmdResetQueryPool vkCmdResetQueryPool = 0;\nPFN_vkCmdResolveImage vkCmdResolveImage = 0;\nPFN_vkCmdUpdateBuffer vkCmdUpdateBuffer = 0;\nPFN_vkCmdWriteTimestamp vkCmdWriteTimestamp = 0;\nPFN_vkCreateBuffer vkCreateBuffer = 0;\nPFN_vkCreateBufferView vkCreateBufferView = 0;\nPFN_vkCreateCommandPool vkCreateCommandPool = 0;\nPFN_vkCreateComputePipelines vkCreateComputePipelines = 0;\nPFN_vkCreateDescriptorPool vkCreateDescriptorPool = 0;\nPFN_vkCreateDescriptorSetLayout vkCreateDescriptorSetLayout = 0;\nPFN_vkCreateDevice vkCreateDevice = 0;\nPFN_vkCreateFence vkCreateFence = 0;\nPFN_vkCreateImage vkCreateImage = 0;\nPFN_vkCreateImageView vkCreateImageView = 0;\nPFN_vkCreatePipelineCache vkCreatePipelineCache = 0;\nPFN_vkCreatePipelineLayout vkCreatePipelineLayout = 0;\nPFN_vkCreateQueryPool vkCreateQueryPool = 0;\nPFN_vkCreateSampler vkCreateSampler = 0;\nPFN_vkCreateSemaphore vkCreateSemaphore = 0;\nPFN_vkCreateShaderModule vkCreateShaderModule = 0;\nPFN_vkDestroyBuffer vkDestroyBuffer = 0;\nPFN_vkDestroyBufferView vkDestroyBufferView = 0;\nPFN_vkDestroyCommandPool vkDestroyCommandPool = 0;\nPFN_vkDestroyDescriptorPool vkDestroyDescriptorPool = 0;\nPFN_vkDestroyDescriptorSetLayout vkDestroyDescriptorSetLayout = 0;\nPFN_vkDestroyDevice vkDestroyDevice = 0;\nPFN_vkDestroyFence vkDestroyFence = 0;\nPFN_vkDestroyImage vkDestroyImage = 0;\nPFN_vkDestroyImageView vkDestroyImageView = 0;\nPFN_vkDestroyInstance vkDestroyInstance = 0;\nPFN_vkDestroyPipeline vkDestroyPipeline = 0;\nPFN_vkDestroyPipelineCache vkDestroyPipelineCache = 0;\nPFN_vkDestroyPipelineLayout vkDestroyPipelineLayout = 0;\nPFN_vkDestroyQueryPool vkDestroyQueryPool = 0;\nPFN_vkDestroySampler vkDestroySampler = 0;\nPFN_vkDestroySemaphore vkDestroySemaphore = 0;\nPFN_vkDestroyShaderModule vkDestroyShaderModule = 0;\nPFN_vkDeviceWaitIdle vkDeviceWaitIdle = 0;\nPFN_vkEndCommandBuffer vkEndCommandBuffer = 0;\nPFN_vkEnumerateDeviceExtensionProperties vkEnumerateDeviceExtensionProperties = 0;\nPFN_vkEnumerateDeviceLayerProperties vkEnumerateDeviceLayerProperties = 0;\nPFN_vkEnumeratePhysicalDevices vkEnumeratePhysicalDevices = 0;\nPFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges = 0;\nPFN_vkFreeCommandBuffers vkFreeCommandBuffers = 0;\nPFN_vkFreeDescriptorSets vkFreeDescriptorSets = 0;\nPFN_vkFreeMemory vkFreeMemory = 0;\nPFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements = 0;\nPFN_vkGetDeviceMemoryCommitment vkGetDeviceMemoryCommitment = 0;\nPFN_vkGetDeviceProcAddr vkGetDeviceProcAddr = 0;\nPFN_vkGetDeviceQueue vkGetDeviceQueue = 0;\nPFN_vkGetFenceStatus vkGetFenceStatus = 0;\nPFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements = 0;\nPFN_vkGetImageSubresourceLayout vkGetImageSubresourceLayout = 0;\nPFN_vkGetPhysicalDeviceFeatures vkGetPhysicalDeviceFeatures = 0;\nPFN_vkGetPhysicalDeviceFormatProperties vkGetPhysicalDeviceFormatProperties = 0;\nPFN_vkGetPhysicalDeviceImageFormatProperties vkGetPhysicalDeviceImageFormatProperties = 0;\nPFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties = 0;\nPFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties = 0;\nPFN_vkGetPhysicalDeviceQueueFamilyProperties vkGetPhysicalDeviceQueueFamilyProperties = 0;\nPFN_vkGetPipelineCacheData vkGetPipelineCacheData = 0;\nPFN_vkGetQueryPoolResults vkGetQueryPoolResults = 0;\nPFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges = 0;\nPFN_vkMapMemory vkMapMemory = 0;\nPFN_vkMergePipelineCaches vkMergePipelineCaches = 0;\nPFN_vkQueueSubmit vkQueueSubmit = 0;\nPFN_vkQueueWaitIdle vkQueueWaitIdle = 0;\nPFN_vkResetCommandBuffer vkResetCommandBuffer = 0;\nPFN_vkResetCommandPool vkResetCommandPool = 0;\nPFN_vkResetDescriptorPool vkResetDescriptorPool = 0;\nPFN_vkResetFences vkResetFences = 0;\nPFN_vkUnmapMemory vkUnmapMemory = 0;\nPFN_vkUpdateDescriptorSets vkUpdateDescriptorSets = 0;\nPFN_vkWaitForFences vkWaitForFences = 0;\n\nint support_VK_KHR_external_memory_capabilities = 0;\nint support_VK_KHR_get_physical_device_properties2 = 0;\nint support_VK_KHR_get_surface_capabilities2 = 0;\nint support_VK_KHR_portability_enumeration = 0;\nint support_VK_KHR_surface = 0;\nint support_VK_EXT_debug_utils = 0;\nint support_VK_EXT_validation_features = 0;\nint support_VK_EXT_validation_flags = 0;\n#if __ANDROID_API__ >= 26\nint support_VK_KHR_android_surface = 0;\n#endif // __ANDROID_API__ >= 26\n\n// VK_KHR_cooperative_matrix\nPFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR = 0;\n\n// VK_KHR_external_memory_capabilities\nPFN_vkGetPhysicalDeviceExternalBufferPropertiesKHR vkGetPhysicalDeviceExternalBufferPropertiesKHR = 0;\n\n// VK_KHR_get_physical_device_properties2\nPFN_vkGetPhysicalDeviceFeatures2KHR vkGetPhysicalDeviceFeatures2KHR = 0;\nPFN_vkGetPhysicalDeviceProperties2KHR vkGetPhysicalDeviceProperties2KHR = 0;\nPFN_vkGetPhysicalDeviceFormatProperties2KHR vkGetPhysicalDeviceFormatProperties2KHR = 0;\nPFN_vkGetPhysicalDeviceImageFormatProperties2KHR vkGetPhysicalDeviceImageFormatProperties2KHR = 0;\nPFN_vkGetPhysicalDeviceQueueFamilyProperties2KHR vkGetPhysicalDeviceQueueFamilyProperties2KHR = 0;\nPFN_vkGetPhysicalDeviceMemoryProperties2KHR vkGetPhysicalDeviceMemoryProperties2KHR = 0;\n\n// VK_KHR_get_surface_capabilities2\nPFN_vkGetPhysicalDeviceSurfaceCapabilities2KHR vkGetPhysicalDeviceSurfaceCapabilities2KHR = 0;\nPFN_vkGetPhysicalDeviceSurfaceFormats2KHR vkGetPhysicalDeviceSurfaceFormats2KHR = 0;\n\n// VK_KHR_surface\nPFN_vkDestroySurfaceKHR vkDestroySurfaceKHR = 0;\nPFN_vkGetPhysicalDeviceSurfaceSupportKHR vkGetPhysicalDeviceSurfaceSupportKHR = 0;\nPFN_vkGetPhysicalDeviceSurfaceCapabilitiesKHR vkGetPhysicalDeviceSurfaceCapabilitiesKHR = 0;\nPFN_vkGetPhysicalDeviceSurfaceFormatsKHR vkGetPhysicalDeviceSurfaceFormatsKHR = 0;\nPFN_vkGetPhysicalDeviceSurfacePresentModesKHR vkGetPhysicalDeviceSurfacePresentModesKHR = 0;\n\n#if __ANDROID_API__ >= 26\n// VK_KHR_android_surface\nPFN_vkCreateAndroidSurfaceKHR vkCreateAndroidSurfaceKHR = 0;\n#endif // __ANDROID_API__ >= 26\n\n// VK_NV_cooperative_matrix\nPFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesNV vkGetPhysicalDeviceCooperativeMatrixPropertiesNV = 0;\n\n// VK_NV_cooperative_matrix2\nPFN_vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV = 0;\n\n// VK_NV_cooperative_vector\nPFN_vkGetPhysicalDeviceCooperativeVectorPropertiesNV vkGetPhysicalDeviceCooperativeVectorPropertiesNV = 0;\n\nclass GpuInfoPrivate\n{\npublic:\n    void query_features();\n    void query_properties();\n    void query_queue_properties();\n    void query_memory_properties();\n    int query_extensions();\n    void query_extension_features();\n    void query_extension_properties();\n    void evaluate_rough_score();\n\npublic:\n    int device_index;\n\n    // physical device\n    VkPhysicalDevice physicalDevice;\n\n    // features\n    VkPhysicalDeviceFeatures physicalDevicefeatures;\n\n    // properties\n    VkPhysicalDeviceProperties physicalDeviceProperties;\n\n    // memory properties\n    VkPhysicalDeviceMemoryProperties physicalDeviceMemoryProperties;\n\n    // extension properties\n    std::vector<VkExtensionProperties> deviceExtensionProperties;\n\n    // 0 = discrete gpu\n    // 1 = integrated gpu\n    // 2 = virtual gpu\n    // 3 = cpu\n    int type;\n\n    uint32_t rough_score;\n\n    // runtime\n    uint32_t compute_queue_family_index;\n    uint32_t transfer_queue_family_index;\n\n    uint32_t compute_queue_count;\n    uint32_t transfer_queue_count;\n\n    // property\n    bool unified_compute_transfer_queue;\n    bool resizable_bar_enabled;\n\n    // bug is not feature\n    bool bug_storage_buffer_no_l1;\n    bool bug_corrupted_online_pipeline_cache;\n    bool bug_buffer_image_load_zero;\n\n    // but sometimes bug is a feature\n    bool bug_implicit_fp16_arithmetic;\n\n    // cooperative matrix\n    bool support_cooperative_matrix_8_8_16;\n    bool support_cooperative_matrix_16_8_8;\n    bool support_cooperative_matrix_16_8_16;\n    bool support_cooperative_matrix_16_16_16;\n\n    // bf16 cooperative matrix feature\n    bool support_bf16_cooperative_matrix;\n\n    // extension capability\n    int support_VK_KHR_8bit_storage;\n    int support_VK_KHR_16bit_storage;\n    int support_VK_KHR_bind_memory2;\n    int support_VK_KHR_buffer_device_address;\n    int support_VK_KHR_create_renderpass2;\n    int support_VK_KHR_cooperative_matrix;\n    int support_VK_KHR_dedicated_allocation;\n    int support_VK_KHR_descriptor_update_template;\n    int support_VK_KHR_driver_properties;\n    int support_VK_KHR_external_memory;\n    int support_VK_KHR_get_memory_requirements2;\n    int support_VK_KHR_maintenance1;\n    int support_VK_KHR_maintenance2;\n    int support_VK_KHR_maintenance3;\n    int support_VK_KHR_multiview;\n    int support_VK_KHR_portability_subset;\n    int support_VK_KHR_push_descriptor;\n    int support_VK_KHR_robustness2;\n    int support_VK_KHR_sampler_ycbcr_conversion;\n    int support_VK_KHR_shader_bfloat16;\n    int support_VK_KHR_shader_float16_int8;\n    int support_VK_KHR_shader_float_controls;\n    int support_VK_KHR_shader_float_controls2;\n    int support_VK_KHR_shader_integer_dot_product;\n    int support_VK_KHR_shader_non_semantic_info;\n    int support_VK_KHR_shader_subgroup_extended_types;\n    int support_VK_KHR_shader_subgroup_rotate;\n    int support_VK_KHR_storage_buffer_storage_class;\n    int support_VK_KHR_swapchain;\n    int support_VK_KHR_vulkan_memory_model;\n    int support_VK_KHR_zero_initialize_workgroup_memory;\n    int support_VK_EXT_buffer_device_address;\n    int support_VK_EXT_descriptor_indexing;\n    int support_VK_EXT_external_memory_host;\n    int support_VK_EXT_memory_budget;\n    int support_VK_EXT_memory_priority;\n    int support_VK_EXT_queue_family_foreign;\n    int support_VK_EXT_robustness2;\n    int support_VK_EXT_shader_atomic_float;\n    int support_VK_EXT_shader_atomic_float2;\n    int support_VK_EXT_shader_float8;\n    int support_VK_EXT_subgroup_size_control;\n    int support_VK_AMD_device_coherent_memory;\n#if __ANDROID_API__ >= 26\n    int support_VK_ANDROID_external_memory_android_hardware_buffer;\n#endif // __ANDROID_API__ >= 26\n    int support_VK_NV_cooperative_matrix;\n    int support_VK_NV_cooperative_matrix2;\n    int support_VK_NV_cooperative_vector;\n\n    // extension features\n    void* queryExtensionFeatures;\n    VkPhysicalDevice8BitStorageFeaturesKHR query8BitStorageFeatures;\n    VkPhysicalDevice16BitStorageFeaturesKHR query16BitStorageFeatures;\n    VkPhysicalDeviceFloat16Int8FeaturesKHR queryFloat16Int8Features;\n    VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR querySamplerYcbcrConversionFeatures;\n    VkPhysicalDeviceCooperativeMatrixFeaturesKHR queryCooperativeMatrixFeatures;\n    VkPhysicalDeviceCooperativeMatrixFeaturesNV queryCooperativeMatrixFeaturesNV;\n    VkPhysicalDeviceCooperativeMatrix2FeaturesNV queryCooperativeMatrix2FeaturesNV;\n    VkPhysicalDeviceCooperativeVectorFeaturesNV queryCooperativeVectorFeaturesNV;\n    VkPhysicalDeviceRobustness2FeaturesKHR queryRobustness2Features;\n    VkPhysicalDeviceShaderBfloat16FeaturesKHR queryShaderBfloat16Features;\n    VkPhysicalDeviceShaderFloat8FeaturesEXT queryShaderFloat8Features;\n    VkPhysicalDeviceShaderFloatControls2FeaturesKHR queryShaderFloatControls2Features;\n    VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR queryShaderIntegerDotProductFeatures;\n    VkPhysicalDeviceSubgroupSizeControlFeaturesEXT querySubgroupSizeControlFeatures;\n    VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR queryShaderSubgroupRotateFeatures;\n    VkPhysicalDeviceShaderAtomicFloatFeaturesEXT queryShaderAtomicFloatFeatures;\n    VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT queryShaderAtomicFloat2Features;\n    VkPhysicalDeviceVulkanMemoryModelFeaturesKHR queryVulkanMemoryModelFeatures;\n\n    // extension properties\n    void* queryExtensionProperties;\n    VkPhysicalDeviceFloatControlsPropertiesKHR queryFloatControlsProperties;\n    VkPhysicalDeviceRobustness2PropertiesKHR queryRobustness2Properties;\n    VkPhysicalDeviceShaderIntegerDotProductProperties queryShaderIntegerDotProductProperties;\n    VkPhysicalDeviceSubgroupProperties querySubgroupProperties;\n    VkPhysicalDeviceDriverPropertiesKHR queryDriverProperties;\n    VkPhysicalDeviceSubgroupSizeControlPropertiesEXT querySubgroupSizeControlProperties;\n    VkPhysicalDeviceExternalMemoryHostPropertiesEXT queryExternalMemoryHostProperties;\n    VkPhysicalDeviceCooperativeMatrix2PropertiesNV queryCooperativeMatrix2PropertiesNV;\n    VkPhysicalDeviceCooperativeVectorPropertiesNV queryCooperativeVectorPropertiesNV;\n\n    // extension sub properties\n    std::vector<VkCooperativeMatrixPropertiesKHR> queryCooperativeMatrixSubProperties;\n    std::vector<VkCooperativeMatrixPropertiesNV> queryCooperativeMatrixSubPropertiesNV;\n    std::vector<VkCooperativeMatrixFlexibleDimensionsPropertiesNV> queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV;\n    std::vector<VkCooperativeVectorPropertiesNV> queryCooperativeVectorSubPropertiesNV;\n};\n\nvoid GpuInfoPrivate::query_features()\n{\n    vkGetPhysicalDeviceFeatures(physicalDevice, &physicalDevicefeatures);\n}\n\nvoid GpuInfoPrivate::query_properties()\n{\n    vkGetPhysicalDeviceProperties(physicalDevice, &physicalDeviceProperties);\n\n    // NCNN_LOGE(\"[%u] apiVersion = %u.%u.%u\", i, VK_VERSION_MAJOR(physicalDeviceProperties.apiVersion),\n    //     VK_VERSION_MINOR(physicalDeviceProperties.apiVersion), VK_VERSION_PATCH(physicalDeviceProperties.apiVersion));\n    // NCNN_LOGE(\"[%u] driverVersion = %u.%u.%u\", i, VK_VERSION_MAJOR(physicalDeviceProperties.driverVersion),\n    //     VK_VERSION_MINOR(physicalDeviceProperties.driverVersion), VK_VERSION_PATCH(physicalDeviceProperties.driverVersion));\n    // NCNN_LOGE(\"[%u] vendorID = %x\", i, physicalDeviceProperties.vendorID);\n    // NCNN_LOGE(\"[%u] deviceID = %x\", i, physicalDeviceProperties.deviceID);\n    // NCNN_LOGE(\"[%u] deviceType = %x\", i, physicalDeviceProperties.deviceType);\n    // NCNN_LOGE(\"[%u] deviceName = %s\", i, physicalDeviceProperties.deviceName);\n    // NCNN_LOGE(\"[%u] pipelineCacheUUID = %u\", i, physicalDeviceProperties.pipelineCacheUUID);\n\n    // device type\n    {\n        type = -1;\n        if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU)\n            type = 0;\n        if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU)\n            type = 1;\n        if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU)\n            type = 2;\n        if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_CPU)\n            type = 3;\n    }\n\n    // mali\n    // t760 = 0x13b5 0x7500001 / 0x7501000\n    // t860 = 0x13b5 0x8602000\n    // t880 = 0x13b5 0x8800020\n    // g31  = 0x13b5 0x70930000\n    // g51  = 0x13b5 0x70901010\n    // g52  = 0x13b5 0x74021000 / 0x72120000\n    // g71  = 0x13b5 0x60a00002\n    // g72  = 0x13b5 0x62210001\n    // g76  = 0x13b5 0x72110000\n    // g77  = 0x13b5 0x90800011\n\n    // adreno\n    // 506 = 0x5143 0x5000600\n    // 510 = 0x5143 0x5010000\n    // 512 = 0x5143 0x5010200\n    // 530 = 0x5143 0x5030004\n    // 540 = 0x5143 0x5040001\n    // 616 = 0x5143 0x6010600\n    // 630 = 0x5143 0x6030001\n    // 640 = 0x5143 0x6040001\n    // 650 = 0x5143 0x6050002\n\n    bug_storage_buffer_no_l1 = false;\n    bug_corrupted_online_pipeline_cache = false;\n    bug_implicit_fp16_arithmetic = false;\n    bug_buffer_image_load_zero = false;\n\n    if (physicalDeviceProperties.vendorID == 0x5143 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 0, 66))\n    {\n        // qcom adreno with old buggy driver cannot share created pipeline properly\n        bug_corrupted_online_pipeline_cache = true;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x5143 && !(physicalDeviceProperties.deviceID == 0x6040001 || physicalDeviceProperties.deviceID == 0x6050002))\n    {\n        // NOTE but qcom855/qcom855plus/qcom865 are known exceptions\n        // qcom adreno storage buffer without L1 cache\n        bug_storage_buffer_no_l1 = true;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x5143 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 1, 87))\n    {\n        // HACK buffer2image before image-read dependency does not work properly\n        // even promised with full image memory barrier on old adreno driver\n        // TODO figure out a proper workaround without hurt speed too much\n        // TODO only for old drivers\n        bug_buffer_image_load_zero = true;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x13b5\n            && (physicalDeviceProperties.deviceID == 0x7500001\n                || physicalDeviceProperties.deviceID == 0x7501000\n                || physicalDeviceProperties.deviceID == 0x8602000\n                || physicalDeviceProperties.deviceID == 0x8800020\n                || physicalDeviceProperties.deviceID == 0x70930000\n                || physicalDeviceProperties.deviceID == 0x70901010\n                || physicalDeviceProperties.deviceID == 0x72120000\n                || physicalDeviceProperties.deviceID == 0x74021000\n                || physicalDeviceProperties.deviceID == 0x60a00002\n                || physicalDeviceProperties.deviceID == 0x62210001))\n    {\n        // NOTE rk3288/rk3399/t880/g31/g51/g52/g71/g72\n        // however, g76/g77 has explicit fp16 arithmetic\n        // arm mali driver accept spirv with fp16 arithmetic\n        bug_implicit_fp16_arithmetic = true;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x5143\n            && (physicalDeviceProperties.deviceID == 0x6030001\n                || physicalDeviceProperties.deviceID == 0x6040001\n                || physicalDeviceProperties.deviceID == 0x6050002))\n    {\n        // TODO enable devices other than qcom845/qcom855/qcom855plus/qcom865\n        // qcom adreno driver accept spirv with fp16 arithmetic\n        bug_implicit_fp16_arithmetic = true;\n    }\n}\n\nstatic uint32_t find_device_compute_queue(const std::vector<VkQueueFamilyProperties>& queueFamilyProperties)\n{\n    // first try, compute only queue\n    for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)\n    {\n        const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];\n\n        if ((queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)\n                && !(queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))\n        {\n            return i;\n        }\n    }\n\n    // second try, any queue with compute and graphics\n    for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)\n    {\n        const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];\n\n        if ((queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)\n                && (queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))\n        {\n            return i;\n        }\n    }\n\n    // third try, any queue with compute\n    for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)\n    {\n        const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];\n\n        if (queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)\n        {\n            return i;\n        }\n    }\n\n    //     NCNN_LOGE(\"no compute queue\");\n    return -1;\n}\n\nstatic uint32_t find_device_transfer_queue(const std::vector<VkQueueFamilyProperties>& queueFamilyProperties)\n{\n    // first try, transfer only queue\n    for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)\n    {\n        const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];\n\n        if ((queueFamilyProperty.queueFlags & VK_QUEUE_TRANSFER_BIT)\n                && !(queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)\n                && !(queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))\n        {\n            return i;\n        }\n    }\n\n    // second try, any queue with transfer\n    for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)\n    {\n        const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];\n\n        if (queueFamilyProperty.queueFlags & VK_QUEUE_TRANSFER_BIT)\n        {\n            return i;\n        }\n    }\n\n    // third try, use compute queue\n    uint32_t compute_queue_index = find_device_compute_queue(queueFamilyProperties);\n    if (compute_queue_index != (uint32_t)-1)\n    {\n        return compute_queue_index;\n    }\n\n    //     NCNN_LOGE(\"no transfer queue\");\n    return -1;\n}\n\nvoid GpuInfoPrivate::query_queue_properties()\n{\n    // find compute queue\n    uint32_t queueFamilyPropertiesCount;\n    vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyPropertiesCount, 0);\n\n    std::vector<VkQueueFamilyProperties> queueFamilyProperties(queueFamilyPropertiesCount);\n    vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyPropertiesCount, queueFamilyProperties.data());\n\n    compute_queue_family_index = find_device_compute_queue(queueFamilyProperties);\n    transfer_queue_family_index = find_device_transfer_queue(queueFamilyProperties);\n\n    compute_queue_count = queueFamilyProperties[compute_queue_family_index].queueCount;\n    transfer_queue_count = queueFamilyProperties[transfer_queue_family_index].queueCount;\n\n    unified_compute_transfer_queue = compute_queue_family_index == transfer_queue_family_index;\n}\n\nvoid GpuInfoPrivate::query_memory_properties()\n{\n    // cache memory properties\n    vkGetPhysicalDeviceMemoryProperties(physicalDevice, &physicalDeviceMemoryProperties);\n\n    if (type == 0)\n    {\n        // discrete gpu\n        resizable_bar_enabled = false;\n\n        // find heap that is device local and host visible and not host cached\n        for (uint32_t i = 0; i < physicalDeviceMemoryProperties.memoryHeapCount; i++)\n        {\n            const VkMemoryHeap& memoryHeap = physicalDeviceMemoryProperties.memoryHeaps[i];\n            if (memoryHeap.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT)\n            {\n                VkFlags required = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;\n                VkFlags disallowed = VK_MEMORY_PROPERTY_HOST_CACHED_BIT;\n                for (uint32_t j = 0; j < physicalDeviceMemoryProperties.memoryTypeCount; j++)\n                {\n                    const VkMemoryType& memoryType = physicalDeviceMemoryProperties.memoryTypes[j];\n                    if (memoryType.heapIndex != i)\n                        continue;\n\n                    if ((memoryType.propertyFlags & disallowed) != 0)\n                    {\n                        // some driver treats a portion of host memory as device local heap, do not select this option\n                        resizable_bar_enabled = false;\n                        break;\n                    }\n\n                    if ((memoryType.propertyFlags & required) == required)\n                    {\n                        resizable_bar_enabled = true;\n                    }\n                }\n\n                // subsequent device local heap is no longer considered\n                // amd may declare a small device local + host visible heap for uploading\n                // resizable bar feature is for the main device heap anyway\n                break;\n            }\n        }\n    }\n    else\n    {\n        // integrated gpu\n        resizable_bar_enabled = true;\n    }\n}\n\nint GpuInfoPrivate::query_extensions()\n{\n    // get device extension\n    uint32_t deviceExtensionPropertyCount = 0;\n    VkResult ret = vkEnumerateDeviceExtensionProperties(physicalDevice, NULL, &deviceExtensionPropertyCount, NULL);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateDeviceExtensionProperties failed %d\", ret);\n        return -1;\n    }\n\n    deviceExtensionProperties.resize(deviceExtensionPropertyCount);\n    ret = vkEnumerateDeviceExtensionProperties(physicalDevice, NULL, &deviceExtensionPropertyCount, deviceExtensionProperties.data());\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateDeviceExtensionProperties failed %d\", ret);\n        return -1;\n    }\n\n    // extension capability\n    support_VK_KHR_8bit_storage = 0;\n    support_VK_KHR_16bit_storage = 0;\n    support_VK_KHR_bind_memory2 = 0;\n    support_VK_KHR_buffer_device_address = 0;\n    support_VK_KHR_create_renderpass2 = 0;\n    support_VK_KHR_cooperative_matrix = 0;\n    support_VK_KHR_dedicated_allocation = 0;\n    support_VK_KHR_descriptor_update_template = 0;\n    support_VK_KHR_driver_properties = 0;\n    support_VK_KHR_external_memory = 0;\n    support_VK_KHR_get_memory_requirements2 = 0;\n    support_VK_KHR_maintenance1 = 0;\n    support_VK_KHR_maintenance2 = 0;\n    support_VK_KHR_maintenance3 = 0;\n    support_VK_KHR_multiview = 0;\n    support_VK_KHR_portability_subset = 0;\n    support_VK_KHR_push_descriptor = 0;\n    support_VK_KHR_robustness2 = 0;\n    support_VK_KHR_sampler_ycbcr_conversion = 0;\n    support_VK_KHR_shader_bfloat16 = 0;\n    support_VK_KHR_shader_float16_int8 = 0;\n    support_VK_KHR_shader_float_controls = 0;\n    support_VK_KHR_shader_float_controls2 = 0;\n    support_VK_KHR_shader_integer_dot_product = 0;\n    support_VK_KHR_shader_non_semantic_info = 0;\n    support_VK_KHR_shader_subgroup_extended_types = 0;\n    support_VK_KHR_shader_subgroup_rotate = 0;\n    support_VK_KHR_storage_buffer_storage_class = 0;\n    support_VK_KHR_swapchain = 0;\n    support_VK_KHR_vulkan_memory_model = 0;\n    support_VK_KHR_zero_initialize_workgroup_memory = 0;\n    support_VK_EXT_buffer_device_address = 0;\n    support_VK_EXT_descriptor_indexing = 0;\n    support_VK_EXT_external_memory_host = 0;\n    support_VK_EXT_memory_budget = 0;\n    support_VK_EXT_memory_priority = 0;\n    support_VK_EXT_queue_family_foreign = 0;\n    support_VK_EXT_robustness2 = 0;\n    support_VK_EXT_shader_atomic_float = 0;\n    support_VK_EXT_shader_atomic_float2 = 0;\n    support_VK_EXT_shader_float8 = 0;\n    support_VK_EXT_subgroup_size_control = 0;\n    support_VK_AMD_device_coherent_memory = 0;\n#if __ANDROID_API__ >= 26\n    support_VK_ANDROID_external_memory_android_hardware_buffer = 0;\n#endif // __ANDROID_API__ >= 26\n    support_VK_NV_cooperative_matrix = 0;\n    support_VK_NV_cooperative_matrix2 = 0;\n    support_VK_NV_cooperative_vector = 0;\n    for (uint32_t j = 0; j < deviceExtensionPropertyCount; j++)\n    {\n        const VkExtensionProperties& exp = deviceExtensionProperties[j];\n        // NCNN_LOGE(\"device extension %s = %u\", exp.extensionName, exp.specVersion);\n\n        if (strcmp(exp.extensionName, \"VK_KHR_8bit_storage\") == 0)\n            support_VK_KHR_8bit_storage = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_16bit_storage\") == 0)\n            support_VK_KHR_16bit_storage = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_bind_memory2\") == 0)\n            support_VK_KHR_bind_memory2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_buffer_device_address\") == 0)\n            support_VK_KHR_buffer_device_address = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_create_renderpass2\") == 0)\n            support_VK_KHR_create_renderpass2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_cooperative_matrix\") == 0)\n            support_VK_KHR_cooperative_matrix = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_dedicated_allocation\") == 0)\n            support_VK_KHR_dedicated_allocation = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_descriptor_update_template\") == 0)\n            support_VK_KHR_descriptor_update_template = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_driver_properties\") == 0)\n            support_VK_KHR_driver_properties = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_external_memory\") == 0)\n            support_VK_KHR_external_memory = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_get_memory_requirements2\") == 0)\n            support_VK_KHR_get_memory_requirements2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_maintenance1\") == 0)\n            support_VK_KHR_maintenance1 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_maintenance2\") == 0)\n            support_VK_KHR_maintenance2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_maintenance3\") == 0)\n            support_VK_KHR_maintenance3 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_multiview\") == 0)\n            support_VK_KHR_multiview = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_portability_subset\") == 0)\n            support_VK_KHR_portability_subset = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_push_descriptor\") == 0)\n            support_VK_KHR_push_descriptor = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_robustness2\") == 0)\n            support_VK_KHR_robustness2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_sampler_ycbcr_conversion\") == 0)\n            support_VK_KHR_sampler_ycbcr_conversion = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_bfloat16\") == 0)\n            support_VK_KHR_shader_bfloat16 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_float16_int8\") == 0)\n            support_VK_KHR_shader_float16_int8 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_float_controls\") == 0)\n            support_VK_KHR_shader_float_controls = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_float_controls2\") == 0)\n            support_VK_KHR_shader_float_controls2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_integer_dot_product\") == 0)\n            support_VK_KHR_shader_integer_dot_product = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_non_semantic_info\") == 0)\n            support_VK_KHR_shader_non_semantic_info = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_subgroup_extended_types\") == 0)\n            support_VK_KHR_shader_subgroup_extended_types = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_subgroup_rotate\") == 0)\n            support_VK_KHR_shader_subgroup_rotate = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_storage_buffer_storage_class\") == 0)\n            support_VK_KHR_storage_buffer_storage_class = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_swapchain\") == 0)\n            support_VK_KHR_swapchain = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_vulkan_memory_model\") == 0)\n            support_VK_KHR_vulkan_memory_model = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_zero_initialize_workgroup_memory\") == 0)\n            support_VK_KHR_zero_initialize_workgroup_memory = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_buffer_device_address\") == 0)\n            support_VK_EXT_buffer_device_address = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_descriptor_indexing\") == 0)\n            support_VK_EXT_descriptor_indexing = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_external_memory_host\") == 0)\n            support_VK_EXT_external_memory_host = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_memory_budget\") == 0)\n            support_VK_EXT_memory_budget = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_memory_priority\") == 0)\n            support_VK_EXT_memory_priority = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_queue_family_foreign\") == 0)\n            support_VK_EXT_queue_family_foreign = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_robustness2\") == 0)\n            support_VK_EXT_robustness2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_shader_atomic_float\") == 0)\n            support_VK_EXT_shader_atomic_float = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_shader_atomic_float2\") == 0)\n            support_VK_EXT_shader_atomic_float2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_shader_float8\") == 0)\n            support_VK_EXT_shader_float8 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_subgroup_size_control\") == 0)\n            support_VK_EXT_subgroup_size_control = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_AMD_device_coherent_memory\") == 0)\n            support_VK_AMD_device_coherent_memory = exp.specVersion;\n#if __ANDROID_API__ >= 26\n        else if (strcmp(exp.extensionName, \"VK_ANDROID_external_memory_android_hardware_buffer\") == 0)\n            support_VK_ANDROID_external_memory_android_hardware_buffer = exp.specVersion;\n#endif // __ANDROID_API__ >= 26\n        else if (strcmp(exp.extensionName, \"VK_NV_cooperative_matrix\") == 0)\n            support_VK_NV_cooperative_matrix = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_NV_cooperative_matrix2\") == 0)\n            support_VK_NV_cooperative_matrix2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_NV_cooperative_vector\") == 0)\n            support_VK_NV_cooperative_vector = exp.specVersion;\n    }\n\n    if (support_VK_KHR_buffer_device_address)\n    {\n        // we prefer khr extension\n        support_VK_EXT_buffer_device_address = 0;\n    }\n\n    if (support_VK_KHR_cooperative_matrix)\n    {\n        // we prefer khr extension\n        support_VK_NV_cooperative_matrix = 0;\n    }\n\n    if (support_VK_KHR_robustness2)\n    {\n        // we prefer khr extension\n        support_VK_EXT_robustness2 = 0;\n    }\n\n    return 0;\n}\n\nvoid GpuInfoPrivate::query_extension_features()\n{\n    queryExtensionFeatures = 0;\n\n    // query int8 storage\n    memset(&query8BitStorageFeatures, 0, sizeof(query8BitStorageFeatures));\n    query8BitStorageFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_8BIT_STORAGE_FEATURES_KHR;\n    query8BitStorageFeatures.pNext = 0;\n    if (support_VK_KHR_8bit_storage)\n    {\n        query8BitStorageFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &query8BitStorageFeatures;\n    }\n\n    // query fp16/int16 storage\n    memset(&query16BitStorageFeatures, 0, sizeof(query16BitStorageFeatures));\n    query16BitStorageFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_16BIT_STORAGE_FEATURES_KHR;\n    query16BitStorageFeatures.pNext = 0;\n    if (support_VK_KHR_16bit_storage)\n    {\n        query16BitStorageFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &query16BitStorageFeatures;\n    }\n\n    // query fp16/int8 arithmetic\n    memset(&queryFloat16Int8Features, 0, sizeof(queryFloat16Int8Features));\n    queryFloat16Int8Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FLOAT16_INT8_FEATURES_KHR;\n    queryFloat16Int8Features.pNext = 0;\n    if (support_VK_KHR_shader_float16_int8)\n    {\n        queryFloat16Int8Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryFloat16Int8Features;\n    }\n\n    // query ycbcr_conversion\n    memset(&querySamplerYcbcrConversionFeatures, 0, sizeof(querySamplerYcbcrConversionFeatures));\n    querySamplerYcbcrConversionFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SAMPLER_YCBCR_CONVERSION_FEATURES_KHR;\n    querySamplerYcbcrConversionFeatures.pNext = 0;\n    if (support_VK_KHR_sampler_ycbcr_conversion)\n    {\n        querySamplerYcbcrConversionFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &querySamplerYcbcrConversionFeatures;\n    }\n\n    // query cooperative_matrix\n    memset(&queryCooperativeMatrixFeatures, 0, sizeof(queryCooperativeMatrixFeatures));\n    queryCooperativeMatrixFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_KHR;\n    queryCooperativeMatrixFeatures.pNext = 0;\n    if (support_VK_KHR_cooperative_matrix)\n    {\n        queryCooperativeMatrixFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryCooperativeMatrixFeatures;\n    }\n\n    // query nv cooperative matrix\n    memset(&queryCooperativeMatrixFeaturesNV, 0, sizeof(queryCooperativeMatrixFeaturesNV));\n    queryCooperativeMatrixFeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_NV;\n    queryCooperativeMatrixFeaturesNV.pNext = 0;\n    if (support_VK_NV_cooperative_matrix)\n    {\n        queryCooperativeMatrixFeaturesNV.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryCooperativeMatrixFeaturesNV;\n    }\n\n    // query nv cooperative matrix2\n    memset(&queryCooperativeMatrix2FeaturesNV, 0, sizeof(queryCooperativeMatrix2FeaturesNV));\n    queryCooperativeMatrix2FeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_2_FEATURES_NV;\n    queryCooperativeMatrix2FeaturesNV.pNext = 0;\n    if (support_VK_NV_cooperative_matrix2)\n    {\n        queryCooperativeMatrix2FeaturesNV.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryCooperativeMatrix2FeaturesNV;\n    }\n\n    // query nv cooperative vector\n    memset(&queryCooperativeVectorFeaturesNV, 0, sizeof(queryCooperativeVectorFeaturesNV));\n    queryCooperativeVectorFeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_VECTOR_FEATURES_NV;\n    queryCooperativeVectorFeaturesNV.pNext = 0;\n    if (support_VK_NV_cooperative_vector)\n    {\n        queryCooperativeVectorFeaturesNV.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryCooperativeVectorFeaturesNV;\n    }\n\n    // query robustness2\n    memset(&queryRobustness2Features, 0, sizeof(queryRobustness2Features));\n    queryRobustness2Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ROBUSTNESS_2_FEATURES_KHR;\n    queryRobustness2Features.pNext = 0;\n    if (support_VK_KHR_robustness2 || support_VK_EXT_robustness2)\n    {\n        queryRobustness2Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryRobustness2Features;\n    }\n\n    // query bfloat16\n    memset(&queryShaderBfloat16Features, 0, sizeof(queryShaderBfloat16Features));\n    queryShaderBfloat16Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_BFLOAT16_FEATURES_KHR;\n    queryShaderBfloat16Features.pNext = 0;\n    if (support_VK_KHR_shader_bfloat16)\n    {\n        queryShaderBfloat16Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderBfloat16Features;\n    }\n\n    // query float8\n    memset(&queryShaderFloat8Features, 0, sizeof(queryShaderFloat8Features));\n    queryShaderFloat8Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_FLOAT8_FEATURES_EXT;\n    queryShaderFloat8Features.pNext = 0;\n    if (support_VK_EXT_shader_float8)\n    {\n        queryShaderFloat8Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderFloat8Features;\n    }\n\n    // query float controls 2\n    memset(&queryShaderFloatControls2Features, 0, sizeof(queryShaderFloatControls2Features));\n    queryShaderFloatControls2Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_FLOAT_CONTROLS_2_FEATURES_KHR;\n    queryShaderFloatControls2Features.pNext = 0;\n    if (support_VK_KHR_shader_float_controls2)\n    {\n        queryShaderFloatControls2Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderFloatControls2Features;\n    }\n\n    // query integer dot product\n    memset(&queryShaderIntegerDotProductFeatures, 0, sizeof(queryShaderIntegerDotProductFeatures));\n    queryShaderIntegerDotProductFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_INTEGER_DOT_PRODUCT_FEATURES_KHR;\n    queryShaderIntegerDotProductFeatures.pNext = 0;\n    if (support_VK_KHR_shader_integer_dot_product)\n    {\n        queryShaderIntegerDotProductFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderIntegerDotProductFeatures;\n    }\n\n    // query subgroup size control\n    memset(&querySubgroupSizeControlFeatures, 0, sizeof(querySubgroupSizeControlFeatures));\n    querySubgroupSizeControlFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_SIZE_CONTROL_FEATURES_EXT;\n    querySubgroupSizeControlFeatures.pNext = 0;\n    if (support_VK_EXT_subgroup_size_control >= 2)\n    {\n        querySubgroupSizeControlFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &querySubgroupSizeControlFeatures;\n    }\n\n    // query subgroup rotate\n    memset(&queryShaderSubgroupRotateFeatures, 0, sizeof(queryShaderSubgroupRotateFeatures));\n    queryShaderSubgroupRotateFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_SUBGROUP_ROTATE_FEATURES_KHR;\n    queryShaderSubgroupRotateFeatures.pNext = 0;\n    if (support_VK_KHR_shader_subgroup_rotate)\n    {\n        queryShaderSubgroupRotateFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderSubgroupRotateFeatures;\n    }\n\n    // query atomic float\n    memset(&queryShaderAtomicFloatFeatures, 0, sizeof(queryShaderAtomicFloatFeatures));\n    queryShaderAtomicFloatFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_ATOMIC_FLOAT_FEATURES_EXT;\n    queryShaderAtomicFloatFeatures.pNext = 0;\n    if (support_VK_EXT_shader_atomic_float)\n    {\n        queryShaderAtomicFloatFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderAtomicFloatFeatures;\n    }\n\n    // query atomic float2\n    memset(&queryShaderAtomicFloat2Features, 0, sizeof(queryShaderAtomicFloat2Features));\n    queryShaderAtomicFloat2Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_ATOMIC_FLOAT_2_FEATURES_EXT;\n    queryShaderAtomicFloat2Features.pNext = 0;\n    if (support_VK_EXT_shader_atomic_float2)\n    {\n        queryShaderAtomicFloat2Features.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryShaderAtomicFloat2Features;\n    }\n\n    // query vulkan memory model\n    memset(&queryVulkanMemoryModelFeatures, 0, sizeof(queryVulkanMemoryModelFeatures));\n    queryVulkanMemoryModelFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_MEMORY_MODEL_FEATURES_KHR;\n    queryVulkanMemoryModelFeatures.pNext = 0;\n    if (support_VK_KHR_vulkan_memory_model)\n    {\n        queryVulkanMemoryModelFeatures.pNext = queryExtensionFeatures;\n        queryExtensionFeatures = &queryVulkanMemoryModelFeatures;\n    }\n\n    if (support_VK_KHR_get_physical_device_properties2)\n    {\n        VkPhysicalDeviceFeatures2KHR queryFeatures;\n        queryFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2_KHR;\n        queryFeatures.pNext = queryExtensionFeatures;\n\n        vkGetPhysicalDeviceFeatures2KHR(physicalDevice, &queryFeatures);\n    }\n\n    // apply known blacklist\n    if (physicalDeviceProperties.vendorID == 0x13b5 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 0, 82))\n    {\n        // the 16bit_storage implementation of arm mali driver is buggy :[\n        query16BitStorageFeatures.storageBuffer16BitAccess = VK_FALSE;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x10002 && physicalDeviceProperties.deviceID == 0x70006214 && physicalDeviceProperties.apiVersion == VK_MAKE_VERSION(1, 1, 82))\n    {\n        // the 16bit_storage implementation of vivante gc1700 driver is buggy :[\n        query16BitStorageFeatures.storageBuffer16BitAccess = VK_FALSE;\n    }\n\n    if (bug_implicit_fp16_arithmetic)\n    {\n        // force capability on as long as the driver accept spirv with fp16 arithmetic :D\n        queryFloat16Int8Features.shaderFloat16 = VK_TRUE;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x5143 && !query16BitStorageFeatures.storageBuffer16BitAccess)\n    {\n        // fp16 arithmetic yields wrong result on old adreno drivers :(\n        queryFloat16Int8Features.shaderFloat16 = VK_FALSE;\n    }\n\n    if (physicalDeviceProperties.vendorID == 0x1002)\n    {\n        // emulated cooperative matrix on amd rdna2 is slow\n        switch (physicalDeviceProperties.deviceID)\n        {\n        case 0x73a1: // V620\n        case 0x73a2: // W6900X\n        case 0x73a3: // W6800\n        case 0x73a4: // NAVI21-USB\n        case 0x73a5: // 6950XT\n        case 0x73ab: // W6800X/W6800X-DUO\n        case 0x73ae: // V620-MX\n        case 0x73af: // 6900XT\n        case 0x73bf: // 6800/6800XT/6900XT\n        case 0x73c3: // NAVI22-?\n        case 0x73c4: // NAVI22-USB\n        case 0x73df: // 6700/6700XT/6750XT/6750GRE-12G/6800M/6850MXT\n        case 0x73e0: // NAVI23-?\n        case 0x73e1: // W6600M\n        case 0x73e3: // W6600\n        case 0x73e4: // NAVI23-USB\n        case 0x73ef: // 6650XT/6700S/6800S\n        case 0x73ff: // 6600/6600XT/6750GRE-10G/6600M\n        case 0x7421: // W6500M\n        case 0x7422: // W6400\n        case 0x7423: // W6300/W6300M\n        case 0x7424: // 6300\n        case 0x743f: // 6400/6500XT/6500M\n        {\n            queryCooperativeMatrixFeatures.cooperativeMatrix = VK_FALSE;\n            queryCooperativeMatrixFeaturesNV.cooperativeMatrix = VK_FALSE;\n            break;\n        }\n        default:\n            break;\n        }\n    }\n}\n\nvoid GpuInfoPrivate::evaluate_rough_score()\n{\n    rough_score = 0;\n\n    // device type score\n    if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU)\n        rough_score += 50;\n    if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU)\n        rough_score += 5;\n    if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU)\n        rough_score += 4;\n\n    // simd width score\n    rough_score += querySubgroupProperties.subgroupSize / 32;\n\n    // extension score\n    for (size_t i = 0; i < deviceExtensionProperties.size(); i++)\n    {\n        const VkExtensionProperties& exp = deviceExtensionProperties[i];\n\n        if (strcmp(exp.extensionName, \"VK_KHR_cooperative_matrix\") == 0)\n            rough_score += 10;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_bfloat16\") == 0)\n            rough_score += 2;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_integer_dot_product\") == 0)\n            rough_score += 2;\n        else if (strcmp(exp.extensionName, \"VK_KHR_shader_float16_int8\") == 0)\n            rough_score += 2;\n        else if (strcmp(exp.extensionName, \"VK_EXT_shader_float8\") == 0)\n            rough_score += 2;\n    }\n\n    // device local heap size score\n    if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU)\n    {\n        VkDeviceSize max_device_local = 0;\n        for (uint32_t i = 0; i < physicalDeviceMemoryProperties.memoryHeapCount; i++)\n        {\n            const VkMemoryHeap& memoryHeap = physicalDeviceMemoryProperties.memoryHeaps[i];\n            if (memoryHeap.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT)\n            {\n                max_device_local = std::max(max_device_local, memoryHeap.size);\n            }\n        }\n        uint32_t mem_gb = max_device_local / (1024 * 1024 * 1024);\n        rough_score += mem_gb;\n    }\n}\n\nstatic int get_vendor_default_subgroup_size(uint32_t vendorID)\n{\n    int default_size = 64;  // default to 64\n    if (vendorID == 0x5143) // qcom adreno\n        default_size = 128;\n    else if (vendorID == 0x13b5) // arm mali\n        default_size = 16;\n    else if (vendorID == 0x1010) // imgtec powervr\n        default_size = 32;\n    else if (vendorID == 0x1002) // amd\n        default_size = 64;\n    else if (vendorID == 0x10de) // nvidia\n        default_size = 32;\n    else if (vendorID == 0x8086) // intel\n        default_size = 32;\n    return default_size;\n}\n\nvoid GpuInfoPrivate::query_extension_properties()\n{\n    queryExtensionProperties = 0;\n\n    // query float controls\n    memset(&queryFloatControlsProperties, 0, sizeof(queryFloatControlsProperties));\n    queryFloatControlsProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FLOAT_CONTROLS_PROPERTIES;\n    queryFloatControlsProperties.pNext = 0;\n    if (support_VK_KHR_shader_float_controls)\n    {\n        queryFloatControlsProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryFloatControlsProperties;\n    }\n\n    // query integer dot product\n    memset(&queryShaderIntegerDotProductProperties, 0, sizeof(queryShaderIntegerDotProductProperties));\n    queryShaderIntegerDotProductProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_INTEGER_DOT_PRODUCT_PROPERTIES_KHR;\n    queryShaderIntegerDotProductProperties.pNext = 0;\n    if (support_VK_KHR_shader_integer_dot_product)\n    {\n        queryShaderIntegerDotProductProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryShaderIntegerDotProductProperties;\n    }\n\n    // query subgroup\n    memset(&querySubgroupProperties, 0, sizeof(querySubgroupProperties));\n    querySubgroupProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_PROPERTIES;\n    querySubgroupProperties.pNext = 0;\n    if (VK_VERSION_MAJOR(g_instance.instance_api_version) >= 1 && VK_VERSION_MINOR(g_instance.instance_api_version) >= 1)\n    {\n        querySubgroupProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &querySubgroupProperties;\n    }\n    else\n    {\n        querySubgroupProperties.subgroupSize = get_vendor_default_subgroup_size(physicalDeviceProperties.vendorID);\n    }\n\n    // query driver properties\n    memset(&queryDriverProperties, 0, sizeof(queryDriverProperties));\n    queryDriverProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_DRIVER_PROPERTIES_KHR;\n    queryDriverProperties.pNext = 0;\n    if (support_VK_KHR_driver_properties)\n    {\n        queryDriverProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryDriverProperties;\n    }\n\n    // query robustness2\n    memset(&queryRobustness2Properties, 0, sizeof(queryRobustness2Properties));\n    queryRobustness2Properties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ROBUSTNESS_2_PROPERTIES_KHR;\n    queryRobustness2Properties.pNext = 0;\n    if (support_VK_KHR_robustness2 || support_VK_EXT_robustness2)\n    {\n        queryRobustness2Properties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryRobustness2Properties;\n    }\n\n    // query subgroup size control\n    memset(&querySubgroupSizeControlProperties, 0, sizeof(querySubgroupSizeControlProperties));\n    querySubgroupSizeControlProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_SIZE_CONTROL_PROPERTIES_EXT;\n    querySubgroupSizeControlProperties.pNext = 0;\n    if (support_VK_EXT_subgroup_size_control)\n    {\n        querySubgroupSizeControlProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &querySubgroupSizeControlProperties;\n    }\n\n    // query external memory host\n    memset(&queryExternalMemoryHostProperties, 0, sizeof(queryExternalMemoryHostProperties));\n    queryExternalMemoryHostProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_EXTERNAL_MEMORY_HOST_PROPERTIES_EXT;\n    queryExternalMemoryHostProperties.pNext = 0;\n    if (support_VK_EXT_external_memory_host)\n    {\n        queryExternalMemoryHostProperties.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryExternalMemoryHostProperties;\n    }\n\n    // query nv cooperative matrix2\n    memset(&queryCooperativeMatrix2PropertiesNV, 0, sizeof(queryCooperativeMatrix2PropertiesNV));\n    queryCooperativeMatrix2PropertiesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_2_PROPERTIES_NV;\n    queryCooperativeMatrix2PropertiesNV.pNext = 0;\n    if (support_VK_NV_cooperative_matrix2)\n    {\n        queryCooperativeMatrix2PropertiesNV.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryCooperativeMatrix2PropertiesNV;\n    }\n\n    // query nv cooperative vector\n    memset(&queryCooperativeVectorPropertiesNV, 0, sizeof(queryCooperativeVectorPropertiesNV));\n    queryCooperativeVectorPropertiesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_VECTOR_PROPERTIES_NV;\n    queryCooperativeVectorPropertiesNV.pNext = 0;\n    if (support_VK_NV_cooperative_vector)\n    {\n        queryCooperativeVectorPropertiesNV.pNext = queryExtensionProperties;\n        queryExtensionProperties = &queryCooperativeVectorPropertiesNV;\n    }\n\n    if (support_VK_KHR_get_physical_device_properties2)\n    {\n        VkPhysicalDeviceProperties2KHR queryProperties;\n        queryProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2_KHR;\n        queryProperties.pNext = queryExtensionProperties;\n\n        vkGetPhysicalDeviceProperties2KHR(physicalDevice, &queryProperties);\n\n        // append subgroup rotate\n        if (support_VK_KHR_shader_subgroup_rotate)\n        {\n            if (queryShaderSubgroupRotateFeatures.shaderSubgroupRotate)\n                querySubgroupProperties.supportedOperations |= VK_SUBGROUP_FEATURE_ROTATE_BIT_KHR;\n            if (queryShaderSubgroupRotateFeatures.shaderSubgroupRotateClustered)\n                querySubgroupProperties.supportedOperations |= VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT_KHR;\n        }\n        // Avoid invalid subgroup size\n        bool is_subgroup_size_valid = (querySubgroupProperties.subgroupSize > 0) && ((querySubgroupProperties.subgroupSize & (querySubgroupProperties.subgroupSize - 1)) == 0);\n        if (!is_subgroup_size_valid)\n        {\n            querySubgroupProperties.subgroupSize = get_vendor_default_subgroup_size(physicalDeviceProperties.vendorID);\n        }\n    }\n\n    if (!support_VK_EXT_subgroup_size_control)\n    {\n        querySubgroupSizeControlProperties.minSubgroupSize = querySubgroupProperties.subgroupSize;\n        querySubgroupSizeControlProperties.maxSubgroupSize = querySubgroupProperties.subgroupSize;\n        querySubgroupSizeControlProperties.maxComputeWorkgroupSubgroups = std::max(physicalDeviceProperties.limits.maxComputeWorkGroupInvocations / querySubgroupProperties.subgroupSize, 1u);\n    }\n\n    // query supported cooperative matrix types and operations\n    queryCooperativeMatrixSubProperties.clear();\n    queryCooperativeMatrixSubPropertiesNV.clear();\n    support_cooperative_matrix_8_8_16 = false;\n    support_cooperative_matrix_16_8_8 = false;\n    support_cooperative_matrix_16_8_16 = false;\n    support_cooperative_matrix_16_16_16 = false;\n    support_bf16_cooperative_matrix = false;\n    if (support_VK_KHR_cooperative_matrix && queryCooperativeMatrixFeatures.cooperativeMatrix)\n    {\n        uint32_t propertyCount = 0;\n        VkResult ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR failed %d\", ret);\n        }\n\n        queryCooperativeMatrixSubProperties.resize(propertyCount);\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            memset(&queryCooperativeMatrixSubProperties[j], 0, sizeof(queryCooperativeMatrixSubProperties[j]));\n            queryCooperativeMatrixSubProperties[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_PROPERTIES_KHR;\n            queryCooperativeMatrixSubProperties[j].pNext = 0;\n        }\n        ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, queryCooperativeMatrixSubProperties.data());\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR failed %d\", ret);\n        }\n\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            const VkCooperativeMatrixPropertiesKHR& cmp = queryCooperativeMatrixSubProperties[j];\n            // NCNN_LOGE(\"cpm %2d %2d %2d  %d %d %d %d  %d\", cmp.MSize, cmp.NSize, cmp.KSize, cmp.AType, cmp.BType, cmp.CType, cmp.ResultType, cmp.scope);\n\n            if (cmp.MSize == 8 && cmp.NSize == 8 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR\n                    && cmp.scope == VK_SCOPE_SUBGROUP_KHR)\n            {\n                support_cooperative_matrix_8_8_16 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 8\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR\n                    && cmp.scope == VK_SCOPE_SUBGROUP_KHR)\n            {\n                support_cooperative_matrix_16_8_8 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR\n                    && cmp.scope == VK_SCOPE_SUBGROUP_KHR)\n            {\n                support_cooperative_matrix_16_8_16 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 16 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR\n                    && cmp.scope == VK_SCOPE_SUBGROUP_KHR)\n            {\n                support_cooperative_matrix_16_16_16 = true;\n            }\n\n            if (cmp.AType == VK_COMPONENT_TYPE_BFLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_BFLOAT16_KHR\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR\n                    && cmp.scope == VK_SCOPE_SUBGROUP_KHR)\n            {\n                support_bf16_cooperative_matrix = true;\n            }\n        }\n    }\n    else if (support_VK_NV_cooperative_matrix && queryCooperativeMatrixFeaturesNV.cooperativeMatrix)\n    {\n        uint32_t propertyCount = 0;\n        VkResult ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(physicalDevice, &propertyCount, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixPropertiesNV failed %d\", ret);\n        }\n\n        queryCooperativeMatrixSubPropertiesNV.resize(propertyCount);\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            memset(&queryCooperativeMatrixSubPropertiesNV[j], 0, sizeof(queryCooperativeMatrixSubPropertiesNV[j]));\n            queryCooperativeMatrixSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_PROPERTIES_NV;\n            queryCooperativeMatrixSubPropertiesNV[j].pNext = 0;\n        }\n        ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(physicalDevice, &propertyCount, queryCooperativeMatrixSubPropertiesNV.data());\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixPropertiesNV failed %d\", ret);\n        }\n\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            const VkCooperativeMatrixPropertiesNV& cmp = queryCooperativeMatrixSubPropertiesNV[j];\n            // NCNN_LOGE(\"cpm %2d %2d %2d  %d %d %d %d  %d\", cmp.MSize, cmp.NSize, cmp.KSize, cmp.AType, cmp.BType, cmp.CType, cmp.DType, cmp.scope);\n\n            if (cmp.MSize == 8 && cmp.NSize == 8 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV\n                    && cmp.scope == VK_SCOPE_SUBGROUP_NV)\n            {\n                support_cooperative_matrix_8_8_16 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 8\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV\n                    && cmp.scope == VK_SCOPE_SUBGROUP_NV)\n            {\n                support_cooperative_matrix_16_8_8 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV\n                    && cmp.scope == VK_SCOPE_SUBGROUP_NV)\n            {\n                support_cooperative_matrix_16_8_16 = true;\n            }\n            if (cmp.MSize == 16 && cmp.NSize == 16 && cmp.KSize == 16\n                    && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV\n                    && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV\n                    && cmp.scope == VK_SCOPE_SUBGROUP_NV)\n            {\n                support_cooperative_matrix_16_16_16 = true;\n            }\n        }\n    }\n\n    // query supported cooperative matrix2 types and operations\n    queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.clear();\n    if (support_VK_NV_cooperative_matrix2 && queryCooperativeMatrix2FeaturesNV.cooperativeMatrixFlexibleDimensions)\n    {\n        uint32_t propertyCount = 0;\n        VkResult ret = vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV(physicalDevice, &propertyCount, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV failed %d\", ret);\n        }\n\n        queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.resize(propertyCount);\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            memset(&queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j], 0, sizeof(queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j]));\n            queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_FLEXIBLE_DIMENSIONS_PROPERTIES_NV;\n            queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j].pNext = 0;\n        }\n        ret = vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV(physicalDevice, &propertyCount, queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.data());\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV failed %d\", ret);\n        }\n\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            const VkCooperativeMatrixFlexibleDimensionsPropertiesNV& cmfdp = queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j];\n            // NCNN_LOGE(\"cmfdp %2d %2d %2d  %d %d %d %d  %d %d %d\", cmfdp.MGranularity, cmfdp.NGranularity, cmfdp.KGranularity, cmfdp.AType, cmfdp.BType, cmfdp.CType, cmfdp.ResultType, cmfdp.saturatingAccumulation, cmfdp.scope, cmfdp.workgroupInvocations);\n        }\n    }\n\n    // query supported cooperative vector types and operations\n    queryCooperativeVectorSubPropertiesNV.clear();\n    if (support_VK_NV_cooperative_vector && queryCooperativeVectorFeaturesNV.cooperativeVector)\n    {\n        uint32_t propertyCount = 0;\n        VkResult ret = vkGetPhysicalDeviceCooperativeVectorPropertiesNV(physicalDevice, &propertyCount, 0);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeVectorPropertiesNV failed %d\", ret);\n        }\n\n        queryCooperativeVectorSubPropertiesNV.resize(propertyCount);\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            memset(&queryCooperativeVectorSubPropertiesNV[j], 0, sizeof(queryCooperativeVectorSubPropertiesNV[j]));\n            queryCooperativeVectorSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_VECTOR_PROPERTIES_NV;\n            queryCooperativeVectorSubPropertiesNV[j].pNext = 0;\n        }\n        ret = vkGetPhysicalDeviceCooperativeVectorPropertiesNV(physicalDevice, &propertyCount, queryCooperativeVectorSubPropertiesNV.data());\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkGetPhysicalDeviceCooperativeVectorPropertiesNV failed %d\", ret);\n        }\n\n        for (uint32_t j = 0; j < propertyCount; j++)\n        {\n            const VkCooperativeVectorPropertiesNV& cvp = queryCooperativeVectorSubPropertiesNV[j];\n            // NCNN_LOGE(\"cvp %d %d %d %d %d  %d\", cvp.inputType, cvp.inputInterpretation, cvp.matrixInterpretation, cvp.biasInterpretation, cvp.resultType, cvp.transpose);\n        }\n    }\n\n    if (queryDriverProperties.driverID == VK_DRIVER_ID_MESA_TURNIP)\n    {\n        // turnip crash when compiling large shader with full subgroup\n        querySubgroupSizeControlFeatures.computeFullSubgroups = VK_FALSE;\n    }\n}\n\nGpuInfo::GpuInfo()\n    : d(new GpuInfoPrivate)\n{\n}\n\nGpuInfo::~GpuInfo()\n{\n    delete d;\n}\n\nGpuInfo::GpuInfo(const GpuInfo&)\n    : d(0)\n{\n}\n\nGpuInfo& GpuInfo::operator=(const GpuInfo&)\n{\n    return *this;\n}\n\nint GpuInfo::device_index() const\n{\n    return d->device_index;\n}\n\nVkPhysicalDevice GpuInfo::physicalDevice() const\n{\n    return d->physicalDevice;\n}\n\nVkPhysicalDevice GpuInfo::physical_device() const\n{\n    return d->physicalDevice;\n}\n\nconst VkPhysicalDeviceFeatures& GpuInfo::physicalDevicefeatures() const\n{\n    return d->physicalDevicefeatures;\n}\n\nconst VkPhysicalDeviceProperties& GpuInfo::physicalDeviceProperties() const\n{\n    return d->physicalDeviceProperties;\n}\n\nconst VkPhysicalDeviceMemoryProperties& GpuInfo::physicalDeviceMemoryProperties() const\n{\n    return d->physicalDeviceMemoryProperties;\n}\n\nconst VkPhysicalDeviceMemoryProperties& GpuInfo::physical_device_memory_properties() const\n{\n    return d->physicalDeviceMemoryProperties;\n}\n\nconst std::vector<VkExtensionProperties>& GpuInfo::deviceExtensionProperties() const\n{\n    return d->deviceExtensionProperties;\n}\n\nuint32_t GpuInfo::api_version() const\n{\n    return d->physicalDeviceProperties.apiVersion;\n}\n\nuint32_t GpuInfo::driver_version() const\n{\n    return d->physicalDeviceProperties.driverVersion;\n}\n\nuint32_t GpuInfo::vendor_id() const\n{\n    return d->physicalDeviceProperties.vendorID;\n}\n\nuint32_t GpuInfo::device_id() const\n{\n    return d->physicalDeviceProperties.deviceID;\n}\n\nconst char* GpuInfo::device_name() const\n{\n    return d->physicalDeviceProperties.deviceName;\n}\n\nuint8_t* GpuInfo::pipeline_cache_uuid() const\n{\n    return d->physicalDeviceProperties.pipelineCacheUUID;\n}\n\nuint32_t GpuInfo::driver_id() const\n{\n    return d->queryDriverProperties.driverID;\n}\n\nconst char* GpuInfo::driver_name() const\n{\n    return d->queryDriverProperties.driverName;\n}\n\nint GpuInfo::type() const\n{\n    return d->type;\n}\n\nuint32_t GpuInfo::rough_score() const\n{\n    return d->rough_score;\n}\n\nuint32_t GpuInfo::max_shared_memory_size() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeSharedMemorySize;\n}\n\nuint32_t GpuInfo::max_workgroup_count_x() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[0];\n}\n\nuint32_t GpuInfo::max_workgroup_count_y() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[1];\n}\n\nuint32_t GpuInfo::max_workgroup_count_z() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[2];\n}\n\nuint32_t GpuInfo::max_workgroup_invocations() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupInvocations;\n}\n\nuint32_t GpuInfo::max_workgroup_size_x() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[0];\n}\n\nuint32_t GpuInfo::max_workgroup_size_y() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[1];\n}\n\nuint32_t GpuInfo::max_workgroup_size_z() const\n{\n    return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[2];\n}\n\nsize_t GpuInfo::memory_map_alignment() const\n{\n    return d->physicalDeviceProperties.limits.minMemoryMapAlignment;\n}\n\nsize_t GpuInfo::buffer_offset_alignment() const\n{\n    return d->physicalDeviceProperties.limits.minStorageBufferOffsetAlignment;\n}\n\nsize_t GpuInfo::non_coherent_atom_size() const\n{\n    return d->physicalDeviceProperties.limits.nonCoherentAtomSize;\n}\n\nsize_t GpuInfo::buffer_image_granularity() const\n{\n    return d->physicalDeviceProperties.limits.bufferImageGranularity;\n}\n\nuint32_t GpuInfo::max_image_dimension_1d() const\n{\n    return d->physicalDeviceProperties.limits.maxImageDimension1D;\n}\n\nuint32_t GpuInfo::max_image_dimension_2d() const\n{\n    return d->physicalDeviceProperties.limits.maxImageDimension2D;\n}\n\nuint32_t GpuInfo::max_image_dimension_3d() const\n{\n    return d->physicalDeviceProperties.limits.maxImageDimension3D;\n}\n\nfloat GpuInfo::timestamp_period() const\n{\n    return d->physicalDeviceProperties.limits.timestampPeriod;\n}\n\nuint32_t GpuInfo::compute_queue_family_index() const\n{\n    return d->compute_queue_family_index;\n}\n\nuint32_t GpuInfo::transfer_queue_family_index() const\n{\n    return d->transfer_queue_family_index;\n}\n\nuint32_t GpuInfo::compute_queue_count() const\n{\n    return d->compute_queue_count;\n}\n\nuint32_t GpuInfo::transfer_queue_count() const\n{\n    return d->transfer_queue_count;\n}\n\nbool GpuInfo::unified_compute_transfer_queue() const\n{\n    return d->unified_compute_transfer_queue;\n}\n\nbool GpuInfo::resizable_bar_enabled() const\n{\n    return d->resizable_bar_enabled;\n}\n\nuint32_t GpuInfo::subgroup_size() const\n{\n    return d->querySubgroupProperties.subgroupSize;\n}\n\nuint32_t GpuInfo::min_subgroup_size() const\n{\n    return d->querySubgroupSizeControlProperties.minSubgroupSize;\n}\n\nuint32_t GpuInfo::max_subgroup_size() const\n{\n    return d->querySubgroupSizeControlProperties.maxSubgroupSize;\n}\n\nuint32_t GpuInfo::max_compute_workgroup_subgroups() const\n{\n    return d->querySubgroupSizeControlProperties.maxComputeWorkgroupSubgroups;\n}\n\nbool GpuInfo::support_subgroup_size_control() const\n{\n    return d->querySubgroupSizeControlFeatures.subgroupSizeControl;\n}\n\nbool GpuInfo::support_compute_full_subgroups() const\n{\n    return d->querySubgroupSizeControlFeatures.computeFullSubgroups;\n}\n\nuint32_t GpuInfo::support_subgroup_ops() const\n{\n    return d->querySubgroupProperties.supportedOperations;\n}\n\nbool GpuInfo::bug_storage_buffer_no_l1() const\n{\n    return d->bug_storage_buffer_no_l1;\n}\n\nbool GpuInfo::bug_corrupted_online_pipeline_cache() const\n{\n    return d->bug_corrupted_online_pipeline_cache;\n}\n\nbool GpuInfo::bug_buffer_image_load_zero() const\n{\n    return d->bug_buffer_image_load_zero;\n}\n\nbool GpuInfo::bug_implicit_fp16_arithmetic() const\n{\n    return d->bug_implicit_fp16_arithmetic;\n}\n\nbool GpuInfo::support_fp16_packed() const\n{\n    return true;\n}\n\nbool GpuInfo::support_fp16_storage() const\n{\n    return d->query16BitStorageFeatures.storageBuffer16BitAccess;\n}\n\nbool GpuInfo::support_fp16_uniform() const\n{\n    return d->query16BitStorageFeatures.uniformAndStorageBuffer16BitAccess;\n}\n\nbool GpuInfo::support_fp16_arithmetic() const\n{\n    return d->queryFloat16Int8Features.shaderFloat16;\n}\n\nbool GpuInfo::support_int8_packed() const\n{\n    return true;\n}\n\nbool GpuInfo::support_int8_storage() const\n{\n    return d->query8BitStorageFeatures.storageBuffer8BitAccess;\n}\n\nbool GpuInfo::support_int8_uniform() const\n{\n    return d->query8BitStorageFeatures.uniformAndStorageBuffer8BitAccess;\n}\n\nbool GpuInfo::support_int8_arithmetic() const\n{\n    return d->queryFloat16Int8Features.shaderInt8;\n}\n\nbool GpuInfo::support_bf16_packed() const\n{\n    return true;\n}\n\nbool GpuInfo::support_bf16_storage() const\n{\n    return d->queryShaderBfloat16Features.shaderBFloat16Type;\n}\n\nbool GpuInfo::support_fp16_image() const\n{\n    return d->physicalDevicefeatures.shaderStorageImageExtendedFormats;\n}\n\nbool GpuInfo::support_int8_image() const\n{\n    return d->physicalDevicefeatures.shaderStorageImageExtendedFormats;\n}\n\nbool GpuInfo::support_fp_fast_math() const\n{\n    return d->queryShaderFloatControls2Features.shaderFloatControls2;\n}\n\nbool GpuInfo::support_ycbcr_conversion() const\n{\n    return d->querySamplerYcbcrConversionFeatures.samplerYcbcrConversion;\n}\n\nbool GpuInfo::support_cooperative_matrix() const\n{\n    return d->queryCooperativeMatrixFeatures.cooperativeMatrix || d->queryCooperativeMatrixFeaturesNV.cooperativeMatrix;\n}\n\nbool GpuInfo::support_cooperative_matrix_8_8_16() const\n{\n    return d->support_cooperative_matrix_8_8_16;\n}\n\nbool GpuInfo::support_cooperative_matrix_16_8_8() const\n{\n    return d->support_cooperative_matrix_16_8_8;\n}\n\nbool GpuInfo::support_cooperative_matrix_16_8_16() const\n{\n    return d->support_cooperative_matrix_16_8_16;\n}\n\nbool GpuInfo::support_cooperative_matrix_16_16_16() const\n{\n    return d->support_cooperative_matrix_16_16_16;\n}\n\nbool GpuInfo::support_bf16_cooperative_matrix() const\n{\n    return d->support_bf16_cooperative_matrix;\n}\n\nint GpuInfo::support_VK_KHR_8bit_storage() const\n{\n    return d->support_VK_KHR_8bit_storage;\n}\n\nint GpuInfo::support_VK_KHR_16bit_storage() const\n{\n    return d->support_VK_KHR_16bit_storage;\n}\n\nint GpuInfo::support_VK_KHR_bind_memory2() const\n{\n    return d->support_VK_KHR_bind_memory2;\n}\n\nint GpuInfo::support_VK_KHR_buffer_device_address() const\n{\n    return d->support_VK_KHR_buffer_device_address;\n}\n\nint GpuInfo::support_VK_KHR_create_renderpass2() const\n{\n    return d->support_VK_KHR_create_renderpass2;\n}\n\nint GpuInfo::support_VK_KHR_cooperative_matrix() const\n{\n    return d->support_VK_KHR_cooperative_matrix;\n}\n\nint GpuInfo::support_VK_KHR_dedicated_allocation() const\n{\n    return d->support_VK_KHR_dedicated_allocation;\n}\n\nint GpuInfo::support_VK_KHR_descriptor_update_template() const\n{\n    return d->support_VK_KHR_descriptor_update_template;\n}\n\nint GpuInfo::support_VK_KHR_driver_properties() const\n{\n    return d->support_VK_KHR_driver_properties;\n}\n\nint GpuInfo::support_VK_KHR_external_memory() const\n{\n    return d->support_VK_KHR_external_memory;\n}\n\nint GpuInfo::support_VK_KHR_get_memory_requirements2() const\n{\n    return d->support_VK_KHR_get_memory_requirements2;\n}\n\nint GpuInfo::support_VK_KHR_maintenance1() const\n{\n    return d->support_VK_KHR_maintenance1;\n}\n\nint GpuInfo::support_VK_KHR_maintenance2() const\n{\n    return d->support_VK_KHR_maintenance2;\n}\n\nint GpuInfo::support_VK_KHR_maintenance3() const\n{\n    return d->support_VK_KHR_maintenance3;\n}\n\nint GpuInfo::support_VK_KHR_multiview() const\n{\n    return d->support_VK_KHR_multiview;\n}\n\nint GpuInfo::support_VK_KHR_portability_subset() const\n{\n    return d->support_VK_KHR_portability_subset;\n}\n\nint GpuInfo::support_VK_KHR_push_descriptor() const\n{\n    return d->support_VK_KHR_push_descriptor;\n}\n\nint GpuInfo::support_VK_KHR_robustness2() const\n{\n    return d->support_VK_KHR_robustness2;\n}\n\nint GpuInfo::support_VK_KHR_sampler_ycbcr_conversion() const\n{\n    return d->support_VK_KHR_sampler_ycbcr_conversion;\n}\n\nint GpuInfo::support_VK_KHR_shader_bfloat16() const\n{\n    return d->support_VK_KHR_shader_bfloat16;\n}\n\nint GpuInfo::support_VK_KHR_shader_float16_int8() const\n{\n    return d->support_VK_KHR_shader_float16_int8;\n}\n\nint GpuInfo::support_VK_KHR_shader_float_controls() const\n{\n    return d->support_VK_KHR_shader_float_controls;\n}\n\nint GpuInfo::support_VK_KHR_shader_float_controls2() const\n{\n    return d->support_VK_KHR_shader_float_controls2;\n}\n\nint GpuInfo::support_VK_KHR_shader_integer_dot_product() const\n{\n    return d->support_VK_KHR_shader_integer_dot_product;\n}\n\nint GpuInfo::support_VK_KHR_shader_non_semantic_info() const\n{\n    return d->support_VK_KHR_shader_non_semantic_info;\n}\n\nint GpuInfo::support_VK_KHR_shader_subgroup_extended_types() const\n{\n    return d->support_VK_KHR_shader_subgroup_extended_types;\n}\n\nint GpuInfo::support_VK_KHR_shader_subgroup_rotate() const\n{\n    return d->support_VK_KHR_shader_subgroup_rotate;\n}\n\nint GpuInfo::support_VK_KHR_storage_buffer_storage_class() const\n{\n    return d->support_VK_KHR_storage_buffer_storage_class;\n}\n\nint GpuInfo::support_VK_KHR_swapchain() const\n{\n    return d->support_VK_KHR_swapchain;\n}\n\nint GpuInfo::support_VK_KHR_vulkan_memory_model() const\n{\n    return d->support_VK_KHR_vulkan_memory_model;\n}\n\nint GpuInfo::support_VK_KHR_zero_initialize_workgroup_memory() const\n{\n    return d->support_VK_KHR_zero_initialize_workgroup_memory;\n}\n\nint GpuInfo::support_VK_EXT_buffer_device_address() const\n{\n    return d->support_VK_EXT_buffer_device_address;\n}\n\nint GpuInfo::support_VK_EXT_descriptor_indexing() const\n{\n    return d->support_VK_EXT_descriptor_indexing;\n}\n\nint GpuInfo::support_VK_EXT_external_memory_host() const\n{\n    return d->support_VK_EXT_external_memory_host;\n}\n\nint GpuInfo::support_VK_EXT_memory_budget() const\n{\n    return d->support_VK_EXT_memory_budget;\n}\n\nint GpuInfo::support_VK_EXT_memory_priority() const\n{\n    return d->support_VK_EXT_memory_priority;\n}\n\nint GpuInfo::support_VK_EXT_queue_family_foreign() const\n{\n    return d->support_VK_EXT_queue_family_foreign;\n}\n\nint GpuInfo::support_VK_EXT_robustness2() const\n{\n    return d->support_VK_EXT_robustness2;\n}\n\nint GpuInfo::support_VK_EXT_shader_atomic_float() const\n{\n    return d->support_VK_EXT_shader_atomic_float;\n}\n\nint GpuInfo::support_VK_EXT_shader_atomic_float2() const\n{\n    return d->support_VK_EXT_shader_atomic_float2;\n}\n\nint GpuInfo::support_VK_EXT_shader_float8() const\n{\n    return d->support_VK_EXT_shader_float8;\n}\n\nint GpuInfo::support_VK_EXT_subgroup_size_control() const\n{\n    return d->support_VK_EXT_subgroup_size_control;\n}\n\nint GpuInfo::support_VK_AMD_device_coherent_memory() const\n{\n    return d->support_VK_AMD_device_coherent_memory;\n}\n\n#if __ANDROID_API__ >= 26\nint GpuInfo::support_VK_ANDROID_external_memory_android_hardware_buffer() const\n{\n    return d->support_VK_ANDROID_external_memory_android_hardware_buffer;\n}\n#endif // __ANDROID_API__ >= 26\n\nint GpuInfo::support_VK_NV_cooperative_matrix() const\n{\n    return d->support_VK_NV_cooperative_matrix;\n}\n\nint GpuInfo::support_VK_NV_cooperative_matrix2() const\n{\n    return d->support_VK_NV_cooperative_matrix2;\n}\n\nint GpuInfo::support_VK_NV_cooperative_vector() const\n{\n    return d->support_VK_NV_cooperative_vector;\n}\n\nconst void* GpuInfo::queryExtensionFeatures() const\n{\n    return d->queryExtensionFeatures;\n}\n\nconst VkPhysicalDevice8BitStorageFeaturesKHR& GpuInfo::query8BitStorageFeatures() const\n{\n    return d->query8BitStorageFeatures;\n}\n\nconst VkPhysicalDevice16BitStorageFeaturesKHR& GpuInfo::query16BitStorageFeatures() const\n{\n    return d->query16BitStorageFeatures;\n}\n\nconst VkPhysicalDeviceFloat16Int8FeaturesKHR& GpuInfo::queryFloat16Int8Features() const\n{\n    return d->queryFloat16Int8Features;\n}\n\nconst VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR& GpuInfo::querySamplerYcbcrConversionFeatures() const\n{\n    return d->querySamplerYcbcrConversionFeatures;\n}\n\nconst VkPhysicalDeviceCooperativeMatrixFeaturesKHR& GpuInfo::queryCooperativeMatrixFeatures() const\n{\n    return d->queryCooperativeMatrixFeatures;\n}\n\nconst VkPhysicalDeviceCooperativeMatrixFeaturesNV& GpuInfo::queryCooperativeMatrixFeaturesNV() const\n{\n    return d->queryCooperativeMatrixFeaturesNV;\n}\n\nconst VkPhysicalDeviceCooperativeMatrix2FeaturesNV& GpuInfo::queryCooperativeMatrix2FeaturesNV() const\n{\n    return d->queryCooperativeMatrix2FeaturesNV;\n}\n\nconst VkPhysicalDeviceCooperativeVectorFeaturesNV& GpuInfo::queryCooperativeVectorFeaturesNV() const\n{\n    return d->queryCooperativeVectorFeaturesNV;\n}\n\nconst VkPhysicalDeviceRobustness2FeaturesKHR& GpuInfo::queryRobustness2Features() const\n{\n    return d->queryRobustness2Features;\n}\n\nconst VkPhysicalDeviceSubgroupSizeControlFeaturesEXT& GpuInfo::querySubgroupSizeControlFeatures() const\n{\n    return d->querySubgroupSizeControlFeatures;\n}\n\nconst VkPhysicalDeviceShaderBfloat16FeaturesKHR& GpuInfo::queryShaderBfloat16Features() const\n{\n    return d->queryShaderBfloat16Features;\n}\n\nconst VkPhysicalDeviceShaderFloat8FeaturesEXT& GpuInfo::queryShaderFloat8Features() const\n{\n    return d->queryShaderFloat8Features;\n}\n\nconst VkPhysicalDeviceShaderFloatControls2FeaturesKHR& GpuInfo::queryShaderFloatControls2Features() const\n{\n    return d->queryShaderFloatControls2Features;\n}\n\nconst VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR& GpuInfo::queryShaderIntegerDotProductFeatures() const\n{\n    return d->queryShaderIntegerDotProductFeatures;\n}\n\nconst VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR& GpuInfo::queryShaderSubgroupRotateFeatures() const\n{\n    return d->queryShaderSubgroupRotateFeatures;\n}\n\nconst VkPhysicalDeviceShaderAtomicFloatFeaturesEXT& GpuInfo::queryShaderAtomicFloatFeatures() const\n{\n    return d->queryShaderAtomicFloatFeatures;\n}\n\nconst VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT& GpuInfo::queryShaderAtomicFloat2Features() const\n{\n    return d->queryShaderAtomicFloat2Features;\n}\n\nconst VkPhysicalDeviceVulkanMemoryModelFeaturesKHR& GpuInfo::queryVulkanMemoryModelFeatures() const\n{\n    return d->queryVulkanMemoryModelFeatures;\n}\n\nconst void* GpuInfo::queryExtensionProperties() const\n{\n    return d->queryExtensionProperties;\n}\n\nconst VkPhysicalDeviceCooperativeMatrix2PropertiesNV& GpuInfo::queryCooperativeMatrix2PropertiesNV() const\n{\n    return d->queryCooperativeMatrix2PropertiesNV;\n}\n\nconst VkPhysicalDeviceCooperativeVectorPropertiesNV& GpuInfo::queryCooperativeVectorPropertiesNV() const\n{\n    return d->queryCooperativeVectorPropertiesNV;\n}\n\nconst VkPhysicalDeviceDriverPropertiesKHR& GpuInfo::queryDriverProperties() const\n{\n    return d->queryDriverProperties;\n}\n\nconst VkPhysicalDeviceFloatControlsPropertiesKHR& GpuInfo::queryFloatControlsProperties() const\n{\n    return d->queryFloatControlsProperties;\n}\n\nconst VkPhysicalDeviceRobustness2PropertiesKHR& GpuInfo::queryRobustness2Properties() const\n{\n    return d->queryRobustness2Properties;\n}\n\nconst VkPhysicalDeviceShaderIntegerDotProductProperties& GpuInfo::queryShaderIntegerDotProductProperties() const\n{\n    return d->queryShaderIntegerDotProductProperties;\n}\n\nconst VkPhysicalDeviceSubgroupProperties& GpuInfo::querySubgroupProperties() const\n{\n    return d->querySubgroupProperties;\n}\n\nconst VkPhysicalDeviceSubgroupSizeControlPropertiesEXT& GpuInfo::querySubgroupSizeControlProperties() const\n{\n    return d->querySubgroupSizeControlProperties;\n}\n\nconst VkPhysicalDeviceExternalMemoryHostPropertiesEXT& GpuInfo::queryExternalMemoryHostProperties() const\n{\n    return d->queryExternalMemoryHostProperties;\n}\n\nconst std::vector<VkCooperativeMatrixPropertiesKHR>& GpuInfo::queryCooperativeMatrixSubProperties() const\n{\n    return d->queryCooperativeMatrixSubProperties;\n}\n\nconst std::vector<VkCooperativeMatrixPropertiesNV>& GpuInfo::queryCooperativeMatrixSubPropertiesNV() const\n{\n    return d->queryCooperativeMatrixSubPropertiesNV;\n}\n\nconst std::vector<VkCooperativeMatrixFlexibleDimensionsPropertiesNV>& GpuInfo::queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV() const\n{\n    return d->queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV;\n}\n\nconst std::vector<VkCooperativeVectorPropertiesNV>& GpuInfo::queryCooperativeVectorSubPropertiesNV() const\n{\n    return d->queryCooperativeVectorSubPropertiesNV;\n}\n\nvoid GpuInfo::get_optimal_cooperative_matrix_mnk(int M, int N, int K, VkComponentTypeKHR type, VkComponentTypeKHR acctype, VkScopeKHR scope, int& coopmat_M, int& coopmat_N, int& coopmat_K, int& coopmat_subgroup_size) const\n{\n    coopmat_M = 0;\n    coopmat_N = 0;\n    coopmat_K = 0;\n    coopmat_subgroup_size = d->querySubgroupProperties.subgroupSize;\n\n    // collect mnk candidates\n    std::vector<VkCooperativeMatrixPropertiesKHR> mnk_properties;\n\n    if (d->support_VK_KHR_cooperative_matrix && d->queryCooperativeMatrixFeatures.cooperativeMatrix)\n    {\n        for (size_t i = 0; i < d->queryCooperativeMatrixSubProperties.size(); i++)\n        {\n            const VkCooperativeMatrixPropertiesKHR& cmp = d->queryCooperativeMatrixSubProperties[i];\n\n            if (cmp.AType == type && cmp.BType == type\n                    && cmp.CType == acctype && cmp.ResultType == acctype\n                    && cmp.scope == scope)\n            {\n                mnk_properties.push_back(cmp);\n            }\n        }\n    }\n    else if (d->support_VK_NV_cooperative_matrix && d->queryCooperativeMatrixFeaturesNV.cooperativeMatrix)\n    {\n        for (size_t i = 0; i < d->queryCooperativeMatrixSubPropertiesNV.size(); i++)\n        {\n            const VkCooperativeMatrixPropertiesNV& cmp = d->queryCooperativeMatrixSubPropertiesNV[i];\n\n            if (cmp.AType == (VkComponentTypeNV)type && cmp.BType == (VkComponentTypeNV)type\n                    && cmp.CType == (VkComponentTypeNV)acctype && cmp.DType == (VkComponentTypeNV)acctype\n                    && cmp.scope == (VkScopeNV)scope)\n            {\n                VkCooperativeMatrixPropertiesKHR cmp_khr;\n                cmp_khr.MSize = cmp.MSize;\n                cmp_khr.NSize = cmp.NSize;\n                cmp_khr.KSize = cmp.KSize;\n\n                mnk_properties.push_back(cmp_khr);\n            }\n        }\n    }\n\n    if (mnk_properties.empty() && (acctype == VK_COMPONENT_TYPE_FLOAT16_KHR || acctype == VK_COMPONENT_TYPE_BFLOAT16_KHR))\n    {\n        // try acctype fp32\n        return get_optimal_cooperative_matrix_mnk(M, N, K, type, VK_COMPONENT_TYPE_FLOAT32_KHR, scope, coopmat_M, coopmat_N, coopmat_K, coopmat_subgroup_size);\n    }\n\n    if (mnk_properties.empty())\n        return;\n\n    // find the optimal, prefer the first mnk tuple with same cost\n    double min_cost = DBL_MAX;\n    for (size_t i = 0; i < mnk_properties.size(); i++)\n    {\n        const VkCooperativeMatrixPropertiesKHR& cmp = mnk_properties[i];\n\n        const int M_pad = (M + cmp.MSize - 1) / cmp.MSize * cmp.MSize;\n        const int N_pad = (N + cmp.NSize - 1) / cmp.NSize * cmp.NSize;\n        const int K_pad = (K + cmp.KSize - 1) / cmp.KSize * cmp.KSize;\n\n        double cost = M_pad * N_pad * K_pad - M * N * K;\n        if (cost < min_cost)\n        {\n            min_cost = cost;\n            coopmat_M = cmp.MSize;\n            coopmat_N = cmp.NSize;\n            coopmat_K = cmp.KSize;\n        }\n    }\n}\n\nstatic int init_instance_core()\n{\n    vkAllocateCommandBuffers = (PFN_vkAllocateCommandBuffers)vkGetInstanceProcAddr(g_instance, \"vkAllocateCommandBuffers\");\n    vkAllocateDescriptorSets = (PFN_vkAllocateDescriptorSets)vkGetInstanceProcAddr(g_instance, \"vkAllocateDescriptorSets\");\n    vkAllocateMemory = (PFN_vkAllocateMemory)vkGetInstanceProcAddr(g_instance, \"vkAllocateMemory\");\n    vkBeginCommandBuffer = (PFN_vkBeginCommandBuffer)vkGetInstanceProcAddr(g_instance, \"vkBeginCommandBuffer\");\n    vkBindBufferMemory = (PFN_vkBindBufferMemory)vkGetInstanceProcAddr(g_instance, \"vkBindBufferMemory\");\n    vkBindImageMemory = (PFN_vkBindImageMemory)vkGetInstanceProcAddr(g_instance, \"vkBindImageMemory\");\n    vkCmdBeginQuery = (PFN_vkCmdBeginQuery)vkGetInstanceProcAddr(g_instance, \"vkCmdBeginQuery\");\n    vkCmdBindDescriptorSets = (PFN_vkCmdBindDescriptorSets)vkGetInstanceProcAddr(g_instance, \"vkCmdBindDescriptorSets\");\n    vkCmdBindIndexBuffer = (PFN_vkCmdBindIndexBuffer)vkGetInstanceProcAddr(g_instance, \"vkCmdBindIndexBuffer\");\n    vkCmdBindPipeline = (PFN_vkCmdBindPipeline)vkGetInstanceProcAddr(g_instance, \"vkCmdBindPipeline\");\n    vkCmdCopyBuffer = (PFN_vkCmdCopyBuffer)vkGetInstanceProcAddr(g_instance, \"vkCmdCopyBuffer\");\n    vkCmdCopyBufferToImage = (PFN_vkCmdCopyBufferToImage)vkGetInstanceProcAddr(g_instance, \"vkCmdCopyBufferToImage\");\n    vkCmdCopyImage = (PFN_vkCmdCopyImage)vkGetInstanceProcAddr(g_instance, \"vkCmdCopyImage\");\n    vkCmdCopyImageToBuffer = (PFN_vkCmdCopyImageToBuffer)vkGetInstanceProcAddr(g_instance, \"vkCmdCopyImageToBuffer\");\n    vkCmdCopyQueryPoolResults = (PFN_vkCmdCopyQueryPoolResults)vkGetInstanceProcAddr(g_instance, \"vkCmdCopyQueryPoolResults\");\n    vkCmdDispatch = (PFN_vkCmdDispatch)vkGetInstanceProcAddr(g_instance, \"vkCmdDispatch\");\n    vkCmdDispatchIndirect = (PFN_vkCmdDispatchIndirect)vkGetInstanceProcAddr(g_instance, \"vkCmdDispatchIndirect\");\n    vkCmdEndQuery = (PFN_vkCmdEndQuery)vkGetInstanceProcAddr(g_instance, \"vkCmdEndQuery\");\n    vkCmdExecuteCommands = (PFN_vkCmdExecuteCommands)vkGetInstanceProcAddr(g_instance, \"vkCmdExecuteCommands\");\n    vkCmdFillBuffer = (PFN_vkCmdFillBuffer)vkGetInstanceProcAddr(g_instance, \"vkCmdFillBuffer\");\n    vkCmdPipelineBarrier = (PFN_vkCmdPipelineBarrier)vkGetInstanceProcAddr(g_instance, \"vkCmdPipelineBarrier\");\n    vkCmdPushConstants = (PFN_vkCmdPushConstants)vkGetInstanceProcAddr(g_instance, \"vkCmdPushConstants\");\n    vkCmdResetQueryPool = (PFN_vkCmdResetQueryPool)vkGetInstanceProcAddr(g_instance, \"vkCmdResetQueryPool\");\n    vkCmdResolveImage = (PFN_vkCmdResolveImage)vkGetInstanceProcAddr(g_instance, \"vkCmdResolveImage\");\n    vkCmdUpdateBuffer = (PFN_vkCmdUpdateBuffer)vkGetInstanceProcAddr(g_instance, \"vkCmdUpdateBuffer\");\n    vkCmdWriteTimestamp = (PFN_vkCmdWriteTimestamp)vkGetInstanceProcAddr(g_instance, \"vkCmdWriteTimestamp\");\n    vkCreateBuffer = (PFN_vkCreateBuffer)vkGetInstanceProcAddr(g_instance, \"vkCreateBuffer\");\n    vkCreateBufferView = (PFN_vkCreateBufferView)vkGetInstanceProcAddr(g_instance, \"vkCreateBufferView\");\n    vkCreateCommandPool = (PFN_vkCreateCommandPool)vkGetInstanceProcAddr(g_instance, \"vkCreateCommandPool\");\n    vkCreateComputePipelines = (PFN_vkCreateComputePipelines)vkGetInstanceProcAddr(g_instance, \"vkCreateComputePipelines\");\n    vkCreateDescriptorPool = (PFN_vkCreateDescriptorPool)vkGetInstanceProcAddr(g_instance, \"vkCreateDescriptorPool\");\n    vkCreateDescriptorSetLayout = (PFN_vkCreateDescriptorSetLayout)vkGetInstanceProcAddr(g_instance, \"vkCreateDescriptorSetLayout\");\n    vkCreateDevice = (PFN_vkCreateDevice)vkGetInstanceProcAddr(g_instance, \"vkCreateDevice\");\n    vkCreateFence = (PFN_vkCreateFence)vkGetInstanceProcAddr(g_instance, \"vkCreateFence\");\n    vkCreateImage = (PFN_vkCreateImage)vkGetInstanceProcAddr(g_instance, \"vkCreateImage\");\n    vkCreateImageView = (PFN_vkCreateImageView)vkGetInstanceProcAddr(g_instance, \"vkCreateImageView\");\n    vkCreatePipelineCache = (PFN_vkCreatePipelineCache)vkGetInstanceProcAddr(g_instance, \"vkCreatePipelineCache\");\n    vkCreatePipelineLayout = (PFN_vkCreatePipelineLayout)vkGetInstanceProcAddr(g_instance, \"vkCreatePipelineLayout\");\n    vkCreateQueryPool = (PFN_vkCreateQueryPool)vkGetInstanceProcAddr(g_instance, \"vkCreateQueryPool\");\n    vkCreateSampler = (PFN_vkCreateSampler)vkGetInstanceProcAddr(g_instance, \"vkCreateSampler\");\n    vkCreateSemaphore = (PFN_vkCreateSemaphore)vkGetInstanceProcAddr(g_instance, \"vkCreateSemaphore\");\n    vkCreateShaderModule = (PFN_vkCreateShaderModule)vkGetInstanceProcAddr(g_instance, \"vkCreateShaderModule\");\n    vkDestroyBuffer = (PFN_vkDestroyBuffer)vkGetInstanceProcAddr(g_instance, \"vkDestroyBuffer\");\n    vkDestroyBufferView = (PFN_vkDestroyBufferView)vkGetInstanceProcAddr(g_instance, \"vkDestroyBufferView\");\n    vkDestroyCommandPool = (PFN_vkDestroyCommandPool)vkGetInstanceProcAddr(g_instance, \"vkDestroyCommandPool\");\n    vkDestroyDescriptorPool = (PFN_vkDestroyDescriptorPool)vkGetInstanceProcAddr(g_instance, \"vkDestroyDescriptorPool\");\n    vkDestroyDescriptorSetLayout = (PFN_vkDestroyDescriptorSetLayout)vkGetInstanceProcAddr(g_instance, \"vkDestroyDescriptorSetLayout\");\n    vkDestroyDevice = (PFN_vkDestroyDevice)vkGetInstanceProcAddr(g_instance, \"vkDestroyDevice\");\n    vkDestroyFence = (PFN_vkDestroyFence)vkGetInstanceProcAddr(g_instance, \"vkDestroyFence\");\n    vkDestroyImage = (PFN_vkDestroyImage)vkGetInstanceProcAddr(g_instance, \"vkDestroyImage\");\n    vkDestroyImageView = (PFN_vkDestroyImageView)vkGetInstanceProcAddr(g_instance, \"vkDestroyImageView\");\n    vkDestroyInstance = (PFN_vkDestroyInstance)vkGetInstanceProcAddr(g_instance, \"vkDestroyInstance\");\n    vkDestroyPipeline = (PFN_vkDestroyPipeline)vkGetInstanceProcAddr(g_instance, \"vkDestroyPipeline\");\n    vkDestroyPipelineCache = (PFN_vkDestroyPipelineCache)vkGetInstanceProcAddr(g_instance, \"vkDestroyPipelineCache\");\n    vkDestroyPipelineLayout = (PFN_vkDestroyPipelineLayout)vkGetInstanceProcAddr(g_instance, \"vkDestroyPipelineLayout\");\n    vkDestroyQueryPool = (PFN_vkDestroyQueryPool)vkGetInstanceProcAddr(g_instance, \"vkDestroyQueryPool\");\n    vkDestroySampler = (PFN_vkDestroySampler)vkGetInstanceProcAddr(g_instance, \"vkDestroySampler\");\n    vkDestroySemaphore = (PFN_vkDestroySemaphore)vkGetInstanceProcAddr(g_instance, \"vkDestroySemaphore\");\n    vkDestroyShaderModule = (PFN_vkDestroyShaderModule)vkGetInstanceProcAddr(g_instance, \"vkDestroyShaderModule\");\n    vkDeviceWaitIdle = (PFN_vkDeviceWaitIdle)vkGetInstanceProcAddr(g_instance, \"vkDeviceWaitIdle\");\n    vkEndCommandBuffer = (PFN_vkEndCommandBuffer)vkGetInstanceProcAddr(g_instance, \"vkEndCommandBuffer\");\n    vkEnumerateDeviceExtensionProperties = (PFN_vkEnumerateDeviceExtensionProperties)vkGetInstanceProcAddr(g_instance, \"vkEnumerateDeviceExtensionProperties\");\n    vkEnumerateDeviceLayerProperties = (PFN_vkEnumerateDeviceLayerProperties)vkGetInstanceProcAddr(g_instance, \"vkEnumerateDeviceLayerProperties\");\n    vkEnumeratePhysicalDevices = (PFN_vkEnumeratePhysicalDevices)vkGetInstanceProcAddr(g_instance, \"vkEnumeratePhysicalDevices\");\n    vkFlushMappedMemoryRanges = (PFN_vkFlushMappedMemoryRanges)vkGetInstanceProcAddr(g_instance, \"vkFlushMappedMemoryRanges\");\n    vkFreeCommandBuffers = (PFN_vkFreeCommandBuffers)vkGetInstanceProcAddr(g_instance, \"vkFreeCommandBuffers\");\n    vkFreeDescriptorSets = (PFN_vkFreeDescriptorSets)vkGetInstanceProcAddr(g_instance, \"vkFreeDescriptorSets\");\n    vkFreeMemory = (PFN_vkFreeMemory)vkGetInstanceProcAddr(g_instance, \"vkFreeMemory\");\n    vkGetBufferMemoryRequirements = (PFN_vkGetBufferMemoryRequirements)vkGetInstanceProcAddr(g_instance, \"vkGetBufferMemoryRequirements\");\n    vkGetDeviceMemoryCommitment = (PFN_vkGetDeviceMemoryCommitment)vkGetInstanceProcAddr(g_instance, \"vkGetDeviceMemoryCommitment\");\n    vkGetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)vkGetInstanceProcAddr(g_instance, \"vkGetDeviceProcAddr\");\n    vkGetDeviceQueue = (PFN_vkGetDeviceQueue)vkGetInstanceProcAddr(g_instance, \"vkGetDeviceQueue\");\n    vkGetFenceStatus = (PFN_vkGetFenceStatus)vkGetInstanceProcAddr(g_instance, \"vkGetFenceStatus\");\n    vkGetImageMemoryRequirements = (PFN_vkGetImageMemoryRequirements)vkGetInstanceProcAddr(g_instance, \"vkGetImageMemoryRequirements\");\n    vkGetImageSubresourceLayout = (PFN_vkGetImageSubresourceLayout)vkGetInstanceProcAddr(g_instance, \"vkGetImageSubresourceLayout\");\n    vkGetPhysicalDeviceFeatures = (PFN_vkGetPhysicalDeviceFeatures)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceFeatures\");\n    vkGetPhysicalDeviceFormatProperties = (PFN_vkGetPhysicalDeviceFormatProperties)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceFormatProperties\");\n    vkGetPhysicalDeviceImageFormatProperties = (PFN_vkGetPhysicalDeviceImageFormatProperties)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceImageFormatProperties\");\n    vkGetPhysicalDeviceMemoryProperties = (PFN_vkGetPhysicalDeviceMemoryProperties)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceMemoryProperties\");\n    vkGetPhysicalDeviceProperties = (PFN_vkGetPhysicalDeviceProperties)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceProperties\");\n    vkGetPhysicalDeviceQueueFamilyProperties = (PFN_vkGetPhysicalDeviceQueueFamilyProperties)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceQueueFamilyProperties\");\n    vkGetPipelineCacheData = (PFN_vkGetPipelineCacheData)vkGetInstanceProcAddr(g_instance, \"vkGetPipelineCacheData\");\n    vkGetQueryPoolResults = (PFN_vkGetQueryPoolResults)vkGetInstanceProcAddr(g_instance, \"vkGetQueryPoolResults\");\n    vkInvalidateMappedMemoryRanges = (PFN_vkInvalidateMappedMemoryRanges)vkGetInstanceProcAddr(g_instance, \"vkInvalidateMappedMemoryRanges\");\n    vkMapMemory = (PFN_vkMapMemory)vkGetInstanceProcAddr(g_instance, \"vkMapMemory\");\n    vkMergePipelineCaches = (PFN_vkMergePipelineCaches)vkGetInstanceProcAddr(g_instance, \"vkMergePipelineCaches\");\n    vkQueueSubmit = (PFN_vkQueueSubmit)vkGetInstanceProcAddr(g_instance, \"vkQueueSubmit\");\n    vkQueueWaitIdle = (PFN_vkQueueWaitIdle)vkGetInstanceProcAddr(g_instance, \"vkQueueWaitIdle\");\n    vkResetCommandBuffer = (PFN_vkResetCommandBuffer)vkGetInstanceProcAddr(g_instance, \"vkResetCommandBuffer\");\n    vkResetCommandPool = (PFN_vkResetCommandPool)vkGetInstanceProcAddr(g_instance, \"vkResetCommandPool\");\n    vkResetDescriptorPool = (PFN_vkResetDescriptorPool)vkGetInstanceProcAddr(g_instance, \"vkResetDescriptorPool\");\n    vkResetFences = (PFN_vkResetFences)vkGetInstanceProcAddr(g_instance, \"vkResetFences\");\n    vkUnmapMemory = (PFN_vkUnmapMemory)vkGetInstanceProcAddr(g_instance, \"vkUnmapMemory\");\n    vkUpdateDescriptorSets = (PFN_vkUpdateDescriptorSets)vkGetInstanceProcAddr(g_instance, \"vkUpdateDescriptorSets\");\n    vkWaitForFences = (PFN_vkWaitForFences)vkGetInstanceProcAddr(g_instance, \"vkWaitForFences\");\n\n    return 0;\n}\n\nstatic int init_instance_extension()\n{\n    if (support_VK_KHR_external_memory_capabilities)\n    {\n        vkGetPhysicalDeviceExternalBufferPropertiesKHR = (PFN_vkGetPhysicalDeviceExternalBufferPropertiesKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceExternalBufferPropertiesKHR\");\n    }\n\n    if (support_VK_KHR_get_physical_device_properties2)\n    {\n        vkGetPhysicalDeviceFeatures2KHR = (PFN_vkGetPhysicalDeviceFeatures2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceFeatures2KHR\");\n        vkGetPhysicalDeviceProperties2KHR = (PFN_vkGetPhysicalDeviceProperties2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceProperties2KHR\");\n        vkGetPhysicalDeviceFormatProperties2KHR = (PFN_vkGetPhysicalDeviceFormatProperties2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceFormatProperties2KHR\");\n        vkGetPhysicalDeviceImageFormatProperties2KHR = (PFN_vkGetPhysicalDeviceImageFormatProperties2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceImageFormatProperties2KHR\");\n        vkGetPhysicalDeviceQueueFamilyProperties2KHR = (PFN_vkGetPhysicalDeviceQueueFamilyProperties2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceQueueFamilyProperties2KHR\");\n        vkGetPhysicalDeviceMemoryProperties2KHR = (PFN_vkGetPhysicalDeviceMemoryProperties2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceMemoryProperties2KHR\");\n    }\n\n    if (support_VK_KHR_get_surface_capabilities2)\n    {\n        vkGetPhysicalDeviceSurfaceCapabilities2KHR = (PFN_vkGetPhysicalDeviceSurfaceCapabilities2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfaceCapabilities2KHR\");\n        vkGetPhysicalDeviceSurfaceFormats2KHR = (PFN_vkGetPhysicalDeviceSurfaceFormats2KHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfaceFormats2KHR\");\n    }\n\n    if (support_VK_KHR_surface)\n    {\n        vkDestroySurfaceKHR = (PFN_vkDestroySurfaceKHR)vkGetInstanceProcAddr(g_instance, \"vkDestroySurfaceKHR\");\n        vkGetPhysicalDeviceSurfaceSupportKHR = (PFN_vkGetPhysicalDeviceSurfaceSupportKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfaceSupportKHR\");\n        vkGetPhysicalDeviceSurfaceCapabilitiesKHR = (PFN_vkGetPhysicalDeviceSurfaceCapabilitiesKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfaceCapabilitiesKHR\");\n        vkGetPhysicalDeviceSurfaceFormatsKHR = (PFN_vkGetPhysicalDeviceSurfaceFormatsKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfaceFormatsKHR\");\n        vkGetPhysicalDeviceSurfacePresentModesKHR = (PFN_vkGetPhysicalDeviceSurfacePresentModesKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceSurfacePresentModesKHR\");\n    }\n\n#if __ANDROID_API__ >= 26\n    if (support_VK_KHR_android_surface)\n    {\n        vkCreateAndroidSurfaceKHR = (PFN_vkCreateAndroidSurfaceKHR)vkGetInstanceProcAddr(g_instance, \"vkCreateAndroidSurfaceKHR\");\n    }\n#endif // __ANDROID_API__ >= 26\n\n    // VK_KHR_cooperative_matrix\n    {\n        vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR = (PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR\");\n    }\n\n    // VK_NV_cooperative_matrix\n    {\n        vkGetPhysicalDeviceCooperativeMatrixPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesNV)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceCooperativeMatrixPropertiesNV\");\n    }\n\n    // VK_NV_cooperative_matrix2\n    {\n        vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV\");\n    }\n\n    // VK_NV_cooperative_vector\n    {\n        vkGetPhysicalDeviceCooperativeVectorPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeVectorPropertiesNV)vkGetInstanceProcAddr(g_instance, \"vkGetPhysicalDeviceCooperativeVectorPropertiesNV\");\n    }\n\n    return 0;\n}\n\n#if ENABLE_VALIDATION_LAYER\nstatic VKAPI_ATTR VkBool32 VKAPI_CALL debugCallback(\n    VkDebugUtilsMessageSeverityFlagBitsEXT /*messageSeverity*/,\n    VkDebugUtilsMessageTypeFlagsEXT /*messageType*/,\n    const VkDebugUtilsMessengerCallbackDataEXT* pCallbackData,\n    void* /*pUserData*/)\n{\n    NCNN_LOGE(\"validation layer: %s\", pCallbackData->pMessage);\n\n    return VK_FALSE;\n}\n\nstatic VkResult CreateDebugUtilsMessengerEXT(VkInstance instance, const VkDebugUtilsMessengerCreateInfoEXT* pCreateInfo, const VkAllocationCallbacks* pAllocator, VkDebugUtilsMessengerEXT* pCallback)\n{\n    PFN_vkCreateDebugUtilsMessengerEXT func = (PFN_vkCreateDebugUtilsMessengerEXT)vkGetInstanceProcAddr(instance, \"vkCreateDebugUtilsMessengerEXT\");\n    if (func)\n        return func(instance, pCreateInfo, pAllocator, pCallback);\n\n    return VK_ERROR_EXTENSION_NOT_PRESENT;\n}\n\nstatic void DestroyDebugUtilsMessengerEXT(VkInstance instance, VkDebugUtilsMessengerEXT callback, const VkAllocationCallbacks* pAllocator)\n{\n    PFN_vkDestroyDebugUtilsMessengerEXT func = (PFN_vkDestroyDebugUtilsMessengerEXT)vkGetInstanceProcAddr(instance, \"vkDestroyDebugUtilsMessengerEXT\");\n    if (func)\n        func(instance, callback, pAllocator);\n}\n#endif // ENABLE_VALIDATION_LAYER\n\nstatic int find_default_vulkan_device_index()\n{\n    // first try, discrete gpu\n    for (int i = 0; i < g_gpu_count; i++)\n    {\n        if (g_gpu_infos[i]->type() == 0)\n            return i;\n    }\n\n    // second try, integrated gpu\n    for (int i = 0; i < g_gpu_count; i++)\n    {\n        if (g_gpu_infos[i]->type() == 1)\n            return i;\n    }\n\n    // third try, any probed device\n    if (g_gpu_count > 0)\n        return 0;\n\n    NCNN_LOGE(\"no vulkan device\");\n    return -1;\n}\n\nint create_gpu_instance(const char* driver_path)\n{\n    MutexLockGuard lock(g_instance_lock);\n\n    if (g_instance.created != 0)\n        return g_instance.instance ? 0 : -1;\n\n    g_instance.created = 1;\n\n    // NCNN_LOGE(\"create_gpu_instance\");\n\n#if NCNN_SIMPLEVK\n    // load vulkan driver\n    {\n        int ret = load_vulkan_driver(driver_path);\n        if (ret != 0)\n        {\n            NCNN_LOGE(\"load vulkan driver failed\");\n            return -1;\n        }\n    }\n#else\n    if (driver_path)\n    {\n        NCNN_LOGE(\"custom vulkan driver is not supported when NCNN_SIMPLEVK is off\");\n        NCNN_LOGE(\"will always use the system vulkan driver\");\n    }\n#endif // NCNN_SIMPLEVK\n\n    VkResult ret;\n\n    std::vector<const char*> enabledLayers;\n\n#if ENABLE_VALIDATION_LAYER\n    uint32_t instanceLayerPropertyCount;\n    ret = vkEnumerateInstanceLayerProperties(&instanceLayerPropertyCount, NULL);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateInstanceLayerProperties failed %d\", ret);\n        return -1;\n    }\n\n    std::vector<VkLayerProperties> instanceLayerProperties(instanceLayerPropertyCount);\n    ret = vkEnumerateInstanceLayerProperties(&instanceLayerPropertyCount, instanceLayerProperties.data());\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateInstanceLayerProperties failed %d\", ret);\n        return -1;\n    }\n\n    for (uint32_t i = 0; i < instanceLayerPropertyCount; i++)\n    {\n        const VkLayerProperties& lp = instanceLayerProperties[i];\n        //         NCNN_LOGE(\"instance layer %s = %u\", lp.layerName, lp.implementationVersion);\n\n        if (strcmp(lp.layerName, \"VK_LAYER_LUNARG_standard_validation\") == 0)\n        {\n            enabledLayers.push_back(\"VK_LAYER_LUNARG_standard_validation\");\n        }\n        if (strcmp(lp.layerName, \"VK_LAYER_LUNARG_parameter_validation\") == 0)\n        {\n            enabledLayers.push_back(\"VK_LAYER_LUNARG_parameter_validation\");\n        }\n        if (strcmp(lp.layerName, \"VK_LAYER_KHRONOS_validation\") == 0)\n        {\n            enabledLayers.push_back(\"VK_LAYER_KHRONOS_validation\");\n        }\n    }\n#endif // ENABLE_VALIDATION_LAYER\n\n    std::vector<const char*> enabledExtensions;\n\n    uint32_t instanceExtensionPropertyCount;\n    ret = vkEnumerateInstanceExtensionProperties(NULL, &instanceExtensionPropertyCount, NULL);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateInstanceExtensionProperties failed %d\", ret);\n        return -1;\n    }\n\n    std::vector<VkExtensionProperties> instanceExtensionProperties(instanceExtensionPropertyCount);\n    ret = vkEnumerateInstanceExtensionProperties(NULL, &instanceExtensionPropertyCount, instanceExtensionProperties.data());\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumerateInstanceExtensionProperties failed %d\", ret);\n        return -1;\n    }\n\n    support_VK_KHR_get_physical_device_properties2 = 0;\n    support_VK_KHR_get_surface_capabilities2 = 0;\n    support_VK_KHR_portability_enumeration = 0;\n    support_VK_KHR_surface = 0;\n    support_VK_EXT_debug_utils = 0;\n    support_VK_EXT_validation_features = 0;\n    support_VK_EXT_validation_flags = 0;\n#if __ANDROID_API__ >= 26\n    support_VK_KHR_android_surface = 0;\n#endif // __ANDROID_API__ >= 26\n    for (uint32_t j = 0; j < instanceExtensionPropertyCount; j++)\n    {\n        const VkExtensionProperties& exp = instanceExtensionProperties[j];\n        //         NCNN_LOGE(\"instance extension %s = %u\", exp.extensionName, exp.specVersion);\n\n        if (strcmp(exp.extensionName, \"VK_KHR_external_memory_capabilities\") == 0)\n            support_VK_KHR_external_memory_capabilities = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_get_physical_device_properties2\") == 0)\n            support_VK_KHR_get_physical_device_properties2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_get_surface_capabilities2\") == 0)\n            support_VK_KHR_get_surface_capabilities2 = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_portability_enumeration\") == 0)\n            support_VK_KHR_portability_enumeration = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_KHR_surface\") == 0)\n            support_VK_KHR_surface = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_debug_utils\") == 0)\n            support_VK_EXT_debug_utils = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_validation_features\") == 0)\n            support_VK_EXT_validation_features = exp.specVersion;\n        else if (strcmp(exp.extensionName, \"VK_EXT_validation_flags\") == 0)\n            support_VK_EXT_validation_flags = exp.specVersion;\n#if __ANDROID_API__ >= 26\n        else if (strcmp(exp.extensionName, \"VK_KHR_android_surface\") == 0)\n            support_VK_KHR_android_surface = exp.specVersion;\n#endif // __ANDROID_API__ >= 26\n    }\n\n    if (support_VK_EXT_validation_features)\n    {\n        // we prefer the modern one\n        support_VK_EXT_validation_flags = 0;\n    }\n\n    if (support_VK_KHR_external_memory_capabilities)\n        enabledExtensions.push_back(\"VK_KHR_external_memory_capabilities\");\n    if (support_VK_KHR_get_physical_device_properties2)\n        enabledExtensions.push_back(\"VK_KHR_get_physical_device_properties2\");\n    if (support_VK_KHR_get_surface_capabilities2)\n        enabledExtensions.push_back(\"VK_KHR_get_surface_capabilities2\");\n    if (support_VK_KHR_portability_enumeration)\n        enabledExtensions.push_back(\"VK_KHR_portability_enumeration\");\n    if (support_VK_KHR_surface)\n        enabledExtensions.push_back(\"VK_KHR_surface\");\n#if ENABLE_VALIDATION_LAYER\n    if (support_VK_EXT_debug_utils)\n        enabledExtensions.push_back(\"VK_EXT_debug_utils\");\n    if (support_VK_EXT_validation_features)\n        enabledExtensions.push_back(\"VK_EXT_validation_features\");\n    if (support_VK_EXT_validation_flags)\n        enabledExtensions.push_back(\"VK_EXT_validation_flags\");\n#endif // ENABLE_VALIDATION_LAYER\n#if __ANDROID_API__ >= 26\n    if (support_VK_KHR_android_surface)\n        enabledExtensions.push_back(\"VK_KHR_android_surface\");\n#endif // __ANDROID_API__ >= 26\n\n    uint32_t instance_api_version = VK_MAKE_VERSION(1, 0, 0);\n    typedef VkResult(VKAPI_PTR * PFN_vkEnumerateInstanceVersion)(uint32_t * pApiVersion);\n    PFN_vkEnumerateInstanceVersion vkEnumerateInstanceVersion = (PFN_vkEnumerateInstanceVersion)vkGetInstanceProcAddr(0, \"vkEnumerateInstanceVersion\");\n    if (vkEnumerateInstanceVersion)\n    {\n        ret = vkEnumerateInstanceVersion(&instance_api_version);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkEnumerateInstanceVersion failed %d\", ret);\n            return -1;\n        }\n    }\n\n    // NCNN_LOGE(\"instance apiVersion = %u.%u.%u\", VK_VERSION_MAJOR(instance_api_version), VK_VERSION_MINOR(instance_api_version), VK_VERSION_PATCH(instance_api_version));\n\n    VkApplicationInfo applicationInfo;\n    applicationInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;\n    applicationInfo.pNext = 0;\n    applicationInfo.pApplicationName = \"ncnn\";\n    applicationInfo.applicationVersion = 0;\n    applicationInfo.pEngineName = \"ncnn\";\n    applicationInfo.engineVersion = NCNN_VERSION_NUMBER;\n    applicationInfo.apiVersion = instance_api_version;\n\n    void* enabledExtensionFeatures = 0;\n\n#if ENABLE_VALIDATION_LAYER\n    std::vector<VkValidationFeatureEnableEXT> enabledValidationFeature;\n    enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_EXT);\n    enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_RESERVE_BINDING_SLOT_EXT);\n    enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_BEST_PRACTICES_EXT);\n    enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT);\n    enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT);\n\n    VkValidationFeaturesEXT validationFeatures;\n    validationFeatures.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;\n    validationFeatures.pNext = 0;\n    validationFeatures.enabledValidationFeatureCount = enabledValidationFeature.size();\n    validationFeatures.pEnabledValidationFeatures = enabledValidationFeature.data();\n    validationFeatures.disabledValidationFeatureCount = 0;\n    validationFeatures.pDisabledValidationFeatures = 0;\n    if (support_VK_EXT_validation_features)\n    {\n        validationFeatures.pNext = enabledExtensionFeatures;\n        enabledExtensionFeatures = &validationFeatures;\n    }\n\n    VkValidationFlagsEXT validationFlags;\n    validationFlags.sType = VK_STRUCTURE_TYPE_VALIDATION_FLAGS_EXT;\n    validationFlags.pNext = 0;\n    validationFlags.disabledValidationCheckCount = 0;\n    validationFlags.pDisabledValidationChecks = 0;\n    if (support_VK_EXT_validation_flags)\n    {\n        validationFlags.pNext = enabledExtensionFeatures;\n        enabledExtensionFeatures = &validationFlags;\n    }\n#endif // ENABLE_VALIDATION_LAYER\n\n    VkInstanceCreateInfo instanceCreateInfo;\n    instanceCreateInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;\n    instanceCreateInfo.pNext = enabledExtensionFeatures;\n    instanceCreateInfo.flags = 0;\n    if (support_VK_KHR_portability_enumeration)\n        instanceCreateInfo.flags |= VK_INSTANCE_CREATE_ENUMERATE_PORTABILITY_BIT_KHR;\n    instanceCreateInfo.pApplicationInfo = &applicationInfo;\n    instanceCreateInfo.enabledLayerCount = enabledLayers.size();\n    instanceCreateInfo.ppEnabledLayerNames = enabledLayers.data();\n    instanceCreateInfo.enabledExtensionCount = enabledExtensions.size();\n    instanceCreateInfo.ppEnabledExtensionNames = enabledExtensions.data();\n\n    VkInstance instance = 0;\n    ret = vkCreateInstance(&instanceCreateInfo, 0, &instance);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateInstance failed %d\", ret);\n        return -1;\n    }\n\n    g_instance.instance = instance;\n    g_instance.instance_api_version = instance_api_version;\n\n    init_instance_core();\n\n#if ENABLE_VALIDATION_LAYER\n    if (support_VK_EXT_debug_utils)\n    {\n        VkDebugUtilsMessengerCreateInfoEXT createInfo = {};\n        createInfo.sType = VK_STRUCTURE_TYPE_DEBUG_UTILS_MESSENGER_CREATE_INFO_EXT;\n        createInfo.messageSeverity = VK_DEBUG_UTILS_MESSAGE_SEVERITY_VERBOSE_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_INFO_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_WARNING_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_ERROR_BIT_EXT;\n        createInfo.messageType = VK_DEBUG_UTILS_MESSAGE_TYPE_GENERAL_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_TYPE_VALIDATION_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_TYPE_PERFORMANCE_BIT_EXT;\n        createInfo.pfnUserCallback = debugCallback;\n        createInfo.pUserData = 0;\n        ret = CreateDebugUtilsMessengerEXT(g_instance, &createInfo, NULL, &g_instance.callback);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"CreateDebugUtilsMessengerEXT failed %d\", ret);\n            return -1;\n        }\n    }\n#endif // ENABLE_VALIDATION_LAYER\n\n    init_instance_extension();\n\n    uint32_t physicalDeviceCount = 0;\n    ret = vkEnumeratePhysicalDevices(g_instance, &physicalDeviceCount, 0);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumeratePhysicalDevices failed %d\", ret);\n        return -1;\n    }\n\n    if (physicalDeviceCount > NCNN_MAX_GPU_COUNT)\n        physicalDeviceCount = NCNN_MAX_GPU_COUNT;\n\n    std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);\n\n    ret = vkEnumeratePhysicalDevices(g_instance, &physicalDeviceCount, physicalDevices.data());\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkEnumeratePhysicalDevices failed %d\", ret);\n        return -1;\n    }\n\n    // find proper device and queue\n    int gpu_info_index = 0;\n    for (uint32_t i = 0; i < physicalDeviceCount; i++)\n    {\n        const VkPhysicalDevice& physicalDevice = physicalDevices[i];\n        delete g_gpu_infos[gpu_info_index];\n        g_gpu_infos[gpu_info_index] = new GpuInfo;\n\n        GpuInfo& gpu_info = *g_gpu_infos[gpu_info_index];\n\n        gpu_info.d->device_index = gpu_info_index;\n\n        gpu_info.d->physicalDevice = physicalDevice;\n\n        gpu_info.d->query_features();\n        gpu_info.d->query_properties();\n\n        // device type\n\n        // info\n        // NCNN_LOGE(\"[%u] max_shared_memory_size = %u\", i, gpu_info.max_shared_memory_size);\n        // NCNN_LOGE(\"[%u] max_workgroup_count = %u %u %u\", i, gpu_info.max_workgroup_count[0], gpu_info.max_workgroup_count[1], gpu_info.max_workgroup_count[2]);\n        // NCNN_LOGE(\"[%u] max_workgroup_invocations = %u\", i, gpu_info.max_workgroup_invocations);\n        // NCNN_LOGE(\"[%u] max_workgroup_size = %u %u %u\", i, gpu_info.max_workgroup_size[0], gpu_info.max_workgroup_size[1], gpu_info.max_workgroup_size[2]);\n        // NCNN_LOGE(\"[%u] memory_map_alignment = %lu\", i, gpu_info.memory_map_alignment);\n        // NCNN_LOGE(\"[%u] buffer_offset_alignment = %lu\", i, gpu_info.buffer_offset_alignment);\n\n        gpu_info.d->query_queue_properties();\n\n        gpu_info.d->query_memory_properties();\n\n        int rqde = gpu_info.d->query_extensions();\n        if (rqde != 0)\n        {\n            return -1;\n        }\n\n        gpu_info.d->query_extension_features();\n        gpu_info.d->query_extension_properties();\n\n        gpu_info.d->evaluate_rough_score();\n\n        NCNN_LOGE(\"[%u %s]  queueC=%u[%u]  queueT=%u[%u]  rebar=%d  r-score=%u\", i, gpu_info.device_name(),\n                  gpu_info.compute_queue_family_index(), gpu_info.compute_queue_count(),\n                  gpu_info.transfer_queue_family_index(), gpu_info.transfer_queue_count(), gpu_info.resizable_bar_enabled(), gpu_info.rough_score());\n\n        NCNN_LOGE(\"[%u %s]  fp16-p/s/u/a=%d/%d/%d/%d  int8-p/s/u/a=%d/%d/%d/%d  bf16-p/s=%d/%d\", i, gpu_info.device_name(),\n                  gpu_info.support_fp16_packed(), gpu_info.support_fp16_storage(), gpu_info.support_fp16_uniform(), gpu_info.support_fp16_arithmetic(),\n                  gpu_info.support_int8_packed(), gpu_info.support_int8_storage(), gpu_info.support_int8_uniform(), gpu_info.support_int8_arithmetic(),\n                  gpu_info.support_bf16_packed(), gpu_info.support_bf16_storage());\n\n        NCNN_LOGE(\"[%u %s]  subgroup=%u(%u~%u)  ops=%d/%d/%d/%d/%d/%d/%d/%d/%d/%d\", i, gpu_info.device_name(),\n                  gpu_info.subgroup_size(), gpu_info.min_subgroup_size(), gpu_info.max_subgroup_size(),\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_BASIC_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_VOTE_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_BALLOT_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_SHUFFLE_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_CLUSTERED_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_QUAD_BIT) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ROTATE_BIT_KHR) != 0,\n                  (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT_KHR) != 0);\n\n        // collect matrix mnk\n        std::vector<VkCooperativeMatrixPropertiesKHR> fp16_matrix_properties;\n        std::vector<VkCooperativeMatrixPropertiesKHR> int8_matrix_properties;\n        std::vector<VkCooperativeMatrixPropertiesKHR> bf16_matrix_properties;\n        std::vector<VkCooperativeMatrixPropertiesKHR> fp8_matrix_properties;\n        if (gpu_info.support_VK_KHR_cooperative_matrix())\n        {\n            const std::vector<VkCooperativeMatrixPropertiesKHR>& properties = gpu_info.queryCooperativeMatrixSubProperties();\n            for (uint32_t j = 0; j < properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesKHR& cmp = properties[j];\n\n                if (cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR)\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < fp16_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = fp16_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                        fp16_matrix_properties.push_back(cmp);\n                }\n                if ((cmp.AType == VK_COMPONENT_TYPE_SINT8_KHR || cmp.AType == VK_COMPONENT_TYPE_SINT8_PACKED_NV)\n                        && (cmp.BType == VK_COMPONENT_TYPE_SINT8_KHR || cmp.BType == VK_COMPONENT_TYPE_SINT8_PACKED_NV))\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < int8_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = int8_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                        int8_matrix_properties.push_back(cmp);\n                }\n                if (cmp.AType == VK_COMPONENT_TYPE_BFLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_BFLOAT16_KHR)\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < bf16_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = bf16_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                        bf16_matrix_properties.push_back(cmp);\n                }\n                if ((cmp.AType == VK_COMPONENT_TYPE_FLOAT8_E4M3_EXT || cmp.AType == VK_COMPONENT_TYPE_FLOAT8_E5M2_EXT\n                        || cmp.AType == VK_COMPONENT_TYPE_FLOAT_E4M3_NV || cmp.AType == VK_COMPONENT_TYPE_FLOAT_E5M2_NV)\n                        && (cmp.BType == VK_COMPONENT_TYPE_FLOAT8_E4M3_EXT || cmp.BType == VK_COMPONENT_TYPE_FLOAT8_E5M2_EXT\n                            || cmp.BType == VK_COMPONENT_TYPE_FLOAT_E4M3_NV || cmp.BType == VK_COMPONENT_TYPE_FLOAT_E5M2_NV))\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < fp8_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = fp8_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                        fp8_matrix_properties.push_back(cmp);\n                }\n            }\n        }\n        else if (gpu_info.support_VK_NV_cooperative_matrix())\n        {\n            const std::vector<VkCooperativeMatrixPropertiesNV>& properties = gpu_info.queryCooperativeMatrixSubPropertiesNV();\n            for (uint32_t j = 0; j < properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesNV& cmp = properties[j];\n\n                if (cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV)\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < fp16_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = fp16_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                    {\n                        VkCooperativeMatrixPropertiesKHR cmp_khr;\n                        cmp_khr.MSize = cmp.MSize;\n                        cmp_khr.NSize = cmp.NSize;\n                        cmp_khr.KSize = cmp.KSize;\n                        fp16_matrix_properties.push_back(cmp_khr);\n                    }\n                }\n                if (cmp.AType == VK_COMPONENT_TYPE_SINT8_NV && cmp.BType == VK_COMPONENT_TYPE_SINT8_NV)\n                {\n                    bool mnk_hit = false;\n                    for (size_t k = 0; k < int8_matrix_properties.size(); k++)\n                    {\n                        const VkCooperativeMatrixPropertiesKHR& cmp0 = int8_matrix_properties[k];\n                        if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)\n                        {\n                            mnk_hit = true;\n                            break;\n                        }\n                    }\n                    if (!mnk_hit)\n                    {\n                        VkCooperativeMatrixPropertiesKHR cmp_khr;\n                        cmp_khr.MSize = cmp.MSize;\n                        cmp_khr.NSize = cmp.NSize;\n                        cmp_khr.KSize = cmp.KSize;\n                        int8_matrix_properties.push_back(cmp_khr);\n                    }\n                }\n            }\n        }\n\n        std::string fp16_matrix_info_str;\n        std::string int8_matrix_info_str;\n        std::string bf16_matrix_info_str;\n        std::string fp8_matrix_info_str;\n        {\n            for (uint32_t j = 0; j < fp16_matrix_properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesKHR& cmp = fp16_matrix_properties[j];\n                char tmp[64];\n                sprintf(tmp, j > 0 ? \"/%ux%ux%u\" : \"%ux%ux%u\", cmp.MSize, cmp.NSize, cmp.KSize);\n                fp16_matrix_info_str += tmp;\n            }\n            for (uint32_t j = 0; j < int8_matrix_properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesKHR& cmp = int8_matrix_properties[j];\n                char tmp[64];\n                sprintf(tmp, j > 0 ? \"/%ux%ux%u\" : \"%ux%ux%u\", cmp.MSize, cmp.NSize, cmp.KSize);\n                int8_matrix_info_str += tmp;\n            }\n            for (uint32_t j = 0; j < bf16_matrix_properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesKHR& cmp = bf16_matrix_properties[j];\n                char tmp[64];\n                sprintf(tmp, j > 0 ? \"/%ux%ux%u\" : \"%ux%ux%u\", cmp.MSize, cmp.NSize, cmp.KSize);\n                bf16_matrix_info_str += tmp;\n            }\n            for (uint32_t j = 0; j < fp8_matrix_properties.size(); j++)\n            {\n                const VkCooperativeMatrixPropertiesKHR& cmp = fp8_matrix_properties[j];\n                char tmp[64];\n                sprintf(tmp, j > 0 ? \"/%ux%ux%u\" : \"%ux%ux%u\", cmp.MSize, cmp.NSize, cmp.KSize);\n                fp8_matrix_info_str += tmp;\n            }\n\n            if (fp16_matrix_info_str.empty())\n                fp16_matrix_info_str = \"0\";\n            if (int8_matrix_info_str.empty())\n                int8_matrix_info_str = \"0\";\n            if (bf16_matrix_info_str.empty())\n                bf16_matrix_info_str = \"0\";\n            if (fp8_matrix_info_str.empty())\n                fp8_matrix_info_str = \"0\";\n        }\n\n        NCNN_LOGE(\"[%u %s]  fp16-cm=%s  int8-cm=%s  bf16-cm=%s  fp8-cm=%s\", i, gpu_info.device_name(),\n                  fp16_matrix_info_str.c_str(), int8_matrix_info_str.c_str(), bf16_matrix_info_str.c_str(), fp8_matrix_info_str.c_str());\n\n        gpu_info_index++;\n    }\n\n    g_gpu_count = gpu_info_index;\n\n    // the default gpu device\n    g_default_gpu_index = find_default_vulkan_device_index();\n\n    g_instance.glslang_initialized = glslang::InitializeProcess();\n\n    // the global __ncnn_vulkan_instance_holder destructor will call destroy_gpu_instance() on exit\n    // but it seems to be too late for nvidia driver :(\n    // driver's internal data structure has been destroyed when called, causing segfault\n    // atexit() seems to be helpful for calling it earlier    --- nihui\n    static int destroy_gpu_instance_atexit_registered = 0;\n    if (!destroy_gpu_instance_atexit_registered)\n    {\n        atexit(destroy_gpu_instance);\n        destroy_gpu_instance_atexit_registered = 1;\n    }\n\n    return 0;\n}\n\nVkInstance get_gpu_instance()\n{\n    return (VkInstance)g_instance;\n}\n\nvoid destroy_gpu_instance()\n{\n    MutexLockGuard lock(g_instance_lock);\n\n    if (g_instance.created == 0)\n        return;\n\n    for (int i = 0; i < NCNN_MAX_GPU_COUNT; i++)\n    {\n        VulkanDevice* vulkan_device = g_default_vkdev[i];\n        if (vulkan_device)\n        {\n            VkDevice vkdev = g_default_vkdev[i]->vkdevice();\n            if (vkdev)\n            {\n                vkDeviceWaitIdle(vkdev);\n            }\n        }\n    }\n\n    // NCNN_LOGE(\"destroy_gpu_instance\");\n\n    if (g_instance.glslang_initialized)\n    {\n        glslang::FinalizeProcess();\n        g_instance.glslang_initialized = false;\n    }\n\n    for (int i = 0; i < NCNN_MAX_GPU_COUNT; i++)\n    {\n        delete g_default_vkdev[i];\n        g_default_vkdev[i] = 0;\n\n        delete g_gpu_infos[i];\n        g_gpu_infos[i] = 0;\n    }\n\n#if ENABLE_VALIDATION_LAYER\n    if (support_VK_EXT_debug_utils && g_instance.callback)\n    {\n        DestroyDebugUtilsMessengerEXT(g_instance, g_instance.callback, NULL);\n        g_instance.callback = 0;\n    }\n#endif // ENABLE_VALIDATION_LAYER\n\n    if (vkDestroyInstance)\n    {\n        vkDestroyInstance(g_instance, 0);\n        vkDestroyInstance = 0;\n    }\n\n    g_instance.instance = 0;\n\n#if NCNN_SIMPLEVK\n    unload_vulkan_driver();\n#endif\n\n    g_instance.created = 0;\n}\n\nstatic void try_create_gpu_instance()\n{\n    {\n        MutexLockGuard lock(g_instance_lock);\n\n        if (g_instance.created != 0)\n            return;\n    }\n\n    create_gpu_instance();\n}\n\nint get_gpu_count()\n{\n    try_create_gpu_instance();\n\n    return g_gpu_count;\n}\n\nint get_default_gpu_index()\n{\n    try_create_gpu_instance();\n\n    return g_default_gpu_index;\n}\n\nconst GpuInfo& get_gpu_info(int device_index)\n{\n    try_create_gpu_instance();\n\n    return *g_gpu_infos[device_index];\n}\n\nclass VkDummyAllocator : public VkBlobAllocator\n{\npublic:\n    // NOTE 16k is large enough I think ...\n    VkDummyAllocator(const VulkanDevice* _vkdev)\n        : VkBlobAllocator(_vkdev, 16 * 1024)\n    {\n    }\n};\n\nclass VkDummyCompute : public VkCompute\n{\npublic:\n    VkDummyCompute(const VulkanDevice* _vkdev)\n        : VkCompute(_vkdev)\n    {\n    }\n\n    void record_dummy(const VkMat& buffer)\n    {\n        barrier_readwrite(buffer);\n    }\n\n    void record_dummy(const VkImageMat& image)\n    {\n        barrier_readwrite(image);\n    }\n\n    void record_dummy_readonly(const VkImageMat& image)\n    {\n        barrier_readonly(image);\n    }\n};\n\nclass VulkanDevicePrivate\n{\npublic:\n    VulkanDevicePrivate(VulkanDevice* _vkdev);\n    VulkanDevice* const vkdev;\n\n    // dummy buffer and image\n    int create_dummy_buffer_image();\n    void destroy_dummy_buffer_image();\n\n    // utility operator\n    const ncnn::Layer* get_utility_operator(int cast_type_from_index, int cast_type_to_index, int packing_type_to_index) const;\n    void destroy_utility_operator();\n\n    VkDevice device;\n\n    // hardware queue\n    mutable std::vector<VkQueue> compute_queues;\n    mutable std::vector<VkQueue> transfer_queues;\n    mutable int free_compute_queue_count;\n    mutable int free_transfer_queue_count;\n    mutable Mutex compute_queue_lock;\n    mutable Mutex transfer_queue_lock;\n    mutable ConditionVariable compute_queue_condition;\n    mutable ConditionVariable transfer_queue_condition;\n\n    // default blob allocator for each queue\n    mutable std::vector<VkAllocator*> blob_allocators;\n    mutable Mutex blob_allocator_lock;\n\n    // default staging allocator for each queue\n    mutable std::vector<VkAllocator*> staging_allocators;\n    mutable Mutex staging_allocator_lock;\n\n    // nearest sampler for texelfetch\n    VkSampler texelfetch_sampler;\n\n    // dummy buffer and image\n    VkAllocator* dummy_allocator;\n    VkMat dummy_buffer;\n    VkImageMat dummy_image;\n    VkImageMat dummy_image_readonly;\n\n    // device-wide pipeline cache\n    PipelineCache* pipeline_cache;\n\n    // utility operator\n    // from fp32 | fp16\n    // to fp32 | fp16\n    // to pack1 | pack4\n    mutable ncnn::Layer* uop_packing[2][2][2];\n    // from int8\n    // to int8\n    // to pack1 | pack4\n    mutable ncnn::Layer* uop_packing_int8[2];\n    // from fp32 to bf16 / from bf16 to fp32 / bf16\n    // to pack1 | pack4\n    mutable ncnn::Layer* uop_packing_bf16[3][2];\n    mutable Mutex uop_lock;\n\n    // device is valid and sucessfully initialized\n    bool valid;\n};\n\nVulkanDevicePrivate::VulkanDevicePrivate(VulkanDevice* _vkdev)\n    : vkdev(_vkdev)\n{\n    device = 0;\n    texelfetch_sampler = 0;\n    dummy_allocator = 0;\n    pipeline_cache = 0;\n    valid = false;\n    memset(uop_packing, 0, sizeof(uop_packing));\n    memset(uop_packing_int8, 0, sizeof(uop_packing_int8));\n    memset(uop_packing_bf16, 0, sizeof(uop_packing_bf16));\n}\n\nint VulkanDevicePrivate::create_dummy_buffer_image()\n{\n    dummy_allocator = new VkDummyAllocator(vkdev);\n\n    dummy_buffer.create(1, 4u, dummy_allocator);\n    dummy_image.create(1, 4u, dummy_allocator);\n#if __APPLE__\n    if (vkdev->info.type() == 0)\n        dummy_image_readonly.create(1, 4u, dummy_allocator);\n#else\n    dummy_image_readonly.create(1, 4u, dummy_allocator);\n#endif\n\n    VkDummyCompute cmd(vkdev);\n\n    cmd.record_dummy(dummy_buffer);\n    cmd.record_dummy(dummy_image);\n#if __APPLE__\n    if (vkdev->info.type() == 0)\n        cmd.record_dummy_readonly(dummy_image_readonly);\n#else\n    cmd.record_dummy_readonly(dummy_image_readonly);\n#endif\n\n    return cmd.submit_and_wait();\n}\n\nvoid VulkanDevicePrivate::destroy_dummy_buffer_image()\n{\n    dummy_buffer.release();\n    dummy_image.release();\n#if __APPLE__\n    if (vkdev->info.type() == 0)\n        dummy_image_readonly.release();\n#else\n    dummy_image_readonly.release();\n#endif\n\n    if (dummy_allocator)\n    {\n        delete dummy_allocator;\n        dummy_allocator = 0;\n    }\n}\n\nconst ncnn::Layer* VulkanDevicePrivate::get_utility_operator(int cast_type_from_index, int cast_type_to_index, int packing_type_to_index) const\n{\n    bool use_fp16 = (cast_type_from_index == 1 || cast_type_to_index == 1);\n    bool use_int8 = (cast_type_from_index == 3 || cast_type_to_index == 3);\n    bool use_bf16 = (cast_type_from_index == 4 || cast_type_to_index == 4);\n\n    MutexLockGuard lock(uop_lock);\n\n    const ncnn::Layer* cached_uop = 0;\n    if (use_int8)\n    {\n        cached_uop = uop_packing_int8[packing_type_to_index];\n    }\n    else if (use_bf16)\n    {\n        if (cast_type_from_index == 4 && cast_type_to_index == 4)\n        {\n            cached_uop = uop_packing_bf16[2][packing_type_to_index];\n        }\n        else if (cast_type_to_index == 4)\n        {\n            cached_uop = uop_packing_bf16[1][packing_type_to_index];\n        }\n        else // if (cast_type_from_index == 4)\n        {\n            cached_uop = uop_packing_bf16[0][packing_type_to_index];\n        }\n    }\n    else\n    {\n        cached_uop = uop_packing[cast_type_from_index][cast_type_to_index][packing_type_to_index];\n    }\n    if (cached_uop)\n        return cached_uop;\n\n    // create uop\n    Option opt;\n    opt.use_fp16_packed = use_fp16; // fp16p is always supported\n    opt.use_fp16_storage = use_fp16 && vkdev->info.support_fp16_storage();\n    opt.use_int8_packed = use_int8; // int8p is always supported\n    opt.use_int8_storage = use_int8 && vkdev->info.support_int8_storage();\n    opt.use_bf16_packed = use_bf16; // bf16p is always supported\n    opt.use_bf16_storage = use_bf16 && vkdev->info.support_bf16_storage();\n\n    // fp16/int8 arithmetic are not necessary for packing\n    // and may conflict with storage options\n    opt.use_fp16_arithmetic = false;\n    opt.use_int8_arithmetic = false;\n\n    // do not enable spirv-1.3 from cooperative matrix\n    opt.use_cooperative_matrix = false;\n\n    opt.use_vulkan_compute = true;\n\n    // cache uop pipeline as device member explicitly\n    opt.pipeline_cache = 0;\n\n    opt.vulkan_device_index = vkdev->info.device_index();\n\n    ncnn::Layer* uop = ncnn::create_layer_vulkan(LayerType::Packing);\n    uop->vkdev = vkdev;\n\n    ncnn::ParamDict pd;\n    pd.set(0, packing_type_to_index == 0 ? 1 : 4); // out_elempack\n    pd.set(2, cast_type_from_index + 1);           // 0=auto 1=fp32 2=fp16 3=int32 4=int8 5=bf16\n    pd.set(3, cast_type_to_index + 1);\n\n    uop->load_param(pd);\n\n    uop->create_pipeline(opt);\n\n    if (use_int8)\n    {\n        uop_packing_int8[packing_type_to_index] = uop;\n    }\n    else if (use_bf16)\n    {\n        if (cast_type_from_index == 4 && cast_type_to_index == 4)\n        {\n            uop_packing_bf16[2][packing_type_to_index] = uop;\n        }\n        else if (cast_type_to_index == 4)\n        {\n            uop_packing_bf16[1][packing_type_to_index] = uop;\n        }\n        else // if (cast_type_from_index == 4)\n        {\n            uop_packing_bf16[0][packing_type_to_index] = uop;\n        }\n    }\n    else\n    {\n        uop_packing[cast_type_from_index][cast_type_to_index][packing_type_to_index] = uop;\n    }\n\n    return uop;\n}\n\nvoid VulkanDevicePrivate::destroy_utility_operator()\n{\n    Option opt;\n    opt.use_vulkan_compute = true;\n    opt.use_fp16_arithmetic = false;\n    opt.use_int8_arithmetic = false;\n    opt.use_cooperative_matrix = false;\n    opt.pipeline_cache = 0;\n    opt.vulkan_device_index = vkdev->info.device_index();\n\n    // from fp32 | fp16\n    for (int j0 = 0; j0 < 2; j0++)\n    {\n        // to fp32 | fp16\n        for (int j1 = 0; j1 < 2; j1++)\n        {\n            bool use_fp16 = (j0 == 1 || j1 == 1);\n\n            opt.use_fp16_packed = use_fp16;\n            opt.use_fp16_storage = use_fp16 && vkdev->info.support_fp16_storage();\n            opt.use_int8_packed = false;\n            opt.use_int8_storage = false;\n            opt.use_bf16_packed = false;\n            opt.use_bf16_storage = false;\n\n            // to pack1 | pack4\n            for (int k = 0; k < 2; k++)\n            {\n                ncnn::Layer* uop = uop_packing[j0][j1][k];\n                if (!uop)\n                    continue;\n\n                uop->destroy_pipeline(opt);\n\n                delete uop;\n\n                uop_packing[j0][j1][k] = 0;\n            }\n        }\n    }\n\n    // int8\n    {\n        bool use_int8 = true;\n\n        opt.use_fp16_packed = false;\n        opt.use_fp16_storage = false;\n        opt.use_int8_packed = use_int8;\n        opt.use_int8_storage = use_int8 && vkdev->info.support_int8_storage();\n        opt.use_bf16_packed = false;\n        opt.use_bf16_storage = false;\n\n        // to pack1 | pack4\n        for (int k = 0; k < 2; k++)\n        {\n            ncnn::Layer* uop = uop_packing_int8[k];\n            if (!uop)\n                continue;\n\n            uop->destroy_pipeline(opt);\n\n            delete uop;\n\n            uop_packing_int8[k] = 0;\n        }\n    }\n\n    // from fp32 to bf16\n    // from bf16 to fp32\n    // bf16\n    for (int j = 0; j < 3; j++)\n    {\n        bool use_bf16 = true;\n\n        opt.use_fp16_packed = false;\n        opt.use_fp16_storage = false;\n        opt.use_int8_packed = false;\n        opt.use_int8_storage = false;\n        opt.use_bf16_packed = use_bf16;\n        opt.use_bf16_storage = use_bf16 && vkdev->info.support_bf16_storage();\n\n        // to pack1 | pack4\n        for (int k = 0; k < 2; k++)\n        {\n            ncnn::Layer* uop = uop_packing_bf16[j][k];\n            if (!uop)\n                continue;\n\n            uop->destroy_pipeline(opt);\n\n            delete uop;\n\n            uop_packing_bf16[j][k] = 0;\n        }\n    }\n}\n\nVulkanDevice::VulkanDevice(int device_index)\n    : info(get_gpu_info(device_index)), d(new VulkanDevicePrivate(this))\n{\n    try_create_gpu_instance();\n\n    std::vector<const char*> enabledExtensions;\n    if (info.support_VK_KHR_8bit_storage())\n        enabledExtensions.push_back(\"VK_KHR_8bit_storage\");\n    if (info.support_VK_KHR_16bit_storage())\n        enabledExtensions.push_back(\"VK_KHR_16bit_storage\");\n    if (info.support_VK_KHR_bind_memory2())\n        enabledExtensions.push_back(\"VK_KHR_bind_memory2\");\n    if (info.support_VK_KHR_buffer_device_address())\n        enabledExtensions.push_back(\"VK_KHR_buffer_device_address\");\n    if (info.support_VK_KHR_create_renderpass2())\n        enabledExtensions.push_back(\"VK_KHR_create_renderpass2\");\n    if (info.support_VK_KHR_cooperative_matrix())\n        enabledExtensions.push_back(\"VK_KHR_cooperative_matrix\");\n    if (info.support_VK_KHR_dedicated_allocation())\n        enabledExtensions.push_back(\"VK_KHR_dedicated_allocation\");\n    if (info.support_VK_KHR_descriptor_update_template())\n        enabledExtensions.push_back(\"VK_KHR_descriptor_update_template\");\n    if (info.support_VK_KHR_driver_properties())\n        enabledExtensions.push_back(\"VK_KHR_driver_properties\");\n    if (info.support_VK_KHR_external_memory())\n        enabledExtensions.push_back(\"VK_KHR_external_memory\");\n    if (info.support_VK_KHR_get_memory_requirements2())\n        enabledExtensions.push_back(\"VK_KHR_get_memory_requirements2\");\n    if (info.support_VK_KHR_maintenance1())\n        enabledExtensions.push_back(\"VK_KHR_maintenance1\");\n    if (info.support_VK_KHR_maintenance2())\n        enabledExtensions.push_back(\"VK_KHR_maintenance2\");\n    if (info.support_VK_KHR_maintenance3())\n        enabledExtensions.push_back(\"VK_KHR_maintenance3\");\n    if (info.support_VK_KHR_multiview())\n        enabledExtensions.push_back(\"VK_KHR_multiview\");\n    if (info.support_VK_KHR_portability_subset())\n        enabledExtensions.push_back(\"VK_KHR_portability_subset\");\n    if (info.support_VK_KHR_push_descriptor())\n        enabledExtensions.push_back(\"VK_KHR_push_descriptor\");\n    if (info.support_VK_KHR_robustness2())\n        enabledExtensions.push_back(\"VK_KHR_robustness2\");\n    if (info.support_VK_KHR_sampler_ycbcr_conversion())\n        enabledExtensions.push_back(\"VK_KHR_sampler_ycbcr_conversion\");\n    if (info.support_VK_KHR_shader_bfloat16())\n        enabledExtensions.push_back(\"VK_KHR_shader_bfloat16\");\n    if (info.support_VK_KHR_shader_float16_int8())\n        enabledExtensions.push_back(\"VK_KHR_shader_float16_int8\");\n    if (info.support_VK_KHR_shader_float_controls())\n        enabledExtensions.push_back(\"VK_KHR_shader_float_controls\");\n    if (info.support_VK_KHR_shader_float_controls2())\n        enabledExtensions.push_back(\"VK_KHR_shader_float_controls2\");\n    if (info.support_VK_KHR_shader_integer_dot_product())\n        enabledExtensions.push_back(\"VK_KHR_shader_integer_dot_product\");\n    if (info.support_VK_KHR_shader_non_semantic_info())\n        enabledExtensions.push_back(\"VK_KHR_shader_non_semantic_info\");\n    if (info.support_VK_KHR_shader_subgroup_extended_types())\n        enabledExtensions.push_back(\"VK_KHR_shader_subgroup_extended_types\");\n    if (info.support_VK_KHR_shader_subgroup_rotate())\n        enabledExtensions.push_back(\"VK_KHR_shader_subgroup_rotate\");\n    if (info.support_VK_KHR_storage_buffer_storage_class())\n        enabledExtensions.push_back(\"VK_KHR_storage_buffer_storage_class\");\n    if (info.support_VK_KHR_swapchain())\n        enabledExtensions.push_back(\"VK_KHR_swapchain\");\n    if (info.support_VK_KHR_vulkan_memory_model())\n        enabledExtensions.push_back(\"VK_KHR_vulkan_memory_model\");\n    if (info.support_VK_KHR_zero_initialize_workgroup_memory())\n        enabledExtensions.push_back(\"VK_KHR_zero_initialize_workgroup_memory\");\n    if (info.support_VK_EXT_buffer_device_address())\n        enabledExtensions.push_back(\"VK_EXT_buffer_device_address\");\n    if (info.support_VK_EXT_descriptor_indexing())\n        enabledExtensions.push_back(\"VK_EXT_descriptor_indexing\");\n    if (info.support_VK_EXT_external_memory_host())\n        enabledExtensions.push_back(\"VK_EXT_external_memory_host\");\n    if (info.support_VK_EXT_memory_budget())\n        enabledExtensions.push_back(\"VK_EXT_memory_budget\");\n    if (info.support_VK_EXT_memory_priority())\n        enabledExtensions.push_back(\"VK_EXT_memory_priority\");\n    if (info.support_VK_EXT_queue_family_foreign())\n        enabledExtensions.push_back(\"VK_EXT_queue_family_foreign\");\n    if (info.support_VK_EXT_robustness2())\n        enabledExtensions.push_back(\"VK_EXT_robustness2\");\n    if (info.support_VK_EXT_shader_atomic_float())\n        enabledExtensions.push_back(\"VK_EXT_shader_atomic_float\");\n    if (info.support_VK_EXT_shader_atomic_float2())\n        enabledExtensions.push_back(\"VK_EXT_shader_atomic_float2\");\n    if (info.support_VK_EXT_shader_float8())\n        enabledExtensions.push_back(\"VK_EXT_shader_float8\");\n    if (info.support_VK_EXT_subgroup_size_control())\n        enabledExtensions.push_back(\"VK_EXT_subgroup_size_control\");\n    if (info.support_VK_AMD_device_coherent_memory())\n        enabledExtensions.push_back(\"VK_AMD_device_coherent_memory\");\n#if __ANDROID_API__ >= 26\n    if (info.support_VK_ANDROID_external_memory_android_hardware_buffer())\n        enabledExtensions.push_back(\"VK_ANDROID_external_memory_android_hardware_buffer\");\n#endif // __ANDROID_API__ >= 26\n    if (info.support_VK_NV_cooperative_matrix())\n        enabledExtensions.push_back(\"VK_NV_cooperative_matrix\");\n    if (info.support_VK_NV_cooperative_matrix2())\n        enabledExtensions.push_back(\"VK_NV_cooperative_matrix2\");\n    if (info.support_VK_NV_cooperative_vector())\n        enabledExtensions.push_back(\"VK_NV_cooperative_vector\");\n\n    const void* enabledExtensionFeatures = info.queryExtensionFeatures();\n\n    std::vector<float> compute_queue_priorities(info.compute_queue_count(), 1.f);   // 0.f ~ 1.f\n    std::vector<float> transfer_queue_priorities(info.transfer_queue_count(), 1.f); // 0.f ~ 1.f\n\n    VkDeviceQueueCreateInfo deviceQueueCreateInfos[3];\n\n    VkDeviceQueueCreateInfo deviceComputeQueueCreateInfo;\n    deviceComputeQueueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;\n    deviceComputeQueueCreateInfo.pNext = 0;\n    deviceComputeQueueCreateInfo.flags = 0;\n    deviceComputeQueueCreateInfo.queueFamilyIndex = info.compute_queue_family_index();\n    deviceComputeQueueCreateInfo.queueCount = info.compute_queue_count();\n    deviceComputeQueueCreateInfo.pQueuePriorities = compute_queue_priorities.data();\n\n    VkDeviceQueueCreateInfo deviceTransferQueueCreateInfo;\n    deviceTransferQueueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;\n    deviceTransferQueueCreateInfo.pNext = 0;\n    deviceTransferQueueCreateInfo.flags = 0;\n    deviceTransferQueueCreateInfo.queueFamilyIndex = info.transfer_queue_family_index();\n    deviceTransferQueueCreateInfo.queueCount = info.transfer_queue_count();\n    deviceTransferQueueCreateInfo.pQueuePriorities = transfer_queue_priorities.data();\n\n    VkDeviceCreateInfo deviceCreateInfo;\n    deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;\n    deviceCreateInfo.pNext = enabledExtensionFeatures;\n    deviceCreateInfo.flags = 0;\n    if (info.compute_queue_family_index() == info.transfer_queue_family_index())\n    {\n        deviceQueueCreateInfos[0] = deviceComputeQueueCreateInfo;\n        deviceCreateInfo.queueCreateInfoCount = 1;\n    }\n    else // if (info.compute_queue_family_index() != info.transfer_queue_family_index())\n    {\n        deviceQueueCreateInfos[0] = deviceComputeQueueCreateInfo;\n        deviceQueueCreateInfos[1] = deviceTransferQueueCreateInfo;\n        deviceCreateInfo.queueCreateInfoCount = 2;\n    }\n\n    deviceCreateInfo.pQueueCreateInfos = deviceQueueCreateInfos;\n    deviceCreateInfo.enabledLayerCount = 0;\n    deviceCreateInfo.ppEnabledLayerNames = 0;\n    deviceCreateInfo.enabledExtensionCount = enabledExtensions.size();\n    deviceCreateInfo.ppEnabledExtensionNames = enabledExtensions.data();\n    deviceCreateInfo.pEnabledFeatures = 0; // VkPhysicalDeviceFeatures pointer\n\n    VkResult ret = vkCreateDevice(info.physicalDevice(), &deviceCreateInfo, 0, &d->device);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateDevice failed %d\", ret);\n        return;\n    }\n\n    init_device_extension();\n\n    d->free_compute_queue_count = 0;\n    d->free_transfer_queue_count = 0;\n\n    d->free_compute_queue_count = info.compute_queue_count();\n    d->compute_queues.resize(info.compute_queue_count());\n    d->blob_allocators.resize(info.compute_queue_count());\n    d->staging_allocators.resize(info.compute_queue_count());\n    for (uint32_t i = 0; i < info.compute_queue_count(); i++)\n    {\n        vkGetDeviceQueue(d->device, info.compute_queue_family_index(), i, &d->compute_queues[i]);\n        d->blob_allocators[i] = new VkBlobAllocator(this);\n        d->staging_allocators[i] = new VkStagingAllocator(this);\n    }\n    if (info.compute_queue_family_index() != info.transfer_queue_family_index())\n    {\n        d->free_transfer_queue_count = info.transfer_queue_count();\n        d->transfer_queues.resize(info.transfer_queue_count());\n        for (uint32_t i = 0; i < info.transfer_queue_count(); i++)\n        {\n            vkGetDeviceQueue(d->device, info.transfer_queue_family_index(), i, &d->transfer_queues[i]);\n        }\n    }\n\n    // prepare immutable texelfetch sampler\n    {\n        VkSamplerCreateInfo samplerCreateInfo;\n        samplerCreateInfo.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;\n        samplerCreateInfo.pNext = 0;\n        samplerCreateInfo.flags = 0;\n        samplerCreateInfo.magFilter = VK_FILTER_NEAREST;\n        samplerCreateInfo.minFilter = VK_FILTER_NEAREST;\n        samplerCreateInfo.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;\n        samplerCreateInfo.addressModeU = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;\n        samplerCreateInfo.addressModeV = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;\n        samplerCreateInfo.addressModeW = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;\n        samplerCreateInfo.mipLodBias = 0.0f;\n        samplerCreateInfo.anisotropyEnable = VK_FALSE;\n        samplerCreateInfo.maxAnisotropy = 1;\n        samplerCreateInfo.compareEnable = VK_FALSE;\n        samplerCreateInfo.compareOp = VK_COMPARE_OP_NEVER;\n        samplerCreateInfo.minLod = 0.0f;\n        samplerCreateInfo.maxLod = 0.0f;\n        samplerCreateInfo.borderColor = VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK;\n        samplerCreateInfo.unnormalizedCoordinates = VK_TRUE;\n\n        ret = vkCreateSampler(d->device, &samplerCreateInfo, 0, &d->texelfetch_sampler);\n        if (ret != VK_SUCCESS)\n        {\n            NCNN_LOGE(\"vkCreateSampler failed %d\", ret);\n        }\n    }\n\n    int cret = d->create_dummy_buffer_image();\n    if (cret != 0)\n    {\n        NCNN_LOGE(\"VulkanDevice create_dummy_buffer_image failed %d\", cret);\n        return;\n    }\n\n    d->pipeline_cache = new PipelineCache(this);\n\n    d->valid = true;\n}\n\nVulkanDevice::~VulkanDevice()\n{\n    d->destroy_utility_operator();\n\n    d->destroy_dummy_buffer_image();\n\n    if (d->texelfetch_sampler)\n    {\n        vkDestroySampler(d->device, d->texelfetch_sampler, 0);\n    }\n\n    for (size_t i = 0; i < d->blob_allocators.size(); i++)\n    {\n        delete d->blob_allocators[i];\n    }\n    d->blob_allocators.clear();\n    for (size_t i = 0; i < d->staging_allocators.size(); i++)\n    {\n        delete d->staging_allocators[i];\n    }\n    d->staging_allocators.clear();\n\n    if (d->pipeline_cache)\n    {\n        delete d->pipeline_cache;\n    }\n\n    if (d->device)\n    {\n        vkDestroyDevice(d->device, 0);\n    }\n\n    delete d;\n}\n\nVulkanDevice::VulkanDevice(const VulkanDevice&)\n    : info(get_gpu_info(0)), d(0)\n{\n}\n\nVulkanDevice& VulkanDevice::operator=(const VulkanDevice&)\n{\n    return *this;\n}\n\nVkDevice VulkanDevice::vkdevice() const\n{\n    return d->device;\n}\n\nbool VulkanDevice::is_valid() const\n{\n    return d->valid;\n}\n\nVkShaderModule VulkanDevice::compile_shader_module(const uint32_t* spv_data, size_t spv_data_size) const\n{\n    VkShaderModuleCreateInfo shaderModuleCreateInfo;\n    shaderModuleCreateInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;\n    shaderModuleCreateInfo.pNext = 0;\n    shaderModuleCreateInfo.flags = 0;\n    shaderModuleCreateInfo.codeSize = spv_data_size;\n    shaderModuleCreateInfo.pCode = spv_data;\n\n    VkShaderModule shader_module;\n    VkResult ret = vkCreateShaderModule(d->device, &shaderModuleCreateInfo, 0, &shader_module);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateShaderModule failed %d\", ret);\n        return 0;\n    }\n\n    return shader_module;\n}\n\nstatic void inject_local_size_xyz(const uint32_t* code, size_t size, uint32_t local_size_x, uint32_t local_size_y, uint32_t local_size_z, uint32_t* dstcode, size_t* dstsize)\n{\n    uint32_t local_size_x_id = -1;\n    uint32_t local_size_y_id = -1;\n    uint32_t local_size_z_id = -1;\n    uint32_t gl_WorkGroupSize_id = -1;\n\n    const uint32_t* p = code;\n    uint32_t* dp = dstcode;\n\n    // skip magic version generator bound schema\n    memcpy(dp, p, 5 * sizeof(uint32_t));\n    p += 5;\n    dp += 5;\n\n    // foreach op\n    while ((const unsigned char*)p < (const unsigned char*)code + size)\n    {\n        uint32_t opcode = p[0];\n\n        uint16_t wordcount = opcode >> 16;\n        uint16_t op = opcode & 0xffff;\n\n        if (op == 16) // OpExecutionMode\n        {\n            uint32_t mode = p[2];\n            if (mode == 17) // LocalSize\n            {\n                memcpy(dp, p, wordcount * sizeof(uint32_t));\n\n                // set local_size_xyz\n                dp[3] = local_size_x;\n                dp[4] = local_size_y;\n                dp[5] = local_size_z;\n\n                p += wordcount;\n                dp += wordcount;\n                continue;\n            }\n        }\n        else if (op == 50) // OpSpecConstant\n        {\n            uint32_t id = p[2];\n            if (id == local_size_x_id || id == local_size_y_id || id == local_size_z_id)\n            {\n                p += wordcount;\n                continue;\n            }\n        }\n        else if (op == 51) // OpSpecConstantComposite\n        {\n            uint32_t id = p[2];\n            if (id == gl_WorkGroupSize_id)\n            {\n                if (wordcount == 6 && (p[3] == local_size_x_id || p[4] == local_size_y_id || p[5] == local_size_z_id))\n                {\n                    p += wordcount;\n                    continue;\n                }\n            }\n        }\n        else if (op == 71) // OpDecorate\n        {\n            uint32_t id = p[1];\n            uint32_t decoration = p[2];\n            if (decoration == 1) // SpecId\n            {\n                uint32_t specid = p[3];\n                if (specid == 233) local_size_x_id = id;\n                if (specid == 234) local_size_y_id = id;\n                if (specid == 235) local_size_z_id = id;\n                if (specid == 233 || specid == 234 || specid == 235)\n                {\n                    p += wordcount;\n                    continue;\n                }\n            }\n            else if (decoration == 11) // BuiltIn\n            {\n                uint32_t builtin = p[3];\n                if (builtin == 25) // WorkgroupSize\n                {\n                    gl_WorkGroupSize_id = id;\n                    p += wordcount;\n                    continue;\n                }\n            }\n        }\n\n        memcpy(dp, p, wordcount * sizeof(uint32_t));\n        p += wordcount;\n        dp += wordcount;\n    }\n\n    *dstsize = (unsigned char*)dp - (unsigned char*)dstcode;\n}\n\nVkShaderModule VulkanDevice::compile_shader_module(const uint32_t* spv_data, size_t spv_data_size, uint32_t local_size_x, uint32_t local_size_y, uint32_t local_size_z) const\n{\n    uint32_t* spv_data_modified = (uint32_t*)malloc(spv_data_size);\n    size_t spv_data_size_modified = spv_data_size;\n    inject_local_size_xyz(spv_data, spv_data_size, local_size_x, local_size_y, local_size_z, spv_data_modified, &spv_data_size_modified);\n\n    VkShaderModule shader_module = compile_shader_module(spv_data_modified, spv_data_size_modified);\n\n    free(spv_data_modified);\n\n    return shader_module;\n}\n\nint VulkanDevice::create_descriptorset_layout(int binding_count, const int* binding_types, VkDescriptorSetLayout* descriptorset_layout) const\n{\n    if (binding_count == 0)\n    {\n        *descriptorset_layout = 0;\n        return 0;\n    }\n\n    std::vector<VkDescriptorSetLayoutBinding> descriptorSetLayoutBindings(binding_count);\n    for (int i = 0; i < binding_count; i++)\n    {\n        int binding_type = binding_types[i];\n\n        descriptorSetLayoutBindings[i].binding = i;\n        descriptorSetLayoutBindings[i].descriptorCount = 1;\n        descriptorSetLayoutBindings[i].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;\n\n        if (binding_type == 1)\n        {\n            descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n            descriptorSetLayoutBindings[i].pImmutableSamplers = 0;\n        }\n        else if (binding_type == 2)\n        {\n            descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;\n            descriptorSetLayoutBindings[i].pImmutableSamplers = 0;\n        }\n        else // if (binding_type == 3)\n        {\n            descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n            descriptorSetLayoutBindings[i].pImmutableSamplers = immutable_texelfetch_sampler(); // we always use texelfetch\n        }\n    }\n\n    VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo;\n    descriptorSetLayoutCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;\n    descriptorSetLayoutCreateInfo.pNext = 0;\n    descriptorSetLayoutCreateInfo.flags = 0;\n    descriptorSetLayoutCreateInfo.bindingCount = binding_count;\n    descriptorSetLayoutCreateInfo.pBindings = descriptorSetLayoutBindings.data();\n\n    if (info.support_VK_KHR_push_descriptor())\n    {\n        descriptorSetLayoutCreateInfo.flags |= VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR;\n    }\n\n    VkResult ret = vkCreateDescriptorSetLayout(d->device, &descriptorSetLayoutCreateInfo, 0, descriptorset_layout);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateDescriptorSetLayout failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VulkanDevice::create_pipeline_layout(int push_constant_count, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout* pipeline_layout) const\n{\n    VkPushConstantRange pushConstantRange;\n    pushConstantRange.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;\n    pushConstantRange.offset = 0;\n    pushConstantRange.size = sizeof(vk_constant_type) * push_constant_count;\n\n    VkPipelineLayoutCreateInfo pipelineLayoutCreateInfo;\n    pipelineLayoutCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;\n    pipelineLayoutCreateInfo.pNext = 0;\n    pipelineLayoutCreateInfo.flags = 0;\n\n    if (descriptorset_layout)\n    {\n        pipelineLayoutCreateInfo.setLayoutCount = 1;\n        pipelineLayoutCreateInfo.pSetLayouts = &descriptorset_layout;\n    }\n    else\n    {\n        pipelineLayoutCreateInfo.setLayoutCount = 0;\n        pipelineLayoutCreateInfo.pSetLayouts = 0;\n    }\n\n    if (push_constant_count > 0)\n    {\n        pipelineLayoutCreateInfo.pushConstantRangeCount = 1;\n        pipelineLayoutCreateInfo.pPushConstantRanges = &pushConstantRange;\n    }\n    else\n    {\n        pipelineLayoutCreateInfo.pushConstantRangeCount = 0;\n        pipelineLayoutCreateInfo.pPushConstantRanges = 0;\n    }\n\n    VkResult ret = vkCreatePipelineLayout(d->device, &pipelineLayoutCreateInfo, 0, pipeline_layout);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreatePipelineLayout failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VulkanDevice::create_pipeline(VkShaderModule shader_module, VkPipelineLayout pipeline_layout, const std::vector<vk_specialization_type>& specializations, uint32_t subgroup_size, VkPipeline* pipeline) const\n{\n    const int specialization_count = specializations.size();\n\n    std::vector<VkSpecializationMapEntry> specializationMapEntries(specialization_count);\n    for (int i = 0; i < specialization_count; i++)\n    {\n        specializationMapEntries[i].constantID = i;\n        specializationMapEntries[i].offset = i * sizeof(vk_specialization_type);\n        specializationMapEntries[i].size = sizeof(vk_specialization_type);\n    }\n\n    VkSpecializationInfo specializationInfo;\n    specializationInfo.mapEntryCount = specializationMapEntries.size();\n    specializationInfo.pMapEntries = specializationMapEntries.data();\n    specializationInfo.dataSize = specializations.size() * sizeof(vk_specialization_type);\n    specializationInfo.pData = specializations.data();\n\n    VkPipelineShaderStageCreateInfo pipelineShaderStageCreateInfo;\n    pipelineShaderStageCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;\n    pipelineShaderStageCreateInfo.pNext = 0;\n    pipelineShaderStageCreateInfo.flags = 0;\n    pipelineShaderStageCreateInfo.stage = VK_SHADER_STAGE_COMPUTE_BIT;\n    pipelineShaderStageCreateInfo.module = shader_module;\n    pipelineShaderStageCreateInfo.pName = \"main\";\n    pipelineShaderStageCreateInfo.pSpecializationInfo = &specializationInfo;\n\n    // but full subgroup bits enforce local_size_x be multiple of subgroup size\n    // if (info.support_compute_full_subgroups())\n    // {\n    //     pipelineShaderStageCreateInfo.flags |= VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT;\n    // }\n\n    void* enabledExtensionFeatures = 0;\n\n    // subgroup size control\n    VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT pipelineShaderStageRequiredSubgroupSizeCreateInfo;\n    pipelineShaderStageRequiredSubgroupSizeCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_REQUIRED_SUBGROUP_SIZE_CREATE_INFO_EXT;\n    pipelineShaderStageRequiredSubgroupSizeCreateInfo.pNext = 0;\n    pipelineShaderStageRequiredSubgroupSizeCreateInfo.requiredSubgroupSize = subgroup_size;\n    if (info.support_subgroup_size_control())\n    {\n        // pipelineShaderStageCreateInfo.flags |= VK_PIPELINE_SHADER_STAGE_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT;\n        pipelineShaderStageRequiredSubgroupSizeCreateInfo.pNext = enabledExtensionFeatures;\n        enabledExtensionFeatures = &pipelineShaderStageRequiredSubgroupSizeCreateInfo;\n    }\n\n    pipelineShaderStageCreateInfo.pNext = enabledExtensionFeatures;\n\n    VkComputePipelineCreateInfo computePipelineCreateInfo;\n    computePipelineCreateInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;\n    computePipelineCreateInfo.pNext = 0;\n    computePipelineCreateInfo.flags = 0;\n    computePipelineCreateInfo.stage = pipelineShaderStageCreateInfo;\n    computePipelineCreateInfo.layout = pipeline_layout;\n    computePipelineCreateInfo.basePipelineHandle = 0;\n    computePipelineCreateInfo.basePipelineIndex = 0;\n\n    VkResult ret = vkCreateComputePipelines(d->device, 0, 1, &computePipelineCreateInfo, 0, pipeline);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateComputePipelines failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nint VulkanDevice::create_descriptor_update_template(int binding_count, const int* binding_types, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout pipeline_layout, VkDescriptorUpdateTemplateKHR* descriptor_update_template) const\n{\n    if (binding_count == 0)\n    {\n        *descriptor_update_template = 0;\n        return 0;\n    }\n\n    std::vector<VkDescriptorUpdateTemplateEntryKHR> descriptorUpdateTemplateEntries(binding_count);\n    size_t offset = 0;\n    for (int i = 0; i < binding_count; i++) // TODO do not update weights\n    {\n        int binding_type = binding_types[i];\n\n        descriptorUpdateTemplateEntries[i].dstBinding = i;\n        descriptorUpdateTemplateEntries[i].dstArrayElement = 0;\n        descriptorUpdateTemplateEntries[i].descriptorCount = 1;\n        descriptorUpdateTemplateEntries[i].offset = offset;\n\n        if (binding_type == 1)\n        {\n            descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;\n            descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorBufferInfo);\n        }\n        else if (binding_type == 2)\n        {\n            descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;\n            descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorImageInfo);\n        }\n        else // if (binding_type == 3)\n        {\n            descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;\n            descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorImageInfo);\n        }\n\n        offset += descriptorUpdateTemplateEntries[i].stride;\n    }\n\n    VkDescriptorUpdateTemplateCreateInfoKHR descriptorUpdateTemplateCreateInfo;\n    descriptorUpdateTemplateCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_UPDATE_TEMPLATE_CREATE_INFO_KHR;\n    descriptorUpdateTemplateCreateInfo.pNext = 0;\n    descriptorUpdateTemplateCreateInfo.flags = 0;\n    descriptorUpdateTemplateCreateInfo.descriptorUpdateEntryCount = binding_count; // TODO do not update weights\n    descriptorUpdateTemplateCreateInfo.pDescriptorUpdateEntries = descriptorUpdateTemplateEntries.data();\n    if (info.support_VK_KHR_push_descriptor())\n    {\n        descriptorUpdateTemplateCreateInfo.templateType = VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR;\n    }\n    else\n    {\n        descriptorUpdateTemplateCreateInfo.templateType = VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_DESCRIPTOR_SET_KHR;\n    }\n    // descriptorSetLayout should be ignored if VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR\n    // FIXME HACK WARNING TODO NOTE but crash on radv if set NULL  :(\n    descriptorUpdateTemplateCreateInfo.descriptorSetLayout = descriptorset_layout;\n    descriptorUpdateTemplateCreateInfo.pipelineBindPoint = VK_PIPELINE_BIND_POINT_COMPUTE;\n    descriptorUpdateTemplateCreateInfo.pipelineLayout = pipeline_layout;\n    descriptorUpdateTemplateCreateInfo.set = 0;\n\n    VkResult ret = vkCreateDescriptorUpdateTemplateKHR(d->device, &descriptorUpdateTemplateCreateInfo, 0, descriptor_update_template);\n    if (ret != VK_SUCCESS)\n    {\n        NCNN_LOGE(\"vkCreateDescriptorUpdateTemplateKHR failed %d\", ret);\n        return -1;\n    }\n\n    return 0;\n}\n\nuint32_t VulkanDevice::find_memory_index(uint32_t memory_type_bits, VkFlags required, VkFlags preferred, VkFlags preferred_not) const\n{\n    const VkPhysicalDeviceMemoryProperties& memory_properties = info.physicalDeviceMemoryProperties();\n\n    // first try, find required and with preferred and without preferred_not\n    for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)\n    {\n        bool is_required = (1 << i) & memory_type_bits;\n        if (is_required)\n        {\n            const VkMemoryType& memoryType = memory_properties.memoryTypes[i];\n            if ((memoryType.propertyFlags & required) == required\n                    && (preferred && (memoryType.propertyFlags & preferred))\n                    && (preferred_not && !(memoryType.propertyFlags & preferred_not)))\n            {\n                return i;\n            }\n        }\n    }\n\n    // second try, find required and with preferred\n    for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)\n    {\n        bool is_required = (1 << i) & memory_type_bits;\n        if (is_required)\n        {\n            const VkMemoryType& memoryType = memory_properties.memoryTypes[i];\n            if ((memoryType.propertyFlags & required) == required\n                    && (preferred && (memoryType.propertyFlags & preferred)))\n            {\n                return i;\n            }\n        }\n    }\n\n    // third try, find required and without preferred_not\n    for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)\n    {\n        bool is_required = (1 << i) & memory_type_bits;\n        if (is_required)\n        {\n            const VkMemoryType& memoryType = memory_properties.memoryTypes[i];\n            if ((memoryType.propertyFlags & required) == required\n                    && (preferred_not && !(memoryType.propertyFlags & preferred_not)))\n            {\n                return i;\n            }\n        }\n    }\n\n    // fourth try, find any required\n    for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)\n    {\n        bool is_required = (1 << i) & memory_type_bits;\n        if (is_required)\n        {\n            const VkMemoryType& memoryType = memory_properties.memoryTypes[i];\n            if ((memoryType.propertyFlags & required) == required)\n            {\n                return i;\n            }\n        }\n    }\n\n    if (info.driver_id() == VK_DRIVER_ID_GOOGLE_SWIFTSHADER)\n    {\n        // buggy swiftshader may set memory property flags in memory_type_bits field\n        for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)\n        {\n            const VkMemoryType& memoryType = memory_properties.memoryTypes[i];\n            if ((memoryType.propertyFlags & (required | memory_type_bits)) == (required | memory_type_bits))\n            {\n                return i;\n            }\n        }\n    }\n\n    NCNN_LOGE(\"no such memory type %u %u %u %u\", memory_type_bits, required, preferred, preferred_not);\n    return -1;\n}\n\nbool VulkanDevice::is_mappable(uint32_t memory_type_index) const\n{\n    const VkMemoryType& memoryType = info.physicalDeviceMemoryProperties().memoryTypes[memory_type_index];\n\n    return memoryType.propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;\n}\n\nbool VulkanDevice::is_coherent(uint32_t memory_type_index) const\n{\n    const VkMemoryType& memoryType = info.physicalDeviceMemoryProperties().memoryTypes[memory_type_index];\n\n    return memoryType.propertyFlags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;\n}\n\nbool VulkanDevice::is_device_local(uint32_t memory_type_index) const\n{\n    const VkMemoryType& memoryType = info.physicalDeviceMemoryProperties().memoryTypes[memory_type_index];\n\n    return memoryType.propertyFlags & VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;\n}\n\nVkQueue VulkanDevice::acquire_queue(uint32_t queue_family_index) const\n{\n    if (queue_family_index != info.compute_queue_family_index() && queue_family_index != info.transfer_queue_family_index())\n    {\n        NCNN_LOGE(\"invalid queue_family_index %u\", queue_family_index);\n        return 0;\n    }\n\n    Mutex& queue_lock = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_lock : d->transfer_queue_lock;\n\n    queue_lock.lock();\n\n    ConditionVariable& queue_condition = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_condition : d->transfer_queue_condition;\n\n    int& free_queue_count = queue_family_index == info.compute_queue_family_index() ? d->free_compute_queue_count : d->free_transfer_queue_count;\n\n    while (free_queue_count == 0)\n    {\n        // no free queues, wait for recleams from other threads\n        queue_condition.wait(queue_lock);\n    }\n\n    std::vector<VkQueue>& queues = queue_family_index == info.compute_queue_family_index() ? d->compute_queues : d->transfer_queues;\n\n    VkQueue queue = 0;\n    for (size_t i = 0; i < queues.size(); i++)\n    {\n        if (queues[i])\n        {\n            queue = queues[i];\n            queues[i] = 0;\n            break;\n        }\n    }\n\n    if (!queue)\n    {\n        NCNN_LOGE(\"FATAL ERROR! out of hardware queue %u\", queue_family_index);\n    }\n\n    free_queue_count -= 1;\n\n    queue_lock.unlock();\n\n    queue_condition.signal();\n\n    return queue;\n}\n\nvoid VulkanDevice::reclaim_queue(uint32_t queue_family_index, VkQueue queue) const\n{\n    if (queue_family_index != info.compute_queue_family_index() && queue_family_index != info.transfer_queue_family_index())\n    {\n        NCNN_LOGE(\"invalid queue_family_index %u\", queue_family_index);\n        return;\n    }\n\n    Mutex& queue_lock = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_lock : d->transfer_queue_lock;\n\n    queue_lock.lock();\n\n    ConditionVariable& queue_condition = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_condition : d->transfer_queue_condition;\n\n    int& free_queue_count = queue_family_index == info.compute_queue_family_index() ? d->free_compute_queue_count : d->free_transfer_queue_count;\n\n    std::vector<VkQueue>& queues = queue_family_index == info.compute_queue_family_index() ? d->compute_queues : d->transfer_queues;\n\n    size_t i = 0;\n    for (; i < queues.size(); i++)\n    {\n        if (!queues[i])\n        {\n            queues[i] = queue;\n            break;\n        }\n    }\n\n    if (i == queues.size())\n    {\n        NCNN_LOGE(\"FATAL ERROR! reclaim_queue get wild queue %u %p\", queue_family_index, queue);\n    }\n\n    free_queue_count += 1;\n\n    queue_lock.unlock();\n\n    queue_condition.signal();\n}\n\nVkAllocator* VulkanDevice::acquire_blob_allocator() const\n{\n    MutexLockGuard lock(d->blob_allocator_lock);\n\n    for (int i = 0; i < (int)d->blob_allocators.size(); i++)\n    {\n        VkAllocator* allocator = d->blob_allocators[i];\n        if (allocator)\n        {\n            d->blob_allocators[i] = 0;\n            return allocator;\n        }\n    }\n\n    // pre-allocated allcator exhausted, create new\n    VkAllocator* allocator = new VkBlobAllocator(this);\n    d->blob_allocators.push_back(allocator);\n    d->blob_allocators[d->blob_allocators.size() - 1] = 0;\n    return allocator;\n}\n\nvoid VulkanDevice::reclaim_blob_allocator(VkAllocator* allocator) const\n{\n    MutexLockGuard lock(d->blob_allocator_lock);\n\n    for (int i = 0; i < (int)d->blob_allocators.size(); i++)\n    {\n        if (!d->blob_allocators[i])\n        {\n            d->blob_allocators[i] = allocator;\n            return;\n        }\n    }\n\n    NCNN_LOGE(\"FATAL ERROR! reclaim_blob_allocator get wild allocator %p\", allocator);\n}\n\nVkAllocator* VulkanDevice::acquire_staging_allocator() const\n{\n    MutexLockGuard lock(d->staging_allocator_lock);\n\n    for (int i = 0; i < (int)d->staging_allocators.size(); i++)\n    {\n        VkAllocator* allocator = d->staging_allocators[i];\n        if (allocator)\n        {\n            d->staging_allocators[i] = 0;\n            return allocator;\n        }\n    }\n\n    // pre-allocated allcator exhausted, create new\n    VkAllocator* allocator = new VkStagingAllocator(this);\n    d->staging_allocators.push_back(allocator);\n    d->staging_allocators[d->staging_allocators.size() - 1] = 0;\n    return allocator;\n}\n\nvoid VulkanDevice::reclaim_staging_allocator(VkAllocator* allocator) const\n{\n    MutexLockGuard lock(d->staging_allocator_lock);\n\n    for (int i = 0; i < (int)d->staging_allocators.size(); i++)\n    {\n        if (!d->staging_allocators[i])\n        {\n            d->staging_allocators[i] = allocator;\n            return;\n        }\n    }\n\n    NCNN_LOGE(\"FATAL ERROR! reclaim_staging_allocator get wild allocator %p\", allocator);\n}\n\nconst VkSampler* VulkanDevice::immutable_texelfetch_sampler() const\n{\n    return &d->texelfetch_sampler;\n}\n\nVkMat VulkanDevice::get_dummy_buffer() const\n{\n    return d->dummy_buffer;\n}\n\nVkImageMat VulkanDevice::get_dummy_image() const\n{\n    return d->dummy_image;\n}\n\nVkImageMat VulkanDevice::get_dummy_image_readonly() const\n{\n#if __APPLE__\n    if (info.type() != 0)\n        return d->dummy_image;\n#endif\n    return d->dummy_image_readonly;\n}\n\nconst PipelineCache* VulkanDevice::get_pipeline_cache() const\n{\n    return d->pipeline_cache;\n}\n\nbool VulkanDevice::shape_support_image_storage(const Mat& shape) const\n{\n    int dims = shape.dims;\n    int width = shape.w;\n    int height = shape.h;\n    int depth = shape.c;\n    int elempack = shape.elempack;\n\n    // large elempack spills on image w\n    if (elempack == 8) width *= 2;\n    if (elempack == 16) width *= 4;\n    if (elempack == 32) width *= 8;\n    if (elempack == 64) width *= 16;\n\n    if (dims == 1)\n    {\n        if (width > (int)info.max_image_dimension_1d())\n        {\n            return false;\n        }\n    }\n    else if (dims == 2)\n    {\n        if (width > (int)info.max_image_dimension_2d() || height > (int)info.max_image_dimension_2d())\n        {\n            return false;\n        }\n    }\n    else // if (dims == 3)\n    {\n        if (width > (int)info.max_image_dimension_3d() || height > (int)info.max_image_dimension_3d() || depth > (int)info.max_image_dimension_3d())\n        {\n            return false;\n        }\n    }\n\n    return true;\n}\n\nuint32_t VulkanDevice::get_heap_budget() const\n{\n    const VkPhysicalDeviceMemoryProperties& memory_properties = info.physicalDeviceMemoryProperties();\n\n    uint32_t buffer_memory_type_index = d->dummy_allocator->buffer_memory_type_index;\n    uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;\n\n    if (!info.support_VK_EXT_memory_budget())\n    {\n        //         NCNN_LOGE(\"heap budget from assumption\\n\");\n        uint32_t device_local_heap_size = memory_properties.memoryHeaps[buffer_heap_index].size / 1024 / 1024;\n\n        // we usually cannot use all heap\n        // 70% for 4G+\n        // 50% for 4G-\n        return device_local_heap_size >= 4000 ? device_local_heap_size * 0.7 : device_local_heap_size * 0.5;\n    }\n\n    VkPhysicalDeviceMemoryBudgetPropertiesEXT memoryBudgetProperties;\n    memoryBudgetProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT;\n    memoryBudgetProperties.pNext = 0;\n\n    VkPhysicalDeviceMemoryProperties2KHR memoryProperties;\n    memoryProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_PROPERTIES_2_KHR;\n    memoryProperties.pNext = &memoryBudgetProperties;\n\n    vkGetPhysicalDeviceMemoryProperties2KHR(info.physicalDevice(), &memoryProperties);\n\n    return memoryBudgetProperties.heapBudget[buffer_heap_index] / 1024 / 1024;\n}\n\nvoid VulkanDevice::convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, VkCompute& cmd, const Option& opt) const\n{\n    convert_packing(src, dst, dst_elempack, 0, cmd, opt);\n}\n\nvoid VulkanDevice::convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, int cast_type_to, VkCompute& cmd, const Option& opt) const\n{\n    int packing_type_to_index = dst_elempack == 1 ? 0 : dst_elempack == 4 ? 1 : 2;\n\n    int cast_type_from_index;\n    if (src.elembits() == 32)\n    {\n        cast_type_from_index = 0;\n    }\n    else if (src.elembits() == 16)\n    {\n        if (opt.use_bf16_storage || opt.use_bf16_packed)\n            cast_type_from_index = 4;\n        else\n            cast_type_from_index = 1;\n    }\n    else // if (src.elembits() == 8)\n    {\n        cast_type_from_index = 3;\n    }\n\n    int cast_type_to_index = cast_type_to ? cast_type_to - 1 : cast_type_from_index;\n\n    // NCNN_LOGE(\"convert_packing b2b %d %d %d\", cast_type_from_index, cast_type_to_index, packing_type_to_index);\n\n    if ((cast_type_from_index == 0 || cast_type_from_index == 1 || cast_type_from_index == 4) && (cast_type_to_index == 2 || cast_type_to_index == 3))\n    {\n        NCNN_LOGE(\"convert_packing from fp32/fp16/bf16 to int32/int8 is not supported\");\n        return;\n    }\n    if ((cast_type_from_index == 2 || cast_type_from_index == 3) && (cast_type_to_index == 0 || cast_type_to_index == 1 || cast_type_to_index == 4))\n    {\n        NCNN_LOGE(\"convert_packing from int32/int8 to fp32/fp16/bf16 is not supported\");\n        return;\n    }\n    if (cast_type_from_index == 1 && cast_type_to_index == 4)\n    {\n        NCNN_LOGE(\"convert_packing from fp16 to bf16 is not supported\");\n        return;\n    }\n    if (cast_type_from_index == 4 && cast_type_to_index == 1)\n    {\n        NCNN_LOGE(\"convert_packing from bf16 to fp16 is not supported\");\n        return;\n    }\n\n    Option opt2 = opt;\n    opt2.use_fp16_packed = (cast_type_from_index == 1 || cast_type_to_index == 1);\n    opt2.use_fp16_storage = (cast_type_from_index == 1 || cast_type_to_index == 1) && info.support_fp16_storage();\n    opt2.use_int8_packed = (cast_type_from_index == 3 || cast_type_to_index == 3);\n    opt2.use_int8_storage = (cast_type_from_index == 3 || cast_type_to_index == 3) && info.support_int8_storage();\n    opt2.use_bf16_packed = (cast_type_from_index == 4 || cast_type_to_index == 4);\n    opt2.use_bf16_storage = (cast_type_from_index == 4 || cast_type_to_index == 4) && info.support_bf16_storage();\n\n    const ncnn::Layer* uop = d->get_utility_operator(cast_type_from_index, cast_type_to_index, packing_type_to_index);\n    uop->forward(src, dst, cmd, opt2);\n}\n\nint VulkanDevice::init_device_extension()\n{\n    if (info.support_VK_KHR_bind_memory2())\n    {\n        vkBindBufferMemory2KHR = (PFN_vkBindBufferMemory2KHR)vkGetDeviceProcAddr(d->device, \"vkBindBufferMemory2KHR\");\n        vkBindImageMemory2KHR = (PFN_vkBindImageMemory2KHR)vkGetDeviceProcAddr(d->device, \"vkBindImageMemory2KHR\");\n    }\n\n    if (info.support_VK_KHR_buffer_device_address())\n    {\n        vkGetBufferDeviceAddressKHR = (PFN_vkGetBufferDeviceAddressKHR)vkGetDeviceProcAddr(d->device, \"vkGetBufferDeviceAddressKHR\");\n        vkGetBufferOpaqueCaptureAddressKHR = (PFN_vkGetBufferOpaqueCaptureAddressKHR)vkGetDeviceProcAddr(d->device, \"vkGetBufferOpaqueCaptureAddressKHR\");\n        vkGetDeviceMemoryOpaqueCaptureAddressKHR = (PFN_vkGetDeviceMemoryOpaqueCaptureAddressKHR)vkGetDeviceProcAddr(d->device, \"vkGetDeviceMemoryOpaqueCaptureAddressKHR\");\n    }\n\n    if (info.support_VK_KHR_descriptor_update_template())\n    {\n        vkCreateDescriptorUpdateTemplateKHR = (PFN_vkCreateDescriptorUpdateTemplateKHR)vkGetDeviceProcAddr(d->device, \"vkCreateDescriptorUpdateTemplateKHR\");\n        vkDestroyDescriptorUpdateTemplateKHR = (PFN_vkDestroyDescriptorUpdateTemplateKHR)vkGetDeviceProcAddr(d->device, \"vkDestroyDescriptorUpdateTemplateKHR\");\n        vkUpdateDescriptorSetWithTemplateKHR = (PFN_vkUpdateDescriptorSetWithTemplateKHR)vkGetDeviceProcAddr(d->device, \"vkUpdateDescriptorSetWithTemplateKHR\");\n    }\n\n    if (info.support_VK_KHR_get_memory_requirements2())\n    {\n        vkGetImageMemoryRequirements2KHR = (PFN_vkGetImageMemoryRequirements2KHR)vkGetDeviceProcAddr(d->device, \"vkGetImageMemoryRequirements2KHR\");\n        vkGetBufferMemoryRequirements2KHR = (PFN_vkGetBufferMemoryRequirements2KHR)vkGetDeviceProcAddr(d->device, \"vkGetBufferMemoryRequirements2KHR\");\n    }\n\n    if (info.support_VK_KHR_maintenance1())\n    {\n        vkTrimCommandPoolKHR = (PFN_vkTrimCommandPoolKHR)vkGetDeviceProcAddr(d->device, \"vkTrimCommandPoolKHR\");\n    }\n\n    if (info.support_VK_KHR_maintenance3())\n    {\n        vkGetDescriptorSetLayoutSupportKHR = (PFN_vkGetDescriptorSetLayoutSupportKHR)vkGetDeviceProcAddr(d->device, \"vkGetDescriptorSetLayoutSupportKHR\");\n    }\n\n    if (info.support_VK_KHR_push_descriptor())\n    {\n        if (info.support_VK_KHR_descriptor_update_template())\n        {\n            vkCmdPushDescriptorSetWithTemplateKHR = (PFN_vkCmdPushDescriptorSetWithTemplateKHR)vkGetDeviceProcAddr(d->device, \"vkCmdPushDescriptorSetWithTemplateKHR\");\n        }\n\n        vkCmdPushDescriptorSetKHR = (PFN_vkCmdPushDescriptorSetKHR)vkGetDeviceProcAddr(d->device, \"vkCmdPushDescriptorSetKHR\");\n    }\n\n    if (info.support_VK_KHR_sampler_ycbcr_conversion())\n    {\n        vkCreateSamplerYcbcrConversionKHR = (PFN_vkCreateSamplerYcbcrConversionKHR)vkGetDeviceProcAddr(d->device, \"vkCreateSamplerYcbcrConversionKHR\");\n        vkDestroySamplerYcbcrConversionKHR = (PFN_vkDestroySamplerYcbcrConversionKHR)vkGetDeviceProcAddr(d->device, \"vkDestroySamplerYcbcrConversionKHR\");\n    }\n\n    if (info.support_VK_KHR_swapchain())\n    {\n        vkCreateSwapchainKHR = (PFN_vkCreateSwapchainKHR)vkGetDeviceProcAddr(d->device, \"vkCreateSwapchainKHR\");\n        vkDestroySwapchainKHR = (PFN_vkDestroySwapchainKHR)vkGetDeviceProcAddr(d->device, \"vkDestroySwapchainKHR\");\n        vkGetSwapchainImagesKHR = (PFN_vkGetSwapchainImagesKHR)vkGetDeviceProcAddr(d->device, \"vkGetSwapchainImagesKHR\");\n        vkAcquireNextImageKHR = (PFN_vkAcquireNextImageKHR)vkGetDeviceProcAddr(d->device, \"vkAcquireNextImageKHR\");\n        vkQueuePresentKHR = (PFN_vkQueuePresentKHR)vkGetDeviceProcAddr(d->device, \"vkQueuePresentKHR\");\n    }\n\n    if (info.support_VK_EXT_buffer_device_address())\n    {\n        vkGetBufferDeviceAddressEXT = (PFN_vkGetBufferDeviceAddressEXT)vkGetDeviceProcAddr(d->device, \"vkGetBufferDeviceAddressEXT\");\n    }\n\n    if (info.support_VK_EXT_external_memory_host())\n    {\n        vkGetMemoryHostPointerPropertiesEXT = (PFN_vkGetMemoryHostPointerPropertiesEXT)vkGetDeviceProcAddr(d->device, \"vkGetMemoryHostPointerPropertiesEXT\");\n    }\n\n#if __ANDROID_API__ >= 26\n    if (info.support_VK_ANDROID_external_memory_android_hardware_buffer())\n    {\n        vkGetAndroidHardwareBufferPropertiesANDROID = (PFN_vkGetAndroidHardwareBufferPropertiesANDROID)vkGetDeviceProcAddr(d->device, \"vkGetAndroidHardwareBufferPropertiesANDROID\");\n        vkGetMemoryAndroidHardwareBufferANDROID = (PFN_vkGetMemoryAndroidHardwareBufferANDROID)vkGetDeviceProcAddr(d->device, \"vkGetMemoryAndroidHardwareBufferANDROID\");\n    }\n#endif // __ANDROID_API__ >= 26\n\n    if (info.support_VK_NV_cooperative_vector())\n    {\n        vkCmdConvertCooperativeVectorMatrixNV = (PFN_vkCmdConvertCooperativeVectorMatrixNV)vkGetDeviceProcAddr(d->device, \"vkCmdConvertCooperativeVectorMatrixNV\");\n        vkConvertCooperativeVectorMatrixNV = (PFN_vkConvertCooperativeVectorMatrixNV)vkGetDeviceProcAddr(d->device, \"vkConvertCooperativeVectorMatrixNV\");\n    }\n\n    return 0;\n}\n\nVulkanDevice* get_gpu_device(int device_index)\n{\n    try_create_gpu_instance();\n\n    if (device_index < 0 || device_index >= g_gpu_count)\n        return 0;\n\n    MutexLockGuard lock(g_default_vkdev_lock);\n\n    if (!g_default_vkdev[device_index])\n        g_default_vkdev[device_index] = new VulkanDevice(device_index);\n\n    return g_default_vkdev[device_index];\n}\n\nstatic TBuiltInResource get_default_TBuiltInResource()\n{\n    TBuiltInResource resource;\n\n    resource.maxLights = 32;\n    resource.maxClipPlanes = 6;\n    resource.maxTextureUnits = 32;\n    resource.maxTextureCoords = 32;\n    resource.maxVertexAttribs = 64;\n    resource.maxVertexUniformComponents = 4096;\n    resource.maxVaryingFloats = 64;\n    resource.maxVertexTextureImageUnits = 32;\n    resource.maxCombinedTextureImageUnits = 80;\n    resource.maxTextureImageUnits = 32;\n    resource.maxFragmentUniformComponents = 4096;\n    resource.maxDrawBuffers = 32;\n    resource.maxVertexUniformVectors = 128;\n    resource.maxVaryingVectors = 8;\n    resource.maxFragmentUniformVectors = 16;\n    resource.maxVertexOutputVectors = 16;\n    resource.maxFragmentInputVectors = 15;\n    resource.minProgramTexelOffset = -8;\n    resource.maxProgramTexelOffset = 7;\n    resource.maxClipDistances = 8;\n    resource.maxComputeWorkGroupCountX = 65535;\n    resource.maxComputeWorkGroupCountY = 65535;\n    resource.maxComputeWorkGroupCountZ = 65535;\n    resource.maxComputeWorkGroupSizeX = 1024;\n    resource.maxComputeWorkGroupSizeY = 1024;\n    resource.maxComputeWorkGroupSizeZ = 64;\n    resource.maxComputeUniformComponents = 1024;\n    resource.maxComputeTextureImageUnits = 16;\n    resource.maxComputeImageUniforms = 8;\n    resource.maxComputeAtomicCounters = 8;\n    resource.maxComputeAtomicCounterBuffers = 1;\n    resource.maxVaryingComponents = 60;\n    resource.maxVertexOutputComponents = 64;\n    resource.maxGeometryInputComponents = 64;\n    resource.maxGeometryOutputComponents = 128;\n    resource.maxFragmentInputComponents = 128;\n    resource.maxImageUnits = 8;\n    resource.maxCombinedImageUnitsAndFragmentOutputs = 8;\n    resource.maxCombinedShaderOutputResources = 8;\n    resource.maxImageSamples = 0;\n    resource.maxVertexImageUniforms = 0;\n    resource.maxTessControlImageUniforms = 0;\n    resource.maxTessEvaluationImageUniforms = 0;\n    resource.maxGeometryImageUniforms = 0;\n    resource.maxFragmentImageUniforms = 8;\n    resource.maxCombinedImageUniforms = 8;\n    resource.maxGeometryTextureImageUnits = 16;\n    resource.maxGeometryOutputVertices = 256;\n    resource.maxGeometryTotalOutputComponents = 1024;\n    resource.maxGeometryUniformComponents = 1024;\n    resource.maxGeometryVaryingComponents = 64;\n    resource.maxTessControlInputComponents = 128;\n    resource.maxTessControlOutputComponents = 128;\n    resource.maxTessControlTextureImageUnits = 16;\n    resource.maxTessControlUniformComponents = 1024;\n    resource.maxTessControlTotalOutputComponents = 4096;\n    resource.maxTessEvaluationInputComponents = 128;\n    resource.maxTessEvaluationOutputComponents = 128;\n    resource.maxTessEvaluationTextureImageUnits = 16;\n    resource.maxTessEvaluationUniformComponents = 1024;\n    resource.maxTessPatchComponents = 120;\n    resource.maxPatchVertices = 32;\n    resource.maxTessGenLevel = 64;\n    resource.maxViewports = 16;\n    resource.maxVertexAtomicCounters = 0;\n    resource.maxTessControlAtomicCounters = 0;\n    resource.maxTessEvaluationAtomicCounters = 0;\n    resource.maxGeometryAtomicCounters = 0;\n    resource.maxFragmentAtomicCounters = 8;\n    resource.maxCombinedAtomicCounters = 8;\n    resource.maxAtomicCounterBindings = 1;\n    resource.maxVertexAtomicCounterBuffers = 0;\n    resource.maxTessControlAtomicCounterBuffers = 0;\n    resource.maxTessEvaluationAtomicCounterBuffers = 0;\n    resource.maxGeometryAtomicCounterBuffers = 0;\n    resource.maxFragmentAtomicCounterBuffers = 1;\n    resource.maxCombinedAtomicCounterBuffers = 1;\n    resource.maxAtomicCounterBufferSize = 16384;\n    resource.maxTransformFeedbackBuffers = 4;\n    resource.maxTransformFeedbackInterleavedComponents = 64;\n    resource.maxCullDistances = 8;\n    resource.maxCombinedClipAndCullDistances = 8;\n    resource.maxSamples = 4;\n    resource.maxMeshOutputVerticesNV = 256;\n    resource.maxMeshOutputPrimitivesNV = 512;\n    resource.maxMeshWorkGroupSizeX_NV = 32;\n    resource.maxMeshWorkGroupSizeY_NV = 1;\n    resource.maxMeshWorkGroupSizeZ_NV = 1;\n    resource.maxTaskWorkGroupSizeX_NV = 32;\n    resource.maxTaskWorkGroupSizeY_NV = 1;\n    resource.maxTaskWorkGroupSizeZ_NV = 1;\n    resource.maxMeshViewCountNV = 4;\n\n    // TODO compile-time glslang version check\n    // resource.maxDualSourceDrawBuffersEXT = 1;\n\n    resource.limits.nonInductiveForLoops = 1;\n    resource.limits.whileLoops = 1;\n    resource.limits.doWhileLoops = 1;\n    resource.limits.generalUniformIndexing = 1;\n    resource.limits.generalAttributeMatrixVectorIndexing = 1;\n    resource.limits.generalVaryingIndexing = 1;\n    resource.limits.generalSamplerIndexing = 1;\n    resource.limits.generalVariableIndexing = 1;\n    resource.limits.generalConstantMatrixVectorIndexing = 1;\n\n    return resource;\n}\n\nclass VulkanShaderIncluder : public glslang::TShader::Includer\n{\npublic:\n    virtual glslang::TShader::Includer::IncludeResult* includeLocal(const char* headerName, const char* /*includerName*/, size_t /*inclusionDepth*/)\n    {\n        if (strcmp(headerName, \"vulkan_activation.comp\") == 0)\n        {\n            const char* const headerData = vulkan_activation_comp_data;\n            const size_t headerLength = sizeof(vulkan_activation_comp_data);\n            glslang::TShader::Includer::IncludeResult* r = new glslang::TShader::Includer::IncludeResult(headerName, headerData, headerLength, 0);\n            return r;\n        }\n\n        return 0;\n    }\n\n    virtual void releaseInclude(glslang::TShader::Includer::IncludeResult* r)\n    {\n        delete r;\n    }\n};\n\nclass DefinitionCollector\n{\npublic:\n    template<typename T>\n    void append(const char* key, T def)\n    {\n        definitions.push_back(std::make_pair(key, def));\n    }\n\npublic:\n    struct typed_value\n    {\n        typed_value(const char* _s)\n            : type(0), s(_s)\n        {\n        }\n        typed_value(uint8_t _u8)\n            : type(1), u8(_u8)\n        {\n        }\n        typed_value(uint32_t _u32)\n            : type(2), u32(_u32)\n        {\n        }\n        typed_value(int32_t _i32)\n            : type(3), i32(_i32)\n        {\n        }\n        typed_value(uint64_t _u64)\n            : type(4), u64(_u64)\n        {\n        }\n        typed_value(float _f32)\n            : type(5), f32(_f32)\n        {\n        }\n\n        int type;\n        union\n        {\n            const char* s;\n            uint8_t u8;\n            uint32_t u32;\n            int32_t i32;\n            uint64_t u64;\n            float f32;\n        };\n    };\n\n    std::vector<std::pair<const char*, typed_value> > definitions;\n};\n\nint compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv)\n{\n    // -1 for omitting the tail '\\0'\n    int length = strlen(comp_string) - 1;\n    return compile_spirv_module(comp_string, length, opt, spirv);\n}\n\nint compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv)\n{\n    DefinitionCollector custom_defines;\n    DefinitionCollector device_defines;\n\n    int device_index = opt.vulkan_device_index;\n    if (device_index < 0 || device_index >= get_gpu_count())\n        device_index = get_default_gpu_index();\n\n    const GpuInfo& info = get_gpu_info(device_index);\n    const bool support_fp16_storage = info.support_fp16_storage();\n    const bool support_fp16_uniform = info.support_fp16_uniform();\n\n    if (opt.use_bf16_storage)\n    {\n        custom_defines.append(\"sfp\", \"bfloat16_t\");\n        custom_defines.append(\"sfpvec2\", \"bf16vec2\");\n        custom_defines.append(\"sfpvec4\", \"bf16vec4\");\n\n        // define pack and unpack macro for bf16s\n        custom_defines.append(\"unpackBFloat2x16(v)\", \"vec2(uintBitsToBFloat16EXT(unpackUint2x16(v)))\");\n        custom_defines.append(\"packBFloat2x16(v)\", \"packUint2x16(bfloat16BitsToUintEXT(bf16vec2(v)))\");\n    }\n    else if (opt.use_bf16_packed)\n    {\n        if (support_fp16_storage)\n        {\n            custom_defines.append(\"sfp\", \"uint16_t\");\n        }\n        else\n        {\n            custom_defines.append(\"sfp\", \"uint\");\n        }\n        custom_defines.append(\"sfpvec2\", \"uint\");\n        custom_defines.append(\"sfpvec4\", \"uvec2\");\n\n        // define pack and unpack macro for bf16p\n        custom_defines.append(\"unpackBFloat2x16(v)\", \"vec2(uintBitsToFloat(v<<16),uintBitsToFloat(v&0xffff0000u))\");\n        custom_defines.append(\"packBFloat2x16(v)\", \"uint((floatBitsToUint(v.x)>>16)|(floatBitsToUint(v.y)&0xffff0000u))\");\n    }\n    else if (opt.use_fp16_storage)\n    {\n        custom_defines.append(\"sfp\", \"float16_t\");\n        custom_defines.append(\"sfpvec2\", \"f16vec2\");\n        custom_defines.append(\"sfpvec4\", \"f16vec4\");\n\n        if (opt.use_fp16_arithmetic)\n        {\n            custom_defines.append(\"sfpmat4\", \"f16mat4\");\n        }\n    }\n    else if (opt.use_fp16_packed)\n    {\n        custom_defines.append(\"sfp\", \"uint\");\n        custom_defines.append(\"sfpvec2\", \"uint\");\n        custom_defines.append(\"sfpvec4\", \"uvec2\");\n    }\n    else\n    {\n        custom_defines.append(\"sfp\", \"float\");\n        custom_defines.append(\"sfpvec2\", \"vec2\");\n        custom_defines.append(\"sfpvec4\", \"vec4\");\n        custom_defines.append(\"sfpmat4\", \"mat4\");\n    }\n\n    if (opt.use_bf16_storage || opt.use_bf16_packed)\n    {\n        // bf16 conflicts with fp16a\n        custom_defines.append(\"afp\", \"float\");\n        custom_defines.append(\"afpvec2\", \"vec2\");\n        custom_defines.append(\"afpvec4\", \"vec4\");\n        custom_defines.append(\"afpmat4\", \"mat4\");\n    }\n    else if (opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"afp\", \"float16_t\");\n        custom_defines.append(\"afpvec2\", \"f16vec2\");\n        custom_defines.append(\"afpvec4\", \"f16vec4\");\n        custom_defines.append(\"afpmat4\", \"f16mat4\");\n    }\n    else\n    {\n        custom_defines.append(\"afp\", \"float\");\n        custom_defines.append(\"afpvec2\", \"vec2\");\n        custom_defines.append(\"afpvec4\", \"vec4\");\n        custom_defines.append(\"afpmat4\", \"mat4\");\n    }\n\n    if (opt.use_bf16_storage)\n    {\n        // bf16s implies 16bit uniform\n        custom_defines.append(\"lfp\", \"bfloat16_t\");\n        custom_defines.append(\"lfpvec4\", \"bf16vec4\");\n    }\n    else if (opt.use_bf16_packed)\n    {\n        if (support_fp16_uniform)\n        {\n            custom_defines.append(\"lfp\", \"uint16_t\");\n        }\n        else\n        {\n            custom_defines.append(\"lfp\", \"float\");\n        }\n        custom_defines.append(\"lfpvec4\", \"uvec2\");\n    }\n    else if (opt.use_fp16_storage && opt.use_fp16_uniform && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"lfp\", \"float16_t\");\n        custom_defines.append(\"lfpvec4\", \"f16vec4\");\n    }\n    else if (opt.use_fp16_storage && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"lfp\", \"float\");\n        custom_defines.append(\"lfpvec4\", \"uint64_t\");\n    }\n    else if (opt.use_fp16_storage || opt.use_fp16_packed)\n    {\n        custom_defines.append(\"lfp\", \"float\");\n        custom_defines.append(\"lfpvec4\", \"uvec2\");\n    }\n    else\n    {\n        custom_defines.append(\"lfp\", \"float\");\n        custom_defines.append(\"lfpvec4\", \"vec4\");\n    }\n\n    if (opt.use_bf16_storage)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"float(v)\");\n        custom_defines.append(\"afp2lfp(v)\", \"bfloat16_t(v)\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"vec4(v)\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"bf16vec4(v)\");\n    }\n    else if (opt.use_bf16_packed)\n    {\n        if (support_fp16_uniform)\n        {\n            custom_defines.append(\"buffer_sm1(buf,i)\", \"buf[i]\");\n        }\n        else if (support_fp16_storage)\n        {\n            custom_defines.append(\"buffer_sm1(buf,i)\", \"uintBitsToFloat(uint(buf[i])<<16)\");\n        }\n        else\n        {\n            custom_defines.append(\"buffer_sm1(buf,i)\", \"unpackBFloat2x16(buf[(i)/2])[(i)%2]\");\n        }\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        if (support_fp16_uniform)\n        {\n            custom_defines.append(\"lfp2afp(v)\", \"uintBitsToFloat(uint(v)<<16)\");\n            custom_defines.append(\"afp2lfp(v)\", \"uint16_t(floatBitsToUint(v)>>16)\");\n        }\n        else\n        {\n            custom_defines.append(\"lfp2afp(v)\", \"v\");\n            custom_defines.append(\"afp2lfp(v)\", \"v\");\n        }\n        custom_defines.append(\"lfp2afpvec4(v)\", \"vec4(unpackBFloat2x16(v.x),unpackBFloat2x16(v.y))\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"uvec2(packBFloat2x16(v.rg),packBFloat2x16(v.ba))\");\n    }\n    else if (opt.use_fp16_storage && opt.use_fp16_uniform && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"v\");\n        custom_defines.append(\"afp2lfp(v)\", \"v\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"v\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"v\");\n    }\n    else if (opt.use_fp16_storage && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"float(buf[i])\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"pack64(halfBitsToUint16(buf[i]))\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"float16_t(v)\");\n        custom_defines.append(\"afp2lfp(v)\", \"float(v)\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"uint16BitsToHalf(unpack16(v))\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"pack64(halfBitsToUint16(v))\");\n    }\n    else if (opt.use_fp16_packed && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"unpackHalf2x16(buf[(i)/2])[(i)%2]\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"float16_t(v)\");\n        custom_defines.append(\"afp2lfp(v)\", \"float(v)\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"f16vec4(unpackFloat2x16(v.x),unpackFloat2x16(v.y))\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"uvec2(packFloat2x16(v.rg),packFloat2x16(v.ba))\");\n    }\n    else if (opt.use_fp16_storage)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"float(buf[i])\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"uvec2(packHalf2x16(vec4(buf[i]).rg),packHalf2x16(vec4(buf[i]).ba))\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"v\");\n        custom_defines.append(\"afp2lfp(v)\", \"float(v)\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"vec4(unpackHalf2x16(v.x),unpackHalf2x16(v.y))\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"uvec2(packHalf2x16(v.rg),packHalf2x16(v.ba))\");\n    }\n    else if (opt.use_fp16_packed)\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"unpackHalf2x16(buf[(i)/2])[(i)%2]\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"v\");\n        custom_defines.append(\"afp2lfp(v)\", \"v\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"vec4(unpackHalf2x16(v.x),unpackHalf2x16(v.y))\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"uvec2(packHalf2x16(v.rg),packHalf2x16(v.ba))\");\n    }\n    else\n    {\n        custom_defines.append(\"buffer_sm1(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_sm4(buf,i)\", \"buf[i]\");\n\n        custom_defines.append(\"lfp2afp(v)\", \"v\");\n        custom_defines.append(\"afp2lfp(v)\", \"v\");\n        custom_defines.append(\"lfp2afpvec4(v)\", \"v\");\n        custom_defines.append(\"afp2lfpvec4(v)\", \"v\");\n    }\n\n    if (opt.use_bf16_storage)\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"float(buf[i])\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{buf[i]=bfloat16_t(v);}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{buf[i].r=sbuf[si4.r];buf[i].g=sbuf[si4.g];buf[i].b=sbuf[si4.b];buf[i].a=sbuf[si4.a];}\");\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"vec2(buf[i])\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=bf16vec2(v);}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"vec4(buf[i])\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=bf16vec4(v);}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{buf[i4.r]=sbuf[si].r;buf[i4.g]=sbuf[si].g;buf[i4.b]=sbuf[si].b;buf[i4.a]=sbuf[si].a;}\");\n    }\n    else if (opt.use_bf16_packed)\n    {\n        if (support_fp16_storage)\n        {\n            custom_defines.append(\"buffer_ld1(buf,i)\", \"uintBitsToFloat(uint(buf[i])<<16)\");\n            custom_defines.append(\"buffer_st1(buf,i,v)\", \"{buf[i]=uint16_t(floatBitsToUint(v)>>16);}\");\n            custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n\n            custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{buf[i]=uvec2(pack32(u16vec2(sbuf[si4.r],sbuf[si4.g])),pack32(u16vec2(sbuf[si4.b],sbuf[si4.a])));}\");\n            custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{buf[i4.r]=unpack16(sbuf[si].x).x;buf[i4.g]=unpack16(sbuf[si].x).y;buf[i4.b]=unpack16(sbuf[si].y).x;buf[i4.a]=unpack16(sbuf[si].y).y;}\");\n        }\n        else\n        {\n            custom_defines.append(\"buffer_ld1(buf,i)\", \"unpackBFloat2x16(buf[(i)/2])[(i)%2]\");\n            custom_defines.append(\"buffer_st1(buf,i,v)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;float _vs=float(v);uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackBFloat2x16(_old_v);_v[_im2]=_vs;_new_v=packBFloat2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n            custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;uint _si=uint(si);uint _sid2=_si/2;uint _sim2=_si%2;float v=unpackBFloat2x16(sbuf[_sid2])[_sim2];uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackBFloat2x16(_old_v);_v[_im2]=v;_new_v=packBFloat2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n\n            custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{uvec4 _si4d2=uvec4(si4)/2;uvec4 _si4m2=uvec4(si4)%2; buf[i]=uvec2(packBFloat2x16(vec2(unpackBFloat2x16(sbuf[_si4d2.r])[_si4m2.r],unpackBFloat2x16(sbuf[_si4d2.g])[_si4m2.g])),packBFloat2x16(vec2(unpackBFloat2x16(sbuf[_si4d2.b])[_si4m2.b],unpackBFloat2x16(sbuf[_si4d2.a])[_si4m2.a])));}\");\n            custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{uvec2 _v=sbuf[si];vec2 _v0=unpackBFloat2x16(_v.x);vec2 _v1=unpackBFloat2x16(_v.y);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);}\");\n        }\n\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"unpackBFloat2x16(buf[i])\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=packBFloat2x16(v);}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"vec4(unpackBFloat2x16(buf[i].x),unpackBFloat2x16(buf[i].y))\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=uvec2(packBFloat2x16(v.rg),packBFloat2x16(v.ba));}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n    }\n    else if (opt.use_fp16_storage && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{buf[i]=f16vec4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a]);}\");\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{buf[i4.r]=sbuf[si].r;buf[i4.g]=sbuf[si].g;buf[i4.b]=sbuf[si].b;buf[i4.a]=sbuf[si].a;}\");\n        custom_defines.append(\"sfp2afpmat4(v)\", \"v\");\n        custom_defines.append(\"afp2sfpmat4(v)\", \"v\");\n    }\n    else if (opt.use_fp16_packed && opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"float16_t(unpackHalf2x16(buf[(i)/2])[(i)%2])\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;float _vs=float(v);uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=_vs;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;uint _si=uint(si);uint _sid2=_si/2;uint _sim2=_si%2;float v=unpackHalf2x16(sbuf[_sid2])[_sim2];uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=v;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{uvec4 _si4d2=uvec4(si4)/2;uvec4 _si4m2=uvec4(si4)%2; buf[i]=uvec2(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])));}\");\n\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"unpackFloat2x16(buf[i])\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=packFloat2x16(v)}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"f16vec4(unpackFloat2x16(buf[i].x),unpackFloat2x16(buf[i].y))\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=uvec2(packFloat2x16(v.rg),packFloat2x16(v.ba));}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{uvec2 _v=sbuf[si];vec2 _v0=unpackHalf2x16(_v.x);vec2 _v1=unpackHalf2x16(_v.y);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);}\");\n    }\n    else if (opt.use_fp16_storage)\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"float(buf[i])\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{buf[i]=float16_t(v);}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{buf[i].r=sbuf[si4.r];buf[i].g=sbuf[si4.g];buf[i].b=sbuf[si4.b];buf[i].a=sbuf[si4.a];}\");\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"vec2(buf[i])\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=f16vec2(v);}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"vec4(buf[i])\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=f16vec4(v);}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{buf[i4.r]=sbuf[si].r;buf[i4.g]=sbuf[si].g;buf[i4.b]=sbuf[si].b;buf[i4.a]=sbuf[si].a;}\");\n    }\n    else if (opt.use_fp16_packed)\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"unpackHalf2x16(buf[(i)/2])[(i)%2]\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;float _vs=float(v);uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=_vs;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;uint _si=uint(si);uint _sid2=_si/2;uint _sim2=_si%2;float v=unpackHalf2x16(sbuf[_sid2])[_sim2];uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=v;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}\");\n\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{uvec4 _si4d2=uvec4(si4)/2;uvec4 _si4m2=uvec4(si4)%2; buf[i]=uvec2(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])));}\");\n\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"unpackHalf2x16(buf[i])\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=packHalf2x16(v);}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"vec4(unpackHalf2x16(buf[i].x),unpackHalf2x16(buf[i].y))\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=uvec2(packHalf2x16(v.rg),packHalf2x16(v.ba));}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{uvec2 _v=sbuf[si];vec2 _v0=unpackHalf2x16(_v.x);vec2 _v1=unpackHalf2x16(_v.y);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);}\");\n    }\n    else\n    {\n        custom_defines.append(\"buffer_ld1(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st1(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp1to4(buf,i,sbuf,si4)\", \"{buf[i]=vec4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a]);}\");\n        custom_defines.append(\"buffer_ld2(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st2(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp2(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_ld4(buf,i)\", \"buf[i]\");\n        custom_defines.append(\"buffer_st4(buf,i,v)\", \"{buf[i]=v;}\");\n        custom_defines.append(\"buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n        custom_defines.append(\"buffer_cp4to1(buf,i4,sbuf,si)\", \"{vec4 _v=sbuf[si]; buf[i4.r]=_v.r;buf[i4.g]=_v.g;buf[i4.b]=_v.b;buf[i4.a]=_v.a;}\");\n        custom_defines.append(\"sfp2afpmat4(v)\", \"v\");\n        custom_defines.append(\"afp2sfpmat4(v)\", \"v\");\n    }\n\n    if (opt.use_int8_storage)\n    {\n        custom_defines.append(\"sint8\", \"int8_t\");\n    }\n    else if (opt.use_int8_packed)\n    {\n        custom_defines.append(\"sint8\", \"int\");\n    }\n    else\n    {\n        custom_defines.append(\"sint8\", \"int\");\n    }\n\n    custom_defines.append(\"sint8vec4\", \"int\");\n\n    custom_defines.append(\"aint8\", \"int\");\n    custom_defines.append(\"aint8vec4\", \"ivec4\");\n\n    custom_defines.append(\"unpackInt4x8(v)\", \"ivec4((v<<24)>>24,(v<<16)>>24,(v<<8)>>24,v>>24)\");\n    custom_defines.append(\"packInt4x8(v)\", \"int((uint(v.r)&0xFFu)|((uint(v.g)&0xFFu)<<8)|((uint(v.b)&0xFFu)<<16)|((uint(v.a)&0xFFu)<<24))\");\n\n    if (opt.use_int8_storage)\n    {\n        custom_defines.append(\"i8buffer_ld1(buf,i)\", \"int(buf[i])\");\n        custom_defines.append(\"i8buffer_st1(buf,i,v)\", \"{buf[i]=int8_t(v);}\");\n        custom_defines.append(\"i8buffer_cp1(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n    }\n    else\n    {\n        custom_defines.append(\"i8buffer_ld1(buf,i)\", \"int(((buf[(i)/4])<<(24-((i)%4)*8))>>24)\");\n        custom_defines.append(\"i8buffer_st1(buf,i,v)\", \"{uint _i=uint(i);uint _id4=_i/4;uint _im4=_i%4;int _vs=int(v);int _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id4],0,0);ivec4 _v=unpackInt4x8(_old_v);_v[_im4]=_vs;_new_v=packInt4x8(_v);} while(atomicCompSwap(buf[_id4],_old_v,_new_v)!=_old_v);}\");\n        custom_defines.append(\"i8buffer_cp1(buf,i,sbuf,si)\", \"{int _v=i8buffer_ld1(sbuf,si);i8buffer_st1(buf,i,_v);}\");\n    }\n\n    custom_defines.append(\"i8buffer_ld4(buf,i)\", \"unpackInt4x8(buf[i])\");\n    custom_defines.append(\"i8buffer_st4(buf,i,v)\", \"{buf[i]=packInt4x8(v);}\");\n    custom_defines.append(\"i8buffer_cp4(buf,i,sbuf,si)\", \"{buf[i]=sbuf[si];}\");\n\n    custom_defines.append(\"psc(x)\", \"(x==0?p.x:x)\");\n\n    if (opt.use_bf16_storage)\n    {\n        custom_defines.append(\"NCNN_bf16_storage\", 1);\n    }\n    else if (opt.use_bf16_packed)\n    {\n        custom_defines.append(\"NCNN_bf16_packed\", 1);\n    }\n    else if (opt.use_fp16_storage)\n    {\n        custom_defines.append(\"NCNN_fp16_storage\", 1);\n    }\n    else if (opt.use_fp16_packed)\n    {\n        custom_defines.append(\"NCNN_fp16_packed\", 1);\n    }\n\n    if (opt.use_fp16_uniform)\n    {\n        custom_defines.append(\"NCNN_fp16_uniform\", 1);\n    }\n\n    if (opt.use_fp16_arithmetic)\n    {\n        custom_defines.append(\"NCNN_fp16_arithmetic\", 1);\n    }\n\n    if (opt.use_int8_storage)\n    {\n        custom_defines.append(\"NCNN_int8_storage\", 1);\n    }\n    else if (opt.use_int8_packed)\n    {\n        custom_defines.append(\"NCNN_int8_packed\", 1);\n    }\n\n    if (opt.use_int8_uniform)\n    {\n        custom_defines.append(\"NCNN_int8_uniform\", 1);\n    }\n\n    if (opt.use_int8_arithmetic)\n    {\n        custom_defines.append(\"NCNN_int8_arithmetic\", 1);\n    }\n\n    if (opt.use_shader_local_memory)\n    {\n        custom_defines.append(\"NCNN_shader_local_memory\", 1);\n    }\n\n#if __APPLE__\n    custom_defines.append(\"NCNN_moltenvk\", 1);\n#endif\n\n    custom_defines.append(\"ncnn_glsl_version\", 1);\n\n    const bool support_shader_int64 = info.physicalDevicefeatures().shaderInt64;\n    const bool support_shader_int16 = info.physicalDevicefeatures().shaderInt16;\n\n    // fill device macros\n    {\n        // pull in device extensions\n        {\n            const std::vector<VkExtensionProperties>& properties = info.deviceExtensionProperties();\n\n            for (size_t i = 0; i < properties.size(); i++)\n            {\n                const VkExtensionProperties& exp = properties[i];\n                device_defines.append(exp.extensionName, exp.specVersion);\n            }\n        }\n\n#define DD_APPEND_FEATURE(X) device_defines.append(#X, features.X ? 1 : 0);\n\n        // pull in device features macros\n        {\n            const VkPhysicalDeviceFeatures& features = info.physicalDevicefeatures();\n            DD_APPEND_FEATURE(robustBufferAccess)\n            DD_APPEND_FEATURE(fullDrawIndexUint32)\n            DD_APPEND_FEATURE(imageCubeArray)\n            DD_APPEND_FEATURE(independentBlend)\n            DD_APPEND_FEATURE(geometryShader)\n            DD_APPEND_FEATURE(tessellationShader)\n            DD_APPEND_FEATURE(sampleRateShading)\n            DD_APPEND_FEATURE(dualSrcBlend)\n            DD_APPEND_FEATURE(logicOp)\n            DD_APPEND_FEATURE(multiDrawIndirect)\n            DD_APPEND_FEATURE(drawIndirectFirstInstance)\n            DD_APPEND_FEATURE(depthClamp)\n            DD_APPEND_FEATURE(depthBiasClamp)\n            DD_APPEND_FEATURE(fillModeNonSolid)\n            DD_APPEND_FEATURE(depthBounds)\n            DD_APPEND_FEATURE(wideLines)\n            DD_APPEND_FEATURE(largePoints)\n            DD_APPEND_FEATURE(alphaToOne)\n            DD_APPEND_FEATURE(multiViewport)\n            DD_APPEND_FEATURE(samplerAnisotropy)\n            DD_APPEND_FEATURE(textureCompressionETC2)\n            DD_APPEND_FEATURE(textureCompressionASTC_LDR)\n            DD_APPEND_FEATURE(textureCompressionBC)\n            DD_APPEND_FEATURE(occlusionQueryPrecise)\n            DD_APPEND_FEATURE(pipelineStatisticsQuery)\n            DD_APPEND_FEATURE(vertexPipelineStoresAndAtomics)\n            DD_APPEND_FEATURE(fragmentStoresAndAtomics)\n            DD_APPEND_FEATURE(shaderTessellationAndGeometryPointSize)\n            DD_APPEND_FEATURE(shaderImageGatherExtended)\n            DD_APPEND_FEATURE(shaderStorageImageExtendedFormats)\n            DD_APPEND_FEATURE(shaderStorageImageMultisample)\n            DD_APPEND_FEATURE(shaderStorageImageReadWithoutFormat)\n            DD_APPEND_FEATURE(shaderStorageImageWriteWithoutFormat)\n            DD_APPEND_FEATURE(shaderUniformBufferArrayDynamicIndexing)\n            DD_APPEND_FEATURE(shaderSampledImageArrayDynamicIndexing)\n            DD_APPEND_FEATURE(shaderStorageBufferArrayDynamicIndexing)\n            DD_APPEND_FEATURE(shaderStorageImageArrayDynamicIndexing)\n            DD_APPEND_FEATURE(shaderClipDistance)\n            DD_APPEND_FEATURE(shaderCullDistance)\n            DD_APPEND_FEATURE(shaderFloat64)\n            DD_APPEND_FEATURE(shaderInt64)\n            DD_APPEND_FEATURE(shaderInt16)\n            DD_APPEND_FEATURE(shaderResourceResidency)\n            DD_APPEND_FEATURE(shaderResourceMinLod)\n            DD_APPEND_FEATURE(sparseBinding)\n            DD_APPEND_FEATURE(sparseResidencyBuffer)\n            DD_APPEND_FEATURE(sparseResidencyImage2D)\n            DD_APPEND_FEATURE(sparseResidencyImage3D)\n            DD_APPEND_FEATURE(sparseResidency2Samples)\n            DD_APPEND_FEATURE(sparseResidency4Samples)\n            DD_APPEND_FEATURE(sparseResidency8Samples)\n            DD_APPEND_FEATURE(sparseResidency16Samples)\n            DD_APPEND_FEATURE(sparseResidencyAliased)\n            DD_APPEND_FEATURE(variableMultisampleRate)\n            DD_APPEND_FEATURE(inheritedQueries)\n        }\n        if (info.support_VK_KHR_8bit_storage())\n        {\n            const VkPhysicalDevice8BitStorageFeaturesKHR& features = info.query8BitStorageFeatures();\n            DD_APPEND_FEATURE(storageBuffer8BitAccess)\n            DD_APPEND_FEATURE(uniformAndStorageBuffer8BitAccess)\n            DD_APPEND_FEATURE(storagePushConstant8)\n        }\n        if (info.support_VK_KHR_16bit_storage())\n        {\n            const VkPhysicalDevice16BitStorageFeaturesKHR& features = info.query16BitStorageFeatures();\n            DD_APPEND_FEATURE(storageBuffer16BitAccess)\n            DD_APPEND_FEATURE(uniformAndStorageBuffer16BitAccess)\n            DD_APPEND_FEATURE(storagePushConstant16)\n            DD_APPEND_FEATURE(storageInputOutput16)\n        }\n        if (info.support_VK_KHR_robustness2() || info.support_VK_EXT_robustness2())\n        {\n            const VkPhysicalDeviceRobustness2FeaturesKHR& features = info.queryRobustness2Features();\n            DD_APPEND_FEATURE(robustBufferAccess2)\n            DD_APPEND_FEATURE(robustImageAccess2)\n            DD_APPEND_FEATURE(nullDescriptor)\n        }\n        if (info.support_VK_KHR_shader_float16_int8())\n        {\n            const VkPhysicalDeviceFloat16Int8FeaturesKHR& features = info.queryFloat16Int8Features();\n            DD_APPEND_FEATURE(shaderFloat16)\n            DD_APPEND_FEATURE(shaderInt8)\n        }\n        if (info.support_VK_KHR_sampler_ycbcr_conversion())\n        {\n            const VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR& features = info.querySamplerYcbcrConversionFeatures();\n            DD_APPEND_FEATURE(samplerYcbcrConversion)\n        }\n        if (info.support_VK_KHR_cooperative_matrix())\n        {\n            const VkPhysicalDeviceCooperativeMatrixFeaturesKHR& features = info.queryCooperativeMatrixFeatures();\n            DD_APPEND_FEATURE(cooperativeMatrix)\n            DD_APPEND_FEATURE(cooperativeMatrixRobustBufferAccess)\n        }\n        else if (info.support_VK_NV_cooperative_matrix())\n        {\n            const VkPhysicalDeviceCooperativeMatrixFeaturesNV& features = info.queryCooperativeMatrixFeaturesNV();\n            DD_APPEND_FEATURE(cooperativeMatrix)\n            DD_APPEND_FEATURE(cooperativeMatrixRobustBufferAccess)\n        }\n        if (info.support_VK_NV_cooperative_matrix2())\n        {\n            const VkPhysicalDeviceCooperativeMatrix2FeaturesNV& features = info.queryCooperativeMatrix2FeaturesNV();\n            DD_APPEND_FEATURE(cooperativeMatrixWorkgroupScope)\n            DD_APPEND_FEATURE(cooperativeMatrixFlexibleDimensions)\n            DD_APPEND_FEATURE(cooperativeMatrixReductions)\n            DD_APPEND_FEATURE(cooperativeMatrixConversions)\n            DD_APPEND_FEATURE(cooperativeMatrixPerElementOperations)\n            DD_APPEND_FEATURE(cooperativeMatrixTensorAddressing)\n            DD_APPEND_FEATURE(cooperativeMatrixBlockLoads)\n        }\n        if (info.support_VK_NV_cooperative_vector())\n        {\n            const VkPhysicalDeviceCooperativeVectorFeaturesNV& features = info.queryCooperativeVectorFeaturesNV();\n            DD_APPEND_FEATURE(cooperativeVector)\n            DD_APPEND_FEATURE(cooperativeVectorTraining)\n        }\n        if (info.support_VK_EXT_subgroup_size_control())\n        {\n            const VkPhysicalDeviceSubgroupSizeControlFeaturesEXT& features = info.querySubgroupSizeControlFeatures();\n            DD_APPEND_FEATURE(subgroupSizeControl)\n            DD_APPEND_FEATURE(computeFullSubgroups)\n        }\n        if (info.support_VK_KHR_shader_bfloat16())\n        {\n            const VkPhysicalDeviceShaderBfloat16FeaturesKHR& features = info.queryShaderBfloat16Features();\n            DD_APPEND_FEATURE(shaderBFloat16Type)\n            DD_APPEND_FEATURE(shaderBFloat16DotProduct)\n            DD_APPEND_FEATURE(shaderBFloat16CooperativeMatrix)\n        }\n        if (info.support_VK_EXT_shader_float8())\n        {\n            const VkPhysicalDeviceShaderFloat8FeaturesEXT& features = info.queryShaderFloat8Features();\n            DD_APPEND_FEATURE(shaderFloat8)\n            DD_APPEND_FEATURE(shaderFloat8CooperativeMatrix)\n        }\n        if (info.support_VK_KHR_shader_float_controls2())\n        {\n            const VkPhysicalDeviceShaderFloatControls2FeaturesKHR& features = info.queryShaderFloatControls2Features();\n            DD_APPEND_FEATURE(shaderFloatControls2)\n        }\n        if (info.support_VK_KHR_shader_integer_dot_product())\n        {\n            const VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR& features = info.queryShaderIntegerDotProductFeatures();\n            DD_APPEND_FEATURE(shaderIntegerDotProduct)\n        }\n        if (info.support_VK_KHR_shader_subgroup_rotate())\n        {\n            const VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR& features = info.queryShaderSubgroupRotateFeatures();\n            DD_APPEND_FEATURE(shaderSubgroupRotate)\n            DD_APPEND_FEATURE(shaderSubgroupRotateClustered)\n        }\n        if (info.support_VK_EXT_shader_atomic_float())\n        {\n            const VkPhysicalDeviceShaderAtomicFloatFeaturesEXT& features = info.queryShaderAtomicFloatFeatures();\n            DD_APPEND_FEATURE(shaderBufferFloat32Atomics)\n            DD_APPEND_FEATURE(shaderBufferFloat32AtomicAdd)\n            DD_APPEND_FEATURE(shaderBufferFloat64Atomics)\n            DD_APPEND_FEATURE(shaderBufferFloat64AtomicAdd)\n            DD_APPEND_FEATURE(shaderSharedFloat32Atomics)\n            DD_APPEND_FEATURE(shaderSharedFloat32AtomicAdd)\n            DD_APPEND_FEATURE(shaderSharedFloat64Atomics)\n            DD_APPEND_FEATURE(shaderSharedFloat64AtomicAdd)\n            DD_APPEND_FEATURE(shaderImageFloat32Atomics)\n            DD_APPEND_FEATURE(shaderImageFloat32AtomicAdd)\n            DD_APPEND_FEATURE(sparseImageFloat32Atomics)\n            DD_APPEND_FEATURE(sparseImageFloat32AtomicAdd)\n        }\n        if (info.support_VK_EXT_shader_atomic_float2())\n        {\n            const VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT& features = info.queryShaderAtomicFloat2Features();\n            DD_APPEND_FEATURE(shaderBufferFloat16Atomics)\n            DD_APPEND_FEATURE(shaderBufferFloat16AtomicAdd)\n            DD_APPEND_FEATURE(shaderBufferFloat16AtomicMinMax)\n            DD_APPEND_FEATURE(shaderBufferFloat32AtomicMinMax)\n            DD_APPEND_FEATURE(shaderBufferFloat64AtomicMinMax)\n            DD_APPEND_FEATURE(shaderSharedFloat16Atomics)\n            DD_APPEND_FEATURE(shaderSharedFloat16AtomicAdd)\n            DD_APPEND_FEATURE(shaderSharedFloat16AtomicMinMax)\n            DD_APPEND_FEATURE(shaderSharedFloat32AtomicMinMax)\n            DD_APPEND_FEATURE(shaderSharedFloat64AtomicMinMax)\n            DD_APPEND_FEATURE(shaderImageFloat32AtomicMinMax)\n            DD_APPEND_FEATURE(sparseImageFloat32AtomicMinMax)\n        }\n        if (info.support_VK_KHR_vulkan_memory_model())\n        {\n            const VkPhysicalDeviceVulkanMemoryModelFeaturesKHR& features = info.queryVulkanMemoryModelFeatures();\n            DD_APPEND_FEATURE(vulkanMemoryModel)\n            DD_APPEND_FEATURE(vulkanMemoryModelDeviceScope)\n            DD_APPEND_FEATURE(vulkanMemoryModelAvailabilityVisibilityChains)\n        }\n\n#undef DD_APPEND_FEATURE\n\n#define DD_APPEND_PROPERTY(X) device_defines.append(#X, properties.X);\n\n        // pull in device properties macros\n        {\n            const VkPhysicalDeviceProperties& properties = info.physicalDeviceProperties();\n            DD_APPEND_PROPERTY(apiVersion)\n            DD_APPEND_PROPERTY(driverVersion)\n            DD_APPEND_PROPERTY(vendorID)\n            DD_APPEND_PROPERTY(deviceID)\n            DD_APPEND_PROPERTY(deviceType)\n            // DD_APPEND_PROPERTY(deviceName)\n\n            // DD_APPEND_PROPERTY(pipelineCacheUUID)\n\n#define DD_APPEND_PROPERTY_LIMIT(X) device_defines.append(#X, properties.limits.X);\n#define DD_APPEND_PROPERTY_LIMIT_2(X)                       \\\n    device_defines.append(#X \"_0\", properties.limits.X[0]); \\\n    device_defines.append(#X \"_1\", properties.limits.X[1]);\n#define DD_APPEND_PROPERTY_LIMIT_3(X)                       \\\n    device_defines.append(#X \"_0\", properties.limits.X[0]); \\\n    device_defines.append(#X \"_1\", properties.limits.X[1]); \\\n    device_defines.append(#X \"_2\", properties.limits.X[2]);\n\n            DD_APPEND_PROPERTY_LIMIT(maxImageDimension1D)\n            DD_APPEND_PROPERTY_LIMIT(maxImageDimension2D)\n            DD_APPEND_PROPERTY_LIMIT(maxImageDimension3D)\n            DD_APPEND_PROPERTY_LIMIT(maxImageDimensionCube)\n            DD_APPEND_PROPERTY_LIMIT(maxImageArrayLayers)\n            DD_APPEND_PROPERTY_LIMIT(maxTexelBufferElements)\n            DD_APPEND_PROPERTY_LIMIT(maxUniformBufferRange)\n            DD_APPEND_PROPERTY_LIMIT(maxStorageBufferRange)\n            DD_APPEND_PROPERTY_LIMIT(maxPushConstantsSize)\n            DD_APPEND_PROPERTY_LIMIT(maxMemoryAllocationCount)\n            DD_APPEND_PROPERTY_LIMIT(maxSamplerAllocationCount)\n            DD_APPEND_PROPERTY_LIMIT(bufferImageGranularity)\n            DD_APPEND_PROPERTY_LIMIT(sparseAddressSpaceSize)\n            DD_APPEND_PROPERTY_LIMIT(maxBoundDescriptorSets)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorSamplers)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorUniformBuffers)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorStorageBuffers)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorSampledImages)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorStorageImages)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorInputAttachments)\n            DD_APPEND_PROPERTY_LIMIT(maxPerStageResources)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetSamplers)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetUniformBuffers)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetUniformBuffersDynamic)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageBuffers)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageBuffersDynamic)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetSampledImages)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageImages)\n            DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetInputAttachments)\n            DD_APPEND_PROPERTY_LIMIT(maxVertexInputAttributes)\n            DD_APPEND_PROPERTY_LIMIT(maxVertexInputBindings)\n            DD_APPEND_PROPERTY_LIMIT(maxVertexInputAttributeOffset)\n            DD_APPEND_PROPERTY_LIMIT(maxVertexInputBindingStride)\n            DD_APPEND_PROPERTY_LIMIT(maxVertexOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationGenerationLevel)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationPatchSize)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerVertexInputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerVertexOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerPatchOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationControlTotalOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationEvaluationInputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxTessellationEvaluationOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxGeometryShaderInvocations)\n            DD_APPEND_PROPERTY_LIMIT(maxGeometryInputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxGeometryOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxGeometryOutputVertices)\n            DD_APPEND_PROPERTY_LIMIT(maxGeometryTotalOutputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxFragmentInputComponents)\n            DD_APPEND_PROPERTY_LIMIT(maxFragmentOutputAttachments)\n            DD_APPEND_PROPERTY_LIMIT(maxFragmentDualSrcAttachments)\n            DD_APPEND_PROPERTY_LIMIT(maxFragmentCombinedOutputResources)\n            DD_APPEND_PROPERTY_LIMIT(maxComputeSharedMemorySize)\n            DD_APPEND_PROPERTY_LIMIT_3(maxComputeWorkGroupCount)\n            DD_APPEND_PROPERTY_LIMIT(maxComputeWorkGroupInvocations)\n            DD_APPEND_PROPERTY_LIMIT_3(maxComputeWorkGroupSize)\n            DD_APPEND_PROPERTY_LIMIT(subPixelPrecisionBits)\n            DD_APPEND_PROPERTY_LIMIT(subTexelPrecisionBits)\n            DD_APPEND_PROPERTY_LIMIT(mipmapPrecisionBits)\n            DD_APPEND_PROPERTY_LIMIT(maxDrawIndexedIndexValue)\n            DD_APPEND_PROPERTY_LIMIT(maxDrawIndirectCount)\n            DD_APPEND_PROPERTY_LIMIT(maxSamplerLodBias)\n            DD_APPEND_PROPERTY_LIMIT(maxSamplerAnisotropy)\n            DD_APPEND_PROPERTY_LIMIT(maxViewports)\n            DD_APPEND_PROPERTY_LIMIT_2(maxViewportDimensions)\n            DD_APPEND_PROPERTY_LIMIT_2(viewportBoundsRange)\n            DD_APPEND_PROPERTY_LIMIT(viewportSubPixelBits)\n            device_defines.append(\"minMemoryMapAlignment\", (uint32_t)properties.limits.minMemoryMapAlignment);\n            DD_APPEND_PROPERTY_LIMIT(minTexelBufferOffsetAlignment)\n            DD_APPEND_PROPERTY_LIMIT(minUniformBufferOffsetAlignment)\n            DD_APPEND_PROPERTY_LIMIT(minStorageBufferOffsetAlignment)\n            DD_APPEND_PROPERTY_LIMIT(minTexelOffset)\n            DD_APPEND_PROPERTY_LIMIT(maxTexelOffset)\n            DD_APPEND_PROPERTY_LIMIT(minTexelGatherOffset)\n            DD_APPEND_PROPERTY_LIMIT(maxTexelGatherOffset)\n            DD_APPEND_PROPERTY_LIMIT(minInterpolationOffset)\n            DD_APPEND_PROPERTY_LIMIT(maxInterpolationOffset)\n            DD_APPEND_PROPERTY_LIMIT(subPixelInterpolationOffsetBits)\n            DD_APPEND_PROPERTY_LIMIT(maxFramebufferWidth)\n            DD_APPEND_PROPERTY_LIMIT(maxFramebufferHeight)\n            DD_APPEND_PROPERTY_LIMIT(maxFramebufferLayers)\n            DD_APPEND_PROPERTY_LIMIT(framebufferColorSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(framebufferDepthSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(framebufferStencilSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(framebufferNoAttachmentsSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(maxColorAttachments)\n            DD_APPEND_PROPERTY_LIMIT(sampledImageColorSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(sampledImageIntegerSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(sampledImageDepthSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(sampledImageStencilSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(storageImageSampleCounts)\n            DD_APPEND_PROPERTY_LIMIT(maxSampleMaskWords)\n            DD_APPEND_PROPERTY_LIMIT(timestampComputeAndGraphics)\n            DD_APPEND_PROPERTY_LIMIT(timestampPeriod)\n            DD_APPEND_PROPERTY_LIMIT(maxClipDistances)\n            DD_APPEND_PROPERTY_LIMIT(maxCullDistances)\n            DD_APPEND_PROPERTY_LIMIT(maxCombinedClipAndCullDistances)\n            DD_APPEND_PROPERTY_LIMIT(discreteQueuePriorities)\n            DD_APPEND_PROPERTY_LIMIT_2(pointSizeRange)\n            DD_APPEND_PROPERTY_LIMIT_2(lineWidthRange)\n            DD_APPEND_PROPERTY_LIMIT(pointSizeGranularity)\n            DD_APPEND_PROPERTY_LIMIT(lineWidthGranularity)\n            DD_APPEND_PROPERTY_LIMIT(strictLines)\n            DD_APPEND_PROPERTY_LIMIT(standardSampleLocations)\n            DD_APPEND_PROPERTY_LIMIT(optimalBufferCopyOffsetAlignment)\n            DD_APPEND_PROPERTY_LIMIT(optimalBufferCopyRowPitchAlignment)\n            DD_APPEND_PROPERTY_LIMIT(nonCoherentAtomSize)\n\n#undef DD_APPEND_PROPERTY_LIMIT\n#undef DD_APPEND_PROPERTY_LIMIT_2\n#undef DD_APPEND_PROPERTY_LIMIT_3\n\n#define DD_APPEND_PROPERTY_SPARSE(X) device_defines.append(#X, properties.sparseProperties.X);\n\n            DD_APPEND_PROPERTY_SPARSE(residencyStandard2DBlockShape)\n            DD_APPEND_PROPERTY_SPARSE(residencyStandard2DMultisampleBlockShape)\n            DD_APPEND_PROPERTY_SPARSE(residencyStandard3DBlockShape)\n            DD_APPEND_PROPERTY_SPARSE(residencyAlignedMipSize)\n            DD_APPEND_PROPERTY_SPARSE(residencyNonResidentStrict)\n\n#undef DD_APPEND_PROPERTY_SPARSE\n        }\n        {\n            const VkPhysicalDeviceSubgroupProperties& properties = info.querySubgroupProperties();\n            DD_APPEND_PROPERTY(subgroupSize)\n            DD_APPEND_PROPERTY(supportedStages)\n            DD_APPEND_PROPERTY(supportedOperations)\n            DD_APPEND_PROPERTY(quadOperationsInAllStages)\n\n            // append subgroup ops\n            device_defines.append(\"subgroup_basic\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_BASIC_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_vote\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_VOTE_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_arithmetic\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_ballot\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_BALLOT_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_shuffle\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_SHUFFLE_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_shuffle_relative\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_clustered\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_CLUSTERED_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_quad\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_QUAD_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_rotate\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ROTATE_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_rotate_relative\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT) ? 1 : 0);\n            device_defines.append(\"subgroup_partitioned\", (properties.supportedOperations & VK_SUBGROUP_FEATURE_PARTITIONED_BIT_NV) ? 1 : 0);\n        }\n        if (info.support_VK_NV_cooperative_matrix2())\n        {\n            const VkPhysicalDeviceCooperativeMatrix2PropertiesNV& properties = info.queryCooperativeMatrix2PropertiesNV();\n            DD_APPEND_PROPERTY(cooperativeMatrixWorkgroupScopeMaxWorkgroupSize)\n            DD_APPEND_PROPERTY(cooperativeMatrixFlexibleDimensionsMaxDimension)\n            DD_APPEND_PROPERTY(cooperativeMatrixWorkgroupScopeReservedSharedMemory)\n        }\n        if (info.support_VK_NV_cooperative_vector())\n        {\n            const VkPhysicalDeviceCooperativeVectorPropertiesNV& properties = info.queryCooperativeVectorPropertiesNV();\n            DD_APPEND_PROPERTY(cooperativeVectorSupportedStages)\n            DD_APPEND_PROPERTY(cooperativeVectorTrainingFloat16Accumulation)\n            DD_APPEND_PROPERTY(cooperativeVectorTrainingFloat32Accumulation)\n            DD_APPEND_PROPERTY(maxCooperativeVectorComponents)\n        }\n        if (info.support_VK_KHR_driver_properties())\n        {\n            const VkPhysicalDeviceDriverPropertiesKHR& properties = info.queryDriverProperties();\n            DD_APPEND_PROPERTY(driverID)\n            // DD_APPEND_PROPERTY(driverName)\n            // DD_APPEND_PROPERTY(driverInfo)\n            device_defines.append(\"conformanceVersion_major\", properties.conformanceVersion.major);\n            device_defines.append(\"conformanceVersion_minor\", properties.conformanceVersion.minor);\n            device_defines.append(\"conformanceVersion_subminor\", properties.conformanceVersion.subminor);\n            device_defines.append(\"conformanceVersion_patch\", properties.conformanceVersion.patch);\n        }\n        if (info.support_VK_KHR_robustness2() || info.support_VK_EXT_robustness2())\n        {\n            const VkPhysicalDeviceRobustness2PropertiesKHR& properties = info.queryRobustness2Properties();\n            DD_APPEND_PROPERTY(robustStorageBufferAccessSizeAlignment)\n            DD_APPEND_PROPERTY(robustUniformBufferAccessSizeAlignment)\n        }\n        if (info.support_VK_KHR_shader_integer_dot_product())\n        {\n            const VkPhysicalDeviceShaderIntegerDotProductProperties& properties = info.queryShaderIntegerDotProductProperties();\n            DD_APPEND_PROPERTY(integerDotProduct8BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct8BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct8BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct16BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct16BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct16BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct32BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct32BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct32BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct64BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct64BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProduct64BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitUnsignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitSignedAccelerated)\n            DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated)\n        }\n        if (info.support_VK_EXT_subgroup_size_control())\n        {\n            const VkPhysicalDeviceSubgroupSizeControlPropertiesEXT& properties = info.querySubgroupSizeControlProperties();\n            DD_APPEND_PROPERTY(minSubgroupSize)\n            DD_APPEND_PROPERTY(maxSubgroupSize)\n            DD_APPEND_PROPERTY(maxComputeWorkgroupSubgroups)\n            DD_APPEND_PROPERTY(requiredSubgroupSizeStages)\n        }\n\n#if ENABLE_VALIDATION_LAYER\n        if (info.support_VK_KHR_shader_non_semantic_info())\n        {\n            device_defines.append(\"enable_validation_layer\", VK_TRUE);\n            custom_defines.append(\"NCNN_LOGE\", \"debugPrintfEXT\");\n        }\n#endif\n\n#undef DD_APPEND_PROPERTY\n    }\n\n    std::string define_macro_data;\n\n    for (size_t i = 0; i < custom_defines.definitions.size(); i++)\n    {\n        const char* key = custom_defines.definitions[i].first;\n        const DefinitionCollector::typed_value& def = custom_defines.definitions[i].second;\n\n        if (def.type == 0)\n        {\n            define_macro_data += std::string(\"#define \") + key + \" \" + def.s + \"\\n\";\n        }\n        else\n        {\n            char defstr[256];\n            if (def.type == 1)\n            {\n                sprintf(defstr, \"%u\", def.u8);\n            }\n            if (def.type == 2)\n            {\n                sprintf(defstr, \"%u\", def.u32);\n            }\n            if (def.type == 3)\n            {\n                sprintf(defstr, \"%d\", def.i32);\n            }\n            if (def.type == 4)\n            {\n                if (support_shader_int64)\n                {\n                    sprintf(defstr, \"%luull\", def.u64);\n                }\n                else\n                {\n                    uint32_t u32 = def.u64 > UINT_MAX ? UINT_MAX : (uint32_t)def.u64;\n                    sprintf(defstr, \"%u\", u32);\n                }\n            }\n            if (def.type == 5)\n            {\n                sprintf(defstr, \"%e\", def.f32);\n            }\n\n            define_macro_data += std::string(\"#define \") + key + \" \" + defstr + \"\\n\";\n        }\n    }\n    for (size_t i = 0; i < device_defines.definitions.size(); i++)\n    {\n        const char* key = device_defines.definitions[i].first;\n        const DefinitionCollector::typed_value& def = device_defines.definitions[i].second;\n\n        if (def.type == 0)\n        {\n            define_macro_data += std::string(\"#define ncnn_\") + key + \" \\\"\" + def.s + \"\\\"\\n\";\n        }\n        else\n        {\n            char defstr[256];\n            if (def.type == 1)\n            {\n                sprintf(defstr, \"%u\", def.u8);\n            }\n            if (def.type == 2)\n            {\n                sprintf(defstr, \"%u\", def.u32);\n            }\n            if (def.type == 3)\n            {\n                sprintf(defstr, \"%d\", def.i32);\n            }\n            if (def.type == 4)\n            {\n                if (support_shader_int64)\n                {\n                    sprintf(defstr, \"%luull\", def.u64);\n                }\n                else\n                {\n                    uint32_t u32 = def.u64 > UINT_MAX ? UINT_MAX : (uint32_t)def.u64;\n                    sprintf(defstr, \"%u\", u32);\n                }\n            }\n            if (def.type == 5)\n            {\n                sprintf(defstr, \"%e\", def.f32);\n            }\n\n            define_macro_data += std::string(\"#define ncnn_\") + key + \" \" + defstr + \"\\n\";\n        }\n    }\n\n    // enable extensions\n    std::string custom_exts;\n    if (support_shader_int64)\n    {\n        custom_exts += \"#extension GL_EXT_shader_explicit_arithmetic_types_int64: require\\n\";\n    }\n    if (support_shader_int16)\n    {\n        custom_exts += \"#extension GL_EXT_shader_explicit_arithmetic_types_int16: require\\n\";\n    }\n    if (opt.use_bf16_storage)\n    {\n        custom_exts += \"#extension GL_EXT_bfloat16: require\\n\";\n    }\n    if (opt.use_fp16_storage || opt.use_bf16_storage)\n    {\n        custom_exts += \"#extension GL_EXT_shader_16bit_storage: require\\n\";\n    }\n    if (opt.use_fp16_arithmetic)\n    {\n        custom_exts += \"#extension GL_EXT_shader_explicit_arithmetic_types_float16: require\\n\";\n    }\n    if (opt.use_int8_storage)\n    {\n        custom_exts += \"#extension GL_EXT_shader_8bit_storage: require\\n\";\n    }\n    if (opt.use_int8_arithmetic)\n    {\n        custom_exts += \"#extension GL_EXT_shader_explicit_arithmetic_types_int8: require\\n\";\n    }\n#if ENABLE_VALIDATION_LAYER\n    {\n        custom_exts += \"#extension GL_EXT_debug_printf : require\\n\";\n    }\n#endif\n\n    // debug\n    // NCNN_LOGE(\"%s\", define_macro_data.c_str());\n\n    bool compile_success = true;\n\n    {\n        glslang::TShader s(EShLangCompute);\n\n        // split shader source by token \"#version 450\\n\"\n        int version_end_pos = -1;\n        {\n            for (int i = 0; i < comp_data_size - 8; i++)\n            {\n                if (strncmp(comp_data + i, \"#version\", 8) != 0)\n                    continue;\n\n                // #version shall be the very beginning or after newline\n                if (i != 0 && comp_data[i - 1] != '\\n')\n                    continue;\n\n                int nversion = 0;\n                sscanf(comp_data + i, \"#version %*d\\n%n\", &nversion);\n                if (nversion == 0)\n                    continue;\n\n                version_end_pos = i + nversion;\n                break;\n            }\n\n            if (version_end_pos == -1)\n            {\n                NCNN_LOGE(\"shader source has no #version token\");\n                return -1;\n            }\n\n            // NCNN_LOGE(\"version_end_pos = %d\", version_end_pos);\n        }\n\n        const char* comp_data_2 = comp_data + version_end_pos;\n        int comp_data_size_1 = version_end_pos;\n        int comp_data_size_2 = comp_data_size - comp_data_size_1;\n\n        const char* comp_datas[4] = {comp_data, custom_exts.c_str(), define_macro_data.c_str(), comp_data_2};\n        const int comp_data_sizes[4] = {comp_data_size_1, (int)custom_exts.size(), (int)define_macro_data.size(), comp_data_size_2};\n\n        s.setStringsWithLengths(comp_datas, comp_data_sizes, 4);\n\n        s.setEntryPoint(\"main\");\n        s.setSourceEntryPoint(\"main\");\n\n        s.setEnvInput(glslang::EShSourceGlsl, EShLangCompute, glslang::EShClientVulkan, 1);\n\n        if (opt.use_subgroup_ops || opt.use_cooperative_matrix)\n        {\n            // subgroup / cooperative_matrix need vulkan-1.1 and spirv-1.3\n            s.setEnvClient(glslang::EShClientVulkan, glslang::EShTargetVulkan_1_1);\n            s.setEnvTarget(glslang::EshTargetSpv, glslang::EShTargetSpv_1_3);\n        }\n        else\n        {\n            s.setEnvClient(glslang::EShClientVulkan, glslang::EShTargetVulkan_1_0);\n            s.setEnvTarget(glslang::EshTargetSpv, glslang::EShTargetSpv_1_0);\n        }\n\n        TBuiltInResource resources = get_default_TBuiltInResource();\n\n        VulkanShaderIncluder includer;\n\n        bool pr = s.parse(&resources, 100, ENoProfile, false, false, EShMsgDefault, includer);\n        if (!pr)\n        {\n            NCNN_LOGE(\"compile spir-v module failed\");\n            NCNN_LOGE(\"%s\", s.getInfoLog());\n            NCNN_LOGE(\"%s\", s.getInfoDebugLog());\n\n            // print as line_number: code\n            {\n                const char* p = comp_datas[3];\n                const char* line_end;\n                int line_number = 1;\n\n                while ((line_end = strchr(p, '\\n')) != NULL)\n                {\n                    NCNN_LOGE(\"%d:\\t%.*s\", line_number++, (int)(line_end - p), p);\n                    p = line_end + 1;\n                }\n\n                if (*p != '\\0')\n                {\n                    NCNN_LOGE(\"%d:\\t%s\", line_number, p);\n                }\n            }\n\n            compile_success = false;\n        }\n        else\n        {\n            glslang::TIntermediate* ir = s.getIntermediate();\n            glslang::GlslangToSpv(*ir, spirv);\n        }\n    }\n\n    return compile_success ? 0 : -1;\n}\n\nint compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv)\n{\n    if (shader_type_index < 0 || shader_type_index >= layer_shader_registry_entry_count)\n    {\n        NCNN_LOGE(\"no such shader module %d\", shader_type_index);\n        return -1;\n    }\n\n    const char* comp_data = layer_shader_registry[shader_type_index].comp_data;\n    int comp_data_size = layer_shader_registry[shader_type_index].comp_data_size;\n\n    return compile_spirv_module(comp_data, comp_data_size, opt, spirv);\n}\n\nint resolve_shader_info(const uint32_t* spv_data, size_t spv_data_size, ShaderInfo& shader_info)\n{\n    shader_info.specialization_count = 0;\n    shader_info.binding_count = 0;\n    shader_info.push_constant_count = 0;\n\n    uint32_t parameter_id = -233;\n\n    int specialization_count = 0;\n    int binding_count = 0;\n    int push_constant_count = 0;\n\n    // id -> binding_type\n    std::vector<int> id_types;\n\n    // binding_id -> binding_type\n    std::vector<int> binding_types;\n\n    const uint32_t* p = spv_data;\n\n    int bound = p[3];\n\n    id_types.resize(bound);\n\n    // skip magic version generator bound schema\n    p += 5;\n\n    // foreach op\n    while ((const unsigned char*)p < (const unsigned char*)spv_data + spv_data_size)\n    {\n        uint32_t opcode = p[0];\n\n        uint16_t wordcount = opcode >> 16;\n        uint16_t op = opcode & 0xffff;\n\n        if (op == 5) // OpName\n        {\n            uint32_t id = p[1];\n            const char* name = (const char*)&p[2];\n            if (strcmp(name, \"parameter\") == 0)\n            {\n                parameter_id = id;\n            }\n        }\n        else if (op == 6) // OpMemberName\n        {\n            uint32_t id = p[1];\n            if (id == parameter_id)\n            {\n                push_constant_count++;\n            }\n        }\n        else if (op == 25) // OpTypeImage\n        {\n            uint32_t id = p[1];\n            id_types[id] = 2;\n        }\n        else if (op == 27) // OpTypeSampledImage\n        {\n            uint32_t id = p[1];\n            id_types[id] = 3;\n        }\n        else if (op == 32) // OpTypePointer\n        {\n            uint32_t id = p[1];\n            uint32_t storage_class = p[2];\n            uint32_t type = p[3];\n            if (storage_class == 0) // UniformConstant\n            {\n                id_types[id] = id_types[type];\n            }\n            if (storage_class == 2) // Uniform\n            {\n                id_types[id] = id_types[type];\n            }\n            if (storage_class == 12) // StorageBuffer\n            {\n                id_types[type] = 1;\n                id_types[id] = id_types[type];\n            }\n        }\n        else if (op == 59) // OpVariable\n        {\n            uint32_t id = p[1];\n            uint32_t var_id = p[2];\n            uint32_t storage_class = p[3];\n            if (storage_class == 0) // UniformConstant\n            {\n                id_types[var_id] = id_types[id];\n            }\n            if (storage_class == 2) // Uniform\n            {\n                id_types[var_id] = id_types[id];\n            }\n            if (storage_class == 12) // StorageBuffer\n            {\n                id_types[var_id] = id_types[id];\n            }\n        }\n        else if (op == 71) // OpDecorate\n        {\n            uint32_t id = p[1];\n            uint32_t decoration = p[2];\n            uint32_t binding_id = p[3];\n            if (decoration == 1) // SpecId\n            {\n                specialization_count++;\n            }\n            if (decoration == 3) // BufferBlock\n            {\n                id_types[id] = 1;\n            }\n            else if (decoration == 33) // Binding\n            {\n                binding_count = std::max(binding_count, (int)binding_id + 1);\n\n                binding_types.resize(binding_count);\n                binding_types[binding_id] = id;\n            }\n        }\n\n        p += wordcount;\n    }\n\n    if (binding_count > 16)\n    {\n        NCNN_LOGE(\"too many binding %d\", binding_count);\n        return -1;\n    }\n\n    shader_info.specialization_count = specialization_count;\n    shader_info.binding_count = binding_count;\n    shader_info.push_constant_count = push_constant_count;\n\n    // resolve binding_types\n    for (int i = 0; i < binding_count; i++)\n    {\n        shader_info.binding_types[i] = id_types[binding_types[i]];\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n\n#endif // NCNN_VULKAN\n"
  },
  {
    "path": "src/gpu.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NCNN_GPU_H\n#define NCNN_GPU_H\n\n#include \"platform.h\"\n\n#if NCNN_VULKAN\n\n#include \"mat.h\"\n\nnamespace ncnn {\n\n// instance\n\n// Create VkInstance and initialize some objects that need to be calculated by GPU\n// Creates a VkInstance object, Checks the extended attributes supported by the Vulkan instance concerned,\n// Initializes, and creates Vulkan validation layers (if ENABLE_VALIDATION_LAYER is enabled),\n// Iterates over all supported physical devices, etc.\nNCNN_EXPORT int create_gpu_instance(const char* driver_path = 0);\n\n// Get global VkInstance variable\n// Must be called after create_gpu_instance() and before destroy_gpu_instance()\nNCNN_EXPORT VkInstance get_gpu_instance();\n\n// Destroy VkInstance object and free the memory of the associated object\n// Usually called in the destructor of the main program exit\n// The function will internally ensure that all vulkan devices are idle before proceeding with destruction.\nNCNN_EXPORT void destroy_gpu_instance();\n\n// vulkan core\nextern PFN_vkAllocateCommandBuffers vkAllocateCommandBuffers;\nextern PFN_vkAllocateDescriptorSets vkAllocateDescriptorSets;\nextern PFN_vkAllocateMemory vkAllocateMemory;\nextern PFN_vkBeginCommandBuffer vkBeginCommandBuffer;\nextern PFN_vkBindBufferMemory vkBindBufferMemory;\nextern PFN_vkBindImageMemory vkBindImageMemory;\nextern PFN_vkCmdBeginQuery vkCmdBeginQuery;\nextern PFN_vkCmdBindDescriptorSets vkCmdBindDescriptorSets;\nextern PFN_vkCmdBindIndexBuffer vkCmdBindIndexBuffer;\nextern PFN_vkCmdBindPipeline vkCmdBindPipeline;\nextern PFN_vkCmdCopyBuffer vkCmdCopyBuffer;\nextern PFN_vkCmdCopyBufferToImage vkCmdCopyBufferToImage;\nextern PFN_vkCmdCopyImage vkCmdCopyImage;\nextern PFN_vkCmdCopyImageToBuffer vkCmdCopyImageToBuffer;\nextern PFN_vkCmdCopyQueryPoolResults vkCmdCopyQueryPoolResults;\nextern PFN_vkCmdDispatch vkCmdDispatch;\nextern PFN_vkCmdDispatchIndirect vkCmdDispatchIndirect;\nextern PFN_vkCmdEndQuery vkCmdEndQuery;\nextern PFN_vkCmdExecuteCommands vkCmdExecuteCommands;\nextern PFN_vkCmdFillBuffer vkCmdFillBuffer;\nextern PFN_vkCmdPipelineBarrier vkCmdPipelineBarrier;\nextern PFN_vkCmdPushConstants vkCmdPushConstants;\nextern PFN_vkCmdResetQueryPool vkCmdResetQueryPool;\nextern PFN_vkCmdResolveImage vkCmdResolveImage;\nextern PFN_vkCmdUpdateBuffer vkCmdUpdateBuffer;\nextern PFN_vkCmdWriteTimestamp vkCmdWriteTimestamp;\nextern PFN_vkCreateBuffer vkCreateBuffer;\nextern PFN_vkCreateBufferView vkCreateBufferView;\nextern PFN_vkCreateCommandPool vkCreateCommandPool;\nextern PFN_vkCreateComputePipelines vkCreateComputePipelines;\nextern PFN_vkCreateDescriptorPool vkCreateDescriptorPool;\nextern PFN_vkCreateDescriptorSetLayout vkCreateDescriptorSetLayout;\nextern PFN_vkCreateDevice vkCreateDevice;\nextern PFN_vkCreateFence vkCreateFence;\nextern PFN_vkCreateImage vkCreateImage;\nextern PFN_vkCreateImageView vkCreateImageView;\nextern PFN_vkCreatePipelineCache vkCreatePipelineCache;\nextern PFN_vkCreatePipelineLayout vkCreatePipelineLayout;\nextern PFN_vkCreateQueryPool vkCreateQueryPool;\nextern PFN_vkCreateSampler vkCreateSampler;\nextern PFN_vkCreateSemaphore vkCreateSemaphore;\nextern PFN_vkCreateShaderModule vkCreateShaderModule;\nextern PFN_vkDestroyBuffer vkDestroyBuffer;\nextern PFN_vkDestroyBufferView vkDestroyBufferView;\nextern PFN_vkDestroyCommandPool vkDestroyCommandPool;\nextern PFN_vkDestroyDescriptorPool vkDestroyDescriptorPool;\nextern PFN_vkDestroyDescriptorSetLayout vkDestroyDescriptorSetLayout;\nextern PFN_vkDestroyDevice vkDestroyDevice;\nextern PFN_vkDestroyFence vkDestroyFence;\nextern PFN_vkDestroyImage vkDestroyImage;\nextern PFN_vkDestroyImageView vkDestroyImageView;\nextern PFN_vkDestroyInstance vkDestroyInstance;\nextern PFN_vkDestroyPipeline vkDestroyPipeline;\nextern PFN_vkDestroyPipelineCache vkDestroyPipelineCache;\nextern PFN_vkDestroyPipelineLayout vkDestroyPipelineLayout;\nextern PFN_vkDestroyQueryPool vkDestroyQueryPool;\nextern PFN_vkDestroySampler vkDestroySampler;\nextern PFN_vkDestroySemaphore vkDestroySemaphore;\nextern PFN_vkDestroyShaderModule vkDestroyShaderModule;\nextern PFN_vkDeviceWaitIdle vkDeviceWaitIdle;\nextern PFN_vkEndCommandBuffer vkEndCommandBuffer;\nextern PFN_vkEnumerateDeviceExtensionProperties vkEnumerateDeviceExtensionProperties;\nextern PFN_vkEnumerateDeviceLayerProperties vkEnumerateDeviceLayerProperties;\nextern PFN_vkEnumeratePhysicalDevices vkEnumeratePhysicalDevices;\nextern PFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges;\nextern PFN_vkFreeCommandBuffers vkFreeCommandBuffers;\nextern PFN_vkFreeDescriptorSets vkFreeDescriptorSets;\nextern PFN_vkFreeMemory vkFreeMemory;\nextern PFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements;\nextern PFN_vkGetDeviceMemoryCommitment vkGetDeviceMemoryCommitment;\nextern PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr;\nextern PFN_vkGetDeviceQueue vkGetDeviceQueue;\nextern PFN_vkGetFenceStatus vkGetFenceStatus;\nextern PFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements;\nextern PFN_vkGetImageSubresourceLayout vkGetImageSubresourceLayout;\nextern PFN_vkGetPhysicalDeviceFeatures vkGetPhysicalDeviceFeatures;\nextern PFN_vkGetPhysicalDeviceFormatProperties vkGetPhysicalDeviceFormatProperties;\nextern PFN_vkGetPhysicalDeviceImageFormatProperties vkGetPhysicalDeviceImageFormatProperties;\nextern PFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties;\nextern PFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties;\nextern PFN_vkGetPhysicalDeviceQueueFamilyProperties vkGetPhysicalDeviceQueueFamilyProperties;\nextern PFN_vkGetPipelineCacheData vkGetPipelineCacheData;\nextern PFN_vkGetQueryPoolResults vkGetQueryPoolResults;\nextern PFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges;\nextern PFN_vkMapMemory vkMapMemory;\nextern PFN_vkMergePipelineCaches vkMergePipelineCaches;\nextern PFN_vkQueueSubmit vkQueueSubmit;\nextern PFN_vkQueueWaitIdle vkQueueWaitIdle;\nextern PFN_vkResetCommandBuffer vkResetCommandBuffer;\nextern PFN_vkResetCommandPool vkResetCommandPool;\nextern PFN_vkResetDescriptorPool vkResetDescriptorPool;\nextern PFN_vkResetFences vkResetFences;\nextern PFN_vkUnmapMemory vkUnmapMemory;\nextern PFN_vkUpdateDescriptorSets vkUpdateDescriptorSets;\nextern PFN_vkWaitForFences vkWaitForFences;\n\n// instance extension capability\nextern int support_VK_KHR_external_memory_capabilities;\nextern int support_VK_KHR_get_physical_device_properties2;\nextern int support_VK_KHR_get_surface_capabilities2;\nextern int support_VK_KHR_surface;\nextern int support_VK_EXT_debug_utils;\nextern int support_VK_EXT_validation_features;\nextern int support_VK_EXT_validation_flags;\n#if __ANDROID_API__ >= 26\nextern int support_VK_KHR_android_surface;\n#endif // __ANDROID_API__ >= 26\n\n// VK_KHR_cooperative_matrix\nextern PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR;\n\n// VK_KHR_external_memory_capabilities\nextern PFN_vkGetPhysicalDeviceExternalBufferPropertiesKHR vkGetPhysicalDeviceExternalBufferPropertiesKHR;\n\n// VK_KHR_get_physical_device_properties2\nextern PFN_vkGetPhysicalDeviceFeatures2KHR vkGetPhysicalDeviceFeatures2KHR;\nextern PFN_vkGetPhysicalDeviceProperties2KHR vkGetPhysicalDeviceProperties2KHR;\nextern PFN_vkGetPhysicalDeviceFormatProperties2KHR vkGetPhysicalDeviceFormatProperties2KHR;\nextern PFN_vkGetPhysicalDeviceImageFormatProperties2KHR vkGetPhysicalDeviceImageFormatProperties2KHR;\nextern PFN_vkGetPhysicalDeviceQueueFamilyProperties2KHR vkGetPhysicalDeviceQueueFamilyProperties2KHR;\nextern PFN_vkGetPhysicalDeviceMemoryProperties2KHR vkGetPhysicalDeviceMemoryProperties2KHR;\n\n// VK_KHR_get_surface_capabilities2\nextern PFN_vkGetPhysicalDeviceSurfaceCapabilities2KHR vkGetPhysicalDeviceSurfaceCapabilities2KHR;\nextern PFN_vkGetPhysicalDeviceSurfaceFormats2KHR vkGetPhysicalDeviceSurfaceFormats2KHR;\n\n// VK_KHR_surface\nextern PFN_vkDestroySurfaceKHR vkDestroySurfaceKHR;\nextern PFN_vkGetPhysicalDeviceSurfaceSupportKHR vkGetPhysicalDeviceSurfaceSupportKHR;\nextern PFN_vkGetPhysicalDeviceSurfaceCapabilitiesKHR vkGetPhysicalDeviceSurfaceCapabilitiesKHR;\nextern PFN_vkGetPhysicalDeviceSurfaceFormatsKHR vkGetPhysicalDeviceSurfaceFormatsKHR;\nextern PFN_vkGetPhysicalDeviceSurfacePresentModesKHR vkGetPhysicalDeviceSurfacePresentModesKHR;\n\n#if __ANDROID_API__ >= 26\n// VK_KHR_android_surface\nextern PFN_vkCreateAndroidSurfaceKHR vkCreateAndroidSurfaceKHR;\n#endif // __ANDROID_API__ >= 26\n\n// VK_NV_cooperative_matrix\nextern PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesNV vkGetPhysicalDeviceCooperativeMatrixPropertiesNV;\n\n// VK_NV_cooperative_matrix2\nextern PFN_vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV;\n\n// VK_NV_cooperative_vector\nextern PFN_vkGetPhysicalDeviceCooperativeVectorPropertiesNV vkGetPhysicalDeviceCooperativeVectorPropertiesNV;\n\n// get info\nNCNN_EXPORT int get_gpu_count();\nNCNN_EXPORT int get_default_gpu_index();\n\nclass GpuInfoPrivate;\nclass NCNN_EXPORT GpuInfo\n{\npublic:\n    explicit GpuInfo();\n    virtual ~GpuInfo();\n\n    int device_index() const;\n\n    // vulkan physical device\n    VkPhysicalDevice physicalDevice() const;\n    VkPhysicalDevice physical_device() const; // api compatibility\n\n    // features\n    const VkPhysicalDeviceFeatures& physicalDevicefeatures() const;\n\n    // properties\n    const VkPhysicalDeviceProperties& physicalDeviceProperties() const;\n\n    // memory properties\n    const VkPhysicalDeviceMemoryProperties& physicalDeviceMemoryProperties() const;\n    const VkPhysicalDeviceMemoryProperties& physical_device_memory_properties() const; // api compatibility\n\n    // extension properties\n    const std::vector<VkExtensionProperties>& deviceExtensionProperties() const;\n\n    // info\n    uint32_t api_version() const;\n    uint32_t driver_version() const;\n    uint32_t vendor_id() const;\n    uint32_t device_id() const;\n    const char* device_name() const;\n    uint8_t* pipeline_cache_uuid() const;\n\n    // driver properties\n    uint32_t driver_id() const;\n    const char* driver_name() const;\n\n    // 0 = discrete gpu\n    // 1 = integrated gpu\n    // 2 = virtual gpu\n    // 3 = cpu\n    int type() const;\n\n    // performance score roughly evaluated based on parameters such as device type,\n    // supported extensions, video memory size etc.\n    // high-end device scores over 75\n    // low-end device scores below 10\n    uint32_t rough_score() const;\n\n    // hardware limit\n    uint32_t max_shared_memory_size() const;\n    uint32_t max_workgroup_count_x() const;\n    uint32_t max_workgroup_count_y() const;\n    uint32_t max_workgroup_count_z() const;\n    uint32_t max_workgroup_invocations() const;\n    uint32_t max_workgroup_size_x() const;\n    uint32_t max_workgroup_size_y() const;\n    uint32_t max_workgroup_size_z() const;\n    size_t memory_map_alignment() const;\n    size_t buffer_offset_alignment() const;\n    size_t non_coherent_atom_size() const;\n    size_t buffer_image_granularity() const;\n    uint32_t max_image_dimension_1d() const;\n    uint32_t max_image_dimension_2d() const;\n    uint32_t max_image_dimension_3d() const;\n    float timestamp_period() const;\n\n    // runtime\n    uint32_t compute_queue_family_index() const;\n    uint32_t transfer_queue_family_index() const;\n\n    uint32_t compute_queue_count() const;\n    uint32_t transfer_queue_count() const;\n\n    // property\n    bool unified_compute_transfer_queue() const;\n    bool resizable_bar_enabled() const;\n\n    // subgroup\n    uint32_t subgroup_size() const;\n    uint32_t min_subgroup_size() const;\n    uint32_t max_subgroup_size() const;\n    uint32_t max_compute_workgroup_subgroups() const;\n    bool support_subgroup_size_control() const;\n    bool support_compute_full_subgroups() const;\n    uint32_t support_subgroup_ops() const;\n\n    // bug is not feature\n    bool bug_storage_buffer_no_l1() const;\n    bool bug_corrupted_online_pipeline_cache() const;\n    bool bug_buffer_image_load_zero() const;\n\n    // but sometimes bug is a feature\n    bool bug_implicit_fp16_arithmetic() const;\n\n    // fp16 and int8 feature\n    bool support_fp16_packed() const;\n    bool support_fp16_storage() const;\n    bool support_fp16_uniform() const;\n    bool support_fp16_arithmetic() const;\n    bool support_int8_packed() const;\n    bool support_int8_storage() const;\n    bool support_int8_uniform() const;\n    bool support_int8_arithmetic() const;\n\n    // bf16 feature\n    bool support_bf16_packed() const;\n    bool support_bf16_storage() const; // bf16s implies bf16u\n\n    // r16f and r8s format in storage image\n    bool support_fp16_image() const;\n    bool support_int8_image() const;\n\n    // shader float controls2\n    bool support_fp_fast_math() const;\n\n    // ycbcr conversion feature\n    bool support_ycbcr_conversion() const;\n\n    // cooperative matrix feature\n    bool support_cooperative_matrix() const;\n    bool support_cooperative_matrix_8_8_16() const;\n    bool support_cooperative_matrix_16_8_8() const;\n    bool support_cooperative_matrix_16_8_16() const;\n    bool support_cooperative_matrix_16_16_16() const;\n\n    // bf16 cooperative matrix feature\n    bool support_bf16_cooperative_matrix() const;\n\n    // extension capability\n    int support_VK_KHR_8bit_storage() const;\n    int support_VK_KHR_16bit_storage() const;\n    int support_VK_KHR_bind_memory2() const;\n    int support_VK_KHR_buffer_device_address() const;\n    int support_VK_KHR_create_renderpass2() const;\n    int support_VK_KHR_cooperative_matrix() const;\n    int support_VK_KHR_dedicated_allocation() const;\n    int support_VK_KHR_descriptor_update_template() const;\n    int support_VK_KHR_driver_properties() const;\n    int support_VK_KHR_external_memory() const;\n    int support_VK_KHR_get_memory_requirements2() const;\n    int support_VK_KHR_maintenance1() const;\n    int support_VK_KHR_maintenance2() const;\n    int support_VK_KHR_maintenance3() const;\n    int support_VK_KHR_multiview() const;\n    int support_VK_KHR_portability_subset() const;\n    int support_VK_KHR_push_descriptor() const;\n    int support_VK_KHR_robustness2() const;\n    int support_VK_KHR_sampler_ycbcr_conversion() const;\n    int support_VK_KHR_shader_bfloat16() const;\n    int support_VK_KHR_shader_float16_int8() const;\n    int support_VK_KHR_shader_float_controls() const;\n    int support_VK_KHR_shader_float_controls2() const;\n    int support_VK_KHR_shader_integer_dot_product() const;\n    int support_VK_KHR_shader_non_semantic_info() const;\n    int support_VK_KHR_shader_subgroup_extended_types() const;\n    int support_VK_KHR_shader_subgroup_rotate() const;\n    int support_VK_KHR_storage_buffer_storage_class() const;\n    int support_VK_KHR_swapchain() const;\n    int support_VK_KHR_vulkan_memory_model() const;\n    int support_VK_KHR_zero_initialize_workgroup_memory() const;\n    int support_VK_EXT_buffer_device_address() const;\n    int support_VK_EXT_descriptor_indexing() const;\n    int support_VK_EXT_external_memory_host() const;\n    int support_VK_EXT_memory_budget() const;\n    int support_VK_EXT_memory_priority() const;\n    int support_VK_EXT_queue_family_foreign() const;\n    int support_VK_EXT_robustness2() const;\n    int support_VK_EXT_shader_atomic_float() const;\n    int support_VK_EXT_shader_atomic_float2() const;\n    int support_VK_EXT_shader_float8() const;\n    int support_VK_EXT_subgroup_size_control() const;\n    int support_VK_AMD_device_coherent_memory() const;\n#if __ANDROID_API__ >= 26\n    int support_VK_ANDROID_external_memory_android_hardware_buffer() const;\n#endif // __ANDROID_API__ >= 26\n    int support_VK_NV_cooperative_matrix() const;\n    int support_VK_NV_cooperative_matrix2() const;\n    int support_VK_NV_cooperative_vector() const;\n\n    // extension features\n    const void* queryExtensionFeatures() const;\n    const VkPhysicalDevice8BitStorageFeaturesKHR& query8BitStorageFeatures() const;\n    const VkPhysicalDevice16BitStorageFeaturesKHR& query16BitStorageFeatures() const;\n    const VkPhysicalDeviceFloat16Int8FeaturesKHR& queryFloat16Int8Features() const;\n    const VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR& querySamplerYcbcrConversionFeatures() const;\n    const VkPhysicalDeviceCooperativeMatrixFeaturesKHR& queryCooperativeMatrixFeatures() const;\n    const VkPhysicalDeviceCooperativeMatrixFeaturesNV& queryCooperativeMatrixFeaturesNV() const;\n    const VkPhysicalDeviceCooperativeMatrix2FeaturesNV& queryCooperativeMatrix2FeaturesNV() const;\n    const VkPhysicalDeviceCooperativeVectorFeaturesNV& queryCooperativeVectorFeaturesNV() const;\n    const VkPhysicalDeviceRobustness2FeaturesKHR& queryRobustness2Features() const;\n    const VkPhysicalDeviceSubgroupSizeControlFeaturesEXT& querySubgroupSizeControlFeatures() const;\n    const VkPhysicalDeviceShaderBfloat16FeaturesKHR& queryShaderBfloat16Features() const;\n    const VkPhysicalDeviceShaderFloat8FeaturesEXT& queryShaderFloat8Features() const;\n    const VkPhysicalDeviceShaderFloatControls2FeaturesKHR& queryShaderFloatControls2Features() const;\n    const VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR& queryShaderIntegerDotProductFeatures() const;\n    const VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR& queryShaderSubgroupRotateFeatures() const;\n    const VkPhysicalDeviceShaderAtomicFloatFeaturesEXT& queryShaderAtomicFloatFeatures() const;\n    const VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT& queryShaderAtomicFloat2Features() const;\n    const VkPhysicalDeviceVulkanMemoryModelFeaturesKHR& queryVulkanMemoryModelFeatures() const;\n\n    // extension properties\n    const void* queryExtensionProperties() const;\n    const VkPhysicalDeviceCooperativeMatrix2PropertiesNV& queryCooperativeMatrix2PropertiesNV() const;\n    const VkPhysicalDeviceCooperativeVectorPropertiesNV& queryCooperativeVectorPropertiesNV() const;\n    const VkPhysicalDeviceDriverPropertiesKHR& queryDriverProperties() const;\n    const VkPhysicalDeviceFloatControlsPropertiesKHR& queryFloatControlsProperties() const;\n    const VkPhysicalDeviceRobustness2PropertiesKHR& queryRobustness2Properties() const;\n    const VkPhysicalDeviceShaderIntegerDotProductProperties& queryShaderIntegerDotProductProperties() const;\n    const VkPhysicalDeviceSubgroupProperties& querySubgroupProperties() const;\n    const VkPhysicalDeviceSubgroupSizeControlPropertiesEXT& querySubgroupSizeControlProperties() const;\n    const VkPhysicalDeviceExternalMemoryHostPropertiesEXT& queryExternalMemoryHostProperties() const;\n\n    // extension sub properties\n    const std::vector<VkCooperativeMatrixPropertiesKHR>& queryCooperativeMatrixSubProperties() const;\n    const std::vector<VkCooperativeMatrixPropertiesNV>& queryCooperativeMatrixSubPropertiesNV() const;\n    const std::vector<VkCooperativeMatrixFlexibleDimensionsPropertiesNV>& queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV() const;\n    const std::vector<VkCooperativeVectorPropertiesNV>& queryCooperativeVectorSubPropertiesNV() const;\n\n    // some utility functions\n    void get_optimal_cooperative_matrix_mnk(int M, int N, int K, VkComponentTypeKHR type, VkComponentTypeKHR acctype, VkScopeKHR scope, int& coopmat_M, int& coopmat_N, int& coopmat_K, int& coopmat_subgroup_size) const;\n\nprivate:\n    GpuInfo(const GpuInfo&);\n    GpuInfo& operator=(const GpuInfo&);\n\nprivate:\n    friend int create_gpu_instance(const char* driver_path);\n    GpuInfoPrivate* const d;\n};\n\nNCNN_EXPORT const GpuInfo& get_gpu_info(int device_index = get_default_gpu_index());\n\nclass VkAllocator;\nclass VkCompute;\nclass Option;\nclass PipelineCache;\nclass VulkanDevicePrivate;\nclass NCNN_EXPORT VulkanDevice\n{\npublic:\n    VulkanDevice(int device_index = get_default_gpu_index());\n    ~VulkanDevice();\n\n    const GpuInfo& info;\n\n    VkDevice vkdevice() const;\n    bool is_valid() const;\n\n    VkShaderModule compile_shader_module(const uint32_t* spv_data, size_t spv_data_size) const;\n\n    // with fixed workgroup size\n    VkShaderModule compile_shader_module(const uint32_t* spv_data, size_t spv_data_size, uint32_t local_size_x, uint32_t local_size_y, uint32_t local_size_z) const;\n\n    // helper for creating pipeline\n    int create_descriptorset_layout(int binding_count, const int* binding_types, VkDescriptorSetLayout* descriptorset_layout) const;\n    int create_pipeline_layout(int push_constant_count, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout* pipeline_layout) const;\n    int create_pipeline(VkShaderModule shader_module, VkPipelineLayout pipeline_layout, const std::vector<vk_specialization_type>& specializations, uint32_t subgroup_size, VkPipeline* pipeline) const;\n    int create_descriptor_update_template(int binding_count, const int* binding_types, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout pipeline_layout, VkDescriptorUpdateTemplateKHR* descriptor_update_template) const;\n\n    uint32_t find_memory_index(uint32_t memory_type_bits, VkFlags required, VkFlags preferred, VkFlags preferred_not) const;\n    bool is_mappable(uint32_t memory_type_index) const;\n    bool is_coherent(uint32_t memory_type_index) const;\n    bool is_device_local(uint32_t memory_type_index) const;\n\n    VkQueue acquire_queue(uint32_t queue_family_index) const;\n    void reclaim_queue(uint32_t queue_family_index, VkQueue queue) const;\n\n    // allocator on this device\n    VkAllocator* acquire_blob_allocator() const;\n    void reclaim_blob_allocator(VkAllocator* allocator) const;\n\n    VkAllocator* acquire_staging_allocator() const;\n    void reclaim_staging_allocator(VkAllocator* allocator) const;\n\n    // immutable sampler for texelfetch\n    const VkSampler* immutable_texelfetch_sampler() const;\n\n    // dummy buffer image\n    VkMat get_dummy_buffer() const;\n    VkImageMat get_dummy_image() const;\n    VkImageMat get_dummy_image_readonly() const;\n\n    // pipeline cache on this device\n    const PipelineCache* get_pipeline_cache() const;\n\n    // test image allocation\n    bool shape_support_image_storage(const Mat& shape) const;\n\n    // current gpu heap memory budget in MB\n    uint32_t get_heap_budget() const;\n\n    // utility operator\n    void convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, VkCompute& cmd, const Option& opt) const;\n    // cast_type_to   0=auto(same as src)  1=fp32  2=fp16  3=int32  4=int8  5=bf16\n    void convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, int cast_type_to, VkCompute& cmd, const Option& opt) const;\n\n    // VK_KHR_bind_memory2\n    PFN_vkBindBufferMemory2KHR vkBindBufferMemory2KHR;\n    PFN_vkBindImageMemory2KHR vkBindImageMemory2KHR;\n\n    // VK_KHR_buffer_device_address\n    PFN_vkGetBufferDeviceAddressKHR vkGetBufferDeviceAddressKHR;\n    PFN_vkGetBufferOpaqueCaptureAddressKHR vkGetBufferOpaqueCaptureAddressKHR;\n    PFN_vkGetDeviceMemoryOpaqueCaptureAddressKHR vkGetDeviceMemoryOpaqueCaptureAddressKHR;\n\n    // VK_KHR_descriptor_update_template\n    PFN_vkCreateDescriptorUpdateTemplateKHR vkCreateDescriptorUpdateTemplateKHR;\n    PFN_vkDestroyDescriptorUpdateTemplateKHR vkDestroyDescriptorUpdateTemplateKHR;\n    PFN_vkUpdateDescriptorSetWithTemplateKHR vkUpdateDescriptorSetWithTemplateKHR;\n\n    // VK_KHR_get_memory_requirements2\n    PFN_vkGetImageMemoryRequirements2KHR vkGetImageMemoryRequirements2KHR;\n    PFN_vkGetBufferMemoryRequirements2KHR vkGetBufferMemoryRequirements2KHR;\n\n    // VK_KHR_maintenance1\n    PFN_vkTrimCommandPoolKHR vkTrimCommandPoolKHR;\n\n    // VK_KHR_maintenance3\n    PFN_vkGetDescriptorSetLayoutSupportKHR vkGetDescriptorSetLayoutSupportKHR;\n\n    // VK_KHR_push_descriptor\n    PFN_vkCmdPushDescriptorSetWithTemplateKHR vkCmdPushDescriptorSetWithTemplateKHR;\n    PFN_vkCmdPushDescriptorSetKHR vkCmdPushDescriptorSetKHR;\n\n    // VK_KHR_sampler_ycbcr_conversion\n    PFN_vkCreateSamplerYcbcrConversionKHR vkCreateSamplerYcbcrConversionKHR;\n    PFN_vkDestroySamplerYcbcrConversionKHR vkDestroySamplerYcbcrConversionKHR;\n\n    // VK_KHR_swapchain\n    PFN_vkCreateSwapchainKHR vkCreateSwapchainKHR;\n    PFN_vkDestroySwapchainKHR vkDestroySwapchainKHR;\n    PFN_vkGetSwapchainImagesKHR vkGetSwapchainImagesKHR;\n    PFN_vkAcquireNextImageKHR vkAcquireNextImageKHR;\n    PFN_vkQueuePresentKHR vkQueuePresentKHR;\n\n    // VK_EXT_buffer_device_address\n    PFN_vkGetBufferDeviceAddressEXT vkGetBufferDeviceAddressEXT;\n\n    // VK_EXT_external_memory_host\n    PFN_vkGetMemoryHostPointerPropertiesEXT vkGetMemoryHostPointerPropertiesEXT;\n\n#if __ANDROID_API__ >= 26\n    // VK_ANDROID_external_memory_android_hardware_buffer\n    PFN_vkGetAndroidHardwareBufferPropertiesANDROID vkGetAndroidHardwareBufferPropertiesANDROID;\n    PFN_vkGetMemoryAndroidHardwareBufferANDROID vkGetMemoryAndroidHardwareBufferANDROID;\n#endif // __ANDROID_API__ >= 26\n\n    // VK_NV_cooperative_vector\n    PFN_vkCmdConvertCooperativeVectorMatrixNV vkCmdConvertCooperativeVectorMatrixNV;\n    PFN_vkConvertCooperativeVectorMatrixNV vkConvertCooperativeVectorMatrixNV;\n\nprotected:\n    // device extension\n    int init_device_extension();\n\nprivate:\n    VulkanDevice(const VulkanDevice&);\n    VulkanDevice& operator=(const VulkanDevice&);\n\nprivate:\n    VulkanDevicePrivate* const d;\n};\n\nNCNN_EXPORT VulkanDevice* get_gpu_device(int device_index = get_default_gpu_index());\n\n// online spirv compilation\nNCNN_EXPORT int compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv);\nNCNN_EXPORT int compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv);\n\n// info from spirv\nclass NCNN_EXPORT ShaderInfo\n{\npublic:\n    int specialization_count;\n    int binding_count;\n    int push_constant_count;\n\n    // 0 = null\n    // 1 = storage buffer\n    // 2 = storage image\n    // 3 = combined image sampler\n    int binding_types[16]; // 16 is large enough I think ...\n\n    int reserved_0;\n    int reserved_1;\n    int reserved_2;\n    int reserved_3;\n};\n\nNCNN_EXPORT int resolve_shader_info(const uint32_t* spv_data, size_t spv_data_size, ShaderInfo& shader_info);\n\n} // namespace ncnn\n\n#endif // NCNN_VULKAN\n\n#endif // NCNN_GPU_H\n"
  },
  {
    "path": "src/layer/absval.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"absval.h\"\n\nnamespace ncnn {\n\nAbsVal::AbsVal()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint AbsVal::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < 0)\n                ptr[i] = -ptr[i];\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/absval.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ABSVAL_H\n#define LAYER_ABSVAL_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass AbsVal : public Layer\n{\npublic:\n    AbsVal();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ABSVAL_H\n"
  },
  {
    "path": "src/layer/argmax.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"argmax.h\"\n\n#include <functional>\n\nnamespace ncnn {\n\nArgMax::ArgMax()\n{\n    one_blob_only = true;\n}\n\nint ArgMax::load_param(const ParamDict& pd)\n{\n    out_max_val = pd.get(0, 0);\n    topk = pd.get(1, 1);\n\n    return 0;\n}\n\nint ArgMax::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int size = bottom_blob.total();\n\n    if (out_max_val)\n        top_blob.create(topk, 2, 4u, opt.blob_allocator);\n    else\n        top_blob.create(topk, 1, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const float* ptr = bottom_blob;\n\n    // partial sort topk with index\n    // optional value\n    std::vector<std::pair<float, int> > vec;\n    vec.resize(size);\n    for (int i = 0; i < size; i++)\n    {\n        vec[i] = std::make_pair(ptr[i], i);\n    }\n\n    std::partial_sort(vec.begin(), vec.begin() + topk, vec.end(),\n                      std::greater<std::pair<float, int> >());\n\n    float* outptr = top_blob;\n    if (out_max_val)\n    {\n        float* valptr = outptr + topk;\n        for (int i = 0; i < topk; i++)\n        {\n            outptr[i] = vec[i].first;\n            valptr[i] = vec[i].second;\n        }\n    }\n    else\n    {\n        for (int i = 0; i < topk; i++)\n        {\n            outptr[i] = vec[i].second;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/argmax.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ARGMAX_H\n#define LAYER_ARGMAX_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ArgMax : public Layer\n{\npublic:\n    ArgMax();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    int out_max_val;\n    int topk;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ARGMAX_H\n"
  },
  {
    "path": "src/layer/arm/absval_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"absval_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nAbsVal_arm::AbsVal_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#endif // __ARM_NEON\n}\n\nint AbsVal_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                \"fabs   v0.4s, v0.4s            \\n\"\n                \"fabs   v1.4s, v1.4s            \\n\"\n                \"fabs   v2.4s, v2.4s            \\n\"\n                \"fabs   v3.4s, v3.4s            \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr)\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0, {d0-d7}     \\n\"\n                \"vabs.f32   q0, q0          \\n\"\n                \"vabs.f32   q1, q1          \\n\"\n                \"vabs.f32   q2, q2          \\n\"\n                \"vabs.f32   q3, q3          \\n\"\n                \"vstm       %0!, {d0-d7}    \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = vabsq_f32(_p0);\n            _p1 = vabsq_f32(_p1);\n            _p2 = vabsq_f32(_p2);\n            _p3 = vabsq_f32(_p3);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = vabsq_f32(_p0);\n            _p1 = vabsq_f32(_p1);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vabsq_f32(_p);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = *ptr > 0 ? *ptr : -*ptr;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/absval_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ABSVAL_ARM_H\n#define LAYER_ABSVAL_ARM_H\n\n#include \"absval.h\"\n\nnamespace ncnn {\n\nclass AbsVal_arm : public AbsVal\n{\npublic:\n    AbsVal_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ABSVAL_ARM_H\n"
  },
  {
    "path": "src/layer/arm/arm_activation.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef ARM_ACTIVATION_H\n#define ARM_ACTIVATION_H\n\n#include \"fused_activation.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n\nstatic inline float32x4_t activation_ps(float32x4_t _v, int activation_type, const ncnn::Mat& activation_params)\n{\n    if (activation_type == 1)\n    {\n        const float32x4_t _zero = vdupq_n_f32(0.f);\n        _v = vmaxq_f32(_v, _zero);\n    }\n    else if (activation_type == 2)\n    {\n        const float32x4_t _zero = vdupq_n_f32(0.f);\n        const float32x4_t _slope = vdupq_n_f32(activation_params[0]);\n        const uint32x4_t _lemask = vcleq_f32(_v, _zero);\n        float32x4_t _ps = vmulq_f32(_v, _slope);\n        _v = vbslq_f32(_lemask, _ps, _v);\n    }\n    else if (activation_type == 3)\n    {\n        const float32x4_t _min = vdupq_n_f32(activation_params[0]);\n        const float32x4_t _max = vdupq_n_f32(activation_params[1]);\n        _v = vmaxq_f32(_v, _min);\n        _v = vminq_f32(_v, _max);\n    }\n    else if (activation_type == 4)\n    {\n        _v = sigmoid_ps(_v);\n    }\n    else if (activation_type == 5)\n    {\n        _v = vmulq_f32(_v, tanh_ps(log_ps(vaddq_f32(exp_ps(_v), vdupq_n_f32(1.f)))));\n    }\n    else if (activation_type == 6)\n    {\n        const float alpha = activation_params[0];\n        const float beta = activation_params[1];\n        const float32x4_t _zero = vdupq_n_f32(0.f);\n        const float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _ans = vdupq_n_f32(beta);\n        _ans = vmlaq_n_f32(_ans, _v, alpha);\n        _ans = vmaxq_f32(_ans, _zero);\n        _ans = vminq_f32(_ans, _one);\n        _v = vmulq_f32(_ans, _v);\n    }\n\n    return _v;\n}\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"arm_usability.h\"\n#include \"neon_mathfun_fp16s.h\"\n\nstatic inline __fp16 activation_ss_f16(__fp16 v, int activation_type, const ncnn::Mat& activation_params)\n{\n    if (activation_type == 1)\n    {\n        v = std::max(v, (__fp16)0.f);\n    }\n    else if (activation_type == 2)\n    {\n        const __fp16 slope = (__fp16)(activation_params[0]);\n        v = v > 0.f ? v : v * slope;\n    }\n    else if (activation_type == 3)\n    {\n        const __fp16 min = (__fp16)(activation_params[0]);\n        const __fp16 max = (__fp16)(activation_params[1]);\n        if (v < min)\n            v = min;\n        if (v > max)\n            v = max;\n    }\n    else if (activation_type == 4)\n    {\n        v = (__fp16)1.f / ((__fp16)1.f + (__fp16)expf(-v));\n    }\n    else if (activation_type == 5)\n    {\n        v = v * (__fp16)tanhf(logf(expf((float)v) + 1.f));\n    }\n    else if (activation_type == 6)\n    {\n        const __fp16 alpha = (__fp16)(activation_params[0]);\n        const __fp16 beta = (__fp16)(activation_params[1]);\n        const __fp16 lower = -beta / alpha;\n        const __fp16 upper = ((__fp16)1.f / alpha) + lower;\n        if (v < lower)\n            v = (__fp16)0.f;\n        else if (v > upper)\n            ;\n        else\n            v = v * (v * alpha + beta);\n    }\n\n    return v;\n}\n\nstatic inline float16x4_t activation_ps_f16(float16x4_t _v, int activation_type, const ncnn::Mat& activation_params)\n{\n    if (activation_type == 1)\n    {\n        const float16x4_t _zero = vdup_n_f16(0.f);\n        _v = vmax_f16(_v, _zero);\n    }\n    else if (activation_type == 2)\n    {\n        const float16x4_t _zero = vdup_n_f16(0.f);\n#if defined(_MSC_VER) && !defined(__clang__)\n        const float16x4_t _slope = vcvt_f16_f32(vdupq_n_f32(activation_params[0]));\n#else\n        const float16x4_t _slope = vdup_n_f16((__fp16)activation_params[0]);\n#endif\n        const uint16x4_t _lemask = vcle_f16(_v, _zero);\n        float16x4_t _ps = vmul_f16(_v, _slope);\n        _v = vbsl_f16(_lemask, _ps, _v);\n    }\n    else if (activation_type == 3)\n    {\n        const float16x4_t _min = vdup_n_f16((__fp16)activation_params[0]);\n        const float16x4_t _max = vdup_n_f16((__fp16)activation_params[1]);\n        _v = vmax_f16(_v, _min);\n        _v = vmin_f16(_v, _max);\n    }\n    else if (activation_type == 4)\n    {\n        _v = sigmoid_ps_f16(_v);\n    }\n    else if (activation_type == 5)\n    {\n        _v = vmul_f16(_v, tanh_ps_f16(log_ps_f16(vadd_f16(exp_ps_f16(_v), vdup_n_f16(1.f)))));\n    }\n    else if (activation_type == 6)\n    {\n        const __fp16 alpha = (__fp16)activation_params[0];\n        const __fp16 beta = (__fp16)activation_params[1];\n        const float16x4_t _zero = vdup_n_f16(0.f);\n        const float16x4_t _one = vdup_n_f16(1.f);\n        float16x4_t _ans = vdup_n_f16(beta);\n        _ans = vfma_n_f16(_ans, _v, alpha);\n        _ans = vmax_f16(_ans, _zero);\n        _ans = vmin_f16(_ans, _one);\n        _v = vmul_f16(_ans, _v);\n    }\n\n    return _v;\n}\n\nstatic inline float16x8_t activation_ps_f16(float16x8_t _v, int activation_type, const ncnn::Mat& activation_params)\n{\n    if (activation_type == 1)\n    {\n        const float16x8_t _zero = vdupq_n_f16(0.f);\n        _v = vmaxq_f16(_v, _zero);\n    }\n    else if (activation_type == 2)\n    {\n        const float16x8_t _zero = vdupq_n_f16(0.f);\n#if defined(_MSC_VER) && !defined(__clang__)\n        const float16x4_t _slope0 = vcvt_f16_f32(vdupq_n_f32(activation_params[0]));\n        const float16x8_t _slope = vcombine_f16(_slope0, _slope0);\n#else\n        const float16x8_t _slope = vdupq_n_f16((__fp16)activation_params[0]);\n#endif\n        const uint16x8_t _lemask = vcleq_f16(_v, _zero);\n        float16x8_t _ps = vmulq_f16(_v, _slope);\n        _v = vbslq_f16(_lemask, _ps, _v);\n    }\n    else if (activation_type == 3)\n    {\n        const float16x8_t _min = vdupq_n_f16((__fp16)activation_params[0]);\n        const float16x8_t _max = vdupq_n_f16((__fp16)activation_params[1]);\n        _v = vmaxq_f16(_v, _min);\n        _v = vminq_f16(_v, _max);\n    }\n    else if (activation_type == 4)\n    {\n        _v = sigmoid_ps_f16(_v);\n    }\n    else if (activation_type == 5)\n    {\n        _v = vmulq_f16(_v, tanh_ps_f16(log_ps_f16(vaddq_f16(exp_ps_f16(_v), vdupq_n_f16(1.f)))));\n    }\n    else if (activation_type == 6)\n    {\n        const __fp16 alpha_fp16 = (__fp16)activation_params[0];\n        const __fp16 beta_fp16 = (__fp16)activation_params[1];\n        const float16x8_t _zero = vdupq_n_f16(0.f);\n        const float16x8_t _one = vdupq_n_f16(1.f);\n        float16x8_t _ans = vdupq_n_f16(beta_fp16);\n        _ans = vfmaq_n_f16(_ans, _v, alpha_fp16);\n        _ans = vmaxq_f16(_ans, _zero);\n        _ans = vminq_f16(_ans, _one);\n        _v = vmulq_f16(_ans, _v);\n    }\n    return _v;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#endif // __ARM_NEON\n\n#endif // ARM_ACTIVATION_H\n"
  },
  {
    "path": "src/layer/arm/arm_usability.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef ARM_USABILITY_H\n#define ARM_USABILITY_H\n\nstatic inline signed char float2int8(float v)\n{\n    int int32 = (int)roundf(v);\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\n#if __ARM_NEON\n#include <arm_neon.h>\n\nstatic inline uint16x4_t float2bfloat(float32x4_t _v)\n{\n    return vshrn_n_u32(vreinterpretq_u32_f32(_v), 16);\n}\nstatic inline float32x4_t bfloat2float(uint16x4_t _v)\n{\n    return vreinterpretq_f32_u32(vshll_n_u16(_v, 16));\n}\n\nstatic inline int8x8_t float2int8(float32x4_t _vlow, float32x4_t _vhigh)\n{\n#if __aarch64__\n    int32x4_t _vlow32 = vcvtaq_s32_f32(_vlow);\n    int32x4_t _vhigh32 = vcvtaq_s32_f32(_vhigh);\n#else\n    // vcvtq_s32_f32 is round to zero\n    // simulate round to nearest via +/-0.5\n    float32x4_t _p5 = vdupq_n_f32(0.5f);\n    int32x4_t _signmask = vdupq_n_s32(1 << 31);\n    int32x4_t _signlow = vandq_s32(vreinterpretq_s32_f32(_vlow), _signmask);\n    int32x4_t _signhigh = vandq_s32(vreinterpretq_s32_f32(_vhigh), _signmask);\n    float32x4_t _p5low = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signlow));\n    float32x4_t _p5high = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signhigh));\n    float32x4_t _vlow5 = vaddq_f32(_vlow, _p5low);\n    float32x4_t _vhigh5 = vaddq_f32(_vhigh, _p5high);\n    int32x4_t _vlow32 = vcvtq_s32_f32(_vlow5);\n    int32x4_t _vhigh32 = vcvtq_s32_f32(_vhigh5);\n#endif\n    int16x8_t _v16 = vcombine_s16(vqmovn_s32(_vlow32), vqmovn_s32(_vhigh32));\n    int8x8_t _v8 = vqmovn_s16(_v16);\n    return vmax_s8(_v8, vdup_n_s8(-127));\n}\n\nstatic inline int8x8_t float2int8relu(float32x4_t _vlow, float32x4_t _vhigh)\n{\n#if __aarch64__\n    int32x4_t _vlow32 = vcvtaq_s32_f32(_vlow);\n    int32x4_t _vhigh32 = vcvtaq_s32_f32(_vhigh);\n#else\n    // vcvtq_s32_f32 is round to zero\n    // simulate round to nearest via +/-0.5\n    float32x4_t _p5 = vdupq_n_f32(0.5f);\n    int32x4_t _signmask = vdupq_n_s32(1 << 31);\n    int32x4_t _signlow = vandq_s32(vreinterpretq_s32_f32(_vlow), _signmask);\n    int32x4_t _signhigh = vandq_s32(vreinterpretq_s32_f32(_vhigh), _signmask);\n    float32x4_t _p5low = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signlow));\n    float32x4_t _p5high = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signhigh));\n    float32x4_t _vlow5 = vaddq_f32(_vlow, _p5low);\n    float32x4_t _vhigh5 = vaddq_f32(_vhigh, _p5high);\n    int32x4_t _vlow32 = vcvtq_s32_f32(_vlow5);\n    int32x4_t _vhigh32 = vcvtq_s32_f32(_vhigh5);\n#endif\n    int16x8_t _v16 = vcombine_s16(vqmovn_s32(_vlow32), vqmovn_s32(_vhigh32));\n    int8x8_t _v8 = vqmovn_s16(_v16);\n    return vmax_s8(_v8, vdup_n_s8(0));\n}\n\nstatic inline int8x8_t float2int8leakyrelu(float32x4_t _vlow, float32x4_t _vhigh, float32x4_t _slope)\n{\n    float32x4_t _vlow_leaky = vmulq_f32(_vlow, _slope);\n    float32x4_t _vhigh_leaky = vmulq_f32(_vhigh, _slope);\n#if __aarch64__\n    int32x4_t _vlow32 = vcvtaq_s32_f32(_vlow);\n    int32x4_t _vhigh32 = vcvtaq_s32_f32(_vhigh);\n    int32x4_t _vlow32_leaky = vcvtaq_s32_f32(_vlow_leaky);\n    int32x4_t _vhigh32_leaky = vcvtaq_s32_f32(_vhigh_leaky);\n#else\n    // vcvtq_s32_f32 is round to zero\n    // simulate round to nearest via +/-0.5\n    float32x4_t _p5 = vdupq_n_f32(0.5f);\n    int32x4_t _signmask = vdupq_n_s32(1 << 31);\n    int32x4_t _signlow = vandq_s32(vreinterpretq_s32_f32(_vlow), _signmask);\n    int32x4_t _signhigh = vandq_s32(vreinterpretq_s32_f32(_vhigh), _signmask);\n    float32x4_t _p5low = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signlow));\n    float32x4_t _p5high = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signhigh));\n    float32x4_t _vlow5 = vaddq_f32(_vlow, _p5low);\n    float32x4_t _vhigh5 = vaddq_f32(_vhigh, _p5high);\n    int32x4_t _vlow32 = vcvtq_s32_f32(_vlow5);\n    int32x4_t _vhigh32 = vcvtq_s32_f32(_vhigh5);\n\n    int32x4_t _signlow_leaky = vandq_s32(vreinterpretq_s32_f32(_vlow_leaky), _signmask);\n    int32x4_t _signhigh_leaky = vandq_s32(vreinterpretq_s32_f32(_vhigh_leaky), _signmask);\n    float32x4_t _p5low_leaky = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signlow_leaky));\n    float32x4_t _p5high_leaky = vreinterpretq_f32_s32(vorrq_s32(vreinterpretq_s32_f32(_p5), _signhigh_leaky));\n    float32x4_t _vlow5_leaky = vaddq_f32(_vlow_leaky, _p5low_leaky);\n    float32x4_t _vhigh5_leaky = vaddq_f32(_vhigh_leaky, _p5high_leaky);\n    int32x4_t _vlow32_leaky = vcvtq_s32_f32(_vlow5_leaky);\n    int32x4_t _vhigh32_leaky = vcvtq_s32_f32(_vhigh5_leaky);\n#endif\n    int16x8_t _v16 = vcombine_s16(vqmovn_s32(_vlow32), vqmovn_s32(_vhigh32));\n    int16x8_t _v16_leaky = vcombine_s16(vqmovn_s32(_vlow32_leaky), vqmovn_s32(_vhigh32_leaky));\n    int8x8_t _v8 = vqmovn_s16(_v16);\n    int8x8_t _v8_leaky = vqmovn_s16(_v16_leaky);\n    return vmax_s8(_v8, _v8_leaky);\n}\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#if defined(_MSC_VER) && !defined(__clang__)\nstruct __fp16\n{\n    __fp16()\n    {\n        _u16 = 0;\n    }\n\n    __fp16(float f32)\n    {\n        _u16 = vget_lane_u16(vreinterpretq_u16_f16(vcvt_f16_f32(vdupq_n_f32(f32))), 0);\n    }\n\n    __fp16(__n16 n16)\n    {\n        _u16 = n16.n16_u16[0];\n    }\n\n    operator const float() const\n    {\n        return vgetq_lane_f32(vcvt_f32_f16(vreinterpretq_f16_u16(vdup_n_u16(_u16))), 0);\n    }\n\n    __fp16& operator+=(const __fp16& b)\n    {\n        float a = (float)*this;\n        float f32 = (a + (float)b);\n        _u16 = vget_lane_u16(vreinterpretq_u16_f16(vcvt_f16_f32(vdupq_n_f32(f32))), 0);\n        return *this;\n    }\n\n    __fp16& operator-=(const __fp16& b)\n    {\n        float a = (float)*this;\n        float f32 = (a - (float)b);\n        _u16 = vget_lane_u16(vreinterpretq_u16_f16(vcvt_f16_f32(vdupq_n_f32(f32))), 0);\n        return *this;\n    }\n\n    __fp16& operator*=(const __fp16& b)\n    {\n        float a = (float)*this;\n        float f32 = (a * (float)b);\n        _u16 = vget_lane_u16(vreinterpretq_u16_f16(vcvt_f16_f32(vdupq_n_f32(f32))), 0);\n        return *this;\n    }\n\n    __fp16& operator/=(const __fp16& b)\n    {\n        float a = (float)*this;\n        float f32 = (a / (float)b);\n        _u16 = vget_lane_u16(vreinterpretq_u16_f16(vcvt_f16_f32(vdupq_n_f32(f32))), 0);\n        return *this;\n    }\n\n    unsigned short _u16;\n};\n\nstatic inline __fp16 operator-(const __fp16& a)\n{\n    return __fp16(-(float)a);\n}\nstatic inline __fp16 operator+(const __fp16& a, const __fp16& b)\n{\n    return __fp16((float)a + (float)b);\n}\nstatic inline __fp16 operator-(const __fp16& a, const __fp16& b)\n{\n    return __fp16((float)a - (float)b);\n}\nstatic inline __fp16 operator*(const __fp16& a, const __fp16& b)\n{\n    return __fp16((float)a * (float)b);\n}\nstatic inline __fp16 operator/(const __fp16& a, const __fp16& b)\n{\n    return __fp16((float)a / (float)b);\n}\n\nstatic inline float16x4_t vdup_n_f16(const __fp16& f16)\n{\n    return vreinterpret_f16_u16(vdup_n_u16(f16._u16));\n}\n\nstatic inline float16x8_t vdupq_n_f16(const __fp16& f16)\n{\n    return vreinterpretq_f16_u16(vdupq_n_u16(f16._u16));\n}\n\nstatic inline __fp16 vmaxv_f16(float16x4_t a)\n{\n    return __fp16(vmaxvq_f32(vcvt_f32_f16(a)));\n}\n\nstatic inline __fp16 vmaxvq_f16(float16x8_t a)\n{\n    float x = vmaxvq_f32(vcvt_f32_f16(vget_low_f16(a)));\n    float y = vmaxvq_f32(vcvt_f32_f16(vget_high_f16(a)));\n    return __fp16(x > y ? x : y);\n}\n\n#define vld1q_f16 vld1q_u16\n#define vst1q_f16 vst1q_u16\n\n#define vld2_f16 vld2_u16\n#define vst2_f16 vst2_u16\n\n#define vld2q_f16 vld2q_u16\n#define vst2q_f16 vst2q_u16\n\n#define vld4_f16 vld4_u16\n#define vst4_f16 vst4_u16\n\n#define vld4q_f16 vld4q_u16\n#define vst4q_f16 vst4q_u16\n\n#define vld1q_dup_f16 vld1q_dup_u16\n\n#define vset_lane_f16(x, v, i)  vset_lane_u16(x._u16, (uint16x4_t)v, i)\n#define vsetq_lane_f16(x, v, i) vsetq_lane_u16(x._u16, (uint16x8_t)v, i)\n\n#define vfma_n_f16(va, vb, x)  vfma_f16(va, vb, vdup_n_f16(x))\n#define vfmaq_n_f16(va, vb, x) vfmaq_f16(va, vb, vdupq_n_f16(x))\n\n#endif\n\nstatic inline signed char float2int8(__fp16 v)\n{\n    int int32 = (int)roundf(v);\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\nstatic inline int8x8_t float2int8(float16x8_t _v)\n{\n    int16x8_t _v16 = vcvtaq_s16_f16(_v);\n    int8x8_t _v8 = vqmovn_s16(_v16);\n    return vmax_s8(_v8, vdup_n_s8(-127));\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\nstatic inline void transpose4x4_u16(uint16x4_t& _r0, uint16x4_t& _r1, uint16x4_t& _r2, uint16x4_t& _r3)\n{\n    uint16x4x2_t _r01z = vzip_u16(_r0, _r1);\n    uint16x4x2_t _r23z = vzip_u16(_r2, _r3);\n    uint32x2x2_t _r01 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    _r0 = vreinterpret_u16_u32(_r01.val[0]);\n    _r1 = vreinterpret_u16_u32(_r01.val[1]);\n    _r2 = vreinterpret_u16_u32(_r23.val[0]);\n    _r3 = vreinterpret_u16_u32(_r23.val[1]);\n}\n\nstatic inline void transpose4x8_u16(uint16x4_t& _r0, uint16x4_t& _r1, uint16x4_t& _r2, uint16x4_t& _r3, uint16x4_t& _r4, uint16x4_t& _r5, uint16x4_t& _r6, uint16x4_t& _r7)\n{\n    uint16x4x2_t _r01z = vzip_u16(_r0, _r1);\n    uint16x4x2_t _r23z = vzip_u16(_r2, _r3);\n    uint16x4x2_t _r45z = vzip_u16(_r4, _r5);\n    uint16x4x2_t _r67z = vzip_u16(_r6, _r7);\n    uint32x2x2_t _r01_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    uint32x2x2_t _r01_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[0]), vreinterpret_u32_u16(_r67z.val[0]));\n    uint32x2x2_t _r23_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[1]), vreinterpret_u32_u16(_r67z.val[1]));\n    _r0 = vreinterpret_u16_u32(_r01_0.val[0]);\n    _r1 = vreinterpret_u16_u32(_r01_1.val[0]);\n    _r2 = vreinterpret_u16_u32(_r01_0.val[1]);\n    _r3 = vreinterpret_u16_u32(_r01_1.val[1]);\n    _r4 = vreinterpret_u16_u32(_r23_0.val[0]);\n    _r5 = vreinterpret_u16_u32(_r23_1.val[0]);\n    _r6 = vreinterpret_u16_u32(_r23_0.val[1]);\n    _r7 = vreinterpret_u16_u32(_r23_1.val[1]);\n}\n\nstatic inline void transpose4x12_u16(uint16x4_t& _r0, uint16x4_t& _r1, uint16x4_t& _r2, uint16x4_t& _r3, uint16x4_t& _r4, uint16x4_t& _r5, uint16x4_t& _r6, uint16x4_t& _r7, uint16x4_t& _r8, uint16x4_t& _r9, uint16x4_t& _ra, uint16x4_t& _rb)\n{\n    uint16x4x2_t _r01z = vzip_u16(_r0, _r1);\n    uint16x4x2_t _r23z = vzip_u16(_r2, _r3);\n    uint16x4x2_t _r45z = vzip_u16(_r4, _r5);\n    uint16x4x2_t _r67z = vzip_u16(_r6, _r7);\n    uint16x4x2_t _r89z = vzip_u16(_r8, _r9);\n    uint16x4x2_t _rabz = vzip_u16(_ra, _rb);\n    uint32x2x2_t _r01_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    uint32x2x2_t _r01_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[0]), vreinterpret_u32_u16(_r67z.val[0]));\n    uint32x2x2_t _r23_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[1]), vreinterpret_u32_u16(_r67z.val[1]));\n    uint32x2x2_t _r01_2 = vzip_u32(vreinterpret_u32_u16(_r89z.val[0]), vreinterpret_u32_u16(_rabz.val[0]));\n    uint32x2x2_t _r23_2 = vzip_u32(vreinterpret_u32_u16(_r89z.val[1]), vreinterpret_u32_u16(_rabz.val[1]));\n    _r0 = vreinterpret_u16_u32(_r01_0.val[0]);\n    _r1 = vreinterpret_u16_u32(_r01_1.val[0]);\n    _r2 = vreinterpret_u16_u32(_r01_2.val[0]);\n    _r3 = vreinterpret_u16_u32(_r01_0.val[1]);\n    _r4 = vreinterpret_u16_u32(_r01_1.val[1]);\n    _r5 = vreinterpret_u16_u32(_r01_2.val[1]);\n    _r6 = vreinterpret_u16_u32(_r23_0.val[0]);\n    _r7 = vreinterpret_u16_u32(_r23_1.val[0]);\n    _r8 = vreinterpret_u16_u32(_r23_2.val[0]);\n    _r9 = vreinterpret_u16_u32(_r23_0.val[1]);\n    _ra = vreinterpret_u16_u32(_r23_1.val[1]);\n    _rb = vreinterpret_u16_u32(_r23_2.val[1]);\n}\n\nstatic inline void transpose8x4_u16(uint16x8_t& _r0, uint16x8_t& _r1, uint16x8_t& _r2, uint16x8_t& _r3)\n{\n    uint16x8x2_t _r01t = vzipq_u16(_r0, _r1);\n    uint16x8x2_t _r23t = vzipq_u16(_r2, _r3);\n    uint32x4x2_t _r01_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    _r0 = vreinterpretq_u16_u32(_r01_0.val[0]);\n    _r1 = vreinterpretq_u16_u32(_r01_0.val[1]);\n    _r2 = vreinterpretq_u16_u32(_r23_0.val[0]);\n    _r3 = vreinterpretq_u16_u32(_r23_0.val[1]);\n}\n\nstatic inline void transpose8x8_u16(uint16x8_t& _r0, uint16x8_t& _r1, uint16x8_t& _r2, uint16x8_t& _r3, uint16x8_t& _r4, uint16x8_t& _r5, uint16x8_t& _r6, uint16x8_t& _r7)\n{\n    uint16x8x2_t _r01t = vzipq_u16(_r0, _r1);\n    uint16x8x2_t _r23t = vzipq_u16(_r2, _r3);\n    uint16x8x2_t _r45t = vzipq_u16(_r4, _r5);\n    uint16x8x2_t _r67t = vzipq_u16(_r6, _r7);\n    uint32x4x2_t _r01_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    uint32x4x2_t _r01_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[0]), vreinterpretq_u32_u16(_r67t.val[0]));\n    uint32x4x2_t _r23_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[1]), vreinterpretq_u32_u16(_r67t.val[1]));\n    _r0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_0.val[0]), vget_low_u32(_r01_1.val[0])));\n    _r1 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r01_0.val[0]), vget_high_u32(_r01_1.val[0])));\n    _r2 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_0.val[1]), vget_low_u32(_r01_1.val[1])));\n    _r3 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r01_0.val[1]), vget_high_u32(_r01_1.val[1])));\n    _r4 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_0.val[0]), vget_low_u32(_r23_1.val[0])));\n    _r5 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r23_0.val[0]), vget_high_u32(_r23_1.val[0])));\n    _r6 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_0.val[1]), vget_low_u32(_r23_1.val[1])));\n    _r7 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r23_0.val[1]), vget_high_u32(_r23_1.val[1])));\n}\n\nstatic inline void transpose8x12_u16(uint16x8_t& _r0, uint16x8_t& _r1, uint16x8_t& _r2, uint16x8_t& _r3, uint16x8_t& _r4, uint16x8_t& _r5, uint16x8_t& _r6, uint16x8_t& _r7, uint16x8_t& _r8, uint16x8_t& _r9, uint16x8_t& _ra, uint16x8_t& _rb)\n{\n    uint16x8x2_t _r01t = vzipq_u16(_r0, _r1);\n    uint16x8x2_t _r23t = vzipq_u16(_r2, _r3);\n    uint16x8x2_t _r45t = vzipq_u16(_r4, _r5);\n    uint16x8x2_t _r67t = vzipq_u16(_r6, _r7);\n    uint16x8x2_t _r89t = vzipq_u16(_r8, _r9);\n    uint16x8x2_t _rabt = vzipq_u16(_ra, _rb);\n    uint32x4x2_t _r01_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    uint32x4x2_t _r01_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[0]), vreinterpretq_u32_u16(_r67t.val[0]));\n    uint32x4x2_t _r23_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[1]), vreinterpretq_u32_u16(_r67t.val[1]));\n    uint32x4x2_t _r01_2 = vzipq_u32(vreinterpretq_u32_u16(_r89t.val[0]), vreinterpretq_u32_u16(_rabt.val[0]));\n    uint32x4x2_t _r23_2 = vzipq_u32(vreinterpretq_u32_u16(_r89t.val[1]), vreinterpretq_u32_u16(_rabt.val[1]));\n    _r0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_0.val[0]), vget_low_u32(_r01_1.val[0])));\n    _r1 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_2.val[0]), vget_high_u32(_r01_0.val[0])));\n    _r2 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r01_1.val[0]), vget_high_u32(_r01_2.val[0])));\n    _r3 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_0.val[1]), vget_low_u32(_r01_1.val[1])));\n    _r4 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r01_2.val[1]), vget_high_u32(_r01_0.val[1])));\n    _r5 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r01_1.val[1]), vget_high_u32(_r01_2.val[1])));\n    _r6 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_0.val[0]), vget_low_u32(_r23_1.val[0])));\n    _r7 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_2.val[0]), vget_high_u32(_r23_0.val[0])));\n    _r8 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r23_1.val[0]), vget_high_u32(_r23_2.val[0])));\n    _r9 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_0.val[1]), vget_low_u32(_r23_1.val[1])));\n    _ra = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_r23_2.val[1]), vget_high_u32(_r23_0.val[1])));\n    _rb = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_r23_1.val[1]), vget_high_u32(_r23_2.val[1])));\n}\n\nstatic inline void transpose4x4_ps(float32x4_t& _r0, float32x4_t& _r1, float32x4_t& _r2, float32x4_t& _r3)\n{\n    float32x4x2_t _r01z = vzipq_f32(_r0, _r1);\n    float32x4x2_t _r23z = vzipq_f32(_r2, _r3);\n    _r0 = vcombine_f32(vget_low_f32(_r01z.val[0]), vget_low_f32(_r23z.val[0]));\n    _r1 = vcombine_f32(vget_high_f32(_r01z.val[0]), vget_high_f32(_r23z.val[0]));\n    _r2 = vcombine_f32(vget_low_f32(_r01z.val[1]), vget_low_f32(_r23z.val[1]));\n    _r3 = vcombine_f32(vget_high_f32(_r01z.val[1]), vget_high_f32(_r23z.val[1]));\n}\n\nstatic inline void transpose4x8_ps(float32x4_t& _r0, float32x4_t& _r1, float32x4_t& _r2, float32x4_t& _r3, float32x4_t& _r4, float32x4_t& _r5, float32x4_t& _r6, float32x4_t& _r7)\n{\n    float32x4x2_t _r01z = vzipq_f32(_r0, _r1);\n    float32x4x2_t _r23z = vzipq_f32(_r2, _r3);\n    float32x4x2_t _r45z = vzipq_f32(_r4, _r5);\n    float32x4x2_t _r67z = vzipq_f32(_r6, _r7);\n    _r0 = vcombine_f32(vget_low_f32(_r01z.val[0]), vget_low_f32(_r23z.val[0]));\n    _r1 = vcombine_f32(vget_low_f32(_r45z.val[0]), vget_low_f32(_r67z.val[0]));\n    _r2 = vcombine_f32(vget_high_f32(_r01z.val[0]), vget_high_f32(_r23z.val[0]));\n    _r3 = vcombine_f32(vget_high_f32(_r45z.val[0]), vget_high_f32(_r67z.val[0]));\n    _r4 = vcombine_f32(vget_low_f32(_r01z.val[1]), vget_low_f32(_r23z.val[1]));\n    _r5 = vcombine_f32(vget_low_f32(_r45z.val[1]), vget_low_f32(_r67z.val[1]));\n    _r6 = vcombine_f32(vget_high_f32(_r01z.val[1]), vget_high_f32(_r23z.val[1]));\n    _r7 = vcombine_f32(vget_high_f32(_r45z.val[1]), vget_high_f32(_r67z.val[1]));\n}\n\nstatic inline void transpose4x12_ps(float32x4_t& _r0, float32x4_t& _r1, float32x4_t& _r2, float32x4_t& _r3, float32x4_t& _r4, float32x4_t& _r5, float32x4_t& _r6, float32x4_t& _r7, float32x4_t& _r8, float32x4_t& _r9, float32x4_t& _ra, float32x4_t& _rb)\n{\n    float32x4x2_t _r01z = vzipq_f32(_r0, _r1);\n    float32x4x2_t _r23z = vzipq_f32(_r2, _r3);\n    float32x4x2_t _r45z = vzipq_f32(_r4, _r5);\n    float32x4x2_t _r67z = vzipq_f32(_r6, _r7);\n    float32x4x2_t _r89z = vzipq_f32(_r8, _r9);\n    float32x4x2_t _rabz = vzipq_f32(_ra, _rb);\n    _r0 = vcombine_f32(vget_low_f32(_r01z.val[0]), vget_low_f32(_r23z.val[0]));\n    _r1 = vcombine_f32(vget_low_f32(_r45z.val[0]), vget_low_f32(_r67z.val[0]));\n    _r2 = vcombine_f32(vget_low_f32(_r89z.val[0]), vget_low_f32(_rabz.val[0]));\n    _r3 = vcombine_f32(vget_high_f32(_r01z.val[0]), vget_high_f32(_r23z.val[0]));\n    _r4 = vcombine_f32(vget_high_f32(_r45z.val[0]), vget_high_f32(_r67z.val[0]));\n    _r5 = vcombine_f32(vget_high_f32(_r89z.val[0]), vget_high_f32(_rabz.val[0]));\n    _r6 = vcombine_f32(vget_low_f32(_r01z.val[1]), vget_low_f32(_r23z.val[1]));\n    _r7 = vcombine_f32(vget_low_f32(_r45z.val[1]), vget_low_f32(_r67z.val[1]));\n    _r8 = vcombine_f32(vget_low_f32(_r89z.val[1]), vget_low_f32(_rabz.val[1]));\n    _r9 = vcombine_f32(vget_high_f32(_r01z.val[1]), vget_high_f32(_r23z.val[1]));\n    _ra = vcombine_f32(vget_high_f32(_r45z.val[1]), vget_high_f32(_r67z.val[1]));\n    _rb = vcombine_f32(vget_high_f32(_r89z.val[1]), vget_high_f32(_rabz.val[1]));\n}\n\nstatic inline void transpose8x4_ps(float32x4_t& _r0l, float32x4_t& _r0h,\n                                   float32x4_t& _r1l, float32x4_t& _r1h,\n                                   float32x4_t& _r2l, float32x4_t& _r2h,\n                                   float32x4_t& _r3l, float32x4_t& _r3h)\n{\n    float32x4x2_t _r01lz = vzipq_f32(_r0l, _r1l);\n    float32x4x2_t _r23lz = vzipq_f32(_r2l, _r3l);\n    float32x4x2_t _r01hz = vzipq_f32(_r0h, _r1h);\n    float32x4x2_t _r23hz = vzipq_f32(_r2h, _r3h);\n    _r0l = vcombine_f32(vget_low_f32(_r01lz.val[0]), vget_low_f32(_r23lz.val[0]));\n    _r0h = vcombine_f32(vget_high_f32(_r01lz.val[0]), vget_high_f32(_r23lz.val[0]));\n    _r1l = vcombine_f32(vget_low_f32(_r01lz.val[1]), vget_low_f32(_r23lz.val[1]));\n    _r1h = vcombine_f32(vget_high_f32(_r01lz.val[1]), vget_high_f32(_r23lz.val[1]));\n    _r2l = vcombine_f32(vget_low_f32(_r01hz.val[0]), vget_low_f32(_r23hz.val[0]));\n    _r2h = vcombine_f32(vget_high_f32(_r01hz.val[0]), vget_high_f32(_r23hz.val[0]));\n    _r3l = vcombine_f32(vget_low_f32(_r01hz.val[1]), vget_low_f32(_r23hz.val[1]));\n    _r3h = vcombine_f32(vget_high_f32(_r01hz.val[1]), vget_high_f32(_r23hz.val[1]));\n}\n\nstatic inline void transpose12x4_ps(float32x4_t& _r0l, float32x4_t& _r0m, float32x4_t& _r0h,\n                                    float32x4_t& _r1l, float32x4_t& _r1m, float32x4_t& _r1h,\n                                    float32x4_t& _r2l, float32x4_t& _r2m, float32x4_t& _r2h,\n                                    float32x4_t& _r3l, float32x4_t& _r3m, float32x4_t& _r3h)\n{\n    float32x4x2_t _r01lz = vzipq_f32(_r0l, _r1l);\n    float32x4x2_t _r23lz = vzipq_f32(_r2l, _r3l);\n    float32x4x2_t _r01mz = vzipq_f32(_r0m, _r1m);\n    float32x4x2_t _r23mz = vzipq_f32(_r2m, _r3m);\n    float32x4x2_t _r01hz = vzipq_f32(_r0h, _r1h);\n    float32x4x2_t _r23hz = vzipq_f32(_r2h, _r3h);\n    _r0l = vcombine_f32(vget_low_f32(_r01lz.val[0]), vget_low_f32(_r23lz.val[0]));\n    _r0m = vcombine_f32(vget_high_f32(_r01lz.val[0]), vget_high_f32(_r23lz.val[0]));\n    _r0h = vcombine_f32(vget_low_f32(_r01lz.val[1]), vget_low_f32(_r23lz.val[1]));\n    _r1l = vcombine_f32(vget_high_f32(_r01lz.val[1]), vget_high_f32(_r23lz.val[1]));\n    _r1m = vcombine_f32(vget_low_f32(_r01mz.val[0]), vget_low_f32(_r23mz.val[0]));\n    _r1h = vcombine_f32(vget_high_f32(_r01mz.val[0]), vget_high_f32(_r23mz.val[0]));\n    _r2l = vcombine_f32(vget_low_f32(_r01mz.val[1]), vget_low_f32(_r23mz.val[1]));\n    _r2m = vcombine_f32(vget_high_f32(_r01mz.val[1]), vget_high_f32(_r23mz.val[1]));\n    _r2h = vcombine_f32(vget_low_f32(_r01hz.val[0]), vget_low_f32(_r23hz.val[0]));\n    _r3l = vcombine_f32(vget_high_f32(_r01hz.val[0]), vget_high_f32(_r23hz.val[0]));\n    _r3m = vcombine_f32(vget_low_f32(_r01hz.val[1]), vget_low_f32(_r23hz.val[1]));\n    _r3h = vcombine_f32(vget_high_f32(_r01hz.val[1]), vget_high_f32(_r23hz.val[1]));\n}\n\n#if __aarch64__\nstatic inline void transpose8x8_ps(float32x4_t& _r0l, float32x4_t& _r0h,\n                                   float32x4_t& _r1l, float32x4_t& _r1h,\n                                   float32x4_t& _r2l, float32x4_t& _r2h,\n                                   float32x4_t& _r3l, float32x4_t& _r3h,\n                                   float32x4_t& _r4l, float32x4_t& _r4h,\n                                   float32x4_t& _r5l, float32x4_t& _r5h,\n                                   float32x4_t& _r6l, float32x4_t& _r6h,\n                                   float32x4_t& _r7l, float32x4_t& _r7h)\n{\n    float32x4x2_t _r01lz = vzipq_f32(_r0l, _r1l);\n    float32x4x2_t _r23lz = vzipq_f32(_r2l, _r3l);\n    float32x4x2_t _r01hz = vzipq_f32(_r0h, _r1h);\n    float32x4x2_t _r23hz = vzipq_f32(_r2h, _r3h);\n    float32x4x2_t _r45lz = vzipq_f32(_r4l, _r5l);\n    float32x4x2_t _r67lz = vzipq_f32(_r6l, _r7l);\n    float32x4x2_t _r45hz = vzipq_f32(_r4h, _r5h);\n    float32x4x2_t _r67hz = vzipq_f32(_r6h, _r7h);\n    _r0l = vcombine_f32(vget_low_f32(_r01lz.val[0]), vget_low_f32(_r23lz.val[0]));\n    _r0h = vcombine_f32(vget_low_f32(_r45lz.val[0]), vget_low_f32(_r67lz.val[0]));\n    _r1l = vcombine_f32(vget_high_f32(_r01lz.val[0]), vget_high_f32(_r23lz.val[0]));\n    _r1h = vcombine_f32(vget_high_f32(_r45lz.val[0]), vget_high_f32(_r67lz.val[0]));\n    _r2l = vcombine_f32(vget_low_f32(_r01lz.val[1]), vget_low_f32(_r23lz.val[1]));\n    _r2h = vcombine_f32(vget_low_f32(_r45lz.val[1]), vget_low_f32(_r67lz.val[1]));\n    _r3l = vcombine_f32(vget_high_f32(_r01lz.val[1]), vget_high_f32(_r23lz.val[1]));\n    _r3h = vcombine_f32(vget_high_f32(_r45lz.val[1]), vget_high_f32(_r67lz.val[1]));\n    _r4l = vcombine_f32(vget_low_f32(_r01hz.val[0]), vget_low_f32(_r23hz.val[0]));\n    _r4h = vcombine_f32(vget_low_f32(_r45hz.val[0]), vget_low_f32(_r67hz.val[0]));\n    _r5l = vcombine_f32(vget_high_f32(_r01hz.val[0]), vget_high_f32(_r23hz.val[0]));\n    _r5h = vcombine_f32(vget_high_f32(_r45hz.val[0]), vget_high_f32(_r67hz.val[0]));\n    _r6l = vcombine_f32(vget_low_f32(_r01hz.val[1]), vget_low_f32(_r23hz.val[1]));\n    _r6h = vcombine_f32(vget_low_f32(_r45hz.val[1]), vget_low_f32(_r67hz.val[1]));\n    _r7l = vcombine_f32(vget_high_f32(_r01hz.val[1]), vget_high_f32(_r23hz.val[1]));\n    _r7h = vcombine_f32(vget_high_f32(_r45hz.val[1]), vget_high_f32(_r67hz.val[1]));\n}\n\nstatic inline void transpose8x12_ps(float32x4_t& _r0l, float32x4_t& _r0h,\n                                    float32x4_t& _r1l, float32x4_t& _r1h,\n                                    float32x4_t& _r2l, float32x4_t& _r2h,\n                                    float32x4_t& _r3l, float32x4_t& _r3h,\n                                    float32x4_t& _r4l, float32x4_t& _r4h,\n                                    float32x4_t& _r5l, float32x4_t& _r5h,\n                                    float32x4_t& _r6l, float32x4_t& _r6h,\n                                    float32x4_t& _r7l, float32x4_t& _r7h,\n                                    float32x4_t& _r8l, float32x4_t& _r8h,\n                                    float32x4_t& _r9l, float32x4_t& _r9h,\n                                    float32x4_t& _ral, float32x4_t& _rah,\n                                    float32x4_t& _rbl, float32x4_t& _rbh)\n{\n    float32x4x2_t _r01lz = vzipq_f32(_r0l, _r1l);\n    float32x4x2_t _r23lz = vzipq_f32(_r2l, _r3l);\n    float32x4x2_t _r01hz = vzipq_f32(_r0h, _r1h);\n    float32x4x2_t _r23hz = vzipq_f32(_r2h, _r3h);\n    float32x4x2_t _r45lz = vzipq_f32(_r4l, _r5l);\n    float32x4x2_t _r67lz = vzipq_f32(_r6l, _r7l);\n    float32x4x2_t _r45hz = vzipq_f32(_r4h, _r5h);\n    float32x4x2_t _r67hz = vzipq_f32(_r6h, _r7h);\n    float32x4x2_t _r89lz = vzipq_f32(_r8l, _r9l);\n    float32x4x2_t _rablz = vzipq_f32(_ral, _rbl);\n    float32x4x2_t _r89hz = vzipq_f32(_r8h, _r9h);\n    float32x4x2_t _rabhz = vzipq_f32(_rah, _rbh);\n    _r0l = vcombine_f32(vget_low_f32(_r01lz.val[0]), vget_low_f32(_r23lz.val[0]));\n    _r0h = vcombine_f32(vget_low_f32(_r45lz.val[0]), vget_low_f32(_r67lz.val[0]));\n    _r1l = vcombine_f32(vget_low_f32(_r89lz.val[0]), vget_low_f32(_rablz.val[0]));\n    _r1h = vcombine_f32(vget_high_f32(_r01lz.val[0]), vget_high_f32(_r23lz.val[0]));\n    _r2l = vcombine_f32(vget_high_f32(_r45lz.val[0]), vget_high_f32(_r67lz.val[0]));\n    _r2h = vcombine_f32(vget_high_f32(_r89lz.val[0]), vget_high_f32(_rablz.val[0]));\n    _r3l = vcombine_f32(vget_low_f32(_r01lz.val[1]), vget_low_f32(_r23lz.val[1]));\n    _r3h = vcombine_f32(vget_low_f32(_r45lz.val[1]), vget_low_f32(_r67lz.val[1]));\n    _r4l = vcombine_f32(vget_low_f32(_r89lz.val[1]), vget_low_f32(_rablz.val[1]));\n    _r4h = vcombine_f32(vget_high_f32(_r01lz.val[1]), vget_high_f32(_r23lz.val[1]));\n    _r5l = vcombine_f32(vget_high_f32(_r45lz.val[1]), vget_high_f32(_r67lz.val[1]));\n    _r5h = vcombine_f32(vget_high_f32(_r89lz.val[1]), vget_high_f32(_rablz.val[1]));\n    _r6l = vcombine_f32(vget_low_f32(_r01hz.val[0]), vget_low_f32(_r23hz.val[0]));\n    _r6h = vcombine_f32(vget_low_f32(_r45hz.val[0]), vget_low_f32(_r67hz.val[0]));\n    _r7l = vcombine_f32(vget_low_f32(_r89hz.val[0]), vget_low_f32(_rabhz.val[0]));\n    _r7h = vcombine_f32(vget_high_f32(_r01hz.val[0]), vget_high_f32(_r23hz.val[0]));\n    _r8l = vcombine_f32(vget_high_f32(_r45hz.val[0]), vget_high_f32(_r67hz.val[0]));\n    _r8h = vcombine_f32(vget_high_f32(_r89hz.val[0]), vget_high_f32(_rabhz.val[0]));\n    _r9l = vcombine_f32(vget_low_f32(_r01hz.val[1]), vget_low_f32(_r23hz.val[1]));\n    _r9h = vcombine_f32(vget_low_f32(_r45hz.val[1]), vget_low_f32(_r67hz.val[1]));\n    _ral = vcombine_f32(vget_low_f32(_r89hz.val[1]), vget_low_f32(_rabhz.val[1]));\n    _rah = vcombine_f32(vget_high_f32(_r01hz.val[1]), vget_high_f32(_r23hz.val[1]));\n    _rbl = vcombine_f32(vget_high_f32(_r45hz.val[1]), vget_high_f32(_r67hz.val[1]));\n    _rbh = vcombine_f32(vget_high_f32(_r89hz.val[1]), vget_high_f32(_rabhz.val[1]));\n}\n\nstatic inline void transpose12x8_ps(float32x4_t& _r0l, float32x4_t& _r0m, float32x4_t& _r0h,\n                                    float32x4_t& _r1l, float32x4_t& _r1m, float32x4_t& _r1h,\n                                    float32x4_t& _r2l, float32x4_t& _r2m, float32x4_t& _r2h,\n                                    float32x4_t& _r3l, float32x4_t& _r3m, float32x4_t& _r3h,\n                                    float32x4_t& _r4l, float32x4_t& _r4m, float32x4_t& _r4h,\n                                    float32x4_t& _r5l, float32x4_t& _r5m, float32x4_t& _r5h,\n                                    float32x4_t& _r6l, float32x4_t& _r6m, float32x4_t& _r6h,\n                                    float32x4_t& _r7l, float32x4_t& _r7m, float32x4_t& _r7h)\n{\n    float32x4x2_t _r01lz = vzipq_f32(_r0l, _r1l);\n    float32x4x2_t _r23lz = vzipq_f32(_r2l, _r3l);\n    float32x4x2_t _r01mz = vzipq_f32(_r0m, _r1m);\n    float32x4x2_t _r23mz = vzipq_f32(_r2m, _r3m);\n    float32x4x2_t _r01hz = vzipq_f32(_r0h, _r1h);\n    float32x4x2_t _r23hz = vzipq_f32(_r2h, _r3h);\n    float32x4x2_t _r45lz = vzipq_f32(_r4l, _r5l);\n    float32x4x2_t _r67lz = vzipq_f32(_r6l, _r7l);\n    float32x4x2_t _r45mz = vzipq_f32(_r4m, _r5m);\n    float32x4x2_t _r67mz = vzipq_f32(_r6m, _r7m);\n    float32x4x2_t _r45hz = vzipq_f32(_r4h, _r5h);\n    float32x4x2_t _r67hz = vzipq_f32(_r6h, _r7h);\n    _r0l = vcombine_f32(vget_low_f32(_r01lz.val[0]), vget_low_f32(_r23lz.val[0]));\n    _r0m = vcombine_f32(vget_low_f32(_r45lz.val[0]), vget_low_f32(_r67lz.val[0]));\n    _r0h = vcombine_f32(vget_high_f32(_r01lz.val[0]), vget_high_f32(_r23lz.val[0]));\n    _r1l = vcombine_f32(vget_high_f32(_r45lz.val[0]), vget_high_f32(_r67lz.val[0]));\n    _r1m = vcombine_f32(vget_low_f32(_r01lz.val[1]), vget_low_f32(_r23lz.val[1]));\n    _r1h = vcombine_f32(vget_low_f32(_r45lz.val[1]), vget_low_f32(_r67lz.val[1]));\n    _r2l = vcombine_f32(vget_high_f32(_r01lz.val[1]), vget_high_f32(_r23lz.val[1]));\n    _r2m = vcombine_f32(vget_high_f32(_r45lz.val[1]), vget_high_f32(_r67lz.val[1]));\n    _r2h = vcombine_f32(vget_low_f32(_r01mz.val[0]), vget_low_f32(_r23mz.val[0]));\n    _r3l = vcombine_f32(vget_low_f32(_r45mz.val[0]), vget_low_f32(_r67mz.val[0]));\n    _r3m = vcombine_f32(vget_high_f32(_r01mz.val[0]), vget_high_f32(_r23mz.val[0]));\n    _r3h = vcombine_f32(vget_high_f32(_r45mz.val[0]), vget_high_f32(_r67mz.val[0]));\n    _r4l = vcombine_f32(vget_low_f32(_r01mz.val[1]), vget_low_f32(_r23mz.val[1]));\n    _r4m = vcombine_f32(vget_low_f32(_r45mz.val[1]), vget_low_f32(_r67mz.val[1]));\n    _r4h = vcombine_f32(vget_high_f32(_r01mz.val[1]), vget_high_f32(_r23mz.val[1]));\n    _r5l = vcombine_f32(vget_high_f32(_r45mz.val[1]), vget_high_f32(_r67mz.val[1]));\n    _r5m = vcombine_f32(vget_low_f32(_r01hz.val[0]), vget_low_f32(_r23hz.val[0]));\n    _r5h = vcombine_f32(vget_low_f32(_r45hz.val[0]), vget_low_f32(_r67hz.val[0]));\n    _r6l = vcombine_f32(vget_high_f32(_r01hz.val[0]), vget_high_f32(_r23hz.val[0]));\n    _r6m = vcombine_f32(vget_high_f32(_r45hz.val[0]), vget_high_f32(_r67hz.val[0]));\n    _r6h = vcombine_f32(vget_low_f32(_r01hz.val[1]), vget_low_f32(_r23hz.val[1]));\n    _r7l = vcombine_f32(vget_low_f32(_r45hz.val[1]), vget_low_f32(_r67hz.val[1]));\n    _r7m = vcombine_f32(vget_high_f32(_r01hz.val[1]), vget_high_f32(_r23hz.val[1]));\n    _r7h = vcombine_f32(vget_high_f32(_r45hz.val[1]), vget_high_f32(_r67hz.val[1]));\n}\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic inline void transpose4x4_ph(float16x4_t& _r0, float16x4_t& _r1, float16x4_t& _r2, float16x4_t& _r3)\n{\n    uint16x4x2_t _r01z = vzip_u16(vreinterpret_u16_f16(_r0), vreinterpret_u16_f16(_r1));\n    uint16x4x2_t _r23z = vzip_u16(vreinterpret_u16_f16(_r2), vreinterpret_u16_f16(_r3));\n    uint32x2x2_t _r01 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    _r0 = vreinterpret_f16_u32(_r01.val[0]);\n    _r1 = vreinterpret_f16_u32(_r01.val[1]);\n    _r2 = vreinterpret_f16_u32(_r23.val[0]);\n    _r3 = vreinterpret_f16_u32(_r23.val[1]);\n}\n\nstatic inline void transpose4x8_ph(float16x4_t& _r0, float16x4_t& _r1, float16x4_t& _r2, float16x4_t& _r3, float16x4_t& _r4, float16x4_t& _r5, float16x4_t& _r6, float16x4_t& _r7)\n{\n    uint16x4x2_t _r01z = vzip_u16(vreinterpret_u16_f16(_r0), vreinterpret_u16_f16(_r1));\n    uint16x4x2_t _r23z = vzip_u16(vreinterpret_u16_f16(_r2), vreinterpret_u16_f16(_r3));\n    uint16x4x2_t _r45z = vzip_u16(vreinterpret_u16_f16(_r4), vreinterpret_u16_f16(_r5));\n    uint16x4x2_t _r67z = vzip_u16(vreinterpret_u16_f16(_r6), vreinterpret_u16_f16(_r7));\n    uint32x2x2_t _r01_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    uint32x2x2_t _r01_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[0]), vreinterpret_u32_u16(_r67z.val[0]));\n    uint32x2x2_t _r23_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[1]), vreinterpret_u32_u16(_r67z.val[1]));\n    _r0 = vreinterpret_f16_u32(_r01_0.val[0]);\n    _r1 = vreinterpret_f16_u32(_r01_1.val[0]);\n    _r2 = vreinterpret_f16_u32(_r01_0.val[1]);\n    _r3 = vreinterpret_f16_u32(_r01_1.val[1]);\n    _r4 = vreinterpret_f16_u32(_r23_0.val[0]);\n    _r5 = vreinterpret_f16_u32(_r23_1.val[0]);\n    _r6 = vreinterpret_f16_u32(_r23_0.val[1]);\n    _r7 = vreinterpret_f16_u32(_r23_1.val[1]);\n}\n\nstatic inline void transpose4x12_ph(float16x4_t& _r0, float16x4_t& _r1, float16x4_t& _r2, float16x4_t& _r3, float16x4_t& _r4, float16x4_t& _r5, float16x4_t& _r6, float16x4_t& _r7, float16x4_t& _r8, float16x4_t& _r9, float16x4_t& _ra, float16x4_t& _rb)\n{\n    uint16x4x2_t _r01z = vzip_u16(vreinterpret_u16_f16(_r0), vreinterpret_u16_f16(_r1));\n    uint16x4x2_t _r23z = vzip_u16(vreinterpret_u16_f16(_r2), vreinterpret_u16_f16(_r3));\n    uint16x4x2_t _r45z = vzip_u16(vreinterpret_u16_f16(_r4), vreinterpret_u16_f16(_r5));\n    uint16x4x2_t _r67z = vzip_u16(vreinterpret_u16_f16(_r6), vreinterpret_u16_f16(_r7));\n    uint16x4x2_t _r89z = vzip_u16(vreinterpret_u16_f16(_r8), vreinterpret_u16_f16(_r9));\n    uint16x4x2_t _rabz = vzip_u16(vreinterpret_u16_f16(_ra), vreinterpret_u16_f16(_rb));\n    uint32x2x2_t _r01_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[0]), vreinterpret_u32_u16(_r23z.val[0]));\n    uint32x2x2_t _r23_0 = vzip_u32(vreinterpret_u32_u16(_r01z.val[1]), vreinterpret_u32_u16(_r23z.val[1]));\n    uint32x2x2_t _r01_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[0]), vreinterpret_u32_u16(_r67z.val[0]));\n    uint32x2x2_t _r23_1 = vzip_u32(vreinterpret_u32_u16(_r45z.val[1]), vreinterpret_u32_u16(_r67z.val[1]));\n    uint32x2x2_t _r01_2 = vzip_u32(vreinterpret_u32_u16(_r89z.val[0]), vreinterpret_u32_u16(_rabz.val[0]));\n    uint32x2x2_t _r23_2 = vzip_u32(vreinterpret_u32_u16(_r89z.val[1]), vreinterpret_u32_u16(_rabz.val[1]));\n    _r0 = vreinterpret_f16_u32(_r01_0.val[0]);\n    _r1 = vreinterpret_f16_u32(_r01_1.val[0]);\n    _r2 = vreinterpret_f16_u32(_r01_2.val[0]);\n    _r3 = vreinterpret_f16_u32(_r01_0.val[1]);\n    _r4 = vreinterpret_f16_u32(_r01_1.val[1]);\n    _r5 = vreinterpret_f16_u32(_r01_2.val[1]);\n    _r6 = vreinterpret_f16_u32(_r23_0.val[0]);\n    _r7 = vreinterpret_f16_u32(_r23_1.val[0]);\n    _r8 = vreinterpret_f16_u32(_r23_2.val[0]);\n    _r9 = vreinterpret_f16_u32(_r23_0.val[1]);\n    _ra = vreinterpret_f16_u32(_r23_1.val[1]);\n    _rb = vreinterpret_f16_u32(_r23_2.val[1]);\n}\n\nstatic inline void transpose8x4_ph(float16x8_t& _r0, float16x8_t& _r1, float16x8_t& _r2, float16x8_t& _r3)\n{\n    uint16x8x2_t _r01t = vzipq_u16(vreinterpretq_u16_f16(_r0), vreinterpretq_u16_f16(_r1));\n    uint16x8x2_t _r23t = vzipq_u16(vreinterpretq_u16_f16(_r2), vreinterpretq_u16_f16(_r3));\n    uint32x4x2_t _r01 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    _r0 = vreinterpretq_f16_u32(_r01.val[0]);\n    _r1 = vreinterpretq_f16_u32(_r01.val[1]);\n    _r2 = vreinterpretq_f16_u32(_r23.val[0]);\n    _r3 = vreinterpretq_f16_u32(_r23.val[1]);\n}\n\nstatic inline void transpose8x8_ph(float16x8_t& _r0, float16x8_t& _r1, float16x8_t& _r2, float16x8_t& _r3, float16x8_t& _r4, float16x8_t& _r5, float16x8_t& _r6, float16x8_t& _r7)\n{\n    uint16x8x2_t _r01t = vzipq_u16(vreinterpretq_u16_f16(_r0), vreinterpretq_u16_f16(_r1));\n    uint16x8x2_t _r23t = vzipq_u16(vreinterpretq_u16_f16(_r2), vreinterpretq_u16_f16(_r3));\n    uint16x8x2_t _r45t = vzipq_u16(vreinterpretq_u16_f16(_r4), vreinterpretq_u16_f16(_r5));\n    uint16x8x2_t _r67t = vzipq_u16(vreinterpretq_u16_f16(_r6), vreinterpretq_u16_f16(_r7));\n    uint32x4x2_t _r01_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    uint32x4x2_t _r01_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[0]), vreinterpretq_u32_u16(_r67t.val[0]));\n    uint32x4x2_t _r23_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[1]), vreinterpretq_u32_u16(_r67t.val[1]));\n    _r0 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_0.val[0]), vget_low_u32(_r01_1.val[0])));\n    _r1 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r01_0.val[0]), vget_high_u32(_r01_1.val[0])));\n    _r2 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_0.val[1]), vget_low_u32(_r01_1.val[1])));\n    _r3 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r01_0.val[1]), vget_high_u32(_r01_1.val[1])));\n    _r4 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_0.val[0]), vget_low_u32(_r23_1.val[0])));\n    _r5 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r23_0.val[0]), vget_high_u32(_r23_1.val[0])));\n    _r6 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_0.val[1]), vget_low_u32(_r23_1.val[1])));\n    _r7 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r23_0.val[1]), vget_high_u32(_r23_1.val[1])));\n}\n\nstatic inline void transpose8x12_ph(float16x8_t& _r0, float16x8_t& _r1, float16x8_t& _r2, float16x8_t& _r3, float16x8_t& _r4, float16x8_t& _r5, float16x8_t& _r6, float16x8_t& _r7, float16x8_t& _r8, float16x8_t& _r9, float16x8_t& _ra, float16x8_t& _rb)\n{\n    uint16x8x2_t _r01t = vzipq_u16(vreinterpretq_u16_f16(_r0), vreinterpretq_u16_f16(_r1));\n    uint16x8x2_t _r23t = vzipq_u16(vreinterpretq_u16_f16(_r2), vreinterpretq_u16_f16(_r3));\n    uint16x8x2_t _r45t = vzipq_u16(vreinterpretq_u16_f16(_r4), vreinterpretq_u16_f16(_r5));\n    uint16x8x2_t _r67t = vzipq_u16(vreinterpretq_u16_f16(_r6), vreinterpretq_u16_f16(_r7));\n    uint16x8x2_t _r89t = vzipq_u16(vreinterpretq_u16_f16(_r8), vreinterpretq_u16_f16(_r9));\n    uint16x8x2_t _rabt = vzipq_u16(vreinterpretq_u16_f16(_ra), vreinterpretq_u16_f16(_rb));\n    uint32x4x2_t _r01_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[0]), vreinterpretq_u32_u16(_r23t.val[0]));\n    uint32x4x2_t _r23_0 = vzipq_u32(vreinterpretq_u32_u16(_r01t.val[1]), vreinterpretq_u32_u16(_r23t.val[1]));\n    uint32x4x2_t _r01_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[0]), vreinterpretq_u32_u16(_r67t.val[0]));\n    uint32x4x2_t _r23_1 = vzipq_u32(vreinterpretq_u32_u16(_r45t.val[1]), vreinterpretq_u32_u16(_r67t.val[1]));\n    uint32x4x2_t _r01_2 = vzipq_u32(vreinterpretq_u32_u16(_r89t.val[0]), vreinterpretq_u32_u16(_rabt.val[0]));\n    uint32x4x2_t _r23_2 = vzipq_u32(vreinterpretq_u32_u16(_r89t.val[1]), vreinterpretq_u32_u16(_rabt.val[1]));\n    _r0 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_0.val[0]), vget_low_u32(_r01_1.val[0])));\n    _r1 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_2.val[0]), vget_high_u32(_r01_0.val[0])));\n    _r2 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r01_1.val[0]), vget_high_u32(_r01_2.val[0])));\n    _r3 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_0.val[1]), vget_low_u32(_r01_1.val[1])));\n    _r4 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r01_2.val[1]), vget_high_u32(_r01_0.val[1])));\n    _r5 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r01_1.val[1]), vget_high_u32(_r01_2.val[1])));\n    _r6 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_0.val[0]), vget_low_u32(_r23_1.val[0])));\n    _r7 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_2.val[0]), vget_high_u32(_r23_0.val[0])));\n    _r8 = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r23_1.val[0]), vget_high_u32(_r23_2.val[0])));\n    _r9 = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_0.val[1]), vget_low_u32(_r23_1.val[1])));\n    _ra = vreinterpretq_f16_u32(vcombine_u32(vget_low_u32(_r23_2.val[1]), vget_high_u32(_r23_0.val[1])));\n    _rb = vreinterpretq_f16_u32(vcombine_u32(vget_high_u32(_r23_1.val[1]), vget_high_u32(_r23_2.val[1])));\n}\n\nstatic inline void transpose12x4_ph(float16x4_t& _r0l, float16x4_t& _r0m, float16x4_t& _r0h,\n                                    float16x4_t& _r1l, float16x4_t& _r1m, float16x4_t& _r1h,\n                                    float16x4_t& _r2l, float16x4_t& _r2m, float16x4_t& _r2h,\n                                    float16x4_t& _r3l, float16x4_t& _r3m, float16x4_t& _r3h)\n{\n    uint16x4x2_t _r01lz = vzip_u16(vreinterpret_u16_f16(_r0l), vreinterpret_u16_f16(_r1l));\n    uint16x4x2_t _r23lz = vzip_u16(vreinterpret_u16_f16(_r2l), vreinterpret_u16_f16(_r3l));\n    uint16x4x2_t _r01mz = vzip_u16(vreinterpret_u16_f16(_r0m), vreinterpret_u16_f16(_r1m));\n    uint16x4x2_t _r23mz = vzip_u16(vreinterpret_u16_f16(_r2m), vreinterpret_u16_f16(_r3m));\n    uint16x4x2_t _r01hz = vzip_u16(vreinterpret_u16_f16(_r0h), vreinterpret_u16_f16(_r1h));\n    uint16x4x2_t _r23hz = vzip_u16(vreinterpret_u16_f16(_r2h), vreinterpret_u16_f16(_r3h));\n    uint32x2x2_t _r01 = vzip_u32(vreinterpret_u32_u16(_r01lz.val[0]), vreinterpret_u32_u16(_r23lz.val[0]));\n    uint32x2x2_t _r23 = vzip_u32(vreinterpret_u32_u16(_r01lz.val[1]), vreinterpret_u32_u16(_r23lz.val[1]));\n    uint32x2x2_t _r45 = vzip_u32(vreinterpret_u32_u16(_r01mz.val[0]), vreinterpret_u32_u16(_r23mz.val[0]));\n    uint32x2x2_t _r67 = vzip_u32(vreinterpret_u32_u16(_r01mz.val[1]), vreinterpret_u32_u16(_r23mz.val[1]));\n    uint32x2x2_t _r89 = vzip_u32(vreinterpret_u32_u16(_r01hz.val[0]), vreinterpret_u32_u16(_r23hz.val[0]));\n    uint32x2x2_t _rab = vzip_u32(vreinterpret_u32_u16(_r01hz.val[1]), vreinterpret_u32_u16(_r23hz.val[1]));\n    _r0l = vreinterpret_f16_u32(_r01.val[0]);\n    _r0m = vreinterpret_f16_u32(_r01.val[1]);\n    _r0h = vreinterpret_f16_u32(_r23.val[0]);\n    _r1l = vreinterpret_f16_u32(_r23.val[1]);\n    _r1m = vreinterpret_f16_u32(_r45.val[0]);\n    _r1h = vreinterpret_f16_u32(_r45.val[1]);\n    _r2l = vreinterpret_f16_u32(_r67.val[0]);\n    _r2m = vreinterpret_f16_u32(_r67.val[1]);\n    _r2h = vreinterpret_f16_u32(_r89.val[0]);\n    _r3l = vreinterpret_f16_u32(_r89.val[1]);\n    _r3m = vreinterpret_f16_u32(_rab.val[0]);\n    _r3h = vreinterpret_f16_u32(_rab.val[1]);\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n#endif // ARM_USABILITY_H\n"
  },
  {
    "path": "src/layer/arm/batchnorm_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"batchnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nBatchNorm_arm::BatchNorm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint BatchNorm_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float* ptr = (float*)bottom_top_blob + i * 4;\n\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1q_f32(ptr, _p);\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                float* ptr = bottom_top_blob.row(i);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    _p = vmlaq_f32(_a, _p, _b);\n                    vst1q_f32(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3 || dims == 4)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int d = bottom_top_blob.d;\n            int c = bottom_top_blob.c;\n            int size = w * h * d;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < c; q++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + q * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + q * 4);\n\n                float* ptr = bottom_top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    _p = vmlaq_f32(_a, _p, _b);\n                    vst1q_f32(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        float* ptr = bottom_top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < w; i++)\n        {\n            ptr[i] = b_data[i] * ptr[i] + a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n\n            float a = a_data[i];\n            float b = b_data[i];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1q_f32(ptr, _p);\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < w; j++)\n            {\n                *ptr = b * *ptr + a;\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            float a = a_data[q];\n            float b = b_data[q];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            for (; j + 15 < size; j += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _p2 = vld1q_f32(ptr + 8);\n                float32x4_t _p3 = vld1q_f32(ptr + 12);\n                _p0 = vmlaq_f32(_a, _p0, _b);\n                _p1 = vmlaq_f32(_a, _p1, _b);\n                _p2 = vmlaq_f32(_a, _p2, _b);\n                _p3 = vmlaq_f32(_a, _p3, _b);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                vst1q_f32(ptr + 8, _p2);\n                vst1q_f32(ptr + 12, _p3);\n                ptr += 16;\n            }\n            for (; j + 7 < size; j += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                _p0 = vmlaq_f32(_a, _p0, _b);\n                _p1 = vmlaq_f32(_a, _p1, _b);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                ptr += 8;\n            }\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < size; j++)\n            {\n                *ptr = b * *ptr + a;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint BatchNorm_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                unsigned short* ptr = (unsigned short*)bottom_top_blob + i * 4;\n\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1_u16(ptr, float2bfloat(_p));\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    _p = vmlaq_f32(_a, _p, _b);\n                    vst1_u16(ptr, float2bfloat(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3 || dims == 4)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int d = bottom_top_blob.d;\n            int c = bottom_top_blob.c;\n            int size = w * h * d;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < c; q++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + q * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + q * 4);\n\n                unsigned short* ptr = bottom_top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    _p = vmlaq_f32(_a, _p, _b);\n                    vst1_u16(ptr, float2bfloat(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        unsigned short* ptr = bottom_top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < w; i++)\n        {\n            ptr[i] = float32_to_bfloat16(b_data[i] * bfloat16_to_float32(ptr[i]) + a_data[i]);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n\n            float a = a_data[i];\n            float b = b_data[i];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1_u16(ptr, float2bfloat(_p));\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < w; j++)\n            {\n                *ptr = float32_to_bfloat16(b * bfloat16_to_float32(*ptr) + a);\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            float a = a_data[q];\n            float b = b_data[q];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = vmlaq_f32(_a, _p, _b);\n                vst1_u16(ptr, float2bfloat(_p));\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < size; j++)\n            {\n                *ptr = float32_to_bfloat16(b * bfloat16_to_float32(*ptr) + a);\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/batchnorm_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BATCHNORM_ARM_H\n#define LAYER_BATCHNORM_ARM_H\n\n#include \"batchnorm.h\"\n\nnamespace ncnn {\n\nclass BatchNorm_arm : public BatchNorm\n{\npublic:\n    BatchNorm_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BATCHNORM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/batchnorm_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"batchnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint BatchNorm_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _p = vfmaq_f32(_a, _p, _b);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + i * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + i * 4);\n\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    _p = vfmaq_f32(_a, _p, _b);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3 || dims == 4)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int d = bottom_top_blob.d;\n            int c = bottom_top_blob.c;\n            int size = w * h * d;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < c; q++)\n            {\n                float32x4_t _a = vld1q_f32((const float*)a_data + q * 4);\n                float32x4_t _b = vld1q_f32((const float*)b_data + q * 4);\n\n                __fp16* ptr = bottom_top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    _p = vfmaq_f32(_a, _p, _b);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        __fp16* ptr = bottom_top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < w; i++)\n        {\n            ptr[i] = b_data[i] * (float)ptr[i] + a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n            float a = a_data[i];\n            float b = b_data[i];\n\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            int j = 0;\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _p = vfmaq_f32(_a, _p, _b);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n            for (; j < w; j++)\n            {\n                *ptr = b * (float)*ptr + a;\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            float a = a_data[q];\n            float b = b_data[q];\n\n            float32x4_t _a = vdupq_n_f32(a);\n            float32x4_t _b = vdupq_n_f32(b);\n\n            int j = 0;\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _p = vfmaq_f32(_a, _p, _b);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n            for (; j < size; j++)\n            {\n                *ptr = b * (float)*ptr + a;\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint BatchNorm_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                __fp16* ptr = (__fp16*)bottom_top_blob + i * 8;\n\n                float16x8_t _a = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 8 + 4)));\n                float16x8_t _b = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 8 + 4)));\n\n                float16x8_t _p = vld1q_f16(ptr);\n                _p = vfmaq_f16(_a, _p, _b);\n                vst1q_f16(ptr, _p);\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float16x8_t _a = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 8 + 4)));\n                float16x8_t _b = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 8 + 4)));\n\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    _p = vfmaq_f16(_a, _p, _b);\n                    vst1q_f16(ptr, _p);\n\n                    ptr += 8;\n                }\n            }\n        }\n\n        if (dims == 3 || dims == 4)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int d = bottom_top_blob.d;\n            int c = bottom_top_blob.c;\n            int size = w * h * d;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < c; q++)\n            {\n                float16x8_t _a = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)a_data + q * 8)), vcvt_f16_f32(vld1q_f32((const float*)a_data + q * 8 + 4)));\n                float16x8_t _b = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)b_data + q * 8)), vcvt_f16_f32(vld1q_f32((const float*)b_data + q * 8 + 4)));\n\n                __fp16* ptr = bottom_top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    _p = vfmaq_f16(_a, _p, _b);\n                    vst1q_f16(ptr, _p);\n\n                    ptr += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                float16x4_t _a = vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 4));\n                float16x4_t _b = vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 4));\n\n                float16x4_t _p = vld1_f16(ptr);\n                _p = vfma_f16(_a, _p, _b);\n                vst1_f16(ptr, _p);\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float16x4_t _a = vcvt_f16_f32(vld1q_f32((const float*)a_data + i * 4));\n                float16x4_t _b = vcvt_f16_f32(vld1q_f32((const float*)b_data + i * 4));\n\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    _p = vfma_f16(_a, _p, _b);\n                    vst1_f16(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3 || dims == 4)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int d = bottom_top_blob.d;\n            int c = bottom_top_blob.c;\n            int size = w * h * d;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < c; q++)\n            {\n                float16x4_t _a = vcvt_f16_f32(vld1q_f32((const float*)a_data + q * 4));\n                float16x4_t _b = vcvt_f16_f32(vld1q_f32((const float*)b_data + q * 4));\n\n                __fp16* ptr = bottom_top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    _p = vfma_f16(_a, _p, _b);\n                    vst1_f16(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        __fp16* ptr = bottom_top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < w; i++)\n        {\n            ptr[i] = (__fp16)b_data[i] * ptr[i] + (__fp16)a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n            __fp16 a = (__fp16)a_data[i];\n            __fp16 b = (__fp16)b_data[i];\n\n            float16x4_t _a = vdup_n_f16(a);\n#if defined(_MSC_VER) && !defined(__clang__)\n            float16x4_t _b = vcvt_f16_f32(vdupq_n_f32(b_data[i]));\n#else\n            float16x4_t _b = vdup_n_f16(b);\n#endif\n\n            int j = 0;\n            for (; j + 3 < w; j += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                _p = vfma_f16(_a, _p, _b);\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n            for (; j < w; j++)\n            {\n                *ptr = b * *ptr + a;\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            __fp16 a = (__fp16)a_data[q];\n            __fp16 b = (__fp16)b_data[q];\n\n            float16x4_t _a = vdup_n_f16(a);\n#if defined(_MSC_VER) && !defined(__clang__)\n            float16x4_t _b = vcvt_f16_f32(vdupq_n_f32(b_data[q]));\n#else\n            float16x4_t _b = vdup_n_f16(b);\n#endif\n\n            int j = 0;\n            for (; j + 3 < size; j += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                _p = vfma_f16(_a, _p, _b);\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n            for (; j < size; j++)\n            {\n                *ptr = b * *ptr + a;\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/bias_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"bias_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nint Bias_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    const float* bias_ptr = bias_data;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        float bias = bias_ptr[q];\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        float32x4_t _bias = vdupq_n_f32(bias);\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _outp = vaddq_f32(_p, _bias);\n            vst1q_f32(ptr, _outp);\n\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n\n        for (; remain > 0; remain--)\n        {\n            *ptr = *ptr + bias;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/bias_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BIAS_ARM_H\n#define LAYER_BIAS_ARM_H\n\n#include \"bias.h\"\n\nnamespace ncnn {\n\nclass Bias_arm : public Bias\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BIAS_ARM_H\n"
  },
  {
    "path": "src/layer/arm/binaryop_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"binaryop_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nBinaryOp_arm::BinaryOp_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_no_broadcast(const float* ptr, const float* ptr1, float* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n#if __ARM_NEON\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = vld1q_f32(ptr);\n        float32x4_t _b = vld1q_f32(ptr1);\n        float32x4_t _outp = op(_p, _b);\n        vst1q_f32(outptr, _outp);\n        ptr += 4;\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, *ptr1);\n        ptr += 1;\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_b(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float b = *ptr1;\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _b_128 = (elempack == 4) ? vld1q_f32(ptr1) : vdupq_n_f32(b);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = vld1q_f32(ptr);\n        float32x4_t _outp = op(_p, _b_128);\n        vst1q_f32(outptr, _outp);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, b);\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_a(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float a = *ptr;\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _a_128 = (elempack == 4) ? vld1q_f32(ptr) : vdupq_n_f32(a);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _b = vld1q_f32(ptr1);\n        float32x4_t _outp = op(_a_128, _b);\n        vst1q_f32(outptr, _outp);\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = op(a, *ptr1);\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int i = 0;\n        for (; i < w; i++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _b = vdupq_n_f32(*ptr1);\n            float32x4_t _outp = op(_p, _b);\n            vst1q_f32(outptr, _outp);\n            ptr += 4;\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_b(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n    const int size = w * elempack;\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _b = vdupq_n_f32(*ptr1);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = vld1q_f32(ptr);\n        float32x4_t _outp = op(_p, _b);\n        vst1q_f32(outptr, _outp);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_a(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int i = 0;\n        float32x4_t _p = vld1q_f32(ptr);\n        for (; i < w; i++)\n        {\n            float32x4_t _b = vdupq_n_f32(*ptr1);\n            float32x4_t _outp = op(_p, _b);\n            vst1q_f32(outptr, _outp);\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp)\n{\n    const int w = std::max(aw, bw);\n    const int elempack = std::max(ap, bp);\n    const int size = w * elempack;\n\n    if (ap == bp)\n    {\n        if (aw == bw)\n        {\n            // no broadcast\n            return binary_op_vector_no_broadcast<Op>(ptr, ptr1, outptr, size);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast single b\n            return binary_op_vector_broadcast_b<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a\n            return binary_op_vector_broadcast_a<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n    }\n\n    if (bp == 1)\n    {\n        if (aw == bw)\n        {\n            // broadcast pack1 b\n            return binary_op_vector_broadcast_pb<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast pack1 single b\n            return binary_op_vector_broadcast_pb_b<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a and pack1 b\n            return binary_op_vector_broadcast_pb_a<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n    }\n\n    // shall never reach here\n}\n\nnamespace BinaryOp_arm_functor {\n\n#if __ARM_NEON\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                                         \\\n    struct NAME                                                                  \\\n    {                                                                            \\\n        float operator()(const float& x, const float& y) const                   \\\n        {                                                                        \\\n            return IMPL;                                                         \\\n        }                                                                        \\\n        float32x4_t operator()(const float32x4_t& x, const float32x4_t& y) const \\\n        {                                                                        \\\n            return IMPL4;                                                        \\\n        }                                                                        \\\n    };\n#else\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                       \\\n    struct NAME                                                \\\n    {                                                          \\\n        float operator()(const float& x, const float& y) const \\\n        {                                                      \\\n            return IMPL;                                       \\\n        }                                                      \\\n    };\n#endif\n\n// clang-format off\n// *INDENT-OFF*\nMAKE_FUNCTION(binary_op_add, x + y, vaddq_f32(x, y))\nMAKE_FUNCTION(binary_op_sub, x - y, vsubq_f32(x, y))\nMAKE_FUNCTION(binary_op_mul, x * y, vmulq_f32(x, y))\n#if __aarch64__\nMAKE_FUNCTION(binary_op_div, x / y, vdivq_f32(x, y))\n#else\nMAKE_FUNCTION(binary_op_div, x / y, div_ps(x, y))\n#endif\nMAKE_FUNCTION(binary_op_max, std::max(x, y), vmaxq_f32(x, y))\nMAKE_FUNCTION(binary_op_min, std::min(x, y), vminq_f32(x, y))\nMAKE_FUNCTION(binary_op_pow, (float)powf(x, y), pow_ps(x, y))\nMAKE_FUNCTION(binary_op_rsub, y - x, vsubq_f32(y, x))\n#if __aarch64__\nMAKE_FUNCTION(binary_op_rdiv, y / x, vdivq_f32(y, x))\n#else\nMAKE_FUNCTION(binary_op_rdiv, y / x, div_ps(y, x))\n#endif\nMAKE_FUNCTION(binary_op_rpow, (float)powf(y, x), pow_ps(y, x))\nMAKE_FUNCTION(binary_op_atan2, atan2f(x, y), atan2_ps(x, y))\nMAKE_FUNCTION(binary_op_ratan2, atan2f(y, x), atan2_ps(y, x))\nMAKE_FUNCTION(binary_op_fmod, (float)fmodf(x, y), fmod_ps(x, y))\nMAKE_FUNCTION(binary_op_rfmod, (float)fmodf(y, x), fmod_ps(y, x))\nMAKE_FUNCTION(binary_op_logaddexp, (float)(std::max(x, y) + log1pf(expf(std::min(x, y) - std::max(x, y)))), logaddexp_ps(x, y))\nMAKE_FUNCTION(binary_op_floor_divide, (float)floorf(x / y), floor_divide_ps(x, y))\nMAKE_FUNCTION(binary_op_rfloor_divide, (float)floorf(y / x), floor_divide_ps(y, x))\nMAKE_FUNCTION(binary_op_remainder, (float)remainderf(x, y), remainder_ps(x, y))\nMAKE_FUNCTION(binary_op_rremainder, (float)remainderf(y, x), remainder_ps(y, x))\n// *INDENT-ON*\n// clang-format on\n\n#undef MAKE_FUNCTION\n\n} // namespace BinaryOp_arm_functor\n\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp, int op_type)\n{\n    using namespace BinaryOp_arm_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector<binary_op_add>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector<binary_op_sub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector<binary_op_mul>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector<binary_op_div>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector<binary_op_max>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector<binary_op_min>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector<binary_op_pow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector<binary_op_rsub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector<binary_op_rdiv>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector<binary_op_rpow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector<binary_op_atan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector<binary_op_ratan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector<binary_op_fmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector<binary_op_rfmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector<binary_op_logaddexp>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector<binary_op_floor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector<binary_op_rfloor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector<binary_op_remainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector<binary_op_rremainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar(const Mat& a, float b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, &b, outptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_no_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        const float* ptr1 = b.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, ptr1, outptr, size, size, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (b.w * b.h * b.d * b.c * b.elempack == 1)\n    {\n        return binary_op_scalar(a, b[0], c, op_type, opt);\n    }\n\n    if (a.dims == b.dims && a.w == b.w && a.h == b.h && a.d == b.d && a.c == b.c && a.elempack == b.elempack)\n    {\n        return binary_op_no_broadcast(a, b, c, op_type, opt);\n    }\n\n    const int dims = c.dims;\n\n    if (dims == 2)\n    {\n        const int h = c.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const int y0 = std::min(y, a.h - 1);\n            const int y1 = std::min(y, b.h - 1);\n\n            const float* ptr = a.row(y0);\n            const float* ptr1 = b.row(y1);\n            float* outptr = c.row(y);\n\n            binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int channels = c.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int q0 = std::min(q, a.c - 1);\n            const int q1 = std::min(q, b.c - 1);\n\n            if (b.d * b.h * b.w == 1)\n            {\n                const float* ptr = a.channel(q0);\n                const float* ptr1 = b.channel(q1);\n                float* outptr = c.channel(q);\n\n                binary_op_vector(ptr, ptr1, outptr, a.w * a.h * a.d, 1, a.elempack, b.elempack, op_type);\n                continue;\n            }\n\n            if (b.h * b.w == 1)\n            {\n                for (int z = 0; z < c.d; z++)\n                {\n                    const int z0 = std::min(z, a.d - 1);\n                    const int z1 = std::min(z, b.d - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0);\n                    const float* ptr1 = b.channel(q1).depth(z1);\n                    float* outptr = c.channel(q).depth(z);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w * a.h, 1, a.elempack, b.elempack, op_type);\n                }\n                continue;\n            }\n\n            for (int z = 0; z < c.d; z++)\n            {\n                const int z0 = std::min(z, a.d - 1);\n                const int z1 = std::min(z, b.d - 1);\n\n                for (int y = 0; y < c.h; y++)\n                {\n                    const int y0 = std::min(y, a.h - 1);\n                    const int y1 = std::min(y, b.h - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0).row(y0);\n                    const float* ptr1 = b.channel(q1).depth(z1).row(y1);\n                    float* outptr = c.channel(q).depth(z).row(y);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n                }\n            }\n        }\n    }\n}\n\nstatic void binary_op_scalar_inplace(Mat& a, float b, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        binary_op_vector(ptr, &b, ptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic int get_reverse_op_type(int op_type)\n{\n    if (op_type == BinaryOp::Operation_SUB) return BinaryOp::Operation_RSUB;\n    if (op_type == BinaryOp::Operation_DIV) return BinaryOp::Operation_RDIV;\n    if (op_type == BinaryOp::Operation_POW) return BinaryOp::Operation_RPOW;\n    if (op_type == BinaryOp::Operation_ATAN2) return BinaryOp::Operation_RATAN2;\n    if (op_type == BinaryOp::Operation_FMOD) return BinaryOp::Operation_RFMOD;\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return BinaryOp::Operation_RFLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_REMAINDER) return BinaryOp::Operation_RREMAINDER;\n\n    if (op_type == BinaryOp::Operation_RSUB) return BinaryOp::Operation_SUB;\n    if (op_type == BinaryOp::Operation_RDIV) return BinaryOp::Operation_DIV;\n    if (op_type == BinaryOp::Operation_RPOW) return BinaryOp::Operation_POW;\n    if (op_type == BinaryOp::Operation_RATAN2) return BinaryOp::Operation_ATAN2;\n    if (op_type == BinaryOp::Operation_RFMOD) return BinaryOp::Operation_FMOD;\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return BinaryOp::Operation_FLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_RREMAINDER) return BinaryOp::Operation_REMAINDER;\n\n    return op_type;\n}\n\nint BinaryOp_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int elembits = std::max(bottom_blobs[0].elembits(), bottom_blobs[1].elembits());\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w * A.elempack == B.h * B.elempack)\n                A2 = A.reshape(1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 2;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 3;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 4;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c, opt.workspace_allocator);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w * B.elempack == A.h * A.elempack)\n                B2 = B.reshape(1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 2;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 3;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 4;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c, opt.workspace_allocator);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n    const size_t out_elemsize = std::max(A2.elemsize, B2.elemsize);\n    const int out_elempack = std::max(A2.elempack, B2.elempack);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    const bool a_pack_is_lower = A2.elempack < B2.elempack;\n    const bool a_pack_is_equal = A2.elempack == B2.elempack;\n    const bool a_size_is_lower = A2.w * A2.h * A2.d * A2.c * A2.elempack < B2.w * B2.h * B2.d * B2.c * B2.elempack;\n    if (a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))\n    {\n        binary_op_broadcast(B2, A2, top_blob, get_reverse_op_type(op_type), opt);\n    }\n    else\n    {\n        binary_op_broadcast(A2, B2, top_blob, op_type, opt);\n    }\n\n    return 0;\n}\n\nint BinaryOp_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    binary_op_scalar_inplace(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n\n#if NCNN_BF16\ntemplate<typename Op>\nstatic void binary_op_vector_no_broadcast_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n#if __ARM_NEON\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n        float32x4_t _b = bfloat2float(vld1_u16(ptr1));\n        float32x4_t _outp = op(_p, _b);\n        vst1_u16(outptr, float2bfloat(_outp));\n        ptr += 4;\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = float32_to_bfloat16(op(bfloat16_to_float32(*ptr), bfloat16_to_float32(*ptr1)));\n        ptr += 1;\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_b_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float b = bfloat16_to_float32(*ptr1);\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _b_128 = (elempack == 4) ? bfloat2float(vld1_u16(ptr1)) : vdupq_n_f32(b);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n        float32x4_t _outp = op(_p, _b_128);\n        vst1_u16(outptr, float2bfloat(_outp));\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = float32_to_bfloat16(op(bfloat16_to_float32(*ptr), b));\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_a_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float a = bfloat16_to_float32(*ptr);\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _a_128 = (elempack == 4) ? bfloat2float(vld1_u16(ptr)) : vdupq_n_f32(a);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _b = bfloat2float(vld1_u16(ptr1));\n        float32x4_t _outp = op(_a_128, _b);\n        vst1_u16(outptr, float2bfloat(_outp));\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = float32_to_bfloat16(op(a, bfloat16_to_float32(*ptr1)));\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int i = 0;\n        for (; i < w; i++)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _b = bfloat2float(vdup_n_u16(*ptr1));\n            float32x4_t _outp = op(_p, _b);\n            vst1_u16(outptr, float2bfloat(_outp));\n            ptr += 4;\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_b_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int w, int elempack)\n{\n    const Op op;\n\n    const int size = w * elempack;\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _b = bfloat2float(vdup_n_u16(*ptr1));\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n        float32x4_t _outp = op(_p, _b);\n        vst1_u16(outptr, float2bfloat(_outp));\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_a_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int i = 0;\n        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n        for (; i < w; i++)\n        {\n            float32x4_t _b = bfloat2float(vdup_n_u16(*ptr1));\n            float32x4_t _outp = op(_p, _b);\n            vst1_u16(outptr, float2bfloat(_outp));\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __ARM_NEON\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int aw, int bw, int ap, int bp)\n{\n    const int w = std::max(aw, bw);\n    const int elempack = std::max(ap, bp);\n    const int size = w * elempack;\n\n    if (ap == bp)\n    {\n        if (aw == bw)\n        {\n            // no broadcast\n            return binary_op_vector_no_broadcast_bf16s<Op>(ptr, ptr1, outptr, size);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast single b\n            return binary_op_vector_broadcast_b_bf16s<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a\n            return binary_op_vector_broadcast_a_bf16s<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n    }\n\n    if (bp == 1)\n    {\n        if (aw == bw)\n        {\n            // broadcast pack1 b\n            return binary_op_vector_broadcast_pb_bf16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast pack1 single b\n            return binary_op_vector_broadcast_pb_b_bf16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a and pack1 b\n            return binary_op_vector_broadcast_pb_a_bf16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n    }\n\n    // shall never reach here\n}\n\nstatic void binary_op_vector_bf16s(const unsigned short* ptr, const unsigned short* ptr1, unsigned short* outptr, int aw, int bw, int ap, int bp, int op_type)\n{\n    using namespace BinaryOp_arm_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector_bf16s<binary_op_add>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector_bf16s<binary_op_sub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector_bf16s<binary_op_mul>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector_bf16s<binary_op_div>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector_bf16s<binary_op_max>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector_bf16s<binary_op_min>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector_bf16s<binary_op_pow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector_bf16s<binary_op_rsub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector_bf16s<binary_op_rdiv>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector_bf16s<binary_op_rpow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector_bf16s<binary_op_atan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector_bf16s<binary_op_ratan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector_bf16s<binary_op_fmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector_bf16s<binary_op_rfmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector_bf16s<binary_op_logaddexp>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector_bf16s<binary_op_floor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector_bf16s<binary_op_rfloor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector_bf16s<binary_op_remainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector_bf16s<binary_op_rremainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n\n    // should never reach here\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_scalar_b_bf16s(const unsigned short* ptr, float b, unsigned short* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n#if __ARM_NEON\n    float32x4_t _b_128 = vdupq_n_f32(b);\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n        float32x4_t _outp = op(_p, _b_128);\n        vst1_u16(outptr, float2bfloat(_outp));\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        *outptr = float32_to_bfloat16(op(bfloat16_to_float32(*ptr), b));\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\nstatic void binary_op_vector_scalar_b_bf16s(const unsigned short* ptr, float b, unsigned short* outptr, int size, int op_type)\n{\n    using namespace BinaryOp_arm_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector_scalar_b_bf16s<binary_op_add>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector_scalar_b_bf16s<binary_op_sub>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector_scalar_b_bf16s<binary_op_mul>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector_scalar_b_bf16s<binary_op_div>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector_scalar_b_bf16s<binary_op_max>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector_scalar_b_bf16s<binary_op_min>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector_scalar_b_bf16s<binary_op_pow>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector_scalar_b_bf16s<binary_op_rsub>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector_scalar_b_bf16s<binary_op_rdiv>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector_scalar_b_bf16s<binary_op_rpow>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector_scalar_b_bf16s<binary_op_atan2>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector_scalar_b_bf16s<binary_op_ratan2>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector_scalar_b_bf16s<binary_op_fmod>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector_scalar_b_bf16s<binary_op_rfmod>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector_scalar_b_bf16s<binary_op_logaddexp>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector_scalar_b_bf16s<binary_op_floor_divide>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector_scalar_b_bf16s<binary_op_rfloor_divide>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector_scalar_b_bf16s<binary_op_remainder>(ptr, b, outptr, size);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector_scalar_b_bf16s<binary_op_rremainder>(ptr, b, outptr, size);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar_bf16s(const Mat& a, float b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned short* ptr = a.channel(q);\n        unsigned short* outptr = c.channel(q);\n\n        binary_op_vector_scalar_b_bf16s(ptr, b, outptr, size, op_type);\n    }\n}\n\nstatic void binary_op_no_broadcast_bf16s(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned short* ptr = a.channel(q);\n        const unsigned short* ptr1 = b.channel(q);\n        unsigned short* outptr = c.channel(q);\n\n        binary_op_vector_bf16s(ptr, ptr1, outptr, size, size, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_broadcast_bf16s(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (b.w * b.h * b.d * b.c * b.elempack == 1)\n    {\n        return binary_op_scalar_bf16s(a, bfloat16_to_float32(((const unsigned short*)b)[0]), c, op_type, opt);\n    }\n\n    if (a.dims == b.dims && a.w == b.w && a.h == b.h && a.d == b.d && a.c == b.c && a.elempack == b.elempack)\n    {\n        return binary_op_no_broadcast_bf16s(a, b, c, op_type, opt);\n    }\n\n    const int dims = c.dims;\n\n    if (dims == 2)\n    {\n        const int h = c.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const int y0 = std::min(y, a.h - 1);\n            const int y1 = std::min(y, b.h - 1);\n\n            const unsigned short* ptr = a.row<const unsigned short>(y0);\n            const unsigned short* ptr1 = b.row<const unsigned short>(y1);\n            unsigned short* outptr = c.row<unsigned short>(y);\n\n            binary_op_vector_bf16s(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int channels = c.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int q0 = std::min(q, a.c - 1);\n            const int q1 = std::min(q, b.c - 1);\n\n            if (b.d * b.h * b.w == 1)\n            {\n                const unsigned short* ptr = a.channel(q0);\n                const unsigned short* ptr1 = b.channel(q1);\n                unsigned short* outptr = c.channel(q);\n\n                binary_op_vector_bf16s(ptr, ptr1, outptr, a.w * a.h * a.d, 1, a.elempack, b.elempack, op_type);\n                continue;\n            }\n\n            if (b.h * b.w == 1)\n            {\n                for (int z = 0; z < c.d; z++)\n                {\n                    const int z0 = std::min(z, a.d - 1);\n                    const int z1 = std::min(z, b.d - 1);\n\n                    const unsigned short* ptr = a.channel(q0).depth(z0);\n                    const unsigned short* ptr1 = b.channel(q1).depth(z1);\n                    unsigned short* outptr = c.channel(q).depth(z);\n\n                    binary_op_vector_bf16s(ptr, ptr1, outptr, a.w * a.h, 1, a.elempack, b.elempack, op_type);\n                }\n                continue;\n            }\n\n            for (int z = 0; z < c.d; z++)\n            {\n                const int z0 = std::min(z, a.d - 1);\n                const int z1 = std::min(z, b.d - 1);\n\n                for (int y = 0; y < c.h; y++)\n                {\n                    const int y0 = std::min(y, a.h - 1);\n                    const int y1 = std::min(y, b.h - 1);\n\n                    const unsigned short* ptr = a.channel(q0).depth(z0).row<const unsigned short>(y0);\n                    const unsigned short* ptr1 = b.channel(q1).depth(z1).row<const unsigned short>(y1);\n                    unsigned short* outptr = c.channel(q).depth(z).row<unsigned short>(y);\n\n                    binary_op_vector_bf16s(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n                }\n            }\n        }\n    }\n}\n\nstatic void binary_op_scalar_inplace_bf16s(Mat& a, float b, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = a.channel(q);\n\n        binary_op_vector_scalar_b_bf16s(ptr, b, ptr, size, op_type);\n    }\n}\n\nint BinaryOp_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w * A.elempack == B.h * B.elempack)\n                A2 = A.reshape(1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 2;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 3;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 4;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c, opt.workspace_allocator);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w * B.elempack == A.h * A.elempack)\n                B2 = B.reshape(1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 2;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 3;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 4;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c, opt.workspace_allocator);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n    const size_t out_elemsize = std::max(A2.elemsize, B2.elemsize);\n    const int out_elempack = std::max(A2.elempack, B2.elempack);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    const bool a_pack_is_lower = A2.elempack < B2.elempack;\n    const bool a_pack_is_equal = A2.elempack == B2.elempack;\n    const bool a_size_is_lower = A2.w * A2.h * A2.d * A2.c * A2.elempack < B2.w * B2.h * B2.d * B2.c * B2.elempack;\n    if (a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))\n    {\n        binary_op_broadcast_bf16s(B2, A2, top_blob, get_reverse_op_type(op_type), opt);\n    }\n    else\n    {\n        binary_op_broadcast_bf16s(A2, B2, top_blob, op_type, opt);\n    }\n\n    return 0;\n}\n\nint BinaryOp_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    binary_op_scalar_inplace_bf16s(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/binaryop_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BINARYOP_ARM_H\n#define LAYER_BINARYOP_ARM_H\n\n#include \"binaryop.h\"\n\nnamespace ncnn {\n\nclass BinaryOp_arm : public BinaryOp\n{\npublic:\n    BinaryOp_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BINARYOP_ARM_H\n"
  },
  {
    "path": "src/layer/arm/binaryop_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"binaryop_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic inline float16x4_t fmod_f16(const float16x4_t& x, const float16x4_t& y)\n{\n    float32x4_t fx = vcvt_f32_f16(x);\n    float32x4_t fy = vcvt_f32_f16(y);\n    return vcvt_f16_f32(fmod_ps(fx, fy));\n}\n\nstatic inline float16x8_t fmodq_f16(const float16x8_t& x, const float16x8_t& y)\n{\n    float16x4_t xl = vget_low_f16(x);\n    float16x4_t xh = vget_high_f16(x);\n    float16x4_t yl = vget_low_f16(y);\n    float16x4_t yh = vget_high_f16(y);\n\n    float16x4_t rl = fmod_f16(xl, yl);\n    float16x4_t rh = fmod_f16(xh, yh);\n    return vcombine_f16(rl, rh);\n}\n\nstatic inline float16x4_t round_f16(const float16x4_t& x)\n{\n    return vcvt_f16_f32(round_ps(vcvt_f32_f16(x)));\n}\n\nstatic inline float16x8_t roundq_f16(const float16x8_t& x)\n{\n    float16x4_t xl = vget_low_f16(x);\n    float16x4_t xh = vget_high_f16(x);\n    float16x4_t rl = round_f16(xl);\n    float16x4_t rh = round_f16(xh);\n    return vcombine_f16(rl, rh);\n}\n\nstatic inline float16x4_t logaddexp_f16(const float16x4_t& x, const float16x4_t& y)\n{\n    return vcvt_f16_f32(logaddexp_ps(vcvt_f32_f16(x), vcvt_f32_f16(y)));\n}\n\nstatic inline float16x8_t logaddexpq_f16(const float16x8_t& x, const float16x8_t& y)\n{\n    float16x4_t xl = vget_low_f16(x);\n    float16x4_t xh = vget_high_f16(x);\n    float16x4_t yl = vget_low_f16(y);\n    float16x4_t yh = vget_high_f16(y);\n    float16x4_t rl = logaddexp_f16(xl, yl);\n    float16x4_t rh = logaddexp_f16(xh, yh);\n    return vcombine_f16(rl, rh);\n}\n\nstatic inline float16x4_t floor_divide_f16(const float16x4_t& x, const float16x4_t& y)\n{\n    return vcvt_f16_f32(floor_divide_ps(vcvt_f32_f16(x), vcvt_f32_f16(y)));\n}\n\nstatic inline float16x8_t floor_divideq_f16(const float16x8_t& x, const float16x8_t& y)\n{\n    float16x4_t xl = vget_low_f16(x);\n    float16x4_t xh = vget_high_f16(x);\n    float16x4_t yl = vget_low_f16(y);\n    float16x4_t yh = vget_high_f16(y);\n    float16x4_t rl = floor_divide_f16(xl, yl);\n    float16x4_t rh = floor_divide_f16(xh, yh);\n    return vcombine_f16(rl, rh);\n}\n\nstatic inline float16x4_t remainder_f16(const float16x4_t& x, const float16x4_t& y)\n{\n    return vcvt_f16_f32(remainder_ps(vcvt_f32_f16(x), vcvt_f32_f16(y)));\n}\n\nstatic inline float16x8_t remainderq_f16(const float16x8_t& x, const float16x8_t& y)\n{\n    float16x4_t xl = vget_low_f16(x);\n    float16x4_t xh = vget_high_f16(x);\n    float16x4_t yl = vget_low_f16(y);\n    float16x4_t yh = vget_high_f16(y);\n    float16x4_t rl = remainder_f16(xl, yl);\n    float16x4_t rh = remainder_f16(xh, yh);\n    return vcombine_f16(rl, rh);\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_no_broadcast_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _p = vld1q_f16(ptr);\n        float16x8_t _b = vld1q_f16(ptr1);\n        float16x8_t _outp = op(_p, _b);\n        vst1q_f16(outptr, _outp);\n        ptr += 8;\n        ptr1 += 8;\n        outptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float16x4_t _p = vld1_f16(ptr);\n        float16x4_t _b = vld1_f16(ptr1);\n        float16x4_t _outp = op(_p, _b);\n        vst1_f16(outptr, _outp);\n        ptr += 4;\n        ptr1 += 4;\n        outptr += 4;\n    }\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, *ptr1);\n        ptr += 1;\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_b_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const __fp16 b = *ptr1;\n\n    int i = 0;\n    float16x4_t _b_128 = (elempack == 4) ? vld1_f16(ptr1) : vdup_n_f16(b);\n    float16x8_t _b_256 = (elempack == 8) ? vld1q_f16(ptr1) : vcombine_f16(_b_128, _b_128);\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _p = vld1q_f16(ptr);\n        float16x8_t _outp = op(_p, _b_256);\n        vst1q_f16(outptr, _outp);\n        ptr += 8;\n        outptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float16x4_t _p = vld1_f16(ptr);\n        float16x4_t _outp = op(_p, _b_128);\n        vst1_f16(outptr, _outp);\n        ptr += 4;\n        outptr += 4;\n    }\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, b);\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_a_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const __fp16 a = *ptr;\n\n    int i = 0;\n    float16x4_t _a_128 = (elempack == 4) ? vld1_f16(ptr) : vdup_n_f16(a);\n    float16x8_t _a_256 = (elempack == 8) ? vld1q_f16(ptr) : vcombine_f16(_a_128, _a_128);\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _b = vld1q_f16(ptr1);\n        float16x8_t _outp = op(_a_256, _b);\n        vst1q_f16(outptr, _outp);\n        ptr1 += 8;\n        outptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float16x4_t _b = vld1_f16(ptr1);\n        float16x4_t _outp = op(_a_128, _b);\n        vst1_f16(outptr, _outp);\n        ptr1 += 4;\n        outptr += 4;\n    }\n    for (; i < size; i++)\n    {\n        *outptr = op(a, *ptr1);\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int w, int elempack)\n{\n    const Op op;\n\n    if (elempack == 8)\n    {\n        int i = 0;\n        for (; i < w; i++)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _b = vdupq_n_f16(*ptr1);\n            float16x8_t _outp = op(_p, _b);\n            vst1q_f16(outptr, _outp);\n            ptr += 8;\n            ptr1 += 1;\n            outptr += 8;\n        }\n    }\n    if (elempack == 4)\n    {\n        int i = 0;\n        for (; i + 1 < w; i += 2)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x4_t _b0 = vdup_n_f16(ptr1[0]);\n            float16x4_t _b1 = vdup_n_f16(ptr1[1]);\n            float16x8_t _b = vcombine_f16(_b0, _b1);\n            float16x8_t _outp = op(_p, _b);\n            vst1q_f16(outptr, _outp);\n            ptr += 8;\n            ptr1 += 2;\n            outptr += 8;\n        }\n        for (; i < w; i++)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _b = vdup_n_f16(*ptr1);\n            float16x4_t _outp = op(_p, _b);\n            vst1_f16(outptr, _outp);\n            ptr += 4;\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_b_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int w, int elempack)\n{\n    const Op op;\n\n    const int size = w * elempack;\n\n    int i = 0;\n    float16x8_t _b = vdupq_n_f16(*ptr1);\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _p = vld1q_f16(ptr);\n        float16x8_t _outp = op(_p, _b);\n        vst1q_f16(outptr, _outp);\n        ptr += 8;\n        outptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float16x4_t _p = vld1_f16(ptr);\n        float16x4_t _outp = op(_p, vget_low_f16(_b));\n        vst1_f16(outptr, _outp);\n        ptr += 4;\n        outptr += 4;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_a_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int w, int elempack)\n{\n    const Op op;\n\n    if (elempack == 8)\n    {\n        int i = 0;\n        float16x8_t _p = vld1q_f16(ptr);\n        for (; i < w; i++)\n        {\n            float16x8_t _b = vdupq_n_f16(*ptr1);\n            float16x8_t _outp = op(_p, _b);\n            vst1q_f16(outptr, _outp);\n            ptr1 += 1;\n            outptr += 8;\n        }\n    }\n    if (elempack == 4)\n    {\n        int i = 0;\n        float16x4_t _p0 = vld1_f16(ptr);\n        float16x8_t _p = vcombine_f16(_p0, _p0);\n        for (; i + 1 < w; i += 2)\n        {\n            float16x4_t _b0 = vdup_n_f16(ptr1[0]);\n            float16x4_t _b1 = vdup_n_f16(ptr1[1]);\n            float16x8_t _b = vcombine_f16(_b0, _b1);\n            float16x8_t _outp = op(_p, _b);\n            vst1q_f16(outptr, _outp);\n            ptr1 += 2;\n            outptr += 8;\n        }\n        for (; i < w; i++)\n        {\n            float16x4_t _b = vdup_n_f16(*ptr1);\n            float16x4_t _outp = op(_p0, _b);\n            vst1_f16(outptr, _outp);\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int aw, int bw, int ap, int bp)\n{\n    const int w = std::max(aw, bw);\n    const int elempack = std::max(ap, bp);\n    const int size = w * elempack;\n\n    if (ap == bp)\n    {\n        if (aw == bw)\n        {\n            // no broadcast\n            return binary_op_vector_no_broadcast_fp16s<Op>(ptr, ptr1, outptr, size);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast single b\n            return binary_op_vector_broadcast_b_fp16s<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a\n            return binary_op_vector_broadcast_a_fp16s<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n    }\n\n    if (bp == 1)\n    {\n        if (aw == bw)\n        {\n            // broadcast pack1 b\n            return binary_op_vector_broadcast_pb_fp16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast pack1 single b\n            return binary_op_vector_broadcast_pb_b_fp16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a and pack1 b\n            return binary_op_vector_broadcast_pb_a_fp16s<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n    }\n\n    // shall never reach here\n}\n\nnamespace BinaryOp_arm_functor {\n\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4, IMPL8)                                  \\\n    struct NAME                                                                  \\\n    {                                                                            \\\n        __fp16 operator()(const __fp16& x, const __fp16& y) const                \\\n        {                                                                        \\\n            return IMPL;                                                         \\\n        }                                                                        \\\n        float16x4_t operator()(const float16x4_t& x, const float16x4_t& y) const \\\n        {                                                                        \\\n            return IMPL4;                                                        \\\n        }                                                                        \\\n        float16x8_t operator()(const float16x8_t& x, const float16x8_t& y) const \\\n        {                                                                        \\\n            return IMPL8;                                                        \\\n        }                                                                        \\\n    };\n\n// clang-format off\n// *INDENT-OFF*\nMAKE_FUNCTION(binary_op_add_fp16s, x + y, vadd_f16(x, y), vaddq_f16(x, y))\nMAKE_FUNCTION(binary_op_sub_fp16s, x - y, vsub_f16(x, y), vsubq_f16(x, y))\nMAKE_FUNCTION(binary_op_mul_fp16s, x * y, vmul_f16(x, y), vmulq_f16(x, y))\nMAKE_FUNCTION(binary_op_div_fp16s, x / y, vdiv_f16(x, y), vdivq_f16(x, y))\nMAKE_FUNCTION(binary_op_max_fp16s, std::max(x, y), vmax_f16(x, y), vmaxq_f16(x, y))\nMAKE_FUNCTION(binary_op_min_fp16s, std::min(x, y), vmin_f16(x, y), vminq_f16(x, y))\nMAKE_FUNCTION(binary_op_pow_fp16s, (__fp16)powf(x, y), vcvt_f16_f32(pow_ps(vcvt_f32_f16(x), vcvt_f32_f16(y))), vcombine_f16(vcvt_f16_f32(pow_ps(vcvt_f32_f16(vget_low_f16(x)), vcvt_f32_f16(vget_low_f16(y)))), vcvt_f16_f32(pow_ps(vcvt_f32_f16(vget_high_f16(x)), vcvt_f32_f16(vget_high_f16(y))))))\nMAKE_FUNCTION(binary_op_rsub_fp16s, y - x, vsub_f16(y, x), vsubq_f16(y, x))\nMAKE_FUNCTION(binary_op_rdiv_fp16s, y / x, vdiv_f16(y, x), vdivq_f16(y, x))\nMAKE_FUNCTION(binary_op_rpow_fp16s, (__fp16)powf(y, x), vcvt_f16_f32(pow_ps(vcvt_f32_f16(y), vcvt_f32_f16(x))), vcombine_f16(vcvt_f16_f32(pow_ps(vcvt_f32_f16(vget_low_f16(y)), vcvt_f32_f16(vget_low_f16(x)))), vcvt_f16_f32(pow_ps(vcvt_f32_f16(vget_high_f16(y)), vcvt_f32_f16(vget_high_f16(x))))))\nMAKE_FUNCTION(binary_op_atan2_fp16s, (__fp16)atan2f(x, y), vcvt_f16_f32(atan2_ps(vcvt_f32_f16(x), vcvt_f32_f16(y))), vcombine_f16(vcvt_f16_f32(atan2_ps(vcvt_f32_f16(vget_low_f16(x)), vcvt_f32_f16(vget_low_f16(y)))), vcvt_f16_f32(atan2_ps(vcvt_f32_f16(vget_high_f16(x)), vcvt_f32_f16(vget_high_f16(y))))))\nMAKE_FUNCTION(binary_op_ratan2_fp16s, (__fp16)atan2f(y, x), vcvt_f16_f32(atan2_ps(vcvt_f32_f16(y), vcvt_f32_f16(x))), vcombine_f16(vcvt_f16_f32(atan2_ps(vcvt_f32_f16(vget_low_f16(y)), vcvt_f32_f16(vget_low_f16(x)))), vcvt_f16_f32(atan2_ps(vcvt_f32_f16(vget_high_f16(y)), vcvt_f32_f16(vget_high_f16(x))))))\nMAKE_FUNCTION(binary_op_fmod_fp16s, (__fp16)fmodf((float)x, (float)y), fmod_f16(x, y), fmodq_f16(x, y))\nMAKE_FUNCTION(binary_op_rfmod_fp16s, (__fp16)fmodf((float)y, (float)x), fmod_f16(y, x), fmodq_f16(y, x))\nMAKE_FUNCTION(binary_op_logaddexp_fp16s, (__fp16)(std::max((float)x, (float)y) + log1pf(expf(std::min((float)x, (float)y) - std::max((float)x, (float)y)))), logaddexp_f16(x, y), logaddexpq_f16(x, y))\nMAKE_FUNCTION(binary_op_floor_divide_fp16s, (__fp16)floorf((float)x / (float)y), floor_divide_f16(x, y), floor_divideq_f16(x, y))\nMAKE_FUNCTION(binary_op_rfloor_divide_fp16s, (__fp16)floorf((float)y / (float)x), floor_divide_f16(y, x), floor_divideq_f16(y, x))\nMAKE_FUNCTION(binary_op_remainder_fp16s, (__fp16)remainderf((float)x, (float)y), remainder_f16(x, y), remainderq_f16(x, y))\nMAKE_FUNCTION(binary_op_rremainder_fp16s, (__fp16)remainderf((float)y, (float)x), remainder_f16(y, x), remainderq_f16(y, x))\n// *INDENT-ON*\n// clang-format on\n\n#undef MAKE_FUNCTION\n\n} // namespace BinaryOp_arm_functor\n\nstatic void binary_op_vector_fp16s(const __fp16* ptr, const __fp16* ptr1, __fp16* outptr, int aw, int bw, int ap, int bp, int op_type)\n{\n    using namespace BinaryOp_arm_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector_fp16s<binary_op_add_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector_fp16s<binary_op_sub_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector_fp16s<binary_op_mul_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector_fp16s<binary_op_div_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector_fp16s<binary_op_max_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector_fp16s<binary_op_min_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector_fp16s<binary_op_pow_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector_fp16s<binary_op_rsub_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector_fp16s<binary_op_rdiv_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector_fp16s<binary_op_rpow_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector_fp16s<binary_op_atan2_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector_fp16s<binary_op_ratan2_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector_fp16s<binary_op_fmod_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector_fp16s<binary_op_rfmod_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector_fp16s<binary_op_logaddexp_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector_fp16s<binary_op_floor_divide_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector_fp16s<binary_op_rfloor_divide_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector_fp16s<binary_op_remainder_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector_fp16s<binary_op_rremainder_fp16s>(ptr, ptr1, outptr, aw, bw, ap, bp);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar_fp16s(const Mat& a, __fp16 b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const __fp16* ptr = a.channel(q);\n        __fp16* outptr = c.channel(q);\n\n        binary_op_vector_fp16s(ptr, &b, outptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_no_broadcast_fp16s(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const __fp16* ptr = a.channel(q);\n        const __fp16* ptr1 = b.channel(q);\n        __fp16* outptr = c.channel(q);\n\n        binary_op_vector_fp16s(ptr, ptr1, outptr, size, size, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_broadcast_fp16s(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (b.w * b.h * b.d * b.c * b.elempack == 1)\n    {\n        return binary_op_scalar_fp16s(a, ((const __fp16*)b)[0], c, op_type, opt);\n    }\n\n    if (a.dims == b.dims && a.w == b.w && a.h == b.h && a.d == b.d && a.c == b.c && a.elempack == b.elempack)\n    {\n        return binary_op_no_broadcast_fp16s(a, b, c, op_type, opt);\n    }\n\n    const int dims = c.dims;\n\n    if (dims == 2)\n    {\n        const int h = c.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const int y0 = std::min(y, a.h - 1);\n            const int y1 = std::min(y, b.h - 1);\n\n            const __fp16* ptr = a.row<const __fp16>(y0);\n            const __fp16* ptr1 = b.row<const __fp16>(y1);\n            __fp16* outptr = c.row<__fp16>(y);\n\n            binary_op_vector_fp16s(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int channels = c.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int q0 = std::min(q, a.c - 1);\n            const int q1 = std::min(q, b.c - 1);\n\n            if (b.d * b.h * b.w == 1)\n            {\n                const __fp16* ptr = a.channel(q0);\n                const __fp16* ptr1 = b.channel(q1);\n                __fp16* outptr = c.channel(q);\n\n                binary_op_vector_fp16s(ptr, ptr1, outptr, a.w * a.h * a.d, 1, a.elempack, b.elempack, op_type);\n                continue;\n            }\n\n            if (b.h * b.w == 1)\n            {\n                for (int z = 0; z < c.d; z++)\n                {\n                    const int z0 = std::min(z, a.d - 1);\n                    const int z1 = std::min(z, b.d - 1);\n\n                    const __fp16* ptr = a.channel(q0).depth(z0);\n                    const __fp16* ptr1 = b.channel(q1).depth(z1);\n                    __fp16* outptr = c.channel(q).depth(z);\n\n                    binary_op_vector_fp16s(ptr, ptr1, outptr, a.w * a.h, 1, a.elempack, b.elempack, op_type);\n                }\n                continue;\n            }\n\n            for (int z = 0; z < c.d; z++)\n            {\n                const int z0 = std::min(z, a.d - 1);\n                const int z1 = std::min(z, b.d - 1);\n\n                for (int y = 0; y < c.h; y++)\n                {\n                    const int y0 = std::min(y, a.h - 1);\n                    const int y1 = std::min(y, b.h - 1);\n\n                    const __fp16* ptr = a.channel(q0).depth(z0).row<const __fp16>(y0);\n                    const __fp16* ptr1 = b.channel(q1).depth(z1).row<const __fp16>(y1);\n                    __fp16* outptr = c.channel(q).depth(z).row<__fp16>(y);\n\n                    binary_op_vector_fp16s(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n                }\n            }\n        }\n    }\n}\n\nstatic void binary_op_scalar_inplace_fp16s(Mat& a, __fp16 b, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = a.channel(q);\n\n        binary_op_vector_fp16s(ptr, &b, ptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic int get_reverse_op_type(int op_type)\n{\n    if (op_type == BinaryOp::Operation_SUB) return BinaryOp::Operation_RSUB;\n    if (op_type == BinaryOp::Operation_DIV) return BinaryOp::Operation_RDIV;\n    if (op_type == BinaryOp::Operation_POW) return BinaryOp::Operation_RPOW;\n    if (op_type == BinaryOp::Operation_ATAN2) return BinaryOp::Operation_RATAN2;\n    if (op_type == BinaryOp::Operation_FMOD) return BinaryOp::Operation_RFMOD;\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return BinaryOp::Operation_RFLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_REMAINDER) return BinaryOp::Operation_RREMAINDER;\n\n    if (op_type == BinaryOp::Operation_RSUB) return BinaryOp::Operation_SUB;\n    if (op_type == BinaryOp::Operation_RDIV) return BinaryOp::Operation_DIV;\n    if (op_type == BinaryOp::Operation_RPOW) return BinaryOp::Operation_POW;\n    if (op_type == BinaryOp::Operation_RATAN2) return BinaryOp::Operation_ATAN2;\n    if (op_type == BinaryOp::Operation_RFMOD) return BinaryOp::Operation_FMOD;\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return BinaryOp::Operation_FLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_RREMAINDER) return BinaryOp::Operation_REMAINDER;\n\n    return op_type;\n}\n\nint BinaryOp_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w * A.elempack == B.h * B.elempack)\n                A2 = A.reshape(1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 2;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 3;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 4;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c, opt.workspace_allocator);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w * B.elempack == A.h * A.elempack)\n                B2 = B.reshape(1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 2;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 3;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 4;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c, opt.workspace_allocator);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n    const size_t out_elemsize = std::max(A2.elemsize, B2.elemsize);\n    const int out_elempack = std::max(A2.elempack, B2.elempack);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    const bool a_pack_is_lower = A2.elempack < B2.elempack;\n    const bool a_pack_is_equal = A2.elempack == B2.elempack;\n    const bool a_size_is_lower = A2.w * A2.h * A2.d * A2.c * A2.elempack < B2.w * B2.h * B2.d * B2.c * B2.elempack;\n    if (a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))\n    {\n        binary_op_broadcast_fp16s(B2, A2, top_blob, get_reverse_op_type(op_type), opt);\n    }\n    else\n    {\n        binary_op_broadcast_fp16s(A2, B2, top_blob, op_type, opt);\n    }\n\n    return 0;\n}\n\nint BinaryOp_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    binary_op_scalar_inplace_fp16s(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/cast_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cast_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"cast_bf16.h\"\n#include \"cast_fp16.h\"\n\nCast_arm::Cast_arm()\n{\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n\n    support_bf16_storage = true;\n}\n\nint Cast_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (type_from == type_to)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    size_t out_elemsize = elemsize;\n    if (type_to == 1)\n    {\n        if (type_from == 3)\n        {\n            Cast::forward(bottom_blob, top_blob, opt);\n        }\n\n        // float32\n        out_elemsize = 4 * elempack;\n    }\n    else if (type_to == 2)\n    {\n        // float16\n        out_elemsize = 2 * elempack;\n    }\n    else if (type_to == 3)\n    {\n        // int8\n        out_elemsize = elempack;\n    }\n    else if (type_to == 4)\n    {\n        // bfloat16\n        out_elemsize = 2 * elempack;\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 4)\n    {\n        top_blob.create(w, h, d, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int size = w * h * d * elempack;\n\n    if (type_from == 1 && type_to == 2)\n    {\n        cast_fp32_to_fp16_neon(bottom_blob, top_blob, opt);\n    }\n\n    if (type_from == 2 && type_to == 1)\n    {\n        cast_fp16_to_fp32_neon(bottom_blob, top_blob, opt);\n    }\n\n    if (type_from == 3 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const signed char* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = (float)ptr[i];\n            }\n        }\n    }\n\n    if (type_from == 1 && type_to == 4)\n    {\n        cast_fp32_to_bf16_neon(bottom_blob, top_blob, opt);\n    }\n\n    if (type_from == 4 && type_to == 1)\n    {\n        cast_bf16_to_fp32_neon(bottom_blob, top_blob, opt);\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/cast_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CAST_ARM_H\n#define LAYER_CAST_ARM_H\n\n#include \"cast.h\"\n\nnamespace ncnn {\n\nclass Cast_arm : public Cast\n{\npublic:\n    Cast_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CAST_ARM_H\n"
  },
  {
    "path": "src/layer/arm/cast_arm_bf16.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n\nnamespace ncnn {\n\n#include \"cast_bf16.h\"\n\nvoid cast_fp32_to_bf16_neon_bf16(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    cast_fp32_to_bf16_neon(bottom_blob, top_blob, opt);\n}\n\nvoid cast_bf16_to_fp32_neon_bf16(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    cast_bf16_to_fp32_neon(bottom_blob, top_blob, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/cast_arm_vfpv4.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n\nnamespace ncnn {\n\n#include \"cast_fp16.h\"\n\nvoid cast_fp32_to_fp16_neon_vfpv4(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    cast_fp32_to_fp16_neon(bottom_blob, top_blob, opt);\n}\n\nvoid cast_fp16_to_fp32_neon_vfpv4(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    cast_fp16_to_fp32_neon(bottom_blob, top_blob, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/cast_bf16.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84BF16 && __aarch64__ && !__ARM_FEATURE_BF16_VECTOR_ARITHMETIC\nvoid cast_fp32_to_bf16_neon_bf16(const Mat& bottom_blob, Mat& top_blob, const Option& opt);\nvoid cast_bf16_to_fp32_neon_bf16(const Mat& bottom_blob, Mat& top_blob, const Option& opt);\n#endif\n\nstatic void cast_fp32_to_bf16_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84BF16 && __aarch64__ && !__ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n    if (ncnn::cpu_support_arm_bf16())\n    {\n        cast_fp32_to_bf16_neon_bf16(bottom_blob, top_blob, opt);\n        return;\n    }\n#endif\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = bottom_blob.channel(q);\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n        __bf16* outptr = top_blob.channel(q);\n#else\n        unsigned short* outptr = top_blob.channel(q);\n#endif\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                \"bfcvtn v0.4h, v0.4s            \\n\"\n                \"bfcvtn2 v0.8h, v1.4s           \\n\"\n                \"bfcvtn v1.4h, v2.4s            \\n\"\n                \"bfcvtn2 v1.8h, v3.4s           \\n\"\n                \"st1    {v0.8h, v1.8h}, [%1], #32 \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\");\n#else  // __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                \"shrn   v0.4h, v0.4s, #16       \\n\"\n                \"shrn   v1.4h, v1.4s, #16       \\n\"\n                \"shrn   v2.4h, v2.4s, #16       \\n\"\n                \"shrn   v3.4h, v3.4s, #16       \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#endif // __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0!, {d0-d7}    \\n\"\n                \"vshrn.u32  d0, q0, #16     \\n\"\n                \"vshrn.u32  d1, q1, #16     \\n\"\n                \"vshrn.u32  d2, q2, #16     \\n\"\n                \"vshrn.u32  d3, q3, #16     \\n\"\n                \"vst1.u16   {d0-d3}, [%1]!  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0_fp32 = vld1q_f32(ptr);\n            float32x4_t _p1_fp32 = vld1q_f32(ptr + 4);\n            float32x4_t _p2_fp32 = vld1q_f32(ptr + 8);\n            float32x4_t _p3_fp32 = vld1q_f32(ptr + 12);\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x4_t _p0_bf16 = vcvt_bf16_f32(_p0_fp32);\n            bfloat16x4_t _p1_bf16 = vcvt_bf16_f32(_p1_fp32);\n            bfloat16x4_t _p2_bf16 = vcvt_bf16_f32(_p2_fp32);\n            bfloat16x4_t _p3_bf16 = vcvt_bf16_f32(_p3_fp32);\n            bfloat16x8_t _p_bf16 = vcombine_bf16(_p0_bf16, _p1_bf16);\n            bfloat16x8_t _q_bf16 = vcombine_bf16(_p2_bf16, _p3_bf16);\n            vst1q_bf16(outptr, _p_bf16);\n            vst1q_bf16(outptr + 8, _q_bf16);\n#else\n            uint16x4_t _p0_bf16 = float2bfloat(_p0_fp32);\n            uint16x4_t _p1_bf16 = float2bfloat(_p1_fp32);\n            uint16x4_t _p2_bf16 = float2bfloat(_p2_fp32);\n            uint16x4_t _p3_bf16 = float2bfloat(_p3_fp32);\n            uint16x8_t _p_bf16 = vcombine_u16(_p0_bf16, _p1_bf16);\n            uint16x8_t _q_bf16 = vcombine_u16(_p2_bf16, _p3_bf16);\n            vst1q_u16(outptr, _p_bf16);\n            vst1q_u16(outptr + 8, _q_bf16);\n#endif\n            ptr += 16;\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0_fp32 = vld1q_f32(ptr);\n            float32x4_t _p1_fp32 = vld1q_f32(ptr + 4);\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x4_t _p0_bf16 = vcvt_bf16_f32(_p0_fp32);\n            bfloat16x4_t _p1_bf16 = vcvt_bf16_f32(_p1_fp32);\n            bfloat16x8_t _p_bf16 = vcombine_bf16(_p0_bf16, _p1_bf16);\n            vst1q_bf16(outptr, _p_bf16);\n#else\n            uint16x4_t _p0_bf16 = float2bfloat(_p0_fp32);\n            uint16x4_t _p1_bf16 = float2bfloat(_p1_fp32);\n            uint16x8_t _p_bf16 = vcombine_u16(_p0_bf16, _p1_bf16);\n            vst1q_u16(outptr, _p_bf16);\n#endif\n            ptr += 8;\n            outptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p_fp32 = vld1q_f32(ptr);\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x4_t _p_bf16 = vcvt_bf16_f32(_p_fp32);\n            vst1_bf16(outptr, _p_bf16);\n#else\n            uint16x4_t _p_bf16 = float2bfloat(_p_fp32);\n            vst1_u16(outptr, _p_bf16);\n#endif\n            ptr += 4;\n            outptr += 4;\n        }\n#endif\n        for (; i < size; i++)\n        {\n#if NCNN_GNU_INLINE_ASM && __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            asm volatile(\n                \"ldr    s0, [%0], #4    \\n\"\n                \"bfcvt  h0, s0          \\n\"\n                \"str    h0, [%1], #2    \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"s0\");\n            // because intrinsic cause ndk clang crash\n            // *outptr++ = vcvth_bf16_f32(*ptr++);\n#else\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            *(unsigned short*)outptr = float32_to_bfloat16(*ptr++);\n            outptr++;\n#else\n            *outptr++ = float32_to_bfloat16(*ptr++);\n#endif\n#endif\n        }\n    }\n}\n\nstatic void cast_bf16_to_fp32_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84BF16 && __aarch64__ && !__ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n    if (ncnn::cpu_support_arm_bf16())\n    {\n        cast_bf16_to_fp32_neon_bf16(bottom_blob, top_blob, opt);\n        return;\n    }\n#endif\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n        const __bf16* ptr = bottom_blob.channel(q);\n#else\n        const unsigned short* ptr = bottom_blob.channel(q);\n#endif\n        float* outptr = top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]   \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                \"shll   v0.4s, v0.4h, #16       \\n\"\n                \"shll   v1.4s, v1.4h, #16       \\n\"\n                \"shll   v2.4s, v2.4h, #16       \\n\"\n                \"shll   v3.4s, v3.4h, #16       \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]      \\n\"\n                \"vld1.u16   {d4-d7}, [%0]!  \\n\"\n                \"vshll.u16  q0, d4, #16     \\n\"\n                \"vshll.u16  q1, d5, #16     \\n\"\n                \"vshll.u16  q2, d6, #16     \\n\"\n                \"vshll.u16  q3, d7, #16     \\n\"\n                \"vstm       %1!, {d0-d7}    \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x8_t _p_bf16 = vld1q_bf16(ptr);\n            bfloat16x8_t _q_bf16 = vld1q_bf16(ptr + 8);\n            float32x4_t _p0_fp32 = vcvt_f32_bf16(vget_low_bf16(_p_bf16));\n            float32x4_t _p1_fp32 = vcvt_f32_bf16(vget_high_bf16(_p_bf16));\n            float32x4_t _p2_fp32 = vcvt_f32_bf16(vget_low_bf16(_q_bf16));\n            float32x4_t _p3_fp32 = vcvt_f32_bf16(vget_high_bf16(_q_bf16));\n#else\n            uint16x8_t _p_bf16 = vld1q_u16(ptr);\n            uint16x8_t _q_bf16 = vld1q_u16(ptr + 8);\n            float32x4_t _p0_fp32 = bfloat2float(vget_low_u16(_p_bf16));\n            float32x4_t _p1_fp32 = bfloat2float(vget_high_u16(_p_bf16));\n            float32x4_t _p2_fp32 = bfloat2float(vget_low_u16(_q_bf16));\n            float32x4_t _p3_fp32 = bfloat2float(vget_high_u16(_q_bf16));\n#endif\n            vst1q_f32(outptr, _p0_fp32);\n            vst1q_f32(outptr + 4, _p1_fp32);\n            vst1q_f32(outptr + 8, _p2_fp32);\n            vst1q_f32(outptr + 12, _p3_fp32);\n            ptr += 16;\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x8_t _p_bf16 = vld1q_bf16(ptr);\n            float32x4_t _p0_fp32 = vcvt_f32_bf16(vget_low_bf16(_p_bf16));\n            float32x4_t _p1_fp32 = vcvt_f32_bf16(vget_high_bf16(_p_bf16));\n#else\n            uint16x8_t _p_bf16 = vld1q_u16(ptr);\n            float32x4_t _p0_fp32 = bfloat2float(vget_low_u16(_p_bf16));\n            float32x4_t _p1_fp32 = bfloat2float(vget_high_u16(_p_bf16));\n#endif\n            vst1q_f32(outptr, _p0_fp32);\n            vst1q_f32(outptr + 4, _p1_fp32);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            bfloat16x4_t _p_bf16 = vld1_bf16(ptr);\n            float32x4_t _p_fp32 = vcvt_f32_bf16(_p_bf16);\n#else\n            uint16x4_t _p_bf16 = vld1_u16(ptr);\n            float32x4_t _p_fp32 = bfloat2float(_p_bf16);\n#endif\n            vst1q_f32(outptr, _p_fp32);\n            ptr += 4;\n            outptr += 4;\n        }\n#endif\n        for (; i < size; i++)\n        {\n#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC\n            *outptr++ = vcvtah_f32_bf16(*ptr++);\n#else\n            *outptr++ = bfloat16_to_float32(*ptr++);\n#endif\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/cast_fp16.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\nvoid cast_fp32_to_fp16_neon_vfpv4(const Mat& bottom_blob, Mat& top_blob, const Option& opt);\nvoid cast_fp16_to_fp32_neon_vfpv4(const Mat& bottom_blob, Mat& top_blob, const Option& opt);\n#endif\n\nstatic void cast_fp32_to_fp16_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\n    if (ncnn::cpu_support_arm_vfpv4())\n    {\n        cast_fp32_to_fp16_neon_vfpv4(bottom_blob, top_blob, opt);\n        return;\n    }\n#endif\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = bottom_blob.channel(q);\n        unsigned short* outptr = top_blob.channel(q);\n\n        int i = 0;\n#if (__ARM_FP & 2)\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                \"fcvtn  v0.4h, v0.4s                \\n\"\n                \"fcvtn  v1.4h, v1.4s                \\n\"\n                \"fcvtn  v2.4h, v2.4s                \\n\"\n                \"fcvtn  v3.4h, v3.4s                \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0!, {d0-d7}    \\n\"\n                \"vcvt.f16.f32 d0, q0        \\n\"\n                \"vcvt.f16.f32 d1, q1        \\n\"\n                \"vcvt.f16.f32 d2, q2        \\n\"\n                \"vcvt.f16.f32 d3, q3        \\n\"\n                \"vst1.u16   {d0-d3}, [%1]!  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0_fp32 = vld1q_f32(ptr);\n            float32x4_t _p1_fp32 = vld1q_f32(ptr + 4);\n            float32x4_t _p2_fp32 = vld1q_f32(ptr + 8);\n            float32x4_t _p3_fp32 = vld1q_f32(ptr + 12);\n            uint16x4_t _p0_fp16 = (uint16x4_t)vcvt_f16_f32(_p0_fp32);\n            uint16x4_t _p1_fp16 = (uint16x4_t)vcvt_f16_f32(_p1_fp32);\n            uint16x4_t _p2_fp16 = (uint16x4_t)vcvt_f16_f32(_p2_fp32);\n            uint16x4_t _p3_fp16 = (uint16x4_t)vcvt_f16_f32(_p3_fp32);\n            uint16x8_t _p_fp16 = vcombine_u16(_p0_fp16, _p1_fp16);\n            uint16x8_t _q_fp16 = vcombine_u16(_p2_fp16, _p3_fp16);\n            vst1q_u16(outptr, _p_fp16);\n            vst1q_u16(outptr + 8, _q_fp16);\n            ptr += 16;\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n#if NCNN_GNU_INLINE_ASM\n            // This is originally implemented with neon fp16 intrinsics.\n            // In the new version of gcc, __ARM_FP16_FORMAT_IEEE or __ARM_FP16_FORMAT_ALTERNATIVE needs to be defined to use the float16x4_t type.\n            // That leads to compiler error when compiled with -mfpu=neon-vfpv4 but without -mfp16-format=ieee flag.\n            // We could add more macro conditions to differentiate between old and new versions, but that's pretty ugly!\n            // Just use all inline assembly here ~\n            //          --- nihui\n#if __aarch64__\n            asm volatile(\n                \"ld1    {v0.4s, v1.4s}, [%0], #32   \\n\"\n                \"fcvtn  v0.4h, v0.4s                \\n\"\n                \"fcvtn  v1.4h, v1.4s                \\n\"\n                \"st1    {v0.4h, v1.4h}, [%1], #16   \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\");\n#else  // __aarch64__\n            asm volatile(\n                \"vld1.f32   {d0-d3}, [%0]!  \\n\"\n                \"vcvt.f16.f32 d0, q0        \\n\"\n                \"vcvt.f16.f32 d1, q1        \\n\"\n                \"vst1.u16   {d0-d1}, [%1]!  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0_fp32 = vld1q_f32(ptr);\n            float32x4_t _p1_fp32 = vld1q_f32(ptr + 4);\n            uint16x4_t _p0_fp16 = (uint16x4_t)vcvt_f16_f32(_p0_fp32);\n            uint16x4_t _p1_fp16 = (uint16x4_t)vcvt_f16_f32(_p1_fp32);\n            uint16x8_t _p_fp16 = vcombine_u16(_p0_fp16, _p1_fp16);\n            vst1q_u16(outptr, _p_fp16);\n            ptr += 8;\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 3 < size; i += 4)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"ld1    {v0.4s}, [%0], #16  \\n\"\n                \"fcvtn  v0.4h, v0.4s        \\n\"\n                \"st1    {v0.4h}, [%1], #8   \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\");\n#else  // __aarch64__\n            asm volatile(\n                \"vld1.f32   {d0-d1}, [%0]!  \\n\"\n                \"vcvt.f16.f32 d0, q0        \\n\"\n                \"vst1.u16   {d0}, [%1]!     \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p_fp32 = vld1q_f32(ptr);\n            uint16x4_t _p_fp16 = (uint16x4_t)vcvt_f16_f32(_p_fp32);\n            vst1_u16(outptr, _p_fp16);\n            ptr += 4;\n            outptr += 4;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // (__ARM_FP & 2)\n        for (; i < size; i++)\n        {\n            *outptr++ = float32_to_float16(*ptr++);\n        }\n    }\n}\n\nstatic void cast_fp16_to_fp32_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\n    if (ncnn::cpu_support_arm_vfpv4())\n    {\n        cast_fp16_to_fp32_neon_vfpv4(bottom_blob, top_blob, opt);\n        return;\n    }\n#endif\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned short* ptr = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n\n        int i = 0;\n#if (__ARM_FP & 2)\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]       \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                \"fcvtl  v0.4s, v0.4h                \\n\"\n                \"fcvtl  v1.4s, v1.4h                \\n\"\n                \"fcvtl  v2.4s, v2.4h                \\n\"\n                \"fcvtl  v3.4s, v3.4h                \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]      \\n\"\n                \"vld1.u16   {d4-d7}, [%0]!  \\n\"\n                \"vcvt.f32.f16 q0, d4        \\n\"\n                \"vcvt.f32.f16 q1, d5        \\n\"\n                \"vcvt.f32.f16 q2, d6        \\n\"\n                \"vcvt.f32.f16 q3, d7        \\n\"\n                \"vstm       %1!, {d0-d7}    \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x8_t _p_fp16 = vld1q_u16(ptr);\n            uint16x8_t _q_fp16 = vld1q_u16(ptr + 8);\n            float32x4_t _p0_fp32 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p_fp16));\n            float32x4_t _p1_fp32 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p_fp16));\n            float32x4_t _p2_fp32 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q_fp16));\n            float32x4_t _p3_fp32 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q_fp16));\n            vst1q_f32(outptr, _p0_fp32);\n            vst1q_f32(outptr + 4, _p1_fp32);\n            vst1q_f32(outptr + 8, _p2_fp32);\n            vst1q_f32(outptr + 12, _p3_fp32);\n            ptr += 16;\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"ld1    {v0.4h, v1.4h}, [%0], #16   \\n\"\n                \"fcvtl  v0.4s, v0.4h                \\n\"\n                \"fcvtl  v1.4s, v1.4h                \\n\"\n                \"st1    {v0.4s, v1.4s}, [%1], #32   \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\", \"v1\");\n#else  // __aarch64__\n            asm volatile(\n                \"vld1.u16   {d4-d5}, [%0]!  \\n\"\n                \"vcvt.f32.f16 q0, d4        \\n\"\n                \"vcvt.f32.f16 q1, d5        \\n\"\n                \"vst1.f32   {d0-d3}, [%1]!  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\", \"q2\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x8_t _p_fp16 = vld1q_u16(ptr);\n            float32x4_t _p0_fp32 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p_fp16));\n            float32x4_t _p1_fp32 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p_fp16));\n            vst1q_f32(outptr, _p0_fp32);\n            vst1q_f32(outptr + 4, _p1_fp32);\n            ptr += 8;\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 3 < size; i += 4)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"ld1    {v0.4h}, [%0], #8   \\n\"\n                \"fcvtl  v0.4s, v0.4h        \\n\"\n                \"st1    {v0.4s}, [%1], #16  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"v0\");\n#else  // __aarch64__\n            asm volatile(\n                \"vld1.u16   {d2}, [%0]!     \\n\"\n                \"vcvt.f32.f16 q0, d2        \\n\"\n                \"vst1.f32   {d0-d1}, [%1]!  \\n\"\n                : \"=r\"(ptr),   // %0\n                \"=r\"(outptr) // %1\n                : \"0\"(ptr),\n                \"1\"(outptr)\n                : \"memory\", \"q0\", \"q1\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x4_t _p_fp16 = vld1_u16(ptr);\n            float32x4_t _p_fp32 = vcvt_f32_f16((float16x4_t)_p_fp16);\n            vst1q_f32(outptr, _p_fp32);\n            ptr += 4;\n            outptr += 4;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // (__ARM_FP & 2)\n        for (; i < size; i++)\n        {\n            *outptr++ = float16_to_float32(*ptr++);\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/clip_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"clip_arm.h\"\n\n#ifdef __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nClip_arm::Clip_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Clip_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _min = vdupq_n_f32(min);\n        float32x4_t _max = vdupq_n_f32(max);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                \"fmax   v0.4s, v0.4s, %2.4s     \\n\"\n                \"fmax   v1.4s, v1.4s, %2.4s     \\n\"\n                \"fmax   v2.4s, v2.4s, %2.4s     \\n\"\n                \"fmax   v3.4s, v3.4s, %2.4s     \\n\"\n                \"fmin   v0.4s, v0.4s, %3.4s     \\n\"\n                \"fmin   v1.4s, v1.4s, %3.4s     \\n\"\n                \"fmin   v2.4s, v2.4s, %3.4s     \\n\"\n                \"fmin   v3.4s, v3.4s, %3.4s     \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_min), // %2\n                \"w\"(_max)  // %3\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0, {d0-d7}     \\n\"\n                \"vmax.f32   q0, q0, %q2     \\n\"\n                \"vmax.f32   q1, q1, %q2     \\n\"\n                \"vmax.f32   q2, q2, %q2     \\n\"\n                \"vmax.f32   q3, q3, %q2     \\n\"\n                \"vmin.f32   q0, q0, %q3     \\n\"\n                \"vmin.f32   q1, q1, %q3     \\n\"\n                \"vmin.f32   q2, q2, %q3     \\n\"\n                \"vmin.f32   q3, q3, %q3     \\n\"\n                \"vstm       %0!, {d0-d7}    \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_min), // %2\n                \"w\"(_max)  // %3\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = vmaxq_f32(_p0, _min);\n            _p1 = vmaxq_f32(_p1, _min);\n            _p2 = vmaxq_f32(_p2, _min);\n            _p3 = vmaxq_f32(_p3, _min);\n            _p0 = vminq_f32(_p0, _max);\n            _p1 = vminq_f32(_p1, _max);\n            _p2 = vminq_f32(_p2, _max);\n            _p3 = vminq_f32(_p3, _max);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = vmaxq_f32(_p0, _min);\n            _p1 = vmaxq_f32(_p1, _min);\n            _p0 = vminq_f32(_p0, _max);\n            _p1 = vminq_f32(_p1, _max);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmaxq_f32(_p, _min);\n            _p = vminq_f32(_p, _max);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            if (*ptr < min)\n                *ptr = min;\n\n            if (*ptr > max)\n                *ptr = max;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Clip_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _min = vdupq_n_f32(min);\n        float32x4_t _max = vdupq_n_f32(max);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]   \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                \"shll   v0.4s, v0.4h, #16       \\n\"\n                \"shll   v1.4s, v1.4h, #16       \\n\"\n                \"shll   v2.4s, v2.4h, #16       \\n\"\n                \"shll   v3.4s, v3.4h, #16       \\n\"\n                \"fmax   v0.4s, v0.4s, %2.4s     \\n\"\n                \"fmax   v1.4s, v1.4s, %2.4s     \\n\"\n                \"fmax   v2.4s, v2.4s, %2.4s     \\n\"\n                \"fmax   v3.4s, v3.4s, %2.4s     \\n\"\n                \"fmin   v0.4s, v0.4s, %3.4s     \\n\"\n                \"fmin   v1.4s, v1.4s, %3.4s     \\n\"\n                \"fmin   v2.4s, v2.4s, %3.4s     \\n\"\n                \"fmin   v3.4s, v3.4s, %3.4s     \\n\"\n                \"shrn   v0.4h, v0.4s, #16       \\n\"\n                \"shrn   v1.4h, v1.4s, #16       \\n\"\n                \"shrn   v2.4h, v2.4s, #16       \\n\"\n                \"shrn   v3.4h, v3.4s, #16       \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_min), // %2\n                \"w\"(_max)  // %3\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]      \\n\"\n                \"vld1.u16   {d4-d7}, [%0]   \\n\"\n                \"vshll.u16  q0, d4, #16     \\n\"\n                \"vshll.u16  q1, d5, #16     \\n\"\n                \"vshll.u16  q2, d6, #16     \\n\"\n                \"vshll.u16  q3, d7, #16     \\n\"\n                \"vmax.f32   q0, q0, %q2     \\n\"\n                \"vmax.f32   q1, q1, %q2     \\n\"\n                \"vmax.f32   q2, q2, %q2     \\n\"\n                \"vmax.f32   q3, q3, %q2     \\n\"\n                \"vmin.f32   q0, q0, %q3     \\n\"\n                \"vmin.f32   q1, q1, %q3     \\n\"\n                \"vmin.f32   q2, q2, %q3     \\n\"\n                \"vmin.f32   q3, q3, %q3     \\n\"\n                \"vshrn.u32  d0, q0, #16     \\n\"\n                \"vshrn.u32  d1, q1, #16     \\n\"\n                \"vshrn.u32  d2, q2, #16     \\n\"\n                \"vshrn.u32  d3, q3, #16     \\n\"\n                \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_min), // %2\n                \"w\"(_max)  // %3\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x8_t _p = vld1q_u16(ptr);\n            uint16x8_t _q = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n            _p0 = vmaxq_f32(_p0, _min);\n            _p1 = vmaxq_f32(_p1, _min);\n            _p2 = vmaxq_f32(_p2, _min);\n            _p3 = vmaxq_f32(_p3, _min);\n            _p0 = vminq_f32(_p0, _max);\n            _p1 = vminq_f32(_p1, _max);\n            _p2 = vminq_f32(_p2, _max);\n            _p3 = vminq_f32(_p3, _max);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _q = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p);\n            vst1q_u16(ptr + 8, _q);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = vmaxq_f32(_p0, _min);\n            _p1 = vmaxq_f32(_p1, _min);\n            _p0 = vminq_f32(_p0, _max);\n            _p1 = vminq_f32(_p1, _max);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmaxq_f32(_p, _min);\n            _p = vminq_f32(_p, _max);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            if (v < min)\n                v = min;\n\n            if (v > max)\n                v = max;\n\n            *ptr = float32_to_bfloat16(v);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/clip_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CLIP_ARM_H\n#define LAYER_CLIP_ARM_H\n\n#include \"clip.h\"\n\nnamespace ncnn {\n\nclass Clip_arm : public Clip\n{\npublic:\n    Clip_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CLIP_ARM_H\n"
  },
  {
    "path": "src/layer/arm/clip_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"clip_arm.h\"\n\n#ifdef __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Clip_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        __fp16 min_fp16 = min;\n        __fp16 max_fp16 = max;\n\n        float16x8_t _min = vdupq_n_f16(min_fp16);\n        float16x8_t _max = vdupq_n_f16(max_fp16);\n\n        int i = 0;\n        for (; i + 31 < size; i += 32)\n        {\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                \"fmax   v0.8h, v0.8h, %2.8h     \\n\"\n                \"fmax   v1.8h, v1.8h, %2.8h     \\n\"\n                \"fmax   v2.8h, v2.8h, %2.8h     \\n\"\n                \"fmax   v3.8h, v3.8h, %2.8h     \\n\"\n                \"fmin   v0.8h, v0.8h, %3.8h     \\n\"\n                \"fmin   v1.8h, v1.8h, %3.8h     \\n\"\n                \"fmin   v2.8h, v2.8h, %3.8h     \\n\"\n                \"fmin   v3.8h, v3.8h, %3.8h     \\n\"\n                \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_min), // %2\n                \"w\"(_max)  // %3\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            _p0 = vmaxq_f16(_p0, _min);\n            _p1 = vmaxq_f16(_p1, _min);\n            _p2 = vmaxq_f16(_p2, _min);\n            _p3 = vmaxq_f16(_p3, _min);\n            _p0 = vminq_f16(_p0, _max);\n            _p1 = vminq_f16(_p1, _max);\n            _p2 = vminq_f16(_p2, _max);\n            _p3 = vminq_f16(_p3, _max);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _p0 = vmaxq_f16(_p0, _min);\n            _p1 = vmaxq_f16(_p1, _min);\n            _p0 = vminq_f16(_p0, _max);\n            _p1 = vminq_f16(_p1, _max);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = vmaxq_f16(_p, _min);\n            _p = vminq_f16(_p, _max);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = vmax_f16(_p, vget_low_f16(_min));\n            _p = vmin_f16(_p, vget_low_f16(_max));\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            if (v < min_fp16)\n                v = min_fp16;\n\n            if (v > max_fp16)\n                v = max_fp16;\n\n            *ptr = v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/concat_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"concat_arm.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nConcat_arm::Concat_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Concat_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int elembits = bottom_blobs[0].elembits();\n\n#if NCNN_ARM82\n    if (support_packing && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    int dims = bottom_blobs[0].dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // concat vector\n        // total length\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_w % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        float* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            const float* ptr = bottom_blob;\n            memcpy(outptr, ptr, bottom_blob.w * bottom_blob.elemsize);\n\n            outptr += bottom_blob.w * bottom_blob.elempack;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // concat image\n        int w = bottom_blobs[0].w;\n\n        // total height\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_h += bottom_blob.h * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_h % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, top_h / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n        }\n\n        float* outptr = top_blob_unpacked;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const float* r0 = bottom_blob.row(i);\n\n                    float* outptr0 = outptr;\n                    float* outptr1 = outptr + w;\n                    float* outptr2 = outptr + w * 2;\n                    float* outptr3 = outptr + w * 3;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    outptr += w * 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = w * bottom_blob.h;\n\n                const float* ptr = bottom_blob;\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                outptr += size * bottom_blob.elempack;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // interleave image row\n        int h = bottom_blobs[0].h;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* outptr = top_blob.row(i);\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                const float* ptr = bottom_blob.row(i);\n                memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                outptr += bottom_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // concat dim\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n\n        // total channels\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_channels = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_channels += bottom_blob.c * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_channels % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, d, top_channels / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, h, d, top_channels / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n\n            top_blob_unpacked.dims = dims;\n        }\n\n        int p = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q);\n\n                    float* outptr0 = top_blob_unpacked.channel(p);\n                    float* outptr1 = top_blob_unpacked.channel(p + 1);\n                    float* outptr2 = top_blob_unpacked.channel(p + 2);\n                    float* outptr3 = top_blob_unpacked.channel(p + 3);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = bottom_blob.total();\n\n                const float* ptr = bottom_blob;\n                float* outptr = top_blob_unpacked.channel(p);\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                p += bottom_blob.c;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // interleave dim height\n        int w = bottom_blobs[0].w;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (size_t b = 0; b < bottom_blobs.size(); b++)\n                {\n                    const Mat& bottom_blob = bottom_blobs[b];\n\n                    int size = bottom_blob.w * bottom_blob.h;\n\n                    const float* ptr = bottom_blob.channel(q).depth(i);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    outptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // interleave dim width\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    for (size_t b = 0; b < bottom_blobs.size(); b++)\n                    {\n                        const Mat& bottom_blob = bottom_blobs[b];\n\n                        const float* ptr = bottom_blob.channel(q).depth(i).row(j);\n                        memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                        outptr += bottom_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        // interleave dim depth\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total depth\n        int top_d = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_d += bottom_blob.d;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, top_d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                const float* ptr = bottom_blob.channel(q);\n                memcpy(outptr, ptr, size * elemsize);\n\n                outptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Concat_arm::forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int dims = bottom_blobs[0].dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // concat vector\n        // total length\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w * bottom_blob.elempack;\n        }\n\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n#if NCNN_ARM82\n            out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && top_w % 8 == 0 ? 8 : top_w % 4 == 0 ? 4 : 1;\n#else\n            out_elempack = top_w % 4 == 0 ? 4 : 1;\n#endif\n        }\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        unsigned short* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            const unsigned short* ptr = bottom_blob;\n            memcpy(outptr, ptr, bottom_blob.w * bottom_blob.elemsize);\n\n            outptr += bottom_blob.w * bottom_blob.elempack;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // concat image\n        int w = bottom_blobs[0].w;\n\n        // total height\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_h += bottom_blob.h * bottom_blob.elempack;\n        }\n\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n#if NCNN_ARM82\n            out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && top_h % 8 == 0 ? 8 : top_h % 4 == 0 ? 4 : 1;\n#else\n            out_elempack = top_h % 4 == 0 ? 4 : 1;\n#endif\n        }\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, top_h / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n        }\n\n        unsigned short* outptr = top_blob_unpacked;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n#if NCNN_ARM82\n            if (bottom_blob.elempack == 8 && elempack == 4)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                    unsigned short* outptr0 = outptr;\n                    unsigned short* outptr1 = outptr + w * 4;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = r0[0];\n                        outptr0[1] = r0[1];\n                        outptr0[2] = r0[2];\n                        outptr0[3] = r0[3];\n                        outptr1[0] = r0[4];\n                        outptr1[1] = r0[5];\n                        outptr1[2] = r0[6];\n                        outptr1[3] = r0[7];\n\n                        outptr0 += 4;\n                        outptr1 += 4;\n                        r0 += 8;\n                    }\n\n                    outptr += w * 8;\n                }\n            }\n            if (bottom_blob.elempack == 8 && elempack == 1)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                    unsigned short* outptr0 = outptr;\n                    unsigned short* outptr1 = outptr + w;\n                    unsigned short* outptr2 = outptr + w * 2;\n                    unsigned short* outptr3 = outptr + w * 3;\n                    unsigned short* outptr4 = outptr + w * 4;\n                    unsigned short* outptr5 = outptr + w * 5;\n                    unsigned short* outptr6 = outptr + w * 6;\n                    unsigned short* outptr7 = outptr + w * 7;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n                        *outptr4++ = r0[4];\n                        *outptr5++ = r0[5];\n                        *outptr6++ = r0[6];\n                        *outptr7++ = r0[7];\n\n                        r0 += 8;\n                    }\n\n                    outptr += w * 8;\n                }\n            }\n#endif // NCNN_ARM82\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                    unsigned short* outptr0 = outptr;\n                    unsigned short* outptr1 = outptr + w;\n                    unsigned short* outptr2 = outptr + w * 2;\n                    unsigned short* outptr3 = outptr + w * 3;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    outptr += w * 4;\n                }\n            }\n            if (bottom_blob.elempack == elempack) // 1-1 4-4 8-8\n            {\n                int size = w * bottom_blob.h;\n\n                const unsigned short* ptr = bottom_blob;\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                outptr += size * bottom_blob.elempack;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // interleave image row\n        int h = bottom_blobs[0].h;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* outptr = top_blob.row<unsigned short>(i);\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                const unsigned short* ptr = bottom_blob.row<unsigned short>(i);\n                memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                outptr += bottom_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // concat dim\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n\n        // total channels\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_channels = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_channels += bottom_blob.c * bottom_blob.elempack;\n        }\n\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n#if NCNN_ARM82\n            out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && top_channels % 8 == 0 ? 8 : top_channels % 4 == 0 ? 4 : 1;\n#else\n            out_elempack = top_channels % 4 == 0 ? 4 : 1;\n#endif\n        }\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, d, top_channels / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, h, d, top_channels / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n\n            top_blob_unpacked.dims = dims;\n        }\n\n        int p = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n#if NCNN_ARM82\n            if (bottom_blob.elempack == 8 && elempack == 4)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q);\n\n                    unsigned short* outptr0 = top_blob_unpacked.channel(p);\n                    unsigned short* outptr1 = top_blob_unpacked.channel(p + 1);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        outptr0[0] = r0[0];\n                        outptr0[1] = r0[1];\n                        outptr0[2] = r0[2];\n                        outptr0[3] = r0[3];\n                        outptr1[0] = r0[4];\n                        outptr1[1] = r0[5];\n                        outptr1[2] = r0[6];\n                        outptr1[3] = r0[7];\n\n                        outptr0 += 4;\n                        outptr1 += 4;\n                        r0 += 8;\n                    }\n\n                    p += 2;\n                }\n            }\n            if (bottom_blob.elempack == 8 && elempack == 1)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q);\n\n                    unsigned short* outptr0 = top_blob_unpacked.channel(p);\n                    unsigned short* outptr1 = top_blob_unpacked.channel(p + 1);\n                    unsigned short* outptr2 = top_blob_unpacked.channel(p + 2);\n                    unsigned short* outptr3 = top_blob_unpacked.channel(p + 3);\n                    unsigned short* outptr4 = top_blob_unpacked.channel(p + 4);\n                    unsigned short* outptr5 = top_blob_unpacked.channel(p + 5);\n                    unsigned short* outptr6 = top_blob_unpacked.channel(p + 6);\n                    unsigned short* outptr7 = top_blob_unpacked.channel(p + 7);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n                        *outptr4++ = r0[4];\n                        *outptr5++ = r0[5];\n                        *outptr6++ = r0[6];\n                        *outptr7++ = r0[7];\n\n                        r0 += 8;\n                    }\n\n                    p += 8;\n                }\n            }\n#endif // NCNN_ARM82\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q);\n\n                    unsigned short* outptr0 = top_blob_unpacked.channel(p);\n                    unsigned short* outptr1 = top_blob_unpacked.channel(p + 1);\n                    unsigned short* outptr2 = top_blob_unpacked.channel(p + 2);\n                    unsigned short* outptr3 = top_blob_unpacked.channel(p + 3);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            if (bottom_blob.elempack == elempack) // 1-1 4-4 8-8\n            {\n                int size = bottom_blob.total();\n\n                const unsigned short* ptr = bottom_blob;\n                unsigned short* outptr = top_blob_unpacked.channel(p);\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                p += bottom_blob.c;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // interleave dim height\n        int w = bottom_blobs[0].w;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (size_t b = 0; b < bottom_blobs.size(); b++)\n                {\n                    const Mat& bottom_blob = bottom_blobs[b];\n\n                    int size = bottom_blob.w * bottom_blob.h;\n\n                    const unsigned short* ptr = bottom_blob.channel(q).depth(i);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    outptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // interleave dim width\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    for (size_t b = 0; b < bottom_blobs.size(); b++)\n                    {\n                        const Mat& bottom_blob = bottom_blobs[b];\n\n                        const unsigned short* ptr = bottom_blob.channel(q).depth(i).row<const unsigned short>(j);\n                        memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                        outptr += bottom_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        // interleave dim depth\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total depth\n        int top_d = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_d += bottom_blob.d;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, top_d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* outptr = top_blob.channel(q);\n\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                const unsigned short* ptr = bottom_blob.channel(q);\n                memcpy(outptr, ptr, size * elemsize);\n\n                outptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/concat_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONCAT_ARM_H\n#define LAYER_CONCAT_ARM_H\n\n#include \"concat.h\"\n\nnamespace ncnn {\n\nclass Concat_arm : public Concat\n{\npublic:\n    Concat_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONCAT_ARM_H\n"
  },
  {
    "path": "src/layer/arm/convolution1d_arm.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution1d_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"arm_activation.h\"\n\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\nnamespace ncnn {\n\n#include \"convolution1d_packed.h\"\n#if NCNN_BF16\n#include \"convolution1d_packed_bf16s.h\"\n#endif // NCNN_BF16\n\nConvolution1D_arm::Convolution1D_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Convolution1D_arm::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / kernel_w / num_output;\n\n    convolution1d_transform_kernel_packed(weight_data, weight_data_tm, num_input, num_output, kernel_w);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution1D_arm::destroy_pipeline(const Option& /*opt*/)\n{\n    return 0;\n}\n\nint Convolution1D_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    convolution1d_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, dilation_w, stride_w, activation_type, activation_params, opt);\n\n    return 0;\n}\n\nint Convolution1D_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n#if NCNN_ARM82\n    if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_float16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n    if (opt.use_bf16_storage && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_bfloat16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_BF16\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_float16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n        if (opt.use_bf16_storage && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_bfloat16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_BF16\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution1D);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(2, dilation_w);\n    pd.set(3, stride_w);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Convolution1D_arm::create_pipeline_bf16s(const Option& opt)\n{\n    const int num_input = weight_data_size / kernel_w / num_output;\n\n    convolution1d_transform_kernel_packed_bf16s(weight_data, weight_data_tm, num_input, num_output, kernel_w);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution1D_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    convolution1d_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, dilation_w, stride_w, activation_type, activation_params, opt);\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution1d_arm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION1D_ARM_H\n#define LAYER_CONVOLUTION1D_ARM_H\n\n#include \"convolution1d.h\"\n\nnamespace ncnn {\n\nclass Convolution1D_arm : public Convolution1D\n{\npublic:\n    Convolution1D_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Mat weight_data_tm;\n\n    // fp16\n    Mat bias_data_fp16;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION1D_ARM_H\n"
  },
  {
    "path": "src/layer/arm/convolution1d_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution1d_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"convolution1d_packed_fp16s.h\"\n\nint Convolution1D_arm::create_pipeline_fp16s(const Option& opt)\n{\n    const int num_input = weight_data_size / kernel_w / num_output;\n\n    convolution1d_transform_kernel_packed_fp16s(weight_data, weight_data_tm, num_input, num_output, kernel_w);\n\n    ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution1D_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n\n    int out_elempack = (opt.use_packing_layout && num_output % 4 == 0) ? 4 : 1;\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    convolution1d_packed_fp16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, dilation_w, stride_w, activation_type, activation_params, opt);\n\n    return 0;\n}\n\nint Convolution1D_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    convolution1d_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, dilation_w, stride_w, activation_type, activation_params, opt);\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution1d_packed.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution1d_transform_kernel_packed(const Mat& kernel, Mat& kernel_tm, int inh, int outh, int kernel_w)\n{\n    // src = kw-inh-outh\n    // dst = pb-pa-kw-inh/pa-outh/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n#if __ARM_NEON\n#if __aarch64__\n    if (outh >= 8)\n    {\n        if (inh >= 8)\n            kernel_tm.create(8 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2);\n        else if (inh >= 4)\n            kernel_tm.create(8 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2);\n        else if (inh >= 2)\n            kernel_tm.create(8 * 2 * kernel_w, inh / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2);\n        else\n            kernel_tm.create(8 * kernel_w, inh, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2);\n    }\n    else\n#endif // __aarch64__\n    if (outh >= 4)\n    {\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(4 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(4 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2);\n        else if (inh >= 2)\n            kernel_tm.create(4 * 2 * kernel_w, inh / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2);\n        else\n            kernel_tm.create(4 * kernel_w, inh, outh / 4 + (outh % 4) / 2 + outh % 2);\n    }\n    else\n#endif // __ARM_NEON\n    if (outh >= 2)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(2 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(2 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2);\n        else\n#endif // __ARM_NEON\n        if (inh >= 2)\n            kernel_tm.create(2 * 2 * kernel_w, inh / 2 + inh % 2, outh / 2 + outh % 2);\n        else\n            kernel_tm.create(2 * kernel_w, inh, outh / 2 + outh % 2);\n    }\n    else\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh);\n        else\n#endif // __ARM_NEON\n        if (inh >= 2)\n            kernel_tm.create(2 * kernel_w, inh / 2 + inh % 2, outh);\n        else\n            kernel_tm.create(kernel_w, inh, outh);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; q + 7 < outh; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inh * kernel_w;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inh * kernel_w;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inh * kernel_w;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inh * kernel_w;\n\n        float* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n            const float* k4 = kptr4 + p * kernel_w;\n            const float* k5 = kptr5 + p * kernel_w;\n            const float* k6 = kptr6 + p * kernel_w;\n            const float* k7 = kptr7 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00[2] = k2[k];\n                g00[3] = k3[k];\n                g00[4] = k4[k];\n                g00[5] = k5[k];\n                g00[6] = k6[k];\n                g00[7] = k7[k];\n                g00 += 8;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; q + 3 < outh; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n#else\n        float* g00 = kernel_tm.channel(q / 4);\n#endif\n\n        int p = 0;\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00[2] = k2[k];\n                g00[3] = k3[k];\n                g00 += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; q + 1 < outh; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n#elif __ARM_NEON\n        float* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2);\n#else\n        float* g00 = kernel_tm.channel(q / 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = k0[0];\n                g00[1] = k0[kernel_w];\n                g00[2] = k0[kernel_w * 2];\n                g00[3] = k0[kernel_w * 3];\n                g00[4] = k0[kernel_w * 4];\n                g00[5] = k0[kernel_w * 5];\n                g00[6] = k0[kernel_w * 6];\n                g00[7] = k0[kernel_w * 7];\n                g00[8] = k1[0];\n                g00[9] = k1[kernel_w];\n                g00[10] = k1[kernel_w * 2];\n                g00[11] = k1[kernel_w * 3];\n                g00[12] = k1[kernel_w * 4];\n                g00[13] = k1[kernel_w * 5];\n                g00[14] = k1[kernel_w * 6];\n                g00[15] = k1[kernel_w * 7];\n                g00 += 16;\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = k0[0];\n                g00[1] = k0[kernel_w];\n                g00[2] = k0[kernel_w * 2];\n                g00[3] = k0[kernel_w * 3];\n                g00[4] = k1[0];\n                g00[5] = k1[kernel_w];\n                g00[6] = k1[kernel_w * 2];\n                g00[7] = k1[kernel_w * 3];\n                g00 += 8;\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outh; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inh * kernel_w;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n#elif __ARM_NEON\n        float* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2 + q % 2);\n#else\n        float* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = k0[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution1d_packed(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int dilation_w, int stride_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int elempack = bottom_blob.elempack;\n    const int inh = bottom_blob.h * elempack;\n\n    const int N = bottom_blob.w * elempack;\n\n    const int outw = top_blob.w;\n    const int out_elempack = top_blob.elempack;\n    const int outh = top_blob.h * out_elempack;\n\n    const int M = top_blob.w * out_elempack;\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outh = 0;\n    int remain_outh_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_outh = (outh - remain_outh_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        float* outptr = top_blob.row(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n            float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n                _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n            }\n\n            const float* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        _r1 = vld1q_f32(r0 + N);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r1 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        _r1 = vsetq_lane_f32(r0[N * 4], _r1, 0);\n                        _r1 = vsetq_lane_f32(r0[N * 5], _r1, 1);\n                        _r1 = vsetq_lane_f32(r0[N * 6], _r1, 2);\n                        _r1 = vsetq_lane_f32(r0[N * 7], _r1, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 4 * 2);\n                    float32x4_t _w3 = vld1q_f32(kptr + 4 * 3);\n                    float32x4_t _w4 = vld1q_f32(kptr + 4 * 4);\n                    float32x4_t _w5 = vld1q_f32(kptr + 4 * 5);\n                    float32x4_t _w6 = vld1q_f32(kptr + 4 * 6);\n                    float32x4_t _w7 = vld1q_f32(kptr + 4 * 7);\n                    float32x4_t _w8 = vld1q_f32(kptr + 4 * 8);\n                    float32x4_t _w9 = vld1q_f32(kptr + 4 * 9);\n                    float32x4_t _wa = vld1q_f32(kptr + 4 * 10);\n                    float32x4_t _wb = vld1q_f32(kptr + 4 * 11);\n                    float32x4_t _wc = vld1q_f32(kptr + 4 * 12);\n                    float32x4_t _wd = vld1q_f32(kptr + 4 * 13);\n                    float32x4_t _we = vld1q_f32(kptr + 4 * 14);\n                    float32x4_t _wf = vld1q_f32(kptr + 4 * 15);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                    kptr += 64;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 4 * 2);\n                    float32x4_t _w3 = vld1q_f32(kptr + 4 * 3);\n                    float32x4_t _w4 = vld1q_f32(kptr + 4 * 4);\n                    float32x4_t _w5 = vld1q_f32(kptr + 4 * 5);\n                    float32x4_t _w6 = vld1q_f32(kptr + 4 * 6);\n                    float32x4_t _w7 = vld1q_f32(kptr + 4 * 7);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 8);\n                    float32x4_t _w3 = vld1q_f32(kptr + 12);\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                    _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                    _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                    kptr += 16;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vdupq_n_f32(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                    _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                    kptr += 8;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum2);\n            _sum1 = vaddq_f32(_sum1, _sum3);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _sum5 = vaddq_f32(_sum5, _sum7);\n            _sum0 = vaddq_f32(_sum0, _sum4);\n            _sum1 = vaddq_f32(_sum1, _sum5);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n            _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + M, _sum1);\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_f32(_sum0, 0);\n                outptr[M] = vgetq_lane_f32(_sum0, 1);\n                outptr[M * 2] = vgetq_lane_f32(_sum0, 2);\n                outptr[M * 3] = vgetq_lane_f32(_sum0, 3);\n                outptr[M * 4] = vgetq_lane_f32(_sum1, 0);\n                outptr[M * 5] = vgetq_lane_f32(_sum1, 1);\n                outptr[M * 6] = vgetq_lane_f32(_sum1, 2);\n                outptr[M * 7] = vgetq_lane_f32(_sum1, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 8;\n    nn_outh = (outh - remain_outh_start) / 4;\n#else // __aarch64__\n    nn_outh = (outh - remain_outh_start) / 4;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __aarch64__\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        float* outptr = top_blob.row(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n            }\n\n#if __aarch64__\n            const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n#else\n            const float* kptr = weight_data_tm.channel(p / 4);\n#endif\n\n            int q = 0;\n#if __aarch64__\n            for (; q + 7 < inh; q += 8)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        _r1 = vld1q_f32(r0 + N);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r1 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        _r1 = vsetq_lane_f32(r0[N * 4], _r1, 0);\n                        _r1 = vsetq_lane_f32(r0[N * 5], _r1, 1);\n                        _r1 = vsetq_lane_f32(r0[N * 6], _r1, 2);\n                        _r1 = vsetq_lane_f32(r0[N * 7], _r1, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 8);\n                    float32x4_t _w3 = vld1q_f32(kptr + 12);\n                    float32x4_t _w4 = vld1q_f32(kptr + 16);\n                    float32x4_t _w5 = vld1q_f32(kptr + 20);\n                    float32x4_t _w6 = vld1q_f32(kptr + 24);\n                    float32x4_t _w7 = vld1q_f32(kptr + 28);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                    kptr += 32;\n                }\n            }\n#endif // __aarch64__\n            for (; q + 3 < inh; q += 4)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 8);\n                    float32x4_t _w3 = vld1q_f32(kptr + 12);\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_r0), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_r0), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_r0), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_r0), 1);\n#endif\n\n                    kptr += 16;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n#if __aarch64__\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n#else\n                    _sum0 = vmlaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vmlaq_n_f32(_sum1, _w1, val1);\n#endif\n\n                    kptr += 8;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vdupq_n_f32(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = vld1q_f32(kptr);\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _val, _w);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _val, _w);\n#endif\n\n                    kptr += 4;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _sum0 = vaddq_f32(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(outptr, _sum0);\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_f32(_sum0, 0);\n                outptr[M] = vgetq_lane_f32(_sum0, 1);\n                outptr[M * 2] = vgetq_lane_f32(_sum0, 2);\n                outptr[M * 3] = vgetq_lane_f32(_sum0, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 4;\n    nn_outh = (outh - remain_outh_start) / 2;\n#else // __ARM_NEON\n    nn_outh = (outh - remain_outh_start) / 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __ARM_NEON\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n\n        float* outptr0 = top_blob.row(p);\n        float* outptr1 = top_blob.row(p + 1);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum0 = bias_data_ptr[p];\n                sum1 = bias_data_ptr[p + 1];\n            }\n\n#if __aarch64__\n            const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#elif __ARM_NEON\n            const float* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2);\n#else\n            const float* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        _r1 = vld1q_f32(r0 + N);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r1 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        _r1 = vsetq_lane_f32(r0[N * 4], _r1, 0);\n                        _r1 = vsetq_lane_f32(r0[N * 5], _r1, 1);\n                        _r1 = vsetq_lane_f32(r0[N * 6], _r1, 2);\n                        _r1 = vsetq_lane_f32(r0[N * 7], _r1, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    float32x4_t _w2 = vld1q_f32(kptr + 8);\n                    float32x4_t _w3 = vld1q_f32(kptr + 12);\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                    _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                    _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                    kptr += 16;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum2);\n            _sum0 = vdupq_n_f32(0.f);\n            _sum1 = vdupq_n_f32(0.f);\n#else  // __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n#endif // __aarch64__\n            for (; q + 3 < inh; q += 4)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vmlaq_f32(_sum1, _r0, _w1);\n#endif\n\n                    kptr += 8;\n                }\n            }\n#if __aarch64__\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum1);\n#else\n            float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n            float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n            float32x2_t _ss = vpadd_f32(_ss0, _ss1);\n            sum0 += vget_lane_f32(_ss, 0);\n            sum1 += vget_lane_f32(_ss, 1);\n#endif\n#endif // __ARM_NEON\n            for (; q + 1 < inh; q += 2)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val0 * kptr[0];\n                    sum1 += val0 * kptr[1];\n                    sum0 += val1 * kptr[2];\n                    sum1 += val1 * kptr[3];\n\n                    kptr += 4;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = r0[0];\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val * kptr[0];\n                    sum1 += val * kptr[1];\n\n                    kptr += 2;\n                }\n            }\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n    remain_outh_start += nn_outh * 2;\n    for (int p = remain_outh_start; p < outh; p++)\n    {\n        float* outptr = top_blob.row(p);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum = bias_data_ptr[p];\n            }\n\n#if __aarch64__\n            const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#elif __ARM_NEON\n            const float* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2 + p % 2);\n#else\n            const float* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        _r1 = vld1q_f32(r0 + N);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r1 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        _r1 = vsetq_lane_f32(r0[N * 4], _r1, 0);\n                        _r1 = vsetq_lane_f32(r0[N * 5], _r1, 1);\n                        _r1 = vsetq_lane_f32(r0[N * 6], _r1, 2);\n                        _r1 = vsetq_lane_f32(r0[N * 7], _r1, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w0 = vld1q_f32(kptr);\n                    float32x4_t _w1 = vld1q_f32(kptr + 4);\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                    kptr += 8;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            sum += vaddvq_f32(_sum0);\n#endif // __aarch64__\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            for (; q + 3 < inh; q += 4)\n            {\n                const float* r0 = bottom_blob.row(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float32x4_t();\n                        _r0 = vsetq_lane_f32(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f32(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f32(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f32(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = vld1q_f32(kptr);\n#if __aarch64__\n                    _sum = vfmaq_f32(_sum, _r0, _w);\n#else\n                    _sum = vmlaq_f32(_sum, _r0, _w);\n#endif\n\n                    kptr += 4;\n                }\n            }\n#if __aarch64__\n            sum += vaddvq_f32(_sum);\n#else\n            float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n            _ss = vpadd_f32(_ss, _ss);\n            sum += vget_lane_f32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n            for (; q + 1 < inh; q += 2)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    sum += val0 * kptr[0];\n                    sum += val1 * kptr[1];\n\n                    kptr += 2;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const float* r0 = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = r0[0];\n                        r0 += dilation_w;\n                    }\n\n                    sum += val * kptr[0];\n\n                    kptr += 1;\n                }\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            outptr[0] = sum;\n            outptr += 1;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution1d_packed_bf16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution1d_transform_kernel_packed_bf16s(const Mat& kernel, Mat& kernel_tm, int inh, int outh, int kernel_w)\n{\n    // src = kw-inh-outh\n    // dst = pb-pa-kw-inh/pa-outh/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n#if __ARM_NEON\n#if __aarch64__\n    if (outh >= 8)\n    {\n        if (inh >= 8)\n            kernel_tm.create(8 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 4)\n            kernel_tm.create(8 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(8 * 2 * kernel_w, inh / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(8 * kernel_w, inh, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n    }\n    else\n#endif // __aarch64__\n    if (outh >= 4)\n    {\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(4 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(4 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(4 * 2 * kernel_w, inh / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(4 * kernel_w, inh, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n    }\n    else\n#endif // __ARM_NEON\n    if (outh >= 2)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(2 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(2 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else\n#endif // __ARM_NEON\n        if (inh >= 2)\n            kernel_tm.create(2 * 2 * kernel_w, inh / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(2 * kernel_w, inh, outh / 2 + outh % 2, (size_t)2u);\n    }\n    else\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inh >= 8)\n            kernel_tm.create(8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inh >= 4)\n            kernel_tm.create(4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh, (size_t)2u);\n        else\n#endif // __ARM_NEON\n        if (inh >= 2)\n            kernel_tm.create(2 * kernel_w, inh / 2 + inh % 2, outh, (size_t)2u);\n        else\n            kernel_tm.create(kernel_w, inh, outh, (size_t)2u);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; q + 7 < outh; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inh * kernel_w;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inh * kernel_w;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inh * kernel_w;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inh * kernel_w;\n\n        unsigned short* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n            const float* k4 = kptr4 + p * kernel_w;\n            const float* k5 = kptr5 + p * kernel_w;\n            const float* k6 = kptr6 + p * kernel_w;\n            const float* k7 = kptr7 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00[2] = float32_to_bfloat16(k2[k]);\n                g00[3] = float32_to_bfloat16(k3[k]);\n                g00[4] = float32_to_bfloat16(k4[k]);\n                g00[5] = float32_to_bfloat16(k5[k]);\n                g00[6] = float32_to_bfloat16(k6[k]);\n                g00[7] = float32_to_bfloat16(k7[k]);\n                g00 += 8;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; q + 3 < outh; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 4);\n#endif\n\n        int p = 0;\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00[2] = float32_to_bfloat16(k2[k]);\n                g00[3] = float32_to_bfloat16(k3[k]);\n                g00 += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; q + 1 < outh; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n#elif __ARM_NEON\n        unsigned short* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = float32_to_bfloat16(k0[0]);\n                g00[1] = float32_to_bfloat16(k0[kernel_w]);\n                g00[2] = float32_to_bfloat16(k0[kernel_w * 2]);\n                g00[3] = float32_to_bfloat16(k0[kernel_w * 3]);\n                g00[4] = float32_to_bfloat16(k0[kernel_w * 4]);\n                g00[5] = float32_to_bfloat16(k0[kernel_w * 5]);\n                g00[6] = float32_to_bfloat16(k0[kernel_w * 6]);\n                g00[7] = float32_to_bfloat16(k0[kernel_w * 7]);\n                g00[8] = float32_to_bfloat16(k1[0]);\n                g00[9] = float32_to_bfloat16(k1[kernel_w]);\n                g00[10] = float32_to_bfloat16(k1[kernel_w * 2]);\n                g00[11] = float32_to_bfloat16(k1[kernel_w * 3]);\n                g00[12] = float32_to_bfloat16(k1[kernel_w * 4]);\n                g00[13] = float32_to_bfloat16(k1[kernel_w * 5]);\n                g00[14] = float32_to_bfloat16(k1[kernel_w * 6]);\n                g00[15] = float32_to_bfloat16(k1[kernel_w * 7]);\n                g00 += 16;\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = float32_to_bfloat16(k0[0]);\n                g00[1] = float32_to_bfloat16(k0[kernel_w]);\n                g00[2] = float32_to_bfloat16(k0[kernel_w * 2]);\n                g00[3] = float32_to_bfloat16(k0[kernel_w * 3]);\n                g00[4] = float32_to_bfloat16(k1[0]);\n                g00[5] = float32_to_bfloat16(k1[kernel_w]);\n                g00[6] = float32_to_bfloat16(k1[kernel_w * 2]);\n                g00[7] = float32_to_bfloat16(k1[kernel_w * 3]);\n                g00 += 8;\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outh; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inh * kernel_w;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n#elif __ARM_NEON\n        unsigned short* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2 + q % 2);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution1d_packed_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int dilation_w, int stride_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int elempack = bottom_blob.elempack;\n    const int inh = bottom_blob.h * elempack;\n\n    const int N = bottom_blob.w * elempack;\n\n    const int outw = top_blob.w;\n    const int out_elempack = top_blob.elempack;\n    const int outh = top_blob.h * out_elempack;\n\n    const int M = top_blob.w * out_elempack;\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outh = 0;\n    int remain_outh_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_outh = (outh - remain_outh_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        unsigned short* outptr = top_blob.row<unsigned short>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n            float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n                _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n            }\n\n            const unsigned short* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        _r1 = bfloat2float(vld1_u16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x8_t _r_u16 = uint16x8_t();\n                        _r_u16 = vsetq_lane_u16(r0[0], _r_u16, 0);\n                        _r_u16 = vsetq_lane_u16(r0[N], _r_u16, 1);\n                        _r_u16 = vsetq_lane_u16(r0[N * 2], _r_u16, 2);\n                        _r_u16 = vsetq_lane_u16(r0[N * 3], _r_u16, 3);\n                        _r_u16 = vsetq_lane_u16(r0[N * 4], _r_u16, 4);\n                        _r_u16 = vsetq_lane_u16(r0[N * 5], _r_u16, 5);\n                        _r_u16 = vsetq_lane_u16(r0[N * 6], _r_u16, 6);\n                        _r_u16 = vsetq_lane_u16(r0[N * 7], _r_u16, 7);\n                        _r0 = bfloat2float(vget_low_u16(_r_u16));\n                        _r1 = bfloat2float(vget_high_u16(_r_u16));\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                    uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                    uint16x8_t _w89 = vld1q_u16(kptr + 32);\n                    uint16x8_t _wab = vld1q_u16(kptr + 40);\n                    uint16x8_t _wcd = vld1q_u16(kptr + 48);\n                    uint16x8_t _wef = vld1q_u16(kptr + 56);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                    float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                    float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                    float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                    float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                    float32x4_t _w8 = bfloat2float(vget_low_u16(_w89));\n                    float32x4_t _w9 = bfloat2float(vget_high_u16(_w89));\n                    float32x4_t _wa = bfloat2float(vget_low_u16(_wab));\n                    float32x4_t _wb = bfloat2float(vget_high_u16(_wab));\n                    float32x4_t _wc = bfloat2float(vget_low_u16(_wcd));\n                    float32x4_t _wd = bfloat2float(vget_high_u16(_wcd));\n                    float32x4_t _we = bfloat2float(vget_low_u16(_wef));\n                    float32x4_t _wf = bfloat2float(vget_high_u16(_wef));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                    kptr += 64;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x4_t _r_u16 = uint16x4_t();\n                        _r_u16 = vset_lane_u16(r0[0], _r_u16, 0);\n                        _r_u16 = vset_lane_u16(r0[N], _r_u16, 1);\n                        _r_u16 = vset_lane_u16(r0[N * 2], _r_u16, 2);\n                        _r_u16 = vset_lane_u16(r0[N * 3], _r_u16, 3);\n                        _r0 = bfloat2float(_r_u16);\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                    uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                    float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                    float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                    float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                    float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = bfloat16_to_float32(r0[0]);\n                        val1 = bfloat16_to_float32(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                    _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                    _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                    kptr += 16;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = bfloat2float(vdup_n_u16(r0[0]));\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w = vld1q_u16(kptr);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n                    _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                    _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                    kptr += 8;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum2);\n            _sum1 = vaddq_f32(_sum1, _sum3);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _sum5 = vaddq_f32(_sum5, _sum7);\n            _sum0 = vaddq_f32(_sum0, _sum4);\n            _sum1 = vaddq_f32(_sum1, _sum5);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n            _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1_u16(outptr, float2bfloat(_sum0));\n                vst1_u16(outptr + M, float2bfloat(_sum1));\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                uint16x4_t _sum0_u16 = float2bfloat(_sum0);\n                uint16x4_t _sum1_u16 = float2bfloat(_sum1);\n                outptr[0] = vget_lane_u16(_sum0_u16, 0);\n                outptr[M] = vget_lane_u16(_sum0_u16, 1);\n                outptr[M * 2] = vget_lane_u16(_sum0_u16, 2);\n                outptr[M * 3] = vget_lane_u16(_sum0_u16, 3);\n                outptr[M * 4] = vget_lane_u16(_sum1_u16, 0);\n                outptr[M * 5] = vget_lane_u16(_sum1_u16, 1);\n                outptr[M * 6] = vget_lane_u16(_sum1_u16, 2);\n                outptr[M * 7] = vget_lane_u16(_sum1_u16, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 8;\n    nn_outh = (outh - remain_outh_start) / 4;\n#else // __aarch64__\n    nn_outh = (outh - remain_outh_start) / 4;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __aarch64__\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        unsigned short* outptr = top_blob.row<unsigned short>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n            }\n\n#if __aarch64__\n            const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n#else\n            const unsigned short* kptr = weight_data_tm.channel(p / 4);\n#endif\n\n            int q = 0;\n#if __aarch64__\n            for (; q + 7 < inh; q += 8)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        _r1 = bfloat2float(vld1_u16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x8_t _r_u16 = uint16x8_t();\n                        _r_u16 = vsetq_lane_u16(r0[0], _r_u16, 0);\n                        _r_u16 = vsetq_lane_u16(r0[N], _r_u16, 1);\n                        _r_u16 = vsetq_lane_u16(r0[N * 2], _r_u16, 2);\n                        _r_u16 = vsetq_lane_u16(r0[N * 3], _r_u16, 3);\n                        _r_u16 = vsetq_lane_u16(r0[N * 4], _r_u16, 4);\n                        _r_u16 = vsetq_lane_u16(r0[N * 5], _r_u16, 5);\n                        _r_u16 = vsetq_lane_u16(r0[N * 6], _r_u16, 6);\n                        _r_u16 = vsetq_lane_u16(r0[N * 7], _r_u16, 7);\n                        _r0 = bfloat2float(vget_low_u16(_r_u16));\n                        _r1 = bfloat2float(vget_high_u16(_r_u16));\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                    uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                    float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                    float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                    float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                    float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                    kptr += 32;\n                }\n            }\n#endif // __aarch64__\n            for (; q + 3 < inh; q += 4)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x4_t _r_u16 = uint16x4_t();\n                        _r_u16 = vset_lane_u16(r0[0], _r_u16, 0);\n                        _r_u16 = vset_lane_u16(r0[N], _r_u16, 1);\n                        _r_u16 = vset_lane_u16(r0[N * 2], _r_u16, 2);\n                        _r_u16 = vset_lane_u16(r0[N * 3], _r_u16, 3);\n                        _r0 = bfloat2float(_r_u16);\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_r0), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_r0), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_r0), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_r0), 1);\n#endif\n\n                    kptr += 16;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = bfloat16_to_float32(r0[0]);\n                        val1 = bfloat16_to_float32(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w = vld1q_u16(kptr);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n#if __aarch64__\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n#else\n                    _sum0 = vmlaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vmlaq_n_f32(_sum1, _w1, val1);\n#endif\n\n                    kptr += 8;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = bfloat2float(vdup_n_u16(r0[0]));\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = bfloat2float(vld1_u16(kptr));\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _val, _w);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _val, _w);\n#endif\n\n                    kptr += 4;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _sum0 = vaddq_f32(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1_u16(outptr, float2bfloat(_sum0));\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                uint16x4_t _sum0_u16 = float2bfloat(_sum0);\n                outptr[0] = vget_lane_u16(_sum0_u16, 0);\n                outptr[M] = vget_lane_u16(_sum0_u16, 1);\n                outptr[M * 2] = vget_lane_u16(_sum0_u16, 2);\n                outptr[M * 3] = vget_lane_u16(_sum0_u16, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 4;\n    nn_outh = (outh - remain_outh_start) / 2;\n#else // __ARM_NEON\n    nn_outh = (outh - remain_outh_start) / 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __ARM_NEON\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n\n        unsigned short* outptr0 = top_blob.row<unsigned short>(p);\n        unsigned short* outptr1 = top_blob.row<unsigned short>(p + 1);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum0 = bias_data_ptr[p];\n                sum1 = bias_data_ptr[p + 1];\n            }\n\n#if __aarch64__\n            const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#elif __ARM_NEON\n            const unsigned short* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2);\n#else\n            const unsigned short* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        _r1 = bfloat2float(vld1_u16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x8_t _r01_u16 = uint16x8_t();\n                        _r01_u16 = vsetq_lane_u16(r0[0], _r01_u16, 0);\n                        _r01_u16 = vsetq_lane_u16(r0[N], _r01_u16, 1);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 2], _r01_u16, 2);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 3], _r01_u16, 3);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 4], _r01_u16, 4);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 5], _r01_u16, 5);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 6], _r01_u16, 6);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 7], _r01_u16, 7);\n                        _r0 = bfloat2float(vget_low_u16(_r01_u16));\n                        _r1 = bfloat2float(vget_high_u16(_r01_u16));\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                    float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                    float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                    _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                    _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                    kptr += 16;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum2);\n            _sum0 = vdupq_n_f32(0.f);\n            _sum1 = vdupq_n_f32(0.f);\n#else  // __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n#endif // __aarch64__\n            for (; q + 3 < inh; q += 4)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x4_t _r0_u16 = uint16x4_t();\n                        _r0_u16 = vset_lane_u16(r0[0], _r0_u16, 0);\n                        _r0_u16 = vset_lane_u16(r0[N], _r0_u16, 1);\n                        _r0_u16 = vset_lane_u16(r0[N * 2], _r0_u16, 2);\n                        _r0_u16 = vset_lane_u16(r0[N * 3], _r0_u16, 3);\n                        _r0 = bfloat2float(_r0_u16);\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w = vld1q_u16(kptr);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vmlaq_f32(_sum1, _r0, _w1);\n#endif\n\n                    kptr += 8;\n                }\n            }\n#if __aarch64__\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum1);\n#else\n            float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n            float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n            float32x2_t _ss = vpadd_f32(_ss0, _ss1);\n            sum0 += vget_lane_f32(_ss, 0);\n            sum1 += vget_lane_f32(_ss, 1);\n#endif\n#endif // __ARM_NEON\n            for (; q + 1 < inh; q += 2)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = bfloat16_to_float32(r0[0]);\n                        val1 = bfloat16_to_float32(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val0 * bfloat16_to_float32(kptr[0]);\n                    sum1 += val0 * bfloat16_to_float32(kptr[1]);\n                    sum0 += val1 * bfloat16_to_float32(kptr[2]);\n                    sum1 += val1 * bfloat16_to_float32(kptr[3]);\n\n                    kptr += 4;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = bfloat16_to_float32(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val * bfloat16_to_float32(kptr[0]);\n                    sum1 += val * bfloat16_to_float32(kptr[1]);\n\n                    kptr += 2;\n                }\n            }\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n\n            outptr0[0] = float32_to_bfloat16(sum0);\n            outptr1[0] = float32_to_bfloat16(sum1);\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n    remain_outh_start += nn_outh * 2;\n    for (int p = remain_outh_start; p < outh; p++)\n    {\n        unsigned short* outptr = top_blob.row<unsigned short>(p);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum = bias_data_ptr[p];\n            }\n\n#if __aarch64__\n            const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#elif __ARM_NEON\n            const unsigned short* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2 + p % 2);\n#else\n            const unsigned short* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        _r1 = bfloat2float(vld1_u16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x8_t _r01_u16 = uint16x8_t();\n                        _r01_u16 = vsetq_lane_u16(r0[0], _r01_u16, 0);\n                        _r01_u16 = vsetq_lane_u16(r0[N], _r01_u16, 1);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 2], _r01_u16, 2);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 3], _r01_u16, 3);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 4], _r01_u16, 4);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 5], _r01_u16, 5);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 6], _r01_u16, 6);\n                        _r01_u16 = vsetq_lane_u16(r0[N * 7], _r01_u16, 7);\n                        _r0 = bfloat2float(vget_low_u16(_r01_u16));\n                        _r1 = bfloat2float(vget_high_u16(_r01_u16));\n                        r0 += dilation_w;\n                    }\n\n                    uint16x8_t _w = vld1q_u16(kptr);\n                    float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                    float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                    kptr += 8;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            sum += vaddvq_f32(_sum0);\n#endif // __aarch64__\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            for (; q + 3 < inh; q += 4)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        uint16x4_t _r0_u16 = uint16x4_t();\n                        _r0_u16 = vset_lane_u16(r0[0], _r0_u16, 0);\n                        _r0_u16 = vset_lane_u16(r0[N], _r0_u16, 1);\n                        _r0_u16 = vset_lane_u16(r0[N * 2], _r0_u16, 2);\n                        _r0_u16 = vset_lane_u16(r0[N * 3], _r0_u16, 3);\n                        _r0 = bfloat2float(_r0_u16);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = bfloat2float(vld1_u16(kptr));\n#if __aarch64__\n                    _sum = vfmaq_f32(_sum, _r0, _w);\n#else\n                    _sum = vmlaq_f32(_sum, _r0, _w);\n#endif\n\n                    kptr += 4;\n                }\n            }\n#if __aarch64__\n            sum += vaddvq_f32(_sum);\n#else\n            float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n            _ss = vpadd_f32(_ss, _ss);\n            sum += vget_lane_f32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n            for (; q + 1 < inh; q += 2)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = bfloat16_to_float32(r0[0]);\n                        val1 = bfloat16_to_float32(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    sum += val0 * bfloat16_to_float32(kptr[0]);\n                    sum += val1 * bfloat16_to_float32(kptr[1]);\n\n                    kptr += 2;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = bfloat16_to_float32(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    sum += val * bfloat16_to_float32(kptr[0]);\n\n                    kptr += 1;\n                }\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            outptr[0] = float32_to_bfloat16(sum);\n            outptr += 1;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution1d_packed_fp16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution1d_transform_kernel_packed_fp16s(const Mat& kernel, Mat& kernel_tm, int inh, int outh, int kernel_w)\n{\n    // src = kw-inh-outh\n    // dst = pb-pa-kw-inh/pa-outh/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n    if (outh >= 8)\n    {\n        if (inh >= 8)\n            kernel_tm.create(8 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 4)\n            kernel_tm.create(8 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(8 * 2 * kernel_w, inh / 2 + inh % 2, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(8 * kernel_w, inh, outh / 8 + (outh % 8) / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n    }\n    else if (outh >= 4)\n    {\n        if (inh >= 8)\n            kernel_tm.create(4 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 4)\n            kernel_tm.create(4 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(4 * 2 * kernel_w, inh / 2 + inh % 2, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(4 * kernel_w, inh, outh / 4 + (outh % 4) / 2 + outh % 2, (size_t)2u);\n    }\n    else if (outh >= 2)\n    {\n        if (inh >= 8)\n            kernel_tm.create(2 * 8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 4)\n            kernel_tm.create(2 * 4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(2 * 2 * kernel_w, inh / 2 + inh % 2, outh / 2 + outh % 2, (size_t)2u);\n        else\n            kernel_tm.create(2 * kernel_w, inh, outh / 2 + outh % 2, (size_t)2u);\n    }\n    else\n    {\n        if (inh >= 8)\n            kernel_tm.create(8 * kernel_w, inh / 8 + (inh % 8) / 4 + (inh % 4) / 2 + inh % 2, outh, (size_t)2u);\n        else if (inh >= 4)\n            kernel_tm.create(4 * kernel_w, inh / 4 + (inh % 4) / 2 + inh % 2, outh, (size_t)2u);\n        else if (inh >= 2)\n            kernel_tm.create(2 * kernel_w, inh / 2 + inh % 2, outh, (size_t)2u);\n        else\n            kernel_tm.create(kernel_w, inh, outh, (size_t)2u);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n    for (; q + 7 < outh; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inh * kernel_w;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inh * kernel_w;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inh * kernel_w;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inh * kernel_w;\n\n        __fp16* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n                const float* k4 = kptr4 + p * kernel_w;\n                const float* k5 = kptr5 + p * kernel_w;\n                const float* k6 = kptr6 + p * kernel_w;\n                const float* k7 = kptr7 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    k4 += kernel_w;\n                    k5 += kernel_w;\n                    k6 += kernel_w;\n                    k7 += kernel_w;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n            const float* k4 = kptr4 + p * kernel_w;\n            const float* k5 = kptr5 + p * kernel_w;\n            const float* k6 = kptr6 + p * kernel_w;\n            const float* k7 = kptr7 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00[2] = (__fp16)k2[k];\n                g00[3] = (__fp16)k3[k];\n                g00[4] = (__fp16)k4[k];\n                g00[5] = (__fp16)k5[k];\n                g00[6] = (__fp16)k6[k];\n                g00[7] = (__fp16)k7[k];\n                g00 += 8;\n            }\n        }\n    }\n    for (; q + 3 < outh; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inh * kernel_w;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inh * kernel_w;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n                const float* k2 = kptr2 + p * kernel_w;\n                const float* k3 = kptr3 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    k2 += kernel_w;\n                    k3 += kernel_w;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n            const float* k2 = kptr2 + p * kernel_w;\n            const float* k3 = kptr3 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00[2] = (__fp16)k2[k];\n                g00[3] = (__fp16)k3[k];\n                g00 += 4;\n            }\n        }\n    }\n    for (; q + 1 < outh; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inh * kernel_w;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inh * kernel_w;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = (__fp16)k0[0];\n                g00[1] = (__fp16)k0[kernel_w];\n                g00[2] = (__fp16)k0[kernel_w * 2];\n                g00[3] = (__fp16)k0[kernel_w * 3];\n                g00[4] = (__fp16)k0[kernel_w * 4];\n                g00[5] = (__fp16)k0[kernel_w * 5];\n                g00[6] = (__fp16)k0[kernel_w * 6];\n                g00[7] = (__fp16)k0[kernel_w * 7];\n                g00[8] = (__fp16)k1[0];\n                g00[9] = (__fp16)k1[kernel_w];\n                g00[10] = (__fp16)k1[kernel_w * 2];\n                g00[11] = (__fp16)k1[kernel_w * 3];\n                g00[12] = (__fp16)k1[kernel_w * 4];\n                g00[13] = (__fp16)k1[kernel_w * 5];\n                g00[14] = (__fp16)k1[kernel_w * 6];\n                g00[15] = (__fp16)k1[kernel_w * 7];\n                g00 += 16;\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w + k;\n                const float* k1 = kptr1 + p * kernel_w + k;\n\n                g00[0] = (__fp16)k0[0];\n                g00[1] = (__fp16)k0[kernel_w];\n                g00[2] = (__fp16)k0[kernel_w * 2];\n                g00[3] = (__fp16)k0[kernel_w * 3];\n                g00[4] = (__fp16)k1[0];\n                g00[5] = (__fp16)k1[kernel_w];\n                g00[6] = (__fp16)k1[kernel_w * 2];\n                g00[7] = (__fp16)k1[kernel_w * 3];\n                g00 += 8;\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr0 + p * kernel_w;\n                const float* k1 = kptr1 + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    k0 += kernel_w;\n                    k1 += kernel_w;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr0 + p * kernel_w;\n            const float* k1 = kptr1 + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outh; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inh * kernel_w;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n\n        int p = 0;\n        for (; p + 7 < inh; p += 8)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p + 3 < inh; p += 4)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p + 1 < inh; p += 2)\n        {\n            for (int k = 0; k < kernel_w; k++)\n            {\n                const float* k0 = kptr + p * kernel_w;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += kernel_w;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inh; p++)\n        {\n            const float* k0 = kptr + p * kernel_w;\n\n            for (int k = 0; k < kernel_w; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution1d_packed_fp16s(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int dilation_w, int stride_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int elempack = bottom_blob.elempack;\n    const int inh = bottom_blob.h * elempack;\n\n    const int N = bottom_blob.w * elempack;\n\n    const int outw = top_blob.w;\n    const int out_elempack = top_blob.elempack;\n    const int outh = top_blob.h * out_elempack;\n\n    const int M = top_blob.w * out_elempack;\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outh = 0;\n    int remain_outh_start = 0;\n    nn_outh = (outh - remain_outh_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.row<__fp16>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n            float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n                _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        _r1 = vcvt_f32_f16(vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x8_t _r_f16 = float16x8_t();\n                        _r_f16 = vsetq_lane_f16(r0[0], _r_f16, 0);\n                        _r_f16 = vsetq_lane_f16(r0[N], _r_f16, 1);\n                        _r_f16 = vsetq_lane_f16(r0[N * 2], _r_f16, 2);\n                        _r_f16 = vsetq_lane_f16(r0[N * 3], _r_f16, 3);\n                        _r_f16 = vsetq_lane_f16(r0[N * 4], _r_f16, 4);\n                        _r_f16 = vsetq_lane_f16(r0[N * 5], _r_f16, 5);\n                        _r_f16 = vsetq_lane_f16(r0[N * 6], _r_f16, 6);\n                        _r_f16 = vsetq_lane_f16(r0[N * 7], _r_f16, 7);\n                        _r0 = vcvt_f32_f16(vget_low_f16(_r_f16));\n                        _r1 = vcvt_f32_f16(vget_high_f16(_r_f16));\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float16x8_t _w45 = vld1q_f16(kptr + 16);\n                    float16x8_t _w67 = vld1q_f16(kptr + 24);\n                    float16x8_t _w89 = vld1q_f16(kptr + 32);\n                    float16x8_t _wab = vld1q_f16(kptr + 40);\n                    float16x8_t _wcd = vld1q_f16(kptr + 48);\n                    float16x8_t _wef = vld1q_f16(kptr + 56);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                    float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                    float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                    float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                    float32x4_t _w8 = vcvt_f32_f16(vget_low_f16(_w89));\n                    float32x4_t _w9 = vcvt_f32_f16(vget_high_f16(_w89));\n                    float32x4_t _wa = vcvt_f32_f16(vget_low_f16(_wab));\n                    float32x4_t _wb = vcvt_f32_f16(vget_high_f16(_wab));\n                    float32x4_t _wc = vcvt_f32_f16(vget_low_f16(_wcd));\n                    float32x4_t _wd = vcvt_f32_f16(vget_high_f16(_wcd));\n                    float32x4_t _we = vcvt_f32_f16(vget_low_f16(_wef));\n                    float32x4_t _wf = vcvt_f32_f16(vget_high_f16(_wef));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                    kptr += 64;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x4_t _r_f16 = float16x4_t();\n                        _r_f16 = vset_lane_f16(r0[0], _r_f16, 0);\n                        _r_f16 = vset_lane_f16(r0[N], _r_f16, 1);\n                        _r_f16 = vset_lane_f16(r0[N * 2], _r_f16, 2);\n                        _r_f16 = vset_lane_f16(r0[N * 3], _r_f16, 3);\n                        _r0 = vcvt_f32_f16(_r_f16);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float16x8_t _w45 = vld1q_f16(kptr + 16);\n                    float16x8_t _w67 = vld1q_f16(kptr + 24);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                    float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                    float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                    float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = (float)(r0[0]);\n                        val1 = (float)(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                    _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                    _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                    kptr += 16;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vcvt_f32_f16(vdup_n_f16(r0[0]));\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w = vld1q_f16(kptr);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                    _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                    _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                    kptr += 8;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum2);\n            _sum1 = vaddq_f32(_sum1, _sum3);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _sum5 = vaddq_f32(_sum5, _sum7);\n            _sum0 = vaddq_f32(_sum0, _sum4);\n            _sum1 = vaddq_f32(_sum1, _sum5);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n            _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1_f16(outptr, vcvt_f16_f32(_sum0));\n                vst1_f16(outptr + M, vcvt_f16_f32(_sum1));\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                float16x4_t _sum0_f16 = vcvt_f16_f32(_sum0);\n                float16x4_t _sum1_f16 = vcvt_f16_f32(_sum1);\n                outptr[0] = vget_lane_f16(_sum0_f16, 0);\n                outptr[M] = vget_lane_f16(_sum0_f16, 1);\n                outptr[M * 2] = vget_lane_f16(_sum0_f16, 2);\n                outptr[M * 3] = vget_lane_f16(_sum0_f16, 3);\n                outptr[M * 4] = vget_lane_f16(_sum1_f16, 0);\n                outptr[M * 5] = vget_lane_f16(_sum1_f16, 1);\n                outptr[M * 6] = vget_lane_f16(_sum1_f16, 2);\n                outptr[M * 7] = vget_lane_f16(_sum1_f16, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 8;\n    nn_outh = (outh - remain_outh_start) / 4;\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.row<__fp16>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f32(bias_data_ptr + p);\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        _r1 = vcvt_f32_f16(vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x8_t _r_f16 = float16x8_t();\n                        _r_f16 = vsetq_lane_f16(r0[0], _r_f16, 0);\n                        _r_f16 = vsetq_lane_f16(r0[N], _r_f16, 1);\n                        _r_f16 = vsetq_lane_f16(r0[N * 2], _r_f16, 2);\n                        _r_f16 = vsetq_lane_f16(r0[N * 3], _r_f16, 3);\n                        _r_f16 = vsetq_lane_f16(r0[N * 4], _r_f16, 4);\n                        _r_f16 = vsetq_lane_f16(r0[N * 5], _r_f16, 5);\n                        _r_f16 = vsetq_lane_f16(r0[N * 6], _r_f16, 6);\n                        _r_f16 = vsetq_lane_f16(r0[N * 7], _r_f16, 7);\n                        _r0 = vcvt_f32_f16(vget_low_f16(_r_f16));\n                        _r1 = vcvt_f32_f16(vget_high_f16(_r_f16));\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float16x8_t _w45 = vld1q_f16(kptr + 16);\n                    float16x8_t _w67 = vld1q_f16(kptr + 24);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                    float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                    float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                    float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x4_t _r_f16 = float16x4_t();\n                        _r_f16 = vset_lane_f16(r0[0], _r_f16, 0);\n                        _r_f16 = vset_lane_f16(r0[N], _r_f16, 1);\n                        _r_f16 = vset_lane_f16(r0[N * 2], _r_f16, 2);\n                        _r_f16 = vset_lane_f16(r0[N * 3], _r_f16, 3);\n                        _r0 = vcvt_f32_f16(_r_f16);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n\n                    kptr += 16;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = (float)(r0[0]);\n                        val1 = (float)(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w = vld1q_f16(kptr);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                    _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n\n                    kptr += 8;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vcvt_f32_f16(vdup_n_f16(r0[0]));\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n                    _sum0 = vfmaq_f32(_sum0, _val, _w);\n\n                    kptr += 4;\n                }\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _sum0 = vaddq_f32(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1_f16(outptr, vcvt_f16_f32(_sum0));\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                float16x4_t _sum0_f16 = vcvt_f16_f32(_sum0);\n                outptr[0] = vget_lane_f16(_sum0_f16, 0);\n                outptr[M] = vget_lane_f16(_sum0_f16, 1);\n                outptr[M * 2] = vget_lane_f16(_sum0_f16, 2);\n                outptr[M * 3] = vget_lane_f16(_sum0_f16, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 4;\n    nn_outh = (outh - remain_outh_start) / 2;\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n\n        __fp16* outptr0 = top_blob.row<__fp16>(p);\n        __fp16* outptr1 = top_blob.row<__fp16>(p + 1);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum0 = bias_data_ptr[p];\n                sum1 = bias_data_ptr[p + 1];\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n\n            int q = 0;\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        _r1 = vcvt_f32_f16(vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x8_t _r01_f16 = float16x8_t();\n                        _r01_f16 = vsetq_lane_f16(r0[0], _r01_f16, 0);\n                        _r01_f16 = vsetq_lane_f16(r0[N], _r01_f16, 1);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 2], _r01_f16, 2);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 3], _r01_f16, 3);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 4], _r01_f16, 4);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 5], _r01_f16, 5);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 6], _r01_f16, 6);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 7], _r01_f16, 7);\n                        _r0 = vcvt_f32_f16(vget_low_f16(_r01_f16));\n                        _r1 = vcvt_f32_f16(vget_high_f16(_r01_f16));\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                    _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                    _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                    kptr += 16;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum2);\n            _sum0 = vdupq_n_f32(0.f);\n            _sum1 = vdupq_n_f32(0.f);\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x4_t _r0_f16 = float16x4_t();\n                        _r0_f16 = vset_lane_f16(r0[0], _r0_f16, 0);\n                        _r0_f16 = vset_lane_f16(r0[N], _r0_f16, 1);\n                        _r0_f16 = vset_lane_f16(r0[N * 2], _r0_f16, 2);\n                        _r0_f16 = vset_lane_f16(r0[N * 3], _r0_f16, 3);\n                        _r0 = vcvt_f32_f16(_r0_f16);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w = vld1q_f16(kptr);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n\n                    kptr += 8;\n                }\n            }\n            sum0 += vaddvq_f32(_sum0);\n            sum1 += vaddvq_f32(_sum1);\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = (float)(r0[0]);\n                        val1 = (float)(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val0 * (float)(kptr[0]);\n                    sum1 += val0 * (float)(kptr[1]);\n                    sum0 += val1 * (float)(kptr[2]);\n                    sum1 += val1 * (float)(kptr[3]);\n\n                    kptr += 4;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = (float)(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val * (float)(kptr[0]);\n                    sum1 += val * (float)(kptr[1]);\n\n                    kptr += 2;\n                }\n            }\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n\n            outptr0[0] = (__fp16)(sum0);\n            outptr1[0] = (__fp16)(sum1);\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n    remain_outh_start += nn_outh * 2;\n    for (int p = remain_outh_start; p < outh; p++)\n    {\n        __fp16* outptr = top_blob.row<__fp16>(p);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum = bias_data_ptr[p];\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n\n            int q = 0;\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    float32x4_t _r1;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        _r1 = vcvt_f32_f16(vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x8_t _r01_f16 = float16x8_t();\n                        _r01_f16 = vsetq_lane_f16(r0[0], _r01_f16, 0);\n                        _r01_f16 = vsetq_lane_f16(r0[N], _r01_f16, 1);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 2], _r01_f16, 2);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 3], _r01_f16, 3);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 4], _r01_f16, 4);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 5], _r01_f16, 5);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 6], _r01_f16, 6);\n                        _r01_f16 = vsetq_lane_f16(r0[N * 7], _r01_f16, 7);\n                        _r0 = vcvt_f32_f16(vget_low_f16(_r01_f16));\n                        _r1 = vcvt_f32_f16(vget_high_f16(_r01_f16));\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w = vld1q_f16(kptr);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                    _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                    kptr += 8;\n                }\n            }\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            sum += vaddvq_f32(_sum0);\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float32x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vcvt_f32_f16(vld1_f16(r0));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        float16x4_t _r0_f16 = float16x4_t();\n                        _r0_f16 = vset_lane_f16(r0[0], _r0_f16, 0);\n                        _r0_f16 = vset_lane_f16(r0[N], _r0_f16, 1);\n                        _r0_f16 = vset_lane_f16(r0[N * 2], _r0_f16, 2);\n                        _r0_f16 = vset_lane_f16(r0[N * 3], _r0_f16, 3);\n                        _r0 = vcvt_f32_f16(_r0_f16);\n                        r0 += dilation_w;\n                    }\n\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n                    _sum = vfmaq_f32(_sum, _r0, _w);\n\n                    kptr += 4;\n                }\n            }\n            sum += vaddvq_f32(_sum);\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val0;\n                    float val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = (float)(r0[0]);\n                        val1 = (float)(r0[N]);\n                        r0 += dilation_w;\n                    }\n\n                    sum += val0 * (float)(kptr[0]);\n                    sum += val1 * (float)(kptr[1]);\n\n                    kptr += 2;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val;\n                    // if (elempack == 1)\n                    {\n                        val = (float)(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    sum += val * (float)(kptr[0]);\n\n                    kptr += 1;\n                }\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            outptr[0] = (__fp16)(sum);\n            outptr += 1;\n        }\n    }\n}\n\nstatic void convolution1d_packed_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int dilation_w, int stride_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int elempack = bottom_blob.elempack;\n    const int inh = bottom_blob.h * elempack;\n\n    const int N = bottom_blob.w * elempack;\n\n    const int outw = top_blob.w;\n    const int out_elempack = top_blob.elempack;\n    const int outh = top_blob.h * out_elempack;\n\n    const int M = top_blob.w * out_elempack;\n\n    const __fp16* bias_data_ptr = bias_data;\n\n    int nn_outh = 0;\n    int remain_outh_start = 0;\n    nn_outh = (outh - remain_outh_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.row<__fp16>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float16x8_t _sum0 = vdupq_n_f16(0.f);\n            float16x8_t _sum1 = vdupq_n_f16(0.f);\n            float16x8_t _sum2 = vdupq_n_f16(0.f);\n            float16x8_t _sum3 = vdupq_n_f16(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1q_f16(bias_data_ptr + p);\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x8_t _r0;\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        r0 += dilation_w * 8;\n                    }\n                    else if (elempack == 4)\n                    {\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x8_t();\n                        _r0 = vsetq_lane_f16(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f16(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f16(r0[N * 3], _r0, 3);\n                        _r0 = vsetq_lane_f16(r0[N * 4], _r0, 4);\n                        _r0 = vsetq_lane_f16(r0[N * 5], _r0, 5);\n                        _r0 = vsetq_lane_f16(r0[N * 6], _r0, 6);\n                        _r0 = vsetq_lane_f16(r0[N * 7], _r0, 7);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    float16x8_t _w1 = vld1q_f16(kptr + 8);\n                    float16x8_t _w2 = vld1q_f16(kptr + 8 * 2);\n                    float16x8_t _w3 = vld1q_f16(kptr + 8 * 3);\n                    float16x8_t _w4 = vld1q_f16(kptr + 8 * 4);\n                    float16x8_t _w5 = vld1q_f16(kptr + 8 * 5);\n                    float16x8_t _w6 = vld1q_f16(kptr + 8 * 6);\n                    float16x8_t _w7 = vld1q_f16(kptr + 8 * 7);\n                    _sum0 = vfmaq_laneq_f16(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_laneq_f16(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_laneq_f16(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_laneq_f16(_sum3, _w3, _r0, 3);\n                    _sum0 = vfmaq_laneq_f16(_sum0, _w4, _r0, 4);\n                    _sum1 = vfmaq_laneq_f16(_sum1, _w5, _r0, 5);\n                    _sum2 = vfmaq_laneq_f16(_sum2, _w6, _r0, 6);\n                    _sum3 = vfmaq_laneq_f16(_sum3, _w7, _r0, 7);\n\n                    kptr += 64;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x4_t();\n                        _r0 = vset_lane_f16(r0[0], _r0, 0);\n                        _r0 = vset_lane_f16(r0[N], _r0, 1);\n                        _r0 = vset_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vset_lane_f16(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    float16x8_t _w1 = vld1q_f16(kptr + 8);\n                    float16x8_t _w2 = vld1q_f16(kptr + 8 * 2);\n                    float16x8_t _w3 = vld1q_f16(kptr + 8 * 3);\n                    _sum0 = vfmaq_lane_f16(_sum0, _w0, _r0, 0);\n                    _sum1 = vfmaq_lane_f16(_sum1, _w1, _r0, 1);\n                    _sum2 = vfmaq_lane_f16(_sum2, _w2, _r0, 2);\n                    _sum3 = vfmaq_lane_f16(_sum3, _w3, _r0, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val0;\n                    __fp16 val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    float16x8_t _w1 = vld1q_f16(kptr + 8);\n                    _sum0 = vfmaq_n_f16(_sum0, _w0, val0);\n                    _sum1 = vfmaq_n_f16(_sum1, _w1, val1);\n\n                    kptr += 16;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x8_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vdupq_n_f16(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    _sum0 = vfmaq_f16(_sum0, _w0, _val);\n\n                    kptr += 8;\n                }\n            }\n\n            _sum0 = vaddq_f16(_sum0, _sum1);\n            _sum2 = vaddq_f16(_sum2, _sum3);\n            _sum0 = vaddq_f16(_sum0, _sum2);\n\n            _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n            if (out_elempack == 8)\n            {\n                vst1q_f16(outptr, _sum0);\n                outptr += 8;\n            }\n            else if (out_elempack == 4)\n            {\n                vst1_f16(outptr, vget_low_f16(_sum0));\n                vst1_f16(outptr + M, vget_high_f16(_sum0));\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_f16(_sum0, 0);\n                outptr[M] = vgetq_lane_f16(_sum0, 1);\n                outptr[M * 2] = vgetq_lane_f16(_sum0, 2);\n                outptr[M * 3] = vgetq_lane_f16(_sum0, 3);\n                outptr[M * 4] = vgetq_lane_f16(_sum0, 4);\n                outptr[M * 5] = vgetq_lane_f16(_sum0, 5);\n                outptr[M * 6] = vgetq_lane_f16(_sum0, 6);\n                outptr[M * 7] = vgetq_lane_f16(_sum0, 7);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 8;\n    nn_outh = (outh - remain_outh_start) / 4;\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.row<__fp16>(p / out_elempack);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float16x4_t _sum0 = vdup_n_f16(0.f);\n            float16x4_t _sum1 = vdup_n_f16(0.f);\n            float16x4_t _sum2 = vdup_n_f16(0.f);\n            float16x4_t _sum3 = vdup_n_f16(0.f);\n\n            if (bias_data_ptr)\n            {\n                _sum0 = vld1_f16(bias_data_ptr + p);\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n            int q = 0;\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _r0;\n                    float16x4_t _r1;\n                    if (elempack == 8)\n                    {\n                        float16x8_t _r01 = vld1q_f16(r0);\n                        _r0 = vget_low_f16(_r01);\n                        _r1 = vget_high_f16(_r01);\n                        r0 += dilation_w * 8;\n                    }\n                    else if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        _r1 = vld1_f16(r0 + N);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x4_t();\n                        _r1 = float16x4_t();\n                        _r0 = vset_lane_f16(r0[0], _r0, 0);\n                        _r0 = vset_lane_f16(r0[N], _r0, 1);\n                        _r0 = vset_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vset_lane_f16(r0[N * 3], _r0, 3);\n                        _r1 = vset_lane_f16(r0[N * 4], _r1, 0);\n                        _r1 = vset_lane_f16(r0[N * 5], _r1, 1);\n                        _r1 = vset_lane_f16(r0[N * 6], _r1, 2);\n                        _r1 = vset_lane_f16(r0[N * 7], _r1, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w0 = vld1_f16(kptr);\n                    float16x4_t _w1 = vld1_f16(kptr + 4);\n                    float16x4_t _w2 = vld1_f16(kptr + 8);\n                    float16x4_t _w3 = vld1_f16(kptr + 12);\n                    float16x4_t _w4 = vld1_f16(kptr + 16);\n                    float16x4_t _w5 = vld1_f16(kptr + 20);\n                    float16x4_t _w6 = vld1_f16(kptr + 24);\n                    float16x4_t _w7 = vld1_f16(kptr + 28);\n                    _sum0 = vfma_lane_f16(_sum0, _w0, _r0, 0);\n                    _sum1 = vfma_lane_f16(_sum1, _w1, _r0, 1);\n                    _sum2 = vfma_lane_f16(_sum2, _w2, _r0, 2);\n                    _sum3 = vfma_lane_f16(_sum3, _w3, _r0, 3);\n                    _sum0 = vfma_lane_f16(_sum0, _w4, _r1, 0);\n                    _sum1 = vfma_lane_f16(_sum1, _w5, _r1, 1);\n                    _sum2 = vfma_lane_f16(_sum2, _w6, _r1, 2);\n                    _sum3 = vfma_lane_f16(_sum3, _w7, _r1, 3);\n\n                    kptr += 32;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x4_t();\n                        _r0 = vset_lane_f16(r0[0], _r0, 0);\n                        _r0 = vset_lane_f16(r0[N], _r0, 1);\n                        _r0 = vset_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vset_lane_f16(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w0 = vld1_f16(kptr);\n                    float16x4_t _w1 = vld1_f16(kptr + 4);\n                    float16x4_t _w2 = vld1_f16(kptr + 8);\n                    float16x4_t _w3 = vld1_f16(kptr + 12);\n                    _sum0 = vfma_lane_f16(_sum0, _w0, _r0, 0);\n                    _sum1 = vfma_lane_f16(_sum1, _w1, _r0, 1);\n                    _sum2 = vfma_lane_f16(_sum2, _w2, _r0, 2);\n                    _sum3 = vfma_lane_f16(_sum3, _w3, _r0, 3);\n\n                    kptr += 16;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val0;\n                    __fp16 val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w0 = vld1_f16(kptr);\n                    float16x4_t _w1 = vld1_f16(kptr + 4);\n                    _sum0 = vfma_n_f16(_sum0, _w0, val0);\n                    _sum1 = vfma_n_f16(_sum1, _w1, val1);\n\n                    kptr += 8;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _val;\n                    // if (elempack == 1)\n                    {\n                        _val = vdup_n_f16(r0[0]);\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w = vld1_f16(kptr);\n                    _sum0 = vfma_f16(_sum0, _val, _w);\n\n                    kptr += 4;\n                }\n            }\n\n            _sum0 = vadd_f16(_sum0, _sum1);\n            _sum2 = vadd_f16(_sum2, _sum3);\n            _sum0 = vadd_f16(_sum0, _sum2);\n\n            _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n            if (out_elempack == 4)\n            {\n                vst1_f16(outptr, _sum0);\n                outptr += 4;\n            }\n            else // if (out_elempack == 1)\n            {\n                outptr[0] = vget_lane_f16(_sum0, 0);\n                outptr[M] = vget_lane_f16(_sum0, 1);\n                outptr[M * 2] = vget_lane_f16(_sum0, 2);\n                outptr[M * 3] = vget_lane_f16(_sum0, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outh_start += nn_outh * 4;\n    nn_outh = (outh - remain_outh_start) / 2;\n    for (int pp = 0; pp < nn_outh; pp++)\n    {\n        const int p = remain_outh_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inh = bottom_blob.h * elempack;\n        const int outw = top_blob.w;\n\n        __fp16* outptr0 = top_blob.row<__fp16>(p);\n        __fp16* outptr1 = top_blob.row<__fp16>(p + 1);\n\n        for (int j = 0; j < outw; j++)\n        {\n            __fp16 sum0 = 0.f;\n            __fp16 sum1 = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum0 = bias_data_ptr[p];\n                sum1 = bias_data_ptr[p + 1];\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n\n            int q = 0;\n            float16x8_t _sum0 = vdupq_n_f16(0.f);\n            float16x8_t _sum1 = vdupq_n_f16(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x8_t _r0;\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        r0 += dilation_w * 8;\n                    }\n                    else if (elempack == 4)\n                    {\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x8_t();\n                        _r0 = vsetq_lane_f16(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f16(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f16(r0[N * 3], _r0, 3);\n                        _r0 = vsetq_lane_f16(r0[N * 4], _r0, 4);\n                        _r0 = vsetq_lane_f16(r0[N * 5], _r0, 5);\n                        _r0 = vsetq_lane_f16(r0[N * 6], _r0, 6);\n                        _r0 = vsetq_lane_f16(r0[N * 7], _r0, 7);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    float16x8_t _w1 = vld1q_f16(kptr + 8);\n                    _sum0 = vfmaq_f16(_sum0, _r0, _w0);\n                    _sum1 = vfmaq_f16(_sum1, _r0, _w1);\n\n                    kptr += 16;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x4_t();\n                        _r0 = vset_lane_f16(r0[0], _r0, 0);\n                        _r0 = vset_lane_f16(r0[N], _r0, 1);\n                        _r0 = vset_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vset_lane_f16(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w0 = vld1_f16(kptr);\n                    float16x4_t _w1 = vld1_f16(kptr + 4);\n                    _sum0 = vcombine_f16(vfma_f16(vget_low_f16(_sum0), _r0, _w0), vget_high_f16(_sum0));\n                    _sum1 = vcombine_f16(vfma_f16(vget_low_f16(_sum1), _r0, _w1), vget_high_f16(_sum1));\n\n                    kptr += 8;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val0;\n                    __fp16 val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val0 * kptr[0];\n                    sum1 += val0 * kptr[1];\n                    sum0 += val1 * kptr[2];\n                    sum1 += val1 * kptr[3];\n\n                    kptr += 4;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val;\n                    // if (elempack == 1)\n                    {\n                        val = r0[0];\n                        r0 += dilation_w;\n                    }\n\n                    sum0 += val * kptr[0];\n                    sum1 += val * kptr[1];\n\n                    kptr += 2;\n                }\n            }\n\n            float16x4_t _ss0 = vadd_f16(vget_low_f16(_sum0), vget_high_f16(_sum0));\n            float16x4_t _ss1 = vadd_f16(vget_low_f16(_sum1), vget_high_f16(_sum1));\n            float16x4_t _ss = vpadd_f16(_ss0, _ss1);\n            _ss = vpadd_f16(_ss, _ss);\n            sum0 += vget_lane_f16(_ss, 0);\n            sum1 += vget_lane_f16(_ss, 1);\n\n            sum0 = activation_ss_f16(sum0, activation_type, activation_params);\n            sum1 = activation_ss_f16(sum1, activation_type, activation_params);\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n    remain_outh_start += nn_outh * 2;\n    for (int p = remain_outh_start; p < outh; p++)\n    {\n        __fp16* outptr = top_blob.row<__fp16>(p);\n\n        for (int j = 0; j < outw; j++)\n        {\n            __fp16 sum = 0.f;\n\n            if (bias_data_ptr)\n            {\n                sum = bias_data_ptr[p];\n            }\n\n            const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n\n            int q = 0;\n            float16x8_t _sum = vdupq_n_f16(0.f);\n            for (; q + 7 < inh; q += 8)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x8_t _r0;\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        r0 += dilation_w * 8;\n                    }\n                    else if (elempack == 4)\n                    {\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r0 + N));\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x8_t();\n                        _r0 = vsetq_lane_f16(r0[0], _r0, 0);\n                        _r0 = vsetq_lane_f16(r0[N], _r0, 1);\n                        _r0 = vsetq_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vsetq_lane_f16(r0[N * 3], _r0, 3);\n                        _r0 = vsetq_lane_f16(r0[N * 4], _r0, 4);\n                        _r0 = vsetq_lane_f16(r0[N * 5], _r0, 5);\n                        _r0 = vsetq_lane_f16(r0[N * 6], _r0, 6);\n                        _r0 = vsetq_lane_f16(r0[N * 7], _r0, 7);\n                        r0 += dilation_w;\n                    }\n\n                    float16x8_t _w0 = vld1q_f16(kptr);\n                    _sum = vfmaq_f16(_sum, _r0, _w0);\n\n                    kptr += 8;\n                }\n            }\n            for (; q + 3 < inh; q += 4)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q / elempack) + j * stride_w * elempack;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float16x4_t _r0;\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        r0 += dilation_w * 4;\n                    }\n                    else // if (elempack == 1)\n                    {\n                        _r0 = float16x4_t();\n                        _r0 = vset_lane_f16(r0[0], _r0, 0);\n                        _r0 = vset_lane_f16(r0[N], _r0, 1);\n                        _r0 = vset_lane_f16(r0[N * 2], _r0, 2);\n                        _r0 = vset_lane_f16(r0[N * 3], _r0, 3);\n                        r0 += dilation_w;\n                    }\n\n                    float16x4_t _w = vld1_f16(kptr);\n                    _sum = vcombine_f16(vfma_f16(vget_low_f16(_sum), _r0, _w), vget_high_f16(_sum));\n\n                    kptr += 4;\n                }\n            }\n            for (; q + 1 < inh; q += 2)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val0;\n                    __fp16 val1;\n                    // if (elempack == 1)\n                    {\n                        val0 = r0[0];\n                        val1 = r0[N];\n                        r0 += dilation_w;\n                    }\n\n                    sum += val0 * kptr[0];\n                    sum += val1 * kptr[1];\n\n                    kptr += 2;\n                }\n            }\n            for (; q < inh; q++)\n            {\n                const __fp16* r0 = bottom_blob.row<const __fp16>(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    __fp16 val;\n                    // if (elempack == 1)\n                    {\n                        val = r0[0];\n                        r0 += dilation_w;\n                    }\n\n                    sum += val * kptr[0];\n\n                    kptr += 1;\n                }\n            }\n\n            float16x4_t _ss = vadd_f16(vget_low_f16(_sum), vget_high_f16(_sum));\n            _ss = vpadd_f16(_ss, _ss);\n            _ss = vpadd_f16(_ss, _ss);\n            sum += vget_lane_f16(_ss, 0);\n\n            sum = activation_ss_f16(sum, activation_type, activation_params);\n\n            outptr[0] = sum;\n            outptr += 1;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_1x1.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n\n    nn_outch = outch >> 3;\n    remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n        Mat out4 = top_blob.channel(p + 4);\n        Mat out5 = top_blob.channel(p + 5);\n        Mat out6 = top_blob.channel(p + 6);\n        Mat out7 = top_blob.channel(p + 7);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        const float bias2 = bias ? bias[p + 2] : 0.f;\n        const float bias3 = bias ? bias[p + 3] : 0.f;\n        const float bias4 = bias ? bias[p + 4] : 0.f;\n        const float bias5 = bias ? bias[p + 5] : 0.f;\n        const float bias6 = bias ? bias[p + 6] : 0.f;\n        const float bias7 = bias ? bias[p + 7] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n        out2.fill(bias2);\n        out3.fill(bias3);\n        out4.fill(bias4);\n        out5.fill(bias5);\n        out6.fill(bias6);\n        out7.fill(bias7);\n\n        int q = 0;\n\n        for (; q + 7 < inch; q += 8)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n            float* outptr4 = out4;\n            float* outptr5 = out5;\n            float* outptr6 = out6;\n            float* outptr7 = out7;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n            const float* img4 = bottom_blob.channel(q + 4);\n            const float* img5 = bottom_blob.channel(q + 5);\n            const float* img6 = bottom_blob.channel(q + 6);\n            const float* img7 = bottom_blob.channel(q + 7);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n            const float* kernel4 = kernel + (p + 4) * inch + q;\n            const float* kernel5 = kernel + (p + 5) * inch + q;\n            const float* kernel6 = kernel + (p + 6) * inch + q;\n            const float* kernel7 = kernel + (p + 7) * inch + q;\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n            const float* r4 = img4;\n            const float* r5 = img5;\n            const float* r6 = img6;\n            const float* r7 = img7;\n\n            int size = outw * outh;\n\n            int nn = size >> 2;\n            int remain = size & 3;\n\n            float32x4_t _k0 = vld1q_f32(kernel0);\n            float32x4_t _k1 = vld1q_f32(kernel1);\n            float32x4_t _k2 = vld1q_f32(kernel2);\n            float32x4_t _k3 = vld1q_f32(kernel3);\n            float32x4_t _k4 = vld1q_f32(kernel4);\n            float32x4_t _k5 = vld1q_f32(kernel5);\n            float32x4_t _k6 = vld1q_f32(kernel6);\n            float32x4_t _k7 = vld1q_f32(kernel7);\n\n            float32x4_t _k0n = vld1q_f32(kernel0 + 4);\n            float32x4_t _k1n = vld1q_f32(kernel1 + 4);\n            float32x4_t _k2n = vld1q_f32(kernel2 + 4);\n            float32x4_t _k3n = vld1q_f32(kernel3 + 4);\n            float32x4_t _k4n = vld1q_f32(kernel4 + 4);\n            float32x4_t _k5n = vld1q_f32(kernel5 + 4);\n            float32x4_t _k6n = vld1q_f32(kernel6 + 4);\n            float32x4_t _k7n = vld1q_f32(kernel7 + 4);\n\n#ifdef __clang__\n            // gcc reject over 30 oprands :(\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%9, #128]       \\n\"\n                    \"ld1    {v17.4s}, [%9], #16         \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v18.4s}, [%1]              \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v19.4s}, [%2]              \\n\"\n\n                    \"0:                                 \\n\"\n\n                    \"fmla   v18.4s, v17.4s, %34.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v20.4s}, [%3]              \\n\"\n\n                    \"fmla   v19.4s, v17.4s, %35.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v21.4s}, [%4]              \\n\"\n\n                    \"fmla   v20.4s, v17.4s, %36.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v22.4s}, [%5]              \\n\"\n\n                    \"fmla   v21.4s, v17.4s, %37.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v23.4s}, [%6]              \\n\"\n\n                    \"fmla   v22.4s, v17.4s, %38.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%10, #128]      \\n\"\n                    \"ld1    {v16.4s}, [%10], #16        \\n\"\n\n                    \"fmla   v23.4s, v17.4s, %39.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%7, #128]       \\n\"\n                    \"ld1    {v24.4s}, [%7]              \\n\"\n\n                    \"fmla   v18.4s, v16.4s, %34.s[1]    \\n\"\n                    \"fmla   v19.4s, v16.4s, %35.s[1]    \\n\"\n\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v25.4s}, [%8]              \\n\"\n\n                    \"fmla   v24.4s, v17.4s, %40.s[0]    \\n\"\n                    \"fmla   v25.4s, v17.4s, %41.s[0]    \\n\"\n\n                    \"fmla   v20.4s, v16.4s, %36.s[1]    \\n\"\n                    \"fmla   v21.4s, v16.4s, %37.s[1]    \\n\"\n\n                    \"prfm   pldl1keep, [%11, #128]      \\n\"\n                    \"ld1    {v17.4s}, [%11], #16        \\n\"\n\n                    \"fmla   v22.4s, v16.4s, %38.s[1]    \\n\"\n                    \"fmla   v23.4s, v16.4s, %39.s[1]    \\n\"\n\n                    \"fmla   v18.4s, v17.4s, %34.s[2]    \\n\"\n                    \"fmla   v19.4s, v17.4s, %35.s[2]    \\n\"\n\n                    \"fmla   v24.4s, v16.4s, %40.s[1]    \\n\"\n                    \"fmla   v25.4s, v16.4s, %41.s[1]    \\n\"\n\n                    \"fmla   v20.4s, v17.4s, %36.s[2]    \\n\"\n                    \"fmla   v21.4s, v17.4s, %37.s[2]    \\n\"\n\n                    \"prfm   pldl1keep, [%12, #128]      \\n\"\n                    \"ld1    {v16.4s}, [%12], #16        \\n\"\n\n                    \"fmla   v22.4s, v17.4s, %38.s[2]    \\n\"\n                    \"fmla   v23.4s, v17.4s, %39.s[2]    \\n\"\n\n                    \"fmla   v18.4s, v16.4s, %34.s[3]    \\n\"\n                    \"fmla   v19.4s, v16.4s, %35.s[3]    \\n\"\n\n                    \"fmla   v24.4s, v17.4s, %40.s[2]    \\n\"\n                    \"fmla   v25.4s, v17.4s, %41.s[2]    \\n\"\n\n                    \"fmla   v20.4s, v16.4s, %36.s[3]    \\n\"\n                    \"fmla   v21.4s, v16.4s, %37.s[3]    \\n\"\n\n                    \"prfm   pldl1keep, [%13, #128]      \\n\"\n                    \"ld1    {v17.4s}, [%13], #16        \\n\"\n\n                    \"fmla   v22.4s, v16.4s, %38.s[3]    \\n\"\n                    \"fmla   v23.4s, v16.4s, %39.s[3]    \\n\"\n\n                    \"fmla   v18.4s, v17.4s, %42.s[0]    \\n\"\n                    \"fmla   v19.4s, v17.4s, %43.s[0]    \\n\"\n\n                    \"fmla   v24.4s, v16.4s, %40.s[3]    \\n\"\n                    \"fmla   v25.4s, v16.4s, %41.s[3]    \\n\"\n\n                    \"fmla   v20.4s, v17.4s, %44.s[0]    \\n\"\n                    \"fmla   v21.4s, v17.4s, %45.s[0]    \\n\"\n\n                    \"prfm   pldl1keep, [%14, #128]      \\n\"\n                    \"ld1    {v16.4s}, [%14], #16        \\n\"\n\n                    \"fmla   v22.4s, v17.4s, %46.s[0]    \\n\"\n                    \"fmla   v23.4s, v17.4s, %47.s[0]    \\n\"\n\n                    \"fmla   v18.4s, v16.4s, %42.s[1]    \\n\"\n                    \"fmla   v19.4s, v16.4s, %43.s[1]    \\n\"\n\n                    \"fmla   v24.4s, v17.4s, %48.s[0]    \\n\"\n                    \"fmla   v25.4s, v17.4s, %49.s[0]    \\n\"\n\n                    \"fmla   v20.4s, v16.4s, %44.s[1]    \\n\"\n                    \"fmla   v21.4s, v16.4s, %45.s[1]    \\n\"\n\n                    \"prfm   pldl1keep, [%15, #128]      \\n\"\n                    \"ld1    {v17.4s}, [%15], #16        \\n\"\n\n                    \"fmla   v22.4s, v16.4s, %46.s[1]    \\n\"\n                    \"fmla   v23.4s, v16.4s, %47.s[1]    \\n\"\n\n                    \"fmla   v18.4s, v17.4s, %42.s[2]    \\n\"\n                    \"fmla   v19.4s, v17.4s, %43.s[2]    \\n\"\n\n                    \"fmla   v24.4s, v16.4s, %48.s[1]    \\n\"\n                    \"fmla   v25.4s, v16.4s, %49.s[1]    \\n\"\n\n                    \"fmla   v20.4s, v17.4s, %44.s[2]    \\n\"\n                    \"fmla   v21.4s, v17.4s, %45.s[2]    \\n\"\n\n                    \"prfm   pldl1keep, [%16, #128]      \\n\"\n                    \"ld1    {v16.4s}, [%16], #16        \\n\"\n\n                    \"fmla   v22.4s, v17.4s, %46.s[2]    \\n\"\n                    \"fmla   v23.4s, v17.4s, %47.s[2]    \\n\"\n\n                    \"fmla   v18.4s, v16.4s, %42.s[3]    \\n\"\n                    \"fmla   v19.4s, v16.4s, %43.s[3]    \\n\"\n\n                    \"fmla   v24.4s, v17.4s, %48.s[2]    \\n\"\n                    \"fmla   v25.4s, v17.4s, %49.s[2]    \\n\"\n\n                    \"fmla   v20.4s, v16.4s, %44.s[3]    \\n\"\n                    \"fmla   v21.4s, v16.4s, %45.s[3]    \\n\"\n\n                    \"st1    {v18.4s}, [%1], #16         \\n\"\n\n                    \"fmla   v22.4s, v16.4s, %46.s[3]    \\n\"\n\n                    \"st1    {v19.4s}, [%2], #16         \\n\"\n\n                    \"fmla   v23.4s, v16.4s, %47.s[3]    \\n\"\n\n                    \"st1    {v20.4s}, [%3], #16         \\n\"\n\n                    \"prfm   pldl1keep, [%9, #128]       \\n\"\n                    \"ld1    {v17.4s}, [%9], #16         \\n\"\n\n                    \"fmla   v24.4s, v16.4s, %48.s[3]    \\n\"\n\n                    \"st1    {v21.4s}, [%4], #16         \\n\"\n\n                    \"fmla   v25.4s, v16.4s, %49.s[3]    \\n\"\n\n                    \"st1    {v22.4s}, [%5], #16         \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v18.4s}, [%1]              \\n\"\n\n                    \"st1    {v23.4s}, [%6], #16         \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v19.4s}, [%2]              \\n\"\n\n                    \"st1    {v24.4s}, [%7], #16         \\n\"\n\n                    \"subs   %w0, %w0, #1                \\n\"\n\n                    \"st1    {v25.4s}, [%8], #16         \\n\"\n\n                    \"bne    0b                          \\n\"\n                    \"sub    %9, %9, #16                 \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(outptr4), // %5\n                    \"=r\"(outptr5), // %6\n                    \"=r\"(outptr6), // %7\n                    \"=r\"(outptr7), // %8\n                    \"=r\"(r0),      // %9\n                    \"=r\"(r1),      // %10\n                    \"=r\"(r2),      // %11\n                    \"=r\"(r3),      // %12\n                    \"=r\"(r4),      // %13\n                    \"=r\"(r5),      // %14\n                    \"=r\"(r6),      // %15\n                    \"=r\"(r7)       // %16\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(outptr4),\n                    \"6\"(outptr5),\n                    \"7\"(outptr6),\n                    \"8\"(outptr7),\n                    \"9\"(r0),\n                    \"10\"(r1),\n                    \"11\"(r2),\n                    \"12\"(r3),\n                    \"13\"(r4),\n                    \"14\"(r5),\n                    \"15\"(r6),\n                    \"16\"(r7),\n                    \"w\"(_k0),                                                                            // %34\n                    \"w\"(_k1),                                                                            // %35\n                    \"w\"(_k2),                                                                            // %36\n                    \"w\"(_k3),                                                                            // %37\n                    \"w\"(_k4),                                                                            // %38\n                    \"w\"(_k5),                                                                            // %39\n                    \"w\"(_k6),                                                                            // %40\n                    \"w\"(_k7),                                                                            // %41\n                    \"w\"(_k0n),                                                                           // %42\n                    \"w\"(_k1n),                                                                           // %43\n                    \"w\"(_k2n),                                                                           // %44\n                    \"w\"(_k3n),                                                                           // %45\n                    \"w\"(_k4n),                                                                           // %46\n                    \"w\"(_k5n),                                                                           // %47\n                    \"w\"(_k6n),                                                                           // %48\n                    \"w\"(_k7n)                                                                            // %49\n                    : \"cc\", \"memory\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\" //, \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\"\n                );\n            }\n#else\n            for (; nn > 0; nn--)\n            {\n                float32x4_t _p = vld1q_f32(r0);\n\n                float32x4_t _out0p = vld1q_f32(outptr0);\n                float32x4_t _out1p = vld1q_f32(outptr1);\n                float32x4_t _out2p = vld1q_f32(outptr2);\n                float32x4_t _out3p = vld1q_f32(outptr3);\n                float32x4_t _out4p = vld1q_f32(outptr4);\n                float32x4_t _out5p = vld1q_f32(outptr5);\n                float32x4_t _out6p = vld1q_f32(outptr6);\n                float32x4_t _out7p = vld1q_f32(outptr7);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p, _k0, 0);\n                _out1p = vfmaq_laneq_f32(_out1p, _p, _k1, 0);\n                _out2p = vfmaq_laneq_f32(_out2p, _p, _k2, 0);\n                _out3p = vfmaq_laneq_f32(_out3p, _p, _k3, 0);\n                _out4p = vfmaq_laneq_f32(_out4p, _p, _k4, 0);\n                _out5p = vfmaq_laneq_f32(_out5p, _p, _k5, 0);\n                _out6p = vfmaq_laneq_f32(_out6p, _p, _k6, 0);\n                _out7p = vfmaq_laneq_f32(_out7p, _p, _k7, 0);\n\n                float32x4_t _p1 = vld1q_f32(r1);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p1, _k0, 1);\n                _out1p = vfmaq_laneq_f32(_out1p, _p1, _k1, 1);\n                _out2p = vfmaq_laneq_f32(_out2p, _p1, _k2, 1);\n                _out3p = vfmaq_laneq_f32(_out3p, _p1, _k3, 1);\n                _out4p = vfmaq_laneq_f32(_out4p, _p1, _k4, 1);\n                _out5p = vfmaq_laneq_f32(_out5p, _p1, _k5, 1);\n                _out6p = vfmaq_laneq_f32(_out6p, _p1, _k6, 1);\n                _out7p = vfmaq_laneq_f32(_out7p, _p1, _k7, 1);\n\n                float32x4_t _p2 = vld1q_f32(r2);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p2, _k0, 2);\n                _out1p = vfmaq_laneq_f32(_out1p, _p2, _k1, 2);\n                _out2p = vfmaq_laneq_f32(_out2p, _p2, _k2, 2);\n                _out3p = vfmaq_laneq_f32(_out3p, _p2, _k3, 2);\n                _out4p = vfmaq_laneq_f32(_out4p, _p2, _k4, 2);\n                _out5p = vfmaq_laneq_f32(_out5p, _p2, _k5, 2);\n                _out6p = vfmaq_laneq_f32(_out6p, _p2, _k6, 2);\n                _out7p = vfmaq_laneq_f32(_out7p, _p2, _k7, 2);\n\n                float32x4_t _p3 = vld1q_f32(r3);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p3, _k0, 3);\n                _out1p = vfmaq_laneq_f32(_out1p, _p3, _k1, 3);\n                _out2p = vfmaq_laneq_f32(_out2p, _p3, _k2, 3);\n                _out3p = vfmaq_laneq_f32(_out3p, _p3, _k3, 3);\n                _out4p = vfmaq_laneq_f32(_out4p, _p3, _k4, 3);\n                _out5p = vfmaq_laneq_f32(_out5p, _p3, _k5, 3);\n                _out6p = vfmaq_laneq_f32(_out6p, _p3, _k6, 3);\n                _out7p = vfmaq_laneq_f32(_out7p, _p3, _k7, 3);\n\n                float32x4_t _p4 = vld1q_f32(r4);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p4, _k0n, 0);\n                _out1p = vfmaq_laneq_f32(_out1p, _p4, _k1n, 0);\n                _out2p = vfmaq_laneq_f32(_out2p, _p4, _k2n, 0);\n                _out3p = vfmaq_laneq_f32(_out3p, _p4, _k3n, 0);\n                _out4p = vfmaq_laneq_f32(_out4p, _p4, _k4n, 0);\n                _out5p = vfmaq_laneq_f32(_out5p, _p4, _k5n, 0);\n                _out6p = vfmaq_laneq_f32(_out6p, _p4, _k6n, 0);\n                _out7p = vfmaq_laneq_f32(_out7p, _p4, _k7n, 0);\n\n                float32x4_t _p5 = vld1q_f32(r5);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p5, _k0n, 1);\n                _out1p = vfmaq_laneq_f32(_out1p, _p5, _k1n, 1);\n                _out2p = vfmaq_laneq_f32(_out2p, _p5, _k2n, 1);\n                _out3p = vfmaq_laneq_f32(_out3p, _p5, _k3n, 1);\n                _out4p = vfmaq_laneq_f32(_out4p, _p5, _k4n, 1);\n                _out5p = vfmaq_laneq_f32(_out5p, _p5, _k5n, 1);\n                _out6p = vfmaq_laneq_f32(_out6p, _p5, _k6n, 1);\n                _out7p = vfmaq_laneq_f32(_out7p, _p5, _k7n, 1);\n\n                float32x4_t _p6 = vld1q_f32(r6);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p6, _k0n, 2);\n                _out1p = vfmaq_laneq_f32(_out1p, _p6, _k1n, 2);\n                _out2p = vfmaq_laneq_f32(_out2p, _p6, _k2n, 2);\n                _out3p = vfmaq_laneq_f32(_out3p, _p6, _k3n, 2);\n                _out4p = vfmaq_laneq_f32(_out4p, _p6, _k4n, 2);\n                _out5p = vfmaq_laneq_f32(_out5p, _p6, _k5n, 2);\n                _out6p = vfmaq_laneq_f32(_out6p, _p6, _k6n, 2);\n                _out7p = vfmaq_laneq_f32(_out7p, _p6, _k7n, 2);\n\n                float32x4_t _p7 = vld1q_f32(r7);\n\n                _out0p = vfmaq_laneq_f32(_out0p, _p7, _k0n, 3);\n                _out1p = vfmaq_laneq_f32(_out1p, _p7, _k1n, 3);\n                _out2p = vfmaq_laneq_f32(_out2p, _p7, _k2n, 3);\n                _out3p = vfmaq_laneq_f32(_out3p, _p7, _k3n, 3);\n                _out4p = vfmaq_laneq_f32(_out4p, _p7, _k4n, 3);\n                _out5p = vfmaq_laneq_f32(_out5p, _p7, _k5n, 3);\n                _out6p = vfmaq_laneq_f32(_out6p, _p7, _k6n, 3);\n                _out7p = vfmaq_laneq_f32(_out7p, _p7, _k7n, 3);\n\n                vst1q_f32(outptr0, _out0p);\n                vst1q_f32(outptr1, _out1p);\n                vst1q_f32(outptr2, _out2p);\n                vst1q_f32(outptr3, _out3p);\n                vst1q_f32(outptr4, _out4p);\n                vst1q_f32(outptr5, _out5p);\n                vst1q_f32(outptr6, _out6p);\n                vst1q_f32(outptr7, _out7p);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n                r5 += 4;\n                r6 += 4;\n                r7 += 4;\n                outptr0 += 4;\n                outptr1 += 4;\n                outptr2 += 4;\n                outptr3 += 4;\n                outptr4 += 4;\n                outptr5 += 4;\n                outptr6 += 4;\n                outptr7 += 4;\n            }\n#endif\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * kernel0[0] + *r1 * kernel0[1] + *r2 * kernel0[2] + *r3 * kernel0[3] + *r4 * kernel0[4] + *r5 * kernel0[5] + *r6 * kernel0[6] + *r7 * kernel0[7];\n                float sum1 = *r0 * kernel1[0] + *r1 * kernel1[1] + *r2 * kernel1[2] + *r3 * kernel1[3] + *r4 * kernel1[4] + *r5 * kernel1[5] + *r6 * kernel1[6] + *r7 * kernel1[7];\n                float sum2 = *r0 * kernel2[0] + *r1 * kernel2[1] + *r2 * kernel2[2] + *r3 * kernel2[3] + *r4 * kernel2[4] + *r5 * kernel2[5] + *r6 * kernel2[6] + *r7 * kernel2[7];\n                float sum3 = *r0 * kernel3[0] + *r1 * kernel3[1] + *r2 * kernel3[2] + *r3 * kernel3[3] + *r4 * kernel3[4] + *r5 * kernel3[5] + *r6 * kernel3[6] + *r7 * kernel3[7];\n                float sum4 = *r0 * kernel4[0] + *r1 * kernel4[1] + *r2 * kernel4[2] + *r3 * kernel4[3] + *r4 * kernel4[4] + *r5 * kernel4[5] + *r6 * kernel4[6] + *r7 * kernel4[7];\n                float sum5 = *r0 * kernel5[0] + *r1 * kernel5[1] + *r2 * kernel5[2] + *r3 * kernel5[3] + *r4 * kernel5[4] + *r5 * kernel5[5] + *r6 * kernel5[6] + *r7 * kernel5[7];\n                float sum6 = *r0 * kernel6[0] + *r1 * kernel6[1] + *r2 * kernel6[2] + *r3 * kernel6[3] + *r4 * kernel6[4] + *r5 * kernel6[5] + *r6 * kernel6[6] + *r7 * kernel6[7];\n                float sum7 = *r0 * kernel7[0] + *r1 * kernel7[1] + *r2 * kernel7[2] + *r3 * kernel7[3] + *r4 * kernel7[4] + *r5 * kernel7[5] + *r6 * kernel7[6] + *r7 * kernel7[7];\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n                *outptr4 += sum4;\n                *outptr5 += sum5;\n                *outptr6 += sum6;\n                *outptr7 += sum7;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                r4++;\n                r5++;\n                r6++;\n                r7++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n                outptr4++;\n                outptr5++;\n                outptr6++;\n                outptr7++;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n            float* outptr4 = out4;\n            float* outptr5 = out5;\n            float* outptr6 = out6;\n            float* outptr7 = out7;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n            const float* kernel4 = kernel + (p + 4) * inch + q;\n            const float* kernel5 = kernel + (p + 5) * inch + q;\n            const float* kernel6 = kernel + (p + 6) * inch + q;\n            const float* kernel7 = kernel + (p + 7) * inch + q;\n\n            const float k0 = kernel0[0];\n            const float k1 = kernel1[0];\n            const float k2 = kernel2[0];\n            const float k3 = kernel3[0];\n            const float k4 = kernel4[0];\n            const float k5 = kernel5[0];\n            const float k6 = kernel6[0];\n            const float k7 = kernel7[0];\n\n            const float* r0 = img0;\n\n            int size = outw * outh;\n\n            int nn = size >> 2;\n            int remain = size & 3;\n\n            float32x4_t _k0 = vdupq_n_f32(k0);\n            float32x4_t _k1 = vdupq_n_f32(k1);\n            float32x4_t _k2 = vdupq_n_f32(k2);\n            float32x4_t _k3 = vdupq_n_f32(k3);\n            float32x4_t _k4 = vdupq_n_f32(k4);\n            float32x4_t _k5 = vdupq_n_f32(k5);\n            float32x4_t _k6 = vdupq_n_f32(k6);\n            float32x4_t _k7 = vdupq_n_f32(k7);\n\n            for (; nn > 0; nn--)\n            {\n                float32x4_t _p = vld1q_f32(r0);\n\n                float32x4_t _out0p = vld1q_f32(outptr0);\n                float32x4_t _out1p = vld1q_f32(outptr1);\n                float32x4_t _out2p = vld1q_f32(outptr2);\n                float32x4_t _out3p = vld1q_f32(outptr3);\n                float32x4_t _out4p = vld1q_f32(outptr4);\n                float32x4_t _out5p = vld1q_f32(outptr5);\n                float32x4_t _out6p = vld1q_f32(outptr6);\n                float32x4_t _out7p = vld1q_f32(outptr7);\n\n                _out0p = vfmaq_f32(_out0p, _p, _k0);\n                _out1p = vfmaq_f32(_out1p, _p, _k1);\n                _out2p = vfmaq_f32(_out2p, _p, _k2);\n                _out3p = vfmaq_f32(_out3p, _p, _k3);\n                _out4p = vfmaq_f32(_out4p, _p, _k4);\n                _out5p = vfmaq_f32(_out5p, _p, _k5);\n                _out6p = vfmaq_f32(_out6p, _p, _k6);\n                _out7p = vfmaq_f32(_out7p, _p, _k7);\n\n                vst1q_f32(outptr0, _out0p);\n                vst1q_f32(outptr1, _out1p);\n                vst1q_f32(outptr2, _out2p);\n                vst1q_f32(outptr3, _out3p);\n                vst1q_f32(outptr4, _out4p);\n                vst1q_f32(outptr5, _out5p);\n                vst1q_f32(outptr6, _out6p);\n                vst1q_f32(outptr7, _out7p);\n\n                r0 += 4;\n                outptr0 += 4;\n                outptr1 += 4;\n                outptr2 += 4;\n                outptr3 += 4;\n                outptr4 += 4;\n                outptr5 += 4;\n                outptr6 += 4;\n                outptr7 += 4;\n            }\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * k0;\n                float sum1 = *r0 * k1;\n                float sum2 = *r0 * k2;\n                float sum3 = *r0 * k3;\n                float sum4 = *r0 * k4;\n                float sum5 = *r0 * k5;\n                float sum6 = *r0 * k6;\n                float sum7 = *r0 * k7;\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n                *outptr4 += sum4;\n                *outptr5 += sum5;\n                *outptr6 += sum6;\n                *outptr7 += sum7;\n\n                r0++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n                outptr4++;\n                outptr5++;\n                outptr6++;\n                outptr7++;\n            }\n        }\n    }\n\n#else\n\n    nn_outch = outch / 6;\n    remain_outch_start = nn_outch * 6;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 6;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n        Mat out4 = top_blob.channel(p + 4);\n        Mat out5 = top_blob.channel(p + 5);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        const float bias2 = bias ? bias[p + 2] : 0.f;\n        const float bias3 = bias ? bias[p + 3] : 0.f;\n        const float bias4 = bias ? bias[p + 4] : 0.f;\n        const float bias5 = bias ? bias[p + 5] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n        out2.fill(bias2);\n        out3.fill(bias3);\n        out4.fill(bias4);\n        out5.fill(bias5);\n\n        int q = 0;\n\n        for (; q + 3 < inch; q += 4)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n            float* outptr4 = out4;\n            float* outptr5 = out5;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n            const float* kernel4 = kernel + (p + 4) * inch + q;\n            const float* kernel5 = kernel + (p + 5) * inch + q;\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 2;\n            int remain = size & 3;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(kernel0);\n            float32x4_t _k1 = vld1q_f32(kernel1);\n            float32x4_t _k2 = vld1q_f32(kernel2);\n            float32x4_t _k3 = vld1q_f32(kernel3);\n            float32x4_t _k4 = vld1q_f32(kernel4);\n            float32x4_t _k5 = vld1q_f32(kernel5);\n\n            for (; nn > 0; nn--)\n            {\n                asm volatile(\n                    \"pld        [%6, #128]              \\n\"\n                    \"vld1.f32   {d24-d25}, [%6 :128]!   \\n\" // q12 = r0\n\n                    \"pld        [%0, #128]              \\n\"\n                    \"vld1.f32   {d12-d13}, [%0 :128]    \\n\" // q6 = outptr0\n\n                    \"pld        [%1, #128]              \\n\"\n                    \"vld1.f32   {d14-d15}, [%1 :128]    \\n\" // q7 = outptr1\n\n                    \"vmla.f32   q6, q12, %e20[0]        \\n\"\n\n                    \"pld        [%2, #128]              \\n\"\n                    \"vld1.f32   {d16-d17}, [%2 :128]    \\n\" // q8 = outptr2\n\n                    \"vmla.f32   q7, q12, %e21[0]        \\n\"\n\n                    \"pld        [%3, #128]              \\n\"\n                    \"vld1.f32   {d18-d19}, [%3 :128]    \\n\" // q9 = outptr3\n\n                    \"vmla.f32   q8, q12, %e22[0]        \\n\"\n\n                    \"pld        [%7, #128]              \\n\"\n                    \"vld1.f32   {d26-d27}, [%7 :128]!   \\n\" // q13 = r1\n\n                    \"vmla.f32   q9, q12, %e23[0]        \\n\"\n\n                    \"pld        [%4, #128]              \\n\"\n                    \"vld1.f32   {d20-d21}, [%4 :128]    \\n\" // q10 = outptr4\n\n                    \"vmla.f32   q6, q13, %e20[1]        \\n\"\n                    \"vmla.f32   q7, q13, %e21[1]        \\n\"\n\n                    \"pld        [%5, #128]              \\n\"\n                    \"vld1.f32   {d22-d23}, [%5 :128]    \\n\" // q11 = outptr5\n\n                    \"vmla.f32   q10, q12, %e24[0]       \\n\"\n                    \"vmla.f32   q11, q12, %e25[0]       \\n\"\n\n                    \"vmla.f32   q8, q13, %e22[1]        \\n\"\n                    \"vmla.f32   q9, q13, %e23[1]        \\n\"\n\n                    \"pld        [%8, #128]              \\n\"\n                    \"vld1.f32   {d28-d29}, [%8 :128]!   \\n\" // q14 = r2\n\n                    \"vmla.f32   q10, q13, %e24[1]       \\n\"\n                    \"vmla.f32   q11, q13, %e25[1]       \\n\"\n\n                    \"vmla.f32   q6, q14, %f20[0]        \\n\"\n                    \"vmla.f32   q7, q14, %f21[0]        \\n\"\n                    \"vmla.f32   q8, q14, %f22[0]        \\n\"\n                    \"vmla.f32   q9, q14, %f23[0]        \\n\"\n\n                    \"pld        [%9, #128]             \\n\"\n                    \"vld1.f32   {d30-d31}, [%9 :128]!  \\n\" // q15 = r3\n\n                    \"vmla.f32   q10, q14, %f24[0]       \\n\"\n                    \"vmla.f32   q11, q14, %f25[0]       \\n\"\n\n                    \"vmla.f32   q6, q15, %f20[1]        \\n\"\n                    \"vmla.f32   q7, q15, %f21[1]        \\n\"\n                    \"vmla.f32   q8, q15, %f22[1]        \\n\"\n                    \"vmla.f32   q9, q15, %f23[1]        \\n\"\n\n                    \"vmla.f32   q10, q15, %f24[1]       \\n\"\n                    \"vmla.f32   q11, q15, %f25[1]       \\n\"\n\n                    \"vst1.f32   {d12-d13}, [%0 :128]!   \\n\"\n                    \"vst1.f32   {d14-d15}, [%1 :128]!   \\n\"\n                    \"vst1.f32   {d16-d17}, [%2 :128]!   \\n\"\n                    \"vst1.f32   {d18-d19}, [%3 :128]!   \\n\"\n                    \"vst1.f32   {d20-d21}, [%4 :128]!   \\n\"\n                    \"vst1.f32   {d22-d23}, [%5 :128]!   \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(outptr2), // %2\n                    \"=r\"(outptr3), // %3\n                    \"=r\"(outptr4), // %4\n                    \"=r\"(outptr5), // %5\n                    \"=r\"(r0),      // %6\n                    \"=r\"(r1),      // %7\n                    \"=r\"(r2),      // %8\n                    \"=r\"(r3)       // %9\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(outptr2),\n                    \"3\"(outptr3),\n                    \"4\"(outptr4),\n                    \"5\"(outptr5),\n                    \"6\"(r0),\n                    \"7\"(r1),\n                    \"8\"(r2),\n                    \"9\"(r3),\n                    \"w\"(_k0), // %20\n                    \"w\"(_k1), // %21\n                    \"w\"(_k2), // %22\n                    \"w\"(_k3), // %23\n                    \"w\"(_k4), // %24\n                    \"w\"(_k5)  // %25\n                    : \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __ARM_NEON\n\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * kernel0[0] + *r1 * kernel0[1] + *r2 * kernel0[2] + *r3 * kernel0[3];\n                float sum1 = *r0 * kernel1[0] + *r1 * kernel1[1] + *r2 * kernel1[2] + *r3 * kernel1[3];\n                float sum2 = *r0 * kernel2[0] + *r1 * kernel2[1] + *r2 * kernel2[2] + *r3 * kernel2[3];\n                float sum3 = *r0 * kernel3[0] + *r1 * kernel3[1] + *r2 * kernel3[2] + *r3 * kernel3[3];\n                float sum4 = *r0 * kernel4[0] + *r1 * kernel4[1] + *r2 * kernel4[2] + *r3 * kernel4[3];\n                float sum5 = *r0 * kernel5[0] + *r1 * kernel5[1] + *r2 * kernel5[2] + *r3 * kernel5[3];\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n                *outptr4 += sum4;\n                *outptr5 += sum5;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n                outptr4++;\n                outptr5++;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n            float* outptr4 = out4;\n            float* outptr5 = out5;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n            const float* kernel4 = kernel + (p + 4) * inch + q;\n            const float* kernel5 = kernel + (p + 5) * inch + q;\n\n            const float k0 = kernel0[0];\n            const float k1 = kernel1[0];\n            const float k2 = kernel2[0];\n            const float k3 = kernel3[0];\n            const float k4 = kernel4[0];\n            const float k5 = kernel5[0];\n\n            const float* r0 = img0;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 2;\n            int remain = size & 3;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vdupq_n_f32(k0);\n            float32x4_t _k1 = vdupq_n_f32(k1);\n            float32x4_t _k2 = vdupq_n_f32(k2);\n            float32x4_t _k3 = vdupq_n_f32(k3);\n            float32x4_t _k4 = vdupq_n_f32(k4);\n            float32x4_t _k5 = vdupq_n_f32(k5);\n\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%7, #128]              \\n\"\n                    \"vld1.f32   {d24-d25}, [%7 :128]!   \\n\" // q12 = r0\n\n                    \"pld        [%1, #128]              \\n\"\n                    \"vld1.f32   {d12-d13}, [%1 :128]    \\n\" // q6 = outptr0\n\n                    \"0:                                 \\n\"\n\n                    \"pld        [%2, #128]              \\n\"\n                    \"vld1.f32   {d14-d15}, [%2 :128]    \\n\" // q7 = outptr1\n\n                    \"vmla.f32   q6, q12, %q16           \\n\"\n\n                    \"pld        [%3, #128]              \\n\"\n                    \"vld1.f32   {d16-d17}, [%3 :128]    \\n\" // q8 = outptr2\n\n                    \"vmla.f32   q7, q12, %q17           \\n\"\n\n                    \"pld        [%4, #128]              \\n\"\n                    \"vld1.f32   {d18-d19}, [%4 :128]    \\n\" // q9 = outptr3\n\n                    \"vmla.f32   q8, q12, %q18           \\n\"\n\n                    \"pld        [%5, #128]              \\n\"\n                    \"vld1.f32   {d20-d21}, [%5 :128]    \\n\" // q10 = outptr4\n\n                    \"vmla.f32   q9, q12, %q19           \\n\"\n\n                    \"pld        [%6, #128]              \\n\"\n                    \"vld1.f32   {d22-d23}, [%6 :128]    \\n\" // q11 = outptr5\n\n                    \"vmla.f32   q10, q12, %q20          \\n\"\n                    \"vmla.f32   q11, q12, %q21          \\n\"\n\n                    \"pld        [%7, #128]              \\n\"\n                    \"vld1.f32   {d24-d25}, [%7 :128]!   \\n\" // q12 = r0\n\n                    \"vst1.f32   {d12-d13}, [%1 :128]!   \\n\"\n                    \"vst1.f32   {d14-d15}, [%2 :128]!   \\n\"\n\n                    \"pld        [%1, #128]              \\n\"\n                    \"vld1.f32   {d12-d13}, [%1 :128]    \\n\" // q6 = outptr0\n\n                    \"vst1.f32   {d16-d17}, [%3 :128]!   \\n\"\n                    \"vst1.f32   {d18-d19}, [%4 :128]!   \\n\"\n\n                    \"subs       %0, #1                  \\n\"\n\n                    \"vst1.f32   {d20-d21}, [%5 :128]!   \\n\"\n                    \"vst1.f32   {d22-d23}, [%6 :128]!   \\n\"\n\n                    \"bne        0b                      \\n\"\n\n                    \"sub        %7, #16                 \\n\"\n\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(outptr4), // %5\n                    \"=r\"(outptr5), // %6\n                    \"=r\"(r0)       // %7\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(outptr4),\n                    \"6\"(outptr5),\n                    \"7\"(r0),\n                    \"w\"(_k0), // %16\n                    \"w\"(_k1), // %17\n                    \"w\"(_k2), // %18\n                    \"w\"(_k3), // %19\n                    \"w\"(_k4), // %20\n                    \"w\"(_k5)  // %21\n                    : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\");\n            }\n#endif // __ARM_NEON\n\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * k0;\n                float sum1 = *r0 * k1;\n                float sum2 = *r0 * k2;\n                float sum3 = *r0 * k3;\n                float sum4 = *r0 * k4;\n                float sum5 = *r0 * k5;\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n                *outptr4 += sum4;\n                *outptr5 += sum5;\n\n                r0++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n                outptr4++;\n                outptr5++;\n            }\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    nn_outch = (outch - remain_outch_start) >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = remain_outch_start + pp * 4;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        const float bias2 = bias ? bias[p + 2] : 0.f;\n        const float bias3 = bias ? bias[p + 3] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n        out2.fill(bias2);\n        out3.fill(bias3);\n\n        int q = 0;\n\n        for (; q + 3 < inch; q += 4)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 3;\n            int remain = size & 7;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(kernel0);\n            float32x4_t _k1 = vld1q_f32(kernel1);\n            float32x4_t _k2 = vld1q_f32(kernel2);\n            float32x4_t _k3 = vld1q_f32(kernel3);\n\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v6.4s, v7.4s}, [%5], #32   \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%1]        \\n\"\n\n                    \"0:                                 \\n\"\n\n                    \"fmla   v8.4s, v6.4s, %18.s[0]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%2]      \\n\"\n\n                    \"fmla   v9.4s, v7.4s, %18.s[0]      \\n\"\n\n                    \"fmla   v10.4s, v6.4s, %19.s[0]     \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v12.4s, v13.4s}, [%3]      \\n\"\n\n                    \"fmla   v11.4s, v7.4s, %19.s[0]     \\n\"\n\n                    \"fmla   v12.4s, v6.4s, %20.s[0]     \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v14.4s, v15.4s}, [%4]      \\n\"\n\n                    \"fmla   v13.4s, v7.4s, %20.s[0]     \\n\"\n\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%6], #32   \\n\"\n\n                    \"fmla   v14.4s, v6.4s, %21.s[0]     \\n\"\n                    \"fmla   v15.4s, v7.4s, %21.s[0]     \\n\"\n\n                    \"fmla   v8.4s, v4.4s, %18.s[1]      \\n\"\n                    \"fmla   v9.4s, v5.4s, %18.s[1]      \\n\"\n\n                    \"fmla   v10.4s, v4.4s, %19.s[1]     \\n\"\n                    \"fmla   v11.4s, v5.4s, %19.s[1]     \\n\"\n\n                    \"fmla   v12.4s, v4.4s, %20.s[1]     \\n\"\n                    \"fmla   v13.4s, v5.4s, %20.s[1]     \\n\"\n\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v6.4s, v7.4s}, [%7], #32   \\n\"\n\n                    \"fmla   v14.4s, v4.4s, %21.s[1]     \\n\"\n                    \"fmla   v15.4s, v5.4s, %21.s[1]     \\n\"\n\n                    \"fmla   v8.4s, v6.4s, %18.s[2]      \\n\"\n                    \"fmla   v9.4s, v7.4s, %18.s[2]      \\n\"\n\n                    \"fmla   v10.4s, v6.4s, %19.s[2]     \\n\"\n                    \"fmla   v11.4s, v7.4s, %19.s[2]     \\n\"\n\n                    \"fmla   v12.4s, v6.4s, %20.s[2]     \\n\"\n                    \"fmla   v13.4s, v7.4s, %20.s[2]     \\n\"\n\n                    \"prfm   pldl1keep, [%8, #256]       \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%8], #32   \\n\"\n\n                    \"fmla   v14.4s, v6.4s, %21.s[2]     \\n\"\n                    \"fmla   v15.4s, v7.4s, %21.s[2]     \\n\"\n\n                    \"fmla   v8.4s, v4.4s, %18.s[3]      \\n\"\n                    \"fmla   v9.4s, v5.4s, %18.s[3]      \\n\"\n\n                    \"fmla   v10.4s, v4.4s, %19.s[3]     \\n\"\n                    \"fmla   v11.4s, v5.4s, %19.s[3]     \\n\"\n\n                    \"st1    {v8.4s, v9.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v12.4s, v4.4s, %20.s[3]     \\n\"\n                    \"fmla   v13.4s, v5.4s, %20.s[3]     \\n\"\n\n                    \"st1    {v10.4s, v11.4s}, [%2], #32 \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v6.4s, v7.4s}, [%5], #32   \\n\"\n\n                    \"fmla   v14.4s, v4.4s, %21.s[3]     \\n\"\n                    \"fmla   v15.4s, v5.4s, %21.s[3]     \\n\"\n\n                    \"st1    {v12.4s, v13.4s}, [%3], #32 \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%1]        \\n\"\n\n                    \"subs   %w0, %w0, #1                \\n\"\n\n                    \"st1    {v14.4s, v15.4s}, [%4], #32 \\n\"\n\n                    \"bne    0b                          \\n\"\n                    \"sub    %5, %5, #32                 \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(r0),      // %5\n                    \"=r\"(r1),      // %6\n                    \"=r\"(r2),      // %7\n                    \"=r\"(r3)       // %8\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(r0),\n                    \"6\"(r1),\n                    \"7\"(r2),\n                    \"8\"(r3),\n                    \"w\"(_k0), // %18\n                    \"w\"(_k1), // %19\n                    \"w\"(_k2), // %20\n                    \"w\"(_k3)  // %21\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%5, #256]              \\n\"\n                    \"vld1.f32   {d12-d15}, [%5 :128]!   \\n\"\n                    \"pld        [%1, #256]              \\n\"\n                    \"vld1.f32   {d16-d19}, [%1 :128]    \\n\"\n                    \"0:                                 \\n\"\n\n                    \"vmla.f32   q8, q6, %e18[0]         \\n\"\n\n                    \"pld        [%2, #256]              \\n\"\n                    \"vld1.f32   {d20-d23}, [%2 :128]    \\n\"\n                    \"vmla.f32   q9, q7, %e18[0]         \\n\"\n\n                    \"vmla.f32   q10, q6, %e19[0]        \\n\"\n\n                    \"pld        [%3, #256]              \\n\"\n                    \"vld1.f32   {d24-d27}, [%3 :128]    \\n\"\n                    \"vmla.f32   q11, q7, %e19[0]        \\n\"\n\n                    \"vmla.f32   q12, q6, %e20[0]        \\n\"\n\n                    \"pld        [%4, #256]              \\n\"\n                    \"vld1.f32   {d28-d31}, [%4 :128]    \\n\"\n                    \"vmla.f32   q13, q7, %e20[0]        \\n\"\n\n                    \"pld        [%6, #256]              \\n\"\n                    \"vld1.f32   {d8-d11}, [%6 :128]!    \\n\"\n\n                    \"vmla.f32   q14, q6, %e21[0]        \\n\"\n                    \"vmla.f32   q15, q7, %e21[0]        \\n\"\n\n                    \"vmla.f32   q8, q4, %e18[1]         \\n\"\n                    \"vmla.f32   q9, q5, %e18[1]         \\n\"\n\n                    \"vmla.f32   q10, q4, %e19[1]        \\n\"\n                    \"vmla.f32   q11, q5, %e19[1]        \\n\"\n\n                    \"vmla.f32   q12, q4, %e20[1]        \\n\"\n                    \"vmla.f32   q13, q5, %e20[1]        \\n\"\n\n                    \"pld        [%7, #256]              \\n\"\n                    \"vld1.f32   {d12-d15}, [%7 :128]!   \\n\"\n\n                    \"vmla.f32   q14, q4, %e21[1]        \\n\"\n                    \"vmla.f32   q15, q5, %e21[1]        \\n\"\n\n                    \"vmla.f32   q8, q6, %f18[0]         \\n\"\n                    \"vmla.f32   q9, q7, %f18[0]         \\n\"\n\n                    \"vmla.f32   q10, q6, %f19[0]        \\n\"\n                    \"vmla.f32   q11, q7, %f19[0]        \\n\"\n\n                    \"vmla.f32   q12, q6, %f20[0]        \\n\"\n                    \"vmla.f32   q13, q7, %f20[0]        \\n\"\n\n                    \"pld        [%8, #256]              \\n\"\n                    \"vld1.f32   {d8-d11}, [%8 :128]!    \\n\"\n\n                    \"vmla.f32   q14, q6, %f21[0]        \\n\"\n                    \"vmla.f32   q15, q7, %f21[0]        \\n\"\n\n                    \"vmla.f32   q8, q4, %f18[1]         \\n\"\n                    \"vmla.f32   q9, q5, %f18[1]         \\n\"\n\n                    \"vmla.f32   q10, q4, %f19[1]        \\n\"\n                    \"vmla.f32   q11, q5, %f19[1]        \\n\"\n\n                    \"vmla.f32   q12, q4, %f20[1]        \\n\"\n                    \"vst1.f32   {d16-d19}, [%1 :128]!   \\n\"\n\n                    \"vmla.f32   q13, q5, %f20[1]        \\n\"\n\n                    \"vst1.f32   {d20-d23}, [%2 :128]!   \\n\"\n\n                    \"vmla.f32   q14, q4, %f21[1]        \\n\"\n                    \"pld        [%5, #256]              \\n\"\n                    \"vld1.f32   {d12-d15}, [%5 :128]!   \\n\"\n\n                    \"vmla.f32   q15, q5, %f21[1]        \\n\"\n\n                    \"vst1.f32   {d24-d27}, [%3 :128]!   \\n\"\n\n                    \"pld        [%1, #256]              \\n\"\n                    \"vld1.f32   {d16-d19}, [%1 :128]    \\n\"\n\n                    \"subs       %0, #1                  \\n\"\n                    \"vst1.f32   {d28-d31}, [%4 :128]!   \\n\"\n\n                    \"bne        0b                      \\n\"\n                    \"sub        %5, #32                 \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(r0),      // %5\n                    \"=r\"(r1),      // %6\n                    \"=r\"(r2),      // %7\n                    \"=r\"(r3)       // %8\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(r0),\n                    \"6\"(r1),\n                    \"7\"(r2),\n                    \"8\"(r3),\n                    \"w\"(_k0), // %18\n                    \"w\"(_k1), // %19\n                    \"w\"(_k2), // %20\n                    \"w\"(_k3)  // %21\n                    : \"cc\", \"memory\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * kernel0[0] + *r1 * kernel0[1] + *r2 * kernel0[2] + *r3 * kernel0[3];\n                float sum1 = *r0 * kernel1[0] + *r1 * kernel1[1] + *r2 * kernel1[2] + *r3 * kernel1[3];\n                float sum2 = *r0 * kernel2[0] + *r1 * kernel2[1] + *r2 * kernel2[2] + *r3 * kernel2[3];\n                float sum3 = *r0 * kernel3[0] + *r1 * kernel3[1] + *r2 * kernel3[2] + *r3 * kernel3[3];\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n\n            const float k0 = kernel0[0];\n            const float k1 = kernel1[0];\n            const float k2 = kernel2[0];\n            const float k3 = kernel3[0];\n\n            const float* r0 = img0;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 3;\n            int remain = size & 7;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vdupq_n_f32(k0);\n            float32x4_t _k1 = vdupq_n_f32(k1);\n            float32x4_t _k2 = vdupq_n_f32(k2);\n            float32x4_t _k3 = vdupq_n_f32(k3);\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm       pldl1keep, [%5, #256]          \\n\"\n                    \"ld1        {v6.4s, v7.4s}, [%5], #32      \\n\"\n                    \"0:                                        \\n\"\n                    \"prfm       pldl1keep, [%1, #256]          \\n\"\n                    \"ld1        {v8.4s, v9.4s}, [%1]           \\n\"\n                    \"fmla       v8.4s, v6.4s, %12.4s           \\n\"\n                    \"fmla       v9.4s, v7.4s, %12.4s           \\n\"\n\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld1        {v10.4s, v11.4s}, [%2]         \\n\"\n                    \"fmla       v10.4s, v6.4s, %13.4s          \\n\"\n                    \"fmla       v11.4s, v7.4s, %13.4s          \\n\"\n\n                    \"st1        {v8.4s, v9.4s}, [%1], #32      \\n\"\n\n                    \"prfm       pldl1keep, [%3, #256]          \\n\"\n                    \"ld1        {v12.4s, v13.4s}, [%3]         \\n\"\n                    \"fmla       v12.4s, v6.4s, %14.4s          \\n\"\n                    \"fmla       v13.4s, v7.4s, %14.4s          \\n\"\n\n                    \"st1        {v10.4s, v11.4s}, [%2], #32    \\n\"\n\n                    \"prfm       pldl1keep, [%4, #256]          \\n\"\n                    \"ld1        {v14.4s, v15.4s}, [%4]         \\n\"\n                    \"fmla       v14.4s, v6.4s, %15.4s          \\n\"\n                    \"fmla       v15.4s, v7.4s, %15.4s          \\n\"\n\n                    \"st1        {v12.4s, v13.4s}, [%3], #32    \\n\"\n\n                    \"prfm       pldl1keep, [%5, #256]          \\n\"\n                    \"ld1        {v6.4s, v7.4s}, [%5], #32      \\n\"\n                    \"subs       %w0, %w0, #1                   \\n\"\n                    \"st1        {v14.4s, v15.4s}, [%4], #32    \\n\"\n                    \"bne        0b                             \\n\"\n                    \"sub        %5, %5, #32                    \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(r0)       // %5\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(r0),\n                    \"w\"(_k0), // %12\n                    \"w\"(_k1), // %13\n                    \"w\"(_k2), // %14\n                    \"w\"(_k3)  // %15\n                    : \"cc\", \"memory\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%5, #256]              \\n\"\n                    \"vld1.f32   {d12-d15}, [%5 :128]!   \\n\"\n                    \"0:                                 \\n\"\n                    \"pld        [%1, #256]              \\n\"\n                    \"vld1.f32   {d16-d19}, [%1 :128]    \\n\"\n                    \"vmla.f32   q8, q6, %q12            \\n\"\n                    \"vmla.f32   q9, q7, %q12            \\n\"\n\n                    \"pld        [%2, #256]              \\n\"\n                    \"vld1.f32   {d20-d23}, [%2 :128]    \\n\"\n                    \"vmla.f32   q10, q6, %q13           \\n\"\n                    \"vmla.f32   q11, q7, %q13           \\n\"\n\n                    \"vst1.f32   {d16-d19}, [%1 :128]!   \\n\"\n\n                    \"pld        [%3, #256]              \\n\"\n                    \"vld1.f32   {d24-d27}, [%3 :128]    \\n\"\n                    \"vmla.f32   q12, q6, %q14           \\n\"\n                    \"vmla.f32   q13, q7, %q14           \\n\"\n\n                    \"vst1.f32   {d20-d23}, [%2 :128]!   \\n\"\n\n                    \"pld        [%4, #256]              \\n\"\n                    \"vld1.f32   {d28-d31}, [%4 :128]    \\n\"\n                    \"vmla.f32   q14, q6, %q15           \\n\"\n                    \"vmla.f32   q15, q7, %q15           \\n\"\n\n                    \"vst1.f32   {d24-d27}, [%3 :128]!   \\n\"\n\n                    \"pld        [%5, #256]              \\n\"\n                    \"vld1.f32   {d12-d15}, [%5 :128]!   \\n\"\n                    \"subs       %0, #1                  \\n\"\n                    \"vst1.f32   {d28-d31}, [%4 :128]!   \\n\"\n                    \"bne        0b                      \\n\"\n                    \"sub        %5, #32                 \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(outptr1), // %2\n                    \"=r\"(outptr2), // %3\n                    \"=r\"(outptr3), // %4\n                    \"=r\"(r0)       // %5\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr1),\n                    \"3\"(outptr2),\n                    \"4\"(outptr3),\n                    \"5\"(r0),\n                    \"w\"(_k0), // %12\n                    \"w\"(_k1), // %13\n                    \"w\"(_k2), // %14\n                    \"w\"(_k3)  // %15\n                    : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                // TODO neon optimize\n                float sum0 = *r0 * k0;\n                float sum1 = *r0 * k1;\n                float sum2 = *r0 * k2;\n                float sum3 = *r0 * k3;\n\n                *outptr0 += sum0;\n                *outptr1 += sum1;\n                *outptr2 += sum2;\n                *outptr3 += sum3;\n\n                r0++;\n                outptr0++;\n                outptr1++;\n                outptr2++;\n                outptr3++;\n            }\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        int q = 0;\n\n        for (; q + 3 < inch; q += 4)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float k0 = kernel0[0];\n            const float k1 = kernel0[1];\n            const float k2 = kernel0[2];\n            const float k3 = kernel0[3];\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 3;\n            int remain = size & 7;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vdupq_n_f32(k0);\n            float32x4_t _k1 = vdupq_n_f32(k1);\n            float32x4_t _k2 = vdupq_n_f32(k2);\n            float32x4_t _k3 = vdupq_n_f32(k3);\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                    \"0:                                        \\n\"\n                    \"prfm       pldl1keep, [%1, #256]          \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%1]           \\n\"\n                    \"fmla       v0.4s, v2.4s, %12.4s           \\n\"\n                    \"fmla       v1.4s, v3.4s, %12.4s           \\n\"\n\n                    \"prfm       pldl1keep, [%3, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%3], #32      \\n\"\n                    \"fmla       v0.4s, v2.4s, %13.4s           \\n\"\n                    \"fmla       v1.4s, v3.4s, %13.4s           \\n\"\n\n                    \"prfm       pldl1keep, [%4, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%4], #32      \\n\"\n                    \"fmla       v0.4s, v2.4s, %14.4s           \\n\"\n                    \"fmla       v1.4s, v3.4s, %14.4s           \\n\"\n\n                    \"prfm       pldl1keep, [%5, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%5], #32      \\n\"\n                    \"fmla       v0.4s, v2.4s, %15.4s           \\n\"\n                    \"fmla       v1.4s, v3.4s, %15.4s           \\n\"\n\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                    \"subs       %w0, %w0, #1                   \\n\"\n                    \"st1        {v0.4s, v1.4s}, [%1], #32      \\n\"\n                    \"bne        0b                             \\n\"\n                    \"sub        %2, %2, #32                    \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3)      // %5\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k0), // %12\n                    \"w\"(_k1), // %13\n                    \"w\"(_k2), // %14\n                    \"w\"(_k3)  // %15\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%2 :128]! \\n\"\n                    \"0:                             \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%1 :128]  \\n\"\n                    \"vmla.f32   q0, q2, %q12        \\n\"\n                    \"vmla.f32   q1, q3, %q12        \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%3 :128]! \\n\"\n                    \"vmla.f32   q0, q2, %q13        \\n\"\n                    \"vmla.f32   q1, q3, %q13        \\n\"\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%4 :128]! \\n\"\n                    \"vmla.f32   q0, q2, %q14        \\n\"\n                    \"vmla.f32   q1, q3, %q14        \\n\"\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%5 :128]! \\n\"\n                    \"vmla.f32   q0, q2, %q15        \\n\"\n                    \"vmla.f32   q1, q3, %q15        \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%2 :128]! \\n\"\n                    \"subs       %0, #1              \\n\"\n                    \"vst1.f32   {d0-d3}, [%1 :128]! \\n\"\n                    \"bne        0b                  \\n\"\n                    \"sub        %2, #32             \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3)      // %5\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k0), // %12\n                    \"w\"(_k1), // %13\n                    \"w\"(_k2), // %14\n                    \"w\"(_k3)  // %15\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float sum = *r0 * k0;\n                float sum1 = *r1 * k1;\n                float sum2 = *r2 * k2;\n                float sum3 = *r3 * k3;\n\n                *outptr += sum + sum1 + sum2 + sum3;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr++;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float k0 = kernel0[0];\n\n            const float* r0 = img0;\n\n            int size = outw * outh;\n\n#if __ARM_NEON\n            int nn = size >> 3;\n            int remain = size & 7;\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _k0 = vdupq_n_f32(k0);\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                    \"0:                                        \\n\"\n                    \"prfm       pldl1keep, [%1, #256]          \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%1]           \\n\"\n                    \"fmla       v0.4s, v2.4s, %6.4s            \\n\"\n                    \"fmla       v1.4s, v3.4s, %6.4s            \\n\"\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                    \"subs       %w0, %w0, #1                   \\n\"\n                    \"st1        {v0.4s, v1.4s}, [%1], #32      \\n\"\n                    \"bne        0b                             \\n\"\n                    \"sub        %2, %2, #32                    \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0)      // %2\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"w\"(_k0) // %6\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%2 :128]! \\n\"\n                    \"0:                             \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%1 :128]  \\n\"\n                    \"vmla.f32   q0, q2, %q6         \\n\"\n                    \"vmla.f32   q1, q3, %q6         \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%2 :128]! \\n\"\n                    \"subs       %0, #1              \\n\"\n                    \"vst1.f32   {d0-d3}, [%1 :128]! \\n\"\n                    \"bne        0b                  \\n\"\n                    \"sub        %2, #32             \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0)      // %2\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"w\"(_k0) // %6\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float sum = *r0 * k0;\n\n                *outptr += sum;\n\n                r0++;\n                outptr++;\n            }\n        }\n    }\n}\n\nstatic void conv1x1s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    int nn_outch = outch >> 2;\n    int remain_outch_start = nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        const float bias2 = bias ? bias[p + 2] : 0.f;\n        const float bias3 = bias ? bias[p + 3] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n        out2.fill(bias2);\n        out3.fill(bias3);\n\n        int q = 0;\n\n        for (; q + 3 < inch; q += 4)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n\n            for (int i = 0; i < outh; i++)\n            {\n                int size = outw;\n\n#if __ARM_NEON\n                int nn = size >> 3;\n                int remain = size & 7;\n#else\n                int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                float32x4_t _k0 = vld1q_f32(kernel0);\n                float32x4_t _k1 = vld1q_f32(kernel1);\n                float32x4_t _k2 = vld1q_f32(kernel2);\n                float32x4_t _k3 = vld1q_f32(kernel3);\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%5, #512]          \\n\"\n                        \"ld2        {v4.4s, v5.4s}, [%5], #32      \\n\"\n                        \"ld2        {v6.4s, v7.4s}, [%5], #32      \\n\"\n                        \"and        v5.16b, v6.16b, v6.16b         \\n\" // v4 v5\n\n                        \"prfm       pldl1keep, [%1, #256]          \\n\"\n                        \"ld1        {v8.4s, v9.4s}, [%1]           \\n\"\n\n                        \"fmla       v8.4s, v4.4s, %18.s[0]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %18.s[0]         \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld1        {v10.4s, v11.4s}, [%2]         \\n\"\n\n                        \"fmla       v10.4s, v4.4s, %19.s[0]        \\n\"\n                        \"fmla       v11.4s, v5.4s, %19.s[0]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld1        {v12.4s, v13.4s}, [%3]         \\n\"\n\n                        \"fmla       v12.4s, v4.4s, %20.s[0]        \\n\"\n                        \"fmla       v13.4s, v5.4s, %20.s[0]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld1        {v14.4s, v15.4s}, [%4]         \\n\"\n\n                        \"prfm       pldl1keep, [%6, #512]          \\n\"\n                        \"ld2        {v6.4s, v7.4s}, [%6], #32      \\n\"\n\n                        \"fmla       v14.4s, v4.4s, %21.s[0]        \\n\"\n                        \"fmla       v15.4s, v5.4s, %21.s[0]        \\n\"\n\n                        \"ld2        {v4.4s, v5.4s}, [%6], #32      \\n\"\n                        \"and        v7.16b, v4.16b, v4.16b         \\n\" // v6 v7\n\n                        \"fmla       v8.4s, v6.4s, %18.s[1]         \\n\"\n                        \"fmla       v9.4s, v7.4s, %18.s[1]         \\n\"\n\n                        \"fmla       v10.4s, v6.4s, %19.s[1]        \\n\"\n                        \"fmla       v11.4s, v7.4s, %19.s[1]        \\n\"\n\n                        \"fmla       v12.4s, v6.4s, %20.s[1]        \\n\"\n                        \"fmla       v13.4s, v7.4s, %20.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%7, #512]          \\n\"\n                        \"ld2        {v4.4s, v5.4s}, [%7], #32      \\n\"\n\n                        \"fmla       v14.4s, v6.4s, %21.s[1]        \\n\"\n                        \"fmla       v15.4s, v7.4s, %21.s[1]        \\n\"\n\n                        \"ld2        {v6.4s, v7.4s}, [%7], #32      \\n\"\n                        \"and        v5.16b, v6.16b, v6.16b         \\n\" // v4 v5\n\n                        \"fmla       v8.4s, v4.4s, %18.s[2]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %18.s[2]         \\n\"\n\n                        \"fmla       v10.4s, v4.4s, %19.s[2]        \\n\"\n                        \"fmla       v11.4s, v5.4s, %19.s[2]        \\n\"\n\n                        \"fmla       v12.4s, v4.4s, %20.s[2]        \\n\"\n                        \"fmla       v13.4s, v5.4s, %20.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%8, #512]          \\n\"\n                        \"ld2        {v6.4s, v7.4s}, [%8], #32      \\n\"\n\n                        \"fmla       v14.4s, v4.4s, %21.s[2]        \\n\"\n                        \"fmla       v15.4s, v5.4s, %21.s[2]        \\n\"\n\n                        \"ld2        {v4.4s, v5.4s}, [%8], #32      \\n\"\n                        \"and        v7.16b, v4.16b, v4.16b         \\n\" // v6 v7\n\n                        \"fmla       v8.4s, v6.4s, %18.s[3]         \\n\"\n                        \"fmla       v9.4s, v7.4s, %18.s[3]         \\n\"\n\n                        \"fmla       v10.4s, v6.4s, %19.s[3]        \\n\"\n                        \"fmla       v11.4s, v7.4s, %19.s[3]        \\n\"\n\n                        \"st1        {v8.4s, v9.4s}, [%1], #32      \\n\"\n\n                        \"fmla       v12.4s, v6.4s, %20.s[3]        \\n\"\n                        \"fmla       v13.4s, v7.4s, %20.s[3]        \\n\"\n\n                        \"st1        {v10.4s, v11.4s}, [%2], #32    \\n\"\n\n                        \"fmla       v14.4s, v6.4s, %21.s[3]        \\n\"\n                        \"fmla       v15.4s, v7.4s, %21.s[3]        \\n\"\n\n                        \"st1        {v12.4s, v13.4s}, [%3], #32    \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v14.4s, v15.4s}, [%4], #32    \\n\"\n\n                        \"bne        0b                             \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(r0),      // %5\n                        \"=r\"(r1),      // %6\n                        \"=r\"(r2),      // %7\n                        \"=r\"(r3)       // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(r0),\n                        \"6\"(r1),\n                        \"7\"(r2),\n                        \"8\"(r3),\n                        \"w\"(_k0), // %18\n                        \"w\"(_k1), // %19\n                        \"w\"(_k2), // %20\n                        \"w\"(_k3)  // %21\n                        : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vld2.f32   {d8-d11}, [%5]!     \\n\"\n                        \"vld2.f32   {d12-d15}, [%5]!    \\n\"\n                        \"vand       q5, q6, q6          \\n\" // q4 q5\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d16-d19}, [%1]     \\n\"\n\n                        \"vmla.f32   q8, q4, %e18[0]     \\n\"\n                        \"vmla.f32   q9, q5, %e18[0]     \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d20-d23}, [%2]     \\n\"\n\n                        \"vmla.f32   q10, q4, %e19[0]    \\n\"\n                        \"vmla.f32   q11, q5, %e19[0]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%3]     \\n\"\n\n                        \"vmla.f32   q12, q4, %e20[0]    \\n\"\n                        \"vmla.f32   q13, q5, %e20[0]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%4]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vld2.f32   {d12-d15}, [%6]!    \\n\"\n\n                        \"vmla.f32   q14, q4, %e21[0]    \\n\"\n                        \"vmla.f32   q15, q5, %e21[0]    \\n\"\n\n                        \"vld2.f32   {d8-d11}, [%6]!     \\n\"\n                        \"vand       q7, q4, q4          \\n\" // q6 q7\n\n                        \"vmla.f32   q8, q6, %e18[1]     \\n\"\n                        \"vmla.f32   q9, q7, %e18[1]     \\n\"\n\n                        \"vmla.f32   q10, q6, %e19[1]    \\n\"\n                        \"vmla.f32   q11, q7, %e19[1]    \\n\"\n\n                        \"vmla.f32   q12, q6, %e20[1]    \\n\"\n                        \"vmla.f32   q13, q7, %e20[1]    \\n\"\n\n                        \"pld        [%7, #512]          \\n\"\n                        \"vld2.f32   {d8-d11}, [%7]!     \\n\"\n\n                        \"vmla.f32   q14, q6, %e21[1]    \\n\"\n                        \"vmla.f32   q15, q7, %e21[1]    \\n\"\n\n                        \"vld2.f32   {d12-d15}, [%7]!    \\n\"\n                        \"vand       q5, q6, q6          \\n\" // q4 q5\n\n                        \"vmla.f32   q8, q4, %f18[0]     \\n\"\n                        \"vmla.f32   q9, q5, %f18[0]     \\n\"\n\n                        \"vmla.f32   q10, q4, %f19[0]    \\n\"\n                        \"vmla.f32   q11, q5, %f19[0]    \\n\"\n\n                        \"vmla.f32   q12, q4, %f20[0]    \\n\"\n                        \"vmla.f32   q13, q5, %f20[0]    \\n\"\n\n                        \"pld        [%8, #512]          \\n\"\n                        \"vld2.f32   {d12-d15}, [%8]!    \\n\"\n\n                        \"vmla.f32   q14, q4, %f21[0]    \\n\"\n                        \"vmla.f32   q15, q5, %f21[0]    \\n\"\n\n                        \"vld2.f32   {d8-d11}, [%8]!     \\n\"\n                        \"vand       q7, q4, q4          \\n\" // q6 q7\n\n                        \"vmla.f32   q8, q6, %f18[1]     \\n\"\n                        \"vmla.f32   q9, q7, %f18[1]     \\n\"\n\n                        \"vmla.f32   q10, q6, %f19[1]    \\n\"\n                        \"vmla.f32   q11, q7, %f19[1]    \\n\"\n\n                        \"vst1.f32   {d16-d19}, [%1]!    \\n\"\n\n                        \"vmla.f32   q12, q6, %f20[1]    \\n\"\n                        \"vmla.f32   q13, q7, %f20[1]    \\n\"\n\n                        \"vst1.f32   {d20-d23}, [%2]!    \\n\"\n\n                        \"vmla.f32   q14, q6, %f21[1]    \\n\"\n                        \"vmla.f32   q15, q7, %f21[1]    \\n\"\n\n                        \"vst1.f32   {d24-d27}, [%3]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d28-d31}, [%4]!    \\n\"\n\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(r0),      // %5\n                        \"=r\"(r1),      // %6\n                        \"=r\"(r2),      // %7\n                        \"=r\"(r3)       // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(r0),\n                        \"6\"(r1),\n                        \"7\"(r2),\n                        \"8\"(r3),\n                        \"w\"(_k0), // %18\n                        \"w\"(_k1), // %19\n                        \"w\"(_k2), // %20\n                        \"w\"(_k3)  // %21\n                        : \"cc\", \"memory\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    // TODO neon optimize\n                    float sum0 = *r0 * kernel0[0] + *r1 * kernel0[1] + *r2 * kernel0[2] + *r3 * kernel0[3];\n                    float sum1 = *r0 * kernel1[0] + *r1 * kernel1[1] + *r2 * kernel1[2] + *r3 * kernel1[3];\n                    float sum2 = *r0 * kernel2[0] + *r1 * kernel2[1] + *r2 * kernel2[2] + *r3 * kernel2[3];\n                    float sum3 = *r0 * kernel3[0] + *r1 * kernel3[1] + *r2 * kernel3[2] + *r3 * kernel3[3];\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n                    *outptr2 += sum2;\n                    *outptr3 += sum3;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    r3 += 2;\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float* kernel1 = kernel + (p + 1) * inch + q;\n            const float* kernel2 = kernel + (p + 2) * inch + q;\n            const float* kernel3 = kernel + (p + 3) * inch + q;\n\n            const float k0 = kernel0[0];\n            const float k1 = kernel1[0];\n            const float k2 = kernel2[0];\n            const float k3 = kernel3[0];\n\n            const float* r0 = img0;\n\n            for (int i = 0; i < outh; i++)\n            {\n                int size = outw;\n\n#if __ARM_NEON\n                int nn = size >> 3;\n                int remain = size & 7;\n#else\n                int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                float32x4_t _k0 = vdupq_n_f32(k0);\n                float32x4_t _k1 = vdupq_n_f32(k1);\n                float32x4_t _k2 = vdupq_n_f32(k2);\n                float32x4_t _k3 = vdupq_n_f32(k3);\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%5, #512]          \\n\"\n                        \"ld2        {v4.4s, v5.4s}, [%5], #32      \\n\"\n                        \"ld2        {v6.4s, v7.4s}, [%5], #32      \\n\"\n                        \"and        v5.16b, v6.16b, v6.16b         \\n\"\n\n                        \"prfm       pldl1keep, [%1, #256]          \\n\"\n                        \"ld1        {v8.4s, v9.4s}, [%1]           \\n\"\n\n                        \"fmla       v8.4s, v4.4s, %12.4s           \\n\"\n                        \"fmla       v9.4s, v5.4s, %12.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld1        {v10.4s, v11.4s}, [%2]         \\n\"\n\n                        \"fmla       v10.4s, v4.4s, %13.4s          \\n\"\n                        \"fmla       v11.4s, v5.4s, %13.4s          \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld1        {v12.4s, v13.4s}, [%3]         \\n\"\n\n                        \"st1        {v8.4s, v9.4s}, [%1], #32      \\n\"\n\n                        \"fmla       v12.4s, v4.4s, %14.4s          \\n\"\n                        \"fmla       v13.4s, v5.4s, %14.4s          \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld1        {v14.4s, v15.4s}, [%4]         \\n\"\n\n                        \"st1        {v10.4s, v11.4s}, [%2], #32    \\n\"\n\n                        \"fmla       v14.4s, v4.4s, %15.4s          \\n\"\n                        \"fmla       v15.4s, v5.4s, %15.4s          \\n\"\n\n                        \"st1        {v12.4s, v13.4s}, [%3], #32    \\n\"\n                        \"subs       %w0, %w0, #1                   \\n\"\n\n                        \"st1        {v14.4s, v15.4s}, [%4], #32    \\n\"\n                        \"bne        0b                             \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(r0)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(r0),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1), // %13\n                        \"w\"(_k2), // %14\n                        \"w\"(_k3)  // %15\n                        : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vld2.f32   {d8-d11}, [%5]!     \\n\"\n                        \"vld2.f32   {d12-d15}, [%5]!    \\n\"\n                        \"vand       q5, q6, q6          \\n\" // q4 q5\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d16-d19}, [%1]     \\n\"\n\n                        \"vmla.f32   q8, q4, %q12        \\n\"\n                        \"vmla.f32   q9, q5, %q12        \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d20-d23}, [%2]     \\n\"\n\n                        \"vmla.f32   q10, q4, %q13       \\n\"\n                        \"vmla.f32   q11, q5, %q13       \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%3]     \\n\"\n\n                        \"vst1.f32   {d16-d19}, [%1]!    \\n\"\n\n                        \"vmla.f32   q12, q4, %q14       \\n\"\n                        \"vmla.f32   q13, q5, %q14       \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%4]     \\n\"\n\n                        \"vst1.f32   {d20-d23}, [%2]!    \\n\"\n\n                        \"vmla.f32   q14, q4, %q15       \\n\"\n                        \"vmla.f32   q15, q5, %q15       \\n\"\n\n                        \"vst1.f32   {d24-d27}, [%3]!    \\n\"\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d28-d31}, [%4]!    \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(r0)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(r0),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1), // %13\n                        \"w\"(_k2), // %14\n                        \"w\"(_k3)  // %15\n                        : \"cc\", \"memory\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    // TODO neon optimize\n                    float sum0 = *r0 * k0;\n                    float sum1 = *r0 * k1;\n                    float sum2 = *r0 * k2;\n                    float sum3 = *r0 * k3;\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n                    *outptr2 += sum2;\n                    *outptr3 += sum3;\n\n                    r0 += 2;\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                }\n\n                r0 += tailstep;\n            }\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        int q = 0;\n\n        for (; q + 3 < inch; q += 4)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n            const float* img2 = bottom_blob.channel(q + 2);\n            const float* img3 = bottom_blob.channel(q + 3);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float k0 = kernel0[0];\n            const float k1 = kernel0[1];\n            const float k2 = kernel0[2];\n            const float k3 = kernel0[3];\n\n            const float* r0 = img0;\n            const float* r1 = img1;\n            const float* r2 = img2;\n            const float* r3 = img3;\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 3;\n                int remain = outw & 7;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                float32x4_t _k0 = vdupq_n_f32(k0);\n                float32x4_t _k1 = vdupq_n_f32(k1);\n                float32x4_t _k2 = vdupq_n_f32(k2);\n                float32x4_t _k3 = vdupq_n_f32(k3);\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\"\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%1, #256]          \\n\"\n                        \"ld1        {v0.4s, v1.4s}, [%1]           \\n\"\n                        \"fmla       v0.4s, v2.4s, %12.4s           \\n\"\n                        \"fmla       v1.4s, v8.4s, %12.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%3, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%3], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%3], #32      \\n\"\n                        \"fmla       v0.4s, v2.4s, %13.4s           \\n\"\n                        \"fmla       v1.4s, v8.4s, %13.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%4, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%4], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%4], #32      \\n\"\n                        \"fmla       v0.4s, v2.4s, %14.4s           \\n\"\n                        \"fmla       v1.4s, v8.4s, %14.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%5, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%5], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%5], #32      \\n\"\n                        \"fmla       v0.4s, v2.4s, %15.4s           \\n\"\n                        \"fmla       v1.4s, v8.4s, %15.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v0.4s, v1.4s}, [%1], #32      \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %2, %2, #64                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3)      // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1), // %13\n                        \"w\"(_k2), // %14\n                        \"w\"(_k3)  // %15\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%2, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\"\n                        \"0:                             \\n\"\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%1]       \\n\"\n                        \"vmla.f32   q0, q2, %q12        \\n\"\n                        \"vmla.f32   q1, q8, %q12        \\n\"\n                        \"pld        [%3, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%3]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%3]!    \\n\"\n                        \"vmla.f32   q0, q2, %q13        \\n\"\n                        \"vmla.f32   q1, q8, %q13        \\n\"\n                        \"pld        [%4, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%4]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%4]!    \\n\"\n                        \"vmla.f32   q0, q2, %q14        \\n\"\n                        \"vmla.f32   q1, q8, %q14        \\n\"\n                        \"pld        [%5, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%5]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%5]!    \\n\"\n                        \"vmla.f32   q0, q2, %q15        \\n\"\n                        \"vmla.f32   q1, q8, %q15        \\n\"\n                        \"pld        [%2, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\"\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d0-d3}, [%1]!      \\n\"\n                        \"bne        0b                  \\n\"\n                        \"sub        %2, #64             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3)      // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1), // %13\n                        \"w\"(_k2), // %14\n                        \"w\"(_k3)  // %15\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    float sum = *r0 * k0;\n                    float sum1 = *r1 * k1;\n                    float sum2 = *r2 * k2;\n                    float sum3 = *r3 * k3;\n\n                    *outptr += sum + sum1 + sum2 + sum3;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    r3 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch + q;\n            const float k0 = kernel0[0];\n\n            const float* r0 = img0;\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 3;\n                int remain = outw & 7;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                float32x4_t _k0 = vdupq_n_f32(k0);\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\"\n\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%1, #256]          \\n\"\n                        \"ld1        {v0.4s, v1.4s}, [%1]           \\n\"\n                        \"fmla       v0.4s, v2.4s, %6.4s            \\n\"\n                        \"fmla       v1.4s, v8.4s, %6.4s            \\n\"\n\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v0.4s, v1.4s}, [%1], #32      \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %2, %2, #64                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0)      // %2\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"w\"(_k0) // %6\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%2, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\"\n                        \"0:                             \\n\"\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%1]       \\n\"\n                        \"vmla.f32   q0, q2, %q6         \\n\"\n                        \"vmla.f32   q1, q8, %q6         \\n\"\n                        \"pld        [%2, #512]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\"\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d0-d3}, [%1]!      \\n\"\n                        \"bne        0b                  \\n\"\n                        \"sub        %2, #64             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0)      // %2\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"w\"(_k0) // %6\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    float sum = *r0 * k0;\n\n                    *outptr += sum;\n\n                    r0 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_2x2.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv2x2s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        int q = 0;\n\n        for (; q + 1 < inch; q += 2)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n            const float* img1 = bottom_blob.channel(q + 1);\n\n            const float* kernel0 = kernel + p * inch * 4 + q * 4;\n            const float* kernel1 = kernel0 + 4;\n\n            const float* r00 = img0;\n            const float* r01 = img0 + w;\n\n            const float* r10 = img1;\n            const float* r11 = img1 + w;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(kernel0);\n            float32x4_t _k1 = vld1q_f32(kernel1);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1], #16             \\n\"\n                        \"prfm       pldl1keep, [%2, #128]          \\n\"\n                        \"ld1        {v2.4s}, [%2], #16             \\n\"\n                        \"prfm       pldl1keep, [%3, #128]          \\n\"\n                        \"ld1        {v12.4s}, [%3], #16            \\n\"\n                        \"prfm       pldl1keep, [%4, #128]          \\n\"\n                        \"ld1        {v14.4s}, [%4], #16            \\n\"\n\n                        \"0:                                        \\n\"\n                        \"prfm       pldl1keep, [%5, #128]          \\n\"\n                        \"ld1        {v9.4s}, [%5]                  \\n\"\n\n                        \"fmul       v8.4s, v0.4s, %12.s[0]         \\n\"\n                        \"fmla       v9.4s, v2.4s, %12.s[2]         \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v1.4s}, [%1], #16             \\n\"\n\n                        \"prfm       pldl1keep, [%2, #128]          \\n\"\n                        \"ld1        {v3.4s}, [%2], #16             \\n\"\n\n                        \"ext        v10.16b, v0.16b, v1.16b, #4    \\n\"\n                        \"ext        v11.16b, v2.16b, v3.16b, #4    \\n\"\n\n                        \"fmla       v8.4s, v12.4s, %13.s[0]        \\n\"\n                        \"fmla       v9.4s, v14.4s, %13.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #128]          \\n\"\n                        \"ld1        {v13.4s}, [%3], #16            \\n\"\n\n                        \"prfm       pldl1keep, [%4, #128]          \\n\"\n                        \"ld1        {v15.4s}, [%4], #16            \\n\"\n\n                        \"fmla       v8.4s, v10.4s, %12.s[1]        \\n\"\n                        \"fmla       v9.4s, v11.4s, %12.s[3]        \\n\"\n\n                        \"ext        v10.16b, v12.16b, v13.16b, #4  \\n\"\n                        \"ext        v11.16b, v14.16b, v15.16b, #4  \\n\"\n\n                        \"fmla       v8.4s, v10.4s, %13.s[1]        \\n\"\n                        \"fmla       v9.4s, v11.4s, %13.s[3]        \\n\"\n\n                        \"orr        v0.16b, v1.16b, v1.16b         \\n\"\n                        \"orr        v2.16b, v3.16b, v3.16b         \\n\"\n\n                        \"fadd       v8.4s, v8.4s, v9.4s            \\n\"\n\n                        \"orr        v12.16b, v13.16b, v13.16b      \\n\"\n                        \"orr        v14.16b, v15.16b, v15.16b      \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v8.4s}, [%5], #16             \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %1, %1, #16                    \\n\"\n                        \"sub        %2, %2, #16                    \\n\"\n                        \"sub        %3, %3, #16                    \\n\"\n                        \"sub        %4, %4, #16                    \\n\"\n                        : \"=r\"(nn),    // %0\n                        \"=r\"(r00),   // %1\n                        \"=r\"(r01),   // %2\n                        \"=r\"(r10),   // %3\n                        \"=r\"(r11),   // %4\n                        \"=r\"(outptr) // %5\n                        : \"0\"(nn),\n                        \"1\"(r00),\n                        \"2\"(r01),\n                        \"3\"(r10),\n                        \"4\"(r11),\n                        \"5\"(outptr),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1)  // %13\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]!      \\n\"\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%2]!      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%3]!    \\n\"\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d28-d29}, [%4]!    \\n\"\n\n                        \"0:                             \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.f32   {d18-d19}, [%5]     \\n\" // q9 = sum\n\n                        \"vmul.f32   q8, q0, %e12[0]     \\n\"\n                        \"vmla.f32   q9, q2, %f12[0]     \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d2-d3}, [%1]!      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d6-d7}, [%2]!      \\n\"\n\n                        \"vext.f32   q10, q0, q1, #1     \\n\"\n                        \"vext.f32   q11, q2, q3, #1     \\n\"\n\n                        \"vmla.f32   q8, q12, %e13[0]    \\n\"\n                        \"vmla.f32   q9, q14, %f13[0]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d26-d27}, [%3]!    \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d30-d31}, [%4]!    \\n\"\n\n                        \"vmla.f32   q8, q10, %e12[1]    \\n\"\n                        \"vmla.f32   q9, q11, %f12[1]    \\n\"\n\n                        \"vext.f32   q10, q12, q13, #1   \\n\"\n                        \"vext.f32   q11, q14, q15, #1   \\n\"\n\n                        \"vmla.f32   q8, q10, %e13[1]    \\n\"\n                        \"vmla.f32   q9, q11, %f13[1]    \\n\"\n\n                        \"vorr       q0, q1, q1          \\n\"\n                        \"vorr       q2, q3, q3          \\n\"\n\n                        \"vadd.f32   q8, q8, q9          \\n\"\n\n                        \"vorr       q12, q13, q13       \\n\"\n                        \"vorr       q14, q15, q15       \\n\"\n\n                        \"subs       %0, #1              \\n\"\n\n                        \"vst1.f32   {d16-d17}, [%5]!    \\n\"\n\n                        \"bne        0b                  \\n\"\n                        \"sub        %1, #16             \\n\"\n                        \"sub        %2, #16             \\n\"\n                        \"sub        %3, #16             \\n\"\n                        \"sub        %4, #16             \\n\"\n                        : \"=r\"(nn),    // %0\n                        \"=r\"(r00),   // %1\n                        \"=r\"(r01),   // %2\n                        \"=r\"(r10),   // %3\n                        \"=r\"(r11),   // %4\n                        \"=r\"(outptr) // %5\n                        : \"0\"(nn),\n                        \"1\"(r00),\n                        \"2\"(r01),\n                        \"3\"(r10),\n                        \"4\"(r11),\n                        \"5\"(outptr),\n                        \"w\"(_k0), // %12\n                        \"w\"(_k1)  // %13\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x2_t _r00 = vld1_f32(r00);\n                    float32x2_t _r01 = vld1_f32(r01);\n                    float32x4_t _r00r1 = vcombine_f32(_r00, _r01);\n                    float32x4_t _s0s1 = vmulq_f32(_r00r1, _k0);\n\n                    float32x2_t _r10 = vld1_f32(r10);\n                    float32x2_t _r11 = vld1_f32(r11);\n                    float32x4_t _r10r1 = vcombine_f32(_r10, _r11);\n                    _s0s1 = vmlaq_f32(_s0s1, _r10r1, _k1);\n\n                    float32x2_t _s = vadd_f32(vget_low_f32(_s0s1), vget_high_f32(_s0s1));\n                    _s = vpadd_f32(_s, _s);\n                    *outptr += vget_lane_f32(_s, 0);\n#else\n                    float sum = 0.f;\n\n                    sum += r00[0] * kernel0[0];\n                    sum += r00[1] * kernel0[1];\n                    sum += r01[0] * kernel0[2];\n                    sum += r01[1] * kernel0[3];\n\n                    sum += r10[0] * kernel1[0];\n                    sum += r10[1] * kernel1[1];\n                    sum += r11[0] * kernel1[2];\n                    sum += r11[1] * kernel1[3];\n\n                    *outptr += sum;\n#endif // __ARM_NEON\n\n                    r00 += 1;\n                    r01 += 1;\n                    r10 += 1;\n                    r11 += 1;\n                    outptr++;\n                }\n\n                r00 += 1;\n                r01 += 1;\n                r10 += 1;\n                r11 += 1;\n            }\n        }\n\n        for (; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 4 + q * 4;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vdupq_n_f32(kernel0[0]);\n            float32x4_t _k1 = vdupq_n_f32(kernel0[1]);\n            float32x4_t _k2 = vdupq_n_f32(kernel0[2]);\n            float32x4_t _k3 = vdupq_n_f32(kernel0[3]);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1], #16             \\n\"\n                        \"prfm       pldl1keep, [%2, #128]          \\n\"\n                        \"ld1        {v2.4s}, [%2], #16             \\n\"\n\n                        \"0:                                        \\n\"\n                        \"prfm       pldl1keep, [%3, #128]          \\n\"\n                        \"ld1        {v9.4s}, [%3]                  \\n\"\n\n                        \"fmul       v8.4s, v0.4s, %8.4s            \\n\"\n                        \"fmla       v9.4s, v2.4s, %10.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v1.4s}, [%1], #16             \\n\"\n                        \"ext        v10.16b, v0.16b, v1.16b, #4    \\n\"\n\n                        \"fmla       v8.4s, v10.4s, %9.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%2, #128]          \\n\"\n                        \"ld1        {v3.4s}, [%2], #16             \\n\"\n                        \"ext        v11.16b, v2.16b, v3.16b, #4    \\n\"\n\n                        \"fmla       v9.4s, v11.4s, %11.4s          \\n\"\n\n                        \"orr        v0.16b, v1.16b, v1.16b         \\n\"\n                        \"fadd       v8.4s, v8.4s, v9.4s            \\n\"\n                        \"orr        v2.16b, v3.16b, v3.16b         \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v8.4s}, [%3], #16             \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %1, %1, #16                    \\n\"\n                        \"sub        %2, %2, #16                    \\n\"\n                        : \"=r\"(nn),    // %0\n                        \"=r\"(r0),    // %1\n                        \"=r\"(r1),    // %2\n                        \"=r\"(outptr) // %3\n                        : \"0\"(nn),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(outptr),\n                        \"w\"(_k0), // %8\n                        \"w\"(_k1), // %9\n                        \"w\"(_k2), // %10\n                        \"w\"(_k3)  // %11\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\", \"v10\", \"v11\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]!      \\n\"\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%2]!      \\n\"\n\n                        \"0:                             \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d18-d19}, [%3]     \\n\" // q9 = sum\n\n                        \"vmul.f32   q8, q0, %q8         \\n\"\n                        \"vmla.f32   q9, q2, %q10        \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d2-d3}, [%1]!      \\n\"\n                        \"vext.f32   q10, q0, q1, #1     \\n\"\n\n                        \"vmla.f32   q8, q10, %q9        \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d6-d7}, [%2]!      \\n\"\n                        \"vext.f32   q11, q2, q3, #1     \\n\"\n\n                        \"vmla.f32   q9, q11, %q11       \\n\"\n\n                        \"vorr       q0, q1, q1          \\n\"\n                        \"vadd.f32   q8, q8, q9          \\n\"\n                        \"vorr       q2, q3, q3          \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d16-d17}, [%3]!    \\n\"\n                        \"bne        0b                  \\n\"\n                        \"sub        %1, #16             \\n\"\n                        \"sub        %2, #16             \\n\"\n                        : \"=r\"(nn),    // %0\n                        \"=r\"(r0),    // %1\n                        \"=r\"(r1),    // %2\n                        \"=r\"(outptr) // %3\n                        : \"0\"(nn),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(outptr),\n                        \"w\"(_k0), // %8\n                        \"w\"(_k1), // %9\n                        \"w\"(_k2), // %10\n                        \"w\"(_k3)  // %11\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                float32x4_t _k0123 = vld1q_f32(kernel0);\n#endif\n\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x2_t _r0 = vld1_f32(r0);\n                    float32x2_t _r1 = vld1_f32(r1);\n                    float32x4_t _r0r1 = vcombine_f32(_r0, _r1);\n                    float32x4_t _s0s1 = vmulq_f32(_r0r1, _k0123);\n                    float32x2_t _s = vadd_f32(vget_low_f32(_s0s1), vget_high_f32(_s0s1));\n                    _s = vpadd_f32(_s, _s);\n                    *outptr += vget_lane_f32(_s, 0);\n#else\n                    float sum = 0.f;\n                    sum += r0[0] * kernel0[0];\n                    sum += r0[1] * kernel0[1];\n                    sum += r1[0] * kernel0[2];\n                    sum += r1[1] * kernel0[3];\n                    *outptr += sum;\n#endif\n\n                    r0 += 1;\n                    r1 += 1;\n                    outptr++;\n                }\n\n                r0 += 1;\n                r1 += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n\n        const float* k0 = kernel + p * inch * 9;\n        const float* k1 = kernel + (p + 1) * inch * 9;\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr0n = outptr0 + outw;\n            float* outptr1n = outptr1 + outw;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n\n#if __ARM_NEON\n            float32x4_t _k00 = vld1q_f32(k0);\n            float32x4_t _k03 = vld1q_f32(k0 + 3);\n            float32x4_t _k06 = vld1q_f32(k0 + 6);\n\n            float32x4_t _k10 = vld1q_f32(k1);\n            float32x4_t _k13 = vld1q_f32(k1 + 3);\n            float32x4_t _k16 = vld1q_f32(k1 + 6);\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i + 1 < outh; i += 2)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%5]        \\n\" // r0\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v14.4s, v15.4s}, [%8]      \\n\" // r3\n                        \"add    %8, %8, #16                 \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v14.16b, v15.16b, #8 \\n\"\n\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v6.4s}, [%1]               \\n\" // _sum0\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v7.4s}, [%2]               \\n\" // _sum1\n\n                        \"fmla   v6.4s, v8.4s, %18.s[0]      \\n\"\n                        \"fmla   v7.4s, v8.4s, %21.s[0]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v12.4s}, [%3]              \\n\" // _sum0n\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v13.4s}, [%4]              \\n\" // _sum1n\n\n                        \"fmla   v12.4s, v14.4s, %20.s[0]    \\n\"\n                        \"fmla   v13.4s, v14.4s, %23.s[0]    \\n\"\n\n                        \"ext    v8.16b, v8.16b, v9.16b, #8  \\n\"\n                        \"ext    v9.16b, v14.16b, v15.16b, #4 \\n\"\n\n                        \"fmla   v6.4s, v10.4s, %18.s[1]     \\n\"\n                        \"fmla   v7.4s, v10.4s, %21.s[1]     \\n\"\n                        \"fmla   v12.4s, v11.4s, %20.s[2]    \\n\"\n                        \"fmla   v13.4s, v11.4s, %23.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v14.4s, v15.4s}, [%6]      \\n\" // r1\n                        \"add    %6, %6, #16                 \\n\"\n\n                        \"fmla   v6.4s, v8.4s, %18.s[2]      \\n\"\n                        \"fmla   v7.4s, v8.4s, %21.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, %20.s[1]     \\n\"\n                        \"fmla   v13.4s, v9.4s, %23.s[1]     \\n\"\n\n                        \"ext    v10.16b, v14.16b, v15.16b, #4 \\n\"\n\n                        \"fmla   v6.4s, v14.4s, %19.s[0]     \\n\"\n                        \"fmla   v7.4s, v14.4s, %22.s[0]     \\n\"\n                        \"fmla   v12.4s, v14.4s, %18.s[0]    \\n\"\n                        \"fmla   v13.4s, v14.4s, %21.s[0]    \\n\"\n\n                        \"ext    v11.16b, v14.16b, v15.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v10.4s, %19.s[1]     \\n\"\n                        \"fmla   v7.4s, v10.4s, %22.s[1]     \\n\"\n                        \"fmla   v12.4s, v10.4s, %18.s[1]    \\n\"\n                        \"fmla   v13.4s, v10.4s, %21.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%7]        \\n\" // r2\n                        \"add    %7, %7, #16                 \\n\"\n\n                        \"fmla   v6.4s, v11.4s, %19.s[2]     \\n\"\n                        \"fmla   v7.4s, v11.4s, %22.s[2]     \\n\"\n                        \"fmla   v12.4s, v11.4s, %18.s[2]    \\n\"\n                        \"fmla   v13.4s, v11.4s, %21.s[2]    \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n\n                        \"fmla   v6.4s, v8.4s, %20.s[0]      \\n\"\n                        \"fmla   v7.4s, v8.4s, %23.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, %19.s[0]     \\n\"\n                        \"fmla   v13.4s, v8.4s, %22.s[0]     \\n\"\n\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v10.4s, %20.s[1]     \\n\"\n                        \"fmla   v7.4s, v10.4s, %23.s[1]     \\n\"\n                        \"fmla   v12.4s, v10.4s, %19.s[1]    \\n\"\n                        \"fmla   v13.4s, v10.4s, %22.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%5]        \\n\" // r0\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"fmla   v6.4s, v11.4s, %20.s[2]     \\n\"\n                        \"fmla   v7.4s, v11.4s, %23.s[2]     \\n\"\n                        \"fmla   v12.4s, v11.4s, %19.s[2]    \\n\"\n                        \"fmla   v13.4s, v11.4s, %22.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v14.4s, v15.4s}, [%8]      \\n\" // r3\n                        \"add    %8, %8, #16                 \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n\n                        \"st1    {v6.4s}, [%1], #16          \\n\"\n                        \"st1    {v7.4s}, [%2], #16          \\n\"\n\n                        \"ext    v11.16b, v14.16b, v15.16b, #8 \\n\"\n\n                        \"st1    {v12.4s}, [%3], #16         \\n\"\n                        \"st1    {v13.4s}, [%4], #16         \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n                        \"bne    0b                          \\n\"\n\n                        \"sub    %5, %5, #16                 \\n\"\n                        \"sub    %8, %8, #16                 \\n\"\n                        : \"=r\"(nn),       // %0\n                        \"=r\"(outptr0),  // %1\n                        \"=r\"(outptr1),  // %2\n                        \"=r\"(outptr0n), // %3\n                        \"=r\"(outptr1n), // %4\n                        \"=r\"(r0),       // %5\n                        \"=r\"(r1),       // %6\n                        \"=r\"(r2),       // %7\n                        \"=r\"(r3)        // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr0n),\n                        \"4\"(outptr1n),\n                        \"5\"(r0),\n                        \"6\"(r1),\n                        \"7\"(r2),\n                        \"8\"(r3),\n                        \"w\"(_k00), // %18\n                        \"w\"(_k03), // %19\n                        \"w\"(_k06), // %20\n                        \"w\"(_k10), // %21\n                        \"w\"(_k13), // %22\n                        \"w\"(_k16)  // %23\n                        : \"cc\", \"memory\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%5 :64] \\n\" // r0\n                        \"add        %5, #16             \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.f32   {d28-d30}, [%8]     \\n\" // r3\n                        \"add        %8, #16             \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q14, q15, #2   \\n\"\n\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d12-d13}, [%1 :64] \\n\" // _sum0\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d14-d15}, [%2 :64] \\n\" // _sum1\n\n                        \"vmla.f32   q6, q8, %e18[0]     \\n\"\n                        \"vmla.f32   q7, q8, %e21[0]     \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%3]     \\n\" // _sum0n\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d26-d27}, [%4]     \\n\" // _sum1n\n\n                        \"vmla.f32   q12, q14, %e20[0]   \\n\"\n                        \"vmla.f32   q13, q14, %e23[0]   \\n\"\n\n                        \"vext.32    q8, q8, q9, #2      \\n\"\n                        \"vext.32    q9, q14, q15, #1    \\n\"\n\n                        \"vmla.f32   q6, q10, %e18[1]    \\n\"\n                        \"vmla.f32   q7, q10, %e21[1]    \\n\"\n                        \"vmla.f32   q12, q11, %f20[0]   \\n\"\n                        \"vmla.f32   q13, q11, %f23[0]   \\n\"\n\n                        \"pld        [%6, #192]          \\n\"\n                        \"vld1.f32   {d28-d30}, [%6]     \\n\" // r1\n                        \"add        %6, #16             \\n\"\n\n                        \"vmla.f32   q6, q8, %f18[0]     \\n\"\n                        \"vmla.f32   q7, q8, %f21[0]     \\n\"\n                        \"vmla.f32   q12, q9, %e20[1]    \\n\"\n                        \"vmla.f32   q13, q9, %e23[1]    \\n\"\n\n                        \"vext.32    q10, q14, q15, #1   \\n\"\n\n                        \"vmla.f32   q6, q14, %e19[0]    \\n\"\n                        \"vmla.f32   q7, q14, %e22[0]    \\n\"\n                        \"vmla.f32   q12, q14, %e18[0]   \\n\"\n                        \"vmla.f32   q13, q14, %e21[0]   \\n\"\n\n                        \"vext.32    q11, q14, q15, #2   \\n\"\n\n                        \"vmla.f32   q6, q10, %e19[1]    \\n\"\n                        \"vmla.f32   q7, q10, %e22[1]    \\n\"\n                        \"vmla.f32   q12, q10, %e18[1]   \\n\"\n                        \"vmla.f32   q13, q10, %e21[1]   \\n\"\n\n                        \"pld        [%7, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%7 :64] \\n\" // r2\n                        \"add        %7, #16             \\n\"\n\n                        \"vmla.f32   q6, q11, %f19[0]    \\n\"\n                        \"vmla.f32   q7, q11, %f22[0]    \\n\"\n                        \"vmla.f32   q12, q11, %f18[0]   \\n\"\n                        \"vmla.f32   q13, q11, %f21[0]   \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n\n                        \"vmla.f32   q6, q8, %e20[0]     \\n\"\n                        \"vmla.f32   q7, q8, %e23[0]     \\n\"\n                        \"vmla.f32   q12, q8, %e19[0]    \\n\"\n                        \"vmla.f32   q13, q8, %e22[0]    \\n\"\n\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q6, q10, %e20[1]    \\n\"\n                        \"vmla.f32   q7, q10, %e23[1]    \\n\"\n                        \"vmla.f32   q12, q10, %e19[1]   \\n\"\n                        \"vmla.f32   q13, q10, %e22[1]   \\n\"\n\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%5 :64] \\n\" // r0\n                        \"add        %5, #16             \\n\"\n\n                        \"vmla.f32   q6, q11, %f20[0]    \\n\"\n                        \"vmla.f32   q7, q11, %f23[0]    \\n\"\n                        \"vmla.f32   q12, q11, %f19[0]   \\n\"\n                        \"vmla.f32   q13, q11, %f22[0]   \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.f32   {d28-d30}, [%8]     \\n\" // r3\n                        \"add        %8, #16             \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n\n                        \"vst1.f32   {d12-d13}, [%1 : 64]!\\n\"\n                        \"vst1.f32   {d14-d15}, [%2 : 64]!\\n\"\n\n                        \"vext.32    q11, q14, q15, #2   \\n\"\n\n                        \"vst1.f32   {d24-d25}, [%3]!    \\n\"\n                        \"vst1.f32   {d26-d27}, [%4]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        \"sub        %5, #16             \\n\"\n                        \"sub        %8, #16             \\n\"\n                        : \"=r\"(nn),       // %0\n                        \"=r\"(outptr0),  // %1\n                        \"=r\"(outptr1),  // %2\n                        \"=r\"(outptr0n), // %3\n                        \"=r\"(outptr1n), // %4\n                        \"=r\"(r0),       // %5\n                        \"=r\"(r1),       // %6\n                        \"=r\"(r2),       // %7\n                        \"=r\"(r3)        // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr0n),\n                        \"4\"(outptr1n),\n                        \"5\"(r0),\n                        \"6\"(r1),\n                        \"7\"(r2),\n                        \"8\"(r3),\n                        \"w\"(_k00), // %18\n                        \"w\"(_k03), // %19\n                        \"w\"(_k06), // %20\n                        \"w\"(_k10), // %21\n                        \"w\"(_k13), // %22\n                        \"w\"(_k16)  // %23\n                        : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n                    float32x4_t _r30 = vld1q_f32(r3);\n\n                    float32x4_t _sum0 = vmulq_f32(_r00, _k00);\n                    float32x4_t _sum1 = vmulq_f32(_r00, _k10);\n                    _sum0 = vmlaq_f32(_sum0, _r10, _k03);\n                    _sum1 = vmlaq_f32(_sum1, _r10, _k13);\n                    _sum0 = vmlaq_f32(_sum0, _r20, _k06);\n                    _sum1 = vmlaq_f32(_sum1, _r20, _k16);\n\n                    float32x4_t _sum0n = vmulq_f32(_r10, _k00);\n                    float32x4_t _sum1n = vmulq_f32(_r10, _k10);\n                    _sum0n = vmlaq_f32(_sum0n, _r20, _k03);\n                    _sum1n = vmlaq_f32(_sum1n, _r20, _k13);\n                    _sum0n = vmlaq_f32(_sum0n, _r30, _k06);\n                    _sum1n = vmlaq_f32(_sum1n, _r30, _k16);\n\n                    _sum0 = vsetq_lane_f32(*outptr0, _sum0, 3);\n                    _sum1 = vsetq_lane_f32(*outptr1, _sum1, 3);\n                    _sum0n = vsetq_lane_f32(*outptr0n, _sum0n, 3);\n                    _sum1n = vsetq_lane_f32(*outptr1n, _sum1n, 3);\n#if __aarch64__\n                    *outptr0 = vaddvq_f32(_sum0);\n                    *outptr1 = vaddvq_f32(_sum1);\n                    *outptr0n = vaddvq_f32(_sum0n);\n                    *outptr1n = vaddvq_f32(_sum1n);\n#else\n                    float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                    float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n                    float32x2_t _ss0n = vadd_f32(vget_low_f32(_sum0n), vget_high_f32(_sum0n));\n                    float32x2_t _ss1n = vadd_f32(vget_low_f32(_sum1n), vget_high_f32(_sum1n));\n\n                    float32x2_t _ss01 = vpadd_f32(_ss0, _ss1);\n                    float32x2_t _ss01n = vpadd_f32(_ss0n, _ss1n);\n\n                    *outptr0 = vget_lane_f32(_ss01, 0);\n                    *outptr1 = vget_lane_f32(_ss01, 1);\n                    *outptr0n = vget_lane_f32(_ss01n, 0);\n                    *outptr1n = vget_lane_f32(_ss01n, 1);\n#endif // __aarch64__\n#else\n                    float sum0 = 0.f;\n                    float sum0n = 0.f;\n                    float sum1 = 0.f;\n                    float sum1n = 0.f;\n\n                    sum0 += r0[0] * k0[0];\n                    sum0 += r0[1] * k0[1];\n                    sum0 += r0[2] * k0[2];\n                    sum0 += r1[0] * k0[3];\n                    sum0 += r1[1] * k0[4];\n                    sum0 += r1[2] * k0[5];\n                    sum0 += r2[0] * k0[6];\n                    sum0 += r2[1] * k0[7];\n                    sum0 += r2[2] * k0[8];\n\n                    sum1 += r0[0] * k1[0];\n                    sum1 += r0[1] * k1[1];\n                    sum1 += r0[2] * k1[2];\n                    sum1 += r1[0] * k1[3];\n                    sum1 += r1[1] * k1[4];\n                    sum1 += r1[2] * k1[5];\n                    sum1 += r2[0] * k1[6];\n                    sum1 += r2[1] * k1[7];\n                    sum1 += r2[2] * k1[8];\n\n                    sum0n += r1[0] * k0[0];\n                    sum0n += r1[1] * k0[1];\n                    sum0n += r1[2] * k0[2];\n                    sum0n += r2[0] * k0[3];\n                    sum0n += r2[1] * k0[4];\n                    sum0n += r2[2] * k0[5];\n                    sum0n += r3[0] * k0[6];\n                    sum0n += r3[1] * k0[7];\n                    sum0n += r3[2] * k0[8];\n\n                    sum1n += r1[0] * k1[0];\n                    sum1n += r1[1] * k1[1];\n                    sum1n += r1[2] * k1[2];\n                    sum1n += r2[0] * k1[3];\n                    sum1n += r2[1] * k1[4];\n                    sum1n += r2[2] * k1[5];\n                    sum1n += r3[0] * k1[6];\n                    sum1n += r3[1] * k1[7];\n                    sum1n += r3[2] * k1[8];\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n                    *outptr0n += sum0n;\n                    *outptr1n += sum1n;\n#endif // __ARM_NEON\n                    r0++;\n                    r1++;\n                    r2++;\n                    r3++;\n                    outptr0++;\n                    outptr1++;\n                    outptr0n++;\n                    outptr1n++;\n                }\n\n                r0 += 2 + w;\n                r1 += 2 + w;\n                r2 += 2 + w;\n                r3 += 2 + w;\n\n                outptr0 += outw;\n                outptr1 += outw;\n                outptr0n += outw;\n                outptr1n += outw;\n            }\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%3]        \\n\" // r0\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v6.4s}, [%1]               \\n\" // _sum0\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v7.4s}, [%2]               \\n\" // _sum1\n\n                        \"fmul   v14.4s, v8.4s, %12.s[0]     \\n\"\n                        \"fmul   v15.4s, v8.4s, %15.s[0]     \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v10.4s, %12.s[1]     \\n\"\n                        \"fmla   v7.4s, v10.4s, %15.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%4]        \\n\" // r1\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fmla   v14.4s, v11.4s, %12.s[2]    \\n\"\n                        \"fmla   v15.4s, v11.4s, %15.s[2]    \\n\"\n\n                        \"fmla   v6.4s, v8.4s, %13.s[0]      \\n\"\n                        \"fmla   v7.4s, v8.4s, %16.s[0]      \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v14.4s, v10.4s, %13.s[1]    \\n\"\n                        \"fmla   v15.4s, v10.4s, %16.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%5]        \\n\" // r2\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"fmla   v6.4s, v11.4s, %13.s[2]     \\n\"\n                        \"fmla   v7.4s, v11.4s, %16.s[2]     \\n\"\n\n                        \"fmla   v14.4s, v8.4s, %14.s[0]     \\n\"\n                        \"fmla   v15.4s, v8.4s, %17.s[0]     \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v10.4s, %14.s[1]     \\n\"\n                        \"fmla   v7.4s, v10.4s, %17.s[1]     \\n\"\n\n                        \"fmla   v14.4s, v11.4s, %14.s[2]    \\n\"\n                        \"fmla   v15.4s, v11.4s, %17.s[2]    \\n\"\n\n                        \"fadd   v6.4s, v6.4s, v14.4s        \\n\"\n                        \"fadd   v7.4s, v7.4s, v15.4s        \\n\"\n\n                        \"st1    {v6.4s}, [%1], #16          \\n\"\n                        \"st1    {v7.4s}, [%2], #16          \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n                        \"bne    0b                          \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00), // %12\n                        \"w\"(_k03), // %13\n                        \"w\"(_k06), // %14\n                        \"w\"(_k10), // %15\n                        \"w\"(_k13), // %16\n                        \"w\"(_k16)  // %17\n                        : \"cc\", \"memory\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%3]     \\n\" // r0\n                        \"add        %3, #16             \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d12-d13}, [%1]     \\n\" // _sum0\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d14-d15}, [%2]     \\n\" // _sum1\n\n                        \"vmul.f32   q14, q8, %e12[0]    \\n\"\n                        \"vmul.f32   q15, q8, %e15[0]    \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q6, q10, %e12[1]    \\n\"\n                        \"vmla.f32   q7, q10, %e15[1]    \\n\"\n\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%4]     \\n\" // r1\n                        \"add        %4, #16             \\n\"\n\n                        \"vmla.f32   q14, q11, %f12[0]   \\n\"\n                        \"vmla.f32   q15, q11, %f15[0]   \\n\"\n\n                        \"vmla.f32   q6, q8, %e13[0]     \\n\"\n                        \"vmla.f32   q7, q8, %e16[0]     \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q14, q10, %e13[1]   \\n\"\n                        \"vmla.f32   q15, q10, %e16[1]   \\n\"\n\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%5]     \\n\" // r2\n                        \"add        %5, #16             \\n\"\n\n                        \"vmla.f32   q6, q11, %f13[0]    \\n\"\n                        \"vmla.f32   q7, q11, %f16[0]    \\n\"\n\n                        \"vmla.f32   q14, q8, %e14[0]    \\n\"\n                        \"vmla.f32   q15, q8, %e17[0]    \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q6, q10, %e14[1]    \\n\"\n                        \"vmla.f32   q7, q10, %e17[1]    \\n\"\n\n                        \"vmla.f32   q14, q11, %f14[0]   \\n\"\n                        \"vmla.f32   q15, q11, %f17[0]   \\n\"\n\n                        \"vadd.f32   q6, q6, q14         \\n\"\n                        \"vadd.f32   q7, q7, q15         \\n\"\n\n                        \"vst1.f32   {d12-d13}, [%1]!    \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%2]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00), // %12\n                        \"w\"(_k03), // %13\n                        \"w\"(_k06), // %14\n                        \"w\"(_k10), // %15\n                        \"w\"(_k13), // %16\n                        \"w\"(_k16)  // %17\n                        : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n\n                    float32x4_t _sum0 = vmulq_f32(_r00, _k00);\n                    float32x4_t _sum1 = vmulq_f32(_r00, _k10);\n                    _sum0 = vmlaq_f32(_sum0, _r10, _k03);\n                    _sum1 = vmlaq_f32(_sum1, _r10, _k13);\n                    _sum0 = vmlaq_f32(_sum0, _r20, _k06);\n                    _sum1 = vmlaq_f32(_sum1, _r20, _k16);\n\n                    _sum0 = vsetq_lane_f32(*outptr0, _sum0, 3);\n                    _sum1 = vsetq_lane_f32(*outptr1, _sum1, 3);\n#if __aarch64__\n                    *outptr0 = vaddvq_f32(_sum0);\n                    *outptr1 = vaddvq_f32(_sum1);\n#else\n                    float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                    float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n                    float32x2_t _ss01 = vpadd_f32(_ss0, _ss1);\n\n                    *outptr0 = vget_lane_f32(_ss01, 0);\n                    *outptr1 = vget_lane_f32(_ss01, 1);\n#endif // __aarch64__\n#else\n                    float sum0 = 0.f;\n                    float sum1 = 0.f;\n\n                    sum0 += r0[0] * k0[0];\n                    sum0 += r0[1] * k0[1];\n                    sum0 += r0[2] * k0[2];\n                    sum0 += r1[0] * k0[3];\n                    sum0 += r1[1] * k0[4];\n                    sum0 += r1[2] * k0[5];\n                    sum0 += r2[0] * k0[6];\n                    sum0 += r2[1] * k0[7];\n                    sum0 += r2[2] * k0[8];\n\n                    sum1 += r0[0] * k1[0];\n                    sum1 += r0[1] * k1[1];\n                    sum1 += r0[2] * k1[2];\n                    sum1 += r1[0] * k1[3];\n                    sum1 += r1[1] * k1[4];\n                    sum1 += r1[2] * k1[5];\n                    sum1 += r2[0] * k1[6];\n                    sum1 += r2[1] * k1[7];\n                    sum1 += r2[2] * k1[8];\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n#endif // __ARM_NEON\n                    r0++;\n                    r1++;\n                    r2++;\n                    outptr0++;\n                    outptr1++;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9;\n            k1 += 9;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        const float* kernel0 = kernel + p * inch * 9;\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n            float* outptr2 = outptr + outw;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(kernel0);\n            float32x4_t _k3456 = vld1q_f32(kernel0 + 3);\n            float32x4_t _k6789 = vld1q_f32(kernel0 + 6);\n#else\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i + 1 < outh; i += 2)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v9.4s, v10.4s}, [%3]       \\n\" // r0\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"ext    v11.16b, v9.16b, v10.16b, #4 \\n\"\n                        \"ext    v12.16b, v9.16b, v10.16b, #8 \\n\"\n\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v7.4s}, [%1]               \\n\" // _sum\n\n                        \"fmla   v7.4s, v9.4s, %14.s[0]      \\n\"\n                        \"fmul   v6.4s, v11.4s, %14.s[1]     \\n\"\n                        \"fmul   v13.4s, v12.4s, %14.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v9.4s, v10.4s}, [%4]       \\n\" // r1\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fmla   v7.4s, v9.4s, %15.s[0]      \\n\"\n\n                        \"ext    v11.16b, v9.16b, v10.16b, #4 \\n\"\n                        \"ext    v12.16b, v9.16b, v10.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v11.4s, %15.s[1]     \\n\"\n                        \"fmla   v13.4s, v12.4s, %15.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v8.4s}, [%2]               \\n\" // _sum2\n\n                        \"fmla   v8.4s, v9.4s, %14.s[0]      \\n\"\n                        \"fmul   v14.4s, v11.4s, %14.s[1]    \\n\"\n                        \"fmul   v15.4s, v12.4s, %14.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v9.4s, v10.4s}, [%5]       \\n\" // r2\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"fmla   v7.4s, v9.4s, %16.s[0]      \\n\"\n\n                        \"ext    v11.16b, v9.16b, v10.16b, #4 \\n\"\n                        \"ext    v12.16b, v9.16b, v10.16b, #8 \\n\"\n\n                        \"fmla   v6.4s, v11.4s, %16.s[1]     \\n\"\n                        \"fmla   v13.4s, v12.4s, %16.s[2]    \\n\"\n\n                        \"fmla   v8.4s, v9.4s, %15.s[0]      \\n\"\n                        \"fmla   v14.4s, v11.4s, %15.s[1]    \\n\"\n                        \"fmla   v15.4s, v12.4s, %15.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v9.4s, v10.4s}, [%6]       \\n\" // r3\n                        \"add    %6, %6, #16                 \\n\"\n\n                        \"fmla   v8.4s, v9.4s, %16.s[0]      \\n\"\n\n                        \"ext    v11.16b, v9.16b, v10.16b, #4 \\n\"\n                        \"ext    v12.16b, v9.16b, v10.16b, #8 \\n\"\n\n                        \"fmla   v14.4s, v11.4s, %16.s[1]    \\n\"\n                        \"fmla   v15.4s, v12.4s, %16.s[2]    \\n\"\n\n                        \"fadd   v7.4s, v7.4s, v6.4s         \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v9.4s, v10.4s}, [%3]       \\n\" // r0\n\n                        \"fadd   v8.4s, v8.4s, v14.4s        \\n\"\n                        \"fadd   v7.4s, v7.4s, v13.4s        \\n\"\n                        \"fadd   v8.4s, v8.4s, v15.4s        \\n\"\n\n                        \"ext    v11.16b, v9.16b, v10.16b, #4 \\n\"\n                        \"ext    v12.16b, v9.16b, v10.16b, #8 \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"st1    {v7.4s}, [%1], #16          \\n\"\n                        \"st1    {v8.4s}, [%2], #16          \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n                        \"bne    0b                          \\n\"\n\n                        \"sub    %3, %3, #16                 \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr),  // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2),      // %5\n                        \"=r\"(r3)       // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(outptr2),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"6\"(r3),\n                        \"w\"(_k0123), // %14\n                        \"w\"(_k3456), // %15\n                        \"w\"(_k6789)  // %16\n                        : \"cc\", \"memory\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.f32   {d18-d20}, [%3 :64] \\n\" // r0\n                        \"add        %3, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\"\n                        \"vext.32    q12, q9, q10, #2    \\n\"\n\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d14-d15}, [%1 :64] \\n\" // _sum\n\n                        \"vmla.f32   q7, q9, %e14[0]     \\n\"\n                        \"vmul.f32   q6, q11, %e14[1]    \\n\"\n                        \"vmul.f32   q13, q12, %f14[0]   \\n\"\n\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.f32   {d18-d20}, [%4]     \\n\" // r1\n                        \"add        %4, #16             \\n\"\n\n                        \"vmla.f32   q7, q9, %e15[0]     \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\"\n                        \"vext.32    q12, q9, q10, #2    \\n\"\n\n                        \"vmla.f32   q6, q11, %e15[1]    \\n\"\n                        \"vmla.f32   q13, q12, %f15[0]   \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d16-d17}, [%2]     \\n\" // _sum2\n\n                        \"vmla.f32   q8, q9, %e14[0]     \\n\"\n                        \"vmul.f32   q14, q11, %e14[1]   \\n\"\n                        \"vmul.f32   q15, q12, %f14[0]   \\n\"\n\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.f32   {d18-d20}, [%5 :64] \\n\" // r2\n                        \"add        %5, #16             \\n\"\n\n                        \"vmla.f32   q7, q9, %e16[0]     \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\"\n                        \"vext.32    q12, q9, q10, #2    \\n\"\n\n                        \"vmla.f32   q6, q11, %e16[1]    \\n\"\n                        \"vmla.f32   q13, q12, %f16[0]   \\n\"\n\n                        \"vmla.f32   q8, q9, %e15[0]     \\n\"\n                        \"vmla.f32   q14, q11, %e15[1]   \\n\"\n                        \"vmla.f32   q15, q12, %f15[0]   \\n\"\n\n                        \"pld        [%6, #192]          \\n\"\n                        \"vld1.f32   {d18-d20}, [%6]     \\n\" // r3\n                        \"add        %6, #16             \\n\"\n\n                        \"vmla.f32   q8, q9, %e16[0]     \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\"\n                        \"vext.32    q12, q9, q10, #2    \\n\"\n\n                        \"vmla.f32   q14, q11, %e16[1]   \\n\"\n                        \"vmla.f32   q15, q12, %f16[0]   \\n\"\n\n                        \"vadd.f32   q7, q7, q6          \\n\"\n\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.f32   {d18-d20}, [%3 :64] \\n\" // r0\n\n                        \"vadd.f32   q8, q8, q14         \\n\"\n                        \"vadd.f32   q7, q7, q13         \\n\"\n                        \"vadd.f32   q8, q8, q15         \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\"\n                        \"vext.32    q12, q9, q10, #2    \\n\"\n\n                        \"add        %3, #16             \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n                        \"vst1.f32   {d16-d17}, [%2]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        \"sub        %3, #16             \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr),  // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2),      // %5\n                        \"=r\"(r3)       // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(outptr2),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"6\"(r3),\n                        \"w\"(_k0123), // %14\n                        \"w\"(_k3456), // %15\n                        \"w\"(_k6789)  // %16\n                        : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n                    float32x4_t _r30 = vld1q_f32(r3);\n\n                    float32x4_t _sum = vmulq_f32(_r00, _k0123);\n                    _sum = vmlaq_f32(_sum, _r10, _k3456);\n                    _sum = vmlaq_f32(_sum, _r20, _k6789);\n\n                    float32x4_t _sum2 = vmulq_f32(_r10, _k0123);\n                    _sum2 = vmlaq_f32(_sum2, _r20, _k3456);\n                    _sum2 = vmlaq_f32(_sum2, _r30, _k6789);\n\n                    _sum = vsetq_lane_f32(*outptr, _sum, 3);\n                    _sum2 = vsetq_lane_f32(*outptr2, _sum2, 3);\n\n#if __aarch64__\n                    *outptr = vaddvq_f32(_sum);\n                    *outptr2 = vaddvq_f32(_sum2);\n#else\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    float32x2_t _ss2 = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n\n                    float32x2_t _sss2 = vpadd_f32(_ss, _ss2);\n\n                    *outptr = vget_lane_f32(_sss2, 0);\n                    *outptr2 = vget_lane_f32(_sss2, 1);\n#endif // __aarch64__\n#else\n                    float sum = 0;\n                    float sum2 = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n\n                    sum2 += r1[0] * k0[0];\n                    sum2 += r1[1] * k0[1];\n                    sum2 += r1[2] * k0[2];\n                    sum2 += r2[0] * k1[0];\n                    sum2 += r2[1] * k1[1];\n                    sum2 += r2[2] * k1[2];\n                    sum2 += r3[0] * k2[0];\n                    sum2 += r3[1] * k2[1];\n                    sum2 += r3[2] * k2[2];\n\n                    *outptr += sum;\n                    *outptr2 += sum2;\n#endif\n                    r0++;\n                    r1++;\n                    r2++;\n                    r3++;\n                    outptr++;\n                    outptr2++;\n                }\n\n                r0 += 2 + w;\n                r1 += 2 + w;\n                r2 += 2 + w;\n                r3 += 2 + w;\n\n                outptr += outw;\n                outptr2 += outw;\n            }\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%2]        \\n\" // r0\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v7.4s}, [%1]               \\n\" // _sum\n\n                        \"fmla   v7.4s, v8.4s, %10.s[0]      \\n\"\n                        \"fmul   v13.4s, v10.4s, %10.s[1]    \\n\"\n                        \"fmul   v14.4s, v11.4s, %10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%3]        \\n\" // r1\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"fmla   v7.4s, v8.4s, %11.s[0]      \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v13.4s, v10.4s, %11.s[1]    \\n\"\n                        \"fmla   v14.4s, v11.4s, %11.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%4]        \\n\" // r2\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fmla   v7.4s, v8.4s, %12.s[0]      \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"fmla   v13.4s, v10.4s, %12.s[1]    \\n\"\n                        \"fmla   v14.4s, v11.4s, %12.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%2]        \\n\" // r0\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"fadd   v7.4s, v7.4s, v13.4s        \\n\"\n                        \"fadd   v7.4s, v7.4s, v14.4s        \\n\"\n\n                        \"ext    v10.16b, v8.16b, v9.16b, #4 \\n\"\n                        \"ext    v11.16b, v8.16b, v9.16b, #8 \\n\"\n\n                        \"st1    {v7.4s}, [%1], #16          \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n                        \"bne    0b                          \\n\"\n\n                        \"sub    %2, %2, #16                 \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%2]     \\n\" // r0\n                        \"add        %2, #16             \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // _sum\n\n                        \"vmla.f32   q7, q8, %e10[0]     \\n\"\n                        \"vmul.f32   q13, q10, %e10[1]   \\n\"\n                        \"vmul.f32   q14, q11, %f10[0]   \\n\"\n\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%3]     \\n\" // r1\n                        \"add        %3, #16             \\n\"\n\n                        \"vmla.f32   q7, q8, %e11[0]     \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q13, q10, %e11[1]   \\n\"\n                        \"vmla.f32   q14, q11, %f11[0]   \\n\"\n\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%4]     \\n\" // r2\n                        \"add        %4, #16             \\n\"\n\n                        \"vmla.f32   q7, q8, %e12[0]     \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vmla.f32   q13, q10, %e12[1]   \\n\"\n                        \"vmla.f32   q14, q11, %f12[0]   \\n\"\n\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.f32   {d16-d18}, [%2]     \\n\" // r0\n                        \"add        %2, #16             \\n\"\n\n                        \"vadd.f32   q7, q7, q13         \\n\"\n                        \"vadd.f32   q7, q7, q14         \\n\"\n\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        \"sub        %2, #16             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n\n                    float32x4_t _sum = vmulq_f32(_r00, _k0123);\n                    _sum = vmlaq_f32(_sum, _r10, _k3456);\n                    _sum = vmlaq_f32(_sum, _r20, _k6789);\n\n                    _sum = vsetq_lane_f32(*outptr, _sum, 3);\n\n#if __aarch64__\n                    *outptr = vaddvq_f32(_sum);\n#else\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n\n                    *outptr = vget_lane_f32(_ss, 0);\n#endif // __aarch64__\n#else\n                    float sum = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n\n                    *outptr += sum;\n#endif\n                    r0++;\n                    r1++;\n                    r2++;\n                    outptr++;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            kernel0 += 9;\n        }\n    }\n}\n\nstatic void conv3x3s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n\n        const float* k0 = kernel + p * inch * 9;\n        const float* k1 = kernel + (p + 1) * inch * 9;\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n\n#if __ARM_NEON\n            float32x4_t _k00 = vld1q_f32(k0);\n            float32x4_t _k03 = vld1q_f32(k0 + 3);\n            float32x4_t _k06 = vld1q_f32(k0 + 6);\n\n            float32x4_t _k10 = vld1q_f32(k1);\n            float32x4_t _k13 = vld1q_f32(k1 + 3);\n            float32x4_t _k16 = vld1q_f32(k1 + 6);\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld2    {v8.4s, v9.4s}, [%3], #32   \\n\" // v8 v9 = r0\n\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v6.4s}, [%1]               \\n\" // v6 = _sum0\n\n                        \"fmul   v12.4s, v8.4s, %12.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v7.4s}, [%2]               \\n\" // v7 = _sum1\n\n                        \"fmul   v13.4s, v8.4s, %15.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld2    {v10.4s, v11.4s}, [%3]      \\n\" // v10\n\n                        \"fmla   v6.4s, v9.4s, %12.s[1]      \\n\"\n\n                        \"ext    v14.16b, v8.16b, v10.16b, #4\\n\"\n\n                        \"fmla   v7.4s, v9.4s, %15.s[1]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld2    {v8.4s, v9.4s}, [%4], #32   \\n\" // r1\n\n                        \"fmla   v12.4s, v14.4s, %12.s[2]    \\n\"\n                        \"fmla   v13.4s, v14.4s, %15.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld2    {v10.4s, v11.4s}, [%4]      \\n\"\n\n                        \"fmla   v6.4s, v8.4s, %13.s[0]      \\n\"\n                        \"fmla   v7.4s, v8.4s, %16.s[0]      \\n\"\n\n                        \"ext    v14.16b, v8.16b, v10.16b, #4\\n\"\n\n                        \"fmla   v12.4s, v9.4s, %13.s[1]     \\n\"\n                        \"fmla   v13.4s, v9.4s, %16.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld2    {v8.4s, v9.4s}, [%5], #32   \\n\" // r2\n\n                        \"fmla   v6.4s, v14.4s, %13.s[2]     \\n\"\n                        \"fmla   v7.4s, v14.4s, %16.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld2    {v10.4s, v11.4s}, [%5]      \\n\"\n\n                        \"fmla   v12.4s, v8.4s, %14.s[0]     \\n\"\n                        \"fmla   v13.4s, v8.4s, %17.s[0]     \\n\"\n\n                        \"ext    v14.16b, v8.16b, v10.16b, #4\\n\"\n\n                        \"fmla   v6.4s, v9.4s, %14.s[1]      \\n\"\n                        \"fmla   v7.4s, v9.4s, %17.s[1]      \\n\"\n\n                        \"fmla   v12.4s, v14.4s, %14.s[2]    \\n\"\n                        \"fmla   v13.4s, v14.4s, %17.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld2    {v8.4s, v9.4s}, [%3], #32   \\n\" // v8 v9 = r0\n\n                        \"fadd   v6.4s, v6.4s, v12.4s        \\n\"\n                        \"fadd   v7.4s, v7.4s, v13.4s        \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n\n                        \"st1    {v6.4s}, [%1], #16          \\n\"\n                        \"st1    {v7.4s}, [%2], #16          \\n\"\n\n                        \"bne    0b                          \\n\"\n                        \"sub    %3, %3, #32                 \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00), // %12\n                        \"w\"(_k03), // %13\n                        \"w\"(_k06), // %14\n                        \"w\"(_k10), // %15\n                        \"w\"(_k13), // %16\n                        \"w\"(_k16)  // %17\n                        : \"cc\", \"memory\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld2.f32   {d16-d19}, [%3]!    \\n\" // q8 q9 = r0\n\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d12-d13}, [%1]     \\n\" // q6 = _sum0\n\n                        \"vmul.f32   q12, q8, %e12[0]    \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d14-d15}, [%2]     \\n\" // q7 = _sum1\n\n                        \"vmul.f32   q13, q8, %e15[0]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld2.f32   {d20-d21}, [%3]     \\n\" // q10\n\n                        \"vmla.f32   q6, q9, %e12[1]     \\n\"\n\n                        \"vext.32    q11, q8, q10, #1    \\n\"\n\n                        \"vmla.f32   q7, q9, %e15[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld2.f32   {d16-d19}, [%4]!    \\n\" // r1\n\n                        \"vmla.f32   q12, q11, %f12[0]   \\n\"\n                        \"vmla.f32   q13, q11, %f15[0]   \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld2.f32   {d20-d21}, [%4]     \\n\"\n\n                        \"vmla.f32   q6, q8, %e13[0]     \\n\"\n                        \"vmla.f32   q7, q8, %e16[0]     \\n\"\n\n                        \"vext.32    q11, q8, q10, #1    \\n\"\n\n                        \"vmla.f32   q12, q9, %e13[1]    \\n\"\n                        \"vmla.f32   q13, q9, %e16[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld2.f32   {d16-d19}, [%5]!    \\n\" // r2\n\n                        \"vmla.f32   q6, q11, %f13[0]    \\n\"\n                        \"vmla.f32   q7, q11, %f16[0]    \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld2.f32   {d20-d21}, [%5]     \\n\"\n\n                        \"vmla.f32   q12, q8, %e14[0]    \\n\"\n                        \"vmla.f32   q13, q8, %e17[0]    \\n\"\n\n                        \"vext.32    q11, q8, q10, #1    \\n\"\n\n                        \"vmla.f32   q6, q9, %e14[1]     \\n\"\n                        \"vmla.f32   q7, q9, %e17[1]     \\n\"\n\n                        \"vmla.f32   q12, q11, %f14[0]   \\n\"\n                        \"vmla.f32   q13, q11, %f17[0]   \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld2.f32   {d16-d19}, [%3]!    \\n\" // q8 q9 = r0\n\n                        \"vadd.f32   q6, q6, q12         \\n\"\n                        \"vadd.f32   q7, q7, q13         \\n\"\n\n                        \"subs       %0, #1              \\n\"\n\n                        \"vst1.f32   {d12-d13}, [%1]!    \\n\"\n                        \"vst1.f32   {d14-d15}, [%2]!    \\n\"\n\n                        \"bne        0b                  \\n\"\n                        \"sub        %3, #32             \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00), // %12\n                        \"w\"(_k03), // %13\n                        \"w\"(_k06), // %14\n                        \"w\"(_k10), // %15\n                        \"w\"(_k13), // %16\n                        \"w\"(_k16)  // %17\n                        : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n\n                    float32x4_t _sum0 = vmulq_f32(_r00, _k00);\n                    float32x4_t _sum1 = vmulq_f32(_r00, _k10);\n                    _sum0 = vmlaq_f32(_sum0, _r10, _k03);\n                    _sum1 = vmlaq_f32(_sum1, _r10, _k13);\n                    _sum0 = vmlaq_f32(_sum0, _r20, _k06);\n                    _sum1 = vmlaq_f32(_sum1, _r20, _k16);\n\n                    _sum0 = vsetq_lane_f32(*outptr0, _sum0, 3);\n                    _sum1 = vsetq_lane_f32(*outptr1, _sum1, 3);\n#if __aarch64__\n                    *outptr0 = vaddvq_f32(_sum0);\n                    *outptr1 = vaddvq_f32(_sum1);\n#else\n                    float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                    float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n                    float32x2_t _ss01 = vpadd_f32(_ss0, _ss1);\n\n                    *outptr0 = vget_lane_f32(_ss01, 0);\n                    *outptr1 = vget_lane_f32(_ss01, 1);\n#endif // __aarch64__\n#else\n                    float sum0 = 0.f;\n                    float sum1 = 0.f;\n\n                    sum0 += r0[0] * k0[0];\n                    sum0 += r0[1] * k0[1];\n                    sum0 += r0[2] * k0[2];\n                    sum0 += r1[0] * k0[3];\n                    sum0 += r1[1] * k0[4];\n                    sum0 += r1[2] * k0[5];\n                    sum0 += r2[0] * k0[6];\n                    sum0 += r2[1] * k0[7];\n                    sum0 += r2[2] * k0[8];\n\n                    sum1 += r0[0] * k1[0];\n                    sum1 += r0[1] * k1[1];\n                    sum1 += r0[2] * k1[2];\n                    sum1 += r1[0] * k1[3];\n                    sum1 += r1[1] * k1[4];\n                    sum1 += r1[2] * k1[5];\n                    sum1 += r2[0] * k1[6];\n                    sum1 += r2[1] * k1[7];\n                    sum1 += r2[2] * k1[8];\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n#endif // __ARM_NEON\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0++;\n                    outptr1++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9;\n            k1 += 9;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        const float* kernel0 = kernel + p * inch * 9;\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(k0);\n            float32x4_t _k3456 = vld1q_f32(k1);\n            float32x4_t _k6789 = vld1q_f32(k2);\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1]                  \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %10.s[0]        \\n\"\n                        \"fmul       v10.4s, v3.4s, %10.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmul       v11.4s, v1.4s, %10.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%3], #32      \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %11.s[0]        \\n\"\n                        \"fmla       v10.4s, v3.4s, %11.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%3]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmla       v11.4s, v1.4s, %11.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%4], #32      \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %12.s[0]        \\n\"\n                        \"fmla       v10.4s, v3.4s, %12.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%4]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmla       v11.4s, v1.4s, %12.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n\n                        \"fadd       v0.4s, v0.4s, v10.4s           \\n\"\n                        \"fadd       v0.4s, v0.4s, v11.4s           \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v0.4s}, [%1], #16             \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %2, %2, #32                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                        \"0:                             \\n\"\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]       \\n\"\n\n                        \"vmla.f32   q0, q2, %e10[0]     \\n\"\n                        \"vmul.f32   q10, q3, %e10[1]    \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%2]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmul.f32   q11, q1, %f10[0]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%3]!      \\n\"\n\n                        \"vmla.f32   q0, q2, %e11[0]     \\n\"\n                        \"vmla.f32   q10, q3, %e11[1]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%3]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmla.f32   q11, q1, %f11[0]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%4]!      \\n\"\n\n                        \"vmla.f32   q0, q2, %e12[0]     \\n\"\n                        \"vmla.f32   q10, q3, %e12[1]    \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%4]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmla.f32   q11, q1, %f12[0]    \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                        \"vadd.f32   q0, q0, q10         \\n\"\n                        \"vadd.f32   q0, q0, q11         \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d0-d1}, [%1]!      \\n\"\n                        \"bne        0b                  \\n\"\n                        \"sub        %2, #32             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n\n                    float32x4_t _sum = vmulq_f32(_r00, _k0123);\n                    _sum = vmlaq_f32(_sum, _r10, _k3456);\n                    _sum = vmlaq_f32(_sum, _r20, _k6789);\n\n                    _sum = vsetq_lane_f32(*outptr, _sum, 3);\n\n#if __aarch64__\n                    *outptr = vaddvq_f32(_sum);\n#else\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n\n                    *outptr = vget_lane_f32(_ss, 0);\n#endif // __aarch64__\n#else\n                    float sum = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n\n                    *outptr += sum;\n#endif // __ARM_NEON\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            kernel0 += 9;\n        }\n    }\n}\n\nstatic void conv3x3s2_transform_kernel_neon(const Mat& _kernel, Mat& kernel_tm, int inch, int outch)\n{\n    kernel_tm.create(8 * 9, inch, outch / 8 + outch % 8);\n\n    const float* kernel = _kernel;\n\n    int p = 0;\n    for (; p + 7 < outch; p += 8)\n    {\n        const float* k0 = kernel + (p + 0) * inch * 9;\n        const float* k1 = kernel + (p + 1) * inch * 9;\n        const float* k2 = kernel + (p + 2) * inch * 9;\n        const float* k3 = kernel + (p + 3) * inch * 9;\n        const float* k4 = kernel + (p + 4) * inch * 9;\n        const float* k5 = kernel + (p + 5) * inch * 9;\n        const float* k6 = kernel + (p + 6) * inch * 9;\n        const float* k7 = kernel + (p + 7) * inch * 9;\n\n        float* ktmp = kernel_tm.channel(p / 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            for (int k = 0; k < 9; k++)\n            {\n                ktmp[0] = k0[k];\n                ktmp[1] = k1[k];\n                ktmp[2] = k2[k];\n                ktmp[3] = k3[k];\n                ktmp[4] = k4[k];\n                ktmp[5] = k5[k];\n                ktmp[6] = k6[k];\n                ktmp[7] = k7[k];\n                ktmp += 8;\n            }\n\n            k0 += 9;\n            k1 += 9;\n            k2 += 9;\n            k3 += 9;\n            k4 += 9;\n            k5 += 9;\n            k6 += 9;\n            k7 += 9;\n        }\n    }\n    for (; p < outch; p++)\n    {\n        const float* k0 = kernel + (p + 0) * inch * 9;\n\n        float* ktmp = kernel_tm.channel(p / 8 + p % 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            for (int k = 0; k < 9; k++)\n            {\n                ktmp[k] = k0[k];\n            }\n            ktmp += 9;\n\n            k0 += 9;\n        }\n    }\n}\n\nstatic void conv3x3s2_packed_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    //     const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    int nn_outch = outch >> 3;\n    int remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        Mat out0 = top_blob.channel(p + 0);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n        Mat out4 = top_blob.channel(p + 4);\n        Mat out5 = top_blob.channel(p + 5);\n        Mat out6 = top_blob.channel(p + 6);\n        Mat out7 = top_blob.channel(p + 7);\n\n        const float bias0 = bias ? bias[p + 0] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        const float bias2 = bias ? bias[p + 2] : 0.f;\n        const float bias3 = bias ? bias[p + 3] : 0.f;\n        const float bias4 = bias ? bias[p + 4] : 0.f;\n        const float bias5 = bias ? bias[p + 5] : 0.f;\n        const float bias6 = bias ? bias[p + 6] : 0.f;\n        const float bias7 = bias ? bias[p + 7] : 0.f;\n\n        out0.fill(bias0);\n        out1.fill(bias1);\n        out2.fill(bias2);\n        out3.fill(bias3);\n        out4.fill(bias4);\n        out5.fill(bias5);\n        out6.fill(bias6);\n        out7.fill(bias7);\n\n        const float* ktmp = _kernel.channel(p / 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n            float* outptr2 = out2;\n            float* outptr3 = out3;\n            float* outptr4 = out4;\n            float* outptr5 = out5;\n            float* outptr6 = out6;\n            float* outptr7 = out7;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v8.4s}, [%1]               \\n\"\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v9.4s}, [%2]               \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v10.4s}, [%3]              \\n\"\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v11.4s}, [%4]              \\n\"\n\n                        ///\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld2    {v4.4s, v5.4s}, [%9], #32   \\n\" // v4=00 v5=01\n\n                        \"ld1    {v0.4s, v1.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v12.4s}, [%5]              \\n\"\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v13.4s}, [%6]              \\n\"\n\n                        \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v14.4s}, [%7]              \\n\"\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v15.4s}, [%8]              \\n\"\n\n                        \"ld1    {v2.4s, v3.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld2    {v6.4s, v7.4s}, [%9]        \\n\" // v6\n\n                        \"fmla   v8.4s, v5.4s, v2.s[0]       \\n\"\n                        \"fmla   v9.4s, v5.4s, v2.s[1]       \\n\"\n                        \"fmla   v10.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"ext    v6.16b, v4.16b, v6.16b, #4  \\n\" // v6=02\n\n                        \"ld1    {v0.4s, v1.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v5.4s, v3.s[0]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v3.s[1]      \\n\"\n                        \"fmla   v14.4s, v5.4s, v3.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v3.s[3]      \\n\"\n\n                        ///\n                        \"prfm   pldl1keep, [%10, #256]      \\n\"\n                        \"ld2    {v4.4s, v5.4s}, [%10], #32  \\n\" // v4=10 v5=11\n\n                        \"fmla   v8.4s, v6.4s, v0.s[0]       \\n\"\n                        \"fmla   v9.4s, v6.4s, v0.s[1]       \\n\"\n                        \"fmla   v10.4s, v6.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v0.s[3]      \\n\"\n\n                        \"ld1    {v2.4s, v3.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v13.4s, v6.4s, v1.s[1]      \\n\"\n                        \"fmla   v14.4s, v6.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v6.4s, v1.s[3]      \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                        \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n\n                        \"ld1    {v0.4s, v1.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                        \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                        \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%10, #256]      \\n\"\n                        \"ld2    {v6.4s, v7.4s}, [%10]       \\n\" // v6\n\n                        \"fmla   v8.4s, v5.4s, v0.s[0]       \\n\"\n                        \"fmla   v9.4s, v5.4s, v0.s[1]       \\n\"\n                        \"fmla   v10.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[3]      \\n\"\n\n                        \"ld1    {v2.4s, v3.4s}, [%12], #32  \\n\"\n\n                        \"ext    v6.16b, v4.16b, v6.16b, #4  \\n\" // v6=12\n\n                        \"fmla   v12.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v14.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v1.s[3]      \\n\"\n\n                        ///\n                        \"prfm   pldl1keep, [%11, #256]      \\n\"\n                        \"ld2    {v4.4s, v5.4s}, [%11], #32  \\n\" // v4=20 v5=21\n\n                        \"fmla   v8.4s, v6.4s, v2.s[0]       \\n\"\n                        \"fmla   v9.4s, v6.4s, v2.s[1]       \\n\"\n                        \"fmla   v10.4s, v6.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v2.s[3]      \\n\"\n\n                        \"ld1    {v0.4s, v1.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v13.4s, v6.4s, v3.s[1]      \\n\"\n                        \"fmla   v14.4s, v6.4s, v3.s[2]      \\n\"\n                        \"fmla   v15.4s, v6.4s, v3.s[3]      \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n\n                        \"ld1    {v2.4s, v3.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%11, #256]      \\n\"\n                        \"ld2    {v6.4s, v7.4s}, [%11]       \\n\" // v6\n\n                        \"fmla   v8.4s, v5.4s, v2.s[0]       \\n\"\n                        \"fmla   v9.4s, v5.4s, v2.s[1]       \\n\"\n                        \"fmla   v10.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"ext    v6.16b, v4.16b, v6.16b, #4  \\n\" // v6=22\n\n                        \"ld1    {v0.4s, v1.4s}, [%12], #32  \\n\"\n\n                        \"fmla   v12.4s, v5.4s, v3.s[0]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v3.s[1]      \\n\"\n                        \"fmla   v14.4s, v5.4s, v3.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v3.s[3]      \\n\"\n\n                        \"fmla   v8.4s, v6.4s, v0.s[0]       \\n\"\n                        \"fmla   v9.4s, v6.4s, v0.s[1]       \\n\"\n                        \"fmla   v10.4s, v6.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v0.s[3]      \\n\"\n\n                        \"fmla   v12.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v13.4s, v6.4s, v1.s[1]      \\n\"\n\n                        \"st1    {v8.4s}, [%1], #16          \\n\"\n                        \"st1    {v9.4s}, [%2], #16          \\n\"\n\n                        \"fmla   v14.4s, v6.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v6.4s, v1.s[3]      \\n\"\n\n                        \"st1    {v10.4s}, [%3], #16         \\n\"\n                        \"st1    {v11.4s}, [%4], #16         \\n\"\n\n                        \"sub    %12, %12, #288              \\n\"\n\n                        \"st1    {v12.4s}, [%5], #16         \\n\"\n                        \"st1    {v13.4s}, [%6], #16         \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n\n                        \"st1    {v14.4s}, [%7], #16         \\n\"\n                        \"st1    {v15.4s}, [%8], #16         \\n\"\n\n                        \"bne    0b                          \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7), // %8\n                        \"=r\"(r0),      // %9\n                        \"=r\"(r1),      // %10\n                        \"=r\"(r2),      // %11\n                        \"=r\"(ktmp)     // %12\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7),\n                        \"9\"(r0),\n                        \"10\"(r1),\n                        \"11\"(r2),\n                        \"12\"(ktmp)\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else  // __aarch64__\n                for (; nn > 0; nn--)\n                {\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d16-d17}, [%0]     \\n\"\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d18-d19}, [%1]     \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d20-d21}, [%2]     \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d22-d23}, [%3]     \\n\"\n\n                        ///\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld2.f32   {d8-d11}, [%8]!     \\n\" // q4=00 q5=01\n\n                        \"vld1.f32   {d0-d3}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q8, q4, d0[0]       \\n\"\n                        \"vmla.f32   q9, q4, d0[1]       \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%4]     \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.f32   {d26-d27}, [%5]     \\n\"\n\n                        \"vmla.f32   q10, q4, d1[0]      \\n\"\n                        \"vmla.f32   q11, q4, d1[1]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.f32   {d28-d29}, [%6]     \\n\"\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.f32   {d30-d31}, [%7]     \\n\"\n\n                        \"vld1.f32   {d4-d7}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q4, d2[0]      \\n\"\n                        \"vmla.f32   q13, q4, d2[1]      \\n\"\n                        \"vmla.f32   q14, q4, d3[0]      \\n\"\n                        \"vmla.f32   q15, q4, d3[1]      \\n\"\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld2.f32   {d12-d13}, [%8]     \\n\" // q6\n\n                        \"vmla.f32   q8, q5, d4[0]       \\n\"\n                        \"vmla.f32   q9, q5, d4[1]       \\n\"\n                        \"vmla.f32   q10, q5, d5[0]      \\n\"\n                        \"vmla.f32   q11, q5, d5[1]      \\n\"\n\n                        \"vext.f32   q6, q4, q6, #1      \\n\" // q6=02\n\n                        \"vld1.f32   {d0-d3}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q5, d6[0]      \\n\"\n                        \"vmla.f32   q13, q5, d6[1]      \\n\"\n                        \"vmla.f32   q14, q5, d7[0]      \\n\"\n                        \"vmla.f32   q15, q5, d7[1]      \\n\"\n\n                        ///\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld2.f32   {d8-d11}, [%9]!     \\n\" // q4=10 q5=11\n\n                        \"vmla.f32   q8, q6, d0[0]       \\n\"\n                        \"vmla.f32   q9, q6, d0[1]       \\n\"\n                        \"vmla.f32   q10, q6, d1[0]      \\n\"\n                        \"vmla.f32   q11, q6, d1[1]      \\n\"\n\n                        \"vld1.f32   {d4-d7}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q6, d2[0]      \\n\"\n                        \"vmla.f32   q13, q6, d2[1]      \\n\"\n                        \"vmla.f32   q14, q6, d3[0]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"vmla.f32   q8, q4, d4[0]       \\n\"\n                        \"vmla.f32   q9, q4, d4[1]       \\n\"\n                        \"vmla.f32   q10, q4, d5[0]      \\n\"\n                        \"vmla.f32   q11, q4, d5[1]      \\n\"\n\n                        \"vld1.f32   {d0-d3}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q4, d6[0]      \\n\"\n                        \"vmla.f32   q13, q4, d6[1]      \\n\"\n                        \"vmla.f32   q14, q4, d7[0]      \\n\"\n                        \"vmla.f32   q15, q4, d7[1]      \\n\"\n\n                        \"pld        [%9, #128]          \\n\"\n                        \"vld2.f32   {d12-d13}, [%9]     \\n\" // q6\n\n                        \"vmla.f32   q8, q5, d0[0]       \\n\"\n                        \"vmla.f32   q9, q5, d0[1]       \\n\"\n                        \"vmla.f32   q10, q5, d1[0]      \\n\"\n                        \"vmla.f32   q11, q5, d1[1]      \\n\"\n\n                        \"vld1.f32   {d4-d7}, [%11 :128]! \\n\"\n\n                        \"vext.f32   q6, q4, q6, #1      \\n\" // q6=12\n\n                        \"vmla.f32   q12, q5, d2[0]      \\n\"\n                        \"vmla.f32   q13, q5, d2[1]      \\n\"\n                        \"vmla.f32   q14, q5, d3[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[1]      \\n\"\n\n                        ///\n                        \"pld        [%10, #256]         \\n\"\n                        \"vld2.f32   {d8-d11}, [%10]!    \\n\" // q4=20 q5=21\n\n                        \"vmla.f32   q8, q6, d4[0]       \\n\"\n                        \"vmla.f32   q9, q6, d4[1]       \\n\"\n                        \"vmla.f32   q10, q6, d5[0]      \\n\"\n                        \"vmla.f32   q11, q6, d5[1]      \\n\"\n\n                        \"vld1.f32   {d0-d3}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q6, d6[0]      \\n\"\n                        \"vmla.f32   q13, q6, d6[1]      \\n\"\n                        \"vmla.f32   q14, q6, d7[0]      \\n\"\n                        \"vmla.f32   q15, q6, d7[1]      \\n\"\n\n                        \"vmla.f32   q8, q4, d0[0]       \\n\"\n                        \"vmla.f32   q9, q4, d0[1]       \\n\"\n                        \"vmla.f32   q10, q4, d1[0]      \\n\"\n                        \"vmla.f32   q11, q4, d1[1]      \\n\"\n\n                        \"vld1.f32   {d4-d7}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q4, d2[0]      \\n\"\n                        \"vmla.f32   q13, q4, d2[1]      \\n\"\n                        \"vmla.f32   q14, q4, d3[0]      \\n\"\n                        \"vmla.f32   q15, q4, d3[1]      \\n\"\n\n                        \"pld        [%10, #128]         \\n\"\n                        \"vld2.f32   {d12-d13}, [%10]    \\n\" // q6\n\n                        \"vmla.f32   q8, q5, d4[0]       \\n\"\n                        \"vmla.f32   q9, q5, d4[1]       \\n\"\n                        \"vmla.f32   q10, q5, d5[0]      \\n\"\n                        \"vmla.f32   q11, q5, d5[1]      \\n\"\n\n                        \"vext.f32   q6, q4, q6, #1      \\n\" // q6=22\n\n                        \"vld1.f32   {d0-d3}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q12, q5, d6[0]      \\n\"\n                        \"vmla.f32   q13, q5, d6[1]      \\n\"\n                        \"vmla.f32   q14, q5, d7[0]      \\n\"\n                        \"vmla.f32   q15, q5, d7[1]      \\n\"\n\n                        \"vmla.f32   q8, q6, d0[0]       \\n\"\n                        \"vmla.f32   q9, q6, d0[1]       \\n\"\n                        \"vmla.f32   q10, q6, d1[0]      \\n\"\n                        \"vmla.f32   q11, q6, d1[1]      \\n\"\n\n                        \"vmla.f32   q12, q6, d2[0]      \\n\"\n                        \"vmla.f32   q13, q6, d2[1]      \\n\"\n\n                        \"vst1.f32   {d16-d17}, [%0]!    \\n\"\n                        \"vst1.f32   {d18-d19}, [%1]!    \\n\"\n\n                        \"vmla.f32   q14, q6, d3[0]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"vst1.f32   {d20-d21}, [%2]!    \\n\"\n                        \"vst1.f32   {d22-d23}, [%3]!    \\n\"\n\n                        \"sub        %11, %11, #288      \\n\"\n\n                        \"vst1.f32   {d24-d25}, [%4]!    \\n\"\n                        \"vst1.f32   {d26-d27}, [%5]!    \\n\"\n                        \"vst1.f32   {d28-d29}, [%6]!    \\n\"\n                        \"vst1.f32   {d30-d31}, [%7]!    \\n\"\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(outptr3), // %3\n                        \"=r\"(outptr4), // %4\n                        \"=r\"(outptr5), // %5\n                        \"=r\"(outptr6), // %6\n                        \"=r\"(outptr7), // %7\n                        \"=r\"(r0),      // %8\n                        \"=r\"(r1),      // %9\n                        \"=r\"(r2),      // %10\n                        \"=r\"(ktmp)     // %11\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(outptr2),\n                        \"3\"(outptr3),\n                        \"4\"(outptr4),\n                        \"5\"(outptr5),\n                        \"6\"(outptr6),\n                        \"7\"(outptr7),\n                        \"8\"(r0),\n                        \"9\"(r1),\n                        \"10\"(r2),\n                        \"11\"(ktmp)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v10.4s, v11.4s}, [%11], #32    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]   \\n\"\n                        \"ld1    {v0.4s}, [%8]           \\n\"\n\n                        \"ld1    {v12.4s, v13.4s}, [%11], #32    \\n\"\n\n                        \"ld1    {v8.s}[0], [%0]         \\n\"\n                        \"ld1    {v8.s}[1], [%1]         \\n\"\n                        \"ld1    {v8.s}[2], [%2]         \\n\"\n                        \"ld1    {v8.s}[3], [%3]         \\n\"\n\n                        \"fmul   v14.4s, v10.4s, v0.s[0] \\n\"\n                        \"fmul   v15.4s, v11.4s, v0.s[0] \\n\"\n\n                        \"ld1    {v9.s}[0], [%4]         \\n\"\n                        \"ld1    {v9.s}[1], [%5]         \\n\"\n                        \"ld1    {v9.s}[2], [%6]         \\n\"\n                        \"ld1    {v9.s}[3], [%7]         \\n\"\n\n                        \"ld1    {v10.4s, v11.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v8.4s, v12.4s, v0.s[1]  \\n\"\n                        \"fmla   v9.4s, v13.4s, v0.s[1]  \\n\"\n\n                        \"ld1    {v12.4s, v13.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v14.4s, v10.4s, v0.s[2] \\n\"\n                        \"fmla   v15.4s, v11.4s, v0.s[2] \\n\"\n\n                        \"prfm   pldl1keep, [%9, #128]   \\n\"\n                        \"ld1    {v1.4s}, [%9]           \\n\"\n\n                        \"ld1    {v10.4s, v11.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v8.4s, v12.4s, v1.s[0]  \\n\"\n                        \"fmla   v9.4s, v13.4s, v1.s[0]  \\n\"\n\n                        \"ld1    {v12.4s, v13.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v14.4s, v10.4s, v1.s[1] \\n\"\n                        \"fmla   v15.4s, v11.4s, v1.s[1] \\n\"\n\n                        \"ld1    {v10.4s, v11.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v8.4s, v12.4s, v1.s[2]  \\n\"\n                        \"fmla   v9.4s, v13.4s, v1.s[2]  \\n\"\n\n                        \"prfm   pldl1keep, [%10, #128]  \\n\"\n                        \"ld1    {v0.4s}, [%10]          \\n\"\n\n                        \"ld1    {v12.4s, v13.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v14.4s, v10.4s, v0.s[0] \\n\"\n                        \"fmla   v15.4s, v11.4s, v0.s[0] \\n\"\n\n                        \"ld1    {v10.4s, v11.4s}, [%11], #32    \\n\"\n\n                        \"fmla   v8.4s, v12.4s, v0.s[1]  \\n\"\n                        \"fmla   v9.4s, v13.4s, v0.s[1]  \\n\"\n\n                        \"fmla   v14.4s, v10.4s, v0.s[2] \\n\"\n                        \"fmla   v15.4s, v11.4s, v0.s[2] \\n\"\n\n                        \"fadd   v8.4s, v8.4s, v14.4s    \\n\"\n                        \"fadd   v9.4s, v9.4s, v15.4s    \\n\"\n\n                        \"sub    %11, %11, #288          \\n\"\n\n                        \"st1    {v8.s}[0], [%0], #4     \\n\"\n                        \"st1    {v8.s}[1], [%1], #4     \\n\"\n                        \"st1    {v8.s}[2], [%2], #4     \\n\"\n                        \"st1    {v8.s}[3], [%3], #4     \\n\"\n\n                        \"st1    {v9.s}[0], [%4], #4     \\n\"\n                        \"st1    {v9.s}[1], [%5], #4     \\n\"\n                        \"st1    {v9.s}[2], [%6], #4     \\n\"\n                        \"st1    {v9.s}[3], [%7], #4     \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(outptr3), // %3\n                        \"=r\"(outptr4), // %4\n                        \"=r\"(outptr5), // %5\n                        \"=r\"(outptr6), // %6\n                        \"=r\"(outptr7), // %7\n                        \"=r\"(r0),      // %8\n                        \"=r\"(r1),      // %9\n                        \"=r\"(r2),      // %10\n                        \"=r\"(ktmp)     // %11\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(outptr2),\n                        \"3\"(outptr3),\n                        \"4\"(outptr4),\n                        \"5\"(outptr5),\n                        \"6\"(outptr6),\n                        \"7\"(outptr7),\n                        \"8\"(r0),\n                        \"9\"(r1),\n                        \"10\"(r2),\n                        \"11\"(ktmp)\n                        : \"memory\", \"v0\", \"v1\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"vld1.f32   {d20-d23}, [%11 :128]! \\n\"\n\n                        \"pld        [%8, #128]      \\n\"\n                        \"vld1.f32   {d0-d1}, [%8]   \\n\"\n\n                        \"vld1.f32   {d24-d27}, [%11 :128]! \\n\"\n\n                        \"vld1.f32   {d16[0]}, [%0]  \\n\"\n                        \"vld1.f32   {d16[1]}, [%1]  \\n\"\n                        \"vld1.f32   {d17[0]}, [%2]  \\n\"\n                        \"vld1.f32   {d17[1]}, [%3]  \\n\"\n\n                        \"vmul.f32   q14, q10, d0[0] \\n\"\n                        \"vmul.f32   q15, q11, d0[0] \\n\"\n\n                        \"vld1.f32   {d18[0]}, [%4]  \\n\"\n                        \"vld1.f32   {d18[1]}, [%5]  \\n\"\n                        \"vld1.f32   {d19[0]}, [%6]  \\n\"\n                        \"vld1.f32   {d19[1]}, [%7]  \\n\"\n\n                        \"vld1.f32   {d20-d23}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q8, q12, d0[1]  \\n\"\n                        \"vmla.f32   q9, q13, d0[1]  \\n\"\n\n                        \"vld1.f32   {d24-d27}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0] \\n\"\n                        \"vmla.f32   q15, q11, d1[0] \\n\"\n\n                        \"pld        [%9, #128]      \\n\"\n                        \"vld1.f32   {d2-d3}, [%9]   \\n\"\n\n                        \"vld1.f32   {d20-d23}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q8, q12, d2[0]  \\n\"\n                        \"vmla.f32   q9, q13, d2[0]  \\n\"\n\n                        \"vld1.f32   {d24-d27}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1] \\n\"\n                        \"vmla.f32   q15, q11, d2[1] \\n\"\n\n                        \"vld1.f32   {d20-d23}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q8, q12, d3[0]  \\n\"\n                        \"vmla.f32   q9, q13, d3[0]  \\n\"\n\n                        \"pld        [%10, #128]     \\n\"\n                        \"vld1.f32   {d0-d1}, [%10]  \\n\"\n\n                        \"vld1.f32   {d24-d27}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d0[0] \\n\"\n                        \"vmla.f32   q15, q11, d0[0] \\n\"\n\n                        \"vld1.f32   {d20-d23}, [%11 :128]! \\n\"\n\n                        \"vmla.f32   q8, q12, d0[1]  \\n\"\n                        \"vmla.f32   q9, q13, d0[1]  \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0] \\n\"\n                        \"vmla.f32   q15, q11, d1[0] \\n\"\n\n                        \"vadd.f32   q8, q8, q14     \\n\"\n                        \"vadd.f32   q9, q9, q15     \\n\"\n\n                        \"sub        %11, %11, #288  \\n\"\n\n                        \"vst1.f32   {d16[0]}, [%0]! \\n\"\n                        \"vst1.f32   {d16[1]}, [%1]! \\n\"\n                        \"vst1.f32   {d17[0]}, [%2]! \\n\"\n                        \"vst1.f32   {d17[1]}, [%3]! \\n\"\n\n                        \"vst1.f32   {d18[0]}, [%4]! \\n\"\n                        \"vst1.f32   {d18[1]}, [%5]! \\n\"\n                        \"vst1.f32   {d19[0]}, [%6]! \\n\"\n                        \"vst1.f32   {d19[1]}, [%7]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(outptr3), // %3\n                        \"=r\"(outptr4), // %4\n                        \"=r\"(outptr5), // %5\n                        \"=r\"(outptr6), // %6\n                        \"=r\"(outptr7), // %7\n                        \"=r\"(r0),      // %8\n                        \"=r\"(r1),      // %9\n                        \"=r\"(r2),      // %10\n                        \"=r\"(ktmp)     // %11\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(outptr2),\n                        \"3\"(outptr3),\n                        \"4\"(outptr4),\n                        \"5\"(outptr5),\n                        \"6\"(outptr6),\n                        \"7\"(outptr7),\n                        \"8\"(r0),\n                        \"9\"(r1),\n                        \"10\"(r2),\n                        \"11\"(ktmp)\n                        : \"memory\", \"q0\", \"q1\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // __ARM_NEON\n                    float sum0 = 0.f;\n                    float sum1 = 0.f;\n                    float sum2 = 0.f;\n                    float sum3 = 0.f;\n                    float sum4 = 0.f;\n                    float sum5 = 0.f;\n                    float sum6 = 0.f;\n                    float sum7 = 0.f;\n\n                    sum0 += r0[0] * ktmp[0];\n                    sum1 += r0[0] * ktmp[1];\n                    sum2 += r0[0] * ktmp[2];\n                    sum3 += r0[0] * ktmp[3];\n                    sum4 += r0[0] * ktmp[4];\n                    sum5 += r0[0] * ktmp[5];\n                    sum6 += r0[0] * ktmp[6];\n                    sum7 += r0[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r0[1] * ktmp[0];\n                    sum1 += r0[1] * ktmp[1];\n                    sum2 += r0[1] * ktmp[2];\n                    sum3 += r0[1] * ktmp[3];\n                    sum4 += r0[1] * ktmp[4];\n                    sum5 += r0[1] * ktmp[5];\n                    sum6 += r0[1] * ktmp[6];\n                    sum7 += r0[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r0[2] * ktmp[0];\n                    sum1 += r0[2] * ktmp[1];\n                    sum2 += r0[2] * ktmp[2];\n                    sum3 += r0[2] * ktmp[3];\n                    sum4 += r0[2] * ktmp[4];\n                    sum5 += r0[2] * ktmp[5];\n                    sum6 += r0[2] * ktmp[6];\n                    sum7 += r0[2] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r1[0] * ktmp[0];\n                    sum1 += r1[0] * ktmp[1];\n                    sum2 += r1[0] * ktmp[2];\n                    sum3 += r1[0] * ktmp[3];\n                    sum4 += r1[0] * ktmp[4];\n                    sum5 += r1[0] * ktmp[5];\n                    sum6 += r1[0] * ktmp[6];\n                    sum7 += r1[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r1[1] * ktmp[0];\n                    sum1 += r1[1] * ktmp[1];\n                    sum2 += r1[1] * ktmp[2];\n                    sum3 += r1[1] * ktmp[3];\n                    sum4 += r1[1] * ktmp[4];\n                    sum5 += r1[1] * ktmp[5];\n                    sum6 += r1[1] * ktmp[6];\n                    sum7 += r1[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r1[2] * ktmp[0];\n                    sum1 += r1[2] * ktmp[1];\n                    sum2 += r1[2] * ktmp[2];\n                    sum3 += r1[2] * ktmp[3];\n                    sum4 += r1[2] * ktmp[4];\n                    sum5 += r1[2] * ktmp[5];\n                    sum6 += r1[2] * ktmp[6];\n                    sum7 += r1[2] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r2[0] * ktmp[0];\n                    sum1 += r2[0] * ktmp[1];\n                    sum2 += r2[0] * ktmp[2];\n                    sum3 += r2[0] * ktmp[3];\n                    sum4 += r2[0] * ktmp[4];\n                    sum5 += r2[0] * ktmp[5];\n                    sum6 += r2[0] * ktmp[6];\n                    sum7 += r2[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r2[1] * ktmp[0];\n                    sum1 += r2[1] * ktmp[1];\n                    sum2 += r2[1] * ktmp[2];\n                    sum3 += r2[1] * ktmp[3];\n                    sum4 += r2[1] * ktmp[4];\n                    sum5 += r2[1] * ktmp[5];\n                    sum6 += r2[1] * ktmp[6];\n                    sum7 += r2[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += r2[2] * ktmp[0];\n                    sum1 += r2[2] * ktmp[1];\n                    sum2 += r2[2] * ktmp[2];\n                    sum3 += r2[2] * ktmp[3];\n                    sum4 += r2[2] * ktmp[4];\n                    sum5 += r2[2] * ktmp[5];\n                    sum6 += r2[2] * ktmp[6];\n                    sum7 += r2[2] * ktmp[7];\n                    ktmp += 8;\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n                    *outptr2 += sum2;\n                    *outptr3 += sum3;\n                    *outptr4 += sum4;\n                    *outptr5 += sum5;\n                    *outptr6 += sum6;\n                    *outptr7 += sum7;\n\n                    ktmp -= 8 * 9;\n\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                    outptr4++;\n                    outptr5++;\n                    outptr6++;\n                    outptr7++;\n#endif // __ARM_NEON\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            ktmp += 8 * 9;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        const float* ktmp = _kernel.channel(p / 8 + p % 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n\n            const float* k0 = ktmp;\n            const float* k1 = ktmp + 3;\n            const float* k2 = ktmp + 6;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(k0);\n            float32x4_t _k3456 = vld1q_f32(k1);\n            float32x4_t _k6789 = vld1q_f32(k2);\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1]                  \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %10.s[0]        \\n\"\n                        \"fmul       v10.4s, v3.4s, %10.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmul       v11.4s, v1.4s, %10.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%3], #32      \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %11.s[0]        \\n\"\n                        \"fmla       v10.4s, v3.4s, %11.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%3]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmla       v11.4s, v1.4s, %11.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%4], #32      \\n\"\n\n                        \"fmla       v0.4s,  v2.4s, %12.s[0]        \\n\"\n                        \"fmla       v10.4s, v3.4s, %12.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%4]           \\n\"\n                        \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                        \"fmla       v11.4s, v1.4s, %12.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n\n                        \"fadd       v0.4s, v0.4s, v10.4s           \\n\"\n                        \"fadd       v0.4s, v0.4s, v11.4s           \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"st1        {v0.4s}, [%1], #16             \\n\"\n                        \"bne        0b                             \\n\"\n                        \"sub        %2, %2, #32                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                        \"0:                             \\n\"\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]       \\n\"\n\n                        \"vmla.f32   q0, q2, %e10[0]     \\n\"\n                        \"vmul.f32   q10, q3, %e10[1]    \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%2]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmul.f32   q11, q1, %f10[0]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%3]!      \\n\"\n\n                        \"vmla.f32   q0, q2, %e11[0]     \\n\"\n                        \"vmla.f32   q10, q3, %e11[1]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%3]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmla.f32   q11, q1, %f11[0]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%4]!      \\n\"\n\n                        \"vmla.f32   q0, q2, %e12[0]     \\n\"\n                        \"vmla.f32   q10, q3, %e12[1]    \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld2.f32   {d16-d17}, [%4]     \\n\"\n                        \"vext.32    q1, q2, q8, #1      \\n\"\n\n                        \"vmla.f32   q11, q1, %f12[0]    \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                        \"vadd.f32   q0, q0, q10         \\n\"\n                        \"vadd.f32   q0, q0, q11         \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.f32   {d0-d1}, [%1]!      \\n\"\n                        \"bne        0b                  \\n\"\n                        \"sub        %2, #32             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2)      // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k0123), // %10\n                        \"w\"(_k3456), // %11\n                        \"w\"(_k6789)  // %12\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n                    float32x4_t _r00 = vld1q_f32(r0);\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r20 = vld1q_f32(r2);\n\n                    float32x4_t _sum = vmulq_f32(_r00, _k0123);\n                    _sum = vmlaq_f32(_sum, _r10, _k3456);\n                    _sum = vmlaq_f32(_sum, _r20, _k6789);\n\n                    _sum = vsetq_lane_f32(*outptr, _sum, 3);\n\n#if __aarch64__\n                    *outptr = vaddvq_f32(_sum);\n#else\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n\n                    *outptr = vget_lane_f32(_ss, 0);\n#endif // __aarch64__\n#else\n                    float sum = 0;\n\n                    sum += r0[0] * ktmp[0];\n                    sum += r0[1] * ktmp[1];\n                    sum += r0[2] * ktmp[2];\n                    sum += r1[0] * ktmp[3];\n                    sum += r1[1] * ktmp[4];\n                    sum += r1[2] * ktmp[5];\n                    sum += r2[0] * ktmp[6];\n                    sum += r2[1] * ktmp[7];\n                    sum += r2[2] * ktmp[8];\n\n                    *outptr += sum;\n#endif // __ARM_NEON\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            ktmp += 9;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s2_transform_kernel_int8_neon(const Mat& _kernel, Mat& kernel_tm, int inch, int outch)\n{\n    kernel_tm.create(8 * 9, inch, outch / 8 + outch % 8, (size_t)1u);\n\n    const signed char* kernel = _kernel;\n\n    int p = 0;\n    for (; p + 7 < outch; p += 8)\n    {\n        const signed char* k0 = kernel + (p + 0) * inch * 9;\n        const signed char* k1 = kernel + (p + 1) * inch * 9;\n        const signed char* k2 = kernel + (p + 2) * inch * 9;\n        const signed char* k3 = kernel + (p + 3) * inch * 9;\n        const signed char* k4 = kernel + (p + 4) * inch * 9;\n        const signed char* k5 = kernel + (p + 5) * inch * 9;\n        const signed char* k6 = kernel + (p + 6) * inch * 9;\n        const signed char* k7 = kernel + (p + 7) * inch * 9;\n\n        signed char* ktmp = kernel_tm.channel(p / 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            for (int k = 0; k < 9; k++)\n            {\n                ktmp[0] = k0[k];\n                ktmp[1] = k1[k];\n                ktmp[2] = k2[k];\n                ktmp[3] = k3[k];\n                ktmp[4] = k4[k];\n                ktmp[5] = k5[k];\n                ktmp[6] = k6[k];\n                ktmp[7] = k7[k];\n                ktmp += 8;\n            }\n\n            k0 += 9;\n            k1 += 9;\n            k2 += 9;\n            k3 += 9;\n            k4 += 9;\n            k5 += 9;\n            k6 += 9;\n            k7 += 9;\n        }\n    }\n    for (; p < outch; p++)\n    {\n        const signed char* k0 = kernel + (p + 0) * inch * 9;\n\n        signed char* ktmp = kernel_tm.channel(p / 8 + p % 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            for (int k = 0; k < 9; k++)\n            {\n                ktmp[k] = k0[k];\n            }\n            ktmp += 9;\n\n            k0 += 9;\n        }\n    }\n}\n\nstatic void conv3x3s2_packed_int8_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    int nn_outch = outch >> 3;\n    int remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        Mat out0 = top_blob.channel(p + 0);\n        Mat out1 = top_blob.channel(p + 1);\n        Mat out2 = top_blob.channel(p + 2);\n        Mat out3 = top_blob.channel(p + 3);\n        Mat out4 = top_blob.channel(p + 4);\n        Mat out5 = top_blob.channel(p + 5);\n        Mat out6 = top_blob.channel(p + 6);\n        Mat out7 = top_blob.channel(p + 7);\n\n        out0.fill(0);\n        out1.fill(0);\n        out2.fill(0);\n        out3.fill(0);\n        out4.fill(0);\n        out5.fill(0);\n        out6.fill(0);\n        out7.fill(0);\n\n        const signed char* ktmp = _kernel.channel(p / 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            int* outptr0 = out0;\n            int* outptr1 = out1;\n            int* outptr2 = out2;\n            int* outptr3 = out3;\n            int* outptr4 = out4;\n            int* outptr5 = out5;\n            int* outptr6 = out6;\n            int* outptr7 = out7;\n\n            const signed char* img0 = bottom_blob.channel(q);\n\n            const signed char* r0 = img0;\n            const signed char* r1 = img0 + w;\n            const signed char* r2 = img0 + w * 2;\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n#if __aarch64__\n                int nn = outw >> 3;\n                int remain = outw & 7;\n#else\n                int nn = outw >> 2;\n                int remain = outw & 3;\n#endif // __aarch64__\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                   \\n\"\n\n                        \"ld1    {v0.8b, v1.8b, v2.8b}, [%12], #24  \\n\" //ktmp\n                        \"ld2    {v3.8b, v4.8b}, [%9], #16     \\n\"      //r0-r2\n                        \"ld2    {v5.8b, v6.8b}, [%9]          \\n\"\n\n                        \"ld1    {v8.4s, v9.4s}, [%1]          \\n\" //out0\n                        \"ld1    {v10.4s, v11.4s}, [%2]        \\n\" //out1\n                        \"ld1    {v12.4s, v13.4s}, [%3]        \\n\" //out2\n                        \"ld1    {v14.4s, v15.4s}, [%4]        \\n\" //out3\n                        \"ld1    {v16.4s, v17.4s}, [%5]        \\n\" //out4\n                        \"ld1    {v18.4s, v19.4s}, [%6]        \\n\" //out5\n                        \"ld1    {v20.4s, v21.4s}, [%7]        \\n\" //out6\n                        \"ld1    {v22.4s, v23.4s}, [%8]        \\n\" //out7\n\n                        \"ext    v7.8b, v3.8b, v5.8b, #1       \\n\"\n\n                        \"sshll  v0.8h, v0.8b, #0              \\n\" //(k00-k70)\n                        \"sshll  v1.8h, v1.8b, #0              \\n\" //(k01-k71)\n                        \"sshll  v2.8h, v2.8b, #0              \\n\" //(k02-k72)\n                        \"sshll  v3.8h, v3.8b, #0              \\n\" // r0\n                        \"sshll  v4.8h, v4.8b, #0              \\n\" // r1\n                        \"sshll  v7.8h, v7.8b, #0              \\n\" // r2\n\n                        // r0\n                        \"smlal  v8.4s, v3.4h, v0.h[0]         \\n\" // out0 += (r00-r07)*k00\n                        \"smlal2  v9.4s, v3.8h, v0.h[0]        \\n\"\n                        \"smlal  v10.4s, v3.4h, v0.h[1]        \\n\" // out1 += (r00-r07)*k10\n                        \"smlal2  v11.4s, v3.8h, v0.h[1]       \\n\"\n                        \"smlal  v12.4s, v3.4h, v0.h[2]        \\n\" // out2 += (r00-r07)*k20\n                        \"smlal2  v13.4s, v3.8h, v0.h[2]       \\n\"\n                        \"smlal  v14.4s, v3.4h, v0.h[3]        \\n\" // out3 += (r00-r07)*k30\n                        \"smlal2  v15.4s, v3.8h, v0.h[3]       \\n\"\n                        \"smlal  v16.4s, v3.4h, v0.h[4]        \\n\" // out4 += (r00-r07)*k40\n                        \"smlal2  v17.4s, v3.8h, v0.h[4]       \\n\"\n                        \"smlal  v18.4s, v3.4h, v0.h[5]        \\n\" // out5 += (r00-r07)*k50\n                        \"smlal2  v19.4s, v3.8h, v0.h[5]       \\n\"\n                        \"smlal  v20.4s, v3.4h, v0.h[6]        \\n\" // out6 += (r00-r07)*k60\n                        \"smlal2  v21.4s, v3.8h, v0.h[6]       \\n\"\n                        \"smlal  v22.4s, v3.4h, v0.h[7]        \\n\" // out7 += (r00-r07)*k70\n                        \"smlal2  v23.4s, v3.8h, v0.h[7]       \\n\"\n                        // r1\n                        \"smlal  v8.4s, v4.4h, v1.h[0]         \\n\" // out0 += (r10-r17)*k01\n                        \"smlal2  v9.4s, v4.8h, v1.h[0]        \\n\"\n                        \"smlal  v10.4s, v4.4h, v1.h[1]        \\n\" // out1 += (r10-r17)*k11\n                        \"smlal2  v11.4s, v4.8h, v1.h[1]       \\n\"\n                        \"smlal  v12.4s, v4.4h, v1.h[2]        \\n\" // out2 += (r10-r17)*k21\n                        \"smlal2  v13.4s, v4.8h, v1.h[2]       \\n\"\n                        \"smlal  v14.4s, v4.4h, v1.h[3]        \\n\" // out3 += (r10-r17)*k31\n                        \"smlal2  v15.4s, v4.8h, v1.h[3]       \\n\"\n                        \"smlal  v16.4s, v4.4h, v1.h[4]        \\n\" // out4 += (r10-r17)*k41\n                        \"smlal2  v17.4s, v4.8h, v1.h[4]       \\n\"\n                        \"smlal  v18.4s, v4.4h, v1.h[5]        \\n\" // out5 += (r10-r17)*k51\n                        \"smlal2  v19.4s, v4.8h, v1.h[5]       \\n\"\n                        \"smlal  v20.4s, v4.4h, v1.h[6]        \\n\" // out6 += (r10-r17)*k61\n                        \"smlal2  v21.4s, v4.8h, v1.h[6]       \\n\"\n                        \"smlal  v22.4s, v4.4h, v1.h[7]        \\n\" // out7 += (r10-r17)*k71\n                        \"smlal2  v23.4s, v4.8h, v1.h[7]       \\n\"\n                        // r2\n                        \"smlal  v8.4s, v7.4h, v2.h[0]         \\n\" // out0 += (r20-r27)*k02\n                        \"smlal2  v9.4s, v7.8h, v2.h[0]        \\n\"\n                        \"smlal  v10.4s, v7.4h, v2.h[1]        \\n\" // out1 += (r20-r27)*k12\n                        \"smlal2  v11.4s, v7.8h, v2.h[1]       \\n\"\n                        \"smlal  v12.4s, v7.4h, v2.h[2]        \\n\" // out2 += (r20-r27)*k22\n                        \"smlal2  v13.4s, v7.8h, v2.h[2]       \\n\"\n                        \"smlal  v14.4s, v7.4h, v2.h[3]        \\n\" // out3 += (r20-r27)*k32\n                        \"smlal2  v15.4s, v7.8h, v2.h[3]       \\n\"\n                        \"smlal  v16.4s, v7.4h, v2.h[4]        \\n\" // out4 += (r20-r27)*k42\n                        \"smlal2  v17.4s, v7.8h, v2.h[4]       \\n\"\n                        \"smlal  v18.4s, v7.4h, v2.h[5]        \\n\" // out5 += (r20-r27)*k52\n                        \"smlal2  v19.4s, v7.8h, v2.h[5]       \\n\"\n                        \"smlal  v20.4s, v7.4h, v2.h[6]        \\n\" // out6 += (r20-r27)*k62\n                        \"smlal2  v21.4s, v7.8h, v2.h[6]       \\n\"\n                        \"smlal  v22.4s, v7.4h, v2.h[7]        \\n\" // out7 += (r20-r27)*k72\n                        \"smlal2  v23.4s, v7.8h, v2.h[7]       \\n\"\n\n                        \"ld1    {v0.8b, v1.8b, v2.8b}, [%12], #24  \\n\" //ktmp\n                        \"ld2    {v3.8b, v4.8b}, [%10], #16    \\n\"      //r3-r5\n                        \"ld2    {v5.8b, v6.8b}, [%10]         \\n\"\n\n                        \"ext    v7.8b, v3.8b, v5.8b, #1       \\n\"\n\n                        \"sshll  v0.8h, v0.8b, #0              \\n\" //(k03-k73)\n                        \"sshll  v1.8h, v1.8b, #0              \\n\" //(k04-k74)\n                        \"sshll  v2.8h, v2.8b, #0              \\n\" //(k05-k75)\n                        \"sshll  v3.8h, v3.8b, #0              \\n\" // r3\n                        \"sshll  v4.8h, v4.8b, #0              \\n\" // r4\n                        \"sshll  v7.8h, v7.8b, #0              \\n\" // r5\n\n                        // r3\n                        \"smlal  v8.4s, v3.4h, v0.h[0]         \\n\" // out0 += (r30-r37)*k03\n                        \"smlal2  v9.4s, v3.8h, v0.h[0]        \\n\"\n                        \"smlal  v10.4s, v3.4h, v0.h[1]        \\n\" // out1 += (r30-r37)*k13\n                        \"smlal2  v11.4s, v3.8h, v0.h[1]       \\n\"\n                        \"smlal  v12.4s, v3.4h, v0.h[2]        \\n\" // out2 += (r30-r37)*k23\n                        \"smlal2  v13.4s, v3.8h, v0.h[2]       \\n\"\n                        \"smlal  v14.4s, v3.4h, v0.h[3]        \\n\" // out3 += (r30-r37)*k33\n                        \"smlal2  v15.4s, v3.8h, v0.h[3]       \\n\"\n                        \"smlal  v16.4s, v3.4h, v0.h[4]        \\n\" // out4 += (r30-r37)*k43\n                        \"smlal2  v17.4s, v3.8h, v0.h[4]       \\n\"\n                        \"smlal  v18.4s, v3.4h, v0.h[5]        \\n\" // out5 += (r30-r37)*k53\n                        \"smlal2  v19.4s, v3.8h, v0.h[5]       \\n\"\n                        \"smlal  v20.4s, v3.4h, v0.h[6]        \\n\" // out6 += (r30-r37)*k63\n                        \"smlal2  v21.4s, v3.8h, v0.h[6]       \\n\"\n                        \"smlal  v22.4s, v3.4h, v0.h[7]        \\n\" // out7 += (r30-r37)*k73\n                        \"smlal2  v23.4s, v3.8h, v0.h[7]       \\n\"\n                        // r4\n                        \"smlal  v8.4s, v4.4h, v1.h[0]         \\n\" // out0 += (r40-r47)*k04\n                        \"smlal2  v9.4s, v4.8h, v1.h[0]        \\n\"\n                        \"smlal  v10.4s, v4.4h, v1.h[1]        \\n\" // out1 += (r40-r47)*k14\n                        \"smlal2  v11.4s, v4.8h, v1.h[1]       \\n\"\n                        \"smlal  v12.4s, v4.4h, v1.h[2]        \\n\" // out2 += (r40-r47)*k24\n                        \"smlal2  v13.4s, v4.8h, v1.h[2]       \\n\"\n                        \"smlal  v14.4s, v4.4h, v1.h[3]        \\n\" // out3 += (r40-r47)*k34\n                        \"smlal2  v15.4s, v4.8h, v1.h[3]       \\n\"\n                        \"smlal  v16.4s, v4.4h, v1.h[4]        \\n\" // out4 += (r40-r47)*k44\n                        \"smlal2  v17.4s, v4.8h, v1.h[4]       \\n\"\n                        \"smlal  v18.4s, v4.4h, v1.h[5]        \\n\" // out5 += (r40-r47)*k54\n                        \"smlal2  v19.4s, v4.8h, v1.h[5]       \\n\"\n                        \"smlal  v20.4s, v4.4h, v1.h[6]        \\n\" // out6 += (r40-r47)*k64\n                        \"smlal2  v21.4s, v4.8h, v1.h[6]       \\n\"\n                        \"smlal  v22.4s, v4.4h, v1.h[7]        \\n\" // out7 += (r40-r47)*k74\n                        \"smlal2  v23.4s, v4.8h, v1.h[7]       \\n\"\n                        // r5\n                        \"smlal  v8.4s, v7.4h, v2.h[0]         \\n\" // out0 += (r50-r57)*k05\n                        \"smlal2  v9.4s, v7.8h, v2.h[0]        \\n\"\n                        \"smlal  v10.4s, v7.4h, v2.h[1]        \\n\" // out1 += (r50-r57)*k15\n                        \"smlal2  v11.4s, v7.8h, v2.h[1]       \\n\"\n                        \"smlal  v12.4s, v7.4h, v2.h[2]        \\n\" // out2 += (r50-r57)*k25\n                        \"smlal2  v13.4s, v7.8h, v2.h[2]       \\n\"\n                        \"smlal  v14.4s, v7.4h, v2.h[3]        \\n\" // out3 += (r50-r57)*k35\n                        \"smlal2  v15.4s, v7.8h, v2.h[3]       \\n\"\n                        \"smlal  v16.4s, v7.4h, v2.h[4]        \\n\" // out4 += (r50-r57)*k45\n                        \"smlal2  v17.4s, v7.8h, v2.h[4]       \\n\"\n                        \"smlal  v18.4s, v7.4h, v2.h[5]        \\n\" // out5 += (r50-r57)*k55\n                        \"smlal2  v19.4s, v7.8h, v2.h[5]       \\n\"\n                        \"smlal  v20.4s, v7.4h, v2.h[6]        \\n\" // out6 += (r50-r57)*k65\n                        \"smlal2  v21.4s, v7.8h, v2.h[6]       \\n\"\n                        \"smlal  v22.4s, v7.4h, v2.h[7]        \\n\" // out7 += (r50-r57)*k75\n                        \"smlal2  v23.4s, v7.8h, v2.h[7]       \\n\"\n\n                        \"ld1    {v0.8b, v1.8b, v2.8b}, [%12], #24  \\n\" //ktmp\n                        \"ld2    {v3.8b, v4.8b}, [%11], #16    \\n\"      //r6-r8\n                        \"ld2    {v5.8b, v6.8b}, [%11]         \\n\"\n\n                        \"ext    v7.8b, v3.8b, v5.8b, #1       \\n\"\n\n                        \"sshll  v0.8h, v0.8b, #0              \\n\" //(k06-k76)\n                        \"sshll  v1.8h, v1.8b, #0              \\n\" //(k07-k77)\n                        \"sshll  v2.8h, v2.8b, #0              \\n\" //(k08-k78)\n                        \"sshll  v3.8h, v3.8b, #0              \\n\" // r6\n                        \"sshll  v4.8h, v4.8b, #0              \\n\" // r7\n                        \"sshll  v7.8h, v7.8b, #0              \\n\" // r8\n\n                        // r6\n                        \"smlal  v8.4s, v3.4h, v0.h[0]         \\n\" // out0 += (r60-r67)*k06\n                        \"smlal2  v9.4s, v3.8h, v0.h[0]        \\n\"\n                        \"smlal  v10.4s, v3.4h, v0.h[1]        \\n\" // out1 += (r60-r67)*k16\n                        \"smlal2  v11.4s, v3.8h, v0.h[1]       \\n\"\n                        \"smlal  v12.4s, v3.4h, v0.h[2]        \\n\" // out2 += (r60-r67)*k26\n                        \"smlal2  v13.4s, v3.8h, v0.h[2]       \\n\"\n                        \"smlal  v14.4s, v3.4h, v0.h[3]        \\n\" // out3 += (r60-r67)*k36\n                        \"smlal2  v15.4s, v3.8h, v0.h[3]       \\n\"\n                        \"smlal  v16.4s, v3.4h, v0.h[4]        \\n\" // out4 += (r60-r67)*k46\n                        \"smlal2  v17.4s, v3.8h, v0.h[4]       \\n\"\n                        \"smlal  v18.4s, v3.4h, v0.h[5]        \\n\" // out5 += (r60-r67)*k56\n                        \"smlal2  v19.4s, v3.8h, v0.h[5]       \\n\"\n                        \"smlal  v20.4s, v3.4h, v0.h[6]        \\n\" // out6 += (r60-r67)*k66\n                        \"smlal2  v21.4s, v3.8h, v0.h[6]       \\n\"\n                        \"smlal  v22.4s, v3.4h, v0.h[7]        \\n\" // out7 += (r60-r67)*k76\n                        \"smlal2  v23.4s, v3.8h, v0.h[7]       \\n\"\n                        // r7\n                        \"smlal  v8.4s, v4.4h, v1.h[0]         \\n\" // out0 += (r70-r77)*k07\n                        \"smlal2  v9.4s, v4.8h, v1.h[0]        \\n\"\n                        \"smlal  v10.4s, v4.4h, v1.h[1]        \\n\" // out1 += (r70-r77)*k17\n                        \"smlal2  v11.4s, v4.8h, v1.h[1]       \\n\"\n                        \"smlal  v12.4s, v4.4h, v1.h[2]        \\n\" // out2 += (r70-r77)*k27\n                        \"smlal2  v13.4s, v4.8h, v1.h[2]       \\n\"\n                        \"smlal  v14.4s, v4.4h, v1.h[3]        \\n\" // out3 += (r70-r77)*k37\n                        \"smlal2  v15.4s, v4.8h, v1.h[3]       \\n\"\n                        \"smlal  v16.4s, v4.4h, v1.h[4]        \\n\" // out4 += (r70-r77)*k47\n                        \"smlal2  v17.4s, v4.8h, v1.h[4]       \\n\"\n                        \"smlal  v18.4s, v4.4h, v1.h[5]        \\n\" // out5 += (r70-r77)*k57\n                        \"smlal2  v19.4s, v4.8h, v1.h[5]       \\n\"\n                        \"smlal  v20.4s, v4.4h, v1.h[6]        \\n\" // out6 += (r70-r77)*k67\n                        \"smlal2  v21.4s, v4.8h, v1.h[6]       \\n\"\n                        \"smlal  v22.4s, v4.4h, v1.h[7]        \\n\" // out7 += (r70-r77)*k77\n                        \"smlal2  v23.4s, v4.8h, v1.h[7]       \\n\"\n                        // r8\n                        \"smlal  v8.4s, v7.4h, v2.h[0]         \\n\" // out0 += (r80-r87)*k08\n                        \"smlal2  v9.4s, v7.8h, v2.h[0]        \\n\"\n                        \"smlal  v10.4s, v7.4h, v2.h[1]        \\n\" // out1 += (r80-r87)*k18\n                        \"smlal2  v11.4s, v7.8h, v2.h[1]       \\n\"\n                        \"smlal  v12.4s, v7.4h, v2.h[2]        \\n\" // out2 += (r80-r87)*k28\n                        \"smlal2  v13.4s, v7.8h, v2.h[2]       \\n\"\n                        \"smlal  v14.4s, v7.4h, v2.h[3]        \\n\" // out3 += (r80-r87)*k38\n                        \"smlal2  v15.4s, v7.8h, v2.h[3]       \\n\"\n                        \"smlal  v16.4s, v7.4h, v2.h[4]        \\n\" // out4 += (r80-r87)*k48\n                        \"smlal2  v17.4s, v7.8h, v2.h[4]       \\n\"\n                        \"smlal  v18.4s, v7.4h, v2.h[5]        \\n\" // out5 += (r80-r87)*k58\n                        \"smlal2  v19.4s, v7.8h, v2.h[5]       \\n\"\n                        \"smlal  v20.4s, v7.4h, v2.h[6]        \\n\" // out6 += (r80-r87)*k68\n                        \"smlal2  v21.4s, v7.8h, v2.h[6]       \\n\"\n                        \"smlal  v22.4s, v7.4h, v2.h[7]        \\n\" // out7 += (r80-r87)*k78\n                        \"smlal2  v23.4s, v7.8h, v2.h[7]       \\n\"\n\n                        \"st1    {v8.4s, v9.4s}, [%1], #32     \\n\"\n                        \"st1    {v10.4s, v11.4s}, [%2], #32   \\n\"\n                        \"st1    {v12.4s, v13.4s}, [%3], #32   \\n\"\n                        \"st1    {v14.4s, v15.4s}, [%4], #32   \\n\"\n                        \"st1    {v16.4s, v17.4s}, [%5], #32   \\n\"\n                        \"st1    {v18.4s, v19.4s}, [%6], #32   \\n\"\n                        \"st1    {v20.4s, v21.4s}, [%7], #32   \\n\"\n                        \"st1    {v22.4s, v23.4s}, [%8], #32   \\n\"\n\n                        \"subs   %w0, %w0, #1                  \\n\"\n                        \"sub    %12, %12, #72                 \\n\" // reset ktmp\n\n                        \"bne    0b                            \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7), // %8\n                        \"=r\"(r0),      // %9\n                        \"=r\"(r1),      // %10\n                        \"=r\"(r2),      // %11\n                        \"=r\"(ktmp)     // %12\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7),\n                        \"9\"(r0),\n                        \"10\"(r1),\n                        \"11\"(r2),\n                        \"12\"(ktmp)\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n                }\n#else  // __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.s32   {d16-d17}, [%1]     \\n\" // out0\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.s32   {d18-d19}, [%2]     \\n\" // out1\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.s32   {d20-d21}, [%3]     \\n\" // out2\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.s32   {d22-d23}, [%4]     \\n\" // out3\n\n                        // r0\n                        \"pld        [%9, #64]          \\n\"\n                        \"vld2.s8    {d8-d9}, [%9]       \\n\" // d8(a00 a02 a04 a06 a08 a010 a012 a014), d9(a01 a03 a05 a07 a09 a011 a013 a015)\n                        \"add        %9, #8              \\n\"\n                        \"pld        [%12, #64]         \\n\"\n                        \"vld1.s8    {d0-d2}, [%12]!     \\n\" // d0(k00-k70) d1(k01-k71) d2(k02-k72)\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.s32   {d24-d25}, [%5]     \\n\" // out4\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.s32   {d26-d27}, [%6]     \\n\" // out5\n\n                        \"vmovl.s8   q2, d2              \\n\" // q2(k02-k72)\n                        \"vmovl.s8   q1, d1              \\n\" // q1(k01-k71)\n                        \"vmovl.s8   q0, d0              \\n\" // q0(k00-k70)\n                        \"vext.s8    d12, d8, d8, #1     \\n\" // d12(a02 a04 a06 a08 x x x x)\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.s32   {d28-d29}, [%7]     \\n\" // out6\n\n                        \"vmovl.s8   q5, d9              \\n\" // q5(a01 a03 a05 a07 a09 a011 a013 a015) d11\n                        \"vmovl.s8   q4, d8              \\n\" // q4(a00 a02 a04 a06 a08 a010 a012 a014) d9\n                        \"vmovl.s8   q6, d12             \\n\" // q6(a02 a04 a06 a08 a010 a012 a014 a016) d13\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.s32   {d30-d31}, [%8]     \\n\" // out7\n\n                        \"vmlal.s16  q8, d8, d0[0]       \\n\" // sum0 += (a00 a02 a04 a06) * k00\n                        \"vmlal.s16  q9, d8, d0[1]       \\n\" // sum1 += (a00 a02 a04 a06) * k10\n                        \"vmlal.s16  q10, d8, d0[2]      \\n\" // sum2 += (a00 a02 a04 a06) * k20\n                        \"vmlal.s16  q11, d8, d0[3]      \\n\" // sum3 += (a00 a02 a04 a06) * k30\n                        \"vmlal.s16  q12, d8, d1[0]      \\n\" // sum4 += (a00 a02 a04 a06) * k40\n                        \"vmlal.s16  q13, d8, d1[1]      \\n\" // sum5 += (a00 a02 a04 a06) * k50\n                        \"vmlal.s16  q14, d8, d1[2]      \\n\" // sum6 += (a00 a02 a04 a06) * k60\n                        \"vmlal.s16  q15, d8, d1[3]      \\n\" // sum7 += (a00 a02 a04 a06) * k70\n\n                        \"vmlal.s16  q8, d10, d2[0]      \\n\" // sum0 += (a01-a07) * k01\n                        \"vmlal.s16  q9, d10, d2[1]      \\n\" // sum1 += (a01-a07) * k11\n                        \"vmlal.s16  q10, d10, d2[2]     \\n\" // sum2 += (a01-a07) * k21\n                        \"vmlal.s16  q11, d10, d2[3]     \\n\" // sum3 += (a01-a07) * k31\n                        \"vmlal.s16  q12, d10, d3[0]     \\n\" // sum4 += (a01-a07) * k41\n                        \"vmlal.s16  q13, d10, d3[1]     \\n\" // sum5 += (a01-a07) * k51\n                        \"vmlal.s16  q14, d10, d3[2]     \\n\" // sum6 += (a01-a07) * k61\n                        \"vmlal.s16  q15, d10, d3[3]     \\n\" // sum7 += (a01-a07) * k71\n\n                        \"pld        [%10, #64]         \\n\"\n                        \"vld2.s8    {d8-d9}, [%10]      \\n\" // d8(a10 a12 a14 a16 a18 a110 a112 a114), d9(a11 a13 a15 a17 a19 a111 a113 a115)\n                        \"add        %10, #8             \\n\"\n\n                        \"vmlal.s16  q8, d12, d4[0]      \\n\" // sum0 += (a02-a08) * k02\n                        \"vmlal.s16  q9, d12, d4[1]      \\n\" // sum1 += (a02-a08) * k12\n                        \"vmlal.s16  q10, d12, d4[2]     \\n\" // sum2 += (a02-a08) * k22\n                        \"vmlal.s16  q11, d12, d4[3]     \\n\" // sum3 += (a02-a08) * k32\n\n                        \"pld        [%12, #64]         \\n\"\n                        \"vld1.s8    {d0-d2}, [%12]!     \\n\" // d0(k03-k73) d1(k04-k74) d2(k05-k75)\n\n                        \"vmlal.s16  q12, d12, d5[0]     \\n\" // sum4 += (a02-a08) * k42\n                        \"vmlal.s16  q13, d12, d5[1]     \\n\" // sum5 += (a02-a08) * k52\n                        \"vmlal.s16  q14, d12, d5[2]     \\n\" // sum6 += (a02-a08) * k62\n                        \"vmlal.s16  q15, d12, d5[3]     \\n\" // sum7 += (a02-a08) * k72\n\n                        // r1\n                        \"vext.s8    d12, d8, d8, #1     \\n\" // d12(a12 a14 a16 a18 x x x x)\n\n                        \"vmovl.s8   q2, d2              \\n\" // q2(k05-k75)\n                        \"vmovl.s8   q1, d1              \\n\" // q1(k04-k74)\n                        \"vmovl.s8   q0, d0              \\n\" // q0(k03-k73)\n                        \"vmovl.s8   q5, d9              \\n\" // q5(a11-a115)\n                        \"vmovl.s8   q4, d8              \\n\" // q4(a10-a114)\n                        \"vmovl.s8   q6, d12             \\n\" // q6(a12-a116)\n\n                        \"vmlal.s16  q8, d8, d0[0]       \\n\" // sum0 += (a10-a16) * k03\n                        \"vmlal.s16  q9, d8, d0[1]       \\n\" // sum1 += (a10-a16) * k13\n                        \"vmlal.s16  q10, d8, d0[2]      \\n\" // sum2 += (a10-a16) * k23\n                        \"vmlal.s16  q11, d8, d0[3]      \\n\" // sum3 += (a10-a16) * k33\n                        \"vmlal.s16  q12, d8, d1[0]      \\n\" // sum4 += (a10-a16) * k43\n                        \"vmlal.s16  q13, d8, d1[1]      \\n\" // sum5 += (a10-a16) * k53\n                        \"vmlal.s16  q14, d8, d1[2]      \\n\" // sum6 += (a10-a16) * k63\n                        \"vmlal.s16  q15, d8, d1[3]      \\n\" // sum7 += (a10-a16) * k73\n\n                        \"vmlal.s16  q8, d10, d2[0]      \\n\" // sum0 += (a11-a17) * k04\n                        \"vmlal.s16  q9, d10, d2[1]      \\n\" // sum1 += (a11-a17) * k14\n                        \"vmlal.s16  q10, d10, d2[2]     \\n\" // sum2 += (a11-a17) * k24\n                        \"vmlal.s16  q11, d10, d2[3]     \\n\" // sum3 += (a11-a17) * k34\n                        \"vmlal.s16  q12, d10, d3[0]     \\n\" // sum4 += (a11-a17) * k44\n                        \"vmlal.s16  q13, d10, d3[1]     \\n\" // sum5 += (a11-a17) * k54\n                        \"vmlal.s16  q14, d10, d3[2]     \\n\" // sum6 += (a11-a17) * k64\n                        \"vmlal.s16  q15, d10, d3[3]     \\n\" // sum7 += (a11-a17) * k74\n\n                        \"pld        [%11, #64]         \\n\"\n                        \"vld2.s8    {d8-d9}, [%11]      \\n\" // d8(a20 a22 a24 a26 a28 a210 a212 a214), d9(a21 a23 a25 a27 a29 a211 a213 a215)\n                        \"add        %11, #8             \\n\"\n\n                        \"vmlal.s16  q8, d12, d4[0]      \\n\" // sum0 += (a12-a18) * k05\n                        \"vmlal.s16  q9, d12, d4[1]      \\n\" // sum1 += (a12-a18) * k15\n                        \"vmlal.s16  q10, d12, d4[2]     \\n\" // sum2 += (a12-a18) * k25\n                        \"vmlal.s16  q11, d12, d4[3]     \\n\" // sum3 += (a12-a18) * k35\n\n                        \"pld        [%12, #64]         \\n\"\n                        \"vld1.s8    {d0-d2}, [%12]!     \\n\" // d0(k06-k76) d1(k07-k77) d2(k08-k78)\n\n                        \"vmlal.s16  q12, d12, d5[0]     \\n\" // sum4 += (a12-a18) * k45\n                        \"vmlal.s16  q13, d12, d5[1]     \\n\" // sum5 += (a12-a18) * k55\n                        \"vmlal.s16  q14, d12, d5[2]     \\n\" // sum6 += (a12-a18) * k65\n                        \"vmlal.s16  q15, d12, d5[3]     \\n\" // sum7 += (a12-a18) * k75\n\n                        // r2\n                        \"vext.s8    d12, d8, d8, #1     \\n\" // d12(a22 a24 a26 a28 x x x x)\n\n                        \"vmovl.s8   q2, d2              \\n\" // q2(k08-k78)\n                        \"vmovl.s8   q1, d1              \\n\" // q1(k07-k77)\n                        \"vmovl.s8   q0, d0              \\n\" // q0(k06-k76)\n                        \"vmovl.s8   q5, d9              \\n\" // q5(a21-a215)\n                        \"vmovl.s8   q4, d8              \\n\" // q4(a20-a214)\n                        \"vmovl.s8   q6, d12             \\n\" // q6(a22-a216)\n\n                        \"vmlal.s16  q8, d8, d0[0]       \\n\" // sum0 += (a20-a26) * k06\n                        \"vmlal.s16  q9, d8, d0[1]       \\n\" // sum1 += (a20-a26) * k16\n                        \"vmlal.s16  q10, d8, d0[2]      \\n\" // sum2 += (a20-a26) * k26\n                        \"vmlal.s16  q11, d8, d0[3]      \\n\" // sum3 += (a20-a26) * k36\n                        \"vmlal.s16  q12, d8, d1[0]      \\n\" // sum4 += (a20-a26) * k46\n                        \"vmlal.s16  q13, d8, d1[1]      \\n\" // sum5 += (a20-a26) * k56\n                        \"vmlal.s16  q14, d8, d1[2]      \\n\" // sum6 += (a20-a26) * k66\n                        \"vmlal.s16  q15, d8, d1[3]      \\n\" // sum7 += (a20-a26) * k76\n\n                        \"vmlal.s16  q8, d10, d2[0]      \\n\" // sum0 += (a21-a27) * k07\n                        \"vmlal.s16  q9, d10, d2[1]      \\n\" // sum1 += (a21-a27) * k17\n                        \"vmlal.s16  q10, d10, d2[2]     \\n\" // sum2 += (a21-a27) * k27\n                        \"vmlal.s16  q11, d10, d2[3]     \\n\" // sum3 += (a21-a27) * k37\n                        \"vmlal.s16  q12, d10, d3[0]     \\n\" // sum4 += (a21-a27) * k47\n                        \"vmlal.s16  q13, d10, d3[1]     \\n\" // sum5 += (a21-a27) * k57\n                        \"vmlal.s16  q14, d10, d3[2]     \\n\" // sum6 += (a21-a27) * k67\n                        \"vmlal.s16  q15, d10, d3[3]     \\n\" // sum7 += (a21-a27) * k77\n\n                        \"vmlal.s16  q8, d12, d4[0]      \\n\" // sum0 += (a22-a28) * k08\n                        \"vmlal.s16  q9, d12, d4[1]      \\n\" // sum1 += (a22-a28) * k18\n                        \"vmlal.s16  q10, d12, d4[2]     \\n\" // sum2 += (a22-a28) * k28\n                        \"vmlal.s16  q11, d12, d4[3]     \\n\" // sum3 += (a22-a28) * k38\n                        \"vmlal.s16  q12, d12, d5[0]     \\n\" // sum4 += (a22-a28) * k48\n                        \"vmlal.s16  q13, d12, d5[1]     \\n\" // sum5 += (a22-a28) * k58\n                        \"vmlal.s16  q14, d12, d5[2]     \\n\" // sum6 += (a22-a28) * k68\n                        \"vmlal.s16  q15, d12, d5[3]     \\n\" // sum7 += (a22-a28) * k78\n\n                        // save s32 to memory\n                        \"sub        %12, %12, #72       \\n\"\n                        \"vst1.s32   {d16-d17}, [%1]!    \\n\" // out0\n                        \"vst1.s32   {d18-d19}, [%2]!    \\n\" // out1\n                        \"vst1.s32   {d20-d21}, [%3]!    \\n\" // out2\n                        \"vst1.s32   {d22-d23}, [%4]!    \\n\" // out3\n                        \"subs       %0, #1              \\n\"\n                        \"vst1.s32   {d24-d25}, [%5]!    \\n\" // out4\n                        \"vst1.s32   {d26-d27}, [%6]!    \\n\" // out5\n                        \"vst1.s32   {d28-d29}, [%7]!    \\n\" // out6\n                        \"vst1.s32   {d30-d31}, [%8]!    \\n\" // out7\n\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7), // %8\n                        \"=r\"(r0),      // %9\n                        \"=r\"(r1),      // %10\n                        \"=r\"(r2),      // %11\n                        \"=r\"(ktmp)     // %12\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7),\n                        \"9\"(r0),\n                        \"10\"(r1),\n                        \"11\"(r2),\n                        \"12\"(ktmp)\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n#if __aarch64__\n                    int8x8_t _r0_s8 = vld1_s8(r0); // (a00 a01 a02 ....)\n                    int8x8_t _r1_s8 = vld1_s8(r1); // (a10 a11 a12 ....)\n                    int8x8_t _r2_s8 = vld1_s8(r2); // (a20 a21 a22 ....)\n\n                    int16x8_t _r0 = vmovl_s8(_r0_s8);\n                    int16x8_t _r1 = vmovl_s8(_r1_s8);\n                    int16x8_t _r2 = vmovl_s8(_r2_s8);\n\n                    int32x4_t _sum03 = {};\n                    int32x4_t _sum47 = {};\n\n                    _sum03 = vld1q_lane_s32(outptr0, _sum03, 0); // out0\n                    _sum03 = vld1q_lane_s32(outptr1, _sum03, 1); // out1\n                    _sum03 = vld1q_lane_s32(outptr2, _sum03, 2); // out2\n                    _sum03 = vld1q_lane_s32(outptr3, _sum03, 3); // out3\n                    _sum47 = vld1q_lane_s32(outptr4, _sum47, 0); // out4\n                    _sum47 = vld1q_lane_s32(outptr5, _sum47, 1); // out5\n                    _sum47 = vld1q_lane_s32(outptr6, _sum47, 2); // out6\n                    _sum47 = vld1q_lane_s32(outptr7, _sum47, 3); // out7\n\n                    // k0 - k2\n                    int8x8_t _k0_8 = vld1_s8(ktmp);      //(k00-k70)\n                    int8x8_t _k1_8 = vld1_s8(ktmp + 8);  //(k01-k71)\n                    int8x8_t _k2_8 = vld1_s8(ktmp + 16); //(k02-k72)\n\n                    int16x8_t _k0 = vmovl_s8(_k0_8);\n                    int16x8_t _k1 = vmovl_s8(_k1_8);\n                    int16x8_t _k2 = vmovl_s8(_k2_8);\n\n                    int32x4_t _sum0 = vmull_laneq_s16(vget_low_s16(_k0), _r0, 0);\n                    int32x4_t _sum0n = vmull_laneq_s16(vget_high_s16(_k0), _r0, 0);\n                    int32x4_t _sum1 = vmull_laneq_s16(vget_low_s16(_k1), _r0, 1);\n                    int32x4_t _sum1n = vmull_laneq_s16(vget_high_s16(_k1), _r0, 1);\n                    _sum03 = vmlal_laneq_s16(_sum03, vget_low_s16(_k2), _r0, 2);\n                    _sum47 = vmlal_laneq_s16(_sum47, vget_high_s16(_k2), _r0, 2);\n\n                    // k3 - k5\n                    _k0_8 = vld1_s8(ktmp + 24); //(k03-k73)\n                    _k1_8 = vld1_s8(ktmp + 32); //(k04-k74)\n                    _k2_8 = vld1_s8(ktmp + 40); //(k05-k75)\n\n                    _k0 = vmovl_s8(_k0_8);\n                    _k1 = vmovl_s8(_k1_8);\n                    _k2 = vmovl_s8(_k2_8);\n\n                    _sum0 = vmlal_laneq_s16(_sum0, vget_low_s16(_k0), _r1, 0);\n                    _sum0n = vmlal_laneq_s16(_sum0n, vget_high_s16(_k0), _r1, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, vget_low_s16(_k1), _r1, 1);\n                    _sum1n = vmlal_laneq_s16(_sum1n, vget_high_s16(_k1), _r1, 1);\n                    _sum03 = vmlal_laneq_s16(_sum03, vget_low_s16(_k2), _r1, 2);\n                    _sum47 = vmlal_laneq_s16(_sum47, vget_high_s16(_k2), _r1, 2);\n\n                    // k6 - k8\n                    _k0_8 = vld1_s8(ktmp + 48); //(k06-k76)\n                    _k1_8 = vld1_s8(ktmp + 56); //(k07-k77)\n                    _k2_8 = vld1_s8(ktmp + 64); //(k08-k78)\n\n                    _k0 = vmovl_s8(_k0_8);\n                    _k1 = vmovl_s8(_k1_8);\n                    _k2 = vmovl_s8(_k2_8);\n\n                    _sum0 = vmlal_laneq_s16(_sum0, vget_low_s16(_k0), _r2, 0);\n                    _sum0n = vmlal_laneq_s16(_sum0n, vget_high_s16(_k0), _r2, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, vget_low_s16(_k1), _r2, 1);\n                    _sum1n = vmlal_laneq_s16(_sum1n, vget_high_s16(_k1), _r2, 1);\n                    _sum03 = vmlal_laneq_s16(_sum03, vget_low_s16(_k2), _r2, 2);\n                    _sum47 = vmlal_laneq_s16(_sum47, vget_high_s16(_k2), _r2, 2);\n\n                    _sum0 = vaddq_s32(_sum0, _sum1);\n                    _sum0n = vaddq_s32(_sum0n, _sum1n);\n                    _sum03 = vaddq_s32(_sum03, _sum0);\n                    _sum47 = vaddq_s32(_sum47, _sum0n);\n\n                    vst1q_lane_s32(outptr0, _sum03, 0);\n                    vst1q_lane_s32(outptr1, _sum03, 1);\n                    vst1q_lane_s32(outptr2, _sum03, 2);\n                    vst1q_lane_s32(outptr3, _sum03, 3);\n                    vst1q_lane_s32(outptr4, _sum47, 0);\n                    vst1q_lane_s32(outptr5, _sum47, 1);\n                    vst1q_lane_s32(outptr6, _sum47, 2);\n                    vst1q_lane_s32(outptr7, _sum47, 3);\n\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                    outptr4++;\n                    outptr5++;\n                    outptr6++;\n                    outptr7++;\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%8, #64]          \\n\"\n                        \"vld1.s8    {d0}, [%8]         \\n\" // d0(a00 a01 a02 ....)\n                        \"pld        [%9, #64]          \\n\"\n                        \"vld1.s8    {d2}, [%9]         \\n\" // d2(a10 a11 a12 ....)\n                        \"pld        [%10, #64]         \\n\"\n                        \"vld1.s8    {d4}, [%10]        \\n\" // d4(a20 a21 a22 ....)\n\n                        \"pld        [%11, #64]         \\n\"\n                        \"vld1.s8    {d6-d8}, [%11]!    \\n\" // d6(k00-k70) d7(k01-k71) d8(k02-k72)\n\n                        \"vmovl.s8   q0, d0             \\n\" // d0(a00 a01 a02 x)\n                        \"vmovl.s8   q1, d2             \\n\" // d2(a10 a11 a12 x)\n                        \"vmovl.s8   q2, d4             \\n\" // d4(a20 a21 a22 x)\n\n                        \"vmovl.s8   q5, d8             \\n\" // d10(k02-k32) d11(k42-k72)\n                        \"vmovl.s8   q4, d7             \\n\" // d8(k01-k31) d9(k41-k71)\n                        \"vmovl.s8   q3, d6             \\n\" // d6(k00-k30) d7(k40-k70)\n\n                        \"vld1.s32   {d20[0]}, [%0]     \\n\" // out0 q10\n                        \"vld1.s32   {d20[1]}, [%1]     \\n\" // out1\n                        \"vld1.s32   {d21[0]}, [%2]     \\n\" // out2\n                        \"vld1.s32   {d21[1]}, [%3]     \\n\" // out3\n\n                        \"pld        [%11, #64]         \\n\"\n                        \"vld1.s8    {d24-d26}, [%11]!  \\n\"\n                        \"vmovl.s8   q14, d26           \\n\" // d28(k05-k35) d29(k45-k75)\n                        \"vmovl.s8   q13, d25           \\n\" // d26(k04-k34) d27(k44-k74)\n                        \"vmovl.s8   q12, d24           \\n\" // d24(k03-k33) d25(k43-k73)\n\n                        \"vld1.s32   {d22[0]}, [%4]     \\n\" // out4 q11\n                        \"vld1.s32   {d22[1]}, [%5]     \\n\" // out5\n                        \"vld1.s32   {d23[0]}, [%6]     \\n\" // out6\n                        \"vld1.s32   {d23[1]}, [%7]     \\n\" // out7\n\n                        \"vmull.s16  q6, d6, d0[0]      \\n\" // a00 x (k00-k30)\n                        \"vmull.s16  q7, d7, d0[0]      \\n\" // a00 x (k40-k70)\n                        \"vmull.s16  q8, d8, d0[1]      \\n\" // a01 x (k01-k31)\n                        \"vmull.s16  q9, d9, d0[1]      \\n\" // a01 x (k41-k71)\n                        \"vmlal.s16  q10, d10, d0[2]    \\n\" // a02 x (k02-k32)\n                        \"vmlal.s16  q11, d11, d0[2]    \\n\" // a02 x (k42-k72)\n\n                        \"pld        [%11, #64]         \\n\"\n                        \"vld1.s8    {d6-d8}, [%11]!    \\n\"\n                        \"vmovl.s8   q5, d8             \\n\" // d10(k08-k38) d11(k48-k78)\n                        \"vmovl.s8   q4, d7             \\n\" // d8(k07-k37) d9(k47-k77)\n                        \"vmovl.s8   q3, d6             \\n\" // d6(k06-k36) d7(k46-k76)\n\n                        \"vmlal.s16  q6, d24, d2[0]     \\n\" // a10 x (k03-k33)\n                        \"vmlal.s16  q7, d25, d2[0]     \\n\" // a10 x (k43-k73)\n                        \"vmlal.s16  q8, d26, d2[1]     \\n\" // a11 x (k04-k34)\n                        \"vmlal.s16  q9, d27, d2[1]     \\n\" // a11 x (k44-k74)\n                        \"vmlal.s16  q10, d28, d2[2]    \\n\" // a12 x (k05-k35)\n                        \"vmlal.s16  q11, d29, d2[2]    \\n\" // a12 x (k45-k75)\n\n                        \"vmlal.s16  q6, d6, d4[0]      \\n\" // a20 x (k06-k36)\n                        \"vmlal.s16  q7, d7, d4[0]      \\n\" // a20 x (k46-k76)\n                        \"vmlal.s16  q8, d8, d4[1]      \\n\" // a21 x (k07-k37)\n                        \"vmlal.s16  q9, d9, d4[1]      \\n\" // a21 x (k47-k77)\n                        \"vmlal.s16  q10, d10, d4[2]    \\n\" // a22 x (k08-k38)\n                        \"vmlal.s16  q11, d11, d4[2]    \\n\" // a22 x (k48-k78)\n\n                        \"vadd.s32   q8, q8, q6         \\n\"\n                        \"vadd.s32   q9, q9, q7         \\n\"\n\n                        \"sub        %11, %11, #72      \\n\"\n\n                        \"vadd.s32   q10, q10, q8       \\n\"\n                        \"vadd.s32   q11, q11, q9       \\n\"\n\n                        \"vst1.s32   {d20[0]}, [%0]!    \\n\" // out0\n                        \"vst1.s32   {d20[1]}, [%1]!    \\n\" // out1\n                        \"vst1.s32   {d21[0]}, [%2]!    \\n\" // out2\n                        \"vst1.s32   {d21[1]}, [%3]!    \\n\" // out3\n                        \"vst1.s32   {d22[0]}, [%4]!    \\n\" // out4\n                        \"vst1.s32   {d22[1]}, [%5]!    \\n\" // out5\n                        \"vst1.s32   {d23[0]}, [%6]!    \\n\" // out6\n                        \"vst1.s32   {d23[1]}, [%7]!    \\n\" // out7\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(outptr3), // %3\n                        \"=r\"(outptr4), // %4\n                        \"=r\"(outptr5), // %5\n                        \"=r\"(outptr6), // %6\n                        \"=r\"(outptr7), // %7\n                        \"=r\"(r0),      // %8\n                        \"=r\"(r1),      // %9\n                        \"=r\"(r2),      // %10\n                        \"=r\"(ktmp)     // %11\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(outptr2),\n                        \"3\"(outptr3),\n                        \"4\"(outptr4),\n                        \"5\"(outptr5),\n                        \"6\"(outptr6),\n                        \"7\"(outptr7),\n                        \"8\"(r0),\n                        \"9\"(r1),\n                        \"10\"(r2),\n                        \"11\"(ktmp)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // __ARM_NEON\n                    int sum0 = 0;\n                    int sum1 = 0;\n                    int sum2 = 0;\n                    int sum3 = 0;\n                    int sum4 = 0;\n                    int sum5 = 0;\n                    int sum6 = 0;\n                    int sum7 = 0;\n\n                    sum0 += (int)r0[0] * ktmp[0];\n                    sum1 += (int)r0[0] * ktmp[1];\n                    sum2 += (int)r0[0] * ktmp[2];\n                    sum3 += (int)r0[0] * ktmp[3];\n                    sum4 += (int)r0[0] * ktmp[4];\n                    sum5 += (int)r0[0] * ktmp[5];\n                    sum6 += (int)r0[0] * ktmp[6];\n                    sum7 += (int)r0[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r0[1] * ktmp[0];\n                    sum1 += (int)r0[1] * ktmp[1];\n                    sum2 += (int)r0[1] * ktmp[2];\n                    sum3 += (int)r0[1] * ktmp[3];\n                    sum4 += (int)r0[1] * ktmp[4];\n                    sum5 += (int)r0[1] * ktmp[5];\n                    sum6 += (int)r0[1] * ktmp[6];\n                    sum7 += (int)r0[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r0[2] * ktmp[0];\n                    sum1 += (int)r0[2] * ktmp[1];\n                    sum2 += (int)r0[2] * ktmp[2];\n                    sum3 += (int)r0[2] * ktmp[3];\n                    sum4 += (int)r0[2] * ktmp[4];\n                    sum5 += (int)r0[2] * ktmp[5];\n                    sum6 += (int)r0[2] * ktmp[6];\n                    sum7 += (int)r0[2] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r1[0] * ktmp[0];\n                    sum1 += (int)r1[0] * ktmp[1];\n                    sum2 += (int)r1[0] * ktmp[2];\n                    sum3 += (int)r1[0] * ktmp[3];\n                    sum4 += (int)r1[0] * ktmp[4];\n                    sum5 += (int)r1[0] * ktmp[5];\n                    sum6 += (int)r1[0] * ktmp[6];\n                    sum7 += (int)r1[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r1[1] * ktmp[0];\n                    sum1 += (int)r1[1] * ktmp[1];\n                    sum2 += (int)r1[1] * ktmp[2];\n                    sum3 += (int)r1[1] * ktmp[3];\n                    sum4 += (int)r1[1] * ktmp[4];\n                    sum5 += (int)r1[1] * ktmp[5];\n                    sum6 += (int)r1[1] * ktmp[6];\n                    sum7 += (int)r1[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r1[2] * ktmp[0];\n                    sum1 += (int)r1[2] * ktmp[1];\n                    sum2 += (int)r1[2] * ktmp[2];\n                    sum3 += (int)r1[2] * ktmp[3];\n                    sum4 += (int)r1[2] * ktmp[4];\n                    sum5 += (int)r1[2] * ktmp[5];\n                    sum6 += (int)r1[2] * ktmp[6];\n                    sum7 += (int)r1[2] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r2[0] * ktmp[0];\n                    sum1 += (int)r2[0] * ktmp[1];\n                    sum2 += (int)r2[0] * ktmp[2];\n                    sum3 += (int)r2[0] * ktmp[3];\n                    sum4 += (int)r2[0] * ktmp[4];\n                    sum5 += (int)r2[0] * ktmp[5];\n                    sum6 += (int)r2[0] * ktmp[6];\n                    sum7 += (int)r2[0] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r2[1] * ktmp[0];\n                    sum1 += (int)r2[1] * ktmp[1];\n                    sum2 += (int)r2[1] * ktmp[2];\n                    sum3 += (int)r2[1] * ktmp[3];\n                    sum4 += (int)r2[1] * ktmp[4];\n                    sum5 += (int)r2[1] * ktmp[5];\n                    sum6 += (int)r2[1] * ktmp[6];\n                    sum7 += (int)r2[1] * ktmp[7];\n                    ktmp += 8;\n\n                    sum0 += (int)r2[2] * ktmp[0];\n                    sum1 += (int)r2[2] * ktmp[1];\n                    sum2 += (int)r2[2] * ktmp[2];\n                    sum3 += (int)r2[2] * ktmp[3];\n                    sum4 += (int)r2[2] * ktmp[4];\n                    sum5 += (int)r2[2] * ktmp[5];\n                    sum6 += (int)r2[2] * ktmp[6];\n                    sum7 += (int)r2[2] * ktmp[7];\n                    ktmp += 8;\n\n                    *outptr0 += sum0;\n                    *outptr1 += sum1;\n                    *outptr2 += sum2;\n                    *outptr3 += sum3;\n                    *outptr4 += sum4;\n                    *outptr5 += sum5;\n                    *outptr6 += sum6;\n                    *outptr7 += sum7;\n\n                    ktmp -= 8 * 9;\n\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                    outptr4++;\n                    outptr5++;\n                    outptr6++;\n                    outptr7++;\n#endif // __ARM_NEON\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            ktmp += 8 * 9;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        out.fill(0);\n\n        const signed char* ktmp = _kernel.channel(p / 8 + p % 8);\n\n        for (int q = 0; q < inch; q++)\n        {\n            int* outptr = out;\n\n            const signed char* img0 = bottom_blob.channel(q);\n\n            const signed char* r0 = img0;\n            const signed char* r1 = img0 + w;\n            const signed char* r2 = img0 + w * 2;\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 3;\n                int remain = outw & 7;\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                   \\n\"\n\n                        \"ld1    {v0.8b, v1.8b}, [%5]          \\n\" //ktmp\n                        \"ld2    {v2.8b, v3.8b}, [%2], #16     \\n\" //r0-r2\n                        \"ld2    {v4.8b, v5.8b}, [%2]          \\n\"\n\n                        \"ld2    {v6.8b, v7.8b}, [%3], #16     \\n\" //r3-r5\n                        \"ld2    {v8.8b, v9.8b}, [%3]          \\n\"\n\n                        \"ld2    {v10.8b, v11.8b}, [%4], #16   \\n\" //r6-r8\n                        \"ld2    {v12.8b, v13.8b}, [%4]        \\n\"\n\n                        \"ld1    {v14.4s, v15.4s}, [%1]        \\n\" //out0\n\n                        \"ext    v4.8b, v2.8b, v4.8b, #1       \\n\"\n                        \"ext    v8.8b, v6.8b, v8.8b, #1       \\n\"\n                        \"ext    v12.8b, v10.8b, v12.8b, #1    \\n\"\n\n                        \"sshll  v0.8h, v0.8b, #0              \\n\" //(k0-k7)\n                        \"sshll  v1.8h, v1.8b, #0              \\n\" //(k8)\n                        \"sshll  v2.8h, v2.8b, #0              \\n\" // r0\n                        \"sshll  v3.8h, v3.8b, #0              \\n\" // r1\n                        \"sshll  v4.8h, v4.8b, #0              \\n\" // r2\n                        \"sshll  v6.8h, v6.8b, #0              \\n\" // r3\n                        \"sshll  v7.8h, v7.8b, #0              \\n\" // r4\n                        \"sshll  v8.8h, v8.8b, #0              \\n\" // r5\n                        \"sshll  v10.8h, v10.8b, #0            \\n\" // r6\n                        \"sshll  v11.8h, v11.8b, #0            \\n\" // r7\n                        \"sshll  v12.8h, v12.8b, #0            \\n\" // r8\n\n                        // r0\n                        \"smull  v16.4s, v2.4h, v0.h[0]        \\n\" // out = r0*k0\n                        \"smull2  v17.4s, v2.8h, v0.h[0]       \\n\"\n                        \"smull  v18.4s, v3.4h, v0.h[1]        \\n\" // outn = r1*k1\n                        \"smull2  v19.4s, v3.8h, v0.h[1]       \\n\"\n                        \"smlal  v16.4s, v4.4h, v0.h[2]        \\n\" // out = r2*k2\n                        \"smlal2  v17.4s, v4.8h, v0.h[2]       \\n\"\n                        \"smlal  v18.4s, v6.4h, v0.h[3]        \\n\" // outn = r3*k3\n                        \"smlal2  v19.4s, v6.8h, v0.h[3]       \\n\"\n                        \"smlal  v16.4s, v7.4h, v0.h[4]        \\n\" // out = r4*k4\n                        \"smlal2  v17.4s, v7.8h, v0.h[4]       \\n\"\n                        \"smlal  v18.4s, v8.4h, v0.h[5]        \\n\" // outn = r5*k5\n                        \"smlal2  v19.4s, v8.8h, v0.h[5]       \\n\"\n                        \"smlal  v16.4s, v10.4h, v0.h[6]       \\n\" // out = r6*k6\n                        \"smlal2  v17.4s, v10.8h, v0.h[6]      \\n\"\n                        \"smlal  v18.4s, v11.4h, v0.h[7]       \\n\" // outn = r7*k7\n                        \"smlal2  v19.4s, v11.8h, v0.h[7]      \\n\"\n                        \"smlal  v16.4s, v12.4h, v1.h[0]       \\n\" // out = r8*k8\n                        \"smlal2  v17.4s, v12.8h, v1.h[0]      \\n\"\n\n                        \"add    v8.4s, v16.4s, v18.4s         \\n\"\n                        \"add    v9.4s, v17.4s, v19.4s         \\n\"\n\n                        \"st1    {v8.4s, v9.4s}, [%1], #32     \\n\"\n\n                        \"subs   %w0, %w0, #1                  \\n\"\n\n                        \"bne    0b                            \\n\"\n\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(ktmp)    // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(ktmp)\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"vld1.s8    {d0-d1}, [%5]       \\n\" // d0(k0 - k7) d1(k8 ...)\n                        \"vmovl.s8   q1, d1              \\n\" // d2(k8 ...)\n                        \"vmovl.s8   q0, d0              \\n\" // d0(k0 - k3) d1(k4 - k7)\n                        \"0:                             \\n\"\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld2.s8    {d4-d5}, [%2]!      \\n\" // r0 d4(a00 a02 ... a014) d5(a01 a03 ... a015)\n                        \"vld2.s8    {d8-d9}, [%2]       \\n\" //    d8(a016 ....)\n                        \"vld2.s8    {d10-d11}, [%3]!    \\n\" // r1 d10(a10 a12 ... a114) d11(a11 a13 ... a115)\n                        \"vld2.s8    {d14-d15}, [%3]     \\n\" //    d14(a116 ....)\n                        \"vld2.s8    {d16-d17}, [%4]!    \\n\" // r2 d16(a20 a22 ... a214) d17(a21 a23 ... a215)\n                        \"vld2.s8    {d20-d21}, [%4]     \\n\" //    d20(a216 ....)\n                        \"vld1.s32   {d22-d25}, [%1]     \\n\" // q11(out0 - out3) q12(out4 - out7)\n\n                        \"vext.s8    d8, d4, d8, #1      \\n\" //  d8(a02 a04 ... a016)\n                        \"vext.s8    d14, d10, d14, #1   \\n\" // d14(a12 a14 ... a116)\n                        \"vext.s8    d20, d16, d20, #1   \\n\" // d20(a22 a24 ... a216)\n\n                        \"vmovl.s8   q3, d5              \\n\" // q3(a01 a03 ... a015)\n                        \"vmovl.s8   q2, d4              \\n\" // q2(a00 a02 ... a014)\n                        \"vmovl.s8   q4, d8              \\n\" // q4(a02 a04 ... a016)\n\n                        \"vmovl.s8   q6, d11             \\n\" // q6(a11 a13 ... a115)\n                        \"vmovl.s8   q5, d10             \\n\" // q5(a10 a12 ... a114)\n                        \"vmovl.s8   q7, d14             \\n\" // q7(a12 a14 ... a116)\n\n                        \"vmovl.s8   q9, d17             \\n\" // q9(a21 a23 ... a215)\n                        \"vmovl.s8   q8, d16             \\n\" // q8(a20 a22 ... a214)\n                        \"vmovl.s8   q10, d20            \\n\" // q10(a22 a24 ... a216)\n\n                        \"vmlal.s16  q11, d4, d0[0]      \\n\" // k0\n                        \"vmlal.s16  q12, d5, d0[0]      \\n\"\n                        \"vmull.s16  q13, d6, d0[1]      \\n\" // k1\n                        \"vmull.s16  q14, d7, d0[1]      \\n\"\n                        \"vmlal.s16  q11, d8, d0[2]      \\n\" // k2\n                        \"vmlal.s16  q12, d9, d0[2]      \\n\"\n\n                        \"vmlal.s16  q13, d12, d1[0]     \\n\" // k4\n                        \"vmlal.s16  q14, d13, d1[0]     \\n\"\n                        \"vmlal.s16  q11, d10, d0[3]     \\n\" // k3\n                        \"vmlal.s16  q12, d11, d0[3]     \\n\"\n                        \"vmlal.s16  q13, d14, d1[1]     \\n\" // k5\n                        \"vmlal.s16  q14, d15, d1[1]     \\n\"\n\n                        \"vmlal.s16  q11, d16, d1[2]     \\n\" // k6\n                        \"vmlal.s16  q12, d17, d1[2]     \\n\"\n                        \"vmlal.s16  q13, d18, d1[3]     \\n\" // k7\n                        \"vmlal.s16  q14, d19, d1[3]     \\n\"\n                        \"vmlal.s16  q11, d20, d2[0]     \\n\" // k8\n                        \"vmlal.s16  q12, d21, d2[0]     \\n\"\n\n                        \"vadd.s32   q11, q11, q13       \\n\"\n                        \"vadd.s32   q12, q12, q14       \\n\"\n\n                        \"vst1.32    {d22-d25}, [%1]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(ktmp)    // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(ktmp)\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                if (remain > 0)\n                {\n#if __ARM_NEON\n                    int8x8_t _k01234567s8 = vld1_s8(ktmp);\n                    int8x8_t _k8xxxxxxxs8 = vld1_s8(ktmp + 8);\n                    int8x8_t _k34567xxxs8 = vext_s8(_k01234567s8, _k01234567s8, 3);\n                    int8x8_t _k678xxxxxs8 = vext_s8(_k01234567s8, _k8xxxxxxxs8, 6);\n                    int16x8_t _k0123_s16 = vmovl_s8(_k01234567s8);\n                    int16x8_t _k3456_s16 = vmovl_s8(_k34567xxxs8);\n                    int16x8_t _k678x_s16 = vmovl_s8(_k678xxxxxs8);\n#endif\n                    for (; remain > 0; remain--)\n                    {\n#if __ARM_NEON\n                        int8x8_t _r00s8 = vld1_s8(r0);\n                        int8x8_t _r10s8 = vld1_s8(r1);\n                        int8x8_t _r20s8 = vld1_s8(r2);\n\n                        int16x8_t _r00s16 = vmovl_s8(_r00s8);\n                        int16x8_t _r10s16 = vmovl_s8(_r10s8);\n                        int16x8_t _r20s16 = vmovl_s8(_r20s8);\n\n                        int32x4_t _sum = vmull_s16(vget_low_s16(_r00s16), vget_low_s16(_k0123_s16));\n                        _sum = vmlal_s16(_sum, vget_low_s16(_r10s16), vget_low_s16(_k3456_s16));\n                        _sum = vmlal_s16(_sum, vget_low_s16(_r20s16), vget_low_s16(_k678x_s16));\n\n                        _sum = vsetq_lane_s32(*outptr, _sum, 3);\n\n#if __aarch64__\n                        *outptr = vaddvq_s32(_sum);\n#else\n                        int32x2_t _ss = vadd_s32(vget_low_s32(_sum), vget_high_s32(_sum));\n                        _ss = vpadd_s32(_ss, _ss);\n\n                        *outptr = vget_lane_s32(_ss, 0);\n#endif // __aarch64__\n#else\n                        int sum = 0;\n\n                        sum += (int)r0[0] * ktmp[0];\n                        sum += (int)r0[1] * ktmp[1];\n                        sum += (int)r0[2] * ktmp[2];\n                        sum += (int)r1[0] * ktmp[3];\n                        sum += (int)r1[1] * ktmp[4];\n                        sum += (int)r1[2] * ktmp[5];\n                        sum += (int)r2[0] * ktmp[6];\n                        sum += (int)r2[1] * ktmp[7];\n                        sum += (int)r2[2] * ktmp[8];\n\n                        *outptr += sum;\n#endif // __ARM_NEON\n                        r0 += 2;\n                        r1 += 2;\n                        r2 += 2;\n                        outptr++;\n                    }\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            ktmp += 9;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack1to4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n    int nn_outch = 0;\n    nn_outch = outch >> 1;\n    remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = bias ? vld1q_f32((const float*)bias + (p + 1) * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n        out1.fill(_bias1);\n\n        const float* k0 = kernel.channel(p);\n        const float* k1 = kernel.channel(p + 1);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00_0 = vld1q_f32(k0);\n            float32x4_t _k01_0 = vld1q_f32(k0 + 4);\n            float32x4_t _k02_0 = vld1q_f32(k0 + 8);\n            float32x4_t _k10_0 = vld1q_f32(k0 + 12);\n            float32x4_t _k11_0 = vld1q_f32(k0 + 16);\n            float32x4_t _k12_0 = vld1q_f32(k0 + 20);\n            float32x4_t _k20_0 = vld1q_f32(k0 + 24);\n            float32x4_t _k21_0 = vld1q_f32(k0 + 28);\n            float32x4_t _k22_0 = vld1q_f32(k0 + 32);\n\n            float32x4_t _k00_1 = vld1q_f32(k1);\n            float32x4_t _k01_1 = vld1q_f32(k1 + 4);\n            float32x4_t _k02_1 = vld1q_f32(k1 + 8);\n            float32x4_t _k10_1 = vld1q_f32(k1 + 12);\n            float32x4_t _k11_1 = vld1q_f32(k1 + 16);\n            float32x4_t _k12_1 = vld1q_f32(k1 + 20);\n            float32x4_t _k20_1 = vld1q_f32(k1 + 24);\n            float32x4_t _k21_1 = vld1q_f32(k1 + 28);\n            float32x4_t _k22_1 = vld1q_f32(k1 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0] \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%1] \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                        \"ld1    {v1.2s}, [%2]               \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %19.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %19.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %20.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %20.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %20.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %20.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v2.4s}, [%3], #16          \\n\"\n\n                        \"ld1    {v3.2s}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %21.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %21.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %21.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %21.4s, v1.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v2.s[1]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v2.s[3]     \\n\"\n                        \"fmla   v28.4s, %22.4s, v2.s[0]     \\n\"\n                        \"fmla   v29.4s, %22.4s, v2.s[1]     \\n\"\n                        \"fmla   v30.4s, %22.4s, v2.s[2]     \\n\"\n                        \"fmla   v31.4s, %22.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v2.s[2]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v3.s[0]     \\n\"\n                        \"fmla   v28.4s, %23.4s, v2.s[1]     \\n\"\n                        \"fmla   v29.4s, %23.4s, v2.s[2]     \\n\"\n                        \"fmla   v30.4s, %23.4s, v2.s[3]     \\n\"\n                        \"fmla   v31.4s, %23.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%4], #16          \\n\"\n\n                        \"ld1    {v1.2s}, [%4]               \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v2.s[3]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v3.s[1]     \\n\"\n                        \"fmla   v28.4s, %24.4s, v2.s[2]     \\n\"\n                        \"fmla   v29.4s, %24.4s, v2.s[3]     \\n\"\n                        \"fmla   v30.4s, %24.4s, v3.s[0]     \\n\"\n                        \"fmla   v31.4s, %24.4s, v3.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %25.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %25.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %25.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %25.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %17.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %26.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %26.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %26.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %26.4s, v1.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %18.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %18.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %27.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %27.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %27.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %27.4s, v1.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%1], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_0), // %10\n                        \"w\"(_k01_0), // %11\n                        \"w\"(_k02_0), // %12\n                        \"w\"(_k10_0), // %13\n                        \"w\"(_k11_0), // %14\n                        \"w\"(_k12_0), // %15\n                        \"w\"(_k20_0), // %16\n                        \"w\"(_k21_0), // %17\n                        \"w\"(_k22_0), // %18\n                        \"w\"(_k00_1), // %19\n                        \"w\"(_k01_1), // %20\n                        \"w\"(_k02_1), // %21\n                        \"w\"(_k10_1), // %22\n                        \"w\"(_k11_1), // %23\n                        \"w\"(_k12_1), // %24\n                        \"w\"(_k20_1), // %25\n                        \"w\"(_k21_1), // %26\n                        \"w\"(_k22_1)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v24.4s, v25.4s}, [%0]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v26.4s, v27.4s}, [%1]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%2]               \\n\"\n                        \"add    %2, %2, #8                  \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %19.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %19.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %20.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %20.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v1.4s}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %21.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %21.4s, v0.s[3]     \\n\"\n\n                        \"add    %3, %3, #8                  \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v1.s[1]     \\n\"\n                        \"fmla   v26.4s, %22.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %22.4s, v1.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v1.s[1]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v1.s[2]     \\n\"\n                        \"fmla   v26.4s, %23.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %23.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%4]               \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v1.s[2]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v1.s[3]     \\n\"\n                        \"fmla   v26.4s, %24.4s, v1.s[2]     \\n\"\n                        \"fmla   v27.4s, %24.4s, v1.s[3]     \\n\"\n\n                        \"add    %4, %4, #8                  \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %25.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %25.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %26.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %26.4s, v0.s[2]     \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %27.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %27.4s, v0.s[3]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s}, [%0], #32 \\n\"\n                        \"st1    {v26.4s, v27.4s}, [%1], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_0), // %10\n                        \"w\"(_k01_0), // %11\n                        \"w\"(_k02_0), // %12\n                        \"w\"(_k10_0), // %13\n                        \"w\"(_k11_0), // %14\n                        \"w\"(_k12_0), // %15\n                        \"w\"(_k20_0), // %16\n                        \"w\"(_k21_0), // %17\n                        \"w\"(_k22_0), // %18\n                        \"w\"(_k00_1), // %19\n                        \"w\"(_k01_1), // %20\n                        \"w\"(_k02_1), // %21\n                        \"w\"(_k10_1), // %22\n                        \"w\"(_k11_1), // %23\n                        \"w\"(_k12_1), // %24\n                        \"w\"(_k20_1), // %25\n                        \"w\"(_k21_1), // %26\n                        \"w\"(_k22_1)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\");\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum00 = vld1q_f32(outptr0);\n                    float32x4_t _sum10 = vld1q_f32(outptr1);\n\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    float32x4_t _r2 = vld1q_f32(r2);\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k00_0, _r0, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k01_0, _r0, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k02_0, _r0, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k10_0, _r1, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k11_0, _r1, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k12_0, _r1, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k20_0, _r2, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k21_0, _r2, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k22_0, _r2, 2);\n\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k00_1, _r0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k01_1, _r0, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k02_1, _r0, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k10_1, _r1, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k11_1, _r1, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k12_1, _r1, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k20_1, _r2, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k21_1, _r2, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k22_1, _r2, 2);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr1, _sum10);\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00 = vld1q_f32(k0);\n            float32x4_t _k01 = vld1q_f32(k0 + 4);\n            float32x4_t _k02 = vld1q_f32(k0 + 8);\n            float32x4_t _k10 = vld1q_f32(k0 + 12);\n            float32x4_t _k11 = vld1q_f32(k0 + 16);\n            float32x4_t _k12 = vld1q_f32(k0 + 20);\n            float32x4_t _k20 = vld1q_f32(k0 + 24);\n            float32x4_t _k21 = vld1q_f32(k0 + 28);\n            float32x4_t _k22 = vld1q_f32(k0 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%1], #32   \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0] \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v26.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v27.4s, %8.4s, v0.s[3]      \\n\"\n                        \"fmla   v28.4s, %8.4s, v1.s[0]      \\n\"\n                        \"fmla   v29.4s, %8.4s, v1.s[1]      \\n\"\n                        \"fmla   v30.4s, %8.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, %8.4s, v1.s[3]      \\n\"\n\n                        \"ld1    {v2.2s}, [%1]               \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n                        \"fmla   v26.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v27.4s, %9.4s, v1.s[0]      \\n\"\n                        \"fmla   v28.4s, %9.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, %9.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, %9.4s, v1.s[3]      \\n\"\n                        \"fmla   v31.4s, %9.4s, v2.s[0]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%2], #32   \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %10.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %10.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %10.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %10.4s, v2.s[1]     \\n\"\n\n                        \"ld1    {v2.2s}, [%2]               \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v4.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v4.s[1]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v4.s[2]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v4.s[3]     \\n\"\n                        \"fmla   v28.4s, %11.4s, v5.s[0]     \\n\"\n                        \"fmla   v29.4s, %11.4s, v5.s[1]     \\n\"\n                        \"fmla   v30.4s, %11.4s, v5.s[2]     \\n\"\n                        \"fmla   v31.4s, %11.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v4.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v4.s[2]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v4.s[3]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v5.s[0]     \\n\"\n                        \"fmla   v28.4s, %12.4s, v5.s[1]     \\n\"\n                        \"fmla   v29.4s, %12.4s, v5.s[2]     \\n\"\n                        \"fmla   v30.4s, %12.4s, v5.s[3]     \\n\"\n                        \"fmla   v31.4s, %12.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%3], #32   \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v4.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v4.s[3]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v5.s[0]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v5.s[1]     \\n\"\n                        \"fmla   v28.4s, %13.4s, v5.s[2]     \\n\"\n                        \"fmla   v29.4s, %13.4s, v5.s[3]     \\n\"\n                        \"fmla   v30.4s, %13.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %13.4s, v2.s[1]     \\n\"\n\n                        \"ld1    {v2.2s}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %14.4s, v1.s[0]     \\n\"\n                        \"fmla   v29.4s, %14.4s, v1.s[1]     \\n\"\n                        \"fmla   v30.4s, %14.4s, v1.s[2]     \\n\"\n                        \"fmla   v31.4s, %14.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %15.4s, v1.s[1]     \\n\"\n                        \"fmla   v29.4s, %15.4s, v1.s[2]     \\n\"\n                        \"fmla   v30.4s, %15.4s, v1.s[3]     \\n\"\n                        \"fmla   v31.4s, %15.4s, v2.s[0]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %16.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %16.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %16.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %16.4s, v2.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0] \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%1], #16          \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v26.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v27.4s, %8.4s, v0.s[3]      \\n\"\n\n                        \"ld1    {v1.2s}, [%1]               \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n                        \"fmla   v26.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v27.4s, %9.4s, v1.s[0]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v2.4s}, [%2], #16          \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v1.s[1]     \\n\"\n\n                        \"ld1    {v3.2s}, [%2]               \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v2.s[1]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v2.s[2]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%3], #16          \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v2.s[3]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v3.s[1]     \\n\"\n\n                        \"ld1    {v1.2s}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v1.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v1.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]!      \\n\"\n\n                        \"vmla.f32   q12, %q8, d0[0]     \\n\"\n                        \"vmla.f32   q13, %q8, d0[1]     \\n\"\n                        \"vmla.f32   q14, %q8, d1[0]     \\n\"\n                        \"vmla.f32   q15, %q8, d1[1]     \\n\"\n\n                        \"vld1.f32   {d2}, [%1]          \\n\"\n\n                        \"vmla.f32   q12, %q9, d0[1]     \\n\"\n                        \"vmla.f32   q13, %q9, d1[0]     \\n\"\n                        \"vmla.f32   q14, %q9, d1[1]     \\n\"\n                        \"vmla.f32   q15, %q9, d2[0]     \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%2]!      \\n\"\n\n                        \"vmla.f32   q12, %q10, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q10, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q10, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q10, d2[1]    \\n\"\n\n                        \"vmla.f32   q12, %q11, d4[0]    \\n\"\n                        \"vmla.f32   q13, %q11, d4[1]    \\n\"\n                        \"vmla.f32   q14, %q11, d5[0]    \\n\"\n                        \"vmla.f32   q15, %q11, d5[1]    \\n\"\n\n                        \"vld1.f32   {d3}, [%2]          \\n\"\n\n                        \"vmla.f32   q12, %q12, d4[1]    \\n\"\n                        \"vmla.f32   q13, %q12, d5[0]    \\n\"\n                        \"vmla.f32   q14, %q12, d5[1]    \\n\"\n                        \"vmla.f32   q15, %q12, d3[0]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%3]!      \\n\"\n\n                        \"vmla.f32   q12, %q13, d5[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d5[1]    \\n\"\n                        \"vmla.f32   q14, %q13, d3[0]    \\n\"\n                        \"vmla.f32   q15, %q13, d3[1]    \\n\"\n\n                        \"vmla.f32   q12, %q14, d0[0]    \\n\"\n                        \"vmla.f32   q13, %q14, d0[1]    \\n\"\n                        \"vmla.f32   q14, %q14, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q14, d1[1]    \\n\"\n\n                        \"vld1.f32   {d2}, [%3]          \\n\"\n\n                        \"vmla.f32   q12, %q15, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q15, d1[0]    \\n\"\n                        \"vmla.f32   q14, %q15, d1[1]    \\n\"\n                        \"vmla.f32   q15, %q15, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q16, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q16, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q16, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d2[1]    \\n\"\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v24.4s, v25.4s}, [%0]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%1]               \\n\"\n\n                        \"fmul   v26.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmul   v27.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v1.4s}, [%2]               \\n\"\n\n                        \"fmla   v26.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v24.4s, %11.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v1.s[1]     \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n\n                        \"fmla   v26.4s, %12.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v1.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v1.s[3]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[1]     \\n\"\n\n                        \"add    %2, %2, #8                  \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v0.s[3]     \\n\"\n\n                        \"add    %3, %3, #8                  \\n\"\n\n                        \"fadd   v24.4s, v24.4s, v26.4s      \\n\"\n                        \"fadd   v25.4s, v25.4s, v27.4s      \\n\"\n\n                        \"st1    {v24.4s, v25.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1]       \\n\"\n\n                        \"vmul.f32   q14, %q8, d0[0]     \\n\"\n                        \"vmul.f32   q15, %q8, d0[1]     \\n\"\n                        \"vmla.f32   q12, %q9, d0[1]     \\n\"\n                        \"vmla.f32   q13, %q9, d1[0]     \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d2-d3}, [%2]       \\n\"\n\n                        \"vmla.f32   q14, %q10, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q10, d1[1]    \\n\"\n                        \"vmla.f32   q12, %q11, d2[0]    \\n\"\n                        \"vmla.f32   q13, %q11, d2[1]    \\n\"\n\n                        \"add        %1, %1, #8          \\n\"\n\n                        \"vmla.f32   q14, %q12, d2[1]    \\n\"\n                        \"vmla.f32   q15, %q12, d3[0]    \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%3]       \\n\"\n\n                        \"vmla.f32   q12, %q13, d3[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d3[1]    \\n\"\n                        \"vmla.f32   q14, %q14, d0[0]    \\n\"\n                        \"vmla.f32   q15, %q14, d0[1]    \\n\"\n\n                        \"add        %2, %2, #8          \\n\"\n\n                        \"vmla.f32   q12, %q15, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q15, d1[0]    \\n\"\n                        \"vmla.f32   q14, %q16, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d1[1]    \\n\"\n\n                        \"add        %3, %3, #8          \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    float32x4_t _r2 = vld1q_f32(r2);\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1q_f32(outptr0, _sum0);\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 4;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n    int nn_outch = 0;\n    nn_outch = outch >> 1;\n    remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = bias ? vld1q_f32((const float*)bias + (p + 1) * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n        out1.fill(_bias1);\n\n        const float* k0 = kernel.channel(p);\n        const float* k1 = kernel.channel(p + 1);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00_0 = vld1q_f32(k0);\n            float32x4_t _k01_0 = vld1q_f32(k0 + 4);\n            float32x4_t _k02_0 = vld1q_f32(k0 + 8);\n            float32x4_t _k10_0 = vld1q_f32(k0 + 12);\n            float32x4_t _k11_0 = vld1q_f32(k0 + 16);\n            float32x4_t _k12_0 = vld1q_f32(k0 + 20);\n            float32x4_t _k20_0 = vld1q_f32(k0 + 24);\n            float32x4_t _k21_0 = vld1q_f32(k0 + 28);\n            float32x4_t _k22_0 = vld1q_f32(k0 + 32);\n\n            float32x4_t _k00_1 = vld1q_f32(k1);\n            float32x4_t _k01_1 = vld1q_f32(k1 + 4);\n            float32x4_t _k02_1 = vld1q_f32(k1 + 8);\n            float32x4_t _k10_1 = vld1q_f32(k1 + 12);\n            float32x4_t _k11_1 = vld1q_f32(k1 + 16);\n            float32x4_t _k12_1 = vld1q_f32(k1 + 20);\n            float32x4_t _k20_1 = vld1q_f32(k1 + 24);\n            float32x4_t _k21_1 = vld1q_f32(k1 + 28);\n            float32x4_t _k22_1 = vld1q_f32(k1 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int nn = outw >> 2;\n                int remain = outw & 3;\n\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%1] \\n\" // sum0\n\n                        // r0\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%3], #32   \\n\"\n                        \"ld1r   {v4.4s}, [%3]               \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v0.s[2]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%2] \\n\" // sum1\n\n                        \"fmla   v8.4s, %12.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v1.s[2]      \\n\"\n\n                        \"fmla   v10.4s, %21.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %21.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %21.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %21.4s, v1.s[2]     \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v1.s[3]      \\n\"\n                        \"fmla   v10.4s, %22.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %22.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %22.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %22.4s, v1.s[3]     \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v2.4s, v3.4s}, [%4], #32   \\n\"\n                        \"ld1r   {v5.4s}, [%4]               \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v4.s[0]      \\n\"\n                        \"fmla   v10.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %23.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %23.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %23.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v3.s[2]      \\n\"\n                        \"fmla   v10.4s, %24.4s, v2.s[0]     \\n\"\n                        \"fmla   v11.4s, %24.4s, v2.s[2]     \\n\"\n                        \"fmla   v12.4s, %24.4s, v3.s[0]     \\n\"\n                        \"fmla   v13.4s, %24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v3.s[3]      \\n\"\n                        \"fmla   v10.4s, %25.4s, v2.s[1]     \\n\"\n                        \"fmla   v11.4s, %25.4s, v2.s[3]     \\n\"\n                        \"fmla   v12.4s, %25.4s, v3.s[1]     \\n\"\n                        \"fmla   v13.4s, %25.4s, v3.s[3]     \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%5], #32   \\n\"\n                        \"ld1r   {v4.4s}, [%5]               \\n\"\n\n                        \"fmla   v6.4s, %17.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %17.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %17.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %17.4s, v5.s[0]      \\n\"\n                        \"fmla   v10.4s, %26.4s, v2.s[2]     \\n\"\n                        \"fmla   v11.4s, %26.4s, v3.s[0]     \\n\"\n                        \"fmla   v12.4s, %26.4s, v3.s[2]     \\n\"\n                        \"fmla   v13.4s, %26.4s, v5.s[0]     \\n\"\n\n                        \"fmla   v6.4s, %18.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %18.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %18.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %18.4s, v1.s[2]      \\n\"\n                        \"fmla   v10.4s, %27.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %27.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %27.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %27.4s, v1.s[2]     \\n\"\n\n                        \"fmla   v6.4s, %19.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %19.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %19.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %19.4s, v1.s[3]      \\n\"\n                        \"fmla   v10.4s, %28.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %28.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %28.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %28.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v6.4s, %20.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %20.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %20.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %20.4s, v4.s[0]      \\n\"\n                        \"fmla   v10.4s, %29.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %29.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %29.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %29.4s, v4.s[0]     \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n\n                        \"st1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%1], #64 \\n\"\n                        \"st1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%2], #64 \\n\"\n\n                        \"bne    0b                          \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2)       // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00_0), // %12\n                        \"w\"(_k01_0), // %13\n                        \"w\"(_k02_0), // %14\n                        \"w\"(_k10_0), // %15\n                        \"w\"(_k11_0), // %16\n                        \"w\"(_k12_0), // %17\n                        \"w\"(_k20_0), // %18\n                        \"w\"(_k21_0), // %19\n                        \"w\"(_k22_0), // %20\n                        \"w\"(_k00_1), // %21\n                        \"w\"(_k01_1), // %22\n                        \"w\"(_k02_1), // %23\n                        \"w\"(_k10_1), // %24\n                        \"w\"(_k11_1), // %25\n                        \"w\"(_k12_1), // %26\n                        \"w\"(_k20_1), // %27\n                        \"w\"(_k21_1), // %28\n                        \"w\"(_k22_1)  // %29\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n\n                for (; remain > 0; remain--)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n                    float32x4_t _sum1 = vld1q_f32(outptr1);\n\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    float32x4_t _r2 = vld1q_f32(r2);\n\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00_0, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01_0, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02_0, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10_0, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11_0, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12_0, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20_0, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21_0, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22_0, _r2, 2);\n\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k00_1, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k01_1, _r0, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k02_1, _r0, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k10_1, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k11_1, _r1, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k12_1, _r1, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k20_1, _r2, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k21_1, _r2, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k22_1, _r2, 2);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr1, _sum1);\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00 = vld1q_f32(k0);\n            float32x4_t _k01 = vld1q_f32(k0 + 4);\n            float32x4_t _k02 = vld1q_f32(k0 + 8);\n            float32x4_t _k10 = vld1q_f32(k0 + 12);\n            float32x4_t _k11 = vld1q_f32(k0 + 16);\n            float32x4_t _k12 = vld1q_f32(k0 + 20);\n            float32x4_t _k20 = vld1q_f32(k0 + 24);\n            float32x4_t _k21 = vld1q_f32(k0 + 28);\n            float32x4_t _k22 = vld1q_f32(k0 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int nn = outw >> 2;\n                int remain = outw & 3;\n\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%1] \\n\" // sum0\n\n                        // r0\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                        \"ld1r   {v4.4s}, [%2]               \\n\"\n\n                        \"fmla   v6.4s, %10.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %10.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %10.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %10.4s, v1.s[2]      \\n\"\n\n                        \"fmla   v6.4s, %11.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %11.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %11.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v1.s[3]      \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v2.4s, v3.4s}, [%3], #32   \\n\"\n                        \"ld1r   {v5.4s}, [%3]               \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %12.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v4.s[0]      \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v3.s[2]      \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v3.s[3]      \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%4], #32   \\n\"\n                        \"ld1r   {v4.4s}, [%4]               \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v5.s[0]      \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v1.s[2]      \\n\"\n\n                        \"fmla   v6.4s, %17.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %17.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %17.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %17.4s, v1.s[3]      \\n\"\n\n                        \"fmla   v6.4s, %18.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %18.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %18.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %18.4s, v4.s[0]      \\n\"\n\n                        \"subs   %w0, %w0, #1                \\n\"\n\n                        \"st1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%1], #64 \\n\"\n\n                        \"bne    0b                          \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n                }\n#else  // __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1, {d0-d7}         \\n\" // sum0\n\n                        // r0\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%2]!     \\n\"\n                        \"vld1.f32   {d12[]}, [%2]       \\n\"\n\n                        \"vmla.f32   q0, %q10, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q10, d9[0]     \\n\"\n                        \"vmla.f32   q2, %q10, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q10, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q11, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q11, d9[1]     \\n\"\n                        \"vmla.f32   q2, %q11, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q11, d11[1]    \\n\"\n\n                        \"vmla.f32   q0, %q12, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q12, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q12, d11[0]    \\n\"\n\n                        // r1\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%3]!     \\n\"\n                        \"vld1.f32   {d13[]}, [%3]       \\n\"\n\n                        \"vmla.f32   q3, %q12, d12[0]    \\n\"\n\n                        \"vmla.f32   q0, %q13, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q13, d9[0]     \\n\"\n                        \"vmla.f32   q2, %q13, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q13, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q14, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q14, d9[1]     \\n\"\n                        \"vmla.f32   q2, %q14, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q14, d11[1]    \\n\"\n\n                        \"vmla.f32   q0, %q15, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q15, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q15, d11[0]    \\n\"\n\n                        // r2\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%4]!     \\n\"\n                        \"vld1.f32   {d12[]}, [%4]       \\n\"\n\n                        \"vmla.f32   q3, %q15, d13[0]    \\n\"\n\n                        \"vmla.f32   q0, %q16, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q16, d9[0]     \\n\"\n                        \"vmla.f32   q2, %q16, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q16, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q17, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q17, d9[1]     \\n\"\n                        \"vmla.f32   q2, %q17, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q17, d11[1]    \\n\"\n\n                        \"vmla.f32   q0, %q18, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q18, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q18, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q18, d12[0]    \\n\"\n\n                        \"subs       %0, %0, #1          \\n\"\n\n                        \"vstm       %1!, {d0-d7}        \\n\"\n\n                        \"bne        0b                  \\n\"\n\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(nn),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n                }\n#endif // __aarch64__\n\n                for (; remain > 0; remain--)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    float32x4_t _r2 = vld1q_f32(r2);\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1q_f32(outptr0, _sum0);\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 4;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack1to4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n#if __ARM_NEON && __aarch64__\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4 * 2, 4 * 2, opt.workspace_allocator);\n#else\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n#endif\n\n    const float* bias = _bias;\n\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n    int nn_outch = 0;\n    nn_outch = outch >> 1;\n    remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = bias ? vld1q_f32((const float*)bias + (p + 1) * 4) : vdupq_n_f32(0.f);\n        {\n            float* ptr = (float*)out0;\n\n            for (int i = 0; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias0);\n                    vst1q_f32(ptr + 8, _bias0);\n                    vst1q_f32(ptr + 12, _bias0);\n                    vst1q_f32(ptr + 16, _bias1);\n                    vst1q_f32(ptr + 20, _bias1);\n                    vst1q_f32(ptr + 24, _bias1);\n                    vst1q_f32(ptr + 28, _bias1);\n                    ptr += 32;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias0);\n                    vst1q_f32(ptr + 8, _bias1);\n                    vst1q_f32(ptr + 12, _bias1);\n                    ptr += 16;\n                }\n                for (; j < outw; j++)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias1);\n                    ptr += 8;\n                }\n            }\n        }\n\n        const unsigned short* k0 = kernel.channel(p);\n        const unsigned short* k1 = kernel.channel(p + 1);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00_0 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01_0 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02_0 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10_0 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11_0 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12_0 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20_0 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21_0 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22_0 = bfloat2float(vld1_u16(k0 + 32));\n\n            float32x4_t _k00_1 = bfloat2float(vld1_u16(k1));\n            float32x4_t _k01_1 = bfloat2float(vld1_u16(k1 + 4));\n            float32x4_t _k02_1 = bfloat2float(vld1_u16(k1 + 8));\n            float32x4_t _k10_1 = bfloat2float(vld1_u16(k1 + 12));\n            float32x4_t _k11_1 = bfloat2float(vld1_u16(k1 + 16));\n            float32x4_t _k12_1 = bfloat2float(vld1_u16(k1 + 20));\n            float32x4_t _k20_1 = bfloat2float(vld1_u16(k1 + 24));\n            float32x4_t _k21_1 = bfloat2float(vld1_u16(k1 + 28));\n            float32x4_t _k22_1 = bfloat2float(vld1_u16(k1 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\"\n                        \"ld1    {v1.s}[0], [%1]             \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0] \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v26.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v27.4s, %8.4s, v0.s[3]      \\n\"\n                        \"fmla   v28.4s, %17.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %17.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%2], #8           \\n\"\n                        \"ld1    {v3.s}[0], [%2]             \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n                        \"fmla   v26.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v27.4s, %9.4s, v1.s[0]      \\n\"\n                        \"fmla   v28.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %18.4s, v1.s[0]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %19.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %19.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %19.4s, v1.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v2.s[1]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v2.s[3]     \\n\"\n                        \"fmla   v28.4s, %20.4s, v2.s[0]     \\n\"\n                        \"fmla   v29.4s, %20.4s, v2.s[1]     \\n\"\n                        \"fmla   v30.4s, %20.4s, v2.s[2]     \\n\"\n                        \"fmla   v31.4s, %20.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n                        \"ld1    {v1.s}[0], [%3]             \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v2.s[2]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v3.s[0]     \\n\"\n                        \"fmla   v28.4s, %21.4s, v2.s[1]     \\n\"\n                        \"fmla   v29.4s, %21.4s, v2.s[2]     \\n\"\n                        \"fmla   v30.4s, %21.4s, v2.s[3]     \\n\"\n                        \"fmla   v31.4s, %21.4s, v3.s[0]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v2.s[3]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v3.s[1]     \\n\"\n                        \"fmla   v28.4s, %22.4s, v2.s[2]     \\n\"\n                        \"fmla   v29.4s, %22.4s, v2.s[3]     \\n\"\n                        \"fmla   v30.4s, %22.4s, v3.s[0]     \\n\"\n                        \"fmla   v31.4s, %22.4s, v3.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %23.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %23.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %23.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %24.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %24.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %24.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %24.4s, v1.s[0]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %25.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %25.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %25.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %25.4s, v1.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_0), // %8\n                        \"w\"(_k01_0), // %9\n                        \"w\"(_k02_0), // %10\n                        \"w\"(_k10_0), // %11\n                        \"w\"(_k11_0), // %12\n                        \"w\"(_k12_0), // %13\n                        \"w\"(_k20_0), // %14\n                        \"w\"(_k21_0), // %15\n                        \"w\"(_k22_0), // %16\n                        \"w\"(_k00_1), // %17\n                        \"w\"(_k01_1), // %18\n                        \"w\"(_k02_1), // %19\n                        \"w\"(_k10_1), // %20\n                        \"w\"(_k11_1), // %21\n                        \"w\"(_k12_1), // %22\n                        \"w\"(_k20_1), // %23\n                        \"w\"(_k21_1), // %24\n                        \"w\"(_k22_1)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0] \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v26.4s, %17.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]       \\n\"\n                        \"ld1    {v1.4h}, [%2]               \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n                        \"fmla   v26.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %18.4s, v0.s[2]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v1.s[1]     \\n\"\n                        \"fmla   v26.4s, %20.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %20.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]       \\n\"\n                        \"ld1    {v0.4h}, [%3]               \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v1.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v1.s[2]     \\n\"\n                        \"fmla   v26.4s, %21.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %21.4s, v1.s[2]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v1.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v1.s[3]     \\n\"\n                        \"fmla   v26.4s, %22.4s, v1.s[2]     \\n\"\n                        \"fmla   v27.4s, %22.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %23.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %23.4s, v0.s[1]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %24.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %24.4s, v0.s[2]     \\n\"\n\n                        \"add    %2, %2, #4                  \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %25.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %25.4s, v0.s[3]     \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_0), // %8\n                        \"w\"(_k01_0), // %9\n                        \"w\"(_k02_0), // %10\n                        \"w\"(_k10_0), // %11\n                        \"w\"(_k11_0), // %12\n                        \"w\"(_k12_0), // %13\n                        \"w\"(_k20_0), // %14\n                        \"w\"(_k21_0), // %15\n                        \"w\"(_k22_0), // %16\n                        \"w\"(_k00_1), // %17\n                        \"w\"(_k01_1), // %18\n                        \"w\"(_k02_1), // %19\n                        \"w\"(_k10_1), // %20\n                        \"w\"(_k11_1), // %21\n                        \"w\"(_k12_1), // %22\n                        \"w\"(_k20_1), // %23\n                        \"w\"(_k21_1), // %24\n                        \"w\"(_k22_1)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\");\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum00 = vld1q_f32(outptr0);\n                    float32x4_t _sum10 = vld1q_f32(outptr0 + 4);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k00_0, _r0, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k01_0, _r0, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k02_0, _r0, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k10_0, _r1, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k11_0, _r1, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k12_0, _r1, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k20_0, _r2, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k21_0, _r2, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k22_0, _r2, 2);\n\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k00_1, _r0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k01_1, _r0, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k02_1, _r0, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k10_1, _r1, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k11_1, _r1, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k12_1, _r1, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k20_1, _r2, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k21_1, _r2, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k22_1, _r2, 2);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 8;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n            unsigned short* outptr1_bf16 = top_blob.channel(p + 1);\n\n            const float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00_0 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01_0 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02_0 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10_0 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11_0 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12_0 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20_0 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21_0 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22_0 = bfloat2float(vld1_u16(k0 + 32));\n\n            float32x4_t _k00_1 = bfloat2float(vld1_u16(k1));\n            float32x4_t _k01_1 = bfloat2float(vld1_u16(k1 + 4));\n            float32x4_t _k02_1 = bfloat2float(vld1_u16(k1 + 8));\n            float32x4_t _k10_1 = bfloat2float(vld1_u16(k1 + 12));\n            float32x4_t _k11_1 = bfloat2float(vld1_u16(k1 + 16));\n            float32x4_t _k12_1 = bfloat2float(vld1_u16(k1 + 20));\n            float32x4_t _k20_1 = bfloat2float(vld1_u16(k1 + 24));\n            float32x4_t _k21_1 = bfloat2float(vld1_u16(k1 + 28));\n            float32x4_t _k22_1 = bfloat2float(vld1_u16(k1 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n                        \"ld1    {v1.s}[0], [%3]             \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%2], #64 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%2], #64 \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %21.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %21.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %21.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %21.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %22.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %22.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %22.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %22.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%4], #8           \\n\"\n                        \"ld1    {v3.s}[0], [%4]             \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %23.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %23.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %23.4s, v1.s[1]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v2.s[1]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v2.s[3]     \\n\"\n                        \"fmla   v28.4s, %24.4s, v2.s[0]     \\n\"\n                        \"fmla   v29.4s, %24.4s, v2.s[1]     \\n\"\n                        \"fmla   v30.4s, %24.4s, v2.s[2]     \\n\"\n                        \"fmla   v31.4s, %24.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v2.s[2]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v3.s[0]     \\n\"\n                        \"fmla   v28.4s, %25.4s, v2.s[1]     \\n\"\n                        \"fmla   v29.4s, %25.4s, v2.s[2]     \\n\"\n                        \"fmla   v30.4s, %25.4s, v2.s[3]     \\n\"\n                        \"fmla   v31.4s, %25.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%5], #8           \\n\"\n                        \"ld1    {v1.s}[0], [%5]             \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v2.s[3]     \\n\"\n                        \"fmla   v26.4s, %17.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v3.s[1]     \\n\"\n                        \"fmla   v28.4s, %26.4s, v2.s[2]     \\n\"\n                        \"fmla   v29.4s, %26.4s, v2.s[3]     \\n\"\n                        \"fmla   v30.4s, %26.4s, v3.s[0]     \\n\"\n                        \"fmla   v31.4s, %26.4s, v3.s[1]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %27.4s, v0.s[0]     \\n\"\n                        \"fmla   v29.4s, %27.4s, v0.s[1]     \\n\"\n                        \"fmla   v30.4s, %27.4s, v0.s[2]     \\n\"\n                        \"fmla   v31.4s, %27.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %19.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %19.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %19.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %28.4s, v0.s[1]     \\n\"\n                        \"fmla   v29.4s, %28.4s, v0.s[2]     \\n\"\n                        \"fmla   v30.4s, %28.4s, v0.s[3]     \\n\"\n                        \"fmla   v31.4s, %28.4s, v1.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %20.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %20.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %20.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %20.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %29.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %29.4s, v0.s[3]     \\n\"\n                        \"fmla   v30.4s, %29.4s, v1.s[0]     \\n\"\n                        \"fmla   v31.4s, %29.4s, v1.s[1]     \\n\"\n\n                        \"shrn   v24.4h, v24.4s, #16         \\n\"\n                        \"shrn   v25.4h, v25.4s, #16         \\n\"\n                        \"shrn   v26.4h, v26.4s, #16         \\n\"\n                        \"shrn   v27.4h, v27.4s, #16         \\n\"\n                        \"shrn   v28.4h, v28.4s, #16         \\n\"\n                        \"shrn   v29.4h, v29.4s, #16         \\n\"\n                        \"shrn   v30.4h, v30.4s, #16         \\n\"\n                        \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                        \"st1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\"\n                        \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%1], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr1_bf16), // %1\n                        \"=r\"(outptr0),      // %2\n                        \"=r\"(r0),           // %3\n                        \"=r\"(r1),           // %4\n                        \"=r\"(r2)            // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr1_bf16),\n                        \"2\"(outptr0),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00_0), // %12\n                        \"w\"(_k01_0), // %13\n                        \"w\"(_k02_0), // %14\n                        \"w\"(_k10_0), // %15\n                        \"w\"(_k11_0), // %16\n                        \"w\"(_k12_0), // %17\n                        \"w\"(_k20_0), // %18\n                        \"w\"(_k21_0), // %19\n                        \"w\"(_k22_0), // %20\n                        \"w\"(_k00_1), // %21\n                        \"w\"(_k01_1), // %22\n                        \"w\"(_k02_1), // %23\n                        \"w\"(_k10_1), // %24\n                        \"w\"(_k11_1), // %25\n                        \"w\"(_k12_1), // %26\n                        \"w\"(_k20_1), // %27\n                        \"w\"(_k21_1), // %28\n                        \"w\"(_k22_1)  // %29\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3]               \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%2], #64 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %21.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %21.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%4]               \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %22.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %22.4s, v0.s[2]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %23.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v1.s[1]     \\n\"\n                        \"fmla   v26.4s, %24.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %24.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%5]               \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v1.s[1]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v1.s[2]     \\n\"\n                        \"fmla   v26.4s, %25.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %25.4s, v1.s[2]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v1.s[2]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v1.s[3]     \\n\"\n                        \"fmla   v26.4s, %26.4s, v1.s[2]     \\n\"\n                        \"fmla   v27.4s, %26.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %27.4s, v0.s[0]     \\n\"\n                        \"fmla   v27.4s, %27.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %19.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %28.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %28.4s, v0.s[2]     \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"fmla   v24.4s, %20.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %20.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %29.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %29.4s, v0.s[3]     \\n\"\n\n                        \"add    %4, %4, #4                  \\n\"\n\n                        \"shrn   v24.4h, v24.4s, #16         \\n\"\n                        \"shrn   v25.4h, v25.4s, #16         \\n\"\n                        \"shrn   v26.4h, v26.4s, #16         \\n\"\n                        \"shrn   v27.4h, v27.4s, #16         \\n\"\n\n                        \"add    %5, %5, #4                  \\n\"\n\n                        \"st1    {v24.4h, v25.4h}, [%0], #16 \\n\"\n                        \"st1    {v26.4h, v27.4h}, [%1], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr1_bf16), // %1\n                        \"=r\"(outptr0),      // %2\n                        \"=r\"(r0),           // %3\n                        \"=r\"(r1),           // %4\n                        \"=r\"(r2)            // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr1_bf16),\n                        \"2\"(outptr0),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00_0), // %12\n                        \"w\"(_k01_0), // %13\n                        \"w\"(_k02_0), // %14\n                        \"w\"(_k10_0), // %15\n                        \"w\"(_k11_0), // %16\n                        \"w\"(_k12_0), // %17\n                        \"w\"(_k20_0), // %18\n                        \"w\"(_k21_0), // %19\n                        \"w\"(_k22_0), // %20\n                        \"w\"(_k00_1), // %21\n                        \"w\"(_k01_1), // %22\n                        \"w\"(_k02_1), // %23\n                        \"w\"(_k10_1), // %24\n                        \"w\"(_k11_1), // %25\n                        \"w\"(_k12_1), // %26\n                        \"w\"(_k20_1), // %27\n                        \"w\"(_k21_1), // %28\n                        \"w\"(_k22_1)  // %29\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\");\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum00 = vld1q_f32(outptr0);\n                    float32x4_t _sum10 = vld1q_f32(outptr0 + 4);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k00_0, _r0, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k01_0, _r0, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k02_0, _r0, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k10_0, _r1, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k11_0, _r1, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k12_0, _r1, 2);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k20_0, _r2, 0);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k21_0, _r2, 1);\n                    _sum00 = vfmaq_laneq_f32(_sum00, _k22_0, _r2, 2);\n\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k00_1, _r0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k01_1, _r0, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k02_1, _r0, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k10_1, _r1, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k11_1, _r1, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k12_1, _r1, 2);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k20_1, _r2, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k21_1, _r2, 1);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _k22_1, _r2, 2);\n\n                    vst1_u16(outptr0_bf16, float2bfloat(_sum00));\n                    vst1_u16(outptr1_bf16, float2bfloat(_sum10));\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 8;\n                    outptr0_bf16 += 4;\n                    outptr1_bf16 += 4;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        const unsigned short* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<unsigned short>(0);\n            const unsigned short* r1 = img0.row<unsigned short>(1);\n            const unsigned short* r2 = img0.row<unsigned short>(2);\n\n            float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        //                         \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0] \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%1]             \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n                        \"fmla   v26.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v27.4s, %8.4s, v0.s[3]      \\n\"\n                        \"fmla   v28.4s, %8.4s, v1.s[0]      \\n\"\n                        \"fmla   v29.4s, %8.4s, v1.s[1]      \\n\"\n                        \"fmla   v30.4s, %8.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, %8.4s, v1.s[3]      \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n                        \"fmla   v26.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v27.4s, %9.4s, v1.s[0]      \\n\"\n                        \"fmla   v28.4s, %9.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, %9.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, %9.4s, v1.s[3]      \\n\"\n                        \"fmla   v31.4s, %9.4s, v2.s[0]      \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %10.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %10.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %10.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %10.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%2], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%2]             \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v4.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v4.s[1]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v4.s[2]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v4.s[3]     \\n\"\n                        \"fmla   v28.4s, %11.4s, v5.s[0]     \\n\"\n                        \"fmla   v29.4s, %11.4s, v5.s[1]     \\n\"\n                        \"fmla   v30.4s, %11.4s, v5.s[2]     \\n\"\n                        \"fmla   v31.4s, %11.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v4.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v4.s[2]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v4.s[3]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v5.s[0]     \\n\"\n                        \"fmla   v28.4s, %12.4s, v5.s[1]     \\n\"\n                        \"fmla   v29.4s, %12.4s, v5.s[2]     \\n\"\n                        \"fmla   v30.4s, %12.4s, v5.s[3]     \\n\"\n                        \"fmla   v31.4s, %12.4s, v2.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v4.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v4.s[3]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v5.s[0]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v5.s[1]     \\n\"\n                        \"fmla   v28.4s, %13.4s, v5.s[2]     \\n\"\n                        \"fmla   v29.4s, %13.4s, v5.s[3]     \\n\"\n                        \"fmla   v30.4s, %13.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %13.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%3]             \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %14.4s, v1.s[0]     \\n\"\n                        \"fmla   v29.4s, %14.4s, v1.s[1]     \\n\"\n                        \"fmla   v30.4s, %14.4s, v1.s[2]     \\n\"\n                        \"fmla   v31.4s, %14.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %15.4s, v1.s[1]     \\n\"\n                        \"fmla   v29.4s, %15.4s, v1.s[2]     \\n\"\n                        \"fmla   v30.4s, %15.4s, v1.s[3]     \\n\"\n                        \"fmla   v31.4s, %15.4s, v2.s[0]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %16.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %16.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %16.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %16.4s, v2.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0] \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"ld1    {v1.s}[0], [%1]             \\n\"\n\n                        \"fmla   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v25.4s, %8.4s, v0.s[1]      \\n\"\n\n                        \"fmla   v26.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v27.4s, %8.4s, v0.s[3]      \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v25.4s, %9.4s, v0.s[2]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%2], #8           \\n\"\n\n                        \"fmla   v26.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v27.4s, %9.4s, v1.s[0]      \\n\"\n\n                        \"ld1    {v3.s}[0], [%2]             \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[3]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v26.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v1.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v2.s[1]     \\n\"\n\n                        \"fmla   v26.4s, %11.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v2.s[3]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n\n                        \"fmla   v26.4s, %12.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v3.s[0]     \\n\"\n\n                        \"ld1    {v1.s}[0], [%3]             \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v2.s[3]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v26.4s, %13.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v3.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v26.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v1.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v1.s[1]     \\n\"\n\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\"\n\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%1]!         \\n\"\n                        \"vld1.u32   {d2[0]}, [%1]       \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n                        \"vshll.u16  q1, d2, #16         \\n\"\n\n                        \"vmla.f32   q12, %q8, d0[0]     \\n\"\n                        \"vmla.f32   q13, %q8, d0[1]     \\n\"\n                        \"vmla.f32   q14, %q8, d1[0]     \\n\"\n                        \"vmla.f32   q15, %q8, d1[1]     \\n\"\n\n                        \"vmla.f32   q12, %q9, d0[1]     \\n\"\n                        \"vmla.f32   q13, %q9, d1[0]     \\n\"\n                        \"vmla.f32   q14, %q9, d1[1]     \\n\"\n                        \"vmla.f32   q15, %q9, d2[0]     \\n\"\n\n                        \"vmla.f32   q12, %q10, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q10, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q10, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q10, d2[1]    \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%2]!         \\n\"\n                        \"vld1.u32   {d3[0]}, [%2]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, %q11, d4[0]    \\n\"\n                        \"vmla.f32   q13, %q11, d4[1]    \\n\"\n                        \"vmla.f32   q14, %q11, d5[0]    \\n\"\n                        \"vmla.f32   q15, %q11, d5[1]    \\n\"\n\n                        \"vmla.f32   q12, %q12, d4[1]    \\n\"\n                        \"vmla.f32   q13, %q12, d5[0]    \\n\"\n                        \"vmla.f32   q14, %q12, d5[1]    \\n\"\n                        \"vmla.f32   q15, %q12, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q13, d5[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d5[1]    \\n\"\n                        \"vmla.f32   q14, %q13, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q13, d2[1]    \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%3]!         \\n\"\n                        \"vld1.u32   {d2[0]}, [%3]       \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n                        \"vshll.u16  q1, d2, #16         \\n\"\n\n                        \"vmla.f32   q12, %q14, d0[0]    \\n\"\n                        \"vmla.f32   q13, %q14, d0[1]    \\n\"\n                        \"vmla.f32   q14, %q14, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q14, d1[1]    \\n\"\n\n                        \"vmla.f32   q12, %q15, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q15, d1[0]    \\n\"\n                        \"vmla.f32   q14, %q15, d1[1]    \\n\"\n                        \"vmla.f32   q15, %q15, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q16, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q16, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q16, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d2[1]    \\n\"\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v28.4s, v29.4s}, [%0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmul   v24.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmul   v25.4s, %8.4s, v0.s[1]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%2]               \\n\"\n\n                        \"fmul   v26.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmul   v27.4s, %9.4s, v0.s[2]      \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v28.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %10.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3]               \\n\"\n\n                        \"fmla   v26.4s, %12.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v1.s[2]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v28.4s, %13.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %13.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v0.s[2]     \\n\"\n\n                        \"fmla   v28.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %16.4s, v0.s[3]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n\n                        \"fadd   v24.4s, v24.4s, v26.4s      \\n\"\n                        \"fadd   v25.4s, v25.4s, v27.4s      \\n\"\n\n                        \"add    %2, %2, #4                  \\n\"\n\n                        \"fadd   v28.4s, v28.4s, v24.4s      \\n\"\n                        \"fadd   v29.4s, v29.4s, v25.4s      \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"st1    {v28.4s, v29.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%1]          \\n\"\n\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmul.f32   q14, %q8, d0[0]     \\n\"\n                        \"vmul.f32   q15, %q8, d0[1]     \\n\"\n                        \"vmla.f32   q12, %q9, d0[1]     \\n\"\n                        \"vmla.f32   q13, %q9, d1[0]     \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d3}, [%2]          \\n\"\n\n                        \"vmla.f32   q14, %q10, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q10, d1[1]    \\n\"\n\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, %q11, d2[0]    \\n\"\n                        \"vmla.f32   q13, %q11, d2[1]    \\n\"\n\n                        \"vmla.f32   q14, %q12, d2[1]    \\n\"\n                        \"vmla.f32   q15, %q12, d3[0]    \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%3]          \\n\"\n\n                        \"vmla.f32   q12, %q13, d3[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d3[1]    \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q14, %q14, d0[0]    \\n\"\n                        \"vmla.f32   q15, %q14, d0[1]    \\n\"\n\n                        \"vmla.f32   q12, %q15, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q15, d1[0]    \\n\"\n\n                        \"add        %1, %1, #4          \\n\"\n\n                        \"vmla.f32   q14, %q16, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d1[1]    \\n\"\n\n                        \"add        %2, %2, #4          \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"add        %3, %3, #4          \\n\"\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1q_f32(outptr0, _sum0);\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 4;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            const float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<unsigned short>(0);\n            const unsigned short* r1 = img0.row<unsigned short>(1);\n            const unsigned short* r2 = img0.row<unsigned short>(2);\n\n            float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%2]             \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v29.4s, %10.4s, v1.s[1]     \\n\"\n                        \"fmla   v30.4s, %10.4s, v1.s[2]     \\n\"\n                        \"fmla   v31.4s, %10.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %11.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %11.4s, v1.s[1]     \\n\"\n                        \"fmla   v29.4s, %11.4s, v1.s[2]     \\n\"\n                        \"fmla   v30.4s, %11.4s, v1.s[3]     \\n\"\n                        \"fmla   v31.4s, %11.4s, v2.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %12.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %12.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %12.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %12.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %12.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%3], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%3]             \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v4.s[0]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v4.s[1]     \\n\"\n                        \"fmla   v26.4s, %13.4s, v4.s[2]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v4.s[3]     \\n\"\n                        \"fmla   v28.4s, %13.4s, v5.s[0]     \\n\"\n                        \"fmla   v29.4s, %13.4s, v5.s[1]     \\n\"\n                        \"fmla   v30.4s, %13.4s, v5.s[2]     \\n\"\n                        \"fmla   v31.4s, %13.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v4.s[1]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v4.s[2]     \\n\"\n                        \"fmla   v26.4s, %14.4s, v4.s[3]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v5.s[0]     \\n\"\n                        \"fmla   v28.4s, %14.4s, v5.s[1]     \\n\"\n                        \"fmla   v29.4s, %14.4s, v5.s[2]     \\n\"\n                        \"fmla   v30.4s, %14.4s, v5.s[3]     \\n\"\n                        \"fmla   v31.4s, %14.4s, v2.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v4.s[2]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v4.s[3]     \\n\"\n                        \"fmla   v26.4s, %15.4s, v5.s[0]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v5.s[1]     \\n\"\n                        \"fmla   v28.4s, %15.4s, v5.s[2]     \\n\"\n                        \"fmla   v29.4s, %15.4s, v5.s[3]     \\n\"\n                        \"fmla   v30.4s, %15.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %15.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\"\n                        \"ld1    {v2.s}[0], [%4]             \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v0.s[3]     \\n\"\n                        \"fmla   v28.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v29.4s, %16.4s, v1.s[1]     \\n\"\n                        \"fmla   v30.4s, %16.4s, v1.s[2]     \\n\"\n                        \"fmla   v31.4s, %16.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %17.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v1.s[0]     \\n\"\n                        \"fmla   v28.4s, %17.4s, v1.s[1]     \\n\"\n                        \"fmla   v29.4s, %17.4s, v1.s[2]     \\n\"\n                        \"fmla   v30.4s, %17.4s, v1.s[3]     \\n\"\n                        \"fmla   v31.4s, %17.4s, v2.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %18.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %18.4s, v1.s[1]     \\n\"\n                        \"fmla   v28.4s, %18.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %18.4s, v1.s[3]     \\n\"\n                        \"fmla   v30.4s, %18.4s, v2.s[0]     \\n\"\n                        \"fmla   v31.4s, %18.4s, v2.s[1]     \\n\"\n\n                        \"shrn   v24.4h, v24.4s, #16         \\n\"\n                        \"shrn   v25.4h, v25.4s, #16         \\n\"\n                        \"shrn   v26.4h, v26.4s, #16         \\n\"\n                        \"shrn   v27.4h, v27.4s, #16         \\n\"\n                        \"shrn   v28.4h, v28.4s, #16         \\n\"\n                        \"shrn   v29.4h, v29.4s, #16         \\n\"\n                        \"shrn   v30.4h, v30.4s, #16         \\n\"\n                        \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                        \"st1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\"\n                        \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%1], #64 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"ld1    {v1.s}[0], [%2]             \\n\"\n\n                        \"fmla   v24.4s, %10.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %10.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v26.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %10.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %11.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %11.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%3], #8           \\n\"\n\n                        \"fmla   v26.4s, %11.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %11.4s, v1.s[0]     \\n\"\n\n                        \"ld1    {v3.s}[0], [%3]             \\n\"\n\n                        \"fmla   v24.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %12.4s, v0.s[3]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v26.4s, %12.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %12.4s, v1.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v2.s[0]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v2.s[1]     \\n\"\n\n                        \"fmla   v26.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v27.4s, %13.4s, v2.s[3]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %14.4s, v2.s[1]     \\n\"\n                        \"fmla   v25.4s, %14.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4], #8           \\n\"\n\n                        \"fmla   v26.4s, %14.4s, v2.s[3]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v3.s[0]     \\n\"\n\n                        \"ld1    {v1.s}[0], [%4]             \\n\"\n\n                        \"fmla   v24.4s, %15.4s, v2.s[2]     \\n\"\n                        \"fmla   v25.4s, %15.4s, v2.s[3]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v26.4s, %15.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, %15.4s, v3.s[1]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[1]     \\n\"\n\n                        \"fmla   v26.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v27.4s, %16.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v24.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v25.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v26.4s, %17.4s, v0.s[3]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v1.s[0]     \\n\"\n\n                        \"fmla   v24.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v25.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v26.4s, %18.4s, v1.s[0]     \\n\"\n                        \"fmla   v27.4s, %18.4s, v1.s[1]     \\n\"\n\n                        \"shrn   v24.4h, v24.4s, #16         \\n\"\n                        \"shrn   v25.4h, v25.4s, #16         \\n\"\n                        \"shrn   v26.4h, v26.4s, #16         \\n\"\n                        \"shrn   v27.4h, v27.4s, #16         \\n\"\n\n                        \"st1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d24-d31}      \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%2]!         \\n\"\n                        \"vld1.u32   {d2[0]}, [%2]       \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n                        \"vshll.u16  q1, d2, #16         \\n\"\n\n                        \"vmla.f32   q12, %q10, d0[0]    \\n\"\n                        \"vmla.f32   q13, %q10, d0[1]    \\n\"\n                        \"vmla.f32   q14, %q10, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q10, d1[1]    \\n\"\n\n                        \"vmla.f32   q12, %q11, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q11, d1[0]    \\n\"\n                        \"vmla.f32   q14, %q11, d1[1]    \\n\"\n                        \"vmla.f32   q15, %q11, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q12, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q12, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q12, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q12, d2[1]    \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%3]!         \\n\"\n                        \"vld1.u32   {d3[0]}, [%3]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, %q13, d4[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d4[1]    \\n\"\n                        \"vmla.f32   q14, %q13, d5[0]    \\n\"\n                        \"vmla.f32   q15, %q13, d5[1]    \\n\"\n\n                        \"vmla.f32   q12, %q14, d4[1]    \\n\"\n                        \"vmla.f32   q13, %q14, d5[0]    \\n\"\n                        \"vmla.f32   q14, %q14, d5[1]    \\n\"\n                        \"vmla.f32   q15, %q14, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q15, d5[0]    \\n\"\n                        \"vmla.f32   q13, %q15, d5[1]    \\n\"\n                        \"vmla.f32   q14, %q15, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q15, d2[1]    \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%4]!         \\n\"\n                        \"vld1.u32   {d2[0]}, [%4]       \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n                        \"vshll.u16  q1, d2, #16         \\n\"\n\n                        \"vmla.f32   q12, %q16, d0[0]    \\n\"\n                        \"vmla.f32   q13, %q16, d0[1]    \\n\"\n                        \"vmla.f32   q14, %q16, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d1[1]    \\n\"\n\n                        \"vmla.f32   q12, %q17, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q17, d1[0]    \\n\"\n                        \"vmla.f32   q14, %q17, d1[1]    \\n\"\n                        \"vmla.f32   q15, %q17, d2[0]    \\n\"\n\n                        \"vmla.f32   q12, %q18, d1[0]    \\n\"\n                        \"vmla.f32   q13, %q18, d1[1]    \\n\"\n                        \"vmla.f32   q14, %q18, d2[0]    \\n\"\n                        \"vmla.f32   q15, %q18, d2[1]    \\n\"\n\n                        \"vshrn.s32  d24, q12, #16       \\n\"\n                        \"vshrn.s32  d25, q13, #16       \\n\"\n                        \"vshrn.s32  d26, q14, #16       \\n\"\n                        \"vshrn.s32  d27, q15, #16       \\n\"\n\n                        \"vst1.u16   {d24-d27}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2]               \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v28.4s, v29.4s}, [%1], #32 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmul   v24.4s, %10.4s, v0.s[0]     \\n\"\n                        \"fmul   v25.4s, %10.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%3]               \\n\"\n\n                        \"fmul   v26.4s, %11.4s, v0.s[1]     \\n\"\n                        \"fmul   v27.4s, %11.4s, v0.s[2]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v28.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %12.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %13.4s, v1.s[0]     \\n\"\n                        \"fmla   v25.4s, %13.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4]               \\n\"\n\n                        \"fmla   v26.4s, %14.4s, v1.s[1]     \\n\"\n                        \"fmla   v27.4s, %14.4s, v1.s[2]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v28.4s, %15.4s, v1.s[2]     \\n\"\n                        \"fmla   v29.4s, %15.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v24.4s, %16.4s, v0.s[0]     \\n\"\n                        \"fmla   v25.4s, %16.4s, v0.s[1]     \\n\"\n                        \"fmla   v26.4s, %17.4s, v0.s[1]     \\n\"\n                        \"fmla   v27.4s, %17.4s, v0.s[2]     \\n\"\n\n                        \"fmla   v28.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v29.4s, %18.4s, v0.s[3]     \\n\"\n\n                        \"add    %2, %2, #4                  \\n\"\n\n                        \"fadd   v24.4s, v24.4s, v26.4s      \\n\"\n                        \"fadd   v25.4s, v25.4s, v27.4s      \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"fadd   v28.4s, v28.4s, v24.4s      \\n\"\n                        \"fadd   v29.4s, v29.4s, v25.4s      \\n\"\n\n                        \"add    %4, %4, #4                  \\n\"\n\n                        \"shrn   v28.4h, v28.4s, #16         \\n\"\n                        \"shrn   v29.4h, v29.4s, #16         \\n\"\n\n                        \"st1    {v28.4h, v29.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"memory\", \"v0\", \"v1\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%2]          \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%1 :128]! \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmul.f32   q14, %q10, d0[0]    \\n\"\n                        \"vmul.f32   q15, %q10, d0[1]    \\n\"\n                        \"vmla.f32   q12, %q11, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q11, d1[0]    \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d3}, [%3]          \\n\"\n\n                        \"vmla.f32   q14, %q12, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q12, d1[1]    \\n\"\n\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, %q13, d2[0]    \\n\"\n                        \"vmla.f32   q13, %q13, d2[1]    \\n\"\n\n                        \"vmla.f32   q14, %q14, d2[1]    \\n\"\n                        \"vmla.f32   q15, %q14, d3[0]    \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%4]          \\n\"\n\n                        \"vmla.f32   q12, %q15, d3[0]    \\n\"\n                        \"vmla.f32   q13, %q15, d3[1]    \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q14, %q16, d0[0]    \\n\"\n                        \"vmla.f32   q15, %q16, d0[1]    \\n\"\n\n                        \"vmla.f32   q12, %q17, d0[1]    \\n\"\n                        \"vmla.f32   q13, %q17, d1[0]    \\n\"\n\n                        \"add        %2, %2, #4          \\n\"\n\n                        \"vmla.f32   q14, %q18, d1[0]    \\n\"\n                        \"vmla.f32   q15, %q18, d1[1]    \\n\"\n\n                        \"add        %3, %3, #4          \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"add        %4, %4, #4          \\n\"\n\n                        \"vshrn.s32  d24, q12, #16       \\n\"\n                        \"vshrn.s32  d25, q13, #16       \\n\"\n\n                        \"vst1.f32   {d24-d25}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"memory\", \"q0\", \"q1\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1_u16(outptr0_bf16, float2bfloat(_sum0));\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                    outptr0 += 4;\n                    outptr0_bf16 += 4;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n#if __ARM_NEON && __aarch64__\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4 * 2, 4 * 2, opt.workspace_allocator);\n#else\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n#endif\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n    int nn_outch = 0;\n    nn_outch = outch >> 1;\n    remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = bias ? vld1q_f32((const float*)bias + (p + 1) * 4) : vdupq_n_f32(0.f);\n        {\n            float* ptr = (float*)out0;\n\n            for (int i = 0; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias0);\n                    vst1q_f32(ptr + 8, _bias0);\n                    vst1q_f32(ptr + 12, _bias0);\n                    vst1q_f32(ptr + 16, _bias1);\n                    vst1q_f32(ptr + 20, _bias1);\n                    vst1q_f32(ptr + 24, _bias1);\n                    vst1q_f32(ptr + 28, _bias1);\n                    ptr += 32;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias0);\n                    vst1q_f32(ptr + 8, _bias1);\n                    vst1q_f32(ptr + 12, _bias1);\n                    ptr += 16;\n                }\n                for (; j < outw; j++)\n                {\n                    vst1q_f32(ptr, _bias0);\n                    vst1q_f32(ptr + 4, _bias1);\n                    ptr += 8;\n                }\n            }\n        }\n\n        const unsigned short* k0 = kernel.channel(p);\n        const unsigned short* k1 = kernel.channel(p + 1);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00_0 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01_0 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02_0 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10_0 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11_0 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12_0 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20_0 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21_0 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22_0 = bfloat2float(vld1_u16(k0 + 32));\n\n            float32x4_t _k00_1 = bfloat2float(vld1_u16(k1));\n            float32x4_t _k01_1 = bfloat2float(vld1_u16(k1 + 4));\n            float32x4_t _k02_1 = bfloat2float(vld1_u16(k1 + 8));\n            float32x4_t _k10_1 = bfloat2float(vld1_u16(k1 + 12));\n            float32x4_t _k11_1 = bfloat2float(vld1_u16(k1 + 16));\n            float32x4_t _k12_1 = bfloat2float(vld1_u16(k1 + 20));\n            float32x4_t _k20_1 = bfloat2float(vld1_u16(k1 + 24));\n            float32x4_t _k21_1 = bfloat2float(vld1_u16(k1 + 28));\n            float32x4_t _k22_1 = bfloat2float(vld1_u16(k1 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1], #16   \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%0], #64 \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        //                         \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0] \\n\" // sum1\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %8.4s, v0.s[0]       \\n\"\n                        \"fmla   v7.4s, %8.4s, v0.s[2]       \\n\"\n                        \"fmla   v8.4s, %8.4s, v1.s[0]       \\n\"\n                        \"fmla   v9.4s, %8.4s, v1.s[2]       \\n\"\n                        \"fmla   v10.4s, %17.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %17.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %17.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %17.4s, v1.s[2]     \\n\"\n\n                        \"ld1    {v4.h}[0], [%1]             \\n\"\n\n                        \"fmla   v6.4s, %9.4s, v0.s[1]       \\n\"\n                        \"fmla   v7.4s, %9.4s, v0.s[3]       \\n\"\n                        \"fmla   v8.4s, %9.4s, v1.s[1]       \\n\"\n                        \"fmla   v9.4s, %9.4s, v1.s[3]       \\n\"\n                        \"fmla   v10.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %18.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %18.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %18.4s, v1.s[3]     \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v2.4h, v3.4h}, [%2], #16   \\n\"\n\n                        \"fmla   v6.4s, %10.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %10.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %10.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %10.4s, v4.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %19.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %19.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %19.4s, v4.s[0]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %11.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %11.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %11.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v3.s[2]      \\n\"\n                        \"fmla   v10.4s, %20.4s, v2.s[0]     \\n\"\n                        \"fmla   v11.4s, %20.4s, v2.s[2]     \\n\"\n                        \"fmla   v12.4s, %20.4s, v3.s[0]     \\n\"\n                        \"fmla   v13.4s, %20.4s, v3.s[2]     \\n\"\n\n                        \"ld1    {v5.h}[0], [%2]             \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %12.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v3.s[3]      \\n\"\n\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %21.4s, v2.s[1]     \\n\"\n                        \"fmla   v11.4s, %21.4s, v2.s[3]     \\n\"\n                        \"fmla   v12.4s, %21.4s, v3.s[1]     \\n\"\n                        \"fmla   v13.4s, %21.4s, v3.s[3]     \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v5.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %22.4s, v2.s[2]     \\n\"\n                        \"fmla   v11.4s, %22.4s, v3.s[0]     \\n\"\n                        \"fmla   v12.4s, %22.4s, v3.s[2]     \\n\"\n                        \"fmla   v13.4s, %22.4s, v5.s[0]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v1.s[2]      \\n\"\n                        \"fmla   v10.4s, %23.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %23.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %23.4s, v1.s[2]     \\n\"\n\n                        \"ld1    {v4.h}[0], [%3]             \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v1.s[3]      \\n\"\n                        \"fmla   v10.4s, %24.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %24.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %24.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %24.4s, v1.s[3]     \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v4.s[0]      \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v10.4s, %25.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %25.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %25.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %25.4s, v4.s[0]     \\n\"\n\n                        \"st1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%0], #64 \\n\"\n                        \"st1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_0), // %8\n                        \"w\"(_k01_0), // %9\n                        \"w\"(_k02_0), // %10\n                        \"w\"(_k10_0), // %11\n                        \"w\"(_k11_0), // %12\n                        \"w\"(_k12_0), // %13\n                        \"w\"(_k20_0), // %14\n                        \"w\"(_k21_0), // %15\n                        \"w\"(_k22_0), // %16\n                        \"w\"(_k00_1), // %17\n                        \"w\"(_k01_1), // %18\n                        \"w\"(_k02_1), // %19\n                        \"w\"(_k10_1), // %20\n                        \"w\"(_k11_1), // %21\n                        \"w\"(_k12_1), // %22\n                        \"w\"(_k20_1), // %23\n                        \"w\"(_k21_1), // %24\n                        \"w\"(_k22_1)  // %25\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0] \\n\" // sum0 sum1\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, %8.4s, v0.s[2]      \\n\"\n                        \"fmla   v12.4s, %17.4s, v0.s[0]     \\n\"\n                        \"fmla   v13.4s, %17.4s, v0.s[2]     \\n\"\n\n                        \"ld1    {v1.h}[0], [%1]             \\n\"\n\n                        \"fmla   v10.4s, %9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, %9.4s, v0.s[3]      \\n\"\n                        \"fmla   v12.4s, %18.4s, v0.s[1]     \\n\"\n                        \"fmla   v13.4s, %18.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%2], #8           \\n\"\n\n                        \"fmla   v10.4s, %10.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %10.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %19.4s, v0.s[2]     \\n\"\n                        \"fmla   v13.4s, %19.4s, v1.s[0]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %11.4s, v2.s[0]     \\n\"\n                        \"fmla   v11.4s, %11.4s, v2.s[2]     \\n\"\n                        \"fmla   v12.4s, %20.4s, v2.s[0]     \\n\"\n                        \"fmla   v13.4s, %20.4s, v2.s[2]     \\n\"\n\n                        \"ld1    {v3.h}[0], [%2]             \\n\"\n\n                        \"fmla   v10.4s, %12.4s, v2.s[1]     \\n\"\n                        \"fmla   v11.4s, %12.4s, v2.s[3]     \\n\"\n                        \"fmla   v12.4s, %21.4s, v2.s[1]     \\n\"\n                        \"fmla   v13.4s, %21.4s, v2.s[3]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n\n                        \"fmla   v10.4s, %13.4s, v2.s[2]     \\n\"\n                        \"fmla   v11.4s, %13.4s, v3.s[0]     \\n\"\n                        \"fmla   v12.4s, %22.4s, v2.s[2]     \\n\"\n                        \"fmla   v13.4s, %22.4s, v3.s[0]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %14.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %23.4s, v0.s[0]     \\n\"\n                        \"fmla   v13.4s, %23.4s, v0.s[2]     \\n\"\n\n                        \"ld1    {v1.h}[0], [%3]             \\n\"\n\n                        \"fmla   v10.4s, %15.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %15.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %24.4s, v0.s[1]     \\n\"\n                        \"fmla   v13.4s, %24.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %16.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %16.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %25.4s, v0.s[2]     \\n\"\n                        \"fmla   v13.4s, %25.4s, v1.s[0]     \\n\"\n\n                        \"st1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_0), // %8\n                        \"w\"(_k01_0), // %9\n                        \"w\"(_k02_0), // %10\n                        \"w\"(_k10_0), // %11\n                        \"w\"(_k11_0), // %12\n                        \"w\"(_k12_0), // %13\n                        \"w\"(_k20_0), // %14\n                        \"w\"(_k21_0), // %15\n                        \"w\"(_k22_0), // %16\n                        \"w\"(_k00_1), // %17\n                        \"w\"(_k01_1), // %18\n                        \"w\"(_k02_1), // %19\n                        \"w\"(_k10_1), // %20\n                        \"w\"(_k11_1), // %21\n                        \"w\"(_k12_1), // %22\n                        \"w\"(_k20_1), // %23\n                        \"w\"(_k21_1), // %24\n                        \"w\"(_k22_1)  // %25\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n                    float32x4_t _sum1 = vld1q_f32(outptr0 + 4);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00_0, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01_0, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02_0, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10_0, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11_0, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12_0, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20_0, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21_0, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22_0, _r2, 2);\n\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k00_1, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k01_1, _r0, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k02_1, _r0, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k10_1, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k11_1, _r1, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k12_1, _r1, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k20_1, _r2, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k21_1, _r2, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k22_1, _r2, 2);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 8;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n            unsigned short* outptr1_bf16 = top_blob.channel(p + 1);\n\n            const float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00_0 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01_0 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02_0 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10_0 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11_0 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12_0 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20_0 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21_0 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22_0 = bfloat2float(vld1_u16(k0 + 32));\n\n            float32x4_t _k00_1 = bfloat2float(vld1_u16(k1));\n            float32x4_t _k01_1 = bfloat2float(vld1_u16(k1 + 4));\n            float32x4_t _k02_1 = bfloat2float(vld1_u16(k1 + 8));\n            float32x4_t _k10_1 = bfloat2float(vld1_u16(k1 + 12));\n            float32x4_t _k11_1 = bfloat2float(vld1_u16(k1 + 16));\n            float32x4_t _k12_1 = bfloat2float(vld1_u16(k1 + 20));\n            float32x4_t _k20_1 = bfloat2float(vld1_u16(k1 + 24));\n            float32x4_t _k21_1 = bfloat2float(vld1_u16(k1 + 28));\n            float32x4_t _k22_1 = bfloat2float(vld1_u16(k1 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%2], #64 \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%2], #64 \\n\" // sum1\n\n                        \"fmla   v6.4s, %12.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %12.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v1.s[2]      \\n\"\n                        \"fmla   v10.4s, %21.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %21.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %21.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %21.4s, v1.s[2]     \\n\"\n\n                        \"ld1    {v4.h}[0], [%3]             \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v1.s[3]      \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %22.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %22.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %22.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %22.4s, v1.s[3]     \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v2.4h, v3.4h}, [%4], #16   \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v4.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %23.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %23.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %23.4s, v4.s[0]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v3.s[2]      \\n\"\n                        \"fmla   v10.4s, %24.4s, v2.s[0]     \\n\"\n                        \"fmla   v11.4s, %24.4s, v2.s[2]     \\n\"\n                        \"fmla   v12.4s, %24.4s, v3.s[0]     \\n\"\n                        \"fmla   v13.4s, %24.4s, v3.s[2]     \\n\"\n\n                        \"ld1    {v5.h}[0], [%4]             \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v3.s[3]      \\n\"\n\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %25.4s, v2.s[1]     \\n\"\n                        \"fmla   v11.4s, %25.4s, v2.s[3]     \\n\"\n                        \"fmla   v12.4s, %25.4s, v3.s[1]     \\n\"\n                        \"fmla   v13.4s, %25.4s, v3.s[3]     \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5], #16   \\n\"\n\n                        \"fmla   v6.4s, %17.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %17.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %17.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %17.4s, v5.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %26.4s, v2.s[2]     \\n\"\n                        \"fmla   v11.4s, %26.4s, v3.s[0]     \\n\"\n                        \"fmla   v12.4s, %26.4s, v3.s[2]     \\n\"\n                        \"fmla   v13.4s, %26.4s, v5.s[0]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %18.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %18.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %18.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %18.4s, v1.s[2]      \\n\"\n                        \"fmla   v10.4s, %27.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %27.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %27.4s, v1.s[0]     \\n\"\n                        \"fmla   v13.4s, %27.4s, v1.s[2]     \\n\"\n\n                        \"ld1    {v4.h}[0], [%5]             \\n\"\n\n                        \"fmla   v6.4s, %19.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %19.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %19.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %19.4s, v1.s[3]      \\n\"\n                        \"fmla   v10.4s, %28.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %28.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %28.4s, v1.s[1]     \\n\"\n                        \"fmla   v13.4s, %28.4s, v1.s[3]     \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %20.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %20.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %20.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %20.4s, v4.s[0]      \\n\"\n                        \"fmla   v10.4s, %29.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %29.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %29.4s, v1.s[2]     \\n\"\n                        \"fmla   v13.4s, %29.4s, v4.s[0]     \\n\"\n\n                        \"shrn   v6.4h, v6.4s, #16           \\n\"\n                        \"shrn   v7.4h, v7.4s, #16           \\n\"\n                        \"shrn   v8.4h, v8.4s, #16           \\n\"\n                        \"shrn   v9.4h, v9.4s, #16           \\n\"\n                        \"shrn   v10.4h, v10.4s, #16         \\n\"\n                        \"shrn   v11.4h, v11.4s, #16         \\n\"\n\n                        \"st1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%0], #32 \\n\"\n\n                        \"shrn   v12.4h, v12.4s, #16         \\n\"\n                        \"shrn   v13.4h, v13.4s, #16         \\n\"\n\n                        \"st1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%1], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr1_bf16), // %1\n                        \"=r\"(outptr0),      // %2\n                        \"=r\"(r0),           // %3\n                        \"=r\"(r1),           // %4\n                        \"=r\"(r2)            // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr1_bf16),\n                        \"2\"(outptr0),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00_0), // %12\n                        \"w\"(_k01_0), // %13\n                        \"w\"(_k02_0), // %14\n                        \"w\"(_k10_0), // %15\n                        \"w\"(_k11_0), // %16\n                        \"w\"(_k12_0), // %17\n                        \"w\"(_k20_0), // %18\n                        \"w\"(_k21_0), // %19\n                        \"w\"(_k22_0), // %20\n                        \"w\"(_k00_1), // %21\n                        \"w\"(_k01_1), // %22\n                        \"w\"(_k02_1), // %23\n                        \"w\"(_k10_1), // %24\n                        \"w\"(_k11_1), // %25\n                        \"w\"(_k12_1), // %26\n                        \"w\"(_k20_1), // %27\n                        \"w\"(_k21_1), // %28\n                        \"w\"(_k22_1)  // %29\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%2], #64 \\n\" // sum0 sum1\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %12.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %12.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %21.4s, v0.s[0]     \\n\"\n                        \"fmla   v13.4s, %21.4s, v0.s[2]     \\n\"\n\n                        \"ld1    {v1.h}[0], [%3]             \\n\"\n\n                        \"fmla   v10.4s, %13.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %13.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %22.4s, v0.s[1]     \\n\"\n                        \"fmla   v13.4s, %22.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%4], #8           \\n\"\n\n                        \"fmla   v10.4s, %14.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %14.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %23.4s, v0.s[2]     \\n\"\n                        \"fmla   v13.4s, %23.4s, v1.s[0]     \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %15.4s, v2.s[0]     \\n\"\n                        \"fmla   v11.4s, %15.4s, v2.s[2]     \\n\"\n                        \"fmla   v12.4s, %24.4s, v2.s[0]     \\n\"\n                        \"fmla   v13.4s, %24.4s, v2.s[2]     \\n\"\n\n                        \"ld1    {v3.h}[0], [%4]             \\n\"\n\n                        \"fmla   v10.4s, %16.4s, v2.s[1]     \\n\"\n                        \"fmla   v11.4s, %16.4s, v2.s[3]     \\n\"\n                        \"fmla   v12.4s, %25.4s, v2.s[1]     \\n\"\n                        \"fmla   v13.4s, %25.4s, v2.s[3]     \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%5, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%5], #8           \\n\"\n\n                        \"fmla   v10.4s, %17.4s, v2.s[2]     \\n\"\n                        \"fmla   v11.4s, %17.4s, v3.s[0]     \\n\"\n                        \"fmla   v12.4s, %26.4s, v2.s[2]     \\n\"\n                        \"fmla   v13.4s, %26.4s, v3.s[0]     \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %18.4s, v0.s[0]     \\n\"\n                        \"fmla   v11.4s, %18.4s, v0.s[2]     \\n\"\n                        \"fmla   v12.4s, %27.4s, v0.s[0]     \\n\"\n                        \"fmla   v13.4s, %27.4s, v0.s[2]     \\n\"\n\n                        \"ld1    {v1.h}[0], [%5]             \\n\"\n\n                        \"fmla   v10.4s, %19.4s, v0.s[1]     \\n\"\n                        \"fmla   v11.4s, %19.4s, v0.s[3]     \\n\"\n                        \"fmla   v12.4s, %28.4s, v0.s[1]     \\n\"\n                        \"fmla   v13.4s, %28.4s, v0.s[3]     \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v10.4s, %20.4s, v0.s[2]     \\n\"\n                        \"fmla   v11.4s, %20.4s, v1.s[0]     \\n\"\n                        \"fmla   v12.4s, %29.4s, v0.s[2]     \\n\"\n                        \"fmla   v13.4s, %29.4s, v1.s[0]     \\n\"\n\n                        \"shrn   v10.4h, v10.4s, #16         \\n\"\n                        \"shrn   v11.4h, v11.4s, #16         \\n\"\n                        \"shrn   v12.4h, v12.4s, #16         \\n\"\n                        \"shrn   v13.4h, v13.4s, #16         \\n\"\n\n                        \"st1    {v10.4h, v11.4h}, [%0], #16 \\n\"\n                        \"st1    {v12.4h, v13.4h}, [%1], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr1_bf16), // %1\n                        \"=r\"(outptr0),      // %2\n                        \"=r\"(r0),           // %3\n                        \"=r\"(r1),           // %4\n                        \"=r\"(r2)            // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr1_bf16),\n                        \"2\"(outptr0),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"w\"(_k00_0), // %12\n                        \"w\"(_k01_0), // %13\n                        \"w\"(_k02_0), // %14\n                        \"w\"(_k10_0), // %15\n                        \"w\"(_k11_0), // %16\n                        \"w\"(_k12_0), // %17\n                        \"w\"(_k20_0), // %18\n                        \"w\"(_k21_0), // %19\n                        \"w\"(_k22_0), // %20\n                        \"w\"(_k00_1), // %21\n                        \"w\"(_k01_1), // %22\n                        \"w\"(_k02_1), // %23\n                        \"w\"(_k10_1), // %24\n                        \"w\"(_k11_1), // %25\n                        \"w\"(_k12_1), // %26\n                        \"w\"(_k20_1), // %27\n                        \"w\"(_k21_1), // %28\n                        \"w\"(_k22_1)  // %29\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n                    float32x4_t _sum1 = vld1q_f32(outptr0 + 4);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00_0, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01_0, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02_0, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10_0, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11_0, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12_0, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20_0, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21_0, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22_0, _r2, 2);\n\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k00_1, _r0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k01_1, _r0, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k02_1, _r0, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k10_1, _r1, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k11_1, _r1, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k12_1, _r1, 2);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k20_1, _r2, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k21_1, _r2, 1);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _k22_1, _r2, 2);\n\n                    vst1_u16(outptr0_bf16, float2bfloat(_sum0));\n                    vst1_u16(outptr1_bf16, float2bfloat(_sum1));\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 8;\n                    outptr0_bf16 += 4;\n                    outptr1_bf16 += 4;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        const unsigned short* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1], #16   \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%0] \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %8.4s, v0.s[0]       \\n\"\n                        \"fmla   v7.4s, %8.4s, v0.s[2]       \\n\"\n                        \"fmla   v8.4s, %8.4s, v1.s[0]       \\n\"\n                        \"fmla   v9.4s, %8.4s, v1.s[2]       \\n\"\n\n                        \"ld1    {v4.h}[0], [%1]             \\n\"\n\n                        \"fmla   v6.4s, %9.4s, v0.s[1]       \\n\"\n                        \"fmla   v7.4s, %9.4s, v0.s[3]       \\n\"\n                        \"fmla   v8.4s, %9.4s, v1.s[1]       \\n\"\n                        \"fmla   v9.4s, %9.4s, v1.s[3]       \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v2.4h, v3.4h}, [%2], #16   \\n\"\n\n                        \"fmla   v6.4s, %10.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %10.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %10.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %10.4s, v4.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %11.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %11.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %11.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v3.s[2]      \\n\"\n\n                        \"ld1    {v5.h}[0], [%2]             \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %12.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v3.s[3]      \\n\"\n\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v5.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v1.s[2]      \\n\"\n\n                        \"ld1    {v4.h}[0], [%3]             \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v1.s[3]      \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v4.s[0]      \\n\"\n\n                        \"st1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else  // __aarch64__\n                    asm volatile(\n                        // r0\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%1]!    \\n\"\n\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d0-d7}         \\n\" // sum0\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%1]      \\n\"\n\n                        \"vmla.f32   q0, %q8, d8[0]      \\n\"\n                        \"vmla.f32   q1, %q8, d9[0]      \\n\"\n\n                        \"vmla.f32   q2, %q8, d10[0]     \\n\"\n                        \"vmla.f32   q3, %q8, d11[0]     \\n\"\n\n                        \"vmla.f32   q0, %q9, d8[1]      \\n\"\n                        \"vmla.f32   q1, %q9, d9[1]      \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q9, d10[1]     \\n\"\n                        \"vmla.f32   q3, %q9, d11[1]     \\n\"\n\n                        // r1\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%2]!    \\n\"\n\n                        \"vmla.f32   q0, %q10, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q10, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q10, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q10, d8[0]     \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%2]      \\n\"\n\n                        \"vmla.f32   q0, %q11, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q11, d9[0]     \\n\"\n\n                        \"vmla.f32   q2, %q11, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q11, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q12, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q12, d9[1]     \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q12, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q12, d11[1]    \\n\"\n\n                        // r2\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%3]!    \\n\"\n\n                        \"vmla.f32   q0, %q13, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q13, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q13, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q13, d8[0]     \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%3]      \\n\"\n\n                        \"vmla.f32   q0, %q14, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q14, d9[0]     \\n\"\n\n                        \"vmla.f32   q2, %q14, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q14, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q15, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q15, d9[1]     \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q15, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q15, d11[1]    \\n\"\n\n                        \"vmla.f32   q0, %q16, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q16, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q16, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q16, d8[0]     \\n\"\n\n                        \"vstm       %0!, {d0-d7}        \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%0]        \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmul   v6.4s, %8.4s, v0.s[0]       \\n\"\n                        \"fmul   v7.4s, %8.4s, v0.s[2]       \\n\"\n\n                        \"ld1    {v1.h}[0], [%1]             \\n\"\n\n                        \"fmla   v8.4s, %9.4s, v0.s[1]       \\n\"\n                        \"fmla   v9.4s, %9.4s, v0.s[3]       \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%2], #8           \\n\"\n\n                        \"fmla   v6.4s, %10.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %10.4s, v1.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v8.4s, %11.4s, v2.s[0]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v2.s[2]      \\n\"\n\n                        \"ld1    {v3.h}[0], [%2]             \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v2.s[3]      \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\"\n\n                        \"fmla   v8.4s, %13.4s, v2.s[2]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v3.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v0.s[2]      \\n\"\n\n                        \"ld1    {v1.h}[0], [%3]             \\n\"\n\n                        \"fmla   v8.4s, %15.4s, v0.s[1]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v0.s[3]      \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v1.s[0]      \\n\"\n\n                        \"fadd   v8.4s, v8.4s, v6.4s         \\n\"\n                        \"fadd   v9.4s, v9.4s, v7.4s         \\n\"\n\n                        \"st1    {v8.4s, v9.4s}, [%0], #32   \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else  // __aarch64__\n                    asm volatile(\n                        // r0\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.u16   {d9}, [%1]!         \\n\"\n\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%0]       \\n\" // sum0\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmul.f32   q0, %q8, d8[0]      \\n\"\n                        \"vmul.f32   q1, %q8, d9[0]      \\n\"\n\n                        \"vld1.u16   {d11[]}, [%1]       \\n\"\n\n                        \"vmla.f32   q2, %q9, d8[1]      \\n\"\n                        \"vmla.f32   q3, %q9, d9[1]      \\n\"\n\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        // r1\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%2]!        \\n\"\n\n                        \"vmla.f32   q0, %q10, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q10, d10[0]    \\n\"\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q2, %q11, d12[0]    \\n\"\n                        \"vmla.f32   q3, %q11, d13[0]    \\n\"\n\n                        \"vld1.u16   {d9[]}, [%2]        \\n\"\n\n                        \"vmla.f32   q0, %q12, d12[1]    \\n\"\n                        \"vmla.f32   q1, %q12, d13[1]    \\n\"\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        // r2\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d11}, [%3]!        \\n\"\n\n                        \"vmla.f32   q2, %q13, d13[0]    \\n\"\n                        \"vmla.f32   q3, %q13, d8[0]     \\n\"\n\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q0, %q14, d10[0]    \\n\"\n                        \"vmla.f32   q1, %q14, d11[0]    \\n\"\n\n                        \"vld1.u16   {d13[]}, [%3]       \\n\"\n\n                        \"vmla.f32   q2, %q15, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q15, d11[1]    \\n\"\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q0, %q16, d11[0]    \\n\"\n                        \"vmla.f32   q1, %q16, d12[0]    \\n\"\n\n                        \"vadd.f32   q2, q2, q0          \\n\"\n                        \"vadd.f32   q3, q3, q1          \\n\"\n\n                        \"vst1.f32   {d4-d7}, [%0]!      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1q_f32(outptr0, _sum0);\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 4;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            const float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n            float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n            float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n            float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n            float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n            float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n            float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n            float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n            float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%1], #64 \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %10.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %10.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %10.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %10.4s, v1.s[2]      \\n\"\n\n                        \"ld1    {v4.h}[0], [%2]             \\n\"\n\n                        \"fmla   v6.4s, %11.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %11.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %11.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v1.s[3]      \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v2.4h, v3.4h}, [%3], #16   \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %12.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %12.4s, v4.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %13.4s, v2.s[0]      \\n\"\n                        \"fmla   v7.4s, %13.4s, v2.s[2]      \\n\"\n                        \"fmla   v8.4s, %13.4s, v3.s[0]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v3.s[2]      \\n\"\n\n                        \"ld1    {v5.h}[0], [%3]             \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, %14.4s, v3.s[1]      \\n\"\n                        \"fmla   v9.4s, %14.4s, v3.s[3]      \\n\"\n\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\"\n\n                        \"fmla   v6.4s, %15.4s, v2.s[2]      \\n\"\n                        \"fmla   v7.4s, %15.4s, v3.s[0]      \\n\"\n                        \"fmla   v8.4s, %15.4s, v3.s[2]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v5.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v0.s[2]      \\n\"\n                        \"fmla   v8.4s, %16.4s, v1.s[0]      \\n\"\n                        \"fmla   v9.4s, %16.4s, v1.s[2]      \\n\"\n\n                        \"ld1    {v4.h}[0], [%4]             \\n\"\n\n                        \"fmla   v6.4s, %17.4s, v0.s[1]      \\n\"\n                        \"fmla   v7.4s, %17.4s, v0.s[3]      \\n\"\n                        \"fmla   v8.4s, %17.4s, v1.s[1]      \\n\"\n                        \"fmla   v9.4s, %17.4s, v1.s[3]      \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %18.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %18.4s, v1.s[0]      \\n\"\n                        \"fmla   v8.4s, %18.4s, v1.s[2]      \\n\"\n                        \"fmla   v9.4s, %18.4s, v4.s[0]      \\n\"\n\n                        \"shrn   v6.4h, v6.4s, #16           \\n\"\n                        \"shrn   v7.4h, v7.4s, #16           \\n\"\n                        \"shrn   v8.4h, v8.4s, #16           \\n\"\n                        \"shrn   v9.4h, v9.4s, #16           \\n\"\n\n                        \"st1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else  // __aarch64__\n                    asm volatile(\n                        // r0\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%2]!    \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // sum0\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%2]      \\n\"\n\n                        \"vmla.f32   q0, %q10, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q10, d9[0]     \\n\"\n\n                        \"vmla.f32   q2, %q10, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q10, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q11, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q11, d9[1]     \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q11, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q11, d11[1]    \\n\"\n\n                        // r1\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%3]!    \\n\"\n\n                        \"vmla.f32   q0, %q12, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q12, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q12, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q12, d8[0]     \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%3]      \\n\"\n\n                        \"vmla.f32   q0, %q13, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q13, d9[0]     \\n\"\n\n                        \"vmla.f32   q2, %q13, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q13, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q14, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q14, d9[1]     \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q14, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q14, d11[1]    \\n\"\n\n                        // r2\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d12-d13}, [%4]!    \\n\"\n\n                        \"vmla.f32   q0, %q15, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q15, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q15, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q15, d8[0]     \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n\n                        \"vld1.u16   {d12[0]}, [%4]      \\n\"\n\n                        \"vmla.f32   q0, %q16, d8[0]     \\n\"\n                        \"vmla.f32   q1, %q16, d9[0]     \\n\"\n\n                        \"vmla.f32   q2, %q16, d10[0]    \\n\"\n                        \"vmla.f32   q3, %q16, d11[0]    \\n\"\n\n                        \"vmla.f32   q0, %q17, d8[1]     \\n\"\n                        \"vmla.f32   q1, %q17, d9[1]     \\n\"\n\n                        \"vshl.u32   d8, d12, #16        \\n\"\n\n                        \"vmla.f32   q2, %q17, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q17, d11[1]    \\n\"\n\n                        \"vmla.f32   q0, %q18, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q18, d10[0]    \\n\"\n                        \"vmla.f32   q2, %q18, d11[0]    \\n\"\n                        \"vmla.f32   q3, %q18, d8[0]     \\n\"\n\n                        \"vshrn.u32  d0, q0, #16         \\n\"\n                        \"vshrn.u32  d1, q1, #16         \\n\"\n                        \"vshrn.u32  d2, q2, #16         \\n\"\n                        \"vshrn.u32  d3, q3, #16         \\n\"\n\n                        \"vst1.u16   {d0-d3}, [%0 :64]!  \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        // r0\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2], #8           \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%1], #32   \\n\" // sum0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmul   v6.4s, %10.4s, v0.s[0]      \\n\"\n                        \"fmul   v7.4s, %10.4s, v0.s[2]      \\n\"\n\n                        \"ld1    {v1.h}[0], [%2]             \\n\"\n\n                        \"fmla   v8.4s, %11.4s, v0.s[1]      \\n\"\n                        \"fmla   v9.4s, %11.4s, v0.s[3]      \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        // r1\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%3], #8           \\n\"\n\n                        \"fmla   v6.4s, %12.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %12.4s, v1.s[0]      \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v8.4s, %13.4s, v2.s[0]      \\n\"\n                        \"fmla   v9.4s, %13.4s, v2.s[2]      \\n\"\n\n                        \"ld1    {v3.h}[0], [%3]             \\n\"\n\n                        \"fmla   v6.4s, %14.4s, v2.s[1]      \\n\"\n                        \"fmla   v7.4s, %14.4s, v2.s[3]      \\n\"\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        // r2\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4], #8           \\n\"\n\n                        \"fmla   v8.4s, %15.4s, v2.s[2]      \\n\"\n                        \"fmla   v9.4s, %15.4s, v3.s[0]      \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %16.4s, v0.s[0]      \\n\"\n                        \"fmla   v7.4s, %16.4s, v0.s[2]      \\n\"\n\n                        \"ld1    {v1.h}[0], [%4]             \\n\"\n\n                        \"fmla   v8.4s, %17.4s, v0.s[1]      \\n\"\n                        \"fmla   v9.4s, %17.4s, v0.s[3]      \\n\"\n\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v6.4s, %18.4s, v0.s[2]      \\n\"\n                        \"fmla   v7.4s, %18.4s, v1.s[0]      \\n\"\n\n                        \"fadd   v8.4s, v8.4s, v6.4s         \\n\"\n                        \"fadd   v9.4s, v9.4s, v7.4s         \\n\"\n\n                        \"shrn   v8.4h, v8.4s, #16           \\n\"\n                        \"shrn   v9.4h, v9.4s, #16           \\n\"\n\n                        \"st1    {v8.4h, v9.4h}, [%0], #16   \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else  // __aarch64__\n                    asm volatile(\n                        // r0\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d9}, [%2]!         \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%1]!      \\n\" // sum0\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmul.f32   q0, %q10, d8[0]     \\n\"\n                        \"vmul.f32   q1, %q10, d9[0]     \\n\"\n\n                        \"vld1.u16   {d11[]}, [%2]       \\n\"\n\n                        \"vmla.f32   q2, %q11, d8[1]     \\n\"\n                        \"vmla.f32   q3, %q11, d9[1]     \\n\"\n\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        // r1\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%3]!        \\n\"\n\n                        \"vmla.f32   q0, %q12, d9[0]     \\n\"\n                        \"vmla.f32   q1, %q12, d10[0]    \\n\"\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q2, %q13, d12[0]    \\n\"\n                        \"vmla.f32   q3, %q13, d13[0]    \\n\"\n\n                        \"vld1.u16   {d9[]}, [%3]        \\n\"\n\n                        \"vmla.f32   q0, %q14, d12[1]    \\n\"\n                        \"vmla.f32   q1, %q14, d13[1]    \\n\"\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        // r2\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d11}, [%4]!        \\n\"\n\n                        \"vmla.f32   q2, %q15, d13[0]    \\n\"\n                        \"vmla.f32   q3, %q15, d8[0]     \\n\"\n\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q0, %q16, d10[0]    \\n\"\n                        \"vmla.f32   q1, %q16, d11[0]    \\n\"\n\n                        \"vld1.u16   {d13[]}, [%4]       \\n\"\n\n                        \"vmla.f32   q2, %q17, d10[1]    \\n\"\n                        \"vmla.f32   q3, %q17, d11[1]    \\n\"\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q0, %q18, d11[0]    \\n\"\n                        \"vmla.f32   q1, %q18, d12[0]    \\n\"\n\n                        \"vadd.f32   q2, q2, q0          \\n\"\n                        \"vadd.f32   q3, q3, q1          \\n\"\n\n                        \"vshrn.u32  d2, q2, #16         \\n\"\n                        \"vshrn.u32  d3, q3, #16         \\n\"\n\n                        \"vst1.u16   {d2-d3}, [%0 :64]!  \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00), // %10\n                        \"w\"(_k01), // %11\n                        \"w\"(_k02), // %12\n                        \"w\"(_k10), // %13\n                        \"w\"(_k11), // %14\n                        \"w\"(_k12), // %15\n                        \"w\"(_k20), // %16\n                        \"w\"(_k21), // %17\n                        \"w\"(_k22)  // %18\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n                    float32x4_t _sum0 = vld1q_f32(outptr0);\n\n                    float32x4_t _r0 = bfloat2float(vld1_u16(r0));\n                    float32x4_t _r1 = bfloat2float(vld1_u16(r1));\n                    float32x4_t _r2 = bfloat2float(vld1_u16(r2));\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k00, _r0, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k01, _r0, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k02, _r0, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k10, _r1, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k11, _r1, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k12, _r1, 2);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k20, _r2, 0);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k21, _r2, 1);\n                    _sum0 = vfmaq_laneq_f32(_sum0, _k22, _r2, 2);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _k00, vget_low_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k01, vget_low_f32(_r0), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k02, vget_high_f32(_r0), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k10, vget_low_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k11, vget_low_f32(_r1), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k12, vget_high_f32(_r1), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k20, vget_low_f32(_r2), 0);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k21, vget_low_f32(_r2), 1);\n                    _sum0 = vmlaq_lane_f32(_sum0, _k22, vget_high_f32(_r2), 0);\n#endif\n\n                    vst1_u16(outptr0_bf16, float2bfloat(_sum0));\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    outptr0 += 4;\n                    outptr0_bf16 += 4;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack1to4_fp16s.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to4_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x4_t _bias0 = bias ? vld1_f16(bias + p * 4) : vdup_n_f16((__fp16)0.f);\n        out0.fill(_bias0);\n\n        const __fp16* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            float16x4_t _k00 = vld1_f16(k0);\n            float16x4_t _k01 = vld1_f16(k0 + 4);\n            float16x4_t _k02 = vld1_f16(k0 + 8);\n            float16x4_t _k10 = vld1_f16(k0 + 12);\n            float16x4_t _k11 = vld1_f16(k0 + 16);\n            float16x4_t _k12 = vld1_f16(k0 + 20);\n            float16x4_t _k20 = vld1_f16(k0 + 24);\n            float16x4_t _k21 = vld1_f16(k0 + 28);\n            float16x4_t _k22 = vld1_f16(k0 + 32);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0] \\n\" // sum4 sum5 sum6 sum7\n\n                        \"sub    %0, %0, #32                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1], #16          \\n\" // r0\n                        \"ld1    {v1.4h}, [%1]               \\n\"\n\n                        \"fmla   v24.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v25.4h, %8.4h, v0.h[1]      \\n\"\n                        \"fmla   v26.4h, %8.4h, v0.h[2]      \\n\"\n                        \"fmla   v27.4h, %8.4h, v0.h[3]      \\n\"\n                        \"fmla   v28.4h, %8.4h, v0.h[4]      \\n\"\n                        \"fmla   v29.4h, %8.4h, v0.h[5]      \\n\"\n                        \"fmla   v30.4h, %8.4h, v0.h[6]      \\n\"\n                        \"fmla   v31.4h, %8.4h, v0.h[7]      \\n\"\n\n                        \"fmla   v24.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v25.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v26.4h, %9.4h, v0.h[3]      \\n\"\n                        \"fmla   v27.4h, %9.4h, v0.h[4]      \\n\"\n                        \"fmla   v28.4h, %9.4h, v0.h[5]      \\n\"\n                        \"fmla   v29.4h, %9.4h, v0.h[6]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[7]      \\n\"\n                        \"fmla   v31.4h, %9.4h, v1.h[0]      \\n\"\n\n                        \"fmla   v24.4h, %10.4h, v0.h[2]     \\n\"\n                        \"fmla   v25.4h, %10.4h, v0.h[3]     \\n\"\n                        \"fmla   v26.4h, %10.4h, v0.h[4]     \\n\"\n                        \"fmla   v27.4h, %10.4h, v0.h[5]     \\n\"\n                        \"fmla   v28.4h, %10.4h, v0.h[6]     \\n\"\n                        \"fmla   v29.4h, %10.4h, v0.h[7]     \\n\"\n                        \"fmla   v30.4h, %10.4h, v1.h[0]     \\n\"\n                        \"fmla   v31.4h, %10.4h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v2.8h}, [%2], #16          \\n\" // r1\n                        \"ld1    {v3.4h}, [%2]               \\n\"\n\n                        \"fmla   v24.4h, %11.4h, v2.h[0]     \\n\"\n                        \"fmla   v25.4h, %11.4h, v2.h[1]     \\n\"\n                        \"fmla   v26.4h, %11.4h, v2.h[2]     \\n\"\n                        \"fmla   v27.4h, %11.4h, v2.h[3]     \\n\"\n                        \"fmla   v28.4h, %11.4h, v2.h[4]     \\n\"\n                        \"fmla   v29.4h, %11.4h, v2.h[5]     \\n\"\n                        \"fmla   v30.4h, %11.4h, v2.h[6]     \\n\"\n                        \"fmla   v31.4h, %11.4h, v2.h[7]     \\n\"\n\n                        \"fmla   v24.4h, %12.4h, v2.h[1]     \\n\"\n                        \"fmla   v25.4h, %12.4h, v2.h[2]     \\n\"\n                        \"fmla   v26.4h, %12.4h, v2.h[3]     \\n\"\n                        \"fmla   v27.4h, %12.4h, v2.h[4]     \\n\"\n                        \"fmla   v28.4h, %12.4h, v2.h[5]     \\n\"\n                        \"fmla   v29.4h, %12.4h, v2.h[6]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v2.h[7]     \\n\"\n                        \"fmla   v31.4h, %12.4h, v3.h[0]     \\n\"\n\n                        \"fmla   v24.4h, %13.4h, v2.h[2]     \\n\"\n                        \"fmla   v25.4h, %13.4h, v2.h[3]     \\n\"\n                        \"fmla   v26.4h, %13.4h, v2.h[4]     \\n\"\n                        \"fmla   v27.4h, %13.4h, v2.h[5]     \\n\"\n                        \"fmla   v28.4h, %13.4h, v2.h[6]     \\n\"\n                        \"fmla   v29.4h, %13.4h, v2.h[7]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v3.h[0]     \\n\"\n                        \"fmla   v31.4h, %13.4h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.8h}, [%3], #16          \\n\" // r2\n                        \"ld1    {v5.4h}, [%3]               \\n\"\n\n                        \"fmla   v24.4h, %14.4h, v4.h[0]     \\n\"\n                        \"fmla   v25.4h, %14.4h, v4.h[1]     \\n\"\n                        \"fmla   v26.4h, %14.4h, v4.h[2]     \\n\"\n                        \"fmla   v27.4h, %14.4h, v4.h[3]     \\n\"\n                        \"fmla   v28.4h, %14.4h, v4.h[4]     \\n\"\n                        \"fmla   v29.4h, %14.4h, v4.h[5]     \\n\"\n                        \"fmla   v30.4h, %14.4h, v4.h[6]     \\n\"\n                        \"fmla   v31.4h, %14.4h, v4.h[7]     \\n\"\n\n                        \"fmla   v24.4h, %15.4h, v4.h[1]     \\n\"\n                        \"fmla   v25.4h, %15.4h, v4.h[2]     \\n\"\n                        \"fmla   v26.4h, %15.4h, v4.h[3]     \\n\"\n                        \"fmla   v27.4h, %15.4h, v4.h[4]     \\n\"\n                        \"fmla   v28.4h, %15.4h, v4.h[5]     \\n\"\n                        \"fmla   v29.4h, %15.4h, v4.h[6]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v4.h[7]     \\n\"\n                        \"fmla   v31.4h, %15.4h, v5.h[0]     \\n\"\n\n                        \"fmla   v24.4h, %16.4h, v4.h[2]     \\n\"\n                        \"fmla   v25.4h, %16.4h, v4.h[3]     \\n\"\n                        \"fmla   v26.4h, %16.4h, v4.h[4]     \\n\"\n                        \"fmla   v27.4h, %16.4h, v4.h[5]     \\n\"\n                        \"fmla   v28.4h, %16.4h, v4.h[6]     \\n\"\n                        \"fmla   v29.4h, %16.4h, v4.h[7]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v5.h[0]     \\n\"\n                        \"fmla   v31.4h, %16.4h, v5.h[1]     \\n\"\n\n                        \"st1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\"\n                        \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1]               \\n\" // r0\n\n                        \"fmla   v28.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v29.4h, %8.4h, v0.h[1]      \\n\"\n                        \"fmla   v30.4h, %8.4h, v0.h[2]      \\n\"\n                        \"fmla   v31.4h, %8.4h, v0.h[3]      \\n\"\n\n                        \"fmla   v28.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v29.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[3]      \\n\"\n                        \"fmla   v31.4h, %9.4h, v0.h[4]      \\n\"\n\n                        \"fmla   v28.4h, %10.4h, v0.h[2]     \\n\"\n                        \"fmla   v29.4h, %10.4h, v0.h[3]     \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[4]     \\n\"\n                        \"fmla   v31.4h, %10.4h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v1.8h}, [%2]               \\n\" // r1\n\n                        \"fmla   v28.4h, %11.4h, v1.h[0]     \\n\"\n                        \"fmla   v29.4h, %11.4h, v1.h[1]     \\n\"\n                        \"fmla   v30.4h, %11.4h, v1.h[2]     \\n\"\n                        \"fmla   v31.4h, %11.4h, v1.h[3]     \\n\"\n\n                        \"fmla   v28.4h, %12.4h, v1.h[1]     \\n\"\n                        \"fmla   v29.4h, %12.4h, v1.h[2]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v1.h[3]     \\n\"\n                        \"fmla   v31.4h, %12.4h, v1.h[4]     \\n\"\n\n                        \"fmla   v28.4h, %13.4h, v1.h[2]     \\n\"\n                        \"fmla   v29.4h, %13.4h, v1.h[3]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v1.h[4]     \\n\"\n                        \"fmla   v31.4h, %13.4h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v2.8h}, [%3]               \\n\" // r2\n\n                        \"fmla   v28.4h, %14.4h, v2.h[0]     \\n\"\n                        \"fmla   v29.4h, %14.4h, v2.h[1]     \\n\"\n                        \"fmla   v30.4h, %14.4h, v2.h[2]     \\n\"\n                        \"fmla   v31.4h, %14.4h, v2.h[3]     \\n\"\n\n                        \"fmla   v28.4h, %15.4h, v2.h[1]     \\n\"\n                        \"fmla   v29.4h, %15.4h, v2.h[2]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v2.h[3]     \\n\"\n                        \"fmla   v31.4h, %15.4h, v2.h[4]     \\n\"\n\n                        \"fmla   v28.4h, %16.4h, v2.h[2]     \\n\"\n                        \"fmla   v29.4h, %16.4h, v2.h[3]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v2.h[4]     \\n\"\n                        \"fmla   v31.4h, %16.4h, v2.h[5]     \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n                        \"add    %2, %2, #8                  \\n\"\n                        \"add    %3, %3, #8                  \\n\"\n\n                        \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v30.4h, v31.4h}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\" // r0\n\n                        \"fmla   v30.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v31.4h, %8.4h, v0.h[1]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v31.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[2]     \\n\"\n                        \"fmla   v31.4h, %10.4h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%2]               \\n\" // r1\n\n                        \"fmla   v30.4h, %11.4h, v1.h[0]     \\n\"\n                        \"fmla   v31.4h, %11.4h, v1.h[1]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v1.h[1]     \\n\"\n                        \"fmla   v31.4h, %12.4h, v1.h[2]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v1.h[2]     \\n\"\n                        \"fmla   v31.4h, %13.4h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%3]               \\n\" // r2\n\n                        \"fmla   v30.4h, %14.4h, v2.h[0]     \\n\"\n                        \"fmla   v31.4h, %14.4h, v2.h[1]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v2.h[1]     \\n\"\n                        \"fmla   v31.4h, %15.4h, v2.h[2]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v2.h[2]     \\n\"\n                        \"fmla   v31.4h, %16.4h, v2.h[3]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n                        \"add    %2, %2, #4                  \\n\"\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"st1    {v30.4h, v31.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #64]        \\n\"\n                        \"ld1    {v30.4h}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\" // r0\n\n                        \"fmla   v30.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%2]               \\n\" // r1\n\n                        \"fmla   v30.4h, %11.4h, v1.h[0]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v1.h[1]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%3]               \\n\" // r2\n\n                        \"fmla   v30.4h, %14.4h, v2.h[0]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v2.h[1]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v2.h[2]     \\n\"\n\n                        \"add    %1, %1, #2                  \\n\"\n                        \"add    %2, %2, #2                  \\n\"\n                        \"add    %3, %3, #2                  \\n\"\n\n                        \"st1    {v30.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\");\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to4_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x4_t _bias0 = bias ? vld1_f16(bias + p * 4) : vdup_n_f16((__fp16)0.f);\n        out0.fill(_bias0);\n\n        const __fp16* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            float16x4_t _k00 = vld1_f16(k0);\n            float16x4_t _k01 = vld1_f16(k0 + 4);\n            float16x4_t _k02 = vld1_f16(k0 + 8);\n            float16x4_t _k10 = vld1_f16(k0 + 12);\n            float16x4_t _k11 = vld1_f16(k0 + 16);\n            float16x4_t _k12 = vld1_f16(k0 + 20);\n            float16x4_t _k20 = vld1_f16(k0 + 24);\n            float16x4_t _k21 = vld1_f16(k0 + 28);\n            float16x4_t _k22 = vld1_f16(k0 + 32);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1], #16          \\n\" // r0\n                        \"ld1    {v1.h}[0], [%1]             \\n\"\n\n                        \"fmla   v28.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v29.4h, %8.4h, v0.h[2]      \\n\"\n                        \"fmla   v30.4h, %8.4h, v0.h[4]      \\n\"\n                        \"fmla   v31.4h, %8.4h, v0.h[6]      \\n\"\n\n                        \"fmla   v28.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v29.4h, %9.4h, v0.h[3]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[5]      \\n\"\n                        \"fmla   v31.4h, %9.4h, v0.h[7]      \\n\"\n\n                        \"fmla   v28.4h, %10.4h, v0.h[2]     \\n\"\n                        \"fmla   v29.4h, %10.4h, v0.h[4]     \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[6]     \\n\"\n                        \"fmla   v31.4h, %10.4h, v1.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v2.8h}, [%2], #16          \\n\" // r1\n                        \"ld1    {v3.h}[0], [%2]             \\n\"\n\n                        \"fmla   v28.4h, %11.4h, v2.h[0]     \\n\"\n                        \"fmla   v29.4h, %11.4h, v2.h[2]     \\n\"\n                        \"fmla   v30.4h, %11.4h, v2.h[4]     \\n\"\n                        \"fmla   v31.4h, %11.4h, v2.h[6]     \\n\"\n\n                        \"fmla   v28.4h, %12.4h, v2.h[1]     \\n\"\n                        \"fmla   v29.4h, %12.4h, v2.h[3]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v2.h[5]     \\n\"\n                        \"fmla   v31.4h, %12.4h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.4h, %13.4h, v2.h[2]     \\n\"\n                        \"fmla   v29.4h, %13.4h, v2.h[4]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v2.h[6]     \\n\"\n                        \"fmla   v31.4h, %13.4h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.8h}, [%3], #16          \\n\" // r2\n                        \"ld1    {v5.h}[0], [%3]             \\n\"\n\n                        \"fmla   v28.4h, %14.4h, v4.h[0]     \\n\"\n                        \"fmla   v29.4h, %14.4h, v4.h[2]     \\n\"\n                        \"fmla   v30.4h, %14.4h, v4.h[4]     \\n\"\n                        \"fmla   v31.4h, %14.4h, v4.h[6]     \\n\"\n\n                        \"fmla   v28.4h, %15.4h, v4.h[1]     \\n\"\n                        \"fmla   v29.4h, %15.4h, v4.h[3]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v4.h[5]     \\n\"\n                        \"fmla   v31.4h, %15.4h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.4h, %16.4h, v4.h[2]     \\n\"\n                        \"fmla   v29.4h, %16.4h, v4.h[4]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v4.h[6]     \\n\"\n                        \"fmla   v31.4h, %16.4h, v5.h[0]     \\n\"\n\n                        \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v30.4h, v31.4h}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\" // r0\n                        \"ld1    {v1.h}[0], [%1]             \\n\"\n\n                        \"fmla   v30.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v31.4h, %8.4h, v0.h[2]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v31.4h, %9.4h, v0.h[3]      \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[2]     \\n\"\n                        \"fmla   v31.4h, %10.4h, v1.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%2], #8           \\n\" // r1\n                        \"ld1    {v3.h}[0], [%2]             \\n\"\n\n                        \"fmla   v30.4h, %11.4h, v2.h[0]     \\n\"\n                        \"fmla   v31.4h, %11.4h, v2.h[2]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v2.h[1]     \\n\"\n                        \"fmla   v31.4h, %12.4h, v2.h[3]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v2.h[2]     \\n\"\n                        \"fmla   v31.4h, %13.4h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%3], #8           \\n\" // r2\n                        \"ld1    {v5.h}[0], [%3]             \\n\"\n\n                        \"fmla   v30.4h, %14.4h, v4.h[0]     \\n\"\n                        \"fmla   v31.4h, %14.4h, v4.h[2]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v4.h[1]     \\n\"\n                        \"fmla   v31.4h, %15.4h, v4.h[3]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v4.h[2]     \\n\"\n                        \"fmla   v31.4h, %16.4h, v5.h[0]     \\n\"\n\n                        \"st1    {v30.4h, v31.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #64]        \\n\"\n                        \"ld1    {v30.4h}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\" // r0\n\n                        \"fmla   v30.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v30.4h, %9.4h, v0.h[1]      \\n\"\n                        \"fmla   v30.4h, %10.4h, v0.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v1.4h}, [%2]               \\n\" // r1\n\n                        \"fmla   v30.4h, %11.4h, v1.h[0]     \\n\"\n                        \"fmla   v30.4h, %12.4h, v1.h[1]     \\n\"\n                        \"fmla   v30.4h, %13.4h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v2.4h}, [%3]               \\n\" // r2\n\n                        \"fmla   v30.4h, %14.4h, v2.h[0]     \\n\"\n                        \"fmla   v30.4h, %15.4h, v2.h[1]     \\n\"\n                        \"fmla   v30.4h, %16.4h, v2.h[2]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n                        \"add    %2, %2, #4                  \\n\"\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"st1    {v30.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\");\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack1to8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16((__fp16)0.f);\n        out0.fill(_bias0);\n\n        const __fp16* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            float16x8_t _k00 = vld1q_f16(k0);\n            float16x8_t _k01 = vld1q_f16(k0 + 8);\n            float16x8_t _k02 = vld1q_f16(k0 + 16);\n            float16x8_t _k10 = vld1q_f16(k0 + 24);\n            float16x8_t _k11 = vld1q_f16(k0 + 32);\n            float16x8_t _k12 = vld1q_f16(k0 + 40);\n            float16x8_t _k20 = vld1q_f16(k0 + 48);\n            float16x8_t _k21 = vld1q_f16(k0 + 56);\n            float16x8_t _k22 = vld1q_f16(k0 + 64);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum4 sum5 sum6 sum7\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ldr    q0, [%1], #16               \\n\" // r0\n                        \"ldr    s1, [%1]                    \\n\"\n\n                        \"fmla   v24.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v25.8h, %8.8h, v0.h[1]      \\n\"\n                        \"fmla   v26.8h, %8.8h, v0.h[2]      \\n\"\n                        \"fmla   v27.8h, %8.8h, v0.h[3]      \\n\"\n                        \"fmla   v28.8h, %8.8h, v0.h[4]      \\n\"\n                        \"fmla   v29.8h, %8.8h, v0.h[5]      \\n\"\n                        \"fmla   v30.8h, %8.8h, v0.h[6]      \\n\"\n                        \"fmla   v31.8h, %8.8h, v0.h[7]      \\n\"\n\n                        \"fmla   v24.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v25.8h, %9.8h, v0.h[2]      \\n\"\n                        \"fmla   v26.8h, %9.8h, v0.h[3]      \\n\"\n                        \"fmla   v27.8h, %9.8h, v0.h[4]      \\n\"\n                        \"fmla   v28.8h, %9.8h, v0.h[5]      \\n\"\n                        \"fmla   v29.8h, %9.8h, v0.h[6]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[7]      \\n\"\n                        \"fmla   v31.8h, %9.8h, v1.h[0]      \\n\"\n\n                        \"fmla   v24.8h, %10.8h, v0.h[2]     \\n\"\n                        \"fmla   v25.8h, %10.8h, v0.h[3]     \\n\"\n                        \"fmla   v26.8h, %10.8h, v0.h[4]     \\n\"\n                        \"fmla   v27.8h, %10.8h, v0.h[5]     \\n\"\n                        \"fmla   v28.8h, %10.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, %10.8h, v0.h[7]     \\n\"\n                        \"fmla   v30.8h, %10.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, %10.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ldr    q2, [%2], #16               \\n\" // r1\n                        \"ldr    s3, [%2]                    \\n\"\n\n                        \"fmla   v24.8h, %11.8h, v2.h[0]     \\n\"\n                        \"fmla   v25.8h, %11.8h, v2.h[1]     \\n\"\n                        \"fmla   v26.8h, %11.8h, v2.h[2]     \\n\"\n                        \"fmla   v27.8h, %11.8h, v2.h[3]     \\n\"\n                        \"fmla   v28.8h, %11.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, %11.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, %11.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, %11.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v24.8h, %12.8h, v2.h[1]     \\n\"\n                        \"fmla   v25.8h, %12.8h, v2.h[2]     \\n\"\n                        \"fmla   v26.8h, %12.8h, v2.h[3]     \\n\"\n                        \"fmla   v27.8h, %12.8h, v2.h[4]     \\n\"\n                        \"fmla   v28.8h, %12.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, %12.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, %12.8h, v3.h[0]     \\n\"\n\n                        \"fmla   v24.8h, %13.8h, v2.h[2]     \\n\"\n                        \"fmla   v25.8h, %13.8h, v2.h[3]     \\n\"\n                        \"fmla   v26.8h, %13.8h, v2.h[4]     \\n\"\n                        \"fmla   v27.8h, %13.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, %13.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, %13.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, %13.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ldr    q4, [%3], #16               \\n\" // r2\n                        \"ldr    s5, [%3]                    \\n\"\n\n                        \"fmla   v24.8h, %14.8h, v4.h[0]     \\n\"\n                        \"fmla   v25.8h, %14.8h, v4.h[1]     \\n\"\n                        \"fmla   v26.8h, %14.8h, v4.h[2]     \\n\"\n                        \"fmla   v27.8h, %14.8h, v4.h[3]     \\n\"\n                        \"fmla   v28.8h, %14.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, %14.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, %14.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, %14.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v24.8h, %15.8h, v4.h[1]     \\n\"\n                        \"fmla   v25.8h, %15.8h, v4.h[2]     \\n\"\n                        \"fmla   v26.8h, %15.8h, v4.h[3]     \\n\"\n                        \"fmla   v27.8h, %15.8h, v4.h[4]     \\n\"\n                        \"fmla   v28.8h, %15.8h, v4.h[5]     \\n\"\n                        \"fmla   v29.8h, %15.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, %15.8h, v5.h[0]     \\n\"\n\n                        \"fmla   v24.8h, %16.8h, v4.h[2]     \\n\"\n                        \"fmla   v25.8h, %16.8h, v4.h[3]     \\n\"\n                        \"fmla   v26.8h, %16.8h, v4.h[4]     \\n\"\n                        \"fmla   v27.8h, %16.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, %16.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, %16.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, %16.8h, v5.h[1]     \\n\"\n\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ldr    q0, [%1]                    \\n\" // r0\n\n                        \"fmla   v28.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v29.8h, %8.8h, v0.h[1]      \\n\"\n                        \"fmla   v30.8h, %8.8h, v0.h[2]      \\n\"\n                        \"fmla   v31.8h, %8.8h, v0.h[3]      \\n\"\n\n                        \"fmla   v28.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v29.8h, %9.8h, v0.h[2]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[3]      \\n\"\n                        \"fmla   v31.8h, %9.8h, v0.h[4]      \\n\"\n\n                        \"fmla   v28.8h, %10.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, %10.8h, v0.h[3]     \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, %10.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ldr    q1, [%2]                    \\n\" // r1\n\n                        \"fmla   v28.8h, %11.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, %11.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, %11.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, %11.8h, v1.h[3]     \\n\"\n\n                        \"fmla   v28.8h, %12.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, %12.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, %12.8h, v1.h[4]     \\n\"\n\n                        \"fmla   v28.8h, %13.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, %13.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v1.h[4]     \\n\"\n                        \"fmla   v31.8h, %13.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ldr    q2, [%3]                    \\n\" // r2\n\n                        \"fmla   v28.8h, %14.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, %14.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, %14.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, %14.8h, v2.h[3]     \\n\"\n\n                        \"fmla   v28.8h, %15.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, %15.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, %15.8h, v2.h[4]     \\n\"\n\n                        \"fmla   v28.8h, %16.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, %16.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, %16.8h, v2.h[5]     \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n                        \"add    %2, %2, #8                  \\n\"\n                        \"add    %3, %3, #8                  \\n\"\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ldr    d0, [%1]                    \\n\" // r0\n\n                        \"fmla   v30.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v31.8h, %8.8h, v0.h[1]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v31.8h, %9.8h, v0.h[2]      \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, %10.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ldr    d1, [%2]                    \\n\" // r1\n\n                        \"fmla   v30.8h, %11.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, %11.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, %12.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, %13.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ldr    d2, [%3]                    \\n\" // r2\n\n                        \"fmla   v30.8h, %14.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, %14.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, %15.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, %16.8h, v2.h[3]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n                        \"add    %2, %2, #4                  \\n\"\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"st1    {v30.8h, v31.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ldr    q30, [%0]                   \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ldr    d0, [%1]                    \\n\" // r0\n\n                        \"fmla   v30.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ldr    d1, [%2]                    \\n\" // r1\n\n                        \"fmla   v30.8h, %11.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ldr    d2, [%3]                    \\n\" // r2\n\n                        \"fmla   v30.8h, %14.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v2.h[2]     \\n\"\n\n                        \"add    %1, %1, #2                  \\n\"\n                        \"add    %2, %2, #2                  \\n\"\n                        \"add    %3, %3, #2                  \\n\"\n\n                        \"str    q30, [%0], #16              \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\");\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 8;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16((__fp16)0.f);\n        out0.fill(_bias0);\n\n        const __fp16* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            float16x8_t _k00 = vld1q_f16(k0);\n            float16x8_t _k01 = vld1q_f16(k0 + 8);\n            float16x8_t _k02 = vld1q_f16(k0 + 16);\n            float16x8_t _k10 = vld1q_f16(k0 + 24);\n            float16x8_t _k11 = vld1q_f16(k0 + 32);\n            float16x8_t _k12 = vld1q_f16(k0 + 40);\n            float16x8_t _k20 = vld1q_f16(k0 + 48);\n            float16x8_t _k21 = vld1q_f16(k0 + 56);\n            float16x8_t _k22 = vld1q_f16(k0 + 64);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ldr    q0, [%1], #16               \\n\" // r0\n                        \"ldr    h1, [%1]                    \\n\"\n\n                        \"fmla   v28.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v29.8h, %8.8h, v0.h[2]      \\n\"\n                        \"fmla   v30.8h, %8.8h, v0.h[4]      \\n\"\n                        \"fmla   v31.8h, %8.8h, v0.h[6]      \\n\"\n\n                        \"fmla   v28.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v29.8h, %9.8h, v0.h[3]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[5]      \\n\"\n                        \"fmla   v31.8h, %9.8h, v0.h[7]      \\n\"\n\n                        \"fmla   v28.8h, %10.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, %10.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, %10.8h, v1.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ldr    q2, [%2], #16               \\n\" // r1\n                        \"ldr    h3, [%2]                    \\n\"\n\n                        \"fmla   v28.8h, %11.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, %11.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, %11.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, %11.8h, v2.h[6]     \\n\"\n\n                        \"fmla   v28.8h, %12.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, %12.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, %12.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, %13.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, %13.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, %13.8h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ldr    q4, [%3], #16               \\n\" // r2\n                        \"ldr    h5, [%3]                    \\n\"\n\n                        \"fmla   v28.8h, %14.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, %14.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, %14.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, %14.8h, v4.h[6]     \\n\"\n\n                        \"fmla   v28.8h, %15.8h, v4.h[1]     \\n\"\n                        \"fmla   v29.8h, %15.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, %15.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, %16.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, %16.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, %16.8h, v5.h[0]     \\n\"\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ldr    d0, [%1], #8                \\n\" // r0\n                        \"ldr    h1, [%1]                    \\n\"\n\n                        \"fmla   v30.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v31.8h, %8.8h, v0.h[2]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v31.8h, %9.8h, v0.h[3]      \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, %10.8h, v1.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ldr    d2, [%2], #8                \\n\" // r1\n                        \"ldr    h3, [%2]                    \\n\"\n\n                        \"fmla   v30.8h, %11.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, %11.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, %12.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, %13.8h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ldr    d4, [%3], #8                \\n\" // r2\n                        \"ldr    h5, [%3]                    \\n\"\n\n                        \"fmla   v30.8h, %14.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, %14.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, %15.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, %16.8h, v5.h[0]     \\n\"\n\n                        \"st1    {v30.8h, v31.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ldr    q30, [%0]                   \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ldr    d0, [%1]                    \\n\" // r0\n\n                        \"fmla   v30.8h, %8.8h, v0.h[0]      \\n\"\n                        \"fmla   v30.8h, %9.8h, v0.h[1]      \\n\"\n                        \"fmla   v30.8h, %10.8h, v0.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ldr    d1, [%2]                    \\n\" // r1\n\n                        \"fmla   v30.8h, %11.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, %12.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, %13.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ldr    d2, [%3]                    \\n\" // r2\n\n                        \"fmla   v30.8h, %14.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, %15.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, %16.8h, v2.h[2]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n                        \"add    %2, %2, #4                  \\n\"\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"str    q30, [%0], #16              \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v30\");\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 8;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s2_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            const float* kptr = (const float*)kernel.channel(p).row(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v28.4s}, [%1]              \\n\" // r08\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v24.4s, v12.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v14.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v10.s[1]    \\n\"\n                        \"fmla   v22.4s, v25.4s, v12.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v14.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v10.s[2]    \\n\"\n                        \"fmla   v22.4s, v26.4s, v12.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v14.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v10.s[3]    \\n\"\n                        \"fmla   v22.4s, v27.4s, v12.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v14.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v28.4s}, [%2]              \\n\" // r18\n\n                        \"fmla   v20.4s, v16.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v11.s[0]    \\n\"\n                        \"fmla   v22.4s, v16.4s, v13.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v15.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v11.s[1]    \\n\"\n                        \"fmla   v22.4s, v17.4s, v13.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v15.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v11.s[2]    \\n\"\n                        \"fmla   v22.4s, v18.4s, v13.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v15.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v9.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v11.s[3]    \\n\"\n                        \"fmla   v22.4s, v19.4s, v13.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v15.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v20.4s, v24.4s, v10.s[0]    \\n\"\n                        \"fmla   v21.4s, v24.4s, v12.s[0]    \\n\"\n                        \"fmla   v22.4s, v24.4s, v14.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v10.s[1]    \\n\"\n                        \"fmla   v21.4s, v25.4s, v12.s[1]    \\n\"\n                        \"fmla   v22.4s, v25.4s, v14.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v10.s[2]    \\n\"\n                        \"fmla   v21.4s, v26.4s, v12.s[2]    \\n\"\n                        \"fmla   v22.4s, v26.4s, v14.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v10.s[3]    \\n\"\n                        \"fmla   v21.4s, v27.4s, v12.s[3]    \\n\"\n                        \"fmla   v22.4s, v27.4s, v14.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v28.4s}, [%3]              \\n\" // r28\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4] \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"sub    %4, %4, #512                \\n\" // kptr -= 8 * 16;\n\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // r00 r01 r02 r03\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d8-d15}       \\n\" // r04 r05 r06 r07\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1 :128]  \\n\" // r08\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d8-d15}       \\n\" // r10 r11 r12 r13\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d0-d7}        \\n\" // r14 r15 r16 r17\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%2 :128]  \\n\" // r18\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d12[0]     \\n\"\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d12[1]     \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d13[0]    \\n\"\n                        \"vmla.f32   q13, q10, d1[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d13[1]    \\n\"\n                        \"vmla.f32   q13, q11, d1[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d0-d7}        \\n\" // r20 r21 r22 r23\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d8-d15}       \\n\" // r24 r25 r26 r27\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%3 :128]  \\n\" // r28\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        //                         \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"sub        %4, %4, #512        \\n\" // kptr -= 8 * 16;\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v4.4s}, [%1]               \\n\" // r04\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v4.4s}, [%2]               \\n\" // r14\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.4s}, [%3]               \\n\" // r24\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4] \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %4, %4, #512                \\n\" // kptr -= 8 * 16;\n\n                        \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // r00 r01 r02 r03\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vmul.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%1 :128]  \\n\" // r04\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d0-d7}        \\n\" // r10 r11 r12 r13\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%2 :128]  \\n\" // r14\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d0-d7}        \\n\" // r20 r21 r22 r23\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%3 :128]  \\n\" // r24\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        //                         \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %4, %4, #512        \\n\" // kptr -= 8 * 16;\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%1] \\n\" // r00 r01 r02\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v3.4s, v4.4s, v5.4s}, [%2] \\n\" // r10 r11 r12\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%3] \\n\" // r20 r21 r22\n\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%4], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%4] \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"fadd   v23.4s, v23.4s, v22.4s      \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %4, %4, #512                \\n\" // kptr -= 8 * 16;\n\n                        \"st1    {v20.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%0 :128] \\n\" // sum0\n\n                        \"pld        [%1, #384]          \\n\"\n                        \"vldm       %1, {d0-d5}         \\n\" // r00 r01 r02\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%2, #384]          \\n\"\n                        \"vldm       %2, {d0-d5}         \\n\" // r10 r11 r12\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%3, #384]          \\n\"\n                        \"vldm       %3, {d0-d5}         \\n\" // r20 r21 r22\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        //                         \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"vadd.f32   q14, q14, q13       \\n\"\n\n                        \"add        %1, %1, #32         \\n\"\n\n                        \"vadd.f32   q15, q15, q14       \\n\"\n\n                        \"add        %2, %2, #32         \\n\"\n\n                        \"vadd.f32   q12, q12, q15       \\n\"\n\n                        \"add        %3, %3, #32         \\n\"\n\n                        \"sub        %4, %4, #512        \\n\" // kptr -= 8 * 16;\n\n                        \"vst1.f32   {d24-d25}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s2_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            const unsigned short* kptr = (const unsigned short*)kernel.channel(p).row<const unsigned short>(q);\n\n#if __aarch64__\n            // 16 * 9\n            uint16x8_t _k00_01 = vld1q_u16(kptr);\n            uint16x8_t _k00_23 = vld1q_u16(kptr + 8);\n            uint16x8_t _k01_01 = vld1q_u16(kptr + 16);\n            uint16x8_t _k01_23 = vld1q_u16(kptr + 24);\n            uint16x8_t _k02_01 = vld1q_u16(kptr + 32);\n            uint16x8_t _k02_23 = vld1q_u16(kptr + 40);\n            uint16x8_t _k10_01 = vld1q_u16(kptr + 48);\n            uint16x8_t _k10_23 = vld1q_u16(kptr + 56);\n            uint16x8_t _k11_01 = vld1q_u16(kptr + 64);\n            uint16x8_t _k11_23 = vld1q_u16(kptr + 72);\n            uint16x8_t _k12_01 = vld1q_u16(kptr + 80);\n            uint16x8_t _k12_23 = vld1q_u16(kptr + 88);\n            uint16x8_t _k20_01 = vld1q_u16(kptr + 96);\n            uint16x8_t _k20_23 = vld1q_u16(kptr + 104);\n            uint16x8_t _k21_01 = vld1q_u16(kptr + 112);\n            uint16x8_t _k21_23 = vld1q_u16(kptr + 120);\n            uint16x8_t _k22_01 = vld1q_u16(kptr + 128);\n            uint16x8_t _k22_23 = vld1q_u16(kptr + 136);\n#endif // __aarch64__\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %8.4h, #16           \\n\"\n                        \"shll2  v9.4s, %8.8h, #16           \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %9.4h, #16           \\n\"\n                        \"shll2  v9.4s, %9.8h, #16           \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %10.4h, #16          \\n\"\n                        \"shll2  v9.4s, %10.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1]               \\n\" // r08\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %12.4h, #16          \\n\"\n                        \"shll2  v9.4s, %12.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %14.4h, #16          \\n\"\n                        \"shll2  v9.4s, %14.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %16.4h, #16          \\n\"\n                        \"shll2  v9.4s, %16.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2]               \\n\" // r18\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %18.4h, #16          \\n\"\n                        \"shll2  v9.4s, %18.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %20.4h, #16          \\n\"\n                        \"shll2  v9.4s, %20.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %22.4h, #16          \\n\"\n                        \"shll2  v9.4s, %22.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3]               \\n\" // r28\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %24.4h, #16          \\n\"\n                        \"shll2  v9.4s, %24.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"st1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d8-d15}       \\n\" // r00 r01 r02 r03 r04 r05 r06 r07\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%1 :64]      \\n\" // r08\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d8-d15}       \\n\" // r10 r11 r12 r13 r14 r15 r16 r17\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%2 :64]      \\n\" // r18\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vldm       %3!, {d8-d15}       \\n\" // r20 r21 r22 r23 r24 r25 r26 r27\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%3 :64]      \\n\" // r28\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        //                         \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"sub        %4, %4, #256        \\n\" // kptr -= 8 * 16;\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v12.4s, v13.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %8.4h, #16           \\n\"\n                        \"shll2  v7.4s, %8.8h, #16           \\n\"\n                        \"shll   v8.4s, %9.4h, #16           \\n\"\n                        \"shll2  v9.4s, %9.8h, #16           \\n\"\n\n                        \"fmul   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmul   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%1]               \\n\" // r04\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %10.4h, #16          \\n\"\n                        \"shll2  v7.4s, %10.8h, #16          \\n\"\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %12.4h, #16          \\n\"\n                        \"shll2  v7.4s, %12.8h, #16          \\n\"\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %14.4h, #16          \\n\"\n                        \"shll2  v7.4s, %14.8h, #16          \\n\"\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%2]               \\n\" // r14\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %16.4h, #16          \\n\"\n                        \"shll2  v7.4s, %16.8h, #16          \\n\"\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %18.4h, #16          \\n\"\n                        \"shll2  v7.4s, %18.8h, #16          \\n\"\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %20.4h, #16          \\n\"\n                        \"shll2  v7.4s, %20.8h, #16          \\n\"\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%3]               \\n\" // r24\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %22.4h, #16          \\n\"\n                        \"shll2  v7.4s, %22.8h, #16          \\n\"\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %24.4h, #16          \\n\"\n                        \"shll2  v7.4s, %24.8h, #16          \\n\"\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"fadd   v12.4s, v10.4s, v12.4s      \\n\"\n                        \"fadd   v13.4s, v11.4s, v13.4s      \\n\"\n\n                        \"st1    {v12.4s, v13.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%1 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q12, q8, d0[0]      \\n\"\n                        \"vmul.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%1 :64]      \\n\" // r04\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%2 :64]      \\n\" // r14\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%3 :64]      \\n\" // r24\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        //                         \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q14, q12, q14       \\n\"\n                        \"vadd.f32   q15, q13, q15       \\n\"\n\n                        \"sub        %4, %4, #256        \\n\" // kptr -= 8 * 16;\n\n                        \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v13.4s}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%1] \\n\" // r00 r01 r02\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %8.4h, #16           \\n\"\n                        \"shll2  v7.4s, %8.8h, #16           \\n\"\n\n                        \"fmul   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmul   v11.4s, v7.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %9.4h, #16           \\n\"\n                        \"shll2  v9.4s, %9.8h, #16           \\n\"\n\n                        \"fmul   v12.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"shll   v6.4s, %10.4h, #16          \\n\"\n                        \"shll2  v7.4s, %10.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v1.s[3]      \\n\"\n\n                        \"shll   v6.4s, %12.4h, #16          \\n\"\n                        \"shll2  v7.4s, %12.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v3.4h, v4.4h, v5.4h}, [%2] \\n\" // r10 r11 r12\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %14.4h, #16          \\n\"\n                        \"shll2  v7.4s, %14.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %16.4h, #16          \\n\"\n                        \"shll2  v7.4s, %16.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"shll   v6.4s, %18.4h, #16          \\n\"\n                        \"shll2  v7.4s, %18.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v5.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v5.s[1]      \\n\"\n\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v5.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%3] \\n\" // r20 r21 r22\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %20.4h, #16          \\n\"\n                        \"shll2  v7.4s, %20.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"shll   v6.4s, %22.4h, #16          \\n\"\n                        \"shll2  v7.4s, %22.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v1.s[3]      \\n\"\n\n                        \"shll   v6.4s, %24.4h, #16          \\n\"\n                        \"shll2  v7.4s, %24.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"fadd   v11.4s, v10.4s, v11.4s      \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fadd   v13.4s, v12.4s, v13.4s      \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fadd   v13.4s, v11.4s, v13.4s      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"st1    {v13.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d30-d31}, [%0 :128] \\n\" // sum0\n\n                        \"pld        [%1, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%1 :64]   \\n\" // r00 r01 r02\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmul.f32   q12, q8, d0[0]      \\n\"\n                        \"vmul.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%2 :64]   \\n\" // r10 r11 r12\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%3 :64]   \\n\" // r20 r21 r22\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        //                         \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%4 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"add        %1, %1, #16         \\n\"\n                        \"vadd.f32   q13, q12, q13       \\n\"\n\n                        \"add        %2, %2, #16         \\n\"\n                        \"vadd.f32   q15, q14, q15       \\n\"\n\n                        \"add        %3, %3, #16         \\n\"\n                        \"vadd.f32   q15, q13, q15       \\n\"\n\n                        \"sub        %4, %4, #256        \\n\" // kptr -= 8 * 16 * 2;\n\n                        \"vst1.f32   {d30-d31}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            const float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n            const unsigned short* kptr = (const unsigned short*)kernel.channel(p).row<const unsigned short>(q);\n\n#if __aarch64__\n            // 16 * 9\n            uint16x8_t _k00_01 = vld1q_u16(kptr);\n            uint16x8_t _k00_23 = vld1q_u16(kptr + 8);\n            uint16x8_t _k01_01 = vld1q_u16(kptr + 16);\n            uint16x8_t _k01_23 = vld1q_u16(kptr + 24);\n            uint16x8_t _k02_01 = vld1q_u16(kptr + 32);\n            uint16x8_t _k02_23 = vld1q_u16(kptr + 40);\n            uint16x8_t _k10_01 = vld1q_u16(kptr + 48);\n            uint16x8_t _k10_23 = vld1q_u16(kptr + 56);\n            uint16x8_t _k11_01 = vld1q_u16(kptr + 64);\n            uint16x8_t _k11_23 = vld1q_u16(kptr + 72);\n            uint16x8_t _k12_01 = vld1q_u16(kptr + 80);\n            uint16x8_t _k12_23 = vld1q_u16(kptr + 88);\n            uint16x8_t _k20_01 = vld1q_u16(kptr + 96);\n            uint16x8_t _k20_23 = vld1q_u16(kptr + 104);\n            uint16x8_t _k21_01 = vld1q_u16(kptr + 112);\n            uint16x8_t _k21_23 = vld1q_u16(kptr + 120);\n            uint16x8_t _k22_01 = vld1q_u16(kptr + 128);\n            uint16x8_t _k22_23 = vld1q_u16(kptr + 136);\n#endif // __aarch64__\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%1], #64 \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %10.4h, #16          \\n\"\n                        \"shll2  v9.4s, %10.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %12.4h, #16          \\n\"\n                        \"shll2  v9.4s, %12.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2]               \\n\" // r08\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %14.4h, #16          \\n\"\n                        \"shll2  v9.4s, %14.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%3], #64 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %16.4h, #16          \\n\"\n                        \"shll2  v9.4s, %16.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %18.4h, #16          \\n\"\n                        \"shll2  v9.4s, %18.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3]               \\n\" // r18\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %20.4h, #16          \\n\"\n                        \"shll2  v9.4s, %20.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%4], #64 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v4.4h, #16           \\n\"\n                        \"shll2  v1.4s, v4.8h, #16           \\n\"\n                        \"shll   v2.4s, v5.4h, #16           \\n\"\n                        \"shll2  v3.4s, v5.8h, #16           \\n\"\n\n                        \"shll   v4.4s, v6.4h, #16           \\n\"\n                        \"shll2  v5.4s, v6.8h, #16           \\n\"\n                        \"shll   v6.4s, v7.4h, #16           \\n\"\n                        \"shll2  v7.4s, v7.8h, #16           \\n\"\n\n                        \"shll   v8.4s, %22.4h, #16          \\n\"\n                        \"shll2  v9.4s, %22.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[1]      \\n\"\n\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v6.s[3]      \\n\"\n\n                        \"shll   v8.4s, %24.4h, #16          \\n\"\n                        \"shll2  v9.4s, %24.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[1]      \\n\"\n\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v7.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v3.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v5.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v7.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4]               \\n\" // r28\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"shll   v8.4s, %26.4h, #16          \\n\"\n                        \"shll2  v9.4s, %26.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[0]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[1]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[1]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[1]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %27.4h, #16          \\n\"\n                        \"shll2  v9.4s, %27.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v8.4s, v6.s[2]      \\n\"\n                        \"fmla   v13.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v10.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v11.4s, v9.4s, v4.s[3]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v6.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"shrn   v10.4h, v10.4s, #16         \\n\"\n                        \"shrn   v11.4h, v11.4s, #16         \\n\"\n                        \"shrn   v12.4h, v12.4s, #16         \\n\"\n                        \"shrn   v13.4h, v13.4s, #16         \\n\"\n\n                        \"st1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_01), // %10\n                        \"w\"(_k00_23), // %11\n                        \"w\"(_k01_01), // %12\n                        \"w\"(_k01_23), // %13\n                        \"w\"(_k02_01), // %14\n                        \"w\"(_k02_23), // %15\n                        \"w\"(_k10_01), // %16\n                        \"w\"(_k10_23), // %17\n                        \"w\"(_k11_01), // %18\n                        \"w\"(_k11_23), // %19\n                        \"w\"(_k12_01), // %20\n                        \"w\"(_k12_23), // %21\n                        \"w\"(_k20_01), // %22\n                        \"w\"(_k20_23), // %23\n                        \"w\"(_k21_01), // %24\n                        \"w\"(_k21_23), // %25\n                        \"w\"(_k22_01), // %26\n                        \"w\"(_k22_23)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d24-d31}      \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d8-d15}       \\n\" // r00 r01 r02 r03 r04 r05 r06 r07\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%2 :64]      \\n\" // r08\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d8-d15}       \\n\" // r10 r11 r12 r13 r14 r15 r16 r17\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%3 :64]      \\n\" // r18\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vldm       %4!, {d8-d15}       \\n\" // r20 r21 r22 r23 r24 r25 r26 r27\n\n                        \"vshll.u16  q0, d8, #16         \\n\"\n                        \"vshll.u16  q1, d9, #16         \\n\"\n                        \"vshll.u16  q2, d10, #16        \\n\"\n                        \"vshll.u16  q3, d11, #16        \\n\"\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.f32   {d1}, [%4 :64]      \\n\" // r28\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        //                         \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"sub        %5, %5, #256        \\n\" // kptr -= 8 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n                        \"vshrn.u32  d26, q14, #16       \\n\"\n                        \"vshrn.u32  d27, q15, #16       \\n\"\n\n                        \"vst1.f32   {d24-d27}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(kptr)          // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v12.4s, v13.4s}, [%1], #32 \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %10.4h, #16          \\n\"\n                        \"shll2  v7.4s, %10.8h, #16          \\n\"\n\n                        \"fmul   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmul   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%2]               \\n\" // r04\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %12.4h, #16          \\n\"\n                        \"shll2  v7.4s, %12.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %14.4h, #16          \\n\"\n                        \"shll2  v7.4s, %14.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %16.4h, #16          \\n\"\n                        \"shll2  v7.4s, %16.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%3]               \\n\" // r14\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %18.4h, #16          \\n\"\n                        \"shll2  v7.4s, %18.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %20.4h, #16          \\n\"\n                        \"shll2  v7.4s, %20.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %22.4h, #16          \\n\"\n                        \"shll2  v7.4s, %22.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v4.4h}, [%4]               \\n\" // r24\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %24.4h, #16          \\n\"\n                        \"shll2  v7.4s, %24.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %26.4h, #16          \\n\"\n                        \"shll2  v7.4s, %26.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v12.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"shll   v8.4s, %27.4h, #16          \\n\"\n                        \"shll2  v9.4s, %27.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v11.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v12.4s, v9.4s, v2.s[3]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"fadd   v12.4s, v10.4s, v12.4s      \\n\"\n                        \"fadd   v13.4s, v11.4s, v13.4s      \\n\"\n\n                        \"shrn   v12.4h, v12.4s, #16         \\n\"\n                        \"shrn   v13.4h, v13.4s, #16         \\n\"\n\n                        \"st1    {v12.4h, v13.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_01), // %10\n                        \"w\"(_k00_23), // %11\n                        \"w\"(_k01_01), // %12\n                        \"w\"(_k01_23), // %13\n                        \"w\"(_k02_01), // %14\n                        \"w\"(_k02_23), // %15\n                        \"w\"(_k10_01), // %16\n                        \"w\"(_k10_23), // %17\n                        \"w\"(_k11_01), // %18\n                        \"w\"(_k11_23), // %19\n                        \"w\"(_k12_01), // %20\n                        \"w\"(_k12_23), // %21\n                        \"w\"(_k20_01), // %22\n                        \"w\"(_k20_23), // %23\n                        \"w\"(_k21_01), // %24\n                        \"w\"(_k21_23), // %25\n                        \"w\"(_k22_01), // %26\n                        \"w\"(_k22_23)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // sum0 sum1\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q12, q8, d0[0]      \\n\"\n                        \"vmul.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%2 :64]      \\n\" // r04\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%3 :64]      \\n\" // r14\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.f32   {d9}, [%4 :64]      \\n\" // r24\n\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        //                         \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q14, q12, q14       \\n\"\n                        \"vadd.f32   q15, q13, q15       \\n\"\n\n                        \"sub        %5, %5, #256        \\n\" // kptr -= 8 * 16;\n\n                        \"vshrn.u32  d28, q14, #16       \\n\"\n                        \"vshrn.u32  d29, q15, #16       \\n\"\n\n                        \"vst1.f32   {d28-d29}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(kptr)          // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v13.4s}, [%1], #16         \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%2] \\n\" // r00 r01 r02\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %10.4h, #16          \\n\"\n                        \"shll2  v7.4s, %10.8h, #16          \\n\"\n\n                        \"fmul   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmul   v11.4s, v7.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %11.4h, #16          \\n\"\n                        \"shll2  v9.4s, %11.8h, #16          \\n\"\n\n                        \"fmul   v12.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"shll   v6.4s, %12.4h, #16          \\n\"\n                        \"shll2  v7.4s, %12.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n\n                        \"shll   v8.4s, %13.4h, #16          \\n\"\n                        \"shll2  v9.4s, %13.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v1.s[3]      \\n\"\n\n                        \"shll   v6.4s, %14.4h, #16          \\n\"\n                        \"shll2  v7.4s, %14.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %15.4h, #16          \\n\"\n                        \"shll2  v9.4s, %15.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v3.4h, v4.4h, v5.4h}, [%3] \\n\" // r10 r11 r12\n\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %16.4h, #16          \\n\"\n                        \"shll2  v7.4s, %16.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v3.s[1]      \\n\"\n\n                        \"shll   v8.4s, %17.4h, #16          \\n\"\n                        \"shll2  v9.4s, %17.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v3.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v3.s[3]      \\n\"\n\n                        \"shll   v6.4s, %18.4h, #16          \\n\"\n                        \"shll2  v7.4s, %18.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v4.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v4.s[1]      \\n\"\n\n                        \"shll   v8.4s, %19.4h, #16          \\n\"\n                        \"shll2  v9.4s, %19.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v4.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v4.s[3]      \\n\"\n\n                        \"shll   v6.4s, %20.4h, #16          \\n\"\n                        \"shll2  v7.4s, %20.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v5.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v5.s[1]      \\n\"\n\n                        \"shll   v8.4s, %21.4h, #16          \\n\"\n                        \"shll2  v9.4s, %21.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v5.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v5.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%4] \\n\" // r20 r21 r22\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"shll   v6.4s, %22.4h, #16          \\n\"\n                        \"shll2  v7.4s, %22.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v0.s[1]      \\n\"\n\n                        \"shll   v8.4s, %23.4h, #16          \\n\"\n                        \"shll2  v9.4s, %23.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"shll   v6.4s, %24.4h, #16          \\n\"\n                        \"shll2  v7.4s, %24.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n\n                        \"shll   v8.4s, %25.4h, #16          \\n\"\n                        \"shll2  v9.4s, %25.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v1.s[3]      \\n\"\n\n                        \"shll   v6.4s, %26.4h, #16          \\n\"\n                        \"shll2  v7.4s, %26.8h, #16          \\n\"\n\n                        \"fmla   v10.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v11.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"shll   v8.4s, %27.4h, #16          \\n\"\n                        \"shll2  v9.4s, %27.8h, #16          \\n\"\n\n                        \"fmla   v12.4s, v8.4s, v2.s[2]      \\n\"\n                        \"fmla   v13.4s, v9.4s, v2.s[3]      \\n\"\n\n                        \"fadd   v11.4s, v10.4s, v11.4s      \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fadd   v13.4s, v12.4s, v13.4s      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n                        \"fadd   v13.4s, v11.4s, v13.4s      \\n\"\n\n                        \"add    %4, %4, #16                 \\n\"\n                        \"shrn   v13.4h, v13.4s, #16         \\n\"\n\n                        \"st1    {v13.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2)            // %4\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_01), // %10\n                        \"w\"(_k00_23), // %11\n                        \"w\"(_k01_01), // %12\n                        \"w\"(_k01_23), // %13\n                        \"w\"(_k02_01), // %14\n                        \"w\"(_k02_23), // %15\n                        \"w\"(_k10_01), // %16\n                        \"w\"(_k10_23), // %17\n                        \"w\"(_k11_01), // %18\n                        \"w\"(_k11_23), // %19\n                        \"w\"(_k12_01), // %20\n                        \"w\"(_k12_23), // %21\n                        \"w\"(_k20_01), // %22\n                        \"w\"(_k20_23), // %23\n                        \"w\"(_k21_01), // %24\n                        \"w\"(_k21_23), // %25\n                        \"w\"(_k22_01), // %26\n                        \"w\"(_k22_23)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d30-d31}, [%1 :128]! \\n\" // sum0\n\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%2 :64]   \\n\" // r00 r01 r02\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmul.f32   q12, q8, d0[0]      \\n\"\n                        \"vmul.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%3 :64]   \\n\" // r10 r11 r12\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.u16   {d2-d4}, [%4 :64]   \\n\" // r20 r21 r22\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshll.u16  q2, d4, #16         \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        //                         \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%5 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"add        %2, %2, #16         \\n\"\n                        \"vadd.f32   q13, q12, q13       \\n\"\n\n                        \"add        %3, %3, #16         \\n\"\n                        \"vadd.f32   q15, q14, q15       \\n\"\n\n                        \"add        %4, %4, #16         \\n\"\n                        \"vadd.f32   q15, q13, q15       \\n\"\n\n                        \"sub        %5, %5, #256        \\n\" // kptr -= 8 * 16 * 2;\n\n                        \"vshrn.u32  d31, q15, #16       \\n\"\n\n                        \"vst1.u16   {d31}, [%0 :64]!    \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(kptr)          // %5\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack4_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack4_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x4_t _bias0 = bias ? vld1_f16(bias + p * 4) : vdup_n_f16((__fp16)0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0.row<__fp16>(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            // 16 * 9\n            float16x8_t _k00_01 = vld1q_f16(kptr);\n            float16x8_t _k00_23 = vld1q_f16(kptr + 8);\n            float16x8_t _k01_01 = vld1q_f16(kptr + 16);\n            float16x8_t _k01_23 = vld1q_f16(kptr + 24);\n            float16x8_t _k02_01 = vld1q_f16(kptr + 32);\n            float16x8_t _k02_23 = vld1q_f16(kptr + 40);\n            float16x8_t _k10_01 = vld1q_f16(kptr + 48);\n            float16x8_t _k10_23 = vld1q_f16(kptr + 56);\n            float16x8_t _k11_01 = vld1q_f16(kptr + 64);\n            float16x8_t _k11_23 = vld1q_f16(kptr + 72);\n            float16x8_t _k12_01 = vld1q_f16(kptr + 80);\n            float16x8_t _k12_23 = vld1q_f16(kptr + 88);\n            float16x8_t _k20_01 = vld1q_f16(kptr + 96);\n            float16x8_t _k20_23 = vld1q_f16(kptr + 104);\n            float16x8_t _k21_01 = vld1q_f16(kptr + 112);\n            float16x8_t _k21_23 = vld1q_f16(kptr + 120);\n            float16x8_t _k22_01 = vld1q_f16(kptr + 128);\n            float16x8_t _k22_23 = vld1q_f16(kptr + 136);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%1] \\n\" // r00 r01 r02 r03 r04 r05\n\n                        \"ext    v6.16b, %8.16b, %8.16b, #8  \\n\"\n                        \"fmla   v10.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmla   v11.4h, %8.4h, v0.h[4]      \\n\"\n                        \"fmla   v12.4h, %8.4h, v1.h[0]      \\n\"\n                        \"fmla   v13.4h, %8.4h, v1.h[4]      \\n\"\n                        \"fmla   v10.4h, v6.4h, v0.h[1]      \\n\"\n                        \"fmla   v11.4h, v6.4h, v0.h[5]      \\n\"\n                        \"fmla   v12.4h, v6.4h, v1.h[1]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v1.h[5]      \\n\"\n                        \"ext    v7.16b, %9.16b, %9.16b, #8  \\n\"\n                        \"fmla   v10.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v11.4h, %9.4h, v0.h[6]      \\n\"\n                        \"fmla   v12.4h, %9.4h, v1.h[2]      \\n\"\n                        \"fmla   v13.4h, %9.4h, v1.h[6]      \\n\"\n                        \"fmla   v10.4h, v7.4h, v0.h[3]      \\n\"\n                        \"fmla   v11.4h, v7.4h, v0.h[7]      \\n\"\n                        \"fmla   v12.4h, v7.4h, v1.h[3]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v1.h[7]      \\n\"\n\n                        \"ext    v8.16b, %10.16b, %10.16b, #8 \\n\"\n                        \"fmla   v10.4h, %10.4h, v0.h[4]     \\n\"\n                        \"fmla   v11.4h, %10.4h, v1.h[0]     \\n\"\n                        \"fmla   v12.4h, %10.4h, v1.h[4]     \\n\"\n                        \"fmla   v13.4h, %10.4h, v2.h[0]     \\n\"\n                        \"fmla   v10.4h, v8.4h, v0.h[5]      \\n\"\n                        \"fmla   v11.4h, v8.4h, v1.h[1]      \\n\"\n                        \"fmla   v12.4h, v8.4h, v1.h[5]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v2.h[1]      \\n\"\n                        \"ext    v9.16b, %11.16b, %11.16b, #8 \\n\"\n                        \"fmla   v10.4h, %11.4h, v0.h[6]     \\n\"\n                        \"fmla   v11.4h, %11.4h, v1.h[2]     \\n\"\n                        \"fmla   v12.4h, %11.4h, v1.h[6]     \\n\"\n                        \"fmla   v13.4h, %11.4h, v2.h[2]     \\n\"\n                        \"fmla   v10.4h, v9.4h, v0.h[7]      \\n\"\n                        \"fmla   v11.4h, v9.4h, v1.h[3]      \\n\"\n                        \"fmla   v12.4h, v9.4h, v1.h[7]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v2.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v3.8h, v4.8h, v5.8h}, [%2] \\n\" // r10 r11 r12 r13 r14 r15\n\n                        \"ext    v6.16b, %12.16b, %12.16b, #8 \\n\"\n                        \"fmla   v10.4h, %12.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, %12.4h, v1.h[4]     \\n\"\n                        \"fmla   v12.4h, %12.4h, v2.h[0]     \\n\"\n                        \"fmla   v13.4h, %12.4h, v2.h[4]     \\n\"\n                        \"fmla   v10.4h, v6.4h, v1.h[1]      \\n\"\n                        \"fmla   v11.4h, v6.4h, v1.h[5]      \\n\"\n                        \"fmla   v12.4h, v6.4h, v2.h[1]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v2.h[5]      \\n\"\n                        \"ext    v7.16b, %13.16b, %13.16b, #8 \\n\"\n                        \"fmla   v10.4h, %13.4h, v1.h[2]     \\n\"\n                        \"fmla   v11.4h, %13.4h, v1.h[6]     \\n\"\n                        \"fmla   v12.4h, %13.4h, v2.h[2]     \\n\"\n                        \"fmla   v13.4h, %13.4h, v2.h[6]     \\n\"\n                        \"fmla   v10.4h, v7.4h, v1.h[3]      \\n\"\n                        \"fmla   v11.4h, v7.4h, v1.h[7]      \\n\"\n                        \"fmla   v12.4h, v7.4h, v2.h[3]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v2.h[7]      \\n\"\n\n                        \"ext    v8.16b, %14.16b, %14.16b, #8 \\n\"\n                        \"fmla   v10.4h, %14.4h, v3.h[0]     \\n\"\n                        \"fmla   v11.4h, %14.4h, v3.h[4]     \\n\"\n                        \"fmla   v12.4h, %14.4h, v4.h[0]     \\n\"\n                        \"fmla   v13.4h, %14.4h, v4.h[4]     \\n\"\n                        \"fmla   v10.4h, v8.4h, v3.h[1]      \\n\"\n                        \"fmla   v11.4h, v8.4h, v3.h[5]      \\n\"\n                        \"fmla   v12.4h, v8.4h, v4.h[1]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v4.h[5]      \\n\"\n                        \"ext    v9.16b, %15.16b, %15.16b, #8 \\n\"\n                        \"fmla   v10.4h, %15.4h, v3.h[2]     \\n\"\n                        \"fmla   v11.4h, %15.4h, v3.h[6]     \\n\"\n                        \"fmla   v12.4h, %15.4h, v4.h[2]     \\n\"\n                        \"fmla   v13.4h, %15.4h, v4.h[6]     \\n\"\n                        \"fmla   v10.4h, v9.4h, v3.h[3]      \\n\"\n                        \"fmla   v11.4h, v9.4h, v3.h[7]      \\n\"\n                        \"fmla   v12.4h, v9.4h, v4.h[3]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v4.h[7]      \\n\"\n\n                        \"ext    v6.16b, %16.16b, %16.16b, #8 \\n\"\n                        \"fmla   v10.4h, %16.4h, v3.h[4]     \\n\"\n                        \"fmla   v11.4h, %16.4h, v4.h[0]     \\n\"\n                        \"fmla   v12.4h, %16.4h, v4.h[4]     \\n\"\n                        \"fmla   v13.4h, %16.4h, v5.h[0]     \\n\"\n                        \"fmla   v10.4h, v6.4h, v3.h[5]      \\n\"\n                        \"fmla   v11.4h, v6.4h, v4.h[1]      \\n\"\n                        \"fmla   v12.4h, v6.4h, v4.h[5]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v5.h[1]      \\n\"\n                        \"ext    v7.16b, %17.16b, %17.16b, #8 \\n\"\n                        \"fmla   v10.4h, %17.4h, v3.h[6]     \\n\"\n                        \"fmla   v11.4h, %17.4h, v4.h[2]     \\n\"\n                        \"fmla   v12.4h, %17.4h, v4.h[6]     \\n\"\n                        \"fmla   v13.4h, %17.4h, v5.h[2]     \\n\"\n                        \"fmla   v10.4h, v7.4h, v3.h[7]      \\n\"\n                        \"fmla   v11.4h, v7.4h, v4.h[3]      \\n\"\n                        \"fmla   v12.4h, v7.4h, v4.h[7]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v5.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%3] \\n\" // r20 r21 r22 r23 r24 r25\n\n                        \"ext    v8.16b, %18.16b, %18.16b, #8 \\n\"\n                        \"fmla   v10.4h, %18.4h, v4.h[0]     \\n\"\n                        \"fmla   v11.4h, %18.4h, v4.h[4]     \\n\"\n                        \"fmla   v12.4h, %18.4h, v5.h[0]     \\n\"\n                        \"fmla   v13.4h, %18.4h, v5.h[4]     \\n\"\n                        \"fmla   v10.4h, v8.4h, v4.h[1]      \\n\"\n                        \"fmla   v11.4h, v8.4h, v4.h[5]      \\n\"\n                        \"fmla   v12.4h, v8.4h, v5.h[1]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v5.h[5]      \\n\"\n                        \"ext    v9.16b, %19.16b, %19.16b, #8 \\n\"\n                        \"fmla   v10.4h, %19.4h, v4.h[2]     \\n\"\n                        \"fmla   v11.4h, %19.4h, v4.h[6]     \\n\"\n                        \"fmla   v12.4h, %19.4h, v5.h[2]     \\n\"\n                        \"fmla   v13.4h, %19.4h, v5.h[6]     \\n\"\n                        \"fmla   v10.4h, v9.4h, v4.h[3]      \\n\"\n                        \"fmla   v11.4h, v9.4h, v4.h[7]      \\n\"\n                        \"fmla   v12.4h, v9.4h, v5.h[3]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v5.h[7]      \\n\"\n\n                        \"ext    v6.16b, %20.16b, %20.16b, #8 \\n\"\n                        \"fmla   v10.4h, %20.4h, v0.h[0]     \\n\"\n                        \"fmla   v11.4h, %20.4h, v0.h[4]     \\n\"\n                        \"fmla   v12.4h, %20.4h, v1.h[0]     \\n\"\n                        \"fmla   v13.4h, %20.4h, v1.h[4]     \\n\"\n                        \"fmla   v10.4h, v6.4h, v0.h[1]      \\n\"\n                        \"fmla   v11.4h, v6.4h, v0.h[5]      \\n\"\n                        \"fmla   v12.4h, v6.4h, v1.h[1]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v1.h[5]      \\n\"\n                        \"ext    v7.16b, %21.16b, %21.16b, #8 \\n\"\n                        \"fmla   v10.4h, %21.4h, v0.h[2]     \\n\"\n                        \"fmla   v11.4h, %21.4h, v0.h[6]     \\n\"\n                        \"fmla   v12.4h, %21.4h, v1.h[2]     \\n\"\n                        \"fmla   v13.4h, %21.4h, v1.h[6]     \\n\"\n                        \"fmla   v10.4h, v7.4h, v0.h[3]      \\n\"\n                        \"fmla   v11.4h, v7.4h, v0.h[7]      \\n\"\n                        \"fmla   v12.4h, v7.4h, v1.h[3]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v1.h[7]      \\n\"\n\n                        \"ext    v8.16b, %22.16b, %22.16b, #8 \\n\"\n                        \"fmla   v10.4h, %22.4h, v0.h[4]     \\n\"\n                        \"fmla   v11.4h, %22.4h, v1.h[0]     \\n\"\n                        \"fmla   v12.4h, %22.4h, v1.h[4]     \\n\"\n                        \"fmla   v13.4h, %22.4h, v2.h[0]     \\n\"\n                        \"fmla   v10.4h, v8.4h, v0.h[5]      \\n\"\n                        \"fmla   v11.4h, v8.4h, v1.h[1]      \\n\"\n                        \"fmla   v12.4h, v8.4h, v1.h[5]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v2.h[1]      \\n\"\n                        \"ext    v9.16b, %23.16b, %23.16b, #8 \\n\"\n                        \"fmla   v10.4h, %23.4h, v0.h[6]     \\n\"\n                        \"fmla   v11.4h, %23.4h, v1.h[2]     \\n\"\n                        \"fmla   v12.4h, %23.4h, v1.h[6]     \\n\"\n                        \"fmla   v13.4h, %23.4h, v2.h[2]     \\n\"\n                        \"fmla   v10.4h, v9.4h, v0.h[7]      \\n\"\n                        \"fmla   v11.4h, v9.4h, v1.h[3]      \\n\"\n                        \"fmla   v12.4h, v9.4h, v1.h[7]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v2.h[3]      \\n\"\n\n                        \"ext    v6.16b, %24.16b, %24.16b, #8 \\n\"\n                        \"fmla   v10.4h, %24.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, %24.4h, v1.h[4]     \\n\"\n                        \"fmla   v12.4h, %24.4h, v2.h[0]     \\n\"\n                        \"fmla   v13.4h, %24.4h, v2.h[4]     \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"fmla   v10.4h, v6.4h, v1.h[1]      \\n\"\n                        \"fmla   v11.4h, v6.4h, v1.h[5]      \\n\"\n                        \"fmla   v12.4h, v6.4h, v2.h[1]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v2.h[5]      \\n\"\n                        \"ext    v7.16b, %25.16b, %25.16b, #8 \\n\"\n                        \"fmla   v10.4h, %25.4h, v1.h[2]     \\n\"\n                        \"fmla   v11.4h, %25.4h, v1.h[6]     \\n\"\n                        \"fmla   v12.4h, %25.4h, v2.h[2]     \\n\"\n                        \"fmla   v13.4h, %25.4h, v2.h[6]     \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"fmla   v10.4h, v7.4h, v1.h[3]      \\n\"\n                        \"fmla   v11.4h, v7.4h, v1.h[7]      \\n\"\n                        \"fmla   v12.4h, v7.4h, v2.h[3]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v2.h[7]      \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"st1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%1]        \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v12.4h, v13.4h}, [%0]      \\n\" // sum0 sum1\n\n                        \"ext    v4.16b, %8.16b, %8.16b, #8  \\n\"\n                        \"fmul   v10.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmul   v11.4h, %8.4h, v0.h[4]      \\n\"\n                        \"fmla   v12.4h, v4.4h, v0.h[1]      \\n\"\n                        \"fmla   v13.4h, v4.4h, v0.h[5]      \\n\"\n                        \"ext    v5.16b, %9.16b, %9.16b, #8  \\n\"\n                        \"fmla   v10.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v11.4h, %9.4h, v0.h[6]      \\n\"\n                        \"fmla   v12.4h, v5.4h, v0.h[3]      \\n\"\n                        \"fmla   v13.4h, v5.4h, v0.h[7]      \\n\"\n\n                        \"ext    v6.16b, %10.16b, %10.16b, #8 \\n\"\n                        \"fmla   v10.4h, %10.4h, v0.h[4]     \\n\"\n                        \"fmla   v11.4h, %10.4h, v1.h[0]     \\n\"\n                        \"fmla   v12.4h, v6.4h, v0.h[5]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v1.h[1]      \\n\"\n                        \"ext    v7.16b, %11.16b, %11.16b, #8 \\n\"\n                        \"fmla   v10.4h, %11.4h, v0.h[6]     \\n\"\n                        \"fmla   v11.4h, %11.4h, v1.h[2]     \\n\"\n                        \"fmla   v12.4h, v7.4h, v0.h[7]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v1.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v2.8h, v3.8h}, [%2]        \\n\" // r10 r11 r12 r13\n\n                        \"ext    v8.16b, %12.16b, %12.16b, #8 \\n\"\n                        \"fmla   v10.4h, %12.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, %12.4h, v1.h[4]     \\n\"\n                        \"fmla   v12.4h, v8.4h, v1.h[1]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v1.h[5]      \\n\"\n                        \"ext    v9.16b, %13.16b, %13.16b, #8 \\n\"\n                        \"fmla   v10.4h, %13.4h, v1.h[2]     \\n\"\n                        \"fmla   v11.4h, %13.4h, v1.h[6]     \\n\"\n                        \"fmla   v12.4h, v9.4h, v1.h[3]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v1.h[7]      \\n\"\n\n                        \"ext    v4.16b, %14.16b, %14.16b, #8 \\n\"\n                        \"fmla   v10.4h, %14.4h, v2.h[0]     \\n\"\n                        \"fmla   v11.4h, %14.4h, v2.h[4]     \\n\"\n                        \"fmla   v12.4h, v4.4h, v2.h[1]      \\n\"\n                        \"fmla   v13.4h, v4.4h, v2.h[5]      \\n\"\n                        \"ext    v5.16b, %15.16b, %15.16b, #8 \\n\"\n                        \"fmla   v10.4h, %15.4h, v2.h[2]     \\n\"\n                        \"fmla   v11.4h, %15.4h, v2.h[6]     \\n\"\n                        \"fmla   v12.4h, v5.4h, v2.h[3]      \\n\"\n                        \"fmla   v13.4h, v5.4h, v2.h[7]      \\n\"\n\n                        \"ext    v6.16b, %16.16b, %16.16b, #8 \\n\"\n                        \"fmla   v10.4h, %16.4h, v2.h[4]     \\n\"\n                        \"fmla   v11.4h, %16.4h, v3.h[0]     \\n\"\n                        \"fmla   v12.4h, v6.4h, v2.h[5]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v3.h[1]      \\n\"\n                        \"ext    v7.16b, %17.16b, %17.16b, #8 \\n\"\n                        \"fmla   v10.4h, %17.4h, v2.h[6]     \\n\"\n                        \"fmla   v11.4h, %17.4h, v3.h[2]     \\n\"\n                        \"fmla   v12.4h, v7.4h, v2.h[7]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v3.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%3]        \\n\" // r20 r21 r22 r23\n\n                        \"ext    v8.16b, %18.16b, %18.16b, #8 \\n\"\n                        \"fmla   v10.4h, %18.4h, v3.h[0]     \\n\"\n                        \"fmla   v11.4h, %18.4h, v3.h[4]     \\n\"\n                        \"fmla   v12.4h, v8.4h, v3.h[1]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v3.h[5]      \\n\"\n                        \"ext    v9.16b, %19.16b, %19.16b, #8 \\n\"\n                        \"fmla   v10.4h, %19.4h, v3.h[2]     \\n\"\n                        \"fmla   v11.4h, %19.4h, v3.h[6]     \\n\"\n                        \"fmla   v12.4h, v9.4h, v3.h[3]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v3.h[7]      \\n\"\n\n                        \"ext    v4.16b, %20.16b, %20.16b, #8 \\n\"\n                        \"fmla   v10.4h, %20.4h, v0.h[0]     \\n\"\n                        \"fmla   v11.4h, %20.4h, v0.h[4]     \\n\"\n                        \"fmla   v12.4h, v4.4h, v0.h[1]      \\n\"\n                        \"fmla   v13.4h, v4.4h, v0.h[5]      \\n\"\n                        \"ext    v5.16b, %21.16b, %21.16b, #8 \\n\"\n                        \"fmla   v10.4h, %21.4h, v0.h[2]     \\n\"\n                        \"fmla   v11.4h, %21.4h, v0.h[6]     \\n\"\n                        \"fmla   v12.4h, v5.4h, v0.h[3]      \\n\"\n                        \"fmla   v13.4h, v5.4h, v0.h[7]      \\n\"\n\n                        \"ext    v6.16b, %22.16b, %22.16b, #8 \\n\"\n                        \"fmla   v10.4h, %22.4h, v0.h[4]     \\n\"\n                        \"fmla   v11.4h, %22.4h, v1.h[0]     \\n\"\n                        \"fmla   v12.4h, v6.4h, v0.h[5]      \\n\"\n                        \"fmla   v13.4h, v6.4h, v1.h[1]      \\n\"\n                        \"ext    v7.16b, %23.16b, %23.16b, #8 \\n\"\n                        \"fmla   v10.4h, %23.4h, v0.h[6]     \\n\"\n                        \"fmla   v11.4h, %23.4h, v1.h[2]     \\n\"\n                        \"fmla   v12.4h, v7.4h, v0.h[7]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v1.h[3]      \\n\"\n\n                        \"ext    v8.16b, %24.16b, %24.16b, #8 \\n\"\n                        \"fmla   v10.4h, %24.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, %24.4h, v1.h[4]     \\n\"\n                        \"fmla   v12.4h, v8.4h, v1.h[1]      \\n\"\n                        \"fmla   v13.4h, v8.4h, v1.h[5]      \\n\"\n                        \"ext    v9.16b, %25.16b, %25.16b, #8 \\n\"\n                        \"fmla   v10.4h, %25.4h, v1.h[2]     \\n\"\n                        \"fmla   v11.4h, %25.4h, v1.h[6]     \\n\"\n                        \"fmla   v12.4h, v9.4h, v1.h[3]      \\n\"\n                        \"fmla   v13.4h, v9.4h, v1.h[7]      \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"fadd   v10.4h, v10.4h, v12.4h      \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"fadd   v11.4h, v11.4h, v13.4h      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"st1    {v10.4h, v11.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%1] \\n\" // r00 r01 r02\n\n                        \"prfm   pldl1keep, [%0, #64]        \\n\"\n                        \"ld1    {v13.4h}, [%0]              \\n\" // sum0\n\n                        \"ext    v6.16b, %8.16b, %8.16b, #8  \\n\"\n                        \"fmul   v10.4h, %8.4h, v0.h[0]      \\n\"\n                        \"fmul   v11.4h, v6.4h, v0.h[1]      \\n\"\n                        \"ext    v7.16b, %9.16b, %9.16b, #8  \\n\"\n                        \"fmul   v12.4h, %9.4h, v0.h[2]      \\n\"\n                        \"fmla   v13.4h, v7.4h, v0.h[3]      \\n\"\n\n                        \"ext    v8.16b, %10.16b, %10.16b, #8 \\n\"\n                        \"fmla   v10.4h, %10.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, v8.4h, v1.h[1]      \\n\"\n                        \"ext    v9.16b, %11.16b, %11.16b, #8 \\n\"\n                        \"fmla   v12.4h, %11.4h, v1.h[2]     \\n\"\n                        \"fmla   v13.4h, v9.4h, v1.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v3.4h, v4.4h, v5.4h}, [%2] \\n\" // r10 r11 r12\n\n                        \"ext    v6.16b, %12.16b, %12.16b, #8 \\n\"\n                        \"fmla   v10.4h, %12.4h, v2.h[0]     \\n\"\n                        \"fmla   v11.4h, v6.4h, v2.h[1]      \\n\"\n                        \"ext    v7.16b, %13.16b, %13.16b, #8 \\n\"\n                        \"fmla   v12.4h, %13.4h, v2.h[2]     \\n\"\n                        \"fmla   v13.4h, v7.4h, v2.h[3]      \\n\"\n\n                        \"ext    v8.16b, %14.16b, %14.16b, #8 \\n\"\n                        \"fmla   v10.4h, %14.4h, v3.h[0]     \\n\"\n                        \"fmla   v11.4h, v8.4h, v3.h[1]      \\n\"\n                        \"ext    v9.16b, %15.16b, %15.16b, #8 \\n\"\n                        \"fmla   v12.4h, %15.4h, v3.h[2]     \\n\"\n                        \"fmla   v13.4h, v9.4h, v3.h[3]      \\n\"\n\n                        \"ext    v6.16b, %16.16b, %16.16b, #8 \\n\"\n                        \"fmla   v10.4h, %16.4h, v4.h[0]     \\n\"\n                        \"fmla   v11.4h, v6.4h, v4.h[1]      \\n\"\n                        \"ext    v7.16b, %17.16b, %17.16b, #8 \\n\"\n                        \"fmla   v12.4h, %17.4h, v4.h[2]     \\n\"\n                        \"fmla   v13.4h, v7.4h, v4.h[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%3] \\n\" // r20 r21 r22\n\n                        \"ext    v8.16b, %18.16b, %18.16b, #8 \\n\"\n                        \"fmla   v10.4h, %18.4h, v5.h[0]     \\n\"\n                        \"fmla   v11.4h, v8.4h, v5.h[1]      \\n\"\n                        \"ext    v9.16b, %19.16b, %19.16b, #8 \\n\"\n                        \"fmla   v12.4h, %19.4h, v5.h[2]     \\n\"\n                        \"fmla   v13.4h, v9.4h, v5.h[3]      \\n\"\n\n                        \"ext    v6.16b, %20.16b, %20.16b, #8 \\n\"\n                        \"fmla   v10.4h, %20.4h, v0.h[0]     \\n\"\n                        \"fmla   v11.4h, v6.4h, v0.h[1]      \\n\"\n                        \"ext    v7.16b, %21.16b, %21.16b, #8 \\n\"\n                        \"fmla   v12.4h, %21.4h, v0.h[2]     \\n\"\n                        \"fmla   v13.4h, v7.4h, v0.h[3]      \\n\"\n\n                        \"ext    v8.16b, %22.16b, %22.16b, #8 \\n\"\n                        \"fmla   v10.4h, %22.4h, v1.h[0]     \\n\"\n                        \"fmla   v11.4h, v8.4h, v1.h[1]      \\n\"\n                        \"ext    v9.16b, %23.16b, %23.16b, #8 \\n\"\n                        \"fmla   v12.4h, %23.4h, v1.h[2]     \\n\"\n                        \"fmla   v13.4h, v9.4h, v1.h[3]      \\n\"\n\n                        \"ext    v6.16b, %24.16b, %24.16b, #8 \\n\"\n                        \"fmla   v10.4h, %24.4h, v2.h[0]     \\n\"\n                        \"fmla   v11.4h, v6.4h, v2.h[1]      \\n\"\n                        \"ext    v7.16b, %25.16b, %25.16b, #8 \\n\"\n                        \"fmla   v12.4h, %25.4h, v2.h[2]     \\n\"\n                        \"fmla   v13.4h, v7.4h, v2.h[3]      \\n\"\n\n                        \"fadd   v10.4h, v10.4h, v11.4h      \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n\n                        \"fadd   v12.4h, v12.4h, v13.4h      \\n\"\n\n                        \"add    %2, %2, #8                  \\n\"\n\n                        \"fadd   v10.4h, v10.4h, v12.4h      \\n\"\n\n                        \"add    %3, %3, #8                  \\n\"\n\n                        \"st1    {v10.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00_01), // %8\n                        \"w\"(_k00_23), // %9\n                        \"w\"(_k01_01), // %10\n                        \"w\"(_k01_23), // %11\n                        \"w\"(_k02_01), // %12\n                        \"w\"(_k02_23), // %13\n                        \"w\"(_k10_01), // %14\n                        \"w\"(_k10_23), // %15\n                        \"w\"(_k11_01), // %16\n                        \"w\"(_k11_23), // %17\n                        \"w\"(_k12_01), // %18\n                        \"w\"(_k12_23), // %19\n                        \"w\"(_k20_01), // %20\n                        \"w\"(_k20_23), // %21\n                        \"w\"(_k21_01), // %22\n                        \"w\"(_k21_23), // %23\n                        \"w\"(_k22_01), // %24\n                        \"w\"(_k22_23)  // %25\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n                }\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack4to1.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack4to1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    int remain_outch_start = 0;\n\n#if __ARM_NEON && __aarch64__\n    int nn_outch = 0;\n    nn_outch = outch >> 1;\n    remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        Mat out0 = top_blob.channel(p);\n        Mat out1 = top_blob.channel(p + 1);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float bias1 = bias ? bias[p + 1] : 0.f;\n        out0.fill(bias0);\n        out1.fill(bias1);\n\n        const float* k0 = kernel.channel(p);\n        const float* k1 = kernel.channel(p + 1);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n            float* outptr1 = out1;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00_0 = vld1q_f32(k0);\n            float32x4_t _k01_0 = vld1q_f32(k0 + 4);\n            float32x4_t _k02_0 = vld1q_f32(k0 + 8);\n            float32x4_t _k10_0 = vld1q_f32(k0 + 12);\n            float32x4_t _k11_0 = vld1q_f32(k0 + 16);\n            float32x4_t _k12_0 = vld1q_f32(k0 + 20);\n            float32x4_t _k20_0 = vld1q_f32(k0 + 24);\n            float32x4_t _k21_0 = vld1q_f32(k0 + 28);\n            float32x4_t _k22_0 = vld1q_f32(k0 + 32);\n\n            float32x4_t _k00_1 = vld1q_f32(k1);\n            float32x4_t _k01_1 = vld1q_f32(k1 + 4);\n            float32x4_t _k02_1 = vld1q_f32(k1 + 8);\n            float32x4_t _k10_1 = vld1q_f32(k1 + 12);\n            float32x4_t _k11_1 = vld1q_f32(k1 + 16);\n            float32x4_t _k12_1 = vld1q_f32(k1 + 20);\n            float32x4_t _k20_1 = vld1q_f32(k1 + 24);\n            float32x4_t _k21_1 = vld1q_f32(k1 + 28);\n            float32x4_t _k22_1 = vld1q_f32(k1 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r00 r01 r02 r03\n\n                        \"fmul   v16.4s, %10.4s, v0.4s       \\n\"\n                        \"fmul   v17.4s, %19.4s, v0.4s       \\n\"\n                        \"fmul   v18.4s, %10.4s, v1.4s       \\n\"\n                        \"fmul   v19.4s, %19.4s, v1.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%2]        \\n\" // r04 r05\n\n                        \"fmul   v6.4s, %10.4s, v2.4s        \\n\"\n                        \"fmul   v7.4s, %19.4s, v2.4s        \\n\"\n                        \"fmul   v8.4s, %10.4s, v3.4s        \\n\"\n                        \"fmul   v9.4s, %19.4s, v3.4s        \\n\"\n\n                        \"fmla   v16.4s, %11.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %20.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %11.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %20.4s, v2.4s       \\n\"\n                        \"fmla   v6.4s, %11.4s, v3.4s        \\n\"\n                        \"fmla   v7.4s, %20.4s, v3.4s        \\n\"\n                        \"fmla   v8.4s, %11.4s, v4.4s        \\n\"\n                        \"fmla   v9.4s, %20.4s, v4.4s        \\n\"\n\n                        \"fmla   v16.4s, %12.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %21.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %12.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %21.4s, v3.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r10 r11 r12 r12\n\n                        \"fmla   v6.4s, %12.4s, v4.4s        \\n\"\n                        \"fmla   v7.4s, %21.4s, v4.4s        \\n\"\n                        \"fmla   v8.4s, %12.4s, v5.4s        \\n\"\n                        \"fmla   v9.4s, %21.4s, v5.4s        \\n\"\n\n                        \"fmla   v16.4s, %13.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %22.4s, v0.4s       \\n\"\n                        \"fmla   v18.4s, %13.4s, v1.4s       \\n\"\n                        \"fmla   v19.4s, %22.4s, v1.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%3]        \\n\" // r14 r15\n\n                        \"fmla   v6.4s, %13.4s, v2.4s        \\n\"\n                        \"fmla   v7.4s, %22.4s, v2.4s        \\n\"\n                        \"fmla   v8.4s, %13.4s, v3.4s        \\n\"\n                        \"fmla   v9.4s, %22.4s, v3.4s        \\n\"\n\n                        \"fmla   v16.4s, %14.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %23.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %14.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %23.4s, v2.4s       \\n\"\n                        \"fmla   v6.4s, %14.4s, v3.4s        \\n\"\n                        \"fmla   v7.4s, %23.4s, v3.4s        \\n\"\n                        \"fmla   v8.4s, %14.4s, v4.4s        \\n\"\n                        \"fmla   v9.4s, %23.4s, v4.4s        \\n\"\n\n                        \"fmla   v16.4s, %15.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %24.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %24.4s, v3.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%4], #64 \\n\" // r20 r21 r22 r22\n\n                        \"fmla   v6.4s, %15.4s, v4.4s        \\n\"\n                        \"fmla   v7.4s, %24.4s, v4.4s        \\n\"\n                        \"fmla   v8.4s, %15.4s, v5.4s        \\n\"\n                        \"fmla   v9.4s, %24.4s, v5.4s        \\n\"\n\n                        \"fmla   v16.4s, %16.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %25.4s, v0.4s       \\n\"\n                        \"fmla   v18.4s, %16.4s, v1.4s       \\n\"\n                        \"fmla   v19.4s, %25.4s, v1.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%4]        \\n\" // r24 r25\n\n                        \"fmla   v6.4s, %16.4s, v2.4s        \\n\"\n                        \"fmla   v7.4s, %25.4s, v2.4s        \\n\"\n                        \"fmla   v8.4s, %16.4s, v3.4s        \\n\"\n                        \"fmla   v9.4s, %25.4s, v3.4s        \\n\"\n\n                        \"fmla   v16.4s, %17.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %26.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %17.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %26.4s, v2.4s       \\n\"\n                        \"fmla   v6.4s, %17.4s, v3.4s        \\n\"\n                        \"fmla   v7.4s, %26.4s, v3.4s        \\n\"\n                        \"fmla   v8.4s, %17.4s, v4.4s        \\n\"\n                        \"fmla   v9.4s, %26.4s, v4.4s        \\n\"\n\n                        \"fmla   v16.4s, %18.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %27.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %18.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %27.4s, v3.4s       \\n\"\n                        \"fmla   v6.4s, %18.4s, v4.4s        \\n\"\n                        \"fmla   v7.4s, %27.4s, v4.4s        \\n\"\n                        \"fmla   v8.4s, %18.4s, v5.4s        \\n\"\n                        \"fmla   v9.4s, %27.4s, v5.4s        \\n\"\n\n                        \"ld1    {v0.4s}, [%0]               \\n\" // sum00 sum01 sum02 sum03\n                        \"ld1    {v1.4s}, [%1]               \\n\" // sum10 sum11 sum12 sum13\n\n                        \"faddp  v16.4s, v16.4s, v16.4s      \\n\"\n                        \"faddp  v17.4s, v17.4s, v17.4s      \\n\"\n                        \"faddp  v18.4s, v18.4s, v18.4s      \\n\"\n                        \"faddp  v19.4s, v19.4s, v19.4s      \\n\"\n                        \"faddp  v6.4s, v6.4s, v6.4s         \\n\"\n                        \"faddp  v7.4s, v7.4s, v7.4s         \\n\"\n                        \"faddp  v8.4s, v8.4s, v8.4s         \\n\"\n                        \"faddp  v9.4s, v9.4s, v9.4s         \\n\"\n\n                        \"faddp  v16.2s, v16.2s, v18.2s      \\n\"\n                        \"faddp  v17.2s, v17.2s, v19.2s      \\n\"\n                        \"faddp  v6.2s, v6.2s, v8.2s         \\n\"\n                        \"faddp  v7.2s, v7.2s, v9.2s         \\n\"\n\n                        \"trn1   v16.2d, v16.2d, v6.2d       \\n\"\n                        \"trn1   v17.2d, v17.2d, v7.2d       \\n\"\n\n                        \"fadd   v0.4s, v0.4s, v16.4s        \\n\"\n                        \"fadd   v1.4s, v1.4s, v17.4s        \\n\"\n\n                        \"st1    {v0.4s}, [%0], #16          \\n\"\n                        \"st1    {v1.4s}, [%1], #16          \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_0), // %10\n                        \"w\"(_k01_0), // %11\n                        \"w\"(_k02_0), // %12\n                        \"w\"(_k10_0), // %13\n                        \"w\"(_k11_0), // %14\n                        \"w\"(_k12_0), // %15\n                        \"w\"(_k20_0), // %16\n                        \"w\"(_k21_0), // %17\n                        \"w\"(_k22_0), // %18\n                        \"w\"(_k00_1), // %19\n                        \"w\"(_k01_1), // %20\n                        \"w\"(_k02_1), // %21\n                        \"w\"(_k10_1), // %22\n                        \"w\"(_k11_1), // %23\n                        \"w\"(_k12_1), // %24\n                        \"w\"(_k20_1), // %25\n                        \"w\"(_k21_1), // %26\n                        \"w\"(_k22_1)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v16\", \"v17\", \"v18\", \"v19\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2] \\n\" // r00 r01 r02 r03\n\n                        \"fmul   v16.4s, %10.4s, v0.4s       \\n\"\n                        \"fmul   v17.4s, %19.4s, v0.4s       \\n\"\n                        \"fmul   v18.4s, %10.4s, v1.4s       \\n\"\n                        \"fmul   v19.4s, %19.4s, v1.4s       \\n\"\n                        \"fmla   v16.4s, %11.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %20.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %11.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %20.4s, v2.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3] \\n\" // r10 r11 r12 r12\n\n                        \"fmla   v16.4s, %12.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %21.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %12.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %21.4s, v3.4s       \\n\"\n\n                        \"fmla   v16.4s, %13.4s, v4.4s       \\n\"\n                        \"fmla   v17.4s, %22.4s, v4.4s       \\n\"\n                        \"fmla   v18.4s, %13.4s, v5.4s       \\n\"\n                        \"fmla   v19.4s, %22.4s, v5.4s       \\n\"\n                        \"fmla   v16.4s, %14.4s, v5.4s       \\n\"\n                        \"fmla   v17.4s, %23.4s, v5.4s       \\n\"\n                        \"fmla   v18.4s, %14.4s, v6.4s       \\n\"\n                        \"fmla   v19.4s, %23.4s, v6.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%4] \\n\" // r20 r21 r22 r22\n\n                        \"fmla   v16.4s, %15.4s, v6.4s       \\n\"\n                        \"fmla   v17.4s, %24.4s, v6.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v7.4s       \\n\"\n                        \"fmla   v19.4s, %24.4s, v7.4s       \\n\"\n\n                        \"fmla   v16.4s, %16.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %25.4s, v0.4s       \\n\"\n                        \"fmla   v18.4s, %16.4s, v1.4s       \\n\"\n                        \"fmla   v19.4s, %25.4s, v1.4s       \\n\"\n                        \"fmla   v16.4s, %17.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %26.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %17.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %26.4s, v2.4s       \\n\"\n                        \"fmla   v16.4s, %18.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %27.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %18.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %27.4s, v3.4s       \\n\"\n\n                        \"ld1    {v4.2s}, [%0]               \\n\" // sum00 sum01\n                        \"ld1    {v5.2s}, [%1]               \\n\" // sum10 sum11\n\n                        \"faddp  v16.4s, v16.4s, v16.4s      \\n\"\n                        \"faddp  v17.4s, v17.4s, v17.4s      \\n\"\n                        \"faddp  v18.4s, v18.4s, v18.4s      \\n\"\n                        \"faddp  v19.4s, v19.4s, v19.4s      \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"faddp  v16.2s, v16.2s, v18.2s      \\n\"\n                        \"faddp  v17.2s, v17.2s, v19.2s      \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fadd   v4.2s, v4.2s, v16.2s        \\n\"\n                        \"fadd   v5.2s, v5.2s, v17.2s        \\n\"\n\n                        \"add    %4, %4, #32                 \\n\"\n\n                        \"st1    {v4.2s}, [%0], #8           \\n\"\n                        \"st1    {v5.2s}, [%1], #8           \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_0), // %10\n                        \"w\"(_k01_0), // %11\n                        \"w\"(_k02_0), // %12\n                        \"w\"(_k10_0), // %13\n                        \"w\"(_k11_0), // %14\n                        \"w\"(_k12_0), // %15\n                        \"w\"(_k20_0), // %16\n                        \"w\"(_k21_0), // %17\n                        \"w\"(_k22_0), // %18\n                        \"w\"(_k00_1), // %19\n                        \"w\"(_k01_1), // %20\n                        \"w\"(_k02_1), // %21\n                        \"w\"(_k10_1), // %22\n                        \"w\"(_k11_1), // %23\n                        \"w\"(_k12_1), // %24\n                        \"w\"(_k20_1), // %25\n                        \"w\"(_k21_1), // %26\n                        \"w\"(_k22_1)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%2] \\n\" // r00 r01 r02\n\n                        \"fmul   v16.4s, %10.4s, v0.4s       \\n\"\n                        \"fmul   v17.4s, %19.4s, v0.4s       \\n\"\n                        \"fmul   v18.4s, %11.4s, v1.4s       \\n\"\n                        \"fmul   v19.4s, %20.4s, v1.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v3.4s, v4.4s, v5.4s}, [%3] \\n\" // r10 r11 r12\n\n                        \"fmla   v16.4s, %12.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %21.4s, v2.4s       \\n\"\n\n                        \"fmla   v18.4s, %13.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %22.4s, v3.4s       \\n\"\n                        \"fmla   v16.4s, %14.4s, v4.4s       \\n\"\n                        \"fmla   v17.4s, %23.4s, v4.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%4] \\n\" // r20 r21 r22\n\n                        \"fmla   v18.4s, %15.4s, v5.4s       \\n\"\n                        \"fmla   v19.4s, %24.4s, v5.4s       \\n\"\n\n                        \"fmla   v16.4s, %16.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %25.4s, v0.4s       \\n\"\n                        \"fmla   v18.4s, %17.4s, v1.4s       \\n\"\n                        \"fmla   v19.4s, %26.4s, v1.4s       \\n\"\n                        \"fmla   v16.4s, %18.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %27.4s, v2.4s       \\n\"\n\n                        \"ld1    {v3.s}[0], [%0]             \\n\" // sum00\n                        \"ld1    {v4.s}[0], [%1]             \\n\" // sum10\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                        \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v16.4s      \\n\"\n                        \"faddp  v17.4s, v17.4s, v17.4s      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"faddp  v16.2s, v16.2s, v16.2s      \\n\"\n                        \"faddp  v17.2s, v17.2s, v17.2s      \\n\"\n\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fadd   v3.2s, v3.2s, v16.2s        \\n\"\n                        \"fadd   v4.2s, v4.2s, v17.2s        \\n\"\n\n                        \"st1    {v3.s}[0], [%0], #4         \\n\"\n                        \"st1    {v4.s}[0], [%1], #4         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(outptr1), // %1\n                        \"=r\"(r0),      // %2\n                        \"=r\"(r1),      // %3\n                        \"=r\"(r2)       // %4\n                        : \"0\"(outptr0),\n                        \"1\"(outptr1),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"w\"(_k00_0), // %10\n                        \"w\"(_k01_0), // %11\n                        \"w\"(_k02_0), // %12\n                        \"w\"(_k10_0), // %13\n                        \"w\"(_k11_0), // %14\n                        \"w\"(_k12_0), // %15\n                        \"w\"(_k20_0), // %16\n                        \"w\"(_k21_0), // %17\n                        \"w\"(_k22_0), // %18\n                        \"w\"(_k00_1), // %19\n                        \"w\"(_k01_1), // %20\n                        \"w\"(_k02_1), // %21\n                        \"w\"(_k10_1), // %22\n                        \"w\"(_k11_1), // %23\n                        \"w\"(_k12_1), // %24\n                        \"w\"(_k20_1), // %25\n                        \"w\"(_k21_1), // %26\n                        \"w\"(_k22_1)  // %27\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\");\n                }\n\n                r0 += 2 * 4;\n                r1 += 2 * 4;\n                r2 += 2 * 4;\n            }\n\n            k0 += 9 * 4;\n            k1 += 9 * 4;\n        }\n    }\n#endif // __ARM_NEON && __aarch64__\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        out0.fill(bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            float32x4_t _k00 = vld1q_f32(k0);\n            float32x4_t _k01 = vld1q_f32(k0 + 4);\n            float32x4_t _k02 = vld1q_f32(k0 + 8);\n            float32x4_t _k10 = vld1q_f32(k0 + 12);\n            float32x4_t _k11 = vld1q_f32(k0 + 16);\n            float32x4_t _k12 = vld1q_f32(k0 + 20);\n            float32x4_t _k20 = vld1q_f32(k0 + 24);\n            float32x4_t _k21 = vld1q_f32(k0 + 28);\n            float32x4_t _k22 = vld1q_f32(k0 + 32);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                        \"fmul   v16.4s, %8.4s, v0.4s        \\n\"\n                        \"fmul   v17.4s, %8.4s, v1.4s        \\n\"\n                        \"fmul   v18.4s, %8.4s, v2.4s        \\n\"\n                        \"fmul   v19.4s, %8.4s, v3.4s        \\n\"\n                        \"fmul   v20.4s, %8.4s, v4.4s        \\n\"\n                        \"fmul   v21.4s, %8.4s, v5.4s        \\n\"\n                        \"fmul   v22.4s, %8.4s, v6.4s        \\n\"\n                        \"fmul   v23.4s, %8.4s, v7.4s        \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%1]        \\n\" // r08 r09\n\n                        \"fmla   v16.4s, %9.4s, v1.4s        \\n\"\n                        \"fmla   v17.4s, %9.4s, v2.4s        \\n\"\n                        \"fmla   v18.4s, %9.4s, v3.4s        \\n\"\n                        \"fmla   v19.4s, %9.4s, v4.4s        \\n\"\n                        \"fmla   v20.4s, %9.4s, v5.4s        \\n\"\n                        \"fmla   v21.4s, %9.4s, v6.4s        \\n\"\n                        \"fmla   v22.4s, %9.4s, v7.4s        \\n\"\n                        \"fmla   v23.4s, %9.4s, v8.4s        \\n\"\n\n                        \"fmla   v16.4s, %10.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %10.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %10.4s, v4.4s       \\n\"\n                        \"fmla   v19.4s, %10.4s, v5.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v20.4s, %10.4s, v6.4s       \\n\"\n                        \"fmla   v21.4s, %10.4s, v7.4s       \\n\"\n                        \"fmla   v22.4s, %10.4s, v8.4s       \\n\"\n                        \"fmla   v23.4s, %10.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v16.4s, %11.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %11.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %11.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %11.4s, v3.4s       \\n\"\n                        \"fmla   v20.4s, %11.4s, v4.4s       \\n\"\n                        \"fmla   v21.4s, %11.4s, v5.4s       \\n\"\n                        \"fmla   v22.4s, %11.4s, v6.4s       \\n\"\n                        \"fmla   v23.4s, %11.4s, v7.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%2]        \\n\" // r18 r19\n\n                        \"fmla   v16.4s, %12.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %12.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %12.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %12.4s, v4.4s       \\n\"\n                        \"fmla   v20.4s, %12.4s, v5.4s       \\n\"\n                        \"fmla   v21.4s, %12.4s, v6.4s       \\n\"\n                        \"fmla   v22.4s, %12.4s, v7.4s       \\n\"\n                        \"fmla   v23.4s, %12.4s, v8.4s       \\n\"\n\n                        \"fmla   v16.4s, %13.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %13.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %13.4s, v4.4s       \\n\"\n                        \"fmla   v19.4s, %13.4s, v5.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v20.4s, %13.4s, v6.4s       \\n\"\n                        \"fmla   v21.4s, %13.4s, v7.4s       \\n\"\n                        \"fmla   v22.4s, %13.4s, v8.4s       \\n\"\n                        \"fmla   v23.4s, %13.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v16.4s, %14.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %14.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %14.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %14.4s, v3.4s       \\n\"\n                        \"fmla   v20.4s, %14.4s, v4.4s       \\n\"\n                        \"fmla   v21.4s, %14.4s, v5.4s       \\n\"\n                        \"fmla   v22.4s, %14.4s, v6.4s       \\n\"\n                        \"fmla   v23.4s, %14.4s, v7.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%3]        \\n\" // r28 r29\n\n                        \"fmla   v16.4s, %15.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %15.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %15.4s, v4.4s       \\n\"\n                        \"fmla   v20.4s, %15.4s, v5.4s       \\n\"\n                        \"fmla   v21.4s, %15.4s, v6.4s       \\n\"\n                        \"fmla   v22.4s, %15.4s, v7.4s       \\n\"\n                        \"fmla   v23.4s, %15.4s, v8.4s       \\n\"\n\n                        \"fmla   v16.4s, %16.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %16.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %16.4s, v4.4s       \\n\"\n                        \"fmla   v19.4s, %16.4s, v5.4s       \\n\"\n                        \"fmla   v20.4s, %16.4s, v6.4s       \\n\"\n                        \"fmla   v21.4s, %16.4s, v7.4s       \\n\"\n                        \"fmla   v22.4s, %16.4s, v8.4s       \\n\"\n                        \"fmla   v23.4s, %16.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%0]        \\n\" // sum0 sum1 sum2 sum3 sum4 sum5 sum6 sum7\n\n                        \"faddp  v16.4s, v16.4s, v17.4s      \\n\"\n                        \"faddp  v18.4s, v18.4s, v19.4s      \\n\"\n                        \"faddp  v20.4s, v20.4s, v21.4s      \\n\"\n                        \"faddp  v22.4s, v22.4s, v23.4s      \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v18.4s      \\n\"\n                        \"faddp  v20.4s, v20.4s, v22.4s      \\n\"\n\n                        \"fadd   v0.4s, v0.4s, v16.4s        \\n\"\n                        \"fadd   v1.4s, v1.4s, v20.4s        \\n\"\n\n                        \"st1    {v0.4s, v1.4s}, [%0], #32   \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%1]        \\n\" // r04 r05\n\n                        \"fmul   v16.4s, %8.4s, v0.4s        \\n\"\n                        \"fmul   v17.4s, %8.4s, v1.4s        \\n\"\n                        \"fmul   v18.4s, %8.4s, v2.4s        \\n\"\n                        \"fmul   v19.4s, %8.4s, v3.4s        \\n\"\n\n                        \"fmla   v16.4s, %9.4s, v1.4s        \\n\"\n                        \"fmla   v17.4s, %9.4s, v2.4s        \\n\"\n                        \"fmla   v18.4s, %9.4s, v3.4s        \\n\"\n                        \"fmla   v19.4s, %9.4s, v8.4s        \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v16.4s, %10.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %10.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %10.4s, v8.4s       \\n\"\n                        \"fmla   v19.4s, %10.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%2]        \\n\" // r14 r15\n\n                        \"fmla   v16.4s, %11.4s, v4.4s       \\n\"\n                        \"fmla   v17.4s, %11.4s, v5.4s       \\n\"\n                        \"fmla   v18.4s, %11.4s, v6.4s       \\n\"\n                        \"fmla   v19.4s, %11.4s, v7.4s       \\n\"\n\n                        \"fmla   v16.4s, %12.4s, v5.4s       \\n\"\n                        \"fmla   v17.4s, %12.4s, v6.4s       \\n\"\n                        \"fmla   v18.4s, %12.4s, v7.4s       \\n\"\n                        \"fmla   v19.4s, %12.4s, v8.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v16.4s, %13.4s, v6.4s       \\n\"\n                        \"fmla   v17.4s, %13.4s, v7.4s       \\n\"\n                        \"fmla   v18.4s, %13.4s, v8.4s       \\n\"\n                        \"fmla   v19.4s, %13.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v8.4s, v9.4s}, [%3]        \\n\" // r24 r25\n\n                        \"fmla   v16.4s, %14.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %14.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %14.4s, v2.4s       \\n\"\n                        \"fmla   v19.4s, %14.4s, v3.4s       \\n\"\n\n                        \"fmla   v16.4s, %15.4s, v1.4s       \\n\"\n                        \"fmla   v17.4s, %15.4s, v2.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v3.4s       \\n\"\n                        \"fmla   v19.4s, %15.4s, v8.4s       \\n\"\n\n                        \"fmla   v16.4s, %16.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %16.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %16.4s, v8.4s       \\n\"\n                        \"fmla   v19.4s, %16.4s, v9.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%0]               \\n\" // sum0 sum1 sum2 sum3\n\n                        \"faddp  v16.4s, v16.4s, v17.4s      \\n\"\n                        \"faddp  v18.4s, v18.4s, v19.4s      \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v18.4s      \\n\"\n\n                        \"fadd   v0.4s, v0.4s, v16.4s        \\n\"\n\n                        \"st1    {v0.4s}, [%0], #16          \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r00 r01\n\n                        \"vmul.f32   q3, %q8, q0     \\n\"\n\n                        \"pld        [%1, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%1 :128]! \\n\" // r02\n\n                        \"vmul.f32   q4, %q8, q1     \\n\"\n                        \"vmla.f32   q3, %q9, q1     \\n\"\n\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r03 r04\n\n                        \"vmul.f32   q5, %q8, q2     \\n\"\n                        \"vmla.f32   q4, %q9, q2     \\n\"\n                        \"vmla.f32   q3, %q10, q2    \\n\"\n\n                        \"vmul.f32   q6, %q8, q0     \\n\"\n                        \"vmla.f32   q5, %q9, q0     \\n\"\n                        \"vmla.f32   q4, %q10, q0    \\n\"\n\n                        \"pld        [%1, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%1 :128] \\n\" // r05\n\n                        \"vmla.f32   q6, %q9, q1     \\n\"\n                        \"vmla.f32   q5, %q10, q1    \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128]! \\n\" // r10 r11\n\n                        \"vmla.f32   q6, %q10, q2    \\n\"\n\n                        \"vmla.f32   q3, %q11, q0    \\n\"\n\n                        \"pld        [%2, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%2 :128]! \\n\" // r12\n\n                        \"vmla.f32   q4, %q11, q1    \\n\"\n                        \"vmla.f32   q3, %q12, q1    \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128]! \\n\" // r13 r14\n\n                        \"vmla.f32   q5, %q11, q2    \\n\"\n                        \"vmla.f32   q4, %q12, q2    \\n\"\n                        \"vmla.f32   q3, %q13, q2    \\n\"\n\n                        \"vmla.f32   q6, %q11, q0    \\n\"\n                        \"vmla.f32   q5, %q12, q0    \\n\"\n                        \"vmla.f32   q4, %q13, q0    \\n\"\n\n                        \"pld        [%2, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%2 :128] \\n\" // r15\n\n                        \"vmla.f32   q6, %q12, q1    \\n\"\n                        \"vmla.f32   q5, %q13, q1    \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r20 r21\n\n                        \"vmla.f32   q6, %q13, q2    \\n\"\n\n                        \"vmla.f32   q3, %q14, q0    \\n\"\n\n                        \"pld        [%3, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%3 :128]! \\n\" // r22\n\n                        \"vmla.f32   q4, %q14, q1    \\n\"\n                        \"vmla.f32   q3, %q15, q1    \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r23 r24\n\n                        \"vmla.f32   q5, %q14, q2    \\n\"\n                        \"vmla.f32   q4, %q15, q2    \\n\"\n                        \"vmla.f32   q3, %q16, q2    \\n\"\n\n                        \"vmla.f32   q6, %q14, q0    \\n\"\n                        \"vmla.f32   q5, %q15, q0    \\n\"\n                        \"vmla.f32   q4, %q16, q0    \\n\"\n\n                        \"pld        [%3, #128]      \\n\"\n                        \"vld1.f32   {d4-d5}, [%3 :128] \\n\" // r25\n\n                        \"vmla.f32   q6, %q15, q1    \\n\"\n                        \"vmla.f32   q5, %q16, q1    \\n\"\n\n                        \"vld1.f32   {d0-d1}, [%0]   \\n\" // sum0 sum1 sum2 sum3\n\n                        \"vmla.f32   q6, %q16, q2    \\n\"\n\n                        \"vadd.f32   d6, d6, d7      \\n\"\n                        \"vadd.f32   d8, d8, d9      \\n\"\n                        \"vadd.f32   d10, d10, d11   \\n\"\n                        \"vadd.f32   d12, d12, d13   \\n\"\n\n                        \"sub        %1, %1, #16     \\n\"\n\n                        \"vpadd.f32  d6, d6, d8      \\n\"\n                        \"vpadd.f32  d7, d10, d12    \\n\"\n\n                        \"sub        %2, %2, #16     \\n\"\n\n                        \"vadd.f32   q0, q0, q3      \\n\"\n\n                        \"sub        %3, %3, #16     \\n\"\n\n                        \"vst1.f32   {d0-d1}, [%0]!  \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1] \\n\" // r00 r01 r02 r03\n\n                        \"fmul   v16.4s, %8.4s, v0.4s        \\n\"\n                        \"fmul   v17.4s, %8.4s, v1.4s        \\n\"\n                        \"fmul   v18.4s, %9.4s, v1.4s        \\n\"\n                        \"fmul   v19.4s, %9.4s, v2.4s        \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2] \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v16.4s, %10.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %10.4s, v3.4s       \\n\"\n\n                        \"fmla   v18.4s, %11.4s, v4.4s       \\n\"\n                        \"fmla   v19.4s, %11.4s, v5.4s       \\n\"\n                        \"fmla   v16.4s, %12.4s, v5.4s       \\n\"\n                        \"fmla   v17.4s, %12.4s, v6.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3] \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v18.4s, %13.4s, v6.4s       \\n\"\n                        \"fmla   v19.4s, %13.4s, v7.4s       \\n\"\n\n                        \"fmla   v16.4s, %14.4s, v0.4s       \\n\"\n                        \"fmla   v17.4s, %14.4s, v1.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v1.4s       \\n\"\n                        \"fmla   v19.4s, %15.4s, v2.4s       \\n\"\n                        \"fmla   v16.4s, %16.4s, v2.4s       \\n\"\n                        \"fmla   v17.4s, %16.4s, v3.4s       \\n\"\n\n                        \"ld1    {v0.2s}, [%0]               \\n\" // sum0 sum1\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                        \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v17.4s      \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v16.4s      \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fadd   v0.2s, v0.2s, v16.2s        \\n\"\n\n                        \"st1    {v0.2s}, [%0], #8           \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r00 r01\n\n                        \"vmul.f32   q5, %q8, q0     \\n\"\n                        \"vmul.f32   q6, %q8, q1     \\n\"\n                        \"vmul.f32   q2, %q9, q1     \\n\"\n\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128] \\n\" // r02 r03\n\n                        \"vmul.f32   q3, %q9, q0     \\n\"\n                        \"vmla.f32   q5, %q10, q0    \\n\"\n                        \"vmla.f32   q6, %q10, q1    \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128]! \\n\" // r10 r11\n\n                        \"vmla.f32   q2, %q11, q0    \\n\"\n                        \"vmla.f32   q3, %q11, q1    \\n\"\n                        \"vmla.f32   q5, %q12, q1    \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128] \\n\" // r12 r13\n\n                        \"vmla.f32   q6, %q12, q0    \\n\"\n                        \"vmla.f32   q2, %q13, q0    \\n\"\n                        \"vmla.f32   q3, %q13, q1    \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r20 r21\n\n                        \"vmla.f32   q5, %q14, q0    \\n\"\n                        \"vmla.f32   q6, %q14, q1    \\n\"\n                        \"vmla.f32   q2, %q15, q1    \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128] \\n\" // r22 r23\n\n                        \"vmla.f32   q3, %q15, q0    \\n\"\n                        \"vmla.f32   q5, %q16, q0    \\n\"\n                        \"vmla.f32   q6, %q16, q1    \\n\"\n\n                        \"vld1.f32   {d8}, [%0]      \\n\" // sum0 sum1\n\n                        \"vadd.f32   q5, q5, q2      \\n\"\n                        \"vadd.f32   q6, q6, q3      \\n\"\n\n                        \"vadd.f32   d10, d10, d11   \\n\"\n                        \"vadd.f32   d12, d12, d13   \\n\"\n\n                        \"vpadd.f32  d10, d10, d12   \\n\"\n\n                        \"vadd.f32   d8, d8, d10     \\n\"\n\n                        \"vst1.f32   {d8}, [%0]!     \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%1] \\n\" // r00 r01 r02\n\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"ld1    {v16.s}[0], [%0]            \\n\" // sum0\n\n                        \"fmul   v17.4s, %8.4s, v0.4s        \\n\"\n                        \"fmul   v18.4s, %9.4s, v1.4s        \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v3.4s, v4.4s, v5.4s}, [%2] \\n\" // r10 r11 r12\n\n                        \"fmla   v16.4s, %10.4s, v2.4s       \\n\"\n\n                        \"fmla   v17.4s, %11.4s, v3.4s       \\n\"\n                        \"fmla   v18.4s, %12.4s, v4.4s       \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%3] \\n\" // r20 r21 r22\n\n                        \"fmla   v16.4s, %13.4s, v5.4s       \\n\"\n\n                        \"fmla   v17.4s, %14.4s, v0.4s       \\n\"\n                        \"fmla   v18.4s, %15.4s, v1.4s       \\n\"\n                        \"fmla   v16.4s, %16.4s, v2.4s       \\n\"\n\n                        \"fadd   v17.4s, v17.4s, v18.4s      \\n\"\n                        \"fadd   v16.4s, v16.4s, v17.4s      \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"faddp  v16.4s, v16.4s, v16.4s      \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"faddp  v16.2s, v16.2s, v16.2s      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"st1    {v16.s}[0], [%0], #4        \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #384]      \\n\"\n                        \"vldm       %1, {d0-d5}     \\n\" // r00 r01 r02\n\n                        \"veor       q3, q3          \\n\"\n                        \"vld1.f32   {d6[0]}, [%0]   \\n\" // sum0\n\n                        \"vmul.f32   q4, %q8, q0     \\n\"\n                        \"vmul.f32   q5, %q9, q1     \\n\"\n                        \"vmla.f32   q3, %q10, q2    \\n\"\n\n                        \"pld        [%2, #384]      \\n\"\n                        \"vldm       %2, {d0-d5}     \\n\" // r10 r11 r12\n\n                        \"vmla.f32   q4, %q11, q0    \\n\"\n                        \"vmla.f32   q5, %q12, q1    \\n\"\n                        \"vmla.f32   q3, %q13, q2    \\n\"\n\n                        \"pld        [%3, #384]      \\n\"\n                        \"vldm       %3, {d0-d5}     \\n\" // r20 r21 r22\n\n                        \"vmla.f32   q4, %q14, q0    \\n\"\n                        \"vmla.f32   q5, %q15, q1    \\n\"\n                        \"vmla.f32   q3, %q16, q2    \\n\"\n\n                        \"vadd.f32   q4, q4, q5      \\n\"\n                        \"vadd.f32   q3, q3, q4      \\n\"\n\n                        \"add        %1, %1, #16     \\n\"\n\n                        \"vadd.f32   d6, d6, d7      \\n\"\n\n                        \"add        %2, %2, #16     \\n\"\n\n                        \"vpadd.f32  d6, d6, d6      \\n\"\n\n                        \"add        %3, %3, #16     \\n\"\n\n                        \"vst1.f32   {d6[0]}, [%0]!  \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2)       // %3\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"w\"(_k00), // %8\n                        \"w\"(_k01), // %9\n                        \"w\"(_k02), // %10\n                        \"w\"(_k10), // %11\n                        \"w\"(_k11), // %12\n                        \"w\"(_k12), // %13\n                        \"w\"(_k20), // %14\n                        \"w\"(_k21), // %15\n                        \"w\"(_k22)  // %16\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\");\n#endif // __aarch64__\n                }\n\n                r0 += 2 * 4;\n                r1 += 2 * 4;\n                r2 += 2 * 4;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            __fp16* outptr0 = out0.row<__fp16>(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v4.8h, v5.8h}, [%1]        \\n\" // r04 r05\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v4.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v4.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v4.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v4.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v5.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v12.8h, v13.8h}, [%2]      \\n\" // r14 r15\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v11.h[0]    \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v11.h[2]    \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v11.h[4]    \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v11.h[6]    \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v12.h[0]    \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v12.h[2]    \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v12.h[4]    \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v12.h[6]    \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v13.h[0]    \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v13.h[2]    \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v13.h[4]    \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v13.h[6]    \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.8h, v5.8h}, [%3]        \\n\" // r24 r25\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v4.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v4.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v4.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v4.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v5.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1] \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2] \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v7.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v7.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3] \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v7.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v7.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                        \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%1] \\n\" // r00 r01 r02\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v31.8h}, [%0]              \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmul   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v3.8h, v4.8h, v5.8h}, [%2] \\n\" // r10 r11 r12\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%3] \\n\" // r20 r21 r22\n\n                        \"fmla   v28.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n                        \"fadd   v30.8h, v30.8h, v31.8h      \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s2_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 8;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v7.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v7.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v7.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v7.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1]               \\n\" // r08\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v0.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v0.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v0.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v0.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v14.h[0]    \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v14.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v14.h[2]    \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v14.h[3]    \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v14.h[4]    \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v14.h[5]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v14.h[6]    \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v14.h[7]    \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v15.h[0]    \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v15.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v15.h[2]    \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v13.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v15.h[3]    \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v15.h[4]    \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v15.h[5]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v15.h[6]    \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v13.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v15.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v8.8h}, [%2]               \\n\" // r18\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v8.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v14.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v8.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v8.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v14.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v8.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v8.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v8.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v8.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v14.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v8.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v7.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v7.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v7.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v7.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%3]               \\n\" // r28\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v0.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v0.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v0.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v0.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v0.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1]               \\n\" // r04\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v7.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v7.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v4.8h}, [%2]               \\n\" // r14\n\n                        \"fmla   v28.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v7.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v7.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%3]               \\n\" // r24\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v0.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                        \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%1] \\n\" // r00 r01 r02\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v31.8h}, [%0]              \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmul   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v3.8h, v4.8h, v5.8h}, [%2] \\n\" // r10 r11 r12\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%3] \\n\" // r20 r21 r22\n\n                        \"fmla   v28.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n\n                        // \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n                        \"fadd   v30.8h, v30.8h, v31.8h      \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n\n                        \"sub    %4, %4, #1088               \\n\" // kptr -= 8.5 * 64;\n\n                        \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(kptr)     // %4\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_winograd.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd_pack_A_tile(const Mat& A, Mat& AT, int batch, int max_ii, int max_kk)\n{\n    const int N = max_kk * batch;\n\n    for (int b = 0; b < batch; b++)\n    {\n        float* pp = AT.row(b);\n\n        int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; ii + 7 < max_ii; ii += 8)\n        {\n            const float* p0 = (const float*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[2 * N];\n                pp[3] = p0[3 * N];\n                pp[4] = p0[4 * N];\n                pp[5] = p0[5 * N];\n                pp[6] = p0[6 * N];\n                pp[7] = p0[7 * N];\n                p0 += batch;\n                pp += 8;\n            }\n        }\n#endif // __aarch64__\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const float* p0 = (const float*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[2 * N];\n                pp[3] = p0[3 * N];\n                p0 += batch;\n                pp += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; ii + 1 < max_ii; ii += 2)\n        {\n            const float* p0 = (const float*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                p0 += batch;\n                pp += 2;\n            }\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const float* p0 = (const float*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_transpose_pack_B_tile(const Mat& B, Mat& BT, int batch, int max_jj, int max_kk, int nT)\n{\n    #pragma omp parallel for num_threads(nT)\n    for (int b = 0; b < batch; b++)\n    {\n        float* pp = BT.row(b);\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const float* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0]  \\n\"\n\n                    \"uzp1   v24.4s, v0.4s, v4.4s        \\n\"\n                    \"uzp2   v25.4s, v0.4s, v4.4s        \\n\"\n                    \"uzp1   v26.4s, v1.4s, v5.4s        \\n\"\n                    \"uzp2   v27.4s, v1.4s, v5.4s        \\n\"\n                    \"uzp1   v28.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp2   v29.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp1   v30.4s, v3.4s, v7.4s        \\n\"\n                    \"uzp2   v31.4s, v3.4s, v7.4s        \\n\"\n\n                    \"uzp1   v0.4s, v8.4s, v12.4s        \\n\"\n                    \"uzp2   v1.4s, v8.4s, v12.4s        \\n\"\n                    \"uzp1   v2.4s, v9.4s, v13.4s        \\n\"\n                    \"uzp2   v3.4s, v9.4s, v13.4s        \\n\"\n                    \"uzp1   v4.4s, v10.4s, v14.4s       \\n\"\n                    \"uzp2   v5.4s, v10.4s, v14.4s       \\n\"\n                    \"uzp1   v6.4s, v11.4s, v15.4s       \\n\"\n                    \"uzp2   v7.4s, v11.4s, v15.4s       \\n\"\n\n                    \"sub    %0, %0, #320                \\n\"\n\n                    \"uzp1   v8.4s, v16.4s, v20.4s       \\n\"\n                    \"uzp2   v9.4s, v16.4s, v20.4s       \\n\"\n                    \"uzp1   v10.4s, v17.4s, v21.4s      \\n\"\n                    \"uzp2   v11.4s, v17.4s, v21.4s      \\n\"\n                    \"uzp1   v12.4s, v18.4s, v22.4s      \\n\"\n                    \"uzp2   v13.4s, v18.4s, v22.4s      \\n\"\n                    \"uzp1   v14.4s, v19.4s, v23.4s      \\n\"\n                    \"uzp2   v15.4s, v19.4s, v23.4s      \\n\"\n\n                    \"st1    {v24.4s}, [%1], #16         \\n\"\n                    \"st1    {v0.4s}, [%1], #16          \\n\"\n                    \"st1    {v8.4s}, [%1], #16          \\n\"\n                    \"st1    {v26.4s}, [%1], #16         \\n\"\n                    \"st1    {v2.4s}, [%1], #16          \\n\"\n                    \"st1    {v10.4s}, [%1], #16         \\n\"\n                    \"st1    {v28.4s}, [%1], #16         \\n\"\n                    \"st1    {v4.4s}, [%1], #16          \\n\"\n                    \"st1    {v12.4s}, [%1], #16         \\n\"\n                    \"st1    {v30.4s}, [%1], #16         \\n\"\n                    \"st1    {v6.4s}, [%1], #16          \\n\"\n                    \"st1    {v14.4s}, [%1], #16         \\n\"\n\n                    \"st1    {v25.4s}, [%1], #16         \\n\"\n                    \"st1    {v1.4s}, [%1], #16          \\n\"\n                    \"st1    {v9.4s}, [%1], #16          \\n\"\n                    \"st1    {v27.4s}, [%1], #16         \\n\"\n                    \"st1    {v3.4s}, [%1], #16          \\n\"\n                    \"st1    {v11.4s}, [%1], #16         \\n\"\n                    \"st1    {v29.4s}, [%1], #16         \\n\"\n                    \"st1    {v5.4s}, [%1], #16          \\n\"\n                    \"st1    {v13.4s}, [%1], #16         \\n\"\n                    \"st1    {v31.4s}, [%1], #16         \\n\"\n                    \"st1    {v7.4s}, [%1], #16          \\n\"\n                    \"st1    {v15.4s}, [%1], #16         \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                float32x4x4_t _r2 = vld4q_f32(p0 + 32);\n                float32x4x4_t _r3 = vld4q_f32(p0 + 48);\n                float32x4x4_t _r4 = vld4q_f32(p0 + 64);\n                float32x4x4_t _r5 = vld4q_f32(p0 + 80);\n                float32x4x2_t _r04l = vuzpq_f32(_r0.val[0], _r1.val[0]);\n                float32x4x2_t _r15l = vuzpq_f32(_r0.val[1], _r1.val[1]);\n                float32x4x2_t _r26l = vuzpq_f32(_r0.val[2], _r1.val[2]);\n                float32x4x2_t _r37l = vuzpq_f32(_r0.val[3], _r1.val[3]);\n                float32x4x2_t _r04m = vuzpq_f32(_r2.val[0], _r3.val[0]);\n                float32x4x2_t _r15m = vuzpq_f32(_r2.val[1], _r3.val[1]);\n                float32x4x2_t _r26m = vuzpq_f32(_r2.val[2], _r3.val[2]);\n                float32x4x2_t _r37m = vuzpq_f32(_r2.val[3], _r3.val[3]);\n                float32x4x2_t _r04h = vuzpq_f32(_r4.val[0], _r5.val[0]);\n                float32x4x2_t _r15h = vuzpq_f32(_r4.val[1], _r5.val[1]);\n                float32x4x2_t _r26h = vuzpq_f32(_r4.val[2], _r5.val[2]);\n                float32x4x2_t _r37h = vuzpq_f32(_r4.val[3], _r5.val[3]);\n                vst1q_f32(pp, _r04l.val[0]);\n                vst1q_f32(pp + 4, _r04m.val[0]);\n                vst1q_f32(pp + 4 * 2, _r04h.val[0]);\n                vst1q_f32(pp + 4 * 3, _r15l.val[0]);\n                vst1q_f32(pp + 4 * 4, _r15m.val[0]);\n                vst1q_f32(pp + 4 * 5, _r15h.val[0]);\n                vst1q_f32(pp + 4 * 6, _r26l.val[0]);\n                vst1q_f32(pp + 4 * 7, _r26m.val[0]);\n                vst1q_f32(pp + 4 * 8, _r26h.val[0]);\n                vst1q_f32(pp + 4 * 9, _r37l.val[0]);\n                vst1q_f32(pp + 4 * 10, _r37m.val[0]);\n                vst1q_f32(pp + 4 * 11, _r37h.val[0]);\n                vst1q_f32(pp + 4 * 12, _r04l.val[1]);\n                vst1q_f32(pp + 4 * 13, _r04m.val[1]);\n                vst1q_f32(pp + 4 * 14, _r04h.val[1]);\n                vst1q_f32(pp + 4 * 15, _r15l.val[1]);\n                vst1q_f32(pp + 4 * 16, _r15m.val[1]);\n                vst1q_f32(pp + 4 * 17, _r15h.val[1]);\n                vst1q_f32(pp + 4 * 18, _r26l.val[1]);\n                vst1q_f32(pp + 4 * 19, _r26m.val[1]);\n                vst1q_f32(pp + 4 * 20, _r26h.val[1]);\n                vst1q_f32(pp + 4 * 21, _r37l.val[1]);\n                vst1q_f32(pp + 4 * 22, _r37m.val[1]);\n                vst1q_f32(pp + 4 * 23, _r37h.val[1]);\n                p0 += max_jj * batch * 8;\n                pp += 96;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x12\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                float32x4x4_t _r2 = vld4q_f32(p0 + 32);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r2.val[0]);\n                vst1q_f32(pp + 4 * 3, _r0.val[1]);\n                vst1q_f32(pp + 4 * 4, _r1.val[1]);\n                vst1q_f32(pp + 4 * 5, _r2.val[1]);\n                vst1q_f32(pp + 4 * 6, _r0.val[2]);\n                vst1q_f32(pp + 4 * 7, _r1.val[2]);\n                vst1q_f32(pp + 4 * 8, _r2.val[2]);\n                vst1q_f32(pp + 4 * 9, _r0.val[3]);\n                vst1q_f32(pp + 4 * 10, _r1.val[3]);\n                vst1q_f32(pp + 4 * 11, _r2.val[3]);\n                p0 += max_jj * batch * 4;\n                pp += 48;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x12\n                float32x4x2_t _r0 = vld2q_f32(p0);\n                float32x4x2_t _r1 = vld2q_f32(p0 + 8);\n                float32x4x2_t _r2 = vld2q_f32(p0 + 16);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r2.val[0]);\n                vst1q_f32(pp + 4 * 3, _r0.val[1]);\n                vst1q_f32(pp + 4 * 4, _r1.val[1]);\n                vst1q_f32(pp + 4 * 5, _r2.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 24;\n            }\n            p0 -= (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p0 + 4);\n                float32x4_t _r2 = vld1q_f32(p0 + 8);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                vst1q_f32(pp + 8, _r2);\n                p0 += max_jj * batch;\n                pp += 12;\n            }\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const float* p0 = B;\n\n            int kk = 0;\n#if __aarch64__\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0] \\n\"\n\n                    \"uzp1   v16.4s, v0.4s, v4.4s        \\n\"\n                    \"uzp2   v24.4s, v0.4s, v4.4s        \\n\"\n                    \"uzp1   v18.4s, v1.4s, v5.4s        \\n\"\n                    \"uzp2   v26.4s, v1.4s, v5.4s        \\n\"\n                    \"uzp1   v20.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp2   v28.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp1   v22.4s, v3.4s, v7.4s        \\n\"\n                    \"uzp2   v30.4s, v3.4s, v7.4s        \\n\"\n\n                    \"sub    %0, %0, #192                \\n\"\n\n                    \"uzp1   v17.4s, v8.4s, v12.4s       \\n\"\n                    \"uzp2   v25.4s, v8.4s, v12.4s       \\n\"\n                    \"uzp1   v19.4s, v9.4s, v13.4s       \\n\"\n                    \"uzp2   v27.4s, v9.4s, v13.4s       \\n\"\n                    \"uzp1   v21.4s, v10.4s, v14.4s      \\n\"\n                    \"uzp2   v29.4s, v10.4s, v14.4s      \\n\"\n                    \"uzp1   v23.4s, v11.4s, v15.4s      \\n\"\n                    \"uzp2   v31.4s, v11.4s, v15.4s      \\n\"\n\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%1], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                float32x4x4_t _r2 = vld4q_f32(p0 + 32);\n                float32x4x4_t _r3 = vld4q_f32(p0 + 48);\n                float32x4x2_t _r04l = vuzpq_f32(_r0.val[0], _r1.val[0]);\n                float32x4x2_t _r15l = vuzpq_f32(_r0.val[1], _r1.val[1]);\n                float32x4x2_t _r26l = vuzpq_f32(_r0.val[2], _r1.val[2]);\n                float32x4x2_t _r37l = vuzpq_f32(_r0.val[3], _r1.val[3]);\n                float32x4x2_t _r04h = vuzpq_f32(_r2.val[0], _r3.val[0]);\n                float32x4x2_t _r15h = vuzpq_f32(_r2.val[1], _r3.val[1]);\n                float32x4x2_t _r26h = vuzpq_f32(_r2.val[2], _r3.val[2]);\n                float32x4x2_t _r37h = vuzpq_f32(_r2.val[3], _r3.val[3]);\n                vst1q_f32(pp, _r04l.val[0]);\n                vst1q_f32(pp + 4, _r04h.val[0]);\n                vst1q_f32(pp + 4 * 2, _r15l.val[0]);\n                vst1q_f32(pp + 4 * 3, _r15h.val[0]);\n                vst1q_f32(pp + 4 * 4, _r26l.val[0]);\n                vst1q_f32(pp + 4 * 5, _r26h.val[0]);\n                vst1q_f32(pp + 4 * 6, _r37l.val[0]);\n                vst1q_f32(pp + 4 * 7, _r37h.val[0]);\n                vst1q_f32(pp + 4 * 8, _r04l.val[1]);\n                vst1q_f32(pp + 4 * 9, _r04h.val[1]);\n                vst1q_f32(pp + 4 * 10, _r15l.val[1]);\n                vst1q_f32(pp + 4 * 11, _r15h.val[1]);\n                vst1q_f32(pp + 4 * 12, _r26l.val[1]);\n                vst1q_f32(pp + 4 * 13, _r26h.val[1]);\n                vst1q_f32(pp + 4 * 14, _r37l.val[1]);\n                vst1q_f32(pp + 4 * 15, _r37h.val[1]);\n                p0 += max_jj * batch * 8;\n                pp += 64;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __aarch64__\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0] \\n\"\n                    \"sub    %0, %0, #64                 \\n\"\n                    \"st1    {v0.4s}, [%1], #16          \\n\"\n                    \"st1    {v4.4s}, [%1], #16          \\n\"\n                    \"st1    {v1.4s}, [%1], #16          \\n\"\n                    \"st1    {v5.4s}, [%1], #16          \\n\"\n                    \"st1    {v2.4s}, [%1], #16          \\n\"\n                    \"st1    {v6.4s}, [%1], #16          \\n\"\n                    \"st1    {v3.4s}, [%1], #16          \\n\"\n                    \"st1    {v7.4s}, [%1], #16          \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0!, {d0-d7}        \\n\"\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0, {d16-d23}       \\n\"\n\n                    \"vtrn.32    q0, q1              \\n\"\n                    \"vtrn.32    q2, q3              \\n\"\n                    \"vtrn.32    q8, q9              \\n\"\n                    \"vtrn.32    q10, q11            \\n\"\n                    \"vswp       d1, d4              \\n\"\n                    \"vswp       d3, d6              \\n\"\n                    \"vswp       d17, d20            \\n\"\n                    \"vswp       d19, d22            \\n\"\n                    \"vswp       q1, q8              \\n\"\n                    \"vswp       q3, q10             \\n\"\n\n                    \"vst1.f32   {d0-d3}, [%1 :128]! \\n\"\n                    \"vst1.f32   {d16-d19}, [%1 :128]! \\n\"\n                    \"sub        %0, %0, #64         \\n\"\n                    \"vst1.f32   {d4-d7}, [%1 :128]! \\n\"\n                    \"vst1.f32   {d20-d23}, [%1 :128]! \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n                p0 += max_jj * batch * 4;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r0.val[1]);\n                vst1q_f32(pp + 4 * 3, _r1.val[1]);\n                vst1q_f32(pp + 4 * 4, _r0.val[2]);\n                vst1q_f32(pp + 4 * 5, _r1.val[2]);\n                vst1q_f32(pp + 4 * 6, _r0.val[3]);\n                vst1q_f32(pp + 4 * 7, _r1.val[3]);\n                p0 += max_jj * batch * 4;\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x8\n                float32x4x2_t _r0 = vld2q_f32(p0);\n                float32x4x2_t _r1 = vld2q_f32(p0 + 8);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r0.val[1]);\n                vst1q_f32(pp + 4 * 3, _r1.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 16;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p0 + 4);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                p0 += max_jj * batch;\n                pp += 8;\n            }\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const float* p0 = B;\n\n            int kk = 0;\n#if __aarch64__\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x4\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0] \\n\"\n\n                    \"uzp1   v8.4s, v0.4s, v4.4s         \\n\"\n                    \"uzp2   v12.4s, v0.4s, v4.4s        \\n\"\n                    \"uzp1   v9.4s, v1.4s, v5.4s         \\n\"\n                    \"uzp2   v13.4s, v1.4s, v5.4s        \\n\"\n\n                    \"sub    %0, %0, #64                 \\n\"\n\n                    \"uzp1   v10.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp2   v14.4s, v2.4s, v6.4s        \\n\"\n                    \"uzp1   v11.4s, v3.4s, v7.4s        \\n\"\n                    \"uzp2   v15.4s, v3.4s, v7.4s        \\n\"\n\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0;\n                float32x4x4_t _r1;\n                _r0.val[0] = vld1q_f32(p0);\n                _r1.val[0] = vld1q_f32(p0 + 4);\n                _r0.val[1] = vld1q_f32(p0 + 8);\n                _r1.val[1] = vld1q_f32(p0 + 12);\n                _r0.val[2] = vld1q_f32(p0 + 16);\n                _r1.val[2] = vld1q_f32(p0 + 20);\n                _r0.val[3] = vld1q_f32(p0 + 24);\n                _r1.val[3] = vld1q_f32(p0 + 28);\n                vst4q_f32(pp, _r0);\n                vst4q_f32(pp + 16, _r1);\n                p0 += max_jj * batch * 8;\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __aarch64__\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x4\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                    \"st4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0, {d0-d7}         \\n\"\n                    \"vtrn.32    q0, q1              \\n\"\n                    \"vtrn.32    q2, q3              \\n\"\n                    \"vswp       d1, d4              \\n\"\n                    \"vswp       d3, d6              \\n\"\n                    \"vstm       %1!, {d0-d7}        \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n                p0 += max_jj * batch * 4;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0;\n                _r0.val[0] = vld1q_f32(p0);\n                _r0.val[1] = vld1q_f32(p0 + 4);\n                _r0.val[2] = vld1q_f32(p0 + 8);\n                _r0.val[3] = vld1q_f32(p0 + 12);\n                vst4q_f32(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x4\n                float32x4x2_t _r0 = vld2q_f32(p0);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r0.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 8;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                vst1q_f32(pp, _r0);\n                p0 += max_jj * batch;\n                pp += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const float* p0 = B;\n\n            int kk = 0;\n#if __ARM_NEON\n#if __aarch64__\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x2\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n\n                    \"zip1   v4.4s, v0.4s, v2.4s         \\n\"\n                    \"zip2   v5.4s, v0.4s, v2.4s         \\n\"\n                    \"zip1   v6.4s, v1.4s, v3.4s         \\n\"\n                    \"zip2   v7.4s, v1.4s, v3.4s         \\n\"\n\n                    \"st1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x2_t _r0;\n                float32x4x2_t _r1;\n                _r0.val[0] = vld1q_f32(p0);\n                _r1.val[0] = vld1q_f32(p0 + 4);\n                _r0.val[1] = vld1q_f32(p0 + 8);\n                _r1.val[1] = vld1q_f32(p0 + 12);\n                vst2q_f32(pp, _r0);\n                vst2q_f32(pp + 8, _r1);\n                p0 += max_jj * batch * 8;\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __aarch64__\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x2\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%0]        \\n\"\n                    \"st2    {v0.4s, v1.4s}, [%1], #32   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%0 :128]  \\n\"\n                    \"vst2.f32   {d0-d3}, [%1 :128]! \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\");\n#endif // __aarch64__\n                p0 += max_jj * batch * 4;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x2_t _r0;\n                _r0.val[0] = vld1q_f32(p0);\n                _r0.val[1] = vld1q_f32(p0 + 4);\n                vst2q_f32(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 4;\n#endif // __ARM_NEON\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[2];\n                pp[2] = p0[1];\n                pp[3] = p0[3];\n                p0 += max_jj * batch * 2;\n                pp += 4;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch;\n                pp += 2;\n            }\n        }\n        for (; jj < max_jj; jj++)\n        {\n            const float* p0 = B;\n\n            int kk = 0;\n#if __ARM_NEON\n#if __aarch64__\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%0]        \\n\"\n                    \"st1    {v0.4s, v1.4s}, [%1], #32   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p0 + 4);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                p0 += max_jj * batch * 8;\n                pp += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __aarch64__\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]       \\n\"\n                    \"ld1    {v0.4s}, [%0]               \\n\"\n                    \"st1    {v0.4s}, [%1], #16          \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #128]          \\n\"\n                    \"vld1.f32   {d0-d1}, [%0]       \\n\"\n                    \"vst1.f32   {d0-d1}, [%1]!      \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\");\n#endif // __aarch64__\n                p0 += max_jj * batch * 4;\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _r0 = vld1q_f32(p0);\n                vst1q_f32(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 4;\n#endif // __ARM_NEON\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch * 2;\n                pp += 2;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += max_jj * batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_gemm_transB_packed_tile(const Mat& AT_tile, const Mat& BT_tile, Mat& top_blob, int batch, int max_ii, int max_jj, int k, int max_kk, int use_a53_a55_optimized_kernel)\n{\n    // NCNN_LOGE(\"conv3x3s1_winograd_gemm_transB_packed_tile %d %d %d\", max_ii, max_jj, max_kk);\n    float* outptr = top_blob;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const float* pAT = AT_tile.row(b) + max_kk * ii;\n            const float* pB = BT_tile.row(b);\n\n            int jj = 0;\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n                {\n                    // a55\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #320                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                        \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                        \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                        \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                        \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                        \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                        \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                        \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s}, [%1], #16          \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"ldr    d6, [%1], #8                \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n                        \"ldr    d7, [%1], #8                \\n\"\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n                        \"ldr    d4, [%1], #8                \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n                        \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v10.4s, v6.4s, v3.s[1]      \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v12.4s, v6.4s, v3.s[2]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v14.4s, v6.4s, v3.s[3]      \\n\"\n                        \"fmla   v9.4s, v7.4s, v3.s[0]       \\n\"\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"fmla   v11.4s, v7.4s, v3.s[1]      \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[2]      \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v15.4s, v7.4s, v3.s[3]      \\n\"\n                        \"fmla   v16.4s, v6.4s, v0.s[0]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v18.4s, v6.4s, v0.s[1]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v20.4s, v6.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v6.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v7.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v7.4s, v0.s[1]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v21.4s, v7.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v7.4s, v0.s[3]      \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v6.4s, v1.s[1]      \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v28.4s, v6.4s, v1.s[2]      \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v30.4s, v6.4s, v1.s[3]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v25.4s, v7.4s, v1.s[0]      \\n\"\n                        \"ldr    d6, [%1], #8                \\n\"\n                        \"fmla   v27.4s, v7.4s, v1.s[1]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v29.4s, v7.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v10.4s, v4.4s, v2.s[1]      \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v12.4s, v4.4s, v2.s[2]      \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v14.4s, v4.4s, v2.s[3]      \\n\"\n                        \"ldr    d7, [%1], #8                \\n\"\n                        \"fmla   v9.4s, v5.4s, v2.s[0]       \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v11.4s, v5.4s, v2.s[1]      \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v13.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v2.s[3]      \\n\"\n                        \"fmla   v16.4s, v4.4s, v3.s[0]      \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v18.4s, v4.4s, v3.s[1]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v20.4s, v4.4s, v3.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v3.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v3.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v3.s[1]      \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v21.4s, v5.4s, v3.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v3.s[3]      \\n\"\n                        \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v26.4s, v4.4s, v0.s[1]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v28.4s, v4.4s, v0.s[2]      \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v30.4s, v4.4s, v0.s[3]      \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v25.4s, v5.4s, v0.s[0]      \\n\"\n                        \"ldr    d4, [%1], #8                \\n\"\n                        \"fmla   v27.4s, v5.4s, v0.s[1]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v29.4s, v5.4s, v0.s[2]      \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v10.4s, v6.4s, v1.s[1]      \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v12.4s, v6.4s, v1.s[2]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v14.4s, v6.4s, v1.s[3]      \\n\"\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"fmla   v9.4s, v7.4s, v1.s[0]       \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v13.4s, v7.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v7.4s, v1.s[3]      \\n\"\n                        \"fmla   v16.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v18.4s, v6.4s, v2.s[1]      \\n\"\n                        \"fmla   v20.4s, v6.4s, v2.s[2]      \\n\"\n                        \"fmla   v22.4s, v6.4s, v2.s[3]      \\n\"\n                        \"fmla   v17.4s, v7.4s, v2.s[0]      \\n\"\n                        \"fmla   v19.4s, v7.4s, v2.s[1]      \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v21.4s, v7.4s, v2.s[2]      \\n\"\n                        \"fmla   v23.4s, v7.4s, v2.s[3]      \\n\"\n                        \"fmla   v24.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v26.4s, v6.4s, v3.s[1]      \\n\"\n                        \"fmla   v28.4s, v6.4s, v3.s[2]      \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v30.4s, v6.4s, v3.s[3]      \\n\"\n                        \"fmla   v25.4s, v7.4s, v3.s[0]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v27.4s, v7.4s, v3.s[1]      \\n\"\n                        \"fmla   v29.4s, v7.4s, v3.s[2]      \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #32                 \\n\"\n                        \"sub    %2, %2, #16                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else if (use_a53_a55_optimized_kernel && !cpu_support_arm_asimdhp())\n                {\n                    // a53\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #320                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                        \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                        \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                        \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                        \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                        \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                        \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                        \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v4.4s}, [%1], #16          \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"ldr    d2, [%2, #16]               \\n\"\n                        \"ldr    x22, [%2, #24]              \\n\"\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n\n                        \"ldr    d5, [%1]                    \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"ldr    x25, [%1, #8]               \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n\n                        \"ldr    d6, [%1]                    \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"ldr    x26, [%1, #8]               \\n\"\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n\n                        \"ldr    d3, [%2]                    \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"ldr    x23, [%2, #8]               \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n\n                        \"ldr    d0, [%2]                    \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"ldr    d7, [%1]                    \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                        \"ldr    x27, [%1, #8]               \\n\"\n                        \"fmla   v10.4s, v6.4s, v3.s[1]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v12.4s, v6.4s, v3.s[2]      \\n\"\n\n                        \"ldr    d4, [%1]                    \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v14.4s, v6.4s, v3.s[3]      \\n\"\n                        \"ldr    x24, [%1, #8]               \\n\"\n                        \"fmla   v16.4s, v6.4s, v0.s[0]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v18.4s, v6.4s, v0.s[1]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v20.4s, v6.4s, v0.s[2]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v22.4s, v6.4s, v0.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n\n                        \"ldr    d2, [%2]                    \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v26.4s, v6.4s, v1.s[1]      \\n\"\n                        \"ldr    x22, [%2, #8]               \\n\"\n                        \"fmla   v28.4s, v6.4s, v1.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v30.4s, v6.4s, v1.s[3]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                        \"fmla   v9.4s, v7.4s, v3.s[0]       \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v11.4s, v7.4s, v3.s[1]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[2]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v15.4s, v7.4s, v3.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v17.4s, v7.4s, v0.s[0]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v19.4s, v7.4s, v0.s[1]      \\n\"\n\n                        \"ldr    d3, [%2]                    \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v21.4s, v7.4s, v0.s[2]      \\n\"\n                        \"ldr    x23, [%2, #8]               \\n\"\n                        \"fmla   v23.4s, v7.4s, v0.s[3]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v25.4s, v7.4s, v1.s[0]      \\n\"\n\n                        \"ldr    d0, [%2]                    \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v27.4s, v7.4s, v1.s[1]      \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"fmla   v29.4s, v7.4s, v1.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                        \"ldr    d5, [%1]                    \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                        \"ldr    x25, [%1, #8]               \\n\"\n                        \"fmla   v10.4s, v4.4s, v2.s[1]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v12.4s, v4.4s, v2.s[2]      \\n\"\n\n                        \"ldr    d6, [%1]                    \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v14.4s, v4.4s, v2.s[3]      \\n\"\n                        \"ldr    x26, [%1, #8]               \\n\"\n                        \"fmla   v16.4s, v4.4s, v3.s[0]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v18.4s, v4.4s, v3.s[1]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v20.4s, v4.4s, v3.s[2]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v22.4s, v4.4s, v3.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v26.4s, v4.4s, v0.s[1]      \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"fmla   v28.4s, v4.4s, v0.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v30.4s, v4.4s, v0.s[3]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                        \"fmla   v9.4s, v5.4s, v2.s[0]       \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v11.4s, v5.4s, v2.s[1]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v13.4s, v5.4s, v2.s[2]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v15.4s, v5.4s, v2.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v17.4s, v5.4s, v3.s[0]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v19.4s, v5.4s, v3.s[1]      \\n\"\n\n                        \"ldr    d2, [%2]                    \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v21.4s, v5.4s, v3.s[2]      \\n\"\n                        \"ldr    x22, [%2, #8]               \\n\"\n                        \"fmla   v23.4s, v5.4s, v3.s[3]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v25.4s, v5.4s, v0.s[0]      \\n\"\n\n                        \"ldr    d3, [%2]                    \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v27.4s, v5.4s, v0.s[1]      \\n\"\n                        \"ldr    x23, [%2, #8]               \\n\"\n                        \"fmla   v29.4s, v5.4s, v0.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                        \"ldr    d7, [%1]                    \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                        \"ldr    x27, [%1, #8]               \\n\"\n                        \"fmla   v10.4s, v6.4s, v1.s[1]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v12.4s, v6.4s, v1.s[2]      \\n\"\n\n                        \"ldr    d4, [%1]                    \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v14.4s, v6.4s, v1.s[3]      \\n\"\n                        \"ldr    x24, [%1, #8]               \\n\"\n                        \"fmla   v16.4s, v6.4s, v2.s[0]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v18.4s, v6.4s, v2.s[1]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v20.4s, v6.4s, v2.s[2]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v22.4s, v6.4s, v2.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v24.4s, v6.4s, v3.s[0]      \\n\"\n\n                        \"ldr    d0, [%2]                    \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v26.4s, v6.4s, v3.s[1]      \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"fmla   v28.4s, v6.4s, v3.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v30.4s, v6.4s, v3.s[3]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                        \"fmla   v9.4s, v7.4s, v1.s[0]       \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v13.4s, v7.4s, v1.s[2]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v15.4s, v7.4s, v1.s[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v17.4s, v7.4s, v2.s[0]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v19.4s, v7.4s, v2.s[1]      \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v21.4s, v7.4s, v2.s[2]      \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"fmla   v23.4s, v7.4s, v2.s[3]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v25.4s, v7.4s, v3.s[0]      \\n\"\n\n                        \"ldr    d2, [%2]                    \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v27.4s, v7.4s, v3.s[1]      \\n\"\n                        \"ldr    x22, [%2, #8]               \\n\"\n                        \"fmla   v29.4s, v7.4s, v3.s[2]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #16                 \\n\"\n                        \"sub    %2, %2, #48                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #320                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                        \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                        \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                        \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                        \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                        \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                        \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                        \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"2:                                 \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                        \"fmla   v10.4s, v6.4s, v3.s[1]      \\n\"\n                        \"fmla   v12.4s, v6.4s, v3.s[2]      \\n\"\n                        \"fmla   v14.4s, v6.4s, v3.s[3]      \\n\"\n                        \"fmla   v9.4s, v7.4s, v3.s[0]       \\n\"\n                        \"fmla   v11.4s, v7.4s, v3.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v3.s[2]      \\n\"\n                        \"fmla   v15.4s, v7.4s, v3.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                        \"fmla   v16.4s, v6.4s, v0.s[0]      \\n\"\n                        \"fmla   v18.4s, v6.4s, v0.s[1]      \\n\"\n                        \"fmla   v20.4s, v6.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v6.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v7.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v7.4s, v0.s[1]      \\n\"\n                        \"fmla   v21.4s, v7.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v7.4s, v0.s[3]      \\n\"\n\n                        \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v6.4s, v1.s[1]      \\n\"\n                        \"fmla   v28.4s, v6.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, v6.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v7.4s, v1.s[0]      \\n\"\n                        \"fmla   v27.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, v7.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v2.s[1]      \\n\"\n                        \"fmla   v12.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v2.s[3]      \\n\"\n                        \"fmla   v9.4s, v5.4s, v2.s[0]       \\n\"\n                        \"fmla   v11.4s, v5.4s, v2.s[1]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"fmla   v16.4s, v4.4s, v3.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v3.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v3.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v3.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v3.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v3.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v3.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v3.s[3]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                        \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v0.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                        \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                        \"fmla   v10.4s, v6.4s, v1.s[1]      \\n\"\n                        \"fmla   v12.4s, v6.4s, v1.s[2]      \\n\"\n                        \"fmla   v14.4s, v6.4s, v1.s[3]      \\n\"\n                        \"fmla   v9.4s, v7.4s, v1.s[0]       \\n\"\n                        \"fmla   v11.4s, v7.4s, v1.s[1]      \\n\"\n                        \"fmla   v13.4s, v7.4s, v1.s[2]      \\n\"\n                        \"fmla   v15.4s, v7.4s, v1.s[3]      \\n\"\n\n                        \"fmla   v16.4s, v6.4s, v2.s[0]      \\n\"\n                        \"fmla   v18.4s, v6.4s, v2.s[1]      \\n\"\n                        \"fmla   v20.4s, v6.4s, v2.s[2]      \\n\"\n                        \"fmla   v22.4s, v6.4s, v2.s[3]      \\n\"\n                        \"fmla   v17.4s, v7.4s, v2.s[0]      \\n\"\n                        \"fmla   v19.4s, v7.4s, v2.s[1]      \\n\"\n                        \"fmla   v21.4s, v7.4s, v2.s[2]      \\n\"\n                        \"fmla   v23.4s, v7.4s, v2.s[3]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v24.4s, v6.4s, v3.s[0]      \\n\"\n                        \"fmla   v26.4s, v6.4s, v3.s[1]      \\n\"\n                        \"fmla   v28.4s, v6.4s, v3.s[2]      \\n\"\n                        \"fmla   v30.4s, v6.4s, v3.s[3]      \\n\"\n                        \"fmla   v25.4s, v7.4s, v3.s[0]      \\n\"\n                        \"fmla   v27.4s, v7.4s, v3.s[1]      \\n\"\n                        \"fmla   v29.4s, v7.4s, v3.s[2]      \\n\"\n                        \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                        \"bne    2b                          \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                        \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                        \"fmla   v10.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v12.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v14.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v16.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v24.4s, v4.4s, v2.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v2.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v2.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v2.s[3]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v9.4s, v5.4s, v0.s[0]       \\n\"\n                        \"fmla   v11.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v13.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v15.4s, v5.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v2.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v2.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v2.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                        \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n                float32x4_t _sum20;\n                float32x4_t _sum21;\n                float32x4_t _sum30;\n                float32x4_t _sum31;\n                float32x4_t _sum40;\n                float32x4_t _sum41;\n                float32x4_t _sum50;\n                float32x4_t _sum51;\n                float32x4_t _sum60;\n                float32x4_t _sum61;\n                float32x4_t _sum70;\n                float32x4_t _sum71;\n                float32x4_t _sum80;\n                float32x4_t _sum81;\n                float32x4_t _sum90;\n                float32x4_t _sum91;\n                float32x4_t _suma0;\n                float32x4_t _suma1;\n                float32x4_t _sumb0;\n                float32x4_t _sumb1;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                    _sum80 = vdupq_n_f32(0.f);\n                    _sum81 = vdupq_n_f32(0.f);\n                    _sum90 = vdupq_n_f32(0.f);\n                    _sum91 = vdupq_n_f32(0.f);\n                    _suma0 = vdupq_n_f32(0.f);\n                    _suma1 = vdupq_n_f32(0.f);\n                    _sumb0 = vdupq_n_f32(0.f);\n                    _sumb1 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum00 = vld1q_f32(outptr);\n                    _sum01 = vld1q_f32(outptr + 4 * 1);\n                    _sum10 = vld1q_f32(outptr + 4 * 2);\n                    _sum11 = vld1q_f32(outptr + 4 * 3);\n                    _sum20 = vld1q_f32(outptr + 4 * 4);\n                    _sum21 = vld1q_f32(outptr + 4 * 5);\n                    _sum30 = vld1q_f32(outptr + 4 * 6);\n                    _sum31 = vld1q_f32(outptr + 4 * 7);\n                    _sum40 = vld1q_f32(outptr + 4 * 8);\n                    _sum41 = vld1q_f32(outptr + 4 * 9);\n                    _sum50 = vld1q_f32(outptr + 4 * 10);\n                    _sum51 = vld1q_f32(outptr + 4 * 11);\n                    _sum60 = vld1q_f32(outptr + 4 * 12);\n                    _sum61 = vld1q_f32(outptr + 4 * 13);\n                    _sum70 = vld1q_f32(outptr + 4 * 14);\n                    _sum71 = vld1q_f32(outptr + 4 * 15);\n                    _sum80 = vld1q_f32(outptr + 4 * 16);\n                    _sum81 = vld1q_f32(outptr + 4 * 17);\n                    _sum90 = vld1q_f32(outptr + 4 * 18);\n                    _sum91 = vld1q_f32(outptr + 4 * 19);\n                    _suma0 = vld1q_f32(outptr + 4 * 20);\n                    _suma1 = vld1q_f32(outptr + 4 * 21);\n                    _sumb0 = vld1q_f32(outptr + 4 * 22);\n                    _sumb1 = vld1q_f32(outptr + 4 * 23);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA0 = vld1q_f32(pA);\n                    float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n                    float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                    _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                    _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                    _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                    _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                    _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                    _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                    _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                    _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                    _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                    _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                    _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                    _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                    _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                    _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                    _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                    _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                    _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                    _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                    _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                    _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                    _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                    _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                    pA += 8;\n                    pB += 12;\n                }\n\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n                outptr += 8 * 12;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n                {\n                    // a55\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #192                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v8.4s}, [%1], #16          \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n                        \"ldr    d9, [%1], #8                \\n\"\n                        \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v18.4s, v8.4s, v0.s[1]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v20.4s, v8.4s, v0.s[2]      \\n\"\n                        \"ldr    d10, [%1], #8               \\n\"\n                        \"fmla   v22.4s, v8.4s, v0.s[3]      \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v24.4s, v8.4s, v1.s[0]      \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v26.4s, v8.4s, v1.s[1]      \\n\"\n                        \"ins    v9.d[1], x25                \\n\"\n                        \"fmla   v28.4s, v8.4s, v1.s[2]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v30.4s, v8.4s, v1.s[3]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v17.4s, v9.4s, v0.s[0]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v19.4s, v9.4s, v0.s[1]      \\n\"\n                        \"ins    v10.d[1], x26               \\n\"\n                        \"fmla   v21.4s, v9.4s, v0.s[2]      \\n\"\n                        \"ldr    d11, [%1], #8               \\n\"\n                        \"fmla   v23.4s, v9.4s, v0.s[3]      \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v25.4s, v9.4s, v1.s[0]      \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v27.4s, v9.4s, v1.s[1]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v29.4s, v9.4s, v1.s[2]      \\n\"\n                        \"ldr    d12, [%1], #8               \\n\"\n                        \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v18.4s, v10.4s, v2.s[1]     \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v20.4s, v10.4s, v2.s[2]     \\n\"\n                        \"ldr    d4, [%2], #8                \\n\"\n                        \"fmla   v22.4s, v10.4s, v2.s[3]     \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                        \"ldr    d5, [%2], #8                \\n\"\n                        \"fmla   v26.4s, v10.4s, v3.s[1]     \\n\"\n                        \"ins    v11.d[1], x27               \\n\"\n                        \"fmla   v28.4s, v10.4s, v3.s[2]     \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v30.4s, v10.4s, v3.s[3]     \\n\"\n                        \"ldr    d13, [%1], #8               \\n\"\n                        \"fmla   v17.4s, v11.4s, v2.s[0]     \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v19.4s, v11.4s, v2.s[1]     \\n\"\n                        \"ins    v12.d[1], x24               \\n\"\n                        \"fmla   v21.4s, v11.4s, v2.s[2]     \\n\"\n                        \"ldr    d14, [%1], #8               \\n\"\n                        \"fmla   v23.4s, v11.4s, v2.s[3]     \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v25.4s, v11.4s, v3.s[0]     \\n\"\n                        \"ldr    d6, [%2], #8                \\n\"\n                        \"fmla   v27.4s, v11.4s, v3.s[1]     \\n\"\n                        \"ins    v4.d[1], x20                \\n\"\n                        \"fmla   v29.4s, v11.4s, v3.s[2]     \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                        \"ldr    d7, [%2], #8                \\n\"\n                        \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v18.4s, v12.4s, v4.s[1]     \\n\"\n                        \"ins    v5.d[1], x21                \\n\"\n                        \"fmla   v20.4s, v12.4s, v4.s[2]     \\n\"\n                        \"ldr    d15, [%1], #8               \\n\"\n                        \"fmla   v22.4s, v12.4s, v4.s[3]     \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v24.4s, v12.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v26.4s, v12.4s, v5.s[1]     \\n\"\n                        \"ins    v13.d[1], x25               \\n\"\n                        \"fmla   v28.4s, v12.4s, v5.s[2]     \\n\"\n                        \"ldr    d8, [%1], #8                \\n\"\n                        \"fmla   v30.4s, v12.4s, v5.s[3]     \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v17.4s, v13.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v19.4s, v13.4s, v4.s[1]     \\n\"\n                        \"ins    v14.d[1], x26               \\n\"\n                        \"fmla   v21.4s, v13.4s, v4.s[2]     \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v23.4s, v13.4s, v4.s[3]     \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v25.4s, v13.4s, v5.s[0]     \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v27.4s, v13.4s, v5.s[1]     \\n\"\n                        \"ins    v6.d[1], x22                \\n\"\n                        \"fmla   v29.4s, v13.4s, v5.s[2]     \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v14.4s, v6.s[1]     \\n\"\n                        \"ins    v7.d[1], x23                \\n\"\n                        \"fmla   v20.4s, v14.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v14.4s, v6.s[3]     \\n\"\n                        \"fmla   v24.4s, v14.4s, v7.s[0]     \\n\"\n                        \"fmla   v26.4s, v14.4s, v7.s[1]     \\n\"\n                        \"ins    v15.d[1], x27               \\n\"\n                        \"fmla   v28.4s, v14.4s, v7.s[2]     \\n\"\n                        \"fmla   v30.4s, v14.4s, v7.s[3]     \\n\"\n                        \"fmla   v17.4s, v15.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v15.4s, v6.s[1]     \\n\"\n                        \"ins    v8.d[1], x24                \\n\"\n                        \"fmla   v21.4s, v15.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v15.4s, v6.s[3]     \\n\"\n                        \"fmla   v25.4s, v15.4s, v7.s[0]     \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v27.4s, v15.4s, v7.s[1]     \\n\"\n                        \"fmla   v29.4s, v15.4s, v7.s[2]     \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #16                 \\n\"\n                        \"sub    %2, %2, #32                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                        \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v24.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else if (use_a53_a55_optimized_kernel && !cpu_support_arm_asimdhp())\n                {\n                    // a53\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #192                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ldr    d0, [%2]                    \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"ldr    d8, [%1]                    \\n\"\n                        \"ldr    x24, [%1, #8]               \\n\"\n                        \"ins    v8.d[1], x24                \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"ldr    d9, [%1]                    \\n\"\n                        \"ldr    x25, [%1, #8]               \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n\n                        \"ldr    d2, [%2]                    \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                        \"ldr    x22, [%2, #8]               \\n\"\n                        \"fmla   v18.4s, v8.4s, v0.s[1]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v20.4s, v8.4s, v0.s[2]      \\n\"\n\n                        \"ldr    d10, [%1]                   \\n\"\n                        \"ins    v9.d[1], x25                \\n\"\n                        \"fmla   v22.4s, v8.4s, v0.s[3]      \\n\"\n                        \"ldr    x26, [%1, #8]               \\n\"\n                        \"fmla   v24.4s, v8.4s, v1.s[0]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v26.4s, v8.4s, v1.s[1]      \\n\"\n\n                        \"ldr    d3, [%2]                    \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v28.4s, v8.4s, v1.s[2]      \\n\"\n                        \"ldr    x23, [%2, #8]               \\n\"\n                        \"fmla   v30.4s, v8.4s, v1.s[3]      \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v17.4s, v9.4s, v0.s[0]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v19.4s, v9.4s, v0.s[1]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v21.4s, v9.4s, v0.s[2]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v23.4s, v9.4s, v0.s[3]      \\n\"\n\n                        \"ldr    d11, [%1]                   \\n\"\n                        \"ins    v10.d[1], x26               \\n\"\n                        \"fmla   v25.4s, v9.4s, v1.s[0]      \\n\"\n                        \"ldr    x27, [%1, #8]               \\n\"\n                        \"fmla   v27.4s, v9.4s, v1.s[1]      \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v29.4s, v9.4s, v1.s[2]      \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v18.4s, v10.4s, v2.s[1]     \\n\"\n\n                        \"ldr    d4, [%2]                    \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v20.4s, v10.4s, v2.s[2]     \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"fmla   v22.4s, v10.4s, v2.s[3]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n\n                        \"ldr    d12, [%1]                   \\n\"\n                        \"ins    v11.d[1], x27               \\n\"\n                        \"fmla   v26.4s, v10.4s, v3.s[1]     \\n\"\n                        \"ldr    x24, [%1, #8]               \\n\"\n                        \"fmla   v28.4s, v10.4s, v3.s[2]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v30.4s, v10.4s, v3.s[3]     \\n\"\n\n                        \"ldr    d5, [%2]                    \\n\"\n                        \"ins    v4.d[1], x20                \\n\"\n                        \"fmla   v17.4s, v11.4s, v2.s[0]     \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"fmla   v19.4s, v11.4s, v2.s[1]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v21.4s, v11.4s, v2.s[2]     \\n\"\n\n                        \"ldr    d13, [%1]                   \\n\"\n                        \"ins    v12.d[1], x24               \\n\"\n                        \"fmla   v23.4s, v11.4s, v2.s[3]     \\n\"\n                        \"ldr    x25, [%1, #8]               \\n\"\n                        \"fmla   v25.4s, v11.4s, v3.s[0]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v27.4s, v11.4s, v3.s[1]     \\n\"\n\n                        \"ldr    d6, [%2]                    \\n\"\n                        \"ins    v5.d[1], x21                \\n\"\n                        \"fmla   v29.4s, v11.4s, v3.s[2]     \\n\"\n                        \"ldr    x22, [%2, #8]               \\n\"\n                        \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n\n                        \"ldr    d14, [%1]                   \\n\"\n                        \"ins    v13.d[1], x25               \\n\"\n                        \"fmla   v18.4s, v12.4s, v4.s[1]     \\n\"\n                        \"ldr    x26, [%1, #8]               \\n\"\n                        \"fmla   v20.4s, v12.4s, v4.s[2]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v22.4s, v12.4s, v4.s[3]     \\n\"\n\n                        \"ldr    d7, [%2]                    \\n\"\n                        \"ins    v6.d[1], x22                \\n\"\n                        \"fmla   v24.4s, v12.4s, v5.s[0]     \\n\"\n                        \"ldr    x23, [%2, #8]               \\n\"\n                        \"fmla   v26.4s, v12.4s, v5.s[1]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v28.4s, v12.4s, v5.s[2]     \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v30.4s, v12.4s, v5.s[3]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v17.4s, v13.4s, v4.s[0]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v19.4s, v13.4s, v4.s[1]     \\n\"\n\n                        \"ldr    d15, [%1]                   \\n\"\n                        \"ins    v14.d[1], x26               \\n\"\n                        \"fmla   v21.4s, v13.4s, v4.s[2]     \\n\"\n                        \"ldr    x27, [%1, #8]               \\n\"\n                        \"fmla   v23.4s, v13.4s, v4.s[3]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v25.4s, v13.4s, v5.s[0]     \\n\"\n\n                        \"nop                                \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v27.4s, v13.4s, v5.s[1]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v29.4s, v13.4s, v5.s[2]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n\n                        \"ldr    d0, [%2]                    \\n\"\n                        \"ins    v7.d[1], x23                \\n\"\n                        \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                        \"ldr    x20, [%2, #8]               \\n\"\n                        \"fmla   v18.4s, v14.4s, v6.s[1]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v20.4s, v14.4s, v6.s[2]     \\n\"\n\n                        \"ldr    d8, [%1]                    \\n\"\n                        \"ins    v15.d[1], x27               \\n\"\n                        \"fmla   v22.4s, v14.4s, v6.s[3]     \\n\"\n                        \"ldr    x24, [%1, #8]               \\n\"\n                        \"fmla   v24.4s, v14.4s, v7.s[0]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v26.4s, v14.4s, v7.s[1]     \\n\"\n\n                        \"ldr    d1, [%2]                    \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v28.4s, v14.4s, v7.s[2]     \\n\"\n                        \"ldr    x21, [%2, #8]               \\n\"\n                        \"fmla   v30.4s, v14.4s, v7.s[3]     \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"fmla   v17.4s, v15.4s, v6.s[0]     \\n\"\n\n                        \"ldr    d9, [%1]                    \\n\"\n                        \"ins    v8.d[1], x24                \\n\"\n                        \"fmla   v19.4s, v15.4s, v6.s[1]     \\n\"\n                        \"ldr    x25, [%1, #8]               \\n\"\n                        \"fmla   v21.4s, v15.4s, v6.s[2]     \\n\"\n                        \"add    %1, %1, #16                 \\n\"\n                        \"fmla   v23.4s, v15.4s, v6.s[3]     \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v25.4s, v15.4s, v7.s[0]     \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v27.4s, v15.4s, v7.s[1]     \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v29.4s, v15.4s, v7.s[2]     \\n\"\n\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n                        \"nop                                \\n\"\n\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #32                 \\n\"\n                        \"sub    %2, %2, #32                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                        \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v24.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                        \"subs   %0, %0, #192                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                        \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                        \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                        \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"2:                                 \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                        \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                        \"fmla   v18.4s, v8.4s, v0.s[1]      \\n\"\n                        \"fmla   v20.4s, v8.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v8.4s, v0.s[3]      \\n\"\n                        \"fmla   v24.4s, v8.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v8.4s, v1.s[1]      \\n\"\n                        \"fmla   v28.4s, v8.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, v8.4s, v1.s[3]      \\n\"\n                        \"fmla   v17.4s, v9.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v9.4s, v0.s[1]      \\n\"\n                        \"fmla   v21.4s, v9.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v9.4s, v0.s[3]      \\n\"\n                        \"fmla   v25.4s, v9.4s, v1.s[0]      \\n\"\n                        \"fmla   v27.4s, v9.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, v9.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                        \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v10.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v10.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v10.4s, v2.s[3]     \\n\"\n                        \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                        \"fmla   v26.4s, v10.4s, v3.s[1]     \\n\"\n                        \"fmla   v28.4s, v10.4s, v3.s[2]     \\n\"\n                        \"fmla   v30.4s, v10.4s, v3.s[3]     \\n\"\n                        \"fmla   v17.4s, v11.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v11.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v11.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v11.4s, v2.s[3]     \\n\"\n                        \"fmla   v25.4s, v11.4s, v3.s[0]     \\n\"\n                        \"fmla   v27.4s, v11.4s, v3.s[1]     \\n\"\n                        \"fmla   v29.4s, v11.4s, v3.s[2]     \\n\"\n                        \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%1], #64 \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                        \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                        \"fmla   v18.4s, v12.4s, v4.s[1]     \\n\"\n                        \"fmla   v20.4s, v12.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v12.4s, v4.s[3]     \\n\"\n                        \"fmla   v24.4s, v12.4s, v5.s[0]     \\n\"\n                        \"fmla   v26.4s, v12.4s, v5.s[1]     \\n\"\n                        \"fmla   v28.4s, v12.4s, v5.s[2]     \\n\"\n                        \"fmla   v30.4s, v12.4s, v5.s[3]     \\n\"\n                        \"fmla   v17.4s, v13.4s, v4.s[0]     \\n\"\n                        \"fmla   v19.4s, v13.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v13.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v13.4s, v4.s[3]     \\n\"\n                        \"fmla   v25.4s, v13.4s, v5.s[0]     \\n\"\n                        \"fmla   v27.4s, v13.4s, v5.s[1]     \\n\"\n                        \"fmla   v29.4s, v13.4s, v5.s[2]     \\n\"\n                        \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v14.4s, v6.s[1]     \\n\"\n                        \"fmla   v20.4s, v14.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v14.4s, v6.s[3]     \\n\"\n                        \"fmla   v24.4s, v14.4s, v7.s[0]     \\n\"\n                        \"fmla   v26.4s, v14.4s, v7.s[1]     \\n\"\n                        \"fmla   v28.4s, v14.4s, v7.s[2]     \\n\"\n                        \"fmla   v30.4s, v14.4s, v7.s[3]     \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v17.4s, v15.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v15.4s, v6.s[1]     \\n\"\n                        \"fmla   v21.4s, v15.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v15.4s, v6.s[3]     \\n\"\n                        \"fmla   v25.4s, v15.4s, v7.s[0]     \\n\"\n                        \"fmla   v27.4s, v15.4s, v7.s[1]     \\n\"\n                        \"fmla   v29.4s, v15.4s, v7.s[2]     \\n\"\n                        \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                        \"bne    2b                          \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                        \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                        \"fmla   v18.4s, v4.4s, v0.s[1]      \\n\"\n                        \"fmla   v20.4s, v4.4s, v0.s[2]      \\n\"\n                        \"fmla   v22.4s, v4.4s, v0.s[3]      \\n\"\n                        \"fmla   v17.4s, v5.4s, v0.s[0]      \\n\"\n                        \"fmla   v19.4s, v5.4s, v0.s[1]      \\n\"\n                        \"fmla   v21.4s, v5.4s, v0.s[2]      \\n\"\n                        \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v24.4s, v4.4s, v1.s[0]      \\n\"\n                        \"fmla   v26.4s, v4.4s, v1.s[1]      \\n\"\n                        \"fmla   v28.4s, v4.4s, v1.s[2]      \\n\"\n                        \"fmla   v30.4s, v4.4s, v1.s[3]      \\n\"\n                        \"fmla   v25.4s, v5.4s, v1.s[0]      \\n\"\n                        \"fmla   v27.4s, v5.4s, v1.s[1]      \\n\"\n                        \"fmla   v29.4s, v5.4s, v1.s[2]      \\n\"\n                        \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                        \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                        \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n                float32x4_t _sum20;\n                float32x4_t _sum21;\n                float32x4_t _sum30;\n                float32x4_t _sum31;\n                float32x4_t _sum40;\n                float32x4_t _sum41;\n                float32x4_t _sum50;\n                float32x4_t _sum51;\n                float32x4_t _sum60;\n                float32x4_t _sum61;\n                float32x4_t _sum70;\n                float32x4_t _sum71;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum00 = vld1q_f32(outptr);\n                    _sum01 = vld1q_f32(outptr + 4 * 1);\n                    _sum10 = vld1q_f32(outptr + 4 * 2);\n                    _sum11 = vld1q_f32(outptr + 4 * 3);\n                    _sum20 = vld1q_f32(outptr + 4 * 4);\n                    _sum21 = vld1q_f32(outptr + 4 * 5);\n                    _sum30 = vld1q_f32(outptr + 4 * 6);\n                    _sum31 = vld1q_f32(outptr + 4 * 7);\n                    _sum40 = vld1q_f32(outptr + 4 * 8);\n                    _sum41 = vld1q_f32(outptr + 4 * 9);\n                    _sum50 = vld1q_f32(outptr + 4 * 10);\n                    _sum51 = vld1q_f32(outptr + 4 * 11);\n                    _sum60 = vld1q_f32(outptr + 4 * 12);\n                    _sum61 = vld1q_f32(outptr + 4 * 13);\n                    _sum70 = vld1q_f32(outptr + 4 * 14);\n                    _sum71 = vld1q_f32(outptr + 4 * 15);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA0 = vld1q_f32(pA);\n                    float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                    _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                    _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                    _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                    _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                    _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                    _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                    _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                    _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                    _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                    _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                    _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                    _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                    _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                    _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n\n                    pA += 8;\n                    pB += 8;\n                }\n\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                outptr += 8 * 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v28.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v30.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                    \"fmla   v26.4s, v6.4s, v1.s[1]      \\n\"\n                    \"fmla   v28.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v30.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                    \"fmla   v24.4s, v8.4s, v2.s[0]      \\n\"\n                    \"fmla   v26.4s, v8.4s, v2.s[1]      \\n\"\n                    \"fmla   v28.4s, v8.4s, v2.s[2]      \\n\"\n                    \"fmla   v30.4s, v8.4s, v2.s[3]      \\n\"\n                    \"fmla   v25.4s, v9.4s, v2.s[0]      \\n\"\n                    \"fmla   v27.4s, v9.4s, v2.s[1]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v2.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                    \"fmla   v26.4s, v10.4s, v3.s[1]     \\n\"\n                    \"fmla   v28.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v30.4s, v10.4s, v3.s[3]     \\n\"\n                    \"fmla   v25.4s, v11.4s, v3.s[0]     \\n\"\n                    \"fmla   v27.4s, v11.4s, v3.s[1]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v28.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v30.4s, v4.4s, v0.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n                float32x4_t _sum20;\n                float32x4_t _sum21;\n                float32x4_t _sum30;\n                float32x4_t _sum31;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum00 = vld1q_f32(outptr);\n                    _sum01 = vld1q_f32(outptr + 4 * 1);\n                    _sum10 = vld1q_f32(outptr + 4 * 2);\n                    _sum11 = vld1q_f32(outptr + 4 * 3);\n                    _sum20 = vld1q_f32(outptr + 4 * 4);\n                    _sum21 = vld1q_f32(outptr + 4 * 5);\n                    _sum30 = vld1q_f32(outptr + 4 * 6);\n                    _sum31 = vld1q_f32(outptr + 4 * 7);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA0 = vld1q_f32(pA);\n                    float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                    float32x4_t _pB0 = vld1q_f32(pB);\n\n                    _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                    _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                    _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                    _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                    _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                    _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                    _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                    _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n\n                    pA += 8;\n                    pB += 4;\n                }\n\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                outptr += 8 * 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v30.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v28.4s, v6.4s, v0.s[2]      \\n\"\n                    \"fmla   v30.4s, v6.4s, v0.s[3]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v0.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                    \"fmla   v28.4s, v8.4s, v1.s[0]      \\n\"\n                    \"fmla   v30.4s, v8.4s, v1.s[1]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v1.s[0]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v1.s[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v10.4s, v1.s[2]     \\n\"\n                    \"fmla   v30.4s, v10.4s, v1.s[3]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v1.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v1.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.2s}, [%2], #8           \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                    \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v30.4s, v4.4s, v0.s[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum00 = vld1q_f32(outptr);\n                    _sum01 = vld1q_f32(outptr + 4 * 1);\n                    _sum10 = vld1q_f32(outptr + 4 * 2);\n                    _sum11 = vld1q_f32(outptr + 4 * 3);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA0 = vld1q_f32(pA);\n                    float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                    float32x2_t _pB0 = vld1_f32(pB);\n\n                    _sum00 = vfmaq_lane_f32(_sum00, _pA0, _pB0, 0);\n                    _sum01 = vfmaq_lane_f32(_sum01, _pA1, _pB0, 0);\n                    _sum10 = vfmaq_lane_f32(_sum10, _pA0, _pB0, 1);\n                    _sum11 = vfmaq_lane_f32(_sum11, _pA1, _pB0, 1);\n\n                    pA += 8;\n                    pB += 2;\n                }\n\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                outptr += 8 * 2;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n                    \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v30.4s, v6.4s, v0.s[1]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v0.s[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                    \"fmla   v28.4s, v8.4s, v0.s[2]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v0.s[2]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v10.4s, v0.s[3]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v0.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1r   {v0.4s}, [%2], #4           \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                    \"fmla   v30.4s, v4.4s, v0.4s        \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.4s        \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum00 = vld1q_f32(outptr);\n                    _sum01 = vld1q_f32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA0 = vld1q_f32(pA);\n                    float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                    float32x4_t _pB = vld1q_dup_f32(pB);\n\n                    _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                    _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n\n                    pA += 8;\n                    pB += 1;\n                }\n\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const float* pAT = AT_tile.row(b) + max_kk * ii;\n            const float* pB = BT_tile.row(b);\n\n            int jj = 0;\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #128                \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                    \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                    \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                    \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                    \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                    \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                    \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                    \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                    \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                    \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                    \"fmla   v20.4s, v17.4s, v3.s[0]     \\n\"\n                    \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                    \"fmla   v22.4s, v17.4s, v3.s[2]     \\n\"\n                    \"fmla   v23.4s, v17.4s, v3.s[3]     \\n\"\n                    \"fmla   v24.4s, v17.4s, v4.s[0]     \\n\"\n                    \"fmla   v25.4s, v17.4s, v4.s[1]     \\n\"\n                    \"fmla   v26.4s, v17.4s, v4.s[2]     \\n\"\n                    \"fmla   v27.4s, v17.4s, v4.s[3]     \\n\"\n                    \"fmla   v28.4s, v17.4s, v5.s[0]     \\n\"\n                    \"fmla   v29.4s, v17.4s, v5.s[1]     \\n\"\n                    \"fmla   v30.4s, v17.4s, v5.s[2]     \\n\"\n                    \"fmla   v31.4s, v17.4s, v5.s[3]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"fmla   v20.4s, v18.4s, v6.s[0]     \\n\"\n                    \"fmla   v21.4s, v18.4s, v6.s[1]     \\n\"\n                    \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                    \"fmla   v23.4s, v18.4s, v6.s[3]     \\n\"\n                    \"fmla   v24.4s, v18.4s, v7.s[0]     \\n\"\n                    \"fmla   v25.4s, v18.4s, v7.s[1]     \\n\"\n                    \"fmla   v26.4s, v18.4s, v7.s[2]     \\n\"\n                    \"fmla   v27.4s, v18.4s, v7.s[3]     \\n\"\n                    \"fmla   v28.4s, v18.4s, v0.s[0]     \\n\"\n                    \"fmla   v29.4s, v18.4s, v0.s[1]     \\n\"\n                    \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                    \"fmla   v31.4s, v18.4s, v0.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v20.4s, v19.4s, v1.s[0]     \\n\"\n                    \"fmla   v21.4s, v19.4s, v1.s[1]     \\n\"\n                    \"fmla   v22.4s, v19.4s, v1.s[2]     \\n\"\n                    \"fmla   v23.4s, v19.4s, v1.s[3]     \\n\"\n                    \"fmla   v24.4s, v19.4s, v2.s[0]     \\n\"\n                    \"fmla   v25.4s, v19.4s, v2.s[1]     \\n\"\n                    \"fmla   v26.4s, v19.4s, v2.s[2]     \\n\"\n                    \"fmla   v27.4s, v19.4s, v2.s[3]     \\n\"\n                    \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                    \"ld1    {v16.4s}, [%1], #16         \\n\"\n                    \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                    \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                    \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                    \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                    \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                    \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n                float32x4_t _sum2;\n                float32x4_t _sum3;\n                float32x4_t _sum4;\n                float32x4_t _sum5;\n                float32x4_t _sum6;\n                float32x4_t _sum7;\n                float32x4_t _sum8;\n                float32x4_t _sum9;\n                float32x4_t _suma;\n                float32x4_t _sumb;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                    _sum8 = vdupq_n_f32(0.f);\n                    _sum9 = vdupq_n_f32(0.f);\n                    _suma = vdupq_n_f32(0.f);\n                    _sumb = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                    _sum2 = vld1q_f32(outptr + 8);\n                    _sum3 = vld1q_f32(outptr + 12);\n                    _sum4 = vld1q_f32(outptr + 16);\n                    _sum5 = vld1q_f32(outptr + 20);\n                    _sum6 = vld1q_f32(outptr + 24);\n                    _sum7 = vld1q_f32(outptr + 28);\n                    _sum8 = vld1q_f32(outptr + 32);\n                    _sum9 = vld1q_f32(outptr + 36);\n                    _suma = vld1q_f32(outptr + 40);\n                    _sumb = vld1q_f32(outptr + 44);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA = vld1q_f32(pA);\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n                    float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                    _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                    _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                    _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                    _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                    _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                    pA += 4;\n                    pB += 12;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n                outptr += 4 * 12;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __aarch64__\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                    \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                    \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                    \"fmla   v24.4s, v17.4s, v2.s[0]     \\n\"\n                    \"fmla   v25.4s, v17.4s, v2.s[1]     \\n\"\n                    \"fmla   v26.4s, v17.4s, v2.s[2]     \\n\"\n                    \"fmla   v27.4s, v17.4s, v2.s[3]     \\n\"\n                    \"fmla   v28.4s, v17.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v17.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v17.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v17.4s, v3.s[3]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                    \"fmla   v24.4s, v18.4s, v4.s[0]     \\n\"\n                    \"fmla   v25.4s, v18.4s, v4.s[1]     \\n\"\n                    \"fmla   v26.4s, v18.4s, v4.s[2]     \\n\"\n                    \"fmla   v27.4s, v18.4s, v4.s[3]     \\n\"\n                    \"fmla   v28.4s, v18.4s, v5.s[0]     \\n\"\n                    \"fmla   v29.4s, v18.4s, v5.s[1]     \\n\"\n                    \"fmla   v30.4s, v18.4s, v5.s[2]     \\n\"\n                    \"fmla   v31.4s, v18.4s, v5.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v19.4s, v6.s[0]     \\n\"\n                    \"fmla   v25.4s, v19.4s, v6.s[1]     \\n\"\n                    \"fmla   v26.4s, v19.4s, v6.s[2]     \\n\"\n                    \"fmla   v27.4s, v19.4s, v6.s[3]     \\n\"\n                    \"fmla   v28.4s, v19.4s, v7.s[0]     \\n\"\n                    \"fmla   v29.4s, v19.4s, v7.s[1]     \\n\"\n                    \"fmla   v30.4s, v19.4s, v7.s[2]     \\n\"\n                    \"fmla   v31.4s, v19.4s, v7.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"ld1    {v16.4s}, [%1], #16         \\n\"\n                    \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                    \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vldm       %0!, {d16-d23}      \\n\"\n                    \"vldm       %0, {d24-d31}       \\n\"\n                    \"sub        %0, %0, #64         \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q8, q8              \\n\"\n                    \"veor       q9, q9              \\n\"\n                    \"veor       q10, q10            \\n\"\n                    \"veor       q11, q11            \\n\"\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"2:                             \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d8-d15}       \\n\"\n                    \"pld        [%2, #512]          \\n\"\n                    \"vldm       %2!, {d0-d7}        \\n\"\n                    \"vmla.f32   q8, q4, d0[0]       \\n\"\n                    \"vmla.f32   q9, q4, d0[1]       \\n\"\n                    \"vmla.f32   q10, q4, d1[0]      \\n\"\n                    \"vmla.f32   q11, q4, d1[1]      \\n\"\n                    \"vmla.f32   q12, q4, d2[0]      \\n\"\n                    \"vmla.f32   q13, q4, d2[1]      \\n\"\n                    \"vmla.f32   q14, q4, d3[0]      \\n\"\n                    \"vmla.f32   q15, q4, d3[1]      \\n\"\n                    \"vmla.f32   q8, q5, d4[0]       \\n\"\n                    \"vmla.f32   q9, q5, d4[1]       \\n\"\n                    \"vmla.f32   q10, q5, d5[0]      \\n\"\n                    \"vmla.f32   q11, q5, d5[1]      \\n\"\n                    \"vmla.f32   q12, q5, d6[0]      \\n\"\n                    \"vmla.f32   q13, q5, d6[1]      \\n\"\n                    \"vmla.f32   q14, q5, d7[0]      \\n\"\n                    \"vmla.f32   q15, q5, d7[1]      \\n\"\n                    \"pld        [%2, #512]          \\n\"\n                    \"vldm       %2!, {d0-d7}        \\n\"\n                    \"vmla.f32   q8, q6, d0[0]       \\n\"\n                    \"vmla.f32   q9, q6, d0[1]       \\n\"\n                    \"vmla.f32   q10, q6, d1[0]      \\n\"\n                    \"vmla.f32   q11, q6, d1[1]      \\n\"\n                    \"vmla.f32   q12, q6, d2[0]      \\n\"\n                    \"vmla.f32   q13, q6, d2[1]      \\n\"\n                    \"vmla.f32   q14, q6, d3[0]      \\n\"\n                    \"vmla.f32   q15, q6, d3[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q8, q7, d4[0]       \\n\"\n                    \"vmla.f32   q9, q7, d4[1]       \\n\"\n                    \"vmla.f32   q10, q7, d5[0]      \\n\"\n                    \"vmla.f32   q11, q7, d5[1]      \\n\"\n                    \"vmla.f32   q12, q7, d6[0]      \\n\"\n                    \"vmla.f32   q13, q7, d6[1]      \\n\"\n                    \"vmla.f32   q14, q7, d7[0]      \\n\"\n                    \"vmla.f32   q15, q7, d7[1]      \\n\"\n                    \"bne        2b                  \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // r4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vldm       %2!, {d0-d3}        \\n\"\n                    \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                    \"vmla.f32   q8, q4, d0[0]       \\n\"\n                    \"vmla.f32   q9, q4, d0[1]       \\n\"\n                    \"vmla.f32   q10, q4, d1[0]      \\n\"\n                    \"vmla.f32   q11, q4, d1[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q12, q4, d2[0]      \\n\"\n                    \"vmla.f32   q13, q4, d2[1]      \\n\"\n                    \"vmla.f32   q14, q4, d3[0]      \\n\"\n                    \"vmla.f32   q15, q4, d3[1]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vstm       %0!, {d16-d23}      \\n\"\n                    \"vstm       %0!, {d24-d31}      \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n                float32x4_t _sum2;\n                float32x4_t _sum3;\n                float32x4_t _sum4;\n                float32x4_t _sum5;\n                float32x4_t _sum6;\n                float32x4_t _sum7;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                    _sum2 = vld1q_f32(outptr + 8);\n                    _sum3 = vld1q_f32(outptr + 12);\n                    _sum4 = vld1q_f32(outptr + 16);\n                    _sum5 = vld1q_f32(outptr + 20);\n                    _sum6 = vld1q_f32(outptr + 24);\n                    _sum7 = vld1q_f32(outptr + 28);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA = vld1q_f32(pA);\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                    _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                    _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                    _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                    _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                    _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                    _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                    _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                    _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n\n                    pA += 4;\n                    pB += 8;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                outptr += 4 * 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0] \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                    \"fmla   v28.4s, v17.4s, v1.s[0]     \\n\"\n                    \"fmla   v29.4s, v17.4s, v1.s[1]     \\n\"\n                    \"fmla   v30.4s, v17.4s, v1.s[2]     \\n\"\n                    \"fmla   v31.4s, v17.4s, v1.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v18.4s, v2.s[0]     \\n\"\n                    \"fmla   v29.4s, v18.4s, v2.s[1]     \\n\"\n                    \"fmla   v30.4s, v18.4s, v2.s[2]     \\n\"\n                    \"fmla   v31.4s, v18.4s, v2.s[3]     \\n\"\n                    \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n                    \"ld1    {v16.4s}, [%1], #16         \\n\"\n                    \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vldm       %0, {d24-d31}       \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"2:                             \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d8-d15}       \\n\"\n                    \"pld        [%2, #512]          \\n\"\n                    \"vldm       %2!, {d0-d7}        \\n\"\n                    \"vmla.f32   q12, q4, d0[0]      \\n\"\n                    \"vmla.f32   q13, q4, d0[1]      \\n\"\n                    \"vmla.f32   q14, q4, d1[0]      \\n\"\n                    \"vmla.f32   q15, q4, d1[1]      \\n\"\n                    \"vmla.f32   q12, q5, d2[0]      \\n\"\n                    \"vmla.f32   q13, q5, d2[1]      \\n\"\n                    \"vmla.f32   q14, q5, d3[0]      \\n\"\n                    \"vmla.f32   q15, q5, d3[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q12, q6, d4[0]      \\n\"\n                    \"vmla.f32   q13, q6, d4[1]      \\n\"\n                    \"vmla.f32   q14, q6, d5[0]      \\n\"\n                    \"vmla.f32   q15, q6, d5[1]      \\n\"\n                    \"vmla.f32   q12, q7, d6[0]      \\n\"\n                    \"vmla.f32   q13, q7, d6[1]      \\n\"\n                    \"vmla.f32   q14, q7, d7[0]      \\n\"\n                    \"vmla.f32   q15, q7, d7[1]      \\n\"\n                    \"bne        2b                  \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // r4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.f32   {d0-d1}, [%2 :128]! \\n\"\n                    \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                    \"vmla.f32   q12, q4, d0[0]      \\n\"\n                    \"vmla.f32   q13, q4, d0[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q14, q4, d1[0]      \\n\"\n                    \"vmla.f32   q15, q4, d1[1]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vstm       %0!, {d24-d31}      \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n                float32x4_t _sum2;\n                float32x4_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                    _sum2 = vld1q_f32(outptr + 8);\n                    _sum3 = vld1q_f32(outptr + 12);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA = vld1q_f32(pA);\n                    float32x4_t _pB = vld1q_f32(pB);\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n\n                    pA += 4;\n                    pB += 4;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                outptr += 4 * 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                    \"fmla   v30.4s, v17.4s, v0.s[2]     \\n\"\n                    \"fmla   v31.4s, v17.4s, v0.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v18.4s, v1.s[0]     \\n\"\n                    \"fmla   v29.4s, v18.4s, v1.s[1]     \\n\"\n                    \"fmla   v30.4s, v19.4s, v1.s[2]     \\n\"\n                    \"fmla   v31.4s, v19.4s, v1.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.2s}, [%2], #8           \\n\"\n                    \"ld1    {v16.4s}, [%1], #16         \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v31.4s, v16.4s, v0.s[1]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vld1.f32   {d28-d31}, [%0 :128] \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"2:                             \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d8-d15}       \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%2 :128]! \\n\"\n                    \"vmla.f32   q12, q4, d0[0]      \\n\"\n                    \"vmla.f32   q13, q4, d0[1]      \\n\"\n                    \"vmla.f32   q14, q5, d1[0]      \\n\"\n                    \"vmla.f32   q15, q5, d1[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q12, q6, d2[0]      \\n\"\n                    \"vmla.f32   q13, q6, d2[1]      \\n\"\n                    \"vmla.f32   q14, q7, d3[0]      \\n\"\n                    \"vmla.f32   q15, q7, d3[1]      \\n\"\n                    \"bne        2b                  \\n\"\n                    \"vadd.f32   q14, q14, q12       \\n\"\n                    \"vadd.f32   q15, q15, q13       \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // r4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.f32   {d0}, [%2 :64]!     \\n\"\n                    \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q14, q4, d0[0]      \\n\"\n                    \"vmla.f32   q15, q4, d0[1]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA = vld1q_f32(pA);\n                    float32x2_t _pB = vld1_f32(pB);\n\n#if __aarch64__\n                    _sum0 = vfmaq_lane_f32(_sum0, _pA, _pB, 0);\n                    _sum1 = vfmaq_lane_f32(_sum1, _pA, _pB, 1);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _pA, _pB, 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _pA, _pB, 1);\n#endif\n\n                    pA += 4;\n                    pB += 2;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                outptr += 4 * 2;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v31.4s}, [%0]              \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n                    \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                    \"fmla   v29.4s, v17.4s, v0.s[1]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                    \"fmla   v31.4s, v19.4s, v0.s[3]     \\n\"\n                    \"bne    2b                          \\n\"\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v30.4s      \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1r   {v0.4s}, [%2], #4           \\n\"\n                    \"ld1    {v16.4s}, [%1], #16         \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v31.4s, v16.4s, v0.4s       \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v31.4s}, [%0], #16         \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vld1.f32   {d30-d31}, [%0 :128] \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"2:                             \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d8-d15}       \\n\"\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld1.f32   {d0-d1}, [%2 :64]!  \\n\"\n                    \"vmla.f32   q12, q4, d0[0]      \\n\"\n                    \"vmla.f32   q13, q5, d0[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q14, q6, d1[0]      \\n\"\n                    \"vmla.f32   q15, q7, d1[1]      \\n\"\n                    \"bne        2b                  \\n\"\n                    \"vadd.f32   q14, q14, q12       \\n\"\n                    \"vadd.f32   q15, q15, q13       \\n\"\n                    \"vadd.f32   q15, q15, q14       \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // r4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.f32   {d0[0]}, [%2]!      \\n\"\n                    \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmla.f32   q15, q4, d0[0]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vst1.f32   {d30-d31}, [%0 :128]! \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _sum;\n\n                if (k == 0)\n                {\n                    _sum = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum = vld1q_f32(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pA = vld1q_f32(pA);\n                    float32x4_t _pB = vdupq_n_f32(pB[0]);\n\n#if __aarch64__\n                    _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                    _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                    pA += 4;\n                    pB += 1;\n                }\n\n                vst1q_f32(outptr, _sum);\n                outptr += 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const float* pAT = AT_tile.row(b) + max_kk * ii;\n            const float* pB = BT_tile.row(b);\n\n            int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum02;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n                float32x4_t _sum12;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum02 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum12 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                    float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                    float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                    _sum00 = _tmp01.val[0];\n                    _sum01 = _tmp23.val[0];\n                    _sum02 = _tmp45.val[0];\n                    _sum10 = _tmp01.val[1];\n                    _sum11 = _tmp23.val[1];\n                    _sum12 = _tmp45.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n                    float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                    float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                    _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                    _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                    _sum02 = vfmaq_lane_f32(_sum02, _pB2, _pA, 0);\n                    _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                    _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n                    _sum12 = vfmaq_lane_f32(_sum12, _pB2, _pA, 1);\n#else\n                    _sum00 = vmlaq_lane_f32(_sum00, _pB0, _pA, 0);\n                    _sum01 = vmlaq_lane_f32(_sum01, _pB1, _pA, 0);\n                    _sum02 = vmlaq_lane_f32(_sum02, _pB2, _pA, 0);\n                    _sum10 = vmlaq_lane_f32(_sum10, _pB0, _pA, 1);\n                    _sum11 = vmlaq_lane_f32(_sum11, _pB1, _pA, 1);\n                    _sum12 = vmlaq_lane_f32(_sum12, _pB2, _pA, 1);\n#endif\n\n                    pA += 2;\n                    pB += 12;\n                }\n\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n                outptr += 2 * 12;\n            }\n#endif // __aarch64__\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum00;\n                float32x4_t _sum01;\n                float32x4_t _sum10;\n                float32x4_t _sum11;\n\n                if (k == 0)\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                    float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                    _sum00 = _tmp01.val[0];\n                    _sum01 = _tmp23.val[0];\n                    _sum10 = _tmp01.val[1];\n                    _sum11 = _tmp23.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                    float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                    _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                    _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                    _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                    _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n#else\n                    _sum00 = vmlaq_lane_f32(_sum00, _pB0, _pA, 0);\n                    _sum01 = vmlaq_lane_f32(_sum01, _pB1, _pA, 0);\n                    _sum10 = vmlaq_lane_f32(_sum10, _pB0, _pA, 1);\n                    _sum11 = vmlaq_lane_f32(_sum11, _pB1, _pA, 1);\n#endif\n\n                    pA += 2;\n                    pB += 8;\n                }\n\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                outptr += 2 * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                    _sum0 = _tmp01.val[0];\n                    _sum1 = _tmp01.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB = vld1q_f32(pB);\n\n                    float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                    _sum0 = vfmaq_lane_f32(_sum0, _pB, _pA, 0);\n                    _sum1 = vfmaq_lane_f32(_sum1, _pB, _pA, 1);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _pB, _pA, 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _pB, _pA, 1);\n#endif\n\n                    pA += 2;\n                    pB += 4;\n                }\n\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n                outptr += 2 * 4;\n            }\n#endif // __ARM_NEON\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const float* pA = pAT;\n\n                float sum00 = 0.f;\n                float sum01 = 0.f;\n                float sum10 = 0.f;\n                float sum11 = 0.f;\n\n                if (k == 0)\n                {\n                    sum00 = 0.f;\n                    sum01 = 0.f;\n                    sum10 = 0.f;\n                    sum11 = 0.f;\n                }\n                else\n                {\n                    sum00 = outptr[0];\n                    sum01 = outptr[1];\n                    sum10 = outptr[2];\n                    sum11 = outptr[3];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum00 += pA[0] * pB[0];\n                    sum01 += pA[1] * pB[0];\n                    sum10 += pA[0] * pB[1];\n                    sum11 += pA[1] * pB[1];\n                    pA += 2;\n                    pB += 2;\n                }\n\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n                outptr += 2 * 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const float* pA = pAT;\n\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n\n                if (k == 0)\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[1] * pB[0];\n                    pA += 2;\n                    pB += 1;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const float* pAT = AT_tile.row(b) + max_kk * ii;\n            const float* pB = BT_tile.row(b);\n\n            int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n                float32x4_t _sum2;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                    _sum2 = vld1q_f32(outptr + 8);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n                    float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                    float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                    _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                    _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                    _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n                    _sum2 = vmlaq_f32(_sum2, _pA0, _pB2);\n#endif\n\n                    pA += 1;\n                    pB += 12;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n                outptr += 12;\n            }\n#endif // __aarch64__\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum0;\n                float32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f32(outptr);\n                    _sum1 = vld1q_f32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB0 = vld1q_f32(pB);\n                    float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                    float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n#if __aarch64__\n                    _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                    _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                    _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                    _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n\n                    pA += 1;\n                    pB += 8;\n                }\n\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                outptr += 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const float* pA = pAT;\n\n                float32x4_t _sum;\n\n                if (k == 0)\n                {\n                    _sum = vdupq_n_f32(0.f);\n                }\n                else\n                {\n                    _sum = vld1q_f32(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float32x4_t _pB = vld1q_f32(pB);\n                    float32x4_t _pA = vdupq_n_f32(pA[0]);\n\n#if __aarch64__\n                    _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                    _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                    pA += 1;\n                    pB += 4;\n                }\n\n                vst1q_f32(outptr, _sum);\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const float* pA = pAT;\n\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n\n                if (k == 0)\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[0] * pB[1];\n                    pA += 1;\n                    pB += 2;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const float* pA = pAT;\n\n                float sum = 0.f;\n\n                if (k == 0)\n                {\n                    sum = 0.f;\n                }\n                else\n                {\n                    sum = outptr[0];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum += pA[0] * pB[0];\n                    pA += 1;\n                    pB += 1;\n                }\n\n                outptr[0] = sum;\n                outptr += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_get_optimal_tile_mnk(int M, int N, int K, int B, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_fp32 = (int)(get_cpu_level2_cache_size() / sizeof(float));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // we shall take B into account for batched gemm, but that will be slower on arm in practice, why ?\n    (void)B;\n\n    // solve K\n    {\n        // try not to split K\n#if __aarch64__\n        int tile_size = (l2_cache_size_fp32 - 32) / 12;\n#elif __ARM_NEON\n        int tile_size = (l2_cache_size_fp32 - 16) / 8;\n#else\n        int tile_size = (l2_cache_size_fp32 - 2) / 3;\n#endif\n\n#if __aarch64__\n        TILE_K = std::max(8, tile_size / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __aarch64__\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 3) / 4 * 4);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n    }\n\n    // solve M\n    {\n#if __aarch64__\n        TILE_M = 8;\n#elif __ARM_NEON\n        TILE_M = 4;\n#else\n        TILE_M = 2;\n#endif\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __aarch64__\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n\n        if (nT > 1)\n        {\n#if __aarch64__\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#elif __ARM_NEON\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 3) / 4 * 4);\n#else\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n        }\n\n#if __aarch64__\n        TILE_M = std::max(8, TILE_M);\n#elif __ARM_NEON\n        TILE_M = std::max(4, TILE_M);\n#else\n        TILE_M = std::max(2, TILE_M);\n#endif\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_fp32 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_fp32 - TILE_M * TILE_K) / (TILE_M + TILE_K);\n        }\n\n#if __aarch64__\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_N = std::max(1, tile_size);\n#endif\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n\n#if __aarch64__\n        TILE_N = std::max(4, TILE_N);\n#elif __ARM_NEON\n        TILE_N = std::max(4, TILE_N);\n#else\n        TILE_N = std::max(1, TILE_N);\n#endif\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_kernel_tile(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    // const float ktm[4][3] = {\n    //     {1.0f, 0.0f, 0.0f},\n    //     {1.0f / 2, 1.0f / 2, 1.0f / 2},\n    //     {1.0f / 2, -1.0f / 2, 1.0f / 2},\n    //     {0.0f, 0.0f, 1.0f}\n    // };\n\n    float* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            float tmp[4][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = r0 * 0.5f + r1 * 0.5f + r2 * 0.5f;\n                tmp[2][m] = r0 * 0.5f - r1 * 0.5f + r2 * 0.5f;\n                tmp[3][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = r0 * 0.5f + r1 * 0.5f + r2 * 0.5f;\n                float z2 = r0 * 0.5f - r1 * 0.5f + r2 * 0.5f;\n                float z3 = r2;\n\n                ptmp[0] = z0;\n                ptmp[1] = z1;\n                ptmp[2] = z2;\n                ptmp[3] = z3;\n                ptmp += 4;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_kernel(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 16;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd23_transform_kernel_tile(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_input_tile(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[4][4] = {\n    //     {1.0f,  0.0f, -1.0f,  0.0f},\n    //     {0.0f,  1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  0.00f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w - 1) / 2;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const float* r1 = r0 + N;\n\n                        _r00 = vld1q_f32(r0);\n                        _r01 = vld1q_f32(r1);\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r10 = vld1q_f32(r0 + 4);\n                            _r11 = vld1q_f32(r1 + 4);\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r20 = vld1q_f32(r0 + 8);\n                            _r21 = vld1q_f32(r1 + 8);\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r30 = vld1q_f32(r0 + 12);\n                            _r31 = vld1q_f32(r1 + 12);\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n                        const float* r4 = r0 + N * 4;\n                        const float* r5 = r0 + N * 5;\n                        const float* r6 = r0 + N * 6;\n                        const float* r7 = r0 + N * 7;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n                        float32x4_t _t4 = vld1q_f32(r4);\n                        float32x4_t _t5 = vld1q_f32(r5);\n                        float32x4_t _t6 = vld1q_f32(r6);\n                        float32x4_t _t7 = vld1q_f32(r7);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n                        transpose4x4_ps(_t4, _t5, _t6, _t7);\n\n                        _r00 = _t0;\n                        _r01 = _t4;\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r10 = _t1;\n                            _r11 = _t5;\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r20 = _t2;\n                            _r21 = _t6;\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r30 = _t3;\n                            _r31 = _t7;\n                        }\n                    }\n                }\n\n                float32x4_t _tmp00 = vsubq_f32(_r00, _r20);\n                float32x4_t _tmp01 = vsubq_f32(_r01, _r21);\n                float32x4_t _tmp10 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp11 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp20 = vsubq_f32(_r20, _r10);\n                float32x4_t _tmp21 = vsubq_f32(_r21, _r11);\n                float32x4_t _tmp30 = vsubq_f32(_r30, _r10);\n                float32x4_t _tmp31 = vsubq_f32(_r31, _r11);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n\n                float32x4_t _tmp00 = vsubq_f32(_r00, _r20);\n                float32x4_t _tmp01 = vsubq_f32(_r01, _r21);\n                float32x4_t _tmp10 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp11 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp20 = vsubq_f32(_r20, _r10);\n                float32x4_t _tmp21 = vsubq_f32(_r21, _r11);\n                float32x4_t _tmp30 = vsubq_f32(_r30, _r10);\n                float32x4_t _tmp31 = vsubq_f32(_r31, _r11);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n\n                p0 += max_jj * 4 * 8;\n                p1 += max_jj * 4 * 8;\n                p2 += max_jj * 4 * 8;\n                p3 += max_jj * 4 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        if (tj * 2 + 1 < w) _r1 = vld1q_f32(r0 + 4);\n                        if (tj * 2 + 2 < w) _r2 = vld1q_f32(r0 + 8);\n                        if (tj * 2 + 3 < w) _r3 = vld1q_f32(r0 + 12);\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 2 + 1 < w) _r1 = _t1;\n                        if (tj * 2 + 2 < w) _r2 = _t2;\n                        if (tj * 2 + 3 < w) _r3 = _t3;\n                    }\n                }\n\n                float32x4_t _tmp0 = vsubq_f32(_r0, _r2);\n                float32x4_t _tmp1 = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp2 = vsubq_f32(_r2, _r1);\n                float32x4_t _tmp3 = vsubq_f32(_r3, _r1);\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n\n                float32x4_t _tmp0 = vsubq_f32(_r0, _r2);\n                float32x4_t _tmp1 = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp2 = vsubq_f32(_r2, _r1);\n                float32x4_t _tmp3 = vsubq_f32(_r3, _r1);\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n\n                p0 += max_jj * 4 * 4;\n                p1 += max_jj * 4 * 4;\n                p2 += max_jj * 4 * 4;\n                p3 += max_jj * 4 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[4][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel(k + kk).row(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n#endif\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n\n#if __ARM_NEON\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4x2_t _t01 = vzipq_f32(_t0, _t1);\n\n                        _r0 = vget_low_f32(_t01.val[0]);\n                        if (tj * 2 + 1 < w) _r1 = vget_high_f32(_t01.val[0]);\n                        if (tj * 2 + 2 < w) _r2 = vget_low_f32(_t01.val[1]);\n                        if (tj * 2 + 3 < w) _r3 = vget_high_f32(_t01.val[1]);\n#else\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 2 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n                float32x2_t _tmp0 = vsub_f32(_r0, _r2);\n                float32x2_t _tmp1 = vadd_f32(_r1, _r2);\n                float32x2_t _tmp2 = vsub_f32(_r2, _r1);\n                float32x2_t _tmp3 = vsub_f32(_r3, _r1);\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n#else\n                tmp[0][m][0] = r00 - r20;\n                tmp[0][m][1] = r01 - r21;\n                tmp[1][m][0] = r10 + r20;\n                tmp[1][m][1] = r11 + r21;\n                tmp[2][m][0] = r20 - r10;\n                tmp[2][m][1] = r21 - r11;\n                tmp[3][m][0] = r30 - r10;\n                tmp[3][m][1] = r31 - r11;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n\n                float32x2_t _tmp0 = vsub_f32(_r0, _r2);\n                float32x2_t _tmp1 = vadd_f32(_r1, _r2);\n                float32x2_t _tmp2 = vsub_f32(_r2, _r1);\n                float32x2_t _tmp3 = vsub_f32(_r3, _r1);\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n\n                p0[0] = r00 - r20;\n                p0[1] = r01 - r21;\n                p1[0] = r10 + r20;\n                p1[1] = r11 + r21;\n                p2[0] = r20 - r10;\n                p2[1] = r21 - r11;\n                p3[0] = r30 - r10;\n                p3[1] = r31 - r11;\n#endif\n\n                p0 += max_jj * 4 * 2;\n                p1 += max_jj * 4 * 2;\n                p2 += max_jj * 4 * 2;\n                p3 += max_jj * 4 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0123 = bottom_blob.channel(k + kk).row(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 2 + 1 < w) r1 = r0123[1];\n                        if (tj * 2 + 2 < w) r2 = r0123[2];\n                        if (tj * 2 + 3 < w) r3 = r0123[3];\n                    }\n                }\n\n                tmp[0][m] = r0 - r2;\n                tmp[1][m] = r1 + r2;\n                tmp[2][m] = r2 - r1;\n                tmp[3][m] = r3 - r1;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n\n                p0[0] = r0 - r2;\n                p1[0] = r1 + r2;\n                p2[0] = r2 - r1;\n                p3[0] = r3 - r1;\n\n                p0 += max_jj * 4;\n                p1 += max_jj * 4;\n                p2 += max_jj * 4;\n                p3 += max_jj * 4;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_output_tile(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[2][4] = {\n    //     {1.0f,  1.0f,  1.0f,  0.0f},\n    //     {0.0f,  1.0f, -1.0f,  1.0f}\n    // };\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 1) / 2;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[2][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _r10), _r20);\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _r11), _r21);\n                float32x4_t _tmp10 = vaddq_f32(vsubq_f32(_r10, _r20), _r30);\n                float32x4_t _tmp11 = vaddq_f32(vsubq_f32(_r11, _r21), _r31);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n\n                r0 += max_jj * 4 * 8;\n                r1 += max_jj * 4 * 8;\n                r2 += max_jj * 4 * 8;\n                r3 += max_jj * 4 * 8;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n\n                float32x4_t _tmp00 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r00, _r10), _r20));\n                float32x4_t _tmp01 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r01, _r11), _r21));\n                float32x4_t _tmp10 = vaddq_f32(_bias0, vaddq_f32(vsubq_f32(_r10, _r20), _r30));\n                float32x4_t _tmp11 = vaddq_f32(_bias1, vaddq_f32(vsubq_f32(_r11, _r21), _r31));\n\n                if (out_elempack == 4)\n                {\n                    float* outptr1 = outptr0 + N;\n\n                    vst1q_f32(outptr0, _tmp00);\n                    vst1q_f32(outptr1, _tmp01);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1q_f32(outptr0 + 4, _tmp10);\n                        vst1q_f32(outptr1 + 4, _tmp11);\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[8];\n                    float tmp1[8];\n                    vst1q_f32(tmp0, _tmp00);\n                    vst1q_f32(tmp0 + 4, _tmp01);\n                    vst1q_f32(tmp1, _tmp10);\n                    vst1q_f32(tmp1 + 4, _tmp11);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n                    float* outptr4 = outptr0 + N * 4;\n                    float* outptr5 = outptr0 + N * 5;\n                    float* outptr6 = outptr0 + N * 6;\n                    float* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[2][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _r1), _r2);\n                float32x4_t _tmp1 = vaddq_f32(vsubq_f32(_r1, _r2), _r3);\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n\n                r0 += max_jj * 4 * 4;\n                r1 += max_jj * 4 * 4;\n                r2 += max_jj * 4 * 4;\n                r3 += max_jj * 4 * 4;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _r1), _r2));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vaddq_f32(vsubq_f32(_r1, _r2), _r3));\n\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _tmp0);\n                    if (tj * 2 + 1 < outw) vst1q_f32(outptr0 + 4, _tmp1);\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[4];\n                    float tmp1[4];\n                    vst1q_f32(tmp0, _tmp0);\n                    vst1q_f32(tmp1, _tmp1);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[2][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _r1), _r2);\n                float32x2_t _tmp1 = vadd_f32(vsub_f32(_r1, _r2), _r3);\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n#else\n                tmp[0][m][0] = r0[0] + r1[0] + r2[0];\n                tmp[0][m][1] = r0[1] + r1[1] + r2[1];\n                tmp[1][m][0] = r1[0] - r2[0] + r3[0];\n                tmp[1][m][1] = r1[1] - r2[1] + r3[1];\n#endif\n\n                r0 += max_jj * 4 * 2;\n                r1 += max_jj * 4 * 2;\n                r2 += max_jj * 4 * 2;\n                r3 += max_jj * 4 * 2;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _r1), _r2));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vadd_f32(vsub_f32(_r1, _r2), _r3));\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n\n                float tmp00 = bias0 + r00 + r10 + r20;\n                float tmp01 = bias1 + r01 + r11 + r21;\n                float tmp10 = bias0 + r10 - r20 + r30;\n                float tmp11 = bias1 + r11 - r21 + r31;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    float* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    outptr0[0] = vget_lane_f32(_tmp0, 0);\n                    outptr1[0] = vget_lane_f32(_tmp0, 1);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_f32(_tmp1, 0);\n                        outptr1[1] = vget_lane_f32(_tmp1, 1);\n                    }\n#else\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[2][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m] = r0[0] + r1[0] + r2[0];\n                tmp[1][m] = r1[0] - r2[0] + r3[0];\n\n                r0 += max_jj * 4;\n                r1 += max_jj * 4;\n                r2 += max_jj * 4;\n                r3 += max_jj * 4;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n\n                float tmp0 = bias0 + r0 + r1 + r2;\n                float tmp1 = bias0 + r1 - r2 + r3;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 2 + 1 < outw) outptr0[1] = tmp1;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd23(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 2n+2, winograd F(2,3)\n    int w_tiles = (outw + 1) / 2;\n    int h_tiles = (outh + 1) / 2;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 16;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd23 %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd23_transform_output_tile(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd43_transform_kernel_tile(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    float* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            const float sq2 = 1.41421356237f;\n            // const float ktm[6][3] = {\n            //     {1.0f, 0.0f, 0.0f},\n            //     {-2.0f / 3, -sq2 / 3, -1.0f / 3},\n            //     {-2.0f / 3, sq2 / 3, -1.0f / 3},\n            //     {1.0f / 6, sq2 / 6, 1.0f / 3},\n            //     {1.0f / 6, -sq2 / 6, 1.0f / 3},\n            //     {0.0f, 0.0f, 1.0f}\n            // };\n            const float ktm0 = 2.0f / 3;\n            const float ktm1 = sq2 / 3;\n            const float ktm2 = 1.0f / 3;\n            const float ktm3 = 1.0f / 6;\n            const float ktm4 = sq2 / 6;\n\n            float tmp[6][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = -r0 * ktm0 - r1 * ktm1 - r2 * ktm2;\n                tmp[2][m] = -r0 * ktm0 + r1 * ktm1 - r2 * ktm2;\n                tmp[3][m] = r0 * ktm3 + r1 * ktm4 + r2 * ktm2;\n                tmp[4][m] = r0 * ktm3 - r1 * ktm4 + r2 * ktm2;\n                tmp[5][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = -r0 * ktm0 - r1 * ktm1 - r2 * ktm2;\n                float z2 = -r0 * ktm0 + r1 * ktm1 - r2 * ktm2;\n                float z3 = r0 * ktm3 + r1 * ktm4 + r2 * ktm2;\n                float z4 = r0 * ktm3 - r1 * ktm4 + r2 * ktm2;\n                float z5 = r2;\n\n                ptmp[0] = z0;\n                ptmp[1] = z1;\n                ptmp[2] = z2;\n                ptmp[3] = z3;\n                ptmp[4] = z4;\n                ptmp[5] = z5;\n                ptmp += 6;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 36;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd43_transform_kernel_tile(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_input_tile(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    const float sq2 = 1.41421356237;\n    const float sq2_d2 = 1.41421356237 / 2;\n\n    // const float itm[6][6] = {\n    //     {1.0f,  0.0f,  -2.5f,  0.0f,  1.0f, 0.0f},\n    //     {0.0f, -sq2,   -2.0f,  sq2/2, 1.0f, 0.0f},\n    //     {0.0f,  sq2,   -2.0f, -sq2/2, 1.0f, 0.0f},\n    //     {0.0f, -sq2/2, -0.5f,  sq2,   1.0f, 0.0f},\n    //     {0.0f,  sq2/2, -0.5f, -sq2,   1.0f, 0.0f},\n    //     {0.0f,  1.0f,   0.0f,  -2.5f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  r00 + r04 - 2.5f * r02\n    // 1 = -(sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 2 =  (sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 3 =  (sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 4 = -(sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 5 =  r01 + r05 - 2.5f * r03\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 1) / 4;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][6][8];\n\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _vm2_5 = vdupq_n_f32(-2.5f);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n                float32x4_t _r40 = vdupq_n_f32(0.f);\n                float32x4_t _r41 = vdupq_n_f32(0.f);\n                float32x4_t _r50 = vdupq_n_f32(0.f);\n                float32x4_t _r51 = vdupq_n_f32(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const float* r1 = r0 + N;\n\n                        _r00 = vld1q_f32(r0);\n                        _r01 = vld1q_f32(r1);\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r10 = vld1q_f32(r0 + 4);\n                            _r11 = vld1q_f32(r1 + 4);\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r20 = vld1q_f32(r0 + 8);\n                            _r21 = vld1q_f32(r1 + 8);\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r30 = vld1q_f32(r0 + 12);\n                            _r31 = vld1q_f32(r1 + 12);\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            _r40 = vld1q_f32(r0 + 16);\n                            _r41 = vld1q_f32(r1 + 16);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            _r50 = vld1q_f32(r0 + 20);\n                            _r51 = vld1q_f32(r1 + 20);\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n                        const float* r4 = r0 + N * 4;\n                        const float* r5 = r0 + N * 5;\n                        const float* r6 = r0 + N * 6;\n                        const float* r7 = r0 + N * 7;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n                        float32x4_t _t4 = vld1q_f32(r4);\n                        float32x4_t _t5 = vld1q_f32(r5);\n                        float32x4_t _t6 = vld1q_f32(r6);\n                        float32x4_t _t7 = vld1q_f32(r7);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n                        transpose4x4_ps(_t4, _t5, _t6, _t7);\n\n                        _r00 = _t0;\n                        _r01 = _t4;\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r10 = _t1;\n                            _r11 = _t5;\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r20 = _t2;\n                            _r21 = _t6;\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r30 = _t3;\n                            _r31 = _t7;\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            float tmp[8] = {r0[4], r1[4], r2[4], r3[4], r4[4], r5[4], r6[4], r7[4]};\n                            _r40 = vld1q_f32(tmp);\n                            _r41 = vld1q_f32(tmp + 4);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            float tmp[8] = {r0[5], r1[5], r2[5], r3[5], r4[5], r5[5], r6[5], r7[5]};\n                            _r50 = vld1q_f32(tmp);\n                            _r51 = vld1q_f32(tmp + 4);\n                        }\n                    }\n                }\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs, 0), _r30, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs, 0), _r31, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 2);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 2);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r30, _coeffs, 0), _r10, _coeffs, 1);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r31, _coeffs, 0), _r11, _coeffs, 1);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 3);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 3);\n\n                float32x4_t _tmp00 = vfmaq_f32(vaddq_f32(_r00, _r40), _r20, _vm2_5);\n                float32x4_t _tmp01 = vfmaq_f32(vaddq_f32(_r01, _r41), _r21, _vm2_5);\n                float32x4_t _tmp10 = vsubq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp11 = vsubq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp20 = vaddq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp21 = vaddq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp50 = vfmaq_f32(vaddq_f32(_r10, _r50), _r30, _vm2_5);\n                float32x4_t _tmp51 = vfmaq_f32(vaddq_f32(_r11, _r51), _r31, _vm2_5);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n            float* p4 = p0 + max_jj * 8 * 4;\n            float* p5 = p0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs, 0), _r30, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs, 0), _r31, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 2);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 2);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r30, _coeffs, 0), _r10, _coeffs, 1);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r31, _coeffs, 0), _r11, _coeffs, 1);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 3);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 3);\n\n                float32x4_t _tmp00 = vfmaq_f32(vaddq_f32(_r00, _r40), _r20, _vm2_5);\n                float32x4_t _tmp01 = vfmaq_f32(vaddq_f32(_r01, _r41), _r21, _vm2_5);\n                float32x4_t _tmp10 = vsubq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp11 = vsubq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp20 = vaddq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp21 = vaddq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp50 = vfmaq_f32(vaddq_f32(_r10, _r50), _r30, _vm2_5);\n                float32x4_t _tmp51 = vfmaq_f32(vaddq_f32(_r11, _r51), _r31, _vm2_5);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n                vst1q_f32(p4, _tmp40);\n                vst1q_f32(p4 + 4, _tmp41);\n                vst1q_f32(p5, _tmp50);\n                vst1q_f32(p5 + 4, _tmp51);\n\n                p0 += max_jj * 6 * 8;\n                p1 += max_jj * 6 * 8;\n                p2 += max_jj * 6 * 8;\n                p3 += max_jj * 6 * 8;\n                p4 += max_jj * 6 * 8;\n                p5 += max_jj * 6 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][6][4];\n\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _vm2_5 = vdupq_n_f32(-2.5f);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n                float32x4_t _r4 = vdupq_n_f32(0.f);\n                float32x4_t _r5 = vdupq_n_f32(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        if (tj * 4 + 1 < w) _r1 = vld1q_f32(r0 + 4);\n                        if (tj * 4 + 2 < w) _r2 = vld1q_f32(r0 + 8);\n                        if (tj * 4 + 3 < w) _r3 = vld1q_f32(r0 + 12);\n                        if (tj * 4 + 4 < w) _r4 = vld1q_f32(r0 + 16);\n                        if (tj * 4 + 5 < w) _r5 = vld1q_f32(r0 + 20);\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 4 + 1 < w) _r1 = _t1;\n                        if (tj * 4 + 2 < w) _r2 = _t2;\n                        if (tj * 4 + 3 < w) _r3 = _t3;\n                        if (tj * 4 + 4 < w)\n                        {\n                            float tmp[4] = {r0[4], r1[4], r2[4], r3[4]};\n                            _r4 = vld1q_f32(tmp);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            float tmp[4] = {r0[5], r1[5], r2[5], r3[5]};\n                            _r5 = vld1q_f32(tmp);\n                        }\n                    }\n                }\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vmulq_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmulq_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x4_t _tmp0 = vmlaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x4_t _tmp1 = vsubq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp2 = vaddq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34b, _tmp34a);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x4_t _tmp5 = vfmaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x4_t _tmp5 = vmlaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n            float* p4 = p0 + max_jj * 4 * 4;\n            float* p5 = p0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vmulq_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmulq_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x4_t _tmp0 = vmlaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x4_t _tmp1 = vsubq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp2 = vaddq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34b, _tmp34a);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x4_t _tmp5 = vfmaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x4_t _tmp5 = vmlaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n                vst1q_f32(p4, _tmp4);\n                vst1q_f32(p5, _tmp5);\n\n                p0 += max_jj * 6 * 4;\n                p1 += max_jj * 6 * 4;\n                p2 += max_jj * 6 * 4;\n                p3 += max_jj * 6 * 4;\n                p4 += max_jj * 6 * 4;\n                p5 += max_jj * 6 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[6][6][2];\n\n#if __ARM_NEON\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x2_t _vm2_5 = vdup_n_f32(-2.5f);\n#endif\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel(k + kk).row(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n                float32x2_t _r4 = vdup_n_f32(0.f);\n                float32x2_t _r5 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n                float r40 = 0.f;\n                float r41 = 0.f;\n                float r50 = 0.f;\n                float r51 = 0.f;\n#endif\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n\n#if __ARM_NEON\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4x2_t _t01 = vzipq_f32(_t0, _t1);\n\n                        _r0 = vget_low_f32(_t01.val[0]);\n                        if (tj * 4 + 1 < w) _r1 = vget_high_f32(_t01.val[0]);\n                        if (tj * 4 + 2 < w) _r2 = vget_low_f32(_t01.val[1]);\n                        if (tj * 4 + 3 < w) _r3 = vget_high_f32(_t01.val[1]);\n                        if (tj * 4 + 4 < w)\n                        {\n                            float tmp[2] = {r0[4], r1[4]};\n                            _r4 = vld1_f32(tmp);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            float tmp[2] = {r0[5], r1[5]};\n                            _r5 = vld1_f32(tmp);\n                        }\n#else\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 4 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            r40 = r0[4];\n                            r41 = r1[4];\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            r50 = r0[5];\n                            r51 = r1[5];\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x2_t _tmp34a = vfma_laneq_f32(vmul_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x2_t _tmp34b = vfma_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34a = vmla_lane_f32(vmul_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x2_t _tmp0 = vmla_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x2_t _tmp1 = vsub_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp2 = vadd_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp3 = vadd_f32(_tmp34b, _tmp34a);\n                float32x2_t _tmp4 = vsub_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x2_t _tmp5 = vfma_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x2_t _tmp5 = vmla_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n#else\n                float tmp12a0 = sq2 * r10 - sq2_d2 * r30;\n                float tmp12a1 = sq2 * r11 - sq2_d2 * r31;\n                float tmp12b0 = r40 - 2 * r20;\n                float tmp12b1 = r41 - 2 * r21;\n                float tmp34a0 = sq2 * r30 - sq2_d2 * r10;\n                float tmp34a1 = sq2 * r31 - sq2_d2 * r11;\n                float tmp34b0 = r40 - 0.5f * r20;\n                float tmp34b1 = r41 - 0.5f * r21;\n\n                tmp[0][m][0] = r00 + r40 - 2.5f * r20;\n                tmp[0][m][1] = r01 + r41 - 2.5f * r21;\n                tmp[1][m][0] = tmp12b0 - tmp12a0;\n                tmp[1][m][1] = tmp12b1 - tmp12a1;\n                tmp[2][m][0] = tmp12b0 + tmp12a0;\n                tmp[2][m][1] = tmp12b1 + tmp12a1;\n                tmp[3][m][0] = tmp34b0 + tmp34a0;\n                tmp[3][m][1] = tmp34b1 + tmp34a1;\n                tmp[4][m][0] = tmp34b0 - tmp34a0;\n                tmp[4][m][1] = tmp34b1 - tmp34a1;\n                tmp[5][m][0] = r10 + r50 - 2.5f * r30;\n                tmp[5][m][1] = r11 + r51 - 2.5f * r31;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n            float* p4 = p0 + max_jj * 2 * 4;\n            float* p5 = p0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x2_t _tmp34a = vfma_laneq_f32(vmul_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x2_t _tmp34b = vfma_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34a = vmla_lane_f32(vmul_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x2_t _tmp0 = vmla_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x2_t _tmp1 = vsub_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp2 = vadd_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp3 = vadd_f32(_tmp34b, _tmp34a);\n                float32x2_t _tmp4 = vsub_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x2_t _tmp5 = vfma_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x2_t _tmp5 = vmla_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n                vst1_f32(p4, _tmp4);\n                vst1_f32(p5, _tmp5);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n\n                float tmp12a0 = sq2 * r10 - sq2_d2 * r30;\n                float tmp12a1 = sq2 * r11 - sq2_d2 * r31;\n                float tmp12b0 = r40 - 2 * r20;\n                float tmp12b1 = r41 - 2 * r21;\n                float tmp34a0 = sq2 * r30 - sq2_d2 * r10;\n                float tmp34a1 = sq2 * r31 - sq2_d2 * r11;\n                float tmp34b0 = r40 - 0.5f * r20;\n                float tmp34b1 = r41 - 0.5f * r21;\n\n                p0[0] = r00 + r40 - 2.5f * r20;\n                p0[1] = r01 + r41 - 2.5f * r21;\n                p1[0] = tmp12b0 - tmp12a0;\n                p1[1] = tmp12b1 - tmp12a1;\n                p2[0] = tmp12b0 + tmp12a0;\n                p2[1] = tmp12b1 + tmp12a1;\n                p3[0] = tmp34b0 + tmp34a0;\n                p3[1] = tmp34b1 + tmp34a1;\n                p4[0] = tmp34b0 - tmp34a0;\n                p4[1] = tmp34b1 - tmp34a1;\n                p5[0] = r10 + r50 - 2.5f * r30;\n                p5[1] = r11 + r51 - 2.5f * r31;\n#endif\n\n                p0 += max_jj * 6 * 2;\n                p1 += max_jj * 6 * 2;\n                p2 += max_jj * 6 * 2;\n                p3 += max_jj * 6 * 2;\n                p4 += max_jj * 6 * 2;\n                p5 += max_jj * 6 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[6][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0123 = bottom_blob.channel(k + kk).row(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n                float r4 = 0.f;\n                float r5 = 0.f;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 4 + 1 < w) r1 = r0123[1];\n                        if (tj * 4 + 2 < w) r2 = r0123[2];\n                        if (tj * 4 + 3 < w) r3 = r0123[3];\n                        if (tj * 4 + 4 < w) r4 = r0123[4];\n                        if (tj * 4 + 5 < w) r5 = r0123[5];\n                    }\n                }\n\n                float tmp12a = sq2 * r1 - sq2_d2 * r3;\n                float tmp12b = r4 - 2 * r2;\n                float tmp34a = sq2 * r3 - sq2_d2 * r1;\n                float tmp34b = r4 - 0.5f * r2;\n\n                tmp[0][m] = r0 + r4 - 2.5f * r2;\n                tmp[1][m] = tmp12b - tmp12a;\n                tmp[2][m] = tmp12b + tmp12a;\n                tmp[3][m] = tmp34b + tmp34a;\n                tmp[4][m] = tmp34b - tmp34a;\n                tmp[5][m] = r1 + r5 - 2.5f * r3;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n            float* p4 = p0 + max_jj * 4;\n            float* p5 = p0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n\n                float tmp12a = sq2 * r1 - sq2_d2 * r3;\n                float tmp12b = r4 - 2 * r2;\n                float tmp34a = sq2 * r3 - sq2_d2 * r1;\n                float tmp34b = r4 - 0.5f * r2;\n\n                p0[0] = r0 + r4 - 2.5f * r2;\n                p1[0] = tmp12b - tmp12a;\n                p2[0] = tmp12b + tmp12a;\n                p3[0] = tmp34b + tmp34a;\n                p4[0] = tmp34b - tmp34a;\n                p5[0] = r1 + r5 - 2.5f * r3;\n\n                p0 += max_jj * 6;\n                p1 += max_jj * 6;\n                p2 += max_jj * 6;\n                p3 += max_jj * 6;\n                p4 += max_jj * 6;\n                p5 += max_jj * 6;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_output_tile(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    const float sq2 = 1.41421356237;\n    const float sq2_m2 = 1.41421356237 * 2;\n    const float sq2_d2 = 1.41421356237 / 2;\n    const float sq2_d4 = 1.41421356237 / 4;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,   1.0f,  1.0f,  1.0f,   0.0f},\n    //     {0.0f, sq2/2, -sq2/2, sq2,   -sq2,   0.0f},\n    //     {0.0f, 0.5f,   0.5f,  2.0f,  2.0f,   0.0f},\n    //     {0.0f, sq2/4, -sq2/4, sq2*2, -sq2*2, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) * sq2_d2 + (r03 - r04) * sq2\n    // 2 =       (r01 + r02) * 0.5f + (r03 + r04) * 2\n    // 3 = r05 + (r01 - r02) * sq2_d4 + (r03 - r04) * sq2_m2\n\n#if __ARM_NEON\n    const float coeffs[6] = {sq2, sq2_d2, sq2_d4, sq2_m2, 0.5f, 2.f};\n    float32x4_t _coeffs = vld1q_f32(coeffs);\n    float32x2_t _coeffs2 = vld1_f32(coeffs + 4);\n#endif // __ARM_NEON\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 3) / 4;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n            const float* r4 = r0 + max_jj * 8 * 4;\n            const float* r5 = r0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n\n                float32x4_t _tmp02a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp02a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp02b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp02b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp13a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp13a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp13b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp13b1 = vsubq_f32(_r31, _r41);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp02a0), _tmp02b0);\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp02a1), _tmp02b1);\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a0, _coeffs, 1), _tmp13b0, _coeffs, 0);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a1, _coeffs, 1), _tmp13b1, _coeffs, 0);\n                float32x4_t _tmp20 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a0, _coeffs2, 0), _tmp02b0, _coeffs2, 1);\n                float32x4_t _tmp21 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a1, _coeffs2, 0), _tmp02b1, _coeffs2, 1);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r50, _tmp13a0, _coeffs, 2), _tmp13b0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r51, _tmp13a1, _coeffs, 2), _tmp13b1, _coeffs, 3);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n\n                r0 += max_jj * 6 * 8;\n                r1 += max_jj * 6 * 8;\n                r2 += max_jj * 6 * 8;\n                r3 += max_jj * 6 * 8;\n                r4 += max_jj * 6 * 8;\n                r5 += max_jj * 6 * 8;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n\n                float32x4_t _tmp02a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp02a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp02b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp02b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp13a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp13a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp13b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp13b1 = vsubq_f32(_r31, _r41);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp02a0), vaddq_f32(_tmp02b0, _bias0));\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp02a1), vaddq_f32(_tmp02b1, _bias1));\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias0, _tmp13a0, _coeffs, 1), _tmp13b0, _coeffs, 0);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias1, _tmp13a1, _coeffs, 1), _tmp13b1, _coeffs, 0);\n                float32x4_t _tmp20 = vfmaq_lane_f32(vfmaq_lane_f32(_bias0, _tmp02a0, _coeffs2, 0), _tmp02b0, _coeffs2, 1);\n                float32x4_t _tmp21 = vfmaq_lane_f32(vfmaq_lane_f32(_bias1, _tmp02a1, _coeffs2, 0), _tmp02b1, _coeffs2, 1);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r50, _bias0), _tmp13a0, _coeffs, 2), _tmp13b0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r51, _bias1), _tmp13a1, _coeffs, 2), _tmp13b1, _coeffs, 3);\n\n                if (out_elempack == 4)\n                {\n                    float* outptr1 = outptr0 + N;\n\n                    vst1q_f32(outptr0, _tmp00);\n                    vst1q_f32(outptr1, _tmp01);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        vst1q_f32(outptr0 + 4, _tmp10);\n                        vst1q_f32(outptr1 + 4, _tmp11);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        vst1q_f32(outptr0 + 8, _tmp20);\n                        vst1q_f32(outptr1 + 8, _tmp21);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        vst1q_f32(outptr0 + 12, _tmp30);\n                        vst1q_f32(outptr1 + 12, _tmp31);\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[8];\n                    float tmp1[8];\n                    float tmp2[8];\n                    float tmp3[8];\n                    vst1q_f32(tmp0, _tmp00);\n                    vst1q_f32(tmp0 + 4, _tmp01);\n                    vst1q_f32(tmp1, _tmp10);\n                    vst1q_f32(tmp1 + 4, _tmp11);\n                    vst1q_f32(tmp2, _tmp20);\n                    vst1q_f32(tmp2 + 4, _tmp21);\n                    vst1q_f32(tmp3, _tmp30);\n                    vst1q_f32(tmp3 + 4, _tmp31);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n                    float* outptr4 = outptr0 + N * 4;\n                    float* outptr5 = outptr0 + N * 5;\n                    float* outptr6 = outptr0 + N * 6;\n                    float* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][6][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n            const float* r4 = r0 + max_jj * 4 * 4;\n            const float* r5 = r0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n                float32x4_t _r4 = vld1q_f32(r4);\n                float32x4_t _r5 = vld1q_f32(r5);\n\n                float32x4_t _tmp02a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp02b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp13a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp13b = vsubq_f32(_r3, _r4);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp02a), _tmp02b);\n#if __aarch64__\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x4_t _tmp2 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmulq_lane_f32(_tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmulq_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(_r5, _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 4;\n                r1 += max_jj * 6 * 4;\n                r2 += max_jj * 6 * 4;\n                r3 += max_jj * 6 * 4;\n                r4 += max_jj * 6 * 4;\n                r5 += max_jj * 6 * 4;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n\n                float32x4_t _tmp02a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp02b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp13a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp13b = vsubq_f32(_r3, _r4);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp02a), vaddq_f32(_tmp02b, _bias0));\n#if __aarch64__\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x4_t _tmp2 = vfmaq_lane_f32(vfmaq_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmlaq_lane_f32(_bias0, _tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmlaq_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(vaddq_f32(_r5, _bias0), _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _tmp0);\n                    if (tj * 4 + 1 < outw) vst1q_f32(outptr0 + 4, _tmp1);\n                    if (tj * 4 + 2 < outw) vst1q_f32(outptr0 + 8, _tmp2);\n                    if (tj * 4 + 3 < outw) vst1q_f32(outptr0 + 12, _tmp3);\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[4];\n                    float tmp1[4];\n                    float tmp2[4];\n                    float tmp3[4];\n                    vst1q_f32(tmp0, _tmp0);\n                    vst1q_f32(tmp1, _tmp1);\n                    vst1q_f32(tmp2, _tmp2);\n                    vst1q_f32(tmp3, _tmp3);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[4][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n            const float* r4 = r0 + max_jj * 2 * 4;\n            const float* r5 = r0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n                float32x2_t _r4 = vld1_f32(r4);\n                float32x2_t _r5 = vld1_f32(r5);\n\n                float32x2_t _tmp02a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp02b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp13a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp13b = vsub_f32(_r3, _r4);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp02a), _tmp02b);\n#if __aarch64__\n                float32x2_t _tmp1 = vfma_laneq_f32(vmul_laneq_f32(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x2_t _tmp2 = vfma_lane_f32(vmul_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x2_t _tmp1 = vmla_lane_f32(vmul_lane_f32(_tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x2_t _tmp2 = vmla_lane_f32(vmul_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(_r5, _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n#else\n                float tmp02a0 = r1[0] + r2[0];\n                float tmp02a1 = r1[1] + r2[1];\n                float tmp02b0 = r3[0] + r4[0];\n                float tmp02b1 = r3[1] + r4[1];\n                float tmp13a0 = r1[0] - r2[0];\n                float tmp13a1 = r1[1] - r2[1];\n                float tmp13b0 = r3[0] - r4[0];\n                float tmp13b1 = r3[1] - r4[1];\n\n                tmp[0][m][0] = r0[0] + tmp02a0 + tmp02b0;\n                tmp[0][m][1] = r0[1] + tmp02a1 + tmp02b1;\n                tmp[1][m][0] = tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                tmp[1][m][1] = tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                tmp[2][m][0] = tmp02a0 * 0.5f + tmp02b0 * 2;\n                tmp[2][m][1] = tmp02a1 * 0.5f + tmp02b1 * 2;\n                tmp[3][m][0] = r5[0] + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                tmp[3][m][1] = r5[1] + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n#endif\n\n                r0 += max_jj * 6 * 2;\n                r1 += max_jj * 6 * 2;\n                r2 += max_jj * 6 * 2;\n                r3 += max_jj * 6 * 2;\n                r4 += max_jj * 6 * 2;\n                r5 += max_jj * 6 * 2;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n\n                float32x2_t _tmp02a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp02b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp13a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp13b = vsub_f32(_r3, _r4);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp02a), vadd_f32(_tmp02b, _bias0));\n#if __aarch64__\n                float32x2_t _tmp1 = vfma_laneq_f32(vfma_laneq_f32(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x2_t _tmp2 = vfma_lane_f32(vfma_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(vadd_f32(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x2_t _tmp1 = vmla_lane_f32(vmla_lane_f32(_bias0, _tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x2_t _tmp2 = vmla_lane_f32(vmla_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(vadd_f32(_r5, _bias0), _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n\n                float tmp02a0 = r10 + r20;\n                float tmp02a1 = r11 + r21;\n                float tmp02b0 = r30 + r40;\n                float tmp02b1 = r31 + r41;\n                float tmp13a0 = r10 - r20;\n                float tmp13a1 = r11 - r21;\n                float tmp13b0 = r30 - r40;\n                float tmp13b1 = r31 - r41;\n\n                float tmp00 = bias0 + r00 + tmp02a0 + tmp02b0;\n                float tmp01 = bias1 + r01 + tmp02a1 + tmp02b1;\n                float tmp10 = bias0 + tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                float tmp11 = bias1 + tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                float tmp20 = bias0 + tmp02a0 * 0.5f + tmp02b0 * 2;\n                float tmp21 = bias1 + tmp02a1 * 0.5f + tmp02b1 * 2;\n                float tmp30 = bias0 + r50 + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                float tmp31 = bias1 + r51 + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    float* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    outptr0[0] = vget_lane_f32(_tmp0, 0);\n                    outptr1[0] = vget_lane_f32(_tmp0, 1);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_f32(_tmp1, 0);\n                        outptr1[1] = vget_lane_f32(_tmp1, 1);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = vget_lane_f32(_tmp2, 0);\n                        outptr1[2] = vget_lane_f32(_tmp2, 1);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = vget_lane_f32(_tmp3, 0);\n                        outptr1[3] = vget_lane_f32(_tmp3, 1);\n                    }\n#else\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp20;\n                        outptr1[2] = tmp21;\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp30;\n                        outptr1[3] = tmp31;\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[4][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n            const float* r4 = r0 + max_jj * 4;\n            const float* r5 = r0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float tmp02a = r1[0] + r2[0];\n                float tmp02b = r3[0] + r4[0];\n                float tmp13a = r1[0] - r2[0];\n                float tmp13b = r3[0] - r4[0];\n\n                tmp[0][m] = r0[0] + tmp02a + tmp02b;\n                tmp[1][m] = tmp13a * sq2_d2 + tmp13b * sq2;\n                tmp[2][m] = tmp02a * 0.5f + tmp02b * 2;\n                tmp[3][m] = r5[0] + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                r0 += max_jj * 6;\n                r1 += max_jj * 6;\n                r2 += max_jj * 6;\n                r3 += max_jj * 6;\n                r4 += max_jj * 6;\n                r5 += max_jj * 6;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n\n                float tmp02a = r1 + r2;\n                float tmp02b = r3 + r4;\n                float tmp13a = r1 - r2;\n                float tmp13b = r3 - r4;\n\n                float tmp0 = bias0 + r0 + tmp02a + tmp02b;\n                float tmp1 = bias0 + tmp13a * sq2_d2 + tmp13b * sq2;\n                float tmp2 = bias0 + tmp02a * 0.5f + tmp02b * 2;\n                float tmp3 = bias0 + r5 + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 4 + 1 < outw) outptr0[1] = tmp1;\n                    if (tj * 4 + 2 < outw) outptr0[2] = tmp2;\n                    if (tj * 4 + 3 < outw) outptr0[3] = tmp3;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd43(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 4n+2, winograd F(4,3)\n    int w_tiles = (outw + 3) / 4;\n    int h_tiles = (outh + 3) / 4;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 36;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd43 %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd43_transform_output_tile(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd63_transform_kernel_tile(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    float* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            // const float ktm[8][3] = {\n            //     {1.0f, 0.0f, 0.0f},\n            //     {-2.0f / 9, -2.0f / 9, -2.0f / 9},\n            //     {-2.0f / 9, 2.0f / 9, -2.0f / 9},\n            //     {1.0f / 90, 1.0f / 45, 2.0f / 45},\n            //     {1.0f / 90, -1.0f / 45, 2.0f / 45},\n            //     {1.0f / 45, 1.0f / 90, 1.0f / 180},\n            //     {1.0f / 45, -1.0f / 90, 1.0f / 180},\n            //     {0.0f, 0.0f, 1.0f}\n            // };\n            const float ktm0 = 2.0f / 9;\n            const float ktm1 = 1.0f / 45;\n            const float ktm2 = 2.0f / 45;\n            const float ktm3 = 1.0f / 90;\n            const float ktm4 = 1.0f / 180;\n\n            float tmp[8][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = -r0 * ktm0 - r1 * ktm0 - r2 * ktm0;\n                tmp[2][m] = -r0 * ktm0 + r1 * ktm0 - r2 * ktm0;\n                tmp[3][m] = r0 * ktm3 + r1 * ktm1 + r2 * ktm2;\n                tmp[4][m] = r0 * ktm3 - r1 * ktm1 + r2 * ktm2;\n                tmp[5][m] = r0 * ktm1 + r1 * ktm3 + r2 * ktm4;\n                tmp[6][m] = r0 * ktm1 - r1 * ktm3 + r2 * ktm4;\n                tmp[7][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = -r0 * ktm0 - r1 * ktm0 - r2 * ktm0;\n                float z2 = -r0 * ktm0 + r1 * ktm0 - r2 * ktm0;\n                float z3 = r0 * ktm3 + r1 * ktm1 + r2 * ktm2;\n                float z4 = r0 * ktm3 - r1 * ktm1 + r2 * ktm2;\n                float z5 = r0 * ktm1 + r1 * ktm3 + r2 * ktm4;\n                float z6 = r0 * ktm1 - r1 * ktm3 + r2 * ktm4;\n                float z7 = r2;\n\n                ptmp[0] = z0;\n                ptmp[1] = z1;\n                ptmp[2] = z2;\n                ptmp[3] = z3;\n                ptmp[4] = z4;\n                ptmp[5] = z5;\n                ptmp[6] = z6;\n                ptmp[7] = z7;\n                ptmp += 8;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd63_transform_kernel(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 64;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd63_transform_kernel_tile(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd63_transform_input_tile(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[8][8] = {\n    //     {1.0f, 0.0f,-5.25f, 0.00f, 5.25f, 0.00f,-1.0f, 0.0f},\n    //     {0.0f, 1.0f, 1.00f,-4.25f,-4.25f, 1.00f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 1.00f, 4.25f,-4.25f,-1.00f, 1.0f, 0.0f},\n    //     {0.0f, 0.5f, 0.25f,-2.50f,-1.25f, 2.00f, 1.0f, 0.0f},\n    //     {0.0f,-0.5f, 0.25f, 2.50f,-1.25f,-2.00f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, 4.00f,-2.50f,-5.00f, 0.50f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, 4.00f, 2.50f,-5.00f,-0.50f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 0.00f, 5.25f, 0.00f,-5.25f, 0.0f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 3) / 6;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[8][8][8];\n\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n                float32x4_t _r40 = vdupq_n_f32(0.f);\n                float32x4_t _r41 = vdupq_n_f32(0.f);\n                float32x4_t _r50 = vdupq_n_f32(0.f);\n                float32x4_t _r51 = vdupq_n_f32(0.f);\n                float32x4_t _r60 = vdupq_n_f32(0.f);\n                float32x4_t _r61 = vdupq_n_f32(0.f);\n                float32x4_t _r70 = vdupq_n_f32(0.f);\n                float32x4_t _r71 = vdupq_n_f32(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const float* r1 = r0 + N;\n\n                        _r00 = vld1q_f32(r0);\n                        _r01 = vld1q_f32(r1);\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r10 = vld1q_f32(r0 + 4);\n                            _r11 = vld1q_f32(r1 + 4);\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r20 = vld1q_f32(r0 + 8);\n                            _r21 = vld1q_f32(r1 + 8);\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r30 = vld1q_f32(r0 + 12);\n                            _r31 = vld1q_f32(r1 + 12);\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _r40 = vld1q_f32(r0 + 16);\n                            _r41 = vld1q_f32(r1 + 16);\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            _r50 = vld1q_f32(r0 + 20);\n                            _r51 = vld1q_f32(r1 + 20);\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            _r60 = vld1q_f32(r0 + 24);\n                            _r61 = vld1q_f32(r1 + 24);\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            _r70 = vld1q_f32(r0 + 28);\n                            _r71 = vld1q_f32(r1 + 28);\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n                        const float* r4 = r0 + N * 4;\n                        const float* r5 = r0 + N * 5;\n                        const float* r6 = r0 + N * 6;\n                        const float* r7 = r0 + N * 7;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n                        float32x4_t _t4 = vld1q_f32(r4);\n                        float32x4_t _t5 = vld1q_f32(r5);\n                        float32x4_t _t6 = vld1q_f32(r6);\n                        float32x4_t _t7 = vld1q_f32(r7);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n                        transpose4x4_ps(_t4, _t5, _t6, _t7);\n\n                        _r00 = _t0;\n                        _r01 = _t4;\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r10 = _t1;\n                            _r11 = _t5;\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r20 = _t2;\n                            _r21 = _t6;\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r30 = _t3;\n                            _r31 = _t7;\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1q_f32(r0 + 4);\n                            _t1 = vld1q_f32(r1 + 4);\n                            _t2 = vld1q_f32(r2 + 4);\n                            _t3 = vld1q_f32(r3 + 4);\n                            _t4 = vld1q_f32(r4 + 4);\n                            _t5 = vld1q_f32(r5 + 4);\n                            _t6 = vld1q_f32(r6 + 4);\n                            _t7 = vld1q_f32(r7 + 4);\n\n                            transpose4x4_ps(_t0, _t1, _t2, _t3);\n                            transpose4x4_ps(_t4, _t5, _t6, _t7);\n\n                            _r40 = _t0;\n                            _r41 = _t4;\n                            if (tj * 6 + 5 < w)\n                            {\n                                _r50 = _t1;\n                                _r51 = _t5;\n                            }\n                            if (tj * 6 + 6 < w)\n                            {\n                                _r60 = _t2;\n                                _r61 = _t6;\n                            }\n                            if (tj * 6 + 7 < w)\n                            {\n                                _r70 = _t3;\n                                _r71 = _t7;\n                            }\n                        }\n                    }\n                }\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vaddq_f32(_r20, _r60), _r40, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vaddq_f32(_r21, _r61), _r41, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(vaddq_f32(_r10, _r50), _r30, _coeffs, 1);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(vaddq_f32(_r11, _r51), _r31, _coeffs, 1);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r60, _r20, _coeffs, 3), _r40, _coeffs, 2);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r61, _r21, _coeffs, 3), _r41, _coeffs, 2);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 1), _r30, _coeffs2, 0), _r50, _coeffs2, 2);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 1), _r31, _coeffs2, 0), _r51, _coeffs2, 2);\n                float32x4_t _tmp56a0 = vfmaq_laneq_f32(_r60, vfmaq_laneq_f32(_r20, _r40, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56a1 = vfmaq_laneq_f32(_r61, vfmaq_laneq_f32(_r21, _r41, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 2), _r30, _coeffs2, 0), _r50, _coeffs2, 1);\n                float32x4_t _tmp56b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 2), _r31, _coeffs2, 0), _r51, _coeffs2, 1);\n\n                float32x4_t _tmp00 = vfmaq_laneq_f32(vsubq_f32(_r00, _r60), vsubq_f32(_r40, _r20), _coeffs, 0);\n                float32x4_t _tmp01 = vfmaq_laneq_f32(vsubq_f32(_r01, _r61), vsubq_f32(_r41, _r21), _coeffs, 0);\n                float32x4_t _tmp10 = vaddq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp11 = vaddq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp20 = vsubq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp21 = vsubq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp50 = vaddq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp51 = vaddq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp60 = vsubq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp61 = vsubq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp70 = vfmaq_laneq_f32(vsubq_f32(_r70, _r10), vsubq_f32(_r30, _r50), _coeffs, 0);\n                float32x4_t _tmp71 = vfmaq_laneq_f32(vsubq_f32(_r71, _r11), vsubq_f32(_r31, _r51), _coeffs, 0);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n                vst1q_f32(tmp[6][m], _tmp60);\n                vst1q_f32(tmp[6][m] + 4, _tmp61);\n                vst1q_f32(tmp[7][m], _tmp70);\n                vst1q_f32(tmp[7][m] + 4, _tmp71);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n            float* p4 = p0 + max_jj * 8 * 4;\n            float* p5 = p0 + max_jj * 8 * 5;\n            float* p6 = p0 + max_jj * 8 * 6;\n            float* p7 = p0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n                float32x4_t _r60 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r61 = vld1q_f32(tmp[m][6] + 4);\n                float32x4_t _r70 = vld1q_f32(tmp[m][7]);\n                float32x4_t _r71 = vld1q_f32(tmp[m][7] + 4);\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vaddq_f32(_r20, _r60), _r40, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vaddq_f32(_r21, _r61), _r41, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(vaddq_f32(_r10, _r50), _r30, _coeffs, 1);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(vaddq_f32(_r11, _r51), _r31, _coeffs, 1);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r60, _r20, _coeffs, 3), _r40, _coeffs, 2);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r61, _r21, _coeffs, 3), _r41, _coeffs, 2);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 1), _r30, _coeffs2, 0), _r50, _coeffs2, 2);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 1), _r31, _coeffs2, 0), _r51, _coeffs2, 2);\n                float32x4_t _tmp56a0 = vfmaq_laneq_f32(_r60, vfmaq_laneq_f32(_r20, _r40, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56a1 = vfmaq_laneq_f32(_r61, vfmaq_laneq_f32(_r21, _r41, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 2), _r30, _coeffs2, 0), _r50, _coeffs2, 1);\n                float32x4_t _tmp56b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 2), _r31, _coeffs2, 0), _r51, _coeffs2, 1);\n\n                float32x4_t _tmp00 = vfmaq_laneq_f32(vsubq_f32(_r00, _r60), vsubq_f32(_r40, _r20), _coeffs, 0);\n                float32x4_t _tmp01 = vfmaq_laneq_f32(vsubq_f32(_r01, _r61), vsubq_f32(_r41, _r21), _coeffs, 0);\n                float32x4_t _tmp10 = vaddq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp11 = vaddq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp20 = vsubq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp21 = vsubq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp50 = vaddq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp51 = vaddq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp60 = vsubq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp61 = vsubq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp70 = vfmaq_laneq_f32(vsubq_f32(_r70, _r10), vsubq_f32(_r30, _r50), _coeffs, 0);\n                float32x4_t _tmp71 = vfmaq_laneq_f32(vsubq_f32(_r71, _r11), vsubq_f32(_r31, _r51), _coeffs, 0);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n                vst1q_f32(p4, _tmp40);\n                vst1q_f32(p4 + 4, _tmp41);\n                vst1q_f32(p5, _tmp50);\n                vst1q_f32(p5 + 4, _tmp51);\n                vst1q_f32(p6, _tmp60);\n                vst1q_f32(p6 + 4, _tmp61);\n                vst1q_f32(p7, _tmp70);\n                vst1q_f32(p7 + 4, _tmp71);\n\n                p0 += max_jj * 8 * 8;\n                p1 += max_jj * 8 * 8;\n                p2 += max_jj * 8 * 8;\n                p3 += max_jj * 8 * 8;\n                p4 += max_jj * 8 * 8;\n                p5 += max_jj * 8 * 8;\n                p6 += max_jj * 8 * 8;\n                p7 += max_jj * 8 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[8][8][4];\n\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel((k + kk) / elempack).row(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n                float32x4_t _r4 = vdupq_n_f32(0.f);\n                float32x4_t _r5 = vdupq_n_f32(0.f);\n                float32x4_t _r6 = vdupq_n_f32(0.f);\n                float32x4_t _r7 = vdupq_n_f32(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1q_f32(r0);\n                        if (tj * 6 + 1 < w) _r1 = vld1q_f32(r0 + 4);\n                        if (tj * 6 + 2 < w) _r2 = vld1q_f32(r0 + 8);\n                        if (tj * 6 + 3 < w) _r3 = vld1q_f32(r0 + 12);\n                        if (tj * 6 + 4 < w) _r4 = vld1q_f32(r0 + 16);\n                        if (tj * 6 + 5 < w) _r5 = vld1q_f32(r0 + 20);\n                        if (tj * 6 + 6 < w) _r6 = vld1q_f32(r0 + 24);\n                        if (tj * 6 + 7 < w) _r7 = vld1q_f32(r0 + 28);\n                    }\n                    if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n                        const float* r2 = r0 + N * 2;\n                        const float* r3 = r0 + N * 3;\n\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4_t _t2 = vld1q_f32(r2);\n                        float32x4_t _t3 = vld1q_f32(r3);\n\n                        transpose4x4_ps(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 6 + 1 < w) _r1 = _t1;\n                        if (tj * 6 + 2 < w) _r2 = _t2;\n                        if (tj * 6 + 3 < w) _r3 = _t3;\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1q_f32(r0 + 4);\n                            _t1 = vld1q_f32(r1 + 4);\n                            _t2 = vld1q_f32(r2 + 4);\n                            _t3 = vld1q_f32(r3 + 4);\n\n                            transpose4x4_ps(_t0, _t1, _t2, _t3);\n\n                            _r4 = _t0;\n                            if (tj * 6 + 5 < w) _r5 = _t1;\n                            if (tj * 6 + 6 < w) _r6 = _t2;\n                            if (tj * 6 + 7 < w) _r7 = _t3;\n                        }\n                    }\n                }\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vaddq_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(vaddq_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vfmaq_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x4_t _tmp56a = vfmaq_laneq_f32(_r6, vfmaq_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vaddq_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(vaddq_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmlaq_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x4_t _tmp56a = vmlaq_lane_f32(_r6, vmlaq_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x4_t _tmp56b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_laneq_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x4_t _tmp0 = vmlaq_lane_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x4_t _tmp1 = vaddq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp2 = vsubq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp5 = vaddq_f32(_tmp56a, _tmp56b);\n                float32x4_t _tmp6 = vsubq_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x4_t _tmp7 = vfmaq_laneq_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x4_t _tmp7 = vmlaq_lane_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n                vst1q_f32(tmp[6][m], _tmp6);\n                vst1q_f32(tmp[7][m], _tmp7);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n            float* p4 = p0 + max_jj * 4 * 4;\n            float* p5 = p0 + max_jj * 4 * 5;\n            float* p6 = p0 + max_jj * 4 * 6;\n            float* p7 = p0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r6 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r7 = vld1q_f32(tmp[m][7]);\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vaddq_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(vaddq_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vfmaq_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x4_t _tmp56a = vfmaq_laneq_f32(_r6, vfmaq_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vaddq_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(vaddq_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmlaq_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x4_t _tmp56a = vmlaq_lane_f32(_r6, vmlaq_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x4_t _tmp56b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_laneq_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x4_t _tmp0 = vmlaq_lane_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x4_t _tmp1 = vaddq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp2 = vsubq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp5 = vaddq_f32(_tmp56a, _tmp56b);\n                float32x4_t _tmp6 = vsubq_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x4_t _tmp7 = vfmaq_laneq_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x4_t _tmp7 = vmlaq_lane_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n                vst1q_f32(p4, _tmp4);\n                vst1q_f32(p5, _tmp5);\n                vst1q_f32(p6, _tmp6);\n                vst1q_f32(p7, _tmp7);\n\n                p0 += max_jj * 8 * 4;\n                p1 += max_jj * 8 * 4;\n                p2 += max_jj * 8 * 4;\n                p3 += max_jj * 8 * 4;\n                p4 += max_jj * 8 * 4;\n                p5 += max_jj * 8 * 4;\n                p6 += max_jj * 8 * 4;\n                p7 += max_jj * 8 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[8][8][2];\n\n#if __ARM_NEON\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n#endif\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = bottom_blob.channel(k + kk).row(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n                float32x2_t _r4 = vdup_n_f32(0.f);\n                float32x2_t _r5 = vdup_n_f32(0.f);\n                float32x2_t _r6 = vdup_n_f32(0.f);\n                float32x2_t _r7 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n                float r40 = 0.f;\n                float r41 = 0.f;\n                float r50 = 0.f;\n                float r51 = 0.f;\n                float r60 = 0.f;\n                float r61 = 0.f;\n                float r70 = 0.f;\n                float r71 = 0.f;\n#endif\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const float* r1 = r0 + N;\n\n#if __ARM_NEON\n                        float32x4_t _t0 = vld1q_f32(r0);\n                        float32x4_t _t1 = vld1q_f32(r1);\n                        float32x4x2_t _t01 = vzipq_f32(_t0, _t1);\n\n                        _r0 = vget_low_f32(_t01.val[0]);\n                        if (tj * 6 + 1 < w) _r1 = vget_high_f32(_t01.val[0]);\n                        if (tj * 6 + 2 < w) _r2 = vget_low_f32(_t01.val[1]);\n                        if (tj * 6 + 3 < w) _r3 = vget_high_f32(_t01.val[1]);\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1q_f32(r0 + 4);\n                            _t1 = vld1q_f32(r1 + 4);\n                            _t01 = vzipq_f32(_t0, _t1);\n\n                            _r4 = vget_low_f32(_t01.val[0]);\n                            if (tj * 6 + 5 < w) _r5 = vget_high_f32(_t01.val[0]);\n                            if (tj * 6 + 6 < w) _r6 = vget_low_f32(_t01.val[1]);\n                            if (tj * 6 + 7 < w) _r7 = vget_high_f32(_t01.val[1]);\n                        }\n#else\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 6 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            r40 = r0[4];\n                            r41 = r1[4];\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            r50 = r0[5];\n                            r51 = r1[5];\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            r60 = r0[6];\n                            r61 = r1[6];\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            r70 = r0[7];\n                            r71 = r1[7];\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vadd_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(vadd_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x2_t _tmp34a = vfma_laneq_f32(vfma_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x2_t _tmp34b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x2_t _tmp56a = vfma_laneq_f32(_r6, vfma_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x2_t _tmp56b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vadd_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(vadd_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34a = vmla_lane_f32(vmla_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x2_t _tmp56a = vmla_lane_f32(_r6, vmla_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x2_t _tmp56b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_laneq_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x2_t _tmp0 = vmla_lane_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x2_t _tmp1 = vadd_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp2 = vsub_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp3 = vadd_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp4 = vsub_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp5 = vadd_f32(_tmp56a, _tmp56b);\n                float32x2_t _tmp6 = vsub_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x2_t _tmp7 = vfma_laneq_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x2_t _tmp7 = vmla_lane_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n                vst1_f32(tmp[6][m], _tmp6);\n                vst1_f32(tmp[7][m], _tmp7);\n#else\n                float tmp12a0 = r20 + r60 - r40 * 4.25f;\n                float tmp12a1 = r21 + r61 - r41 * 4.25f;\n                float tmp12b0 = r10 + r50 - r30 * 4.25f;\n                float tmp12b1 = r11 + r51 - r31 * 4.25f;\n                float tmp34a0 = r60 + r20 * 0.25f - r40 * 1.25f;\n                float tmp34a1 = r61 + r21 * 0.25f - r41 * 1.25f;\n                float tmp34b0 = r10 * 0.5f - r30 * 2.5f + r50 * 2.f;\n                float tmp34b1 = r11 * 0.5f - r31 * 2.5f + r51 * 2.f;\n                float tmp56a0 = r20 * 4.f - r40 * 5.f + r60;\n                float tmp56a1 = r21 * 4.f - r41 * 5.f + r61;\n                float tmp56b0 = r10 * 2.f - r30 * 2.5f + r50 * 0.5f;\n                float tmp56b1 = r11 * 2.f - r31 * 2.5f + r51 * 0.5f;\n\n                tmp[0][m][0] = r00 - r60 + (r40 - r20) * 5.25f;\n                tmp[0][m][1] = r01 - r61 + (r41 - r21) * 5.25f;\n                tmp[1][m][0] = tmp12a0 + tmp12b0;\n                tmp[1][m][1] = tmp12a1 + tmp12b1;\n                tmp[2][m][0] = tmp12a0 - tmp12b0;\n                tmp[2][m][1] = tmp12a1 - tmp12b1;\n                tmp[3][m][0] = tmp34a0 + tmp34b0;\n                tmp[3][m][1] = tmp34a1 + tmp34b1;\n                tmp[4][m][0] = tmp34a0 - tmp34b0;\n                tmp[4][m][1] = tmp34a1 - tmp34b1;\n                tmp[5][m][0] = tmp56a0 + tmp56b0;\n                tmp[5][m][1] = tmp56a1 + tmp56b1;\n                tmp[6][m][0] = tmp56a0 - tmp56b0;\n                tmp[6][m][1] = tmp56a1 - tmp56b1;\n                tmp[7][m][0] = r70 - r10 + (r30 - r50) * 5.25f;\n                tmp[7][m][1] = r71 - r11 + (r31 - r51) * 5.25f;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n            float* p4 = p0 + max_jj * 2 * 4;\n            float* p5 = p0 + max_jj * 2 * 5;\n            float* p6 = p0 + max_jj * 2 * 6;\n            float* p7 = p0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n                float32x2_t _r6 = vld1_f32(tmp[m][6]);\n                float32x2_t _r7 = vld1_f32(tmp[m][7]);\n\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vadd_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(vadd_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x2_t _tmp34a = vfma_laneq_f32(vfma_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x2_t _tmp34b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x2_t _tmp56a = vfma_laneq_f32(_r6, vfma_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x2_t _tmp56b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vadd_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(vadd_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34a = vmla_lane_f32(vmla_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x2_t _tmp56a = vmla_lane_f32(_r6, vmla_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x2_t _tmp56b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_laneq_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x2_t _tmp0 = vmla_lane_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x2_t _tmp1 = vadd_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp2 = vsub_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp3 = vadd_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp4 = vsub_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp5 = vadd_f32(_tmp56a, _tmp56b);\n                float32x2_t _tmp6 = vsub_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x2_t _tmp7 = vfma_laneq_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x2_t _tmp7 = vmla_lane_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n                vst1_f32(p4, _tmp4);\n                vst1_f32(p5, _tmp5);\n                vst1_f32(p6, _tmp6);\n                vst1_f32(p7, _tmp7);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n                float r60 = tmp[m][6][0];\n                float r61 = tmp[m][6][1];\n                float r70 = tmp[m][7][0];\n                float r71 = tmp[m][7][1];\n\n                float tmp12a0 = r20 + r60 - r40 * 4.25f;\n                float tmp12a1 = r21 + r61 - r41 * 4.25f;\n                float tmp12b0 = r10 + r50 - r30 * 4.25f;\n                float tmp12b1 = r11 + r51 - r31 * 4.25f;\n                float tmp34a0 = r60 + r20 * 0.25f - r40 * 1.25f;\n                float tmp34a1 = r61 + r21 * 0.25f - r41 * 1.25f;\n                float tmp34b0 = r10 * 0.5f - r30 * 2.5f + r50 * 2.f;\n                float tmp34b1 = r11 * 0.5f - r31 * 2.5f + r51 * 2.f;\n                float tmp56a0 = r20 * 4.f - r40 * 5.f + r60;\n                float tmp56a1 = r21 * 4.f - r41 * 5.f + r61;\n                float tmp56b0 = r10 * 2.f - r30 * 2.5f + r50 * 0.5f;\n                float tmp56b1 = r11 * 2.f - r31 * 2.5f + r51 * 0.5f;\n\n                p0[0] = r00 - r60 + (r40 - r20) * 5.25f;\n                p0[1] = r01 - r61 + (r41 - r21) * 5.25f;\n                p1[0] = tmp12a0 + tmp12b0;\n                p1[1] = tmp12a1 + tmp12b1;\n                p2[0] = tmp12a0 - tmp12b0;\n                p2[1] = tmp12a1 - tmp12b1;\n                p3[0] = tmp34a0 + tmp34b0;\n                p3[1] = tmp34a1 + tmp34b1;\n                p4[0] = tmp34a0 - tmp34b0;\n                p4[1] = tmp34a1 - tmp34b1;\n                p5[0] = tmp56a0 + tmp56b0;\n                p5[1] = tmp56a1 + tmp56b1;\n                p6[0] = tmp56a0 - tmp56b0;\n                p6[1] = tmp56a1 - tmp56b1;\n                p7[0] = r70 - r10 + (r30 - r50) * 5.25f;\n                p7[1] = r71 - r11 + (r31 - r51) * 5.25f;\n#endif\n\n                p0 += max_jj * 8 * 2;\n                p1 += max_jj * 8 * 2;\n                p2 += max_jj * 8 * 2;\n                p3 += max_jj * 8 * 2;\n                p4 += max_jj * 8 * 2;\n                p5 += max_jj * 8 * 2;\n                p6 += max_jj * 8 * 2;\n                p7 += max_jj * 8 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0123 = bottom_blob.channel(k + kk).row(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n                float r4 = 0.f;\n                float r5 = 0.f;\n                float r6 = 0.f;\n                float r7 = 0.f;\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 6 + 1 < w) r1 = r0123[1];\n                        if (tj * 6 + 2 < w) r2 = r0123[2];\n                        if (tj * 6 + 3 < w) r3 = r0123[3];\n                        if (tj * 6 + 4 < w) r4 = r0123[4];\n                        if (tj * 6 + 5 < w) r5 = r0123[5];\n                        if (tj * 6 + 6 < w) r6 = r0123[6];\n                        if (tj * 6 + 7 < w) r7 = r0123[7];\n                    }\n                }\n\n                float tmp12a = r2 + r6 - r4 * 4.25f;\n                float tmp12b = r1 + r5 - r3 * 4.25f;\n                float tmp34a = r6 + r2 * 0.25f - r4 * 1.25f;\n                float tmp34b = r1 * 0.5f - r3 * 2.5f + r5 * 2.f;\n                float tmp56a = r2 * 4.f - r4 * 5.f + r6;\n                float tmp56b = r1 * 2.f - r3 * 2.5f + r5 * 0.5f;\n\n                tmp[0][m] = r0 - r6 + (r4 - r2) * 5.25f;\n                tmp[1][m] = tmp12a + tmp12b;\n                tmp[2][m] = tmp12a - tmp12b;\n                tmp[3][m] = tmp34a + tmp34b;\n                tmp[4][m] = tmp34a - tmp34b;\n                tmp[5][m] = tmp56a + tmp56b;\n                tmp[6][m] = tmp56a - tmp56b;\n                tmp[7][m] = r7 - r1 + (r3 - r5) * 5.25f;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n            float* p4 = p0 + max_jj * 4;\n            float* p5 = p0 + max_jj * 5;\n            float* p6 = p0 + max_jj * 6;\n            float* p7 = p0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n                float r6 = tmp[m][6];\n                float r7 = tmp[m][7];\n\n                float tmp12a = r2 + r6 - r4 * 4.25f;\n                float tmp12b = r1 + r5 - r3 * 4.25f;\n                float tmp34a = r6 + r2 * 0.25f - r4 * 1.25f;\n                float tmp34b = r1 * 0.5f - r3 * 2.5f + r5 * 2.f;\n                float tmp56a = r2 * 4.f - r4 * 5.f + r6;\n                float tmp56b = r1 * 2.f - r3 * 2.5f + r5 * 0.5f;\n\n                p0[0] = r0 - r6 + (r4 - r2) * 5.25f;\n                p1[0] = tmp12a + tmp12b;\n                p2[0] = tmp12a - tmp12b;\n                p3[0] = tmp34a + tmp34b;\n                p4[0] = tmp34a - tmp34b;\n                p5[0] = tmp56a + tmp56b;\n                p6[0] = tmp56a - tmp56b;\n                p7[0] = r7 - r1 + (r3 - r5) * 5.25f;\n\n                p0 += max_jj * 8;\n                p1 += max_jj * 8;\n                p2 += max_jj * 8;\n                p3 += max_jj * 8;\n                p4 += max_jj * 8;\n                p5 += max_jj * 8;\n                p6 += max_jj * 8;\n                p7 += max_jj * 8;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd63_transform_output_tile(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[6][8] = {\n    //     {1.0f, 1.0f,  1.0f,  1.0f,  1.0f, 32.0f, 32.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  2.0f, -2.0f, 16.0f,-16.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f,  4.0f,  4.0f,  8.0f,  8.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  8.0f, -8.0f,  4.0f, -4.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 16.0f, 16.0f,  2.0f,  2.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 32.0f,-32.0f,  1.0f, -1.0f, 1.0f}\n    // };\n\n#if __ARM_NEON\n    const float coeffs[4] = {32.f, 16.f, 8.f, 4.f};\n    float32x4_t _coeffs = vld1q_f32(coeffs);\n    float32x2_t _v2 = vdup_n_f32(2.f);\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 5) / 6;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n            const float* r4 = r0 + max_jj * 8 * 4;\n            const float* r5 = r0 + max_jj * 8 * 5;\n            const float* r6 = r0 + max_jj * 8 * 6;\n            const float* r7 = r0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n                float32x4_t _r60 = vld1q_f32(r6);\n                float32x4_t _r61 = vld1q_f32(r6 + 4);\n                float32x4_t _r70 = vld1q_f32(r7);\n                float32x4_t _r71 = vld1q_f32(r7 + 4);\n\n                float32x4_t _tmp024a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp024a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp135a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp135a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp024b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp024b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp135b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp135b1 = vsubq_f32(_r31, _r41);\n                float32x4_t _tmp024c0 = vaddq_f32(_r50, _r60);\n                float32x4_t _tmp024c1 = vaddq_f32(_r51, _r61);\n                float32x4_t _tmp135c0 = vsubq_f32(_r50, _r60);\n                float32x4_t _tmp135c1 = vsubq_f32(_r51, _r61);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp024a0), vfmaq_laneq_f32(_tmp024b0, _tmp024c0, _coeffs, 0));\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp024a1), vfmaq_laneq_f32(_tmp024b1, _tmp024c1, _coeffs, 0));\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a0, _tmp135b0, _v2, 0), _tmp135c0, _coeffs, 1);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a1, _tmp135b1, _v2, 0), _tmp135c1, _coeffs, 1);\n                float32x4_t _tmp20 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 3), _tmp024c0, _coeffs, 2);\n                float32x4_t _tmp21 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 3), _tmp024c1, _coeffs, 2);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a0, _tmp135b0, _coeffs, 2), _tmp135c0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a1, _tmp135b1, _coeffs, 2), _tmp135c1, _coeffs, 3);\n                float32x4_t _tmp40 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 1), _tmp024c0, _v2, 0);\n                float32x4_t _tmp41 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 1), _tmp024c1, _v2, 0);\n                float32x4_t _tmp50 = vaddq_f32(vaddq_f32(_r70, _tmp135a0), vfmaq_laneq_f32(_tmp135c0, _tmp135b0, _coeffs, 0));\n                float32x4_t _tmp51 = vaddq_f32(vaddq_f32(_r71, _tmp135a1), vfmaq_laneq_f32(_tmp135c1, _tmp135b1, _coeffs, 0));\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n\n                r0 += max_jj * 8 * 8;\n                r1 += max_jj * 8 * 8;\n                r2 += max_jj * 8 * 8;\n                r3 += max_jj * 8 * 8;\n                r4 += max_jj * 8 * 8;\n                r5 += max_jj * 8 * 8;\n                r6 += max_jj * 8 * 8;\n                r7 += max_jj * 8 * 8;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n                float32x4_t _r60 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r61 = vld1q_f32(tmp[m][6] + 4);\n                float32x4_t _r70 = vld1q_f32(tmp[m][7]);\n                float32x4_t _r71 = vld1q_f32(tmp[m][7] + 4);\n\n                float32x4_t _tmp024a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp024a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp135a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp135a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp024b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp024b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp135b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp135b1 = vsubq_f32(_r31, _r41);\n                float32x4_t _tmp024c0 = vaddq_f32(_r50, _r60);\n                float32x4_t _tmp024c1 = vaddq_f32(_r51, _r61);\n                float32x4_t _tmp135c0 = vsubq_f32(_r50, _r60);\n                float32x4_t _tmp135c1 = vsubq_f32(_r51, _r61);\n\n                float32x4_t _tmp00 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r00, _tmp024a0), vfmaq_laneq_f32(_tmp024b0, _tmp024c0, _coeffs, 0)));\n                float32x4_t _tmp01 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r01, _tmp024a1), vfmaq_laneq_f32(_tmp024b1, _tmp024c1, _coeffs, 0)));\n                float32x4_t _tmp10 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a0, _tmp135b0, _v2, 0), _tmp135c0, _coeffs, 1));\n                float32x4_t _tmp11 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a1, _tmp135b1, _v2, 0), _tmp135c1, _coeffs, 1));\n                float32x4_t _tmp20 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 3), _tmp024c0, _coeffs, 2));\n                float32x4_t _tmp21 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 3), _tmp024c1, _coeffs, 2));\n                float32x4_t _tmp30 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a0, _tmp135b0, _coeffs, 2), _tmp135c0, _coeffs, 3));\n                float32x4_t _tmp31 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a1, _tmp135b1, _coeffs, 2), _tmp135c1, _coeffs, 3));\n                float32x4_t _tmp40 = vaddq_f32(_bias0, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 1), _tmp024c0, _v2, 0));\n                float32x4_t _tmp41 = vaddq_f32(_bias1, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 1), _tmp024c1, _v2, 0));\n                float32x4_t _tmp50 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r70, _tmp135a0), vfmaq_laneq_f32(_tmp135c0, _tmp135b0, _coeffs, 0)));\n                float32x4_t _tmp51 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r71, _tmp135a1), vfmaq_laneq_f32(_tmp135c1, _tmp135b1, _coeffs, 0)));\n\n                if (out_elempack == 4)\n                {\n                    float* outptr1 = outptr0 + N;\n\n                    vst1q_f32(outptr0, _tmp00);\n                    vst1q_f32(outptr1, _tmp01);\n                    if (tj * 6 + 1 < outw)\n                    {\n                        vst1q_f32(outptr0 + 4, _tmp10);\n                        vst1q_f32(outptr1 + 4, _tmp11);\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        vst1q_f32(outptr0 + 8, _tmp20);\n                        vst1q_f32(outptr1 + 8, _tmp21);\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        vst1q_f32(outptr0 + 12, _tmp30);\n                        vst1q_f32(outptr1 + 12, _tmp31);\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        vst1q_f32(outptr0 + 16, _tmp40);\n                        vst1q_f32(outptr1 + 16, _tmp41);\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        vst1q_f32(outptr0 + 20, _tmp50);\n                        vst1q_f32(outptr1 + 20, _tmp51);\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[8];\n                    float tmp1[8];\n                    float tmp2[8];\n                    float tmp3[8];\n                    float tmp4[8];\n                    float tmp5[8];\n                    vst1q_f32(tmp0, _tmp00);\n                    vst1q_f32(tmp0 + 4, _tmp01);\n                    vst1q_f32(tmp1, _tmp10);\n                    vst1q_f32(tmp1 + 4, _tmp11);\n                    vst1q_f32(tmp2, _tmp20);\n                    vst1q_f32(tmp2 + 4, _tmp21);\n                    vst1q_f32(tmp3, _tmp30);\n                    vst1q_f32(tmp3 + 4, _tmp31);\n                    vst1q_f32(tmp4, _tmp40);\n                    vst1q_f32(tmp4 + 4, _tmp41);\n                    vst1q_f32(tmp5, _tmp50);\n                    vst1q_f32(tmp5 + 4, _tmp51);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n                    float* outptr4 = outptr0 + N * 4;\n                    float* outptr5 = outptr0 + N * 5;\n                    float* outptr6 = outptr0 + N * 6;\n                    float* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                        outptr4[4] = tmp4[4];\n                        outptr5[4] = tmp4[5];\n                        outptr6[4] = tmp4[6];\n                        outptr7[4] = tmp4[7];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                        outptr4[5] = tmp5[4];\n                        outptr5[5] = tmp5[5];\n                        outptr6[5] = tmp5[6];\n                        outptr7[5] = tmp5[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][8][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n            const float* r4 = r0 + max_jj * 4 * 4;\n            const float* r5 = r0 + max_jj * 4 * 5;\n            const float* r6 = r0 + max_jj * 4 * 6;\n            const float* r7 = r0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n                float32x4_t _r4 = vld1q_f32(r4);\n                float32x4_t _r5 = vld1q_f32(r5);\n                float32x4_t _r6 = vld1q_f32(r6);\n                float32x4_t _r7 = vld1q_f32(r7);\n\n                float32x4_t _tmp024a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp135a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp024b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp135b = vsubq_f32(_r3, _r4);\n                float32x4_t _tmp024c = vaddq_f32(_r5, _r6);\n                float32x4_t _tmp135c = vsubq_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp024a), vfmaq_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0));\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, _coeffs, 1);\n                float32x4_t _tmp2 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float32x4_t _tmp4 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2, 0);\n                float32x4_t _tmp5 = vaddq_f32(vaddq_f32(_r7, _tmp135a), vfmaq_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0));\n#else\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp024a), vmlaq_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0));\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1);\n                float32x4_t _tmp4 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2, 0);\n                float32x4_t _tmp5 = vaddq_f32(vaddq_f32(_r7, _tmp135a), vmlaq_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0));\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n\n                r0 += max_jj * 8 * 4;\n                r1 += max_jj * 8 * 4;\n                r2 += max_jj * 8 * 4;\n                r3 += max_jj * 8 * 4;\n                r4 += max_jj * 8 * 4;\n                r5 += max_jj * 8 * 4;\n                r6 += max_jj * 8 * 4;\n                r7 += max_jj * 8 * 4;\n            }\n\n            float* outptr0 = top_blob.channel((i + ii) / out_elempack).row(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r6 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r7 = vld1q_f32(tmp[m][7]);\n\n                float32x4_t _tmp024a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp135a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp024b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp135b = vsubq_f32(_r3, _r4);\n                float32x4_t _tmp024c = vaddq_f32(_r5, _r6);\n                float32x4_t _tmp135c = vsubq_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _tmp024a), vfmaq_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0)));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, _coeffs, 1));\n                float32x4_t _tmp2 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float32x4_t _tmp3 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float32x4_t _tmp4 = vaddq_f32(_bias0, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2, 0));\n                float32x4_t _tmp5 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r7, _tmp135a), vfmaq_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0)));\n#else\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _tmp024a), vmlaq_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0)));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, vget_low_f32(_coeffs), 1));\n                float32x4_t _tmp2 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0));\n                float32x4_t _tmp3 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1));\n                float32x4_t _tmp4 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2, 0));\n                float32x4_t _tmp5 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r7, _tmp135a), vmlaq_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0)));\n#endif\n\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _tmp0);\n                    if (tj * 6 + 1 < outw) vst1q_f32(outptr0 + 4, _tmp1);\n                    if (tj * 6 + 2 < outw) vst1q_f32(outptr0 + 8, _tmp2);\n                    if (tj * 6 + 3 < outw) vst1q_f32(outptr0 + 12, _tmp3);\n                    if (tj * 6 + 4 < outw) vst1q_f32(outptr0 + 16, _tmp4);\n                    if (tj * 6 + 5 < outw) vst1q_f32(outptr0 + 20, _tmp5);\n                }\n                if (out_elempack == 1)\n                {\n                    float tmp0[4];\n                    float tmp1[4];\n                    float tmp2[4];\n                    float tmp3[4];\n                    float tmp4[4];\n                    float tmp5[4];\n                    vst1q_f32(tmp0, _tmp0);\n                    vst1q_f32(tmp1, _tmp1);\n                    vst1q_f32(tmp2, _tmp2);\n                    vst1q_f32(tmp3, _tmp3);\n                    vst1q_f32(tmp4, _tmp4);\n                    vst1q_f32(tmp5, _tmp5);\n\n                    float* outptr1 = outptr0 + N;\n                    float* outptr2 = outptr0 + N * 2;\n                    float* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[6][8][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n            const float* r4 = r0 + max_jj * 2 * 4;\n            const float* r5 = r0 + max_jj * 2 * 5;\n            const float* r6 = r0 + max_jj * 2 * 6;\n            const float* r7 = r0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n                float32x2_t _r4 = vld1_f32(r4);\n                float32x2_t _r5 = vld1_f32(r5);\n                float32x2_t _r6 = vld1_f32(r6);\n                float32x2_t _r7 = vld1_f32(r7);\n\n                float32x2_t _tmp024a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp135a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp024b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp135b = vsub_f32(_r3, _r4);\n                float32x2_t _tmp024c = vadd_f32(_r5, _r6);\n                float32x2_t _tmp135c = vsub_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp024a), vfma_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0));\n                float32x2_t _tmp1 = vfma_laneq_f32(vfma_f32(_tmp135a, _tmp135b, _v2), _tmp135c, _coeffs, 1);\n                float32x2_t _tmp2 = vfma_laneq_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float32x2_t _tmp4 = vfma_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2);\n                float32x2_t _tmp5 = vadd_f32(vadd_f32(_r7, _tmp135a), vfma_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0));\n#else\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp024a), vmla_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0));\n                float32x2_t _tmp1 = vmla_lane_f32(vmla_f32(_tmp135a, _tmp135b, _v2), _tmp135c, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp2 = vmla_lane_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1);\n                float32x2_t _tmp4 = vmla_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2);\n                float32x2_t _tmp5 = vadd_f32(vadd_f32(_r7, _tmp135a), vmla_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0));\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n#else\n                float tmp024a0 = r1[0] + r2[0];\n                float tmp024a1 = r1[1] + r2[1];\n                float tmp135a0 = r1[0] - r2[0];\n                float tmp135a1 = r1[1] - r2[1];\n                float tmp024b0 = r3[0] + r4[0];\n                float tmp024b1 = r3[1] + r4[1];\n                float tmp135b0 = r3[0] - r4[0];\n                float tmp135b1 = r3[1] - r4[1];\n                float tmp024c0 = r5[0] + r6[0];\n                float tmp024c1 = r5[1] + r6[1];\n                float tmp135c0 = r5[0] - r6[0];\n                float tmp135c1 = r5[1] - r6[1];\n\n                tmp[0][m][0] = r0[0] + tmp024a0 + tmp024b0 + tmp024c0 * 32;\n                tmp[0][m][1] = r0[1] + tmp024a1 + tmp024b1 + tmp024c1 * 32;\n                tmp[1][m][0] = tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * 16;\n                tmp[1][m][1] = tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * 16;\n                tmp[2][m][0] = tmp024a0 + tmp024b0 * 4 + tmp024c0 * 8;\n                tmp[2][m][1] = tmp024a1 + tmp024b1 * 4 + tmp024c1 * 8;\n                tmp[3][m][0] = tmp135a0 + tmp135b0 * 8 + tmp135c0 * 4;\n                tmp[3][m][1] = tmp135a1 + tmp135b1 * 8 + tmp135c1 * 4;\n                tmp[4][m][0] = tmp024a0 + tmp024b0 * 16 + tmp024c0 + tmp024c0;\n                tmp[4][m][1] = tmp024a1 + tmp024b1 * 16 + tmp024c1 + tmp024c1;\n                tmp[5][m][0] = r7[0] + tmp135a0 + tmp135b0 * 32 + tmp135c0;\n                tmp[5][m][1] = r7[1] + tmp135a1 + tmp135b1 * 32 + tmp135c1;\n#endif\n\n                r0 += max_jj * 8 * 2;\n                r1 += max_jj * 8 * 2;\n                r2 += max_jj * 8 * 2;\n                r3 += max_jj * 8 * 2;\n                r4 += max_jj * 8 * 2;\n                r5 += max_jj * 8 * 2;\n                r6 += max_jj * 8 * 2;\n                r7 += max_jj * 8 * 2;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n                float32x2_t _r6 = vld1_f32(tmp[m][6]);\n                float32x2_t _r7 = vld1_f32(tmp[m][7]);\n\n                float32x2_t _tmp024a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp135a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp024b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp135b = vsub_f32(_r3, _r4);\n                float32x2_t _tmp024c = vadd_f32(_r5, _r6);\n                float32x2_t _tmp135c = vsub_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _tmp024a), vfma_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0)));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vfma_laneq_f32(vfma_f32(_tmp135a, _tmp135b, _v2), _tmp135c, _coeffs, 1));\n                float32x2_t _tmp2 = vadd_f32(_bias0, vfma_laneq_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float32x2_t _tmp3 = vadd_f32(_bias0, vfma_laneq_f32(vfma_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float32x2_t _tmp4 = vadd_f32(_bias0, vfma_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2));\n                float32x2_t _tmp5 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r7, _tmp135a), vfma_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0)));\n#else\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _tmp024a), vmla_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0)));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vmla_lane_f32(vmla_f32(_tmp135a, _tmp135b, _v2), _tmp135c, vget_low_f32(_coeffs), 1));\n                float32x2_t _tmp2 = vadd_f32(_bias0, vmla_lane_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0));\n                float32x2_t _tmp3 = vadd_f32(_bias0, vmla_lane_f32(vmla_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1));\n                float32x2_t _tmp4 = vadd_f32(_bias0, vmla_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2));\n                float32x2_t _tmp5 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r7, _tmp135a), vmla_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0)));\n#endif\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n                float r60 = tmp[m][6][0];\n                float r61 = tmp[m][6][1];\n                float r70 = tmp[m][7][0];\n                float r71 = tmp[m][7][1];\n\n                float tmp024a0 = r10 + r20;\n                float tmp024a1 = r11 + r21;\n                float tmp135a0 = r10 - r20;\n                float tmp135a1 = r11 - r21;\n                float tmp024b0 = r30 + r40;\n                float tmp024b1 = r31 + r41;\n                float tmp135b0 = r30 - r40;\n                float tmp135b1 = r31 - r41;\n                float tmp024c0 = r50 + r60;\n                float tmp024c1 = r51 + r61;\n                float tmp135c0 = r50 - r60;\n                float tmp135c1 = r51 - r61;\n\n                float tmp00 = bias0 + r00 + tmp024a0 + tmp024b0 + tmp024c0 * 32;\n                float tmp01 = bias1 + r01 + tmp024a1 + tmp024b1 + tmp024c1 * 32;\n                float tmp10 = bias0 + tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * 16;\n                float tmp11 = bias1 + tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * 16;\n                float tmp20 = bias0 + tmp024a0 + tmp024b0 * 4 + tmp024c0 * 8;\n                float tmp21 = bias1 + tmp024a1 + tmp024b1 * 4 + tmp024c1 * 8;\n                float tmp30 = bias0 + tmp135a0 + tmp135b0 * 8 + tmp135c0 * 4;\n                float tmp31 = bias1 + tmp135a1 + tmp135b1 * 8 + tmp135c1 * 4;\n                float tmp40 = bias0 + tmp024a0 + tmp024b0 * 16 + tmp024c0 + tmp024c0;\n                float tmp41 = bias1 + tmp024a1 + tmp024b1 * 16 + tmp024c1 + tmp024c1;\n                float tmp50 = bias0 + r70 + tmp135a0 + tmp135b0 * 32 + tmp135c0;\n                float tmp51 = bias1 + r71 + tmp135a1 + tmp135b1 * 32 + tmp135c1;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    float* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    outptr0[0] = vget_lane_f32(_tmp0, 0);\n                    outptr1[0] = vget_lane_f32(_tmp0, 1);\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_f32(_tmp1, 0);\n                        outptr1[1] = vget_lane_f32(_tmp1, 1);\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = vget_lane_f32(_tmp2, 0);\n                        outptr1[2] = vget_lane_f32(_tmp2, 1);\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = vget_lane_f32(_tmp3, 0);\n                        outptr1[3] = vget_lane_f32(_tmp3, 1);\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = vget_lane_f32(_tmp4, 0);\n                        outptr1[4] = vget_lane_f32(_tmp4, 1);\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = vget_lane_f32(_tmp5, 0);\n                        outptr1[5] = vget_lane_f32(_tmp5, 1);\n                    }\n#else\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp20;\n                        outptr1[2] = tmp21;\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp30;\n                        outptr1[3] = tmp31;\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp40;\n                        outptr1[4] = tmp41;\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp50;\n                        outptr1[5] = tmp51;\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n            const float* r4 = r0 + max_jj * 4;\n            const float* r5 = r0 + max_jj * 5;\n            const float* r6 = r0 + max_jj * 6;\n            const float* r7 = r0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float tmp024a = r1[0] + r2[0];\n                float tmp135a = r1[0] - r2[0];\n                float tmp024b = r3[0] + r4[0];\n                float tmp135b = r3[0] - r4[0];\n                float tmp024c = r5[0] + r6[0];\n                float tmp135c = r5[0] - r6[0];\n\n                tmp[0][m] = r0[0] + tmp024a + tmp024b + tmp024c * 32;\n                tmp[1][m] = tmp135a + tmp135b + tmp135b + tmp135c * 16;\n                tmp[2][m] = tmp024a + tmp024b * 4 + tmp024c * 8;\n                tmp[3][m] = tmp135a + tmp135b * 8 + tmp135c * 4;\n                tmp[4][m] = tmp024a + tmp024b * 16 + tmp024c + tmp024c;\n                tmp[5][m] = r7[0] + tmp135a + tmp135b * 32 + tmp135c;\n\n                r0 += max_jj * 8;\n                r1 += max_jj * 8;\n                r2 += max_jj * 8;\n                r3 += max_jj * 8;\n                r4 += max_jj * 8;\n                r5 += max_jj * 8;\n                r6 += max_jj * 8;\n                r7 += max_jj * 8;\n            }\n\n            float* outptr0 = top_blob.channel(i + ii).row(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n                float r6 = tmp[m][6];\n                float r7 = tmp[m][7];\n\n                float tmp024a = r1 + r2;\n                float tmp135a = r1 - r2;\n                float tmp024b = r3 + r4;\n                float tmp135b = r3 - r4;\n                float tmp024c = r5 + r6;\n                float tmp135c = r5 - r6;\n\n                float tmp0 = bias0 + r0 + tmp024a + tmp024b + tmp024c * 32;\n                float tmp1 = bias0 + tmp135a + tmp135b + tmp135b + tmp135c * 16;\n                float tmp2 = bias0 + tmp024a + tmp024b * 4 + tmp024c * 8;\n                float tmp3 = bias0 + tmp135a + tmp135b * 8 + tmp135c * 4;\n                float tmp4 = bias0 + tmp024a + tmp024b * 16 + tmp024c + tmp024c;\n                float tmp5 = bias0 + r7 + tmp135a + tmp135b * 32 + tmp135c;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 6 + 1 < outw) outptr0[1] = tmp1;\n                    if (tj * 6 + 2 < outw) outptr0[2] = tmp2;\n                    if (tj * 6 + 3 < outw) outptr0[3] = tmp3;\n                    if (tj * 6 + 4 < outw) outptr0[4] = tmp4;\n                    if (tj * 6 + 5 < outw) outptr0[5] = tmp5;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd63(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 6n+2, winograd F(6,3)\n    int w_tiles = (outw + 5) / 6;\n    int h_tiles = (outh + 5) / 6;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 64;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd63 %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd63_transform_output_tile(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_winograd_bf16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic inline void conv3x3s1_winograd23_transform_input_tile_bf16s(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[4][4] = {\n    //     {1.0f,  0.0f, -1.0f,  0.0f},\n    //     {0.0f,  1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  0.00f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w - 1) / 2;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n                        _r00 = bfloat2float(vld1_u16(r0));\n                        _r01 = bfloat2float(vld1_u16(r1));\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r10 = bfloat2float(vld1_u16(r0 + 4));\n                            _r11 = bfloat2float(vld1_u16(r1 + 4));\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r20 = bfloat2float(vld1_u16(r0 + 8));\n                            _r21 = bfloat2float(vld1_u16(r1 + 8));\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r30 = bfloat2float(vld1_u16(r0 + 12));\n                            _r31 = bfloat2float(vld1_u16(r1 + 12));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n                        const unsigned short* r4 = r0 + N * 4;\n                        const unsigned short* r5 = r0 + N * 5;\n                        const unsigned short* r6 = r0 + N * 6;\n                        const unsigned short* r7 = r0 + N * 7;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n                        uint16x4_t _t4 = vld1_u16(r4);\n                        uint16x4_t _t5 = vld1_u16(r5);\n                        uint16x4_t _t6 = vld1_u16(r6);\n                        uint16x4_t _t7 = vld1_u16(r7);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n                        transpose4x4_u16(_t4, _t5, _t6, _t7);\n\n                        _r00 = bfloat2float(_t0);\n                        _r01 = bfloat2float(_t4);\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r10 = bfloat2float(_t1);\n                            _r11 = bfloat2float(_t5);\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r20 = bfloat2float(_t2);\n                            _r21 = bfloat2float(_t6);\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r30 = bfloat2float(_t3);\n                            _r31 = bfloat2float(_t7);\n                        }\n                    }\n                }\n\n                float32x4_t _tmp00 = vsubq_f32(_r00, _r20);\n                float32x4_t _tmp01 = vsubq_f32(_r01, _r21);\n                float32x4_t _tmp10 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp11 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp20 = vsubq_f32(_r20, _r10);\n                float32x4_t _tmp21 = vsubq_f32(_r21, _r11);\n                float32x4_t _tmp30 = vsubq_f32(_r30, _r10);\n                float32x4_t _tmp31 = vsubq_f32(_r31, _r11);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n\n                float32x4_t _tmp00 = vsubq_f32(_r00, _r20);\n                float32x4_t _tmp01 = vsubq_f32(_r01, _r21);\n                float32x4_t _tmp10 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp11 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp20 = vsubq_f32(_r20, _r10);\n                float32x4_t _tmp21 = vsubq_f32(_r21, _r11);\n                float32x4_t _tmp30 = vsubq_f32(_r30, _r10);\n                float32x4_t _tmp31 = vsubq_f32(_r31, _r11);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n\n                p0 += max_jj * 4 * 8;\n                p1 += max_jj * 4 * 8;\n                p2 += max_jj * 4 * 8;\n                p3 += max_jj * 4 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        if (tj * 2 + 1 < w) _r1 = bfloat2float(vld1_u16(r0 + 4));\n                        if (tj * 2 + 2 < w) _r2 = bfloat2float(vld1_u16(r0 + 8));\n                        if (tj * 2 + 3 < w) _r3 = bfloat2float(vld1_u16(r0 + 12));\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n\n                        _r0 = bfloat2float(_t0);\n                        if (tj * 2 + 1 < w) _r1 = bfloat2float(_t1);\n                        if (tj * 2 + 2 < w) _r2 = bfloat2float(_t2);\n                        if (tj * 2 + 3 < w) _r3 = bfloat2float(_t3);\n                    }\n                }\n\n                float32x4_t _tmp0 = vsubq_f32(_r0, _r2);\n                float32x4_t _tmp1 = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp2 = vsubq_f32(_r2, _r1);\n                float32x4_t _tmp3 = vsubq_f32(_r3, _r1);\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n\n                float32x4_t _tmp0 = vsubq_f32(_r0, _r2);\n                float32x4_t _tmp1 = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp2 = vsubq_f32(_r2, _r1);\n                float32x4_t _tmp3 = vsubq_f32(_r3, _r1);\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n\n                p0 += max_jj * 4 * 4;\n                p1 += max_jj * 4 * 4;\n                p2 += max_jj * 4 * 4;\n                p3 += max_jj * 4 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[4][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n#endif\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n#if __ARM_NEON\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4x2_t _t01 = vzip_u16(_t0, _t1);\n                        float32x4_t _t0_fp32 = bfloat2float(_t01.val[0]);\n                        float32x4_t _t1_fp32 = bfloat2float(_t01.val[1]);\n\n                        _r0 = vget_low_f32(_t0_fp32);\n                        if (tj * 2 + 1 < w) _r1 = vget_high_f32(_t0_fp32);\n                        if (tj * 2 + 2 < w) _r2 = vget_low_f32(_t1_fp32);\n                        if (tj * 2 + 3 < w) _r3 = vget_high_f32(_t1_fp32);\n#else\n                        r00 = bfloat16_to_float32(r0[0]);\n                        r01 = bfloat16_to_float32(r1[0]);\n                        if (tj * 2 + 1 < w)\n                        {\n                            r10 = bfloat16_to_float32(r0[1]);\n                            r11 = bfloat16_to_float32(r1[1]);\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            r20 = bfloat16_to_float32(r0[2]);\n                            r21 = bfloat16_to_float32(r1[2]);\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            r30 = bfloat16_to_float32(r0[3]);\n                            r31 = bfloat16_to_float32(r1[3]);\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n                float32x2_t _tmp0 = vsub_f32(_r0, _r2);\n                float32x2_t _tmp1 = vadd_f32(_r1, _r2);\n                float32x2_t _tmp2 = vsub_f32(_r2, _r1);\n                float32x2_t _tmp3 = vsub_f32(_r3, _r1);\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n#else\n                tmp[0][m][0] = r00 - r20;\n                tmp[0][m][1] = r01 - r21;\n                tmp[1][m][0] = r10 + r20;\n                tmp[1][m][1] = r11 + r21;\n                tmp[2][m][0] = r20 - r10;\n                tmp[2][m][1] = r21 - r11;\n                tmp[3][m][0] = r30 - r10;\n                tmp[3][m][1] = r31 - r11;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n\n                float32x2_t _tmp0 = vsub_f32(_r0, _r2);\n                float32x2_t _tmp1 = vadd_f32(_r1, _r2);\n                float32x2_t _tmp2 = vsub_f32(_r2, _r1);\n                float32x2_t _tmp3 = vsub_f32(_r3, _r1);\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n\n                p0[0] = r00 - r20;\n                p0[1] = r01 - r21;\n                p1[0] = r10 + r20;\n                p1[1] = r11 + r21;\n                p2[0] = r20 - r10;\n                p2[1] = r21 - r11;\n                p3[0] = r30 - r10;\n                p3[1] = r31 - r11;\n#endif\n\n                p0 += max_jj * 4 * 2;\n                p1 += max_jj * 4 * 2;\n                p2 += max_jj * 4 * 2;\n                p3 += max_jj * 4 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0123 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = bfloat16_to_float32(r0123[0]);\n                        if (tj * 2 + 1 < w) r1 = bfloat16_to_float32(r0123[1]);\n                        if (tj * 2 + 2 < w) r2 = bfloat16_to_float32(r0123[2]);\n                        if (tj * 2 + 3 < w) r3 = bfloat16_to_float32(r0123[3]);\n                    }\n                }\n\n                tmp[0][m] = r0 - r2;\n                tmp[1][m] = r1 + r2;\n                tmp[2][m] = r2 - r1;\n                tmp[3][m] = r3 - r1;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 16 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n\n                p0[0] = r0 - r2;\n                p1[0] = r1 + r2;\n                p2[0] = r2 - r1;\n                p3[0] = r3 - r1;\n\n                p0 += max_jj * 4;\n                p1 += max_jj * 4;\n                p2 += max_jj * 4;\n                p3 += max_jj * 4;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_output_tile_bf16s(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[2][4] = {\n    //     {1.0f,  1.0f,  1.0f,  0.0f},\n    //     {0.0f,  1.0f, -1.0f,  1.0f}\n    // };\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 1) / 2;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[2][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _r10), _r20);\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _r11), _r21);\n                float32x4_t _tmp10 = vaddq_f32(vsubq_f32(_r10, _r20), _r30);\n                float32x4_t _tmp11 = vaddq_f32(vsubq_f32(_r11, _r21), _r31);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n\n                r0 += max_jj * 4 * 8;\n                r1 += max_jj * 4 * 8;\n                r2 += max_jj * 4 * 8;\n                r3 += max_jj * 4 * 8;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n\n                float32x4_t _tmp00 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r00, _r10), _r20));\n                float32x4_t _tmp01 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r01, _r11), _r21));\n                float32x4_t _tmp10 = vaddq_f32(_bias0, vaddq_f32(vsubq_f32(_r10, _r20), _r30));\n                float32x4_t _tmp11 = vaddq_f32(_bias1, vaddq_f32(vsubq_f32(_r11, _r21), _r31));\n\n                if (out_elempack == 4)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n                    vst1_u16(outptr0, float2bfloat(_tmp00));\n                    vst1_u16(outptr1, float2bfloat(_tmp01));\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1_u16(outptr0 + 4, float2bfloat(_tmp10));\n                        vst1_u16(outptr1 + 4, float2bfloat(_tmp11));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[8];\n                    unsigned short tmp1[8];\n                    vst1_u16(tmp0, float2bfloat(_tmp00));\n                    vst1_u16(tmp0 + 4, float2bfloat(_tmp01));\n                    vst1_u16(tmp1, float2bfloat(_tmp10));\n                    vst1_u16(tmp1 + 4, float2bfloat(_tmp11));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n                    unsigned short* outptr4 = outptr0 + N * 4;\n                    unsigned short* outptr5 = outptr0 + N * 5;\n                    unsigned short* outptr6 = outptr0 + N * 6;\n                    unsigned short* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[2][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _r1), _r2);\n                float32x4_t _tmp1 = vaddq_f32(vsubq_f32(_r1, _r2), _r3);\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n\n                r0 += max_jj * 4 * 4;\n                r1 += max_jj * 4 * 4;\n                r2 += max_jj * 4 * 4;\n                r3 += max_jj * 4 * 4;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _r1), _r2));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vaddq_f32(vsubq_f32(_r1, _r2), _r3));\n\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_tmp0));\n                    if (tj * 2 + 1 < outw) vst1_u16(outptr0 + 4, float2bfloat(_tmp1));\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[4];\n                    unsigned short tmp1[4];\n                    vst1_u16(tmp0, float2bfloat(_tmp0));\n                    vst1_u16(tmp1, float2bfloat(_tmp1));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[2][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _r1), _r2);\n                float32x2_t _tmp1 = vadd_f32(vsub_f32(_r1, _r2), _r3);\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n#else\n                tmp[0][m][0] = r0[0] + r1[0] + r2[0];\n                tmp[0][m][1] = r0[1] + r1[1] + r2[1];\n                tmp[1][m][0] = r1[0] - r2[0] + r3[0];\n                tmp[1][m][1] = r1[1] - r2[1] + r3[1];\n#endif\n\n                r0 += max_jj * 4 * 2;\n                r1 += max_jj * 4 * 2;\n                r2 += max_jj * 4 * 2;\n                r3 += max_jj * 4 * 2;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _r1), _r2));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vadd_f32(vsub_f32(_r1, _r2), _r3));\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n\n                float tmp00 = bias0 + r00 + r10 + r20;\n                float tmp01 = bias1 + r01 + r11 + r21;\n                float tmp10 = bias0 + r10 - r20 + r30;\n                float tmp11 = bias1 + r11 - r21 + r31;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    uint16x4_t _tmp01 = float2bfloat(vcombine_f32(_tmp0, _tmp1));\n\n                    outptr0[0] = vget_lane_u16(_tmp01, 0);\n                    outptr1[0] = vget_lane_u16(_tmp01, 1);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_u16(_tmp01, 2);\n                        outptr1[1] = vget_lane_u16(_tmp01, 3);\n                    }\n#else\n                    outptr0[0] = float32_to_bfloat16(tmp00);\n                    outptr1[0] = float32_to_bfloat16(tmp01);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = float32_to_bfloat16(tmp10);\n                        outptr1[1] = float32_to_bfloat16(tmp11);\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[2][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 16 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m] = r0[0] + r1[0] + r2[0];\n                tmp[1][m] = r1[0] - r2[0] + r3[0];\n\n                r0 += max_jj * 4;\n                r1 += max_jj * 4;\n                r2 += max_jj * 4;\n                r3 += max_jj * 4;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n\n                float tmp0 = bias0 + r0 + r1 + r2;\n                float tmp1 = bias0 + r1 - r2 + r3;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(tmp0);\n                    if (tj * 2 + 1 < outw) outptr0[1] = float32_to_bfloat16(tmp1);\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd23_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 2n+2, winograd F(2,3)\n    int w_tiles = (outw + 1) / 2;\n    int h_tiles = (outh + 1) / 2;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 16;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd23_bf16s %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd23_transform_output_tile_bf16s(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd43_transform_input_tile_bf16s(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    const float sq2 = 1.41421356237;\n    const float sq2_d2 = 1.41421356237 / 2;\n\n    // const float itm[6][6] = {\n    //     {1.0f,  0.0f,  -2.5f,  0.0f,  1.0f, 0.0f},\n    //     {0.0f, -sq2,   -2.0f,  sq2/2, 1.0f, 0.0f},\n    //     {0.0f,  sq2,   -2.0f, -sq2/2, 1.0f, 0.0f},\n    //     {0.0f, -sq2/2, -0.5f,  sq2,   1.0f, 0.0f},\n    //     {0.0f,  sq2/2, -0.5f, -sq2,   1.0f, 0.0f},\n    //     {0.0f,  1.0f,   0.0f,  -2.5f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  r00 + r04 - 2.5f * r02\n    // 1 = -(sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 2 =  (sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 3 =  (sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 4 = -(sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 5 =  r01 + r05 - 2.5f * r03\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 1) / 4;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][6][8];\n\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _vm2_5 = vdupq_n_f32(-2.5f);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n                float32x4_t _r40 = vdupq_n_f32(0.f);\n                float32x4_t _r41 = vdupq_n_f32(0.f);\n                float32x4_t _r50 = vdupq_n_f32(0.f);\n                float32x4_t _r51 = vdupq_n_f32(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n                        _r00 = bfloat2float(vld1_u16(r0));\n                        _r01 = bfloat2float(vld1_u16(r1));\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r10 = bfloat2float(vld1_u16(r0 + 4));\n                            _r11 = bfloat2float(vld1_u16(r1 + 4));\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r20 = bfloat2float(vld1_u16(r0 + 8));\n                            _r21 = bfloat2float(vld1_u16(r1 + 8));\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r30 = bfloat2float(vld1_u16(r0 + 12));\n                            _r31 = bfloat2float(vld1_u16(r1 + 12));\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            _r40 = bfloat2float(vld1_u16(r0 + 16));\n                            _r41 = bfloat2float(vld1_u16(r1 + 16));\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            _r50 = bfloat2float(vld1_u16(r0 + 20));\n                            _r51 = bfloat2float(vld1_u16(r1 + 20));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n                        const unsigned short* r4 = r0 + N * 4;\n                        const unsigned short* r5 = r0 + N * 5;\n                        const unsigned short* r6 = r0 + N * 6;\n                        const unsigned short* r7 = r0 + N * 7;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n                        uint16x4_t _t4 = vld1_u16(r4);\n                        uint16x4_t _t5 = vld1_u16(r5);\n                        uint16x4_t _t6 = vld1_u16(r6);\n                        uint16x4_t _t7 = vld1_u16(r7);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n                        transpose4x4_u16(_t4, _t5, _t6, _t7);\n\n                        _r00 = bfloat2float(_t0);\n                        _r01 = bfloat2float(_t4);\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r10 = bfloat2float(_t1);\n                            _r11 = bfloat2float(_t5);\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r20 = bfloat2float(_t2);\n                            _r21 = bfloat2float(_t6);\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r30 = bfloat2float(_t3);\n                            _r31 = bfloat2float(_t7);\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            unsigned short tmp[8] = {r0[4], r1[4], r2[4], r3[4], r4[4], r5[4], r6[4], r7[4]};\n                            _r40 = bfloat2float(vld1_u16(tmp));\n                            _r41 = bfloat2float(vld1_u16(tmp + 4));\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            unsigned short tmp[8] = {r0[5], r1[5], r2[5], r3[5], r4[5], r5[5], r6[5], r7[5]};\n                            _r50 = bfloat2float(vld1_u16(tmp));\n                            _r51 = bfloat2float(vld1_u16(tmp + 4));\n                        }\n                    }\n                }\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs, 0), _r30, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs, 0), _r31, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 2);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 2);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r30, _coeffs, 0), _r10, _coeffs, 1);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r31, _coeffs, 0), _r11, _coeffs, 1);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 3);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 3);\n\n                float32x4_t _tmp00 = vfmaq_f32(vaddq_f32(_r00, _r40), _r20, _vm2_5);\n                float32x4_t _tmp01 = vfmaq_f32(vaddq_f32(_r01, _r41), _r21, _vm2_5);\n                float32x4_t _tmp10 = vsubq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp11 = vsubq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp20 = vaddq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp21 = vaddq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp50 = vfmaq_f32(vaddq_f32(_r10, _r50), _r30, _vm2_5);\n                float32x4_t _tmp51 = vfmaq_f32(vaddq_f32(_r11, _r51), _r31, _vm2_5);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n            float* p4 = p0 + max_jj * 8 * 4;\n            float* p5 = p0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs, 0), _r30, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs, 0), _r31, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 2);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 2);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vmulq_laneq_f32(_r30, _coeffs, 0), _r10, _coeffs, 1);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vmulq_laneq_f32(_r31, _coeffs, 0), _r11, _coeffs, 1);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(_r40, _r20, _coeffs, 3);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(_r41, _r21, _coeffs, 3);\n\n                float32x4_t _tmp00 = vfmaq_f32(vaddq_f32(_r00, _r40), _r20, _vm2_5);\n                float32x4_t _tmp01 = vfmaq_f32(vaddq_f32(_r01, _r41), _r21, _vm2_5);\n                float32x4_t _tmp10 = vsubq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp11 = vsubq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp20 = vaddq_f32(_tmp12b0, _tmp12a0);\n                float32x4_t _tmp21 = vaddq_f32(_tmp12b1, _tmp12a1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34b0, _tmp34a0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34b1, _tmp34a1);\n                float32x4_t _tmp50 = vfmaq_f32(vaddq_f32(_r10, _r50), _r30, _vm2_5);\n                float32x4_t _tmp51 = vfmaq_f32(vaddq_f32(_r11, _r51), _r31, _vm2_5);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n                vst1q_f32(p4, _tmp40);\n                vst1q_f32(p4 + 4, _tmp41);\n                vst1q_f32(p5, _tmp50);\n                vst1q_f32(p5 + 4, _tmp51);\n\n                p0 += max_jj * 6 * 8;\n                p1 += max_jj * 6 * 8;\n                p2 += max_jj * 6 * 8;\n                p3 += max_jj * 6 * 8;\n                p4 += max_jj * 6 * 8;\n                p5 += max_jj * 6 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][6][4];\n\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _vm2_5 = vdupq_n_f32(-2.5f);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n                float32x4_t _r4 = vdupq_n_f32(0.f);\n                float32x4_t _r5 = vdupq_n_f32(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        if (tj * 4 + 1 < w) _r1 = bfloat2float(vld1_u16(r0 + 4));\n                        if (tj * 4 + 2 < w) _r2 = bfloat2float(vld1_u16(r0 + 8));\n                        if (tj * 4 + 3 < w) _r3 = bfloat2float(vld1_u16(r0 + 12));\n                        if (tj * 4 + 4 < w) _r4 = bfloat2float(vld1_u16(r0 + 16));\n                        if (tj * 4 + 5 < w) _r5 = bfloat2float(vld1_u16(r0 + 20));\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n\n                        _r0 = bfloat2float(_t0);\n                        if (tj * 4 + 1 < w) _r1 = bfloat2float(_t1);\n                        if (tj * 4 + 2 < w) _r2 = bfloat2float(_t2);\n                        if (tj * 4 + 3 < w) _r3 = bfloat2float(_t3);\n                        if (tj * 4 + 4 < w)\n                        {\n                            unsigned short tmp[4] = {r0[4], r1[4], r2[4], r3[4]};\n                            _r4 = bfloat2float(vld1_u16(tmp));\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            unsigned short tmp[4] = {r0[5], r1[5], r2[5], r3[5]};\n                            _r5 = bfloat2float(vld1_u16(tmp));\n                        }\n                    }\n                }\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vmulq_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmulq_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x4_t _tmp0 = vmlaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x4_t _tmp1 = vsubq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp2 = vaddq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34b, _tmp34a);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x4_t _tmp5 = vfmaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x4_t _tmp5 = vmlaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n            float* p4 = p0 + max_jj * 4 * 4;\n            float* p5 = p0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vmulq_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmulq_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34b = vmlaq_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x4_t _tmp0 = vmlaq_f32(vaddq_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x4_t _tmp1 = vsubq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp2 = vaddq_f32(_tmp12b, _tmp12a);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34b, _tmp34a);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x4_t _tmp5 = vfmaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x4_t _tmp5 = vmlaq_f32(vaddq_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n                vst1q_f32(p4, _tmp4);\n                vst1q_f32(p5, _tmp5);\n\n                p0 += max_jj * 6 * 4;\n                p1 += max_jj * 6 * 4;\n                p2 += max_jj * 6 * 4;\n                p3 += max_jj * 6 * 4;\n                p4 += max_jj * 6 * 4;\n                p5 += max_jj * 6 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[6][6][2];\n\n#if __ARM_NEON\n        const float coeffs[4] = {sq2, -sq2_d2, -2.f, -0.5f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x2_t _vm2_5 = vdup_n_f32(-2.5f);\n#endif\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n                float32x2_t _r4 = vdup_n_f32(0.f);\n                float32x2_t _r5 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n                float r40 = 0.f;\n                float r41 = 0.f;\n                float r50 = 0.f;\n                float r51 = 0.f;\n#endif\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n#if __ARM_NEON\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4x2_t _t01 = vzip_u16(_t0, _t1);\n                        float32x4_t _t0_fp32 = bfloat2float(_t01.val[0]);\n                        float32x4_t _t1_fp32 = bfloat2float(_t01.val[1]);\n\n                        _r0 = vget_low_f32(_t0_fp32);\n                        if (tj * 4 + 1 < w) _r1 = vget_high_f32(_t0_fp32);\n                        if (tj * 4 + 2 < w) _r2 = vget_low_f32(_t1_fp32);\n                        if (tj * 4 + 3 < w) _r3 = vget_high_f32(_t1_fp32);\n                        if (tj * 4 + 4 < w)\n                        {\n                            float tmp[2] = {bfloat16_to_float32(r0[4]), bfloat16_to_float32(r1[4])};\n                            _r4 = vld1_f32(tmp);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            float tmp[2] = {bfloat16_to_float32(r0[5]), bfloat16_to_float32(r1[5])};\n                            _r5 = vld1_f32(tmp);\n                        }\n#else\n                        r00 = bfloat16_to_float32(r0[0]);\n                        r01 = bfloat16_to_float32(r1[0]);\n                        if (tj * 4 + 1 < w)\n                        {\n                            r10 = bfloat16_to_float32(r0[1]);\n                            r11 = bfloat16_to_float32(r1[1]);\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            r20 = bfloat16_to_float32(r0[2]);\n                            r21 = bfloat16_to_float32(r1[2]);\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            r30 = bfloat16_to_float32(r0[3]);\n                            r31 = bfloat16_to_float32(r1[3]);\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            r40 = bfloat16_to_float32(r0[4]);\n                            r41 = bfloat16_to_float32(r1[4]);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            r50 = bfloat16_to_float32(r0[5]);\n                            r51 = bfloat16_to_float32(r1[5]);\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x2_t _tmp34a = vfma_laneq_f32(vmul_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x2_t _tmp34b = vfma_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34a = vmla_lane_f32(vmul_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x2_t _tmp0 = vmla_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x2_t _tmp1 = vsub_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp2 = vadd_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp3 = vadd_f32(_tmp34b, _tmp34a);\n                float32x2_t _tmp4 = vsub_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x2_t _tmp5 = vfma_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x2_t _tmp5 = vmla_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n#else\n                float tmp12a0 = sq2 * r10 - sq2_d2 * r30;\n                float tmp12a1 = sq2 * r11 - sq2_d2 * r31;\n                float tmp12b0 = r40 - 2 * r20;\n                float tmp12b1 = r41 - 2 * r21;\n                float tmp34a0 = sq2 * r30 - sq2_d2 * r10;\n                float tmp34a1 = sq2 * r31 - sq2_d2 * r11;\n                float tmp34b0 = r40 - 0.5f * r20;\n                float tmp34b1 = r41 - 0.5f * r21;\n\n                tmp[0][m][0] = r00 + r40 - 2.5f * r20;\n                tmp[0][m][1] = r01 + r41 - 2.5f * r21;\n                tmp[1][m][0] = tmp12b0 - tmp12a0;\n                tmp[1][m][1] = tmp12b1 - tmp12a1;\n                tmp[2][m][0] = tmp12b0 + tmp12a0;\n                tmp[2][m][1] = tmp12b1 + tmp12a1;\n                tmp[3][m][0] = tmp34b0 + tmp34a0;\n                tmp[3][m][1] = tmp34b1 + tmp34a1;\n                tmp[4][m][0] = tmp34b0 - tmp34a0;\n                tmp[4][m][1] = tmp34b1 - tmp34a1;\n                tmp[5][m][0] = r10 + r50 - 2.5f * r30;\n                tmp[5][m][1] = r11 + r51 - 2.5f * r31;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n            float* p4 = p0 + max_jj * 2 * 4;\n            float* p5 = p0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(_r4, _r2, _coeffs, 2);\n                float32x2_t _tmp34a = vfma_laneq_f32(vmul_laneq_f32(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float32x2_t _tmp34b = vfma_laneq_f32(_r4, _r2, _coeffs, 3);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs), 0), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34a = vmla_lane_f32(vmul_lane_f32(_r3, vget_low_f32(_coeffs), 0), _r1, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34b = vmla_lane_f32(_r4, _r2, vget_high_f32(_coeffs), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#else\n                float32x2_t _tmp0 = vmla_f32(vadd_f32(_r0, _r4), _r2, _vm2_5);\n#endif\n                float32x2_t _tmp1 = vsub_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp2 = vadd_f32(_tmp12b, _tmp12a);\n                float32x2_t _tmp3 = vadd_f32(_tmp34b, _tmp34a);\n                float32x2_t _tmp4 = vsub_f32(_tmp34b, _tmp34a);\n#if __aarch64__\n                float32x2_t _tmp5 = vfma_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#else\n                float32x2_t _tmp5 = vmla_f32(vadd_f32(_r1, _r5), _r3, _vm2_5);\n#endif\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n                vst1_f32(p4, _tmp4);\n                vst1_f32(p5, _tmp5);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n\n                float tmp12a0 = sq2 * r10 - sq2_d2 * r30;\n                float tmp12a1 = sq2 * r11 - sq2_d2 * r31;\n                float tmp12b0 = r40 - 2 * r20;\n                float tmp12b1 = r41 - 2 * r21;\n                float tmp34a0 = sq2 * r30 - sq2_d2 * r10;\n                float tmp34a1 = sq2 * r31 - sq2_d2 * r11;\n                float tmp34b0 = r40 - 0.5f * r20;\n                float tmp34b1 = r41 - 0.5f * r21;\n\n                p0[0] = r00 + r40 - 2.5f * r20;\n                p0[1] = r01 + r41 - 2.5f * r21;\n                p1[0] = tmp12b0 - tmp12a0;\n                p1[1] = tmp12b1 - tmp12a1;\n                p2[0] = tmp12b0 + tmp12a0;\n                p2[1] = tmp12b1 + tmp12a1;\n                p3[0] = tmp34b0 + tmp34a0;\n                p3[1] = tmp34b1 + tmp34a1;\n                p4[0] = tmp34b0 - tmp34a0;\n                p4[1] = tmp34b1 - tmp34a1;\n                p5[0] = r10 + r50 - 2.5f * r30;\n                p5[1] = r11 + r51 - 2.5f * r31;\n#endif\n\n                p0 += max_jj * 6 * 2;\n                p1 += max_jj * 6 * 2;\n                p2 += max_jj * 6 * 2;\n                p3 += max_jj * 6 * 2;\n                p4 += max_jj * 6 * 2;\n                p5 += max_jj * 6 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[6][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0123 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n                float r4 = 0.f;\n                float r5 = 0.f;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = bfloat16_to_float32(r0123[0]);\n                        if (tj * 4 + 1 < w) r1 = bfloat16_to_float32(r0123[1]);\n                        if (tj * 4 + 2 < w) r2 = bfloat16_to_float32(r0123[2]);\n                        if (tj * 4 + 3 < w) r3 = bfloat16_to_float32(r0123[3]);\n                        if (tj * 4 + 4 < w) r4 = bfloat16_to_float32(r0123[4]);\n                        if (tj * 4 + 5 < w) r5 = bfloat16_to_float32(r0123[5]);\n                    }\n                }\n\n                float tmp12a = sq2 * r1 - sq2_d2 * r3;\n                float tmp12b = r4 - 2 * r2;\n                float tmp34a = sq2 * r3 - sq2_d2 * r1;\n                float tmp34b = r4 - 0.5f * r2;\n\n                tmp[0][m] = r0 + r4 - 2.5f * r2;\n                tmp[1][m] = tmp12b - tmp12a;\n                tmp[2][m] = tmp12b + tmp12a;\n                tmp[3][m] = tmp34b + tmp34a;\n                tmp[4][m] = tmp34b - tmp34a;\n                tmp[5][m] = r1 + r5 - 2.5f * r3;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 36 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n            float* p4 = p0 + max_jj * 4;\n            float* p5 = p0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n\n                float tmp12a = sq2 * r1 - sq2_d2 * r3;\n                float tmp12b = r4 - 2 * r2;\n                float tmp34a = sq2 * r3 - sq2_d2 * r1;\n                float tmp34b = r4 - 0.5f * r2;\n\n                p0[0] = r0 + r4 - 2.5f * r2;\n                p1[0] = tmp12b - tmp12a;\n                p2[0] = tmp12b + tmp12a;\n                p3[0] = tmp34b + tmp34a;\n                p4[0] = tmp34b - tmp34a;\n                p5[0] = r1 + r5 - 2.5f * r3;\n\n                p0 += max_jj * 6;\n                p1 += max_jj * 6;\n                p2 += max_jj * 6;\n                p3 += max_jj * 6;\n                p4 += max_jj * 6;\n                p5 += max_jj * 6;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_output_tile_bf16s(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    const float sq2 = 1.41421356237;\n    const float sq2_m2 = 1.41421356237 * 2;\n    const float sq2_d2 = 1.41421356237 / 2;\n    const float sq2_d4 = 1.41421356237 / 4;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,   1.0f,  1.0f,  1.0f,   0.0f},\n    //     {0.0f, sq2/2, -sq2/2, sq2,   -sq2,   0.0f},\n    //     {0.0f, 0.5f,   0.5f,  2.0f,  2.0f,   0.0f},\n    //     {0.0f, sq2/4, -sq2/4, sq2*2, -sq2*2, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) * sq2_d2 + (r03 - r04) * sq2\n    // 2 =       (r01 + r02) * 0.5f + (r03 + r04) * 2\n    // 3 = r05 + (r01 - r02) * sq2_d4 + (r03 - r04) * sq2_m2\n\n#if __ARM_NEON\n    const float coeffs[6] = {sq2, sq2_d2, sq2_d4, sq2_m2, 0.5f, 2.f};\n    float32x4_t _coeffs = vld1q_f32(coeffs);\n    float32x2_t _coeffs2 = vld1_f32(coeffs + 4);\n#endif // __ARM_NEON\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 3) / 4;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n            const float* r4 = r0 + max_jj * 8 * 4;\n            const float* r5 = r0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n\n                float32x4_t _tmp02a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp02a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp02b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp02b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp13a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp13a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp13b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp13b1 = vsubq_f32(_r31, _r41);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp02a0), _tmp02b0);\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp02a1), _tmp02b1);\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a0, _coeffs, 1), _tmp13b0, _coeffs, 0);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a1, _coeffs, 1), _tmp13b1, _coeffs, 0);\n                float32x4_t _tmp20 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a0, _coeffs2, 0), _tmp02b0, _coeffs2, 1);\n                float32x4_t _tmp21 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a1, _coeffs2, 0), _tmp02b1, _coeffs2, 1);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r50, _tmp13a0, _coeffs, 2), _tmp13b0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r51, _tmp13a1, _coeffs, 2), _tmp13b1, _coeffs, 3);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n\n                r0 += max_jj * 6 * 8;\n                r1 += max_jj * 6 * 8;\n                r2 += max_jj * 6 * 8;\n                r3 += max_jj * 6 * 8;\n                r4 += max_jj * 6 * 8;\n                r5 += max_jj * 6 * 8;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n\n                float32x4_t _tmp02a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp02a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp02b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp02b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp13a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp13a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp13b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp13b1 = vsubq_f32(_r31, _r41);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp02a0), vaddq_f32(_tmp02b0, _bias0));\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp02a1), vaddq_f32(_tmp02b1, _bias1));\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias0, _tmp13a0, _coeffs, 1), _tmp13b0, _coeffs, 0);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias1, _tmp13a1, _coeffs, 1), _tmp13b1, _coeffs, 0);\n                float32x4_t _tmp20 = vfmaq_lane_f32(vfmaq_lane_f32(_bias0, _tmp02a0, _coeffs2, 0), _tmp02b0, _coeffs2, 1);\n                float32x4_t _tmp21 = vfmaq_lane_f32(vfmaq_lane_f32(_bias1, _tmp02a1, _coeffs2, 0), _tmp02b1, _coeffs2, 1);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r50, _bias0), _tmp13a0, _coeffs, 2), _tmp13b0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r51, _bias1), _tmp13a1, _coeffs, 2), _tmp13b1, _coeffs, 3);\n\n                if (out_elempack == 4)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n                    vst1_u16(outptr0, float2bfloat(_tmp00));\n                    vst1_u16(outptr1, float2bfloat(_tmp01));\n                    if (tj * 4 + 1 < outw)\n                    {\n                        vst1_u16(outptr0 + 4, float2bfloat(_tmp10));\n                        vst1_u16(outptr1 + 4, float2bfloat(_tmp11));\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        vst1_u16(outptr0 + 8, float2bfloat(_tmp20));\n                        vst1_u16(outptr1 + 8, float2bfloat(_tmp21));\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        vst1_u16(outptr0 + 12, float2bfloat(_tmp30));\n                        vst1_u16(outptr1 + 12, float2bfloat(_tmp31));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[8];\n                    unsigned short tmp1[8];\n                    unsigned short tmp2[8];\n                    unsigned short tmp3[8];\n                    vst1_u16(tmp0, float2bfloat(_tmp00));\n                    vst1_u16(tmp0 + 4, float2bfloat(_tmp01));\n                    vst1_u16(tmp1, float2bfloat(_tmp10));\n                    vst1_u16(tmp1 + 4, float2bfloat(_tmp11));\n                    vst1_u16(tmp2, float2bfloat(_tmp20));\n                    vst1_u16(tmp2 + 4, float2bfloat(_tmp21));\n                    vst1_u16(tmp3, float2bfloat(_tmp30));\n                    vst1_u16(tmp3 + 4, float2bfloat(_tmp31));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n                    unsigned short* outptr4 = outptr0 + N * 4;\n                    unsigned short* outptr5 = outptr0 + N * 5;\n                    unsigned short* outptr6 = outptr0 + N * 6;\n                    unsigned short* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[4][6][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n            const float* r4 = r0 + max_jj * 4 * 4;\n            const float* r5 = r0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n                float32x4_t _r4 = vld1q_f32(r4);\n                float32x4_t _r5 = vld1q_f32(r5);\n\n                float32x4_t _tmp02a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp02b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp13a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp13b = vsubq_f32(_r3, _r4);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp02a), _tmp02b);\n#if __aarch64__\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vmulq_laneq_f32(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x4_t _tmp2 = vfmaq_lane_f32(vmulq_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmulq_lane_f32(_tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmulq_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(_r5, _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 4;\n                r1 += max_jj * 6 * 4;\n                r2 += max_jj * 6 * 4;\n                r3 += max_jj * 6 * 4;\n                r4 += max_jj * 6 * 4;\n                r5 += max_jj * 6 * 4;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n\n                float32x4_t _tmp02a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp02b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp13a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp13b = vsubq_f32(_r3, _r4);\n\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp02a), vaddq_f32(_tmp02b, _bias0));\n#if __aarch64__\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x4_t _tmp2 = vfmaq_lane_f32(vfmaq_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(vaddq_f32(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmlaq_lane_f32(_bias0, _tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmlaq_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(vaddq_f32(_r5, _bias0), _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_tmp0));\n                    if (tj * 4 + 1 < outw) vst1_u16(outptr0 + 4, float2bfloat(_tmp1));\n                    if (tj * 4 + 2 < outw) vst1_u16(outptr0 + 8, float2bfloat(_tmp2));\n                    if (tj * 4 + 3 < outw) vst1_u16(outptr0 + 12, float2bfloat(_tmp3));\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[4];\n                    unsigned short tmp1[4];\n                    unsigned short tmp2[4];\n                    unsigned short tmp3[4];\n                    vst1_u16(tmp0, float2bfloat(_tmp0));\n                    vst1_u16(tmp1, float2bfloat(_tmp1));\n                    vst1_u16(tmp2, float2bfloat(_tmp2));\n                    vst1_u16(tmp3, float2bfloat(_tmp3));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[4][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n            const float* r4 = r0 + max_jj * 2 * 4;\n            const float* r5 = r0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n                float32x2_t _r4 = vld1_f32(r4);\n                float32x2_t _r5 = vld1_f32(r5);\n\n                float32x2_t _tmp02a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp02b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp13a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp13b = vsub_f32(_r3, _r4);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp02a), _tmp02b);\n#if __aarch64__\n                float32x2_t _tmp1 = vfma_laneq_f32(vmul_laneq_f32(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x2_t _tmp2 = vfma_lane_f32(vmul_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x2_t _tmp1 = vmla_lane_f32(vmul_lane_f32(_tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x2_t _tmp2 = vmla_lane_f32(vmul_lane_f32(_tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(_r5, _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n#else\n                float tmp02a0 = r1[0] + r2[0];\n                float tmp02a1 = r1[1] + r2[1];\n                float tmp02b0 = r3[0] + r4[0];\n                float tmp02b1 = r3[1] + r4[1];\n                float tmp13a0 = r1[0] - r2[0];\n                float tmp13a1 = r1[1] - r2[1];\n                float tmp13b0 = r3[0] - r4[0];\n                float tmp13b1 = r3[1] - r4[1];\n\n                tmp[0][m][0] = r0[0] + tmp02a0 + tmp02b0;\n                tmp[0][m][1] = r0[1] + tmp02a1 + tmp02b1;\n                tmp[1][m][0] = tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                tmp[1][m][1] = tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                tmp[2][m][0] = tmp02a0 * 0.5f + tmp02b0 * 2;\n                tmp[2][m][1] = tmp02a1 * 0.5f + tmp02b1 * 2;\n                tmp[3][m][0] = r5[0] + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                tmp[3][m][1] = r5[1] + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n#endif\n\n                r0 += max_jj * 6 * 2;\n                r1 += max_jj * 6 * 2;\n                r2 += max_jj * 6 * 2;\n                r3 += max_jj * 6 * 2;\n                r4 += max_jj * 6 * 2;\n                r5 += max_jj * 6 * 2;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n\n                float32x2_t _tmp02a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp02b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp13a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp13b = vsub_f32(_r3, _r4);\n\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp02a), vadd_f32(_tmp02b, _bias0));\n#if __aarch64__\n                float32x2_t _tmp1 = vfma_laneq_f32(vfma_laneq_f32(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float32x2_t _tmp2 = vfma_lane_f32(vfma_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(vadd_f32(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n#else\n                float32x2_t _tmp1 = vmla_lane_f32(vmla_lane_f32(_bias0, _tmp13a, vget_low_f32(_coeffs), 1), _tmp13b, vget_low_f32(_coeffs), 0);\n                float32x2_t _tmp2 = vmla_lane_f32(vmla_lane_f32(_bias0, _tmp02a, _coeffs2, 0), _tmp02b, _coeffs2, 1);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(vadd_f32(_r5, _bias0), _tmp13a, vget_high_f32(_coeffs), 0), _tmp13b, vget_high_f32(_coeffs), 1);\n#endif\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n\n                float tmp02a0 = r10 + r20;\n                float tmp02a1 = r11 + r21;\n                float tmp02b0 = r30 + r40;\n                float tmp02b1 = r31 + r41;\n                float tmp13a0 = r10 - r20;\n                float tmp13a1 = r11 - r21;\n                float tmp13b0 = r30 - r40;\n                float tmp13b1 = r31 - r41;\n\n                float tmp00 = bias0 + r00 + tmp02a0 + tmp02b0;\n                float tmp01 = bias1 + r01 + tmp02a1 + tmp02b1;\n                float tmp10 = bias0 + tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                float tmp11 = bias1 + tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                float tmp20 = bias0 + tmp02a0 * 0.5f + tmp02b0 * 2;\n                float tmp21 = bias1 + tmp02a1 * 0.5f + tmp02b1 * 2;\n                float tmp30 = bias0 + r50 + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                float tmp31 = bias1 + r51 + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    uint16x4_t _tmp01 = float2bfloat(vcombine_f32(_tmp0, _tmp1));\n                    uint16x4_t _tmp23 = float2bfloat(vcombine_f32(_tmp2, _tmp3));\n\n                    outptr0[0] = vget_lane_u16(_tmp01, 0);\n                    outptr1[0] = vget_lane_u16(_tmp01, 1);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_u16(_tmp01, 2);\n                        outptr1[1] = vget_lane_u16(_tmp01, 3);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = vget_lane_u16(_tmp23, 0);\n                        outptr1[2] = vget_lane_u16(_tmp23, 1);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = vget_lane_u16(_tmp23, 2);\n                        outptr1[3] = vget_lane_u16(_tmp23, 3);\n                    }\n#else\n                    outptr0[0] = float32_to_bfloat16(tmp00);\n                    outptr1[0] = float32_to_bfloat16(tmp01);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = float32_to_bfloat16(tmp10);\n                        outptr1[1] = float32_to_bfloat16(tmp11);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = float32_to_bfloat16(tmp20);\n                        outptr1[2] = float32_to_bfloat16(tmp21);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = float32_to_bfloat16(tmp30);\n                        outptr1[3] = float32_to_bfloat16(tmp31);\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[4][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 36 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n            const float* r4 = r0 + max_jj * 4;\n            const float* r5 = r0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float tmp02a = r1[0] + r2[0];\n                float tmp02b = r3[0] + r4[0];\n                float tmp13a = r1[0] - r2[0];\n                float tmp13b = r3[0] - r4[0];\n\n                tmp[0][m] = r0[0] + tmp02a + tmp02b;\n                tmp[1][m] = tmp13a * sq2_d2 + tmp13b * sq2;\n                tmp[2][m] = tmp02a * 0.5f + tmp02b * 2;\n                tmp[3][m] = r5[0] + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                r0 += max_jj * 6;\n                r1 += max_jj * 6;\n                r2 += max_jj * 6;\n                r3 += max_jj * 6;\n                r4 += max_jj * 6;\n                r5 += max_jj * 6;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n\n                float tmp02a = r1 + r2;\n                float tmp02b = r3 + r4;\n                float tmp13a = r1 - r2;\n                float tmp13b = r3 - r4;\n\n                float tmp0 = bias0 + r0 + tmp02a + tmp02b;\n                float tmp1 = bias0 + tmp13a * sq2_d2 + tmp13b * sq2;\n                float tmp2 = bias0 + tmp02a * 0.5f + tmp02b * 2;\n                float tmp3 = bias0 + r5 + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(tmp0);\n                    if (tj * 4 + 1 < outw) outptr0[1] = float32_to_bfloat16(tmp1);\n                    if (tj * 4 + 2 < outw) outptr0[2] = float32_to_bfloat16(tmp2);\n                    if (tj * 4 + 3 < outw) outptr0[3] = float32_to_bfloat16(tmp3);\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd43_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 4n+2, winograd F(4,3)\n    int w_tiles = (outw + 3) / 4;\n    int h_tiles = (outh + 3) / 4;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 36;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd43_bf16s %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd43_transform_output_tile_bf16s(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd63_transform_input_tile_bf16s(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[8][8] = {\n    //     {1.0f, 0.0f,-5.25f, 0.00f, 5.25f, 0.00f,-1.0f, 0.0f},\n    //     {0.0f, 1.0f, 1.00f,-4.25f,-4.25f, 1.00f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 1.00f, 4.25f,-4.25f,-1.00f, 1.0f, 0.0f},\n    //     {0.0f, 0.5f, 0.25f,-2.50f,-1.25f, 2.00f, 1.0f, 0.0f},\n    //     {0.0f,-0.5f, 0.25f, 2.50f,-1.25f,-2.00f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, 4.00f,-2.50f,-5.00f, 0.50f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, 4.00f, 2.50f,-5.00f,-0.50f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 0.00f, 5.25f, 0.00f,-5.25f, 0.0f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 3) / 6;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[8][8][8];\n\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vdupq_n_f32(0.f);\n                float32x4_t _r01 = vdupq_n_f32(0.f);\n                float32x4_t _r10 = vdupq_n_f32(0.f);\n                float32x4_t _r11 = vdupq_n_f32(0.f);\n                float32x4_t _r20 = vdupq_n_f32(0.f);\n                float32x4_t _r21 = vdupq_n_f32(0.f);\n                float32x4_t _r30 = vdupq_n_f32(0.f);\n                float32x4_t _r31 = vdupq_n_f32(0.f);\n                float32x4_t _r40 = vdupq_n_f32(0.f);\n                float32x4_t _r41 = vdupq_n_f32(0.f);\n                float32x4_t _r50 = vdupq_n_f32(0.f);\n                float32x4_t _r51 = vdupq_n_f32(0.f);\n                float32x4_t _r60 = vdupq_n_f32(0.f);\n                float32x4_t _r61 = vdupq_n_f32(0.f);\n                float32x4_t _r70 = vdupq_n_f32(0.f);\n                float32x4_t _r71 = vdupq_n_f32(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n                        _r00 = bfloat2float(vld1_u16(r0));\n                        _r01 = bfloat2float(vld1_u16(r1));\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r10 = bfloat2float(vld1_u16(r0 + 4));\n                            _r11 = bfloat2float(vld1_u16(r1 + 4));\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r20 = bfloat2float(vld1_u16(r0 + 8));\n                            _r21 = bfloat2float(vld1_u16(r1 + 8));\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r30 = bfloat2float(vld1_u16(r0 + 12));\n                            _r31 = bfloat2float(vld1_u16(r1 + 12));\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _r40 = bfloat2float(vld1_u16(r0 + 16));\n                            _r41 = bfloat2float(vld1_u16(r1 + 16));\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            _r50 = bfloat2float(vld1_u16(r0 + 20));\n                            _r51 = bfloat2float(vld1_u16(r1 + 20));\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            _r60 = bfloat2float(vld1_u16(r0 + 24));\n                            _r61 = bfloat2float(vld1_u16(r1 + 24));\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            _r70 = bfloat2float(vld1_u16(r0 + 28));\n                            _r71 = bfloat2float(vld1_u16(r1 + 28));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n                        const unsigned short* r4 = r0 + N * 4;\n                        const unsigned short* r5 = r0 + N * 5;\n                        const unsigned short* r6 = r0 + N * 6;\n                        const unsigned short* r7 = r0 + N * 7;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n                        uint16x4_t _t4 = vld1_u16(r4);\n                        uint16x4_t _t5 = vld1_u16(r5);\n                        uint16x4_t _t6 = vld1_u16(r6);\n                        uint16x4_t _t7 = vld1_u16(r7);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n                        transpose4x4_u16(_t4, _t5, _t6, _t7);\n\n                        _r00 = bfloat2float(_t0);\n                        _r01 = bfloat2float(_t4);\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r10 = bfloat2float(_t1);\n                            _r11 = bfloat2float(_t5);\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r20 = bfloat2float(_t2);\n                            _r21 = bfloat2float(_t6);\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r30 = bfloat2float(_t3);\n                            _r31 = bfloat2float(_t7);\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1_u16(r0 + 4);\n                            _t1 = vld1_u16(r1 + 4);\n                            _t2 = vld1_u16(r2 + 4);\n                            _t3 = vld1_u16(r3 + 4);\n                            _t4 = vld1_u16(r4 + 4);\n                            _t5 = vld1_u16(r5 + 4);\n                            _t6 = vld1_u16(r6 + 4);\n                            _t7 = vld1_u16(r7 + 4);\n\n                            transpose4x4_u16(_t0, _t1, _t2, _t3);\n                            transpose4x4_u16(_t4, _t5, _t6, _t7);\n\n                            _r40 = bfloat2float(_t0);\n                            _r41 = bfloat2float(_t4);\n                            if (tj * 6 + 5 < w)\n                            {\n                                _r50 = bfloat2float(_t1);\n                                _r51 = bfloat2float(_t5);\n                            }\n                            if (tj * 6 + 6 < w)\n                            {\n                                _r60 = bfloat2float(_t2);\n                                _r61 = bfloat2float(_t6);\n                            }\n                            if (tj * 6 + 7 < w)\n                            {\n                                _r70 = bfloat2float(_t3);\n                                _r71 = bfloat2float(_t7);\n                            }\n                        }\n                    }\n                }\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vaddq_f32(_r20, _r60), _r40, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vaddq_f32(_r21, _r61), _r41, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(vaddq_f32(_r10, _r50), _r30, _coeffs, 1);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(vaddq_f32(_r11, _r51), _r31, _coeffs, 1);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r60, _r20, _coeffs, 3), _r40, _coeffs, 2);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r61, _r21, _coeffs, 3), _r41, _coeffs, 2);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 1), _r30, _coeffs2, 0), _r50, _coeffs2, 2);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 1), _r31, _coeffs2, 0), _r51, _coeffs2, 2);\n                float32x4_t _tmp56a0 = vfmaq_laneq_f32(_r60, vfmaq_laneq_f32(_r20, _r40, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56a1 = vfmaq_laneq_f32(_r61, vfmaq_laneq_f32(_r21, _r41, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 2), _r30, _coeffs2, 0), _r50, _coeffs2, 1);\n                float32x4_t _tmp56b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 2), _r31, _coeffs2, 0), _r51, _coeffs2, 1);\n\n                float32x4_t _tmp00 = vfmaq_laneq_f32(vsubq_f32(_r00, _r60), vsubq_f32(_r40, _r20), _coeffs, 0);\n                float32x4_t _tmp01 = vfmaq_laneq_f32(vsubq_f32(_r01, _r61), vsubq_f32(_r41, _r21), _coeffs, 0);\n                float32x4_t _tmp10 = vaddq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp11 = vaddq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp20 = vsubq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp21 = vsubq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp50 = vaddq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp51 = vaddq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp60 = vsubq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp61 = vsubq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp70 = vfmaq_laneq_f32(vsubq_f32(_r70, _r10), vsubq_f32(_r30, _r50), _coeffs, 0);\n                float32x4_t _tmp71 = vfmaq_laneq_f32(vsubq_f32(_r71, _r11), vsubq_f32(_r31, _r51), _coeffs, 0);\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n                vst1q_f32(tmp[6][m], _tmp60);\n                vst1q_f32(tmp[6][m] + 4, _tmp61);\n                vst1q_f32(tmp[7][m], _tmp70);\n                vst1q_f32(tmp[7][m] + 4, _tmp71);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 8;\n            float* p1 = p0 + max_jj * 8;\n            float* p2 = p0 + max_jj * 8 * 2;\n            float* p3 = p0 + max_jj * 8 * 3;\n            float* p4 = p0 + max_jj * 8 * 4;\n            float* p5 = p0 + max_jj * 8 * 5;\n            float* p6 = p0 + max_jj * 8 * 6;\n            float* p7 = p0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n                float32x4_t _r60 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r61 = vld1q_f32(tmp[m][6] + 4);\n                float32x4_t _r70 = vld1q_f32(tmp[m][7]);\n                float32x4_t _r71 = vld1q_f32(tmp[m][7] + 4);\n\n                float32x4_t _tmp12a0 = vfmaq_laneq_f32(vaddq_f32(_r20, _r60), _r40, _coeffs, 1);\n                float32x4_t _tmp12a1 = vfmaq_laneq_f32(vaddq_f32(_r21, _r61), _r41, _coeffs, 1);\n                float32x4_t _tmp12b0 = vfmaq_laneq_f32(vaddq_f32(_r10, _r50), _r30, _coeffs, 1);\n                float32x4_t _tmp12b1 = vfmaq_laneq_f32(vaddq_f32(_r11, _r51), _r31, _coeffs, 1);\n                float32x4_t _tmp34a0 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r60, _r20, _coeffs, 3), _r40, _coeffs, 2);\n                float32x4_t _tmp34a1 = vfmaq_laneq_f32(vfmaq_laneq_f32(_r61, _r21, _coeffs, 3), _r41, _coeffs, 2);\n                float32x4_t _tmp34b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 1), _r30, _coeffs2, 0), _r50, _coeffs2, 2);\n                float32x4_t _tmp34b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 1), _r31, _coeffs2, 0), _r51, _coeffs2, 2);\n                float32x4_t _tmp56a0 = vfmaq_laneq_f32(_r60, vfmaq_laneq_f32(_r20, _r40, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56a1 = vfmaq_laneq_f32(_r61, vfmaq_laneq_f32(_r21, _r41, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b0 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r10, _coeffs2, 2), _r30, _coeffs2, 0), _r50, _coeffs2, 1);\n                float32x4_t _tmp56b1 = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r11, _coeffs2, 2), _r31, _coeffs2, 0), _r51, _coeffs2, 1);\n\n                float32x4_t _tmp00 = vfmaq_laneq_f32(vsubq_f32(_r00, _r60), vsubq_f32(_r40, _r20), _coeffs, 0);\n                float32x4_t _tmp01 = vfmaq_laneq_f32(vsubq_f32(_r01, _r61), vsubq_f32(_r41, _r21), _coeffs, 0);\n                float32x4_t _tmp10 = vaddq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp11 = vaddq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp20 = vsubq_f32(_tmp12a0, _tmp12b0);\n                float32x4_t _tmp21 = vsubq_f32(_tmp12a1, _tmp12b1);\n                float32x4_t _tmp30 = vaddq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp31 = vaddq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp40 = vsubq_f32(_tmp34a0, _tmp34b0);\n                float32x4_t _tmp41 = vsubq_f32(_tmp34a1, _tmp34b1);\n                float32x4_t _tmp50 = vaddq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp51 = vaddq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp60 = vsubq_f32(_tmp56a0, _tmp56b0);\n                float32x4_t _tmp61 = vsubq_f32(_tmp56a1, _tmp56b1);\n                float32x4_t _tmp70 = vfmaq_laneq_f32(vsubq_f32(_r70, _r10), vsubq_f32(_r30, _r50), _coeffs, 0);\n                float32x4_t _tmp71 = vfmaq_laneq_f32(vsubq_f32(_r71, _r11), vsubq_f32(_r31, _r51), _coeffs, 0);\n\n                vst1q_f32(p0, _tmp00);\n                vst1q_f32(p0 + 4, _tmp01);\n                vst1q_f32(p1, _tmp10);\n                vst1q_f32(p1 + 4, _tmp11);\n                vst1q_f32(p2, _tmp20);\n                vst1q_f32(p2 + 4, _tmp21);\n                vst1q_f32(p3, _tmp30);\n                vst1q_f32(p3 + 4, _tmp31);\n                vst1q_f32(p4, _tmp40);\n                vst1q_f32(p4 + 4, _tmp41);\n                vst1q_f32(p5, _tmp50);\n                vst1q_f32(p5 + 4, _tmp51);\n                vst1q_f32(p6, _tmp60);\n                vst1q_f32(p6 + 4, _tmp61);\n                vst1q_f32(p7, _tmp70);\n                vst1q_f32(p7 + 4, _tmp71);\n\n                p0 += max_jj * 8 * 8;\n                p1 += max_jj * 8 * 8;\n                p2 += max_jj * 8 * 8;\n                p3 += max_jj * 8 * 8;\n                p4 += max_jj * 8 * 8;\n                p5 += max_jj * 8 * 8;\n                p6 += max_jj * 8 * 8;\n                p7 += max_jj * 8 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n#else // __aarch64__\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    #pragma omp parallel for num_threads(nT)\n#endif // __aarch64__\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[8][8][4];\n\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel((k + kk) / elempack).row<const unsigned short>(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vdupq_n_f32(0.f);\n                float32x4_t _r1 = vdupq_n_f32(0.f);\n                float32x4_t _r2 = vdupq_n_f32(0.f);\n                float32x4_t _r3 = vdupq_n_f32(0.f);\n                float32x4_t _r4 = vdupq_n_f32(0.f);\n                float32x4_t _r5 = vdupq_n_f32(0.f);\n                float32x4_t _r6 = vdupq_n_f32(0.f);\n                float32x4_t _r7 = vdupq_n_f32(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = bfloat2float(vld1_u16(r0));\n                        if (tj * 6 + 1 < w) _r1 = bfloat2float(vld1_u16(r0 + 4));\n                        if (tj * 6 + 2 < w) _r2 = bfloat2float(vld1_u16(r0 + 8));\n                        if (tj * 6 + 3 < w) _r3 = bfloat2float(vld1_u16(r0 + 12));\n                        if (tj * 6 + 4 < w) _r4 = bfloat2float(vld1_u16(r0 + 16));\n                        if (tj * 6 + 5 < w) _r5 = bfloat2float(vld1_u16(r0 + 20));\n                        if (tj * 6 + 6 < w) _r6 = bfloat2float(vld1_u16(r0 + 24));\n                        if (tj * 6 + 7 < w) _r7 = bfloat2float(vld1_u16(r0 + 28));\n                    }\n                    if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n                        const unsigned short* r2 = r0 + N * 2;\n                        const unsigned short* r3 = r0 + N * 3;\n\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4_t _t2 = vld1_u16(r2);\n                        uint16x4_t _t3 = vld1_u16(r3);\n\n                        transpose4x4_u16(_t0, _t1, _t2, _t3);\n\n                        _r0 = bfloat2float(_t0);\n                        if (tj * 6 + 1 < w) _r1 = bfloat2float(_t1);\n                        if (tj * 6 + 2 < w) _r2 = bfloat2float(_t2);\n                        if (tj * 6 + 3 < w) _r3 = bfloat2float(_t3);\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1_u16(r0 + 4);\n                            _t1 = vld1_u16(r1 + 4);\n                            _t2 = vld1_u16(r2 + 4);\n                            _t3 = vld1_u16(r3 + 4);\n\n                            transpose4x4_u16(_t0, _t1, _t2, _t3);\n\n                            _r4 = bfloat2float(_t0);\n                            if (tj * 6 + 5 < w) _r5 = bfloat2float(_t1);\n                            if (tj * 6 + 6 < w) _r6 = bfloat2float(_t2);\n                            if (tj * 6 + 7 < w) _r7 = bfloat2float(_t3);\n                        }\n                    }\n                }\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vaddq_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(vaddq_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vfmaq_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x4_t _tmp56a = vfmaq_laneq_f32(_r6, vfmaq_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vaddq_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(vaddq_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmlaq_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x4_t _tmp56a = vmlaq_lane_f32(_r6, vmlaq_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x4_t _tmp56b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_laneq_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x4_t _tmp0 = vmlaq_lane_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x4_t _tmp1 = vaddq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp2 = vsubq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp5 = vaddq_f32(_tmp56a, _tmp56b);\n                float32x4_t _tmp6 = vsubq_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x4_t _tmp7 = vfmaq_laneq_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x4_t _tmp7 = vmlaq_lane_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n                vst1q_f32(tmp[6][m], _tmp6);\n                vst1q_f32(tmp[7][m], _tmp7);\n\n                r0 += w * elempack;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 4;\n            float* p1 = p0 + max_jj * 4;\n            float* p2 = p0 + max_jj * 4 * 2;\n            float* p3 = p0 + max_jj * 4 * 3;\n            float* p4 = p0 + max_jj * 4 * 4;\n            float* p5 = p0 + max_jj * 4 * 5;\n            float* p6 = p0 + max_jj * 4 * 6;\n            float* p7 = p0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r6 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r7 = vld1q_f32(tmp[m][7]);\n\n#if __aarch64__\n                float32x4_t _tmp12a = vfmaq_laneq_f32(vaddq_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x4_t _tmp12b = vfmaq_laneq_f32(vaddq_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x4_t _tmp34a = vfmaq_laneq_f32(vfmaq_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x4_t _tmp34b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x4_t _tmp56a = vfmaq_laneq_f32(_r6, vfmaq_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x4_t _tmp56b = vfmaq_laneq_f32(vfmaq_laneq_f32(vmulq_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x4_t _tmp12a = vmlaq_lane_f32(vaddq_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp12b = vmlaq_lane_f32(vaddq_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp34a = vmlaq_lane_f32(vmlaq_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp34b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x4_t _tmp56a = vmlaq_lane_f32(_r6, vmlaq_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x4_t _tmp56b = vmlaq_lane_f32(vmlaq_lane_f32(vmulq_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x4_t _tmp0 = vfmaq_laneq_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x4_t _tmp0 = vmlaq_lane_f32(vsubq_f32(_r0, _r6), vsubq_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x4_t _tmp1 = vaddq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp2 = vsubq_f32(_tmp12a, _tmp12b);\n                float32x4_t _tmp3 = vaddq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp4 = vsubq_f32(_tmp34a, _tmp34b);\n                float32x4_t _tmp5 = vaddq_f32(_tmp56a, _tmp56b);\n                float32x4_t _tmp6 = vsubq_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x4_t _tmp7 = vfmaq_laneq_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x4_t _tmp7 = vmlaq_lane_f32(vsubq_f32(_r7, _r1), vsubq_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1q_f32(p0, _tmp0);\n                vst1q_f32(p1, _tmp1);\n                vst1q_f32(p2, _tmp2);\n                vst1q_f32(p3, _tmp3);\n                vst1q_f32(p4, _tmp4);\n                vst1q_f32(p5, _tmp5);\n                vst1q_f32(p6, _tmp6);\n                vst1q_f32(p7, _tmp7);\n\n                p0 += max_jj * 8 * 4;\n                p1 += max_jj * 8 * 4;\n                p2 += max_jj * 8 * 4;\n                p3 += max_jj * 8 * 4;\n                p4 += max_jj * 8 * 4;\n                p5 += max_jj * 8 * 4;\n                p6 += max_jj * 8 * 4;\n                p7 += max_jj * 8 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[8][8][2];\n\n#if __ARM_NEON\n        const float coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float32x4_t _coeffs = vld1q_f32(coeffs);\n        float32x4_t _coeffs2 = vld1q_f32(coeffs + 4);\n#endif\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vdup_n_f32(0.f);\n                float32x2_t _r1 = vdup_n_f32(0.f);\n                float32x2_t _r2 = vdup_n_f32(0.f);\n                float32x2_t _r3 = vdup_n_f32(0.f);\n                float32x2_t _r4 = vdup_n_f32(0.f);\n                float32x2_t _r5 = vdup_n_f32(0.f);\n                float32x2_t _r6 = vdup_n_f32(0.f);\n                float32x2_t _r7 = vdup_n_f32(0.f);\n#else\n                float r00 = 0.f;\n                float r01 = 0.f;\n                float r10 = 0.f;\n                float r11 = 0.f;\n                float r20 = 0.f;\n                float r21 = 0.f;\n                float r30 = 0.f;\n                float r31 = 0.f;\n                float r40 = 0.f;\n                float r41 = 0.f;\n                float r50 = 0.f;\n                float r51 = 0.f;\n                float r60 = 0.f;\n                float r61 = 0.f;\n                float r70 = 0.f;\n                float r71 = 0.f;\n#endif\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const unsigned short* r1 = r0 + N;\n\n#if __ARM_NEON\n                        uint16x4_t _t0 = vld1_u16(r0);\n                        uint16x4_t _t1 = vld1_u16(r1);\n                        uint16x4x2_t _t01 = vzip_u16(_t0, _t1);\n                        float32x4_t _t0_fp32 = bfloat2float(_t01.val[0]);\n                        float32x4_t _t1_fp32 = bfloat2float(_t01.val[1]);\n\n                        _r0 = vget_low_f32(_t0_fp32);\n                        if (tj * 6 + 1 < w) _r1 = vget_high_f32(_t0_fp32);\n                        if (tj * 6 + 2 < w) _r2 = vget_low_f32(_t1_fp32);\n                        if (tj * 6 + 3 < w) _r3 = vget_high_f32(_t1_fp32);\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1_u16(r0 + 4);\n                            _t1 = vld1_u16(r1 + 4);\n                            _t01 = vzip_u16(_t0, _t1);\n                            _t0_fp32 = bfloat2float(_t01.val[0]);\n                            _t1_fp32 = bfloat2float(_t01.val[1]);\n\n                            _r4 = vget_low_f32(_t0_fp32);\n                            if (tj * 6 + 5 < w) _r5 = vget_high_f32(_t0_fp32);\n                            if (tj * 6 + 6 < w) _r6 = vget_low_f32(_t1_fp32);\n                            if (tj * 6 + 7 < w) _r7 = vget_high_f32(_t1_fp32);\n                        }\n#else\n                        r00 = bfloat16_to_float32(r0[0]);\n                        r01 = bfloat16_to_float32(r1[0]);\n                        if (tj * 6 + 1 < w)\n                        {\n                            r10 = bfloat16_to_float32(r0[1]);\n                            r11 = bfloat16_to_float32(r1[1]);\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            r20 = bfloat16_to_float32(r0[2]);\n                            r21 = bfloat16_to_float32(r1[2]);\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            r30 = bfloat16_to_float32(r0[3]);\n                            r31 = bfloat16_to_float32(r1[3]);\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            r40 = bfloat16_to_float32(r0[4]);\n                            r41 = bfloat16_to_float32(r1[4]);\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            r50 = bfloat16_to_float32(r0[5]);\n                            r51 = bfloat16_to_float32(r1[5]);\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            r60 = bfloat16_to_float32(r0[6]);\n                            r61 = bfloat16_to_float32(r1[6]);\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            r70 = bfloat16_to_float32(r0[7]);\n                            r71 = bfloat16_to_float32(r1[7]);\n                        }\n#endif\n                    }\n                }\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vadd_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(vadd_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x2_t _tmp34a = vfma_laneq_f32(vfma_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x2_t _tmp34b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x2_t _tmp56a = vfma_laneq_f32(_r6, vfma_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x2_t _tmp56b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vadd_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(vadd_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34a = vmla_lane_f32(vmla_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x2_t _tmp56a = vmla_lane_f32(_r6, vmla_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x2_t _tmp56b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_laneq_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x2_t _tmp0 = vmla_lane_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x2_t _tmp1 = vadd_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp2 = vsub_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp3 = vadd_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp4 = vsub_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp5 = vadd_f32(_tmp56a, _tmp56b);\n                float32x2_t _tmp6 = vsub_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x2_t _tmp7 = vfma_laneq_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x2_t _tmp7 = vmla_lane_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n                vst1_f32(tmp[6][m], _tmp6);\n                vst1_f32(tmp[7][m], _tmp7);\n#else\n                float tmp12a0 = r20 + r60 - r40 * 4.25f;\n                float tmp12a1 = r21 + r61 - r41 * 4.25f;\n                float tmp12b0 = r10 + r50 - r30 * 4.25f;\n                float tmp12b1 = r11 + r51 - r31 * 4.25f;\n                float tmp34a0 = r60 + r20 * 0.25f - r40 * 1.25f;\n                float tmp34a1 = r61 + r21 * 0.25f - r41 * 1.25f;\n                float tmp34b0 = r10 * 0.5f - r30 * 2.5f + r50 * 2.f;\n                float tmp34b1 = r11 * 0.5f - r31 * 2.5f + r51 * 2.f;\n                float tmp56a0 = r20 * 4.f - r40 * 5.f + r60;\n                float tmp56a1 = r21 * 4.f - r41 * 5.f + r61;\n                float tmp56b0 = r10 * 2.f - r30 * 2.5f + r50 * 0.5f;\n                float tmp56b1 = r11 * 2.f - r31 * 2.5f + r51 * 0.5f;\n\n                tmp[0][m][0] = r00 - r60 + (r40 - r20) * 5.25f;\n                tmp[0][m][1] = r01 - r61 + (r41 - r21) * 5.25f;\n                tmp[1][m][0] = tmp12a0 + tmp12b0;\n                tmp[1][m][1] = tmp12a1 + tmp12b1;\n                tmp[2][m][0] = tmp12a0 - tmp12b0;\n                tmp[2][m][1] = tmp12a1 - tmp12b1;\n                tmp[3][m][0] = tmp34a0 + tmp34b0;\n                tmp[3][m][1] = tmp34a1 + tmp34b1;\n                tmp[4][m][0] = tmp34a0 - tmp34b0;\n                tmp[4][m][1] = tmp34a1 - tmp34b1;\n                tmp[5][m][0] = tmp56a0 + tmp56b0;\n                tmp[5][m][1] = tmp56a1 + tmp56b1;\n                tmp[6][m][0] = tmp56a0 - tmp56b0;\n                tmp[6][m][1] = tmp56a1 - tmp56b1;\n                tmp[7][m][0] = r70 - r10 + (r30 - r50) * 5.25f;\n                tmp[7][m][1] = r71 - r11 + (r31 - r51) * 5.25f;\n#endif\n\n                r0 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj * 2;\n            float* p1 = p0 + max_jj * 2;\n            float* p2 = p0 + max_jj * 2 * 2;\n            float* p3 = p0 + max_jj * 2 * 3;\n            float* p4 = p0 + max_jj * 2 * 4;\n            float* p5 = p0 + max_jj * 2 * 5;\n            float* p6 = p0 + max_jj * 2 * 6;\n            float* p7 = p0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n                float32x2_t _r6 = vld1_f32(tmp[m][6]);\n                float32x2_t _r7 = vld1_f32(tmp[m][7]);\n\n#if __aarch64__\n                float32x2_t _tmp12a = vfma_laneq_f32(vadd_f32(_r2, _r6), _r4, _coeffs, 1);\n                float32x2_t _tmp12b = vfma_laneq_f32(vadd_f32(_r1, _r5), _r3, _coeffs, 1);\n                float32x2_t _tmp34a = vfma_laneq_f32(vfma_laneq_f32(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float32x2_t _tmp34b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 1), _r3, _coeffs2, 0), _r5, _coeffs2, 2);\n                float32x2_t _tmp56a = vfma_laneq_f32(_r6, vfma_laneq_f32(_r2, _r4, _coeffs, 2), _coeffs2, 3);\n                float32x2_t _tmp56b = vfma_laneq_f32(vfma_laneq_f32(vmul_laneq_f32(_r1, _coeffs2, 2), _r3, _coeffs2, 0), _r5, _coeffs2, 1);\n#else\n                float32x2_t _tmp12a = vmla_lane_f32(vadd_f32(_r2, _r6), _r4, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp12b = vmla_lane_f32(vadd_f32(_r1, _r5), _r3, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp34a = vmla_lane_f32(vmla_lane_f32(_r6, _r2, vget_high_f32(_coeffs), 1), _r4, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp34b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_low_f32(_coeffs2), 1), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_high_f32(_coeffs2), 0);\n                float32x2_t _tmp56a = vmla_lane_f32(_r6, vmla_lane_f32(_r2, _r4, vget_high_f32(_coeffs), 0), vget_high_f32(_coeffs2), 1);\n                float32x2_t _tmp56b = vmla_lane_f32(vmla_lane_f32(vmul_lane_f32(_r1, vget_high_f32(_coeffs2), 0), _r3, vget_low_f32(_coeffs2), 0), _r5, vget_low_f32(_coeffs2), 1);\n#endif\n\n#if __aarch64__\n                float32x2_t _tmp0 = vfma_laneq_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), _coeffs, 0);\n#else\n                float32x2_t _tmp0 = vmla_lane_f32(vsub_f32(_r0, _r6), vsub_f32(_r4, _r2), vget_low_f32(_coeffs), 0);\n#endif\n                float32x2_t _tmp1 = vadd_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp2 = vsub_f32(_tmp12a, _tmp12b);\n                float32x2_t _tmp3 = vadd_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp4 = vsub_f32(_tmp34a, _tmp34b);\n                float32x2_t _tmp5 = vadd_f32(_tmp56a, _tmp56b);\n                float32x2_t _tmp6 = vsub_f32(_tmp56a, _tmp56b);\n#if __aarch64__\n                float32x2_t _tmp7 = vfma_laneq_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), _coeffs, 0);\n#else\n                float32x2_t _tmp7 = vmla_lane_f32(vsub_f32(_r7, _r1), vsub_f32(_r3, _r5), vget_low_f32(_coeffs), 0);\n#endif\n\n                vst1_f32(p0, _tmp0);\n                vst1_f32(p1, _tmp1);\n                vst1_f32(p2, _tmp2);\n                vst1_f32(p3, _tmp3);\n                vst1_f32(p4, _tmp4);\n                vst1_f32(p5, _tmp5);\n                vst1_f32(p6, _tmp6);\n                vst1_f32(p7, _tmp7);\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n                float r60 = tmp[m][6][0];\n                float r61 = tmp[m][6][1];\n                float r70 = tmp[m][7][0];\n                float r71 = tmp[m][7][1];\n\n                float tmp12a0 = r20 + r60 - r40 * 4.25f;\n                float tmp12a1 = r21 + r61 - r41 * 4.25f;\n                float tmp12b0 = r10 + r50 - r30 * 4.25f;\n                float tmp12b1 = r11 + r51 - r31 * 4.25f;\n                float tmp34a0 = r60 + r20 * 0.25f - r40 * 1.25f;\n                float tmp34a1 = r61 + r21 * 0.25f - r41 * 1.25f;\n                float tmp34b0 = r10 * 0.5f - r30 * 2.5f + r50 * 2.f;\n                float tmp34b1 = r11 * 0.5f - r31 * 2.5f + r51 * 2.f;\n                float tmp56a0 = r20 * 4.f - r40 * 5.f + r60;\n                float tmp56a1 = r21 * 4.f - r41 * 5.f + r61;\n                float tmp56b0 = r10 * 2.f - r30 * 2.5f + r50 * 0.5f;\n                float tmp56b1 = r11 * 2.f - r31 * 2.5f + r51 * 0.5f;\n\n                p0[0] = r00 - r60 + (r40 - r20) * 5.25f;\n                p0[1] = r01 - r61 + (r41 - r21) * 5.25f;\n                p1[0] = tmp12a0 + tmp12b0;\n                p1[1] = tmp12a1 + tmp12b1;\n                p2[0] = tmp12a0 - tmp12b0;\n                p2[1] = tmp12a1 - tmp12b1;\n                p3[0] = tmp34a0 + tmp34b0;\n                p3[1] = tmp34a1 + tmp34b1;\n                p4[0] = tmp34a0 - tmp34b0;\n                p4[1] = tmp34a1 - tmp34b1;\n                p5[0] = tmp56a0 + tmp56b0;\n                p5[1] = tmp56a1 + tmp56b1;\n                p6[0] = tmp56a0 - tmp56b0;\n                p6[1] = tmp56a1 - tmp56b1;\n                p7[0] = r70 - r10 + (r30 - r50) * 5.25f;\n                p7[1] = r71 - r11 + (r31 - r51) * 5.25f;\n#endif\n\n                p0 += max_jj * 8 * 2;\n                p1 += max_jj * 8 * 2;\n                p2 += max_jj * 8 * 2;\n                p3 += max_jj * 8 * 2;\n                p4 += max_jj * 8 * 2;\n                p5 += max_jj * 8 * 2;\n                p6 += max_jj * 8 * 2;\n                p7 += max_jj * 8 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        float tmp[8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const unsigned short* r0123 = bottom_blob.channel(k + kk).row<const unsigned short>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = 0.f;\n                float r1 = 0.f;\n                float r2 = 0.f;\n                float r3 = 0.f;\n                float r4 = 0.f;\n                float r5 = 0.f;\n                float r6 = 0.f;\n                float r7 = 0.f;\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = bfloat16_to_float32(r0123[0]);\n                        if (tj * 6 + 1 < w) r1 = bfloat16_to_float32(r0123[1]);\n                        if (tj * 6 + 2 < w) r2 = bfloat16_to_float32(r0123[2]);\n                        if (tj * 6 + 3 < w) r3 = bfloat16_to_float32(r0123[3]);\n                        if (tj * 6 + 4 < w) r4 = bfloat16_to_float32(r0123[4]);\n                        if (tj * 6 + 5 < w) r5 = bfloat16_to_float32(r0123[5]);\n                        if (tj * 6 + 6 < w) r6 = bfloat16_to_float32(r0123[6]);\n                        if (tj * 6 + 7 < w) r7 = bfloat16_to_float32(r0123[7]);\n                    }\n                }\n\n                float tmp12a = r2 + r6 - r4 * 4.25f;\n                float tmp12b = r1 + r5 - r3 * 4.25f;\n                float tmp34a = r6 + r2 * 0.25f - r4 * 1.25f;\n                float tmp34b = r1 * 0.5f - r3 * 2.5f + r5 * 2.f;\n                float tmp56a = r2 * 4.f - r4 * 5.f + r6;\n                float tmp56b = r1 * 2.f - r3 * 2.5f + r5 * 0.5f;\n\n                tmp[0][m] = r0 - r6 + (r4 - r2) * 5.25f;\n                tmp[1][m] = tmp12a + tmp12b;\n                tmp[2][m] = tmp12a - tmp12b;\n                tmp[3][m] = tmp34a + tmp34b;\n                tmp[4][m] = tmp34a - tmp34b;\n                tmp[5][m] = tmp56a + tmp56b;\n                tmp[6][m] = tmp56a - tmp56b;\n                tmp[7][m] = r7 - r1 + (r3 - r5) * 5.25f;\n\n                r0123 += w;\n            }\n\n            float* p0 = (float*)B + kk * max_jj * 64 + jj;\n            float* p1 = p0 + max_jj;\n            float* p2 = p0 + max_jj * 2;\n            float* p3 = p0 + max_jj * 3;\n            float* p4 = p0 + max_jj * 4;\n            float* p5 = p0 + max_jj * 5;\n            float* p6 = p0 + max_jj * 6;\n            float* p7 = p0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n                float r6 = tmp[m][6];\n                float r7 = tmp[m][7];\n\n                float tmp12a = r2 + r6 - r4 * 4.25f;\n                float tmp12b = r1 + r5 - r3 * 4.25f;\n                float tmp34a = r6 + r2 * 0.25f - r4 * 1.25f;\n                float tmp34b = r1 * 0.5f - r3 * 2.5f + r5 * 2.f;\n                float tmp56a = r2 * 4.f - r4 * 5.f + r6;\n                float tmp56b = r1 * 2.f - r3 * 2.5f + r5 * 0.5f;\n\n                p0[0] = r0 - r6 + (r4 - r2) * 5.25f;\n                p1[0] = tmp12a + tmp12b;\n                p2[0] = tmp12a - tmp12b;\n                p3[0] = tmp34a + tmp34b;\n                p4[0] = tmp34a - tmp34b;\n                p5[0] = tmp56a + tmp56b;\n                p6[0] = tmp56a - tmp56b;\n                p7[0] = r7 - r1 + (r3 - r5) * 5.25f;\n\n                p0 += max_jj * 8;\n                p1 += max_jj * 8;\n                p2 += max_jj * 8;\n                p3 += max_jj * 8;\n                p4 += max_jj * 8;\n                p5 += max_jj * 8;\n                p6 += max_jj * 8;\n                p7 += max_jj * 8;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd63_transform_output_tile_bf16s(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[6][8] = {\n    //     {1.0f, 1.0f,  1.0f,  1.0f,  1.0f, 32.0f, 32.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  2.0f, -2.0f, 16.0f,-16.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f,  4.0f,  4.0f,  8.0f,  8.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  8.0f, -8.0f,  4.0f, -4.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 16.0f, 16.0f,  2.0f,  2.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 32.0f,-32.0f,  1.0f, -1.0f, 1.0f}\n    // };\n\n#if __ARM_NEON\n    const float coeffs[4] = {32.f, 16.f, 8.f, 4.f};\n    float32x4_t _coeffs = vld1q_f32(coeffs);\n    float32x2_t _v2 = vdup_n_f32(2.f);\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 5) / 6;\n\n    const float* biasptr = bias;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n        float32x4_t _bias1 = biasptr ? vld1q_f32(biasptr + i + ii + 4) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 8;\n            const float* r1 = r0 + max_jj * 8;\n            const float* r2 = r0 + max_jj * 8 * 2;\n            const float* r3 = r0 + max_jj * 8 * 3;\n            const float* r4 = r0 + max_jj * 8 * 4;\n            const float* r5 = r0 + max_jj * 8 * 5;\n            const float* r6 = r0 + max_jj * 8 * 6;\n            const float* r7 = r0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n                float32x4_t _r60 = vld1q_f32(r6);\n                float32x4_t _r61 = vld1q_f32(r6 + 4);\n                float32x4_t _r70 = vld1q_f32(r7);\n                float32x4_t _r71 = vld1q_f32(r7 + 4);\n\n                float32x4_t _tmp024a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp024a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp135a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp135a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp024b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp024b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp135b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp135b1 = vsubq_f32(_r31, _r41);\n                float32x4_t _tmp024c0 = vaddq_f32(_r50, _r60);\n                float32x4_t _tmp024c1 = vaddq_f32(_r51, _r61);\n                float32x4_t _tmp135c0 = vsubq_f32(_r50, _r60);\n                float32x4_t _tmp135c1 = vsubq_f32(_r51, _r61);\n\n                float32x4_t _tmp00 = vaddq_f32(vaddq_f32(_r00, _tmp024a0), vfmaq_laneq_f32(_tmp024b0, _tmp024c0, _coeffs, 0));\n                float32x4_t _tmp01 = vaddq_f32(vaddq_f32(_r01, _tmp024a1), vfmaq_laneq_f32(_tmp024b1, _tmp024c1, _coeffs, 0));\n                float32x4_t _tmp10 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a0, _tmp135b0, _v2, 0), _tmp135c0, _coeffs, 1);\n                float32x4_t _tmp11 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a1, _tmp135b1, _v2, 0), _tmp135c1, _coeffs, 1);\n                float32x4_t _tmp20 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 3), _tmp024c0, _coeffs, 2);\n                float32x4_t _tmp21 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 3), _tmp024c1, _coeffs, 2);\n                float32x4_t _tmp30 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a0, _tmp135b0, _coeffs, 2), _tmp135c0, _coeffs, 3);\n                float32x4_t _tmp31 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a1, _tmp135b1, _coeffs, 2), _tmp135c1, _coeffs, 3);\n                float32x4_t _tmp40 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 1), _tmp024c0, _v2, 0);\n                float32x4_t _tmp41 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 1), _tmp024c1, _v2, 0);\n                float32x4_t _tmp50 = vaddq_f32(vaddq_f32(_r70, _tmp135a0), vfmaq_laneq_f32(_tmp135c0, _tmp135b0, _coeffs, 0));\n                float32x4_t _tmp51 = vaddq_f32(vaddq_f32(_r71, _tmp135a1), vfmaq_laneq_f32(_tmp135c1, _tmp135b1, _coeffs, 0));\n\n                vst1q_f32(tmp[0][m], _tmp00);\n                vst1q_f32(tmp[0][m] + 4, _tmp01);\n                vst1q_f32(tmp[1][m], _tmp10);\n                vst1q_f32(tmp[1][m] + 4, _tmp11);\n                vst1q_f32(tmp[2][m], _tmp20);\n                vst1q_f32(tmp[2][m] + 4, _tmp21);\n                vst1q_f32(tmp[3][m], _tmp30);\n                vst1q_f32(tmp[3][m] + 4, _tmp31);\n                vst1q_f32(tmp[4][m], _tmp40);\n                vst1q_f32(tmp[4][m] + 4, _tmp41);\n                vst1q_f32(tmp[5][m], _tmp50);\n                vst1q_f32(tmp[5][m] + 4, _tmp51);\n\n                r0 += max_jj * 8 * 8;\n                r1 += max_jj * 8 * 8;\n                r2 += max_jj * 8 * 8;\n                r3 += max_jj * 8 * 8;\n                r4 += max_jj * 8 * 8;\n                r5 += max_jj * 8 * 8;\n                r6 += max_jj * 8 * 8;\n                r7 += max_jj * 8 * 8;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float32x4_t _r00 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r01 = vld1q_f32(tmp[m][0] + 4);\n                float32x4_t _r10 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r11 = vld1q_f32(tmp[m][1] + 4);\n                float32x4_t _r20 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r21 = vld1q_f32(tmp[m][2] + 4);\n                float32x4_t _r30 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r31 = vld1q_f32(tmp[m][3] + 4);\n                float32x4_t _r40 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r41 = vld1q_f32(tmp[m][4] + 4);\n                float32x4_t _r50 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r51 = vld1q_f32(tmp[m][5] + 4);\n                float32x4_t _r60 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r61 = vld1q_f32(tmp[m][6] + 4);\n                float32x4_t _r70 = vld1q_f32(tmp[m][7]);\n                float32x4_t _r71 = vld1q_f32(tmp[m][7] + 4);\n\n                float32x4_t _tmp024a0 = vaddq_f32(_r10, _r20);\n                float32x4_t _tmp024a1 = vaddq_f32(_r11, _r21);\n                float32x4_t _tmp135a0 = vsubq_f32(_r10, _r20);\n                float32x4_t _tmp135a1 = vsubq_f32(_r11, _r21);\n                float32x4_t _tmp024b0 = vaddq_f32(_r30, _r40);\n                float32x4_t _tmp024b1 = vaddq_f32(_r31, _r41);\n                float32x4_t _tmp135b0 = vsubq_f32(_r30, _r40);\n                float32x4_t _tmp135b1 = vsubq_f32(_r31, _r41);\n                float32x4_t _tmp024c0 = vaddq_f32(_r50, _r60);\n                float32x4_t _tmp024c1 = vaddq_f32(_r51, _r61);\n                float32x4_t _tmp135c0 = vsubq_f32(_r50, _r60);\n                float32x4_t _tmp135c1 = vsubq_f32(_r51, _r61);\n\n                float32x4_t _tmp00 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r00, _tmp024a0), vfmaq_laneq_f32(_tmp024b0, _tmp024c0, _coeffs, 0)));\n                float32x4_t _tmp01 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r01, _tmp024a1), vfmaq_laneq_f32(_tmp024b1, _tmp024c1, _coeffs, 0)));\n                float32x4_t _tmp10 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a0, _tmp135b0, _v2, 0), _tmp135c0, _coeffs, 1));\n                float32x4_t _tmp11 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a1, _tmp135b1, _v2, 0), _tmp135c1, _coeffs, 1));\n                float32x4_t _tmp20 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 3), _tmp024c0, _coeffs, 2));\n                float32x4_t _tmp21 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 3), _tmp024c1, _coeffs, 2));\n                float32x4_t _tmp30 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a0, _tmp135b0, _coeffs, 2), _tmp135c0, _coeffs, 3));\n                float32x4_t _tmp31 = vaddq_f32(_bias1, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a1, _tmp135b1, _coeffs, 2), _tmp135c1, _coeffs, 3));\n                float32x4_t _tmp40 = vaddq_f32(_bias0, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a0, _tmp024b0, _coeffs, 1), _tmp024c0, _v2, 0));\n                float32x4_t _tmp41 = vaddq_f32(_bias1, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a1, _tmp024b1, _coeffs, 1), _tmp024c1, _v2, 0));\n                float32x4_t _tmp50 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r70, _tmp135a0), vfmaq_laneq_f32(_tmp135c0, _tmp135b0, _coeffs, 0)));\n                float32x4_t _tmp51 = vaddq_f32(_bias1, vaddq_f32(vaddq_f32(_r71, _tmp135a1), vfmaq_laneq_f32(_tmp135c1, _tmp135b1, _coeffs, 0)));\n\n                if (out_elempack == 4)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n                    vst1_u16(outptr0, float2bfloat(_tmp00));\n                    vst1_u16(outptr1, float2bfloat(_tmp01));\n                    if (tj * 6 + 1 < outw)\n                    {\n                        vst1_u16(outptr0 + 4, float2bfloat(_tmp10));\n                        vst1_u16(outptr1 + 4, float2bfloat(_tmp11));\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        vst1_u16(outptr0 + 8, float2bfloat(_tmp20));\n                        vst1_u16(outptr1 + 8, float2bfloat(_tmp21));\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        vst1_u16(outptr0 + 12, float2bfloat(_tmp30));\n                        vst1_u16(outptr1 + 12, float2bfloat(_tmp31));\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        vst1_u16(outptr0 + 16, float2bfloat(_tmp40));\n                        vst1_u16(outptr1 + 16, float2bfloat(_tmp41));\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        vst1_u16(outptr0 + 20, float2bfloat(_tmp50));\n                        vst1_u16(outptr1 + 20, float2bfloat(_tmp51));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[8];\n                    unsigned short tmp1[8];\n                    unsigned short tmp2[8];\n                    unsigned short tmp3[8];\n                    unsigned short tmp4[8];\n                    unsigned short tmp5[8];\n                    vst1_u16(tmp0, float2bfloat(_tmp00));\n                    vst1_u16(tmp0 + 4, float2bfloat(_tmp01));\n                    vst1_u16(tmp1, float2bfloat(_tmp10));\n                    vst1_u16(tmp1 + 4, float2bfloat(_tmp11));\n                    vst1_u16(tmp2, float2bfloat(_tmp20));\n                    vst1_u16(tmp2 + 4, float2bfloat(_tmp21));\n                    vst1_u16(tmp3, float2bfloat(_tmp30));\n                    vst1_u16(tmp3 + 4, float2bfloat(_tmp31));\n                    vst1_u16(tmp4, float2bfloat(_tmp40));\n                    vst1_u16(tmp4 + 4, float2bfloat(_tmp41));\n                    vst1_u16(tmp5, float2bfloat(_tmp50));\n                    vst1_u16(tmp5 + 4, float2bfloat(_tmp51));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n                    unsigned short* outptr4 = outptr0 + N * 4;\n                    unsigned short* outptr5 = outptr0 + N * 5;\n                    unsigned short* outptr6 = outptr0 + N * 6;\n                    unsigned short* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                        outptr4[4] = tmp4[4];\n                        outptr5[4] = tmp4[5];\n                        outptr6[4] = tmp4[6];\n                        outptr7[4] = tmp4[7];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                        outptr4[5] = tmp5[4];\n                        outptr5[5] = tmp5[5];\n                        outptr6[5] = tmp5[6];\n                        outptr7[5] = tmp5[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float32x4_t _bias0 = biasptr ? vld1q_f32(biasptr + i + ii) : vdupq_n_f32(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        float tmp[6][8][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 4;\n            const float* r1 = r0 + max_jj * 4;\n            const float* r2 = r0 + max_jj * 4 * 2;\n            const float* r3 = r0 + max_jj * 4 * 3;\n            const float* r4 = r0 + max_jj * 4 * 4;\n            const float* r5 = r0 + max_jj * 4 * 5;\n            const float* r6 = r0 + max_jj * 4 * 6;\n            const float* r7 = r0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _r3 = vld1q_f32(r3);\n                float32x4_t _r4 = vld1q_f32(r4);\n                float32x4_t _r5 = vld1q_f32(r5);\n                float32x4_t _r6 = vld1q_f32(r6);\n                float32x4_t _r7 = vld1q_f32(r7);\n\n                float32x4_t _tmp024a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp135a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp024b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp135b = vsubq_f32(_r3, _r4);\n                float32x4_t _tmp024c = vaddq_f32(_r5, _r6);\n                float32x4_t _tmp135c = vsubq_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp024a), vfmaq_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0));\n                float32x4_t _tmp1 = vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, _coeffs, 1);\n                float32x4_t _tmp2 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float32x4_t _tmp3 = vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float32x4_t _tmp4 = vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2, 0);\n                float32x4_t _tmp5 = vaddq_f32(vaddq_f32(_r7, _tmp135a), vfmaq_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0));\n#else\n                float32x4_t _tmp0 = vaddq_f32(vaddq_f32(_r0, _tmp024a), vmlaq_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0));\n                float32x4_t _tmp1 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, vget_low_f32(_coeffs), 1);\n                float32x4_t _tmp2 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0);\n                float32x4_t _tmp3 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1);\n                float32x4_t _tmp4 = vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2, 0);\n                float32x4_t _tmp5 = vaddq_f32(vaddq_f32(_r7, _tmp135a), vmlaq_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0));\n#endif\n\n                vst1q_f32(tmp[0][m], _tmp0);\n                vst1q_f32(tmp[1][m], _tmp1);\n                vst1q_f32(tmp[2][m], _tmp2);\n                vst1q_f32(tmp[3][m], _tmp3);\n                vst1q_f32(tmp[4][m], _tmp4);\n                vst1q_f32(tmp[5][m], _tmp5);\n\n                r0 += max_jj * 8 * 4;\n                r1 += max_jj * 8 * 4;\n                r2 += max_jj * 8 * 4;\n                r3 += max_jj * 8 * 4;\n                r4 += max_jj * 8 * 4;\n                r5 += max_jj * 8 * 4;\n                r6 += max_jj * 8 * 4;\n                r7 += max_jj * 8 * 4;\n            }\n\n            unsigned short* outptr0 = top_blob.channel((i + ii) / out_elempack).row<unsigned short>(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float32x4_t _r0 = vld1q_f32(tmp[m][0]);\n                float32x4_t _r1 = vld1q_f32(tmp[m][1]);\n                float32x4_t _r2 = vld1q_f32(tmp[m][2]);\n                float32x4_t _r3 = vld1q_f32(tmp[m][3]);\n                float32x4_t _r4 = vld1q_f32(tmp[m][4]);\n                float32x4_t _r5 = vld1q_f32(tmp[m][5]);\n                float32x4_t _r6 = vld1q_f32(tmp[m][6]);\n                float32x4_t _r7 = vld1q_f32(tmp[m][7]);\n\n                float32x4_t _tmp024a = vaddq_f32(_r1, _r2);\n                float32x4_t _tmp135a = vsubq_f32(_r1, _r2);\n                float32x4_t _tmp024b = vaddq_f32(_r3, _r4);\n                float32x4_t _tmp135b = vsubq_f32(_r3, _r4);\n                float32x4_t _tmp024c = vaddq_f32(_r5, _r6);\n                float32x4_t _tmp135c = vsubq_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _tmp024a), vfmaq_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0)));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, _coeffs, 1));\n                float32x4_t _tmp2 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float32x4_t _tmp3 = vaddq_f32(_bias0, vfmaq_laneq_f32(vfmaq_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float32x4_t _tmp4 = vaddq_f32(_bias0, vfmaq_lane_f32(vfmaq_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2, 0));\n                float32x4_t _tmp5 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r7, _tmp135a), vfmaq_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0)));\n#else\n                float32x4_t _tmp0 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r0, _tmp024a), vmlaq_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0)));\n                float32x4_t _tmp1 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, _v2, 0), _tmp135c, vget_low_f32(_coeffs), 1));\n                float32x4_t _tmp2 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0));\n                float32x4_t _tmp3 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1));\n                float32x4_t _tmp4 = vaddq_f32(_bias0, vmlaq_lane_f32(vmlaq_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2, 0));\n                float32x4_t _tmp5 = vaddq_f32(_bias0, vaddq_f32(vaddq_f32(_r7, _tmp135a), vmlaq_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0)));\n#endif\n\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_tmp0));\n                    if (tj * 6 + 1 < outw) vst1_u16(outptr0 + 4, float2bfloat(_tmp1));\n                    if (tj * 6 + 2 < outw) vst1_u16(outptr0 + 8, float2bfloat(_tmp2));\n                    if (tj * 6 + 3 < outw) vst1_u16(outptr0 + 12, float2bfloat(_tmp3));\n                    if (tj * 6 + 4 < outw) vst1_u16(outptr0 + 16, float2bfloat(_tmp4));\n                    if (tj * 6 + 5 < outw) vst1_u16(outptr0 + 20, float2bfloat(_tmp5));\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short tmp0[4];\n                    unsigned short tmp1[4];\n                    unsigned short tmp2[4];\n                    unsigned short tmp3[4];\n                    unsigned short tmp4[4];\n                    unsigned short tmp5[4];\n                    vst1_u16(tmp0, float2bfloat(_tmp0));\n                    vst1_u16(tmp1, float2bfloat(_tmp1));\n                    vst1_u16(tmp2, float2bfloat(_tmp2));\n                    vst1_u16(tmp3, float2bfloat(_tmp3));\n                    vst1_u16(tmp4, float2bfloat(_tmp4));\n                    vst1_u16(tmp5, float2bfloat(_tmp5));\n\n                    unsigned short* outptr1 = outptr0 + N;\n                    unsigned short* outptr2 = outptr0 + N * 2;\n                    unsigned short* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        float32x2_t _bias0 = biasptr ? vld1_f32(biasptr + i + ii) : vdup_n_f32(0.f);\n#else\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        float bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n#endif\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        float tmp[6][8][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj * 2;\n            const float* r1 = r0 + max_jj * 2;\n            const float* r2 = r0 + max_jj * 2 * 2;\n            const float* r3 = r0 + max_jj * 2 * 3;\n            const float* r4 = r0 + max_jj * 2 * 4;\n            const float* r5 = r0 + max_jj * 2 * 5;\n            const float* r6 = r0 + max_jj * 2 * 6;\n            const float* r7 = r0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(r0);\n                float32x2_t _r1 = vld1_f32(r1);\n                float32x2_t _r2 = vld1_f32(r2);\n                float32x2_t _r3 = vld1_f32(r3);\n                float32x2_t _r4 = vld1_f32(r4);\n                float32x2_t _r5 = vld1_f32(r5);\n                float32x2_t _r6 = vld1_f32(r6);\n                float32x2_t _r7 = vld1_f32(r7);\n\n                float32x2_t _tmp024a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp135a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp024b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp135b = vsub_f32(_r3, _r4);\n                float32x2_t _tmp024c = vadd_f32(_r5, _r6);\n                float32x2_t _tmp135c = vsub_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp024a), vfma_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0));\n                float32x2_t _tmp1 = vfma_laneq_f32(vfma_f32(_tmp135a, _tmp135b, _v2), _tmp135c, _coeffs, 1);\n                float32x2_t _tmp2 = vfma_laneq_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float32x2_t _tmp3 = vfma_laneq_f32(vfma_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float32x2_t _tmp4 = vfma_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2);\n                float32x2_t _tmp5 = vadd_f32(vadd_f32(_r7, _tmp135a), vfma_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0));\n#else\n                float32x2_t _tmp0 = vadd_f32(vadd_f32(_r0, _tmp024a), vmla_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0));\n                float32x2_t _tmp1 = vmla_lane_f32(vmla_f32(_tmp135a, _tmp135b, _v2), _tmp135c, vget_low_f32(_coeffs), 1);\n                float32x2_t _tmp2 = vmla_lane_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0);\n                float32x2_t _tmp3 = vmla_lane_f32(vmla_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1);\n                float32x2_t _tmp4 = vmla_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2);\n                float32x2_t _tmp5 = vadd_f32(vadd_f32(_r7, _tmp135a), vmla_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0));\n#endif\n\n                vst1_f32(tmp[0][m], _tmp0);\n                vst1_f32(tmp[1][m], _tmp1);\n                vst1_f32(tmp[2][m], _tmp2);\n                vst1_f32(tmp[3][m], _tmp3);\n                vst1_f32(tmp[4][m], _tmp4);\n                vst1_f32(tmp[5][m], _tmp5);\n#else\n                float tmp024a0 = r1[0] + r2[0];\n                float tmp024a1 = r1[1] + r2[1];\n                float tmp135a0 = r1[0] - r2[0];\n                float tmp135a1 = r1[1] - r2[1];\n                float tmp024b0 = r3[0] + r4[0];\n                float tmp024b1 = r3[1] + r4[1];\n                float tmp135b0 = r3[0] - r4[0];\n                float tmp135b1 = r3[1] - r4[1];\n                float tmp024c0 = r5[0] + r6[0];\n                float tmp024c1 = r5[1] + r6[1];\n                float tmp135c0 = r5[0] - r6[0];\n                float tmp135c1 = r5[1] - r6[1];\n\n                tmp[0][m][0] = r0[0] + tmp024a0 + tmp024b0 + tmp024c0 * 32;\n                tmp[0][m][1] = r0[1] + tmp024a1 + tmp024b1 + tmp024c1 * 32;\n                tmp[1][m][0] = tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * 16;\n                tmp[1][m][1] = tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * 16;\n                tmp[2][m][0] = tmp024a0 + tmp024b0 * 4 + tmp024c0 * 8;\n                tmp[2][m][1] = tmp024a1 + tmp024b1 * 4 + tmp024c1 * 8;\n                tmp[3][m][0] = tmp135a0 + tmp135b0 * 8 + tmp135c0 * 4;\n                tmp[3][m][1] = tmp135a1 + tmp135b1 * 8 + tmp135c1 * 4;\n                tmp[4][m][0] = tmp024a0 + tmp024b0 * 16 + tmp024c0 + tmp024c0;\n                tmp[4][m][1] = tmp024a1 + tmp024b1 * 16 + tmp024c1 + tmp024c1;\n                tmp[5][m][0] = r7[0] + tmp135a0 + tmp135b0 * 32 + tmp135c0;\n                tmp[5][m][1] = r7[1] + tmp135a1 + tmp135b1 * 32 + tmp135c1;\n#endif\n\n                r0 += max_jj * 8 * 2;\n                r1 += max_jj * 8 * 2;\n                r2 += max_jj * 8 * 2;\n                r3 += max_jj * 8 * 2;\n                r4 += max_jj * 8 * 2;\n                r5 += max_jj * 8 * 2;\n                r6 += max_jj * 8 * 2;\n                r7 += max_jj * 8 * 2;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n#if __ARM_NEON\n                float32x2_t _r0 = vld1_f32(tmp[m][0]);\n                float32x2_t _r1 = vld1_f32(tmp[m][1]);\n                float32x2_t _r2 = vld1_f32(tmp[m][2]);\n                float32x2_t _r3 = vld1_f32(tmp[m][3]);\n                float32x2_t _r4 = vld1_f32(tmp[m][4]);\n                float32x2_t _r5 = vld1_f32(tmp[m][5]);\n                float32x2_t _r6 = vld1_f32(tmp[m][6]);\n                float32x2_t _r7 = vld1_f32(tmp[m][7]);\n\n                float32x2_t _tmp024a = vadd_f32(_r1, _r2);\n                float32x2_t _tmp135a = vsub_f32(_r1, _r2);\n                float32x2_t _tmp024b = vadd_f32(_r3, _r4);\n                float32x2_t _tmp135b = vsub_f32(_r3, _r4);\n                float32x2_t _tmp024c = vadd_f32(_r5, _r6);\n                float32x2_t _tmp135c = vsub_f32(_r5, _r6);\n\n#if __aarch64__\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _tmp024a), vfma_laneq_f32(_tmp024b, _tmp024c, _coeffs, 0)));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vfma_laneq_f32(vfma_f32(_tmp135a, _tmp135b, _v2), _tmp135c, _coeffs, 1));\n                float32x2_t _tmp2 = vadd_f32(_bias0, vfma_laneq_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float32x2_t _tmp3 = vadd_f32(_bias0, vfma_laneq_f32(vfma_laneq_f32(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float32x2_t _tmp4 = vadd_f32(_bias0, vfma_f32(vfma_laneq_f32(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _v2));\n                float32x2_t _tmp5 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r7, _tmp135a), vfma_laneq_f32(_tmp135c, _tmp135b, _coeffs, 0)));\n#else\n                float32x2_t _tmp0 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r0, _tmp024a), vmla_lane_f32(_tmp024b, _tmp024c, vget_low_f32(_coeffs), 0)));\n                float32x2_t _tmp1 = vadd_f32(_bias0, vmla_lane_f32(vmla_f32(_tmp135a, _tmp135b, _v2), _tmp135c, vget_low_f32(_coeffs), 1));\n                float32x2_t _tmp2 = vadd_f32(_bias0, vmla_lane_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_high_f32(_coeffs), 1), _tmp024c, vget_high_f32(_coeffs), 0));\n                float32x2_t _tmp3 = vadd_f32(_bias0, vmla_lane_f32(vmla_lane_f32(_tmp135a, _tmp135b, vget_high_f32(_coeffs), 0), _tmp135c, vget_high_f32(_coeffs), 1));\n                float32x2_t _tmp4 = vadd_f32(_bias0, vmla_f32(vmla_lane_f32(_tmp024a, _tmp024b, vget_low_f32(_coeffs), 1), _tmp024c, _v2));\n                float32x2_t _tmp5 = vadd_f32(_bias0, vadd_f32(vadd_f32(_r7, _tmp135a), vmla_lane_f32(_tmp135c, _tmp135b, vget_low_f32(_coeffs), 0)));\n#endif\n#else\n                float r00 = tmp[m][0][0];\n                float r01 = tmp[m][0][1];\n                float r10 = tmp[m][1][0];\n                float r11 = tmp[m][1][1];\n                float r20 = tmp[m][2][0];\n                float r21 = tmp[m][2][1];\n                float r30 = tmp[m][3][0];\n                float r31 = tmp[m][3][1];\n                float r40 = tmp[m][4][0];\n                float r41 = tmp[m][4][1];\n                float r50 = tmp[m][5][0];\n                float r51 = tmp[m][5][1];\n                float r60 = tmp[m][6][0];\n                float r61 = tmp[m][6][1];\n                float r70 = tmp[m][7][0];\n                float r71 = tmp[m][7][1];\n\n                float tmp024a0 = r10 + r20;\n                float tmp024a1 = r11 + r21;\n                float tmp135a0 = r10 - r20;\n                float tmp135a1 = r11 - r21;\n                float tmp024b0 = r30 + r40;\n                float tmp024b1 = r31 + r41;\n                float tmp135b0 = r30 - r40;\n                float tmp135b1 = r31 - r41;\n                float tmp024c0 = r50 + r60;\n                float tmp024c1 = r51 + r61;\n                float tmp135c0 = r50 - r60;\n                float tmp135c1 = r51 - r61;\n\n                float tmp00 = bias0 + r00 + tmp024a0 + tmp024b0 + tmp024c0 * 32;\n                float tmp01 = bias1 + r01 + tmp024a1 + tmp024b1 + tmp024c1 * 32;\n                float tmp10 = bias0 + tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * 16;\n                float tmp11 = bias1 + tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * 16;\n                float tmp20 = bias0 + tmp024a0 + tmp024b0 * 4 + tmp024c0 * 8;\n                float tmp21 = bias1 + tmp024a1 + tmp024b1 * 4 + tmp024c1 * 8;\n                float tmp30 = bias0 + tmp135a0 + tmp135b0 * 8 + tmp135c0 * 4;\n                float tmp31 = bias1 + tmp135a1 + tmp135b1 * 8 + tmp135c1 * 4;\n                float tmp40 = bias0 + tmp024a0 + tmp024b0 * 16 + tmp024c0 + tmp024c0;\n                float tmp41 = bias1 + tmp024a1 + tmp024b1 * 16 + tmp024c1 + tmp024c1;\n                float tmp50 = bias0 + r70 + tmp135a0 + tmp135b0 * 32 + tmp135c0;\n                float tmp51 = bias1 + r71 + tmp135a1 + tmp135b1 * 32 + tmp135c1;\n#endif\n\n                // if (out_elempack == 1)\n                {\n                    unsigned short* outptr1 = outptr0 + N;\n\n#if __ARM_NEON\n                    uint16x4_t _tmp01 = float2bfloat(vcombine_f32(_tmp0, _tmp1));\n                    uint16x4_t _tmp23 = float2bfloat(vcombine_f32(_tmp2, _tmp3));\n                    uint16x4_t _tmp45 = float2bfloat(vcombine_f32(_tmp4, _tmp5));\n\n                    outptr0[0] = vget_lane_u16(_tmp01, 0);\n                    outptr1[0] = vget_lane_u16(_tmp01, 1);\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = vget_lane_u16(_tmp01, 2);\n                        outptr1[1] = vget_lane_u16(_tmp01, 3);\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = vget_lane_u16(_tmp23, 0);\n                        outptr1[2] = vget_lane_u16(_tmp23, 1);\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = vget_lane_u16(_tmp23, 2);\n                        outptr1[3] = vget_lane_u16(_tmp23, 3);\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = vget_lane_u16(_tmp45, 0);\n                        outptr1[4] = vget_lane_u16(_tmp45, 1);\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = vget_lane_u16(_tmp45, 2);\n                        outptr1[5] = vget_lane_u16(_tmp45, 3);\n                    }\n#else\n                    outptr0[0] = float32_to_bfloat16(tmp00);\n                    outptr1[0] = float32_to_bfloat16(tmp01);\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = float32_to_bfloat16(tmp10);\n                        outptr1[1] = float32_to_bfloat16(tmp11);\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = float32_to_bfloat16(tmp20);\n                        outptr1[2] = float32_to_bfloat16(tmp21);\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = float32_to_bfloat16(tmp30);\n                        outptr1[3] = float32_to_bfloat16(tmp31);\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = float32_to_bfloat16(tmp40);\n                        outptr1[4] = float32_to_bfloat16(tmp41);\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = float32_to_bfloat16(tmp50);\n                        outptr1[5] = float32_to_bfloat16(tmp51);\n                    }\n#endif\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        float bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        float tmp[6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const float* r0 = (const float*)top_tile + ii * max_jj * 64 + jj;\n            const float* r1 = r0 + max_jj;\n            const float* r2 = r0 + max_jj * 2;\n            const float* r3 = r0 + max_jj * 3;\n            const float* r4 = r0 + max_jj * 4;\n            const float* r5 = r0 + max_jj * 5;\n            const float* r6 = r0 + max_jj * 6;\n            const float* r7 = r0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float tmp024a = r1[0] + r2[0];\n                float tmp135a = r1[0] - r2[0];\n                float tmp024b = r3[0] + r4[0];\n                float tmp135b = r3[0] - r4[0];\n                float tmp024c = r5[0] + r6[0];\n                float tmp135c = r5[0] - r6[0];\n\n                tmp[0][m] = r0[0] + tmp024a + tmp024b + tmp024c * 32;\n                tmp[1][m] = tmp135a + tmp135b + tmp135b + tmp135c * 16;\n                tmp[2][m] = tmp024a + tmp024b * 4 + tmp024c * 8;\n                tmp[3][m] = tmp135a + tmp135b * 8 + tmp135c * 4;\n                tmp[4][m] = tmp024a + tmp024b * 16 + tmp024c + tmp024c;\n                tmp[5][m] = r7[0] + tmp135a + tmp135b * 32 + tmp135c;\n\n                r0 += max_jj * 8;\n                r1 += max_jj * 8;\n                r2 += max_jj * 8;\n                r3 += max_jj * 8;\n                r4 += max_jj * 8;\n                r5 += max_jj * 8;\n                r6 += max_jj * 8;\n                r7 += max_jj * 8;\n            }\n\n            unsigned short* outptr0 = top_blob.channel(i + ii).row<unsigned short>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n                float r3 = tmp[m][3];\n                float r4 = tmp[m][4];\n                float r5 = tmp[m][5];\n                float r6 = tmp[m][6];\n                float r7 = tmp[m][7];\n\n                float tmp024a = r1 + r2;\n                float tmp135a = r1 - r2;\n                float tmp024b = r3 + r4;\n                float tmp135b = r3 - r4;\n                float tmp024c = r5 + r6;\n                float tmp135c = r5 - r6;\n\n                float tmp0 = bias0 + r0 + tmp024a + tmp024b + tmp024c * 32;\n                float tmp1 = bias0 + tmp135a + tmp135b + tmp135b + tmp135c * 16;\n                float tmp2 = bias0 + tmp024a + tmp024b * 4 + tmp024c * 8;\n                float tmp3 = bias0 + tmp135a + tmp135b * 8 + tmp135c * 4;\n                float tmp4 = bias0 + tmp024a + tmp024b * 16 + tmp024c + tmp024c;\n                float tmp5 = bias0 + r7 + tmp135a + tmp135b * 32 + tmp135c;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(tmp0);\n                    if (tj * 6 + 1 < outw) outptr0[1] = float32_to_bfloat16(tmp1);\n                    if (tj * 6 + 2 < outw) outptr0[2] = float32_to_bfloat16(tmp2);\n                    if (tj * 6 + 3 < outw) outptr0[3] = float32_to_bfloat16(tmp3);\n                    if (tj * 6 + 4 < outw) outptr0[4] = float32_to_bfloat16(tmp4);\n                    if (tj * 6 + 5 < outw) outptr0[5] = float32_to_bfloat16(tmp5);\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd63_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 6n+2, winograd F(6,3)\n    int w_tiles = (outw + 5) / 6;\n    int h_tiles = (outh + 5) / 6;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 64;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd63_bf16s %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 4u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 4u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile_bf16s(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd63_transform_output_tile_bf16s(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_winograd_fp16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd_pack_A_tile_fp16(const Mat& A, Mat& AT, int batch, int max_ii, int max_kk)\n{\n    const int N = max_kk * batch;\n\n    for (int b = 0; b < batch; b++)\n    {\n        unsigned short* pp = AT.row<unsigned short>(b);\n\n        int ii = 0;\n        for (; ii + 7 < max_ii; ii += 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[2 * N];\n                pp[3] = p0[3 * N];\n                pp[4] = p0[4 * N];\n                pp[5] = p0[5 * N];\n                pp[6] = p0[6 * N];\n                pp[7] = p0[7 * N];\n                p0 += batch;\n                pp += 8;\n            }\n        }\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[2 * N];\n                pp[3] = p0[3 * N];\n                p0 += batch;\n                pp += 4;\n            }\n        }\n        for (; ii + 1 < max_ii; ii += 2)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                p0 += batch;\n                pp += 2;\n            }\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_transpose_pack_B_tile_fp16(const Mat& B, Mat& BT, int batch, int max_jj, int max_kk, int nT)\n{\n    #pragma omp parallel for num_threads(nT)\n    for (int b = 0; b < batch; b++)\n    {\n        unsigned short* pp = BT.row<unsigned short>(b);\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const unsigned short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v8.8h, v9.8h, v10.8h, v11.8h}, [%0] \\n\"\n\n                    \"uzp1   v12.8h, v0.8h, v4.8h        \\n\"\n                    \"uzp2   v16.8h, v0.8h, v4.8h        \\n\"\n                    \"uzp1   v13.8h, v1.8h, v5.8h        \\n\"\n                    \"uzp2   v17.8h, v1.8h, v5.8h        \\n\"\n                    \"uzp1   v14.8h, v2.8h, v6.8h        \\n\"\n                    \"uzp2   v18.8h, v2.8h, v6.8h        \\n\"\n                    \"uzp1   v15.8h, v3.8h, v7.8h        \\n\"\n                    \"uzp2   v19.8h, v3.8h, v7.8h        \\n\"\n                    \"uzp1   v20.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp2   v22.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp1   v21.8h, v10.8h, v11.8h      \\n\"\n                    \"uzp2   v23.8h, v10.8h, v11.8h      \\n\"\n\n                    \"sub    %0, %0, #128                \\n\"\n\n                    \"ext    v24.16b, v20.16b, v20.16b, #8 \\n\"\n                    \"ext    v26.16b, v22.16b, v22.16b, #8 \\n\"\n                    \"ext    v25.16b, v21.16b, v21.16b, #8 \\n\"\n                    \"ext    v27.16b, v23.16b, v23.16b, #8 \\n\"\n\n                    \"st1    {v12.8h}, [%1], #16         \\n\"\n                    \"st1    {v20.4h}, [%1], #8          \\n\"\n                    \"st1    {v13.8h}, [%1], #16         \\n\"\n                    \"st1    {v24.4h}, [%1], #8          \\n\"\n                    \"st1    {v14.8h}, [%1], #16         \\n\"\n                    \"st1    {v21.4h}, [%1], #8          \\n\"\n                    \"st1    {v15.8h}, [%1], #16         \\n\"\n                    \"st1    {v25.4h}, [%1], #8          \\n\"\n                    \"st1    {v16.8h}, [%1], #16         \\n\"\n                    \"st1    {v22.4h}, [%1], #8          \\n\"\n                    \"st1    {v17.8h}, [%1], #16         \\n\"\n                    \"st1    {v26.4h}, [%1], #8          \\n\"\n                    \"st1    {v18.8h}, [%1], #16         \\n\"\n                    \"st1    {v23.4h}, [%1], #8          \\n\"\n                    \"st1    {v19.8h}, [%1], #16         \\n\"\n                    \"st1    {v27.4h}, [%1], #8          \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x4_t _r0 = vld4q_u16(p0);\n                uint16x8x4_t _r1 = vld4q_u16(p0 + 32);\n                uint16x8x4_t _r2 = vld4q_u16(p0 + 64);\n                uint16x8x2_t _r04lm = vuzpq_u16(_r0.val[0], _r1.val[0]);\n                uint16x8x2_t _r15lm = vuzpq_u16(_r0.val[1], _r1.val[1]);\n                uint16x8x2_t _r26lm = vuzpq_u16(_r0.val[2], _r1.val[2]);\n                uint16x8x2_t _r37lm = vuzpq_u16(_r0.val[3], _r1.val[3]);\n                uint16x8x2_t _r0145h = vuzpq_u16(_r2.val[0], _r2.val[1]);\n                uint16x8x2_t _r2367h = vuzpq_u16(_r2.val[2], _r2.val[3]);\n                vst1q_u16(pp, _r04lm.val[0]);\n                vst1_u16(pp + 8, vget_low_u16(_r0145h.val[0]));\n                vst1q_u16(pp + 12, _r15lm.val[0]);\n                vst1_u16(pp + 20, vget_high_u16(_r0145h.val[0]));\n                vst1q_u16(pp + 24, _r26lm.val[0]);\n                vst1_u16(pp + 32, vget_low_u16(_r2367h.val[0]));\n                vst1q_u16(pp + 36, _r37lm.val[0]);\n                vst1_u16(pp + 44, vget_high_u16(_r2367h.val[0]));\n                vst1q_u16(pp + 48, _r04lm.val[1]);\n                vst1_u16(pp + 56, vget_low_u16(_r0145h.val[1]));\n                vst1q_u16(pp + 60, _r15lm.val[1]);\n                vst1_u16(pp + 68, vget_high_u16(_r0145h.val[1]));\n                vst1q_u16(pp + 72, _r26lm.val[1]);\n                vst1_u16(pp + 80, vget_low_u16(_r2367h.val[1]));\n                vst1q_u16(pp + 84, _r37lm.val[1]);\n                vst1_u16(pp + 92, vget_high_u16(_r2367h.val[1]));\n                p0 += max_jj * batch * 8;\n                pp += 96;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x12\n                uint16x8x4_t _r01 = vld4q_u16(p0);\n                uint16x4x4_t _r2 = vld4_u16(p0 + 32);\n                vst1q_u16(pp, _r01.val[0]);\n                vst1_u16(pp + 8, _r2.val[0]);\n                vst1q_u16(pp + 12, _r01.val[1]);\n                vst1_u16(pp + 20, _r2.val[1]);\n                vst1q_u16(pp + 24, _r01.val[2]);\n                vst1_u16(pp + 32, _r2.val[2]);\n                vst1q_u16(pp + 36, _r01.val[3]);\n                vst1_u16(pp + 44, _r2.val[3]);\n                p0 += max_jj * batch * 4;\n                pp += 48;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x12\n                uint16x8x2_t _r01 = vld2q_u16(p0);\n                uint16x4x2_t _r2 = vld2_u16(p0 + 16);\n                vst1q_u16(pp, _r01.val[0]);\n                vst1_u16(pp + 8, _r2.val[0]);\n                vst1q_u16(pp + 12, _r01.val[1]);\n                vst1_u16(pp + 20, _r2.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 24;\n            }\n            p0 -= (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _r01 = vld1q_u16(p0);\n                uint16x4_t _r2 = vld1_u16(p0 + 8);\n                vst1q_u16(pp, _r01);\n                vst1_u16(pp + 8, _r2);\n                p0 += max_jj * batch;\n                pp += 12;\n            }\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const unsigned short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0] \\n\"\n\n                    \"uzp1   v8.8h, v0.8h, v4.8h         \\n\"\n                    \"uzp2   v12.8h, v0.8h, v4.8h        \\n\"\n                    \"uzp1   v9.8h, v1.8h, v5.8h         \\n\"\n                    \"uzp2   v13.8h, v1.8h, v5.8h        \\n\"\n\n                    \"sub    %0, %0, #64                 \\n\"\n\n                    \"uzp1   v10.8h, v2.8h, v6.8h        \\n\"\n                    \"uzp2   v14.8h, v2.8h, v6.8h        \\n\"\n                    \"uzp1   v11.8h, v3.8h, v7.8h        \\n\"\n                    \"uzp2   v15.8h, v3.8h, v7.8h        \\n\"\n\n                    \"st1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%1], #64 \\n\"\n                    \"st1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x4_t _r0 = vld4q_u16(p0);\n                uint16x8x4_t _r1 = vld4q_u16(p0 + 32);\n                uint16x8x2_t _r04 = vuzpq_u16(_r0.val[0], _r1.val[0]);\n                uint16x8x2_t _r15 = vuzpq_u16(_r0.val[1], _r1.val[1]);\n                uint16x8x2_t _r26 = vuzpq_u16(_r0.val[2], _r1.val[2]);\n                uint16x8x2_t _r37 = vuzpq_u16(_r0.val[3], _r1.val[3]);\n                vst1q_u16(pp, _r04.val[0]);\n                vst1q_u16(pp + 8, _r15.val[0]);\n                vst1q_u16(pp + 8 * 2, _r26.val[0]);\n                vst1q_u16(pp + 8 * 3, _r37.val[0]);\n                vst1q_u16(pp + 8 * 4, _r04.val[1]);\n                vst1q_u16(pp + 8 * 5, _r15.val[1]);\n                vst1q_u16(pp + 8 * 6, _r26.val[1]);\n                vst1q_u16(pp + 8 * 7, _r37.val[1]);\n                p0 += max_jj * batch * 8;\n                pp += 64;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x8\n                uint16x8x4_t _r0 = vld4q_u16(p0);\n                vst1q_u16(pp, _r0.val[0]);\n                vst1q_u16(pp + 8, _r0.val[1]);\n                vst1q_u16(pp + 16, _r0.val[2]);\n                vst1q_u16(pp + 24, _r0.val[3]);\n                p0 += max_jj * batch * 4;\n                pp += 32;\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x8\n                uint16x8x2_t _r0 = vld2q_u16(p0);\n                vst1q_u16(pp, _r0.val[0]);\n                vst1q_u16(pp + 8, _r0.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 16;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _r0 = vld1q_u16(p0);\n                vst1q_u16(pp, _r0);\n                p0 += max_jj * batch;\n                pp += 8;\n            }\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const unsigned short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x4\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"st4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x4_t _r0;\n                _r0.val[0] = vld1q_u16(p0);\n                _r0.val[1] = vld1q_u16(p0 + 8);\n                _r0.val[2] = vld1q_u16(p0 + 16);\n                _r0.val[3] = vld1q_u16(p0 + 24);\n                vst4q_u16(pp, _r0);\n                p0 += max_jj * batch * 8;\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x4\n                uint16x4x4_t _r0;\n                _r0.val[0] = vld1_u16(p0);\n                _r0.val[1] = vld1_u16(p0 + 4);\n                _r0.val[2] = vld1_u16(p0 + 8);\n                _r0.val[3] = vld1_u16(p0 + 12);\n                vst4_u16(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 16;\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // transpose 2x4\n                uint16x4x2_t _r0 = vld2_u16(p0);\n                vst1_u16(pp, _r0.val[0]);\n                vst1_u16(pp + 4, _r0.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 8;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _r0 = vld1_u16(p0);\n                vst1_u16(pp, _r0);\n                p0 += max_jj * batch;\n                pp += 4;\n            }\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const unsigned short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x2\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%0]        \\n\"\n                    \"st2    {v0.8h, v1.8h}, [%1], #32   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x2_t _r0;\n                _r0.val[0] = vld1q_u16(p0);\n                _r0.val[1] = vld1q_u16(p0 + 8);\n                vst2q_u16(pp, _r0);\n                p0 += max_jj * batch * 8;\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                // transpose 4x2\n                uint16x4x2_t _r0;\n                _r0.val[0] = vld1_u16(p0);\n                _r0.val[1] = vld1_u16(p0 + 4);\n                vst2_u16(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 8;\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[2];\n                pp[2] = p0[1];\n                pp[3] = p0[3];\n                p0 += max_jj * batch * 2;\n                pp += 4;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch;\n                pp += 2;\n            }\n        }\n        for (; jj < max_jj; jj++)\n        {\n            const unsigned short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(p0);\n                vst1q_u16(pp, _r0);\n                p0 += max_jj * batch * 8;\n                pp += 8;\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 4;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4_t _r0 = vld1_u16(p0);\n                vst1_u16(pp, _r0);\n                p0 += max_jj * batch * 4;\n                pp += 4;\n            }\n            p0 -= (b * max_jj + jj) * 4;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch * 2;\n                pp += 2;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += max_jj * batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_gemm_transB_packed_tile_fp16sa(const Mat& AT_tile, const Mat& BT_tile, Mat& top_blob, int batch, int max_ii, int max_jj, int k, int max_kk, int use_a53_a55_optimized_kernel)\n{\n    // NCNN_LOGE(\"conv3x3s1_winograd_gemm_transB_packed_tile_fp16sa %d %d %d\", max_ii, max_jj, max_kk);\n    __fp16* outptr = top_blob;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const __fp16* pAT = AT_tile.row<const __fp16>(b) + max_kk * ii;\n            const __fp16* pB = BT_tile.row<const __fp16>(b);\n\n            int jj = 0;\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                if (use_a53_a55_optimized_kernel)\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                        \"subs   %0, %0, #128                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.8h}, [%2], #16          \\n\"\n\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                        \"ldr    d6, [%1], #8                \\n\"\n                        \"fmla   v24.8h, v4.8h, v0.h[4]      \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[5]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[6]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[7]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v28.8h, v4.8h, v1.h[0]      \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\" // NOTE PRELOAD\n                        \"fmla   v29.8h, v4.8h, v1.h[1]      \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v30.8h, v4.8h, v1.h[2]      \\n\"\n                        \"ldr    d8, [%2], #8                \\n\"\n                        \"fmla   v31.8h, v4.8h, v1.h[3]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v20.8h, v5.8h, v1.h[4]      \\n\"\n                        \"ldr    d7, [%1], #8                \\n\"\n                        \"fmla   v21.8h, v5.8h, v1.h[5]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v22.8h, v5.8h, v1.h[6]      \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v23.8h, v5.8h, v1.h[7]      \\n\"\n                        \"ldr    d9, [%2], #8                \\n\"\n                        \"fmla   v24.8h, v5.8h, v2.h[0]      \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v25.8h, v5.8h, v2.h[1]      \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v26.8h, v5.8h, v2.h[2]      \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v27.8h, v5.8h, v2.h[3]      \\n\"\n                        \"ldr    d4, [%1], #8                \\n\"\n                        \"fmla   v28.8h, v5.8h, v2.h[4]      \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v29.8h, v5.8h, v2.h[5]      \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v30.8h, v5.8h, v2.h[6]      \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v31.8h, v5.8h, v2.h[7]      \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v20.8h, v6.8h, v3.h[0]      \\n\"\n                        \"fmla   v21.8h, v6.8h, v3.h[1]      \\n\"\n                        \"fmla   v22.8h, v6.8h, v3.h[2]      \\n\"\n                        \"fmla   v23.8h, v6.8h, v3.h[3]      \\n\"\n                        \"fmla   v24.8h, v6.8h, v3.h[4]      \\n\"\n                        \"fmla   v25.8h, v6.8h, v3.h[5]      \\n\"\n                        \"ins    v8.d[1], x20                \\n\"\n                        \"fmla   v26.8h, v6.8h, v3.h[6]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v27.8h, v6.8h, v3.h[7]      \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v28.8h, v6.8h, v8.h[0]      \\n\"\n                        \"fmla   v29.8h, v6.8h, v8.h[1]      \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v30.8h, v6.8h, v8.h[2]      \\n\"\n                        \"fmla   v31.8h, v6.8h, v8.h[3]      \\n\"\n                        \"fmla   v20.8h, v7.8h, v8.h[4]      \\n\"\n                        \"fmla   v21.8h, v7.8h, v8.h[5]      \\n\"\n                        \"ins    v9.d[1], x21                \\n\"\n                        \"fmla   v22.8h, v7.8h, v8.h[6]      \\n\"\n                        \"fmla   v23.8h, v7.8h, v8.h[7]      \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v24.8h, v7.8h, v9.h[0]      \\n\"\n                        \"fmla   v25.8h, v7.8h, v9.h[1]      \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v26.8h, v7.8h, v9.h[2]      \\n\"\n                        \"fmla   v27.8h, v7.8h, v9.h[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v28.8h, v7.8h, v9.h[4]      \\n\"\n                        \"fmla   v29.8h, v7.8h, v9.h[5]      \\n\"\n                        \"fmla   v30.8h, v7.8h, v9.h[6]      \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v31.8h, v7.8h, v9.h[7]      \\n\"\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #16                 \\n\"\n                        \"sub    %2, %2, #32                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                        \"fmla   v24.8h, v4.8h, v1.h[0]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v1.h[1]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v1.h[2]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v1.h[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v28.8h, v4.8h, v2.h[0]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v2.h[1]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v2.h[2]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v2.h[3]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                        \"subs   %0, %0, #128                \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                        \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                        \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                        \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"2:                                 \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%2], #64 \\n\"\n\n                        \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                        \"fmla   v24.8h, v4.8h, v0.h[4]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[5]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[6]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[7]      \\n\"\n                        \"fmla   v28.8h, v4.8h, v1.h[0]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v1.h[1]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v1.h[2]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v1.h[3]      \\n\"\n\n                        \"fmla   v20.8h, v5.8h, v1.h[4]      \\n\"\n                        \"fmla   v21.8h, v5.8h, v1.h[5]      \\n\"\n                        \"fmla   v22.8h, v5.8h, v1.h[6]      \\n\"\n                        \"fmla   v23.8h, v5.8h, v1.h[7]      \\n\"\n                        \"fmla   v24.8h, v5.8h, v2.h[0]      \\n\"\n                        \"fmla   v25.8h, v5.8h, v2.h[1]      \\n\"\n                        \"fmla   v26.8h, v5.8h, v2.h[2]      \\n\"\n                        \"fmla   v27.8h, v5.8h, v2.h[3]      \\n\"\n                        \"fmla   v28.8h, v5.8h, v2.h[4]      \\n\"\n                        \"fmla   v29.8h, v5.8h, v2.h[5]      \\n\"\n                        \"fmla   v30.8h, v5.8h, v2.h[6]      \\n\"\n                        \"fmla   v31.8h, v5.8h, v2.h[7]      \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.8h, v9.8h}, [%2], #32   \\n\"\n\n                        \"fmla   v20.8h, v6.8h, v3.h[0]      \\n\"\n                        \"fmla   v21.8h, v6.8h, v3.h[1]      \\n\"\n                        \"fmla   v22.8h, v6.8h, v3.h[2]      \\n\"\n                        \"fmla   v23.8h, v6.8h, v3.h[3]      \\n\"\n                        \"fmla   v24.8h, v6.8h, v3.h[4]      \\n\"\n                        \"fmla   v25.8h, v6.8h, v3.h[5]      \\n\"\n                        \"fmla   v26.8h, v6.8h, v3.h[6]      \\n\"\n                        \"fmla   v27.8h, v6.8h, v3.h[7]      \\n\"\n                        \"fmla   v28.8h, v6.8h, v8.h[0]      \\n\"\n                        \"fmla   v29.8h, v6.8h, v8.h[1]      \\n\"\n                        \"fmla   v30.8h, v6.8h, v8.h[2]      \\n\"\n                        \"fmla   v31.8h, v6.8h, v8.h[3]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v20.8h, v7.8h, v8.h[4]      \\n\"\n                        \"fmla   v21.8h, v7.8h, v8.h[5]      \\n\"\n                        \"fmla   v22.8h, v7.8h, v8.h[6]      \\n\"\n                        \"fmla   v23.8h, v7.8h, v8.h[7]      \\n\"\n                        \"fmla   v24.8h, v7.8h, v9.h[0]      \\n\"\n                        \"fmla   v25.8h, v7.8h, v9.h[1]      \\n\"\n                        \"fmla   v26.8h, v7.8h, v9.h[2]      \\n\"\n                        \"fmla   v27.8h, v7.8h, v9.h[3]      \\n\"\n                        \"fmla   v28.8h, v7.8h, v9.h[4]      \\n\"\n                        \"fmla   v29.8h, v7.8h, v9.h[5]      \\n\"\n                        \"fmla   v30.8h, v7.8h, v9.h[6]      \\n\"\n                        \"fmla   v31.8h, v7.8h, v9.h[7]      \\n\"\n\n                        \"bne    2b                          \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                        \"fmla   v24.8h, v4.8h, v1.h[0]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v1.h[1]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v1.h[2]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v1.h[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v28.8h, v4.8h, v2.h[0]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v2.h[1]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v2.h[2]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v2.h[3]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#else  // NCNN_GNU_INLINE_ASM\n                float16x8_t _sum0;\n                float16x8_t _sum1;\n                float16x8_t _sum2;\n                float16x8_t _sum3;\n                float16x8_t _sum4;\n                float16x8_t _sum5;\n                float16x8_t _sum6;\n                float16x8_t _sum7;\n                float16x8_t _sum8;\n                float16x8_t _sum9;\n                float16x8_t _suma;\n                float16x8_t _sumb;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                    _sum4 = vdupq_n_f16(0.f);\n                    _sum5 = vdupq_n_f16(0.f);\n                    _sum6 = vdupq_n_f16(0.f);\n                    _sum7 = vdupq_n_f16(0.f);\n                    _sum8 = vdupq_n_f16(0.f);\n                    _sum9 = vdupq_n_f16(0.f);\n                    _suma = vdupq_n_f16(0.f);\n                    _sumb = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f16(outptr);\n                    _sum1 = vld1q_f16(outptr + 8);\n                    _sum2 = vld1q_f16(outptr + 16);\n                    _sum3 = vld1q_f16(outptr + 24);\n                    _sum4 = vld1q_f16(outptr + 32);\n                    _sum5 = vld1q_f16(outptr + 40);\n                    _sum6 = vld1q_f16(outptr + 48);\n                    _sum7 = vld1q_f16(outptr + 56);\n                    _sum8 = vld1q_f16(outptr + 64);\n                    _sum9 = vld1q_f16(outptr + 72);\n                    _suma = vld1q_f16(outptr + 80);\n                    _sumb = vld1q_f16(outptr + 88);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pA = vld1q_f16(pA);\n                    float16x8_t _pB0 = vld1q_f16(pB);\n                    float16x4_t _pB2 = vld1_f16(pB + 8);\n                    _sum0 = vfmaq_laneq_f16(_sum0, _pA, _pB0, 0);\n                    _sum1 = vfmaq_laneq_f16(_sum1, _pA, _pB0, 1);\n                    _sum2 = vfmaq_laneq_f16(_sum2, _pA, _pB0, 2);\n                    _sum3 = vfmaq_laneq_f16(_sum3, _pA, _pB0, 3);\n                    _sum4 = vfmaq_laneq_f16(_sum4, _pA, _pB0, 4);\n                    _sum5 = vfmaq_laneq_f16(_sum5, _pA, _pB0, 5);\n                    _sum6 = vfmaq_laneq_f16(_sum6, _pA, _pB0, 6);\n                    _sum7 = vfmaq_laneq_f16(_sum7, _pA, _pB0, 7);\n                    _sum8 = vfmaq_lane_f16(_sum8, _pA, _pB2, 0);\n                    _sum9 = vfmaq_lane_f16(_sum9, _pA, _pB2, 1);\n                    _suma = vfmaq_lane_f16(_suma, _pA, _pB2, 2);\n                    _sumb = vfmaq_lane_f16(_sumb, _pA, _pB2, 3);\n\n                    pA += 8;\n                    pB += 12;\n                }\n\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n                vst1q_f16(outptr + 8 * 8, _sum8);\n                vst1q_f16(outptr + 8 * 9, _sum9);\n                vst1q_f16(outptr + 8 * 10, _suma);\n                vst1q_f16(outptr + 8 * 11, _sumb);\n                outptr += 8 * 12;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                if (use_a53_a55_optimized_kernel)\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                        \"subs   %0, %0, #64                 \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.8h}, [%2], #16          \\n\"\n\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n\n                        \".align 4                           \\n\"\n                        \"2:                                 \\n\"\n                        \"ldr    d1, [%2], #8                \\n\"\n                        \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                        \"ldr    x21, [%2], #8               \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                        \"ins    v5.d[1], x25                \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                        \"ldr    d6, [%1], #8                \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                        \"ldr    x26, [%1], #8               \\n\"\n                        \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                        \"ldr    d2, [%2], #8                \\n\"\n                        \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                        \"ins    v1.d[1], x21                \\n\"\n                        \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                        \"ldr    x22, [%2], #8               \\n\"\n                        \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                        \"ldr    d7, [%1], #8                \\n\"\n                        \"fmla   v24.8h, v5.8h, v1.h[0]      \\n\"\n                        \"ldr    x27, [%1], #8               \\n\"\n                        \"fmla   v25.8h, v5.8h, v1.h[1]      \\n\"\n                        \"ins    v6.d[1], x26                \\n\"\n                        \"fmla   v26.8h, v5.8h, v1.h[2]      \\n\"\n                        \"ldr    d3, [%2], #8                \\n\"\n                        \"fmla   v27.8h, v5.8h, v1.h[3]      \\n\"\n                        \"ldr    x23, [%2], #8               \\n\"\n                        \"fmla   v28.8h, v5.8h, v1.h[4]      \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v29.8h, v5.8h, v1.h[5]      \\n\"\n                        \"ins    v2.d[1], x22                \\n\"\n                        \"fmla   v30.8h, v5.8h, v1.h[6]      \\n\"\n                        \"ldr    d4, [%1], #8                \\n\"\n                        \"fmla   v31.8h, v5.8h, v1.h[7]      \\n\"\n                        \"ldr    x24, [%1], #8               \\n\"\n                        \"fmla   v24.8h, v6.8h, v2.h[0]      \\n\"\n                        \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                        \"fmla   v25.8h, v6.8h, v2.h[1]      \\n\"\n                        \"ins    v7.d[1], x27                \\n\"\n                        \"fmla   v26.8h, v6.8h, v2.h[2]      \\n\"\n                        \"ldr    d0, [%2], #8                \\n\"\n                        \"fmla   v27.8h, v6.8h, v2.h[3]      \\n\"\n                        \"ldr    x20, [%2], #8               \\n\"\n                        \"fmla   v28.8h, v6.8h, v2.h[4]      \\n\"\n                        \"ldr    d5, [%1], #8                \\n\"\n                        \"fmla   v29.8h, v6.8h, v2.h[5]      \\n\"\n                        \"ins    v3.d[1], x23                \\n\"\n                        \"fmla   v30.8h, v6.8h, v2.h[6]      \\n\"\n                        \"ldr    x25, [%1], #8               \\n\"\n                        \"fmla   v31.8h, v6.8h, v2.h[7]      \\n\"\n                        \"fmla   v24.8h, v7.8h, v3.h[0]      \\n\"\n                        \"fmla   v25.8h, v7.8h, v3.h[1]      \\n\"\n                        \"fmla   v26.8h, v7.8h, v3.h[2]      \\n\"\n                        \"ins    v4.d[1], x24                \\n\"\n                        \"fmla   v27.8h, v7.8h, v3.h[3]      \\n\"\n                        \"fmla   v28.8h, v7.8h, v3.h[4]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v29.8h, v7.8h, v3.h[5]      \\n\"\n                        \"fmla   v30.8h, v7.8h, v3.h[6]      \\n\"\n                        \"ins    v0.d[1], x20                \\n\"\n                        \"fmla   v31.8h, v7.8h, v3.h[7]      \\n\"\n                        \"bne    2b                          \\n\"\n\n                        \"sub    %1, %1, #32                 \\n\"\n                        \"sub    %2, %2, #16                 \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.8h}, [%2], #16          \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                else\n                {\n                    asm volatile(\n                        \"cbz    %w7, 0f                     \\n\"\n\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                        \"subs   %0, %0, #64                 \\n\"\n                        \"b      1f                          \\n\"\n\n                        \"0:                                 \\n\"\n                        \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                        \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                        \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                        \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                        \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                        \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                        \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                        \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                        \"1:                                 \\n\"\n                        \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    3f                          \\n\"\n\n                        \"2:                                 \\n\"\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%2], #64 \\n\"\n\n                        \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                        \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n\n                        \"fmla   v24.8h, v5.8h, v1.h[0]      \\n\"\n                        \"fmla   v25.8h, v5.8h, v1.h[1]      \\n\"\n                        \"fmla   v26.8h, v5.8h, v1.h[2]      \\n\"\n                        \"fmla   v27.8h, v5.8h, v1.h[3]      \\n\"\n                        \"fmla   v28.8h, v5.8h, v1.h[4]      \\n\"\n                        \"fmla   v29.8h, v5.8h, v1.h[5]      \\n\"\n                        \"fmla   v30.8h, v5.8h, v1.h[6]      \\n\"\n                        \"fmla   v31.8h, v5.8h, v1.h[7]      \\n\"\n\n                        \"fmla   v24.8h, v6.8h, v2.h[0]      \\n\"\n                        \"fmla   v25.8h, v6.8h, v2.h[1]      \\n\"\n                        \"fmla   v26.8h, v6.8h, v2.h[2]      \\n\"\n                        \"fmla   v27.8h, v6.8h, v2.h[3]      \\n\"\n                        \"fmla   v28.8h, v6.8h, v2.h[4]      \\n\"\n                        \"fmla   v29.8h, v6.8h, v2.h[5]      \\n\"\n                        \"fmla   v30.8h, v6.8h, v2.h[6]      \\n\"\n                        \"fmla   v31.8h, v6.8h, v2.h[7]      \\n\"\n\n                        \"subs   w4, w4, #1                  \\n\"\n\n                        \"fmla   v24.8h, v7.8h, v3.h[0]      \\n\"\n                        \"fmla   v25.8h, v7.8h, v3.h[1]      \\n\"\n                        \"fmla   v26.8h, v7.8h, v3.h[2]      \\n\"\n                        \"fmla   v27.8h, v7.8h, v3.h[3]      \\n\"\n                        \"fmla   v28.8h, v7.8h, v3.h[4]      \\n\"\n                        \"fmla   v29.8h, v7.8h, v3.h[5]      \\n\"\n                        \"fmla   v30.8h, v7.8h, v3.h[6]      \\n\"\n                        \"fmla   v31.8h, v7.8h, v3.h[7]      \\n\"\n\n                        \"bne    2b                          \\n\"\n\n                        \"3:                                 \\n\"\n                        \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                        \"cmp    w4, #0                      \\n\"\n                        \"beq    5f                          \\n\"\n\n                        \"4:                                 \\n\"\n                        \"ld1    {v0.8h}, [%2], #16          \\n\"\n                        \"ld1    {v4.8h}, [%1], #16          \\n\"\n                        \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                        \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                        \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                        \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                        \"subs   w4, w4, #1                  \\n\"\n                        \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                        \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                        \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                        \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                        \"bne    4b                          \\n\"\n\n                        \"5:                                 \\n\"\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr), // %0\n                        \"=r\"(pA),     // %1\n                        \"=r\"(pB)      // %2\n                        : \"0\"(outptr),\n                        \"1\"(pA),\n                        \"2\"(pB),\n                        \"r\"(max_kk), // %6\n                        \"r\"(k)       // %7\n                        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n#else  // NCNN_GNU_INLINE_ASM\n                float16x8_t _sum0;\n                float16x8_t _sum1;\n                float16x8_t _sum2;\n                float16x8_t _sum3;\n                float16x8_t _sum4;\n                float16x8_t _sum5;\n                float16x8_t _sum6;\n                float16x8_t _sum7;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                    _sum4 = vdupq_n_f16(0.f);\n                    _sum5 = vdupq_n_f16(0.f);\n                    _sum6 = vdupq_n_f16(0.f);\n                    _sum7 = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f16(outptr);\n                    _sum1 = vld1q_f16(outptr + 8);\n                    _sum2 = vld1q_f16(outptr + 16);\n                    _sum3 = vld1q_f16(outptr + 24);\n                    _sum4 = vld1q_f16(outptr + 32);\n                    _sum5 = vld1q_f16(outptr + 40);\n                    _sum6 = vld1q_f16(outptr + 48);\n                    _sum7 = vld1q_f16(outptr + 56);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pA = vld1q_f16(pA);\n                    float16x8_t _pB = vld1q_f16(pB);\n                    _sum0 = vfmaq_laneq_f16(_sum0, _pA, _pB, 0);\n                    _sum1 = vfmaq_laneq_f16(_sum1, _pA, _pB, 1);\n                    _sum2 = vfmaq_laneq_f16(_sum2, _pA, _pB, 2);\n                    _sum3 = vfmaq_laneq_f16(_sum3, _pA, _pB, 3);\n                    _sum4 = vfmaq_laneq_f16(_sum4, _pA, _pB, 4);\n                    _sum5 = vfmaq_laneq_f16(_sum5, _pA, _pB, 5);\n                    _sum6 = vfmaq_laneq_f16(_sum6, _pA, _pB, 6);\n                    _sum7 = vfmaq_laneq_f16(_sum7, _pA, _pB, 7);\n\n                    pA += 8;\n                    pB += 8;\n                }\n\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n                outptr += 8 * 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"cbz    %w7, 0f                     \\n\"\n\n                    \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"2:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n\n                    \"fmla   v28.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[3]      \\n\"\n\n                    \"fmla   v28.8h, v5.8h, v0.h[4]      \\n\"\n                    \"fmla   v29.8h, v5.8h, v0.h[5]      \\n\"\n                    \"fmla   v30.8h, v5.8h, v0.h[6]      \\n\"\n                    \"fmla   v31.8h, v5.8h, v0.h[7]      \\n\"\n\n                    \"fmla   v28.8h, v6.8h, v1.h[0]      \\n\"\n                    \"fmla   v29.8h, v6.8h, v1.h[1]      \\n\"\n                    \"fmla   v30.8h, v6.8h, v1.h[2]      \\n\"\n                    \"fmla   v31.8h, v6.8h, v1.h[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v28.8h, v7.8h, v1.h[4]      \\n\"\n                    \"fmla   v29.8h, v7.8h, v1.h[5]      \\n\"\n                    \"fmla   v30.8h, v7.8h, v1.h[6]      \\n\"\n                    \"fmla   v31.8h, v7.8h, v1.h[7]      \\n\"\n\n                    \"bne    2b                          \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v0.4h}, [%2], #8           \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"fmla   v28.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v4\", \"v5\", \"v6\", \"v7\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x8_t _sum0;\n                float16x8_t _sum1;\n                float16x8_t _sum2;\n                float16x8_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f16(outptr);\n                    _sum1 = vld1q_f16(outptr + 8);\n                    _sum2 = vld1q_f16(outptr + 16);\n                    _sum3 = vld1q_f16(outptr + 24);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pA = vld1q_f16(pA);\n                    float16x4_t _pB = vld1_f16(pB);\n                    _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB, 0);\n                    _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB, 1);\n                    _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB, 2);\n                    _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB, 3);\n\n                    pA += 8;\n                    pB += 4;\n                }\n\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                outptr += 8 * 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum0;\n                float16x8_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1q_f16(outptr);\n                    _sum1 = vld1q_f16(outptr + 8);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pA = vld1q_f16(pA);\n                    _sum0 = vfmaq_n_f16(_sum0, _pA, pB[0]);\n                    _sum1 = vfmaq_n_f16(_sum1, _pA, pB[1]);\n\n                    pA += 8;\n                    pB += 2;\n                }\n\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                outptr += 8 * 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum;\n\n                if (k == 0)\n                {\n                    _sum = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum = vld1q_f16(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pA = vld1q_f16(pA);\n                    _sum = vfmaq_n_f16(_sum, _pA, pB[0]);\n\n                    pA += 8;\n                    pB += 1;\n                }\n\n                vst1q_f16(outptr, _sum);\n                outptr += 8;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const __fp16* pAT = AT_tile.row<const __fp16>(b) + max_kk * ii;\n            const __fp16* pB = BT_tile.row<const __fp16>(b);\n\n            int jj = 0;\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum0;\n                float16x4_t _sum1;\n                float16x4_t _sum2;\n                float16x4_t _sum3;\n                float16x4_t _sum4;\n                float16x4_t _sum5;\n                float16x4_t _sum6;\n                float16x4_t _sum7;\n                float16x4_t _sum8;\n                float16x4_t _sum9;\n                float16x4_t _suma;\n                float16x4_t _sumb;\n\n                if (k == 0)\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                    _sum4 = vdup_n_f16(0.f);\n                    _sum5 = vdup_n_f16(0.f);\n                    _sum6 = vdup_n_f16(0.f);\n                    _sum7 = vdup_n_f16(0.f);\n                    _sum8 = vdup_n_f16(0.f);\n                    _sum9 = vdup_n_f16(0.f);\n                    _suma = vdup_n_f16(0.f);\n                    _sumb = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1_f16(outptr);\n                    _sum1 = vld1_f16(outptr + 4);\n                    _sum2 = vld1_f16(outptr + 8);\n                    _sum3 = vld1_f16(outptr + 12);\n                    _sum4 = vld1_f16(outptr + 16);\n                    _sum5 = vld1_f16(outptr + 20);\n                    _sum6 = vld1_f16(outptr + 24);\n                    _sum7 = vld1_f16(outptr + 28);\n                    _sum8 = vld1_f16(outptr + 32);\n                    _sum9 = vld1_f16(outptr + 36);\n                    _suma = vld1_f16(outptr + 40);\n                    _sumb = vld1_f16(outptr + 44);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pA = vld1_f16(pA);\n                    float16x8_t _pB0 = vld1q_f16(pB);\n                    float16x4_t _pB2 = vld1_f16(pB + 8);\n                    _sum0 = vfma_laneq_f16(_sum0, _pA, _pB0, 0);\n                    _sum1 = vfma_laneq_f16(_sum1, _pA, _pB0, 1);\n                    _sum2 = vfma_laneq_f16(_sum2, _pA, _pB0, 2);\n                    _sum3 = vfma_laneq_f16(_sum3, _pA, _pB0, 3);\n                    _sum4 = vfma_laneq_f16(_sum4, _pA, _pB0, 4);\n                    _sum5 = vfma_laneq_f16(_sum5, _pA, _pB0, 5);\n                    _sum6 = vfma_laneq_f16(_sum6, _pA, _pB0, 6);\n                    _sum7 = vfma_laneq_f16(_sum7, _pA, _pB0, 7);\n                    _sum8 = vfma_lane_f16(_sum8, _pA, _pB2, 0);\n                    _sum9 = vfma_lane_f16(_sum9, _pA, _pB2, 1);\n                    _suma = vfma_lane_f16(_suma, _pA, _pB2, 2);\n                    _sumb = vfma_lane_f16(_sumb, _pA, _pB2, 3);\n\n                    pA += 4;\n                    pB += 12;\n                }\n\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n                vst1_f16(outptr + 4 * 8, _sum8);\n                vst1_f16(outptr + 4 * 9, _sum9);\n                vst1_f16(outptr + 4 * 10, _suma);\n                vst1_f16(outptr + 4 * 11, _sumb);\n                outptr += 4 * 12;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum0;\n                float16x4_t _sum1;\n                float16x4_t _sum2;\n                float16x4_t _sum3;\n                float16x4_t _sum4;\n                float16x4_t _sum5;\n                float16x4_t _sum6;\n                float16x4_t _sum7;\n\n                if (k == 0)\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                    _sum4 = vdup_n_f16(0.f);\n                    _sum5 = vdup_n_f16(0.f);\n                    _sum6 = vdup_n_f16(0.f);\n                    _sum7 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1_f16(outptr);\n                    _sum1 = vld1_f16(outptr + 4);\n                    _sum2 = vld1_f16(outptr + 8);\n                    _sum3 = vld1_f16(outptr + 12);\n                    _sum4 = vld1_f16(outptr + 16);\n                    _sum5 = vld1_f16(outptr + 20);\n                    _sum6 = vld1_f16(outptr + 24);\n                    _sum7 = vld1_f16(outptr + 28);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pA = vld1_f16(pA);\n                    float16x8_t _pB = vld1q_f16(pB);\n                    _sum0 = vfma_laneq_f16(_sum0, _pA, _pB, 0);\n                    _sum1 = vfma_laneq_f16(_sum1, _pA, _pB, 1);\n                    _sum2 = vfma_laneq_f16(_sum2, _pA, _pB, 2);\n                    _sum3 = vfma_laneq_f16(_sum3, _pA, _pB, 3);\n                    _sum4 = vfma_laneq_f16(_sum4, _pA, _pB, 4);\n                    _sum5 = vfma_laneq_f16(_sum5, _pA, _pB, 5);\n                    _sum6 = vfma_laneq_f16(_sum6, _pA, _pB, 6);\n                    _sum7 = vfma_laneq_f16(_sum7, _pA, _pB, 7);\n\n                    pA += 4;\n                    pB += 8;\n                }\n\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n                outptr += 4 * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum0;\n                float16x4_t _sum1;\n                float16x4_t _sum2;\n                float16x4_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1_f16(outptr);\n                    _sum1 = vld1_f16(outptr + 4);\n                    _sum2 = vld1_f16(outptr + 8);\n                    _sum3 = vld1_f16(outptr + 12);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pA = vld1_f16(pA);\n                    float16x4_t _pB = vld1_f16(pB);\n                    _sum0 = vfma_lane_f16(_sum0, _pA, _pB, 0);\n                    _sum1 = vfma_lane_f16(_sum1, _pA, _pB, 1);\n                    _sum2 = vfma_lane_f16(_sum2, _pA, _pB, 2);\n                    _sum3 = vfma_lane_f16(_sum3, _pA, _pB, 3);\n\n                    pA += 4;\n                    pB += 4;\n                }\n\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                outptr += 4 * 4;\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum0;\n                float16x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum0 = vld1_f16(outptr);\n                    _sum1 = vld1_f16(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pA = vld1_f16(pA);\n                    _sum0 = vfma_n_f16(_sum0, _pA, pB[0]);\n                    _sum1 = vfma_n_f16(_sum1, _pA, pB[1]);\n\n                    pA += 4;\n                    pB += 2;\n                }\n\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                outptr += 4 * 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum;\n\n                if (k == 0)\n                {\n                    _sum = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum = vld1_f16(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pA = vld1_f16(pA);\n                    _sum = vfma_n_f16(_sum, _pA, pB[0]);\n\n                    pA += 4;\n                    pB += 1;\n                }\n\n                vst1_f16(outptr, _sum);\n                outptr += 4;\n            }\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const __fp16* pAT = AT_tile.row<const __fp16>(b) + max_kk * ii;\n            const __fp16* pB = BT_tile.row<const __fp16>(b);\n\n            int jj = 0;\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum01;\n                float16x4_t _sum2;\n                float16x8_t _sum34;\n                float16x4_t _sum5;\n\n                if (k == 0)\n                {\n                    _sum01 = vdupq_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum34 = vdupq_n_f16(0.f);\n                    _sum5 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    float16x8x2_t _tmp0123 = vld2q_f16(outptr);\n                    float16x4x2_t _tmp45 = vld2_f16(outptr + 16);\n                    _sum01 = _tmp0123.val[0];\n                    _sum2 = _tmp45.val[0];\n                    _sum34 = _tmp0123.val[1];\n                    _sum5 = _tmp45.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pB0 = vld1q_f16(pB);\n                    float16x4_t _pB2 = vld1_f16(pB + 8);\n                    _sum01 = vfmaq_n_f16(_sum01, _pB0, pA[0]);\n                    _sum2 = vfma_n_f16(_sum2, _pB2, pA[0]);\n                    _sum34 = vfmaq_n_f16(_sum34, _pB0, pA[1]);\n                    _sum5 = vfma_n_f16(_sum5, _pB2, pA[1]);\n                    pA += 2;\n                    pB += 12;\n                }\n\n                float16x8x2_t _tmp0123;\n                _tmp0123.val[0] = _sum01;\n                _tmp0123.val[1] = _sum34;\n                float16x4x2_t _tmp45;\n                _tmp45.val[0] = _sum2;\n                _tmp45.val[1] = _sum5;\n                vst2q_f16(outptr, _tmp0123);\n                vst2_f16(outptr + 16, _tmp45);\n                outptr += 2 * 12;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum01;\n                float16x8_t _sum23;\n\n                if (k == 0)\n                {\n                    _sum01 = vdupq_n_f16(0.f);\n                    _sum23 = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    float16x8x2_t _tmp0123 = vld2q_f16(outptr);\n                    _sum01 = _tmp0123.val[0];\n                    _sum23 = _tmp0123.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pB = vld1q_f16(pB);\n                    _sum01 = vfmaq_n_f16(_sum01, _pB, pA[0]);\n                    _sum23 = vfmaq_n_f16(_sum23, _pB, pA[1]);\n                    pA += 2;\n                    pB += 8;\n                }\n\n                float16x8x2_t _tmp0123;\n                _tmp0123.val[0] = _sum01;\n                _tmp0123.val[1] = _sum23;\n                vst2q_f16(outptr, _tmp0123);\n                outptr += 2 * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum0;\n                float16x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    float16x4x2_t _tmp01 = vld2_f16(outptr);\n                    _sum0 = _tmp01.val[0];\n                    _sum1 = _tmp01.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pB = vld1_f16(pB);\n                    _sum0 = vfma_n_f16(_sum0, _pB, pA[0]);\n                    _sum1 = vfma_n_f16(_sum1, _pB, pA[1]);\n                    pA += 2;\n                    pB += 4;\n                }\n\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2_f16(outptr, _tmp01);\n                outptr += 2 * 4;\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const __fp16* pA = pAT;\n\n                __fp16 sum00 = 0.f;\n                __fp16 sum01 = 0.f;\n                __fp16 sum10 = 0.f;\n                __fp16 sum11 = 0.f;\n\n                if (k == 0)\n                {\n                    sum00 = 0.f;\n                    sum01 = 0.f;\n                    sum10 = 0.f;\n                    sum11 = 0.f;\n                }\n                else\n                {\n                    sum00 = outptr[0];\n                    sum01 = outptr[1];\n                    sum10 = outptr[2];\n                    sum11 = outptr[3];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum00 += pA[0] * pB[0];\n                    sum01 += pA[1] * pB[0];\n                    sum10 += pA[0] * pB[1];\n                    sum11 += pA[1] * pB[1];\n                    pA += 2;\n                    pB += 2;\n                }\n\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n                outptr += 2 * 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const __fp16* pA = pAT;\n\n                __fp16 sum0 = 0.f;\n                __fp16 sum1 = 0.f;\n\n                if (k == 0)\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[1] * pB[0];\n                    pA += 2;\n                    pB += 1;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const __fp16* pAT = AT_tile.row<const __fp16>(b) + max_kk * ii;\n            const __fp16* pB = BT_tile.row<const __fp16>(b);\n\n            int jj = 0;\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum01;\n                float16x4_t _sum2;\n\n                if (k == 0)\n                {\n                    _sum01 = vdupq_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum01 = vld1q_f16(outptr);\n                    _sum2 = vld1_f16(outptr + 8);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pB0 = vld1q_f16(pB);\n                    float16x4_t _pB2 = vld1_f16(pB + 8);\n                    _sum01 = vfmaq_n_f16(_sum01, _pB0, pA[0]);\n                    _sum2 = vfma_n_f16(_sum2, _pB2, pA[0]);\n                    pA += 1;\n                    pB += 12;\n                }\n\n                vst1q_f16(outptr, _sum01);\n                vst1_f16(outptr + 8, _sum2);\n                outptr += 12;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const __fp16* pA = pAT;\n\n                float16x8_t _sum01;\n\n                if (k == 0)\n                {\n                    _sum01 = vdupq_n_f16(0.f);\n                }\n                else\n                {\n                    _sum01 = vld1q_f16(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x8_t _pB = vld1q_f16(pB);\n                    _sum01 = vfmaq_n_f16(_sum01, _pB, pA[0]);\n                    pA += 1;\n                    pB += 8;\n                }\n\n                vst1q_f16(outptr, _sum01);\n                outptr += 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const __fp16* pA = pAT;\n\n                float16x4_t _sum;\n\n                if (k == 0)\n                {\n                    _sum = vdup_n_f16(0.f);\n                }\n                else\n                {\n                    _sum = vld1_f16(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    float16x4_t _pB = vld1_f16(pB);\n                    _sum = vfma_n_f16(_sum, _pB, pA[0]);\n                    pA += 1;\n                    pB += 4;\n                }\n\n                vst1_f16(outptr, _sum);\n                outptr += 4;\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const __fp16* pA = pAT;\n\n                __fp16 sum0 = 0.f;\n                __fp16 sum1 = 0.f;\n\n                if (k == 0)\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[0] * pB[1];\n                    pA += 1;\n                    pB += 2;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const __fp16* pA = pAT;\n\n                __fp16 sum = 0.f;\n\n                if (k == 0)\n                {\n                    sum = 0.f;\n                }\n                else\n                {\n                    sum = outptr[0];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    sum += pA[0] * pB[0];\n                    pA += 1;\n                    pB += 1;\n                }\n\n                outptr[0] = sum;\n                outptr += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd_get_optimal_tile_mnk_fp16(int M, int N, int K, int B, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_fp16 = (int)(get_cpu_level2_cache_size() / sizeof(unsigned short));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // we shall take B into account for batched gemm, but that will be slower on arm in practice, why ?\n    (void)B;\n\n    // solve K\n    {\n        // try not to split K\n        int tile_size = (l2_cache_size_fp16 - 32) / 12;\n\n        TILE_K = std::max(8, tile_size / 8 * 8);\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n    }\n\n    // solve M\n    {\n        TILE_M = 8;\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n\n        if (nT > 1)\n        {\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n        }\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_fp16 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_fp16 - TILE_M * TILE_K) / (TILE_M + TILE_K);\n        }\n\n        TILE_N = std::max(4, tile_size / 4 * 4);\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_kernel_tile_fp16sa(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    // const float ktm[4][3] = {\n    //     {1.0f, 0.0f, 0.0f},\n    //     {1.0f / 2, 1.0f / 2, 1.0f / 2},\n    //     {1.0f / 2, -1.0f / 2, 1.0f / 2},\n    //     {0.0f, 0.0f, 1.0f}\n    // };\n\n    __fp16* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            float tmp[4][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = r0 * 0.5f + r1 * 0.5f + r2 * 0.5f;\n                tmp[2][m] = r0 * 0.5f - r1 * 0.5f + r2 * 0.5f;\n                tmp[3][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 4; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = r0 * 0.5f + r1 * 0.5f + r2 * 0.5f;\n                float z2 = r0 * 0.5f - r1 * 0.5f + r2 * 0.5f;\n                float z3 = r2;\n\n                ptmp[0] = (__fp16)z0;\n                ptmp[1] = (__fp16)z1;\n                ptmp[2] = (__fp16)z2;\n                ptmp[3] = (__fp16)z3;\n                ptmp += 4;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_kernel_fp16sa(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 16;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads, (size_t)2u);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)2u);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd23_transform_kernel_tile_fp16sa(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile_fp16(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_input_tile_fp16sa(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[4][4] = {\n    //     {1.0f,  0.0f, -1.0f,  0.0f},\n    //     {0.0f,  1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  0.00f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const int N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w - 1) / 2;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[4][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x8_t _r0 = vdupq_n_f16(0.f);\n                float16x8_t _r1 = vdupq_n_f16(0.f);\n                float16x8_t _r2 = vdupq_n_f16(0.f);\n                float16x8_t _r3 = vdupq_n_f16(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        if (tj * 2 + 1 < w) _r1 = vld1q_f16(r0 + 8);\n                        if (tj * 2 + 2 < w) _r2 = vld1q_f16(r0 + 16);\n                        if (tj * 2 + 3 < w) _r3 = vld1q_f16(r0 + 24);\n                    }\n                    if (elempack == 4)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r1));\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(vld1_f16(r0 + 4), vld1_f16(r1 + 4));\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(vld1_f16(r0 + 8), vld1_f16(r1 + 8));\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(vld1_f16(r0 + 12), vld1_f16(r1 + 12));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n                        const __fp16* r4 = r0 + N * 4;\n                        const __fp16* r5 = r0 + N * 5;\n                        const __fp16* r6 = r0 + N * 6;\n                        const __fp16* r7 = r0 + N * 7;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n                        float16x4_t _t4 = vld1_f16(r4);\n                        float16x4_t _t5 = vld1_f16(r5);\n                        float16x4_t _t6 = vld1_f16(r6);\n                        float16x4_t _t7 = vld1_f16(r7);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n                        transpose4x4_ph(_t4, _t5, _t6, _t7);\n\n                        _r0 = vcombine_f16(_t0, _t4);\n                        if (tj * 2 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(_t1, _t5);\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(_t2, _t6);\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(_t3, _t7);\n                        }\n                    }\n                }\n\n                float16x8_t _tmp0 = vsubq_f16(_r0, _r2);\n                float16x8_t _tmp1 = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp2 = vsubq_f16(_r2, _r1);\n                float16x8_t _tmp3 = vsubq_f16(_r3, _r1);\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n                vst1q_f16(tmp[2][m], _tmp2);\n                vst1q_f16(tmp[3][m], _tmp3);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 16 + jj * 8;\n            __fp16* p1 = p0 + max_jj * 8;\n            __fp16* p2 = p0 + max_jj * 8 * 2;\n            __fp16* p3 = p0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n\n                float16x8_t _tmp0 = vsubq_f16(_r0, _r2);\n                float16x8_t _tmp1 = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp2 = vsubq_f16(_r2, _r1);\n                float16x8_t _tmp3 = vsubq_f16(_r3, _r1);\n\n                vst1q_f16(p0, _tmp0);\n                vst1q_f16(p1, _tmp1);\n                vst1q_f16(p2, _tmp2);\n                vst1q_f16(p3, _tmp3);\n\n                p0 += max_jj * 4 * 8;\n                p1 += max_jj * 4 * 8;\n                p2 += max_jj * 4 * 8;\n                p3 += max_jj * 4 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[4][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x4_t _r0 = vdup_n_f16(0.f);\n                float16x4_t _r1 = vdup_n_f16(0.f);\n                float16x4_t _r2 = vdup_n_f16(0.f);\n                float16x4_t _r3 = vdup_n_f16(0.f);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        if (tj * 2 + 1 < w) _r1 = vld1_f16(r0 + 4);\n                        if (tj * 2 + 2 < w) _r2 = vld1_f16(r0 + 8);\n                        if (tj * 2 + 3 < w) _r3 = vld1_f16(r0 + 12);\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 2 + 1 < w) _r1 = _t1;\n                        if (tj * 2 + 2 < w) _r2 = _t2;\n                        if (tj * 2 + 3 < w) _r3 = _t3;\n                    }\n                }\n\n                float16x4_t _tmp0 = vsub_f16(_r0, _r2);\n                float16x4_t _tmp1 = vadd_f16(_r1, _r2);\n                float16x4_t _tmp2 = vsub_f16(_r2, _r1);\n                float16x4_t _tmp3 = vsub_f16(_r3, _r1);\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n                vst1_f16(tmp[2][m], _tmp2);\n                vst1_f16(tmp[3][m], _tmp3);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 16 + jj * 4;\n            __fp16* p1 = p0 + max_jj * 4;\n            __fp16* p2 = p0 + max_jj * 4 * 2;\n            __fp16* p3 = p0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n\n                float16x4_t _tmp0 = vsub_f16(_r0, _r2);\n                float16x4_t _tmp1 = vadd_f16(_r1, _r2);\n                float16x4_t _tmp2 = vsub_f16(_r2, _r1);\n                float16x4_t _tmp3 = vsub_f16(_r3, _r1);\n\n                vst1_f16(p0, _tmp0);\n                vst1_f16(p1, _tmp1);\n                vst1_f16(p2, _tmp2);\n                vst1_f16(p3, _tmp3);\n\n                p0 += max_jj * 4 * 4;\n                p1 += max_jj * 4 * 4;\n                p2 += max_jj * 4 * 4;\n                p3 += max_jj * 4 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[4][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                __fp16 r00 = 0.f;\n                __fp16 r01 = 0.f;\n                __fp16 r10 = 0.f;\n                __fp16 r11 = 0.f;\n                __fp16 r20 = 0.f;\n                __fp16 r21 = 0.f;\n                __fp16 r30 = 0.f;\n                __fp16 r31 = 0.f;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 2 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                    }\n                }\n\n                tmp[0][m][0] = r00 - r20;\n                tmp[0][m][1] = r01 - r21;\n                tmp[1][m][0] = r10 + r20;\n                tmp[1][m][1] = r11 + r21;\n                tmp[2][m][0] = r20 - r10;\n                tmp[2][m][1] = r21 - r11;\n                tmp[3][m][0] = r30 - r10;\n                tmp[3][m][1] = r31 - r11;\n\n                r0 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 16 + jj * 2;\n            __fp16* p1 = p0 + max_jj * 2;\n            __fp16* p2 = p0 + max_jj * 2 * 2;\n            __fp16* p3 = p0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n\n                p0[0] = r00 - r20;\n                p0[1] = r01 - r21;\n                p1[0] = r10 + r20;\n                p1[1] = r11 + r21;\n                p2[0] = r20 - r10;\n                p2[1] = r21 - r11;\n                p3[0] = r30 - r10;\n                p3[1] = r31 - r11;\n\n                p0 += max_jj * 4 * 2;\n                p1 += max_jj * 4 * 2;\n                p2 += max_jj * 4 * 2;\n                p3 += max_jj * 4 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        __fp16 tmp[4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0123 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                __fp16 r0 = 0.f;\n                __fp16 r1 = 0.f;\n                __fp16 r2 = 0.f;\n                __fp16 r3 = 0.f;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 2 + 1 < w) r1 = r0123[1];\n                        if (tj * 2 + 2 < w) r2 = r0123[2];\n                        if (tj * 2 + 3 < w) r3 = r0123[3];\n                    }\n                }\n\n                tmp[0][m] = r0 - r2;\n                tmp[1][m] = r1 + r2;\n                tmp[2][m] = r2 - r1;\n                tmp[3][m] = r3 - r1;\n\n                r0123 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 16 + jj;\n            __fp16* p1 = p0 + max_jj;\n            __fp16* p2 = p0 + max_jj * 2;\n            __fp16* p3 = p0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n\n                p0[0] = r0 - r2;\n                p1[0] = r1 + r2;\n                p2[0] = r2 - r1;\n                p3[0] = r3 - r1;\n\n                p0 += max_jj * 4;\n                p1 += max_jj * 4;\n                p2 += max_jj * 4;\n                p3 += max_jj * 4;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_output_tile_fp16sa(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[2][4] = {\n    //     {1.0f,  1.0f,  1.0f,  0.0f},\n    //     {0.0f,  1.0f, -1.0f,  1.0f}\n    // };\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 1) / 2;\n\n    const __fp16* biasptr = bias;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float16x8_t _bias0 = biasptr ? vld1q_f16(biasptr + i + ii) : vdupq_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[2][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 16 + jj * 8;\n            const __fp16* r1 = r0 + max_jj * 8;\n            const __fp16* r2 = r0 + max_jj * 8 * 2;\n            const __fp16* r3 = r0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(r0);\n                float16x8_t _r1 = vld1q_f16(r1);\n                float16x8_t _r2 = vld1q_f16(r2);\n                float16x8_t _r3 = vld1q_f16(r3);\n\n                float16x8_t _tmp0 = vaddq_f16(vaddq_f16(_r0, _r1), _r2);\n                float16x8_t _tmp1 = vaddq_f16(vsubq_f16(_r1, _r2), _r3);\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n\n                r0 += max_jj * 4 * 8;\n                r1 += max_jj * 4 * 8;\n                r2 += max_jj * 4 * 8;\n                r3 += max_jj * 4 * 8;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n\n                float16x8_t _tmp0 = vaddq_f16(_bias0, vaddq_f16(vaddq_f16(_r0, _r1), _r2));\n                float16x8_t _tmp1 = vaddq_f16(_bias0, vaddq_f16(vsubq_f16(_r1, _r2), _r3));\n\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _tmp0);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1q_f16(outptr0 + 8, _tmp1);\n                    }\n                }\n                if (out_elempack == 4)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    vst1_f16(outptr0, vget_low_f16(_tmp0));\n                    vst1_f16(outptr1, vget_high_f16(_tmp0));\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1_f16(outptr0 + 4, vget_low_f16(_tmp1));\n                        vst1_f16(outptr1 + 4, vget_high_f16(_tmp1));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[8];\n                    __fp16 tmp1[8];\n                    vst1q_f16(tmp0, _tmp0);\n                    vst1q_f16(tmp1, _tmp1);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n                    __fp16* outptr4 = outptr0 + N * 4;\n                    __fp16* outptr5 = outptr0 + N * 5;\n                    __fp16* outptr6 = outptr0 + N * 6;\n                    __fp16* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float16x4_t _bias0 = biasptr ? vld1_f16(biasptr + i + ii) : vdup_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[2][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 16 + jj * 4;\n            const __fp16* r1 = r0 + max_jj * 4;\n            const __fp16* r2 = r0 + max_jj * 4 * 2;\n            const __fp16* r3 = r0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n                float16x4_t _r3 = vld1_f16(r3);\n\n                float16x4_t _tmp0 = vadd_f16(vadd_f16(_r0, _r1), _r2);\n                float16x4_t _tmp1 = vadd_f16(vsub_f16(_r1, _r2), _r3);\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n\n                r0 += max_jj * 4 * 4;\n                r1 += max_jj * 4 * 4;\n                r2 += max_jj * 4 * 4;\n                r3 += max_jj * 4 * 4;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n\n                float16x4_t _tmp0 = vadd_f16(_bias0, vadd_f16(vadd_f16(_r0, _r1), _r2));\n                float16x4_t _tmp1 = vadd_f16(_bias0, vadd_f16(vsub_f16(_r1, _r2), _r3));\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _tmp0);\n                    if (tj * 2 + 1 < outw) vst1_f16(outptr0 + 4, _tmp1);\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[4];\n                    __fp16 tmp1[4];\n                    vst1_f16(tmp0, _tmp0);\n                    vst1_f16(tmp1, _tmp1);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        __fp16 bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[2][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 16 + jj * 2;\n            const __fp16* r1 = r0 + max_jj * 2;\n            const __fp16* r2 = r0 + max_jj * 2 * 2;\n            const __fp16* r3 = r0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m][0] = r0[0] + r1[0] + r2[0];\n                tmp[0][m][1] = r0[1] + r1[1] + r2[1];\n                tmp[1][m][0] = r1[0] - r2[0] + r3[0];\n                tmp[1][m][1] = r1[1] - r2[1] + r3[1];\n\n                r0 += max_jj * 4 * 2;\n                r1 += max_jj * 4 * 2;\n                r2 += max_jj * 4 * 2;\n                r3 += max_jj * 4 * 2;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n\n                __fp16 tmp00 = bias0 + r00 + r10 + r20;\n                __fp16 tmp01 = bias1 + r01 + r11 + r21;\n                __fp16 tmp10 = bias0 + r10 - r20 + r30;\n                __fp16 tmp11 = bias1 + r11 - r21 + r31;\n\n                // if (out_elempack == 1)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        __fp16 tmp[2][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 16 + jj;\n            const __fp16* r1 = r0 + max_jj;\n            const __fp16* r2 = r0 + max_jj * 2;\n            const __fp16* r3 = r0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m] = r0[0] + r1[0] + r2[0];\n                tmp[1][m] = r1[0] - r2[0] + r3[0];\n\n                r0 += max_jj * 4;\n                r1 += max_jj * 4;\n                r2 += max_jj * 4;\n                r3 += max_jj * 4;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n\n                __fp16 tmp0 = bias0 + r0 + r1 + r2;\n                __fp16 tmp1 = bias0 + r1 - r2 + r3;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 2 + 1 < outw) outptr0[1] = tmp1;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd23_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 2n+2, winograd F(2,3)\n    int w_tiles = (outw + 1) / 2;\n    int h_tiles = (outh + 1) / 2;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 16;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd23_fp16sa %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 2u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 2u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd23_transform_output_tile_fp16sa(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd43_transform_kernel_tile_fp16sa(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    __fp16* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            const float sq2 = 1.41421356237f;\n            // const float ktm[6][3] = {\n            //     {1.0f, 0.0f, 0.0f},\n            //     {-2.0f / 3, -sq2 / 3, -1.0f / 3},\n            //     {-2.0f / 3, sq2 / 3, -1.0f / 3},\n            //     {1.0f / 6, sq2 / 6, 1.0f / 3},\n            //     {1.0f / 6, -sq2 / 6, 1.0f / 3},\n            //     {0.0f, 0.0f, 1.0f}\n            // };\n            const float ktm0 = 2.0f / 3;\n            const float ktm1 = sq2 / 3;\n            const float ktm2 = 1.0f / 3;\n            const float ktm3 = 1.0f / 6;\n            const float ktm4 = sq2 / 6;\n\n            float tmp[6][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = -r0 * ktm0 - r1 * ktm1 - r2 * ktm2;\n                tmp[2][m] = -r0 * ktm0 + r1 * ktm1 - r2 * ktm2;\n                tmp[3][m] = r0 * ktm3 + r1 * ktm4 + r2 * ktm2;\n                tmp[4][m] = r0 * ktm3 - r1 * ktm4 + r2 * ktm2;\n                tmp[5][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 6; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = -r0 * ktm0 - r1 * ktm1 - r2 * ktm2;\n                float z2 = -r0 * ktm0 + r1 * ktm1 - r2 * ktm2;\n                float z3 = r0 * ktm3 + r1 * ktm4 + r2 * ktm2;\n                float z4 = r0 * ktm3 - r1 * ktm4 + r2 * ktm2;\n                float z5 = r2;\n\n                ptmp[0] = (__fp16)z0;\n                ptmp[1] = (__fp16)z1;\n                ptmp[2] = (__fp16)z2;\n                ptmp[3] = (__fp16)z3;\n                ptmp[4] = (__fp16)z4;\n                ptmp[5] = (__fp16)z5;\n                ptmp += 6;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel_fp16sa(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 36;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads, (size_t)2u);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)2u);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd43_transform_kernel_tile_fp16sa(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile_fp16(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_input_tile_fp16sa(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    const __fp16 sq2 = 1.41421356237;\n    const __fp16 msq2_d2 = -1.41421356237 / 2;\n\n    // const float itm[6][6] = {\n    //     {1.0f,  0.0f,  -2.5f,  0.0f,  1.0f, 0.0f},\n    //     {0.0f, -sq2,   -2.0f,  sq2/2, 1.0f, 0.0f},\n    //     {0.0f,  sq2,   -2.0f, -sq2/2, 1.0f, 0.0f},\n    //     {0.0f, -sq2/2, -0.5f,  sq2,   1.0f, 0.0f},\n    //     {0.0f,  sq2/2, -0.5f, -sq2,   1.0f, 0.0f},\n    //     {0.0f,  1.0f,   0.0f,  -2.5f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  r00 + r04 - 2.5f * r02\n    // 1 = -(sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 2 =  (sq2 * r01 - sq2_d2 * r03) + (r04 - 2 * r02)\n    // 3 =  (sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 4 = -(sq2 * r03 - sq2_d2 * r01) + (r04 - 0.5f * r02)\n    // 5 =  r01 + r05 - 2.5f * r03\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const int N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 1) / 4;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[6][6][8];\n\n        const __fp16 coeffs[8] = {sq2, msq2_d2, -2.f, -0.5f, -2.5f, 0.f, 0.f, 0.f};\n        float16x8_t _coeffs = vld1q_f16(coeffs);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x8_t _r0 = vdupq_n_f16(0.f);\n                float16x8_t _r1 = vdupq_n_f16(0.f);\n                float16x8_t _r2 = vdupq_n_f16(0.f);\n                float16x8_t _r3 = vdupq_n_f16(0.f);\n                float16x8_t _r4 = vdupq_n_f16(0.f);\n                float16x8_t _r5 = vdupq_n_f16(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        if (tj * 4 + 1 < w) _r1 = vld1q_f16(r0 + 8);\n                        if (tj * 4 + 2 < w) _r2 = vld1q_f16(r0 + 16);\n                        if (tj * 4 + 3 < w) _r3 = vld1q_f16(r0 + 24);\n                        if (tj * 4 + 4 < w) _r4 = vld1q_f16(r0 + 32);\n                        if (tj * 4 + 5 < w) _r5 = vld1q_f16(r0 + 40);\n                    }\n                    if (elempack == 4)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r1));\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(vld1_f16(r0 + 4), vld1_f16(r1 + 4));\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(vld1_f16(r0 + 8), vld1_f16(r1 + 8));\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(vld1_f16(r0 + 12), vld1_f16(r1 + 12));\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            _r4 = vcombine_f16(vld1_f16(r0 + 16), vld1_f16(r1 + 16));\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            _r5 = vcombine_f16(vld1_f16(r0 + 20), vld1_f16(r1 + 20));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n                        const __fp16* r4 = r0 + N * 4;\n                        const __fp16* r5 = r0 + N * 5;\n                        const __fp16* r6 = r0 + N * 6;\n                        const __fp16* r7 = r0 + N * 7;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n                        float16x4_t _t4 = vld1_f16(r4);\n                        float16x4_t _t5 = vld1_f16(r5);\n                        float16x4_t _t6 = vld1_f16(r6);\n                        float16x4_t _t7 = vld1_f16(r7);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n                        transpose4x4_ph(_t4, _t5, _t6, _t7);\n\n                        _r0 = vcombine_f16(_t0, _t4);\n                        if (tj * 4 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(_t1, _t5);\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(_t2, _t6);\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(_t3, _t7);\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            __fp16 tmp[8] = {r0[4], r1[4], r2[4], r3[4], r4[4], r5[4], r6[4], r7[4]};\n                            _r4 = vld1q_f16(tmp);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            __fp16 tmp[8] = {r0[5], r1[5], r2[5], r3[5], r4[5], r5[5], r6[5], r7[5]};\n                            _r5 = vld1q_f16(tmp);\n                        }\n                    }\n                }\n\n                float16x8_t _tmp12a = vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float16x8_t _tmp12b = vfmaq_laneq_f16(_r4, _r2, _coeffs, 2);\n                float16x8_t _tmp34a = vfmaq_laneq_f16(vmulq_laneq_f16(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float16x8_t _tmp34b = vfmaq_laneq_f16(_r4, _r2, _coeffs, 3);\n\n                float16x8_t _tmp0 = vfmaq_laneq_f16(vaddq_f16(_r0, _r4), _r2, _coeffs, 4);\n                float16x8_t _tmp1 = vsubq_f16(_tmp12b, _tmp12a);\n                float16x8_t _tmp2 = vaddq_f16(_tmp12b, _tmp12a);\n                float16x8_t _tmp3 = vaddq_f16(_tmp34b, _tmp34a);\n                float16x8_t _tmp4 = vsubq_f16(_tmp34b, _tmp34a);\n                float16x8_t _tmp5 = vfmaq_laneq_f16(vaddq_f16(_r1, _r5), _r3, _coeffs, 4);\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n                vst1q_f16(tmp[2][m], _tmp2);\n                vst1q_f16(tmp[3][m], _tmp3);\n                vst1q_f16(tmp[4][m], _tmp4);\n                vst1q_f16(tmp[5][m], _tmp5);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 36 + jj * 8;\n            __fp16* p1 = p0 + max_jj * 8;\n            __fp16* p2 = p0 + max_jj * 8 * 2;\n            __fp16* p3 = p0 + max_jj * 8 * 3;\n            __fp16* p4 = p0 + max_jj * 8 * 4;\n            __fp16* p5 = p0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n                float16x8_t _r4 = vld1q_f16(tmp[m][4]);\n                float16x8_t _r5 = vld1q_f16(tmp[m][5]);\n\n                float16x8_t _tmp12a = vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float16x8_t _tmp12b = vfmaq_laneq_f16(_r4, _r2, _coeffs, 2);\n                float16x8_t _tmp34a = vfmaq_laneq_f16(vmulq_laneq_f16(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float16x8_t _tmp34b = vfmaq_laneq_f16(_r4, _r2, _coeffs, 3);\n\n                float16x8_t _tmp0 = vfmaq_laneq_f16(vaddq_f16(_r0, _r4), _r2, _coeffs, 4);\n                float16x8_t _tmp1 = vsubq_f16(_tmp12b, _tmp12a);\n                float16x8_t _tmp2 = vaddq_f16(_tmp12b, _tmp12a);\n                float16x8_t _tmp3 = vaddq_f16(_tmp34b, _tmp34a);\n                float16x8_t _tmp4 = vsubq_f16(_tmp34b, _tmp34a);\n                float16x8_t _tmp5 = vfmaq_laneq_f16(vaddq_f16(_r1, _r5), _r3, _coeffs, 4);\n\n                vst1q_f16(p0, _tmp0);\n                vst1q_f16(p1, _tmp1);\n                vst1q_f16(p2, _tmp2);\n                vst1q_f16(p3, _tmp3);\n                vst1q_f16(p4, _tmp4);\n                vst1q_f16(p5, _tmp5);\n\n                p0 += max_jj * 6 * 8;\n                p1 += max_jj * 6 * 8;\n                p2 += max_jj * 6 * 8;\n                p3 += max_jj * 6 * 8;\n                p4 += max_jj * 6 * 8;\n                p5 += max_jj * 6 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[6][6][4];\n\n        const __fp16 coeffs[8] = {sq2, msq2_d2, -2.f, -0.5f, -2.5f, 0.f, 0.f, 0.f};\n        float16x8_t _coeffs = vld1q_f16(coeffs);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 4) + (tj * 4) * elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x4_t _r0 = vdup_n_f16(0.f);\n                float16x4_t _r1 = vdup_n_f16(0.f);\n                float16x4_t _r2 = vdup_n_f16(0.f);\n                float16x4_t _r3 = vdup_n_f16(0.f);\n                float16x4_t _r4 = vdup_n_f16(0.f);\n                float16x4_t _r5 = vdup_n_f16(0.f);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        if (tj * 4 + 1 < w) _r1 = vld1_f16(r0 + 4);\n                        if (tj * 4 + 2 < w) _r2 = vld1_f16(r0 + 8);\n                        if (tj * 4 + 3 < w) _r3 = vld1_f16(r0 + 12);\n                        if (tj * 4 + 4 < w) _r4 = vld1_f16(r0 + 16);\n                        if (tj * 4 + 5 < w) _r5 = vld1_f16(r0 + 20);\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 4 + 1 < w) _r1 = _t1;\n                        if (tj * 4 + 2 < w) _r2 = _t2;\n                        if (tj * 4 + 3 < w) _r3 = _t3;\n                        if (tj * 4 + 4 < w)\n                        {\n                            __fp16 tmp[4] = {r0[4], r1[4], r2[4], r3[4]};\n                            _r4 = vld1_f16(tmp);\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            __fp16 tmp[4] = {r0[5], r1[5], r2[5], r3[5]};\n                            _r5 = vld1_f16(tmp);\n                        }\n                    }\n                }\n\n                float16x4_t _tmp12a = vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float16x4_t _tmp12b = vfma_laneq_f16(_r4, _r2, _coeffs, 2);\n                float16x4_t _tmp34a = vfma_laneq_f16(vmul_laneq_f16(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float16x4_t _tmp34b = vfma_laneq_f16(_r4, _r2, _coeffs, 3);\n\n                float16x4_t _tmp0 = vfma_laneq_f16(vadd_f16(_r0, _r4), _r2, _coeffs, 4);\n                float16x4_t _tmp1 = vsub_f16(_tmp12b, _tmp12a);\n                float16x4_t _tmp2 = vadd_f16(_tmp12b, _tmp12a);\n                float16x4_t _tmp3 = vadd_f16(_tmp34b, _tmp34a);\n                float16x4_t _tmp4 = vsub_f16(_tmp34b, _tmp34a);\n                float16x4_t _tmp5 = vfma_laneq_f16(vadd_f16(_r1, _r5), _r3, _coeffs, 4);\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n                vst1_f16(tmp[2][m], _tmp2);\n                vst1_f16(tmp[3][m], _tmp3);\n                vst1_f16(tmp[4][m], _tmp4);\n                vst1_f16(tmp[5][m], _tmp5);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 36 + jj * 4;\n            __fp16* p1 = p0 + max_jj * 4;\n            __fp16* p2 = p0 + max_jj * 4 * 2;\n            __fp16* p3 = p0 + max_jj * 4 * 3;\n            __fp16* p4 = p0 + max_jj * 4 * 4;\n            __fp16* p5 = p0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n                float16x4_t _r4 = vld1_f16(tmp[m][4]);\n                float16x4_t _r5 = vld1_f16(tmp[m][5]);\n\n                float16x4_t _tmp12a = vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 0), _r3, _coeffs, 1);\n                float16x4_t _tmp12b = vfma_laneq_f16(_r4, _r2, _coeffs, 2);\n                float16x4_t _tmp34a = vfma_laneq_f16(vmul_laneq_f16(_r3, _coeffs, 0), _r1, _coeffs, 1);\n                float16x4_t _tmp34b = vfma_laneq_f16(_r4, _r2, _coeffs, 3);\n\n                float16x4_t _tmp0 = vfma_laneq_f16(vadd_f16(_r0, _r4), _r2, _coeffs, 4);\n                float16x4_t _tmp1 = vsub_f16(_tmp12b, _tmp12a);\n                float16x4_t _tmp2 = vadd_f16(_tmp12b, _tmp12a);\n                float16x4_t _tmp3 = vadd_f16(_tmp34b, _tmp34a);\n                float16x4_t _tmp4 = vsub_f16(_tmp34b, _tmp34a);\n                float16x4_t _tmp5 = vfma_laneq_f16(vadd_f16(_r1, _r5), _r3, _coeffs, 4);\n\n                vst1_f16(p0, _tmp0);\n                vst1_f16(p1, _tmp1);\n                vst1_f16(p2, _tmp2);\n                vst1_f16(p3, _tmp3);\n                vst1_f16(p4, _tmp4);\n                vst1_f16(p5, _tmp5);\n\n                p0 += max_jj * 6 * 4;\n                p1 += max_jj * 6 * 4;\n                p2 += max_jj * 6 * 4;\n                p3 += max_jj * 6 * 4;\n                p4 += max_jj * 6 * 4;\n                p5 += max_jj * 6 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[6][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 r00 = 0.f;\n                __fp16 r01 = 0.f;\n                __fp16 r10 = 0.f;\n                __fp16 r11 = 0.f;\n                __fp16 r20 = 0.f;\n                __fp16 r21 = 0.f;\n                __fp16 r30 = 0.f;\n                __fp16 r31 = 0.f;\n                __fp16 r40 = 0.f;\n                __fp16 r41 = 0.f;\n                __fp16 r50 = 0.f;\n                __fp16 r51 = 0.f;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 4 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            r40 = r0[4];\n                            r41 = r1[4];\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            r50 = r0[5];\n                            r51 = r1[5];\n                        }\n                    }\n                }\n\n                __fp16 tmp12a0 = sq2 * r10 + msq2_d2 * r30;\n                __fp16 tmp12a1 = sq2 * r11 + msq2_d2 * r31;\n                __fp16 tmp12b0 = r40 - (__fp16)2.f * r20;\n                __fp16 tmp12b1 = r41 - (__fp16)2.f * r21;\n                __fp16 tmp34a0 = sq2 * r30 + msq2_d2 * r10;\n                __fp16 tmp34a1 = sq2 * r31 + msq2_d2 * r11;\n                __fp16 tmp34b0 = r40 - (__fp16)0.5f * r20;\n                __fp16 tmp34b1 = r41 - (__fp16)0.5f * r21;\n\n                tmp[0][m][0] = r00 + r40 - (__fp16)2.5f * r20;\n                tmp[0][m][1] = r01 + r41 - (__fp16)2.5f * r21;\n                tmp[1][m][0] = tmp12b0 - tmp12a0;\n                tmp[1][m][1] = tmp12b1 - tmp12a1;\n                tmp[2][m][0] = tmp12b0 + tmp12a0;\n                tmp[2][m][1] = tmp12b1 + tmp12a1;\n                tmp[3][m][0] = tmp34b0 + tmp34a0;\n                tmp[3][m][1] = tmp34b1 + tmp34a1;\n                tmp[4][m][0] = tmp34b0 - tmp34a0;\n                tmp[4][m][1] = tmp34b1 - tmp34a1;\n                tmp[5][m][0] = r10 + r50 - (__fp16)2.5f * r30;\n                tmp[5][m][1] = r11 + r51 - (__fp16)2.5f * r31;\n\n                r0 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 36 + jj * 2;\n            __fp16* p1 = p0 + max_jj * 2;\n            __fp16* p2 = p0 + max_jj * 2 * 2;\n            __fp16* p3 = p0 + max_jj * 2 * 3;\n            __fp16* p4 = p0 + max_jj * 2 * 4;\n            __fp16* p5 = p0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n                __fp16 r40 = tmp[m][4][0];\n                __fp16 r41 = tmp[m][4][1];\n                __fp16 r50 = tmp[m][5][0];\n                __fp16 r51 = tmp[m][5][1];\n\n                __fp16 tmp12a0 = sq2 * r10 + msq2_d2 * r30;\n                __fp16 tmp12a1 = sq2 * r11 + msq2_d2 * r31;\n                __fp16 tmp12b0 = r40 - (__fp16)2.f * r20;\n                __fp16 tmp12b1 = r41 - (__fp16)2.f * r21;\n                __fp16 tmp34a0 = sq2 * r30 + msq2_d2 * r10;\n                __fp16 tmp34a1 = sq2 * r31 + msq2_d2 * r11;\n                __fp16 tmp34b0 = r40 - (__fp16)0.5f * r20;\n                __fp16 tmp34b1 = r41 - (__fp16)0.5f * r21;\n\n                p0[0] = r00 + r40 - (__fp16)2.5f * r20;\n                p0[1] = r01 + r41 - (__fp16)2.5f * r21;\n                p1[0] = tmp12b0 - tmp12a0;\n                p1[1] = tmp12b1 - tmp12a1;\n                p2[0] = tmp12b0 + tmp12a0;\n                p2[1] = tmp12b1 + tmp12a1;\n                p3[0] = tmp34b0 + tmp34a0;\n                p3[1] = tmp34b1 + tmp34a1;\n                p4[0] = tmp34b0 - tmp34a0;\n                p4[1] = tmp34b1 - tmp34a1;\n                p5[0] = r10 + r50 - (__fp16)2.5f * r30;\n                p5[1] = r11 + r51 - (__fp16)2.5f * r31;\n\n                p0 += max_jj * 6 * 2;\n                p1 += max_jj * 6 * 2;\n                p2 += max_jj * 6 * 2;\n                p3 += max_jj * 6 * 2;\n                p4 += max_jj * 6 * 2;\n                p5 += max_jj * 6 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        __fp16 tmp[6][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0123 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 r0 = 0.f;\n                __fp16 r1 = 0.f;\n                __fp16 r2 = 0.f;\n                __fp16 r3 = 0.f;\n                __fp16 r4 = 0.f;\n                __fp16 r5 = 0.f;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 4 + 1 < w) r1 = r0123[1];\n                        if (tj * 4 + 2 < w) r2 = r0123[2];\n                        if (tj * 4 + 3 < w) r3 = r0123[3];\n                        if (tj * 4 + 4 < w) r4 = r0123[4];\n                        if (tj * 4 + 5 < w) r5 = r0123[5];\n                    }\n                }\n\n                __fp16 tmp12a = sq2 * r1 + msq2_d2 * r3;\n                __fp16 tmp12b = r4 - (__fp16)2.f * r2;\n                __fp16 tmp34a = sq2 * r3 + msq2_d2 * r1;\n                __fp16 tmp34b = r4 - (__fp16)0.5f * r2;\n\n                tmp[0][m] = r0 + r4 - (__fp16)2.5f * r2;\n                tmp[1][m] = tmp12b - tmp12a;\n                tmp[2][m] = tmp12b + tmp12a;\n                tmp[3][m] = tmp34b + tmp34a;\n                tmp[4][m] = tmp34b - tmp34a;\n                tmp[5][m] = r1 + r5 - (__fp16)2.5f * r3;\n\n                r0123 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 36 + jj;\n            __fp16* p1 = p0 + max_jj;\n            __fp16* p2 = p0 + max_jj * 2;\n            __fp16* p3 = p0 + max_jj * 3;\n            __fp16* p4 = p0 + max_jj * 4;\n            __fp16* p5 = p0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n                __fp16 r4 = tmp[m][4];\n                __fp16 r5 = tmp[m][5];\n\n                __fp16 tmp12a = sq2 * r1 + msq2_d2 * r3;\n                __fp16 tmp12b = r4 - (__fp16)2.f * r2;\n                __fp16 tmp34a = sq2 * r3 + msq2_d2 * r1;\n                __fp16 tmp34b = r4 - (__fp16)0.5f * r2;\n\n                p0[0] = r0 + r4 - (__fp16)2.5f * r2;\n                p1[0] = tmp12b - tmp12a;\n                p2[0] = tmp12b + tmp12a;\n                p3[0] = tmp34b + tmp34a;\n                p4[0] = tmp34b - tmp34a;\n                p5[0] = r1 + r5 - (__fp16)2.5f * r3;\n\n                p0 += max_jj * 6;\n                p1 += max_jj * 6;\n                p2 += max_jj * 6;\n                p3 += max_jj * 6;\n                p4 += max_jj * 6;\n                p5 += max_jj * 6;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_output_tile_fp16sa(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    const __fp16 sq2 = 1.41421356237;\n    const __fp16 sq2_m2 = 1.41421356237 * 2;\n    const __fp16 sq2_d2 = 1.41421356237 / 2;\n    const __fp16 sq2_d4 = 1.41421356237 / 4;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,   1.0f,  1.0f,  1.0f,   0.0f},\n    //     {0.0f, sq2/2, -sq2/2, sq2,   -sq2,   0.0f},\n    //     {0.0f, 0.5f,   0.5f,  2.0f,  2.0f,   0.0f},\n    //     {0.0f, sq2/4, -sq2/4, sq2*2, -sq2*2, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) * sq2_d2 + (r03 - r04) * sq2\n    // 2 =       (r01 + r02) * 0.5f + (r03 + r04) * 2\n    // 3 = r05 + (r01 - r02) * sq2_d4 + (r03 - r04) * sq2_m2\n\n    const __fp16 coeffs[8] = {sq2, sq2_d2, sq2_d4, sq2_m2, 0.5f, 2.f, 0.f, 0.f};\n    float16x8_t _coeffs = vld1q_f16(coeffs);\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 3) / 4;\n\n    const __fp16* biasptr = bias;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float16x8_t _bias0 = biasptr ? vld1q_f16(biasptr + i + ii) : vdupq_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[4][6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 36 + jj * 8;\n            const __fp16* r1 = r0 + max_jj * 8;\n            const __fp16* r2 = r0 + max_jj * 8 * 2;\n            const __fp16* r3 = r0 + max_jj * 8 * 3;\n            const __fp16* r4 = r0 + max_jj * 8 * 4;\n            const __fp16* r5 = r0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(r0);\n                float16x8_t _r1 = vld1q_f16(r1);\n                float16x8_t _r2 = vld1q_f16(r2);\n                float16x8_t _r3 = vld1q_f16(r3);\n                float16x8_t _r4 = vld1q_f16(r4);\n                float16x8_t _r5 = vld1q_f16(r5);\n\n                float16x8_t _tmp02a = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp02b = vaddq_f16(_r3, _r4);\n                float16x8_t _tmp13a = vsubq_f16(_r1, _r2);\n                float16x8_t _tmp13b = vsubq_f16(_r3, _r4);\n\n                float16x8_t _tmp0 = vaddq_f16(vaddq_f16(_r0, _tmp02a), _tmp02b);\n                float16x8_t _tmp1 = vfmaq_laneq_f16(vmulq_laneq_f16(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float16x8_t _tmp2 = vfmaq_laneq_f16(vmulq_laneq_f16(_tmp02a, _coeffs, 4), _tmp02b, _coeffs, 5);\n                float16x8_t _tmp3 = vfmaq_laneq_f16(vfmaq_laneq_f16(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n                vst1q_f16(tmp[2][m], _tmp2);\n                vst1q_f16(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 8;\n                r1 += max_jj * 6 * 8;\n                r2 += max_jj * 6 * 8;\n                r3 += max_jj * 6 * 8;\n                r4 += max_jj * 6 * 8;\n                r5 += max_jj * 6 * 8;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n                float16x8_t _r4 = vld1q_f16(tmp[m][4]);\n                float16x8_t _r5 = vld1q_f16(tmp[m][5]);\n\n                float16x8_t _tmp02a = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp02b = vaddq_f16(_r3, _r4);\n                float16x8_t _tmp13a = vsubq_f16(_r1, _r2);\n                float16x8_t _tmp13b = vsubq_f16(_r3, _r4);\n\n                float16x8_t _tmp0 = vaddq_f16(vaddq_f16(_r0, _tmp02a), vaddq_f16(_tmp02b, _bias0));\n                float16x8_t _tmp1 = vfmaq_laneq_f16(vfmaq_laneq_f16(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float16x8_t _tmp2 = vfmaq_laneq_f16(vfmaq_laneq_f16(_bias0, _tmp02a, _coeffs, 4), _tmp02b, _coeffs, 5);\n                float16x8_t _tmp3 = vfmaq_laneq_f16(vfmaq_laneq_f16(vaddq_f16(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _tmp0);\n                    if (tj * 4 + 1 < outw) vst1q_f16(outptr0 + 8, _tmp1);\n                    if (tj * 4 + 2 < outw) vst1q_f16(outptr0 + 16, _tmp2);\n                    if (tj * 4 + 3 < outw) vst1q_f16(outptr0 + 24, _tmp3);\n                }\n                if (out_elempack == 4)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    vst1_f16(outptr0, vget_low_f16(_tmp0));\n                    vst1_f16(outptr1, vget_high_f16(_tmp0));\n                    if (tj * 4 + 1 < outw)\n                    {\n                        vst1_f16(outptr0 + 4, vget_low_f16(_tmp1));\n                        vst1_f16(outptr1 + 4, vget_high_f16(_tmp1));\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        vst1_f16(outptr0 + 8, vget_low_f16(_tmp2));\n                        vst1_f16(outptr1 + 8, vget_high_f16(_tmp2));\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        vst1_f16(outptr0 + 12, vget_low_f16(_tmp3));\n                        vst1_f16(outptr1 + 12, vget_high_f16(_tmp3));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[8];\n                    __fp16 tmp1[8];\n                    __fp16 tmp2[8];\n                    __fp16 tmp3[8];\n                    vst1q_f16(tmp0, _tmp0);\n                    vst1q_f16(tmp1, _tmp1);\n                    vst1q_f16(tmp2, _tmp2);\n                    vst1q_f16(tmp3, _tmp3);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n                    __fp16* outptr4 = outptr0 + N * 4;\n                    __fp16* outptr5 = outptr0 + N * 5;\n                    __fp16* outptr6 = outptr0 + N * 6;\n                    __fp16* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float16x4_t _bias0 = biasptr ? vld1_f16(biasptr + i + ii) : vdup_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[4][6][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 36 + jj * 4;\n            const __fp16* r1 = r0 + max_jj * 4;\n            const __fp16* r2 = r0 + max_jj * 4 * 2;\n            const __fp16* r3 = r0 + max_jj * 4 * 3;\n            const __fp16* r4 = r0 + max_jj * 4 * 4;\n            const __fp16* r5 = r0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n                float16x4_t _r3 = vld1_f16(r3);\n                float16x4_t _r4 = vld1_f16(r4);\n                float16x4_t _r5 = vld1_f16(r5);\n\n                float16x4_t _tmp02a = vadd_f16(_r1, _r2);\n                float16x4_t _tmp02b = vadd_f16(_r3, _r4);\n                float16x4_t _tmp13a = vsub_f16(_r1, _r2);\n                float16x4_t _tmp13b = vsub_f16(_r3, _r4);\n\n                float16x4_t _tmp0 = vadd_f16(vadd_f16(_r0, _tmp02a), _tmp02b);\n                float16x4_t _tmp1 = vfma_laneq_f16(vmul_laneq_f16(_tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float16x4_t _tmp2 = vfma_laneq_f16(vmul_laneq_f16(_tmp02a, _coeffs, 4), _tmp02b, _coeffs, 5);\n                float16x4_t _tmp3 = vfma_laneq_f16(vfma_laneq_f16(_r5, _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n                vst1_f16(tmp[2][m], _tmp2);\n                vst1_f16(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 4;\n                r1 += max_jj * 6 * 4;\n                r2 += max_jj * 6 * 4;\n                r3 += max_jj * 6 * 4;\n                r4 += max_jj * 6 * 4;\n                r5 += max_jj * 6 * 4;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n                float16x4_t _r4 = vld1_f16(tmp[m][4]);\n                float16x4_t _r5 = vld1_f16(tmp[m][5]);\n\n                float16x4_t _tmp02a = vadd_f16(_r1, _r2);\n                float16x4_t _tmp02b = vadd_f16(_r3, _r4);\n                float16x4_t _tmp13a = vsub_f16(_r1, _r2);\n                float16x4_t _tmp13b = vsub_f16(_r3, _r4);\n\n                float16x4_t _tmp0 = vadd_f16(vadd_f16(_r0, _tmp02a), vadd_f16(_tmp02b, _bias0));\n                float16x4_t _tmp1 = vfma_laneq_f16(vfma_laneq_f16(_bias0, _tmp13a, _coeffs, 1), _tmp13b, _coeffs, 0);\n                float16x4_t _tmp2 = vfma_laneq_f16(vfma_laneq_f16(_bias0, _tmp02a, _coeffs, 4), _tmp02b, _coeffs, 5);\n                float16x4_t _tmp3 = vfma_laneq_f16(vfma_laneq_f16(vadd_f16(_r5, _bias0), _tmp13a, _coeffs, 2), _tmp13b, _coeffs, 3);\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _tmp0);\n                    if (tj * 4 + 1 < outw) vst1_f16(outptr0 + 4, _tmp1);\n                    if (tj * 4 + 2 < outw) vst1_f16(outptr0 + 8, _tmp2);\n                    if (tj * 4 + 3 < outw) vst1_f16(outptr0 + 12, _tmp3);\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[4];\n                    __fp16 tmp1[4];\n                    __fp16 tmp2[4];\n                    __fp16 tmp3[4];\n                    vst1_f16(tmp0, _tmp0);\n                    vst1_f16(tmp1, _tmp1);\n                    vst1_f16(tmp2, _tmp2);\n                    vst1_f16(tmp3, _tmp3);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        __fp16 bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[4][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 36 + jj * 2;\n            const __fp16* r1 = r0 + max_jj * 2;\n            const __fp16* r2 = r0 + max_jj * 2 * 2;\n            const __fp16* r3 = r0 + max_jj * 2 * 3;\n            const __fp16* r4 = r0 + max_jj * 2 * 4;\n            const __fp16* r5 = r0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 tmp02a0 = r1[0] + r2[0];\n                __fp16 tmp02a1 = r1[1] + r2[1];\n                __fp16 tmp02b0 = r3[0] + r4[0];\n                __fp16 tmp02b1 = r3[1] + r4[1];\n                __fp16 tmp13a0 = r1[0] - r2[0];\n                __fp16 tmp13a1 = r1[1] - r2[1];\n                __fp16 tmp13b0 = r3[0] - r4[0];\n                __fp16 tmp13b1 = r3[1] - r4[1];\n\n                tmp[0][m][0] = r0[0] + tmp02a0 + tmp02b0;\n                tmp[0][m][1] = r0[1] + tmp02a1 + tmp02b1;\n                tmp[1][m][0] = tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                tmp[1][m][1] = tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                tmp[2][m][0] = tmp02a0 * (__fp16)0.5f + tmp02b0 * (__fp16)2;\n                tmp[2][m][1] = tmp02a1 * (__fp16)0.5f + tmp02b1 * (__fp16)2;\n                tmp[3][m][0] = r5[0] + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                tmp[3][m][1] = r5[1] + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n\n                r0 += max_jj * 6 * 2;\n                r1 += max_jj * 6 * 2;\n                r2 += max_jj * 6 * 2;\n                r3 += max_jj * 6 * 2;\n                r4 += max_jj * 6 * 2;\n                r5 += max_jj * 6 * 2;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n                __fp16 r40 = tmp[m][4][0];\n                __fp16 r41 = tmp[m][4][1];\n                __fp16 r50 = tmp[m][5][0];\n                __fp16 r51 = tmp[m][5][1];\n\n                __fp16 tmp02a0 = r10 + r20;\n                __fp16 tmp02a1 = r11 + r21;\n                __fp16 tmp02b0 = r30 + r40;\n                __fp16 tmp02b1 = r31 + r41;\n                __fp16 tmp13a0 = r10 - r20;\n                __fp16 tmp13a1 = r11 - r21;\n                __fp16 tmp13b0 = r30 - r40;\n                __fp16 tmp13b1 = r31 - r41;\n\n                __fp16 tmp00 = bias0 + r00 + tmp02a0 + tmp02b0;\n                __fp16 tmp01 = bias1 + r01 + tmp02a1 + tmp02b1;\n                __fp16 tmp10 = bias0 + tmp13a0 * sq2_d2 + tmp13b0 * sq2;\n                __fp16 tmp11 = bias1 + tmp13a1 * sq2_d2 + tmp13b1 * sq2;\n                __fp16 tmp20 = bias0 + tmp02a0 * (__fp16)0.5f + tmp02b0 * (__fp16)2;\n                __fp16 tmp21 = bias1 + tmp02a1 * (__fp16)0.5f + tmp02b1 * (__fp16)2;\n                __fp16 tmp30 = bias0 + r50 + tmp13a0 * sq2_d4 + tmp13b0 * sq2_m2;\n                __fp16 tmp31 = bias1 + r51 + tmp13a1 * sq2_d4 + tmp13b1 * sq2_m2;\n\n                // if (out_elempack == 1)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp20;\n                        outptr1[2] = tmp21;\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp30;\n                        outptr1[3] = tmp31;\n                    }\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        __fp16 tmp[4][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 36 + jj;\n            const __fp16* r1 = r0 + max_jj;\n            const __fp16* r2 = r0 + max_jj * 2;\n            const __fp16* r3 = r0 + max_jj * 3;\n            const __fp16* r4 = r0 + max_jj * 4;\n            const __fp16* r5 = r0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                __fp16 tmp02a = r1[0] + r2[0];\n                __fp16 tmp02b = r3[0] + r4[0];\n                __fp16 tmp13a = r1[0] - r2[0];\n                __fp16 tmp13b = r3[0] - r4[0];\n\n                tmp[0][m] = r0[0] + tmp02a + tmp02b;\n                tmp[1][m] = tmp13a * sq2_d2 + tmp13b * sq2;\n                tmp[2][m] = tmp02a * (__fp16)0.5f + tmp02b * (__fp16)2;\n                tmp[3][m] = r5[0] + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                r0 += max_jj * 6;\n                r1 += max_jj * 6;\n                r2 += max_jj * 6;\n                r3 += max_jj * 6;\n                r4 += max_jj * 6;\n                r5 += max_jj * 6;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n                __fp16 r4 = tmp[m][4];\n                __fp16 r5 = tmp[m][5];\n\n                __fp16 tmp02a = r1 + r2;\n                __fp16 tmp02b = r3 + r4;\n                __fp16 tmp13a = r1 - r2;\n                __fp16 tmp13b = r3 - r4;\n\n                __fp16 tmp0 = bias0 + r0 + tmp02a + tmp02b;\n                __fp16 tmp1 = bias0 + tmp13a * sq2_d2 + tmp13b * sq2;\n                __fp16 tmp2 = bias0 + tmp02a * (__fp16)0.5f + tmp02b * (__fp16)2;\n                __fp16 tmp3 = bias0 + r5 + tmp13a * sq2_d4 + tmp13b * sq2_m2;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 4 + 1 < outw) outptr0[1] = tmp1;\n                    if (tj * 4 + 2 < outw) outptr0[2] = tmp2;\n                    if (tj * 4 + 3 < outw) outptr0[3] = tmp3;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd43_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 4n+2, winograd F(4,3)\n    int w_tiles = (outw + 3) / 4;\n    int h_tiles = (outh + 3) / 4;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 36;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd43_fp16sa %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 2u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 2u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd43_transform_output_tile_fp16sa(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd63_transform_kernel_tile_fp16sa(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    __fp16* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            // const float ktm[8][3] = {\n            //     {1.0f, 0.0f, 0.0f},\n            //     {-2.0f / 9, -2.0f / 9, -2.0f / 9},\n            //     {-2.0f / 9, 2.0f / 9, -2.0f / 9},\n            //     {1.0f / 90, 1.0f / 45, 2.0f / 45},\n            //     {1.0f / 90, -1.0f / 45, 2.0f / 45},\n            //     {1.0f / 45, 1.0f / 90, 1.0f / 180},\n            //     {1.0f / 45, -1.0f / 90, 1.0f / 180},\n            //     {0.0f, 0.0f, 1.0f}\n            // };\n            const float ktm0 = 2.0f / 9;\n            const float ktm1 = 1.0f / 45;\n            const float ktm2 = 2.0f / 45;\n            const float ktm3 = 1.0f / 90;\n            const float ktm4 = 1.0f / 180;\n\n            float tmp[8][3];\n\n            const float* k0 = (const float*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                float r0 = k0[0];\n                float r1 = k0[1];\n                float r2 = k0[2];\n\n                tmp[0][m] = r0;\n                tmp[1][m] = -r0 * ktm0 - r1 * ktm0 - r2 * ktm0;\n                tmp[2][m] = -r0 * ktm0 + r1 * ktm0 - r2 * ktm0;\n                tmp[3][m] = r0 * ktm3 + r1 * ktm1 + r2 * ktm2;\n                tmp[4][m] = r0 * ktm3 - r1 * ktm1 + r2 * ktm2;\n                tmp[5][m] = r0 * ktm1 + r1 * ktm3 + r2 * ktm4;\n                tmp[6][m] = r0 * ktm1 - r1 * ktm3 + r2 * ktm4;\n                tmp[7][m] = r2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 8; m++)\n            {\n                float r0 = tmp[m][0];\n                float r1 = tmp[m][1];\n                float r2 = tmp[m][2];\n\n                float z0 = r0;\n                float z1 = -r0 * ktm0 - r1 * ktm0 - r2 * ktm0;\n                float z2 = -r0 * ktm0 + r1 * ktm0 - r2 * ktm0;\n                float z3 = r0 * ktm3 + r1 * ktm1 + r2 * ktm2;\n                float z4 = r0 * ktm3 - r1 * ktm1 + r2 * ktm2;\n                float z5 = r0 * ktm1 + r1 * ktm3 + r2 * ktm4;\n                float z6 = r0 * ktm1 - r1 * ktm3 + r2 * ktm4;\n                float z7 = r2;\n\n                ptmp[0] = (__fp16)z0;\n                ptmp[1] = (__fp16)z1;\n                ptmp[2] = (__fp16)z2;\n                ptmp[3] = (__fp16)z3;\n                ptmp[4] = (__fp16)z4;\n                ptmp[5] = (__fp16)z5;\n                ptmp[6] = (__fp16)z6;\n                ptmp[7] = (__fp16)z7;\n                ptmp += 8;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd63_transform_kernel_fp16sa(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 64;\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, 0, K, B, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads, (size_t)2u);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)2u);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd63_transform_kernel_tile_fp16sa(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            conv3x3s1_winograd_pack_A_tile_fp16(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd63_transform_input_tile_fp16sa(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[8][8] = {\n    //     {1.0f, 0.0f,-5.25f, 0.00f, 5.25f, 0.00f,-1.0f, 0.0f},\n    //     {0.0f, 1.0f, 1.00f,-4.25f,-4.25f, 1.00f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 1.00f, 4.25f,-4.25f,-1.00f, 1.0f, 0.0f},\n    //     {0.0f, 0.5f, 0.25f,-2.50f,-1.25f, 2.00f, 1.0f, 0.0f},\n    //     {0.0f,-0.5f, 0.25f, 2.50f,-1.25f,-2.00f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, 4.00f,-2.50f,-5.00f, 0.50f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, 4.00f, 2.50f,-5.00f,-0.50f, 1.0f, 0.0f},\n    //     {0.0f,-1.0f, 0.00f, 5.25f, 0.00f,-5.25f, 0.0f, 1.0f}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const int N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 3) / 6;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[8][8][8];\n\n        const __fp16 coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float16x8_t _coeffs = vld1q_f16(coeffs);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x8_t _r0 = vdupq_n_f16(0.f);\n                float16x8_t _r1 = vdupq_n_f16(0.f);\n                float16x8_t _r2 = vdupq_n_f16(0.f);\n                float16x8_t _r3 = vdupq_n_f16(0.f);\n                float16x8_t _r4 = vdupq_n_f16(0.f);\n                float16x8_t _r5 = vdupq_n_f16(0.f);\n                float16x8_t _r6 = vdupq_n_f16(0.f);\n                float16x8_t _r7 = vdupq_n_f16(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1q_f16(r0);\n                        if (tj * 6 + 1 < w) _r1 = vld1q_f16(r0 + 8);\n                        if (tj * 6 + 2 < w) _r2 = vld1q_f16(r0 + 16);\n                        if (tj * 6 + 3 < w) _r3 = vld1q_f16(r0 + 24);\n                        if (tj * 6 + 4 < w) _r4 = vld1q_f16(r0 + 32);\n                        if (tj * 6 + 5 < w) _r5 = vld1q_f16(r0 + 40);\n                        if (tj * 6 + 6 < w) _r6 = vld1q_f16(r0 + 48);\n                        if (tj * 6 + 7 < w) _r7 = vld1q_f16(r0 + 56);\n                    }\n                    if (elempack == 4)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        _r0 = vcombine_f16(vld1_f16(r0), vld1_f16(r1));\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(vld1_f16(r0 + 4), vld1_f16(r1 + 4));\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(vld1_f16(r0 + 8), vld1_f16(r1 + 8));\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(vld1_f16(r0 + 12), vld1_f16(r1 + 12));\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _r4 = vcombine_f16(vld1_f16(r0 + 16), vld1_f16(r1 + 16));\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            _r5 = vcombine_f16(vld1_f16(r0 + 20), vld1_f16(r1 + 20));\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            _r6 = vcombine_f16(vld1_f16(r0 + 24), vld1_f16(r1 + 24));\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            _r7 = vcombine_f16(vld1_f16(r0 + 28), vld1_f16(r1 + 28));\n                        }\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n                        const __fp16* r4 = r0 + N * 4;\n                        const __fp16* r5 = r0 + N * 5;\n                        const __fp16* r6 = r0 + N * 6;\n                        const __fp16* r7 = r0 + N * 7;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n                        float16x4_t _t4 = vld1_f16(r4);\n                        float16x4_t _t5 = vld1_f16(r5);\n                        float16x4_t _t6 = vld1_f16(r6);\n                        float16x4_t _t7 = vld1_f16(r7);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n                        transpose4x4_ph(_t4, _t5, _t6, _t7);\n\n                        _r0 = vcombine_f16(_t0, _t4);\n                        if (tj * 6 + 1 < w)\n                        {\n                            _r1 = vcombine_f16(_t1, _t5);\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            _r2 = vcombine_f16(_t2, _t6);\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            _r3 = vcombine_f16(_t3, _t7);\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1_f16(r0 + 4);\n                            _t1 = vld1_f16(r1 + 4);\n                            _t2 = vld1_f16(r2 + 4);\n                            _t3 = vld1_f16(r3 + 4);\n                            _t4 = vld1_f16(r4 + 4);\n                            _t5 = vld1_f16(r5 + 4);\n                            _t6 = vld1_f16(r6 + 4);\n                            _t7 = vld1_f16(r7 + 4);\n\n                            transpose4x4_ph(_t0, _t1, _t2, _t3);\n                            transpose4x4_ph(_t4, _t5, _t6, _t7);\n\n                            _r4 = vcombine_f16(_t0, _t4);\n                            if (tj * 6 + 5 < w)\n                            {\n                                _r5 = vcombine_f16(_t1, _t5);\n                            }\n                            if (tj * 6 + 6 < w)\n                            {\n                                _r6 = vcombine_f16(_t2, _t6);\n                            }\n                            if (tj * 6 + 7 < w)\n                            {\n                                _r7 = vcombine_f16(_t3, _t7);\n                            }\n                        }\n                    }\n                }\n\n                float16x8_t _tmp12a = vfmaq_laneq_f16(vaddq_f16(_r2, _r6), _r4, _coeffs, 1);\n                float16x8_t _tmp12b = vfmaq_laneq_f16(vaddq_f16(_r1, _r5), _r3, _coeffs, 1);\n                float16x8_t _tmp34a = vfmaq_laneq_f16(vfmaq_laneq_f16(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float16x8_t _tmp34b = vfmaq_laneq_f16(vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 5), _r3, _coeffs, 4), _r5, _coeffs, 6);\n                float16x8_t _tmp56a = vfmaq_laneq_f16(_r6, vfmaq_laneq_f16(_r2, _r4, _coeffs, 2), _coeffs, 7);\n                float16x8_t _tmp56b = vfmaq_laneq_f16(vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 6), _r3, _coeffs, 4), _r5, _coeffs, 5);\n\n                float16x8_t _tmp0 = vfmaq_laneq_f16(vsubq_f16(_r0, _r6), vsubq_f16(_r4, _r2), _coeffs, 0);\n                float16x8_t _tmp1 = vaddq_f16(_tmp12a, _tmp12b);\n                float16x8_t _tmp2 = vsubq_f16(_tmp12a, _tmp12b);\n                float16x8_t _tmp3 = vaddq_f16(_tmp34a, _tmp34b);\n                float16x8_t _tmp4 = vsubq_f16(_tmp34a, _tmp34b);\n                float16x8_t _tmp5 = vaddq_f16(_tmp56a, _tmp56b);\n                float16x8_t _tmp6 = vsubq_f16(_tmp56a, _tmp56b);\n                float16x8_t _tmp7 = vfmaq_laneq_f16(vsubq_f16(_r7, _r1), vsubq_f16(_r3, _r5), _coeffs, 0);\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n                vst1q_f16(tmp[2][m], _tmp2);\n                vst1q_f16(tmp[3][m], _tmp3);\n                vst1q_f16(tmp[4][m], _tmp4);\n                vst1q_f16(tmp[5][m], _tmp5);\n                vst1q_f16(tmp[6][m], _tmp6);\n                vst1q_f16(tmp[7][m], _tmp7);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 64 + jj * 8;\n            __fp16* p1 = p0 + max_jj * 8;\n            __fp16* p2 = p0 + max_jj * 8 * 2;\n            __fp16* p3 = p0 + max_jj * 8 * 3;\n            __fp16* p4 = p0 + max_jj * 8 * 4;\n            __fp16* p5 = p0 + max_jj * 8 * 5;\n            __fp16* p6 = p0 + max_jj * 8 * 6;\n            __fp16* p7 = p0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n                float16x8_t _r4 = vld1q_f16(tmp[m][4]);\n                float16x8_t _r5 = vld1q_f16(tmp[m][5]);\n                float16x8_t _r6 = vld1q_f16(tmp[m][6]);\n                float16x8_t _r7 = vld1q_f16(tmp[m][7]);\n\n                float16x8_t _tmp12a = vfmaq_laneq_f16(vaddq_f16(_r2, _r6), _r4, _coeffs, 1);\n                float16x8_t _tmp12b = vfmaq_laneq_f16(vaddq_f16(_r1, _r5), _r3, _coeffs, 1);\n                float16x8_t _tmp34a = vfmaq_laneq_f16(vfmaq_laneq_f16(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float16x8_t _tmp34b = vfmaq_laneq_f16(vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 5), _r3, _coeffs, 4), _r5, _coeffs, 6);\n                float16x8_t _tmp56a = vfmaq_laneq_f16(_r6, vfmaq_laneq_f16(_r2, _r4, _coeffs, 2), _coeffs, 7);\n                float16x8_t _tmp56b = vfmaq_laneq_f16(vfmaq_laneq_f16(vmulq_laneq_f16(_r1, _coeffs, 6), _r3, _coeffs, 4), _r5, _coeffs, 5);\n\n                float16x8_t _tmp0 = vfmaq_laneq_f16(vsubq_f16(_r0, _r6), vsubq_f16(_r4, _r2), _coeffs, 0);\n                float16x8_t _tmp1 = vaddq_f16(_tmp12a, _tmp12b);\n                float16x8_t _tmp2 = vsubq_f16(_tmp12a, _tmp12b);\n                float16x8_t _tmp3 = vaddq_f16(_tmp34a, _tmp34b);\n                float16x8_t _tmp4 = vsubq_f16(_tmp34a, _tmp34b);\n                float16x8_t _tmp5 = vaddq_f16(_tmp56a, _tmp56b);\n                float16x8_t _tmp6 = vsubq_f16(_tmp56a, _tmp56b);\n                float16x8_t _tmp7 = vfmaq_laneq_f16(vsubq_f16(_r7, _r1), vsubq_f16(_r3, _r5), _coeffs, 0);\n\n                vst1q_f16(p0, _tmp0);\n                vst1q_f16(p1, _tmp1);\n                vst1q_f16(p2, _tmp2);\n                vst1q_f16(p3, _tmp3);\n                vst1q_f16(p4, _tmp4);\n                vst1q_f16(p5, _tmp5);\n                vst1q_f16(p6, _tmp6);\n                vst1q_f16(p7, _tmp7);\n\n                p0 += max_jj * 8 * 8;\n                p1 += max_jj * 8 * 8;\n                p2 += max_jj * 8 * 8;\n                p3 += max_jj * 8 * 8;\n                p4 += max_jj * 8 * 8;\n                p5 += max_jj * 8 * 8;\n                p6 += max_jj * 8 * 8;\n                p7 += max_jj * 8 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 4;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 4;\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[8][8][4];\n\n        const __fp16 coeffs[8] = {5.25f, -4.25f, -1.25f, 0.25f, -2.5f, 0.5f, 2.f, 4.f};\n        float16x8_t _coeffs = vld1q_f16(coeffs);\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel((k + kk) / elempack).row<const __fp16>(ti * 6) + (tj * 6) * elempack;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x4_t _r0 = vdup_n_f16(0.f);\n                float16x4_t _r1 = vdup_n_f16(0.f);\n                float16x4_t _r2 = vdup_n_f16(0.f);\n                float16x4_t _r3 = vdup_n_f16(0.f);\n                float16x4_t _r4 = vdup_n_f16(0.f);\n                float16x4_t _r5 = vdup_n_f16(0.f);\n                float16x4_t _r6 = vdup_n_f16(0.f);\n                float16x4_t _r7 = vdup_n_f16(0.f);\n\n                if (ti * 6 + m < h)\n                {\n                    if (elempack == 4)\n                    {\n                        _r0 = vld1_f16(r0);\n                        if (tj * 6 + 1 < w) _r1 = vld1_f16(r0 + 4);\n                        if (tj * 6 + 2 < w) _r2 = vld1_f16(r0 + 8);\n                        if (tj * 6 + 3 < w) _r3 = vld1_f16(r0 + 12);\n                        if (tj * 6 + 4 < w) _r4 = vld1_f16(r0 + 16);\n                        if (tj * 6 + 5 < w) _r5 = vld1_f16(r0 + 20);\n                        if (tj * 6 + 6 < w) _r6 = vld1_f16(r0 + 24);\n                        if (tj * 6 + 7 < w) _r7 = vld1_f16(r0 + 28);\n                    }\n                    if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n                        const __fp16* r2 = r0 + N * 2;\n                        const __fp16* r3 = r0 + N * 3;\n\n                        float16x4_t _t0 = vld1_f16(r0);\n                        float16x4_t _t1 = vld1_f16(r1);\n                        float16x4_t _t2 = vld1_f16(r2);\n                        float16x4_t _t3 = vld1_f16(r3);\n\n                        transpose4x4_ph(_t0, _t1, _t2, _t3);\n\n                        _r0 = _t0;\n                        if (tj * 6 + 1 < w) _r1 = _t1;\n                        if (tj * 6 + 2 < w) _r2 = _t2;\n                        if (tj * 6 + 3 < w) _r3 = _t3;\n                        if (tj * 6 + 4 < w)\n                        {\n                            _t0 = vld1_f16(r0 + 4);\n                            _t1 = vld1_f16(r1 + 4);\n                            _t2 = vld1_f16(r2 + 4);\n                            _t3 = vld1_f16(r3 + 4);\n\n                            transpose4x4_ph(_t0, _t1, _t2, _t3);\n\n                            _r4 = _t0;\n                            if (tj * 6 + 5 < w) _r5 = _t1;\n                            if (tj * 6 + 6 < w) _r6 = _t2;\n                            if (tj * 6 + 7 < w) _r7 = _t3;\n                        }\n                    }\n                }\n\n                float16x4_t _tmp12a = vfma_laneq_f16(vadd_f16(_r2, _r6), _r4, _coeffs, 1);\n                float16x4_t _tmp12b = vfma_laneq_f16(vadd_f16(_r1, _r5), _r3, _coeffs, 1);\n                float16x4_t _tmp34a = vfma_laneq_f16(vfma_laneq_f16(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float16x4_t _tmp34b = vfma_laneq_f16(vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 5), _r3, _coeffs, 4), _r5, _coeffs, 6);\n                float16x4_t _tmp56a = vfma_laneq_f16(_r6, vfma_laneq_f16(_r2, _r4, _coeffs, 2), _coeffs, 7);\n                float16x4_t _tmp56b = vfma_laneq_f16(vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 6), _r3, _coeffs, 4), _r5, _coeffs, 5);\n\n                float16x4_t _tmp0 = vfma_laneq_f16(vsub_f16(_r0, _r6), vsub_f16(_r4, _r2), _coeffs, 0);\n                float16x4_t _tmp1 = vadd_f16(_tmp12a, _tmp12b);\n                float16x4_t _tmp2 = vsub_f16(_tmp12a, _tmp12b);\n                float16x4_t _tmp3 = vadd_f16(_tmp34a, _tmp34b);\n                float16x4_t _tmp4 = vsub_f16(_tmp34a, _tmp34b);\n                float16x4_t _tmp5 = vadd_f16(_tmp56a, _tmp56b);\n                float16x4_t _tmp6 = vsub_f16(_tmp56a, _tmp56b);\n                float16x4_t _tmp7 = vfma_laneq_f16(vsub_f16(_r7, _r1), vsub_f16(_r3, _r5), _coeffs, 0);\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n                vst1_f16(tmp[2][m], _tmp2);\n                vst1_f16(tmp[3][m], _tmp3);\n                vst1_f16(tmp[4][m], _tmp4);\n                vst1_f16(tmp[5][m], _tmp5);\n                vst1_f16(tmp[6][m], _tmp6);\n                vst1_f16(tmp[7][m], _tmp7);\n\n                r0 += w * elempack;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 64 + jj * 4;\n            __fp16* p1 = p0 + max_jj * 4;\n            __fp16* p2 = p0 + max_jj * 4 * 2;\n            __fp16* p3 = p0 + max_jj * 4 * 3;\n            __fp16* p4 = p0 + max_jj * 4 * 4;\n            __fp16* p5 = p0 + max_jj * 4 * 5;\n            __fp16* p6 = p0 + max_jj * 4 * 6;\n            __fp16* p7 = p0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n                float16x4_t _r4 = vld1_f16(tmp[m][4]);\n                float16x4_t _r5 = vld1_f16(tmp[m][5]);\n                float16x4_t _r6 = vld1_f16(tmp[m][6]);\n                float16x4_t _r7 = vld1_f16(tmp[m][7]);\n\n                float16x4_t _tmp12a = vfma_laneq_f16(vadd_f16(_r2, _r6), _r4, _coeffs, 1);\n                float16x4_t _tmp12b = vfma_laneq_f16(vadd_f16(_r1, _r5), _r3, _coeffs, 1);\n                float16x4_t _tmp34a = vfma_laneq_f16(vfma_laneq_f16(_r6, _r2, _coeffs, 3), _r4, _coeffs, 2);\n                float16x4_t _tmp34b = vfma_laneq_f16(vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 5), _r3, _coeffs, 4), _r5, _coeffs, 6);\n                float16x4_t _tmp56a = vfma_laneq_f16(_r6, vfma_laneq_f16(_r2, _r4, _coeffs, 2), _coeffs, 7);\n                float16x4_t _tmp56b = vfma_laneq_f16(vfma_laneq_f16(vmul_laneq_f16(_r1, _coeffs, 6), _r3, _coeffs, 4), _r5, _coeffs, 5);\n\n                float16x4_t _tmp0 = vfma_laneq_f16(vsub_f16(_r0, _r6), vsub_f16(_r4, _r2), _coeffs, 0);\n                float16x4_t _tmp1 = vadd_f16(_tmp12a, _tmp12b);\n                float16x4_t _tmp2 = vsub_f16(_tmp12a, _tmp12b);\n                float16x4_t _tmp3 = vadd_f16(_tmp34a, _tmp34b);\n                float16x4_t _tmp4 = vsub_f16(_tmp34a, _tmp34b);\n                float16x4_t _tmp5 = vadd_f16(_tmp56a, _tmp56b);\n                float16x4_t _tmp6 = vsub_f16(_tmp56a, _tmp56b);\n                float16x4_t _tmp7 = vfma_laneq_f16(vsub_f16(_r7, _r1), vsub_f16(_r3, _r5), _coeffs, 0);\n\n                vst1_f16(p0, _tmp0);\n                vst1_f16(p1, _tmp1);\n                vst1_f16(p2, _tmp2);\n                vst1_f16(p3, _tmp3);\n                vst1_f16(p4, _tmp4);\n                vst1_f16(p5, _tmp5);\n                vst1_f16(p6, _tmp6);\n                vst1_f16(p7, _tmp7);\n\n                p0 += max_jj * 8 * 4;\n                p1 += max_jj * 8 * 4;\n                p2 += max_jj * 8 * 4;\n                p3 += max_jj * 8 * 4;\n                p4 += max_jj * 8 * 4;\n                p5 += max_jj * 8 * 4;\n                p6 += max_jj * 8 * 4;\n                p7 += max_jj * 8 * 4;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 4;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[8][8][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 r00 = 0.f;\n                __fp16 r01 = 0.f;\n                __fp16 r10 = 0.f;\n                __fp16 r11 = 0.f;\n                __fp16 r20 = 0.f;\n                __fp16 r21 = 0.f;\n                __fp16 r30 = 0.f;\n                __fp16 r31 = 0.f;\n                __fp16 r40 = 0.f;\n                __fp16 r41 = 0.f;\n                __fp16 r50 = 0.f;\n                __fp16 r51 = 0.f;\n                __fp16 r60 = 0.f;\n                __fp16 r61 = 0.f;\n                __fp16 r70 = 0.f;\n                __fp16 r71 = 0.f;\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const __fp16* r1 = r0 + N;\n\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 6 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 6 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 6 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                        if (tj * 6 + 4 < w)\n                        {\n                            r40 = r0[4];\n                            r41 = r1[4];\n                        }\n                        if (tj * 6 + 5 < w)\n                        {\n                            r50 = r0[5];\n                            r51 = r1[5];\n                        }\n                        if (tj * 6 + 6 < w)\n                        {\n                            r60 = r0[6];\n                            r61 = r1[6];\n                        }\n                        if (tj * 6 + 7 < w)\n                        {\n                            r70 = r0[7];\n                            r71 = r1[7];\n                        }\n                    }\n                }\n\n                __fp16 tmp12a0 = r20 + r60 - r40 * (__fp16)4.25f;\n                __fp16 tmp12a1 = r21 + r61 - r41 * (__fp16)4.25f;\n                __fp16 tmp12b0 = r10 + r50 - r30 * (__fp16)4.25f;\n                __fp16 tmp12b1 = r11 + r51 - r31 * (__fp16)4.25f;\n                __fp16 tmp34a0 = r60 + r20 * (__fp16)0.25f - r40 * (__fp16)1.25f;\n                __fp16 tmp34a1 = r61 + r21 * (__fp16)0.25f - r41 * (__fp16)1.25f;\n                __fp16 tmp34b0 = r10 * (__fp16)0.5f - r30 * (__fp16)2.5f + r50 * (__fp16)2.f;\n                __fp16 tmp34b1 = r11 * (__fp16)0.5f - r31 * (__fp16)2.5f + r51 * (__fp16)2.f;\n                __fp16 tmp56a0 = r20 * (__fp16)4.f - r40 * (__fp16)5.f + r60;\n                __fp16 tmp56a1 = r21 * (__fp16)4.f - r41 * (__fp16)5.f + r61;\n                __fp16 tmp56b0 = r10 * (__fp16)2.f - r30 * (__fp16)2.5f + r50 * (__fp16)0.5f;\n                __fp16 tmp56b1 = r11 * (__fp16)2.f - r31 * (__fp16)2.5f + r51 * (__fp16)0.5f;\n\n                tmp[0][m][0] = r00 - r60 + (r40 - r20) * (__fp16)5.25f;\n                tmp[0][m][1] = r01 - r61 + (r41 - r21) * (__fp16)5.25f;\n                tmp[1][m][0] = tmp12a0 + tmp12b0;\n                tmp[1][m][1] = tmp12a1 + tmp12b1;\n                tmp[2][m][0] = tmp12a0 - tmp12b0;\n                tmp[2][m][1] = tmp12a1 - tmp12b1;\n                tmp[3][m][0] = tmp34a0 + tmp34b0;\n                tmp[3][m][1] = tmp34a1 + tmp34b1;\n                tmp[4][m][0] = tmp34a0 - tmp34b0;\n                tmp[4][m][1] = tmp34a1 - tmp34b1;\n                tmp[5][m][0] = tmp56a0 + tmp56b0;\n                tmp[5][m][1] = tmp56a1 + tmp56b1;\n                tmp[6][m][0] = tmp56a0 - tmp56b0;\n                tmp[6][m][1] = tmp56a1 - tmp56b1;\n                tmp[7][m][0] = r70 - r10 + (r30 - r50) * (__fp16)5.25f;\n                tmp[7][m][1] = r71 - r11 + (r31 - r51) * (__fp16)5.25f;\n\n                r0 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 64 + jj * 2;\n            __fp16* p1 = p0 + max_jj * 2;\n            __fp16* p2 = p0 + max_jj * 2 * 2;\n            __fp16* p3 = p0 + max_jj * 2 * 3;\n            __fp16* p4 = p0 + max_jj * 2 * 4;\n            __fp16* p5 = p0 + max_jj * 2 * 5;\n            __fp16* p6 = p0 + max_jj * 2 * 6;\n            __fp16* p7 = p0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n                __fp16 r40 = tmp[m][4][0];\n                __fp16 r41 = tmp[m][4][1];\n                __fp16 r50 = tmp[m][5][0];\n                __fp16 r51 = tmp[m][5][1];\n                __fp16 r60 = tmp[m][6][0];\n                __fp16 r61 = tmp[m][6][1];\n                __fp16 r70 = tmp[m][7][0];\n                __fp16 r71 = tmp[m][7][1];\n\n                __fp16 tmp12a0 = r20 + r60 - r40 * (__fp16)4.25f;\n                __fp16 tmp12a1 = r21 + r61 - r41 * (__fp16)4.25f;\n                __fp16 tmp12b0 = r10 + r50 - r30 * (__fp16)4.25f;\n                __fp16 tmp12b1 = r11 + r51 - r31 * (__fp16)4.25f;\n                __fp16 tmp34a0 = r60 + r20 * (__fp16)0.25f - r40 * (__fp16)1.25f;\n                __fp16 tmp34a1 = r61 + r21 * (__fp16)0.25f - r41 * (__fp16)1.25f;\n                __fp16 tmp34b0 = r10 * (__fp16)0.5f - r30 * (__fp16)2.5f + r50 * (__fp16)2.f;\n                __fp16 tmp34b1 = r11 * (__fp16)0.5f - r31 * (__fp16)2.5f + r51 * (__fp16)2.f;\n                __fp16 tmp56a0 = r20 * (__fp16)4.f - r40 * (__fp16)5.f + r60;\n                __fp16 tmp56a1 = r21 * (__fp16)4.f - r41 * (__fp16)5.f + r61;\n                __fp16 tmp56b0 = r10 * (__fp16)2.f - r30 * (__fp16)2.5f + r50 * (__fp16)0.5f;\n                __fp16 tmp56b1 = r11 * (__fp16)2.f - r31 * (__fp16)2.5f + r51 * (__fp16)0.5f;\n\n                p0[0] = r00 - r60 + (r40 - r20) * (__fp16)5.25f;\n                p0[1] = r01 - r61 + (r41 - r21) * (__fp16)5.25f;\n                p1[0] = tmp12a0 + tmp12b0;\n                p1[1] = tmp12a1 + tmp12b1;\n                p2[0] = tmp12a0 - tmp12b0;\n                p2[1] = tmp12a1 - tmp12b1;\n                p3[0] = tmp34a0 + tmp34b0;\n                p3[1] = tmp34a1 + tmp34b1;\n                p4[0] = tmp34a0 - tmp34b0;\n                p4[1] = tmp34a1 - tmp34b1;\n                p5[0] = tmp56a0 + tmp56b0;\n                p5[1] = tmp56a1 + tmp56b1;\n                p6[0] = tmp56a0 - tmp56b0;\n                p6[1] = tmp56a1 - tmp56b1;\n                p7[0] = r70 - r10 + (r30 - r50) * (__fp16)5.25f;\n                p7[1] = r71 - r11 + (r31 - r51) * (__fp16)5.25f;\n\n                p0 += max_jj * 8 * 2;\n                p1 += max_jj * 8 * 2;\n                p2 += max_jj * 8 * 2;\n                p3 += max_jj * 8 * 2;\n                p4 += max_jj * 8 * 2;\n                p5 += max_jj * 8 * 2;\n                p6 += max_jj * 8 * 2;\n                p7 += max_jj * 8 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        __fp16 tmp[8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0123 = bottom_blob.channel(k + kk).row<const __fp16>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 r0 = 0.f;\n                __fp16 r1 = 0.f;\n                __fp16 r2 = 0.f;\n                __fp16 r3 = 0.f;\n                __fp16 r4 = 0.f;\n                __fp16 r5 = 0.f;\n                __fp16 r6 = 0.f;\n                __fp16 r7 = 0.f;\n\n                if (ti * 6 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 6 + 1 < w) r1 = r0123[1];\n                        if (tj * 6 + 2 < w) r2 = r0123[2];\n                        if (tj * 6 + 3 < w) r3 = r0123[3];\n                        if (tj * 6 + 4 < w) r4 = r0123[4];\n                        if (tj * 6 + 5 < w) r5 = r0123[5];\n                        if (tj * 6 + 6 < w) r6 = r0123[6];\n                        if (tj * 6 + 7 < w) r7 = r0123[7];\n                    }\n                }\n\n                __fp16 tmp12a = r2 + r6 - r4 * (__fp16)4.25f;\n                __fp16 tmp12b = r1 + r5 - r3 * (__fp16)4.25f;\n                __fp16 tmp34a = r6 + r2 * (__fp16)0.25f - r4 * (__fp16)1.25f;\n                __fp16 tmp34b = r1 * (__fp16)0.5f - r3 * (__fp16)2.5f + r5 * (__fp16)2.f;\n                __fp16 tmp56a = r2 * (__fp16)4.f - r4 * (__fp16)5.f + r6;\n                __fp16 tmp56b = r1 * (__fp16)2.f - r3 * (__fp16)2.5f + r5 * (__fp16)0.5f;\n\n                tmp[0][m] = r0 - r6 + (r4 - r2) * (__fp16)5.25f;\n                tmp[1][m] = tmp12a + tmp12b;\n                tmp[2][m] = tmp12a - tmp12b;\n                tmp[3][m] = tmp34a + tmp34b;\n                tmp[4][m] = tmp34a - tmp34b;\n                tmp[5][m] = tmp56a + tmp56b;\n                tmp[6][m] = tmp56a - tmp56b;\n                tmp[7][m] = r7 - r1 + (r3 - r5) * (__fp16)5.25f;\n\n                r0123 += w;\n            }\n\n            __fp16* p0 = (__fp16*)B + kk * max_jj * 64 + jj;\n            __fp16* p1 = p0 + max_jj;\n            __fp16* p2 = p0 + max_jj * 2;\n            __fp16* p3 = p0 + max_jj * 3;\n            __fp16* p4 = p0 + max_jj * 4;\n            __fp16* p5 = p0 + max_jj * 5;\n            __fp16* p6 = p0 + max_jj * 6;\n            __fp16* p7 = p0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n                __fp16 r4 = tmp[m][4];\n                __fp16 r5 = tmp[m][5];\n                __fp16 r6 = tmp[m][6];\n                __fp16 r7 = tmp[m][7];\n\n                __fp16 tmp12a = r2 + r6 - r4 * (__fp16)4.25f;\n                __fp16 tmp12b = r1 + r5 - r3 * (__fp16)4.25f;\n                __fp16 tmp34a = r6 + r2 * (__fp16)0.25f - r4 * (__fp16)1.25f;\n                __fp16 tmp34b = r1 * (__fp16)0.5f - r3 * (__fp16)2.5f + r5 * (__fp16)2.f;\n                __fp16 tmp56a = r2 * (__fp16)4.f - r4 * (__fp16)5.f + r6;\n                __fp16 tmp56b = r1 * (__fp16)2.f - r3 * (__fp16)2.5f + r5 * (__fp16)0.5f;\n\n                p0[0] = r0 - r6 + (r4 - r2) * (__fp16)5.25f;\n                p1[0] = tmp12a + tmp12b;\n                p2[0] = tmp12a - tmp12b;\n                p3[0] = tmp34a + tmp34b;\n                p4[0] = tmp34a - tmp34b;\n                p5[0] = tmp56a + tmp56b;\n                p6[0] = tmp56a - tmp56b;\n                p7[0] = r7 - r1 + (r3 - r5) * (__fp16)5.25f;\n\n                p0 += max_jj * 8;\n                p1 += max_jj * 8;\n                p2 += max_jj * 8;\n                p3 += max_jj * 8;\n                p4 += max_jj * 8;\n                p5 += max_jj * 8;\n                p6 += max_jj * 8;\n                p7 += max_jj * 8;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd63_transform_output_tile_fp16sa(const Mat& top_tile, Mat& top_blob, const Mat& bias, int i, int max_ii, int j, int max_jj)\n{\n    // const float otm[6][8] = {\n    //     {1.0f, 1.0f,  1.0f,  1.0f,  1.0f, 32.0f, 32.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  2.0f, -2.0f, 16.0f,-16.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f,  4.0f,  4.0f,  8.0f,  8.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f,  8.0f, -8.0f,  4.0f, -4.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 16.0f, 16.0f,  2.0f,  2.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 32.0f,-32.0f,  1.0f, -1.0f, 1.0f}\n    // };\n\n    const __fp16 coeffs[8] = {32.f, 16.f, 8.f, 4.f, 2.f, 0.f, 0.f, 0.f};\n    float16x8_t _coeffs = vld1q_f16(coeffs);\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 5) / 6;\n\n    const __fp16* biasptr = bias;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float16x8_t _bias0 = biasptr ? vld1q_f16(biasptr + i + ii) : vdupq_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[6][8][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 64 + jj * 8;\n            const __fp16* r1 = r0 + max_jj * 8;\n            const __fp16* r2 = r0 + max_jj * 8 * 2;\n            const __fp16* r3 = r0 + max_jj * 8 * 3;\n            const __fp16* r4 = r0 + max_jj * 8 * 4;\n            const __fp16* r5 = r0 + max_jj * 8 * 5;\n            const __fp16* r6 = r0 + max_jj * 8 * 6;\n            const __fp16* r7 = r0 + max_jj * 8 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x8_t _r0 = vld1q_f16(r0);\n                float16x8_t _r1 = vld1q_f16(r1);\n                float16x8_t _r2 = vld1q_f16(r2);\n                float16x8_t _r3 = vld1q_f16(r3);\n                float16x8_t _r4 = vld1q_f16(r4);\n                float16x8_t _r5 = vld1q_f16(r5);\n                float16x8_t _r6 = vld1q_f16(r6);\n                float16x8_t _r7 = vld1q_f16(r7);\n\n                float16x8_t _tmp024a = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp135a = vsubq_f16(_r1, _r2);\n                float16x8_t _tmp024b = vaddq_f16(_r3, _r4);\n                float16x8_t _tmp135b = vsubq_f16(_r3, _r4);\n                float16x8_t _tmp024c = vaddq_f16(_r5, _r6);\n                float16x8_t _tmp135c = vsubq_f16(_r5, _r6);\n\n                float16x8_t _tmp0 = vaddq_f16(vaddq_f16(_r0, _tmp024a), vfmaq_laneq_f16(_tmp024b, _tmp024c, _coeffs, 0));\n                float16x8_t _tmp1 = vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp135a, _tmp135b, _coeffs, 4), _tmp135c, _coeffs, 1);\n                float16x8_t _tmp2 = vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float16x8_t _tmp3 = vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float16x8_t _tmp4 = vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _coeffs, 4);\n                float16x8_t _tmp5 = vaddq_f16(vaddq_f16(_r7, _tmp135a), vfmaq_laneq_f16(_tmp135c, _tmp135b, _coeffs, 0));\n\n                vst1q_f16(tmp[0][m], _tmp0);\n                vst1q_f16(tmp[1][m], _tmp1);\n                vst1q_f16(tmp[2][m], _tmp2);\n                vst1q_f16(tmp[3][m], _tmp3);\n                vst1q_f16(tmp[4][m], _tmp4);\n                vst1q_f16(tmp[5][m], _tmp5);\n\n                r0 += max_jj * 8 * 8;\n                r1 += max_jj * 8 * 8;\n                r2 += max_jj * 8 * 8;\n                r3 += max_jj * 8 * 8;\n                r4 += max_jj * 8 * 8;\n                r5 += max_jj * 8 * 8;\n                r6 += max_jj * 8 * 8;\n                r7 += max_jj * 8 * 8;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float16x8_t _r0 = vld1q_f16(tmp[m][0]);\n                float16x8_t _r1 = vld1q_f16(tmp[m][1]);\n                float16x8_t _r2 = vld1q_f16(tmp[m][2]);\n                float16x8_t _r3 = vld1q_f16(tmp[m][3]);\n                float16x8_t _r4 = vld1q_f16(tmp[m][4]);\n                float16x8_t _r5 = vld1q_f16(tmp[m][5]);\n                float16x8_t _r6 = vld1q_f16(tmp[m][6]);\n                float16x8_t _r7 = vld1q_f16(tmp[m][7]);\n\n                float16x8_t _tmp024a = vaddq_f16(_r1, _r2);\n                float16x8_t _tmp135a = vsubq_f16(_r1, _r2);\n                float16x8_t _tmp024b = vaddq_f16(_r3, _r4);\n                float16x8_t _tmp135b = vsubq_f16(_r3, _r4);\n                float16x8_t _tmp024c = vaddq_f16(_r5, _r6);\n                float16x8_t _tmp135c = vsubq_f16(_r5, _r6);\n\n                float16x8_t _tmp0 = vaddq_f16(_bias0, vaddq_f16(vaddq_f16(_r0, _tmp024a), vfmaq_laneq_f16(_tmp024b, _tmp024c, _coeffs, 0)));\n                float16x8_t _tmp1 = vaddq_f16(_bias0, vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp135a, _tmp135b, _coeffs, 4), _tmp135c, _coeffs, 1));\n                float16x8_t _tmp2 = vaddq_f16(_bias0, vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float16x8_t _tmp3 = vaddq_f16(_bias0, vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float16x8_t _tmp4 = vaddq_f16(_bias0, vfmaq_laneq_f16(vfmaq_laneq_f16(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _coeffs, 4));\n                float16x8_t _tmp5 = vaddq_f16(_bias0, vaddq_f16(vaddq_f16(_r7, _tmp135a), vfmaq_laneq_f16(_tmp135c, _tmp135b, _coeffs, 0)));\n\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _tmp0);\n                    if (tj * 6 + 1 < outw) vst1q_f16(outptr0 + 8, _tmp1);\n                    if (tj * 6 + 2 < outw) vst1q_f16(outptr0 + 16, _tmp2);\n                    if (tj * 6 + 3 < outw) vst1q_f16(outptr0 + 24, _tmp3);\n                    if (tj * 6 + 4 < outw) vst1q_f16(outptr0 + 32, _tmp4);\n                    if (tj * 6 + 5 < outw) vst1q_f16(outptr0 + 40, _tmp5);\n                }\n                if (out_elempack == 4)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    vst1_f16(outptr0, vget_low_f16(_tmp0));\n                    vst1_f16(outptr1, vget_high_f16(_tmp0));\n                    if (tj * 6 + 1 < outw)\n                    {\n                        vst1_f16(outptr0 + 4, vget_low_f16(_tmp1));\n                        vst1_f16(outptr1 + 4, vget_high_f16(_tmp1));\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        vst1_f16(outptr0 + 8, vget_low_f16(_tmp2));\n                        vst1_f16(outptr1 + 8, vget_high_f16(_tmp2));\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        vst1_f16(outptr0 + 12, vget_low_f16(_tmp3));\n                        vst1_f16(outptr1 + 12, vget_high_f16(_tmp3));\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        vst1_f16(outptr0 + 16, vget_low_f16(_tmp4));\n                        vst1_f16(outptr1 + 16, vget_high_f16(_tmp4));\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        vst1_f16(outptr0 + 20, vget_low_f16(_tmp5));\n                        vst1_f16(outptr1 + 20, vget_high_f16(_tmp5));\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[8];\n                    __fp16 tmp1[8];\n                    __fp16 tmp2[8];\n                    __fp16 tmp3[8];\n                    __fp16 tmp4[8];\n                    __fp16 tmp5[8];\n                    vst1q_f16(tmp0, _tmp0);\n                    vst1q_f16(tmp1, _tmp1);\n                    vst1q_f16(tmp2, _tmp2);\n                    vst1q_f16(tmp3, _tmp3);\n                    vst1q_f16(tmp4, _tmp4);\n                    vst1q_f16(tmp5, _tmp5);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n                    __fp16* outptr4 = outptr0 + N * 4;\n                    __fp16* outptr5 = outptr0 + N * 5;\n                    __fp16* outptr6 = outptr0 + N * 6;\n                    __fp16* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    outptr4[0] = tmp0[4];\n                    outptr5[0] = tmp0[5];\n                    outptr6[0] = tmp0[6];\n                    outptr7[0] = tmp0[7];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                        outptr4[1] = tmp1[4];\n                        outptr5[1] = tmp1[5];\n                        outptr6[1] = tmp1[6];\n                        outptr7[1] = tmp1[7];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                        outptr4[2] = tmp2[4];\n                        outptr5[2] = tmp2[5];\n                        outptr6[2] = tmp2[6];\n                        outptr7[2] = tmp2[7];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                        outptr4[3] = tmp3[4];\n                        outptr5[3] = tmp3[5];\n                        outptr6[3] = tmp3[6];\n                        outptr7[3] = tmp3[7];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                        outptr4[4] = tmp4[4];\n                        outptr5[4] = tmp4[5];\n                        outptr6[4] = tmp4[6];\n                        outptr7[4] = tmp4[7];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                        outptr4[5] = tmp5[4];\n                        outptr5[5] = tmp5[5];\n                        outptr6[5] = tmp5[6];\n                        outptr7[5] = tmp5[7];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float16x4_t _bias0 = biasptr ? vld1_f16(biasptr + i + ii) : vdup_n_f16(0.f);\n\n#ifdef _MSC_VER\n        __declspec(align(16))\n#else\n        __attribute__((aligned(16)))\n#endif\n        __fp16 tmp[6][8][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 64 + jj * 4;\n            const __fp16* r1 = r0 + max_jj * 4;\n            const __fp16* r2 = r0 + max_jj * 4 * 2;\n            const __fp16* r3 = r0 + max_jj * 4 * 3;\n            const __fp16* r4 = r0 + max_jj * 4 * 4;\n            const __fp16* r5 = r0 + max_jj * 4 * 5;\n            const __fp16* r6 = r0 + max_jj * 4 * 6;\n            const __fp16* r7 = r0 + max_jj * 4 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n                float16x4_t _r3 = vld1_f16(r3);\n                float16x4_t _r4 = vld1_f16(r4);\n                float16x4_t _r5 = vld1_f16(r5);\n                float16x4_t _r6 = vld1_f16(r6);\n                float16x4_t _r7 = vld1_f16(r7);\n\n                float16x4_t _tmp024a = vadd_f16(_r1, _r2);\n                float16x4_t _tmp135a = vsub_f16(_r1, _r2);\n                float16x4_t _tmp024b = vadd_f16(_r3, _r4);\n                float16x4_t _tmp135b = vsub_f16(_r3, _r4);\n                float16x4_t _tmp024c = vadd_f16(_r5, _r6);\n                float16x4_t _tmp135c = vsub_f16(_r5, _r6);\n\n                float16x4_t _tmp0 = vadd_f16(vadd_f16(_r0, _tmp024a), vfma_laneq_f16(_tmp024b, _tmp024c, _coeffs, 0));\n                float16x4_t _tmp1 = vfma_laneq_f16(vfma_laneq_f16(_tmp135a, _tmp135b, _coeffs, 4), _tmp135c, _coeffs, 1);\n                float16x4_t _tmp2 = vfma_laneq_f16(vfma_laneq_f16(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2);\n                float16x4_t _tmp3 = vfma_laneq_f16(vfma_laneq_f16(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3);\n                float16x4_t _tmp4 = vfma_laneq_f16(vfma_laneq_f16(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _coeffs, 4);\n                float16x4_t _tmp5 = vadd_f16(vadd_f16(_r7, _tmp135a), vfma_laneq_f16(_tmp135c, _tmp135b, _coeffs, 0));\n\n                vst1_f16(tmp[0][m], _tmp0);\n                vst1_f16(tmp[1][m], _tmp1);\n                vst1_f16(tmp[2][m], _tmp2);\n                vst1_f16(tmp[3][m], _tmp3);\n                vst1_f16(tmp[4][m], _tmp4);\n                vst1_f16(tmp[5][m], _tmp5);\n\n                r0 += max_jj * 8 * 4;\n                r1 += max_jj * 8 * 4;\n                r2 += max_jj * 8 * 4;\n                r3 += max_jj * 8 * 4;\n                r4 += max_jj * 8 * 4;\n                r5 += max_jj * 8 * 4;\n                r6 += max_jj * 8 * 4;\n                r7 += max_jj * 8 * 4;\n            }\n\n            __fp16* outptr0 = top_blob.channel((i + ii) / out_elempack).row<__fp16>(ti * 6) + (tj * 6) * out_elempack;\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                float16x4_t _r0 = vld1_f16(tmp[m][0]);\n                float16x4_t _r1 = vld1_f16(tmp[m][1]);\n                float16x4_t _r2 = vld1_f16(tmp[m][2]);\n                float16x4_t _r3 = vld1_f16(tmp[m][3]);\n                float16x4_t _r4 = vld1_f16(tmp[m][4]);\n                float16x4_t _r5 = vld1_f16(tmp[m][5]);\n                float16x4_t _r6 = vld1_f16(tmp[m][6]);\n                float16x4_t _r7 = vld1_f16(tmp[m][7]);\n\n                float16x4_t _tmp024a = vadd_f16(_r1, _r2);\n                float16x4_t _tmp135a = vsub_f16(_r1, _r2);\n                float16x4_t _tmp024b = vadd_f16(_r3, _r4);\n                float16x4_t _tmp135b = vsub_f16(_r3, _r4);\n                float16x4_t _tmp024c = vadd_f16(_r5, _r6);\n                float16x4_t _tmp135c = vsub_f16(_r5, _r6);\n\n                float16x4_t _tmp0 = vadd_f16(_bias0, vadd_f16(vadd_f16(_r0, _tmp024a), vfma_laneq_f16(_tmp024b, _tmp024c, _coeffs, 0)));\n                float16x4_t _tmp1 = vadd_f16(_bias0, vfma_laneq_f16(vfma_laneq_f16(_tmp135a, _tmp135b, _coeffs, 4), _tmp135c, _coeffs, 1));\n                float16x4_t _tmp2 = vadd_f16(_bias0, vfma_laneq_f16(vfma_laneq_f16(_tmp024a, _tmp024b, _coeffs, 3), _tmp024c, _coeffs, 2));\n                float16x4_t _tmp3 = vadd_f16(_bias0, vfma_laneq_f16(vfma_laneq_f16(_tmp135a, _tmp135b, _coeffs, 2), _tmp135c, _coeffs, 3));\n                float16x4_t _tmp4 = vadd_f16(_bias0, vfma_laneq_f16(vfma_laneq_f16(_tmp024a, _tmp024b, _coeffs, 1), _tmp024c, _coeffs, 4));\n                float16x4_t _tmp5 = vadd_f16(_bias0, vadd_f16(vadd_f16(_r7, _tmp135a), vfma_laneq_f16(_tmp135c, _tmp135b, _coeffs, 0)));\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _tmp0);\n                    if (tj * 6 + 1 < outw) vst1_f16(outptr0 + 4, _tmp1);\n                    if (tj * 6 + 2 < outw) vst1_f16(outptr0 + 8, _tmp2);\n                    if (tj * 6 + 3 < outw) vst1_f16(outptr0 + 12, _tmp3);\n                    if (tj * 6 + 4 < outw) vst1_f16(outptr0 + 16, _tmp4);\n                    if (tj * 6 + 5 < outw) vst1_f16(outptr0 + 20, _tmp5);\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 tmp0[4];\n                    __fp16 tmp1[4];\n                    __fp16 tmp2[4];\n                    __fp16 tmp3[4];\n                    __fp16 tmp4[4];\n                    __fp16 tmp5[4];\n                    vst1_f16(tmp0, _tmp0);\n                    vst1_f16(tmp1, _tmp1);\n                    vst1_f16(tmp2, _tmp2);\n                    vst1_f16(tmp3, _tmp3);\n                    vst1_f16(tmp4, _tmp4);\n                    vst1_f16(tmp5, _tmp5);\n\n                    __fp16* outptr1 = outptr0 + N;\n                    __fp16* outptr2 = outptr0 + N * 2;\n                    __fp16* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = tmp0[0];\n                    outptr1[0] = tmp0[1];\n                    outptr2[0] = tmp0[2];\n                    outptr3[0] = tmp0[3];\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp1[0];\n                        outptr1[1] = tmp1[1];\n                        outptr2[1] = tmp1[2];\n                        outptr3[1] = tmp1[3];\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp2[0];\n                        outptr1[2] = tmp2[1];\n                        outptr2[2] = tmp2[2];\n                        outptr3[2] = tmp2[3];\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp3[0];\n                        outptr1[3] = tmp3[1];\n                        outptr2[3] = tmp3[2];\n                        outptr3[3] = tmp3[3];\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp4[0];\n                        outptr1[4] = tmp4[1];\n                        outptr2[4] = tmp4[2];\n                        outptr3[4] = tmp4[3];\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp5[0];\n                        outptr1[5] = tmp5[1];\n                        outptr2[5] = tmp5[2];\n                        outptr3[5] = tmp5[3];\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n        __fp16 bias1 = biasptr ? biasptr[i + ii + 1] : 0.f;\n\n#ifdef _MSC_VER\n        __declspec(align(8))\n#else\n        __attribute__((aligned(8)))\n#endif\n        __fp16 tmp[6][8][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 64 + jj * 2;\n            const __fp16* r1 = r0 + max_jj * 2;\n            const __fp16* r2 = r0 + max_jj * 2 * 2;\n            const __fp16* r3 = r0 + max_jj * 2 * 3;\n            const __fp16* r4 = r0 + max_jj * 2 * 4;\n            const __fp16* r5 = r0 + max_jj * 2 * 5;\n            const __fp16* r6 = r0 + max_jj * 2 * 6;\n            const __fp16* r7 = r0 + max_jj * 2 * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 tmp024a0 = r1[0] + r2[0];\n                __fp16 tmp024a1 = r1[1] + r2[1];\n                __fp16 tmp135a0 = r1[0] - r2[0];\n                __fp16 tmp135a1 = r1[1] - r2[1];\n                __fp16 tmp024b0 = r3[0] + r4[0];\n                __fp16 tmp024b1 = r3[1] + r4[1];\n                __fp16 tmp135b0 = r3[0] - r4[0];\n                __fp16 tmp135b1 = r3[1] - r4[1];\n                __fp16 tmp024c0 = r5[0] + r6[0];\n                __fp16 tmp024c1 = r5[1] + r6[1];\n                __fp16 tmp135c0 = r5[0] - r6[0];\n                __fp16 tmp135c1 = r5[1] - r6[1];\n\n                tmp[0][m][0] = r0[0] + tmp024a0 + tmp024b0 + tmp024c0 * (__fp16)32;\n                tmp[0][m][1] = r0[1] + tmp024a1 + tmp024b1 + tmp024c1 * (__fp16)32;\n                tmp[1][m][0] = tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * (__fp16)16;\n                tmp[1][m][1] = tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * (__fp16)16;\n                tmp[2][m][0] = tmp024a0 + tmp024b0 * (__fp16)4 + tmp024c0 * (__fp16)8;\n                tmp[2][m][1] = tmp024a1 + tmp024b1 * (__fp16)4 + tmp024c1 * (__fp16)8;\n                tmp[3][m][0] = tmp135a0 + tmp135b0 * (__fp16)8 + tmp135c0 * (__fp16)4;\n                tmp[3][m][1] = tmp135a1 + tmp135b1 * (__fp16)8 + tmp135c1 * (__fp16)4;\n                tmp[4][m][0] = tmp024a0 + tmp024b0 * (__fp16)16 + tmp024c0 + tmp024c0;\n                tmp[4][m][1] = tmp024a1 + tmp024b1 * (__fp16)16 + tmp024c1 + tmp024c1;\n                tmp[5][m][0] = r7[0] + tmp135a0 + tmp135b0 * (__fp16)32 + tmp135c0;\n                tmp[5][m][1] = r7[1] + tmp135a1 + tmp135b1 * (__fp16)32 + tmp135c1;\n\n                r0 += max_jj * 8 * 2;\n                r1 += max_jj * 8 * 2;\n                r2 += max_jj * 8 * 2;\n                r3 += max_jj * 8 * 2;\n                r4 += max_jj * 8 * 2;\n                r5 += max_jj * 8 * 2;\n                r6 += max_jj * 8 * 2;\n                r7 += max_jj * 8 * 2;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                __fp16 r00 = tmp[m][0][0];\n                __fp16 r01 = tmp[m][0][1];\n                __fp16 r10 = tmp[m][1][0];\n                __fp16 r11 = tmp[m][1][1];\n                __fp16 r20 = tmp[m][2][0];\n                __fp16 r21 = tmp[m][2][1];\n                __fp16 r30 = tmp[m][3][0];\n                __fp16 r31 = tmp[m][3][1];\n                __fp16 r40 = tmp[m][4][0];\n                __fp16 r41 = tmp[m][4][1];\n                __fp16 r50 = tmp[m][5][0];\n                __fp16 r51 = tmp[m][5][1];\n                __fp16 r60 = tmp[m][6][0];\n                __fp16 r61 = tmp[m][6][1];\n                __fp16 r70 = tmp[m][7][0];\n                __fp16 r71 = tmp[m][7][1];\n\n                __fp16 tmp024a0 = r10 + r20;\n                __fp16 tmp024a1 = r11 + r21;\n                __fp16 tmp135a0 = r10 - r20;\n                __fp16 tmp135a1 = r11 - r21;\n                __fp16 tmp024b0 = r30 + r40;\n                __fp16 tmp024b1 = r31 + r41;\n                __fp16 tmp135b0 = r30 - r40;\n                __fp16 tmp135b1 = r31 - r41;\n                __fp16 tmp024c0 = r50 + r60;\n                __fp16 tmp024c1 = r51 + r61;\n                __fp16 tmp135c0 = r50 - r60;\n                __fp16 tmp135c1 = r51 - r61;\n\n                __fp16 tmp00 = bias0 + r00 + tmp024a0 + tmp024b0 + tmp024c0 * (__fp16)32;\n                __fp16 tmp01 = bias1 + r01 + tmp024a1 + tmp024b1 + tmp024c1 * (__fp16)32;\n                __fp16 tmp10 = bias0 + tmp135a0 + tmp135b0 + tmp135b0 + tmp135c0 * (__fp16)16;\n                __fp16 tmp11 = bias1 + tmp135a1 + tmp135b1 + tmp135b1 + tmp135c1 * (__fp16)16;\n                __fp16 tmp20 = bias0 + tmp024a0 + tmp024b0 * (__fp16)4 + tmp024c0 * (__fp16)8;\n                __fp16 tmp21 = bias1 + tmp024a1 + tmp024b1 * (__fp16)4 + tmp024c1 * (__fp16)8;\n                __fp16 tmp30 = bias0 + tmp135a0 + tmp135b0 * (__fp16)8 + tmp135c0 * (__fp16)4;\n                __fp16 tmp31 = bias1 + tmp135a1 + tmp135b1 * (__fp16)8 + tmp135c1 * (__fp16)4;\n                __fp16 tmp40 = bias0 + tmp024a0 + tmp024b0 * (__fp16)16 + tmp024c0 + tmp024c0;\n                __fp16 tmp41 = bias1 + tmp024a1 + tmp024b1 * (__fp16)16 + tmp024c1 + tmp024c1;\n                __fp16 tmp50 = bias0 + r70 + tmp135a0 + tmp135b0 * (__fp16)32 + tmp135c0;\n                __fp16 tmp51 = bias1 + r71 + tmp135a1 + tmp135b1 * (__fp16)32 + tmp135c1;\n\n                // if (out_elempack == 1)\n                {\n                    __fp16* outptr1 = outptr0 + N;\n\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 6 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                    if (tj * 6 + 2 < outw)\n                    {\n                        outptr0[2] = tmp20;\n                        outptr1[2] = tmp21;\n                    }\n                    if (tj * 6 + 3 < outw)\n                    {\n                        outptr0[3] = tmp30;\n                        outptr1[3] = tmp31;\n                    }\n                    if (tj * 6 + 4 < outw)\n                    {\n                        outptr0[4] = tmp40;\n                        outptr1[4] = tmp41;\n                    }\n                    if (tj * 6 + 5 < outw)\n                    {\n                        outptr0[5] = tmp50;\n                        outptr1[5] = tmp51;\n                    }\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        __fp16 bias0 = biasptr ? biasptr[i + ii] : 0.f;\n\n        __fp16 tmp[6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const __fp16* r0 = (const __fp16*)top_tile + ii * max_jj * 64 + jj;\n            const __fp16* r1 = r0 + max_jj;\n            const __fp16* r2 = r0 + max_jj * 2;\n            const __fp16* r3 = r0 + max_jj * 3;\n            const __fp16* r4 = r0 + max_jj * 4;\n            const __fp16* r5 = r0 + max_jj * 5;\n            const __fp16* r6 = r0 + max_jj * 6;\n            const __fp16* r7 = r0 + max_jj * 7;\n\n            for (int m = 0; m < 8; m++)\n            {\n                __fp16 tmp024a = r1[0] + r2[0];\n                __fp16 tmp135a = r1[0] - r2[0];\n                __fp16 tmp024b = r3[0] + r4[0];\n                __fp16 tmp135b = r3[0] - r4[0];\n                __fp16 tmp024c = r5[0] + r6[0];\n                __fp16 tmp135c = r5[0] - r6[0];\n\n                tmp[0][m] = r0[0] + tmp024a + tmp024b + tmp024c * (__fp16)32;\n                tmp[1][m] = tmp135a + tmp135b + tmp135b + tmp135c * (__fp16)16;\n                tmp[2][m] = tmp024a + tmp024b * (__fp16)4 + tmp024c * (__fp16)8;\n                tmp[3][m] = tmp135a + tmp135b * (__fp16)8 + tmp135c * (__fp16)4;\n                tmp[4][m] = tmp024a + tmp024b * (__fp16)16 + tmp024c + tmp024c;\n                tmp[5][m] = r7[0] + tmp135a + tmp135b * (__fp16)32 + tmp135c;\n\n                r0 += max_jj * 8;\n                r1 += max_jj * 8;\n                r2 += max_jj * 8;\n                r3 += max_jj * 8;\n                r4 += max_jj * 8;\n                r5 += max_jj * 8;\n                r6 += max_jj * 8;\n                r7 += max_jj * 8;\n            }\n\n            __fp16* outptr0 = top_blob.channel(i + ii).row<__fp16>(ti * 6) + (tj * 6);\n\n            for (int m = 0; m < 6; m++)\n            {\n                if (ti * 6 + m >= outh)\n                    continue;\n\n                __fp16 r0 = tmp[m][0];\n                __fp16 r1 = tmp[m][1];\n                __fp16 r2 = tmp[m][2];\n                __fp16 r3 = tmp[m][3];\n                __fp16 r4 = tmp[m][4];\n                __fp16 r5 = tmp[m][5];\n                __fp16 r6 = tmp[m][6];\n                __fp16 r7 = tmp[m][7];\n\n                __fp16 tmp024a = r1 + r2;\n                __fp16 tmp135a = r1 - r2;\n                __fp16 tmp024b = r3 + r4;\n                __fp16 tmp135b = r3 - r4;\n                __fp16 tmp024c = r5 + r6;\n                __fp16 tmp135c = r5 - r6;\n\n                __fp16 tmp0 = bias0 + r0 + tmp024a + tmp024b + tmp024c * (__fp16)32;\n                __fp16 tmp1 = bias0 + tmp135a + tmp135b + tmp135b + tmp135c * (__fp16)16;\n                __fp16 tmp2 = bias0 + tmp024a + tmp024b * (__fp16)4 + tmp024c * (__fp16)8;\n                __fp16 tmp3 = bias0 + tmp135a + tmp135b * (__fp16)8 + tmp135c * (__fp16)4;\n                __fp16 tmp4 = bias0 + tmp024a + tmp024b * (__fp16)16 + tmp024c + tmp024c;\n                __fp16 tmp5 = bias0 + r7 + tmp135a + tmp135b * (__fp16)32 + tmp135c;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 6 + 1 < outw) outptr0[1] = tmp1;\n                    if (tj * 6 + 2 < outw) outptr0[2] = tmp2;\n                    if (tj * 6 + 3 < outw) outptr0[3] = tmp3;\n                    if (tj * 6 + 4 < outw) outptr0[4] = tmp4;\n                    if (tj * 6 + 5 < outw) outptr0[5] = tmp5;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd63_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 6n+2, winograd F(6,3)\n    int w_tiles = (outw + 5) / 6;\n    int h_tiles = (outh + 5) / 6;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 64;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd63_fp16sa %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    conv3x3s1_winograd_get_optimal_tile_mnk_fp16(M, N, K, B, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 2u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 2u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd63_transform_input_tile_fp16sa(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            conv3x3s1_winograd_transpose_pack_B_tile_fp16(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                conv3x3s1_winograd_gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk, opt.use_a53_a55_optimized_kernel);\n            }\n\n            // transform output\n            conv3x3s1_winograd63_transform_output_tile_fp16sa(top_tile, top_blob, bias, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_3x3_winograd_int8.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pack_A_tile_int8(const Mat& A, Mat& AT, int batch, int max_ii, int max_kk)\n{\n    const int N = max_kk * batch;\n\n    for (int b = 0; b < batch; b++)\n    {\n        short* pp = AT.row<short>(b);\n\n        int ii = 0;\n#if __ARM_NEON\n        for (; ii + 7 < max_ii; ii += 8)\n        {\n            const short* p0 = (const short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[N * 2];\n                pp[3] = p0[N * 3];\n                pp[4] = p0[N * 4];\n                pp[5] = p0[N * 5];\n                pp[6] = p0[N * 6];\n                pp[7] = p0[N * 7];\n                p0 += batch;\n                pp += 8;\n            }\n        }\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const short* p0 = (const short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                pp[2] = p0[N * 2];\n                pp[3] = p0[N * 3];\n                p0 += batch;\n                pp += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; ii + 1 < max_ii; ii += 2)\n        {\n            const short* p0 = (const short*)A + ii * N + b;\n\n            int kk = 0;\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[batch];\n                pp[2] = p0[N];\n                pp[3] = p0[batch + N];\n                p0 += batch * 2;\n                pp += 4;\n            }\n#endif\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[N];\n                p0 += batch;\n                pp += 2;\n            }\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const short* p0 = (const short*)A + ii * N + b;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_int8(const Mat& B, Mat& BT, int batch, int max_jj, int max_kk, int nT)\n{\n    // NCNN_LOGE(\"transpose_pack_B_tile_int8 %d %d\", max_jj, max_kk);\n\n    #pragma omp parallel for num_threads(nT)\n    for (int b = 0; b < batch; b++)\n    {\n        short* pp = BT.row<short>(b);\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"prfm   pldl1keep, [%0, #1024]  \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0], #64 \\n\"\n                    \"ld4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%0] \\n\"\n                    \"sub    %0, %0, #128            \\n\"\n                    \"uzp1   v20.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp2   v26.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp1   v23.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp2   v29.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp1   v21.8h, v16.8h, v1.8h   \\n\"\n                    \"uzp2   v27.8h, v16.8h, v1.8h   \\n\"\n                    \"uzp1   v22.8h, v5.8h, v17.8h   \\n\"\n                    \"uzp2   v28.8h, v5.8h, v17.8h   \\n\"\n                    \"uzp1   v24.8h, v18.8h, v3.8h   \\n\"\n                    \"uzp2   v30.8h, v18.8h, v3.8h   \\n\"\n                    \"uzp1   v25.8h, v7.8h, v19.8h   \\n\"\n                    \"uzp2   v31.8h, v7.8h, v19.8h   \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%1], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%1], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8x4_t _r0 = vld4q_s16(p0);\n                int16x8x4_t _r1 = vld4q_s16(p0 + 32);\n                int16x8x4_t _r2 = vld4q_s16(p0 + 64);\n                int16x8x2_t _t0 = vuzpq_s16(_r0.val[0], _r1.val[0]);\n                int16x8x2_t _t1 = vuzpq_s16(_r2.val[0], _r0.val[1]);\n                int16x8x2_t _t2 = vuzpq_s16(_r1.val[1], _r2.val[1]);\n                int16x8x2_t _t3 = vuzpq_s16(_r0.val[2], _r1.val[2]);\n                int16x8x2_t _t4 = vuzpq_s16(_r2.val[2], _r0.val[3]);\n                int16x8x2_t _t5 = vuzpq_s16(_r1.val[3], _r2.val[3]);\n                vst1q_s16(pp, _t0.val[0]);\n                vst1q_s16(pp + 8, _t1.val[0]);\n                vst1q_s16(pp + 16, _t2.val[0]);\n                vst1q_s16(pp + 24, _t3.val[0]);\n                vst1q_s16(pp + 32, _t4.val[0]);\n                vst1q_s16(pp + 40, _t5.val[0]);\n                vst1q_s16(pp + 48, _t0.val[1]);\n                vst1q_s16(pp + 56, _t1.val[1]);\n                vst1q_s16(pp + 64, _t2.val[1]);\n                vst1q_s16(pp + 72, _t3.val[1]);\n                vst1q_s16(pp + 80, _t4.val[1]);\n                vst1q_s16(pp + 88, _t5.val[1]);\n                p0 += max_jj * batch * 8;\n                pp += 96;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x8x2_t _r01 = vld2q_s16(p0);\n                int16x4x2_t _r2 = vld2_s16(p0 + 16);\n                vst1q_s16(pp, _r01.val[0]);\n                vst1_s16(pp + 8, _r2.val[0]);\n                vst1q_s16(pp + 12, _r01.val[1]);\n                vst1_s16(pp + 20, _r2.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 24;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                int16x8_t _r0 = vld1q_s16(p0);\n                int16x4_t _r1 = vld1_s16(p0 + 8);\n                vst1q_s16(pp, _r0);\n                vst1_s16(pp + 8, _r1);\n                p0 += max_jj * batch;\n                pp += 12;\n            }\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"prfm   pldl1keep, [%0, #1024]  \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0] \\n\"\n                    \"sub    %0, %0, #64             \\n\"\n                    \"zip1   v16.8h, v0.8h, v4.8h    \\n\"\n                    \"zip2   v20.8h, v0.8h, v4.8h    \\n\"\n                    \"zip1   v17.8h, v1.8h, v5.8h    \\n\"\n                    \"zip2   v21.8h, v1.8h, v5.8h    \\n\"\n                    \"zip1   v18.8h, v2.8h, v6.8h    \\n\"\n                    \"zip2   v22.8h, v2.8h, v6.8h    \\n\"\n                    \"zip1   v19.8h, v3.8h, v7.8h    \\n\"\n                    \"zip2   v23.8h, v3.8h, v7.8h    \\n\"\n                    \"st4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%1], #64 \\n\"\n                    \"st4    {v20.8h, v21.8h, v22.8h, v23.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n                p0 += max_jj * batch * 8;\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8_t _r0 = vld1q_s16(p0);\n                int16x8_t _r1 = vld1q_s16(p0 + 8);\n                int16x8_t _r2 = vld1q_s16(p0 + 16);\n                int16x8_t _r3 = vld1q_s16(p0 + 24);\n                int16x8_t _r4 = vld1q_s16(p0 + 32);\n                int16x8_t _r5 = vld1q_s16(p0 + 40);\n                int16x8_t _r6 = vld1q_s16(p0 + 48);\n                int16x8_t _r7 = vld1q_s16(p0 + 56);\n                int16x8x2_t _r04 = vzipq_s16(_r0, _r4);\n                int16x8x2_t _r15 = vzipq_s16(_r1, _r5);\n                int16x8x2_t _r26 = vzipq_s16(_r2, _r6);\n                int16x8x2_t _r37 = vzipq_s16(_r3, _r7);\n                int16x8x4_t _r0123;\n                _r0123.val[0] = _r04.val[0];\n                _r0123.val[1] = _r15.val[0];\n                _r0123.val[2] = _r26.val[0];\n                _r0123.val[3] = _r37.val[0];\n                int16x8x4_t _r4567;\n                _r4567.val[0] = _r04.val[1];\n                _r4567.val[1] = _r15.val[1];\n                _r4567.val[2] = _r26.val[1];\n                _r4567.val[3] = _r37.val[1];\n                vst4q_s16(pp, _r0123);\n                vst4q_s16(pp + 32, _r4567);\n                p0 += max_jj * batch * 8;\n                pp += 64;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x8x2_t _r01 = vld2q_s16(p0);\n                vst1q_s16(pp, _r01.val[0]);\n                vst1q_s16(pp + 8, _r01.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 16;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                int16x8_t _r0 = vld1q_s16(p0);\n                vst1q_s16(pp, _r0);\n                p0 += max_jj * batch;\n                pp += 8;\n            }\n        }\n#endif // __aarch64__\n        for (; jj + 5 < max_jj; jj += 6)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #768]   \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h}, [%0], #48 \\n\"\n                    \"ld1    {v3.8h, v4.8h, v5.8h}, [%0] \\n\"\n                    \"sub    %0, %0, #48             \\n\"\n                    \"zip1   v16.8h, v0.8h, v3.8h    \\n\"\n                    \"zip2   v20.8h, v0.8h, v3.8h    \\n\"\n                    \"zip1   v17.8h, v1.8h, v4.8h    \\n\"\n                    \"zip2   v21.8h, v1.8h, v4.8h    \\n\"\n                    \"zip1   v18.8h, v2.8h, v5.8h    \\n\"\n                    \"zip2   v22.8h, v2.8h, v5.8h    \\n\"\n                    \"st3    {v16.8h, v17.8h, v18.8h}, [%1], #48 \\n\"\n                    \"st3    {v20.8h, v21.8h, v22.8h}, [%1], #48 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v20\", \"v21\", \"v22\");\n                p0 += max_jj * batch * 8;\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #768]          \\n\"\n                    \"vldm       %0, {d0-d11}        \\n\"\n                    \"vzip.16    q0, q3              \\n\"\n                    \"vzip.16    q1, q4              \\n\"\n                    \"vzip.16    q2, q5              \\n\"\n                    \"vst3.s16   {d0,d2,d4}, [%1]!   \\n\"\n                    \"vst3.s16   {d1,d3,d5}, [%1]!   \\n\"\n                    \"vst3.s16   {d6,d8,d10}, [%1]!  \\n\"\n                    \"vst3.s16   {d7,d9,d11}, [%1]!  \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\");\n                p0 += max_jj * batch * 8;\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8_t _r0 = vld1q_s16(p0);\n                int16x8_t _r1 = vld1q_s16(p0 + 8);\n                int16x8_t _r2 = vld1q_s16(p0 + 16);\n                int16x8_t _r3 = vld1q_s16(p0 + 24);\n                int16x8_t _r4 = vld1q_s16(p0 + 32);\n                int16x8_t _r5 = vld1q_s16(p0 + 40);\n                int16x8x2_t _r03 = vzipq_s16(_r0, _r3);\n                int16x8x2_t _r14 = vzipq_s16(_r1, _r4);\n                int16x8x2_t _r25 = vzipq_s16(_r2, _r5);\n                int16x8x3_t _r012;\n                _r012.val[0] = _r03.val[0];\n                _r012.val[1] = _r14.val[0];\n                _r012.val[2] = _r25.val[0];\n                int16x8x3_t _r345;\n                _r345.val[0] = _r03.val[1];\n                _r345.val[1] = _r14.val[1];\n                _r345.val[2] = _r25.val[1];\n                vst3q_s16(pp, _r012);\n                vst3q_s16(pp + 24, _r345);\n                p0 += max_jj * batch * 8;\n                pp += 48;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x8x2_t _r01 = vld2q_s16(p0);\n                int32x4x2_t _r01x = vtrnq_s32(vreinterpretq_s32_s16(_r01.val[0]), vreinterpretq_s32_s16(_r01.val[1]));\n                int32x2x3_t _r012;\n                _r012.val[0] = vget_low_s32(_r01x.val[0]);\n                _r012.val[1] = vget_low_s32(_r01x.val[1]);\n                _r012.val[2] = vget_high_s32(_r01x.val[0]);\n                vst3_s32((int*)pp, _r012);\n                p0 += max_jj * batch * 2;\n                pp += 12;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                int16x4_t _r0 = vld1_s16(p0);\n                vst1_s16(pp, _r0);\n                pp[4] = p0[4];\n                pp[5] = p0[5];\n                p0 += max_jj * batch;\n                pp += 6;\n            }\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"st4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n                p0 += max_jj * batch * 8;\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0, {d0-d7}         \\n\"\n                    \"vst4.s16   {d0,d2,d4,d6}, [%1]! \\n\"\n                    \"vst4.s16   {d1,d3,d5,d7}, [%1]! \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n                p0 += max_jj * batch * 8;\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_s16(p0);\n                _r0123.val[1] = vld1q_s16(p0 + 8);\n                _r0123.val[2] = vld1q_s16(p0 + 16);\n                _r0123.val[3] = vld1q_s16(p0 + 24);\n                vst4q_s16(pp, _r0123);\n                p0 += max_jj * batch * 8;\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x4x2_t _r01 = vld2_s16(p0);\n                vst1_s16(pp, _r01.val[0]);\n                vst1_s16(pp + 4, _r01.val[1]);\n                p0 += max_jj * batch * 2;\n                pp += 8;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                int16x4_t _r0 = vld1_s16(p0);\n                vst1_s16(pp, _r0);\n                p0 += max_jj * batch;\n                pp += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n#if __ARM_NEON\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]   \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%0]    \\n\"\n                    \"st2    {v0.8h, v1.8h}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\");\n                p0 += max_jj * batch * 8;\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.s16   {d0-d3}, [%0]       \\n\"\n                    \"vst2.s16   {d0-d3}, [%1]!      \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\");\n                p0 += max_jj * batch * 8;\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8x2_t _r01;\n                _r01.val[0] = vld1q_s16(p0);\n                _r01.val[1] = vld1q_s16(p0 + 8);\n                vst2q_s16(pp, _r01);\n                p0 += max_jj * batch * 8;\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __ARM_NEON\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp[2] = p0[2];\n                pp[3] = p0[3];\n#else\n                pp[0] = p0[0];\n                pp[1] = p0[2];\n                pp[2] = p0[1];\n                pp[3] = p0[3];\n#endif\n                p0 += max_jj * batch * 2;\n                pp += 4;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch;\n                pp += 2;\n            }\n        }\n        for (; jj < max_jj; jj++)\n        {\n            const short* p0 = B;\n\n            int kk = 0;\n#if __ARM_NEON\n            p0 += (b * max_jj + jj) * 8;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]   \\n\"\n                    \"ld1    {v0.8h}, [%0]           \\n\"\n                    \"st1    {v0.8h}, [%1], #16      \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\");\n                p0 += max_jj * batch * 8;\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #128]          \\n\"\n                    \"vld1.s16   {d0-d1}, [%0]       \\n\"\n                    \"vst1.s16   {d0-d1}, [%1]!      \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\");\n                p0 += max_jj * batch * 8;\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8_t _r0 = vld1q_s16(p0);\n                vst1q_s16(pp, _r0);\n                p0 += max_jj * batch * 8;\n                pp += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            p0 -= (b * max_jj + jj) * 8;\n#endif // __ARM_NEON\n            p0 += (b * max_jj + jj) * 2;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                p0 += max_jj * batch * 2;\n                pp += 2;\n            }\n            p0 -= (b * max_jj + jj) * 2;\n            p0 += (b * max_jj + jj);\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                p0 += max_jj * batch;\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void gemm_transB_packed_tile_int8(const Mat& AT_tile, const Mat& BT_tile, Mat& top_blob, int batch, int max_ii, int max_jj, int k, int max_kk)\n{\n    // return;\n    // NCNN_LOGE(\"gemm_transB_packed_tile_int8 %d %d %d\", max_ii, max_jj, max_kk);\n\n    int* outptr = top_blob;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const short* pAT = AT_tile.row<const short>(b) + max_kk * ii;\n            const short* pB = BT_tile.row<const short>(b);\n\n            int jj = 0;\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"sub    %0, %0, #320                \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                    \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                    \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                    \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                    \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                    \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                    \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                    \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                    \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                    \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                    \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                    \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"smlal  v8.4s, v4.4h, v0.h[0]       \\n\"\n                    \"smlal  v10.4s, v4.4h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal2 v9.4s, v4.8h, v0.h[0]       \\n\"\n                    \"smlal2 v11.4s, v4.8h, v0.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal  v12.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v14.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v13.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v15.4s, v4.8h, v0.h[3]      \\n\"\n                    \"smlal  v16.4s, v4.4h, v0.h[4]      \\n\"\n                    \"smlal  v18.4s, v4.4h, v0.h[5]      \\n\"\n                    \"smlal2 v17.4s, v4.8h, v0.h[4]      \\n\"\n                    \"smlal2 v19.4s, v4.8h, v0.h[5]      \\n\"\n                    \"smlal  v20.4s, v4.4h, v0.h[6]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v0.h[7]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v0.h[6]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v0.h[7]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v1.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v1.h[1]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v1.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v1.h[1]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v1.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v1.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v1.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v1.h[3]      \\n\"\n                    \"smlal  v8.4s, v5.4h, v1.h[4]       \\n\"\n                    \"smlal  v10.4s, v5.4h, v1.h[5]      \\n\"\n                    \"smlal2 v9.4s, v5.8h, v1.h[4]       \\n\"\n                    \"smlal2 v11.4s, v5.8h, v1.h[5]      \\n\"\n                    \"smlal  v12.4s, v5.4h, v1.h[6]      \\n\"\n                    \"smlal  v14.4s, v5.4h, v1.h[7]      \\n\"\n                    \"smlal2 v13.4s, v5.8h, v1.h[6]      \\n\"\n                    \"smlal2 v15.4s, v5.8h, v1.h[7]      \\n\"\n                    \"smlal  v16.4s, v5.4h, v2.h[0]      \\n\"\n                    \"smlal  v18.4s, v5.4h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal2 v17.4s, v5.8h, v2.h[0]      \\n\"\n                    \"smlal2 v19.4s, v5.8h, v2.h[1]      \\n\"\n                    \"smlal  v20.4s, v5.4h, v2.h[2]      \\n\"\n                    \"smlal  v22.4s, v5.4h, v2.h[3]      \\n\"\n                    \"smlal2 v21.4s, v5.8h, v2.h[2]      \\n\"\n                    \"smlal2 v23.4s, v5.8h, v2.h[3]      \\n\"\n                    \"smlal  v24.4s, v5.4h, v2.h[4]      \\n\"\n                    \"smlal  v26.4s, v5.4h, v2.h[5]      \\n\"\n                    \"smlal2 v25.4s, v5.8h, v2.h[4]      \\n\"\n                    \"smlal2 v27.4s, v5.8h, v2.h[5]      \\n\"\n                    \"smlal  v28.4s, v5.4h, v2.h[6]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v2.h[7]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v2.h[6]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v2.h[7]      \\n\"\n                    \"smlal  v8.4s, v6.4h, v3.h[0]       \\n\"\n                    \"smlal  v10.4s, v6.4h, v3.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v9.4s, v6.8h, v3.h[0]       \\n\"\n                    \"smlal2 v11.4s, v6.8h, v3.h[1]      \\n\"\n                    \"smlal  v12.4s, v6.4h, v3.h[2]      \\n\"\n                    \"smlal  v14.4s, v6.4h, v3.h[3]      \\n\"\n                    \"smlal2 v13.4s, v6.8h, v3.h[2]      \\n\"\n                    \"smlal2 v15.4s, v6.8h, v3.h[3]      \\n\"\n                    \"smlal  v16.4s, v6.4h, v3.h[4]      \\n\"\n                    \"smlal  v18.4s, v6.4h, v3.h[5]      \\n\"\n                    \"smlal2 v17.4s, v6.8h, v3.h[4]      \\n\"\n                    \"smlal2 v19.4s, v6.8h, v3.h[5]      \\n\"\n                    \"smlal  v20.4s, v6.4h, v3.h[6]      \\n\"\n                    \"smlal  v22.4s, v6.4h, v3.h[7]      \\n\"\n                    \"smlal2 v21.4s, v6.8h, v3.h[6]      \\n\"\n                    \"smlal2 v23.4s, v6.8h, v3.h[7]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v0.h[0]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal2 v25.4s, v6.8h, v0.h[0]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v0.h[1]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v0.h[2]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v0.h[3]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v0.h[2]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v0.h[3]      \\n\"\n                    \"smlal  v8.4s, v7.4h, v0.h[4]       \\n\"\n                    \"smlal  v10.4s, v7.4h, v0.h[5]      \\n\"\n                    \"smlal2 v9.4s, v7.8h, v0.h[4]       \\n\"\n                    \"smlal2 v11.4s, v7.8h, v0.h[5]      \\n\"\n                    \"smlal  v12.4s, v7.4h, v0.h[6]      \\n\"\n                    \"smlal  v14.4s, v7.4h, v0.h[7]      \\n\"\n                    \"smlal2 v13.4s, v7.8h, v0.h[6]      \\n\"\n                    \"smlal2 v15.4s, v7.8h, v0.h[7]      \\n\"\n                    \"smlal  v16.4s, v7.4h, v1.h[0]      \\n\"\n                    \"smlal  v18.4s, v7.4h, v1.h[1]      \\n\"\n                    \"smlal2 v17.4s, v7.8h, v1.h[0]      \\n\"\n                    \"smlal2 v19.4s, v7.8h, v1.h[1]      \\n\"\n                    \"smlal  v20.4s, v7.4h, v1.h[2]      \\n\"\n                    \"smlal  v22.4s, v7.4h, v1.h[3]      \\n\"\n                    \"smlal2 v21.4s, v7.8h, v1.h[2]      \\n\"\n                    \"smlal2 v23.4s, v7.8h, v1.h[3]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v1.h[4]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v1.h[5]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v1.h[4]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v1.h[5]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v1.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v1.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v1.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v1.h[7]      \\n\"\n                    \"smlal  v8.4s, v4.4h, v2.h[0]       \\n\"\n                    \"smlal  v10.4s, v4.4h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal2 v9.4s, v4.8h, v2.h[0]       \\n\"\n                    \"smlal2 v11.4s, v4.8h, v2.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal  v12.4s, v4.4h, v2.h[2]      \\n\"\n                    \"smlal  v14.4s, v4.4h, v2.h[3]      \\n\"\n                    \"smlal2 v13.4s, v4.8h, v2.h[2]      \\n\"\n                    \"smlal2 v15.4s, v4.8h, v2.h[3]      \\n\"\n                    \"smlal  v16.4s, v4.4h, v2.h[4]      \\n\"\n                    \"smlal  v18.4s, v4.4h, v2.h[5]      \\n\"\n                    \"smlal2 v17.4s, v4.8h, v2.h[4]      \\n\"\n                    \"smlal2 v19.4s, v4.8h, v2.h[5]      \\n\"\n                    \"smlal  v20.4s, v4.4h, v2.h[6]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v2.h[7]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v2.h[6]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v2.h[7]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v3.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v3.h[1]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v3.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v3.h[1]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v3.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v3.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v3.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v3.h[3]      \\n\"\n                    \"smlal  v8.4s, v5.4h, v3.h[4]       \\n\"\n                    \"smlal  v10.4s, v5.4h, v3.h[5]      \\n\"\n                    \"smlal2 v9.4s, v5.8h, v3.h[4]       \\n\"\n                    \"smlal2 v11.4s, v5.8h, v3.h[5]      \\n\"\n                    \"smlal  v12.4s, v5.4h, v3.h[6]      \\n\"\n                    \"smlal  v14.4s, v5.4h, v3.h[7]      \\n\"\n                    \"smlal2 v13.4s, v5.8h, v3.h[6]      \\n\"\n                    \"smlal2 v15.4s, v5.8h, v3.h[7]      \\n\"\n                    \"smlal  v16.4s, v5.4h, v0.h[0]      \\n\"\n                    \"smlal  v18.4s, v5.4h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal2 v17.4s, v5.8h, v0.h[0]      \\n\"\n                    \"smlal2 v19.4s, v5.8h, v0.h[1]      \\n\"\n                    \"smlal  v20.4s, v5.4h, v0.h[2]      \\n\"\n                    \"smlal  v22.4s, v5.4h, v0.h[3]      \\n\"\n                    \"smlal2 v21.4s, v5.8h, v0.h[2]      \\n\"\n                    \"smlal2 v23.4s, v5.8h, v0.h[3]      \\n\"\n                    \"smlal  v24.4s, v5.4h, v0.h[4]      \\n\"\n                    \"smlal  v26.4s, v5.4h, v0.h[5]      \\n\"\n                    \"smlal2 v25.4s, v5.8h, v0.h[4]      \\n\"\n                    \"smlal2 v27.4s, v5.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v5.4h, v0.h[6]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v0.h[7]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v0.h[6]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v0.h[7]      \\n\"\n                    \"smlal  v8.4s, v6.4h, v1.h[0]       \\n\"\n                    \"smlal  v10.4s, v6.4h, v1.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v9.4s, v6.8h, v1.h[0]       \\n\"\n                    \"smlal2 v11.4s, v6.8h, v1.h[1]      \\n\"\n                    \"smlal  v12.4s, v6.4h, v1.h[2]      \\n\"\n                    \"smlal  v14.4s, v6.4h, v1.h[3]      \\n\"\n                    \"smlal2 v13.4s, v6.8h, v1.h[2]      \\n\"\n                    \"smlal2 v15.4s, v6.8h, v1.h[3]      \\n\"\n                    \"smlal  v16.4s, v6.4h, v1.h[4]      \\n\"\n                    \"smlal  v18.4s, v6.4h, v1.h[5]      \\n\"\n                    \"smlal2 v17.4s, v6.8h, v1.h[4]      \\n\"\n                    \"smlal2 v19.4s, v6.8h, v1.h[5]      \\n\"\n                    \"smlal  v20.4s, v6.4h, v1.h[6]      \\n\"\n                    \"smlal  v22.4s, v6.4h, v1.h[7]      \\n\"\n                    \"smlal2 v21.4s, v6.8h, v1.h[6]      \\n\"\n                    \"smlal2 v23.4s, v6.8h, v1.h[7]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v2.h[0]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal2 v25.4s, v6.8h, v2.h[0]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v2.h[1]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v2.h[2]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v2.h[3]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v2.h[2]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v2.h[3]      \\n\"\n                    \"smlal  v8.4s, v7.4h, v2.h[4]       \\n\"\n                    \"smlal  v10.4s, v7.4h, v2.h[5]      \\n\"\n                    \"smlal2 v9.4s, v7.8h, v2.h[4]       \\n\"\n                    \"smlal2 v11.4s, v7.8h, v2.h[5]      \\n\"\n                    \"smlal  v12.4s, v7.4h, v2.h[6]      \\n\"\n                    \"smlal  v14.4s, v7.4h, v2.h[7]      \\n\"\n                    \"smlal2 v13.4s, v7.8h, v2.h[6]      \\n\"\n                    \"smlal2 v15.4s, v7.8h, v2.h[7]      \\n\"\n                    \"smlal  v16.4s, v7.4h, v3.h[0]      \\n\"\n                    \"smlal  v18.4s, v7.4h, v3.h[1]      \\n\"\n                    \"smlal2 v17.4s, v7.8h, v3.h[0]      \\n\"\n                    \"smlal2 v19.4s, v7.8h, v3.h[1]      \\n\"\n                    \"smlal  v20.4s, v7.4h, v3.h[2]      \\n\"\n                    \"smlal  v22.4s, v7.4h, v3.h[3]      \\n\"\n                    \"smlal2 v21.4s, v7.8h, v3.h[2]      \\n\"\n                    \"smlal2 v23.4s, v7.8h, v3.h[3]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v3.h[4]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v3.h[5]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v3.h[4]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v3.h[5]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v7.4h, v3.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v3.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v3.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v3.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #7                 \\n\" // w4 = remain = max_kk & 7\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                    \"smlal  v8.4s, v4.4h, v0.h[0]       \\n\"\n                    \"smlal  v10.4s, v4.4h, v0.h[1]      \\n\"\n                    \"smlal2 v9.4s, v4.8h, v0.h[0]       \\n\"\n                    \"smlal2 v11.4s, v4.8h, v0.h[1]      \\n\"\n                    \"smlal  v12.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v14.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v13.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v15.4s, v4.8h, v0.h[3]      \\n\"\n                    \"smlal  v16.4s, v4.4h, v1.h[0]      \\n\"\n                    \"smlal  v18.4s, v4.4h, v1.h[1]      \\n\"\n                    \"smlal2 v17.4s, v4.8h, v1.h[0]      \\n\"\n                    \"smlal2 v19.4s, v4.8h, v1.h[1]      \\n\"\n                    \"smlal  v20.4s, v4.4h, v1.h[2]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v1.h[3]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v1.h[2]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v1.h[3]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v2.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v2.h[1]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v2.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v2.h[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v4.4h, v2.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v2.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v2.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v2.h[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n                int32x4_t _sum8;\n                int32x4_t _sum9;\n                int32x4_t _suma;\n                int32x4_t _sumb;\n                int32x4_t _sumc;\n                int32x4_t _sumd;\n                int32x4_t _sume;\n                int32x4_t _sumf;\n                int32x4_t _sumg;\n                int32x4_t _sumh;\n                int32x4_t _sumi;\n                int32x4_t _sumj;\n                int32x4_t _sumk;\n                int32x4_t _suml;\n                int32x4_t _summ;\n                int32x4_t _sumn;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                    _sum8 = vdupq_n_s32(0);\n                    _sum9 = vdupq_n_s32(0);\n                    _suma = vdupq_n_s32(0);\n                    _sumb = vdupq_n_s32(0);\n                    _sumc = vdupq_n_s32(0);\n                    _sumd = vdupq_n_s32(0);\n                    _sume = vdupq_n_s32(0);\n                    _sumf = vdupq_n_s32(0);\n                    _sumg = vdupq_n_s32(0);\n                    _sumh = vdupq_n_s32(0);\n                    _sumi = vdupq_n_s32(0);\n                    _sumj = vdupq_n_s32(0);\n                    _sumk = vdupq_n_s32(0);\n                    _suml = vdupq_n_s32(0);\n                    _summ = vdupq_n_s32(0);\n                    _sumn = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                    _sumc = vld1q_s32(outptr + 48);\n                    _sumd = vld1q_s32(outptr + 52);\n                    _sume = vld1q_s32(outptr + 56);\n                    _sumf = vld1q_s32(outptr + 60);\n                    _sumg = vld1q_s32(outptr + 64);\n                    _sumh = vld1q_s32(outptr + 68);\n                    _sumi = vld1q_s32(outptr + 72);\n                    _sumj = vld1q_s32(outptr + 76);\n                    _sumk = vld1q_s32(outptr + 80);\n                    _suml = vld1q_s32(outptr + 84);\n                    _summ = vld1q_s32(outptr + 88);\n                    _sumn = vld1q_s32(outptr + 92);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    int16x4_t _pB2 = vld1_s16(pB + 8);\n                    _sum0 = vmlal_laneq_s16(_sum0, vget_low_s16(_pA), _pB, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, vget_high_s16(_pA), _pB, 0);\n                    _sum2 = vmlal_laneq_s16(_sum2, vget_low_s16(_pA), _pB, 1);\n                    _sum3 = vmlal_laneq_s16(_sum3, vget_high_s16(_pA), _pB, 1);\n                    _sum4 = vmlal_laneq_s16(_sum4, vget_low_s16(_pA), _pB, 2);\n                    _sum5 = vmlal_laneq_s16(_sum5, vget_high_s16(_pA), _pB, 2);\n                    _sum6 = vmlal_laneq_s16(_sum6, vget_low_s16(_pA), _pB, 3);\n                    _sum7 = vmlal_laneq_s16(_sum7, vget_high_s16(_pA), _pB, 3);\n                    _sum8 = vmlal_laneq_s16(_sum8, vget_low_s16(_pA), _pB, 4);\n                    _sum9 = vmlal_laneq_s16(_sum9, vget_high_s16(_pA), _pB, 4);\n                    _suma = vmlal_laneq_s16(_suma, vget_low_s16(_pA), _pB, 5);\n                    _sumb = vmlal_laneq_s16(_sumb, vget_high_s16(_pA), _pB, 5);\n                    _sumc = vmlal_laneq_s16(_sumc, vget_low_s16(_pA), _pB, 6);\n                    _sumd = vmlal_laneq_s16(_sumd, vget_high_s16(_pA), _pB, 6);\n                    _sume = vmlal_laneq_s16(_sume, vget_low_s16(_pA), _pB, 7);\n                    _sumf = vmlal_laneq_s16(_sumf, vget_high_s16(_pA), _pB, 7);\n                    _sumg = vmlal_lane_s16(_sumg, vget_low_s16(_pA), _pB2, 0);\n                    _sumh = vmlal_lane_s16(_sumh, vget_high_s16(_pA), _pB2, 0);\n                    _sumi = vmlal_lane_s16(_sumi, vget_low_s16(_pA), _pB2, 1);\n                    _sumj = vmlal_lane_s16(_sumj, vget_high_s16(_pA), _pB2, 1);\n                    _sumk = vmlal_lane_s16(_sumk, vget_low_s16(_pA), _pB2, 2);\n                    _suml = vmlal_lane_s16(_suml, vget_high_s16(_pA), _pB2, 2);\n                    _summ = vmlal_lane_s16(_summ, vget_low_s16(_pA), _pB2, 3);\n                    _sumn = vmlal_lane_s16(_sumn, vget_high_s16(_pA), _pB2, 3);\n                    pA += 8;\n                    pB += 12;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                vst1q_s32(outptr + 32, _sum8);\n                vst1q_s32(outptr + 36, _sum9);\n                vst1q_s32(outptr + 40, _suma);\n                vst1q_s32(outptr + 44, _sumb);\n                vst1q_s32(outptr + 48, _sumc);\n                vst1q_s32(outptr + 52, _sumd);\n                vst1q_s32(outptr + 56, _sume);\n                vst1q_s32(outptr + 60, _sumf);\n                vst1q_s32(outptr + 64, _sumg);\n                vst1q_s32(outptr + 68, _sumh);\n                vst1q_s32(outptr + 72, _sumi);\n                vst1q_s32(outptr + 76, _sumj);\n                vst1q_s32(outptr + 80, _sumk);\n                vst1q_s32(outptr + 84, _suml);\n                vst1q_s32(outptr + 88, _summ);\n                vst1q_s32(outptr + 92, _sumn);\n                outptr += 96;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"sub    %0, %0, #192                \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                    \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                    \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                    \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                    \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                    \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"smlal  v16.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v18.4s, v4.4h, v0.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v17.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v19.4s, v4.8h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal  v20.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v0.h[3]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v0.h[4]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v0.h[5]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v0.h[4]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[6]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[7]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[6]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[7]      \\n\"\n                    \"smlal  v16.4s, v5.4h, v1.h[0]      \\n\"\n                    \"smlal  v18.4s, v5.4h, v1.h[1]      \\n\"\n                    \"smlal2 v17.4s, v5.8h, v1.h[0]      \\n\"\n                    \"smlal2 v19.4s, v5.8h, v1.h[1]      \\n\"\n                    \"smlal  v20.4s, v5.4h, v1.h[2]      \\n\"\n                    \"smlal  v22.4s, v5.4h, v1.h[3]      \\n\"\n                    \"smlal2 v21.4s, v5.8h, v1.h[2]      \\n\"\n                    \"smlal2 v23.4s, v5.8h, v1.h[3]      \\n\"\n                    \"smlal  v24.4s, v5.4h, v1.h[4]      \\n\"\n                    \"smlal  v26.4s, v5.4h, v1.h[5]      \\n\"\n                    \"smlal2 v25.4s, v5.8h, v1.h[4]      \\n\"\n                    \"smlal2 v27.4s, v5.8h, v1.h[5]      \\n\"\n                    \"smlal  v28.4s, v5.4h, v1.h[6]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v1.h[7]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v1.h[6]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v1.h[7]      \\n\"\n                    \"smlal  v16.4s, v6.4h, v2.h[0]      \\n\"\n                    \"smlal  v18.4s, v6.4h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v17.4s, v6.8h, v2.h[0]      \\n\"\n                    \"smlal2 v19.4s, v6.8h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal  v20.4s, v6.4h, v2.h[2]      \\n\"\n                    \"smlal  v22.4s, v6.4h, v2.h[3]      \\n\"\n                    \"smlal2 v21.4s, v6.8h, v2.h[2]      \\n\"\n                    \"smlal2 v23.4s, v6.8h, v2.h[3]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v2.h[4]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v2.h[5]      \\n\"\n                    \"smlal2 v25.4s, v6.8h, v2.h[4]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v2.h[5]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v2.h[6]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v2.h[7]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v2.h[6]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v2.h[7]      \\n\"\n                    \"smlal  v16.4s, v7.4h, v3.h[0]      \\n\"\n                    \"smlal  v18.4s, v7.4h, v3.h[1]      \\n\"\n                    \"smlal2 v17.4s, v7.8h, v3.h[0]      \\n\"\n                    \"smlal2 v19.4s, v7.8h, v3.h[1]      \\n\"\n                    \"smlal  v20.4s, v7.4h, v3.h[2]      \\n\"\n                    \"smlal  v22.4s, v7.4h, v3.h[3]      \\n\"\n                    \"smlal2 v21.4s, v7.8h, v3.h[2]      \\n\"\n                    \"smlal2 v23.4s, v7.8h, v3.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v24.4s, v7.4h, v3.h[4]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v3.h[5]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v3.h[4]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v3.h[5]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v3.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v3.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v3.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v3.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n                    \"smlal  v16.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v18.4s, v4.4h, v0.h[1]      \\n\"\n                    \"smlal2 v17.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v19.4s, v4.8h, v0.h[1]      \\n\"\n                    \"smlal  v20.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v0.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v24.4s, v4.4h, v0.h[4]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v0.h[5]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v0.h[4]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[6]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[7]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[6]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[7]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n                int32x4_t _sum8;\n                int32x4_t _sum9;\n                int32x4_t _suma;\n                int32x4_t _sumb;\n                int32x4_t _sumc;\n                int32x4_t _sumd;\n                int32x4_t _sume;\n                int32x4_t _sumf;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                    _sum8 = vdupq_n_s32(0);\n                    _sum9 = vdupq_n_s32(0);\n                    _suma = vdupq_n_s32(0);\n                    _sumb = vdupq_n_s32(0);\n                    _sumc = vdupq_n_s32(0);\n                    _sumd = vdupq_n_s32(0);\n                    _sume = vdupq_n_s32(0);\n                    _sumf = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                    _sumc = vld1q_s32(outptr + 48);\n                    _sumd = vld1q_s32(outptr + 52);\n                    _sume = vld1q_s32(outptr + 56);\n                    _sumf = vld1q_s32(outptr + 60);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_laneq_s16(_sum0, vget_low_s16(_pA), _pB, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, vget_high_s16(_pA), _pB, 0);\n                    _sum2 = vmlal_laneq_s16(_sum2, vget_low_s16(_pA), _pB, 1);\n                    _sum3 = vmlal_laneq_s16(_sum3, vget_high_s16(_pA), _pB, 1);\n                    _sum4 = vmlal_laneq_s16(_sum4, vget_low_s16(_pA), _pB, 2);\n                    _sum5 = vmlal_laneq_s16(_sum5, vget_high_s16(_pA), _pB, 2);\n                    _sum6 = vmlal_laneq_s16(_sum6, vget_low_s16(_pA), _pB, 3);\n                    _sum7 = vmlal_laneq_s16(_sum7, vget_high_s16(_pA), _pB, 3);\n                    _sum8 = vmlal_laneq_s16(_sum8, vget_low_s16(_pA), _pB, 4);\n                    _sum9 = vmlal_laneq_s16(_sum9, vget_high_s16(_pA), _pB, 4);\n                    _suma = vmlal_laneq_s16(_suma, vget_low_s16(_pA), _pB, 5);\n                    _sumb = vmlal_laneq_s16(_sumb, vget_high_s16(_pA), _pB, 5);\n                    _sumc = vmlal_laneq_s16(_sumc, vget_low_s16(_pA), _pB, 6);\n                    _sumd = vmlal_laneq_s16(_sumd, vget_high_s16(_pA), _pB, 6);\n                    _sume = vmlal_laneq_s16(_sume, vget_low_s16(_pA), _pB, 7);\n                    _sumf = vmlal_laneq_s16(_sumf, vget_high_s16(_pA), _pB, 7);\n                    pA += 8;\n                    pB += 8;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                vst1q_s32(outptr + 32, _sum8);\n                vst1q_s32(outptr + 36, _sum9);\n                vst1q_s32(outptr + 40, _suma);\n                vst1q_s32(outptr + 44, _sumb);\n                vst1q_s32(outptr + 48, _sumc);\n                vst1q_s32(outptr + 52, _sumd);\n                vst1q_s32(outptr + 56, _sume);\n                vst1q_s32(outptr + 60, _sumf);\n                outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __aarch64__\n            for (; jj + 5 < max_jj; jj += 6)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"sub    %0, %0, #128                \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                    \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                    \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                    \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"smlal  v20.4s, v6.4h, v0.h[0]      \\n\"\n                    \"smlal  v22.4s, v6.4h, v0.h[1]      \\n\"\n                    \"ld1    {v8.8h, v9.8h}, [%1], #32   \\n\"\n                    \"smlal2 v21.4s, v6.8h, v0.h[0]      \\n\"\n                    \"smlal2 v23.4s, v6.8h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal  v24.4s, v6.4h, v0.h[2]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v0.h[3]      \\n\"\n                    \"smlal2 v25.4s, v6.8h, v0.h[2]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v0.h[3]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v0.h[4]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v0.h[5]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v0.h[4]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v0.h[5]      \\n\"\n                    \"smlal  v20.4s, v7.4h, v0.h[6]      \\n\"\n                    \"smlal  v22.4s, v7.4h, v0.h[7]      \\n\"\n                    \"smlal2 v21.4s, v7.8h, v0.h[6]      \\n\"\n                    \"smlal2 v23.4s, v7.8h, v0.h[7]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v1.h[0]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v1.h[1]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v1.h[0]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v1.h[1]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v1.h[2]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v1.h[3]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v1.h[2]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v1.h[3]      \\n\"\n                    \"smlal  v20.4s, v8.4h, v1.h[4]      \\n\"\n                    \"smlal  v22.4s, v8.4h, v1.h[5]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v21.4s, v8.8h, v1.h[4]      \\n\"\n                    \"smlal2 v23.4s, v8.8h, v1.h[5]      \\n\"\n                    \"smlal  v24.4s, v8.4h, v1.h[6]      \\n\"\n                    \"smlal  v26.4s, v8.4h, v1.h[7]      \\n\"\n                    \"smlal2 v25.4s, v8.8h, v1.h[6]      \\n\"\n                    \"smlal2 v27.4s, v8.8h, v1.h[7]      \\n\"\n                    \"smlal  v28.4s, v8.4h, v2.h[0]      \\n\"\n                    \"smlal  v30.4s, v8.4h, v2.h[1]      \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%2], #32   \\n\"\n                    \"smlal2 v29.4s, v8.8h, v2.h[0]      \\n\"\n                    \"smlal2 v31.4s, v8.8h, v2.h[1]      \\n\"\n                    \"smlal  v20.4s, v9.4h, v2.h[2]      \\n\"\n                    \"smlal  v22.4s, v9.4h, v2.h[3]      \\n\"\n                    \"smlal2 v21.4s, v9.8h, v2.h[2]      \\n\"\n                    \"smlal2 v23.4s, v9.8h, v2.h[3]      \\n\"\n                    \"smlal  v24.4s, v9.4h, v2.h[4]      \\n\"\n                    \"smlal  v26.4s, v9.4h, v2.h[5]      \\n\"\n                    \"smlal2 v25.4s, v9.8h, v2.h[4]      \\n\"\n                    \"smlal2 v27.4s, v9.8h, v2.h[5]      \\n\"\n                    \"smlal  v28.4s, v9.4h, v2.h[6]      \\n\"\n                    \"smlal  v30.4s, v9.4h, v2.h[7]      \\n\"\n                    \"smlal2 v29.4s, v9.8h, v2.h[6]      \\n\"\n                    \"smlal2 v31.4s, v9.8h, v2.h[7]      \\n\"\n                    \"smlal  v20.4s, v6.4h, v3.h[0]      \\n\"\n                    \"smlal  v22.4s, v6.4h, v3.h[1]      \\n\"\n                    \"ld1    {v8.8h, v9.8h}, [%1], #32   \\n\"\n                    \"smlal2 v21.4s, v6.8h, v3.h[0]      \\n\"\n                    \"smlal2 v23.4s, v6.8h, v3.h[1]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v3.h[2]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v3.h[3]      \\n\"\n                    \"smlal2 v25.4s, v6.8h, v3.h[2]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v3.h[3]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v3.h[4]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v3.h[5]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v3.h[4]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v3.h[5]      \\n\"\n                    \"smlal  v20.4s, v7.4h, v3.h[6]      \\n\"\n                    \"smlal  v22.4s, v7.4h, v3.h[7]      \\n\"\n                    \"smlal2 v21.4s, v7.8h, v3.h[6]      \\n\"\n                    \"smlal2 v23.4s, v7.8h, v3.h[7]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v4.h[0]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v4.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal2 v25.4s, v7.8h, v4.h[0]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v4.h[1]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v4.h[2]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v4.h[3]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v4.h[2]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v4.h[3]      \\n\"\n                    \"smlal  v20.4s, v8.4h, v4.h[4]      \\n\"\n                    \"smlal  v22.4s, v8.4h, v4.h[5]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v21.4s, v8.8h, v4.h[4]      \\n\"\n                    \"smlal2 v23.4s, v8.8h, v4.h[5]      \\n\"\n                    \"smlal  v24.4s, v8.4h, v4.h[6]      \\n\"\n                    \"smlal  v26.4s, v8.4h, v4.h[7]      \\n\"\n                    \"smlal2 v25.4s, v8.8h, v4.h[6]      \\n\"\n                    \"smlal2 v27.4s, v8.8h, v4.h[7]      \\n\"\n                    \"smlal  v28.4s, v8.4h, v5.h[0]      \\n\"\n                    \"smlal  v30.4s, v8.4h, v5.h[1]      \\n\"\n                    \"smlal2 v29.4s, v8.8h, v5.h[0]      \\n\"\n                    \"smlal2 v31.4s, v8.8h, v5.h[1]      \\n\"\n                    \"smlal  v20.4s, v9.4h, v5.h[2]      \\n\"\n                    \"smlal  v22.4s, v9.4h, v5.h[3]      \\n\"\n                    \"smlal2 v21.4s, v9.8h, v5.h[2]      \\n\"\n                    \"smlal2 v23.4s, v9.8h, v5.h[3]      \\n\"\n                    \"smlal  v24.4s, v9.4h, v5.h[4]      \\n\"\n                    \"smlal  v26.4s, v9.4h, v5.h[5]      \\n\"\n                    \"smlal2 v25.4s, v9.8h, v5.h[4]      \\n\"\n                    \"smlal2 v27.4s, v9.8h, v5.h[5]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v9.4h, v5.h[6]      \\n\"\n                    \"smlal  v30.4s, v9.4h, v5.h[7]      \\n\"\n                    \"smlal2 v29.4s, v9.8h, v5.h[6]      \\n\"\n                    \"smlal2 v31.4s, v9.8h, v5.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #7                 \\n\" // w4 = remain = max_kk & 7\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1    {v0.8h}, [%2]               \\n\"\n                    \"add    %2, %2, #12                 \\n\"\n                    \"smlal  v20.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v22.4s, v4.4h, v0.h[1]      \\n\"\n                    \"smlal2 v21.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v23.4s, v4.8h, v0.h[1]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v0.h[3]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[4]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[5]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[4]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[5]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]          \\n\"\n                    \"pld        [%2, #384]          \\n\"\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vldm       %0!, {d8-d15}       \\n\"\n                    \"vldm       %0, {d16-d31}       \\n\"\n                    \"sub        %0, %0, #64         \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q4, q4              \\n\"\n                    \"veor       q5, q5              \\n\"\n                    \"veor       q6, q6              \\n\"\n                    \"veor       q7, q7              \\n\"\n                    \"veor       q8, q8              \\n\"\n                    \"veor       q9, q9              \\n\"\n                    \"veor       q10, q10            \\n\"\n                    \"veor       q11, q11            \\n\"\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #3          \\n\" // r4 = max_kk >> 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \".align 4                       \\n\"\n                    \"2:                             \\n\"\n                    \"vmlal.s16  q4, d4, d0[0]       \\n\"\n                    \"vld1.s16   {d6-d7}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d4, d0[1]       \\n\"\n                    \"vld1.s16   {d2-d3}, [%2]!      \\n\"\n                    \"vmlal.s16  q8, d4, d0[2]       \\n\"\n                    \"vmlal.s16  q10, d4, d0[3]      \\n\"\n                    \"vmlal.s16  q5, d5, d0[0]       \\n\"\n                    \"vmlal.s16  q7, d5, d0[1]       \\n\"\n                    \"vmlal.s16  q9, d5, d0[2]       \\n\"\n                    \"vmlal.s16  q11, d5, d0[3]      \\n\"\n                    \"vmlal.s16  q12, d4, d1[0]      \\n\"\n                    \"vmlal.s16  q14, d4, d1[1]      \\n\"\n                    \"vmlal.s16  q13, d5, d1[0]      \\n\"\n                    \"vmlal.s16  q15, d5, d1[1]      \\n\"\n                    \"vmlal.s16  q4, d6, d1[2]       \\n\"\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d6, d1[3]       \\n\"\n                    \"vmlal.s16  q5, d7, d1[2]       \\n\"\n                    \"vmlal.s16  q7, d7, d1[3]       \\n\"\n                    \"vmlal.s16  q8, d6, d2[0]       \\n\"\n                    \"pld        [%2, #384]          \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \"vmlal.s16  q10, d6, d2[1]      \\n\"\n                    \"vmlal.s16  q12, d6, d2[2]      \\n\"\n                    \"vmlal.s16  q14, d6, d2[3]      \\n\"\n                    \"vmlal.s16  q9, d7, d2[0]       \\n\"\n                    \"vmlal.s16  q11, d7, d2[1]      \\n\"\n                    \"vmlal.s16  q13, d7, d2[2]      \\n\"\n                    \"vmlal.s16  q15, d7, d2[3]      \\n\"\n                    \"vmlal.s16  q4, d4, d3[0]       \\n\"\n                    \"vld1.s16   {d6-d7}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d4, d3[1]       \\n\"\n                    \"vmlal.s16  q8, d4, d3[2]       \\n\"\n                    \"vmlal.s16  q10, d4, d3[3]      \\n\"\n                    \"vmlal.s16  q5, d5, d3[0]       \\n\"\n                    \"vmlal.s16  q7, d5, d3[1]       \\n\"\n                    \"vmlal.s16  q9, d5, d3[2]       \\n\"\n                    \"vmlal.s16  q11, d5, d3[3]      \\n\"\n                    \"vmlal.s16  q12, d4, d0[0]      \\n\"\n                    \"vld1.s16   {d2-d3}, [%2]!      \\n\"\n                    \"vmlal.s16  q14, d4, d0[1]      \\n\"\n                    \"vmlal.s16  q13, d5, d0[0]      \\n\"\n                    \"vmlal.s16  q15, d5, d0[1]      \\n\"\n                    \"vmlal.s16  q4, d6, d0[2]       \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d6, d0[3]       \\n\"\n                    \"vmlal.s16  q5, d7, d0[2]       \\n\"\n                    \"vmlal.s16  q7, d7, d0[3]       \\n\"\n                    \"vmlal.s16  q8, d6, d1[0]       \\n\"\n                    \"vmlal.s16  q10, d6, d1[1]      \\n\"\n                    \"vmlal.s16  q12, d6, d1[2]      \\n\"\n                    \"vmlal.s16  q14, d6, d1[3]      \\n\"\n                    \"vmlal.s16  q9, d7, d1[0]       \\n\"\n                    \"vmlal.s16  q11, d7, d1[1]      \\n\"\n                    \"vmlal.s16  q13, d7, d1[2]      \\n\"\n                    \"vmlal.s16  q15, d7, d1[3]      \\n\"\n                    \"vmlal.s16  q4, d4, d2[0]       \\n\"\n                    \"vld1.s16   {d6-d7}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d4, d2[1]       \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \"vmlal.s16  q8, d4, d2[2]       \\n\"\n                    \"vmlal.s16  q10, d4, d2[3]      \\n\"\n                    \"vmlal.s16  q5, d5, d2[0]       \\n\"\n                    \"vmlal.s16  q7, d5, d2[1]       \\n\"\n                    \"vmlal.s16  q9, d5, d2[2]       \\n\"\n                    \"vmlal.s16  q11, d5, d2[3]      \\n\"\n                    \"vmlal.s16  q12, d4, d3[0]      \\n\"\n                    \"vmlal.s16  q14, d4, d3[1]      \\n\"\n                    \"vmlal.s16  q13, d5, d3[0]      \\n\"\n                    \"vmlal.s16  q15, d5, d3[1]      \\n\"\n                    \"vmlal.s16  q4, d6, d3[2]       \\n\"\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d6, d3[3]       \\n\"\n                    \"vmlal.s16  q5, d7, d3[2]       \\n\"\n                    \"vmlal.s16  q7, d7, d3[3]       \\n\"\n                    \"vmlal.s16  q8, d6, d0[0]       \\n\"\n                    \"pld        [%2, #384]          \\n\"\n                    \"vld1.s16   {d2-d3}, [%2]!      \\n\"\n                    \"vmlal.s16  q10, d6, d0[1]      \\n\"\n                    \"vmlal.s16  q12, d6, d0[2]      \\n\"\n                    \"vmlal.s16  q14, d6, d0[3]      \\n\"\n                    \"vmlal.s16  q9, d7, d0[0]       \\n\"\n                    \"vmlal.s16  q11, d7, d0[1]      \\n\"\n                    \"vmlal.s16  q13, d7, d0[2]      \\n\"\n                    \"vmlal.s16  q15, d7, d0[3]      \\n\"\n                    \"vmlal.s16  q4, d4, d1[0]       \\n\"\n                    \"vld1.s16   {d6-d7}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d4, d1[1]       \\n\"\n                    \"vmlal.s16  q8, d4, d1[2]       \\n\"\n                    \"vmlal.s16  q10, d4, d1[3]      \\n\"\n                    \"vmlal.s16  q5, d5, d1[0]       \\n\"\n                    \"vmlal.s16  q7, d5, d1[1]       \\n\"\n                    \"vmlal.s16  q9, d5, d1[2]       \\n\"\n                    \"vmlal.s16  q11, d5, d1[3]      \\n\"\n                    \"vmlal.s16  q12, d4, d2[0]      \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \"vmlal.s16  q14, d4, d2[1]      \\n\"\n                    \"vmlal.s16  q13, d5, d2[0]      \\n\"\n                    \"vmlal.s16  q15, d5, d2[1]      \\n\"\n                    \"vmlal.s16  q4, d6, d2[2]       \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q6, d6, d2[3]       \\n\"\n                    \"vmlal.s16  q5, d7, d2[2]       \\n\"\n                    \"vmlal.s16  q7, d7, d2[3]       \\n\"\n                    \"vmlal.s16  q8, d6, d3[0]       \\n\"\n                    \"vmlal.s16  q10, d6, d3[1]      \\n\"\n                    \"vmlal.s16  q12, d6, d3[2]      \\n\"\n                    \"vmlal.s16  q14, d6, d3[3]      \\n\"\n                    \"vmlal.s16  q9, d7, d3[0]       \\n\"\n                    \"vmlal.s16  q11, d7, d3[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q13, d7, d3[2]      \\n\"\n                    \"vmlal.s16  q15, d7, d3[3]      \\n\"\n                    \"bne        2b                  \\n\"\n                    \"sub        %1, %1, #16         \\n\"\n                    \"sub        %2, %2, #16         \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #7          \\n\" // w4 = remain = max_kk & 7\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.s16   {d0-d1}, [%1]!      \\n\"\n                    \"vld1.s16   {d2-d3}, [%2]       \\n\"\n                    \"add        %2, %2, #12         \\n\"\n                    \"vmlal.s16  q4, d0, d2[0]       \\n\"\n                    \"vmlal.s16  q6, d0, d2[1]       \\n\"\n                    \"vmlal.s16  q8, d0, d2[2]       \\n\"\n                    \"vmlal.s16  q10, d0, d2[3]      \\n\"\n                    \"vmlal.s16  q5, d1, d2[0]       \\n\"\n                    \"vmlal.s16  q7, d1, d2[1]       \\n\"\n                    \"vmlal.s16  q9, d1, d2[2]       \\n\"\n                    \"vmlal.s16  q11, d1, d2[3]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q12, d0, d3[0]      \\n\"\n                    \"vmlal.s16  q14, d0, d3[1]      \\n\"\n                    \"vmlal.s16  q13, d1, d3[0]      \\n\"\n                    \"vmlal.s16  q15, d1, d3[1]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vstm       %0!, {d8-d15}       \\n\"\n                    \"vstm       %0!, {d16-d31}      \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n                int32x4_t _sum8;\n                int32x4_t _sum9;\n                int32x4_t _suma;\n                int32x4_t _sumb;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                    _sum8 = vdupq_n_s32(0);\n                    _sum9 = vdupq_n_s32(0);\n                    _suma = vdupq_n_s32(0);\n                    _sumb = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_lane_s16(_sum0, vget_low_s16(_pA), vget_low_s16(_pB), 0);\n                    _sum1 = vmlal_lane_s16(_sum1, vget_high_s16(_pA), vget_low_s16(_pB), 0);\n                    _sum2 = vmlal_lane_s16(_sum2, vget_low_s16(_pA), vget_low_s16(_pB), 1);\n                    _sum3 = vmlal_lane_s16(_sum3, vget_high_s16(_pA), vget_low_s16(_pB), 1);\n                    _sum4 = vmlal_lane_s16(_sum4, vget_low_s16(_pA), vget_low_s16(_pB), 2);\n                    _sum5 = vmlal_lane_s16(_sum5, vget_high_s16(_pA), vget_low_s16(_pB), 2);\n                    _sum6 = vmlal_lane_s16(_sum6, vget_low_s16(_pA), vget_low_s16(_pB), 3);\n                    _sum7 = vmlal_lane_s16(_sum7, vget_high_s16(_pA), vget_low_s16(_pB), 3);\n                    _sum8 = vmlal_lane_s16(_sum8, vget_low_s16(_pA), vget_high_s16(_pB), 0);\n                    _sum9 = vmlal_lane_s16(_sum9, vget_high_s16(_pA), vget_high_s16(_pB), 0);\n                    _suma = vmlal_lane_s16(_suma, vget_low_s16(_pA), vget_high_s16(_pB), 1);\n                    _sumb = vmlal_lane_s16(_sumb, vget_high_s16(_pA), vget_high_s16(_pB), 1);\n                    pA += 8;\n                    pB += 6;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                vst1q_s32(outptr + 32, _sum8);\n                vst1q_s32(outptr + 36, _sum9);\n                vst1q_s32(outptr + 40, _suma);\n                vst1q_s32(outptr + 44, _sumb);\n                outptr += 48;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"sub    %0, %0, #64                 \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                    \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                    \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"smlal  v24.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v0.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v25.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v0.h[1]      \\n\"\n                    \"ld1    {v2.8h, v3.8h}, [%2], #32   \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[3]      \\n\"\n                    \"smlal  v24.4s, v5.4h, v0.h[4]      \\n\"\n                    \"smlal  v26.4s, v5.4h, v0.h[5]      \\n\"\n                    \"smlal2 v25.4s, v5.8h, v0.h[4]      \\n\"\n                    \"smlal2 v27.4s, v5.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v5.4h, v0.h[6]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v0.h[7]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v0.h[6]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v0.h[7]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v1.h[0]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v1.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v25.4s, v6.8h, v1.h[0]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v1.h[1]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v1.h[2]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v1.h[3]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v1.h[2]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v1.h[3]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v1.h[4]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v1.h[5]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v1.h[4]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v1.h[5]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v1.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v1.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v1.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v1.h[7]      \\n\"\n                    \"smlal  v24.4s, v4.4h, v2.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v2.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v25.4s, v4.8h, v2.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v2.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                    \"smlal  v28.4s, v4.4h, v2.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v2.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v2.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v2.h[3]      \\n\"\n                    \"smlal  v24.4s, v5.4h, v2.h[4]      \\n\"\n                    \"smlal  v26.4s, v5.4h, v2.h[5]      \\n\"\n                    \"smlal2 v25.4s, v5.8h, v2.h[4]      \\n\"\n                    \"smlal2 v27.4s, v5.8h, v2.h[5]      \\n\"\n                    \"smlal  v28.4s, v5.4h, v2.h[6]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v2.h[7]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v2.h[6]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v2.h[7]      \\n\"\n                    \"smlal  v24.4s, v6.4h, v3.h[0]      \\n\"\n                    \"smlal  v26.4s, v6.4h, v3.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v25.4s, v6.8h, v3.h[0]      \\n\"\n                    \"smlal2 v27.4s, v6.8h, v3.h[1]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v3.h[2]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v3.h[3]      \\n\"\n                    \"smlal2 v29.4s, v6.8h, v3.h[2]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v3.h[3]      \\n\"\n                    \"smlal  v24.4s, v7.4h, v3.h[4]      \\n\"\n                    \"smlal  v26.4s, v7.4h, v3.h[5]      \\n\"\n                    \"smlal2 v25.4s, v7.8h, v3.h[4]      \\n\"\n                    \"smlal2 v27.4s, v7.8h, v3.h[5]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v7.4h, v3.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v3.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v3.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v3.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #7                 \\n\" // w4 = remain = max_kk & 7\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1    {v0.4h}, [%2], #8           \\n\"\n                    \"smlal  v24.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v26.4s, v4.4h, v0.h[1]      \\n\"\n                    \"smlal2 v25.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v27.4s, v4.8h, v0.h[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[2]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[3]      \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[2]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]          \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vldm       %0, {d16-d31}       \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q8, q8              \\n\"\n                    \"veor       q9, q9              \\n\"\n                    \"veor       q10, q10            \\n\"\n                    \"veor       q11, q11            \\n\"\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \".align 4                       \\n\"\n                    \"2:                             \\n\"\n                    \"vmlal.s16  q8, d4, d0[0]       \\n\"\n                    \"vld1.s16   {d6-d7}, [%1]!      \\n\"\n                    \"vmlal.s16  q10, d4, d0[1]      \\n\"\n                    \"vmlal.s16  q12, d4, d0[2]      \\n\"\n                    \"vmlal.s16  q14, d4, d0[3]      \\n\"\n                    \"vmlal.s16  q9, d5, d0[0]       \\n\"\n                    \"vld1.s16   {d8-d9}, [%1]!      \\n\"\n                    \"vmlal.s16  q11, d5, d0[1]      \\n\"\n                    \"vld1.s16   {d2-d3}, [%2]!      \\n\"\n                    \"vmlal.s16  q13, d5, d0[2]      \\n\"\n                    \"vmlal.s16  q15, d5, d0[3]      \\n\"\n                    \"vmlal.s16  q8, d6, d1[0]       \\n\"\n                    \"vmlal.s16  q10, d6, d1[1]      \\n\"\n                    \"vmlal.s16  q12, d6, d1[2]      \\n\"\n                    \"vmlal.s16  q14, d6, d1[3]      \\n\"\n                    \"vmlal.s16  q9, d7, d1[0]       \\n\"\n                    \"vld1.s16   {d10-d11}, [%1]!    \\n\"\n                    \"vmlal.s16  q11, d7, d1[1]      \\n\"\n                    \"vmlal.s16  q13, d7, d1[2]      \\n\"\n                    \"vmlal.s16  q15, d7, d1[3]      \\n\"\n                    \"vmlal.s16  q8, d8, d2[0]       \\n\"\n                    \"vmlal.s16  q10, d8, d2[1]      \\n\"\n                    \"vmlal.s16  q12, d8, d2[2]      \\n\"\n                    \"vmlal.s16  q14, d8, d2[3]      \\n\"\n                    \"vmlal.s16  q9, d9, d2[0]       \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vld1.s16   {d4-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q11, d9, d2[1]      \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.s16   {d0-d1}, [%2]!      \\n\"\n                    \"vmlal.s16  q13, d9, d2[2]      \\n\"\n                    \"vmlal.s16  q15, d9, d2[3]      \\n\"\n                    \"vmlal.s16  q8, d10, d3[0]      \\n\"\n                    \"vmlal.s16  q10, d10, d3[1]     \\n\"\n                    \"vmlal.s16  q12, d10, d3[2]     \\n\"\n                    \"vmlal.s16  q14, d10, d3[3]     \\n\"\n                    \"vmlal.s16  q9, d11, d3[0]      \\n\"\n                    \"vmlal.s16  q11, d11, d3[1]     \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q13, d11, d3[2]     \\n\"\n                    \"vmlal.s16  q15, d11, d3[3]     \\n\"\n                    \"bne        2b                  \\n\"\n                    \"sub        %1, %1, #16         \\n\"\n                    \"sub        %2, %2, #16         \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // w4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.s16   {d0-d1}, [%1]!      \\n\"\n                    \"vld1.s16   {d2}, [%2]!         \\n\"\n                    \"vmlal.s16  q8, d0, d2[0]       \\n\"\n                    \"vmlal.s16  q10, d0, d2[1]      \\n\"\n                    \"vmlal.s16  q12, d0, d2[2]      \\n\"\n                    \"vmlal.s16  q14, d0, d2[3]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q9, d1, d2[0]       \\n\"\n                    \"vmlal.s16  q11, d1, d2[1]      \\n\"\n                    \"vmlal.s16  q13, d1, d2[2]      \\n\"\n                    \"vmlal.s16  q15, d1, d2[3]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vstm       %0!, {d16-d31}      \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x4_t _pB = vld1_s16(pB);\n                    _sum0 = vmlal_lane_s16(_sum0, vget_low_s16(_pA), _pB, 0);\n                    _sum1 = vmlal_lane_s16(_sum1, vget_high_s16(_pA), _pB, 0);\n                    _sum2 = vmlal_lane_s16(_sum2, vget_low_s16(_pA), _pB, 1);\n                    _sum3 = vmlal_lane_s16(_sum3, vget_high_s16(_pA), _pB, 1);\n                    _sum4 = vmlal_lane_s16(_sum4, vget_low_s16(_pA), _pB, 2);\n                    _sum5 = vmlal_lane_s16(_sum5, vget_high_s16(_pA), _pB, 2);\n                    _sum6 = vmlal_lane_s16(_sum6, vget_low_s16(_pA), _pB, 3);\n                    _sum7 = vmlal_lane_s16(_sum7, vget_high_s16(_pA), _pB, 3);\n                    pA += 8;\n                    pB += 4;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0] \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[1]      \\n\"\n                    \"ld1    {v1.8h}, [%2], #16          \\n\"\n                    \"smlal  v28.4s, v5.4h, v0.h[2]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v0.h[3]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v0.h[2]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v0.h[3]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v0.h[4]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v0.h[5]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v6.8h, v0.h[4]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v7.4h, v0.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v0.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v0.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v0.h[7]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v1.h[0]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v1.h[1]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v4.8h, v1.h[0]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v1.h[1]      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n                    \"smlal  v28.4s, v5.4h, v1.h[2]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v1.h[3]      \\n\"\n                    \"smlal2 v29.4s, v5.8h, v1.h[2]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v1.h[3]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v1.h[4]      \\n\"\n                    \"smlal  v30.4s, v6.4h, v1.h[5]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v6.8h, v1.h[4]      \\n\"\n                    \"smlal2 v31.4s, v6.8h, v1.h[5]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v28.4s, v7.4h, v1.h[6]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v1.h[7]      \\n\"\n                    \"smlal2 v29.4s, v7.8h, v1.h[6]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v1.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #16                 \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #7                 \\n\" // w4 = remain = max_kk & 7\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1    {v0.4h}, [%2]               \\n\"\n                    \"add    %2, %2, #4                  \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[0]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[1]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]          \\n\"\n                    \"pld        [%2, #128]          \\n\"\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vldm       %0, {d24-d31}       \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q12, q12            \\n\"\n                    \"veor       q13, q13            \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"vld1.s16   {d2-d5}, [%1]!      \\n\"\n                    \"vld1.s16   {d0}, [%2]!         \\n\"\n                    \".align 4                       \\n\"\n                    \"2:                             \\n\"\n                    \"vmlal.s16  q12, d2, d0[0]      \\n\"\n                    \"vld1.s16   {d6-d9}, [%1]!      \\n\"\n                    \"vmlal.s16  q14, d2, d0[1]      \\n\"\n                    \"vld1.s16   {d1}, [%2]!         \\n\"\n                    \"vmlal.s16  q13, d3, d0[0]      \\n\"\n                    \"vmlal.s16  q15, d3, d0[1]      \\n\"\n                    \"vmlal.s16  q12, d4, d0[2]      \\n\"\n                    \"vmlal.s16  q14, d4, d0[3]      \\n\"\n                    \"vmlal.s16  q13, d5, d0[2]      \\n\"\n                    \"vmlal.s16  q15, d5, d0[3]      \\n\"\n                    \"vmlal.s16  q12, d6, d1[0]      \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vld1.s16   {d2-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q14, d6, d1[1]      \\n\"\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld1.s16   {d0}, [%2]!         \\n\"\n                    \"vmlal.s16  q13, d7, d1[0]      \\n\"\n                    \"vmlal.s16  q15, d7, d1[1]      \\n\"\n                    \"vmlal.s16  q12, d8, d1[2]      \\n\"\n                    \"vmlal.s16  q14, d8, d1[3]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q13, d9, d1[2]      \\n\"\n                    \"vmlal.s16  q15, d9, d1[3]      \\n\"\n                    \"bne        2b                  \\n\"\n                    \"sub        %1, %1, #32         \\n\"\n                    \"sub        %2, %2, #8          \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // w4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.s16   {d0-d1}, [%1]!      \\n\"\n                    \"vld1.s16   {d2}, [%2]          \\n\"\n                    \"add        %2, %2, #4          \\n\"\n                    \"vmlal.s16  q12, d0, d2[0]      \\n\"\n                    \"vmlal.s16  q14, d0, d2[1]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q13, d1, d2[0]      \\n\"\n                    \"vmlal.s16  q15, d1, d2[1]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vstm       %0!, {d24-d31}      \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x4_t _pB0 = vdup_n_s16(pB[0]);\n                    int16x4_t _pB1 = vdup_n_s16(pB[1]);\n                    _sum0 = vmlal_s16(_sum0, vget_low_s16(_pA), _pB0);\n                    _sum1 = vmlal_s16(_sum1, vget_high_s16(_pA), _pB0);\n                    _sum2 = vmlal_s16(_sum2, vget_low_s16(_pA), _pB1);\n                    _sum3 = vmlal_s16(_sum3, vget_high_s16(_pA), _pB1);\n                    pA += 8;\n                    pB += 2;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"cmp    %w7, #0                     \\n\"\n                    \"beq    0f                          \\n\"\n\n                    \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                    \"b      1f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                    \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                    \"1:                                 \\n\"\n                    \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    3f                          \\n\"\n\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"ld1    {v1.8h}, [%2], #16          \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                    \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                    \".align 4                           \\n\"\n                    \"2:                                 \\n\"\n                    \"mov    v0.16b, v1.16b              \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[0]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[0]      \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v1.8h}, [%2], #16          \\n\"\n                    \"smlal  v30.4s, v5.4h, v0.h[1]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v0.h[1]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v0.h[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v6.8h, v0.h[2]      \\n\"\n                    \"smlal  v30.4s, v7.4h, v0.h[3]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v0.h[3]      \\n\"\n                    \"smlal  v28.4s, v4.4h, v0.h[4]      \\n\"\n                    \"ld1    {v6.8h, v7.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v4.8h, v0.h[4]      \\n\"\n                    \"smlal  v30.4s, v5.4h, v0.h[5]      \\n\"\n                    \"smlal2 v31.4s, v5.8h, v0.h[5]      \\n\"\n                    \"smlal  v28.4s, v6.4h, v0.h[6]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h}, [%1], #32   \\n\"\n                    \"smlal2 v29.4s, v6.8h, v0.h[6]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v30.4s, v7.4h, v0.h[7]      \\n\"\n                    \"smlal2 v31.4s, v7.8h, v0.h[7]      \\n\"\n                    \"bne    2b                          \\n\"\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #16                 \\n\"\n                    \"add    v30.4s, v30.4s, v28.4s      \\n\"\n                    \"add    v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"3:                                 \\n\"\n                    \"and    w4, %w6, #7                 \\n\" // w4 = remain = max_kk & 7\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"ld1r   {v0.4h}, [%2], #2           \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"smlal  v30.4s, v4.4h, v0.h[0]      \\n\"\n                    \"smlal2 v31.4s, v4.8h, v0.h[0]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]          \\n\"\n                    \"pld        [%2, #64]           \\n\"\n                    \"cmp        %7, #0              \\n\"\n                    \"beq        0f                  \\n\"\n\n                    \"vld1.s32   {d28-d31}, [%0]     \\n\"\n                    \"b          1f                  \\n\"\n\n                    \"0:                             \\n\"\n                    \"veor       q14, q14            \\n\"\n                    \"veor       q15, q15            \\n\"\n\n                    \"1:                             \\n\"\n                    \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        3f                  \\n\"\n\n                    \"vld1.s16   {d2-d5}, [%1]!      \\n\"\n                    \".align 4                       \\n\"\n                    \"2:                             \\n\"\n                    \"pld        [%2, #64]           \\n\"\n                    \"vld1.s16   {d0}, [%2]!         \\n\"\n                    \"vmlal.s16  q14, d2, d0[0]      \\n\"\n                    \"vld1.s16   {d6-d9}, [%1]!      \\n\"\n                    \"vmlal.s16  q15, d3, d0[0]      \\n\"\n                    \"vmlal.s16  q14, d4, d0[1]      \\n\"\n                    \"vmlal.s16  q15, d5, d0[1]      \\n\"\n                    \"vmlal.s16  q14, d6, d0[2]      \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vld1.s16   {d2-d5}, [%1]!      \\n\"\n                    \"vmlal.s16  q15, d7, d0[2]      \\n\"\n                    \"vmlal.s16  q14, d8, d0[3]      \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q15, d9, d0[3]      \\n\"\n                    \"bne        2b                  \\n\"\n                    \"sub        %1, %1, #32         \\n\"\n\n                    \"3:                             \\n\"\n                    \"and        r4, %6, #3          \\n\" // w4 = remain = max_kk & 3\n                    \"cmp        r4, #0              \\n\"\n                    \"beq        5f                  \\n\"\n\n                    \"4:                             \\n\"\n                    \"vld1.s16   {d0-d1}, [%1]!      \\n\"\n                    \"vld1.s16   {d2[]}, [%2]!       \\n\"\n                    \"subs       r4, r4, #1          \\n\"\n                    \"vmlal.s16  q14, d0, d2[0]      \\n\"\n                    \"vmlal.s16  q15, d1, d2[0]      \\n\"\n                    \"bne        4b                  \\n\"\n\n                    \"5:                             \\n\"\n                    \"vst1.s32   {d28-d31}, [%0]!    \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"r\"(max_kk), // %6\n                    \"r\"(k)       // %7\n                    : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x8_t _pA = vld1q_s16(pA);\n                    int16x4_t _pB = vld1_dup_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, vget_low_s16(_pA), _pB);\n                    _sum1 = vmlal_s16(_sum1, vget_high_s16(_pA), _pB);\n                    pA += 8;\n                    pB += 1;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const short* pAT = AT_tile.row<const short>(b) + max_kk * ii;\n            const short* pB = BT_tile.row<const short>(b);\n\n            int jj = 0;\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n                int32x4_t _sum8;\n                int32x4_t _sum9;\n                int32x4_t _suma;\n                int32x4_t _sumb;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                    _sum8 = vdupq_n_s32(0);\n                    _sum9 = vdupq_n_s32(0);\n                    _suma = vdupq_n_s32(0);\n                    _sumb = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    int16x4_t _pB2 = vld1_s16(pB + 8);\n                    _sum0 = vmlal_laneq_s16(_sum0, _pA, _pB, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, _pA, _pB, 1);\n                    _sum2 = vmlal_laneq_s16(_sum2, _pA, _pB, 2);\n                    _sum3 = vmlal_laneq_s16(_sum3, _pA, _pB, 3);\n                    _sum4 = vmlal_laneq_s16(_sum4, _pA, _pB, 4);\n                    _sum5 = vmlal_laneq_s16(_sum5, _pA, _pB, 5);\n                    _sum6 = vmlal_laneq_s16(_sum6, _pA, _pB, 6);\n                    _sum7 = vmlal_laneq_s16(_sum7, _pA, _pB, 7);\n                    _sum8 = vmlal_lane_s16(_sum8, _pA, _pB2, 0);\n                    _sum9 = vmlal_lane_s16(_sum9, _pA, _pB2, 1);\n                    _suma = vmlal_lane_s16(_suma, _pA, _pB2, 2);\n                    _sumb = vmlal_lane_s16(_sumb, _pA, _pB2, 3);\n                    pA += 4;\n                    pB += 12;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                vst1q_s32(outptr + 32, _sum8);\n                vst1q_s32(outptr + 36, _sum9);\n                vst1q_s32(outptr + 40, _suma);\n                vst1q_s32(outptr + 44, _sumb);\n                outptr += 48;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n                int32x4_t _sum6;\n                int32x4_t _sum7;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                    _sum6 = vdupq_n_s32(0);\n                    _sum7 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_laneq_s16(_sum0, _pA, _pB, 0);\n                    _sum1 = vmlal_laneq_s16(_sum1, _pA, _pB, 1);\n                    _sum2 = vmlal_laneq_s16(_sum2, _pA, _pB, 2);\n                    _sum3 = vmlal_laneq_s16(_sum3, _pA, _pB, 3);\n                    _sum4 = vmlal_laneq_s16(_sum4, _pA, _pB, 4);\n                    _sum5 = vmlal_laneq_s16(_sum5, _pA, _pB, 5);\n                    _sum6 = vmlal_laneq_s16(_sum6, _pA, _pB, 6);\n                    _sum7 = vmlal_laneq_s16(_sum7, _pA, _pB, 7);\n                    pA += 4;\n                    pB += 8;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                outptr += 32;\n            }\n#endif // __aarch64__\n            for (; jj + 5 < max_jj; jj += 6)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_lane_s16(_sum0, _pA, vget_low_s16(_pB), 0);\n                    _sum1 = vmlal_lane_s16(_sum1, _pA, vget_low_s16(_pB), 1);\n                    _sum2 = vmlal_lane_s16(_sum2, _pA, vget_low_s16(_pB), 2);\n                    _sum3 = vmlal_lane_s16(_sum3, _pA, vget_low_s16(_pB), 3);\n                    _sum4 = vmlal_lane_s16(_sum4, _pA, vget_high_s16(_pB), 0);\n                    _sum5 = vmlal_lane_s16(_sum5, _pA, vget_high_s16(_pB), 1);\n                    pA += 4;\n                    pB += 6;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                outptr += 24;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x4_t _pB = vld1_s16(pB);\n                    _sum0 = vmlal_lane_s16(_sum0, _pA, _pB, 0);\n                    _sum1 = vmlal_lane_s16(_sum1, _pA, _pB, 1);\n                    _sum2 = vmlal_lane_s16(_sum2, _pA, _pB, 2);\n                    _sum3 = vmlal_lane_s16(_sum3, _pA, _pB, 3);\n                    pA += 4;\n                    pB += 4;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                outptr += 16;\n            }\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x4_t _pB0 = vdup_n_s16(pB[0]);\n                    int16x4_t _pB1 = vdup_n_s16(pB[1]);\n                    _sum0 = vmlal_s16(_sum0, _pA, _pB0);\n                    _sum1 = vmlal_s16(_sum1, _pA, _pB1);\n                    pA += 4;\n                    pB += 2;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                outptr += 8;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_s16(pA);\n                    int16x4_t _pB = vld1_dup_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA, _pB);\n                    pA += 4;\n                    pB += 1;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                outptr += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const short* pAT = AT_tile.row<const short>(b) + max_kk * ii;\n            const short* pB = BT_tile.row<const short>(b);\n\n            int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n                int32x4_t _sum4;\n                int32x4_t _sum5;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                    _sum4 = vdupq_n_s32(0);\n                    _sum5 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    int32x4x2_t _s01 = vld2q_s32(outptr);\n                    int32x4x2_t _s23 = vld2q_s32(outptr + 8);\n                    int32x4x2_t _s45 = vld2q_s32(outptr + 16);\n                    _sum0 = _s01.val[0];\n                    _sum3 = _s01.val[1];\n                    _sum1 = _s23.val[0];\n                    _sum4 = _s23.val[1];\n                    _sum2 = _s45.val[0];\n                    _sum5 = _s45.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA0 = vdup_n_s16(pA[0]);\n                    int16x4_t _pA1 = vdup_n_s16(pA[1]);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    int16x4_t _pB2 = vld1_s16(pB + 8);\n                    _sum0 = vmlal_s16(_sum0, _pA0, vget_low_s16(_pB));\n                    _sum1 = vmlal_s16(_sum1, _pA0, vget_high_s16(_pB));\n                    _sum2 = vmlal_s16(_sum2, _pA0, _pB2);\n                    _sum3 = vmlal_s16(_sum3, _pA1, vget_low_s16(_pB));\n                    _sum4 = vmlal_s16(_sum4, _pA1, vget_high_s16(_pB));\n                    _sum5 = vmlal_s16(_sum5, _pA1, _pB2);\n                    pA += 2;\n                    pB += 12;\n                }\n\n                int32x4x2_t _s01;\n                _s01.val[0] = _sum0;\n                _s01.val[1] = _sum3;\n                int32x4x2_t _s23;\n                _s23.val[0] = _sum1;\n                _s23.val[1] = _sum4;\n                int32x4x2_t _s45;\n                _s45.val[0] = _sum2;\n                _s45.val[1] = _sum5;\n                vst2q_s32(outptr, _s01);\n                vst2q_s32(outptr + 8, _s23);\n                vst2q_s32(outptr + 16, _s45);\n                outptr += 24;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n                int32x4_t _sum3;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                    _sum3 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    int32x4x2_t _s01 = vld2q_s32(outptr);\n                    int32x4x2_t _s23 = vld2q_s32(outptr + 8);\n                    _sum0 = _s01.val[0];\n                    _sum2 = _s01.val[1];\n                    _sum1 = _s23.val[0];\n                    _sum3 = _s23.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA0 = vdup_n_s16(pA[0]);\n                    int16x4_t _pA1 = vdup_n_s16(pA[1]);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA0, vget_low_s16(_pB));\n                    _sum1 = vmlal_s16(_sum1, _pA0, vget_high_s16(_pB));\n                    _sum2 = vmlal_s16(_sum2, _pA1, vget_low_s16(_pB));\n                    _sum3 = vmlal_s16(_sum3, _pA1, vget_high_s16(_pB));\n                    pA += 2;\n                    pB += 8;\n                }\n\n                int32x4x2_t _s01;\n                _s01.val[0] = _sum0;\n                _s01.val[1] = _sum2;\n                int32x4x2_t _s23;\n                _s23.val[0] = _sum1;\n                _s23.val[1] = _sum3;\n                vst2q_s32(outptr, _s01);\n                vst2q_s32(outptr + 8, _s23);\n                outptr += 16;\n            }\n#endif // __aarch64__\n            for (; jj + 5 < max_jj; jj += 6)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    int32x4x2_t _s01 = vld2q_s32(outptr);\n                    _sum0 = _s01.val[0];\n                    _sum1 = _s01.val[1];\n                    _sum2 = vld1q_s32(outptr + 8);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vreinterpret_s16_s32(vld1_dup_s32((const int*)pA));\n                    int16x8_t _pB = vld1q_s16(pB);\n                    int16x4_t _pB2 = vzip_s16(vget_high_s16(_pB), vget_high_s16(_pB)).val[0];\n                    _sum0 = vmlal_lane_s16(_sum0, vget_low_s16(_pB), _pA, 0);\n                    _sum1 = vmlal_lane_s16(_sum1, vget_low_s16(_pB), _pA, 1);\n                    _sum2 = vmlal_s16(_sum2, _pA, _pB2);\n                    pA += 2;\n                    pB += 6;\n                }\n\n                int32x4x2_t _s01;\n                _s01.val[0] = _sum0;\n                _s01.val[1] = _sum1;\n                vst2q_s32(outptr, _s01);\n                vst1q_s32(outptr + 8, _sum2);\n                outptr += 12;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    int32x4x2_t _s01 = vld2q_s32(outptr);\n                    _sum0 = _s01.val[0];\n                    _sum1 = _s01.val[1];\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA0 = vdup_n_s16(pA[0]);\n                    int16x4_t _pA1 = vdup_n_s16(pA[1]);\n                    int16x4_t _pB = vld1_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA0, _pB);\n                    _sum1 = vmlal_s16(_sum1, _pA1, _pB);\n                    pA += 2;\n                    pB += 4;\n                }\n\n                int32x4x2_t _s01;\n                _s01.val[0] = _sum0;\n                _s01.val[1] = _sum1;\n                vst2q_s32(outptr, _s01);\n                outptr += 8;\n            }\n#endif // __ARM_NEON\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const short* pA = pAT;\n\n                int sum00 = 0;\n                int sum01 = 0;\n                int sum10 = 0;\n                int sum11 = 0;\n\n                if (k == 0)\n                {\n                    sum00 = 0;\n                    sum01 = 0;\n                    sum10 = 0;\n                    sum11 = 0;\n                }\n                else\n                {\n                    sum00 = outptr[0];\n                    sum01 = outptr[1];\n                    sum10 = outptr[2];\n                    sum11 = outptr[3];\n                }\n\n                int kk = 0;\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    // fomit-frame-pointer implied in optimized flag spare one register\n                    // let us stay away from error: ‘asm’ operand has impossible constraints   --- nihui\n#if __OPTIMIZE__\n                    asm volatile(\n                        \"ldr    r2, [%0], #4    \\n\" // int16x2_t _pA0 = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r3, [%0], #4    \\n\" // int16x2_t _pA1 = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r4, [%1], #4    \\n\" // int16x2_t _pB0 = *((int16x2_t*)pB); pB += 2;\n                        \"ldr    r5, [%1], #4    \\n\" // int16x2_t _pB1 = *((int16x2_t*)pB); pB += 2;\n                        \"smlad  %2, r2, r4, %2  \\n\" // sum00 = __smlad(_pA0, _pB0, sum00);\n                        \"smlad  %3, r3, r4, %3  \\n\" // sum01 = __smlad(_pA1, _pB0, sum01);\n                        \"smlad  %4, r2, r5, %4  \\n\" // sum10 = __smlad(_pA0, _pB1, sum10);\n                        \"smlad  %5, r3, r5, %5  \\n\" // sum11 = __smlad(_pA1, _pB1, sum11);\n                        : \"=r\"(pA),\n                        \"=r\"(pB),\n                        \"=r\"(sum00),\n                        \"=r\"(sum01),\n                        \"=r\"(sum10),\n                        \"=r\"(sum11)\n                        : \"0\"(pA),\n                        \"1\"(pB),\n                        \"2\"(sum00),\n                        \"3\"(sum01),\n                        \"4\"(sum10),\n                        \"5\"(sum11)\n                        : \"memory\", \"r2\", \"r3\", \"r4\", \"r5\");\n#else\n                    int _pA0 = *((int*)pA);\n                    int _pA1 = *((int*)(pA + 2));\n                    int _pB0 = *((int*)pB);\n                    int _pB1 = *((int*)(pB + 2));\n                    asm volatile(\"smlad %0, %2, %3, %0\"\n                                 : \"=r\"(sum00)\n                                 : \"0\"(sum00), \"r\"(_pA0), \"r\"(_pB0)\n                                 :);\n                    asm volatile(\"smlad %0, %2, %3, %0\"\n                                 : \"=r\"(sum01)\n                                 : \"0\"(sum01), \"r\"(_pA1), \"r\"(_pB0)\n                                 :);\n                    asm volatile(\"smlad %0, %2, %3, %0\"\n                                 : \"=r\"(sum10)\n                                 : \"0\"(sum10), \"r\"(_pA0), \"r\"(_pB1)\n                                 :);\n                    asm volatile(\"smlad %0, %2, %3, %0\"\n                                 : \"=r\"(sum11)\n                                 : \"0\"(sum11), \"r\"(_pA1), \"r\"(_pB1)\n                                 :);\n                    pA += 4;\n                    pB += 4;\n#endif\n                }\n#endif // !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk < max_kk; kk++)\n                {\n                    sum00 += pA[0] * pB[0];\n                    sum01 += pA[1] * pB[0];\n                    sum10 += pA[0] * pB[1];\n                    sum11 += pA[1] * pB[1];\n                    pA += 2;\n                    pB += 2;\n                }\n\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n                outptr += 2 * 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const short* pA = pAT;\n\n                int sum0 = 0;\n                int sum1 = 0;\n\n                if (k == 0)\n                {\n                    sum0 = 0;\n                    sum1 = 0;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    asm volatile(\n                        \"ldr    r2, [%0], #4    \\n\" // int16x2_t _pA0 = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r3, [%0], #4    \\n\" // int16x2_t _pA1 = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r4, [%1], #4    \\n\" // int16x2_t _pB = *((int16x2_t*)pB); pB += 2;\n                        \"smlad  %2, r2, r4, %2  \\n\" // sum0 = __smlad(_pA0, _pB, sum0);\n                        \"smlad  %3, r3, r4, %3  \\n\" // sum1 = __smlad(_pA1, _pB, sum1);\n                        : \"=r\"(pA),\n                        \"=r\"(pB),\n                        \"=r\"(sum0),\n                        \"=r\"(sum1)\n                        : \"0\"(pA),\n                        \"1\"(pB),\n                        \"2\"(sum0),\n                        \"3\"(sum1)\n                        : \"memory\", \"r2\", \"r3\", \"r4\");\n                }\n#endif // !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[1] * pB[0];\n                    pA += 2;\n                    pB += 1;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        for (int b = 0; b < batch; b++)\n        {\n            const short* pAT = AT_tile.row<const short>(b) + max_kk * ii;\n            const short* pB = BT_tile.row<const short>(b);\n\n            int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n            for (; jj + 11 < max_jj; jj += 12)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n                int32x4_t _sum2;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                    _sum2 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_dup_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    int16x4_t _pB2 = vld1_s16(pB + 8);\n                    _sum0 = vmlal_s16(_sum0, _pA, vget_low_s16(_pB));\n                    _sum1 = vmlal_s16(_sum1, _pA, vget_high_s16(_pB));\n                    _sum2 = vmlal_s16(_sum2, _pA, _pB2);\n                    pA += 1;\n                    pB += 12;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                outptr += 12;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_dup_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA, vget_low_s16(_pB));\n                    _sum1 = vmlal_s16(_sum1, _pA, vget_high_s16(_pB));\n                    pA += 1;\n                    pB += 8;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                outptr += 8;\n            }\n#endif // __aarch64__\n            for (; jj + 5 < max_jj; jj += 6)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n                int32x4_t _sum1;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                    _sum1 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_dup_s16(pA);\n                    int16x8_t _pB = vld1q_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA, vget_low_s16(_pB));\n                    _sum1 = vmlal_s16(_sum1, _pA, vget_high_s16(_pB));\n                    pA += 1;\n                    pB += 6;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                vst1_s32(outptr + 4, vget_low_s32(_sum1));\n                outptr += 6;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                const short* pA = pAT;\n\n                int32x4_t _sum0;\n\n                if (k == 0)\n                {\n                    _sum0 = vdupq_n_s32(0);\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                }\n\n                int kk = 0;\n                for (; kk < max_kk; kk++)\n                {\n                    int16x4_t _pA = vld1_dup_s16(pA);\n                    int16x4_t _pB = vld1_s16(pB);\n                    _sum0 = vmlal_s16(_sum0, _pA, _pB);\n                    pA += 1;\n                    pB += 4;\n                }\n\n                vst1q_s32(outptr, _sum0);\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; jj + 1 < max_jj; jj += 2)\n            {\n                const short* pA = pAT;\n\n                int sum0 = 0;\n                int sum1 = 0;\n\n                if (k == 0)\n                {\n                    sum0 = 0;\n                    sum1 = 0;\n                }\n                else\n                {\n                    sum0 = outptr[0];\n                    sum1 = outptr[1];\n                }\n\n                int kk = 0;\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    asm volatile(\n                        \"ldr    r2, [%0], #4    \\n\" // int16x2_t _pA = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r3, [%1], #4    \\n\" // int16x2_t _pB0 = *((int16x2_t*)pB); pB += 2;\n                        \"ldr    r4, [%1], #4    \\n\" // int16x2_t _pB1 = *((int16x2_t*)pB); pB += 2;\n                        \"smlad  %2, r2, r3, %2  \\n\" // sum0 = __smlad(_pA, _pB0, sum0);\n                        \"smlad  %3, r2, r4, %3  \\n\" // sum1 = __smlad(_pA, _pB1, sum1);\n                        : \"=r\"(pA),\n                        \"=r\"(pB),\n                        \"=r\"(sum0),\n                        \"=r\"(sum1)\n                        : \"0\"(pA),\n                        \"1\"(pB),\n                        \"2\"(sum0),\n                        \"3\"(sum1)\n                        : \"memory\", \"r2\", \"r3\", \"r4\");\n                }\n#endif // !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk < max_kk; kk++)\n                {\n                    sum0 += pA[0] * pB[0];\n                    sum1 += pA[0] * pB[1];\n                    pA += 1;\n                    pB += 2;\n                }\n\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n                outptr += 2;\n            }\n            for (; jj < max_jj; jj++)\n            {\n                const short* pA = pAT;\n\n                int sum = 0;\n\n                if (k == 0)\n                {\n                    sum = 0;\n                }\n                else\n                {\n                    sum = outptr[0];\n                }\n\n                int kk = 0;\n#if !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    asm volatile(\n                        \"ldr    r2, [%0], #4    \\n\" // int16x2_t _pA = *((int16x2_t*)pA); pA += 2;\n                        \"ldr    r3, [%1], #4    \\n\" // int16x2_t _pB = *((int16x2_t*)pB); pB += 2;\n                        \"smlad  %2, r2, r3, %2  \\n\" // sum = __smlad(_pA, _pB, sum);\n                        : \"=r\"(pA),\n                        \"=r\"(pB),\n                        \"=r\"(sum)\n                        : \"0\"(pA),\n                        \"1\"(pB),\n                        \"2\"(sum)\n                        : \"memory\", \"r2\", \"r3\");\n                }\n#endif // !__ARM_NEON && __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n                for (; kk < max_kk; kk++)\n                {\n                    sum += pA[0] * pB[0];\n                    pA += 1;\n                    pB += 1;\n                }\n\n                outptr[0] = sum;\n                outptr += 1;\n            }\n        }\n    }\n}\n\nstatic void get_optimal_tile_mnk_int8(int M, int N, int K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_int8 = (int)(get_cpu_level2_cache_size() / sizeof(short));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // we shall take B into account for batched gemm, but that will be slower on arm in practice, why ?\n    // (void)B;\n\n    // solve K\n    {\n        // try not to split K\n#if __aarch64__\n        int tile_size = (l2_cache_size_int8 - 32) / 12;\n#elif __ARM_NEON\n        int tile_size = (l2_cache_size_int8 - 32) / 6;\n#else\n        int tile_size = (l2_cache_size_int8 - 2) / 3;\n#endif\n\n#if __aarch64__\n        TILE_K = std::max(8, tile_size / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __aarch64__\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 3) / 4 * 4);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n    }\n\n    // solve M\n    {\n#if __ARM_NEON\n        TILE_M = 8;\n#else\n        TILE_M = 2;\n#endif\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n\n        if (nT > 1)\n        {\n#if __ARM_NEON\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#else\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n        }\n\n#if __ARM_NEON\n        TILE_M = std::max(8, TILE_M);\n#else\n        TILE_M = std::max(2, TILE_M);\n#endif\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_int8 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_int8 - TILE_M * TILE_K) / (TILE_M * 2 + TILE_K);\n        }\n\n#if __aarch64__\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_N = std::max(1, tile_size);\n#endif\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n\n#if __aarch64__\n        TILE_N = std::max(4, TILE_N);\n#elif __ARM_NEON\n        TILE_N = std::max(4, TILE_N);\n#else\n        TILE_N = std::max(1, TILE_N);\n#endif\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_kernel_tile_int8(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    // const signed char ktm[4][3] = {\n    //     {2, 0, 0},\n    //     {1, 1, 1},\n    //     {1, -1, 1},\n    //     {0, 0, 2}\n    // };\n\n    short* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            short tmp[4][3];\n\n            const signed char* k0 = (const signed char*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                signed char r0 = k0[0];\n                signed char r1 = k0[1];\n                signed char r2 = k0[2];\n\n                tmp[0][m] = r0 * 2;\n                tmp[1][m] = r0 + r1 + r2;\n                tmp[2][m] = r0 - r1 + r2;\n                tmp[3][m] = r2 * 2;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 4; m++)\n            {\n                short r0 = tmp[m][0];\n                short r1 = tmp[m][1];\n                short r2 = tmp[m][2];\n\n                short z0 = r0 * 2;\n                short z1 = r0 + r1 + r2;\n                short z2 = r0 - r1 + r2;\n                short z3 = r2 * 2;\n\n                ptmp[0] = z0;\n                ptmp[1] = z1;\n                ptmp[2] = z2;\n                ptmp[3] = z3;\n                ptmp += 4;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_kernel_int8(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 16;\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads, 2u, (Allocator*)0);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 2u, (Allocator*)0);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd23_transform_kernel_tile_int8(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            pack_A_tile_int8(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_input_tile_int8(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const signed char itm[4][4] = {\n    //     {1,  0, -1,  0},\n    //     {0,  1,  1,  0},\n    //     {0, -1,  1,  0},\n    //     {0, -1,  0,  1}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w - 1) / 2;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n    nn_max_kk = max_kk / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n        short tmp[4][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0 = bottom_blob.channel((k + kk) / elempack).row<const signed char>(ti * 2) + (tj * 2) * elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                int8x8_t _r0 = vdup_n_s8(0);\n                int8x8_t _r1 = vdup_n_s8(0);\n                int8x8_t _r2 = vdup_n_s8(0);\n                int8x8_t _r3 = vdup_n_s8(0);\n\n                if (ti * 2 + m < h)\n                {\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1_s8(r0);\n                        if (tj * 2 + 1 < w) _r1 = vld1_s8(r0 + 8);\n                        if (tj * 2 + 2 < w) _r2 = vld1_s8(r0 + 16);\n                        if (tj * 2 + 3 < w) _r3 = vld1_s8(r0 + 24);\n                    }\n                    if (elempack == 1)\n                    {\n                        const signed char* r1 = r0 + N;\n                        const signed char* r2 = r0 + N * 2;\n                        const signed char* r3 = r0 + N * 3;\n                        const signed char* r4 = r0 + N * 4;\n                        const signed char* r5 = r0 + N * 5;\n                        const signed char* r6 = r0 + N * 6;\n                        const signed char* r7 = r0 + N * 7;\n\n                        int8x8_t _t0 = vld1_s8(r0);\n                        int8x8_t _t1 = vld1_s8(r1);\n                        int8x8_t _t2 = vld1_s8(r2);\n                        int8x8_t _t3 = vld1_s8(r3);\n                        int8x8_t _t4 = vld1_s8(r4);\n                        int8x8_t _t5 = vld1_s8(r5);\n                        int8x8_t _t6 = vld1_s8(r6);\n                        int8x8_t _t7 = vld1_s8(r7);\n\n                        int8x8_t _t01 = vzip_s8(_t0, _t1).val[0];\n                        int8x8_t _t23 = vzip_s8(_t2, _t3).val[0];\n                        int8x8_t _t45 = vzip_s8(_t4, _t5).val[0];\n                        int8x8_t _t67 = vzip_s8(_t6, _t7).val[0];\n                        int16x4x2_t _t0123 = vzip_s16(vreinterpret_s16_s8(_t01), vreinterpret_s16_s8(_t23));\n                        int16x4x2_t _t4567 = vzip_s16(vreinterpret_s16_s8(_t45), vreinterpret_s16_s8(_t67));\n                        int16x8_t _ta = vcombine_s16(_t0123.val[0], _t0123.val[1]);\n                        int16x8_t _tb = vcombine_s16(_t4567.val[0], _t4567.val[1]);\n                        int32x4x2_t _tab = vzipq_s32(vreinterpretq_s32_s16(_ta), vreinterpretq_s32_s16(_tb));\n\n                        _r0 = vreinterpret_s8_s32(vget_low_s32(_tab.val[0]));\n                        if (tj * 2 + 1 < w) _r1 = vreinterpret_s8_s32(vget_high_s32(_tab.val[0]));\n                        if (tj * 2 + 2 < w) _r2 = vreinterpret_s8_s32(vget_low_s32(_tab.val[1]));\n                        if (tj * 2 + 3 < w) _r3 = vreinterpret_s8_s32(vget_high_s32(_tab.val[1]));\n                    }\n                }\n\n                int16x8_t _tmp0 = vsubl_s8(_r0, _r2);\n                int16x8_t _tmp1 = vaddl_s8(_r1, _r2);\n                int16x8_t _tmp2 = vsubl_s8(_r2, _r1);\n                int16x8_t _tmp3 = vsubl_s8(_r3, _r1);\n\n                vst1q_s16(tmp[0][m], _tmp0);\n                vst1q_s16(tmp[1][m], _tmp1);\n                vst1q_s16(tmp[2][m], _tmp2);\n                vst1q_s16(tmp[3][m], _tmp3);\n\n                r0 += w * elempack;\n            }\n\n            short* p0 = (short*)B + kk * max_jj * 16 + jj * 8;\n            short* p1 = p0 + max_jj * 8;\n            short* p2 = p0 + max_jj * 8 * 2;\n            short* p3 = p0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                int16x8_t _r0 = vld1q_s16(tmp[m][0]);\n                int16x8_t _r1 = vld1q_s16(tmp[m][1]);\n                int16x8_t _r2 = vld1q_s16(tmp[m][2]);\n                int16x8_t _r3 = vld1q_s16(tmp[m][3]);\n\n                int16x8_t _tmp0 = vsubq_s16(_r0, _r2);\n                int16x8_t _tmp1 = vaddq_s16(_r1, _r2);\n                int16x8_t _tmp2 = vsubq_s16(_r2, _r1);\n                int16x8_t _tmp3 = vsubq_s16(_r3, _r1);\n\n                vst1q_s16(p0, _tmp0);\n                vst1q_s16(p1, _tmp1);\n                vst1q_s16(p2, _tmp2);\n                vst1q_s16(p3, _tmp3);\n\n                p0 += max_jj * 4 * 8;\n                p1 += max_jj * 4 * 8;\n                p2 += max_jj * 4 * 8;\n                p3 += max_jj * 4 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n        short tmp[4][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0 = bottom_blob.channel(k + kk).row<const signed char>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                signed char r00 = 0;\n                signed char r01 = 0;\n                signed char r10 = 0;\n                signed char r11 = 0;\n                signed char r20 = 0;\n                signed char r21 = 0;\n                signed char r30 = 0;\n                signed char r31 = 0;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const signed char* r1 = r0 + N;\n\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 2 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 2 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 2 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                    }\n                }\n\n                tmp[0][m][0] = r00 - r20;\n                tmp[0][m][1] = r01 - r21;\n                tmp[1][m][0] = r10 + r20;\n                tmp[1][m][1] = r11 + r21;\n                tmp[2][m][0] = r20 - r10;\n                tmp[2][m][1] = r21 - r11;\n                tmp[3][m][0] = r30 - r10;\n                tmp[3][m][1] = r31 - r11;\n\n                r0 += w;\n            }\n\n            short* p0 = (short*)B + kk * max_jj * 16 + jj * 2;\n            short* p1 = p0 + max_jj * 2;\n            short* p2 = p0 + max_jj * 2 * 2;\n            short* p3 = p0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                short r00 = tmp[m][0][0];\n                short r01 = tmp[m][0][1];\n                short r10 = tmp[m][1][0];\n                short r11 = tmp[m][1][1];\n                short r20 = tmp[m][2][0];\n                short r21 = tmp[m][2][1];\n                short r30 = tmp[m][3][0];\n                short r31 = tmp[m][3][1];\n\n                p0[0] = r00 - r20;\n                p0[1] = r01 - r21;\n                p1[0] = r10 + r20;\n                p1[1] = r11 + r21;\n                p2[0] = r20 - r10;\n                p2[1] = r21 - r11;\n                p3[0] = r30 - r10;\n                p3[1] = r31 - r11;\n\n                p0 += max_jj * 4 * 2;\n                p1 += max_jj * 4 * 2;\n                p2 += max_jj * 4 * 2;\n                p3 += max_jj * 4 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        short tmp[4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0123 = bottom_blob.channel(k + kk).row<const signed char>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 4; m++)\n            {\n                signed char r0 = 0;\n                signed char r1 = 0;\n                signed char r2 = 0;\n                signed char r3 = 0;\n\n                if (ti * 2 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 2 + 1 < w) r1 = r0123[1];\n                        if (tj * 2 + 2 < w) r2 = r0123[2];\n                        if (tj * 2 + 3 < w) r3 = r0123[3];\n                    }\n                }\n\n                tmp[0][m] = r0 - r2;\n                tmp[1][m] = r1 + r2;\n                tmp[2][m] = r2 - r1;\n                tmp[3][m] = r3 - r1;\n\n                r0123 += w;\n            }\n\n            short* p0 = (short*)B + kk * max_jj * 16 + jj;\n            short* p1 = p0 + max_jj;\n            short* p2 = p0 + max_jj * 2;\n            short* p3 = p0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                short r0 = tmp[m][0];\n                short r1 = tmp[m][1];\n                short r2 = tmp[m][2];\n                short r3 = tmp[m][3];\n\n                p0[0] = r0 - r2;\n                p1[0] = r1 + r2;\n                p2[0] = r2 - r1;\n                p3[0] = r3 - r1;\n\n                p0 += max_jj * 4;\n                p1 += max_jj * 4;\n                p2 += max_jj * 4;\n                p3 += max_jj * 4;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd23_transform_output_tile_int8(const Mat& top_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    // const int otm[2][4] = {\n    //     {1,  1,  1,  0},\n    //     {0,  1, -1,  1}\n    // };\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 1) / 2;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        int tmp[2][4][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 16 + jj * 8;\n            const int* r1 = r0 + max_jj * 8;\n            const int* r2 = r0 + max_jj * 8 * 2;\n            const int* r3 = r0 + max_jj * 8 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                int32x4_t _r00 = vld1q_s32(r0);\n                int32x4_t _r01 = vld1q_s32(r0 + 4);\n                int32x4_t _r10 = vld1q_s32(r1);\n                int32x4_t _r11 = vld1q_s32(r1 + 4);\n                int32x4_t _r20 = vld1q_s32(r2);\n                int32x4_t _r21 = vld1q_s32(r2 + 4);\n                int32x4_t _r30 = vld1q_s32(r3);\n                int32x4_t _r31 = vld1q_s32(r3 + 4);\n\n                int32x4_t _tmp00 = vaddq_s32(vaddq_s32(_r00, _r10), _r20);\n                int32x4_t _tmp01 = vaddq_s32(vaddq_s32(_r01, _r11), _r21);\n                int32x4_t _tmp10 = vaddq_s32(vsubq_s32(_r10, _r20), _r30);\n                int32x4_t _tmp11 = vaddq_s32(vsubq_s32(_r11, _r21), _r31);\n\n                vst1q_s32(tmp[0][m], _tmp00);\n                vst1q_s32(tmp[0][m] + 4, _tmp01);\n                vst1q_s32(tmp[1][m], _tmp10);\n                vst1q_s32(tmp[1][m] + 4, _tmp11);\n\n                r0 += max_jj * 4 * 8;\n                r1 += max_jj * 4 * 8;\n                r2 += max_jj * 4 * 8;\n                r3 += max_jj * 4 * 8;\n            }\n\n            int* outptr0 = top_blob.channel((i + ii) / out_elempack).row<int>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                int32x4_t _r00 = vld1q_s32(tmp[m][0]);\n                int32x4_t _r01 = vld1q_s32(tmp[m][0] + 4);\n                int32x4_t _r10 = vld1q_s32(tmp[m][1]);\n                int32x4_t _r11 = vld1q_s32(tmp[m][1] + 4);\n                int32x4_t _r20 = vld1q_s32(tmp[m][2]);\n                int32x4_t _r21 = vld1q_s32(tmp[m][2] + 4);\n                int32x4_t _r30 = vld1q_s32(tmp[m][3]);\n                int32x4_t _r31 = vld1q_s32(tmp[m][3] + 4);\n\n                int32x4_t _tmp00 = vaddq_s32(vaddq_s32(_r00, _r10), _r20);\n                int32x4_t _tmp01 = vaddq_s32(vaddq_s32(_r01, _r11), _r21);\n                int32x4_t _tmp10 = vaddq_s32(vsubq_s32(_r10, _r20), _r30);\n                int32x4_t _tmp11 = vaddq_s32(vsubq_s32(_r11, _r21), _r31);\n\n                _tmp00 = vshrq_n_s32(_tmp00, 2);\n                _tmp01 = vshrq_n_s32(_tmp01, 2);\n                _tmp10 = vshrq_n_s32(_tmp10, 2);\n                _tmp11 = vshrq_n_s32(_tmp11, 2);\n\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _tmp00);\n                    vst1q_s32(outptr0 + 4, _tmp01);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1q_s32(outptr0 + 8, _tmp10);\n                        vst1q_s32(outptr0 + 12, _tmp11);\n                    }\n                }\n                if (out_elempack == 4)\n                {\n                    int* outptr1 = outptr0 + N;\n\n                    vst1q_s32(outptr0, _tmp00);\n                    vst1q_s32(outptr1, _tmp01);\n                    if (tj * 2 + 1 < outw)\n                    {\n                        vst1q_s32(outptr0 + 4, _tmp10);\n                        vst1q_s32(outptr1 + 4, _tmp11);\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n                    int* outptr2 = outptr0 + N * 2;\n                    int* outptr3 = outptr0 + N * 3;\n                    int* outptr4 = outptr0 + N * 4;\n                    int* outptr5 = outptr0 + N * 5;\n                    int* outptr6 = outptr0 + N * 6;\n                    int* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = vgetq_lane_s32(_tmp00, 0);\n                    outptr1[0] = vgetq_lane_s32(_tmp00, 1);\n                    outptr2[0] = vgetq_lane_s32(_tmp00, 2);\n                    outptr3[0] = vgetq_lane_s32(_tmp00, 3);\n                    outptr4[0] = vgetq_lane_s32(_tmp01, 0);\n                    outptr5[0] = vgetq_lane_s32(_tmp01, 1);\n                    outptr6[0] = vgetq_lane_s32(_tmp01, 2);\n                    outptr7[0] = vgetq_lane_s32(_tmp01, 3);\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = vgetq_lane_s32(_tmp10, 0);\n                        outptr1[1] = vgetq_lane_s32(_tmp10, 1);\n                        outptr2[1] = vgetq_lane_s32(_tmp10, 2);\n                        outptr3[1] = vgetq_lane_s32(_tmp10, 3);\n                        outptr4[1] = vgetq_lane_s32(_tmp11, 0);\n                        outptr5[1] = vgetq_lane_s32(_tmp11, 1);\n                        outptr6[1] = vgetq_lane_s32(_tmp11, 2);\n                        outptr7[1] = vgetq_lane_s32(_tmp11, 3);\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        int tmp[2][4][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 16 + jj * 4;\n            const int* r1 = r0 + max_jj * 4;\n            const int* r2 = r0 + max_jj * 4 * 2;\n            const int* r3 = r0 + max_jj * 4 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                int32x4_t _r0 = vld1q_s32(r0);\n                int32x4_t _r1 = vld1q_s32(r1);\n                int32x4_t _r2 = vld1q_s32(r2);\n                int32x4_t _r3 = vld1q_s32(r3);\n\n                int32x4_t _tmp0 = vaddq_s32(vaddq_s32(_r0, _r1), _r2);\n                int32x4_t _tmp1 = vaddq_s32(vsubq_s32(_r1, _r2), _r3);\n\n                vst1q_s32(tmp[0][m], _tmp0);\n                vst1q_s32(tmp[1][m], _tmp1);\n\n                r0 += max_jj * 4 * 4;\n                r1 += max_jj * 4 * 4;\n                r2 += max_jj * 4 * 4;\n                r3 += max_jj * 4 * 4;\n            }\n\n            int* outptr0 = top_blob.channel((i + ii) / out_elempack).row<int>(ti * 2) + (tj * 2) * out_elempack;\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                int32x4_t _r0 = vld1q_s32(tmp[m][0]);\n                int32x4_t _r1 = vld1q_s32(tmp[m][1]);\n                int32x4_t _r2 = vld1q_s32(tmp[m][2]);\n                int32x4_t _r3 = vld1q_s32(tmp[m][3]);\n\n                int32x4_t _tmp0 = vaddq_s32(vaddq_s32(_r0, _r1), _r2);\n                int32x4_t _tmp1 = vaddq_s32(vsubq_s32(_r1, _r2), _r3);\n\n                _tmp0 = vshrq_n_s32(_tmp0, 2);\n                _tmp1 = vshrq_n_s32(_tmp1, 2);\n\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _tmp0);\n                    if (tj * 2 + 1 < outw) vst1q_s32(outptr0 + 4, _tmp1);\n                }\n                if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n                    int* outptr2 = outptr0 + N * 2;\n                    int* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = vgetq_lane_s32(_tmp0, 0);\n                    outptr1[0] = vgetq_lane_s32(_tmp0, 1);\n                    outptr2[0] = vgetq_lane_s32(_tmp0, 2);\n                    outptr3[0] = vgetq_lane_s32(_tmp0, 3);\n\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = vgetq_lane_s32(_tmp1, 0);\n                        outptr1[1] = vgetq_lane_s32(_tmp1, 1);\n                        outptr2[1] = vgetq_lane_s32(_tmp1, 2);\n                        outptr3[1] = vgetq_lane_s32(_tmp1, 3);\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        int tmp[2][4][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 16 + jj * 2;\n            const int* r1 = r0 + max_jj * 2;\n            const int* r2 = r0 + max_jj * 2 * 2;\n            const int* r3 = r0 + max_jj * 2 * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m][0] = r0[0] + r1[0] + r2[0];\n                tmp[0][m][1] = r0[1] + r1[1] + r2[1];\n                tmp[1][m][0] = r1[0] - r2[0] + r3[0];\n                tmp[1][m][1] = r1[1] - r2[1] + r3[1];\n\n                r0 += max_jj * 4 * 2;\n                r1 += max_jj * 4 * 2;\n                r2 += max_jj * 4 * 2;\n                r3 += max_jj * 4 * 2;\n            }\n\n            int* outptr0 = top_blob.channel(i + ii).row<int>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                int tmp00 = tmp[m][0][0] + tmp[m][1][0] + tmp[m][2][0];\n                int tmp01 = tmp[m][0][1] + tmp[m][1][1] + tmp[m][2][1];\n                int tmp10 = tmp[m][1][0] - tmp[m][2][0] + tmp[m][3][0];\n                int tmp11 = tmp[m][1][1] - tmp[m][2][1] + tmp[m][3][1];\n\n                tmp00 = tmp00 >> 2;\n                tmp01 = tmp01 >> 2;\n                tmp10 = tmp10 >> 2;\n                tmp11 = tmp11 >> 2;\n\n                // if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 2 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        int tmp[2][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 16 + jj;\n            const int* r1 = r0 + max_jj;\n            const int* r2 = r0 + max_jj * 2;\n            const int* r3 = r0 + max_jj * 3;\n\n            for (int m = 0; m < 4; m++)\n            {\n                tmp[0][m] = r0[0] + r1[0] + r2[0];\n                tmp[1][m] = r1[0] - r2[0] + r3[0];\n\n                r0 += max_jj * 4;\n                r1 += max_jj * 4;\n                r2 += max_jj * 4;\n                r3 += max_jj * 4;\n            }\n\n            int* outptr0 = top_blob.channel(i + ii).row<int>(ti * 2) + (tj * 2);\n\n            for (int m = 0; m < 2; m++)\n            {\n                if (ti * 2 + m >= outh)\n                    continue;\n\n                int tmp0 = tmp[m][0] + tmp[m][1] + tmp[m][2];\n                int tmp1 = tmp[m][1] - tmp[m][2] + tmp[m][3];\n\n                tmp0 = tmp0 >> 2;\n                tmp1 = tmp1 >> 2;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 2 + 1 < outw) outptr0[1] = tmp1;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd23_int8(Mat& bottom_blob, Mat& top_blob, const Mat& AT, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 2n+2, winograd F(2,3)\n    int w_tiles = (outw + 1) / 2;\n    int h_tiles = (outh + 1) / 2;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 16;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd23_int8 %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 2u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_int8(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            transpose_pack_B_tile_int8(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 2u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        // #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd23_transform_input_tile_int8(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            transpose_pack_B_tile_int8(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    bottom_blob.release();\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk);\n            }\n\n            // transform output\n            conv3x3s1_winograd23_transform_output_tile_int8(top_tile, top_blob, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n\nstatic inline void conv3x3s1_winograd43_transform_kernel_tile_int8(const Mat& kernel, Mat& A, int inch, int i, int max_ii, int k, int max_kk)\n{\n    // const short ktm[6][3] = {\n    //     {6, 0, 0},\n    //     {-4, -4, -4},\n    //     {-4, 4, -4},\n    //     {1, 2, 4},\n    //     {1, -2, 4},\n    //     {0, 0, 6}\n    // };\n\n    short* ptmp = A;\n\n    int ii = 0;\n    for (; ii < max_ii; ii++)\n    {\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            short tmp[6][3];\n\n            const signed char* k0 = (const signed char*)kernel + (i + ii) * inch * 9 + (k + kk) * 9;\n\n            for (int m = 0; m < 3; m++)\n            {\n                signed char r0 = k0[0];\n                signed char r1 = k0[1];\n                signed char r2 = k0[2];\n\n                tmp[0][m] = r0 * 6;\n                tmp[1][m] = -r0 * 4 - r1 * 4 - r2 * 4;\n                tmp[2][m] = -r0 * 4 + r1 * 4 - r2 * 4;\n                tmp[3][m] = r0 + r1 * 2 + r2 * 4;\n                tmp[4][m] = r0 - r1 * 2 + r2 * 4;\n                tmp[5][m] = r2 * 6;\n\n                k0 += 3;\n            }\n\n            for (int m = 0; m < 6; m++)\n            {\n                short r0 = tmp[m][0];\n                short r1 = tmp[m][1];\n                short r2 = tmp[m][2];\n\n                short z0 = r0 * 6;\n                short z1 = -r0 * 4 - r1 * 4 - r2 * 4;\n                short z2 = -r0 * 4 + r1 * 4 - r2 * 4;\n                short z3 = r0 + r1 * 2 + r2 * 4;\n                short z4 = r0 - r1 * 2 + r2 * 4;\n                short z5 = r2 * 6;\n\n                ptmp[0] = z0;\n                ptmp[1] = z1;\n                ptmp[2] = z2;\n                ptmp[3] = z3;\n                ptmp[4] = z4;\n                ptmp[5] = z5;\n                ptmp += 6;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel_int8(const Mat& kernel, Mat& AT, int inch, int outch, const Option& opt)\n{\n    const int M = outch;\n    const int K = inch;\n    const int B = 36;\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    Mat A_tileX(B * TILE_M * TILE_K, 1, opt.num_threads, 2u, (Allocator*)0);\n\n    AT.create(TILE_K * TILE_M, B, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 2u, (Allocator*)0);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat A_tile = A_tileX.channel(get_omp_thread_num());\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_ii = std::min((M - i), TILE_M);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            conv3x3s1_winograd43_transform_kernel_tile_int8(kernel, A_tile, inch, i, max_ii, k, max_kk);\n\n            Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n            pack_A_tile_int8(A_tile, AT_tile, B, max_ii, max_kk);\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_input_tile_int8(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int nT)\n{\n    // const float itm[4][4] = {\n    //     {4,  0, -5,  0, 1, 0},\n    //     {0, -4, -4,  1, 1, 0},\n    //     {0,  4, -4, -1, 1, 0},\n    //     {0, -2, -1,  2, 1, 0},\n    //     {0,  2, -1, -2, 1, 0},\n    //     {0,  4,  0, -5, 0, 1}\n    // };\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int elempack = bottom_blob.elempack;\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int w_tiles = (w + 1) / 4;\n\n    int nn_max_kk = 0;\n    int remain_max_kk_start = 0;\n#if __ARM_NEON\n    nn_max_kk = max_kk / 8;\n    #pragma omp parallel for num_threads(nT)\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 8;\n\n        short tmp[6][6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0 = bottom_blob.channel((k + kk) / elempack).row<const signed char>(ti * 4) + (tj * 4) * elempack;\n\n            int8x8_t _v5 = vdup_n_s8(5);\n\n            for (int m = 0; m < 6; m++)\n            {\n                int8x8_t _r0 = vdup_n_s8(0);\n                int8x8_t _r1 = vdup_n_s8(0);\n                int8x8_t _r2 = vdup_n_s8(0);\n                int8x8_t _r3 = vdup_n_s8(0);\n                int8x8_t _r4 = vdup_n_s8(0);\n                int8x8_t _r5 = vdup_n_s8(0);\n\n                if (ti * 4 + m < h)\n                {\n                    if (elempack == 8)\n                    {\n                        _r0 = vld1_s8(r0);\n                        if (tj * 4 + 1 < w) _r1 = vld1_s8(r0 + 8);\n                        if (tj * 4 + 2 < w) _r2 = vld1_s8(r0 + 16);\n                        if (tj * 4 + 3 < w) _r3 = vld1_s8(r0 + 24);\n                        if (tj * 4 + 4 < w) _r4 = vld1_s8(r0 + 32);\n                        if (tj * 4 + 5 < w) _r5 = vld1_s8(r0 + 40);\n                    }\n                    if (elempack == 1)\n                    {\n                        const signed char* r1 = r0 + N;\n                        const signed char* r2 = r0 + N * 2;\n                        const signed char* r3 = r0 + N * 3;\n                        const signed char* r4 = r0 + N * 4;\n                        const signed char* r5 = r0 + N * 5;\n                        const signed char* r6 = r0 + N * 6;\n                        const signed char* r7 = r0 + N * 7;\n\n                        int8x8_t _t0 = vld1_s8(r0);\n                        int8x8_t _t1 = vld1_s8(r1);\n                        int8x8_t _t2 = vld1_s8(r2);\n                        int8x8_t _t3 = vld1_s8(r3);\n                        int8x8_t _t4 = vld1_s8(r4);\n                        int8x8_t _t5 = vld1_s8(r5);\n                        int8x8_t _t6 = vld1_s8(r6);\n                        int8x8_t _t7 = vld1_s8(r7);\n\n                        int8x8_t _t01 = vzip_s8(_t0, _t1).val[0];\n                        int8x8_t _t23 = vzip_s8(_t2, _t3).val[0];\n                        int8x8_t _t45 = vzip_s8(_t4, _t5).val[0];\n                        int8x8_t _t67 = vzip_s8(_t6, _t7).val[0];\n                        int16x4x2_t _t0123 = vzip_s16(vreinterpret_s16_s8(_t01), vreinterpret_s16_s8(_t23));\n                        int16x4x2_t _t4567 = vzip_s16(vreinterpret_s16_s8(_t45), vreinterpret_s16_s8(_t67));\n                        int16x8_t _ta = vcombine_s16(_t0123.val[0], _t0123.val[1]);\n                        int16x8_t _tb = vcombine_s16(_t4567.val[0], _t4567.val[1]);\n                        int32x4x2_t _tab = vzipq_s32(vreinterpretq_s32_s16(_ta), vreinterpretq_s32_s16(_tb));\n\n                        _r0 = vreinterpret_s8_s32(vget_low_s32(_tab.val[0]));\n                        if (tj * 4 + 1 < w) _r1 = vreinterpret_s8_s32(vget_high_s32(_tab.val[0]));\n                        if (tj * 4 + 2 < w) _r2 = vreinterpret_s8_s32(vget_low_s32(_tab.val[1]));\n                        if (tj * 4 + 3 < w) _r3 = vreinterpret_s8_s32(vget_high_s32(_tab.val[1]));\n                        if (tj * 4 + 4 < w)\n                        {\n                            _t01 = vzip_s8(_t0, _t1).val[1];\n                            _t23 = vzip_s8(_t2, _t3).val[1];\n                            _t45 = vzip_s8(_t4, _t5).val[1];\n                            _t67 = vzip_s8(_t6, _t7).val[1];\n                            int16x4_t _tc = vzip_s16(vreinterpret_s16_s8(_t01), vreinterpret_s16_s8(_t23)).val[0];\n                            int16x4_t _td = vzip_s16(vreinterpret_s16_s8(_t45), vreinterpret_s16_s8(_t67)).val[0];\n                            int32x2x2_t _tcd = vzip_s32(vreinterpret_s32_s16(_tc), vreinterpret_s32_s16(_td));\n\n                            _r4 = vreinterpret_s8_s32(_tcd.val[0]);\n                            if (tj * 4 + 5 < w) _r5 = vreinterpret_s8_s32(_tcd.val[1]);\n                        }\n                    }\n                }\n\n                int16x8_t _tmp12a = vsubw_s8(vshll_n_s8(_r1, 2), _r3);\n                int16x8_t _tmp12b = vsubw_s8(vshll_n_s8(_r2, 2), _r4);\n                int16x8_t _tmp34a = vshlq_n_s16(vsubl_s8(_r3, _r1), 1);\n                int16x8_t _tmp34b = vsubl_s8(_r4, _r2);\n\n                int16x8_t _tmp0 = vaddq_s16(vmovl_s8(_r4), vsubq_s16(vshll_n_s8(_r0, 2), vmull_s8(_r2, _v5)));\n                int16x8_t _tmp1 = vnegq_s16(vaddq_s16(_tmp12a, _tmp12b));\n                int16x8_t _tmp2 = vsubq_s16(_tmp12a, _tmp12b);\n                int16x8_t _tmp3 = vaddq_s16(_tmp34b, _tmp34a);\n                int16x8_t _tmp4 = vsubq_s16(_tmp34b, _tmp34a);\n                int16x8_t _tmp5 = vaddq_s16(vmovl_s8(_r5), vsubq_s16(vshll_n_s8(_r1, 2), vmull_s8(_r3, _v5)));\n\n                vst1q_s16(tmp[0][m], _tmp0);\n                vst1q_s16(tmp[1][m], _tmp1);\n                vst1q_s16(tmp[2][m], _tmp2);\n                vst1q_s16(tmp[3][m], _tmp3);\n                vst1q_s16(tmp[4][m], _tmp4);\n                vst1q_s16(tmp[5][m], _tmp5);\n\n                r0 += w * elempack;\n            }\n\n            int16x8_t _v5q = vdupq_n_s16(5);\n\n            short* p0 = (short*)B + kk * max_jj * 36 + jj * 8;\n            short* p1 = p0 + max_jj * 8;\n            short* p2 = p0 + max_jj * 8 * 2;\n            short* p3 = p0 + max_jj * 8 * 3;\n            short* p4 = p0 + max_jj * 8 * 4;\n            short* p5 = p0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                int16x8_t _r0 = vld1q_s16(tmp[m][0]);\n                int16x8_t _r1 = vld1q_s16(tmp[m][1]);\n                int16x8_t _r2 = vld1q_s16(tmp[m][2]);\n                int16x8_t _r3 = vld1q_s16(tmp[m][3]);\n                int16x8_t _r4 = vld1q_s16(tmp[m][4]);\n                int16x8_t _r5 = vld1q_s16(tmp[m][5]);\n\n                int16x8_t _tmp12a = vsubq_s16(_r3, vshlq_n_s16(_r1, 2));\n                int16x8_t _tmp12b = vsubq_s16(_r4, vshlq_n_s16(_r2, 2));\n                int16x8_t _tmp34a = vshlq_n_s16(vsubq_s16(_r3, _r1), 1);\n                int16x8_t _tmp34b = vsubq_s16(_r4, _r2);\n\n                int16x8_t _tmp0 = vaddq_s16(_r4, vsubq_s16(vshlq_n_s16(_r0, 2), vmulq_s16(_r2, _v5q)));\n                int16x8_t _tmp1 = vaddq_s16(_tmp12b, _tmp12a);\n                int16x8_t _tmp2 = vsubq_s16(_tmp12b, _tmp12a);\n                int16x8_t _tmp3 = vaddq_s16(_tmp34b, _tmp34a);\n                int16x8_t _tmp4 = vsubq_s16(_tmp34b, _tmp34a);\n                int16x8_t _tmp5 = vaddq_s16(_r5, vsubq_s16(vshlq_n_s16(_r1, 2), vmulq_s16(_r3, _v5q)));\n\n                vst1q_s16(p0, _tmp0);\n                vst1q_s16(p1, _tmp1);\n                vst1q_s16(p2, _tmp2);\n                vst1q_s16(p3, _tmp3);\n                vst1q_s16(p4, _tmp4);\n                vst1q_s16(p5, _tmp5);\n\n                p0 += max_jj * 6 * 8;\n                p1 += max_jj * 6 * 8;\n                p2 += max_jj * 6 * 8;\n                p3 += max_jj * 6 * 8;\n                p4 += max_jj * 6 * 8;\n                p5 += max_jj * 6 * 8;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 8;\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n#else // __ARM_NEON\n    nn_max_kk = (max_kk - remain_max_kk_start) / 2;\n    #pragma omp parallel for num_threads(nT)\n#endif // __ARM_NEON\n    for (int ppkk = 0; ppkk < nn_max_kk; ppkk++)\n    {\n        const int kk = remain_max_kk_start + ppkk * 2;\n\n        short tmp[6][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0 = bottom_blob.channel(k + kk).row<const signed char>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                signed char r00 = 0;\n                signed char r01 = 0;\n                signed char r10 = 0;\n                signed char r11 = 0;\n                signed char r20 = 0;\n                signed char r21 = 0;\n                signed char r30 = 0;\n                signed char r31 = 0;\n                signed char r40 = 0;\n                signed char r41 = 0;\n                signed char r50 = 0;\n                signed char r51 = 0;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        const signed char* r1 = r0 + N;\n\n                        r00 = r0[0];\n                        r01 = r1[0];\n                        if (tj * 4 + 1 < w)\n                        {\n                            r10 = r0[1];\n                            r11 = r1[1];\n                        }\n                        if (tj * 4 + 2 < w)\n                        {\n                            r20 = r0[2];\n                            r21 = r1[2];\n                        }\n                        if (tj * 4 + 3 < w)\n                        {\n                            r30 = r0[3];\n                            r31 = r1[3];\n                        }\n                        if (tj * 4 + 4 < w)\n                        {\n                            r40 = r0[4];\n                            r41 = r1[4];\n                        }\n                        if (tj * 4 + 5 < w)\n                        {\n                            r50 = r0[5];\n                            r51 = r1[5];\n                        }\n                    }\n                }\n\n                short tmp120a = r30 - r10 * 4;\n                short tmp121a = r31 - r11 * 4;\n                short tmp120b = r40 - r20 * 4;\n                short tmp121b = r41 - r21 * 4;\n                short tmp340a = (r30 - r10) * 2;\n                short tmp341a = (r31 - r11) * 2;\n                short tmp340b = r40 - r20;\n                short tmp341b = r41 - r21;\n\n                tmp[0][m][0] = r40 + r00 * 4 - r20 * 5;\n                tmp[0][m][1] = r41 + r01 * 4 - r21 * 5;\n                tmp[1][m][0] = tmp120b + tmp120a;\n                tmp[1][m][1] = tmp121b + tmp121a;\n                tmp[2][m][0] = tmp120b - tmp120a;\n                tmp[2][m][1] = tmp121b - tmp121a;\n                tmp[3][m][0] = tmp340b + tmp340a;\n                tmp[3][m][1] = tmp341b + tmp341a;\n                tmp[4][m][0] = tmp340b - tmp340a;\n                tmp[4][m][1] = tmp341b - tmp341a;\n                tmp[5][m][0] = r50 + r10 * 4 - r30 * 5;\n                tmp[5][m][1] = r51 + r11 * 4 - r31 * 5;\n\n                r0 += w;\n            }\n\n            short* p0 = (short*)B + kk * max_jj * 36 + jj * 2;\n            short* p1 = p0 + max_jj * 2;\n            short* p2 = p0 + max_jj * 2 * 2;\n            short* p3 = p0 + max_jj * 2 * 3;\n            short* p4 = p0 + max_jj * 2 * 4;\n            short* p5 = p0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                short r00 = tmp[m][0][0];\n                short r01 = tmp[m][0][1];\n                short r10 = tmp[m][1][0];\n                short r11 = tmp[m][1][1];\n                short r20 = tmp[m][2][0];\n                short r21 = tmp[m][2][1];\n                short r30 = tmp[m][3][0];\n                short r31 = tmp[m][3][1];\n                short r40 = tmp[m][4][0];\n                short r41 = tmp[m][4][1];\n                short r50 = tmp[m][5][0];\n                short r51 = tmp[m][5][1];\n\n                short tmp120a = r30 - r10 * 4;\n                short tmp121a = r31 - r11 * 4;\n                short tmp120b = r40 - r20 * 4;\n                short tmp121b = r41 - r21 * 4;\n                short tmp340a = (r30 - r10) * 2;\n                short tmp341a = (r31 - r11) * 2;\n                short tmp340b = r40 - r20;\n                short tmp341b = r41 - r21;\n\n                p0[0] = r40 + r00 * 4 - r20 * 5;\n                p0[1] = r41 + r01 * 4 - r21 * 5;\n                p1[0] = tmp120b + tmp120a;\n                p1[1] = tmp121b + tmp121a;\n                p2[0] = tmp120b - tmp120a;\n                p2[1] = tmp121b - tmp121a;\n                p3[0] = tmp340b + tmp340a;\n                p3[1] = tmp341b + tmp341a;\n                p4[0] = tmp340b - tmp340a;\n                p4[1] = tmp341b - tmp341a;\n                p5[0] = r50 + r10 * 4 - r30 * 5;\n                p5[1] = r51 + r11 * 4 - r31 * 5;\n\n                p0 += max_jj * 6 * 2;\n                p1 += max_jj * 6 * 2;\n                p2 += max_jj * 6 * 2;\n                p3 += max_jj * 6 * 2;\n                p4 += max_jj * 6 * 2;\n                p5 += max_jj * 6 * 2;\n            }\n        }\n    }\n    remain_max_kk_start += nn_max_kk * 2;\n    for (int kk = remain_max_kk_start; kk < max_kk; kk++)\n    {\n        short tmp[6][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const signed char* r0123 = bottom_blob.channel(k + kk).row<const signed char>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 6; m++)\n            {\n                signed char r0 = 0;\n                signed char r1 = 0;\n                signed char r2 = 0;\n                signed char r3 = 0;\n                signed char r4 = 0;\n                signed char r5 = 0;\n\n                if (ti * 4 + m < h)\n                {\n                    // if (elempack == 1)\n                    {\n                        r0 = r0123[0];\n                        if (tj * 4 + 1 < w) r1 = r0123[1];\n                        if (tj * 4 + 2 < w) r2 = r0123[2];\n                        if (tj * 4 + 3 < w) r3 = r0123[3];\n                        if (tj * 4 + 4 < w) r4 = r0123[4];\n                        if (tj * 4 + 5 < w) r5 = r0123[5];\n                    }\n                }\n\n                short tmp12a = r3 - r1 * 4;\n                short tmp12b = r4 - r2 * 4;\n                short tmp34a = (r3 - r1) * 2;\n                short tmp34b = r4 - r2;\n\n                tmp[0][m] = r4 + r0 * 4 - r2 * 5;\n                tmp[1][m] = tmp12b + tmp12a;\n                tmp[2][m] = tmp12b - tmp12a;\n                tmp[3][m] = tmp34b + tmp34a;\n                tmp[4][m] = tmp34b - tmp34a;\n                tmp[5][m] = r5 + r1 * 4 - r3 * 5;\n\n                r0123 += w;\n            }\n\n            short* p0 = (short*)B + kk * max_jj * 36 + jj;\n            short* p1 = p0 + max_jj;\n            short* p2 = p0 + max_jj * 2;\n            short* p3 = p0 + max_jj * 3;\n            short* p4 = p0 + max_jj * 4;\n            short* p5 = p0 + max_jj * 5;\n\n            for (int m = 0; m < 6; m++)\n            {\n                short r0 = tmp[m][0];\n                short r1 = tmp[m][1];\n                short r2 = tmp[m][2];\n                short r3 = tmp[m][3];\n                short r4 = tmp[m][4];\n                short r5 = tmp[m][5];\n\n                short tmp12a = r3 - r1 * 4;\n                short tmp12b = r4 - r2 * 4;\n                short tmp34a = (r3 - r1) * 2;\n                short tmp34b = r4 - r2;\n\n                p0[0] = r4 + r0 * 4 - r2 * 5;\n                p1[0] = tmp12b + tmp12a;\n                p2[0] = tmp12b - tmp12a;\n                p3[0] = tmp34b + tmp34a;\n                p4[0] = tmp34b - tmp34a;\n                p5[0] = r5 + r1 * 4 - r3 * 5;\n\n                p0 += max_jj * 6;\n                p1 += max_jj * 6;\n                p2 += max_jj * 6;\n                p3 += max_jj * 6;\n                p4 += max_jj * 6;\n                p5 += max_jj * 6;\n            }\n        }\n    }\n}\n\nstatic inline void conv3x3s1_winograd43_transform_output_tile_int8(const Mat& top_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    // const int otm[4][6] = {\n    //     {1, 1,  1, 1,  1, 0},\n    //     {0, 1, -1, 2, -2, 0},\n    //     {0, 1,  1, 4,  4, 0},\n    //     {0, 1, -1, 8, -8, 1}\n    // };\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const size_t N = top_blob.cstep * out_elempack;\n\n    const int w_tiles = (outw + 3) / 4;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        int tmp[4][6][8];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 36 + jj * 8;\n            const int* r1 = r0 + max_jj * 8;\n            const int* r2 = r0 + max_jj * 8 * 2;\n            const int* r3 = r0 + max_jj * 8 * 3;\n            const int* r4 = r0 + max_jj * 8 * 4;\n            const int* r5 = r0 + max_jj * 8 * 5;\n\n            for (int m = 0; m < 5; m++)\n            {\n                int32x4_t _r00 = vld1q_s32(r0);\n                int32x4_t _r01 = vld1q_s32(r0 + 4);\n                int32x4_t _r10 = vld1q_s32(r1);\n                int32x4_t _r11 = vld1q_s32(r1 + 4);\n                int32x4_t _r20 = vld1q_s32(r2);\n                int32x4_t _r21 = vld1q_s32(r2 + 4);\n                int32x4_t _r30 = vld1q_s32(r3);\n                int32x4_t _r31 = vld1q_s32(r3 + 4);\n                int32x4_t _r40 = vld1q_s32(r4);\n                int32x4_t _r41 = vld1q_s32(r4 + 4);\n                int32x4_t _r50 = vld1q_s32(r5);\n                int32x4_t _r51 = vld1q_s32(r5 + 4);\n\n                int32x4_t _tmp02a0 = vaddq_s32(_r10, _r20);\n                int32x4_t _tmp02a1 = vaddq_s32(_r11, _r21);\n                int32x4_t _tmp02b0 = vaddq_s32(_r30, _r40);\n                int32x4_t _tmp02b1 = vaddq_s32(_r31, _r41);\n                int32x4_t _tmp13a0 = vsubq_s32(_r10, _r20);\n                int32x4_t _tmp13a1 = vsubq_s32(_r11, _r21);\n                int32x4_t _tmp13b0 = vsubq_s32(_r30, _r40);\n                int32x4_t _tmp13b1 = vsubq_s32(_r31, _r41);\n\n                int32x4_t _tmp00 = vaddq_s32(vaddq_s32(_tmp02a0, _tmp02b0), _r00);\n                int32x4_t _tmp01 = vaddq_s32(vaddq_s32(_tmp02a1, _tmp02b1), _r01);\n                int32x4_t _tmp10 = vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 1));\n                int32x4_t _tmp11 = vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 1));\n                int32x4_t _tmp20 = vaddq_s32(_tmp02a0, vshlq_n_s32(_tmp02b0, 2));\n                int32x4_t _tmp21 = vaddq_s32(_tmp02a1, vshlq_n_s32(_tmp02b1, 2));\n                int32x4_t _tmp30 = vaddq_s32(vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 3)), vshlq_n_s32(_r50, 2));\n                int32x4_t _tmp31 = vaddq_s32(vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 3)), vshlq_n_s32(_r51, 2));\n\n                vst1q_s32(tmp[0][m], _tmp00);\n                vst1q_s32(tmp[0][m] + 4, _tmp01);\n                vst1q_s32(tmp[1][m], _tmp10);\n                vst1q_s32(tmp[1][m] + 4, _tmp11);\n                vst1q_s32(tmp[2][m], _tmp20);\n                vst1q_s32(tmp[2][m] + 4, _tmp21);\n                vst1q_s32(tmp[3][m], _tmp30);\n                vst1q_s32(tmp[3][m] + 4, _tmp31);\n\n                r0 += max_jj * 6 * 8;\n                r1 += max_jj * 6 * 8;\n                r2 += max_jj * 6 * 8;\n                r3 += max_jj * 6 * 8;\n                r4 += max_jj * 6 * 8;\n                r5 += max_jj * 6 * 8;\n            }\n            for (int m = 5; m < 6; m++)\n            {\n                int32x4_t _r00 = vld1q_s32(r0);\n                int32x4_t _r01 = vld1q_s32(r0 + 4);\n                int32x4_t _r10 = vld1q_s32(r1);\n                int32x4_t _r11 = vld1q_s32(r1 + 4);\n                int32x4_t _r20 = vld1q_s32(r2);\n                int32x4_t _r21 = vld1q_s32(r2 + 4);\n                int32x4_t _r30 = vld1q_s32(r3);\n                int32x4_t _r31 = vld1q_s32(r3 + 4);\n                int32x4_t _r40 = vld1q_s32(r4);\n                int32x4_t _r41 = vld1q_s32(r4 + 4);\n                int32x4_t _r50 = vld1q_s32(r5);\n                int32x4_t _r51 = vld1q_s32(r5 + 4);\n\n                int32x4_t _tmp02a0 = vaddq_s32(_r10, _r20);\n                int32x4_t _tmp02a1 = vaddq_s32(_r11, _r21);\n                int32x4_t _tmp02b0 = vaddq_s32(_r30, _r40);\n                int32x4_t _tmp02b1 = vaddq_s32(_r31, _r41);\n                int32x4_t _tmp13a0 = vsubq_s32(_r10, _r20);\n                int32x4_t _tmp13a1 = vsubq_s32(_r11, _r21);\n                int32x4_t _tmp13b0 = vsubq_s32(_r30, _r40);\n                int32x4_t _tmp13b1 = vsubq_s32(_r31, _r41);\n\n                int32x4_t _tmp00 = vaddq_s32(vaddq_s32(_tmp02a0, _tmp02b0), _r00);\n                int32x4_t _tmp01 = vaddq_s32(vaddq_s32(_tmp02a1, _tmp02b1), _r01);\n                int32x4_t _tmp10 = vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 1));\n                int32x4_t _tmp11 = vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 1));\n                int32x4_t _tmp20 = vaddq_s32(_tmp02a0, vshlq_n_s32(_tmp02b0, 2));\n                int32x4_t _tmp21 = vaddq_s32(_tmp02a1, vshlq_n_s32(_tmp02b1, 2));\n                int32x4_t _tmp30 = vaddq_s32(vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 3)), vshlq_n_s32(_r50, 2));\n                int32x4_t _tmp31 = vaddq_s32(vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 3)), vshlq_n_s32(_r51, 2));\n\n                _tmp00 = vshlq_n_s32(_tmp00, 2);\n                _tmp01 = vshlq_n_s32(_tmp01, 2);\n                _tmp10 = vshlq_n_s32(_tmp10, 2);\n                _tmp11 = vshlq_n_s32(_tmp11, 2);\n                _tmp20 = vshlq_n_s32(_tmp20, 2);\n                _tmp21 = vshlq_n_s32(_tmp21, 2);\n                _tmp30 = vshlq_n_s32(_tmp30, 2);\n                _tmp31 = vshlq_n_s32(_tmp31, 2);\n\n                vst1q_s32(tmp[0][m], _tmp00);\n                vst1q_s32(tmp[0][m] + 4, _tmp01);\n                vst1q_s32(tmp[1][m], _tmp10);\n                vst1q_s32(tmp[1][m] + 4, _tmp11);\n                vst1q_s32(tmp[2][m], _tmp20);\n                vst1q_s32(tmp[2][m] + 4, _tmp21);\n                vst1q_s32(tmp[3][m], _tmp30);\n                vst1q_s32(tmp[3][m] + 4, _tmp31);\n\n                r0 += max_jj * 6 * 8;\n                r1 += max_jj * 6 * 8;\n                r2 += max_jj * 6 * 8;\n                r3 += max_jj * 6 * 8;\n                r4 += max_jj * 6 * 8;\n                r5 += max_jj * 6 * 8;\n            }\n\n            int* outptr0 = top_blob.channel((i + ii) / out_elempack).row<int>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                int32x4_t _r00 = vld1q_s32(tmp[m][0]);\n                int32x4_t _r01 = vld1q_s32(tmp[m][0] + 4);\n                int32x4_t _r10 = vld1q_s32(tmp[m][1]);\n                int32x4_t _r11 = vld1q_s32(tmp[m][1] + 4);\n                int32x4_t _r20 = vld1q_s32(tmp[m][2]);\n                int32x4_t _r21 = vld1q_s32(tmp[m][2] + 4);\n                int32x4_t _r30 = vld1q_s32(tmp[m][3]);\n                int32x4_t _r31 = vld1q_s32(tmp[m][3] + 4);\n                int32x4_t _r40 = vld1q_s32(tmp[m][4]);\n                int32x4_t _r41 = vld1q_s32(tmp[m][4] + 4);\n                int32x4_t _r50 = vld1q_s32(tmp[m][5]);\n                int32x4_t _r51 = vld1q_s32(tmp[m][5] + 4);\n\n                int32x4_t _tmp02a0 = vaddq_s32(_r10, _r20);\n                int32x4_t _tmp02a1 = vaddq_s32(_r11, _r21);\n                int32x4_t _tmp02b0 = vaddq_s32(_r30, _r40);\n                int32x4_t _tmp02b1 = vaddq_s32(_r31, _r41);\n                int32x4_t _tmp13a0 = vsubq_s32(_r10, _r20);\n                int32x4_t _tmp13a1 = vsubq_s32(_r11, _r21);\n                int32x4_t _tmp13b0 = vsubq_s32(_r30, _r40);\n                int32x4_t _tmp13b1 = vsubq_s32(_r31, _r41);\n\n                int32x4_t _tmp00 = vaddq_s32(vaddq_s32(_tmp02a0, _tmp02b0), _r00);\n                int32x4_t _tmp01 = vaddq_s32(vaddq_s32(_tmp02a1, _tmp02b1), _r01);\n                int32x4_t _tmp10 = vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 1));\n                int32x4_t _tmp11 = vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 1));\n                int32x4_t _tmp20 = vaddq_s32(_tmp02a0, vshlq_n_s32(_tmp02b0, 2));\n                int32x4_t _tmp21 = vaddq_s32(_tmp02a1, vshlq_n_s32(_tmp02b1, 2));\n                int32x4_t _tmp30 = vaddq_s32(vaddq_s32(_tmp13a0, vshlq_n_s32(_tmp13b0, 3)), _r50);\n                int32x4_t _tmp31 = vaddq_s32(vaddq_s32(_tmp13a1, vshlq_n_s32(_tmp13b1, 3)), _r51);\n\n                // TODO use integer trick for division by 576\n                float32x4_t _v576 = vdupq_n_f32(1.0 / 576);\n                _tmp00 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp00), _v576));\n                _tmp01 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp01), _v576));\n                _tmp10 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp10), _v576));\n                _tmp11 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp11), _v576));\n                _tmp20 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp20), _v576));\n                _tmp21 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp21), _v576));\n                _tmp30 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp30), _v576));\n                _tmp31 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp31), _v576));\n\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _tmp00);\n                    vst1q_s32(outptr0 + 4, _tmp01);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        vst1q_s32(outptr0 + 8, _tmp10);\n                        vst1q_s32(outptr0 + 12, _tmp11);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        vst1q_s32(outptr0 + 16, _tmp20);\n                        vst1q_s32(outptr0 + 20, _tmp21);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        vst1q_s32(outptr0 + 24, _tmp30);\n                        vst1q_s32(outptr0 + 28, _tmp31);\n                    }\n                }\n                if (out_elempack == 4)\n                {\n                    int* outptr1 = outptr0 + N;\n\n                    vst1q_s32(outptr0, _tmp00);\n                    vst1q_s32(outptr1, _tmp01);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        vst1q_s32(outptr0 + 4, _tmp10);\n                        vst1q_s32(outptr1 + 4, _tmp11);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        vst1q_s32(outptr0 + 8, _tmp20);\n                        vst1q_s32(outptr1 + 8, _tmp21);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        vst1q_s32(outptr0 + 12, _tmp30);\n                        vst1q_s32(outptr1 + 12, _tmp31);\n                    }\n                }\n                if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n                    int* outptr2 = outptr0 + N * 2;\n                    int* outptr3 = outptr0 + N * 3;\n                    int* outptr4 = outptr0 + N * 4;\n                    int* outptr5 = outptr0 + N * 5;\n                    int* outptr6 = outptr0 + N * 6;\n                    int* outptr7 = outptr0 + N * 7;\n\n                    outptr0[0] = vgetq_lane_s32(_tmp00, 0);\n                    outptr1[0] = vgetq_lane_s32(_tmp00, 1);\n                    outptr2[0] = vgetq_lane_s32(_tmp00, 2);\n                    outptr3[0] = vgetq_lane_s32(_tmp00, 3);\n                    outptr4[0] = vgetq_lane_s32(_tmp01, 0);\n                    outptr5[0] = vgetq_lane_s32(_tmp01, 1);\n                    outptr6[0] = vgetq_lane_s32(_tmp01, 2);\n                    outptr7[0] = vgetq_lane_s32(_tmp01, 3);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = vgetq_lane_s32(_tmp10, 0);\n                        outptr1[1] = vgetq_lane_s32(_tmp10, 1);\n                        outptr2[1] = vgetq_lane_s32(_tmp10, 2);\n                        outptr3[1] = vgetq_lane_s32(_tmp10, 3);\n                        outptr4[1] = vgetq_lane_s32(_tmp11, 0);\n                        outptr5[1] = vgetq_lane_s32(_tmp11, 1);\n                        outptr6[1] = vgetq_lane_s32(_tmp11, 2);\n                        outptr7[1] = vgetq_lane_s32(_tmp11, 3);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = vgetq_lane_s32(_tmp20, 0);\n                        outptr1[2] = vgetq_lane_s32(_tmp20, 1);\n                        outptr2[2] = vgetq_lane_s32(_tmp20, 2);\n                        outptr3[2] = vgetq_lane_s32(_tmp20, 3);\n                        outptr4[2] = vgetq_lane_s32(_tmp21, 0);\n                        outptr5[2] = vgetq_lane_s32(_tmp21, 1);\n                        outptr6[2] = vgetq_lane_s32(_tmp21, 2);\n                        outptr7[2] = vgetq_lane_s32(_tmp21, 3);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = vgetq_lane_s32(_tmp30, 0);\n                        outptr1[3] = vgetq_lane_s32(_tmp30, 1);\n                        outptr2[3] = vgetq_lane_s32(_tmp30, 2);\n                        outptr3[3] = vgetq_lane_s32(_tmp30, 3);\n                        outptr4[3] = vgetq_lane_s32(_tmp31, 0);\n                        outptr5[3] = vgetq_lane_s32(_tmp31, 1);\n                        outptr6[3] = vgetq_lane_s32(_tmp31, 2);\n                        outptr7[3] = vgetq_lane_s32(_tmp31, 3);\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        int tmp[4][6][4];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 36 + jj * 4;\n            const int* r1 = r0 + max_jj * 4;\n            const int* r2 = r0 + max_jj * 4 * 2;\n            const int* r3 = r0 + max_jj * 4 * 3;\n            const int* r4 = r0 + max_jj * 4 * 4;\n            const int* r5 = r0 + max_jj * 4 * 5;\n\n            for (int m = 0; m < 5; m++)\n            {\n                int32x4_t _r0 = vld1q_s32(r0);\n                int32x4_t _r1 = vld1q_s32(r1);\n                int32x4_t _r2 = vld1q_s32(r2);\n                int32x4_t _r3 = vld1q_s32(r3);\n                int32x4_t _r4 = vld1q_s32(r4);\n                int32x4_t _r5 = vld1q_s32(r5);\n\n                int32x4_t _tmp02a = vaddq_s32(_r1, _r2);\n                int32x4_t _tmp02b = vaddq_s32(_r3, _r4);\n                int32x4_t _tmp13a = vsubq_s32(_r1, _r2);\n                int32x4_t _tmp13b = vsubq_s32(_r3, _r4);\n\n                int32x4_t _tmp0 = vaddq_s32(vaddq_s32(_tmp02a, _tmp02b), _r0);\n                int32x4_t _tmp1 = vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 1));\n                int32x4_t _tmp2 = vaddq_s32(_tmp02a, vshlq_n_s32(_tmp02b, 2));\n                int32x4_t _tmp3 = vaddq_s32(vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 3)), vshlq_n_s32(_r5, 2));\n\n                vst1q_s32(tmp[0][m], _tmp0);\n                vst1q_s32(tmp[1][m], _tmp1);\n                vst1q_s32(tmp[2][m], _tmp2);\n                vst1q_s32(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 4;\n                r1 += max_jj * 6 * 4;\n                r2 += max_jj * 6 * 4;\n                r3 += max_jj * 6 * 4;\n                r4 += max_jj * 6 * 4;\n                r5 += max_jj * 6 * 4;\n            }\n            for (int m = 5; m < 6; m++)\n            {\n                int32x4_t _r0 = vld1q_s32(r0);\n                int32x4_t _r1 = vld1q_s32(r1);\n                int32x4_t _r2 = vld1q_s32(r2);\n                int32x4_t _r3 = vld1q_s32(r3);\n                int32x4_t _r4 = vld1q_s32(r4);\n                int32x4_t _r5 = vld1q_s32(r5);\n\n                int32x4_t _tmp02a = vaddq_s32(_r1, _r2);\n                int32x4_t _tmp02b = vaddq_s32(_r3, _r4);\n                int32x4_t _tmp13a = vsubq_s32(_r1, _r2);\n                int32x4_t _tmp13b = vsubq_s32(_r3, _r4);\n\n                int32x4_t _tmp0 = vaddq_s32(vaddq_s32(_tmp02a, _tmp02b), _r0);\n                int32x4_t _tmp1 = vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 1));\n                int32x4_t _tmp2 = vaddq_s32(_tmp02a, vshlq_n_s32(_tmp02b, 2));\n                int32x4_t _tmp3 = vaddq_s32(vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 3)), vshlq_n_s32(_r5, 2));\n\n                _tmp0 = vshlq_n_s32(_tmp0, 2);\n                _tmp1 = vshlq_n_s32(_tmp1, 2);\n                _tmp2 = vshlq_n_s32(_tmp2, 2);\n                _tmp3 = vshlq_n_s32(_tmp3, 2);\n\n                vst1q_s32(tmp[0][m], _tmp0);\n                vst1q_s32(tmp[1][m], _tmp1);\n                vst1q_s32(tmp[2][m], _tmp2);\n                vst1q_s32(tmp[3][m], _tmp3);\n\n                r0 += max_jj * 6 * 4;\n                r1 += max_jj * 6 * 4;\n                r2 += max_jj * 6 * 4;\n                r3 += max_jj * 6 * 4;\n                r4 += max_jj * 6 * 4;\n                r5 += max_jj * 6 * 4;\n            }\n\n            int* outptr0 = top_blob.channel((i + ii) / out_elempack).row<int>(ti * 4) + (tj * 4) * out_elempack;\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                int32x4_t _r0 = vld1q_s32(tmp[m][0]);\n                int32x4_t _r1 = vld1q_s32(tmp[m][1]);\n                int32x4_t _r2 = vld1q_s32(tmp[m][2]);\n                int32x4_t _r3 = vld1q_s32(tmp[m][3]);\n                int32x4_t _r4 = vld1q_s32(tmp[m][4]);\n                int32x4_t _r5 = vld1q_s32(tmp[m][5]);\n\n                int32x4_t _tmp02a = vaddq_s32(_r1, _r2);\n                int32x4_t _tmp02b = vaddq_s32(_r3, _r4);\n                int32x4_t _tmp13a = vsubq_s32(_r1, _r2);\n                int32x4_t _tmp13b = vsubq_s32(_r3, _r4);\n\n                int32x4_t _tmp0 = vaddq_s32(vaddq_s32(_tmp02a, _tmp02b), _r0);\n                int32x4_t _tmp1 = vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 1));\n                int32x4_t _tmp2 = vaddq_s32(_tmp02a, vshlq_n_s32(_tmp02b, 2));\n                int32x4_t _tmp3 = vaddq_s32(vaddq_s32(_tmp13a, vshlq_n_s32(_tmp13b, 3)), _r5);\n\n                // TODO use integer trick for division by 576\n                float32x4_t _v576 = vdupq_n_f32(1.0 / 576);\n                _tmp0 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp0), _v576));\n                _tmp1 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp1), _v576));\n                _tmp2 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp2), _v576));\n                _tmp3 = vcvtq_s32_f32(vmulq_f32(vcvtq_f32_s32(_tmp3), _v576));\n\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _tmp0);\n                    if (tj * 4 + 1 < outw) vst1q_s32(outptr0 + 4, _tmp1);\n                    if (tj * 4 + 2 < outw) vst1q_s32(outptr0 + 8, _tmp2);\n                    if (tj * 4 + 3 < outw) vst1q_s32(outptr0 + 12, _tmp3);\n                }\n                if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n                    int* outptr2 = outptr0 + N * 2;\n                    int* outptr3 = outptr0 + N * 3;\n\n                    outptr0[0] = vgetq_lane_s32(_tmp0, 0);\n                    outptr1[0] = vgetq_lane_s32(_tmp0, 1);\n                    outptr2[0] = vgetq_lane_s32(_tmp0, 2);\n                    outptr3[0] = vgetq_lane_s32(_tmp0, 3);\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = vgetq_lane_s32(_tmp1, 0);\n                        outptr1[1] = vgetq_lane_s32(_tmp1, 1);\n                        outptr2[1] = vgetq_lane_s32(_tmp1, 2);\n                        outptr3[1] = vgetq_lane_s32(_tmp1, 3);\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = vgetq_lane_s32(_tmp2, 0);\n                        outptr1[2] = vgetq_lane_s32(_tmp2, 1);\n                        outptr2[2] = vgetq_lane_s32(_tmp2, 2);\n                        outptr3[2] = vgetq_lane_s32(_tmp2, 3);\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = vgetq_lane_s32(_tmp3, 0);\n                        outptr1[3] = vgetq_lane_s32(_tmp3, 1);\n                        outptr2[3] = vgetq_lane_s32(_tmp3, 2);\n                        outptr3[3] = vgetq_lane_s32(_tmp3, 3);\n                    }\n                }\n\n                outptr0 += outw * out_elempack;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        int tmp[4][6][2];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 36 + jj * 2;\n            const int* r1 = r0 + max_jj * 2;\n            const int* r2 = r0 + max_jj * 2 * 2;\n            const int* r3 = r0 + max_jj * 2 * 3;\n            const int* r4 = r0 + max_jj * 2 * 4;\n            const int* r5 = r0 + max_jj * 2 * 5;\n\n            for (int m = 0; m < 5; m++)\n            {\n                int tmp02a0 = r1[0] + r2[0];\n                int tmp02a1 = r1[1] + r2[1];\n                int tmp02b0 = r3[0] + r4[0];\n                int tmp02b1 = r3[1] + r4[1];\n                int tmp13a0 = r1[0] - r2[0];\n                int tmp13a1 = r1[1] - r2[1];\n                int tmp13b0 = r3[0] - r4[0];\n                int tmp13b1 = r3[1] - r4[1];\n\n                int tmp00 = tmp02a0 + tmp02b0 + r0[0];\n                int tmp01 = tmp02a1 + tmp02b1 + r0[1];\n                int tmp10 = tmp13a0 + tmp13b0 * 2;\n                int tmp11 = tmp13a1 + tmp13b1 * 2;\n                int tmp20 = tmp02a0 + tmp02b0 * 4;\n                int tmp21 = tmp02a1 + tmp02b1 * 4;\n                int tmp30 = tmp13a0 + tmp13b0 * 8 + r5[0] * 4;\n                int tmp31 = tmp13a1 + tmp13b1 * 8 + r5[1] * 4;\n\n                tmp[0][m][0] = tmp00;\n                tmp[0][m][1] = tmp01;\n                tmp[1][m][0] = tmp10;\n                tmp[1][m][1] = tmp11;\n                tmp[2][m][0] = tmp20;\n                tmp[2][m][1] = tmp21;\n                tmp[3][m][0] = tmp30;\n                tmp[3][m][1] = tmp31;\n\n                r0 += max_jj * 6 * 2;\n                r1 += max_jj * 6 * 2;\n                r2 += max_jj * 6 * 2;\n                r3 += max_jj * 6 * 2;\n                r4 += max_jj * 6 * 2;\n                r5 += max_jj * 6 * 2;\n            }\n            for (int m = 5; m < 6; m++)\n            {\n                int tmp02a0 = r1[0] + r2[0];\n                int tmp02a1 = r1[1] + r2[1];\n                int tmp02b0 = r3[0] + r4[0];\n                int tmp02b1 = r3[1] + r4[1];\n                int tmp13a0 = r1[0] - r2[0];\n                int tmp13a1 = r1[1] - r2[1];\n                int tmp13b0 = r3[0] - r4[0];\n                int tmp13b1 = r3[1] - r4[1];\n\n                int tmp00 = tmp02a0 + tmp02b0 + r0[0];\n                int tmp01 = tmp02a1 + tmp02b1 + r0[1];\n                int tmp10 = tmp13a0 + tmp13b0 * 2;\n                int tmp11 = tmp13a1 + tmp13b1 * 2;\n                int tmp20 = tmp02a0 + tmp02b0 * 4;\n                int tmp21 = tmp02a1 + tmp02b1 * 4;\n                int tmp30 = tmp13a0 + tmp13b0 * 8 + r5[0] * 4;\n                int tmp31 = tmp13a1 + tmp13b1 * 8 + r5[1] * 4;\n\n                tmp00 = tmp00 * 4;\n                tmp01 = tmp01 * 4;\n                tmp10 = tmp10 * 4;\n                tmp11 = tmp11 * 4;\n                tmp20 = tmp20 * 4;\n                tmp21 = tmp21 * 4;\n                tmp30 = tmp30 * 4;\n                tmp31 = tmp31 * 4;\n\n                tmp[0][m][0] = tmp00;\n                tmp[0][m][1] = tmp01;\n                tmp[1][m][0] = tmp10;\n                tmp[1][m][1] = tmp11;\n                tmp[2][m][0] = tmp20;\n                tmp[2][m][1] = tmp21;\n                tmp[3][m][0] = tmp30;\n                tmp[3][m][1] = tmp31;\n\n                r0 += max_jj * 6 * 2;\n                r1 += max_jj * 6 * 2;\n                r2 += max_jj * 6 * 2;\n                r3 += max_jj * 6 * 2;\n                r4 += max_jj * 6 * 2;\n                r5 += max_jj * 6 * 2;\n            }\n\n            int* outptr0 = top_blob.channel(i + ii).row<int>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                int tmp02a0 = tmp[m][1][0] + tmp[m][2][0];\n                int tmp02a1 = tmp[m][1][1] + tmp[m][2][1];\n                int tmp02b0 = tmp[m][3][0] + tmp[m][4][0];\n                int tmp02b1 = tmp[m][3][1] + tmp[m][4][1];\n                int tmp13a0 = tmp[m][1][0] - tmp[m][2][0];\n                int tmp13a1 = tmp[m][1][1] - tmp[m][2][1];\n                int tmp13b0 = tmp[m][3][0] - tmp[m][4][0];\n                int tmp13b1 = tmp[m][3][1] - tmp[m][4][1];\n\n                int tmp00 = tmp02a0 + tmp02b0 + tmp[m][0][0];\n                int tmp01 = tmp02a1 + tmp02b1 + tmp[m][0][1];\n                int tmp10 = tmp13a0 + tmp13b0 * 2;\n                int tmp11 = tmp13a1 + tmp13b1 * 2;\n                int tmp20 = tmp02a0 + tmp02b0 * 4;\n                int tmp21 = tmp02a1 + tmp02b1 * 4;\n                int tmp30 = tmp13a0 + tmp13b0 * 8 + tmp[m][5][0];\n                int tmp31 = tmp13a1 + tmp13b1 * 8 + tmp[m][5][1];\n\n                tmp00 = tmp00 / 576;\n                tmp01 = tmp01 / 576;\n                tmp10 = tmp10 / 576;\n                tmp11 = tmp11 / 576;\n                tmp20 = tmp20 / 576;\n                tmp21 = tmp21 / 576;\n                tmp30 = tmp30 / 576;\n                tmp31 = tmp31 / 576;\n\n                // if (out_elempack == 1)\n                {\n                    int* outptr1 = outptr0 + N;\n\n                    outptr0[0] = tmp00;\n                    outptr1[0] = tmp01;\n                    if (tj * 4 + 1 < outw)\n                    {\n                        outptr0[1] = tmp10;\n                        outptr1[1] = tmp11;\n                    }\n                    if (tj * 4 + 2 < outw)\n                    {\n                        outptr0[2] = tmp20;\n                        outptr1[2] = tmp21;\n                    }\n                    if (tj * 4 + 3 < outw)\n                    {\n                        outptr0[3] = tmp30;\n                        outptr1[3] = tmp31;\n                    }\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n    for (; ii < max_ii; ii++)\n    {\n        int tmp[4][6];\n\n        int jj = 0;\n        for (; jj < max_jj; jj++)\n        {\n            int ti = (j + jj) / w_tiles;\n            int tj = (j + jj) % w_tiles;\n\n            const int* r0 = (const int*)top_tile + ii * max_jj * 36 + jj;\n            const int* r1 = r0 + max_jj;\n            const int* r2 = r0 + max_jj * 2;\n            const int* r3 = r0 + max_jj * 3;\n            const int* r4 = r0 + max_jj * 4;\n            const int* r5 = r0 + max_jj * 5;\n\n            for (int m = 0; m < 5; m++)\n            {\n                int tmp02a = r1[0] + r2[0];\n                int tmp02b = r3[0] + r4[0];\n                int tmp13a = r1[0] - r2[0];\n                int tmp13b = r3[0] - r4[0];\n\n                int tmp0 = tmp02a + tmp02b + r0[0];\n                int tmp1 = tmp13a + tmp13b * 2;\n                int tmp2 = tmp02a + tmp02b * 4;\n                int tmp3 = tmp13a + tmp13b * 8 + r5[0] * 4;\n\n                tmp[0][m] = tmp0;\n                tmp[1][m] = tmp1;\n                tmp[2][m] = tmp2;\n                tmp[3][m] = tmp3;\n\n                r0 += max_jj * 6;\n                r1 += max_jj * 6;\n                r2 += max_jj * 6;\n                r3 += max_jj * 6;\n                r4 += max_jj * 6;\n                r5 += max_jj * 6;\n            }\n            for (int m = 5; m < 6; m++)\n            {\n                int tmp02a = r1[0] + r2[0];\n                int tmp02b = r3[0] + r4[0];\n                int tmp13a = r1[0] - r2[0];\n                int tmp13b = r3[0] - r4[0];\n\n                int tmp0 = tmp02a + tmp02b + r0[0];\n                int tmp1 = tmp13a + tmp13b * 2;\n                int tmp2 = tmp02a + tmp02b * 4;\n                int tmp3 = tmp13a + tmp13b * 8 + r5[0] * 4;\n\n                tmp0 = tmp0 * 4;\n                tmp1 = tmp1 * 4;\n                tmp2 = tmp2 * 4;\n                tmp3 = tmp3 * 4;\n\n                tmp[0][m] = tmp0;\n                tmp[1][m] = tmp1;\n                tmp[2][m] = tmp2;\n                tmp[3][m] = tmp3;\n\n                r0 += max_jj * 6;\n                r1 += max_jj * 6;\n                r2 += max_jj * 6;\n                r3 += max_jj * 6;\n                r4 += max_jj * 6;\n                r5 += max_jj * 6;\n            }\n\n            int* outptr0 = top_blob.channel(i + ii).row<int>(ti * 4) + (tj * 4);\n\n            for (int m = 0; m < 4; m++)\n            {\n                if (ti * 4 + m >= outh)\n                    continue;\n\n                int tmp02a = tmp[m][1] + tmp[m][2];\n                int tmp02b = tmp[m][3] + tmp[m][4];\n                int tmp13a = tmp[m][1] - tmp[m][2];\n                int tmp13b = tmp[m][3] - tmp[m][4];\n\n                int tmp0 = tmp02a + tmp02b + tmp[m][0];\n                int tmp1 = tmp13a + tmp13b * 2;\n                int tmp2 = tmp02a + tmp02b * 4;\n                int tmp3 = tmp13a + tmp13b * 8 + tmp[m][5];\n\n                tmp0 = tmp0 / 576;\n                tmp1 = tmp1 / 576;\n                tmp2 = tmp2 / 576;\n                tmp3 = tmp3 / 576;\n\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = tmp0;\n                    if (tj * 4 + 1 < outw) outptr0[1] = tmp1;\n                    if (tj * 4 + 2 < outw) outptr0[2] = tmp2;\n                    if (tj * 4 + 3 < outw) outptr0[3] = tmp3;\n                }\n\n                outptr0 += outw;\n            }\n        }\n    }\n}\n\nstatic int conv3x3s1_winograd43_int8(Mat& bottom_blob, Mat& top_blob, const Mat& AT, int nT, const Option& opt)\n{\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    // pad to 4n+2, winograd F(4,3)\n    int w_tiles = (outw + 3) / 4;\n    int h_tiles = (outh + 3) / 4;\n    int tiles = w_tiles * h_tiles;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = tiles;\n    const int K = bottom_blob.c * bottom_blob.elempack;\n    const int B = 36;\n\n    // NCNN_LOGE(\"conv3x3s1_winograd43_int8 %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, B, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    if (nT > 1 && nn_NK < nT)\n    {\n        Mat B_tile(TILE_N * B * TILE_K, 2u, opt.workspace_allocator);\n        if (B_tile.empty())\n            return -100;\n\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_int8(bottom_blob, B_tile, j, max_jj, k, max_kk, nT);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            transpose_pack_B_tile_int8(B_tile, BT_tile, B, max_jj, max_kk, nT);\n        }\n    }\n    else\n    {\n        Mat B_tileX(TILE_N * B * TILE_K, 1, nT, 2u, opt.workspace_allocator);\n        if (B_tileX.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(nT)\n        for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n        {\n            const int ppj = ppjk / nn_K;\n            const int ppk = ppjk % nn_K;\n\n            const int j = ppj * TILE_N;\n            const int k = ppk * TILE_K;\n\n            const int max_jj = std::min((N - j), TILE_N);\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat B_tile = B_tileX.channel(get_omp_thread_num());\n\n            // transform input\n            conv3x3s1_winograd43_transform_input_tile_int8(bottom_blob, B_tile, j, max_jj, k, max_kk, 1);\n\n            Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n            transpose_pack_B_tile_int8(B_tile, BT_tile, B, max_jj, max_kk, 1);\n        }\n    }\n\n    bottom_blob.release();\n\n    Mat top_tileX(TILE_N * B * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (top_tileX.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat top_tile = top_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).depth(k / TILE_K);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).depth(k / TILE_K);\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, top_tile, B, max_ii, max_jj, k, max_kk);\n            }\n\n            // transform output\n            conv3x3s1_winograd43_transform_output_tile_int8(top_tile, top_blob, i, max_ii, j, max_jj);\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_4x4.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv4x4s4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 4 * outw + w * 3;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 16 + q * 16;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(kernel0);\n            float32x4_t _k4567 = vld1q_f32(kernel0 + 4);\n            float32x4_t _k891011 = vld1q_f32(kernel0 + 8);\n            float32x4_t _k12131415 = vld1q_f32(kernel0 + 12);\n#else\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 4;\n            const float* k2 = kernel0 + 8;\n            const float* k3 = kernel0 + 12;\n#endif // __ARM_NEON\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"prfm       pldl1keep, [%3, #512]          \\n\"\n\n                        \"ld1        {v7.4s}, [%1]                  \\n\" // v7 = outptr\n\n                        \"ld1        {v8.4s}, [%2], #16             \\n\" // v8  = r0\n                        \"ld1        {v9.4s}, [%3], #16             \\n\" // v9  = r1\n\n                        \"prfm       pldl1keep, [%4, #512]          \\n\"\n                        \"prfm       pldl1keep, [%5, #512]          \\n\"\n\n                        \"fmul       v12.4s, v8.4s, %12.4s          \\n\"\n                        \"fmul       v13.4s, v9.4s, %13.4s          \\n\"\n\n                        \"ld1        {v10.4s}, [%4], #16            \\n\" // v10 = r2\n                        \"ld1        {v11.4s}, [%5], #16            \\n\" // v11 = r3\n\n                        \"fmla       v12.4s, v10.4s, %14.4s         \\n\"\n                        \"fmla       v13.4s, v11.4s, %15.4s         \\n\"\n\n                        \"fadd       v5.4s, v12.4s, v13.4s          \\n\"\n\n                        \"ld1        {v8.4s}, [%2], #16             \\n\" // v8  = r0\n                        \"ld1        {v9.4s}, [%3], #16             \\n\" // v9  = r1\n\n                        \"fmul       v12.4s, v8.4s, %12.4s          \\n\"\n                        \"fmul       v13.4s, v9.4s, %13.4s          \\n\"\n\n                        \"ld1        {v10.4s}, [%4], #16            \\n\" // v10 = r2\n                        \"ld1        {v11.4s}, [%5], #16            \\n\" // v11 = r3\n\n                        \"fmla       v12.4s, v10.4s, %14.4s         \\n\"\n                        \"fmla       v13.4s, v11.4s, %15.4s         \\n\"\n\n                        \"fadd       v6.4s, v12.4s, v13.4s          \\n\"\n\n                        \"ld1        {v8.4s}, [%2], #16             \\n\" // v8  = r0\n                        \"ld1        {v9.4s}, [%3], #16             \\n\" // v9  = r1\n\n                        \"fmul       v12.4s, v8.4s, %12.4s          \\n\"\n                        \"fmul       v13.4s, v9.4s, %13.4s          \\n\"\n\n                        \"ld1        {v10.4s}, [%4], #16            \\n\" // v10 = r2\n                        \"ld1        {v11.4s}, [%5], #16            \\n\" // v11 = r3\n\n                        \"fmla       v12.4s, v10.4s, %14.4s         \\n\"\n                        \"fmla       v13.4s, v11.4s, %15.4s         \\n\"\n\n                        \"fadd       v14.4s, v12.4s, v13.4s         \\n\"\n                        \"faddp      v5.4s, v5.4s, v6.4s            \\n\" // Move to here to enhance ILP\n\n                        \"ld1        {v8.4s}, [%2], #16             \\n\" // v8  = r0\n                        \"ld1        {v9.4s}, [%3], #16             \\n\" // v9  = r1\n\n                        \"fmul       v12.4s, v8.4s, %12.4s          \\n\"\n                        \"fmul       v13.4s, v9.4s, %13.4s          \\n\"\n\n                        \"ld1        {v10.4s}, [%4], #16            \\n\" // v10 = r2\n                        \"ld1        {v11.4s}, [%5], #16            \\n\" // v11 = r3\n\n                        \"fmla       v12.4s, v10.4s, %14.4s         \\n\"\n                        \"fmla       v13.4s, v11.4s, %15.4s         \\n\"\n\n                        \"fadd       v15.4s, v12.4s, v13.4s         \\n\"\n\n                        //                  \"faddp      v5.4s ,  v5.4s,  v6.4s         \\n\"  // Move this line upward.\n                        \"faddp      v14.4s, v14.4s, v15.4s         \\n\"\n                        \"faddp      v5.4s ,  v5.4s, v14.4s         \\n\"\n\n                        \"fadd       v7.4s, v7.4s, v5.4s            \\n\"\n\n                        \"st1        {v7.4s}, [%1], #16             \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3)      // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"w\"(_k0123),    // %12\n                        \"w\"(_k4567),    // %13\n                        \"w\"(_k891011),  // %14\n                        \"w\"(_k12131415) // %15\n                        : \"cc\", \"memory\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"0:                             \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"pld        [%3, #512]          \\n\"\n\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // q7 = outptr\n\n                        \"vld1.f32   {d16-d17}, [%2]!    \\n\" // q8  = r0\n                        \"vld1.f32   {d18-d19}, [%3]!    \\n\" // q9  = r1\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"pld        [%5, #512]          \\n\"\n\n                        \"vmul.f32   q12, q8, %q12       \\n\"\n                        \"vmul.f32   q13, q9, %q13       \\n\"\n\n                        \"vld1.f32   {d20-d21}, [%4]!    \\n\" // q10 = r2\n                        \"vld1.f32   {d22-d23}, [%5]!    \\n\" // q11 = r3\n\n                        \"vmla.f32   q12, q10, %q14      \\n\"\n                        \"vmla.f32   q13, q11, %q15      \\n\"\n\n                        \"vadd.f32   q5, q12, q13        \\n\"\n\n                        \"vld1.f32   {d16-d17}, [%2]!    \\n\" // q8  = r0\n                        \"vld1.f32   {d18-d19}, [%3]!    \\n\" // q9  = r1\n\n                        \"vmul.f32   q12, q8, %q12       \\n\"\n                        \"vmul.f32   q13, q9, %q13       \\n\"\n\n                        \"vld1.f32   {d20-d21}, [%4]!    \\n\" // q10 = r2\n                        \"vld1.f32   {d22-d23}, [%5]!    \\n\" // q11 = r3\n\n                        \"vmla.f32   q12, q10, %q14      \\n\"\n                        \"vmla.f32   q13, q11, %q15      \\n\"\n\n                        \"vadd.f32   q6, q12, q13        \\n\"\n\n                        \"vld1.f32   {d16-d17}, [%2]!    \\n\" // q8  = r0\n                        \"vld1.f32   {d18-d19}, [%3]!    \\n\" // q9  = r1\n\n                        \"vmul.f32   q12, q8, %q12       \\n\"\n                        \"vmul.f32   q13, q9, %q13       \\n\"\n\n                        \"vld1.f32   {d20-d21}, [%4]!    \\n\" // q10 = r2\n                        \"vld1.f32   {d22-d23}, [%5]!    \\n\" // q11 = r3\n\n                        \"vmla.f32   q12, q10, %q14      \\n\"\n                        \"vmla.f32   q13, q11, %q15      \\n\"\n\n                        \"vadd.f32   q14, q12, q13       \\n\"\n\n                        \"vld1.f32   {d16-d17}, [%2]!    \\n\" // q8  = r0\n                        \"vld1.f32   {d18-d19}, [%3]!    \\n\" // q9  = r1\n\n                        \"vmul.f32   q12, q8, %q12       \\n\"\n                        \"vmul.f32   q13, q9, %q13       \\n\"\n\n                        \"vld1.f32   {d20-d21}, [%4]!    \\n\" // q10 = r2\n                        \"vld1.f32   {d22-d23}, [%5]!    \\n\" // q11 = r3\n\n                        \"vmla.f32   q12, q10, %q14      \\n\"\n                        \"vmla.f32   q13, q11, %q15      \\n\"\n\n                        \"vadd.f32   q15, q12, q13       \\n\"\n\n                        \"vadd.f32   d10, d10, d11       \\n\"\n                        \"vadd.f32   d28, d28, d29       \\n\"\n                        \"vadd.f32   d11, d12, d13       \\n\"\n                        \"vadd.f32   d29, d30, d31       \\n\"\n\n                        \"vpadd.f32  d10, d10, d11       \\n\"\n                        \"vpadd.f32  d11, d28, d29       \\n\"\n\n                        \"vadd.f32   q7, q7, q5          \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3)      // %5\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"w\"(_k0123),    // %12\n                        \"w\"(_k4567),    // %13\n                        \"w\"(_k891011),  // %14\n                        \"w\"(_k12131415) // %15\n                        : \"cc\", \"memory\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n#if __ARM_NEON\n#if __aarch64__\n                    float sum = 0.f;\n\n                    asm volatile(\n                        \"ld1        {v8.4s}, [%0], #16             \\n\" // v8  = r0\n                        \"ld1        {v9.4s}, [%1], #16             \\n\" // v9  = r1\n\n                        \"fmul       v12.4s, v8.4s, %9.4s           \\n\"\n                        \"fmul       v13.4s, v9.4s, %10.4s          \\n\"\n\n                        \"ld1        {v10.4s}, [%2], #16            \\n\" // v10 = r2\n                        \"ld1        {v11.4s}, [%3], #16            \\n\" // v11 = r3\n\n                        \"fmla       v12.4s, v10.4s, %11.4s         \\n\"\n                        \"fmla       v13.4s, v11.4s, %12.4s         \\n\"\n\n                        \"fadd       v5.4s, v12.4s, v13.4s          \\n\"\n                        \"faddp      v5.4s, v5.4s, v5.4s            \\n\"\n                        \"faddp      s5, v5.2s                      \\n\"\n                        \"fmov       %w4, s5                        \\n\"\n                        : \"=r\"(r0), // %0\n                        \"=r\"(r1), // %1\n                        \"=r\"(r2), // %2\n                        \"=r\"(r3), // %3\n                        \"=r\"(sum) // %4\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"w\"(_k0123),    // %9\n                        \"w\"(_k4567),    // %10\n                        \"w\"(_k891011),  // %11\n                        \"w\"(_k12131415) // %12\n                        : \"cc\", \"memory\", \"v5\", \"v6\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\");\n\n                    *outptr += sum;\n#else\n                    float sum = 0.f;\n\n                    asm volatile(\n                        \"vld1.f32   {d16-d17}, [%0]!    \\n\" // q8  = r0\n                        \"vld1.f32   {d18-d19}, [%1]!    \\n\" // q9  = r1\n\n                        \"vmul.f32   q12, q8, %q9        \\n\"\n                        \"vmul.f32   q13, q9, %q10       \\n\"\n\n                        \"vld1.f32   {d20-d21}, [%2]!    \\n\" // q10 = r2\n                        \"vld1.f32   {d22-d23}, [%3]!    \\n\" // q11 = r3\n\n                        \"vmla.f32   q12, q10, %q11      \\n\"\n                        \"vmla.f32   q13, q11, %q12      \\n\"\n\n                        \"vadd.f32   q5, q12, q13        \\n\"\n                        \"vadd.f32   d10, d10, d11       \\n\"\n                        \"vpadd.f32  d10, d10, d10       \\n\"\n                        \"vmov.f32   %4, d10[0]          \\n\"\n                        : \"=r\"(r0), // %0\n                        \"=r\"(r1), // %1\n                        \"=r\"(r2), // %2\n                        \"=r\"(r3), // %3\n                        \"=r\"(sum) // %4\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"w\"(_k0123),    // %9\n                        \"w\"(_k4567),    // %10\n                        \"w\"(_k891011),  // %11\n                        \"w\"(_k12131415) // %12\n                        : \"cc\", \"memory\", \"q5\", \"q6\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\");\n\n                    *outptr += sum;\n#endif // __aarch64__\n#else\n                    float sum = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n\n                    *outptr += sum;\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n#endif // __ARM_NEON\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_5x5.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv5x5s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n            float* outptr2 = outptr + outw;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 25 + q * 25;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n            const float* r4 = img0 + w * 4;\n            const float* r5 = img0 + w * 5;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 5;\n            const float* k2 = kernel0 + 10;\n            const float* k3 = kernel0 + 15;\n            const float* k4 = kernel0 + 20;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(kernel0);\n            float32x4_t _k4567 = vld1q_f32(kernel0 + 4);\n            float32x4_t _k891011 = vld1q_f32(kernel0 + 8);\n            float32x4_t _k12131415 = vld1q_f32(kernel0 + 12);\n            float32x4_t _k16171819 = vld1q_f32(kernel0 + 16);\n            float32x4_t _k20212223 = vld1q_f32(kernel0 + 20);\n            float32x4_t _k24242424 = vdupq_n_f32(kernel0[24]);\n#endif // __ARM_NEON\n\n            int i = 0;\n\n            for (; i + 1 < outh; i += 2)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        // v11 = rx1 / rx3\n                        // v12 = rx2\n                        // v13 v14 = intermediate sum register\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v7.4s}, [%1]                  \\n\" // v7 = out\n\n                        \"0:                                        \\n\"\n\n                        \"prfm       pldl1keep, [%2, #128]          \\n\"\n                        \"ld1        {v8.4s}, [%2]                  \\n\" // v8 = out2\n\n                        // r1\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n                        \"ld1        {v9.4s, v10.4s}, [%4]          \\n\" // v9 v10 = r10 r14\n                        \"add        %4, %4, #16                    \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #4   \\n\" //r11\n                        \"fmul       v13.4s, v9.4s, %19.s[1]        \\n\"\n                        \"fmla       v8.4s,  v9.4s, %18.s[0]        \\n\"\n\n                        \"ext        v12.16b, v9.16b, v10.16b, #8   \\n\" //r12\n                        \"fmla       v7.4s,  v11.4s, %19.s[2]       \\n\"\n                        \"fmul       v14.4s, v11.4s, %18.s[1]       \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #12  \\n\" //r13\n                        \"fmla       v13.4s, v12.4s, %19.s[3]       \\n\"\n                        \"fmla       v8.4s,  v12.4s, %18.s[2]       \\n\"\n\n                        \"fmla       v7.4s,  v11.4s, %20.s[0]       \\n\"\n                        \"fmla       v14.4s, v11.4s, %18.s[3]       \\n\"\n\n                        \"prfm       pldl1keep, [%5, #256]          \\n\"\n\n                        \"fmla       v13.4s, v10.4s, %20.s[1]       \\n\"\n                        \"fmla       v8.4s,  v10.4s, %19.s[0]       \\n\"\n\n                        // r2\n                        \"ld1        {v9.4s, v10.4s}, [%5]          \\n\" // v9 v10 = r20 r24\n                        \"add        %5, %5, #16                    \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #4   \\n\" //r21\n                        \"fmla       v7.4s,  v9.4s, %20.s[2]        \\n\"\n                        \"fmla       v14.4s, v9.4s, %19.s[1]        \\n\"\n\n                        \"ext        v12.16b, v9.16b, v10.16b, #8   \\n\" //r22\n                        \"fmla       v13.4s, v11.4s, %20.s[3]       \\n\"\n                        \"fmla       v8.4s,  v11.4s, %19.s[2]       \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #12  \\n\" //r23\n                        \"fmla       v7.4s,  v12.4s, %21.s[0]       \\n\"\n                        \"fmla       v14.4s, v12.4s, %19.s[3]       \\n\"\n\n                        \"fmla       v13.4s, v11.4s, %21.s[1]       \\n\"\n                        \"fmla       v8.4s,  v11.4s, %20.s[0]       \\n\"\n\n                        \"prfm       pldl1keep, [%6, #256]          \\n\"\n\n                        \"fmla       v7.4s,  v10.4s, %21.s[2]       \\n\"\n                        \"fmla       v14.4s, v10.4s, %20.s[1]       \\n\"\n\n                        // r3\n                        \"ld1        {v9.4s, v10.4s}, [%6]          \\n\" // v9 v10 = r30 r34\n                        \"add        %6, %6, #16                    \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #4   \\n\" //r31\n                        \"fmla       v13.4s, v9.4s, %21.s[3]        \\n\"\n                        \"fmla       v8.4s,  v9.4s, %20.s[2]        \\n\"\n\n                        \"ext        v12.16b, v9.16b, v10.16b, #8   \\n\" //r32\n                        \"fmla       v7.4s,  v11.4s, %22.s[0]       \\n\"\n                        \"fmla       v14.4s, v11.4s, %20.s[3]       \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #12  \\n\" //r33\n                        \"fmla       v13.4s, v12.4s, %22.s[1]       \\n\"\n                        \"fmla       v8.4s,  v12.4s, %21.s[0]       \\n\"\n\n                        \"fmla       v7.4s,  v11.4s, %22.s[2]       \\n\"\n                        \"fmla       v14.4s, v11.4s, %21.s[1]       \\n\"\n\n                        \"prfm       pldl1keep, [%7, #256]          \\n\"\n\n                        \"fmla       v13.4s, v10.4s, %22.s[3]       \\n\"\n                        \"fmla       v8.4s,  v10.4s, %21.s[2]       \\n\"\n\n                        // r4\n                        \"ld1        {v9.4s, v10.4s}, [%7]          \\n\" // v9 v10 = r40 r44\n                        \"add        %7, %7, #16                    \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #4   \\n\" //r41\n                        \"fmla       v7.4s,  v9.4s, %23.s[0]        \\n\"\n                        \"fmla       v14.4s, v9.4s, %21.s[3]        \\n\"\n\n                        \"ext        v12.16b, v9.16b, v10.16b, #8   \\n\" //r41\n                        \"fmla       v13.4s, v11.4s, %23.s[1]       \\n\"\n                        \"fmla       v8.4s,  v11.4s, %22.s[0]       \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #12  \\n\" //r41\n                        \"fmla       v7.4s,  v12.4s, %23.s[2]       \\n\"\n                        \"fmla       v14.4s, v12.4s, %22.s[1]       \\n\"\n\n                        \"fmla       v13.4s, v11.4s, %23.s[3]       \\n\"\n                        \"fmla       v8.4s,  v11.4s, %22.s[2]       \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n\n                        \"fmla       v7.4s,  v10.4s, %24.s[0]       \\n\"\n                        \"fmla       v14.4s, v10.4s, %22.s[3]       \\n\"\n\n                        // r0 and r5\n                        \"ld1        {v9.4s, v10.4s}, [%3]          \\n\" // v9 v10 = r00 r04\n                        \"add        %3, %3, #16                    \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #4   \\n\" //r01\n                        \"fmla       v13.4s, v11.4s, %18.s[1]       \\n\"\n\n                        \"ext        v12.16b, v9.16b, v10.16b, #8   \\n\" //r02\n                        \"fmla       v7.4s, v12.4s, %18.s[2]        \\n\"\n\n                        \"ext        v11.16b, v9.16b, v10.16b, #12  \\n\" //r03\n\n                        \"prfm       pldl1keep, [%8, #256]          \\n\"\n\n                        \"fmla       v13.4s, v11.4s, %18.s[3]       \\n\"\n\n                        // r5\n                        \"ld1        {v11.4s, v12.4s}, [%8]         \\n\" // v11 v12 = r50 r54\n                        \"add        %8, %8, #16                    \\n\"\n\n                        \"fmla       v8.4s,  v11.4s, %23.s[0]       \\n\"\n                        \"fmla       v14.4s, v12.4s, %24.s[0]       \\n\"\n\n                        \"fmla       v7.4s,  v9.4s,  %18.s[0]       \\n\"\n                        \"fmla       v13.4s, v10.4s, %19.s[0]       \\n\"\n\n                        \"ext        v9.16b,  v11.16b, v12.16b, #4  \\n\" //r51\n                        \"ext        v10.16b, v11.16b, v12.16b, #8  \\n\" //r52\n\n                        \"fmla       v14.4s, v9.4s, %23.s[1]        \\n\"\n\n                        \"ext        v9.16b, v11.16b, v12.16b, #12  \\n\" //r53\n                        \"fmla       v8.4s, v10.4s, %23.s[2]        \\n\"\n\n                        \"fmla       v14.4s, v9.4s, %23.s[3]        \\n\"\n\n                        \"fadd       v7.4s, v7.4s, v13.4s           \\n\"\n\n                        \"st1        {v7.4s}, [%1], #16             \\n\"\n\n                        \"fadd       v8.4s, v8.4s, v14.4s           \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v7.4s}, [%1]                  \\n\" // v7 = out\n                        \"st1        {v8.4s}, [%2], #16             \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr),  // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2),      // %5\n                        \"=r\"(r3),      // %6\n                        \"=r\"(r4),      // %7\n                        \"=r\"(r5)       // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(outptr2),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"6\"(r3),\n                        \"7\"(r4),\n                        \"8\"(r5),\n                        \"w\"(_k0123),     // %18\n                        \"w\"(_k4567),     // %19\n                        \"w\"(_k891011),   // %20\n                        \"w\"(_k12131415), // %21\n                        \"w\"(_k16171819), // %22\n                        \"w\"(_k20212223), // %23\n                        \"w\"(_k24242424)  // %24\n                        : \"cc\", \"memory\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        //                     \"veor       q13, q13            \\n\"\n                        //                     \"veor       q14, q14            \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // q7 = out\n\n                        \"0:                             \\n\"\n\n                        // q11 = rx1 / rx3\n                        // q12 = rx2\n\n                        // q13 q14 = intermediate sum register\n\n                        \"pld        [%2, #128]          \\n\"\n\n                        \"vld1.f32   {d16-d17}, [%2]     \\n\" // q8 = out2\n\n                        \"pld        [%4, #256]          \\n\"\n\n                        // r1\n                        \"vld1.f32   {d18-d21}, [%4]     \\n\" // q9 q10 = r10 r14\n                        \"add        %4, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\" // r11\n                        \"vmul.f32   q13, q9, %e19[1]    \\n\"\n                        \"vmla.f32   q8, q9, %e18[0]     \\n\"\n\n                        \"vext.32    q12, q9, q10, #2    \\n\" // r12\n                        \"vmla.f32   q7, q11, %f19[0]    \\n\"\n                        \"vmul.f32   q14, q11, %e18[1]   \\n\"\n\n                        \"vext.32    q11, q9, q10, #3    \\n\" // r13\n                        \"vmla.f32   q13, q12, %f19[1]   \\n\"\n                        \"vmla.f32   q8, q12, %f18[0]    \\n\"\n\n                        \"vmla.f32   q7, q11, %e20[0]    \\n\"\n                        \"vmla.f32   q14, q11, %f18[1]   \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n\n                        \"vmla.f32   q13, q10, %e20[1]   \\n\"\n                        \"vmla.f32   q8, q10, %e19[0]    \\n\"\n\n                        // r2\n                        \"vld1.f32   {d18-d21}, [%5]     \\n\" // q9 q10 = r20 r24\n                        \"add        %5, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\" // r21\n                        \"vmla.f32   q7, q9, %f20[0]     \\n\"\n                        \"vmla.f32   q14, q9, %e19[1]    \\n\"\n\n                        \"vext.32    q12, q9, q10, #2    \\n\" // r22\n                        \"vmla.f32   q13, q11, %f20[1]   \\n\"\n                        \"vmla.f32   q8, q11, %f19[0]    \\n\"\n\n                        \"vext.32    q11, q9, q10, #3    \\n\" // r23\n                        \"vmla.f32   q7, q12, %e21[0]    \\n\"\n                        \"vmla.f32   q14, q12, %f19[1]   \\n\"\n\n                        \"vmla.f32   q13, q11, %e21[1]   \\n\"\n                        \"vmla.f32   q8, q11, %e20[0]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n\n                        \"vmla.f32   q7, q10, %f21[0]    \\n\"\n                        \"vmla.f32   q14, q10, %e20[1]   \\n\"\n\n                        // r3\n                        \"vld1.f32   {d18-d21}, [%6]     \\n\" // q9 q10 = r30 r34\n                        \"add        %6, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\" // r31\n                        \"vmla.f32   q13, q9, %f21[1]    \\n\"\n                        \"vmla.f32   q8, q9, %f20[0]     \\n\"\n\n                        \"vext.32    q12, q9, q10, #2    \\n\" // r32\n                        \"vmla.f32   q7, q11, %e22[0]    \\n\"\n                        \"vmla.f32   q14, q11, %f20[1]   \\n\"\n\n                        \"vext.32    q11, q9, q10, #3    \\n\" // r33\n                        \"vmla.f32   q13, q12, %e22[1]   \\n\"\n                        \"vmla.f32   q8, q12, %e21[0]    \\n\"\n\n                        \"vmla.f32   q7, q11, %f22[0]    \\n\"\n                        \"vmla.f32   q14, q11, %e21[1]   \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n\n                        \"vmla.f32   q13, q10, %f22[1]   \\n\"\n                        \"vmla.f32   q8, q10, %f21[0]    \\n\"\n\n                        // r4\n                        \"vld1.f32   {d18-d21}, [%7]     \\n\" // q9 q10 = r40 r44\n                        \"add        %7, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\" // r41\n                        \"vmla.f32   q7, q9, %e23[0]     \\n\"\n                        \"vmla.f32   q14, q9, %f21[1]    \\n\"\n\n                        \"vext.32    q12, q9, q10, #2    \\n\" // r42\n                        \"vmla.f32   q13, q11, %e23[1]   \\n\"\n                        \"vmla.f32   q8, q11, %e22[0]    \\n\"\n\n                        \"vext.32    q11, q9, q10, #3    \\n\" // r43\n                        \"vmla.f32   q7, q12, %f23[0]    \\n\"\n                        \"vmla.f32   q14, q12, %e22[1]   \\n\"\n\n                        \"vmla.f32   q13, q11, %f23[1]   \\n\"\n                        \"vmla.f32   q8, q11, %f22[0]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n\n                        \"vmla.f32   q7, q10, %e24[0]    \\n\"\n                        \"vmla.f32   q14, q10, %f22[1]   \\n\"\n\n                        // r0 and r5\n                        \"vld1.f32   {d18-d21}, [%3]     \\n\" // q9 q10 = r00 r04\n                        \"add        %3, #16             \\n\"\n\n                        \"vext.32    q11, q9, q10, #1    \\n\" // r01\n                        \"vmla.f32   q13, q11, %e18[1]   \\n\"\n\n                        \"vext.32    q12, q9, q10, #2    \\n\" // r02\n                        \"vmla.f32   q7, q12, %f18[0]    \\n\"\n\n                        \"vext.32    q11, q9, q10, #3    \\n\" // r03\n\n                        \"pld        [%8, #256]          \\n\"\n\n                        \"vmla.f32   q13, q11, %f18[1]   \\n\"\n\n                        // r5\n                        \"vld1.f32   {d22-d25}, [%8]     \\n\" // q11 q12 = r50 r54\n                        \"add        %8, #16             \\n\"\n\n                        \"vmla.f32   q8, q11, %e23[0]    \\n\"\n                        \"vmla.f32   q14, q12, %e24[0]   \\n\"\n\n                        \"vmla.f32   q7, q9, %e18[0]     \\n\"\n                        \"vmla.f32   q13, q10, %e19[0]   \\n\"\n\n                        \"vext.32    q9, q11, q12, #1    \\n\" // r51\n                        \"vext.32    q10, q11, q12, #2   \\n\" // r52\n\n                        \"vmla.f32   q14, q9, %e23[1]    \\n\"\n\n                        \"vext.32    q9, q11, q12, #3    \\n\" // r53\n                        \"vmla.f32   q8, q10, %f23[0]    \\n\"\n\n                        \"vmla.f32   q14, q9, %f23[1]    \\n\"\n\n                        \"vadd.f32   q7, q7, q13         \\n\"\n\n                        //                     \"veor       q13, q13            \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                        \"vadd.f32   q8, q8, q14         \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // q7 = out\n\n                        //                     \"veor       q14, q14            \\n\"\n\n                        \"vst1.f32   {d16-d17}, [%2]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),      // %0\n                        \"=r\"(outptr),  // %1\n                        \"=r\"(outptr2), // %2\n                        \"=r\"(r0),      // %3\n                        \"=r\"(r1),      // %4\n                        \"=r\"(r2),      // %5\n                        \"=r\"(r3),      // %6\n                        \"=r\"(r4),      // %7\n                        \"=r\"(r5)       // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(outptr2),\n                        \"3\"(r0),\n                        \"4\"(r1),\n                        \"5\"(r2),\n                        \"6\"(r3),\n                        \"7\"(r4),\n                        \"8\"(r5),\n                        \"w\"(_k0123),     // %18\n                        \"w\"(_k4567),     // %19\n                        \"w\"(_k891011),   // %20\n                        \"w\"(_k12131415), // %21\n                        \"w\"(_k16171819), // %22\n                        \"w\"(_k20212223), // %23\n                        \"w\"(_k24242424)  // %24\n                        : \"cc\", \"memory\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    float sum = 0;\n                    float sum2 = 0;\n#if __ARM_NEON\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    float32x4_t _k1 = vld1q_f32(k1);\n                    float32x4_t _sum = vmulq_f32(_r1, _k1);\n                    float32x4_t _sum2 = vmulq_f32(_r1, _k0123);\n\n                    float32x4_t _r2 = vld1q_f32(r2);\n                    float32x4_t _k2 = vld1q_f32(k2);\n                    _sum = vmlaq_f32(_sum, _r2, _k2);\n                    _sum2 = vmlaq_f32(_sum2, _r2, _k1);\n\n                    float32x4_t _r3 = vld1q_f32(r3);\n                    float32x4_t _k3 = vld1q_f32(k3);\n                    _sum = vmlaq_f32(_sum, _r3, _k3);\n                    _sum2 = vmlaq_f32(_sum2, _r3, _k2);\n\n                    float32x4_t _r4 = vld1q_f32(r4);\n                    _sum = vmlaq_f32(_sum, _r4, _k20212223);\n                    _sum2 = vmlaq_f32(_sum2, _r4, _k3);\n\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    _sum = vmlaq_f32(_sum, _r0, _k0123);\n                    float32x4_t _r5 = vld1q_f32(r5);\n                    _sum2 = vmlaq_f32(_sum2, _r5, _k20212223);\n\n                    float32x4_t _k_t4 = {};\n\n                    _k_t4 = vsetq_lane_f32(k0[4], _k_t4, 0);\n                    _k_t4 = vsetq_lane_f32(k1[4], _k_t4, 1);\n                    _k_t4 = vsetq_lane_f32(k2[4], _k_t4, 2);\n                    _k_t4 = vsetq_lane_f32(k3[4], _k_t4, 3);\n\n                    float32x4_t _r_t4 = {};\n\n                    _r_t4 = vsetq_lane_f32(r0[4], _r_t4, 0);\n                    _r_t4 = vsetq_lane_f32(r1[4], _r_t4, 1);\n                    _r_t4 = vsetq_lane_f32(r2[4], _r_t4, 2);\n                    _r_t4 = vsetq_lane_f32(r3[4], _r_t4, 3);\n                    _sum = vmlaq_f32(_sum, _r_t4, _k_t4);\n\n                    sum = r4[4] * k4[4];\n\n                    _r_t4 = vextq_f32(_r_t4, _r_t4, 1);\n                    _r_t4 = vsetq_lane_f32(r4[4], _r_t4, 3);\n                    _sum2 = vmlaq_f32(_sum2, _r_t4, _k_t4);\n\n                    sum2 = r5[4] * k4[4];\n\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    float32x2_t _ss2 = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n                    float32x2_t _ss_ss2 = vpadd_f32(_ss, _ss2);\n\n                    sum += vget_lane_f32(_ss_ss2, 0);\n                    sum2 += vget_lane_f32(_ss_ss2, 1);\n#else\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n                    sum += r0[4] * k0[4];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n                    sum += r1[4] * k1[4];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n                    sum += r2[4] * k2[4];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n                    sum += r3[4] * k3[4];\n\n                    sum += r4[0] * k4[0];\n                    sum += r4[1] * k4[1];\n                    sum += r4[2] * k4[2];\n                    sum += r4[3] * k4[3];\n                    sum += r4[4] * k4[4];\n\n                    sum2 += r1[0] * k0[0];\n                    sum2 += r1[1] * k0[1];\n                    sum2 += r1[2] * k0[2];\n                    sum2 += r1[3] * k0[3];\n                    sum2 += r1[4] * k0[4];\n\n                    sum2 += r2[0] * k1[0];\n                    sum2 += r2[1] * k1[1];\n                    sum2 += r2[2] * k1[2];\n                    sum2 += r2[3] * k1[3];\n                    sum2 += r2[4] * k1[4];\n\n                    sum2 += r3[0] * k2[0];\n                    sum2 += r3[1] * k2[1];\n                    sum2 += r3[2] * k2[2];\n                    sum2 += r3[3] * k2[3];\n                    sum2 += r3[4] * k2[4];\n\n                    sum2 += r4[0] * k3[0];\n                    sum2 += r4[1] * k3[1];\n                    sum2 += r4[2] * k3[2];\n                    sum2 += r4[3] * k3[3];\n                    sum2 += r4[4] * k3[4];\n\n                    sum2 += r5[0] * k4[0];\n                    sum2 += r5[1] * k4[1];\n                    sum2 += r5[2] * k4[2];\n                    sum2 += r5[3] * k4[3];\n                    sum2 += r5[4] * k4[4];\n#endif // __ARM_NEON\n                    *outptr += sum;\n                    *outptr2 += sum2;\n\n                    r0++;\n                    r1++;\n                    r2++;\n                    r3++;\n                    r4++;\n                    r5++;\n                    outptr++;\n                    outptr2++;\n                }\n\n                r0 += 4 + w;\n                r1 += 4 + w;\n                r2 += 4 + w;\n                r3 += 4 + w;\n                r4 += 4 + w;\n                r5 += 4 + w;\n\n                outptr += outw;\n                outptr2 += outw;\n            }\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%2]           \\n\" // _r00 = vld1q_f32(r0+j);\n                        \"add        %2, %2, #16                    \\n\"\n\n                        \"0:                                        \\n\"\n\n                        \"ld1        {v7.4s}, [%1]                  \\n\" // _sum = vld1q_f32(outptr+j);\n\n                        \"ext        v10.16b, v8.16b, v9.16b, #4    \\n\" //_r01\n                        \"ext        v11.16b, v8.16b, v9.16b, #8    \\n\" //_r02\n                        \"ext        v12.16b, v8.16b, v9.16b, #12   \\n\" //_r03\n\n                        \"fmla       v7.4s,   v8.4s, %14.s[0]       \\n\"\n                        \"fmul       v13.4s, v10.4s, %14.s[1]       \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n\n                        \"fmul       v14.4s, v11.4s, %14.s[2]       \\n\"\n                        \"fmul       v15.4s, v12.4s, %14.s[3]       \\n\"\n                        \"fmla       v7.4s,   v9.4s, %15.s[0]       \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%3]           \\n\"\n                        \"add        %3, %3, #16                    \\n\"\n                        \"ext        v10.16b, v8.16b, v9.16b, #4    \\n\" //_r11\n                        \"ext        v11.16b, v8.16b, v9.16b, #8    \\n\" //_r12\n                        \"ext        v12.16b, v8.16b, v9.16b, #12   \\n\" //_r13\n\n                        \"fmla       v7.4s,   v8.4s, %15.s[1]       \\n\"\n                        \"fmla       v13.4s, v10.4s, %15.s[2]       \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n\n                        \"fmla       v14.4s, v11.4s, %15.s[3]       \\n\"\n                        \"fmla       v15.4s, v12.4s, %16.s[0]       \\n\"\n                        \"fmla       v7.4s,   v9.4s, %16.s[1]       \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%4]           \\n\"\n                        \"add        %4, %4, #16                    \\n\"\n                        \"ext        v10.16b, v8.16b, v9.16b, #4    \\n\" //_r21\n                        \"ext        v11.16b, v8.16b, v9.16b, #8    \\n\" //_r22\n                        \"ext        v12.16b, v8.16b, v9.16b, #12   \\n\" //_r23\n\n                        \"fmla       v7.4s,   v8.4s, %16.s[2]       \\n\"\n                        \"fmla       v13.4s, v10.4s, %16.s[3]       \\n\"\n\n                        \"prfm       pldl1keep, [%5, #256]          \\n\"\n\n                        \"fmla       v14.4s, v11.4s, %17.s[0]       \\n\"\n                        \"fmla       v15.4s, v12.4s, %17.s[1]       \\n\"\n                        \"fmla       v7.4s,   v9.4s, %17.s[2]       \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%5]           \\n\"\n                        \"add        %5, %5, #16                    \\n\"\n                        \"ext        v10.16b, v8.16b, v9.16b, #4    \\n\" //_r31\n                        \"ext        v11.16b, v8.16b, v9.16b, #8    \\n\" //_r32\n                        \"ext        v12.16b, v8.16b, v9.16b, #12   \\n\" //_r33\n\n                        \"fmla       v7.4s,   v8.4s, %17.s[3]       \\n\"\n                        \"fmla       v13.4s, v10.4s, %18.s[0]       \\n\"\n\n                        \"prfm       pldl1keep, [%6, #256]          \\n\"\n\n                        \"fmla       v14.4s, v11.4s, %18.s[1]       \\n\"\n                        \"fmla       v15.4s, v12.4s, %18.s[2]       \\n\"\n                        \"fmla       v7.4s,   v9.4s, %18.s[3]       \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%6]           \\n\"\n                        \"add        %6, %6, #16                    \\n\"\n                        \"ext        v10.16b, v8.16b, v9.16b, #4    \\n\" //_r41\n                        \"ext        v11.16b, v8.16b, v9.16b, #8    \\n\" //_r42\n                        \"ext        v12.16b, v8.16b, v9.16b, #12   \\n\" //_r43\n\n                        \"fmla       v7.4s,   v8.4s, %19.s[0]       \\n\"\n                        \"fmla       v13.4s, v10.4s, %19.s[1]       \\n\"\n                        \"fmla       v14.4s, v11.4s, %19.s[2]       \\n\"\n                        \"fmla       v15.4s, v12.4s, %19.s[3]       \\n\"\n                        \"fmla       v7.4s,   v9.4s, %20.s[0]       \\n\"\n\n                        \"fadd       v14.4s, v14.4s, v15.4s         \\n\"\n                        \"fadd       v7.4s,   v7.4s, v13.4s         \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n\n                        \"fadd       v7.4s,   v7.4s, v14.4s         \\n\"\n\n                        \"ld1        {v8.4s, v9.4s}, [%2]           \\n\"\n                        \"add        %2, %2, #16                    \\n\"\n\n                        \"st1        {v7.4s}, [%1], #16             \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n\n                        \"sub        %2, %2, #16                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4)      // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"w\"(_k0123),     // %14\n                        \"w\"(_k4567),     // %15\n                        \"w\"(_k891011),   // %16\n                        \"w\"(_k12131415), // %17\n                        \"w\"(_k16171819), // %18\n                        \"w\"(_k20212223), // %19\n                        \"w\"(_k24242424)  // %20\n                        : \"cc\", \"memory\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        //                     \"veor       q15, q15            \\n\"// _sum3 = 0;\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%2]     \\n\" // _r00 = vld1q_f32(r0+j);\n                        \"add        %2, #16             \\n\"\n\n                        \"0:                             \\n\"\n\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // _sum = vld1q_f32(outptr+j);\n                        //                     \"veor       q13, q13            \\n\"// _sum2 = 0;\n                        //                     \"veor       q14, q14            \\n\"// _sum3 = 0;\n\n                        \"vext.32    q10, q8, q9, #1     \\n\" // _r01\n                        \"vext.32    q11, q8, q9, #2     \\n\" // _r02\n                        \"vext.32    q12, q8, q9, #3     \\n\" // _r03\n\n                        \"vmla.f32   q7, q8, %e14[0]     \\n\"\n                        \"vmul.f32   q13, q10, %e14[1]   \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n\n                        \"vmul.f32   q14, q11, %f14[0]   \\n\"\n                        \"vmul.f32   q15, q12, %f14[1]   \\n\"\n                        \"vmla.f32   q7, q9, %e15[0]     \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%3]     \\n\"\n                        \"add        %3, #16             \\n\"\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n                        \"vext.32    q12, q8, q9, #3     \\n\"\n\n                        \"vmla.f32   q7, q8, %e15[1]     \\n\"\n                        \"vmla.f32   q13, q10, %f15[0]   \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n\n                        \"vmla.f32   q14, q11, %f15[1]   \\n\"\n                        \"vmla.f32   q15, q12, %e16[0]   \\n\"\n                        \"vmla.f32   q7, q9, %e16[1]     \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%4]     \\n\"\n                        \"add        %4, #16             \\n\"\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n                        \"vext.32    q12, q8, q9, #3     \\n\"\n\n                        \"vmla.f32   q7, q8, %f16[0]     \\n\"\n                        \"vmla.f32   q13, q10, %f16[1]   \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n\n                        \"vmla.f32   q14, q11, %e17[0]   \\n\"\n                        \"vmla.f32   q15, q12, %e17[1]   \\n\"\n                        \"vmla.f32   q7, q9, %f17[0]     \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%5]     \\n\"\n                        \"add        %5, #16             \\n\"\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n                        \"vext.32    q12, q8, q9, #3     \\n\"\n\n                        \"vmla.f32   q7, q8, %f17[1]     \\n\"\n                        \"vmla.f32   q13, q10, %e18[0]   \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n\n                        \"vmla.f32   q14, q11, %e18[1]   \\n\"\n                        \"vmla.f32   q15, q12, %f18[0]   \\n\"\n                        \"vmla.f32   q7, q9, %f18[1]     \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%6]     \\n\"\n                        \"add        %6, #16             \\n\"\n                        \"vext.32    q10, q8, q9, #1     \\n\"\n                        \"vext.32    q11, q8, q9, #2     \\n\"\n                        \"vext.32    q12, q8, q9, #3     \\n\"\n\n                        \"vmla.f32   q7, q8, %e19[0]     \\n\"\n                        \"vmla.f32   q13, q10, %e19[1]   \\n\"\n                        \"vmla.f32   q14, q11, %f19[0]   \\n\"\n                        \"vmla.f32   q15, q12, %f19[1]   \\n\"\n                        \"vmla.f32   q7, q9, %e20[0]     \\n\"\n\n                        \"vadd.f32   q14, q14, q15       \\n\"\n                        \"vadd.f32   q7, q7, q13         \\n\"\n                        //                     \"veor       q15, q15            \\n\"// _sum3 = 0;\n\n                        \"pld        [%2, #256]          \\n\"\n\n                        \"vadd.f32   q7, q7, q14         \\n\"\n\n                        \"vld1.f32   {d16-d19}, [%2]     \\n\" // _r00 = vld1q_f32(r0+j);\n                        \"add        %2, #16             \\n\"\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        \"sub        %2, #16             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4)      // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"w\"(_k0123),     // %14\n                        \"w\"(_k4567),     // %15\n                        \"w\"(_k891011),   // %16\n                        \"w\"(_k12131415), // %17\n                        \"w\"(_k16171819), // %18\n                        \"w\"(_k20212223), // %19\n                        \"w\"(_k24242424)  // %20\n                        : \"cc\", \"memory\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    float sum = 0;\n#if __ARM_NEON\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _sum = vmulq_f32(_r0, _k0123);\n\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    _sum = vmlaq_f32(_sum, _r1, vld1q_f32(k1));\n\n                    float32x4_t _r2 = vld1q_f32(r2);\n                    _sum = vmlaq_f32(_sum, _r2, vld1q_f32(k2));\n\n                    float32x4_t _r3 = vld1q_f32(r3);\n                    _sum = vmlaq_f32(_sum, _r3, vld1q_f32(k3));\n\n                    float32x4_t _r4 = vld1q_f32(r4);\n                    _sum = vmlaq_f32(_sum, _r4, _k20212223);\n\n                    float32x4_t _k_t4 = {};\n\n                    _k_t4 = vsetq_lane_f32(k0[4], _k_t4, 0);\n                    _k_t4 = vsetq_lane_f32(k1[4], _k_t4, 1);\n                    _k_t4 = vsetq_lane_f32(k2[4], _k_t4, 2);\n                    _k_t4 = vsetq_lane_f32(k3[4], _k_t4, 3);\n\n                    float32x4_t _r_t4 = {};\n\n                    _r_t4 = vsetq_lane_f32(r0[4], _r_t4, 0);\n                    _r_t4 = vsetq_lane_f32(r1[4], _r_t4, 1);\n                    _r_t4 = vsetq_lane_f32(r2[4], _r_t4, 2);\n                    _r_t4 = vsetq_lane_f32(r3[4], _r_t4, 3);\n                    _sum = vmlaq_f32(_sum, _r_t4, _k_t4);\n\n                    sum = r4[4] * k4[4];\n\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n\n                    sum += vget_lane_f32(_ss, 0);\n#else\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n                    sum += r0[4] * k0[4];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n                    sum += r1[4] * k1[4];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n                    sum += r2[4] * k2[4];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n                    sum += r3[4] * k3[4];\n\n                    sum += r4[0] * k4[0];\n                    sum += r4[1] * k4[1];\n                    sum += r4[2] * k4[2];\n                    sum += r4[3] * k4[3];\n                    sum += r4[4] * k4[4];\n#endif\n                    *outptr += sum;\n\n                    r0++;\n                    r1++;\n                    r2++;\n                    r3++;\n                    r4++;\n                    outptr++;\n                }\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n            }\n        }\n    }\n}\n\nstatic void conv5x5s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 25 + q * 25;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n            const float* r4 = img0 + w * 4;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 5;\n            const float* k2 = kernel0 + 10;\n            const float* k3 = kernel0 + 15;\n            const float* k4 = kernel0 + 20;\n\n#if __ARM_NEON\n            float32x4_t _k0123 = vld1q_f32(kernel0);\n            float32x4_t _k4567 = vld1q_f32(kernel0 + 4);\n            float32x4_t _k891011 = vld1q_f32(kernel0 + 8);\n            float32x4_t _k12131415 = vld1q_f32(kernel0 + 12);\n            float32x4_t _k16171819 = vld1q_f32(kernel0 + 16);\n            float32x4_t _k20212223 = vld1q_f32(kernel0 + 20);\n            float32x4_t _k24242424 = vdupq_n_f32(kernel0[24]);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\" // v8  = 0  2  4  6   q9  = 1  3  5  7\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n                        \"ld2        {v10.4s, v11.4s}, [%2]         \\n\" // v10 = 8 10 12 14   v11 = 9 11 13 15\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"0:                                        \\n\"\n\n                        \"ld1        {v7.4s}, [%1]                  \\n\" // v7 = outptr\n\n                        \"ext        v12.16b, v8.16b, v10.16b, #4   \\n\" // v12 = 2 4 6 8\n                        \"ext        v11.16b, v9.16b, v11.16b, #4   \\n\" // v11 = 3 5 7 9\n                        \"ext        v10.16b, v8.16b, v10.16b, #8   \\n\" // v10 = 4 6 8 10\n\n                        \"fmla       v7.4s,  v8.4s, %14.s[0]        \\n\"\n                        \"fmul       v13.4s, v9.4s, %14.s[1]        \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n\n                        \"fmul       v14.4s, v12.4s, %14.s[2]       \\n\"\n                        \"fmul       v15.4s, v11.4s, %14.s[3]       \\n\"\n                        \"fmla       v7.4s,  v10.4s, %15.s[0]       \\n\"\n\n                        \"ld2        {v8.4s, v9.4s}, [%3], #32      \\n\"\n\n                        \"prfm       pldl1keep, [%3, #256]          \\n\"\n\n                        \"ld2        {v10.4s, v11.4s}, [%3]         \\n\"\n                        \"ext        v12.16b, v8.16b, v10.16b, #4   \\n\"\n                        \"ext        v11.16b, v9.16b, v11.16b, #4   \\n\"\n                        \"ext        v10.16b, v8.16b, v10.16b, #8   \\n\"\n\n                        \"fmla       v7.4s,  v8.4s, %15.s[1]        \\n\"\n                        \"fmla       v13.4s, v9.4s, %15.s[2]        \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n\n                        \"fmla       v14.4s, v12.4s, %15.s[3]       \\n\"\n                        \"fmla       v15.4s, v11.4s, %16.s[0]       \\n\"\n                        \"fmla       v7.4s,  v10.4s, %16.s[1]       \\n\"\n\n                        \"ld2        {v8.4s, v9.4s}, [%4], #32      \\n\"\n\n                        \"prfm       pldl1keep, [%4, #256]          \\n\"\n\n                        \"ld2        {v10.4s, v11.4s}, [%4]         \\n\"\n                        \"ext        v12.16b, v8.16b, v10.16b, #4   \\n\"\n                        \"ext        v11.16b, v9.16b, v11.16b, #4   \\n\"\n                        \"ext        v10.16b, v8.16b, v10.16b, #8   \\n\"\n\n                        \"fmla       v7.4s,  v8.4s, %16.s[2]        \\n\"\n                        \"fmla       v13.4s, v9.4s, %16.s[3]        \\n\"\n\n                        \"prfm       pldl1keep, [%5, #256]          \\n\"\n\n                        \"fmla       v14.4s, v12.4s, %17.s[0]       \\n\"\n                        \"fmla       v15.4s, v11.4s, %17.s[1]       \\n\"\n                        \"fmla       v7.4s,  v10.4s, %17.s[2]       \\n\"\n\n                        \"ld2        {v8.4s, v9.4s}, [%5], #32      \\n\"\n\n                        \"prfm       pldl1keep, [%5, #256]          \\n\"\n\n                        \"ld2        {v10.4s, v11.4s}, [%5]         \\n\"\n                        \"ext        v12.16b, v8.16b, v10.16b, #4   \\n\"\n                        \"ext        v11.16b, v9.16b, v11.16b, #4   \\n\"\n                        \"ext        v10.16b, v8.16b, v10.16b, #8   \\n\"\n\n                        \"fmla       v7.4s,  v8.4s, %17.s[3]        \\n\"\n                        \"fmla       v13.4s, v9.4s, %18.s[0]        \\n\"\n\n                        \"prfm       pldl1keep, [%6, #256]          \\n\"\n\n                        \"fmla       v14.4s, v12.4s, %18.s[1]       \\n\"\n                        \"fmla       v15.4s, v11.4s, %18.s[2]       \\n\"\n                        \"fmla       v7.4s,  v10.4s, %18.s[3]       \\n\"\n\n                        \"ld2        {v8.4s, v9.4s}, [%6], #32      \\n\"\n\n                        \"prfm       pldl1keep, [%6, #256]          \\n\"\n\n                        \"ld2        {v10.4s, v11.4s}, [%6]         \\n\"\n                        \"ext        v12.16b, v8.16b, v10.16b, #4   \\n\"\n                        \"ext        v11.16b, v9.16b, v11.16b, #4   \\n\"\n                        \"ext        v10.16b, v8.16b, v10.16b, #8   \\n\"\n\n                        \"fmla       v7.4s,   v8.4s, %19.s[0]       \\n\"\n                        \"fmla       v13.4s,  v9.4s, %19.s[1]       \\n\"\n                        \"fmla       v14.4s, v12.4s, %19.s[2]       \\n\"\n                        \"fmla       v15.4s, v11.4s, %19.s[3]       \\n\"\n                        \"fmla       v7.4s,  v10.4s, %20.s[0]       \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n\n                        \"ld2        {v8.4s, v9.4s}, [%2], #32      \\n\"\n\n                        \"fadd       v14.4s, v14.4s, v15.4s         \\n\"\n                        \"fadd       v7.4s,   v7.4s, v13.4s         \\n\"\n\n                        \"prfm       pldl1keep, [%2, #256]          \\n\"\n\n                        \"fadd       v7.4s, v7.4s, v14.4s           \\n\"\n\n                        \"ld2        {v10.4s, v11.4s}, [%2]         \\n\"\n                        \"st1        {v7.4s}, [%1], #16             \\n\"\n\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n\n                        \"sub        %2, %2, #32                    \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4)      // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"w\"(_k0123),     // %14\n                        \"w\"(_k4567),     // %15\n                        \"w\"(_k891011),   // %16\n                        \"w\"(_k12131415), // %17\n                        \"w\"(_k16171819), // %18\n                        \"w\"(_k20212223), // %19\n                        \"w\"(_k24242424)  // %20\n                        : \"cc\", \"memory\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n                }\n\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        //                     \"veor       q15, q15            \\n\"// _sump3 = 0;\n                        //                     \"veor       q13, q13            \\n\"// _sump2 = 0;\n                        //                     \"veor       q14, q14            \\n\"// _sump3 = 0;\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\" // q8  = 0  2  4  6   q9  = 1  3  5  7\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld2.f32   {d20-d23}, [%2]     \\n\" // q10 = 8 10 12 14   q11 = 9 11 13 15\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"0:                             \\n\"\n\n                        \"vld1.f32   {d14-d15}, [%1]     \\n\" // q7 = outptr\n\n                        \"vext.32    q12, q8, q10, #1    \\n\" // q12 = 2 4 6 8\n                        \"vext.32    q11, q9, q11, #1    \\n\" // q11 = 3 5 7 9\n                        \"vext.32    q10, q8, q10, #2    \\n\" // q10 = 4 6 8 10\n\n                        \"vmla.f32   q7, q8, %e14[0]     \\n\"\n                        \"vmul.f32   q13, q9, %e14[1]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n\n                        \"vmul.f32   q14, q12, %f14[0]   \\n\"\n                        \"vmul.f32   q15, q11, %f14[1]   \\n\"\n                        \"vmla.f32   q7, q10, %e15[0]    \\n\"\n\n                        \"vld2.f32   {d16-d19}, [%3]!    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n\n                        \"vld2.f32   {d20-d23}, [%3]     \\n\"\n                        \"vext.32    q12, q8, q10, #1    \\n\"\n                        \"vext.32    q11, q9, q11, #1    \\n\"\n                        \"vext.32    q10, q8, q10, #2    \\n\"\n\n                        \"vmla.f32   q7, q8, %e15[1]     \\n\"\n                        \"vmla.f32   q13, q9, %f15[0]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n\n                        \"vmla.f32   q14, q12, %f15[1]   \\n\"\n                        \"vmla.f32   q15, q11, %e16[0]   \\n\"\n                        \"vmla.f32   q7, q10, %e16[1]    \\n\"\n\n                        \"vld2.f32   {d16-d19}, [%4]!    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n\n                        \"vld2.f32   {d20-d23}, [%4]     \\n\"\n                        \"vext.32    q12, q8, q10, #1    \\n\"\n                        \"vext.32    q11, q9, q11, #1    \\n\"\n                        \"vext.32    q10, q8, q10, #2    \\n\"\n\n                        \"vmla.f32   q7, q8, %f16[0]     \\n\"\n                        \"vmla.f32   q13, q9, %f16[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n\n                        \"vmla.f32   q14, q12, %e17[0]   \\n\"\n                        \"vmla.f32   q15, q11, %e17[1]   \\n\"\n                        \"vmla.f32   q7, q10, %f17[0]    \\n\"\n\n                        \"vld2.f32   {d16-d19}, [%5]!    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n\n                        \"vld2.f32   {d20-d23}, [%5]     \\n\"\n                        \"vext.32    q12, q8, q10, #1    \\n\"\n                        \"vext.32    q11, q9, q11, #1    \\n\"\n                        \"vext.32    q10, q8, q10, #2    \\n\"\n\n                        \"vmla.f32   q7, q8, %f17[1]     \\n\"\n                        \"vmla.f32   q13, q9, %e18[0]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n\n                        \"vmla.f32   q14, q12, %e18[1]   \\n\"\n                        \"vmla.f32   q15, q11, %f18[0]   \\n\"\n                        \"vmla.f32   q7, q10, %f18[1]    \\n\"\n\n                        \"vld2.f32   {d16-d19}, [%6]!    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n\n                        \"vld2.f32   {d20-d23}, [%6]     \\n\"\n                        \"vext.32    q12, q8, q10, #1    \\n\"\n                        \"vext.32    q11, q9, q11, #1    \\n\"\n                        \"vext.32    q10, q8, q10, #2    \\n\"\n\n                        \"vmla.f32   q7, q8, %e19[0]     \\n\"\n                        \"vmla.f32   q13, q9, %e19[1]    \\n\"\n                        \"vmla.f32   q14, q12, %f19[0]   \\n\"\n                        \"vmla.f32   q15, q11, %f19[1]   \\n\"\n                        \"vmla.f32   q7, q10, %e20[0]    \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n\n                        \"vld2.f32   {d16-d19}, [%2]!    \\n\" // q8  = 0  2  4  6   q9  = 1  3  5  7\n\n                        \"vadd.f32   q14, q14, q15       \\n\"\n                        \"vadd.f32   q7, q7, q13         \\n\"\n                        //                     \"veor       q15, q15            \\n\"// _sump3 = 0;\n                        //                     \"veor       q13, q13            \\n\"// _sump2 = 0;\n\n                        \"pld        [%2, #256]          \\n\"\n\n                        \"vadd.f32   q7, q7, q14         \\n\"\n\n                        \"vld2.f32   {d20-d23}, [%2]     \\n\" // q10 = 8 10 12 14   q11 = 9 11 13 15\n\n                        //                     \"veor       q14, q14            \\n\"// _sump3 = 0;\n\n                        \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n\n                        \"sub        %2, #32             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4)      // %6\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"w\"(_k0123),     // %14\n                        \"w\"(_k4567),     // %15\n                        \"w\"(_k891011),   // %16\n                        \"w\"(_k12131415), // %17\n                        \"w\"(_k16171819), // %18\n                        \"w\"(_k20212223), // %19\n                        \"w\"(_k24242424)  // %20\n                        : \"cc\", \"memory\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    float sum = 0;\n#if __ARM_NEON\n                    float32x4_t _r0 = vld1q_f32(r0);\n                    float32x4_t _sum = vmulq_f32(_r0, _k0123);\n\n                    float32x4_t _r1 = vld1q_f32(r1);\n                    _sum = vmlaq_f32(_sum, _r1, vld1q_f32(k1));\n\n                    float32x4_t _r2 = vld1q_f32(r2);\n                    _sum = vmlaq_f32(_sum, _r2, vld1q_f32(k2));\n\n                    float32x4_t _r3 = vld1q_f32(r3);\n                    _sum = vmlaq_f32(_sum, _r3, vld1q_f32(k3));\n\n                    float32x4_t _r4 = vld1q_f32(r4);\n                    _sum = vmlaq_f32(_sum, _r4, _k20212223);\n\n                    sum += r0[4] * k0[4];\n                    sum += r1[4] * k1[4];\n                    sum += r2[4] * k2[4];\n                    sum += r3[4] * k3[4];\n                    sum += r4[4] * k4[4];\n\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n\n                    sum += vget_lane_f32(_ss, 0);\n#else\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n                    sum += r0[4] * k0[4];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n                    sum += r1[4] * k1[4];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n                    sum += r2[4] * k2[4];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n                    sum += r3[4] * k3[4];\n\n                    sum += r4[0] * k4[0];\n                    sum += r4[1] * k4[1];\n                    sum += r4[2] * k4[2];\n                    sum += r4[3] * k4[3];\n                    sum += r4[4] * k4[4];\n#endif\n                    *outptr += sum;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    r3 += 2;\n                    r4 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_5x5_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv5x5s1_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n            const float* r3 = img0.row(3);\n            const float* r4 = img0.row(4);\n\n            const float* kptr = (const float*)kernel.channel(p).row(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1] \\n\" // r04 r05 r06 r07\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2] \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3] \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%4], #64 \\n\" // r30 r31 r32 r33\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%4] \\n\" // r34 r35 r36 r37\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%5], #64 \\n\" // r40 r41 r42 r43\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%5] \\n\" // r44 r45 r46 r47\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // r00 r01 r02 r03\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1, {d8-d15}        \\n\" // r04 r05 r06 r07\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d0-d7}        \\n\" // r10 r11 r12 r13\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2, {d8-d15}        \\n\" // r14 r15 r16 r17\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d0-d7}        \\n\" // r20 r21 r22 r23\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3, {d8-d15}        \\n\" // r24 r25 r26 r27\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d0-d7}        \\n\" // r30 r31 r32 r33\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d8-d15}        \\n\" // r34 r35 r36 r37\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5!, {d0-d7}        \\n\" // r40 r41 r42 r43\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5, {d8-d15}        \\n\" // r44 r45 r46 r47\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%1], #32   \\n\" // r00 r01\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s, v5.4s}, [%1] \\n\" // r02 r03 r04 r05\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\" // r10 r11\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s, v5.4s}, [%2] \\n\" // r12 r13 r14 r15\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%3], #32   \\n\" // r20 r21\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s, v5.4s}, [%3] \\n\" // r22 r23 r24 r25\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%4], #32   \\n\" // r30 r31\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s, v5.4s}, [%4] \\n\" // r32 r33 r34 r35\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%5], #32   \\n\" // r40 r41\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s, v5.4s}, [%5] \\n\" // r42 r43 r44 r45\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r00 r01\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vmul.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1, {d4-d11}        \\n\" // r02 r03 r04 r05\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128]! \\n\" // r10 r11\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2, {d4-d11}        \\n\" // r12 r13 r14 r15\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r20 r21\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3, {d4-d11}        \\n\" // r22 r23 r24 r25\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%4 :128]! \\n\" // r30 r31\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d4-d11}        \\n\" // r32 r33 r34 r35\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%5 :128]! \\n\" // r40 r41\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5, {d4-d11}        \\n\" // r42 r43 r44 r45\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%1], #16          \\n\" // r00\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v1.4s, v2.4s, v3.4s, v4.4s}, [%1] \\n\" // r01 r02 r03 r04\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%2], #16          \\n\" // r10\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v1.4s, v2.4s, v3.4s, v4.4s}, [%2] \\n\" // r11 r12 r13 r14\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%3], #16          \\n\" // r20\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v1.4s, v2.4s, v3.4s, v4.4s}, [%3] \\n\" // r21 r22 r23 r24\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%4], #16          \\n\" // r30\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v1.4s, v2.4s, v3.4s, v4.4s}, [%4] \\n\" // r31 r32 r33 r34\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4s}, [%5], #16          \\n\" // r40\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v1.4s, v2.4s, v3.4s, v4.4s}, [%5] \\n\" // r41 r42 r43 r44\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%0 :128] \\n\" // sum0\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%1 :128]! \\n\" // r00\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1, {d2-d9}         \\n\" // r01 r02 r03 r04\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%2 :128]! \\n\" // r10\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2, {d2-d9}         \\n\" // r11 r12 r13 r14\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%3 :128]! \\n\" // r20\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3, {d2-d9}         \\n\" // r21 r22 r23 r24\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%4 :128]! \\n\" // r30\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4, {d2-d9}         \\n\" // r31 r32 r33 r34\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%5 :128]! \\n\" // r40\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5, {d2-d9}         \\n\" // r41 r42 r43 r44\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}       \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q13, q13, q14       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n                        \"vadd.f32   q12, q12, q13       \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d25}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += 4 * 4;\n                r1 += 4 * 4;\n                r2 += 4 * 4;\n                r3 += 4 * 4;\n                r4 += 4 * 4;\n            }\n        }\n    }\n}\n\nstatic void conv5x5s2_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n            const float* r3 = img0.row(3);\n            const float* r4 = img0.row(4);\n\n            const float* kptr = (const float*)kernel.channel(p).row(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%1] \\n\" // r08 r09 r010\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%2] \\n\" // r18 r19 r110\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%3] \\n\" // r28 r29 r210\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%4], #64 \\n\" // r30 r31 r32 r33\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%4], #64 \\n\" // r34 r35 r36 r37\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%4] \\n\" // r38 r39 r310\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%5], #64 \\n\" // r40 r41 r42 r43\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%5], #64 \\n\" // r44 r45 r46 r47\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%5] \\n\" // r48 r49 r410\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // r00 r01 r02 r03\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d8-d15}       \\n\" // r04 r05 r06 r07\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r08 r09\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d14[0]     \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d14[1]     \\n\"\n                        \"vmla.f32   q15, q9, d2[1]      \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d15[0]    \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d15[1]    \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%1 :128]  \\n\" // r010\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d8-d15}       \\n\" // r10 r11 r12 r13\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d0-d7}        \\n\" // r14 r15 r16 r17\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%2 :128]! \\n\" // r18 r19\n\n                        \"vmla.f32   q12, q8, d12[0]     \\n\"\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d12[1]     \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d13[0]    \\n\"\n                        \"vmla.f32   q13, q10, d1[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d13[1]    \\n\"\n                        \"vmla.f32   q13, q11, d1[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d12-d13}, [%2 :128] \\n\" // r110\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d0-d7}        \\n\" // r20 r21 r22 r23\n\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d8-d15}       \\n\" // r24 r25 r26 r27\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r28 r29\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d14[0]     \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d14[1]     \\n\"\n                        \"vmla.f32   q15, q9, d2[1]      \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d15[0]    \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d15[1]    \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%3 :128]  \\n\" // r210\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d8-d15}       \\n\" // r30 r31 r32 r33\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d0-d7}        \\n\" // r34 r35 r36 r37\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%4 :128]! \\n\" // r38 r39\n\n                        \"vmla.f32   q12, q8, d12[0]     \\n\"\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d12[1]     \\n\"\n                        \"vmla.f32   q13, q9, d0[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d13[0]    \\n\"\n                        \"vmla.f32   q13, q10, d1[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d13[1]    \\n\"\n                        \"vmla.f32   q13, q11, d1[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d12-d13}, [%4 :128] \\n\" // r310\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5!, {d0-d7}        \\n\" // r40 r41 r42 r43\n\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5!, {d8-d15}       \\n\" // r44 r45 r46 r47\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%5 :128]! \\n\" // r48 r49\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d14[0]     \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d14[1]     \\n\"\n                        \"vmla.f32   q15, q9, d2[1]      \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d15[0]    \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d15[1]    \\n\"\n                        \"vmla.f32   q15, q11, d3[1]     \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}       \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.f32   {d4-d5}, [%5 :128]  \\n\" // r410\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"sub        %1, %1, #32         \\n\"\n                        \"sub        %2, %2, #32         \\n\"\n                        \"sub        %3, %3, #32         \\n\"\n                        \"sub        %4, %4, #32         \\n\"\n                        \"sub        %5, %5, #32         \\n\"\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%1] \\n\" // r04 r05 r06\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%2] \\n\" // r14 r15 r16\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%3] \\n\" // r24 r25 r26\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%4], #64 \\n\" // r30 r31 r32 r33\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%4] \\n\" // r34 r35 r36\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%5], #64 \\n\" // r40 r41 r42 r43\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%5] \\n\" // r44 r45 r46\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d0-d7}        \\n\" // r00 r01 r02 r03\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vmul.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #384]          \\n\"\n                        \"vldm       %1, {d8-d13}        \\n\" // r04 r05 r06\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vldm       %2!, {d0-d7}        \\n\" // r10 r11 r12 r13\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #384]          \\n\"\n                        \"vldm       %2, {d8-d13}        \\n\" // r14 r15 r16\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vldm       %3!, {d0-d7}        \\n\" // r20 r21 r22 r23\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #384]          \\n\"\n                        \"vldm       %3, {d8-d13}        \\n\" // r24 r25 r26\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vldm       %4!, {d0-d7}        \\n\" // r30 r31 r32 r33\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #384]          \\n\"\n                        \"vldm       %4, {d8-d13}        \\n\" // r34 r35 r36\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vldm       %5!, {d0-d7}        \\n\" // r40 r41 r42 r43\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #384]          \\n\"\n                        \"vldm       %5, {d8-d13}        \\n\" // r44 r45 r46\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%1], #32   \\n\" // r00 r01\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s}, [%1] \\n\" // r02 r03 r04\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\" // r10 r11\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s}, [%2] \\n\" // r12 r13 r14\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%3], #32   \\n\" // r20 r21\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s}, [%3] \\n\" // r22 r23 r24\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%4], #32   \\n\" // r30 r31\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s}, [%4] \\n\" // r32 r33 r34\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%5], #32   \\n\" // r40 r41\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v2.4s, v3.4s, v4.4s}, [%5] \\n\" // r42 r43 r44\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%6], #64 \\n\"\n\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%6] \\n\"\n\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #1536               \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%0 :128] \\n\" // sum0\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%1 :128]! \\n\" // r00 r01\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%1, #384]          \\n\"\n                        \"vldm       %1, {d4-d9}         \\n\" // r02 r03 r04\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%2 :128]! \\n\" // r10 r11\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%2, #384]          \\n\"\n                        \"vldm       %2, {d4-d9}         \\n\" // r12 r13 r14\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%3 :128]! \\n\" // r20 r21\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%3, #384]          \\n\"\n                        \"vldm       %3, {d4-d9}         \\n\" // r22 r23 r24\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%4 :128]! \\n\" // r30 r31\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%4, #384]          \\n\"\n                        \"vldm       %4, {d4-d9}         \\n\" // r32 r33 r34\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.f32   {d0-d3}, [%5 :128]! \\n\" // r40 r41\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"pld        [%5, #384]          \\n\"\n                        \"vldm       %5, {d4-d9}         \\n\" // r42 r43 r44\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6!, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n\n                        //                         \"pld        [%6, #512]          \\n\"\n                        \"vldm       %6, {d16-d23}      \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q14, q13, q14       \\n\"\n                        \"vadd.f32   q15, q14, q15       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n\n                        \"sub        %6, %6, #1536       \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d25}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_5x5_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv5x5s1_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1] \\n\" // r04 r05 r06 r07\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2] \\n\" // r14 r15 r16 r17\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3] \\n\" // r24 r25 r26 r27\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%4] \\n\" // r34 r35 r36 r37\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%5] \\n\" // r44 r45 r46 r47\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%1 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%1 :64] \\n\" // r04 r05 r06 r07\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d6[1]     \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d3[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d7[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%2 :64] \\n\" // r14 r15 r16 r17\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d4[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n                        \"vmla.f32   q12, q8, d5[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d11[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d12[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d12[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d13[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d13[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%3 :64] \\n\" // r24 r25 r26 r27\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r30 r31 r32 r33\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d6[1]     \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d3[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d7[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%4 :64] \\n\" // r34 r35 r36 r37\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d4[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n                        \"vmla.f32   q12, q8, d5[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d11[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r40 r41 r42 r43\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d12[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d12[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d13[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d13[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%5 :64] \\n\" // r44 r45 r46 r47\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        //                         \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :64] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1], #16   \\n\" // r00 r01\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%1] \\n\" // r02 r03 r04 r05\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\" // r10 r11\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%2] \\n\" // r12 r13 r14 r15\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\" // r20 r21\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%3] \\n\" // r22 r23 r24 r25\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\" // r30 r31\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%4] \\n\" // r32 r33 r34 r35\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5], #16   \\n\" // r40 r41\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%5] \\n\" // r42 r43 r44 r45\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1 :64]!  \\n\" // r00 r01\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%1 :64]  \\n\" // r02 r03 r04 r05\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmul.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r10 r11\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%2 :64]  \\n\" // r12 r13 r14 r15\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r20 r21\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%3 :64]  \\n\" // r22 r23 r24 r25\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r30 r31\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%4 :64]  \\n\" // r32 r33 r34 r35\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r40 r41\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%5 :64]  \\n\" // r42 r43 r44 r45\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128] \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%1], #8           \\n\" // r00\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%1] \\n\" // r01 r02 r03 r04\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%0]              \\n\" // sum0\n\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2], #8           \\n\" // r10\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%2] \\n\" // r11 r12 r13 r14\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\" // r20\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%3] \\n\" // r21 r22 r23 r24\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4], #8           \\n\" // r30\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%4] \\n\" // r31 r32 r33 r34\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%5], #8           \\n\" // r40\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%5] \\n\" // r41 r42 r43 r44\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%1 :64]!     \\n\" // r00\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%0 :128] \\n\" // sum0\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%1 :64]   \\n\" // r01 r02 r03 r04\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%2 :64]!     \\n\" // r10\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%2 :64]   \\n\" // r11 r12 r13 r14\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%3 :64]!     \\n\" // r20\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%3 :64]   \\n\" // r21 r22 r23 r24\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%4 :64]!     \\n\" // r30\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%4 :64]   \\n\" // r31 r32 r33 r34\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%5, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%5 :64]!     \\n\" // r40\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%5 :64]   \\n\" // r41 r42 r43 r44\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        //                         \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128] \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q13, q13, q14       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n                        \"vadd.f32   q12, q12, q13       \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d25}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += 4 * 4;\n                r1 += 4 * 4;\n                r2 += 4 * 4;\n                r3 += 4 * 4;\n                r4 += 4 * 4;\n            }\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            const float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2] \\n\" // r04 r05 r06 r07\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3] \\n\" // r14 r15 r16 r17\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%4] \\n\" // r24 r25 r26 r27\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r30 r31 r32 r33\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%5] \\n\" // r34 r35 r36 r37\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%6], #32 \\n\" // r40 r41 r42 r43\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v3.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%6] \\n\" // r44 r45 r46 r47\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v4.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v5.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n                        \"shrn   v21.4h, v21.4s, #16         \\n\"\n                        \"shrn   v22.4h, v22.4s, #16         \\n\"\n                        \"shrn   v23.4h, v23.4s, #16         \\n\"\n\n                        \"st1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d24-d31}      \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%2 :64] \\n\" // r04 r05 r06 r07\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d6[1]     \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d3[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d7[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%3 :64] \\n\" // r14 r15 r16 r17\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d4[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n                        \"vmla.f32   q12, q8, d5[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d11[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d12[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d12[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d13[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d13[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%4 :64] \\n\" // r24 r25 r26 r27\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r30 r31 r32 r33\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d6[1]     \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d3[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d7[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%5 :64] \\n\" // r34 r35 r36 r37\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d2[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d8[1]      \\n\"\n                        \"vmla.f32   q12, q10, d3[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d9[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d4[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n                        \"vmla.f32   q12, q8, d5[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d11[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d6[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d10[0]     \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d10[1]     \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d7[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d11[0]    \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d11[1]    \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d12[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d12[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d13[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d13[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d5[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%6 :64] \\n\" // r44 r45 r46 r47\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vmla.f32   q14, q9, d7[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d7[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        //                         \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :64] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d10[0]     \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d14[0]     \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d14[1]     \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d11[0]    \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d15[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d15[1]    \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n                        \"vshrn.u32  d26, q14, #16       \\n\"\n                        \"vshrn.u32  d27, q15, #16       \\n\"\n\n                        \"vst1.u16   {d24-d27}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\" // r00 r01\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%1], #32 \\n\" // sum0 sum1\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmul   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%2] \\n\" // r02 r03 r04 r05\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\" // r10 r11\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%3] \\n\" // r12 r13 r14 r15\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\" // r20 r21\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%4] \\n\" // r22 r23 r24 r25\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5], #16   \\n\" // r30 r31\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%5] \\n\" // r32 r33 r34 r35\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%6], #16   \\n\" // r40 r41\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h, v5.4h}, [%6] \\n\" // r42 r43 r44 r45\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n                        \"shrn   v21.4h, v21.4s, #16         \\n\"\n\n                        \"st1    {v20.4h, v21.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r00 r01\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%2 :64]  \\n\" // r02 r03 r04 r05\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%1 :128]! \\n\" // sum0 sum1\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmul.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r10 r11\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%3 :64]  \\n\" // r12 r13 r14 r15\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r20 r21\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%4 :64]  \\n\" // r22 r23 r24 r25\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r30 r31\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%5 :64]  \\n\" // r32 r33 r34 r35\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r40 r41\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d8-d11}, [%6 :64]  \\n\" // r42 r43 r44 r45\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vmla.f32   q15, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vshll.u16  q2, d8, #16         \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q3, d9, #16         \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128] \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n\n                        \"vst1.u16   {d24-d25}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%2], #8           \\n\" // r00\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%2] \\n\" // r01 r02 r03 r04\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%1], #16         \\n\" // sum0\n\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%3], #8           \\n\" // r10\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%3] \\n\" // r11 r12 r13 r14\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%4], #8           \\n\" // r20\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%4] \\n\" // r21 r22 r23 r24\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%5], #8           \\n\" // r30\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%5] \\n\" // r31 r32 r33 r34\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #64]        \\n\"\n                        \"ld1    {v0.4h}, [%6], #8           \\n\" // r40\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v1.4h, v2.4h, v3.4h, v4.4h}, [%6] \\n\" // r41 r42 r43 r44\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n\n                        \"st1    {v20.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%2 :64]!     \\n\" // r00\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%1 :128]! \\n\" // sum0\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%2 :64]   \\n\" // r01 r02 r03 r04\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%3 :64]!     \\n\" // r10\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%3 :64]   \\n\" // r11 r12 r13 r14\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%4 :64]!     \\n\" // r20\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%4 :64]   \\n\" // r21 r22 r23 r24\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%5, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%5 :64]!     \\n\" // r30\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%5 :64]   \\n\" // r31 r32 r33 r34\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%6, #64]           \\n\"\n                        \"vld1.u16   {d1}, [%6 :64]!     \\n\" // r40\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d1, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d6-d9}, [%6 :64]   \\n\" // r41 r42 r43 r44\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q1, d6, #16         \\n\"\n                        \"vshll.u16  q2, d7, #16         \\n\"\n                        \"vshll.u16  q3, d8, #16         \\n\"\n                        \"vshll.u16  q4, d9, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        //                         \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128] \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q13, q13, q14       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n                        \"vadd.f32   q12, q12, q13       \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n\n                        \"vst1.u16   {d24}, [%0 :64]!    \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += 4 * 4;\n                r1 += 4 * 4;\n                r2 += 4 * 4;\n                r3 += 4 * 4;\n                r4 += 4 * 4;\n            }\n        }\n    }\n}\n\nstatic void conv5x5s2_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1], #32 \\n\" // r04 r05 r06 r07\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%1] \\n\" // r08 r09 r010\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2], #32 \\n\" // r14 r15 r16 r17\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%2] \\n\" // r18 r19 r110\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3], #32 \\n\" // r24 r25 r26 r27\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%3] \\n\" // r28 r29 r210\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%4], #32 \\n\" // r34 r35 r36 r37\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%4] \\n\" // r38 r39 r310\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%5], #32 \\n\" // r44 r45 r46 r47\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%5] \\n\" // r48 r49 r410\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%1 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%1 :64]! \\n\" // r04 r05 r06 r07\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1 :64]!  \\n\" // r08 r09\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%1, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%1 :64]      \\n\" // r010\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%2 :64]! \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r14 r15 r16 r17\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d12[0]    \\n\"\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"vmla.f32   q15, q11, d4[1]     \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vmla.f32   q14, q9, d1[1]      \\n\"\n                        \"vmla.f32   q15, q9, d5[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d10-d11}, [%2 :64]! \\n\" // r18 r19\n\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d12[0]    \\n\"\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d12[1]    \\n\"\n                        \"vmla.f32   q13, q11, d0[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d13[0]     \\n\"\n                        \"vmla.f32   q13, q8, d1[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d13[1]     \\n\"\n                        \"vmla.f32   q13, q9, d1[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%2 :64]     \\n\" // r110\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%3 :64]! \\n\" // r24 r25 r26 r27\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r28 r29\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%3 :64]      \\n\" // r210\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%4 :64]! \\n\" // r30 r31 r32 r33\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r34 r35 r36 r37\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d12[0]    \\n\"\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"vmla.f32   q15, q11, d4[1]     \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vmla.f32   q14, q9, d1[1]      \\n\"\n                        \"vmla.f32   q15, q9, d5[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d10-d11}, [%4 :64]! \\n\" // r38 r39\n\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d12[0]    \\n\"\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d12[1]    \\n\"\n                        \"vmla.f32   q13, q11, d0[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d13[0]     \\n\"\n                        \"vmla.f32   q13, q8, d1[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d13[1]     \\n\"\n                        \"vmla.f32   q13, q9, d1[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%4 :64]     \\n\" // r310\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r40 r41 r42 r43\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%5 :64]! \\n\" // r44 r45 r46 r47\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r48 r49\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        //                         \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%5, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%5 :64]      \\n\" // r410\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"sub        %1, %1, #16         \\n\"\n                        \"sub        %2, %2, #16         \\n\"\n                        \"sub        %3, %3, #16         \\n\"\n                        \"sub        %4, %4, #16         \\n\"\n                        \"sub        %5, %5, #16         \\n\"\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%0]      \\n\" // sum0 sum1\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmul   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%1] \\n\" // r04 r05 r06\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%2] \\n\" // r14 r15 r16\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%3] \\n\" // r24 r25 r26\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%4] \\n\" // r34 r35 r36\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%5] \\n\" // r44 r45 r46\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%1 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%0 :128] \\n\" // sum0 sum1\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vmul.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%1, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%1 :64] \\n\" // r04 r05 r06\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r10 r11 r12 r13\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%2 :64] \\n\" // r14 r15 r16\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r20 r21 r22 r23\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%3 :64] \\n\" // r24 r25 r26\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r30 r31 r32 r33\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%4 :64] \\n\" // r34 r35 r36\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r40 r41 r42 r43\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%5 :64] \\n\" // r44 r45 r46\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        //                         \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128] \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d27}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%0]              \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1], #16   \\n\" // r00 r01\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%1] \\n\" // r02 r03 r04\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\" // r10 r11\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%2] \\n\" // r12 r13 r14\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\" // r20 r21\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%3] \\n\" // r22 r23 r24\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\" // r30 r31\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%4] \\n\" // r32 r33 r34\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5], #16   \\n\" // r40 r41\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%5] \\n\" // r42 r43 r44\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%6], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6] \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %6, %6, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"st1    {v20.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1 :64]!  \\n\" // r00 r01\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%0 :128] \\n\" // sum0\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%1, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%1 :64]   \\n\" // r02 r03 r04\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r10 r11\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%2 :64]   \\n\" // r12 r13 r14\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r20 r21\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%3 :64]   \\n\" // r22 r23 r24\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r30 r31\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%4 :64]   \\n\" // r32 r33 r34\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r40 r41\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%5 :64]   \\n\" // r42 r43 r44\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%6 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        //                         \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%6 :128] \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q14, q13, q14       \\n\"\n                        \"vadd.f32   q15, q14, q15       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n\n                        \"sub        %6, %6, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vst1.f32   {d24-d25}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n            }\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            const float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2], #32 \\n\" // r04 r05 r06 r07\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\" // sum0 sum1 sum2 sum3\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%2] \\n\" // r08 r09 r010\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3], #32 \\n\" // r14 r15 r16 r17\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%3] \\n\" // r18 r19 r110\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%4], #32 \\n\" // r24 r25 r26 r27\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%4] \\n\" // r28 r29 r210\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r30 r31 r32 r33\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%5], #32 \\n\" // r34 r35 r36 r37\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%5] \\n\" // r38 r39 r310\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v29.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v29.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%6], #32 \\n\" // r40 r41 r42 r43\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v24.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v25.4s, v30.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v26.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v27.4s, v30.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%6], #32 \\n\" // r44 r45 r46 r47\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v6.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%6] \\n\" // r48 r49 r410\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v7.s[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v7.s[3]     \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v23.4s, v17.4s, v28.s[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v23.4s, v19.4s, v28.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v29.s[0]    \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v29.s[1]    \\n\"\n\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v20.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v29.s[2]    \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v29.s[3]    \\n\"\n\n                        \"fmla   v20.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v21.4s, v16.4s, v6.s[0]     \\n\"\n                        \"fmla   v22.4s, v16.4s, v28.s[0]    \\n\"\n                        \"fmla   v23.4s, v16.4s, v30.s[0]    \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"fmla   v22.4s, v17.4s, v28.s[1]    \\n\"\n                        \"fmla   v23.4s, v17.4s, v30.s[1]    \\n\"\n                        \"fmla   v20.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v21.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v22.4s, v18.4s, v28.s[2]    \\n\"\n                        \"fmla   v23.4s, v18.4s, v30.s[2]    \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"fmla   v22.4s, v19.4s, v28.s[3]    \\n\"\n                        \"fmla   v23.4s, v19.4s, v30.s[3]    \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n                        \"shrn   v21.4h, v21.4s, #16         \\n\"\n                        \"shrn   v22.4h, v22.4s, #16         \\n\"\n                        \"shrn   v23.4h, v23.4s, #16         \\n\"\n\n                        \"st1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d24-d31}      \\n\" // sum0 sum1 sum2 sum3\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%2 :64]! \\n\" // r04 r05 r06 r07\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r08 r09\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%2, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%2 :64]      \\n\" // r010\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%3 :64]! \\n\" // r10 r11 r12 r13\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r14 r15 r16 r17\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d12[0]    \\n\"\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"vmla.f32   q15, q11, d4[1]     \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vmla.f32   q14, q9, d1[1]      \\n\"\n                        \"vmla.f32   q15, q9, d5[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d10-d11}, [%3 :64]! \\n\" // r18 r19\n\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d12[0]    \\n\"\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d12[1]    \\n\"\n                        \"vmla.f32   q13, q11, d0[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d13[0]     \\n\"\n                        \"vmla.f32   q13, q8, d1[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d13[1]     \\n\"\n                        \"vmla.f32   q13, q9, d1[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%3, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%3 :64]     \\n\" // r110\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%4 :64]! \\n\" // r24 r25 r26 r27\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r28 r29\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%4, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%4 :64]      \\n\" // r210\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%5 :64]! \\n\" // r30 r31 r32 r33\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r34 r35 r36 r37\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q12, q10, d8[0]     \\n\"\n                        \"vmla.f32   q13, q10, d12[0]    \\n\"\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"vmla.f32   q15, q11, d4[1]     \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vmla.f32   q14, q9, d1[1]      \\n\"\n                        \"vmla.f32   q15, q9, d5[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d10[0]     \\n\"\n                        \"vmla.f32   q13, q8, d14[0]     \\n\"\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vmla.f32   q12, q9, d10[1]     \\n\"\n                        \"vmla.f32   q13, q9, d14[1]     \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"vmla.f32   q15, q9, d6[1]      \\n\"\n                        \"vmla.f32   q12, q10, d11[0]    \\n\"\n                        \"vmla.f32   q13, q10, d15[0]    \\n\"\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vmla.f32   q12, q11, d11[1]    \\n\"\n                        \"vmla.f32   q13, q11, d15[1]    \\n\"\n                        \"vmla.f32   q14, q11, d3[1]     \\n\"\n                        \"vmla.f32   q15, q11, d7[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d10-d11}, [%5 :64]! \\n\" // r38 r39\n\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d12[0]    \\n\"\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vmla.f32   q12, q11, d12[1]    \\n\"\n                        \"vmla.f32   q13, q11, d0[1]     \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"vmla.f32   q15, q11, d8[1]     \\n\"\n                        \"vmla.f32   q12, q8, d13[0]     \\n\"\n                        \"vmla.f32   q13, q8, d1[0]      \\n\"\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q12, q9, d13[1]     \\n\"\n                        \"vmla.f32   q13, q9, d1[1]      \\n\"\n                        \"vmla.f32   q14, q9, d5[1]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q12, q8, d14[0]     \\n\"\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vmla.f32   q12, q9, d14[1]     \\n\"\n                        \"vmla.f32   q13, q9, d2[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"vmla.f32   q15, q9, d10[1]     \\n\"\n                        \"vmla.f32   q12, q10, d15[0]    \\n\"\n                        \"vmla.f32   q13, q10, d3[0]     \\n\"\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vmla.f32   q12, q11, d15[1]    \\n\"\n                        \"vmla.f32   q13, q11, d3[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[1]     \\n\"\n                        \"vmla.f32   q15, q11, d11[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"pld        [%5, #64]           \\n\"\n                        \"vld1.u16   {d13}, [%5 :64]     \\n\" // r310\n\n                        \"vshll.u16  q6, d13, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d0[0]     \\n\"\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"vmla.f32   q15, q11, d12[1]    \\n\"\n                        \"vmla.f32   q12, q8, d1[0]      \\n\"\n                        \"vmla.f32   q13, q8, d5[0]      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d12-d15}, [%6 :64]! \\n\" // r44 r45 r46 r47\n\n                        \"vshll.u16  q4, d12, #16        \\n\"\n                        \"vshll.u16  q5, d13, #16        \\n\"\n                        \"vshll.u16  q6, d14, #16        \\n\"\n                        \"vshll.u16  q7, d15, #16        \\n\"\n\n                        \"vmla.f32   q12, q8, d0[0]      \\n\"\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vmla.f32   q15, q9, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d1[0]     \\n\"\n                        \"vmla.f32   q13, q10, d5[0]     \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"vmla.f32   q14, q11, d9[1]     \\n\"\n                        \"vmla.f32   q15, q11, d13[1]    \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[0]     \\n\"\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n                        \"vmla.f32   q12, q8, d3[0]      \\n\"\n                        \"vmla.f32   q13, q8, d7[0]      \\n\"\n                        \"vmla.f32   q14, q8, d11[0]     \\n\"\n                        \"vmla.f32   q15, q8, d15[0]     \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vmla.f32   q14, q9, d11[1]     \\n\"\n                        \"vmla.f32   q15, q9, d15[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r48 r49\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d4[0]      \\n\"\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q8, d12[0]     \\n\"\n                        \"vmla.f32   q15, q8, d0[0]      \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"vmla.f32   q14, q9, d12[1]     \\n\"\n                        \"vmla.f32   q15, q9, d0[1]      \\n\"\n                        \"vmla.f32   q12, q10, d5[0]     \\n\"\n                        \"vmla.f32   q13, q10, d9[0]     \\n\"\n                        \"vmla.f32   q14, q10, d13[0]    \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vmla.f32   q14, q11, d13[1]    \\n\"\n                        \"vmla.f32   q15, q11, d1[1]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[0]     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q10, d2[0]     \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n                        \"vmla.f32   q15, q11, d2[1]     \\n\"\n                        \"vmla.f32   q12, q8, d7[0]      \\n\"\n                        \"vmla.f32   q13, q8, d11[0]     \\n\"\n                        \"vmla.f32   q14, q8, d15[0]     \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vmla.f32   q14, q9, d15[1]     \\n\"\n                        \"vmla.f32   q15, q9, d3[1]      \\n\"\n\n                        //                         \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128] \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"pld        [%6, #64]           \\n\"\n                        \"vld1.u16   {d5}, [%6 :64]      \\n\" // r410\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d8[0]      \\n\"\n                        \"vmla.f32   q13, q8, d12[0]     \\n\"\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"vmla.f32   q15, q9, d4[1]      \\n\"\n                        \"vmla.f32   q12, q10, d9[0]     \\n\"\n                        \"vmla.f32   q13, q10, d13[0]    \\n\"\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vmla.f32   q14, q11, d1[1]     \\n\"\n                        \"vmla.f32   q15, q11, d5[1]     \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"sub        %2, %2, #16         \\n\"\n                        \"sub        %3, %3, #16         \\n\"\n                        \"sub        %4, %4, #16         \\n\"\n                        \"sub        %5, %5, #16         \\n\"\n                        \"sub        %6, %6, #16         \\n\"\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n                        \"vshrn.u32  d26, q14, #16       \\n\"\n                        \"vshrn.u32  d27, q15, #16       \\n\"\n\n                        \"vst1.u16   {d24-d27}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v20.4s, v21.4s}, [%1], #32 \\n\" // sum0 sum1\n\n                        \"fmul   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"fmul   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%2] \\n\" // r04 r05 r06\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%3] \\n\" // r14 r15 r16\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%4] \\n\" // r24 r25 r26\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r30 r31 r32 r33\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v2.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%5] \\n\" // r34 r35 r36\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v3.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v3.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v4.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v5.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v5.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v5.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v5.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%6], #32 \\n\" // r40 r41 r42 r43\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v6.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v6.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v6.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v6.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v2.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v0.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v0.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v2.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%6] \\n\" // r44 r45 r46\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"fmla   v23.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v18.4s, v2.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n                        \"fmla   v23.4s, v24.4s, v5.s[0]     \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v20.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v5.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v5.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v5.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v22.4s, v16.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v16.4s, v6.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v20.4s, v17.4s, v4.s[1]     \\n\"\n                        \"fmla   v21.4s, v17.4s, v6.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v22.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v18.4s, v6.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"fmla   v21.4s, v19.4s, v6.s[3]     \\n\"\n\n                        \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                        \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n                        \"shrn   v21.4h, v21.4s, #16         \\n\"\n\n                        \"st1    {v20.4h, v21.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d24-d27}, [%1 :128]! \\n\" // sum0 sum1\n\n                        \"vmul.f32   q14, q8, d0[0]      \\n\"\n                        \"vmul.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%2 :64] \\n\" // r04 r05 r06\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r10 r11 r12 r13\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%3 :64] \\n\" // r14 r15 r16\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%4 :64] \\n\" // r24 r25 r26\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r30 r31 r32 r33\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d0[0]     \\n\"\n                        \"vmla.f32   q15, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d0[1]     \\n\"\n                        \"vmla.f32   q13, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d1[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"vmla.f32   q13, q9, d5[1]      \\n\"\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%5 :64] \\n\" // r34 r35 r36\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d2[0]      \\n\"\n                        \"vmla.f32   q15, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d2[1]      \\n\"\n                        \"vmla.f32   q13, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d3[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vmla.f32   q13, q11, d7[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d4[0]     \\n\"\n                        \"vmla.f32   q15, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d4[1]     \\n\"\n                        \"vmla.f32   q13, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d5[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d6[0]      \\n\"\n                        \"vmla.f32   q15, q8, d10[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d6[1]      \\n\"\n                        \"vmla.f32   q13, q9, d10[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d7[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d11[0]    \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"vmla.f32   q13, q11, d11[1]    \\n\"\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d4, #16         \\n\"\n                        \"vshll.u16  q1, d5, #16         \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q14, q10, d8[0]     \\n\"\n                        \"vmla.f32   q15, q10, d12[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d8[1]     \\n\"\n                        \"vmla.f32   q13, q11, d12[1]    \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d0[0]      \\n\"\n                        \"vmla.f32   q15, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d0[1]      \\n\"\n                        \"vmla.f32   q13, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d1[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"vmla.f32   q13, q11, d5[1]     \\n\"\n                        \"pld        [%6, #192]          \\n\"\n                        \"vld1.u16   {d10-d12}, [%6 :64] \\n\" // r44 r45 r46\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q4, d10, #16        \\n\"\n                        \"vshll.u16  q5, d11, #16        \\n\"\n                        \"vshll.u16  q6, d12, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[0]     \\n\"\n                        \"vmla.f32   q15, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d2[1]     \\n\"\n                        \"vmla.f32   q13, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q8, d3[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vmla.f32   q13, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d4[0]      \\n\"\n                        \"vmla.f32   q15, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d4[1]      \\n\"\n                        \"vmla.f32   q13, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n\n                        \"vmla.f32   q14, q10, d5[0]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vmla.f32   q13, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q14, q10, d6[0]     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q12, q11, d6[1]     \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n                        //                         \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128] \\n\"\n\n                        \"vmla.f32   q14, q8, d7[0]      \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d11[0]     \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vmla.f32   q13, q9, d11[1]     \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q14, q8, d8[0]      \\n\"\n                        \"vmla.f32   q15, q8, d12[0]     \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q12, q9, d8[1]      \\n\"\n                        \"vmla.f32   q13, q9, d12[1]     \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q14, q10, d9[0]     \\n\"\n                        \"vmla.f32   q15, q10, d13[0]    \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vmla.f32   q13, q11, d13[1]    \\n\"\n\n                        \"vadd.f32   q12, q12, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n\n                        \"vst1.u16   {d24-d25}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v20.4s}, [%1], #16         \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\" // r00 r01\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmul   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmul   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmul   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%2] \\n\" // r02 r03 r04\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3], #16   \\n\" // r10 r11\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%3] \\n\" // r12 r13 r14\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4], #16   \\n\" // r20 r21\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%4] \\n\" // r22 r23 r24\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5], #16   \\n\" // r30 r31\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v0.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v0.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%5] \\n\" // r32 r33 r34\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v1.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v1.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v1.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v2.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v2.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v3.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v3.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v3.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v3.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%6], #16   \\n\" // r40 r41\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v4.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v4.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v4.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v0.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v0.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v0.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v0.s[3]     \\n\"\n                        \"prfm   pldl1keep, [%6, #192]       \\n\"\n                        \"ld1    {v2.4h, v3.4h, v4.4h}, [%6] \\n\" // r42 r43 r44\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v1.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v1.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v1.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v1.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v2.s[0]     \\n\"\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7], #32 \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v2.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v2.s[2]     \\n\"\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"fmla   v20.4s, v19.4s, v2.s[3]     \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v24.4s, v3.s[0]     \\n\"\n                        //                         \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7] \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n                        \"fmla   v23.4s, v26.4s, v3.s[2]     \\n\"\n                        \"shll   v16.4s, v16.4h, #16         \\n\"\n                        \"fmla   v20.4s, v27.4s, v3.s[3]     \\n\"\n                        \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                        \"fmla   v21.4s, v16.4s, v4.s[0]     \\n\"\n                        \"shll   v18.4s, v18.4h, #16         \\n\"\n                        \"fmla   v22.4s, v17.4s, v4.s[1]     \\n\"\n                        \"shll   v19.4s, v19.4h, #16         \\n\"\n                        \"fmla   v23.4s, v18.4s, v4.s[2]     \\n\"\n                        \"fmla   v20.4s, v19.4s, v4.s[3]     \\n\"\n\n                        \"fadd   v22.4s, v21.4s, v22.4s      \\n\"\n                        \"fadd   v23.4s, v22.4s, v23.4s      \\n\"\n                        \"fadd   v20.4s, v20.4s, v23.4s      \\n\"\n\n                        \"sub    %7, %7, #768                \\n\" // kptr -= 24 * 16;\n\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n\n                        \"st1    {v20.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r00 r01\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%1 :128]! \\n\" // sum0\n\n                        \"vmul.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmul.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmul.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%2, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%2 :64]   \\n\" // r02 r03 r04\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r10 r11\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%3, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%3 :64]   \\n\" // r12 r13 r14\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r20 r21\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%4, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%4 :64]   \\n\" // r22 r23 r24\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r30 r31\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d0[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d0[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d1[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d1[1]      \\n\"\n                        \"pld        [%5, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%5 :64]   \\n\" // r32 r33 r34\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q8, d2[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d2[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d3[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d3[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d4[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d4[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d5[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d5[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d6[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d6[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d7[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d7[1]     \\n\"\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r40 r41\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d8[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d8[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d0[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d0[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d1[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d1[1]     \\n\"\n                        \"pld        [%6, #192]          \\n\"\n                        \"vld1.u16   {d6-d8}, [%6 :64]   \\n\" // r42 r43 r44\n                        \"vshll.u16  q11, d17, #16       \\n\"\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshll.u16  q4, d8, #16         \\n\"\n\n                        \"vmla.f32   q13, q10, d2[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d2[1]     \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128]! \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d3[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d3[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d4[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d4[1]      \\n\"\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d16-d19}, [%7 :128]! \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d5[0]     \\n\"\n                        \"vshll.u16  q10, d16, #16       \\n\"\n                        \"vmla.f32   q12, q11, d5[1]     \\n\"\n                        \"vshll.u16  q11, d17, #16       \\n\"\n\n                        \"vmla.f32   q13, q10, d6[0]     \\n\"\n                        \"vshll.u16  q8, d18, #16        \\n\"\n                        \"vmla.f32   q14, q11, d6[1]     \\n\"\n                        //                         \"pld        [%7, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%7 :128] \\n\"\n                        \"vshll.u16  q9, d19, #16        \\n\"\n                        \"vmla.f32   q15, q8, d7[0]      \\n\"\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vmla.f32   q12, q9, d7[1]      \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n\n                        \"vmla.f32   q13, q8, d8[0]      \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vmla.f32   q14, q9, d8[1]      \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n                        \"vmla.f32   q15, q10, d9[0]     \\n\"\n                        \"vmla.f32   q12, q11, d9[1]     \\n\"\n\n                        \"vadd.f32   q14, q13, q14       \\n\"\n                        \"vadd.f32   q15, q14, q15       \\n\"\n                        \"vadd.f32   q12, q12, q15       \\n\"\n\n                        \"sub        %7, %7, #768        \\n\" // kptr -= 24 * 16;\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n\n                        \"vst1.u16   {d24}, [%0 :64]!    \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(kptr)          // %7\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_5x5_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv5x5s1_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n            const __fp16* r3 = img0.row<const __fp16>(3);\n            const __fp16* r4 = img0.row<const __fp16>(4);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0 sum1 sum2 sum3\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1] \\n\" // r04 r05 r06 r07\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v7.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v7.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v7.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v7.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%2] \\n\" // r14 r15 r16 r17\n\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v14.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v13.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v14.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v13.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v14.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v15.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v14.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v15.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v15.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v13.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v14.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v15.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v15.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v15.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v15.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v13.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v14.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v15.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%3] \\n\" // r24 r25 r26 r27\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v7.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%4], #64 \\n\" // r30 r31 r32 r33\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v7.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v7.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v7.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%4] \\n\" // r34 r35 r36 r37\n\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v14.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v13.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v14.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v13.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v14.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v15.h[0]    \\n\"\n                        \"fmla   v28.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v14.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v15.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%5], #64 \\n\" // r40 r41 r42 r43\n\n                        \"fmla   v28.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v15.h[2]    \\n\"\n                        \"fmla   v28.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v29.8h, v19.8h, v13.h[3]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v14.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v15.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v15.h[4]    \\n\"\n                        \"fmla   v28.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v15.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v15.h[6]    \\n\"\n                        \"fmla   v28.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v29.8h, v23.8h, v13.h[7]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v14.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v15.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%5] \\n\" // r44 r45 r46 r47\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v7.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v7.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v7.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v7.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v7.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v7.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v7.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v7.h[7]     \\n\"\n\n                        \"sub    %6, %6, #3136               \\n\" // kptr -= 24.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%1], #32   \\n\" // r00 r01\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0 sum1\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h, v5.8h}, [%1] \\n\" // r02 r03 r04 r05\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.8h, v9.8h}, [%2], #32   \\n\" // r10 r11\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v9.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v10.8h, v11.8h, v12.8h, v13.8h}, [%2] \\n\" // r12 r13 r14 r15\n\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%3], #32   \\n\" // r20 r21\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h, v5.8h}, [%3] \\n\" // r22 r23 r24 r25\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v8.8h, v9.8h}, [%4], #32   \\n\" // r30 r31\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v9.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v10.8h, v11.8h, v12.8h, v13.8h}, [%4] \\n\" // r32 r33 r34 r35\n\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v31.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%5], #32   \\n\" // r40 r41\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v30.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v31.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h, v5.8h}, [%5] \\n\" // r42 r43 r44 r45\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                        \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                        \"sub    %6, %6, #3136               \\n\" // kptr -= 24.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1], #16          \\n\" // r00\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v31.8h}, [%0]              \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmul   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v1.8h, v2.8h, v3.8h, v4.8h}, [%1] \\n\" // r01 r02 r03 r04\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v8.8h}, [%2], #16          \\n\" // r10\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v8.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v9.8h, v10.8h, v11.8h, v12.8h}, [%2] \\n\" // r11 r12 r13 r14\n\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v8.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v8.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%3], #16          \\n\" // r20\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v1.8h, v2.8h, v3.8h, v4.8h}, [%3] \\n\" // r21 r22 r23 r24\n\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v8.8h}, [%4], #16          \\n\" // r30\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v8.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v9.8h, v10.8h, v11.8h, v12.8h}, [%4] \\n\" // r31 r32 r33 r34\n\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v8.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v8.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%5], #16          \\n\" // r40\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v1.8h, v2.8h, v3.8h, v4.8h}, [%5] \\n\" // r41 r42 r43 r44\n\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n                        \"fadd   v30.8h, v30.8h, v31.8h      \\n\"\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n\n                        \"sub    %6, %6, #3136               \\n\" // kptr -= 24.5 * 64;\n\n                        \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n\n                r0 += 4 * 8;\n                r1 += 4 * 8;\n                r2 += 4 * 8;\n                r3 += 4 * 8;\n                r4 += 4 * 8;\n            }\n        }\n    }\n}\n\nstatic void conv5x5s2_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 8;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n            const __fp16* r3 = img0.row<const __fp16>(3);\n            const __fp16* r4 = img0.row<const __fp16>(4);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 1 < outw; j += 2)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v30.8h, v31.8h}, [%0]      \\n\" // sum0 sum1\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%1] \\n\" // r04 r05 r06\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v12.8h, v13.8h, v14.8h}, [%2] \\n\" // r14 r15 r16\n\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v14.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                        \"fmla   v28.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v14.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v14.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%3] \\n\" // r24 r25 r26\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%4], #64 \\n\" // r30 r31 r32 r33\n\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v8.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v8.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v9.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v11.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v12.8h, v13.8h, v14.8h}, [%4] \\n\" // r34 r35 r36\n\n                        \"fmla   v28.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v9.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v9.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v10.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v10.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v13.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v13.h[1]    \\n\"\n                        \"fmla   v28.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v13.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v11.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v13.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v13.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v13.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v13.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v11.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v13.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v16.8h, v14.h[0]    \\n\"\n                        \"fmla   v30.8h, v17.8h, v12.h[1]    \\n\"\n                        \"fmla   v31.8h, v17.8h, v14.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%5], #64 \\n\" // r40 r41 r42 r43\n\n                        \"fmla   v28.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v29.8h, v18.8h, v14.h[2]    \\n\"\n                        \"fmla   v30.8h, v19.8h, v12.h[3]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v14.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v20.8h, v14.h[4]    \\n\"\n                        \"fmla   v30.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v31.8h, v21.8h, v14.h[5]    \\n\"\n                        \"fmla   v28.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v29.8h, v22.8h, v14.h[6]    \\n\"\n                        \"fmla   v30.8h, v23.8h, v12.h[7]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v14.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%5] \\n\" // r44 r45 r46\n\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v6.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v6.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[5]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v6.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v4.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[7]     \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                        \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                        \"sub    %6, %6, #3136               \\n\" // kptr -= 24.5 * 64;\n\n                        \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%1], #32   \\n\" // r00 r01\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v31.8h}, [%0]              \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmul   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h}, [%1] \\n\" // r02 r03 r04\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v8.8h, v9.8h}, [%2], #32   \\n\" // r10 r11\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v8.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v8.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v10.8h, v11.8h, v12.8h}, [%2] \\n\" // r12 r13 r14\n\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%3], #32   \\n\" // r20 r21\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h}, [%3] \\n\" // r22 r23 r24\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v8.8h, v9.8h}, [%4], #32   \\n\" // r30 r31\n\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v8.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v8.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v8.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v8.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v8.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v8.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v8.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v8.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v9.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v9.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v10.8h, v11.8h, v12.8h}, [%4] \\n\" // r32 r33 r34\n\n                        \"fmla   v30.8h, v18.8h, v9.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v9.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v9.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v9.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v9.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v9.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v10.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v10.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v10.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v10.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v10.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v10.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v10.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v10.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v11.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v11.h[1]    \\n\"\n                        \"fmla   v30.8h, v18.8h, v11.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v11.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v11.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v11.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v11.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v11.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v12.h[0]    \\n\"\n                        \"fmla   v29.8h, v17.8h, v12.h[1]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%5], #32   \\n\" // r40 r41\n\n                        \"fmla   v30.8h, v18.8h, v12.h[2]    \\n\"\n                        \"fmla   v31.8h, v19.8h, v12.h[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v12.h[4]    \\n\"\n                        \"fmla   v29.8h, v21.8h, v12.h[5]    \\n\"\n                        \"fmla   v30.8h, v22.8h, v12.h[6]    \\n\"\n                        \"fmla   v31.8h, v23.8h, v12.h[7]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v2.8h, v3.8h, v4.8h}, [%5] \\n\" // r42 r43 r44\n\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v3.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v3.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v3.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v4.h[0]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v4.h[1]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v4.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v4.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v4.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v4.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v4.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v4.h[7]     \\n\"\n\n                        \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n                        \"fadd   v30.8h, v30.8h, v31.8h      \\n\"\n                        \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n\n                        \"sub    %6, %6, #3136               \\n\" // kptr -= 24.5 * 64;\n\n                        \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(kptr)     // %6\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_7x7.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv7x7s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 49 + q * 49;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n            const float* r4 = img0 + w * 4;\n            const float* r5 = img0 + w * 5;\n            const float* r6 = img0 + w * 6;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 7;\n            const float* k2 = kernel0 + 14;\n            const float* k3 = kernel0 + 21;\n            const float* k4 = kernel0 + 28;\n            const float* k5 = kernel0 + 35;\n            const float* k6 = kernel0 + 42;\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _k0123 = vld1q_f32(k0);\n                float32x4_t _k4567 = vld1q_f32(k0 + 4);\n                float32x4_t _k78910 = vld1q_f32(k1);\n                float32x4_t _k11121314 = vld1q_f32(k1 + 4);\n                float32x4_t _k14151617 = vld1q_f32(k2);\n                float32x4_t _k18192021 = vld1q_f32(k2 + 4);\n                float32x4_t _k21222324 = vld1q_f32(k3);\n                float32x4_t _k25262728 = vld1q_f32(k3 + 4);\n                float32x4_t _k28293031 = vld1q_f32(k4);\n                float32x4_t _k32333435 = vld1q_f32(k4 + 4);\n                float32x4_t _k35363738 = vld1q_f32(k5);\n                float32x4_t _k39404142 = vld1q_f32(k5 + 4);\n                float32x4_t _k42434445 = vld1q_f32(k6);\n                float32x4_t _k46474849 = vld1q_f32(k6 + 4);\n#ifdef __clang__ // __ARM_NEON && __aarch64__ && __clang__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        // v0:  input / final output\n                        // v1 v2 v3: = ri0 ri4 ri0n , i <-  1-7\n                        // v4 = ri1 / ri3 / ri6\n                        // v5 = ri2 / ri5\n                        // v9 = intermediate sum register\n                        \"0:                                        \\n\"\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1]                  \\n\"\n\n                        //i = 1\n                        \"prfm       pldl1keep, [%2, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%2]    \\n\"\n                        \"add        %2, %2, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmul       v9.4s, v1.4s, %18.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %18.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %18.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %18.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %19.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %19.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %19.s[2]         \\n\"\n\n                        //i = 2\n                        \"prfm       pldl1keep, [%3, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%3]    \\n\" // v1 v2 v3: = r20 r24 r20n\n                        \"add        %3, %3, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\" // v4 = r21\n                        \"fmla       v9.4s, v1.4s, %20.s[0]         \\n\" // *+ r10\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\" // v5 = r22\n                        \"fmla       v0.4s, v4.4s, %20.s[1]         \\n\" // *+ r11\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\" // v4 = r23\n                        \"fmla       v9.4s, v5.4s, %20.s[2]         \\n\" // *+ r1\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\" // v5 = r25\n                        \"fmla       v0.4s, v4.4s, %20.s[3]         \\n\" // *+ r13\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\" // v4 = r26\n                        \"fmla       v9.4s, v2.4s, %21.s[0]         \\n\" // *+ r14\n                        \"fmla       v0.4s, v5.4s, %21.s[1]         \\n\" // *+ r15\n                        \"fmla       v9.4s, v4.4s, %21.s[2]         \\n\" // *+ r16\n\n                        //i = 3\n                        \"prfm       pldl1keep, [%4, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%4]    \\n\"\n                        \"add        %4, %4, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmla       v9.4s, v1.4s, %22.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %22.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %22.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %22.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %23.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %23.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %23.s[2]         \\n\"\n\n                        //i = 4\n                        \"prfm       pldl1keep, [%5, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%5]    \\n\"\n                        \"add        %5, %5, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmla       v9.4s, v1.4s, %24.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %24.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %24.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %24.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %25.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %25.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %25.s[2]         \\n\"\n\n                        //i = 5\n                        \"prfm       pldl1keep, [%6, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%6]    \\n\"\n                        \"add        %6, %6, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmla       v9.4s, v1.4s, %26.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %26.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %26.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %26.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %27.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %27.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %27.s[2]         \\n\"\n\n                        //i = 6\n                        \"prfm       pldl1keep, [%7, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%7]    \\n\"\n                        \"add        %7, %7, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmla       v9.4s, v1.4s, %28.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %28.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %28.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %28.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %29.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %29.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %29.s[2]         \\n\"\n\n                        //i = 7\n                        \"prfm       pldl1keep, [%8, #384]          \\n\"\n                        \"ld1        {v1.4s, v2.4s, v3.4s}, [%8]    \\n\"\n                        \"add        %8, %8, #16                    \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #4     \\n\"\n                        \"fmla       v9.4s, v1.4s, %30.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v2.16b, #8     \\n\"\n                        \"fmla       v0.4s, v4.4s, %30.s[1]         \\n\"\n                        \"ext        v4.16b, v1.16b, v2.16b, #12    \\n\"\n                        \"fmla       v9.4s, v5.4s, %30.s[2]         \\n\"\n                        \"ext        v5.16b, v2.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v4.4s, %30.s[3]         \\n\"\n                        \"ext        v4.16b, v2.16b, v3.16b, #8     \\n\"\n                        \"fmla       v9.4s, v2.4s, %31.s[0]         \\n\"\n                        \"fmla       v0.4s, v5.4s, %31.s[1]         \\n\"\n                        \"fmla       v9.4s, v4.4s, %31.s[2]         \\n\"\n\n                        \"fadd       v0.4s, v0.4s, v9.4s            \\n\"\n                        \"st1        {v0.4s}, [%1], #16             \\n\"\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4),     // %6\n                        \"=r\"(r5),     // %7\n                        \"=r\"(r6)      // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"w\"(_k0123),     // %18\n                        \"w\"(_k4567),     // %19\n                        \"w\"(_k78910),    // %20\n                        \"w\"(_k11121314), // %21\n                        \"w\"(_k14151617), // %22\n                        \"w\"(_k18192021), // %23\n                        \"w\"(_k21222324), // %24\n                        \"w\"(_k25262728), // %25\n                        \"w\"(_k28293031), // %26\n                        \"w\"(_k32333435), // %27\n                        \"w\"(_k35363738), // %28\n                        \"w\"(_k39404142), // %29\n                        \"w\"(_k42434445), // %30\n                        \"w\"(_k46474849)  // %31\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v9\");\n                }\n#else\n                /**\n                * __ARM_NEON && __aarch64__ defined, but __clang__ not defined\n                * When compiled with gcc, gcc does not accept over 30 operands\n                */\n                for (; nn > 0; nn--)\n                {\n                    float32x4_t _sum = vld1q_f32(outptr);\n\n                    float32x4_t _r00 = vld1q_f32(r0);             // 0 1 2 3\n                    float32x4_t _r04 = vld1q_f32(r0 + 4);         // 4 5 6 7\n                    float32x4_t _r00n = vld1q_f32(r0 + 8);        // 8 9 10 11\n                    float32x4_t _r01 = vextq_f32(_r00, _r04, 1);  // 1 2 3 4\n                    float32x4_t _r02 = vextq_f32(_r00, _r04, 2);  // 2 3 4 5\n                    float32x4_t _r03 = vextq_f32(_r00, _r04, 3);  // 3 4 5 6\n                    float32x4_t _r05 = vextq_f32(_r04, _r00n, 1); // 5 6 7 8\n                    float32x4_t _r06 = vextq_f32(_r04, _r00n, 2); // 6 7 8 9\n\n                    _sum = vfmaq_laneq_f32(_sum, _r00, _k0123, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r01, _k0123, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r02, _k0123, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r03, _k0123, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r04, _k4567, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r05, _k4567, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r06, _k4567, 2);\n\n                    float32x4_t _r10 = vld1q_f32(r1);\n                    float32x4_t _r14 = vld1q_f32(r1 + 4);\n                    float32x4_t _r10n = vld1q_f32(r1 + 8);\n                    float32x4_t _r11 = vextq_f32(_r10, _r14, 1);\n                    float32x4_t _r12 = vextq_f32(_r10, _r14, 2);\n                    float32x4_t _r13 = vextq_f32(_r10, _r14, 3);\n                    float32x4_t _r15 = vextq_f32(_r14, _r10n, 1);\n                    float32x4_t _r16 = vextq_f32(_r14, _r10n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r10, _k78910, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r11, _k78910, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r12, _k78910, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r13, _k78910, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r14, _k11121314, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r15, _k11121314, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r16, _k11121314, 2);\n\n                    float32x4_t _r20 = vld1q_f32(r2);\n                    float32x4_t _r24 = vld1q_f32(r2 + 4);\n                    float32x4_t _r20n = vld1q_f32(r2 + 8);\n                    float32x4_t _r21 = vextq_f32(_r20, _r24, 1);\n                    float32x4_t _r22 = vextq_f32(_r20, _r24, 2);\n                    float32x4_t _r23 = vextq_f32(_r20, _r24, 3);\n                    float32x4_t _r25 = vextq_f32(_r24, _r20n, 1);\n                    float32x4_t _r26 = vextq_f32(_r24, _r20n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r20, _k14151617, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r21, _k14151617, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r22, _k14151617, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r23, _k14151617, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r24, _k18192021, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r25, _k18192021, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r26, _k18192021, 2);\n\n                    float32x4_t _r30 = vld1q_f32(r3);\n                    float32x4_t _r34 = vld1q_f32(r3 + 4);\n                    float32x4_t _r30n = vld1q_f32(r3 + 8);\n                    float32x4_t _r31 = vextq_f32(_r30, _r34, 1);\n                    float32x4_t _r32 = vextq_f32(_r30, _r34, 2);\n                    float32x4_t _r33 = vextq_f32(_r30, _r34, 3);\n                    float32x4_t _r35 = vextq_f32(_r34, _r30n, 1);\n                    float32x4_t _r36 = vextq_f32(_r34, _r30n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r30, _k21222324, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r31, _k21222324, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r32, _k21222324, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r33, _k21222324, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r34, _k25262728, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r35, _k25262728, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r36, _k25262728, 2);\n\n                    float32x4_t _r40 = vld1q_f32(r4);\n                    float32x4_t _r44 = vld1q_f32(r4 + 4);\n                    float32x4_t _r40n = vld1q_f32(r4 + 8);\n                    float32x4_t _r41 = vextq_f32(_r40, _r44, 1);\n                    float32x4_t _r42 = vextq_f32(_r40, _r44, 2);\n                    float32x4_t _r43 = vextq_f32(_r40, _r44, 3);\n                    float32x4_t _r45 = vextq_f32(_r44, _r40n, 1);\n                    float32x4_t _r46 = vextq_f32(_r44, _r40n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r40, _k28293031, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r41, _k28293031, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r42, _k28293031, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r43, _k28293031, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r44, _k32333435, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r45, _k32333435, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r46, _k32333435, 2);\n\n                    float32x4_t _r50 = vld1q_f32(r5);\n                    float32x4_t _r54 = vld1q_f32(r5 + 4);\n                    float32x4_t _r50n = vld1q_f32(r5 + 8);\n                    float32x4_t _r51 = vextq_f32(_r50, _r54, 1);\n                    float32x4_t _r52 = vextq_f32(_r50, _r54, 2);\n                    float32x4_t _r53 = vextq_f32(_r50, _r54, 3);\n                    float32x4_t _r55 = vextq_f32(_r54, _r50n, 1);\n                    float32x4_t _r56 = vextq_f32(_r54, _r50n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r50, _k35363738, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r51, _k35363738, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r52, _k35363738, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r53, _k35363738, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r54, _k39404142, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r55, _k39404142, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r56, _k39404142, 2);\n\n                    float32x4_t _r60 = vld1q_f32(r6);\n                    float32x4_t _r64 = vld1q_f32(r6 + 4);\n                    float32x4_t _r60n = vld1q_f32(r6 + 8);\n                    float32x4_t _r61 = vextq_f32(_r60, _r64, 1);\n                    float32x4_t _r62 = vextq_f32(_r60, _r64, 2);\n                    float32x4_t _r63 = vextq_f32(_r60, _r64, 3);\n                    float32x4_t _r65 = vextq_f32(_r64, _r60n, 1);\n                    float32x4_t _r66 = vextq_f32(_r64, _r60n, 2);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r60, _k42434445, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r61, _k42434445, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r62, _k42434445, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r63, _k42434445, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r64, _k46474849, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r65, _k46474849, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r66, _k46474849, 2);\n\n                    vst1q_f32(outptr, _sum);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    r4 += 4;\n                    r5 += 4;\n                    r6 += 4;\n                    outptr += 4;\n                }\n#endif // __clang__\n#else  //__aarch32__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d24-d25}, [%1]     \\n\" // _sum\n                        //                     \"veor       q13, q13            \\n\"// _sum2 = 0;\n                        //                     \"veor       q14, q14            \\n\"// _sum3 = 0;\n                        //                     \"veor       q15, q15            \\n\"// _sum4 = 0;\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k0123 k4567\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%2]!      \\n\" // q0 = 0  1  2  3\n                        \"vmla.f32   q12, q0, d8[0]      \\n\"\n\n                        \"pld        [%2, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%2]       \\n\" // q2 = 4  5  6  7  q3 = 8  9 10 11\n                        \"vmul.f32   q13, q2, d10[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\" // q1 = 1  2  3  4\n                        \"vext.32    q10, q2, q3, #1     \\n\" // q10= 5  6  7  8\n                        \"vmul.f32   q14, q1, d8[1]      \\n\"\n                        \"vmul.f32   q15, q10, d10[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\" // q8 = 2  3  4  5\n                        \"vext.32    q11, q2, q3, #2     \\n\" // q11= 6  7  8  9\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q11, d11[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\" // q9 = 3  4  5  6\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k78910 k11121314\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%3]!      \\n\"\n                        \"vmla.f32   q15, q0, d12[0]     \\n\"\n\n                        \"pld        [%3, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%3]       \\n\"\n                        \"vmla.f32   q12, q2, d14[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q13, q1, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d14[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q11, d15[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k14151617 k18192021\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%4]!      \\n\"\n                        \"vmla.f32   q14, q0, d8[0]      \\n\"\n\n                        \"pld        [%4, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%4]       \\n\"\n                        \"vmla.f32   q15, q2, d10[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q12, q1, d8[1]      \\n\"\n                        \"vmla.f32   q13, q10, d10[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q11, d11[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k21222324 k25262728\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%5]!      \\n\"\n                        \"vmla.f32   q13, q0, d12[0]     \\n\"\n\n                        \"pld        [%5, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%5]       \\n\"\n                        \"vmla.f32   q14, q2, d14[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q15, q1, d12[1]     \\n\"\n                        \"vmla.f32   q12, q10, d14[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q11, d15[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k28293031 k32333435\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%6]!      \\n\"\n                        \"vmla.f32   q12, q0, d8[0]      \\n\"\n\n                        \"pld        [%6, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%6]       \\n\"\n                        \"vmla.f32   q13, q2, d10[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q14, q1, d8[1]      \\n\"\n                        \"vmla.f32   q15, q10, d10[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q12, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q11, d11[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k35363738 k39404142\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%7]!      \\n\"\n                        \"vmla.f32   q15, q0, d12[0]     \\n\"\n\n                        \"pld        [%7, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%7]       \\n\"\n                        \"vmla.f32   q12, q2, d14[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q13, q1, d12[1]     \\n\"\n                        \"vmla.f32   q14, q10, d14[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q12, q11, d15[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k42434445 k46474849\n                        \"sub        %9, #168            \\n\" // restore k0\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.f32   {d0-d1}, [%8]!      \\n\"\n                        \"vmla.f32   q14, q0, d8[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.f32   {d4-d7}, [%8]       \\n\"\n                        \"vmla.f32   q15, q2, d10[0]     \\n\"\n\n                        \"vext.32    q1, q0, q2, #1      \\n\"\n                        \"vext.32    q10, q2, q3, #1     \\n\"\n                        \"vmla.f32   q12, q1, d8[1]      \\n\"\n                        \"vmla.f32   q13, q10, d10[1]    \\n\"\n\n                        \"vext.32    q8, q0, q2, #2      \\n\"\n                        \"vext.32    q11, q2, q3, #2     \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q11, d11[0]    \\n\"\n\n                        \"vext.32    q9, q0, q2, #3      \\n\"\n                        \"vmla.f32   q12, q9, d9[1]      \\n\"\n\n                        \"vadd.f32   q13, q13, q14       \\n\"\n                        \"vadd.f32   q13, q13, q15       \\n\"\n                        \"vadd.f32   q12, q12, q13       \\n\"\n\n                        \"vst1.f32   {d24-d25}, [%1]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4),     // %6\n                        \"=r\"(r5),     // %7\n                        \"=r\"(r6),     // %8\n                        \"=r\"(k0)      // %9\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(k0)\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n                for (; remain > 0; remain--)\n                {\n                    float sum = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n                    sum += r0[4] * k0[4];\n                    sum += r0[5] * k0[5];\n                    sum += r0[6] * k0[6];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n                    sum += r1[4] * k1[4];\n                    sum += r1[5] * k1[5];\n                    sum += r1[6] * k1[6];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n                    sum += r2[4] * k2[4];\n                    sum += r2[5] * k2[5];\n                    sum += r2[6] * k2[6];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n                    sum += r3[4] * k3[4];\n                    sum += r3[5] * k3[5];\n                    sum += r3[6] * k3[6];\n\n                    sum += r4[0] * k4[0];\n                    sum += r4[1] * k4[1];\n                    sum += r4[2] * k4[2];\n                    sum += r4[3] * k4[3];\n                    sum += r4[4] * k4[4];\n                    sum += r4[5] * k4[5];\n                    sum += r4[6] * k4[6];\n\n                    sum += r5[0] * k5[0];\n                    sum += r5[1] * k5[1];\n                    sum += r5[2] * k5[2];\n                    sum += r5[3] * k5[3];\n                    sum += r5[4] * k5[4];\n                    sum += r5[5] * k5[5];\n                    sum += r5[6] * k5[6];\n\n                    sum += r6[0] * k6[0];\n                    sum += r6[1] * k6[1];\n                    sum += r6[2] * k6[2];\n                    sum += r6[3] * k6[3];\n                    sum += r6[4] * k6[4];\n                    sum += r6[5] * k6[5];\n                    sum += r6[6] * k6[6];\n\n                    *outptr += sum;\n\n                    r0++;\n                    r1++;\n                    r2++;\n                    r3++;\n                    r4++;\n                    r5++;\n                    r6++;\n                    outptr++;\n                }\n\n                r0 += 6;\n                r1 += 6;\n                r2 += 6;\n                r3 += 6;\n                r4 += 6;\n                r5 += 6;\n                r6 += 6;\n            }\n        }\n    }\n}\n\nstatic void conv7x7s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr = out;\n\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 49 + q * 49;\n\n            const float* r0 = img0;\n            const float* r1 = img0 + w;\n            const float* r2 = img0 + w * 2;\n            const float* r3 = img0 + w * 3;\n            const float* r4 = img0 + w * 4;\n            const float* r5 = img0 + w * 5;\n            const float* r6 = img0 + w * 6;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 7;\n            const float* k2 = kernel0 + 14;\n            const float* k3 = kernel0 + 21;\n            const float* k4 = kernel0 + 28;\n            const float* k5 = kernel0 + 35;\n            const float* k6 = kernel0 + 42;\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n#if __ARM_NEON\n                int nn = outw >> 2;\n                int remain = outw - (nn << 2);\n#else\n                int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _k0123 = vld1q_f32(k0);\n                float32x4_t _k4567 = vld1q_f32(k0 + 4);\n                float32x4_t _k78910 = vld1q_f32(k1);\n                float32x4_t _k11121314 = vld1q_f32(k1 + 4);\n                float32x4_t _k14151617 = vld1q_f32(k2);\n                float32x4_t _k18192021 = vld1q_f32(k2 + 4);\n                float32x4_t _k21222324 = vld1q_f32(k3);\n                float32x4_t _k25262728 = vld1q_f32(k3 + 4);\n                float32x4_t _k28293031 = vld1q_f32(k4);\n                float32x4_t _k32333435 = vld1q_f32(k4 + 4);\n                float32x4_t _k35363738 = vld1q_f32(k5);\n                float32x4_t _k39404142 = vld1q_f32(k5 + 4);\n                float32x4_t _k42434445 = vld1q_f32(k6);\n                float32x4_t _k46474849 = vld1q_f32(k6 + 4);\n#ifdef __clang__ // __ARM_NEON && __aarch64__ && __clang__\n                if (nn > 0)\n                {\n                    asm volatile(\n                        // v0:  input / final output\n                        // v1 v2: = _ri0/_ri1  first\n                        // v3 v4: =                  then _r0_8101214/_r0_9111315\n                        // v5 = ri2 / ri4 / ri6\n                        // v6 = ri3 / ri5\n                        // v9 = intermediate sum register\n                        \"0:                                        \\n\"\n                        \"prfm       pldl1keep, [%1, #128]          \\n\"\n                        \"ld1        {v0.4s}, [%1]                  \\n\"\n\n                        //i = 1\n                        \"prfm       pldl1keep, [%2, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%2]           \\n\" // v1  v2 = _r00  _r01\n                        \"add        %2, %2, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%2]           \\n\" // v3  v4 = _r0_8101214 / _r0_9111315\n                        \"fmul       v9.4s, v1.4s, %18.s[0]         \\n\" // *+ _r00\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\" // v5 = _r02\n                        \"fmla       v0.4s, v2.4s, %18.s[1]         \\n\" // *+ _r01\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\" // v6 = _r03\n                        \"fmla       v9.4s, v5.4s, %18.s[2]         \\n\" // *+ _r02\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\" // v5 = _r04\n                        \"fmla       v0.4s, v6.4s, %18.s[3]         \\n\" // *+ _r03\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\" // v6 = _r05\n                        \"fmla       v9.4s, v5.4s, %19.s[0]         \\n\" // *+ _r04\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\" // v5 = _r06\n                        \"fmla       v0.4s, v6.4s, %19.s[1]         \\n\" // *+ _r05\n                        \"fmla       v9.4s, v5.4s, %19.s[2]         \\n\" // *+ _r06\n\n                        //i = 2\n                        \"prfm       pldl1keep, [%3, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%3]           \\n\"\n                        \"add        %3, %3, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%3]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %20.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %20.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %20.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %20.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %21.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %21.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %21.s[2]         \\n\"\n\n                        //i = 3\n                        \"prfm       pldl1keep, [%4, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%4]           \\n\"\n                        \"add        %4, %4, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%4]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %22.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %22.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %22.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %22.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %23.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %23.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %23.s[2]         \\n\"\n\n                        //i = 4\n                        \"prfm       pldl1keep, [%5, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%5]           \\n\"\n                        \"add        %5, %5, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%5]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %24.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %24.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %24.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %24.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %25.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %25.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %25.s[2]         \\n\"\n\n                        //i = 5\n                        \"prfm       pldl1keep, [%6, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%6]           \\n\"\n                        \"add        %6, %6, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%6]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %26.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %26.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %26.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %26.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %27.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %27.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %27.s[2]         \\n\"\n\n                        //i = 6\n                        \"prfm       pldl1keep, [%7, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%7]           \\n\"\n                        \"add        %7, %7, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%7]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %28.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %28.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %28.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %28.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %29.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %29.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %29.s[2]         \\n\"\n\n                        //i = 7\n                        \"prfm       pldl1keep, [%8, #512]          \\n\"\n                        \"ld2        {v1.4s, v2.4s}, [%8]           \\n\"\n                        \"add        %8, %8, #32                    \\n\"\n                        \"ld2        {v3.4s, v4.4s}, [%8]           \\n\"\n                        \"fmla       v9.4s, v1.4s, %30.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #4     \\n\"\n                        \"fmla       v0.4s, v2.4s, %30.s[1]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #4     \\n\"\n                        \"fmla       v9.4s, v5.4s, %30.s[2]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #8     \\n\"\n                        \"fmla       v0.4s, v6.4s, %30.s[3]         \\n\"\n                        \"ext        v6.16b, v2.16b, v4.16b, #8     \\n\"\n                        \"fmla       v9.4s, v5.4s, %31.s[0]         \\n\"\n                        \"ext        v5.16b, v1.16b, v3.16b, #12    \\n\"\n                        \"fmla       v0.4s, v6.4s, %31.s[1]         \\n\"\n                        \"fmla       v9.4s, v5.4s, %31.s[2]         \\n\"\n\n                        \"fadd       v0.4s, v0.4s, v9.4s            \\n\"\n                        \"st1        {v0.4s}, [%1], #16             \\n\"\n                        \"subs       %w0, %w0, #1                   \\n\"\n                        \"bne        0b                             \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4),     // %6\n                        \"=r\"(r5),     // %7\n                        \"=r\"(r6)      // %8\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"w\"(_k0123),     // %18\n                        \"w\"(_k4567),     // %19\n                        \"w\"(_k78910),    // %20\n                        \"w\"(_k11121314), // %21\n                        \"w\"(_k14151617), // %22\n                        \"w\"(_k18192021), // %23\n                        \"w\"(_k21222324), // %24\n                        \"w\"(_k25262728), // %25\n                        \"w\"(_k28293031), // %26\n                        \"w\"(_k32333435), // %27\n                        \"w\"(_k35363738), // %28\n                        \"w\"(_k39404142), // %29\n                        \"w\"(_k42434445), // %30\n                        \"w\"(_k46474849)  // %31\n                        : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v9\");\n                }\n#else\n                /**\n                * __ARM_NEON && __aarch64__ defined, but __clang__ not defined\n                * When compiled with gcc, gcc does not accept over 30 operands\n                */\n                for (; nn > 0; nn--)\n                {\n                    float32x4_t _sum = vld1q_f32(outptr);\n\n                    float32x4x2_t _r00_02461357 = vld2q_f32(r0);\n                    float32x4x2_t _r00nx2 = vld2q_f32(r0 + 8);\n                    float32x4_t _r0_8101214 = _r00nx2.val[0];           // 8 10 12 14\n                    float32x4_t _r0_9111315 = _r00nx2.val[1];           // 9 11 13 15\n                    float32x4_t _r00 = _r00_02461357.val[0];            // 0 2 4 6\n                    float32x4_t _r01 = _r00_02461357.val[1];            // 1 3 5 7\n                    float32x4_t _r02 = vextq_f32(_r00, _r0_8101214, 1); // 2 4 6 8\n                    float32x4_t _r03 = vextq_f32(_r01, _r0_9111315, 1); // 3 5 7 9\n                    float32x4_t _r04 = vextq_f32(_r00, _r0_8101214, 2); // 4 6 8 10\n                    float32x4_t _r05 = vextq_f32(_r01, _r0_9111315, 2); // 5 7 9 11\n                    float32x4_t _r06 = vextq_f32(_r00, _r0_8101214, 3); // 6 8 10 12\n\n                    _sum = vfmaq_laneq_f32(_sum, _r00, _k0123, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r01, _k0123, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r02, _k0123, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r03, _k0123, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r04, _k4567, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r05, _k4567, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r06, _k4567, 2);\n\n                    float32x4x2_t _r10_02461357 = vld2q_f32(r1);\n                    float32x4x2_t _r10nx2 = vld2q_f32(r1 + 8);\n                    float32x4_t _r1_8101214 = _r10nx2.val[0];\n                    float32x4_t _r1_9111315 = _r10nx2.val[1];\n                    float32x4_t _r10 = _r10_02461357.val[0];\n                    float32x4_t _r11 = _r10_02461357.val[1];\n                    float32x4_t _r12 = vextq_f32(_r10, _r1_8101214, 1);\n                    float32x4_t _r13 = vextq_f32(_r11, _r1_9111315, 1);\n                    float32x4_t _r14 = vextq_f32(_r10, _r1_8101214, 2);\n                    float32x4_t _r15 = vextq_f32(_r11, _r1_9111315, 2);\n                    float32x4_t _r16 = vextq_f32(_r10, _r1_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r10, _k78910, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r11, _k78910, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r12, _k78910, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r13, _k78910, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r14, _k11121314, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r15, _k11121314, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r16, _k11121314, 2);\n\n                    float32x4x2_t _r20_02461357 = vld2q_f32(r2);\n                    float32x4x2_t _r20nx2 = vld2q_f32(r2 + 8);\n                    float32x4_t _r2_8101214 = _r20nx2.val[0];\n                    float32x4_t _r2_9111315 = _r20nx2.val[1];\n                    float32x4_t _r20 = _r20_02461357.val[0];\n                    float32x4_t _r21 = _r20_02461357.val[1];\n                    float32x4_t _r22 = vextq_f32(_r20, _r2_8101214, 1);\n                    float32x4_t _r23 = vextq_f32(_r21, _r2_9111315, 1);\n                    float32x4_t _r24 = vextq_f32(_r20, _r2_8101214, 2);\n                    float32x4_t _r25 = vextq_f32(_r21, _r2_9111315, 2);\n                    float32x4_t _r26 = vextq_f32(_r20, _r2_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r20, _k14151617, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r21, _k14151617, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r22, _k14151617, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r23, _k14151617, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r24, _k18192021, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r25, _k18192021, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r26, _k18192021, 2);\n\n                    float32x4x2_t _r30_02461357 = vld2q_f32(r3);\n                    float32x4x2_t _r30nx2 = vld2q_f32(r3 + 8);\n                    float32x4_t _r3_8101214 = _r30nx2.val[0];\n                    float32x4_t _r3_9111315 = _r30nx2.val[1];\n                    float32x4_t _r30 = _r30_02461357.val[0];\n                    float32x4_t _r31 = _r30_02461357.val[1];\n                    float32x4_t _r32 = vextq_f32(_r30, _r3_8101214, 1);\n                    float32x4_t _r33 = vextq_f32(_r31, _r3_9111315, 1);\n                    float32x4_t _r34 = vextq_f32(_r30, _r3_8101214, 2);\n                    float32x4_t _r35 = vextq_f32(_r31, _r3_9111315, 2);\n                    float32x4_t _r36 = vextq_f32(_r30, _r3_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r30, _k21222324, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r31, _k21222324, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r32, _k21222324, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r33, _k21222324, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r34, _k25262728, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r35, _k25262728, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r36, _k25262728, 2);\n\n                    float32x4x2_t _r40_02461357 = vld2q_f32(r4);\n                    float32x4x2_t _r40nx2 = vld2q_f32(r4 + 8);\n                    float32x4_t _r4_8101214 = _r40nx2.val[0];\n                    float32x4_t _r4_9111315 = _r40nx2.val[1];\n                    float32x4_t _r40 = _r40_02461357.val[0];\n                    float32x4_t _r41 = _r40_02461357.val[1];\n                    float32x4_t _r42 = vextq_f32(_r40, _r4_8101214, 1);\n                    float32x4_t _r43 = vextq_f32(_r41, _r4_9111315, 1);\n                    float32x4_t _r44 = vextq_f32(_r40, _r4_8101214, 2);\n                    float32x4_t _r45 = vextq_f32(_r41, _r4_9111315, 2);\n                    float32x4_t _r46 = vextq_f32(_r40, _r4_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r40, _k28293031, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r41, _k28293031, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r42, _k28293031, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r43, _k28293031, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r44, _k32333435, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r45, _k32333435, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r46, _k32333435, 2);\n\n                    float32x4x2_t _r50_02461357 = vld2q_f32(r5);\n                    float32x4x2_t _r50nx2 = vld2q_f32(r5 + 8);\n                    float32x4_t _r5_8101214 = _r50nx2.val[0];\n                    float32x4_t _r5_9111315 = _r50nx2.val[1];\n                    float32x4_t _r50 = _r50_02461357.val[0];\n                    float32x4_t _r51 = _r50_02461357.val[1];\n                    float32x4_t _r52 = vextq_f32(_r50, _r5_8101214, 1);\n                    float32x4_t _r53 = vextq_f32(_r51, _r5_9111315, 1);\n                    float32x4_t _r54 = vextq_f32(_r50, _r5_8101214, 2);\n                    float32x4_t _r55 = vextq_f32(_r51, _r5_9111315, 2);\n                    float32x4_t _r56 = vextq_f32(_r50, _r5_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r50, _k35363738, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r51, _k35363738, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r52, _k35363738, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r53, _k35363738, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r54, _k39404142, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r55, _k39404142, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r56, _k39404142, 2);\n\n                    float32x4x2_t _r60_02461357 = vld2q_f32(r6);\n                    float32x4x2_t _r60nx2 = vld2q_f32(r6 + 8);\n                    float32x4_t _r6_8101214 = _r60nx2.val[0];\n                    float32x4_t _r6_9111315 = _r60nx2.val[1];\n                    float32x4_t _r60 = _r60_02461357.val[0];\n                    float32x4_t _r61 = _r60_02461357.val[1];\n                    float32x4_t _r62 = vextq_f32(_r60, _r6_8101214, 1);\n                    float32x4_t _r63 = vextq_f32(_r61, _r6_9111315, 1);\n                    float32x4_t _r64 = vextq_f32(_r60, _r6_8101214, 2);\n                    float32x4_t _r65 = vextq_f32(_r61, _r6_9111315, 2);\n                    float32x4_t _r66 = vextq_f32(_r60, _r6_8101214, 3);\n\n                    _sum = vfmaq_laneq_f32(_sum, _r60, _k42434445, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r61, _k42434445, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r62, _k42434445, 2);\n                    _sum = vfmaq_laneq_f32(_sum, _r63, _k42434445, 3);\n                    _sum = vfmaq_laneq_f32(_sum, _r64, _k46474849, 0);\n                    _sum = vfmaq_laneq_f32(_sum, _r65, _k46474849, 1);\n                    _sum = vfmaq_laneq_f32(_sum, _r66, _k46474849, 2);\n\n                    vst1q_f32(outptr, _sum);\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                    r3 += 8;\n                    r4 += 8;\n                    r5 += 8;\n                    r6 += 8;\n                    outptr += 4;\n                }\n#endif // __clang__\n#else\n                if (nn > 0)\n                {\n                    asm volatile(\n                        \"0:                             \\n\"\n\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d26-d27}, [%1]     \\n\" // _sum\n                        //                     \"veor       q14, q14            \\n\"// _sum2 = 0;\n                        //                     \"veor       q15, q15            \\n\"// _sum3 = 0;\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k0123 k4567\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%2, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%2]!      \\n\" // q0 = 0  2  4  6  q1 = 1  3  5  7\n                        \"vmla.f32   q13, q0, d8[0]      \\n\"\n                        \"vmul.f32   q14, q1, d8[1]      \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%2]       \\n\" // q2 = 8 10 12 14  q3 = 9 11 13 15\n                        \"vext.32    q8, q0, q2, #1      \\n\" // q8 = 2  4  6  8\n                        \"vext.32    q9, q1, q3, #1      \\n\" // q9 = 3  5  7  9\n                        \"vmul.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\" // q10= 4  6  8 10\n                        \"vext.32    q11, q1, q3, #2     \\n\" // q11= 5  7  9 11\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\" // q12= 6  8 10 12\n                        \"vmla.f32   q13, q12, d11[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k78910 k11121314\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%3, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%3]!      \\n\"\n                        \"vmla.f32   q14, q0, d12[0]     \\n\"\n                        \"vmla.f32   q15, q1, d12[1]     \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%3]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q13, q8, d13[0]     \\n\"\n                        \"vmla.f32   q14, q9, d13[1]     \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q15, q10, d14[0]    \\n\"\n                        \"vmla.f32   q13, q11, d14[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q14, q12, d15[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k14151617 k18192021\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%4, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%4]!      \\n\"\n                        \"vmla.f32   q15, q0, d8[0]      \\n\"\n                        \"vmla.f32   q13, q1, d8[1]      \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%4]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q14, q8, d9[0]      \\n\"\n                        \"vmla.f32   q15, q9, d9[1]      \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q13, q10, d10[0]    \\n\"\n                        \"vmla.f32   q14, q11, d10[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q15, q12, d11[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k21222324 k25262728\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%5, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%5]!      \\n\"\n                        \"vmla.f32   q13, q0, d12[0]     \\n\"\n                        \"vmla.f32   q14, q1, d12[1]     \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%5]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q15, q8, d13[0]     \\n\"\n                        \"vmla.f32   q13, q9, d13[1]     \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q14, q10, d14[0]    \\n\"\n                        \"vmla.f32   q15, q11, d14[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q13, q12, d15[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k28293031 k32333435\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%6, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%6]!      \\n\"\n                        \"vmla.f32   q14, q0, d8[0]      \\n\"\n                        \"vmla.f32   q15, q1, d8[1]      \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%6]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q13, q8, d9[0]      \\n\"\n                        \"vmla.f32   q14, q9, d9[1]      \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q15, q10, d10[0]    \\n\"\n                        \"vmla.f32   q13, q11, d10[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q14, q12, d11[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d12-d15}, [%9]     \\n\" // q6 q7 = k35363738 k39404142\n                        \"add        %9, #28             \\n\"\n\n                        \"pld        [%7, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%7]!      \\n\"\n                        \"vmla.f32   q15, q0, d12[0]     \\n\"\n                        \"vmla.f32   q13, q1, d12[1]     \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%7]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q14, q8, d13[0]     \\n\"\n                        \"vmla.f32   q15, q9, d13[1]     \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q13, q10, d14[0]    \\n\"\n                        \"vmla.f32   q14, q11, d14[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q15, q12, d15[0]    \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.f32   {d8-d11}, [%9]      \\n\" // q4 q5 = k42434445 k46474849\n                        \"sub        %9, #168            \\n\" // restore k0\n\n                        \"pld        [%8, #512]          \\n\"\n                        \"vld2.f32   {d0-d3}, [%8]!      \\n\"\n                        \"vmla.f32   q13, q0, d8[0]      \\n\"\n                        \"vmla.f32   q14, q1, d8[1]      \\n\"\n\n                        \"vld2.f32   {d4-d7}, [%8]       \\n\"\n                        \"vext.32    q8, q0, q2, #1      \\n\"\n                        \"vext.32    q9, q1, q3, #1      \\n\"\n                        \"vmla.f32   q15, q8, d9[0]      \\n\"\n                        \"vmla.f32   q13, q9, d9[1]      \\n\"\n\n                        \"vext.32    q10, q0, q2, #2     \\n\"\n                        \"vext.32    q11, q1, q3, #2     \\n\"\n                        \"vmla.f32   q14, q10, d10[0]    \\n\"\n                        \"vmla.f32   q15, q11, d10[1]    \\n\"\n\n                        \"vext.32    q12, q0, q2, #3     \\n\"\n                        \"vmla.f32   q13, q12, d11[0]    \\n\"\n\n                        \"vadd.f32   q14, q14, q15       \\n\"\n                        \"vadd.f32   q13, q13, q14       \\n\"\n\n                        \"vst1.f32   {d26-d27}, [%1]!    \\n\"\n\n                        \"subs       %0, #1              \\n\"\n                        \"bne        0b                  \\n\"\n                        : \"=r\"(nn),     // %0\n                        \"=r\"(outptr), // %1\n                        \"=r\"(r0),     // %2\n                        \"=r\"(r1),     // %3\n                        \"=r\"(r2),     // %4\n                        \"=r\"(r3),     // %5\n                        \"=r\"(r4),     // %6\n                        \"=r\"(r5),     // %7\n                        \"=r\"(r6),     // %8\n                        \"=r\"(k0)      // %9\n                        : \"0\"(nn),\n                        \"1\"(outptr),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(k0)\n                        : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n                }\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n                for (; remain > 0; remain--)\n                {\n                    float sum = 0;\n\n                    sum += r0[0] * k0[0];\n                    sum += r0[1] * k0[1];\n                    sum += r0[2] * k0[2];\n                    sum += r0[3] * k0[3];\n                    sum += r0[4] * k0[4];\n                    sum += r0[5] * k0[5];\n                    sum += r0[6] * k0[6];\n\n                    sum += r1[0] * k1[0];\n                    sum += r1[1] * k1[1];\n                    sum += r1[2] * k1[2];\n                    sum += r1[3] * k1[3];\n                    sum += r1[4] * k1[4];\n                    sum += r1[5] * k1[5];\n                    sum += r1[6] * k1[6];\n\n                    sum += r2[0] * k2[0];\n                    sum += r2[1] * k2[1];\n                    sum += r2[2] * k2[2];\n                    sum += r2[3] * k2[3];\n                    sum += r2[4] * k2[4];\n                    sum += r2[5] * k2[5];\n                    sum += r2[6] * k2[6];\n\n                    sum += r3[0] * k3[0];\n                    sum += r3[1] * k3[1];\n                    sum += r3[2] * k3[2];\n                    sum += r3[3] * k3[3];\n                    sum += r3[4] * k3[4];\n                    sum += r3[5] * k3[5];\n                    sum += r3[6] * k3[6];\n\n                    sum += r4[0] * k4[0];\n                    sum += r4[1] * k4[1];\n                    sum += r4[2] * k4[2];\n                    sum += r4[3] * k4[3];\n                    sum += r4[4] * k4[4];\n                    sum += r4[5] * k4[5];\n                    sum += r4[6] * k4[6];\n\n                    sum += r5[0] * k5[0];\n                    sum += r5[1] * k5[1];\n                    sum += r5[2] * k5[2];\n                    sum += r5[3] * k5[3];\n                    sum += r5[4] * k5[4];\n                    sum += r5[5] * k5[5];\n                    sum += r5[6] * k5[6];\n\n                    sum += r6[0] * k6[0];\n                    sum += r6[1] * k6[1];\n                    sum += r6[2] * k6[2];\n                    sum += r6[3] * k6[3];\n                    sum += r6[4] * k6[4];\n                    sum += r6[5] * k6[5];\n                    sum += r6[6] * k6[6];\n\n                    *outptr += sum;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    r3 += 2;\n                    r4 += 2;\n                    r5 += 2;\n                    r6 += 2;\n                    outptr++;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_7x7_pack1to4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv7x7s2_pack1to4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n            const float* r3 = img0.row(3);\n            const float* r4 = img0.row(4);\n            const float* r5 = img0.row(5);\n            const float* r6 = img0.row(6);\n\n            const float* kptr = (const float*)kernel.channel(p).row(q);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\" // r0\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%1]        \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%2], #64 \\n\" // r1\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v10.4s, v11.4s}, [%2]      \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\" // r2\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%3]        \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%4], #64 \\n\" // r3\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v10.4s, v11.4s}, [%4]      \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%5], #64 \\n\" // r4\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%5]        \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v6.4s, v7.4s, v8.4s, v9.4s}, [%6], #64 \\n\" // r5\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v10.4s, v11.4s}, [%6]      \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%7], #64 \\n\" // r6\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%7]        \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"sub    %8, %8, #784                \\n\"\n\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0] \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1] \\n\" // r0\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2] \\n\" // r1\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3] \\n\" // r2\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%4] \\n\" // r3\n                        \"add    %4, %4, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%5] \\n\" // r4\n                        \"add    %5, %5, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #512]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%6] \\n\" // r5\n                        \"add    %6, %6, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #512]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%7] \\n\" // r6\n                        \"add    %7, %7, #32                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"sub    %8, %8, #784                \\n\"\n\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]      \\n\"\n                        \"vldm       %0, {d24-d31}   \\n\"\n\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1]!  \\n\" // r0\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%1, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%1]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%2]!  \\n\" // r1\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%2, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%2]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3]!  \\n\" // r2\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%3, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%3]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%4, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%4]!  \\n\" // r3\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%4, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%4]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%5, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%5]!  \\n\" // r4\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%5, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%5]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%6, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%6]!  \\n\" // r5\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%6, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%6]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n\n                        \"pld        [%7, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%7]!  \\n\" // r6\n\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q5, d2[0]  \\n\"\n                        \"vmla.f32   q15, q5, d3[0]  \\n\"\n\n                        \"pld        [%7, #192]      \\n\"\n                        \"vld1.f32   {d4-d6}, [%7]   \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]  \\n\"\n                        \"vmla.f32   q13, q6, d1[1]  \\n\"\n                        \"vmla.f32   q14, q6, d2[1]  \\n\"\n                        \"vmla.f32   q15, q6, d3[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q7, d3[0]  \\n\"\n                        \"vmla.f32   q15, q7, d4[0]  \\n\"\n                        \"vmla.f32   q12, q8, d1[1]  \\n\"\n                        \"vmla.f32   q13, q8, d2[1]  \\n\"\n                        \"vmla.f32   q14, q8, d3[1]  \\n\"\n                        \"vmla.f32   q15, q8, d4[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q9, d4[0]  \\n\"\n                        \"vmla.f32   q15, q9, d5[0]  \\n\"\n                        \"vmla.f32   q12, q10, d2[1] \\n\"\n                        \"vmla.f32   q13, q10, d3[1] \\n\"\n                        \"vmla.f32   q14, q10, d4[1] \\n\"\n                        \"vmla.f32   q15, q10, d5[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d4[0] \\n\"\n                        \"vmla.f32   q14, q11, d5[0] \\n\"\n                        \"vmla.f32   q15, q11, d6[0] \\n\"\n\n                        \"sub        %8, %8, #784    \\n\"\n\n                        \"vstm       %0!, {d24-d31}  \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v16.4s, v17.4s}, [%0]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%1] \\n\" // r0\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmul   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmul   v19.4s, v24.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%2] \\n\" // r1\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%3] \\n\" // r2\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%4] \\n\" // r3\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%5] \\n\" // r4\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #384]       \\n\"\n                        \"ld1    {v4.4s, v5.4s, v6.4s}, [%6] \\n\" // r5\n                        \"add    %6, %6, #16                 \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #384]       \\n\"\n                        \"ld1    {v0.4s, v1.4s, v2.4s}, [%7] \\n\" // r6\n                        \"add    %7, %7, #16                 \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                        \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                        \"sub    %8, %8, #784                \\n\"\n\n                        \"st1    {v16.4s, v17.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]      \\n\"\n                        \"vld1.f32   {d28-d31}, [%0 :128] \\n\"\n\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1]!  \\n\" // r0\n                        \"vld1.f32   {d8[0]}, [%1]   \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmul.f32   q12, q5, d0[0]  \\n\"\n                        \"vmul.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q6, d0[1]  \\n\"\n                        \"vmla.f32   q15, q6, d1[1]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%2]!  \\n\" // r1\n                        \"vld1.f32   {d9[0]}, [%2]   \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]  \\n\"\n                        \"vmla.f32   q15, q8, d2[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1] \\n\"\n                        \"vmla.f32   q15, q10, d3[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d8[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]  \\n\"\n                        \"vmla.f32   q15, q5, d5[0]  \\n\"\n                        \"vmla.f32   q12, q6, d4[1]  \\n\"\n                        \"vmla.f32   q13, q6, d5[1]  \\n\"\n                        \"vmla.f32   q14, q7, d5[0]  \\n\"\n                        \"vmla.f32   q15, q7, d6[0]  \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3]!  \\n\" // r2\n                        \"vld1.f32   {d8[0]}, [%3]   \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]  \\n\"\n                        \"vmla.f32   q13, q8, d6[1]  \\n\"\n                        \"vmla.f32   q14, q9, d6[0]  \\n\"\n                        \"vmla.f32   q15, q9, d7[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1] \\n\"\n                        \"vmla.f32   q13, q10, d7[1] \\n\"\n                        \"vmla.f32   q14, q11, d7[0] \\n\"\n                        \"vmla.f32   q15, q11, d9[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q6, d0[1]  \\n\"\n                        \"vmla.f32   q15, q6, d1[1]  \\n\"\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n\n                        \"pld        [%4, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%4]!  \\n\" // r3\n                        \"vld1.f32   {d9[0]}, [%4]   \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]  \\n\"\n                        \"vmla.f32   q15, q8, d2[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1] \\n\"\n                        \"vmla.f32   q15, q10, d3[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d8[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]  \\n\"\n                        \"vmla.f32   q15, q5, d5[0]  \\n\"\n                        \"vmla.f32   q12, q6, d4[1]  \\n\"\n                        \"vmla.f32   q13, q6, d5[1]  \\n\"\n                        \"vmla.f32   q14, q7, d5[0]  \\n\"\n                        \"vmla.f32   q15, q7, d6[0]  \\n\"\n\n                        \"pld        [%5, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%5]!  \\n\" // r4\n                        \"vld1.f32   {d8[0]}, [%5]   \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]  \\n\"\n                        \"vmla.f32   q13, q8, d6[1]  \\n\"\n                        \"vmla.f32   q14, q9, d6[0]  \\n\"\n                        \"vmla.f32   q15, q9, d7[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1] \\n\"\n                        \"vmla.f32   q13, q10, d7[1] \\n\"\n                        \"vmla.f32   q14, q11, d7[0] \\n\"\n                        \"vmla.f32   q15, q11, d9[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q6, d0[1]  \\n\"\n                        \"vmla.f32   q15, q6, d1[1]  \\n\"\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n\n                        \"pld        [%6, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%6]!  \\n\" // r5\n                        \"vld1.f32   {d9[0]}, [%6]   \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]  \\n\"\n                        \"vmla.f32   q15, q8, d2[1]  \\n\"\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1] \\n\"\n                        \"vmla.f32   q15, q10, d3[1] \\n\"\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d8[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]  \\n\"\n                        \"vmla.f32   q15, q5, d5[0]  \\n\"\n                        \"vmla.f32   q12, q6, d4[1]  \\n\"\n                        \"vmla.f32   q13, q6, d5[1]  \\n\"\n                        \"vmla.f32   q14, q7, d5[0]  \\n\"\n                        \"vmla.f32   q15, q7, d6[0]  \\n\"\n\n                        \"pld        [%7, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%7]!  \\n\" // r6\n                        \"vld1.f32   {d8[0]}, [%7]   \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]  \\n\"\n                        \"vmla.f32   q13, q8, d6[1]  \\n\"\n                        \"vmla.f32   q14, q9, d6[0]  \\n\"\n                        \"vmla.f32   q15, q9, d7[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d10-d17}  \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1] \\n\"\n                        \"vmla.f32   q13, q10, d7[1] \\n\"\n                        \"vmla.f32   q14, q11, d7[0] \\n\"\n                        \"vmla.f32   q15, q11, d9[0] \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d18-d23}  \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]  \\n\"\n                        \"vmla.f32   q13, q5, d1[0]  \\n\"\n                        \"vmla.f32   q14, q6, d0[1]  \\n\"\n                        \"vmla.f32   q15, q6, d1[1]  \\n\"\n\n                        \"sub        %1, %1, #16     \\n\"\n                        \"sub        %2, %2, #16     \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]  \\n\"\n                        \"vmla.f32   q13, q7, d2[0]  \\n\"\n                        \"vmla.f32   q14, q8, d1[1]  \\n\"\n                        \"vmla.f32   q15, q8, d2[1]  \\n\"\n\n                        \"sub        %8, %8, #784    \\n\"\n\n                        \"vmla.f32   q12, q9, d2[0]  \\n\"\n                        \"vmla.f32   q13, q9, d3[0]  \\n\"\n                        \"vmla.f32   q14, q10, d2[1] \\n\"\n                        \"vmla.f32   q15, q10, d3[1] \\n\"\n\n                        \"sub        %3, %3, #16     \\n\"\n                        \"sub        %4, %4, #16     \\n\"\n\n                        \"vmla.f32   q12, q11, d3[0] \\n\"\n                        \"vmla.f32   q13, q11, d8[0] \\n\"\n\n                        \"sub        %5, %5, #16     \\n\"\n                        \"sub        %6, %6, #16     \\n\"\n\n                        \"vadd.f32   q14, q14, q12   \\n\"\n                        \"vadd.f32   q15, q15, q13   \\n\"\n\n                        \"sub        %7, %7, #16     \\n\"\n\n                        \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v16.4s}, [%0]              \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%1]        \\n\" // r0\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmul   v17.4s, v24.4s, v0.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmul   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmul   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%2]        \\n\" // r1\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%3]        \\n\" // r2\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%4]        \\n\" // r3\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%5]        \\n\" // r4\n\n                        \"fmla   v17.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v19.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v17.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v4.4s, v5.4s}, [%6]        \\n\" // r5\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v0.4s, v1.4s}, [%7]        \\n\" // r6\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%8], #64 \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #384]       \\n\"\n                        \"ld1    {v28.4s, v29.4s, v30.4s}, [%8], #48 \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n                        \"add    %2, %2, #8                  \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"add    %3, %3, #8                  \\n\"\n                        \"add    %4, %4, #8                  \\n\"\n\n                        \"fadd   v18.4s, v18.4s, v19.4s      \\n\"\n\n                        \"add    %5, %5, #8                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v17.4s      \\n\"\n\n                        \"add    %6, %6, #8                  \\n\"\n                        \"add    %7, %7, #8                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n\n                        \"sub    %8, %8, #784                \\n\"\n\n                        \"st1    {v16.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]      \\n\"\n                        \"vld1.f32   {d8-d9}, [%0 :128] \\n\"\n\n                        \"pld        [%1, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%1]   \\n\" // r0\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmul.f32   q5, q8, d0[0]   \\n\"\n                        \"vmul.f32   q6, q9, d0[1]   \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmul.f32   q7, q10, d1[0]  \\n\"\n                        \"vmla.f32   q4, q11, d1[1]  \\n\"\n\n                        \"pld        [%2, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%2]   \\n\" // r1\n\n                        \"vmla.f32   q5, q12, d2[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]  \\n\"\n                        \"vmla.f32   q7, q14, d3[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]   \\n\"\n                        \"vmla.f32   q5, q9, d4[1]   \\n\"\n                        \"vmla.f32   q6, q10, d5[0]  \\n\"\n\n                        \"pld        [%3, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%3]   \\n\" // r2\n\n                        \"vmla.f32   q7, q11, d5[1]  \\n\"\n                        \"vmla.f32   q4, q12, d6[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]  \\n\"\n                        \"vmla.f32   q6, q14, d7[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]   \\n\"\n                        \"vmla.f32   q4, q9, d0[1]   \\n\"\n                        \"vmla.f32   q5, q10, d1[0]  \\n\"\n\n                        \"pld        [%4, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%4]   \\n\" // r3\n\n                        \"vmla.f32   q6, q11, d1[1]  \\n\"\n                        \"vmla.f32   q7, q12, d2[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q4, q13, d2[1]  \\n\"\n                        \"vmla.f32   q5, q14, d3[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q6, q8, d4[0]   \\n\"\n                        \"vmla.f32   q7, q9, d4[1]   \\n\"\n                        \"vmla.f32   q4, q10, d5[0]  \\n\"\n\n                        \"pld        [%5, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%5]   \\n\" // r4\n\n                        \"vmla.f32   q5, q11, d5[1]  \\n\"\n                        \"vmla.f32   q6, q12, d6[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q7, q13, d6[1]  \\n\"\n                        \"vmla.f32   q4, q14, d7[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q5, q8, d0[0]   \\n\"\n                        \"vmla.f32   q6, q9, d0[1]   \\n\"\n                        \"vmla.f32   q7, q10, d1[0]  \\n\"\n\n                        \"pld        [%6, #256]      \\n\"\n                        \"vld1.f32   {d4-d7}, [%6]   \\n\" // r5\n\n                        \"vmla.f32   q4, q11, d1[1]  \\n\"\n                        \"vmla.f32   q5, q12, d2[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]  \\n\"\n                        \"vmla.f32   q7, q14, d3[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]   \\n\"\n                        \"vmla.f32   q5, q9, d4[1]   \\n\"\n                        \"vmla.f32   q6, q10, d5[0]  \\n\"\n\n                        \"pld        [%7, #256]      \\n\"\n                        \"vld1.f32   {d0-d3}, [%7]   \\n\" // r6\n\n                        \"vmla.f32   q7, q11, d5[1]  \\n\"\n                        \"vmla.f32   q4, q12, d6[0]  \\n\"\n\n                        \"pld        [%8, #512]      \\n\"\n                        \"vldm       %8!, {d16-d23}  \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]  \\n\"\n                        \"vmla.f32   q6, q14, d7[0]  \\n\"\n\n                        \"pld        [%8, #384]      \\n\"\n                        \"vldm       %8!, {d24-d29}  \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]   \\n\"\n                        \"vmla.f32   q4, q9, d0[1]   \\n\"\n\n                        \"add        %1, %1, #8      \\n\"\n                        \"add        %2, %2, #8      \\n\"\n\n                        \"vmla.f32   q5, q10, d1[0]  \\n\"\n                        \"vmla.f32   q6, q11, d1[1]  \\n\"\n\n                        \"sub        %8, %8, #784    \\n\"\n\n                        \"vmla.f32   q7, q12, d2[0]  \\n\"\n                        \"vmla.f32   q4, q13, d2[1]  \\n\"\n                        \"vmla.f32   q5, q14, d3[0]  \\n\"\n\n                        \"add        %3, %3, #8      \\n\"\n                        \"add        %4, %4, #8      \\n\"\n\n                        \"vadd.f32   q6, q6, q7      \\n\"\n\n                        \"add        %5, %5, #8      \\n\"\n\n                        \"vadd.f32   q4, q4, q5      \\n\"\n\n                        \"add        %6, %6, #8      \\n\"\n\n                        \"vadd.f32   q4, q4, q6      \\n\"\n\n                        \"add        %7, %7, #8      \\n\"\n\n                        \"vst1.f32   {d8-d9}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_7x7_pack1to4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv7x7s2_pack1to4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    Mat top_blob_fp32(outw, outh, opt.num_threads, (size_t)4u * 4, 4, opt.workspace_allocator);\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob_fp32.channel(get_omp_thread_num());\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + p * 4) : vdupq_n_f32(0.f);\n        out0.fill(_bias0);\n\n        int q = 0;\n        for (; q < inch - 1; q++)\n        {\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n            const unsigned short* r5 = img0.row<const unsigned short>(5);\n            const unsigned short* r6 = img0.row<const unsigned short>(6);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%1]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%2], #32 \\n\" // r1\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%2]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%3]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%4], #32 \\n\" // r3\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%4]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5], #32 \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%5]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%6], #32 \\n\" // r5\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%6]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%7], #32 \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%7]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"sub    %8, %8, #392                \\n\"\n\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0] \\n\"\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1] \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2] \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3] \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%4] \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%5] \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%6] \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%7] \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n                        \"add    %2, %2, #16                 \\n\"\n                        \"add    %3, %3, #16                 \\n\"\n                        \"add    %4, %4, #16                 \\n\"\n                        \"add    %5, %5, #16                 \\n\"\n                        \"add    %6, %6, #16                 \\n\"\n                        \"add    %7, %7, #16                 \\n\"\n\n                        \"sub    %8, %8, #392                \\n\"\n\n                        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #512]          \\n\"\n                        \"vldm       %0, {d24-d31}       \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1]!      \\n\" // r0\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%1]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2]!      \\n\" // r1\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%2]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3]!      \\n\" // r2\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%3]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4]!      \\n\" // r3\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%4]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5]!      \\n\" // r4\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%5]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6]!      \\n\" // r5\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%6]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%7]!      \\n\" // r6\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%7]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"sub        %8, %8, #392        \\n\"\n\n                        \"vstm       %0!, {d24-d31}      \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #256]       \\n\"\n                        \"ld1    {v16.4s, v17.4s}, [%0]      \\n\"\n\n                        \"prfm   pldl1keep, [%1, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%1] \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmul   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmul   v19.4s, v24.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%2] \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%3] \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%4] \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%5] \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%6] \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%7] \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"add    %1, %1, #8                  \\n\"\n                        \"add    %2, %2, #8                  \\n\"\n                        \"add    %3, %3, #8                  \\n\"\n                        \"add    %4, %4, #8                  \\n\"\n                        \"add    %5, %5, #8                  \\n\"\n                        \"add    %6, %6, #8                  \\n\"\n                        \"add    %7, %7, #8                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                        \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                        \"sub    %8, %8, #392                \\n\"\n\n                        \"st1    {v16.4s, v17.4s}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%0 :128] \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1]!      \\n\" // r0\n                        \"vld1.u16   {d8[0]}, [%1]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmul.f32   q12, q5, d0[0]      \\n\"\n                        \"vmul.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%2]!      \\n\" // r1\n                        \"vld1.u16   {d9[0]}, [%2]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3]!      \\n\" // r2\n                        \"vld1.u16   {d8[0]}, [%3]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%4]!      \\n\" // r3\n                        \"vld1.u16   {d9[0]}, [%4]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5]!      \\n\" // r4\n                        \"vld1.u16   {d8[0]}, [%5]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%6]!      \\n\" // r5\n                        \"vld1.u16   {d9[0]}, [%6]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%7]!      \\n\" // r6\n                        \"vld1.u16   {d8[0]}, [%7]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%8]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%8]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n\n                        \"sub        %1, %1, #8          \\n\"\n                        \"sub        %2, %2, #8          \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n\n                        \"sub        %8, %8, #392        \\n\"\n\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n\n                        \"sub        %3, %3, #8          \\n\"\n                        \"sub        %4, %4, #8          \\n\"\n\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"sub        %5, %5, #8          \\n\"\n                        \"sub        %6, %6, #8          \\n\"\n\n                        \"vadd.f32   q14, q14, q12       \\n\"\n                        \"vadd.f32   q15, q15, q13       \\n\"\n\n                        \"sub        %7, %7, #8          \\n\"\n\n                        \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v16.4s}, [%0]              \\n\"\n\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%1]        \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmul   v17.4s, v24.4s, v0.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmul   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmul   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%2]        \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%3]        \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%4]        \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%5]        \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v17.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%6]        \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%7]        \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%8], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%8], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n                        \"add    %2, %2, #4                  \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n                        \"add    %4, %4, #4                  \\n\"\n\n                        \"fadd   v18.4s, v18.4s, v19.4s      \\n\"\n\n                        \"add    %5, %5, #4                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v17.4s      \\n\"\n\n                        \"add    %6, %6, #4                  \\n\"\n                        \"add    %7, %7, #4                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n\n                        \"sub    %8, %8, #392                \\n\"\n\n                        \"st1    {v16.4s}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%0, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%0 :128]  \\n\"\n\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%1]       \\n\" // r0\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q5, q8, d0[0]       \\n\"\n                        \"vmul.f32   q6, q9, d0[1]       \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmul.f32   q7, q10, d1[0]      \\n\"\n                        \"vmla.f32   q4, q11, d1[1]      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%2]       \\n\" // r1\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q5, q12, d2[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]      \\n\"\n                        \"vmla.f32   q7, q14, d3[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]       \\n\"\n                        \"vmla.f32   q5, q9, d4[1]       \\n\"\n                        \"vmla.f32   q6, q10, d5[0]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3]       \\n\" // r2\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q7, q11, d5[1]      \\n\"\n                        \"vmla.f32   q4, q12, d6[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]      \\n\"\n                        \"vmla.f32   q6, q14, d7[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]       \\n\"\n                        \"vmla.f32   q4, q9, d0[1]       \\n\"\n                        \"vmla.f32   q5, q10, d1[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%4]       \\n\" // r3\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q6, q11, d1[1]      \\n\"\n                        \"vmla.f32   q7, q12, d2[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q4, q13, d2[1]      \\n\"\n                        \"vmla.f32   q5, q14, d3[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q6, q8, d4[0]       \\n\"\n                        \"vmla.f32   q7, q9, d4[1]       \\n\"\n                        \"vmla.f32   q4, q10, d5[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5]       \\n\" // r4\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q5, q11, d5[1]      \\n\"\n                        \"vmla.f32   q6, q12, d6[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q7, q13, d6[1]      \\n\"\n                        \"vmla.f32   q4, q14, d7[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q5, q8, d0[0]       \\n\"\n                        \"vmla.f32   q6, q9, d0[1]       \\n\"\n                        \"vmla.f32   q7, q10, d1[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%6]       \\n\" // r5\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q4, q11, d1[1]      \\n\"\n                        \"vmla.f32   q5, q12, d2[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]      \\n\"\n                        \"vmla.f32   q7, q14, d3[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]       \\n\"\n                        \"vmla.f32   q5, q9, d4[1]       \\n\"\n                        \"vmla.f32   q6, q10, d5[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%7]       \\n\" // r6\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q7, q11, d5[1]      \\n\"\n                        \"vmla.f32   q4, q12, d6[0]      \\n\"\n\n                        \"pld        [%8, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%8]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]      \\n\"\n                        \"vmla.f32   q6, q14, d7[0]      \\n\"\n\n                        \"pld        [%8, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%8]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]       \\n\"\n                        \"vmla.f32   q4, q9, d0[1]       \\n\"\n\n                        \"add        %1, %1, #4          \\n\"\n                        \"add        %2, %2, #4          \\n\"\n\n                        \"vmla.f32   q5, q10, d1[0]      \\n\"\n                        \"vmla.f32   q6, q11, d1[1]      \\n\"\n\n                        \"sub        %8, %8, #392        \\n\"\n\n                        \"vmla.f32   q7, q12, d2[0]      \\n\"\n                        \"vmla.f32   q4, q13, d2[1]      \\n\"\n                        \"vmla.f32   q5, q14, d3[0]      \\n\"\n\n                        \"add        %3, %3, #4          \\n\"\n                        \"add        %4, %4, #4          \\n\"\n\n                        \"vadd.f32   q6, q6, q7          \\n\"\n\n                        \"add        %5, %5, #4          \\n\"\n\n                        \"vadd.f32   q4, q4, q5          \\n\"\n\n                        \"add        %6, %6, #4          \\n\"\n\n                        \"vadd.f32   q4, q4, q6          \\n\"\n\n                        \"add        %7, %7, #4          \\n\"\n\n                        \"vst1.f32   {d8-d9}, [%0 :128]! \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n        for (; q < inch; q++)\n        {\n            unsigned short* outptr0_bf16 = top_blob.channel(p);\n\n            float* outptr0 = out0.row(0);\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const unsigned short* r0 = img0.row<const unsigned short>(0);\n            const unsigned short* r1 = img0.row<const unsigned short>(1);\n            const unsigned short* r2 = img0.row<const unsigned short>(2);\n            const unsigned short* r3 = img0.row<const unsigned short>(3);\n            const unsigned short* r4 = img0.row<const unsigned short>(4);\n            const unsigned short* r5 = img0.row<const unsigned short>(5);\n            const unsigned short* r6 = img0.row<const unsigned short>(6);\n\n            const unsigned short* kptr = kernel.channel(p).row<const unsigned short>(q);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n#if __aarch64__\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\"\n\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%2]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%3], #32 \\n\" // r1\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%3]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4], #32 \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%4]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%5], #32 \\n\" // r3\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%5]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%6], #32 \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%6]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v6.4h, v7.4h, v8.4h, v9.4h}, [%7], #32 \\n\" // r5\n\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n                        \"shll   v8.4s, v8.4h, #16           \\n\"\n                        \"shll   v9.4s, v9.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v10.4h, v11.4h}, [%7]      \\n\"\n\n                        \"shll   v10.4s, v10.4h, #16         \\n\"\n                        \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v6.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v6.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v7.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v7.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v8.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v8.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v9.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v9.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v6.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v6.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v7.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v7.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v8.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v8.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v9.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v9.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v6.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v7.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v7.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v8.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v8.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v9.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v9.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v10.s[0]    \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v6.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v7.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v7.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v8.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v8.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v9.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v9.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v10.s[1]    \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v7.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v7.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v8.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v8.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v9.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v9.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v10.s[0]    \\n\"\n                        \"fmla   v23.4s, v28.4s, v10.s[2]    \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v7.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v7.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v8.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v8.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v9.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v9.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v10.s[1]    \\n\"\n                        \"fmla   v23.4s, v29.4s, v10.s[3]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%8], #32 \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v7.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v8.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v8.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v9.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v9.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v10.s[0]    \\n\"\n                        \"fmla   v22.4s, v30.4s, v10.s[2]    \\n\"\n                        \"fmla   v23.4s, v30.4s, v11.s[0]    \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%8]        \\n\"\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v20.4s, v24.4s, v2.s[0]     \\n\"\n                        \"fmla   v21.4s, v24.4s, v2.s[2]     \\n\"\n                        \"fmla   v22.4s, v24.4s, v3.s[0]     \\n\"\n                        \"fmla   v23.4s, v24.4s, v3.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v20.4s, v25.4s, v2.s[1]     \\n\"\n                        \"fmla   v21.4s, v25.4s, v2.s[3]     \\n\"\n                        \"fmla   v22.4s, v25.4s, v3.s[1]     \\n\"\n                        \"fmla   v23.4s, v25.4s, v3.s[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v20.4s, v26.4s, v2.s[2]     \\n\"\n                        \"fmla   v21.4s, v26.4s, v3.s[0]     \\n\"\n                        \"fmla   v22.4s, v26.4s, v3.s[2]     \\n\"\n                        \"fmla   v23.4s, v26.4s, v4.s[0]     \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v20.4s, v27.4s, v2.s[3]     \\n\"\n                        \"fmla   v21.4s, v27.4s, v3.s[1]     \\n\"\n                        \"fmla   v22.4s, v27.4s, v3.s[3]     \\n\"\n                        \"fmla   v23.4s, v27.4s, v4.s[1]     \\n\"\n\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v20.4s, v28.4s, v3.s[0]     \\n\"\n                        \"fmla   v21.4s, v28.4s, v3.s[2]     \\n\"\n                        \"fmla   v22.4s, v28.4s, v4.s[0]     \\n\"\n                        \"fmla   v23.4s, v28.4s, v4.s[2]     \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v20.4s, v29.4s, v3.s[1]     \\n\"\n                        \"fmla   v21.4s, v29.4s, v3.s[3]     \\n\"\n                        \"fmla   v22.4s, v29.4s, v4.s[1]     \\n\"\n                        \"fmla   v23.4s, v29.4s, v4.s[3]     \\n\"\n\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n                        \"fmla   v20.4s, v30.4s, v3.s[2]     \\n\"\n                        \"fmla   v21.4s, v30.4s, v4.s[0]     \\n\"\n                        \"fmla   v22.4s, v30.4s, v4.s[2]     \\n\"\n                        \"fmla   v23.4s, v30.4s, v5.s[0]     \\n\"\n\n                        \"sub    %9, %9, #392                \\n\"\n\n                        \"shrn   v16.4h, v16.4s, #16         \\n\"\n                        \"shrn   v17.4h, v17.4s, #16         \\n\"\n                        \"shrn   v18.4h, v18.4s, #16         \\n\"\n                        \"shrn   v19.4h, v19.4s, #16         \\n\"\n                        \"shrn   v20.4h, v20.4s, #16         \\n\"\n                        \"shrn   v21.4h, v21.4s, #16         \\n\"\n                        \"shrn   v22.4h, v22.4s, #16         \\n\"\n                        \"shrn   v23.4h, v23.4s, #16         \\n\"\n\n                        \"st1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%0], #32 \\n\"\n                        \"st1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n                }\n#endif // __aarch64__\n                for (; j + 3 < outw; j += 4)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #512]       \\n\"\n                        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2] \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3] \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%4] \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%5] \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%6] \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%7] \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n                        \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v5.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v5.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v5.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #256]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%8] \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n                        \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v5.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v6.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v6.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v6.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v6.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v6.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v6.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v7.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v18.4s, v24.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v1.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v2.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v27.4s, v1.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v2.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v2.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v2.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v29.4s, v2.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v2.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v2.s[0]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v2.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v3.s[0]     \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n                        \"add    %3, %3, #16                 \\n\"\n                        \"add    %4, %4, #16                 \\n\"\n                        \"add    %5, %5, #16                 \\n\"\n                        \"add    %6, %6, #16                 \\n\"\n                        \"add    %7, %7, #16                 \\n\"\n                        \"add    %8, %8, #16                 \\n\"\n\n                        \"sub    %9, %9, #392                \\n\"\n\n                        \"shrn   v16.4h, v16.4s, #16         \\n\"\n                        \"shrn   v17.4h, v17.4s, #16         \\n\"\n                        \"shrn   v18.4h, v18.4s, #16         \\n\"\n                        \"shrn   v19.4h, v19.4s, #16         \\n\"\n\n                        \"st1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%0], #32 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #512]          \\n\"\n                        \"vldm       %1!, {d24-d31}      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2]!      \\n\" // r0\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%2]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%3]!      \\n\" // r1\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%3]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4]!      \\n\" // r2\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%4]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%5]!      \\n\" // r3\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%5]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6]!      \\n\" // r4\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%6]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%7]!      \\n\" // r5\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%7]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%8]!      \\n\" // r6\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q5, d2[0]      \\n\"\n                        \"vmla.f32   q15, q5, d3[0]      \\n\"\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.u16   {d5-d6}, [%8]       \\n\"\n\n                        \"vshll.u16  q2, d5, #16         \\n\"\n                        \"vshl.u32   d6, d6, #16         \\n\"\n\n                        \"vmla.f32   q12, q6, d0[1]      \\n\"\n                        \"vmla.f32   q13, q6, d1[1]      \\n\"\n                        \"vmla.f32   q14, q6, d2[1]      \\n\"\n                        \"vmla.f32   q15, q6, d3[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q7, d3[0]      \\n\"\n                        \"vmla.f32   q15, q7, d4[0]      \\n\"\n                        \"vmla.f32   q12, q8, d1[1]      \\n\"\n                        \"vmla.f32   q13, q8, d2[1]      \\n\"\n                        \"vmla.f32   q14, q8, d3[1]      \\n\"\n                        \"vmla.f32   q15, q8, d4[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q9, d4[0]      \\n\"\n                        \"vmla.f32   q15, q9, d5[0]      \\n\"\n                        \"vmla.f32   q12, q10, d2[1]     \\n\"\n                        \"vmla.f32   q13, q10, d3[1]     \\n\"\n                        \"vmla.f32   q14, q10, d4[1]     \\n\"\n                        \"vmla.f32   q15, q10, d5[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d4[0]     \\n\"\n                        \"vmla.f32   q14, q11, d5[0]     \\n\"\n                        \"vmla.f32   q15, q11, d6[0]     \\n\"\n\n                        \"sub        %9, %9, #392        \\n\"\n\n                        \"vshrn.u32  d24, q12, #16       \\n\"\n                        \"vshrn.u32  d25, q13, #16       \\n\"\n                        \"vshrn.u32  d26, q14, #16       \\n\"\n                        \"vshrn.u32  d27, q15, #16       \\n\"\n\n                        \"vst1.u16   {d24-d27}, [%0]!    \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v16.4s, v17.4s}, [%1], #32 \\n\"\n\n                        \"prfm   pldl1keep, [%2, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%2] \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmul   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmul   v19.4s, v24.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%3] \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%4] \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%5] \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%6] \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #192]       \\n\"\n                        \"ld1    {v4.4h, v5.4h, v6.4h}, [%7] \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n                        \"shll   v6.4s, v6.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v24.4s, v4.s[2]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #192]       \\n\"\n                        \"ld1    {v0.4h, v1.4h, v2.4h}, [%8] \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n                        \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v19.4s, v27.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v19.4s, v29.4s, v5.s[3]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v6.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v19.4s, v24.4s, v0.s[2]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v0.s[3]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v0.s[2]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v27.4s, v1.s[1]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[2]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v29.4s, v1.s[3]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v1.s[2]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v2.s[0]     \\n\"\n\n                        \"add    %2, %2, #8                  \\n\"\n                        \"add    %3, %3, #8                  \\n\"\n                        \"add    %4, %4, #8                  \\n\"\n                        \"add    %5, %5, #8                  \\n\"\n                        \"add    %6, %6, #8                  \\n\"\n                        \"add    %7, %7, #8                  \\n\"\n                        \"add    %8, %8, #8                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                        \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                        \"sub    %9, %9, #392                \\n\"\n\n                        \"shrn   v16.4h, v16.4s, #16         \\n\"\n                        \"shrn   v17.4h, v17.4s, #16         \\n\"\n\n                        \"st1    {v16.4h, v17.4h}, [%0], #16 \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #256]          \\n\"\n                        \"vld1.f32   {d28-d31}, [%1 :128]! \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2]!      \\n\" // r0\n                        \"vld1.u16   {d8[0]}, [%2]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmul.f32   q12, q5, d0[0]      \\n\"\n                        \"vmul.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%3]!      \\n\" // r1\n                        \"vld1.u16   {d9[0]}, [%3]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4]!      \\n\" // r2\n                        \"vld1.u16   {d8[0]}, [%4]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%5]!      \\n\" // r3\n                        \"vld1.u16   {d9[0]}, [%5]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6]!      \\n\" // r4\n                        \"vld1.u16   {d8[0]}, [%6]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%7]!      \\n\" // r5\n                        \"vld1.u16   {d9[0]}, [%7]       \\n\"\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n                        \"vshl.u32   d9, d9, #16         \\n\"\n\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q14, q5, d4[0]      \\n\"\n                        \"vmla.f32   q15, q5, d5[0]      \\n\"\n                        \"vmla.f32   q12, q6, d4[1]      \\n\"\n                        \"vmla.f32   q13, q6, d5[1]      \\n\"\n                        \"vmla.f32   q14, q7, d5[0]      \\n\"\n                        \"vmla.f32   q15, q7, d6[0]      \\n\"\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%8]!      \\n\" // r6\n                        \"vld1.u16   {d8[0]}, [%8]       \\n\"\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n                        \"vshl.u32   d8, d8, #16         \\n\"\n\n                        \"vmla.f32   q12, q8, d5[1]      \\n\"\n                        \"vmla.f32   q13, q8, d6[1]      \\n\"\n                        \"vmla.f32   q14, q9, d6[0]      \\n\"\n                        \"vmla.f32   q15, q9, d7[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d14-d17}, [%9]!    \\n\"\n\n                        \"vshll.u16  q5, d14, #16        \\n\"\n                        \"vshll.u16  q6, d15, #16        \\n\"\n                        \"vshll.u16  q7, d16, #16        \\n\"\n                        \"vshll.u16  q8, d17, #16        \\n\"\n\n                        \"vmla.f32   q12, q10, d6[1]     \\n\"\n                        \"vmla.f32   q13, q10, d7[1]     \\n\"\n                        \"vmla.f32   q14, q11, d7[0]     \\n\"\n                        \"vmla.f32   q15, q11, d9[0]     \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d20-d22}, [%9]!    \\n\"\n\n                        \"vshll.u16  q9, d20, #16        \\n\"\n                        \"vshll.u16  q10, d21, #16       \\n\"\n                        \"vshll.u16  q11, d22, #16       \\n\"\n\n                        \"vmla.f32   q12, q5, d0[0]      \\n\"\n                        \"vmla.f32   q13, q5, d1[0]      \\n\"\n                        \"vmla.f32   q14, q6, d0[1]      \\n\"\n                        \"vmla.f32   q15, q6, d1[1]      \\n\"\n\n                        \"sub        %2, %2, #8          \\n\"\n                        \"sub        %3, %3, #8          \\n\"\n\n                        \"vmla.f32   q12, q7, d1[0]      \\n\"\n                        \"vmla.f32   q13, q7, d2[0]      \\n\"\n                        \"vmla.f32   q14, q8, d1[1]      \\n\"\n                        \"vmla.f32   q15, q8, d2[1]      \\n\"\n\n                        \"sub        %9, %9, #392        \\n\"\n\n                        \"vmla.f32   q12, q9, d2[0]      \\n\"\n                        \"vmla.f32   q13, q9, d3[0]      \\n\"\n                        \"vmla.f32   q14, q10, d2[1]     \\n\"\n                        \"vmla.f32   q15, q10, d3[1]     \\n\"\n\n                        \"sub        %4, %4, #8          \\n\"\n                        \"sub        %5, %5, #8          \\n\"\n\n                        \"vmla.f32   q12, q11, d3[0]     \\n\"\n                        \"vmla.f32   q13, q11, d8[0]     \\n\"\n\n                        \"sub        %6, %6, #8          \\n\"\n                        \"sub        %7, %7, #8          \\n\"\n\n                        \"vadd.f32   q14, q14, q12       \\n\"\n                        \"vadd.f32   q15, q15, q13       \\n\"\n\n                        \"sub        %8, %8, #8          \\n\"\n\n                        \"vshrn.u32  d28, q14, #16       \\n\"\n                        \"vshrn.u32  d29, q15, #16       \\n\"\n\n                        \"vst1.u16   {d28-d29}, [%0 :64]! \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n                }\n                for (; j < outw; j++)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v16.4s}, [%1], #16         \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%2]        \\n\" // r0\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmul   v17.4s, v24.4s, v0.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmul   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmul   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%3]        \\n\" // r1\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%4]        \\n\" // r2\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%5]        \\n\" // r3\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v19.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v16.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%6]        \\n\" // r4\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v17.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v18.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v16.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v18.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v19.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v4.4h, v5.4h}, [%7]        \\n\" // r5\n\n                        \"shll   v4.4s, v4.4h, #16           \\n\"\n                        \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                        \"fmla   v16.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v17.4s, v28.4s, v1.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v18.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v19.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v16.4s, v24.4s, v4.s[0]     \\n\"\n                        \"fmla   v17.4s, v25.4s, v4.s[1]     \\n\"\n                        \"fmla   v18.4s, v26.4s, v4.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v0.4h, v1.4h}, [%8]        \\n\" // r6\n\n                        \"shll   v0.4s, v0.4h, #16           \\n\"\n                        \"shll   v1.4s, v1.4h, #16           \\n\"\n\n                        \"fmla   v19.4s, v27.4s, v4.s[3]     \\n\"\n                        \"fmla   v16.4s, v28.4s, v5.s[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #256]       \\n\"\n                        \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%9], #32 \\n\"\n\n                        \"shll   v24.4s, v24.4h, #16         \\n\"\n                        \"shll   v25.4s, v25.4h, #16         \\n\"\n                        \"shll   v26.4s, v26.4h, #16         \\n\"\n                        \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                        \"fmla   v17.4s, v29.4s, v5.s[1]     \\n\"\n                        \"fmla   v18.4s, v30.4s, v5.s[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%9, #192]       \\n\"\n                        \"ld1    {v28.4h, v29.4h, v30.4h}, [%9], #24 \\n\"\n\n                        \"shll   v28.4s, v28.4h, #16         \\n\"\n                        \"shll   v29.4s, v29.4h, #16         \\n\"\n                        \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                        \"fmla   v19.4s, v24.4s, v0.s[0]     \\n\"\n                        \"fmla   v16.4s, v25.4s, v0.s[1]     \\n\"\n                        \"fmla   v17.4s, v26.4s, v0.s[2]     \\n\"\n\n                        \"add    %2, %2, #4                  \\n\"\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"fmla   v18.4s, v27.4s, v0.s[3]     \\n\"\n                        \"fmla   v19.4s, v28.4s, v1.s[0]     \\n\"\n                        \"fmla   v16.4s, v29.4s, v1.s[1]     \\n\"\n                        \"fmla   v17.4s, v30.4s, v1.s[2]     \\n\"\n\n                        \"add    %4, %4, #4                  \\n\"\n                        \"add    %5, %5, #4                  \\n\"\n\n                        \"fadd   v18.4s, v18.4s, v19.4s      \\n\"\n\n                        \"add    %6, %6, #4                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v17.4s      \\n\"\n\n                        \"add    %7, %7, #4                  \\n\"\n                        \"add    %8, %8, #4                  \\n\"\n\n                        \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n\n                        \"sub    %9, %9, #392                \\n\"\n\n                        \"shrn   v16.4h, v16.4s, #16         \\n\"\n\n                        \"st1    {v16.4h}, [%0], #8          \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v4\", \"v5\", \"v16\", \"v17\", \"v18\", \"v19\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\");\n#else  // __aarch64__\n                    asm volatile(\n                        \"pld        [%1, #128]          \\n\"\n                        \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n\n                        \"pld        [%2, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%2]       \\n\" // r0\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmul.f32   q5, q8, d0[0]       \\n\"\n                        \"vmul.f32   q6, q9, d0[1]       \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmul.f32   q7, q10, d1[0]      \\n\"\n                        \"vmla.f32   q4, q11, d1[1]      \\n\"\n\n                        \"pld        [%3, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%3]       \\n\" // r1\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q5, q12, d2[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]      \\n\"\n                        \"vmla.f32   q7, q14, d3[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]       \\n\"\n                        \"vmla.f32   q5, q9, d4[1]       \\n\"\n                        \"vmla.f32   q6, q10, d5[0]      \\n\"\n\n                        \"pld        [%4, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%4]       \\n\" // r2\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q7, q11, d5[1]      \\n\"\n                        \"vmla.f32   q4, q12, d6[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]      \\n\"\n                        \"vmla.f32   q6, q14, d7[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]       \\n\"\n                        \"vmla.f32   q4, q9, d0[1]       \\n\"\n                        \"vmla.f32   q5, q10, d1[0]      \\n\"\n\n                        \"pld        [%5, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%5]       \\n\" // r3\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q6, q11, d1[1]      \\n\"\n                        \"vmla.f32   q7, q12, d2[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q4, q13, d2[1]      \\n\"\n                        \"vmla.f32   q5, q14, d3[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q6, q8, d4[0]       \\n\"\n                        \"vmla.f32   q7, q9, d4[1]       \\n\"\n                        \"vmla.f32   q4, q10, d5[0]      \\n\"\n\n                        \"pld        [%6, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%6]       \\n\" // r4\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q5, q11, d5[1]      \\n\"\n                        \"vmla.f32   q6, q12, d6[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q7, q13, d6[1]      \\n\"\n                        \"vmla.f32   q4, q14, d7[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q5, q8, d0[0]       \\n\"\n                        \"vmla.f32   q6, q9, d0[1]       \\n\"\n                        \"vmla.f32   q7, q10, d1[0]      \\n\"\n\n                        \"pld        [%7, #128]          \\n\"\n                        \"vld1.u16   {d6-d7}, [%7]       \\n\" // r5\n\n                        \"vshll.u16  q2, d6, #16         \\n\"\n                        \"vshll.u16  q3, d7, #16         \\n\"\n\n                        \"vmla.f32   q4, q11, d1[1]      \\n\"\n                        \"vmla.f32   q5, q12, d2[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q6, q13, d2[1]      \\n\"\n                        \"vmla.f32   q7, q14, d3[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q4, q8, d4[0]       \\n\"\n                        \"vmla.f32   q5, q9, d4[1]       \\n\"\n                        \"vmla.f32   q6, q10, d5[0]      \\n\"\n\n                        \"pld        [%8, #128]          \\n\"\n                        \"vld1.u16   {d2-d3}, [%8]       \\n\" // r6\n\n                        \"vshll.u16  q0, d2, #16         \\n\"\n                        \"vshll.u16  q1, d3, #16         \\n\"\n\n                        \"vmla.f32   q7, q11, d5[1]      \\n\"\n                        \"vmla.f32   q4, q12, d6[0]      \\n\"\n\n                        \"pld        [%9, #256]          \\n\"\n                        \"vld1.u16   {d20-d23}, [%9]!    \\n\"\n\n                        \"vshll.u16  q8, d20, #16        \\n\"\n                        \"vshll.u16  q9, d21, #16        \\n\"\n                        \"vshll.u16  q10, d22, #16       \\n\"\n                        \"vshll.u16  q11, d23, #16       \\n\"\n\n                        \"vmla.f32   q5, q13, d6[1]      \\n\"\n                        \"vmla.f32   q6, q14, d7[0]      \\n\"\n\n                        \"pld        [%9, #192]          \\n\"\n                        \"vld1.u16   {d26-d28}, [%9]!    \\n\"\n\n                        \"vshll.u16  q12, d26, #16       \\n\"\n                        \"vshll.u16  q13, d27, #16       \\n\"\n                        \"vshll.u16  q14, d28, #16       \\n\"\n\n                        \"vmla.f32   q7, q8, d0[0]       \\n\"\n                        \"vmla.f32   q4, q9, d0[1]       \\n\"\n\n                        \"add        %2, %2, #4          \\n\"\n                        \"add        %3, %3, #4          \\n\"\n\n                        \"vmla.f32   q5, q10, d1[0]      \\n\"\n                        \"vmla.f32   q6, q11, d1[1]      \\n\"\n\n                        \"sub        %9, %9, #392        \\n\"\n\n                        \"vmla.f32   q7, q12, d2[0]      \\n\"\n                        \"vmla.f32   q4, q13, d2[1]      \\n\"\n                        \"vmla.f32   q5, q14, d3[0]      \\n\"\n\n                        \"add        %4, %4, #4          \\n\"\n                        \"add        %5, %5, #4          \\n\"\n\n                        \"vadd.f32   q6, q6, q7          \\n\"\n\n                        \"add        %6, %6, #4          \\n\"\n\n                        \"vadd.f32   q4, q4, q5          \\n\"\n\n                        \"add        %7, %7, #4          \\n\"\n\n                        \"vadd.f32   q4, q4, q6          \\n\"\n\n                        \"add        %8, %8, #4          \\n\"\n\n                        \"vshrn.u32  d8, q4, #16         \\n\"\n\n                        \"vst1.u16   {d8}, [%0 :64]!     \\n\"\n\n                        : \"=r\"(outptr0_bf16), // %0\n                        \"=r\"(outptr0),      // %1\n                        \"=r\"(r0),           // %2\n                        \"=r\"(r1),           // %3\n                        \"=r\"(r2),           // %4\n                        \"=r\"(r3),           // %5\n                        \"=r\"(r4),           // %6\n                        \"=r\"(r5),           // %7\n                        \"=r\"(r6),           // %8\n                        \"=r\"(kptr)          // %9\n                        : \"0\"(outptr0_bf16),\n                        \"1\"(outptr0),\n                        \"2\"(r0),\n                        \"3\"(r1),\n                        \"4\"(r2),\n                        \"5\"(r3),\n                        \"6\"(r4),\n                        \"7\"(r5),\n                        \"8\"(r6),\n                        \"9\"(kptr)\n                        : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\");\n#endif // __aarch64__\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_7x7_pack1to8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv7x7s2_pack1to8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + p * 8) : vdupq_n_f16(0.f);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            __fp16* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const __fp16* r0 = img0.row<const __fp16>(0);\n            const __fp16* r1 = img0.row<const __fp16>(1);\n            const __fp16* r2 = img0.row<const __fp16>(2);\n            const __fp16* r3 = img0.row<const __fp16>(3);\n            const __fp16* r4 = img0.row<const __fp16>(4);\n            const __fp16* r5 = img0.row<const __fp16>(5);\n            const __fp16* r6 = img0.row<const __fp16>(6);\n\n            const __fp16* kptr = kernel.channel(p).row<const __fp16>(q);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%1] \\n\" // r0\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0\n\n                        \"fmla   v24.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v0.h[2]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v1.h[4]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v1.h[6]     \\n\"\n\n                        \"sub    %0, %0, #64                 \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v0.h[3]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v0.h[7]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v2.h[0]     \\n\"\n\n                        \"fmla   v24.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v0.h[7]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v1.h[1]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%2] \\n\" // r1\n\n                        \"fmla   v24.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v1.h[2]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v2.h[2]     \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v0.h[7]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v1.h[1]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v1.h[2]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v1.h[4]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v2.h[4]     \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v4.h[0]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v4.h[2]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v4.h[4]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v5.h[6]     \\n\"\n\n                        \"fmla   v24.8h, v16.8h, v4.h[1]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v4.h[3]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v4.h[5]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v4.h[7]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v5.h[1]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v4.h[2]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v4.h[4]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v4.h[6]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v5.h[2]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v4.h[3]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v4.h[5]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v4.h[7]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v5.h[3]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[1]     \\n\"\n\n                        \"fmla   v24.8h, v19.8h, v4.h[4]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v4.h[6]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v5.h[0]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%3] \\n\" // r2\n\n                        \"fmla   v24.8h, v20.8h, v4.h[5]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v4.h[7]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v5.h[1]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v5.h[3]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v5.h[5]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v6.h[3]     \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v4.h[6]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v5.h[0]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v5.h[2]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v5.h[6]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v0.h[0]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v0.h[2]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[4]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v1.h[6]     \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v0.h[1]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v0.h[3]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v24.8h, v16.8h, v0.h[2]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v2.h[0]     \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v0.h[3]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v0.h[7]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v2.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%4] \\n\" // r3\n\n                        \"fmla   v24.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v0.h[7]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v1.h[1]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[3]     \\n\"\n\n                        \"fmla   v24.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v1.h[2]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v2.h[4]     \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v4.h[0]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v4.h[2]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v4.h[4]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v5.h[6]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v4.h[1]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v4.h[3]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v4.h[5]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v4.h[7]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v5.h[1]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v4.h[2]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v4.h[4]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v4.h[6]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v5.h[2]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[0]     \\n\"\n\n                        \"fmla   v24.8h, v16.8h, v4.h[3]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v4.h[5]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v4.h[7]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v5.h[3]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[1]     \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v4.h[4]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v4.h[6]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v5.h[0]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v4.h[5]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v4.h[7]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v5.h[1]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v5.h[3]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v5.h[5]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v6.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%5] \\n\" // r4\n\n                        \"fmla   v24.8h, v19.8h, v4.h[6]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v5.h[0]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v5.h[2]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v5.h[6]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v6.h[4]     \\n\"\n\n                        \"fmla   v24.8h, v20.8h, v0.h[0]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v0.h[2]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v1.h[6]     \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v0.h[1]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v0.h[3]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v0.h[7]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v0.h[2]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v2.h[0]     \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v0.h[3]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v1.h[1]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #384]       \\n\"\n                        \"ld1    {v4.8h, v5.8h, v6.8h}, [%6] \\n\" // r5\n\n                        \"fmla   v24.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v1.h[2]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v2.h[2]     \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v0.h[7]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v1.h[3]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v2.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v1.h[4]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v2.h[4]     \\n\"\n\n                        \"fmla   v24.8h, v19.8h, v4.h[0]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v4.h[2]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v4.h[4]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v4.h[6]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v5.h[0]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v5.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v5.h[4]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v5.h[6]     \\n\"\n\n                        \"fmla   v24.8h, v20.8h, v4.h[1]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v4.h[3]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v4.h[5]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v4.h[7]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v5.h[1]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v5.h[3]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v5.h[5]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v5.h[7]     \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v4.h[2]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v4.h[4]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v4.h[6]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v5.h[0]     \\n\"\n                        \"fmla   v28.8h, v21.8h, v5.h[2]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v5.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v5.h[6]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v6.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v4.h[3]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v4.h[5]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v4.h[7]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v5.h[1]     \\n\"\n                        \"fmla   v28.8h, v22.8h, v5.h[3]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v5.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v5.h[7]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v6.h[1]     \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v4.h[4]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v4.h[6]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v5.h[0]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v5.h[2]     \\n\"\n                        \"fmla   v28.8h, v23.8h, v5.h[4]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v5.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v6.h[0]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v6.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #384]       \\n\"\n                        \"ld1    {v0.8h, v1.8h, v2.8h}, [%7] \\n\" // r6\n\n                        \"fmla   v24.8h, v16.8h, v4.h[5]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v4.h[7]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v5.h[1]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v5.h[3]     \\n\"\n                        \"fmla   v28.8h, v16.8h, v5.h[5]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v5.h[7]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v6.h[1]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v6.h[3]     \\n\"\n\n                        \"fmla   v24.8h, v17.8h, v4.h[6]     \\n\"\n                        \"fmla   v25.8h, v17.8h, v5.h[0]     \\n\"\n                        \"fmla   v26.8h, v17.8h, v5.h[2]     \\n\"\n                        \"fmla   v27.8h, v17.8h, v5.h[4]     \\n\"\n                        \"fmla   v28.8h, v17.8h, v5.h[6]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v6.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v6.h[2]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v6.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v24.8h, v18.8h, v0.h[0]     \\n\"\n                        \"fmla   v25.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v26.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v27.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v28.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[4]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v1.h[6]     \\n\"\n\n                        \"fmla   v24.8h, v19.8h, v0.h[1]     \\n\"\n                        \"fmla   v25.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v26.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v27.8h, v19.8h, v0.h[7]     \\n\"\n                        \"fmla   v28.8h, v19.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v1.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[7]     \\n\"\n\n                        \"fmla   v24.8h, v20.8h, v0.h[2]     \\n\"\n                        \"fmla   v25.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v26.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v27.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v28.8h, v20.8h, v1.h[2]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v1.h[6]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v2.h[0]     \\n\"\n\n                        \"add    %1, %1, #32                 \\n\"\n\n                        \"fmla   v24.8h, v21.8h, v0.h[3]     \\n\"\n                        \"fmla   v25.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v26.8h, v21.8h, v0.h[7]     \\n\"\n                        \"fmla   v27.8h, v21.8h, v1.h[1]     \\n\"\n\n                        \"add    %2, %2, #32                 \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[7]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v16.8h}, [%8]              \\n\"\n\n                        \"fmla   v24.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v25.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v26.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v27.8h, v22.8h, v1.h[2]     \\n\"\n\n                        \"add    %3, %3, #32                 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v1.h[4]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[0]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v2.h[2]     \\n\"\n\n                        \"add    %4, %4, #32                 \\n\"\n\n                        \"fmla   v24.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v25.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v26.8h, v23.8h, v1.h[1]     \\n\"\n                        \"fmla   v27.8h, v23.8h, v1.h[3]     \\n\"\n\n                        \"add    %5, %5, #32                 \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v1.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[1]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[3]     \\n\"\n\n                        \"add    %6, %6, #32                 \\n\"\n\n                        \"fmla   v24.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v25.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v26.8h, v16.8h, v1.h[2]     \\n\"\n                        \"fmla   v27.8h, v16.8h, v1.h[4]     \\n\"\n\n                        \"add    %7, %7, #32                 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[6]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[2]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v2.h[4]     \\n\"\n\n                        \"sub    %8, %8, #768                \\n\" // kptr -= 48 * 8;\n\n                        \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v4\", \"v5\", \"v6\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%0, #512]       \\n\"\n                        \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0] \\n\" // sum0\n\n                        \"prfm   pldl1keep, [%1, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%1]        \\n\" // r0\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v0.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[3]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v1.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #256]       \\n\"\n                        \"ld1    {v2.8h, v3.8h}, [%2]        \\n\" // r1\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v1.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[7]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v1.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v2.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[1]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%3]        \\n\" // r2\n\n                        \"fmla   v28.8h, v20.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v3.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v0.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v0.h[3]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v1.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #256]       \\n\"\n                        \"ld1    {v2.8h, v3.8h}, [%4]        \\n\" // r3\n\n                        \"fmla   v28.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v0.h[7]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v1.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v2.h[6]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%5]        \\n\" // r4\n\n                        \"fmla   v28.8h, v17.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v3.h[3]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v3.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v0.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[3]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v0.h[7]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v1.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #256]       \\n\"\n                        \"ld1    {v2.8h, v3.8h}, [%6]        \\n\" // r5\n\n                        \"fmla   v28.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v1.h[2]     \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[7]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v1.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v1.h[4]     \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v2.h[0]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v2.h[2]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v2.h[4]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v2.h[6]     \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v2.h[1]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v2.h[3]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v2.h[5]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v2.h[7]     \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v2.h[2]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v2.h[4]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v2.h[6]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v3.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v2.h[3]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v2.h[5]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v2.h[7]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v3.h[1]     \\n\"\n\n                        \"add    %1, %1, #16                 \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v2.h[4]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v2.h[6]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v3.h[0]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v3.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #256]       \\n\"\n                        \"ld1    {v0.8h, v1.8h}, [%7]        \\n\" // r6\n\n                        \"fmla   v28.8h, v16.8h, v2.h[5]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v2.h[7]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v3.h[1]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v3.h[3]     \\n\"\n\n                        \"add    %2, %2, #16                 \\n\"\n\n                        \"fmla   v28.8h, v17.8h, v2.h[6]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v3.h[0]     \\n\"\n                        \"fmla   v30.8h, v17.8h, v3.h[2]     \\n\"\n                        \"fmla   v31.8h, v17.8h, v3.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v28.8h, v18.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v30.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v18.8h, v0.h[6]     \\n\"\n\n                        \"add    %3, %3, #16                 \\n\"\n\n                        \"fmla   v28.8h, v19.8h, v0.h[1]     \\n\"\n                        \"fmla   v29.8h, v19.8h, v0.h[3]     \\n\"\n                        \"fmla   v30.8h, v19.8h, v0.h[5]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[7]     \\n\"\n\n                        \"add    %4, %4, #16                 \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v30.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v20.8h, v1.h[0]     \\n\"\n\n                        \"add    %5, %5, #16                 \\n\"\n\n                        \"fmla   v28.8h, v21.8h, v0.h[3]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n                        \"fmla   v30.8h, v21.8h, v0.h[7]     \\n\"\n                        \"fmla   v31.8h, v21.8h, v1.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v16.8h}, [%8]              \\n\"\n\n                        \"fmla   v28.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v30.8h, v22.8h, v1.h[0]     \\n\"\n                        \"fmla   v31.8h, v22.8h, v1.h[2]     \\n\"\n\n                        \"add    %6, %6, #16                 \\n\"\n\n                        \"fmla   v28.8h, v23.8h, v0.h[5]     \\n\"\n                        \"fmla   v29.8h, v23.8h, v0.h[7]     \\n\"\n                        \"fmla   v30.8h, v23.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[3]     \\n\"\n\n                        \"add    %7, %7, #16                 \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v16.8h, v1.h[0]     \\n\"\n                        \"fmla   v30.8h, v16.8h, v1.h[2]     \\n\"\n                        \"fmla   v31.8h, v16.8h, v1.h[4]     \\n\"\n\n                        \"sub    %8, %8, #768                \\n\" // kptr -= 48 * 8;\n\n                        \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n                for (; j < outw; j++)\n                {\n                    asm volatile(\n                        \"prfm   pldl1keep, [%1, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%1]               \\n\" // r0\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"prfm   pldl1keep, [%0, #128]       \\n\"\n                        \"ld1    {v31.8h}, [%0]              \\n\" // sum0\n\n                        \"fmul   v28.8h, v16.8h, v0.h[0]     \\n\"\n                        \"fmul   v29.8h, v17.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmul   v30.8h, v18.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%2, #128]       \\n\"\n                        \"ld1    {v1.8h}, [%2]               \\n\" // r1\n\n                        \"fmla   v28.8h, v20.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[0]     \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%3, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%3]               \\n\" // r2\n\n                        \"fmla   v28.8h, v20.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[6]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[0]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%4, #128]       \\n\"\n                        \"ld1    {v1.8h}, [%4]               \\n\" // r3\n\n                        \"fmla   v28.8h, v16.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[5]     \\n\"\n\n                        \"add    %1, %1, #4                  \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[6]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[0]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[1]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%5, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%5]               \\n\" // r4\n\n                        \"fmla   v28.8h, v16.8h, v1.h[3]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v1.h[5]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[6]     \\n\"\n\n                        \"add    %2, %2, #4                  \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[0]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[1]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[2]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%6, #128]       \\n\"\n                        \"ld1    {v1.8h}, [%6]               \\n\" // r5\n\n                        \"fmla   v28.8h, v16.8h, v0.h[4]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v0.h[5]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v0.h[6]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v1.h[0]     \\n\"\n\n                        \"add    %3, %3, #4                  \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v1.h[1]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v1.h[2]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v1.h[3]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v1.h[4]     \\n\"\n\n                        \"prfm   pldl1keep, [%7, #128]       \\n\"\n                        \"ld1    {v0.8h}, [%7]               \\n\" // r6\n\n                        \"fmla   v28.8h, v16.8h, v1.h[5]     \\n\"\n                        \"fmla   v29.8h, v17.8h, v1.h[6]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #512]       \\n\"\n                        \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n\n                        \"fmla   v30.8h, v18.8h, v0.h[0]     \\n\"\n                        \"fmla   v31.8h, v19.8h, v0.h[1]     \\n\"\n\n                        \"add    %4, %4, #4                  \\n\"\n\n                        \"fmla   v28.8h, v20.8h, v0.h[2]     \\n\"\n                        \"fmla   v29.8h, v21.8h, v0.h[3]     \\n\"\n\n                        \"prfm   pldl1keep, [%8, #128]       \\n\"\n                        \"ld1    {v16.8h}, [%8]              \\n\"\n\n                        \"fmla   v30.8h, v22.8h, v0.h[4]     \\n\"\n                        \"fmla   v31.8h, v23.8h, v0.h[5]     \\n\"\n\n                        \"add    %5, %5, #4                  \\n\"\n\n                        \"fmla   v28.8h, v16.8h, v0.h[6]     \\n\"\n\n                        \"add    %6, %6, #4                  \\n\"\n\n                        \"fadd   v29.8h, v29.8h, v30.8h      \\n\"\n                        \"fadd   v31.8h, v31.8h, v28.8h      \\n\"\n\n                        \"add    %7, %7, #4                  \\n\"\n\n                        \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                        \"sub    %8, %8, #768                \\n\" // kptr -= 48 * 8;\n\n                        \"st1    {v29.8h}, [%0], #16         \\n\"\n\n                        : \"=r\"(outptr0), // %0\n                        \"=r\"(r0),      // %1\n                        \"=r\"(r1),      // %2\n                        \"=r\"(r2),      // %3\n                        \"=r\"(r3),      // %4\n                        \"=r\"(r4),      // %5\n                        \"=r\"(r5),      // %6\n                        \"=r\"(r6),      // %7\n                        \"=r\"(kptr)     // %8\n                        : \"0\"(outptr0),\n                        \"1\"(r0),\n                        \"2\"(r1),\n                        \"3\"(r2),\n                        \"4\"(r3),\n                        \"5\"(r4),\n                        \"6\"(r5),\n                        \"7\"(r6),\n                        \"8\"(kptr)\n                        : \"memory\", \"v0\", \"v1\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution_arm.h\"\n\n#include \"benchmark.h\"\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if NCNN_GNU_INLINE_ASM\n#include \"convolution_1x1.h\"\n#include \"convolution_2x2.h\"\n#include \"convolution_3x3.h\"\n#include \"convolution_4x4.h\"\n#include \"convolution_5x5.h\"\n#include \"convolution_7x7.h\"\n#endif // NCNN_GNU_INLINE_ASM\n\n#include \"convolution_packed.h\"\n#include \"convolution_3x3_winograd.h\"\n#include \"convolution_im2col_gemm.h\"\n\n#if NCNN_BF16\n#include \"convolution_packed_bf16s.h\"\n#include \"convolution_3x3_winograd_bf16s.h\"\n#include \"convolution_im2col_gemm_bf16s_fp16s.h\"\n#include \"convolution_im2col_gemm_bf16s.h\"\n#endif // NCNN_BF16\n\n#if NCNN_INT8\n#include \"convolution_packed_int8.h\"\n#include \"convolution_im2col_gemm_int8.h\"\n#include \"convolution_3x3_winograd_int8.h\"\n\n// #include \"convolution_3x3_int8.h\"\n#endif // NCNN_INT8\n\n#if __ARM_NEON\n#if NCNN_GNU_INLINE_ASM\n#include \"convolution_3x3_pack1to4.h\"\n#include \"convolution_3x3_pack4.h\"\n#include \"convolution_3x3_pack4to1.h\"\n#include \"convolution_5x5_pack4.h\"\n#include \"convolution_7x7_pack1to4.h\"\n\n#if NCNN_BF16\n#include \"convolution_3x3_pack1to4_bf16s.h\"\n#include \"convolution_3x3_pack4_bf16s.h\"\n#include \"convolution_5x5_pack4_bf16s.h\"\n#include \"convolution_7x7_pack1to4_bf16s.h\"\n#endif // NCNN_BF16\n#endif // NCNN_GNU_INLINE_ASM\n#endif // __ARM_NEON\n\nConvolution_arm::Convolution_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    activation = 0;\n    nT = 0;\n    convolution_dilation1 = 0;\n}\n\nstatic void convolution_transform_kernel_packed_neon(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_arm::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n    nT = opt.num_threads;\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_arm(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    if ((!support_packing || !opt.use_packing_layout) && !opt.use_bf16_storage && kernel_w == kernel_h && dilation_w != 1 && dilation_h == dilation_w && stride_w == 1 && stride_h == 1)\n    {\n        convolution_dilation1 = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n        // set param\n        ncnn::ParamDict pd;\n        pd.set(0, num_output); // num_output\n        pd.set(1, kernel_w);\n        pd.set(11, kernel_h);\n        pd.set(2, 1);\n        pd.set(12, 1);\n        pd.set(3, 1);  // stride_w\n        pd.set(13, 1); // stride_h\n        pd.set(4, 0);  // pad_w\n        pd.set(14, 0); // pad_h\n        pd.set(5, bias_term);\n        pd.set(6, weight_data_size);\n\n        convolution_dilation1->load_param(pd);\n\n        // set weights\n        if (bias_term)\n        {\n            ncnn::Mat weights[2];\n            weights[0] = weight_data;\n            weights[1] = bias_data;\n\n            convolution_dilation1->load_model(ModelBinFromMatArray(weights));\n        }\n        else\n        {\n            ncnn::Mat weights[1];\n            weights[0] = weight_data;\n\n            convolution_dilation1->load_model(ModelBinFromMatArray(weights));\n        }\n\n        convolution_dilation1->create_pipeline(opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 8 || num_output >= 8);\n\n    if (opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        // dynamic shape\n        if (opt.use_winograd63_convolution && (num_input <= 128 && num_output <= 128))\n            conv3x3s1_winograd63_transform_kernel(weight_data, weight_winograd63_data, num_input, num_output, opt);\n        else if (opt.use_winograd43_convolution && (num_input >= 8 && num_output >= 8))\n            conv3x3s1_winograd43_transform_kernel(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        else\n            conv3x3s1_winograd23_transform_kernel(weight_data, weight_winograd23_data, num_input, num_output, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    int l2_cache_size_fp32 = get_cpu_level2_cache_size() / sizeof(float);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_fp32 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 4 || num_output < 32))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 8 || num_output < 44))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1))\n    {\n        convolution_im2col_gemm_transform_kernel(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n    if ((elempack == 4 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 4 && out_elempack == 4 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 4 && out_elempack == 4 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2))\n    {\n        convolution_transform_kernel_packed_neon(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n    }\n    else if (elempack == 1 && out_elempack == 1 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n    {\n        conv3x3s2_transform_kernel_neon(weight_data, weight_3x3s2_data, num_input, num_output);\n    }\n    else if ((elempack == 1 && out_elempack == 1 && kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 4 && kernel_h == 4 && dilation_w == 1 && dilation_h == 1 && stride_w == 4 && stride_h == 4)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n             || (elempack == 1 && out_elempack == 1 && kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2))\n    {\n        weight_data_tm = weight_data;\n    }\n    else\n#endif // NCNN_GNU_INLINE_ASM\n    {\n        convolution_transform_kernel_packed(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h);\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_arm::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    if (convolution_dilation1)\n    {\n        convolution_dilation1->destroy_pipeline(opt);\n        delete convolution_dilation1;\n        convolution_dilation1 = 0;\n    }\n\n    return 0;\n}\n\nint Convolution_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_arm(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    // flattened blob, implement as InnerProduct\n    if (bottom_blob.dims == 1 && kernel_w == 1 && kernel_h == 1)\n    {\n        Mat bottom_blob_3d;\n        if (bottom_blob.elemsize % 16 == 0)\n        {\n            bottom_blob_3d = bottom_blob;\n            bottom_blob_3d.dims = 3;\n            bottom_blob_3d.w = 1;\n            bottom_blob_3d.h = 1;\n            bottom_blob_3d.c = bottom_blob.w;\n            bottom_blob_3d.cstep = 1;\n        }\n        else\n        {\n            bottom_blob_3d = bottom_blob.reshape(1, 1, bottom_blob.w, opt.workspace_allocator);\n        }\n\n        Mat top_blob_3d;\n        int ret = forward(bottom_blob_3d, top_blob_3d, opt);\n        if (ret != 0)\n            return ret;\n\n        if (top_blob_3d.elemsize % 16 == 0)\n        {\n            top_blob = top_blob_3d;\n            top_blob.dims = 1;\n            top_blob.w = top_blob_3d.c;\n            top_blob.h = 1;\n            top_blob.c = 1;\n            top_blob.cstep = top_blob_3d.c;\n        }\n        else\n        {\n            top_blob = top_blob_3d.reshape(top_blob_3d.c, opt.blob_allocator);\n        }\n\n        return 0;\n    }\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if ((!support_packing || !opt.use_packing_layout) && kernel_w == kernel_h && dilation_w != 1 && dilation_h == dilation_w && stride_w == 1 && stride_h == 1)\n    {\n        if (outw >= dilation_w && outh >= dilation_h)\n        {\n            return forwardDilation_arm(bottom_blob_bordered, top_blob, opt);\n        }\n    }\n\n    const int num_input = channels * elempack;\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 8 || num_output >= 8);\n\n    if (opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        bool prefer_winograd63 = false;\n        bool prefer_winograd23 = false;\n        bool prefer_winograd43 = !prefer_winograd63 && !prefer_winograd23;\n\n        if (prefer_winograd23 && (!opt.use_winograd23_convolution || weight_winograd23_data.empty()))\n        {\n            // f23 fallback to f43\n            prefer_winograd23 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd63 && (!opt.use_winograd63_convolution || weight_winograd63_data.empty()))\n        {\n            // f63 fallback to f43\n            prefer_winograd63 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))\n        {\n            // f43 fallback to f63 or f23\n            prefer_winograd43 = false;\n            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())\n            {\n                prefer_winograd63 = true;\n            }\n            else\n            {\n                prefer_winograd23 = true;\n            }\n        }\n        // NCNN_LOGE(\"prefer_winograd %d %d %d\", prefer_winograd23, prefer_winograd43, prefer_winograd63);\n\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution winograd will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = 0;\n        if (prefer_winograd23)\n        {\n            ret = conv3x3s1_winograd23(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, _nT, opt);\n        }\n        else if (prefer_winograd43)\n        {\n            ret = conv3x3s1_winograd43(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, _nT, opt);\n        }\n        else if (prefer_winograd63)\n        {\n            ret = conv3x3s1_winograd63(bottom_blob_bordered, top_blob, weight_winograd63_data, bias_data, _nT, opt);\n        }\n        else\n        {\n            // should never reach here\n        }\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n    int l2_cache_size_fp32 = get_cpu_level2_cache_size() / sizeof(float);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_fp32 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 4 || num_output < 32))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 8 || num_output < 44))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1))\n    {\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution gemm will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = convolution_im2col_gemm(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, _nT, opt);\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv5x5s1_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv5x5s2_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_pack1to4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            convolution_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_packed_neon(bottom_blob_bordered, top_blob, weight_3x3s2_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 4 && kernel_h == 4 && dilation_w == 1 && dilation_h == 1 && stride_w == 4 && stride_h == 4)\n        {\n            conv4x4s4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv5x5s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv5x5s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv7x7s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#else  // NCNN_GNU_INLINE_ASM\n    {\n        convolution_packed(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    return 0;\n}\n\nint Convolution_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n#if NCNN_ARM82\n    if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_float16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n    if (opt.use_bf16_storage && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_bfloat16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_BF16\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_float16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n        if (opt.use_bf16_storage && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_bfloat16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_BF16\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(8, int8_scale_term);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void convolution_transform_kernel_packed_bf16s_neon(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)2u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            unsigned short* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = float32_to_bfloat16(k00[k]);\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_arm::create_pipeline_bf16s(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 8 || num_output >= 8);\n\n    if (opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        // dynamic shape\n        if (opt.use_winograd63_convolution && (num_input <= 128 && num_output <= 128))\n            conv3x3s1_winograd63_transform_kernel(weight_data, weight_winograd63_data, num_input, num_output, opt);\n        else if (opt.use_winograd43_convolution && (num_input >= 8 && num_output >= 8))\n            conv3x3s1_winograd43_transform_kernel(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        else\n            conv3x3s1_winograd23_transform_kernel(weight_data, weight_winograd23_data, num_input, num_output, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    int l2_cache_size_bf16 = get_cpu_level2_cache_size() / sizeof(unsigned short);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_bf16 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 4 || num_output < 32))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 8 || num_output < 44))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1))\n    {\n        convolution_im2col_gemm_transform_kernel_bf16s(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n    if ((elempack == 4 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 4 && out_elempack == 4 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 4 && out_elempack == 4 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 4 && kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2))\n    {\n        convolution_transform_kernel_packed_bf16s_neon(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n    }\n    else\n#endif // NCNN_GNU_INLINE_ASM\n    {\n        convolution_transform_kernel_packed_bf16s(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h);\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // TODO dilated conv for bf16s\n    //     if ((!support_packing || !opt.use_packing_layout) && kernel_w == kernel_h && dilation_w != 1 && dilation_h == dilation_w && stride_w == 1 && stride_h == 1)\n    //     {\n    //         return forwardDilation_arm(bottom_blob_bordered, top_blob, opt);\n    //     }\n\n    const int num_input = channels * elempack;\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 8 || num_output >= 8);\n\n    if (opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        bool prefer_winograd63 = false;\n        bool prefer_winograd23 = false;\n        bool prefer_winograd43 = !prefer_winograd63 && !prefer_winograd23;\n\n        if (prefer_winograd23 && (!opt.use_winograd23_convolution || weight_winograd23_data.empty()))\n        {\n            // f23 fallback to f43\n            prefer_winograd23 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd63 && (!opt.use_winograd63_convolution || weight_winograd63_data.empty()))\n        {\n            // f63 fallback to f43\n            prefer_winograd63 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))\n        {\n            // f43 fallback to f63 or f23\n            prefer_winograd43 = false;\n            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())\n            {\n                prefer_winograd63 = true;\n            }\n            else\n            {\n                prefer_winograd23 = true;\n            }\n        }\n        // NCNN_LOGE(\"prefer_winograd %d %d %d\", prefer_winograd23, prefer_winograd43, prefer_winograd63);\n\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution winograd will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = 0;\n        if (prefer_winograd23)\n        {\n            ret = conv3x3s1_winograd23_bf16s(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, _nT, opt);\n        }\n        else if (prefer_winograd43)\n        {\n            ret = conv3x3s1_winograd43_bf16s(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, _nT, opt);\n        }\n        else if (prefer_winograd63)\n        {\n            ret = conv3x3s1_winograd63_bf16s(bottom_blob_bordered, top_blob, weight_winograd63_data, bias_data, _nT, opt);\n        }\n        else\n        {\n            // should never reach here\n        }\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n    int l2_cache_size_bf16 = get_cpu_level2_cache_size() / sizeof(unsigned short);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_bf16 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 4 || num_output < 32))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 8 || num_output < 44))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1))\n    {\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution gemm will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = convolution_im2col_gemm_bf16s(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, _nT, opt);\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv5x5s1_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv5x5s2_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_pack1to4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            convolution_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            convolution_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#else  // NCNN_GNU_INLINE_ASM\n    {\n        convolution_packed_bf16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint Convolution_arm::create_pipeline_int8_arm(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && (num_input >= 8 && num_output >= 8) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1;\n#if NCNN_ARM82DOT\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        prefer_winograd = false;\n    }\n#endif\n\n    if (opt.use_winograd_convolution && prefer_winograd)\n    {\n        if (opt.use_winograd43_convolution)\n            conv3x3s1_winograd43_transform_kernel_int8(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        else\n            conv3x3s1_winograd23_transform_kernel_int8(weight_data, weight_winograd23_data, num_input, num_output, opt);\n    }\n    else if (opt.use_sgemm_convolution)\n    {\n        convolution_im2col_gemm_transform_kernel_int8(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h, opt);\n    }\n    else\n    {\n        convolution_transform_kernel_packed_int8(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h);\n    }\n\n    scale_in_data.create(num_output);\n    for (int p = 0; p < num_output; p++)\n    {\n        // requantize and relu\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        scale_in_data[p] = scale_in;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_arm::forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_q);\n        if (bottom_blob_int8.empty())\n            return -100;\n    }\n\n    //     NCNN_LOGE(\"Convolution_arm input %d x %d  ksize=%d %d  stride=%d %d\", w, h, kernel_w, kernel_h, stride_w, stride_h);\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    int w = bottom_blob_bordered.w;\n    int h = bottom_blob_bordered.h;\n    int elempack = bottom_blob_bordered.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    bool use_int8_requantize = int8_scale_term > 100;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        else\n            out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n    }\n#endif\n    if (opt.use_bf16_storage)\n        out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n\n    //     NCNN_LOGE(\"forward_int8_arm %d %d %d    %d %d\", w, h, bottom_blob_bordered.c, elempack, out_elempack);\n\n    int channels = bottom_blob_bordered.c;\n    const int num_input = channels * elempack;\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && (num_input >= 8 && num_output >= 8) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1;\n#if NCNN_ARM82DOT\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        prefer_winograd = false;\n    }\n#endif\n\n    int out_elempack_int32 = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n        {\n            out_elempack_int32 = num_output % 8 == 0 ? 8 : 1;\n        }\n        else\n        {\n#if NCNN_ARM82\n            if (ncnn::cpu_support_arm_asimdhp() && opt.use_fp16_storage && opt.use_fp16_arithmetic)\n            {\n                out_elempack_int32 = num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n            }\n            else\n#endif // NCNN_ARM82\n            {\n                out_elempack_int32 = num_output % 4 == 0 ? 4 : 1;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    Mat top_blob_int32;\n    top_blob_int32.create(outw, outh, num_output / out_elempack_int32, (size_t)(4u * out_elempack_int32), out_elempack_int32, opt.workspace_allocator);\n    if (top_blob_int32.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, convolution gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (opt.use_winograd_convolution && prefer_winograd)\n    {\n        if (opt.use_winograd43_convolution && !weight_winograd43_data.empty())\n            ret = conv3x3s1_winograd43_int8(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, _nT, opt);\n        else\n            ret = conv3x3s1_winograd23_int8(bottom_blob_bordered, top_blob_int32, weight_winograd23_data, _nT, opt);\n    }\n    else if (opt.use_sgemm_convolution)\n    {\n        ret = convolution_im2col_gemm_int8(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, _nT, opt);\n    }\n    else\n    {\n        convolution_packed_int8(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n    }\n    if (ret != 0)\n        return ret;\n\n    bottom_blob_bordered.release();\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (use_int8_requantize)\n    {\n        requantize_from_int32_to_int8(top_blob_int32, top_blob, scale_in_data, top_blob_int8_scales, bias_data, activation_type, activation_params, opt);\n    }\n    else\n    {\n        dequantize_from_int32(top_blob_int32, top_blob, scale_in_data, bias_data, opt);\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\nint Convolution_arm::forwardDilation_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_size = kernel_w;\n    const int stride = stride_w;\n    const int dilation = dilation_w;\n    const int kernel_extent = dilation * (kernel_size - 1) + 1;\n\n    int outw = (w - kernel_extent) / stride + 1;\n    int outh = (h - kernel_extent) / stride + 1;\n\n    top_blob.create(outw, outh, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Make (dilation * dilation) batches\n    Mat inner_bottom_blob;\n    Mat inner_top_blob;\n    for (int x = 0; x < dilation; x++)\n    {\n        for (int y = 0; y < dilation; y++)\n        {\n            int inner_w = (w - y + dilation - 1) / dilation;\n            int inner_h = (h - x + dilation - 1) / dilation;\n\n            int inner_outw = (inner_w - kernel_size) / stride + 1;\n            int inner_outh = (inner_h - kernel_size) / stride + 1;\n\n            inner_bottom_blob.create(inner_w, inner_h, bottom_blob.c, elemsize, opt.workspace_allocator);\n            if (inner_bottom_blob.empty())\n                return -100;\n\n            inner_top_blob.create(inner_outw, inner_outh, num_output, elemsize, opt.workspace_allocator);\n            if (inner_top_blob.empty())\n                return -100;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int c = 0; c < bottom_blob.c; c++)\n            {\n                float* outptr = inner_bottom_blob.channel(c);\n\n                for (int i = 0; i < inner_h; i++)\n                {\n                    const float* ptr = (const float*)bottom_blob.channel(c) + dilation * i * w + x * w + y;\n                    for (int j = 0; j < inner_w; j++)\n                    {\n                        outptr[j] = ptr[j * dilation];\n                    }\n                    outptr += inner_w;\n                }\n            }\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = inner_top_blob.allocator;\n            convolution_dilation1->forward(inner_bottom_blob, inner_top_blob, opt_g);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int c = 0; c < num_output; c++)\n            {\n                float* outptr = (float*)top_blob.channel(c) + x * outw + y;\n                for (int i = 0; i < inner_outh; i++)\n                {\n                    const float* ptr = (const float*)inner_top_blob.channel(c) + i * inner_outw;\n                    for (int j = 0; j < inner_outw; j++)\n                    {\n                        outptr[j * dilation] = ptr[j];\n                    }\n                    outptr += dilation * outw;\n                }\n            }\n        }\n    }\n\n    if (activation)\n    {\n        activation->forward_inplace(top_blob, opt);\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION_ARM_H\n#define LAYER_CONVOLUTION_ARM_H\n\n#include \"convolution.h\"\n\nnamespace ncnn {\n\nclass Convolution_arm : public Convolution\n{\npublic:\n    Convolution_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8_arm(const Option& opt);\n    int forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n    int forwardDilation_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    Layer* activation;\n\n    int nT;\n\n    Mat weight_data_tm;\n    Mat weight_3x3s2_data;\n\n    Mat weight_sgemm_data;\n    Mat weight_winograd23_data;\n    Mat weight_winograd43_data;\n    Mat weight_winograd63_data;\n\n    // forwardDilation\n    Layer* convolution_dilation1;\n\n    // fp16\n    Mat bias_data_fp16;\n\n#if NCNN_INT8\n    Mat scale_in_data;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION_ARM_H\n"
  },
  {
    "path": "src/layer/arm/convolution_arm_asimddp.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n\nnamespace ncnn {\n\n#include \"convolution_packed_int8.h\"\n#include \"convolution_im2col_gemm_int8.h\"\n\n// packed\nvoid convolution_transform_kernel_packed_int8_asimddp(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    convolution_transform_kernel_packed_int8(kernel, kernel_tm, inch, outch, kernel_w, kernel_h);\n}\n\nvoid convolution_packed_int8_asimddp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    convolution_packed_int8(bottom_blob, top_blob, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n}\n\n// gemm\nvoid convolution_im2col_gemm_transform_kernel_int8_asimddp(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n    convolution_im2col_gemm_transform_kernel_int8(kernel, AT, inch, outch, kernel_w, kernel_h, opt);\n}\n\nint convolution_im2col_gemm_int8_asimddp(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n    return convolution_im2col_gemm_int8(bottom_blob, top_blob, AT, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, nT, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution_arm.h\"\n\n#include \"cpu.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"convolution_packed_fp16s.h\"\n\n#include \"convolution_3x3_winograd_fp16s.h\"\n\n#include \"convolution_im2col_gemm_bf16s_fp16s.h\"\n#include \"convolution_im2col_gemm_fp16s.h\"\n\n#if NCNN_GNU_INLINE_ASM\n#include \"convolution_3x3_pack4_fp16s.h\"\n#include \"convolution_3x3_pack1to8_fp16s.h\"\n#include \"convolution_3x3_pack1to4_fp16s.h\"\n#include \"convolution_3x3_pack8_fp16s.h\"\n#include \"convolution_5x5_pack8_fp16s.h\"\n#include \"convolution_7x7_pack1to8_fp16s.h\"\n#endif // NCNN_GNU_INLINE_ASM\n#endif\n#endif // __ARM_NEON\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void convolution_transform_kernel_packed_fp16s_neon(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)2u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            __fp16* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = (__fp16)k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_arm::create_pipeline_fp16s(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n\n    if (opt.use_packing_layout)\n    {\n        elempack = opt.use_fp16_arithmetic && num_input % 8 == 0 ? 8 : num_input % 4 == 0 ? 4 : 1;\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 16 || num_output >= 16);\n\n    if (opt.use_fp16_arithmetic && opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        // dynamic shape\n        if (opt.use_winograd63_convolution && (num_input <= 128 && num_output <= 128))\n            conv3x3s1_winograd63_transform_kernel_fp16sa(weight_data, weight_winograd63_data, num_input, num_output, opt);\n        else if (opt.use_winograd43_convolution && (num_input >= 16 && num_output >= 16))\n            conv3x3s1_winograd43_transform_kernel_fp16sa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        else\n            conv3x3s1_winograd23_transform_kernel_fp16sa(weight_data, weight_winograd23_data, num_input, num_output, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        if (opt.use_fp16_arithmetic)\n        {\n            ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n        }\n\n        return 0;\n    }\n\n    int l2_cache_size_fp16 = get_cpu_level2_cache_size() / sizeof(unsigned short);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_fp16 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 8 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 64 || num_output < 128))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 16 || num_output < 88))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if (opt.use_fp16_arithmetic && ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1)))\n    {\n        convolution_im2col_gemm_transform_kernel_fp16sa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h, opt);\n\n        ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n    if ((elempack == 8 && out_elempack == 8 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 8 && out_elempack == 8 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 8 && out_elempack == 8 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 8 && out_elempack == 8 && kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 8 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (elempack == 1 && out_elempack == 8 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (elempack == 1 && out_elempack == 8 && kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            || (opt.use_fp16_arithmetic && elempack == 4 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (opt.use_fp16_arithmetic && elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            || (opt.use_fp16_arithmetic && elempack == 1 && out_elempack == 4 && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2))\n    {\n        convolution_transform_kernel_packed_fp16s_neon(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n    }\n    else\n#endif // NCNN_GNU_INLINE_ASM\n    {\n        convolution_transform_kernel_packed_fp16s(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h);\n    }\n\n    if (opt.use_fp16_arithmetic)\n    {\n        ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    // NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = (opt.use_packing_layout && num_output % 4 == 0) ? 4 : 1;\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // TODO dilated conv for bf16s\n    //     if ((!support_packing || !opt.use_packing_layout) && kernel_w == kernel_h && dilation_w != 1 && dilation_h == dilation_w && stride_w == 1 && stride_h == 1)\n    //     {\n    //         return forwardDilation_arm(bottom_blob_bordered, top_blob, opt);\n    //     }\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        convolution_packed_fp16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        convolution_packed_fp16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        convolution_packed_fp16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        convolution_packed_fp16s(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n\n    return 0;\n}\n\nint Convolution_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    // NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // TODO dilated conv for bf16s\n    //     if ((!support_packing || !opt.use_packing_layout) && kernel_w == kernel_h && dilation_w != 1 && dilation_h == dilation_w && stride_w == 1 && stride_h == 1)\n    //     {\n    //         return forwardDilation_arm(bottom_blob_bordered, top_blob, opt);\n    //     }\n\n    const int num_input = channels * elempack;\n\n    bool prefer_winograd = (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && (num_input >= 16 || num_output >= 16);\n\n    if (opt.use_winograd_convolution && prefer_winograd && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        bool prefer_winograd63 = false;\n        bool prefer_winograd23 = false;\n        bool prefer_winograd43 = !prefer_winograd63 && !prefer_winograd23;\n\n        if (prefer_winograd23 && (!opt.use_winograd23_convolution || weight_winograd23_data.empty()))\n        {\n            // f23 fallback to f43\n            prefer_winograd23 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd63 && (!opt.use_winograd63_convolution || weight_winograd63_data.empty()))\n        {\n            // f63 fallback to f43\n            prefer_winograd63 = false;\n            prefer_winograd43 = true;\n        }\n\n        if (prefer_winograd43 && (!opt.use_winograd43_convolution || weight_winograd43_data.empty()))\n        {\n            // f43 fallback to f63 or f23\n            prefer_winograd43 = false;\n            if (opt.use_winograd63_convolution && !weight_winograd63_data.empty())\n            {\n                prefer_winograd63 = true;\n            }\n            else\n            {\n                prefer_winograd23 = true;\n            }\n        }\n        // NCNN_LOGE(\"prefer_winograd %d %d %d\", prefer_winograd23, prefer_winograd43, prefer_winograd63);\n\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution winograd will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = 0;\n        if (prefer_winograd23)\n        {\n            ret = conv3x3s1_winograd23_fp16sa(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data_fp16, _nT, opt);\n        }\n        else if (prefer_winograd43)\n        {\n            ret = conv3x3s1_winograd43_fp16sa(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data_fp16, _nT, opt);\n        }\n        else if (prefer_winograd63)\n        {\n            ret = conv3x3s1_winograd63_fp16sa(bottom_blob_bordered, top_blob, weight_winograd63_data, bias_data_fp16, _nT, opt);\n        }\n        else\n        {\n            // should never reach here\n        }\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n    int l2_cache_size_fp16 = get_cpu_level2_cache_size() / sizeof(unsigned short);\n    bool prefer_sgemm = num_input * num_output * kernel_w * kernel_h * dilation_w * dilation_h * stride_w * stride_h * 2 > l2_cache_size_fp16 || (num_input > 16 || num_output > 16);\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 8 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 64 || num_output < 128))\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (num_input < 16 || num_output < 88))\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            prefer_sgemm = false;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    if ((opt.use_sgemm_convolution && prefer_sgemm) || (kernel_w == 1 && kernel_h == 1))\n    {\n        int _nT = nT ? nT : opt.num_threads;\n        if (nT != 0 && opt.num_threads != nT)\n        {\n            // force num_threads the same as in create_pipeline\n            // so we could use pre-packed A/B from the same tile config\n            NCNN_LOGE(\"opt.num_threads %d changed, convolution gemm will use load-time value %d\", opt.num_threads, nT);\n        }\n\n        int ret = convolution_im2col_gemm_fp16sa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, _nT, opt);\n        if (ret != 0)\n            return ret;\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n        return 0;\n    }\n\n#if NCNN_GNU_INLINE_ASM\n    if (elempack == 8 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv5x5s1_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv5x5s2_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 8)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_pack1to8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 8)\n    {\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 8 && out_elempack == 1)\n    {\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 8 && out_elempack == 4)\n    {\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack4_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to4_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to4_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#else  // NCNN_GNU_INLINE_ASM\n    {\n        convolution_packed_fp16sa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n    }\n#endif // NCNN_GNU_INLINE_ASM\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution_arm_i8mm.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n\nnamespace ncnn {\n\n#include \"convolution_packed_int8.h\"\n#include \"convolution_im2col_gemm_int8.h\"\n\n// packed\nvoid convolution_transform_kernel_packed_int8_i8mm(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    convolution_transform_kernel_packed_int8(kernel, kernel_tm, inch, outch, kernel_w, kernel_h);\n}\n\nvoid convolution_packed_int8_i8mm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    convolution_packed_int8(bottom_blob, top_blob, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n}\n\n// gemm\nvoid convolution_im2col_gemm_transform_kernel_int8_i8mm(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n    convolution_im2col_gemm_transform_kernel_int8(kernel, AT, inch, outch, kernel_w, kernel_h, opt);\n}\n\nint convolution_im2col_gemm_int8_i8mm(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n    return convolution_im2col_gemm_int8(bottom_blob, top_blob, AT, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, nT, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolution_im2col_gemm.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_im2col_pack_A_tile(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    // A = (pa, maxk, inch/pa), outch\n    const int A_hstep = A.w;\n\n    float* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n        const float* p4 = (const float*)A + (i + ii + 4) * A_hstep + k;\n        const float* p5 = (const float*)A + (i + ii + 5) * A_hstep + k;\n        const float* p6 = (const float*)A + (i + ii + 6) * A_hstep + k;\n        const float* p7 = (const float*)A + (i + ii + 7) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            float32x4_t _r0l = vld1q_f32(p0);\n            float32x4_t _r0h = vld1q_f32(p0 + 4);\n            float32x4_t _r1l = vld1q_f32(p1);\n            float32x4_t _r1h = vld1q_f32(p1 + 4);\n            float32x4_t _r2l = vld1q_f32(p2);\n            float32x4_t _r2h = vld1q_f32(p2 + 4);\n            float32x4_t _r3l = vld1q_f32(p3);\n            float32x4_t _r3h = vld1q_f32(p3 + 4);\n            float32x4_t _r4l = vld1q_f32(p4);\n            float32x4_t _r4h = vld1q_f32(p4 + 4);\n            float32x4_t _r5l = vld1q_f32(p5);\n            float32x4_t _r5h = vld1q_f32(p5 + 4);\n            float32x4_t _r6l = vld1q_f32(p6);\n            float32x4_t _r6h = vld1q_f32(p6 + 4);\n            float32x4_t _r7l = vld1q_f32(p7);\n            float32x4_t _r7h = vld1q_f32(p7 + 4);\n            transpose8x8_ps(_r0l, _r0h, _r1l, _r1h, _r2l, _r2h, _r3l, _r3h, _r4l, _r4h, _r5l, _r5h, _r6l, _r6h, _r7l, _r7h);\n            vst1q_f32(pp, _r0l);\n            vst1q_f32(pp + 4, _r0h);\n            vst1q_f32(pp + 8, _r1l);\n            vst1q_f32(pp + 12, _r1h);\n            vst1q_f32(pp + 8 * 2, _r2l);\n            vst1q_f32(pp + 8 * 2 + 4, _r2h);\n            vst1q_f32(pp + 8 * 3, _r3l);\n            vst1q_f32(pp + 8 * 3 + 4, _r3h);\n            vst1q_f32(pp + 8 * 4, _r4l);\n            vst1q_f32(pp + 8 * 4 + 4, _r4h);\n            vst1q_f32(pp + 8 * 5, _r5l);\n            vst1q_f32(pp + 8 * 5 + 4, _r5h);\n            vst1q_f32(pp + 8 * 6, _r6l);\n            vst1q_f32(pp + 8 * 6 + 4, _r6h);\n            vst1q_f32(pp + 8 * 7, _r7l);\n            vst1q_f32(pp + 8 * 7 + 4, _r7h);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp[4] = p4[0];\n            pp[5] = p5[0];\n            pp[6] = p6[0];\n            pp[7] = p7[0];\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            float32x4x4_t _r0123;\n            _r0123.val[0] = vld1q_f32(p0);\n            _r0123.val[1] = vld1q_f32(p1);\n            _r0123.val[2] = vld1q_f32(p2);\n            _r0123.val[3] = vld1q_f32(p3);\n            vst4q_f32(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            float32x4x2_t _r01;\n            _r01.val[0] = vld1q_f32(p0);\n            _r01.val[1] = vld1q_f32(p1);\n            vst2q_f32(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            vst1q_f32(pp, vld1q_f32(p0));\n            pp += 4;\n            p0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void convolution_gemm_transB_packed_tile(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end, int use_a53_a55_optimized_kernel)\n{\n    // NCNN_LOGE(\"convolution_gemm_transB_packed_tile %d %d %d %d %d %d\", i, max_ii, j, max_jj, k, max_kk);\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.cstep;\n\n    const float* pAT = AT_tile;\n    const float* pBT = BT_tile;\n    const float* pC = CT_tile;\n\n    float* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n            {\n                // a55\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s}, [%1], #16          \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #16                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%3], #64 \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [x4], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"zip1   v6.4s, v8.4s, v9.4s         \\n\"\n                    \"zip2   v7.4s, v8.4s, v9.4s         \\n\"\n                    \"zip1   v8.4s, v10.4s, v11.4s       \\n\"\n                    \"zip2   v9.4s, v10.4s, v11.4s       \\n\"\n                    \"zip1   v10.4s, v12.4s, v13.4s      \\n\"\n                    \"zip2   v11.4s, v12.4s, v13.4s      \\n\"\n                    \"zip1   v12.4s, v14.4s, v15.4s      \\n\"\n                    \"zip2   v13.4s, v14.4s, v15.4s      \\n\"\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v6.2d, v8.2d         \\n\"\n                    \"zip2   v3.2d, v6.2d, v8.2d         \\n\"\n                    \"zip1   v1.2d, v10.2d, v12.2d       \\n\"\n                    \"zip2   v4.2d, v10.2d, v12.2d       \\n\"\n                    \"zip1   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v5.2d, v14.2d, v16.2d       \\n\"\n\n                    \"zip1   v6.2d, v7.2d, v9.2d         \\n\"\n                    \"zip2   v9.2d, v7.2d, v9.2d         \\n\"\n                    \"zip1   v7.2d, v11.2d, v13.2d       \\n\"\n                    \"zip2   v10.2d, v11.2d, v13.2d      \\n\"\n                    \"zip1   v8.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v11.2d, v15.2d, v17.2d      \\n\"\n\n                    \"zip1   v12.2d, v18.2d, v20.2d      \\n\"\n                    \"zip2   v15.2d, v18.2d, v20.2d      \\n\"\n                    \"zip1   v13.2d, v22.2d, v24.2d      \\n\"\n                    \"zip2   v16.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v14.2d, v26.2d, v28.2d      \\n\"\n                    \"zip2   v17.2d, v26.2d, v28.2d      \\n\"\n\n                    \"zip1   v18.2d, v19.2d, v21.2d      \\n\"\n                    \"zip2   v21.2d, v19.2d, v21.2d      \\n\"\n                    \"zip1   v19.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v22.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v20.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v23.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s}, [%3], #48 \\n\"\n                    \"st1    {v3.4s, v4.4s, v5.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s, v8.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v9.4s, v10.4s, v11.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v15.4s, v16.4s, v17.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v18.4s, v19.4s, v20.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v21.4s, v22.4s, v23.4s}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else if (use_a53_a55_optimized_kernel && !cpu_support_arm_asimdhp())\n            {\n                // a53\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v4.4s}, [%1], #16          \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"ldr    d2, [%2, #16]               \\n\"\n                    \"ldr    x22, [%2, #24]              \\n\"\n                    \"add    %2, %2, #32                 \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n\n                    \"ldr    d5, [%1]                    \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"ldr    x25, [%1, #8]               \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n\n                    \"ldr    d6, [%1]                    \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"ldr    x26, [%1, #8]               \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n\n                    \"ldr    d3, [%2]                    \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"ldr    x23, [%2, #8]               \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n\n                    \"ldr    d0, [%2]                    \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"ldr    d7, [%1]                    \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"ldr    x27, [%1, #8]               \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n\n                    \"ldr    d4, [%1]                    \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"ldr    x24, [%1, #8]               \\n\"\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n\n                    \"ldr    d2, [%2]                    \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"ldr    x22, [%2, #8]               \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n\n                    \"ldr    d3, [%2]                    \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"ldr    x23, [%2, #8]               \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n\n                    \"ldr    d0, [%2]                    \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"ldr    d5, [%1]                    \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"ldr    x25, [%1, #8]               \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n\n                    \"ldr    d6, [%1]                    \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"ldr    x26, [%1, #8]               \\n\"\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n\n                    \"ldr    d2, [%2]                    \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"ldr    x22, [%2, #8]               \\n\"\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n\n                    \"ldr    d3, [%2]                    \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"ldr    x23, [%2, #8]               \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"ldr    d7, [%1]                    \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"ldr    x27, [%1, #8]               \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n\n                    \"ldr    d4, [%1]                    \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"ldr    x24, [%1, #8]               \\n\"\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n\n                    \"ldr    d0, [%2]                    \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n\n                    \"ldr    d2, [%2]                    \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"ldr    x22, [%2, #8]               \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #48                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%3], #64 \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [x4], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"zip1   v6.4s, v8.4s, v9.4s         \\n\"\n                    \"zip2   v7.4s, v8.4s, v9.4s         \\n\"\n                    \"zip1   v8.4s, v10.4s, v11.4s       \\n\"\n                    \"zip2   v9.4s, v10.4s, v11.4s       \\n\"\n                    \"zip1   v10.4s, v12.4s, v13.4s      \\n\"\n                    \"zip2   v11.4s, v12.4s, v13.4s      \\n\"\n                    \"zip1   v12.4s, v14.4s, v15.4s      \\n\"\n                    \"zip2   v13.4s, v14.4s, v15.4s      \\n\"\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v6.2d, v8.2d         \\n\"\n                    \"zip2   v3.2d, v6.2d, v8.2d         \\n\"\n                    \"zip1   v1.2d, v10.2d, v12.2d       \\n\"\n                    \"zip2   v4.2d, v10.2d, v12.2d       \\n\"\n                    \"zip1   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v5.2d, v14.2d, v16.2d       \\n\"\n\n                    \"zip1   v6.2d, v7.2d, v9.2d         \\n\"\n                    \"zip2   v9.2d, v7.2d, v9.2d         \\n\"\n                    \"zip1   v7.2d, v11.2d, v13.2d       \\n\"\n                    \"zip2   v10.2d, v11.2d, v13.2d      \\n\"\n                    \"zip1   v8.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v11.2d, v15.2d, v17.2d      \\n\"\n\n                    \"zip1   v12.2d, v18.2d, v20.2d      \\n\"\n                    \"zip2   v15.2d, v18.2d, v20.2d      \\n\"\n                    \"zip1   v13.2d, v22.2d, v24.2d      \\n\"\n                    \"zip2   v16.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v14.2d, v26.2d, v28.2d      \\n\"\n                    \"zip2   v17.2d, v26.2d, v28.2d      \\n\"\n\n                    \"zip1   v18.2d, v19.2d, v21.2d      \\n\"\n                    \"zip2   v21.2d, v19.2d, v21.2d      \\n\"\n                    \"zip1   v19.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v22.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v20.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v23.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s}, [%3], #48 \\n\"\n                    \"st1    {v3.4s, v4.4s, v5.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s, v8.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v9.4s, v10.4s, v11.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v15.4s, v16.4s, v17.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v18.4s, v19.4s, v20.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v21.4s, v22.4s, v23.4s}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%3], #64 \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [x4], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"zip1   v6.4s, v8.4s, v9.4s         \\n\"\n                    \"zip2   v7.4s, v8.4s, v9.4s         \\n\"\n                    \"zip1   v8.4s, v10.4s, v11.4s       \\n\"\n                    \"zip2   v9.4s, v10.4s, v11.4s       \\n\"\n                    \"zip1   v10.4s, v12.4s, v13.4s      \\n\"\n                    \"zip2   v11.4s, v12.4s, v13.4s      \\n\"\n                    \"zip1   v12.4s, v14.4s, v15.4s      \\n\"\n                    \"zip2   v13.4s, v14.4s, v15.4s      \\n\"\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v6.2d, v8.2d         \\n\"\n                    \"zip2   v3.2d, v6.2d, v8.2d         \\n\"\n                    \"zip1   v1.2d, v10.2d, v12.2d       \\n\"\n                    \"zip2   v4.2d, v10.2d, v12.2d       \\n\"\n                    \"zip1   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v5.2d, v14.2d, v16.2d       \\n\"\n\n                    \"zip1   v6.2d, v7.2d, v9.2d         \\n\"\n                    \"zip2   v9.2d, v7.2d, v9.2d         \\n\"\n                    \"zip1   v7.2d, v11.2d, v13.2d       \\n\"\n                    \"zip2   v10.2d, v11.2d, v13.2d      \\n\"\n                    \"zip1   v8.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v11.2d, v15.2d, v17.2d      \\n\"\n\n                    \"zip1   v12.2d, v18.2d, v20.2d      \\n\"\n                    \"zip2   v15.2d, v18.2d, v20.2d      \\n\"\n                    \"zip1   v13.2d, v22.2d, v24.2d      \\n\"\n                    \"zip2   v16.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v14.2d, v26.2d, v28.2d      \\n\"\n                    \"zip2   v17.2d, v26.2d, v28.2d      \\n\"\n\n                    \"zip1   v18.2d, v19.2d, v21.2d      \\n\"\n                    \"zip2   v21.2d, v19.2d, v21.2d      \\n\"\n                    \"zip1   v19.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v22.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v20.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v23.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s}, [%3], #48 \\n\"\n                    \"st1    {v3.4s, v4.4s, v5.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s, v8.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v9.4s, v10.4s, v11.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v15.4s, v16.4s, v17.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v18.4s, v19.4s, v20.4s}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v21.4s, v22.4s, v23.4s}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n            float32x4_t _sum80;\n            float32x4_t _sum81;\n            float32x4_t _sum90;\n            float32x4_t _sum91;\n            float32x4_t _suma0;\n            float32x4_t _suma1;\n            float32x4_t _sumb0;\n            float32x4_t _sumb1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                    _sum40 = _sum00;\n                    _sum41 = _sum01;\n                    _sum50 = _sum00;\n                    _sum51 = _sum01;\n                    _sum60 = _sum00;\n                    _sum61 = _sum01;\n                    _sum70 = _sum00;\n                    _sum71 = _sum01;\n                    _sum80 = _sum00;\n                    _sum81 = _sum01;\n                    _sum90 = _sum00;\n                    _sum91 = _sum01;\n                    _suma0 = _sum00;\n                    _suma1 = _sum01;\n                    _sumb0 = _sum00;\n                    _sumb1 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                    _sum80 = vdupq_n_f32(0.f);\n                    _sum81 = vdupq_n_f32(0.f);\n                    _sum90 = vdupq_n_f32(0.f);\n                    _sum91 = vdupq_n_f32(0.f);\n                    _suma0 = vdupq_n_f32(0.f);\n                    _suma1 = vdupq_n_f32(0.f);\n                    _sumb0 = vdupq_n_f32(0.f);\n                    _sumb1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n                _sum80 = vld1q_f32(outptr + 4 * 16);\n                _sum81 = vld1q_f32(outptr + 4 * 17);\n                _sum90 = vld1q_f32(outptr + 4 * 18);\n                _sum91 = vld1q_f32(outptr + 4 * 19);\n                _suma0 = vld1q_f32(outptr + 4 * 20);\n                _suma1 = vld1q_f32(outptr + 4 * 21);\n                _sumb0 = vld1q_f32(outptr + 4 * 22);\n                _sumb1 = vld1q_f32(outptr + 4 * 23);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n                    vst1q_f32(outptr0 + 4 * 4, _sum40);\n                    vst1q_f32(outptr0 + 4 * 5, _sum50);\n                    vst1q_f32(outptr0 + 4 * 6, _sum60);\n                    vst1q_f32(outptr0 + 4 * 7, _sum70);\n                    vst1q_f32(outptr0 + 4 * 8, _sum80);\n                    vst1q_f32(outptr0 + 4 * 9, _sum90);\n                    vst1q_f32(outptr0 + 4 * 10, _suma0);\n                    vst1q_f32(outptr0 + 4 * 11, _sumb0);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 5, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 6, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 7, _sum71);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 8, _sum81);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 9, _sum91);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 10, _suma1);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 11, _sumb1);\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x12_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71, _sum80, _sum81, _sum90, _sum91, _suma0, _suma1, _sumb0, _sumb1);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + 8, _sum10);\n                    vst1q_f32(outptr0 + out_hstep, _sum11);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum20);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 8, _sum40);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum50);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 8, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum60);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 8, _sum70);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum71);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 4, _sum80);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 8, _sum81);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum90);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 4, _sum91);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 8, _suma0);\n                    vst1q_f32(outptr0 + out_hstep * 7, _suma1);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 4, _sumb0);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 8, _sumb1);\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n            }\n\n            outptr += 96;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n            {\n                // a55\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #192                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v16.4s}, [%8]              \\n\"\n                    \"ld1    {v24.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v17.16b, v16.16b            \\n\"\n                    \"mov    v18.16b, v16.16b            \\n\"\n                    \"mov    v19.16b, v16.16b            \\n\"\n                    \"mov    v20.16b, v16.16b            \\n\"\n                    \"mov    v21.16b, v16.16b            \\n\"\n                    \"mov    v22.16b, v16.16b            \\n\"\n                    \"mov    v23.16b, v16.16b            \\n\"\n\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4s}, [%1], #16          \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s}, [%2], #16          \\n\"\n\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"ldr    d9, [%1], #8                \\n\"\n                    \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v17.4s, v8.4s, v0.s[1]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v18.4s, v8.4s, v0.s[2]      \\n\"\n                    \"ldr    d10, [%1], #8               \\n\"\n                    \"fmla   v19.4s, v8.4s, v0.s[3]      \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v20.4s, v8.4s, v1.s[0]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v21.4s, v8.4s, v1.s[1]      \\n\"\n                    \"ins    v9.d[1], x25                \\n\"\n                    \"fmla   v22.4s, v8.4s, v1.s[2]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v23.4s, v8.4s, v1.s[3]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v24.4s, v9.4s, v0.s[0]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v25.4s, v9.4s, v0.s[1]      \\n\"\n                    \"ins    v10.d[1], x26               \\n\"\n                    \"fmla   v26.4s, v9.4s, v0.s[2]      \\n\"\n                    \"ldr    d11, [%1], #8               \\n\"\n                    \"fmla   v27.4s, v9.4s, v0.s[3]      \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v28.4s, v9.4s, v1.s[0]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v29.4s, v9.4s, v1.s[1]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v30.4s, v9.4s, v1.s[2]      \\n\"\n                    \"ldr    d12, [%1], #8               \\n\"\n                    \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v17.4s, v10.4s, v2.s[1]     \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v18.4s, v10.4s, v2.s[2]     \\n\"\n                    \"ldr    d4, [%2], #8                \\n\"\n                    \"fmla   v19.4s, v10.4s, v2.s[3]     \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v20.4s, v10.4s, v3.s[0]     \\n\"\n                    \"ldr    d5, [%2], #8                \\n\"\n                    \"fmla   v21.4s, v10.4s, v3.s[1]     \\n\"\n                    \"ins    v11.d[1], x27               \\n\"\n                    \"fmla   v22.4s, v10.4s, v3.s[2]     \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v23.4s, v10.4s, v3.s[3]     \\n\"\n                    \"ldr    d13, [%1], #8               \\n\"\n                    \"fmla   v24.4s, v11.4s, v2.s[0]     \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v25.4s, v11.4s, v2.s[1]     \\n\"\n                    \"ins    v12.d[1], x24               \\n\"\n                    \"fmla   v26.4s, v11.4s, v2.s[2]     \\n\"\n                    \"ldr    d14, [%1], #8               \\n\"\n                    \"fmla   v27.4s, v11.4s, v2.s[3]     \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"ldr    d6, [%2], #8                \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"ins    v4.d[1], x20                \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"ldr    d7, [%2], #8                \\n\"\n                    \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v17.4s, v12.4s, v4.s[1]     \\n\"\n                    \"ins    v5.d[1], x21                \\n\"\n                    \"fmla   v18.4s, v12.4s, v4.s[2]     \\n\"\n                    \"ldr    d15, [%1], #8               \\n\"\n                    \"fmla   v19.4s, v12.4s, v4.s[3]     \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v20.4s, v12.4s, v5.s[0]     \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v21.4s, v12.4s, v5.s[1]     \\n\"\n                    \"ins    v13.d[1], x25               \\n\"\n                    \"fmla   v22.4s, v12.4s, v5.s[2]     \\n\"\n                    \"ldr    d8, [%1], #8                \\n\"\n                    \"fmla   v23.4s, v12.4s, v5.s[3]     \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v24.4s, v13.4s, v4.s[0]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v25.4s, v13.4s, v4.s[1]     \\n\"\n                    \"ins    v14.d[1], x26               \\n\"\n                    \"fmla   v26.4s, v13.4s, v4.s[2]     \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v27.4s, v13.4s, v4.s[3]     \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v28.4s, v13.4s, v5.s[0]     \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v29.4s, v13.4s, v5.s[1]     \\n\"\n                    \"ins    v6.d[1], x22                \\n\"\n                    \"fmla   v30.4s, v13.4s, v5.s[2]     \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                    \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                    \"fmla   v17.4s, v14.4s, v6.s[1]     \\n\"\n                    \"ins    v7.d[1], x23                \\n\"\n                    \"fmla   v18.4s, v14.4s, v6.s[2]     \\n\"\n                    \"fmla   v19.4s, v14.4s, v6.s[3]     \\n\"\n                    \"fmla   v20.4s, v14.4s, v7.s[0]     \\n\"\n                    \"fmla   v21.4s, v14.4s, v7.s[1]     \\n\"\n                    \"ins    v15.d[1], x27               \\n\"\n                    \"fmla   v22.4s, v14.4s, v7.s[2]     \\n\"\n                    \"fmla   v23.4s, v14.4s, v7.s[3]     \\n\"\n                    \"fmla   v24.4s, v15.4s, v6.s[0]     \\n\"\n                    \"fmla   v25.4s, v15.4s, v6.s[1]     \\n\"\n                    \"ins    v8.d[1], x24                \\n\"\n                    \"fmla   v26.4s, v15.4s, v6.s[2]     \\n\"\n                    \"fmla   v27.4s, v15.4s, v6.s[3]     \\n\"\n                    \"fmla   v28.4s, v15.4s, v7.s[0]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v29.4s, v15.4s, v7.s[1]     \\n\"\n                    \"fmla   v30.4s, v15.4s, v7.s[2]     \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v4.4s, v1.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v24.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x8\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip1   v4.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v6.2d, v15.2d, v17.2d       \\n\"\n                    \"zip1   v1.2d, v18.2d, v20.2d       \\n\"\n                    \"zip2   v3.2d, v18.2d, v20.2d       \\n\"\n                    \"zip1   v5.2d, v19.2d, v21.2d       \\n\"\n                    \"zip2   v7.2d, v19.2d, v21.2d       \\n\"\n\n                    \"zip1   v8.2d, v22.2d, v24.2d       \\n\"\n                    \"zip2   v10.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v12.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v14.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v9.2d, v26.2d, v28.2d       \\n\"\n                    \"zip2   v11.2d, v26.2d, v28.2d      \\n\"\n                    \"zip1   v13.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v15.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s}, [%3], #32   \\n\"\n                    \"st1    {v2.4s, v3.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v4.4s, v5.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v8.4s, v9.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v10.4s, v11.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v14.4s, v15.4s}, [x4]      \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #256                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else if (use_a53_a55_optimized_kernel && !cpu_support_arm_asimdhp())\n            {\n                // a53\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #192                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v16.4s}, [%8]              \\n\"\n                    \"ld1    {v24.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v17.16b, v16.16b            \\n\"\n                    \"mov    v18.16b, v16.16b            \\n\"\n                    \"mov    v19.16b, v16.16b            \\n\"\n                    \"mov    v20.16b, v16.16b            \\n\"\n                    \"mov    v21.16b, v16.16b            \\n\"\n                    \"mov    v22.16b, v16.16b            \\n\"\n                    \"mov    v23.16b, v16.16b            \\n\"\n\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ldr    d0, [%2]                    \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n\n                    \"ldr    d8, [%1]                    \\n\"\n                    \"ldr    x24, [%1, #8]               \\n\"\n                    \"ins    v8.d[1], x24                \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n\n                    \"ldr    d9, [%1]                    \\n\"\n                    \"ldr    x25, [%1, #8]               \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n\n                    \"ldr    d2, [%2]                    \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                    \"ldr    x22, [%2, #8]               \\n\"\n                    \"fmla   v17.4s, v8.4s, v0.s[1]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v18.4s, v8.4s, v0.s[2]      \\n\"\n\n                    \"ldr    d10, [%1]                   \\n\"\n                    \"ins    v9.d[1], x25                \\n\"\n                    \"fmla   v19.4s, v8.4s, v0.s[3]      \\n\"\n                    \"ldr    x26, [%1, #8]               \\n\"\n                    \"fmla   v20.4s, v8.4s, v1.s[0]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v21.4s, v8.4s, v1.s[1]      \\n\"\n\n                    \"ldr    d3, [%2]                    \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v22.4s, v8.4s, v1.s[2]      \\n\"\n                    \"ldr    x23, [%2, #8]               \\n\"\n                    \"fmla   v23.4s, v8.4s, v1.s[3]      \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v24.4s, v9.4s, v0.s[0]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v25.4s, v9.4s, v0.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v26.4s, v9.4s, v0.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v27.4s, v9.4s, v0.s[3]      \\n\"\n\n                    \"ldr    d11, [%1]                   \\n\"\n                    \"ins    v10.d[1], x26               \\n\"\n                    \"fmla   v28.4s, v9.4s, v1.s[0]      \\n\"\n                    \"ldr    x27, [%1, #8]               \\n\"\n                    \"fmla   v29.4s, v9.4s, v1.s[1]      \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v30.4s, v9.4s, v1.s[2]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v17.4s, v10.4s, v2.s[1]     \\n\"\n\n                    \"ldr    d4, [%2]                    \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v18.4s, v10.4s, v2.s[2]     \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"fmla   v19.4s, v10.4s, v2.s[3]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v20.4s, v10.4s, v3.s[0]     \\n\"\n\n                    \"ldr    d12, [%1]                   \\n\"\n                    \"ins    v11.d[1], x27               \\n\"\n                    \"fmla   v21.4s, v10.4s, v3.s[1]     \\n\"\n                    \"ldr    x24, [%1, #8]               \\n\"\n                    \"fmla   v22.4s, v10.4s, v3.s[2]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v23.4s, v10.4s, v3.s[3]     \\n\"\n\n                    \"ldr    d5, [%2]                    \\n\"\n                    \"ins    v4.d[1], x20                \\n\"\n                    \"fmla   v24.4s, v11.4s, v2.s[0]     \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"fmla   v25.4s, v11.4s, v2.s[1]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v26.4s, v11.4s, v2.s[2]     \\n\"\n\n                    \"ldr    d13, [%1]                   \\n\"\n                    \"ins    v12.d[1], x24               \\n\"\n                    \"fmla   v27.4s, v11.4s, v2.s[3]     \\n\"\n                    \"ldr    x25, [%1, #8]               \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n\n                    \"ldr    d6, [%2]                    \\n\"\n                    \"ins    v5.d[1], x21                \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"ldr    x22, [%2, #8]               \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n\n                    \"ldr    d14, [%1]                   \\n\"\n                    \"ins    v13.d[1], x25               \\n\"\n                    \"fmla   v17.4s, v12.4s, v4.s[1]     \\n\"\n                    \"ldr    x26, [%1, #8]               \\n\"\n                    \"fmla   v18.4s, v12.4s, v4.s[2]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v19.4s, v12.4s, v4.s[3]     \\n\"\n\n                    \"ldr    d7, [%2]                    \\n\"\n                    \"ins    v6.d[1], x22                \\n\"\n                    \"fmla   v20.4s, v12.4s, v5.s[0]     \\n\"\n                    \"ldr    x23, [%2, #8]               \\n\"\n                    \"fmla   v21.4s, v12.4s, v5.s[1]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v22.4s, v12.4s, v5.s[2]     \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v23.4s, v12.4s, v5.s[3]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v24.4s, v13.4s, v4.s[0]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v25.4s, v13.4s, v4.s[1]     \\n\"\n\n                    \"ldr    d15, [%1]                   \\n\"\n                    \"ins    v14.d[1], x26               \\n\"\n                    \"fmla   v26.4s, v13.4s, v4.s[2]     \\n\"\n                    \"ldr    x27, [%1, #8]               \\n\"\n                    \"fmla   v27.4s, v13.4s, v4.s[3]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v28.4s, v13.4s, v5.s[0]     \\n\"\n\n                    \"nop                                \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v29.4s, v13.4s, v5.s[1]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v30.4s, v13.4s, v5.s[2]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n\n                    \"ldr    d0, [%2]                    \\n\"\n                    \"ins    v7.d[1], x23                \\n\"\n                    \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                    \"ldr    x20, [%2, #8]               \\n\"\n                    \"fmla   v17.4s, v14.4s, v6.s[1]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v18.4s, v14.4s, v6.s[2]     \\n\"\n\n                    \"ldr    d8, [%1]                    \\n\"\n                    \"ins    v15.d[1], x27               \\n\"\n                    \"fmla   v19.4s, v14.4s, v6.s[3]     \\n\"\n                    \"ldr    x24, [%1, #8]               \\n\"\n                    \"fmla   v20.4s, v14.4s, v7.s[0]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v21.4s, v14.4s, v7.s[1]     \\n\"\n\n                    \"ldr    d1, [%2]                    \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v22.4s, v14.4s, v7.s[2]     \\n\"\n                    \"ldr    x21, [%2, #8]               \\n\"\n                    \"fmla   v23.4s, v14.4s, v7.s[3]     \\n\"\n                    \"add    %2, %2, #16                 \\n\"\n                    \"fmla   v24.4s, v15.4s, v6.s[0]     \\n\"\n\n                    \"ldr    d9, [%1]                    \\n\"\n                    \"ins    v8.d[1], x24                \\n\"\n                    \"fmla   v25.4s, v15.4s, v6.s[1]     \\n\"\n                    \"ldr    x25, [%1, #8]               \\n\"\n                    \"fmla   v26.4s, v15.4s, v6.s[2]     \\n\"\n                    \"add    %1, %1, #16                 \\n\"\n                    \"fmla   v27.4s, v15.4s, v6.s[3]     \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v28.4s, v15.4s, v7.s[0]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v29.4s, v15.4s, v7.s[1]     \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v30.4s, v15.4s, v7.s[2]     \\n\"\n\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n                    \"nop                                \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v4.4s, v1.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v24.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x8\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip1   v4.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v6.2d, v15.2d, v17.2d       \\n\"\n                    \"zip1   v1.2d, v18.2d, v20.2d       \\n\"\n                    \"zip2   v3.2d, v18.2d, v20.2d       \\n\"\n                    \"zip1   v5.2d, v19.2d, v21.2d       \\n\"\n                    \"zip2   v7.2d, v19.2d, v21.2d       \\n\"\n\n                    \"zip1   v8.2d, v22.2d, v24.2d       \\n\"\n                    \"zip2   v10.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v12.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v14.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v9.2d, v26.2d, v28.2d       \\n\"\n                    \"zip2   v11.2d, v26.2d, v28.2d      \\n\"\n                    \"zip1   v13.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v15.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s}, [%3], #32   \\n\"\n                    \"st1    {v2.4s, v3.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v4.4s, v5.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v8.4s, v9.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v10.4s, v11.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v14.4s, v15.4s}, [x4]      \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #256                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #192                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v16.4s}, [%8]              \\n\"\n                    \"ld1    {v24.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v17.16b, v16.16b            \\n\"\n                    \"mov    v18.16b, v16.16b            \\n\"\n                    \"mov    v19.16b, v16.16b            \\n\"\n                    \"mov    v20.16b, v16.16b            \\n\"\n                    \"mov    v21.16b, v16.16b            \\n\"\n                    \"mov    v22.16b, v16.16b            \\n\"\n                    \"mov    v23.16b, v16.16b            \\n\"\n\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                    \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v8.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v8.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v8.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v8.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v8.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v8.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v8.4s, v1.s[3]      \\n\"\n                    \"fmla   v24.4s, v9.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v9.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v9.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v9.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v9.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v9.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                    \"fmla   v17.4s, v10.4s, v2.s[1]     \\n\"\n                    \"fmla   v18.4s, v10.4s, v2.s[2]     \\n\"\n                    \"fmla   v19.4s, v10.4s, v2.s[3]     \\n\"\n                    \"fmla   v20.4s, v10.4s, v3.s[0]     \\n\"\n                    \"fmla   v21.4s, v10.4s, v3.s[1]     \\n\"\n                    \"fmla   v22.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v23.4s, v10.4s, v3.s[3]     \\n\"\n                    \"fmla   v24.4s, v11.4s, v2.s[0]     \\n\"\n                    \"fmla   v25.4s, v11.4s, v2.s[1]     \\n\"\n                    \"fmla   v26.4s, v11.4s, v2.s[2]     \\n\"\n                    \"fmla   v27.4s, v11.4s, v2.s[3]     \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%1], #64 \\n\"\n                    \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                    \"fmla   v17.4s, v12.4s, v4.s[1]     \\n\"\n                    \"fmla   v18.4s, v12.4s, v4.s[2]     \\n\"\n                    \"fmla   v19.4s, v12.4s, v4.s[3]     \\n\"\n                    \"fmla   v20.4s, v12.4s, v5.s[0]     \\n\"\n                    \"fmla   v21.4s, v12.4s, v5.s[1]     \\n\"\n                    \"fmla   v22.4s, v12.4s, v5.s[2]     \\n\"\n                    \"fmla   v23.4s, v12.4s, v5.s[3]     \\n\"\n                    \"fmla   v24.4s, v13.4s, v4.s[0]     \\n\"\n                    \"fmla   v25.4s, v13.4s, v4.s[1]     \\n\"\n                    \"fmla   v26.4s, v13.4s, v4.s[2]     \\n\"\n                    \"fmla   v27.4s, v13.4s, v4.s[3]     \\n\"\n                    \"fmla   v28.4s, v13.4s, v5.s[0]     \\n\"\n                    \"fmla   v29.4s, v13.4s, v5.s[1]     \\n\"\n                    \"fmla   v30.4s, v13.4s, v5.s[2]     \\n\"\n                    \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                    \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                    \"fmla   v17.4s, v14.4s, v6.s[1]     \\n\"\n                    \"fmla   v18.4s, v14.4s, v6.s[2]     \\n\"\n                    \"fmla   v19.4s, v14.4s, v6.s[3]     \\n\"\n                    \"fmla   v20.4s, v14.4s, v7.s[0]     \\n\"\n                    \"fmla   v21.4s, v14.4s, v7.s[1]     \\n\"\n                    \"fmla   v22.4s, v14.4s, v7.s[2]     \\n\"\n                    \"fmla   v23.4s, v14.4s, v7.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v15.4s, v6.s[0]     \\n\"\n                    \"fmla   v25.4s, v15.4s, v6.s[1]     \\n\"\n                    \"fmla   v26.4s, v15.4s, v6.s[2]     \\n\"\n                    \"fmla   v27.4s, v15.4s, v6.s[3]     \\n\"\n                    \"fmla   v28.4s, v15.4s, v7.s[0]     \\n\"\n                    \"fmla   v29.4s, v15.4s, v7.s[1]     \\n\"\n                    \"fmla   v30.4s, v15.4s, v7.s[2]     \\n\"\n                    \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v4.4s, v1.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v24.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 2          \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x8\n                    \"zip1   v14.4s, v16.4s, v17.4s      \\n\"\n                    \"zip2   v15.4s, v16.4s, v17.4s      \\n\"\n                    \"zip1   v16.4s, v18.4s, v19.4s      \\n\"\n                    \"zip2   v17.4s, v18.4s, v19.4s      \\n\"\n                    \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                    \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                    \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                    \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n\n                    \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                    \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                    \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                    \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                    \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                    \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                    \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                    \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                    \"zip1   v0.2d, v14.2d, v16.2d       \\n\"\n                    \"zip2   v2.2d, v14.2d, v16.2d       \\n\"\n                    \"zip1   v4.2d, v15.2d, v17.2d       \\n\"\n                    \"zip2   v6.2d, v15.2d, v17.2d       \\n\"\n                    \"zip1   v1.2d, v18.2d, v20.2d       \\n\"\n                    \"zip2   v3.2d, v18.2d, v20.2d       \\n\"\n                    \"zip1   v5.2d, v19.2d, v21.2d       \\n\"\n                    \"zip2   v7.2d, v19.2d, v21.2d       \\n\"\n\n                    \"zip1   v8.2d, v22.2d, v24.2d       \\n\"\n                    \"zip2   v10.2d, v22.2d, v24.2d      \\n\"\n                    \"zip1   v12.2d, v23.2d, v25.2d      \\n\"\n                    \"zip2   v14.2d, v23.2d, v25.2d      \\n\"\n                    \"zip1   v9.2d, v26.2d, v28.2d       \\n\"\n                    \"zip2   v11.2d, v26.2d, v28.2d      \\n\"\n                    \"zip1   v13.2d, v27.2d, v29.2d      \\n\"\n                    \"zip2   v15.2d, v27.2d, v29.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 2        \\n\"\n                    \"st1    {v0.4s, v1.4s}, [%3], #32   \\n\"\n                    \"st1    {v2.4s, v3.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v4.4s, v5.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v6.4s, v7.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v8.4s, v9.4s}, [x4]        \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v10.4s, v11.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v12.4s, v13.4s}, [x4]      \\n\"\n                    \"add    x4, x4, %w13, sxtw 2        \\n\"\n                    \"st1    {v14.4s, v15.4s}, [x4]      \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #256                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                    _sum40 = _sum00;\n                    _sum41 = _sum01;\n                    _sum50 = _sum00;\n                    _sum51 = _sum01;\n                    _sum60 = _sum00;\n                    _sum61 = _sum01;\n                    _sum70 = _sum00;\n                    _sum71 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n                    vst1q_f32(outptr0 + 4 * 4, _sum40);\n                    vst1q_f32(outptr0 + 4 * 5, _sum50);\n                    vst1q_f32(outptr0 + 4 * 6, _sum60);\n                    vst1q_f32(outptr0 + 4 * 7, _sum70);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 5, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 6, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 7, _sum71);\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum20);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum40);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum50);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 4, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum60);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 4, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 7, _sum70);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 4, _sum71);\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n            }\n\n            outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"subs   %0, %0, #64                 \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"add    x4, %8, #16                 \\n\"\n                \"ld1    {v24.4s}, [%8]              \\n\"\n                \"ld1    {v28.4s}, [x4]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v25.16b, v24.16b            \\n\"\n                \"mov    v26.16b, v24.16b            \\n\"\n                \"mov    v27.16b, v24.16b            \\n\"\n\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v30.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v28.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                \"fmla   v25.4s, v6.4s, v1.s[1]      \\n\"\n                \"fmla   v26.4s, v6.4s, v1.s[2]      \\n\"\n                \"fmla   v27.4s, v6.4s, v1.s[3]      \\n\"\n                \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                \"fmla   v24.4s, v8.4s, v2.s[0]      \\n\"\n                \"fmla   v25.4s, v8.4s, v2.s[1]      \\n\"\n                \"fmla   v26.4s, v8.4s, v2.s[2]      \\n\"\n                \"fmla   v27.4s, v8.4s, v2.s[3]      \\n\"\n                \"fmla   v28.4s, v9.4s, v2.s[0]      \\n\"\n                \"fmla   v29.4s, v9.4s, v2.s[1]      \\n\"\n                \"fmla   v30.4s, v9.4s, v2.s[2]      \\n\"\n                \"fmla   v31.4s, v9.4s, v2.s[3]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                \"fmla   v25.4s, v10.4s, v3.s[1]     \\n\"\n                \"fmla   v26.4s, v10.4s, v3.s[2]     \\n\"\n                \"fmla   v27.4s, v10.4s, v3.s[3]     \\n\"\n                \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4s}, [%2], #16          \\n\"\n                \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 2          \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%3], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose8x4\n                \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                \"zip1   v0.2d, v22.2d, v24.2d       \\n\"\n                \"zip2   v1.2d, v22.2d, v24.2d       \\n\"\n                \"zip1   v2.2d, v23.2d, v25.2d       \\n\"\n                \"zip2   v3.2d, v23.2d, v25.2d       \\n\"\n                \"zip1   v4.2d, v26.2d, v28.2d       \\n\"\n                \"zip2   v5.2d, v26.2d, v28.2d       \\n\"\n                \"zip1   v6.2d, v27.2d, v29.2d       \\n\"\n                \"zip2   v7.2d, v27.2d, v29.2d       \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v0.4s}, [%3], #16          \\n\"\n                \"st1    {v1.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v2.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v3.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v4.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v5.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v6.4s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v7.4s}, [x4]               \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #128                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + out_hstep * 1, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum10);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum20);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 7, _sum31);\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"add    x4, %8, #16                 \\n\"\n                \"ld1    {v28.4s}, [%8]              \\n\"\n                \"ld1    {v30.4s}, [x4]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v30.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #256]       \\n\"\n                \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v4.4s, v0.s[1]      \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                \"fmla   v28.4s, v6.4s, v0.s[2]      \\n\"\n                \"fmla   v29.4s, v6.4s, v0.s[3]      \\n\"\n                \"fmla   v30.4s, v7.4s, v0.s[2]      \\n\"\n                \"fmla   v31.4s, v7.4s, v0.s[3]      \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v8.4s, v1.s[0]      \\n\"\n                \"fmla   v29.4s, v8.4s, v1.s[1]      \\n\"\n                \"fmla   v30.4s, v9.4s, v1.s[0]      \\n\"\n                \"fmla   v31.4s, v9.4s, v1.s[1]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v10.4s, v1.s[2]     \\n\"\n                \"fmla   v29.4s, v10.4s, v1.s[3]     \\n\"\n                \"fmla   v30.4s, v11.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v11.4s, v1.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.2s}, [%2], #8           \\n\"\n                \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v4.4s, v0.s[1]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 2          \\n\"\n                \"st1    {v28.4s, v29.4s}, [%3], #32 \\n\"\n                \"st1    {v30.4s, v31.4s}, [x4]      \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose8x2\n                \"zip1   v0.4s, v28.4s, v29.4s       \\n\"\n                \"zip2   v2.4s, v28.4s, v29.4s       \\n\"\n                \"zip1   v4.4s, v30.4s, v31.4s       \\n\"\n                \"zip2   v6.4s, v30.4s, v31.4s       \\n\"\n\n                \"mov    v1.d[0], v0.d[1]            \\n\"\n                \"mov    v3.d[0], v2.d[1]            \\n\"\n                \"mov    v5.d[0], v4.d[1]            \\n\"\n                \"mov    v7.d[0], v6.d[1]            \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v0.2s}, [%3], #8           \\n\"\n                \"st1    {v1.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v2.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v3.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v4.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v5.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v6.2s}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v7.2s}, [x4]               \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x2_t _pB0 = vld1_f32(pB);\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pA1, _pB0, 1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[8];\n                    float sum1[8];\n                    vst1q_f32(sum0, _sum00);\n                    vst1q_f32(sum0 + 4, _sum01);\n                    vst1q_f32(sum1, _sum10);\n                    vst1q_f32(sum1 + 4, _sum11);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      2f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%8]      \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"3:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #128]       \\n\"\n                \"ld1    {v0.4s}, [%2], #16          \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v30.4s, v6.4s, v0.s[1]      \\n\"\n                \"fmla   v31.4s, v7.4s, v0.s[1]      \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v8.4s, v0.s[2]      \\n\"\n                \"fmla   v29.4s, v9.4s, v0.s[2]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v10.4s, v0.s[3]     \\n\"\n                \"fmla   v31.4s, v11.4s, v0.s[3]     \\n\"\n                \"bne    3b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                \"4:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    6f                          \\n\"\n\n                \"5:                                 \\n\"\n                \"ld1r   {v0.4s}, [%2], #4           \\n\"\n                \"ld1    {v4.4s, v5.4s}, [%1], #32   \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v4.4s, v0.4s        \\n\"\n                \"fmla   v31.4s, v5.4s, v0.4s        \\n\"\n                \"bne    5b                          \\n\"\n\n                \"6:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    9f                          \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 2          \\n\"\n                \"st1    {v30.4s}, [%3], #16         \\n\"\n                \"st1    {v31.4s}, [x4]              \\n\"\n                \"b      8f                          \\n\"\n\n                // if out_elempack == 1\n                \"7:                                 \\n\"\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v30.s}[0], [%3], #4        \\n\"\n                \"st1    {v30.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v30.s}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v30.s}[3], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[0], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[3], [x4]            \\n\"\n\n                \"8:                                 \\n\"\n                \"add    %0, %0, #32                 \\n\"\n                \"b      10f                         \\n\"\n\n                \"9:                                 \\n\"\n                \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                \"10:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB = vld1q_dup_f32(pB);\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[8];\n                    vst1q_f32(sum0, _sum00);\n                    vst1q_f32(sum0 + 4, _sum01);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n            }\n\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n\n        pAT += max_kk * 8;\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"subs   %0, %0, #128                \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v20.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v21.16b, v20.16b            \\n\"\n                \"mov    v22.16b, v20.16b            \\n\"\n                \"mov    v23.16b, v20.16b            \\n\"\n                \"mov    v24.16b, v20.16b            \\n\"\n                \"mov    v25.16b, v20.16b            \\n\"\n                \"mov    v26.16b, v20.16b            \\n\"\n                \"mov    v27.16b, v20.16b            \\n\"\n                \"mov    v28.16b, v20.16b            \\n\"\n                \"mov    v29.16b, v20.16b            \\n\"\n                \"mov    v30.16b, v20.16b            \\n\"\n                \"mov    v31.16b, v20.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                \"fmla   v20.4s, v17.4s, v3.s[0]     \\n\"\n                \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                \"fmla   v22.4s, v17.4s, v3.s[2]     \\n\"\n                \"fmla   v23.4s, v17.4s, v3.s[3]     \\n\"\n                \"fmla   v24.4s, v17.4s, v4.s[0]     \\n\"\n                \"fmla   v25.4s, v17.4s, v4.s[1]     \\n\"\n                \"fmla   v26.4s, v17.4s, v4.s[2]     \\n\"\n                \"fmla   v27.4s, v17.4s, v4.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v5.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v5.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v5.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v5.s[3]     \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                \"fmla   v20.4s, v18.4s, v6.s[0]     \\n\"\n                \"fmla   v21.4s, v18.4s, v6.s[1]     \\n\"\n                \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                \"fmla   v23.4s, v18.4s, v6.s[3]     \\n\"\n                \"fmla   v24.4s, v18.4s, v7.s[0]     \\n\"\n                \"fmla   v25.4s, v18.4s, v7.s[1]     \\n\"\n                \"fmla   v26.4s, v18.4s, v7.s[2]     \\n\"\n                \"fmla   v27.4s, v18.4s, v7.s[3]     \\n\"\n                \"fmla   v28.4s, v18.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v20.4s, v19.4s, v1.s[0]     \\n\"\n                \"fmla   v21.4s, v19.4s, v1.s[1]     \\n\"\n                \"fmla   v22.4s, v19.4s, v1.s[2]     \\n\"\n                \"fmla   v23.4s, v19.4s, v1.s[3]     \\n\"\n                \"fmla   v24.4s, v19.4s, v2.s[0]     \\n\"\n                \"fmla   v25.4s, v19.4s, v2.s[1]     \\n\"\n                \"fmla   v26.4s, v19.4s, v2.s[2]     \\n\"\n                \"fmla   v27.4s, v19.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s}, [%2], #48 \\n\"\n                \"ld1    {v16.4s}, [%1], #16         \\n\"\n                \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%3], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x12\n                \"zip1   v18.4s, v20.4s, v21.4s      \\n\"\n                \"zip2   v19.4s, v20.4s, v21.4s      \\n\"\n                \"zip1   v20.4s, v22.4s, v23.4s      \\n\"\n                \"zip2   v21.4s, v22.4s, v23.4s      \\n\"\n                \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                \"zip1   v12.2d, v18.2d, v20.2d      \\n\"\n                \"zip2   v15.2d, v18.2d, v20.2d      \\n\"\n                \"zip1   v13.2d, v22.2d, v24.2d      \\n\"\n                \"zip2   v16.2d, v22.2d, v24.2d      \\n\"\n                \"zip1   v14.2d, v26.2d, v28.2d      \\n\"\n                \"zip2   v17.2d, v26.2d, v28.2d      \\n\"\n\n                \"zip1   v18.2d, v19.2d, v21.2d      \\n\"\n                \"zip2   v21.2d, v19.2d, v21.2d      \\n\"\n                \"zip1   v19.2d, v23.2d, v25.2d      \\n\"\n                \"zip2   v22.2d, v23.2d, v25.2d      \\n\"\n                \"zip1   v20.2d, v27.2d, v29.2d      \\n\"\n                \"zip2   v23.2d, v27.2d, v29.2d      \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v12.4s, v13.4s, v14.4s}, [%3], #48 \\n\"\n                \"st1    {v15.4s, v16.4s, v17.4s}, [x4] \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v18.4s, v19.4s, v20.4s}, [x4] \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v21.4s, v22.4s, v23.4s}, [x4] \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #192                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n            float32x4_t _sum8;\n            float32x4_t _sum9;\n            float32x4_t _suma;\n            float32x4_t _sumb;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                    _sum8 = _sum0;\n                    _sum9 = _sum0;\n                    _suma = _sum0;\n                    _sumb = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                    _sum8 = vdupq_n_f32(0.f);\n                    _sum9 = vdupq_n_f32(0.f);\n                    _suma = vdupq_n_f32(0.f);\n                    _sumb = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n                _sum8 = vld1q_f32(outptr + 4 * 8);\n                _sum9 = vld1q_f32(outptr + 4 * 9);\n                _suma = vld1q_f32(outptr + 4 * 10);\n                _sumb = vld1q_f32(outptr + 4 * 11);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    vst1q_f32(outptr0 + 4 * 4, _sum4);\n                    vst1q_f32(outptr0 + 4 * 5, _sum5);\n                    vst1q_f32(outptr0 + 4 * 6, _sum6);\n                    vst1q_f32(outptr0 + 4 * 7, _sum7);\n                    vst1q_f32(outptr0 + 4 * 8, _sum8);\n                    vst1q_f32(outptr0 + 4 * 9, _sum9);\n                    vst1q_f32(outptr0 + 4 * 10, _suma);\n                    vst1q_f32(outptr0 + 4 * 11, _sumb);\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 8, _sum2);\n                    vst1q_f32(outptr0 + out_hstep, _sum3);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum4);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum5);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum6);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum7);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 8, _sum8);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum9);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _suma);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 8, _sumb);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"subs   %0, %0, #64                 \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v24.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v25.16b, v24.16b            \\n\"\n                \"mov    v26.16b, v24.16b            \\n\"\n                \"mov    v27.16b, v24.16b            \\n\"\n                \"mov    v28.16b, v24.16b            \\n\"\n                \"mov    v29.16b, v24.16b            \\n\"\n                \"mov    v30.16b, v24.16b            \\n\"\n                \"mov    v31.16b, v24.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                \"fmla   v24.4s, v17.4s, v2.s[0]     \\n\"\n                \"fmla   v25.4s, v17.4s, v2.s[1]     \\n\"\n                \"fmla   v26.4s, v17.4s, v2.s[2]     \\n\"\n                \"fmla   v27.4s, v17.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v3.s[3]     \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n                \"fmla   v24.4s, v18.4s, v4.s[0]     \\n\"\n                \"fmla   v25.4s, v18.4s, v4.s[1]     \\n\"\n                \"fmla   v26.4s, v18.4s, v4.s[2]     \\n\"\n                \"fmla   v27.4s, v18.4s, v4.s[3]     \\n\"\n                \"fmla   v28.4s, v18.4s, v5.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v5.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v5.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v5.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v24.4s, v19.4s, v6.s[0]     \\n\"\n                \"fmla   v25.4s, v19.4s, v6.s[1]     \\n\"\n                \"fmla   v26.4s, v19.4s, v6.s[2]     \\n\"\n                \"fmla   v27.4s, v19.4s, v6.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v7.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v7.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v7.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v7.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                \"ld1    {v16.4s}, [%1], #16         \\n\"\n                \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%3], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x8\n                \"zip1   v22.4s, v24.4s, v25.4s      \\n\"\n                \"zip2   v23.4s, v24.4s, v25.4s      \\n\"\n                \"zip1   v24.4s, v26.4s, v27.4s      \\n\"\n                \"zip2   v25.4s, v26.4s, v27.4s      \\n\"\n                \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                \"zip1   v12.2d, v22.2d, v24.2d      \\n\"\n                \"zip2   v14.2d, v22.2d, v24.2d      \\n\"\n                \"zip1   v13.2d, v26.2d, v28.2d      \\n\"\n                \"zip2   v15.2d, v26.2d, v28.2d      \\n\"\n\n                \"zip1   v16.2d, v23.2d, v25.2d      \\n\"\n                \"zip2   v18.2d, v23.2d, v25.2d      \\n\"\n                \"zip1   v17.2d, v27.2d, v29.2d      \\n\"\n                \"zip2   v19.2d, v27.2d, v29.2d      \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v12.4s, v13.4s}, [%3], #32 \\n\"\n                \"st1    {v14.4s, v15.4s}, [x4]      \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v16.4s, v17.4s}, [x4]      \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v18.4s, v19.4s}, [x4]      \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #128                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0!, {d16-d23}      \\n\"\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"sub        %0, %0, #64         \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d16-d17}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q8, q8              \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q9, q8              \\n\"\n                \"vmov       q10, q8             \\n\"\n                \"vmov       q11, q8             \\n\"\n                \"vmov       q12, q8             \\n\"\n                \"vmov       q13, q8             \\n\"\n                \"vmov       q14, q8             \\n\"\n                \"vmov       q15, q8             \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"4:                             \\n\"\n                \"pld        [%2, #512]          \\n\"\n                \"vldm       %2!, {d0-d7}        \\n\"\n                \"pld        [%1, #512]          \\n\"\n                \"vldm       %1!, {d8-d15}       \\n\"\n                \"vmla.f32   q8, q4, d0[0]       \\n\"\n                \"vmla.f32   q9, q4, d0[1]       \\n\"\n                \"vmla.f32   q10, q4, d1[0]      \\n\"\n                \"vmla.f32   q11, q4, d1[1]      \\n\"\n                \"vmla.f32   q12, q4, d2[0]      \\n\"\n                \"vmla.f32   q13, q4, d2[1]      \\n\"\n                \"vmla.f32   q14, q4, d3[0]      \\n\"\n                \"vmla.f32   q15, q4, d3[1]      \\n\"\n                \"vmla.f32   q8, q5, d4[0]       \\n\"\n                \"vmla.f32   q9, q5, d4[1]       \\n\"\n                \"vmla.f32   q10, q5, d5[0]      \\n\"\n                \"vmla.f32   q11, q5, d5[1]      \\n\"\n                \"vmla.f32   q12, q5, d6[0]      \\n\"\n                \"vmla.f32   q13, q5, d6[1]      \\n\"\n                \"vmla.f32   q14, q5, d7[0]      \\n\"\n                \"vmla.f32   q15, q5, d7[1]      \\n\"\n                \"pld        [%2, #512]          \\n\"\n                \"vldm       %2!, {d0-d7}        \\n\"\n                \"vmla.f32   q8, q6, d0[0]       \\n\"\n                \"vmla.f32   q9, q6, d0[1]       \\n\"\n                \"vmla.f32   q10, q6, d1[0]      \\n\"\n                \"vmla.f32   q11, q6, d1[1]      \\n\"\n                \"vmla.f32   q12, q6, d2[0]      \\n\"\n                \"vmla.f32   q13, q6, d2[1]      \\n\"\n                \"vmla.f32   q14, q6, d3[0]      \\n\"\n                \"vmla.f32   q15, q6, d3[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q8, q7, d4[0]       \\n\"\n                \"vmla.f32   q9, q7, d4[1]       \\n\"\n                \"vmla.f32   q10, q7, d5[0]      \\n\"\n                \"vmla.f32   q11, q7, d5[1]      \\n\"\n                \"vmla.f32   q12, q7, d6[0]      \\n\"\n                \"vmla.f32   q13, q7, d6[1]      \\n\"\n                \"vmla.f32   q14, q7, d7[0]      \\n\"\n                \"vmla.f32   q15, q7, d7[1]      \\n\"\n                \"bne        4b                  \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.f32   {d0-d3}, [%2 :128]! \\n\"\n                \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                \"vmla.f32   q8, q4, d0[0]       \\n\"\n                \"vmla.f32   q9, q4, d0[1]       \\n\"\n                \"vmla.f32   q10, q4, d1[0]      \\n\"\n                \"vmla.f32   q11, q4, d1[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q4, d2[0]      \\n\"\n                \"vmla.f32   q13, q4, d2[1]      \\n\"\n                \"vmla.f32   q14, q4, d3[0]      \\n\"\n                \"vmla.f32   q15, q4, d3[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vstm       %3!, {d16-d23}      \\n\"\n                \"vstm       %3!, {d24-d31}      \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x8\n                \"vtrn.32    q8, q9              \\n\"\n                \"vtrn.32    q10, q11            \\n\"\n                \"vtrn.32    q12, q13            \\n\"\n                \"vtrn.32    q14, q15            \\n\"\n                \"vswp       d17, d20            \\n\"\n                \"vswp       d19, d22            \\n\"\n                \"vswp       d25, d28            \\n\"\n                \"vswp       d27, d30            \\n\"\n                \"vswp       q9, q12             \\n\"\n                \"vswp       q11, q14            \\n\"\n\n                \"add        r4, %3, %13, lsl #2 \\n\"\n                \"vst1.f32   {d16-d19}, [%3 :128]! \\n\"\n                \"vst1.f32   {d24-d27}, [r4 :128] \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d20-d23}, [r4 :128] \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d28-d31}, [r4 :128] \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #128        \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    vst1q_f32(outptr0 + 4 * 4, _sum4);\n                    vst1q_f32(outptr0 + 4 * 5, _sum5);\n                    vst1q_f32(outptr0 + 4 * 6, _sum6);\n                    vst1q_f32(outptr0 + 4 * 7, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + out_hstep, _sum2);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v28.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v30.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v28.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v1.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v18.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4s}, [%2], #16          \\n\"\n                \"ld1    {v16.4s}, [%1], #16         \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x4\n                \"zip1   v26.4s, v28.4s, v29.4s      \\n\"\n                \"zip2   v27.4s, v28.4s, v29.4s      \\n\"\n                \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                \"zip1   v12.2d, v26.2d, v28.2d      \\n\"\n                \"zip2   v13.2d, v26.2d, v28.2d      \\n\"\n                \"zip1   v14.2d, v27.2d, v29.2d      \\n\"\n                \"zip2   v15.2d, v27.2d, v29.2d      \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v12.4s}, [%3], #16         \\n\"\n                \"st1    {v13.4s}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v14.4s}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v15.4s}, [x4]              \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d24-d25}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q12, q12            \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q13, q12            \\n\"\n                \"vmov       q14, q12            \\n\"\n                \"vmov       q15, q12            \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"4:                             \\n\"\n                \"pld        [%2, #512]          \\n\"\n                \"vldm       %2!, {d0-d7}        \\n\"\n                \"pld        [%1, #512]          \\n\"\n                \"vldm       %1!, {d8-d15}       \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"vmla.f32   q14, q4, d1[0]      \\n\"\n                \"vmla.f32   q15, q4, d1[1]      \\n\"\n                \"vmla.f32   q12, q5, d2[0]      \\n\"\n                \"vmla.f32   q13, q5, d2[1]      \\n\"\n                \"vmla.f32   q14, q5, d3[0]      \\n\"\n                \"vmla.f32   q15, q5, d3[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q6, d4[0]      \\n\"\n                \"vmla.f32   q13, q6, d4[1]      \\n\"\n                \"vmla.f32   q14, q6, d5[0]      \\n\"\n                \"vmla.f32   q15, q6, d5[1]      \\n\"\n                \"vmla.f32   q12, q7, d6[0]      \\n\"\n                \"vmla.f32   q13, q7, d6[1]      \\n\"\n                \"vmla.f32   q14, q7, d7[0]      \\n\"\n                \"vmla.f32   q15, q7, d7[1]      \\n\"\n                \"bne        4b                  \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.f32   {d0-d1}, [%2 :128]! \\n\"\n                \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q14, q4, d1[0]      \\n\"\n                \"vmla.f32   q15, q4, d1[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vstm       %3!, {d24-d31}      \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x4\n                \"vtrn.32    q12, q13            \\n\"\n                \"vtrn.32    q14, q15            \\n\"\n                \"vswp       d25, d28            \\n\"\n                \"vswp       d27, d30            \\n\"\n\n                \"add        r4, %3, %13, lsl #2 \\n\"\n                \"vst1.f32   {d24-d25}, [%3 :128]! \\n\"\n                \"vst1.f32   {d26-d27}, [r4 :128] \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d28-d29}, [r4 :128] \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d30-d31}, [r4 :128] \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #64         \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB = vld1q_f32(pB);\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ps(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + out_hstep * 1, _sum1);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v30.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v31.16b, v30.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #256]       \\n\"\n                \"ld1    {v0.4s, v1.4s}, [%2], #32   \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v18.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v1.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.2s}, [%2], #8           \\n\"\n                \"ld1    {v16.4s}, [%1], #16         \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[1]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v30.4s, v31.4s}, [%3], #32 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x2\n                \"zip1   v28.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v29.4s, v30.4s, v31.4s      \\n\"\n\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v28.d}[0], [%3], #8        \\n\"\n                \"st1    {v28.d}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v29.d}[0], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v29.d}[1], [x4]            \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #32                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vld1.f32   {d28-d31}, [%0 :128] \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d28-d29}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q14, q14            \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q15, q14            \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"4:                             \\n\"\n                \"pld        [%2, #256]          \\n\"\n                \"vld1.f32   {d0-d3}, [%2 :128]! \\n\"\n                \"pld        [%1, #512]          \\n\"\n                \"vldm       %1!, {d8-d15}       \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"vmla.f32   q14, q5, d1[0]      \\n\"\n                \"vmla.f32   q15, q5, d1[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q6, d2[0]      \\n\"\n                \"vmla.f32   q13, q6, d2[1]      \\n\"\n                \"vmla.f32   q14, q7, d3[0]      \\n\"\n                \"vmla.f32   q15, q7, d3[1]      \\n\"\n                \"bne        4b                  \\n\"\n                \"vadd.f32   q14, q14, q12       \\n\"\n                \"vadd.f32   q15, q15, q13       \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.f32   {d0}, [%2 :64]!     \\n\"\n                \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q14, q4, d0[0]      \\n\"\n                \"vmla.f32   q15, q4, d0[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vst1.f32   {d28-d31}, [%3 :128]! \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x2\n                \"vtrn.32    q14, q15            \\n\"\n\n                \"add        r4, %3, %13, lsl #2 \\n\"\n                \"vst1.f32   {d28}, [%3 :64]!    \\n\"\n                \"vst1.f32   {d30}, [r4 :64]     \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d29}, [r4 :64]     \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d31}, [r4 :64]     \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #32         \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x2_t _pB = vld1_f32(pB);\n\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pA, _pB, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, _pB, 1);\n#endif\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[4];\n                    float sum1[4];\n                    vst1q_f32(sum0, _sum0);\n                    vst1q_f32(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const float* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v31.4s}, [%0]              \\n\"\n                \"b      2f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v31.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"3:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #128]       \\n\"\n                \"ld1    {v0.4s}, [%2], #16          \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v0.s[1]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v0.s[3]     \\n\"\n                \"bne    3b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v30.4s      \\n\"\n\n                \"4:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    6f                          \\n\"\n\n                \"5:                                 \\n\"\n                \"ld1r   {v0.4s}, [%2], #4           \\n\"\n                \"ld1    {v16.4s}, [%1], #16         \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v31.4s, v16.4s, v0.4s       \\n\"\n                \"bne    5b                          \\n\"\n\n                \"6:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    9f                          \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"st1    {v31.4s}, [%3], #16         \\n\"\n                \"b      8f                          \\n\"\n\n                // if out_elempack == 1\n                \"7:                                 \\n\"\n                \"add    x4, %3, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[0], [%3], #4        \\n\"\n                \"st1    {v31.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 2        \\n\"\n                \"st1    {v31.s}[3], [x4]            \\n\"\n\n                \"8:                                 \\n\"\n                \"add    %0, %0, #16                 \\n\"\n                \"b      10f                         \\n\"\n\n                \"9:                                 \\n\"\n                \"st1    {v31.4s}, [%0], #16         \\n\"\n\n                \"10:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vld1.f32   {d30-d31}, [%0 :128] \\n\"\n                \"b          2f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d30-d31}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q15, q15            \\n\"\n\n                \"2:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"veor       q14, q14            \\n\"\n                \"3:                             \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.f32   {d0-d1}, [%2 :64]!  \\n\"\n                \"pld        [%1, #512]          \\n\"\n                \"vldm       %1!, {d8-d15}       \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q5, d0[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q14, q6, d1[0]      \\n\"\n                \"vmla.f32   q15, q7, d1[1]      \\n\"\n                \"bne        3b                  \\n\"\n                \"vadd.f32   q14, q14, q12       \\n\"\n                \"vadd.f32   q15, q15, q13       \\n\"\n                \"vadd.f32   q15, q15, q14       \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        6f                  \\n\"\n\n                \"5:                             \\n\"\n                \"vld1.f32   {d0[0]}, [%2]!      \\n\"\n                \"vld1.f32   {d8-d9}, [%1 :128]! \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q15, q4, d0[0]      \\n\"\n                \"bne        5b                  \\n\"\n\n                \"6:                             \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        9f                  \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        7f                  \\n\"\n\n                \"vst1.f32   {d30-d31}, [%3 :128]! \\n\"\n                \"b          8f                  \\n\"\n\n                // if out_elempack == 1\n                \"7:                             \\n\"\n                \"add        r4, %3, %13, lsl #2 \\n\"\n                \"vst1.f32   {d30[0]}, [%3]!     \\n\"\n                \"vst1.f32   {d30[1]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d31[0]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #2 \\n\"\n                \"vst1.f32   {d31[1]}, [r4]      \\n\"\n\n                \"8:                             \\n\"\n                \"add        %0, %0, #16         \\n\"\n                \"b          10f                 \\n\"\n\n                \"9:                             \\n\"\n                \"vst1.f32   {d30-d31}, [%0 :128]! \\n\"\n\n                \"10:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB = vdupq_n_f32(pB[0]);\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB);\n#endif\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[4];\n                    vst1q_f32(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n            }\n\n            outptr += 4;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum02;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum12;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdupq_n_f32(pC[0]);\n                    _sum01 = vdupq_n_f32(pC[0]);\n                    _sum02 = vdupq_n_f32(pC[0]);\n                    _sum10 = vdupq_n_f32(pC[1]);\n                    _sum11 = vdupq_n_f32(pC[1]);\n                    _sum12 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum02 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum12 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                float32x2_t _pA = vld1_f32(pA);\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum02 = vfmaq_lane_f32(_sum02, _pB2, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n                _sum12 = vfmaq_lane_f32(_sum12, _pB2, _pA, 1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + 8, _sum02);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum12);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdupq_n_f32(pC[0]);\n                    _sum01 = vdupq_n_f32(pC[0]);\n                    _sum10 = vdupq_n_f32(pC[1]);\n                    _sum11 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n#else\n                _sum00 = vmlaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vmlaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vmlaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vmlaq_lane_f32(_sum11, _pB1, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = vld1q_f32(pB);\n\n                float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pB, _pA, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pB, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + out_hstep, _sum1);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum00;\n            float sum01;\n            float sum10;\n            float sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum00 = pC[0];\n                    sum01 = pC[1];\n                    sum10 = pC[0];\n                    sum11 = pC[1];\n                }\n                else\n                {\n                    sum00 = 0.f;\n                    sum01 = 0.f;\n                    sum10 = 0.f;\n                    sum11 = 0.f;\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum01 += pA[1] * pB[0];\n                sum10 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum00;\n                    outptr0[1] = sum10;\n                    outptr0[out_hstep] = sum01;\n                    outptr0[out_hstep + 1] = sum11;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[1];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[out_hstep] = sum1;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[0]);\n                    _sum2 = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n                _sum2 = vld1q_f32(outptr + 8);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 8, _sum2);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum = vld1q_f32(outptr);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = vld1q_f32(pB);\n                float32x4_t _pA = vdupq_n_f32(pA[0]);\n\n#if __aarch64__\n                _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[0];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[1] = sum1;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum = pC[0];\n                }\n                else\n                {\n                    sum = 0.f;\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void convolution_im2col_gemm_get_optimal_tile_mnk(int M, int N, int K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_fp32 = (int)(get_cpu_level2_cache_size() / sizeof(float));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // solve K\n    {\n        // try not to split K\n#if __aarch64__\n        int tile_size = (l2_cache_size_fp32 - 32) / 12;\n#elif __ARM_NEON\n        int tile_size = (l2_cache_size_fp32 - 16) / 8;\n#else\n        int tile_size = (l2_cache_size_fp32 - 2) / 3;\n#endif\n\n#if __aarch64__\n        TILE_K = std::max(8, tile_size / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __aarch64__\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 3) / 4 * 4);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n    }\n\n    // solve M\n    {\n#if __aarch64__\n        int nn_M = (M + 31) / 32;\n#elif __ARM_NEON\n        int nn_M = (M + 15) / 16;\n#else\n        int nn_M = (M + 7) / 8;\n#endif\n\n#if __aarch64__\n        TILE_M = std::max(8, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::max(4, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::max(2, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __aarch64__\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n\n        if (nT > 1)\n        {\n#if __aarch64__\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#elif __ARM_NEON\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 3) / 4 * 4);\n#else\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n        }\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_fp32 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_fp32 - TILE_M * TILE_K) / (TILE_M + TILE_K);\n        }\n\n#if __aarch64__\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_N = std::max(1, tile_size);\n#endif\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n\n#if __aarch64__\n        TILE_N = std::max(4, TILE_N);\n#elif __ARM_NEON\n        TILE_N = std::max(4, TILE_N);\n#else\n        TILE_N = std::max(1, TILE_N);\n#endif\n    }\n}\n\nstatic void convolution_im2col_input_tile_conv1x1s1d1(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = bottom_blob.elempack;\n\n    float* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0] \\n\"\n                    \"st1    {v0.4s}, [%1], #16          \\n\"\n                    \"st1    {v4.4s}, [%1], #16          \\n\"\n                    \"st1    {v8.4s}, [%1], #16          \\n\"\n                    \"sub    %0, %0, #128                \\n\"\n                    \"st1    {v1.4s}, [%1], #16          \\n\"\n                    \"st1    {v5.4s}, [%1], #16          \\n\"\n                    \"st1    {v9.4s}, [%1], #16          \\n\"\n                    \"st1    {v2.4s}, [%1], #16          \\n\"\n                    \"st1    {v6.4s}, [%1], #16          \\n\"\n                    \"st1    {v10.4s}, [%1], #16         \\n\"\n                    \"st1    {v3.4s}, [%1], #16          \\n\"\n                    \"st1    {v7.4s}, [%1], #16          \\n\"\n                    \"st1    {v11.4s}, [%1], #16         \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\");\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                float32x4x4_t _r2 = vld4q_f32(p0 + 32);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r2.val[0]);\n                vst1q_f32(pp + 4 * 3, _r0.val[1]);\n                vst1q_f32(pp + 4 * 4, _r1.val[1]);\n                vst1q_f32(pp + 4 * 5, _r2.val[1]);\n                vst1q_f32(pp + 4 * 6, _r0.val[2]);\n                vst1q_f32(pp + 4 * 7, _r1.val[2]);\n                vst1q_f32(pp + 4 * 8, _r2.val[2]);\n                vst1q_f32(pp + 4 * 9, _r0.val[3]);\n                vst1q_f32(pp + 4 * 10, _r1.val[3]);\n                vst1q_f32(pp + 4 * 11, _r2.val[3]);\n                pp += 48;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p0 + 4);\n                float32x4_t _r2 = vld1q_f32(p0 + 8);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                vst1q_f32(pp + 8, _r2);\n                pp += 12;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0] \\n\"\n                    \"st1    {v0.4s}, [%1], #16          \\n\"\n                    \"st1    {v4.4s}, [%1], #16          \\n\"\n                    \"st1    {v1.4s}, [%1], #16          \\n\"\n                    \"st1    {v5.4s}, [%1], #16          \\n\"\n                    \"sub    %0, %0, #64                 \\n\"\n                    \"st1    {v2.4s}, [%1], #16          \\n\"\n                    \"st1    {v6.4s}, [%1], #16          \\n\"\n                    \"st1    {v3.4s}, [%1], #16          \\n\"\n                    \"st1    {v7.4s}, [%1], #16          \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0!, {d0-d7}        \\n\"\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0, {d16-d23}       \\n\"\n                    \"vzip.32    q0, q1              \\n\"\n                    \"vzip.32    q2, q3              \\n\"\n                    \"vzip.32    q8, q9              \\n\"\n                    \"vzip.32    q10, q11            \\n\"\n                    \"vswp       d1, d4              \\n\"\n                    \"vswp       d3, d6              \\n\"\n                    \"vswp       d17, d20            \\n\"\n                    \"vswp       d19, d22            \\n\"\n                    \"vswp       q1, q8              \\n\"\n                    \"vswp       q3, q10             \\n\"\n                    \"sub        %0, %0, #64         \\n\"\n                    \"vstm       %1!, {d0-d7}        \\n\"\n                    \"vstm       %1!, {d16-d23}      \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0 = vld4q_f32(p0);\n                float32x4x4_t _r1 = vld4q_f32(p0 + 16);\n                vst1q_f32(pp, _r0.val[0]);\n                vst1q_f32(pp + 4, _r1.val[0]);\n                vst1q_f32(pp + 4 * 2, _r0.val[1]);\n                vst1q_f32(pp + 4 * 3, _r1.val[1]);\n                vst1q_f32(pp + 4 * 4, _r0.val[2]);\n                vst1q_f32(pp + 4 * 5, _r1.val[2]);\n                vst1q_f32(pp + 4 * 6, _r0.val[3]);\n                vst1q_f32(pp + 4 * 7, _r1.val[3]);\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p0 + 4);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                pp += 8;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x4\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                    \"st4    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]          \\n\"\n                    \"vldm       %0, {d0-d7}         \\n\"\n                    \"vtrn.32    q0, q1              \\n\"\n                    \"vtrn.32    q2, q3              \\n\"\n                    \"vswp       d1, d4              \\n\"\n                    \"vswp       d3, d6              \\n\"\n                    \"vstm       %1!, {d0-d7}        \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4x4_t _r0;\n                _r0.val[0] = vld1q_f32(p0);\n                _r0.val[1] = vld1q_f32(p0 + 4);\n                _r0.val[2] = vld1q_f32(p0 + 4 * 2);\n                _r0.val[3] = vld1q_f32(p0 + 4 * 3);\n                vst4q_f32(pp, _r0);\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k / elempack) + (j + jj) * elempack;\n\n            int kk = 0;\n            for (; kk < max_kk / elempack; kk++)\n            {\n                // transpose4x2\n                float32x4x2_t _r0;\n                _r0.val[0] = vld1q_f32(p0);\n                _r0.val[1] = vld1q_f32(p0 + 4);\n                vst2q_f32(pp, _r0);\n                pp += 8;\n                p0 += bottom_blob.cstep * elempack;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n#if __ARM_NEON\n                vst1_f32(pp, vld1_f32(p0));\n#else\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n#endif // __ARM_NEON\n                pp += 2;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_input_tile(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h)\n{\n    if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        convolution_im2col_input_tile_conv1x1s1d1(bottom_blob, B, j, max_jj, k, max_kk);\n        return;\n    }\n\n    const int w = bottom_blob.w;\n    // const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    // j max_jj     outw*outh    split w and h\n\n    // k max_kk     pa*maxk*(inch/pa)    split inch\n\n    // k/max_kk shall be multiple of maxk\n\n    const int maxk = kernel_w * kernel_h;\n\n    float* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dy4 = (j + jj + 4) / outw;\n        int dy5 = (j + jj + 5) / outw;\n        int dy6 = (j + jj + 6) / outw;\n        int dy7 = (j + jj + 7) / outw;\n        int dy8 = (j + jj + 8) / outw;\n        int dy9 = (j + jj + 9) / outw;\n        int dya = (j + jj + 10) / outw;\n        int dyb = (j + jj + 11) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n        int dx4 = (j + jj + 4) % outw;\n        int dx5 = (j + jj + 5) % outw;\n        int dx6 = (j + jj + 6) % outw;\n        int dx7 = (j + jj + 7) % outw;\n        int dx8 = (j + jj + 8) % outw;\n        int dx9 = (j + jj + 9) % outw;\n        int dxa = (j + jj + 10) % outw;\n        int dxb = (j + jj + 11) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int x4 = stride_w * dx4 + dilation_w * v;\n            int x5 = stride_w * dx5 + dilation_w * v;\n            int x6 = stride_w * dx6 + dilation_w * v;\n            int x7 = stride_w * dx7 + dilation_w * v;\n            int x8 = stride_w * dx8 + dilation_w * v;\n            int x9 = stride_w * dx9 + dilation_w * v;\n            int xa = stride_w * dxa + dilation_w * v;\n            int xb = stride_w * dxb + dilation_w * v;\n\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n            int y4 = stride_h * dy4 + dilation_h * u;\n            int y5 = stride_h * dy5 + dilation_h * u;\n            int y6 = stride_h * dy6 + dilation_h * u;\n            int y7 = stride_h * dy7 + dilation_h * u;\n            int y8 = stride_h * dy8 + dilation_h * u;\n            int y9 = stride_h * dy9 + dilation_h * u;\n            int ya = stride_h * dya + dilation_h * u;\n            int yb = stride_h * dyb + dilation_h * u;\n\n            const float* sptr0 = img.row(y0) + x0 * elempack;\n            const float* sptr1 = img.row(y1) + x1 * elempack;\n            const float* sptr2 = img.row(y2) + x2 * elempack;\n            const float* sptr3 = img.row(y3) + x3 * elempack;\n            const float* sptr4 = img.row(y4) + x4 * elempack;\n            const float* sptr5 = img.row(y5) + x5 * elempack;\n            const float* sptr6 = img.row(y6) + x6 * elempack;\n            const float* sptr7 = img.row(y7) + x7 * elempack;\n            const float* sptr8 = img.row(y8) + x8 * elempack;\n            const float* sptr9 = img.row(y9) + x9 * elempack;\n            const float* sptra = img.row(ya) + xa * elempack;\n            const float* sptrb = img.row(yb) + xb * elempack;\n\n            if (elempack == 4)\n            {\n                float32x4_t _r0 = vld1q_f32(sptr0);\n                float32x4_t _r1 = vld1q_f32(sptr1);\n                float32x4_t _r2 = vld1q_f32(sptr2);\n                float32x4_t _r3 = vld1q_f32(sptr3);\n                float32x4_t _r4 = vld1q_f32(sptr4);\n                float32x4_t _r5 = vld1q_f32(sptr5);\n                float32x4_t _r6 = vld1q_f32(sptr6);\n                float32x4_t _r7 = vld1q_f32(sptr7);\n                float32x4_t _r8 = vld1q_f32(sptr8);\n                float32x4_t _r9 = vld1q_f32(sptr9);\n                float32x4_t _ra = vld1q_f32(sptra);\n                float32x4_t _rb = vld1q_f32(sptrb);\n                transpose4x12_ps(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7, _r8, _r9, _ra, _rb);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                vst1q_f32(pp + 4 * 2, _r2);\n                vst1q_f32(pp + 4 * 3, _r3);\n                vst1q_f32(pp + 4 * 4, _r4);\n                vst1q_f32(pp + 4 * 5, _r5);\n                vst1q_f32(pp + 4 * 6, _r6);\n                vst1q_f32(pp + 4 * 7, _r7);\n                vst1q_f32(pp + 4 * 8, _r8);\n                vst1q_f32(pp + 4 * 9, _r9);\n                vst1q_f32(pp + 4 * 10, _ra);\n                vst1q_f32(pp + 4 * 11, _rb);\n                pp += 48;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp[4] = sptr4[0];\n                pp[5] = sptr5[0];\n                pp[6] = sptr6[0];\n                pp[7] = sptr7[0];\n                pp[8] = sptr8[0];\n                pp[9] = sptr9[0];\n                pp[10] = sptra[0];\n                pp[11] = sptrb[0];\n                pp += 12;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dy4 = (j + jj + 4) / outw;\n        int dy5 = (j + jj + 5) / outw;\n        int dy6 = (j + jj + 6) / outw;\n        int dy7 = (j + jj + 7) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n        int dx4 = (j + jj + 4) % outw;\n        int dx5 = (j + jj + 5) % outw;\n        int dx6 = (j + jj + 6) % outw;\n        int dx7 = (j + jj + 7) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int x4 = stride_w * dx4 + dilation_w * v;\n            int x5 = stride_w * dx5 + dilation_w * v;\n            int x6 = stride_w * dx6 + dilation_w * v;\n            int x7 = stride_w * dx7 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n            int y4 = stride_h * dy4 + dilation_h * u;\n            int y5 = stride_h * dy5 + dilation_h * u;\n            int y6 = stride_h * dy6 + dilation_h * u;\n            int y7 = stride_h * dy7 + dilation_h * u;\n\n            const float* sptr0 = img.row(y0) + x0 * elempack;\n            const float* sptr1 = img.row(y1) + x1 * elempack;\n            const float* sptr2 = img.row(y2) + x2 * elempack;\n            const float* sptr3 = img.row(y3) + x3 * elempack;\n            const float* sptr4 = img.row(y4) + x4 * elempack;\n            const float* sptr5 = img.row(y5) + x5 * elempack;\n            const float* sptr6 = img.row(y6) + x6 * elempack;\n            const float* sptr7 = img.row(y7) + x7 * elempack;\n\n            if (elempack == 4)\n            {\n                float32x4_t _r0 = vld1q_f32(sptr0);\n                float32x4_t _r1 = vld1q_f32(sptr1);\n                float32x4_t _r2 = vld1q_f32(sptr2);\n                float32x4_t _r3 = vld1q_f32(sptr3);\n                float32x4_t _r4 = vld1q_f32(sptr4);\n                float32x4_t _r5 = vld1q_f32(sptr5);\n                float32x4_t _r6 = vld1q_f32(sptr6);\n                float32x4_t _r7 = vld1q_f32(sptr7);\n                transpose4x8_ps(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r1);\n                vst1q_f32(pp + 4 * 2, _r2);\n                vst1q_f32(pp + 4 * 3, _r3);\n                vst1q_f32(pp + 4 * 4, _r4);\n                vst1q_f32(pp + 4 * 5, _r5);\n                vst1q_f32(pp + 4 * 6, _r6);\n                vst1q_f32(pp + 4 * 7, _r7);\n                pp += 32;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp[4] = sptr4[0];\n                pp[5] = sptr5[0];\n                pp[6] = sptr6[0];\n                pp[7] = sptr7[0];\n                pp += 8;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n\n            const float* sptr0 = img.row(y0) + x0 * elempack;\n            const float* sptr1 = img.row(y1) + x1 * elempack;\n            const float* sptr2 = img.row(y2) + x2 * elempack;\n            const float* sptr3 = img.row(y3) + x3 * elempack;\n\n            if (elempack == 4)\n            {\n                float32x4x4_t _r0;\n                _r0.val[0] = vld1q_f32(sptr0);\n                _r0.val[1] = vld1q_f32(sptr1);\n                _r0.val[2] = vld1q_f32(sptr2);\n                _r0.val[3] = vld1q_f32(sptr3);\n                vst4q_f32(pp, _r0);\n                pp += 16;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n\n            const float* sptr0 = img.row(y0) + x0 * elempack;\n            const float* sptr1 = img.row(y1) + x1 * elempack;\n\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr0[1];\n                pp[3] = sptr1[1];\n                pp[4] = sptr0[2];\n                pp[5] = sptr1[2];\n                pp[6] = sptr0[3];\n                pp[7] = sptr1[3];\n                pp += 8;\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp += 2;\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n        int dy = (j + jj) / outw;\n        int dx = (j + jj) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x = stride_w * dx + dilation_w * v;\n            int y = stride_h * dy + dilation_h * u;\n\n            const float* sptr = img.row(y) + x * elempack;\n\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                pp[0] = sptr[0];\n                pp[1] = sptr[1];\n                pp[2] = sptr[2];\n                pp[3] = sptr[3];\n                pp += 4;\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                pp[0] = sptr[0];\n                pp += 1;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_gemm_transform_kernel(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n    // NCNN_LOGE(\"convolution_im2col_gemm_transform_kernel\");\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = outch;\n    const int K = inch * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    int elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = inch % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    // maxk-inch-outch to pa-maxk-inch/pa-outch\n    Mat A_data;\n    if (maxk == 1)\n    {\n        A_data = kernel.reshape(maxk * inch, outch);\n    }\n    else\n    {\n        Mat weight_data_r2 = kernel.reshape(maxk, inch, outch);\n\n        A_data.create(maxk * inch, outch);\n\n        for (int q = 0; q < outch; q += 1)\n        {\n            float* g00 = A_data.row(q);\n\n            for (int p = 0; p + (elempack - 1) < inch; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        const float* k00 = weight_data_r2.channel(q).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n\n    AT.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n            convolution_im2col_pack_A_tile(A_data, AT_tile, i, max_ii, k, max_kk);\n        }\n    }\n}\n\nstatic int convolution_im2col_gemm(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = top_blob.w * top_blob.h;\n    const int K = bottom_blob.c * bottom_blob.elempack * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        // im2col\n        convolution_im2col_input_tile(bottom_blob, BT_tile, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n    }\n\n    Mat topT_tileX;\n    if (K > TILE_K)\n    {\n        topT_tileX.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT_tileX.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat topT_tile;\n        if (K > TILE_K)\n            topT_tile = topT_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = k + TILE_K >= K;\n\n                convolution_gemm_transB_packed_tile(AT_tile, BT_tile, bias, topT_tile, top_blob, i, max_ii, j, max_jj, k, max_kk, k_end, opt.use_a53_a55_optimized_kernel);\n            }\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_im2col_gemm_bf16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_gemm_transB_packed_tile_bf16s(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end, int use_a53_a55_optimized_kernel)\n{\n    // NCNN_LOGE(\"convolution_gemm_transB_packed_tile_bf16s %d %d %d %d %d %d\", i, max_ii, j, max_jj, k, max_kk);\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.cstep;\n\n    const unsigned short* pAT = AT_tile;\n    const unsigned short* pBT = BT_tile;\n    const float* pC = CT_tile;\n\n    float* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n            {\n                // a55\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4h, v5.4h}, [%1], #16   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"ldr    d5, [%1], #8                 \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"ldr    d1, [%2], #8                 \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #24                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                    \"ld1    {v4.4h, v5.4h}, [%1], #16   \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v8.4s, #16           \\n\"\n                    \"shrn2  v0.8h, v9.4s, #16           \\n\"\n                    \"shrn   v1.4h, v10.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v11.4s, #16          \\n\"\n                    \"shrn   v2.4h, v12.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v13.4s, #16          \\n\"\n                    \"shrn   v3.4h, v14.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v15.4s, #16          \\n\"\n                    \"shrn   v4.4h, v16.4s, #16          \\n\"\n                    \"shrn2  v4.8h, v17.4s, #16          \\n\"\n                    \"shrn   v5.4h, v18.4s, #16          \\n\"\n                    \"shrn2  v5.8h, v19.4s, #16          \\n\"\n                    \"shrn   v6.4h, v20.4s, #16          \\n\"\n                    \"shrn2  v6.8h, v21.4s, #16          \\n\"\n                    \"shrn   v7.4h, v22.4s, #16          \\n\"\n                    \"shrn2  v7.8h, v23.4s, #16          \\n\"\n                    \"shrn   v8.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v8.8h, v25.4s, #16          \\n\"\n                    \"shrn   v9.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v9.8h, v27.4s, #16          \\n\"\n                    \"shrn   v10.4h, v28.4s, #16         \\n\"\n                    \"shrn2  v10.8h, v29.4s, #16         \\n\"\n                    \"shrn   v11.4h, v30.4s, #16         \\n\"\n                    \"shrn2  v11.8h, v31.4s, #16         \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                    \"st1    {v4.8h, v5.8h}, [%3], #32 \\n\"\n                    \"st1    {v6.8h, v7.8h, v8.8h, v9.8h}, [x4], #64 \\n\"\n                    \"st1    {v10.8h, v11.8h}, [x4]      \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v22.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v23.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp1   v24.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp2   v25.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp1   v26.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp2   v27.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp1   v28.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp2   v29.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp1   v30.8h, v10.8h, v11.8h      \\n\"\n                    \"uzp2   v31.8h, v10.8h, v11.8h      \\n\"\n\n                    \"uzp1   v0.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp2   v6.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp1   v3.8h, v21.8h, v23.8h       \\n\"\n                    \"uzp2   v9.8h, v21.8h, v23.8h       \\n\"\n                    \"mov    v1.d[0], v0.d[1]            \\n\"\n                    \"mov    v7.d[0], v6.d[1]            \\n\"\n                    \"mov    v4.d[0], v3.d[1]            \\n\"\n                    \"mov    v10.d[0], v9.d[1]           \\n\"\n                    \"uzp1   v2.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp2   v8.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp1   v5.8h, v25.8h, v25.8h       \\n\"\n                    \"uzp2   v11.8h, v25.8h, v25.8h      \\n\"\n\n                    \"uzp1   v12.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp2   v18.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp1   v15.8h, v27.8h, v29.8h      \\n\"\n                    \"uzp2   v21.8h, v27.8h, v29.8h      \\n\"\n                    \"mov    v13.d[0], v12.d[1]          \\n\"\n                    \"mov    v19.d[0], v18.d[1]          \\n\"\n                    \"mov    v16.d[0], v15.d[1]          \\n\"\n                    \"mov    v22.d[0], v21.d[1]          \\n\"\n                    \"uzp1   v14.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp2   v20.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp1   v17.8h, v31.8h, v31.8h      \\n\"\n                    \"uzp2   v23.8h, v31.8h, v31.8h      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                    \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v12.4h, v13.4h, v14.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v15.4h, v16.4h, v17.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v18.4h, v19.4h, v20.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v21.4h, v22.4h, v23.4h}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else if (use_a53_a55_optimized_kernel && !cpu_support_arm_asimdhp())\n            {\n                // a53\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v4.4h}, [%1], #8           \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"ldr    x25, [%1]                   \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"ldr    x26, [%1]                   \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"ldr    x23, [%2]                   \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"ins    v5.d[0], x25                \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"ldr    x20, [%2]                   \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"ldr    x21, [%2]                   \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"ins    v6.d[0], x26                \\n\"\n                    \"ins    v3.d[0], x23                \\n\"\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"ldr    x27, [%1]                   \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n\n                    \"ins    v0.d[0], x20                \\n\"\n                    \"ins    v1.d[0], x21                \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"ldr    x24, [%1]                   \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"ldr    x22, [%2]                   \\n\"\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"ins    v7.d[0], x27                \\n\"\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"ldr    x23, [%2]                   \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"ldr    x20, [%2]                   \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n\n                    \"ins    v4.d[0], x24                \\n\"\n                    \"ins    v2.d[0], x22                \\n\"\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n                    \"ldr    x25, [%1]                   \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n\n                    \"ins    v3.d[0], x23                \\n\"\n                    \"ins    v0.d[0], x20                \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"ldr    x26, [%1]                   \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"ldr    x21, [%2]                   \\n\"\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"ins    v5.d[0], x25                \\n\"\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"ldr    x22, [%2]                   \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"ldr    x23, [%2]                   \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n\n                    \"ins    v6.d[0], x26                \\n\"\n                    \"ins    v1.d[0], x21                \\n\"\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"prfm   pldl1keep, [%2, #384]       \\n\" // NOTE PRELOAD\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n                    \"ldr    x27, [%1]                   \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n\n                    \"ins    v2.d[0], x22                \\n\"\n                    \"ins    v3.d[0], x23                \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"ldr    x24, [%1]                   \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"ldr    x20, [%2]                   \\n\"\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n\n                    \"nop                                \\n\"\n                    \"ins    v7.d[0], x27                \\n\"\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"ldr    x21, [%2]                   \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"ldr    x22, [%2]                   \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"add    %2, %2, #8                  \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n\n                    \"ins    v4.d[0], x24                \\n\"\n                    \"ins    v0.d[0], x20                \\n\"\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n                    \"ldr    x25, [%1]                   \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"add    %1, %1, #8                  \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n\n                    \"ins    v1.d[0], x21                \\n\"\n                    \"ins    v2.d[0], x22                \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"nop                                \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #24                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                    \"ld1    {v4.4h, v5.4h}, [%1], #16   \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v8.4s, #16           \\n\"\n                    \"shrn2  v0.8h, v9.4s, #16           \\n\"\n                    \"shrn   v1.4h, v10.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v11.4s, #16          \\n\"\n                    \"shrn   v2.4h, v12.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v13.4s, #16          \\n\"\n                    \"shrn   v3.4h, v14.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v15.4s, #16          \\n\"\n                    \"shrn   v4.4h, v16.4s, #16          \\n\"\n                    \"shrn2  v4.8h, v17.4s, #16          \\n\"\n                    \"shrn   v5.4h, v18.4s, #16          \\n\"\n                    \"shrn2  v5.8h, v19.4s, #16          \\n\"\n                    \"shrn   v6.4h, v20.4s, #16          \\n\"\n                    \"shrn2  v6.8h, v21.4s, #16          \\n\"\n                    \"shrn   v7.4h, v22.4s, #16          \\n\"\n                    \"shrn2  v7.8h, v23.4s, #16          \\n\"\n                    \"shrn   v8.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v8.8h, v25.4s, #16          \\n\"\n                    \"shrn   v9.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v9.8h, v27.4s, #16          \\n\"\n                    \"shrn   v10.4h, v28.4s, #16         \\n\"\n                    \"shrn2  v10.8h, v29.4s, #16         \\n\"\n                    \"shrn   v11.4h, v30.4s, #16         \\n\"\n                    \"shrn2  v11.8h, v31.4s, #16         \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                    \"st1    {v4.8h, v5.8h}, [%3], #32 \\n\"\n                    \"st1    {v6.8h, v7.8h, v8.8h, v9.8h}, [x4], #64 \\n\"\n                    \"st1    {v10.8h, v11.8h}, [x4]      \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v22.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v23.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp1   v24.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp2   v25.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp1   v26.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp2   v27.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp1   v28.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp2   v29.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp1   v30.8h, v10.8h, v11.8h      \\n\"\n                    \"uzp2   v31.8h, v10.8h, v11.8h      \\n\"\n\n                    \"uzp1   v0.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp2   v6.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp1   v3.8h, v21.8h, v23.8h       \\n\"\n                    \"uzp2   v9.8h, v21.8h, v23.8h       \\n\"\n                    \"mov    v1.d[0], v0.d[1]            \\n\"\n                    \"mov    v7.d[0], v6.d[1]            \\n\"\n                    \"mov    v4.d[0], v3.d[1]            \\n\"\n                    \"mov    v10.d[0], v9.d[1]           \\n\"\n                    \"uzp1   v2.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp2   v8.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp1   v5.8h, v25.8h, v25.8h       \\n\"\n                    \"uzp2   v11.8h, v25.8h, v25.8h      \\n\"\n\n                    \"uzp1   v12.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp2   v18.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp1   v15.8h, v27.8h, v29.8h      \\n\"\n                    \"uzp2   v21.8h, v27.8h, v29.8h      \\n\"\n                    \"mov    v13.d[0], v12.d[1]          \\n\"\n                    \"mov    v19.d[0], v18.d[1]          \\n\"\n                    \"mov    v16.d[0], v15.d[1]          \\n\"\n                    \"mov    v22.d[0], v21.d[1]          \\n\"\n                    \"uzp1   v14.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp2   v20.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp1   v17.8h, v31.8h, v31.8h      \\n\"\n                    \"uzp2   v23.8h, v31.8h, v31.8h      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                    \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v12.4h, v13.4h, v14.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v15.4h, v16.4h, v17.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v18.4h, v19.4h, v20.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v21.4h, v22.4h, v23.4h}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #320                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v8.4s}, [%8]               \\n\"\n                    \"ld1    {v20.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v9.16b, v8.16b              \\n\"\n                    \"mov    v10.16b, v8.16b             \\n\"\n                    \"mov    v11.16b, v8.16b             \\n\"\n                    \"mov    v12.16b, v8.16b             \\n\"\n                    \"mov    v13.16b, v8.16b             \\n\"\n                    \"mov    v14.16b, v8.16b             \\n\"\n                    \"mov    v15.16b, v8.16b             \\n\"\n                    \"mov    v16.16b, v8.16b             \\n\"\n                    \"mov    v17.16b, v8.16b             \\n\"\n                    \"mov    v18.16b, v8.16b             \\n\"\n                    \"mov    v19.16b, v8.16b             \\n\"\n\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"fmla   v8.4s, v6.4s, v3.s[0]       \\n\"\n                    \"fmla   v9.4s, v6.4s, v3.s[1]       \\n\"\n                    \"fmla   v10.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v20.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                    \"fmla   v12.4s, v6.4s, v0.s[0]      \\n\"\n                    \"fmla   v13.4s, v6.4s, v0.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v0.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v16.4s, v6.4s, v1.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v1.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v2.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v2.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v2.s[3]      \\n\"\n                    \"fmla   v20.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v4.4s, v3.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v3.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v3.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v3.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v3.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v3.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v3.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v3.s[3]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n\n                    \"fmla   v8.4s, v6.4s, v1.s[0]       \\n\"\n                    \"fmla   v9.4s, v6.4s, v1.s[1]       \\n\"\n                    \"fmla   v10.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v11.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v20.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v7.4s, v1.s[3]      \\n\"\n\n                    \"fmla   v12.4s, v6.4s, v2.s[0]      \\n\"\n                    \"fmla   v13.4s, v6.4s, v2.s[1]      \\n\"\n                    \"fmla   v14.4s, v6.4s, v2.s[2]      \\n\"\n                    \"fmla   v15.4s, v6.4s, v2.s[3]      \\n\"\n                    \"fmla   v24.4s, v7.4s, v2.s[0]      \\n\"\n                    \"fmla   v25.4s, v7.4s, v2.s[1]      \\n\"\n                    \"fmla   v26.4s, v7.4s, v2.s[2]      \\n\"\n                    \"fmla   v27.4s, v7.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v16.4s, v6.4s, v3.s[0]      \\n\"\n                    \"fmla   v17.4s, v6.4s, v3.s[1]      \\n\"\n                    \"fmla   v18.4s, v6.4s, v3.s[2]      \\n\"\n                    \"fmla   v19.4s, v6.4s, v3.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v3.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v3.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v3.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v3.s[3]      \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n\n                    \"ld1    {v4.4h, v5.4h}, [%1], #16   \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n\n                    \"fmla   v8.4s, v4.4s, v0.s[0]       \\n\"\n                    \"fmla   v9.4s, v4.4s, v0.s[1]       \\n\"\n                    \"fmla   v10.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v11.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v12.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v13.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v14.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v15.4s, v4.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v4.4s, v2.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v2.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v2.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v2.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v21.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v22.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v23.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v2.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v8.4s, #16           \\n\"\n                    \"shrn2  v0.8h, v9.4s, #16           \\n\"\n                    \"shrn   v1.4h, v10.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v11.4s, #16          \\n\"\n                    \"shrn   v2.4h, v12.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v13.4s, #16          \\n\"\n                    \"shrn   v3.4h, v14.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v15.4s, #16          \\n\"\n                    \"shrn   v4.4h, v16.4s, #16          \\n\"\n                    \"shrn2  v4.8h, v17.4s, #16          \\n\"\n                    \"shrn   v5.4h, v18.4s, #16          \\n\"\n                    \"shrn2  v5.8h, v19.4s, #16          \\n\"\n                    \"shrn   v6.4h, v20.4s, #16          \\n\"\n                    \"shrn2  v6.8h, v21.4s, #16          \\n\"\n                    \"shrn   v7.4h, v22.4s, #16          \\n\"\n                    \"shrn2  v7.8h, v23.4s, #16          \\n\"\n                    \"shrn   v8.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v8.8h, v25.4s, #16          \\n\"\n                    \"shrn   v9.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v9.8h, v27.4s, #16          \\n\"\n                    \"shrn   v10.4h, v28.4s, #16         \\n\"\n                    \"shrn2  v10.8h, v29.4s, #16         \\n\"\n                    \"shrn   v11.4h, v30.4s, #16         \\n\"\n                    \"shrn2  v11.8h, v31.4s, #16         \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                    \"st1    {v4.8h, v5.8h}, [%3], #32 \\n\"\n                    \"st1    {v6.8h, v7.8h, v8.8h, v9.8h}, [x4], #64 \\n\"\n                    \"st1    {v10.8h, v11.8h}, [x4]      \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x12\n                    \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v22.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v23.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp1   v24.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp2   v25.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp1   v26.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp2   v27.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp1   v28.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp2   v29.8h, v8.8h, v9.8h        \\n\"\n                    \"uzp1   v30.8h, v10.8h, v11.8h      \\n\"\n                    \"uzp2   v31.8h, v10.8h, v11.8h      \\n\"\n\n                    \"uzp1   v0.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp2   v6.8h, v20.8h, v22.8h       \\n\"\n                    \"uzp1   v3.8h, v21.8h, v23.8h       \\n\"\n                    \"uzp2   v9.8h, v21.8h, v23.8h       \\n\"\n                    \"mov    v1.d[0], v0.d[1]            \\n\"\n                    \"mov    v7.d[0], v6.d[1]            \\n\"\n                    \"mov    v4.d[0], v3.d[1]            \\n\"\n                    \"mov    v10.d[0], v9.d[1]           \\n\"\n                    \"uzp1   v2.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp2   v8.8h, v24.8h, v24.8h       \\n\"\n                    \"uzp1   v5.8h, v25.8h, v25.8h       \\n\"\n                    \"uzp2   v11.8h, v25.8h, v25.8h      \\n\"\n\n                    \"uzp1   v12.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp2   v18.8h, v26.8h, v28.8h      \\n\"\n                    \"uzp1   v15.8h, v27.8h, v29.8h      \\n\"\n                    \"uzp2   v21.8h, v27.8h, v29.8h      \\n\"\n                    \"mov    v13.d[0], v12.d[1]          \\n\"\n                    \"mov    v19.d[0], v18.d[1]          \\n\"\n                    \"mov    v16.d[0], v15.d[1]          \\n\"\n                    \"mov    v22.d[0], v21.d[1]          \\n\"\n                    \"uzp1   v14.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp2   v20.8h, v30.8h, v30.8h      \\n\"\n                    \"uzp1   v17.8h, v31.8h, v31.8h      \\n\"\n                    \"uzp2   v23.8h, v31.8h, v31.8h      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                    \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v12.4h, v13.4h, v14.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v15.4h, v16.4h, v17.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v18.4h, v19.4h, v20.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v21.4h, v22.4h, v23.4h}, [x4] \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #384                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64   \\n\"\n                    \"st1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0], #64 \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n            float32x4_t _sum80;\n            float32x4_t _sum81;\n            float32x4_t _sum90;\n            float32x4_t _sum91;\n            float32x4_t _suma0;\n            float32x4_t _suma1;\n            float32x4_t _sumb0;\n            float32x4_t _sumb1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                    _sum40 = _sum00;\n                    _sum41 = _sum01;\n                    _sum50 = _sum00;\n                    _sum51 = _sum01;\n                    _sum60 = _sum00;\n                    _sum61 = _sum01;\n                    _sum70 = _sum00;\n                    _sum71 = _sum01;\n                    _sum80 = _sum00;\n                    _sum81 = _sum01;\n                    _sum90 = _sum00;\n                    _sum91 = _sum01;\n                    _suma0 = _sum00;\n                    _suma1 = _sum01;\n                    _sumb0 = _sum00;\n                    _sumb1 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                    _sum80 = vdupq_n_f32(0.f);\n                    _sum81 = vdupq_n_f32(0.f);\n                    _sum90 = vdupq_n_f32(0.f);\n                    _sum91 = vdupq_n_f32(0.f);\n                    _suma0 = vdupq_n_f32(0.f);\n                    _suma1 = vdupq_n_f32(0.f);\n                    _sumb0 = vdupq_n_f32(0.f);\n                    _sumb1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n                _sum80 = vld1q_f32(outptr + 4 * 16);\n                _sum81 = vld1q_f32(outptr + 4 * 17);\n                _sum90 = vld1q_f32(outptr + 4 * 18);\n                _sum91 = vld1q_f32(outptr + 4 * 19);\n                _suma0 = vld1q_f32(outptr + 4 * 20);\n                _suma1 = vld1q_f32(outptr + 4 * 21);\n                _sumb0 = vld1q_f32(outptr + 4 * 22);\n                _sumb1 = vld1q_f32(outptr + 4 * 23);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pA1 = bfloat2float(vld1_u16(pA + 4));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum70));\n                    vst1_u16(outptr0 + 4 * 8, float2bfloat(_sum80));\n                    vst1_u16(outptr0 + 4 * 9, float2bfloat(_sum90));\n                    vst1_u16(outptr0 + 4 * 10, float2bfloat(_suma0));\n                    vst1_u16(outptr0 + 4 * 11, float2bfloat(_sumb0));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, float2bfloat(_sum71));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 8, float2bfloat(_sum81));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 9, float2bfloat(_sum91));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 10, float2bfloat(_suma1));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 11, float2bfloat(_sumb1));\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x8_t _t0 = vcombine_u16(float2bfloat(_sum00), float2bfloat(_sum01));\n                    uint16x8_t _t1 = vcombine_u16(float2bfloat(_sum10), float2bfloat(_sum11));\n                    uint16x8_t _t2 = vcombine_u16(float2bfloat(_sum20), float2bfloat(_sum21));\n                    uint16x8_t _t3 = vcombine_u16(float2bfloat(_sum30), float2bfloat(_sum31));\n                    uint16x8_t _t4 = vcombine_u16(float2bfloat(_sum40), float2bfloat(_sum41));\n                    uint16x8_t _t5 = vcombine_u16(float2bfloat(_sum50), float2bfloat(_sum51));\n                    uint16x8_t _t6 = vcombine_u16(float2bfloat(_sum60), float2bfloat(_sum61));\n                    uint16x8_t _t7 = vcombine_u16(float2bfloat(_sum70), float2bfloat(_sum71));\n                    uint16x8_t _t8 = vcombine_u16(float2bfloat(_sum80), float2bfloat(_sum81));\n                    uint16x8_t _t9 = vcombine_u16(float2bfloat(_sum90), float2bfloat(_sum91));\n                    uint16x8_t _ta = vcombine_u16(float2bfloat(_suma0), float2bfloat(_suma1));\n                    uint16x8_t _tb = vcombine_u16(float2bfloat(_sumb0), float2bfloat(_sumb1));\n                    transpose8x12_u16(_t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7, _t8, _t9, _ta, _tb);\n\n                    vst1_u16(outptr0, vget_low_u16(_t0));\n                    vst1_u16(outptr0 + 4, vget_high_u16(_t0));\n                    vst1_u16(outptr0 + 8, vget_low_u16(_t1));\n                    vst1_u16(outptr0 + out_hstep, vget_high_u16(_t1));\n                    vst1_u16(outptr0 + out_hstep + 4, vget_low_u16(_t2));\n                    vst1_u16(outptr0 + out_hstep + 8, vget_high_u16(_t2));\n                    vst1_u16(outptr0 + out_hstep * 2, vget_low_u16(_t3));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, vget_high_u16(_t3));\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, vget_low_u16(_t4));\n                    vst1_u16(outptr0 + out_hstep * 3, vget_high_u16(_t4));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, vget_low_u16(_t5));\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, vget_high_u16(_t5));\n                    vst1_u16(outptr0 + out_hstep * 4, vget_low_u16(_t6));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, vget_high_u16(_t6));\n                    vst1_u16(outptr0 + out_hstep * 4 + 8, vget_low_u16(_t7));\n                    vst1_u16(outptr0 + out_hstep * 5, vget_high_u16(_t7));\n                    vst1_u16(outptr0 + out_hstep * 5 + 4, vget_low_u16(_t8));\n                    vst1_u16(outptr0 + out_hstep * 5 + 8, vget_high_u16(_t8));\n                    vst1_u16(outptr0 + out_hstep * 6, vget_low_u16(_t9));\n                    vst1_u16(outptr0 + out_hstep * 6 + 4, vget_high_u16(_t9));\n                    vst1_u16(outptr0 + out_hstep * 6 + 8, vget_low_u16(_ta));\n                    vst1_u16(outptr0 + out_hstep * 7, vget_high_u16(_ta));\n                    vst1_u16(outptr0 + out_hstep * 7 + 4, vget_low_u16(_tb));\n                    vst1_u16(outptr0 + out_hstep * 7 + 8, vget_high_u16(_tb));\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n            }\n\n            outptr += 96;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n            {\n                // a55\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #192                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v16.4s}, [%8]              \\n\"\n                    \"ld1    {v24.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v17.16b, v16.16b            \\n\"\n                    \"mov    v18.16b, v16.16b            \\n\"\n                    \"mov    v19.16b, v16.16b            \\n\"\n                    \"mov    v20.16b, v16.16b            \\n\"\n                    \"mov    v21.16b, v16.16b            \\n\"\n                    \"mov    v22.16b, v16.16b            \\n\"\n                    \"mov    v23.16b, v16.16b            \\n\"\n\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v8.4h, v9.4h}, [%1], #16   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v8.4s, v8.4h, #16           \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                    \"ldr    d10, [%1], #8               \\n\"\n                    \"fmla   v17.4s, v8.4s, v0.s[1]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v18.4s, v8.4s, v0.s[2]      \\n\"\n                    \"ldr    d11, [%1], #8               \\n\"\n                    \"fmla   v19.4s, v8.4s, v0.s[3]      \\n\"\n                    \"shll   v9.4s, v9.4h, #16           \\n\"\n                    \"fmla   v20.4s, v8.4s, v1.s[0]      \\n\"\n                    \"ldr    d4, [%2], #8                \\n\"\n                    \"fmla   v21.4s, v8.4s, v1.s[1]      \\n\"\n                    \"ldr    d12, [%1], #8               \\n\"\n                    \"fmla   v22.4s, v8.4s, v1.s[2]      \\n\"\n                    \"ldr    d5, [%2], #8                \\n\"\n                    \"fmla   v23.4s, v8.4s, v1.s[3]      \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v24.4s, v9.4s, v0.s[0]      \\n\"\n                    \"ldr    d13, [%1], #8               \\n\"\n                    \"fmla   v25.4s, v9.4s, v0.s[1]      \\n\"\n                    \"ldr    d6, [%2], #8                \\n\"\n                    \"fmla   v26.4s, v9.4s, v0.s[2]      \\n\"\n                    \"ldr    d14, [%1], #8               \\n\"\n                    \"fmla   v27.4s, v9.4s, v0.s[3]      \\n\"\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"fmla   v28.4s, v9.4s, v1.s[0]      \\n\"\n                    \"ldr    d7, [%2], #8                \\n\"\n                    \"fmla   v29.4s, v9.4s, v1.s[1]      \\n\"\n                    \"ldr    d15, [%1], #8               \\n\"\n                    \"fmla   v30.4s, v9.4s, v1.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                    \"ldr    d8, [%1], #8                \\n\"\n                    \"fmla   v17.4s, v10.4s, v2.s[1]     \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v18.4s, v10.4s, v2.s[2]     \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v19.4s, v10.4s, v2.s[3]     \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"fmla   v20.4s, v10.4s, v3.s[0]     \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v21.4s, v10.4s, v3.s[1]     \\n\"\n                    \"ldr    d9, [%1], #8                \\n\"\n                    \"fmla   v22.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v23.4s, v10.4s, v3.s[3]     \\n\"\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v24.4s, v11.4s, v2.s[0]     \\n\"\n                    \"fmla   v25.4s, v11.4s, v2.s[1]     \\n\"\n                    \"fmla   v26.4s, v11.4s, v2.s[2]     \\n\"\n                    \"fmla   v27.4s, v11.4s, v2.s[3]     \\n\"\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                    \"fmla   v17.4s, v12.4s, v4.s[1]     \\n\"\n                    \"fmla   v18.4s, v12.4s, v4.s[2]     \\n\"\n                    \"fmla   v19.4s, v12.4s, v4.s[3]     \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n                    \"fmla   v20.4s, v12.4s, v5.s[0]     \\n\"\n                    \"fmla   v21.4s, v12.4s, v5.s[1]     \\n\"\n                    \"fmla   v22.4s, v12.4s, v5.s[2]     \\n\"\n                    \"fmla   v23.4s, v12.4s, v5.s[3]     \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v24.4s, v13.4s, v4.s[0]     \\n\"\n                    \"fmla   v25.4s, v13.4s, v4.s[1]     \\n\"\n                    \"fmla   v26.4s, v13.4s, v4.s[2]     \\n\"\n                    \"fmla   v27.4s, v13.4s, v4.s[3]     \\n\"\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n                    \"fmla   v28.4s, v13.4s, v5.s[0]     \\n\"\n                    \"fmla   v29.4s, v13.4s, v5.s[1]     \\n\"\n                    \"fmla   v30.4s, v13.4s, v5.s[2]     \\n\"\n                    \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                    \"fmla   v17.4s, v14.4s, v6.s[1]     \\n\"\n                    \"fmla   v18.4s, v14.4s, v6.s[2]     \\n\"\n                    \"fmla   v19.4s, v14.4s, v6.s[3]     \\n\"\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n                    \"fmla   v20.4s, v14.4s, v7.s[0]     \\n\"\n                    \"fmla   v21.4s, v14.4s, v7.s[1]     \\n\"\n                    \"fmla   v22.4s, v14.4s, v7.s[2]     \\n\"\n                    \"fmla   v23.4s, v14.4s, v7.s[3]     \\n\"\n                    \"shll   v8.4s, v8.4h, #16           \\n\"\n                    \"fmla   v24.4s, v15.4s, v6.s[0]     \\n\"\n                    \"fmla   v25.4s, v15.4s, v6.s[1]     \\n\"\n                    \"fmla   v26.4s, v15.4s, v6.s[2]     \\n\"\n                    \"fmla   v27.4s, v15.4s, v6.s[3]     \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v28.4s, v15.4s, v7.s[0]     \\n\"\n                    \"fmla   v29.4s, v15.4s, v7.s[1]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v15.4s, v7.s[2]     \\n\"\n                    \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #24                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v2.8h}, [%2], #16          \\n\"\n                    \"shll   v0.4s, v2.4h, #16           \\n\"\n                    \"shll2  v1.4s, v2.8h, #16           \\n\"\n                    \"ld1    {v3.8h}, [%1], #16          \\n\"\n                    \"shll   v4.4s, v3.4h, #16           \\n\"\n                    \"shll2  v5.4s, v3.8h, #16           \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v4.4s, v1.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v24.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v16.4s, #16          \\n\"\n                    \"shrn2  v0.8h, v17.4s, #16          \\n\"\n                    \"shrn   v1.4h, v18.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v19.4s, #16          \\n\"\n                    \"shrn   v2.4h, v20.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v21.4s, #16          \\n\"\n                    \"shrn   v3.4h, v22.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v23.4s, #16          \\n\"\n                    \"shrn   v4.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v4.8h, v25.4s, #16          \\n\"\n                    \"shrn   v5.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v5.8h, v27.4s, #16          \\n\"\n                    \"shrn   v6.4h, v28.4s, #16          \\n\"\n                    \"shrn2  v6.8h, v29.4s, #16          \\n\"\n                    \"shrn   v7.4h, v30.4s, #16          \\n\"\n                    \"shrn2  v7.8h, v31.4s, #16          \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                    \"st1    {v4.8h, v5.8h, v6.8h, v7.8h}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x8\n                    \"uzp1   v24.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v25.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v26.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v27.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp1   v28.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp2   v29.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp1   v30.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp2   v31.8h, v6.8h, v7.8h        \\n\"\n\n                    \"uzp1   v0.8h, v24.8h, v26.8h       \\n\"\n                    \"uzp2   v2.8h, v24.8h, v26.8h       \\n\"\n                    \"uzp1   v1.8h, v25.8h, v27.8h       \\n\"\n                    \"uzp2   v3.8h, v25.8h, v27.8h       \\n\"\n\n                    \"uzp1   v4.8h, v28.8h, v30.8h       \\n\"\n                    \"uzp2   v6.8h, v28.8h, v30.8h       \\n\"\n                    \"uzp1   v5.8h, v29.8h, v31.8h       \\n\"\n                    \"uzp2   v7.8h, v29.8h, v31.8h       \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.8h}, [%3], #16          \\n\"\n                    \"st1    {v1.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v2.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v4.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v5.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v7.8h}, [x4]               \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #256                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #192                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v16.4s}, [%8]              \\n\"\n                    \"ld1    {v24.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v17.16b, v16.16b            \\n\"\n                    \"mov    v18.16b, v16.16b            \\n\"\n                    \"mov    v19.16b, v16.16b            \\n\"\n                    \"mov    v20.16b, v16.16b            \\n\"\n                    \"mov    v21.16b, v16.16b            \\n\"\n                    \"mov    v22.16b, v16.16b            \\n\"\n                    \"mov    v23.16b, v16.16b            \\n\"\n\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\"\n                    \"shll   v0.4s, v4.4h, #16           \\n\"\n                    \"shll2  v1.4s, v4.8h, #16           \\n\"\n                    \"shll   v2.4s, v5.4h, #16           \\n\"\n                    \"shll2  v3.4s, v5.8h, #16           \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\"\n                    \"shll   v8.4s, v12.4h, #16          \\n\"\n                    \"shll2  v9.4s, v12.8h, #16          \\n\"\n                    \"shll   v10.4s, v13.4h, #16         \\n\"\n                    \"shll2  v11.4s, v13.8h, #16         \\n\"\n                    \"fmla   v16.4s, v8.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v8.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v8.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v8.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v8.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v8.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v8.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v8.4s, v1.s[3]      \\n\"\n                    \"fmla   v24.4s, v9.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v9.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v9.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v9.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v9.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v9.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v1.s[3]      \\n\"\n                    \"fmla   v16.4s, v10.4s, v2.s[0]     \\n\"\n                    \"fmla   v17.4s, v10.4s, v2.s[1]     \\n\"\n                    \"fmla   v18.4s, v10.4s, v2.s[2]     \\n\"\n                    \"fmla   v19.4s, v10.4s, v2.s[3]     \\n\"\n                    \"fmla   v20.4s, v10.4s, v3.s[0]     \\n\"\n                    \"fmla   v21.4s, v10.4s, v3.s[1]     \\n\"\n                    \"fmla   v22.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v23.4s, v10.4s, v3.s[3]     \\n\"\n                    \"fmla   v24.4s, v11.4s, v2.s[0]     \\n\"\n                    \"fmla   v25.4s, v11.4s, v2.s[1]     \\n\"\n                    \"fmla   v26.4s, v11.4s, v2.s[2]     \\n\"\n                    \"fmla   v27.4s, v11.4s, v2.s[3]     \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"shll   v4.4s, v6.4h, #16           \\n\"\n                    \"shll2  v5.4s, v6.8h, #16           \\n\"\n                    \"shll   v6.4s, v7.4h, #16           \\n\"\n                    \"shll2  v7.4s, v7.8h, #16           \\n\"\n                    \"shll   v12.4s, v14.4h, #16         \\n\"\n                    \"shll2  v13.4s, v14.8h, #16         \\n\"\n                    \"shll   v14.4s, v15.4h, #16         \\n\"\n                    \"shll2  v15.4s, v15.8h, #16         \\n\"\n                    \"fmla   v16.4s, v12.4s, v4.s[0]     \\n\"\n                    \"fmla   v17.4s, v12.4s, v4.s[1]     \\n\"\n                    \"fmla   v18.4s, v12.4s, v4.s[2]     \\n\"\n                    \"fmla   v19.4s, v12.4s, v4.s[3]     \\n\"\n                    \"fmla   v20.4s, v12.4s, v5.s[0]     \\n\"\n                    \"fmla   v21.4s, v12.4s, v5.s[1]     \\n\"\n                    \"fmla   v22.4s, v12.4s, v5.s[2]     \\n\"\n                    \"fmla   v23.4s, v12.4s, v5.s[3]     \\n\"\n                    \"fmla   v24.4s, v13.4s, v4.s[0]     \\n\"\n                    \"fmla   v25.4s, v13.4s, v4.s[1]     \\n\"\n                    \"fmla   v26.4s, v13.4s, v4.s[2]     \\n\"\n                    \"fmla   v27.4s, v13.4s, v4.s[3]     \\n\"\n                    \"fmla   v28.4s, v13.4s, v5.s[0]     \\n\"\n                    \"fmla   v29.4s, v13.4s, v5.s[1]     \\n\"\n                    \"fmla   v30.4s, v13.4s, v5.s[2]     \\n\"\n                    \"fmla   v31.4s, v13.4s, v5.s[3]     \\n\"\n                    \"fmla   v16.4s, v14.4s, v6.s[0]     \\n\"\n                    \"fmla   v17.4s, v14.4s, v6.s[1]     \\n\"\n                    \"fmla   v18.4s, v14.4s, v6.s[2]     \\n\"\n                    \"fmla   v19.4s, v14.4s, v6.s[3]     \\n\"\n                    \"fmla   v20.4s, v14.4s, v7.s[0]     \\n\"\n                    \"fmla   v21.4s, v14.4s, v7.s[1]     \\n\"\n                    \"fmla   v22.4s, v14.4s, v7.s[2]     \\n\"\n                    \"fmla   v23.4s, v14.4s, v7.s[3]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v15.4s, v6.s[0]     \\n\"\n                    \"fmla   v25.4s, v15.4s, v6.s[1]     \\n\"\n                    \"fmla   v26.4s, v15.4s, v6.s[2]     \\n\"\n                    \"fmla   v27.4s, v15.4s, v6.s[3]     \\n\"\n                    \"fmla   v28.4s, v15.4s, v7.s[0]     \\n\"\n                    \"fmla   v29.4s, v15.4s, v7.s[1]     \\n\"\n                    \"fmla   v30.4s, v15.4s, v7.s[2]     \\n\"\n                    \"fmla   v31.4s, v15.4s, v7.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v2.8h}, [%2], #16          \\n\"\n                    \"shll   v0.4s, v2.4h, #16           \\n\"\n                    \"shll2  v1.4s, v2.8h, #16           \\n\"\n                    \"ld1    {v3.8h}, [%1], #16          \\n\"\n                    \"shll   v4.4s, v3.4h, #16           \\n\"\n                    \"shll2  v5.4s, v3.8h, #16           \\n\"\n\n                    \"fmla   v16.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v17.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v18.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v19.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v20.4s, v4.4s, v1.s[0]      \\n\"\n                    \"fmla   v21.4s, v4.4s, v1.s[1]      \\n\"\n                    \"fmla   v22.4s, v4.4s, v1.s[2]      \\n\"\n                    \"fmla   v23.4s, v4.4s, v1.s[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v24.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v1.s[3]      \\n\"\n\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v16.4s, #16          \\n\"\n                    \"shrn2  v0.8h, v17.4s, #16          \\n\"\n                    \"shrn   v1.4h, v18.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v19.4s, #16          \\n\"\n                    \"shrn   v2.4h, v20.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v21.4s, #16          \\n\"\n                    \"shrn   v3.4h, v22.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v23.4s, #16          \\n\"\n                    \"shrn   v4.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v4.8h, v25.4s, #16          \\n\"\n                    \"shrn   v5.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v5.8h, v27.4s, #16          \\n\"\n                    \"shrn   v6.4h, v28.4s, #16          \\n\"\n                    \"shrn2  v6.8h, v29.4s, #16          \\n\"\n                    \"shrn   v7.4h, v30.4s, #16          \\n\"\n                    \"shrn2  v7.8h, v31.4s, #16          \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                    \"st1    {v4.8h, v5.8h, v6.8h, v7.8h}, [x4] \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x8\n                    \"uzp1   v24.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v25.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v26.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v27.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp1   v28.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp2   v29.8h, v4.8h, v5.8h        \\n\"\n                    \"uzp1   v30.8h, v6.8h, v7.8h        \\n\"\n                    \"uzp2   v31.8h, v6.8h, v7.8h        \\n\"\n\n                    \"uzp1   v0.8h, v24.8h, v26.8h       \\n\"\n                    \"uzp2   v2.8h, v24.8h, v26.8h       \\n\"\n                    \"uzp1   v1.8h, v25.8h, v27.8h       \\n\"\n                    \"uzp2   v3.8h, v25.8h, v27.8h       \\n\"\n\n                    \"uzp1   v4.8h, v28.8h, v30.8h       \\n\"\n                    \"uzp2   v6.8h, v28.8h, v30.8h       \\n\"\n                    \"uzp1   v5.8h, v29.8h, v31.8h       \\n\"\n                    \"uzp2   v7.8h, v29.8h, v31.8h       \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.8h}, [%3], #16          \\n\"\n                    \"st1    {v1.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v2.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v4.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v5.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.8h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v7.8h}, [x4]               \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #256                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                    _sum40 = _sum00;\n                    _sum41 = _sum01;\n                    _sum50 = _sum00;\n                    _sum51 = _sum01;\n                    _sum60 = _sum00;\n                    _sum61 = _sum01;\n                    _sum70 = _sum00;\n                    _sum71 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                    _sum40 = vdupq_n_f32(0.f);\n                    _sum41 = vdupq_n_f32(0.f);\n                    _sum50 = vdupq_n_f32(0.f);\n                    _sum51 = vdupq_n_f32(0.f);\n                    _sum60 = vdupq_n_f32(0.f);\n                    _sum61 = vdupq_n_f32(0.f);\n                    _sum70 = vdupq_n_f32(0.f);\n                    _sum71 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pA1 = bfloat2float(vld1_u16(pA + 4));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum70));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, float2bfloat(_sum71));\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x8_t _t0 = vcombine_u16(float2bfloat(_sum00), float2bfloat(_sum01));\n                    uint16x8_t _t1 = vcombine_u16(float2bfloat(_sum10), float2bfloat(_sum11));\n                    uint16x8_t _t2 = vcombine_u16(float2bfloat(_sum20), float2bfloat(_sum21));\n                    uint16x8_t _t3 = vcombine_u16(float2bfloat(_sum30), float2bfloat(_sum31));\n                    uint16x8_t _t4 = vcombine_u16(float2bfloat(_sum40), float2bfloat(_sum41));\n                    uint16x8_t _t5 = vcombine_u16(float2bfloat(_sum50), float2bfloat(_sum51));\n                    uint16x8_t _t6 = vcombine_u16(float2bfloat(_sum60), float2bfloat(_sum61));\n                    uint16x8_t _t7 = vcombine_u16(float2bfloat(_sum70), float2bfloat(_sum71));\n                    transpose8x8_u16(_t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7);\n\n                    vst1q_u16(outptr0, _t0);\n                    vst1q_u16(outptr0 + out_hstep, _t1);\n                    vst1q_u16(outptr0 + out_hstep * 2, _t2);\n                    vst1q_u16(outptr0 + out_hstep * 3, _t3);\n                    vst1q_u16(outptr0 + out_hstep * 4, _t4);\n                    vst1q_u16(outptr0 + out_hstep * 5, _t5);\n                    vst1q_u16(outptr0 + out_hstep * 6, _t6);\n                    vst1q_u16(outptr0 + out_hstep * 7, _t7);\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n            }\n\n            outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel && cpu_support_arm_asimdhp())\n            {\n                // a55\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v24.4s}, [%8]              \\n\"\n                    \"ld1    {v28.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n\n                    \"mov    v29.16b, v28.16b            \\n\"\n                    \"mov    v30.16b, v28.16b            \\n\"\n                    \"mov    v31.16b, v28.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h}, [%1], #24 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"shll   v5.4s, v5.4h, #16           \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                    \"shll   v6.4s, v6.4h, #16           \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"ldr    d8, [%1], #8                \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"ldr    d9, [%1], #8                \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"shll   v7.4s, v7.4h, #16           \\n\"\n                    \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                    \"ldr    d10, [%1], #8               \\n\"\n                    \"fmla   v25.4s, v6.4s, v1.s[1]      \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"fmla   v26.4s, v6.4s, v1.s[2]      \\n\"\n                    \"ldr    d11, [%1], #8               \\n\"\n                    \"fmla   v27.4s, v6.4s, v1.s[3]      \\n\"\n                    \"shll   v8.4s, v8.4h, #16           \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                    \"shll   v9.4s, v9.4h, #16           \\n\"\n                    \"fmla   v24.4s, v8.4s, v2.s[0]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v25.4s, v8.4s, v2.s[1]      \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"fmla   v26.4s, v8.4s, v2.s[2]      \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v27.4s, v8.4s, v2.s[3]      \\n\"\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"fmla   v28.4s, v9.4s, v2.s[0]      \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v29.4s, v9.4s, v2.s[1]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v30.4s, v9.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v2.s[3]      \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                    \"fmla   v25.4s, v10.4s, v3.s[1]     \\n\"\n                    \"shll   v4.4s, v4.4h, #16           \\n\"\n                    \"fmla   v26.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v27.4s, v10.4s, v3.s[3]     \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #24                 \\n\"\n                    \"sub    %2, %2, #16                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h}, [%2], #8           \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"ld1    {v3.8h}, [%1], #16          \\n\"\n                    \"shll   v4.4s, v3.4h, #16           \\n\"\n                    \"shll2  v5.4s, v3.8h, #16           \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v0.8h, v25.4s, #16          \\n\"\n                    \"shrn   v1.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v27.4s, #16          \\n\"\n                    \"shrn   v2.4h, v28.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v29.4s, #16          \\n\"\n                    \"shrn   v3.4h, v30.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v31.4s, #16          \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h}, [%3], #32   \\n\"\n                    \"st1    {v2.8h, v3.8h}, [x4]        \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x4\n                    \"uzp1   v28.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v29.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v30.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v31.8h, v2.8h, v3.8h        \\n\"\n\n                    \"uzp1   v0.8h, v28.8h, v29.8h       \\n\"\n                    \"uzp2   v2.8h, v28.8h, v29.8h       \\n\"\n                    \"uzp1   v4.8h, v30.8h, v31.8h       \\n\"\n                    \"uzp2   v6.8h, v30.8h, v31.8h       \\n\"\n\n                    \"mov    v1.d[0], v0.d[1]            \\n\"\n                    \"mov    v3.d[0], v2.d[1]            \\n\"\n                    \"mov    v5.d[0], v4.d[1]            \\n\"\n                    \"mov    v7.d[0], v6.d[1]            \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h}, [%3], #8           \\n\"\n                    \"st1    {v1.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v2.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v4.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v5.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v7.4h}, [x4]               \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #128                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"add    x4, %8, #16                 \\n\"\n                    \"ld1    {v24.4s}, [%8]              \\n\"\n                    \"ld1    {v28.4s}, [x4]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                    \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n\n                    \"mov    v29.16b, v28.16b            \\n\"\n                    \"mov    v30.16b, v28.16b            \\n\"\n                    \"mov    v31.16b, v28.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"shll   v1.4s, v1.4h, #16           \\n\"\n                    \"shll   v2.4s, v2.4h, #16           \\n\"\n                    \"shll   v3.4s, v3.4h, #16           \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\"\n                    \"shll   v4.4s, v12.4h, #16          \\n\"\n                    \"shll2  v5.4s, v12.8h, #16          \\n\"\n                    \"shll   v6.4s, v13.4h, #16          \\n\"\n                    \"shll2  v7.4s, v13.8h, #16          \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"fmla   v24.4s, v6.4s, v1.s[0]      \\n\"\n                    \"fmla   v25.4s, v6.4s, v1.s[1]      \\n\"\n                    \"fmla   v26.4s, v6.4s, v1.s[2]      \\n\"\n                    \"fmla   v27.4s, v6.4s, v1.s[3]      \\n\"\n                    \"fmla   v28.4s, v7.4s, v1.s[0]      \\n\"\n                    \"fmla   v29.4s, v7.4s, v1.s[1]      \\n\"\n                    \"fmla   v30.4s, v7.4s, v1.s[2]      \\n\"\n                    \"fmla   v31.4s, v7.4s, v1.s[3]      \\n\"\n                    \"shll   v8.4s, v14.4h, #16          \\n\"\n                    \"shll2  v9.4s, v14.8h, #16          \\n\"\n                    \"shll   v10.4s, v15.4h, #16         \\n\"\n                    \"shll2  v11.4s, v15.8h, #16         \\n\"\n                    \"fmla   v24.4s, v8.4s, v2.s[0]      \\n\"\n                    \"fmla   v25.4s, v8.4s, v2.s[1]      \\n\"\n                    \"fmla   v26.4s, v8.4s, v2.s[2]      \\n\"\n                    \"fmla   v27.4s, v8.4s, v2.s[3]      \\n\"\n                    \"fmla   v28.4s, v9.4s, v2.s[0]      \\n\"\n                    \"fmla   v29.4s, v9.4s, v2.s[1]      \\n\"\n                    \"fmla   v30.4s, v9.4s, v2.s[2]      \\n\"\n                    \"fmla   v31.4s, v9.4s, v2.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.4s, v10.4s, v3.s[0]     \\n\"\n                    \"fmla   v25.4s, v10.4s, v3.s[1]     \\n\"\n                    \"fmla   v26.4s, v10.4s, v3.s[2]     \\n\"\n                    \"fmla   v27.4s, v10.4s, v3.s[3]     \\n\"\n                    \"fmla   v28.4s, v11.4s, v3.s[0]     \\n\"\n                    \"fmla   v29.4s, v11.4s, v3.s[1]     \\n\"\n                    \"fmla   v30.4s, v11.4s, v3.s[2]     \\n\"\n                    \"fmla   v31.4s, v11.4s, v3.s[3]     \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h}, [%2], #8           \\n\"\n                    \"shll   v0.4s, v0.4h, #16           \\n\"\n                    \"ld1    {v3.8h}, [%1], #16          \\n\"\n                    \"shll   v4.4s, v3.4h, #16           \\n\"\n                    \"shll2  v5.4s, v3.8h, #16           \\n\"\n                    \"fmla   v24.4s, v4.4s, v0.s[0]      \\n\"\n                    \"fmla   v25.4s, v4.4s, v0.s[1]      \\n\"\n                    \"fmla   v26.4s, v4.4s, v0.s[2]      \\n\"\n                    \"fmla   v27.4s, v4.4s, v0.s[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.4s, v5.4s, v0.s[0]      \\n\"\n                    \"fmla   v29.4s, v5.4s, v0.s[1]      \\n\"\n                    \"fmla   v30.4s, v5.4s, v0.s[2]      \\n\"\n                    \"fmla   v31.4s, v5.4s, v0.s[3]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"shrn   v0.4h, v24.4s, #16          \\n\"\n                    \"shrn2  v0.8h, v25.4s, #16          \\n\"\n                    \"shrn   v1.4h, v26.4s, #16          \\n\"\n                    \"shrn2  v1.8h, v27.4s, #16          \\n\"\n                    \"shrn   v2.4h, v28.4s, #16          \\n\"\n                    \"shrn2  v2.8h, v29.4s, #16          \\n\"\n                    \"shrn   v3.4h, v30.4s, #16          \\n\"\n                    \"shrn2  v3.8h, v31.4s, #16          \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v0.8h, v1.8h}, [%3], #32   \\n\"\n                    \"st1    {v2.8h, v3.8h}, [x4]        \\n\"\n                    \"b      9f                          \\n\"\n\n                    // if out_elempack == 1\n                    \"8:                                 \\n\"\n                    // transpose8x4\n                    \"uzp1   v28.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp2   v29.8h, v0.8h, v1.8h        \\n\"\n                    \"uzp1   v30.8h, v2.8h, v3.8h        \\n\"\n                    \"uzp2   v31.8h, v2.8h, v3.8h        \\n\"\n\n                    \"uzp1   v0.8h, v28.8h, v29.8h       \\n\"\n                    \"uzp2   v2.8h, v28.8h, v29.8h       \\n\"\n                    \"uzp1   v4.8h, v30.8h, v31.8h       \\n\"\n                    \"uzp2   v6.8h, v30.8h, v31.8h       \\n\"\n\n                    \"mov    v1.d[0], v0.d[1]            \\n\"\n                    \"mov    v3.d[0], v2.d[1]            \\n\"\n                    \"mov    v5.d[0], v4.d[1]            \\n\"\n                    \"mov    v7.d[0], v6.d[1]            \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h}, [%3], #8           \\n\"\n                    \"st1    {v1.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v2.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v4.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v5.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h}, [x4]               \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v7.4h}, [x4]               \\n\"\n\n                    \"9:                                 \\n\"\n                    \"add    %0, %0, #128                \\n\"\n                    \"b      11f                         \\n\"\n\n                    \"10:                                \\n\"\n                    \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    \"11:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                    _sum20 = _sum00;\n                    _sum21 = _sum01;\n                    _sum30 = _sum00;\n                    _sum31 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum20 = vdupq_n_f32(0.f);\n                    _sum21 = vdupq_n_f32(0.f);\n                    _sum30 = vdupq_n_f32(0.f);\n                    _sum31 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pA1 = bfloat2float(vld1_u16(pA + 4));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x8_t _t0 = vcombine_u16(float2bfloat(_sum00), float2bfloat(_sum01));\n                    uint16x8_t _t1 = vcombine_u16(float2bfloat(_sum10), float2bfloat(_sum11));\n                    uint16x8_t _t2 = vcombine_u16(float2bfloat(_sum20), float2bfloat(_sum21));\n                    uint16x8_t _t3 = vcombine_u16(float2bfloat(_sum30), float2bfloat(_sum31));\n                    transpose8x4_u16(_t0, _t1, _t2, _t3);\n\n                    vst1_u16(outptr0, vget_low_u16(_t0));\n                    vst1_u16(outptr0 + out_hstep * 1, vget_high_u16(_t0));\n                    vst1_u16(outptr0 + out_hstep * 2, vget_low_u16(_t1));\n                    vst1_u16(outptr0 + out_hstep * 3, vget_high_u16(_t1));\n                    vst1_u16(outptr0 + out_hstep * 4, vget_low_u16(_t2));\n                    vst1_u16(outptr0 + out_hstep * 5, vget_high_u16(_t2));\n                    vst1_u16(outptr0 + out_hstep * 6, vget_low_u16(_t3));\n                    vst1_u16(outptr0 + out_hstep * 7, vget_high_u16(_t3));\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"add    x4, %8, #16                 \\n\"\n                \"ld1    {v28.4s}, [%8]              \\n\"\n                \"ld1    {v30.4s}, [x4]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v30.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #128]       \\n\"\n                \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\"\n                \"shll   v4.4s, v12.4h, #16          \\n\"\n                \"shll2  v5.4s, v12.8h, #16          \\n\"\n                \"shll   v6.4s, v13.4h, #16          \\n\"\n                \"shll2  v7.4s, v13.8h, #16          \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v4.4s, v0.s[1]      \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                \"fmla   v28.4s, v6.4s, v0.s[2]      \\n\"\n                \"fmla   v29.4s, v6.4s, v0.s[3]      \\n\"\n                \"fmla   v30.4s, v7.4s, v0.s[2]      \\n\"\n                \"fmla   v31.4s, v7.4s, v0.s[3]      \\n\"\n                \"shll   v8.4s, v14.4h, #16          \\n\"\n                \"shll2  v9.4s, v14.8h, #16          \\n\"\n                \"shll   v10.4s, v15.4h, #16         \\n\"\n                \"shll2  v11.4s, v15.8h, #16         \\n\"\n                \"fmla   v28.4s, v8.4s, v1.s[0]      \\n\"\n                \"fmla   v29.4s, v8.4s, v1.s[1]      \\n\"\n                \"fmla   v30.4s, v9.4s, v1.s[0]      \\n\"\n                \"fmla   v31.4s, v9.4s, v1.s[1]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v10.4s, v1.s[2]     \\n\"\n                \"fmla   v29.4s, v10.4s, v1.s[3]     \\n\"\n                \"fmla   v30.4s, v11.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v11.4s, v1.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.s}[0], [%2], #4         \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"ld1    {v3.8h}, [%1], #16          \\n\"\n                \"shll   v4.4s, v3.4h, #16           \\n\"\n                \"shll2  v5.4s, v3.8h, #16           \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v4.4s, v0.s[1]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v31.4s, v5.4s, v0.s[1]      \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"shrn   v0.4h, v28.4s, #16          \\n\"\n                \"shrn2  v0.8h, v29.4s, #16          \\n\"\n                \"shrn   v1.4h, v30.4s, #16          \\n\"\n                \"shrn2  v1.8h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 1          \\n\"\n                \"st1    {v0.8h}, [%3], #16          \\n\"\n                \"st1    {v1.8h}, [x4]               \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose8x2\n                \"uzp1   v2.8h, v0.8h, v1.8h         \\n\"\n                \"uzp2   v3.8h, v0.8h, v1.8h         \\n\"\n                \"uzp1   v0.8h, v2.8h, v3.8h         \\n\"\n                \"uzp2   v1.8h, v2.8h, v3.8h         \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v0.s}[0], [%3], #4         \\n\"\n                \"st1    {v0.s}[2], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.s}[0], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.s}[2], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v0.s}[1], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v0.s}[3], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.s}[1], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.s}[3], [x4]             \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                    _sum10 = _sum00;\n                    _sum11 = _sum01;\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pA1 = bfloat2float(vld1_u16(pA + 4));\n\n                float32x2_t _pB0 = vget_low_f32(bfloat2float(vld1_u16(pB)));\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pA1, _pB0, 1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    unsigned short sum1[8];\n                    vst1_u16(sum0, float2bfloat(_sum00));\n                    vst1_u16(sum0 + 4, float2bfloat(_sum01));\n                    vst1_u16(sum1, float2bfloat(_sum10));\n                    vst1_u16(sum1 + 4, float2bfloat(_sum11));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      2f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%8]      \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"3:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #64]        \\n\"\n                \"ld1    {v0.4h}, [%2], #8           \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\"\n                \"shll   v4.4s, v12.4h, #16          \\n\"\n                \"shll2  v5.4s, v12.8h, #16          \\n\"\n                \"shll   v6.4s, v13.4h, #16          \\n\"\n                \"shll2  v7.4s, v13.8h, #16          \\n\"\n                \"fmla   v28.4s, v4.4s, v0.s[0]      \\n\"\n                \"fmla   v29.4s, v5.4s, v0.s[0]      \\n\"\n                \"fmla   v30.4s, v6.4s, v0.s[1]      \\n\"\n                \"fmla   v31.4s, v7.4s, v0.s[1]      \\n\"\n                \"shll   v8.4s, v14.4h, #16          \\n\"\n                \"shll2  v9.4s, v14.8h, #16          \\n\"\n                \"shll   v10.4s, v15.4h, #16         \\n\"\n                \"shll2  v11.4s, v15.8h, #16         \\n\"\n                \"fmla   v28.4s, v8.4s, v0.s[2]      \\n\"\n                \"fmla   v29.4s, v9.4s, v0.s[2]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v10.4s, v0.s[3]     \\n\"\n                \"fmla   v31.4s, v11.4s, v0.s[3]     \\n\"\n                \"bne    3b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                \"4:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    6f                          \\n\"\n\n                \"5:                                 \\n\"\n                \"ld1r   {v0.4h}, [%2], #2           \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"ld1    {v3.8h}, [%1], #16          \\n\"\n                \"shll   v4.4s, v3.4h, #16           \\n\"\n                \"shll2  v5.4s, v3.8h, #16           \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v4.4s, v0.4s        \\n\"\n                \"fmla   v31.4s, v5.4s, v0.4s        \\n\"\n                \"bne    5b                          \\n\"\n\n                \"6:                                 \\n\"\n                \"shrn   v30.4h, v30.4s, #16         \\n\"\n                \"shrn   v31.4h, v31.4s, #16         \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    9f                          \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 1          \\n\"\n                \"st1    {v30.4h}, [%3], #8          \\n\"\n                \"st1    {v31.4h}, [x4]              \\n\"\n                \"b      8f                          \\n\"\n\n                // if out_elempack == 1\n                \"7:                                 \\n\"\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v30.h}[0], [%3], #2        \\n\"\n                \"st1    {v30.h}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v30.h}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v30.h}[3], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.h}[0], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.h}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.h}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.h}[3], [x4]            \\n\"\n\n                \"8:                                 \\n\"\n                \"add    %0, %0, #32                 \\n\"\n                \"b      10f                         \\n\"\n\n                \"9:                                 \\n\"\n                \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                \"10:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vld1q_f32(pC);\n                    _sum01 = vld1q_f32(pC + 4);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pA1 = bfloat2float(vld1_u16(pA + 4));\n\n                float32x4_t _pB = bfloat2float(vld1_dup_u16(pB));\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    vst1_u16(sum0, float2bfloat(_sum00));\n                    vst1_u16(sum0 + 4, float2bfloat(_sum01));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n            }\n\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n\n        pAT += max_kk * 8;\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"subs   %0, %0, #128                \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v20.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v21.16b, v20.16b            \\n\"\n                \"mov    v22.16b, v20.16b            \\n\"\n                \"mov    v23.16b, v20.16b            \\n\"\n                \"mov    v24.16b, v20.16b            \\n\"\n                \"mov    v25.16b, v20.16b            \\n\"\n                \"mov    v26.16b, v20.16b            \\n\"\n                \"mov    v27.16b, v20.16b            \\n\"\n                \"mov    v28.16b, v20.16b            \\n\"\n                \"mov    v29.16b, v20.16b            \\n\"\n                \"mov    v30.16b, v20.16b            \\n\"\n                \"mov    v31.16b, v20.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\"\n                \"shll   v0.4s, v4.4h, #16           \\n\"\n                \"shll2  v1.4s, v4.8h, #16           \\n\"\n                \"shll   v2.4s, v5.4h, #16           \\n\"\n                \"shll2  v3.4s, v5.8h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #256]       \\n\"\n                \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"shll   v17.4s, v17.4h, #16         \\n\"\n                \"shll   v18.4s, v18.4h, #16         \\n\"\n                \"shll   v19.4s, v19.4h, #16         \\n\"\n                \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                \"shll   v4.4s, v6.4h, #16           \\n\"\n                \"shll2  v5.4s, v6.8h, #16           \\n\"\n                \"shll   v6.4s, v7.4h, #16           \\n\"\n                \"shll2  v7.4s, v7.8h, #16           \\n\"\n                \"fmla   v20.4s, v17.4s, v3.s[0]     \\n\"\n                \"fmla   v21.4s, v17.4s, v3.s[1]     \\n\"\n                \"fmla   v22.4s, v17.4s, v3.s[2]     \\n\"\n                \"fmla   v23.4s, v17.4s, v3.s[3]     \\n\"\n                \"fmla   v24.4s, v17.4s, v4.s[0]     \\n\"\n                \"fmla   v25.4s, v17.4s, v4.s[1]     \\n\"\n                \"fmla   v26.4s, v17.4s, v4.s[2]     \\n\"\n                \"fmla   v27.4s, v17.4s, v4.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v5.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v5.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v5.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v5.s[3]     \\n\"\n                \"prfm   pldl1keep, [%2, #256]       \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"shll   v2.4s, v2.4h, #16           \\n\"\n                \"shll   v3.4s, v3.4h, #16           \\n\"\n                \"fmla   v20.4s, v18.4s, v6.s[0]     \\n\"\n                \"fmla   v21.4s, v18.4s, v6.s[1]     \\n\"\n                \"fmla   v22.4s, v18.4s, v6.s[2]     \\n\"\n                \"fmla   v23.4s, v18.4s, v6.s[3]     \\n\"\n                \"fmla   v24.4s, v18.4s, v7.s[0]     \\n\"\n                \"fmla   v25.4s, v18.4s, v7.s[1]     \\n\"\n                \"fmla   v26.4s, v18.4s, v7.s[2]     \\n\"\n                \"fmla   v27.4s, v18.4s, v7.s[3]     \\n\"\n                \"fmla   v28.4s, v18.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v20.4s, v19.4s, v1.s[0]     \\n\"\n                \"fmla   v21.4s, v19.4s, v1.s[1]     \\n\"\n                \"fmla   v22.4s, v19.4s, v1.s[2]     \\n\"\n                \"fmla   v23.4s, v19.4s, v1.s[3]     \\n\"\n                \"fmla   v24.4s, v19.4s, v2.s[0]     \\n\"\n                \"fmla   v25.4s, v19.4s, v2.s[1]     \\n\"\n                \"fmla   v26.4s, v19.4s, v2.s[2]     \\n\"\n                \"fmla   v27.4s, v19.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"shll   v2.4s, v2.4h, #16           \\n\"\n                \"ld1    {v16.4h}, [%1], #8          \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"fmla   v20.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v21.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v22.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v23.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v24.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v1.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v16.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v2.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"shrn   v0.4h, v20.4s, #16          \\n\"\n                \"shrn2  v0.8h, v21.4s, #16          \\n\"\n                \"shrn   v1.4h, v22.4s, #16          \\n\"\n                \"shrn2  v1.8h, v23.4s, #16          \\n\"\n                \"shrn   v2.4h, v24.4s, #16          \\n\"\n                \"shrn2  v2.8h, v25.4s, #16          \\n\"\n                \"shrn   v3.4h, v26.4s, #16          \\n\"\n                \"shrn2  v3.8h, v27.4s, #16          \\n\"\n                \"shrn   v4.4h, v28.4s, #16          \\n\"\n                \"shrn2  v4.8h, v29.4s, #16          \\n\"\n                \"shrn   v5.4h, v30.4s, #16          \\n\"\n                \"shrn2  v5.8h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                \"st1    {v4.8h, v5.8h}, [%3], #32   \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x12\n                \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n                \"uzp1   v22.8h, v2.8h, v3.8h        \\n\"\n                \"uzp2   v23.8h, v2.8h, v3.8h        \\n\"\n                \"uzp1   v24.8h, v4.8h, v5.8h        \\n\"\n                \"uzp2   v25.8h, v4.8h, v5.8h        \\n\"\n\n                \"uzp1   v0.8h, v20.8h, v21.8h       \\n\"\n                \"uzp2   v6.8h, v20.8h, v21.8h       \\n\"\n                \"uzp1   v1.8h, v22.8h, v23.8h       \\n\"\n                \"uzp2   v7.8h, v22.8h, v23.8h       \\n\"\n                \"uzp1   v2.8h, v24.8h, v25.8h       \\n\"\n                \"uzp2   v8.8h, v24.8h, v25.8h       \\n\"\n\n                \"mov    v3.d[0], v0.d[1]            \\n\"\n                \"mov    v4.d[0], v1.d[1]            \\n\"\n                \"mov    v5.d[0], v2.d[1]            \\n\"\n                \"mov    v9.d[0], v6.d[1]            \\n\"\n                \"mov    v10.d[0], v7.d[1]           \\n\"\n                \"mov    v11.d[0], v8.d[1]           \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #192                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n            float32x4_t _sum8;\n            float32x4_t _sum9;\n            float32x4_t _suma;\n            float32x4_t _sumb;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                    _sum8 = _sum0;\n                    _sum9 = _sum0;\n                    _suma = _sum0;\n                    _sumb = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                    _sum8 = vdupq_n_f32(0.f);\n                    _sum9 = vdupq_n_f32(0.f);\n                    _suma = vdupq_n_f32(0.f);\n                    _sumb = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n                _sum8 = vld1q_f32(outptr + 4 * 8);\n                _sum9 = vld1q_f32(outptr + 4 * 9);\n                _suma = vld1q_f32(outptr + 4 * 10);\n                _sumb = vld1q_f32(outptr + 4 * 11);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum7));\n                    vst1_u16(outptr0 + 4 * 8, float2bfloat(_sum8));\n                    vst1_u16(outptr0 + 4 * 9, float2bfloat(_sum9));\n                    vst1_u16(outptr0 + 4 * 10, float2bfloat(_suma));\n                    vst1_u16(outptr0 + 4 * 11, float2bfloat(_sumb));\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x4_t _t0 = float2bfloat(_sum0);\n                    uint16x4_t _t1 = float2bfloat(_sum1);\n                    uint16x4_t _t2 = float2bfloat(_sum2);\n                    uint16x4_t _t3 = float2bfloat(_sum3);\n                    uint16x4_t _t4 = float2bfloat(_sum4);\n                    uint16x4_t _t5 = float2bfloat(_sum5);\n                    uint16x4_t _t6 = float2bfloat(_sum6);\n                    uint16x4_t _t7 = float2bfloat(_sum7);\n                    uint16x4_t _t8 = float2bfloat(_sum8);\n                    uint16x4_t _t9 = float2bfloat(_sum9);\n                    uint16x4_t _ta = float2bfloat(_suma);\n                    uint16x4_t _tb = float2bfloat(_sumb);\n                    transpose4x12_u16(_t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7, _t8, _t9, _ta, _tb);\n\n                    vst1_u16(outptr0, _t0);\n                    vst1_u16(outptr0 + 4, _t1);\n                    vst1_u16(outptr0 + 8, _t2);\n                    vst1_u16(outptr0 + out_hstep, _t3);\n                    vst1_u16(outptr0 + out_hstep + 4, _t4);\n                    vst1_u16(outptr0 + out_hstep + 8, _t5);\n                    vst1_u16(outptr0 + out_hstep * 2, _t6);\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, _t7);\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, _t8);\n                    vst1_u16(outptr0 + out_hstep * 3, _t9);\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, _ta);\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, _tb);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"subs   %0, %0, #64                 \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v24.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v25.16b, v24.16b            \\n\"\n                \"mov    v26.16b, v24.16b            \\n\"\n                \"mov    v27.16b, v24.16b            \\n\"\n                \"mov    v28.16b, v24.16b            \\n\"\n                \"mov    v29.16b, v24.16b            \\n\"\n                \"mov    v30.16b, v24.16b            \\n\"\n                \"mov    v31.16b, v24.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #512]       \\n\"\n                \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%2], #64 \\n\"\n                \"shll   v0.4s, v4.4h, #16           \\n\"\n                \"shll2  v1.4s, v4.8h, #16           \\n\"\n                \"shll   v2.4s, v5.4h, #16           \\n\"\n                \"shll2  v3.4s, v5.8h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #256]       \\n\"\n                \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"shll   v17.4s, v17.4h, #16         \\n\"\n                \"shll   v18.4s, v18.4h, #16         \\n\"\n                \"shll   v19.4s, v19.4h, #16         \\n\"\n                \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                \"fmla   v24.4s, v17.4s, v2.s[0]     \\n\"\n                \"fmla   v25.4s, v17.4s, v2.s[1]     \\n\"\n                \"fmla   v26.4s, v17.4s, v2.s[2]     \\n\"\n                \"fmla   v27.4s, v17.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v3.s[3]     \\n\"\n                \"shll   v4.4s, v6.4h, #16           \\n\"\n                \"shll2  v5.4s, v6.8h, #16           \\n\"\n                \"shll   v6.4s, v7.4h, #16           \\n\"\n                \"shll2  v7.4s, v7.8h, #16           \\n\"\n                \"fmla   v24.4s, v18.4s, v4.s[0]     \\n\"\n                \"fmla   v25.4s, v18.4s, v4.s[1]     \\n\"\n                \"fmla   v26.4s, v18.4s, v4.s[2]     \\n\"\n                \"fmla   v27.4s, v18.4s, v4.s[3]     \\n\"\n                \"fmla   v28.4s, v18.4s, v5.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v5.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v5.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v5.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v24.4s, v19.4s, v6.s[0]     \\n\"\n                \"fmla   v25.4s, v19.4s, v6.s[1]     \\n\"\n                \"fmla   v26.4s, v19.4s, v6.s[2]     \\n\"\n                \"fmla   v27.4s, v19.4s, v6.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v7.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v7.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v7.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v7.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"ld1    {v16.4h}, [%1], #8          \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"fmla   v24.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v25.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v26.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v27.4s, v16.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v16.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v1.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"shrn   v0.4h, v24.4s, #16          \\n\"\n                \"shrn2  v0.8h, v25.4s, #16          \\n\"\n                \"shrn   v1.4h, v26.4s, #16          \\n\"\n                \"shrn2  v1.8h, v27.4s, #16          \\n\"\n                \"shrn   v2.4h, v28.4s, #16          \\n\"\n                \"shrn2  v2.8h, v29.4s, #16          \\n\"\n                \"shrn   v3.4h, v30.4s, #16          \\n\"\n                \"shrn2  v3.8h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x8\n                \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n                \"uzp1   v22.8h, v2.8h, v3.8h        \\n\"\n                \"uzp2   v23.8h, v2.8h, v3.8h        \\n\"\n\n                \"uzp1   v0.8h, v20.8h, v22.8h       \\n\"\n                \"uzp2   v2.8h, v20.8h, v22.8h       \\n\"\n                \"uzp1   v1.8h, v21.8h, v23.8h       \\n\"\n                \"uzp2   v3.8h, v21.8h, v23.8h       \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v0.8h}, [%3], #16          \\n\"\n                \"st1    {v1.8h}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v2.8h}, [x4]               \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v3.8h}, [x4]               \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #128                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0!, {d16-d23}      \\n\"\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"sub        %0, %0, #64         \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d16-d17}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q8, q8              \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q9, q8              \\n\"\n                \"vmov       q10, q8             \\n\"\n                \"vmov       q11, q8             \\n\"\n                \"vmov       q12, q8             \\n\"\n                \"vmov       q13, q8             \\n\"\n                \"vmov       q14, q8             \\n\"\n                \"vmov       q15, q8             \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"4:                             \\n\"\n                \"pld        [%2, #256]          \\n\"\n                \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.u16   {d12-d15}, [%1 :64]! \\n\"\n                \"vshll.u16  q0, d4, #16         \\n\"\n                \"vshll.u16  q1, d5, #16         \\n\"\n                \"vshll.u16  q2, d6, #16         \\n\"\n                \"vshll.u16  q3, d7, #16         \\n\"\n                \"vshll.u16  q4, d12, #16        \\n\"\n                \"vshll.u16  q5, d13, #16        \\n\"\n                \"vshll.u16  q6, d14, #16        \\n\"\n                \"vshll.u16  q7, d15, #16        \\n\"\n                \"vmla.f32   q8, q4, d0[0]       \\n\"\n                \"vmla.f32   q9, q4, d0[1]       \\n\"\n                \"vmla.f32   q10, q4, d1[0]      \\n\"\n                \"vmla.f32   q11, q4, d1[1]      \\n\"\n                \"vmla.f32   q12, q4, d2[0]      \\n\"\n                \"vmla.f32   q13, q4, d2[1]      \\n\"\n                \"vmla.f32   q14, q4, d3[0]      \\n\"\n                \"vmla.f32   q15, q4, d3[1]      \\n\"\n                \"vmla.f32   q8, q5, d4[0]       \\n\"\n                \"vmla.f32   q9, q5, d4[1]       \\n\"\n                \"vmla.f32   q10, q5, d5[0]      \\n\"\n                \"vmla.f32   q11, q5, d5[1]      \\n\"\n                \"vmla.f32   q12, q5, d6[0]      \\n\"\n                \"vmla.f32   q13, q5, d6[1]      \\n\"\n                \"vmla.f32   q14, q5, d7[0]      \\n\"\n                \"vmla.f32   q15, q5, d7[1]      \\n\"\n                \"pld        [%2, #256]          \\n\"\n                \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\"\n                \"vshll.u16  q0, d4, #16         \\n\"\n                \"vshll.u16  q1, d5, #16         \\n\"\n                \"vshll.u16  q2, d6, #16         \\n\"\n                \"vshll.u16  q3, d7, #16         \\n\"\n                \"vmla.f32   q8, q6, d0[0]       \\n\"\n                \"vmla.f32   q9, q6, d0[1]       \\n\"\n                \"vmla.f32   q10, q6, d1[0]      \\n\"\n                \"vmla.f32   q11, q6, d1[1]      \\n\"\n                \"vmla.f32   q12, q6, d2[0]      \\n\"\n                \"vmla.f32   q13, q6, d2[1]      \\n\"\n                \"vmla.f32   q14, q6, d3[0]      \\n\"\n                \"vmla.f32   q15, q6, d3[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q8, q7, d4[0]       \\n\"\n                \"vmla.f32   q9, q7, d4[1]       \\n\"\n                \"vmla.f32   q10, q7, d5[0]      \\n\"\n                \"vmla.f32   q11, q7, d5[1]      \\n\"\n                \"vmla.f32   q12, q7, d6[0]      \\n\"\n                \"vmla.f32   q13, q7, d6[1]      \\n\"\n                \"vmla.f32   q14, q7, d7[0]      \\n\"\n                \"vmla.f32   q15, q7, d7[1]      \\n\"\n                \"bne        4b                  \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\"\n                \"vshll.u16  q0, d2, #16         \\n\"\n                \"vshll.u16  q1, d3, #16         \\n\"\n                \"vld1.u16   {d9}, [%1 :64]!     \\n\"\n                \"vshll.u16  q4, d9, #16         \\n\"\n                \"vmla.f32   q8, q4, d0[0]       \\n\"\n                \"vmla.f32   q9, q4, d0[1]       \\n\"\n                \"vmla.f32   q10, q4, d1[0]      \\n\"\n                \"vmla.f32   q11, q4, d1[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q4, d2[0]      \\n\"\n                \"vmla.f32   q13, q4, d2[1]      \\n\"\n                \"vmla.f32   q14, q4, d3[0]      \\n\"\n                \"vmla.f32   q15, q4, d3[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"vshrn.u32  d16, q8, #16        \\n\"\n                \"vshrn.u32  d17, q9, #16        \\n\"\n                \"vshrn.u32  d18, q10, #16       \\n\"\n                \"vshrn.u32  d19, q11, #16       \\n\"\n                \"vshrn.u32  d20, q12, #16       \\n\"\n                \"vshrn.u32  d21, q13, #16       \\n\"\n                \"vshrn.u32  d22, q14, #16       \\n\"\n                \"vshrn.u32  d23, q15, #16       \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vstm       %3!, {d16-d23}      \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x8\n                \"vuzp.16    q8, q9              \\n\"\n                \"vuzp.16    q10, q11            \\n\"\n                \"vuzp.16    q8, q10             \\n\"\n                \"vuzp.16    q9, q11             \\n\"\n\n                \"add        r4, %3, %13, lsl #1 \\n\"\n                \"vst1.u16   {d16-d17}, [%3 :64]! \\n\"\n                \"vst1.u16   {d18-d19}, [r4 :64] \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d20-d21}, [r4 :64] \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d22-d23}, [r4 :64] \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #128        \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                    _sum4 = vdupq_n_f32(0.f);\n                    _sum5 = vdupq_n_f32(0.f);\n                    _sum6 = vdupq_n_f32(0.f);\n                    _sum7 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum7));\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x4_t _t0 = float2bfloat(_sum0);\n                    uint16x4_t _t1 = float2bfloat(_sum1);\n                    uint16x4_t _t2 = float2bfloat(_sum2);\n                    uint16x4_t _t3 = float2bfloat(_sum3);\n                    uint16x4_t _t4 = float2bfloat(_sum4);\n                    uint16x4_t _t5 = float2bfloat(_sum5);\n                    uint16x4_t _t6 = float2bfloat(_sum6);\n                    uint16x4_t _t7 = float2bfloat(_sum7);\n                    transpose4x8_u16(_t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7);\n\n                    vst1_u16(outptr0, _t0);\n                    vst1_u16(outptr0 + 4, _t1);\n                    vst1_u16(outptr0 + out_hstep, _t2);\n                    vst1_u16(outptr0 + out_hstep + 4, _t3);\n                    vst1_u16(outptr0 + out_hstep * 2, _t4);\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, _t5);\n                    vst1_u16(outptr0 + out_hstep * 3, _t6);\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, _t7);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v28.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v30.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v28.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #256]       \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"shll   v2.4s, v2.4h, #16           \\n\"\n                \"shll   v3.4s, v3.4h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #256]       \\n\"\n                \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"shll   v17.4s, v17.4h, #16         \\n\"\n                \"shll   v18.4s, v18.4h, #16         \\n\"\n                \"shll   v19.4s, v19.4h, #16         \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                \"fmla   v28.4s, v17.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v1.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v18.4s, v2.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v2.s[1]     \\n\"\n                \"fmla   v30.4s, v18.4s, v2.s[2]     \\n\"\n                \"fmla   v31.4s, v18.4s, v2.s[3]     \\n\"\n                \"fmla   v28.4s, v19.4s, v3.s[0]     \\n\"\n                \"fmla   v29.4s, v19.4s, v3.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v3.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v3.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4h}, [%2], #8           \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"ld1    {v16.4h}, [%1], #8          \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[3]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"shrn   v0.4h, v28.4s, #16          \\n\"\n                \"shrn2  v0.8h, v29.4s, #16          \\n\"\n                \"shrn   v1.4h, v30.4s, #16          \\n\"\n                \"shrn2  v1.8h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v0.8h, v1.8h}, [%3], #32   \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x4\n                \"uzp1   v20.8h, v0.8h, v1.8h        \\n\"\n                \"uzp2   v21.8h, v0.8h, v1.8h        \\n\"\n\n                \"uzp1   v0.8h, v20.8h, v21.8h       \\n\"\n                \"uzp2   v1.8h, v20.8h, v21.8h       \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v0.d}[0], [%3], #8         \\n\"\n                \"st1    {v0.d}[1], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.d}[0], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v1.d}[1], [x4]             \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d24-d25}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q12, q12            \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q13, q12            \\n\"\n                \"vmov       q14, q12            \\n\"\n                \"vmov       q15, q12            \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"4:                             \\n\"\n                \"pld        [%2, #256]          \\n\"\n                \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.u16   {d12-d15}, [%1 :64]! \\n\"\n                \"vshll.u16  q0, d4, #16         \\n\"\n                \"vshll.u16  q1, d5, #16         \\n\"\n                \"vshll.u16  q2, d6, #16         \\n\"\n                \"vshll.u16  q3, d7, #16         \\n\"\n                \"vshll.u16  q4, d12, #16        \\n\"\n                \"vshll.u16  q5, d13, #16        \\n\"\n                \"vshll.u16  q6, d14, #16        \\n\"\n                \"vshll.u16  q7, d15, #16        \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"vmla.f32   q14, q4, d1[0]      \\n\"\n                \"vmla.f32   q15, q4, d1[1]      \\n\"\n                \"vmla.f32   q12, q5, d2[0]      \\n\"\n                \"vmla.f32   q13, q5, d2[1]      \\n\"\n                \"vmla.f32   q14, q5, d3[0]      \\n\"\n                \"vmla.f32   q15, q5, d3[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q6, d4[0]      \\n\"\n                \"vmla.f32   q13, q6, d4[1]      \\n\"\n                \"vmla.f32   q14, q6, d5[0]      \\n\"\n                \"vmla.f32   q15, q6, d5[1]      \\n\"\n                \"vmla.f32   q12, q7, d6[0]      \\n\"\n                \"vmla.f32   q13, q7, d6[1]      \\n\"\n                \"vmla.f32   q14, q7, d7[0]      \\n\"\n                \"vmla.f32   q15, q7, d7[1]      \\n\"\n                \"bne        4b                  \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.u16   {d0}, [%2 :64]!     \\n\"\n                \"vshll.u16  q0, d0, #16         \\n\"\n                \"vld1.u16   {d8}, [%1 :64]!     \\n\"\n                \"vshll.u16  q4, d8, #16         \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"vmla.f32   q14, q4, d1[0]      \\n\"\n                \"vmla.f32   q15, q4, d1[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"vshrn.u32  d24, q12, #16       \\n\"\n                \"vshrn.u32  d25, q13, #16       \\n\"\n                \"vshrn.u32  d26, q14, #16       \\n\"\n                \"vshrn.u32  d27, q15, #16       \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vst1.u16   {d24-d27}, [%3]!    \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x4\n                \"vuzp.16    q12, q13            \\n\"\n                \"vuzp.16    q12, q13            \\n\"\n\n                \"add        r4, %3, %13, lsl #1 \\n\"\n                \"vst1.u16   {d24}, [%3]!        \\n\"\n                \"vst1.u16   {d25}, [r4]         \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d26}, [r4]         \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d27}, [r4]         \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #64         \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                    _sum3 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    uint16x4_t _t0 = float2bfloat(_sum0);\n                    uint16x4_t _t1 = float2bfloat(_sum1);\n                    uint16x4_t _t2 = float2bfloat(_sum2);\n                    uint16x4_t _t3 = float2bfloat(_sum3);\n                    transpose4x4_u16(_t0, _t1, _t2, _t3);\n\n                    vst1_u16(outptr0, _t0);\n                    vst1_u16(outptr0 + out_hstep * 1, _t1);\n                    vst1_u16(outptr0 + out_hstep * 2, _t2);\n                    vst1_u16(outptr0 + out_hstep * 3, _t3);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v30.4s, v31.4s}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v30.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v31.16b, v30.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #128]       \\n\"\n                \"ld1    {v0.4h, v1.4h}, [%2], #16   \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"shll   v1.4s, v1.4h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #256]       \\n\"\n                \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"shll   v17.4s, v17.4h, #16         \\n\"\n                \"shll   v18.4s, v18.4h, #16         \\n\"\n                \"shll   v19.4s, v19.4h, #16         \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v16.4s, v0.s[1]     \\n\"\n                \"fmla   v30.4s, v17.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v17.4s, v0.s[3]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.4s, v18.4s, v1.s[0]     \\n\"\n                \"fmla   v29.4s, v18.4s, v1.s[1]     \\n\"\n                \"fmla   v30.4s, v19.4s, v1.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v1.s[3]     \\n\"\n                \"bne    4b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.s}[0], [%2], #4         \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"ld1    {v16.4h}, [%1], #8          \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v31.4s, v16.4s, v0.s[1]     \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"shrn   v0.4h, v30.4s, #16          \\n\"\n                \"shrn   v1.4h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    10f                         \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v0.4h, v1.4h}, [%3], #16   \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // transpose4x2\n                \"zip1   v30.4h, v0.4h, v1.4h        \\n\"\n                \"zip2   v31.4h, v0.4h, v1.4h        \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v30.s}[0], [%3], #4        \\n\"\n                \"st1    {v30.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.s}[0], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v31.s}[1], [x4]            \\n\"\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #32                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v30.4s, v31.4s}, [%0], #32 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vld1.f32   {d28-d31}, [%0 :128] \\n\"\n                \"b          3f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d28-d29}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q14, q14            \\n\"\n\n                \"2:                             \\n\"\n                \"vmov       q15, q14            \\n\"\n\n                \"3:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"4:                             \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.u16   {d12-d15}, [%1 :64]! \\n\"\n                \"vshll.u16  q0, d2, #16         \\n\"\n                \"vshll.u16  q1, d3, #16         \\n\"\n                \"vshll.u16  q4, d12, #16        \\n\"\n                \"vshll.u16  q5, d13, #16        \\n\"\n                \"vshll.u16  q6, d14, #16        \\n\"\n                \"vshll.u16  q7, d15, #16        \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q4, d0[1]      \\n\"\n                \"vmla.f32   q14, q5, d1[0]      \\n\"\n                \"vmla.f32   q15, q5, d1[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q12, q6, d2[0]      \\n\"\n                \"vmla.f32   q13, q6, d2[1]      \\n\"\n                \"vmla.f32   q14, q7, d3[0]      \\n\"\n                \"vmla.f32   q15, q7, d3[1]      \\n\"\n                \"bne        4b                  \\n\"\n                \"vadd.f32   q14, q14, q12       \\n\"\n                \"vadd.f32   q15, q15, q13       \\n\"\n\n                \"5:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        7f                  \\n\"\n\n                \"6:                             \\n\"\n                \"vld1.u32   {d0[0]}, [%2]!      \\n\"\n                \"vshll.u16  q0, d0, #16         \\n\"\n                \"vld1.u16   {d8}, [%1 :64]!     \\n\"\n                \"vshll.u16  q4, d8, #16         \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q14, q4, d0[0]      \\n\"\n                \"vmla.f32   q15, q4, d0[1]      \\n\"\n                \"bne        6b                  \\n\"\n\n                \"7:                             \\n\"\n                \"vshrn.u32  d28, q14, #16       \\n\"\n                \"vshrn.u32  d29, q15, #16       \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        8f                  \\n\"\n\n                \"vst1.u16   {d28-d29}, [%3]!    \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // transpose4x2\n                \"vzip.16    d28, d29            \\n\"\n\n                \"add        r4, %3, %13, lsl #1 \\n\"\n                \"vst1.u32   {d28[0]}, [%3]!     \\n\"\n                \"vst1.u32   {d28[1]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u32   {d29[0]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u32   {d29[1]}, [r4]      \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #32         \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vst1.f32   {d28-d31}, [%0 :128]! \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                    _sum1 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x2_t _pB = vget_low_f32(bfloat2float(vld1_u16(pB)));\n\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pA, _pB, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, _pB, 1);\n#endif\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    unsigned short sum1[4];\n                    vst1_u16(sum0, float2bfloat(_sum0));\n                    vst1_u16(sum1, float2bfloat(_sum1));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const unsigned short* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v31.4s}, [%0]              \\n\"\n                \"b      2f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v31.4s}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"3:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #64]        \\n\"\n                \"ld1    {v0.4h}, [%2], #8           \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"prfm   pldl1keep, [%1, #256]       \\n\"\n                \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"shll   v17.4s, v17.4h, #16         \\n\"\n                \"shll   v18.4s, v18.4h, #16         \\n\"\n                \"shll   v19.4s, v19.4h, #16         \\n\"\n                \"fmla   v28.4s, v16.4s, v0.s[0]     \\n\"\n                \"fmla   v29.4s, v17.4s, v0.s[1]     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.4s, v18.4s, v0.s[2]     \\n\"\n                \"fmla   v31.4s, v19.4s, v0.s[3]     \\n\"\n                \"bne    3b                          \\n\"\n                \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n                \"fadd   v31.4s, v31.4s, v30.4s      \\n\"\n\n                \"4:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    6f                          \\n\"\n\n                \"5:                                 \\n\"\n                \"ld1r   {v0.4h}, [%2], #2           \\n\"\n                \"shll   v0.4s, v0.4h, #16           \\n\"\n                \"ld1    {v16.4h}, [%1], #8          \\n\"\n                \"shll   v16.4s, v16.4h, #16         \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v31.4s, v16.4s, v0.4s       \\n\"\n                \"bne    5b                          \\n\"\n\n                \"6:                                 \\n\"\n                \"shrn   v0.4h, v31.4s, #16          \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    9f                          \\n\"\n\n                // if out_elempack == 4\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"st1    {v0.4h}, [%3], #8           \\n\"\n                \"b      8f                          \\n\"\n\n                // if out_elempack == 1\n                \"7:                                 \\n\"\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v0.h}[0], [%3], #2         \\n\"\n                \"st1    {v0.h}[1], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v0.h}[2], [x4]             \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v0.h}[3], [x4]             \\n\"\n\n                \"8:                                 \\n\"\n                \"add    %0, %0, #16                 \\n\"\n                \"b      10f                         \\n\"\n\n                \"9:                                 \\n\"\n                \"st1    {v31.4s}, [%0], #16         \\n\"\n\n                \"10:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v16\", \"v17\", \"v18\", \"v19\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %10, #0             \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vld1.f32   {d30-d31}, [%0 :64] \\n\"\n                \"b          2f                  \\n\"\n\n                \"0:                             \\n\"\n                // if pC\n                \"cmp        %8, #0              \\n\"\n                \"beq        1f                  \\n\"\n\n                \"vld1.f32   {d30-d31}, [%8]     \\n\"\n                \"b          2f                  \\n\"\n\n                // else\n                \"1:                             \\n\"\n                \"veor       q15, q15            \\n\"\n\n                \"2:                             \\n\"\n                \"lsr        r4, %9, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"veor       q14, q14            \\n\"\n                \"3:                             \\n\"\n                \"pld        [%2, #64]           \\n\"\n                \"vld1.u16   {d1}, [%2]!         \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.u16   {d12-d15}, [%1 :64]! \\n\"\n                \"vshll.u16  q0, d1, #16         \\n\"\n                \"vshll.u16  q4, d12, #16        \\n\"\n                \"vshll.u16  q5, d13, #16        \\n\"\n                \"vshll.u16  q6, d14, #16        \\n\"\n                \"vshll.u16  q7, d15, #16        \\n\"\n                \"vmla.f32   q12, q4, d0[0]      \\n\"\n                \"vmla.f32   q13, q5, d0[1]      \\n\"\n                \"vmla.f32   q14, q6, d1[0]      \\n\"\n                \"vmla.f32   q15, q7, d1[1]      \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"bne        3b                  \\n\"\n                \"vadd.f32   q14, q14, q12       \\n\"\n                \"vadd.f32   q15, q15, q13       \\n\"\n                \"vadd.f32   q15, q15, q14       \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %9, #3          \\n\" // r4 = remain = max_kk & 3\n                \"cmp        r4, #0              \\n\"\n                \"beq        6f                  \\n\"\n\n                \"5:                             \\n\"\n                \"vld1.u16   {d0[]}, [%2]!       \\n\"\n                \"vshll.u16  q0, d0, #16         \\n\"\n                \"vld1.u16   {d8}, [%1 :64]!     \\n\"\n                \"vshll.u16  q4, d8, #16         \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vmla.f32   q15, q4, q0         \\n\"\n                \"bne        5b                  \\n\"\n\n                \"6:                             \\n\"\n                \"vshrn.u32  d30, q15, #16       \\n\"\n                \"cmp        %11, #0             \\n\"\n                \"beq        9f                  \\n\"\n\n                // if out_elempack == 4\n                \"cmp        %12, #4             \\n\"\n                \"bne        7f                  \\n\"\n\n                \"vst1.u16   {d30}, [%3]!        \\n\"\n                \"b          8f                  \\n\"\n\n                // if out_elempack == 1\n                \"7:                             \\n\"\n\n                \"add        r4, %3, %13, lsl #1 \\n\"\n                \"vst1.u16   {d30[0]}, [%3]!     \\n\"\n                \"vst1.u16   {d30[1]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d30[2]}, [r4]      \\n\"\n                \"add        r4, r4, %13, lsl #1 \\n\"\n                \"vst1.u16   {d30[3]}, [r4]      \\n\"\n\n                \"8:                             \\n\"\n                \"add        %0, %0, #16         \\n\"\n                \"b          10f                 \\n\"\n\n                \"9:                             \\n\"\n                \"vst1.f32   {d30-d31}, [%0 :64]! \\n\"\n\n                \"10:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q4\", \"q5\", \"q6\", \"q7\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _sum0;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f32(pC);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB = bfloat2float(vdup_n_u16(pB[0]));\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB);\n#endif\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    vst1_u16(sum0, float2bfloat(_sum0));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n            }\n\n            outptr += 4;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum02;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum12;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdupq_n_f32(pC[0]);\n                    _sum01 = vdupq_n_f32(pC[0]);\n                    _sum02 = vdupq_n_f32(pC[0]);\n                    _sum10 = vdupq_n_f32(pC[1]);\n                    _sum11 = vdupq_n_f32(pC[1]);\n                    _sum12 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum02 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                    _sum12 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                float32x2_t _pA = vget_low_f32(bfloat2float(vld1_u16(pA)));\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum02 = vfmaq_lane_f32(_sum02, _pB2, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n                _sum12 = vfmaq_lane_f32(_sum12, _pB2, _pA, 1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum02));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep + 8, float2bfloat(_sum12));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdupq_n_f32(pC[0]);\n                    _sum01 = vdupq_n_f32(pC[0]);\n                    _sum10 = vdupq_n_f32(pC[1]);\n                    _sum11 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdupq_n_f32(0.f);\n                    _sum01 = vdupq_n_f32(0.f);\n                    _sum10 = vdupq_n_f32(0.f);\n                    _sum11 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n                float32x2_t _pA = vget_low_f32(bfloat2float(vld1_u16(pA)));\n#if __aarch64__\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n#else\n                _sum00 = vmlaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vmlaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vmlaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vmlaq_lane_f32(_sum11, _pB1, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum11));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[1]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n\n                float32x2_t _pA = vget_low_f32(bfloat2float(vld1_u16(pA)));\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pB, _pA, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pB, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum1));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum00;\n            float sum01;\n            float sum10;\n            float sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum00 = pC[0];\n                    sum01 = pC[1];\n                    sum10 = pC[0];\n                    sum11 = pC[1];\n                }\n                else\n                {\n                    sum00 = 0.f;\n                    sum01 = 0.f;\n                    sum10 = 0.f;\n                    sum11 = 0.f;\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[0]);\n                sum01 += bfloat16_to_float32(pA[1]) * bfloat16_to_float32(pB[0]);\n                sum10 += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[1]);\n                sum11 += bfloat16_to_float32(pA[1]) * bfloat16_to_float32(pB[1]);\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum00);\n                    outptr0[1] = float32_to_bfloat16(sum10);\n                    outptr0[out_hstep] = float32_to_bfloat16(sum01);\n                    outptr0[out_hstep + 1] = float32_to_bfloat16(sum11);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[1];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[0]);\n                sum1 += bfloat16_to_float32(pA[1]) * bfloat16_to_float32(pB[0]);\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum0);\n                    outptr0[out_hstep] = float32_to_bfloat16(sum1);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const float*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[0]);\n                    _sum2 = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                    _sum2 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n                _sum2 = vld1q_f32(outptr + 8);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum2));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdupq_n_f32(pC[0]);\n                    _sum1 = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f32(0.f);\n                    _sum1 = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum = vdupq_n_f32(pC[0]);\n                }\n                else\n                {\n                    _sum = vdupq_n_f32(0.f);\n                }\n            }\n            else\n            {\n                _sum = vld1q_f32(outptr);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n                float32x4_t _pA = bfloat2float(vdup_n_u16(pA[0]));\n\n#if __aarch64__\n                _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[0];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[0]);\n                sum1 += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[1]);\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum0);\n                    outptr0[1] = float32_to_bfloat16(sum1);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum = pC[0];\n                }\n                else\n                {\n                    sum = 0.f;\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += bfloat16_to_float32(pA[0]) * bfloat16_to_float32(pB[0]);\n\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void convolution_im2col_gemm_get_optimal_tile_mnk_bf16s(int M, int N, int K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_bf16 = (int)(get_cpu_level2_cache_size() / sizeof(unsigned short));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // solve K\n    {\n        // try not to split K\n#if __aarch64__\n        int tile_size = (l2_cache_size_bf16 - 32) / 12;\n#elif __ARM_NEON\n        int tile_size = (l2_cache_size_bf16 - 16) / 8;\n#else\n        int tile_size = (l2_cache_size_bf16 - 2) / 3;\n#endif\n\n#if __aarch64__\n        TILE_K = std::max(8, tile_size / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __aarch64__\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 3) / 4 * 4);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n    }\n\n    // solve M\n    {\n#if __aarch64__\n        int nn_M = (M + 31) / 32;\n#elif __ARM_NEON\n        int nn_M = (M + 15) / 16;\n#else\n        int nn_M = (M + 7) / 8;\n#endif\n\n#if __aarch64__\n        TILE_M = std::max(8, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::max(4, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::max(2, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __aarch64__\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n\n        if (nT > 1)\n        {\n#if __aarch64__\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#elif __ARM_NEON\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 3) / 4 * 4);\n#else\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n        }\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_bf16 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_bf16 - TILE_M * TILE_K) / (TILE_M * 2 + TILE_K);\n        }\n\n#if __aarch64__\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_N = std::max(1, tile_size);\n#endif\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n    }\n}\n\nstatic void convolution_im2col_gemm_transform_kernel_bf16s(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n    // NCNN_LOGE(\"convolution_im2col_gemm_transform_kernel\");\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = outch;\n    const int K = inch * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_bf16s(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    int elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = inch % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    // maxk-inch-outch to pa-maxk-inch/pa-outch\n    Mat A_data;\n    if (maxk == 1)\n    {\n        cast_float32_to_bfloat16(kernel, A_data);\n        A_data = A_data.reshape(maxk * inch, outch);\n    }\n    else\n    {\n        Mat weight_data_r2 = kernel.reshape(maxk, inch, outch);\n\n        A_data.create(maxk * inch, outch, (size_t)2u);\n\n        for (int q = 0; q < outch; q += 1)\n        {\n            unsigned short* g00 = A_data.row<unsigned short>(q);\n\n            for (int p = 0; p + (elempack - 1) < inch; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        const float* k00 = weight_data_r2.channel(q).row(p + i);\n                        g00[0] = float32_to_bfloat16(k00[k]);\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n\n    AT.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)2u);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n            convolution_im2col_pack_A_tile_bf16_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n        }\n    }\n}\n\nstatic int convolution_im2col_gemm_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = top_blob.w * top_blob.h;\n    const int K = bottom_blob.c * bottom_blob.elempack * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_bf16s(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        // im2col\n        convolution_im2col_input_tile_bf16_fp16(bottom_blob, BT_tile, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n    }\n\n    Mat topT_tileX;\n    if (K > TILE_K)\n    {\n        topT_tileX.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT_tileX.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat topT_tile;\n        if (K > TILE_K)\n            topT_tile = topT_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = k + TILE_K >= K;\n\n                convolution_gemm_transB_packed_tile_bf16s(AT_tile, BT_tile, bias, topT_tile, top_blob, i, max_ii, j, max_jj, k, max_kk, k_end, opt.use_a53_a55_optimized_kernel);\n            }\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_im2col_gemm_bf16s_fp16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_im2col_pack_A_tile_bf16_fp16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    // A = (pa, maxk, inch/pa), outch\n    const int A_hstep = A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n        const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n        const unsigned short* p2 = (const unsigned short*)A + (i + ii + 2) * A_hstep + k;\n        const unsigned short* p3 = (const unsigned short*)A + (i + ii + 3) * A_hstep + k;\n        const unsigned short* p4 = (const unsigned short*)A + (i + ii + 4) * A_hstep + k;\n        const unsigned short* p5 = (const unsigned short*)A + (i + ii + 5) * A_hstep + k;\n        const unsigned short* p6 = (const unsigned short*)A + (i + ii + 6) * A_hstep + k;\n        const unsigned short* p7 = (const unsigned short*)A + (i + ii + 7) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vld1q_u16(p0);\n            uint16x8_t _r1 = vld1q_u16(p1);\n            uint16x8_t _r2 = vld1q_u16(p2);\n            uint16x8_t _r3 = vld1q_u16(p3);\n            uint16x8_t _r4 = vld1q_u16(p4);\n            uint16x8_t _r5 = vld1q_u16(p5);\n            uint16x8_t _r6 = vld1q_u16(p6);\n            uint16x8_t _r7 = vld1q_u16(p7);\n            transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n            vst1q_u16(pp, _r0);\n            vst1q_u16(pp + 8, _r1);\n            vst1q_u16(pp + 8 * 2, _r2);\n            vst1q_u16(pp + 8 * 3, _r3);\n            vst1q_u16(pp + 8 * 4, _r4);\n            vst1q_u16(pp + 8 * 5, _r5);\n            vst1q_u16(pp + 8 * 6, _r6);\n            vst1q_u16(pp + 8 * 7, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp[4] = p4[0];\n            pp[5] = p5[0];\n            pp[6] = p6[0];\n            pp[7] = p7[0];\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n        const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n        const unsigned short* p2 = (const unsigned short*)A + (i + ii + 2) * A_hstep + k;\n        const unsigned short* p3 = (const unsigned short*)A + (i + ii + 3) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x4_t _r0123;\n            _r0123.val[0] = vld1q_u16(p0);\n            _r0123.val[1] = vld1q_u16(p1);\n            _r0123.val[2] = vld1q_u16(p2);\n            _r0123.val[3] = vld1q_u16(p3);\n            vst4q_u16(pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x4_t _r0123;\n            _r0123.val[0] = vld1_u16(p0);\n            _r0123.val[1] = vld1_u16(p1);\n            _r0123.val[2] = vld1_u16(p2);\n            _r0123.val[3] = vld1_u16(p3);\n            vst4_u16(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n        const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x2_t _r01;\n            _r01.val[0] = vld1q_u16(p0);\n            _r01.val[1] = vld1q_u16(p1);\n            vst2q_u16(pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x2_t _r01;\n            _r01.val[0] = vld1_u16(p0);\n            _r01.val[1] = vld1_u16(p1);\n            vst2_u16(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            vst1q_u16(pp, vld1q_u16(p0));\n            pp += 8;\n            p0 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            vst1_u16(pp, vld1_u16(p0));\n            pp += 4;\n            p0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = (unsigned short)p0[0];\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void convolution_im2col_input_tile_conv1x1s1d1_bf16_fp16(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = bottom_blob.elempack;\n\n    unsigned short* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                // transpose8x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%0] \\n\"\n                    \"uzp1   v20.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp2   v26.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp1   v21.8h, v16.8h, v1.8h   \\n\"\n                    \"uzp2   v27.8h, v16.8h, v1.8h   \\n\"\n                    \"sub    %0, %0, #128            \\n\"\n                    \"uzp1   v22.8h, v5.8h, v17.8h   \\n\"\n                    \"uzp2   v28.8h, v5.8h, v17.8h   \\n\"\n                    \"uzp1   v23.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp2   v29.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp1   v24.8h, v18.8h, v3.8h   \\n\"\n                    \"uzp2   v30.8h, v18.8h, v3.8h   \\n\"\n                    \"uzp1   v25.8h, v7.8h, v19.8h   \\n\"\n                    \"uzp2   v31.8h, v7.8h, v19.8h   \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%1], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%1], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8_t _r0 = vld1q_u16(p0);\n                uint16x8_t _r1 = vld1q_u16(p0 + 8);\n                uint16x8_t _r2 = vld1q_u16(p0 + 8 * 2);\n                uint16x8_t _r3 = vld1q_u16(p0 + 8 * 3);\n                uint16x8_t _r4 = vld1q_u16(p0 + 8 * 4);\n                uint16x8_t _r5 = vld1q_u16(p0 + 8 * 5);\n                uint16x8_t _r6 = vld1q_u16(p0 + 8 * 6);\n                uint16x8_t _r7 = vld1q_u16(p0 + 8 * 7);\n                uint16x8_t _r8 = vld1q_u16(p0 + 8 * 8);\n                uint16x8_t _r9 = vld1q_u16(p0 + 8 * 9);\n                uint16x8_t _ra = vld1q_u16(p0 + 8 * 10);\n                uint16x8_t _rb = vld1q_u16(p0 + 8 * 11);\n                transpose8x12_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7, _r8, _r9, _ra, _rb);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                vst1q_u16(pp + 8 * 8, _r8);\n                vst1q_u16(pp + 8 * 9, _r9);\n                vst1q_u16(pp + 8 * 10, _ra);\n                vst1q_u16(pp + 8 * 11, _rb);\n                pp += 96;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 8;\n            }\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x12\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #256]   \\n\"\n                    \"ld4    {v4.4h, v5.4h, v6.4h, v7.4h}, [%0]      \\n\"\n                    \"st1    {v0.8h}, [%1], #16          \\n\"\n                    \"st1    {v4.4h}, [%1], #8           \\n\"\n                    \"st1    {v1.8h}, [%1], #16          \\n\"\n                    \"st1    {v5.4h}, [%1], #8           \\n\"\n                    \"sub    %0, %0, #64                 \\n\"\n                    \"st1    {v2.8h}, [%1], #16          \\n\"\n                    \"st1    {v6.4h}, [%1], #8           \\n\"\n                    \"st1    {v3.8h}, [%1], #16          \\n\"\n                    \"st1    {v7.4h}, [%1], #8           \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x4x4_t _r0 = vld4_u16(p0);\n                uint16x4x4_t _r1 = vld4_u16(p0 + 16);\n                uint16x4x4_t _r2 = vld4_u16(p0 + 32);\n                vst1_u16(pp, _r0.val[0]);\n                vst1_u16(pp + 4, _r1.val[0]);\n                vst1_u16(pp + 4 * 2, _r2.val[0]);\n                vst1_u16(pp + 4 * 3, _r0.val[1]);\n                vst1_u16(pp + 4 * 4, _r1.val[1]);\n                vst1_u16(pp + 4 * 5, _r2.val[1]);\n                vst1_u16(pp + 4 * 6, _r0.val[2]);\n                vst1_u16(pp + 4 * 7, _r1.val[2]);\n                vst1_u16(pp + 4 * 8, _r2.val[2]);\n                vst1_u16(pp + 4 * 9, _r0.val[3]);\n                vst1_u16(pp + 4 * 10, _r1.val[3]);\n                vst1_u16(pp + 4 * 11, _r2.val[3]);\n                pp += 48;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _r01 = vld1q_u16(p0);\n                uint16x4_t _r2 = vld1_u16(p0 + 8);\n                vst1q_u16(pp, _r01);\n                vst1_u16(pp + 8, _r2);\n                pp += 12;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                // transpose8x8\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0] \\n\"\n                    \"uzp1   v16.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp2   v20.8h, v0.8h, v4.8h    \\n\"\n                    \"uzp1   v17.8h, v1.8h, v5.8h    \\n\"\n                    \"uzp2   v21.8h, v1.8h, v5.8h    \\n\"\n                    \"sub    %0, %0, #64             \\n\"\n                    \"uzp1   v18.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp2   v22.8h, v2.8h, v6.8h    \\n\"\n                    \"uzp1   v19.8h, v3.8h, v7.8h    \\n\"\n                    \"uzp2   v23.8h, v3.8h, v7.8h    \\n\"\n                    \"st1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%1], #64 \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8_t _r0 = vld1q_u16(p0);\n                uint16x8_t _r1 = vld1q_u16(p0 + 8);\n                uint16x8_t _r2 = vld1q_u16(p0 + 8 * 2);\n                uint16x8_t _r3 = vld1q_u16(p0 + 8 * 3);\n                uint16x8_t _r4 = vld1q_u16(p0 + 8 * 4);\n                uint16x8_t _r5 = vld1q_u16(p0 + 8 * 5);\n                uint16x8_t _r6 = vld1q_u16(p0 + 8 * 6);\n                uint16x8_t _r7 = vld1q_u16(p0 + 8 * 7);\n                transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                pp += 64;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 8;\n            }\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld4.u16   {d0,d2,d4,d6}, [%0 :64]! \\n\"\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld4.u16   {d1,d3,d5,d7}, [%0 :64] \\n\"\n                    \"sub        %0, %0, #32         \\n\"\n                    \"vstm       %1!, {d0-d7}        \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x4_t _r0 = vld4q_u16(p0);\n                vst1q_u16(pp, _r0.val[0]);\n                vst1q_u16(pp + 8, _r0.val[1]);\n                vst1q_u16(pp + 16, _r0.val[2]);\n                vst1q_u16(pp + 24, _r0.val[3]);\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _r0 = vld1q_u16(p0);\n                vst1q_u16(pp, _r0);\n                pp += 8;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                // transpose8x4\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"st4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8x4_t _r0;\n                _r0.val[0] = vld1q_u16(p0);\n                _r0.val[1] = vld1q_u16(p0 + 8);\n                _r0.val[2] = vld1q_u16(p0 + 8 * 2);\n                _r0.val[3] = vld1q_u16(p0 + 8 * 3);\n                vst4q_u16(pp, _r0);\n                pp += 32;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 8;\n            }\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x4\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                    \"st4    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.u16   {d0-d3}, [%0 :64]   \\n\"\n                    \"vst4.u16   {d0-d3}, [%1 :64]!  \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp)\n                    : \"memory\", \"q0\", \"q1\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x4x4_t _r0;\n                _r0.val[0] = vld1_u16(p0);\n                _r0.val[1] = vld1_u16(p0 + 4);\n                _r0.val[2] = vld1_u16(p0 + 4 * 2);\n                _r0.val[3] = vld1_u16(p0 + 4 * 3);\n                vst4_u16(pp, _r0);\n                pp += 16;\n#endif // NCNN_GNU_INLINE_ASM\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _r0 = vld1_u16(p0);\n                vst1_u16(pp, _r0);\n                pp += 4;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                // transpose8x2\n                uint16x8x2_t _r0;\n                _r0.val[0] = vld1q_u16(p0);\n                _r0.val[1] = vld1q_u16(p0 + 8);\n                vst2q_u16(pp, _r0);\n                pp += 16;\n                p0 += bottom_blob.cstep * 8;\n            }\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                // transpose4x2\n                uint16x4x2_t _r0;\n                _r0.val[0] = vld1_u16(p0);\n                _r0.val[1] = vld1_u16(p0 + 4);\n                vst2_u16(pp, _r0);\n                pp += 8;\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += bottom_blob.cstep * 8;\n            }\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k / 4) + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk < max_kk / 4; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += bottom_blob.cstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)bottom_blob.channel(k) + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += bottom_blob.cstep;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_input_tile_bf16_fp16(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h)\n{\n    if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        convolution_im2col_input_tile_conv1x1s1d1_bf16_fp16(bottom_blob, B, j, max_jj, k, max_kk);\n        return;\n    }\n\n    const int w = bottom_blob.w;\n    // const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    // j max_jj     outw*outh    split w and h\n\n    // k max_kk     pa*maxk*(inch/pa)    split inch\n\n    // k/max_kk shall be multiple of maxk\n\n    const int maxk = kernel_w * kernel_h;\n\n    unsigned short* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dy4 = (j + jj + 4) / outw;\n        int dy5 = (j + jj + 5) / outw;\n        int dy6 = (j + jj + 6) / outw;\n        int dy7 = (j + jj + 7) / outw;\n        int dy8 = (j + jj + 8) / outw;\n        int dy9 = (j + jj + 9) / outw;\n        int dya = (j + jj + 10) / outw;\n        int dyb = (j + jj + 11) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n        int dx4 = (j + jj + 4) % outw;\n        int dx5 = (j + jj + 5) % outw;\n        int dx6 = (j + jj + 6) % outw;\n        int dx7 = (j + jj + 7) % outw;\n        int dx8 = (j + jj + 8) % outw;\n        int dx9 = (j + jj + 9) % outw;\n        int dxa = (j + jj + 10) % outw;\n        int dxb = (j + jj + 11) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int x4 = stride_w * dx4 + dilation_w * v;\n            int x5 = stride_w * dx5 + dilation_w * v;\n            int x6 = stride_w * dx6 + dilation_w * v;\n            int x7 = stride_w * dx7 + dilation_w * v;\n            int x8 = stride_w * dx8 + dilation_w * v;\n            int x9 = stride_w * dx9 + dilation_w * v;\n            int xa = stride_w * dxa + dilation_w * v;\n            int xb = stride_w * dxb + dilation_w * v;\n\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n            int y4 = stride_h * dy4 + dilation_h * u;\n            int y5 = stride_h * dy5 + dilation_h * u;\n            int y6 = stride_h * dy6 + dilation_h * u;\n            int y7 = stride_h * dy7 + dilation_h * u;\n            int y8 = stride_h * dy8 + dilation_h * u;\n            int y9 = stride_h * dy9 + dilation_h * u;\n            int ya = stride_h * dya + dilation_h * u;\n            int yb = stride_h * dyb + dilation_h * u;\n\n            const unsigned short* sptr0 = img.row<const unsigned short>(y0) + x0 * elempack;\n            const unsigned short* sptr1 = img.row<const unsigned short>(y1) + x1 * elempack;\n            const unsigned short* sptr2 = img.row<const unsigned short>(y2) + x2 * elempack;\n            const unsigned short* sptr3 = img.row<const unsigned short>(y3) + x3 * elempack;\n            const unsigned short* sptr4 = img.row<const unsigned short>(y4) + x4 * elempack;\n            const unsigned short* sptr5 = img.row<const unsigned short>(y5) + x5 * elempack;\n            const unsigned short* sptr6 = img.row<const unsigned short>(y6) + x6 * elempack;\n            const unsigned short* sptr7 = img.row<const unsigned short>(y7) + x7 * elempack;\n            const unsigned short* sptr8 = img.row<const unsigned short>(y8) + x8 * elempack;\n            const unsigned short* sptr9 = img.row<const unsigned short>(y9) + x9 * elempack;\n            const unsigned short* sptra = img.row<const unsigned short>(ya) + xa * elempack;\n            const unsigned short* sptrb = img.row<const unsigned short>(yb) + xb * elempack;\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(sptr0);\n                uint16x8_t _r1 = vld1q_u16(sptr1);\n                uint16x8_t _r2 = vld1q_u16(sptr2);\n                uint16x8_t _r3 = vld1q_u16(sptr3);\n                uint16x8_t _r4 = vld1q_u16(sptr4);\n                uint16x8_t _r5 = vld1q_u16(sptr5);\n                uint16x8_t _r6 = vld1q_u16(sptr6);\n                uint16x8_t _r7 = vld1q_u16(sptr7);\n                uint16x8_t _r8 = vld1q_u16(sptr8);\n                uint16x8_t _r9 = vld1q_u16(sptr9);\n                uint16x8_t _ra = vld1q_u16(sptra);\n                uint16x8_t _rb = vld1q_u16(sptrb);\n                transpose8x12_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7, _r8, _r9, _ra, _rb);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                vst1q_u16(pp + 8 * 8, _r8);\n                vst1q_u16(pp + 8 * 9, _r9);\n                vst1q_u16(pp + 8 * 10, _ra);\n                vst1q_u16(pp + 8 * 11, _rb);\n                pp += 96;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 4)\n            {\n                uint16x4_t _r0 = vld1_u16(sptr0);\n                uint16x4_t _r1 = vld1_u16(sptr1);\n                uint16x4_t _r2 = vld1_u16(sptr2);\n                uint16x4_t _r3 = vld1_u16(sptr3);\n                uint16x4_t _r4 = vld1_u16(sptr4);\n                uint16x4_t _r5 = vld1_u16(sptr5);\n                uint16x4_t _r6 = vld1_u16(sptr6);\n                uint16x4_t _r7 = vld1_u16(sptr7);\n                uint16x4_t _r8 = vld1_u16(sptr8);\n                uint16x4_t _r9 = vld1_u16(sptr9);\n                uint16x4_t _ra = vld1_u16(sptra);\n                uint16x4_t _rb = vld1_u16(sptrb);\n                transpose4x12_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7, _r8, _r9, _ra, _rb);\n                vst1_u16(pp, _r0);\n                vst1_u16(pp + 4, _r1);\n                vst1_u16(pp + 4 * 2, _r2);\n                vst1_u16(pp + 4 * 3, _r3);\n                vst1_u16(pp + 4 * 4, _r4);\n                vst1_u16(pp + 4 * 5, _r5);\n                vst1_u16(pp + 4 * 6, _r6);\n                vst1_u16(pp + 4 * 7, _r7);\n                vst1_u16(pp + 4 * 8, _r8);\n                vst1_u16(pp + 4 * 9, _r9);\n                vst1_u16(pp + 4 * 10, _ra);\n                vst1_u16(pp + 4 * 11, _rb);\n                pp += 48;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp[4] = sptr4[0];\n                pp[5] = sptr5[0];\n                pp[6] = sptr6[0];\n                pp[7] = sptr7[0];\n                pp[8] = sptr8[0];\n                pp[9] = sptr9[0];\n                pp[10] = sptra[0];\n                pp[11] = sptrb[0];\n                pp += 12;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dy4 = (j + jj + 4) / outw;\n        int dy5 = (j + jj + 5) / outw;\n        int dy6 = (j + jj + 6) / outw;\n        int dy7 = (j + jj + 7) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n        int dx4 = (j + jj + 4) % outw;\n        int dx5 = (j + jj + 5) % outw;\n        int dx6 = (j + jj + 6) % outw;\n        int dx7 = (j + jj + 7) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int x4 = stride_w * dx4 + dilation_w * v;\n            int x5 = stride_w * dx5 + dilation_w * v;\n            int x6 = stride_w * dx6 + dilation_w * v;\n            int x7 = stride_w * dx7 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n            int y4 = stride_h * dy4 + dilation_h * u;\n            int y5 = stride_h * dy5 + dilation_h * u;\n            int y6 = stride_h * dy6 + dilation_h * u;\n            int y7 = stride_h * dy7 + dilation_h * u;\n\n            const unsigned short* sptr0 = img.row<const unsigned short>(y0) + x0 * elempack;\n            const unsigned short* sptr1 = img.row<const unsigned short>(y1) + x1 * elempack;\n            const unsigned short* sptr2 = img.row<const unsigned short>(y2) + x2 * elempack;\n            const unsigned short* sptr3 = img.row<const unsigned short>(y3) + x3 * elempack;\n            const unsigned short* sptr4 = img.row<const unsigned short>(y4) + x4 * elempack;\n            const unsigned short* sptr5 = img.row<const unsigned short>(y5) + x5 * elempack;\n            const unsigned short* sptr6 = img.row<const unsigned short>(y6) + x6 * elempack;\n            const unsigned short* sptr7 = img.row<const unsigned short>(y7) + x7 * elempack;\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(sptr0);\n                uint16x8_t _r1 = vld1q_u16(sptr1);\n                uint16x8_t _r2 = vld1q_u16(sptr2);\n                uint16x8_t _r3 = vld1q_u16(sptr3);\n                uint16x8_t _r4 = vld1q_u16(sptr4);\n                uint16x8_t _r5 = vld1q_u16(sptr5);\n                uint16x8_t _r6 = vld1q_u16(sptr6);\n                uint16x8_t _r7 = vld1q_u16(sptr7);\n                transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                pp += 64;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 4)\n            {\n                uint16x4_t _r0 = vld1_u16(sptr0);\n                uint16x4_t _r1 = vld1_u16(sptr1);\n                uint16x4_t _r2 = vld1_u16(sptr2);\n                uint16x4_t _r3 = vld1_u16(sptr3);\n                uint16x4_t _r4 = vld1_u16(sptr4);\n                uint16x4_t _r5 = vld1_u16(sptr5);\n                uint16x4_t _r6 = vld1_u16(sptr6);\n                uint16x4_t _r7 = vld1_u16(sptr7);\n                transpose4x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1_u16(pp, _r0);\n                vst1_u16(pp + 4, _r1);\n                vst1_u16(pp + 4 * 2, _r2);\n                vst1_u16(pp + 4 * 3, _r3);\n                vst1_u16(pp + 4 * 4, _r4);\n                vst1_u16(pp + 4 * 5, _r5);\n                vst1_u16(pp + 4 * 6, _r6);\n                vst1_u16(pp + 4 * 7, _r7);\n                pp += 32;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp[4] = sptr4[0];\n                pp[5] = sptr5[0];\n                pp[6] = sptr6[0];\n                pp[7] = sptr7[0];\n                pp += 8;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dy2 = (j + jj + 2) / outw;\n        int dy3 = (j + jj + 3) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n        int dx2 = (j + jj + 2) % outw;\n        int dx3 = (j + jj + 3) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int x2 = stride_w * dx2 + dilation_w * v;\n            int x3 = stride_w * dx3 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n            int y2 = stride_h * dy2 + dilation_h * u;\n            int y3 = stride_h * dy3 + dilation_h * u;\n\n            const unsigned short* sptr0 = img.row<const unsigned short>(y0) + x0 * elempack;\n            const unsigned short* sptr1 = img.row<const unsigned short>(y1) + x1 * elempack;\n            const unsigned short* sptr2 = img.row<const unsigned short>(y2) + x2 * elempack;\n            const unsigned short* sptr3 = img.row<const unsigned short>(y3) + x3 * elempack;\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 8)\n            {\n                uint16x8x4_t _r0;\n                _r0.val[0] = vld1q_u16(sptr0);\n                _r0.val[1] = vld1q_u16(sptr1);\n                _r0.val[2] = vld1q_u16(sptr2);\n                _r0.val[3] = vld1q_u16(sptr3);\n                vst4q_u16(pp, _r0);\n                pp += 32;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 4)\n            {\n                uint16x4x4_t _r0;\n                _r0.val[0] = vld1_u16(sptr0);\n                _r0.val[1] = vld1_u16(sptr1);\n                _r0.val[2] = vld1_u16(sptr2);\n                _r0.val[3] = vld1_u16(sptr3);\n                vst4_u16(pp, _r0);\n                pp += 16;\n            }\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr2[0];\n                pp[3] = sptr3[0];\n                pp += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        int dy0 = (j + jj) / outw;\n        int dy1 = (j + jj + 1) / outw;\n        int dx0 = (j + jj) % outw;\n        int dx1 = (j + jj + 1) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x0 = stride_w * dx0 + dilation_w * v;\n            int x1 = stride_w * dx1 + dilation_w * v;\n            int y0 = stride_h * dy0 + dilation_h * u;\n            int y1 = stride_h * dy1 + dilation_h * u;\n\n            const unsigned short* sptr0 = img.row<const unsigned short>(y0) + x0 * elempack;\n            const unsigned short* sptr1 = img.row<const unsigned short>(y1) + x1 * elempack;\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 8)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr0[1];\n                pp[3] = sptr1[1];\n                pp[4] = sptr0[2];\n                pp[5] = sptr1[2];\n                pp[6] = sptr0[3];\n                pp[7] = sptr1[3];\n                pp[8 + 0] = sptr0[4];\n                pp[8 + 1] = sptr1[4];\n                pp[8 + 2] = sptr0[5];\n                pp[8 + 3] = sptr1[5];\n                pp[8 + 4] = sptr0[6];\n                pp[8 + 5] = sptr1[6];\n                pp[8 + 6] = sptr0[7];\n                pp[8 + 7] = sptr1[7];\n                pp += 16;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 4)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp[2] = sptr0[1];\n                pp[3] = sptr1[1];\n                pp[4] = sptr0[2];\n                pp[5] = sptr1[2];\n                pp[6] = sptr0[3];\n                pp[7] = sptr1[3];\n                pp += 8;\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                pp[0] = sptr0[0];\n                pp[1] = sptr1[0];\n                pp += 2;\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n        int dy = (j + jj) / outw;\n        int dx = (j + jj) % outw;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x = stride_w * dx + dilation_w * v;\n            int y = stride_h * dy + dilation_h * u;\n\n            const unsigned short* sptr = img.row<const unsigned short>(y) + x * elempack;\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 8)\n            {\n                pp[0] = sptr[0];\n                pp[1] = sptr[1];\n                pp[2] = sptr[2];\n                pp[3] = sptr[3];\n                pp[4] = sptr[4];\n                pp[5] = sptr[5];\n                pp[6] = sptr[6];\n                pp[7] = sptr[7];\n                pp += 8;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            if (elempack == 4)\n            {\n                pp[0] = sptr[0];\n                pp[1] = sptr[1];\n                pp[2] = sptr[2];\n                pp[3] = sptr[3];\n                pp += 4;\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                pp[0] = sptr[0];\n                pp += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_im2col_gemm_fp16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_gemm_transB_packed_tile_fp16sa(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end, int use_a53_a55_optimized_kernel)\n{\n    // NCNN_LOGE(\"convolution_gemm_transB_packed_tile_fp16sa %d %d %d %d %d %d\", i, max_ii, j, max_jj, k, max_kk);\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.cstep;\n\n    const __fp16* pAT = AT_tile;\n    const __fp16* pBT = BT_tile;\n    const __fp16* pC = CT_tile;\n\n    __fp16* outptr = topT_tile;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const __fp16*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel)\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                    \"subs   %0, %0, #128                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"ld1    {v20.8h}, [%8]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[4]      \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[5]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[6]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[7]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v28.8h, v4.8h, v1.h[0]      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\" // NOTE PRELOAD\n                    \"fmla   v29.8h, v4.8h, v1.h[1]      \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v30.8h, v4.8h, v1.h[2]      \\n\"\n                    \"ldr    d8, [%2], #8                \\n\"\n                    \"fmla   v31.8h, v4.8h, v1.h[3]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v20.8h, v5.8h, v1.h[4]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v21.8h, v5.8h, v1.h[5]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v22.8h, v5.8h, v1.h[6]      \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v23.8h, v5.8h, v1.h[7]      \\n\"\n                    \"ldr    d9, [%2], #8                \\n\"\n                    \"fmla   v24.8h, v5.8h, v2.h[0]      \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v25.8h, v5.8h, v2.h[1]      \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v26.8h, v5.8h, v2.h[2]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v27.8h, v5.8h, v2.h[3]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v28.8h, v5.8h, v2.h[4]      \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v29.8h, v5.8h, v2.h[5]      \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v30.8h, v5.8h, v2.h[6]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v31.8h, v5.8h, v2.h[7]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v20.8h, v6.8h, v3.h[0]      \\n\"\n                    \"fmla   v21.8h, v6.8h, v3.h[1]      \\n\"\n                    \"fmla   v22.8h, v6.8h, v3.h[2]      \\n\"\n                    \"fmla   v23.8h, v6.8h, v3.h[3]      \\n\"\n                    \"fmla   v24.8h, v6.8h, v3.h[4]      \\n\"\n                    \"fmla   v25.8h, v6.8h, v3.h[5]      \\n\"\n                    \"ins    v8.d[1], x20                \\n\"\n                    \"fmla   v26.8h, v6.8h, v3.h[6]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v27.8h, v6.8h, v3.h[7]      \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v28.8h, v6.8h, v8.h[0]      \\n\"\n                    \"fmla   v29.8h, v6.8h, v8.h[1]      \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v30.8h, v6.8h, v8.h[2]      \\n\"\n                    \"fmla   v31.8h, v6.8h, v8.h[3]      \\n\"\n                    \"fmla   v20.8h, v7.8h, v8.h[4]      \\n\"\n                    \"fmla   v21.8h, v7.8h, v8.h[5]      \\n\"\n                    \"ins    v9.d[1], x21                \\n\"\n                    \"fmla   v22.8h, v7.8h, v8.h[6]      \\n\"\n                    \"fmla   v23.8h, v7.8h, v8.h[7]      \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v24.8h, v7.8h, v9.h[0]      \\n\"\n                    \"fmla   v25.8h, v7.8h, v9.h[1]      \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v26.8h, v7.8h, v9.h[2]      \\n\"\n                    \"fmla   v27.8h, v7.8h, v9.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.8h, v7.8h, v9.h[4]      \\n\"\n                    \"fmla   v29.8h, v7.8h, v9.h[5]      \\n\"\n                    \"fmla   v30.8h, v7.8h, v9.h[6]      \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v31.8h, v7.8h, v9.h[7]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #16                 \\n\"\n                    \"sub    %2, %2, #32                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                    \"fmla   v24.8h, v4.8h, v1.h[0]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v1.h[1]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v1.h[2]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v1.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.8h, v4.8h, v2.h[0]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v2.h[1]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v2.h[2]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v2.h[3]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    11f                         \\n\"\n\n                    // if out_elempack == 8\n                    \"cmp    %w12, #8                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%3], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%3], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%3], #64 \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"8:                                 \\n\"\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    9f                          \\n\"\n\n                    \"zip1   v12.2d, v20.2d, v21.2d      \\n\"\n                    \"zip2   v18.2d, v20.2d, v21.2d      \\n\"\n                    \"zip1   v13.2d, v22.2d, v23.2d      \\n\"\n                    \"zip2   v19.2d, v22.2d, v23.2d      \\n\"\n                    \"zip1   v14.2d, v24.2d, v25.2d      \\n\"\n                    \"zip2   v20.2d, v24.2d, v25.2d      \\n\"\n                    \"zip1   v15.2d, v26.2d, v27.2d      \\n\"\n                    \"zip2   v21.2d, v26.2d, v27.2d      \\n\"\n                    \"zip1   v16.2d, v28.2d, v29.2d      \\n\"\n                    \"zip2   v22.2d, v28.2d, v29.2d      \\n\"\n                    \"zip1   v17.2d, v30.2d, v31.2d      \\n\"\n                    \"zip2   v23.2d, v30.2d, v31.2d      \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3], #64 \\n\"\n                    \"st1    {v16.8h, v17.8h}, [%3], #32 \\n\"\n                    \"st1    {v18.8h, v19.8h, v20.8h, v21.8h}, [x4], #64 \\n\"\n                    \"st1    {v22.8h, v23.8h}, [x4]      \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 1\n                    \"9:                                 \\n\"\n                    // transpose8x12\n                    \"zip1   v18.8h, v20.8h, v21.8h      \\n\"\n                    \"zip2   v19.8h, v20.8h, v21.8h      \\n\"\n                    \"zip1   v20.8h, v22.8h, v23.8h      \\n\"\n                    \"zip2   v21.8h, v22.8h, v23.8h      \\n\"\n                    \"zip1   v22.8h, v24.8h, v25.8h      \\n\"\n                    \"zip2   v23.8h, v24.8h, v25.8h      \\n\"\n                    \"zip1   v24.8h, v26.8h, v27.8h      \\n\"\n                    \"zip2   v25.8h, v26.8h, v27.8h      \\n\"\n                    \"zip1   v26.8h, v28.8h, v29.8h      \\n\"\n                    \"zip2   v27.8h, v28.8h, v29.8h      \\n\"\n                    \"zip1   v28.8h, v30.8h, v31.8h      \\n\"\n                    \"zip2   v29.8h, v30.8h, v31.8h      \\n\"\n\n                    \"zip1   v0.4s, v18.4s, v20.4s       \\n\"\n                    \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                    \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                    \"zip2   v9.4s, v19.4s, v21.4s       \\n\"\n                    \"zip1   v1.4s, v22.4s, v24.4s       \\n\"\n                    \"zip2   v4.4s, v22.4s, v24.4s       \\n\"\n                    \"zip1   v7.4s, v23.4s, v25.4s       \\n\"\n                    \"zip2   v10.4s, v23.4s, v25.4s      \\n\"\n                    \"zip1   v2.4s, v26.4s, v28.4s       \\n\"\n                    \"zip2   v5.4s, v26.4s, v28.4s       \\n\"\n                    \"zip1   v8.4s, v27.4s, v29.4s       \\n\"\n                    \"zip2   v11.4s, v27.4s, v29.4s      \\n\"\n\n                    \"mov    v12.d[0], v0.d[1]           \\n\"\n                    \"mov    v13.d[0], v1.d[1]           \\n\"\n                    \"mov    v14.d[0], v2.d[1]           \\n\"\n                    \"mov    v15.d[0], v3.d[1]           \\n\"\n                    \"mov    v16.d[0], v4.d[1]           \\n\"\n                    \"mov    v17.d[0], v5.d[1]           \\n\"\n                    \"mov    v18.d[0], v6.d[1]           \\n\"\n                    \"mov    v19.d[0], v7.d[1]           \\n\"\n                    \"mov    v20.d[0], v8.d[1]           \\n\"\n                    \"mov    v21.d[0], v9.d[1]           \\n\"\n                    \"mov    v22.d[0], v10.d[1]          \\n\"\n                    \"mov    v23.d[0], v11.d[1]          \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                    \"st1    {v12.4h, v13.4h, v14.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v15.4h, v16.4h, v17.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v18.4h, v19.4h, v20.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v21.4h, v22.4h, v23.4h}, [x4] \\n\"\n\n                    \"10:                                \\n\"\n                    \"add    %0, %0, #192                \\n\"\n                    \"b      12f                         \\n\"\n\n                    \"11:                                \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    \"12:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                    \"subs   %0, %0, #128                \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"ld1    {v20.8h}, [%8]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v21.16b, v20.16b            \\n\"\n                    \"mov    v22.16b, v20.16b            \\n\"\n                    \"mov    v23.16b, v20.16b            \\n\"\n                    \"mov    v24.16b, v20.16b            \\n\"\n                    \"mov    v25.16b, v20.16b            \\n\"\n                    \"mov    v26.16b, v20.16b            \\n\"\n                    \"mov    v27.16b, v20.16b            \\n\"\n                    \"mov    v28.16b, v20.16b            \\n\"\n                    \"mov    v29.16b, v20.16b            \\n\"\n                    \"mov    v30.16b, v20.16b            \\n\"\n                    \"mov    v31.16b, v20.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%2], #64 \\n\"\n\n                    \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[4]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[5]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[6]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[7]      \\n\"\n                    \"fmla   v28.8h, v4.8h, v1.h[0]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v1.h[1]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v1.h[2]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v1.h[3]      \\n\"\n\n                    \"fmla   v20.8h, v5.8h, v1.h[4]      \\n\"\n                    \"fmla   v21.8h, v5.8h, v1.h[5]      \\n\"\n                    \"fmla   v22.8h, v5.8h, v1.h[6]      \\n\"\n                    \"fmla   v23.8h, v5.8h, v1.h[7]      \\n\"\n                    \"fmla   v24.8h, v5.8h, v2.h[0]      \\n\"\n                    \"fmla   v25.8h, v5.8h, v2.h[1]      \\n\"\n                    \"fmla   v26.8h, v5.8h, v2.h[2]      \\n\"\n                    \"fmla   v27.8h, v5.8h, v2.h[3]      \\n\"\n                    \"fmla   v28.8h, v5.8h, v2.h[4]      \\n\"\n                    \"fmla   v29.8h, v5.8h, v2.h[5]      \\n\"\n                    \"fmla   v30.8h, v5.8h, v2.h[6]      \\n\"\n                    \"fmla   v31.8h, v5.8h, v2.h[7]      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v8.8h, v9.8h}, [%2], #32   \\n\"\n\n                    \"fmla   v20.8h, v6.8h, v3.h[0]      \\n\"\n                    \"fmla   v21.8h, v6.8h, v3.h[1]      \\n\"\n                    \"fmla   v22.8h, v6.8h, v3.h[2]      \\n\"\n                    \"fmla   v23.8h, v6.8h, v3.h[3]      \\n\"\n                    \"fmla   v24.8h, v6.8h, v3.h[4]      \\n\"\n                    \"fmla   v25.8h, v6.8h, v3.h[5]      \\n\"\n                    \"fmla   v26.8h, v6.8h, v3.h[6]      \\n\"\n                    \"fmla   v27.8h, v6.8h, v3.h[7]      \\n\"\n                    \"fmla   v28.8h, v6.8h, v8.h[0]      \\n\"\n                    \"fmla   v29.8h, v6.8h, v8.h[1]      \\n\"\n                    \"fmla   v30.8h, v6.8h, v8.h[2]      \\n\"\n                    \"fmla   v31.8h, v6.8h, v8.h[3]      \\n\"\n\n                    \"subs   w4, w4, #1                  \\n\"\n\n                    \"fmla   v20.8h, v7.8h, v8.h[4]      \\n\"\n                    \"fmla   v21.8h, v7.8h, v8.h[5]      \\n\"\n                    \"fmla   v22.8h, v7.8h, v8.h[6]      \\n\"\n                    \"fmla   v23.8h, v7.8h, v8.h[7]      \\n\"\n                    \"fmla   v24.8h, v7.8h, v9.h[0]      \\n\"\n                    \"fmla   v25.8h, v7.8h, v9.h[1]      \\n\"\n                    \"fmla   v26.8h, v7.8h, v9.h[2]      \\n\"\n                    \"fmla   v27.8h, v7.8h, v9.h[3]      \\n\"\n                    \"fmla   v28.8h, v7.8h, v9.h[4]      \\n\"\n                    \"fmla   v29.8h, v7.8h, v9.h[5]      \\n\"\n                    \"fmla   v30.8h, v7.8h, v9.h[6]      \\n\"\n                    \"fmla   v31.8h, v7.8h, v9.h[7]      \\n\"\n\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%2], #24 \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"fmla   v20.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v21.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v22.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v23.8h, v4.8h, v0.h[3]      \\n\"\n                    \"fmla   v24.8h, v4.8h, v1.h[0]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v1.h[1]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v1.h[2]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v1.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.8h, v4.8h, v2.h[0]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v2.h[1]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v2.h[2]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v2.h[3]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    11f                         \\n\"\n\n                    // if out_elempack == 8\n                    \"cmp    %w12, #8                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%3], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%3], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%3], #64 \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"8:                                 \\n\"\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    9f                          \\n\"\n\n                    \"zip1   v12.2d, v20.2d, v21.2d      \\n\"\n                    \"zip2   v18.2d, v20.2d, v21.2d      \\n\"\n                    \"zip1   v13.2d, v22.2d, v23.2d      \\n\"\n                    \"zip2   v19.2d, v22.2d, v23.2d      \\n\"\n                    \"zip1   v14.2d, v24.2d, v25.2d      \\n\"\n                    \"zip2   v20.2d, v24.2d, v25.2d      \\n\"\n                    \"zip1   v15.2d, v26.2d, v27.2d      \\n\"\n                    \"zip2   v21.2d, v26.2d, v27.2d      \\n\"\n                    \"zip1   v16.2d, v28.2d, v29.2d      \\n\"\n                    \"zip2   v22.2d, v28.2d, v29.2d      \\n\"\n                    \"zip1   v17.2d, v30.2d, v31.2d      \\n\"\n                    \"zip2   v23.2d, v30.2d, v31.2d      \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3], #64 \\n\"\n                    \"st1    {v16.8h, v17.8h}, [%3], #32 \\n\"\n                    \"st1    {v18.8h, v19.8h, v20.8h, v21.8h}, [x4], #64 \\n\"\n                    \"st1    {v22.8h, v23.8h}, [x4]      \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 1\n                    \"9:                                 \\n\"\n                    // transpose8x12\n                    \"zip1   v18.8h, v20.8h, v21.8h      \\n\"\n                    \"zip2   v19.8h, v20.8h, v21.8h      \\n\"\n                    \"zip1   v20.8h, v22.8h, v23.8h      \\n\"\n                    \"zip2   v21.8h, v22.8h, v23.8h      \\n\"\n                    \"zip1   v22.8h, v24.8h, v25.8h      \\n\"\n                    \"zip2   v23.8h, v24.8h, v25.8h      \\n\"\n                    \"zip1   v24.8h, v26.8h, v27.8h      \\n\"\n                    \"zip2   v25.8h, v26.8h, v27.8h      \\n\"\n                    \"zip1   v26.8h, v28.8h, v29.8h      \\n\"\n                    \"zip2   v27.8h, v28.8h, v29.8h      \\n\"\n                    \"zip1   v28.8h, v30.8h, v31.8h      \\n\"\n                    \"zip2   v29.8h, v30.8h, v31.8h      \\n\"\n\n                    \"zip1   v0.4s, v18.4s, v20.4s       \\n\"\n                    \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                    \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                    \"zip2   v9.4s, v19.4s, v21.4s       \\n\"\n                    \"zip1   v1.4s, v22.4s, v24.4s       \\n\"\n                    \"zip2   v4.4s, v22.4s, v24.4s       \\n\"\n                    \"zip1   v7.4s, v23.4s, v25.4s       \\n\"\n                    \"zip2   v10.4s, v23.4s, v25.4s      \\n\"\n                    \"zip1   v2.4s, v26.4s, v28.4s       \\n\"\n                    \"zip2   v5.4s, v26.4s, v28.4s       \\n\"\n                    \"zip1   v8.4s, v27.4s, v29.4s       \\n\"\n                    \"zip2   v11.4s, v27.4s, v29.4s      \\n\"\n\n                    \"mov    v12.d[0], v0.d[1]           \\n\"\n                    \"mov    v13.d[0], v1.d[1]           \\n\"\n                    \"mov    v14.d[0], v2.d[1]           \\n\"\n                    \"mov    v15.d[0], v3.d[1]           \\n\"\n                    \"mov    v16.d[0], v4.d[1]           \\n\"\n                    \"mov    v17.d[0], v5.d[1]           \\n\"\n                    \"mov    v18.d[0], v6.d[1]           \\n\"\n                    \"mov    v19.d[0], v7.d[1]           \\n\"\n                    \"mov    v20.d[0], v8.d[1]           \\n\"\n                    \"mov    v21.d[0], v9.d[1]           \\n\"\n                    \"mov    v22.d[0], v10.d[1]          \\n\"\n                    \"mov    v23.d[0], v11.d[1]          \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h}, [%3], #24 \\n\"\n                    \"st1    {v12.4h, v13.4h, v14.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v3.4h, v4.4h, v5.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v15.4h, v16.4h, v17.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v6.4h, v7.4h, v8.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v18.4h, v19.4h, v20.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v9.4h, v10.4h, v11.4h}, [x4] \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v21.4h, v22.4h, v23.4h}, [x4] \\n\"\n\n                    \"10:                                \\n\"\n                    \"add    %0, %0, #192                \\n\"\n                    \"b      12f                         \\n\"\n\n                    \"11:                                \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%0], #64 \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    \"12:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n            float16x8_t _sum4;\n            float16x8_t _sum5;\n            float16x8_t _sum6;\n            float16x8_t _sum7;\n            float16x8_t _sum8;\n            float16x8_t _sum9;\n            float16x8_t _suma;\n            float16x8_t _sumb;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                    _sum8 = _sum0;\n                    _sum9 = _sum0;\n                    _suma = _sum0;\n                    _sumb = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                    _sum4 = vdupq_n_f16(0.f);\n                    _sum5 = vdupq_n_f16(0.f);\n                    _sum6 = vdupq_n_f16(0.f);\n                    _sum7 = vdupq_n_f16(0.f);\n                    _sum8 = vdupq_n_f16(0.f);\n                    _sum9 = vdupq_n_f16(0.f);\n                    _suma = vdupq_n_f16(0.f);\n                    _sumb = vdupq_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n                _sum4 = vld1q_f16(outptr + 8 * 4);\n                _sum5 = vld1q_f16(outptr + 8 * 5);\n                _sum6 = vld1q_f16(outptr + 8 * 6);\n                _sum7 = vld1q_f16(outptr + 8 * 7);\n                _sum8 = vld1q_f16(outptr + 8 * 8);\n                _sum9 = vld1q_f16(outptr + 8 * 9);\n                _suma = vld1q_f16(outptr + 8 * 10);\n                _sumb = vld1q_f16(outptr + 8 * 11);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_lane_f16(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_lane_f16(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_lane_f16(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_lane_f16(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_lane_f16(_sumb, _pA, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    vst1q_f16(outptr0 + 8 * 4, _sum4);\n                    vst1q_f16(outptr0 + 8 * 5, _sum5);\n                    vst1q_f16(outptr0 + 8 * 6, _sum6);\n                    vst1q_f16(outptr0 + 8 * 7, _sum7);\n                    vst1q_f16(outptr0 + 8 * 8, _sum8);\n                    vst1q_f16(outptr0 + 8 * 9, _sum9);\n                    vst1q_f16(outptr0 + 8 * 10, _suma);\n                    vst1q_f16(outptr0 + 8 * 11, _sumb);\n                    outptr0 += 96;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + 4 * 4, vget_low_f16(_sum4));\n                    vst1_f16(outptr0 + 4 * 5, vget_low_f16(_sum5));\n                    vst1_f16(outptr0 + 4 * 6, vget_low_f16(_sum6));\n                    vst1_f16(outptr0 + 4 * 7, vget_low_f16(_sum7));\n                    vst1_f16(outptr0 + 4 * 8, vget_low_f16(_sum8));\n                    vst1_f16(outptr0 + 4 * 9, vget_low_f16(_sum9));\n                    vst1_f16(outptr0 + 4 * 10, vget_low_f16(_suma));\n                    vst1_f16(outptr0 + 4 * 11, vget_low_f16(_sumb));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 4, vget_high_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 5, vget_high_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 6, vget_high_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 7, vget_high_f16(_sum7));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 8, vget_high_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 9, vget_high_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 10, vget_high_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 11, vget_high_f16(_sumb));\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x12_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + 8, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep + 4, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep + 8, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 2, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 2 + 4, vget_high_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 2 + 8, vget_low_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 3, vget_high_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 3 + 4, vget_low_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 3 + 8, vget_high_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 4, vget_low_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 8, vget_low_f16(_sum7));\n                    vst1_f16(outptr0 + out_hstep * 5, vget_high_f16(_sum7));\n                    vst1_f16(outptr0 + out_hstep * 5 + 4, vget_low_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 5 + 8, vget_high_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 6, vget_low_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 6 + 4, vget_high_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 6 + 8, vget_low_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 7, vget_high_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 7 + 4, vget_low_f16(_sumb));\n                    vst1_f16(outptr0 + out_hstep * 7 + 8, vget_high_f16(_sumb));\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n                vst1q_f16(outptr + 8 * 8, _sum8);\n                vst1q_f16(outptr + 8 * 9, _sum9);\n                vst1q_f16(outptr + 8 * 10, _suma);\n                vst1q_f16(outptr + 8 * 11, _sumb);\n            }\n\n            outptr += 96;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            if (use_a53_a55_optimized_kernel)\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"ld1    {v24.8h}, [%8]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n\n                    \".align 4                           \\n\"\n                    \"4:                                 \\n\"\n                    \"ldr    d1, [%2], #8                \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                    \"ldr    x21, [%2], #8               \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                    \"ins    v5.d[1], x25                \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                    \"ldr    d6, [%1], #8                \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                    \"ldr    x26, [%1], #8               \\n\"\n                    \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                    \"ldr    d2, [%2], #8                \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                    \"ins    v1.d[1], x21                \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                    \"ldr    x22, [%2], #8               \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                    \"ldr    d7, [%1], #8                \\n\"\n                    \"fmla   v24.8h, v5.8h, v1.h[0]      \\n\"\n                    \"ldr    x27, [%1], #8               \\n\"\n                    \"fmla   v25.8h, v5.8h, v1.h[1]      \\n\"\n                    \"ins    v6.d[1], x26                \\n\"\n                    \"fmla   v26.8h, v5.8h, v1.h[2]      \\n\"\n                    \"ldr    d3, [%2], #8                \\n\"\n                    \"fmla   v27.8h, v5.8h, v1.h[3]      \\n\"\n                    \"ldr    x23, [%2], #8               \\n\"\n                    \"fmla   v28.8h, v5.8h, v1.h[4]      \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v29.8h, v5.8h, v1.h[5]      \\n\"\n                    \"ins    v2.d[1], x22                \\n\"\n                    \"fmla   v30.8h, v5.8h, v1.h[6]      \\n\"\n                    \"ldr    d4, [%1], #8                \\n\"\n                    \"fmla   v31.8h, v5.8h, v1.h[7]      \\n\"\n                    \"ldr    x24, [%1], #8               \\n\"\n                    \"fmla   v24.8h, v6.8h, v2.h[0]      \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\" // NOTE PRELOAD\n                    \"fmla   v25.8h, v6.8h, v2.h[1]      \\n\"\n                    \"ins    v7.d[1], x27                \\n\"\n                    \"fmla   v26.8h, v6.8h, v2.h[2]      \\n\"\n                    \"ldr    d0, [%2], #8                \\n\"\n                    \"fmla   v27.8h, v6.8h, v2.h[3]      \\n\"\n                    \"ldr    x20, [%2], #8               \\n\"\n                    \"fmla   v28.8h, v6.8h, v2.h[4]      \\n\"\n                    \"ldr    d5, [%1], #8                \\n\"\n                    \"fmla   v29.8h, v6.8h, v2.h[5]      \\n\"\n                    \"ins    v3.d[1], x23                \\n\"\n                    \"fmla   v30.8h, v6.8h, v2.h[6]      \\n\"\n                    \"ldr    x25, [%1], #8               \\n\"\n                    \"fmla   v31.8h, v6.8h, v2.h[7]      \\n\"\n                    \"fmla   v24.8h, v7.8h, v3.h[0]      \\n\"\n                    \"fmla   v25.8h, v7.8h, v3.h[1]      \\n\"\n                    \"fmla   v26.8h, v7.8h, v3.h[2]      \\n\"\n                    \"ins    v4.d[1], x24                \\n\"\n                    \"fmla   v27.8h, v7.8h, v3.h[3]      \\n\"\n                    \"fmla   v28.8h, v7.8h, v3.h[4]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v29.8h, v7.8h, v3.h[5]      \\n\"\n                    \"fmla   v30.8h, v7.8h, v3.h[6]      \\n\"\n                    \"ins    v0.d[1], x20                \\n\"\n                    \"fmla   v31.8h, v7.8h, v3.h[7]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"sub    %1, %1, #32                 \\n\"\n                    \"sub    %2, %2, #16                 \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    11f                         \\n\"\n\n                    // if out_elempack == 8\n                    \"cmp    %w12, #8                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%3], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%3], #64 \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"8:                                 \\n\"\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    9f                          \\n\"\n\n                    \"zip1   v16.2d, v24.2d, v25.2d      \\n\"\n                    \"zip2   v20.2d, v24.2d, v25.2d      \\n\"\n                    \"zip1   v17.2d, v26.2d, v27.2d      \\n\"\n                    \"zip2   v21.2d, v26.2d, v27.2d      \\n\"\n                    \"zip1   v18.2d, v28.2d, v29.2d      \\n\"\n                    \"zip2   v22.2d, v28.2d, v29.2d      \\n\"\n                    \"zip1   v19.2d, v30.2d, v31.2d      \\n\"\n                    \"zip2   v23.2d, v30.2d, v31.2d      \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%3], #64 \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [x4] \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 1\n                    \"9:                                 \\n\"\n                    // transpose8x8\n                    \"zip1   v22.8h, v24.8h, v25.8h      \\n\"\n                    \"zip2   v23.8h, v24.8h, v25.8h      \\n\"\n                    \"zip1   v24.8h, v26.8h, v27.8h      \\n\"\n                    \"zip2   v25.8h, v26.8h, v27.8h      \\n\"\n                    \"zip1   v26.8h, v28.8h, v29.8h      \\n\"\n                    \"zip2   v27.8h, v28.8h, v29.8h      \\n\"\n                    \"zip1   v28.8h, v30.8h, v31.8h      \\n\"\n                    \"zip2   v29.8h, v30.8h, v31.8h      \\n\"\n\n                    \"zip1   v16.4s, v22.4s, v24.4s      \\n\"\n                    \"zip2   v17.4s, v22.4s, v24.4s      \\n\"\n                    \"zip1   v18.4s, v23.4s, v25.4s      \\n\"\n                    \"zip2   v19.4s, v23.4s, v25.4s      \\n\"\n                    \"zip1   v20.4s, v26.4s, v28.4s      \\n\"\n                    \"zip2   v21.4s, v26.4s, v28.4s      \\n\"\n                    \"zip1   v22.4s, v27.4s, v29.4s      \\n\"\n                    \"zip2   v23.4s, v27.4s, v29.4s      \\n\"\n\n                    \"zip1   v24.2d, v16.2d, v20.2d      \\n\"\n                    \"zip2   v25.2d, v16.2d, v20.2d      \\n\"\n                    \"zip1   v26.2d, v17.2d, v21.2d      \\n\"\n                    \"zip2   v27.2d, v17.2d, v21.2d      \\n\"\n                    \"zip1   v28.2d, v18.2d, v22.2d      \\n\"\n                    \"zip2   v29.2d, v18.2d, v22.2d      \\n\"\n                    \"zip1   v30.2d, v19.2d, v23.2d      \\n\"\n                    \"zip2   v31.2d, v19.2d, v23.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v24.8h}, [%3], #16         \\n\"\n                    \"st1    {v25.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v26.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v27.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v28.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v29.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v30.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v31.8h}, [x4]              \\n\"\n\n                    \"10:                                \\n\"\n                    \"add    %0, %0, #128                \\n\"\n                    \"b      12f                         \\n\"\n\n                    \"11:                                \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    \"12:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"x20\", \"x21\", \"x22\", \"x23\", \"x24\", \"x25\", \"x26\", \"x27\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            else\n            {\n                asm volatile(\n                    \"cbz    %w10, 0f                    \\n\"\n\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                    \"subs   %0, %0, #64                 \\n\"\n                    \"b      3f                          \\n\"\n\n                    \"0:                                 \\n\"\n                    // if pC\n                    \"cbz    %8, 1f                      \\n\"\n\n                    \"ld1    {v24.8h}, [%8]              \\n\"\n                    \"b      2f                          \\n\"\n\n                    // else\n                    \"1:                                 \\n\"\n                    \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n\n                    \"2:                                 \\n\"\n                    \"mov    v25.16b, v24.16b            \\n\"\n                    \"mov    v26.16b, v24.16b            \\n\"\n                    \"mov    v27.16b, v24.16b            \\n\"\n                    \"mov    v28.16b, v24.16b            \\n\"\n                    \"mov    v29.16b, v24.16b            \\n\"\n                    \"mov    v30.16b, v24.16b            \\n\"\n                    \"mov    v31.16b, v24.16b            \\n\"\n\n                    \"3:                                 \\n\"\n                    \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    5f                          \\n\"\n\n                    \"4:                                 \\n\"\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%2], #64 \\n\"\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                    \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                    \"fmla   v24.8h, v5.8h, v1.h[0]      \\n\"\n                    \"fmla   v25.8h, v5.8h, v1.h[1]      \\n\"\n                    \"fmla   v26.8h, v5.8h, v1.h[2]      \\n\"\n                    \"fmla   v27.8h, v5.8h, v1.h[3]      \\n\"\n                    \"fmla   v28.8h, v5.8h, v1.h[4]      \\n\"\n                    \"fmla   v29.8h, v5.8h, v1.h[5]      \\n\"\n                    \"fmla   v30.8h, v5.8h, v1.h[6]      \\n\"\n                    \"fmla   v31.8h, v5.8h, v1.h[7]      \\n\"\n                    \"fmla   v24.8h, v6.8h, v2.h[0]      \\n\"\n                    \"fmla   v25.8h, v6.8h, v2.h[1]      \\n\"\n                    \"fmla   v26.8h, v6.8h, v2.h[2]      \\n\"\n                    \"fmla   v27.8h, v6.8h, v2.h[3]      \\n\"\n                    \"fmla   v28.8h, v6.8h, v2.h[4]      \\n\"\n                    \"fmla   v29.8h, v6.8h, v2.h[5]      \\n\"\n                    \"fmla   v30.8h, v6.8h, v2.h[6]      \\n\"\n                    \"fmla   v31.8h, v6.8h, v2.h[7]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v24.8h, v7.8h, v3.h[0]      \\n\"\n                    \"fmla   v25.8h, v7.8h, v3.h[1]      \\n\"\n                    \"fmla   v26.8h, v7.8h, v3.h[2]      \\n\"\n                    \"fmla   v27.8h, v7.8h, v3.h[3]      \\n\"\n                    \"fmla   v28.8h, v7.8h, v3.h[4]      \\n\"\n                    \"fmla   v29.8h, v7.8h, v3.h[5]      \\n\"\n                    \"fmla   v30.8h, v7.8h, v3.h[6]      \\n\"\n                    \"fmla   v31.8h, v7.8h, v3.h[7]      \\n\"\n                    \"bne    4b                          \\n\"\n\n                    \"5:                                 \\n\"\n                    \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                    \"cmp    w4, #0                      \\n\"\n                    \"beq    7f                          \\n\"\n\n                    \"6:                                 \\n\"\n                    \"ld1    {v0.8h}, [%2], #16          \\n\"\n                    \"ld1    {v4.8h}, [%1], #16          \\n\"\n                    \"fmla   v24.8h, v4.8h, v0.h[0]      \\n\"\n                    \"fmla   v25.8h, v4.8h, v0.h[1]      \\n\"\n                    \"fmla   v26.8h, v4.8h, v0.h[2]      \\n\"\n                    \"fmla   v27.8h, v4.8h, v0.h[3]      \\n\"\n                    \"subs   w4, w4, #1                  \\n\"\n                    \"fmla   v28.8h, v4.8h, v0.h[4]      \\n\"\n                    \"fmla   v29.8h, v4.8h, v0.h[5]      \\n\"\n                    \"fmla   v30.8h, v4.8h, v0.h[6]      \\n\"\n                    \"fmla   v31.8h, v4.8h, v0.h[7]      \\n\"\n                    \"bne    6b                          \\n\"\n\n                    \"7:                                 \\n\"\n                    \"tst    %w11, #255                  \\n\"\n                    \"beq    11f                         \\n\"\n\n                    // if out_elempack == 8\n                    \"cmp    %w12, #8                    \\n\"\n                    \"bne    8f                          \\n\"\n\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%3], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%3], #64 \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 4\n                    \"8:                                 \\n\"\n                    \"cmp    %w12, #4                    \\n\"\n                    \"bne    9f                          \\n\"\n\n                    \"zip1   v16.2d, v24.2d, v25.2d      \\n\"\n                    \"zip2   v20.2d, v24.2d, v25.2d      \\n\"\n                    \"zip1   v17.2d, v26.2d, v27.2d      \\n\"\n                    \"zip2   v21.2d, v26.2d, v27.2d      \\n\"\n                    \"zip1   v18.2d, v28.2d, v29.2d      \\n\"\n                    \"zip2   v22.2d, v28.2d, v29.2d      \\n\"\n                    \"zip1   v19.2d, v30.2d, v31.2d      \\n\"\n                    \"zip2   v23.2d, v30.2d, v31.2d      \\n\"\n\n                    \"lsl    w4, %w13, #2                \\n\"\n                    \"add    x4, %3, w4, sxtw 1          \\n\"\n                    \"st1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%3], #64 \\n\"\n                    \"st1    {v20.8h, v21.8h, v22.8h, v23.8h}, [x4] \\n\"\n                    \"b      10f                         \\n\"\n\n                    // if out_elempack == 1\n                    \"9:                                 \\n\"\n                    // transpose8x8\n                    \"zip1   v22.8h, v24.8h, v25.8h      \\n\"\n                    \"zip2   v23.8h, v24.8h, v25.8h      \\n\"\n                    \"zip1   v24.8h, v26.8h, v27.8h      \\n\"\n                    \"zip2   v25.8h, v26.8h, v27.8h      \\n\"\n                    \"zip1   v26.8h, v28.8h, v29.8h      \\n\"\n                    \"zip2   v27.8h, v28.8h, v29.8h      \\n\"\n                    \"zip1   v28.8h, v30.8h, v31.8h      \\n\"\n                    \"zip2   v29.8h, v30.8h, v31.8h      \\n\"\n\n                    \"zip1   v16.4s, v22.4s, v24.4s      \\n\"\n                    \"zip2   v17.4s, v22.4s, v24.4s      \\n\"\n                    \"zip1   v18.4s, v23.4s, v25.4s      \\n\"\n                    \"zip2   v19.4s, v23.4s, v25.4s      \\n\"\n                    \"zip1   v20.4s, v26.4s, v28.4s      \\n\"\n                    \"zip2   v21.4s, v26.4s, v28.4s      \\n\"\n                    \"zip1   v22.4s, v27.4s, v29.4s      \\n\"\n                    \"zip2   v23.4s, v27.4s, v29.4s      \\n\"\n\n                    \"zip1   v24.2d, v16.2d, v20.2d      \\n\"\n                    \"zip2   v25.2d, v16.2d, v20.2d      \\n\"\n                    \"zip1   v26.2d, v17.2d, v21.2d      \\n\"\n                    \"zip2   v27.2d, v17.2d, v21.2d      \\n\"\n                    \"zip1   v28.2d, v18.2d, v22.2d      \\n\"\n                    \"zip2   v29.2d, v18.2d, v22.2d      \\n\"\n                    \"zip1   v30.2d, v19.2d, v23.2d      \\n\"\n                    \"zip2   v31.2d, v19.2d, v23.2d      \\n\"\n\n                    \"add    x4, %3, %w13, sxtw 1        \\n\"\n                    \"st1    {v24.8h}, [%3], #16         \\n\"\n                    \"st1    {v25.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v26.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v27.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v28.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v29.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v30.8h}, [x4]              \\n\"\n                    \"add    x4, x4, %w13, sxtw 1        \\n\"\n                    \"st1    {v31.8h}, [x4]              \\n\"\n\n                    \"10:                                \\n\"\n                    \"add    %0, %0, #128                \\n\"\n                    \"b      12f                         \\n\"\n\n                    \"11:                                \\n\"\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    \"12:                                \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(pA),     // %1\n                    \"=r\"(pB),     // %2\n                    \"=r\"(outptr0) // %3\n                    : \"0\"(outptr),\n                    \"1\"(pA),\n                    \"2\"(pB),\n                    \"3\"(outptr0),\n                    \"r\"(pC),           // %8\n                    \"r\"(max_kk),       // %9\n                    \"r\"(k),            // %10\n                    \"r\"(k_end),        // %11\n                    \"r\"(out_elempack), // %12\n                    \"r\"(out_hstep)     // %13\n                    : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n            float16x8_t _sum4;\n            float16x8_t _sum5;\n            float16x8_t _sum6;\n            float16x8_t _sum7;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                    _sum4 = vdupq_n_f16(0.f);\n                    _sum5 = vdupq_n_f16(0.f);\n                    _sum6 = vdupq_n_f16(0.f);\n                    _sum7 = vdupq_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n                _sum4 = vld1q_f16(outptr + 8 * 4);\n                _sum5 = vld1q_f16(outptr + 8 * 5);\n                _sum6 = vld1q_f16(outptr + 8 * 6);\n                _sum7 = vld1q_f16(outptr + 8 * 7);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x8_t _pB = vld1q_f16(pB);\n\n                _sum0 = vfmaq_laneq_f16(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f16(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f16(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f16(_sum3, _pA, _pB, 3);\n                _sum4 = vfmaq_laneq_f16(_sum4, _pA, _pB, 4);\n                _sum5 = vfmaq_laneq_f16(_sum5, _pA, _pB, 5);\n                _sum6 = vfmaq_laneq_f16(_sum6, _pA, _pB, 6);\n                _sum7 = vfmaq_laneq_f16(_sum7, _pA, _pB, 7);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    vst1q_f16(outptr0 + 8 * 4, _sum4);\n                    vst1q_f16(outptr0 + 8 * 5, _sum5);\n                    vst1q_f16(outptr0 + 8 * 6, _sum6);\n                    vst1q_f16(outptr0 + 8 * 7, _sum7);\n                    outptr0 += 64;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + 4 * 4, vget_low_f16(_sum4));\n                    vst1_f16(outptr0 + 4 * 5, vget_low_f16(_sum5));\n                    vst1_f16(outptr0 + 4 * 6, vget_low_f16(_sum6));\n                    vst1_f16(outptr0 + 4 * 7, vget_low_f16(_sum7));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 4, vget_high_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 5, vget_high_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 6, vget_high_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 7, vget_high_f16(_sum7));\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + out_hstep, _sum1);\n                    vst1q_f16(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_f16(outptr0 + out_hstep * 3, _sum3);\n                    vst1q_f16(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_f16(outptr0 + out_hstep * 5, _sum5);\n                    vst1q_f16(outptr0 + out_hstep * 6, _sum6);\n                    vst1q_f16(outptr0 + out_hstep * 7, _sum7);\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n            }\n\n            outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v28.8h}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v29.16b, v28.16b            \\n\"\n                \"mov    v30.16b, v28.16b            \\n\"\n                \"mov    v31.16b, v28.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #256]       \\n\"\n                \"ld1    {v0.8h, v1.8h}, [%2], #32   \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n                \"fmla   v28.8h, v4.8h, v0.h[0]      \\n\"\n                \"fmla   v29.8h, v4.8h, v0.h[1]      \\n\"\n                \"fmla   v30.8h, v4.8h, v0.h[2]      \\n\"\n                \"fmla   v31.8h, v4.8h, v0.h[3]      \\n\"\n                \"fmla   v28.8h, v5.8h, v0.h[4]      \\n\"\n                \"fmla   v29.8h, v5.8h, v0.h[5]      \\n\"\n                \"fmla   v30.8h, v5.8h, v0.h[6]      \\n\"\n                \"fmla   v31.8h, v5.8h, v0.h[7]      \\n\"\n                \"fmla   v28.8h, v6.8h, v1.h[0]      \\n\"\n                \"fmla   v29.8h, v6.8h, v1.h[1]      \\n\"\n                \"fmla   v30.8h, v6.8h, v1.h[2]      \\n\"\n                \"fmla   v31.8h, v6.8h, v1.h[3]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v28.8h, v7.8h, v1.h[4]      \\n\"\n                \"fmla   v29.8h, v7.8h, v1.h[5]      \\n\"\n                \"fmla   v30.8h, v7.8h, v1.h[6]      \\n\"\n                \"fmla   v31.8h, v7.8h, v1.h[7]      \\n\"\n                \"bne    4b                          \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.4h}, [%2], #8           \\n\"\n                \"ld1    {v4.8h}, [%1], #16          \\n\"\n                \"fmla   v28.8h, v4.8h, v0.h[0]      \\n\"\n                \"fmla   v29.8h, v4.8h, v0.h[1]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.8h, v4.8h, v0.h[2]      \\n\"\n                \"fmla   v31.8h, v4.8h, v0.h[3]      \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    11f                         \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w12, #8                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%3], #64 \\n\"\n                \"b      10f                         \\n\"\n\n                // if out_elempack == 4\n                \"8:                                 \\n\"\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    9f                          \\n\"\n\n                \"zip1   v24.2d, v28.2d, v29.2d      \\n\"\n                \"zip2   v26.2d, v28.2d, v29.2d      \\n\"\n                \"zip1   v25.2d, v30.2d, v31.2d      \\n\"\n                \"zip2   v27.2d, v30.2d, v31.2d      \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 1          \\n\"\n                \"st1    {v24.8h, v25.8h}, [%3], #32 \\n\"\n                \"st1    {v26.8h, v27.8h}, [x4]      \\n\"\n                \"b      10f                         \\n\"\n\n                // if out_elempack == 1\n                \"9:                                 \\n\"\n                // transpose8x4\n                \"zip1   v24.8h, v28.8h, v29.8h      \\n\"\n                \"zip2   v25.8h, v28.8h, v29.8h      \\n\"\n                \"zip1   v26.8h, v30.8h, v31.8h      \\n\"\n                \"zip2   v27.8h, v30.8h, v31.8h      \\n\"\n\n                \"zip1   v20.4s, v24.4s, v26.4s      \\n\"\n                \"zip2   v22.4s, v24.4s, v26.4s      \\n\"\n                \"zip1   v24.4s, v25.4s, v27.4s      \\n\"\n                \"zip2   v26.4s, v25.4s, v27.4s      \\n\"\n\n                \"mov    v21.d[0], v20.d[1]          \\n\"\n                \"mov    v23.d[0], v22.d[1]          \\n\"\n                \"mov    v25.d[0], v24.d[1]          \\n\"\n                \"mov    v27.d[0], v26.d[1]          \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v20.4h}, [%3], #8          \\n\"\n                \"st1    {v21.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v22.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v23.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v24.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v25.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v26.4h}, [x4]              \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v27.4h}, [x4]              \\n\"\n\n                \"10:                                \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      12f                         \\n\"\n\n                \"11:                                \\n\"\n                \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                \"12:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v4\", \"v5\", \"v6\", \"v7\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                    _sum2 = vdupq_n_f16(0.f);\n                    _sum3 = vdupq_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ph(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 1, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 2, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 3, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 5, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 6, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 7, vget_high_f16(_sum3));\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const __fp16* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cbz    %w10, 0f                    \\n\"\n\n                \"ld1    {v30.8h, v31.8h}, [%0]      \\n\"\n                \"b      3f                          \\n\"\n\n                \"0:                                 \\n\"\n                // if pC\n                \"cbz    %8, 1f                      \\n\"\n\n                \"ld1    {v30.8h}, [%8]              \\n\"\n                \"b      2f                          \\n\"\n\n                // else\n                \"1:                                 \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"mov    v31.16b, v30.16b            \\n\"\n\n                \"3:                                 \\n\"\n                \"lsr    w4, %w9, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"4:                                 \\n\"\n                \"prfm   pldl1keep, [%2, #128]       \\n\"\n                \"ld1    {v0.8h}, [%2], #16          \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n                \"fmla   v28.8h, v4.8h, v0.h[0]      \\n\"\n                \"fmla   v29.8h, v4.8h, v0.h[1]      \\n\"\n                \"fmla   v30.8h, v5.8h, v0.h[2]      \\n\"\n                \"fmla   v31.8h, v5.8h, v0.h[3]      \\n\"\n                \"fmla   v28.8h, v6.8h, v0.h[4]      \\n\"\n                \"fmla   v29.8h, v6.8h, v0.h[5]      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.8h, v7.8h, v0.h[6]      \\n\"\n                \"fmla   v31.8h, v7.8h, v0.h[7]      \\n\"\n                \"bne    4b                          \\n\"\n                \"fadd   v30.8h, v30.8h, v28.8h      \\n\"\n                \"fadd   v31.8h, v31.8h, v29.8h      \\n\"\n\n                \"5:                                 \\n\"\n                \"and    w4, %w9, #3                 \\n\" // w4 = remain = max_kk & 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    7f                          \\n\"\n\n                \"6:                                 \\n\"\n                \"ld1    {v0.s}[0], [%2], #4         \\n\"\n                \"ld1    {v4.8h}, [%1], #16          \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"fmla   v30.8h, v4.8h, v0.h[0]      \\n\"\n                \"fmla   v31.8h, v4.8h, v0.h[1]      \\n\"\n                \"bne    6b                          \\n\"\n\n                \"7:                                 \\n\"\n                \"tst    %w11, #255                  \\n\"\n                \"beq    11f                         \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w12, #8                    \\n\"\n                \"bne    8f                          \\n\"\n\n                \"st1    {v30.8h, v31.8h}, [%3], #32 \\n\"\n                \"b      10f                         \\n\"\n\n                // if out_elempack == 4\n                \"8:                                 \\n\"\n                \"cmp    %w12, #4                    \\n\"\n                \"bne    9f                          \\n\"\n\n                \"zip1   v28.2d, v30.2d, v31.2d      \\n\"\n                \"zip2   v29.2d, v30.2d, v31.2d      \\n\"\n\n                \"lsl    w4, %w13, #2                \\n\"\n                \"add    x4, %3, w4, sxtw 1          \\n\"\n                \"st1    {v28.8h}, [%3], #16         \\n\"\n                \"st1    {v29.8h}, [x4]              \\n\"\n                \"b      10f                         \\n\"\n\n                // if out_elempack == 1\n                \"9:                                 \\n\"\n                // transpose8x2\n                \"zip1   v28.8h, v30.8h, v31.8h      \\n\"\n                \"zip2   v29.8h, v30.8h, v31.8h      \\n\"\n\n                \"add    x4, %3, %w13, sxtw 1        \\n\"\n                \"st1    {v28.s}[0], [%3], #4        \\n\"\n                \"st1    {v28.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v28.s}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v28.s}[3], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v29.s}[0], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v29.s}[1], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v29.s}[2], [x4]            \\n\"\n                \"add    x4, x4, %w13, sxtw 1        \\n\"\n                \"st1    {v29.s}[3], [x4]            \\n\"\n\n                \"10:                                \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      12f                         \\n\"\n\n                \"11:                                \\n\"\n                \"st1    {v30.8h, v31.8h}, [%0], #32 \\n\"\n\n                \"12:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(pC),           // %8\n                \"r\"(max_kk),       // %9\n                \"r\"(k),            // %10\n                \"r\"(k_end),        // %11\n                \"r\"(out_elempack), // %12\n                \"r\"(out_hstep)     // %13\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v4\", \"v5\", \"v6\", \"v7\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f16(pC);\n                    _sum1 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                    _sum1 = vdupq_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB, 1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8, _sum1);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[8];\n                    __fp16 sum1[8];\n                    vst1q_f16(sum0, _sum0);\n                    vst1q_f16(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const __fp16* pA = pAT;\n\n            float16x8_t _sum0;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1q_f16(pC);\n                }\n                else\n                {\n                    _sum0 = vdupq_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n            }\n\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x8_t _pB = vld1q_dup_f16(pB);\n\n                _sum0 = vfmaq_f16(_sum0, _pA, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[8];\n                    vst1q_f16(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const __fp16*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n            float16x4_t _sum4;\n            float16x4_t _sum5;\n            float16x4_t _sum6;\n            float16x4_t _sum7;\n            float16x4_t _sum8;\n            float16x4_t _sum9;\n            float16x4_t _suma;\n            float16x4_t _sumb;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                    _sum8 = _sum0;\n                    _sum9 = _sum0;\n                    _suma = _sum0;\n                    _sumb = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                    _sum4 = vdup_n_f16(0.f);\n                    _sum5 = vdup_n_f16(0.f);\n                    _sum6 = vdup_n_f16(0.f);\n                    _sum7 = vdup_n_f16(0.f);\n                    _sum8 = vdup_n_f16(0.f);\n                    _sum9 = vdup_n_f16(0.f);\n                    _suma = vdup_n_f16(0.f);\n                    _sumb = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4 * 1);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n                _sum4 = vld1_f16(outptr + 4 * 4);\n                _sum5 = vld1_f16(outptr + 4 * 5);\n                _sum6 = vld1_f16(outptr + 4 * 6);\n                _sum7 = vld1_f16(outptr + 4 * 7);\n                _sum8 = vld1_f16(outptr + 4 * 8);\n                _sum9 = vld1_f16(outptr + 4 * 9);\n                _suma = vld1_f16(outptr + 4 * 10);\n                _sumb = vld1_f16(outptr + 4 * 11);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfma_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfma_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfma_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfma_lane_f16(_sum7, _pA, _pB1, 3);\n                _sum8 = vfma_lane_f16(_sum8, _pA, _pB2, 0);\n                _sum9 = vfma_lane_f16(_sum9, _pA, _pB2, 1);\n                _suma = vfma_lane_f16(_suma, _pA, _pB2, 2);\n                _sumb = vfma_lane_f16(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    vst1_f16(outptr0 + 4 * 4, _sum4);\n                    vst1_f16(outptr0 + 4 * 5, _sum5);\n                    vst1_f16(outptr0 + 4 * 6, _sum6);\n                    vst1_f16(outptr0 + 4 * 7, _sum7);\n                    vst1_f16(outptr0 + 4 * 8, _sum8);\n                    vst1_f16(outptr0 + 4 * 9, _sum9);\n                    vst1_f16(outptr0 + 4 * 10, _suma);\n                    vst1_f16(outptr0 + 4 * 11, _sumb);\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 8, _sum2);\n                    vst1_f16(outptr0 + out_hstep, _sum3);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum4);\n                    vst1_f16(outptr0 + out_hstep + 8, _sum5);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum6);\n                    vst1_f16(outptr0 + out_hstep * 2 + 4, _sum7);\n                    vst1_f16(outptr0 + out_hstep * 2 + 8, _sum8);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum9);\n                    vst1_f16(outptr0 + out_hstep * 3 + 4, _suma);\n                    vst1_f16(outptr0 + out_hstep * 3 + 8, _sumb);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n                vst1_f16(outptr + 4 * 8, _sum8);\n                vst1_f16(outptr + 4 * 9, _sum9);\n                vst1_f16(outptr + 4 * 10, _suma);\n                vst1_f16(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n            float16x4_t _sum4;\n            float16x4_t _sum5;\n            float16x4_t _sum6;\n            float16x4_t _sum7;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                    _sum4 = _sum0;\n                    _sum5 = _sum0;\n                    _sum6 = _sum0;\n                    _sum7 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                    _sum4 = vdup_n_f16(0.f);\n                    _sum5 = vdup_n_f16(0.f);\n                    _sum6 = vdup_n_f16(0.f);\n                    _sum7 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4 * 1);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n                _sum4 = vld1_f16(outptr + 4 * 4);\n                _sum5 = vld1_f16(outptr + 4 * 5);\n                _sum6 = vld1_f16(outptr + 4 * 6);\n                _sum7 = vld1_f16(outptr + 4 * 7);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfma_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfma_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfma_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfma_lane_f16(_sum7, _pA, _pB1, 3);\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    vst1_f16(outptr0 + 4 * 4, _sum4);\n                    vst1_f16(outptr0 + 4 * 5, _sum5);\n                    vst1_f16(outptr0 + 4 * 6, _sum6);\n                    vst1_f16(outptr0 + 4 * 7, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + out_hstep, _sum2);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum3);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum4);\n                    vst1_f16(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum6);\n                    vst1_f16(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1_f16(pC);\n                    _sum1 = _sum0;\n                    _sum2 = _sum0;\n                    _sum3 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                    _sum3 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB, 3);\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ph(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + out_hstep, _sum1);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum2);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1_f16(pC);\n                    _sum1 = _sum0;\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n\n                _sum0 = vfma_n_f16(_sum0, _pA, pB[0]);\n                _sum1 = vfma_n_f16(_sum1, _pA, pB[1]);\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[4];\n                    __fp16 sum1[4];\n                    vst1_f16(sum0, _sum0);\n                    vst1_f16(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float16x4_t _sum0;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vld1_f16(pC);\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB = vdup_n_f16(pB[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA, _pB);\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[4];\n                    vst1_f16(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const __fp16*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum00;\n            float16x4_t _sum01;\n            float16x4_t _sum02;\n            float16x4_t _sum10;\n            float16x4_t _sum11;\n            float16x4_t _sum12;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdup_n_f16(pC[0]);\n                    _sum01 = vdup_n_f16(pC[0]);\n                    _sum02 = vdup_n_f16(pC[0]);\n                    _sum10 = vdup_n_f16(pC[1]);\n                    _sum11 = vdup_n_f16(pC[1]);\n                    _sum12 = vdup_n_f16(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdup_n_f16(0.f);\n                    _sum01 = vdup_n_f16(0.f);\n                    _sum02 = vdup_n_f16(0.f);\n                    _sum10 = vdup_n_f16(0.f);\n                    _sum11 = vdup_n_f16(0.f);\n                    _sum12 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                float16x4x2_t _tmp23 = vld2_f16(outptr + 8);\n                float16x4x2_t _tmp45 = vld2_f16(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                float16x4_t _pA0 = vld1_dup_f16(pA);\n                float16x4_t _pA1 = vld1_dup_f16(pA + 1);\n\n                _sum00 = vfma_f16(_sum00, _pB0, _pA0);\n                _sum01 = vfma_f16(_sum01, _pB1, _pA0);\n                _sum02 = vfma_f16(_sum02, _pB2, _pA0);\n                _sum10 = vfma_f16(_sum10, _pB0, _pA1);\n                _sum11 = vfma_f16(_sum11, _pB1, _pA1);\n                _sum12 = vfma_f16(_sum12, _pB2, _pA1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum00);\n                    vst1_f16(outptr0 + 4, _sum01);\n                    vst1_f16(outptr0 + 8, _sum02);\n                    vst1_f16(outptr0 + out_hstep, _sum10);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum11);\n                    vst1_f16(outptr0 + out_hstep + 8, _sum12);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float16x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float16x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2_f16(outptr, _tmp01);\n                vst2_f16(outptr + 8, _tmp23);\n                vst2_f16(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum00;\n            float16x4_t _sum01;\n            float16x4_t _sum10;\n            float16x4_t _sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum00 = vdup_n_f16(pC[0]);\n                    _sum01 = vdup_n_f16(pC[0]);\n                    _sum10 = vdup_n_f16(pC[1]);\n                    _sum11 = vdup_n_f16(pC[1]);\n                }\n                else\n                {\n                    _sum00 = vdup_n_f16(0.f);\n                    _sum01 = vdup_n_f16(0.f);\n                    _sum10 = vdup_n_f16(0.f);\n                    _sum11 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                float16x4x2_t _tmp23 = vld2_f16(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                float16x4_t _pA0 = vld1_dup_f16(pA);\n                float16x4_t _pA1 = vld1_dup_f16(pA + 1);\n\n                _sum00 = vfma_f16(_sum00, _pB0, _pA0);\n                _sum01 = vfma_f16(_sum01, _pB1, _pA0);\n                _sum10 = vfma_f16(_sum10, _pB0, _pA1);\n                _sum11 = vfma_f16(_sum11, _pB1, _pA1);\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum00);\n                    vst1_f16(outptr0 + 4, _sum01);\n                    vst1_f16(outptr0 + out_hstep, _sum10);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum11);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float16x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2_f16(outptr, _tmp01);\n                vst2_f16(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdup_n_f16(pC[0]);\n                    _sum1 = vdup_n_f16(pC[1]);\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum0 = vfma_n_f16(_sum0, _pB, pA[0]);\n                _sum1 = vfma_n_f16(_sum1, _pB, pA[1]);\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, (_sum0));\n                    vst1_f16(outptr0 + out_hstep, (_sum1));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2_f16(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            __fp16 sum00;\n            __fp16 sum01;\n            __fp16 sum10;\n            __fp16 sum11;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum00 = pC[0];\n                    sum01 = pC[1];\n                    sum10 = pC[0];\n                    sum11 = pC[1];\n                }\n                else\n                {\n                    sum00 = 0.f;\n                    sum01 = 0.f;\n                    sum10 = 0.f;\n                    sum11 = 0.f;\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum01 += pA[1] * pB[0];\n                sum10 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum00;\n                    outptr0[1] = sum10;\n                    outptr0[out_hstep] = sum01;\n                    outptr0[out_hstep + 1] = sum11;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            __fp16 sum0;\n            __fp16 sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[1];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[out_hstep] = sum1;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            pC = (const __fp16*)CT_tile + i + ii;\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdup_n_f16(pC[0]);\n                    _sum1 = vdup_n_f16(pC[0]);\n                    _sum2 = vdup_n_f16(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                    _sum2 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n                _sum2 = vld1_f16(outptr + 8);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA0, _pB0);\n                _sum1 = vfma_f16(_sum1, _pA0, _pB1);\n                _sum2 = vfma_f16(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 8, _sum2);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum0 = vdup_n_f16(pC[0]);\n                    _sum1 = vdup_n_f16(pC[0]);\n                }\n                else\n                {\n                    _sum0 = vdup_n_f16(0.f);\n                    _sum1 = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA0, _pB0);\n                _sum1 = vfma_f16(_sum1, _pA0, _pB1);\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    _sum = vdup_n_f16(pC[0]);\n                }\n                else\n                {\n                    _sum = vdup_n_f16(0.f);\n                }\n            }\n            else\n            {\n                _sum = vld1_f16(outptr);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB = vld1_f16(pB);\n                float16x4_t _pA = vdup_n_f16(pA[0]);\n\n                _sum = vfma_f16(_sum, _pA, _pB);\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            __fp16 sum0;\n            __fp16 sum1;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum0 = pC[0];\n                    sum1 = pC[0];\n                }\n                else\n                {\n                    sum0 = 0.f;\n                    sum1 = 0.f;\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[1] = sum1;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            __fp16 sum;\n\n            if (k == 0)\n            {\n                if (pC)\n                {\n                    sum = pC[0];\n                }\n                else\n                {\n                    sum = 0.f;\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void convolution_im2col_gemm_get_optimal_tile_mnk_fp16sa(int M, int N, int K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const int l2_cache_size_fp16 = (int)(get_cpu_level2_cache_size() / sizeof(unsigned short));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // solve K\n    {\n        // try not to split K\n        int tile_size = (l2_cache_size_fp16 - 32) / 12;\n\n        TILE_K = std::max(8, tile_size / 8 * 8);\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n    }\n\n    // solve M\n    {\n        int nn_M = (M + 63) / 64;\n\n        TILE_M = std::max(8, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n\n        if (nT > 1)\n        {\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n        }\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_fp16 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_fp16 - TILE_M * TILE_K) / (TILE_M + TILE_K);\n        }\n\n        TILE_N = std::max(4, tile_size / 4 * 4);\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n    }\n}\n\nstatic void convolution_im2col_gemm_transform_kernel_fp16sa(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n    // NCNN_LOGE(\"convolution_im2col_gemm_transform_kernel_fp16sa %p\", kernel.data);\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = outch;\n    const int K = inch * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_fp16sa(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    int elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        elempack = inch % 8 == 0 ? 8 : inch % 4 == 0 ? 4 : 1;\n    }\n\n    // maxk-inch-outch to pa-maxk-inch/pa-outch\n    Mat A_data;\n    if (maxk == 1)\n    {\n        cast_float32_to_float16(kernel, A_data);\n        A_data = A_data.reshape(maxk * inch, outch);\n    }\n    else\n    {\n        Mat weight_data_r2 = kernel.reshape(maxk, inch, outch);\n\n        A_data.create(maxk * inch, outch, (size_t)2u);\n\n        for (int q = 0; q < outch; q += 1)\n        {\n            __fp16* g00 = A_data.row<__fp16>(q);\n\n            for (int p = 0; p + (elempack - 1) < inch; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        const float* k00 = weight_data_r2.channel(q).row(p + i);\n                        g00[0] = (__fp16)k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n\n    AT.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)2u);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n            convolution_im2col_pack_A_tile_bf16_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n        }\n    }\n}\n\nstatic int convolution_im2col_gemm_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, const Mat& bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"convolution_im2col_gemm_fp16sa %p %p %p %p\", bottom_blob.data, top_blob.data, AT.data, bias.data);\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = top_blob.w * top_blob.h;\n    const int K = bottom_blob.c * bottom_blob.elempack * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_fp16sa(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        // im2col\n        convolution_im2col_input_tile_bf16_fp16(bottom_blob, BT_tile, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n    }\n\n    Mat topT_tileX;\n    if (K > TILE_K)\n    {\n        topT_tileX.create(TILE_N * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n        if (topT_tileX.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat topT_tile;\n        if (K > TILE_K)\n            topT_tile = topT_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = k + TILE_K >= K;\n\n                convolution_gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, bias, topT_tile, top_blob, i, max_ii, j, max_jj, k, max_kk, k_end, opt.use_a53_a55_optimized_kernel);\n            }\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_im2col_gemm_int8.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\nvoid convolution_im2col_gemm_transform_kernel_int8_i8mm(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt);\nint convolution_im2col_gemm_int8_i8mm(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\nvoid convolution_im2col_gemm_transform_kernel_int8_asimddp(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt);\nint convolution_im2col_gemm_int8_asimddp(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt);\n#endif\n\nstatic void convolution_im2col_pack_A_tile_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    // A = (pa, maxk, inch/pa), outch\n    const int A_hstep = A.w;\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const signed char* p0 = (const signed char*)A + (i + ii) * A_hstep + k;\n        const signed char* p1 = (const signed char*)A + (i + ii + 1) * A_hstep + k;\n        const signed char* p2 = (const signed char*)A + (i + ii + 2) * A_hstep + k;\n        const signed char* p3 = (const signed char*)A + (i + ii + 3) * A_hstep + k;\n        const signed char* p4 = (const signed char*)A + (i + ii + 4) * A_hstep + k;\n        const signed char* p5 = (const signed char*)A + (i + ii + 5) * A_hstep + k;\n        const signed char* p6 = (const signed char*)A + (i + ii + 6) * A_hstep + k;\n        const signed char* p7 = (const signed char*)A + (i + ii + 7) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _r0 = vld1q_s8(p0);\n            int8x16_t _r1 = vld1q_s8(p1);\n            int8x16_t _r2 = vld1q_s8(p2);\n            int8x16_t _r3 = vld1q_s8(p3);\n            int8x16_t _r4 = vld1q_s8(p4);\n            int8x16_t _r5 = vld1q_s8(p5);\n            int8x16_t _r6 = vld1q_s8(p6);\n            int8x16_t _r7 = vld1q_s8(p7);\n            int8x16_t _t0 = vcombine_s8(vget_low_s8(_r0), vget_low_s8(_r1));\n            int8x16_t _t1 = vcombine_s8(vget_low_s8(_r2), vget_low_s8(_r3));\n            int8x16_t _t2 = vcombine_s8(vget_low_s8(_r4), vget_low_s8(_r5));\n            int8x16_t _t3 = vcombine_s8(vget_low_s8(_r6), vget_low_s8(_r7));\n            int8x16_t _t4 = vcombine_s8(vget_high_s8(_r0), vget_high_s8(_r1));\n            int8x16_t _t5 = vcombine_s8(vget_high_s8(_r2), vget_high_s8(_r3));\n            int8x16_t _t6 = vcombine_s8(vget_high_s8(_r4), vget_high_s8(_r5));\n            int8x16_t _t7 = vcombine_s8(vget_high_s8(_r6), vget_high_s8(_r7));\n            vst1q_s8(pp, _t0);\n            vst1q_s8(pp + 16, _t1);\n            vst1q_s8(pp + 32, _t2);\n            vst1q_s8(pp + 48, _t3);\n            vst1q_s8(pp + 64, _t4);\n            vst1q_s8(pp + 80, _t5);\n            vst1q_s8(pp + 96, _t6);\n            vst1q_s8(pp + 112, _t7);\n            pp += 128;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n            p4 += 16;\n            p5 += 16;\n            p6 += 16;\n            p7 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p1);\n            int8x8_t _r2 = vld1_s8(p2);\n            int8x8_t _r3 = vld1_s8(p3);\n            int8x8_t _r4 = vld1_s8(p4);\n            int8x8_t _r5 = vld1_s8(p5);\n            int8x8_t _r6 = vld1_s8(p6);\n            int8x8_t _r7 = vld1_s8(p7);\n            vst1_s8(pp, _r0);\n            vst1_s8(pp + 8, _r1);\n            vst1_s8(pp + 16, _r2);\n            vst1_s8(pp + 24, _r3);\n            vst1_s8(pp + 32, _r4);\n            vst1_s8(pp + 40, _r5);\n            vst1_s8(pp + 48, _r6);\n            vst1_s8(pp + 56, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n#else  // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _r0 = vld1q_s8(p0);\n            int8x16_t _r1 = vld1q_s8(p1);\n            int8x16_t _r2 = vld1q_s8(p2);\n            int8x16_t _r3 = vld1q_s8(p3);\n            int8x16_t _r4 = vld1q_s8(p4);\n            int8x16_t _r5 = vld1q_s8(p5);\n            int8x16_t _r6 = vld1q_s8(p6);\n            int8x16_t _r7 = vld1q_s8(p7);\n            int32x4x2_t _r01 = vzipq_s32(vreinterpretq_s32_s8(_r0), vreinterpretq_s32_s8(_r1));\n            int32x4x2_t _r23 = vzipq_s32(vreinterpretq_s32_s8(_r2), vreinterpretq_s32_s8(_r3));\n            int32x4x2_t _r45 = vzipq_s32(vreinterpretq_s32_s8(_r4), vreinterpretq_s32_s8(_r5));\n            int32x4x2_t _r67 = vzipq_s32(vreinterpretq_s32_s8(_r6), vreinterpretq_s32_s8(_r7));\n            _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_r01.val[0]), vget_low_s32(_r23.val[0])));\n            _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_r45.val[0]), vget_low_s32(_r67.val[0])));\n            _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_r01.val[0]), vget_high_s32(_r23.val[0])));\n            _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_r45.val[0]), vget_high_s32(_r67.val[0])));\n            _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_r01.val[1]), vget_low_s32(_r23.val[1])));\n            _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_r45.val[1]), vget_low_s32(_r67.val[1])));\n            _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_r01.val[1]), vget_high_s32(_r23.val[1])));\n            _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_r45.val[1]), vget_high_s32(_r67.val[1])));\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            vst1q_s8(pp + 64, _r4);\n            vst1q_s8(pp + 80, _r5);\n            vst1q_s8(pp + 96, _r6);\n            vst1q_s8(pp + 112, _r7);\n            pp += 128;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n            p4 += 16;\n            p5 += 16;\n            p6 += 16;\n            p7 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p1);\n            int8x8_t _r2 = vld1_s8(p2);\n            int8x8_t _r3 = vld1_s8(p3);\n            int8x8_t _r4 = vld1_s8(p4);\n            int8x8_t _r5 = vld1_s8(p5);\n            int8x8_t _r6 = vld1_s8(p6);\n            int8x8_t _r7 = vld1_s8(p7);\n            int32x2x2_t _r01 = vzip_s32(vreinterpret_s32_s8(_r0), vreinterpret_s32_s8(_r1));\n            int32x2x2_t _r23 = vzip_s32(vreinterpret_s32_s8(_r2), vreinterpret_s32_s8(_r3));\n            int32x2x2_t _r45 = vzip_s32(vreinterpret_s32_s8(_r4), vreinterpret_s32_s8(_r5));\n            int32x2x2_t _r67 = vzip_s32(vreinterpret_s32_s8(_r6), vreinterpret_s32_s8(_r7));\n            int8x16_t _t0 = vreinterpretq_s8_s32(vcombine_s32(_r01.val[0], _r23.val[0]));\n            int8x16_t _t1 = vreinterpretq_s8_s32(vcombine_s32(_r45.val[0], _r67.val[0]));\n            int8x16_t _t2 = vreinterpretq_s8_s32(vcombine_s32(_r01.val[1], _r23.val[1]));\n            int8x16_t _t3 = vreinterpretq_s8_s32(vcombine_s32(_r45.val[1], _r67.val[1]));\n            vst1q_s8(pp, _t0);\n            vst1q_s8(pp + 16, _t1);\n            vst1q_s8(pp + 32, _t2);\n            vst1q_s8(pp + 48, _t3);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n            pp[16] = p4[0];\n            pp[17] = p4[1];\n            pp[18] = p4[2];\n            pp[19] = p4[3];\n            pp[20] = p5[0];\n            pp[21] = p5[1];\n            pp[22] = p5[2];\n            pp[23] = p5[3];\n            pp[24] = p6[0];\n            pp[25] = p6[1];\n            pp[26] = p6[2];\n            pp[27] = p6[3];\n            pp[28] = p7[0];\n            pp[29] = p7[1];\n            pp[30] = p7[2];\n            pp[31] = p7[3];\n            pp += 32;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n        }\n#else  // __ARM_FEATURE_DOTPROD\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _r0 = vld1q_s8(p0);\n            int8x16_t _r1 = vld1q_s8(p1);\n            int8x16_t _r2 = vld1q_s8(p2);\n            int8x16_t _r3 = vld1q_s8(p3);\n            int8x16_t _r4 = vld1q_s8(p4);\n            int8x16_t _r5 = vld1q_s8(p5);\n            int8x16_t _r6 = vld1q_s8(p6);\n            int8x16_t _r7 = vld1q_s8(p7);\n            int16x8x2_t _r01 = vzipq_s16(vreinterpretq_s16_s8(_r0), vreinterpretq_s16_s8(_r1));\n            int16x8x2_t _r23 = vzipq_s16(vreinterpretq_s16_s8(_r2), vreinterpretq_s16_s8(_r3));\n            int16x8x2_t _r45 = vzipq_s16(vreinterpretq_s16_s8(_r4), vreinterpretq_s16_s8(_r5));\n            int16x8x2_t _r67 = vzipq_s16(vreinterpretq_s16_s8(_r6), vreinterpretq_s16_s8(_r7));\n            int32x4x2_t _t0 = vzipq_s32(vreinterpretq_s32_s16(_r01.val[0]), vreinterpretq_s32_s16(_r23.val[0]));\n            int32x4x2_t _t1 = vzipq_s32(vreinterpretq_s32_s16(_r01.val[1]), vreinterpretq_s32_s16(_r23.val[1]));\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_r45.val[0]), vreinterpretq_s32_s16(_r67.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_r45.val[1]), vreinterpretq_s32_s16(_r67.val[1]));\n            _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t2.val[0])));\n            _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t2.val[0])));\n            _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t2.val[1])));\n            _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t2.val[1])));\n            _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t3.val[0])));\n            _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t3.val[0])));\n            _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t3.val[1])));\n            _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t3.val[1])));\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            vst1q_s8(pp + 64, _r4);\n            vst1q_s8(pp + 80, _r5);\n            vst1q_s8(pp + 96, _r6);\n            vst1q_s8(pp + 112, _r7);\n            pp += 128;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n            p4 += 16;\n            p5 += 16;\n            p6 += 16;\n            p7 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p1);\n            int8x8_t _r2 = vld1_s8(p2);\n            int8x8_t _r3 = vld1_s8(p3);\n            int8x8_t _r4 = vld1_s8(p4);\n            int8x8_t _r5 = vld1_s8(p5);\n            int8x8_t _r6 = vld1_s8(p6);\n            int8x8_t _r7 = vld1_s8(p7);\n            int16x8_t _r04 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r4));\n            int16x8_t _r15 = vreinterpretq_s16_s8(vcombine_s8(_r1, _r5));\n            int16x8_t _r26 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r6));\n            int16x8_t _r37 = vreinterpretq_s16_s8(vcombine_s8(_r3, _r7));\n            int16x8x2_t _t0 = vzipq_s16(_r04, _r15);\n            int16x8x2_t _t1 = vzipq_s16(_r26, _r37);\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[0]), vreinterpretq_s32_s16(_t1.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[1]), vreinterpretq_s32_s16(_t1.val[1]));\n            int8x16_t _t4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0])));\n            int8x16_t _t5 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0])));\n            int8x16_t _t6 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1])));\n            int8x16_t _t7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1])));\n            vst1q_s8(pp, _t4);\n            vst1q_s8(pp + 16, _t5);\n            vst1q_s8(pp + 32, _t6);\n            vst1q_s8(pp + 48, _t7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p4[0];\n            pp[9] = p4[1];\n            pp[10] = p5[0];\n            pp[11] = p5[1];\n            pp[12] = p6[0];\n            pp[13] = p6[1];\n            pp[14] = p7[0];\n            pp[15] = p7[1];\n            pp += 16;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n            p4 += 2;\n            p5 += 2;\n            p6 += 2;\n            p7 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp[4] = p4[0];\n            pp[5] = p5[0];\n            pp[6] = p6[0];\n            pp[7] = p7[0];\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const signed char* p0 = (const signed char*)A + (i + ii) * A_hstep + k;\n        const signed char* p1 = (const signed char*)A + (i + ii + 1) * A_hstep + k;\n        const signed char* p2 = (const signed char*)A + (i + ii + 2) * A_hstep + k;\n        const signed char* p3 = (const signed char*)A + (i + ii + 3) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int64x2x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s64_s8(vld1q_s8(p0));\n            _r0123.val[1] = vreinterpretq_s64_s8(vld1q_s8(p1));\n            _r0123.val[2] = vreinterpretq_s64_s8(vld1q_s8(p2));\n            _r0123.val[3] = vreinterpretq_s64_s8(vld1q_s8(p3));\n            vst4q_s64((int64_t*)pp, _r0123);\n            pp += 64;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p1);\n            int8x8_t _r2 = vld1_s8(p2);\n            int8x8_t _r3 = vld1_s8(p3);\n            vst1_s8(pp, _r0);\n            vst1_s8(pp + 8, _r1);\n            vst1_s8(pp + 16, _r2);\n            vst1_s8(pp + 24, _r3);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n#else  // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int32x4x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s32_s8(vld1q_s8(p0));\n            _r0123.val[1] = vreinterpretq_s32_s8(vld1q_s8(p1));\n            _r0123.val[2] = vreinterpretq_s32_s8(vld1q_s8(p2));\n            _r0123.val[3] = vreinterpretq_s32_s8(vld1q_s8(p3));\n            vst4q_s32((int*)pp, _r0123);\n            pp += 64;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int32x2x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s32_s8(vld1_s8(p0));\n            _r0123.val[1] = vreinterpret_s32_s8(vld1_s8(p1));\n            _r0123.val[2] = vreinterpret_s32_s8(vld1_s8(p2));\n            _r0123.val[3] = vreinterpret_s32_s8(vld1_s8(p3));\n            vst4_s32((int*)pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n#else  // __ARM_FEATURE_DOTPROD\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int16x8x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s16_s8(vld1q_s8(p0));\n            _r0123.val[1] = vreinterpretq_s16_s8(vld1q_s8(p1));\n            _r0123.val[2] = vreinterpretq_s16_s8(vld1q_s8(p2));\n            _r0123.val[3] = vreinterpretq_s16_s8(vld1q_s8(p3));\n            vst4q_s16((short*)pp, _r0123);\n            pp += 64;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int16x4x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s16_s8(vld1_s8(p0));\n            _r0123.val[1] = vreinterpret_s16_s8(vld1_s8(p1));\n            _r0123.val[2] = vreinterpret_s16_s8(vld1_s8(p2));\n            _r0123.val[3] = vreinterpret_s16_s8(vld1_s8(p3));\n            vst4_s16((short*)pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp += 8;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const signed char* p0 = (const signed char*)A + (i + ii) * A_hstep + k;\n        const signed char* p1 = (const signed char*)A + (i + ii + 1) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int64x2x2_t _r01;\n            _r01.val[0] = vreinterpretq_s64_s8(vld1q_s8(p0));\n            _r01.val[1] = vreinterpretq_s64_s8(vld1q_s8(p1));\n            vst2q_s64((int64_t*)pp, _r01);\n            pp += 32;\n            p0 += 16;\n            p1 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p1);\n            vst1_s8(pp, _r0);\n            vst1_s8(pp + 8, _r1);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n#else  // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int32x4x2_t _r01;\n            _r01.val[0] = vreinterpretq_s32_s8(vld1q_s8(p0));\n            _r01.val[1] = vreinterpretq_s32_s8(vld1q_s8(p1));\n            vst2q_s32((int*)pp, _r01);\n            pp += 32;\n            p0 += 16;\n            p1 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int32x2x2_t _r01;\n            _r01.val[0] = vreinterpret_s32_s8(vld1_s8(p0));\n            _r01.val[1] = vreinterpret_s32_s8(vld1_s8(p1));\n            vst2_s32((int*)pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n#else  // __ARM_FEATURE_DOTPROD\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int16x8x2_t _r01;\n            _r01.val[0] = vreinterpretq_s16_s8(vld1q_s8(p0));\n            _r01.val[1] = vreinterpretq_s16_s8(vld1q_s8(p1));\n            vst2q_s16((short*)pp, _r01);\n            pp += 32;\n            p0 += 16;\n            p1 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int16x4x2_t _r01;\n            _r01.val[0] = vreinterpret_s16_s8(vld1_s8(p0));\n            _r01.val[1] = vreinterpret_s16_s8(vld1_s8(p1));\n            vst2_s16((short*)pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp += 4;\n            p0 += 2;\n            p1 += 2;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const signed char* p0 = (const signed char*)A + (i + ii) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            vst1q_s8(pp, vld1q_s8(p0));\n            pp += 16;\n            p0 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            vst1_s8(pp, vld1_s8(p0));\n            pp += 8;\n            p0 += 8;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void convolution_gemm_transB_packed_tile_int8(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, Mat& top_blob, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n    // NCNN_LOGE(\"convolution_gemm_transB_packed_tile_int8 %d %d %d %d %d %d\", i, max_ii, j, max_jj, k, max_kk);\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.cstep;\n\n    const signed char* pAT = AT_tile;\n    const signed char* pBT = BT_tile;\n\n    int* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        int* outptr0 = (int*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n#if !__ARM_FEATURE_MATMUL_INT8\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"1:                                 \\n\"\n#endif // !__ARM_FEATURE_MATMUL_INT8\n\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v0.16b, v0.16b, v0.16b      \\n\"\n                \"eor    v1.16b, v1.16b, v1.16b      \\n\"\n                \"eor    v2.16b, v2.16b, v2.16b      \\n\"\n                \"eor    v3.16b, v3.16b, v3.16b      \\n\"\n                \"eor    v4.16b, v4.16b, v4.16b      \\n\"\n                \"eor    v5.16b, v5.16b, v5.16b      \\n\"\n                \"eor    v6.16b, v6.16b, v6.16b      \\n\"\n                \"eor    v7.16b, v7.16b, v7.16b      \\n\"\n                \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v16.16b, v17.16b, v18.16b, v19.16b}, [%1], #64 \\n\"\n                \"ld1    {v20.16b, v21.16b, v22.16b, v23.16b}, [%2], #64 \\n\"\n                \"smmla  v0.4s, v16.16b, v20.16b     \\n\"\n                \"smmla  v1.4s, v17.16b, v20.16b     \\n\"\n                \"smmla  v2.4s, v16.16b, v21.16b     \\n\"\n                \"smmla  v3.4s, v17.16b, v21.16b     \\n\"\n                \"smmla  v4.4s, v18.16b, v20.16b     \\n\"\n                \"smmla  v5.4s, v19.16b, v20.16b     \\n\"\n                \"smmla  v6.4s, v18.16b, v21.16b     \\n\"\n                \"smmla  v7.4s, v19.16b, v21.16b     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v8.4s, v16.16b, v22.16b     \\n\"\n                \"smmla  v9.4s, v17.16b, v22.16b     \\n\"\n                \"smmla  v10.4s, v16.16b, v23.16b    \\n\"\n                \"smmla  v11.4s, v17.16b, v23.16b    \\n\"\n                \"smmla  v12.4s, v18.16b, v22.16b    \\n\"\n                \"smmla  v13.4s, v19.16b, v22.16b    \\n\"\n                \"smmla  v14.4s, v18.16b, v23.16b    \\n\"\n                \"smmla  v15.4s, v19.16b, v23.16b    \\n\"\n                \"bne    2b                          \\n\"\n\n                \"uzp1   v16.4s, v0.4s, v1.4s        \\n\"\n                \"uzp2   v17.4s, v0.4s, v1.4s        \\n\"\n                \"uzp1   v18.4s, v2.4s, v3.4s        \\n\"\n                \"uzp2   v19.4s, v2.4s, v3.4s        \\n\"\n                \"uzp1   v20.4s, v4.4s, v5.4s        \\n\"\n                \"uzp2   v21.4s, v4.4s, v5.4s        \\n\"\n                \"uzp1   v22.4s, v6.4s, v7.4s        \\n\"\n                \"uzp2   v23.4s, v6.4s, v7.4s        \\n\"\n                \"uzp1   v24.4s, v8.4s, v9.4s        \\n\"\n                \"uzp2   v25.4s, v8.4s, v9.4s        \\n\"\n                \"uzp1   v26.4s, v10.4s, v11.4s      \\n\"\n                \"uzp2   v27.4s, v10.4s, v11.4s      \\n\"\n                \"uzp1   v28.4s, v12.4s, v13.4s      \\n\"\n                \"uzp2   v29.4s, v12.4s, v13.4s      \\n\"\n                \"uzp1   v30.4s, v14.4s, v15.4s      \\n\"\n                \"uzp2   v31.4s, v14.4s, v15.4s      \\n\"\n\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    1f                          \\n\"\n\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64   \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0], #64   \\n\"\n                \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64 \\n\"\n                \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0]    \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n                \"add    v24.4s, v24.4s, v8.4s       \\n\"\n                \"add    v25.4s, v25.4s, v9.4s       \\n\"\n                \"add    v26.4s, v26.4s, v10.4s      \\n\"\n                \"add    v27.4s, v27.4s, v11.4s      \\n\"\n                \"add    v28.4s, v28.4s, v12.4s      \\n\"\n                \"add    v29.4s, v29.4s, v13.4s      \\n\"\n                \"add    v30.4s, v30.4s, v14.4s      \\n\"\n                \"add    v31.4s, v31.4s, v15.4s      \\n\"\n                \"b      1f                          \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%1], #64 \\n\"\n                \"ld1    {v4.16b, v5.16b, v6.16b, v7.16b}, [%2], #64 \\n\"\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v4.4b[3]    \\n\"\n                \"sdot   v24.4s, v0.16b, v5.4b[0]    \\n\"\n                \"sdot   v25.4s, v0.16b, v5.4b[1]    \\n\"\n                \"sdot   v26.4s, v0.16b, v5.4b[2]    \\n\"\n                \"sdot   v27.4s, v0.16b, v5.4b[3]    \\n\"\n                \"sdot   v28.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v29.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v30.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v31.4s, v1.16b, v5.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v2.16b, v6.4b[0]    \\n\"\n                \"sdot   v17.4s, v2.16b, v6.4b[1]    \\n\"\n                \"sdot   v18.4s, v2.16b, v6.4b[2]    \\n\"\n                \"sdot   v19.4s, v2.16b, v6.4b[3]    \\n\"\n                \"sdot   v20.4s, v3.16b, v6.4b[0]    \\n\"\n                \"sdot   v21.4s, v3.16b, v6.4b[1]    \\n\"\n                \"sdot   v22.4s, v3.16b, v6.4b[2]    \\n\"\n                \"sdot   v23.4s, v3.16b, v6.4b[3]    \\n\"\n                \"sdot   v24.4s, v2.16b, v7.4b[0]    \\n\"\n                \"sdot   v25.4s, v2.16b, v7.4b[1]    \\n\"\n                \"sdot   v26.4s, v2.16b, v7.4b[2]    \\n\"\n                \"sdot   v27.4s, v2.16b, v7.4b[3]    \\n\"\n                \"sdot   v28.4s, v3.16b, v7.4b[0]    \\n\"\n                \"sdot   v29.4s, v3.16b, v7.4b[1]    \\n\"\n                \"sdot   v30.4s, v3.16b, v7.4b[2]    \\n\"\n                \"sdot   v31.4s, v3.16b, v7.4b[3]    \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n#if __ARM_FEATURE_MATMUL_INT8\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n                \"1:                                 \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"and    w4, %w8, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b, v3.16b}, [%2], #32 \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v2.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v2.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v2.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v2.4b[3]    \\n\"\n                \"sdot   v24.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v25.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v26.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v27.4s, v0.16b, v3.4b[3]    \\n\"\n                \"sdot   v28.4s, v1.16b, v3.4b[0]    \\n\"\n                \"sdot   v29.4s, v1.16b, v3.4b[1]    \\n\"\n                \"sdot   v30.4s, v1.16b, v3.4b[2]    \\n\"\n                \"sdot   v31.4s, v1.16b, v3.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v4.16b       \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v4.16b      \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v6.16b      \\n\"\n                \"rev64  v3.4s, v1.4s                \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"smull2 v15.8h, v2.16b, v6.16b      \\n\"\n                \"rev64  v7.8h, v5.8h                \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v5.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v5.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v7.16b      \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v15.8h, v3.16b, v7.16b      \\n\"\n                \"ext    v0.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v2.16b, v2.16b, v2.16b, #8  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v20.4s, v10.8h              \\n\"\n                \"sadalp v21.4s, v11.8h              \\n\"\n                \"ext    v1.16b, v1.16b, v1.16b, #8  \\n\"\n                \"ext    v3.16b, v3.16b, v3.16b, #8  \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v4.16b       \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v4.16b      \\n\"\n                \"sadalp v24.4s, v12.8h              \\n\"\n                \"sadalp v25.4s, v13.8h              \\n\"\n                \"sadalp v28.4s, v14.8h              \\n\"\n                \"sadalp v29.4s, v15.8h              \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v6.16b      \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"smull2 v15.8h, v2.16b, v6.16b      \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v5.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v5.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v7.16b      \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v15.8h, v3.16b, v7.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v18.4s, v8.8h               \\n\"\n                \"sadalp v19.4s, v9.8h               \\n\"\n                \"sadalp v22.4s, v10.8h              \\n\"\n                \"sadalp v23.4s, v11.8h              \\n\"\n                \"sadalp v26.4s, v12.8h              \\n\"\n                \"sadalp v27.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w8, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v1.16b}, [%2], #16         \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"dup    v4.8h, v1.h[4]              \\n\"\n                \"dup    v5.8h, v1.h[5]              \\n\"\n                \"dup    v6.8h, v1.h[6]              \\n\"\n                \"dup    v7.8h, v1.h[7]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v24.4s, v8.8h               \\n\"\n                \"sadalp v25.4s, v9.8h               \\n\"\n                \"sadalp v26.4s, v10.8h              \\n\"\n                \"sadalp v27.4s, v11.8h              \\n\"\n                \"sadalp v28.4s, v12.8h              \\n\"\n                \"sadalp v29.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v20.4s, v10.8h              \\n\"\n                \"sadalp v21.4s, v11.8h              \\n\"\n                \"sadalp v24.4s, v12.8h              \\n\"\n                \"sadalp v25.4s, v13.8h              \\n\"\n                \"sadalp v28.4s, v14.8h              \\n\"\n                \"sadalp v29.4s, v15.8h              \\n\"\n                \"ext    v0.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v1.16b, v1.16b, v1.16b, #8  \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v18.4s, v8.8h               \\n\"\n                \"sadalp v19.4s, v9.8h               \\n\"\n                \"sadalp v22.4s, v10.8h              \\n\"\n                \"sadalp v23.4s, v11.8h              \\n\"\n                \"sadalp v26.4s, v12.8h              \\n\"\n                \"sadalp v27.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w8, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v8.8b, v1.b[0]              \\n\"\n                \"dup    v9.8b, v1.b[1]              \\n\"\n                \"dup    v10.8b, v1.b[2]             \\n\"\n                \"dup    v11.8b, v1.b[3]             \\n\"\n                \"dup    v12.8b, v1.b[4]             \\n\"\n                \"dup    v13.8b, v1.b[5]             \\n\"\n                \"dup    v14.8b, v1.b[6]             \\n\"\n                \"dup    v15.8b, v1.b[7]             \\n\"\n                \"smull  v8.8h, v0.8b, v8.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v9.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v10.8b       \\n\"\n                \"smull  v11.8h, v0.8b, v11.8b       \\n\"\n                \"smull  v12.8h, v0.8b, v12.8b       \\n\"\n                \"smull  v13.8h, v0.8b, v13.8b       \\n\"\n                \"smull  v14.8h, v0.8b, v14.8b       \\n\"\n                \"smull  v15.8h, v0.8b, v15.8b       \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw  v18.4s, v18.4s, v10.4h      \\n\"\n                \"saddw  v19.4s, v19.4s, v11.4h      \\n\"\n                \"saddw2 v20.4s, v20.4s, v8.8h       \\n\"\n                \"saddw2 v21.4s, v21.4s, v9.8h       \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n                \"saddw  v24.4s, v24.4s, v12.4h      \\n\"\n                \"saddw  v25.4s, v25.4s, v13.4h      \\n\"\n                \"saddw  v26.4s, v26.4s, v14.4h      \\n\"\n                \"saddw  v27.4s, v27.4s, v15.4h      \\n\"\n                \"saddw2 v28.4s, v28.4s, v12.8h      \\n\"\n                \"saddw2 v29.4s, v29.4s, v13.8h      \\n\"\n                \"saddw2 v30.4s, v30.4s, v14.8h      \\n\"\n                \"saddw2 v31.4s, v31.4s, v15.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v4.8b}, [%2], #8           \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #4     \\n\"\n                \"rev32  v2.4h, v0.4h                \\n\"\n                \"rev64  v3.4h, v0.4h                \\n\"\n                \"rev32  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull  v11.8h, v3.8b, v4.8b        \\n\"\n                \"smull  v12.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v13.8h, v1.8b, v5.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v5.8b        \\n\"\n                \"smull  v15.8h, v3.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n                \"saddw  v24.4s, v24.4s, v12.4h      \\n\"\n                \"saddw2 v25.4s, v25.4s, v12.8h      \\n\"\n                \"saddw  v26.4s, v26.4s, v13.4h      \\n\"\n                \"saddw2 v27.4s, v27.4s, v13.8h      \\n\"\n                \"saddw  v28.4s, v28.4s, v14.4h      \\n\"\n                \"saddw2 v29.4s, v29.4s, v14.8h      \\n\"\n                \"saddw  v30.4s, v30.4s, v15.4h      \\n\"\n                \"saddw2 v31.4s, v31.4s, v15.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"cmp    %w10, #0                    \\n\"\n                \"beq    10f                         \\n\"\n\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                //      e4 f4 g4 h4\n                //      e5 f5 g5 h5\n                //      e6 f6 g6 h6\n                //      e7 f7 g7 h7\n                // if out_elempack == 4 / 8\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w11, #8                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v20.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [%3], #16         \\n\"\n                \"st1    {v21.4s}, [%3], #16         \\n\"\n                \"st1    {v18.4s}, [%3], #16         \\n\"\n                \"st1    {v22.4s}, [%3], #16         \\n\"\n                \"st1    {v19.4s}, [%3], #16         \\n\"\n                \"st1    {v23.4s}, [%3], #16         \\n\"\n                \"st1    {v24.4s}, [%3], #16         \\n\"\n                \"st1    {v28.4s}, [%3], #16         \\n\"\n                \"st1    {v25.4s}, [%3], #16         \\n\"\n                \"st1    {v29.4s}, [%3], #16         \\n\"\n                \"st1    {v26.4s}, [%3], #16         \\n\"\n                \"st1    {v30.4s}, [%3], #16         \\n\"\n                \"st1    {v27.4s}, [%3], #16         \\n\"\n                \"st1    {v31.4s}, [%3], #16         \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 4\n                \"7:                                 \\n\"\n                \"add    x4, %3, %12, lsl #4         \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [x4], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      a4 a5 a6 a7\n                //      b0 b1 b2 b3\n                //      b4 b5 b6 b7\n                //      c0 c1 c2 c3\n                //      c4 c5 c6 c7\n                //      d0 d1 d2 d3\n                //      d4 d5 d6 d7\n                //      e0 e1 e2 e3\n                //      e4 e5 e6 e7\n                //      f0 f1 f2 f3\n                //      f4 f5 f6 f7\n                //      g0 g1 g2 g3\n                //      g4 g5 g6 g7\n                //      h0 h1 h2 h3\n                //      h4 h5 h6 h7\n                \"zip1   v0.4s, v16.4s, v17.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v17.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v19.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v19.4s       \\n\"\n                \"zip1   v4.4s, v24.4s, v25.4s       \\n\"\n                \"zip2   v5.4s, v24.4s, v25.4s       \\n\"\n                \"zip1   v6.4s, v26.4s, v27.4s       \\n\"\n                \"zip2   v7.4s, v26.4s, v27.4s       \\n\"\n                \"zip1   v8.4s, v20.4s, v21.4s       \\n\"\n                \"zip2   v9.4s, v20.4s, v21.4s       \\n\"\n                \"zip1   v10.4s, v22.4s, v23.4s      \\n\"\n                \"zip2   v11.4s, v22.4s, v23.4s      \\n\"\n                \"zip1   v12.4s, v28.4s, v29.4s      \\n\"\n                \"zip2   v13.4s, v28.4s, v29.4s      \\n\"\n                \"zip1   v14.4s, v30.4s, v31.4s      \\n\"\n                \"zip2   v15.4s, v30.4s, v31.4s      \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v1.2d, v3.2d        \\n\"\n                \"zip1   v21.2d, v5.2d, v7.2d        \\n\"\n                \"zip2   v22.2d, v1.2d, v3.2d        \\n\"\n                \"zip2   v23.2d, v5.2d, v7.2d        \\n\"\n                \"zip1   v24.2d, v8.2d, v10.2d       \\n\"\n                \"zip1   v25.2d, v12.2d, v14.2d      \\n\"\n                \"zip2   v26.2d, v8.2d, v10.2d       \\n\"\n                \"zip2   v27.2d, v12.2d, v14.2d      \\n\"\n                \"zip1   v28.2d, v9.2d, v11.2d       \\n\"\n                \"zip1   v29.2d, v13.2d, v15.2d      \\n\"\n                \"zip2   v30.2d, v9.2d, v11.2d       \\n\"\n                \"zip2   v31.2d, v13.2d, v15.2d      \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s, v17.4s}, [%3], #32 \\n\"\n                \"st1    {v18.4s, v19.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s, v21.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s, v23.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v24.4s, v25.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v26.4s, v27.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v28.4s, v29.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v30.4s, v31.4s}, [x4]      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      e4 f5 g6 h7\n                //      e0 f1 g2 h3\n                //      a4 b5 c6 d7\n                //      c0 d1 a2 b3\n                //      g4 h5 e6 f7\n                //      g0 h1 e2 f3\n                //      c4 d5 a6 b7\n                //      a3 b2 c1 d0\n                //      e7 f6 g5 h4\n                //      e3 f2 g1 h0\n                //      a7 b6 c5 d4\n                //      c3 d2 a1 b0\n                //      g7 h6 e5 f4\n                //      g3 h2 e1 f0\n                //      c7 d6 a5 b4\n                // if out_elempack == 4 / 8\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                \"rev64  v24.4s, v24.4s              \\n\"\n                \"rev64  v25.4s, v25.4s              \\n\"\n                \"rev64  v26.4s, v26.4s              \\n\"\n                \"rev64  v27.4s, v27.4s              \\n\"\n                \"rev64  v28.4s, v28.4s              \\n\"\n                \"rev64  v29.4s, v29.4s              \\n\"\n                \"rev64  v30.4s, v30.4s              \\n\"\n                \"rev64  v31.4s, v31.4s              \\n\"\n                \"ext    v24.16b, v24.16b, v24.16b, #8 \\n\"\n                \"ext    v25.16b, v25.16b, v25.16b, #8 \\n\"\n                \"ext    v26.16b, v26.16b, v26.16b, #8 \\n\"\n                \"ext    v27.16b, v27.16b, v27.16b, #8 \\n\"\n                \"ext    v28.16b, v28.16b, v28.16b, #8 \\n\"\n                \"ext    v29.16b, v29.16b, v29.16b, #8 \\n\"\n                \"ext    v30.16b, v30.16b, v30.16b, #8 \\n\"\n                \"ext    v31.16b, v31.16b, v31.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v28.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v28.4s       \\n\"\n                \"zip1   v2.4s, v20.4s, v24.4s       \\n\"\n                \"zip2   v3.4s, v20.4s, v24.4s       \\n\"\n                \"zip1   v4.4s, v18.4s, v30.4s       \\n\"\n                \"zip2   v5.4s, v18.4s, v30.4s       \\n\"\n                \"zip1   v6.4s, v22.4s, v26.4s       \\n\"\n                \"zip2   v7.4s, v22.4s, v26.4s       \\n\"\n                \"zip1   v8.4s, v19.4s, v31.4s       \\n\"\n                \"zip2   v9.4s, v19.4s, v31.4s       \\n\"\n                \"zip1   v10.4s, v23.4s, v27.4s      \\n\"\n                \"zip2   v11.4s, v23.4s, v27.4s      \\n\"\n                \"zip1   v12.4s, v17.4s, v29.4s      \\n\"\n                \"zip2   v13.4s, v17.4s, v29.4s      \\n\"\n                \"zip1   v14.4s, v21.4s, v25.4s      \\n\"\n                \"zip2   v15.4s, v21.4s, v25.4s      \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w11, #8                    \\n\"\n                \"bne    7f                          \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      e0 f0 g0 h0\n                //      a1 b1 c1 d1\n                //      e1 f1 g1 h1\n                //      a2 b2 c2 d2\n                //      e2 f2 g2 h2\n                //      a3 b3 c3 d3\n                //      e3 f3 g3 h3\n                //      a4 b4 c4 d4\n                //      e4 f4 g4 h4\n                //      a5 b5 c5 d5\n                //      e5 f5 g5 h5\n                //      a6 b6 c6 d6\n                //      e6 f6 g6 h6\n                //      a7 b7 c7 d7\n                //      e7 f7 g7 h7\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v21.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v22.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"zip1   v24.2d, v8.2d, v10.2d       \\n\"\n                \"zip1   v25.2d, v12.2d, v14.2d      \\n\"\n                \"zip2   v26.2d, v8.2d, v10.2d       \\n\"\n                \"zip2   v27.2d, v12.2d, v14.2d      \\n\"\n                \"zip1   v28.2d, v11.2d, v9.2d       \\n\"\n                \"zip1   v29.2d, v15.2d, v13.2d      \\n\"\n                \"zip2   v30.2d, v11.2d, v9.2d       \\n\"\n                \"zip2   v31.2d, v15.2d, v13.2d      \\n\"\n                \"rev64  v18.4s, v18.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n                \"rev64  v26.4s, v26.4s              \\n\"\n                \"rev64  v27.4s, v27.4s              \\n\"\n                \"rev64  v30.4s, v30.4s              \\n\"\n                \"rev64  v31.4s, v31.4s              \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%3], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 4\n                \"7:                                 \\n\"\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                //      e4 f4 g4 h4\n                //      e5 f5 g5 h5\n                //      e6 f6 g6 h6\n                //      e7 f7 g7 h7\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v24.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v25.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v26.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v27.2d, v7.2d, v5.2d        \\n\"\n                \"zip1   v20.2d, v8.2d, v10.2d       \\n\"\n                \"zip1   v28.2d, v12.2d, v14.2d      \\n\"\n                \"zip2   v21.2d, v8.2d, v10.2d       \\n\"\n                \"zip2   v29.2d, v12.2d, v14.2d      \\n\"\n                \"zip1   v22.2d, v11.2d, v9.2d       \\n\"\n                \"zip1   v30.2d, v15.2d, v13.2d      \\n\"\n                \"zip2   v23.2d, v11.2d, v9.2d       \\n\"\n                \"zip2   v31.2d, v15.2d, v13.2d      \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v21.4s, v21.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n                \"rev64  v25.4s, v25.4s              \\n\"\n                \"rev64  v27.4s, v27.4s              \\n\"\n                \"rev64  v29.4s, v29.4s              \\n\"\n                \"rev64  v31.4s, v31.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #4         \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [x4] \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      a4 a5 a6 a7\n                //      b0 b1 b2 b3\n                //      b4 b5 b6 b7\n                //      c0 c1 c2 c3\n                //      c4 c5 c6 c7\n                //      d0 d1 d2 d3\n                //      d4 d5 d6 d7\n                //      e0 e1 e2 e3\n                //      e4 e5 e6 e7\n                //      f0 f1 f2 f3\n                //      f4 f5 f6 f7\n                //      g0 g1 g2 g3\n                //      g4 g5 g6 g7\n                //      h0 h1 h2 h3\n                //      h4 h5 h6 h7\n                \"ext    v20.16b, v20.16b, v20.16b, #8 \\n\"\n                \"ext    v21.16b, v21.16b, v21.16b, #8 \\n\"\n                \"ext    v22.16b, v22.16b, v22.16b, #8 \\n\"\n                \"ext    v23.16b, v23.16b, v23.16b, #8 \\n\"\n                \"ext    v28.16b, v28.16b, v28.16b, #8 \\n\"\n                \"ext    v29.16b, v29.16b, v29.16b, #8 \\n\"\n                \"ext    v30.16b, v30.16b, v30.16b, #8 \\n\"\n                \"ext    v31.16b, v31.16b, v31.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v28.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v28.4s       \\n\"\n                \"zip1   v2.4s, v20.4s, v24.4s       \\n\"\n                \"zip2   v3.4s, v20.4s, v24.4s       \\n\"\n                \"zip1   v4.4s, v19.4s, v31.4s       \\n\"\n                \"zip2   v5.4s, v19.4s, v31.4s       \\n\"\n                \"zip1   v6.4s, v23.4s, v27.4s       \\n\"\n                \"zip2   v7.4s, v23.4s, v27.4s       \\n\"\n                \"zip1   v8.4s, v18.4s, v30.4s       \\n\"\n                \"zip2   v9.4s, v18.4s, v30.4s       \\n\"\n                \"zip1   v10.4s, v22.4s, v26.4s      \\n\"\n                \"zip2   v11.4s, v22.4s, v26.4s      \\n\"\n                \"zip1   v12.4s, v17.4s, v29.4s      \\n\"\n                \"zip2   v13.4s, v17.4s, v29.4s      \\n\"\n                \"zip1   v14.4s, v21.4s, v25.4s      \\n\"\n                \"zip2   v15.4s, v21.4s, v25.4s      \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v21.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v22.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"zip1   v24.2d, v8.2d, v10.2d       \\n\"\n                \"zip1   v25.2d, v12.2d, v14.2d      \\n\"\n                \"zip2   v26.2d, v8.2d, v10.2d       \\n\"\n                \"zip2   v27.2d, v12.2d, v14.2d      \\n\"\n                \"zip1   v28.2d, v11.2d, v9.2d       \\n\"\n                \"zip1   v29.2d, v15.2d, v13.2d      \\n\"\n                \"zip2   v30.2d, v11.2d, v9.2d       \\n\"\n                \"zip2   v31.2d, v15.2d, v13.2d      \\n\"\n                \"rev64  v18.4s, v18.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n                \"rev64  v26.4s, v26.4s              \\n\"\n                \"rev64  v27.4s, v27.4s              \\n\"\n                \"rev64  v30.4s, v30.4s              \\n\"\n                \"rev64  v31.4s, v31.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s, v17.4s}, [%3], #32 \\n\"\n                \"st1    {v18.4s, v19.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s, v21.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s, v23.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v24.4s, v25.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v26.4s, v27.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v28.4s, v29.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v30.4s, v31.4s}, [x4]      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #256                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n            int32x4_t _sum8;\n            int32x4_t _sum9;\n            int32x4_t _suma;\n            int32x4_t _sumb;\n            int32x4_t _sumc;\n            int32x4_t _sumd;\n            int32x4_t _sume;\n            int32x4_t _sumf;\n\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n                _sum8 = vdupq_n_s32(0);\n                _sum9 = vdupq_n_s32(0);\n                _suma = vdupq_n_s32(0);\n                _sumb = vdupq_n_s32(0);\n                _sumc = vdupq_n_s32(0);\n                _sumd = vdupq_n_s32(0);\n                _sume = vdupq_n_s32(0);\n                _sumf = vdupq_n_s32(0);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n                _sum8 = vdupq_n_s32(0);\n                _sum9 = vdupq_n_s32(0);\n                _suma = vdupq_n_s32(0);\n                _sumb = vdupq_n_s32(0);\n                _sumc = vdupq_n_s32(0);\n                _sumd = vdupq_n_s32(0);\n                _sume = vdupq_n_s32(0);\n                _sumf = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n                _sum8 = vld1q_s32(outptr + 32);\n                _sum9 = vld1q_s32(outptr + 36);\n                _suma = vld1q_s32(outptr + 40);\n                _sumb = vld1q_s32(outptr + 44);\n                _sumc = vld1q_s32(outptr + 48);\n                _sumd = vld1q_s32(outptr + 52);\n                _sume = vld1q_s32(outptr + 56);\n                _sumf = vld1q_s32(outptr + 60);\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    _sum0 = vmmlaq_s32(_sum0, _pA0, _pB0);\n                    _sum1 = vmmlaq_s32(_sum1, _pA1, _pB0);\n                    _sum2 = vmmlaq_s32(_sum2, _pA0, _pB1);\n                    _sum3 = vmmlaq_s32(_sum3, _pA1, _pB1);\n                    _sum4 = vmmlaq_s32(_sum4, _pA2, _pB0);\n                    _sum5 = vmmlaq_s32(_sum5, _pA3, _pB0);\n                    _sum6 = vmmlaq_s32(_sum6, _pA2, _pB1);\n                    _sum7 = vmmlaq_s32(_sum7, _pA3, _pB1);\n                    _sum8 = vmmlaq_s32(_sum8, _pA0, _pB2);\n                    _sum9 = vmmlaq_s32(_sum9, _pA1, _pB2);\n                    _suma = vmmlaq_s32(_suma, _pA0, _pB3);\n                    _sumb = vmmlaq_s32(_sumb, _pA1, _pB3);\n                    _sumc = vmmlaq_s32(_sumc, _pA2, _pB2);\n                    _sumd = vmmlaq_s32(_sumd, _pA3, _pB2);\n                    _sume = vmmlaq_s32(_sume, _pA2, _pB3);\n                    _sumf = vmmlaq_s32(_sumf, _pA3, _pB3);\n\n                    pA += 64;\n                    pB += 64;\n                }\n\n                int32x4x2_t _ss0 = vuzpq_s32(_sum0, _sum1);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum2, _sum3);\n                int32x4x2_t _ss2 = vuzpq_s32(_sum4, _sum5);\n                int32x4x2_t _ss3 = vuzpq_s32(_sum6, _sum7);\n                int32x4x2_t _ss4 = vuzpq_s32(_sum8, _sum9);\n                int32x4x2_t _ss5 = vuzpq_s32(_suma, _sumb);\n                int32x4x2_t _ss6 = vuzpq_s32(_sumc, _sumd);\n                int32x4x2_t _ss7 = vuzpq_s32(_sume, _sumf);\n\n                if (k == 0)\n                {\n                    _sum0 = _ss0.val[0];\n                    _sum1 = _ss0.val[1];\n                    _sum2 = _ss1.val[0];\n                    _sum3 = _ss1.val[1];\n                    _sum4 = _ss2.val[0];\n                    _sum5 = _ss2.val[1];\n                    _sum6 = _ss3.val[0];\n                    _sum7 = _ss3.val[1];\n                    _sum8 = _ss4.val[0];\n                    _sum9 = _ss4.val[1];\n                    _suma = _ss5.val[0];\n                    _sumb = _ss5.val[1];\n                    _sumc = _ss6.val[0];\n                    _sumd = _ss6.val[1];\n                    _sume = _ss7.val[0];\n                    _sumf = _ss7.val[1];\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                    _sumc = vld1q_s32(outptr + 48);\n                    _sumd = vld1q_s32(outptr + 52);\n                    _sume = vld1q_s32(outptr + 56);\n                    _sumf = vld1q_s32(outptr + 60);\n\n                    _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                    _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                    _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                    _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                    _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                    _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                    _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                    _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n                    _sum8 = vaddq_s32(_sum8, _ss4.val[0]);\n                    _sum9 = vaddq_s32(_sum9, _ss4.val[1]);\n                    _suma = vaddq_s32(_suma, _ss5.val[0]);\n                    _sumb = vaddq_s32(_sumb, _ss5.val[1]);\n                    _sumc = vaddq_s32(_sumc, _ss6.val[0]);\n                    _sumd = vaddq_s32(_sumd, _ss6.val[1]);\n                    _sume = vaddq_s32(_sume, _ss7.val[0]);\n                    _sumf = vaddq_s32(_sumf, _ss7.val[1]);\n                }\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pA2 = vld1q_s8(pA + 32);\n                int8x16_t _pA3 = vld1q_s8(pA + 48);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n                int8x16_t _pB2 = vld1q_s8(pB + 32);\n                int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                // aaaa bbbb cccc dddd    eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333    4444 5555 6666 7777\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA0, _pB1, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA0, _pB1, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA0, _pB1, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA0, _pB1, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA1, _pB1, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA1, _pB1, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA1, _pB1, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA1, _pB1, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB2, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB2, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA2, _pB2, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA2, _pB2, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA3, _pB2, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA3, _pB2, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA3, _pB2, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA3, _pB2, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA2, _pB3, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA2, _pB3, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA2, _pB3, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA2, _pB3, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA3, _pB3, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA3, _pB3, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA3, _pB3, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA3, _pB3, 3);\n\n                pA += 64;\n                pB += 64;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                // aaaa bbbb cccc dddd    eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333    4444 5555 6666 7777\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA0, _pB1, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA0, _pB1, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA0, _pB1, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA0, _pB1, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA1, _pB1, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA1, _pB1, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA1, _pB1, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA1, _pB1, 3);\n\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB2 = vld1q_s8(pB + 16);\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA3 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA2)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB3 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _s9 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sa = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sc = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB1));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sf = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB1));\n\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_low_s8(_pB2));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB2));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA2), vget_low_s8(_pB2));\n                _s3 = vmlal_s8(_s3, vget_low_s8(_pA2), vget_high_s8(_pB2));\n                _s4 = vmlal_s8(_s4, vget_low_s8(_pA3), vget_low_s8(_pB2));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA3), vget_high_s8(_pB2));\n                _s6 = vmlal_s8(_s6, vget_high_s8(_pA3), vget_low_s8(_pB2));\n                _s7 = vmlal_s8(_s7, vget_low_s8(_pA3), vget_high_s8(_pB2));\n                _s8 = vmlal_s8(_s8, vget_low_s8(_pA2), vget_low_s8(_pB3));\n                _s9 = vmlal_s8(_s9, vget_high_s8(_pA2), vget_high_s8(_pB3));\n                _sa = vmlal_s8(_sa, vget_high_s8(_pA2), vget_low_s8(_pB3));\n                _sb = vmlal_s8(_sb, vget_low_s8(_pA2), vget_high_s8(_pB3));\n                _sc = vmlal_s8(_sc, vget_low_s8(_pA3), vget_low_s8(_pB3));\n                _sd = vmlal_s8(_sd, vget_high_s8(_pA3), vget_high_s8(_pB3));\n                _se = vmlal_s8(_se, vget_high_s8(_pA3), vget_low_s8(_pB3));\n                _sf = vmlal_s8(_sf, vget_low_s8(_pA3), vget_high_s8(_pB3));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233 44556677\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 0)));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 1)));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 2)));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 3)));\n                int16x8_t _s4 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 0)));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 1)));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 2)));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 3)));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 0)));\n                int16x8_t _s9 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 1)));\n                int16x8_t _sa = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 2)));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 3)));\n                int16x8_t _sc = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 0)));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 1)));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 2)));\n                int16x8_t _sf = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 3)));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233 44556677\n\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _s9 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sa = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sc = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB1));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sf = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB1));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                // int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd efgh\n                // 0123 4567\n\n                int16x8_t _s01 = vmull_s8(_pA, vdup_n_s8(pB[0]));\n                int16x8_t _s23 = vmull_s8(_pA, vdup_n_s8(pB[1]));\n                int16x8_t _s45 = vmull_s8(_pA, vdup_n_s8(pB[2]));\n                int16x8_t _s67 = vmull_s8(_pA, vdup_n_s8(pB[3]));\n                int16x8_t _s89 = vmull_s8(_pA, vdup_n_s8(pB[4]));\n                int16x8_t _sab = vmull_s8(_pA, vdup_n_s8(pB[5]));\n                int16x8_t _scd = vmull_s8(_pA, vdup_n_s8(pB[6]));\n                int16x8_t _sef = vmull_s8(_pA, vdup_n_s8(pB[7]));\n\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s23));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s45));\n                _sum3 = vaddw_s16(_sum3, vget_low_s16(_s67));\n                _sum4 = vaddw_s16(_sum4, vget_high_s16(_s01));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s23));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s45));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n                _sum8 = vaddw_s16(_sum8, vget_low_s16(_s89));\n                _sum9 = vaddw_s16(_sum9, vget_low_s16(_sab));\n                _suma = vaddw_s16(_suma, vget_low_s16(_scd));\n                _sumb = vaddw_s16(_sumb, vget_low_s16(_sef));\n                _sumc = vaddw_s16(_sumc, vget_high_s16(_s89));\n                _sumd = vaddw_s16(_sumd, vget_high_s16(_sab));\n                _sume = vaddw_s16(_sume, vget_high_s16(_scd));\n                _sumf = vaddw_s16(_sumf, vget_high_s16(_sef));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd efgh\n                // efgh abcd\n                // cdab ghef\n                // ghef cdab\n\n                // 0123 4567\n                // 3210 7654\n\n                // abcdefgh  ->  ghefcdab  ->  cdabghef\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 4);\n                int8x8_t _pA2 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n                int8x8_t _pA3 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pA0)));\n\n                // 01234567  ->  32107654\n\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA2, _pB0);\n                int16x8_t _s67 = vmull_s8(_pA3, _pB0);\n                int16x8_t _s89 = vmull_s8(_pA0, _pB1);\n                int16x8_t _sab = vmull_s8(_pA1, _pB1);\n                int16x8_t _scd = vmull_s8(_pA2, _pB1);\n                int16x8_t _sef = vmull_s8(_pA3, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n                _sum8 = vaddw_s16(_sum8, vget_low_s16(_s89));\n                _sum9 = vaddw_s16(_sum9, vget_high_s16(_s89));\n                _suma = vaddw_s16(_suma, vget_low_s16(_sab));\n                _sumb = vaddw_s16(_sumb, vget_high_s16(_sab));\n                _sumc = vaddw_s16(_sumc, vget_low_s16(_scd));\n                _sumd = vaddw_s16(_sumd, vget_high_s16(_scd));\n                _sume = vaddw_s16(_sume, vget_low_s16(_sef));\n                _sumf = vaddw_s16(_sumf, vget_high_s16(_sef));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                //      e4 f4 g4 h4\n                //      e5 f5 g5 h5\n                //      e6 f6 g6 h6\n                //      e7 f7 g7 h7\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum4);\n                    vst1q_s32(outptr0 + 8, _sum1);\n                    vst1q_s32(outptr0 + 12, _sum5);\n                    vst1q_s32(outptr0 + 16, _sum2);\n                    vst1q_s32(outptr0 + 20, _sum6);\n                    vst1q_s32(outptr0 + 24, _sum3);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    vst1q_s32(outptr0 + 32, _sum8);\n                    vst1q_s32(outptr0 + 36, _sumc);\n                    vst1q_s32(outptr0 + 40, _sum9);\n                    vst1q_s32(outptr0 + 44, _sumd);\n                    vst1q_s32(outptr0 + 48, _suma);\n                    vst1q_s32(outptr0 + 52, _sume);\n                    vst1q_s32(outptr0 + 56, _sumb);\n                    vst1q_s32(outptr0 + 60, _sumf);\n                    outptr0 += 64;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum8);\n                    vst1q_s32(outptr0 + 20, _sum9);\n                    vst1q_s32(outptr0 + 24, _suma);\n                    vst1q_s32(outptr0 + 28, _sumb);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 8, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 12, _sum7);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 16, _sumc);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 20, _sumd);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 24, _sume);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 28, _sumf);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      a4 a5 a6 a7\n                    //      b0 b1 b2 b3\n                    //      b4 b5 b6 b7\n                    //      c0 c1 c2 c3\n                    //      c4 c5 c6 c7\n                    //      d0 d1 d2 d3\n                    //      d4 d5 d6 d7\n                    //      e0 e1 e2 e3\n                    //      e4 e5 e6 e7\n                    //      f0 f1 f2 f3\n                    //      f4 f5 f6 f7\n                    //      g0 g1 g2 g3\n                    //      g4 g5 g6 g7\n                    //      h0 h1 h2 h3\n                    //      h4 h5 h6 h7\n                    {\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum3);\n                        int32x4x2_t _t2 = vzipq_s32(_sum8, _sum9);\n                        int32x4x2_t _t3 = vzipq_s32(_suma, _sumb);\n                        int32x4x2_t _t4 = vzipq_s32(_sum4, _sum5);\n                        int32x4x2_t _t5 = vzipq_s32(_sum6, _sum7);\n                        int32x4x2_t _t6 = vzipq_s32(_sumc, _sumd);\n                        int32x4x2_t _t7 = vzipq_s32(_sume, _sumf);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t1.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t1.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1]));\n                        _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                        _sum9 = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                        _suma = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                        _sumb = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                        _sumc = vcombine_s32(vget_low_s32(_t4.val[1]), vget_low_s32(_t5.val[1]));\n                        _sumd = vcombine_s32(vget_low_s32(_t6.val[1]), vget_low_s32(_t7.val[1]));\n                        _sume = vcombine_s32(vget_high_s32(_t4.val[1]), vget_high_s32(_t5.val[1]));\n                        _sumf = vcombine_s32(vget_high_s32(_t6.val[1]), vget_high_s32(_t7.val[1]));\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep, _sum2);\n                    vst1q_s32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum8);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum9);\n                    vst1q_s32(outptr0 + out_hstep * 5, _suma);\n                    vst1q_s32(outptr0 + out_hstep * 5 + 4, _sumb);\n                    vst1q_s32(outptr0 + out_hstep * 6, _sumc);\n                    vst1q_s32(outptr0 + out_hstep * 6 + 4, _sumd);\n                    vst1q_s32(outptr0 + out_hstep * 7, _sume);\n                    vst1q_s32(outptr0 + out_hstep * 7 + 4, _sumf);\n                    outptr0 += 8;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      e4 f5 g6 h7\n                //      e0 f1 g2 h3\n                //      a4 b5 c6 d7\n                //      c0 d1 a2 b3\n                //      g4 h5 e6 f7\n                //      g0 h1 e2 f3\n                //      c4 d5 a6 b7\n                //      a3 b2 c1 d0\n                //      e7 f6 g5 h4\n                //      e3 f2 g1 h0\n                //      a7 b6 c5 d4\n                //      c3 d2 a1 b0\n                //      g7 h6 e5 f4\n                //      g3 h2 e1 f0\n                //      c7 d6 a5 b4\n                if (out_elempack == 8)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      e0 f0 g0 h0\n                    //      a1 b1 c1 d1\n                    //      e1 f1 g1 h1\n                    //      a2 b2 c2 d2\n                    //      e2 f2 g2 h2\n                    //      a3 b3 c3 d3\n                    //      e3 f3 g3 h3\n                    //      a4 b4 c4 d4\n                    //      e4 f4 g4 h4\n                    //      a5 b5 c5 d5\n                    //      e5 f5 g5 h5\n                    //      a6 b6 c6 d6\n                    //      e6 f6 g6 h6\n                    //      a7 b7 c7 d7\n                    //      e7 f7 g7 h7\n                    {\n                        _sum8 = vrev64q_s32(_sum8);\n                        _sum9 = vrev64q_s32(_sum9);\n                        _suma = vrev64q_s32(_suma);\n                        _sumb = vrev64q_s32(_sumb);\n                        _sumc = vrev64q_s32(_sumc);\n                        _sumd = vrev64q_s32(_sumd);\n                        _sume = vrev64q_s32(_sume);\n                        _sumf = vrev64q_s32(_sumf);\n                        _sum8 = vextq_s32(_sum8, _sum8, 2);\n                        _sum9 = vextq_s32(_sum9, _sum9, 2);\n                        _suma = vextq_s32(_suma, _suma, 2);\n                        _sumb = vextq_s32(_sumb, _sumb, 2);\n                        _sumc = vextq_s32(_sumc, _sumc, 2);\n                        _sumd = vextq_s32(_sumd, _sumd, 2);\n                        _sume = vextq_s32(_sume, _sume, 2);\n                        _sumf = vextq_s32(_sumf, _sumf, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                        int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                        int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                        int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                        int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                        int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                        int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                        int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                        _sum9 = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                        _suma = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                        _sumb = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                        _sumc = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                        _sumd = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                        _sume = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                        _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _suma = vrev64q_s32(_suma);\n                        _sumb = vrev64q_s32(_sumb);\n                        _sume = vrev64q_s32(_sume);\n                        _sumf = vrev64q_s32(_sumf);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum4);\n                    vst1q_s32(outptr0 + 20, _sum5);\n                    vst1q_s32(outptr0 + 24, _sum6);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    vst1q_s32(outptr0 + 32, _sum8);\n                    vst1q_s32(outptr0 + 36, _sum9);\n                    vst1q_s32(outptr0 + 40, _suma);\n                    vst1q_s32(outptr0 + 44, _sumb);\n                    vst1q_s32(outptr0 + 48, _sumc);\n                    vst1q_s32(outptr0 + 52, _sumd);\n                    vst1q_s32(outptr0 + 56, _sume);\n                    vst1q_s32(outptr0 + 60, _sumf);\n                    outptr0 += 64;\n                }\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    //      a2 b2 c2 d2\n                    //      a3 b3 c3 d3\n                    //      a4 b4 c4 d4\n                    //      a5 b5 c5 d5\n                    //      a6 b6 c6 d6\n                    //      a7 b7 c7 d7\n                    //      e0 f0 g0 h0\n                    //      e1 f1 g1 h1\n                    //      e2 f2 g2 h2\n                    //      e3 f3 g3 h3\n                    //      e4 f4 g4 h4\n                    //      e5 f5 g5 h5\n                    //      e6 f6 g6 h6\n                    //      e7 f7 g7 h7\n                    {\n                        _sum8 = vrev64q_s32(_sum8);\n                        _sum9 = vrev64q_s32(_sum9);\n                        _suma = vrev64q_s32(_suma);\n                        _sumb = vrev64q_s32(_sumb);\n                        _sumc = vrev64q_s32(_sumc);\n                        _sumd = vrev64q_s32(_sumd);\n                        _sume = vrev64q_s32(_sume);\n                        _sumf = vrev64q_s32(_sumf);\n                        _sum8 = vextq_s32(_sum8, _sum8, 2);\n                        _sum9 = vextq_s32(_sum9, _sum9, 2);\n                        _suma = vextq_s32(_suma, _suma, 2);\n                        _sumb = vextq_s32(_sumb, _sumb, 2);\n                        _sumc = vextq_s32(_sumc, _sumc, 2);\n                        _sumd = vextq_s32(_sumd, _sumd, 2);\n                        _sume = vextq_s32(_sume, _sume, 2);\n                        _sumf = vextq_s32(_sumf, _sumf, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                        int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                        int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                        int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                        int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                        int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                        int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                        int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                        _sum5 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                        _sum6 = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                        _sum8 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum9 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _suma = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sumb = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                        _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                        _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                        _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _sum9 = vrev64q_s32(_sum9);\n                        _sumb = vrev64q_s32(_sumb);\n                        _sumd = vrev64q_s32(_sumd);\n                        _sumf = vrev64q_s32(_sumf);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum4);\n                    vst1q_s32(outptr0 + 20, _sum5);\n                    vst1q_s32(outptr0 + 24, _sum6);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum8);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum9);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 8, _suma);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 12, _sumb);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 16, _sumc);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 20, _sumd);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 24, _sume);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 28, _sumf);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      a4 a5 a6 a7\n                    //      b0 b1 b2 b3\n                    //      b4 b5 b6 b7\n                    //      c0 c1 c2 c3\n                    //      c4 c5 c6 c7\n                    //      d0 d1 d2 d3\n                    //      d4 d5 d6 d7\n                    //      e0 e1 e2 e3\n                    //      e4 e5 e6 e7\n                    //      f0 f1 f2 f3\n                    //      f4 f5 f6 f7\n                    //      g0 g1 g2 g3\n                    //      g4 g5 g6 g7\n                    //      h0 h1 h2 h3\n                    //      h4 h5 h6 h7\n                    {\n                        _sum4 = vextq_s32(_sum4, _sum4, 2);\n                        _sum5 = vextq_s32(_sum5, _sum5, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        _sumc = vextq_s32(_sumc, _sumc, 2);\n                        _sumd = vextq_s32(_sumd, _sumd, 2);\n                        _sume = vextq_s32(_sume, _sume, 2);\n                        _sumf = vextq_s32(_sumf, _sumf, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                        int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                        int32x4x2_t _t2 = vzipq_s32(_sum3, _sumf);\n                        int32x4x2_t _t3 = vzipq_s32(_sum7, _sumb);\n                        int32x4x2_t _t4 = vzipq_s32(_sum2, _sume);\n                        int32x4x2_t _t5 = vzipq_s32(_sum6, _suma);\n                        int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                        int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                        _sum9 = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                        _suma = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                        _sumb = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                        _sumc = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                        _sumd = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                        _sume = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                        _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _suma = vrev64q_s32(_suma);\n                        _sumb = vrev64q_s32(_sumb);\n                        _sume = vrev64q_s32(_sume);\n                        _sumf = vrev64q_s32(_sumf);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep, _sum2);\n                    vst1q_s32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum8);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum9);\n                    vst1q_s32(outptr0 + out_hstep * 5, _suma);\n                    vst1q_s32(outptr0 + out_hstep * 5 + 4, _sumb);\n                    vst1q_s32(outptr0 + out_hstep * 6, _sumc);\n                    vst1q_s32(outptr0 + out_hstep * 6 + 4, _sumd);\n                    vst1q_s32(outptr0 + out_hstep * 7, _sume);\n                    vst1q_s32(outptr0 + out_hstep * 7 + 4, _sumf);\n                    outptr0 += 8;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n                vst1q_s32(outptr + 32, _sum8);\n                vst1q_s32(outptr + 36, _sum9);\n                vst1q_s32(outptr + 40, _suma);\n                vst1q_s32(outptr + 44, _sumb);\n                vst1q_s32(outptr + 48, _sumc);\n                vst1q_s32(outptr + 52, _sumd);\n                vst1q_s32(outptr + 56, _sume);\n                vst1q_s32(outptr + 60, _sumf);\n            }\n\n            outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n                \"sub    %0, %0, #64                 \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%1], #64 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v4.16b      \\n\"\n                \"smmla  v26.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v5.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v28.4s, v2.16b, v4.16b      \\n\"\n                \"smmla  v29.4s, v3.16b, v4.16b      \\n\"\n                \"smmla  v30.4s, v2.16b, v5.16b      \\n\"\n                \"smmla  v31.4s, v3.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v4.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v2.16b, v5.4b[0]    \\n\"\n                \"sdot   v17.4s, v2.16b, v5.4b[1]    \\n\"\n                \"sdot   v18.4s, v2.16b, v5.4b[2]    \\n\"\n                \"sdot   v19.4s, v2.16b, v5.4b[3]    \\n\"\n                \"sdot   v20.4s, v3.16b, v5.4b[0]    \\n\"\n                \"sdot   v21.4s, v3.16b, v5.4b[1]    \\n\"\n                \"sdot   v22.4s, v3.16b, v5.4b[2]    \\n\"\n                \"sdot   v23.4s, v3.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n                \"uzp1   v4.4s, v28.4s, v29.4s       \\n\"\n                \"uzp2   v5.4s, v28.4s, v29.4s       \\n\"\n                \"uzp1   v6.4s, v30.4s, v31.4s       \\n\"\n                \"uzp2   v7.4s, v30.4s, v31.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w8, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v2.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v2.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v2.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v2.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b}, [%2], #16         \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"ext    v5.16b, v4.16b, v4.16b, #8  \\n\"\n                \"smull2 v9.8h, v0.16b, v5.16b       \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull2 v11.8h, v2.16b, v5.16b      \\n\"\n                \"ext    v7.16b, v6.16b, v6.16b, #8  \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"rev64  v3.4s, v1.4s                \\n\"\n                \"smull2 v13.8h, v0.16b, v7.16b      \\n\"\n                \"smull2 v15.8h, v2.16b, v7.16b      \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal2 v11.8h, v3.16b, v4.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v6.16b      \\n\"\n                \"smlal2 v15.8h, v3.16b, v6.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w8, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1r   {v2.2d}, [%2]               \\n\"\n                \"add    %2, %2, #8                  \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w8, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2]               \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"dup    v8.8b, v1.b[0]              \\n\"\n                \"dup    v9.8b, v1.b[1]              \\n\"\n                \"dup    v10.8b, v1.b[2]             \\n\"\n                \"dup    v11.8b, v1.b[3]             \\n\"\n                \"smull  v8.8h, v0.8b, v8.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v9.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v10.8b       \\n\"\n                \"smull  v11.8h, v0.8b, v11.8b       \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw  v18.4s, v18.4s, v10.4h      \\n\"\n                \"saddw  v19.4s, v19.4s, v11.4h      \\n\"\n                \"saddw2 v20.4s, v20.4s, v8.8h       \\n\"\n                \"saddw2 v21.4s, v21.4s, v9.8h       \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1r   {v4.2s}, [%2]               \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"rev32  v1.4h, v0.4h                \\n\"\n                \"rev64  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"cmp    %w10, #0                    \\n\"\n                \"beq    10f                         \\n\"\n\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                // if out_elempack == 4 / 8\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w11, #8                    \\n\"\n                \"bne    7f                          \\n\"\n\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v20.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [%3], #16         \\n\"\n                \"st1    {v21.4s}, [%3], #16         \\n\"\n                \"st1    {v18.4s}, [%3], #16         \\n\"\n                \"st1    {v22.4s}, [%3], #16         \\n\"\n                \"st1    {v19.4s}, [%3], #16         \\n\"\n                \"st1    {v23.4s}, [%3], #16         \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 4\n                \"7:                                 \\n\"\n                \"add    x4, %3, %12, lsl #4         \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [x4] \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                //      e0 e1 e2 e3\n                //      f0 f1 f2 f3\n                //      g0 g1 g2 g3\n                //      h0 h1 h2 h3\n                \"zip1   v0.4s, v16.4s, v17.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v17.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v19.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v19.4s       \\n\"\n                \"zip1   v4.4s, v20.4s, v21.4s       \\n\"\n                \"zip2   v5.4s, v20.4s, v21.4s       \\n\"\n                \"zip1   v6.4s, v22.4s, v23.4s       \\n\"\n                \"zip2   v7.4s, v22.4s, v23.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v1.2d, v3.2d        \\n\"\n                \"zip2   v19.2d, v1.2d, v3.2d        \\n\"\n                \"zip1   v20.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v21.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v22.2d, v5.2d, v7.2d        \\n\"\n                \"zip2   v23.2d, v5.2d, v7.2d        \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v18.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v19.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v21.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v23.4s}, [x4]              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      e0 f1 g2 h3\n                //      c0 d1 a2 b3\n                //      g0 h1 e2 f3\n                //      a3 b2 c1 d0\n                //      e3 f2 g1 h0\n                //      c3 d2 a1 b0\n                //      g3 h2 e1 f0\n                // if out_elempack == 4 / 8\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                \"rev64  v20.4s, v20.4s              \\n\"\n                \"rev64  v21.4s, v21.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n                \"ext    v20.16b, v20.16b, v20.16b, #8 \\n\"\n                \"ext    v21.16b, v21.16b, v21.16b, #8 \\n\"\n                \"ext    v22.16b, v22.16b, v22.16b, #8 \\n\"\n                \"ext    v23.16b, v23.16b, v23.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v22.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v22.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v20.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                \"zip1   v4.4s, v17.4s, v23.4s       \\n\"\n                \"zip2   v5.4s, v17.4s, v23.4s       \\n\"\n                \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                \"zip2   v7.4s, v19.4s, v21.4s       \\n\"\n\n                // if out_elempack == 8\n                \"cmp    %w11, #8                    \\n\"\n                \"bne    7f                          \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      e0 f0 g0 h0\n                //      a1 b1 c1 d1\n                //      e1 f1 g1 h1\n                //      a2 b2 c2 d2\n                //      e2 f2 g2 h2\n                //      a3 b3 c3 d3\n                //      e3 f3 g3 h3\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v21.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v22.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"rev64  v18.4s, v18.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 4\n                \"7:                                 \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v24.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v25.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v26.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v27.2d, v7.2d, v5.2d        \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v25.4s, v25.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v27.4s, v27.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #4         \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [x4] \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                //      e0 e1 e2 e3\n                //      f0 f1 f2 f3\n                //      g0 g1 g2 g3\n                //      h0 h1 h2 h3\n                \"ext    v18.16b, v18.16b, v18.16b, #8 \\n\"\n                \"ext    v19.16b, v19.16b, v19.16b, #8 \\n\"\n                \"ext    v22.16b, v22.16b, v22.16b, #8 \\n\"\n                \"ext    v23.16b, v23.16b, v23.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v22.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v22.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v20.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                \"zip1   v4.4s, v17.4s, v23.4s       \\n\"\n                \"zip2   v5.4s, v17.4s, v23.4s       \\n\"\n                \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                \"zip2   v7.4s, v19.4s, v21.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v20.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v21.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v22.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v21.4s, v21.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v18.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v19.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v21.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v23.4s}, [x4]              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #128                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %9, #0              \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0!, {d16-d23}      \\n\"\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"sub        %0, %0, #64         \\n\"\n                \"b          1f                  \\n\"\n\n                \"0:                             \\n\"\n                \"veor       q8, q8              \\n\"\n                \"veor       q9, q9              \\n\"\n                \"veor       q10, q10            \\n\"\n                \"veor       q11, q11            \\n\"\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"veor       q14, q14            \\n\"\n                \"veor       q15, q15            \\n\"\n\n                \"1:                             \\n\"\n                \"lsr        r4, %8, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        3f                  \\n\"\n\n                \".align 4                       \\n\"\n                \"2:                             \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.s8    {d0-d3}, [%1 :64]!  \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.s8    {d4-d5}, [%2]!      \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vrev64.32  q3, q0              \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d6, d4          \\n\"\n                \"vmull.s8   q7, d7, d4          \\n\"\n                \"vrev64.32  q0, q1              \\n\"\n                \"vmlal.s8   q4, d2, d5          \\n\"\n                \"vmlal.s8   q5, d3, d5          \\n\"\n                \"vmlal.s8   q6, d0, d5          \\n\"\n                \"vmlal.s8   q7, d1, d5          \\n\"\n                \"vrev64.16  q2, q2              \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vrev64.32  q1, q3              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vmull.s8   q4, d6, d4          \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vmull.s8   q5, d7, d4          \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"vmull.s8   q6, d2, d4          \\n\"\n                \"vmull.s8   q7, d3, d4          \\n\"\n                \"vrev64.32  q3, q0              \\n\"\n                \"vmlal.s8   q4, d0, d5          \\n\"\n                \"vmlal.s8   q5, d1, d5          \\n\"\n                \"vmlal.s8   q6, d6, d5          \\n\"\n                \"vmlal.s8   q7, d7, d5          \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vpadal.s16 q14, q4             \\n\"\n                \"vpadal.s16 q15, q5             \\n\"\n                \"vpadal.s16 q12, q6             \\n\"\n                \"vpadal.s16 q13, q7             \\n\"\n                \"bne        2b                  \\n\"\n\n                \"3:                             \\n\"\n                \"and        r4, %8, #2          \\n\" // r4 = remain = max_kk & 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                // kk += 2 part\n                \"vld1.s8    {d0-d1}, [%1 :64]!  \\n\"\n                \"vld1.s8    {d4}, [%2]!         \\n\"\n                \"vrev64.32  q1, q0              \\n\"\n                \"vrev64.16  d5, d4              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d2, d4          \\n\"\n                \"vmull.s8   q7, d3, d4          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"vmull.s8   q4, d0, d5          \\n\"\n                \"vmull.s8   q5, d1, d5          \\n\"\n                \"vmull.s8   q6, d2, d5          \\n\"\n                \"vmull.s8   q7, d3, d5          \\n\"\n                \"vpadal.s16 q12, q4             \\n\"\n                \"vpadal.s16 q13, q5             \\n\"\n                \"vpadal.s16 q14, q6             \\n\"\n                \"vpadal.s16 q15, q7             \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %8, #1          \\n\" // r4 = remain = max_kk & 1\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                // kk += 1 part\n                \"vld1.s8    {d0}, [%1 :64]!     \\n\"\n                \"vld1.s32   {d2[]}, [%2]!       \\n\"\n                \"vrev64.16  d1, d0              \\n\"\n                \"vrev64.8   d3, d2              \\n\"\n                \"vext.s8    d1, d1, #4          \\n\"\n                \"vmull.s8   q4, d0, d2          \\n\"\n                \"vmull.s8   q5, d1, d2          \\n\"\n                \"vmull.s8   q6, d0, d3          \\n\"\n                \"vmull.s8   q7, d1, d3          \\n\"\n                \"vaddw.s16  q8, d8              \\n\"\n                \"vaddw.s16  q9, d9              \\n\"\n                \"vaddw.s16  q10, d10            \\n\"\n                \"vaddw.s16  q11, d11            \\n\"\n                \"vaddw.s16  q12, d12            \\n\"\n                \"vaddw.s16  q13, d13            \\n\"\n                \"vaddw.s16  q14, d14            \\n\"\n                \"vaddw.s16  q15, d15            \\n\"\n\n                \"5:                             \\n\"\n                \"cmp        %10, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // from\n                //      a0 b1 c2 d3\n                //      e0 f1 g2 h3\n                //      c0 d1 a2 b3\n                //      g0 h1 e2 f3\n                //      a3 b2 c1 d0\n                //      e3 f2 g1 h0\n                //      c3 d2 a1 b0\n                //      g3 h2 e1 f0\n                // if out_elempack == 4 / 8\n                \"cmp        %11, #1             \\n\"\n                \"beq        8f                  \\n\"\n\n                \"vrev64.32  q12, q12            \\n\"\n                \"vrev64.32  q13, q13            \\n\"\n                \"vrev64.32  q14, q14            \\n\"\n                \"vrev64.32  q15, q15            \\n\"\n                \"vext.32    q12, q12, #2        \\n\"\n                \"vext.32    q13, q13, #2        \\n\"\n                \"vext.32    q14, q14, #2        \\n\"\n                \"vext.32    q15, q15, #2        \\n\"\n                \"vzip.32    q8, q14             \\n\"\n                \"vzip.32    q10, q12            \\n\"\n                \"vzip.32    q9, q15             \\n\"\n                \"vzip.32    q11, q13            \\n\"\n                \"vswp       d17, d20            \\n\"\n                \"vswp       d19, d22            \\n\"\n                \"vswp       d28, d25            \\n\"\n                \"vswp       d30, d27            \\n\"\n                \"vrev64.32  q10, q10            \\n\"\n                \"vrev64.32  q11, q11            \\n\"\n                \"vrev64.32  q14, q14            \\n\"\n                \"vrev64.32  q15, q15            \\n\"\n\n                // if out_elempack == 8\n                \"cmp        %11, #8             \\n\"\n                \"bne        7f                  \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      e0 f0 g0 h0\n                //      a1 b1 c1 d1\n                //      e1 f1 g1 h1\n                //      a2 b2 c2 d2\n                //      e2 f2 g2 h2\n                //      a3 b3 c3 d3\n                //      e3 f3 g3 h3\n                \"vstm       %3!, {d16-d23}      \\n\"\n                \"vstm       %3!, {d24-d31}      \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 4\n                \"7:                             \\n\"\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                \"vswp       q9, q10             \\n\"\n                \"vswp       q13, q14            \\n\"\n                \"vswp       q10, q12            \\n\"\n                \"vswp       q11, q13            \\n\"\n\n                \"add        r4, %3, %12, lsl #4 \\n\"\n                \"vstm       %3!, {d16-d23}      \\n\"\n                \"vstm       r4, {d24-d31}       \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                //      e0 e1 e2 e3\n                //      f0 f1 f2 f3\n                //      g0 g1 g2 g3\n                //      h0 h1 h2 h3\n                \"vext.32    q10, q10, #2        \\n\"\n                \"vext.32    q11, q11, #2        \\n\"\n                \"vext.32    q14, q14, #2        \\n\"\n                \"vext.32    q15, q15, #2        \\n\"\n                \"vzip.32    q8, q14             \\n\"\n                \"vzip.32    q10, q12            \\n\"\n                \"vzip.32    q9, q15             \\n\"\n                \"vzip.32    q11, q13            \\n\"\n                \"vswp       d17, d20            \\n\"\n                \"vswp       d19, d22            \\n\"\n                \"vswp       d28, d25            \\n\"\n                \"vswp       d30, d27            \\n\"\n                \"vrev64.32  q10, q10            \\n\"\n                \"vrev64.32  q11, q11            \\n\"\n                \"vrev64.32  q14, q14            \\n\"\n                \"vrev64.32  q15, q15            \\n\"\n\n                \"add        r4, %3, %12, lsl #2 \\n\"\n                \"vst1.s32   {d16-d17}, [%3]!    \\n\"\n                \"vst1.s32   {d20-d21}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d24-d25}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d28-d29}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d18-d19}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d22-d23}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d26-d27}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d30-d31}, [r4]     \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #128        \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n                int32x4_t _s4 = vdupq_n_s32(0);\n                int32x4_t _s5 = vdupq_n_s32(0);\n                int32x4_t _s6 = vdupq_n_s32(0);\n                int32x4_t _s7 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000 11111111 22222222 33333333\n\n                    _s0 = vmmlaq_s32(_s0, _pA0, _pB0);\n                    _s1 = vmmlaq_s32(_s1, _pA1, _pB0);\n                    _s2 = vmmlaq_s32(_s2, _pA0, _pB1);\n                    _s3 = vmmlaq_s32(_s3, _pA1, _pB1);\n                    _s4 = vmmlaq_s32(_s4, _pA2, _pB0);\n                    _s5 = vmmlaq_s32(_s5, _pA3, _pB0);\n                    _s6 = vmmlaq_s32(_s6, _pA2, _pB1);\n                    _s7 = vmmlaq_s32(_s7, _pA3, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                    _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                    _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                    _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                    _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB1, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB1, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA2, _pB1, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA2, _pB1, 3);\n                    _sum4 = vdotq_laneq_s32(_sum4, _pA3, _pB1, 0);\n                    _sum5 = vdotq_laneq_s32(_sum5, _pA3, _pB1, 1);\n                    _sum6 = vdotq_laneq_s32(_sum6, _pA3, _pB1, 2);\n                    _sum7 = vdotq_laneq_s32(_sum7, _pA3, _pB1, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss0 = vuzpq_s32(_s0, _s1);\n                int32x4x2_t _ss1 = vuzpq_s32(_s2, _s3);\n                int32x4x2_t _ss2 = vuzpq_s32(_s4, _s5);\n                int32x4x2_t _ss3 = vuzpq_s32(_s6, _s7);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                // aaaa bbbb cccc dddd   eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x16_t _pB02 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n                int8x16_t _pA3 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA2)));\n\n                // 00112233 44556677\n\n                // 33221100 77665544\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB02));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB02));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB13));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB13));\n\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_pA3), vget_high_s8(_pB02));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA3), vget_high_s8(_pB02));\n                _s4 = vmlal_s8(_s4, vget_low_s8(_pA2), vget_high_s8(_pB13));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA2), vget_high_s8(_pB13));\n                _s6 = vmlal_s8(_s6, vget_low_s8(_pA3), vget_high_s8(_pB13));\n                _s7 = vmlal_s8(_s7, vget_high_s8(_pA3), vget_high_s8(_pB13));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n                int16x8_t _s4 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233\n\n                // 33221100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), _pB0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA1), _pB0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA1), _pB0);\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA0), _pB1);\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA0), _pB1);\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA1), _pB1);\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA1), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                // int8x8_t _pB0 = vreinterpret_s32_s8(vld1_dup_s32(pB));\n\n                // abcdefgh\n\n                // 0123\n\n                int16x8_t _s01 = vmull_s8(_pA0, vdup_n_s8(pB[0]));\n                int16x8_t _s23 = vmull_s8(_pA0, vdup_n_s8(pB[1]));\n                int16x8_t _s45 = vmull_s8(_pA0, vdup_n_s8(pB[2]));\n                int16x8_t _s67 = vmull_s8(_pA0, vdup_n_s8(pB[3]));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s23));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s45));\n                _sum3 = vaddw_s16(_sum3, vget_low_s16(_s67));\n                _sum4 = vaddw_s16(_sum4, vget_high_s16(_s01));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s23));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s45));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n                // int8x8_t _pB0 = vld1_s8(pB);\n                // _pB0 = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB0), vreinterpret_s32_s8(_pB0)).val[0]);\n\n                // abcdefgh  ->  cdabghef\n                int8x8_t _pA1 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n\n                // 01230123  ->  32103210\n                int8x8_t _pB1 = vrev64_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s67 = vmull_s8(_pA1, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                //      e2 f2 g2 h2\n                //      e3 f3 g3 h3\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum4);\n                    vst1q_s32(outptr0 + 8, _sum1);\n                    vst1q_s32(outptr0 + 12, _sum5);\n                    vst1q_s32(outptr0 + 16, _sum2);\n                    vst1q_s32(outptr0 + 20, _sum6);\n                    vst1q_s32(outptr0 + 24, _sum3);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 8, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 12, _sum7);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      b0 b1 b2 b3\n                    //      c0 c1 c2 c3\n                    //      d0 d1 d2 d3\n                    //      e0 e1 e2 e3\n                    //      f0 f1 f2 f3\n                    //      g0 g1 g2 g3\n                    //      h0 h1 h2 h3\n                    {\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum3);\n                        int32x4x2_t _t2 = vzipq_s32(_sum4, _sum5);\n                        int32x4x2_t _t3 = vzipq_s32(_sum6, _sum7);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t1.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t1.val[1]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum6 = vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1]));\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 5, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 6, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 7, _sum7);\n                    outptr0 += 4;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      e0 f1 g2 h3\n                //      c0 d1 a2 b3\n                //      g0 h1 e2 f3\n                //      a3 b2 c1 d0\n                //      e3 f2 g1 h0\n                //      c3 d2 a1 b0\n                //      g3 h2 e1 f0\n                if (out_elempack == 8)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      e0 f0 g0 h0\n                    //      a1 b1 c1 d1\n                    //      e1 f1 g1 h1\n                    //      a2 b2 c2 d2\n                    //      e2 f2 g2 h2\n                    //      a3 b3 c3 d3\n                    //      e3 f3 g3 h3\n                    {\n                        _sum4 = vrev64q_s32(_sum4);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _sum4 = vextq_s32(_sum4, _sum4, 2);\n                        _sum5 = vextq_s32(_sum5, _sum5, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                        int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                        int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum4);\n                    vst1q_s32(outptr0 + 20, _sum5);\n                    vst1q_s32(outptr0 + 24, _sum6);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    //      a2 b2 c2 d2\n                    //      a3 b3 c3 d3\n                    //      e0 f0 g0 h0\n                    //      e1 f1 g1 h1\n                    //      e2 f2 g2 h2\n                    //      e3 f3 g3 h3\n                    {\n                        _sum4 = vrev64q_s32(_sum4);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _sum4 = vextq_s32(_sum4, _sum4, 2);\n                        _sum5 = vextq_s32(_sum5, _sum5, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                        int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                        int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum7 = vrev64q_s32(_sum7);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 8, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 12, _sum7);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      b0 b1 b2 b3\n                    //      c0 c1 c2 c3\n                    //      d0 d1 d2 d3\n                    //      e0 e1 e2 e3\n                    //      f0 f1 f2 f3\n                    //      g0 g1 g2 g3\n                    //      h0 h1 h2 h3\n                    {\n                        _sum2 = vextq_s32(_sum2, _sum2, 2);\n                        _sum3 = vextq_s32(_sum3, _sum3, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                        int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                        int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum7 = vrev64q_s32(_sum7);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 5, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 6, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 7, _sum7);\n                    outptr0 += 4;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000 11111111\n\n                    _s0 = vmmlaq_s32(_s0, _pA0, _pB);\n                    _s1 = vmmlaq_s32(_s1, _pA1, _pB);\n                    _s2 = vmmlaq_s32(_s2, _pA2, _pB);\n                    _s3 = vmmlaq_s32(_s3, _pA3, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB, 0);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB, 1);\n\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB, 2);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB, 3);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA3, _pB, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA3, _pB, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss0 = vuzpq_s32(_s0, _s1);\n                int32x4x2_t _ss1 = vuzpq_s32(_s2, _s3);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aaaa bbbb cccc dddd eeee ffff gggg hhhh\n\n                // 0000 1111\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA0, _pB, 1);\n                _sum2 = vdotq_lane_s32(_sum2, _pA1, _pB, 0);\n                _sum3 = vdotq_lane_s32(_sum3, _pA1, _pB, 1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh   aabbccdd eeffgghh\n\n                // 00112233 -> 00110011 22332233\n\n                // 11001100 33223322\n\n                int32x2x2_t _pBB = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n                int8x16_t _pB02 = vreinterpretq_s8_s32(vcombine_s32(_pBB.val[0], _pBB.val[1]));\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB13));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_pA2), vget_high_s8(_pB13));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA2), vget_high_s8(_pB13));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int16x4_t _pB = vreinterpret_s16_s32(vld1_dup_s32((const int*)pB));\n\n                int16x4x2_t _pB01 = vuzp_s16(_pB, _pB);\n                int8x8_t _pB0 = vreinterpret_s8_s16(_pB01.val[0]);\n                int8x8_t _pB1 = vreinterpret_s8_s16(_pB01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), _pB1);\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA), _pB0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aabbccdd eeffgghh\n\n                // 00110011\n                // 11001100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA), _pB0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), _pB1);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int8x8x2_t _pB01 = vuzp_s8(_pB, _pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB01.val[0]);\n                int16x8_t _s1 = vmull_s8(_pA, _pB01.val[1]);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s1));\n                _sum2 = vaddw_s16(_sum2, vget_high_s16(_s0));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcdefgh\n\n                // 01010101\n                // 10101010\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 1);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      e0 f0 g0 h0\n                //      e1 f1 g1 h1\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum2);\n                    vst1q_s32(outptr0 + 8, _sum1);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum3);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 b0 b1\n                    //      c0 c1 d0 d1\n                    //      e0 e1 f0 f1\n                    //      g0 g1 h0 h1\n                    {\n                        int32x4x2_t _sum02 = vzipq_s32(_sum0, _sum1);\n                        int32x4x2_t _sum13 = vzipq_s32(_sum2, _sum3);\n                        _sum0 = _sum02.val[0];\n                        _sum1 = _sum02.val[1];\n                        _sum2 = _sum13.val[0];\n                        _sum3 = _sum13.val[1];\n                    }\n\n                    vst1_s32(outptr0, vget_low_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep, vget_high_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep * 2, vget_low_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 3, vget_high_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 4, vget_low_s32(_sum2));\n                    vst1_s32(outptr0 + out_hstep * 5, vget_high_s32(_sum2));\n                    vst1_s32(outptr0 + out_hstep * 6, vget_low_s32(_sum3));\n                    vst1_s32(outptr0 + out_hstep * 7, vget_high_s32(_sum3));\n                    outptr0 += 2;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c0 d1\n                //      e0 f1 g0 h1\n                //      a1 b0 c1 d0\n                //      e1 f0 g1 h0\n                if (out_elempack == 8)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      e0 f0 g0 h0\n                    //      a1 b1 c1 d1\n                    //      e1 f1 g1 h1\n                    {\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                        int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    //      e0 f0 g0 h0\n                    //      e1 f1 g1 h1\n                    {\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                        int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 4 + 4, _sum3);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 c0 c1\n                    //      b0 b1 d0 d1\n                    //      e0 e1 g0 g1\n                    //      f0 f1 h0 h1\n                    {\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                        int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                    }\n\n                    vst1_s32(outptr0, vget_low_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep, vget_low_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 2, vget_high_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep * 3, vget_high_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 4, vget_low_s32(_sum2));\n                    vst1_s32(outptr0 + out_hstep * 5, vget_low_s32(_sum3));\n                    vst1_s32(outptr0 + out_hstep * 6, vget_high_s32(_sum2));\n                    vst1_s32(outptr0 + out_hstep * 7, vget_high_s32(_sum3));\n                    outptr0 += 2;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _s0 = vdotq_s32(_s0, _pA0, _pBB);\n                    _s1 = vdotq_s32(_s1, _pA1, _pBB);\n                    _s2 = vdotq_s32(_s2, _pA2, _pBB);\n                    _s3 = vdotq_s32(_s3, _pA3, _pBB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_lane_s32(_sum1, _pA1, _pB, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pA2, _pB, 1);\n                    _sum1 = vdotq_lane_s32(_sum1, _pA3, _pB, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 8;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_s0, _s1));\n                _sum1 = vaddq_s32(_sum1, vpaddq_s32(_s2, _s3));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aaaa bbbb cccc dddd eeee ffff gggg hhhh\n\n                // 0000 0000\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA1, _pB, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n                int8x8_t _pB1 = vreinterpret_s8_s16(vld1_dup_s16((const short*)(pB + 2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), _pB0);\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), _pB1);\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA), _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 16;\n                pB += 2;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_dup_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep * 4, _sum1);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    outptr0[0] = vgetq_lane_s32(_sum0, 0);\n                    outptr0[out_hstep] = vgetq_lane_s32(_sum0, 1);\n                    outptr0[out_hstep * 2] = vgetq_lane_s32(_sum0, 2);\n                    outptr0[out_hstep * 3] = vgetq_lane_s32(_sum0, 3);\n                    outptr0[out_hstep * 4] = vgetq_lane_s32(_sum1, 0);\n                    outptr0[out_hstep * 5] = vgetq_lane_s32(_sum1, 1);\n                    outptr0[out_hstep * 6] = vgetq_lane_s32(_sum1, 2);\n                    outptr0[out_hstep * 7] = vgetq_lane_s32(_sum1, 3);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        int* outptr0 = (int*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n                \"sub    %0, %0, #64                 \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b, v3.16b, v4.16b, v5.16b}, [%2], #64 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v2.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v2.16b      \\n\"\n                \"smmla  v26.4s, v0.16b, v3.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v3.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v28.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v29.4s, v1.16b, v4.16b      \\n\"\n                \"smmla  v30.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v31.4s, v1.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v21.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v22.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v23.4s, v0.16b, v3.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v1.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n                \"uzp1   v4.4s, v28.4s, v29.4s       \\n\"\n                \"uzp2   v5.4s, v28.4s, v29.4s       \\n\"\n                \"uzp1   v6.4s, v30.4s, v31.4s       \\n\"\n                \"uzp2   v7.4s, v30.4s, v31.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w8, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b, v3.16b}, [%2], #32 \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v21.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v22.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v23.4s, v0.16b, v3.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v5.16b       \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v5.16b      \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"rev64  v7.8h, v5.8h                \\n\"\n                \"smull2 v13.8h, v0.16b, v7.16b      \\n\"\n                \"smull2 v15.8h, v2.16b, v7.16b      \\n\"\n                \"ext    v1.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v3.16b, v2.16b, v2.16b, #8  \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v4.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v6.16b      \\n\"\n                \"smlal2 v15.8h, v3.16b, v6.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w8, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.16b}, [%2], #16         \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"dup    v4.8h, v1.h[4]              \\n\"\n                \"dup    v5.8h, v1.h[5]              \\n\"\n                \"dup    v6.8h, v1.h[6]              \\n\"\n                \"dup    v7.8h, v1.h[7]              \\n\"\n                \"smull  v12.8h, v0.8b, v4.8b        \\n\"\n                \"smull  v13.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v14.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v15.8h, v0.8b, v7.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2d}, [%1]               \\n\"\n                \"add    %1, %1, #8                  \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w8, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"dup    v8.8h, v1.h[0]              \\n\"\n                \"dup    v9.8h, v1.h[1]              \\n\"\n                \"dup    v10.8h, v1.h[2]             \\n\"\n                \"dup    v11.8h, v1.h[3]             \\n\"\n                \"uzp1   v2.8b, v8.8b, v9.8b         \\n\"\n                \"uzp2   v3.8b, v8.8b, v9.8b         \\n\"\n                \"uzp1   v4.8b, v10.8b, v11.8b       \\n\"\n                \"uzp2   v5.8b, v10.8b, v11.8b       \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v3.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v4.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw2 v18.4s, v18.4s, v8.8h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw  v21.4s, v21.4s, v11.4h      \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1    {v2.8b}, [%2], #8           \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #2     \\n\"\n                \"rev32  v3.8b, v2.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v2.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v3.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v3.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"cmp    %w10, #0                    \\n\"\n                \"beq    10f                         \\n\"\n\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                // if out_elempack == 4\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      a4 a5 a6 a7\n                //      b0 b1 b2 b3\n                //      b4 b5 b6 b7\n                //      c0 c1 c2 c3\n                //      c4 c5 c6 c7\n                //      d0 d1 d2 d3\n                //      d4 d5 d6 d7\n                \"zip1   v0.4s, v16.4s, v17.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v17.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v19.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v19.4s       \\n\"\n                \"zip1   v4.4s, v20.4s, v21.4s       \\n\"\n                \"zip2   v5.4s, v20.4s, v21.4s       \\n\"\n                \"zip1   v6.4s, v22.4s, v23.4s       \\n\"\n                \"zip2   v7.4s, v22.4s, v23.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v1.2d, v3.2d        \\n\"\n                \"zip1   v21.2d, v5.2d, v7.2d        \\n\"\n                \"zip2   v22.2d, v1.2d, v3.2d        \\n\"\n                \"zip2   v23.2d, v5.2d, v7.2d        \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s, v17.4s}, [%3], #32 \\n\"\n                \"st1    {v18.4s, v19.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s, v21.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s, v23.4s}, [x4]      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      a4 b5 c6 d7\n                //      c0 d1 a2 b3\n                //      c4 d5 a6 b7\n                //      a3 b2 c1 d0\n                //      a7 b6 c5 d4\n                //      c3 d2 a1 b0\n                //      c7 d6 a5 b4\n                // if out_elempack == 4\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                \"rev64  v20.4s, v20.4s              \\n\"\n                \"rev64  v21.4s, v21.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n                \"ext    v20.16b, v20.16b, v20.16b, #8 \\n\"\n                \"ext    v21.16b, v21.16b, v21.16b, #8 \\n\"\n                \"ext    v22.16b, v22.16b, v22.16b, #8 \\n\"\n                \"ext    v23.16b, v23.16b, v23.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v22.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v22.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v20.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                \"zip1   v4.4s, v17.4s, v23.4s       \\n\"\n                \"zip2   v5.4s, v17.4s, v23.4s       \\n\"\n                \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                \"zip2   v7.4s, v19.4s, v21.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v20.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v21.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v22.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v21.4s, v21.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n\n                // to\n                //      a0 a1 a2 a3\n                //      a4 a5 a6 a7\n                //      b0 b1 b2 b3\n                //      b4 b5 b6 b7\n                //      c0 c1 c2 c3\n                //      c4 c5 c6 c7\n                //      d0 d1 d2 d3\n                //      d4 d5 d6 d7\n                \"ext    v18.16b, v18.16b, v18.16b, #8 \\n\"\n                \"ext    v19.16b, v19.16b, v19.16b, #8 \\n\"\n                \"ext    v22.16b, v22.16b, v22.16b, #8 \\n\"\n                \"ext    v23.16b, v23.16b, v23.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v22.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v22.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v20.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v20.4s       \\n\"\n                \"zip1   v4.4s, v17.4s, v23.4s       \\n\"\n                \"zip2   v5.4s, v17.4s, v23.4s       \\n\"\n                \"zip1   v6.4s, v19.4s, v21.4s       \\n\"\n                \"zip2   v7.4s, v19.4s, v21.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v17.2d, v4.2d, v6.2d        \\n\"\n                \"zip2   v18.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v19.2d, v4.2d, v6.2d        \\n\"\n                \"zip1   v20.2d, v3.2d, v1.2d        \\n\"\n                \"zip1   v21.2d, v7.2d, v5.2d        \\n\"\n                \"zip2   v22.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v23.2d, v7.2d, v5.2d        \\n\"\n                \"rev64  v18.4s, v18.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"rev64  v22.4s, v22.4s              \\n\"\n                \"rev64  v23.4s, v23.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s, v17.4s}, [%3], #32 \\n\"\n                \"st1    {v18.4s, v19.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v20.4s, v21.4s}, [x4]      \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v22.4s, v23.4s}, [x4]      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #128                \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n                int32x4_t _sum20 = vdupq_n_s32(0);\n                int32x4_t _sum21 = vdupq_n_s32(0);\n                int32x4_t _sum30 = vdupq_n_s32(0);\n                int32x4_t _sum31 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111 22222222 33333333\n                    // 44444444 55555555 66666666 77777777\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB0);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB0);\n                    _sum10 = vmmlaq_s32(_sum10, _pA0, _pB1);\n                    _sum11 = vmmlaq_s32(_sum11, _pA1, _pB1);\n                    _sum20 = vmmlaq_s32(_sum20, _pA0, _pB2);\n                    _sum21 = vmmlaq_s32(_sum21, _pA1, _pB2);\n                    _sum30 = vmmlaq_s32(_sum30, _pA0, _pB3);\n                    _sum31 = vmmlaq_s32(_sum31, _pA1, _pB3);\n\n                    // a0 a1 b0 b1\n                    // c0 c1 d0 d1\n                    // a2 a3 b2 b3\n                    // c2 c3 d2 d3\n                    // a4 a5 b4 b5\n                    // c4 c5 d4 d5\n                    // a6 a7 b6 b7\n                    // c6 c7 d6 d7\n\n                    pA += 32;\n                    pB += 64;\n                }\n                int32x4x2_t _ss0 = vuzpq_s32(_sum00, _sum01);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum10, _sum11);\n                int32x4x2_t _ss2 = vuzpq_s32(_sum20, _sum21);\n                int32x4x2_t _ss3 = vuzpq_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n                int8x16_t _pB2 = vld1q_s8(pB + 32);\n                int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA0, _pB1, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA0, _pB1, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA0, _pB1, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA0, _pB1, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB2, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB2, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB2, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB2, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB3, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB3, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB3, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB3, 3);\n\n                pA += 32;\n                pB += 64;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA, _pB1, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA, _pB1, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA, _pB1, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA, _pB1, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA02 = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB2 = vld1q_s8(pB + 16);\n\n                int8x16_t _pA13 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA02)));\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n                int8x16_t _pB3 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA02), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA13), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB1));\n                int16x8_t _s5 = vmull_s8(vget_low_s8(_pA02), vget_high_s8(_pB1));\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB1));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA13), vget_high_s8(_pB1));\n\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA02), vget_low_s8(_pB2));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA02), vget_high_s8(_pB2));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA13), vget_low_s8(_pB2));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA13), vget_high_s8(_pB2));\n                _s4 = vmlal_s8(_s4, vget_high_s8(_pA02), vget_low_s8(_pB3));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA02), vget_high_s8(_pB3));\n                _s6 = vmlal_s8(_s6, vget_high_s8(_pA13), vget_low_s8(_pB3));\n                _s7 = vmlal_s8(_s7, vget_high_s8(_pA13), vget_high_s8(_pB3));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x16_t _pB01 = vld1q_s8(pB);\n\n                // aabbccdd\n\n                // 00112233 44556677\n\n                int16x8_t _s0 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 0)));\n                int16x8_t _s1 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 1)));\n                int16x8_t _s2 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 2)));\n                int16x8_t _s3 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 3)));\n                int16x8_t _s4 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 0)));\n                int16x8_t _s5 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 1)));\n                int16x8_t _s6 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 2)));\n                int16x8_t _s7 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 3)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n\n                // aabbccdd\n                // ccddaabb\n\n                int8x8_t _pA1 = vreinterpret_s8_s32(vrev64_s32(vreinterpret_s32_s8(_pA0)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(_pA0, vget_low_s8(_pB1));\n                int16x8_t _s5 = vmull_s8(_pA0, vget_high_s8(_pB1));\n                int16x8_t _s6 = vmull_s8(_pA1, vget_low_s8(_pB1));\n                int16x8_t _s7 = vmull_s8(_pA1, vget_high_s8(_pB1));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pAA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                // abcdabcd\n                // 01234567  ->  01010101 23232323 45454545 67676767\n                int8x8_t _pB0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0));\n                int8x8_t _pB2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1));\n                int8x8_t _pB4 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2));\n                int8x8_t _pB6 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3));\n\n                int8x8x2_t _pB0123 = vuzp_s8(_pB0, _pB2);\n                int8x8x2_t _pB4567 = vuzp_s8(_pB4, _pB6);\n\n                int16x8_t _s02 = vmull_s8(_pAA, _pB0123.val[0]);\n                int16x8_t _s13 = vmull_s8(_pAA, _pB0123.val[1]);\n                int16x8_t _s46 = vmull_s8(_pAA, _pB4567.val[0]);\n                int16x8_t _s57 = vmull_s8(_pAA, _pB4567.val[1]);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s02));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s13));\n                _sum2 = vaddw_s16(_sum2, vget_high_s16(_s02));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s13));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s46));\n                _sum5 = vaddw_s16(_sum5, vget_low_s16(_s57));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s46));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s57));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd abcd\n                // cdab cdab\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 2);\n\n                // 0123 4567\n                // 3210 7654\n\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s67 = vmull_s8(_pA1, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                //      a4 b4 c4 d4\n                //      a5 b5 c5 d5\n                //      a6 b6 c6 d6\n                //      a7 b7 c7 d7\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum4);\n                    vst1q_s32(outptr0 + 20, _sum5);\n                    vst1q_s32(outptr0 + 24, _sum6);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      a4 a5 a6 a7\n                    //      b0 b1 b2 b3\n                    //      b4 b5 b6 b7\n                    //      c0 c1 c2 c3\n                    //      c4 c5 c6 c7\n                    //      d0 d1 d2 d3\n                    //      d4 d5 d6 d7\n                    {\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum3);\n                        int32x4x2_t _t2 = vzipq_s32(_sum4, _sum5);\n                        int32x4x2_t _t3 = vzipq_s32(_sum6, _sum7);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t1.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t1.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1]));\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep, _sum2);\n                    vst1q_s32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      a4 b5 c6 d7\n                //      c0 d1 a2 b3\n                //      c4 d5 a6 b7\n                //      a3 b2 c1 d0\n                //      a7 b6 c5 d4\n                //      c3 d2 a1 b0\n                //      c7 d6 a5 b4\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    //      a2 b2 c2 d2\n                    //      a3 b3 c3 d3\n                    //      a4 b4 c4 d4\n                    //      a5 b5 c5 d5\n                    //      a6 b6 c6 d6\n                    //      a7 b7 c7 d7\n                    {\n                        _sum4 = vrev64q_s32(_sum4);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                        _sum4 = vextq_s32(_sum4, _sum4, 2);\n                        _sum5 = vextq_s32(_sum5, _sum5, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                        int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                        int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum5 = vrev64q_s32(_sum5);\n                        _sum7 = vrev64q_s32(_sum7);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    vst1q_s32(outptr0 + 16, _sum4);\n                    vst1q_s32(outptr0 + 20, _sum5);\n                    vst1q_s32(outptr0 + 24, _sum6);\n                    vst1q_s32(outptr0 + 28, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      a4 a5 a6 a7\n                    //      b0 b1 b2 b3\n                    //      b4 b5 b6 b7\n                    //      c0 c1 c2 c3\n                    //      c4 c5 c6 c7\n                    //      d0 d1 d2 d3\n                    //      d4 d5 d6 d7\n                    {\n                        _sum2 = vextq_s32(_sum2, _sum2, 2);\n                        _sum3 = vextq_s32(_sum3, _sum3, 2);\n                        _sum6 = vextq_s32(_sum6, _sum6, 2);\n                        _sum7 = vextq_s32(_sum7, _sum7, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                        int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                        int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                        int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                        _sum2 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                        _sum4 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum5 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                        _sum6 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum6 = vrev64q_s32(_sum6);\n                        _sum7 = vrev64q_s32(_sum7);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep, _sum2);\n                    vst1q_s32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_s32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_s32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                vst1q_s32(outptr + 16, _sum4);\n                vst1q_s32(outptr + 20, _sum5);\n                vst1q_s32(outptr + 24, _sum6);\n                vst1q_s32(outptr + 28, _sum7);\n            }\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cmp    %w9, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0] \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v4.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v26.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v17.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v18.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v19.4s, v1.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w8, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w8, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v4.16b}, [%2], #16         \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"rev64  v5.8h, v4.8h                \\n\"\n                \"smull  v10.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v5.8b        \\n\"\n                \"smlal2 v8.8h, v0.16b, v4.16b       \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal2 v10.8h, v0.16b, v5.16b      \\n\"\n                \"smlal2 v11.8h, v1.16b, v5.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w8, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v4.4h, v1.h[0]              \\n\"\n                \"dup    v5.4h, v1.h[1]              \\n\"\n                \"dup    v6.4h, v1.h[2]              \\n\"\n                \"dup    v7.4h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v2.8b}, [%2], #8           \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #4     \\n\"\n                \"rev64  v3.4h, v2.4h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v2.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v3.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v3.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w8, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1r   {v1.2s}, [%2]               \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"zip1   v1.8b, v1.8b, v1.8b         \\n\"\n                \"zip1   v2.4h, v1.4h, v1.4h         \\n\"\n                \"zip2   v3.4h, v1.4h, v1.4h         \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v3.8b         \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1]               \\n\"\n                \"ld1r   {v4.2s}, [%2]               \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"rev32  v1.4h, v0.4h                \\n\"\n                \"zip1   v0.2s, v0.2s, v1.2s         \\n\"\n                \"rev32  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"cmp    %w10, #0                    \\n\"\n                \"beq    10f                         \\n\"\n\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                // if out_elempack == 4\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                \"zip1   v0.4s, v16.4s, v17.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v17.4s       \\n\"\n                \"zip1   v2.4s, v18.4s, v19.4s       \\n\"\n                \"zip2   v3.4s, v18.4s, v19.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v1.2d, v3.2d        \\n\"\n                \"zip2   v19.2d, v1.2d, v3.2d        \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v18.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v19.4s}, [x4]              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      c0 d1 a2 b3\n                //      a3 b2 c1 d0\n                //      c3 d2 a1 b0\n                // if out_elempack == 4\n                \"cmp    %w11, #1                    \\n\"\n                \"beq    8f                          \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                \"rev64  v18.4s, v18.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n                \"ext    v18.16b, v18.16b, v18.16b, #8 \\n\"\n                \"ext    v19.16b, v19.16b, v19.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v19.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v19.4s       \\n\"\n                \"zip1   v2.4s, v17.4s, v18.4s       \\n\"\n                \"zip2   v3.4s, v17.4s, v18.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%3], #64 \\n\"\n                \"b      9f                          \\n\"\n\n                // if out_elempack == 1\n                \"8:                                 \\n\"\n\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                \"ext    v17.16b, v17.16b, v17.16b, #8 \\n\"\n                \"ext    v19.16b, v19.16b, v19.16b, #8 \\n\"\n                \"zip1   v0.4s, v16.4s, v19.4s       \\n\"\n                \"zip2   v1.4s, v16.4s, v19.4s       \\n\"\n                \"zip1   v2.4s, v17.4s, v18.4s       \\n\"\n                \"zip2   v3.4s, v17.4s, v18.4s       \\n\"\n                \"zip1   v16.2d, v0.2d, v2.2d        \\n\"\n                \"zip2   v17.2d, v0.2d, v2.2d        \\n\"\n                \"zip1   v18.2d, v3.2d, v1.2d        \\n\"\n                \"zip2   v19.2d, v3.2d, v1.2d        \\n\"\n                \"rev64  v17.4s, v17.4s              \\n\"\n                \"rev64  v19.4s, v19.4s              \\n\"\n\n                \"add    x4, %3, %12, lsl #2         \\n\"\n                \"st1    {v16.4s}, [%3], #16         \\n\"\n                \"st1    {v17.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v18.4s}, [x4]              \\n\"\n                \"add    x4, x4, %12, lsl #2         \\n\"\n                \"st1    {v19.4s}, [x4]              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"9:                                 \\n\"\n                \"add    %0, %0, #64                 \\n\"\n                \"b      11f                         \\n\"\n\n                \"10:                                \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                \"11:                                \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %9, #0              \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0, {d16-d23}       \\n\"\n                \"b          1f                  \\n\"\n\n                \"0:                             \\n\"\n                \"veor       q8, q8              \\n\"\n                \"veor       q9, q9              \\n\"\n                \"veor       q10, q10            \\n\"\n                \"veor       q11, q11            \\n\"\n\n                \"1:                             \\n\"\n                \"lsr        r4, %8, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        3f                  \\n\"\n\n                \".align 4                       \\n\"\n                \"2:                             \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.s8    {d0-d1}, [%1 :64]!  \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.s8    {d4-d5}, [%2]!      \\n\"\n                \"vrev64.32  q1, q0              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vrev64.16  q3, q2              \\n\"\n                \"vmull.s8   q5, d2, d4          \\n\"\n                \"vmull.s8   q6, d0, d6          \\n\"\n                \"vmull.s8   q7, d2, d6          \\n\"\n                \"vmlal.s8   q4, d1, d5          \\n\"\n                \"vmlal.s8   q5, d3, d5          \\n\"\n                \"vmlal.s8   q6, d1, d7          \\n\"\n                \"vmlal.s8   q7, d3, d7          \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"bne        2b                  \\n\"\n\n                \"3:                             \\n\"\n                \"and        r4, %8, #2          \\n\" // r4 = remain = max_kk & 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                // kk += 2 part\n                \"vld1.s8    {d0}, [%1 :64]!     \\n\"\n                \"vld1.s8    {d4}, [%2]!         \\n\"\n                \"vext.8     d1, d0, d0, #4      \\n\"\n                \"vrev64.16  d5, d4              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d0, d5          \\n\"\n                \"vmull.s8   q7, d1, d5          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %8, #1          \\n\" // r4 = remain = max_kk & 1\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                // kk += 1 part\n                \"vld1.s32   {d0[0]}, [%1]!      \\n\"\n                \"vld1.s32   {d2[]}, [%2]!       \\n\"\n                \"vrev32.16  d1, d0              \\n\"\n                \"vrev32.s8  d3, d2              \\n\"\n                \"vzip.32    d0, d1              \\n\"\n                \"vmull.s8   q4, d0, d2          \\n\"\n                \"vmull.s8   q5, d0, d3          \\n\"\n                \"vaddw.s16  q8, d8              \\n\"\n                \"vaddw.s16  q9, d9              \\n\"\n                \"vaddw.s16  q10, d10            \\n\"\n                \"vaddw.s16  q11, d11            \\n\"\n\n                \"5:                             \\n\"\n                \"cmp        %10, #0             \\n\"\n                \"beq        10f                 \\n\"\n\n                // from\n                //      a0 b1 c2 d3\n                //      c0 d1 a2 b3\n                //      a3 b2 c1 d0\n                //      c3 d2 a1 b0\n                // if out_elempack == 4\n                \"cmp        %11, #1             \\n\"\n                \"beq        8f                  \\n\"\n\n                // to\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                \"vrev64.32  q10, q10            \\n\"\n                \"vrev64.32  q11, q11            \\n\"\n                \"vext.32    q10, q10, #2        \\n\"\n                \"vext.32    q11, q11, #2        \\n\"\n                \"vzip.32    q8, q11             \\n\"\n                \"vzip.32    q9, q10             \\n\"\n                \"vswp       d17, d18            \\n\"\n                \"vswp       d21, d22            \\n\"\n                \"vrev64.32  q9, q9              \\n\"\n                \"vrev64.32  q11, q11            \\n\"\n\n                \"vstm       %3!, {d16-d23}      \\n\"\n                \"b          9f                  \\n\"\n\n                // if out_elempack == 1\n                \"8:                             \\n\"\n                // to\n                //      a0 a1 a2 a3\n                //      b0 b1 b2 b3\n                //      c0 c1 c2 c3\n                //      d0 d1 d2 d3\n                \"vext.32    q9, q9, #2          \\n\"\n                \"vext.32    q11, q11, #2        \\n\"\n                \"vzip.32    q8, q11             \\n\"\n                \"vzip.32    q9, q10             \\n\"\n                \"vswp       d17, d18            \\n\"\n                \"vswp       d21, d22            \\n\"\n                \"vrev64.32  q9, q9              \\n\"\n                \"vrev64.32  q11, q11            \\n\"\n\n                \"add        r4, %3, %12, lsl #2 \\n\"\n                \"vst1.s32   {d16-d17}, [%3]!    \\n\"\n                \"vst1.s32   {d18-d19}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d20-d21}, [r4]     \\n\"\n                \"add        r4, r4, %12, lsl #2 \\n\"\n                \"vst1.s32   {d22-d23}, [r4]     \\n\"\n\n                \"9:                             \\n\"\n                \"add        %0, %0, #64         \\n\"\n                \"b          11f                 \\n\"\n\n                \"10:                            \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n\n                \"11:                            \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB),     // %2\n                \"=r\"(outptr0) // %3\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"3\"(outptr0),\n                \"r\"(max_kk),       // %8\n                \"r\"(k),            // %9\n                \"r\"(k_end),        // %10\n                \"r\"(out_elempack), // %11\n                \"r\"(out_hstep)     // %12\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111 22222222 33333333\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB0);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB0);\n                    _sum10 = vmmlaq_s32(_sum10, _pA0, _pB1);\n                    _sum11 = vmmlaq_s32(_sum11, _pA1, _pB1);\n\n                    // a0 a1 b0 b1\n                    // c0 c1 d0 d1\n                    // a2 a3 b2 b3\n                    // c2 c3 d2 d3\n\n                    pA += 32;\n                    pB += 32;\n                }\n                int32x4x2_t _ss0 = vuzpq_s32(_sum00, _sum01);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB1, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB1, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB1, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB1, 3);\n\n                pA += 32;\n                pB += 32;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA, _pB, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA, _pB, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA, _pB, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA02 = vld1q_s8(pA);\n                int8x16_t _pB02 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA13 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA02)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB13));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB13));\n\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA02), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA13), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA02), vget_high_s8(_pB13));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA13), vget_high_s8(_pB13));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s2 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s3 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // aabbccdd\n                // ccddaabb\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 4);\n\n                // 00112233\n                // 33221100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s2 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s3 = vmull_s8(_pA1, _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                _pB = vzip_s8(_pB, _pB).val[0];\n                int16x4x2_t _pB0123 = vzip_s16(vreinterpret_s16_s8(_pB), vreinterpret_s16_s8(_pB));\n\n                int16x8_t _s01 = vmull_s8(_pA, vreinterpret_s8_s16(_pB0123.val[0]));\n                int16x8_t _s23 = vmull_s8(_pA, vreinterpret_s8_s16(_pB0123.val[1]));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n#else  // __ARM_FEATURE_DOTPROD\n\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // abcd.... -> cdab.... -> abcdcdab\n                int8x8_t _pA1 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n                int8x8_t _pA01 = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pA0), vreinterpret_s32_s8(_pA1)).val[0]);\n\n                // 01230123 -> 32103210\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA01, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA01, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                //      a2 b2 c2 d2\n                //      a3 b3 c3 d3\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      b0 b1 b2 b3\n                    //      c0 c1 c2 c3\n                    //      d0 d1 d2 d3\n                    {\n                        int32x4x2_t _r01 = vzipq_s32(_sum0, _sum1);\n                        int32x4x2_t _r23 = vzipq_s32(_sum2, _sum3);\n                        _sum0 = vcombine_s32(vget_low_s32(_r01.val[0]), vget_low_s32(_r23.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_r01.val[0]), vget_high_s32(_r23.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_r01.val[1]), vget_low_s32(_r23.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_r01.val[1]), vget_high_s32(_r23.val[1]));\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c2 d3\n                //      c0 d1 a2 b3\n                //      a3 b2 c1 d0\n                //      c3 d2 a1 b0\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    //      a2 b2 c2 d2\n                    //      a3 b3 c3 d3\n                    {\n                        _sum2 = vrev64q_s32(_sum2);\n                        _sum3 = vrev64q_s32(_sum3);\n                        _sum2 = vextq_s32(_sum2, _sum2, 2);\n                        _sum3 = vextq_s32(_sum3, _sum3, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                        int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + 8, _sum2);\n                    vst1q_s32(outptr0 + 12, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 a2 a3\n                    //      b0 b1 b2 b3\n                    //      c0 c1 c2 c3\n                    //      d0 d1 d2 d3\n                    {\n                        _sum1 = vextq_s32(_sum1, _sum1, 2);\n                        _sum3 = vextq_s32(_sum3, _sum3, 2);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                        int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                        _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                        _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                        _sum3 = vrev64q_s32(_sum3);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep, _sum1);\n                    vst1q_s32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_s32(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n            }\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB, 2);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 32;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss = vuzpq_s32(_sum00, _sum01);\n                _sum0 = vaddq_s32(_sum0, _ss.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA, _pB, 1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233 -> 00110011 22332233\n                // 11001100 33223322\n\n                int32x2x2_t _pBB = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n                int8x16_t _pB02 = vreinterpretq_s8_s32(vcombine_s32(_pBB.val[0], _pBB.val[1]));\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB13));\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA), vget_high_s8(_pB13));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n                // aabbccdd\n                // 0011....\n                int16x8_t _s0 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aabbccdd\n\n                // 00110011\n                // 11001100\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 2);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA, _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcdabcd\n\n                // 01010101 -> 00001111\n                _pB = vuzp_s8(_pB, vext_s8(_pB, _pB, 1)).val[0];\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcd abcd\n\n                // 0101 0101 -> 0101 1010\n\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 1);\n                int8x8_t _pB = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB0), vreinterpret_s32_s8(_pB1)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n#if __ARM_FEATURE_DOTPROD\n                // from\n                //      a0 b0 c0 d0\n                //      a1 b1 c1 d1\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 b0 b1\n                    //      c0 c1 d0 d1\n                    {\n                        int32x4x2_t _sum01 = vzipq_s32(_sum0, _sum1);\n                        _sum0 = _sum01.val[0];\n                        _sum1 = _sum01.val[1];\n                    }\n\n                    vst1_s32(outptr0, vget_low_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep, vget_high_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep * 2, vget_low_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 3, vget_high_s32(_sum1));\n                    outptr0 += 2;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n\n                // from\n                //      a0 b1 c0 d1\n                //      a1 b0 c1 d0\n                if (out_elempack == 4)\n                {\n                    // to\n                    //      a0 b0 c0 d0\n                    //      a1 b1 c1 d1\n                    {\n                        _sum1 = vrev64q_s32(_sum1);\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                    }\n\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    // to\n                    //      a0 a1 c0 c1\n                    //      b0 b1 d0 d1\n                    {\n                        int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                        _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                        _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                        _sum1 = vrev64q_s32(_sum1);\n                    }\n\n                    vst1_s32(outptr0, vget_low_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep, vget_low_s32(_sum1));\n                    vst1_s32(outptr0 + out_hstep * 2, vget_high_s32(_sum0));\n                    vst1_s32(outptr0 + out_hstep * 3, vget_high_s32(_sum1));\n                    outptr0 += 2;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000\n\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _sum01 = vdotq_s32(_sum01, _pA0, _pBB);\n                    _sum23 = vdotq_s32(_sum23, _pA1, _pBB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pA1, _pB, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 32;\n                    pB += 8;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum01, _sum23));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA, _pB, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n                int8x8_t _pB1 = vreinterpret_s8_s16(vld1_dup_s16((const short*)(pB + 2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n\n                pA += 8;\n                pB += 2;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_dup_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    outptr0[0] = vgetq_lane_s32(_sum0, 0);\n                    outptr0[out_hstep] = vgetq_lane_s32(_sum0, 1);\n                    outptr0[out_hstep * 2] = vgetq_lane_s32(_sum0, 2);\n                    outptr0[out_hstep * 3] = vgetq_lane_s32(_sum0, 3);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        int* outptr0 = (int*)top_blob + (i + ii) * out_hstep + j;\n\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n                int32x4_t _sum45 = vdupq_n_s32(0);\n                int32x4_t _sum67 = vdupq_n_s32(0);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n                int32x2_t _sum20 = vdup_n_s32(0);\n                int32x2_t _sum21 = vdup_n_s32(0);\n                int32x2_t _sum30 = vdup_n_s32(0);\n                int32x2_t _sum31 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    _sum01 = vmmlaq_s32(_sum01, _pA, _pB0);\n                    _sum23 = vmmlaq_s32(_sum23, _pA, _pB1);\n                    _sum45 = vmmlaq_s32(_sum45, _pA, _pB2);\n                    _sum67 = vmmlaq_s32(_sum67, _pA, _pB3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum00 = vdot_laneq_s32(_sum00, vget_low_s8(_pA), _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_low_s8(_pA), _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_low_s8(_pA), _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_low_s8(_pA), _pB0, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, vget_low_s8(_pA), _pB1, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, vget_low_s8(_pA), _pB1, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, vget_low_s8(_pA), _pB1, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, vget_low_s8(_pA), _pB1, 3);\n                    _sum00 = vdot_laneq_s32(_sum00, vget_high_s8(_pA), _pB2, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_high_s8(_pA), _pB2, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_high_s8(_pA), _pB2, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_high_s8(_pA), _pB2, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, vget_high_s8(_pA), _pB3, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, vget_high_s8(_pA), _pB3, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, vget_high_s8(_pA), _pB3, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, vget_high_s8(_pA), _pB3, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 16;\n                    pB += 64;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(vget_low_s32(_sum01), vget_low_s32(_sum23)));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(vget_low_s32(_sum45), vget_low_s32(_sum67)));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(vget_high_s32(_sum01), vget_high_s32(_sum23)));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(vget_high_s32(_sum45), vget_high_s32(_sum67)));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                int32x2x2_t _sum2x = vzip_s32(_sum20, _sum21);\n                int32x2x2_t _sum3x = vzip_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum2x.val[0], _sum3x.val[0]));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(_sum2x.val[1], _sum3x.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_DOTPROD\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n                int32x2_t _sum20 = vdup_n_s32(0);\n                int32x2_t _sum21 = vdup_n_s32(0);\n                int32x2_t _sum30 = vdup_n_s32(0);\n                int32x2_t _sum31 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_DOTPROD\n                    _sum00 = vdot_laneq_s32(_sum00, _pA, _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, _pA, _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, _pA, _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, _pA, _pB0, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, _pA, _pB1, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, _pA, _pB1, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, _pA, _pB1, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, _pA, _pB1, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                    int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB0));\n                    int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB0));\n                    _s0 = vmlal_s8(_s0, _pA2, vget_low_s8(_pB1));\n                    _s1 = vmlal_s8(_s1, _pA2, vget_high_s8(_pB1));\n                    _s2 = vmlal_s8(_s2, _pA3, vget_low_s8(_pB1));\n                    _s3 = vmlal_s8(_s3, _pA3, vget_high_s8(_pB1));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n                    _sum2 = vpadalq_s16(_sum2, _s2);\n                    _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                    pA += 8;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                int32x2x2_t _sum2x = vzip_s32(_sum20, _sum21);\n                int32x2x2_t _sum3x = vzip_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum2x.val[0], _sum3x.val[0]));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(_sum2x.val[1], _sum3x.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x4_t _pA = vreinterpret_s16_s32(vld1_dup_s32((const int*)pA));\n                int8x16_t _pB = vld1q_s8(pB);\n\n                int16x4x2_t _pA01 = vuzp_s16(_pA, _pA);\n                int8x8_t _pA0 = vreinterpret_s8_s16(_pA01.val[0]);\n                int8x8_t _pA1 = vreinterpret_s8_s16(_pA01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB));\n                int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB));\n                int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                pA += 4;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int8x8x2_t _pA01 = vuzp_s8(_pA, _pA);\n\n                int16x8_t _s0 = vmull_s8(_pA01.val[0], _pB);\n                int16x8_t _s1 = vmull_s8(_pA01.val[1], _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    vst1q_s32(outptr0 + out_hstep, _sum2);\n                    vst1q_s32(outptr0 + out_hstep + 4, _sum3);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n            }\n\n            outptr += 16;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    _sum01 = vmmlaq_s32(_sum01, _pA, _pB0);\n                    _sum23 = vmmlaq_s32(_sum23, _pA, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum00 = vdot_laneq_s32(_sum00, vget_low_s8(_pA), _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_low_s8(_pA), _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_low_s8(_pA), _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_low_s8(_pA), _pB0, 3);\n                    _sum00 = vdot_laneq_s32(_sum00, vget_high_s8(_pA), _pB1, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_high_s8(_pA), _pB1, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_high_s8(_pA), _pB1, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_high_s8(_pA), _pB1, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 16;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(vget_low_s32(_sum01), vget_low_s32(_sum23)));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(vget_high_s32(_sum01), vget_high_s32(_sum23)));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_DOTPROD\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                    _sum00 = vdot_laneq_s32(_sum00, _pA, _pB, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, _pA, _pB, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, _pA, _pB, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, _pA, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                    int16x8_t _s1 = vmull_s8(_pA1, vget_low_s8(_pB));\n                    _s0 = vmlal_s8(_s0, _pA2, vget_high_s8(_pB));\n                    _s1 = vmlal_s8(_s1, _pA3, vget_high_s8(_pB));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                    pA += 8;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x4_t _pA = vreinterpret_s16_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x4x2_t _pA01 = vuzp_s16(_pA, _pA);\n                int8x8_t _pA0 = vreinterpret_s8_s16(_pA01.val[0]);\n                int8x8_t _pA1 = vreinterpret_s8_s16(_pA01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(_pA0, _pB);\n                int16x8_t _s1 = vmull_s8(_pA1, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 4;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                _pA = vzip_s8(_pA, _pA).val[0];\n                _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + out_hstep, _sum1);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n#if __ARM_NEON\n            int32x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1q_s32(outptr);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n\n#if __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum = vmmlaq_s32(_sum, _pA, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _pAA = vzipq_s32(vreinterpretq_s32_s8(_pA), vreinterpretq_s32_s8(_pA));\n                int8x16_t _pA01 = vreinterpretq_s8_s32(_pAA.val[0]);\n                int8x16_t _pA23 = vreinterpretq_s8_s32(_pAA.val[1]);\n                int8x16_t _pB01 = vcombine_s8(vget_low_s8(_pB), vget_low_s8(_pB));\n                int8x16_t _pB23 = vcombine_s8(vget_high_s8(_pB), vget_high_s8(_pB));\n\n                _sum = vdotq_s32(_sum, _pA01, _pB01);\n                _sum = vdotq_s32(_sum, _pA23, _pB23);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                pA += 16;\n                pB += 16;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _pAA = vzip_s32(vreinterpret_s32_s8(_pA), vreinterpret_s32_s8(_pA));\n                int8x16_t _pA01 = vreinterpretq_s8_s32(vcombine_s32(_pAA.val[0], _pAA.val[1]));\n\n                int8x16_t _pB01 = vcombine_s8(_pB, _pB);\n\n                _sum = vdotq_s32(_sum, _pA01, _pB01);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4x2_t _pA01 = vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA));\n                int32x2x2_t _pB01 = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n\n                int16x8_t _s0 = vmull_s8(vreinterpret_s8_s16(_pA01.val[0]), vreinterpret_s8_s32(_pB01.val[0]));\n                _s0 = vmlal_s8(_s0, vreinterpret_s8_s16(_pA01.val[1]), vreinterpret_s8_s32(_pB01.val[1]));\n                _sum = vpadalq_s16(_sum, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n                _pB = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vpadalq_s16(_sum, _s0);\n\n                pA += 4;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(pB)), 0));\n\n                _pA = vzip_s8(_pA, _pA).val[0];\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vaddw_s16(_sum, vget_low_s16(_s0));\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_s32(outptr0, vget_low_s32(_sum));\n                    vst1_s32(outptr0 + out_hstep, vget_high_s32(_sum));\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum);\n            }\n\n            outptr += 4;\n#else // __ARM_NEON\n            int sum00;\n            int sum10;\n            int sum01;\n            int sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0;\n                sum10 = 0;\n                sum01 = 0;\n                sum11 = 0;\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum10 = outptr[1];\n                sum01 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // fomit-frame-pointer implied in optimized flag spare one register\n                // let us stay away from error: ‘asm’ operand has impossible constraints   --- nihui\n#if __OPTIMIZE__\n                asm volatile(\n                    \"ldr    r2, [%0], #4    \\n\" // int8x4_t _pA = *((int8x4_t*)pA); pA += 4;\n                    \"ldr    r4, [%1], #4    \\n\" // int8x4_t _pB = *((int8x4_t*)pB); pB += 4;\n                    \"ror    r3, r2, #8      \\n\" // int8x4_t _pA_r8 = __ror(_pA, 8);\n                    \"ror    r5, r4, #8      \\n\" // int8x4_t _pB_r8 = __ror(_pB, 8);\n                    \"sxtb16 r2, r2          \\n\" // int16x2_t _pA0 = __sxtb16(_pA);\n                    \"sxtb16 r4, r4          \\n\" // int16x2_t _pA1 = __sxtb16(_pA_r8);\n                    \"sxtb16 r3, r3          \\n\" // int16x2_t _pB0 = __sxtb16(_pB);\n                    \"sxtb16 r5, r5          \\n\" // int16x2_t _pB1 = __sxtb16(_pB_r8);\n                    \"smlad  %2, r2, r4, %2  \\n\" // sum00 = __smlad(_pA0, _pB0, sum00);\n                    \"smlad  %3, r3, r4, %3  \\n\" // sum10 = __smlad(_pA1, _pB0, sum10);\n                    \"smlad  %4, r2, r5, %4  \\n\" // sum01 = __smlad(_pA0, _pB1, sum01);\n                    \"smlad  %5, r3, r5, %5  \\n\" // sum11 = __smlad(_pA1, _pB1, sum11);\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=r\"(sum00),\n                    \"=r\"(sum10),\n                    \"=r\"(sum01),\n                    \"=r\"(sum11)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(sum00),\n                    \"3\"(sum10),\n                    \"4\"(sum01),\n                    \"5\"(sum11)\n                    : \"memory\", \"r2\", \"r3\", \"r4\", \"r5\");\n#else\n                int _pA0 = *((int*)pA);\n                int _pB0 = *((int*)pB);\n                int _pA1;\n                int _pB1;\n                asm volatile(\"ror %0, %1, #8\"\n                             : \"=r\"(_pA1)\n                             : \"r\"(_pA0)\n                             :);\n                asm volatile(\"ror %0, %1, #8\"\n                             : \"=r\"(_pB1)\n                             : \"r\"(_pB0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pA0)\n                             : \"0\"(_pA0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pA1)\n                             : \"0\"(_pA1)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pB0)\n                             : \"0\"(_pB0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pB1)\n                             : \"0\"(_pB1)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum00)\n                             : \"0\"(sum00), \"r\"(_pA0), \"r\"(_pB0)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum10)\n                             : \"0\"(sum10), \"r\"(_pA1), \"r\"(_pB0)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum01)\n                             : \"0\"(sum01), \"r\"(_pA0), \"r\"(_pB1)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum11)\n                             : \"0\"(sum11), \"r\"(_pA1), \"r\"(_pB1)\n                             :);\n                pA += 4;\n                pB += 4;\n#endif\n            }\n#endif // __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum10 += pA[1] * pB[0];\n                sum01 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum00;\n                    outptr0[1] = sum01;\n                    outptr0[out_hstep] = sum10;\n                    outptr0[out_hstep + 1] = sum11;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum10;\n                outptr[2] = sum01;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n#endif // __ARM_NEON\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n#if __ARM_NEON\n            int32x2_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdup_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1_s32(outptr);\n            }\n#else  // __ARM_NEON\n            int sum0;\n            int sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0;\n                sum1 = 0;\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n#endif // __ARM_NEON\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x8_t _pB = vld1_s8(pB);\n\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _sum0 = vdotq_s32(_sum0, _pA, _pBB);\n\n                    pA += 16;\n                    pB += 8;\n                }\n                int32x2_t _ss = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum = vdot_lane_s32(_sum, vget_low_s8(_pA), _pB, 0);\n                _sum = vdot_lane_s32(_sum, vget_high_s8(_pA), _pB, 1);\n\n                pA += 16;\n                pB += 8;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                _sum = vdot_s32(_sum, _pA, _pB);\n\n                pA += 8;\n                pB += 4;\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                    _pB = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pB), vreinterpret_s16_s8(_pB)).val[0]);\n\n                    int16x8_t _s0 = vmull_s8(_pA, _pB);\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n\n                    pA += 8;\n                    pB += 4;\n                }\n                int32x2_t _ss = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            int sum0 = vget_lane_s32(_sum, 0);\n            int sum1 = vget_lane_s32(_sum, 1);\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                sum0 += pA[0] * pB[0];\n                sum0 += pA[1] * pB[1];\n                sum1 += pA[2] * pB[0];\n                sum1 += pA[3] * pB[1];\n                pA += 4;\n                pB += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[out_hstep] = sum1;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        int* outptr0 = (int*)top_blob + (i + ii) * out_hstep + j;\n\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n                    _sum00 = vdotq_s32(_sum00, _pAA, _pB0);\n                    _sum01 = vdotq_s32(_sum01, _pAA, _pB1);\n                    _sum10 = vdotq_s32(_sum10, _pAA, _pB2);\n                    _sum11 = vdotq_s32(_sum11, _pAA, _pB3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                    _sum1 = vdotq_lane_s32(_sum1, _pB1, _pA, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pB2, _pA, 1);\n                    _sum1 = vdotq_lane_s32(_sum1, _pB3, _pA, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 8;\n                    pB += 64;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum00, _sum01));\n                _sum1 = vaddq_s32(_sum1, vpaddq_s32(_sum10, _sum11));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pB1, _pA, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                _s0 = vmlal_s8(_s0, _pA1, vget_low_s8(_pB1));\n                _s1 = vmlal_s8(_s1, _pA1, vget_high_s8(_pB1));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x16_t _pB = vld1q_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, vget_low_s8(_pB));\n                int16x8_t _s1 = vmull_s8(_pA, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 2;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_dup_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    vst1q_s32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n                    _sum00 = vdotq_s32(_sum00, _pAA, _pB0);\n                    _sum01 = vdotq_s32(_sum01, _pAA, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pB1, _pA, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 8;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum00, _sum01));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum0 = vdotq_lane_s32(_sum0, _pB, _pA, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(pA)), 0));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n\n                pA += 2;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_dup_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_s32(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_s32(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n#if __ARM_NEON\n            int32x2_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdup_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1_s32(outptr);\n            }\n#else  // __ARM_NEON\n            int sum0;\n            int sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0;\n                sum1 = 0;\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n#endif // __ARM_NEON\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n\n                    _sum0 = vdotq_s32(_sum0, _pAA, _pB);\n\n                    pA += 8;\n                    pB += 16;\n                }\n                int32x2_t _ss = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                _sum = vdot_lane_s32(_sum, vget_low_s8(_pB), _pA, 0);\n                _sum = vdot_lane_s32(_sum, vget_high_s8(_pB), _pA, 1);\n\n                pA += 8;\n                pB += 16;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum = vdot_s32(_sum, _pA, _pB);\n\n                pA += 4;\n                pB += 8;\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                    int8x8_t _pB = vld1_s8(pB);\n\n                    _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n\n                    int16x8_t _s0 = vmull_s8(_pA, _pB);\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n\n                    pA += 4;\n                    pB += 8;\n                }\n                int32x2_t _ss = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            int sum0 = vget_lane_s32(_sum, 0);\n            int sum1 = vget_lane_s32(_sum, 1);\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                sum0 += pA[0] * pB[0];\n                sum0 += pA[1] * pB[1];\n                sum1 += pA[0] * pB[2];\n                sum1 += pA[1] * pB[3];\n                pA += 2;\n                pB += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[1] = sum1;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            int sum;\n\n            if (k == 0)\n            {\n                sum = 0;\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n            int32x4_t _sum = vdupq_n_s32(0);\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum = vdotq_s32(_sum, _pA, _pB);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB));\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), vget_high_s8(_pB));\n                _sum = vpadalq_s16(_sum, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vpadalq_s16(_sum, _s0);\n\n                pA += 8;\n                pB += 8;\n            }\n#if __aarch64__\n            sum += vaddvq_s32(_sum);\n#else\n            int32x2_t _ss = vadd_s32(vget_low_s32(_sum), vget_high_s32(_sum));\n            _ss = vpadd_s32(_ss, _ss);\n            sum += vget_lane_s32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void convolution_im2col_gemm_get_optimal_tile_mnk_int8(int M, int N, int K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const size_t l2_cache_size_int8 = (int)(get_cpu_level2_cache_size() / sizeof(signed char));\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    // solve K\n    {\n        // try not to split K\n#if __ARM_NEON\n        int tile_size = (l2_cache_size_int8 - 16) / 8;\n#else\n        int tile_size = (l2_cache_size_int8 - 2) / 3;\n#endif\n\n#if __ARM_NEON\n        TILE_K = std::max(8, tile_size / 8 * 8);\n#else\n        TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n    }\n\n    // solve M\n    {\n#if __ARM_NEON\n        int nn_M = (M + 31) / 32;\n#else\n        int nn_M = (M + 7) / 8;\n#endif\n\n#if __ARM_NEON\n        TILE_M = std::max(8, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#else\n        TILE_M = std::max(2, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n    }\n\n    {\n        TILE_M *= std::min(nT, get_physical_cpu_count());\n\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n\n        if (nT > 1)\n        {\n#if __ARM_NEON\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#else\n            TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n        }\n    }\n\n    if (N > 0)\n    {\n        int tile_size;\n        if (TILE_K >= K)\n        {\n            tile_size = (l2_cache_size_int8 - TILE_M * TILE_K) / TILE_K;\n        }\n        else\n        {\n            tile_size = (l2_cache_size_int8 - TILE_M * TILE_K) / (TILE_M * 4 + TILE_K);\n        }\n\n#if __ARM_NEON\n        TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n        TILE_N = std::max(1, tile_size);\n#endif\n\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n#if __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n    }\n}\n\nstatic void convolution_im2col_input_tile_conv1x1s1d1_int8(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = bottom_blob.elempack;\n\n    signed char* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 8)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n            const size_t cstep = bottom_blob.cstep * 8;\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], %4 \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                int8x16_t _r01 = vld1q_s8(p0);\n                int8x16_t _r23 = vld1q_s8(p0 + 16);\n                int8x16_t _r45 = vld1q_s8(p0 + 32);\n                int8x16_t _r67 = vld1q_s8(p0 + 48);\n                vst1q_s8(pp, _r01);\n                vst1q_s8(pp + 16, _r23);\n                vst1q_s8(pp + 32, _r45);\n                vst1q_s8(pp + 48, _r67);\n                pp += 64;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], %4 \\n\"\n                    \"uzp1   v4.4s, v0.4s, v1.4s         \\n\"\n                    \"uzp2   v6.4s, v0.4s, v1.4s         \\n\"\n                    \"uzp1   v5.4s, v2.4s, v3.4s         \\n\"\n                    \"uzp2   v7.4s, v2.4s, v3.4s         \\n\"\n                    \"st1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // NCNN_GNU_INLINE_ASM\n                int32x4x2_t _r0246 = vld2q_s32((const int*)p0);\n                int32x4x2_t _r1357 = vld2q_s32((const int*)(p0 + 32));\n                vst1q_s32((int*)pp, _r0246.val[0]);\n                vst1q_s32((int*)(pp + 16), _r1357.val[0]);\n                vst1q_s32((int*)(pp + 32), _r0246.val[1]);\n                vst1q_s32((int*)(pp + 48), _r1357.val[1]);\n                pp += 64;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]       \\n\"\n                    \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], %4 \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                int16x8x4_t _r0 = vld4q_s16((const short*)p0);\n                vst1q_s16((short*)pp, _r0.val[0]);\n                vst1q_s16((short*)(pp + 16), _r0.val[1]);\n                vst1q_s16((short*)(pp + 32), _r0.val[2]);\n                vst1q_s16((short*)(pp + 48), _r0.val[3]);\n                pp += 64;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        }\n\n        if (elempack == 1)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k) + (j + jj);\n            const size_t cstep = bottom_blob.cstep;\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v0.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v1.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v0.d}[1], [%0], %4         \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v1.d}[1], [%0], %4         \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v2.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v3.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v2.d}[1], [%0], %4         \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v3.d}[1], [%0], %4         \\n\"\n                    \"zip1   v4.16b, v0.16b, v1.16b      \\n\"\n                    \"zip2   v5.16b, v0.16b, v1.16b      \\n\"\n                    \"zip1   v6.16b, v2.16b, v3.16b      \\n\"\n                    \"zip2   v7.16b, v2.16b, v3.16b      \\n\"\n                    \"st4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // NCNN_GNU_INLINE_ASM\n                int8x8_t _r0 = vld1_s8(p0);\n                int8x8_t _r1 = vld1_s8(p0 + cstep);\n                int8x8_t _r2 = vld1_s8(p0 + cstep * 2);\n                int8x8_t _r3 = vld1_s8(p0 + cstep * 3);\n                int8x8_t _r4 = vld1_s8(p0 + cstep * 4);\n                int8x8_t _r5 = vld1_s8(p0 + cstep * 5);\n                int8x8_t _r6 = vld1_s8(p0 + cstep * 6);\n                int8x8_t _r7 = vld1_s8(p0 + cstep * 7);\n                // save as transpose8x8\n                int8x8x2_t _r01 = vzip_s8(_r0, _r1);\n                int8x8x2_t _r23 = vzip_s8(_r2, _r3);\n                int8x8x2_t _r45 = vzip_s8(_r4, _r5);\n                int8x8x2_t _r67 = vzip_s8(_r6, _r7);\n                int16x8x4_t _r0246;\n                _r0246.val[0] = vreinterpretq_s16_s8(vcombine_s8(_r01.val[0], _r01.val[1]));\n                _r0246.val[1] = vreinterpretq_s16_s8(vcombine_s8(_r23.val[0], _r23.val[1]));\n                _r0246.val[2] = vreinterpretq_s16_s8(vcombine_s8(_r45.val[0], _r45.val[1]));\n                _r0246.val[3] = vreinterpretq_s16_s8(vcombine_s8(_r67.val[0], _r67.val[1]));\n                vst4q_s16((short*)pp, _r0246);\n                pp += 64;\n                p0 += cstep * 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v0.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v1.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v2.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v3.8b}, [%0], %4           \\n\"\n                    \"st4    {v0.8b, v1.8b, v2.8b, v3.8b}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                int8x8x4_t _r0123;\n                _r0123.val[0] = vld1_s8(p0);\n                _r0123.val[1] = vld1_s8(p0 + cstep);\n                _r0123.val[2] = vld1_s8(p0 + cstep * 2);\n                _r0123.val[3] = vld1_s8(p0 + cstep * 3);\n                vst4_s8(pp, _r0123);\n                pp += 32;\n                p0 += cstep * 4;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v0.8b}, [%0], %4           \\n\"\n                    \"prfm   pldl1keep, [%0, #64]        \\n\"\n                    \"ld1    {v1.8b}, [%0], %4           \\n\"\n                    \"st2    {v0.8b, v1.8b}, [%1], #16   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\");\n#else  // NCNN_GNU_INLINE_ASM\n                int8x8x2_t _r01;\n                _r01.val[0] = vld1_s8(p0);\n                _r01.val[1] = vld1_s8(p0 + cstep);\n                vst2_s8(pp, _r01);\n                pp += 16;\n                p0 += cstep * 2;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; kk < max_kk; kk++)\n            {\n                vst1_s8(pp, vld1_s8(p0));\n                pp += 8;\n                p0 += cstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 8)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n            const size_t cstep = bottom_blob.cstep * 8;\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.16b, v1.16b}, [%0], %4  \\n\"\n                    \"st1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\");\n#else  // NCNN_GNU_INLINE_ASM\n                int8x16_t _r01 = vld1q_s8(p0);\n                int8x16_t _r23 = vld1q_s8(p0 + 16);\n                vst1q_s8(pp, _r01);\n                vst1q_s8(pp + 16, _r23);\n                pp += 32;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.8b, v1.8b, v2.8b, v3.8b}, [%0], %4 \\n\"\n                    \"st4    {v0.2s, v1.2s, v2.2s, v3.2s}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                int32x2x4_t _r0123;\n                _r0123.val[0] = vreinterpret_s32_s8(vld1_s8(p0));\n                _r0123.val[1] = vreinterpret_s32_s8(vld1_s8(p0 + 8));\n                _r0123.val[2] = vreinterpret_s32_s8(vld1_s8(p0 + 16));\n                _r0123.val[3] = vreinterpret_s32_s8(vld1_s8(p0 + 24));\n                vst4_s32((int*)pp, _r0123);\n                pp += 32;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]       \\n\"\n                    \"ld1    {v0.8b, v1.8b, v2.8b, v3.8b}, [%0], %4 \\n\"\n                    \"st4    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.s8    {d0-d3}, [%0], %4   \\n\"\n                    \"vst4.s16   {d0-d3}, [%1 :64]!  \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"q0\", \"q1\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x4x4_t _r0123;\n                _r0123.val[0] = vreinterpret_s16_s8(vld1_s8(p0));\n                _r0123.val[1] = vreinterpret_s16_s8(vld1_s8(p0 + 8));\n                _r0123.val[2] = vreinterpret_s16_s8(vld1_s8(p0 + 16));\n                _r0123.val[3] = vreinterpret_s16_s8(vld1_s8(p0 + 24));\n                vst4_s16((short*)pp, _r0123);\n                pp += 32;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        }\n\n        if (elempack == 1)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k) + (j + jj);\n            const size_t cstep = bottom_blob.cstep;\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep + 0];\n                pp[2] = p0[cstep * 2 + 0];\n                pp[3] = p0[cstep * 3 + 0];\n                pp[4] = p0[cstep * 4 + 0];\n                pp[5] = p0[cstep * 5 + 0];\n                pp[6] = p0[cstep * 6 + 0];\n                pp[7] = p0[cstep * 7 + 0];\n                pp[8] = p0[1];\n                pp[9] = p0[cstep + 1];\n                pp[10] = p0[cstep * 2 + 1];\n                pp[11] = p0[cstep * 3 + 1];\n                pp[12] = p0[cstep * 4 + 1];\n                pp[13] = p0[cstep * 5 + 1];\n                pp[14] = p0[cstep * 6 + 1];\n                pp[15] = p0[cstep * 7 + 1];\n                pp[16] = p0[2];\n                pp[17] = p0[cstep + 2];\n                pp[18] = p0[cstep * 2 + 2];\n                pp[19] = p0[cstep * 3 + 2];\n                pp[20] = p0[cstep * 4 + 2];\n                pp[21] = p0[cstep * 5 + 2];\n                pp[22] = p0[cstep * 6 + 2];\n                pp[23] = p0[cstep * 7 + 2];\n                pp[24] = p0[3];\n                pp[25] = p0[cstep + 3];\n                pp[26] = p0[cstep * 2 + 3];\n                pp[27] = p0[cstep * 3 + 3];\n                pp[28] = p0[cstep * 4 + 3];\n                pp[29] = p0[cstep * 5 + 3];\n                pp[30] = p0[cstep * 6 + 3];\n                pp[31] = p0[cstep * 7 + 3];\n                pp += 32;\n                p0 += cstep * 8;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep + 0];\n                pp[2] = p0[cstep * 2 + 0];\n                pp[3] = p0[cstep * 3 + 0];\n                pp[4] = p0[1];\n                pp[5] = p0[cstep + 1];\n                pp[6] = p0[cstep * 2 + 1];\n                pp[7] = p0[cstep * 3 + 1];\n                pp[8] = p0[2];\n                pp[9] = p0[cstep + 2];\n                pp[10] = p0[cstep * 2 + 2];\n                pp[11] = p0[cstep * 3 + 2];\n                pp[12] = p0[3];\n                pp[13] = p0[cstep + 3];\n                pp[14] = p0[cstep * 2 + 3];\n                pp[15] = p0[cstep * 3 + 3];\n                pp += 16;\n                p0 += cstep * 4;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep + 0];\n                pp[2] = p0[1];\n                pp[3] = p0[cstep + 1];\n                pp[4] = p0[2];\n                pp[5] = p0[cstep + 2];\n                pp[6] = p0[3];\n                pp[7] = p0[cstep + 3];\n                pp += 8;\n                p0 += cstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp[2] = p0[2];\n                pp[3] = p0[3];\n                pp += 4;\n                p0 += cstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n            const size_t cstep = bottom_blob.cstep * 8;\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]       \\n\"\n                    \"ld1    {v0.16b}, [%0], %4          \\n\"\n                    \"st1    {v0.16b}, [%1], #16         \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\");\n#else  // NCNN_GNU_INLINE_ASM\n                vst1q_s8(pp, vld1q_s8(p0));\n                pp += 16;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]       \\n\"\n                    \"ld1    {v0.8b, v1.8b}, [%0], %4    \\n\"\n                    \"st2    {v0.2s, v1.2s}, [%1], #16   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\");\n#else  // NCNN_GNU_INLINE_ASM\n                int32x2x2_t _r01;\n                _r01.val[0] = vreinterpret_s32_s8(vld1_s8(p0));\n                _r01.val[1] = vreinterpret_s32_s8(vld1_s8(p0 + 8));\n                vst2_s32((int*)pp, _r01);\n                pp += 16;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk < max_kk / 8; kk++)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]       \\n\"\n                    \"ld1    {v0.8b, v1.8b}, [%0], %4    \\n\"\n                    \"st2    {v0.4h, v1.4h}, [%1], #16   \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"v0\", \"v1\");\n#else\n                asm volatile(\n                    \"pld        [%0, #128]          \\n\"\n                    \"vld1.s8    {d0-d1}, [%0], %4   \\n\"\n                    \"vst2.s16   {d0-d1}, [%1 :64]!  \\n\"\n                    : \"=r\"(p0), // %0\n                    \"=r\"(pp)  // %1\n                    : \"0\"(p0),\n                    \"1\"(pp),\n                    \"r\"(cstep)\n                    : \"memory\", \"q0\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                int16x4x2_t _r01;\n                _r01.val[0] = vreinterpret_s16_s8(vld1_s8(p0));\n                _r01.val[1] = vreinterpret_s16_s8(vld1_s8(p0 + 8));\n                vst2_s16((short*)pp, _r01);\n                pp += 16;\n                p0 += cstep;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k) + (j + jj);\n            const size_t cstep = bottom_blob.cstep;\n\n            int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep];\n                pp[2] = p0[cstep * 2];\n                pp[3] = p0[cstep * 3];\n                pp[4] = p0[cstep * 4];\n                pp[5] = p0[cstep * 5];\n                pp[6] = p0[cstep * 6];\n                pp[7] = p0[cstep * 7];\n                pp[8] = p0[1];\n                pp[9] = p0[cstep + 1];\n                pp[10] = p0[cstep * 2 + 1];\n                pp[11] = p0[cstep * 3 + 1];\n                pp[12] = p0[cstep * 4 + 1];\n                pp[13] = p0[cstep * 5 + 1];\n                pp[14] = p0[cstep * 6 + 1];\n                pp[15] = p0[cstep * 7 + 1];\n                pp += 16;\n                p0 += cstep * 8;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep];\n                pp[2] = p0[cstep * 2];\n                pp[3] = p0[cstep * 3];\n                pp[4] = p0[1];\n                pp[5] = p0[cstep + 1];\n                pp[6] = p0[cstep * 2 + 1];\n                pp[7] = p0[cstep * 3 + 1];\n                pp += 8;\n                p0 += cstep * 4;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[cstep];\n                pp[2] = p0[1];\n                pp[3] = p0[cstep + 1];\n                pp += 4;\n                p0 += cstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += cstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k / 8) + (j + jj) * 8;\n            const size_t cstep = bottom_blob.cstep * 8;\n\n            int kk = 0;\n            for (; kk < max_kk / 8; kk++)\n            {\n                vst1_s8(pp, vld1_s8(p0));\n                pp += 8;\n                p0 += cstep;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            const signed char* p0 = (const signed char*)bottom_blob.channel(k) + (j + jj);\n            const size_t cstep = bottom_blob.cstep;\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += cstep;\n            }\n        }\n    }\n}\n\nstatic inline void convolution_im2col_input_tile_int8_impl(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h)\n{\n    const int w = bottom_blob.w;\n    // const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    // j max_jj     outw*outh    split w and h\n\n    // k max_kk     pa*maxk*(inch/pa)    split inch\n\n    // k/max_kk shall be multiple of maxk\n\n    const int maxk = kernel_w * kernel_h;\n\n    signed char* pp = B;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        int dy0 = (j + jj) / outw * stride_h;\n        int dy1 = (j + jj + 1) / outw * stride_h;\n        int dy2 = (j + jj + 2) / outw * stride_h;\n        int dy3 = (j + jj + 3) / outw * stride_h;\n        int dy4 = (j + jj + 4) / outw * stride_h;\n        int dy5 = (j + jj + 5) / outw * stride_h;\n        int dy6 = (j + jj + 6) / outw * stride_h;\n        int dy7 = (j + jj + 7) / outw * stride_h;\n        int dx0 = (j + jj) % outw * stride_w;\n        int dx1 = (j + jj + 1) % outw * stride_w;\n        int dx2 = (j + jj + 2) % outw * stride_w;\n        int dx3 = (j + jj + 3) % outw * stride_w;\n        int dx4 = (j + jj + 4) % outw * stride_w;\n        int dx5 = (j + jj + 5) % outw * stride_w;\n        int dx6 = (j + jj + 6) % outw * stride_w;\n        int dx7 = (j + jj + 7) % outw * stride_w;\n\n        if (dy0 == dy7)\n        {\n            int kk = 0;\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n\n                    int x50 = dx0 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n\n                    int x70 = dx0 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr4 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr5 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr6 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr7 = img7.row<const signed char>(y70) + x70;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r2 = vld1_s8(sptr2);\n                        int8x8_t _r3 = vld1_s8(sptr3);\n                        int8x8_t _r4 = vld1_s8(sptr4);\n                        int8x8_t _r5 = vld1_s8(sptr5);\n                        int8x8_t _r6 = vld1_s8(sptr6);\n                        int8x8_t _r7 = vld1_s8(sptr7);\n                        // save as transpose8x8\n                        int8x8x2_t _r01 = vzip_s8(_r0, _r1);\n                        int8x8x2_t _r23 = vzip_s8(_r2, _r3);\n                        int8x8x2_t _r45 = vzip_s8(_r4, _r5);\n                        int8x8x2_t _r67 = vzip_s8(_r6, _r7);\n                        int16x8x4_t _r0246;\n                        _r0246.val[0] = vreinterpretq_s16_s8(vcombine_s8(_r01.val[0], _r01.val[1]));\n                        _r0246.val[1] = vreinterpretq_s16_s8(vcombine_s8(_r23.val[0], _r23.val[1]));\n                        _r0246.val[2] = vreinterpretq_s16_s8(vcombine_s8(_r45.val[0], _r45.val[1]));\n                        _r0246.val[3] = vreinterpretq_s16_s8(vcombine_s8(_r67.val[0], _r67.val[1]));\n                        vst4q_s16((short*)pp, _r0246);\n                        pp += 64;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x16_t _r0 = vld1q_s8(sptr0);\n                        int8x16_t _r1 = vld1q_s8(sptr1);\n                        int8x16_t _r2 = vld1q_s8(sptr2);\n                        int8x16_t _r3 = vld1q_s8(sptr3);\n                        int8x16_t _r4 = vld1q_s8(sptr4);\n                        int8x16_t _r5 = vld1q_s8(sptr5);\n                        int8x16_t _r6 = vld1q_s8(sptr6);\n                        int8x16_t _r7 = vld1q_s8(sptr7);\n                        int8x16_t _r01 = vtrnq_s8(_r0, _r1).val[0];\n                        int8x16_t _r23 = vtrnq_s8(_r2, _r3).val[0];\n                        int8x16_t _r45 = vtrnq_s8(_r4, _r5).val[0];\n                        int8x16_t _r67 = vtrnq_s8(_r6, _r7).val[0];\n                        int16x8x4_t _r0123;\n                        _r0123.val[0] = vreinterpretq_s16_s8(_r01);\n                        _r0123.val[1] = vreinterpretq_s16_s8(_r23);\n                        _r0123.val[2] = vreinterpretq_s16_s8(_r45);\n                        _r0123.val[3] = vreinterpretq_s16_s8(_r67);\n                        vst4q_s16((short*)pp, _r0123);\n                        pp += 64;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr2[0];\n                        pp[3] = sptr3[0];\n                        pp[4] = sptr4[0];\n                        pp[5] = sptr5[0];\n                        pp[6] = sptr6[0];\n                        pp[7] = sptr7[0];\n                        pp[8] = sptr0[stride_w];\n                        pp[9] = sptr1[stride_w];\n                        pp[10] = sptr2[stride_w];\n                        pp[11] = sptr3[stride_w];\n                        pp[12] = sptr4[stride_w];\n                        pp[13] = sptr5[stride_w];\n                        pp[14] = sptr6[stride_w];\n                        pp[15] = sptr7[stride_w];\n                        pp[16] = sptr0[stride_w * 2];\n                        pp[17] = sptr1[stride_w * 2];\n                        pp[18] = sptr2[stride_w * 2];\n                        pp[19] = sptr3[stride_w * 2];\n                        pp[20] = sptr4[stride_w * 2];\n                        pp[21] = sptr5[stride_w * 2];\n                        pp[22] = sptr6[stride_w * 2];\n                        pp[23] = sptr7[stride_w * 2];\n                        pp[24] = sptr0[stride_w * 3];\n                        pp[25] = sptr1[stride_w * 3];\n                        pp[26] = sptr2[stride_w * 3];\n                        pp[27] = sptr3[stride_w * 3];\n                        pp[28] = sptr4[stride_w * 3];\n                        pp[29] = sptr5[stride_w * 3];\n                        pp[30] = sptr6[stride_w * 3];\n                        pp[31] = sptr7[stride_w * 3];\n                        pp[32] = sptr0[stride_w * 4];\n                        pp[33] = sptr1[stride_w * 4];\n                        pp[34] = sptr2[stride_w * 4];\n                        pp[35] = sptr3[stride_w * 4];\n                        pp[36] = sptr4[stride_w * 4];\n                        pp[37] = sptr5[stride_w * 4];\n                        pp[38] = sptr6[stride_w * 4];\n                        pp[39] = sptr7[stride_w * 4];\n                        pp[40] = sptr0[stride_w * 5];\n                        pp[41] = sptr1[stride_w * 5];\n                        pp[42] = sptr2[stride_w * 5];\n                        pp[43] = sptr3[stride_w * 5];\n                        pp[44] = sptr4[stride_w * 5];\n                        pp[45] = sptr5[stride_w * 5];\n                        pp[46] = sptr6[stride_w * 5];\n                        pp[47] = sptr7[stride_w * 5];\n                        pp[48] = sptr0[stride_w * 6];\n                        pp[49] = sptr1[stride_w * 6];\n                        pp[50] = sptr2[stride_w * 6];\n                        pp[51] = sptr3[stride_w * 6];\n                        pp[52] = sptr4[stride_w * 6];\n                        pp[53] = sptr5[stride_w * 6];\n                        pp[54] = sptr6[stride_w * 6];\n                        pp[55] = sptr7[stride_w * 6];\n                        pp[56] = sptr0[stride_w * 7];\n                        pp[57] = sptr1[stride_w * 7];\n                        pp[58] = sptr2[stride_w * 7];\n                        pp[59] = sptr3[stride_w * 7];\n                        pp[60] = sptr4[stride_w * 7];\n                        pp[61] = sptr5[stride_w * 7];\n                        pp[62] = sptr6[stride_w * 7];\n                        pp[63] = sptr7[stride_w * 7];\n                        pp += 64;\n                    }\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8x4_t _r01;\n                        _r01.val[0] = vld1_s8(sptr0);\n                        _r01.val[1] = vld1_s8(sptr1);\n                        _r01.val[2] = vld1_s8(sptr2);\n                        _r01.val[3] = vld1_s8(sptr3);\n                        vst4_s8(pp, _r01);\n                        pp += 32;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x16_t _r0 = vld1q_s8(sptr0);\n                        int8x16_t _r1 = vld1q_s8(sptr1);\n                        int8x16_t _r2 = vld1q_s8(sptr2);\n                        int8x16_t _r3 = vld1q_s8(sptr3);\n                        int8x16_t _r01 = vtrnq_s8(_r0, _r1).val[0];\n                        int8x16_t _r23 = vtrnq_s8(_r2, _r3).val[0];\n                        int16x8x2_t _r0123;\n                        _r0123.val[0] = vreinterpretq_s16_s8(_r01);\n                        _r0123.val[1] = vreinterpretq_s16_s8(_r23);\n                        vst2q_s16((short*)pp, _r0123);\n                        pp += 32;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr2[0];\n                        pp[3] = sptr3[0];\n                        pp[4] = sptr0[stride_w];\n                        pp[5] = sptr1[stride_w];\n                        pp[6] = sptr2[stride_w];\n                        pp[7] = sptr3[stride_w];\n                        pp[8] = sptr0[stride_w * 2];\n                        pp[9] = sptr1[stride_w * 2];\n                        pp[10] = sptr2[stride_w * 2];\n                        pp[11] = sptr3[stride_w * 2];\n                        pp[12] = sptr0[stride_w * 3];\n                        pp[13] = sptr1[stride_w * 3];\n                        pp[14] = sptr2[stride_w * 3];\n                        pp[15] = sptr3[stride_w * 3];\n                        pp[16] = sptr0[stride_w * 4];\n                        pp[17] = sptr1[stride_w * 4];\n                        pp[18] = sptr2[stride_w * 4];\n                        pp[19] = sptr3[stride_w * 4];\n                        pp[20] = sptr0[stride_w * 5];\n                        pp[21] = sptr1[stride_w * 5];\n                        pp[22] = sptr2[stride_w * 5];\n                        pp[23] = sptr3[stride_w * 5];\n                        pp[24] = sptr0[stride_w * 6];\n                        pp[25] = sptr1[stride_w * 6];\n                        pp[26] = sptr2[stride_w * 6];\n                        pp[27] = sptr3[stride_w * 6];\n                        pp[28] = sptr0[stride_w * 7];\n                        pp[29] = sptr1[stride_w * 7];\n                        pp[30] = sptr2[stride_w * 7];\n                        pp[31] = sptr3[stride_w * 7];\n                        pp += 32;\n                    }\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8x2_t _r01;\n                        _r01.val[0] = vld1_s8(sptr0);\n                        _r01.val[1] = vld1_s8(sptr1);\n                        vst2_s8(pp, _r01);\n                        pp += 16;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x16_t _r0 = vld1q_s8(sptr0);\n                        int8x16_t _r1 = vld1q_s8(sptr1);\n                        int8x16_t _r01 = vtrnq_s8(_r0, _r1).val[0];\n                        vst1q_s8(pp, _r01);\n                        pp += 16;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr0[stride_w];\n                        pp[3] = sptr1[stride_w];\n                        pp[4] = sptr0[stride_w * 2];\n                        pp[5] = sptr1[stride_w * 2];\n                        pp[6] = sptr0[stride_w * 3];\n                        pp[7] = sptr1[stride_w * 3];\n                        pp[8] = sptr0[stride_w * 4];\n                        pp[9] = sptr1[stride_w * 4];\n                        pp[10] = sptr0[stride_w * 5];\n                        pp[11] = sptr1[stride_w * 5];\n                        pp[12] = sptr0[stride_w * 6];\n                        pp[13] = sptr1[stride_w * 6];\n                        pp[14] = sptr0[stride_w * 7];\n                        pp[15] = sptr1[stride_w * 7];\n                        pp += 16;\n                    }\n                }\n            }\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n\n                const signed char* sptr = img.row<const signed char>(y0) + x0 * elempack;\n\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr);\n                    int8x8_t _r1 = vld1_s8(sptr + stride_w * 8);\n                    int8x8_t _r2 = vld1_s8(sptr + stride_w * 16);\n                    int8x8_t _r3 = vld1_s8(sptr + stride_w * 24);\n                    int8x8_t _r4 = vld1_s8(sptr + stride_w * 32);\n                    int8x8_t _r5 = vld1_s8(sptr + stride_w * 40);\n                    int8x8_t _r6 = vld1_s8(sptr + stride_w * 48);\n                    int8x8_t _r7 = vld1_s8(sptr + stride_w * 56);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    vst1_s8(pp + 16, _r2);\n                    vst1_s8(pp + 24, _r3);\n                    vst1_s8(pp + 32, _r4);\n                    vst1_s8(pp + 40, _r5);\n                    vst1_s8(pp + 48, _r6);\n                    vst1_s8(pp + 56, _r7);\n                    pp += 64;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2_t _r0 = vreinterpret_s32_s8(vld1_s8(sptr));\n                    int32x2_t _r1 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 8));\n                    int32x2_t _r2 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 16));\n                    int32x2_t _r3 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 24));\n                    int32x2_t _r4 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 32));\n                    int32x2_t _r5 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 40));\n                    int32x2_t _r6 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 48));\n                    int32x2_t _r7 = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 56));\n                    int32x2x2_t _r01 = vzip_s32(_r0, _r1);\n                    int32x2x2_t _r23 = vzip_s32(_r2, _r3);\n                    int32x2x2_t _r45 = vzip_s32(_r4, _r5);\n                    int32x2x2_t _r67 = vzip_s32(_r6, _r7);\n                    vst1_s32((int*)pp, _r01.val[0]);\n                    vst1_s32((int*)(pp + 8), _r23.val[0]);\n                    vst1_s32((int*)(pp + 16), _r45.val[0]);\n                    vst1_s32((int*)(pp + 24), _r67.val[0]);\n                    vst1_s32((int*)(pp + 32), _r01.val[1]);\n                    vst1_s32((int*)(pp + 40), _r23.val[1]);\n                    vst1_s32((int*)(pp + 48), _r45.val[1]);\n                    vst1_s32((int*)(pp + 56), _r67.val[1]);\n                    pp += 64;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4_t _r0 = vreinterpret_s16_s8(vld1_s8(sptr));\n                    int16x4_t _r1 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 8));\n                    int16x4_t _r2 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 16));\n                    int16x4_t _r3 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 24));\n                    int16x4_t _r4 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 32));\n                    int16x4_t _r5 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 40));\n                    int16x4_t _r6 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 48));\n                    int16x4_t _r7 = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 56));\n                    int16x4x2_t _r01 = vzip_s16(_r0, _r1);\n                    int16x4x2_t _r23 = vzip_s16(_r2, _r3);\n                    int16x4x2_t _r45 = vzip_s16(_r4, _r5);\n                    int16x4x2_t _r67 = vzip_s16(_r6, _r7);\n                    int32x4x4_t _r0123;\n                    _r0123.val[0] = vreinterpretq_s32_s16(vcombine_s16(_r01.val[0], _r01.val[1]));\n                    _r0123.val[1] = vreinterpretq_s32_s16(vcombine_s16(_r23.val[0], _r23.val[1]));\n                    _r0123.val[2] = vreinterpretq_s32_s16(vcombine_s16(_r45.val[0], _r45.val[1]));\n                    _r0123.val[3] = vreinterpretq_s32_s16(vcombine_s16(_r67.val[0], _r67.val[1]));\n                    vst4q_s32((int*)pp, _r0123);\n                    pp += 64;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n                if (elempack == 1)\n                {\n                    pp[0] = sptr[0];\n                    pp[1] = sptr[stride_w];\n                    pp[2] = sptr[stride_w * 2];\n                    pp[3] = sptr[stride_w * 3];\n                    pp[4] = sptr[stride_w * 4];\n                    pp[5] = sptr[stride_w * 5];\n                    pp[6] = sptr[stride_w * 6];\n                    pp[7] = sptr[stride_w * 7];\n                    pp += 8;\n                }\n            }\n        }\n        else\n        {\n            int kk = 0;\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int x04 = dx4 + dilation_w * v0;\n                    int x05 = dx5 + dilation_w * v0;\n                    int x06 = dx6 + dilation_w * v0;\n                    int x07 = dx7 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n                    int y04 = dy4 + dilation_h * u0;\n                    int y05 = dy5 + dilation_h * u0;\n                    int y06 = dy6 + dilation_h * u0;\n                    int y07 = dy7 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int x14 = dx4 + dilation_w * v1;\n                    int x15 = dx5 + dilation_w * v1;\n                    int x16 = dx6 + dilation_w * v1;\n                    int x17 = dx7 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n                    int y14 = dy4 + dilation_h * u1;\n                    int y15 = dy5 + dilation_h * u1;\n                    int y16 = dy6 + dilation_h * u1;\n                    int y17 = dy7 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int x22 = dx2 + dilation_w * v2;\n                    int x23 = dx3 + dilation_w * v2;\n                    int x24 = dx4 + dilation_w * v2;\n                    int x25 = dx5 + dilation_w * v2;\n                    int x26 = dx6 + dilation_w * v2;\n                    int x27 = dx7 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int y22 = dy2 + dilation_h * u2;\n                    int y23 = dy3 + dilation_h * u2;\n                    int y24 = dy4 + dilation_h * u2;\n                    int y25 = dy5 + dilation_h * u2;\n                    int y26 = dy6 + dilation_h * u2;\n                    int y27 = dy7 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int x32 = dx2 + dilation_w * v3;\n                    int x33 = dx3 + dilation_w * v3;\n                    int x34 = dx4 + dilation_w * v3;\n                    int x35 = dx5 + dilation_w * v3;\n                    int x36 = dx6 + dilation_w * v3;\n                    int x37 = dx7 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n                    int y32 = dy2 + dilation_h * u3;\n                    int y33 = dy3 + dilation_h * u3;\n                    int y34 = dy4 + dilation_h * u3;\n                    int y35 = dy5 + dilation_h * u3;\n                    int y36 = dy6 + dilation_h * u3;\n                    int y37 = dy7 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int x41 = dx1 + dilation_w * v4;\n                    int x42 = dx2 + dilation_w * v4;\n                    int x43 = dx3 + dilation_w * v4;\n                    int x44 = dx4 + dilation_w * v4;\n                    int x45 = dx5 + dilation_w * v4;\n                    int x46 = dx6 + dilation_w * v4;\n                    int x47 = dx7 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n                    int y41 = dy1 + dilation_h * u4;\n                    int y42 = dy2 + dilation_h * u4;\n                    int y43 = dy3 + dilation_h * u4;\n                    int y44 = dy4 + dilation_h * u4;\n                    int y45 = dy5 + dilation_h * u4;\n                    int y46 = dy6 + dilation_h * u4;\n                    int y47 = dy7 + dilation_h * u4;\n\n                    int x50 = dx0 + dilation_w * v5;\n                    int x51 = dx1 + dilation_w * v5;\n                    int x52 = dx2 + dilation_w * v5;\n                    int x53 = dx3 + dilation_w * v5;\n                    int x54 = dx4 + dilation_w * v5;\n                    int x55 = dx5 + dilation_w * v5;\n                    int x56 = dx6 + dilation_w * v5;\n                    int x57 = dx7 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n                    int y51 = dy1 + dilation_h * u5;\n                    int y52 = dy2 + dilation_h * u5;\n                    int y53 = dy3 + dilation_h * u5;\n                    int y54 = dy4 + dilation_h * u5;\n                    int y55 = dy5 + dilation_h * u5;\n                    int y56 = dy6 + dilation_h * u5;\n                    int y57 = dy7 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int x61 = dx1 + dilation_w * v6;\n                    int x62 = dx2 + dilation_w * v6;\n                    int x63 = dx3 + dilation_w * v6;\n                    int x64 = dx4 + dilation_w * v6;\n                    int x65 = dx5 + dilation_w * v6;\n                    int x66 = dx6 + dilation_w * v6;\n                    int x67 = dx7 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n                    int y61 = dy1 + dilation_h * u6;\n                    int y62 = dy2 + dilation_h * u6;\n                    int y63 = dy3 + dilation_h * u6;\n                    int y64 = dy4 + dilation_h * u6;\n                    int y65 = dy5 + dilation_h * u6;\n                    int y66 = dy6 + dilation_h * u6;\n                    int y67 = dy7 + dilation_h * u6;\n\n                    int x70 = dx0 + dilation_w * v7;\n                    int x71 = dx1 + dilation_w * v7;\n                    int x72 = dx2 + dilation_w * v7;\n                    int x73 = dx3 + dilation_w * v7;\n                    int x74 = dx4 + dilation_w * v7;\n                    int x75 = dx5 + dilation_w * v7;\n                    int x76 = dx6 + dilation_w * v7;\n                    int x77 = dx7 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n                    int y71 = dy1 + dilation_h * u7;\n                    int y72 = dy2 + dilation_h * u7;\n                    int y73 = dy3 + dilation_h * u7;\n                    int y74 = dy4 + dilation_h * u7;\n                    int y75 = dy5 + dilation_h * u7;\n                    int y76 = dy6 + dilation_h * u7;\n                    int y77 = dy7 + dilation_h * u7;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n                    const signed char* sptr04 = img0.row<const signed char>(y04) + x04;\n                    const signed char* sptr05 = img0.row<const signed char>(y05) + x05;\n                    const signed char* sptr06 = img0.row<const signed char>(y06) + x06;\n                    const signed char* sptr07 = img0.row<const signed char>(y07) + x07;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n                    const signed char* sptr14 = img1.row<const signed char>(y14) + x14;\n                    const signed char* sptr15 = img1.row<const signed char>(y15) + x15;\n                    const signed char* sptr16 = img1.row<const signed char>(y16) + x16;\n                    const signed char* sptr17 = img1.row<const signed char>(y17) + x17;\n\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr22 = img2.row<const signed char>(y22) + x22;\n                    const signed char* sptr23 = img2.row<const signed char>(y23) + x23;\n                    const signed char* sptr24 = img2.row<const signed char>(y24) + x24;\n                    const signed char* sptr25 = img2.row<const signed char>(y25) + x25;\n                    const signed char* sptr26 = img2.row<const signed char>(y26) + x26;\n                    const signed char* sptr27 = img2.row<const signed char>(y27) + x27;\n\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n                    const signed char* sptr32 = img3.row<const signed char>(y32) + x32;\n                    const signed char* sptr33 = img3.row<const signed char>(y33) + x33;\n                    const signed char* sptr34 = img3.row<const signed char>(y34) + x34;\n                    const signed char* sptr35 = img3.row<const signed char>(y35) + x35;\n                    const signed char* sptr36 = img3.row<const signed char>(y36) + x36;\n                    const signed char* sptr37 = img3.row<const signed char>(y37) + x37;\n\n                    const signed char* sptr40 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr41 = img4.row<const signed char>(y41) + x41;\n                    const signed char* sptr42 = img4.row<const signed char>(y42) + x42;\n                    const signed char* sptr43 = img4.row<const signed char>(y43) + x43;\n                    const signed char* sptr44 = img4.row<const signed char>(y44) + x44;\n                    const signed char* sptr45 = img4.row<const signed char>(y45) + x45;\n                    const signed char* sptr46 = img4.row<const signed char>(y46) + x46;\n                    const signed char* sptr47 = img4.row<const signed char>(y47) + x47;\n\n                    const signed char* sptr50 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr51 = img5.row<const signed char>(y51) + x51;\n                    const signed char* sptr52 = img5.row<const signed char>(y52) + x52;\n                    const signed char* sptr53 = img5.row<const signed char>(y53) + x53;\n                    const signed char* sptr54 = img5.row<const signed char>(y54) + x54;\n                    const signed char* sptr55 = img5.row<const signed char>(y55) + x55;\n                    const signed char* sptr56 = img5.row<const signed char>(y56) + x56;\n                    const signed char* sptr57 = img5.row<const signed char>(y57) + x57;\n\n                    const signed char* sptr60 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr61 = img6.row<const signed char>(y61) + x61;\n                    const signed char* sptr62 = img6.row<const signed char>(y62) + x62;\n                    const signed char* sptr63 = img6.row<const signed char>(y63) + x63;\n                    const signed char* sptr64 = img6.row<const signed char>(y64) + x64;\n                    const signed char* sptr65 = img6.row<const signed char>(y65) + x65;\n                    const signed char* sptr66 = img6.row<const signed char>(y66) + x66;\n                    const signed char* sptr67 = img6.row<const signed char>(y67) + x67;\n\n                    const signed char* sptr70 = img7.row<const signed char>(y70) + x70;\n                    const signed char* sptr71 = img7.row<const signed char>(y71) + x71;\n                    const signed char* sptr72 = img7.row<const signed char>(y72) + x72;\n                    const signed char* sptr73 = img7.row<const signed char>(y73) + x73;\n                    const signed char* sptr74 = img7.row<const signed char>(y74) + x74;\n                    const signed char* sptr75 = img7.row<const signed char>(y75) + x75;\n                    const signed char* sptr76 = img7.row<const signed char>(y76) + x76;\n                    const signed char* sptr77 = img7.row<const signed char>(y77) + x77;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr40[0];\n                    pp[5] = sptr50[0];\n                    pp[6] = sptr60[0];\n                    pp[7] = sptr70[0];\n                    pp[8] = sptr01[0];\n                    pp[9] = sptr11[0];\n                    pp[10] = sptr21[0];\n                    pp[11] = sptr31[0];\n                    pp[12] = sptr41[0];\n                    pp[13] = sptr51[0];\n                    pp[14] = sptr61[0];\n                    pp[15] = sptr71[0];\n                    pp[16] = sptr02[0];\n                    pp[17] = sptr12[0];\n                    pp[18] = sptr22[0];\n                    pp[19] = sptr32[0];\n                    pp[20] = sptr42[0];\n                    pp[21] = sptr52[0];\n                    pp[22] = sptr62[0];\n                    pp[23] = sptr72[0];\n                    pp[24] = sptr03[0];\n                    pp[25] = sptr13[0];\n                    pp[26] = sptr23[0];\n                    pp[27] = sptr33[0];\n                    pp[28] = sptr43[0];\n                    pp[29] = sptr53[0];\n                    pp[30] = sptr63[0];\n                    pp[31] = sptr73[0];\n                    pp[32] = sptr04[0];\n                    pp[33] = sptr14[0];\n                    pp[34] = sptr24[0];\n                    pp[35] = sptr34[0];\n                    pp[36] = sptr44[0];\n                    pp[37] = sptr54[0];\n                    pp[38] = sptr64[0];\n                    pp[39] = sptr74[0];\n                    pp[40] = sptr05[0];\n                    pp[41] = sptr15[0];\n                    pp[42] = sptr25[0];\n                    pp[43] = sptr35[0];\n                    pp[44] = sptr45[0];\n                    pp[45] = sptr55[0];\n                    pp[46] = sptr65[0];\n                    pp[47] = sptr75[0];\n                    pp[48] = sptr06[0];\n                    pp[49] = sptr16[0];\n                    pp[50] = sptr26[0];\n                    pp[51] = sptr36[0];\n                    pp[52] = sptr46[0];\n                    pp[53] = sptr56[0];\n                    pp[54] = sptr66[0];\n                    pp[55] = sptr76[0];\n                    pp[56] = sptr07[0];\n                    pp[57] = sptr17[0];\n                    pp[58] = sptr27[0];\n                    pp[59] = sptr37[0];\n                    pp[60] = sptr47[0];\n                    pp[61] = sptr57[0];\n                    pp[62] = sptr67[0];\n                    pp[63] = sptr77[0];\n                    pp += 64;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int x04 = dx4 + dilation_w * v0;\n                    int x05 = dx5 + dilation_w * v0;\n                    int x06 = dx6 + dilation_w * v0;\n                    int x07 = dx7 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n                    int y04 = dy4 + dilation_h * u0;\n                    int y05 = dy5 + dilation_h * u0;\n                    int y06 = dy6 + dilation_h * u0;\n                    int y07 = dy7 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int x14 = dx4 + dilation_w * v1;\n                    int x15 = dx5 + dilation_w * v1;\n                    int x16 = dx6 + dilation_w * v1;\n                    int x17 = dx7 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n                    int y14 = dy4 + dilation_h * u1;\n                    int y15 = dy5 + dilation_h * u1;\n                    int y16 = dy6 + dilation_h * u1;\n                    int y17 = dy7 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int x22 = dx2 + dilation_w * v2;\n                    int x23 = dx3 + dilation_w * v2;\n                    int x24 = dx4 + dilation_w * v2;\n                    int x25 = dx5 + dilation_w * v2;\n                    int x26 = dx6 + dilation_w * v2;\n                    int x27 = dx7 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int y22 = dy2 + dilation_h * u2;\n                    int y23 = dy3 + dilation_h * u2;\n                    int y24 = dy4 + dilation_h * u2;\n                    int y25 = dy5 + dilation_h * u2;\n                    int y26 = dy6 + dilation_h * u2;\n                    int y27 = dy7 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int x32 = dx2 + dilation_w * v3;\n                    int x33 = dx3 + dilation_w * v3;\n                    int x34 = dx4 + dilation_w * v3;\n                    int x35 = dx5 + dilation_w * v3;\n                    int x36 = dx6 + dilation_w * v3;\n                    int x37 = dx7 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n                    int y32 = dy2 + dilation_h * u3;\n                    int y33 = dy3 + dilation_h * u3;\n                    int y34 = dy4 + dilation_h * u3;\n                    int y35 = dy5 + dilation_h * u3;\n                    int y36 = dy6 + dilation_h * u3;\n                    int y37 = dy7 + dilation_h * u3;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n                    const signed char* sptr04 = img0.row<const signed char>(y04) + x04;\n                    const signed char* sptr05 = img0.row<const signed char>(y05) + x05;\n                    const signed char* sptr06 = img0.row<const signed char>(y06) + x06;\n                    const signed char* sptr07 = img0.row<const signed char>(y07) + x07;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n                    const signed char* sptr14 = img1.row<const signed char>(y14) + x14;\n                    const signed char* sptr15 = img1.row<const signed char>(y15) + x15;\n                    const signed char* sptr16 = img1.row<const signed char>(y16) + x16;\n                    const signed char* sptr17 = img1.row<const signed char>(y17) + x17;\n\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr22 = img2.row<const signed char>(y22) + x22;\n                    const signed char* sptr23 = img2.row<const signed char>(y23) + x23;\n                    const signed char* sptr24 = img2.row<const signed char>(y24) + x24;\n                    const signed char* sptr25 = img2.row<const signed char>(y25) + x25;\n                    const signed char* sptr26 = img2.row<const signed char>(y26) + x26;\n                    const signed char* sptr27 = img2.row<const signed char>(y27) + x27;\n\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n                    const signed char* sptr32 = img3.row<const signed char>(y32) + x32;\n                    const signed char* sptr33 = img3.row<const signed char>(y33) + x33;\n                    const signed char* sptr34 = img3.row<const signed char>(y34) + x34;\n                    const signed char* sptr35 = img3.row<const signed char>(y35) + x35;\n                    const signed char* sptr36 = img3.row<const signed char>(y36) + x36;\n                    const signed char* sptr37 = img3.row<const signed char>(y37) + x37;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr01[0];\n                    pp[5] = sptr11[0];\n                    pp[6] = sptr21[0];\n                    pp[7] = sptr31[0];\n                    pp[8] = sptr02[0];\n                    pp[9] = sptr12[0];\n                    pp[10] = sptr22[0];\n                    pp[11] = sptr32[0];\n                    pp[12] = sptr03[0];\n                    pp[13] = sptr13[0];\n                    pp[14] = sptr23[0];\n                    pp[15] = sptr33[0];\n                    pp[16] = sptr04[0];\n                    pp[17] = sptr14[0];\n                    pp[18] = sptr24[0];\n                    pp[19] = sptr34[0];\n                    pp[20] = sptr05[0];\n                    pp[21] = sptr15[0];\n                    pp[22] = sptr25[0];\n                    pp[23] = sptr35[0];\n                    pp[24] = sptr06[0];\n                    pp[25] = sptr16[0];\n                    pp[26] = sptr26[0];\n                    pp[27] = sptr36[0];\n                    pp[28] = sptr07[0];\n                    pp[29] = sptr17[0];\n                    pp[30] = sptr27[0];\n                    pp[31] = sptr37[0];\n                    pp += 32;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int x04 = dx4 + dilation_w * v0;\n                    int x05 = dx5 + dilation_w * v0;\n                    int x06 = dx6 + dilation_w * v0;\n                    int x07 = dx7 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n                    int y04 = dy4 + dilation_h * u0;\n                    int y05 = dy5 + dilation_h * u0;\n                    int y06 = dy6 + dilation_h * u0;\n                    int y07 = dy7 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int x14 = dx4 + dilation_w * v1;\n                    int x15 = dx5 + dilation_w * v1;\n                    int x16 = dx6 + dilation_w * v1;\n                    int x17 = dx7 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n                    int y14 = dy4 + dilation_h * u1;\n                    int y15 = dy5 + dilation_h * u1;\n                    int y16 = dy6 + dilation_h * u1;\n                    int y17 = dy7 + dilation_h * u1;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n                    const signed char* sptr04 = img0.row<const signed char>(y04) + x04;\n                    const signed char* sptr05 = img0.row<const signed char>(y05) + x05;\n                    const signed char* sptr06 = img0.row<const signed char>(y06) + x06;\n                    const signed char* sptr07 = img0.row<const signed char>(y07) + x07;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n                    const signed char* sptr14 = img1.row<const signed char>(y14) + x14;\n                    const signed char* sptr15 = img1.row<const signed char>(y15) + x15;\n                    const signed char* sptr16 = img1.row<const signed char>(y16) + x16;\n                    const signed char* sptr17 = img1.row<const signed char>(y17) + x17;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr01[0];\n                    pp[3] = sptr11[0];\n                    pp[4] = sptr02[0];\n                    pp[5] = sptr12[0];\n                    pp[6] = sptr03[0];\n                    pp[7] = sptr13[0];\n                    pp[8] = sptr04[0];\n                    pp[9] = sptr14[0];\n                    pp[10] = sptr05[0];\n                    pp[11] = sptr15[0];\n                    pp[12] = sptr06[0];\n                    pp[13] = sptr16[0];\n                    pp[14] = sptr07[0];\n                    pp[15] = sptr17[0];\n                    pp += 16;\n                }\n            }\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int x1 = dx1 + dilation_w * v;\n                int x2 = dx2 + dilation_w * v;\n                int x3 = dx3 + dilation_w * v;\n                int x4 = dx4 + dilation_w * v;\n                int x5 = dx5 + dilation_w * v;\n                int x6 = dx6 + dilation_w * v;\n                int x7 = dx7 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n                int y1 = dy1 + dilation_h * u;\n                int y2 = dy2 + dilation_h * u;\n                int y3 = dy3 + dilation_h * u;\n                int y4 = dy4 + dilation_h * u;\n                int y5 = dy5 + dilation_h * u;\n                int y6 = dy6 + dilation_h * u;\n                int y7 = dy7 + dilation_h * u;\n\n                const signed char* sptr0 = img.row<const signed char>(y0) + x0 * elempack;\n                const signed char* sptr1 = img.row<const signed char>(y1) + x1 * elempack;\n                const signed char* sptr2 = img.row<const signed char>(y2) + x2 * elempack;\n                const signed char* sptr3 = img.row<const signed char>(y3) + x3 * elempack;\n                const signed char* sptr4 = img.row<const signed char>(y4) + x4 * elempack;\n                const signed char* sptr5 = img.row<const signed char>(y5) + x5 * elempack;\n                const signed char* sptr6 = img.row<const signed char>(y6) + x6 * elempack;\n                const signed char* sptr7 = img.row<const signed char>(y7) + x7 * elempack;\n\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr0);\n                    int8x8_t _r1 = vld1_s8(sptr1);\n                    int8x8_t _r2 = vld1_s8(sptr2);\n                    int8x8_t _r3 = vld1_s8(sptr3);\n                    int8x8_t _r4 = vld1_s8(sptr4);\n                    int8x8_t _r5 = vld1_s8(sptr5);\n                    int8x8_t _r6 = vld1_s8(sptr6);\n                    int8x8_t _r7 = vld1_s8(sptr7);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    vst1_s8(pp + 16, _r2);\n                    vst1_s8(pp + 24, _r3);\n                    vst1_s8(pp + 32, _r4);\n                    vst1_s8(pp + 40, _r5);\n                    vst1_s8(pp + 48, _r6);\n                    vst1_s8(pp + 56, _r7);\n                    pp += 64;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2_t _r0 = vreinterpret_s32_s8(vld1_s8(sptr0));\n                    int32x2_t _r1 = vreinterpret_s32_s8(vld1_s8(sptr1));\n                    int32x2_t _r2 = vreinterpret_s32_s8(vld1_s8(sptr2));\n                    int32x2_t _r3 = vreinterpret_s32_s8(vld1_s8(sptr3));\n                    int32x2_t _r4 = vreinterpret_s32_s8(vld1_s8(sptr4));\n                    int32x2_t _r5 = vreinterpret_s32_s8(vld1_s8(sptr5));\n                    int32x2_t _r6 = vreinterpret_s32_s8(vld1_s8(sptr6));\n                    int32x2_t _r7 = vreinterpret_s32_s8(vld1_s8(sptr7));\n                    int32x2x2_t _r01 = vzip_s32(_r0, _r1);\n                    int32x2x2_t _r23 = vzip_s32(_r2, _r3);\n                    int32x2x2_t _r45 = vzip_s32(_r4, _r5);\n                    int32x2x2_t _r67 = vzip_s32(_r6, _r7);\n                    vst1_s32((int*)pp, _r01.val[0]);\n                    vst1_s32((int*)(pp + 8), _r23.val[0]);\n                    vst1_s32((int*)(pp + 16), _r45.val[0]);\n                    vst1_s32((int*)(pp + 24), _r67.val[0]);\n                    vst1_s32((int*)(pp + 32), _r01.val[1]);\n                    vst1_s32((int*)(pp + 40), _r23.val[1]);\n                    vst1_s32((int*)(pp + 48), _r45.val[1]);\n                    vst1_s32((int*)(pp + 56), _r67.val[1]);\n                    pp += 64;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4_t _r0 = vreinterpret_s16_s8(vld1_s8(sptr0));\n                    int16x4_t _r1 = vreinterpret_s16_s8(vld1_s8(sptr1));\n                    int16x4_t _r2 = vreinterpret_s16_s8(vld1_s8(sptr2));\n                    int16x4_t _r3 = vreinterpret_s16_s8(vld1_s8(sptr3));\n                    int16x4_t _r4 = vreinterpret_s16_s8(vld1_s8(sptr4));\n                    int16x4_t _r5 = vreinterpret_s16_s8(vld1_s8(sptr5));\n                    int16x4_t _r6 = vreinterpret_s16_s8(vld1_s8(sptr6));\n                    int16x4_t _r7 = vreinterpret_s16_s8(vld1_s8(sptr7));\n                    int16x4x2_t _r01 = vzip_s16(_r0, _r1);\n                    int16x4x2_t _r23 = vzip_s16(_r2, _r3);\n                    int16x4x2_t _r45 = vzip_s16(_r4, _r5);\n                    int16x4x2_t _r67 = vzip_s16(_r6, _r7);\n                    int32x4x4_t _r0123;\n                    _r0123.val[0] = vreinterpretq_s32_s16(vcombine_s16(_r01.val[0], _r01.val[1]));\n                    _r0123.val[1] = vreinterpretq_s32_s16(vcombine_s16(_r23.val[0], _r23.val[1]));\n                    _r0123.val[2] = vreinterpretq_s32_s16(vcombine_s16(_r45.val[0], _r45.val[1]));\n                    _r0123.val[3] = vreinterpretq_s32_s16(vcombine_s16(_r67.val[0], _r67.val[1]));\n                    vst4q_s32((int*)pp, _r0123);\n                    pp += 64;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n                if (elempack == 1)\n                {\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp[2] = sptr2[0];\n                    pp[3] = sptr3[0];\n                    pp[4] = sptr4[0];\n                    pp[5] = sptr5[0];\n                    pp[6] = sptr6[0];\n                    pp[7] = sptr7[0];\n                    pp += 8;\n                }\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        int dy0 = (j + jj) / outw * stride_h;\n        int dy1 = (j + jj + 1) / outw * stride_h;\n        int dy2 = (j + jj + 2) / outw * stride_h;\n        int dy3 = (j + jj + 3) / outw * stride_h;\n        int dx0 = (j + jj) % outw * stride_w;\n        int dx1 = (j + jj + 1) % outw * stride_w;\n        int dx2 = (j + jj + 2) % outw * stride_w;\n        int dx3 = (j + jj + 3) % outw * stride_w;\n\n        if (dy0 == dy3)\n        {\n            int kk = 0;\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n\n                    int x50 = dx0 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n\n                    int x70 = dx0 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr4 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr5 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr6 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr7 = img7.row<const signed char>(y70) + x70;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r2 = vld1_s8(sptr2);\n                        int8x8_t _r3 = vld1_s8(sptr3);\n                        int8x8_t _r4 = vld1_s8(sptr4);\n                        int8x8_t _r5 = vld1_s8(sptr5);\n                        int8x8_t _r6 = vld1_s8(sptr6);\n                        int8x8_t _r7 = vld1_s8(sptr7);\n                        int16x4x4_t _r0123;\n                        _r0123.val[0] = vreinterpret_s16_s8(vzip_s8(_r0, _r1).val[0]);\n                        _r0123.val[1] = vreinterpret_s16_s8(vzip_s8(_r2, _r3).val[0]);\n                        _r0123.val[2] = vreinterpret_s16_s8(vzip_s8(_r4, _r5).val[0]);\n                        _r0123.val[3] = vreinterpret_s16_s8(vzip_s8(_r6, _r7).val[0]);\n                        vst4_s16((short*)pp, _r0123);\n                        pp += 32;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r2 = vld1_s8(sptr2);\n                        int8x8_t _r3 = vld1_s8(sptr3);\n                        int8x8_t _r4 = vld1_s8(sptr4);\n                        int8x8_t _r5 = vld1_s8(sptr5);\n                        int8x8_t _r6 = vld1_s8(sptr6);\n                        int8x8_t _r7 = vld1_s8(sptr7);\n                        int8x8_t _r01 = vtrn_s8(_r0, _r1).val[0];\n                        int8x8_t _r23 = vtrn_s8(_r2, _r3).val[0];\n                        int8x8_t _r45 = vtrn_s8(_r4, _r5).val[0];\n                        int8x8_t _r67 = vtrn_s8(_r6, _r7).val[0];\n                        int16x4x4_t _r0123;\n                        _r0123.val[0] = vreinterpret_s16_s8(_r01);\n                        _r0123.val[1] = vreinterpret_s16_s8(_r23);\n                        _r0123.val[2] = vreinterpret_s16_s8(_r45);\n                        _r0123.val[3] = vreinterpret_s16_s8(_r67);\n                        vst4_s16((short*)pp, _r0123);\n                        pp += 32;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr2[0];\n                        pp[3] = sptr3[0];\n                        pp[4] = sptr4[0];\n                        pp[5] = sptr5[0];\n                        pp[6] = sptr6[0];\n                        pp[7] = sptr7[0];\n                        pp[8] = sptr0[stride_w];\n                        pp[9] = sptr1[stride_w];\n                        pp[10] = sptr2[stride_w];\n                        pp[11] = sptr3[stride_w];\n                        pp[12] = sptr4[stride_w];\n                        pp[13] = sptr5[stride_w];\n                        pp[14] = sptr6[stride_w];\n                        pp[15] = sptr7[stride_w];\n                        pp[16] = sptr0[stride_w * 2];\n                        pp[17] = sptr1[stride_w * 2];\n                        pp[18] = sptr2[stride_w * 2];\n                        pp[19] = sptr3[stride_w * 2];\n                        pp[20] = sptr4[stride_w * 2];\n                        pp[21] = sptr5[stride_w * 2];\n                        pp[22] = sptr6[stride_w * 2];\n                        pp[23] = sptr7[stride_w * 2];\n                        pp[24] = sptr0[stride_w * 3];\n                        pp[25] = sptr1[stride_w * 3];\n                        pp[26] = sptr2[stride_w * 3];\n                        pp[27] = sptr3[stride_w * 3];\n                        pp[28] = sptr4[stride_w * 3];\n                        pp[29] = sptr5[stride_w * 3];\n                        pp[30] = sptr6[stride_w * 3];\n                        pp[31] = sptr7[stride_w * 3];\n                        pp += 32;\n                    }\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r2 = vld1_s8(sptr2);\n                        int8x8_t _r3 = vld1_s8(sptr3);\n                        int16x4x2_t _r01;\n                        _r01.val[0] = vreinterpret_s16_s8(vzip_s8(_r0, _r1).val[0]);\n                        _r01.val[1] = vreinterpret_s16_s8(vzip_s8(_r2, _r3).val[0]);\n                        vst2_s16((short*)pp, _r01);\n                        pp += 16;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r2 = vld1_s8(sptr2);\n                        int8x8_t _r3 = vld1_s8(sptr3);\n                        int8x8_t _r01 = vtrn_s8(_r0, _r1).val[0];\n                        int8x8_t _r23 = vtrn_s8(_r2, _r3).val[0];\n                        int16x4x2_t _r0123;\n                        _r0123.val[0] = vreinterpret_s16_s8(_r01);\n                        _r0123.val[1] = vreinterpret_s16_s8(_r23);\n                        vst2_s16((short*)pp, _r0123);\n                        pp += 16;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr2[0];\n                        pp[3] = sptr3[0];\n                        pp[4] = sptr0[stride_w];\n                        pp[5] = sptr1[stride_w];\n                        pp[6] = sptr2[stride_w];\n                        pp[7] = sptr3[stride_w];\n                        pp[8] = sptr0[stride_w * 2];\n                        pp[9] = sptr1[stride_w * 2];\n                        pp[10] = sptr2[stride_w * 2];\n                        pp[11] = sptr3[stride_w * 2];\n                        pp[12] = sptr0[stride_w * 3];\n                        pp[13] = sptr1[stride_w * 3];\n                        pp[14] = sptr2[stride_w * 3];\n                        pp[15] = sptr3[stride_w * 3];\n                        pp += 16;\n                    }\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n\n                    if (stride_w == 1)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r01 = vzip_s8(_r0, _r1).val[0];\n                        vst1_s8(pp, _r01);\n                        pp += 8;\n                    }\n                    else if (stride_w == 2)\n                    {\n                        int8x8_t _r0 = vld1_s8(sptr0);\n                        int8x8_t _r1 = vld1_s8(sptr1);\n                        int8x8_t _r01 = vtrn_s8(_r0, _r1).val[0];\n                        vst1_s8(pp, _r01);\n                        pp += 8;\n                    }\n                    else\n                    {\n                        pp[0] = sptr0[0];\n                        pp[1] = sptr1[0];\n                        pp[2] = sptr0[stride_w];\n                        pp[3] = sptr1[stride_w];\n                        pp[4] = sptr0[stride_w * 2];\n                        pp[5] = sptr1[stride_w * 2];\n                        pp[6] = sptr0[stride_w * 3];\n                        pp[7] = sptr1[stride_w * 3];\n                        pp += 8;\n                    }\n                }\n            }\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n\n                const signed char* sptr = img.row<const signed char>(y0) + x0 * elempack;\n\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr);\n                    int8x8_t _r1 = vld1_s8(sptr + stride_w * 8);\n                    int8x8_t _r2 = vld1_s8(sptr + stride_w * 16);\n                    int8x8_t _r3 = vld1_s8(sptr + stride_w * 24);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    vst1_s8(pp + 16, _r2);\n                    vst1_s8(pp + 24, _r3);\n                    pp += 32;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2x4_t _r0123;\n                    _r0123.val[0] = vreinterpret_s32_s8(vld1_s8(sptr));\n                    _r0123.val[1] = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 8));\n                    _r0123.val[2] = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 16));\n                    _r0123.val[3] = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 24));\n                    vst4_s32((int*)pp, _r0123);\n                    pp += 32;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4x4_t _r0123;\n                    _r0123.val[0] = vreinterpret_s16_s8(vld1_s8(sptr));\n                    _r0123.val[1] = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 8));\n                    _r0123.val[2] = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 16));\n                    _r0123.val[3] = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 24));\n                    vst4_s16((short*)pp, _r0123);\n                    pp += 32;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n                if (elempack == 1)\n                {\n                    pp[0] = sptr[0];\n                    pp[1] = sptr[stride_w];\n                    pp[2] = sptr[stride_w * 2];\n                    pp[3] = sptr[stride_w * 3];\n                    pp += 4;\n                }\n            }\n        }\n        else\n        {\n            int kk = 0;\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int x22 = dx2 + dilation_w * v2;\n                    int x23 = dx3 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int y22 = dy2 + dilation_h * u2;\n                    int y23 = dy3 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int x32 = dx2 + dilation_w * v3;\n                    int x33 = dx3 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n                    int y32 = dy2 + dilation_h * u3;\n                    int y33 = dy3 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int x41 = dx1 + dilation_w * v4;\n                    int x42 = dx2 + dilation_w * v4;\n                    int x43 = dx3 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n                    int y41 = dy1 + dilation_h * u4;\n                    int y42 = dy2 + dilation_h * u4;\n                    int y43 = dy3 + dilation_h * u4;\n\n                    int x50 = dx0 + dilation_w * v5;\n                    int x51 = dx1 + dilation_w * v5;\n                    int x52 = dx2 + dilation_w * v5;\n                    int x53 = dx3 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n                    int y51 = dy1 + dilation_h * u5;\n                    int y52 = dy2 + dilation_h * u5;\n                    int y53 = dy3 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int x61 = dx1 + dilation_w * v6;\n                    int x62 = dx2 + dilation_w * v6;\n                    int x63 = dx3 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n                    int y61 = dy1 + dilation_h * u6;\n                    int y62 = dy2 + dilation_h * u6;\n                    int y63 = dy3 + dilation_h * u6;\n\n                    int x70 = dx0 + dilation_w * v7;\n                    int x71 = dx1 + dilation_w * v7;\n                    int x72 = dx2 + dilation_w * v7;\n                    int x73 = dx3 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n                    int y71 = dy1 + dilation_h * u7;\n                    int y72 = dy2 + dilation_h * u7;\n                    int y73 = dy3 + dilation_h * u7;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr22 = img2.row<const signed char>(y22) + x22;\n                    const signed char* sptr23 = img2.row<const signed char>(y23) + x23;\n\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n                    const signed char* sptr32 = img3.row<const signed char>(y32) + x32;\n                    const signed char* sptr33 = img3.row<const signed char>(y33) + x33;\n\n                    const signed char* sptr40 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr41 = img4.row<const signed char>(y41) + x41;\n                    const signed char* sptr42 = img4.row<const signed char>(y42) + x42;\n                    const signed char* sptr43 = img4.row<const signed char>(y43) + x43;\n\n                    const signed char* sptr50 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr51 = img5.row<const signed char>(y51) + x51;\n                    const signed char* sptr52 = img5.row<const signed char>(y52) + x52;\n                    const signed char* sptr53 = img5.row<const signed char>(y53) + x53;\n\n                    const signed char* sptr60 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr61 = img6.row<const signed char>(y61) + x61;\n                    const signed char* sptr62 = img6.row<const signed char>(y62) + x62;\n                    const signed char* sptr63 = img6.row<const signed char>(y63) + x63;\n\n                    const signed char* sptr70 = img7.row<const signed char>(y70) + x70;\n                    const signed char* sptr71 = img7.row<const signed char>(y71) + x71;\n                    const signed char* sptr72 = img7.row<const signed char>(y72) + x72;\n                    const signed char* sptr73 = img7.row<const signed char>(y73) + x73;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr40[0];\n                    pp[5] = sptr50[0];\n                    pp[6] = sptr60[0];\n                    pp[7] = sptr70[0];\n                    pp[8] = sptr01[0];\n                    pp[9] = sptr11[0];\n                    pp[10] = sptr21[0];\n                    pp[11] = sptr31[0];\n                    pp[12] = sptr41[0];\n                    pp[13] = sptr51[0];\n                    pp[14] = sptr61[0];\n                    pp[15] = sptr71[0];\n                    pp[16] = sptr02[0];\n                    pp[17] = sptr12[0];\n                    pp[18] = sptr22[0];\n                    pp[19] = sptr32[0];\n                    pp[20] = sptr42[0];\n                    pp[21] = sptr52[0];\n                    pp[22] = sptr62[0];\n                    pp[23] = sptr72[0];\n                    pp[24] = sptr03[0];\n                    pp[25] = sptr13[0];\n                    pp[26] = sptr23[0];\n                    pp[27] = sptr33[0];\n                    pp[28] = sptr43[0];\n                    pp[29] = sptr53[0];\n                    pp[30] = sptr63[0];\n                    pp[31] = sptr73[0];\n                    pp += 32;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int x22 = dx2 + dilation_w * v2;\n                    int x23 = dx3 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int y22 = dy2 + dilation_h * u2;\n                    int y23 = dy3 + dilation_h * u2;\n\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int x32 = dx2 + dilation_w * v3;\n                    int x33 = dx3 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n                    int y32 = dy2 + dilation_h * u3;\n                    int y33 = dy3 + dilation_h * u3;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr22 = img2.row<const signed char>(y22) + x22;\n                    const signed char* sptr23 = img2.row<const signed char>(y23) + x23;\n\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n                    const signed char* sptr32 = img3.row<const signed char>(y32) + x32;\n                    const signed char* sptr33 = img3.row<const signed char>(y33) + x33;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr01[0];\n                    pp[5] = sptr11[0];\n                    pp[6] = sptr21[0];\n                    pp[7] = sptr31[0];\n                    pp[8] = sptr02[0];\n                    pp[9] = sptr12[0];\n                    pp[10] = sptr22[0];\n                    pp[11] = sptr32[0];\n                    pp[12] = sptr03[0];\n                    pp[13] = sptr13[0];\n                    pp[14] = sptr23[0];\n                    pp[15] = sptr33[0];\n                    pp += 16;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int x02 = dx2 + dilation_w * v0;\n                    int x03 = dx3 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int y02 = dy2 + dilation_h * u0;\n                    int y03 = dy3 + dilation_h * u0;\n\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int x12 = dx2 + dilation_w * v1;\n                    int x13 = dx3 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int y12 = dy2 + dilation_h * u1;\n                    int y13 = dy3 + dilation_h * u1;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr02 = img0.row<const signed char>(y02) + x02;\n                    const signed char* sptr03 = img0.row<const signed char>(y03) + x03;\n\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr12 = img1.row<const signed char>(y12) + x12;\n                    const signed char* sptr13 = img1.row<const signed char>(y13) + x13;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr01[0];\n                    pp[3] = sptr11[0];\n                    pp[4] = sptr02[0];\n                    pp[5] = sptr12[0];\n                    pp[6] = sptr03[0];\n                    pp[7] = sptr13[0];\n                    pp += 8;\n                }\n            }\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int x1 = dx1 + dilation_w * v;\n                int x2 = dx2 + dilation_w * v;\n                int x3 = dx3 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n                int y1 = dy1 + dilation_h * u;\n                int y2 = dy2 + dilation_h * u;\n                int y3 = dy3 + dilation_h * u;\n\n                const signed char* sptr0 = img.row<const signed char>(y0) + x0 * elempack;\n                const signed char* sptr1 = img.row<const signed char>(y1) + x1 * elempack;\n                const signed char* sptr2 = img.row<const signed char>(y2) + x2 * elempack;\n                const signed char* sptr3 = img.row<const signed char>(y3) + x3 * elempack;\n\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr0);\n                    int8x8_t _r1 = vld1_s8(sptr1);\n                    int8x8_t _r2 = vld1_s8(sptr2);\n                    int8x8_t _r3 = vld1_s8(sptr3);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    vst1_s8(pp + 16, _r2);\n                    vst1_s8(pp + 24, _r3);\n                    pp += 32;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2x4_t _r0123;\n                    _r0123.val[0] = vreinterpret_s32_s8(vld1_s8(sptr0));\n                    _r0123.val[1] = vreinterpret_s32_s8(vld1_s8(sptr1));\n                    _r0123.val[2] = vreinterpret_s32_s8(vld1_s8(sptr2));\n                    _r0123.val[3] = vreinterpret_s32_s8(vld1_s8(sptr3));\n                    vst4_s32((int*)pp, _r0123);\n                    pp += 32;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4x4_t _r0123;\n                    _r0123.val[0] = vreinterpret_s16_s8(vld1_s8(sptr0));\n                    _r0123.val[1] = vreinterpret_s16_s8(vld1_s8(sptr1));\n                    _r0123.val[2] = vreinterpret_s16_s8(vld1_s8(sptr2));\n                    _r0123.val[3] = vreinterpret_s16_s8(vld1_s8(sptr3));\n                    vst4_s16((short*)pp, _r0123);\n                    pp += 32;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n                if (elempack == 1)\n                {\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp[2] = sptr2[0];\n                    pp[3] = sptr3[0];\n                    pp += 4;\n                }\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        int dy0 = (j + jj) / outw * stride_h;\n        int dy1 = (j + jj + 1) / outw * stride_h;\n        int dx0 = (j + jj) % outw * stride_w;\n        int dx1 = (j + jj + 1) % outw * stride_w;\n\n        if (dy0 == dy1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n                    int x50 = dx0 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n                    int x70 = dx0 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n\n                    const signed char* sptr4 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr5 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr6 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr7 = img7.row<const signed char>(y70) + x70;\n\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp[2] = sptr2[0];\n                    pp[3] = sptr3[0];\n                    pp[4] = sptr4[0];\n                    pp[5] = sptr5[0];\n                    pp[6] = sptr6[0];\n                    pp[7] = sptr7[0];\n                    pp[8] = sptr0[stride_w];\n                    pp[9] = sptr1[stride_w];\n                    pp[10] = sptr2[stride_w];\n                    pp[11] = sptr3[stride_w];\n                    pp[12] = sptr4[stride_w];\n                    pp[13] = sptr5[stride_w];\n                    pp[14] = sptr6[stride_w];\n                    pp[15] = sptr7[stride_w];\n                    pp += 16;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int x20 = dx0 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int x30 = dx0 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr2 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr3 = img3.row<const signed char>(y30) + x30;\n\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp[2] = sptr2[0];\n                    pp[3] = sptr3[0];\n                    pp[4] = sptr0[stride_w];\n                    pp[5] = sptr1[stride_w];\n                    pp[6] = sptr2[stride_w];\n                    pp[7] = sptr3[stride_w];\n                    pp += 8;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n\n                    const signed char* sptr0 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr1 = img1.row<const signed char>(y10) + x10;\n\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp[2] = sptr0[stride_w];\n                    pp[3] = sptr1[stride_w];\n                    pp += 4;\n                }\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n\n                const signed char* sptr = img.row<const signed char>(y0) + x0 * elempack;\n\n#if __ARM_NEON\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr);\n                    int8x8_t _r1 = vld1_s8(sptr + stride_w * 8);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    pp += 16;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2x2_t _r01;\n                    _r01.val[0] = vreinterpret_s32_s8(vld1_s8(sptr));\n                    _r01.val[1] = vreinterpret_s32_s8(vld1_s8(sptr + stride_w * 8));\n                    vst2_s32((int*)pp, _r01);\n                    pp += 16;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4x2_t _r01;\n                    _r01.val[0] = vreinterpret_s16_s8(vld1_s8(sptr));\n                    _r01.val[1] = vreinterpret_s16_s8(vld1_s8(sptr + stride_w * 8));\n                    vst2_s16((short*)pp, _r01);\n                    pp += 16;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n#endif // __ARM_NEON\n                if (elempack == 1)\n                {\n                    pp[0] = sptr[0];\n                    pp[1] = sptr[stride_w];\n                    pp += 2;\n                }\n            }\n        }\n        else\n        {\n            int kk = 0;\n#if __ARM_NEON\n            if (elempack == 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int p4 = (k + kk + 4) / maxk;\n                    int p5 = (k + kk + 5) / maxk;\n                    int p6 = (k + kk + 6) / maxk;\n                    int p7 = (k + kk + 7) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int uv4 = (k + kk + 4) % maxk;\n                    int uv5 = (k + kk + 5) % maxk;\n                    int uv6 = (k + kk + 6) % maxk;\n                    int uv7 = (k + kk + 7) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int u4 = uv4 / kernel_w;\n                    int u5 = uv5 / kernel_w;\n                    int u6 = uv6 / kernel_w;\n                    int u7 = uv7 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n                    int v4 = uv4 % kernel_w;\n                    int v5 = uv5 % kernel_w;\n                    int v6 = uv6 % kernel_w;\n                    int v7 = uv7 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n                    const Mat img4 = bottom_blob.channel(p4);\n                    const Mat img5 = bottom_blob.channel(p5);\n                    const Mat img6 = bottom_blob.channel(p6);\n                    const Mat img7 = bottom_blob.channel(p7);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n\n                    int x40 = dx0 + dilation_w * v4;\n                    int x41 = dx1 + dilation_w * v4;\n                    int y40 = dy0 + dilation_h * u4;\n                    int y41 = dy1 + dilation_h * u4;\n                    int x50 = dx0 + dilation_w * v5;\n                    int x51 = dx1 + dilation_w * v5;\n                    int y50 = dy0 + dilation_h * u5;\n                    int y51 = dy1 + dilation_h * u5;\n\n                    int x60 = dx0 + dilation_w * v6;\n                    int x61 = dx1 + dilation_w * v6;\n                    int y60 = dy0 + dilation_h * u6;\n                    int y61 = dy1 + dilation_h * u6;\n                    int x70 = dx0 + dilation_w * v7;\n                    int x71 = dx1 + dilation_w * v7;\n                    int y70 = dy0 + dilation_h * u7;\n                    int y71 = dy1 + dilation_h * u7;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n\n                    const signed char* sptr40 = img4.row<const signed char>(y40) + x40;\n                    const signed char* sptr41 = img4.row<const signed char>(y41) + x41;\n                    const signed char* sptr50 = img5.row<const signed char>(y50) + x50;\n                    const signed char* sptr51 = img5.row<const signed char>(y51) + x51;\n                    const signed char* sptr60 = img6.row<const signed char>(y60) + x60;\n                    const signed char* sptr61 = img6.row<const signed char>(y61) + x61;\n                    const signed char* sptr70 = img7.row<const signed char>(y70) + x70;\n                    const signed char* sptr71 = img7.row<const signed char>(y71) + x71;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr40[0];\n                    pp[5] = sptr50[0];\n                    pp[6] = sptr60[0];\n                    pp[7] = sptr70[0];\n                    pp[8] = sptr01[0];\n                    pp[9] = sptr11[0];\n                    pp[10] = sptr21[0];\n                    pp[11] = sptr31[0];\n                    pp[12] = sptr41[0];\n                    pp[13] = sptr51[0];\n                    pp[14] = sptr61[0];\n                    pp[15] = sptr71[0];\n                    pp += 16;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int p2 = (k + kk + 2) / maxk;\n                    int p3 = (k + kk + 3) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int uv2 = (k + kk + 2) % maxk;\n                    int uv3 = (k + kk + 3) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int u2 = uv2 / kernel_w;\n                    int u3 = uv3 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n                    int v2 = uv2 % kernel_w;\n                    int v3 = uv3 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n                    const Mat img2 = bottom_blob.channel(p2);\n                    const Mat img3 = bottom_blob.channel(p3);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n                    int x20 = dx0 + dilation_w * v2;\n                    int x21 = dx1 + dilation_w * v2;\n                    int y20 = dy0 + dilation_h * u2;\n                    int y21 = dy1 + dilation_h * u2;\n                    int x30 = dx0 + dilation_w * v3;\n                    int x31 = dx1 + dilation_w * v3;\n                    int y30 = dy0 + dilation_h * u3;\n                    int y31 = dy1 + dilation_h * u3;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n                    const signed char* sptr20 = img2.row<const signed char>(y20) + x20;\n                    const signed char* sptr21 = img2.row<const signed char>(y21) + x21;\n                    const signed char* sptr30 = img3.row<const signed char>(y30) + x30;\n                    const signed char* sptr31 = img3.row<const signed char>(y31) + x31;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr20[0];\n                    pp[3] = sptr30[0];\n                    pp[4] = sptr01[0];\n                    pp[5] = sptr11[0];\n                    pp[6] = sptr21[0];\n                    pp[7] = sptr31[0];\n                    pp += 8;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 1 < max_kk; kk += 2)\n                {\n                    int p0 = (k + kk) / maxk;\n                    int p1 = (k + kk + 1) / maxk;\n                    int uv0 = (k + kk) % maxk;\n                    int uv1 = (k + kk + 1) % maxk;\n                    int u0 = uv0 / kernel_w;\n                    int u1 = uv1 / kernel_w;\n                    int v0 = uv0 % kernel_w;\n                    int v1 = uv1 % kernel_w;\n\n                    const Mat img0 = bottom_blob.channel(p0);\n                    const Mat img1 = bottom_blob.channel(p1);\n\n                    int x00 = dx0 + dilation_w * v0;\n                    int x01 = dx1 + dilation_w * v0;\n                    int y00 = dy0 + dilation_h * u0;\n                    int y01 = dy1 + dilation_h * u0;\n                    int x10 = dx0 + dilation_w * v1;\n                    int x11 = dx1 + dilation_w * v1;\n                    int y10 = dy0 + dilation_h * u1;\n                    int y11 = dy1 + dilation_h * u1;\n\n                    const signed char* sptr00 = img0.row<const signed char>(y00) + x00;\n                    const signed char* sptr01 = img0.row<const signed char>(y01) + x01;\n                    const signed char* sptr10 = img1.row<const signed char>(y10) + x10;\n                    const signed char* sptr11 = img1.row<const signed char>(y11) + x11;\n\n                    pp[0] = sptr00[0];\n                    pp[1] = sptr10[0];\n                    pp[2] = sptr01[0];\n                    pp[3] = sptr11[0];\n                    pp += 4;\n                }\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk / elempack; kk++)\n            {\n                int p = (k / elempack + kk) / maxk;\n                int uv = (k / elempack + kk) % maxk;\n                int u = uv / kernel_w;\n                int v = uv % kernel_w;\n\n                const Mat img = bottom_blob.channel(p);\n\n                int x0 = dx0 + dilation_w * v;\n                int x1 = dx1 + dilation_w * v;\n                int y0 = dy0 + dilation_h * u;\n                int y1 = dy1 + dilation_h * u;\n\n                const signed char* sptr0 = img.row<const signed char>(y0) + x0 * elempack;\n                const signed char* sptr1 = img.row<const signed char>(y1) + x1 * elempack;\n\n#if __ARM_NEON\n                if (elempack == 8)\n                {\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x8_t _r0 = vld1_s8(sptr0);\n                    int8x8_t _r1 = vld1_s8(sptr1);\n                    vst1_s8(pp, _r0);\n                    vst1_s8(pp + 8, _r1);\n                    pp += 16;\n#elif __ARM_FEATURE_DOTPROD\n                    int32x2x2_t _r01;\n                    _r01.val[0] = vreinterpret_s32_s8(vld1_s8(sptr0));\n                    _r01.val[1] = vreinterpret_s32_s8(vld1_s8(sptr1));\n                    vst2_s32((int*)pp, _r01);\n                    pp += 16;\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                    int16x4x2_t _r01;\n                    _r01.val[0] = vreinterpret_s16_s8(vld1_s8(sptr0));\n                    _r01.val[1] = vreinterpret_s16_s8(vld1_s8(sptr1));\n                    vst2_s16((short*)pp, _r01);\n                    pp += 16;\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                }\n#endif // __ARM_NEON\n                if (elempack == 1)\n                {\n                    pp[0] = sptr0[0];\n                    pp[1] = sptr1[0];\n                    pp += 2;\n                }\n            }\n        }\n    }\n    for (; jj < max_jj; jj++)\n    {\n        int dy = (j + jj) / outw * stride_h;\n        int dx = (j + jj) % outw * stride_w;\n\n        int kk = 0;\n        for (; kk < max_kk / elempack; kk++)\n        {\n            int p = (k / elempack + kk) / maxk;\n            int uv = (k / elempack + kk) % maxk;\n            int u = uv / kernel_w;\n            int v = uv % kernel_w;\n\n            const Mat img = bottom_blob.channel(p);\n\n            int x = dx + dilation_w * v;\n            int y = dy + dilation_h * u;\n\n            const signed char* sptr = img.row<const signed char>(y) + x * elempack;\n\n#if __ARM_NEON\n            if (elempack == 8)\n            {\n                vst1_s8(pp, vld1_s8(sptr));\n                pp += 8;\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                pp[0] = sptr[0];\n                pp += 1;\n            }\n        }\n    }\n}\n\ntemplate<int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h>\n#if __ARM_FEATURE_MATMUL_INT8\nvoid convolution_im2col_input_tile_int8_i8mm(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n#elif __ARM_FEATURE_DOTPROD\nvoid convolution_im2col_input_tile_int8_asimddp(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\nvoid convolution_im2col_input_tile_int8(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk)\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n{\n    convolution_im2col_input_tile_int8_impl(bottom_blob, B, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n}\n\n#if __ARM_FEATURE_MATMUL_INT8\ntemplate void convolution_im2col_input_tile_int8_i8mm<1, 1, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_i8mm<3, 3, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_i8mm<3, 3, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_i8mm<5, 5, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_i8mm<5, 5, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_i8mm<7, 7, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\n#elif __ARM_FEATURE_DOTPROD\ntemplate void convolution_im2col_input_tile_int8_asimddp<1, 1, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_asimddp<3, 3, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_asimddp<3, 3, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_asimddp<5, 5, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_asimddp<5, 5, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8_asimddp<7, 7, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\ntemplate void convolution_im2col_input_tile_int8<1, 1, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8<3, 3, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8<3, 3, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8<5, 5, 1, 1, 1, 1>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8<5, 5, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\ntemplate void convolution_im2col_input_tile_int8<7, 7, 1, 1, 2, 2>(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n\nstatic void convolution_im2col_input_tile_int8(const Mat& bottom_blob, Mat& B, int j, int max_jj, int k, int max_kk, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h)\n{\n    if (kernel_w == 1 && kernel_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n        convolution_im2col_input_tile_conv1x1s1d1_int8(bottom_blob, B, j, max_jj, k, max_kk);\n        return;\n    }\n\n    if (kernel_w == 1 && kernel_h == 1 && stride_w == 2 && stride_h == 2)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<1, 1, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<1, 1, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<1, 1, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<3, 3, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<3, 3, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<3, 3, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<3, 3, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<3, 3, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<3, 3, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<5, 5, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<5, 5, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<5, 5, 1, 1, 1, 1>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<5, 5, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<5, 5, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<5, 5, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n    {\n#if __ARM_FEATURE_MATMUL_INT8\n        convolution_im2col_input_tile_int8_i8mm<7, 7, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#elif __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8_asimddp<7, 7, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        convolution_im2col_input_tile_int8<7, 7, 1, 1, 2, 2>(bottom_blob, B, j, max_jj, k, max_kk);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n        return;\n    }\n\n    convolution_im2col_input_tile_int8_impl(bottom_blob, B, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n}\n\nstatic void convolution_im2col_gemm_transform_kernel_int8(const Mat& kernel, Mat& AT, int inch, int outch, int kernel_w, int kernel_h, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        convolution_im2col_gemm_transform_kernel_int8_i8mm(kernel, AT, inch, outch, kernel_w, kernel_h, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        convolution_im2col_gemm_transform_kernel_int8_asimddp(kernel, AT, inch, outch, kernel_w, kernel_h, opt);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"convolution_im2col_gemm_transform_kernel\");\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = outch;\n    const int K = inch * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_int8(M, 0, K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n    int elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = inch % 8 == 0 ? 8 : 1;\n    }\n#endif // __ARM_NEON\n\n    // maxk-inch-outch to pa-maxk-inch/pa-outch\n    Mat A_data;\n    if (maxk == 1)\n    {\n        A_data = kernel.reshape(maxk * inch, outch);\n    }\n    else\n    {\n        Mat weight_data_r2 = kernel.reshape(maxk, inch, outch);\n\n        A_data.create(maxk * inch, outch, (size_t)1u, 1);\n\n        for (int q = 0; q < outch; q += 1)\n        {\n            signed char* g00 = A_data.row<signed char>(q);\n\n            for (int p = 0; p + (elempack - 1) < inch; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        const signed char* k00 = weight_data_r2.channel(q).row<const signed char>(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n\n    AT.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, (size_t)1u, 1);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int k = 0; k < K; k += TILE_K)\n        {\n            const int max_kk = std::min((K - k), TILE_K);\n\n            Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n            convolution_im2col_pack_A_tile_int8(A_data, AT_tile, i, max_ii, k, max_kk);\n        }\n    }\n}\n\nstatic int convolution_im2col_gemm_int8(const Mat& bottom_blob, Mat& top_blob, const Mat& AT, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int nT, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        return convolution_im2col_gemm_int8_i8mm(bottom_blob, top_blob, AT, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, nT, opt);\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        return convolution_im2col_gemm_int8_asimddp(bottom_blob, top_blob, AT, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, nT, opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n\n    const int M = top_blob.c * top_blob.elempack;\n    const int N = top_blob.w * top_blob.h;\n    const int K = bottom_blob.c * bottom_blob.elempack * maxk;\n\n    int TILE_M, TILE_N, TILE_K;\n    convolution_im2col_gemm_get_optimal_tile_mnk_int8(M, N, K, TILE_M, TILE_N, TILE_K, nT);\n\n    const int nn_M = (M + TILE_M - 1) / TILE_M;\n    const int nn_N = (N + TILE_N - 1) / TILE_N;\n    const int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d -> %d %d %d\", M, N, K, TILE_M, TILE_N, TILE_K);\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 1u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        // im2col\n        convolution_im2col_input_tile_int8(bottom_blob, BT_tile, j, max_jj, k, max_kk, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h);\n    }\n\n    Mat topT_tileX;\n    if (K > TILE_K)\n    {\n        topT_tileX.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT_tileX.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppj = 0; ppj < nn_M; ppj++)\n    {\n        const int i = ppj * TILE_M;\n\n        Mat topT_tile;\n        if (K > TILE_K)\n            topT_tile = topT_tileX.channel(get_omp_thread_num());\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                const Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                const Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = k + TILE_K >= K;\n\n                convolution_gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, top_blob, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n        }\n    }\n\n    return 0;\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_packed.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_transform_kernel_packed(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n#if __ARM_NEON\n#if __aarch64__\n    if (outch >= 8)\n    {\n        if (inch >= 8)\n            kernel_tm.create(8 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2);\n        else if (inch >= 4)\n            kernel_tm.create(8 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2);\n        else if (inch >= 2)\n            kernel_tm.create(8 * 2 * maxk, inch / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2);\n        else\n            kernel_tm.create(8 * maxk, inch, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2);\n    }\n    else\n#endif // __aarch64__\n    if (outch >= 4)\n    {\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(4 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(4 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2);\n        else if (inch >= 2)\n            kernel_tm.create(4 * 2 * maxk, inch / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2);\n        else\n            kernel_tm.create(4 * maxk, inch, outch / 4 + (outch % 4) / 2 + outch % 2);\n    }\n    else\n#endif // __ARM_NEON\n    if (outch >= 2)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(2 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(2 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2);\n        else\n#endif // __ARM_NEON\n        if (inch >= 2)\n            kernel_tm.create(2 * 2 * maxk, inch / 2 + inch % 2, outch / 2 + outch % 2);\n        else\n            kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2);\n    }\n    else\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch);\n        else\n#endif // __ARM_NEON\n        if (inch >= 2)\n            kernel_tm.create(2 * maxk, inch / 2 + inch % 2, outch);\n        else\n            kernel_tm.create(maxk, inch, outch);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; q + 7 < outch; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inch * maxk;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inch * maxk;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inch * maxk;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inch * maxk;\n\n        float* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    g00[4] = k4[k];\n                    g00[5] = k5[k];\n                    g00[6] = k6[k];\n                    g00[7] = k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n            const float* k4 = kptr4 + p * maxk;\n            const float* k5 = kptr5 + p * maxk;\n            const float* k6 = kptr6 + p * maxk;\n            const float* k7 = kptr7 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00[2] = k2[k];\n                g00[3] = k3[k];\n                g00[4] = k4[k];\n                g00[5] = k5[k];\n                g00[6] = k6[k];\n                g00[7] = k7[k];\n                g00 += 8;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; q + 3 < outch; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n#else\n        float* g00 = kernel_tm.channel(q / 4);\n#endif\n\n        int p = 0;\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    g00[2] = k2[k];\n                    g00[3] = k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00[2] = k2[k];\n                g00[3] = k3[k];\n                g00 += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; q + 1 < outch; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n#elif __ARM_NEON\n        float* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2);\n#else\n        float* g00 = kernel_tm.channel(q / 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = k0[0];\n                g00[1] = k0[maxk];\n                g00[2] = k0[maxk * 2];\n                g00[3] = k0[maxk * 3];\n                g00[4] = k0[maxk * 4];\n                g00[5] = k0[maxk * 5];\n                g00[6] = k0[maxk * 6];\n                g00[7] = k0[maxk * 7];\n                g00[8] = k1[0];\n                g00[9] = k1[maxk];\n                g00[10] = k1[maxk * 2];\n                g00[11] = k1[maxk * 3];\n                g00[12] = k1[maxk * 4];\n                g00[13] = k1[maxk * 5];\n                g00[14] = k1[maxk * 6];\n                g00[15] = k1[maxk * 7];\n                g00 += 16;\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = k0[0];\n                g00[1] = k0[maxk];\n                g00[2] = k0[maxk * 2];\n                g00[3] = k0[maxk * 3];\n                g00[4] = k1[0];\n                g00[5] = k1[maxk];\n                g00[6] = k1[maxk * 2];\n                g00[7] = k1[maxk * 3];\n                g00 += 8;\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    g00[1] = k1[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k0[k];\n                g00[1] = k1[k];\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outch; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inch * maxk;\n\n#if __aarch64__\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n#elif __ARM_NEON\n        float* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2 + q % 2);\n#else\n        float* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k0[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution_packed(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int inch = bottom_blob.c * elempack;\n\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int outch = top_blob.c * out_elempack;\n\n    const size_t M = top_blob.cstep * out_elempack;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2 * elempack;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_outch = (outch - remain_outch_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        float* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                float32x4_t _sum4 = vdupq_n_f32(0.f);\n                float32x4_t _sum5 = vdupq_n_f32(0.f);\n                float32x4_t _sum6 = vdupq_n_f32(0.f);\n                float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                    _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n                }\n\n                const float* kptr = weight_data_tm.channel(p / 8);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                            _r1 = vld1q_f32(r0 + sok + N);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r1 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 4], _r1, 0);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 5], _r1, 1);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 6], _r1, 2);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 7], _r1, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 4 * 2);\n                        float32x4_t _w3 = vld1q_f32(kptr + 4 * 3);\n                        float32x4_t _w4 = vld1q_f32(kptr + 4 * 4);\n                        float32x4_t _w5 = vld1q_f32(kptr + 4 * 5);\n                        float32x4_t _w6 = vld1q_f32(kptr + 4 * 6);\n                        float32x4_t _w7 = vld1q_f32(kptr + 4 * 7);\n                        float32x4_t _w8 = vld1q_f32(kptr + 4 * 8);\n                        float32x4_t _w9 = vld1q_f32(kptr + 4 * 9);\n                        float32x4_t _wa = vld1q_f32(kptr + 4 * 10);\n                        float32x4_t _wb = vld1q_f32(kptr + 4 * 11);\n                        float32x4_t _wc = vld1q_f32(kptr + 4 * 12);\n                        float32x4_t _wd = vld1q_f32(kptr + 4 * 13);\n                        float32x4_t _we = vld1q_f32(kptr + 4 * 14);\n                        float32x4_t _wf = vld1q_f32(kptr + 4 * 15);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                        kptr += 64;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 4 * 2);\n                        float32x4_t _w3 = vld1q_f32(kptr + 4 * 3);\n                        float32x4_t _w4 = vld1q_f32(kptr + 4 * 4);\n                        float32x4_t _w5 = vld1q_f32(kptr + 4 * 5);\n                        float32x4_t _w6 = vld1q_f32(kptr + 4 * 6);\n                        float32x4_t _w7 = vld1q_f32(kptr + 4 * 7);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 8);\n                        float32x4_t _w3 = vld1q_f32(kptr + 12);\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                        _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                        _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vdupq_n_f32(r0[space_ofs[k]]);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                        _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                        kptr += 8;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum2);\n                _sum1 = vaddq_f32(_sum1, _sum3);\n                _sum4 = vaddq_f32(_sum4, _sum6);\n                _sum5 = vaddq_f32(_sum5, _sum7);\n                _sum0 = vaddq_f32(_sum0, _sum4);\n                _sum1 = vaddq_f32(_sum1, _sum5);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr, _sum0);\n                    vst1q_f32(outptr + M, _sum1);\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    outptr[0] = vgetq_lane_f32(_sum0, 0);\n                    outptr[M] = vgetq_lane_f32(_sum0, 1);\n                    outptr[M * 2] = vgetq_lane_f32(_sum0, 2);\n                    outptr[M * 3] = vgetq_lane_f32(_sum0, 3);\n                    outptr[M * 4] = vgetq_lane_f32(_sum1, 0);\n                    outptr[M * 5] = vgetq_lane_f32(_sum1, 1);\n                    outptr[M * 6] = vgetq_lane_f32(_sum1, 2);\n                    outptr[M * 7] = vgetq_lane_f32(_sum1, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 8;\n    nn_outch = (outch - remain_outch_start) / 4;\n#else // __aarch64__\n    nn_outch = (outch - remain_outch_start) / 4;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __aarch64__\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        float* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                }\n\n#if __aarch64__\n                const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n#else\n                const float* kptr = weight_data_tm.channel(p / 4);\n#endif\n\n                int q = 0;\n#if __aarch64__\n                for (; q + 7 < inch; q += 8)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                            _r1 = vld1q_f32(r0 + sok + N);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r1 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 4], _r1, 0);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 5], _r1, 1);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 6], _r1, 2);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 7], _r1, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 8);\n                        float32x4_t _w3 = vld1q_f32(kptr + 12);\n                        float32x4_t _w4 = vld1q_f32(kptr + 16);\n                        float32x4_t _w5 = vld1q_f32(kptr + 20);\n                        float32x4_t _w6 = vld1q_f32(kptr + 24);\n                        float32x4_t _w7 = vld1q_f32(kptr + 28);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                        kptr += 32;\n                    }\n                }\n#endif // __aarch64__\n                for (; q + 3 < inch; q += 4)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 8);\n                        float32x4_t _w3 = vld1q_f32(kptr + 12);\n#if __aarch64__\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n#else\n                        _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_r0), 0);\n                        _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_r0), 1);\n                        _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_r0), 0);\n                        _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_r0), 1);\n#endif\n\n                        kptr += 16;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n#if __aarch64__\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n#else\n                        _sum0 = vmlaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vmlaq_n_f32(_sum1, _w1, val1);\n#endif\n\n                        kptr += 8;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vdupq_n_f32(r0[space_ofs[k]]);\n                        }\n\n                        float32x4_t _w = vld1q_f32(kptr);\n#if __aarch64__\n                        _sum0 = vfmaq_f32(_sum0, _val, _w);\n#else\n                        _sum0 = vmlaq_f32(_sum0, _val, _w);\n#endif\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                _sum0 = vaddq_f32(_sum0, _sum2);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr, _sum0);\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    outptr[0] = vgetq_lane_f32(_sum0, 0);\n                    outptr[M] = vgetq_lane_f32(_sum0, 1);\n                    outptr[M * 2] = vgetq_lane_f32(_sum0, 2);\n                    outptr[M * 3] = vgetq_lane_f32(_sum0, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 4;\n    nn_outch = (outch - remain_outch_start) / 2;\n#else // __ARM_NEON\n    nn_outch = (outch - remain_outch_start) / 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __ARM_NEON\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum0 = bias_data_ptr[p];\n                    sum1 = bias_data_ptr[p + 1];\n                }\n\n#if __aarch64__\n                const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#elif __ARM_NEON\n                const float* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2);\n#else\n                const float* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n                int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                            _r1 = vld1q_f32(r0 + sok + N);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r1 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 4], _r1, 0);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 5], _r1, 1);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 6], _r1, 2);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 7], _r1, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 8);\n                        float32x4_t _w3 = vld1q_f32(kptr + 12);\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                        _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                        _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                        kptr += 16;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum2);\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n#else  // __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n#endif // __aarch64__\n                for (; q + 3 < inch; q += 4)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n#if __aarch64__\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n#else\n                        _sum0 = vmlaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vmlaq_f32(_sum1, _r0, _w1);\n#endif\n\n                        kptr += 8;\n                    }\n                }\n#if __aarch64__\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum1);\n#else\n                float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n                float32x2_t _ss = vpadd_f32(_ss0, _ss1);\n                sum0 += vget_lane_f32(_ss, 0);\n                sum1 += vget_lane_f32(_ss, 1);\n#endif\n#endif // __ARM_NEON\n                for (; q + 1 < inch; q += 2)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        sum0 += val0 * kptr[0];\n                        sum1 += val0 * kptr[1];\n                        sum0 += val1 * kptr[2];\n                        sum1 += val1 * kptr[3];\n\n                        kptr += 4;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = r0[space_ofs[k]];\n                        }\n\n                        sum0 += val * kptr[0];\n                        sum1 += val * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n\n                sum0 = activation_ss(sum0, activation_type, activation_params);\n                sum1 = activation_ss(sum1, activation_type, activation_params);\n\n                outptr0[0] = sum0;\n                outptr1[0] = sum1;\n                outptr0 += 1;\n                outptr1 += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 2;\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n#if __aarch64__\n                const float* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#elif __ARM_NEON\n                const float* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2 + p % 2);\n#else\n                const float* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n                int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                            _r1 = vld1q_f32(r0 + sok + N);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r1 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 4], _r1, 0);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 5], _r1, 1);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 6], _r1, 2);\n                            _r1 = vsetq_lane_f32(r0[sok + N * 7], _r1, 3);\n                        }\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                        kptr += 8;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                sum += vaddvq_f32(_sum0);\n#endif // __aarch64__\n                float32x4_t _sum = vdupq_n_f32(0.f);\n                for (; q + 3 < inch; q += 4)\n                {\n                    const float* r0 = bottom_blob.channel(q / elempack).row(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1q_f32(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float32x4_t();\n                            _r0 = vsetq_lane_f32(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f32(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f32(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float32x4_t _w = vld1q_f32(kptr);\n#if __aarch64__\n                        _sum = vfmaq_f32(_sum, _r0, _w);\n#else\n                        _sum = vmlaq_f32(_sum, _r0, _w);\n#endif\n\n                        kptr += 4;\n                    }\n                }\n#if __aarch64__\n                sum += vaddvq_f32(_sum);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n                sum += vget_lane_f32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n                for (; q + 1 < inch; q += 2)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        sum += val0 * kptr[0];\n                        sum += val1 * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q).row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = r0[space_ofs[k]];\n                        }\n\n                        sum += val * kptr[0];\n\n                        kptr += 1;\n                    }\n                }\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n                outptr[0] = sum;\n                outptr += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_packed_bf16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_transform_kernel_packed_bf16s(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n#if __ARM_NEON\n#if __aarch64__\n    if (outch >= 8)\n    {\n        if (inch >= 8)\n            kernel_tm.create(8 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 4)\n            kernel_tm.create(8 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(8 * 2 * maxk, inch / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(8 * maxk, inch, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n    }\n    else\n#endif // __aarch64__\n    if (outch >= 4)\n    {\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(4 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(4 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(4 * 2 * maxk, inch / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(4 * maxk, inch, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n    }\n    else\n#endif // __ARM_NEON\n    if (outch >= 2)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(2 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(2 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else\n#endif // __ARM_NEON\n        if (inch >= 2)\n            kernel_tm.create(2 * 2 * maxk, inch / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2, (size_t)2u);\n    }\n    else\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (inch >= 8)\n            kernel_tm.create(8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch, (size_t)2u);\n        else\n#endif // __aarch64__\n        if (inch >= 4)\n            kernel_tm.create(4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch, (size_t)2u);\n        else\n#endif // __ARM_NEON\n        if (inch >= 2)\n            kernel_tm.create(2 * maxk, inch / 2 + inch % 2, outch, (size_t)2u);\n        else\n            kernel_tm.create(maxk, inch, outch, (size_t)2u);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; q + 7 < outch; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inch * maxk;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inch * maxk;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inch * maxk;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inch * maxk;\n\n        unsigned short* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    g00[4] = float32_to_bfloat16(k4[k]);\n                    g00[5] = float32_to_bfloat16(k5[k]);\n                    g00[6] = float32_to_bfloat16(k6[k]);\n                    g00[7] = float32_to_bfloat16(k7[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n            const float* k4 = kptr4 + p * maxk;\n            const float* k5 = kptr5 + p * maxk;\n            const float* k6 = kptr6 + p * maxk;\n            const float* k7 = kptr7 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00[2] = float32_to_bfloat16(k2[k]);\n                g00[3] = float32_to_bfloat16(k3[k]);\n                g00[4] = float32_to_bfloat16(k4[k]);\n                g00[5] = float32_to_bfloat16(k5[k]);\n                g00[6] = float32_to_bfloat16(k6[k]);\n                g00[7] = float32_to_bfloat16(k7[k]);\n                g00 += 8;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; q + 3 < outch; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 4);\n#endif\n\n        int p = 0;\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    g00[2] = float32_to_bfloat16(k2[k]);\n                    g00[3] = float32_to_bfloat16(k3[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00[2] = float32_to_bfloat16(k2[k]);\n                g00[3] = float32_to_bfloat16(k3[k]);\n                g00 += 4;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; q + 1 < outch; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n#elif __ARM_NEON\n        unsigned short* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = float32_to_bfloat16(k0[0]);\n                g00[1] = float32_to_bfloat16(k0[maxk]);\n                g00[2] = float32_to_bfloat16(k0[maxk * 2]);\n                g00[3] = float32_to_bfloat16(k0[maxk * 3]);\n                g00[4] = float32_to_bfloat16(k0[maxk * 4]);\n                g00[5] = float32_to_bfloat16(k0[maxk * 5]);\n                g00[6] = float32_to_bfloat16(k0[maxk * 6]);\n                g00[7] = float32_to_bfloat16(k0[maxk * 7]);\n                g00[8] = float32_to_bfloat16(k1[0]);\n                g00[9] = float32_to_bfloat16(k1[maxk]);\n                g00[10] = float32_to_bfloat16(k1[maxk * 2]);\n                g00[11] = float32_to_bfloat16(k1[maxk * 3]);\n                g00[12] = float32_to_bfloat16(k1[maxk * 4]);\n                g00[13] = float32_to_bfloat16(k1[maxk * 5]);\n                g00[14] = float32_to_bfloat16(k1[maxk * 6]);\n                g00[15] = float32_to_bfloat16(k1[maxk * 7]);\n                g00 += 16;\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = float32_to_bfloat16(k0[0]);\n                g00[1] = float32_to_bfloat16(k0[maxk]);\n                g00[2] = float32_to_bfloat16(k0[maxk * 2]);\n                g00[3] = float32_to_bfloat16(k0[maxk * 3]);\n                g00[4] = float32_to_bfloat16(k1[0]);\n                g00[5] = float32_to_bfloat16(k1[maxk]);\n                g00[6] = float32_to_bfloat16(k1[maxk * 2]);\n                g00[7] = float32_to_bfloat16(k1[maxk * 3]);\n                g00 += 8;\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    g00[1] = float32_to_bfloat16(k1[k]);\n                    k0 += maxk;\n                    k1 += maxk;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00[1] = float32_to_bfloat16(k1[k]);\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outch; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inch * maxk;\n\n#if __aarch64__\n        unsigned short* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n#elif __ARM_NEON\n        unsigned short* g00 = kernel_tm.channel(q / 4 + (q % 4) / 2 + q % 2);\n#else\n        unsigned short* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __aarch64__\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n#endif // __ARM_NEON\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = float32_to_bfloat16(k0[k]);\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = float32_to_bfloat16(k0[k]);\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution_packed_bf16s(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int inch = bottom_blob.c * elempack;\n\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int outch = top_blob.c * out_elempack;\n\n    const size_t M = top_blob.cstep * out_elempack;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2 * elempack;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n#if __ARM_NEON\n#if __aarch64__\n    nn_outch = (outch - remain_outch_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        unsigned short* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                float32x4_t _sum4 = vdupq_n_f32(0.f);\n                float32x4_t _sum5 = vdupq_n_f32(0.f);\n                float32x4_t _sum6 = vdupq_n_f32(0.f);\n                float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                    _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n                }\n\n                const unsigned short* kptr = weight_data_tm.channel(p / 8);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                            _r1 = bfloat2float(vld1_u16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x8_t _r_u16 = uint16x8_t();\n                            _r_u16 = vsetq_lane_u16(r0[sok], _r_u16, 0);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N], _r_u16, 1);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 2], _r_u16, 2);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 3], _r_u16, 3);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 4], _r_u16, 4);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 5], _r_u16, 5);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 6], _r_u16, 6);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 7], _r_u16, 7);\n                            _r0 = bfloat2float(vget_low_u16(_r_u16));\n                            _r1 = bfloat2float(vget_high_u16(_r_u16));\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                        uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                        uint16x8_t _w89 = vld1q_u16(kptr + 32);\n                        uint16x8_t _wab = vld1q_u16(kptr + 40);\n                        uint16x8_t _wcd = vld1q_u16(kptr + 48);\n                        uint16x8_t _wef = vld1q_u16(kptr + 56);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                        float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                        float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                        float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                        float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                        float32x4_t _w8 = bfloat2float(vget_low_u16(_w89));\n                        float32x4_t _w9 = bfloat2float(vget_high_u16(_w89));\n                        float32x4_t _wa = bfloat2float(vget_low_u16(_wab));\n                        float32x4_t _wb = bfloat2float(vget_high_u16(_wab));\n                        float32x4_t _wc = bfloat2float(vget_low_u16(_wcd));\n                        float32x4_t _wd = bfloat2float(vget_high_u16(_wcd));\n                        float32x4_t _we = bfloat2float(vget_low_u16(_wef));\n                        float32x4_t _wf = bfloat2float(vget_high_u16(_wef));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                        kptr += 64;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x4_t _r_u16 = uint16x4_t();\n                            _r_u16 = vset_lane_u16(r0[sok], _r_u16, 0);\n                            _r_u16 = vset_lane_u16(r0[sok + N], _r_u16, 1);\n                            _r_u16 = vset_lane_u16(r0[sok + N * 2], _r_u16, 2);\n                            _r_u16 = vset_lane_u16(r0[sok + N * 3], _r_u16, 3);\n                            _r0 = bfloat2float(_r_u16);\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                        uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                        float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                        float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                        float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                        float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = bfloat16_to_float32(r0[sok]);\n                            val1 = bfloat16_to_float32(r0[sok + N]);\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                        _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                        _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = bfloat2float(vdup_n_u16(r0[space_ofs[k]]));\n                        }\n\n                        uint16x8_t _w = vld1q_u16(kptr);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n                        _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                        _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                        kptr += 8;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum2);\n                _sum1 = vaddq_f32(_sum1, _sum3);\n                _sum4 = vaddq_f32(_sum4, _sum6);\n                _sum5 = vaddq_f32(_sum5, _sum7);\n                _sum0 = vaddq_f32(_sum0, _sum4);\n                _sum1 = vaddq_f32(_sum1, _sum5);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr, float2bfloat(_sum0));\n                    vst1_u16(outptr + M, float2bfloat(_sum1));\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    uint16x4_t _sum0_u16 = float2bfloat(_sum0);\n                    uint16x4_t _sum1_u16 = float2bfloat(_sum1);\n                    outptr[0] = vget_lane_u16(_sum0_u16, 0);\n                    outptr[M] = vget_lane_u16(_sum0_u16, 1);\n                    outptr[M * 2] = vget_lane_u16(_sum0_u16, 2);\n                    outptr[M * 3] = vget_lane_u16(_sum0_u16, 3);\n                    outptr[M * 4] = vget_lane_u16(_sum1_u16, 0);\n                    outptr[M * 5] = vget_lane_u16(_sum1_u16, 1);\n                    outptr[M * 6] = vget_lane_u16(_sum1_u16, 2);\n                    outptr[M * 7] = vget_lane_u16(_sum1_u16, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 8;\n    nn_outch = (outch - remain_outch_start) / 4;\n#else // __aarch64__\n    nn_outch = (outch - remain_outch_start) / 4;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __aarch64__\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        unsigned short* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                }\n\n#if __aarch64__\n                const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n#else\n                const unsigned short* kptr = weight_data_tm.channel(p / 4);\n#endif\n\n                int q = 0;\n#if __aarch64__\n                for (; q + 7 < inch; q += 8)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                            _r1 = bfloat2float(vld1_u16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x8_t _r_u16 = uint16x8_t();\n                            _r_u16 = vsetq_lane_u16(r0[sok], _r_u16, 0);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N], _r_u16, 1);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 2], _r_u16, 2);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 3], _r_u16, 3);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 4], _r_u16, 4);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 5], _r_u16, 5);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 6], _r_u16, 6);\n                            _r_u16 = vsetq_lane_u16(r0[sok + N * 7], _r_u16, 7);\n                            _r0 = bfloat2float(vget_low_u16(_r_u16));\n                            _r1 = bfloat2float(vget_high_u16(_r_u16));\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        uint16x8_t _w45 = vld1q_u16(kptr + 16);\n                        uint16x8_t _w67 = vld1q_u16(kptr + 24);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                        float32x4_t _w4 = bfloat2float(vget_low_u16(_w45));\n                        float32x4_t _w5 = bfloat2float(vget_high_u16(_w45));\n                        float32x4_t _w6 = bfloat2float(vget_low_u16(_w67));\n                        float32x4_t _w7 = bfloat2float(vget_high_u16(_w67));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                        kptr += 32;\n                    }\n                }\n#endif // __aarch64__\n                for (; q + 3 < inch; q += 4)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x4_t _r_u16 = uint16x4_t();\n                            _r_u16 = vset_lane_u16(r0[sok], _r_u16, 0);\n                            _r_u16 = vset_lane_u16(r0[sok + N], _r_u16, 1);\n                            _r_u16 = vset_lane_u16(r0[sok + N * 2], _r_u16, 2);\n                            _r_u16 = vset_lane_u16(r0[sok + N * 3], _r_u16, 3);\n                            _r0 = bfloat2float(_r_u16);\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n#if __aarch64__\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n#else\n                        _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_r0), 0);\n                        _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_r0), 1);\n                        _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_r0), 0);\n                        _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_r0), 1);\n#endif\n\n                        kptr += 16;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = bfloat16_to_float32(r0[sok]);\n                            val1 = bfloat16_to_float32(r0[sok + N]);\n                        }\n\n                        uint16x8_t _w = vld1q_u16(kptr);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n#if __aarch64__\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n#else\n                        _sum0 = vmlaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vmlaq_n_f32(_sum1, _w1, val1);\n#endif\n\n                        kptr += 8;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = bfloat2float(vdup_n_u16(r0[space_ofs[k]]));\n                        }\n\n                        float32x4_t _w = bfloat2float(vld1_u16(kptr));\n#if __aarch64__\n                        _sum0 = vfmaq_f32(_sum0, _val, _w);\n#else\n                        _sum0 = vmlaq_f32(_sum0, _val, _w);\n#endif\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                _sum0 = vaddq_f32(_sum0, _sum2);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr, float2bfloat(_sum0));\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    uint16x4_t _sum0_u16 = float2bfloat(_sum0);\n                    outptr[0] = vget_lane_u16(_sum0_u16, 0);\n                    outptr[M] = vget_lane_u16(_sum0_u16, 1);\n                    outptr[M * 2] = vget_lane_u16(_sum0_u16, 2);\n                    outptr[M * 3] = vget_lane_u16(_sum0_u16, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 4;\n    nn_outch = (outch - remain_outch_start) / 2;\n#else // __ARM_NEON\n    nn_outch = (outch - remain_outch_start) / 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __ARM_NEON\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n\n        unsigned short* outptr0 = top_blob.channel(p);\n        unsigned short* outptr1 = top_blob.channel(p + 1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum0 = bias_data_ptr[p];\n                    sum1 = bias_data_ptr[p + 1];\n                }\n\n#if __aarch64__\n                const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#elif __ARM_NEON\n                const unsigned short* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2);\n#else\n                const unsigned short* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n                int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                            _r1 = bfloat2float(vld1_u16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x8_t _r01_u16 = uint16x8_t();\n                            _r01_u16 = vsetq_lane_u16(r0[sok], _r01_u16, 0);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N], _r01_u16, 1);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 2], _r01_u16, 2);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 3], _r01_u16, 3);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 4], _r01_u16, 4);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 5], _r01_u16, 5);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 6], _r01_u16, 6);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 7], _r01_u16, 7);\n                            _r0 = bfloat2float(vget_low_u16(_r01_u16));\n                            _r1 = bfloat2float(vget_high_u16(_r01_u16));\n                        }\n\n                        uint16x8_t _w01 = vld1q_u16(kptr);\n                        uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w01));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w01));\n                        float32x4_t _w2 = bfloat2float(vget_low_u16(_w23));\n                        float32x4_t _w3 = bfloat2float(vget_high_u16(_w23));\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                        _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                        _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                        kptr += 16;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum2);\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n#else  // __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n#endif // __aarch64__\n                for (; q + 3 < inch; q += 4)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x4_t _r0_u16 = uint16x4_t();\n                            _r0_u16 = vset_lane_u16(r0[sok], _r0_u16, 0);\n                            _r0_u16 = vset_lane_u16(r0[sok + N], _r0_u16, 1);\n                            _r0_u16 = vset_lane_u16(r0[sok + N * 2], _r0_u16, 2);\n                            _r0_u16 = vset_lane_u16(r0[sok + N * 3], _r0_u16, 3);\n                            _r0 = bfloat2float(_r0_u16);\n                        }\n\n                        uint16x8_t _w = vld1q_u16(kptr);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n#if __aarch64__\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n#else\n                        _sum0 = vmlaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vmlaq_f32(_sum1, _r0, _w1);\n#endif\n\n                        kptr += 8;\n                    }\n                }\n#if __aarch64__\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum1);\n#else\n                float32x2_t _ss0 = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                float32x2_t _ss1 = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n                float32x2_t _ss = vpadd_f32(_ss0, _ss1);\n                sum0 += vget_lane_f32(_ss, 0);\n                sum1 += vget_lane_f32(_ss, 1);\n#endif\n#endif // __ARM_NEON\n                for (; q + 1 < inch; q += 2)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = bfloat16_to_float32(r0[sok]);\n                            val1 = bfloat16_to_float32(r0[sok + N]);\n                        }\n\n                        sum0 += val0 * bfloat16_to_float32(kptr[0]);\n                        sum1 += val0 * bfloat16_to_float32(kptr[1]);\n                        sum0 += val1 * bfloat16_to_float32(kptr[2]);\n                        sum1 += val1 * bfloat16_to_float32(kptr[3]);\n\n                        kptr += 4;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = bfloat16_to_float32(r0[space_ofs[k]]);\n                        }\n\n                        sum0 += val * bfloat16_to_float32(kptr[0]);\n                        sum1 += val * bfloat16_to_float32(kptr[1]);\n\n                        kptr += 2;\n                    }\n                }\n\n                sum0 = activation_ss(sum0, activation_type, activation_params);\n                sum1 = activation_ss(sum1, activation_type, activation_params);\n\n                outptr0[0] = float32_to_bfloat16(sum0);\n                outptr1[0] = float32_to_bfloat16(sum1);\n                outptr0 += 1;\n                outptr1 += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 2;\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        unsigned short* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n#if __aarch64__\n                const unsigned short* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#elif __ARM_NEON\n                const unsigned short* kptr = weight_data_tm.channel(p / 4 + (p % 4) / 2 + p % 2);\n#else\n                const unsigned short* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n                int q = 0;\n#if __ARM_NEON\n#if __aarch64__\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                            _r1 = bfloat2float(vld1_u16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x8_t _r01_u16 = uint16x8_t();\n                            _r01_u16 = vsetq_lane_u16(r0[sok], _r01_u16, 0);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N], _r01_u16, 1);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 2], _r01_u16, 2);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 3], _r01_u16, 3);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 4], _r01_u16, 4);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 5], _r01_u16, 5);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 6], _r01_u16, 6);\n                            _r01_u16 = vsetq_lane_u16(r0[sok + N * 7], _r01_u16, 7);\n                            _r0 = bfloat2float(vget_low_u16(_r01_u16));\n                            _r1 = bfloat2float(vget_high_u16(_r01_u16));\n                        }\n\n                        uint16x8_t _w = vld1q_u16(kptr);\n                        float32x4_t _w0 = bfloat2float(vget_low_u16(_w));\n                        float32x4_t _w1 = bfloat2float(vget_high_u16(_w));\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                        kptr += 8;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                sum += vaddvq_f32(_sum0);\n#endif // __aarch64__\n                float32x4_t _sum = vdupq_n_f32(0.f);\n                for (; q + 3 < inch; q += 4)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q / elempack).row<const unsigned short>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = bfloat2float(vld1_u16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            uint16x4_t _r0_u16 = uint16x4_t();\n                            _r0_u16 = vset_lane_u16(r0[sok], _r0_u16, 0);\n                            _r0_u16 = vset_lane_u16(r0[sok + N], _r0_u16, 1);\n                            _r0_u16 = vset_lane_u16(r0[sok + N * 2], _r0_u16, 2);\n                            _r0_u16 = vset_lane_u16(r0[sok + N * 3], _r0_u16, 3);\n                            _r0 = bfloat2float(_r0_u16);\n                        }\n\n                        float32x4_t _w = bfloat2float(vld1_u16(kptr));\n#if __aarch64__\n                        _sum = vfmaq_f32(_sum, _r0, _w);\n#else\n                        _sum = vmlaq_f32(_sum, _r0, _w);\n#endif\n\n                        kptr += 4;\n                    }\n                }\n#if __aarch64__\n                sum += vaddvq_f32(_sum);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n                sum += vget_lane_f32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n                for (; q + 1 < inch; q += 2)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = bfloat16_to_float32(r0[sok]);\n                            val1 = bfloat16_to_float32(r0[sok + N]);\n                        }\n\n                        sum += val0 * bfloat16_to_float32(kptr[0]);\n                        sum += val1 * bfloat16_to_float32(kptr[1]);\n\n                        kptr += 2;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const unsigned short* r0 = bottom_blob.channel(q).row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = bfloat16_to_float32(r0[space_ofs[k]]);\n                        }\n\n                        sum += val * bfloat16_to_float32(kptr[0]);\n\n                        kptr += 1;\n                    }\n                }\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n                outptr[0] = float32_to_bfloat16(sum);\n                outptr += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_packed_fp16s.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_transform_kernel_packed_fp16s(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n    if (outch >= 8)\n    {\n        if (inch >= 8)\n            kernel_tm.create(8 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 4)\n            kernel_tm.create(8 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(8 * 2 * maxk, inch / 2 + inch % 2, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(8 * maxk, inch, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n    }\n    else if (outch >= 4)\n    {\n        if (inch >= 8)\n            kernel_tm.create(4 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 4)\n            kernel_tm.create(4 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(4 * 2 * maxk, inch / 2 + inch % 2, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(4 * maxk, inch, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)2u);\n    }\n    else if (outch >= 2)\n    {\n        if (inch >= 8)\n            kernel_tm.create(2 * 8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 4)\n            kernel_tm.create(2 * 4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(2 * 2 * maxk, inch / 2 + inch % 2, outch / 2 + outch % 2, (size_t)2u);\n        else\n            kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2, (size_t)2u);\n    }\n    else\n    {\n        if (inch >= 8)\n            kernel_tm.create(8 * maxk, inch / 8 + (inch % 8) / 4 + (inch % 4) / 2 + inch % 2, outch, (size_t)2u);\n        else if (inch >= 4)\n            kernel_tm.create(4 * maxk, inch / 4 + (inch % 4) / 2 + inch % 2, outch, (size_t)2u);\n        else if (inch >= 2)\n            kernel_tm.create(2 * maxk, inch / 2 + inch % 2, outch, (size_t)2u);\n        else\n            kernel_tm.create(maxk, inch, outch, (size_t)2u);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n    for (; q + 7 < outch; q += 8)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n        const float* kptr4 = (const float*)kernel + (q + 4) * inch * maxk;\n        const float* kptr5 = (const float*)kernel + (q + 5) * inch * maxk;\n        const float* kptr6 = (const float*)kernel + (q + 6) * inch * maxk;\n        const float* kptr7 = (const float*)kernel + (q + 7) * inch * maxk;\n\n        __fp16* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n                const float* k4 = kptr4 + p * maxk;\n                const float* k5 = kptr5 + p * maxk;\n                const float* k6 = kptr6 + p * maxk;\n                const float* k7 = kptr7 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    g00[4] = (__fp16)k4[k];\n                    g00[5] = (__fp16)k5[k];\n                    g00[6] = (__fp16)k6[k];\n                    g00[7] = (__fp16)k7[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    k4 += maxk;\n                    k5 += maxk;\n                    k6 += maxk;\n                    k7 += maxk;\n                    g00 += 8;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n            const float* k4 = kptr4 + p * maxk;\n            const float* k5 = kptr5 + p * maxk;\n            const float* k6 = kptr6 + p * maxk;\n            const float* k7 = kptr7 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00[2] = (__fp16)k2[k];\n                g00[3] = (__fp16)k3[k];\n                g00[4] = (__fp16)k4[k];\n                g00[5] = (__fp16)k5[k];\n                g00[6] = (__fp16)k6[k];\n                g00[7] = (__fp16)k7[k];\n                g00 += 8;\n            }\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n        const float* kptr2 = (const float*)kernel + (q + 2) * inch * maxk;\n        const float* kptr3 = (const float*)kernel + (q + 3) * inch * maxk;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n                const float* k2 = kptr2 + p * maxk;\n                const float* k3 = kptr3 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    g00[2] = (__fp16)k2[k];\n                    g00[3] = (__fp16)k3[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    k2 += maxk;\n                    k3 += maxk;\n                    g00 += 4;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n            const float* k2 = kptr2 + p * maxk;\n            const float* k3 = kptr3 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00[2] = (__fp16)k2[k];\n                g00[3] = (__fp16)k3[k];\n                g00 += 4;\n            }\n        }\n    }\n    for (; q + 1 < outch; q += 2)\n    {\n        const float* kptr0 = (const float*)kernel + q * inch * maxk;\n        const float* kptr1 = (const float*)kernel + (q + 1) * inch * maxk;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = (__fp16)k0[0];\n                g00[1] = (__fp16)k0[maxk];\n                g00[2] = (__fp16)k0[maxk * 2];\n                g00[3] = (__fp16)k0[maxk * 3];\n                g00[4] = (__fp16)k0[maxk * 4];\n                g00[5] = (__fp16)k0[maxk * 5];\n                g00[6] = (__fp16)k0[maxk * 6];\n                g00[7] = (__fp16)k0[maxk * 7];\n                g00[8] = (__fp16)k1[0];\n                g00[9] = (__fp16)k1[maxk];\n                g00[10] = (__fp16)k1[maxk * 2];\n                g00[11] = (__fp16)k1[maxk * 3];\n                g00[12] = (__fp16)k1[maxk * 4];\n                g00[13] = (__fp16)k1[maxk * 5];\n                g00[14] = (__fp16)k1[maxk * 6];\n                g00[15] = (__fp16)k1[maxk * 7];\n                g00 += 16;\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk + k;\n                const float* k1 = kptr1 + p * maxk + k;\n\n                g00[0] = (__fp16)k0[0];\n                g00[1] = (__fp16)k0[maxk];\n                g00[2] = (__fp16)k0[maxk * 2];\n                g00[3] = (__fp16)k0[maxk * 3];\n                g00[4] = (__fp16)k1[0];\n                g00[5] = (__fp16)k1[maxk];\n                g00[6] = (__fp16)k1[maxk * 2];\n                g00[7] = (__fp16)k1[maxk * 3];\n                g00 += 8;\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr0 + p * maxk;\n                const float* k1 = kptr1 + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    g00[1] = (__fp16)k1[k];\n                    k0 += maxk;\n                    k1 += maxk;\n                    g00 += 2;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr0 + p * maxk;\n            const float* k1 = kptr1 + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00[1] = (__fp16)k1[k];\n                g00 += 2;\n            }\n        }\n    }\n    for (; q < outch; q++)\n    {\n        const float* kptr = (const float*)kernel + q * inch * maxk;\n\n        __fp16* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p + 1 < inch; p += 2)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const float* k0 = kptr + p * maxk;\n\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = (__fp16)k0[k];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            const float* k0 = kptr + p * maxk;\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = (__fp16)k0[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution_packed_fp16s(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int inch = bottom_blob.c * elempack;\n\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int outch = top_blob.c * out_elempack;\n\n    const size_t M = top_blob.cstep * out_elempack;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2 * elempack;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n    nn_outch = (outch - remain_outch_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                float32x4_t _sum4 = vdupq_n_f32(0.f);\n                float32x4_t _sum5 = vdupq_n_f32(0.f);\n                float32x4_t _sum6 = vdupq_n_f32(0.f);\n                float32x4_t _sum7 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                    _sum1 = vld1q_f32(bias_data_ptr + p + 4);\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                            _r1 = vcvt_f32_f16(vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x8_t _r_f16 = float16x8_t();\n                            _r_f16 = vsetq_lane_f16(r0[sok], _r_f16, 0);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N], _r_f16, 1);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 2], _r_f16, 2);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 3], _r_f16, 3);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 4], _r_f16, 4);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 5], _r_f16, 5);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 6], _r_f16, 6);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 7], _r_f16, 7);\n                            _r0 = vcvt_f32_f16(vget_low_f16(_r_f16));\n                            _r1 = vcvt_f32_f16(vget_high_f16(_r_f16));\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float16x8_t _w45 = vld1q_f16(kptr + 16);\n                        float16x8_t _w67 = vld1q_f16(kptr + 24);\n                        float16x8_t _w89 = vld1q_f16(kptr + 32);\n                        float16x8_t _wab = vld1q_f16(kptr + 40);\n                        float16x8_t _wcd = vld1q_f16(kptr + 48);\n                        float16x8_t _wef = vld1q_f16(kptr + 56);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                        float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                        float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                        float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                        float32x4_t _w8 = vcvt_f32_f16(vget_low_f16(_w89));\n                        float32x4_t _w9 = vcvt_f32_f16(vget_high_f16(_w89));\n                        float32x4_t _wa = vcvt_f32_f16(vget_low_f16(_wab));\n                        float32x4_t _wb = vcvt_f32_f16(vget_high_f16(_wab));\n                        float32x4_t _wc = vcvt_f32_f16(vget_low_f16(_wcd));\n                        float32x4_t _wd = vcvt_f32_f16(vget_high_f16(_wcd));\n                        float32x4_t _we = vcvt_f32_f16(vget_low_f16(_wef));\n                        float32x4_t _wf = vcvt_f32_f16(vget_high_f16(_wef));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w8, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w9, _r1, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _wa, _r1, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _wb, _r1, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _wc, _r1, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _wd, _r1, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _we, _r1, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _wf, _r1, 3);\n\n                        kptr += 64;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x4_t _r_f16 = float16x4_t();\n                            _r_f16 = vset_lane_f16(r0[sok], _r_f16, 0);\n                            _r_f16 = vset_lane_f16(r0[sok + N], _r_f16, 1);\n                            _r_f16 = vset_lane_f16(r0[sok + N * 2], _r_f16, 2);\n                            _r_f16 = vset_lane_f16(r0[sok + N * 3], _r_f16, 3);\n                            _r0 = vcvt_f32_f16(_r_f16);\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float16x8_t _w45 = vld1q_f16(kptr + 16);\n                        float16x8_t _w67 = vld1q_f16(kptr + 24);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                        float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                        float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                        float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 0);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 1);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 1);\n                        _sum4 = vfmaq_laneq_f32(_sum4, _w4, _r0, 2);\n                        _sum5 = vfmaq_laneq_f32(_sum5, _w5, _r0, 2);\n                        _sum6 = vfmaq_laneq_f32(_sum6, _w6, _r0, 3);\n                        _sum7 = vfmaq_laneq_f32(_sum7, _w7, _r0, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = (float)(r0[sok]);\n                            val1 = (float)(r0[sok + N]);\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val0);\n                        _sum2 = vfmaq_n_f32(_sum2, _w2, val1);\n                        _sum3 = vfmaq_n_f32(_sum3, _w3, val1);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vcvt_f32_f16(vdup_n_f16(r0[space_ofs[k]]));\n                        }\n\n                        float16x8_t _w = vld1q_f16(kptr);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                        _sum0 = vfmaq_f32(_sum0, _w0, _val);\n                        _sum1 = vfmaq_f32(_sum1, _w1, _val);\n\n                        kptr += 8;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum2);\n                _sum1 = vaddq_f32(_sum1, _sum3);\n                _sum4 = vaddq_f32(_sum4, _sum6);\n                _sum5 = vaddq_f32(_sum5, _sum7);\n                _sum0 = vaddq_f32(_sum0, _sum4);\n                _sum1 = vaddq_f32(_sum1, _sum5);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                _sum1 = activation_ps(_sum1, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr, vcvt_f16_f32(_sum0));\n                    vst1_f16(outptr + M, vcvt_f16_f32(_sum1));\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    float16x4_t _sum0_f16 = vcvt_f16_f32(_sum0);\n                    float16x4_t _sum1_f16 = vcvt_f16_f32(_sum1);\n                    outptr[0] = vget_lane_f16(_sum0_f16, 0);\n                    outptr[M] = vget_lane_f16(_sum0_f16, 1);\n                    outptr[M * 2] = vget_lane_f16(_sum0_f16, 2);\n                    outptr[M * 3] = vget_lane_f16(_sum0_f16, 3);\n                    outptr[M * 4] = vget_lane_f16(_sum1_f16, 0);\n                    outptr[M * 5] = vget_lane_f16(_sum1_f16, 1);\n                    outptr[M * 6] = vget_lane_f16(_sum1_f16, 2);\n                    outptr[M * 7] = vget_lane_f16(_sum1_f16, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 8;\n    nn_outch = (outch - remain_outch_start) / 4;\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p);\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                            _r1 = vcvt_f32_f16(vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x8_t _r_f16 = float16x8_t();\n                            _r_f16 = vsetq_lane_f16(r0[sok], _r_f16, 0);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N], _r_f16, 1);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 2], _r_f16, 2);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 3], _r_f16, 3);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 4], _r_f16, 4);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 5], _r_f16, 5);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 6], _r_f16, 6);\n                            _r_f16 = vsetq_lane_f16(r0[sok + N * 7], _r_f16, 7);\n                            _r0 = vcvt_f32_f16(vget_low_f16(_r_f16));\n                            _r1 = vcvt_f32_f16(vget_high_f16(_r_f16));\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float16x8_t _w45 = vld1q_f16(kptr + 16);\n                        float16x8_t _w67 = vld1q_f16(kptr + 24);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        float32x4_t _w4 = vcvt_f32_f16(vget_low_f16(_w45));\n                        float32x4_t _w5 = vcvt_f32_f16(vget_high_f16(_w45));\n                        float32x4_t _w6 = vcvt_f32_f16(vget_low_f16(_w67));\n                        float32x4_t _w7 = vcvt_f32_f16(vget_high_f16(_w67));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w4, _r1, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w5, _r1, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w6, _r1, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w7, _r1, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x4_t _r_f16 = float16x4_t();\n                            _r_f16 = vset_lane_f16(r0[sok], _r_f16, 0);\n                            _r_f16 = vset_lane_f16(r0[sok + N], _r_f16, 1);\n                            _r_f16 = vset_lane_f16(r0[sok + N * 2], _r_f16, 2);\n                            _r_f16 = vset_lane_f16(r0[sok + N * 3], _r_f16, 3);\n                            _r0 = vcvt_f32_f16(_r_f16);\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _r0, 3);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = (float)(r0[sok]);\n                            val1 = (float)(r0[sok + N]);\n                        }\n\n                        float16x8_t _w = vld1q_f16(kptr);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                        _sum0 = vfmaq_n_f32(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f32(_sum1, _w1, val1);\n\n                        kptr += 8;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float32x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vcvt_f32_f16(vdup_n_f16(r0[space_ofs[k]]));\n                        }\n\n                        float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n                        _sum0 = vfmaq_f32(_sum0, _val, _w);\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                _sum0 = vaddq_f32(_sum0, _sum2);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr, vcvt_f16_f32(_sum0));\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    float16x4_t _sum0_f16 = vcvt_f16_f32(_sum0);\n                    outptr[0] = vget_lane_f16(_sum0_f16, 0);\n                    outptr[M] = vget_lane_f16(_sum0_f16, 1);\n                    outptr[M * 2] = vget_lane_f16(_sum0_f16, 2);\n                    outptr[M * 3] = vget_lane_f16(_sum0_f16, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 4;\n    nn_outch = (outch - remain_outch_start) / 2;\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n\n        __fp16* outptr0 = top_blob.channel(p);\n        __fp16* outptr1 = top_blob.channel(p + 1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum0 = bias_data_ptr[p];\n                    sum1 = bias_data_ptr[p + 1];\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n\n                int q = 0;\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                            _r1 = vcvt_f32_f16(vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x8_t _r01_f16 = float16x8_t();\n                            _r01_f16 = vsetq_lane_f16(r0[sok], _r01_f16, 0);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N], _r01_f16, 1);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 2], _r01_f16, 2);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 3], _r01_f16, 3);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 4], _r01_f16, 4);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 5], _r01_f16, 5);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 6], _r01_f16, 6);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 7], _r01_f16, 7);\n                            _r0 = vcvt_f32_f16(vget_low_f16(_r01_f16));\n                            _r1 = vcvt_f32_f16(vget_high_f16(_r01_f16));\n                        }\n\n                        float16x8_t _w01 = vld1q_f16(kptr);\n                        float16x8_t _w23 = vld1q_f16(kptr + 8);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                        float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                        float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n                        _sum2 = vfmaq_f32(_sum2, _r0, _w2);\n                        _sum3 = vfmaq_f32(_sum3, _r1, _w3);\n\n                        kptr += 16;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum2);\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x4_t _r0_f16 = float16x4_t();\n                            _r0_f16 = vset_lane_f16(r0[sok], _r0_f16, 0);\n                            _r0_f16 = vset_lane_f16(r0[sok + N], _r0_f16, 1);\n                            _r0_f16 = vset_lane_f16(r0[sok + N * 2], _r0_f16, 2);\n                            _r0_f16 = vset_lane_f16(r0[sok + N * 3], _r0_f16, 3);\n                            _r0 = vcvt_f32_f16(_r0_f16);\n                        }\n\n                        float16x8_t _w = vld1q_f16(kptr);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r0, _w1);\n\n                        kptr += 8;\n                    }\n                }\n                sum0 += vaddvq_f32(_sum0);\n                sum1 += vaddvq_f32(_sum1);\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = (float)(r0[sok]);\n                            val1 = (float)(r0[sok + N]);\n                        }\n\n                        sum0 += val0 * (float)(kptr[0]);\n                        sum1 += val0 * (float)(kptr[1]);\n                        sum0 += val1 * (float)(kptr[2]);\n                        sum1 += val1 * (float)(kptr[3]);\n\n                        kptr += 4;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = (float)(r0[space_ofs[k]]);\n                        }\n\n                        sum0 += val * (float)(kptr[0]);\n                        sum1 += val * (float)(kptr[1]);\n\n                        kptr += 2;\n                    }\n                }\n\n                sum0 = activation_ss(sum0, activation_type, activation_params);\n                sum1 = activation_ss(sum1, activation_type, activation_params);\n\n                outptr0[0] = (__fp16)(sum0);\n                outptr1[0] = (__fp16)(sum1);\n                outptr0 += 1;\n                outptr1 += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 2;\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        __fp16* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n\n                int q = 0;\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        float32x4_t _r1;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                            _r1 = vcvt_f32_f16(vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x8_t _r01_f16 = float16x8_t();\n                            _r01_f16 = vsetq_lane_f16(r0[sok], _r01_f16, 0);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N], _r01_f16, 1);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 2], _r01_f16, 2);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 3], _r01_f16, 3);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 4], _r01_f16, 4);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 5], _r01_f16, 5);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 6], _r01_f16, 6);\n                            _r01_f16 = vsetq_lane_f16(r0[sok + N * 7], _r01_f16, 7);\n                            _r0 = vcvt_f32_f16(vget_low_f16(_r01_f16));\n                            _r1 = vcvt_f32_f16(vget_high_f16(_r01_f16));\n                        }\n\n                        float16x8_t _w = vld1q_f16(kptr);\n                        float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w));\n                        float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w));\n                        _sum0 = vfmaq_f32(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f32(_sum1, _r1, _w1);\n\n                        kptr += 8;\n                    }\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                sum += vaddvq_f32(_sum0);\n                float32x4_t _sum = vdupq_n_f32(0.f);\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float32x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vcvt_f32_f16(vld1_f16(r0 + sok));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            float16x4_t _r0_f16 = float16x4_t();\n                            _r0_f16 = vset_lane_f16(r0[sok], _r0_f16, 0);\n                            _r0_f16 = vset_lane_f16(r0[sok + N], _r0_f16, 1);\n                            _r0_f16 = vset_lane_f16(r0[sok + N * 2], _r0_f16, 2);\n                            _r0_f16 = vset_lane_f16(r0[sok + N * 3], _r0_f16, 3);\n                            _r0 = vcvt_f32_f16(_r0_f16);\n                        }\n\n                        float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n                        _sum = vfmaq_f32(_sum, _r0, _w);\n\n                        kptr += 4;\n                    }\n                }\n                sum += vaddvq_f32(_sum);\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float val0;\n                        float val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = (float)(r0[sok]);\n                            val1 = (float)(r0[sok + N]);\n                        }\n\n                        sum += val0 * (float)(kptr[0]);\n                        sum += val1 * (float)(kptr[1]);\n\n                        kptr += 2;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val;\n                        // if (elempack == 1)\n                        {\n                            val = (float)(r0[space_ofs[k]]);\n                        }\n\n                        sum += val * (float)(kptr[0]);\n\n                        kptr += 1;\n                    }\n                }\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n                outptr[0] = (__fp16)(sum);\n                outptr += 1;\n            }\n        }\n    }\n}\n\nstatic void convolution_packed_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int inch = bottom_blob.c * elempack;\n\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int outch = top_blob.c * out_elempack;\n\n    const size_t M = top_blob.cstep * out_elempack;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2 * elempack;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const __fp16* bias_data_ptr = bias_data;\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n    nn_outch = (outch - remain_outch_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float16x8_t _sum0 = vdupq_n_f16(0.f);\n                float16x8_t _sum1 = vdupq_n_f16(0.f);\n                float16x8_t _sum2 = vdupq_n_f16(0.f);\n                float16x8_t _sum3 = vdupq_n_f16(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f16(bias_data_ptr + p);\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1q_f16(r0 + sok);\n                        }\n                        else if (elempack == 4)\n                        {\n                            _r0 = vcombine_f16(vld1_f16(r0 + sok), vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x8_t();\n                            _r0 = vsetq_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 3], _r0, 3);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 4], _r0, 4);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 5], _r0, 5);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 6], _r0, 6);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 7], _r0, 7);\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        float16x8_t _w1 = vld1q_f16(kptr + 8);\n                        float16x8_t _w2 = vld1q_f16(kptr + 8 * 2);\n                        float16x8_t _w3 = vld1q_f16(kptr + 8 * 3);\n                        float16x8_t _w4 = vld1q_f16(kptr + 8 * 4);\n                        float16x8_t _w5 = vld1q_f16(kptr + 8 * 5);\n                        float16x8_t _w6 = vld1q_f16(kptr + 8 * 6);\n                        float16x8_t _w7 = vld1q_f16(kptr + 8 * 7);\n                        _sum0 = vfmaq_laneq_f16(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_laneq_f16(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_laneq_f16(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_laneq_f16(_sum3, _w3, _r0, 3);\n                        _sum0 = vfmaq_laneq_f16(_sum0, _w4, _r0, 4);\n                        _sum1 = vfmaq_laneq_f16(_sum1, _w5, _r0, 5);\n                        _sum2 = vfmaq_laneq_f16(_sum2, _w6, _r0, 6);\n                        _sum3 = vfmaq_laneq_f16(_sum3, _w7, _r0, 7);\n\n                        kptr += 64;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1_f16(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x4_t();\n                            _r0 = vset_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vset_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vset_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vset_lane_f16(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        float16x8_t _w1 = vld1q_f16(kptr + 8);\n                        float16x8_t _w2 = vld1q_f16(kptr + 8 * 2);\n                        float16x8_t _w3 = vld1q_f16(kptr + 8 * 3);\n                        _sum0 = vfmaq_lane_f16(_sum0, _w0, _r0, 0);\n                        _sum1 = vfmaq_lane_f16(_sum1, _w1, _r0, 1);\n                        _sum2 = vfmaq_lane_f16(_sum2, _w2, _r0, 2);\n                        _sum3 = vfmaq_lane_f16(_sum3, _w3, _r0, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        __fp16 val0;\n                        __fp16 val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        float16x8_t _w1 = vld1q_f16(kptr + 8);\n                        _sum0 = vfmaq_n_f16(_sum0, _w0, val0);\n                        _sum1 = vfmaq_n_f16(_sum1, _w1, val1);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float16x8_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vdupq_n_f16(r0[space_ofs[k]]);\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        _sum0 = vfmaq_f16(_sum0, _w0, _val);\n\n                        kptr += 8;\n                    }\n                }\n\n                _sum0 = vaddq_f16(_sum0, _sum1);\n                _sum2 = vaddq_f16(_sum2, _sum3);\n                _sum0 = vaddq_f16(_sum0, _sum2);\n\n                _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr, _sum0);\n                    outptr += 8;\n                }\n                else if (out_elempack == 4)\n                {\n                    vst1_f16(outptr, vget_low_f16(_sum0));\n                    vst1_f16(outptr + M, vget_high_f16(_sum0));\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    outptr[0] = vgetq_lane_f16(_sum0, 0);\n                    outptr[M] = vgetq_lane_f16(_sum0, 1);\n                    outptr[M * 2] = vgetq_lane_f16(_sum0, 2);\n                    outptr[M * 3] = vgetq_lane_f16(_sum0, 3);\n                    outptr[M * 4] = vgetq_lane_f16(_sum0, 4);\n                    outptr[M * 5] = vgetq_lane_f16(_sum0, 5);\n                    outptr[M * 6] = vgetq_lane_f16(_sum0, 6);\n                    outptr[M * 7] = vgetq_lane_f16(_sum0, 7);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 8;\n    nn_outch = (outch - remain_outch_start) / 4;\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int out_elempack = top_blob.elempack;\n\n        __fp16* outptr = top_blob.channel(p / out_elempack);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float16x4_t _sum0 = vdup_n_f16(0.f);\n                float16x4_t _sum1 = vdup_n_f16(0.f);\n                float16x4_t _sum2 = vdup_n_f16(0.f);\n                float16x4_t _sum3 = vdup_n_f16(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1_f16(bias_data_ptr + p);\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n                int q = 0;\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x4_t _r0;\n                        float16x4_t _r1;\n                        if (elempack == 8)\n                        {\n                            float16x8_t _r01 = vld1q_f16(r0 + sok);\n                            _r0 = vget_low_f16(_r01);\n                            _r1 = vget_high_f16(_r01);\n                        }\n                        else if (elempack == 4)\n                        {\n                            _r0 = vld1_f16(r0 + sok);\n                            _r1 = vld1_f16(r0 + sok + N);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x4_t();\n                            _r1 = float16x4_t();\n                            _r0 = vset_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vset_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vset_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vset_lane_f16(r0[sok + N * 3], _r0, 3);\n                            _r1 = vset_lane_f16(r0[sok + N * 4], _r1, 0);\n                            _r1 = vset_lane_f16(r0[sok + N * 5], _r1, 1);\n                            _r1 = vset_lane_f16(r0[sok + N * 6], _r1, 2);\n                            _r1 = vset_lane_f16(r0[sok + N * 7], _r1, 3);\n                        }\n\n                        float16x4_t _w0 = vld1_f16(kptr);\n                        float16x4_t _w1 = vld1_f16(kptr + 4);\n                        float16x4_t _w2 = vld1_f16(kptr + 8);\n                        float16x4_t _w3 = vld1_f16(kptr + 12);\n                        float16x4_t _w4 = vld1_f16(kptr + 16);\n                        float16x4_t _w5 = vld1_f16(kptr + 20);\n                        float16x4_t _w6 = vld1_f16(kptr + 24);\n                        float16x4_t _w7 = vld1_f16(kptr + 28);\n                        _sum0 = vfma_lane_f16(_sum0, _w0, _r0, 0);\n                        _sum1 = vfma_lane_f16(_sum1, _w1, _r0, 1);\n                        _sum2 = vfma_lane_f16(_sum2, _w2, _r0, 2);\n                        _sum3 = vfma_lane_f16(_sum3, _w3, _r0, 3);\n                        _sum0 = vfma_lane_f16(_sum0, _w4, _r1, 0);\n                        _sum1 = vfma_lane_f16(_sum1, _w5, _r1, 1);\n                        _sum2 = vfma_lane_f16(_sum2, _w6, _r1, 2);\n                        _sum3 = vfma_lane_f16(_sum3, _w7, _r1, 3);\n\n                        kptr += 32;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1_f16(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x4_t();\n                            _r0 = vset_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vset_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vset_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vset_lane_f16(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float16x4_t _w0 = vld1_f16(kptr);\n                        float16x4_t _w1 = vld1_f16(kptr + 4);\n                        float16x4_t _w2 = vld1_f16(kptr + 8);\n                        float16x4_t _w3 = vld1_f16(kptr + 12);\n                        _sum0 = vfma_lane_f16(_sum0, _w0, _r0, 0);\n                        _sum1 = vfma_lane_f16(_sum1, _w1, _r0, 1);\n                        _sum2 = vfma_lane_f16(_sum2, _w2, _r0, 2);\n                        _sum3 = vfma_lane_f16(_sum3, _w3, _r0, 3);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        __fp16 val0;\n                        __fp16 val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        float16x4_t _w0 = vld1_f16(kptr);\n                        float16x4_t _w1 = vld1_f16(kptr + 4);\n                        _sum0 = vfma_n_f16(_sum0, _w0, val0);\n                        _sum1 = vfma_n_f16(_sum1, _w1, val1);\n\n                        kptr += 8;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float16x4_t _val;\n                        // if (elempack == 1)\n                        {\n                            _val = vdup_n_f16(r0[space_ofs[k]]);\n                        }\n\n                        float16x4_t _w = vld1_f16(kptr);\n                        _sum0 = vfma_f16(_sum0, _val, _w);\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum0 = vadd_f16(_sum0, _sum1);\n                _sum2 = vadd_f16(_sum2, _sum3);\n                _sum0 = vadd_f16(_sum0, _sum2);\n\n                _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr, _sum0);\n                    outptr += 4;\n                }\n                else // if (out_elempack == 1)\n                {\n                    outptr[0] = vget_lane_f16(_sum0, 0);\n                    outptr[M] = vget_lane_f16(_sum0, 1);\n                    outptr[M * 2] = vget_lane_f16(_sum0, 2);\n                    outptr[M * 3] = vget_lane_f16(_sum0, 3);\n                    outptr += 1;\n                }\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 4;\n    nn_outch = (outch - remain_outch_start) / 2;\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int elempack = bottom_blob.elempack;\n        const int inch = bottom_blob.c * elempack;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n\n        __fp16* outptr0 = top_blob.channel(p);\n        __fp16* outptr1 = top_blob.channel(p + 1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __fp16 sum0 = 0.f;\n                __fp16 sum1 = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum0 = bias_data_ptr[p];\n                    sum1 = bias_data_ptr[p + 1];\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n\n                int q = 0;\n                float16x8_t _sum0 = vdupq_n_f16(0.f);\n                float16x8_t _sum1 = vdupq_n_f16(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1q_f16(r0 + sok);\n                        }\n                        else if (elempack == 4)\n                        {\n                            _r0 = vcombine_f16(vld1_f16(r0 + sok), vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x8_t();\n                            _r0 = vsetq_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 3], _r0, 3);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 4], _r0, 4);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 5], _r0, 5);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 6], _r0, 6);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 7], _r0, 7);\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        float16x8_t _w1 = vld1q_f16(kptr + 8);\n                        _sum0 = vfmaq_f16(_sum0, _r0, _w0);\n                        _sum1 = vfmaq_f16(_sum1, _r0, _w1);\n\n                        kptr += 16;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1_f16(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x4_t();\n                            _r0 = vset_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vset_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vset_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vset_lane_f16(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float16x4_t _w0 = vld1_f16(kptr);\n                        float16x4_t _w1 = vld1_f16(kptr + 4);\n                        _sum0 = vcombine_f16(vfma_f16(vget_low_f16(_sum0), _r0, _w0), vget_high_f16(_sum0));\n                        _sum1 = vcombine_f16(vfma_f16(vget_low_f16(_sum1), _r0, _w1), vget_high_f16(_sum1));\n\n                        kptr += 8;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        __fp16 val0;\n                        __fp16 val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        sum0 += val0 * kptr[0];\n                        sum1 += val0 * kptr[1];\n                        sum0 += val1 * kptr[2];\n                        sum1 += val1 * kptr[3];\n\n                        kptr += 4;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __fp16 val;\n                        // if (elempack == 1)\n                        {\n                            val = r0[space_ofs[k]];\n                        }\n\n                        sum0 += val * kptr[0];\n                        sum1 += val * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n\n                float16x4_t _ss0 = vadd_f16(vget_low_f16(_sum0), vget_high_f16(_sum0));\n                float16x4_t _ss1 = vadd_f16(vget_low_f16(_sum1), vget_high_f16(_sum1));\n                float16x4_t _ss = vpadd_f16(_ss0, _ss1);\n                _ss = vpadd_f16(_ss, _ss);\n                sum0 += vget_lane_f16(_ss, 0);\n                sum1 += vget_lane_f16(_ss, 1);\n\n                sum0 = activation_ss_f16(sum0, activation_type, activation_params);\n                sum1 = activation_ss_f16(sum1, activation_type, activation_params);\n\n                outptr0[0] = sum0;\n                outptr1[0] = sum1;\n                outptr0 += 1;\n                outptr1 += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 2;\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        __fp16* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __fp16 sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n                const __fp16* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n\n                int q = 0;\n                float16x8_t _sum = vdupq_n_f16(0.f);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1q_f16(r0 + sok);\n                        }\n                        else if (elempack == 4)\n                        {\n                            _r0 = vcombine_f16(vld1_f16(r0 + sok), vld1_f16(r0 + sok + N));\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x8_t();\n                            _r0 = vsetq_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vsetq_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 3], _r0, 3);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 4], _r0, 4);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 5], _r0, 5);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 6], _r0, 6);\n                            _r0 = vsetq_lane_f16(r0[sok + N * 7], _r0, 7);\n                        }\n\n                        float16x8_t _w0 = vld1q_f16(kptr);\n                        _sum = vfmaq_f16(_sum, _r0, _w0);\n\n                        kptr += 8;\n                    }\n                }\n                for (; q + 3 < inch; q += 4)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q / elempack).row<const __fp16>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        float16x4_t _r0;\n                        if (elempack == 4)\n                        {\n                            _r0 = vld1_f16(r0 + sok);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            _r0 = float16x4_t();\n                            _r0 = vset_lane_f16(r0[sok], _r0, 0);\n                            _r0 = vset_lane_f16(r0[sok + N], _r0, 1);\n                            _r0 = vset_lane_f16(r0[sok + N * 2], _r0, 2);\n                            _r0 = vset_lane_f16(r0[sok + N * 3], _r0, 3);\n                        }\n\n                        float16x4_t _w = vld1_f16(kptr);\n                        _sum = vcombine_f16(vfma_f16(vget_low_f16(_sum), _r0, _w), vget_high_f16(_sum));\n\n                        kptr += 4;\n                    }\n                }\n                for (; q + 1 < inch; q += 2)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const int sok = space_ofs[k];\n                        __fp16 val0;\n                        __fp16 val1;\n                        // if (elempack == 1)\n                        {\n                            val0 = r0[sok];\n                            val1 = r0[sok + N];\n                        }\n\n                        sum += val0 * kptr[0];\n                        sum += val1 * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n                for (; q < inch; q++)\n                {\n                    const __fp16* r0 = bottom_blob.channel(q).row<const __fp16>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __fp16 val;\n                        // if (elempack == 1)\n                        {\n                            val = r0[space_ofs[k]];\n                        }\n\n                        sum += val * kptr[0];\n\n                        kptr += 1;\n                    }\n                }\n\n                float16x4_t _ss = vadd_f16(vget_low_f16(_sum), vget_high_f16(_sum));\n                _ss = vpadd_f16(_ss, _ss);\n                _ss = vpadd_f16(_ss, _ss);\n                sum += vget_lane_f16(_ss, 0);\n\n                sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                outptr[0] = sum;\n                outptr += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolution_packed_int8.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\nvoid convolution_transform_kernel_packed_int8_i8mm(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h);\nvoid convolution_packed_int8_i8mm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\nvoid convolution_transform_kernel_packed_int8_asimddp(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h);\nvoid convolution_packed_int8_asimddp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt);\n#endif\n\nstatic void convolution_transform_kernel_packed_int8(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        convolution_transform_kernel_packed_int8_i8mm(kernel, kernel_tm, inch, outch, kernel_w, kernel_h);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        convolution_transform_kernel_packed_int8_asimddp(kernel, kernel_tm, inch, outch, kernel_w, kernel_h);\n        return;\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n\n    // clang-format off\n    // *INDENT-OFF*\n#if __ARM_NEON\n    if (outch >= 8)\n    {\n        if (inch >= 8)\n            kernel_tm.create(maxk, inch / 8 + inch % 8, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)64u, 64);\n        else\n            kernel_tm.create(maxk, inch, outch / 8 + (outch % 8) / 4 + (outch % 4) / 2 + outch % 2, (size_t)8u, 8);\n    }\n    else if (outch >= 4)\n    {\n        if (inch >= 8)\n            kernel_tm.create(maxk, inch / 8 + inch % 8, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)32u, 32);\n        else\n            kernel_tm.create(maxk, inch, outch / 4 + (outch % 4) / 2 + outch % 2, (size_t)4u, 4);\n    }\n    else\n#endif // __ARM_NEON\n    if (outch >= 2)\n    {\n#if __ARM_NEON\n        if (inch >= 8)\n            kernel_tm.create(maxk, inch / 8 + inch % 8, outch / 2 + outch % 2, (size_t)16u, 16);\n        else\n#endif // __ARM_NEON\n            kernel_tm.create(maxk, inch, outch / 2 + outch % 2, (size_t)2u, 2);\n    }\n    else\n    {\n#if __ARM_NEON\n        if (inch >= 8)\n            kernel_tm.create(maxk, inch / 8 + inch % 8, outch, (size_t)8u, 8);\n        else\n#endif // __ARM_NEON\n            kernel_tm.create(maxk, inch, outch, (size_t)1u, 1);\n    }\n    // *INDENT-ON*\n    // clang-format on\n\n    int q = 0;\n#if __ARM_NEON\n    for (; q + 7 < outch; q += 8)\n    {\n        const signed char* kptr0 = (const signed char*)kernel + q * inch * maxk;\n        const signed char* kptr1 = (const signed char*)kernel + (q + 1) * inch * maxk;\n        const signed char* kptr2 = (const signed char*)kernel + (q + 2) * inch * maxk;\n        const signed char* kptr3 = (const signed char*)kernel + (q + 3) * inch * maxk;\n        const signed char* kptr4 = (const signed char*)kernel + (q + 4) * inch * maxk;\n        const signed char* kptr5 = (const signed char*)kernel + (q + 5) * inch * maxk;\n        const signed char* kptr6 = (const signed char*)kernel + (q + 6) * inch * maxk;\n        const signed char* kptr7 = (const signed char*)kernel + (q + 7) * inch * maxk;\n\n        signed char* g00 = kernel_tm.channel(q / 8);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n                const signed char* k2 = kptr2 + k;\n                const signed char* k3 = kptr3 + k;\n                const signed char* k4 = kptr4 + k;\n                const signed char* k5 = kptr5 + k;\n                const signed char* k6 = kptr6 + k;\n                const signed char* k7 = kptr7 + k;\n\n#if __ARM_FEATURE_MATMUL_INT8\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[0];\n                    g00 += 1;\n                    k0 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k1[0];\n                    g00 += 1;\n                    k1 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k2[0];\n                    g00 += 1;\n                    k2 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k3[0];\n                    g00 += 1;\n                    k3 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k4[0];\n                    g00 += 1;\n                    k4 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k5[0];\n                    g00 += 1;\n                    k5 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k6[0];\n                    g00 += 1;\n                    k6 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k7[0];\n                    g00 += 1;\n                    k7 += maxk;\n                }\n#elif __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[0];\n                    g00[1] = k0[maxk];\n                    g00[2] = k0[maxk * 2];\n                    g00[3] = k0[maxk * 3];\n                    g00[4] = k1[0];\n                    g00[5] = k1[maxk];\n                    g00[6] = k1[maxk * 2];\n                    g00[7] = k1[maxk * 3];\n                    g00[8] = k2[0];\n                    g00[9] = k2[maxk];\n                    g00[10] = k2[maxk * 2];\n                    g00[11] = k2[maxk * 3];\n                    g00[12] = k3[0];\n                    g00[13] = k3[maxk];\n                    g00[14] = k3[maxk * 2];\n                    g00[15] = k3[maxk * 3];\n                    g00[16] = k4[0];\n                    g00[17] = k4[maxk];\n                    g00[18] = k4[maxk * 2];\n                    g00[19] = k4[maxk * 3];\n                    g00[20] = k5[0];\n                    g00[21] = k5[maxk];\n                    g00[22] = k5[maxk * 2];\n                    g00[23] = k5[maxk * 3];\n                    g00[24] = k6[0];\n                    g00[25] = k6[maxk];\n                    g00[26] = k6[maxk * 2];\n                    g00[27] = k6[maxk * 3];\n                    g00[28] = k7[0];\n                    g00[29] = k7[maxk];\n                    g00[30] = k7[maxk * 2];\n                    g00[31] = k7[maxk * 3];\n                    g00 += 32;\n                    k0 += maxk * 4;\n                    k1 += maxk * 4;\n                    k2 += maxk * 4;\n                    k3 += maxk * 4;\n                    k4 += maxk * 4;\n                    k5 += maxk * 4;\n                    k6 += maxk * 4;\n                    k7 += maxk * 4;\n                }\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[0];\n                    g00[1] = k0[maxk];\n                    g00[2] = k1[0];\n                    g00[3] = k1[maxk];\n                    g00[4] = k2[0];\n                    g00[5] = k2[maxk];\n                    g00[6] = k3[0];\n                    g00[7] = k3[maxk];\n                    g00[8] = k4[0];\n                    g00[9] = k4[maxk];\n                    g00[10] = k5[0];\n                    g00[11] = k5[maxk];\n                    g00[12] = k6[0];\n                    g00[13] = k6[maxk];\n                    g00[14] = k7[0];\n                    g00[15] = k7[maxk];\n                    g00 += 16;\n                    k0 += maxk * 2;\n                    k1 += maxk * 2;\n                    k2 += maxk * 2;\n                    k3 += maxk * 2;\n                    k4 += maxk * 2;\n                    k5 += maxk * 2;\n                    k6 += maxk * 2;\n                    k7 += maxk * 2;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            }\n\n            kptr0 += maxk * 8;\n            kptr1 += maxk * 8;\n            kptr2 += maxk * 8;\n            kptr3 += maxk * 8;\n            kptr4 += maxk * 8;\n            kptr5 += maxk * 8;\n            kptr6 += maxk * 8;\n            kptr7 += maxk * 8;\n        }\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n                const signed char* k2 = kptr2 + k;\n                const signed char* k3 = kptr3 + k;\n                const signed char* k4 = kptr4 + k;\n                const signed char* k5 = kptr5 + k;\n                const signed char* k6 = kptr6 + k;\n                const signed char* k7 = kptr7 + k;\n\n                g00[0] = k0[0];\n                g00[1] = k1[0];\n                g00[2] = k2[0];\n                g00[3] = k3[0];\n                g00[4] = k4[0];\n                g00[5] = k5[0];\n                g00[6] = k6[0];\n                g00[7] = k7[0];\n                g00 += 8;\n            }\n\n            kptr0 += maxk;\n            kptr1 += maxk;\n            kptr2 += maxk;\n            kptr3 += maxk;\n            kptr4 += maxk;\n            kptr5 += maxk;\n            kptr6 += maxk;\n            kptr7 += maxk;\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        const signed char* kptr0 = (const signed char*)kernel + q * inch * maxk;\n        const signed char* kptr1 = (const signed char*)kernel + (q + 1) * inch * maxk;\n        const signed char* kptr2 = (const signed char*)kernel + (q + 2) * inch * maxk;\n        const signed char* kptr3 = (const signed char*)kernel + (q + 3) * inch * maxk;\n\n        signed char* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n\n        int p = 0;\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n                const signed char* k2 = kptr2 + k;\n                const signed char* k3 = kptr3 + k;\n\n#if __ARM_FEATURE_MATMUL_INT8\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[0];\n                    g00 += 1;\n                    k0 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k1[0];\n                    g00 += 1;\n                    k1 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k2[0];\n                    g00 += 1;\n                    k2 += maxk;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k3[0];\n                    g00 += 1;\n                    k3 += maxk;\n                }\n#elif __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 2; i++)\n                {\n                    g00[0] = k0[0];\n                    g00[1] = k0[maxk];\n                    g00[2] = k0[maxk * 2];\n                    g00[3] = k0[maxk * 3];\n                    g00[4] = k1[0];\n                    g00[5] = k1[maxk];\n                    g00[6] = k1[maxk * 2];\n                    g00[7] = k1[maxk * 3];\n                    g00[8] = k2[0];\n                    g00[9] = k2[maxk];\n                    g00[10] = k2[maxk * 2];\n                    g00[11] = k2[maxk * 3];\n                    g00[12] = k3[0];\n                    g00[13] = k3[maxk];\n                    g00[14] = k3[maxk * 2];\n                    g00[15] = k3[maxk * 3];\n                    g00 += 16;\n                    k0 += maxk * 4;\n                    k1 += maxk * 4;\n                    k2 += maxk * 4;\n                    k3 += maxk * 4;\n                }\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[0];\n                    g00[1] = k0[maxk];\n                    g00[2] = k1[0];\n                    g00[3] = k1[maxk];\n                    g00[4] = k2[0];\n                    g00[5] = k2[maxk];\n                    g00[6] = k3[0];\n                    g00[7] = k3[maxk];\n                    g00 += 8;\n                    k0 += maxk * 2;\n                    k1 += maxk * 2;\n                    k2 += maxk * 2;\n                    k3 += maxk * 2;\n                }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            }\n\n            kptr0 += maxk * 8;\n            kptr1 += maxk * 8;\n            kptr2 += maxk * 8;\n            kptr3 += maxk * 8;\n        }\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n                const signed char* k2 = kptr2 + k;\n                const signed char* k3 = kptr3 + k;\n\n                g00[0] = k0[0];\n                g00[1] = k1[0];\n                g00[2] = k2[0];\n                g00[3] = k3[0];\n                g00 += 4;\n            }\n\n            kptr0 += maxk;\n            kptr1 += maxk;\n            kptr2 += maxk;\n            kptr3 += maxk;\n        }\n    }\n#endif // __ARM_NEON\n    for (; q + 1 < outch; q += 2)\n    {\n        const signed char* kptr0 = (const signed char*)kernel + q * inch * maxk;\n        const signed char* kptr1 = (const signed char*)kernel + (q + 1) * inch * maxk;\n\n#if __ARM_NEON\n        signed char* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2);\n#else\n        signed char* g00 = kernel_tm.channel(q / 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n\n#if __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[0];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k1[0];\n                    k1 += maxk;\n                    g00 += 1;\n                }\n#else  // __ARM_FEATURE_DOTPROD\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k0[0];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n                for (int i = 0; i < 4; i++)\n                {\n                    g00[0] = k1[0];\n                    k1 += maxk;\n                    g00 += 1;\n                }\n\n                for (int i = 4; i < 8; i++)\n                {\n                    g00[0] = k0[0];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n                for (int i = 4; i < 8; i++)\n                {\n                    g00[0] = k1[0];\n                    k1 += maxk;\n                    g00 += 1;\n                }\n#endif // __ARM_FEATURE_DOTPROD\n            }\n\n            kptr0 += maxk * 8;\n            kptr1 += maxk * 8;\n        }\n#endif // __ARM_NEON\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr0 + k;\n                const signed char* k1 = kptr1 + k;\n\n                g00[0] = k0[0];\n                g00[1] = k1[0];\n                g00 += 2;\n            }\n\n            kptr0 += maxk;\n            kptr1 += maxk;\n        }\n    }\n    for (; q < outch; q++)\n    {\n        const signed char* kptr = (const signed char*)kernel + q * inch * maxk;\n\n#if __ARM_NEON\n        signed char* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + (q % 4) / 2 + q % 2);\n#else\n        signed char* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __ARM_NEON\n        for (; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr + k;\n\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0[0];\n                    k0 += maxk;\n                    g00 += 1;\n                }\n            }\n\n            kptr += maxk * 8;\n        }\n#endif // __ARM_NEON\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k0 = kptr + k;\n                g00[0] = k0[0];\n                g00++;\n            }\n\n            kptr += maxk;\n        }\n    }\n}\n\nstatic void convolution_packed_int8(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_tm, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        convolution_packed_int8_i8mm(bottom_blob, top_blob, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        convolution_packed_int8_asimddp(bottom_blob, top_blob, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        return;\n    }\n#endif\n\n    const int w = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int inch = bottom_blob.c * elempack;\n\n    const size_t N = bottom_blob.cstep * elempack;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int out_elempack = top_blob.elempack;\n    const int outch = top_blob.c * out_elempack;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2 * elempack;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n#if __ARM_NEON\n    nn_outch = (outch - remain_outch_start) / 8;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 8;\n\n        // shadowed variable for less openmp task args\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const size_t N = bottom_blob.cstep * elempack;\n        const size_t M = top_blob.cstep * out_elempack;\n\n        int* outptr = top_blob.channel(p / out_elempack);\n\n        int ij = 0;\n        for (; ij + 1 < outw * outh; ij += 2)\n        {\n            const int i0 = ij / outw;\n            const int i1 = (ij + 1) / outw;\n            const int j0 = ij % outw;\n            const int j1 = (ij + 1) % outw;\n\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n\n            const signed char* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            {\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i0 * stride_h) + j0 * stride_w * elempack;\n                    const signed char* r1 = bottom_blob.channel(q / elempack).row<const signed char>(i1 * stride_h) + j1 * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n                        const signed char* r1s = r1 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        int8x8_t _r1;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                            _r1 = vld1_s8(r1s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp0[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            signed char tmp1[8] = {r1s[0], r1s[N], r1s[N * 2], r1s[N * 3], r1s[N * 4], r1s[N * 5], r1s[N * 6], r1s[N * 7]};\n                            _r0 = vld1_s8(tmp0);\n                            _r1 = vld1_s8(tmp1);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n                        int8x16_t _w1 = vld1q_s8(kptr + 16);\n                        int8x16_t _w2 = vld1q_s8(kptr + 32);\n                        int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                        int8x16_t _r01 = vcombine_s8(_r0, _r1);\n                        _sum0 = vmmlaq_s32(_sum0, _r01, _w0);\n                        _sum1 = vmmlaq_s32(_sum1, _r01, _w1);\n                        _sum2 = vmmlaq_s32(_sum2, _r01, _w2);\n                        _sum3 = vmmlaq_s32(_sum3, _r01, _w3);\n#elif __ARM_FEATURE_DOTPROD\n                        _sum0 = vdotq_lane_s32(_sum0, _w0, _r0, 0);\n                        _sum1 = vdotq_lane_s32(_sum1, _w1, _r0, 0);\n                        _sum2 = vdotq_lane_s32(_sum2, _w0, _r1, 0);\n                        _sum3 = vdotq_lane_s32(_sum3, _w1, _r1, 0);\n                        _sum0 = vdotq_lane_s32(_sum0, _w2, _r0, 1);\n                        _sum1 = vdotq_lane_s32(_sum1, _w3, _r0, 1);\n                        _sum2 = vdotq_lane_s32(_sum2, _w2, _r1, 1);\n                        _sum3 = vdotq_lane_s32(_sum3, _w3, _r1, 1);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                        int16x4_t _rr0 = vreinterpret_s16_s8(_r0);\n                        int16x4_t _rr1 = vreinterpret_s16_s8(_r1);\n\n                        int8x8_t _r0ll = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 0));\n                        int8x8_t _r1ll = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 0));\n                        int8x8_t _r0hl = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 2));\n                        int8x8_t _r1hl = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 2));\n\n                        int16x8_t _s0l = vmull_s8(_r0ll, vget_low_s8(_w0));\n                        int16x8_t _s1l = vmull_s8(_r0ll, vget_high_s8(_w0));\n                        int16x8_t _s2l = vmull_s8(_r1ll, vget_low_s8(_w0));\n                        int16x8_t _s3l = vmull_s8(_r1ll, vget_high_s8(_w0));\n                        _s0l = vmlal_s8(_s0l, _r0hl, vget_low_s8(_w2));\n                        _s1l = vmlal_s8(_s1l, _r0hl, vget_high_s8(_w2));\n                        _s2l = vmlal_s8(_s2l, _r1hl, vget_low_s8(_w2));\n                        _s3l = vmlal_s8(_s3l, _r1hl, vget_high_s8(_w2));\n\n                        _sum0 = vpadalq_s16(_sum0, _s0l);\n                        _sum1 = vpadalq_s16(_sum1, _s1l);\n                        _sum2 = vpadalq_s16(_sum2, _s2l);\n                        _sum3 = vpadalq_s16(_sum3, _s3l);\n\n                        int8x8_t _r0lh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 1));\n                        int8x8_t _r1lh = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 1));\n                        int8x8_t _r0hh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 3));\n                        int8x8_t _r1hh = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 3));\n\n                        int16x8_t _s0h = vmull_s8(_r0lh, vget_low_s8(_w1));\n                        int16x8_t _s1h = vmull_s8(_r0lh, vget_high_s8(_w1));\n                        int16x8_t _s2h = vmull_s8(_r1lh, vget_low_s8(_w1));\n                        int16x8_t _s3h = vmull_s8(_r1lh, vget_high_s8(_w1));\n                        _s0h = vmlal_s8(_s0h, _r0hh, vget_low_s8(_w3));\n                        _s1h = vmlal_s8(_s1h, _r0hh, vget_high_s8(_w3));\n                        _s2h = vmlal_s8(_s2h, _r1hh, vget_low_s8(_w3));\n                        _s3h = vmlal_s8(_s3h, _r1hh, vget_high_s8(_w3));\n\n                        _sum0 = vpadalq_s16(_sum0, _s0h);\n                        _sum1 = vpadalq_s16(_sum1, _s1h);\n                        _sum2 = vpadalq_s16(_sum2, _s2h);\n                        _sum3 = vpadalq_s16(_sum3, _s3h);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n\n                        kptr += 64;\n                    }\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                {\n                    int32x4_t _tmp0 = vcombine_s32(vget_low_s32(_sum0), vget_low_s32(_sum1));\n                    int32x4_t _tmp1 = vcombine_s32(vget_low_s32(_sum2), vget_low_s32(_sum3));\n                    int32x4_t _tmp2 = vcombine_s32(vget_high_s32(_sum0), vget_high_s32(_sum1));\n                    int32x4_t _tmp3 = vcombine_s32(vget_high_s32(_sum2), vget_high_s32(_sum3));\n                    _sum0 = _tmp0;\n                    _sum1 = _tmp1;\n                    _sum2 = _tmp2;\n                    _sum3 = _tmp3;\n                }\n#endif\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i0 * stride_h) + j0 * stride_w;\n                const signed char* r1 = bottom_blob.channel(q).row<const signed char>(i1 * stride_h) + j1 * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n                    const signed char* r1s = r1 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        int8x8_t _r0 = vdup_n_s8(r0s[0]);\n                        int8x8_t _r1 = vdup_n_s8(r1s[0]);\n                        int8x8_t _w = vld1_s8(kptr);\n                        int16x8_t _s0 = vmull_s8(_r0, _w);\n                        int16x8_t _s1 = vmull_s8(_r1, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                        _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                        _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n\n                        kptr += 8;\n                    }\n                }\n            }\n\n            if (out_elempack == 8)\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                vst1q_s32(outptr + 8, _sum2);\n                vst1q_s32(outptr + 12, _sum3);\n                outptr += 16;\n            }\n            if (out_elempack == 4)\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum2);\n                vst1q_s32(outptr + M, _sum1);\n                vst1q_s32(outptr + M + 4, _sum3);\n                outptr += 8;\n            }\n            if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_s32(_sum0, 0);\n                outptr[1] = vgetq_lane_s32(_sum2, 0);\n                outptr[M] = vgetq_lane_s32(_sum0, 1);\n                outptr[M + 1] = vgetq_lane_s32(_sum2, 1);\n                outptr[M * 2] = vgetq_lane_s32(_sum0, 2);\n                outptr[M * 2 + 1] = vgetq_lane_s32(_sum2, 2);\n                outptr[M * 3] = vgetq_lane_s32(_sum0, 3);\n                outptr[M * 3 + 1] = vgetq_lane_s32(_sum2, 3);\n                outptr[M * 4] = vgetq_lane_s32(_sum1, 0);\n                outptr[M * 4 + 1] = vgetq_lane_s32(_sum3, 0);\n                outptr[M * 5] = vgetq_lane_s32(_sum1, 1);\n                outptr[M * 5 + 1] = vgetq_lane_s32(_sum3, 1);\n                outptr[M * 6] = vgetq_lane_s32(_sum1, 2);\n                outptr[M * 6 + 1] = vgetq_lane_s32(_sum3, 2);\n                outptr[M * 7] = vgetq_lane_s32(_sum1, 3);\n                outptr[M * 7 + 1] = vgetq_lane_s32(_sum3, 3);\n                outptr += 2;\n            }\n        }\n        for (; ij < outw * outh; ij++)\n        {\n            const int i = ij / outw;\n            const int j = ij % outw;\n\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n\n            const signed char* kptr = weight_data_tm.channel(p / 8);\n\n            int q = 0;\n            {\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            _r0 = vld1_s8(tmp);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n                        int8x16_t _w1 = vld1q_s8(kptr + 16);\n                        int8x16_t _w2 = vld1q_s8(kptr + 32);\n                        int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                        int8x16_t _r00 = vcombine_s8(_r0, _r0);\n                        _sum0 = vdotq_s32(_sum0, _r00, _w0);\n                        _sum1 = vdotq_s32(_sum1, _r00, _w1);\n                        _sum2 = vdotq_s32(_sum2, _r00, _w2);\n                        _sum3 = vdotq_s32(_sum3, _r00, _w3);\n#elif __ARM_FEATURE_DOTPROD\n                        _sum0 = vdotq_lane_s32(_sum0, _w0, _r0, 0);\n                        _sum1 = vdotq_lane_s32(_sum1, _w1, _r0, 0);\n                        _sum2 = vdotq_lane_s32(_sum2, _w2, _r0, 1);\n                        _sum3 = vdotq_lane_s32(_sum3, _w3, _r0, 1);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                        int16x4_t _rr0 = vreinterpret_s16_s8(_r0);\n                        int8x8_t _r0ll = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 0));\n                        int8x8_t _r0lh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 1));\n                        int8x8_t _r0hl = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 2));\n                        int8x8_t _r0hh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 3));\n\n                        int16x8_t _s0l = vmull_s8(_r0ll, vget_low_s8(_w0));\n                        int16x8_t _s1l = vmull_s8(_r0ll, vget_high_s8(_w0));\n                        int16x8_t _s0h = vmull_s8(_r0lh, vget_low_s8(_w1));\n                        int16x8_t _s1h = vmull_s8(_r0lh, vget_high_s8(_w1));\n                        _s0l = vmlal_s8(_s0l, _r0hl, vget_low_s8(_w2));\n                        _s1l = vmlal_s8(_s1l, _r0hl, vget_high_s8(_w2));\n                        _s0h = vmlal_s8(_s0h, _r0hh, vget_low_s8(_w3));\n                        _s1h = vmlal_s8(_s1h, _r0hh, vget_high_s8(_w3));\n\n                        _sum0 = vpadalq_s16(_sum0, _s0l);\n                        _sum1 = vpadalq_s16(_sum1, _s1l);\n                        _sum2 = vpadalq_s16(_sum2, _s0h);\n                        _sum3 = vpadalq_s16(_sum3, _s1h);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n\n                        kptr += 64;\n                    }\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                {\n                    _sum0 = vpaddq_s32(_sum0, _sum1);\n                    _sum1 = vpaddq_s32(_sum2, _sum3);\n                }\n#else\n                {\n                    _sum0 = vaddq_s32(_sum0, _sum2);\n                    _sum1 = vaddq_s32(_sum1, _sum3);\n                }\n#endif\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i * stride_h) + j * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        int8x8_t _val = vdup_n_s8(r0s[0]);\n                        int8x8_t _w = vld1_s8(kptr);\n                        int16x8_t _s0 = vmull_s8(_val, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                        kptr += 8;\n                    }\n                }\n            }\n\n            if (out_elempack == 8)\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                outptr += 8;\n            }\n            if (out_elempack == 4)\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + M, _sum1);\n                outptr += 4;\n            }\n            if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_s32(_sum0, 0);\n                outptr[M] = vgetq_lane_s32(_sum0, 1);\n                outptr[M * 2] = vgetq_lane_s32(_sum0, 2);\n                outptr[M * 3] = vgetq_lane_s32(_sum0, 3);\n                outptr[M * 4] = vgetq_lane_s32(_sum1, 0);\n                outptr[M * 5] = vgetq_lane_s32(_sum1, 1);\n                outptr[M * 6] = vgetq_lane_s32(_sum1, 2);\n                outptr[M * 7] = vgetq_lane_s32(_sum1, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 8;\n    nn_outch = (outch - remain_outch_start) / 4;\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 4;\n\n        // shadowed variable for less openmp task args\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const size_t N = bottom_blob.cstep * elempack;\n        const size_t M = top_blob.cstep * out_elempack;\n\n        int* outptr = top_blob.channel(p / out_elempack);\n\n        int ij = 0;\n        for (; ij + 1 < outw * outh; ij += 2)\n        {\n            const int i0 = ij / outw;\n            const int i1 = (ij + 1) / outw;\n            const int j0 = ij % outw;\n            const int j1 = (ij + 1) % outw;\n\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n            int q = 0;\n            {\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i0 * stride_h) + j0 * stride_w * elempack;\n                    const signed char* r1 = bottom_blob.channel(q / elempack).row<const signed char>(i1 * stride_h) + j1 * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n                        const signed char* r1s = r1 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        int8x8_t _r1;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                            _r1 = vld1_s8(r1s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp0[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            signed char tmp1[8] = {r1s[0], r1s[N], r1s[N * 2], r1s[N * 3], r1s[N * 4], r1s[N * 5], r1s[N * 6], r1s[N * 7]};\n                            _r0 = vld1_s8(tmp0);\n                            _r1 = vld1_s8(tmp1);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n                        int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                        int8x16_t _r01 = vcombine_s8(_r0, _r1);\n                        _sum0 = vmmlaq_s32(_sum0, _r01, _w0);\n                        _sum1 = vmmlaq_s32(_sum1, _r01, _w1);\n#elif __ARM_FEATURE_DOTPROD\n                        _sum0 = vdotq_lane_s32(_sum0, _w0, _r0, 0);\n                        _sum1 = vdotq_lane_s32(_sum1, _w0, _r1, 0);\n                        _sum0 = vdotq_lane_s32(_sum0, _w1, _r0, 1);\n                        _sum1 = vdotq_lane_s32(_sum1, _w1, _r1, 1);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                        int16x4_t _rr0 = vreinterpret_s16_s8(_r0);\n                        int16x4_t _rr1 = vreinterpret_s16_s8(_r1);\n\n                        int8x8_t _r0ll = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 0));\n                        int8x8_t _r1ll = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 0));\n                        int8x8_t _r0lh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 1));\n                        int8x8_t _r1lh = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 1));\n\n                        int16x8_t _s0l = vmull_s8(_r0ll, vget_low_s8(_w0));\n                        int16x8_t _s1l = vmull_s8(_r1ll, vget_low_s8(_w0));\n                        int16x8_t _s0h = vmull_s8(_r0lh, vget_high_s8(_w0));\n                        int16x8_t _s1h = vmull_s8(_r1lh, vget_high_s8(_w0));\n\n                        int8x8_t _r0hl = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 2));\n                        int8x8_t _r1hl = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 2));\n                        int8x8_t _r0hh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 3));\n                        int8x8_t _r1hh = vreinterpret_s8_s16(vdup_lane_s16(_rr1, 3));\n\n                        _s0l = vmlal_s8(_s0l, _r0hl, vget_low_s8(_w1));\n                        _s1l = vmlal_s8(_s1l, _r1hl, vget_low_s8(_w1));\n                        _s0h = vmlal_s8(_s0h, _r0hh, vget_high_s8(_w1));\n                        _s1h = vmlal_s8(_s1h, _r1hh, vget_high_s8(_w1));\n\n                        _sum0 = vpadalq_s16(_sum0, _s0l);\n                        _sum1 = vpadalq_s16(_sum1, _s1l);\n                        _sum0 = vpadalq_s16(_sum0, _s0h);\n                        _sum1 = vpadalq_s16(_sum1, _s1h);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n\n                        kptr += 32;\n                    }\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                {\n                    int32x4_t _tmp0 = vcombine_s32(vget_low_s32(_sum0), vget_low_s32(_sum1));\n                    int32x4_t _tmp1 = vcombine_s32(vget_high_s32(_sum0), vget_high_s32(_sum1));\n                    _sum0 = _tmp0;\n                    _sum1 = _tmp1;\n                }\n#endif\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i0 * stride_h) + j0 * stride_w;\n                const signed char* r1 = bottom_blob.channel(q).row<const signed char>(i1 * stride_h) + j1 * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n                    const signed char* r1s = r1 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        int8x8_t _r0 = vdup_n_s8(r0s[0]);\n                        int8x8_t _r1 = vdup_n_s8(r1s[0]);\n                        int8x8_t _r01 = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_r0), vreinterpret_s32_s8(_r1)).val[0]);\n                        int8x8_t _w = vld1_s8(kptr);\n                        int8x8_t _ww = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_w), vreinterpret_s32_s8(_w)).val[0]);\n                        int16x8_t _s01 = vmull_s8(_r01, _ww);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n\n                        kptr += 4;\n                    }\n                }\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_s32(outptr, _sum0);\n                vst1q_s32(outptr + 4, _sum1);\n                outptr += 8;\n            }\n            if (out_elempack == 1)\n            {\n                int32x4x2_t _sum01 = vzipq_s32(_sum0, _sum1);\n                vst1_s32(outptr, vget_low_s32(_sum01.val[0]));\n                vst1_s32(outptr + M, vget_high_s32(_sum01.val[0]));\n                vst1_s32(outptr + M * 2, vget_low_s32(_sum01.val[1]));\n                vst1_s32(outptr + M * 3, vget_high_s32(_sum01.val[1]));\n                outptr += 2;\n            }\n        }\n        for (; ij < outw * outh; ij++)\n        {\n            const int i = ij / outw;\n            const int j = ij % outw;\n\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4);\n\n            int q = 0;\n            {\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            _r0 = vld1_s8(tmp);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n                        int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                        int8x16_t _r00 = vcombine_s8(_r0, _r0);\n                        _sum0 = vdotq_s32(_sum0, _r00, _w0);\n                        _sum1 = vdotq_s32(_sum1, _r00, _w1);\n#elif __ARM_FEATURE_DOTPROD\n                        _sum0 = vdotq_lane_s32(_sum0, _w0, _r0, 0);\n                        _sum1 = vdotq_lane_s32(_sum1, _w1, _r0, 1);\n#else  // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n                        int16x4_t _rr0 = vreinterpret_s16_s8(_r0);\n                        int8x8_t _r0ll = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 0));\n                        int8x8_t _r0lh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 1));\n                        int8x8_t _r0hl = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 2));\n                        int8x8_t _r0hh = vreinterpret_s8_s16(vdup_lane_s16(_rr0, 3));\n\n                        int16x8_t _sl = vmull_s8(_r0ll, vget_low_s8(_w0));\n                        int16x8_t _sh = vmull_s8(_r0lh, vget_high_s8(_w0));\n                        _sl = vmlal_s8(_sl, _r0hl, vget_low_s8(_w1));\n                        _sh = vmlal_s8(_sh, _r0hh, vget_high_s8(_w1));\n\n                        _sum0 = vpadalq_s16(_sum0, _sl);\n                        _sum1 = vpadalq_s16(_sum1, _sh);\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n\n                        kptr += 32;\n                    }\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                {\n                    _sum0 = vpaddq_s32(_sum0, _sum1);\n                }\n#else\n                {\n                    _sum0 = vaddq_s32(_sum0, _sum1);\n                }\n#endif\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i * stride_h) + j * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        int8x8_t _val = vdup_n_s8(r0s[0]);\n                        int8x8_t _w = vld1_s8(kptr);\n                        int16x8_t _s0 = vmull_s8(_val, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n\n                        kptr += 4;\n                    }\n                }\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_s32(outptr, _sum0);\n                outptr += 4;\n            }\n            if (out_elempack == 1)\n            {\n                outptr[0] = vgetq_lane_s32(_sum0, 0);\n                outptr[M] = vgetq_lane_s32(_sum0, 1);\n                outptr[M * 2] = vgetq_lane_s32(_sum0, 2);\n                outptr[M * 3] = vgetq_lane_s32(_sum0, 3);\n                outptr += 1;\n            }\n        }\n    }\n    remain_outch_start += nn_outch * 4;\n    nn_outch = (outch - remain_outch_start) / 2;\n#else // __ARM_NEON\n    nn_outch = (outch - remain_outch_start) / 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n#endif // __ARM_NEON\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        const int p = remain_outch_start + pp * 2;\n\n        // shadowed variable for less openmp task args\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const size_t N = bottom_blob.cstep * elempack;\n\n        int* outptr0 = top_blob.channel(p);\n        int* outptr1 = top_blob.channel(p + 1);\n\n        int ij = 0;\n        for (; ij + 1 < outw * outh; ij += 2)\n        {\n            const int i0 = ij / outw;\n            const int i1 = (ij + 1) / outw;\n            const int j0 = ij % outw;\n            const int j1 = (ij + 1) % outw;\n\n            int sum00 = 0;\n            int sum01 = 0;\n            int sum10 = 0;\n            int sum11 = 0;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#else\n            const signed char* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n            {\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i0 * stride_h) + j0 * stride_w * elempack;\n                    const signed char* r1 = bottom_blob.channel(q / elempack).row<const signed char>(i1 * stride_h) + j1 * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n                        const signed char* r1s = r1 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        int8x8_t _r1;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                            _r1 = vld1_s8(r1s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp0[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            signed char tmp1[8] = {r1s[0], r1s[N], r1s[N * 2], r1s[N * 3], r1s[N * 4], r1s[N * 5], r1s[N * 6], r1s[N * 7]};\n                            _r0 = vld1_s8(tmp0);\n                            _r1 = vld1_s8(tmp1);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n\n#if __ARM_FEATURE_DOTPROD\n                        int8x16_t _r00 = vcombine_s8(_r0, _r0);\n                        int8x16_t _r11 = vcombine_s8(_r1, _r1);\n                        _sum01 = vdotq_s32(_sum01, _r00, _w0);\n                        _sum23 = vdotq_s32(_sum23, _r11, _w0);\n#else  // __ARM_FEATURE_DOTPROD\n                        int32x2x2_t _rr0 = vzip_s32(vreinterpret_s32_s8(_r0), vreinterpret_s32_s8(_r0));\n                        int32x2x2_t _rr1 = vzip_s32(vreinterpret_s32_s8(_r1), vreinterpret_s32_s8(_r1));\n                        int8x8_t _r0l = vreinterpret_s8_s32(_rr0.val[0]);\n                        int8x8_t _r0h = vreinterpret_s8_s32(_rr0.val[1]);\n                        int8x8_t _r1l = vreinterpret_s8_s32(_rr1.val[0]);\n                        int8x8_t _r1h = vreinterpret_s8_s32(_rr1.val[1]);\n\n                        int16x8_t _s01 = vmull_s8(_r0l, vget_low_s8(_w0));\n                        int16x8_t _s23 = vmull_s8(_r1l, vget_low_s8(_w0));\n                        _s01 = vmlal_s8(_s01, _r0h, vget_high_s8(_w0));\n                        _s23 = vmlal_s8(_s23, _r1h, vget_high_s8(_w0));\n\n                        _sum01 = vpadalq_s16(_sum01, _s01);\n                        _sum23 = vpadalq_s16(_sum23, _s23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                        kptr += 16;\n                    }\n                }\n                int32x2_t _s0 = vpadd_s32(vget_low_s32(_sum01), vget_high_s32(_sum01));\n                int32x2_t _s1 = vpadd_s32(vget_low_s32(_sum23), vget_high_s32(_sum23));\n                sum00 += vget_lane_s32(_s0, 0);\n                sum01 += vget_lane_s32(_s1, 0);\n                sum10 += vget_lane_s32(_s0, 1);\n                sum11 += vget_lane_s32(_s1, 1);\n            }\n#endif // __ARM_NEON\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i0 * stride_h) + j0 * stride_w;\n                const signed char* r1 = bottom_blob.channel(q).row<const signed char>(i1 * stride_h) + j1 * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n                    const signed char* r1s = r1 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        sum00 += r0s[0] * kptr[0];\n                        sum01 += r1s[0] * kptr[0];\n                        sum10 += r0s[0] * kptr[1];\n                        sum11 += r1s[0] * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n            }\n\n            outptr0[0] = sum00;\n            outptr0[1] = sum01;\n            outptr1[0] = sum10;\n            outptr1[1] = sum11;\n            outptr0 += 2;\n            outptr1 += 2;\n        }\n        for (; ij < outw * outh; ij++)\n        {\n            const int i = ij / outw;\n            const int j = ij % outw;\n\n            int sum0 = 0;\n            int sum1 = 0;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2);\n#else\n            const signed char* kptr = weight_data_tm.channel(p / 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n            {\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            _r0 = vld1_s8(tmp);\n                        }\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n\n#if __ARM_FEATURE_DOTPROD\n                        int8x16_t _r00 = vcombine_s8(_r0, _r0);\n                        _sum01 = vdotq_s32(_sum01, _r00, _w0);\n#else  // __ARM_FEATURE_DOTPROD\n                        int32x2x2_t _rr0 = vzip_s32(vreinterpret_s32_s8(_r0), vreinterpret_s32_s8(_r0));\n                        int8x8_t _r0l = vreinterpret_s8_s32(_rr0.val[0]);\n                        int8x8_t _r0h = vreinterpret_s8_s32(_rr0.val[1]);\n\n                        int16x8_t _s01 = vmull_s8(_r0l, vget_low_s8(_w0));\n                        _s01 = vmlal_s8(_s01, _r0h, vget_high_s8(_w0));\n\n                        _sum01 = vpadalq_s16(_sum01, _s01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                        kptr += 16;\n                    }\n                }\n                int32x2_t _s0 = vpadd_s32(vget_low_s32(_sum01), vget_high_s32(_sum01));\n                sum0 += vget_lane_s32(_s0, 0);\n                sum1 += vget_lane_s32(_s0, 1);\n            }\n#endif // __ARM_NEON\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i * stride_h) + j * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        sum0 += r0s[0] * kptr[0];\n                        sum1 += r0s[0] * kptr[1];\n\n                        kptr += 2;\n                    }\n                }\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n    remain_outch_start += nn_outch * 2;\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        int ij = 0;\n        for (; ij + 1 < outw * outh; ij += 2)\n        {\n            const int i0 = ij / outw;\n            const int i1 = (ij + 1) / outw;\n            const int j0 = ij % outw;\n            const int j1 = (ij + 1) % outw;\n\n            int sum0 = 0;\n            int sum1 = 0;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#else\n            const signed char* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                int32x4_t _sum1 = vdupq_n_s32(0);\n                int32x4_t _sum2 = vdupq_n_s32(0);\n                int32x4_t _sum3 = vdupq_n_s32(0);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i0 * stride_h) + j0 * stride_w * elempack;\n                    const signed char* r1 = bottom_blob.channel(q / elempack).row<const signed char>(i1 * stride_h) + j1 * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n                        const signed char* r1s = r1 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        int8x8_t _r1;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                            _r1 = vld1_s8(r1s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp0[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            signed char tmp1[8] = {r1s[0], r1s[N], r1s[N * 2], r1s[N * 3], r1s[N * 4], r1s[N * 5], r1s[N * 6], r1s[N * 7]};\n                            _r0 = vld1_s8(tmp0);\n                            _r1 = vld1_s8(tmp1);\n                        }\n\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_r0, _w);\n                        int16x8_t _s1 = vmull_s8(_r1, _w);\n\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                        _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                        _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n\n                        kptr += 8;\n                    }\n                }\n                _sum0 = vaddq_s32(_sum0, _sum1);\n                _sum2 = vaddq_s32(_sum2, _sum3);\n#if __aarch64__\n                sum0 += vaddvq_s32(_sum0);\n                sum1 += vaddvq_s32(_sum2);\n#else\n                int32x2_t _ss0 = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                int32x2_t _ss2 = vadd_s32(vget_low_s32(_sum2), vget_high_s32(_sum2));\n                _ss0 = vpadd_s32(_ss0, _ss2);\n                sum0 += vget_lane_s32(_ss0, 0);\n                sum1 += vget_lane_s32(_ss0, 1);\n#endif\n            }\n#endif // __ARM_NEON\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i0 * stride_h) + j0 * stride_w;\n                const signed char* r1 = bottom_blob.channel(q).row<const signed char>(i1 * stride_h) + j1 * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n                    const signed char* r1s = r1 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        sum0 += r0s[0] * kptr[0];\n                        sum1 += r1s[0] * kptr[0];\n\n                        kptr += 1;\n                    }\n                }\n            }\n\n            outptr[0] = sum0;\n            outptr[1] = sum1;\n            outptr += 2;\n        }\n        for (; ij < outw * outh; ij++)\n        {\n            const int i = ij / outw;\n            const int j = ij % outw;\n\n            int sum = 0;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.channel(p / 8 + (p % 8) / 4 + (p % 4) / 2 + p % 2);\n#else\n            const signed char* kptr = weight_data_tm.channel(p / 2 + p % 2);\n#endif\n\n            int q = 0;\n#if __ARM_NEON\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                int32x4_t _sum1 = vdupq_n_s32(0);\n                for (; q + 7 < inch; q += 8)\n                {\n                    const signed char* r0 = bottom_blob.channel(q / elempack).row<const signed char>(i * stride_h) + j * stride_w * elempack;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        const signed char* r0s = r0 + space_ofs[k];\n\n                        int8x8_t _r0;\n                        if (elempack == 8)\n                        {\n                            _r0 = vld1_s8(r0s);\n                        }\n                        else // if (elempack == 1)\n                        {\n                            signed char tmp[8] = {r0s[0], r0s[N], r0s[N * 2], r0s[N * 3], r0s[N * 4], r0s[N * 5], r0s[N * 6], r0s[N * 7]};\n                            _r0 = vld1_s8(tmp);\n                        }\n\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_r0, _w);\n\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                        kptr += 8;\n                    }\n                }\n                int32x4_t _sum = vaddq_s32(_sum0, _sum1);\n#if __aarch64__\n                sum += vaddvq_s32(_sum);\n#else\n                int32x2_t _ss = vadd_s32(vget_low_s32(_sum), vget_high_s32(_sum));\n                _ss = vpadd_s32(_ss, _ss);\n                sum += vget_lane_s32(_ss, 0);\n#endif\n            }\n#endif // __ARM_NEON\n            for (; q < inch; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q).row<const signed char>(i * stride_h) + j * stride_w;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    const signed char* r0s = r0 + space_ofs[k];\n\n                    // if (elempack == 1)\n                    {\n                        sum += r0s[0] * kptr[0];\n\n                        kptr += 1;\n                    }\n                }\n            }\n\n            outptr[0] = sum;\n            outptr += 1;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 9;\n\n        float* outptr = out;\n        float* outptr2 = outptr + outw;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n        const float* r3 = img0 + w * 3;\n\n#if __ARM_NEON\n        float32x4_t _k012x = vld1q_f32(kernel0);\n        float32x4_t _k345x = vld1q_f32(kernel0 + 3);\n        float32x4_t _k678x = vld1q_f32(kernel0 + 6);\n\n        _k012x = vsetq_lane_f32(0.f, _k012x, 3);\n        _k345x = vsetq_lane_f32(0.f, _k345x, 3);\n        _k678x = vsetq_lane_f32(0.f, _k678x, 3);\n\n        float32x4_t _bias0 = vdupq_n_f32(bias0);\n#else\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 3;\n        const float* k2 = kernel0 + 6;\n#endif // __ARM_NEON\n\n        int i = 0;\n\n        for (; i + 1 < outh; i += 2)\n        {\n#if __ARM_NEON\n#if __aarch64__\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#endif // __aarch64__\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #384]           \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s}, [%3]    \\n\" // r0\n                    \"add    %3, %3, #32                     \\n\"\n\n                    \"ext    v11.16b, v8.16b, v9.16b, #4     \\n\"\n                    \"ext    v13.16b, v9.16b, v10.16b, #4    \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #8     \\n\"\n                    \"ext    v14.16b, v9.16b, v10.16b, #8    \\n\"\n\n                    \"0:                                     \\n\"\n\n                    \"and    v4.16b, %17.16b, %17.16b        \\n\" // v4 = _bias0\n                    \"and    v5.16b, %17.16b, %17.16b        \\n\" // v5 = _bias0\n\n                    \"prfm   pldl1keep, [%6, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%6]  \\n\" // r3\n                    \"add    %6, %6, #32                     \\n\"\n\n                    \"and    v6.16b, %17.16b, %17.16b        \\n\" // v6 = _bias0\n                    \"and    v7.16b, %17.16b, %17.16b        \\n\" // v7 = _bias0\n\n                    \"ext    v15.16b, v16.16b, v17.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v8.4s, %14.s[0]          \\n\"\n                    \"fmla   v5.4s, v9.4s, %14.s[0]          \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\"\n\n                    \"fmla   v6.4s, v16.4s, %16.s[0]         \\n\"\n                    \"fmla   v7.4s, v17.4s, %16.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v11.4s, %14.s[1]         \\n\"\n                    \"fmla   v5.4s, v13.4s, %14.s[1]         \\n\"\n\n                    \"ext    v21.16b, v17.16b, v18.16b, #8   \\n\"\n\n                    \"fmla   v6.4s, v15.4s, %16.s[1]         \\n\"\n                    \"fmla   v7.4s, v20.4s, %16.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%4, #384]           \\n\"\n                    \"ld1    {v22.4s, v23.4s, v24.4s}, [%4]  \\n\" // r1\n\n                    \"fmla   v4.4s, v12.4s, %14.s[2]         \\n\"\n                    \"fmla   v5.4s, v14.4s, %14.s[2]         \\n\"\n\n                    \"add    %4, %4, #32                     \\n\"\n\n                    \"fmla   v6.4s, v19.4s, %16.s[2]         \\n\"\n                    \"fmla   v7.4s, v21.4s, %16.s[2]         \\n\"\n\n                    \"ext    v25.16b, v22.16b, v23.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v22.4s, %15.s[0]         \\n\"\n                    \"fmla   v5.4s, v23.4s, %15.s[0]         \\n\"\n\n                    \"ext    v27.16b, v23.16b, v24.16b, #4   \\n\"\n\n                    \"fmla   v6.4s, v22.4s, %14.s[0]         \\n\"\n                    \"fmla   v7.4s, v23.4s, %14.s[0]         \\n\"\n\n                    \"ext    v26.16b, v22.16b, v23.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v25.4s, %15.s[1]         \\n\"\n                    \"fmla   v5.4s, v27.4s, %15.s[1]         \\n\"\n\n                    \"ext    v28.16b, v23.16b, v24.16b, #8   \\n\"\n\n                    \"fmla   v6.4s, v25.4s, %14.s[1]         \\n\"\n                    \"fmla   v7.4s, v27.4s, %14.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%5, #384]           \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s}, [%5]    \\n\" // r2\n\n                    \"fmla   v4.4s, v26.4s, %15.s[2]         \\n\"\n                    \"fmla   v5.4s, v28.4s, %15.s[2]         \\n\"\n\n                    \"add    %5, %5, #32                     \\n\"\n\n                    \"fmla   v6.4s, v26.4s, %14.s[2]         \\n\"\n                    \"fmla   v7.4s, v28.4s, %14.s[2]         \\n\"\n\n                    \"ext    v11.16b, v8.16b, v9.16b, #4     \\n\"\n\n                    \"fmla   v4.4s, v8.4s, %16.s[0]          \\n\"\n                    \"fmla   v5.4s, v9.4s, %16.s[0]          \\n\"\n\n                    \"ext    v13.16b, v9.16b, v10.16b, #4    \\n\"\n\n                    \"fmla   v6.4s, v8.4s, %15.s[0]          \\n\"\n                    \"fmla   v7.4s, v9.4s, %15.s[0]          \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #8     \\n\"\n\n                    \"fmla   v4.4s, v11.4s, %16.s[1]         \\n\"\n                    \"fmla   v5.4s, v13.4s, %16.s[1]         \\n\"\n\n                    \"ext    v14.16b, v9.16b, v10.16b, #8    \\n\"\n\n                    \"fmla   v6.4s, v11.4s, %15.s[1]         \\n\"\n                    \"fmla   v7.4s, v13.4s, %15.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%3, #384]           \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s}, [%3]    \\n\" // r0 next loop\n\n                    \"fmla   v4.4s, v12.4s, %16.s[2]         \\n\"\n                    \"fmla   v5.4s, v14.4s, %16.s[2]         \\n\"\n\n                    \"add    %3, %3, #32                     \\n\"\n                    \"ext    v11.16b, v8.16b, v9.16b, #4     \\n\"\n\n                    \"fmla   v6.4s, v12.4s, %15.s[2]         \\n\"\n                    \"fmla   v7.4s, v14.4s, %15.s[2]         \\n\"\n\n                    \"ext    v13.16b, v9.16b, v10.16b, #4    \\n\"\n                    \"ext    v12.16b, v8.16b, v9.16b, #8     \\n\"\n\n                    \"st1    {v4.4s, v5.4s}, [%1], #32       \\n\"\n\n                    \"ext    v14.16b, v9.16b, v10.16b, #8    \\n\"\n\n                    \"subs   %w0, %w0, #1                    \\n\"\n\n                    \"st1    {v6.4s, v7.4s}, [%2], #32       \\n\"\n\n                    \"bne    0b                              \\n\"\n                    \"sub    %3, %3, #32                     \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr),  // %1\n                    \"=r\"(outptr2), // %2\n                    \"=r\"(r0),      // %3\n                    \"=r\"(r1),      // %4\n                    \"=r\"(r2),      // %5\n                    \"=r\"(r3)       // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(outptr2),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k012x), // %14\n                    \"w\"(_k345x), // %15\n                    \"w\"(_k678x), // %16\n                    \"w\"(_bias0)  // %17\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\");\n            }\n\n            if (remain >= 4)\n            {\n                remain -= 4;\n\n                asm volatile(\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%2]            \\n\" // r0\n                    \"add    %2, %2, #16                     \\n\"\n\n                    \"and    v4.16b, %15.16b, %15.16b        \\n\" // v4 = _bias0\n                    \"and    v6.16b, %15.16b, %15.16b        \\n\" // v6 = _bias0\n\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%5]          \\n\" // r3\n                    \"add    %5, %5, #16                     \\n\"\n\n                    \"ext    v11.16b, v8.16b, v9.16b, #4     \\n\"\n                    \"ext    v15.16b, v16.16b, v17.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v8.4s, %12.s[0]          \\n\"\n                    \"fmla   v6.4s, v16.4s, %14.s[0]         \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #8     \\n\"\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v11.4s, %12.s[1]         \\n\"\n                    \"fmla   v6.4s, v15.4s, %14.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld1    {v22.4s, v23.4s}, [%3]          \\n\" // r1\n\n                    \"fmla   v4.4s, v12.4s, %12.s[2]         \\n\"\n\n                    \"add    %3, %3, #16                     \\n\"\n\n                    \"fmla   v6.4s, v19.4s, %14.s[2]         \\n\"\n\n                    \"ext    v25.16b, v22.16b, v23.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v22.4s, %13.s[0]         \\n\"\n                    \"fmla   v6.4s, v22.4s, %12.s[0]         \\n\"\n\n                    \"ext    v26.16b, v22.16b, v23.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v25.4s, %13.s[1]         \\n\"\n                    \"fmla   v6.4s, v25.4s, %12.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%4]            \\n\" // r2\n\n                    \"fmla   v4.4s, v26.4s, %13.s[2]         \\n\"\n\n                    \"add    %4, %4, #16                     \\n\"\n\n                    \"fmla   v6.4s, v26.4s, %12.s[2]         \\n\"\n\n                    \"ext    v11.16b, v8.16b, v9.16b, #4     \\n\"\n\n                    \"fmla   v4.4s, v8.4s, %14.s[0]          \\n\"\n                    \"fmla   v6.4s, v8.4s, %13.s[0]          \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #8     \\n\"\n\n                    \"fmla   v4.4s, v11.4s, %14.s[1]         \\n\"\n                    \"fmla   v6.4s, v11.4s, %13.s[1]         \\n\"\n\n                    \"fmla   v4.4s, v12.4s, %14.s[2]         \\n\"\n                    \"fmla   v6.4s, v12.4s, %13.s[2]         \\n\"\n\n                    \"st1    {v4.4s}, [%0], #16              \\n\"\n                    \"st1    {v6.4s}, [%1], #16              \\n\"\n\n                    : \"=r\"(outptr),  // %0\n                    \"=r\"(outptr2), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr),\n                    \"1\"(outptr2),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k012x), // %12\n                    \"w\"(_k345x), // %13\n                    \"w\"(_k678x), // %14\n                    \"w\"(_bias0)  // %15\n                    : \"cc\", \"memory\", \"v4\", \"v6\", \"v8\", \"v9\", \"v11\", \"v12\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v22\", \"v23\", \"v25\", \"v26\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%3, #192]          \\n\"\n                    \"vld1.f32   {d18-d20}, [%3 :64] \\n\" // r0\n                    \"add        %3, #16             \\n\"\n\n                    \"vext.32    q11, q9, q10, #1    \\n\"\n                    \"vext.32    q12, q9, q10, #2    \\n\"\n\n                    \"0:                             \\n\"\n\n                    \"vmul.f32   q7, q9, %e14[0]     \\n\"\n\n                    \"vand       q13, %q17, %q17     \\n\" // q13 = _bias0\n                    \"vmul.f32   q6, q11, %e14[1]    \\n\"\n                    \"vmla.f32   q13, q12, %f14[0]   \\n\"\n\n                    \"pld        [%4, #192]          \\n\"\n                    \"vld1.f32   {d18-d20}, [%4]     \\n\" // r1\n                    \"add        %4, #16             \\n\"\n\n                    \"vmla.f32   q7, q9, %e15[0]     \\n\"\n\n                    \"vext.32    q11, q9, q10, #1    \\n\"\n                    \"vext.32    q12, q9, q10, #2    \\n\"\n\n                    \"vmla.f32   q6, q11, %e15[1]    \\n\"\n                    \"vmla.f32   q13, q12, %f15[0]   \\n\"\n\n                    \"vmul.f32   q8, q9, %e14[0]     \\n\"\n\n                    \"vand       q15, %q17, %q17     \\n\" // q15 = _bias0\n                    \"vmul.f32   q14, q11, %e14[1]   \\n\"\n                    \"vmla.f32   q15, q12, %f14[0]   \\n\"\n\n                    \"pld        [%5, #192]          \\n\"\n                    \"vld1.f32   {d18-d20}, [%5 :64] \\n\" // r2\n                    \"add        %5, #16             \\n\"\n\n                    \"vmla.f32   q7, q9, %e16[0]     \\n\"\n\n                    \"vext.32    q11, q9, q10, #1    \\n\"\n                    \"vext.32    q12, q9, q10, #2    \\n\"\n\n                    \"vmla.f32   q6, q11, %e16[1]    \\n\"\n                    \"vmla.f32   q13, q12, %f16[0]   \\n\"\n\n                    \"vmla.f32   q8, q9, %e15[0]     \\n\"\n                    \"vmla.f32   q14, q11, %e15[1]   \\n\"\n                    \"vmla.f32   q15, q12, %f15[0]   \\n\"\n\n                    \"pld        [%6, #192]          \\n\"\n                    \"vld1.f32   {d18-d20}, [%6]     \\n\" // r3\n                    \"add        %6, #16             \\n\"\n\n                    \"vmla.f32   q8, q9, %e16[0]     \\n\"\n\n                    \"vext.32    q11, q9, q10, #1    \\n\"\n                    \"vext.32    q12, q9, q10, #2    \\n\"\n\n                    \"vmla.f32   q14, q11, %e16[1]   \\n\"\n                    \"vmla.f32   q15, q12, %f16[0]   \\n\"\n\n                    \"vadd.f32   q7, q7, q6          \\n\"\n\n                    \"pld        [%3, #192]          \\n\"\n                    \"vld1.f32   {d18-d20}, [%3 :64] \\n\" // r0\n\n                    \"vadd.f32   q8, q8, q14         \\n\"\n                    \"vadd.f32   q7, q7, q13         \\n\"\n                    \"vadd.f32   q8, q8, q15         \\n\"\n\n                    \"vext.32    q11, q9, q10, #1    \\n\"\n                    \"vext.32    q12, q9, q10, #2    \\n\"\n\n                    \"add        %3, #16             \\n\"\n\n                    \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n                    \"vst1.f32   {d16-d17}, [%2]!    \\n\"\n\n                    \"subs       %0, #1              \\n\"\n                    \"bne        0b                  \\n\"\n\n                    \"sub        %3, #16             \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr),  // %1\n                    \"=r\"(outptr2), // %2\n                    \"=r\"(r0),      // %3\n                    \"=r\"(r1),      // %4\n                    \"=r\"(r2),      // %5\n                    \"=r\"(r3)       // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(outptr2),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k012x), // %14\n                    \"w\"(_k345x), // %15\n                    \"w\"(_k678x), // %16\n                    \"w\"(_bias0)  // %17\n                    : \"cc\", \"memory\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n#if __ARM_NEON\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r30 = vld1q_f32(r3);\n\n                float32x4_t _sum = vmulq_f32(_r00, _k012x);\n                _sum = vmlaq_f32(_sum, _r10, _k345x);\n                _sum = vmlaq_f32(_sum, _r20, _k678x);\n\n                float32x4_t _sum2 = vmulq_f32(_r10, _k012x);\n                _sum2 = vmlaq_f32(_sum2, _r20, _k345x);\n                _sum2 = vmlaq_f32(_sum2, _r30, _k678x);\n\n                _sum = vsetq_lane_f32(bias0, _sum, 3);\n                _sum2 = vsetq_lane_f32(bias0, _sum2, 3);\n#if __aarch64__\n                *outptr = vaddvq_f32(_sum);\n                *outptr2 = vaddvq_f32(_sum2);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                float32x2_t _ss2 = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n\n                float32x2_t _sss2 = vpadd_f32(_ss, _ss2);\n\n                *outptr = vget_lane_f32(_sss2, 0);\n                *outptr2 = vget_lane_f32(_sss2, 1);\n#endif // __aarch64__\n#else\n                float sum = bias0;\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n\n                float sum2 = bias0;\n                sum2 += r1[0] * k0[0];\n                sum2 += r1[1] * k0[1];\n                sum2 += r1[2] * k0[2];\n                sum2 += r2[0] * k1[0];\n                sum2 += r2[1] * k1[1];\n                sum2 += r2[2] * k1[2];\n                sum2 += r3[0] * k2[0];\n                sum2 += r3[1] * k2[1];\n                sum2 += r3[2] * k2[2];\n\n                *outptr = sum;\n                *outptr2 = sum2;\n#endif\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr++;\n                outptr2++;\n            }\n\n            r0 += 2 + w;\n            r1 += 2 + w;\n            r2 += 2 + w;\n            r3 += 2 + w;\n\n            outptr += outw;\n            outptr2 += outw;\n        }\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n#if __aarch64__\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#endif // __aarch64__\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%2, #384]           \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s}, [%2]    \\n\" // r0\n                    \"add    %2, %2, #32                     \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #4     \\n\"\n                    \"ext    v14.16b, v9.16b, v10.16b, #4    \\n\"\n\n                    \"0:                                     \\n\"\n\n                    \"fmul   v6.4s, v8.4s, %10.s[0]          \\n\"\n\n                    \"and    v4.16b, %13.16b, %13.16b        \\n\" // v4 = _bias0\n\n                    \"fmul   v7.4s, v9.4s, %10.s[0]          \\n\"\n\n                    \"and    v5.16b, %13.16b, %13.16b        \\n\" // v5 = _bias0\n\n                    \"fmla   v4.4s, v12.4s, %10.s[1]         \\n\"\n\n                    \"ext    v13.16b, v8.16b, v9.16b, #8     \\n\"\n\n                    \"fmla   v5.4s, v14.4s, %10.s[1]         \\n\"\n\n                    \"ext    v15.16b, v9.16b, v10.16b, #8    \\n\"\n\n                    \"fmla   v6.4s, v13.4s, %10.s[2]         \\n\"\n\n                    \"prfm   pldl1keep, [%3, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%3]  \\n\" // r1\n\n                    \"fmla   v7.4s, v15.4s, %10.s[2]         \\n\"\n\n                    \"add    %3, %3, #32                     \\n\"\n\n                    \"fmla   v4.4s, v16.4s, %11.s[0]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #4   \\n\"\n\n                    \"fmla   v5.4s, v17.4s, %11.s[0]         \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #4   \\n\"\n\n                    \"fmla   v6.4s, v20.4s, %11.s[1]         \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\"\n\n                    \"fmla   v7.4s, v22.4s, %11.s[1]         \\n\"\n\n                    \"ext    v23.16b, v17.16b, v18.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v21.4s, %11.s[2]         \\n\"\n\n                    \"prfm   pldl1keep, [%4, #384]           \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s}, [%4]  \\n\" // r2\n\n                    \"fmla   v5.4s, v23.4s, %11.s[2]         \\n\"\n\n                    \"add    %4, %4, #32                     \\n\"\n\n                    \"fmla   v6.4s, v24.4s, %12.s[0]         \\n\"\n\n                    \"ext    v12.16b, v24.16b, v25.16b, #4   \\n\"\n\n                    \"fmla   v7.4s, v25.4s, %12.s[0]         \\n\"\n\n                    \"ext    v14.16b, v25.16b, v26.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v12.4s, %12.s[1]         \\n\"\n\n                    \"ext    v13.16b, v24.16b, v25.16b, #8   \\n\"\n\n                    \"fmla   v5.4s, v14.4s, %12.s[1]         \\n\"\n\n                    \"ext    v15.16b, v25.16b, v26.16b, #8   \\n\"\n\n                    \"fmla   v6.4s, v13.4s, %12.s[2]         \\n\"\n                    \"fmla   v7.4s, v15.4s, %12.s[2]         \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]           \\n\"\n                    \"ld1    {v8.4s, v9.4s, v10.4s}, [%2]    \\n\" // r0 next loop\n\n                    \"fadd   v4.4s, v4.4s, v6.4s             \\n\"\n\n                    \"add    %2, %2, #32                     \\n\"\n\n                    \"fadd   v5.4s, v5.4s, v7.4s             \\n\"\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #4     \\n\"\n                    \"ext    v14.16b, v9.16b, v10.16b, #4    \\n\"\n\n                    \"subs   %w0, %w0, #1                    \\n\"\n\n                    \"st1    {v4.4s, v5.4s}, [%1], #32       \\n\"\n\n                    \"bne    0b                              \\n\"\n                    \"sub    %2, %2, #32                     \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k012x), // %10\n                    \"w\"(_k345x), // %11\n                    \"w\"(_k678x), // %12\n                    \"w\"(_bias0)  // %13\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\");\n            }\n\n            if (remain >= 4)\n            {\n                remain -= 4;\n\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #192]           \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%1]            \\n\" // r0\n                    \"add    %1, %1, #16                     \\n\"\n\n                    \"and    v4.16b, %11.16b, %11.16b        \\n\" // v4 = _bias0\n\n                    \"ext    v12.16b, v8.16b, v9.16b, #4     \\n\"\n\n                    \"fmul   v6.4s, v8.4s, %8.s[0]           \\n\"\n\n                    \"ext    v13.16b, v8.16b, v9.16b, #8     \\n\"\n\n                    \"fmla   v4.4s, v12.4s, %8.s[1]          \\n\"\n\n                    \"prfm   pldl1keep, [%2, #192]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%2]          \\n\" // r1\n                    \"add    %2, %2, #16                     \\n\"\n\n                    \"fmla   v6.4s, v13.4s, %8.s[2]          \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #4   \\n\"\n\n                    \"fmla   v4.4s, v16.4s, %9.s[0]          \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\"\n\n                    \"fmla   v6.4s, v20.4s, %9.s[1]          \\n\"\n\n                    \"prfm   pldl1keep, [%3, #192]           \\n\"\n                    \"ld1    {v24.4s, v25.4s}, [%3]          \\n\" // r2\n                    \"add    %3, %3, #16                     \\n\"\n\n                    \"fmla   v4.4s, v21.4s, %9.s[2]          \\n\"\n\n                    \"ext    v12.16b, v24.16b, v25.16b, #4   \\n\"\n\n                    \"fmla   v6.4s, v24.4s, %10.s[0]         \\n\"\n\n                    \"ext    v13.16b, v24.16b, v25.16b, #8   \\n\"\n\n                    \"fmla   v4.4s, v12.4s, %10.s[1]         \\n\"\n\n                    \"fmla   v6.4s, v13.4s, %10.s[2]         \\n\"\n\n                    \"fadd   v4.4s, v4.4s, v6.4s             \\n\"\n\n                    \"st1    {v4.4s}, [%0], #16              \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k012x), // %8\n                    \"w\"(_k345x), // %9\n                    \"w\"(_k678x), // %10\n                    \"w\"(_bias0)  // %11\n                    : \"cc\", \"memory\", \"v4\", \"v6\", \"v8\", \"v9\", \"v12\", \"v13\", \"v16\", \"v17\", \"v20\", \"v21\", \"v24\", \"v25\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%2, #192]          \\n\"\n                    \"vld1.f32   {d16-d18}, [%2]     \\n\" // r0\n                    \"add        %2, #16             \\n\"\n\n                    \"vext.32    q10, q8, q9, #1     \\n\"\n                    \"vext.32    q11, q8, q9, #2     \\n\"\n\n                    \"0:                             \\n\"\n\n                    \"vmul.f32   q7, q8, %e10[0]     \\n\"\n\n                    \"vand       q14, %q13, %q13     \\n\" // q14 = _bias0\n                    \"vmul.f32   q13, q10, %e10[1]   \\n\"\n                    \"vmla.f32   q14, q11, %f10[0]   \\n\"\n\n                    \"pld        [%3, #192]          \\n\"\n                    \"vld1.f32   {d16-d18}, [%3]     \\n\" // r1\n                    \"add        %3, #16             \\n\"\n\n                    \"vmla.f32   q7, q8, %e11[0]     \\n\"\n\n                    \"vext.32    q10, q8, q9, #1     \\n\"\n                    \"vext.32    q11, q8, q9, #2     \\n\"\n\n                    \"vmla.f32   q13, q10, %e11[1]   \\n\"\n                    \"vmla.f32   q14, q11, %f11[0]   \\n\"\n\n                    \"pld        [%4, #192]          \\n\"\n                    \"vld1.f32   {d16-d18}, [%4]     \\n\" // r2\n                    \"add        %4, #16             \\n\"\n\n                    \"vmla.f32   q7, q8, %e12[0]     \\n\"\n\n                    \"vext.32    q10, q8, q9, #1     \\n\"\n                    \"vext.32    q11, q8, q9, #2     \\n\"\n\n                    \"vmla.f32   q13, q10, %e12[1]   \\n\"\n                    \"vmla.f32   q14, q11, %f12[0]   \\n\"\n\n                    \"pld        [%2, #192]          \\n\"\n                    \"vld1.f32   {d16-d18}, [%2]     \\n\" // r0\n                    \"add        %2, #16             \\n\"\n\n                    \"vadd.f32   q7, q7, q13         \\n\"\n                    \"vadd.f32   q7, q7, q14         \\n\"\n\n                    \"vext.32    q10, q8, q9, #1     \\n\"\n                    \"vext.32    q11, q8, q9, #2     \\n\"\n\n                    \"vst1.f32   {d14-d15}, [%1]!    \\n\"\n\n                    \"subs       %0, #1              \\n\"\n                    \"bne        0b                  \\n\"\n\n                    \"sub        %2, #16             \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k012x), // %10\n                    \"w\"(_k345x), // %11\n                    \"w\"(_k678x), // %12\n                    \"w\"(_bias0)  // %13\n                    : \"cc\", \"memory\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n#if __ARM_NEON\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r20 = vld1q_f32(r2);\n\n                float32x4_t _sum = vmulq_f32(_r00, _k012x);\n                _sum = vmlaq_f32(_sum, _r10, _k345x);\n                _sum = vmlaq_f32(_sum, _r20, _k678x);\n\n                _sum = vsetq_lane_f32(bias0, _sum, 3);\n#if __aarch64__\n                *outptr = vaddvq_f32(_sum);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n\n                *outptr = vget_lane_f32(_ss, 0);\n#endif // __aarch64__\n#else\n                float sum = bias0;\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n\n                *outptr = sum;\n#endif\n                r0++;\n                r1++;\n                r2++;\n                outptr++;\n            }\n\n            r0 += 2;\n            r1 += 2;\n            r2 += 2;\n        }\n    }\n}\n\nstatic void convdw3x3s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 9;\n\n        float* outptr = out;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n\n#if __ARM_NEON\n        float32x4_t _k012x = vld1q_f32(kernel0);\n        float32x4_t _k345x = vld1q_f32(kernel0 + 3);\n        float32x4_t _k678x = vld1q_f32(kernel0 + 6);\n\n        _k012x = vsetq_lane_f32(0.f, _k012x, 3);\n        _k345x = vsetq_lane_f32(0.f, _k345x, 3);\n        _k678x = vsetq_lane_f32(0.f, _k678x, 3);\n\n        float32x4_t _bias0 = vdupq_n_f32(bias0);\n#else\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 3;\n        const float* k2 = kernel0 + 6;\n#endif // __ARM_NEON\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n\n                    \"and        v11.16b, %13.16b, %13.16b      \\n\" // v11 = _bias0\n\n                    \"0:                                        \\n\"\n                    \"fmul       v0.4s,  v2.4s, %10.s[0]        \\n\"\n                    \"fmul       v10.4s, v3.4s, %10.s[1]        \\n\"\n\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld2        {v8.4s, v9.4s}, [%2]           \\n\"\n                    \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                    \"fmla       v11.4s, v1.4s, %10.s[2]        \\n\"\n\n                    \"prfm       pldl1keep, [%3, #256]          \\n\"\n                    \"ld2        {v2.4s, v3.4s}, [%3], #32      \\n\"\n\n                    \"fmla       v0.4s,  v2.4s, %11.s[0]        \\n\"\n                    \"fmla       v10.4s, v3.4s, %11.s[1]        \\n\"\n\n                    \"prfm       pldl1keep, [%3, #256]          \\n\"\n                    \"ld2        {v8.4s, v9.4s}, [%3]           \\n\"\n                    \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                    \"fmla       v11.4s, v1.4s, %11.s[2]        \\n\"\n\n                    \"prfm       pldl1keep, [%4, #256]          \\n\"\n                    \"ld2        {v2.4s, v3.4s}, [%4], #32      \\n\"\n\n                    \"fmla       v0.4s,  v2.4s, %12.s[0]        \\n\"\n                    \"fmla       v10.4s, v3.4s, %12.s[1]        \\n\"\n\n                    \"prfm       pldl1keep, [%4, #256]          \\n\"\n                    \"ld2        {v8.4s, v9.4s}, [%4]           \\n\"\n                    \"ext        v1.16b, v2.16b, v8.16b, #4     \\n\"\n\n                    \"fmla       v11.4s, v1.4s, %12.s[2]        \\n\"\n\n                    \"prfm       pldl1keep, [%2, #256]          \\n\"\n                    \"ld2        {v2.4s, v3.4s}, [%2], #32      \\n\"\n\n                    \"fadd       v0.4s, v0.4s, v10.4s           \\n\"\n                    \"fadd       v0.4s, v0.4s, v11.4s           \\n\"\n\n                    \"and        v11.16b, %13.16b, %13.16b      \\n\" // v11 = _bias0\n\n                    \"subs       %w0, %w0, #1                   \\n\"\n                    \"st1        {v0.4s}, [%1], #16             \\n\"\n                    \"bne        0b                             \\n\"\n                    \"sub        %2, %2, #32                    \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k012x), // %10\n                    \"w\"(_k345x), // %11\n                    \"w\"(_k678x), // %12\n                    \"w\"(_bias0)  // %13\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                    \"vand       q11, %q13, %q13     \\n\"\n\n                    \"0:                             \\n\"\n                    \"vmul.f32   q0, q2, %e10[0]     \\n\"\n                    \"vmul.f32   q10, q3, %e10[1]    \\n\"\n\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld2.f32   {d16-d17}, [%2]     \\n\"\n                    \"vext.32    q1, q2, q8, #1      \\n\"\n\n                    \"vmla.f32   q11, q1, %f10[0]    \\n\"\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld2.f32   {d4-d7}, [%3]!      \\n\"\n\n                    \"vmla.f32   q0, q2, %e11[0]     \\n\"\n                    \"vmla.f32   q10, q3, %e11[1]    \\n\"\n\n                    \"pld        [%3, #128]          \\n\"\n                    \"vld2.f32   {d16-d17}, [%3]     \\n\"\n                    \"vext.32    q1, q2, q8, #1      \\n\"\n\n                    \"vmla.f32   q11, q1, %f11[0]    \\n\"\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld2.f32   {d4-d7}, [%4]!      \\n\"\n\n                    \"vmla.f32   q0, q2, %e12[0]     \\n\"\n                    \"vmla.f32   q10, q3, %e12[1]    \\n\"\n\n                    \"pld        [%4, #128]          \\n\"\n                    \"vld2.f32   {d16-d17}, [%4]     \\n\"\n                    \"vext.32    q1, q2, q8, #1      \\n\"\n\n                    \"vmla.f32   q11, q1, %f12[0]    \\n\"\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n\n                    \"vadd.f32   q0, q0, q10         \\n\"\n                    \"vadd.f32   q0, q0, q11         \\n\"\n\n                    \"vand       q11, %q13, %q13     \\n\"\n\n                    \"subs       %0, #1              \\n\"\n                    \"vst1.f32   {d0-d1}, [%1]!      \\n\"\n                    \"bne        0b                  \\n\"\n                    \"sub        %2, #32             \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k012x), // %10\n                    \"w\"(_k345x), // %11\n                    \"w\"(_k678x), // %12\n                    \"w\"(_bias0)  // %13\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n#if __ARM_NEON\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r20 = vld1q_f32(r2);\n\n                float32x4_t _sum = vmulq_f32(_r00, _k012x);\n                _sum = vmlaq_f32(_sum, _r10, _k345x);\n                _sum = vmlaq_f32(_sum, _r20, _k678x);\n\n                _sum = vsetq_lane_f32(bias0, _sum, 3);\n#if __aarch64__\n                *outptr = vaddvq_f32(_sum);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n\n                *outptr = vget_lane_f32(_ss, 0);\n#endif // __aarch64__\n#else\n                float sum = bias0;\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n\n                *outptr = sum;\n#endif // __ARM_NEON\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const __fp16* kernel = _kernel;\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const __fp16 bias0 = bias ? bias[g] : 0.f;\n\n        const __fp16* kernel0 = kernel + g * 9;\n\n        __fp16* outptr0 = out;\n        __fp16* outptr1 = outptr0 + outw;\n\n        const __fp16* img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0;\n        const __fp16* r1 = img0 + w;\n        const __fp16* r2 = img0 + w * 2;\n        const __fp16* r3 = img0 + w * 3;\n\n        float16x4_t _k012x = vld1_f16(kernel0);\n        float16x4_t _k345x = vld1_f16(kernel0 + 3);\n        float16x4_t _k678x = vld1_f16(kernel0 + 6);\n\n        _k012x = vset_lane_f16(0.f, _k012x, 3);\n        _k345x = vset_lane_f16(0.f, _k345x, 3);\n        _k678x = vset_lane_f16(0.f, _k678x, 3);\n\n        float16x8_t _bias0 = vdupq_n_f16(bias0);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j + 7 < outw; j += 8)\n            {\n                float16x8_t _r00 = vld1q_f16(r0);\n                float16x8_t _r10 = vld1q_f16(r1);\n                float16x8_t _r20 = vld1q_f16(r2);\n                float16x8_t _r30 = vld1q_f16(r3);\n\n                float16x8_t _r0n = vld1q_f16(r0 + 8);\n                float16x8_t _r1n = vld1q_f16(r1 + 8);\n                float16x8_t _r2n = vld1q_f16(r2 + 8);\n                float16x8_t _r3n = vld1q_f16(r3 + 8);\n\n                float16x8_t _r01 = vextq_f16(_r00, _r0n, 1);\n                float16x8_t _r11 = vextq_f16(_r10, _r1n, 1);\n                float16x8_t _r21 = vextq_f16(_r20, _r2n, 1);\n                float16x8_t _r31 = vextq_f16(_r30, _r3n, 1);\n\n                float16x8_t _r02 = vextq_f16(_r00, _r0n, 2);\n                float16x8_t _r12 = vextq_f16(_r10, _r1n, 2);\n                float16x8_t _r22 = vextq_f16(_r20, _r2n, 2);\n                float16x8_t _r32 = vextq_f16(_r30, _r3n, 2);\n\n                float16x8_t _sum0 = _bias0;\n                float16x8_t _sum1 = _bias0;\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r00, _k012x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r01, _k012x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r02, _k012x, 2);\n                _sum1 = vfmaq_lane_f16(_sum1, _r10, _k012x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _r11, _k012x, 1);\n                _sum1 = vfmaq_lane_f16(_sum1, _r12, _k012x, 2);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r10, _k345x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r11, _k345x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r12, _k345x, 2);\n                _sum1 = vfmaq_lane_f16(_sum1, _r20, _k345x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _r21, _k345x, 1);\n                _sum1 = vfmaq_lane_f16(_sum1, _r22, _k345x, 2);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r20, _k678x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r21, _k678x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r22, _k678x, 2);\n                _sum1 = vfmaq_lane_f16(_sum1, _r30, _k678x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _r31, _k678x, 1);\n                _sum1 = vfmaq_lane_f16(_sum1, _r32, _k678x, 2);\n\n                vst1q_f16(outptr0, _sum0);\n                vst1q_f16(outptr1, _sum1);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                r3 += 8;\n                outptr0 += 8;\n                outptr1 += 8;\n            }\n            for (; j + 3 < outw; j += 4)\n            {\n                float16x4_t _r00 = vld1_f16(r0);\n                float16x4_t _r10 = vld1_f16(r1);\n                float16x4_t _r20 = vld1_f16(r2);\n                float16x4_t _r30 = vld1_f16(r3);\n\n                float16x4_t _r0n = vld1_f16(r0 + 4);\n                float16x4_t _r1n = vld1_f16(r1 + 4);\n                float16x4_t _r2n = vld1_f16(r2 + 4);\n                float16x4_t _r3n = vld1_f16(r3 + 4);\n\n                float16x4_t _r01 = vext_f16(_r00, _r0n, 1);\n                float16x4_t _r11 = vext_f16(_r10, _r1n, 1);\n                float16x4_t _r21 = vext_f16(_r20, _r2n, 1);\n                float16x4_t _r31 = vext_f16(_r30, _r3n, 1);\n\n                float16x4_t _r02 = vext_f16(_r00, _r0n, 2);\n                float16x4_t _r12 = vext_f16(_r10, _r1n, 2);\n                float16x4_t _r22 = vext_f16(_r20, _r2n, 2);\n                float16x4_t _r32 = vext_f16(_r30, _r3n, 2);\n\n                float16x4_t _sum0 = vget_low_f16(_bias0);\n                float16x4_t _sum1 = vget_low_f16(_bias0);\n\n                _sum0 = vfma_lane_f16(_sum0, _r00, _k012x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r01, _k012x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r02, _k012x, 2);\n                _sum1 = vfma_lane_f16(_sum1, _r10, _k012x, 0);\n                _sum1 = vfma_lane_f16(_sum1, _r11, _k012x, 1);\n                _sum1 = vfma_lane_f16(_sum1, _r12, _k012x, 2);\n\n                _sum0 = vfma_lane_f16(_sum0, _r10, _k345x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r11, _k345x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r12, _k345x, 2);\n                _sum1 = vfma_lane_f16(_sum1, _r20, _k345x, 0);\n                _sum1 = vfma_lane_f16(_sum1, _r21, _k345x, 1);\n                _sum1 = vfma_lane_f16(_sum1, _r22, _k345x, 2);\n\n                _sum0 = vfma_lane_f16(_sum0, _r20, _k678x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r21, _k678x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r22, _k678x, 2);\n                _sum1 = vfma_lane_f16(_sum1, _r30, _k678x, 0);\n                _sum1 = vfma_lane_f16(_sum1, _r31, _k678x, 1);\n                _sum1 = vfma_lane_f16(_sum1, _r32, _k678x, 2);\n\n                vst1_f16(outptr0, _sum0);\n                vst1_f16(outptr1, _sum1);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                outptr0 += 4;\n                outptr1 += 4;\n            }\n            for (; j < outw; j++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n                float16x4_t _r3 = vld1_f16(r3);\n\n                float16x4_t _sum0 = vmul_f16(_r0, _k012x);\n                _sum0 = vfma_f16(_sum0, _r1, _k345x);\n                _sum0 = vfma_f16(_sum0, _r2, _k678x);\n\n                float16x4_t _sum1 = vmul_f16(_r1, _k012x);\n                _sum1 = vfma_f16(_sum1, _r2, _k345x);\n                _sum1 = vfma_f16(_sum1, _r3, _k678x);\n\n                _sum0 = vset_lane_f16(bias0, _sum0, 3);\n                _sum1 = vset_lane_f16(bias0, _sum1, 3);\n\n                *outptr0 = (__fp16)vaddvq_f32(vcvt_f32_f16(_sum0));\n                *outptr1 = (__fp16)vaddvq_f32(vcvt_f32_f16(_sum1));\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr1++;\n            }\n\n            r0 += 2 + w;\n            r1 += 2 + w;\n            r2 += 2 + w;\n            r3 += 2 + w;\n\n            outptr0 += outw;\n            outptr1 += outw;\n        }\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 7 < outw; j += 8)\n            {\n                float16x8_t _r00 = vld1q_f16(r0);\n                float16x8_t _r10 = vld1q_f16(r1);\n                float16x8_t _r20 = vld1q_f16(r2);\n\n                float16x8_t _r0n = vld1q_f16(r0 + 8);\n                float16x8_t _r1n = vld1q_f16(r1 + 8);\n                float16x8_t _r2n = vld1q_f16(r2 + 8);\n\n                float16x8_t _r01 = vextq_f16(_r00, _r0n, 1);\n                float16x8_t _r11 = vextq_f16(_r10, _r1n, 1);\n                float16x8_t _r21 = vextq_f16(_r20, _r2n, 1);\n\n                float16x8_t _r02 = vextq_f16(_r00, _r0n, 2);\n                float16x8_t _r12 = vextq_f16(_r10, _r1n, 2);\n                float16x8_t _r22 = vextq_f16(_r20, _r2n, 2);\n\n                float16x8_t _sum0 = _bias0;\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r00, _k012x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r01, _k012x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r02, _k012x, 2);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r10, _k345x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r11, _k345x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r12, _k345x, 2);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _r20, _k678x, 0);\n                _sum0 = vfmaq_lane_f16(_sum0, _r21, _k678x, 1);\n                _sum0 = vfmaq_lane_f16(_sum0, _r22, _k678x, 2);\n\n                vst1q_f16(outptr0, _sum0);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                outptr0 += 8;\n            }\n            for (; j + 3 < outw; j += 4)\n            {\n                float16x4_t _r00 = vld1_f16(r0);\n                float16x4_t _r10 = vld1_f16(r1);\n                float16x4_t _r20 = vld1_f16(r2);\n\n                float16x4_t _r0n = vld1_f16(r0 + 4);\n                float16x4_t _r1n = vld1_f16(r1 + 4);\n                float16x4_t _r2n = vld1_f16(r2 + 4);\n\n                float16x4_t _r01 = vext_f16(_r00, _r0n, 1);\n                float16x4_t _r11 = vext_f16(_r10, _r1n, 1);\n                float16x4_t _r21 = vext_f16(_r20, _r2n, 1);\n\n                float16x4_t _r02 = vext_f16(_r00, _r0n, 2);\n                float16x4_t _r12 = vext_f16(_r10, _r1n, 2);\n                float16x4_t _r22 = vext_f16(_r20, _r2n, 2);\n\n                float16x4_t _sum0 = vget_low_f16(_bias0);\n\n                _sum0 = vfma_lane_f16(_sum0, _r00, _k012x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r01, _k012x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r02, _k012x, 2);\n\n                _sum0 = vfma_lane_f16(_sum0, _r10, _k345x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r11, _k345x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r12, _k345x, 2);\n\n                _sum0 = vfma_lane_f16(_sum0, _r20, _k678x, 0);\n                _sum0 = vfma_lane_f16(_sum0, _r21, _k678x, 1);\n                _sum0 = vfma_lane_f16(_sum0, _r22, _k678x, 2);\n\n                vst1_f16(outptr0, _sum0);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                outptr0 += 4;\n            }\n            for (; j < outw; j++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n\n                float16x4_t _sum = vmul_f16(_r0, _k012x);\n                _sum = vfma_f16(_sum, _r1, _k345x);\n                _sum = vfma_f16(_sum, _r2, _k678x);\n\n                _sum = vset_lane_f16(bias0, _sum, 3);\n\n                *outptr0 = (__fp16)vaddvq_f32(vcvt_f32_f16(_sum));\n\n                r0++;\n                r1++;\n                r2++;\n                outptr0++;\n            }\n\n            r0 += 2;\n            r1 += 2;\n            r2 += 2;\n        }\n    }\n}\n\nstatic void convdw3x3s2_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const __fp16* kernel = _kernel;\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const __fp16 bias0 = bias ? bias[g] : 0.f;\n\n        const __fp16* kernel0 = kernel + g * 9;\n\n        __fp16* outptr = out;\n\n        const __fp16* img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0;\n        const __fp16* r1 = img0 + w;\n        const __fp16* r2 = img0 + w * 2;\n\n        float16x4_t _k012x = vld1_f16(kernel0);\n        float16x4_t _k345x = vld1_f16(kernel0 + 3);\n        float16x4_t _k678x = vld1_f16(kernel0 + 6);\n\n        _k012x = vset_lane_f16(0.f, _k012x, 3);\n        _k345x = vset_lane_f16(0.f, _k345x, 3);\n        _k678x = vset_lane_f16(0.f, _k678x, 3);\n\n        float16x8_t _bias0 = vdupq_n_f16(bias0);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 7 < outw; j += 8)\n            {\n                float16x8x2_t _r00 = vld2q_f16(r0);\n                float16x8x2_t _r10 = vld2q_f16(r1);\n                float16x8x2_t _r20 = vld2q_f16(r2);\n\n                float16x8x2_t _r0n = vld2q_f16(r0 + 16);\n                float16x8x2_t _r1n = vld2q_f16(r1 + 16);\n                float16x8x2_t _r2n = vld2q_f16(r2 + 16);\n\n                float16x8_t _r02 = vextq_f16(_r00.val[0], _r0n.val[0], 1);\n                float16x8_t _r12 = vextq_f16(_r10.val[0], _r1n.val[0], 1);\n                float16x8_t _r22 = vextq_f16(_r20.val[0], _r2n.val[0], 1);\n\n                float16x8_t _sum = _bias0;\n\n                _sum = vfmaq_lane_f16(_sum, _r00.val[0], _k012x, 0);\n                _sum = vfmaq_lane_f16(_sum, _r00.val[1], _k012x, 1);\n                _sum = vfmaq_lane_f16(_sum, _r02, _k012x, 2);\n\n                _sum = vfmaq_lane_f16(_sum, _r10.val[0], _k345x, 0);\n                _sum = vfmaq_lane_f16(_sum, _r10.val[1], _k345x, 1);\n                _sum = vfmaq_lane_f16(_sum, _r12, _k345x, 2);\n\n                _sum = vfmaq_lane_f16(_sum, _r20.val[0], _k678x, 0);\n                _sum = vfmaq_lane_f16(_sum, _r20.val[1], _k678x, 1);\n                _sum = vfmaq_lane_f16(_sum, _r22, _k678x, 2);\n\n                vst1q_f16(outptr, _sum);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                outptr += 8;\n            }\n            for (; j + 3 < outw; j += 4)\n            {\n                float16x4x2_t _r00 = vld2_f16(r0);\n                float16x4x2_t _r10 = vld2_f16(r1);\n                float16x4x2_t _r20 = vld2_f16(r2);\n\n                float16x4x2_t _r0n = vld2_f16(r0 + 8);\n                float16x4x2_t _r1n = vld2_f16(r1 + 8);\n                float16x4x2_t _r2n = vld2_f16(r2 + 8);\n\n                float16x4_t _r02 = vext_f16(_r00.val[0], _r0n.val[0], 1);\n                float16x4_t _r12 = vext_f16(_r10.val[0], _r1n.val[0], 1);\n                float16x4_t _r22 = vext_f16(_r20.val[0], _r2n.val[0], 1);\n\n                float16x4_t _sum = vget_low_f16(_bias0);\n\n                _sum = vfma_lane_f16(_sum, _r00.val[0], _k012x, 0);\n                _sum = vfma_lane_f16(_sum, _r00.val[1], _k012x, 1);\n                _sum = vfma_lane_f16(_sum, _r02, _k012x, 2);\n\n                _sum = vfma_lane_f16(_sum, _r10.val[0], _k345x, 0);\n                _sum = vfma_lane_f16(_sum, _r10.val[1], _k345x, 1);\n                _sum = vfma_lane_f16(_sum, _r12, _k345x, 2);\n\n                _sum = vfma_lane_f16(_sum, _r20.val[0], _k678x, 0);\n                _sum = vfma_lane_f16(_sum, _r20.val[1], _k678x, 1);\n                _sum = vfma_lane_f16(_sum, _r22, _k678x, 2);\n\n                vst1_f16(outptr, _sum);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                outptr += 4;\n            }\n            for (; j < outw; j++)\n            {\n                float16x4_t _r0 = vld1_f16(r0);\n                float16x4_t _r1 = vld1_f16(r1);\n                float16x4_t _r2 = vld1_f16(r2);\n\n                float16x4_t _sum = vmul_f16(_r0, _k012x);\n                _sum = vfma_f16(_sum, _r1, _k345x);\n                _sum = vfma_f16(_sum, _r2, _k678x);\n\n                _sum = vset_lane_f16(bias0, _sum, 3);\n\n                *outptr = (__fp16)vaddvq_f32(vcvt_f32_f16(_sum));\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_int8.h",
    "content": "// Copyright 2019 BUG1989\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_int8_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const signed char* kernel = (const signed char*)_kernel + p * 9;\n\n        int* outptr0 = out;\n        int* outptr0n = outptr0 + outw;\n\n        const signed char* img0 = bottom_blob.channel(p);\n\n        const signed char* r0 = img0;\n        const signed char* r1 = img0 + w;\n        const signed char* r2 = img0 + w * 2;\n        const signed char* r3 = img0 + w * 3;\n\n        int i = 0;\n\n#if __ARM_NEON\n        int8x16_t _k0123456789x = vld1q_s8(kernel);\n        int16x8_t _k_s16 = vmovl_s8(vget_low_s8(_k0123456789x));\n        int16x8_t _kn_s16 = vmovl_s8(vget_high_s8(_k0123456789x));\n\n        int16x4_t _k0123 = vget_low_s16(_k_s16);\n        int16x4_t _k4567 = vget_high_s16(_k_s16);\n        int16x4_t _k8xxx = vget_low_s16(_kn_s16);\n#endif // __ARM_NEON\n\n        for (; i + 1 < outh; i += 2)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                                   \\n\"\n                    \"ld1    {v4.8b, v5.8b}, [%3]          \\n\"\n                    \"ld1    {v6.8b, v7.8b}, [%4]          \\n\"\n                    \"ld1    {v8.8b, v9.8b}, [%5]          \\n\"\n                    \"ld1    {v10.8b, v11.8b}, [%6]        \\n\"\n                    \"add    %3, %3, #8                    \\n\"\n                    \"add    %4, %4, #8                    \\n\"\n                    \"add    %5, %5, #8                    \\n\"\n                    \"add    %6, %6, #8                    \\n\"\n\n                    \"ext    v12.8b, v4.8b, v5.8b, #1      \\n\"\n                    \"ext    v13.8b, v4.8b, v5.8b, #2      \\n\"\n                    \"ext    v14.8b, v6.8b, v7.8b, #1      \\n\"\n                    \"ext    v15.8b, v6.8b, v7.8b, #2      \\n\"\n                    \"ext    v16.8b, v8.8b, v9.8b, #1      \\n\"\n                    \"ext    v17.8b, v8.8b, v9.8b, #2      \\n\"\n                    \"ext    v18.8b, v10.8b, v11.8b, #1    \\n\"\n                    \"ext    v19.8b, v10.8b, v11.8b, #2    \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r01\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r02\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r10\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r11\n                    \"sshll  v15.8h, v15.8b, #0            \\n\" // r12\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r20\n                    \"sshll  v16.8h, v16.8b, #0            \\n\" // r21\n                    \"sshll  v17.8h, v17.8b, #0            \\n\" // r22\n                    \"sshll  v10.8h, v10.8b, #0            \\n\" // r30\n                    \"sshll  v18.8h, v18.8b, #0            \\n\" // r31\n                    \"sshll  v19.8h, v19.8b, #0            \\n\" // r32\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %14.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %14.h[0]      \\n\"\n                    \"smull  v22.4s, v12.4h, %14.h[1]      \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v12.8h, %14.h[1]     \\n\"\n                    \"smull  v24.4s, v13.4h, %14.h[2]      \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v13.8h, %14.h[2]     \\n\"\n\n                    // r1\n                    \"smull  v26.4s, v6.4h, %14.h[0]       \\n\" // (r10 - r17) * k00\n                    \"smull2  v27.4s, v6.8h, %14.h[0]      \\n\"\n                    \"smull  v28.4s, v14.4h, %14.h[1]      \\n\" // (r11 - r18) * k01\n                    \"smull2  v29.4s, v14.8h, %14.h[1]     \\n\"\n                    \"smull  v30.4s, v15.4h, %14.h[2]      \\n\" // (r12 - r19) * k02\n                    \"smull2  v31.4s, v15.8h, %14.h[2]     \\n\"\n\n                    \"smlal  v20.4s, v6.4h, %14.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v6.8h, %14.h[3]      \\n\"\n                    \"smlal  v22.4s, v14.4h, %15.h[0]      \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v14.8h, %15.h[0]     \\n\"\n                    \"smlal  v24.4s, v15.4h, %15.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v15.8h, %15.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v26.4s, v8.4h, %14.h[3]       \\n\" // (r20 - r27) * k03\n                    \"smlal2  v27.4s, v8.8h, %14.h[3]      \\n\"\n                    \"smlal  v28.4s, v16.4h, %15.h[0]      \\n\" // (r21 - r28) * k04\n                    \"smlal2  v29.4s, v16.8h, %15.h[0]     \\n\"\n                    \"smlal  v30.4s, v17.4h, %15.h[1]      \\n\" // (r22 - r29) * k05\n                    \"smlal2  v31.4s, v17.8h, %15.h[1]     \\n\"\n\n                    \"smlal  v20.4s, v8.4h, %15.h[2]       \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v8.8h, %15.h[2]      \\n\"\n                    \"smlal  v22.4s, v16.4h, %15.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v16.8h, %15.h[3]     \\n\"\n                    \"smlal  v24.4s, v17.4h, %16.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v17.8h, %16.h[0]     \\n\"\n\n                    // r3\n                    \"smlal  v26.4s, v10.4h, %15.h[2]      \\n\" // (r30 - r37) * k06\n                    \"smlal2  v27.4s, v10.8h, %15.h[2]     \\n\"\n                    \"smlal  v28.4s, v18.4h, %15.h[3]      \\n\" // (r31 - r38) * k07\n                    \"smlal2  v29.4s, v18.8h, %15.h[3]     \\n\"\n                    \"smlal  v30.4s, v19.4h, %16.h[0]      \\n\" // (r32 - r39) * k08\n                    \"smlal2  v31.4s, v19.8h, %16.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v26.4s, v26.4s, v28.4s        \\n\"\n                    \"add    v27.4s, v27.4s, v29.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n                    \"add    v26.4s, v26.4s, v30.4s        \\n\"\n                    \"add    v27.4s, v27.4s, v31.4s        \\n\"\n\n                    \"st1    {v20.4s, v21.4s}, [%1], #32   \\n\"\n                    \"st1    {v26.4s, v27.4s}, [%2], #32   \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),       // %0\n                    \"=r\"(outptr0),  // %1\n                    \"=r\"(outptr0n), // %2\n                    \"=r\"(r0),       // %3\n                    \"=r\"(r1),       // %4\n                    \"=r\"(r2),       // %5\n                    \"=r\"(r3)        // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr0n),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k0123), // %14\n                    \"w\"(_k4567), // %15\n                    \"w\"(_k8xxx)  // %16\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld1.s8    {d30-d31}, [%3]      \\n\" // r0\n                    \"add    %3, %3, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r00\n                    \"vmovl.s8    q5, d10             \\n\" // r01\n                    \"vmovl.s8    q6, d12             \\n\" // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P14[0]     \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P14[0]     \\n\"\n                    \"vmull.s16  q9, d10, %P14[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P14[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P14[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P14[2]     \\n\"\n\n                    // r1\n                    \"vld1.s8    {d30-d31}, [%4]      \\n\" // r1\n                    \"add    %4, %4, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r10\n                    \"vmovl.s8    q5, d10             \\n\" // r11\n                    \"vmovl.s8    q6, d12             \\n\" // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P14[3]     \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P14[3]     \\n\"\n                    \"vmlal.s16  q9, d10, %P15[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P15[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P15[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P15[1]     \\n\"\n                    // sum1\n                    \"vmull.s16  q11, d30, %P14[0]    \\n\" // (r10 - r17) * k00\n                    \"vmull.s16  q12, d31, %P14[0]    \\n\"\n                    \"vmull.s16  q13, d10, %P14[1]    \\n\" // (r11 - r18) * k01\n                    \"vmull.s16  q14, d11, %P14[1]    \\n\"\n                    \"vmlal.s16  q11, d12, %P14[2]    \\n\" // (r12 - r19) * k02\n                    \"vmlal.s16  q12, d13, %P14[2]    \\n\"\n\n                    // r2\n                    \"vld1.s8    {d30-d31}, [%5]      \\n\" // r2\n                    \"add    %5, %5, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r20\n                    \"vmovl.s8    q5, d10             \\n\" // r21\n                    \"vmovl.s8    q6, d12             \\n\" // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P15[2]     \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P15[2]     \\n\"\n                    \"vmlal.s16  q9, d10, %P15[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P15[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P16[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P16[0]     \\n\"\n                    // sum1\n                    \"vmlal.s16  q11, d30, %P14[3]    \\n\" // (r20 - r27) * k03\n                    \"vmlal.s16  q12, d31, %P14[3]    \\n\"\n                    \"vmlal.s16  q13, d10, %P15[0]    \\n\" // (r21 - r28) * k04\n                    \"vmlal.s16  q14, d11, %P15[0]    \\n\"\n                    \"vmlal.s16  q11, d12, %P15[1]    \\n\" // (r22 - r29) * k05\n                    \"vmlal.s16  q12, d13, %P15[1]    \\n\"\n\n                    // r3\n                    \"vld1.s8    {d30-d31}, [%6]      \\n\" // r3\n                    \"add    %6, %6, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r30\n                    \"vmovl.s8    q5, d10             \\n\" // r31\n                    \"vmovl.s8    q6, d12             \\n\" // r32\n\n                    // sum1\n                    \"vmlal.s16  q11, d30, %P15[2]    \\n\" // (r30 - r37) * k06\n                    \"vmlal.s16  q12, d31, %P15[2]    \\n\"\n                    \"vmlal.s16  q13, d10, %P15[3]    \\n\" // (r31 - r38) * k07\n                    \"vmlal.s16  q14, d11, %P15[3]    \\n\"\n                    \"vmlal.s16  q11, d12, %P16[0]    \\n\" // (r32 - r39) * k08\n                    \"vmlal.s16  q12, d13, %P16[0]    \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n                    \"vadd.s32    q11, q11, q13       \\n\"\n                    \"vadd.s32    q12, q12, q14       \\n\"\n\n                    \"vst1.s32    {d14-d17}, [%1]!    \\n\"\n                    \"vst1.s32    {d22-d25}, [%2]!    \\n\"\n\n                    \"bne    0b                       \\n\"\n\n                    : \"=r\"(nn),       // %0\n                    \"=r\"(outptr0),  // %1\n                    \"=r\"(outptr0n), // %2\n                    \"=r\"(r0),       // %3\n                    \"=r\"(r1),       // %4\n                    \"=r\"(r2),       // %5\n                    \"=r\"(r3)        // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr0n),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k0123), // %14\n                    \"w\"(_k4567), // %15\n                    \"w\"(_k8xxx)  // %16\n                    : \"cc\", \"memory\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                // TODO NEON\n                int sum0 = 0;\n                int sum0n = 0;\n\n                sum0 += (int)r0[0] * kernel[0];\n                sum0 += (int)r0[1] * kernel[1];\n                sum0 += (int)r0[2] * kernel[2];\n                sum0 += (int)r1[0] * kernel[3];\n                sum0 += (int)r1[1] * kernel[4];\n                sum0 += (int)r1[2] * kernel[5];\n                sum0 += (int)r2[0] * kernel[6];\n                sum0 += (int)r2[1] * kernel[7];\n                sum0 += (int)r2[2] * kernel[8];\n\n                sum0n += (int)r1[0] * kernel[0];\n                sum0n += (int)r1[1] * kernel[1];\n                sum0n += (int)r1[2] * kernel[2];\n                sum0n += (int)r2[0] * kernel[3];\n                sum0n += (int)r2[1] * kernel[4];\n                sum0n += (int)r2[2] * kernel[5];\n                sum0n += (int)r3[0] * kernel[6];\n                sum0n += (int)r3[1] * kernel[7];\n                sum0n += (int)r3[2] * kernel[8];\n\n                *outptr0 = sum0;\n                *outptr0n = sum0n;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr0n++;\n            }\n\n            r0 += 2 + w;\n            r1 += 2 + w;\n            r2 += 2 + w;\n            r3 += 2 + w;\n\n            outptr0 += outw;\n            outptr0n += outw;\n        }\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                                   \\n\"\n                    \"ld1    {v4.8b, v5.8b}, [%2]          \\n\"\n                    \"ld1    {v6.8b, v7.8b}, [%3]          \\n\"\n                    \"ld1    {v8.8b, v9.8b}, [%4]          \\n\"\n                    \"add    %2, %2, #8                    \\n\"\n                    \"add    %3, %3, #8                    \\n\"\n                    \"add    %4, %4, #8                    \\n\"\n\n                    \"ext    v12.8b, v4.8b, v5.8b, #1      \\n\"\n                    \"ext    v13.8b, v4.8b, v5.8b, #2      \\n\"\n                    \"ext    v14.8b, v6.8b, v7.8b, #1      \\n\"\n                    \"ext    v15.8b, v6.8b, v7.8b, #2      \\n\"\n                    \"ext    v16.8b, v8.8b, v9.8b, #1      \\n\"\n                    \"ext    v17.8b, v8.8b, v9.8b, #2      \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r01\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r02\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r10\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r11\n                    \"sshll  v15.8h, v15.8b, #0            \\n\" // r12\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r20\n                    \"sshll  v16.8h, v16.8b, #0            \\n\" // r21\n                    \"sshll  v17.8h, v17.8b, #0            \\n\" // r22\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %10.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %10.h[0]      \\n\"\n                    \"smull  v22.4s, v12.4h, %10.h[1]      \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v12.8h, %10.h[1]     \\n\"\n                    \"smull  v24.4s, v13.4h, %10.h[2]      \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v13.8h, %10.h[2]     \\n\"\n\n                    // r1\n                    \"smlal  v20.4s, v6.4h, %10.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v6.8h, %10.h[3]      \\n\"\n                    \"smlal  v22.4s, v14.4h, %11.h[0]      \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v14.8h, %11.h[0]     \\n\"\n                    \"smlal  v24.4s, v15.4h, %11.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v15.8h, %11.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v20.4s, v8.4h, %11.h[2]       \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v8.8h, %11.h[2]      \\n\"\n                    \"smlal  v22.4s, v16.4h, %11.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v16.8h, %11.h[3]     \\n\"\n                    \"smlal  v24.4s, v17.4h, %12.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v17.8h, %12.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n\n                    \"st1    {v20.4s, v21.4s}, [%1], #32   \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2)       // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123), // %10\n                    \"w\"(_k4567), // %11\n                    \"w\"(_k8xxx)  // %12\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld1.s8    {d30-d31}, [%2]        \\n\" // r0\n                    \"add    %2, %2, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1      \\n\"\n                    \"vext.s8    d12, d30, d31, #2      \\n\"\n\n                    \"vmovl.s8    q15, d30              \\n\" // r00\n                    \"vmovl.s8    q5, d10             \\n\"   // r01\n                    \"vmovl.s8    q6, d12             \\n\"   // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P10[0]      \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P10[0]      \\n\"\n                    \"vmull.s16  q9, d10, %P10[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P10[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P10[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P10[2]     \\n\"\n\n                    // r1\n                    \"vld1.s8    {d30-d31}, [%3]        \\n\" // r1\n                    \"add    %3, %3, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1      \\n\"\n                    \"vext.s8    d12, d30, d31, #2      \\n\"\n\n                    \"vmovl.s8    q15, d30              \\n\" // r10\n                    \"vmovl.s8    q5, d10             \\n\"   // r11\n                    \"vmovl.s8    q6, d12             \\n\"   // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P10[3]      \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P10[3]      \\n\"\n                    \"vmlal.s16  q9, d10, %P11[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P11[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P11[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P11[1]     \\n\"\n\n                    // r2\n                    \"vld1.s8    {d30-d31}, [%4]        \\n\" // r2\n                    \"add    %4, %4, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1      \\n\"\n                    \"vext.s8    d12, d30, d31, #2      \\n\"\n\n                    \"vmovl.s8    q15, d30              \\n\" // r20\n                    \"vmovl.s8    q5, d10             \\n\"   // r21\n                    \"vmovl.s8    q6, d12             \\n\"   // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P11[2]      \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P11[2]      \\n\"\n                    \"vmlal.s16  q9, d10, %P11[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P11[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P12[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P12[0]     \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n\n                    \"vst1.s32    {d14-d17}, [%1]!    \\n\"\n\n                    \"bne    0b                       \\n\"\n\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2)       // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123), // %10\n                    \"w\"(_k4567), // %11\n                    \"w\"(_k8xxx)  // %12\n                    : \"cc\", \"memory\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                int sum = 0;\n\n                sum += (int)r0[0] * kernel[0];\n                sum += (int)r0[1] * kernel[1];\n                sum += (int)r0[2] * kernel[2];\n                sum += (int)r1[0] * kernel[3];\n                sum += (int)r1[1] * kernel[4];\n                sum += (int)r1[2] * kernel[5];\n                sum += (int)r2[0] * kernel[6];\n                sum += (int)r2[1] * kernel[7];\n                sum += (int)r2[2] * kernel[8];\n\n                *outptr0 = sum;\n\n                r0++;\n                r1++;\n                r2++;\n                outptr0++;\n            }\n\n            r0 += 2;\n            r1 += 2;\n            r2 += 2;\n        }\n    }\n}\n\nstatic void convdw3x3s2_int8_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const signed char* kernel = (const signed char*)_kernel + p * 9;\n\n        int* outptr = out;\n\n        const signed char* img = bottom_blob.channel(p);\n\n        const signed char* r0 = img;\n        const signed char* r1 = img + w;\n        const signed char* r2 = img + w * 2;\n\n        int i = 0;\n#if __ARM_NEON\n        int8x16_t _k0123456789x = vld1q_s8(kernel);\n        int16x8_t _k_s16 = vmovl_s8(vget_low_s8(_k0123456789x));\n        int16x8_t _kn_s16 = vmovl_s8(vget_high_s8(_k0123456789x));\n\n        int16x4_t _k0123 = vget_low_s16(_k_s16);\n        int16x4_t _k4567 = vget_high_s16(_k_s16);\n        int16x4_t _k8xxx = vget_low_s16(_kn_s16);\n#endif // __ARM_NEON\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                                   \\n\"\n                    \"ld2    {v4.8b, v5.8b}, [%2], #16     \\n\"\n                    \"ld2    {v6.8b, v7.8b}, [%2]          \\n\"\n                    \"ld2    {v8.8b, v9.8b}, [%3], #16     \\n\"\n                    \"ld2    {v10.8b, v11.8b}, [%3]        \\n\"\n                    \"ld2    {v12.8b, v13.8b}, [%4], #16   \\n\"\n                    \"ld2    {v14.8b, v15.8b}, [%4]        \\n\"\n\n                    \"ext    v6.8b, v4.8b, v6.8b, #1       \\n\"\n                    \"ext    v10.8b, v8.8b, v10.8b, #1     \\n\"\n                    \"ext    v14.8b, v12.8b, v14.8b, #1    \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v5.8h, v5.8b, #0              \\n\" // r01\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r02\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r10\n                    \"sshll  v9.8h, v9.8b, #0              \\n\" // r11\n                    \"sshll  v10.8h, v10.8b, #0            \\n\" // r12\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r20\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r21\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r22\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %10.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %10.h[0]      \\n\"\n                    \"smull  v22.4s, v5.4h, %10.h[1]       \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v5.8h, %10.h[1]      \\n\"\n                    \"smull  v24.4s, v6.4h, %10.h[2]       \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v6.8h, %10.h[2]      \\n\"\n\n                    // r1\n                    \"smlal  v20.4s, v8.4h, %10.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v8.8h, %10.h[3]      \\n\"\n                    \"smlal  v22.4s, v9.4h, %11.h[0]       \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v9.8h, %11.h[0]      \\n\"\n                    \"smlal  v24.4s, v10.4h, %11.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v10.8h, %11.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v20.4s, v12.4h, %11.h[2]      \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v12.8h, %11.h[2]     \\n\"\n                    \"smlal  v22.4s, v13.4h, %11.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v13.8h, %11.h[3]     \\n\"\n                    \"smlal  v24.4s, v14.4h, %12.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v14.8h, %12.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n\n                    \"st1    {v20.4s, v21.4s}, [%1], #32   \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123), // %10\n                    \"w\"(_k4567), // %11\n                    \"w\"(_k8xxx)  // %12\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld2.s8    {d30-d31}, [%2]!     \\n\" // r0\n                    \"vld2.s8    {d10-d11}, [%2]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r01\n                    \"vmovl.s8    q15, d30            \\n\" // r00\n                    \"vmovl.s8    q6, d12             \\n\" // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P10[0]     \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P10[0]     \\n\"\n                    \"vmull.s16  q9, d10, %P10[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P10[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P10[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P10[2]     \\n\"\n\n                    // r1\n                    \"vld2.s8    {d30-d31}, [%3]!     \\n\" // r1\n                    \"vld2.s8    {d10-d11}, [%3]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r11\n                    \"vmovl.s8    q15, d30            \\n\" // r10\n                    \"vmovl.s8    q6, d12             \\n\" // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P10[3]     \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P10[3]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P11[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P11[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P11[1]     \\n\"\n\n                    // r2\n                    \"vld2.s8    {d30-d31}, [%4]!     \\n\" // r2\n                    \"vld2.s8    {d10-d11}, [%4]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r21\n                    \"vmovl.s8    q15, d30            \\n\" // r20\n                    \"vmovl.s8    q6, d12             \\n\" // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P11[2]     \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P11[2]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P11[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P12[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P12[0]     \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n\n                    \"vst1.s32    {d14-d17}, [%1]!    \\n\"\n\n                    \"bne    0b                       \\n\"\n\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123), // %10\n                    \"w\"(_k4567), // %11\n                    \"w\"(_k8xxx)  // %12\n                    : \"cc\", \"memory\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                int sum = 0;\n\n                sum += (int)r0[0] * kernel[0];\n                sum += (int)r0[1] * kernel[1];\n                sum += (int)r0[2] * kernel[2];\n                sum += (int)r1[0] * kernel[3];\n                sum += (int)r1[1] * kernel[4];\n                sum += (int)r1[2] * kernel[5];\n                sum += (int)r2[0] * kernel[6];\n                sum += (int)r2[1] * kernel[7];\n                sum += (int)r2[2] * kernel[8];\n\n                *outptr = sum;\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n\nstatic void convdw3x3s1_int8_requant_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, std::vector<float> scales_requant, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float scale_requant_in = scales_requant[2 * p];\n        const float scale_requant_out = scales_requant[2 * p + 1];\n\n        const signed char* kernel = (const signed char*)_kernel + p * 9;\n\n        signed char* outptr0 = out;\n        signed char* outptr0n = outptr0 + outw;\n\n        const signed char* img0 = bottom_blob.channel(p);\n\n        const signed char* r0 = img0;\n        const signed char* r1 = img0 + w;\n        const signed char* r2 = img0 + w * 2;\n        const signed char* r3 = img0 + w * 3;\n\n        int i = 0;\n\n#if __ARM_NEON\n        int8x16_t _k0123456789x = vld1q_s8(kernel);\n        int16x8_t _k_s16 = vmovl_s8(vget_low_s8(_k0123456789x));\n        int16x8_t _kn_s16 = vmovl_s8(vget_high_s8(_k0123456789x));\n\n        int16x4_t _k0123 = vget_low_s16(_k_s16);\n        int16x4_t _k4567 = vget_high_s16(_k_s16);\n        int16x4_t _k8xxx = vget_low_s16(_kn_s16);\n#endif // __ARM_NEON\n\n        for (; i + 1 < outh; i += 2)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                                   \\n\"\n                    \"ld1    {v4.8b, v5.8b}, [%3]          \\n\"\n                    \"ld1    {v6.8b, v7.8b}, [%4]          \\n\"\n                    \"ld1    {v8.8b, v9.8b}, [%5]          \\n\"\n                    \"ld1    {v10.8b, v11.8b}, [%6]        \\n\"\n                    \"add    %3, %3, #8                    \\n\"\n                    \"add    %4, %4, #8                    \\n\"\n                    \"add    %5, %5, #8                    \\n\"\n                    \"add    %6, %6, #8                    \\n\"\n\n                    \"ext    v12.8b, v4.8b, v5.8b, #1      \\n\"\n                    \"ext    v13.8b, v4.8b, v5.8b, #2      \\n\"\n                    \"ext    v14.8b, v6.8b, v7.8b, #1      \\n\"\n                    \"ext    v15.8b, v6.8b, v7.8b, #2      \\n\"\n                    \"ext    v16.8b, v8.8b, v9.8b, #1      \\n\"\n                    \"ext    v17.8b, v8.8b, v9.8b, #2      \\n\"\n                    \"ext    v18.8b, v10.8b, v11.8b, #1    \\n\"\n                    \"ext    v19.8b, v10.8b, v11.8b, #2    \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r01\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r02\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r10\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r11\n                    \"sshll  v15.8h, v15.8b, #0            \\n\" // r12\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r20\n                    \"sshll  v16.8h, v16.8b, #0            \\n\" // r21\n                    \"sshll  v17.8h, v17.8b, #0            \\n\" // r22\n                    \"sshll  v10.8h, v10.8b, #0            \\n\" // r30\n                    \"sshll  v18.8h, v18.8b, #0            \\n\" // r31\n                    \"sshll  v19.8h, v19.8b, #0            \\n\" // r32\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %14.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %14.h[0]      \\n\"\n                    \"smull  v22.4s, v12.4h, %14.h[1]      \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v12.8h, %14.h[1]     \\n\"\n                    \"smull  v24.4s, v13.4h, %14.h[2]      \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v13.8h, %14.h[2]     \\n\"\n\n                    // r1\n                    \"smull  v26.4s, v6.4h, %14.h[0]       \\n\" // (r10 - r17) * k00\n                    \"smull2  v27.4s, v6.8h, %14.h[0]      \\n\"\n                    \"smull  v28.4s, v14.4h, %14.h[1]      \\n\" // (r11 - r18) * k01\n                    \"smull2  v29.4s, v14.8h, %14.h[1]     \\n\"\n                    \"smull  v30.4s, v15.4h, %14.h[2]      \\n\" // (r12 - r19) * k02\n                    \"smull2  v31.4s, v15.8h, %14.h[2]     \\n\"\n\n                    \"smlal  v20.4s, v6.4h, %14.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v6.8h, %14.h[3]      \\n\"\n                    \"smlal  v22.4s, v14.4h, %15.h[0]      \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v14.8h, %15.h[0]     \\n\"\n                    \"smlal  v24.4s, v15.4h, %15.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v15.8h, %15.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v26.4s, v8.4h, %14.h[3]       \\n\" // (r20 - r27) * k03\n                    \"smlal2  v27.4s, v8.8h, %14.h[3]      \\n\"\n                    \"smlal  v28.4s, v16.4h, %15.h[0]      \\n\" // (r21 - r28) * k04\n                    \"smlal2  v29.4s, v16.8h, %15.h[0]     \\n\"\n                    \"smlal  v30.4s, v17.4h, %15.h[1]      \\n\" // (r22 - r29) * k05\n                    \"smlal2  v31.4s, v17.8h, %15.h[1]     \\n\"\n\n                    \"smlal  v20.4s, v8.4h, %15.h[2]       \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v8.8h, %15.h[2]      \\n\"\n                    \"smlal  v22.4s, v16.4h, %15.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v16.8h, %15.h[3]     \\n\"\n                    \"smlal  v24.4s, v17.4h, %16.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v17.8h, %16.h[0]     \\n\"\n\n                    // r3\n                    \"smlal  v26.4s, v10.4h, %15.h[2]      \\n\" // (r30 - r37) * k06\n                    \"smlal2  v27.4s, v10.8h, %15.h[2]     \\n\"\n                    \"smlal  v28.4s, v18.4h, %15.h[3]      \\n\" // (r31 - r38) * k07\n                    \"smlal2  v29.4s, v18.8h, %15.h[3]     \\n\"\n                    \"smlal  v30.4s, v19.4h, %16.h[0]      \\n\" // (r32 - r39) * k08\n                    \"smlal2  v31.4s, v19.8h, %16.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v26.4s, v26.4s, v28.4s        \\n\"\n                    \"add    v27.4s, v27.4s, v29.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n                    \"add    v26.4s, v26.4s, v30.4s        \\n\"\n                    \"add    v27.4s, v27.4s, v31.4s        \\n\"\n\n                    \"dup    v4.4s, %w17                   \\n\" // bias\n                    \"dup    v5.4s, %w18                   \\n\" // scale_in\n                    \"dup    v6.4s, %w19                   \\n\" // scale_out\n\n                    // top_s32 -> top_f32\n                    \"scvtf  v20.4s, v20.4s                 \\n\"\n                    \"scvtf  v21.4s, v21.4s                 \\n\"\n                    \"scvtf  v26.4s, v26.4s                 \\n\"\n                    \"scvtf  v27.4s, v27.4s                 \\n\"\n\n                    // top_f32 = top_f32 * scale_in\n                    \"fmul   v20.4s, v20.4s, v5.4s          \\n\"\n                    \"fmul   v21.4s, v21.4s, v5.4s          \\n\"\n                    \"fmul   v26.4s, v26.4s, v5.4s          \\n\"\n                    \"fmul   v27.4s, v27.4s, v5.4s          \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"fadd   v20.4s, v20.4s, v4.4s          \\n\"\n                    \"fadd   v21.4s, v21.4s, v4.4s          \\n\"\n                    \"fadd   v26.4s, v26.4s, v4.4s          \\n\"\n                    \"fadd   v27.4s, v27.4s, v4.4s          \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"fmul   v20.4s, v20.4s, v6.4s          \\n\"\n                    \"fmul   v21.4s, v21.4s, v6.4s          \\n\"\n                    \"fmul   v26.4s, v26.4s, v6.4s          \\n\"\n                    \"fmul   v27.4s, v27.4s, v6.4s          \\n\"\n                    // top_f32 -> top_s32\n                    \"fcvtas v20.4s, v20.4s                 \\n\"\n                    \"fcvtas v21.4s, v21.4s                 \\n\"\n                    \"fcvtas v26.4s, v26.4s                 \\n\"\n                    \"fcvtas v27.4s, v27.4s                 \\n\"\n                    // top_s32 -> top_s16\n                    \"sqxtn  v7.4h, v20.4s                 \\n\"\n                    \"sqxtn  v9.4h, v26.4s                 \\n\"\n                    \"sqxtn2 v7.8h, v21.4s                 \\n\"\n                    \"sqxtn2 v9.8h, v27.4s                 \\n\"\n                    // top_s16 -> top_s8\n                    \"sqxtn  v8.8b, v7.8h                  \\n\"\n                    \"sqxtn  v10.8b, v9.8h                 \\n\"\n                    // save top_s8\n                    \"st1    {v8.8b}, [%1], #8             \\n\"\n                    \"st1    {v10.8b}, [%2], #8            \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),       // %0\n                    \"=r\"(outptr0),  // %1\n                    \"=r\"(outptr0n), // %2\n                    \"=r\"(r0),       // %3\n                    \"=r\"(r1),       // %4\n                    \"=r\"(r2),       // %5\n                    \"=r\"(r3)        // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr0n),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k0123),           // %14\n                    \"w\"(_k4567),           // %15\n                    \"w\"(_k8xxx),           // %16\n                    \"r\"(bias0),            // %17\n                    \"r\"(scale_requant_in), // %18\n                    \"r\"(scale_requant_out) // %19\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld1.s8    {d30-d31}, [%3]      \\n\" // r0\n                    \"add    %3, %3, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r00\n                    \"vmovl.s8    q5, d10             \\n\" // r01\n                    \"vmovl.s8    q6, d12             \\n\" // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P14[0]     \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P14[0]     \\n\"\n                    \"vmull.s16  q9, d10, %P14[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P14[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P14[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P14[2]     \\n\"\n\n                    // r1\n                    \"vld1.s8    {d30-d31}, [%4]      \\n\" // r1\n                    \"add    %4, %4, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r10\n                    \"vmovl.s8    q5, d10             \\n\" // r11\n                    \"vmovl.s8    q6, d12             \\n\" // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P14[3]     \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P14[3]     \\n\"\n                    \"vmlal.s16  q9, d10, %P15[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P15[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P15[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P15[1]     \\n\"\n                    // sum1\n                    \"vmull.s16  q11, d30, %P14[0]    \\n\" // (r10 - r17) * k00\n                    \"vmull.s16  q12, d31, %P14[0]    \\n\"\n                    \"vmull.s16  q13, d10, %P14[1]    \\n\" // (r11 - r18) * k01\n                    \"vmull.s16  q14, d11, %P14[1]    \\n\"\n                    \"vmlal.s16  q11, d12, %P14[2]    \\n\" // (r12 - r19) * k02\n                    \"vmlal.s16  q12, d13, %P14[2]    \\n\"\n\n                    // r2\n                    \"vld1.s8    {d30-d31}, [%5]      \\n\" // r2\n                    \"add    %5, %5, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r20\n                    \"vmovl.s8    q5, d10             \\n\" // r21\n                    \"vmovl.s8    q6, d12             \\n\" // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P15[2]     \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P15[2]     \\n\"\n                    \"vmlal.s16  q9, d10, %P15[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P15[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P16[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P16[0]     \\n\"\n                    // sum1\n                    \"vmlal.s16  q11, d30, %P14[3]    \\n\" // (r20 - r27) * k03\n                    \"vmlal.s16  q12, d31, %P14[3]    \\n\"\n                    \"vmlal.s16  q13, d10, %P15[0]    \\n\" // (r21 - r28) * k04\n                    \"vmlal.s16  q14, d11, %P15[0]    \\n\"\n                    \"vmlal.s16  q11, d12, %P15[1]    \\n\" // (r22 - r29) * k05\n                    \"vmlal.s16  q12, d13, %P15[1]    \\n\"\n\n                    // r3\n                    \"vld1.s8    {d30-d31}, [%6]      \\n\" // r3\n                    \"add    %6, %6, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r30\n                    \"vmovl.s8    q5, d10             \\n\" // r31\n                    \"vmovl.s8    q6, d12             \\n\" // r32\n\n                    // sum1\n                    \"vmlal.s16  q11, d30, %P15[2]    \\n\" // (r30 - r37) * k06\n                    \"vmlal.s16  q12, d31, %P15[2]    \\n\"\n                    \"vmlal.s16  q13, d10, %P15[3]    \\n\" // (r31 - r38) * k07\n                    \"vmlal.s16  q14, d11, %P15[3]    \\n\"\n                    \"vmlal.s16  q11, d12, %P16[0]    \\n\" // (r32 - r39) * k08\n                    \"vmlal.s16  q12, d13, %P16[0]    \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n                    \"vadd.s32    q11, q11, q13       \\n\"\n                    \"vadd.s32    q12, q12, q14       \\n\"\n\n                    \"vdup.f32   q13, %17             \\n\" // bias\n                    \"vdup.f32   q14, %18             \\n\" // scale_in\n                    \"vdup.f32   q15, %19             \\n\" // scale_out\n\n                    // top_s32 -> top_f32\n                    \"vcvt.f32.s32 q7, q7            \\n\"\n                    \"vcvt.f32.s32 q8, q8            \\n\"\n                    // top_f32 = top_f32 * scale_int\n                    \"vmul.f32   q0, q7, q14         \\n\"\n                    \"vmul.f32   q4, q8, q14         \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"vadd.f32   q0, q0, q13         \\n\"\n                    \"vadd.f32   q4, q4, q13         \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"vmul.f32   q0, q0, q15         \\n\"\n                    \"vmul.f32   q4, q4, q15         \\n\"\n                    // top_f32 -> top_s32\n                    \"vcvtr.s32.f32 s0, s0           \\n\"\n                    \"vcvtr.s32.f32 s1, s1           \\n\"\n                    \"vcvtr.s32.f32 s2, s2           \\n\"\n                    \"vcvtr.s32.f32 s3, s3           \\n\"\n                    \"vcvtr.s32.f32 s16, s16           \\n\"\n                    \"vcvtr.s32.f32 s17, s17           \\n\"\n                    \"vcvtr.s32.f32 s18, s18           \\n\"\n                    \"vcvtr.s32.f32 s19, s19           \\n\"\n                    // top_s32 -> top_s16\n                    \"vqmovn.s32 d14, q0             \\n\"\n                    \"vqmovn.s32 d15, q4             \\n\"\n                    // top_s16 -> top_s8\n                    \"vqmovn.s16   d14, q7           \\n\"\n                    // save top_s8\n                    \"vst1.8     {d14}, [%1]!        \\n\"\n\n                    // top_s32 -> top_f32\n                    \"vcvt.f32.s32 q11, q11          \\n\"\n                    \"vcvt.f32.s32 q12, q12          \\n\"\n                    // top_f32 = top_f32 * scale_int\n                    \"vmul.f32   q0, q11, q14        \\n\"\n                    \"vmul.f32   q4, q12, q14        \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"vadd.f32   q0, q0, q13         \\n\"\n                    \"vadd.f32   q4, q4, q13         \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"vmul.f32   q0, q0, q15         \\n\"\n                    \"vmul.f32   q4, q4, q15         \\n\"\n                    // top_f32 -> top_s32\n                    \"vcvtr.s32.f32 s0, s0           \\n\"\n                    \"vcvtr.s32.f32 s1, s1           \\n\"\n                    \"vcvtr.s32.f32 s2, s2           \\n\"\n                    \"vcvtr.s32.f32 s3, s3           \\n\"\n                    \"vcvtr.s32.f32 s16, s16           \\n\"\n                    \"vcvtr.s32.f32 s17, s17           \\n\"\n                    \"vcvtr.s32.f32 s18, s18           \\n\"\n                    \"vcvtr.s32.f32 s19, s19           \\n\"\n                    // top_s32 -> top_s16\n                    \"vqmovn.s32 d14, q0             \\n\"\n                    \"vqmovn.s32 d15, q4             \\n\"\n                    // top_s16 -> top_s8\n                    \"vqmovn.s16   d14, q7           \\n\"\n                    // save top_s8\n                    \"vst1.8     {d14}, [%2]!        \\n\"\n\n                    \"bne    0b                      \\n\"\n\n                    : \"=r\"(nn),       // %0\n                    \"=r\"(outptr0),  // %1\n                    \"=r\"(outptr0n), // %2\n                    \"=r\"(r0),       // %3\n                    \"=r\"(r1),       // %4\n                    \"=r\"(r2),       // %5\n                    \"=r\"(r3)        // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(outptr0n),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"w\"(_k0123),           // %14\n                    \"w\"(_k4567),           // %15\n                    \"w\"(_k8xxx),           // %16\n                    \"r\"(bias0),            // %17\n                    \"r\"(scale_requant_in), // %18\n                    \"r\"(scale_requant_out) // %19\n                    : \"cc\", \"memory\", \"q0\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                // TODO NEON\n                int sum0 = 0;\n                int sum0n = 0;\n\n                sum0 += (int)r0[0] * kernel[0];\n                sum0 += (int)r0[1] * kernel[1];\n                sum0 += (int)r0[2] * kernel[2];\n                sum0 += (int)r1[0] * kernel[3];\n                sum0 += (int)r1[1] * kernel[4];\n                sum0 += (int)r1[2] * kernel[5];\n                sum0 += (int)r2[0] * kernel[6];\n                sum0 += (int)r2[1] * kernel[7];\n                sum0 += (int)r2[2] * kernel[8];\n\n                sum0n += (int)r1[0] * kernel[0];\n                sum0n += (int)r1[1] * kernel[1];\n                sum0n += (int)r1[2] * kernel[2];\n                sum0n += (int)r2[0] * kernel[3];\n                sum0n += (int)r2[1] * kernel[4];\n                sum0n += (int)r2[2] * kernel[5];\n                sum0n += (int)r3[0] * kernel[6];\n                sum0n += (int)r3[1] * kernel[7];\n                sum0n += (int)r3[2] * kernel[8];\n\n                *outptr0 = float2int8(((float)sum0 * scale_requant_in + bias0) * scale_requant_out);\n                *outptr0n = float2int8(((float)sum0n * scale_requant_in + bias0) * scale_requant_out);\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr0n++;\n            }\n\n            r0 += 2 + w;\n            r1 += 2 + w;\n            r2 += 2 + w;\n            r3 += 2 + w;\n\n            outptr0 += outw;\n            outptr0n += outw;\n        }\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"dup    v26.4s, %w13                  \\n\"\n                    \"dup    v27.4s, %w14                  \\n\"\n                    \"dup    v28.4s, %w15                  \\n\"\n\n                    \"0:                                   \\n\"\n                    \"ld1    {v4.8b, v5.8b}, [%2]          \\n\"\n                    \"ld1    {v6.8b, v7.8b}, [%3]          \\n\"\n                    \"ld1    {v8.8b, v9.8b}, [%4]          \\n\"\n                    \"add    %2, %2, #8                    \\n\"\n                    \"add    %3, %3, #8                    \\n\"\n                    \"add    %4, %4, #8                    \\n\"\n\n                    \"ext    v12.8b, v4.8b, v5.8b, #1      \\n\"\n                    \"ext    v13.8b, v4.8b, v5.8b, #2      \\n\"\n                    \"ext    v14.8b, v6.8b, v7.8b, #1      \\n\"\n                    \"ext    v15.8b, v6.8b, v7.8b, #2      \\n\"\n                    \"ext    v16.8b, v8.8b, v9.8b, #1      \\n\"\n                    \"ext    v17.8b, v8.8b, v9.8b, #2      \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r01\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r02\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r10\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r11\n                    \"sshll  v15.8h, v15.8b, #0            \\n\" // r12\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r20\n                    \"sshll  v16.8h, v16.8b, #0            \\n\" // r21\n                    \"sshll  v17.8h, v17.8b, #0            \\n\" // r22\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %10.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %10.h[0]      \\n\"\n                    \"smull  v22.4s, v12.4h, %10.h[1]      \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v12.8h, %10.h[1]     \\n\"\n                    \"smull  v24.4s, v13.4h, %10.h[2]      \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v13.8h, %10.h[2]     \\n\"\n\n                    // r1\n                    \"smlal  v20.4s, v6.4h, %10.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v6.8h, %10.h[3]      \\n\"\n                    \"smlal  v22.4s, v14.4h, %11.h[0]      \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v14.8h, %11.h[0]     \\n\"\n                    \"smlal  v24.4s, v15.4h, %11.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v15.8h, %11.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v20.4s, v8.4h, %11.h[2]       \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v8.8h, %11.h[2]      \\n\"\n                    \"smlal  v22.4s, v16.4h, %11.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v16.8h, %11.h[3]     \\n\"\n                    \"smlal  v24.4s, v17.4h, %12.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v17.8h, %12.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n\n                    // top_s32 -> top_f32\n                    \"scvtf  v20.4s, v20.4s                \\n\"\n                    \"scvtf  v21.4s, v21.4s                \\n\"\n                    // top_f32 = top_f32 * scale_in\n                    \"fmul   v20.4s, v20.4s, v27.4s        \\n\"\n                    \"fmul   v21.4s, v21.4s, v27.4s        \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"fadd   v20.4s, v20.4s, v26.4s        \\n\"\n                    \"fadd   v21.4s, v21.4s, v26.4s        \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"fmul   v20.4s, v20.4s, v28.4s        \\n\"\n                    \"fmul   v21.4s, v21.4s, v28.4s        \\n\"\n                    // top_f32 -> top_s32\n                    \"fcvtas v20.4s, v20.4s                \\n\"\n                    \"fcvtas v21.4s, v21.4s                \\n\"\n                    // top_s32 -> top_s16\n                    \"sqxtn  v7.4h, v20.4s                 \\n\"\n                    \"sqxtn2 v7.8h, v21.4s                 \\n\"\n                    // top_s16 -> top_s8\n                    \"sqxtn  v8.8b, v7.8h                  \\n\"\n                    // save top_s8\n                    \"st1    {v8.8b}, [%1], #8             \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2)       // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123),           // %10\n                    \"w\"(_k4567),           // %11\n                    \"w\"(_k8xxx),           // %12\n                    \"r\"(bias0),            // %13\n                    \"r\"(scale_requant_in), // %14\n                    \"r\"(scale_requant_out) // %15\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld1.s8    {d30-d31}, [%2]      \\n\" // r0\n                    \"add    %2, %2, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r00\n                    \"vmovl.s8    q5, d10             \\n\" // r01\n                    \"vmovl.s8    q6, d12             \\n\" // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P10[0]     \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P10[0]     \\n\"\n                    \"vmull.s16  q9, d10, %P10[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P10[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P10[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P10[2]     \\n\"\n\n                    // r1\n                    \"vld1.s8    {d30-d31}, [%3]      \\n\" // r1\n                    \"add    %3, %3, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r10\n                    \"vmovl.s8    q5, d10             \\n\" // r11\n                    \"vmovl.s8    q6, d12             \\n\" // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P10[3]     \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P10[3]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P11[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P11[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P11[1]     \\n\"\n\n                    // r2\n                    \"vld1.s8    {d30-d31}, [%4]      \\n\" // r2\n                    \"add    %4, %4, #8               \\n\"\n\n                    \"vext.s8    d10, d30, d31, #1    \\n\"\n                    \"vext.s8    d12, d30, d31, #2    \\n\"\n\n                    \"vmovl.s8    q15, d30            \\n\" // r20\n                    \"vmovl.s8    q5, d10             \\n\" // r21\n                    \"vmovl.s8    q6, d12             \\n\" // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P11[2]     \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P11[2]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P11[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P12[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P12[0]     \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n\n                    \"vdup.f32   q13, %13             \\n\" // bias\n                    \"vdup.f32   q14, %14             \\n\" // scale_in\n                    \"vdup.f32   q15, %15             \\n\" // scale_out\n\n                    // top_s32 -> top_f32\n                    \"vcvt.f32.s32 q7, q7            \\n\"\n                    \"vcvt.f32.s32 q8, q8            \\n\"\n                    // top_f32 = top_f32 * scale_int\n                    \"vmul.f32   q0, q7, q14         \\n\"\n                    \"vmul.f32   q4, q8, q14         \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"vadd.f32   q0, q0, q13         \\n\"\n                    \"vadd.f32   q4, q4, q13         \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"vmul.f32   q0, q0, q15         \\n\"\n                    \"vmul.f32   q4, q4, q15         \\n\"\n                    // top_f32 -> top_s32\n                    \"vcvtr.s32.f32 s0, s0           \\n\"\n                    \"vcvtr.s32.f32 s1, s1           \\n\"\n                    \"vcvtr.s32.f32 s2, s2           \\n\"\n                    \"vcvtr.s32.f32 s3, s3           \\n\"\n                    \"vcvtr.s32.f32 s16, s16           \\n\"\n                    \"vcvtr.s32.f32 s17, s17           \\n\"\n                    \"vcvtr.s32.f32 s18, s18           \\n\"\n                    \"vcvtr.s32.f32 s19, s19           \\n\"\n                    // top_s32 -> top_s16\n                    \"vqmovn.s32 d14, q0             \\n\"\n                    \"vqmovn.s32 d15, q4             \\n\"\n                    // top_s16 -> top_s8\n                    \"vqmovn.s16   d14, q7           \\n\"\n                    // save top_s8\n                    \"vst1.8     {d14}, [%1]!        \\n\"\n\n                    \"bne    0b                      \\n\"\n\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr0), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2)       // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr0),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123),           // %10\n                    \"w\"(_k4567),           // %11\n                    \"w\"(_k8xxx),           // %12\n                    \"r\"(bias0),            // %13\n                    \"r\"(scale_requant_in), // %14\n                    \"r\"(scale_requant_out) // %15\n                    : \"cc\", \"memory\", \"q0\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                int sum = 0;\n\n                sum += (int)r0[0] * kernel[0];\n                sum += (int)r0[1] * kernel[1];\n                sum += (int)r0[2] * kernel[2];\n                sum += (int)r1[0] * kernel[3];\n                sum += (int)r1[1] * kernel[4];\n                sum += (int)r1[2] * kernel[5];\n                sum += (int)r2[0] * kernel[6];\n                sum += (int)r2[1] * kernel[7];\n                sum += (int)r2[2] * kernel[8];\n\n                *outptr0 = float2int8(((float)sum * scale_requant_in + bias0) * scale_requant_out);\n\n                r0++;\n                r1++;\n                r2++;\n                outptr0++;\n            }\n\n            r0 += 2;\n            r1 += 2;\n            r2 += 2;\n        }\n    }\n}\n\nstatic void convdw3x3s2_int8_requant_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, std::vector<float> scales_requant, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    const int tailstep = w - 2 * outw + w;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n        const float scale_requant_in = scales_requant[2 * p];\n        const float scale_requant_out = scales_requant[2 * p + 1];\n\n        const signed char* kernel = (const signed char*)_kernel + p * 9;\n\n        signed char* outptr = out;\n\n        const signed char* img = bottom_blob.channel(p);\n\n        const signed char* r0 = img;\n        const signed char* r1 = img + w;\n        const signed char* r2 = img + w * 2;\n\n        int i = 0;\n#if __ARM_NEON\n        int8x16_t _k0123456789x = vld1q_s8(kernel);\n        int16x8_t _k_s16 = vmovl_s8(vget_low_s8(_k0123456789x));\n        int16x8_t _kn_s16 = vmovl_s8(vget_high_s8(_k0123456789x));\n\n        int16x4_t _k0123 = vget_low_s16(_k_s16);\n        int16x4_t _k4567 = vget_high_s16(_k_s16);\n        int16x4_t _k8xxx = vget_low_s16(_kn_s16);\n#endif // __ARM_NEON\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"dup    v26.4s, %w13                  \\n\"\n                    \"dup    v27.4s, %w14                  \\n\"\n                    \"dup    v28.4s, %w15                  \\n\"\n                    \"0:                                   \\n\"\n                    \"ld2    {v4.8b, v5.8b}, [%2], #16     \\n\"\n                    \"ld2    {v6.8b, v7.8b}, [%2]          \\n\"\n                    \"ld2    {v8.8b, v9.8b}, [%3], #16     \\n\"\n                    \"ld2    {v10.8b, v11.8b}, [%3]        \\n\"\n                    \"ld2    {v12.8b, v13.8b}, [%4], #16   \\n\"\n                    \"ld2    {v14.8b, v15.8b}, [%4]        \\n\"\n\n                    \"ext    v6.8b, v4.8b, v6.8b, #1       \\n\"\n                    \"ext    v10.8b, v8.8b, v10.8b, #1     \\n\"\n                    \"ext    v14.8b, v12.8b, v14.8b, #1    \\n\"\n\n                    \"sshll  v4.8h, v4.8b, #0              \\n\" // r00\n                    \"sshll  v5.8h, v5.8b, #0              \\n\" // r01\n                    \"sshll  v6.8h, v6.8b, #0              \\n\" // r02\n                    \"sshll  v8.8h, v8.8b, #0              \\n\" // r10\n                    \"sshll  v9.8h, v9.8b, #0              \\n\" // r11\n                    \"sshll  v10.8h, v10.8b, #0            \\n\" // r12\n                    \"sshll  v12.8h, v12.8b, #0            \\n\" // r20\n                    \"sshll  v13.8h, v13.8b, #0            \\n\" // r21\n                    \"sshll  v14.8h, v14.8b, #0            \\n\" // r22\n\n                    // r0\n                    \"smull  v20.4s, v4.4h, %10.h[0]       \\n\" // (r00 - r07) * k00\n                    \"smull2  v21.4s, v4.8h, %10.h[0]      \\n\"\n                    \"smull  v22.4s, v5.4h, %10.h[1]       \\n\" // (r01 - r08) * k01\n                    \"smull2  v23.4s, v5.8h, %10.h[1]      \\n\"\n                    \"smull  v24.4s, v6.4h, %10.h[2]       \\n\" // (r02 - r09) * k02\n                    \"smull2  v25.4s, v6.8h, %10.h[2]      \\n\"\n\n                    // r1\n                    \"smlal  v20.4s, v8.4h, %10.h[3]       \\n\" // (r10 - r17) * k03\n                    \"smlal2  v21.4s, v8.8h, %10.h[3]      \\n\"\n                    \"smlal  v22.4s, v9.4h, %11.h[0]       \\n\" // (r11 - r18) * k04\n                    \"smlal2  v23.4s, v9.8h, %11.h[0]      \\n\"\n                    \"smlal  v24.4s, v10.4h, %11.h[1]      \\n\" // (r12 - r19) * k05\n                    \"smlal2  v25.4s, v10.8h, %11.h[1]     \\n\"\n\n                    // r2\n                    \"smlal  v20.4s, v12.4h, %11.h[2]      \\n\" // (r20 - r27) * k06\n                    \"smlal2  v21.4s, v12.8h, %11.h[2]     \\n\"\n                    \"smlal  v22.4s, v13.4h, %11.h[3]      \\n\" // (r21 - r28) * k07\n                    \"smlal2  v23.4s, v13.8h, %11.h[3]     \\n\"\n                    \"smlal  v24.4s, v14.4h, %12.h[0]      \\n\" // (r22 - r29) * k08\n                    \"smlal2  v25.4s, v14.8h, %12.h[0]     \\n\"\n\n                    // add and save\n                    \"add    v20.4s, v20.4s, v22.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v23.4s        \\n\"\n                    \"add    v20.4s, v20.4s, v24.4s        \\n\"\n                    \"add    v21.4s, v21.4s, v25.4s        \\n\"\n\n                    // top_s32 -> top_f32\n                    \"scvtf  v20.4s, v20.4s                \\n\"\n                    \"scvtf  v21.4s, v21.4s                \\n\"\n                    // top_f32 = top_f32 * scale_in\n                    \"fmul   v20.4s, v20.4s, v27.4s        \\n\"\n                    \"fmul   v21.4s, v21.4s, v27.4s        \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"fadd   v20.4s, v20.4s, v26.4s        \\n\"\n                    \"fadd   v21.4s, v21.4s, v26.4s        \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"fmul   v20.4s, v20.4s, v28.4s        \\n\"\n                    \"fmul   v21.4s, v21.4s, v28.4s        \\n\"\n                    // top_f32 -> top_s32\n                    \"fcvtas v20.4s, v20.4s                \\n\"\n                    \"fcvtas v21.4s, v21.4s                \\n\"\n                    // top_s32 -> top_s16\n                    \"sqxtn  v7.4h, v20.4s                 \\n\"\n                    \"sqxtn2 v7.8h, v21.4s                 \\n\"\n                    // top_s16 -> top_s8\n                    \"sqxtn  v8.8b, v7.8h                  \\n\"\n                    // save top_s8\n                    \"st1    {v8.8b}, [%1], #8             \\n\"\n\n                    \"subs   %w0, %w0, #1                  \\n\"\n                    \"bne    0b                            \\n\"\n\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123),           // %10\n                    \"w\"(_k4567),           // %11\n                    \"w\"(_k8xxx),           // %12\n                    \"r\"(bias0),            // %13\n                    \"r\"(scale_requant_in), // %14\n                    \"r\"(scale_requant_out) // %15\n                    : \"cc\", \"memory\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                              \\n\"\n                    // r0\n                    \"vld2.s8    {d30-d31}, [%2]!     \\n\" // r0\n                    \"vld2.s8    {d10-d11}, [%2]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r01\n                    \"vmovl.s8    q15, d30            \\n\" // r00\n                    \"vmovl.s8    q6, d12             \\n\" // r02\n                    // sum0\n                    \"vmull.s16  q7, d30, %P10[0]     \\n\" // (r00 - r07) * k00\n                    \"vmull.s16  q8, d31, %P10[0]     \\n\"\n                    \"vmull.s16  q9, d10, %P10[1]     \\n\" // (r01 - r08) * k01\n                    \"vmull.s16  q10, d11, %P10[1]    \\n\"\n                    \"vmlal.s16  q7, d12, %P10[2]     \\n\" // (r02 - r09) * k02\n                    \"vmlal.s16  q8, d13, %P10[2]     \\n\"\n\n                    // r1\n                    \"vld2.s8    {d30-d31}, [%3]!     \\n\" // r1\n                    \"vld2.s8    {d10-d11}, [%3]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r11\n                    \"vmovl.s8    q15, d30            \\n\" // r10\n                    \"vmovl.s8    q6, d12             \\n\" // r12\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P10[3]     \\n\" // (r10 - r17) * k03\n                    \"vmlal.s16  q8, d31, %P10[3]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[0]     \\n\" // (r11 - r18) * k04\n                    \"vmlal.s16  q10, d11, %P11[0]    \\n\"\n                    \"vmlal.s16  q7, d12, %P11[1]     \\n\" // (r12 - r19) * k05\n                    \"vmlal.s16  q8, d13, %P11[1]     \\n\"\n\n                    // r2\n                    \"vld2.s8    {d30-d31}, [%4]!     \\n\" // r2\n                    \"vld2.s8    {d10-d11}, [%4]      \\n\"\n                    \"vext.s8    d12, d30, d10, #1    \\n\"\n\n                    \"vmovl.s8    q5, d31             \\n\" // r21\n                    \"vmovl.s8    q15, d30            \\n\" // r20\n                    \"vmovl.s8    q6, d12             \\n\" // r22\n\n                    // sum0\n                    \"vmlal.s16  q7, d30, %P11[2]     \\n\" // (r20 - r27) * k06\n                    \"vmlal.s16  q8, d31, %P11[2]     \\n\"\n                    \"vmlal.s16  q9, d10, %P11[3]     \\n\" // (r21 - r28) * k07\n                    \"vmlal.s16  q10, d11, %P11[3]    \\n\"\n                    \"vmlal.s16  q7, d12, %P12[0]     \\n\" // (r22 - r29) * k08\n                    \"vmlal.s16  q8, d13, %P12[0]     \\n\"\n\n                    \"subs   %0, %0, #1               \\n\"\n\n                    // add and save\n                    \"vadd.s32    q7, q7, q9          \\n\"\n                    \"vadd.s32    q8, q8, q10         \\n\"\n\n                    \"vdup.f32   q11, %13             \\n\" // bias\n                    \"vdup.f32   q12, %14             \\n\" // scale_in\n                    \"vdup.f32   q13, %15             \\n\" // scale_out\n\n                    // top_s32 -> top_f32\n                    \"vcvt.f32.s32 q7, q7             \\n\"\n                    \"vcvt.f32.s32 q8, q8             \\n\"\n                    // top_f32 = top_f32 * scale_int\n                    \"vmul.f32   q0, q7, q12          \\n\"\n                    \"vmul.f32   q4, q8, q12          \\n\"\n                    // top_f32 = top_f32 + bias\n                    \"vadd.f32   q0, q0, q11          \\n\"\n                    \"vadd.f32   q4, q4, q11          \\n\"\n                    // top_f32 = top_f32 * scale_out\n                    \"vmul.f32   q0, q0, q13          \\n\"\n                    \"vmul.f32   q4, q4, q13          \\n\"\n                    // top_f32 -> top_s32\n                    \"vcvtr.s32.f32 s0, s0            \\n\"\n                    \"vcvtr.s32.f32 s1, s1            \\n\"\n                    \"vcvtr.s32.f32 s2, s2            \\n\"\n                    \"vcvtr.s32.f32 s3, s3            \\n\"\n                    \"vcvtr.s32.f32 s16, s16            \\n\"\n                    \"vcvtr.s32.f32 s17, s17            \\n\"\n                    \"vcvtr.s32.f32 s18, s18            \\n\"\n                    \"vcvtr.s32.f32 s19, s19            \\n\"\n                    // top_s32 -> top_s16\n                    \"vqmovn.s32 d14, q0              \\n\"\n                    \"vqmovn.s32 d15, q4              \\n\"\n                    // top_s16 -> top_s8\n                    \"vqmovn.s16   d14, q7            \\n\"\n                    // save top_s8\n                    \"vst1.8     {d14}, [%1]!         \\n\"\n\n                    \"bne    0b                       \\n\"\n\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2)      // %4\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"w\"(_k0123),           // %10\n                    \"w\"(_k4567),           // %11\n                    \"w\"(_k8xxx),           // %12\n                    \"r\"(bias0),            // %13\n                    \"r\"(scale_requant_in), // %14\n                    \"r\"(scale_requant_out) // %15\n                    : \"cc\", \"memory\", \"q0\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                int sum = 0;\n\n                sum += (int)r0[0] * kernel[0];\n                sum += (int)r0[1] * kernel[1];\n                sum += (int)r0[2] * kernel[2];\n                sum += (int)r1[0] * kernel[3];\n                sum += (int)r1[1] * kernel[4];\n                sum += (int)r1[2] * kernel[5];\n                sum += (int)r2[0] * kernel[6];\n                sum += (int)r2[1] * kernel[7];\n                sum += (int)r2[2] * kernel[8];\n\n                *outptr = float2int8(((float)sum * scale_requant_in + bias0) * scale_requant_out);\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n#if __aarch64__\n    const int w = bottom_blob.w;\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out.row(0);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n\n        float32x4_t _k00 = vld1q_f32(k0);\n        float32x4_t _k01 = vld1q_f32(k0 + 4);\n        float32x4_t _k02 = vld1q_f32(k0 + 8);\n        float32x4_t _k10 = vld1q_f32(k0 + 12);\n        float32x4_t _k11 = vld1q_f32(k0 + 16);\n        float32x4_t _k12 = vld1q_f32(k0 + 20);\n        float32x4_t _k20 = vld1q_f32(k0 + 24);\n        float32x4_t _k21 = vld1q_f32(k0 + 28);\n        float32x4_t _k22 = vld1q_f32(k0 + 32);\n\n        int i = 0;\n\n#if __aarch64__\n        float* outptr1 = out.row(1);\n        const float* r3 = img0.row(3);\n\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%3], #32 \\n\" // r10 r11\n\n                    \"mov    v16.16b, %21.16b            \\n\" // sum00\n                    \"mov    v17.16b, %21.16b            \\n\" // sum01\n                    \"mov    v18.16b, %21.16b            \\n\" // sum02\n                    \"mov    v19.16b, %21.16b            \\n\" // sum03\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3] \\n\" // r12 r13 r14 r15\n\n                    \"mov    v20.16b, %21.16b            \\n\" // sum10\n                    \"mov    v21.16b, %21.16b            \\n\" // sum11\n                    \"mov    v22.16b, %21.16b            \\n\" // sum12\n                    \"mov    v23.16b, %21.16b            \\n\" // sum13\n\n                    \"fmla   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %15.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v20.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v21.4s, %12.4s, v11.4s      \\n\"\n                    \"fmla   v22.4s, %12.4s, v12.4s      \\n\"\n                    \"fmla   v23.4s, %12.4s, v13.4s      \\n\"\n\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %16.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v20.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v21.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v22.4s, %13.4s, v13.4s      \\n\"\n                    \"fmla   v23.4s, %13.4s, v14.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%4], #32 \\n\" // r20 r21\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %17.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %17.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %17.4s, v15.4s      \\n\"\n                    \"fmla   v20.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %14.4s, v13.4s      \\n\"\n                    \"fmla   v22.4s, %14.4s, v14.4s      \\n\"\n                    \"fmla   v23.4s, %14.4s, v15.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%4] \\n\" // r22 r23 r24 r25\n\n                    \"fmla   v16.4s, %18.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %18.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %18.4s, v13.4s      \\n\"\n                    \"fmla   v20.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v21.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v22.4s, %15.4s, v12.4s      \\n\"\n                    \"fmla   v23.4s, %15.4s, v13.4s      \\n\"\n\n                    \"add    %4, %4, #32                 \\n\"\n\n                    \"fmla   v16.4s, %19.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %19.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %19.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %19.4s, v14.4s      \\n\"\n                    \"fmla   v20.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v21.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v22.4s, %16.4s, v13.4s      \\n\"\n                    \"fmla   v23.4s, %16.4s, v14.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%2], #32 \\n\" // r00 r01\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v24.4s, v25.4s}, [%5], #32 \\n\" // r30 r31\n\n                    \"fmla   v16.4s, %20.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %20.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %20.4s, v15.4s      \\n\"\n                    \"fmla   v20.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %17.4s, v13.4s      \\n\"\n                    \"fmla   v22.4s, %17.4s, v14.4s      \\n\"\n                    \"fmla   v23.4s, %17.4s, v15.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%2] \\n\" // r02 r03 r04 r05\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v26.4s, v27.4s, v28.4s, v29.4s}, [%5] \\n\" // r32 r33 r34 r35\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %12.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v13.4s      \\n\"\n                    \"fmla   v20.4s, %18.4s, v24.4s      \\n\"\n                    \"fmla   v21.4s, %18.4s, v25.4s      \\n\"\n                    \"fmla   v22.4s, %18.4s, v26.4s      \\n\"\n                    \"fmla   v23.4s, %18.4s, v27.4s      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n\n                    \"fmla   v16.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %13.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v14.4s      \\n\"\n                    \"fmla   v20.4s, %19.4s, v25.4s      \\n\"\n                    \"fmla   v21.4s, %19.4s, v26.4s      \\n\"\n                    \"fmla   v22.4s, %19.4s, v27.4s      \\n\"\n                    \"fmla   v23.4s, %19.4s, v28.4s      \\n\"\n\n                    \"add    %5, %5, #32                 \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %14.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v15.4s      \\n\"\n                    \"fmla   v20.4s, %20.4s, v26.4s      \\n\"\n                    \"fmla   v21.4s, %20.4s, v27.4s      \\n\"\n                    \"fmla   v22.4s, %20.4s, v28.4s      \\n\"\n                    \"fmla   v23.4s, %20.4s, v29.4s      \\n\"\n\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%3] \\n\" // r10 r11 r12 r13\n\n                    \"mov    v16.16b, %21.16b            \\n\" // sum00\n                    \"mov    v17.16b, %21.16b            \\n\" // sum01\n                    \"mov    v18.16b, %21.16b            \\n\" // sum10\n                    \"mov    v19.16b, %21.16b            \\n\" // sum11\n\n                    \"fmla   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v11.4s      \\n\"\n\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v12.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%4] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %17.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v13.4s      \\n\"\n\n                    \"add    %4, %4, #32                 \\n\"\n\n                    \"fmla   v16.4s, %18.4s, v20.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v21.4s      \\n\"\n                    \"fmla   v18.4s, %15.4s, v20.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%2] \\n\" // r00 r01 r02 r03\n\n                    \"fmla   v16.4s, %19.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %19.4s, v22.4s      \\n\"\n                    \"fmla   v18.4s, %16.4s, v21.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%5] \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v16.4s, %20.4s, v22.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v23.4s      \\n\"\n                    \"fmla   v18.4s, %17.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %17.4s, v23.4s      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %18.4s, v24.4s      \\n\"\n                    \"fmla   v19.4s, %18.4s, v25.4s      \\n\"\n\n                    \"add    %5, %5, #32                 \\n\"\n\n                    \"fmla   v16.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %19.4s, v25.4s      \\n\"\n                    \"fmla   v19.4s, %19.4s, v26.4s      \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %20.4s, v26.4s      \\n\"\n                    \"fmla   v19.4s, %20.4s, v27.4s      \\n\"\n\n                    \"st1    {v16.4s, v17.4s}, [%0], #32 \\n\"\n                    \"st1    {v18.4s, v19.4s}, [%1], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #384]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s}, [%3] \\n\" // r10 r11 r12\n\n                    \"mov    v16.16b, %21.16b            \\n\" // sum0\n                    \"mov    v17.16b, %21.16b            \\n\" // sum1\n\n                    \"fmla   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v10.4s      \\n\"\n\n                    \"add    %3, %3, #16                 \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v11.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #384]       \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s}, [%4] \\n\" // r20 r21 r22\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v12.4s      \\n\"\n\n                    \"add    %4, %4, #16                 \\n\"\n\n                    \"fmla   v16.4s, %18.4s, v20.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s}, [%2] \\n\" // r00 r01 r02\n\n                    \"fmla   v16.4s, %19.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #384]       \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s}, [%5] \\n\" // r30 r31 r32\n\n                    \"fmla   v16.4s, %20.4s, v22.4s      \\n\"\n                    \"fmla   v17.4s, %17.4s, v22.4s      \\n\"\n\n                    \"add    %2, %2, #16                 \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v24.4s      \\n\"\n\n                    \"add    %5, %5, #16                 \\n\"\n\n                    \"fmla   v16.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %19.4s, v25.4s      \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v26.4s      \\n\"\n\n                    \"st1    {v16.4s}, [%0], #16         \\n\"\n                    \"st1    {v17.4s}, [%1], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v24\", \"v25\", \"v26\");\n            }\n\n            r0 += 2 * 4 + w * 4;\n            r1 += 2 * 4 + w * 4;\n            r2 += 2 * 4 + w * 4;\n            r3 += 2 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n#endif // __aarch64__\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%1], #32 \\n\" // r00 r01\n\n                    \"mov    v16.16b, %17.16b            \\n\" // sum00\n                    \"mov    v17.16b, %17.16b            \\n\" // sum01\n                    \"mov    v18.16b, %17.16b            \\n\" // sum02\n                    \"mov    v19.16b, %17.16b            \\n\" // sum03\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%1] \\n\" // r02 r03 r04 r05\n\n                    \"fmla   v16.4s, %8.4s, v10.4s       \\n\"\n                    \"fmla   v17.4s, %8.4s, v11.4s       \\n\"\n                    \"fmla   v18.4s, %8.4s, v12.4s       \\n\"\n                    \"fmla   v19.4s, %8.4s, v13.4s       \\n\"\n\n                    \"add    %1, %1, #32                 \\n\"\n\n                    \"fmla   v16.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v17.4s, %9.4s, v12.4s       \\n\"\n                    \"fmla   v18.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v19.4s, %9.4s, v14.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%2], #32 \\n\" // r10 r11\n\n                    \"fmla   v16.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %10.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %10.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %10.4s, v15.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%2] \\n\" // r12 r13 r14 r15\n\n                    \"fmla   v16.4s, %11.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %11.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %11.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %11.4s, v13.4s      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %12.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v14.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%3], #32 \\n\" // r20 r21\n\n                    \"fmla   v16.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %13.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v15.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3] \\n\" // r22 r23 r24 r25\n\n                    \"fmla   v16.4s, %14.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v11.4s      \\n\"\n                    \"fmla   v18.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v13.4s      \\n\"\n\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"fmla   v16.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v14.4s      \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v15.4s      \\n\"\n\n                    \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r00 r01\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vmla.f32   q10, %q8, q14   \\n\"\n                    \"vmla.f32   q11, %q8, q15   \\n\"\n                    \"vmla.f32   q10, %q9, q15   \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r02 r03\n\n                    \"vmov       q12, %q17       \\n\" // sum02\n                    \"vmov       q13, %q17       \\n\" // sum03\n\n                    \"vmla.f32   q12, %q8, q14   \\n\"\n                    \"vmla.f32   q11, %q9, q14   \\n\"\n                    \"vmla.f32   q13, %q8, q15   \\n\"\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n                    \"vmla.f32   q12, %q9, q15   \\n\"\n                    \"vmla.f32   q11, %q10, q15  \\n\"\n\n                    //                     \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128] \\n\" // r04 r05\n\n                    \"vmla.f32   q13, %q9, q14   \\n\"\n                    \"vmla.f32   q12, %q10, q14  \\n\"\n                    \"vmla.f32   q13, %q10, q15  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r10 r11\n\n                    \"vmla.f32   q10, %q11, q14  \\n\"\n                    \"vmla.f32   q11, %q11, q15  \\n\"\n                    \"vmla.f32   q10, %q12, q15  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r12 r13\n\n                    \"vmla.f32   q12, %q11, q14  \\n\"\n                    \"vmla.f32   q11, %q12, q14  \\n\"\n                    \"vmla.f32   q13, %q11, q15  \\n\"\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n                    \"vmla.f32   q12, %q12, q15  \\n\"\n                    \"vmla.f32   q11, %q13, q15  \\n\"\n\n                    //                     \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128] \\n\" // r14 r15\n\n                    \"vmla.f32   q13, %q12, q14  \\n\"\n                    \"vmla.f32   q12, %q13, q14  \\n\"\n                    \"vmla.f32   q13, %q13, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r20 r21\n\n                    \"vmla.f32   q10, %q14, q14  \\n\"\n                    \"vmla.f32   q11, %q14, q15  \\n\"\n                    \"vmla.f32   q10, %q15, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r22 r23\n\n                    \"vmla.f32   q12, %q14, q14  \\n\"\n                    \"vmla.f32   q11, %q15, q14  \\n\"\n                    \"vmla.f32   q13, %q14, q15  \\n\"\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n                    \"vmla.f32   q12, %q15, q15  \\n\"\n                    \"vmla.f32   q11, %q16, q15  \\n\"\n\n                    //                     \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128] \\n\" // r24 r25\n\n                    \"vmla.f32   q13, %q15, q14  \\n\"\n                    \"vmla.f32   q12, %q16, q14  \\n\"\n                    \"vmla.f32   q13, %q16, q15  \\n\"\n\n                    \"vstm       %0!, {d20-d27}  \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%1] \\n\" // r00 r01 r02 r03\n\n                    \"mov    v16.16b, %17.16b            \\n\" // sum00\n                    \"mov    v17.16b, %17.16b            \\n\" // sum01\n\n                    \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                    \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n\n                    \"fmla   v16.4s, %8.4s, v12.4s       \\n\"\n                    \"fmla   v17.4s, %8.4s, v13.4s       \\n\"\n\n                    \"add    %1, %1, #32                 \\n\"\n\n                    \"fmla   v18.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v19.4s, %9.4s, v14.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%2] \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v16.4s, %10.4s, v14.4s      \\n\"\n                    \"fmla   v17.4s, %10.4s, v15.4s      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n\n                    \"fmla   v18.4s, %11.4s, v20.4s      \\n\"\n                    \"fmla   v19.4s, %11.4s, v21.4s      \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%3] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v18.4s, %13.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v23.4s      \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n\n                    \"fmla   v18.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v14.4s      \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v15.4s      \\n\"\n\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"fadd   v16.4s, v16.4s, v18.4s      \\n\"\n                    \"fadd   v17.4s, v17.4s, v19.4s      \\n\"\n\n                    \"st1    {v16.4s, v17.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%1 :128]! \\n\" // r00 r01\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vmla.f32   q10, %q8, q12   \\n\"\n                    \"vmla.f32   q11, %q8, q13   \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128] \\n\" // r02 r03\n\n                    \"vmla.f32   q10, %q9, q13   \\n\"\n\n                    \"vmla.f32   q11, %q9, q14   \\n\"\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%2 :128]! \\n\" // r10 r11\n\n                    \"vmla.f32   q11, %q10, q15  \\n\"\n\n                    \"vmla.f32   q10, %q11, q12  \\n\"\n                    \"vmla.f32   q11, %q11, q13  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128] \\n\" // r12 r13\n\n                    \"vmla.f32   q10, %q12, q13  \\n\"\n\n                    \"vmla.f32   q11, %q12, q14  \\n\"\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%3 :128]! \\n\" // r20 r21\n\n                    \"vmla.f32   q11, %q13, q15  \\n\"\n\n                    \"vmla.f32   q10, %q14, q12  \\n\"\n                    \"vmla.f32   q11, %q14, q13  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128] \\n\" // r22 r23\n\n                    \"vmla.f32   q10, %q15, q13  \\n\"\n\n                    \"vmla.f32   q11, %q15, q14  \\n\"\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n                    \"vmla.f32   q11, %q16, q15  \\n\"\n\n                    \"vst1.f32   {d20-d23}, [%0 :128]! \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n\n                vst1q_f32(outptr0, _sum0);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                outptr0 += 4;\n            }\n\n            r0 += 2 * 4;\n            r1 += 2 * 4;\n            r2 += 2 * 4;\n        }\n    }\n}\n\nstatic void convdw3x3s2_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n\n        float32x4_t _k00 = vld1q_f32(k0);\n        float32x4_t _k01 = vld1q_f32(k0 + 4);\n        float32x4_t _k02 = vld1q_f32(k0 + 8);\n        float32x4_t _k10 = vld1q_f32(k0 + 12);\n        float32x4_t _k11 = vld1q_f32(k0 + 16);\n        float32x4_t _k12 = vld1q_f32(k0 + 20);\n        float32x4_t _k20 = vld1q_f32(k0 + 24);\n        float32x4_t _k21 = vld1q_f32(k0 + 28);\n        float32x4_t _k22 = vld1q_f32(k0 + 32);\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n                    \"mov    v30.16b, %17.16b            \\n\" // sum02\n                    \"mov    v31.16b, %17.16b            \\n\" // sum03\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v14.4s, v15.4s, v16.4s, v17.4s}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                    \"fmla   v28.4s, %8.4s, v10.4s       \\n\"\n                    \"fmla   v29.4s, %8.4s, v12.4s       \\n\"\n                    \"fmla   v30.4s, %8.4s, v14.4s       \\n\"\n                    \"fmla   v31.4s, %8.4s, v16.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v18.4s}, [%1]              \\n\" // r08\n\n                    \"fmla   v28.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v29.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v30.4s, %9.4s, v15.4s       \\n\"\n                    \"fmla   v31.4s, %9.4s, v17.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v28.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v29.4s, %10.4s, v14.4s      \\n\"\n                    \"fmla   v30.4s, %10.4s, v16.4s      \\n\"\n                    \"fmla   v31.4s, %10.4s, v18.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                    \"fmla   v28.4s, %11.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, %11.4s, v22.4s      \\n\"\n                    \"fmla   v30.4s, %11.4s, v24.4s      \\n\"\n                    \"fmla   v31.4s, %11.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v19.4s}, [%2]              \\n\" // r18\n\n                    \"fmla   v28.4s, %12.4s, v21.4s      \\n\"\n                    \"fmla   v29.4s, %12.4s, v23.4s      \\n\"\n                    \"fmla   v30.4s, %12.4s, v25.4s      \\n\"\n                    \"fmla   v31.4s, %12.4s, v27.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.4s, %13.4s, v22.4s      \\n\"\n                    \"fmla   v29.4s, %13.4s, v24.4s      \\n\"\n                    \"fmla   v30.4s, %13.4s, v26.4s      \\n\"\n                    \"fmla   v31.4s, %13.4s, v19.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v14.4s, v15.4s, v16.4s, v17.4s}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                    \"fmla   v28.4s, %14.4s, v10.4s      \\n\"\n                    \"fmla   v29.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v30.4s, %14.4s, v14.4s      \\n\"\n                    \"fmla   v31.4s, %14.4s, v16.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v18.4s}, [%3]              \\n\" // r28\n\n                    \"fmla   v28.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v29.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v30.4s, %15.4s, v15.4s      \\n\"\n                    \"fmla   v31.4s, %15.4s, v17.4s      \\n\"\n\n                    \"fmla   v28.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v29.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v30.4s, %16.4s, v16.4s      \\n\"\n                    \"fmla   v31.4s, %16.4s, v18.4s      \\n\"\n\n                    \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r00 r01\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n\n                    \"vmla.f32   q10, %q8, q14   \\n\"\n\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vmla.f32   q10, %q9, q15   \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r02 r03\n\n                    \"vmla.f32   q11, %q8, q14   \\n\"\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n\n                    \"vmov       q12, %q17       \\n\" // sum02\n\n                    \"vmla.f32   q11, %q9, q15   \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r04 r05\n\n                    \"vmla.f32   q12, %q8, q14   \\n\"\n                    \"vmla.f32   q11, %q10, q14  \\n\"\n\n                    \"vmla.f32   q12, %q9, q15   \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r10 r11\n\n                    \"vmla.f32   q10, %q11, q14  \\n\"\n\n                    \"vmov       q13, %q17       \\n\" // sum03\n\n                    \"vmla.f32   q10, %q12, q15  \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r06 r07\n\n                    \"vmla.f32   q13, %q8, q14   \\n\"\n                    \"vmla.f32   q12, %q10, q14  \\n\"\n\n                    \"vmla.f32   q13, %q9, q15   \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r12 r13\n\n                    \"vmla.f32   q11, %q11, q14  \\n\"\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n\n                    \"vmla.f32   q11, %q12, q15  \\n\"\n\n                    \"vld1.f32   {d28-d29}, [%1 :128] \\n\" // r08\n\n                    \"vmla.f32   q13, %q10, q14  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r14 r15\n\n                    \"vmla.f32   q12, %q11, q14  \\n\"\n                    \"vmla.f32   q11, %q13, q14  \\n\"\n\n                    \"vmla.f32   q12, %q12, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r20 r21\n\n                    \"vmla.f32   q10, %q14, q14  \\n\"\n                    \"vmla.f32   q10, %q15, q15  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r16 r17\n\n                    \"vmla.f32   q13, %q11, q14  \\n\"\n                    \"vmla.f32   q12, %q13, q14  \\n\"\n\n                    \"vmla.f32   q13, %q12, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r22 r23\n\n                    \"vmla.f32   q11, %q14, q14  \\n\"\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n\n                    \"vmla.f32   q11, %q15, q15  \\n\"\n\n                    \"vld1.f32   {d28-d29}, [%2 :128] \\n\" // r18\n\n                    \"vmla.f32   q13, %q13, q14  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r24 r25\n\n                    \"vmla.f32   q12, %q14, q14  \\n\"\n                    \"vmla.f32   q11, %q16, q14  \\n\"\n\n                    \"vmla.f32   q12, %q15, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r26 r27\n\n                    \"vmla.f32   q13, %q14, q14  \\n\"\n                    \"vmla.f32   q12, %q16, q14  \\n\"\n\n                    \"vmla.f32   q13, %q15, q15  \\n\"\n\n                    \"vld1.f32   {d28-d29}, [%3 :128] \\n\" // r28\n\n                    \"vmla.f32   q13, %q16, q14  \\n\"\n\n                    \"vstm       %0!, {d20-d27}  \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v20.16b, %17.16b            \\n\" // sum00\n                    \"mov    v21.16b, %17.16b            \\n\" // sum01\n\n                    \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                    \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n\n                    \"fmla   v20.4s, %8.4s, v10.4s       \\n\"\n                    \"fmla   v21.4s, %8.4s, v12.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v14.4s}, [%1]              \\n\" // r04\n\n                    \"fmla   v22.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v23.4s, %9.4s, v13.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v20.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %10.4s, v14.4s      \\n\"\n\n                    \"fmla   v22.4s, %11.4s, v16.4s      \\n\"\n                    \"fmla   v23.4s, %11.4s, v18.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v15.4s}, [%2]              \\n\" // r14\n\n                    \"fmla   v20.4s, %12.4s, v17.4s      \\n\"\n                    \"fmla   v21.4s, %12.4s, v19.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v10.4s, v11.4s, v12.4s, v13.4s}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v22.4s, %13.4s, v18.4s      \\n\"\n                    \"fmla   v23.4s, %13.4s, v15.4s      \\n\"\n\n                    \"fmla   v20.4s, %14.4s, v10.4s      \\n\"\n                    \"fmla   v21.4s, %14.4s, v12.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v14.4s}, [%3]              \\n\" // r24\n\n                    \"fmla   v22.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v23.4s, %15.4s, v13.4s      \\n\"\n\n                    \"fmla   v20.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %16.4s, v14.4s      \\n\"\n\n                    \"fadd   v20.4s, v20.4s, v22.4s      \\n\"\n                    \"fadd   v21.4s, v21.4s, v23.4s      \\n\"\n\n                    \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%1 :128]! \\n\" // r00 r01\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vmla.f32   q10, %q8, q12   \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%1 :128]! \\n\" // r02 r03\n\n                    \"vmla.f32   q10, %q9, q13   \\n\"\n\n                    \"vmla.f32   q11, %q8, q14   \\n\"\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n\n                    \"vld1.f32   {d24-d25}, [%1 :128] \\n\" // r04\n\n                    \"vmla.f32   q11, %q9, q15   \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%2 :128]! \\n\" // r10 r11\n\n                    \"vmla.f32   q11, %q10, q12  \\n\"\n\n                    \"vmla.f32   q10, %q11, q14  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%2 :128]! \\n\" // r12 r13\n\n                    \"vmla.f32   q10, %q12, q15  \\n\"\n\n                    \"vmla.f32   q11, %q11, q12  \\n\"\n                    \"vmla.f32   q10, %q13, q12  \\n\"\n\n                    \"vld1.f32   {d28-d29}, [%2 :128] \\n\" // r14\n\n                    \"vmla.f32   q11, %q12, q13  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d24-d27}, [%3 :128]! \\n\" // r20 r21\n\n                    \"vmla.f32   q11, %q13, q14  \\n\"\n\n                    \"vmla.f32   q10, %q14, q12  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.f32   {d28-d31}, [%3 :128]! \\n\" // r22 r23\n\n                    \"vmla.f32   q10, %q15, q13  \\n\"\n\n                    \"vmla.f32   q11, %q14, q14  \\n\"\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n\n                    \"vld1.f32   {d24-d25}, [%3 :128] \\n\" // r24\n\n                    \"vmla.f32   q11, %q15, q15  \\n\"\n\n                    \"vmla.f32   q11, %q16, q12  \\n\"\n\n                    \"vst1.f32   {d20-d23}, [%0 :128]! \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n\n                vst1q_f32(outptr0, _sum0);\n\n                r0 += 2 * 4;\n                r1 += 2 * 4;\n                r2 += 2 * 4;\n                outptr0 += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n#if __aarch64__\n    const int w = bottom_blob.w;\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const unsigned short* k0 = kernel.row<const unsigned short>(g);\n\n        unsigned short* outptr0 = out.row<unsigned short>(0);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n        const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n        float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n        float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n        float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n        float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n        float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n        float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n        float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n        float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n        float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n        int i = 0;\n\n#if __aarch64__\n        unsigned short* outptr1 = out.row<unsigned short>(1);\n        const unsigned short* r3 = img0.row<const unsigned short>(3);\n\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n\n                    \"mov    v16.16b, %21.16b            \\n\" // sum00\n                    \"mov    v17.16b, %21.16b            \\n\" // sum01\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v28.4h, v29.4h}, [%3]      \\n\" // r14 r15\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"mov    v18.16b, %21.16b            \\n\" // sum02\n                    \"mov    v19.16b, %21.16b            \\n\" // sum03\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"mov    v20.16b, %21.16b            \\n\" // sum10\n\n                    \"fmla   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v11.4s      \\n\"\n\n                    \"mov    v21.16b, %21.16b            \\n\" // sum11\n\n                    \"fmla   v18.4s, %15.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v13.4s      \\n\"\n\n                    \"mov    v22.16b, %21.16b            \\n\" // sum12\n\n                    \"fmla   v20.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v21.4s, %12.4s, v11.4s      \\n\"\n\n                    \"mov    v23.16b, %21.16b            \\n\" // sum13\n\n                    \"fmla   v22.4s, %12.4s, v12.4s      \\n\"\n                    \"fmla   v23.4s, %12.4s, v13.4s      \\n\"\n\n                    \"shll   v28.4s, v28.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v12.4s      \\n\"\n\n                    \"shll   v29.4s, v29.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %16.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v28.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v20.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v21.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v22.4s, %13.4s, v13.4s      \\n\"\n                    \"fmla   v23.4s, %13.4s, v28.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v14.4h, v15.4h}, [%4]      \\n\" // r24 r25\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %17.4s, v13.4s      \\n\"\n\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %17.4s, v28.4s      \\n\"\n                    \"fmla   v19.4s, %17.4s, v29.4s      \\n\"\n\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %14.4s, v13.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n\n                    \"fmla   v22.4s, %14.4s, v28.4s      \\n\"\n                    \"fmla   v23.4s, %14.4s, v29.4s      \\n\"\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %18.4s, v24.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v25.4s      \\n\"\n\n                    \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %18.4s, v26.4s      \\n\"\n                    \"fmla   v19.4s, %18.4s, v27.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%5], #32 \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v20.4s, %15.4s, v24.4s      \\n\"\n                    \"fmla   v21.4s, %15.4s, v25.4s      \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v22.4s, %15.4s, v26.4s      \\n\"\n                    \"fmla   v23.4s, %15.4s, v27.4s      \\n\"\n\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %19.4s, v25.4s      \\n\"\n                    \"fmla   v17.4s, %19.4s, v26.4s      \\n\"\n\n                    \"fmla   v18.4s, %19.4s, v27.4s      \\n\"\n                    \"fmla   v19.4s, %19.4s, v14.4s      \\n\"\n\n                    \"fmla   v20.4s, %16.4s, v25.4s      \\n\"\n                    \"fmla   v21.4s, %16.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v24.4h, v25.4h}, [%2]      \\n\" // r04 r05\n\n                    \"fmla   v22.4s, %16.4s, v27.4s      \\n\"\n                    \"fmla   v23.4s, %16.4s, v14.4s      \\n\"\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %20.4s, v26.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v27.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %20.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %20.4s, v15.4s      \\n\"\n\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %17.4s, v26.4s      \\n\"\n                    \"fmla   v21.4s, %17.4s, v27.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v26.4h, v27.4h}, [%5]      \\n\" // r34 r35\n\n                    \"fmla   v22.4s, %17.4s, v14.4s      \\n\"\n                    \"fmla   v23.4s, %17.4s, v15.4s      \\n\"\n\n                    \"shll   v28.4s, v28.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v11.4s      \\n\"\n\n                    \"shll   v29.4s, v29.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %12.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v13.4s      \\n\"\n\n                    \"shll   v30.4s, v30.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %18.4s, v28.4s      \\n\"\n                    \"fmla   v21.4s, %18.4s, v29.4s      \\n\"\n\n                    \"shll   v31.4s, v31.4h, #16         \\n\"\n\n                    \"fmla   v22.4s, %18.4s, v30.4s      \\n\"\n                    \"fmla   v23.4s, %18.4s, v31.4s      \\n\"\n\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v12.4s      \\n\"\n                    \"fmla   v18.4s, %13.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v24.4s      \\n\"\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %19.4s, v29.4s      \\n\"\n                    \"fmla   v21.4s, %19.4s, v30.4s      \\n\"\n                    \"fmla   v22.4s, %19.4s, v31.4s      \\n\"\n                    \"fmla   v23.4s, %19.4s, v26.4s      \\n\"\n\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n                    \"fmla   v18.4s, %14.4s, v24.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v25.4s      \\n\"\n\n                    \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %20.4s, v30.4s      \\n\"\n                    \"fmla   v21.4s, %20.4s, v31.4s      \\n\"\n                    \"fmla   v22.4s, %20.4s, v26.4s      \\n\"\n                    \"fmla   v23.4s, %20.4s, v27.4s      \\n\"\n\n                    \"shrn   v16.4h, v16.4s, #16         \\n\"\n                    \"shrn   v17.4h, v17.4s, #16         \\n\"\n                    \"shrn   v18.4h, v18.4s, #16         \\n\"\n                    \"shrn   v19.4h, v19.4s, #16         \\n\"\n                    \"shrn   v20.4h, v20.4s, #16         \\n\"\n                    \"shrn   v21.4h, v21.4s, #16         \\n\"\n                    \"shrn   v22.4h, v22.4s, #16         \\n\"\n                    \"shrn   v23.4h, v23.4s, #16         \\n\"\n\n                    \"st1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%0], #32 \\n\"\n                    \"st1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%1], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%3] \\n\" // r10 r11 r12 r13\n\n                    \"mov    v16.16b, %21.16b            \\n\" // sum00\n                    \"mov    v17.16b, %21.16b            \\n\" // sum01\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"mov    v18.16b, %21.16b            \\n\" // sum10\n                    \"mov    v19.16b, %21.16b            \\n\" // sum11\n\n                    \"fmla   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v11.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v11.4s      \\n\"\n\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v12.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%4] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v18.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v12.4s      \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %17.4s, v13.4s      \\n\"\n\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v13.4s      \\n\"\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %18.4s, v20.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v21.4s      \\n\"\n\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %15.4s, v20.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%2] \\n\" // r00 r01 r02 r03\n\n                    \"fmla   v16.4s, %19.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %19.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%5] \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v18.4s, %16.4s, v21.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v22.4s      \\n\"\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %20.4s, v22.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v23.4s      \\n\"\n\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %17.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %17.4s, v23.4s      \\n\"\n\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v11.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %18.4s, v24.4s      \\n\"\n                    \"fmla   v19.4s, %18.4s, v25.4s      \\n\"\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v12.4s      \\n\"\n\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %19.4s, v25.4s      \\n\"\n                    \"fmla   v19.4s, %19.4s, v26.4s      \\n\"\n\n                    \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n\n                    \"add    %3, %3, #16                 \\n\"\n\n                    \"fmla   v18.4s, %20.4s, v26.4s      \\n\"\n                    \"fmla   v19.4s, %20.4s, v27.4s      \\n\"\n\n                    \"add    %4, %4, #16                 \\n\"\n\n                    \"shrn   v16.4h, v16.4s, #16         \\n\"\n                    \"shrn   v17.4h, v17.4s, #16         \\n\"\n\n                    \"add    %2, %2, #16                 \\n\"\n\n                    \"shrn   v18.4h, v18.4s, #16         \\n\"\n                    \"shrn   v19.4h, v19.4s, #16         \\n\"\n\n                    \"add    %5, %5, #16                 \\n\"\n\n                    \"st1    {v16.4h, v17.4h}, [%0], #16 \\n\"\n                    \"st1    {v18.4h, v19.4h}, [%1], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #192]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h}, [%3] \\n\" // r10 r11 r12\n\n                    \"mov    v18.16b, %21.16b            \\n\" // sum0\n                    \"mov    v19.16b, %21.16b            \\n\" // sum1\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmul   v16.4s, %15.4s, v10.4s      \\n\"\n                    \"fmul   v17.4s, %12.4s, v10.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %16.4s, v11.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v11.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%4] \\n\" // r20 r21 r22\n\n                    \"fmla   v16.4s, %17.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v12.4s      \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %18.4s, v20.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #192]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h}, [%2] \\n\" // r00 r01 r02\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n\n                    \"prfm   pldl1keep, [%5, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%5] \\n\" // r30 r31 r32\n\n                    \"fmla   v16.4s, %19.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v21.4s      \\n\"\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %20.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %17.4s, v22.4s      \\n\"\n\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %18.4s, v24.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %13.4s, v11.4s      \\n\"\n                    \"fmla   v19.4s, %19.4s, v25.4s      \\n\"\n\n                    \"add    %3, %3, #8                  \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %20.4s, v26.4s      \\n\"\n\n                    \"add    %4, %4, #8                  \\n\"\n\n                    \"fadd   v18.4s, v18.4s, v16.4s      \\n\"\n                    \"fadd   v19.4s, v19.4s, v17.4s      \\n\"\n\n                    \"add    %2, %2, #8                  \\n\"\n\n                    \"shrn   v18.4h, v18.4s, #16         \\n\"\n                    \"shrn   v19.4h, v19.4s, #16         \\n\"\n\n                    \"add    %5, %5, #8                  \\n\"\n\n                    \"st1    {v18.4h}, [%0], #8          \\n\"\n                    \"st1    {v19.4h}, [%1], #8          \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v24\", \"v25\", \"v26\");\n            }\n\n            r0 += 2 * 4 + w * 4;\n            r1 += 2 * 4 + w * 4;\n            r2 += 2 * 4 + w * 4;\n            r3 += 2 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n#endif // __aarch64__\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v16.16b, %17.16b            \\n\" // sum00\n                    \"mov    v17.16b, %17.16b            \\n\" // sum01\n                    \"mov    v18.16b, %17.16b            \\n\" // sum02\n                    \"mov    v19.16b, %17.16b            \\n\" // sum03\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %8.4s, v10.4s       \\n\"\n                    \"fmla   v17.4s, %8.4s, v11.4s       \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %8.4s, v12.4s       \\n\"\n                    \"fmla   v19.4s, %8.4s, v13.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v14.4h, v15.4h}, [%1]      \\n\" // r04 r05\n\n                    \"fmla   v16.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v17.4s, %9.4s, v12.4s       \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v19.4s, %9.4s, v14.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v16.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %10.4s, v13.4s      \\n\"\n\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %10.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %10.4s, v15.4s      \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %11.4s, v20.4s      \\n\"\n                    \"fmla   v17.4s, %11.4s, v21.4s      \\n\"\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %11.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %11.4s, v23.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v14.4h, v15.4h}, [%2]      \\n\" // r14 r15\n\n                    \"fmla   v16.4s, %12.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v22.4s      \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %12.4s, v23.4s      \\n\"\n                    \"fmla   v19.4s, %12.4s, v14.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v16.4s, %13.4s, v22.4s      \\n\"\n                    \"fmla   v17.4s, %13.4s, v23.4s      \\n\"\n\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %13.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v15.4s      \\n\"\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v10.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v11.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v19.4s, %14.4s, v13.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v14.4h, v15.4h}, [%3]      \\n\" // r24 r25\n\n                    \"fmla   v16.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v17.4s, %15.4s, v12.4s      \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v14.4s      \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v13.4s      \\n\"\n\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v19.4s, %16.4s, v15.4s      \\n\"\n\n                    \"shrn   v16.4h, v16.4s, #16         \\n\"\n                    \"shrn   v17.4h, v17.4s, #16         \\n\"\n                    \"shrn   v18.4h, v18.4s, #16         \\n\"\n                    \"shrn   v19.4h, v19.4s, #16         \\n\"\n\n                    \"st1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                asm volatile(\n                    \"pld        [%1, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%1 :64]! \\n\" // r00 r01\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q8, q14   \\n\"\n                    \"vmla.f32   q11, %q8, q15   \\n\"\n                    \"vmla.f32   q10, %q9, q15   \\n\"\n\n                    \"pld        [%1, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%1 :64]! \\n\" // r02 r03\n\n                    \"vmov       q12, %q17       \\n\" // sum02\n                    \"vmov       q13, %q17       \\n\" // sum03\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q12, %q8, q14   \\n\"\n                    \"vmla.f32   q11, %q9, q14   \\n\"\n                    \"vmla.f32   q13, %q8, q15   \\n\"\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n                    \"vmla.f32   q12, %q9, q15   \\n\"\n                    \"vmla.f32   q11, %q10, q15  \\n\"\n\n                    //                     \"pld        [%1, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%1 :64] \\n\" // r04 r05\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q13, %q9, q14   \\n\"\n                    \"vmla.f32   q12, %q10, q14  \\n\"\n                    \"vmla.f32   q13, %q10, q15  \\n\"\n\n                    \"pld        [%2, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%2 :64]! \\n\" // r10 r11\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q11, q14  \\n\"\n                    \"vmla.f32   q11, %q11, q15  \\n\"\n                    \"vmla.f32   q10, %q12, q15  \\n\"\n\n                    \"pld        [%2, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%2 :64]! \\n\" // r12 r13\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q12, %q11, q14  \\n\"\n                    \"vmla.f32   q11, %q12, q14  \\n\"\n                    \"vmla.f32   q13, %q11, q15  \\n\"\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n                    \"vmla.f32   q12, %q12, q15  \\n\"\n                    \"vmla.f32   q11, %q13, q15  \\n\"\n\n                    //                     \"pld        [%2, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%2 :64] \\n\" // r14 r15\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q13, %q12, q14  \\n\"\n                    \"vmla.f32   q12, %q13, q14  \\n\"\n                    \"vmla.f32   q13, %q13, q15  \\n\"\n\n                    \"pld        [%3, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%3 :64]! \\n\" // r20 r21\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q14, q14  \\n\"\n                    \"vmla.f32   q11, %q14, q15  \\n\"\n                    \"vmla.f32   q10, %q15, q15  \\n\"\n\n                    \"pld        [%3, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%3 :64]! \\n\" // r22 r23\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q12, %q14, q14  \\n\"\n                    \"vmla.f32   q11, %q15, q14  \\n\"\n                    \"vmla.f32   q13, %q14, q15  \\n\"\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n                    \"vmla.f32   q12, %q15, q15  \\n\"\n                    \"vmla.f32   q11, %q16, q15  \\n\"\n\n                    //                     \"pld        [%3, #128]      \\n\"\n                    \"vld1.u16   {d30-d31}, [%3 :64] \\n\" // r24 r25\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q13, %q15, q14  \\n\"\n                    \"vmla.f32   q12, %q16, q14  \\n\"\n                    \"vmla.f32   q13, %q16, q15  \\n\"\n\n                    \"vshrn.u32  d20, q10, #16   \\n\"\n                    \"vshrn.u32  d21, q11, #16   \\n\"\n                    \"vshrn.u32  d22, q12, #16   \\n\"\n                    \"vshrn.u32  d23, q13, #16   \\n\"\n\n                    \"vst1.u16   {d20-d23}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v12.4h, v13.4h, v14.4h, v15.4h}, [%1] \\n\" // r00 r01 r02 r03\n\n                    \"mov    v18.16b, %17.16b            \\n\" // sum00\n                    \"mov    v19.16b, %17.16b            \\n\" // sum01\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmul   v16.4s, %8.4s, v12.4s       \\n\"\n                    \"fmul   v17.4s, %8.4s, v13.4s       \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v19.4s, %9.4s, v14.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2] \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v16.4s, %10.4s, v14.4s      \\n\"\n                    \"fmla   v17.4s, %10.4s, v15.4s      \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %11.4s, v20.4s      \\n\"\n                    \"fmla   v19.4s, %11.4s, v21.4s      \\n\"\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %12.4s, v21.4s      \\n\"\n                    \"fmla   v17.4s, %12.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v12.4h, v13.4h, v14.4h, v15.4h}, [%3] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v18.4s, %13.4s, v22.4s      \\n\"\n                    \"fmla   v19.4s, %13.4s, v23.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v16.4s, %14.4s, v12.4s      \\n\"\n                    \"fmla   v17.4s, %14.4s, v13.4s      \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v18.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v19.4s, %15.4s, v14.4s      \\n\"\n\n                    \"add    %1, %1, #16                 \\n\"\n\n                    \"fmla   v16.4s, %16.4s, v14.4s      \\n\"\n                    \"fmla   v17.4s, %16.4s, v15.4s      \\n\"\n\n                    \"add    %2, %2, #16                 \\n\"\n\n                    \"fadd   v18.4s, v18.4s, v16.4s      \\n\"\n                    \"fadd   v19.4s, v19.4s, v17.4s      \\n\"\n\n                    \"add    %3, %3, #16                 \\n\"\n\n                    \"shrn   v18.4h, v18.4s, #16         \\n\"\n                    \"shrn   v19.4h, v19.4s, #16         \\n\"\n\n                    \"st1    {v18.4h, v19.4h}, [%0], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%1 :64] \\n\" // r00 r01 r02 r03\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q8, q12   \\n\"\n                    \"vmla.f32   q11, %q8, q13   \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n\n                    \"vmla.f32   q10, %q9, q13   \\n\"\n                    \"vmla.f32   q11, %q9, q14   \\n\"\n\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n                    \"vmla.f32   q11, %q10, q15  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%2 :64] \\n\" // r10 r11 r12 r13\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q11, q12  \\n\"\n                    \"vmla.f32   q11, %q11, q13  \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n\n                    \"vmla.f32   q10, %q12, q13  \\n\"\n                    \"vmla.f32   q11, %q12, q14  \\n\"\n\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n                    \"vmla.f32   q11, %q13, q15  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%3 :64] \\n\" // r20 r21 r22 r23\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q14, q12  \\n\"\n                    \"vmla.f32   q11, %q14, q13  \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n\n                    \"vmla.f32   q10, %q15, q13  \\n\"\n                    \"vmla.f32   q11, %q15, q14  \\n\"\n\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n                    \"vmla.f32   q11, %q16, q15  \\n\"\n\n                    \"add        %1, %1, #16     \\n\"\n                    \"add        %2, %2, #16     \\n\"\n\n                    \"vshrn.u32  d20, q10, #16   \\n\"\n                    \"vshrn.u32  d21, q11, #16   \\n\"\n\n                    \"add        %3, %3, #16     \\n\"\n\n                    \"vst1.u16   {d20-d21}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = bfloat2float(vld1_u16(r0));\n                float32x4_t _r01 = bfloat2float(vld1_u16(r0 + 4));\n                float32x4_t _r02 = bfloat2float(vld1_u16(r0 + 8));\n                float32x4_t _r10 = bfloat2float(vld1_u16(r1));\n                float32x4_t _r11 = bfloat2float(vld1_u16(r1 + 4));\n                float32x4_t _r12 = bfloat2float(vld1_u16(r1 + 8));\n                float32x4_t _r20 = bfloat2float(vld1_u16(r2));\n                float32x4_t _r21 = bfloat2float(vld1_u16(r2 + 4));\n                float32x4_t _r22 = bfloat2float(vld1_u16(r2 + 8));\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n\n                vst1_u16(outptr0, float2bfloat(_sum0));\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                outptr0 += 4;\n            }\n\n            r0 += 2 * 4;\n            r1 += 2 * 4;\n            r2 += 2 * 4;\n        }\n    }\n}\n\nstatic void convdw3x3s2_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const unsigned short* k0 = kernel.row<const unsigned short>(g);\n\n        unsigned short* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n        const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n        float32x4_t _k00 = bfloat2float(vld1_u16(k0));\n        float32x4_t _k01 = bfloat2float(vld1_u16(k0 + 4));\n        float32x4_t _k02 = bfloat2float(vld1_u16(k0 + 8));\n        float32x4_t _k10 = bfloat2float(vld1_u16(k0 + 12));\n        float32x4_t _k11 = bfloat2float(vld1_u16(k0 + 16));\n        float32x4_t _k12 = bfloat2float(vld1_u16(k0 + 20));\n        float32x4_t _k20 = bfloat2float(vld1_u16(k0 + 24));\n        float32x4_t _k21 = bfloat2float(vld1_u16(k0 + 28));\n        float32x4_t _k22 = bfloat2float(vld1_u16(k0 + 32));\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n#if __aarch64__\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n                    \"mov    v30.16b, %17.16b            \\n\" // sum02\n                    \"mov    v31.16b, %17.16b            \\n\" // sum03\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v14.4h, v15.4h, v16.4h, v17.4h}, [%1], #32 \\n\" // r04 r05 r06 r07\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"prfm   pldl1keep, [%1, #64]        \\n\"\n                    \"ld1    {v18.4h}, [%1]              \\n\" // r08\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %8.4s, v10.4s       \\n\"\n                    \"fmla   v29.4s, %8.4s, v12.4s       \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %8.4s, v14.4s       \\n\"\n                    \"fmla   v31.4s, %8.4s, v16.4s       \\n\"\n\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v29.4s, %9.4s, v13.4s       \\n\"\n                    \"fmla   v30.4s, %9.4s, v15.4s       \\n\"\n                    \"fmla   v31.4s, %9.4s, v17.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v28.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v29.4s, %10.4s, v14.4s      \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %10.4s, v16.4s      \\n\"\n                    \"fmla   v31.4s, %10.4s, v18.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%2], #32 \\n\" // r14 r15 r16 r17\n\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n\n                    \"prfm   pldl1keep, [%2, #64]        \\n\"\n                    \"ld1    {v19.4h}, [%2]              \\n\" // r18\n\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %11.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, %11.4s, v22.4s      \\n\"\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %11.4s, v24.4s      \\n\"\n                    \"fmla   v31.4s, %11.4s, v26.4s      \\n\"\n\n                    \"shll   v27.4s, v27.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %12.4s, v21.4s      \\n\"\n                    \"fmla   v29.4s, %12.4s, v23.4s      \\n\"\n                    \"fmla   v30.4s, %12.4s, v25.4s      \\n\"\n                    \"fmla   v31.4s, %12.4s, v27.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.4s, %13.4s, v22.4s      \\n\"\n                    \"fmla   v29.4s, %13.4s, v24.4s      \\n\"\n\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %13.4s, v26.4s      \\n\"\n                    \"fmla   v31.4s, %13.4s, v19.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v14.4h, v15.4h, v16.4h, v17.4h}, [%3], #32 \\n\" // r24 r25 r26 r27\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"prfm   pldl1keep, [%3, #64]        \\n\"\n                    \"ld1    {v18.4h}, [%3]              \\n\" // r28\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %14.4s, v10.4s      \\n\"\n                    \"fmla   v29.4s, %14.4s, v12.4s      \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %14.4s, v14.4s      \\n\"\n                    \"fmla   v31.4s, %14.4s, v16.4s      \\n\"\n\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v29.4s, %15.4s, v13.4s      \\n\"\n                    \"fmla   v30.4s, %15.4s, v15.4s      \\n\"\n                    \"fmla   v31.4s, %15.4s, v17.4s      \\n\"\n\n                    \"fmla   v28.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v29.4s, %16.4s, v14.4s      \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, %16.4s, v16.4s      \\n\"\n                    \"fmla   v31.4s, %16.4s, v18.4s      \\n\"\n\n                    \"shrn   v28.4h, v28.4s, #16         \\n\"\n                    \"shrn   v29.4h, v29.4s, #16         \\n\"\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n#endif // __aarch64__\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v22.16b, %17.16b            \\n\" // sum00\n                    \"mov    v23.16b, %17.16b            \\n\" // sum01\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmul   v20.4s, %8.4s, v10.4s       \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmul   v21.4s, %8.4s, v12.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #64]        \\n\"\n                    \"ld1    {v14.4h}, [%1]              \\n\" // r04\n\n                    \"fmla   v22.4s, %9.4s, v11.4s       \\n\"\n                    \"fmla   v23.4s, %9.4s, v13.4s       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %10.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %10.4s, v14.4s      \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v22.4s, %11.4s, v16.4s      \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n\n                    \"fmla   v23.4s, %11.4s, v18.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #64]        \\n\"\n                    \"ld1    {v15.4h}, [%2]              \\n\" // r14\n\n                    \"fmla   v20.4s, %12.4s, v17.4s      \\n\"\n                    \"fmla   v21.4s, %12.4s, v19.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v10.4h, v11.4h, v12.4h, v13.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                    \"shll   v15.4s, v15.4h, #16         \\n\"\n\n                    \"fmla   v22.4s, %13.4s, v18.4s      \\n\"\n                    \"fmla   v23.4s, %13.4s, v15.4s      \\n\"\n\n                    \"shll   v10.4s, v10.4h, #16         \\n\"\n                    \"shll   v11.4s, v11.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %14.4s, v10.4s      \\n\"\n\n                    \"shll   v12.4s, v12.4h, #16         \\n\"\n                    \"shll   v13.4s, v13.4h, #16         \\n\"\n\n                    \"fmla   v21.4s, %14.4s, v12.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #64]        \\n\"\n                    \"ld1    {v14.4h}, [%3]              \\n\" // r24\n\n                    \"fmla   v22.4s, %15.4s, v11.4s      \\n\"\n                    \"fmla   v23.4s, %15.4s, v13.4s      \\n\"\n\n                    \"shll   v14.4s, v14.4h, #16         \\n\"\n\n                    \"fmla   v20.4s, %16.4s, v12.4s      \\n\"\n                    \"fmla   v21.4s, %16.4s, v14.4s      \\n\"\n\n                    \"fadd   v22.4s, v20.4s, v22.4s      \\n\"\n                    \"fadd   v23.4s, v21.4s, v23.4s      \\n\"\n\n                    \"shrn   v22.4h, v22.4s, #16         \\n\"\n                    \"shrn   v23.4h, v23.4s, #16         \\n\"\n\n                    \"st1    {v22.4h, v23.4h}, [%0], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%1 :64]! \\n\" // r00 r01 r02 r03\n\n                    \"vmov       q10, %q17       \\n\" // sum00\n                    \"vmov       q11, %q17       \\n\" // sum01\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q8, q12   \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q11, %q8, q14   \\n\"\n\n                    \"vld1.u16   {d25}, [%1]     \\n\" // r04\n\n                    \"vmla.f32   q10, %q9, q13   \\n\"\n                    \"vmla.f32   q11, %q9, q15   \\n\"\n\n                    \"vshll.u16  q12, d25, #16   \\n\"\n\n                    \"vmla.f32   q10, %q10, q14  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%2 :64]! \\n\" // r10 r11 r12 r13\n\n                    \"vmla.f32   q11, %q10, q12  \\n\"\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q11, q12  \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q11, %q11, q14  \\n\"\n\n                    \"vld1.u16   {d25}, [%2]     \\n\" // r14\n\n                    \"vmla.f32   q10, %q12, q13  \\n\"\n                    \"vmla.f32   q11, %q12, q15  \\n\"\n\n                    \"vshll.u16  q12, d25, #16   \\n\"\n\n                    \"vmla.f32   q10, %q13, q14  \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%3 :64]! \\n\" // r20 r21 r22 r23\n\n                    \"vmla.f32   q11, %q13, q12  \\n\"\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n\n                    \"vmla.f32   q10, %q14, q12  \\n\"\n\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmla.f32   q11, %q14, q14  \\n\"\n\n                    \"vld1.u16   {d25}, [%3]     \\n\" // r24\n\n                    \"vmla.f32   q10, %q15, q13  \\n\"\n                    \"vmla.f32   q11, %q15, q15  \\n\"\n\n                    \"vshll.u16  q12, d25, #16   \\n\"\n\n                    \"vmla.f32   q10, %q16, q14  \\n\"\n                    \"vmla.f32   q11, %q16, q12  \\n\"\n\n                    \"vshrn.u32  d20, q10, #16   \\n\"\n                    \"vshrn.u32  d21, q11, #16   \\n\"\n\n                    \"vst1.u16   {d20-d21}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = bfloat2float(vld1_u16(r0));\n                float32x4_t _r01 = bfloat2float(vld1_u16(r0 + 4));\n                float32x4_t _r02 = bfloat2float(vld1_u16(r0 + 8));\n                float32x4_t _r10 = bfloat2float(vld1_u16(r1));\n                float32x4_t _r11 = bfloat2float(vld1_u16(r1 + 4));\n                float32x4_t _r12 = bfloat2float(vld1_u16(r1 + 8));\n                float32x4_t _r20 = bfloat2float(vld1_u16(r2));\n                float32x4_t _r21 = bfloat2float(vld1_u16(r2 + 4));\n                float32x4_t _r22 = bfloat2float(vld1_u16(r2 + 8));\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n\n                vst1_u16(outptr0, float2bfloat(_sum0));\n\n                r0 += 2 * 4;\n                r1 += 2 * 4;\n                r2 += 2 * 4;\n                outptr0 += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + g * 8) : vdupq_n_f16((__fp16)0.f);\n\n        const __fp16* k0 = kernel.row<const __fp16>(g);\n\n        __fp16* outptr0 = out.row<__fp16>(0);\n        __fp16* outptr1 = out.row<__fp16>(1);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0.row<const __fp16>(0);\n        const __fp16* r1 = img0.row<const __fp16>(1);\n        const __fp16* r2 = img0.row<const __fp16>(2);\n        const __fp16* r3 = img0.row<const __fp16>(3);\n\n        float16x8_t _k00 = vld1q_f16(k0);\n        float16x8_t _k01 = vld1q_f16(k0 + 8);\n        float16x8_t _k02 = vld1q_f16(k0 + 16);\n        float16x8_t _k10 = vld1q_f16(k0 + 24);\n        float16x8_t _k11 = vld1q_f16(k0 + 32);\n        float16x8_t _k12 = vld1q_f16(k0 + 40);\n        float16x8_t _k20 = vld1q_f16(k0 + 48);\n        float16x8_t _k21 = vld1q_f16(k0 + 56);\n        float16x8_t _k22 = vld1q_f16(k0 + 64);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3], #64 \\n\" // r10 r11 r12 r13\n\n                    \"mov    v24.16b, %21.16b            \\n\" // sum00\n                    \"mov    v25.16b, %21.16b            \\n\" // sum01\n                    \"mov    v26.16b, %21.16b            \\n\" // sum02\n                    \"mov    v27.16b, %21.16b            \\n\" // sum03\n\n                    \"fmla   v24.8h, %15.8h, v12.8h      \\n\"\n                    \"fmla   v25.8h, %15.8h, v13.8h      \\n\"\n\n                    \"mov    v28.16b, %21.16b            \\n\" // sum10\n                    \"mov    v29.16b, %21.16b            \\n\" // sum11\n                    \"mov    v30.16b, %21.16b            \\n\" // sum12\n                    \"mov    v31.16b, %21.16b            \\n\" // sum13\n\n                    \"fmla   v26.8h, %15.8h, v14.8h      \\n\"\n                    \"fmla   v27.8h, %15.8h, v15.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%3]      \\n\" // r14 r15\n\n                    \"fmla   v28.8h, %12.8h, v12.8h      \\n\"\n                    \"fmla   v29.8h, %12.8h, v13.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v15.8h      \\n\"\n\n                    \"fmla   v24.8h, %16.8h, v13.8h      \\n\"\n                    \"fmla   v25.8h, %16.8h, v14.8h      \\n\"\n                    \"fmla   v26.8h, %16.8h, v15.8h      \\n\"\n                    \"fmla   v27.8h, %16.8h, v16.8h      \\n\"\n\n                    \"fmla   v28.8h, %13.8h, v13.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v14.8h      \\n\"\n                    \"fmla   v30.8h, %13.8h, v15.8h      \\n\"\n                    \"fmla   v31.8h, %13.8h, v16.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%4], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v24.8h, %17.8h, v14.8h      \\n\"\n                    \"fmla   v25.8h, %17.8h, v15.8h      \\n\"\n                    \"fmla   v26.8h, %17.8h, v16.8h      \\n\"\n                    \"fmla   v27.8h, %17.8h, v17.8h      \\n\"\n\n                    \"fmla   v28.8h, %14.8h, v14.8h      \\n\"\n                    \"fmla   v29.8h, %14.8h, v15.8h      \\n\"\n                    \"fmla   v30.8h, %14.8h, v16.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v17.8h      \\n\"\n\n                    \"fmla   v24.8h, %18.8h, v18.8h      \\n\"\n                    \"fmla   v25.8h, %18.8h, v19.8h      \\n\"\n                    \"fmla   v26.8h, %18.8h, v20.8h      \\n\"\n                    \"fmla   v27.8h, %18.8h, v21.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%4]      \\n\" // r24 r25\n\n                    \"fmla   v28.8h, %15.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %15.8h, v19.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v20.8h      \\n\"\n                    \"fmla   v31.8h, %15.8h, v21.8h      \\n\"\n\n                    \"fmla   v24.8h, %19.8h, v19.8h      \\n\"\n                    \"fmla   v25.8h, %19.8h, v20.8h      \\n\"\n                    \"fmla   v26.8h, %19.8h, v21.8h      \\n\"\n                    \"fmla   v27.8h, %19.8h, v22.8h      \\n\"\n\n                    \"fmla   v28.8h, %16.8h, v19.8h      \\n\"\n                    \"fmla   v29.8h, %16.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %16.8h, v21.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v22.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%2], #64 \\n\" // r00 r01 r02 r03\n\n                    \"fmla   v24.8h, %20.8h, v20.8h      \\n\"\n                    \"fmla   v25.8h, %20.8h, v21.8h      \\n\"\n                    \"fmla   v26.8h, %20.8h, v22.8h      \\n\"\n                    \"fmla   v27.8h, %20.8h, v23.8h      \\n\"\n\n                    \"fmla   v28.8h, %17.8h, v20.8h      \\n\"\n                    \"fmla   v29.8h, %17.8h, v21.8h      \\n\"\n                    \"fmla   v30.8h, %17.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %17.8h, v23.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%5], #64 \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v24.8h, %12.8h, v12.8h      \\n\"\n                    \"fmla   v25.8h, %12.8h, v13.8h      \\n\"\n                    \"fmla   v26.8h, %12.8h, v14.8h      \\n\"\n                    \"fmla   v27.8h, %12.8h, v15.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%2]      \\n\" // r04 r05\n\n                    \"fmla   v28.8h, %18.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %18.8h, v19.8h      \\n\"\n                    \"fmla   v30.8h, %18.8h, v20.8h      \\n\"\n                    \"fmla   v31.8h, %18.8h, v21.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%5]      \\n\" // r34 r35\n\n                    \"fmla   v24.8h, %13.8h, v13.8h      \\n\"\n                    \"fmla   v25.8h, %13.8h, v14.8h      \\n\"\n                    \"fmla   v26.8h, %13.8h, v15.8h      \\n\"\n                    \"fmla   v27.8h, %13.8h, v16.8h      \\n\"\n\n                    \"fmla   v28.8h, %19.8h, v19.8h      \\n\"\n                    \"fmla   v29.8h, %19.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %19.8h, v21.8h      \\n\"\n                    \"fmla   v31.8h, %19.8h, v22.8h      \\n\"\n\n                    \"fmla   v24.8h, %14.8h, v14.8h      \\n\"\n                    \"fmla   v25.8h, %14.8h, v15.8h      \\n\"\n                    \"fmla   v26.8h, %14.8h, v16.8h      \\n\"\n                    \"fmla   v27.8h, %14.8h, v17.8h      \\n\"\n\n                    \"fmla   v28.8h, %20.8h, v20.8h      \\n\"\n                    \"fmla   v29.8h, %20.8h, v21.8h      \\n\"\n                    \"fmla   v30.8h, %20.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %20.8h, v23.8h      \\n\"\n\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%1], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%3] \\n\" // r10 r11 r12 r13\n\n                    \"mov    v28.16b, %21.16b            \\n\" // sum00\n                    \"mov    v29.16b, %21.16b            \\n\" // sum01\n                    \"mov    v30.16b, %21.16b            \\n\" // sum10\n                    \"mov    v31.16b, %21.16b            \\n\" // sum11\n\n                    \"fmla   v28.8h, %15.8h, v16.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v16.8h      \\n\"\n                    \"fmla   v29.8h, %15.8h, v17.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v17.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.8h, %16.8h, v17.8h      \\n\"\n                    \"fmla   v30.8h, %13.8h, v17.8h      \\n\"\n                    \"fmla   v29.8h, %16.8h, v18.8h      \\n\"\n                    \"fmla   v31.8h, %13.8h, v18.8h      \\n\"\n\n                    \"fmla   v28.8h, %17.8h, v18.8h      \\n\"\n                    \"fmla   v30.8h, %14.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %17.8h, v19.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v19.8h      \\n\"\n\n                    \"fmla   v28.8h, %18.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v20.8h      \\n\"\n                    \"fmla   v29.8h, %18.8h, v21.8h      \\n\"\n                    \"fmla   v31.8h, %15.8h, v21.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%2] \\n\" // r00 r01 r02 r03\n\n                    \"fmla   v28.8h, %19.8h, v21.8h      \\n\"\n                    \"fmla   v30.8h, %16.8h, v21.8h      \\n\"\n                    \"fmla   v29.8h, %19.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v22.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%5] \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v28.8h, %20.8h, v22.8h      \\n\"\n                    \"fmla   v30.8h, %17.8h, v22.8h      \\n\"\n                    \"fmla   v29.8h, %20.8h, v23.8h      \\n\"\n                    \"fmla   v31.8h, %17.8h, v23.8h      \\n\"\n\n                    \"fmla   v28.8h, %12.8h, v12.8h      \\n\"\n                    \"fmla   v30.8h, %18.8h, v24.8h      \\n\"\n                    \"fmla   v29.8h, %12.8h, v13.8h      \\n\"\n                    \"fmla   v31.8h, %18.8h, v25.8h      \\n\"\n                    \"fmla   v28.8h, %13.8h, v13.8h      \\n\"\n                    \"fmla   v30.8h, %19.8h, v25.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %19.8h, v26.8h      \\n\"\n                    \"fmla   v28.8h, %14.8h, v14.8h      \\n\"\n                    \"fmla   v30.8h, %20.8h, v26.8h      \\n\"\n                    \"fmla   v29.8h, %14.8h, v15.8h      \\n\"\n                    \"fmla   v31.8h, %20.8h, v27.8h      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n                    \"add    %3, %3, #32                 \\n\"\n                    \"add    %4, %4, #32                 \\n\"\n                    \"add    %5, %5, #32                 \\n\"\n\n                    \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n                    \"st1    {v30.8h, v31.8h}, [%1], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #384]       \\n\"\n                    \"ld1    {v15.8h, v16.8h, v17.8h}, [%3] \\n\" // r10 r11 r12\n\n                    \"mov    v28.16b, %21.16b            \\n\" // sum00\n                    \"mov    v30.16b, %21.16b            \\n\" // sum10\n\n                    \"fmul   v29.8h, %15.8h, v15.8h      \\n\"\n                    \"fmul   v31.8h, %12.8h, v15.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #384]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h}, [%4] \\n\" // r20 r21 r22\n\n                    \"fmla   v28.8h, %16.8h, v16.8h      \\n\"\n                    \"fmla   v30.8h, %13.8h, v16.8h      \\n\"\n\n                    \"fmla   v29.8h, %17.8h, v17.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v17.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h}, [%2] \\n\" // r00 r01 r02\n\n                    \"fmla   v28.8h, %18.8h, v18.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v18.8h      \\n\"\n\n                    \"fmla   v29.8h, %19.8h, v19.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v19.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #384]       \\n\"\n                    \"ld1    {v21.8h, v22.8h, v23.8h}, [%5] \\n\" // r30 r31 r32\n\n                    \"fmla   v28.8h, %20.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %17.8h, v20.8h      \\n\"\n\n                    \"fmla   v29.8h, %12.8h, v12.8h      \\n\"\n                    \"fmla   v31.8h, %18.8h, v21.8h      \\n\"\n                    \"fmla   v28.8h, %13.8h, v13.8h      \\n\"\n                    \"fmla   v30.8h, %19.8h, v22.8h      \\n\"\n                    \"fmla   v29.8h, %14.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %20.8h, v23.8h      \\n\"\n\n                    \"add    %2, %2, #16                 \\n\"\n                    \"add    %3, %3, #16                 \\n\"\n\n                    \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n                    \"fadd   v30.8h, v30.8h, v31.8h      \\n\"\n\n                    \"add    %4, %4, #16                 \\n\"\n                    \"add    %5, %5, #16                 \\n\"\n\n                    \"st1    {v28.8h}, [%0], #16         \\n\"\n                    \"st1    {v30.8h}, [%1], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"w\"(_k00),  // %12\n                    \"w\"(_k01),  // %13\n                    \"w\"(_k02),  // %14\n                    \"w\"(_k10),  // %15\n                    \"w\"(_k11),  // %16\n                    \"w\"(_k12),  // %17\n                    \"w\"(_k20),  // %18\n                    \"w\"(_k21),  // %19\n                    \"w\"(_k22),  // %20\n                    \"w\"(_bias0) // %21\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n\n            r0 += 2 * 8 + w * 8;\n            r1 += 2 * 8 + w * 8;\n            r2 += 2 * 8 + w * 8;\n            r3 += 2 * 8 + w * 8;\n\n            outptr0 += outw * 8;\n            outptr1 += outw * 8;\n        }\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n                    \"mov    v30.16b, %17.16b            \\n\" // sum02\n                    \"mov    v31.16b, %17.16b            \\n\" // sum03\n\n                    \"fmla   v28.8h, %8.8h, v12.8h       \\n\"\n                    \"fmla   v29.8h, %8.8h, v13.8h       \\n\"\n                    \"fmla   v30.8h, %8.8h, v14.8h       \\n\"\n                    \"fmla   v31.8h, %8.8h, v15.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%1]      \\n\" // r04 r05\n\n                    \"fmla   v28.8h, %9.8h, v13.8h       \\n\"\n                    \"fmla   v29.8h, %9.8h, v14.8h       \\n\"\n                    \"fmla   v30.8h, %9.8h, v15.8h       \\n\"\n                    \"fmla   v31.8h, %9.8h, v16.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v28.8h, %10.8h, v14.8h      \\n\"\n                    \"fmla   v29.8h, %10.8h, v15.8h      \\n\"\n                    \"fmla   v30.8h, %10.8h, v16.8h      \\n\"\n                    \"fmla   v31.8h, %10.8h, v17.8h      \\n\"\n\n                    \"fmla   v28.8h, %11.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %11.8h, v19.8h      \\n\"\n                    \"fmla   v30.8h, %11.8h, v20.8h      \\n\"\n                    \"fmla   v31.8h, %11.8h, v21.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%2]      \\n\" // r14 r15\n\n                    \"fmla   v28.8h, %12.8h, v19.8h      \\n\"\n                    \"fmla   v29.8h, %12.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v21.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v22.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.8h, %13.8h, v20.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v21.8h      \\n\"\n                    \"fmla   v30.8h, %13.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %13.8h, v23.8h      \\n\"\n\n                    \"fmla   v28.8h, %14.8h, v12.8h      \\n\"\n                    \"fmla   v29.8h, %14.8h, v13.8h      \\n\"\n                    \"fmla   v30.8h, %14.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v15.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%3]      \\n\" // r24 r25\n\n                    \"fmla   v28.8h, %15.8h, v13.8h      \\n\"\n                    \"fmla   v29.8h, %15.8h, v14.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v15.8h      \\n\"\n                    \"fmla   v31.8h, %15.8h, v16.8h      \\n\"\n\n                    \"fmla   v28.8h, %16.8h, v14.8h      \\n\"\n                    \"fmla   v29.8h, %16.8h, v15.8h      \\n\"\n                    \"fmla   v30.8h, %16.8h, v16.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v17.8h      \\n\"\n\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1] \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n\n                    \"fmul   v30.8h, %8.8h, v12.8h       \\n\"\n                    \"fmul   v31.8h, %8.8h, v13.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%2] \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v28.8h, %9.8h, v13.8h       \\n\"\n                    \"fmla   v29.8h, %9.8h, v14.8h       \\n\"\n                    \"fmla   v30.8h, %10.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %10.8h, v15.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%3] \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.8h, %11.8h, v16.8h      \\n\"\n                    \"fmla   v29.8h, %11.8h, v17.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v17.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v18.8h      \\n\"\n                    \"fmla   v28.8h, %13.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v19.8h      \\n\"\n\n                    \"fmla   v30.8h, %14.8h, v20.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v21.8h      \\n\"\n                    \"fmla   v28.8h, %15.8h, v21.8h      \\n\"\n                    \"fmla   v29.8h, %15.8h, v22.8h      \\n\"\n                    \"fmla   v30.8h, %16.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v23.8h      \\n\"\n\n                    \"add    %1, %1, #32                 \\n\"\n\n                    \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                    \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #384]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h}, [%1] \\n\" // r00 r01 r02\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n\n                    \"fmul   v29.8h, %8.8h, v12.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v15.8h, v16.8h, v17.8h}, [%2] \\n\" // r10 r11 r12\n\n                    \"fmul   v30.8h, %9.8h, v13.8h       \\n\"\n                    \"fmla   v28.8h, %10.8h, v14.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #384]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h}, [%3] \\n\" // r20 r21 r22\n\n                    \"fmla   v29.8h, %11.8h, v15.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v16.8h      \\n\"\n                    \"fmla   v28.8h, %13.8h, v17.8h      \\n\"\n\n                    \"fmla   v29.8h, %14.8h, v18.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v19.8h      \\n\"\n                    \"fmla   v28.8h, %16.8h, v20.8h      \\n\"\n\n                    \"add    %1, %1, #16                 \\n\"\n\n                    \"fadd   v29.8h, v29.8h, v30.8h      \\n\"\n                    \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n\n                    \"add    %2, %2, #16                 \\n\"\n                    \"add    %3, %3, #16                 \\n\"\n\n                    \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v28\", \"v29\", \"v30\");\n            }\n\n            r0 += 2 * 8;\n            r1 += 2 * 8;\n            r2 += 2 * 8;\n        }\n    }\n}\n\nstatic void convdw3x3s2_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 8;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + g * 8) : vdupq_n_f16((__fp16)0.f);\n\n        const __fp16* k0 = kernel.row<const __fp16>(g);\n\n        __fp16* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0.row<const __fp16>(0);\n        const __fp16* r1 = img0.row<const __fp16>(1);\n        const __fp16* r2 = img0.row<const __fp16>(2);\n\n        float16x8_t _k00 = vld1q_f16(k0);\n        float16x8_t _k01 = vld1q_f16(k0 + 8);\n        float16x8_t _k02 = vld1q_f16(k0 + 16);\n        float16x8_t _k10 = vld1q_f16(k0 + 24);\n        float16x8_t _k11 = vld1q_f16(k0 + 32);\n        float16x8_t _k12 = vld1q_f16(k0 + 40);\n        float16x8_t _k20 = vld1q_f16(k0 + 48);\n        float16x8_t _k21 = vld1q_f16(k0 + 56);\n        float16x8_t _k22 = vld1q_f16(k0 + 64);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n                    \"mov    v30.16b, %17.16b            \\n\" // sum02\n                    \"mov    v31.16b, %17.16b            \\n\" // sum03\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%1], #64 \\n\" // r04 r05 r06 r07\n\n                    \"fmla   v28.8h, %8.8h, v0.8h        \\n\"\n                    \"fmla   v29.8h, %8.8h, v2.8h        \\n\"\n                    \"fmla   v30.8h, %8.8h, v4.8h        \\n\"\n                    \"fmla   v31.8h, %8.8h, v6.8h        \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v8.8h}, [%1]               \\n\" // r08\n\n                    \"fmla   v28.8h, %9.8h, v1.8h        \\n\"\n                    \"fmla   v29.8h, %9.8h, v3.8h        \\n\"\n                    \"fmla   v30.8h, %9.8h, v5.8h        \\n\"\n                    \"fmla   v31.8h, %9.8h, v7.8h        \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v28.8h, %10.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, %10.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, %10.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, %10.8h, v8.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%2], #64 \\n\" // r14 r15 r16 r17\n\n                    \"fmla   v28.8h, %11.8h, v16.8h      \\n\"\n                    \"fmla   v29.8h, %11.8h, v18.8h      \\n\"\n                    \"fmla   v30.8h, %11.8h, v20.8h      \\n\"\n                    \"fmla   v31.8h, %11.8h, v22.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v24.8h}, [%2]              \\n\" // r18\n\n                    \"fmla   v28.8h, %12.8h, v17.8h      \\n\"\n                    \"fmla   v29.8h, %12.8h, v19.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v21.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v23.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v28.8h, %13.8h, v18.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v20.8h      \\n\"\n                    \"fmla   v30.8h, %13.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %13.8h, v24.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%3], #64 \\n\" // r24 r25 r26 r27\n\n                    \"fmla   v28.8h, %14.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, %14.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, %14.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, %14.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v8.8h}, [%3]               \\n\" // r28\n\n                    \"fmla   v28.8h, %15.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, %15.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, %15.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, %15.8h, v7.8h       \\n\"\n\n                    \"fmla   v28.8h, %16.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, %16.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, %16.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, %16.8h, v8.8h       \\n\"\n\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\" // r00 r01 r02 r03\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n                    \"mov    v29.16b, %17.16b            \\n\" // sum01\n\n                    \"fmul   v30.8h, %8.8h, v12.8h       \\n\"\n                    \"fmul   v31.8h, %8.8h, v14.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%1]              \\n\" // r04\n\n                    \"fmla   v28.8h, %9.8h, v13.8h       \\n\"\n                    \"fmla   v29.8h, %9.8h, v15.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%2], #64 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v30.8h, %10.8h, v14.8h      \\n\"\n                    \"fmla   v31.8h, %10.8h, v16.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%1]              \\n\" // r14\n\n                    \"fmla   v28.8h, %11.8h, v17.8h      \\n\"\n                    \"fmla   v29.8h, %11.8h, v19.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%3], #64 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v30.8h, %12.8h, v18.8h      \\n\"\n                    \"fmla   v31.8h, %12.8h, v20.8h      \\n\"\n\n                    \"fmla   v28.8h, %13.8h, v19.8h      \\n\"\n                    \"fmla   v29.8h, %13.8h, v21.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v26.8h}, [%1]              \\n\" // r24\n\n                    \"fmla   v30.8h, %14.8h, v22.8h      \\n\"\n                    \"fmla   v31.8h, %14.8h, v24.8h      \\n\"\n\n                    \"fmla   v28.8h, %15.8h, v23.8h      \\n\"\n                    \"fmla   v29.8h, %15.8h, v25.8h      \\n\"\n                    \"fmla   v30.8h, %16.8h, v24.8h      \\n\"\n                    \"fmla   v31.8h, %16.8h, v26.8h      \\n\"\n\n                    \"fadd   v28.8h, v28.8h, v30.8h      \\n\"\n                    \"fadd   v29.8h, v29.8h, v31.8h      \\n\"\n\n                    \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #384]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h}, [%1] \\n\" // r00 r01 r02\n\n                    \"mov    v28.16b, %17.16b            \\n\" // sum00\n\n                    \"fmul   v29.8h, %8.8h, v12.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #384]       \\n\"\n                    \"ld1    {v15.8h, v16.8h, v17.8h}, [%2] \\n\" // r10 r11 r12\n\n                    \"fmul   v30.8h, %9.8h, v13.8h       \\n\"\n                    \"fmla   v28.8h, %10.8h, v14.8h      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #384]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h}, [%3] \\n\" // r20 r21 r22\n\n                    \"fmla   v29.8h, %11.8h, v15.8h      \\n\"\n                    \"fmla   v30.8h, %12.8h, v16.8h      \\n\"\n                    \"fmla   v28.8h, %13.8h, v17.8h      \\n\"\n\n                    \"fmla   v29.8h, %14.8h, v18.8h      \\n\"\n                    \"fmla   v30.8h, %15.8h, v19.8h      \\n\"\n                    \"fmla   v28.8h, %16.8h, v20.8h      \\n\"\n\n                    \"add    %1, %1, #32                 \\n\"\n\n                    \"fadd   v29.8h, v29.8h, v30.8h      \\n\"\n                    \"fadd   v28.8h, v28.8h, v29.8h      \\n\"\n\n                    \"add    %2, %2, #32                 \\n\"\n                    \"add    %3, %3, #32                 \\n\"\n\n                    \"st1    {v28.8h}, [%0], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2)       // %3\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"w\"(_k00),  // %8\n                    \"w\"(_k01),  // %9\n                    \"w\"(_k02),  // %10\n                    \"w\"(_k10),  // %11\n                    \"w\"(_k11),  // %12\n                    \"w\"(_k12),  // %13\n                    \"w\"(_k20),  // %14\n                    \"w\"(_k21),  // %15\n                    \"w\"(_k22),  // %16\n                    \"w\"(_bias0) // %17\n                    : \"memory\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v28\", \"v29\", \"v30\");\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_3x3_pack8_int8.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_pack8_int8_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const signed char* k0 = kernel.row<const signed char>(g);\n\n        int* outptr0 = out.row<int>(0);\n        int* outptr1 = out.row<int>(1);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const signed char* r0 = img0.row<const signed char>(0);\n        const signed char* r1 = img0.row<const signed char>(1);\n        const signed char* r2 = img0.row<const signed char>(2);\n        const signed char* r3 = img0.row<const signed char>(3);\n\n        int8x8_t _k00 = vld1_s8(k0);\n        int8x8_t _k01 = vld1_s8(k0 + 8);\n        int8x8_t _k02 = vld1_s8(k0 + 16);\n        int8x8_t _k10 = vld1_s8(k0 + 24);\n        int8x8_t _k11 = vld1_s8(k0 + 32);\n        int8x8_t _k12 = vld1_s8(k0 + 40);\n        int8x8_t _k20 = vld1_s8(k0 + 48);\n        int8x8_t _k21 = vld1_s8(k0 + 56);\n        int8x8_t _k22 = vld1_s8(k0 + 64);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                int8x16_t _r0001 = vld1q_s8(r0);\n                int8x16_t _r0203 = vld1q_s8(r0 + 16);\n                int8x16_t _r1011 = vld1q_s8(r1);\n                int8x16_t _r1213 = vld1q_s8(r1 + 16);\n                int8x16_t _r2021 = vld1q_s8(r2);\n                int8x16_t _r2223 = vld1q_s8(r2 + 16);\n                int8x16_t _r3031 = vld1q_s8(r3);\n                int8x16_t _r3233 = vld1q_s8(r3 + 16);\n\n                int16x8_t _s00 = vmull_s8(vget_low_s8(_r0001), _k00);\n                int16x8_t _s01 = vmull_s8(vget_high_s8(_r0001), _k01);\n                int16x8_t _s02 = vmull_s8(vget_low_s8(_r0203), _k02);\n                int16x8_t _s03 = vmull_s8(vget_low_s8(_r1011), _k10);\n                int16x8_t _s10 = vmull_s8(vget_high_s8(_r0001), _k00);\n                int16x8_t _s11 = vmull_s8(vget_low_s8(_r0203), _k01);\n                int16x8_t _s12 = vmull_s8(vget_high_s8(_r0203), _k02);\n                int16x8_t _s13 = vmull_s8(vget_high_s8(_r1011), _k10);\n\n                int16x8_t _s20 = vmull_s8(vget_low_s8(_r1011), _k00);\n                int16x8_t _s21 = vmull_s8(vget_high_s8(_r1011), _k01);\n                int16x8_t _s22 = vmull_s8(vget_low_s8(_r1213), _k02);\n                int16x8_t _s23 = vmull_s8(vget_low_s8(_r2021), _k10);\n                int16x8_t _s30 = vmull_s8(vget_high_s8(_r1011), _k00);\n                int16x8_t _s31 = vmull_s8(vget_low_s8(_r1213), _k01);\n                int16x8_t _s32 = vmull_s8(vget_high_s8(_r1213), _k02);\n                int16x8_t _s33 = vmull_s8(vget_high_s8(_r2021), _k10);\n\n                _s00 = vmlal_s8(_s00, vget_high_s8(_r1011), _k11);\n                _s01 = vmlal_s8(_s01, vget_low_s8(_r1213), _k12);\n                _s02 = vmlal_s8(_s02, vget_low_s8(_r2021), _k20);\n                _s03 = vmlal_s8(_s03, vget_high_s8(_r2021), _k21);\n                _s10 = vmlal_s8(_s10, vget_low_s8(_r1213), _k11);\n                _s11 = vmlal_s8(_s11, vget_high_s8(_r1213), _k12);\n                _s12 = vmlal_s8(_s12, vget_high_s8(_r2021), _k20);\n                _s13 = vmlal_s8(_s13, vget_low_s8(_r2223), _k21);\n\n                _s20 = vmlal_s8(_s20, vget_high_s8(_r2021), _k11);\n                _s21 = vmlal_s8(_s21, vget_low_s8(_r2223), _k12);\n                _s22 = vmlal_s8(_s22, vget_low_s8(_r3031), _k20);\n                _s23 = vmlal_s8(_s23, vget_high_s8(_r3031), _k21);\n                _s30 = vmlal_s8(_s30, vget_low_s8(_r2223), _k11);\n                _s31 = vmlal_s8(_s31, vget_high_s8(_r2223), _k12);\n                _s32 = vmlal_s8(_s32, vget_high_s8(_r3031), _k20);\n                _s33 = vmlal_s8(_s33, vget_low_s8(_r3233), _k21);\n\n                int16x8_t _s08 = vmull_s8(vget_low_s8(_r2223), _k22);\n                int16x8_t _s18 = vmull_s8(vget_high_s8(_r2223), _k22);\n                int16x8_t _s28 = vmull_s8(vget_low_s8(_r3233), _k22);\n                int16x8_t _s38 = vmull_s8(vget_high_s8(_r3233), _k22);\n\n                int32x4_t _sum00 = vaddl_s16(vget_low_s16(_s00), vget_low_s16(_s01));\n                int32x4_t _sum01 = vaddl_s16(vget_high_s16(_s00), vget_high_s16(_s01));\n                int32x4_t _sum02 = vaddl_s16(vget_low_s16(_s02), vget_low_s16(_s03));\n                int32x4_t _sum03 = vaddl_s16(vget_high_s16(_s02), vget_high_s16(_s03));\n                int32x4_t _sum10 = vaddl_s16(vget_low_s16(_s10), vget_low_s16(_s11));\n                int32x4_t _sum11 = vaddl_s16(vget_high_s16(_s10), vget_high_s16(_s11));\n                int32x4_t _sum12 = vaddl_s16(vget_low_s16(_s12), vget_low_s16(_s13));\n                int32x4_t _sum13 = vaddl_s16(vget_high_s16(_s12), vget_high_s16(_s13));\n                int32x4_t _sum20 = vaddl_s16(vget_low_s16(_s20), vget_low_s16(_s21));\n                int32x4_t _sum21 = vaddl_s16(vget_high_s16(_s20), vget_high_s16(_s21));\n                int32x4_t _sum22 = vaddl_s16(vget_low_s16(_s22), vget_low_s16(_s23));\n                int32x4_t _sum23 = vaddl_s16(vget_high_s16(_s22), vget_high_s16(_s23));\n                int32x4_t _sum30 = vaddl_s16(vget_low_s16(_s30), vget_low_s16(_s31));\n                int32x4_t _sum31 = vaddl_s16(vget_high_s16(_s30), vget_high_s16(_s31));\n                int32x4_t _sum32 = vaddl_s16(vget_low_s16(_s32), vget_low_s16(_s33));\n                int32x4_t _sum33 = vaddl_s16(vget_high_s16(_s32), vget_high_s16(_s33));\n                _sum00 = vaddw_s16(_sum00, vget_low_s16(_s08));\n                _sum01 = vaddw_s16(_sum01, vget_high_s16(_s08));\n                _sum10 = vaddw_s16(_sum10, vget_low_s16(_s18));\n                _sum11 = vaddw_s16(_sum11, vget_high_s16(_s18));\n                _sum20 = vaddw_s16(_sum20, vget_low_s16(_s28));\n                _sum21 = vaddw_s16(_sum21, vget_high_s16(_s28));\n                _sum30 = vaddw_s16(_sum30, vget_low_s16(_s38));\n                _sum31 = vaddw_s16(_sum31, vget_high_s16(_s38));\n                _sum00 = vaddq_s32(_sum00, _sum02);\n                _sum01 = vaddq_s32(_sum01, _sum03);\n                _sum10 = vaddq_s32(_sum10, _sum12);\n                _sum11 = vaddq_s32(_sum11, _sum13);\n                _sum20 = vaddq_s32(_sum20, _sum22);\n                _sum21 = vaddq_s32(_sum21, _sum23);\n                _sum30 = vaddq_s32(_sum30, _sum32);\n                _sum31 = vaddq_s32(_sum31, _sum33);\n\n                vst1q_s32(outptr0, _sum00);\n                vst1q_s32(outptr0 + 4, _sum01);\n                vst1q_s32(outptr0 + 8, _sum10);\n                vst1q_s32(outptr0 + 12, _sum11);\n                vst1q_s32(outptr1, _sum20);\n                vst1q_s32(outptr1 + 4, _sum21);\n                vst1q_s32(outptr1 + 8, _sum30);\n                vst1q_s32(outptr1 + 12, _sum31);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                r3 += 16;\n                outptr0 += 16;\n                outptr1 += 16;\n            }\n            for (; j < outw; j++)\n            {\n                int8x8_t _r00 = vld1_s8(r0);\n                int8x8_t _r01 = vld1_s8(r0 + 8);\n                int8x8_t _r02 = vld1_s8(r0 + 16);\n                int8x8_t _r10 = vld1_s8(r1);\n                int8x8_t _r11 = vld1_s8(r1 + 8);\n                int8x8_t _r12 = vld1_s8(r1 + 16);\n                int8x8_t _r20 = vld1_s8(r2);\n                int8x8_t _r21 = vld1_s8(r2 + 8);\n                int8x8_t _r22 = vld1_s8(r2 + 16);\n                int8x8_t _r30 = vld1_s8(r3);\n                int8x8_t _r31 = vld1_s8(r3 + 8);\n                int8x8_t _r32 = vld1_s8(r3 + 16);\n\n                int16x8_t _s00 = vmull_s8(_r00, _k00);\n                int16x8_t _s01 = vmull_s8(_r01, _k01);\n                int16x8_t _s02 = vmull_s8(_r02, _k02);\n                int16x8_t _s03 = vmull_s8(_r10, _k10);\n                int16x8_t _s10 = vmull_s8(_r10, _k00);\n                int16x8_t _s11 = vmull_s8(_r11, _k01);\n                int16x8_t _s12 = vmull_s8(_r12, _k02);\n                int16x8_t _s13 = vmull_s8(_r20, _k10);\n                _s00 = vmlal_s8(_s00, _r11, _k11);\n                _s01 = vmlal_s8(_s01, _r12, _k12);\n                _s02 = vmlal_s8(_s02, _r20, _k20);\n                _s03 = vmlal_s8(_s03, _r21, _k21);\n                _s10 = vmlal_s8(_s10, _r21, _k11);\n                _s11 = vmlal_s8(_s11, _r22, _k12);\n                _s12 = vmlal_s8(_s12, _r30, _k20);\n                _s13 = vmlal_s8(_s13, _r31, _k21);\n                int16x8_t _s08 = vmull_s8(_r22, _k22);\n                int16x8_t _s18 = vmull_s8(_r32, _k22);\n\n                int32x4_t _sum00 = vaddl_s16(vget_low_s16(_s00), vget_low_s16(_s01));\n                int32x4_t _sum01 = vaddl_s16(vget_high_s16(_s00), vget_high_s16(_s01));\n                int32x4_t _sum02 = vaddl_s16(vget_low_s16(_s02), vget_low_s16(_s03));\n                int32x4_t _sum03 = vaddl_s16(vget_high_s16(_s02), vget_high_s16(_s03));\n                int32x4_t _sum10 = vaddl_s16(vget_low_s16(_s10), vget_low_s16(_s11));\n                int32x4_t _sum11 = vaddl_s16(vget_high_s16(_s10), vget_high_s16(_s11));\n                int32x4_t _sum12 = vaddl_s16(vget_low_s16(_s12), vget_low_s16(_s13));\n                int32x4_t _sum13 = vaddl_s16(vget_high_s16(_s12), vget_high_s16(_s13));\n                _sum00 = vaddw_s16(_sum00, vget_low_s16(_s08));\n                _sum01 = vaddw_s16(_sum01, vget_high_s16(_s08));\n                _sum10 = vaddw_s16(_sum10, vget_low_s16(_s18));\n                _sum11 = vaddw_s16(_sum11, vget_high_s16(_s18));\n                _sum00 = vaddq_s32(_sum00, _sum02);\n                _sum01 = vaddq_s32(_sum01, _sum03);\n                _sum10 = vaddq_s32(_sum10, _sum12);\n                _sum11 = vaddq_s32(_sum11, _sum13);\n\n                vst1q_s32(outptr0, _sum00);\n                vst1q_s32(outptr0 + 4, _sum01);\n                vst1q_s32(outptr1, _sum10);\n                vst1q_s32(outptr1 + 4, _sum11);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                r3 += 8;\n                outptr0 += 8;\n                outptr1 += 8;\n            }\n\n            r0 += 2 * 8 + w * 8;\n            r1 += 2 * 8 + w * 8;\n            r2 += 2 * 8 + w * 8;\n            r3 += 2 * 8 + w * 8;\n\n            outptr0 += outw * 8;\n            outptr1 += outw * 8;\n        }\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                int8x16_t _r0001 = vld1q_s8(r0);\n                int8x16_t _r0203 = vld1q_s8(r0 + 16);\n                int8x16_t _r1011 = vld1q_s8(r1);\n                int8x16_t _r1213 = vld1q_s8(r1 + 16);\n                int8x16_t _r2021 = vld1q_s8(r2);\n                int8x16_t _r2223 = vld1q_s8(r2 + 16);\n\n                int16x8_t _s00 = vmull_s8(vget_low_s8(_r0001), _k00);\n                int16x8_t _s01 = vmull_s8(vget_high_s8(_r0001), _k01);\n                int16x8_t _s02 = vmull_s8(vget_low_s8(_r0203), _k02);\n                int16x8_t _s03 = vmull_s8(vget_low_s8(_r1011), _k10);\n                int16x8_t _s10 = vmull_s8(vget_high_s8(_r0001), _k00);\n                int16x8_t _s11 = vmull_s8(vget_low_s8(_r0203), _k01);\n                int16x8_t _s12 = vmull_s8(vget_high_s8(_r0203), _k02);\n                int16x8_t _s13 = vmull_s8(vget_high_s8(_r1011), _k10);\n                _s00 = vmlal_s8(_s00, vget_high_s8(_r1011), _k11);\n                _s01 = vmlal_s8(_s01, vget_low_s8(_r1213), _k12);\n                _s02 = vmlal_s8(_s02, vget_low_s8(_r2021), _k20);\n                _s03 = vmlal_s8(_s03, vget_high_s8(_r2021), _k21);\n                _s10 = vmlal_s8(_s10, vget_low_s8(_r1213), _k11);\n                _s11 = vmlal_s8(_s11, vget_high_s8(_r1213), _k12);\n                _s12 = vmlal_s8(_s12, vget_high_s8(_r2021), _k20);\n                _s13 = vmlal_s8(_s13, vget_low_s8(_r2223), _k21);\n                int16x8_t _s08 = vmull_s8(vget_low_s8(_r2223), _k22);\n                int16x8_t _s18 = vmull_s8(vget_high_s8(_r2223), _k22);\n\n                int32x4_t _sum00 = vaddl_s16(vget_low_s16(_s00), vget_low_s16(_s01));\n                int32x4_t _sum01 = vaddl_s16(vget_high_s16(_s00), vget_high_s16(_s01));\n                int32x4_t _sum02 = vaddl_s16(vget_low_s16(_s02), vget_low_s16(_s03));\n                int32x4_t _sum03 = vaddl_s16(vget_high_s16(_s02), vget_high_s16(_s03));\n                int32x4_t _sum10 = vaddl_s16(vget_low_s16(_s10), vget_low_s16(_s11));\n                int32x4_t _sum11 = vaddl_s16(vget_high_s16(_s10), vget_high_s16(_s11));\n                int32x4_t _sum12 = vaddl_s16(vget_low_s16(_s12), vget_low_s16(_s13));\n                int32x4_t _sum13 = vaddl_s16(vget_high_s16(_s12), vget_high_s16(_s13));\n                _sum00 = vaddw_s16(_sum00, vget_low_s16(_s08));\n                _sum01 = vaddw_s16(_sum01, vget_high_s16(_s08));\n                _sum10 = vaddw_s16(_sum10, vget_low_s16(_s18));\n                _sum11 = vaddw_s16(_sum11, vget_high_s16(_s18));\n                _sum00 = vaddq_s32(_sum00, _sum02);\n                _sum01 = vaddq_s32(_sum01, _sum03);\n                _sum10 = vaddq_s32(_sum10, _sum12);\n                _sum11 = vaddq_s32(_sum11, _sum13);\n\n                vst1q_s32(outptr0, _sum00);\n                vst1q_s32(outptr0 + 4, _sum01);\n                vst1q_s32(outptr0 + 8, _sum10);\n                vst1q_s32(outptr0 + 12, _sum11);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                outptr0 += 16;\n            }\n            for (; j < outw; j++)\n            {\n                int8x8_t _r00 = vld1_s8(r0);\n                int8x8_t _r01 = vld1_s8(r0 + 8);\n                int8x8_t _r02 = vld1_s8(r0 + 16);\n                int8x8_t _r10 = vld1_s8(r1);\n                int8x8_t _r11 = vld1_s8(r1 + 8);\n                int8x8_t _r12 = vld1_s8(r1 + 16);\n                int8x8_t _r20 = vld1_s8(r2);\n                int8x8_t _r21 = vld1_s8(r2 + 8);\n                int8x8_t _r22 = vld1_s8(r2 + 16);\n\n                int16x8_t _s0 = vmull_s8(_r00, _k00);\n                int16x8_t _s1 = vmull_s8(_r01, _k01);\n                int16x8_t _s2 = vmull_s8(_r02, _k02);\n                int16x8_t _s3 = vmull_s8(_r10, _k10);\n                _s0 = vmlal_s8(_s0, _r11, _k11);\n                _s1 = vmlal_s8(_s1, _r12, _k12);\n                _s2 = vmlal_s8(_s2, _r20, _k20);\n                _s3 = vmlal_s8(_s3, _r21, _k21);\n                int16x8_t _s4 = vmull_s8(_r22, _k22);\n\n                int32x4_t _sum0 = vaddl_s16(vget_low_s16(_s0), vget_low_s16(_s1));\n                int32x4_t _sum1 = vaddl_s16(vget_high_s16(_s0), vget_high_s16(_s1));\n                int32x4_t _sum2 = vaddl_s16(vget_low_s16(_s2), vget_low_s16(_s3));\n                int32x4_t _sum3 = vaddl_s16(vget_high_s16(_s2), vget_high_s16(_s3));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s4));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s4));\n                _sum0 = vaddq_s32(_sum0, _sum2);\n                _sum1 = vaddq_s32(_sum1, _sum3);\n\n                vst1q_s32(outptr0, _sum0);\n                vst1q_s32(outptr0 + 4, _sum1);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                outptr0 += 8;\n            }\n\n            r0 += 2 * 8;\n            r1 += 2 * 8;\n            r2 += 2 * 8;\n        }\n    }\n}\n\nstatic void convdw3x3s2_pack8_int8_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 8;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const signed char* k0 = kernel.row<const signed char>(g);\n\n        int* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const signed char* r0 = img0.row<const signed char>(0);\n        const signed char* r1 = img0.row<const signed char>(1);\n        const signed char* r2 = img0.row<const signed char>(2);\n\n        int8x8_t _k00 = vld1_s8(k0);\n        int8x8_t _k01 = vld1_s8(k0 + 8);\n        int8x8_t _k02 = vld1_s8(k0 + 16);\n        int8x8_t _k10 = vld1_s8(k0 + 24);\n        int8x8_t _k11 = vld1_s8(k0 + 32);\n        int8x8_t _k12 = vld1_s8(k0 + 40);\n        int8x8_t _k20 = vld1_s8(k0 + 48);\n        int8x8_t _k21 = vld1_s8(k0 + 56);\n        int8x8_t _k22 = vld1_s8(k0 + 64);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                int8x8_t _r00 = vld1_s8(r0);\n                int8x8_t _r01 = vld1_s8(r0 + 8);\n                int8x8_t _r02 = vld1_s8(r0 + 16);\n                int8x8_t _r03 = vld1_s8(r0 + 24);\n                int8x8_t _r04 = vld1_s8(r0 + 32);\n                int8x8_t _r10 = vld1_s8(r1);\n                int8x8_t _r11 = vld1_s8(r1 + 8);\n                int8x8_t _r12 = vld1_s8(r1 + 16);\n                int8x8_t _r13 = vld1_s8(r1 + 24);\n                int8x8_t _r14 = vld1_s8(r1 + 32);\n                int8x8_t _r20 = vld1_s8(r2);\n                int8x8_t _r21 = vld1_s8(r2 + 8);\n                int8x8_t _r22 = vld1_s8(r2 + 16);\n                int8x8_t _r23 = vld1_s8(r2 + 24);\n                int8x8_t _r24 = vld1_s8(r2 + 32);\n\n                int16x8_t _s00 = vmull_s8(_r00, _k00);\n                int16x8_t _s01 = vmull_s8(_r01, _k01);\n                int16x8_t _s02 = vmull_s8(_r02, _k02);\n                int16x8_t _s03 = vmull_s8(_r10, _k10);\n                int16x8_t _s10 = vmull_s8(_r02, _k00);\n                int16x8_t _s11 = vmull_s8(_r03, _k01);\n                int16x8_t _s12 = vmull_s8(_r04, _k02);\n                int16x8_t _s13 = vmull_s8(_r12, _k10);\n                _s00 = vmlal_s8(_s00, _r11, _k11);\n                _s01 = vmlal_s8(_s01, _r12, _k12);\n                _s02 = vmlal_s8(_s02, _r20, _k20);\n                _s03 = vmlal_s8(_s03, _r21, _k21);\n                _s10 = vmlal_s8(_s10, _r13, _k11);\n                _s11 = vmlal_s8(_s11, _r14, _k12);\n                _s12 = vmlal_s8(_s12, _r22, _k20);\n                _s13 = vmlal_s8(_s13, _r23, _k21);\n                int16x8_t _s08 = vmull_s8(_r22, _k22);\n                int16x8_t _s18 = vmull_s8(_r24, _k22);\n\n                int32x4_t _sum00 = vaddl_s16(vget_low_s16(_s00), vget_low_s16(_s01));\n                int32x4_t _sum01 = vaddl_s16(vget_high_s16(_s00), vget_high_s16(_s01));\n                int32x4_t _sum02 = vaddl_s16(vget_low_s16(_s02), vget_low_s16(_s03));\n                int32x4_t _sum03 = vaddl_s16(vget_high_s16(_s02), vget_high_s16(_s03));\n                int32x4_t _sum10 = vaddl_s16(vget_low_s16(_s10), vget_low_s16(_s11));\n                int32x4_t _sum11 = vaddl_s16(vget_high_s16(_s10), vget_high_s16(_s11));\n                int32x4_t _sum12 = vaddl_s16(vget_low_s16(_s12), vget_low_s16(_s13));\n                int32x4_t _sum13 = vaddl_s16(vget_high_s16(_s12), vget_high_s16(_s13));\n                _sum00 = vaddw_s16(_sum00, vget_low_s16(_s08));\n                _sum01 = vaddw_s16(_sum01, vget_high_s16(_s08));\n                _sum10 = vaddw_s16(_sum10, vget_low_s16(_s18));\n                _sum11 = vaddw_s16(_sum11, vget_high_s16(_s18));\n                _sum00 = vaddq_s32(_sum00, _sum02);\n                _sum01 = vaddq_s32(_sum01, _sum03);\n                _sum10 = vaddq_s32(_sum10, _sum12);\n                _sum11 = vaddq_s32(_sum11, _sum13);\n\n                vst1q_s32(outptr0, _sum00);\n                vst1q_s32(outptr0 + 4, _sum01);\n                vst1q_s32(outptr0 + 8, _sum10);\n                vst1q_s32(outptr0 + 12, _sum11);\n\n                r0 += 32;\n                r1 += 32;\n                r2 += 32;\n                outptr0 += 16;\n            }\n            for (; j < outw; j++)\n            {\n                int8x8_t _r00 = vld1_s8(r0);\n                int8x8_t _r01 = vld1_s8(r0 + 8);\n                int8x8_t _r02 = vld1_s8(r0 + 16);\n                int8x8_t _r10 = vld1_s8(r1);\n                int8x8_t _r11 = vld1_s8(r1 + 8);\n                int8x8_t _r12 = vld1_s8(r1 + 16);\n                int8x8_t _r20 = vld1_s8(r2);\n                int8x8_t _r21 = vld1_s8(r2 + 8);\n                int8x8_t _r22 = vld1_s8(r2 + 16);\n\n                int16x8_t _s0 = vmull_s8(_r00, _k00);\n                int16x8_t _s1 = vmull_s8(_r01, _k01);\n                int16x8_t _s2 = vmull_s8(_r02, _k02);\n                int16x8_t _s3 = vmull_s8(_r10, _k10);\n                _s0 = vmlal_s8(_s0, _r11, _k11);\n                _s1 = vmlal_s8(_s1, _r12, _k12);\n                _s2 = vmlal_s8(_s2, _r20, _k20);\n                _s3 = vmlal_s8(_s3, _r21, _k21);\n                int16x8_t _s4 = vmull_s8(_r22, _k22);\n\n                int32x4_t _sum0 = vaddl_s16(vget_low_s16(_s0), vget_low_s16(_s1));\n                int32x4_t _sum1 = vaddl_s16(vget_high_s16(_s0), vget_high_s16(_s1));\n                int32x4_t _sum2 = vaddl_s16(vget_low_s16(_s2), vget_low_s16(_s3));\n                int32x4_t _sum3 = vaddl_s16(vget_high_s16(_s2), vget_high_s16(_s3));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s4));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s4));\n                _sum0 = vaddq_s32(_sum0, _sum2);\n                _sum1 = vaddq_s32(_sum1, _sum3);\n\n                vst1q_s32(outptr0, _sum0);\n                vst1q_s32(outptr0 + 4, _sum1);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                outptr0 += 8;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_5x5.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw5x5s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 25;\n\n        float* outptr = out;\n        float* outptr2 = outptr + outw;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n        const float* r3 = img0 + w * 3;\n        const float* r4 = img0 + w * 4;\n        const float* r5 = img0 + w * 5;\n\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 5;\n        const float* k2 = kernel0 + 10;\n        const float* k3 = kernel0 + 15;\n        const float* k4 = kernel0 + 20;\n\n#if __ARM_NEON\n        float32x4_t _k0123 = vld1q_f32(kernel0);\n        float32x4_t _k4567 = vld1q_f32(kernel0 + 4);\n        float32x4_t _k891011 = vld1q_f32(kernel0 + 8);\n        float32x4_t _k12131415 = vld1q_f32(kernel0 + 12);\n        float32x4_t _k16171819 = vld1q_f32(kernel0 + 16);\n        float32x4_t _k20212223 = vld1q_f32(kernel0 + 20);\n        float32x4_t _k24242424 = vdupq_n_f32(kernel0[24]);\n\n        float32x4_t _bias0 = vdupq_n_f32(bias0);\n#endif // __ARM_NEON\n\n        int i = 0;\n\n        for (; i + 1 < outh; i += 2)\n        {\n#if __ARM_NEON\n#if __aarch64__\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#endif // __aarch64__\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    // r1\n                    \"prfm   pldl1keep, [%4, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%4]  \\n\" // v16 v17 v18 = r10 r14 r18\n\n                    \"mov    v8.16b, %25.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %25.16b                 \\n\" // v9 = _bias0\n\n                    \"0:                                     \\n\"\n\n                    \"mov    v10.16b, %25.16b                \\n\" // v10 = _bias0\n                    \"mov    v11.16b, %25.16b                \\n\" // v11 = _bias0\n\n                    \"fmla   v8.4s, v16.4s, %19.s[1]         \\n\"\n                    \"fmla   v10.4s, v16.4s, %18.s[0]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r11\n\n                    \"fmla   v9.4s, v17.4s, %19.s[1]         \\n\"\n                    \"fmla   v11.4s, v17.4s, %18.s[0]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r15\n\n                    \"fmla   v8.4s, v17.4s, %20.s[1]         \\n\"\n                    \"fmla   v10.4s, v17.4s, %19.s[0]        \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r12\n\n                    \"fmla   v9.4s, v18.4s, %20.s[1]         \\n\"\n                    \"fmla   v11.4s, v18.4s, %19.s[0]        \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r16\n\n                    \"fmla   v8.4s, v19.4s, %19.s[2]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %18.s[1]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r13\n\n                    \"fmla   v9.4s, v20.4s, %19.s[2]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %18.s[1]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r17\n\n                    \"fmla   v8.4s, v21.4s, %19.s[3]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %18.s[2]        \\n\"\n\n                    \"add    %4, %4, #32                     \\n\"\n\n                    \"fmla   v9.4s, v22.4s, %19.s[3]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %18.s[2]        \\n\"\n\n                    // r2\n                    \"prfm   pldl1keep, [%5, #384]           \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s}, [%5]  \\n\" // v12 v13 v14 = r20 r24 r28\n\n                    \"fmla   v8.4s, v19.4s, %20.s[0]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %18.s[3]        \\n\"\n                    \"fmla   v9.4s, v20.4s, %20.s[0]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %18.s[3]        \\n\"\n\n                    \"add    %5, %5, #32                     \\n\"\n\n                    \"fmla   v8.4s, v12.4s, %20.s[2]         \\n\"\n                    \"fmla   v10.4s, v12.4s, %19.s[1]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #4   \\n\" // r21\n\n                    \"fmla   v9.4s, v13.4s, %20.s[2]         \\n\"\n                    \"fmla   v11.4s, v13.4s, %19.s[1]        \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #4   \\n\" // r25\n\n                    \"fmla   v8.4s, v13.4s, %21.s[2]         \\n\"\n                    \"fmla   v10.4s, v13.4s, %20.s[1]        \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #8   \\n\" // r22\n\n                    \"fmla   v9.4s, v14.4s, %21.s[2]         \\n\"\n                    \"fmla   v11.4s, v14.4s, %20.s[1]        \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #8   \\n\" // r26\n\n                    \"fmla   v8.4s, v21.4s, %20.s[3]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %19.s[2]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #12  \\n\" // r23\n\n                    \"fmla   v9.4s, v22.4s, %20.s[3]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %19.s[2]        \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #12  \\n\" // r27\n\n                    \"fmla   v8.4s, v19.4s, %21.s[0]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %19.s[3]        \\n\"\n                    \"fmla   v9.4s, v20.4s, %21.s[0]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %19.s[3]        \\n\"\n\n                    // r3\n                    \"prfm   pldl1keep, [%6, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%6]  \\n\" // v16 v17 v18 = r30 r34 r38\n\n                    \"fmla   v8.4s, v21.4s, %21.s[1]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %20.s[0]        \\n\"\n                    \"fmla   v9.4s, v22.4s, %21.s[1]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %20.s[0]        \\n\"\n\n                    \"add    %6, %6, #32                     \\n\"\n\n                    \"fmla   v8.4s, v16.4s, %21.s[3]         \\n\"\n                    \"fmla   v10.4s, v16.4s, %20.s[2]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r31\n\n                    \"fmla   v9.4s, v17.4s, %21.s[3]         \\n\"\n                    \"fmla   v11.4s, v17.4s, %20.s[2]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r35\n\n                    \"fmla   v8.4s, v17.4s, %22.s[3]         \\n\"\n                    \"fmla   v10.4s, v17.4s, %21.s[2]        \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r32\n\n                    \"fmla   v9.4s, v18.4s, %22.s[3]         \\n\"\n                    \"fmla   v11.4s, v18.4s, %21.s[2]        \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r36\n\n                    \"fmla   v8.4s, v19.4s, %22.s[0]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %20.s[3]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r33\n\n                    \"fmla   v9.4s, v20.4s, %22.s[0]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %20.s[3]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r37\n\n                    \"fmla   v8.4s, v21.4s, %22.s[1]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %21.s[0]        \\n\"\n                    \"fmla   v9.4s, v22.4s, %22.s[1]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %21.s[0]        \\n\"\n\n                    // r4\n                    \"prfm   pldl1keep, [%7, #384]           \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s}, [%7]  \\n\" // v12 v13 v14 = r40 r44 r48\n\n                    \"fmla   v8.4s, v19.4s, %22.s[2]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %21.s[1]        \\n\"\n\n                    \"add    %7, %7, #32                     \\n\"\n\n                    \"fmla   v9.4s, v20.4s, %22.s[2]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %21.s[1]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #4   \\n\" // r41\n\n                    \"fmla   v8.4s, v12.4s, %23.s[0]         \\n\"\n                    \"fmla   v10.4s, v12.4s, %21.s[3]        \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #4   \\n\" // r45\n\n                    \"fmla   v9.4s, v13.4s, %23.s[0]         \\n\"\n                    \"fmla   v11.4s, v13.4s, %21.s[3]        \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #8   \\n\" // r42\n\n                    \"fmla   v8.4s, v13.4s, %24.s[0]         \\n\"\n                    \"fmla   v10.4s, v13.4s, %22.s[3]        \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #8   \\n\" // r46\n\n                    \"fmla   v9.4s, v14.4s, %24.s[0]         \\n\"\n                    \"fmla   v11.4s, v14.4s, %22.s[3]        \\n\"\n\n                    // r0 and r5\n                    \"prfm   pldl1keep, [%3, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%3]  \\n\" // v16 v17 v18 = r00 r04 r08\n\n                    \"fmla   v8.4s, v21.4s, %23.s[1]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %22.s[0]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #12  \\n\" // r43\n\n                    \"fmla   v9.4s, v22.4s, %23.s[1]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %22.s[0]        \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #12  \\n\" // r47\n\n                    \"fmla   v8.4s, v19.4s, %23.s[2]         \\n\"\n                    \"fmla   v10.4s, v19.4s, %22.s[1]        \\n\"\n\n                    \"prfm   pldl1keep, [%8, #384]           \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s}, [%8]  \\n\" // v12 v13 v14 = r50 r54 r58\n\n                    \"fmla   v9.4s, v20.4s, %23.s[2]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %22.s[1]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r01\n\n                    \"fmla   v8.4s, v21.4s, %23.s[3]         \\n\"\n                    \"fmla   v10.4s, v21.4s, %22.s[2]        \\n\"\n\n                    \"ext    v23.16b, v12.16b, v13.16b, #4   \\n\" // r51\n\n                    \"fmla   v9.4s, v22.4s, %23.s[3]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %22.s[2]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r05\n\n                    \"fmla   v8.4s, v16.4s, %18.s[0]         \\n\"\n                    \"fmla   v10.4s, v12.4s, %23.s[0]        \\n\"\n\n                    \"ext    v24.16b, v13.16b, v14.16b, #4   \\n\" // r55\n\n                    \"fmla   v9.4s, v17.4s, %18.s[0]         \\n\"\n                    \"fmla   v11.4s, v13.4s, %23.s[0]        \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r02\n\n                    \"fmla   v8.4s, v17.4s, %19.s[0]         \\n\"\n                    \"fmla   v10.4s, v13.4s, %24.s[0]        \\n\"\n\n                    \"ext    v25.16b, v12.16b, v13.16b, #8   \\n\" // r52\n\n                    \"fmla   v9.4s, v18.4s, %19.s[0]         \\n\"\n                    \"fmla   v11.4s, v14.4s, %24.s[0]        \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r06\n\n                    \"fmla   v8.4s, v19.4s, %18.s[1]         \\n\"\n                    \"fmla   v10.4s, v23.4s, %23.s[1]        \\n\"\n\n                    \"ext    v26.16b, v13.16b, v14.16b, #8   \\n\" // r56\n\n                    \"fmla   v9.4s, v20.4s, %18.s[1]         \\n\"\n                    \"fmla   v11.4s, v24.4s, %23.s[1]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r03\n\n                    \"fmla   v8.4s, v21.4s, %18.s[2]         \\n\"\n                    \"fmla   v10.4s, v25.4s, %23.s[2]        \\n\"\n\n                    \"ext    v23.16b, v12.16b, v13.16b, #12  \\n\" // r53\n\n                    \"fmla   v9.4s, v22.4s, %18.s[2]         \\n\"\n                    \"fmla   v11.4s, v26.4s, %23.s[2]        \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r07\n\n                    \"fmla   v8.4s, v19.4s, %18.s[3]         \\n\"\n                    \"fmla   v10.4s, v23.4s, %23.s[3]        \\n\"\n\n                    \"ext    v24.16b, v13.16b, v14.16b, #12  \\n\" // r57\n\n                    \"fmla   v9.4s, v20.4s, %18.s[3]         \\n\"\n\n                    \"add    %3, %3, #32                     \\n\"\n\n                    \"fmla   v11.4s, v24.4s, %23.s[3]        \\n\"\n\n                    \"add    %8, %8, #32                     \\n\"\n\n                    // r1\n                    \"prfm   pldl1keep, [%4, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%4]  \\n\" // v16 v17 v18 = r10 r14 r18\n\n                    \"subs   %w0, %w0, #1                    \\n\"\n\n                    \"st1    {v8.4s, v9.4s}, [%1], #32       \\n\"\n\n                    \"mov    v8.16b, %25.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %25.16b                 \\n\" // v9 = _bias0\n\n                    \"st1    {v10.4s, v11.4s}, [%2], #32     \\n\"\n\n                    \"bne    0b                              \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr),  // %1\n                    \"=r\"(outptr2), // %2\n                    \"=r\"(r0),      // %3\n                    \"=r\"(r1),      // %4\n                    \"=r\"(r2),      // %5\n                    \"=r\"(r3),      // %6\n                    \"=r\"(r4),      // %7\n                    \"=r\"(r5)       // %8\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(outptr2),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"7\"(r4),\n                    \"8\"(r5),\n                    \"w\"(_k0123),     // %18\n                    \"w\"(_k4567),     // %19\n                    \"w\"(_k891011),   // %20\n                    \"w\"(_k12131415), // %21\n                    \"w\"(_k16171819), // %22\n                    \"w\"(_k20212223), // %23\n                    \"w\"(_k24242424), // %24\n                    \"w\"(_bias0)      // %25\n                    : \"cc\", \"memory\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\");\n            }\n\n            if (remain >= 4)\n            {\n                remain -= 4;\n                asm volatile(\n                    // r1\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld1    {v12.4s, v13.4s}, [%3]          \\n\" // v12 v13 = r10 r14\n\n                    \"mov    v8.16b, %23.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %23.16b                 \\n\" // v9 = _bias0\n\n                    \"fmul   v10.4s, v12.4s, %17.s[1]        \\n\"\n                    \"fmul   v11.4s, v12.4s, %16.s[0]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #4   \\n\" // r11\n\n                    \"fmla   v8.4s, v13.4s, %18.s[1]         \\n\"\n                    \"fmla   v9.4s, v13.4s, %17.s[0]         \\n\"\n\n                    \"ext    v22.16b, v12.16b, v13.16b, #8   \\n\" // r12\n\n                    \"fmla   v10.4s, v21.4s, %17.s[2]        \\n\"\n                    \"fmla   v11.4s, v21.4s, %16.s[1]        \\n\"\n\n                    \"ext    v23.16b, v12.16b, v13.16b, #12  \\n\" // r13\n\n                    \"fmla   v8.4s, v22.4s, %17.s[3]         \\n\"\n                    \"fmla   v9.4s, v22.4s, %16.s[2]         \\n\"\n\n                    // r2\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%4]          \\n\" // v16 v17 = r20 r24\n\n                    \"fmla   v10.4s, v23.4s, %18.s[0]        \\n\"\n                    \"fmla   v11.4s, v23.4s, %16.s[3]        \\n\"\n\n                    \"add    %4, %4, #16                     \\n\"\n\n                    \"fmla   v8.4s, v16.4s, %18.s[2]         \\n\"\n                    \"fmla   v9.4s, v16.4s, %17.s[1]         \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r21\n\n                    \"fmla   v10.4s, v17.4s, %19.s[2]        \\n\"\n                    \"fmla   v11.4s, v17.4s, %18.s[1]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r22\n\n                    \"fmla   v8.4s, v18.4s, %18.s[3]         \\n\"\n                    \"fmla   v9.4s, v18.4s, %17.s[2]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r23\n\n                    \"fmla   v10.4s, v19.4s, %19.s[0]        \\n\"\n                    \"fmla   v11.4s, v19.4s, %17.s[3]        \\n\"\n\n                    // r3\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld1    {v12.4s, v13.4s}, [%5]          \\n\" // v12 v13 = r30 r34\n\n                    \"fmla   v8.4s, v20.4s, %19.s[1]         \\n\"\n                    \"fmla   v9.4s, v20.4s, %18.s[0]         \\n\"\n\n                    \"add    %5, %5, #16                     \\n\"\n\n                    \"fmla   v10.4s, v12.4s, %19.s[3]        \\n\"\n                    \"fmla   v11.4s, v12.4s, %18.s[2]        \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #4   \\n\" // r31\n\n                    \"fmla   v8.4s, v13.4s, %20.s[3]         \\n\"\n                    \"fmla   v9.4s, v13.4s, %19.s[2]         \\n\"\n\n                    \"ext    v22.16b, v12.16b, v13.16b, #8   \\n\" // r32\n\n                    \"fmla   v10.4s, v21.4s, %20.s[0]        \\n\"\n                    \"fmla   v11.4s, v21.4s, %18.s[3]        \\n\"\n\n                    \"ext    v23.16b, v12.16b, v13.16b, #12  \\n\" // r33\n\n                    \"fmla   v8.4s, v22.4s, %20.s[1]         \\n\"\n                    \"fmla   v9.4s, v22.4s, %19.s[0]         \\n\"\n\n                    // r4\n                    \"prfm   pldl1keep, [%6, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%6]          \\n\" // v16 v17 = r40 r44\n\n                    \"fmla   v10.4s, v23.4s, %20.s[2]        \\n\"\n                    \"fmla   v11.4s, v23.4s, %19.s[1]        \\n\"\n\n                    \"add    %6, %6, #16                     \\n\"\n\n                    \"fmla   v8.4s, v16.4s, %21.s[0]         \\n\"\n                    \"fmla   v9.4s, v16.4s, %19.s[3]         \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r41\n\n                    \"fmla   v10.4s, v17.4s, %22.s[0]        \\n\"\n                    \"fmla   v11.4s, v17.4s, %20.s[3]        \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r42\n\n                    \"fmla   v8.4s, v18.4s, %21.s[1]         \\n\"\n                    \"fmla   v9.4s, v18.4s, %20.s[0]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r43\n\n                    \"fmla   v10.4s, v19.4s, %21.s[2]        \\n\"\n                    \"fmla   v11.4s, v19.4s, %20.s[1]        \\n\"\n\n                    // r0\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%2]          \\n\" // v16 v17 = r00 r04\n\n                    \"fmla   v8.4s, v20.4s, %21.s[3]         \\n\"\n                    \"fmla   v9.4s, v20.4s, %20.s[2]         \\n\"\n\n                    // r5\n                    \"prfm   pldl1keep, [%7, #256]           \\n\"\n                    \"ld1    {v12.4s, v13.4s}, [%7]          \\n\" // v12 v13 = r50 r54\n\n                    \"fmla   v10.4s, v16.4s, %16.s[0]        \\n\"\n                    \"fmla   v11.4s, v12.4s, %21.s[0]        \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r01\n\n                    \"fmla   v8.4s, v17.4s, %17.s[0]         \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #4   \\n\" // r51\n\n                    \"fmla   v9.4s, v13.4s, %22.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r02\n\n                    \"fmla   v10.4s, v18.4s, %16.s[1]        \\n\"\n\n                    \"ext    v22.16b, v12.16b, v13.16b, #8   \\n\" // r52\n\n                    \"fmla   v11.4s, v21.4s, %21.s[1]        \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r03\n\n                    \"fmla   v8.4s, v19.4s, %16.s[2]         \\n\"\n\n                    \"ext    v23.16b, v12.16b, v13.16b, #12  \\n\" // r53\n\n                    \"fmla   v9.4s, v22.4s, %21.s[2]         \\n\"\n\n                    \"add    %3, %3, #16                     \\n\"\n\n                    \"fmla   v10.4s, v20.4s, %16.s[3]        \\n\"\n                    \"fmla   v11.4s, v23.4s, %21.s[3]        \\n\"\n\n                    \"add    %2, %2, #16                     \\n\"\n\n                    \"fadd   v8.4s, v8.4s, v10.4s            \\n\"\n                    \"fadd   v9.4s, v9.4s, v11.4s            \\n\"\n\n                    \"add    %7, %7, #16                     \\n\"\n\n                    \"st1    {v8.4s}, [%0], #16              \\n\"\n                    \"st1    {v9.4s}, [%1], #16              \\n\"\n\n                    : \"=r\"(outptr),  // %0\n                    \"=r\"(outptr2), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5)       // %7\n                    : \"0\"(outptr),\n                    \"1\"(outptr2),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"w\"(_k0123),     // %16\n                    \"w\"(_k4567),     // %17\n                    \"w\"(_k891011),   // %18\n                    \"w\"(_k12131415), // %19\n                    \"w\"(_k16171819), // %20\n                    \"w\"(_k20212223), // %21\n                    \"w\"(_k24242424), // %22\n                    \"w\"(_bias0)      // %23\n                    : \"cc\", \"memory\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    // r1\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%4]     \\n\" // q14 q15 = r10 r14\n\n                    \"vmov       q8, %q25            \\n\" // q8 = _bias0\n\n                    \"0:                             \\n\"\n\n                    \"vmov       q9, %q25            \\n\" // q9 = _bias0\n\n                    \"vmla.f32   q8, q14, %e19[1]    \\n\"\n                    \"vmla.f32   q9, q14, %e18[0]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #1   \\n\" // r11\n\n                    \"vmla.f32   q8, q15, %e20[1]    \\n\"\n                    \"vmla.f32   q9, q15, %e19[0]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #2   \\n\" // r12\n\n                    \"vmla.f32   q8, q12, %f19[0]    \\n\"\n                    \"vmla.f32   q9, q12, %e18[1]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #3   \\n\" // r13\n\n                    \"vmla.f32   q8, q13, %f19[1]    \\n\"\n                    \"vmla.f32   q9, q13, %f18[0]    \\n\"\n\n                    // r2\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%5]     \\n\" // q10 q11 = r20 r24\n\n                    \"vmla.f32   q8, q12, %e20[0]    \\n\"\n                    \"vmla.f32   q9, q12, %f18[1]    \\n\"\n\n                    \"add        %5, #16             \\n\"\n\n                    \"vmla.f32   q8, q10, %f20[0]    \\n\"\n                    \"vmla.f32   q9, q10, %e19[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r21\n\n                    \"vmla.f32   q8, q11, %f21[0]    \\n\"\n                    \"vmla.f32   q9, q11, %e20[1]    \\n\"\n\n                    \"vext.32    q13, q10, q11, #2   \\n\" // r22\n\n                    \"vmla.f32   q8, q12, %f20[1]    \\n\"\n                    \"vmla.f32   q9, q12, %f19[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r23\n\n                    \"vmla.f32   q8, q13, %e21[0]    \\n\"\n                    \"vmla.f32   q9, q13, %f19[1]    \\n\"\n\n                    // r3\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%6]     \\n\" // q14 q15 = r30 r34\n\n                    \"vmla.f32   q8, q12, %e21[1]    \\n\"\n                    \"vmla.f32   q9, q12, %e20[0]    \\n\"\n\n                    \"add        %6, #16             \\n\"\n\n                    \"vmla.f32   q8, q14, %f21[1]    \\n\"\n                    \"vmla.f32   q9, q14, %f20[0]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #1   \\n\" // r31\n\n                    \"vmla.f32   q8, q15, %f22[1]    \\n\"\n                    \"vmla.f32   q9, q15, %f21[0]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #2   \\n\" // r32\n\n                    \"vmla.f32   q8, q12, %e22[0]    \\n\"\n                    \"vmla.f32   q9, q12, %f20[1]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #3   \\n\" // r33\n\n                    \"vmla.f32   q8, q13, %e22[1]    \\n\"\n                    \"vmla.f32   q9, q13, %e21[0]    \\n\"\n\n                    // r4\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%7]     \\n\" // q10 q11 = r40 r44\n\n                    \"vmla.f32   q8, q12, %f22[0]    \\n\"\n                    \"vmla.f32   q9, q12, %e21[1]    \\n\"\n\n                    \"add        %7, #16             \\n\"\n\n                    \"vmla.f32   q8, q10, %e23[0]    \\n\"\n                    \"vmla.f32   q9, q10, %f21[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r41\n\n                    \"vmla.f32   q8, q11, %e24[0]    \\n\"\n                    \"vmla.f32   q9, q11, %f22[1]    \\n\"\n\n                    \"vext.32    q13, q10, q11, #2   \\n\" // r42\n\n                    \"vmla.f32   q8, q12, %e23[1]    \\n\"\n                    \"vmla.f32   q9, q12, %e22[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r43\n\n                    \"vmla.f32   q8, q13, %f23[0]    \\n\"\n                    \"vmla.f32   q9, q13, %e22[1]    \\n\"\n\n                    // r0 and r5\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%3]     \\n\" // q10 q11 = r00 r04\n\n                    \"vmla.f32   q8, q12, %f23[1]    \\n\"\n                    \"vmla.f32   q9, q12, %f22[0]    \\n\"\n\n                    // r5\n                    \"pld        [%8, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%8]     \\n\" // q14 q15 = r50 r54\n\n                    \"vmla.f32   q8, q10, %e18[0]    \\n\"\n                    \"vmla.f32   q9, q14, %e23[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r01\n\n                    \"vmla.f32   q8, q11, %e19[0]    \\n\"\n                    \"vmla.f32   q9, q15, %e24[0]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #1   \\n\" // r51\n\n                    \"vmla.f32   q8, q12, %e18[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #2   \\n\" // r02\n\n                    \"vmla.f32   q9, q13, %e23[1]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #2   \\n\" // r52\n\n                    \"vmla.f32   q8, q12, %f18[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r03\n\n                    \"vmla.f32   q9, q13, %f23[0]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #3   \\n\" // r33\n\n                    \"vmla.f32   q8, q12, %f18[1]    \\n\"\n\n                    \"add        %3, #16             \\n\"\n\n                    \"vmla.f32   q9, q13, %f23[1]    \\n\"\n\n                    \"add        %4, #16             \\n\"\n\n                    // r1\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%4]     \\n\" // q14 q15 = r10 r14\n\n                    \"add        %8, #16             \\n\"\n\n                    \"vst1.f32   {d16-d17}, [%1]!    \\n\"\n\n                    \"vmov       q8, %q25            \\n\" // q8 = _bias0\n\n                    \"subs       %0, #1              \\n\"\n\n                    \"vst1.f32   {d18-d19}, [%2]!    \\n\"\n\n                    \"bne        0b                  \\n\"\n                    : \"=r\"(nn),      // %0\n                    \"=r\"(outptr),  // %1\n                    \"=r\"(outptr2), // %2\n                    \"=r\"(r0),      // %3\n                    \"=r\"(r1),      // %4\n                    \"=r\"(r2),      // %5\n                    \"=r\"(r3),      // %6\n                    \"=r\"(r4),      // %7\n                    \"=r\"(r5)       // %8\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(outptr2),\n                    \"3\"(r0),\n                    \"4\"(r1),\n                    \"5\"(r2),\n                    \"6\"(r3),\n                    \"7\"(r4),\n                    \"8\"(r5),\n                    \"w\"(_k0123),     // %18\n                    \"w\"(_k4567),     // %19\n                    \"w\"(_k891011),   // %20\n                    \"w\"(_k12131415), // %21\n                    \"w\"(_k16171819), // %22\n                    \"w\"(_k20212223), // %23\n                    \"w\"(_k24242424), // %24\n                    \"w\"(_bias0)      // %25\n                    : \"cc\", \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float sum = bias0;\n                float sum2 = bias0;\n#if __ARM_NEON\n                // TODO neon assembly optimize\n                float32x4_t _r1 = vld1q_f32(r1);\n                float32x4_t _k1 = vld1q_f32(k1);\n                float32x4_t _sum = vmulq_f32(_r1, _k1);\n                float32x4_t _sum2 = vmulq_f32(_r1, _k0123);\n\n                float32x4_t _r2 = vld1q_f32(r2);\n                float32x4_t _k2 = vld1q_f32(k2);\n                _sum = vmlaq_f32(_sum, _r2, _k2);\n                _sum2 = vmlaq_f32(_sum2, _r2, _k1);\n\n                float32x4_t _r3 = vld1q_f32(r3);\n                float32x4_t _k3 = vld1q_f32(k3);\n                _sum = vmlaq_f32(_sum, _r3, _k3);\n                _sum2 = vmlaq_f32(_sum2, _r3, _k2);\n\n                float32x4_t _r4 = vld1q_f32(r4);\n                _sum = vmlaq_f32(_sum, _r4, _k20212223);\n                _sum2 = vmlaq_f32(_sum2, _r4, _k3);\n\n                float32x4_t _r0 = vld1q_f32(r0);\n                _sum = vmlaq_f32(_sum, _r0, _k0123);\n                float32x4_t _r5 = vld1q_f32(r5);\n                _sum2 = vmlaq_f32(_sum2, _r5, _k20212223);\n\n                float32x4_t _k_t4 = {};\n\n                _k_t4 = vsetq_lane_f32(k0[4], _k_t4, 0);\n                _k_t4 = vsetq_lane_f32(k1[4], _k_t4, 1);\n                _k_t4 = vsetq_lane_f32(k2[4], _k_t4, 2);\n                _k_t4 = vsetq_lane_f32(k3[4], _k_t4, 3);\n\n                float32x4_t _r_t4 = {};\n\n                _r_t4 = vsetq_lane_f32(r0[4], _r_t4, 0);\n                _r_t4 = vsetq_lane_f32(r1[4], _r_t4, 1);\n                _r_t4 = vsetq_lane_f32(r2[4], _r_t4, 2);\n                _r_t4 = vsetq_lane_f32(r3[4], _r_t4, 3);\n                _sum = vmlaq_f32(_sum, _r_t4, _k_t4);\n\n                sum += r4[4] * k4[4];\n\n                _r_t4 = vextq_f32(_r_t4, _r_t4, 1);\n                _r_t4 = vsetq_lane_f32(r4[4], _r_t4, 3);\n                _sum2 = vmlaq_f32(_sum2, _r_t4, _k_t4);\n\n                sum2 += r5[4] * k4[4];\n\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                float32x2_t _ss2 = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n                float32x2_t _ss_ss2 = vpadd_f32(_ss, _ss2);\n\n                sum += vget_lane_f32(_ss_ss2, 0);\n                sum2 += vget_lane_f32(_ss_ss2, 1);\n#else\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r0[3] * k0[3];\n                sum += r0[4] * k0[4];\n\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r1[3] * k1[3];\n                sum += r1[4] * k1[4];\n\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n                sum += r2[3] * k2[3];\n                sum += r2[4] * k2[4];\n\n                sum += r3[0] * k3[0];\n                sum += r3[1] * k3[1];\n                sum += r3[2] * k3[2];\n                sum += r3[3] * k3[3];\n                sum += r3[4] * k3[4];\n\n                sum += r4[0] * k4[0];\n                sum += r4[1] * k4[1];\n                sum += r4[2] * k4[2];\n                sum += r4[3] * k4[3];\n                sum += r4[4] * k4[4];\n\n                sum2 += r1[0] * k0[0];\n                sum2 += r1[1] * k0[1];\n                sum2 += r1[2] * k0[2];\n                sum2 += r1[3] * k0[3];\n                sum2 += r1[4] * k0[4];\n\n                sum2 += r2[0] * k1[0];\n                sum2 += r2[1] * k1[1];\n                sum2 += r2[2] * k1[2];\n                sum2 += r2[3] * k1[3];\n                sum2 += r2[4] * k1[4];\n\n                sum2 += r3[0] * k2[0];\n                sum2 += r3[1] * k2[1];\n                sum2 += r3[2] * k2[2];\n                sum2 += r3[3] * k2[3];\n                sum2 += r3[4] * k2[4];\n\n                sum2 += r4[0] * k3[0];\n                sum2 += r4[1] * k3[1];\n                sum2 += r4[2] * k3[2];\n                sum2 += r4[3] * k3[3];\n                sum2 += r4[4] * k3[4];\n\n                sum2 += r5[0] * k4[0];\n                sum2 += r5[1] * k4[1];\n                sum2 += r5[2] * k4[2];\n                sum2 += r5[3] * k4[3];\n                sum2 += r5[4] * k4[4];\n#endif // __ARM_NEON\n                *outptr = sum;\n                *outptr2 = sum2;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                r4++;\n                r5++;\n                outptr++;\n                outptr2++;\n            }\n\n            r0 += 4 + w;\n            r1 += 4 + w;\n            r2 += 4 + w;\n            r3 += 4 + w;\n            r4 += 4 + w;\n            r5 += 4 + w;\n\n            outptr += outw;\n            outptr2 += outw;\n        }\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n#if __aarch64__\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#endif // __aarch64__\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    // v10 v11\n                    // r0\n                    \"prfm   pldl1keep, [%2, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%2]  \\n\" // v16 v17 v18 = r00 r04 r08\n\n                    \"mov    v8.16b, %21.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %21.16b                 \\n\" // v9 = _bias0\n\n                    \"0:                                     \\n\"\n\n                    \"fmul   v10.4s, v16.4s, %14.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r01\n\n                    \"fmul   v11.4s, v17.4s, %14.s[0]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r05\n\n                    \"fmla   v8.4s, v17.4s, %15.s[0]         \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r02\n\n                    \"fmla   v9.4s, v18.4s, %15.s[0]         \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r06\n\n                    \"fmla   v10.4s, v19.4s, %14.s[1]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r03\n\n                    \"fmla   v11.4s, v20.4s, %14.s[1]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r07\n\n                    \"fmla   v8.4s, v21.4s, %14.s[2]         \\n\"\n                    \"fmla   v9.4s, v22.4s, %14.s[2]         \\n\"\n\n                    // r1\n                    \"prfm   pldl1keep, [%3, #384]           \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s}, [%3]  \\n\" // v12 v13 v14 = r10 r14 r18\n\n                    \"fmla   v10.4s, v19.4s, %14.s[3]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %14.s[3]         \\n\"\n\n                    \"fmla   v8.4s, v12.4s, %15.s[1]         \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #4   \\n\" // r11\n\n                    \"fmla   v9.4s, v13.4s, %15.s[1]         \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #4   \\n\" // r15\n\n                    \"fmla   v10.4s, v13.4s, %16.s[1]         \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #8   \\n\" // r12\n\n                    \"fmla   v11.4s, v14.4s, %16.s[1]         \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #8   \\n\" // r16\n\n                    \"fmla   v8.4s, v19.4s, %15.s[2]         \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #12  \\n\" // r13\n\n                    \"fmla   v9.4s, v20.4s, %15.s[2]         \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #12  \\n\" // r17\n\n                    \"fmla   v10.4s, v21.4s, %15.s[3]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %15.s[3]         \\n\"\n\n                    // r2\n                    \"prfm   pldl1keep, [%4, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%4]  \\n\" // v16 v17 v18 = r20 r24 r28\n\n                    \"fmla   v8.4s, v19.4s, %16.s[0]         \\n\"\n                    \"fmla   v9.4s, v20.4s, %16.s[0]         \\n\"\n\n                    \"fmla   v10.4s, v16.4s, %16.s[2]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r21\n\n                    \"fmla   v11.4s, v17.4s, %16.s[2]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r25\n\n                    \"fmla   v8.4s, v17.4s, %17.s[2]         \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r22\n\n                    \"fmla   v9.4s, v18.4s, %17.s[2]         \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r26\n\n                    \"fmla   v10.4s, v19.4s, %16.s[3]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r23\n\n                    \"fmla   v11.4s, v20.4s, %16.s[3]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r27\n\n                    \"fmla   v8.4s, v21.4s, %17.s[0]         \\n\"\n                    \"fmla   v9.4s, v22.4s, %17.s[0]         \\n\"\n\n                    // r3\n                    \"prfm   pldl1keep, [%5, #384]           \\n\"\n                    \"ld1    {v12.4s, v13.4s, v14.4s}, [%5]  \\n\" // v12 v13 v14 = r30 r34 r38\n\n                    \"fmla   v10.4s, v19.4s, %17.s[1]         \\n\"\n                    \"fmla   v11.4s, v20.4s, %17.s[1]         \\n\"\n\n                    \"fmla   v8.4s, v12.4s, %17.s[3]         \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #4   \\n\" // r11\n\n                    \"fmla   v9.4s, v13.4s, %17.s[3]         \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #4   \\n\" // r15\n\n                    \"fmla   v10.4s, v13.4s, %18.s[3]         \\n\"\n\n                    \"ext    v21.16b, v12.16b, v13.16b, #8   \\n\" // r12\n\n                    \"fmla   v11.4s, v14.4s, %18.s[3]         \\n\"\n\n                    \"ext    v22.16b, v13.16b, v14.16b, #8   \\n\" // r16\n\n                    \"fmla   v8.4s, v19.4s, %18.s[0]         \\n\"\n\n                    \"ext    v19.16b, v12.16b, v13.16b, #12  \\n\" // r13\n\n                    \"fmla   v9.4s, v20.4s, %18.s[0]         \\n\"\n\n                    \"ext    v20.16b, v13.16b, v14.16b, #12  \\n\" // r17\n\n                    \"fmla   v10.4s, v21.4s, %18.s[1]         \\n\"\n                    \"fmla   v11.4s, v22.4s, %18.s[1]         \\n\"\n\n                    // r4\n                    \"prfm   pldl1keep, [%6, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%6]  \\n\" // v16 v17 v18 = r40 r44 r48\n\n                    \"fmla   v8.4s, v19.4s, %18.s[2]         \\n\"\n                    \"fmla   v9.4s, v20.4s, %18.s[2]         \\n\"\n\n                    \"fmla   v10.4s, v16.4s, %19.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #4   \\n\" // r41\n\n                    \"fmla   v11.4s, v17.4s, %19.s[0]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #4   \\n\" // r45\n\n                    \"fmla   v8.4s, v17.4s, %20.s[0]         \\n\"\n\n                    \"ext    v21.16b, v16.16b, v17.16b, #8   \\n\" // r42\n\n                    \"fmla   v9.4s, v18.4s, %20.s[0]         \\n\"\n\n                    \"ext    v22.16b, v17.16b, v18.16b, #8   \\n\" // r46\n\n                    \"fmla   v10.4s, v19.4s, %19.s[1]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #12  \\n\" // r43\n\n                    \"fmla   v11.4s, v20.4s, %19.s[1]         \\n\"\n\n                    \"ext    v20.16b, v17.16b, v18.16b, #12  \\n\" // r47\n\n                    \"fmla   v8.4s, v21.4s, %19.s[2]         \\n\"\n\n                    \"add    %2, %2, #32                     \\n\"\n\n                    \"fmla   v9.4s, v22.4s, %19.s[2]         \\n\"\n\n                    \"add    %3, %3, #32                     \\n\"\n\n                    \"fmla   v10.4s, v19.4s, %19.s[3]         \\n\"\n\n                    \"add    %4, %4, #32                     \\n\"\n\n                    \"fmla   v11.4s, v20.4s, %19.s[3]         \\n\"\n\n                    // r0\n                    \"prfm   pldl1keep, [%2, #384]           \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s}, [%2]  \\n\" // v16 v17 v18 = r00 r04 r08\n\n                    \"add    %5, %5, #32                     \\n\"\n\n                    \"fadd   v10.4s, v8.4s, v10.4s           \\n\"\n\n                    \"add    %6, %6, #32                     \\n\"\n\n                    \"fadd   v11.4s, v9.4s, v11.4s           \\n\"\n\n                    \"mov    v8.16b, %21.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %21.16b                 \\n\" // v9 = _bias0\n\n                    \"subs   %w0, %w0, #1                    \\n\"\n\n                    \"st1    {v10.4s, v11.4s}, [%1], #32     \\n\"\n\n                    \"bne    0b                              \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3),     // %5\n                    \"=r\"(r4)      // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"w\"(_k0123),     // %14\n                    \"w\"(_k4567),     // %15\n                    \"w\"(_k891011),   // %16\n                    \"w\"(_k12131415), // %17\n                    \"w\"(_k16171819), // %18\n                    \"w\"(_k20212223), // %19\n                    \"w\"(_k24242424), // %20\n                    \"w\"(_bias0)      // %21\n                    : \"cc\", \"memory\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\");\n            }\n\n            if (remain >= 4)\n            {\n                remain -= 4;\n                asm volatile(\n                    // r0\n                    \"prfm   pldl1keep, [%1, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%1]          \\n\" // v16 v17 = r00 r04\n\n                    \"mov    v8.16b, %19.16b                 \\n\" // v8 = _bias0\n\n                    \"add    %1, %1, #16                     \\n\"\n\n                    \"fmul   v9.4s, v16.4s, %12.s[0]         \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r01\n\n                    \"fmla   v8.4s, v17.4s, %13.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r02\n\n                    \"fmla   v9.4s, v18.4s, %12.s[1]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r03\n\n                    \"fmla   v8.4s, v19.4s, %12.s[2]         \\n\"\n\n                    // r1\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%2]          \\n\" // v10 v11 = r10 r14\n\n                    \"fmla   v9.4s, v20.4s, %12.s[3]         \\n\"\n\n                    \"add    %2, %2, #16                     \\n\"\n\n                    \"fmla   v8.4s, v10.4s, %13.s[1]         \\n\"\n\n                    \"ext    v12.16b, v10.16b, v11.16b, #4   \\n\" // r11\n\n                    \"fmla   v9.4s, v11.4s, %14.s[1]         \\n\"\n\n                    \"ext    v13.16b, v10.16b, v11.16b, #8   \\n\" // r12\n\n                    \"fmla   v8.4s, v12.4s, %13.s[2]         \\n\"\n\n                    \"ext    v14.16b, v10.16b, v11.16b, #12  \\n\" // r13\n\n                    \"fmla   v9.4s, v13.4s, %13.s[3]         \\n\"\n\n                    // r2\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%3]          \\n\" // v16 v17 = r20 r24\n\n                    \"fmla   v8.4s, v14.4s, %14.s[0]         \\n\"\n\n                    \"add    %3, %3, #16                     \\n\"\n\n                    \"fmla   v9.4s, v16.4s, %14.s[2]         \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r21\n\n                    \"fmla   v8.4s, v17.4s, %15.s[2]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r22\n\n                    \"fmla   v9.4s, v18.4s, %14.s[3]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r23\n\n                    \"fmla   v8.4s, v19.4s, %15.s[0]         \\n\"\n\n                    // r3\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%4]          \\n\" // v10 v11 = r30 r34\n\n                    \"fmla   v9.4s, v20.4s, %15.s[1]         \\n\"\n\n                    \"add    %4, %4, #16                     \\n\"\n\n                    \"fmla   v8.4s, v10.4s, %15.s[3]         \\n\"\n\n                    \"ext    v12.16b, v10.16b, v11.16b, #4   \\n\" // r31\n\n                    \"fmla   v9.4s, v11.4s, %16.s[3]         \\n\"\n\n                    \"ext    v13.16b, v10.16b, v11.16b, #8   \\n\" // r32\n\n                    \"fmla   v8.4s, v12.4s, %16.s[0]         \\n\"\n\n                    \"ext    v14.16b, v10.16b, v11.16b, #12  \\n\" // r33\n\n                    \"fmla   v9.4s, v13.4s, %16.s[1]         \\n\"\n\n                    // r4\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld1    {v16.4s, v17.4s}, [%5]          \\n\" // v16 v17 = r40 r44\n\n                    \"fmla   v8.4s, v14.4s, %16.s[2]         \\n\"\n\n                    \"add    %5, %5, #16                     \\n\"\n\n                    \"fmla   v9.4s, v16.4s, %17.s[0]         \\n\"\n\n                    \"ext    v18.16b, v16.16b, v17.16b, #4   \\n\" // r41\n\n                    \"fmla   v8.4s, v17.4s, %18.s[0]         \\n\"\n\n                    \"ext    v19.16b, v16.16b, v17.16b, #8   \\n\" // r42\n\n                    \"fmla   v9.4s, v18.4s, %17.s[1]         \\n\"\n\n                    \"ext    v20.16b, v16.16b, v17.16b, #12  \\n\" // r43\n\n                    \"fmla   v8.4s, v19.4s, %17.s[2]         \\n\"\n\n                    \"fmla   v9.4s, v20.4s, %17.s[3]         \\n\"\n\n                    \"fadd   v8.4s, v8.4s, v9.4s             \\n\"\n\n                    \"st1    {v8.4s}, [%0], #16              \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2),     // %3\n                    \"=r\"(r3),     // %4\n                    \"=r\"(r4)      // %5\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k0123),     // %12\n                    \"w\"(_k4567),     // %13\n                    \"w\"(_k891011),   // %14\n                    \"w\"(_k12131415), // %15\n                    \"w\"(_k16171819), // %16\n                    \"w\"(_k20212223), // %17\n                    \"w\"(_k24242424), // %18\n                    \"w\"(_bias0)      // %19\n                    : \"cc\", \"memory\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    // r0\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%2]     \\n\" // q10 q11 = r00 r04\n\n                    \"vmov       q8, %q21            \\n\" // q8 = _bias0\n\n                    \"0:                             \\n\"\n\n                    \"vmul.f32   q9, q10, %e14[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r01\n\n                    \"vmla.f32   q8, q11, %e15[0]    \\n\"\n\n                    \"vext.32    q13, q10, q11, #2   \\n\" // r02\n\n                    \"vmla.f32   q9, q12, %e14[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r03\n\n                    \"vmla.f32   q8, q13, %f14[0]    \\n\"\n\n                    // r1\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%3]     \\n\" // q14 q15 = r10 r14\n\n                    \"vmla.f32   q9, q12, %f14[1]    \\n\"\n\n                    \"add        %3, #16             \\n\"\n\n                    \"vmla.f32   q8, q14, %e15[1]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #1   \\n\" // r11\n\n                    \"vmla.f32   q9, q15, %e16[1]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #2   \\n\" // r12\n\n                    \"vmla.f32   q8, q12, %f15[0]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #3   \\n\" // r13\n\n                    \"vmla.f32   q9, q13, %f15[1]    \\n\"\n\n                    // r2\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%4]     \\n\" // q10 q11 = r20 r24\n\n                    \"vmla.f32   q8, q12, %e16[0]    \\n\"\n\n                    \"add        %4, #16             \\n\"\n\n                    \"vmla.f32   q9, q10, %f16[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r21\n\n                    \"vmla.f32   q8, q11, %f17[0]    \\n\"\n\n                    \"vext.32    q13, q10, q11, #2   \\n\" // r22\n\n                    \"vmla.f32   q9, q12, %f16[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r23\n\n                    \"vmla.f32   q8, q13, %e17[0]    \\n\"\n\n                    // r3\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.f32   {d28-d31}, [%5]     \\n\" // q14 q15 = r30 r34\n\n                    \"vmla.f32   q9, q12, %e17[1]    \\n\"\n\n                    \"add        %5, #16             \\n\"\n\n                    \"vmla.f32   q8, q14, %f17[1]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #1   \\n\" // r31\n\n                    \"vmla.f32   q9, q15, %f18[1]    \\n\"\n\n                    \"vext.32    q13, q14, q15, #2   \\n\" // r32\n\n                    \"vmla.f32   q8, q12, %e18[0]    \\n\"\n\n                    \"vext.32    q12, q14, q15, #3   \\n\" // r33\n\n                    \"vmla.f32   q9, q13, %e18[1]    \\n\"\n\n                    // r4\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%6]     \\n\" // q10 q11 = r40 r44\n\n                    \"vmla.f32   q8, q12, %f18[0]    \\n\"\n\n                    \"add        %6, #16             \\n\"\n\n                    \"vmla.f32   q9, q10, %e19[0]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #1   \\n\" // r41\n\n                    \"vmla.f32   q8, q11, %e20[0]    \\n\"\n\n                    \"vext.32    q13, q10, q11, #2   \\n\" // r42\n\n                    \"vmla.f32   q9, q12, %e19[1]    \\n\"\n\n                    \"vext.32    q12, q10, q11, #3   \\n\" // r43\n\n                    \"vmla.f32   q8, q13, %f19[0]    \\n\"\n\n                    \"add        %2, #16             \\n\"\n\n                    \"vmla.f32   q9, q12, %f19[1]    \\n\"\n\n                    // r0\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d20-d23}, [%2]     \\n\" // q10 q11 = r00 r04\n\n                    \"vadd.f32   q9, q9, q8          \\n\"\n\n                    \"vmov       q8, %q21            \\n\" // q8 = _bias0\n\n                    \"subs       %0, #1              \\n\"\n\n                    \"vst1.f32   {d18-d19}, [%1]!    \\n\"\n\n                    \"bne        0b                  \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3),     // %5\n                    \"=r\"(r4)      // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"w\"(_k0123),     // %14\n                    \"w\"(_k4567),     // %15\n                    \"w\"(_k891011),   // %16\n                    \"w\"(_k12131415), // %17\n                    \"w\"(_k16171819), // %18\n                    \"w\"(_k20212223), // %19\n                    \"w\"(_k24242424), // %20\n                    \"w\"(_bias0)      // %21\n                    : \"cc\", \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n#if __ARM_NEON\n#if __aarch64__\n                // TODO neon assembly optimize\n                float sum = bias0;\n\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _sum = vmulq_f32(_r0, _k0123);\n\n                float32x4_t _r1 = vld1q_f32(r1);\n                _sum = vmlaq_f32(_sum, _r1, vld1q_f32(k1));\n\n                float32x4_t _r2 = vld1q_f32(r2);\n                _sum = vmlaq_f32(_sum, _r2, vld1q_f32(k2));\n\n                float32x4_t _r3 = vld1q_f32(r3);\n                _sum = vmlaq_f32(_sum, _r3, vld1q_f32(k3));\n\n                float32x4_t _r4 = vld1q_f32(r4);\n                _sum = vmlaq_f32(_sum, _r4, _k20212223);\n\n                float32x4_t _k_t4 = {};\n\n                _k_t4 = vsetq_lane_f32(k0[4], _k_t4, 0);\n                _k_t4 = vsetq_lane_f32(k1[4], _k_t4, 1);\n                _k_t4 = vsetq_lane_f32(k2[4], _k_t4, 2);\n                _k_t4 = vsetq_lane_f32(k3[4], _k_t4, 3);\n\n                float32x4_t _r_t4 = {};\n\n                _r_t4 = vsetq_lane_f32(r0[4], _r_t4, 0);\n                _r_t4 = vsetq_lane_f32(r1[4], _r_t4, 1);\n                _r_t4 = vsetq_lane_f32(r2[4], _r_t4, 2);\n                _r_t4 = vsetq_lane_f32(r3[4], _r_t4, 3);\n                _sum = vmlaq_f32(_sum, _r_t4, _k_t4);\n\n                sum += r4[4] * k4[4];\n\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n\n                sum += vget_lane_f32(_ss, 0);\n\n                *outptr = sum;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                r4++;\n                outptr++;\n#else\n                // TODO neon assembly optimize\n                asm volatile(\n                    \"veor       q14, q14            \\n\"\n                    \"vext.32    q14, %q19, q14, #3  \\n\" // q14 = bias0 0 0 0\n\n                    \"vld1.f32   {d16-d17}, [%1]     \\n\" // q8 = r00 r01 r02 r03\n\n                    \"vld1.f32   {d18-d19}, [%2]     \\n\" // q9 = r10 r11 r12 r13(X)\n                    \"add        r4, %1, #16         \\n\"\n                    \"vld1.f32   {d19[1]}, [r4]      \\n\"\n                    \"vext.32    q9, q9, q9, #3      \\n\" // q9 = r04 r10 r11 r12\n\n                    \"vmla.f32   q14, q8, %q12       \\n\"\n\n                    \"add        r4, %2, #12         \\n\"\n                    \"vld1.f32   {d20}, [r4]         \\n\" // d20 = r13 r14\n                    \"vld1.f32   {d21}, [%3]         \\n\" // d21 = r20 r21\n\n                    \"vmla.f32   q14, q9, %q13       \\n\"\n\n                    \"add        r4, %3, #8          \\n\"\n                    \"vld1.f32   {d22-d23}, [r4]     \\n\" // q11 = r22 r23 r24 X\n                    \"vld1.f32   {d23[1]}, [%4]      \\n\" // q11 = r22 r23 r24 r30\n\n                    \"vmla.f32   q14, q10, %q14      \\n\"\n\n                    \"add        r4, %4, #4          \\n\"\n                    \"vld1.f32   {d24-d25}, [r4]     \\n\" // q12 = r31 r32 r33 r34\n\n                    \"vmla.f32   q14, q11, %q15      \\n\"\n\n                    \"vld1.f32   {d26-d27}, [%5]     \\n\" // q13 = r40 r41 r42 r43\n\n                    \"vmla.f32   q14, q12, %q16      \\n\"\n\n                    \"veor       d30, d30            \\n\"\n                    \"add        r4, %5, #16         \\n\"\n                    \"vld1.f32   {d30[0]}, [r4]      \\n\" // d30 = r44 0\n\n                    \"vmla.f32   q14, q13, %q17      \\n\"\n\n                    \"vmla.f32   d28, d30, %e18      \\n\"\n\n                    \"add        %1, #4              \\n\"\n\n                    // h-sum\n                    \"vadd.f32   d28, d28, d29       \\n\"\n\n                    \"add        %2, #4              \\n\"\n                    \"add        %3, #4              \\n\"\n\n                    \"vpadd.f32  d28, d28, d28       \\n\"\n\n                    \"add        %4, #4              \\n\"\n                    \"add        %5, #4              \\n\"\n\n                    \"vst1.f32   {d28[0]}, [%0]!     \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2),     // %3\n                    \"=r\"(r3),     // %4\n                    \"=r\"(r4)      // %5\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k0123),     // %12\n                    \"w\"(_k4567),     // %13\n                    \"w\"(_k891011),   // %14\n                    \"w\"(_k12131415), // %15\n                    \"w\"(_k16171819), // %16\n                    \"w\"(_k20212223), // %17\n                    \"w\"(_k24242424), // %18\n                    \"w\"(_bias0)      // %19\n                    : \"cc\", \"memory\", \"r4\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else\n                float sum = bias0;\n\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r0[3] * k0[3];\n                sum += r0[4] * k0[4];\n\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r1[3] * k1[3];\n                sum += r1[4] * k1[4];\n\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n                sum += r2[3] * k2[3];\n                sum += r2[4] * k2[4];\n\n                sum += r3[0] * k3[0];\n                sum += r3[1] * k3[1];\n                sum += r3[2] * k3[2];\n                sum += r3[3] * k3[3];\n                sum += r3[4] * k3[4];\n\n                sum += r4[0] * k4[0];\n                sum += r4[1] * k4[1];\n                sum += r4[2] * k4[2];\n                sum += r4[3] * k4[3];\n                sum += r4[4] * k4[4];\n\n                *outptr = sum;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                r4++;\n                outptr++;\n#endif\n            }\n\n            r0 += 4;\n            r1 += 4;\n            r2 += 4;\n            r3 += 4;\n            r4 += 4;\n        }\n    }\n}\n\nstatic void convdw5x5s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    //int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    //int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const int group = bottom_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 25;\n\n        float* outptr = out;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n        const float* r3 = img0 + w * 3;\n        const float* r4 = img0 + w * 4;\n\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 5;\n        const float* k2 = kernel0 + 10;\n        const float* k3 = kernel0 + 15;\n        const float* k4 = kernel0 + 20;\n\n#if __ARM_NEON\n        float32x4_t _k0123 = vld1q_f32(kernel0);\n        float32x4_t _k4567 = vld1q_f32(kernel0 + 4);\n        float32x4_t _k891011 = vld1q_f32(kernel0 + 8);\n        float32x4_t _k12131415 = vld1q_f32(kernel0 + 12);\n        float32x4_t _k16171819 = vld1q_f32(kernel0 + 16);\n        float32x4_t _k20212223 = vld1q_f32(kernel0 + 20);\n        float32x4_t _k24242424 = vdupq_n_f32(kernel0[24]);\n\n        float32x4_t _bias0 = vdupq_n_f32(bias0);\n#endif // __ARM_NEON\n\n        int i = 0;\n\n        // NOTE unroll outh 2 results somewhat speed drop :| (about -4%)\n        // so we do not implement it here\n\n        for (; i < outh; i++)\n        {\n#if __ARM_NEON\n#if __aarch64__\n            int nn = outw >> 3;\n            int remain = outw & 7;\n#else\n            int nn = outw >> 2;\n            int remain = outw & 3;\n#endif // __aarch64__\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    // r0\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld2    {v16.4s, v17.4s}, [%2], #32     \\n\" // v16 v17 = r00 r01\n\n                    \"mov    v8.16b, %21.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %21.16b                 \\n\" // v9 = _bias0\n\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld2    {v18.4s, v19.4s}, [%2], #32     \\n\" // v18 v19 = r08 r09\n\n                    \"0:                                     \\n\"\n\n                    \"fmul   v10.4s, v16.4s, %14.s[0]        \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld2    {v20.4s, v21.4s}, [%2]          \\n\" // v20 v21 = r016 r017\n\n                    \"fmul   v11.4s, v18.4s, %14.s[0]        \\n\"\n\n                    \"ext    v22.16b, v16.16b, v18.16b, #4   \\n\" // v22 = r02\n\n                    \"fmla   v8.4s, v17.4s, %14.s[1]         \\n\"\n\n                    \"ext    v25.16b, v18.16b, v20.16b, #4   \\n\" // v25 = r010\n\n                    \"fmla   v9.4s, v19.4s, %14.s[1]         \\n\"\n\n                    \"ext    v23.16b, v17.16b, v19.16b, #4   \\n\" // v23 = r03\n\n                    \"fmla   v10.4s, v22.4s, %14.s[2]        \\n\"\n\n                    \"ext    v26.16b, v19.16b, v21.16b, #4   \\n\" // v26 = r011\n\n                    \"fmla   v11.4s, v25.4s, %14.s[2]        \\n\"\n\n                    \"ext    v24.16b, v16.16b, v18.16b, #8   \\n\" // v24 = r04\n\n                    \"fmla   v8.4s, v23.4s, %14.s[3]         \\n\"\n\n                    \"ext    v27.16b, v18.16b, v20.16b, #8   \\n\" // v27 = r012\n\n                    \"fmla   v9.4s, v26.4s, %14.s[3]         \\n\"\n\n                    // r1\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld2    {v12.4s, v13.4s}, [%3], #32     \\n\" // v12 v13 = r10 r11\n\n                    \"fmla   v10.4s, v24.4s, %15.s[0]        \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld2    {v14.4s, v15.4s}, [%3], #32     \\n\" // v14 v15 = r18 r19\n\n                    \"fmla   v11.4s, v27.4s, %15.s[0]        \\n\"\n\n                    \"fmla   v8.4s, v12.4s, %15.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]           \\n\"\n                    \"ld2    {v20.4s, v21.4s}, [%3]          \\n\" // v20 v21 = r116 r117\n\n                    \"fmla   v9.4s, v14.4s, %15.s[1]         \\n\"\n\n                    \"ext    v22.16b, v12.16b, v14.16b, #4   \\n\" // v22 = r12\n\n                    \"fmla   v10.4s, v13.4s, %15.s[2]        \\n\"\n\n                    \"ext    v25.16b, v14.16b, v20.16b, #4   \\n\" // v25 = r110\n\n                    \"fmla   v11.4s, v15.4s, %15.s[2]        \\n\"\n\n                    \"ext    v23.16b, v13.16b, v15.16b, #4   \\n\" // v23 = r13\n\n                    \"fmla   v8.4s, v22.4s, %15.s[3]         \\n\"\n\n                    \"ext    v26.16b, v15.16b, v21.16b, #4   \\n\" // v26 = r111\n\n                    \"fmla   v9.4s, v25.4s, %15.s[3]         \\n\"\n\n                    \"ext    v24.16b, v12.16b, v14.16b, #8   \\n\" // v24 = r14\n\n                    \"fmla   v10.4s, v23.4s, %16.s[0]        \\n\"\n\n                    \"ext    v27.16b, v14.16b, v20.16b, #8   \\n\" // v27 = r112\n\n                    \"fmla   v11.4s, v26.4s, %16.s[0]        \\n\"\n\n                    // r2\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld2    {v16.4s, v17.4s}, [%4], #32     \\n\" // v16 v17 = r20 r21\n\n                    \"fmla   v8.4s, v24.4s, %16.s[1]         \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld2    {v18.4s, v19.4s}, [%4], #32     \\n\" // v18 v19 = r28 r29\n\n                    \"fmla   v9.4s, v27.4s, %16.s[1]         \\n\"\n\n                    \"fmla   v10.4s, v16.4s, %16.s[2]        \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]           \\n\"\n                    \"ld2    {v20.4s, v21.4s}, [%4]          \\n\" // v20 v21 = r216 r217\n\n                    \"fmla   v11.4s, v18.4s, %16.s[2]        \\n\"\n\n                    \"ext    v22.16b, v16.16b, v18.16b, #4   \\n\" // v22 = r22\n\n                    \"fmla   v8.4s, v17.4s, %16.s[3]         \\n\"\n\n                    \"ext    v25.16b, v18.16b, v20.16b, #4   \\n\" // v25 = r210\n\n                    \"fmla   v9.4s, v19.4s, %16.s[3]         \\n\"\n\n                    \"ext    v23.16b, v17.16b, v19.16b, #4   \\n\" // v23 = r23\n\n                    \"fmla   v10.4s, v22.4s, %17.s[0]        \\n\"\n\n                    \"ext    v26.16b, v19.16b, v21.16b, #4   \\n\" // v26 = r211\n\n                    \"fmla   v11.4s, v25.4s, %17.s[0]        \\n\"\n\n                    \"ext    v24.16b, v16.16b, v18.16b, #8   \\n\" // v24 = r24\n\n                    \"fmla   v8.4s, v23.4s, %17.s[1]         \\n\"\n\n                    \"ext    v27.16b, v18.16b, v20.16b, #8   \\n\" // v27 = r212\n\n                    \"fmla   v9.4s, v26.4s, %17.s[1]         \\n\"\n\n                    // r3\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld2    {v12.4s, v13.4s}, [%5], #32     \\n\" // v12 v13 = r30 r31\n\n                    \"fmla   v10.4s, v24.4s, %17.s[2]        \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld2    {v14.4s, v15.4s}, [%5], #32     \\n\" // v14 v15 = r38 r39\n\n                    \"fmla   v11.4s, v27.4s, %17.s[2]        \\n\"\n\n                    \"fmla   v8.4s, v12.4s, %17.s[3]         \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]           \\n\"\n                    \"ld2    {v20.4s, v21.4s}, [%5]          \\n\" // v20 v21 = r316 r317\n\n                    \"fmla   v9.4s, v14.4s, %17.s[3]         \\n\"\n\n                    \"ext    v22.16b, v12.16b, v14.16b, #4   \\n\" // v22 = r32\n\n                    \"fmla   v10.4s, v13.4s, %18.s[0]        \\n\"\n\n                    \"ext    v25.16b, v14.16b, v20.16b, #4   \\n\" // v25 = r310\n\n                    \"fmla   v11.4s, v15.4s, %18.s[0]        \\n\"\n\n                    \"ext    v23.16b, v13.16b, v15.16b, #4   \\n\" // v23 = r33\n\n                    \"fmla   v8.4s, v22.4s, %18.s[1]         \\n\"\n\n                    \"ext    v26.16b, v15.16b, v21.16b, #4   \\n\" // v26 = r311\n\n                    \"fmla   v9.4s, v25.4s, %18.s[1]         \\n\"\n\n                    \"ext    v24.16b, v12.16b, v14.16b, #8   \\n\" // v24 = r34\n\n                    \"fmla   v10.4s, v23.4s, %18.s[2]        \\n\"\n\n                    \"ext    v27.16b, v14.16b, v20.16b, #8   \\n\" // v27 = r312\n\n                    \"fmla   v11.4s, v26.4s, %18.s[2]        \\n\"\n\n                    // r4\n                    \"prfm   pldl1keep, [%6, #256]           \\n\"\n                    \"ld2    {v16.4s, v17.4s}, [%6], #32     \\n\" // v16 v17 = r40 r41\n\n                    \"fmla   v8.4s, v24.4s, %18.s[3]         \\n\"\n\n                    \"prfm   pldl1keep, [%6, #256]           \\n\"\n                    \"ld2    {v18.4s, v19.4s}, [%6], #32     \\n\" // v18 v19 = r48 r49\n\n                    \"fmla   v9.4s, v27.4s, %18.s[3]         \\n\"\n\n                    \"fmla   v10.4s, v16.4s, %19.s[0]        \\n\"\n\n                    \"prfm   pldl1keep, [%6, #256]           \\n\"\n                    \"ld2    {v20.4s, v21.4s}, [%6]          \\n\" // v20 v21 = r416 r417\n\n                    \"fmla   v11.4s, v18.4s, %19.s[0]        \\n\"\n\n                    \"ext    v22.16b, v16.16b, v18.16b, #4   \\n\" // v22 = r42\n\n                    \"fmla   v8.4s, v17.4s, %19.s[1]         \\n\"\n\n                    \"ext    v25.16b, v18.16b, v20.16b, #4   \\n\" // v25 = r410\n\n                    \"fmla   v9.4s, v19.4s, %19.s[1]         \\n\"\n\n                    \"ext    v23.16b, v17.16b, v19.16b, #4   \\n\" // v23 = r43\n\n                    \"fmla   v10.4s, v22.4s, %19.s[2]        \\n\"\n\n                    \"ext    v26.16b, v19.16b, v21.16b, #4   \\n\" // v26 = r411\n\n                    \"fmla   v11.4s, v25.4s, %19.s[2]        \\n\"\n\n                    \"ext    v24.16b, v16.16b, v18.16b, #8   \\n\" // v24 = r44\n\n                    \"fmla   v8.4s, v23.4s, %19.s[3]         \\n\"\n\n                    \"ext    v27.16b, v18.16b, v20.16b, #8   \\n\" // v27 = r412\n\n                    \"fmla   v9.4s, v26.4s, %19.s[3]         \\n\"\n                    \"fmla   v10.4s, v24.4s, %20.s[0]        \\n\"\n\n                    // r0\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld2    {v16.4s, v17.4s}, [%2], #32     \\n\" // v16 v17 = r00 r01\n\n                    \"fmla   v11.4s, v27.4s, %20.s[0]        \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]           \\n\"\n                    \"ld2    {v18.4s, v19.4s}, [%2], #32     \\n\" // v18 v19 = r08 r09\n\n                    \"fadd   v10.4s, v8.4s, v10.4s           \\n\"\n                    \"fadd   v11.4s, v9.4s, v11.4s           \\n\"\n\n                    \"subs   %w0, %w0, #1                    \\n\"\n\n                    \"mov    v8.16b, %21.16b                 \\n\" // v8 = _bias0\n                    \"mov    v9.16b, %21.16b                 \\n\" // v9 = _bias0\n\n                    \"st1    {v10.4s, v11.4s}, [%1], #32     \\n\"\n\n                    \"bne    0b                              \\n\"\n                    \"sub    %2, %2, #64                     \\n\"\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3),     // %5\n                    \"=r\"(r4)      // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"w\"(_k0123),     // %14\n                    \"w\"(_k4567),     // %15\n                    \"w\"(_k891011),   // %16\n                    \"w\"(_k12131415), // %17\n                    \"w\"(_k16171819), // %18\n                    \"w\"(_k20212223), // %19\n                    \"w\"(_k24242424), // %20\n                    \"w\"(_bias0)      // %21\n                    : \"cc\", \"memory\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    // r0\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%2]!    \\n\" // q10 q11 = r00 r01\n\n                    \"vmov       q8, %q21            \\n\"\n\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%2]     \\n\" // q12 = r08 x x\n\n                    \"0:                             \\n\"\n\n                    \"vmul.f32   q9, q10, %e14[0]    \\n\"\n\n                    \"vmov       d26, d25            \\n\" // q13 = r09 x x\n\n                    \"vext.32    q14, q10, q12, #1   \\n\" // q14 = r02\n\n                    \"vmla.f32   q8, q11, %e14[1]    \\n\"\n\n                    \"vext.32    q15, q11, q13, #1   \\n\" // q15 = r03\n\n                    \"vmla.f32   q9, q14, %f14[0]    \\n\"\n\n                    \"vext.32    q14, q10, q12, #2   \\n\" // q14 = r04\n\n                    \"vmla.f32   q8, q15, %f14[1]    \\n\"\n\n                    // r1\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%3]!    \\n\" // q10 q11 = r10 r11\n\n                    \"vmla.f32   q9, q14, %e15[0]    \\n\"\n\n                    \"pld        [%3, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%3]     \\n\" // q12 = r18 x x\n\n                    \"vmla.f32   q8, q10, %e15[1]    \\n\"\n\n                    \"vmov       d26, d25            \\n\" // q13 = r19 x x\n\n                    \"vext.32    q14, q10, q12, #1   \\n\" // q14 = r12\n\n                    \"vmla.f32   q9, q11, %f15[0]    \\n\"\n\n                    \"vext.32    q15, q11, q13, #1   \\n\" // q15 = r13\n\n                    \"vmla.f32   q8, q14, %f15[1]    \\n\"\n\n                    \"vext.32    q14, q10, q12, #2   \\n\" // q14 = r14\n\n                    \"vmla.f32   q9, q15, %e16[0]    \\n\"\n\n                    // r2\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%4]!    \\n\" // q10 q11 = r20 r21\n\n                    \"vmla.f32   q8, q14, %e16[1]    \\n\"\n\n                    \"pld        [%4, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%4]     \\n\" // q12 = r28 x x\n\n                    \"vmla.f32   q9, q10, %f16[0]    \\n\"\n\n                    \"vmov       d26, d25            \\n\" // q13 = r29 x x\n\n                    \"vext.32    q14, q10, q12, #1   \\n\" // q14 = r22\n\n                    \"vmla.f32   q8, q11, %f16[1]    \\n\"\n\n                    \"vext.32    q15, q11, q13, #1   \\n\" // q15 = r23\n\n                    \"vmla.f32   q9, q14, %e17[0]    \\n\"\n\n                    \"vext.32    q14, q10, q12, #2   \\n\" // q14 = r24\n\n                    \"vmla.f32   q8, q15, %e17[1]    \\n\"\n\n                    // r3\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%5]!    \\n\" // q10 q11 = r30 r31\n\n                    \"vmla.f32   q9, q14, %f17[0]    \\n\"\n\n                    \"pld        [%5, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%5]     \\n\" // q12 = r38 x x\n\n                    \"vmla.f32   q8, q10, %f17[1]    \\n\"\n\n                    \"vmov       d26, d25            \\n\" // q13 = r39 x x\n\n                    \"vext.32    q14, q10, q12, #1   \\n\" // q14 = r32\n\n                    \"vmla.f32   q9, q11, %e18[0]    \\n\"\n\n                    \"vext.32    q15, q11, q13, #1   \\n\" // q15 = r33\n\n                    \"vmla.f32   q8, q14, %e18[1]    \\n\"\n\n                    \"vext.32    q14, q10, q12, #2   \\n\" // q14 = r34\n\n                    \"vmla.f32   q9, q15, %f18[0]    \\n\"\n\n                    // r4\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%6]!    \\n\" // q10 q11 = r40 r41\n\n                    \"vmla.f32   q8, q14, %f18[1]    \\n\"\n\n                    \"pld        [%6, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%6]     \\n\" // q12 = r48 x x\n\n                    \"vmla.f32   q9, q10, %e19[0]    \\n\"\n\n                    \"vmov       d26, d25            \\n\" // q13 = r49 x x\n\n                    \"vext.32    q14, q10, q12, #1   \\n\" // q14 = r42\n\n                    \"vmla.f32   q8, q11, %e19[1]    \\n\"\n\n                    \"vext.32    q15, q11, q13, #1   \\n\" // q15 = r43\n\n                    \"vmla.f32   q9, q14, %f19[0]    \\n\"\n\n                    \"vext.32    q14, q10, q12, #2   \\n\" // q14 = r44\n\n                    \"vmla.f32   q8, q15, %f19[1]    \\n\"\n\n                    // r0\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%2]!    \\n\" // q10 q11 = r00 r01\n\n                    \"vmla.f32   q9, q14, %e20[0]    \\n\"\n\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld2.f32   {d24-d25}, [%2]     \\n\" // q12 = r08 x x\n\n                    \"vadd.f32   q9, q8, q9          \\n\"\n\n                    \"vmov       q8, %q21            \\n\"\n\n                    \"subs       %0, #1              \\n\"\n\n                    \"vst1.f32   {d18-d19}, [%1]!    \\n\"\n\n                    \"bne        0b                  \\n\"\n                    \"sub        %2, #32             \\n\"\n\n                    : \"=r\"(nn),     // %0\n                    \"=r\"(outptr), // %1\n                    \"=r\"(r0),     // %2\n                    \"=r\"(r1),     // %3\n                    \"=r\"(r2),     // %4\n                    \"=r\"(r3),     // %5\n                    \"=r\"(r4)      // %6\n                    : \"0\"(nn),\n                    \"1\"(outptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"w\"(_k0123),     // %14\n                    \"w\"(_k4567),     // %15\n                    \"w\"(_k891011),   // %16\n                    \"w\"(_k12131415), // %17\n                    \"w\"(_k16171819), // %18\n                    \"w\"(_k20212223), // %19\n                    \"w\"(_k24242424), // %20\n                    \"w\"(_bias0)      // %21\n                    : \"cc\", \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float sum = bias0;\n#if __ARM_NEON\n                // TODO neon assembly optimize\n                float32x4_t _r0 = vld1q_f32(r0);\n                float32x4_t _sum = vmulq_f32(_r0, _k0123);\n\n                float32x4_t _r1 = vld1q_f32(r1);\n                _sum = vmlaq_f32(_sum, _r1, vld1q_f32(k1));\n\n                float32x4_t _r2 = vld1q_f32(r2);\n                _sum = vmlaq_f32(_sum, _r2, vld1q_f32(k2));\n\n                float32x4_t _r3 = vld1q_f32(r3);\n                _sum = vmlaq_f32(_sum, _r3, vld1q_f32(k3));\n\n                float32x4_t _r4 = vld1q_f32(r4);\n                _sum = vmlaq_f32(_sum, _r4, _k20212223);\n\n                sum += r0[4] * k0[4];\n                sum += r1[4] * k1[4];\n                sum += r2[4] * k2[4];\n                sum += r3[4] * k3[4];\n                sum += r4[4] * k4[4];\n\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                _ss = vpadd_f32(_ss, _ss);\n\n                sum += vget_lane_f32(_ss, 0);\n#else\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r0[3] * k0[3];\n                sum += r0[4] * k0[4];\n\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r1[3] * k1[3];\n                sum += r1[4] * k1[4];\n\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n                sum += r2[3] * k2[3];\n                sum += r2[4] * k2[4];\n\n                sum += r3[0] * k3[0];\n                sum += r3[1] * k3[1];\n                sum += r3[2] * k3[2];\n                sum += r3[3] * k3[3];\n                sum += r3[4] * k3[4];\n\n                sum += r4[0] * k4[0];\n                sum += r4[1] * k4[1];\n                sum += r4[2] * k4[2];\n                sum += r4[3] * k4[3];\n                sum += r4[4] * k4[4];\n#endif\n                *outptr = sum;\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                r3 += 2;\n                r4 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n            r3 += tailstep;\n            r4 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_5x5_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw5x5s1_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n#if __aarch64__\n    const int w = bottom_blob.w;\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out.row(0);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n        const float* r3 = img0.row(3);\n        const float* r4 = img0.row(4);\n\n        int i = 0;\n\n#if __aarch64__\n        float* outptr1 = out.row(1);\n        const float* r5 = img0.row(5);\n\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                float32x4_t _sum00 = _bias0;\n                float32x4_t _sum01 = _bias0;\n                float32x4_t _sum02 = _bias0;\n                float32x4_t _sum03 = _bias0;\n                float32x4_t _sum10 = _bias0;\n                float32x4_t _sum11 = _bias0;\n                float32x4_t _sum12 = _bias0;\n                float32x4_t _sum13 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n                float32x4_t _r06 = vld1q_f32(r0 + 24);\n                float32x4_t _r07 = vld1q_f32(r0 + 28);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum00 = vmlaq_f32(_sum00, _k00, _r00);\n                _sum00 = vmlaq_f32(_sum00, _k01, _r01);\n                _sum00 = vmlaq_f32(_sum00, _k02, _r02);\n                _sum00 = vmlaq_f32(_sum00, _k03, _r03);\n                _sum00 = vmlaq_f32(_sum00, _k04, _r04);\n                _sum01 = vmlaq_f32(_sum01, _k00, _r01);\n                _sum01 = vmlaq_f32(_sum01, _k01, _r02);\n                _sum01 = vmlaq_f32(_sum01, _k02, _r03);\n                _sum01 = vmlaq_f32(_sum01, _k03, _r04);\n                _sum01 = vmlaq_f32(_sum01, _k04, _r05);\n                _sum02 = vmlaq_f32(_sum02, _k00, _r02);\n                _sum02 = vmlaq_f32(_sum02, _k01, _r03);\n                _sum02 = vmlaq_f32(_sum02, _k02, _r04);\n                _sum02 = vmlaq_f32(_sum02, _k03, _r05);\n                _sum02 = vmlaq_f32(_sum02, _k04, _r06);\n                _sum03 = vmlaq_f32(_sum03, _k00, _r03);\n                _sum03 = vmlaq_f32(_sum03, _k01, _r04);\n                _sum03 = vmlaq_f32(_sum03, _k02, _r05);\n                _sum03 = vmlaq_f32(_sum03, _k03, _r06);\n                _sum03 = vmlaq_f32(_sum03, _k04, _r07);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n                float32x4_t _r16 = vld1q_f32(r1 + 24);\n                float32x4_t _r17 = vld1q_f32(r1 + 28);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k00, _r10);\n                _sum10 = vmlaq_f32(_sum10, _k01, _r11);\n                _sum10 = vmlaq_f32(_sum10, _k02, _r12);\n                _sum10 = vmlaq_f32(_sum10, _k03, _r13);\n                _sum10 = vmlaq_f32(_sum10, _k04, _r14);\n                _sum11 = vmlaq_f32(_sum11, _k00, _r11);\n                _sum11 = vmlaq_f32(_sum11, _k01, _r12);\n                _sum11 = vmlaq_f32(_sum11, _k02, _r13);\n                _sum11 = vmlaq_f32(_sum11, _k03, _r14);\n                _sum11 = vmlaq_f32(_sum11, _k04, _r15);\n                _sum12 = vmlaq_f32(_sum12, _k00, _r12);\n                _sum12 = vmlaq_f32(_sum12, _k01, _r13);\n                _sum12 = vmlaq_f32(_sum12, _k02, _r14);\n                _sum12 = vmlaq_f32(_sum12, _k03, _r15);\n                _sum12 = vmlaq_f32(_sum12, _k04, _r16);\n                _sum13 = vmlaq_f32(_sum13, _k00, _r13);\n                _sum13 = vmlaq_f32(_sum13, _k01, _r14);\n                _sum13 = vmlaq_f32(_sum13, _k02, _r15);\n                _sum13 = vmlaq_f32(_sum13, _k03, _r16);\n                _sum13 = vmlaq_f32(_sum13, _k04, _r17);\n\n                _sum00 = vmlaq_f32(_sum00, _k10, _r10);\n                _sum00 = vmlaq_f32(_sum00, _k11, _r11);\n                _sum00 = vmlaq_f32(_sum00, _k12, _r12);\n                _sum00 = vmlaq_f32(_sum00, _k13, _r13);\n                _sum00 = vmlaq_f32(_sum00, _k14, _r14);\n                _sum01 = vmlaq_f32(_sum01, _k10, _r11);\n                _sum01 = vmlaq_f32(_sum01, _k11, _r12);\n                _sum01 = vmlaq_f32(_sum01, _k12, _r13);\n                _sum01 = vmlaq_f32(_sum01, _k13, _r14);\n                _sum01 = vmlaq_f32(_sum01, _k14, _r15);\n                _sum02 = vmlaq_f32(_sum02, _k10, _r12);\n                _sum02 = vmlaq_f32(_sum02, _k11, _r13);\n                _sum02 = vmlaq_f32(_sum02, _k12, _r14);\n                _sum02 = vmlaq_f32(_sum02, _k13, _r15);\n                _sum02 = vmlaq_f32(_sum02, _k14, _r16);\n                _sum03 = vmlaq_f32(_sum03, _k10, _r13);\n                _sum03 = vmlaq_f32(_sum03, _k11, _r14);\n                _sum03 = vmlaq_f32(_sum03, _k12, _r15);\n                _sum03 = vmlaq_f32(_sum03, _k13, _r16);\n                _sum03 = vmlaq_f32(_sum03, _k14, _r17);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n                float32x4_t _r26 = vld1q_f32(r2 + 24);\n                float32x4_t _r27 = vld1q_f32(r2 + 28);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k10, _r20);\n                _sum10 = vmlaq_f32(_sum10, _k11, _r21);\n                _sum10 = vmlaq_f32(_sum10, _k12, _r22);\n                _sum10 = vmlaq_f32(_sum10, _k13, _r23);\n                _sum10 = vmlaq_f32(_sum10, _k14, _r24);\n                _sum11 = vmlaq_f32(_sum11, _k10, _r21);\n                _sum11 = vmlaq_f32(_sum11, _k11, _r22);\n                _sum11 = vmlaq_f32(_sum11, _k12, _r23);\n                _sum11 = vmlaq_f32(_sum11, _k13, _r24);\n                _sum11 = vmlaq_f32(_sum11, _k14, _r25);\n                _sum12 = vmlaq_f32(_sum12, _k10, _r22);\n                _sum12 = vmlaq_f32(_sum12, _k11, _r23);\n                _sum12 = vmlaq_f32(_sum12, _k12, _r24);\n                _sum12 = vmlaq_f32(_sum12, _k13, _r25);\n                _sum12 = vmlaq_f32(_sum12, _k14, _r26);\n                _sum13 = vmlaq_f32(_sum13, _k10, _r23);\n                _sum13 = vmlaq_f32(_sum13, _k11, _r24);\n                _sum13 = vmlaq_f32(_sum13, _k12, _r25);\n                _sum13 = vmlaq_f32(_sum13, _k13, _r26);\n                _sum13 = vmlaq_f32(_sum13, _k14, _r27);\n\n                _sum00 = vmlaq_f32(_sum00, _k20, _r20);\n                _sum00 = vmlaq_f32(_sum00, _k21, _r21);\n                _sum00 = vmlaq_f32(_sum00, _k22, _r22);\n                _sum00 = vmlaq_f32(_sum00, _k23, _r23);\n                _sum00 = vmlaq_f32(_sum00, _k24, _r24);\n                _sum01 = vmlaq_f32(_sum01, _k20, _r21);\n                _sum01 = vmlaq_f32(_sum01, _k21, _r22);\n                _sum01 = vmlaq_f32(_sum01, _k22, _r23);\n                _sum01 = vmlaq_f32(_sum01, _k23, _r24);\n                _sum01 = vmlaq_f32(_sum01, _k24, _r25);\n                _sum02 = vmlaq_f32(_sum02, _k20, _r22);\n                _sum02 = vmlaq_f32(_sum02, _k21, _r23);\n                _sum02 = vmlaq_f32(_sum02, _k22, _r24);\n                _sum02 = vmlaq_f32(_sum02, _k23, _r25);\n                _sum02 = vmlaq_f32(_sum02, _k24, _r26);\n                _sum03 = vmlaq_f32(_sum03, _k20, _r23);\n                _sum03 = vmlaq_f32(_sum03, _k21, _r24);\n                _sum03 = vmlaq_f32(_sum03, _k22, _r25);\n                _sum03 = vmlaq_f32(_sum03, _k23, _r26);\n                _sum03 = vmlaq_f32(_sum03, _k24, _r27);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n                float32x4_t _r36 = vld1q_f32(r3 + 24);\n                float32x4_t _r37 = vld1q_f32(r3 + 28);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k20, _r30);\n                _sum10 = vmlaq_f32(_sum10, _k21, _r31);\n                _sum10 = vmlaq_f32(_sum10, _k22, _r32);\n                _sum10 = vmlaq_f32(_sum10, _k23, _r33);\n                _sum10 = vmlaq_f32(_sum10, _k24, _r34);\n                _sum11 = vmlaq_f32(_sum11, _k20, _r31);\n                _sum11 = vmlaq_f32(_sum11, _k21, _r32);\n                _sum11 = vmlaq_f32(_sum11, _k22, _r33);\n                _sum11 = vmlaq_f32(_sum11, _k23, _r34);\n                _sum11 = vmlaq_f32(_sum11, _k24, _r35);\n                _sum12 = vmlaq_f32(_sum12, _k20, _r32);\n                _sum12 = vmlaq_f32(_sum12, _k21, _r33);\n                _sum12 = vmlaq_f32(_sum12, _k22, _r34);\n                _sum12 = vmlaq_f32(_sum12, _k23, _r35);\n                _sum12 = vmlaq_f32(_sum12, _k24, _r36);\n                _sum13 = vmlaq_f32(_sum13, _k20, _r33);\n                _sum13 = vmlaq_f32(_sum13, _k21, _r34);\n                _sum13 = vmlaq_f32(_sum13, _k22, _r35);\n                _sum13 = vmlaq_f32(_sum13, _k23, _r36);\n                _sum13 = vmlaq_f32(_sum13, _k24, _r37);\n\n                _sum00 = vmlaq_f32(_sum00, _k30, _r30);\n                _sum00 = vmlaq_f32(_sum00, _k31, _r31);\n                _sum00 = vmlaq_f32(_sum00, _k32, _r32);\n                _sum00 = vmlaq_f32(_sum00, _k33, _r33);\n                _sum00 = vmlaq_f32(_sum00, _k34, _r34);\n                _sum01 = vmlaq_f32(_sum01, _k30, _r31);\n                _sum01 = vmlaq_f32(_sum01, _k31, _r32);\n                _sum01 = vmlaq_f32(_sum01, _k32, _r33);\n                _sum01 = vmlaq_f32(_sum01, _k33, _r34);\n                _sum01 = vmlaq_f32(_sum01, _k34, _r35);\n                _sum02 = vmlaq_f32(_sum02, _k30, _r32);\n                _sum02 = vmlaq_f32(_sum02, _k31, _r33);\n                _sum02 = vmlaq_f32(_sum02, _k32, _r34);\n                _sum02 = vmlaq_f32(_sum02, _k33, _r35);\n                _sum02 = vmlaq_f32(_sum02, _k34, _r36);\n                _sum03 = vmlaq_f32(_sum03, _k30, _r33);\n                _sum03 = vmlaq_f32(_sum03, _k31, _r34);\n                _sum03 = vmlaq_f32(_sum03, _k32, _r35);\n                _sum03 = vmlaq_f32(_sum03, _k33, _r36);\n                _sum03 = vmlaq_f32(_sum03, _k34, _r37);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n                float32x4_t _r46 = vld1q_f32(r4 + 24);\n                float32x4_t _r47 = vld1q_f32(r4 + 28);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum10 = vmlaq_f32(_sum10, _k30, _r40);\n                _sum10 = vmlaq_f32(_sum10, _k31, _r41);\n                _sum10 = vmlaq_f32(_sum10, _k32, _r42);\n                _sum10 = vmlaq_f32(_sum10, _k33, _r43);\n                _sum10 = vmlaq_f32(_sum10, _k34, _r44);\n                _sum11 = vmlaq_f32(_sum11, _k30, _r41);\n                _sum11 = vmlaq_f32(_sum11, _k31, _r42);\n                _sum11 = vmlaq_f32(_sum11, _k32, _r43);\n                _sum11 = vmlaq_f32(_sum11, _k33, _r44);\n                _sum11 = vmlaq_f32(_sum11, _k34, _r45);\n                _sum12 = vmlaq_f32(_sum12, _k30, _r42);\n                _sum12 = vmlaq_f32(_sum12, _k31, _r43);\n                _sum12 = vmlaq_f32(_sum12, _k32, _r44);\n                _sum12 = vmlaq_f32(_sum12, _k33, _r45);\n                _sum12 = vmlaq_f32(_sum12, _k34, _r46);\n                _sum13 = vmlaq_f32(_sum13, _k30, _r43);\n                _sum13 = vmlaq_f32(_sum13, _k31, _r44);\n                _sum13 = vmlaq_f32(_sum13, _k32, _r45);\n                _sum13 = vmlaq_f32(_sum13, _k33, _r46);\n                _sum13 = vmlaq_f32(_sum13, _k34, _r47);\n\n                _sum00 = vmlaq_f32(_sum00, _k40, _r40);\n                _sum00 = vmlaq_f32(_sum00, _k41, _r41);\n                _sum00 = vmlaq_f32(_sum00, _k42, _r42);\n                _sum00 = vmlaq_f32(_sum00, _k43, _r43);\n                _sum00 = vmlaq_f32(_sum00, _k44, _r44);\n                _sum01 = vmlaq_f32(_sum01, _k40, _r41);\n                _sum01 = vmlaq_f32(_sum01, _k41, _r42);\n                _sum01 = vmlaq_f32(_sum01, _k42, _r43);\n                _sum01 = vmlaq_f32(_sum01, _k43, _r44);\n                _sum01 = vmlaq_f32(_sum01, _k44, _r45);\n                _sum02 = vmlaq_f32(_sum02, _k40, _r42);\n                _sum02 = vmlaq_f32(_sum02, _k41, _r43);\n                _sum02 = vmlaq_f32(_sum02, _k42, _r44);\n                _sum02 = vmlaq_f32(_sum02, _k43, _r45);\n                _sum02 = vmlaq_f32(_sum02, _k44, _r46);\n                _sum03 = vmlaq_f32(_sum03, _k40, _r43);\n                _sum03 = vmlaq_f32(_sum03, _k41, _r44);\n                _sum03 = vmlaq_f32(_sum03, _k42, _r45);\n                _sum03 = vmlaq_f32(_sum03, _k43, _r46);\n                _sum03 = vmlaq_f32(_sum03, _k44, _r47);\n\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n                float32x4_t _r52 = vld1q_f32(r5 + 8);\n                float32x4_t _r53 = vld1q_f32(r5 + 12);\n                float32x4_t _r54 = vld1q_f32(r5 + 16);\n                float32x4_t _r55 = vld1q_f32(r5 + 20);\n                float32x4_t _r56 = vld1q_f32(r5 + 24);\n                float32x4_t _r57 = vld1q_f32(r5 + 28);\n\n                _sum10 = vmlaq_f32(_sum10, _k40, _r50);\n                _sum10 = vmlaq_f32(_sum10, _k41, _r51);\n                _sum10 = vmlaq_f32(_sum10, _k42, _r52);\n                _sum10 = vmlaq_f32(_sum10, _k43, _r53);\n                _sum10 = vmlaq_f32(_sum10, _k44, _r54);\n                _sum11 = vmlaq_f32(_sum11, _k40, _r51);\n                _sum11 = vmlaq_f32(_sum11, _k41, _r52);\n                _sum11 = vmlaq_f32(_sum11, _k42, _r53);\n                _sum11 = vmlaq_f32(_sum11, _k43, _r54);\n                _sum11 = vmlaq_f32(_sum11, _k44, _r55);\n                _sum12 = vmlaq_f32(_sum12, _k40, _r52);\n                _sum12 = vmlaq_f32(_sum12, _k41, _r53);\n                _sum12 = vmlaq_f32(_sum12, _k42, _r54);\n                _sum12 = vmlaq_f32(_sum12, _k43, _r55);\n                _sum12 = vmlaq_f32(_sum12, _k44, _r56);\n                _sum13 = vmlaq_f32(_sum13, _k40, _r53);\n                _sum13 = vmlaq_f32(_sum13, _k41, _r54);\n                _sum13 = vmlaq_f32(_sum13, _k42, _r55);\n                _sum13 = vmlaq_f32(_sum13, _k43, _r56);\n                _sum13 = vmlaq_f32(_sum13, _k44, _r57);\n\n                vst1q_f32(outptr0, _sum00);\n                vst1q_f32(outptr0 + 4, _sum01);\n                vst1q_f32(outptr0 + 8, _sum02);\n                vst1q_f32(outptr0 + 12, _sum03);\n                vst1q_f32(outptr1, _sum10);\n                vst1q_f32(outptr1 + 4, _sum11);\n                vst1q_f32(outptr1 + 8, _sum12);\n                vst1q_f32(outptr1 + 12, _sum13);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                r3 += 16;\n                r4 += 16;\n                r5 += 16;\n                outptr0 += 16;\n                outptr1 += 16;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                float32x4_t _sum00 = _bias0;\n                float32x4_t _sum01 = _bias0;\n                float32x4_t _sum10 = _bias0;\n                float32x4_t _sum11 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum00 = vmlaq_f32(_sum00, _k00, _r00);\n                _sum00 = vmlaq_f32(_sum00, _k01, _r01);\n                _sum00 = vmlaq_f32(_sum00, _k02, _r02);\n                _sum00 = vmlaq_f32(_sum00, _k03, _r03);\n                _sum00 = vmlaq_f32(_sum00, _k04, _r04);\n                _sum01 = vmlaq_f32(_sum01, _k00, _r01);\n                _sum01 = vmlaq_f32(_sum01, _k01, _r02);\n                _sum01 = vmlaq_f32(_sum01, _k02, _r03);\n                _sum01 = vmlaq_f32(_sum01, _k03, _r04);\n                _sum01 = vmlaq_f32(_sum01, _k04, _r05);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k00, _r10);\n                _sum10 = vmlaq_f32(_sum10, _k01, _r11);\n                _sum10 = vmlaq_f32(_sum10, _k02, _r12);\n                _sum10 = vmlaq_f32(_sum10, _k03, _r13);\n                _sum10 = vmlaq_f32(_sum10, _k04, _r14);\n                _sum11 = vmlaq_f32(_sum11, _k00, _r11);\n                _sum11 = vmlaq_f32(_sum11, _k01, _r12);\n                _sum11 = vmlaq_f32(_sum11, _k02, _r13);\n                _sum11 = vmlaq_f32(_sum11, _k03, _r14);\n                _sum11 = vmlaq_f32(_sum11, _k04, _r15);\n\n                _sum00 = vmlaq_f32(_sum00, _k10, _r10);\n                _sum00 = vmlaq_f32(_sum00, _k11, _r11);\n                _sum00 = vmlaq_f32(_sum00, _k12, _r12);\n                _sum00 = vmlaq_f32(_sum00, _k13, _r13);\n                _sum00 = vmlaq_f32(_sum00, _k14, _r14);\n                _sum01 = vmlaq_f32(_sum01, _k10, _r11);\n                _sum01 = vmlaq_f32(_sum01, _k11, _r12);\n                _sum01 = vmlaq_f32(_sum01, _k12, _r13);\n                _sum01 = vmlaq_f32(_sum01, _k13, _r14);\n                _sum01 = vmlaq_f32(_sum01, _k14, _r15);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k10, _r20);\n                _sum10 = vmlaq_f32(_sum10, _k11, _r21);\n                _sum10 = vmlaq_f32(_sum10, _k12, _r22);\n                _sum10 = vmlaq_f32(_sum10, _k13, _r23);\n                _sum10 = vmlaq_f32(_sum10, _k14, _r24);\n                _sum11 = vmlaq_f32(_sum11, _k10, _r21);\n                _sum11 = vmlaq_f32(_sum11, _k11, _r22);\n                _sum11 = vmlaq_f32(_sum11, _k12, _r23);\n                _sum11 = vmlaq_f32(_sum11, _k13, _r24);\n                _sum11 = vmlaq_f32(_sum11, _k14, _r25);\n\n                _sum00 = vmlaq_f32(_sum00, _k20, _r20);\n                _sum00 = vmlaq_f32(_sum00, _k21, _r21);\n                _sum00 = vmlaq_f32(_sum00, _k22, _r22);\n                _sum00 = vmlaq_f32(_sum00, _k23, _r23);\n                _sum00 = vmlaq_f32(_sum00, _k24, _r24);\n                _sum01 = vmlaq_f32(_sum01, _k20, _r21);\n                _sum01 = vmlaq_f32(_sum01, _k21, _r22);\n                _sum01 = vmlaq_f32(_sum01, _k22, _r23);\n                _sum01 = vmlaq_f32(_sum01, _k23, _r24);\n                _sum01 = vmlaq_f32(_sum01, _k24, _r25);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum10 = vmlaq_f32(_sum10, _k20, _r30);\n                _sum10 = vmlaq_f32(_sum10, _k21, _r31);\n                _sum10 = vmlaq_f32(_sum10, _k22, _r32);\n                _sum10 = vmlaq_f32(_sum10, _k23, _r33);\n                _sum10 = vmlaq_f32(_sum10, _k24, _r34);\n                _sum11 = vmlaq_f32(_sum11, _k20, _r31);\n                _sum11 = vmlaq_f32(_sum11, _k21, _r32);\n                _sum11 = vmlaq_f32(_sum11, _k22, _r33);\n                _sum11 = vmlaq_f32(_sum11, _k23, _r34);\n                _sum11 = vmlaq_f32(_sum11, _k24, _r35);\n\n                _sum00 = vmlaq_f32(_sum00, _k30, _r30);\n                _sum00 = vmlaq_f32(_sum00, _k31, _r31);\n                _sum00 = vmlaq_f32(_sum00, _k32, _r32);\n                _sum00 = vmlaq_f32(_sum00, _k33, _r33);\n                _sum00 = vmlaq_f32(_sum00, _k34, _r34);\n                _sum01 = vmlaq_f32(_sum01, _k30, _r31);\n                _sum01 = vmlaq_f32(_sum01, _k31, _r32);\n                _sum01 = vmlaq_f32(_sum01, _k32, _r33);\n                _sum01 = vmlaq_f32(_sum01, _k33, _r34);\n                _sum01 = vmlaq_f32(_sum01, _k34, _r35);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum10 = vmlaq_f32(_sum10, _k30, _r40);\n                _sum10 = vmlaq_f32(_sum10, _k31, _r41);\n                _sum10 = vmlaq_f32(_sum10, _k32, _r42);\n                _sum10 = vmlaq_f32(_sum10, _k33, _r43);\n                _sum10 = vmlaq_f32(_sum10, _k34, _r44);\n                _sum11 = vmlaq_f32(_sum11, _k30, _r41);\n                _sum11 = vmlaq_f32(_sum11, _k31, _r42);\n                _sum11 = vmlaq_f32(_sum11, _k32, _r43);\n                _sum11 = vmlaq_f32(_sum11, _k33, _r44);\n                _sum11 = vmlaq_f32(_sum11, _k34, _r45);\n\n                _sum00 = vmlaq_f32(_sum00, _k40, _r40);\n                _sum00 = vmlaq_f32(_sum00, _k41, _r41);\n                _sum00 = vmlaq_f32(_sum00, _k42, _r42);\n                _sum00 = vmlaq_f32(_sum00, _k43, _r43);\n                _sum00 = vmlaq_f32(_sum00, _k44, _r44);\n                _sum01 = vmlaq_f32(_sum01, _k40, _r41);\n                _sum01 = vmlaq_f32(_sum01, _k41, _r42);\n                _sum01 = vmlaq_f32(_sum01, _k42, _r43);\n                _sum01 = vmlaq_f32(_sum01, _k43, _r44);\n                _sum01 = vmlaq_f32(_sum01, _k44, _r45);\n\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n                float32x4_t _r52 = vld1q_f32(r5 + 8);\n                float32x4_t _r53 = vld1q_f32(r5 + 12);\n                float32x4_t _r54 = vld1q_f32(r5 + 16);\n                float32x4_t _r55 = vld1q_f32(r5 + 20);\n\n                _sum10 = vmlaq_f32(_sum10, _k40, _r50);\n                _sum10 = vmlaq_f32(_sum10, _k41, _r51);\n                _sum10 = vmlaq_f32(_sum10, _k42, _r52);\n                _sum10 = vmlaq_f32(_sum10, _k43, _r53);\n                _sum10 = vmlaq_f32(_sum10, _k44, _r54);\n                _sum11 = vmlaq_f32(_sum11, _k40, _r51);\n                _sum11 = vmlaq_f32(_sum11, _k41, _r52);\n                _sum11 = vmlaq_f32(_sum11, _k42, _r53);\n                _sum11 = vmlaq_f32(_sum11, _k43, _r54);\n                _sum11 = vmlaq_f32(_sum11, _k44, _r55);\n\n                vst1q_f32(outptr0, _sum00);\n                vst1q_f32(outptr0 + 4, _sum01);\n                vst1q_f32(outptr1, _sum10);\n                vst1q_f32(outptr1 + 4, _sum11);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                r3 += 8;\n                r4 += 8;\n                r5 += 8;\n                outptr0 += 8;\n                outptr1 += 8;\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n                float32x4_t _sum1 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum1 = vmlaq_f32(_sum1, _k00, _r10);\n                _sum1 = vmlaq_f32(_sum1, _k01, _r11);\n                _sum1 = vmlaq_f32(_sum1, _k02, _r12);\n                _sum1 = vmlaq_f32(_sum1, _k03, _r13);\n                _sum1 = vmlaq_f32(_sum1, _k04, _r14);\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum1 = vmlaq_f32(_sum1, _k10, _r20);\n                _sum1 = vmlaq_f32(_sum1, _k11, _r21);\n                _sum1 = vmlaq_f32(_sum1, _k12, _r22);\n                _sum1 = vmlaq_f32(_sum1, _k13, _r23);\n                _sum1 = vmlaq_f32(_sum1, _k14, _r24);\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum1 = vmlaq_f32(_sum1, _k20, _r30);\n                _sum1 = vmlaq_f32(_sum1, _k21, _r31);\n                _sum1 = vmlaq_f32(_sum1, _k22, _r32);\n                _sum1 = vmlaq_f32(_sum1, _k23, _r33);\n                _sum1 = vmlaq_f32(_sum1, _k24, _r34);\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum1 = vmlaq_f32(_sum1, _k30, _r40);\n                _sum1 = vmlaq_f32(_sum1, _k31, _r41);\n                _sum1 = vmlaq_f32(_sum1, _k32, _r42);\n                _sum1 = vmlaq_f32(_sum1, _k33, _r43);\n                _sum1 = vmlaq_f32(_sum1, _k34, _r44);\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n\n                float32x4_t _r50 = vld1q_f32(r5);\n                float32x4_t _r51 = vld1q_f32(r5 + 4);\n                float32x4_t _r52 = vld1q_f32(r5 + 8);\n                float32x4_t _r53 = vld1q_f32(r5 + 12);\n                float32x4_t _r54 = vld1q_f32(r5 + 16);\n\n                _sum1 = vmlaq_f32(_sum1, _k40, _r50);\n                _sum1 = vmlaq_f32(_sum1, _k41, _r51);\n                _sum1 = vmlaq_f32(_sum1, _k42, _r52);\n                _sum1 = vmlaq_f32(_sum1, _k43, _r53);\n                _sum1 = vmlaq_f32(_sum1, _k44, _r54);\n\n                vst1q_f32(outptr0, _sum0);\n                vst1q_f32(outptr1, _sum1);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n                r5 += 4;\n                outptr0 += 4;\n                outptr1 += 4;\n            }\n\n            r0 += 4 * 4 + w * 4;\n            r1 += 4 * 4 + w * 4;\n            r2 += 4 * 4 + w * 4;\n            r3 += 4 * 4 + w * 4;\n            r4 += 4 * 4 + w * 4;\n            r5 += 4 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n#endif // __aarch64__\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                float32x4_t _sum0 = _bias0;\n                float32x4_t _sum1 = _bias0;\n                float32x4_t _sum2 = _bias0;\n                float32x4_t _sum3 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n                float32x4_t _r06 = vld1q_f32(r0 + 24);\n                float32x4_t _r07 = vld1q_f32(r0 + 28);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k00, _r01);\n                _sum1 = vmlaq_f32(_sum1, _k01, _r02);\n                _sum1 = vmlaq_f32(_sum1, _k02, _r03);\n                _sum1 = vmlaq_f32(_sum1, _k03, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k04, _r05);\n                _sum2 = vmlaq_f32(_sum2, _k00, _r02);\n                _sum2 = vmlaq_f32(_sum2, _k01, _r03);\n                _sum2 = vmlaq_f32(_sum2, _k02, _r04);\n                _sum2 = vmlaq_f32(_sum2, _k03, _r05);\n                _sum2 = vmlaq_f32(_sum2, _k04, _r06);\n                _sum3 = vmlaq_f32(_sum3, _k00, _r03);\n                _sum3 = vmlaq_f32(_sum3, _k01, _r04);\n                _sum3 = vmlaq_f32(_sum3, _k02, _r05);\n                _sum3 = vmlaq_f32(_sum3, _k03, _r06);\n                _sum3 = vmlaq_f32(_sum3, _k04, _r07);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n                float32x4_t _r16 = vld1q_f32(r1 + 24);\n                float32x4_t _r17 = vld1q_f32(r1 + 28);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k10, _r11);\n                _sum1 = vmlaq_f32(_sum1, _k11, _r12);\n                _sum1 = vmlaq_f32(_sum1, _k12, _r13);\n                _sum1 = vmlaq_f32(_sum1, _k13, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k14, _r15);\n                _sum2 = vmlaq_f32(_sum2, _k10, _r12);\n                _sum2 = vmlaq_f32(_sum2, _k11, _r13);\n                _sum2 = vmlaq_f32(_sum2, _k12, _r14);\n                _sum2 = vmlaq_f32(_sum2, _k13, _r15);\n                _sum2 = vmlaq_f32(_sum2, _k14, _r16);\n                _sum3 = vmlaq_f32(_sum3, _k10, _r13);\n                _sum3 = vmlaq_f32(_sum3, _k11, _r14);\n                _sum3 = vmlaq_f32(_sum3, _k12, _r15);\n                _sum3 = vmlaq_f32(_sum3, _k13, _r16);\n                _sum3 = vmlaq_f32(_sum3, _k14, _r17);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n                float32x4_t _r26 = vld1q_f32(r2 + 24);\n                float32x4_t _r27 = vld1q_f32(r2 + 28);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k20, _r21);\n                _sum1 = vmlaq_f32(_sum1, _k21, _r22);\n                _sum1 = vmlaq_f32(_sum1, _k22, _r23);\n                _sum1 = vmlaq_f32(_sum1, _k23, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k24, _r25);\n                _sum2 = vmlaq_f32(_sum2, _k20, _r22);\n                _sum2 = vmlaq_f32(_sum2, _k21, _r23);\n                _sum2 = vmlaq_f32(_sum2, _k22, _r24);\n                _sum2 = vmlaq_f32(_sum2, _k23, _r25);\n                _sum2 = vmlaq_f32(_sum2, _k24, _r26);\n                _sum3 = vmlaq_f32(_sum3, _k20, _r23);\n                _sum3 = vmlaq_f32(_sum3, _k21, _r24);\n                _sum3 = vmlaq_f32(_sum3, _k22, _r25);\n                _sum3 = vmlaq_f32(_sum3, _k23, _r26);\n                _sum3 = vmlaq_f32(_sum3, _k24, _r27);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n                float32x4_t _r36 = vld1q_f32(r3 + 24);\n                float32x4_t _r37 = vld1q_f32(r3 + 28);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k30, _r31);\n                _sum1 = vmlaq_f32(_sum1, _k31, _r32);\n                _sum1 = vmlaq_f32(_sum1, _k32, _r33);\n                _sum1 = vmlaq_f32(_sum1, _k33, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k34, _r35);\n                _sum2 = vmlaq_f32(_sum2, _k30, _r32);\n                _sum2 = vmlaq_f32(_sum2, _k31, _r33);\n                _sum2 = vmlaq_f32(_sum2, _k32, _r34);\n                _sum2 = vmlaq_f32(_sum2, _k33, _r35);\n                _sum2 = vmlaq_f32(_sum2, _k34, _r36);\n                _sum3 = vmlaq_f32(_sum3, _k30, _r33);\n                _sum3 = vmlaq_f32(_sum3, _k31, _r34);\n                _sum3 = vmlaq_f32(_sum3, _k32, _r35);\n                _sum3 = vmlaq_f32(_sum3, _k33, _r36);\n                _sum3 = vmlaq_f32(_sum3, _k34, _r37);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n                float32x4_t _r46 = vld1q_f32(r4 + 24);\n                float32x4_t _r47 = vld1q_f32(r4 + 28);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k40, _r41);\n                _sum1 = vmlaq_f32(_sum1, _k41, _r42);\n                _sum1 = vmlaq_f32(_sum1, _k42, _r43);\n                _sum1 = vmlaq_f32(_sum1, _k43, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k44, _r45);\n                _sum2 = vmlaq_f32(_sum2, _k40, _r42);\n                _sum2 = vmlaq_f32(_sum2, _k41, _r43);\n                _sum2 = vmlaq_f32(_sum2, _k42, _r44);\n                _sum2 = vmlaq_f32(_sum2, _k43, _r45);\n                _sum2 = vmlaq_f32(_sum2, _k44, _r46);\n                _sum3 = vmlaq_f32(_sum3, _k40, _r43);\n                _sum3 = vmlaq_f32(_sum3, _k41, _r44);\n                _sum3 = vmlaq_f32(_sum3, _k42, _r45);\n                _sum3 = vmlaq_f32(_sum3, _k43, _r46);\n                _sum3 = vmlaq_f32(_sum3, _k44, _r47);\n\n                vst1q_f32(outptr0, _sum0);\n                vst1q_f32(outptr0 + 4, _sum1);\n                vst1q_f32(outptr0 + 8, _sum2);\n                vst1q_f32(outptr0 + 12, _sum3);\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                r3 += 16;\n                r4 += 16;\n                outptr0 += 16;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                float32x4_t _sum0 = _bias0;\n                float32x4_t _sum1 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k00, _r01);\n                _sum1 = vmlaq_f32(_sum1, _k01, _r02);\n                _sum1 = vmlaq_f32(_sum1, _k02, _r03);\n                _sum1 = vmlaq_f32(_sum1, _k03, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k04, _r05);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k10, _r11);\n                _sum1 = vmlaq_f32(_sum1, _k11, _r12);\n                _sum1 = vmlaq_f32(_sum1, _k12, _r13);\n                _sum1 = vmlaq_f32(_sum1, _k13, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k14, _r15);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k20, _r21);\n                _sum1 = vmlaq_f32(_sum1, _k21, _r22);\n                _sum1 = vmlaq_f32(_sum1, _k22, _r23);\n                _sum1 = vmlaq_f32(_sum1, _k23, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k24, _r25);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k30, _r31);\n                _sum1 = vmlaq_f32(_sum1, _k31, _r32);\n                _sum1 = vmlaq_f32(_sum1, _k32, _r33);\n                _sum1 = vmlaq_f32(_sum1, _k33, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k34, _r35);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k40, _r41);\n                _sum1 = vmlaq_f32(_sum1, _k41, _r42);\n                _sum1 = vmlaq_f32(_sum1, _k42, _r43);\n                _sum1 = vmlaq_f32(_sum1, _k43, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k44, _r45);\n\n                vst1q_f32(outptr0, _sum0);\n                vst1q_f32(outptr0 + 4, _sum1);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                r3 += 8;\n                r4 += 8;\n                outptr0 += 8;\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n\n                vst1q_f32(outptr0, _sum0);\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n                outptr0 += 4;\n            }\n\n            r0 += 4 * 4;\n            r1 += 4 * 4;\n            r2 += 4 * 4;\n            r3 += 4 * 4;\n            r4 += 4 * 4;\n        }\n    }\n}\n\nstatic void convdw5x5s2_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n        const float* r3 = img0.row(3);\n        const float* r4 = img0.row(4);\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                float32x4_t _sum0 = _bias0;\n                float32x4_t _sum1 = _bias0;\n                float32x4_t _sum2 = _bias0;\n                float32x4_t _sum3 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n                float32x4_t _r06 = vld1q_f32(r0 + 24);\n                float32x4_t _r07 = vld1q_f32(r0 + 28);\n                float32x4_t _r08 = vld1q_f32(r0 + 32);\n                float32x4_t _r09 = vld1q_f32(r0 + 36);\n                float32x4_t _r010 = vld1q_f32(r0 + 40);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k00, _r02);\n                _sum1 = vmlaq_f32(_sum1, _k01, _r03);\n                _sum1 = vmlaq_f32(_sum1, _k02, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k03, _r05);\n                _sum1 = vmlaq_f32(_sum1, _k04, _r06);\n                _sum2 = vmlaq_f32(_sum2, _k00, _r04);\n                _sum2 = vmlaq_f32(_sum2, _k01, _r05);\n                _sum2 = vmlaq_f32(_sum2, _k02, _r06);\n                _sum2 = vmlaq_f32(_sum2, _k03, _r07);\n                _sum2 = vmlaq_f32(_sum2, _k04, _r08);\n                _sum3 = vmlaq_f32(_sum3, _k00, _r06);\n                _sum3 = vmlaq_f32(_sum3, _k01, _r07);\n                _sum3 = vmlaq_f32(_sum3, _k02, _r08);\n                _sum3 = vmlaq_f32(_sum3, _k03, _r09);\n                _sum3 = vmlaq_f32(_sum3, _k04, _r010);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n                float32x4_t _r16 = vld1q_f32(r1 + 24);\n                float32x4_t _r17 = vld1q_f32(r1 + 28);\n                float32x4_t _r18 = vld1q_f32(r1 + 32);\n                float32x4_t _r19 = vld1q_f32(r1 + 36);\n                float32x4_t _r110 = vld1q_f32(r1 + 40);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k10, _r12);\n                _sum1 = vmlaq_f32(_sum1, _k11, _r13);\n                _sum1 = vmlaq_f32(_sum1, _k12, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k13, _r15);\n                _sum1 = vmlaq_f32(_sum1, _k14, _r16);\n                _sum2 = vmlaq_f32(_sum2, _k10, _r14);\n                _sum2 = vmlaq_f32(_sum2, _k11, _r15);\n                _sum2 = vmlaq_f32(_sum2, _k12, _r16);\n                _sum2 = vmlaq_f32(_sum2, _k13, _r17);\n                _sum2 = vmlaq_f32(_sum2, _k14, _r18);\n                _sum3 = vmlaq_f32(_sum3, _k10, _r16);\n                _sum3 = vmlaq_f32(_sum3, _k11, _r17);\n                _sum3 = vmlaq_f32(_sum3, _k12, _r18);\n                _sum3 = vmlaq_f32(_sum3, _k13, _r19);\n                _sum3 = vmlaq_f32(_sum3, _k14, _r110);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n                float32x4_t _r26 = vld1q_f32(r2 + 24);\n                float32x4_t _r27 = vld1q_f32(r2 + 28);\n                float32x4_t _r28 = vld1q_f32(r2 + 32);\n                float32x4_t _r29 = vld1q_f32(r2 + 36);\n                float32x4_t _r210 = vld1q_f32(r2 + 40);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k20, _r22);\n                _sum1 = vmlaq_f32(_sum1, _k21, _r23);\n                _sum1 = vmlaq_f32(_sum1, _k22, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k23, _r25);\n                _sum1 = vmlaq_f32(_sum1, _k24, _r26);\n                _sum2 = vmlaq_f32(_sum2, _k20, _r24);\n                _sum2 = vmlaq_f32(_sum2, _k21, _r25);\n                _sum2 = vmlaq_f32(_sum2, _k22, _r26);\n                _sum2 = vmlaq_f32(_sum2, _k23, _r27);\n                _sum2 = vmlaq_f32(_sum2, _k24, _r28);\n                _sum3 = vmlaq_f32(_sum3, _k20, _r26);\n                _sum3 = vmlaq_f32(_sum3, _k21, _r27);\n                _sum3 = vmlaq_f32(_sum3, _k22, _r28);\n                _sum3 = vmlaq_f32(_sum3, _k23, _r29);\n                _sum3 = vmlaq_f32(_sum3, _k24, _r210);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n                float32x4_t _r36 = vld1q_f32(r3 + 24);\n                float32x4_t _r37 = vld1q_f32(r3 + 28);\n                float32x4_t _r38 = vld1q_f32(r3 + 32);\n                float32x4_t _r39 = vld1q_f32(r3 + 36);\n                float32x4_t _r310 = vld1q_f32(r3 + 40);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k30, _r32);\n                _sum1 = vmlaq_f32(_sum1, _k31, _r33);\n                _sum1 = vmlaq_f32(_sum1, _k32, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k33, _r35);\n                _sum1 = vmlaq_f32(_sum1, _k34, _r36);\n                _sum2 = vmlaq_f32(_sum2, _k30, _r34);\n                _sum2 = vmlaq_f32(_sum2, _k31, _r35);\n                _sum2 = vmlaq_f32(_sum2, _k32, _r36);\n                _sum2 = vmlaq_f32(_sum2, _k33, _r37);\n                _sum2 = vmlaq_f32(_sum2, _k34, _r38);\n                _sum3 = vmlaq_f32(_sum3, _k30, _r36);\n                _sum3 = vmlaq_f32(_sum3, _k31, _r37);\n                _sum3 = vmlaq_f32(_sum3, _k32, _r38);\n                _sum3 = vmlaq_f32(_sum3, _k33, _r39);\n                _sum3 = vmlaq_f32(_sum3, _k34, _r310);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n                float32x4_t _r46 = vld1q_f32(r4 + 24);\n                float32x4_t _r47 = vld1q_f32(r4 + 28);\n                float32x4_t _r48 = vld1q_f32(r4 + 32);\n                float32x4_t _r49 = vld1q_f32(r4 + 36);\n                float32x4_t _r410 = vld1q_f32(r4 + 40);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k40, _r42);\n                _sum1 = vmlaq_f32(_sum1, _k41, _r43);\n                _sum1 = vmlaq_f32(_sum1, _k42, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k43, _r45);\n                _sum1 = vmlaq_f32(_sum1, _k44, _r46);\n                _sum2 = vmlaq_f32(_sum2, _k40, _r44);\n                _sum2 = vmlaq_f32(_sum2, _k41, _r45);\n                _sum2 = vmlaq_f32(_sum2, _k42, _r46);\n                _sum2 = vmlaq_f32(_sum2, _k43, _r47);\n                _sum2 = vmlaq_f32(_sum2, _k44, _r48);\n                _sum3 = vmlaq_f32(_sum3, _k40, _r46);\n                _sum3 = vmlaq_f32(_sum3, _k41, _r47);\n                _sum3 = vmlaq_f32(_sum3, _k42, _r48);\n                _sum3 = vmlaq_f32(_sum3, _k43, _r49);\n                _sum3 = vmlaq_f32(_sum3, _k44, _r410);\n\n                vst1q_f32(outptr0, _sum0);\n                vst1q_f32(outptr0 + 4, _sum1);\n                vst1q_f32(outptr0 + 8, _sum2);\n                vst1q_f32(outptr0 + 12, _sum3);\n\n                r0 += 8 * 4;\n                r1 += 8 * 4;\n                r2 += 8 * 4;\n                r3 += 8 * 4;\n                r4 += 8 * 4;\n                outptr0 += 16;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                float32x4_t _sum0 = _bias0;\n                float32x4_t _sum1 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n                float32x4_t _r05 = vld1q_f32(r0 + 20);\n                float32x4_t _r06 = vld1q_f32(r0 + 24);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k00, _r02);\n                _sum1 = vmlaq_f32(_sum1, _k01, _r03);\n                _sum1 = vmlaq_f32(_sum1, _k02, _r04);\n                _sum1 = vmlaq_f32(_sum1, _k03, _r05);\n                _sum1 = vmlaq_f32(_sum1, _k04, _r06);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n                float32x4_t _r15 = vld1q_f32(r1 + 20);\n                float32x4_t _r16 = vld1q_f32(r1 + 24);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k10, _r12);\n                _sum1 = vmlaq_f32(_sum1, _k11, _r13);\n                _sum1 = vmlaq_f32(_sum1, _k12, _r14);\n                _sum1 = vmlaq_f32(_sum1, _k13, _r15);\n                _sum1 = vmlaq_f32(_sum1, _k14, _r16);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n                float32x4_t _r25 = vld1q_f32(r2 + 20);\n                float32x4_t _r26 = vld1q_f32(r2 + 24);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k20, _r22);\n                _sum1 = vmlaq_f32(_sum1, _k21, _r23);\n                _sum1 = vmlaq_f32(_sum1, _k22, _r24);\n                _sum1 = vmlaq_f32(_sum1, _k23, _r25);\n                _sum1 = vmlaq_f32(_sum1, _k24, _r26);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n                float32x4_t _r35 = vld1q_f32(r3 + 20);\n                float32x4_t _r36 = vld1q_f32(r3 + 24);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k30, _r32);\n                _sum1 = vmlaq_f32(_sum1, _k31, _r33);\n                _sum1 = vmlaq_f32(_sum1, _k32, _r34);\n                _sum1 = vmlaq_f32(_sum1, _k33, _r35);\n                _sum1 = vmlaq_f32(_sum1, _k34, _r36);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n                float32x4_t _r45 = vld1q_f32(r4 + 20);\n                float32x4_t _r46 = vld1q_f32(r4 + 24);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k40, _r42);\n                _sum1 = vmlaq_f32(_sum1, _k41, _r43);\n                _sum1 = vmlaq_f32(_sum1, _k42, _r44);\n                _sum1 = vmlaq_f32(_sum1, _k43, _r45);\n                _sum1 = vmlaq_f32(_sum1, _k44, _r46);\n\n                vst1q_f32(outptr0, _sum0);\n                vst1q_f32(outptr0 + 4, _sum1);\n\n                r0 += 4 * 4;\n                r1 += 4 * 4;\n                r2 += 4 * 4;\n                r3 += 4 * 4;\n                r4 += 4 * 4;\n                outptr0 += 8;\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _sum0 = _bias0;\n\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r03 = vld1q_f32(r0 + 12);\n                float32x4_t _r04 = vld1q_f32(r0 + 16);\n\n                float32x4_t _k00 = vld1q_f32(k0);\n                float32x4_t _k01 = vld1q_f32(k0 + 4);\n                float32x4_t _k02 = vld1q_f32(k0 + 8);\n                float32x4_t _k03 = vld1q_f32(k0 + 12);\n                float32x4_t _k04 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k00, _r00);\n                _sum0 = vmlaq_f32(_sum0, _k01, _r01);\n                _sum0 = vmlaq_f32(_sum0, _k02, _r02);\n                _sum0 = vmlaq_f32(_sum0, _k03, _r03);\n                _sum0 = vmlaq_f32(_sum0, _k04, _r04);\n\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r13 = vld1q_f32(r1 + 12);\n                float32x4_t _r14 = vld1q_f32(r1 + 16);\n\n                float32x4_t _k10 = vld1q_f32(k0);\n                float32x4_t _k11 = vld1q_f32(k0 + 4);\n                float32x4_t _k12 = vld1q_f32(k0 + 8);\n                float32x4_t _k13 = vld1q_f32(k0 + 12);\n                float32x4_t _k14 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k10, _r10);\n                _sum0 = vmlaq_f32(_sum0, _k11, _r11);\n                _sum0 = vmlaq_f32(_sum0, _k12, _r12);\n                _sum0 = vmlaq_f32(_sum0, _k13, _r13);\n                _sum0 = vmlaq_f32(_sum0, _k14, _r14);\n\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n                float32x4_t _r23 = vld1q_f32(r2 + 12);\n                float32x4_t _r24 = vld1q_f32(r2 + 16);\n\n                float32x4_t _k20 = vld1q_f32(k0);\n                float32x4_t _k21 = vld1q_f32(k0 + 4);\n                float32x4_t _k22 = vld1q_f32(k0 + 8);\n                float32x4_t _k23 = vld1q_f32(k0 + 12);\n                float32x4_t _k24 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k20, _r20);\n                _sum0 = vmlaq_f32(_sum0, _k21, _r21);\n                _sum0 = vmlaq_f32(_sum0, _k22, _r22);\n                _sum0 = vmlaq_f32(_sum0, _k23, _r23);\n                _sum0 = vmlaq_f32(_sum0, _k24, _r24);\n\n                float32x4_t _r30 = vld1q_f32(r3);\n                float32x4_t _r31 = vld1q_f32(r3 + 4);\n                float32x4_t _r32 = vld1q_f32(r3 + 8);\n                float32x4_t _r33 = vld1q_f32(r3 + 12);\n                float32x4_t _r34 = vld1q_f32(r3 + 16);\n\n                float32x4_t _k30 = vld1q_f32(k0);\n                float32x4_t _k31 = vld1q_f32(k0 + 4);\n                float32x4_t _k32 = vld1q_f32(k0 + 8);\n                float32x4_t _k33 = vld1q_f32(k0 + 12);\n                float32x4_t _k34 = vld1q_f32(k0 + 16);\n                k0 += 20;\n\n                _sum0 = vmlaq_f32(_sum0, _k30, _r30);\n                _sum0 = vmlaq_f32(_sum0, _k31, _r31);\n                _sum0 = vmlaq_f32(_sum0, _k32, _r32);\n                _sum0 = vmlaq_f32(_sum0, _k33, _r33);\n                _sum0 = vmlaq_f32(_sum0, _k34, _r34);\n\n                float32x4_t _r40 = vld1q_f32(r4);\n                float32x4_t _r41 = vld1q_f32(r4 + 4);\n                float32x4_t _r42 = vld1q_f32(r4 + 8);\n                float32x4_t _r43 = vld1q_f32(r4 + 12);\n                float32x4_t _r44 = vld1q_f32(r4 + 16);\n\n                float32x4_t _k40 = vld1q_f32(k0);\n                float32x4_t _k41 = vld1q_f32(k0 + 4);\n                float32x4_t _k42 = vld1q_f32(k0 + 8);\n                float32x4_t _k43 = vld1q_f32(k0 + 12);\n                float32x4_t _k44 = vld1q_f32(k0 + 16);\n                k0 -= 80;\n\n                _sum0 = vmlaq_f32(_sum0, _k40, _r40);\n                _sum0 = vmlaq_f32(_sum0, _k41, _r41);\n                _sum0 = vmlaq_f32(_sum0, _k42, _r42);\n                _sum0 = vmlaq_f32(_sum0, _k43, _r43);\n                _sum0 = vmlaq_f32(_sum0, _k44, _r44);\n\n                vst1q_f32(outptr0, _sum0);\n\n                r0 += 2 * 4;\n                r1 += 2 * 4;\n                r2 += 2 * 4;\n                r3 += 2 * 4;\n                r4 += 2 * 4;\n                outptr0 += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n            r3 += tailstep;\n            r4 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_5x5_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw5x5s1_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n#if __aarch64__\n    const int w = bottom_blob.w;\n#endif\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const unsigned short* kptr = kernel.row<const unsigned short>(g);\n\n        unsigned short* outptr0 = out.row<unsigned short>(0);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n        const unsigned short* r2 = img0.row<const unsigned short>(2);\n        const unsigned short* r3 = img0.row<const unsigned short>(3);\n        const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n#if __aarch64__\n        unsigned short* outptr1 = out.row<unsigned short>(1);\n        const unsigned short* r5 = img0.row<const unsigned short>(5);\n\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n\n        // 4 * 25\n        uint16x8_t _k00_01 = vld1q_u16(kptr);\n        uint16x8_t _k02_03 = vld1q_u16(kptr + 8);\n        uint16x8_t _k04_10 = vld1q_u16(kptr + 16);\n        uint16x8_t _k11_12 = vld1q_u16(kptr + 24);\n        uint16x8_t _k13_14 = vld1q_u16(kptr + 32);\n        uint16x8_t _k20_21 = vld1q_u16(kptr + 40);\n        uint16x8_t _k22_23 = vld1q_u16(kptr + 48);\n        uint16x8_t _k24_30 = vld1q_u16(kptr + 56);\n        uint16x8_t _k31_32 = vld1q_u16(kptr + 64);\n        uint16x8_t _k33_34 = vld1q_u16(kptr + 72);\n        uint16x8_t _k40_41 = vld1q_u16(kptr + 80);\n        uint16x8_t _k42_43 = vld1q_u16(kptr + 88);\n        uint16x4_t _k44 = vld1_u16(kptr + 96);\n#else  // __aarch64__\n        float bias0_data[4];\n        if (bias)\n        {\n            bias0_data[0] = bias[g * 4 + 0];\n            bias0_data[1] = bias[g * 4 + 1];\n            bias0_data[2] = bias[g * 4 + 2];\n            bias0_data[3] = bias[g * 4 + 3];\n        }\n        else\n        {\n            bias0_data[0] = 0.f;\n            bias0_data[1] = 0.f;\n            bias0_data[2] = 0.f;\n            bias0_data[3] = 0.f;\n        }\n        const float* bias0_data_ptr = bias0_data;\n#endif // __aarch64__\n\n        int i = 0;\n#if __aarch64__\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%3], #32 \\n\" // r10 r11 r12 r13\n\n                    \"shll2  v14.4s, %18.8h, #16         \\n\"\n\n                    \"mov    v24.16b, %29.16b            \\n\" // sum00\n                    \"mov    v25.16b, %29.16b            \\n\" // sum01\n                    \"mov    v26.16b, %29.16b            \\n\" // sum02\n                    \"mov    v27.16b, %29.16b            \\n\" // sum03\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"mov    v28.16b, %29.16b            \\n\" // sum10\n                    \"mov    v29.16b, %29.16b            \\n\" // sum11\n                    \"mov    v30.16b, %29.16b            \\n\" // sum12\n                    \"mov    v31.16b, %29.16b            \\n\" // sum13\n\n                    \"shll   v15.4s, %16.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%3]      \\n\" // r14 r15 r16 r17\n                    \"fmla   v27.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll2  v14.4s, %19.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll   v15.4s, %17.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v22.4s      \\n\"\n\n                    \"shll2  v14.4s, %20.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%4], #32 \\n\" // r20 r21 r22 r23\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll   v15.4s, %18.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%4]      \\n\" // r24 r25 r26 r27\n                    \"fmla   v27.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll2  v14.4s, %21.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll   v15.4s, %19.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll2  v14.4s, %22.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"shll   v15.4s, %20.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v22.4s      \\n\"\n\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%5], #32 \\n\" // r30 r31 r32 r33\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll2  v14.4s, %23.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll   v15.4s, %21.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%5]      \\n\" // r34 r35 r36 r37\n                    \"fmla   v27.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll2  v14.4s, %24.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll   v15.4s, %22.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %25.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v22.4s      \\n\"\n\n                    \"shll2  v14.4s, %25.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%6], #32 \\n\" // r40 r41 r42 r43\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll   v15.4s, %23.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v23.4s      \\n\"\n                    \"shll   v14.4s, %26.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%6]      \\n\" // r44 r45 r46 r47\n                    \"fmla   v27.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll2  v14.4s, %26.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v15.4s, %24.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %27.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %24.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v21.4s      \\n\"\n                    \"shll2  v14.4s, %27.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v15.4s, %25.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v14.4s, %28.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\" // r00 r01 r02 r03\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll2  v15.4s, %25.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2]      \\n\" // r04 r05 r06 r07\n                    \"fmla   v27.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v25.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v26.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v27.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v25.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v27.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v25.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v26.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%7], #32 \\n\" // r50 r51 r52 r53\n                    \"fmla   v27.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll   v15.4s, %26.4h, #16         \\n\"\n\n                    \"fmla   v24.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v25.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v26.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v27.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll2  v14.4s, %26.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%7]      \\n\" // r54 r55 r56 r57\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll   v15.4s, %27.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll2  v14.4s, %27.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"shll   v15.4s, %28.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shrn   v24.4h, v24.4s, #16         \\n\"\n                    \"shrn   v25.4h, v25.4s, #16         \\n\"\n                    \"shrn   v26.4h, v26.4s, #16         \\n\"\n                    \"shrn   v27.4h, v27.4s, #16         \\n\"\n                    \"shrn   v28.4h, v28.4s, #16         \\n\"\n                    \"shrn   v29.4h, v29.4s, #16         \\n\"\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%0], #32 \\n\"\n                    \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%1], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5)       // %7\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"w\"(_k00_01), // %16\n                    \"w\"(_k02_03), // %17\n                    \"w\"(_k04_10), // %18\n                    \"w\"(_k11_12), // %19\n                    \"w\"(_k13_14), // %20\n                    \"w\"(_k20_21), // %21\n                    \"w\"(_k22_23), // %22\n                    \"w\"(_k24_30), // %23\n                    \"w\"(_k31_32), // %24\n                    \"w\"(_k33_34), // %25\n                    \"w\"(_k40_41), // %26\n                    \"w\"(_k42_43), // %27\n                    \"w\"(_k44),    // %28\n                    \"w\"(_bias0)   // %29\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%3], #16 \\n\" // r10 r11\n\n                    \"shll2  v14.4s, %18.8h, #16         \\n\"\n\n                    \"mov    v28.16b, %29.16b            \\n\" // sum00\n                    \"mov    v29.16b, %29.16b            \\n\" // sum01\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"mov    v30.16b, %29.16b            \\n\" // sum10\n                    \"mov    v31.16b, %29.16b            \\n\" // sum11\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%3] \\n\" // r12 r13 r14 r15\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v15.4s, %16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll2  v14.4s, %19.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v15.4s, %17.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%4], #16 \\n\" // r20 r21\n\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll2  v14.4s, %20.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v15.4s, %18.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%4] \\n\" // r22 r23 r24 r25\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll2  v14.4s, %21.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v15.4s, %19.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%5], #16 \\n\" // r30 r31\n\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll2  v14.4s, %22.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v15.4s, %20.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll2  v14.4s, %23.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%5] \\n\" // r32 r33 r34 r35\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v15.4s, %21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll2  v14.4s, %24.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v15.4s, %22.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%6], #16 \\n\" // r40 r41\n\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll2  v14.4s, %25.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v15.4s, %23.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v14.4s, %26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%6] \\n\" // r42 r43 r44 r45\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll2  v14.4s, %26.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v15.4s, %24.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %27.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %24.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%2], #16 \\n\" // r00 r01\n\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll2  v14.4s, %27.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v15.4s, %25.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%7, #128]       \\n\"\n                    \"ld1    {v22.4h, v23.4h}, [%7], #16 \\n\" // r50 r51\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %28.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %25.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%2] \\n\" // r02 r03 r04 r05\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v15.4s, %26.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h, v27.4h}, [%7] \\n\" // r52 r53 r54 r55\n\n                    \"shll2  v14.4s, %16.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n                    \"shll2  v15.4s, %26.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v23.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v24.4s      \\n\"\n                    \"shll   v15.4s, %27.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll2  v14.4s, %17.8h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v24.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v25.4s      \\n\"\n                    \"shll2  v15.4s, %27.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v25.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v26.4s      \\n\"\n                    \"shll   v15.4s, %28.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v27.4s, v27.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v26.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v27.4s      \\n\"\n\n                    \"shrn   v28.4h, v28.4s, #16         \\n\"\n                    \"shrn   v29.4h, v29.4s, #16         \\n\"\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v28.4h, v29.4h}, [%0], #16 \\n\"\n                    \"st1    {v30.4h, v31.4h}, [%1], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5)       // %7\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"w\"(_k00_01), // %16\n                    \"w\"(_k02_03), // %17\n                    \"w\"(_k04_10), // %18\n                    \"w\"(_k11_12), // %19\n                    \"w\"(_k13_14), // %20\n                    \"w\"(_k20_21), // %21\n                    \"w\"(_k22_23), // %22\n                    \"w\"(_k24_30), // %23\n                    \"w\"(_k31_32), // %24\n                    \"w\"(_k33_34), // %25\n                    \"w\"(_k40_41), // %26\n                    \"w\"(_k42_43), // %27\n                    \"w\"(_k44),    // %28\n                    \"w\"(_bias0)   // %29\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%3, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%3], #8          \\n\" // r10\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%3] \\n\" // r11 r12 r13 r14\n\n                    \"mov    v30.16b, %29.16b            \\n\" // sum00\n                    \"mov    v31.16b, %29.16b            \\n\" // sum10\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"shll2  v14.4s, %18.8h, #16         \\n\"\n                    \"shll   v15.4s, %16.4h, #16         \\n\"\n                    \"fmul   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmul   v29.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v17.4s      \\n\"\n                    \"shll2  v14.4s, %19.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v15.4s, %17.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"shll2  v14.4s, %20.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v15.4s, %18.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%4], #8          \\n\" // r20\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%4] \\n\" // r21 r22 r23 r24\n\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v14.4s, v16.4s      \\n\"\n                    \"shll2  v14.4s, %21.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v15.4s, %19.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll2  v14.4s, %22.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v15.4s, %20.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"shll2  v14.4s, %23.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%5], #8          \\n\" // r30\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%5] \\n\" // r31 r32 r33 r34\n\n                    \"shll   v15.4s, %21.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v17.4s      \\n\"\n                    \"shll2  v14.4s, %24.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v15.4s, %22.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %25.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"shll2  v14.4s, %25.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v15.4s, %23.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %26.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%6, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%6], #8          \\n\" // r40\n\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%6] \\n\" // r41 r42 r43 r44\n\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v14.4s, v16.4s      \\n\"\n                    \"shll2  v14.4s, %26.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v15.4s, %24.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %27.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %24.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll2  v14.4s, %27.8h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v15.4s, %25.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %28.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %25.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%2], #8          \\n\" // r00\n\n                    \"prfm   pldl1keep, [%7, #64]        \\n\"\n                    \"ld1    {v21.4h}, [%7], #8          \\n\" // r50\n\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v22.4h, v23.4h, v24.4h, v25.4h}, [%7] \\n\" // r51 r52 r53 r54\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%2] \\n\" // r01 r02 r03 r04\n\n                    \"shll   v15.4s, %26.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll2  v14.4s, %16.8h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll2  v15.4s, %26.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v15.4s, %27.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll2  v14.4s, %17.8h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v23.4s      \\n\"\n                    \"shll2  v15.4s, %27.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v24.4s      \\n\"\n                    \"shll   v15.4s, %28.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v25.4s      \\n\"\n\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v30.4h}, [%0], #8          \\n\"\n                    \"st1    {v31.4h}, [%1], #8          \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5)       // %7\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"w\"(_k00_01), // %16\n                    \"w\"(_k02_03), // %17\n                    \"w\"(_k04_10), // %18\n                    \"w\"(_k11_12), // %19\n                    \"w\"(_k13_14), // %20\n                    \"w\"(_k20_21), // %21\n                    \"w\"(_k22_23), // %22\n                    \"w\"(_k24_30), // %23\n                    \"w\"(_k31_32), // %24\n                    \"w\"(_k33_34), // %25\n                    \"w\"(_k40_41), // %26\n                    \"w\"(_k42_43), // %27\n                    \"w\"(_k44),    // %28\n                    \"w\"(_bias0)   // %29\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n\n            r0 += 4 * 4 + w * 4;\n            r1 += 4 * 4 + w * 4;\n            r2 += 4 * 4 + w * 4;\n            r3 += 4 * 4 + w * 4;\n            r4 += 4 * 4 + w * 4;\n            r5 += 4 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n#endif // __aarch64__\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n\n                    \"mov    v28.16b, %25.16b            \\n\" // sum00\n                    \"mov    v29.16b, %25.16b            \\n\" // sum01\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"mov    v30.16b, %25.16b            \\n\" // sum02\n                    \"mov    v31.16b, %25.16b            \\n\" // sum03\n\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%1]      \\n\" // r04 r05 r06 r07\n                    \"fmla   v31.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2]      \\n\" // r14 r15 r16 r17\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%3]      \\n\" // r24 r25 r26 r27\n                    \"fmla   v31.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%4]      \\n\" // r34 r35 r36 r37\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v20.4s      \\n\"\n\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v21.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%5]      \\n\" // r44 r45 r46 r47\n                    \"fmla   v31.4s, v14.4s, v19.4s      \\n\"\n\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v21.4s      \\n\"\n\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v23.4s      \\n\"\n\n                    \"shrn   v28.4h, v28.4s, #16         \\n\"\n                    \"shrn   v29.4h, v29.4s, #16         \\n\"\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\"\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n\n                    \"vmov       q13, q12            \\n\" // sum0 sum1\n                    \"vmov       q14, q12            \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n\n                    \"vmov       q15, q12            \\n\" // sum2 sum3\n\n                    \"vmla.f32   q12, q8, q0         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q13, q8, q1         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q8, q2         \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%2 :64] \\n\" // r04 r05 r06 r07\n\n                    \"vmla.f32   q15, q8, q3         \\n\"\n\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n                    \"vmla.f32   q12, q9, q1         \\n\"\n                    \"vmla.f32   q13, q9, q2         \\n\"\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vmla.f32   q14, q9, q3         \\n\"\n                    \"vmla.f32   q15, q9, q4         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmla.f32   q12, q10, q2        \\n\"\n                    \"vmla.f32   q13, q10, q3        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vmla.f32   q15, q10, q5        \\n\"\n\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vmla.f32   q13, q11, q4        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q11, q5        \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r10 r11 r12 r13\n\n                    \"vmla.f32   q15, q11, q6        \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n\n                    \"vmla.f32   q12, q10, q4        \\n\"\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vmla.f32   q13, q10, q5        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q10, q6        \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vmla.f32   q15, q10, q7        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n                    \"vmla.f32   q12, q11, q0        \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q13, q11, q1        \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q11, q2        \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%3 :64] \\n\" // r14 r15 r16 r17\n\n                    \"vmla.f32   q15, q11, q3        \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q12, q8, q1         \\n\"\n                    \"vmla.f32   q13, q8, q2         \\n\"\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vmla.f32   q14, q8, q3         \\n\"\n                    \"vmla.f32   q15, q8, q4         \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vmla.f32   q14, q9, q4         \\n\"\n                    \"vmla.f32   q15, q9, q5         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n\n                    \"vmla.f32   q12, q8, q3         \\n\"\n                    \"vmla.f32   q13, q8, q4         \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q8, q5         \\n\"\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                    \"vmla.f32   q15, q8, q6         \\n\"\n\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q12, q9, q4         \\n\"\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vmla.f32   q13, q9, q5         \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q9, q6         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vmla.f32   q15, q9, q7         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n                    \"vmla.f32   q12, q10, q0        \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q13, q10, q1        \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%4 :64] \\n\" // r24 r25 r26 r27\n\n                    \"vmla.f32   q15, q10, q3        \\n\"\n\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vmla.f32   q14, q11, q3        \\n\"\n                    \"vmla.f32   q15, q11, q4        \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n\n                    \"vmla.f32   q12, q10, q2        \\n\"\n                    \"vmla.f32   q13, q10, q3        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vmla.f32   q15, q10, q5        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vmla.f32   q13, q11, q4        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q11, q5        \\n\"\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r30 r31 r32 r33\n\n                    \"vmla.f32   q15, q11, q6        \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q12, q8, q4         \\n\"\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vmla.f32   q13, q8, q5         \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q8, q6         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vmla.f32   q15, q8, q7         \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n                    \"vmla.f32   q12, q9, q0         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q13, q9, q1         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q9, q2         \\n\"\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%5 :64] \\n\" // r34 r35 r36 r37\n\n                    \"vmla.f32   q15, q9, q3         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n\n                    \"vmla.f32   q12, q8, q1         \\n\"\n                    \"vmla.f32   q13, q8, q2         \\n\"\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vmla.f32   q14, q8, q3         \\n\"\n                    \"vmla.f32   q15, q8, q4         \\n\"\n\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vmla.f32   q14, q9, q4         \\n\"\n                    \"vmla.f32   q15, q9, q5         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q10, q3        \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q13, q10, q4        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q10, q5        \\n\"\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n\n                    \"vmla.f32   q15, q10, q6        \\n\"\n\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q12, q11, q4        \\n\"\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q11, q6        \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vmla.f32   q15, q11, q7        \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n\n                    \"vmla.f32   q12, q10, q0        \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q13, q10, q1        \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%6 :64] \\n\" // r44 r45 r46 r47\n\n                    \"vmla.f32   q15, q10, q3        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vmla.f32   q14, q11, q3        \\n\"\n                    \"vmla.f32   q15, q11, q4        \\n\"\n\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n\n                    \"vmla.f32   q12, q8, q2         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q13, q8, q3         \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vmla.f32   q15, q8, q5         \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vmla.f32   q13, q9, q4         \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q9, q5         \\n\"\n                    \"vmla.f32   q15, q9, q6         \\n\"\n\n                    \"vmla.f32   q12, q8, q4         \\n\"\n                    \"vmla.f32   q13, q8, q5         \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q8, q6         \\n\"\n                    \"vmla.f32   q15, q8, q7         \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n                    \"vshrn.u32  d25, q13, #16       \\n\"\n                    \"vshrn.u32  d26, q14, #16       \\n\"\n                    \"vshrn.u32  d27, q15, #16       \\n\"\n\n                    \"vst1.u16   {d24-d27}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%1], #16 \\n\" // r00 r01\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%1] \\n\" // r02 r03 r04 r05\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"mov    v30.16b, %25.16b            \\n\" // sum01\n                    \"mov    v31.16b, %25.16b            \\n\" // sum02\n\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n                    \"fmul   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmul   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%2], #16 \\n\" // r10 r11\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%2] \\n\" // r12 r13 r14 r15\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%3], #16 \\n\" // r20 r21\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%3] \\n\" // r22 r23 r24 r25\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%4], #16 \\n\" // r30 r31\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%4] \\n\" // r32 r33 r34 r35\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%5], #16 \\n\" // r40 r41\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h, v21.4h}, [%5] \\n\" // r42 r43 r44 r45\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v30.4h, v31.4h}, [%0], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r00 r01\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\"\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d8-d11}, [%2 :64]  \\n\" // r02 r03 r04 r05\n\n                    \"vshll.u16  q0, d2, #16         \\n\"\n\n                    \"vmov       q13, q12            \\n\" // sum0 sum1\n\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n                    \"vmul.f32   q14, q8, q0         \\n\"\n                    \"vshll.u16  q2, d8, #16         \\n\"\n                    \"vmul.f32   q15, q8, q1         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n                    \"vmla.f32   q12, q9, q1         \\n\"\n                    \"pld        [%3, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%3 :64]!  \\n\" // r10 r11\n\n                    \"vshll.u16  q3, d9, #16         \\n\"\n                    \"vmla.f32   q13, q9, q2         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q15, q10, q3        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q13, q11, q4        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q15, q10, q5        \\n\"\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d8-d11}, [%3 :64]  \\n\" // r12 r13 r14 r15\n\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n                    \"vmla.f32   q12, q11, q0        \\n\"\n                    \"vshll.u16  q2, d8, #16         \\n\"\n                    \"vmla.f32   q13, q11, q1        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"pld        [%4, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r20 r21\n\n                    \"vshll.u16  q3, d9, #16         \\n\"\n                    \"vmla.f32   q15, q8, q2         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n                    \"vmla.f32   q14, q8, q3         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q15, q8, q4         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q12, q9, q4         \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q13, q9, q5         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d8-d11}, [%4 :64]  \\n\" // r22 r23 r24 r25\n\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n                    \"vmla.f32   q14, q10, q0        \\n\"\n                    \"vshll.u16  q2, d8, #16         \\n\"\n                    \"vmla.f32   q15, q10, q1        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"pld        [%5, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%5 :64]!  \\n\" // r30 r31\n\n                    \"vshll.u16  q3, d9, #16         \\n\"\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q15, q10, q3        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q13, q11, q4        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q15, q8, q5         \\n\"\n\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d8-d11}, [%5 :64]  \\n\" // r32 r33 r34 r35\n\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n                    \"vmla.f32   q12, q9, q0         \\n\"\n                    \"vshll.u16  q2, d8, #16         \\n\"\n                    \"vmla.f32   q13, q9, q1         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"pld        [%6, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r40 r41\n\n                    \"vshll.u16  q3, d9, #16         \\n\"\n                    \"vmla.f32   q15, q8, q2         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q14, q10, q3        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q15, q10, q4        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q12, q11, q4        \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d8-d11}, [%6 :64]  \\n\" // r42 r43 r44 r45\n\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n                    \"vmla.f32   q14, q10, q0        \\n\"\n                    \"vshll.u16  q2, d8, #16         \\n\"\n                    \"vmla.f32   q15, q10, q1        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vshll.u16  q3, d9, #16         \\n\"\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q14, q8, q2         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q15, q8, q3         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vmla.f32   q13, q9, q4         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vmla.f32   q15, q8, q5         \\n\"\n\n                    \"vadd.f32   q12, q12, q14       \\n\"\n                    \"vadd.f32   q13, q13, q15       \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n                    \"vshrn.u32  d25, q13, #16       \\n\"\n\n                    \"vst1.u16   {d24-d25}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%1], #8          \\n\" // r00\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%1] \\n\" // r01 r02 r03 r04\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"mov    v31.16b, %25.16b            \\n\" // sum01\n\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n                    \"fmul   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n                    \"fmul   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n                    \"fmul   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%2], #8          \\n\" // r10\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%2] \\n\" // r11 r12 r13 r14\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v29.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%3], #8          \\n\" // r20\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%3] \\n\" // r21 r22 r23 r24\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%4], #8          \\n\" // r30\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%4] \\n\" // r31 r32 r33 r34\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v31.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #64]        \\n\"\n                    \"ld1    {v16.4h}, [%5], #8          \\n\" // r40\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v17.4h, v18.4h, v19.4h, v20.4h}, [%5] \\n\" // r41 r42 r43 r44\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n\n                    \"fadd   v29.4s, v29.4s, v30.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v31.4h}, [%0], #8          \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\" // sum0\n\n                    \"pld        [%2, #64]           \\n\"\n                    \"vld1.u16   {d1}, [%2 :64]!     \\n\" // r00\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d6-d9}, [%2 :64]   \\n\" // r01 r02 r03 r04\n\n                    \"vshll.u16  q0, d1, #16         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n                    \"vshll.u16  q1, d6, #16         \\n\"\n                    \"vmul.f32   q13, q8, q0         \\n\"\n                    \"pld        [%3, #64]           \\n\"\n                    \"vld1.u16   {d1}, [%3 :64]!     \\n\" // r10\n\n                    \"vshll.u16  q2, d7, #16         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n                    \"vmul.f32   q14, q9, q1         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmul.f32   q15, q10, q2        \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q0, d1, #16         \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n                    \"vmla.f32   q13, q10, q4        \\n\"\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d6-d9}, [%3 :64]   \\n\" // r11 r12 r13 r14\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n                    \"vshll.u16  q1, d6, #16         \\n\"\n                    \"vmla.f32   q14, q11, q0        \\n\"\n                    \"pld        [%4, #64]           \\n\"\n                    \"vld1.u16   {d1}, [%4 :64]!     \\n\" // r20\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q2, d7, #16         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q15, q8, q1         \\n\"\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n                    \"vmla.f32   q13, q8, q3         \\n\"\n                    \"vshll.u16  q0, d1, #16         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q14, q9, q4         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d6-d9}, [%4 :64]   \\n\" // r21 r22 r23 r24\n\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n                    \"vshll.u16  q1, d6, #16         \\n\"\n                    \"vmla.f32   q15, q10, q0        \\n\"\n                    \"pld        [%5, #64]           \\n\"\n                    \"vld1.u16   {d1}, [%5 :64]!     \\n\" // r30\n\n                    \"vshll.u16  q2, d7, #16         \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n                    \"vmla.f32   q13, q10, q2        \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n                    \"vmla.f32   q14, q11, q3        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q0, d1, #16         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q15, q8, q4         \\n\"\n\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d6-d9}, [%5 :64]   \\n\" // r31 r32 r33 r34\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n                    \"vshll.u16  q1, d6, #16         \\n\"\n                    \"vmla.f32   q12, q9, q0         \\n\"\n                    \"pld        [%6, #64]           \\n\"\n                    \"vld1.u16   {d1}, [%6 :64]!     \\n\" // r40\n\n                    \"vshll.u16  q2, d7, #16         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n                    \"vmla.f32   q13, q8, q1         \\n\"\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q14, q9, q2         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q15, q10, q3        \\n\"\n                    \"vshll.u16  q0, d1, #16         \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q12, q11, q4        \\n\"\n\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d6-d9}, [%6 :64]   \\n\" // r41 r42 r43 r44\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n                    \"vshll.u16  q1, d6, #16         \\n\"\n                    \"vmla.f32   q13, q10, q0        \\n\"\n                    \"vshll.u16  q2, d7, #16         \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n                    \"vmla.f32   q14, q11, q1        \\n\"\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q15, q8, q2         \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vmla.f32   q13, q8, q4         \\n\"\n\n                    \"vadd.f32   q14, q14, q15       \\n\"\n                    \"vadd.f32   q12, q12, q13       \\n\"\n                    \"vadd.f32   q12, q12, q14       \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n\n                    \"vst1.u16   {d24}, [%0 :64]!    \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n\n            r0 += 4 * 4;\n            r1 += 4 * 4;\n            r2 += 4 * 4;\n            r3 += 4 * 4;\n            r4 += 4 * 4;\n        }\n    }\n}\n\nstatic void convdw5x5s2_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n#if __aarch64__\n        float32x4_t _bias0 = bias ? vld1q_f32((const float*)bias + g * 4) : vdupq_n_f32(0.f);\n#endif // __aarch64__\n\n        const unsigned short* kptr = kernel.row<const unsigned short>(g);\n\n        unsigned short* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n        const unsigned short* r2 = img0.row<const unsigned short>(2);\n        const unsigned short* r3 = img0.row<const unsigned short>(3);\n        const unsigned short* r4 = img0.row<const unsigned short>(4);\n\n#if __aarch64__\n        // 4 * 25\n        uint16x8_t _k00_01 = vld1q_u16(kptr);\n        uint16x8_t _k02_03 = vld1q_u16(kptr + 8);\n        uint16x8_t _k04_10 = vld1q_u16(kptr + 16);\n        uint16x8_t _k11_12 = vld1q_u16(kptr + 24);\n        uint16x8_t _k13_14 = vld1q_u16(kptr + 32);\n        uint16x8_t _k20_21 = vld1q_u16(kptr + 40);\n        uint16x8_t _k22_23 = vld1q_u16(kptr + 48);\n        uint16x8_t _k24_30 = vld1q_u16(kptr + 56);\n        uint16x8_t _k31_32 = vld1q_u16(kptr + 64);\n        uint16x8_t _k33_34 = vld1q_u16(kptr + 72);\n        uint16x8_t _k40_41 = vld1q_u16(kptr + 80);\n        uint16x8_t _k42_43 = vld1q_u16(kptr + 88);\n        uint16x4_t _k44 = vld1_u16(kptr + 96);\n#else  // __aarch64__\n        float bias0_data[4];\n        if (bias)\n        {\n            bias0_data[0] = bias[g * 4 + 0];\n            bias0_data[1] = bias[g * 4 + 1];\n            bias0_data[2] = bias[g * 4 + 2];\n            bias0_data[3] = bias[g * 4 + 3];\n        }\n        else\n        {\n            bias0_data[0] = 0.f;\n            bias0_data[1] = 0.f;\n            bias0_data[2] = 0.f;\n            bias0_data[3] = 0.f;\n        }\n        const float* bias0_data_ptr = bias0_data;\n#endif // __aarch64__\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%1], #32 \\n\" // r04 r05 r06 r07\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n\n                    \"mov    v28.16b, %25.16b            \\n\" // sum00\n                    \"mov    v29.16b, %25.16b            \\n\" // sum01\n                    \"mov    v30.16b, %25.16b            \\n\" // sum02\n                    \"mov    v31.16b, %25.16b            \\n\" // sum03\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"prfm   pldl1keep, [%1, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%1] \\n\" // r08 r09 r010\n\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v24.4s      \\n\"\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v23.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v25.4s      \\n\"\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2], #32 \\n\" // r14 r15 r16 r17\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v24.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%2] \\n\" // r18 r19 r110\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v23.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v24.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v23.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v25.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v22.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%3], #32 \\n\" // r24 r25 r26 r27\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v24.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%3] \\n\" // r28 r29 r210\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v24.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v23.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v25.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%4], #32 \\n\" // r34 r35 r36 r37\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v24.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%4] \\n\" // r38 r39 r310\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v23.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v24.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v23.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v25.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v22.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%5], #32 \\n\" // r44 r45 r46 r47\n\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v24.4s      \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v26.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #192]       \\n\"\n                    \"ld1    {v24.4h, v25.4h, v26.4h}, [%5] \\n\" // r48 r49 r410\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll   v23.4s, v23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v21.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v23.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v24.4s, v24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v22.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v24.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v28.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v29.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v25.4s, v25.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v23.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v25.4s      \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n                    \"shll   v26.4s, v26.4h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v24.4s      \\n\"\n                    \"fmla   v31.4s, v14.4s, v26.4s      \\n\"\n\n                    \"shrn   v28.4h, v28.4s, #16         \\n\"\n                    \"shrn   v29.4h, v29.4s, #16         \\n\"\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v28.4h, v29.4h, v30.4h, v31.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\"\n                    \"vmov       q13, q12            \\n\" // sum0 sum1\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                    \"vmov       q14, q12            \\n\"\n                    \"vmov       q15, q12            \\n\" // sum2 sum3\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%2 :64]! \\n\" // r04 r05 r06 r07\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n\n                    \"vmla.f32   q12, q8, q0         \\n\"\n                    \"vmla.f32   q13, q8, q2         \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vmla.f32   q15, q8, q6         \\n\"\n\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n\n                    \"vmla.f32   q12, q9, q1         \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q9, q5         \\n\"\n                    \"vmla.f32   q15, q9, q7         \\n\"\n\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r08 r09\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q10, q2        \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmla.f32   q13, q10, q4        \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q14, q10, q6        \\n\"\n                    \"vmla.f32   q15, q10, q0        \\n\"\n\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vmla.f32   q14, q11, q7        \\n\"\n                    \"vmla.f32   q15, q11, q1        \\n\"\n\n                    \"pld        [%2, #64]           \\n\"\n                    \"vld1.u16   {d5}, [%2 :64]      \\n\" // r010\n\n                    \"vmla.f32   q12, q10, q4        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n                    \"vmla.f32   q13, q10, q6        \\n\"\n                    \"vshll.u16  q2, d5, #16         \\n\"\n                    \"vmla.f32   q14, q10, q0        \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%3 :64]! \\n\" // r10 r11 r12 r13\n\n                    \"vmla.f32   q15, q10, q2        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r14 r15 r16 r17\n\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n\n                    \"vmla.f32   q12, q11, q4        \\n\"\n                    \"vmla.f32   q13, q11, q6        \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q14, q11, q0        \\n\"\n                    \"vmla.f32   q15, q11, q2        \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q8, q5         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q13, q8, q7         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"vmla.f32   q15, q8, q3         \\n\"\n\n                    \"pld        [%3, #128]          \\n\"\n                    \"vld1.u16   {d10-d11}, [%3 :64]! \\n\" // r18 r19\n\n                    \"vmla.f32   q12, q9, q6         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q13, q9, q0         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q14, q9, q2         \\n\"\n                    \"vmla.f32   q15, q9, q4         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n\n                    \"vmla.f32   q12, q8, q7         \\n\"\n                    \"vmla.f32   q13, q8, q1         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q14, q8, q3         \\n\"\n                    \"vmla.f32   q15, q8, q5         \\n\"\n\n                    \"pld        [%3, #64]           \\n\"\n                    \"vld1.u16   {d13}, [%3 :64]     \\n\" // r110\n\n                    \"vmla.f32   q12, q9, q0         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q13, q9, q2         \\n\"\n                    \"vshll.u16  q6, d13, #16        \\n\"\n                    \"vmla.f32   q14, q9, q4         \\n\"\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                    \"vmla.f32   q15, q9, q6         \\n\"\n\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%4 :64]! \\n\" // r24 r25 r26 r27\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q10, q0        \\n\"\n                    \"vmla.f32   q13, q10, q2        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vmla.f32   q15, q10, q6        \\n\"\n\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vmla.f32   q13, q11, q3        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q11, q5        \\n\"\n                    \"vmla.f32   q15, q11, q7        \\n\"\n\n                    \"pld        [%4, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r28 r29\n\n                    \"vmla.f32   q12, q10, q2        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n                    \"vmla.f32   q13, q10, q4        \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q14, q10, q6        \\n\"\n                    \"vmla.f32   q15, q10, q0        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vmla.f32   q14, q11, q7        \\n\"\n                    \"vmla.f32   q15, q11, q1        \\n\"\n\n                    \"pld        [%4, #64]           \\n\"\n                    \"vld1.u16   {d5}, [%4 :64]      \\n\" // r210\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q8, q4         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q13, q8, q6         \\n\"\n                    \"vshll.u16  q2, d5, #16         \\n\"\n                    \"vmla.f32   q14, q8, q0         \\n\"\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%5 :64]! \\n\" // r30 r31 r32 r33\n\n                    \"vmla.f32   q15, q8, q2         \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r34 r35 r36 r37\n\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n\n                    \"vmla.f32   q12, q9, q4         \\n\"\n                    \"vmla.f32   q13, q9, q6         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vmla.f32   q14, q9, q0         \\n\"\n                    \"vmla.f32   q15, q9, q2         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n\n                    \"vmla.f32   q12, q8, q5         \\n\"\n                    \"vmla.f32   q13, q8, q7         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"vmla.f32   q15, q8, q3         \\n\"\n\n                    \"pld        [%5, #128]          \\n\"\n                    \"vld1.u16   {d10-d11}, [%5 :64]! \\n\" // r38 r39\n\n                    \"vmla.f32   q12, q9, q6         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q13, q9, q0         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q14, q9, q2         \\n\"\n                    \"vmla.f32   q15, q9, q4         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"vmla.f32   q12, q10, q7        \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q13, q10, q1        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q14, q10, q3        \\n\"\n                    \"vmla.f32   q15, q10, q5        \\n\"\n\n                    \"pld        [%5, #64]           \\n\"\n                    \"vld1.u16   {d13}, [%5 :64]     \\n\" // r310\n\n                    \"vmla.f32   q12, q11, q0        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"vshll.u16  q6, d13, #16        \\n\"\n                    \"vmla.f32   q14, q11, q4        \\n\"\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n\n                    \"vmla.f32   q15, q11, q6        \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d12-d15}, [%6 :64]! \\n\" // r44 r45 r46 r47\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vshll.u16  q4, d12, #16        \\n\"\n                    \"vshll.u16  q5, d13, #16        \\n\"\n\n                    \"vmla.f32   q12, q10, q0        \\n\"\n                    \"vmla.f32   q13, q10, q2        \\n\"\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vmla.f32   q15, q10, q6        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vmla.f32   q13, q11, q3        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n                    \"vmla.f32   q14, q11, q5        \\n\"\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n\n                    \"vmla.f32   q15, q11, q7        \\n\"\n\n                    \"pld        [%6, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r48 r49\n\n                    \"vmla.f32   q12, q8, q2         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q13, q8, q4         \\n\"\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vmla.f32   q14, q8, q6         \\n\"\n                    \"vmla.f32   q15, q8, q0         \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vmla.f32   q13, q9, q5         \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n                    \"vmla.f32   q14, q9, q7         \\n\"\n                    \"vmla.f32   q15, q9, q1         \\n\"\n\n                    \"pld        [%6, #64]           \\n\"\n                    \"vld1.u16   {d5}, [%6 :64]      \\n\" // r410\n\n                    \"vmla.f32   q12, q8, q4         \\n\"\n                    \"vmla.f32   q13, q8, q6         \\n\"\n                    \"vshll.u16  q2, d5, #16         \\n\"\n                    \"vmla.f32   q14, q8, q0         \\n\"\n                    \"vmla.f32   q15, q8, q2         \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"sub        %2, %2, #16         \\n\"\n                    \"sub        %3, %3, #16         \\n\"\n                    \"sub        %4, %4, #16         \\n\"\n                    \"sub        %5, %5, #16         \\n\"\n                    \"sub        %6, %6, #16         \\n\"\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n                    \"vshrn.u32  d25, q13, #16       \\n\"\n                    \"vshrn.u32  d26, q14, #16       \\n\"\n                    \"vshrn.u32  d27, q15, #16       \\n\"\n\n                    \"vst1.u16   {d24-d27}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%1], #32 \\n\" // r00 r01 r02 r03\n\n                    \"prfm   pldl1keep, [%1, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%1] \\n\" // r04 r05 r06\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n\n                    \"mov    v30.16b, %25.16b            \\n\" // sum00\n                    \"mov    v31.16b, %25.16b            \\n\" // sum01\n\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmul   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmul   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\" // r10 r11 r12 r13\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%2] \\n\" // r14 r15 r16\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%3], #32 \\n\" // r20 r21 r22 r23\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%3] \\n\" // r24 r25 r26\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%4], #32 \\n\" // r30 r31 r32 r33\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%4] \\n\" // r34 r35 r36\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v15.4s, v16.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v18.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%5], #32 \\n\" // r40 r41 r42 r43\n\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v21.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v20.4s      \\n\"\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v22.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #192]       \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h}, [%5] \\n\" // r44 r45 r46\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v17.4s      \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v21.4s, v21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v14.4s, v20.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v30.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v31.4s, v15.4s, v21.4s      \\n\"\n                    \"shll   v22.4s, v22.4h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n                    \"fmla   v29.4s, v14.4s, v22.4s      \\n\"\n\n                    \"fadd   v30.4s, v30.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"shrn   v30.4h, v30.4s, #16         \\n\"\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v30.4h, v31.4h}, [%0], #16 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\"\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%2 :64]!  \\n\" // r00 r01 r02 r03\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.u16   {d10-d12}, [%2 :64] \\n\" // r04 r05 r06\n\n                    \"vmov       q13, q12            \\n\" // sum0 sum1\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n                    \"vmul.f32   q14, q8, q0         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmul.f32   q15, q8, q2         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n                    \"vmla.f32   q12, q9, q1         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q13, q9, q3         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmla.f32   q15, q10, q4        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q6, d12, #16        \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n                    \"vmla.f32   q14, q10, q4        \\n\"\n                    \"vmla.f32   q15, q10, q6        \\n\"\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%3 :64]!  \\n\" // r10 r11 r12 r13\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.u16   {d10-d12}, [%3 :64] \\n\" // r14 r15 r16\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vmla.f32   q12, q11, q0        \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q13, q11, q2        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q15, q8, q3         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q13, q9, q4         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n                    \"vmla.f32   q14, q8, q3         \\n\"\n                    \"vshll.u16  q6, d12, #16        \\n\"\n                    \"vmla.f32   q15, q8, q5         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q12, q9, q4         \\n\"\n                    \"vmla.f32   q13, q9, q6         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%4 :64]!  \\n\" // r20 r21 r22 r23\n\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.u16   {d10-d12}, [%4 :64] \\n\" // r24 r25 r26\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vmla.f32   q14, q10, q0        \\n\"\n                    \"vmla.f32   q15, q10, q2        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q13, q11, q3        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n                    \"vmla.f32   q14, q10, q2        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q15, q10, q4        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q6, d12, #16        \\n\"\n                    \"vmla.f32   q13, q11, q5        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q15, q8, q6         \\n\"\n\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%5 :64]!  \\n\" // r30 r31 r32 r33\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n\n                    \"pld        [%5, #256]          \\n\"\n                    \"vld1.u16   {d10-d12}, [%5 :64] \\n\" // r34 r35 r36\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vmla.f32   q12, q9, q0         \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q13, q9, q2         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n                    \"vmla.f32   q14, q8, q1         \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q15, q8, q3         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q12, q9, q2         \\n\"\n                    \"vshll.u16  q6, d12, #16        \\n\"\n                    \"vmla.f32   q13, q9, q4         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vmla.f32   q14, q10, q3        \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q15, q10, q5        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q12, q11, q4        \\n\"\n                    \"vmla.f32   q13, q11, q6        \\n\"\n\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d4-d7}, [%6 :64]!  \\n\" // r40 r41 r42 r43\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n\n                    \"pld        [%6, #256]          \\n\"\n                    \"vld1.u16   {d10-d12}, [%6 :64] \\n\" // r44 r45 r46\n\n                    \"vshll.u16  q0, d4, #16         \\n\"\n                    \"vshll.u16  q1, d5, #16         \\n\"\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n\n                    \"vmla.f32   q14, q10, q0        \\n\"\n                    \"vshll.u16  q4, d10, #16        \\n\"\n                    \"vmla.f32   q15, q10, q2        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n                    \"vmla.f32   q12, q11, q1        \\n\"\n                    \"vshll.u16  q5, d11, #16        \\n\"\n                    \"vmla.f32   q13, q11, q3        \\n\"\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n                    \"vmla.f32   q14, q8, q2         \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q15, q8, q4         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vshll.u16  q6, d12, #16        \\n\"\n                    \"vmla.f32   q13, q9, q5         \\n\"\n                    \"vmla.f32   q14, q8, q4         \\n\"\n                    \"vmla.f32   q15, q8, q6         \\n\"\n\n                    \"vadd.f32   q12, q12, q14       \\n\"\n                    \"vadd.f32   q13, q13, q15       \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n                    \"vshrn.u32  d25, q13, #16       \\n\"\n\n                    \"vst1.u16   {d24-d25}, [%0 :64]! \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%1], #16 \\n\" // r00 r01\n\n                    \"prfm   pldl1keep, [%1, #192]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h}, [%1] \\n\" // r02 r03 r04\n\n                    \"shll   v14.4s, %12.4h, #16         \\n\"\n\n                    \"mov    v31.16b, %25.16b            \\n\" // sum00\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"shll2  v15.4s, %12.8h, #16         \\n\"\n                    \"fmul   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %13.4h, #16         \\n\"\n                    \"fmul   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %13.8h, #16         \\n\"\n                    \"fmul   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %14.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %14.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%2], #16 \\n\" // r10 r11\n\n                    \"prfm   pldl1keep, [%2, #192]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h}, [%2] \\n\" // r12 r13 r14\n\n                    \"shll   v14.4s, %15.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v29.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %15.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %16.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %16.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %17.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%3], #16 \\n\" // r20 r21\n\n                    \"prfm   pldl1keep, [%3, #192]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h}, [%3] \\n\" // r22 r23 r24\n\n                    \"shll2  v15.4s, %17.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v30.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %18.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %18.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %19.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v19.4s      \\n\"\n                    \"shll2  v15.4s, %19.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%4], #16 \\n\" // r30 r31\n\n                    \"prfm   pldl1keep, [%4, #192]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h}, [%4] \\n\" // r32 r33 r34\n\n                    \"shll   v14.4s, %20.4h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v31.4s, v15.4s, v16.4s      \\n\"\n                    \"shll2  v15.4s, %20.8h, #16         \\n\"\n                    \"fmla   v28.4s, v14.4s, v17.4s      \\n\"\n                    \"shll   v14.4s, %21.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v18.4s      \\n\"\n                    \"shll2  v15.4s, %21.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v19.4s      \\n\"\n                    \"shll   v14.4s, %22.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v20.4s      \\n\"\n\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v16.4h, v17.4h}, [%5], #16 \\n\" // r40 r41\n\n                    \"prfm   pldl1keep, [%5, #192]       \\n\"\n                    \"ld1    {v18.4h, v19.4h, v20.4h}, [%5] \\n\" // r42 r43 r44\n\n                    \"shll2  v15.4s, %22.8h, #16         \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16         \\n\"\n                    \"shll   v17.4s, v17.4h, #16         \\n\"\n\n                    \"shll   v18.4s, v18.4h, #16         \\n\"\n                    \"shll   v19.4s, v19.4h, #16         \\n\"\n                    \"shll   v20.4s, v20.4h, #16         \\n\"\n\n                    \"fmla   v28.4s, v14.4s, v16.4s      \\n\"\n                    \"shll   v14.4s, %23.4h, #16         \\n\"\n                    \"fmla   v29.4s, v15.4s, v17.4s      \\n\"\n                    \"shll2  v15.4s, %23.8h, #16         \\n\"\n                    \"fmla   v30.4s, v14.4s, v18.4s      \\n\"\n                    \"shll   v14.4s, %24.4h, #16         \\n\"\n                    \"fmla   v31.4s, v15.4s, v19.4s      \\n\"\n                    \"fmla   v28.4s, v14.4s, v20.4s      \\n\"\n\n                    \"fadd   v29.4s, v29.4s, v30.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v28.4s      \\n\"\n                    \"fadd   v31.4s, v31.4s, v29.4s      \\n\"\n\n                    \"shrn   v31.4h, v31.4s, #16         \\n\"\n\n                    \"st1    {v31.4h}, [%0], #8          \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4)       // %5\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"w\"(_k00_01), // %12\n                    \"w\"(_k02_03), // %13\n                    \"w\"(_k04_10), // %14\n                    \"w\"(_k11_12), // %15\n                    \"w\"(_k13_14), // %16\n                    \"w\"(_k20_21), // %17\n                    \"w\"(_k22_23), // %18\n                    \"w\"(_k24_30), // %19\n                    \"w\"(_k31_32), // %20\n                    \"w\"(_k33_34), // %21\n                    \"w\"(_k40_41), // %22\n                    \"w\"(_k42_43), // %23\n                    \"w\"(_k44),    // %24\n                    \"w\"(_bias0)   // %25\n                    : \"memory\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%2, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%2 :64]!  \\n\" // r00 r01\n\n                    \"pld        [%2, #192]          \\n\"\n                    \"vld1.u16   {d6-d8}, [%2 :64]   \\n\" // r02 r03 r04\n\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k00\n\n                    \"pld        [%1, #128]          \\n\"\n                    \"vld1.f32   {d24-d25}, [%1]     \\n\" // sum0\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k01\n                    \"vmul.f32   q13, q8, q0         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k02\n                    \"vmul.f32   q14, q9, q1         \\n\"\n\n                    \"pld        [%3, #128]          \\n\"\n                    \"vld1.u16   {d14-d15}, [%3 :64]! \\n\" // r10 r11\n\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vshll.u16  q4, d8, #16         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k03\n                    \"vmul.f32   q15, q10, q2        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k04\n                    \"vmla.f32   q12, q11, q3        \\n\"\n                    \"vshll.u16  q11, d17, #16       \\n\" // k10\n                    \"vmla.f32   q13, q10, q4        \\n\"\n\n                    \"pld        [%3, #192]          \\n\"\n                    \"vld1.u16   {d8-d10}, [%3 :64]  \\n\" // r12 r13 r14\n\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n\n                    \"vshll.u16  q8, d18, #16        \\n\" // k11\n                    \"vmla.f32   q14, q11, q6        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k12\n                    \"vmla.f32   q15, q8, q7         \\n\"\n\n                    \"pld        [%4, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%4 :64]!  \\n\" // r20 r21\n\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q5, d10, #16        \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k13\n                    \"vmla.f32   q12, q9, q3         \\n\"\n                    \"vshll.u16  q9, d21, #16        \\n\" // k14\n                    \"vmla.f32   q13, q8, q4         \\n\"\n                    \"vshll.u16  q10, d22, #16       \\n\" // k20\n                    \"vmla.f32   q14, q9, q5         \\n\"\n\n                    \"pld        [%4, #192]          \\n\"\n                    \"vld1.u16   {d6-d8}, [%4 :64]   \\n\" // r22 r23 r24\n\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k21\n                    \"vmla.f32   q15, q10, q0        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k22\n                    \"vmla.f32   q12, q11, q1        \\n\"\n\n                    \"pld        [%5, #128]          \\n\"\n                    \"vld1.u16   {d14-d15}, [%5 :64]! \\n\" // r30 r31\n\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vshll.u16  q4, d8, #16         \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k23\n                    \"vmla.f32   q13, q10, q2        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k24\n                    \"vmla.f32   q14, q11, q3        \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d20-d23}, [%7 :64]! \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k30\n                    \"vmla.f32   q15, q8, q4         \\n\"\n\n                    \"pld        [%5, #192]          \\n\"\n                    \"vld1.u16   {d8-d10}, [%5 :64]  \\n\" // r32 r33 r34\n\n                    \"vshll.u16  q6, d14, #16        \\n\"\n                    \"vshll.u16  q7, d15, #16        \\n\"\n\n                    \"vshll.u16  q8, d20, #16        \\n\" // k31\n                    \"vmla.f32   q12, q9, q6         \\n\"\n\n                    \"vshll.u16  q9, d21, #16        \\n\" // k32\n                    \"vmla.f32   q13, q8, q7         \\n\"\n\n                    \"pld        [%6, #128]          \\n\"\n                    \"vld1.u16   {d2-d3}, [%6 :64]!  \\n\" // r40 r41\n\n                    \"vshll.u16  q3, d8, #16         \\n\"\n                    \"vshll.u16  q4, d9, #16         \\n\"\n                    \"vshll.u16  q5, d10, #16        \\n\"\n\n                    \"vshll.u16  q10, d22, #16       \\n\" // k33\n                    \"vmla.f32   q14, q9, q3         \\n\"\n                    \"pld        [%7, #256]          \\n\"\n                    \"vld1.u16   {d16-d19}, [%7 :64]! \\n\"\n                    \"vshll.u16  q11, d23, #16       \\n\" // k34\n                    \"vmla.f32   q15, q10, q4        \\n\"\n                    \"vshll.u16  q10, d16, #16       \\n\" // k40\n                    \"vmla.f32   q12, q11, q5        \\n\"\n\n                    \"pld        [%6, #192]          \\n\"\n                    \"vld1.u16   {d6-d8}, [%6 :64]   \\n\" // r42 r43 r44\n\n                    \"vshll.u16  q0, d2, #16         \\n\"\n                    \"vshll.u16  q1, d3, #16         \\n\"\n\n                    \"vshll.u16  q11, d17, #16       \\n\" // k41\n                    \"vmla.f32   q13, q10, q0        \\n\"\n                    \"vshll.u16  q8, d18, #16        \\n\" // k42\n                    \"vmla.f32   q14, q11, q1        \\n\"\n\n                    \"vshll.u16  q2, d6, #16         \\n\"\n                    \"vshll.u16  q3, d7, #16         \\n\"\n                    \"vshll.u16  q4, d8, #16         \\n\"\n\n                    \"pld        [%7, #64]           \\n\"\n                    \"vld1.u16   {d20}, [%7 :64]     \\n\"\n                    \"vshll.u16  q9, d19, #16        \\n\" // k43\n                    \"vmla.f32   q15, q8, q2         \\n\"\n                    \"vshll.u16  q8, d20, #16        \\n\" // k44\n                    \"vmla.f32   q12, q9, q3         \\n\"\n\n                    \"vmla.f32   q13, q8, q4         \\n\"\n\n                    \"vadd.f32   q14, q14, q15       \\n\"\n                    \"vadd.f32   q12, q12, q13       \\n\"\n\n                    \"sub        %7, %7, #192        \\n\" // kptr -= 24 * 4;\n\n                    \"vadd.f32   q12, q12, q14       \\n\"\n\n                    \"vshrn.u32  d24, q12, #16       \\n\"\n\n                    \"vst1.u16   {d24}, [%0 :64]!    \\n\"\n\n                    : \"=r\"(outptr0),        // %0\n                    \"=r\"(bias0_data_ptr), // %1\n                    \"=r\"(r0),             // %2\n                    \"=r\"(r1),             // %3\n                    \"=r\"(r2),             // %4\n                    \"=r\"(r3),             // %5\n                    \"=r\"(r4),             // %6\n                    \"=r\"(kptr)            // %7\n                    : \"0\"(outptr0),\n                    \"1\"(bias0_data_ptr),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(kptr)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n            r3 += tailstep;\n            r4 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_5x5_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw5x5s1_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        __fp16 bias0_data[8] = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};\n\n        const __fp16* k0 = kernel.row<const __fp16>(g);\n\n        __fp16* outptr0 = out.row<__fp16>(0);\n        __fp16* outptr1 = out.row<__fp16>(1);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0.row<const __fp16>(0);\n        const __fp16* r1 = img0.row<const __fp16>(1);\n        const __fp16* r2 = img0.row<const __fp16>(2);\n        const __fp16* r3 = img0.row<const __fp16>(3);\n        const __fp16* r4 = img0.row<const __fp16>(4);\n        const __fp16* r5 = img0.row<const __fp16>(5);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                const __fp16* bias0_data_ptr = bias ? bias + g * 8 : bias0_data;\n\n                asm volatile(\n                    \"prfm   pldl1keep, [%18, #512]      \\n\"\n                    \"ld1    {v31.8h}, [%18]             \\n\" // sum13\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%2], #64 \\n\" // r0_0123\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w0_0123\n\n                    \"mov    v24.16b, v31.16b            \\n\" // sum00\n                    \"mov    v25.16b, v31.16b            \\n\" // sum01\n                    \"mov    v26.16b, v31.16b            \\n\" // sum02\n                    \"mov    v27.16b, v31.16b            \\n\" // sum03\n\n                    \"fmla   v24.8h, v16.8h, v0.8h       \\n\"\n                    \"fmla   v25.8h, v17.8h, v0.8h       \\n\"\n                    \"fmla   v26.8h, v18.8h, v0.8h       \\n\"\n                    \"fmla   v27.8h, v19.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%2] \\n\" // r0_4567\n\n                    \"fmla   v24.8h, v17.8h, v1.8h       \\n\"\n                    \"fmla   v25.8h, v18.8h, v1.8h       \\n\"\n                    \"fmla   v26.8h, v19.8h, v1.8h       \\n\"\n                    \"fmla   v27.8h, v20.8h, v1.8h       \\n\"\n\n                    \"mov    v28.16b, v31.16b            \\n\" // sum10\n\n                    \"fmla   v24.8h, v18.8h, v2.8h       \\n\"\n                    \"fmla   v25.8h, v19.8h, v2.8h       \\n\"\n                    \"fmla   v26.8h, v20.8h, v2.8h       \\n\"\n                    \"fmla   v27.8h, v21.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v24.8h, v19.8h, v3.8h       \\n\"\n                    \"fmla   v25.8h, v20.8h, v3.8h       \\n\"\n                    \"fmla   v26.8h, v21.8h, v3.8h       \\n\"\n                    \"fmla   v27.8h, v22.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%3], #64 \\n\" // r1_0123\n\n                    \"fmla   v24.8h, v20.8h, v4.8h       \\n\"\n                    \"fmla   v25.8h, v21.8h, v4.8h       \\n\"\n                    \"fmla   v26.8h, v22.8h, v4.8h       \\n\"\n                    \"fmla   v27.8h, v23.8h, v4.8h       \\n\"\n\n                    \"mov    v29.16b, v31.16b            \\n\" // sum11\n                    \"mov    v30.16b, v31.16b            \\n\" // sum12\n\n                    \"fmla   v28.8h, v8.8h, v0.8h        \\n\"\n                    \"fmla   v29.8h, v9.8h, v0.8h        \\n\"\n                    \"fmla   v30.8h, v10.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v11.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3] \\n\" // r1_4567\n\n                    \"fmla   v28.8h, v9.8h, v1.8h        \\n\"\n                    \"fmla   v29.8h, v10.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v11.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v12.8h, v1.8h       \\n\"\n\n                    \"fmla   v28.8h, v10.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v11.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v12.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v13.8h, v2.8h       \\n\"\n\n                    \"fmla   v28.8h, v11.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v12.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v13.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v14.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v12.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v4.8h       \\n\"\n\n                    \"fmla   v24.8h, v8.8h, v5.8h        \\n\"\n                    \"fmla   v25.8h, v9.8h, v5.8h        \\n\"\n                    \"fmla   v26.8h, v10.8h, v5.8h       \\n\"\n                    \"fmla   v27.8h, v11.8h, v5.8h       \\n\"\n\n                    \"fmla   v24.8h, v9.8h, v6.8h        \\n\"\n                    \"fmla   v25.8h, v10.8h, v6.8h       \\n\"\n                    \"fmla   v26.8h, v11.8h, v6.8h       \\n\"\n                    \"fmla   v27.8h, v12.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v24.8h, v10.8h, v7.8h       \\n\"\n                    \"fmla   v25.8h, v11.8h, v7.8h       \\n\"\n                    \"fmla   v26.8h, v12.8h, v7.8h       \\n\"\n                    \"fmla   v27.8h, v13.8h, v7.8h       \\n\"\n\n                    \"fmla   v24.8h, v11.8h, v0.8h       \\n\"\n                    \"fmla   v25.8h, v12.8h, v0.8h       \\n\"\n                    \"fmla   v26.8h, v13.8h, v0.8h       \\n\"\n                    \"fmla   v27.8h, v14.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%4], #64 \\n\" // r2_0123\n\n                    \"fmla   v24.8h, v12.8h, v1.8h       \\n\"\n                    \"fmla   v25.8h, v13.8h, v1.8h       \\n\"\n                    \"fmla   v26.8h, v14.8h, v1.8h       \\n\"\n                    \"fmla   v27.8h, v15.8h, v1.8h       \\n\"\n\n                    \"fmla   v28.8h, v16.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4] \\n\" // r2_4567\n\n                    \"fmla   v28.8h, v17.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v19.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v6.8h       \\n\"\n\n                    \"fmla   v28.8h, v18.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v7.8h       \\n\"\n\n                    \"fmla   v28.8h, v19.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v20.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v21.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v22.8h, v0.8h       \\n\"\n\n                    \"fmla   v28.8h, v20.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v1.8h       \\n\"\n\n                    \"fmla   v24.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v25.8h, v17.8h, v2.8h       \\n\"\n                    \"fmla   v26.8h, v18.8h, v2.8h       \\n\"\n                    \"fmla   v27.8h, v19.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v24.8h, v17.8h, v3.8h       \\n\"\n                    \"fmla   v25.8h, v18.8h, v3.8h       \\n\"\n                    \"fmla   v26.8h, v19.8h, v3.8h       \\n\"\n                    \"fmla   v27.8h, v20.8h, v3.8h       \\n\"\n\n                    \"fmla   v24.8h, v18.8h, v4.8h       \\n\"\n                    \"fmla   v25.8h, v19.8h, v4.8h       \\n\"\n                    \"fmla   v26.8h, v20.8h, v4.8h       \\n\"\n                    \"fmla   v27.8h, v21.8h, v4.8h       \\n\"\n\n                    \"fmla   v24.8h, v19.8h, v5.8h       \\n\"\n                    \"fmla   v25.8h, v20.8h, v5.8h       \\n\"\n                    \"fmla   v26.8h, v21.8h, v5.8h       \\n\"\n                    \"fmla   v27.8h, v22.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%5], #64 \\n\" // r3_0123\n\n                    \"fmla   v24.8h, v20.8h, v6.8h       \\n\"\n                    \"fmla   v25.8h, v21.8h, v6.8h       \\n\"\n                    \"fmla   v26.8h, v22.8h, v6.8h       \\n\"\n                    \"fmla   v27.8h, v23.8h, v6.8h       \\n\"\n\n                    \"fmla   v28.8h, v8.8h, v2.8h        \\n\"\n                    \"fmla   v29.8h, v9.8h, v2.8h        \\n\"\n                    \"fmla   v30.8h, v10.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v11.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%5] \\n\" // r3_4567\n\n                    \"fmla   v28.8h, v9.8h, v3.8h        \\n\"\n                    \"fmla   v29.8h, v10.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v11.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v12.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v10.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v11.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v12.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v13.8h, v4.8h       \\n\"\n\n                    \"fmla   v28.8h, v11.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v12.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v13.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v14.8h, v5.8h       \\n\"\n\n                    \"fmla   v28.8h, v12.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w3_1234\n\n                    \"fmla   v24.8h, v8.8h, v7.8h        \\n\"\n                    \"fmla   v25.8h, v9.8h, v7.8h        \\n\"\n                    \"fmla   v26.8h, v10.8h, v7.8h       \\n\"\n                    \"fmla   v27.8h, v11.8h, v7.8h       \\n\"\n\n                    \"fmla   v24.8h, v9.8h, v0.8h        \\n\"\n                    \"fmla   v25.8h, v10.8h, v0.8h       \\n\"\n                    \"fmla   v26.8h, v11.8h, v0.8h       \\n\"\n                    \"fmla   v27.8h, v12.8h, v0.8h       \\n\"\n\n                    \"fmla   v24.8h, v10.8h, v1.8h       \\n\"\n                    \"fmla   v25.8h, v11.8h, v1.8h       \\n\"\n                    \"fmla   v26.8h, v12.8h, v1.8h       \\n\"\n                    \"fmla   v27.8h, v13.8h, v1.8h       \\n\"\n\n                    \"fmla   v24.8h, v11.8h, v2.8h       \\n\"\n                    \"fmla   v25.8h, v12.8h, v2.8h       \\n\"\n                    \"fmla   v26.8h, v13.8h, v2.8h       \\n\"\n                    \"fmla   v27.8h, v14.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%6], #64 \\n\" // r4_0123\n\n                    \"fmla   v24.8h, v12.8h, v3.8h       \\n\"\n                    \"fmla   v25.8h, v13.8h, v3.8h       \\n\"\n                    \"fmla   v26.8h, v14.8h, v3.8h       \\n\"\n                    \"fmla   v27.8h, v15.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v16.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%6] \\n\" // r4_4567\n\n                    \"fmla   v28.8h, v17.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v18.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v19.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v0.8h       \\n\"\n\n                    \"fmla   v28.8h, v18.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v19.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v1.8h       \\n\"\n\n                    \"fmla   v28.8h, v19.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v20.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v21.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v22.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w4_0123\n\n                    \"fmla   v28.8h, v20.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v3.8h       \\n\"\n\n                    \"fmla   v24.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v25.8h, v17.8h, v4.8h       \\n\"\n                    \"fmla   v26.8h, v18.8h, v4.8h       \\n\"\n                    \"fmla   v27.8h, v19.8h, v4.8h       \\n\"\n\n                    \"fmla   v24.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v25.8h, v18.8h, v5.8h       \\n\"\n                    \"fmla   v26.8h, v19.8h, v5.8h       \\n\"\n                    \"fmla   v27.8h, v20.8h, v5.8h       \\n\"\n\n                    \"fmla   v24.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v25.8h, v19.8h, v6.8h       \\n\"\n                    \"fmla   v26.8h, v20.8h, v6.8h       \\n\"\n                    \"fmla   v27.8h, v21.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%8]               \\n\" // w44\n\n                    \"fmla   v24.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v25.8h, v20.8h, v7.8h       \\n\"\n                    \"fmla   v26.8h, v21.8h, v7.8h       \\n\"\n                    \"fmla   v27.8h, v22.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #512]       \\n\"\n                    \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%7], #64 \\n\" // r5_0123\n\n                    \"fmla   v24.8h, v20.8h, v0.8h       \\n\"\n                    \"fmla   v25.8h, v21.8h, v0.8h       \\n\"\n                    \"fmla   v26.8h, v22.8h, v0.8h       \\n\"\n                    \"fmla   v27.8h, v23.8h, v0.8h       \\n\"\n\n                    \"fmla   v28.8h, v8.8h, v4.8h        \\n\"\n                    \"fmla   v29.8h, v9.8h, v4.8h        \\n\"\n                    \"fmla   v30.8h, v10.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v11.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%7] \\n\" // r5_4567\n\n                    \"fmla   v28.8h, v9.8h, v5.8h        \\n\"\n                    \"fmla   v29.8h, v10.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v11.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v12.8h, v5.8h       \\n\"\n\n                    \"fmla   v28.8h, v10.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v11.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v12.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v13.8h, v6.8h       \\n\"\n\n                    \"fmla   v28.8h, v11.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v12.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v13.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v14.8h, v7.8h       \\n\"\n\n                    \"fmla   v28.8h, v12.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v0.8h       \\n\"\n\n                    \"sub    %8, %8, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%0], #64 \\n\"\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%1], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5),      // %7\n                    \"=r\"(k0)       // %8\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"8\"(k0),\n                    \"r\"(bias0_data_ptr) // %18\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n\n            float16x8_t _bias0 = bias ? vld1q_f16(bias + g * 8) : vdupq_n_f16((__fp16)0.f);\n\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%2], #32 \\n\" // r0_01\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w0_0123\n\n                    \"mov    v28.16b, %18.16b            \\n\" // sum00\n                    \"mov    v29.16b, %18.16b            \\n\" // sum01\n\n                    \"fmla   v28.8h, v16.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%2] \\n\" // r0_2345\n\n                    \"mov    v30.16b, %18.16b            \\n\" // sum10\n                    \"mov    v31.16b, %18.16b            \\n\" // sum11\n\n                    \"fmla   v28.8h, v17.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v18.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v28.8h, v18.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v19.8h, v2.8h       \\n\"\n                    \"fmla   v28.8h, v19.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v20.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%3], #32 \\n\" // r1_01\n\n                    \"fmla   v28.8h, v20.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%3] \\n\" // r1_2345\n\n                    \"fmla   v30.8h, v22.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v26.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v28.8h, v22.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v23.8h, v5.8h       \\n\"\n                    \"fmla   v28.8h, v23.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v24.8h, v6.8h       \\n\"\n                    \"fmla   v28.8h, v24.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v25.8h, v7.8h       \\n\"\n                    \"fmla   v28.8h, v25.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v26.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%4], #32 \\n\" // r2_01\n\n                    \"fmla   v28.8h, v26.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v27.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%4] \\n\" // r2_2345\n\n                    \"fmla   v30.8h, v16.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v19.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v28.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v2.8h       \\n\"\n                    \"fmla   v28.8h, v17.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v18.8h, v3.8h       \\n\"\n                    \"fmla   v28.8h, v18.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v19.8h, v4.8h       \\n\"\n                    \"fmla   v28.8h, v19.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v20.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%5], #32 \\n\" // r3_01\n\n                    \"fmla   v28.8h, v20.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%5] \\n\" // r3_2345\n\n                    \"fmla   v30.8h, v22.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w3_1234\n\n                    \"fmla   v30.8h, v26.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v6.8h       \\n\"\n\n                    \"fmla   v28.8h, v22.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v23.8h, v7.8h       \\n\"\n                    \"fmla   v28.8h, v23.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v24.8h, v0.8h       \\n\"\n                    \"fmla   v28.8h, v24.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v25.8h, v1.8h       \\n\"\n                    \"fmla   v28.8h, v25.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v26.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%6], #32 \\n\" // r4_01\n\n                    \"fmla   v28.8h, v26.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v27.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%6] \\n\" // r4_2345\n\n                    \"fmla   v30.8h, v16.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w4_0123\n\n                    \"fmla   v30.8h, v19.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v4.8h       \\n\"\n                    \"fmla   v28.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v18.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%8]               \\n\" // w44\n\n                    \"fmla   v28.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v19.8h, v6.8h       \\n\"\n                    \"fmla   v28.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v20.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%7], #32 \\n\" // r5_01\n\n                    \"fmla   v28.8h, v20.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%7] \\n\" // r5_2345\n\n                    \"fmla   v30.8h, v22.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v26.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v0.8h       \\n\"\n\n                    \"sub    %8, %8, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v28.8h, v29.8h}, [%0], #32 \\n\"\n                    \"st1    {v30.8h, v31.8h}, [%1], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5),      // %7\n                    \"=r\"(k0)       // %8\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"8\"(k0),\n                    \"w\"(_bias0) // %18\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%2], #16         \\n\" // r0_0\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%2] \\n\" // r0_1234\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w0_0123\n\n                    \"mov    v30.16b, %18.16b            \\n\" // sum00\n                    \"mov    v31.16b, %18.16b            \\n\" // sum10\n\n                    \"fmla   v30.8h, v16.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v30.8h, v18.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%3], #16         \\n\" // r1_0\n\n                    \"fmla   v30.8h, v19.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%3] \\n\" // r1_1234\n\n                    \"fmla   v31.8h, v21.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v22.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v30.8h, v21.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%4], #16         \\n\" // r2_0\n\n                    \"fmla   v30.8h, v24.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%4] \\n\" // r2_1234\n\n                    \"fmla   v31.8h, v16.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v30.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%5], #16         \\n\" // r3_0\n\n                    \"fmla   v30.8h, v19.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%5] \\n\" // r3_1234\n\n                    \"fmla   v31.8h, v21.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v22.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%8], #64 \\n\" // w3_1234\n\n                    \"fmla   v31.8h, v24.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v6.8h       \\n\"\n\n                    \"fmla   v30.8h, v21.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%6], #16         \\n\" // r4_0\n\n                    \"fmla   v30.8h, v24.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%6] \\n\" // r4_1234\n\n                    \"fmla   v31.8h, v16.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%8], #64 \\n\" // w4_0123\n\n                    \"fmla   v31.8h, v19.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v3.8h       \\n\"\n\n                    \"fmla   v30.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%8]               \\n\" // w44\n\n                    \"fmla   v30.8h, v18.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%7], #16         \\n\" // r5_0\n\n                    \"fmla   v30.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%7, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%7] \\n\" // r5_1234\n\n                    \"fmla   v31.8h, v21.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v22.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v0.8h       \\n\"\n\n                    \"sub    %8, %8, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v30.8h}, [%0], #16         \\n\"\n                    \"st1    {v31.8h}, [%1], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(outptr1), // %1\n                    \"=r\"(r0),      // %2\n                    \"=r\"(r1),      // %3\n                    \"=r\"(r2),      // %4\n                    \"=r\"(r3),      // %5\n                    \"=r\"(r4),      // %6\n                    \"=r\"(r5),      // %7\n                    \"=r\"(k0)       // %8\n                    : \"0\"(outptr0),\n                    \"1\"(outptr1),\n                    \"2\"(r0),\n                    \"3\"(r1),\n                    \"4\"(r2),\n                    \"5\"(r3),\n                    \"6\"(r4),\n                    \"7\"(r5),\n                    \"8\"(k0),\n                    \"w\"(_bias0) // %18\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v30\", \"v31\");\n            }\n\n            r0 += 4 * 8 + w * 8;\n            r1 += 4 * 8 + w * 8;\n            r2 += 4 * 8 + w * 8;\n            r3 += 4 * 8 + w * 8;\n            r4 += 4 * 8 + w * 8;\n            r5 += 4 * 8 + w * 8;\n\n            outptr0 += outw * 8;\n            outptr1 += outw * 8;\n        }\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + g * 8) : vdupq_n_f16((__fp16)0.f);\n\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%1], #64 \\n\" // r0_0123\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w0_0123\n\n                    \"mov    v28.16b, %14.16b            \\n\" // sum00\n                    \"mov    v29.16b, %14.16b            \\n\" // sum01\n                    \"mov    v30.16b, %14.16b            \\n\" // sum02\n                    \"mov    v31.16b, %14.16b            \\n\" // sum03\n\n                    \"fmla   v28.8h, v12.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%1] \\n\" // r0_4567\n\n                    \"fmla   v28.8h, v13.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v14.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v15.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v16.8h, v1.8h       \\n\"\n\n                    \"fmla   v28.8h, v14.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v15.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v28.8h, v15.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v16.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%2], #64 \\n\" // r1_0123\n\n                    \"fmla   v28.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v4.8h       \\n\"\n\n                    \"fmla   v28.8h, v20.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%2] \\n\" // r1_4567\n\n                    \"fmla   v28.8h, v21.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v22.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v28.8h, v22.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v23.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v7.8h       \\n\"\n\n                    \"fmla   v28.8h, v23.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v24.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%3], #64 \\n\" // r2_0123\n\n                    \"fmla   v28.8h, v24.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v25.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v26.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%3] \\n\" // r2_4567\n\n                    \"fmla   v28.8h, v12.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v28.8h, v13.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v14.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v15.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v16.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v14.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v15.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v4.8h       \\n\"\n\n                    \"fmla   v28.8h, v15.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v16.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v20.8h, v21.8h, v22.8h, v23.8h}, [%4], #64 \\n\" // r3_0123\n\n                    \"fmla   v28.8h, v16.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w3_1234\n\n                    \"fmla   v28.8h, v20.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v21.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%4] \\n\" // r3_4567\n\n                    \"fmla   v28.8h, v21.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v22.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v0.8h       \\n\"\n\n                    \"fmla   v28.8h, v22.8h, v1.8h       \\n\"\n                    \"fmla   v29.8h, v23.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%5], #64 \\n\" // r4_0123\n\n                    \"fmla   v28.8h, v23.8h, v2.8h       \\n\"\n                    \"fmla   v29.8h, v24.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w4_0123\n\n                    \"fmla   v28.8h, v24.8h, v3.8h       \\n\"\n                    \"fmla   v29.8h, v25.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v26.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v3.8h       \\n\"\n\n                    \"fmla   v28.8h, v12.8h, v4.8h       \\n\"\n                    \"fmla   v29.8h, v13.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v14.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v15.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%5] \\n\" // r4_4567\n\n                    \"fmla   v28.8h, v13.8h, v5.8h       \\n\"\n                    \"fmla   v29.8h, v14.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v15.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v16.8h, v5.8h       \\n\"\n\n                    \"fmla   v28.8h, v14.8h, v6.8h       \\n\"\n                    \"fmla   v29.8h, v15.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v16.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%6]               \\n\" // w44\n\n                    \"fmla   v28.8h, v15.8h, v7.8h       \\n\"\n                    \"fmla   v29.8h, v16.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v7.8h       \\n\"\n\n                    \"fmla   v28.8h, v16.8h, v0.8h       \\n\"\n                    \"fmla   v29.8h, v17.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v0.8h       \\n\"\n\n                    \"sub    %6, %6, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v28.8h, v29.8h, v30.8h, v31.8h}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4),      // %5\n                    \"=r\"(k0)       // %6\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"6\"(k0),\n                    \"w\"(_bias0) // %14\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%1], #32 \\n\" // r0_01\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w0_0123\n\n                    \"mov    v30.16b, %14.16b            \\n\" // sum00\n                    \"mov    v31.16b, %14.16b            \\n\" // sum01\n\n                    \"fmla   v30.8h, v16.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%1] \\n\" // r0_2345\n\n                    \"fmla   v30.8h, v17.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v1.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v30.8h, v19.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%2], #32 \\n\" // r1_01\n\n                    \"fmla   v30.8h, v20.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%2] \\n\" // r1_2345\n\n                    \"fmla   v30.8h, v22.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v5.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v30.8h, v24.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v25.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%3], #32 \\n\" // r2_01\n\n                    \"fmla   v30.8h, v26.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%3] \\n\" // r2_2345\n\n                    \"fmla   v30.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v30.8h, v17.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v3.8h       \\n\"\n                    \"fmla   v30.8h, v18.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #256]       \\n\"\n                    \"ld1    {v22.8h, v23.8h}, [%4], #32 \\n\" // r3_01\n\n                    \"fmla   v30.8h, v19.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w3_1234\n\n                    \"fmla   v30.8h, v20.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v6.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v24.8h, v25.8h, v26.8h, v27.8h}, [%4] \\n\" // r3_2345\n\n                    \"fmla   v30.8h, v22.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v23.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v23.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v24.8h, v0.8h       \\n\"\n                    \"fmla   v30.8h, v24.8h, v1.8h       \\n\"\n                    \"fmla   v31.8h, v25.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v16.8h, v17.8h}, [%5], #32 \\n\" // r4_01\n\n                    \"fmla   v30.8h, v25.8h, v2.8h       \\n\"\n                    \"fmla   v31.8h, v26.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w4_0123\n\n                    \"fmla   v30.8h, v26.8h, v3.8h       \\n\"\n                    \"fmla   v31.8h, v27.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v18.8h, v19.8h, v20.8h, v21.8h}, [%5] \\n\" // r4_2345\n\n                    \"fmla   v30.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v31.8h, v17.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v5.8h       \\n\"\n                    \"fmla   v31.8h, v18.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%6]               \\n\" // w44\n\n                    \"fmla   v30.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v31.8h, v19.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v31.8h, v20.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v0.8h       \\n\"\n                    \"fmla   v31.8h, v21.8h, v0.8h       \\n\"\n\n                    \"sub    %6, %6, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v30.8h, v31.8h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4),      // %5\n                    \"=r\"(k0)       // %6\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"6\"(k0),\n                    \"w\"(_bias0) // %14\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v30\", \"v31\");\n            }\n            for (; j < outw; j++)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%1], #16         \\n\" // r0_0\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w0_0123\n\n                    \"mov    v30.16b, %14.16b            \\n\" // sum00\n\n                    \"prfm   pldl1keep, [%1, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%1] \\n\" // r0_1234\n\n                    \"fmla   v30.8h, v16.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w04 w1_012\n\n                    \"fmla   v30.8h, v17.8h, v1.8h       \\n\"\n\n                    \"fmla   v30.8h, v18.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%2], #16         \\n\" // r1_0\n\n                    \"fmla   v30.8h, v19.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%2] \\n\" // r1_1234\n\n                    \"fmla   v30.8h, v20.8h, v4.8h       \\n\"\n\n                    \"fmla   v30.8h, v21.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w1_34 w2_01\n\n                    \"fmla   v30.8h, v22.8h, v6.8h       \\n\"\n\n                    \"fmla   v30.8h, v23.8h, v7.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%3], #16         \\n\" // r2_0\n\n                    \"fmla   v30.8h, v24.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%3] \\n\" // r2_1234\n\n                    \"fmla   v30.8h, v25.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w2_234 w30\n\n                    \"fmla   v30.8h, v16.8h, v2.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v3.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v21.8h}, [%4], #16         \\n\" // r3_0\n\n                    \"fmla   v30.8h, v18.8h, v4.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%4, #512]       \\n\"\n                    \"ld1    {v22.8h, v23.8h, v24.8h, v25.8h}, [%4] \\n\" // r3_1234\n\n                    \"fmla   v30.8h, v19.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%6], #64 \\n\" // w3_1234\n\n                    \"fmla   v30.8h, v20.8h, v6.8h       \\n\"\n\n                    \"fmla   v30.8h, v21.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v22.8h, v0.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #128]       \\n\"\n                    \"ld1    {v16.8h}, [%5], #16         \\n\" // r4_0\n\n                    \"fmla   v30.8h, v23.8h, v1.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #512]       \\n\"\n                    \"ld1    {v4.8h, v5.8h, v6.8h, v7.8h}, [%6], #64 \\n\" // w4_0123\n\n                    \"fmla   v30.8h, v24.8h, v2.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v17.8h, v18.8h, v19.8h, v20.8h}, [%5] \\n\" // r4_1234\n\n                    \"fmla   v30.8h, v25.8h, v3.8h       \\n\"\n\n                    \"fmla   v30.8h, v16.8h, v4.8h       \\n\"\n                    \"fmla   v30.8h, v17.8h, v5.8h       \\n\"\n\n                    \"prfm   pldl1keep, [%6, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%6]               \\n\" // w44\n\n                    \"fmla   v30.8h, v18.8h, v6.8h       \\n\"\n                    \"fmla   v30.8h, v19.8h, v7.8h       \\n\"\n                    \"fmla   v30.8h, v20.8h, v0.8h       \\n\"\n\n                    \"sub    %6, %6, #384                \\n\" // k0 -= 24 * 8\n\n                    \"st1    {v30.8h}, [%0], #16         \\n\"\n\n                    : \"=r\"(outptr0), // %0\n                    \"=r\"(r0),      // %1\n                    \"=r\"(r1),      // %2\n                    \"=r\"(r2),      // %3\n                    \"=r\"(r3),      // %4\n                    \"=r\"(r4),      // %5\n                    \"=r\"(k0)       // %6\n                    : \"0\"(outptr0),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(r3),\n                    \"5\"(r4),\n                    \"6\"(k0),\n                    \"w\"(_bias0) // %14\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v30\");\n            }\n\n            r0 += 4 * 8;\n            r1 += 4 * 8;\n            r2 += 4 * 8;\n            r3 += 4 * 8;\n            r4 += 4 * 8;\n        }\n    }\n}\n\nstatic void convdw5x5s2_pack8_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 8;\n\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        float16x8_t _bias0 = bias ? vld1q_f16(bias + g * 8) : vdupq_n_f16((__fp16)0.f);\n\n        const __fp16* k0 = kernel.row<const __fp16>(g);\n\n        __fp16* outptr0 = out.row<__fp16>(0);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const __fp16* r0 = img0.row<const __fp16>(0);\n        const __fp16* r1 = img0.row<const __fp16>(1);\n        const __fp16* r2 = img0.row<const __fp16>(2);\n        const __fp16* r3 = img0.row<const __fp16>(3);\n        const __fp16* r4 = img0.row<const __fp16>(4);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                float16x8_t _sum0 = _bias0;\n\n                float16x8_t _r00 = vld1q_f16(r0);\n                float16x8_t _r01 = vld1q_f16(r0 + 8);\n                float16x8_t _r02 = vld1q_f16(r0 + 16);\n                float16x8_t _r03 = vld1q_f16(r0 + 24);\n                float16x8_t _r04 = vld1q_f16(r0 + 32);\n\n                float16x8_t _k00 = vld1q_f16(k0);\n                float16x8_t _k01 = vld1q_f16(k0 + 8);\n                float16x8_t _k02 = vld1q_f16(k0 + 16);\n                float16x8_t _k03 = vld1q_f16(k0 + 24);\n                float16x8_t _k04 = vld1q_f16(k0 + 32);\n                k0 += 40;\n\n                _sum0 = vfmaq_f16(_sum0, _k00, _r00);\n                _sum0 = vfmaq_f16(_sum0, _k01, _r01);\n                _sum0 = vfmaq_f16(_sum0, _k02, _r02);\n                _sum0 = vfmaq_f16(_sum0, _k03, _r03);\n                _sum0 = vfmaq_f16(_sum0, _k04, _r04);\n\n                float16x8_t _r10 = vld1q_f16(r1);\n                float16x8_t _r11 = vld1q_f16(r1 + 8);\n                float16x8_t _r12 = vld1q_f16(r1 + 16);\n                float16x8_t _r13 = vld1q_f16(r1 + 24);\n                float16x8_t _r14 = vld1q_f16(r1 + 32);\n\n                float16x8_t _k10 = vld1q_f16(k0);\n                float16x8_t _k11 = vld1q_f16(k0 + 8);\n                float16x8_t _k12 = vld1q_f16(k0 + 16);\n                float16x8_t _k13 = vld1q_f16(k0 + 24);\n                float16x8_t _k14 = vld1q_f16(k0 + 32);\n                k0 += 40;\n\n                _sum0 = vfmaq_f16(_sum0, _k10, _r10);\n                _sum0 = vfmaq_f16(_sum0, _k11, _r11);\n                _sum0 = vfmaq_f16(_sum0, _k12, _r12);\n                _sum0 = vfmaq_f16(_sum0, _k13, _r13);\n                _sum0 = vfmaq_f16(_sum0, _k14, _r14);\n\n                float16x8_t _r20 = vld1q_f16(r2);\n                float16x8_t _r21 = vld1q_f16(r2 + 8);\n                float16x8_t _r22 = vld1q_f16(r2 + 16);\n                float16x8_t _r23 = vld1q_f16(r2 + 24);\n                float16x8_t _r24 = vld1q_f16(r2 + 32);\n\n                float16x8_t _k20 = vld1q_f16(k0);\n                float16x8_t _k21 = vld1q_f16(k0 + 8);\n                float16x8_t _k22 = vld1q_f16(k0 + 16);\n                float16x8_t _k23 = vld1q_f16(k0 + 24);\n                float16x8_t _k24 = vld1q_f16(k0 + 32);\n                k0 += 40;\n\n                _sum0 = vfmaq_f16(_sum0, _k20, _r20);\n                _sum0 = vfmaq_f16(_sum0, _k21, _r21);\n                _sum0 = vfmaq_f16(_sum0, _k22, _r22);\n                _sum0 = vfmaq_f16(_sum0, _k23, _r23);\n                _sum0 = vfmaq_f16(_sum0, _k24, _r24);\n\n                float16x8_t _r30 = vld1q_f16(r3);\n                float16x8_t _r31 = vld1q_f16(r3 + 8);\n                float16x8_t _r32 = vld1q_f16(r3 + 16);\n                float16x8_t _r33 = vld1q_f16(r3 + 24);\n                float16x8_t _r34 = vld1q_f16(r3 + 32);\n\n                float16x8_t _k30 = vld1q_f16(k0);\n                float16x8_t _k31 = vld1q_f16(k0 + 8);\n                float16x8_t _k32 = vld1q_f16(k0 + 16);\n                float16x8_t _k33 = vld1q_f16(k0 + 24);\n                float16x8_t _k34 = vld1q_f16(k0 + 32);\n                k0 += 40;\n\n                _sum0 = vfmaq_f16(_sum0, _k30, _r30);\n                _sum0 = vfmaq_f16(_sum0, _k31, _r31);\n                _sum0 = vfmaq_f16(_sum0, _k32, _r32);\n                _sum0 = vfmaq_f16(_sum0, _k33, _r33);\n                _sum0 = vfmaq_f16(_sum0, _k34, _r34);\n\n                float16x8_t _r40 = vld1q_f16(r4);\n                float16x8_t _r41 = vld1q_f16(r4 + 8);\n                float16x8_t _r42 = vld1q_f16(r4 + 16);\n                float16x8_t _r43 = vld1q_f16(r4 + 24);\n                float16x8_t _r44 = vld1q_f16(r4 + 32);\n\n                float16x8_t _k40 = vld1q_f16(k0);\n                float16x8_t _k41 = vld1q_f16(k0 + 8);\n                float16x8_t _k42 = vld1q_f16(k0 + 16);\n                float16x8_t _k43 = vld1q_f16(k0 + 24);\n                float16x8_t _k44 = vld1q_f16(k0 + 32);\n                k0 -= 160;\n\n                _sum0 = vfmaq_f16(_sum0, _k40, _r40);\n                _sum0 = vfmaq_f16(_sum0, _k41, _r41);\n                _sum0 = vfmaq_f16(_sum0, _k42, _r42);\n                _sum0 = vfmaq_f16(_sum0, _k43, _r43);\n                _sum0 = vfmaq_f16(_sum0, _k44, _r44);\n\n                vst1q_f16(outptr0, _sum0);\n\n                outptr0 += 8;\n\n                r0 += 16;\n                r1 += 16;\n                r2 += 16;\n                r3 += 16;\n                r4 += 16;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n            r3 += tailstep;\n            r4 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise_arm.h\"\n\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if NCNN_GNU_INLINE_ASM\n#include \"convolutiondepthwise_3x3.h\"\n#include \"convolutiondepthwise_5x5.h\"\n\n#if NCNN_INT8\n#include \"convolutiondepthwise_3x3_int8.h\"\n#endif // NCNN_INT8\n\n#if __ARM_NEON\n#include \"convolutiondepthwise_3x3_pack4.h\"\n#include \"convolutiondepthwise_5x5_pack4.h\"\n\n#if NCNN_BF16\n#include \"convolutiondepthwise_3x3_pack4_bf16s.h\"\n#include \"convolutiondepthwise_5x5_pack4_bf16s.h\"\n#endif // NCNN_BF16\n\n#if NCNN_INT8\n#include \"convolutiondepthwise_3x3_pack8_int8.h\"\n#endif // NCNN_INT8\n#endif // __ARM_NEON\n#endif // NCNN_GNU_INLINE_ASM\n\nConvolutionDepthWise_arm::ConvolutionDepthWise_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    activation = 0;\n}\n\nint ConvolutionDepthWise_arm::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_arm(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 4 == 0 ? 4 : 1;\n        }\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n        if (opt.use_bf16_storage)\n        {\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                Mat weight_data_r2 = weight_data.reshape(maxk, group);\n                Mat weight_data_r2_packed;\n                convert_packing(weight_data_r2, weight_data_r2_packed, 4, opt);\n\n                ncnn::cast_float32_to_bfloat16(weight_data_r2_packed, weight_data_tm, opt);\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                ncnn::cast_float32_to_bfloat16(weight_data, weight_data_tm, opt);\n            }\n\n            if (opt.lightmode)\n                weight_data.release();\n\n            return 0;\n        }\n#endif // NCNN_BF16\n\n#if __ARM_NEON\n        // pack4\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 4, opt);\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                weight_data_tm = weight_data;\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                // group convolution\n                create_group_ops(opt);\n            }\n        }\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::create_group_ops(const Option& opt)\n{\n    // create Convolution op for each group\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    for (int i = 0; i < (int)group_ops.size(); i++)\n        delete group_ops[i];\n\n    group_ops.clear();\n\n    const int channels_g = channels / group;\n    const int num_output_g = num_output / group;\n\n    group_ops.resize(group);\n\n    for (int g = 0; g < group; g++)\n    {\n        Mat weight_data_g = weight_data.range(maxk * channels_g * num_output_g * g, maxk * channels_g * num_output_g).clone();\n        Mat bias_data_g;\n        if (bias_term)\n            bias_data_g = bias_data.range(num_output_g * g, num_output_g);\n\n        ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n        // set param\n        ncnn::ParamDict pd;\n        pd.set(0, num_output_g); // num_output\n        pd.set(1, kernel_w);\n        pd.set(11, kernel_h);\n        pd.set(2, dilation_w);\n        pd.set(12, dilation_h);\n        pd.set(3, stride_w);\n        pd.set(13, stride_h);\n        pd.set(4, 0);  // pad_w\n        pd.set(14, 0); // pad_h\n        pd.set(5, bias_term);\n        pd.set(6, maxk * channels_g * num_output_g); // weight_data_size\n        pd.set(8, int8_scale_term);\n        pd.set(9, activation_type);\n        pd.set(10, activation_params);\n\n        op->load_param(pd);\n\n        // set weights\n        if (bias_term)\n        {\n            ncnn::Mat weights[5];\n            weights[0] = weight_data_g;\n            weights[1] = bias_data_g;\n\n#if NCNN_INT8\n            if (int8_scale_term)\n            {\n                Mat weight_data_int8_scales_g(num_output_g);\n                weight_data_int8_scales_g.fill(weight_data_int8_scales[g]);\n                weights[2] = weight_data_int8_scales_g;\n                weights[3] = bottom_blob_int8_scales.range(g, 1);\n            }\n            if (int8_scale_term > 100)\n            {\n                weights[4] = top_blob_int8_scales.range(g, 1);\n            }\n#endif\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n        else\n        {\n            ncnn::Mat weights[4];\n            weights[0] = weight_data_g;\n\n#if NCNN_INT8\n            if (int8_scale_term)\n            {\n                Mat weight_data_int8_scales_g(num_output_g);\n                weight_data_int8_scales_g.fill(weight_data_int8_scales[g]);\n                weights[1] = weight_data_int8_scales_g;\n                weights[2] = bottom_blob_int8_scales.range(g, 1);\n            }\n            if (int8_scale_term > 100)\n            {\n                weights[3] = top_blob_int8_scales.range(g, 1);\n            }\n#endif\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n\n        op->create_pipeline(opt);\n\n        group_ops[g] = op;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    for (int i = 0; i < (int)group_ops.size(); i++)\n    {\n        group_ops[i]->destroy_pipeline(opt);\n        delete group_ops[i];\n    }\n    group_ops.clear();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_arm(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw5x5s1_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw5x5s2_pack4_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    float* outptr = top_blob.channel(g);\n                    const float* kptr = (const float*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f32(((const float*)bias_data) + g * 4);\n                            }\n\n                            const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = vld1q_f32(sptr + space_ofs[k] * 4);\n                                float32x4_t _w = vld1q_f32(kptr + k * 4);\n                                _sum = vmlaq_f32(_sum, _val, _w);\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            vst1q_f32(outptr + j * 4, _sum);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n\n                return 0;\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw5x5s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw5x5s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n\n                return 0;\n            }\n#endif // NCNN_GNU_INLINE_ASM\n        }\n    }\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 4 == 0 ? 4 : 1;\n        out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack == 4 && g_elempack == 1)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, 1, opt_p);\n        if (bottom_blob_bordered_unpacked.empty())\n            return -100;\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack == 1 && out_elempack == 4)\n    {\n        top_blob_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        int ret = op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n        if (ret != 0)\n            return ret;\n    }\n\n    // packing\n    if (out_g_elempack == 1 && out_elempack == 4)\n    {\n        convert_packing(top_blob_unpacked, top_blob, 4, opt);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n#if NCNN_ARM82\n    if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_float16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n    if (opt.use_bf16_storage && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_bfloat16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_BF16\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_float16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n        if (opt.use_bf16_storage && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_bfloat16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_BF16\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::ConvolutionDepthWise);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(7, group);\n    pd.set(8, int8_scale_term);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_BF16\nint ConvolutionDepthWise_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw5x5s1_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw5x5s2_pack4_bf16s_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    unsigned short* outptr = top_blob.channel(g);\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f32(((const float*)bias_data) + g * 4);\n                            }\n\n                            const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w * 4;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = bfloat2float(vld1_u16(sptr + space_ofs[k] * 4));\n                                float32x4_t _w = bfloat2float(vld1_u16(kptr + k * 4));\n                                _sum = vmlaq_f32(_sum, _val, _w);\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            vst1_u16(outptr + j * 4, float2bfloat(_sum));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n\n            return 0;\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n#if NCNN_GNU_INLINE_ASM\n            //             if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            //             {\n            //                 convdw3x3s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n            //\n            //                 if (activation)\n            //                 {\n            //                     activation->forward_inplace(top_blob, opt);\n            //                 }\n            //\n            //                 return 0;\n            //             }\n            //             else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            //             {\n            //                 convdw3x3s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n            //\n            //                 if (activation)\n            //                 {\n            //                     activation->forward_inplace(top_blob, opt);\n            //                 }\n            //\n            //                 return 0;\n            //             }\n            //             else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            //             {\n            //                 convdw5x5s1_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n            //\n            //                 if (activation)\n            //                 {\n            //                     activation->forward_inplace(top_blob, opt);\n            //                 }\n            //\n            //                 return 0;\n            //             }\n            //             else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            //             {\n            //                 convdw5x5s2_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n            //\n            //                 if (activation)\n            //                 {\n            //                     activation->forward_inplace(top_blob, opt);\n            //                 }\n            //\n            //                 return 0;\n            //             }\n            //             else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    unsigned short* outptr = top_blob.channel(g);\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                                sum = bias_data[g];\n\n                            const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = bfloat16_to_float32(sptr[space_ofs[k]]);\n                                float w = bfloat16_to_float32(kptr[k]);\n                                sum += val * w;\n                            }\n\n                            sum = activation_ss(sum, activation_type, activation_params);\n\n                            outptr[j] = float32_to_bfloat16(sum);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 4 == 0 ? 4 : 1;\n        out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack == 4 && g_elempack == 1)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, 1, opt_p);\n        if (bottom_blob_bordered_unpacked.empty())\n            return -100;\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack == 1 && out_elempack == 4)\n    {\n        top_blob_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        int ret = op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n        if (ret != 0)\n            return ret;\n    }\n\n    // packing\n    if (out_g_elempack == 1 && out_elempack == 4)\n    {\n        convert_packing(top_blob_unpacked, top_blob, 4, opt);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint ConvolutionDepthWise_arm::create_pipeline_int8_arm(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 8 == 0 ? 8 : 1;\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 8)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 8, opt);\n        }\n\n        if (elempack == 1)\n        {\n            weight_data_tm = weight_data;\n        }\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n\n    int elembits = bottom_blob.elembits();\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        const int channels_g = channels * elempack / group;\n\n        Mat scales(channels * elempack);\n        {\n            float* ps = scales;\n            for (int g = 0; g < group; g++)\n            {\n                float scale = bottom_blob_int8_scales[g];\n                for (int q = 0; q < channels_g; q++)\n                {\n                    *ps++ = scale;\n                }\n            }\n        }\n\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, scales, opt_q);\n        if (bottom_blob_int8.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n    channels = bottom_blob_bordered.c;\n    elempack = bottom_blob_bordered.elempack;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        }\n#endif // __ARM_NEON\n        bool use_int8_requantize = int8_scale_term > 100;\n        size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage)\n        {\n            out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n        }\n#endif\n        if (opt.use_bf16_storage)\n            out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n\n        top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        // TODO use fp16 / bf16\n        out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n        top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1 && (activation_type == 0 || activation_type == 1))\n            {\n                Mat top_blob_int32;\n                top_blob_int32.create(outw, outh, num_output / out_elempack, (size_t)4u * out_elempack, out_elempack, opt.workspace_allocator);\n                if (top_blob_int32.empty())\n                    return -100;\n\n                convdw3x3s1_pack8_int8_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, opt);\n\n                Mat scale_in_data(group);\n                for (int g = 0; g < group; g++)\n                {\n                    // dequantize\n                    float scale_in;\n                    if (weight_data_int8_scales[g] == 0)\n                        scale_in = 0;\n                    else\n                        scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                    scale_in_data[g] = scale_in;\n                }\n\n                if (use_int8_requantize)\n                {\n                    requantize_from_int32_to_int8(top_blob_int32, top_blob, scale_in_data, top_blob_int8_scales, bias_data, activation_type, activation_params, opt);\n                }\n                else\n                {\n                    dequantize_from_int32(top_blob_int32, top_blob, scale_in_data, bias_data, opt);\n\n                    if (activation)\n                    {\n                        activation->forward_inplace(top_blob, opt);\n                    }\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (activation_type == 0 || activation_type == 1))\n            {\n                Mat top_blob_int32;\n                top_blob_int32.create(outw, outh, num_output / out_elempack, (size_t)4u * out_elempack, out_elempack, opt.workspace_allocator);\n                if (top_blob_int32.empty())\n                    return -100;\n\n                convdw3x3s2_pack8_int8_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, opt);\n\n                Mat scale_in_data(group);\n                for (int g = 0; g < group; g++)\n                {\n                    // dequantize\n                    float scale_in;\n                    if (weight_data_int8_scales[g] == 0)\n                        scale_in = 0;\n                    else\n                        scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                    scale_in_data[g] = scale_in;\n                }\n\n                if (use_int8_requantize)\n                {\n                    requantize_from_int32_to_int8(top_blob_int32, top_blob, scale_in_data, top_blob_int8_scales, bias_data, activation_type, activation_params, opt);\n                }\n                else\n                {\n                    dequantize_from_int32(top_blob_int32, top_blob, scale_in_data, bias_data, opt);\n\n                    if (activation)\n                    {\n                        activation->forward_inplace(top_blob, opt);\n                    }\n                }\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    signed char* outptr_s8 = top_blob.channel(g);\n                    float* outptr_f32 = top_blob.channel(g);\n                    const signed char* kptr = (const signed char*)weight_data_tm + maxk * g * 8;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int32x4_t _sum0 = vdupq_n_s32(0);\n                            int32x4_t _sum1 = vdupq_n_s32(0);\n\n                            const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w * 8;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                int8x8_t _val = vld1_s8(sptr + space_ofs[k] * 8);\n                                int8x8_t _w = vld1_s8(kptr + k * 8);\n                                int16x8_t _s0 = vmull_s8(_val, _w);\n                                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                            }\n\n                            float32x4_t _scale_in0;\n                            float32x4_t _scale_in1;\n                            {\n                                float32x4_t _bottom_blob_int8_scales0 = vld1q_f32((const float*)bottom_blob_int8_scales + g * 8);\n                                float32x4_t _bottom_blob_int8_scales1 = vld1q_f32((const float*)bottom_blob_int8_scales + g * 8 + 4);\n                                float32x4_t _weight_data_int8_scales0 = vld1q_f32((const float*)weight_data_int8_scales + g * 8);\n                                float32x4_t _weight_data_int8_scales1 = vld1q_f32((const float*)weight_data_int8_scales + g * 8 + 4);\n                                _scale_in0 = div_ps(vdupq_n_f32(1.f), vmulq_f32(_bottom_blob_int8_scales0, _weight_data_int8_scales0));\n                                _scale_in1 = div_ps(vdupq_n_f32(1.f), vmulq_f32(_bottom_blob_int8_scales1, _weight_data_int8_scales1));\n\n                                uint32x4_t _m0 = vmvnq_u32(vceqq_f32(_weight_data_int8_scales0, vdupq_n_f32(0.f)));\n                                uint32x4_t _m1 = vmvnq_u32(vceqq_f32(_weight_data_int8_scales1, vdupq_n_f32(0.f)));\n                                _scale_in0 = vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(_scale_in0), _m0));\n                                _scale_in1 = vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(_scale_in1), _m1));\n                            }\n\n                            float32x4_t _sumfp32_0 = vmulq_f32(vcvtq_f32_s32(_sum0), _scale_in0);\n                            float32x4_t _sumfp32_1 = vmulq_f32(vcvtq_f32_s32(_sum1), _scale_in1);\n\n                            if (bias_term)\n                            {\n                                float32x4_t _bias0 = vld1q_f32((const float*)bias_data + g * 8);\n                                float32x4_t _bias1 = vld1q_f32((const float*)bias_data + g * 8 + 4);\n                                _sumfp32_0 = vaddq_f32(_sumfp32_0, _bias0);\n                                _sumfp32_1 = vaddq_f32(_sumfp32_1, _bias1);\n                            }\n\n                            _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n                            _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n                            if (use_int8_requantize)\n                            {\n                                // requantize\n                                float32x4_t _scale_out0 = vld1q_f32((const float*)top_blob_int8_scales + g * 8);\n                                float32x4_t _scale_out1 = vld1q_f32((const float*)top_blob_int8_scales + g * 8 + 4);\n                                int8x8_t _sum8 = float2int8(vmulq_f32(_sumfp32_0, _scale_out0), vmulq_f32(_sumfp32_1, _scale_out1));\n                                vst1_s8(outptr_s8, _sum8);\n                                outptr_s8 += 8;\n                            }\n                            else\n                            {\n                                // dequantize\n                                vst1q_f32(outptr_f32, _sumfp32_0);\n                                vst1q_f32(outptr_f32 + 4, _sumfp32_1);\n                                outptr_f32 += 8;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1 && (activation_type == 0 || activation_type == 1))\n            {\n                if (use_int8_requantize)\n                {\n                    std::vector<float> requantize_scales;\n                    for (int g = 0; g < group; g++)\n                    {\n                        float scale_in;\n                        if (weight_data_int8_scales[g] == 0)\n                            scale_in = 0;\n                        else\n                            scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                        float scale_out = top_blob_int8_scales[g];\n\n                        requantize_scales.push_back(scale_in);\n                        requantize_scales.push_back(scale_out);\n                    }\n\n                    convdw3x3s1_int8_requant_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, requantize_scales, opt);\n                }\n                else\n                {\n                    Mat top_blob_int32;\n                    top_blob_int32.create(outw, outh, num_output, (size_t)4u, opt.workspace_allocator);\n                    if (top_blob_int32.empty())\n                        return -100;\n\n                    convdw3x3s1_int8_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, opt);\n                    //                 convdw3x3s1_int8_dequant_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, bias_data, dequantize_scales, opt);\n\n                    Mat scale_data(group);\n                    for (int g = 0; g < group; g++)\n                    {\n                        // dequantize\n                        float scale_in;\n                        if (weight_data_int8_scales[g] == 0)\n                            scale_in = 0;\n                        else\n                            scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                        scale_data[g] = scale_in;\n                    }\n\n                    dequantize_from_int32(top_blob_int32, top_blob, scale_data, bias_data, opt);\n                }\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2 && (activation_type == 0 || activation_type == 1))\n            {\n                if (use_int8_requantize)\n                {\n                    std::vector<float> requantize_scales;\n                    for (int g = 0; g < group; g++)\n                    {\n                        float scale_in;\n                        if (weight_data_int8_scales[g] == 0)\n                            scale_in = 0;\n                        else\n                            scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                        float scale_out = top_blob_int8_scales[g];\n\n                        requantize_scales.push_back(scale_in);\n                        requantize_scales.push_back(scale_out);\n                    }\n\n                    convdw3x3s2_int8_requant_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, requantize_scales, opt);\n                }\n                else\n                {\n                    Mat top_blob_int32;\n                    top_blob_int32.create(outw, outh, num_output, (size_t)4u, opt.workspace_allocator);\n                    if (top_blob_int32.empty())\n                        return -100;\n\n                    convdw3x3s2_int8_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, opt);\n                    //                 convdw3x3s2_int8_dequant_neon(bottom_blob_bordered, top_blob_int32, weight_data_tm, bias_data, dequantize_scales, opt);\n\n                    Mat scale_data(group);\n                    for (int g = 0; g < group; g++)\n                    {\n                        // dequantize\n                        float scale_in;\n                        if (weight_data_int8_scales[g] == 0)\n                            scale_in = 0;\n                        else\n                            scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                        scale_data[g] = scale_in;\n                    }\n\n                    dequantize_from_int32(top_blob_int32, top_blob, scale_data, bias_data, opt);\n                }\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    signed char* outptr_s8 = top_blob.channel(g);\n                    float* outptr_f32 = top_blob.channel(g);\n                    const signed char* kptr = (const signed char*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sum = 0;\n\n                            const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                signed char val = sptr[space_ofs[k]];\n                                signed char w = kptr[k];\n                                sum += val * w;\n                            }\n\n                            float scale_in;\n                            if (weight_data_int8_scales[g] == 0)\n                                scale_in = 0;\n                            else\n                                scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                            float sumfp32 = sum * scale_in;\n\n                            if (bias_term)\n                                sumfp32 += bias_data[g];\n\n                            sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n                            if (use_int8_requantize)\n                            {\n                                // requantize\n                                float scale_out = top_blob_int8_scales[g];\n                                signed char sums8 = float2int8(sumfp32 * scale_out);\n                                outptr_s8[0] = sums8;\n                                outptr_s8 += 1;\n                            }\n                            else\n                            {\n                                // dequantize\n                                outptr_f32[0] = sumfp32;\n                                outptr_f32 += 1;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    bool use_int8_requantize = int8_scale_term > 100;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        else\n            out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n    }\n#endif\n    if (opt.use_bf16_storage)\n        out_elemsize = use_int8_requantize ? 1u * out_elempack : 2u * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 8 == 0 ? 8 : 1;\n        if (use_int8_requantize)\n            out_g_elempack = num_output_g % 8 == 0 ? 8 : 1;\n        else\n            out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, g_elempack, opt_p);\n        if (bottom_blob_bordered_unpacked.empty())\n            return -100;\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack < out_elempack)\n    {\n        top_blob_unpacked.create(outw, outh, num_output / out_g_elempack, out_elemsize / out_elempack * out_g_elempack, out_g_elempack, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        int ret = op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n        if (ret != 0)\n            return ret;\n    }\n\n    // packing\n    if (out_g_elempack < out_elempack)\n    {\n        convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTIONDEPTHWISE_ARM_H\n#define LAYER_CONVOLUTIONDEPTHWISE_ARM_H\n\n#include \"convolutiondepthwise.h\"\n\nnamespace ncnn {\n\nclass ConvolutionDepthWise_arm : public ConvolutionDepthWise\n{\npublic:\n    ConvolutionDepthWise_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int create_group_ops(const Option& opt);\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8_arm(const Option& opt);\n    int forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* activation;\n    std::vector<ncnn::Layer*> group_ops;\n\n    Mat weight_data_tm;\n\n    // fp16\n    Mat bias_data_fp16;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTIONDEPTHWISE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/convolutiondepthwise_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise_arm.h\"\n\n#include \"cpu.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"convolutiondepthwise_3x3_fp16s.h\"\n#include \"convolutiondepthwise_3x3_pack8_fp16s.h\"\n#include \"convolutiondepthwise_5x5_pack8_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n#endif // NCNN_GNU_INLINE_ASM\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint ConvolutionDepthWise_arm::create_pipeline_fp16s(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n\n        if (opt.use_packing_layout)\n        {\n            elempack = opt.use_fp16_arithmetic && channels % 8 == 0 ? 8 : channels % 4 == 0 ? 4 : 1;\n        }\n\n        if (elempack == 8)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            Mat weight_data_r2_packed;\n            convert_packing(weight_data_r2, weight_data_r2_packed, 8, opt);\n\n            ncnn::cast_float32_to_float16(weight_data_r2_packed, weight_data_tm, opt);\n        }\n\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            Mat weight_data_r2_packed;\n            convert_packing(weight_data_r2, weight_data_r2_packed, 4, opt);\n\n            ncnn::cast_float32_to_float16(weight_data_r2_packed, weight_data_tm, opt);\n        }\n\n        if (elempack == 1)\n        {\n            ncnn::cast_float32_to_float16(weight_data, weight_data_tm, opt);\n        }\n\n        ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        if (elempack == 4)\n        {\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f32(((const float*)bias_data) + g * 4);\n                            }\n\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 4;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr + space_ofs[k] * 4));\n                                float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr + k * 4));\n                                _sum = vfmaq_f32(_sum, _val, _w);\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            vst1_f16(outptr + j * 4, vcvt_f16_f32(_sum));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 1)\n        {\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    __fp16* outptr = top_blob.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                                sum = bias_data[g];\n\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = (float)sptr[space_ofs[k]];\n                                float w = (float)kptr[k];\n                                sum += val * w;\n                            }\n\n                            sum = activation_ss(sum, activation_type, activation_params);\n\n                            outptr[j] = (__fp16)sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = (opt.use_packing_layout && channels_g % 4 == 0) ? 4 : 1;\n    int out_g_elempack = (opt.use_packing_layout && num_output_g % 4 == 0) ? 4 : 1;\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, g_elempack, opt_p);\n        if (bottom_blob_bordered_unpacked.empty())\n            return -100;\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack < out_elempack)\n    {\n        top_blob_unpacked.create(outw, outh, num_output / out_g_elempack, out_elemsize / out_elempack * out_g_elempack, out_g_elempack, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        int ret = op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n        if (ret != 0)\n            return ret;\n    }\n\n    // packing\n    if (out_g_elempack < out_elempack)\n    {\n        convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        if (elempack == 8)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw5x5s1_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw5x5s2_pack8_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 8;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f16(((const __fp16*)bias_data_fp16) + g * 8);\n                            }\n\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 8;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float16x8_t _val = vld1q_f16(sptr + space_ofs[k] * 8);\n                                float16x8_t _w = vld1q_f16(kptr + k * 8);\n                                _sum = vfmaq_f16(_sum, _val, _w);\n                            }\n\n                            _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                            vst1q_f16(outptr + j * 8, _sum);\n                        }\n\n                        outptr += outw * 8;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 4)\n        {\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1_f16(((const __fp16*)bias_data_fp16) + g * 4);\n                            }\n\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 4;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float16x4_t _val = vld1_f16(sptr + space_ofs[k] * 4);\n                                float16x4_t _w = vld1_f16(kptr + k * 4);\n                                _sum = vfma_f16(_sum, _val, _w);\n                            }\n\n                            _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                            vst1_f16(outptr + j * 4, _sum);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 1)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_fp16sa_neon(bottom_blob_bordered, top_blob, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n#endif // NCNN_GNU_INLINE_ASM\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    __fp16* outptr = top_blob.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                                sum = bias_data[g];\n\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                __fp16 val = sptr[space_ofs[k]];\n                                __fp16 w = kptr[k];\n                                sum += val * w;\n                            }\n\n                            sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                            outptr[j] = (__fp16)sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        g_elempack = opt.use_fp16_arithmetic && channels_g % 8 == 0 ? 8 : channels_g % 4 == 0 ? 4 : 1;\n        out_g_elempack = opt.use_fp16_arithmetic && num_output_g % 8 == 0 ? 8 : num_output_g % 4 == 0 ? 4 : 1;\n    }\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, g_elempack, opt_p);\n        if (bottom_blob_bordered_unpacked.empty())\n            return -100;\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack < out_elempack)\n    {\n        top_blob_unpacked.create(outw, outh, num_output / out_g_elempack, out_elemsize / out_elempack * out_g_elempack, out_g_elempack, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        int ret = op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n        if (ret != 0)\n            return ret;\n    }\n\n    // packing\n    if (out_g_elempack < out_elempack)\n    {\n        convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/crop_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"crop_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nCrop_arm::Crop_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\n#if __ARM_NEON\nstatic void crop_pack8_neon(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n    int right = src.w - dst.w - left;\n\n    const float* ptr = src.row(top) + left * 8;\n    float* outptr = dst;\n\n    for (int y = 0; y < h; y++)\n    {\n        for (int x = 0; x < w; x++)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            vst1q_f32(outptr, _p0);\n            vst1q_f32(outptr + 4, _p1);\n            ptr += 8;\n            outptr += 8;\n        }\n\n        ptr += (left + right) * 8;\n    }\n}\n\nstatic void crop_pack8_bf16_fp16s_neon(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n    int right = src.w - dst.w - left;\n\n    const unsigned short* ptr = src.row<unsigned short>(top) + left * 8;\n    unsigned short* outptr = dst;\n\n    for (int y = 0; y < h; y++)\n    {\n        for (int x = 0; x < w; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            vst1q_u16(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n\n        ptr += (left + right) * 8;\n    }\n}\n\nstatic void crop_pack4_neon(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n    int right = src.w - dst.w - left;\n\n    const float* ptr = src.row(top) + left * 4;\n    float* outptr = dst;\n\n    for (int y = 0; y < h; y++)\n    {\n        for (int x = 0; x < w; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            vst1q_f32(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n\n        ptr += (left + right) * 4;\n    }\n}\n\nstatic void crop_pack4_bf16_fp16s_neon(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n    int right = src.w - dst.w - left;\n\n    const unsigned short* ptr = src.row<unsigned short>(top) + left * 4;\n    unsigned short* outptr = dst;\n\n    for (int y = 0; y < h; y++)\n    {\n        for (int x = 0; x < w; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr);\n            vst1_u16(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n\n        ptr += (left + right) * 4;\n    }\n}\n#endif // __ARM_NEON\n\nint Crop_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __ARM_NEON\n    int _woffset, _hoffset, _doffset, _coffset;\n    int _outw, _outh, _outd, _outc;\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(1);\n        bottom_blob_shapes[0] = bottom_blob.shape();\n        eval_crop_expr(bottom_blob_shapes, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 8 == 0 ? 8 : _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 8 == 0 && out_elempack == 8)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 16u)\n                    crop_pack8_bf16_fp16s_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n                else\n                    crop_pack8_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 8 == 0 ? 8 : _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 8 == 0 && out_elempack == 8)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 16u)\n                    crop_pack8_bf16_fp16s_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n                else\n                    crop_pack8_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 8 == 0 ? 8 : _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 8 == 0 && out_elempack == 8)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    if (elemsize == 16u)\n                        crop_pack8_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                    else\n                        crop_pack8_neon(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 8 == 0 ? 8 : _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 8 == 0 && out_elempack == 8)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        if (elemsize == 16u)\n                            crop_pack8_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                        else\n                            crop_pack8_neon(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 8u)\n                    crop_pack4_bf16_fp16s_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n                else\n                    crop_pack4_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 8u)\n                    crop_pack4_bf16_fp16s_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n                else\n                    crop_pack4_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    if (elemsize == 8u)\n                        crop_pack4_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                    else\n                        crop_pack4_neon(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        if (elemsize == 8u)\n                            crop_pack4_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                        else\n                            crop_pack4_neon(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n        if (bottom_blob_unpacked.empty())\n            return -100;\n    }\n\n    return Crop::forward(bottom_blob_unpacked, top_blob, opt);\n}\n\nint Crop_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int ref_elempack = reference_blob.elempack;\n\n    Mat& top_blob = top_blobs[0];\n\n#if __ARM_NEON\n    int _woffset, _hoffset, _doffset, _coffset;\n    int _outw, _outh, _outd, _outc;\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_crop_expr(bottom_blob_shapes, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else if (woffset == -233)\n    {\n        resolve_crop_roi(bottom_blob.shape(), (const int*)reference_blob, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), reference_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 8 == 0 ? 8 : _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 8 == 0 && out_elempack == 8)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 16u)\n                    crop_pack8_bf16_fp16s_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n                else\n                    crop_pack8_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 8 == 0 ? 8 : _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 8 == 0 && out_elempack == 8)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 16u)\n                    crop_pack8_bf16_fp16s_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n                else\n                    crop_pack8_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 8 == 0 ? 8 : _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 8 == 0 && out_elempack == 8)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    if (elemsize == 16u)\n                        crop_pack8_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                    else\n                        crop_pack8_neon(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 8 == 0 ? 8 : _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 8)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 8 == 0 && out_elempack == 8)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        if (elemsize == 16u)\n                            crop_pack8_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                        else\n                            crop_pack8_neon(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 8u)\n                    crop_pack4_bf16_fp16s_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n                else\n                    crop_pack4_neon(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                if (elemsize == 8u)\n                    crop_pack4_bf16_fp16s_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n                else\n                    crop_pack4_neon(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    if (elemsize == 8u)\n                        crop_pack4_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                    else\n                        crop_pack4_neon(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        if (elemsize == 8u)\n                            crop_pack4_bf16_fp16s_neon(m, borderm, _hoffset, _woffset);\n                        else\n                            crop_pack4_neon(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    std::vector<Mat> bottom_blobs_unpacked(bottom_blobs.size());\n    for (size_t i = 0; i < bottom_blobs.size(); i++)\n    {\n        Mat bottom_blob_unpacked = bottom_blobs[i];\n        if (elempack != 1)\n        {\n            Option opt_pack1 = opt;\n            opt_pack1.blob_allocator = opt.workspace_allocator;\n\n            convert_packing(bottom_blobs[i], bottom_blob_unpacked, 1, opt_pack1);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        bottom_blobs_unpacked[i] = bottom_blob_unpacked;\n    }\n\n    return Crop::forward(bottom_blobs_unpacked, top_blobs, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/crop_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CROP_ARM_H\n#define LAYER_CROP_ARM_H\n\n#include \"crop.h\"\n\nnamespace ncnn {\n\nclass Crop_arm : public Crop\n{\npublic:\n    Crop_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CROP_ARM_H\n"
  },
  {
    "path": "src/layer/arm/deconvolution_3x3.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconv3x3s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 9 + q * 9;\n\n            const float* r0 = img0;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(k0);\n            float32x4_t _k1 = vld1q_f32(k1);\n            float32x4_t _k2 = vld1q_f32(k2);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr = out.row(i);\n\n                float* outptr0 = outptr;\n                float* outptr1 = outptr + outw;\n                float* outptr2 = outptr + outw * 2;\n\n                int j = 0;\n\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4_t _v = vld1q_f32(r0);\n\n                    //\n                    float32x4_t _out00 = vld1q_f32(outptr0 + 0);\n                    _out00 = vmlaq_lane_f32(_out00, _v, vget_low_f32(_k0), 0);\n                    vst1q_f32(outptr0 + 0, _out00);\n\n                    float32x4_t _out01 = vld1q_f32(outptr0 + 1);\n                    _out01 = vmlaq_lane_f32(_out01, _v, vget_low_f32(_k0), 1);\n                    vst1q_f32(outptr0 + 1, _out01);\n\n                    float32x4_t _out02 = vld1q_f32(outptr0 + 2);\n                    _out02 = vmlaq_lane_f32(_out02, _v, vget_high_f32(_k0), 0);\n                    vst1q_f32(outptr0 + 2, _out02);\n\n                    //\n                    float32x4_t _out10 = vld1q_f32(outptr1 + 0);\n                    _out10 = vmlaq_lane_f32(_out10, _v, vget_low_f32(_k1), 0);\n                    vst1q_f32(outptr1 + 0, _out10);\n\n                    float32x4_t _out11 = vld1q_f32(outptr1 + 1);\n                    _out11 = vmlaq_lane_f32(_out11, _v, vget_low_f32(_k1), 1);\n                    vst1q_f32(outptr1 + 1, _out11);\n\n                    float32x4_t _out12 = vld1q_f32(outptr1 + 2);\n                    _out12 = vmlaq_lane_f32(_out12, _v, vget_high_f32(_k1), 0);\n                    vst1q_f32(outptr1 + 2, _out12);\n\n                    //\n                    float32x4_t _out20 = vld1q_f32(outptr2 + 0);\n                    _out20 = vmlaq_lane_f32(_out20, _v, vget_low_f32(_k2), 0);\n                    vst1q_f32(outptr2 + 0, _out20);\n\n                    float32x4_t _out21 = vld1q_f32(outptr2 + 1);\n                    _out21 = vmlaq_lane_f32(_out21, _v, vget_low_f32(_k2), 1);\n                    vst1q_f32(outptr2 + 1, _out21);\n\n                    float32x4_t _out22 = vld1q_f32(outptr2 + 2);\n                    _out22 = vmlaq_lane_f32(_out22, _v, vget_high_f32(_k2), 0);\n                    vst1q_f32(outptr2 + 2, _out22);\n\n                    r0 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                }\n#endif // __ARM_NEON\n\n                for (; j < w; j++)\n                {\n                    float val = r0[0];\n\n                    outptr0[0] += val * k0[0];\n                    outptr0[1] += val * k0[1];\n                    outptr0[2] += val * k0[2];\n\n                    outptr1[0] += val * k1[0];\n                    outptr1[1] += val * k1[1];\n                    outptr1[2] += val * k1[2];\n\n                    outptr2[0] += val * k2[0];\n                    outptr2[1] += val * k2[1];\n                    outptr2[2] += val * k2[2];\n\n                    r0++;\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                }\n            }\n        }\n    }\n}\n\nstatic void deconv3x3s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 9 + q * 9;\n\n            const float* r0 = img0;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(k0);\n            float32x4_t _k1 = vld1q_f32(k1);\n            float32x4_t _k2 = vld1q_f32(k2);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr = out.row(i * 2);\n\n                float* outptr0 = outptr;\n                float* outptr1 = outptr0 + outw;\n                float* outptr2 = outptr1 + outw;\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4_t _v = vld1q_f32(r0);\n\n                    // out row 0\n                    float32x4_t _out00 = vmulq_lane_f32(_v, vget_low_f32(_k0), 0);  // 0,2,4,6\n                    float32x4_t _out01 = vmulq_lane_f32(_v, vget_low_f32(_k0), 1);  // 1,3,5,7\n                    float32x4_t _out02 = vmulq_lane_f32(_v, vget_high_f32(_k0), 0); // 2,4,6,8\n\n                    float32x4x2_t _out0 = vld2q_f32(outptr0);\n                    _out0.val[0] = vaddq_f32(_out0.val[0], _out00); // 0,2,4,6\n                    _out0.val[1] = vaddq_f32(_out0.val[1], _out01); // 1,3,5,7\n                    vst2q_f32(outptr0, _out0);\n\n                    _out0 = vld2q_f32(outptr0 + 2);\n                    _out0.val[0] = vaddq_f32(_out0.val[0], _out02); // 2,4,6,8\n                    vst2q_f32(outptr0 + 2, _out0);\n\n                    // out row 1\n                    float32x4_t _out10 = vmulq_lane_f32(_v, vget_low_f32(_k1), 0);  // 0,2,4,6\n                    float32x4_t _out11 = vmulq_lane_f32(_v, vget_low_f32(_k1), 1);  // 1,3,5,7\n                    float32x4_t _out12 = vmulq_lane_f32(_v, vget_high_f32(_k1), 0); // 2,4,6,8\n\n                    float32x4x2_t _out1 = vld2q_f32(outptr1);\n                    _out1.val[0] = vaddq_f32(_out1.val[0], _out10); // 0,2,4,6\n                    _out1.val[1] = vaddq_f32(_out1.val[1], _out11); // 1,3,5,7\n                    vst2q_f32(outptr1, _out1);\n\n                    _out1 = vld2q_f32(outptr1 + 2);\n                    _out1.val[0] = vaddq_f32(_out1.val[0], _out12); // 2,4,6,8\n                    vst2q_f32(outptr1 + 2, _out1);\n\n                    // out row 2\n                    float32x4_t _out20 = vmulq_lane_f32(_v, vget_low_f32(_k2), 0);  // 0,2,4,6\n                    float32x4_t _out21 = vmulq_lane_f32(_v, vget_low_f32(_k2), 1);  // 1,3,5,7\n                    float32x4_t _out22 = vmulq_lane_f32(_v, vget_high_f32(_k2), 0); // 2,4,6,8\n\n                    float32x4x2_t _out2 = vld2q_f32(outptr2);\n                    _out2.val[0] = vaddq_f32(_out2.val[0], _out20); // 0,2,4,6\n                    _out2.val[1] = vaddq_f32(_out2.val[1], _out21); // 1,3,5,7\n                    vst2q_f32(outptr2, _out2);\n\n                    _out2 = vld2q_f32(outptr2 + 2);\n                    _out2.val[0] = vaddq_f32(_out2.val[0], _out22); // 2,4,6,8\n                    vst2q_f32(outptr2 + 2, _out2);\n\n                    r0 += 4;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                }\n#endif // __ARM_NEON\n\n                for (; j < w; j++)\n                {\n                    float val = r0[0];\n\n                    outptr0[0] += val * k0[0];\n                    outptr0[1] += val * k0[1];\n                    outptr0[2] += val * k0[2];\n\n                    outptr1[0] += val * k1[0];\n                    outptr1[1] += val * k1[1];\n                    outptr1[2] += val * k1[2];\n\n                    outptr2[0] += val * k2[0];\n                    outptr2[1] += val * k2[1];\n                    outptr2[2] += val * k2[2];\n\n                    r0++;\n                    outptr0 += 2;\n                    outptr1 += 2;\n                    outptr2 += 2;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/deconvolution_4x4.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconv4x4s1_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 16 + q * 16;\n\n            const float* r0 = img0;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 4;\n            const float* k2 = kernel0 + 8;\n            const float* k3 = kernel0 + 12;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(k0);\n            float32x4_t _k1 = vld1q_f32(k1);\n            float32x4_t _k2 = vld1q_f32(k2);\n            float32x4_t _k3 = vld1q_f32(k3);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr = out.row(i);\n\n                float* outptr0 = outptr;\n                float* outptr1 = outptr0 + outw;\n                float* outptr2 = outptr1 + outw;\n                float* outptr3 = outptr2 + outw;\n\n                int j = 0;\n\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4_t _v = vld1q_f32(r0);\n\n                    //\n                    float32x4_t _out00 = vld1q_f32(outptr0 + 0);\n                    _out00 = vmlaq_lane_f32(_out00, _v, vget_low_f32(_k0), 0);\n                    vst1q_f32(outptr0 + 0, _out00);\n\n                    float32x4_t _out01 = vld1q_f32(outptr0 + 1);\n                    _out01 = vmlaq_lane_f32(_out01, _v, vget_low_f32(_k0), 1);\n                    vst1q_f32(outptr0 + 1, _out01);\n\n                    float32x4_t _out02 = vld1q_f32(outptr0 + 2);\n                    _out02 = vmlaq_lane_f32(_out02, _v, vget_high_f32(_k0), 0);\n                    vst1q_f32(outptr0 + 2, _out02);\n\n                    float32x4_t _out03 = vld1q_f32(outptr0 + 3);\n                    _out03 = vmlaq_lane_f32(_out03, _v, vget_high_f32(_k0), 1);\n                    vst1q_f32(outptr0 + 3, _out03);\n\n                    //\n                    float32x4_t _out10 = vld1q_f32(outptr1 + 0);\n                    _out10 = vmlaq_lane_f32(_out10, _v, vget_low_f32(_k1), 0);\n                    vst1q_f32(outptr1 + 0, _out10);\n\n                    float32x4_t _out11 = vld1q_f32(outptr1 + 1);\n                    _out11 = vmlaq_lane_f32(_out11, _v, vget_low_f32(_k1), 1);\n                    vst1q_f32(outptr1 + 1, _out11);\n\n                    float32x4_t _out12 = vld1q_f32(outptr1 + 2);\n                    _out12 = vmlaq_lane_f32(_out12, _v, vget_high_f32(_k1), 0);\n                    vst1q_f32(outptr1 + 2, _out12);\n\n                    float32x4_t _out13 = vld1q_f32(outptr1 + 3);\n                    _out13 = vmlaq_lane_f32(_out13, _v, vget_high_f32(_k1), 1);\n                    vst1q_f32(outptr1 + 3, _out13);\n\n                    //\n                    float32x4_t _out20 = vld1q_f32(outptr2 + 0);\n                    _out20 = vmlaq_lane_f32(_out20, _v, vget_low_f32(_k2), 0);\n                    vst1q_f32(outptr2 + 0, _out20);\n\n                    float32x4_t _out21 = vld1q_f32(outptr2 + 1);\n                    _out21 = vmlaq_lane_f32(_out21, _v, vget_low_f32(_k2), 1);\n                    vst1q_f32(outptr2 + 1, _out21);\n\n                    float32x4_t _out22 = vld1q_f32(outptr2 + 2);\n                    _out22 = vmlaq_lane_f32(_out22, _v, vget_high_f32(_k2), 0);\n                    vst1q_f32(outptr2 + 2, _out22);\n\n                    float32x4_t _out23 = vld1q_f32(outptr2 + 3);\n                    _out23 = vmlaq_lane_f32(_out23, _v, vget_high_f32(_k2), 1);\n                    vst1q_f32(outptr2 + 3, _out23);\n\n                    //\n                    float32x4_t _out30 = vld1q_f32(outptr3 + 0);\n                    _out30 = vmlaq_lane_f32(_out30, _v, vget_low_f32(_k3), 0);\n                    vst1q_f32(outptr3 + 0, _out30);\n\n                    float32x4_t _out31 = vld1q_f32(outptr3 + 1);\n                    _out31 = vmlaq_lane_f32(_out31, _v, vget_low_f32(_k3), 1);\n                    vst1q_f32(outptr3 + 1, _out31);\n\n                    float32x4_t _out32 = vld1q_f32(outptr3 + 2);\n                    _out32 = vmlaq_lane_f32(_out32, _v, vget_high_f32(_k3), 0);\n                    vst1q_f32(outptr3 + 2, _out32);\n\n                    float32x4_t _out33 = vld1q_f32(outptr3 + 3);\n                    _out33 = vmlaq_lane_f32(_out33, _v, vget_high_f32(_k3), 1);\n                    vst1q_f32(outptr3 + 3, _out33);\n\n                    r0 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n\n#endif // __ARM_NEON\n\n                for (; j < w; j++)\n                {\n                    float val = r0[0];\n\n                    outptr0[0] += val * k0[0];\n                    outptr0[1] += val * k0[1];\n                    outptr0[2] += val * k0[2];\n                    outptr0[3] += val * k0[3];\n\n                    outptr1[0] += val * k1[0];\n                    outptr1[1] += val * k1[1];\n                    outptr1[2] += val * k1[2];\n                    outptr1[3] += val * k1[3];\n\n                    outptr2[0] += val * k2[0];\n                    outptr2[1] += val * k2[1];\n                    outptr2[2] += val * k2[2];\n                    outptr2[3] += val * k2[3];\n\n                    outptr3[0] += val * k3[0];\n                    outptr3[1] += val * k3[1];\n                    outptr3[2] += val * k3[2];\n                    outptr3[3] += val * k3[3];\n\n                    r0++;\n                    outptr0++;\n                    outptr1++;\n                    outptr2++;\n                    outptr3++;\n                }\n            }\n        }\n    }\n}\n\nstatic void deconv4x4s2_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outch = top_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            const float* img0 = bottom_blob.channel(q);\n\n            const float* kernel0 = kernel + p * inch * 16 + q * 16;\n\n            const float* r0 = img0;\n\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 4;\n            const float* k2 = kernel0 + 8;\n            const float* k3 = kernel0 + 12;\n\n#if __ARM_NEON\n            float32x4_t _k0 = vld1q_f32(k0);\n            float32x4_t _k1 = vld1q_f32(k1);\n            float32x4_t _k2 = vld1q_f32(k2);\n            float32x4_t _k3 = vld1q_f32(k3);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr = out.row(i * 2);\n\n                float* outptr0 = outptr;\n                float* outptr1 = outptr0 + outw;\n                float* outptr2 = outptr1 + outw;\n                float* outptr3 = outptr2 + outw;\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4_t _v = vld1q_f32(r0);\n\n                    // row 0\n                    float32x4x2_t _out0 = vld2q_f32(outptr0);\n                    // 0,2,4,6\n                    _out0.val[0] = vmlaq_lane_f32(_out0.val[0], _v, vget_low_f32(_k0), 0);\n                    // 1,3,5,7\n                    _out0.val[1] = vmlaq_lane_f32(_out0.val[1], _v, vget_low_f32(_k0), 1);\n                    vst2q_f32(outptr0, _out0);\n\n                    _out0 = vld2q_f32(outptr0 + 2);\n                    // 2,4,6,8\n                    _out0.val[0] = vmlaq_lane_f32(_out0.val[0], _v, vget_high_f32(_k0), 0);\n                    // 3,5,7,9\n                    _out0.val[1] = vmlaq_lane_f32(_out0.val[1], _v, vget_high_f32(_k0), 1);\n                    vst2q_f32(outptr0 + 2, _out0);\n\n                    // row 1\n                    float32x4x2_t _out1 = vld2q_f32(outptr1);\n                    // 0,2,4,6\n                    _out1.val[0] = vmlaq_lane_f32(_out1.val[0], _v, vget_low_f32(_k1), 0);\n                    // 1,3,5,7\n                    _out1.val[1] = vmlaq_lane_f32(_out1.val[1], _v, vget_low_f32(_k1), 1);\n                    vst2q_f32(outptr1, _out1);\n\n                    _out1 = vld2q_f32(outptr1 + 2);\n                    // 2,4,6,8\n                    _out1.val[0] = vmlaq_lane_f32(_out1.val[0], _v, vget_high_f32(_k1), 0);\n                    // 3,5,7,9\n                    _out1.val[1] = vmlaq_lane_f32(_out1.val[1], _v, vget_high_f32(_k1), 1);\n                    vst2q_f32(outptr1 + 2, _out1);\n\n                    // row 2\n                    float32x4x2_t _out2 = vld2q_f32(outptr2);\n                    _out2.val[0] = vmlaq_lane_f32(_out2.val[0], _v, vget_low_f32(_k2), 0);\n                    _out2.val[1] = vmlaq_lane_f32(_out2.val[1], _v, vget_low_f32(_k2), 1);\n                    vst2q_f32(outptr2, _out2);\n\n                    _out2 = vld2q_f32(outptr2 + 2);\n                    _out2.val[0] = vmlaq_lane_f32(_out2.val[0], _v, vget_high_f32(_k2), 0);\n                    _out2.val[1] = vmlaq_lane_f32(_out2.val[1], _v, vget_high_f32(_k2), 1);\n                    vst2q_f32(outptr2 + 2, _out2);\n\n                    // row 3\n                    float32x4x2_t _out3 = vld2q_f32(outptr3);\n                    _out3.val[0] = vmlaq_lane_f32(_out3.val[0], _v, vget_low_f32(_k3), 0);\n                    _out3.val[1] = vmlaq_lane_f32(_out3.val[1], _v, vget_low_f32(_k3), 1);\n                    vst2q_f32(outptr3, _out3);\n\n                    _out3 = vld2q_f32(outptr3 + 2);\n                    _out3.val[0] = vmlaq_lane_f32(_out3.val[0], _v, vget_high_f32(_k3), 0);\n                    _out3.val[1] = vmlaq_lane_f32(_out3.val[1], _v, vget_high_f32(_k3), 1);\n                    vst2q_f32(outptr3 + 2, _out3);\n\n                    r0 += 4;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                    outptr3 += 8;\n                }\n\n#endif // __ARM_NEON\n\n                for (; j < w; j++)\n                {\n                    float val = r0[0];\n\n                    outptr0[0] += val * k0[0];\n                    outptr0[1] += val * k0[1];\n                    outptr0[2] += val * k0[2];\n                    outptr0[3] += val * k0[3];\n\n                    outptr1[0] += val * k1[0];\n                    outptr1[1] += val * k1[1];\n                    outptr1[2] += val * k1[2];\n                    outptr1[3] += val * k1[3];\n\n                    outptr2[0] += val * k2[0];\n                    outptr2[1] += val * k2[1];\n                    outptr2[2] += val * k2[2];\n                    outptr2[3] += val * k2[3];\n\n                    outptr3[0] += val * k3[0];\n                    outptr3[1] += val * k3[1];\n                    outptr3[2] += val * k3[2];\n                    outptr3[3] += val * k3[3];\n\n                    r0++;\n                    outptr0 += 2;\n                    outptr1 += 2;\n                    outptr2 += 2;\n                    outptr3 += 2;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/deconvolution_4x4_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconv4x4s2_fp16sa_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outch = top_blob.c;\n\n    const __fp16* kernel = _kernel;\n    const __fp16* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const __fp16 bias0 = bias ? bias[p] : 0.f;\n\n        out.fill(bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            const __fp16* img0 = bottom_blob.channel(q);\n\n            const __fp16* kernel0 = kernel + p * inch * 16 + q * 16;\n\n            const __fp16* r0 = img0;\n\n            const __fp16* k0 = kernel0;\n            const __fp16* k1 = kernel0 + 4;\n            const __fp16* k2 = kernel0 + 8;\n            const __fp16* k3 = kernel0 + 12;\n\n            float16x4_t _k0 = vld1_f16(k0);\n            float16x4_t _k1 = vld1_f16(k1);\n            float16x4_t _k2 = vld1_f16(k2);\n            float16x4_t _k3 = vld1_f16(k3);\n\n            for (int i = 0; i < h; i++)\n            {\n                __fp16* outptr = out.row<__fp16>(i * 2);\n\n                __fp16* outptr0 = outptr;\n                __fp16* outptr1 = outptr0 + outw;\n                __fp16* outptr2 = outptr1 + outw;\n                __fp16* outptr3 = outptr2 + outw;\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    float16x4_t _v = vld1_f16(r0);\n\n                    // row 0\n                    float16x4x2_t _out0 = vld2_f16(outptr0);\n                    // 0,2,4,6\n                    _out0.val[0] = vfma_lane_f16(_out0.val[0], _v, _k0, 0);\n                    // 1,3,5,7\n                    _out0.val[1] = vfma_lane_f16(_out0.val[1], _v, _k0, 1);\n                    vst2_f16(outptr0, _out0);\n\n                    _out0 = vld2_f16(outptr0 + 2);\n                    // 2,4,6,8\n                    _out0.val[0] = vfma_lane_f16(_out0.val[0], _v, _k0, 2);\n                    // 3,5,7,9\n                    _out0.val[1] = vfma_lane_f16(_out0.val[1], _v, _k0, 3);\n                    vst2_f16(outptr0 + 2, _out0);\n\n                    // row 1\n                    float16x4x2_t _out1 = vld2_f16(outptr1);\n                    // 0,2,4,6\n                    _out1.val[0] = vfma_lane_f16(_out1.val[0], _v, _k1, 0);\n                    // 1,3,5,7\n                    _out1.val[1] = vfma_lane_f16(_out1.val[1], _v, _k1, 1);\n                    vst2_f16(outptr1, _out1);\n\n                    _out1 = vld2_f16(outptr1 + 2);\n                    // 2,4,6,8\n                    _out1.val[0] = vfma_lane_f16(_out1.val[0], _v, _k1, 2);\n                    // 3,5,7,9\n                    _out1.val[1] = vfma_lane_f16(_out1.val[1], _v, _k1, 3);\n                    vst2_f16(outptr1 + 2, _out1);\n\n                    // row 2\n                    float16x4x2_t _out2 = vld2_f16(outptr2);\n                    _out2.val[0] = vfma_lane_f16(_out2.val[0], _v, _k2, 0);\n                    _out2.val[1] = vfma_lane_f16(_out2.val[1], _v, _k2, 1);\n                    vst2_f16(outptr2, _out2);\n\n                    _out2 = vld2_f16(outptr2 + 2);\n                    _out2.val[0] = vfma_lane_f16(_out2.val[0], _v, _k2, 2);\n                    _out2.val[1] = vfma_lane_f16(_out2.val[1], _v, _k2, 3);\n                    vst2_f16(outptr2 + 2, _out2);\n\n                    // row 3\n                    float16x4x2_t _out3 = vld2_f16(outptr3);\n                    _out3.val[0] = vfma_lane_f16(_out3.val[0], _v, _k3, 0);\n                    _out3.val[1] = vfma_lane_f16(_out3.val[1], _v, _k3, 1);\n                    vst2_f16(outptr3, _out3);\n\n                    _out3 = vld2_f16(outptr3 + 2);\n                    _out3.val[0] = vfma_lane_f16(_out3.val[0], _v, _k3, 2);\n                    _out3.val[1] = vfma_lane_f16(_out3.val[1], _v, _k3, 3);\n                    vst2_f16(outptr3 + 2, _out3);\n\n                    r0 += 4;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                    outptr3 += 8;\n                }\n                for (; j < w; j++)\n                {\n                    __fp16 val = r0[0];\n\n                    outptr0[0] += val * k0[0];\n                    outptr0[1] += val * k0[1];\n                    outptr0[2] += val * k0[2];\n                    outptr0[3] += val * k0[3];\n\n                    outptr1[0] += val * k1[0];\n                    outptr1[1] += val * k1[1];\n                    outptr1[2] += val * k1[2];\n                    outptr1[3] += val * k1[3];\n\n                    outptr2[0] += val * k2[0];\n                    outptr2[1] += val * k2[1];\n                    outptr2[2] += val * k2[2];\n                    outptr2[3] += val * k2[3];\n\n                    outptr3[0] += val * k3[0];\n                    outptr3[1] += val * k3[1];\n                    outptr3[2] += val * k3[2];\n                    outptr3[3] += val * k3[3];\n\n                    r0++;\n                    outptr0 += 2;\n                    outptr1 += 2;\n                    outptr2 += 2;\n                    outptr3 += 2;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/deconvolution_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution_arm.h\"\n\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"deconvolution_3x3.h\"\n#include \"deconvolution_4x4.h\"\n\nDeconvolution_arm::Deconvolution_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    activation = 0;\n    gemm = 0;\n}\n\nint Deconvolution_arm::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n    int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    if (opt.use_sgemm_convolution)\n    {\n        const int maxk = kernel_w * kernel_h;\n\n        gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n\n        ncnn::ParamDict pd;\n        pd.set(2, 1);                 // transA\n        pd.set(3, 0);                 // transB\n        pd.set(4, 1);                 // constantA\n        pd.set(5, 0);                 // constantB\n        pd.set(6, 1);                 // constantC\n        pd.set(7, maxk * num_output); // M = maxk*num_output\n        pd.set(8, 0);                 // N = size\n        pd.set(9, num_input);         // K = inch\n        pd.set(10, -1);               // constant_broadcast_type_C = null\n        pd.set(11, 0);                // output_N1M\n        pd.set(12, out_elempack);\n\n        gemm->load_param(pd);\n\n        // maxk-inch-outch to pa-maxk-outch/pa-inch\n        Mat tmp;\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n            tmp.create(maxk * num_output, num_input);\n\n            for (int p = 0; p < num_input; p += 1)\n            {\n                float* g00 = tmp.row(p);\n\n                for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        for (int i = 0; i < out_elempack; i++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + i).row(p);\n                            g00[0] = k00[k];\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n\n        ncnn::Mat weights[1];\n        weights[0] = tmp;\n\n        gemm->load_model(ModelBinFromMatArray(weights));\n\n        Option opt1 = opt;\n        opt1.use_fp16_storage = false;\n        gemm->create_pipeline(opt1);\n    }\n    else\n    {\n        Mat weight_data_transposed(weight_data.w);\n        {\n            float* pt = weight_data_transposed;\n            const float* p = weight_data;\n\n            for (int i = 0; i < num_input * num_output; i++)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    pt[maxk - 1 - k] = p[k];\n                }\n\n                p += maxk;\n                pt += maxk;\n            }\n        }\n\n        // src = kw-kh-inch-outch\n        // dst = pb-pa-kw-kh-inch/pa-outch/pb\n        Mat weight_data_r2 = weight_data_transposed.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n\n        // pack1\n        if (elempack == 1 && out_elempack == 1)\n        {\n            if (kernel_w == 3 && kernel_h == 3 && stride_w == 1 && stride_h == 1 && dilation_w == 1 && dilation_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 4 && kernel_h == 4 && stride_w == 1 && stride_h == 1 && dilation_w == 1 && dilation_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else if (kernel_w == 4 && kernel_h == 4 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n            {\n                weight_data_tm = weight_data;\n            }\n            else\n            {\n                weight_data_tm = weight_data_transposed;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Deconvolution_arm::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    if (gemm)\n    {\n        gemm->destroy_pipeline(opt);\n        delete gemm;\n        gemm = 0;\n    }\n\n    return 0;\n}\n\nint Deconvolution_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Deconvolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    int out_channels = num_output / out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, out_channels, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, out_channels, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    if (opt.use_sgemm_convolution)\n    {\n        // sgemm\n        Mat bottom_blob_2 = bottom_blob;\n        {\n            bottom_blob_2.w = bottom_blob.w * bottom_blob.h;\n            bottom_blob_2.h = 1;\n        }\n        Mat top_col2im;\n        Option opt_b = opt;\n        opt_b.blob_allocator = top_blob_bordered.allocator;\n        int ret = gemm->forward(bottom_blob_2, top_col2im, opt_b);\n        if (ret != 0)\n            return ret;\n\n        {\n            // col2im\n            const int gap = (outw * stride_h - w * stride_w) * out_elempack;\n\n#if __ARM_NEON\n            if (out_elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < out_channels; p++)\n                {\n                    const float* sptr = top_col2im.row(p * maxk);\n                    Mat outm = top_blob_bordered.channel(p);\n\n                    if (bias_data.empty())\n                    {\n                        outm.fill(vdupq_n_f32(0.f));\n                    }\n                    else\n                    {\n                        outm.fill(vld1q_f32((const float*)bias_data + p * 4));\n                    }\n\n                    for (int u = 0; u < kernel_h; u++)\n                    {\n                        for (int v = 0; v < kernel_w; v++)\n                        {\n                            float* ptr = outm.row(dilation_h * u) + dilation_w * v * 4;\n\n                            for (int i = 0; i < h; i++)\n                            {\n                                for (int j = 0; j < w; j++)\n                                {\n                                    float32x4_t _val = vld1q_f32(ptr);\n                                    float32x4_t _s = vld1q_f32(sptr);\n                                    _val = vaddq_f32(_val, _s);\n                                    vst1q_f32(ptr, _val);\n\n                                    ptr += stride_w * 4;\n                                    sptr += 4;\n                                }\n\n                                ptr += gap;\n                            }\n                        }\n                    }\n                }\n            }\n#endif // __ARM_NEON\n\n            if (out_elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < out_channels; p++)\n                {\n                    const float* sptr = top_col2im.row(p * maxk);\n                    Mat outm = top_blob_bordered.channel(p);\n\n                    const float bias = bias_data.empty() ? 0.f : bias_data[p];\n                    outm.fill(bias);\n\n                    for (int u = 0; u < kernel_h; u++)\n                    {\n                        for (int v = 0; v < kernel_w; v++)\n                        {\n                            float* ptr = outm.row(dilation_h * u) + dilation_w * v;\n\n                            for (int i = 0; i < h; i++)\n                            {\n                                for (int j = 0; j < w; j++)\n                                {\n                                    ptr[0] += sptr[0];\n\n                                    ptr += stride_w;\n                                    sptr += 1;\n                                }\n\n                                ptr += gap;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob_bordered, opt);\n        }\n    }\n    else\n    {\n#if __ARM_NEON\n        if (elempack == 4 && out_elempack == 4)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                float* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const float* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const float* sptr = m.row(sy) + sx * 4;\n\n                                    float32x4_t _val = vld1q_f32(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w0 = vld1q_f32(kptr + k * 16);\n                                    float32x4_t _w1 = vld1q_f32(kptr + k * 16 + 4);\n                                    float32x4_t _w2 = vld1q_f32(kptr + k * 16 + 8);\n                                    float32x4_t _w3 = vld1q_f32(kptr + k * 16 + 12);\n\n#if __aarch64__\n                                    _sum = vmlaq_laneq_f32(_sum, _w0, _val, 0);\n                                    _sum = vmlaq_laneq_f32(_sum, _w1, _val, 1);\n                                    _sum = vmlaq_laneq_f32(_sum, _w2, _val, 2);\n                                    _sum = vmlaq_laneq_f32(_sum, _w3, _val, 3);\n#else\n                                    _sum = vmlaq_lane_f32(_sum, _w0, vget_low_f32(_val), 0);\n                                    _sum = vmlaq_lane_f32(_sum, _w1, vget_low_f32(_val), 1);\n                                    _sum = vmlaq_lane_f32(_sum, _w2, vget_high_f32(_val), 0);\n                                    _sum = vmlaq_lane_f32(_sum, _w3, vget_high_f32(_val), 1);\n#endif\n                                }\n                            }\n\n                            kptr += maxk * 16;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1q_f32(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 1 && out_elempack == 4)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                float* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const float* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const float* sptr = m.row(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float32x4_t _val = vdupq_n_f32(sptr[sx]);\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = vld1q_f32(kptr + k * 4);\n\n                                    _sum = vmlaq_f32(_sum, _val, _w);\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1q_f32(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 4 && out_elempack == 1)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                float* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const float* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const float* sptr = m.row(sy) + sx * 4;\n\n                                    float32x4_t _val = vld1q_f32(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = vld1q_f32(kptr + k * 4);\n\n                                    float32x4_t _s4 = vmulq_f32(_val, _w);\n#if __aarch64__\n                                    sum += vaddvq_f32(_s4); // dot\n#else\n                                    float32x2_t _ss = vadd_f32(vget_low_f32(_s4), vget_high_f32(_s4));\n                                    _ss = vpadd_f32(_ss, _ss);\n                                    sum += vget_lane_f32(_ss, 0);\n#endif\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1 && out_elempack == 1)\n        {\n            if (kernel_w == 3 && kernel_h == 3 && stride_w == 1 && stride_h == 1 && dilation_w == 1 && dilation_h == 1)\n            {\n                deconv3x3s1_neon(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob_bordered, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n            {\n                deconv3x3s2_neon(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob_bordered, opt);\n                }\n            }\n            else if (kernel_w == 4 && kernel_h == 4 && stride_w == 1 && stride_h == 1 && dilation_w == 1 && dilation_h == 1)\n            {\n                deconv4x4s1_neon(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob_bordered, opt);\n                }\n            }\n            else if (kernel_w == 4 && kernel_h == 4 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n            {\n                deconv4x4s2_neon(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob_bordered, opt);\n                }\n            }\n            else\n            {\n                // num_output\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < num_output; p++)\n                {\n                    float* outptr = top_blob_bordered.channel(p);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                            {\n                                sum = bias_data[p];\n                            }\n\n                            const float* kptr = (const float*)weight_data_tm + maxk * channels * p;\n\n                            // channels\n                            for (int q = 0; q < channels; q++)\n                            {\n                                const Mat m = bottom_blob.channel(q);\n\n                                for (int y = 0; y < kernel_h; y++)\n                                {\n                                    int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                    if (sys < 0 || sys % stride_h != 0)\n                                        continue;\n\n                                    int sy = sys / stride_h;\n                                    if (sy >= h)\n                                        continue;\n\n                                    const float* sptr = m.row(sy);\n\n                                    for (int x = 0; x < kernel_w; x++)\n                                    {\n                                        int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                        if (sxs < 0 || sxs % stride_w != 0)\n                                            continue;\n\n                                        int sx = sxs / stride_w;\n                                        if (sx >= w)\n                                            continue;\n\n                                        float val = sptr[sx];\n\n                                        int k = y * kernel_w + x;\n\n                                        float w = kptr[k];\n\n                                        sum += val * w;\n                                    }\n                                }\n\n                                kptr += maxk;\n                            }\n\n                            sum = activation_ss(sum, activation_type, activation_params);\n\n                            outptr[j] = sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint Deconvolution_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c * bottom_blob.elempack;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * 1;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n#if NCNN_ARM82\n    if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_float16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n    if (opt.use_bf16_storage && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_bfloat16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_BF16\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / 1, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / 1;\n        const int inch_g = _num_input / 1;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < 1; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_float16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n        if (opt.use_bf16_storage && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_bfloat16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_BF16\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Deconvolution);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, output_pad_right);\n    pd.set(19, output_pad_bottom);\n    pd.set(20, output_w);\n    pd.set(21, output_h);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_transposed.w);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_transposed;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Deconvolution_arm::create_pipeline_bf16s(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    Mat weight_data_transposed(weight_data.w);\n    {\n        float* pt = weight_data_transposed;\n        const float* p = weight_data;\n\n        for (int i = 0; i < num_input * num_output; i++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                pt[maxk - 1 - k] = p[k];\n            }\n\n            p += maxk;\n            pt += maxk;\n        }\n    }\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data_transposed.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)2u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            unsigned short* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = float32_to_bfloat16(k00[k]);\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Deconvolution_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // deconvolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Deconvolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n#if __ARM_NEON\n    if (elempack == 4 && out_elempack == 4)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const unsigned short* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const unsigned short* sptr = m.row<const unsigned short>(sy) + sx * 4;\n\n                                    float32x4_t _val = bfloat2float(vld1_u16(sptr));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w0 = bfloat2float(vld1_u16(kptr + k * 16));\n                                    float32x4_t _w1 = bfloat2float(vld1_u16(kptr + k * 16 + 4));\n                                    float32x4_t _w2 = bfloat2float(vld1_u16(kptr + k * 16 + 8));\n                                    float32x4_t _w3 = bfloat2float(vld1_u16(kptr + k * 16 + 12));\n\n#if __aarch64__\n                                    _sum = vmlaq_laneq_f32(_sum, _w0, _val, 0);\n                                    _sum = vmlaq_laneq_f32(_sum, _w1, _val, 1);\n                                    _sum = vmlaq_laneq_f32(_sum, _w2, _val, 2);\n                                    _sum = vmlaq_laneq_f32(_sum, _w3, _val, 3);\n#else\n                                    _sum = vmlaq_lane_f32(_sum, _w0, vget_low_f32(_val), 0);\n                                    _sum = vmlaq_lane_f32(_sum, _w1, vget_low_f32(_val), 1);\n                                    _sum = vmlaq_lane_f32(_sum, _w2, vget_high_f32(_val), 0);\n                                    _sum = vmlaq_lane_f32(_sum, _w3, vget_high_f32(_val), 1);\n#endif\n                                }\n                            }\n\n                            kptr += maxk * 16;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1_u16(outptr + j * 4, float2bfloat(_sum));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const unsigned short* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const unsigned short* sptr = m.row<const unsigned short>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float32x4_t _val = vdupq_n_f32(bfloat16_to_float32(sptr[sx]));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = bfloat2float(vld1_u16(kptr + k * 4));\n\n                                    _sum = vmlaq_f32(_sum, _val, _w);\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1_u16(outptr + j * 4, float2bfloat(_sum));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const unsigned short* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const unsigned short* sptr = m.row<const unsigned short>(sy) + sx * 4;\n\n                                    float32x4_t _val = bfloat2float(vld1_u16(sptr));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = bfloat2float(vld1_u16(kptr + k * 4));\n\n                                    float32x4_t _s4 = vmulq_f32(_val, _w);\n#if __aarch64__\n                                    sum += vaddvq_f32(_s4); // dot\n#else\n                                    float32x2_t _ss = vadd_f32(vget_low_f32(_s4), vget_high_f32(_s4));\n                                    _ss = vpadd_f32(_ss, _ss);\n                                    sum += vget_lane_f32(_ss, 0);\n#endif\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = float32_to_bfloat16(sum);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const unsigned short* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const unsigned short* sptr = m.row<const unsigned short>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float val = bfloat16_to_float32(sptr[sx]);\n\n                                    int k = y * kernel_w + x;\n\n                                    float w = bfloat16_to_float32(kptr[k]);\n\n                                    sum += val * w;\n                                }\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = float32_to_bfloat16(sum);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/deconvolution_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTION_ARM_H\n#define LAYER_DECONVOLUTION_ARM_H\n\n#include \"deconvolution.h\"\n\nnamespace ncnn {\n\nclass Deconvolution_arm : public Deconvolution\n{\npublic:\n    Deconvolution_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* activation;\n    Layer* gemm;\n\n    Mat weight_data_tm;\n\n    // fp16\n    Mat bias_data_fp16;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTION_ARM_H\n"
  },
  {
    "path": "src/layer/arm/deconvolution_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"deconvolution_4x4_fp16s.h\"\n#endif\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Deconvolution_arm::create_pipeline_fp16s(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n\n    if (opt.use_packing_layout)\n    {\n        elempack = opt.use_fp16_arithmetic && num_input % 8 == 0 ? 8 : num_input % 4 == 0 ? 4 : 1;\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n\n    if (opt.use_fp16_arithmetic && opt.use_sgemm_convolution)\n    {\n        const int maxk = kernel_w * kernel_h;\n\n        gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n\n        ncnn::ParamDict pd;\n        pd.set(2, 1);                 // transA\n        pd.set(3, 0);                 // transB\n        pd.set(4, 1);                 // constantA\n        pd.set(5, 0);                 // constantB\n        pd.set(6, 1);                 // constantC\n        pd.set(7, maxk * num_output); // M = maxk*num_output\n        pd.set(8, 0);                 // N = size\n        pd.set(9, num_input);         // K = inch\n        pd.set(10, -1);               // constant_broadcast_type_C = null\n        pd.set(11, 0);                // output_N1M\n        pd.set(12, out_elempack);\n\n        gemm->load_param(pd);\n\n        // maxk-inch-outch to pa-maxk-outch/pa-inch\n        Mat tmp;\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n            tmp.create(maxk * num_output, num_input);\n\n            for (int p = 0; p < num_input; p += 1)\n            {\n                float* g00 = tmp.row(p);\n\n                for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        for (int i = 0; i < out_elempack; i++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + i).row(p);\n                            g00[0] = k00[k];\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n\n        ncnn::Mat weights[1];\n        weights[0] = tmp;\n\n        gemm->load_model(ModelBinFromMatArray(weights));\n\n        gemm->create_pipeline(opt);\n    }\n    else\n    {\n        Mat weight_data_transposed(weight_data.w);\n        {\n            float* pt = weight_data_transposed;\n            const float* p = weight_data;\n\n            for (int i = 0; i < num_input * num_output; i++)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    pt[maxk - 1 - k] = p[k];\n                }\n\n                p += maxk;\n                pt += maxk;\n            }\n        }\n\n        // src = kw-kh-inch-outch\n        // dst = pb-pa-kw-kh-inch/pa-outch/pb\n        Mat weight_data_r2 = weight_data_transposed.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)2u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            __fp16* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = (__fp16)k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 1 && opt.use_fp16_arithmetic)\n    {\n        if (kernel_w == 4 && kernel_h == 4 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n        {\n            ncnn::cast_float32_to_float16(weight_data, weight_data_tm, opt);\n        }\n    }\n\n    ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Deconvolution_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // deconvolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Deconvolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = opt.use_packing_layout && num_output % 4 == 0 ? 4 : 1;\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w0 = vcvt_f32_f16(vld1_f16(kptr + k * 16));\n                                    float32x4_t _w1 = vcvt_f32_f16(vld1_f16(kptr + k * 16 + 4));\n                                    float32x4_t _w2 = vcvt_f32_f16(vld1_f16(kptr + k * 16 + 8));\n                                    float32x4_t _w3 = vcvt_f32_f16(vld1_f16(kptr + k * 16 + 12));\n\n                                    _sum = vfmaq_laneq_f32(_sum, _w0, _val, 0);\n                                    _sum = vfmaq_laneq_f32(_sum, _w1, _val, 1);\n                                    _sum = vfmaq_laneq_f32(_sum, _w2, _val, 2);\n                                    _sum = vfmaq_laneq_f32(_sum, _w3, _val, 3);\n                                }\n                            }\n\n                            kptr += maxk * 16;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1_f16(outptr + j * 4, vcvt_f16_f32(_sum));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32(((const float*)bias_data) + p * 4);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float32x4_t _val = vdupq_n_f32((float)sptr[sx]);\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr + k * 4));\n\n                                    _sum = vfmaq_f32(_sum, _val, _w);\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1_f16(outptr + j * 4, vcvt_f16_f32(_sum));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output / out_elempack; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr + k * 4));\n\n                                    float32x4_t _s4 = vmulq_f32(_val, _w);\n\n                                    sum += vaddvq_f32(_s4); // dot\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = (__fp16)sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float val = (float)sptr[sx];\n\n                                    int k = y * kernel_w + x;\n\n                                    float w = (float)kptr[k];\n\n                                    sum += val * w;\n                                }\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = (__fp16)sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint Deconvolution_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // deconvolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Deconvolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    int out_channels = num_output / out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, out_channels, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, out_channels, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    if (opt.use_sgemm_convolution)\n    {\n        // sgemm\n        Mat bottom_blob_2 = bottom_blob;\n        {\n            bottom_blob_2.w = bottom_blob.w * bottom_blob.h;\n            bottom_blob_2.h = 1;\n        }\n        Mat top_col2im;\n        Option opt_b = opt;\n        opt_b.blob_allocator = top_blob_bordered.allocator;\n        int ret = gemm->forward(bottom_blob_2, top_col2im, opt_b);\n        if (ret != 0)\n            return ret;\n\n        {\n            // col2im\n            const int gap = (outw * stride_h - w * stride_w) * out_elempack;\n\n            if (out_elempack == 8)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < out_channels; p++)\n                {\n                    const __fp16* sptr = top_col2im.row<const __fp16>(p * maxk);\n                    Mat outm = top_blob_bordered.channel(p);\n\n                    if (bias_data.empty())\n                    {\n                        outm.fill(vdupq_n_f16(0.f));\n                    }\n                    else\n                    {\n                        outm.fill(vld1q_f16((const __fp16*)bias_data_fp16 + p * 8));\n                    }\n\n                    for (int u = 0; u < kernel_h; u++)\n                    {\n                        for (int v = 0; v < kernel_w; v++)\n                        {\n                            __fp16* ptr = outm.row<__fp16>(dilation_h * u) + dilation_w * v * 8;\n\n                            for (int i = 0; i < h; i++)\n                            {\n                                for (int j = 0; j < w; j++)\n                                {\n                                    float16x8_t _val = vld1q_f16(ptr);\n                                    float16x8_t _s = vld1q_f16(sptr);\n                                    _val = vaddq_f16(_val, _s);\n                                    vst1q_f16(ptr, _val);\n\n                                    ptr += stride_w * 8;\n                                    sptr += 8;\n                                }\n\n                                ptr += gap;\n                            }\n                        }\n                    }\n                }\n            }\n\n            if (out_elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < out_channels; p++)\n                {\n                    const __fp16* sptr = top_col2im.row<const __fp16>(p * maxk);\n                    Mat outm = top_blob_bordered.channel(p);\n\n                    if (bias_data.empty())\n                    {\n                        outm.fill(vdup_n_f16(0.f));\n                    }\n                    else\n                    {\n                        outm.fill(vld1_f16((const __fp16*)bias_data_fp16 + p * 4));\n                    }\n\n                    for (int u = 0; u < kernel_h; u++)\n                    {\n                        for (int v = 0; v < kernel_w; v++)\n                        {\n                            __fp16* ptr = outm.row<__fp16>(dilation_h * u) + dilation_w * v * 4;\n\n                            for (int i = 0; i < h; i++)\n                            {\n                                for (int j = 0; j < w; j++)\n                                {\n                                    float16x4_t _val = vld1_f16(ptr);\n                                    float16x4_t _s = vld1_f16(sptr);\n                                    _val = vadd_f16(_val, _s);\n                                    vst1_f16(ptr, _val);\n\n                                    ptr += stride_w * 4;\n                                    sptr += 4;\n                                }\n\n                                ptr += gap;\n                            }\n                        }\n                    }\n                }\n            }\n\n            if (out_elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < out_channels; p++)\n                {\n                    const __fp16* sptr = top_col2im.row<const __fp16>(p * maxk);\n                    Mat outm = top_blob_bordered.channel(p);\n\n                    const __fp16 bias = bias_data_fp16.empty() ? 0.f : ((const __fp16*)bias_data_fp16)[p];\n                    outm.fill(bias);\n\n                    for (int u = 0; u < kernel_h; u++)\n                    {\n                        for (int v = 0; v < kernel_w; v++)\n                        {\n                            __fp16* ptr = outm.row<__fp16>(dilation_h * u) + dilation_w * v;\n\n                            for (int i = 0; i < h; i++)\n                            {\n                                for (int j = 0; j < w; j++)\n                                {\n                                    ptr[0] += sptr[0];\n\n                                    ptr += stride_w;\n                                    sptr += 1;\n                                }\n\n                                ptr += gap;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob_bordered, opt);\n        }\n    }\n    else\n    {\n        if (elempack == 8 && out_elempack == 8)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f16((const __fp16*)bias_data_fp16 + p * 8);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 8;\n\n                                    float16x8_t _val = vld1q_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x8_t _w0 = vld1q_f16(kptr + k * 64);\n                                    float16x8_t _w1 = vld1q_f16(kptr + k * 64 + 8);\n                                    float16x8_t _w2 = vld1q_f16(kptr + k * 64 + 16);\n                                    float16x8_t _w3 = vld1q_f16(kptr + k * 64 + 24);\n                                    float16x8_t _w4 = vld1q_f16(kptr + k * 64 + 32);\n                                    float16x8_t _w5 = vld1q_f16(kptr + k * 64 + 40);\n                                    float16x8_t _w6 = vld1q_f16(kptr + k * 64 + 48);\n                                    float16x8_t _w7 = vld1q_f16(kptr + k * 64 + 56);\n\n                                    _sum = vfmaq_laneq_f16(_sum, _w0, _val, 0);\n                                    _sum = vfmaq_laneq_f16(_sum, _w1, _val, 1);\n                                    _sum = vfmaq_laneq_f16(_sum, _w2, _val, 2);\n                                    _sum = vfmaq_laneq_f16(_sum, _w3, _val, 3);\n                                    _sum = vfmaq_laneq_f16(_sum, _w4, _val, 4);\n                                    _sum = vfmaq_laneq_f16(_sum, _w5, _val, 5);\n                                    _sum = vfmaq_laneq_f16(_sum, _w6, _val, 6);\n                                    _sum = vfmaq_laneq_f16(_sum, _w7, _val, 7);\n                                }\n                            }\n\n                            kptr += maxk * 64;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1q_f16(outptr + j * 8, _sum);\n                    }\n\n                    outptr += outw * 8;\n                }\n            }\n        }\n\n        if (elempack == 1 && out_elempack == 8)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f16((const __fp16*)bias_data_fp16 + p * 8);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float16x8_t _val = vdupq_n_f16(sptr[sx]);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x8_t _w = vld1q_f16(kptr + k * 8);\n\n                                    _sum = vfmaq_f16(_sum, _val, _w);\n                                }\n                            }\n\n                            kptr += maxk * 8;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1q_f16(outptr + j * 8, _sum);\n                    }\n\n                    outptr += outw * 8;\n                }\n            }\n        }\n\n        if (elempack == 4 && out_elempack == 8)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f16((const __fp16*)bias_data_fp16 + p * 8);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float16x4_t _val = vld1_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x8_t _w0 = vld1q_f16(kptr + k * 32);\n                                    float16x8_t _w1 = vld1q_f16(kptr + k * 32 + 8);\n                                    float16x8_t _w2 = vld1q_f16(kptr + k * 32 + 16);\n                                    float16x8_t _w3 = vld1q_f16(kptr + k * 32 + 24);\n\n                                    _sum = vfmaq_lane_f16(_sum, _w0, _val, 0);\n                                    _sum = vfmaq_lane_f16(_sum, _w1, _val, 1);\n                                    _sum = vfmaq_lane_f16(_sum, _w2, _val, 2);\n                                    _sum = vfmaq_lane_f16(_sum, _w3, _val, 3);\n                                }\n                            }\n\n                            kptr += maxk * 32;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1q_f16(outptr + j * 8, _sum);\n                    }\n\n                    outptr += outw * 8;\n                }\n            }\n        }\n\n        if (elempack == 8 && out_elempack == 1)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 8;\n\n                                    float16x8_t _val = vld1q_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x8_t _w = vld1q_f16(kptr + k * 8);\n\n                                    float16x8_t _s8 = vmulq_f16(_val, _w);\n\n                                    float16x4_t _s4 = vadd_f16(vget_low_f16(_s8), vget_high_f16(_s8));\n                                    sum += vaddvq_f32(vcvt_f32_f16(_s4)); // dot\n                                }\n                            }\n\n                            kptr += maxk * 8;\n                        }\n\n                        sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                        outptr[j] = (__fp16)sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n\n        if (elempack == 8 && out_elempack == 4)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1_f16((const __fp16*)bias_data_fp16 + p * 4);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 8;\n\n                                    float16x8_t _val = vld1q_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x4_t _w0 = vld1_f16(kptr + k * 32);\n                                    float16x4_t _w1 = vld1_f16(kptr + k * 32 + 4);\n                                    float16x4_t _w2 = vld1_f16(kptr + k * 32 + 8);\n                                    float16x4_t _w3 = vld1_f16(kptr + k * 32 + 12);\n                                    float16x4_t _w4 = vld1_f16(kptr + k * 32 + 16);\n                                    float16x4_t _w5 = vld1_f16(kptr + k * 32 + 20);\n                                    float16x4_t _w6 = vld1_f16(kptr + k * 32 + 24);\n                                    float16x4_t _w7 = vld1_f16(kptr + k * 32 + 28);\n\n                                    _sum = vfma_laneq_f16(_sum, _w0, _val, 0);\n                                    _sum = vfma_laneq_f16(_sum, _w1, _val, 1);\n                                    _sum = vfma_laneq_f16(_sum, _w2, _val, 2);\n                                    _sum = vfma_laneq_f16(_sum, _w3, _val, 3);\n                                    _sum = vfma_laneq_f16(_sum, _w4, _val, 4);\n                                    _sum = vfma_laneq_f16(_sum, _w5, _val, 5);\n                                    _sum = vfma_laneq_f16(_sum, _w6, _val, 6);\n                                    _sum = vfma_laneq_f16(_sum, _w7, _val, 7);\n                                }\n                            }\n\n                            kptr += maxk * 32;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1_f16(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 4 && out_elempack == 4)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1_f16((const __fp16*)bias_data_fp16 + p * 4);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float16x4_t _val = vld1_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x4_t _w0 = vld1_f16(kptr + k * 16);\n                                    float16x4_t _w1 = vld1_f16(kptr + k * 16 + 4);\n                                    float16x4_t _w2 = vld1_f16(kptr + k * 16 + 8);\n                                    float16x4_t _w3 = vld1_f16(kptr + k * 16 + 12);\n\n                                    _sum = vfma_lane_f16(_sum, _w0, _val, 0);\n                                    _sum = vfma_lane_f16(_sum, _w1, _val, 1);\n                                    _sum = vfma_lane_f16(_sum, _w2, _val, 2);\n                                    _sum = vfma_lane_f16(_sum, _w3, _val, 3);\n                                }\n                            }\n\n                            kptr += maxk * 16;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1_f16(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 1 && out_elempack == 4)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1_f16((const __fp16*)bias_data_fp16 + p * 4);\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float16x4_t _val = vdup_n_f16(sptr[sx]);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x4_t _w = vld1_f16(kptr + k * 4);\n\n                                    _sum = vfma_f16(_sum, _val, _w);\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                        vst1_f16(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 4 && out_elempack == 1)\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < out_channels; p++)\n            {\n                __fp16* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const __fp16* kptr = weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float16x4_t _val = vld1_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x4_t _w = vld1_f16(kptr + k * 4);\n\n                                    float16x4_t _s4 = vmul_f16(_val, _w);\n\n                                    sum += vaddvq_f32(vcvt_f32_f16(_s4)); // dot\n                                }\n                            }\n\n                            kptr += maxk * 4;\n                        }\n\n                        sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                        outptr[j] = (__fp16)sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n\n        if (elempack == 1 && out_elempack == 1)\n        {\n            if (kernel_w == 4 && kernel_h == 4 && stride_w == 2 && stride_h == 2 && dilation_w == 1 && dilation_h == 1)\n            {\n                deconv4x4s2_fp16sa_neon(bottom_blob, top_blob_bordered, weight_data_tm, bias_data_fp16, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob_bordered, opt);\n                }\n            }\n            else\n            {\n                // num_output\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int p = 0; p < num_output; p++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(p);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                            {\n                                sum = bias_data[p];\n                            }\n\n                            const __fp16* kptr = weight_data_tm.channel(p);\n\n                            // channels\n                            for (int q = 0; q < channels; q++)\n                            {\n                                const Mat m = bottom_blob.channel(q);\n\n                                for (int y = 0; y < kernel_h; y++)\n                                {\n                                    int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                    if (sys < 0 || sys % stride_h != 0)\n                                        continue;\n\n                                    int sy = sys / stride_h;\n                                    if (sy >= h)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                    for (int x = 0; x < kernel_w; x++)\n                                    {\n                                        int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                        if (sxs < 0 || sxs % stride_w != 0)\n                                            continue;\n\n                                        int sx = sxs / stride_w;\n                                        if (sx >= w)\n                                            continue;\n\n                                        __fp16 val = sptr[sx];\n\n                                        int k = y * kernel_w + x;\n\n                                        __fp16 w = kptr[k];\n\n                                        sum += val * w;\n                                    }\n                                }\n\n                                kptr += maxk;\n                            }\n\n                            sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                            outptr[j] = (__fp16)sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/deconvolutiondepthwise_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise_arm.h\"\n\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nDeconvolutionDepthWise_arm::DeconvolutionDepthWise_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint DeconvolutionDepthWise_arm::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n    // create Deconvolution op for each group\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        Mat weight_data_transposed(weight_data.w);\n        {\n            float* pt = weight_data_transposed;\n            const float* p = weight_data;\n\n            for (int i = 0; i < (channels / group) * (num_output / group) * group; i++)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    pt[maxk - 1 - k] = p[k];\n                }\n\n                p += maxk;\n                pt += maxk;\n            }\n        }\n\n#if NCNN_BF16\n        if (opt.use_bf16_storage)\n        {\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                Mat weight_data_r2 = weight_data_transposed.reshape(maxk, group);\n                Mat weight_data_r2_packed;\n                convert_packing(weight_data_r2, weight_data_r2_packed, 4, opt);\n\n                ncnn::cast_float32_to_bfloat16(weight_data_r2_packed, weight_data_tm, opt);\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                ncnn::cast_float32_to_bfloat16(weight_data_transposed, weight_data_tm, opt);\n            }\n\n            if (opt.lightmode)\n                weight_data.release();\n\n            return 0;\n        }\n#endif // NCNN_BF16\n\n#if __ARM_NEON\n        // pack4\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data_transposed.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 4, opt);\n        }\n#endif // __ARM_NEON\n\n        // pack1\n        if (elempack == 1)\n        {\n            weight_data_tm = weight_data_transposed;\n        }\n    }\n    else\n    {\n        // group deconvolution\n        for (int i = 0; i < (int)group_ops.size(); i++)\n            delete group_ops[i];\n\n        group_ops.clear();\n\n        const int channels_g = channels / group;\n        const int num_output_g = num_output / group;\n\n        group_ops.resize(group);\n\n        for (int g = 0; g < group; g++)\n        {\n            Mat weight_data_g = weight_data.range(maxk * channels_g * num_output_g * g, maxk * channels_g * num_output_g).clone();\n            Mat bias_data_g;\n            if (bias_term)\n                bias_data_g = bias_data.range(num_output_g * g, num_output_g);\n\n            ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Deconvolution);\n\n            // set param\n            ncnn::ParamDict pd;\n            pd.set(0, num_output_g); // num_output\n            pd.set(1, kernel_w);\n            pd.set(11, kernel_h);\n            pd.set(2, dilation_w);\n            pd.set(12, dilation_h);\n            pd.set(3, stride_w);\n            pd.set(13, stride_h);\n            pd.set(4, 0);  // pad_w\n            pd.set(14, 0); // pad_h\n            pd.set(18, output_pad_right);\n            pd.set(19, output_pad_bottom);\n            pd.set(5, bias_term);\n            pd.set(6, maxk * channels_g * num_output_g); // weight_data_size\n            pd.set(9, activation_type);\n            pd.set(10, activation_params);\n\n            op->load_param(pd);\n\n            // set weights\n            if (bias_term)\n            {\n                ncnn::Mat weights[2];\n                weights[0] = weight_data_g;\n                weights[1] = bias_data_g;\n\n                op->load_model(ModelBinFromMatArray(weights));\n            }\n            else\n            {\n                ncnn::Mat weights[1];\n                weights[0] = weight_data_g;\n\n                op->load_model(ModelBinFromMatArray(weights));\n            }\n\n            op->create_pipeline(opt);\n\n            group_ops[g] = op;\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_arm::destroy_pipeline(const Option& opt)\n{\n    for (int i = 0; i < (int)group_ops.size(); i++)\n    {\n        group_ops[i]->destroy_pipeline(opt);\n        delete group_ops[i];\n    }\n    group_ops.clear();\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    // convolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int g = 0; g < channels; g++)\n            {\n                float* outptr = top_blob_bordered.channel(g);\n                const float* kptr = (const float*)weight_data_tm + maxk * g * 4;\n                const Mat m = bottom_blob.channel(g);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32((const float*)bias_data + g * 4);\n                        }\n\n                        for (int y = 0; y < kernel_h; y++)\n                        {\n                            int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                            if (sys < 0 || sys % stride_h != 0)\n                                continue;\n\n                            int sy = sys / stride_h;\n                            if (sy >= h)\n                                continue;\n\n                            for (int x = 0; x < kernel_w; x++)\n                            {\n                                int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                if (sxs < 0 || sxs % stride_w != 0)\n                                    continue;\n\n                                int sx = sxs / stride_w;\n                                if (sx >= w)\n                                    continue;\n\n                                const float* sptr = m.row(sy) + sx * 4;\n\n                                float32x4_t _val = vld1q_f32(sptr);\n\n                                int k = y * kernel_w + x;\n\n                                float32x4_t _w = vld1q_f32(kptr + k * 4);\n\n                                _sum = vmlaq_f32(_sum, _val, _w);\n                            }\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1q_f32(outptr + j * 4, _sum);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int g = 0; g < channels; g++)\n            {\n                float* outptr = top_blob_bordered.channel(g);\n                const float* kptr = (const float*)weight_data_tm + maxk * g;\n                const Mat m = bottom_blob.channel(g);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[g];\n                        }\n\n                        for (int y = 0; y < kernel_h; y++)\n                        {\n                            int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                            if (sys < 0 || sys % stride_h != 0)\n                                continue;\n\n                            int sy = sys / stride_h;\n                            if (sy >= h)\n                                continue;\n\n                            const float* sptr = m.row(sy);\n\n                            for (int x = 0; x < kernel_w; x++)\n                            {\n                                int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                if (sxs < 0 || sxs % stride_w != 0)\n                                    continue;\n\n                                int sx = sxs / stride_w;\n                                if (sx >= w)\n                                    continue;\n\n                                float val = sptr[sx];\n\n                                int k = y * kernel_w + x;\n\n                                float w = kptr[k];\n\n                                sum += val * w;\n                            }\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n    else\n    {\n        // group deconvolution\n        const int channels_g = channels * elempack / group;\n        const int num_output_g = num_output / group;\n\n        int g_elempack = 1;\n        int out_g_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            g_elempack = channels_g % 4 == 0 ? 4 : 1;\n            out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        // unpacking\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack == 4 && g_elempack == 1)\n        {\n            Option opt_p = opt;\n            opt_p.blob_allocator = opt.workspace_allocator;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_p);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        Mat top_blob_bordered_unpacked = top_blob_bordered;\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            top_blob_bordered_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n            if (top_blob_bordered_unpacked.empty())\n                return -100;\n        }\n\n        for (int g = 0; g < group; g++)\n        {\n            const Mat bottom_blob_g = bottom_blob_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n            Mat top_blob_bordered_g = top_blob_bordered_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n            const ncnn::Layer* op = group_ops[g];\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = top_blob_bordered_unpacked.allocator;\n\n            // forward\n            int ret = op->forward(bottom_blob_g, top_blob_bordered_g, opt_g);\n            if (ret != 0)\n                return ret;\n        }\n\n        // packing\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            convert_packing(top_blob_bordered_unpacked, top_blob_bordered, 4, opt);\n            if (top_blob_bordered.empty())\n                return -100;\n        }\n        else\n        {\n            top_blob_bordered = top_blob_bordered_unpacked;\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c * bottom_blob.elempack;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * group;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n#if NCNN_ARM82\n    if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_float16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n    if (opt.use_bf16_storage && weight_data_flattened.elembits() == 16)\n    {\n        Mat weight_data_flattened_fp32;\n        cast_bfloat16_to_float32(weight_data_flattened, weight_data_flattened_fp32, opt);\n        weight_data_flattened = weight_data_flattened_fp32;\n    }\n#endif // NCNN_BF16\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / group, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / group;\n        const int inch_g = _num_input / group;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < group; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (opt.use_fp16_storage && cpu_support_arm_asimdhp() && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_float16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_ARM82\n#if NCNN_BF16\n        if (opt.use_bf16_storage && bias_data_flattened.elembits() == 16)\n        {\n            Mat bias_data_flattened_fp32;\n            cast_bfloat16_to_float32(bias_data_flattened, bias_data_flattened_fp32, opt);\n            bias_data_flattened = bias_data_flattened_fp32;\n        }\n#endif // NCNN_BF16\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::DeconvolutionDepthWise);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, output_pad_right);\n    pd.set(19, output_pad_bottom);\n    pd.set(20, output_w);\n    pd.set(21, output_h);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_transposed.w);\n    pd.set(7, group);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_transposed;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_BF16\nint DeconvolutionDepthWise_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int g = 0; g < channels; g++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(g);\n                const unsigned short* kptr = (const unsigned short*)weight_data_tm + maxk * g * 4;\n                const Mat m = bottom_blob.channel(g);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float32x4_t _sum = vdupq_n_f32(0.f);\n\n                        if (bias_term)\n                        {\n                            _sum = vld1q_f32((const float*)bias_data + g * 4);\n                        }\n\n                        for (int y = 0; y < kernel_h; y++)\n                        {\n                            int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                            if (sys < 0 || sys % stride_h != 0)\n                                continue;\n\n                            int sy = sys / stride_h;\n                            if (sy >= h)\n                                continue;\n\n                            for (int x = 0; x < kernel_w; x++)\n                            {\n                                int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                if (sxs < 0 || sxs % stride_w != 0)\n                                    continue;\n\n                                int sx = sxs / stride_w;\n                                if (sx >= w)\n                                    continue;\n\n                                const unsigned short* sptr = m.row<const unsigned short>(sy) + sx * 4;\n\n                                float32x4_t _val = bfloat2float(vld1_u16(sptr));\n\n                                int k = y * kernel_w + x;\n\n                                float32x4_t _w = bfloat2float(vld1_u16(kptr + k * 4));\n\n                                _sum = vmlaq_f32(_sum, _val, _w);\n                            }\n                        }\n\n                        _sum = activation_ps(_sum, activation_type, activation_params);\n\n                        vst1_u16(outptr + j * 4, float2bfloat(_sum));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int g = 0; g < channels; g++)\n            {\n                unsigned short* outptr = top_blob_bordered.channel(g);\n                const unsigned short* kptr = (const unsigned short*)weight_data_tm + maxk * g;\n                const Mat m = bottom_blob.channel(g);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[g];\n                        }\n\n                        for (int y = 0; y < kernel_h; y++)\n                        {\n                            int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                            if (sys < 0 || sys % stride_h != 0)\n                                continue;\n\n                            int sy = sys / stride_h;\n                            if (sy >= h)\n                                continue;\n\n                            const unsigned short* sptr = m.row<const unsigned short>(sy);\n\n                            for (int x = 0; x < kernel_w; x++)\n                            {\n                                int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                if (sxs < 0 || sxs % stride_w != 0)\n                                    continue;\n\n                                int sx = sxs / stride_w;\n                                if (sx >= w)\n                                    continue;\n\n                                float val = bfloat16_to_float32(sptr[sx]);\n\n                                int k = y * kernel_w + x;\n\n                                float w = bfloat16_to_float32(kptr[k]);\n\n                                sum += val * w;\n                            }\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = float32_to_bfloat16(sum);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n    else\n    {\n        // group deconvolution\n        const int channels_g = channels * elempack / group;\n        const int num_output_g = num_output / group;\n\n        int g_elempack = 1;\n        int out_g_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            g_elempack = channels_g % 4 == 0 ? 4 : 1;\n            out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        // unpacking\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack == 4 && g_elempack == 1)\n        {\n            Option opt_p = opt;\n            opt_p.blob_allocator = opt.workspace_allocator;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_p);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        Mat top_blob_bordered_unpacked = top_blob_bordered;\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            top_blob_bordered_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n            if (top_blob_bordered_unpacked.empty())\n                return -100;\n        }\n\n        for (int g = 0; g < group; g++)\n        {\n            const Mat bottom_blob_g = bottom_blob_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n            Mat top_blob_bordered_g = top_blob_bordered_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n            const ncnn::Layer* op = group_ops[g];\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = top_blob_bordered_unpacked.allocator;\n\n            // forward\n            int ret = op->forward(bottom_blob_g, top_blob_bordered_g, opt_g);\n            if (ret != 0)\n                return ret;\n        }\n\n        // packing\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            convert_packing(top_blob_bordered_unpacked, top_blob_bordered, 4, opt);\n            if (top_blob_bordered.empty())\n                return -100;\n        }\n        else\n        {\n            top_blob_bordered = top_blob_bordered_unpacked;\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/deconvolutiondepthwise_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTIONDEPTHWISE_ARM_H\n#define LAYER_DECONVOLUTIONDEPTHWISE_ARM_H\n\n#include \"deconvolutiondepthwise.h\"\n\nnamespace ncnn {\n\nclass DeconvolutionDepthWise_arm : public DeconvolutionDepthWise\n{\npublic:\n    DeconvolutionDepthWise_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    std::vector<ncnn::Layer*> group_ops;\n\n    Mat weight_data_tm;\n\n    // fp16\n    Mat bias_data_fp16;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTIONDEPTHWISE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/deconvolutiondepthwise_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint DeconvolutionDepthWise_arm::create_pipeline_fp16s(const Option& opt)\n{\n    // create Deconvolution op for each group\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        Mat weight_data_transposed(weight_data.w);\n        {\n            float* pt = weight_data_transposed;\n            const float* p = weight_data;\n\n            for (int i = 0; i < (channels / group) * (num_output / group) * group; i++)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    pt[maxk - 1 - k] = p[k];\n                }\n\n                p += maxk;\n                pt += maxk;\n            }\n        }\n\n        int elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            elempack = opt.use_fp16_arithmetic && channels % 8 == 0 ? 8 : channels % 4 == 0 ? 4 : 1;\n        }\n\n        if (elempack == 8)\n        {\n            Mat weight_data_r2 = weight_data_transposed.reshape(maxk, group);\n            Mat weight_data_r2_packed;\n            convert_packing(weight_data_r2, weight_data_r2_packed, 8, opt);\n\n            ncnn::cast_float32_to_float16(weight_data_r2_packed, weight_data_tm, opt);\n        }\n\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data_transposed.reshape(maxk, group);\n            Mat weight_data_r2_packed;\n            convert_packing(weight_data_r2, weight_data_r2_packed, 4, opt);\n\n            ncnn::cast_float32_to_float16(weight_data_r2_packed, weight_data_tm, opt);\n        }\n\n        if (elempack == 1)\n        {\n            ncnn::cast_float32_to_float16(weight_data_transposed, weight_data_tm, opt);\n        }\n\n        ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n    }\n    else\n    {\n        // group deconvolution\n        for (int i = 0; i < (int)group_ops.size(); i++)\n            delete group_ops[i];\n\n        group_ops.clear();\n\n        const int channels_g = channels / group;\n        const int num_output_g = num_output / group;\n\n        group_ops.resize(group);\n\n        for (int g = 0; g < group; g++)\n        {\n            Mat weight_data_g = weight_data.range(maxk * channels_g * num_output_g * g, maxk * channels_g * num_output_g).clone();\n            Mat bias_data_g;\n            if (bias_term)\n                bias_data_g = bias_data.range(num_output_g * g, num_output_g);\n\n            ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Deconvolution);\n\n            // set param\n            ncnn::ParamDict pd;\n            pd.set(0, num_output_g); // num_output\n            pd.set(1, kernel_w);\n            pd.set(11, kernel_h);\n            pd.set(2, dilation_w);\n            pd.set(12, dilation_h);\n            pd.set(3, stride_w);\n            pd.set(13, stride_h);\n            pd.set(4, 0);  // pad_w\n            pd.set(14, 0); // pad_h\n            pd.set(18, output_pad_right);\n            pd.set(19, output_pad_bottom);\n            pd.set(5, bias_term);\n            pd.set(6, maxk * channels_g * num_output_g); // weight_data_size\n            pd.set(9, activation_type);\n            pd.set(10, activation_params);\n\n            op->load_param(pd);\n\n            // set weights\n            if (bias_term)\n            {\n                ncnn::Mat weights[2];\n                weights[0] = weight_data_g;\n                weights[1] = bias_data_g;\n\n                op->load_model(ModelBinFromMatArray(weights));\n            }\n            else\n            {\n                ncnn::Mat weights[1];\n                weights[0] = weight_data_g;\n\n                op->load_model(ModelBinFromMatArray(weights));\n            }\n\n            op->create_pipeline(opt);\n\n            group_ops[g] = op;\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = (opt.use_packing_layout && num_output % 4 == 0) ? 4 : 1;\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        if (elempack == 4)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f32((const float*)bias_data + g * 4);\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n\n                                    int k = y * kernel_w + x;\n\n                                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr + k * 4));\n\n                                    _sum = vfmaq_f32(_sum, _val, _w);\n                                }\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            vst1_f16(outptr + j * 4, vcvt_f16_f32(_sum));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 1)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                            {\n                                sum = bias_data[g];\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float val = (float)sptr[sx];\n\n                                    int k = y * kernel_w + x;\n\n                                    float w = (float)kptr[k];\n\n                                    sum += val * w;\n                                }\n                            }\n\n                            sum = activation_ss(sum, activation_type, activation_params);\n\n                            outptr[j] = (__fp16)sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n    else\n    {\n        // group deconvolution\n        const int channels_g = channels * elempack / group;\n        const int num_output_g = num_output / group;\n\n        int g_elempack = (opt.use_packing_layout && channels_g % 4 == 0) ? 4 : 1;\n        int out_g_elempack = (opt.use_packing_layout && num_output_g % 4 == 0) ? 4 : 1;\n\n        // unpacking\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack == 4 && g_elempack == 1)\n        {\n            Option opt_p = opt;\n            opt_p.blob_allocator = opt.workspace_allocator;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_p);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        Mat top_blob_bordered_unpacked = top_blob_bordered;\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            top_blob_bordered_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n            if (top_blob_bordered_unpacked.empty())\n                return -100;\n        }\n\n        for (int g = 0; g < group; g++)\n        {\n            const Mat bottom_blob_g = bottom_blob_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n            Mat top_blob_bordered_g = top_blob_bordered_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n            const ncnn::Layer* op = group_ops[g];\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = top_blob_bordered_unpacked.allocator;\n\n            // forward\n            int ret = op->forward(bottom_blob_g, top_blob_bordered_g, opt_g);\n            if (ret != 0)\n                return ret;\n        }\n\n        // packing\n        if (out_g_elempack == 1 && out_elempack == 4)\n        {\n            convert_packing(top_blob_bordered_unpacked, top_blob_bordered, 4, opt);\n            if (top_blob_bordered.empty())\n                return -100;\n        }\n        else\n        {\n            top_blob_bordered = top_blob_bordered_unpacked;\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        if (elempack == 8)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 8;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1q_f16((const __fp16*)bias_data_fp16 + g * 8);\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 8;\n\n                                    float16x8_t _val = vld1q_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x8_t _w = vld1q_f16(kptr + k * 8);\n\n                                    _sum = vfmaq_f16(_sum, _val, _w);\n                                }\n                            }\n\n                            _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                            vst1q_f16(outptr + j * 8, _sum);\n                        }\n\n                        outptr += outw * 8;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 4)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                            if (bias_term)\n                            {\n                                _sum = vld1_f16((const __fp16*)bias_data_fp16 + g * 4);\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const __fp16* sptr = m.row<const __fp16>(sy) + sx * 4;\n\n                                    float16x4_t _val = vld1_f16(sptr);\n\n                                    int k = y * kernel_w + x;\n\n                                    float16x4_t _w = vld1_f16(kptr + k * 4);\n\n                                    _sum = vfma_f16(_sum, _val, _w);\n                                }\n                            }\n\n                            _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                            vst1_f16(outptr + j * 4, _sum);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        if (elempack == 1)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    __fp16* outptr = top_blob_bordered.channel(g);\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                            {\n                                sum = bias_data[g];\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const __fp16* sptr = m.row<const __fp16>(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    __fp16 val = sptr[sx];\n\n                                    int k = y * kernel_w + x;\n\n                                    __fp16 w = kptr[k];\n\n                                    sum += val * w;\n                                }\n                            }\n\n                            sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                            outptr[j] = (__fp16)sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n    else\n    {\n        // group deconvolution\n        const int channels_g = channels * elempack / group;\n        const int num_output_g = num_output / group;\n\n        int g_elempack = 1;\n        int out_g_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            g_elempack = opt.use_fp16_arithmetic && channels_g % 8 == 0 ? 8 : channels_g % 4 == 0 ? 4 : 1;\n            out_g_elempack = opt.use_fp16_arithmetic && num_output_g % 8 == 0 ? 8 : num_output_g % 4 == 0 ? 4 : 1;\n        }\n\n        // unpacking\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > g_elempack)\n        {\n            Option opt_p = opt;\n            opt_p.blob_allocator = opt.workspace_allocator;\n            convert_packing(bottom_blob, bottom_blob_unpacked, g_elempack, opt_p);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        Mat top_blob_bordered_unpacked = top_blob_bordered;\n        if (out_g_elempack < out_elempack)\n        {\n            top_blob_bordered_unpacked.create(outw, outh, num_output / out_g_elempack, out_elemsize / out_elempack * out_g_elempack, out_g_elempack, opt.workspace_allocator);\n            if (top_blob_bordered_unpacked.empty())\n                return -100;\n        }\n\n        for (int g = 0; g < group; g++)\n        {\n            const Mat bottom_blob_g = bottom_blob_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n            Mat top_blob_bordered_g = top_blob_bordered_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n            const ncnn::Layer* op = group_ops[g];\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = top_blob_bordered_unpacked.allocator;\n\n            // forward\n            int ret = op->forward(bottom_blob_g, top_blob_bordered_g, opt_g);\n            if (ret != 0)\n                return ret;\n        }\n\n        // packing\n        if (out_g_elempack < out_elempack)\n        {\n            convert_packing(top_blob_bordered_unpacked, top_blob_bordered, out_elempack, opt);\n            if (top_blob_bordered.empty())\n                return -100;\n        }\n        else\n        {\n            top_blob_bordered = top_blob_bordered_unpacked;\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/dequantize_arm.cpp",
    "content": "// Copyright 2018 Tencent\n// Copyright 2019 BUG1989\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dequantize_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nDequantize_arm::Dequantize_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void dequantize(const int* intptr, float* ptr, const Mat& scale_data, const Mat& bias_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int bias_data_size = bias_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"dequantize %d %d   %d %d\", scale_data_size, bias_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __ARM_NEON\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = _scale0;\n        }\n        if (elempack == 8)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = vld1q_f32((const float*)scale_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale0);\n            _v1 = vmulq_f32(_v1, _scale1);\n            vst1q_f32(ptr, _v0);\n            vst1q_f32(ptr + 4, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale0);\n            vst1q_f32(ptr, _v);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = *intptr * scale;\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __ARM_NEON\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 4)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = _bias0;\n            }\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n        }\n#endif // __ARM_NEON\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n#if __aarch64__\n            _v0 = vfmaq_f32(_bias0, _v0, _scale0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale1);\n#else\n            _v0 = vmlaq_f32(_bias0, _v0, _scale0);\n            _v1 = vmlaq_f32(_bias1, _v1, _scale1);\n#endif\n            vst1q_f32(ptr, _v0);\n            vst1q_f32(ptr + 4, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n#if __aarch64__\n            _v = vfmaq_f32(_bias0, _v, _scale0);\n#else\n            _v = vmlaq_f32(_bias0, _v, _scale0);\n#endif\n            vst1q_f32(ptr, _v);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = *intptr * scale + bias;\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Dequantize_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // assert bottom_blob.elembits() == 32\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 1)\n    {\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            float* ptr = (float*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            dequantize(intptr, ptr, scale_data, bias_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            float* ptr = top_blob.row(i);\n\n            const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n\n            dequantize(intptr, ptr, scale_data_i, bias_data_i, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            float* ptr = top_blob.channel(q);\n\n            const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n\n            dequantize(intptr, ptr, scale_data_q, bias_data_q, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void dequantize_bf16s(const int* intptr, unsigned short* ptr, const Mat& scale_data, const Mat& bias_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int bias_data_size = bias_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"dequantize_bf16s %d %d   %d %d\", scale_data_size, bias_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __ARM_NEON\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = _scale0;\n        }\n        if (elempack == 8)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = vld1q_f32((const float*)scale_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale0);\n            _v1 = vmulq_f32(_v1, _scale1);\n            vst1q_u16(ptr, vcombine_u16(float2bfloat(_v0), float2bfloat(_v1)));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale0);\n            vst1_u16(ptr, float2bfloat(_v));\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = float32_to_bfloat16(*intptr * scale);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __ARM_NEON\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 4)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = _bias0;\n            }\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n        }\n#endif // __ARM_NEON\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n#if __aarch64__\n            _v0 = vfmaq_f32(_bias0, _v0, _scale0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale1);\n#else\n            _v0 = vmlaq_f32(_bias0, _v0, _scale0);\n            _v1 = vmlaq_f32(_bias1, _v1, _scale1);\n#endif\n            vst1q_u16(ptr, vcombine_u16(float2bfloat(_v0), float2bfloat(_v1)));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n#if __aarch64__\n            _v = vfmaq_f32(_bias0, _v, _scale0);\n#else\n            _v = vmlaq_f32(_bias0, _v, _scale0);\n#endif\n            vst1_u16(ptr, float2bfloat(_v));\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = float32_to_bfloat16(*intptr * scale + bias);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Dequantize_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n    const size_t out_elemsize = elempack * 2u;\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            unsigned short* ptr = (unsigned short*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            dequantize_bf16s(intptr, ptr, scale_data, bias_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n\n            dequantize_bf16s(intptr, ptr, scale_data_i, bias_data_i, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            unsigned short* ptr = top_blob.channel(q);\n\n            const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n\n            dequantize_bf16s(intptr, ptr, scale_data_q, bias_data_q, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/dequantize_arm.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DEQUANTIZE_ARM_H\n#define LAYER_DEQUANTIZE_ARM_H\n\n#include \"dequantize.h\"\n\nnamespace ncnn {\n\nclass Dequantize_arm : public Dequantize\n{\npublic:\n    Dequantize_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DEQUANTIZE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/dequantize_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dequantize_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void dequantize_fp16s(const int* intptr, __fp16* ptr, const Mat& scale_data, const Mat& bias_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int bias_data_size = bias_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"dequantize_fp16s %d %d   %d %d\", scale_data_size, bias_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = vld1q_f32((const float*)scale_data + 4);\n        }\n        if (elempack == 4)\n        {\n            _scale0 = vld1q_f32((const float*)scale_data);\n            _scale1 = _scale0;\n        }\n    }\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale0);\n            _v1 = vmulq_f32(_v1, _scale1);\n            vst1q_f16(ptr, vcombine_f16(vcvt_f16_f32(_v0), vcvt_f16_f32(_v1)));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale0);\n            vst1_f16(ptr, vcvt_f16_f32(_v));\n            intptr += 4;\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            *ptr = (__fp16)(*intptr * scale);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n            if (elempack == 4)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = _bias0;\n            }\n        }\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vfmaq_f32(_bias0, _v0, _scale0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale1);\n            vst1q_f16(ptr, vcombine_f16(vcvt_f16_f32(_v0), vcvt_f16_f32(_v1)));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vfmaq_f32(_bias0, _v, _scale0);\n            vst1_f16(ptr, vcvt_f16_f32(_v));\n            intptr += 4;\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            *ptr = (__fp16)(*intptr * scale + bias);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Dequantize_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n    const size_t out_elemsize = elempack * 2u;\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            __fp16* ptr = (__fp16*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            dequantize_fp16s(intptr, ptr, scale_data, bias_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n\n            dequantize_fp16s(intptr, ptr, scale_data_i, bias_data_i, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            __fp16* ptr = top_blob.channel(q);\n\n            const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n\n            dequantize_fp16s(intptr, ptr, scale_data_q, bias_data_q, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/dropout_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dropout_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nDropout_arm::Dropout_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#endif // __ARM_NEON\n}\n\nint Dropout_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    if (scale == 1.f)\n    {\n        return 0;\n    }\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _scale = vdupq_n_f32(scale);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                \"fmul   v0.4s, v0.4s, %2.4s     \\n\"\n                \"fmul   v1.4s, v1.4s, %2.4s     \\n\"\n                \"fmul   v2.4s, v2.4s, %2.4s     \\n\"\n                \"fmul   v3.4s, v3.4s, %2.4s     \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_scale) // %2\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0, {d0-d7}     \\n\"\n                \"vmul.f32   q0, q0, %q2     \\n\"\n                \"vmul.f32   q1, q1, %q2     \\n\"\n                \"vmul.f32   q2, q2, %q2     \\n\"\n                \"vmul.f32   q3, q3, %q2     \\n\"\n                \"vstm       %0!, {d0-d7}    \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_scale) // %2\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = vmulq_f32(_p0, _scale);\n            _p1 = vmulq_f32(_p1, _scale);\n            _p2 = vmulq_f32(_p2, _scale);\n            _p3 = vmulq_f32(_p3, _scale);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = vmulq_f32(_p0, _scale);\n            _p1 = vmulq_f32(_p1, _scale);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmulq_f32(_p, _scale);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = *ptr * scale;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/dropout_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DROPOUT_ARM_H\n#define LAYER_DROPOUT_ARM_H\n\n#include \"dropout.h\"\n\nnamespace ncnn {\n\nclass Dropout_arm : public Dropout\n{\npublic:\n    Dropout_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DROPOUT_ARM_H\n"
  },
  {
    "path": "src/layer/arm/eltwise_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"eltwise_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nEltwise_arm::Eltwise_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Eltwise_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int elembits = bottom_blobs[0].elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blobs, top_blobs, opt);\n        else\n            return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d * elempack;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 7 < size; i += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _q0 = vld1q_f32(ptr1);\n                float32x4_t _q1 = vld1q_f32(ptr1 + 4);\n                _p0 = vmulq_f32(_p0, _q0);\n                _p1 = vmulq_f32(_p1, _q1);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _q = vld1q_f32(ptr1);\n                _p = vmulq_f32(_p, _q);\n                vst1q_f32(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *outptr = *ptr * *ptr1;\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    float32x4_t _q0 = vld1q_f32(ptr);\n                    float32x4_t _q1 = vld1q_f32(ptr + 4);\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(outptr);\n                    float32x4_t _q = vld1q_f32(ptr);\n                    _p = vmulq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr *= *ptr;\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n    if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr);\n                    float32x4_t _p1 = vld1q_f32(ptr + 4);\n                    float32x4_t _q0 = vld1q_f32(ptr1);\n                    float32x4_t _q1 = vld1q_f32(ptr1 + 4);\n                    _p0 = vaddq_f32(_p0, _q0);\n                    _p1 = vaddq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    float32x4_t _q = vld1q_f32(ptr1);\n                    _p = vaddq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr + *ptr1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    int i = 0;\n#if __ARM_NEON\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float32x4_t _q0 = vld1q_f32(ptr);\n                        float32x4_t _q1 = vld1q_f32(ptr + 4);\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = vld1q_f32(ptr);\n                        _p = vaddq_f32(_p, _q);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                const float coeff0 = coeffs[0];\n                const float coeff1 = coeffs[1];\n\n                int i = 0;\n#if __ARM_NEON\n                float32x4_t _coeff0 = vdupq_n_f32(coeff0);\n                float32x4_t _coeff1 = vdupq_n_f32(coeff1);\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr);\n                    float32x4_t _p1 = vld1q_f32(ptr + 4);\n                    float32x4_t _q0 = vld1q_f32(ptr1);\n                    float32x4_t _q1 = vld1q_f32(ptr1 + 4);\n                    _p0 = vmulq_f32(_p0, _coeff0);\n                    _p1 = vmulq_f32(_p1, _coeff0);\n                    _p0 = vmlaq_f32(_p0, _q0, _coeff1);\n                    _p1 = vmlaq_f32(_p1, _q1, _coeff1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    float32x4_t _q = vld1q_f32(ptr1);\n                    _p = vmulq_f32(_p, _coeff0);\n                    _p = vmlaq_f32(_p, _q, _coeff1);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr * coeff0 + *ptr1 * coeff1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    const float coeff = coeffs[b];\n\n                    int i = 0;\n#if __ARM_NEON\n                    float32x4_t _coeff = vdupq_n_f32(coeff);\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float32x4_t _q0 = vld1q_f32(ptr);\n                        float32x4_t _q1 = vld1q_f32(ptr + 4);\n                        _p0 = vmlaq_f32(_p0, _q0, _coeff);\n                        _p1 = vmlaq_f32(_p1, _q1, _coeff);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = vld1q_f32(ptr);\n                        _p = vmlaq_f32(_p, _q, _coeff);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr * coeff;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n    }\n    if (op_type == Operation_MAX)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 7 < size; i += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _q0 = vld1q_f32(ptr1);\n                float32x4_t _q1 = vld1q_f32(ptr1 + 4);\n                _p0 = vmaxq_f32(_p0, _q0);\n                _p1 = vmaxq_f32(_p1, _q1);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _q = vld1q_f32(ptr1);\n                _p = vmaxq_f32(_p, _q);\n                vst1q_f32(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *outptr = std::max(*ptr, *ptr1);\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    float32x4_t _q0 = vld1q_f32(ptr);\n                    float32x4_t _q1 = vld1q_f32(ptr + 4);\n                    _p0 = vmaxq_f32(_p0, _q0);\n                    _p1 = vmaxq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(outptr);\n                    float32x4_t _q = vld1q_f32(ptr);\n                    _p = vmaxq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = std::max(*ptr, *outptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Eltwise_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d * elempack;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (bottom_blobs.size() == 2)\n    {\n        // fast path without fp32 accumulator\n        if (op_type == Operation_PROD)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob1.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    uint16x8_t _p01 = vld1q_u16(ptr);\n                    uint16x8_t _q01 = vld1q_u16(ptr1);\n                    float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                    float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                    _p = vmulq_f32(_p, _q);\n                    vst1_u16(outptr, float2bfloat(_p));\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = float32_to_bfloat16(bfloat16_to_float32(*ptr) * bfloat16_to_float32(*ptr1));\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n        }\n        if (op_type == Operation_SUM)\n        {\n            if (coeffs.w == 0)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[1];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n                    const unsigned short* ptr1 = bottom_blob1.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    int i = 0;\n#if __ARM_NEON\n                    for (; i + 7 < size; i += 8)\n                    {\n                        uint16x8_t _p01 = vld1q_u16(ptr);\n                        uint16x8_t _q01 = vld1q_u16(ptr1);\n                        float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                        float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                        ptr += 8;\n                        ptr1 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                        _p = vaddq_f32(_p, _q);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        ptr += 4;\n                        ptr1 += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr = float32_to_bfloat16(bfloat16_to_float32(*ptr) + bfloat16_to_float32(*ptr1));\n\n                        ptr++;\n                        ptr1++;\n                        outptr++;\n                    }\n                }\n            }\n            else\n            {\n                const Mat& bottom_blob1 = bottom_blobs[1];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n                    const unsigned short* ptr1 = bottom_blob1.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    const float coeff0 = coeffs[0];\n                    const float coeff1 = coeffs[1];\n\n                    int i = 0;\n#if __ARM_NEON\n                    float32x4_t _coeff0 = vdupq_n_f32(coeff0);\n                    float32x4_t _coeff1 = vdupq_n_f32(coeff1);\n                    for (; i + 7 < size; i += 8)\n                    {\n                        uint16x8_t _p01 = vld1q_u16(ptr);\n                        uint16x8_t _q01 = vld1q_u16(ptr1);\n                        float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                        float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vmulq_f32(_p0, _coeff0);\n                        _p1 = vmulq_f32(_p1, _coeff0);\n                        _p0 = vmlaq_f32(_p0, _q0, _coeff1);\n                        _p1 = vmlaq_f32(_p1, _q1, _coeff1);\n                        vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                        ptr += 8;\n                        ptr1 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                        _p = vmulq_f32(_p, _coeff0);\n                        _p = vmlaq_f32(_p, _q, _coeff1);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        ptr += 4;\n                        ptr1 += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr = float32_to_bfloat16(bfloat16_to_float32(*ptr) * coeff0 + bfloat16_to_float32(*ptr1) * coeff1);\n\n                        ptr++;\n                        ptr1++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        if (op_type == Operation_MAX)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob1.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    uint16x8_t _p01 = vld1q_u16(ptr);\n                    uint16x8_t _q01 = vld1q_u16(ptr1);\n                    float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                    float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmaxq_f32(_p0, _q0);\n                    _p1 = vmaxq_f32(_p1, _q1);\n                    vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                    _p = vmaxq_f32(_p, _q);\n                    vst1_u16(outptr, float2bfloat(_p));\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = float32_to_bfloat16(std::max(bfloat16_to_float32(*ptr), bfloat16_to_float32(*ptr1)));\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat top_blob_fp32(w, h, d, channels, (size_t)4u * elempack, elempack, opt.workspace_allocator);\n    if (top_blob_fp32.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            const unsigned short* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob_fp32.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 7 < size; i += 8)\n            {\n                uint16x8_t _p01 = vld1q_u16(ptr);\n                uint16x8_t _q01 = vld1q_u16(ptr1);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                _p0 = vmulq_f32(_p0, _q0);\n                _p1 = vmulq_f32(_p1, _q1);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                _p = vmulq_f32(_p, _q);\n                vst1q_f32(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *outptr = bfloat16_to_float32(*ptr) * bfloat16_to_float32(*ptr1);\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        size_t b = 2;\n        for (; b < bottom_blobs.size() - 1; b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    uint16x8_t _q01 = vld1q_u16(ptr);\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(outptr);\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                    _p = vmulq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr *= bfloat16_to_float32(*ptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n        for (; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob1.channel(q);\n                const float* ptr0 = top_blob_fp32.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                    uint16x8_t _q01 = vld1q_u16(ptr);\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                    ptr += 8;\n                    ptr0 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(ptr0);\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                    _p = vmulq_f32(_p, _q);\n                    vst1_u16(outptr, float2bfloat(_p));\n\n                    ptr += 4;\n                    ptr0 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = float32_to_bfloat16(*ptr0 * bfloat16_to_float32(*ptr));\n\n                    ptr++;\n                    ptr0++;\n                    outptr++;\n                }\n            }\n        }\n    }\n    if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    uint16x8_t _p01 = vld1q_u16(ptr);\n                    uint16x8_t _q01 = vld1q_u16(ptr1);\n                    float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                    float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vaddq_f32(_p0, _q0);\n                    _p1 = vaddq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                    _p = vaddq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = bfloat16_to_float32(*ptr) + bfloat16_to_float32(*ptr1);\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size() - 1; b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob_fp32.channel(q);\n\n                    int i = 0;\n#if __ARM_NEON\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        uint16x8_t _q01 = vld1q_u16(ptr);\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                        _p = vaddq_f32(_p, _q);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr += bfloat16_to_float32(*ptr);\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob1.channel(q);\n                    const float* ptr0 = top_blob_fp32.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    int i = 0;\n#if __ARM_NEON\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        uint16x8_t _q01 = vld1q_u16(ptr);\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                        ptr += 8;\n                        ptr0 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr0);\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                        _p = vaddq_f32(_p, _q);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        ptr += 4;\n                        ptr0 += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr = float32_to_bfloat16(*ptr0 + bfloat16_to_float32(*ptr));\n\n                        ptr++;\n                        ptr0++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                const float coeff0 = coeffs[0];\n                const float coeff1 = coeffs[1];\n\n                int i = 0;\n#if __ARM_NEON\n                float32x4_t _coeff0 = vdupq_n_f32(coeff0);\n                float32x4_t _coeff1 = vdupq_n_f32(coeff1);\n                for (; i + 7 < size; i += 8)\n                {\n                    uint16x8_t _p01 = vld1q_u16(ptr);\n                    uint16x8_t _q01 = vld1q_u16(ptr1);\n                    float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                    float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmulq_f32(_p0, _coeff0);\n                    _p1 = vmulq_f32(_p1, _coeff0);\n                    _p0 = vmlaq_f32(_p0, _q0, _coeff1);\n                    _p1 = vmlaq_f32(_p1, _q1, _coeff1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                    _p = vmulq_f32(_p, _coeff0);\n                    _p = vmlaq_f32(_p, _q, _coeff1);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = bfloat16_to_float32(*ptr) * coeff0 + bfloat16_to_float32(*ptr1) * coeff1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size() - 1; b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob_fp32.channel(q);\n\n                    const float coeff = coeffs[b];\n\n                    int i = 0;\n#if __ARM_NEON\n                    float32x4_t _coeff = vdupq_n_f32(coeff);\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        uint16x8_t _q01 = vld1q_u16(ptr);\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vmlaq_f32(_p0, _q0, _coeff);\n                        _p1 = vmlaq_f32(_p1, _q1, _coeff);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                        _p = vmlaq_f32(_p, _q, _coeff);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr += bfloat16_to_float32(*ptr) * coeff;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob1.channel(q);\n                    const float* ptr0 = top_blob_fp32.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    const float coeff = coeffs[b];\n\n                    int i = 0;\n#if __ARM_NEON\n                    float32x4_t _coeff = vdupq_n_f32(coeff);\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        uint16x8_t _q01 = vld1q_u16(ptr);\n                        float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                        float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                        _p0 = vmlaq_f32(_p0, _q0, _coeff);\n                        _p1 = vmlaq_f32(_p1, _q1, _coeff);\n                        vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                        ptr += 8;\n                        ptr0 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr0);\n                        float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                        _p = vmlaq_f32(_p, _q, _coeff);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        ptr += 4;\n                        ptr0 += 4;\n                        outptr += 4;\n                    }\n#endif // __ARM_NEON\n                    for (; i < size; i++)\n                    {\n                        *outptr = float32_to_bfloat16(*ptr0 + bfloat16_to_float32(*ptr) * coeff);\n\n                        ptr++;\n                        ptr0++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n    }\n    if (op_type == Operation_MAX)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            const unsigned short* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob_fp32.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 7 < size; i += 8)\n            {\n                uint16x8_t _p01 = vld1q_u16(ptr);\n                uint16x8_t _q01 = vld1q_u16(ptr1);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n                float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                _p0 = vmaxq_f32(_p0, _q0);\n                _p1 = vmaxq_f32(_p1, _q1);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _q = bfloat2float(vld1_u16(ptr1));\n                _p = vmaxq_f32(_p, _q);\n                vst1q_f32(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *outptr = std::max(bfloat16_to_float32(*ptr), bfloat16_to_float32(*ptr1));\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        size_t b = 2;\n        for (; b < bottom_blobs.size() - 1; b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    uint16x8_t _q01 = vld1q_u16(ptr);\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmaxq_f32(_p0, _q0);\n                    _p1 = vmaxq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(outptr);\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                    _p = vmaxq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = std::max(bfloat16_to_float32(*ptr), *outptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n        for (; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob1.channel(q);\n                const float* ptr0 = top_blob_fp32.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                    uint16x8_t _q01 = vld1q_u16(ptr);\n                    float32x4_t _q0 = bfloat2float(vget_low_u16(_q01));\n                    float32x4_t _q1 = bfloat2float(vget_high_u16(_q01));\n                    _p0 = vmaxq_f32(_p0, _q0);\n                    _p1 = vmaxq_f32(_p1, _q1);\n                    vst1q_u16(outptr, vcombine_u16(float2bfloat(_p0), float2bfloat(_p1)));\n\n                    ptr += 8;\n                    ptr0 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(ptr0);\n                    float32x4_t _q = bfloat2float(vld1_u16(ptr));\n                    _p = vmaxq_f32(_p, _q);\n                    vst1_u16(outptr, float2bfloat(_p));\n\n                    ptr += 4;\n                    ptr0 += 4;\n                    outptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; i < size; i++)\n                {\n                    *outptr = float32_to_bfloat16(std::max(bfloat16_to_float32(*ptr), *ptr0));\n\n                    ptr++;\n                    ptr0++;\n                    outptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/eltwise_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ELTWISE_ARM_H\n#define LAYER_ELTWISE_ARM_H\n\n#include \"eltwise.h\"\n\nnamespace ncnn {\n\nclass Eltwise_arm : public Eltwise\n{\npublic:\n    Eltwise_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n    int forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ELTWISE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/eltwise_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"eltwise_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Eltwise_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d * elempack;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (bottom_blobs.size() == 2)\n    {\n        // fast path without fp32 accumulator\n        if (op_type == Operation_PROD)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(ptr);\n                    float16x8_t _p1 = vld1q_f16(ptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr1);\n                    float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                    _p0 = vmulq_f16(_p0, _q0);\n                    _p1 = vmulq_f16(_p1, _q1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    float16x8_t _q = vld1q_f16(ptr1);\n                    _p = vmulq_f16(_p, _q);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    float16x4_t _q = vld1_f16(ptr1);\n                    _p = vmul_f16(_p, _q);\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr * *ptr1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n        }\n        if (op_type == Operation_SUM)\n        {\n            if (coeffs.w == 0)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[1];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n                    const __fp16* ptr1 = bottom_blob1.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float16x8_t _p0 = vld1q_f16(ptr);\n                        float16x8_t _p1 = vld1q_f16(ptr + 8);\n                        float16x8_t _q0 = vld1q_f16(ptr1);\n                        float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                        _p0 = vaddq_f16(_p0, _q0);\n                        _p1 = vaddq_f16(_p1, _q1);\n                        vst1q_f16(outptr, _p0);\n                        vst1q_f16(outptr + 8, _p1);\n\n                        ptr += 16;\n                        ptr1 += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float16x8_t _p = vld1q_f16(ptr);\n                        float16x8_t _q = vld1q_f16(ptr1);\n                        _p = vaddq_f16(_p, _q);\n                        vst1q_f16(outptr, _p);\n\n                        ptr += 8;\n                        ptr1 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float16x4_t _p = vld1_f16(ptr);\n                        float16x4_t _q = vld1_f16(ptr1);\n                        _p = vadd_f16(_p, _q);\n                        vst1_f16(outptr, _p);\n\n                        ptr += 4;\n                        ptr1 += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr = *ptr + *ptr1;\n\n                        ptr++;\n                        ptr1++;\n                        outptr++;\n                    }\n                }\n            }\n            else\n            {\n                const Mat& bottom_blob1 = bottom_blobs[1];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n                    const __fp16* ptr1 = bottom_blob1.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    const float coeff0 = coeffs[0];\n                    const float coeff1 = coeffs[1];\n                    float16x8_t _coeff0 = vdupq_n_f16((__fp16)coeff0);\n                    float16x8_t _coeff1 = vdupq_n_f16((__fp16)coeff1);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float16x8_t _p0 = vld1q_f16(ptr);\n                        float16x8_t _p1 = vld1q_f16(ptr + 8);\n                        float16x8_t _q0 = vld1q_f16(ptr1);\n                        float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                        _p0 = vmulq_f16(_p0, _coeff0);\n                        _p1 = vmulq_f16(_p1, _coeff0);\n                        _p0 = vfmaq_f16(_p0, _q0, _coeff1);\n                        _p1 = vfmaq_f16(_p1, _q1, _coeff1);\n                        vst1q_f16(outptr, _p0);\n                        vst1q_f16(outptr + 8, _p1);\n\n                        ptr += 16;\n                        ptr1 += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float16x8_t _p = vld1q_f16(ptr);\n                        float16x8_t _q = vld1q_f16(ptr1);\n                        _p = vmulq_f16(_p, _coeff0);\n                        _p = vfmaq_f16(_p, _q, _coeff1);\n                        vst1q_f16(outptr, _p);\n\n                        ptr += 8;\n                        ptr1 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float16x4_t _p = vld1_f16(ptr);\n                        float16x4_t _q = vld1_f16(ptr1);\n                        _p = vmul_f16(_p, vget_low_f16(_coeff0));\n                        _p = vfma_f16(_p, _q, vget_low_f16(_coeff1));\n                        vst1_f16(outptr, _p);\n\n                        ptr += 4;\n                        ptr1 += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr = (__fp16)((float)(*ptr) * coeff0 + (float)(*ptr1) * coeff1);\n\n                        ptr++;\n                        ptr1++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        if (op_type == Operation_MAX)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(ptr);\n                    float16x8_t _p1 = vld1q_f16(ptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr1);\n                    float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                    _p0 = vmaxq_f16(_p0, _q0);\n                    _p1 = vmaxq_f16(_p1, _q1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    float16x8_t _q = vld1q_f16(ptr1);\n                    _p = vmaxq_f16(_p, _q);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    float16x4_t _q = vld1_f16(ptr1);\n                    _p = vmax_f16(_p, _q);\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = std::max(*ptr, *ptr1);\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat top_blob_fp32(w, h, d, channels, (size_t)4u * elempack, elempack, opt.workspace_allocator);\n    if (top_blob_fp32.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const __fp16* ptr = bottom_blob.channel(q);\n            const __fp16* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob_fp32.channel(q);\n\n            int i = 0;\n            for (; i + 15 < size; i += 16)\n            {\n                float16x8_t _p01 = vld1q_f16(ptr);\n                float16x8_t _p23 = vld1q_f16(ptr + 8);\n                float16x8_t _q01 = vld1q_f16(ptr1);\n                float16x8_t _q23 = vld1q_f16(ptr1 + 8);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                float32x4_t _p2 = vcvt_f32_f16(vget_low_f16(_p23));\n                float32x4_t _p3 = vcvt_f32_f16(vget_high_f16(_p23));\n                float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                _p0 = vmulq_f32(_p0, _q0);\n                _p1 = vmulq_f32(_p1, _q1);\n                _p2 = vmulq_f32(_p2, _q2);\n                _p3 = vmulq_f32(_p3, _q3);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n                vst1q_f32(outptr + 8, _p2);\n                vst1q_f32(outptr + 12, _p3);\n\n                ptr += 16;\n                ptr1 += 16;\n                outptr += 16;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p01 = vld1q_f16(ptr);\n                float16x8_t _q01 = vld1q_f16(ptr1);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                _p0 = vmulq_f32(_p0, _q0);\n                _p1 = vmulq_f32(_p1, _q1);\n                vst1q_f32(outptr, _p0);\n                vst1q_f32(outptr + 4, _p1);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr1));\n                _p = vmulq_f32(_p, _q);\n                vst1q_f32(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n            for (; i < size; i++)\n            {\n                *outptr = (float)(*ptr) * (float)(*ptr1);\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        size_t b = 2;\n        for (; b < bottom_blobs.size() - 1; b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    float32x4_t _p2 = vld1q_f32(outptr + 8);\n                    float32x4_t _p3 = vld1q_f32(outptr + 12);\n                    float16x8_t _q01 = vld1q_f16(ptr);\n                    float16x8_t _q23 = vld1q_f16(ptr + 8);\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                    float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    _p2 = vmulq_f32(_p2, _q2);\n                    _p3 = vmulq_f32(_p3, _q3);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n                    vst1q_f32(outptr + 8, _p2);\n                    vst1q_f32(outptr + 12, _p3);\n\n                    ptr += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(outptr);\n                    float32x4_t _p1 = vld1q_f32(outptr + 4);\n                    float16x8_t _q01 = vld1q_f16(ptr);\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(outptr);\n                    float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                    _p = vmulq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr *= (float)(*ptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n        for (; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob1.channel(q);\n                const float* ptr0 = top_blob_fp32.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                    float32x4_t _p2 = vld1q_f32(ptr0 + 8);\n                    float32x4_t _p3 = vld1q_f32(ptr0 + 12);\n                    float16x8_t _q01 = vld1q_f16(ptr);\n                    float16x8_t _q23 = vld1q_f16(ptr + 8);\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                    float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    _p2 = vmulq_f32(_p2, _q2);\n                    _p3 = vmulq_f32(_p3, _q3);\n                    vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n                    vst1q_f16(outptr + 8, vcombine_f16(vcvt_f16_f32(_p2), vcvt_f16_f32(_p3)));\n\n                    ptr += 16;\n                    ptr0 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                    float16x8_t _q01 = vld1q_f16(ptr);\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    _p0 = vmulq_f32(_p0, _q0);\n                    _p1 = vmulq_f32(_p1, _q1);\n                    vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n\n                    ptr += 8;\n                    ptr0 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vld1q_f32(ptr0);\n                    float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                    _p = vmulq_f32(_p, _q);\n                    vst1_f16(outptr, vcvt_f16_f32(_p));\n\n                    ptr += 4;\n                    ptr0 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = (__fp16)(*ptr0 * (float)(*ptr));\n\n                    ptr++;\n                    ptr0++;\n                    outptr++;\n                }\n            }\n        }\n    }\n    if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p01 = vld1q_f16(ptr);\n                    float16x8_t _p23 = vld1q_f16(ptr + 8);\n                    float16x8_t _q01 = vld1q_f16(ptr1);\n                    float16x8_t _q23 = vld1q_f16(ptr1 + 8);\n                    float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                    float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                    float32x4_t _p2 = vcvt_f32_f16(vget_low_f16(_p23));\n                    float32x4_t _p3 = vcvt_f32_f16(vget_high_f16(_p23));\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                    float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                    _p0 = vaddq_f32(_p0, _q0);\n                    _p1 = vaddq_f32(_p1, _q1);\n                    _p2 = vaddq_f32(_p2, _q2);\n                    _p3 = vaddq_f32(_p3, _q3);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n                    vst1q_f32(outptr + 8, _p2);\n                    vst1q_f32(outptr + 12, _p3);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p01 = vld1q_f16(ptr);\n                    float16x8_t _q01 = vld1q_f16(ptr1);\n                    float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                    float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    _p0 = vaddq_f32(_p0, _q0);\n                    _p1 = vaddq_f32(_p1, _q1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr1));\n                    _p = vaddq_f32(_p, _q);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = (float)(*ptr) + (float)(*ptr1);\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size() - 1; b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob_fp32.channel(q);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float32x4_t _p2 = vld1q_f32(outptr + 8);\n                        float32x4_t _p3 = vld1q_f32(outptr + 12);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float16x8_t _q23 = vld1q_f16(ptr + 8);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                        float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        _p2 = vaddq_f32(_p2, _q2);\n                        _p3 = vaddq_f32(_p3, _q3);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n                        vst1q_f32(outptr + 8, _p2);\n                        vst1q_f32(outptr + 12, _p3);\n\n                        ptr += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                        _p = vaddq_f32(_p, _q);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr += (float)(*ptr);\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    const float* ptr0 = top_blob_fp32.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        float32x4_t _p2 = vld1q_f32(ptr0 + 8);\n                        float32x4_t _p3 = vld1q_f32(ptr0 + 12);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float16x8_t _q23 = vld1q_f16(ptr + 8);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                        float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        _p2 = vaddq_f32(_p2, _q2);\n                        _p3 = vaddq_f32(_p3, _q3);\n                        vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n                        vst1q_f16(outptr + 8, vcombine_f16(vcvt_f16_f32(_p2), vcvt_f16_f32(_p3)));\n\n                        ptr += 16;\n                        ptr0 += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        _p0 = vaddq_f32(_p0, _q0);\n                        _p1 = vaddq_f32(_p1, _q1);\n                        vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n\n                        ptr += 8;\n                        ptr0 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr0);\n                        float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                        _p = vaddq_f32(_p, _q);\n                        vst1_f16(outptr, vcvt_f16_f32(_p));\n\n                        ptr += 4;\n                        ptr0 += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr = (__fp16)(*ptr0 + (float)(*ptr));\n\n                        ptr++;\n                        ptr0++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob_fp32.channel(q);\n\n                const float coeff0 = coeffs[0];\n                const float coeff1 = coeffs[1];\n                float32x4_t _coeff0 = vdupq_n_f32(coeff0);\n                float32x4_t _coeff1 = vdupq_n_f32(coeff1);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p01 = vld1q_f16(ptr);\n                    float16x8_t _p23 = vld1q_f16(ptr + 8);\n                    float16x8_t _q01 = vld1q_f16(ptr1);\n                    float16x8_t _q23 = vld1q_f16(ptr1 + 8);\n                    float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                    float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                    float32x4_t _p2 = vcvt_f32_f16(vget_low_f16(_p23));\n                    float32x4_t _p3 = vcvt_f32_f16(vget_high_f16(_p23));\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                    float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                    _p0 = vmulq_f32(_p0, _coeff0);\n                    _p1 = vmulq_f32(_p1, _coeff0);\n                    _p2 = vmulq_f32(_p2, _coeff0);\n                    _p3 = vmulq_f32(_p3, _coeff0);\n                    _p0 = vfmaq_f32(_p0, _q0, _coeff1);\n                    _p1 = vfmaq_f32(_p1, _q1, _coeff1);\n                    _p2 = vfmaq_f32(_p2, _q2, _coeff1);\n                    _p3 = vfmaq_f32(_p3, _q3, _coeff1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n                    vst1q_f32(outptr + 8, _p2);\n                    vst1q_f32(outptr + 12, _p3);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p01 = vld1q_f16(ptr);\n                    float16x8_t _q01 = vld1q_f16(ptr1);\n                    float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n                    float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n                    float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                    float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                    _p0 = vmulq_f32(_p0, _coeff0);\n                    _p1 = vmulq_f32(_p1, _coeff0);\n                    _p0 = vfmaq_f32(_p0, _q0, _coeff1);\n                    _p1 = vfmaq_f32(_p1, _q1, _coeff1);\n                    vst1q_f32(outptr, _p0);\n                    vst1q_f32(outptr + 4, _p1);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr1));\n                    _p = vmulq_f32(_p, _coeff0);\n                    _p = vfmaq_f32(_p, _q, _coeff1);\n                    vst1q_f32(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = (float)(*ptr) * coeff0 + (float)(*ptr1) * coeff1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size() - 1; b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob_fp32.channel(q);\n\n                    const float coeff = coeffs[b];\n                    float32x4_t _coeff = vdupq_n_f32(coeff);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float32x4_t _p2 = vld1q_f32(outptr + 8);\n                        float32x4_t _p3 = vld1q_f32(outptr + 12);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float16x8_t _q23 = vld1q_f16(ptr + 8);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                        float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                        _p0 = vfmaq_f32(_p0, _q0, _coeff);\n                        _p1 = vfmaq_f32(_p1, _q1, _coeff);\n                        _p2 = vfmaq_f32(_p2, _q2, _coeff);\n                        _p3 = vfmaq_f32(_p3, _q3, _coeff);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n                        vst1q_f32(outptr + 8, _p2);\n                        vst1q_f32(outptr + 12, _p3);\n\n                        ptr += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(outptr);\n                        float32x4_t _p1 = vld1q_f32(outptr + 4);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        _p0 = vfmaq_f32(_p0, _q0, _coeff);\n                        _p1 = vfmaq_f32(_p1, _q1, _coeff);\n                        vst1q_f32(outptr, _p0);\n                        vst1q_f32(outptr + 4, _p1);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(outptr);\n                        float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                        _p = vfmaq_f32(_p, _q, _coeff);\n                        vst1q_f32(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr += (float)(*ptr) * coeff;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    const float* ptr0 = top_blob_fp32.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    const float coeff = coeffs[b];\n                    float32x4_t _coeff = vdupq_n_f32(coeff);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        float32x4_t _p2 = vld1q_f32(ptr0 + 8);\n                        float32x4_t _p3 = vld1q_f32(ptr0 + 12);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float16x8_t _q23 = vld1q_f16(ptr + 8);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        float32x4_t _q2 = vcvt_f32_f16(vget_low_f16(_q23));\n                        float32x4_t _q3 = vcvt_f32_f16(vget_high_f16(_q23));\n                        _p0 = vfmaq_f32(_p0, _q0, _coeff);\n                        _p1 = vfmaq_f32(_p1, _q1, _coeff);\n                        _p2 = vfmaq_f32(_p2, _q2, _coeff);\n                        _p3 = vfmaq_f32(_p3, _q3, _coeff);\n                        vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n                        vst1q_f16(outptr + 8, vcombine_f16(vcvt_f16_f32(_p2), vcvt_f16_f32(_p3)));\n\n                        ptr += 16;\n                        ptr0 += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float32x4_t _p0 = vld1q_f32(ptr0);\n                        float32x4_t _p1 = vld1q_f32(ptr0 + 4);\n                        float16x8_t _q01 = vld1q_f16(ptr);\n                        float32x4_t _q0 = vcvt_f32_f16(vget_low_f16(_q01));\n                        float32x4_t _q1 = vcvt_f32_f16(vget_high_f16(_q01));\n                        _p0 = vfmaq_f32(_p0, _q0, _coeff);\n                        _p1 = vfmaq_f32(_p1, _q1, _coeff);\n                        vst1q_f16(outptr, vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1)));\n\n                        ptr += 8;\n                        ptr0 += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr0);\n                        float32x4_t _q = vcvt_f32_f16(vld1_f16(ptr));\n                        _p = vfmaq_f32(_p, _q, _coeff);\n                        vst1_f16(outptr, vcvt_f16_f32(_p));\n\n                        ptr += 4;\n                        ptr0 += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr = (__fp16)(*ptr0 + (float)(*ptr) * coeff);\n\n                        ptr++;\n                        ptr0++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n    }\n    if (op_type == Operation_MAX)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const __fp16* ptr = bottom_blob.channel(q);\n            const __fp16* ptr1 = bottom_blob1.channel(q);\n            __fp16* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i + 15 < size; i += 16)\n            {\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                float16x8_t _q0 = vld1q_f16(ptr1);\n                float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                _p0 = vmaxq_f16(_p0, _q0);\n                _p1 = vmaxq_f16(_p1, _q1);\n                vst1q_f16(outptr, _p0);\n                vst1q_f16(outptr + 8, _p1);\n\n                ptr += 16;\n                ptr1 += 16;\n                outptr += 16;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float16x8_t _q = vld1q_f16(ptr1);\n                _p = vmaxq_f16(_p, _q);\n                vst1q_f16(outptr, _p);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                float16x4_t _q = vld1_f16(ptr1);\n                _p = vmax_f16(_p, _q);\n                vst1_f16(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n            for (; i < size; i++)\n            {\n                *outptr = std::max(*ptr, *ptr1);\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        size_t b = 2;\n        for (; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(outptr);\n                    float16x8_t _p1 = vld1q_f16(outptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr);\n                    float16x8_t _q1 = vld1q_f16(ptr + 8);\n                    _p0 = vmaxq_f16(_p0, _q0);\n                    _p1 = vmaxq_f16(_p1, _q1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(outptr);\n                    float16x8_t _q = vld1q_f16(ptr);\n                    _p = vmaxq_f16(_p, _q);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(outptr);\n                    float16x4_t _q = vld1_f16(ptr);\n                    _p = vmax_f16(_p, _q);\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = std::max(*ptr, *outptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Eltwise_arm::forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    if (bottom_blobs.size() == 2)\n    {\n        // fast path without fp32 accumulator\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n\n    if (op_type == Operation_MAX)\n    {\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d * elempack;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const __fp16* ptr = bottom_blob.channel(q);\n            const __fp16* ptr1 = bottom_blob1.channel(q);\n            __fp16* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i + 15 < size; i += 16)\n            {\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                float16x8_t _q0 = vld1q_f16(ptr1);\n                float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                _p0 = vmulq_f16(_p0, _q0);\n                _p1 = vmulq_f16(_p1, _q1);\n                vst1q_f16(outptr, _p0);\n                vst1q_f16(outptr + 8, _p1);\n\n                ptr += 16;\n                ptr1 += 16;\n                outptr += 16;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float16x8_t _q = vld1q_f16(ptr1);\n                _p = vmulq_f16(_p, _q);\n                vst1q_f16(outptr, _p);\n\n                ptr += 8;\n                ptr1 += 8;\n                outptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                float16x4_t _q = vld1_f16(ptr1);\n                _p = vmul_f16(_p, _q);\n                vst1_f16(outptr, _p);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n            for (; i < size; i++)\n            {\n                *outptr = *ptr * *ptr1;\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        size_t b = 2;\n        for (; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(outptr);\n                    float16x8_t _p1 = vld1q_f16(outptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr);\n                    float16x8_t _q1 = vld1q_f16(ptr + 8);\n                    _p0 = vmulq_f16(_p0, _q0);\n                    _p1 = vmulq_f16(_p1, _q1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(outptr);\n                    float16x8_t _q = vld1q_f16(ptr);\n                    _p = vmulq_f16(_p, _q);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(outptr);\n                    float16x4_t _q = vld1_f16(ptr);\n                    _p = vmul_f16(_p, _q);\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr *= *ptr;\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n    if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(ptr);\n                    float16x8_t _p1 = vld1q_f16(ptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr1);\n                    float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                    _p0 = vaddq_f16(_p0, _q0);\n                    _p1 = vaddq_f16(_p1, _q1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    float16x8_t _q = vld1q_f16(ptr1);\n                    _p = vaddq_f16(_p, _q);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    float16x4_t _q = vld1_f16(ptr1);\n                    _p = vadd_f16(_p, _q);\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr + *ptr1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float16x8_t _p0 = vld1q_f16(outptr);\n                        float16x8_t _p1 = vld1q_f16(outptr + 8);\n                        float16x8_t _q0 = vld1q_f16(ptr);\n                        float16x8_t _q1 = vld1q_f16(ptr + 8);\n                        _p0 = vaddq_f16(_p0, _q0);\n                        _p1 = vaddq_f16(_p1, _q1);\n                        vst1q_f16(outptr, _p0);\n                        vst1q_f16(outptr + 8, _p1);\n\n                        ptr += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float16x8_t _p = vld1q_f16(outptr);\n                        float16x8_t _q = vld1q_f16(ptr);\n                        _p = vaddq_f16(_p, _q);\n                        vst1q_f16(outptr, _p);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float16x4_t _p = vld1_f16(outptr);\n                        float16x4_t _q = vld1_f16(ptr);\n                        _p = vadd_f16(_p, _q);\n                        vst1_f16(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                const __fp16* ptr1 = bottom_blob1.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                const __fp16 coeff0 = (__fp16)coeffs[0];\n                const __fp16 coeff1 = (__fp16)coeffs[1];\n                float16x8_t _coeff0 = vdupq_n_f16(coeff0);\n                float16x8_t _coeff1 = vdupq_n_f16(coeff1);\n\n                int i = 0;\n                for (; i + 15 < size; i += 16)\n                {\n                    float16x8_t _p0 = vld1q_f16(ptr);\n                    float16x8_t _p1 = vld1q_f16(ptr + 8);\n                    float16x8_t _q0 = vld1q_f16(ptr1);\n                    float16x8_t _q1 = vld1q_f16(ptr1 + 8);\n                    _p0 = vmulq_f16(_p0, _coeff0);\n                    _p1 = vmulq_f16(_p1, _coeff0);\n                    _p0 = vfmaq_f16(_p0, _q0, _coeff1);\n                    _p1 = vfmaq_f16(_p1, _q1, _coeff1);\n                    vst1q_f16(outptr, _p0);\n                    vst1q_f16(outptr + 8, _p1);\n\n                    ptr += 16;\n                    ptr1 += 16;\n                    outptr += 16;\n                }\n                for (; i + 7 < size; i += 8)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    float16x8_t _q = vld1q_f16(ptr1);\n                    _p = vmulq_f16(_p, _coeff0);\n                    _p = vfmaq_f16(_p, _q, _coeff1);\n                    vst1q_f16(outptr, _p);\n\n                    ptr += 8;\n                    ptr1 += 8;\n                    outptr += 8;\n                }\n                for (; i + 3 < size; i += 4)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    float16x4_t _q = vld1_f16(ptr1);\n                    _p = vmul_f16(_p, vget_low_f16(_coeff0));\n                    _p = vfma_f16(_p, _q, vget_low_f16(_coeff1));\n                    vst1_f16(outptr, _p);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr * coeff0 + *ptr1 * coeff1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            size_t b = 2;\n            for (; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob1.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    const __fp16 coeff = (__fp16)coeffs[b];\n                    float16x8_t _coeff = vdupq_n_f16(coeff);\n\n                    int i = 0;\n                    for (; i + 15 < size; i += 16)\n                    {\n                        float16x8_t _p0 = vld1q_f16(outptr);\n                        float16x8_t _p1 = vld1q_f16(outptr + 8);\n                        float16x8_t _q0 = vld1q_f16(ptr);\n                        float16x8_t _q1 = vld1q_f16(ptr + 8);\n                        _p0 = vfmaq_f16(_p0, _q0, _coeff);\n                        _p1 = vfmaq_f16(_p1, _q1, _coeff);\n                        vst1q_f16(outptr, _p0);\n                        vst1q_f16(outptr + 8, _p1);\n\n                        ptr += 16;\n                        outptr += 16;\n                    }\n                    for (; i + 7 < size; i += 8)\n                    {\n                        float16x8_t _p = vld1q_f16(outptr);\n                        float16x8_t _q = vld1q_f16(ptr);\n                        _p = vfmaq_f16(_p, _q, _coeff);\n                        vst1q_f16(outptr, _p);\n\n                        ptr += 8;\n                        outptr += 8;\n                    }\n                    for (; i + 3 < size; i += 4)\n                    {\n                        float16x4_t _p = vld1_f16(outptr);\n                        float16x4_t _q = vld1_f16(ptr);\n                        _p = vfma_f16(_p, _q, vget_low_f16(_coeff));\n                        vst1_f16(outptr, _p);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr * coeff;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/flatten_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"flatten_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nFlatten_arm::Flatten_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif // NCNN_BF16\n}\n\nint Flatten_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d;\n\n    int total = size * channels * elempack;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = total % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (out_elempack == 1)\n    {\n        return Flatten::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (dims == 2 && elempack == 1) // out_elempack == 4\n    {\n        top_blob = bottom_blob;\n        top_blob.dims = 1;\n        top_blob.w = total / out_elempack;\n        top_blob.h = 1;\n        top_blob.cstep = bottom_blob.cstep / out_elempack;\n        top_blob.elemsize = out_elemsize;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    top_blob.create(total / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 2)\n    {\n        if (elempack == 4) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                float* outptr0 = (float*)top_blob + w * i * 4;\n                float* outptr1 = (float*)top_blob + w * (i * 4 + 1);\n                float* outptr2 = (float*)top_blob + w * (i * 4 + 2);\n                float* outptr3 = (float*)top_blob + w * (i * 4 + 3);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4x4_t _v4 = vld4q_f32(ptr);\n                    vst1q_f32(outptr0, _v4.val[0]);\n                    vst1q_f32(outptr1, _v4.val[1]);\n                    vst1q_f32(outptr2, _v4.val[2]);\n                    vst1q_f32(outptr3, _v4.val[3]);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        if (elempack == 4) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                float* outptr0 = (float*)top_blob + size * q * 4;\n                float* outptr1 = (float*)top_blob + size * (q * 4 + 1);\n                float* outptr2 = (float*)top_blob + size * (q * 4 + 2);\n                float* outptr3 = (float*)top_blob + size * (q * 4 + 3);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4x4_t _v4 = vld4q_f32(ptr);\n                    vst1q_f32(outptr0, _v4.val[0]);\n                    vst1q_f32(outptr1, _v4.val[1]);\n                    vst1q_f32(outptr2, _v4.val[2]);\n                    vst1q_f32(outptr3, _v4.val[3]);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (elempack == 1) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                float* outptr = (float*)top_blob + size * q;\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _v = vld1q_f32(ptr);\n                    vst1q_f32(outptr, _v);\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Flatten_arm::forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d;\n\n    int total = size * channels * elempack;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n#if NCNN_ARM82\n        out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && total % 8 == 0 ? 8 : total % 4 == 0 ? 4 : 1;\n#else\n        out_elempack = total % 4 == 0 ? 4 : 1;\n#endif\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (out_elempack == 1)\n    {\n        return Flatten::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (dims == 2 && elempack == 1) // out_elempack == 4 || out_elempack == 8\n    {\n        top_blob = bottom_blob;\n        top_blob.dims = 1;\n        top_blob.w = total / out_elempack;\n        top_blob.h = 1;\n        top_blob.cstep = bottom_blob.cstep / out_elempack;\n        top_blob.elemsize = out_elemsize;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    top_blob.create(total / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 2)\n    {\n#if NCNN_ARM82\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(i);\n                unsigned short* outptr0 = (unsigned short*)top_blob + w * i * 8;\n                unsigned short* outptr1 = (unsigned short*)top_blob + w * (i * 8 + 1);\n                unsigned short* outptr2 = (unsigned short*)top_blob + w * (i * 8 + 2);\n                unsigned short* outptr3 = (unsigned short*)top_blob + w * (i * 8 + 3);\n                unsigned short* outptr4 = (unsigned short*)top_blob + w * (i * 8 + 4);\n                unsigned short* outptr5 = (unsigned short*)top_blob + w * (i * 8 + 5);\n                unsigned short* outptr6 = (unsigned short*)top_blob + w * (i * 8 + 6);\n                unsigned short* outptr7 = (unsigned short*)top_blob + w * (i * 8 + 7);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x8x4_t _v4 = vld4q_u16(ptr);\n                    uint16x8_t _v_01 = vuzp1q_u16(_v4.val[0], _v4.val[1]);\n                    uint16x8_t _v_23 = vuzp1q_u16(_v4.val[2], _v4.val[3]);\n                    uint16x8_t _v_45 = vuzp2q_u16(_v4.val[0], _v4.val[1]);\n                    uint16x8_t _v_67 = vuzp2q_u16(_v4.val[2], _v4.val[3]);\n                    vst1_u16(outptr0, vget_low_u16(_v_01));\n                    vst1_u16(outptr1, vget_high_u16(_v_01));\n                    vst1_u16(outptr2, vget_low_u16(_v_23));\n                    vst1_u16(outptr3, vget_high_u16(_v_23));\n                    vst1_u16(outptr4, vget_low_u16(_v_45));\n                    vst1_u16(outptr5, vget_high_u16(_v_45));\n                    vst1_u16(outptr6, vget_low_u16(_v_67));\n                    vst1_u16(outptr7, vget_high_u16(_v_67));\n\n                    ptr += 32;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                    outptr4 += 4;\n                    outptr5 += 4;\n                    outptr6 += 4;\n                    outptr7 += 4;\n                }\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n#endif // NCNN_ARM82\n\n        if (elempack == 4) // out_elempack == 4 || out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(i);\n                unsigned short* outptr0 = (unsigned short*)top_blob + w * i * 4;\n                unsigned short* outptr1 = (unsigned short*)top_blob + w * (i * 4 + 1);\n                unsigned short* outptr2 = (unsigned short*)top_blob + w * (i * 4 + 2);\n                unsigned short* outptr3 = (unsigned short*)top_blob + w * (i * 4 + 3);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x4x4_t _v4 = vld4_u16(ptr);\n                    vst1_u16(outptr0, _v4.val[0]);\n                    vst1_u16(outptr1, _v4.val[1]);\n                    vst1_u16(outptr2, _v4.val[2]);\n                    vst1_u16(outptr3, _v4.val[3]);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n#if NCNN_ARM82\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                unsigned short* outptr0 = (unsigned short*)top_blob + size * q * 8;\n                unsigned short* outptr1 = (unsigned short*)top_blob + size * (q * 8 + 1);\n                unsigned short* outptr2 = (unsigned short*)top_blob + size * (q * 8 + 2);\n                unsigned short* outptr3 = (unsigned short*)top_blob + size * (q * 8 + 3);\n                unsigned short* outptr4 = (unsigned short*)top_blob + size * (q * 8 + 4);\n                unsigned short* outptr5 = (unsigned short*)top_blob + size * (q * 8 + 5);\n                unsigned short* outptr6 = (unsigned short*)top_blob + size * (q * 8 + 6);\n                unsigned short* outptr7 = (unsigned short*)top_blob + size * (q * 8 + 7);\n\n                int i = 0;\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x8x4_t _v4 = vld4q_u16(ptr);\n                    uint16x8_t _v_01 = vuzp1q_u16(_v4.val[0], _v4.val[1]);\n                    uint16x8_t _v_23 = vuzp1q_u16(_v4.val[2], _v4.val[3]);\n                    uint16x8_t _v_45 = vuzp2q_u16(_v4.val[0], _v4.val[1]);\n                    uint16x8_t _v_67 = vuzp2q_u16(_v4.val[2], _v4.val[3]);\n                    vst1_u16(outptr0, vget_low_u16(_v_01));\n                    vst1_u16(outptr1, vget_high_u16(_v_01));\n                    vst1_u16(outptr2, vget_low_u16(_v_23));\n                    vst1_u16(outptr3, vget_high_u16(_v_23));\n                    vst1_u16(outptr4, vget_low_u16(_v_45));\n                    vst1_u16(outptr5, vget_high_u16(_v_45));\n                    vst1_u16(outptr6, vget_low_u16(_v_67));\n                    vst1_u16(outptr7, vget_high_u16(_v_67));\n\n                    ptr += 32;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                    outptr4 += 4;\n                    outptr5 += 4;\n                    outptr6 += 4;\n                    outptr7 += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n#endif // NCNN_ARM82\n\n        if (elempack == 4) // out_elempack == 4 || out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                unsigned short* outptr0 = (unsigned short*)top_blob + size * q * 4;\n                unsigned short* outptr1 = (unsigned short*)top_blob + size * (q * 4 + 1);\n                unsigned short* outptr2 = (unsigned short*)top_blob + size * (q * 4 + 2);\n                unsigned short* outptr3 = (unsigned short*)top_blob + size * (q * 4 + 3);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4x4_t _v4 = vld4_u16(ptr);\n                    vst1_u16(outptr0, _v4.val[0]);\n                    vst1_u16(outptr1, _v4.val[1]);\n                    vst1_u16(outptr2, _v4.val[2]);\n                    vst1_u16(outptr3, _v4.val[3]);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (elempack == 1) // out_elempack == 4 || out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                unsigned short* outptr = (unsigned short*)top_blob + size * q;\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4_t _v = vld1_u16(ptr);\n                    vst1_u16(outptr, _v);\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Flatten_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d;\n\n    int total = size * channels * elempack;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = total % 8 == 0 ? 8 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (out_elempack == 1)\n    {\n        return Flatten::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (dims == 2 && elempack == 1) // out_elempack == 8\n    {\n        top_blob = bottom_blob;\n        top_blob.dims = 1;\n        top_blob.w = total / out_elempack;\n        top_blob.h = 1;\n        top_blob.cstep = bottom_blob.cstep / out_elempack;\n        top_blob.elemsize = out_elemsize;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    top_blob.create(total / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 2)\n    {\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const signed char* ptr = bottom_blob.row<const signed char>(i);\n                signed char* outptr0 = (signed char*)top_blob + w * i * 8;\n                signed char* outptr1 = (signed char*)top_blob + w * (i * 8 + 1);\n                signed char* outptr2 = (signed char*)top_blob + w * (i * 8 + 2);\n                signed char* outptr3 = (signed char*)top_blob + w * (i * 8 + 3);\n                signed char* outptr4 = (signed char*)top_blob + w * (i * 8 + 4);\n                signed char* outptr5 = (signed char*)top_blob + w * (i * 8 + 5);\n                signed char* outptr6 = (signed char*)top_blob + w * (i * 8 + 6);\n                signed char* outptr7 = (signed char*)top_blob + w * (i * 8 + 7);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* ptr = bottom_blob.channel(q);\n                signed char* outptr0 = (signed char*)top_blob + size * q * 8;\n                signed char* outptr1 = (signed char*)top_blob + size * (q * 8 + 1);\n                signed char* outptr2 = (signed char*)top_blob + size * (q * 8 + 2);\n                signed char* outptr3 = (signed char*)top_blob + size * (q * 8 + 3);\n                signed char* outptr4 = (signed char*)top_blob + size * (q * 8 + 4);\n                signed char* outptr5 = (signed char*)top_blob + size * (q * 8 + 5);\n                signed char* outptr6 = (signed char*)top_blob + size * (q * 8 + 6);\n                signed char* outptr7 = (signed char*)top_blob + size * (q * 8 + 7);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n\n        if (elempack == 1) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* ptr = bottom_blob.channel(q);\n                signed char* outptr = (signed char*)top_blob + size * q;\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/flatten_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_FLATTEN_ARM_H\n#define LAYER_FLATTEN_ARM_H\n\n#include \"flatten.h\"\n\nnamespace ncnn {\n\nclass Flatten_arm : public Flatten\n{\npublic:\n    Flatten_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_FLATTEN_ARM_H\n"
  },
  {
    "path": "src/layer/arm/gelu_arm.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gelu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nGELU_arm::GELU_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint GELU_arm::create_pipeline(const Option& /*opt*/)\n{\n    if (!fast_gelu)\n    {\n        support_packing = false;\n        support_fp16_storage = false;\n        support_bf16_storage = false;\n    }\n    return 0;\n}\n\nint GELU_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    if (!fast_gelu)\n    {\n        return GELU::forward_inplace(bottom_top_blob, opt);\n    }\n\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int elempack = bottom_top_blob.elempack;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _pLoad = vld1q_f32(ptr);\n\n            float32x4_t _blob = vmulq_f32(_pLoad, _pLoad);\n            _blob = vmulq_f32(_pLoad, _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.044715f * 0.79788452f), _blob);\n            _blob = vmlaq_f32(_blob, vdupq_n_f32(0.79788452f), _pLoad);\n            _blob = tanh_ps(_blob);\n            _blob = vaddq_f32(vdupq_n_f32(1.f), _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.5f), vmulq_f32(_blob, _pLoad));\n            vst1q_f32(ptr, _blob);\n            ptr += 4;\n        }\n#endif\n        for (; i < size; i++)\n        {\n            // y = 0.5x * (1 + tanh(sqrt(2/Pi) * (x + 0.044715x^3)))\n            *ptr = 0.5f * *ptr * (1.0f + tanhf(0.79788452f * (*ptr + 0.044715f * *ptr * *ptr * *ptr)));\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint GELU_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int elempack = bottom_top_blob.elempack;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _pLoad = bfloat2float(vld1_u16(ptr));\n\n            float32x4_t _blob = vmulq_f32(_pLoad, _pLoad);\n            _blob = vmulq_f32(_pLoad, _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.044715f * 0.79788452f), _blob);\n            _blob = vmlaq_f32(_blob, vdupq_n_f32(0.79788452f), _pLoad);\n            _blob = tanh_ps(_blob);\n            _blob = vaddq_f32(vdupq_n_f32(1.f), _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.5f), vmulq_f32(_blob, _pLoad));\n            vst1_u16(ptr, float2bfloat(_blob));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            v = 0.5f * v * (1.0f + tanhf(0.79788452f * (v + 0.044715f * v * v * v)));\n            *ptr = float32_to_bfloat16(v);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gelu_arm.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GELU_ARM_H\n#define LAYER_GELU_ARM_H\n\n#include \"gelu.h\"\n\nnamespace ncnn {\n\nclass GELU_arm : public GELU\n{\npublic:\n    GELU_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GELU_ARM_H\n"
  },
  {
    "path": "src/layer/arm/gelu_arm_asimdhp.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gelu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun.h\"\n#if NCNN_ARM82\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint GELU_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int elempack = bottom_top_blob.elempack;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = (__fp16*)bottom_top_blob.channel(q);\n\n        int i = 0;\n\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _pLoad = vcvt_f32_f16(vld1_f16(ptr));\n\n            float32x4_t _blob = vmulq_f32(_pLoad, _pLoad);\n            _blob = vmulq_f32(_pLoad, _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.044715f * 0.79788452f), _blob);\n            _blob = vmlaq_f32(_blob, vdupq_n_f32(0.79788452f), _pLoad);\n            _blob = tanh_ps(_blob);\n            _blob = vaddq_f32(vdupq_n_f32(1.f), _blob);\n            _blob = vmulq_f32(vdupq_n_f32(0.5f), vmulq_f32(_blob, _pLoad));\n            vst1_f16(ptr, vcvt_f16_f32(_blob));\n            ptr += 4;\n        }\n\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            v = 0.5f * v * (1.0f + tanhf(0.79788452f * (v + 0.044715f * v * v * v)));\n            *ptr = (__fp16)v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint GELU_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int elempack = bottom_top_blob.elempack;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = (__fp16*)bottom_top_blob.channel(q);\n\n        int i = 0;\n\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _pLoad = vld1q_f16(ptr);\n\n            float16x8_t _blob = vmulq_f16(_pLoad, _pLoad);\n            _blob = vmulq_f16(_pLoad, _blob);\n            _blob = vmulq_f16(vdupq_n_f16(0.044715f * 0.79788452f), _blob);\n            _blob = vfmaq_f16(_blob, vdupq_n_f16(0.79788452f), _pLoad);\n            _blob = tanh_ps_f16(_blob);\n            _blob = vaddq_f16(vdupq_n_f16(1.f), _blob);\n            _blob = vmulq_f16(vdupq_n_f16(0.5f), vmulq_f16(_blob, _pLoad));\n            vst1q_f16(ptr, _blob);\n            ptr += 8;\n        }\n\n        for (; i < size; i++)\n        {\n            *ptr = (__fp16)0.5f * *ptr * (__fp16)(1.0f + tanhf((__fp16)0.79788452f * (*ptr + (__fp16)0.044715f * *ptr * *ptr * *ptr)));\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gemm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if NCNN_BF16\n#include \"gemm_bf16s_fp16s.h\"\n#include \"gemm_bf16s.h\"\n#endif\n\n#if NCNN_INT8\n#include \"gemm_int8.h\"\n#if NCNN_BF16\n#include \"gemm_int8_bf16s.h\"\n#endif\n#endif\n\nGemm_arm::Gemm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_VFPV4\n    support_fp16_storage = cpu_support_arm_vfpv4();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    nT = 0;\n}\n\nvoid pack_A_tile(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    float* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k * 4;\n            const float* p1 = (const float*)A + (i + ii + 4) * A_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p1));\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n            const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n            const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n            const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n            const float* p4 = (const float*)A + (i + ii + 4) * A_hstep + k;\n            const float* p5 = (const float*)A + (i + ii + 5) * A_hstep + k;\n            const float* p6 = (const float*)A + (i + ii + 6) * A_hstep + k;\n            const float* p7 = (const float*)A + (i + ii + 7) * A_hstep + k;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _r0l = vld1q_f32(p0);\n                float32x4_t _r0h = vld1q_f32(p0 + 4);\n                float32x4_t _r1l = vld1q_f32(p1);\n                float32x4_t _r1h = vld1q_f32(p1 + 4);\n                float32x4_t _r2l = vld1q_f32(p2);\n                float32x4_t _r2h = vld1q_f32(p2 + 4);\n                float32x4_t _r3l = vld1q_f32(p3);\n                float32x4_t _r3h = vld1q_f32(p3 + 4);\n                float32x4_t _r4l = vld1q_f32(p4);\n                float32x4_t _r4h = vld1q_f32(p4 + 4);\n                float32x4_t _r5l = vld1q_f32(p5);\n                float32x4_t _r5h = vld1q_f32(p5 + 4);\n                float32x4_t _r6l = vld1q_f32(p6);\n                float32x4_t _r6h = vld1q_f32(p6 + 4);\n                float32x4_t _r7l = vld1q_f32(p7);\n                float32x4_t _r7h = vld1q_f32(p7 + 4);\n                transpose8x8_ps(_r0l, _r0h, _r1l, _r1h, _r2l, _r2h, _r3l, _r3h, _r4l, _r4h, _r5l, _r5h, _r6l, _r6h, _r7l, _r7h);\n                vst1q_f32(pp, _r0l);\n                vst1q_f32(pp + 4, _r0h);\n                vst1q_f32(pp + 8, _r1l);\n                vst1q_f32(pp + 12, _r1h);\n                vst1q_f32(pp + 8 * 2, _r2l);\n                vst1q_f32(pp + 8 * 2 + 4, _r2h);\n                vst1q_f32(pp + 8 * 3, _r3l);\n                vst1q_f32(pp + 8 * 3 + 4, _r3h);\n                vst1q_f32(pp + 8 * 4, _r4l);\n                vst1q_f32(pp + 8 * 4 + 4, _r4h);\n                vst1q_f32(pp + 8 * 5, _r5l);\n                vst1q_f32(pp + 8 * 5 + 4, _r5h);\n                vst1q_f32(pp + 8 * 6, _r6l);\n                vst1q_f32(pp + 8 * 6 + 4, _r6h);\n                vst1q_f32(pp + 8 * 7, _r7l);\n                vst1q_f32(pp + 8 * 7 + 4, _r7h);\n                pp += 64;\n                p0 += 8;\n                p1 += 8;\n                p2 += 8;\n                p3 += 8;\n                p4 += 8;\n                p5 += 8;\n                p6 += 8;\n                p7 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp += 8;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n            const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n            const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n            const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123;\n                _r0123.val[0] = vld1q_f32(p0);\n                _r0123.val[1] = vld1q_f32(p1);\n                _r0123.val[2] = vld1q_f32(p2);\n                _r0123.val[3] = vld1q_f32(p3);\n                vst4q_f32(pp, _r0123);\n                pp += 16;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp += 4;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        // if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n            const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x2_t _r01;\n                _r01.val[0] = vld1q_f32(p0);\n                _r01.val[1] = vld1q_f32(p1);\n                vst2q_f32(pp, _r01);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp += 2;\n                p0++;\n                p1++;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        // if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    float* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123 = vld4q_f32(p0);\n                float32x4x4_t _r4567 = vld4q_f32(p0 + 16);\n                vst1q_f32(pp, _r0123.val[0]);\n                vst1q_f32(pp + 4, _r4567.val[0]);\n                vst1q_f32(pp + 4 * 2, _r0123.val[1]);\n                vst1q_f32(pp + 4 * 3, _r4567.val[1]);\n                vst1q_f32(pp + 4 * 4, _r0123.val[2]);\n                vst1q_f32(pp + 4 * 5, _r4567.val[2]);\n                vst1q_f32(pp + 4 * 6, _r0123.val[3]);\n                vst1q_f32(pp + 4 * 7, _r4567.val[3]);\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p0 + 4));\n                pp += 8;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123 = vld4q_f32(p0);\n                vst1q_f32(pp, _r0123.val[0]);\n                vst1q_f32(pp + 4, _r0123.val[1]);\n                vst1q_f32(pp + 4 * 2, _r0123.val[2]);\n                vst1q_f32(pp + 4 * 3, _r0123.val[3]);\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x2_t _r01;\n                _r01.val[0] = vld1q_f32(p0);\n                _r01.val[1] = vld1q_f32(p0 + 4);\n                vst2q_f32(pp, _r01);\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += A_hstep;\n            }\n        }\n    }\n}\n\nstatic void pack_B_tile(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    float* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k * 4;\n            const float* p1 = (const float*)B + (j + jj + 4) * B_hstep + k * 4;\n            const float* p2 = (const float*)B + (j + jj + 8) * B_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p1));\n                vst1q_f32(pp + 8, vld1q_f32(p2));\n                pp += 12;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n            const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n            const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n            const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n            const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n            const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n            const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n            const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n            const float* p8 = (const float*)B + (j + jj + 8) * B_hstep + k;\n            const float* p9 = (const float*)B + (j + jj + 9) * B_hstep + k;\n            const float* pa = (const float*)B + (j + jj + 10) * B_hstep + k;\n            const float* pb = (const float*)B + (j + jj + 11) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p1);\n                float32x4_t _r2 = vld1q_f32(p2);\n                float32x4_t _r3 = vld1q_f32(p3);\n                float32x4_t _r4 = vld1q_f32(p4);\n                float32x4_t _r5 = vld1q_f32(p5);\n                float32x4_t _r6 = vld1q_f32(p6);\n                float32x4_t _r7 = vld1q_f32(p7);\n                float32x4_t _r8 = vld1q_f32(p8);\n                float32x4_t _r9 = vld1q_f32(p9);\n                float32x4_t _ra = vld1q_f32(pa);\n                float32x4_t _rb = vld1q_f32(pb);\n\n                transpose4x4_ps(_r0, _r1, _r2, _r3);\n                transpose4x4_ps(_r4, _r5, _r6, _r7);\n                transpose4x4_ps(_r8, _r9, _ra, _rb);\n\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r4);\n                vst1q_f32(pp + 4 * 2, _r8);\n                vst1q_f32(pp + 4 * 3, _r1);\n                vst1q_f32(pp + 4 * 4, _r5);\n                vst1q_f32(pp + 4 * 5, _r9);\n                vst1q_f32(pp + 4 * 6, _r2);\n                vst1q_f32(pp + 4 * 7, _r6);\n                vst1q_f32(pp + 4 * 8, _ra);\n                vst1q_f32(pp + 4 * 9, _r3);\n                vst1q_f32(pp + 4 * 10, _r7);\n                vst1q_f32(pp + 4 * 11, _rb);\n                pp += 48;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n                p4 += 4;\n                p5 += 4;\n                p6 += 4;\n                p7 += 4;\n                p8 += 4;\n                p9 += 4;\n                pa += 4;\n                pb += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp[8] = p8[0];\n                pp[9] = p9[0];\n                pp[10] = pa[0];\n                pp[11] = pb[0];\n                pp += 12;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n                p8++;\n                p9++;\n                pa++;\n                pb++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k * 4;\n            const float* p1 = (const float*)B + (j + jj + 4) * B_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p1));\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n            const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n            const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n            const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n            const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n            const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n            const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n            const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _r0 = vld1q_f32(p0);\n                float32x4_t _r1 = vld1q_f32(p1);\n                float32x4_t _r2 = vld1q_f32(p2);\n                float32x4_t _r3 = vld1q_f32(p3);\n                float32x4_t _r4 = vld1q_f32(p4);\n                float32x4_t _r5 = vld1q_f32(p5);\n                float32x4_t _r6 = vld1q_f32(p6);\n                float32x4_t _r7 = vld1q_f32(p7);\n\n                transpose4x4_ps(_r0, _r1, _r2, _r3);\n                transpose4x4_ps(_r4, _r5, _r6, _r7);\n\n                vst1q_f32(pp, _r0);\n                vst1q_f32(pp + 4, _r4);\n                vst1q_f32(pp + 4 * 2, _r1);\n                vst1q_f32(pp + 4 * 3, _r5);\n                vst1q_f32(pp + 4 * 4, _r2);\n                vst1q_f32(pp + 4 * 5, _r6);\n                vst1q_f32(pp + 4 * 6, _r3);\n                vst1q_f32(pp + 4 * 7, _r7);\n                pp += 32;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n                p4 += 4;\n                p5 += 4;\n                p6 += 4;\n                p7 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp += 8;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n            const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n            const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n            const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123;\n                _r0123.val[0] = vld1q_f32(p0);\n                _r0123.val[1] = vld1q_f32(p1);\n                _r0123.val[2] = vld1q_f32(p2);\n                _r0123.val[3] = vld1q_f32(p3);\n                vst4q_f32(pp, _r0123);\n                pp += 16;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp += 4;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        // if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n            const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x2_t _r01;\n                _r01.val[0] = vld1q_f32(p0);\n                _r01.val[1] = vld1q_f32(p1);\n                vst2q_f32(pp, _r01);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp += 2;\n                p0++;\n                p1++;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        // if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    float* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123 = vld4q_f32(p0);\n                float32x4x4_t _r4567 = vld4q_f32(p0 + 16);\n                float32x4x4_t _r89ab = vld4q_f32(p0 + 32);\n                vst1q_f32(pp, _r0123.val[0]);\n                vst1q_f32(pp + 4, _r4567.val[0]);\n                vst1q_f32(pp + 4 * 2, _r89ab.val[0]);\n                vst1q_f32(pp + 4 * 3, _r0123.val[1]);\n                vst1q_f32(pp + 4 * 4, _r4567.val[1]);\n                vst1q_f32(pp + 4 * 5, _r89ab.val[1]);\n                vst1q_f32(pp + 4 * 6, _r0123.val[2]);\n                vst1q_f32(pp + 4 * 7, _r4567.val[2]);\n                vst1q_f32(pp + 4 * 8, _r89ab.val[2]);\n                vst1q_f32(pp + 4 * 9, _r0123.val[3]);\n                vst1q_f32(pp + 4 * 10, _r4567.val[3]);\n                vst1q_f32(pp + 4 * 11, _r89ab.val[3]);\n                pp += 48;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p0 + 4));\n                vst1q_f32(pp + 8, vld1q_f32(p0 + 8));\n                pp += 12;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123 = vld4q_f32(p0);\n                float32x4x4_t _r4567 = vld4q_f32(p0 + 16);\n                vst1q_f32(pp, _r0123.val[0]);\n                vst1q_f32(pp + 4, _r4567.val[0]);\n                vst1q_f32(pp + 4 * 2, _r0123.val[1]);\n                vst1q_f32(pp + 4 * 3, _r4567.val[1]);\n                vst1q_f32(pp + 4 * 4, _r0123.val[2]);\n                vst1q_f32(pp + 4 * 5, _r4567.val[2]);\n                vst1q_f32(pp + 4 * 6, _r0123.val[3]);\n                vst1q_f32(pp + 4 * 7, _r4567.val[3]);\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                vst1q_f32(pp + 4, vld1q_f32(p0 + 4));\n                pp += 8;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x4_t _r0123 = vld4q_f32(p0);\n                vst1q_f32(pp, _r0123.val[0]);\n                vst1q_f32(pp + 4, _r0123.val[1]);\n                vst1q_f32(pp + 4 * 2, _r0123.val[2]);\n                vst1q_f32(pp + 4 * 3, _r0123.val[3]);\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4x2_t _r01;\n                _r01.val[0] = vld1q_f32(p0);\n                _r01.val[1] = vld1q_f32(p0 + 4);\n                vst2q_f32(pp, _r01);\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1q_f32(pp, vld1q_f32(p0));\n                pp += 4;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += B_hstep;\n            }\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile(const Mat& topT, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const float* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (out_elempack == 4)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                float32x4x4_t _r0;\n                float32x4x4_t _r1;\n                _r0.val[0] = vld1q_f32(pp);\n                _r1.val[0] = vld1q_f32(pp + 4);\n                _r0.val[1] = vld1q_f32(pp + 8);\n                _r1.val[1] = vld1q_f32(pp + 12);\n                _r0.val[2] = vld1q_f32(pp + 16);\n                _r1.val[2] = vld1q_f32(pp + 20);\n                _r0.val[3] = vld1q_f32(pp + 24);\n                _r1.val[3] = vld1q_f32(pp + 28);\n                vst4q_f32(p0, _r0);\n                vst4q_f32(p0 + 16, _r1);\n                pp += 32;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                float32x4_t _r0 = vld1q_f32(pp);\n                float32x4_t _r1 = vld1q_f32(pp + 4);\n                vst1q_f32(p0, _r0);\n                vst1q_f32(p0 + 4, _r1);\n                pp += 8;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (out_elempack == 4)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                float32x4x4_t _r0123;\n                _r0123.val[0] = vld1q_f32(pp);\n                _r0123.val[1] = vld1q_f32(pp + 4);\n                _r0123.val[2] = vld1q_f32(pp + 8);\n                _r0123.val[3] = vld1q_f32(pp + 12);\n                vst4q_f32(p0, _r0123);\n                pp += 16;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                float32x4_t _r0 = vld1q_f32(pp);\n                vst1q_f32(p0, _r0);\n                pp += 4;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[2];\n                p0[2] = pp[4];\n                p0[3] = pp[6];\n                p0[4] = pp[1];\n                p0[5] = pp[3];\n                p0[6] = pp[5];\n                p0[7] = pp[7];\n                pp += 8;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[1];\n                pp += 2;\n                p0 += out_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                float32x4_t _r0 = vld1q_f32(pp);\n                vst1q_f32(p0, _r0);\n                pp += 4;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            float* p0 = (float*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = pp[0];\n                pp += 1;\n                p0 += out_hstep;\n            }\n        }\n    }\n}\n\nstatic void gemm_transB_packed_tile(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const float* pAT = AT_tile;\n    const float* pBT = BT_tile;\n    const float* pC = CT_tile;\n\n    float* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n            float32x4_t _sum80;\n            float32x4_t _sum81;\n            float32x4_t _sum90;\n            float32x4_t _sum91;\n            float32x4_t _suma0;\n            float32x4_t _suma1;\n            float32x4_t _sumb0;\n            float32x4_t _sumb1;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n                _sum80 = vdupq_n_f32(0.f);\n                _sum81 = vdupq_n_f32(0.f);\n                _sum90 = vdupq_n_f32(0.f);\n                _sum91 = vdupq_n_f32(0.f);\n                _suma0 = vdupq_n_f32(0.f);\n                _suma1 = vdupq_n_f32(0.f);\n                _sumb0 = vdupq_n_f32(0.f);\n                _sumb1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                        _sum20 = vdupq_n_f32(pC[0]);\n                        _sum21 = vdupq_n_f32(pC[0]);\n                        _sum30 = vdupq_n_f32(pC[0]);\n                        _sum31 = vdupq_n_f32(pC[0]);\n                        _sum40 = vdupq_n_f32(pC[0]);\n                        _sum41 = vdupq_n_f32(pC[0]);\n                        _sum50 = vdupq_n_f32(pC[0]);\n                        _sum51 = vdupq_n_f32(pC[0]);\n                        _sum60 = vdupq_n_f32(pC[0]);\n                        _sum61 = vdupq_n_f32(pC[0]);\n                        _sum70 = vdupq_n_f32(pC[0]);\n                        _sum71 = vdupq_n_f32(pC[0]);\n                        _sum80 = vdupq_n_f32(pC[0]);\n                        _sum81 = vdupq_n_f32(pC[0]);\n                        _sum90 = vdupq_n_f32(pC[0]);\n                        _sum91 = vdupq_n_f32(pC[0]);\n                        _suma0 = vdupq_n_f32(pC[0]);\n                        _suma1 = vdupq_n_f32(pC[0]);\n                        _sumb0 = vdupq_n_f32(pC[0]);\n                        _sumb1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                        _sum80 = _sum00;\n                        _sum81 = _sum01;\n                        _sum90 = _sum00;\n                        _sum91 = _sum01;\n                        _suma0 = _sum00;\n                        _suma1 = _sum01;\n                        _sumb0 = _sum00;\n                        _sumb1 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        _sum80 = vld1q_f32(pC + 4 * 16);\n                        _sum81 = vld1q_f32(pC + 4 * 17);\n                        _sum90 = vld1q_f32(pC + 4 * 18);\n                        _sum91 = vld1q_f32(pC + 4 * 19);\n                        _suma0 = vld1q_f32(pC + 4 * 20);\n                        _suma1 = vld1q_f32(pC + 4 * 21);\n                        _sumb0 = vld1q_f32(pC + 4 * 22);\n                        _sumb1 = vld1q_f32(pC + 4 * 23);\n                        pC += 96;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum80 = vdupq_n_f32(pC[8]);\n                        _sum90 = vdupq_n_f32(pC[9]);\n                        _suma0 = vdupq_n_f32(pC[10]);\n                        _sumb0 = vdupq_n_f32(pC[11]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        _sum81 = _sum80;\n                        _sum91 = _sum90;\n                        _suma1 = _suma0;\n                        _sumb1 = _sumb0;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n                _sum80 = vld1q_f32(outptr + 4 * 16);\n                _sum81 = vld1q_f32(outptr + 4 * 17);\n                _sum90 = vld1q_f32(outptr + 4 * 18);\n                _sum91 = vld1q_f32(outptr + 4 * 19);\n                _suma0 = vld1q_f32(outptr + 4 * 20);\n                _suma1 = vld1q_f32(outptr + 4 * 21);\n                _sumb0 = vld1q_f32(outptr + 4 * 22);\n                _sumb1 = vld1q_f32(outptr + 4 * 23);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n\n                _pA0 = vld1q_f32(pA);\n                _pA1 = vld1q_f32(pA + 4);\n\n                _pB0 = vld1q_f32(pB);\n                _pB1 = vld1q_f32(pB + 4);\n                _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n\n                _pA0 = vld1q_f32(pA);\n                _pA1 = vld1q_f32(pA + 4);\n\n                _pB0 = vld1q_f32(pB);\n                _pB1 = vld1q_f32(pB + 4);\n                _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n\n                _pA0 = vld1q_f32(pA);\n                _pA1 = vld1q_f32(pA + 4);\n\n                _pB0 = vld1q_f32(pB);\n                _pB1 = vld1q_f32(pB + 4);\n                _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n                    vst1q_f32(outptr0 + 4 * 4, _sum40);\n                    vst1q_f32(outptr0 + 4 * 5, _sum50);\n                    vst1q_f32(outptr0 + 4 * 6, _sum60);\n                    vst1q_f32(outptr0 + 4 * 7, _sum70);\n                    vst1q_f32(outptr0 + 4 * 8, _sum80);\n                    vst1q_f32(outptr0 + 4 * 9, _sum90);\n                    vst1q_f32(outptr0 + 4 * 10, _suma0);\n                    vst1q_f32(outptr0 + 4 * 11, _sumb0);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 5, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 6, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 7, _sum71);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 8, _sum81);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 9, _sum91);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 10, _suma1);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 11, _sumb1);\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x12_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71, _sum80, _sum81, _sum90, _sum91, _suma0, _suma1, _sumb0, _sumb1);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + 8, _sum10);\n                    vst1q_f32(outptr0 + out_hstep, _sum11);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum20);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 8, _sum40);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum50);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 8, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum60);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 8, _sum70);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum71);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 4, _sum80);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 8, _sum81);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum90);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 4, _sum91);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 8, _suma0);\n                    vst1q_f32(outptr0 + out_hstep * 7, _suma1);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 4, _sumb0);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 8, _sumb1);\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n            }\n\n            outptr += 96;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                        _sum20 = vdupq_n_f32(pC[0]);\n                        _sum21 = vdupq_n_f32(pC[0]);\n                        _sum30 = vdupq_n_f32(pC[0]);\n                        _sum31 = vdupq_n_f32(pC[0]);\n                        _sum40 = vdupq_n_f32(pC[0]);\n                        _sum41 = vdupq_n_f32(pC[0]);\n                        _sum50 = vdupq_n_f32(pC[0]);\n                        _sum51 = vdupq_n_f32(pC[0]);\n                        _sum60 = vdupq_n_f32(pC[0]);\n                        _sum61 = vdupq_n_f32(pC[0]);\n                        _sum70 = vdupq_n_f32(pC[0]);\n                        _sum71 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        pC += 64;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n                    vst1q_f32(outptr0 + 4 * 4, _sum40);\n                    vst1q_f32(outptr0 + 4 * 5, _sum50);\n                    vst1q_f32(outptr0 + 4 * 6, _sum60);\n                    vst1q_f32(outptr0 + 4 * 7, _sum70);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 5, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 6, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 7, _sum71);\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum20);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum31);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum40);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum41);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum50);\n                    vst1q_f32(outptr0 + out_hstep * 5 + 4, _sum51);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum60);\n                    vst1q_f32(outptr0 + out_hstep * 6 + 4, _sum61);\n                    vst1q_f32(outptr0 + out_hstep * 7, _sum70);\n                    vst1q_f32(outptr0 + out_hstep * 7 + 4, _sum71);\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n            }\n\n            outptr += 64;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                        _sum20 = vdupq_n_f32(pC[0]);\n                        _sum21 = vdupq_n_f32(pC[0]);\n                        _sum30 = vdupq_n_f32(pC[0]);\n                        _sum31 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB0 = vld1q_f32(pB);\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n                    vst1q_f32(outptr0 + 4 * 2, _sum20);\n                    vst1q_f32(outptr0 + 4 * 3, _sum30);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 2, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4 * 3, _sum31);\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31);\n\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + out_hstep * 1, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum10);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum11);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum20);\n                    vst1q_f32(outptr0 + out_hstep * 5, _sum21);\n                    vst1q_f32(outptr0 + out_hstep * 6, _sum30);\n                    vst1q_f32(outptr0 + out_hstep * 7, _sum31);\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x2_t _pB0 = vld1_f32(pB);\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pA1, _pB0, 1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum10);\n\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep * 4 + 4, _sum11);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[8];\n                    float sum1[8];\n                    vst1q_f32(sum0, _sum00);\n                    vst1q_f32(sum0 + 4, _sum01);\n                    vst1q_f32(sum1, _sum10);\n                    vst1q_f32(sum1 + 4, _sum11);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n            }\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA0 = vld1q_f32(pA);\n                float32x4_t _pA1 = vld1q_f32(pA + 4);\n\n                float32x4_t _pB = vld1q_dup_f32(pB);\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + out_hstep * 4, _sum01);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[8];\n                    vst1q_f32(sum0, _sum00);\n                    vst1q_f32(sum0 + 4, _sum01);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n            float32x4_t _sum8;\n            float32x4_t _sum9;\n            float32x4_t _suma;\n            float32x4_t _sumb;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n                _sum8 = vdupq_n_f32(0.f);\n                _sum9 = vdupq_n_f32(0.f);\n                _suma = vdupq_n_f32(0.f);\n                _sumb = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                        _sum3 = vdupq_n_f32(pC[0]);\n                        _sum4 = vdupq_n_f32(pC[0]);\n                        _sum5 = vdupq_n_f32(pC[0]);\n                        _sum6 = vdupq_n_f32(pC[0]);\n                        _sum7 = vdupq_n_f32(pC[0]);\n                        _sum8 = vdupq_n_f32(pC[0]);\n                        _sum9 = vdupq_n_f32(pC[0]);\n                        _suma = vdupq_n_f32(pC[0]);\n                        _sumb = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        _sum8 = vld1q_f32(pC + 32);\n                        _sum9 = vld1q_f32(pC + 36);\n                        _suma = vld1q_f32(pC + 40);\n                        _sumb = vld1q_f32(pC + 44);\n                        pC += 48;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        _sum8 = vdupq_n_f32(pC[8]);\n                        _sum9 = vdupq_n_f32(pC[9]);\n                        _suma = vdupq_n_f32(pC[10]);\n                        _sumb = vdupq_n_f32(pC[11]);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n                _sum8 = vld1q_f32(outptr + 4 * 8);\n                _sum9 = vld1q_f32(outptr + 4 * 9);\n                _suma = vld1q_f32(outptr + 4 * 10);\n                _sumb = vld1q_f32(outptr + 4 * 11);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    vst1q_f32(outptr0 + 4 * 4, _sum4);\n                    vst1q_f32(outptr0 + 4 * 5, _sum5);\n                    vst1q_f32(outptr0 + 4 * 6, _sum6);\n                    vst1q_f32(outptr0 + 4 * 7, _sum7);\n                    vst1q_f32(outptr0 + 4 * 8, _sum8);\n                    vst1q_f32(outptr0 + 4 * 9, _sum9);\n                    vst1q_f32(outptr0 + 4 * 10, _suma);\n                    vst1q_f32(outptr0 + 4 * 11, _sumb);\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 8, _sum2);\n                    vst1q_f32(outptr0 + out_hstep, _sum3);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum4);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum5);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum6);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum7);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 8, _sum8);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum9);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _suma);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 8, _sumb);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                        _sum3 = vdupq_n_f32(pC[0]);\n                        _sum4 = vdupq_n_f32(pC[0]);\n                        _sum5 = vdupq_n_f32(pC[0]);\n                        _sum6 = vdupq_n_f32(pC[0]);\n                        _sum7 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #128]   \\n\"\n                    \"ld1    {v2.4s}, [%0], #16      \\n\"\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v0.4s, v1.4s}, [%1], #32 \\n\"\n                    \"fmla   %2.4s, v2.4s, v0.s[0]   \\n\"\n                    \"fmla   %3.4s, v2.4s, v0.s[1]   \\n\"\n                    \"fmla   %4.4s, v2.4s, v0.s[2]   \\n\"\n                    \"fmla   %5.4s, v2.4s, v0.s[3]   \\n\"\n                    \"fmla   %6.4s, v2.4s, v1.s[0]   \\n\"\n                    \"fmla   %7.4s, v2.4s, v1.s[1]   \\n\"\n                    \"fmla   %8.4s, v2.4s, v1.s[2]   \\n\"\n                    \"fmla   %9.4s, v2.4s, v1.s[3]   \\n\"\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=w\"(_sum0),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6),\n                    \"=w\"(_sum7)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3),\n                    \"6\"(_sum4),\n                    \"7\"(_sum5),\n                    \"8\"(_sum6),\n                    \"9\"(_sum7)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                asm volatile(\n                    \"pld        [%0, #128]          \\n\"\n                    \"vld1.f32   {d4-d5}, [%0]!      \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%1]!      \\n\"\n                    \"vmla.f32   %q2, q2, d0[0]      \\n\"\n                    \"vmla.f32   %q3, q2, d0[1]      \\n\"\n                    \"vmla.f32   %q4, q2, d1[0]      \\n\"\n                    \"vmla.f32   %q5, q2, d1[1]      \\n\"\n                    \"vmla.f32   %q6, q2, d2[0]      \\n\"\n                    \"vmla.f32   %q7, q2, d2[1]      \\n\"\n                    \"vmla.f32   %q8, q2, d3[0]      \\n\"\n                    \"vmla.f32   %q9, q2, d3[1]      \\n\"\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=w\"(_sum0),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6),\n                    \"=w\"(_sum7)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3),\n                    \"6\"(_sum4),\n                    \"7\"(_sum5),\n                    \"8\"(_sum6),\n                    \"9\"(_sum7)\n                    : \"memory\", \"q0\", \"q1\", \"q2\");\n#endif\n#else // NCNN_GNU_INLINE_ASM\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n\n                pA += 4;\n                pB += 8;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    vst1q_f32(outptr0 + 4 * 4, _sum4);\n                    vst1q_f32(outptr0 + 4 * 5, _sum5);\n                    vst1q_f32(outptr0 + 4 * 6, _sum6);\n                    vst1q_f32(outptr0 + 4 * 7, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + out_hstep, _sum2);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum3);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum4);\n                    vst1q_f32(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum6);\n                    vst1q_f32(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                        _sum3 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB = vld1q_f32(pB);\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 4 * 2, _sum2);\n                    vst1q_f32(outptr0 + 4 * 3, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ps(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + out_hstep * 1, _sum1);\n                    vst1q_f32(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_f32(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x2_t _pB = vld1_f32(pB);\n\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pA, _pB, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, _pB, 1);\n#endif\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[4];\n                    float sum1[4];\n                    vst1q_f32(sum0, _sum0);\n                    vst1q_f32(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = vld1q_f32(pA);\n                float32x4_t _pB = vdupq_n_f32(pB[0]);\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB);\n#endif\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    float sum0[4];\n                    vst1q_f32(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum02;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum12;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum02 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum12 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum02 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                        _sum12 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum02 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = vdupq_n_f32(pC[1]);\n                        _sum12 = vdupq_n_f32(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        float32x4x2_t _tmp45 = vld2q_f32(pC + 16);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum02 = _tmp45.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        _sum12 = _tmp45.val[1];\n                        pC += 24;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum02 = vld1q_f32(pC + 8);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum12 = _sum02;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                float32x2_t _pA = vld1_f32(pA);\n\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum02 = vfmaq_lane_f32(_sum02, _pB2, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n                _sum12 = vfmaq_lane_f32(_sum12, _pB2, _pA, 1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + 8, _sum02);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    vst1q_f32(outptr0 + out_hstep + 8, _sum12);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[0]);\n                        _sum11 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = vdupq_n_f32(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                _sum00 = vfmaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vfmaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vfmaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vfmaq_lane_f32(_sum11, _pB1, _pA, 1);\n#else\n                _sum00 = vmlaq_lane_f32(_sum00, _pB0, _pA, 0);\n                _sum01 = vmlaq_lane_f32(_sum01, _pB1, _pA, 0);\n                _sum10 = vmlaq_lane_f32(_sum10, _pB0, _pA, 1);\n                _sum11 = vmlaq_lane_f32(_sum11, _pB1, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum00);\n                    vst1q_f32(outptr0 + 4, _sum01);\n                    vst1q_f32(outptr0 + out_hstep, _sum10);\n                    vst1q_f32(outptr0 + out_hstep + 4, _sum11);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        _sum0 = _tmp01.val[0];\n                        _sum1 = _tmp01.val[1];\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = vld1q_f32(pB);\n\n                float32x2_t _pA = vld1_f32(pA);\n#if __aarch64__\n                _sum0 = vfmaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vfmaq_lane_f32(_sum1, _pB, _pA, 1);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pB, _pA, 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pB, _pA, 1);\n#endif\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + out_hstep, _sum1);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum00;\n            float sum01;\n            float sum10;\n            float sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0.f;\n                sum01 = 0.f;\n                sum10 = 0.f;\n                sum11 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[0];\n                        sum11 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[0];\n                        sum11 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[2];\n                        sum11 = pC[3];\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[1];\n                        sum11 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum01 += pA[1] * pB[0];\n                sum10 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum00;\n                    outptr0[1] = sum10;\n                    outptr0[out_hstep] = sum01;\n                    outptr0[out_hstep + 1] = sum11;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[out_hstep] = sum1;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        float* outptr0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n                _sum2 = vld1q_f32(outptr + 8);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n                float32x4_t _pB2 = vld1q_f32(pB + 8);\n\n                float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    vst1q_f32(outptr0 + 8, _sum2);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB0 = vld1q_f32(pB);\n                float32x4_t _pB1 = vld1q_f32(pB + 4);\n\n                float32x4_t _pA0 = vdupq_n_f32(pA[0]);\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum0);\n                    vst1q_f32(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum = vld1q_f32(outptr);\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = vld1q_f32(pB);\n                float32x4_t _pA = vdupq_n_f32(pA[0]);\n\n#if __aarch64__\n                _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1q_f32(outptr0, _sum);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[1] = sum1;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum;\n\n            if (k == 0)\n            {\n                sum = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const float* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void get_optimal_tile_mnk(int M, int N, int K, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const size_t l2_cache_size = get_cpu_level2_cache_size();\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    int tile_size = (int)sqrtf((float)l2_cache_size / 3 / sizeof(float));\n\n#if __aarch64__\n    TILE_M = std::max(8, tile_size / 8 * 8);\n    TILE_N = std::max(4, tile_size / 4 * 4);\n    TILE_K = std::max(8, tile_size / 8 * 8);\n#elif __ARM_NEON\n    TILE_M = std::max(4, tile_size / 4 * 4);\n    TILE_N = std::max(4, tile_size / 4 * 4);\n    TILE_K = std::max(4, tile_size / 4 * 4);\n#else\n    TILE_M = std::max(2, tile_size / 2 * 2);\n    TILE_N = std::max(1, tile_size);\n    TILE_K = std::max(2, tile_size / 2 * 2);\n#endif\n\n    if (K > 0)\n    {\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n#if __aarch64__\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 3) / 4 * 4);\n#else\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 1) / 2 * 2);\n#endif\n\n        if (nn_K == 1)\n        {\n            tile_size = (int)((float)l2_cache_size / 2 / sizeof(float) / TILE_K);\n\n#if __aarch64__\n            TILE_M = std::max(8, tile_size / 8 * 8);\n            TILE_N = std::max(4, tile_size / 4 * 4);\n#elif __ARM_NEON\n            TILE_M = std::max(4, tile_size / 4 * 4);\n            TILE_N = std::max(4, tile_size / 4 * 4);\n#else\n            TILE_M = std::max(2, tile_size / 2 * 2);\n            TILE_N = std::max(1, tile_size);\n#endif\n        }\n    }\n\n    TILE_M *= std::min(nT, get_physical_cpu_count());\n\n    if (M > 0)\n    {\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n#if __aarch64__\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 3) / 4 * 4);\n#else\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 1) / 2 * 2);\n#endif\n    }\n\n    if (N > 0)\n    {\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#elif __ARM_NEON\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#else\n        TILE_N = std::min(TILE_N, (N + nn_N - 1) / nn_N);\n#endif\n    }\n\n    if (nT > 1)\n    {\n#if __aarch64__\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n#elif __ARM_NEON\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 3) / 4 * 4);\n#else\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 1) / 2 * 2);\n#endif\n    }\n\n    // always take constant TILE_M/N/K value when provided\n    if (constant_TILE_M > 0)\n    {\n#if __aarch64__\n        TILE_M = (constant_TILE_M + 7) / 8 * 8;\n#elif __ARM_NEON\n        TILE_M = (constant_TILE_M + 3) / 4 * 4;\n#else\n        TILE_M = (constant_TILE_M + 1) / 2 * 2;\n#endif\n    }\n\n    if (constant_TILE_N > 0)\n    {\n#if __aarch64__\n        TILE_N = (constant_TILE_N + 3) / 4 * 4;\n#elif __ARM_NEON\n        TILE_N = (constant_TILE_N + 3) / 4 * 4;\n#else\n        TILE_N = constant_TILE_N;\n#endif\n    }\n\n    if (constant_TILE_K > 0)\n    {\n#if __aarch64__\n        TILE_K = (constant_TILE_K + 7) / 8 * 8;\n#elif __ARM_NEON\n        TILE_K = (constant_TILE_K + 3) / 4 * 4;\n#else\n        TILE_K = (constant_TILE_K + 1) / 2 * 2;\n#endif\n    }\n}\n\nstatic int gemm_arm(const Mat& A, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int transA, int transB, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n    const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 4u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_arm(const Mat& AT, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int K, int transB, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_BT_arm(const Mat& A, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int N, int K, int transA, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 4u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_BT_arm(const Mat& AT, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int N, int K, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Gemm_arm::create_pipeline(const Option& opt)\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return create_pipeline_int8(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (cpu_support_arm_asimdhp() && opt.use_fp16_storage)\n    {\n        if (opt.use_fp16_arithmetic)\n            return create_pipeline_fp16sa(opt);\n        else\n            return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n#if NCNN_VFPV4\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n    if (constantA)\n    {\n        const int M = constantM;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk(M, 0, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n        AT_data.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 4u, (Allocator*)0);\n        if (AT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_M; ppj++)\n        {\n            const int i = ppj * TILE_M;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_ii = std::min((M - i), TILE_M);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat AT_tile = AT_data.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                if (transA)\n                {\n                    transpose_pack_A_tile(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n                else\n                {\n                    pack_A_tile(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            A_data.release();\n    }\n\n    if (constantB)\n    {\n        const int N = constantN;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk(0, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_N = (N + TILE_N - 1) / TILE_N;\n\n        BT_data.create(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 4u, (Allocator*)0);\n        if (BT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_N; ppj++)\n        {\n            const int j = ppj * TILE_N;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_jj = std::min((N - j), TILE_N);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat BT_tile = BT_data.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (transB)\n                {\n                    pack_B_tile(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n                else\n                {\n                    transpose_pack_B_tile(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            B_data.release();\n    }\n\n    if (constantC && constant_broadcast_type_C != -1)\n    {\n        CT_data = C_data;\n\n#if __ARM_NEON\n        if (constant_broadcast_type_C == 3 && opt.use_packing_layout)\n        {\n            int C_elempack = constantM % 4 == 0 ? 4 : 1;\n            convert_packing(C_data, CT_data, C_elempack, opt);\n            if (CT_data.empty())\n                return -100;\n        }\n#endif // __ARM_NEON\n\n        // pre-multiply C with beta\n        if (beta != 1.f)\n        {\n            Mat C2;\n            C2.create_like(CT_data);\n            if (C2.empty())\n                return -100;\n\n            const int size = CT_data.total() * CT_data.elempack;\n            for (int i = 0; i < size; i++)\n            {\n                C2[i] = CT_data[i] * beta;\n            }\n\n            CT_data = C2;\n        }\n\n        if (opt.lightmode)\n            C_data.release();\n    }\n\n    if (constantA || constantB || constantC)\n    {\n        nT = opt.num_threads;\n    }\n\n    return 0;\n}\n\nint Gemm_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n    const Mat& bottom_blob = constantA ? AT_data : bottom_blobs[0];\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (cpu_support_arm_asimdhp() && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blobs, top_blobs, opt);\n        else\n            return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_VFPV4\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n    int M;\n    int N;\n    if (constantA && constantB)\n    {\n        M = constantM;\n        N = constantN;\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        M = constantM;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = constantN;\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = CT_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB)\n        {\n            C = bottom_blobs.size() == 1 ? bottom_blobs[0] : Mat();\n        }\n        else if (constantA)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else if (constantB)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else\n        {\n            C = bottom_blobs.size() == 3 ? bottom_blobs[2] : Mat();\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w * C.elempack == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w * C.elempack == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h * C.elempack == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n\n            // pre-multiply C with beta\n            if (beta != 1.f)\n            {\n                Mat CT_data;\n                CT_data.create_like(C, opt.workspace_allocator);\n                if (CT_data.empty())\n                    return -100;\n\n                const int size = C.total() * C.elempack;\n                for (int i = 0; i < size; i++)\n                {\n                    CT_data[i] = C[i] * beta;\n                }\n\n                C = CT_data;\n            }\n        }\n    }\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        int outh = output_transpose ? N : M;\n        out_elempack = outh % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    if (output_elempack)\n        out_elempack = output_elempack;\n    size_t out_elemsize = 4u * out_elempack;\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(M, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(N, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (constantA && constantB)\n    {\n        ret = gemm_AT_BT_arm(AT_data, BT_data, C, top_blob, broadcast_type_C, constantM, constantN, constantK, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        ret = gemm_AT_arm(AT_data, B, C, top_blob, broadcast_type_C, constantM, constantK, transB, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        ret = gemm_BT_arm(A, BT_data, C, top_blob, broadcast_type_C, constantN, constantK, transA, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        ret = gemm_arm(A, B, C, top_blob, broadcast_type_C, transA, transB, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    if (ret != 0)\n        return ret;\n\n    // multiply top_blob with alpha\n    if (alpha != 1.f)\n    {\n        const int size = top_blob.total() * out_elempack;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < size; i++)\n        {\n            top_blob[i] *= alpha;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic int gemm_arm_bf16s(const Mat& A, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int transA, int transB, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n    const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_bf16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_bf16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_arm_bf16s(const Mat& AT, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int K, int transB, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_bf16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_bf16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_BT_arm_bf16s(const Mat& A, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int N, int K, int transA, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_bf16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_bf16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_BT_arm_bf16s(const Mat& AT, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int N, int K, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_bf16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_bf16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Gemm_arm::create_pipeline_bf16s(const Option& opt)\n{\n    if (constantA)\n    {\n        const int M = constantM;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_bf16s_fp16s(M, 0, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n        AT_data.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 2u, (Allocator*)0);\n        if (AT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_M; ppj++)\n        {\n            const int i = ppj * TILE_M;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_ii = std::min((M - i), TILE_M);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat AT_tile = AT_data.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                if (transA)\n                {\n                    transpose_pack_A_tile_fp32_to_bf16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n                else\n                {\n                    pack_A_tile_fp32_to_bf16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            A_data.release();\n    }\n\n    if (constantB)\n    {\n        const int N = constantN;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_bf16s_fp16s(0, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_N = (N + TILE_N - 1) / TILE_N;\n\n        BT_data.create(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, (Allocator*)0);\n        if (BT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_N; ppj++)\n        {\n            const int j = ppj * TILE_N;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_jj = std::min((N - j), TILE_N);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat BT_tile = BT_data.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (transB)\n                {\n                    pack_B_tile_fp32_to_bf16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n                else\n                {\n                    transpose_pack_B_tile_fp32_to_bf16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            B_data.release();\n    }\n\n    if (constantC && constant_broadcast_type_C != -1)\n    {\n        CT_data = C_data;\n\n#if __ARM_NEON\n        if (constant_broadcast_type_C == 3 && opt.use_packing_layout)\n        {\n            int C_elempack = constantM % 4 == 0 ? 4 : 1;\n            convert_packing(C_data, CT_data, C_elempack, opt);\n            if (CT_data.empty())\n                return -100;\n        }\n#endif // __ARM_NEON\n\n        // pre-multiply C with beta\n        if (beta != 1.f)\n        {\n            Mat C2;\n            C2.create_like(CT_data);\n            if (C2.empty())\n                return -100;\n\n            const int size = CT_data.total() * CT_data.elempack;\n            for (int i = 0; i < size; i++)\n            {\n                C2[i] = CT_data[i] * beta;\n            }\n\n            CT_data = C2;\n        }\n\n        if (opt.lightmode)\n            C_data.release();\n    }\n\n    if (constantA || constantB || constantC)\n    {\n        nT = opt.num_threads;\n    }\n\n    return 0;\n}\n\nint Gemm_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int M;\n    int N;\n    if (constantA && constantB)\n    {\n        M = constantM;\n        N = constantN;\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        M = constantM;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = constantN;\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = CT_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB)\n        {\n            C = bottom_blobs.size() == 1 ? bottom_blobs[0] : Mat();\n        }\n        else if (constantA)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else if (constantB)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else\n        {\n            C = bottom_blobs.size() == 3 ? bottom_blobs[2] : Mat();\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w * C.elempack == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w * C.elempack == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h * C.elempack == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n\n            // cast to fp32\n            {\n                Option opt_cast = opt;\n                opt_cast.blob_allocator = opt.workspace_allocator;\n\n                Mat C_fp32;\n                cast_bfloat16_to_float32(C, C_fp32, opt_cast);\n                if (C_fp32.empty())\n                    return -100;\n\n                C = C_fp32;\n            }\n\n            // pre-multiply C with beta\n            if (beta != 1.f)\n            {\n                Mat CT_data;\n                CT_data.create_like(C, opt.workspace_allocator);\n                if (CT_data.empty())\n                    return -100;\n\n                const int size = C.total() * C.elempack;\n                for (int i = 0; i < size; i++)\n                {\n                    CT_data[i] = C[i] * beta;\n                }\n\n                C = CT_data;\n            }\n        }\n    }\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        int outh = output_transpose ? N : M;\n        out_elempack = outh % 4 == 0 ? 4 : 1;\n    }\n    if (output_elempack)\n        out_elempack = output_elempack;\n    size_t out_elemsize = 2u * out_elempack;\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(M, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(N, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (constantA && constantB)\n    {\n        ret = gemm_AT_BT_arm_bf16s(AT_data, BT_data, C, top_blob, broadcast_type_C, constantM, constantN, constantK, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        ret = gemm_AT_arm_bf16s(AT_data, B, C, top_blob, broadcast_type_C, constantM, constantK, transB, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        ret = gemm_BT_arm_bf16s(A, BT_data, C, top_blob, broadcast_type_C, constantN, constantK, transA, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        ret = gemm_arm_bf16s(A, B, C, top_blob, broadcast_type_C, transA, transB, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n\n    return ret;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\n\n#if NCNN_VFPV4\nextern void compute_A_tile_fp16_int8_scales_vfpv4(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii);\nextern void pack_A_tile_fp16_to_int8_vfpv4(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nextern void transpose_compute_A_tile_fp16_int8_scales_vfpv4(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii);\nextern void transpose_pack_A_tile_fp16_to_int8_vfpv4(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nextern void compute_B_fp16_int8_scale_vfpv4(const Mat& B, float& scale);\nextern void pack_B_tile_fp16_to_int8_vfpv4(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nextern void transpose_pack_B_tile_fp16_to_int8_vfpv4(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nextern void unpack_output_tile_int32_to_fp16_vfpv4(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\nextern void transpose_unpack_output_tile_int32_to_fp16_vfpv4(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\n#endif\n\nstatic void compute_A_tile_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (A.elembits() == 16 && input_elemtype == 2)\n    {\n        compute_A_tile_fp16_int8_scales_vfpv4(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (A.elembits() == 16 && input_elemtype == 3)\n    {\n        compute_A_tile_bf16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n    compute_A_tile_fp32_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nstatic void transpose_compute_A_tile_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (A.elembits() == 16 && input_elemtype == 2)\n    {\n        transpose_compute_A_tile_fp16_int8_scales_vfpv4(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (A.elembits() == 16 && input_elemtype == 3)\n    {\n        transpose_compute_A_tile_bf16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n    transpose_compute_A_tile_fp32_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nstatic void pack_A_tile_quantize(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (A.elembits() == 16 && input_elemtype == 2)\n    {\n        pack_A_tile_fp16_to_int8_vfpv4(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (A.elembits() == 16 && input_elemtype == 3)\n    {\n        pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nstatic void transpose_pack_A_tile_quantize(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (A.elembits() == 16 && input_elemtype == 2)\n    {\n        transpose_pack_A_tile_fp16_to_int8_vfpv4(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (A.elembits() == 16 && input_elemtype == 3)\n    {\n        transpose_pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    transpose_pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nstatic void compute_B_int8_scale(const Mat& B, float& scale, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (B.elembits() == 16 && input_elemtype == 2)\n    {\n        compute_B_fp16_int8_scale_vfpv4(B, scale);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (B.elembits() == 16 && input_elemtype == 3)\n    {\n        compute_B_bf16_int8_scale(B, scale);\n        return;\n    }\n#endif\n\n    compute_B_fp32_int8_scale(B, scale);\n}\n\nstatic void pack_B_tile_quantize(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (B.elembits() == 16 && input_elemtype == 2)\n    {\n        pack_B_tile_fp16_to_int8_vfpv4(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (B.elembits() == 16 && input_elemtype == 3)\n    {\n        pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nstatic void transpose_pack_B_tile_quantize(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale, int input_elemtype)\n{\n#if NCNN_VFPV4\n    if (B.elembits() == 16 && input_elemtype == 2)\n    {\n        transpose_pack_B_tile_fp16_to_int8_vfpv4(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (B.elembits() == 16 && input_elemtype == 3)\n    {\n        transpose_pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    transpose_pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nstatic void unpack_output_tile_dequantize(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta, int output_elemtype)\n{\n#if NCNN_VFPV4\n    if (top_blob.elembits() == 16 && output_elemtype == 2)\n    {\n        unpack_output_tile_int32_to_fp16_vfpv4(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (top_blob.elembits() == 16 && output_elemtype == 3)\n    {\n        unpack_output_tile_int32_to_bf16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    unpack_output_tile_int32_to_fp32(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nstatic void transpose_unpack_output_tile_dequantize(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta, int output_elemtype)\n{\n#if NCNN_VFPV4\n    if (top_blob.elembits() == 16 && output_elemtype == 2)\n    {\n        transpose_unpack_output_tile_int32_to_fp16_vfpv4(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n#if NCNN_BF16\n    if (top_blob.elembits() == 16 && output_elemtype == 3)\n    {\n        transpose_unpack_output_tile_int32_to_bf16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    transpose_unpack_output_tile_int32_to_fp32(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nstruct gemm_arm_int8_omp_args\n{\n    int TILE_M;\n    int TILE_N;\n    int TILE_K;\n    int broadcast_type_C;\n    int transA;\n    int output_transpose;\n    float alpha;\n    float beta;\n    int input_elemtype;\n    int output_elemtype;\n};\n\nstatic int gemm_arm_int8(const Mat& A, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int transA, int transB, int output_transpose, float alpha, float beta, int input_elemtype, int output_elemtype, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"gemm_arm_int8\");\n\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n    const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 1u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 1u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    Mat A_int8_scales(M, 4u, opt.workspace_allocator);\n    if (A_int8_scales.empty())\n        return -100;\n\n    // dynamic quantize B\n    float B_int8_scale;\n    compute_B_int8_scale(B, B_int8_scale, input_elemtype);\n\n    // const float output_descale = 1.f / (A_int8_scale * B_int8_scale);\n    Mat output_descales(M, 4u, opt.workspace_allocator);\n    if (output_descales.empty())\n        return -100;\n\n    // NCNN_LOGE(\"arm ds %f %f\", 1/A_int8_scale, 1/B_int8_scale);\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n            pack_B_tile_quantize(B, BT_tile, j, max_jj, k, max_kk, B_int8_scale, input_elemtype);\n        else\n            transpose_pack_B_tile_quantize(B, BT_tile, j, max_jj, k, max_kk, B_int8_scale, input_elemtype);\n    }\n\n    Mat topT(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (topT.empty())\n        return -100;\n\n    const struct gemm_arm_int8_omp_args args = {TILE_M, TILE_N, TILE_K, broadcast_type_C, transA, output_transpose, alpha, beta, input_elemtype, output_elemtype};\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        // shadowed variable for less openmp task args\n        const int TILE_M = args.TILE_M;\n        const int TILE_N = args.TILE_N;\n        const int TILE_K = args.TILE_K;\n        const int broadcast_type_C = args.broadcast_type_C;\n        const int transA = args.transA;\n        const int output_transpose = args.output_transpose;\n        const float alpha = args.alpha;\n        const float beta = args.beta;\n        const int input_elemtype = args.input_elemtype;\n        const int output_elemtype = args.output_elemtype;\n\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (k == 0)\n                    {\n                        if (transA)\n                            transpose_compute_A_tile_int8_scales(A, A_int8_scales, B_int8_scale, output_descales, i, max_ii, input_elemtype);\n                        else\n                            compute_A_tile_int8_scales(A, A_int8_scales, B_int8_scale, output_descales, i, max_ii, input_elemtype);\n\n                        // NCNN_LOGE(\"A_int8_scales %f  B_int8_scale %f\", A_int8_scales[0], B_int8_scale);\n                    }\n\n                    if (transA)\n                        transpose_pack_A_tile_quantize(A, AT_tile, i, max_ii, k, max_kk, A_int8_scales, input_elemtype);\n                    else\n                        pack_A_tile_quantize(A, AT_tile, i, max_ii, k, max_kk, A_int8_scales, input_elemtype);\n                }\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n            }\n\n            if (output_transpose)\n                transpose_unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n            else\n                unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_arm_int8(const Mat& AT, const Mat& A_int8_scales, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int K, int transB, int output_transpose, float alpha, float beta, int input_elemtype, int output_elemtype, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"gemm_AT_arm_int8\");\n\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 1u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // dynamic quantize B\n    float B_int8_scale;\n    compute_B_int8_scale(B, B_int8_scale, input_elemtype);\n\n    // NCNN_LOGE(\"%.4f %.4f\", A_int8_scale, B_int8_scale);\n\n    // const float output_descale = 1.f / (A_int8_scale * B_int8_scale);\n    Mat output_descales(M, 4u, opt.workspace_allocator);\n    if (output_descales.empty())\n        return -100;\n\n    for (int i = 0; i < M; i++)\n    {\n        output_descales[i] = 1.f / (A_int8_scales[i] * B_int8_scale);\n    }\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n            pack_B_tile_quantize(B, BT_tile, j, max_jj, k, max_kk, B_int8_scale, input_elemtype);\n        else\n            transpose_pack_B_tile_quantize(B, BT_tile, j, max_jj, k, max_kk, B_int8_scale, input_elemtype);\n    }\n\n    Mat topT(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (topT.empty())\n        return -100;\n\n    const struct gemm_arm_int8_omp_args args = {TILE_M, TILE_N, TILE_K, broadcast_type_C, 0, output_transpose, alpha, beta, input_elemtype, output_elemtype};\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        // shadowed variable for less openmp task args\n        const int TILE_M = args.TILE_M;\n        const int TILE_N = args.TILE_N;\n        const int TILE_K = args.TILE_K;\n        const int broadcast_type_C = args.broadcast_type_C;\n        const int output_transpose = args.output_transpose;\n        const float alpha = args.alpha;\n        const float beta = args.beta;\n        const int output_elemtype = args.output_elemtype;\n\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n            }\n\n            if (output_transpose)\n                transpose_unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n            else\n                unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_BT_arm_int8(const Mat& A, const Mat& BT, float B_int8_scale, const Mat& C, Mat& top_blob, int broadcast_type_C, int N, int K, int transA, int output_transpose, float alpha, float beta, int input_elemtype, int output_elemtype, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"gemm_BT_arm_int8\");\n\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat A_int8_scales(M, 4u, opt.workspace_allocator);\n    if (A_int8_scales.empty())\n        return -100;\n\n    // const float output_descale = 1.f / (A_int8_scale * B_int8_scale);\n    Mat output_descales(M, 4u, opt.workspace_allocator);\n    if (output_descales.empty())\n        return -100;\n\n    // NCNN_LOGE(\"scale %.4f  %.4f\", A_int8_scale, B_int8_scale);\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 1u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n\n    Mat topT(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (topT.empty())\n        return -100;\n\n    const struct gemm_arm_int8_omp_args args = {TILE_M, TILE_N, TILE_K, broadcast_type_C, transA, output_transpose, alpha, beta, input_elemtype, output_elemtype};\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        // shadowed variable for less openmp task args\n        const int TILE_M = args.TILE_M;\n        const int TILE_N = args.TILE_N;\n        const int TILE_K = args.TILE_K;\n        const int broadcast_type_C = args.broadcast_type_C;\n        const int transA = args.transA;\n        const int output_transpose = args.output_transpose;\n        const float alpha = args.alpha;\n        const float beta = args.beta;\n        const int input_elemtype = args.input_elemtype;\n        const int output_elemtype = args.output_elemtype;\n\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (k == 0)\n                    {\n                        if (transA)\n                            transpose_compute_A_tile_int8_scales(A, A_int8_scales, B_int8_scale, output_descales, i, max_ii, input_elemtype);\n                        else\n                            compute_A_tile_int8_scales(A, A_int8_scales, B_int8_scale, output_descales, i, max_ii, input_elemtype);\n\n                        // NCNN_LOGE(\"A_int8_scales %f  B_int8_scale %f\", A_int8_scales[0], B_int8_scale);\n                    }\n\n                    if (transA)\n                        transpose_pack_A_tile_quantize(A, AT_tile, i, max_ii, k, max_kk, A_int8_scales, input_elemtype);\n                    else\n                        pack_A_tile_quantize(A, AT_tile, i, max_ii, k, max_kk, A_int8_scales, input_elemtype);\n                }\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n            }\n\n            if (output_transpose)\n                transpose_unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n            else\n                unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_BT_arm_int8(const Mat& AT, const Mat& A_int8_scales, const Mat& BT, float B_int8_scale, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int N, int K, int output_transpose, float alpha, float beta, int input_elemtype, int output_elemtype, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"gemm_AT_BT_arm_int8\");\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_int8(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    // const float output_descale = 1.f / (A_int8_scale * B_int8_scale);\n    Mat output_descales(M, 4u, opt.workspace_allocator);\n    if (output_descales.empty())\n        return -100;\n\n    for (int i = 0; i < M; i++)\n    {\n        output_descales[i] = 1.f / (A_int8_scales[i] * B_int8_scale);\n    }\n\n    Mat topT(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n    if (topT.empty())\n        return -100;\n\n    const struct gemm_arm_int8_omp_args args = {TILE_M, TILE_N, TILE_K, broadcast_type_C, 0, output_transpose, alpha, beta, input_elemtype, output_elemtype};\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        // shadowed variable for less openmp task args\n        const int TILE_M = args.TILE_M;\n        const int TILE_N = args.TILE_N;\n        const int TILE_K = args.TILE_K;\n        const int broadcast_type_C = args.broadcast_type_C;\n        const int output_transpose = args.output_transpose;\n        const float alpha = args.alpha;\n        const float beta = args.beta;\n        const int output_elemtype = args.output_elemtype;\n\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n            }\n\n            if (output_transpose)\n                transpose_unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n            else\n                unpack_output_tile_dequantize(topT_tile, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, output_descales, alpha, beta, output_elemtype);\n        }\n    }\n\n    return 0;\n}\n\nint Gemm_arm::create_pipeline_int8(const Option& opt)\n{\n    // finalize input_elemtype from cpu capability and opt\n    {\n        // armv8.2                  + use-fp16              = fp16\n        // armv8.2                  + no-fp16 + use-bf16    = bf16\n        // armv8.2                  + no-fp16 + no-bf16     = fp32\n        // armv8.0/armv7-vfpv4      + use-bf16              = bf16\n        // armv8.0/armv7-vfpv4      + no-bf16 + use-fp16    = fp16\n        // armv8.0/armv7-vfpv4      + no-fp16 + no-bf16     = fp32\n        // armv7                    + use-bf16              = bf16\n        // armv7                    + no-bf16               = fp32\n\n        bool use_fp16 = false;\n        bool use_bf16 = false;\n\n#if NCNN_ARM82\n        if (ncnn::cpu_support_arm_asimdhp())\n        {\n            use_fp16 = opt.use_fp16_storage;\n            use_bf16 = opt.use_bf16_storage && !opt.use_fp16_storage;\n        }\n        else\n#endif\n#if NCNN_VFPV4\n            if (ncnn::cpu_support_arm_vfpv4())\n            {\n                use_bf16 = opt.use_bf16_storage;\n                use_fp16 = opt.use_fp16_storage && !opt.use_bf16_storage;\n            }\n            else\n#endif\n            {\n                use_bf16 = opt.use_bf16_storage;\n            }\n\n        input_elemtype = 1; // fp32\n        if (use_fp16) input_elemtype = 2;\n        if (use_bf16) input_elemtype = 3;\n\n        // NCNN_LOGE(\"input_elemtype = %d\", input_elemtype);\n    }\n\n    if (constantA)\n    {\n        const int M = constantM;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_int8(M, 0, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n        AT_data.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 1u, (Allocator*)0);\n        if (AT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_M; ppj++)\n        {\n            const int i = ppj * TILE_M;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_ii = std::min((M - i), TILE_M);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat AT_tile = AT_data.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                if (transA)\n                {\n                    transpose_pack_A_tile_int8(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n                else\n                {\n                    pack_A_tile_int8(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            A_data.release();\n    }\n\n    if (constantB)\n    {\n        const int N = constantN;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_int8(0, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_N = (N + TILE_N - 1) / TILE_N;\n\n        BT_data.create(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 1u, (Allocator*)0);\n        if (BT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_N; ppj++)\n        {\n            const int j = ppj * TILE_N;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_jj = std::min((N - j), TILE_N);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat BT_tile = BT_data.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (transB)\n                {\n                    pack_B_tile_int8(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n                else\n                {\n                    transpose_pack_B_tile_int8(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            B_data.release();\n    }\n\n    if (constantC && constant_broadcast_type_C != -1)\n    {\n        CT_data = C_data;\n\n#if NCNN_VFPV4\n        if (input_elemtype == 2)\n        {\n            Mat C2;\n            ncnn::cast_float32_to_float16(CT_data, C2);\n            CT_data = C2;\n        }\n#endif\n#if NCNN_BF16\n        if (input_elemtype == 3)\n        {\n            Mat C2;\n            ncnn::cast_float32_to_bfloat16(CT_data, C2);\n            CT_data = C2;\n        }\n#endif\n\n        if (opt.lightmode)\n            C_data.release();\n    }\n\n    if (constantA || constantB || constantC)\n    {\n        nT = opt.num_threads;\n    }\n\n    return 0;\n}\n\nint Gemm_arm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int M;\n    int N;\n    if (constantA && constantB)\n    {\n        M = constantM;\n        N = constantN;\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        M = constantM;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = constantN;\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = CT_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB)\n        {\n            C = bottom_blobs.size() == 1 ? bottom_blobs[0] : Mat();\n        }\n        else if (constantA)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else if (constantB)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else\n        {\n            C = bottom_blobs.size() == 3 ? bottom_blobs[2] : Mat();\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w * C.elempack == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w * C.elempack == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h * C.elempack == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n        }\n    }\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        int outh = output_transpose ? N : M;\n        out_elempack = outh % 4 == 0 ? 4 : 1;\n#if NCNN_ARM82\n        if (cpu_support_arm_asimdhp() && opt.use_fp16_storage && opt.use_fp16_arithmetic)\n        {\n            // TODO use output_elemtype\n            out_elempack = outh % 8 == 0 ? 8 : outh % 4 == 0 ? 4 : 1;\n        }\n#endif\n    }\n#endif // __ARM_NEON\n\n    // FIXME use output_elempack\n    // int output_elempack = out_elempack > 4 ? 4 : out_elempack;\n\n    if (output_elempack)\n        out_elempack = output_elempack;\n    size_t out_elemsize = 4u * out_elempack;\n\n    // FIXME use output_elemtype instead of input_elemtype\n    int output_elemtype = input_elemtype;\n\n    // TODO use output_elemtype\n    if (opt.use_bf16_storage)\n    {\n        out_elemsize = 2u * out_elempack;\n    }\n#if NCNN_VFPV4\n    else if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        out_elemsize = 2u * out_elempack;\n    }\n#endif\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(M, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(N, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (constantA && constantB)\n    {\n        ret = gemm_AT_BT_arm_int8(AT_data, A_data_int8_scales, BT_data, B_data_int8_scale, C, top_blob, broadcast_type_C, constantM, constantN, constantK, output_transpose, alpha, beta, input_elemtype, output_elemtype, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        ret = gemm_AT_arm_int8(AT_data, A_data_int8_scales, B, C, top_blob, broadcast_type_C, constantM, constantK, transB, output_transpose, alpha, beta, input_elemtype, output_elemtype, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        ret = gemm_BT_arm_int8(A, BT_data, B_data_int8_scale, C, top_blob, broadcast_type_C, constantN, constantK, transA, output_transpose, alpha, beta, input_elemtype, output_elemtype, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        ret = gemm_arm_int8(A, B, C, top_blob, broadcast_type_C, transA, transB, output_transpose, alpha, beta, input_elemtype, output_elemtype, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n\n    return ret;\n}\n#endif\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GEMM_ARM_H\n#define LAYER_GEMM_ARM_H\n\n#include \"gemm.h\"\n\nnamespace ncnn {\n\nclass Gemm_arm : public Gemm\n{\npublic:\n    Gemm_arm();\n\n    virtual int create_pipeline(const Option& opt);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_VFPV4\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#if NCNN_ARM82\n    int create_pipeline_fp16sa(const Option& opt);\n    int forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8(const Option& opt);\n    int forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n\npublic:\n    int nT;\n    Mat AT_data;\n    Mat BT_data;\n    Mat CT_data;\n\n    int input_elemtype; // 0=auto 1=fp32 2=fp16 3=bf16\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GEMM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/gemm_arm_asimddp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"gemm_int8.h\"\n#include \"gemm_int8_fp16s.h\"\n\n#if NCNN_BF16\n#include \"gemm_int8_bf16s.h\"\n#endif\n\nvoid pack_A_tile_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    pack_A_tile_int8(A, AT, i, max_ii, k, max_kk);\n}\n\nvoid transpose_pack_A_tile_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    transpose_pack_A_tile_int8(A, AT, i, max_ii, k, max_kk);\n}\n\nvoid pack_B_tile_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    pack_B_tile_int8(B, BT, j, max_jj, k, max_kk);\n}\n\nvoid transpose_pack_B_tile_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    transpose_pack_B_tile_int8(B, BT, j, max_jj, k, max_kk);\n}\n\nvoid pack_A_tile_fp32_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_fp32_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_fp32_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_fp32_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid unpack_output_tile_int32_to_fp32_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    unpack_output_tile_int32_to_fp32(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nvoid transpose_unpack_output_tile_int32_to_fp32_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    transpose_unpack_output_tile_int32_to_fp32(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nvoid gemm_transB_packed_tile_int8_asimddp(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, int i, int max_ii, int j, int max_jj, int k, int max_kk)\n{\n    gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n}\n\nvoid pack_A_tile_fp16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_fp16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_fp16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_fp16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid unpack_output_tile_int32_to_fp16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    unpack_output_tile_int32_to_fp16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nvoid transpose_unpack_output_tile_int32_to_fp16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    transpose_unpack_output_tile_int32_to_fp16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\n#if NCNN_BF16\nvoid pack_A_tile_bf16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_bf16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_bf16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_bf16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid unpack_output_tile_int32_to_bf16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    unpack_output_tile_int32_to_bf16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nvoid transpose_unpack_output_tile_int32_to_bf16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    transpose_unpack_output_tile_int32_to_bf16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm_asimdfhm.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gemm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"gemm_fp16s.h\"\n\nvoid gemm_transB_packed_tile_fp16s_asimdfhm(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, float alpha, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n    gemm_transB_packed_tile_fp16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gemm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"gemm_bf16s_fp16s.h\"\n#include \"gemm_fp16s.h\"\n\n#if NCNN_INT8\n#include \"gemm_int8_fp16s.h\"\n#endif\n\nstatic void gemm_transB_packed_tile_fp16sa(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const __fp16* pAT = AT_tile;\n    const __fp16* pBT = BT_tile;\n    const __fp16* pC = CT_tile;\n\n    __fp16* outptr = topT_tile;\n\n    int ii = 0;\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const __fp16*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const __fp16*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n            float16x8_t _sum4;\n            float16x8_t _sum5;\n            float16x8_t _sum6;\n            float16x8_t _sum7;\n            float16x8_t _sum8;\n            float16x8_t _sum9;\n            float16x8_t _suma;\n            float16x8_t _sumb;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f16(0.f);\n                _sum1 = vdupq_n_f16(0.f);\n                _sum2 = vdupq_n_f16(0.f);\n                _sum3 = vdupq_n_f16(0.f);\n                _sum4 = vdupq_n_f16(0.f);\n                _sum5 = vdupq_n_f16(0.f);\n                _sum6 = vdupq_n_f16(0.f);\n                _sum7 = vdupq_n_f16(0.f);\n                _sum8 = vdupq_n_f16(0.f);\n                _sum9 = vdupq_n_f16(0.f);\n                _suma = vdupq_n_f16(0.f);\n                _sumb = vdupq_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[0]);\n                        _sum2 = vdupq_n_f16(pC[0]);\n                        _sum3 = vdupq_n_f16(pC[0]);\n                        _sum4 = vdupq_n_f16(pC[0]);\n                        _sum5 = vdupq_n_f16(pC[0]);\n                        _sum6 = vdupq_n_f16(pC[0]);\n                        _sum7 = vdupq_n_f16(pC[0]);\n                        _sum8 = vdupq_n_f16(pC[0]);\n                        _sum9 = vdupq_n_f16(pC[0]);\n                        _suma = vdupq_n_f16(pC[0]);\n                        _sumb = vdupq_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = vld1q_f16(pC + 8);\n                        _sum2 = vld1q_f16(pC + 8 * 2);\n                        _sum3 = vld1q_f16(pC + 8 * 3);\n                        _sum4 = vld1q_f16(pC + 8 * 4);\n                        _sum5 = vld1q_f16(pC + 8 * 5);\n                        _sum6 = vld1q_f16(pC + 8 * 6);\n                        _sum7 = vld1q_f16(pC + 8 * 7);\n                        _sum8 = vld1q_f16(pC + 8 * 8);\n                        _sum9 = vld1q_f16(pC + 8 * 9);\n                        _suma = vld1q_f16(pC + 8 * 10);\n                        _sumb = vld1q_f16(pC + 8 * 11);\n                        pC += 96;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[1]);\n                        _sum2 = vdupq_n_f16(pC[2]);\n                        _sum3 = vdupq_n_f16(pC[3]);\n                        _sum4 = vdupq_n_f16(pC[4]);\n                        _sum5 = vdupq_n_f16(pC[5]);\n                        _sum6 = vdupq_n_f16(pC[6]);\n                        _sum7 = vdupq_n_f16(pC[7]);\n                        _sum8 = vdupq_n_f16(pC[8]);\n                        _sum9 = vdupq_n_f16(pC[9]);\n                        _suma = vdupq_n_f16(pC[10]);\n                        _sumb = vdupq_n_f16(pC[11]);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8 * 1);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n                _sum4 = vld1q_f16(outptr + 8 * 4);\n                _sum5 = vld1q_f16(outptr + 8 * 5);\n                _sum6 = vld1q_f16(outptr + 8 * 6);\n                _sum7 = vld1q_f16(outptr + 8 * 7);\n                _sum8 = vld1q_f16(outptr + 8 * 8);\n                _sum9 = vld1q_f16(outptr + 8 * 9);\n                _suma = vld1q_f16(outptr + 8 * 10);\n                _sumb = vld1q_f16(outptr + 8 * 11);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v3.8h}, [%0], #16      \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h}, [%1], #24 \\n\"\n                    \"fmla   %2.8h, v3.8h, v0.h[0]   \\n\"\n                    \"fmla   %3.8h, v3.8h, v0.h[1]   \\n\"\n                    \"fmla   %4.8h, v3.8h, v0.h[2]   \\n\"\n                    \"fmla   %5.8h, v3.8h, v0.h[3]   \\n\"\n                    \"fmla   %6.8h, v3.8h, v1.h[0]   \\n\"\n                    \"fmla   %7.8h, v3.8h, v1.h[1]   \\n\"\n                    \"fmla   %8.8h, v3.8h, v1.h[2]   \\n\"\n                    \"fmla   %9.8h, v3.8h, v1.h[3]   \\n\"\n                    \"fmla   %10.8h, v3.8h, v2.h[0]  \\n\"\n                    \"fmla   %11.8h, v3.8h, v2.h[1]  \\n\"\n                    \"fmla   %12.8h, v3.8h, v2.h[2]  \\n\"\n                    \"fmla   %13.8h, v3.8h, v2.h[3]  \\n\"\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=w\"(_sum0),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6),\n                    \"=w\"(_sum7),\n                    \"=w\"(_sum8),\n                    \"=w\"(_sum9),\n                    \"=w\"(_suma),\n                    \"=w\"(_sumb)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3),\n                    \"6\"(_sum4),\n                    \"7\"(_sum5),\n                    \"8\"(_sum6),\n                    \"9\"(_sum7),\n                    \"10\"(_sum8),\n                    \"11\"(_sum9),\n                    \"12\"(_suma),\n                    \"13\"(_sumb)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_lane_f16(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_lane_f16(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_lane_f16(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_lane_f16(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_lane_f16(_sumb, _pA, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n#endif\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8 * 1, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    vst1q_f16(outptr0 + 8 * 4, _sum4);\n                    vst1q_f16(outptr0 + 8 * 5, _sum5);\n                    vst1q_f16(outptr0 + 8 * 6, _sum6);\n                    vst1q_f16(outptr0 + 8 * 7, _sum7);\n                    vst1q_f16(outptr0 + 8 * 8, _sum8);\n                    vst1q_f16(outptr0 + 8 * 9, _sum9);\n                    vst1q_f16(outptr0 + 8 * 10, _suma);\n                    vst1q_f16(outptr0 + 8 * 11, _sumb);\n                    outptr0 += 96;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + 4 * 4, vget_low_f16(_sum4));\n                    vst1_f16(outptr0 + 4 * 5, vget_low_f16(_sum5));\n                    vst1_f16(outptr0 + 4 * 6, vget_low_f16(_sum6));\n                    vst1_f16(outptr0 + 4 * 7, vget_low_f16(_sum7));\n                    vst1_f16(outptr0 + 4 * 8, vget_low_f16(_sum8));\n                    vst1_f16(outptr0 + 4 * 9, vget_low_f16(_sum9));\n                    vst1_f16(outptr0 + 4 * 10, vget_low_f16(_suma));\n                    vst1_f16(outptr0 + 4 * 11, vget_low_f16(_sumb));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 4, vget_high_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 5, vget_high_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 6, vget_high_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 7, vget_high_f16(_sum7));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 8, vget_high_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 9, vget_high_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 10, vget_high_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 11, vget_high_f16(_sumb));\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + out_hstep * 1, _sum1);\n                    vst1q_f16(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_f16(outptr0 + out_hstep * 3, _sum3);\n                    vst1q_f16(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_f16(outptr0 + out_hstep * 5, _sum5);\n                    vst1q_f16(outptr0 + out_hstep * 6, _sum6);\n                    vst1q_f16(outptr0 + out_hstep * 7, _sum7);\n\n                    transpose8x4_ph(_sum8, _sum9, _suma, _sumb);\n\n                    vst1_f16(outptr0 + 8, vget_low_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 1 + 8, vget_high_f16(_sum8));\n                    vst1_f16(outptr0 + out_hstep * 2 + 8, vget_low_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 3 + 8, vget_high_f16(_sum9));\n                    vst1_f16(outptr0 + out_hstep * 4 + 8, vget_low_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 5 + 8, vget_high_f16(_suma));\n                    vst1_f16(outptr0 + out_hstep * 6 + 8, vget_low_f16(_sumb));\n                    vst1_f16(outptr0 + out_hstep * 7 + 8, vget_high_f16(_sumb));\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8 * 1, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n                vst1q_f16(outptr + 8 * 8, _sum8);\n                vst1q_f16(outptr + 8 * 9, _sum9);\n                vst1q_f16(outptr + 8 * 10, _suma);\n                vst1q_f16(outptr + 8 * 11, _sumb);\n            }\n\n            outptr += 96;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n            float16x8_t _sum4;\n            float16x8_t _sum5;\n            float16x8_t _sum6;\n            float16x8_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f16(0.f);\n                _sum1 = vdupq_n_f16(0.f);\n                _sum2 = vdupq_n_f16(0.f);\n                _sum3 = vdupq_n_f16(0.f);\n                _sum4 = vdupq_n_f16(0.f);\n                _sum5 = vdupq_n_f16(0.f);\n                _sum6 = vdupq_n_f16(0.f);\n                _sum7 = vdupq_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[0]);\n                        _sum2 = vdupq_n_f16(pC[0]);\n                        _sum3 = vdupq_n_f16(pC[0]);\n                        _sum4 = vdupq_n_f16(pC[0]);\n                        _sum5 = vdupq_n_f16(pC[0]);\n                        _sum6 = vdupq_n_f16(pC[0]);\n                        _sum7 = vdupq_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = vld1q_f16(pC + 8);\n                        _sum2 = vld1q_f16(pC + 8 * 2);\n                        _sum3 = vld1q_f16(pC + 8 * 3);\n                        _sum4 = vld1q_f16(pC + 8 * 4);\n                        _sum5 = vld1q_f16(pC + 8 * 5);\n                        _sum6 = vld1q_f16(pC + 8 * 6);\n                        _sum7 = vld1q_f16(pC + 8 * 7);\n                        pC += 64;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[1]);\n                        _sum2 = vdupq_n_f16(pC[2]);\n                        _sum3 = vdupq_n_f16(pC[3]);\n                        _sum4 = vdupq_n_f16(pC[4]);\n                        _sum5 = vdupq_n_f16(pC[5]);\n                        _sum6 = vdupq_n_f16(pC[6]);\n                        _sum7 = vdupq_n_f16(pC[7]);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8 * 1);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n                _sum4 = vld1q_f16(outptr + 8 * 4);\n                _sum5 = vld1q_f16(outptr + 8 * 5);\n                _sum6 = vld1q_f16(outptr + 8 * 6);\n                _sum7 = vld1q_f16(outptr + 8 * 7);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_lane_f16(_sum7, _pA, _pB1, 3);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8 * 1, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    vst1q_f16(outptr0 + 8 * 4, _sum4);\n                    vst1q_f16(outptr0 + 8 * 5, _sum5);\n                    vst1q_f16(outptr0 + 8 * 6, _sum6);\n                    vst1q_f16(outptr0 + 8 * 7, _sum7);\n                    outptr0 += 64;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + 4 * 4, vget_low_f16(_sum4));\n                    vst1_f16(outptr0 + 4 * 5, vget_low_f16(_sum5));\n                    vst1_f16(outptr0 + 4 * 6, vget_low_f16(_sum6));\n                    vst1_f16(outptr0 + 4 * 7, vget_low_f16(_sum7));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 4, vget_high_f16(_sum4));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 5, vget_high_f16(_sum5));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 6, vget_high_f16(_sum6));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 7, vget_high_f16(_sum7));\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + out_hstep * 1, _sum1);\n                    vst1q_f16(outptr0 + out_hstep * 2, _sum2);\n                    vst1q_f16(outptr0 + out_hstep * 3, _sum3);\n                    vst1q_f16(outptr0 + out_hstep * 4, _sum4);\n                    vst1q_f16(outptr0 + out_hstep * 5, _sum5);\n                    vst1q_f16(outptr0 + out_hstep * 6, _sum6);\n                    vst1q_f16(outptr0 + out_hstep * 7, _sum7);\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8 * 1, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n                vst1q_f16(outptr + 8 * 4, _sum4);\n                vst1q_f16(outptr + 8 * 5, _sum5);\n                vst1q_f16(outptr + 8 * 6, _sum6);\n                vst1q_f16(outptr + 8 * 7, _sum7);\n            }\n\n            outptr += 64;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n            float16x8_t _sum2;\n            float16x8_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f16(0.f);\n                _sum1 = vdupq_n_f16(0.f);\n                _sum2 = vdupq_n_f16(0.f);\n                _sum3 = vdupq_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[0]);\n                        _sum2 = vdupq_n_f16(pC[0]);\n                        _sum3 = vdupq_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = vld1q_f16(pC + 8);\n                        _sum2 = vld1q_f16(pC + 8 * 2);\n                        _sum3 = vld1q_f16(pC + 8 * 3);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[1]);\n                        _sum2 = vdupq_n_f16(pC[2]);\n                        _sum3 = vdupq_n_f16(pC[3]);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8 * 1);\n                _sum2 = vld1q_f16(outptr + 8 * 2);\n                _sum3 = vld1q_f16(outptr + 8 * 3);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n\n                _sum0 = vfmaq_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _pA, _pB0, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8 * 1, _sum1);\n                    vst1q_f16(outptr0 + 8 * 2, _sum2);\n                    vst1q_f16(outptr0 + 8 * 3, _sum3);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + 4 * 2, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + 4 * 3, vget_low_f16(_sum3));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 2, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4 * 3, vget_high_f16(_sum3));\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ph(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 1, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 2, vget_low_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 3, vget_high_f16(_sum1));\n                    vst1_f16(outptr0 + out_hstep * 4, vget_low_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 5, vget_high_f16(_sum2));\n                    vst1_f16(outptr0 + out_hstep * 6, vget_low_f16(_sum3));\n                    vst1_f16(outptr0 + out_hstep * 7, vget_high_f16(_sum3));\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8 * 1, _sum1);\n                vst1q_f16(outptr + 8 * 2, _sum2);\n                vst1q_f16(outptr + 8 * 3, _sum3);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float16x8_t _sum0;\n            float16x8_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f16(0.f);\n                _sum1 = vdupq_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        _sum1 = vld1q_f16(pC + 8);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        _sum1 = vdupq_n_f16(pC[1]);\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n                _sum1 = vld1q_f16(outptr + 8);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x8_t _pB0 = vdupq_n_f16(pB[0]);\n                float16x8_t _pB1 = vdupq_n_f16(pB[1]);\n\n                _sum0 = vfmaq_f16(_sum0, _pA, _pB0);\n                _sum1 = vfmaq_f16(_sum1, _pA, _pB1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    vst1q_f16(outptr0 + 8, _sum1);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + 4, vget_low_f16(_sum1));\n\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4 + 4, vget_high_f16(_sum1));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[8];\n                    __fp16 sum1[8];\n                    vst1q_f16(sum0, _sum0);\n                    vst1q_f16(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n                vst1q_f16(outptr + 8, _sum1);\n            }\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float16x8_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f16(pC);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f16(pC[0]);\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f16(outptr);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x8_t _pB = vdupq_n_f16(pB[0]);\n\n                _sum0 = vfmaq_f16(_sum0, _pA, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 8)\n                {\n                    vst1q_f16(outptr0, _sum0);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, vget_low_f16(_sum0));\n                    vst1_f16(outptr0 + out_hstep * 4, vget_high_f16(_sum0));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[8];\n                    vst1q_f16(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f16(outptr, _sum0);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const __fp16*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const __fp16*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n            float16x4_t _sum4;\n            float16x4_t _sum5;\n            float16x4_t _sum6;\n            float16x4_t _sum7;\n            float16x4_t _sum8;\n            float16x4_t _sum9;\n            float16x4_t _suma;\n            float16x4_t _sumb;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n                _sum2 = vdup_n_f16(0.f);\n                _sum3 = vdup_n_f16(0.f);\n                _sum4 = vdup_n_f16(0.f);\n                _sum5 = vdup_n_f16(0.f);\n                _sum6 = vdup_n_f16(0.f);\n                _sum7 = vdup_n_f16(0.f);\n                _sum8 = vdup_n_f16(0.f);\n                _sum9 = vdup_n_f16(0.f);\n                _suma = vdup_n_f16(0.f);\n                _sumb = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                        _sum2 = vdup_n_f16(pC[0]);\n                        _sum3 = vdup_n_f16(pC[0]);\n                        _sum4 = vdup_n_f16(pC[0]);\n                        _sum5 = vdup_n_f16(pC[0]);\n                        _sum6 = vdup_n_f16(pC[0]);\n                        _sum7 = vdup_n_f16(pC[0]);\n                        _sum8 = vdup_n_f16(pC[0]);\n                        _sum9 = vdup_n_f16(pC[0]);\n                        _suma = vdup_n_f16(pC[0]);\n                        _sumb = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        _sum2 = vld1_f16(pC + 8);\n                        _sum3 = vld1_f16(pC + 12);\n                        _sum4 = vld1_f16(pC + 16);\n                        _sum5 = vld1_f16(pC + 20);\n                        _sum6 = vld1_f16(pC + 24);\n                        _sum7 = vld1_f16(pC + 28);\n                        _sum8 = vld1_f16(pC + 32);\n                        _sum9 = vld1_f16(pC + 36);\n                        _suma = vld1_f16(pC + 40);\n                        _sumb = vld1_f16(pC + 44);\n                        pC += 48;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[1]);\n                        _sum2 = vdup_n_f16(pC[2]);\n                        _sum3 = vdup_n_f16(pC[3]);\n                        _sum4 = vdup_n_f16(pC[4]);\n                        _sum5 = vdup_n_f16(pC[5]);\n                        _sum6 = vdup_n_f16(pC[6]);\n                        _sum7 = vdup_n_f16(pC[7]);\n                        _sum8 = vdup_n_f16(pC[8]);\n                        _sum9 = vdup_n_f16(pC[9]);\n                        _suma = vdup_n_f16(pC[10]);\n                        _sumb = vdup_n_f16(pC[11]);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4 * 1);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n                _sum4 = vld1_f16(outptr + 4 * 4);\n                _sum5 = vld1_f16(outptr + 4 * 5);\n                _sum6 = vld1_f16(outptr + 4 * 6);\n                _sum7 = vld1_f16(outptr + 4 * 7);\n                _sum8 = vld1_f16(outptr + 4 * 8);\n                _sum9 = vld1_f16(outptr + 4 * 9);\n                _suma = vld1_f16(outptr + 4 * 10);\n                _sumb = vld1_f16(outptr + 4 * 11);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfma_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfma_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfma_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfma_lane_f16(_sum7, _pA, _pB1, 3);\n                _sum8 = vfma_lane_f16(_sum8, _pA, _pB2, 0);\n                _sum9 = vfma_lane_f16(_sum9, _pA, _pB2, 1);\n                _suma = vfma_lane_f16(_suma, _pA, _pB2, 2);\n                _sumb = vfma_lane_f16(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    vst1_f16(outptr0 + 4 * 4, _sum4);\n                    vst1_f16(outptr0 + 4 * 5, _sum5);\n                    vst1_f16(outptr0 + 4 * 6, _sum6);\n                    vst1_f16(outptr0 + 4 * 7, _sum7);\n                    vst1_f16(outptr0 + 4 * 8, _sum8);\n                    vst1_f16(outptr0 + 4 * 9, _sum9);\n                    vst1_f16(outptr0 + 4 * 10, _suma);\n                    vst1_f16(outptr0 + 4 * 11, _sumb);\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 8, _sum2);\n                    vst1_f16(outptr0 + out_hstep, _sum3);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum4);\n                    vst1_f16(outptr0 + out_hstep + 8, _sum5);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum6);\n                    vst1_f16(outptr0 + out_hstep * 2 + 4, _sum7);\n                    vst1_f16(outptr0 + out_hstep * 2 + 8, _sum8);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum9);\n                    vst1_f16(outptr0 + out_hstep * 3 + 4, _suma);\n                    vst1_f16(outptr0 + out_hstep * 3 + 8, _sumb);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n                vst1_f16(outptr + 4 * 8, _sum8);\n                vst1_f16(outptr + 4 * 9, _sum9);\n                vst1_f16(outptr + 4 * 10, _suma);\n                vst1_f16(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n            float16x4_t _sum4;\n            float16x4_t _sum5;\n            float16x4_t _sum6;\n            float16x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n                _sum2 = vdup_n_f16(0.f);\n                _sum3 = vdup_n_f16(0.f);\n                _sum4 = vdup_n_f16(0.f);\n                _sum5 = vdup_n_f16(0.f);\n                _sum6 = vdup_n_f16(0.f);\n                _sum7 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                        _sum2 = vdup_n_f16(pC[0]);\n                        _sum3 = vdup_n_f16(pC[0]);\n                        _sum4 = vdup_n_f16(pC[0]);\n                        _sum5 = vdup_n_f16(pC[0]);\n                        _sum6 = vdup_n_f16(pC[0]);\n                        _sum7 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        _sum2 = vld1_f16(pC + 8);\n                        _sum3 = vld1_f16(pC + 12);\n                        _sum4 = vld1_f16(pC + 16);\n                        _sum5 = vld1_f16(pC + 20);\n                        _sum6 = vld1_f16(pC + 24);\n                        _sum7 = vld1_f16(pC + 28);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[1]);\n                        _sum2 = vdup_n_f16(pC[2]);\n                        _sum3 = vdup_n_f16(pC[3]);\n                        _sum4 = vdup_n_f16(pC[4]);\n                        _sum5 = vdup_n_f16(pC[5]);\n                        _sum6 = vdup_n_f16(pC[6]);\n                        _sum7 = vdup_n_f16(pC[7]);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4 * 1);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n                _sum4 = vld1_f16(outptr + 4 * 4);\n                _sum5 = vld1_f16(outptr + 4 * 5);\n                _sum6 = vld1_f16(outptr + 4 * 6);\n                _sum7 = vld1_f16(outptr + 4 * 7);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB0, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB0, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB0, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB0, 3);\n                _sum4 = vfma_lane_f16(_sum4, _pA, _pB1, 0);\n                _sum5 = vfma_lane_f16(_sum5, _pA, _pB1, 1);\n                _sum6 = vfma_lane_f16(_sum6, _pA, _pB1, 2);\n                _sum7 = vfma_lane_f16(_sum7, _pA, _pB1, 3);\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    vst1_f16(outptr0 + 4 * 4, _sum4);\n                    vst1_f16(outptr0 + 4 * 5, _sum5);\n                    vst1_f16(outptr0 + 4 * 6, _sum6);\n                    vst1_f16(outptr0 + 4 * 7, _sum7);\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ph(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + out_hstep, _sum2);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum3);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum4);\n                    vst1_f16(outptr0 + out_hstep * 2 + 4, _sum5);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum6);\n                    vst1_f16(outptr0 + out_hstep * 3 + 4, _sum7);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n                vst1_f16(outptr + 4 * 4, _sum4);\n                vst1_f16(outptr + 4 * 5, _sum5);\n                vst1_f16(outptr + 4 * 6, _sum6);\n                vst1_f16(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n            float16x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n                _sum2 = vdup_n_f16(0.f);\n                _sum3 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                        _sum2 = vdup_n_f16(pC[0]);\n                        _sum3 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        _sum2 = vld1_f16(pC + 8);\n                        _sum3 = vld1_f16(pC + 12);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[1]);\n                        _sum2 = vdup_n_f16(pC[2]);\n                        _sum3 = vdup_n_f16(pC[3]);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4 * 1);\n                _sum2 = vld1_f16(outptr + 4 * 2);\n                _sum3 = vld1_f16(outptr + 4 * 3);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum0 = vfma_lane_f16(_sum0, _pA, _pB, 0);\n                _sum1 = vfma_lane_f16(_sum1, _pA, _pB, 1);\n                _sum2 = vfma_lane_f16(_sum2, _pA, _pB, 2);\n                _sum3 = vfma_lane_f16(_sum3, _pA, _pB, 3);\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 4 * 2, _sum2);\n                    vst1_f16(outptr0 + 4 * 3, _sum3);\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ph(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + out_hstep * 1, _sum1);\n                    vst1_f16(outptr0 + out_hstep * 2, _sum2);\n                    vst1_f16(outptr0 + out_hstep * 3, _sum3);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 4 * 2, _sum2);\n                vst1_f16(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[1]);\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB0 = vdup_n_f16(pB[0]);\n                float16x4_t _pB1 = vdup_n_f16(pB[1]);\n\n                _sum0 = vfma_f16(_sum0, _pA, _pB0);\n                _sum1 = vfma_f16(_sum1, _pA, _pB1);\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[4];\n                    __fp16 sum1[4];\n                    vst1_f16(sum0, _sum0);\n                    vst1_f16(sum1, _sum1);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float16x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1_f16(pC);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pA = vld1_f16(pA);\n                float16x4_t _pB = vdup_n_f16(pB[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA, _pB);\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    __fp16 sum0[4];\n                    vst1_f16(sum0, _sum0);\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const __fp16*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const __fp16*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum00;\n            float16x4_t _sum01;\n            float16x4_t _sum02;\n            float16x4_t _sum10;\n            float16x4_t _sum11;\n            float16x4_t _sum12;\n\n            if (k == 0)\n            {\n                _sum00 = vdup_n_f16(0.f);\n                _sum01 = vdup_n_f16(0.f);\n                _sum02 = vdup_n_f16(0.f);\n                _sum10 = vdup_n_f16(0.f);\n                _sum11 = vdup_n_f16(0.f);\n                _sum12 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdup_n_f16(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum12 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdup_n_f16(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = vdup_n_f16(pC[1]);\n                        _sum11 = _sum10;\n                        _sum12 = _sum10;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float16x4x2_t _tmp01 = vld2_f16(pC);\n                        float16x4x2_t _tmp23 = vld2_f16(pC + 8);\n                        float16x4x2_t _tmp45 = vld2_f16(pC + 16);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum02 = _tmp45.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        _sum12 = _tmp45.val[1];\n                        pC += 24;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1_f16(pC);\n                        _sum01 = vld1_f16(pC + 4);\n                        _sum02 = vld1_f16(pC + 8);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum12 = _sum02;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                float16x4x2_t _tmp23 = vld2_f16(outptr + 8);\n                float16x4x2_t _tmp45 = vld2_f16(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n\n                _sum00 = vfma_f16(_sum00, _pB0, _pA0);\n                _sum01 = vfma_f16(_sum01, _pB1, _pA0);\n                _sum02 = vfma_f16(_sum02, _pB2, _pA0);\n                _sum10 = vfma_f16(_sum10, _pB0, _pA1);\n                _sum11 = vfma_f16(_sum11, _pB1, _pA1);\n                _sum12 = vfma_f16(_sum12, _pB2, _pA1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum00);\n                    vst1_f16(outptr0 + 4, _sum01);\n                    vst1_f16(outptr0 + 8, _sum02);\n                    vst1_f16(outptr0 + out_hstep, _sum10);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum11);\n                    vst1_f16(outptr0 + out_hstep + 8, _sum12);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float16x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float16x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2_f16(outptr, _tmp01);\n                vst2_f16(outptr + 8, _tmp23);\n                vst2_f16(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum00;\n            float16x4_t _sum01;\n            float16x4_t _sum10;\n            float16x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdup_n_f16(0.f);\n                _sum01 = vdup_n_f16(0.f);\n                _sum10 = vdup_n_f16(0.f);\n                _sum11 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdup_n_f16(pC[0]);\n                        _sum01 = vdup_n_f16(pC[0]);\n                        _sum10 = vdup_n_f16(pC[0]);\n                        _sum11 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdup_n_f16(pC[0]);\n                        _sum01 = vdup_n_f16(pC[0]);\n                        _sum10 = vdup_n_f16(pC[1]);\n                        _sum11 = vdup_n_f16(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float16x4x2_t _tmp01 = vld2_f16(pC);\n                        float16x4x2_t _tmp23 = vld2_f16(pC + 8);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1_f16(pC);\n                        _sum01 = vld1_f16(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                float16x4x2_t _tmp23 = vld2_f16(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n\n                _sum00 = vfma_f16(_sum00, _pB0, _pA0);\n                _sum01 = vfma_f16(_sum01, _pB1, _pA0);\n                _sum10 = vfma_f16(_sum10, _pB0, _pA1);\n                _sum11 = vfma_f16(_sum11, _pB1, _pA1);\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum00);\n                    vst1_f16(outptr0 + 4, _sum01);\n                    vst1_f16(outptr0 + out_hstep, _sum10);\n                    vst1_f16(outptr0 + out_hstep + 4, _sum11);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float16x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2_f16(outptr, _tmp01);\n                vst2_f16(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float16x4x2_t _tmp01 = vld2_f16(pC);\n                        _sum0 = _tmp01.val[0];\n                        _sum1 = _tmp01.val[1];\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = _sum0;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01 = vld2_f16(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB = vld1_f16(pB);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n\n                _sum0 = vfma_f16(_sum0, _pB, _pA0);\n                _sum1 = vfma_f16(_sum1, _pB, _pA1);\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + out_hstep, _sum1);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float16x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2_f16(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            __fp16 sum00;\n            __fp16 sum01;\n            __fp16 sum10;\n            __fp16 sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0.f;\n                sum01 = 0.f;\n                sum10 = 0.f;\n                sum11 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[0];\n                        sum11 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[0];\n                        sum11 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[2];\n                        sum11 = pC[3];\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[1];\n                        sum11 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum01 += pA[1] * pB[0];\n                sum10 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum00;\n                    outptr0[1] = sum10;\n                    outptr0[out_hstep] = sum01;\n                    outptr0[out_hstep + 1] = sum11;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            __fp16 sum0;\n            __fp16 sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[out_hstep] = sum1;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        __fp16* outptr0 = (__fp16*)top_blob + (i + ii) * out_hstep + j;\n\n        const __fp16* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const __fp16*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const __fp16*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n            float16x4_t _sum2;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n                _sum2 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                        _sum2 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        _sum2 = vld1_f16(pC + 8);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n                _sum2 = vld1_f16(outptr + 8);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA0, _pB0);\n                _sum1 = vfma_f16(_sum1, _pA0, _pB1);\n                _sum2 = vfma_f16(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    vst1_f16(outptr0 + 8, _sum2);\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n                vst1_f16(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float16x4_t _sum0;\n            float16x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdup_n_f16(0.f);\n                _sum1 = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdup_n_f16(pC[0]);\n                        _sum1 = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1_f16(pC);\n                        _sum1 = vld1_f16(pC + 4);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1_f16(outptr);\n                _sum1 = vld1_f16(outptr + 4);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n\n                _sum0 = vfma_f16(_sum0, _pA0, _pB0);\n                _sum1 = vfma_f16(_sum1, _pA0, _pB1);\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum0);\n                    vst1_f16(outptr0 + 4, _sum1);\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum0);\n                vst1_f16(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float16x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdup_n_f16(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum = vdup_n_f16(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum = vld1_f16(pC);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum = vld1_f16(outptr);\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float16x4_t _pB = vld1_f16(pB);\n                float16x4_t _pA = vdup_n_f16(pA[0]);\n\n                _sum = vfma_f16(_sum, _pA, _pB);\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_f16(outptr0, _sum);\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1_f16(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            __fp16 sum0;\n            __fp16 sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum0;\n                    outptr0[1] = sum1;\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            __fp16 sum;\n\n            if (k == 0)\n            {\n                sum = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const __fp16* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n                pA += 1;\n                pB += 1;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = sum;\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void get_optimal_tile_mnk_fp16sa(int M, int N, int K, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const size_t l2_cache_size = get_cpu_level2_cache_size();\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    int tile_size = (int)sqrtf((float)l2_cache_size / 3 / sizeof(__fp16));\n\n    TILE_M = std::max(8, tile_size / 8 * 8);\n    TILE_N = std::max(4, tile_size / 4 * 4);\n    TILE_K = std::max(8, tile_size / 8 * 8);\n\n    if (K > 0)\n    {\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n\n        if (nn_K == 1)\n        {\n            tile_size = (int)((float)l2_cache_size / 2 / sizeof(__fp16) / TILE_K);\n\n            TILE_M = std::max(8, tile_size / 8 * 8);\n            TILE_N = std::max(4, tile_size / 4 * 4);\n        }\n    }\n\n    TILE_M *= std::min(nT, get_physical_cpu_count());\n\n    if (M > 0)\n    {\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n    }\n\n    if (N > 0)\n    {\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n    }\n\n    if (nT > 1)\n    {\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n    }\n\n    // always take constant TILE_M/N/K value when provided\n    if (constant_TILE_M > 0)\n    {\n        TILE_M = (constant_TILE_M + 7) / 8 * 8;\n    }\n\n    if (constant_TILE_N > 0)\n    {\n        TILE_N = (constant_TILE_N + 3) / 4 * 4;\n    }\n\n    if (constant_TILE_K > 0)\n    {\n        TILE_K = (constant_TILE_K + 7) / 8 * 8;\n    }\n}\n\nstatic int gemm_arm_fp16sa(const Mat& A, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int transA, int transB, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n    const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_fp16sa(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile_bf16_fp16(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_bf16_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_arm_fp16sa(const Mat& AT, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int K, int transB, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_fp16sa(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile_bf16_fp16(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_bf16_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_BT_arm_fp16sa(const Mat& A, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int N, int K, int transA, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_fp16sa(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile_bf16_fp16(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_bf16_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_BT_arm_fp16sa(const Mat& AT, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int N, int K, int output_transpose, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_fp16sa(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 2u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile_bf16_fp16(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n\n                gemm_transB_packed_tile_fp16sa(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_bf16_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Gemm_arm::create_pipeline_fp16sa(const Option& opt)\n{\n    if (constantA)\n    {\n        const int M = constantM;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_fp16sa(M, 0, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n        AT_data.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 2u, (Allocator*)0);\n        if (AT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_M; ppj++)\n        {\n            const int i = ppj * TILE_M;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_ii = std::min((M - i), TILE_M);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat AT_tile = AT_data.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                if (transA)\n                {\n                    transpose_pack_A_tile_fp32_to_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n                else\n                {\n                    pack_A_tile_fp32_to_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            A_data.release();\n    }\n\n    if (constantB)\n    {\n        const int N = constantN;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_fp16sa(0, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_N = (N + TILE_N - 1) / TILE_N;\n\n        BT_data.create(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, (Allocator*)0);\n        if (BT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_N; ppj++)\n        {\n            const int j = ppj * TILE_N;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_jj = std::min((N - j), TILE_N);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat BT_tile = BT_data.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (transB)\n                {\n                    pack_B_tile_fp32_to_fp16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n                else\n                {\n                    transpose_pack_B_tile_fp32_to_fp16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            B_data.release();\n    }\n\n    if (constantC && constant_broadcast_type_C != -1)\n    {\n        cast_float32_to_float16(C_data, CT_data, opt);\n        if (CT_data.empty())\n            return -100;\n\n        if (constant_broadcast_type_C == 3 && opt.use_packing_layout)\n        {\n            int C_elempack = constantM % 8 == 0 ? 8 : constantM % 4 == 0 ? 4 : 1;\n            Mat tmp;\n            convert_packing(CT_data, tmp, C_elempack, opt);\n            CT_data = tmp;\n            if (CT_data.empty())\n                return -100;\n        }\n\n        // pre-multiply C with beta\n        if (beta != 1.f)\n        {\n            const int size = CT_data.total() * CT_data.elempack;\n            __fp16* ptr = CT_data;\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] *= beta;\n            }\n        }\n\n        if (opt.lightmode)\n            C_data.release();\n    }\n\n    if (constantA || constantB || constantC)\n    {\n        nT = opt.num_threads;\n    }\n\n    return 0;\n}\n\nint Gemm_arm::forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int M;\n    int N;\n    if (constantA && constantB)\n    {\n        M = constantM;\n        N = constantN;\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        M = constantM;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = constantN;\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = CT_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB)\n        {\n            C = bottom_blobs.size() == 1 ? bottom_blobs[0] : Mat();\n        }\n        else if (constantA)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else if (constantB)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else\n        {\n            C = bottom_blobs.size() == 3 ? bottom_blobs[2] : Mat();\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w * C.elempack == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w * C.elempack == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h * C.elempack == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n\n            // pre-multiply C with beta\n            if (beta != 1.f)\n            {\n                Mat CT_data;\n                CT_data.create_like(C, opt.workspace_allocator);\n                if (CT_data.empty())\n                    return -100;\n\n                const int size = C.total() * C.elempack;\n                const __fp16* ptr = C;\n                __fp16* outptr = CT_data;\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = ptr[i] * (__fp16)beta;\n                }\n\n                C = CT_data;\n            }\n        }\n    }\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        int outh = output_transpose ? N : M;\n        out_elempack = outh % 8 == 0 ? 8 : outh % 4 == 0 ? 4 : 1;\n    }\n    if (output_elempack)\n        out_elempack = output_elempack;\n    size_t out_elemsize = 2u * out_elempack;\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(M, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(N, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (constantA && constantB)\n    {\n        ret = gemm_AT_BT_arm_fp16sa(AT_data, BT_data, C, top_blob, broadcast_type_C, constantM, constantN, constantK, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        ret = gemm_AT_arm_fp16sa(AT_data, B, C, top_blob, broadcast_type_C, constantM, constantK, transB, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        ret = gemm_BT_arm_fp16sa(A, BT_data, C, top_blob, broadcast_type_C, constantN, constantK, transA, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        ret = gemm_arm_fp16sa(A, B, C, top_blob, broadcast_type_C, transA, transB, output_transpose, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    if (ret != 0)\n        return ret;\n\n    // multiply top_blob with alpha\n    if (alpha != 1.f)\n    {\n        const int size = top_blob.total() * out_elempack;\n        __fp16* ptr = top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] *= alpha;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_INT8\nvoid compute_A_tile_fp16_int8_scales_asimdhp(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    compute_A_tile_fp16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nvoid transpose_compute_A_tile_fp16_int8_scales_asimdhp(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    transpose_compute_A_tile_fp16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nvoid compute_B_fp16_int8_scale_asimdhp(const Mat& B, float& scale)\n{\n    compute_B_fp16_int8_scale(B, scale);\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm_i8mm.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"gemm_int8.h\"\n#include \"gemm_int8_fp16s.h\"\n\n#if NCNN_BF16\n#include \"gemm_int8_bf16s.h\"\n#endif\n\nvoid pack_A_tile_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    pack_A_tile_int8(A, AT, i, max_ii, k, max_kk);\n}\n\nvoid transpose_pack_A_tile_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    transpose_pack_A_tile_int8(A, AT, i, max_ii, k, max_kk);\n}\n\nvoid pack_B_tile_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    pack_B_tile_int8(B, BT, j, max_jj, k, max_kk);\n}\n\nvoid transpose_pack_B_tile_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    transpose_pack_B_tile_int8(B, BT, j, max_jj, k, max_kk);\n}\n\nvoid pack_A_tile_fp32_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_fp32_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_fp32_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_fp32_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_fp32_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_fp32_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid gemm_transB_packed_tile_int8_i8mm(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, int i, int max_ii, int j, int max_jj, int k, int max_kk)\n{\n    gemm_transB_packed_tile_int8(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n}\n\nvoid pack_A_tile_fp16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_fp16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_fp16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_fp16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\n#if NCNN_BF16\nvoid pack_A_tile_bf16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_bf16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_bf16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid pack_B_tile_bf16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_bf16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_bf16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_arm_vfpv4.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gemm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"gemm_bf16s_fp16s.h\"\n#include \"gemm_fp16s.h\"\n\n#if NCNN_INT8\n#include \"gemm_int8_fp16s.h\"\n#endif\n\nextern void pack_A_tile(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk);\n\nstatic int gemm_arm_fp16s(const Mat& A, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int transA, int transB, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n    const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_fp16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_arm_fp16s(const Mat& AT, const Mat& B, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int K, int transB, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    int nn_N = (N + TILE_N - 1) / TILE_N;\n    int nn_K = (K + TILE_K - 1) / TILE_K;\n\n    Mat BT(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, opt.workspace_allocator);\n    if (BT.empty())\n        return -100;\n\n    const int nn_NK = nn_N * nn_K;\n\n    // pack B\n    #pragma omp parallel for num_threads(nT)\n    for (int ppjk = 0; ppjk < nn_NK; ppjk++)\n    {\n        const int ppj = ppjk / nn_K;\n        const int ppk = ppjk % nn_K;\n\n        const int j = ppj * TILE_N;\n        const int k = ppk * TILE_K;\n\n        const int max_jj = std::min((N - j), TILE_N);\n        const int max_kk = std::min((K - k), TILE_K);\n\n        Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n        if (transB)\n        {\n            pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n        else\n        {\n            transpose_pack_B_tile_bf16_fp16(B, BT_tile, j, max_jj, k, max_kk);\n        }\n    }\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_fp16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_BT_arm_fp16s(const Mat& A, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int N, int K, int transA, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat ATX(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, nT, 2u, opt.workspace_allocator);\n    if (ATX.empty())\n        return -100;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        // shadowed variable for less openmp task args\n        const int M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        const int K = transA ? (A.dims == 3 ? A.c : A.h) * A.elempack : A.w;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = ATX.channel(get_omp_thread_num()).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (j == 0)\n                {\n                    if (transA)\n                    {\n                        transpose_pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                    else\n                    {\n                        pack_A_tile_bf16_fp16(A, AT_tile, i, max_ii, k, max_kk);\n                    }\n                }\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_fp16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int gemm_AT_BT_arm_fp16s(const Mat& AT, const Mat& BT, const Mat& C, Mat& top_blob, int broadcast_type_C, int M, int N, int K, int output_transpose, float alpha, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int nT, const Option& opt)\n{\n    // NCNN_LOGE(\"M/N/K = %d %d %d\", M, N, K);\n\n    int TILE_M, TILE_N, TILE_K;\n    get_optimal_tile_mnk_bf16s_fp16s(M, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, nT);\n\n    // NCNN_LOGE(\"TILE M/N/K = %d %d %d\", TILE_M, TILE_N, TILE_K);\n\n    int nn_M = (M + TILE_M - 1) / TILE_M;\n    // int nn_N = (N + TILE_N - 1) / TILE_N;\n\n    Mat topT;\n    if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n    {\n        topT.create(TILE_N * TILE_M, 1, nT, 4u, opt.workspace_allocator);\n        if (topT.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(nT)\n    for (int ppi = 0; ppi < nn_M; ppi++)\n    {\n        const int i = ppi * TILE_M;\n\n        const int max_ii = std::min((M - i), TILE_M);\n\n        Mat topT_tile;\n        if (K > TILE_K || broadcast_type_C == 3 || output_transpose)\n            topT_tile = topT.channel(get_omp_thread_num());\n\n        for (int j = 0; j < N; j += TILE_N)\n        {\n            const int max_jj = std::min((N - j), TILE_N);\n\n            if (broadcast_type_C == 3)\n            {\n                pack_A_tile(C, topT_tile, i, max_ii, j, max_jj);\n            }\n\n            const Mat& CT_tile = broadcast_type_C == 3 ? topT_tile : C;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_kk = std::min((K - k), TILE_K);\n\n                // NCNN_LOGE(\"max_ii/jj/kk = %d %d %d\", max_ii, max_jj, max_kk);\n\n                Mat AT_tile = AT.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                Mat BT_tile = BT.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                bool k_end = !output_transpose && k + TILE_K >= K;\n                float _alpha = k + TILE_K >= K ? alpha : 1.f;\n\n                gemm_transB_packed_tile_fp16s(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, _alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n            }\n\n            if (output_transpose)\n            {\n                transpose_unpack_output_tile_fp32_to_fp16(topT_tile, top_blob, i, max_ii, j, max_jj);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Gemm_arm::create_pipeline_fp16s(const Option& opt)\n{\n    if (constantA)\n    {\n        const int M = constantM;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_bf16s_fp16s(M, 0, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_M = (M + TILE_M - 1) / TILE_M;\n\n        AT_data.create(TILE_K * TILE_M, (K + TILE_K - 1) / TILE_K, (M + TILE_M - 1) / TILE_M, 2u, (Allocator*)0);\n        if (AT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_M; ppj++)\n        {\n            const int i = ppj * TILE_M;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_ii = std::min((M - i), TILE_M);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat AT_tile = AT_data.channel(i / TILE_M).row_range(k / TILE_K, 1);\n\n                if (transA)\n                {\n                    transpose_pack_A_tile_fp32_to_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n                else\n                {\n                    pack_A_tile_fp32_to_fp16(A_data, AT_tile, i, max_ii, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            A_data.release();\n    }\n\n    if (constantB)\n    {\n        const int N = constantN;\n        const int K = constantK;\n\n        int TILE_M, TILE_N, TILE_K;\n        get_optimal_tile_mnk_bf16s_fp16s(0, N, K, constant_TILE_M, constant_TILE_N, constant_TILE_K, TILE_M, TILE_N, TILE_K, opt.num_threads);\n\n        const int nn_N = (N + TILE_N - 1) / TILE_N;\n\n        BT_data.create(TILE_K * TILE_N, (K + TILE_K - 1) / TILE_K, (N + TILE_N - 1) / TILE_N, 2u, (Allocator*)0);\n        if (BT_data.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ppj = 0; ppj < nn_N; ppj++)\n        {\n            const int j = ppj * TILE_N;\n\n            for (int k = 0; k < K; k += TILE_K)\n            {\n                const int max_jj = std::min((N - j), TILE_N);\n                const int max_kk = std::min((K - k), TILE_K);\n\n                Mat BT_tile = BT_data.channel(j / TILE_N).row_range(k / TILE_K, 1);\n\n                if (transB)\n                {\n                    pack_B_tile_fp32_to_fp16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n                else\n                {\n                    transpose_pack_B_tile_fp32_to_fp16(B_data, BT_tile, j, max_jj, k, max_kk);\n                }\n            }\n        }\n\n        if (opt.lightmode)\n            B_data.release();\n    }\n\n    if (constantC && constant_broadcast_type_C != -1)\n    {\n        CT_data = C_data;\n\n#if __ARM_NEON\n        if (constant_broadcast_type_C == 3 && opt.use_packing_layout)\n        {\n            int C_elempack = constantM % 4 == 0 ? 4 : 1;\n            convert_packing(C_data, CT_data, C_elempack, opt);\n            if (CT_data.empty())\n                return -100;\n        }\n#endif // __ARM_NEON\n\n        // pre-multiply C with beta\n        if (beta != 1.f)\n        {\n            Mat C2;\n            C2.create_like(CT_data);\n            if (C2.empty())\n                return -100;\n\n            const int size = CT_data.total() * CT_data.elempack;\n            for (int i = 0; i < size; i++)\n            {\n                C2[i] = CT_data[i] * beta;\n            }\n\n            CT_data = C2;\n        }\n\n        if (opt.lightmode)\n            C_data.release();\n    }\n\n    if (constantA || constantB || constantC)\n    {\n        nT = opt.num_threads;\n    }\n\n    return 0;\n}\n\nint Gemm_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int M;\n    int N;\n    if (constantA && constantB)\n    {\n        M = constantM;\n        N = constantN;\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        M = constantM;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = constantN;\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        M = transA ? A.w : (A.dims == 3 ? A.c : A.h) * A.elempack;\n        N = transB ? (B.dims == 3 ? B.c : B.h) * B.elempack : B.w;\n    }\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = CT_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB)\n        {\n            C = bottom_blobs.size() == 1 ? bottom_blobs[0] : Mat();\n        }\n        else if (constantA)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else if (constantB)\n        {\n            C = bottom_blobs.size() == 2 ? bottom_blobs[1] : Mat();\n        }\n        else\n        {\n            C = bottom_blobs.size() == 3 ? bottom_blobs[2] : Mat();\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w * C.elempack == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w * C.elempack == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h * C.elempack == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h * C.elempack == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n\n            // cast to fp32\n            {\n                Mat CT_data;\n                cast_float16_to_float32(C, CT_data);\n                C = CT_data;\n                if (C.empty())\n                    return -100;\n            }\n\n            // pre-multiply C with beta\n            if (beta != 1.f)\n            {\n                Mat CT_data;\n                CT_data.create_like(C, opt.workspace_allocator);\n                if (CT_data.empty())\n                    return -100;\n\n                const int size = C.total() * C.elempack;\n                for (int i = 0; i < size; i++)\n                {\n                    CT_data[i] = C[i] * beta;\n                }\n\n                C = CT_data;\n            }\n        }\n    }\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        int outh = output_transpose ? N : M;\n        out_elempack = outh % 4 == 0 ? 4 : 1;\n    }\n    if (output_elempack)\n        out_elempack = output_elempack;\n    size_t out_elemsize = 2u * out_elempack;\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(M, N / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        else\n            top_blob.create(N, M / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int _nT = nT ? nT : opt.num_threads;\n    if (nT != 0 && opt.num_threads != nT)\n    {\n        // force num_threads the same as in create_pipeline\n        // so we could use pre-packed A/B from the same tile config\n        NCNN_LOGE(\"opt.num_threads %d changed, gemm will use load-time value %d\", opt.num_threads, nT);\n    }\n\n    int ret = 0;\n    if (constantA && constantB)\n    {\n        ret = gemm_AT_BT_arm_fp16s(AT_data, BT_data, C, top_blob, broadcast_type_C, constantM, constantN, constantK, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantA)\n    {\n        const Mat& B = bottom_blobs[0];\n        ret = gemm_AT_arm_fp16s(AT_data, B, C, top_blob, broadcast_type_C, constantM, constantK, transB, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else if (constantB)\n    {\n        const Mat& A = bottom_blobs[0];\n        ret = gemm_BT_arm_fp16s(A, BT_data, C, top_blob, broadcast_type_C, constantN, constantK, transA, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n    else\n    {\n        const Mat& A = bottom_blobs[0];\n        const Mat& B = bottom_blobs[1];\n        ret = gemm_arm_fp16s(A, B, C, top_blob, broadcast_type_C, transA, transB, output_transpose, alpha, constant_TILE_M, constant_TILE_N, constant_TILE_K, _nT, opt);\n    }\n\n    return ret;\n}\n\n#if NCNN_INT8\nvoid compute_A_tile_fp16_int8_scales_vfpv4(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    compute_A_tile_fp16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nvoid transpose_compute_A_tile_fp16_int8_scales_vfpv4(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    transpose_compute_A_tile_fp16_int8_scales(A, scales, B_scale, out_descales, i, max_ii);\n}\n\nvoid pack_A_tile_fp16_to_int8_vfpv4(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid transpose_pack_A_tile_fp16_to_int8_vfpv4(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n    transpose_pack_A_tile_fp16_to_int8(A, AT, i, max_ii, k, max_kk, scales);\n}\n\nvoid compute_B_fp16_int8_scale_vfpv4(const Mat& B, float& scale)\n{\n    compute_B_fp16_int8_scale(B, scale);\n}\n\nvoid pack_B_tile_fp16_to_int8_vfpv4(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid transpose_pack_B_tile_fp16_to_int8_vfpv4(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n    transpose_pack_B_tile_fp16_to_int8(B, BT, j, max_jj, k, max_kk, scale);\n}\n\nvoid unpack_output_tile_int32_to_fp16_vfpv4(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    unpack_output_tile_int32_to_fp16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n\nvoid transpose_unpack_output_tile_int32_to_fp16_vfpv4(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n    transpose_unpack_output_tile_int32_to_fp16(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gemm_bf16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pack_A_tile_fp32_to_bf16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n        const float* p4 = (const float*)A + (i + ii + 4) * A_hstep + k;\n        const float* p5 = (const float*)A + (i + ii + 5) * A_hstep + k;\n        const float* p6 = (const float*)A + (i + ii + 6) * A_hstep + k;\n        const float* p7 = (const float*)A + (i + ii + 7) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            uint16x8_t _r1 = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            uint16x8_t _r2 = vcombine_u16(float2bfloat(vld1q_f32(p2)), float2bfloat(vld1q_f32(p2 + 4)));\n            uint16x8_t _r3 = vcombine_u16(float2bfloat(vld1q_f32(p3)), float2bfloat(vld1q_f32(p3 + 4)));\n            uint16x8_t _r4 = vcombine_u16(float2bfloat(vld1q_f32(p4)), float2bfloat(vld1q_f32(p4 + 4)));\n            uint16x8_t _r5 = vcombine_u16(float2bfloat(vld1q_f32(p5)), float2bfloat(vld1q_f32(p5 + 4)));\n            uint16x8_t _r6 = vcombine_u16(float2bfloat(vld1q_f32(p6)), float2bfloat(vld1q_f32(p6 + 4)));\n            uint16x8_t _r7 = vcombine_u16(float2bfloat(vld1q_f32(p7)), float2bfloat(vld1q_f32(p7 + 4)));\n            transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n            vst1q_u16(pp, _r0);\n            vst1q_u16(pp + 8, _r1);\n            vst1q_u16(pp + 8 * 2, _r2);\n            vst1q_u16(pp + 8 * 3, _r3);\n            vst1q_u16(pp + 8 * 4, _r4);\n            vst1q_u16(pp + 8 * 5, _r5);\n            vst1q_u16(pp + 8 * 6, _r6);\n            vst1q_u16(pp + 8 * 7, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp[2] = float32_to_bfloat16(p2[0]);\n            pp[3] = float32_to_bfloat16(p3[0]);\n            pp[4] = float32_to_bfloat16(p4[0]);\n            pp[5] = float32_to_bfloat16(p5[0]);\n            pp[6] = float32_to_bfloat16(p6[0]);\n            pp[7] = float32_to_bfloat16(p7[0]);\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x4_t _r0123;\n            _r0123.val[0] = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            _r0123.val[1] = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            _r0123.val[2] = vcombine_u16(float2bfloat(vld1q_f32(p2)), float2bfloat(vld1q_f32(p2 + 4)));\n            _r0123.val[3] = vcombine_u16(float2bfloat(vld1q_f32(p3)), float2bfloat(vld1q_f32(p3 + 4)));\n            vst4q_u16(pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x4_t _r0123;\n            _r0123.val[0] = float2bfloat(vld1q_f32(p0));\n            _r0123.val[1] = float2bfloat(vld1q_f32(p1));\n            _r0123.val[2] = float2bfloat(vld1q_f32(p2));\n            _r0123.val[3] = float2bfloat(vld1q_f32(p3));\n            vst4_u16(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp[2] = float32_to_bfloat16(p2[0]);\n            pp[3] = float32_to_bfloat16(p3[0]);\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x2_t _r01;\n            _r01.val[0] = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            _r01.val[1] = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            vst2q_u16(pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x2_t _r01;\n            _r01.val[0] = float2bfloat(vld1q_f32(p0));\n            _r01.val[1] = float2bfloat(vld1q_f32(p1));\n            vst2_u16(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_fp32_to_bf16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += A_hstep;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += A_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p0[1]);\n            pp += 2;\n            p0 += A_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp += 1;\n            p0 += A_hstep;\n        }\n    }\n}\n\nstatic void pack_B_tile_fp32_to_bf16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n        const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n        const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n        const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n        const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n        const float* p8 = (const float*)B + (j + jj + 8) * B_hstep + k;\n        const float* p9 = (const float*)B + (j + jj + 9) * B_hstep + k;\n        const float* pa = (const float*)B + (j + jj + 10) * B_hstep + k;\n        const float* pb = (const float*)B + (j + jj + 11) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            uint16x4_t _r1 = float2bfloat(vld1q_f32(p1));\n            uint16x4_t _r2 = float2bfloat(vld1q_f32(p2));\n            uint16x4_t _r3 = float2bfloat(vld1q_f32(p3));\n            uint16x4_t _r4 = float2bfloat(vld1q_f32(p4));\n            uint16x4_t _r5 = float2bfloat(vld1q_f32(p5));\n            uint16x4_t _r6 = float2bfloat(vld1q_f32(p6));\n            uint16x4_t _r7 = float2bfloat(vld1q_f32(p7));\n            uint16x4_t _r8 = float2bfloat(vld1q_f32(p8));\n            uint16x4_t _r9 = float2bfloat(vld1q_f32(p9));\n            uint16x4_t _ra = float2bfloat(vld1q_f32(pa));\n            uint16x4_t _rb = float2bfloat(vld1q_f32(pb));\n\n            transpose4x4_u16(_r0, _r1, _r2, _r3);\n            transpose4x4_u16(_r4, _r5, _r6, _r7);\n            transpose4x4_u16(_r8, _r9, _ra, _rb);\n\n            vst1_u16(pp, _r0);\n            vst1_u16(pp + 4, _r4);\n            vst1_u16(pp + 4 * 2, _r8);\n            vst1_u16(pp + 4 * 3, _r1);\n            vst1_u16(pp + 4 * 4, _r5);\n            vst1_u16(pp + 4 * 5, _r9);\n            vst1_u16(pp + 4 * 6, _r2);\n            vst1_u16(pp + 4 * 7, _r6);\n            vst1_u16(pp + 4 * 8, _ra);\n            vst1_u16(pp + 4 * 9, _r3);\n            vst1_u16(pp + 4 * 10, _r7);\n            vst1_u16(pp + 4 * 11, _rb);\n            pp += 48;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n            p8 += 4;\n            p9 += 4;\n            pa += 4;\n            pb += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp[2] = float32_to_bfloat16(p2[0]);\n            pp[3] = float32_to_bfloat16(p3[0]);\n            pp[4] = float32_to_bfloat16(p4[0]);\n            pp[5] = float32_to_bfloat16(p5[0]);\n            pp[6] = float32_to_bfloat16(p6[0]);\n            pp[7] = float32_to_bfloat16(p7[0]);\n            pp[8] = float32_to_bfloat16(p8[0]);\n            pp[9] = float32_to_bfloat16(p9[0]);\n            pp[10] = float32_to_bfloat16(pa[0]);\n            pp[11] = float32_to_bfloat16(pb[0]);\n            pp += 12;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n            p8++;\n            p9++;\n            pa++;\n            pb++;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n        const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n        const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n        const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n        const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            uint16x8_t _r1 = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            uint16x8_t _r2 = vcombine_u16(float2bfloat(vld1q_f32(p2)), float2bfloat(vld1q_f32(p2 + 4)));\n            uint16x8_t _r3 = vcombine_u16(float2bfloat(vld1q_f32(p3)), float2bfloat(vld1q_f32(p3 + 4)));\n            uint16x8_t _r4 = vcombine_u16(float2bfloat(vld1q_f32(p4)), float2bfloat(vld1q_f32(p4 + 4)));\n            uint16x8_t _r5 = vcombine_u16(float2bfloat(vld1q_f32(p5)), float2bfloat(vld1q_f32(p5 + 4)));\n            uint16x8_t _r6 = vcombine_u16(float2bfloat(vld1q_f32(p6)), float2bfloat(vld1q_f32(p6 + 4)));\n            uint16x8_t _r7 = vcombine_u16(float2bfloat(vld1q_f32(p7)), float2bfloat(vld1q_f32(p7 + 4)));\n            transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n            vst1q_u16(pp, _r0);\n            vst1q_u16(pp + 8, _r1);\n            vst1q_u16(pp + 8 * 2, _r2);\n            vst1q_u16(pp + 8 * 3, _r3);\n            vst1q_u16(pp + 8 * 4, _r4);\n            vst1q_u16(pp + 8 * 5, _r5);\n            vst1q_u16(pp + 8 * 6, _r6);\n            vst1q_u16(pp + 8 * 7, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            uint16x4_t _r1 = float2bfloat(vld1q_f32(p1));\n            uint16x4_t _r2 = float2bfloat(vld1q_f32(p2));\n            uint16x4_t _r3 = float2bfloat(vld1q_f32(p3));\n            uint16x4_t _r4 = float2bfloat(vld1q_f32(p4));\n            uint16x4_t _r5 = float2bfloat(vld1q_f32(p5));\n            uint16x4_t _r6 = float2bfloat(vld1q_f32(p6));\n            uint16x4_t _r7 = float2bfloat(vld1q_f32(p7));\n\n            transpose4x4_u16(_r0, _r1, _r2, _r3);\n            transpose4x4_u16(_r4, _r5, _r6, _r7);\n\n            vst1_u16(pp, _r0);\n            vst1_u16(pp + 4, _r4);\n            vst1_u16(pp + 4 * 2, _r1);\n            vst1_u16(pp + 4 * 3, _r5);\n            vst1_u16(pp + 4 * 4, _r2);\n            vst1_u16(pp + 4 * 5, _r6);\n            vst1_u16(pp + 4 * 6, _r3);\n            vst1_u16(pp + 4 * 7, _r7);\n            pp += 32;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp[2] = float32_to_bfloat16(p2[0]);\n            pp[3] = float32_to_bfloat16(p3[0]);\n            pp[4] = float32_to_bfloat16(p4[0]);\n            pp[5] = float32_to_bfloat16(p5[0]);\n            pp[6] = float32_to_bfloat16(p6[0]);\n            pp[7] = float32_to_bfloat16(p7[0]);\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x4_t _r0123;\n            _r0123.val[0] = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            _r0123.val[1] = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            _r0123.val[2] = vcombine_u16(float2bfloat(vld1q_f32(p2)), float2bfloat(vld1q_f32(p2 + 4)));\n            _r0123.val[3] = vcombine_u16(float2bfloat(vld1q_f32(p3)), float2bfloat(vld1q_f32(p3 + 4)));\n            vst4q_u16(pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x4_t _r0123;\n            _r0123.val[0] = float2bfloat(vld1q_f32(p0));\n            _r0123.val[1] = float2bfloat(vld1q_f32(p1));\n            _r0123.val[2] = float2bfloat(vld1q_f32(p2));\n            _r0123.val[3] = float2bfloat(vld1q_f32(p3));\n            vst4_u16(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp[2] = float32_to_bfloat16(p2[0]);\n            pp[3] = float32_to_bfloat16(p3[0]);\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x2_t _r01;\n            _r01.val[0] = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            _r01.val[1] = vcombine_u16(float2bfloat(vld1q_f32(p1)), float2bfloat(vld1q_f32(p1 + 4)));\n            vst2q_u16(pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x2_t _r01;\n            _r01.val[0] = float2bfloat(vld1q_f32(p0));\n            _r01.val[1] = float2bfloat(vld1q_f32(p1));\n            vst2_u16(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p1[0]);\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_fp32_to_bf16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            vst1_u16(pp, float2bfloat(vld1q_f32(p0)));\n            vst1_u16(pp + 4, float2bfloat(vld1q_f32(p0 + 4)));\n            vst1_u16(pp + 8, float2bfloat(vld1q_f32(p0 + 8)));\n            pp += 12;\n            p0 += B_hstep;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(p0)), float2bfloat(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x4_t _r0 = float2bfloat(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += B_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp[1] = float32_to_bfloat16(p0[1]);\n            pp += 2;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_bfloat16(p0[0]);\n            pp += 1;\n            p0 += B_hstep;\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_fp32_to_bf16(const Mat& topT, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const float* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x8x4_t _r0;\n                _r0.val[0] = vcombine_u16(float2bfloat(vld1q_f32(pp)), float2bfloat(vld1q_f32(pp + 4)));\n                _r0.val[1] = vcombine_u16(float2bfloat(vld1q_f32(pp + 8)), float2bfloat(vld1q_f32(pp + 12)));\n                _r0.val[2] = vcombine_u16(float2bfloat(vld1q_f32(pp + 16)), float2bfloat(vld1q_f32(pp + 20)));\n                _r0.val[3] = vcombine_u16(float2bfloat(vld1q_f32(pp + 24)), float2bfloat(vld1q_f32(pp + 28)));\n                vst4q_u16(p0, _r0);\n                pp += 32;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x8_t _r0 = vcombine_u16(float2bfloat(vld1q_f32(pp)), float2bfloat(vld1q_f32(pp + 4)));\n                vst1q_u16(p0, _r0);\n                pp += 8;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4x4_t _r0123;\n                _r0123.val[0] = float2bfloat(vld1q_f32(pp));\n                _r0123.val[1] = float2bfloat(vld1q_f32(pp + 4));\n                _r0123.val[2] = float2bfloat(vld1q_f32(pp + 8));\n                _r0123.val[3] = float2bfloat(vld1q_f32(pp + 12));\n                vst4_u16(p0, _r0123);\n                pp += 16;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x4_t _r0 = float2bfloat(vld1q_f32(pp));\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                p0[0] = float32_to_bfloat16(pp[0]);\n                p0[1] = float32_to_bfloat16(pp[2]);\n                p0[2] = float32_to_bfloat16(pp[4]);\n                p0[3] = float32_to_bfloat16(pp[6]);\n                p0[4] = float32_to_bfloat16(pp[1]);\n                p0[5] = float32_to_bfloat16(pp[3]);\n                p0[6] = float32_to_bfloat16(pp[5]);\n                p0[7] = float32_to_bfloat16(pp[7]);\n                pp += 8;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = float32_to_bfloat16(pp[0]);\n                p0[1] = float32_to_bfloat16(pp[1]);\n                pp += 2;\n                p0 += out_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4_t _r0 = float2bfloat(vld1q_f32(pp));\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = float32_to_bfloat16(pp[0]);\n                pp += 1;\n                p0 += out_hstep;\n            }\n        }\n    }\n}\n\nstatic void gemm_transB_packed_tile_bf16s(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, float alpha, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const unsigned short* pAT = AT_tile;\n    const unsigned short* pBT = BT_tile;\n    const float* pC = CT_tile;\n\n    float* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n            float32x4_t _sum80;\n            float32x4_t _sum81;\n            float32x4_t _sum90;\n            float32x4_t _sum91;\n            float32x4_t _suma0;\n            float32x4_t _suma1;\n            float32x4_t _sumb0;\n            float32x4_t _sumb1;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n                _sum80 = vdupq_n_f32(0.f);\n                _sum81 = vdupq_n_f32(0.f);\n                _sum90 = vdupq_n_f32(0.f);\n                _sum91 = vdupq_n_f32(0.f);\n                _suma0 = vdupq_n_f32(0.f);\n                _suma1 = vdupq_n_f32(0.f);\n                _sumb0 = vdupq_n_f32(0.f);\n                _sumb1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                        _sum40 = _sum00;\n                        _sum41 = _sum00;\n                        _sum50 = _sum00;\n                        _sum51 = _sum00;\n                        _sum60 = _sum00;\n                        _sum61 = _sum00;\n                        _sum70 = _sum00;\n                        _sum71 = _sum00;\n                        _sum80 = _sum00;\n                        _sum81 = _sum00;\n                        _sum90 = _sum00;\n                        _sum91 = _sum00;\n                        _suma0 = _sum00;\n                        _suma1 = _sum00;\n                        _sumb0 = _sum00;\n                        _sumb1 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                        _sum80 = _sum00;\n                        _sum81 = _sum01;\n                        _sum90 = _sum00;\n                        _sum91 = _sum01;\n                        _suma0 = _sum00;\n                        _suma1 = _sum01;\n                        _sumb0 = _sum00;\n                        _sumb1 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        _sum80 = vld1q_f32(pC + 4 * 16);\n                        _sum81 = vld1q_f32(pC + 4 * 17);\n                        _sum90 = vld1q_f32(pC + 4 * 18);\n                        _sum91 = vld1q_f32(pC + 4 * 19);\n                        _suma0 = vld1q_f32(pC + 4 * 20);\n                        _suma1 = vld1q_f32(pC + 4 * 21);\n                        _sumb0 = vld1q_f32(pC + 4 * 22);\n                        _sumb1 = vld1q_f32(pC + 4 * 23);\n                        pC += 96;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum80 = vdupq_n_f32(pC[8]);\n                        _sum90 = vdupq_n_f32(pC[9]);\n                        _suma0 = vdupq_n_f32(pC[10]);\n                        _sumb0 = vdupq_n_f32(pC[11]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        _sum81 = _sum80;\n                        _sum91 = _sum90;\n                        _suma1 = _suma0;\n                        _sumb1 = _sumb0;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n                _sum80 = vld1q_f32(outptr + 4 * 16);\n                _sum81 = vld1q_f32(outptr + 4 * 17);\n                _sum90 = vld1q_f32(outptr + 4 * 18);\n                _sum91 = vld1q_f32(outptr + 4 * 19);\n                _suma0 = vld1q_f32(outptr + 4 * 20);\n                _suma1 = vld1q_f32(outptr + 4 * 21);\n                _sumb0 = vld1q_f32(outptr + 4 * 22);\n                _sumb1 = vld1q_f32(outptr + 4 * 23);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = bfloat2float(vget_low_u16(_pA));\n                float32x4_t _pA1 = bfloat2float(vget_high_u16(_pA));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n                _sum40 = vmulq_f32(_sum40, _alpha);\n                _sum41 = vmulq_f32(_sum41, _alpha);\n                _sum50 = vmulq_f32(_sum50, _alpha);\n                _sum51 = vmulq_f32(_sum51, _alpha);\n                _sum60 = vmulq_f32(_sum60, _alpha);\n                _sum61 = vmulq_f32(_sum61, _alpha);\n                _sum70 = vmulq_f32(_sum70, _alpha);\n                _sum71 = vmulq_f32(_sum71, _alpha);\n                _sum80 = vmulq_f32(_sum80, _alpha);\n                _sum81 = vmulq_f32(_sum81, _alpha);\n                _sum90 = vmulq_f32(_sum90, _alpha);\n                _sum91 = vmulq_f32(_sum91, _alpha);\n                _suma0 = vmulq_f32(_suma0, _alpha);\n                _suma1 = vmulq_f32(_suma1, _alpha);\n                _sumb0 = vmulq_f32(_sumb0, _alpha);\n                _sumb1 = vmulq_f32(_sumb1, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum70));\n                    vst1_u16(outptr0 + 4 * 8, float2bfloat(_sum80));\n                    vst1_u16(outptr0 + 4 * 9, float2bfloat(_sum90));\n                    vst1_u16(outptr0 + 4 * 10, float2bfloat(_suma0));\n                    vst1_u16(outptr0 + 4 * 11, float2bfloat(_sumb0));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, float2bfloat(_sum71));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 8, float2bfloat(_sum81));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 9, float2bfloat(_sum91));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 10, float2bfloat(_suma1));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 11, float2bfloat(_sumb1));\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x12_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71, _sum80, _sum81, _sum90, _sum91, _suma0, _suma1, _sumb0, _sumb1);\n\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + out_hstep + 8, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 8, float2bfloat(_sum70));\n                    vst1_u16(outptr0 + out_hstep * 5, float2bfloat(_sum71));\n                    vst1_u16(outptr0 + out_hstep * 5 + 4, float2bfloat(_sum80));\n                    vst1_u16(outptr0 + out_hstep * 5 + 8, float2bfloat(_sum81));\n                    vst1_u16(outptr0 + out_hstep * 6, float2bfloat(_sum90));\n                    vst1_u16(outptr0 + out_hstep * 6 + 4, float2bfloat(_sum91));\n                    vst1_u16(outptr0 + out_hstep * 6 + 8, float2bfloat(_suma0));\n                    vst1_u16(outptr0 + out_hstep * 7, float2bfloat(_suma1));\n                    vst1_u16(outptr0 + out_hstep * 7 + 4, float2bfloat(_sumb0));\n                    vst1_u16(outptr0 + out_hstep * 7 + 8, float2bfloat(_sumb1));\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n            }\n\n            outptr += 96;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                        _sum40 = _sum00;\n                        _sum41 = _sum00;\n                        _sum50 = _sum00;\n                        _sum51 = _sum00;\n                        _sum60 = _sum00;\n                        _sum61 = _sum00;\n                        _sum70 = _sum00;\n                        _sum71 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        pC += 64;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = bfloat2float(vget_low_u16(_pA));\n                float32x4_t _pA1 = bfloat2float(vget_high_u16(_pA));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n                _sum40 = vmulq_f32(_sum40, _alpha);\n                _sum41 = vmulq_f32(_sum41, _alpha);\n                _sum50 = vmulq_f32(_sum50, _alpha);\n                _sum51 = vmulq_f32(_sum51, _alpha);\n                _sum60 = vmulq_f32(_sum60, _alpha);\n                _sum61 = vmulq_f32(_sum61, _alpha);\n                _sum70 = vmulq_f32(_sum70, _alpha);\n                _sum71 = vmulq_f32(_sum71, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum70));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, float2bfloat(_sum71));\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71);\n\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, float2bfloat(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum40));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 5, float2bfloat(_sum50));\n                    vst1_u16(outptr0 + out_hstep * 5 + 4, float2bfloat(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 6, float2bfloat(_sum60));\n                    vst1_u16(outptr0 + out_hstep * 6 + 4, float2bfloat(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 7, float2bfloat(_sum70));\n                    vst1_u16(outptr0 + out_hstep * 7 + 4, float2bfloat(_sum71));\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n            }\n\n            outptr += 64;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = bfloat2float(vget_low_u16(_pA));\n                float32x4_t _pA1 = bfloat2float(vget_high_u16(_pA));\n\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum30));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, float2bfloat(_sum31));\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31);\n\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + out_hstep * 1, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum20));\n                    vst1_u16(outptr0 + out_hstep * 5, float2bfloat(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 6, float2bfloat(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 7, float2bfloat(_sum31));\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = bfloat2float(vget_low_u16(_pA));\n                float32x4_t _pA1 = bfloat2float(vget_high_u16(_pA));\n\n                float32x4_t _pB0 = bfloat2float(vdup_n_u16(pB[0]));\n                float32x4_t _pB1 = bfloat2float(vdup_n_u16(pB[1]));\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB0);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB0);\n                _sum10 = vfmaq_f32(_sum10, _pA0, _pB1);\n                _sum11 = vfmaq_f32(_sum11, _pA1, _pB1);\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum10));\n\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, float2bfloat(_sum11));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    unsigned short sum1[8];\n                    vst1_u16(sum0, float2bfloat(_sum00));\n                    vst1_u16(sum0 + 4, float2bfloat(_sum01));\n                    vst1_u16(sum1, float2bfloat(_sum10));\n                    vst1_u16(sum1 + 4, float2bfloat(_sum11));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n            }\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = bfloat2float(vget_low_u16(_pA));\n                float32x4_t _pA1 = bfloat2float(vget_high_u16(_pA));\n\n                float32x4_t _pB = bfloat2float(vld1_dup_u16(pB));\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + out_hstep * 4, float2bfloat(_sum01));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    vst1_u16(sum0, float2bfloat(_sum00));\n                    vst1_u16(sum0 + 4, float2bfloat(_sum01));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n            float32x4_t _sum8;\n            float32x4_t _sum9;\n            float32x4_t _suma;\n            float32x4_t _sumb;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n                _sum8 = vdupq_n_f32(0.f);\n                _sum9 = vdupq_n_f32(0.f);\n                _suma = vdupq_n_f32(0.f);\n                _sumb = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        _sum8 = vld1q_f32(pC + 32);\n                        _sum9 = vld1q_f32(pC + 36);\n                        _suma = vld1q_f32(pC + 40);\n                        _sumb = vld1q_f32(pC + 44);\n                        pC += 48;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        _sum8 = vdupq_n_f32(pC[8]);\n                        _sum9 = vdupq_n_f32(pC[9]);\n                        _suma = vdupq_n_f32(pC[10]);\n                        _sumb = vdupq_n_f32(pC[11]);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n                _sum8 = vld1q_f32(outptr + 4 * 8);\n                _sum9 = vld1q_f32(outptr + 4 * 9);\n                _suma = vld1q_f32(outptr + 4 * 10);\n                _sumb = vld1q_f32(outptr + 4 * 11);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n                _sum4 = vmulq_f32(_sum4, _alpha);\n                _sum5 = vmulq_f32(_sum5, _alpha);\n                _sum6 = vmulq_f32(_sum6, _alpha);\n                _sum7 = vmulq_f32(_sum7, _alpha);\n                _sum8 = vmulq_f32(_sum8, _alpha);\n                _sum9 = vmulq_f32(_sum9, _alpha);\n                _suma = vmulq_f32(_suma, _alpha);\n                _sumb = vmulq_f32(_sumb, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum7));\n                    vst1_u16(outptr0 + 4 * 8, float2bfloat(_sum8));\n                    vst1_u16(outptr0 + 4 * 9, float2bfloat(_sum9));\n                    vst1_u16(outptr0 + 4 * 10, float2bfloat(_suma));\n                    vst1_u16(outptr0 + 4 * 11, float2bfloat(_sumb));\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + out_hstep + 8, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, float2bfloat(_sum7));\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, float2bfloat(_sum8));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum9));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, float2bfloat(_suma));\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, float2bfloat(_sumb));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0 = bfloat2float(vld1_u16(pB));\n                float32x4_t _pB1 = bfloat2float(vld1_u16(pB + 4));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n                _sum4 = vmulq_f32(_sum4, _alpha);\n                _sum5 = vmulq_f32(_sum5, _alpha);\n                _sum6 = vmulq_f32(_sum6, _alpha);\n                _sum7 = vmulq_f32(_sum7, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, float2bfloat(_sum7));\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum3));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum4));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, float2bfloat(_sum5));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum6));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, float2bfloat(_sum7));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, float2bfloat(_sum3));\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ps(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + out_hstep * 2, float2bfloat(_sum2));\n                    vst1_u16(outptr0 + out_hstep * 3, float2bfloat(_sum3));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0 = bfloat2float(vdup_n_u16(pB[0]));\n                float32x4_t _pB1 = bfloat2float(vdup_n_u16(pB[1]));\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA, _pB1);\n#endif\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    unsigned short sum1[4];\n                    vst1_u16(sum0, float2bfloat(_sum0));\n                    vst1_u16(sum1, float2bfloat(_sum1));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pA = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB = bfloat2float(vdup_n_u16(pB[0]));\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB);\n#endif\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    vst1_u16(sum0, float2bfloat(_sum0));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum02;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum12;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum02 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum12 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum12 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = _sum10;\n                        _sum12 = _sum10;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        float32x4x2_t _tmp45 = vld2q_f32(pC + 16);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum02 = _tmp45.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        _sum12 = _tmp45.val[1];\n                        pC += 24;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum02 = vld1q_f32(pC + 8);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum12 = _sum02;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = bfloat2float(vget_low_u16(_pB));\n                float32x4_t _pB1 = bfloat2float(vget_high_u16(_pB));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = bfloat2float(vdup_n_u16(pA[1]));\n\n                _sum00 = vfmaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vfmaq_f32(_sum01, _pB1, _pA0);\n                _sum02 = vfmaq_f32(_sum02, _pB2, _pA0);\n                _sum10 = vfmaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vfmaq_f32(_sum11, _pB1, _pA1);\n                _sum12 = vfmaq_f32(_sum12, _pB2, _pA1);\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum02 = vmulq_f32(_sum02, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum12 = vmulq_f32(_sum12, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum02));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum11));\n                    vst1_u16(outptr0 + out_hstep + 8, float2bfloat(_sum12));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = _sum10;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = bfloat2float(vget_low_u16(_pB));\n                float32x4_t _pB1 = bfloat2float(vget_high_u16(_pB));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = bfloat2float(vdup_n_u16(pA[1]));\n#if __aarch64__\n                _sum00 = vfmaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vfmaq_f32(_sum01, _pB1, _pA0);\n                _sum10 = vfmaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vfmaq_f32(_sum11, _pB1, _pA1);\n#else\n                _sum00 = vmlaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vmlaq_f32(_sum01, _pB1, _pA0);\n                _sum10 = vmlaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vmlaq_f32(_sum11, _pB1, _pA1);\n#endif\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum00));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum01));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, float2bfloat(_sum11));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        _sum0 = _tmp01.val[0];\n                        _sum1 = _tmp01.val[1];\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = bfloat2float(vdup_n_u16(pA[1]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pB, _pA0);\n                _sum1 = vfmaq_f32(_sum1, _pB, _pA1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pB, _pA0);\n                _sum1 = vmlaq_f32(_sum1, _pB, _pA1);\n#endif\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + out_hstep, float2bfloat(_sum1));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum00;\n            float sum01;\n            float sum10;\n            float sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0.f;\n                sum01 = 0.f;\n                sum10 = 0.f;\n                sum11 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[0];\n                        sum11 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[0];\n                        sum11 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[2];\n                        sum11 = pC[3];\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[1];\n                        sum11 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n            // clang 15.0.1 on aarch64 auto vectorization produces wrong result on this loop\n            // we have to teach it a bit  :$   --- nihui\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _pA0123 = bfloat2float(vld1_u16(pA));\n                float32x4_t _pB0123 = bfloat2float(vld1_u16(pB));\n\n                float32x4x2_t _pB0213 = vtrnq_f32(_pB0123, _pB0123);\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0123, _pB0213.val[0]);\n                _sum1 = vfmaq_f32(_sum1, _pA0123, _pB0213.val[1]);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0123, _pB0213.val[0]);\n                _sum1 = vmlaq_f32(_sum1, _pA0123, _pB0213.val[1]);\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n            sum00 += vgetq_lane_f32(_sum0, 0) + vgetq_lane_f32(_sum0, 2);\n            sum01 += vgetq_lane_f32(_sum0, 1) + vgetq_lane_f32(_sum0, 3);\n            sum10 += vgetq_lane_f32(_sum1, 0) + vgetq_lane_f32(_sum1, 2);\n            sum11 += vgetq_lane_f32(_sum1, 1) + vgetq_lane_f32(_sum1, 3);\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                float pA0 = bfloat16_to_float32(pA[0]);\n                float pA1 = bfloat16_to_float32(pA[1]);\n                float pB0 = bfloat16_to_float32(pB[0]);\n                float pB1 = bfloat16_to_float32(pB[1]);\n\n                sum00 += pA0 * pB0;\n                sum01 += pA1 * pB0;\n                sum10 += pA0 * pB1;\n                sum11 += pA1 * pB1;\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum00 *= alpha;\n                sum01 *= alpha;\n                sum10 *= alpha;\n                sum11 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum00);\n                    outptr0[1] = float32_to_bfloat16(sum10);\n                    outptr0[out_hstep] = float32_to_bfloat16(sum01);\n                    outptr0[out_hstep + 1] = float32_to_bfloat16(sum11);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float pA0 = bfloat16_to_float32(pA[0]);\n                float pA1 = bfloat16_to_float32(pA[1]);\n                float pB0 = bfloat16_to_float32(pB[0]);\n\n                sum0 += pA0 * pB0;\n                sum1 += pA1 * pB0;\n                pA += 2;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum0 *= alpha;\n                sum1 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum0);\n                    outptr0[out_hstep] = float32_to_bfloat16(sum1);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const unsigned short* pB = pBT;\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n                _sum2 = vld1q_f32(outptr + 8);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = bfloat2float(vget_low_u16(_pB));\n                float32x4_t _pB1 = bfloat2float(vget_high_u16(_pB));\n                float32x4_t _pB2 = bfloat2float(vld1_u16(pB + 8));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr0 + 8, float2bfloat(_sum2));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = bfloat2float(vget_low_u16(_pB));\n                float32x4_t _pB1 = bfloat2float(vget_high_u16(_pB));\n\n                float32x4_t _pA0 = bfloat2float(vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum0));\n                    vst1_u16(outptr0 + 4, float2bfloat(_sum1));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum = vld1q_f32(outptr);\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float32x4_t _pB = bfloat2float(vld1_u16(pB));\n                float32x4_t _pA = bfloat2float(vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum = vmulq_f32(_sum, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, float2bfloat(_sum));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float pA0 = bfloat16_to_float32(pA[0]);\n                float pB0 = bfloat16_to_float32(pB[0]);\n                float pB1 = bfloat16_to_float32(pB[1]);\n\n                sum0 += pA0 * pB0;\n                sum1 += pA0 * pB1;\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum0 *= alpha;\n                sum1 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum0);\n                    outptr0[1] = float32_to_bfloat16(sum1);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum;\n\n            if (k == 0)\n            {\n                sum = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const unsigned short* pA = pAT;\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n                float pA0 = bfloat16_to_float32(pA[0]);\n                float pB0 = bfloat16_to_float32(pB[0]);\n\n                sum += pA0 * pB0;\n                pA += 1;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_bfloat16(sum);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/gemm_bf16s_fp16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pack_A_tile_bf16_fp16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * 8;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * 4;\n            const unsigned short* p1 = (const unsigned short*)A + (i + ii + 4) * A_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                uint16x8_t _r0 = vcombine_u16(vld1_u16(p0), vld1_u16(p1));\n                vst1q_u16(pp, _r0);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n            const unsigned short* p2 = (const unsigned short*)A + (i + ii + 2) * A_hstep + k;\n            const unsigned short* p3 = (const unsigned short*)A + (i + ii + 3) * A_hstep + k;\n            const unsigned short* p4 = (const unsigned short*)A + (i + ii + 4) * A_hstep + k;\n            const unsigned short* p5 = (const unsigned short*)A + (i + ii + 5) * A_hstep + k;\n            const unsigned short* p6 = (const unsigned short*)A + (i + ii + 6) * A_hstep + k;\n            const unsigned short* p7 = (const unsigned short*)A + (i + ii + 7) * A_hstep + k;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(p0);\n                uint16x8_t _r1 = vld1q_u16(p1);\n                uint16x8_t _r2 = vld1q_u16(p2);\n                uint16x8_t _r3 = vld1q_u16(p3);\n                uint16x8_t _r4 = vld1q_u16(p4);\n                uint16x8_t _r5 = vld1q_u16(p5);\n                uint16x8_t _r6 = vld1q_u16(p6);\n                uint16x8_t _r7 = vld1q_u16(p7);\n                transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                pp += 64;\n                p0 += 8;\n                p1 += 8;\n                p2 += 8;\n                p3 += 8;\n                p4 += 8;\n                p5 += 8;\n                p6 += 8;\n                p7 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp += 8;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * 4;\n\n            int kk = 0;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n            const unsigned short* p2 = (const unsigned short*)A + (i + ii + 2) * A_hstep + k;\n            const unsigned short* p3 = (const unsigned short*)A + (i + ii + 3) * A_hstep + k;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_u16(p0);\n                _r0123.val[1] = vld1q_u16(p1);\n                _r0123.val[2] = vld1q_u16(p2);\n                _r0123.val[3] = vld1q_u16(p3);\n                vst4q_u16(pp, _r0123);\n                pp += 32;\n                p0 += 8;\n                p1 += 8;\n                p2 += 8;\n                p3 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x4_t _r0123;\n                _r0123.val[0] = vld1_u16(p0);\n                _r0123.val[1] = vld1_u16(p1);\n                _r0123.val[2] = vld1_u16(p2);\n                _r0123.val[3] = vld1_u16(p3);\n                vst4_u16(pp, _r0123);\n                pp += 16;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp += 4;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        // if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)A + (i + ii + 1) * A_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x2_t _r01;\n                _r01.val[0] = vld1q_u16(p0);\n                _r01.val[1] = vld1q_u16(p1);\n                vst2q_u16(pp, _r01);\n                pp += 16;\n                p0 += 8;\n                p1 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x2_t _r01;\n                _r01.val[0] = vld1_u16(p0);\n                _r01.val[1] = vld1_u16(p1);\n                vst2_u16(pp, _r01);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp += 2;\n                p0++;\n                p1++;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        // if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = (unsigned short)p0[0];\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_bf16_fp16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                uint16x8x4_t _r4567 = vld4q_u16(p0 + 32);\n                uint16x8x2_t _r04 = vuzpq_u16(_r0123.val[0], _r4567.val[0]);\n                uint16x8x2_t _r15 = vuzpq_u16(_r0123.val[1], _r4567.val[1]);\n                uint16x8x2_t _r26 = vuzpq_u16(_r0123.val[2], _r4567.val[2]);\n                uint16x8x2_t _r37 = vuzpq_u16(_r0123.val[3], _r4567.val[3]);\n                vst1q_u16(pp, _r04.val[0]);\n                vst1q_u16(pp + 8, _r15.val[0]);\n                vst1q_u16(pp + 16, _r26.val[0]);\n                vst1q_u16(pp + 24, _r37.val[0]);\n                vst1q_u16(pp + 32, _r04.val[1]);\n                vst1q_u16(pp + 40, _r15.val[1]);\n                vst1q_u16(pp + 48, _r26.val[1]);\n                vst1q_u16(pp + 56, _r37.val[1]);\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                vst1q_u16(pp, _r0123.val[0]);\n                vst1q_u16(pp + 8, _r0123.val[1]);\n                vst1q_u16(pp + 16, _r0123.val[2]);\n                vst1q_u16(pp + 24, _r0123.val[3]);\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_u16(p0);\n                _r0123.val[1] = vld1q_u16(p0 + 8);\n                _r0123.val[2] = vld1q_u16(p0 + 16);\n                _r0123.val[3] = vld1q_u16(p0 + 24);\n                vst4q_u16(pp, _r0123);\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x4_t _r0123 = vld4_u16(p0);\n                vst1q_u16(pp, vcombine_u16(_r0123.val[0], _r0123.val[1]));\n                vst1q_u16(pp + 8, vcombine_u16(_r0123.val[2], _r0123.val[3]));\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x2_t _r01;\n                _r01.val[0] = vld1q_u16(p0);\n                _r01.val[1] = vld1q_u16(p0 + 8);\n                vst2q_u16(pp, _r01);\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x2_t _r01;\n                _r01.val[0] = vld1_u16(p0);\n                _r01.val[1] = vld1_u16(p0 + 4);\n                vst2_u16(pp, _r01);\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += A_hstep;\n            }\n        }\n    }\n}\n\nstatic void pack_B_tile_bf16_fp16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) / 8 * 8 * B_hstep + k * 8;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 8) / 8 * 8 * B_hstep + k * 8;\n\n            if ((j + jj) % 8 == 0)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1q_u16(pp, vld1q_u16(p0));\n                    vst1_u16(pp + 8, vld1_u16(p1));\n                    pp += 12;\n                    p0 += 8;\n                    p1 += 8;\n                }\n            }\n            if ((j + jj) % 8 == 4)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1_u16(pp, vld1_u16(p0 + 4));\n                    vst1q_u16(pp + 4, vld1q_u16(p1));\n                    pp += 12;\n                    p0 += 8;\n                    p1 += 8;\n                }\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * 4;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 4) * B_hstep + k * 4;\n            const unsigned short* p2 = (const unsigned short*)B + (j + jj + 8) * B_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                vst1_u16(pp + 4, vld1_u16(p1));\n                vst1_u16(pp + 8, vld1_u16(p2));\n                pp += 12;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 1) * B_hstep + k;\n            const unsigned short* p2 = (const unsigned short*)B + (j + jj + 2) * B_hstep + k;\n            const unsigned short* p3 = (const unsigned short*)B + (j + jj + 3) * B_hstep + k;\n            const unsigned short* p4 = (const unsigned short*)B + (j + jj + 4) * B_hstep + k;\n            const unsigned short* p5 = (const unsigned short*)B + (j + jj + 5) * B_hstep + k;\n            const unsigned short* p6 = (const unsigned short*)B + (j + jj + 6) * B_hstep + k;\n            const unsigned short* p7 = (const unsigned short*)B + (j + jj + 7) * B_hstep + k;\n            const unsigned short* p8 = (const unsigned short*)B + (j + jj + 8) * B_hstep + k;\n            const unsigned short* p9 = (const unsigned short*)B + (j + jj + 9) * B_hstep + k;\n            const unsigned short* pa = (const unsigned short*)B + (j + jj + 10) * B_hstep + k;\n            const unsigned short* pb = (const unsigned short*)B + (j + jj + 11) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4_t _r0 = vld1_u16(p0);\n                uint16x4_t _r1 = vld1_u16(p1);\n                uint16x4_t _r2 = vld1_u16(p2);\n                uint16x4_t _r3 = vld1_u16(p3);\n                uint16x4_t _r4 = vld1_u16(p4);\n                uint16x4_t _r5 = vld1_u16(p5);\n                uint16x4_t _r6 = vld1_u16(p6);\n                uint16x4_t _r7 = vld1_u16(p7);\n                uint16x4_t _r8 = vld1_u16(p8);\n                uint16x4_t _r9 = vld1_u16(p9);\n                uint16x4_t _ra = vld1_u16(pa);\n                uint16x4_t _rb = vld1_u16(pb);\n\n                transpose4x4_u16(_r0, _r1, _r2, _r3);\n                transpose4x4_u16(_r4, _r5, _r6, _r7);\n                transpose4x4_u16(_r8, _r9, _ra, _rb);\n\n                vst1_u16(pp, _r0);\n                vst1_u16(pp + 4, _r4);\n                vst1_u16(pp + 4 * 2, _r8);\n                vst1_u16(pp + 4 * 3, _r1);\n                vst1_u16(pp + 4 * 4, _r5);\n                vst1_u16(pp + 4 * 5, _r9);\n                vst1_u16(pp + 4 * 6, _r2);\n                vst1_u16(pp + 4 * 7, _r6);\n                vst1_u16(pp + 4 * 8, _ra);\n                vst1_u16(pp + 4 * 9, _r3);\n                vst1_u16(pp + 4 * 10, _r7);\n                vst1_u16(pp + 4 * 11, _rb);\n                pp += 48;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n                p4 += 4;\n                p5 += 4;\n                p6 += 4;\n                p7 += 4;\n                p8 += 4;\n                p9 += 4;\n                pa += 4;\n                pb += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp[8] = p8[0];\n                pp[9] = p9[0];\n                pp[10] = pa[0];\n                pp[11] = pb[0];\n                pp += 12;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n                p8++;\n                p9++;\n                pa++;\n                pb++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) / 8 * 8 * B_hstep + k * 8;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 8) / 8 * 8 * B_hstep + k * 8;\n\n            if ((j + jj) % 8 == 0)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1q_u16(pp, vld1q_u16(p0));\n                    pp += 8;\n                    p0 += 8;\n                }\n            }\n            if ((j + jj) % 8 == 4)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1q_u16(pp, vcombine_u16(vld1_u16(p0 + 4), vld1_u16(p1)));\n                    pp += 8;\n                    p0 += 8;\n                    p1 += 8;\n                }\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * 4;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 4) * B_hstep + k * 4;\n\n            for (int kk = 0; kk < max_kk; kk++)\n            {\n                uint16x8_t _r0 = vcombine_u16(vld1_u16(p0), vld1_u16(p1));\n                vst1q_u16(pp, _r0);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 1) * B_hstep + k;\n            const unsigned short* p2 = (const unsigned short*)B + (j + jj + 2) * B_hstep + k;\n            const unsigned short* p3 = (const unsigned short*)B + (j + jj + 3) * B_hstep + k;\n            const unsigned short* p4 = (const unsigned short*)B + (j + jj + 4) * B_hstep + k;\n            const unsigned short* p5 = (const unsigned short*)B + (j + jj + 5) * B_hstep + k;\n            const unsigned short* p6 = (const unsigned short*)B + (j + jj + 6) * B_hstep + k;\n            const unsigned short* p7 = (const unsigned short*)B + (j + jj + 7) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(p0);\n                uint16x8_t _r1 = vld1q_u16(p1);\n                uint16x8_t _r2 = vld1q_u16(p2);\n                uint16x8_t _r3 = vld1q_u16(p3);\n                uint16x8_t _r4 = vld1q_u16(p4);\n                uint16x8_t _r5 = vld1q_u16(p5);\n                uint16x8_t _r6 = vld1q_u16(p6);\n                uint16x8_t _r7 = vld1q_u16(p7);\n                transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_u16(pp, _r0);\n                vst1q_u16(pp + 8, _r1);\n                vst1q_u16(pp + 8 * 2, _r2);\n                vst1q_u16(pp + 8 * 3, _r3);\n                vst1q_u16(pp + 8 * 4, _r4);\n                vst1q_u16(pp + 8 * 5, _r5);\n                vst1q_u16(pp + 8 * 6, _r6);\n                vst1q_u16(pp + 8 * 7, _r7);\n                pp += 64;\n                p0 += 8;\n                p1 += 8;\n                p2 += 8;\n                p3 += 8;\n                p4 += 8;\n                p5 += 8;\n                p6 += 8;\n                p7 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4_t _r0 = vld1_u16(p0);\n                uint16x4_t _r1 = vld1_u16(p1);\n                uint16x4_t _r2 = vld1_u16(p2);\n                uint16x4_t _r3 = vld1_u16(p3);\n                uint16x4_t _r4 = vld1_u16(p4);\n                uint16x4_t _r5 = vld1_u16(p5);\n                uint16x4_t _r6 = vld1_u16(p6);\n                uint16x4_t _r7 = vld1_u16(p7);\n\n                transpose4x4_u16(_r0, _r1, _r2, _r3);\n                transpose4x4_u16(_r4, _r5, _r6, _r7);\n\n                vst1_u16(pp, _r0);\n                vst1_u16(pp + 4, _r4);\n                vst1_u16(pp + 4 * 2, _r1);\n                vst1_u16(pp + 4 * 3, _r5);\n                vst1_u16(pp + 4 * 4, _r2);\n                vst1_u16(pp + 4 * 5, _r6);\n                vst1_u16(pp + 4 * 6, _r3);\n                vst1_u16(pp + 4 * 7, _r7);\n                pp += 32;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n                p4 += 4;\n                p5 += 4;\n                p6 += 4;\n                p7 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp[4] = p4[0];\n                pp[5] = p5[0];\n                pp[6] = p6[0];\n                pp[7] = p7[0];\n                pp += 8;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n                p4++;\n                p5++;\n                p6++;\n                p7++;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) / 8 * 8 * B_hstep + k * 8;\n\n            if ((j + jj) % 8 == 0)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1_u16(pp, vld1_u16(p0));\n                    pp += 4;\n                    p0 += 8;\n                }\n            }\n            if ((j + jj) % 8 == 4)\n            {\n                for (int kk = 0; kk < max_kk; kk++)\n                {\n                    vst1_u16(pp, vld1_u16(p0 + 4));\n                    pp += 4;\n                    p0 += 8;\n                }\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * 4;\n\n            int kk = 0;\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 1) * B_hstep + k;\n            const unsigned short* p2 = (const unsigned short*)B + (j + jj + 2) * B_hstep + k;\n            const unsigned short* p3 = (const unsigned short*)B + (j + jj + 3) * B_hstep + k;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_u16(p0);\n                _r0123.val[1] = vld1q_u16(p1);\n                _r0123.val[2] = vld1q_u16(p2);\n                _r0123.val[3] = vld1q_u16(p3);\n                vst4q_u16(pp, _r0123);\n                pp += 32;\n                p0 += 8;\n                p1 += 8;\n                p2 += 8;\n                p3 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x4_t _r0123;\n                _r0123.val[0] = vld1_u16(p0);\n                _r0123.val[1] = vld1_u16(p1);\n                _r0123.val[2] = vld1_u16(p2);\n                _r0123.val[3] = vld1_u16(p3);\n                vst4_u16(pp, _r0123);\n                pp += 16;\n                p0 += 4;\n                p1 += 4;\n                p2 += 4;\n                p3 += 4;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp[2] = p2[0];\n                pp[3] = p3[0];\n                pp += 4;\n                p0++;\n                p1++;\n                p2++;\n                p3++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        // if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n            const unsigned short* p1 = (const unsigned short*)B + (j + jj + 1) * B_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x2_t _r01;\n                _r01.val[0] = vld1q_u16(p0);\n                _r01.val[1] = vld1q_u16(p1);\n                vst2q_u16(pp, _r01);\n                pp += 16;\n                p0 += 8;\n                p1 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x2_t _r01;\n                _r01.val[0] = vld1_u16(p0);\n                _r01.val[1] = vld1_u16(p1);\n                vst2_u16(pp, _r01);\n                pp += 8;\n                p0 += 4;\n                p1 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p1[0];\n                pp += 2;\n                p0++;\n                p1++;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        // if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_bf16_fp16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                uint16x8x4_t _r4567 = vld4q_u16(p0 + 32);\n                uint16x8x4_t _r89ab = vld4q_u16(p0 + 64);\n                uint16x8x2_t _r04 = vuzpq_u16(_r0123.val[0], _r4567.val[0]);\n                uint16x8x2_t _r15 = vuzpq_u16(_r0123.val[1], _r4567.val[1]);\n                uint16x8x2_t _r26 = vuzpq_u16(_r0123.val[2], _r4567.val[2]);\n                uint16x8x2_t _r37 = vuzpq_u16(_r0123.val[3], _r4567.val[3]);\n                uint16x4x2_t _r04_1 = vuzp_u16(vget_low_u16(_r89ab.val[0]), vget_high_u16(_r89ab.val[0]));\n                uint16x4x2_t _r15_1 = vuzp_u16(vget_low_u16(_r89ab.val[1]), vget_high_u16(_r89ab.val[1]));\n                uint16x4x2_t _r26_1 = vuzp_u16(vget_low_u16(_r89ab.val[2]), vget_high_u16(_r89ab.val[2]));\n                uint16x4x2_t _r37_1 = vuzp_u16(vget_low_u16(_r89ab.val[3]), vget_high_u16(_r89ab.val[3]));\n                vst1q_u16(pp, _r04.val[0]);\n                vst1_u16(pp + 8, _r04_1.val[0]);\n                vst1q_u16(pp + 12, _r15.val[0]);\n                vst1_u16(pp + 20, _r15_1.val[0]);\n                vst1q_u16(pp + 24, _r26.val[0]);\n                vst1_u16(pp + 32, _r26_1.val[0]);\n                vst1q_u16(pp + 36, _r37.val[0]);\n                vst1_u16(pp + 44, _r37_1.val[0]);\n                vst1q_u16(pp + 48, _r04.val[1]);\n                vst1_u16(pp + 56, _r04_1.val[1]);\n                vst1q_u16(pp + 60, _r15.val[1]);\n                vst1_u16(pp + 68, _r15_1.val[1]);\n                vst1q_u16(pp + 72, _r26.val[1]);\n                vst1_u16(pp + 80, _r26_1.val[1]);\n                vst1q_u16(pp + 84, _r37.val[1]);\n                vst1_u16(pp + 92, _r37_1.val[1]);\n                pp += 96;\n                p0 += B_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                uint16x4x4_t _r89ab = vld4_u16(p0 + 32);\n                vst1q_u16(pp, _r0123.val[0]);\n                vst1_u16(pp + 8, _r89ab.val[0]);\n                vst1q_u16(pp + 12, _r0123.val[1]);\n                vst1_u16(pp + 20, _r89ab.val[1]);\n                vst1q_u16(pp + 24, _r0123.val[2]);\n                vst1_u16(pp + 32, _r89ab.val[2]);\n                vst1q_u16(pp + 36, _r0123.val[3]);\n                vst1_u16(pp + 44, _r89ab.val[3]);\n                pp += 48;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                vst1_u16(pp + 8, vld1_u16(p0 + 8));\n                pp += 12;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                uint16x8x4_t _r4567 = vld4q_u16(p0 + 32);\n                uint16x8x2_t _r04 = vuzpq_u16(_r0123.val[0], _r4567.val[0]);\n                uint16x8x2_t _r15 = vuzpq_u16(_r0123.val[1], _r4567.val[1]);\n                uint16x8x2_t _r26 = vuzpq_u16(_r0123.val[2], _r4567.val[2]);\n                uint16x8x2_t _r37 = vuzpq_u16(_r0123.val[3], _r4567.val[3]);\n                vst1q_u16(pp, _r04.val[0]);\n                vst1q_u16(pp + 8, _r15.val[0]);\n                vst1q_u16(pp + 16, _r26.val[0]);\n                vst1q_u16(pp + 24, _r37.val[0]);\n                vst1q_u16(pp + 32, _r04.val[1]);\n                vst1q_u16(pp + 40, _r15.val[1]);\n                vst1q_u16(pp + 48, _r26.val[1]);\n                vst1q_u16(pp + 56, _r37.val[1]);\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(p0);\n                vst1q_u16(pp, _r0123.val[0]);\n                vst1q_u16(pp + 8, _r0123.val[1]);\n                vst1q_u16(pp + 16, _r0123.val[2]);\n                vst1q_u16(pp + 24, _r0123.val[3]);\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_u16(p0);\n                _r0123.val[1] = vld1q_u16(p0 + 8);\n                _r0123.val[2] = vld1q_u16(p0 + 16);\n                _r0123.val[3] = vld1q_u16(p0 + 24);\n                vst4q_u16(pp, _r0123);\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x4_t _r0123 = vld4_u16(p0);\n                vst1q_u16(pp, vcombine_u16(_r0123.val[0], _r0123.val[1]));\n                vst1q_u16(pp + 8, vcombine_u16(_r0123.val[2], _r0123.val[3]));\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8x2_t _r01;\n                _r01.val[0] = vld1q_u16(p0);\n                _r01.val[1] = vld1q_u16(p0 + 8);\n                vst2q_u16(pp, _r01);\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x4x2_t _r01;\n                _r01.val[0] = vld1_u16(p0);\n                _r01.val[1] = vld1_u16(p0 + 4);\n                vst2_u16(pp, _r01);\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp[1] = p0[1];\n                pp += 2;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n#if __ARM_NEON\n        if (elempack == 8)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 8;\n\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                vst1q_u16(pp, vld1q_u16(p0));\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n        }\n        if (elempack == 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * 4;\n\n            int kk = 0;\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                vst1_u16(pp, vld1_u16(p0));\n                pp += 4;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj);\n\n            int kk = 0;\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = p0[0];\n                pp += 1;\n                p0 += B_hstep;\n            }\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_bf16_fp16(const Mat& topT, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const unsigned short* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (out_elempack == 8)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + (j / 8 * 8) * out_hstep + (i + ii) * 8;\n\n            int jj = 0;\n            if (j % 8 == 4)\n            {\n                uint16x8_t _r0 = vld1q_u16(pp);\n                uint16x8_t _r1 = vld1q_u16(pp + 8);\n                uint16x8_t _r2 = vld1q_u16(pp + 8 * 2);\n                uint16x8_t _r3 = vld1q_u16(pp + 8 * 3);\n                transpose8x4_u16(_r0, _r1, _r2, _r3);\n                vst1_u16(p0 + 4, vget_low_u16(_r0));\n                vst1_u16(p0 + 8 + 4, vget_high_u16(_r0));\n                vst1_u16(p0 + 8 * 2 + 4, vget_low_u16(_r1));\n                vst1_u16(p0 + 8 * 3 + 4, vget_high_u16(_r1));\n                vst1_u16(p0 + 8 * 4 + 4, vget_low_u16(_r2));\n                vst1_u16(p0 + 8 * 5 + 4, vget_high_u16(_r2));\n                vst1_u16(p0 + 8 * 6 + 4, vget_low_u16(_r3));\n                vst1_u16(p0 + 8 * 7 + 4, vget_high_u16(_r3));\n                pp += 32;\n                p0 += out_hstep * 8;\n                jj += 4;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(pp);\n                uint16x8_t _r1 = vld1q_u16(pp + 8);\n                uint16x8_t _r2 = vld1q_u16(pp + 8 * 2);\n                uint16x8_t _r3 = vld1q_u16(pp + 8 * 3);\n                uint16x8_t _r4 = vld1q_u16(pp + 8 * 4);\n                uint16x8_t _r5 = vld1q_u16(pp + 8 * 5);\n                uint16x8_t _r6 = vld1q_u16(pp + 8 * 6);\n                uint16x8_t _r7 = vld1q_u16(pp + 8 * 7);\n                transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n                vst1q_u16(p0, _r0);\n                vst1q_u16(p0 + 8, _r1);\n                vst1q_u16(p0 + 8 * 2, _r2);\n                vst1q_u16(p0 + 8 * 3, _r3);\n                vst1q_u16(p0 + 8 * 4, _r4);\n                vst1q_u16(p0 + 8 * 5, _r5);\n                vst1q_u16(p0 + 8 * 6, _r6);\n                vst1q_u16(p0 + 8 * 7, _r7);\n                pp += 64;\n                p0 += out_hstep * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x8_t _r0 = vld1q_u16(pp);\n                uint16x8_t _r1 = vld1q_u16(pp + 8);\n                uint16x8_t _r2 = vld1q_u16(pp + 8 * 2);\n                uint16x8_t _r3 = vld1q_u16(pp + 8 * 3);\n                transpose8x4_u16(_r0, _r1, _r2, _r3);\n                vst1_u16(p0, vget_low_u16(_r0));\n                vst1_u16(p0 + 8, vget_high_u16(_r0));\n                vst1_u16(p0 + 8 * 2, vget_low_u16(_r1));\n                vst1_u16(p0 + 8 * 3, vget_high_u16(_r1));\n                vst1_u16(p0 + 8 * 4, vget_low_u16(_r2));\n                vst1_u16(p0 + 8 * 5, vget_high_u16(_r2));\n                vst1_u16(p0 + 8 * 6, vget_low_u16(_r3));\n                vst1_u16(p0 + 8 * 7, vget_high_u16(_r3));\n                pp += 32;\n                p0 += out_hstep * 8;\n            }\n        }\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x8x4_t _r0123;\n                _r0123.val[0] = vld1q_u16(pp);\n                _r0123.val[1] = vld1q_u16(pp + 8);\n                _r0123.val[2] = vld1q_u16(pp + 8 * 2);\n                _r0123.val[3] = vld1q_u16(pp + 8 * 3);\n                vst4q_u16(p0, _r0123);\n                pp += 32;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x8_t _r0 = vld1q_u16(pp);\n                vst1q_u16(p0, _r0);\n                pp += 8;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n#if __aarch64__\n        if (out_elempack == 8)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + (j / 8 * 8) * out_hstep + (i + ii) * 8;\n\n            int jj = 0;\n            if (j % 8 == 4)\n            {\n                uint16x4x4_t _r0123 = vld4_u16(pp);\n                vst1_u16(p0 + 4, _r0123.val[0]);\n                vst1_u16(p0 + 8 + 4, _r0123.val[1]);\n                vst1_u16(p0 + 16 + 4, _r0123.val[2]);\n                vst1_u16(p0 + 24 + 4, _r0123.val[3]);\n                pp += 16;\n                p0 += out_hstep * 8;\n                jj += 4;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                uint16x8x4_t _r0123 = vld4q_u16(pp);\n                vst1q_u16(p0, _r0123.val[0]);\n                vst1q_u16(p0 + 8, _r0123.val[1]);\n                vst1q_u16(p0 + 16, _r0123.val[2]);\n                vst1q_u16(p0 + 24, _r0123.val[3]);\n                pp += 32;\n                p0 += out_hstep * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4x4_t _r0123 = vld4_u16(pp);\n                vst1_u16(p0, _r0123.val[0]);\n                vst1_u16(p0 + 8, _r0123.val[1]);\n                vst1_u16(p0 + 16, _r0123.val[2]);\n                vst1_u16(p0 + 24, _r0123.val[3]);\n                pp += 16;\n                p0 += out_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4x4_t _r0123;\n                _r0123.val[0] = vld1_u16(pp);\n                _r0123.val[1] = vld1_u16(pp + 4);\n                _r0123.val[2] = vld1_u16(pp + 8);\n                _r0123.val[3] = vld1_u16(pp + 12);\n                vst4_u16(p0, _r0123);\n                pp += 16;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x4_t _r0 = vld1_u16(pp);\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (out_elempack == 8)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + (j / 8 * 8) * out_hstep + (i + ii) * 8;\n\n            int jj = 0;\n            if (j % 8 == 4)\n            {\n                p0[0 + 4] = pp[0];\n                p0[1 + 4] = pp[2];\n                p0[2 + 4] = pp[4];\n                p0[3 + 4] = pp[6];\n                p0[8 + 4] = pp[1];\n                p0[9 + 4] = pp[3];\n                p0[10 + 4] = pp[5];\n                p0[11 + 4] = pp[7];\n                pp += 8;\n                p0 += out_hstep * 8;\n                jj += 4;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[2];\n                p0[2] = pp[4];\n                p0[3] = pp[6];\n                p0[4] = pp[8];\n                p0[5] = pp[10];\n                p0[6] = pp[12];\n                p0[7] = pp[14];\n                p0[8] = pp[1];\n                p0[9] = pp[3];\n                p0[10] = pp[5];\n                p0[11] = pp[7];\n                p0[12] = pp[9];\n                p0[13] = pp[11];\n                p0[14] = pp[13];\n                p0[15] = pp[15];\n                pp += 16;\n                p0 += out_hstep * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[2];\n                p0[2] = pp[4];\n                p0[3] = pp[6];\n                p0[8] = pp[1];\n                p0[9] = pp[3];\n                p0[10] = pp[5];\n                p0[11] = pp[7];\n                pp += 8;\n                p0 += out_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[2];\n                p0[2] = pp[4];\n                p0[3] = pp[6];\n                p0[4] = pp[1];\n                p0[5] = pp[3];\n                p0[6] = pp[5];\n                p0[7] = pp[7];\n                pp += 8;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = pp[0];\n                p0[1] = pp[1];\n                pp += 2;\n                p0 += out_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        if (out_elempack == 8)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + (j / 8 * 8) * out_hstep + (i + ii) * 8;\n\n            int jj = 0;\n            if (j % 8 == 4)\n            {\n                uint16x4_t _r0 = vld1_u16(pp);\n                vst1_u16(p0 + 4, _r0);\n                pp += 4;\n                p0 += out_hstep * 8;\n                jj += 4;\n            }\n            for (; jj + 7 < max_jj; jj += 8)\n            {\n                uint16x8_t _r0 = vld1q_u16(pp);\n                vst1q_u16(p0, _r0);\n                pp += 8;\n                p0 += out_hstep * 8;\n            }\n            for (; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4_t _r0 = vld1_u16(pp);\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4_t _r0 = vld1_u16(pp);\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = pp[0];\n                pp += 1;\n                p0 += out_hstep;\n            }\n        }\n    }\n}\n\nstatic void get_optimal_tile_mnk_bf16s_fp16s(int M, int N, int K, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const size_t l2_cache_size = get_cpu_level2_cache_size();\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    int tile_size = (int)sqrtf((float)l2_cache_size / (2 * sizeof(unsigned short) + sizeof(float)));\n\n    TILE_M = std::max(8, tile_size / 8 * 8);\n    TILE_N = std::max(4, tile_size / 4 * 4);\n    TILE_K = std::max(8, tile_size / 8 * 8);\n\n    if (K > 0)\n    {\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n\n        if (nn_K == 1)\n        {\n            tile_size = (int)((float)l2_cache_size / 2 / sizeof(unsigned short) / TILE_K);\n\n            TILE_M = std::max(8, tile_size / 8 * 8);\n            TILE_N = std::max(4, tile_size / 4 * 4);\n        }\n    }\n\n    TILE_M *= std::min(nT, get_physical_cpu_count());\n\n    if (M > 0)\n    {\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n    }\n\n    if (N > 0)\n    {\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n    }\n\n    if (nT > 1)\n    {\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n    }\n\n    // always take constant TILE_M/N/K value when provided\n    if (constant_TILE_M > 0)\n    {\n        TILE_M = (constant_TILE_M + 7) / 8 * 8;\n    }\n\n    if (constant_TILE_N > 0)\n    {\n        TILE_N = (constant_TILE_N + 3) / 4 * 4;\n    }\n\n    if (constant_TILE_K > 0)\n    {\n        TILE_K = (constant_TILE_K + 7) / 8 * 8;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/gemm_fp16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\nvoid gemm_transB_packed_tile_fp16s_asimdfhm(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, float alpha, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end);\n#endif\n\nstatic void pack_A_tile_fp32_to_fp16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n        const float* p4 = (const float*)A + (i + ii + 4) * A_hstep + k;\n        const float* p5 = (const float*)A + (i + ii + 5) * A_hstep + k;\n        const float* p6 = (const float*)A + (i + ii + 6) * A_hstep + k;\n        const float* p7 = (const float*)A + (i + ii + 7) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            uint16x8_t _r1 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            uint16x8_t _r2 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p2)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2 + 4)));\n            uint16x8_t _r3 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p3)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3 + 4)));\n            uint16x8_t _r4 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p4)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p4 + 4)));\n            uint16x8_t _r5 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p5)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p5 + 4)));\n            uint16x8_t _r6 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p6)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p6 + 4)));\n            uint16x8_t _r7 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p7)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p7 + 4)));\n            transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n            vst1q_u16(pp, _r0);\n            vst1q_u16(pp + 8, _r1);\n            vst1q_u16(pp + 8 * 2, _r2);\n            vst1q_u16(pp + 8 * 3, _r3);\n            vst1q_u16(pp + 8 * 4, _r4);\n            vst1q_u16(pp + 8 * 5, _r5);\n            vst1q_u16(pp + 8 * 6, _r6);\n            vst1q_u16(pp + 8 * 7, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp[2] = float32_to_float16(p2[0]);\n            pp[3] = float32_to_float16(p3[0]);\n            pp[4] = float32_to_float16(p4[0]);\n            pp[5] = float32_to_float16(p5[0]);\n            pp[6] = float32_to_float16(p6[0]);\n            pp[7] = float32_to_float16(p7[0]);\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n        const float* p2 = (const float*)A + (i + ii + 2) * A_hstep + k;\n        const float* p3 = (const float*)A + (i + ii + 3) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x4_t _r0123;\n            _r0123.val[0] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            _r0123.val[1] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            _r0123.val[2] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p2)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2 + 4)));\n            _r0123.val[3] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p3)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3 + 4)));\n            vst4q_u16(pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x4_t _r0123;\n            _r0123.val[0] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            _r0123.val[1] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            _r0123.val[2] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2));\n            _r0123.val[3] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3));\n            vst4_u16(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp[2] = float32_to_float16(p2[0]);\n            pp[3] = float32_to_float16(p3[0]);\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n        const float* p1 = (const float*)A + (i + ii + 1) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x2_t _r01;\n            _r01.val[0] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            _r01.val[1] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            vst2q_u16(pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x2_t _r01;\n            _r01.val[0] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            _r01.val[1] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            vst2_u16(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_fp32_to_fp16(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    unsigned short* pp = AT;\n\n    int ii = 0;\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += A_hstep;\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += A_hstep;\n        }\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p0[1]);\n            pp += 2;\n            p0 += A_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp += 1;\n            p0 += A_hstep;\n        }\n    }\n}\n\nstatic void pack_B_tile_fp32_to_fp16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n        const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n        const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n        const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n        const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n        const float* p8 = (const float*)B + (j + jj + 8) * B_hstep + k;\n        const float* p9 = (const float*)B + (j + jj + 9) * B_hstep + k;\n        const float* pa = (const float*)B + (j + jj + 10) * B_hstep + k;\n        const float* pb = (const float*)B + (j + jj + 11) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            uint16x4_t _r1 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            uint16x4_t _r2 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2));\n            uint16x4_t _r3 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3));\n            uint16x4_t _r4 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p4));\n            uint16x4_t _r5 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p5));\n            uint16x4_t _r6 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p6));\n            uint16x4_t _r7 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p7));\n            uint16x4_t _r8 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p8));\n            uint16x4_t _r9 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p9));\n            uint16x4_t _ra = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pa));\n            uint16x4_t _rb = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pb));\n\n            transpose4x4_u16(_r0, _r1, _r2, _r3);\n            transpose4x4_u16(_r4, _r5, _r6, _r7);\n            transpose4x4_u16(_r8, _r9, _ra, _rb);\n\n            vst1_u16(pp, _r0);\n            vst1_u16(pp + 4, _r4);\n            vst1_u16(pp + 4 * 2, _r8);\n            vst1_u16(pp + 4 * 3, _r1);\n            vst1_u16(pp + 4 * 4, _r5);\n            vst1_u16(pp + 4 * 5, _r9);\n            vst1_u16(pp + 4 * 6, _r2);\n            vst1_u16(pp + 4 * 7, _r6);\n            vst1_u16(pp + 4 * 8, _ra);\n            vst1_u16(pp + 4 * 9, _r3);\n            vst1_u16(pp + 4 * 10, _r7);\n            vst1_u16(pp + 4 * 11, _rb);\n            pp += 48;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n            p8 += 4;\n            p9 += 4;\n            pa += 4;\n            pb += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp[2] = float32_to_float16(p2[0]);\n            pp[3] = float32_to_float16(p3[0]);\n            pp[4] = float32_to_float16(p4[0]);\n            pp[5] = float32_to_float16(p5[0]);\n            pp[6] = float32_to_float16(p6[0]);\n            pp[7] = float32_to_float16(p7[0]);\n            pp[8] = float32_to_float16(p8[0]);\n            pp[9] = float32_to_float16(p9[0]);\n            pp[10] = float32_to_float16(pa[0]);\n            pp[11] = float32_to_float16(pb[0]);\n            pp += 12;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n            p8++;\n            p9++;\n            pa++;\n            pb++;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n        const float* p4 = (const float*)B + (j + jj + 4) * B_hstep + k;\n        const float* p5 = (const float*)B + (j + jj + 5) * B_hstep + k;\n        const float* p6 = (const float*)B + (j + jj + 6) * B_hstep + k;\n        const float* p7 = (const float*)B + (j + jj + 7) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            uint16x8_t _r1 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            uint16x8_t _r2 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p2)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2 + 4)));\n            uint16x8_t _r3 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p3)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3 + 4)));\n            uint16x8_t _r4 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p4)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p4 + 4)));\n            uint16x8_t _r5 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p5)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p5 + 4)));\n            uint16x8_t _r6 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p6)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p6 + 4)));\n            uint16x8_t _r7 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p7)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p7 + 4)));\n            transpose8x8_u16(_r0, _r1, _r2, _r3, _r4, _r5, _r6, _r7);\n            vst1q_u16(pp, _r0);\n            vst1q_u16(pp + 8, _r1);\n            vst1q_u16(pp + 8 * 2, _r2);\n            vst1q_u16(pp + 8 * 3, _r3);\n            vst1q_u16(pp + 8 * 4, _r4);\n            vst1q_u16(pp + 8 * 5, _r5);\n            vst1q_u16(pp + 8 * 6, _r6);\n            vst1q_u16(pp + 8 * 7, _r7);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            uint16x4_t _r1 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            uint16x4_t _r2 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2));\n            uint16x4_t _r3 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3));\n            uint16x4_t _r4 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p4));\n            uint16x4_t _r5 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p5));\n            uint16x4_t _r6 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p6));\n            uint16x4_t _r7 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p7));\n\n            transpose4x4_u16(_r0, _r1, _r2, _r3);\n            transpose4x4_u16(_r4, _r5, _r6, _r7);\n\n            vst1_u16(pp, _r0);\n            vst1_u16(pp + 4, _r4);\n            vst1_u16(pp + 4 * 2, _r1);\n            vst1_u16(pp + 4 * 3, _r5);\n            vst1_u16(pp + 4 * 4, _r2);\n            vst1_u16(pp + 4 * 5, _r6);\n            vst1_u16(pp + 4 * 6, _r3);\n            vst1_u16(pp + 4 * 7, _r7);\n            pp += 32;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp[2] = float32_to_float16(p2[0]);\n            pp[3] = float32_to_float16(p3[0]);\n            pp[4] = float32_to_float16(p4[0]);\n            pp[5] = float32_to_float16(p5[0]);\n            pp[6] = float32_to_float16(p6[0]);\n            pp[7] = float32_to_float16(p7[0]);\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n        const float* p2 = (const float*)B + (j + jj + 2) * B_hstep + k;\n        const float* p3 = (const float*)B + (j + jj + 3) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x4_t _r0123;\n            _r0123.val[0] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            _r0123.val[1] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            _r0123.val[2] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p2)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2 + 4)));\n            _r0123.val[3] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p3)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3 + 4)));\n            vst4q_u16(pp, _r0123);\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x4_t _r0123;\n            _r0123.val[0] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            _r0123.val[1] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            _r0123.val[2] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p2));\n            _r0123.val[3] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p3));\n            vst4_u16(pp, _r0123);\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp[2] = float32_to_float16(p2[0]);\n            pp[3] = float32_to_float16(p3[0]);\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n        const float* p1 = (const float*)B + (j + jj + 1) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8x2_t _r01;\n            _r01.val[0] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            _r01.val[1] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p1)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1 + 4)));\n            vst2q_u16(pp, _r01);\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4x2_t _r01;\n            _r01.val[0] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            _r01.val[1] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p1));\n            vst2_u16(pp, _r01);\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p1[0]);\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n\n        int kk = 0;\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += 4;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_fp32_to_fp16(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    unsigned short* pp = BT;\n\n    int jj = 0;\n#if __aarch64__\n    for (; jj + 11 < max_jj; jj += 12)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            vst1_u16(pp, (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)));\n            vst1_u16(pp + 4, (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            vst1_u16(pp + 8, (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 8)));\n            pp += 12;\n            p0 += B_hstep;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(p0)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0 + 4)));\n            vst1q_u16(pp, _r0);\n            pp += 8;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(p0));\n            vst1_u16(pp, _r0);\n            pp += 4;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp[1] = float32_to_float16(p0[1]);\n            pp += 2;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj);\n\n        int kk = 0;\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = float32_to_float16(p0[0]);\n            pp += 1;\n            p0 += B_hstep;\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_fp32_to_fp16(const Mat& topT, Mat& top_blob, int i, int max_ii, int j, int max_jj)\n{\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const float* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x8x4_t _r0;\n                _r0.val[0] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(pp)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 4)));\n                _r0.val[1] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 8)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 12)));\n                _r0.val[2] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 16)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 20)));\n                _r0.val[3] = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 24)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 28)));\n                vst4q_u16(p0, _r0);\n                pp += 32;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x8_t _r0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(vld1q_f32(pp)), (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 4)));\n                vst1q_u16(p0, _r0);\n                pp += 8;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4x4_t _r0123;\n                _r0123.val[0] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp));\n                _r0123.val[1] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 4));\n                _r0123.val[2] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 8));\n                _r0123.val[3] = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp + 12));\n                vst4_u16(p0, _r0123);\n                pp += 16;\n                p0 += out_hstep * 4;\n            }\n        }\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp));\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                p0[0] = float32_to_float16(pp[0]);\n                p0[1] = float32_to_float16(pp[2]);\n                p0[2] = float32_to_float16(pp[4]);\n                p0[3] = float32_to_float16(pp[6]);\n                p0[4] = float32_to_float16(pp[1]);\n                p0[5] = float32_to_float16(pp[3]);\n                p0[6] = float32_to_float16(pp[5]);\n                p0[7] = float32_to_float16(pp[7]);\n                pp += 8;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = float32_to_float16(pp[0]);\n                p0[1] = float32_to_float16(pp[1]);\n                pp += 2;\n                p0 += out_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n#if __ARM_NEON\n        if (out_elempack == 4)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * 4;\n\n            for (int jj = 0; jj + 3 < max_jj; jj += 4)\n            {\n                uint16x4_t _r0 = (uint16x4_t)vcvt_f16_f32(vld1q_f32(pp));\n                vst1_u16(p0, _r0);\n                pp += 4;\n                p0 += out_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (out_elempack == 1)\n        {\n            unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii);\n\n            for (int jj = 0; jj < max_jj; jj += 1)\n            {\n                p0[0] = float32_to_float16(pp[0]);\n                pp += 1;\n                p0 += out_hstep;\n            }\n        }\n    }\n}\n\nstatic void gemm_transB_packed_tile_fp16s(const Mat& AT_tile, const Mat& BT_tile, const Mat& CT_tile, Mat& topT_tile, Mat& top_blob, int broadcast_type_C, float alpha, int i, int max_ii, int j, int max_jj, int k, int max_kk, bool k_end)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdfhm())\n    {\n        gemm_transB_packed_tile_fp16s_asimdfhm(AT_tile, BT_tile, CT_tile, topT_tile, top_blob, broadcast_type_C, alpha, i, max_ii, j, max_jj, k, max_kk, k_end);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n#if __ARM_FEATURE_FP16_FML\n    const __fp16* pAT = AT_tile;\n    const __fp16* pBT = BT_tile;\n#else\n    const unsigned short* pAT = AT_tile;\n    const unsigned short* pBT = BT_tile;\n#endif\n    const float* pC = CT_tile;\n\n    float* outptr = topT_tile;\n\n    int ii = 0;\n#if __aarch64__\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n#if __ARM_FEATURE_FP16_FML\n        const __fp16* pB = pBT;\n#else\n        const unsigned short* pB = pBT;\n#endif\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n            float32x4_t _sum80;\n            float32x4_t _sum81;\n            float32x4_t _sum90;\n            float32x4_t _sum91;\n            float32x4_t _suma0;\n            float32x4_t _suma1;\n            float32x4_t _sumb0;\n            float32x4_t _sumb1;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n                _sum80 = vdupq_n_f32(0.f);\n                _sum81 = vdupq_n_f32(0.f);\n                _sum90 = vdupq_n_f32(0.f);\n                _sum91 = vdupq_n_f32(0.f);\n                _suma0 = vdupq_n_f32(0.f);\n                _suma1 = vdupq_n_f32(0.f);\n                _sumb0 = vdupq_n_f32(0.f);\n                _sumb1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                        _sum40 = _sum00;\n                        _sum41 = _sum00;\n                        _sum50 = _sum00;\n                        _sum51 = _sum00;\n                        _sum60 = _sum00;\n                        _sum61 = _sum00;\n                        _sum70 = _sum00;\n                        _sum71 = _sum00;\n                        _sum80 = _sum00;\n                        _sum81 = _sum00;\n                        _sum90 = _sum00;\n                        _sum91 = _sum00;\n                        _suma0 = _sum00;\n                        _suma1 = _sum00;\n                        _sumb0 = _sum00;\n                        _sumb1 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                        _sum80 = _sum00;\n                        _sum81 = _sum01;\n                        _sum90 = _sum00;\n                        _sum91 = _sum01;\n                        _suma0 = _sum00;\n                        _suma1 = _sum01;\n                        _sumb0 = _sum00;\n                        _sumb1 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        _sum80 = vld1q_f32(pC + 4 * 16);\n                        _sum81 = vld1q_f32(pC + 4 * 17);\n                        _sum90 = vld1q_f32(pC + 4 * 18);\n                        _sum91 = vld1q_f32(pC + 4 * 19);\n                        _suma0 = vld1q_f32(pC + 4 * 20);\n                        _suma1 = vld1q_f32(pC + 4 * 21);\n                        _sumb0 = vld1q_f32(pC + 4 * 22);\n                        _sumb1 = vld1q_f32(pC + 4 * 23);\n                        pC += 96;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum80 = vdupq_n_f32(pC[8]);\n                        _sum90 = vdupq_n_f32(pC[9]);\n                        _suma0 = vdupq_n_f32(pC[10]);\n                        _sumb0 = vdupq_n_f32(pC[11]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        _sum81 = _sum80;\n                        _sum91 = _sum90;\n                        _suma1 = _suma0;\n                        _sumb1 = _sumb0;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n                _sum80 = vld1q_f32(outptr + 4 * 16);\n                _sum81 = vld1q_f32(outptr + 4 * 17);\n                _sum90 = vld1q_f32(outptr + 4 * 18);\n                _sum91 = vld1q_f32(outptr + 4 * 19);\n                _suma0 = vld1q_f32(outptr + 4 * 20);\n                _suma1 = vld1q_f32(outptr + 4 * 21);\n                _sumb0 = vld1q_f32(outptr + 4 * 22);\n                _sumb1 = vld1q_f32(outptr + 4 * 23);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pA = vld1q_f16(pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum00 = vfmlalq_lane_low_f16(_sum00, _pA, _pB0, 0);\n                _sum01 = vfmlalq_lane_high_f16(_sum01, _pA, _pB0, 0);\n                _sum10 = vfmlalq_lane_low_f16(_sum10, _pA, _pB0, 1);\n                _sum11 = vfmlalq_lane_high_f16(_sum11, _pA, _pB0, 1);\n                _sum20 = vfmlalq_lane_low_f16(_sum20, _pA, _pB0, 2);\n                _sum21 = vfmlalq_lane_high_f16(_sum21, _pA, _pB0, 2);\n                _sum30 = vfmlalq_lane_low_f16(_sum30, _pA, _pB0, 3);\n                _sum31 = vfmlalq_lane_high_f16(_sum31, _pA, _pB0, 3);\n                _sum40 = vfmlalq_lane_low_f16(_sum40, _pA, _pB1, 0);\n                _sum41 = vfmlalq_lane_high_f16(_sum41, _pA, _pB1, 0);\n                _sum50 = vfmlalq_lane_low_f16(_sum50, _pA, _pB1, 1);\n                _sum51 = vfmlalq_lane_high_f16(_sum51, _pA, _pB1, 1);\n                _sum60 = vfmlalq_lane_low_f16(_sum60, _pA, _pB1, 2);\n                _sum61 = vfmlalq_lane_high_f16(_sum61, _pA, _pB1, 2);\n                _sum70 = vfmlalq_lane_low_f16(_sum70, _pA, _pB1, 3);\n                _sum71 = vfmlalq_lane_high_f16(_sum71, _pA, _pB1, 3);\n                _sum80 = vfmlalq_lane_low_f16(_sum80, _pA, _pB2, 0);\n                _sum81 = vfmlalq_lane_high_f16(_sum81, _pA, _pB2, 0);\n                _sum90 = vfmlalq_lane_low_f16(_sum90, _pA, _pB2, 1);\n                _sum91 = vfmlalq_lane_high_f16(_sum91, _pA, _pB2, 1);\n                _suma0 = vfmlalq_lane_low_f16(_suma0, _pA, _pB2, 2);\n                _suma1 = vfmlalq_lane_high_f16(_suma1, _pA, _pB2, 2);\n                _sumb0 = vfmlalq_lane_low_f16(_sumb0, _pA, _pB2, 3);\n                _sumb1 = vfmlalq_lane_high_f16(_sumb1, _pA, _pB2, 3);\n#else\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pA));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pA));\n\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 4));\n                float32x4_t _pB2 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 8));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n                _sum80 = vfmaq_laneq_f32(_sum80, _pA0, _pB2, 0);\n                _sum81 = vfmaq_laneq_f32(_sum81, _pA1, _pB2, 0);\n                _sum90 = vfmaq_laneq_f32(_sum90, _pA0, _pB2, 1);\n                _sum91 = vfmaq_laneq_f32(_sum91, _pA1, _pB2, 1);\n                _suma0 = vfmaq_laneq_f32(_suma0, _pA0, _pB2, 2);\n                _suma1 = vfmaq_laneq_f32(_suma1, _pA1, _pB2, 2);\n                _sumb0 = vfmaq_laneq_f32(_sumb0, _pA0, _pB2, 3);\n                _sumb1 = vfmaq_laneq_f32(_sumb1, _pA1, _pB2, 3);\n#endif\n\n                pA += 8;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n                _sum40 = vmulq_f32(_sum40, _alpha);\n                _sum41 = vmulq_f32(_sum41, _alpha);\n                _sum50 = vmulq_f32(_sum50, _alpha);\n                _sum51 = vmulq_f32(_sum51, _alpha);\n                _sum60 = vmulq_f32(_sum60, _alpha);\n                _sum61 = vmulq_f32(_sum61, _alpha);\n                _sum70 = vmulq_f32(_sum70, _alpha);\n                _sum71 = vmulq_f32(_sum71, _alpha);\n                _sum80 = vmulq_f32(_sum80, _alpha);\n                _sum81 = vmulq_f32(_sum81, _alpha);\n                _sum90 = vmulq_f32(_sum90, _alpha);\n                _sum91 = vmulq_f32(_sum91, _alpha);\n                _suma0 = vmulq_f32(_suma0, _alpha);\n                _suma1 = vmulq_f32(_suma1, _alpha);\n                _sumb0 = vmulq_f32(_sumb0, _alpha);\n                _sumb1 = vmulq_f32(_sumb1, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum70));\n                    vst1_u16(outptr0 + 4 * 8, (uint16x4_t)vcvt_f16_f32(_sum80));\n                    vst1_u16(outptr0 + 4 * 9, (uint16x4_t)vcvt_f16_f32(_sum90));\n                    vst1_u16(outptr0 + 4 * 10, (uint16x4_t)vcvt_f16_f32(_suma0));\n                    vst1_u16(outptr0 + 4 * 11, (uint16x4_t)vcvt_f16_f32(_sumb0));\n\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum71));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 8, (uint16x4_t)vcvt_f16_f32(_sum81));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 9, (uint16x4_t)vcvt_f16_f32(_sum91));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 10, (uint16x4_t)vcvt_f16_f32(_suma1));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 11, (uint16x4_t)vcvt_f16_f32(_sumb1));\n\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x12_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71, _sum80, _sum81, _sum90, _sum91, _suma0, _suma1, _sumb0, _sumb1);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + 8, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + out_hstep + 8, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, (uint16x4_t)vcvt_f16_f32(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, (uint16x4_t)vcvt_f16_f32(_sum40));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, (uint16x4_t)vcvt_f16_f32(_sum50));\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, (uint16x4_t)vcvt_f16_f32(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum60));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 8, (uint16x4_t)vcvt_f16_f32(_sum70));\n                    vst1_u16(outptr0 + out_hstep * 5, (uint16x4_t)vcvt_f16_f32(_sum71));\n                    vst1_u16(outptr0 + out_hstep * 5 + 4, (uint16x4_t)vcvt_f16_f32(_sum80));\n                    vst1_u16(outptr0 + out_hstep * 5 + 8, (uint16x4_t)vcvt_f16_f32(_sum81));\n                    vst1_u16(outptr0 + out_hstep * 6, (uint16x4_t)vcvt_f16_f32(_sum90));\n                    vst1_u16(outptr0 + out_hstep * 6 + 4, (uint16x4_t)vcvt_f16_f32(_sum91));\n                    vst1_u16(outptr0 + out_hstep * 6 + 8, (uint16x4_t)vcvt_f16_f32(_suma0));\n                    vst1_u16(outptr0 + out_hstep * 7, (uint16x4_t)vcvt_f16_f32(_suma1));\n                    vst1_u16(outptr0 + out_hstep * 7 + 4, (uint16x4_t)vcvt_f16_f32(_sumb0));\n                    vst1_u16(outptr0 + out_hstep * 7 + 8, (uint16x4_t)vcvt_f16_f32(_sumb1));\n\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n                vst1q_f32(outptr + 4 * 16, _sum80);\n                vst1q_f32(outptr + 4 * 17, _sum81);\n                vst1q_f32(outptr + 4 * 18, _sum90);\n                vst1q_f32(outptr + 4 * 19, _sum91);\n                vst1q_f32(outptr + 4 * 20, _suma0);\n                vst1q_f32(outptr + 4 * 21, _suma1);\n                vst1q_f32(outptr + 4 * 22, _sumb0);\n                vst1q_f32(outptr + 4 * 23, _sumb1);\n            }\n\n            outptr += 96;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n            float32x4_t _sum40;\n            float32x4_t _sum41;\n            float32x4_t _sum50;\n            float32x4_t _sum51;\n            float32x4_t _sum60;\n            float32x4_t _sum61;\n            float32x4_t _sum70;\n            float32x4_t _sum71;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n                _sum40 = vdupq_n_f32(0.f);\n                _sum41 = vdupq_n_f32(0.f);\n                _sum50 = vdupq_n_f32(0.f);\n                _sum51 = vdupq_n_f32(0.f);\n                _sum60 = vdupq_n_f32(0.f);\n                _sum61 = vdupq_n_f32(0.f);\n                _sum70 = vdupq_n_f32(0.f);\n                _sum71 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                        _sum40 = _sum00;\n                        _sum41 = _sum00;\n                        _sum50 = _sum00;\n                        _sum51 = _sum00;\n                        _sum60 = _sum00;\n                        _sum61 = _sum00;\n                        _sum70 = _sum00;\n                        _sum71 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                        _sum40 = _sum00;\n                        _sum41 = _sum01;\n                        _sum50 = _sum00;\n                        _sum51 = _sum01;\n                        _sum60 = _sum00;\n                        _sum61 = _sum01;\n                        _sum70 = _sum00;\n                        _sum71 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        _sum40 = vld1q_f32(pC + 4 * 8);\n                        _sum41 = vld1q_f32(pC + 4 * 9);\n                        _sum50 = vld1q_f32(pC + 4 * 10);\n                        _sum51 = vld1q_f32(pC + 4 * 11);\n                        _sum60 = vld1q_f32(pC + 4 * 12);\n                        _sum61 = vld1q_f32(pC + 4 * 13);\n                        _sum70 = vld1q_f32(pC + 4 * 14);\n                        _sum71 = vld1q_f32(pC + 4 * 15);\n                        pC += 64;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum40 = vdupq_n_f32(pC[4]);\n                        _sum50 = vdupq_n_f32(pC[5]);\n                        _sum60 = vdupq_n_f32(pC[6]);\n                        _sum70 = vdupq_n_f32(pC[7]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        _sum41 = _sum40;\n                        _sum51 = _sum50;\n                        _sum61 = _sum60;\n                        _sum71 = _sum70;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n                _sum40 = vld1q_f32(outptr + 4 * 8);\n                _sum41 = vld1q_f32(outptr + 4 * 9);\n                _sum50 = vld1q_f32(outptr + 4 * 10);\n                _sum51 = vld1q_f32(outptr + 4 * 11);\n                _sum60 = vld1q_f32(outptr + 4 * 12);\n                _sum61 = vld1q_f32(outptr + 4 * 13);\n                _sum70 = vld1q_f32(outptr + 4 * 14);\n                _sum71 = vld1q_f32(outptr + 4 * 15);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pA = vld1q_f16(pA);\n                float16x8_t _pB = vld1q_f16(pB);\n\n                _sum00 = vfmlalq_laneq_low_f16(_sum00, _pA, _pB, 0);\n                _sum01 = vfmlalq_laneq_high_f16(_sum01, _pA, _pB, 0);\n                _sum10 = vfmlalq_laneq_low_f16(_sum10, _pA, _pB, 1);\n                _sum11 = vfmlalq_laneq_high_f16(_sum11, _pA, _pB, 1);\n                _sum20 = vfmlalq_laneq_low_f16(_sum20, _pA, _pB, 2);\n                _sum21 = vfmlalq_laneq_high_f16(_sum21, _pA, _pB, 2);\n                _sum30 = vfmlalq_laneq_low_f16(_sum30, _pA, _pB, 3);\n                _sum31 = vfmlalq_laneq_high_f16(_sum31, _pA, _pB, 3);\n                _sum40 = vfmlalq_laneq_low_f16(_sum40, _pA, _pB, 4);\n                _sum41 = vfmlalq_laneq_high_f16(_sum41, _pA, _pB, 4);\n                _sum50 = vfmlalq_laneq_low_f16(_sum50, _pA, _pB, 5);\n                _sum51 = vfmlalq_laneq_high_f16(_sum51, _pA, _pB, 5);\n                _sum60 = vfmlalq_laneq_low_f16(_sum60, _pA, _pB, 6);\n                _sum61 = vfmlalq_laneq_high_f16(_sum61, _pA, _pB, 6);\n                _sum70 = vfmlalq_laneq_low_f16(_sum70, _pA, _pB, 7);\n                _sum71 = vfmlalq_laneq_high_f16(_sum71, _pA, _pB, 7);\n#else\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pA));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pA));\n\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 4));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n                _sum40 = vfmaq_laneq_f32(_sum40, _pA0, _pB1, 0);\n                _sum41 = vfmaq_laneq_f32(_sum41, _pA1, _pB1, 0);\n                _sum50 = vfmaq_laneq_f32(_sum50, _pA0, _pB1, 1);\n                _sum51 = vfmaq_laneq_f32(_sum51, _pA1, _pB1, 1);\n                _sum60 = vfmaq_laneq_f32(_sum60, _pA0, _pB1, 2);\n                _sum61 = vfmaq_laneq_f32(_sum61, _pA1, _pB1, 2);\n                _sum70 = vfmaq_laneq_f32(_sum70, _pA0, _pB1, 3);\n                _sum71 = vfmaq_laneq_f32(_sum71, _pA1, _pB1, 3);\n#endif\n\n                pA += 8;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n                _sum40 = vmulq_f32(_sum40, _alpha);\n                _sum41 = vmulq_f32(_sum41, _alpha);\n                _sum50 = vmulq_f32(_sum50, _alpha);\n                _sum51 = vmulq_f32(_sum51, _alpha);\n                _sum60 = vmulq_f32(_sum60, _alpha);\n                _sum61 = vmulq_f32(_sum61, _alpha);\n                _sum70 = vmulq_f32(_sum70, _alpha);\n                _sum71 = vmulq_f32(_sum71, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum30));\n                    vst1_u16(outptr0 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum40));\n                    vst1_u16(outptr0 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum50));\n                    vst1_u16(outptr0 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum60));\n                    vst1_u16(outptr0 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum70));\n\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum71));\n\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x8_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31, _sum40, _sum41, _sum50, _sum51, _sum60, _sum61, _sum70, _sum71);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, (uint16x4_t)vcvt_f16_f32(_sum31));\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum40));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum41));\n                    vst1_u16(outptr0 + out_hstep * 5, (uint16x4_t)vcvt_f16_f32(_sum50));\n                    vst1_u16(outptr0 + out_hstep * 5 + 4, (uint16x4_t)vcvt_f16_f32(_sum51));\n                    vst1_u16(outptr0 + out_hstep * 6, (uint16x4_t)vcvt_f16_f32(_sum60));\n                    vst1_u16(outptr0 + out_hstep * 6 + 4, (uint16x4_t)vcvt_f16_f32(_sum61));\n                    vst1_u16(outptr0 + out_hstep * 7, (uint16x4_t)vcvt_f16_f32(_sum70));\n                    vst1_u16(outptr0 + out_hstep * 7 + 4, (uint16x4_t)vcvt_f16_f32(_sum71));\n\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n                vst1q_f32(outptr + 4 * 8, _sum40);\n                vst1q_f32(outptr + 4 * 9, _sum41);\n                vst1q_f32(outptr + 4 * 10, _sum50);\n                vst1q_f32(outptr + 4 * 11, _sum51);\n                vst1q_f32(outptr + 4 * 12, _sum60);\n                vst1q_f32(outptr + 4 * 13, _sum61);\n                vst1q_f32(outptr + 4 * 14, _sum70);\n                vst1q_f32(outptr + 4 * 15, _sum71);\n            }\n\n            outptr += 64;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum20;\n            float32x4_t _sum21;\n            float32x4_t _sum30;\n            float32x4_t _sum31;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum20 = vdupq_n_f32(0.f);\n                _sum21 = vdupq_n_f32(0.f);\n                _sum30 = vdupq_n_f32(0.f);\n                _sum31 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum20 = _sum00;\n                        _sum21 = _sum00;\n                        _sum30 = _sum00;\n                        _sum31 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum20 = _sum00;\n                        _sum21 = _sum01;\n                        _sum30 = _sum00;\n                        _sum31 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        _sum20 = vld1q_f32(pC + 4 * 4);\n                        _sum21 = vld1q_f32(pC + 4 * 5);\n                        _sum30 = vld1q_f32(pC + 4 * 6);\n                        _sum31 = vld1q_f32(pC + 4 * 7);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum20 = vdupq_n_f32(pC[2]);\n                        _sum30 = vdupq_n_f32(pC[3]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        _sum21 = _sum20;\n                        _sum31 = _sum30;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n                _sum20 = vld1q_f32(outptr + 4 * 4);\n                _sum21 = vld1q_f32(outptr + 4 * 5);\n                _sum30 = vld1q_f32(outptr + 4 * 6);\n                _sum31 = vld1q_f32(outptr + 4 * 7);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pA = vld1q_f16(pA);\n                float16x4_t _pB = vld1_f16(pB);\n\n                _sum00 = vfmlalq_lane_low_f16(_sum00, _pA, _pB, 0);\n                _sum01 = vfmlalq_lane_high_f16(_sum01, _pA, _pB, 0);\n                _sum10 = vfmlalq_lane_low_f16(_sum10, _pA, _pB, 1);\n                _sum11 = vfmlalq_lane_high_f16(_sum11, _pA, _pB, 1);\n                _sum20 = vfmlalq_lane_low_f16(_sum20, _pA, _pB, 2);\n                _sum21 = vfmlalq_lane_high_f16(_sum21, _pA, _pB, 2);\n                _sum30 = vfmlalq_lane_low_f16(_sum30, _pA, _pB, 3);\n                _sum31 = vfmlalq_lane_high_f16(_sum31, _pA, _pB, 3);\n#else\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pA));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pA));\n\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n\n                _sum00 = vfmaq_laneq_f32(_sum00, _pA0, _pB0, 0);\n                _sum01 = vfmaq_laneq_f32(_sum01, _pA1, _pB0, 0);\n                _sum10 = vfmaq_laneq_f32(_sum10, _pA0, _pB0, 1);\n                _sum11 = vfmaq_laneq_f32(_sum11, _pA1, _pB0, 1);\n                _sum20 = vfmaq_laneq_f32(_sum20, _pA0, _pB0, 2);\n                _sum21 = vfmaq_laneq_f32(_sum21, _pA1, _pB0, 2);\n                _sum30 = vfmaq_laneq_f32(_sum30, _pA0, _pB0, 3);\n                _sum31 = vfmaq_laneq_f32(_sum31, _pA1, _pB0, 3);\n#endif\n\n                pA += 8;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum20 = vmulq_f32(_sum20, _alpha);\n                _sum21 = vmulq_f32(_sum21, _alpha);\n                _sum30 = vmulq_f32(_sum30, _alpha);\n                _sum31 = vmulq_f32(_sum31, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum30));\n\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum31));\n\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose8x4_ps(_sum00, _sum01, _sum10, _sum11, _sum20, _sum21, _sum30, _sum31);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + out_hstep * 1, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum20));\n                    vst1_u16(outptr0 + out_hstep * 5, (uint16x4_t)vcvt_f16_f32(_sum21));\n                    vst1_u16(outptr0 + out_hstep * 6, (uint16x4_t)vcvt_f16_f32(_sum30));\n                    vst1_u16(outptr0 + out_hstep * 7, (uint16x4_t)vcvt_f16_f32(_sum31));\n\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n                vst1q_f32(outptr + 4 * 4, _sum20);\n                vst1q_f32(outptr + 4 * 5, _sum21);\n                vst1q_f32(outptr + 4 * 6, _sum30);\n                vst1q_f32(outptr + 4 * 7, _sum31);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4 * 1);\n                        _sum10 = vld1q_f32(pC + 4 * 2);\n                        _sum11 = vld1q_f32(pC + 4 * 3);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum01 = _sum00;\n                        _sum11 = _sum10;\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4 * 1);\n                _sum10 = vld1q_f32(outptr + 4 * 2);\n                _sum11 = vld1q_f32(outptr + 4 * 3);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pA = vld1q_f16(pA);\n                float16x4_t _pB0 = vdup_n_f16(pB[0]);\n                float16x4_t _pB1 = vdup_n_f16(pB[1]);\n                float16x8_t _pB01 = vcombine_f16(_pB0, _pB1);\n                float16x8_t _pB10 = vcombine_f16(_pB1, _pB0);\n\n                _sum00 = vfmlalq_low_f16(_sum00, _pA, _pB01);\n                _sum01 = vfmlalq_high_f16(_sum01, _pA, _pB10);\n                _sum10 = vfmlalq_low_f16(_sum10, _pA, _pB10);\n                _sum11 = vfmlalq_high_f16(_sum11, _pA, _pB01);\n#else\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pA));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pA));\n\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pB[0]));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pB[1]));\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB0);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB0);\n                _sum10 = vfmaq_f32(_sum10, _pA0, _pB1);\n                _sum11 = vfmaq_f32(_sum11, _pA1, _pB1);\n#endif\n\n                pA += 8;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum10));\n\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep * 4 + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    unsigned short sum1[8];\n                    vst1_u16(sum0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(sum0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(sum1, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(sum1 + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0[out_hstep * 4 + 1] = sum1[4];\n                    outptr0[out_hstep * 5 + 1] = sum1[5];\n                    outptr0[out_hstep * 6 + 1] = sum1[6];\n                    outptr0[out_hstep * 7 + 1] = sum1[7];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n                vst1q_f32(outptr + 4 * 2, _sum10);\n                vst1q_f32(outptr + 4 * 3, _sum11);\n            }\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum00 = vld1q_f32(outptr);\n                _sum01 = vld1q_f32(outptr + 4);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pA = vld1q_f16(pA);\n                float16x8_t _pB = vdupq_n_f16(pB[0]);\n\n                _sum00 = vfmlalq_low_f16(_sum00, _pA, _pB);\n                _sum01 = vfmlalq_high_f16(_sum01, _pA, _pB);\n#else\n                uint16x8_t _pA = vld1q_u16(pA);\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pA));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pA));\n\n                float32x4_t _pB = vcvt_f32_f16((float16x4_t)vld1_dup_u16(pB));\n\n                _sum00 = vfmaq_f32(_sum00, _pA0, _pB);\n                _sum01 = vfmaq_f32(_sum01, _pA1, _pB);\n#endif\n\n                pA += 8;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + out_hstep * 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[8];\n                    vst1_u16(sum0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(sum0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep * 1] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[out_hstep * 4] = sum0[4];\n                    outptr0[out_hstep * 5] = sum0[5];\n                    outptr0[out_hstep * 6] = sum0[6];\n                    outptr0[out_hstep * 7] = sum0[7];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum00);\n                vst1q_f32(outptr + 4, _sum01);\n            }\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n#endif // __aarch64__\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n#if __ARM_FEATURE_FP16_FML\n        const __fp16* pB = pBT;\n#else\n        const unsigned short* pB = pBT;\n#endif\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n            float32x4_t _sum8;\n            float32x4_t _sum9;\n            float32x4_t _suma;\n            float32x4_t _sumb;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n                _sum8 = vdupq_n_f32(0.f);\n                _sum9 = vdupq_n_f32(0.f);\n                _suma = vdupq_n_f32(0.f);\n                _sumb = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                        _sum8 = _sum0;\n                        _sum9 = _sum0;\n                        _suma = _sum0;\n                        _sumb = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        _sum8 = vld1q_f32(pC + 32);\n                        _sum9 = vld1q_f32(pC + 36);\n                        _suma = vld1q_f32(pC + 40);\n                        _sumb = vld1q_f32(pC + 44);\n                        pC += 48;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        _sum8 = vdupq_n_f32(pC[8]);\n                        _sum9 = vdupq_n_f32(pC[9]);\n                        _suma = vdupq_n_f32(pC[10]);\n                        _sumb = vdupq_n_f32(pC[11]);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n                _sum8 = vld1q_f32(outptr + 4 * 8);\n                _sum9 = vld1q_f32(outptr + 4 * 9);\n                _suma = vld1q_f32(outptr + 4 * 10);\n                _sumb = vld1q_f32(outptr + 4 * 11);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pA = vld1_f16(pA);\n                float16x8_t _pAA = vcombine_f16(_pA, _pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n\n                _sum0 = vfmlalq_lane_low_f16(_sum0, _pAA, _pB0, 0);\n                _sum1 = vfmlalq_lane_low_f16(_sum1, _pAA, _pB0, 1);\n                _sum2 = vfmlalq_lane_low_f16(_sum2, _pAA, _pB0, 2);\n                _sum3 = vfmlalq_lane_low_f16(_sum3, _pAA, _pB0, 3);\n                _sum4 = vfmlalq_lane_low_f16(_sum4, _pAA, _pB1, 0);\n                _sum5 = vfmlalq_lane_low_f16(_sum5, _pAA, _pB1, 1);\n                _sum6 = vfmlalq_lane_low_f16(_sum6, _pAA, _pB1, 2);\n                _sum7 = vfmlalq_lane_low_f16(_sum7, _pAA, _pB1, 3);\n                _sum8 = vfmlalq_lane_low_f16(_sum8, _pAA, _pB2, 0);\n                _sum9 = vfmlalq_lane_low_f16(_sum9, _pAA, _pB2, 1);\n                _suma = vfmlalq_lane_low_f16(_suma, _pAA, _pB2, 2);\n                _sumb = vfmlalq_lane_low_f16(_sumb, _pAA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n#else\n#if __aarch64__\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 4));\n                float32x4_t _pB2 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 8));\n\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n                _sum8 = vfmaq_laneq_f32(_sum8, _pA, _pB2, 0);\n                _sum9 = vfmaq_laneq_f32(_sum9, _pA, _pB2, 1);\n                _suma = vfmaq_laneq_f32(_suma, _pA, _pB2, 2);\n                _sumb = vfmaq_laneq_f32(_sumb, _pA, _pB2, 3);\n\n                pA += 4;\n                pB += 12;\n#else // __aarch64__\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"pld        [%0, #64]       \\n\"\n                    \"pld        [%1, #192]      \\n\"\n                    \"vld1.u16   {d6}, [%0 :64]! \\n\"\n                    \"vld1.u16   {d2-d4}, [%1 :64]! \\n\"\n                    \"vcvt.f32.f16 q3, d6        \\n\"\n                    \"vcvt.f32.f16 q0, d2        \\n\"\n                    \"vcvt.f32.f16 q1, d3        \\n\"\n                    \"vcvt.f32.f16 q2, d4        \\n\"\n                    \"vmla.f32   %q2, q3, d0[0]  \\n\"\n                    \"vmla.f32   %q3, q3, d0[1]  \\n\"\n                    \"vmla.f32   %q4, q3, d1[0]  \\n\"\n                    \"vmla.f32   %q5, q3, d1[1]  \\n\"\n                    \"vmla.f32   %q6, q3, d2[0]  \\n\"\n                    \"vmla.f32   %q7, q3, d2[1]  \\n\"\n                    \"vmla.f32   %q8, q3, d3[0]  \\n\"\n                    \"vmla.f32   %q9, q3, d3[1]  \\n\"\n                    \"vmla.f32   %q10, q3, d4[0] \\n\"\n                    \"vmla.f32   %q11, q3, d4[1] \\n\"\n                    \"vmla.f32   %q12, q3, d5[0] \\n\"\n                    \"vmla.f32   %q13, q3, d5[1] \\n\"\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=w\"(_sum0),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6),\n                    \"=w\"(_sum7),\n                    \"=w\"(_sum8),\n                    \"=w\"(_sum9),\n                    \"=w\"(_suma),\n                    \"=w\"(_sumb)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3),\n                    \"6\"(_sum4),\n                    \"7\"(_sum5),\n                    \"8\"(_sum6),\n                    \"9\"(_sum7),\n                    \"10\"(_sum8),\n                    \"11\"(_sum9),\n                    \"12\"(_suma),\n                    \"13\"(_sumb)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#else\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 4));\n                float32x4_t _pB2 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 8));\n\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n                _sum8 = vmlaq_lane_f32(_sum8, _pA, vget_low_f32(_pB2), 0);\n                _sum9 = vmlaq_lane_f32(_sum9, _pA, vget_low_f32(_pB2), 1);\n                _suma = vmlaq_lane_f32(_suma, _pA, vget_high_f32(_pB2), 0);\n                _sumb = vmlaq_lane_f32(_sumb, _pA, vget_high_f32(_pB2), 1);\n\n                pA += 4;\n                pB += 12;\n#endif\n#endif // __aarch64__\n#endif\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n                _sum4 = vmulq_f32(_sum4, _alpha);\n                _sum5 = vmulq_f32(_sum5, _alpha);\n                _sum6 = vmulq_f32(_sum6, _alpha);\n                _sum7 = vmulq_f32(_sum7, _alpha);\n                _sum8 = vmulq_f32(_sum8, _alpha);\n                _sum9 = vmulq_f32(_sum9, _alpha);\n                _suma = vmulq_f32(_suma, _alpha);\n                _sumb = vmulq_f32(_sumb, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum7));\n                    vst1_u16(outptr0 + 4 * 8, (uint16x4_t)vcvt_f16_f32(_sum8));\n                    vst1_u16(outptr0 + 4 * 9, (uint16x4_t)vcvt_f16_f32(_sum9));\n                    vst1_u16(outptr0 + 4 * 10, (uint16x4_t)vcvt_f16_f32(_suma));\n                    vst1_u16(outptr0 + 4 * 11, (uint16x4_t)vcvt_f16_f32(_sumb));\n                    outptr0 += 48;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x12_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7, _sum8, _sum9, _suma, _sumb);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + 8, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum4));\n                    vst1_u16(outptr0 + out_hstep + 8, (uint16x4_t)vcvt_f16_f32(_sum5));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum6));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, (uint16x4_t)vcvt_f16_f32(_sum7));\n                    vst1_u16(outptr0 + out_hstep * 2 + 8, (uint16x4_t)vcvt_f16_f32(_sum8));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum9));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, (uint16x4_t)vcvt_f16_f32(_suma));\n                    vst1_u16(outptr0 + out_hstep * 3 + 8, (uint16x4_t)vcvt_f16_f32(_sumb));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n                vst1q_f32(outptr + 4 * 8, _sum8);\n                vst1q_f32(outptr + 4 * 9, _sum9);\n                vst1q_f32(outptr + 4 * 10, _suma);\n                vst1q_f32(outptr + 4 * 11, _sumb);\n            }\n\n            outptr += 48;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n            float32x4_t _sum4;\n            float32x4_t _sum5;\n            float32x4_t _sum6;\n            float32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n                _sum4 = vdupq_n_f32(0.f);\n                _sum5 = vdupq_n_f32(0.f);\n                _sum6 = vdupq_n_f32(0.f);\n                _sum7 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                        _sum4 = _sum0;\n                        _sum5 = _sum0;\n                        _sum6 = _sum0;\n                        _sum7 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        _sum4 = vld1q_f32(pC + 16);\n                        _sum5 = vld1q_f32(pC + 20);\n                        _sum6 = vld1q_f32(pC + 24);\n                        _sum7 = vld1q_f32(pC + 28);\n                        pC += 32;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        _sum4 = vdupq_n_f32(pC[4]);\n                        _sum5 = vdupq_n_f32(pC[5]);\n                        _sum6 = vdupq_n_f32(pC[6]);\n                        _sum7 = vdupq_n_f32(pC[7]);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n                _sum4 = vld1q_f32(outptr + 4 * 4);\n                _sum5 = vld1q_f32(outptr + 4 * 5);\n                _sum6 = vld1q_f32(outptr + 4 * 6);\n                _sum7 = vld1q_f32(outptr + 4 * 7);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pA = vld1_f16(pA);\n                float16x8_t _pAA = vcombine_f16(_pA, _pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n                float16x4_t _pB1 = vld1_f16(pB + 4);\n\n                _sum0 = vfmlalq_lane_low_f16(_sum0, _pAA, _pB0, 0);\n                _sum1 = vfmlalq_lane_low_f16(_sum1, _pAA, _pB0, 1);\n                _sum2 = vfmlalq_lane_low_f16(_sum2, _pAA, _pB0, 2);\n                _sum3 = vfmlalq_lane_low_f16(_sum3, _pAA, _pB0, 3);\n                _sum4 = vfmlalq_lane_low_f16(_sum4, _pAA, _pB1, 0);\n                _sum5 = vfmlalq_lane_low_f16(_sum5, _pAA, _pB1, 1);\n                _sum6 = vfmlalq_lane_low_f16(_sum6, _pAA, _pB1, 2);\n                _sum7 = vfmlalq_lane_low_f16(_sum7, _pAA, _pB1, 3);\n#else\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 4));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB0, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB0, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB0, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB0, 3);\n                _sum4 = vfmaq_laneq_f32(_sum4, _pA, _pB1, 0);\n                _sum5 = vfmaq_laneq_f32(_sum5, _pA, _pB1, 1);\n                _sum6 = vfmaq_laneq_f32(_sum6, _pA, _pB1, 2);\n                _sum7 = vfmaq_laneq_f32(_sum7, _pA, _pB1, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB0), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB0), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB0), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB0), 1);\n                _sum4 = vmlaq_lane_f32(_sum4, _pA, vget_low_f32(_pB1), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _pA, vget_low_f32(_pB1), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _pA, vget_high_f32(_pB1), 0);\n                _sum7 = vmlaq_lane_f32(_sum7, _pA, vget_high_f32(_pB1), 1);\n#endif\n#endif\n\n                pA += 4;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n                _sum4 = vmulq_f32(_sum4, _alpha);\n                _sum5 = vmulq_f32(_sum5, _alpha);\n                _sum6 = vmulq_f32(_sum6, _alpha);\n                _sum7 = vmulq_f32(_sum7, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    vst1_u16(outptr0 + 4 * 4, (uint16x4_t)vcvt_f16_f32(_sum4));\n                    vst1_u16(outptr0 + 4 * 5, (uint16x4_t)vcvt_f16_f32(_sum5));\n                    vst1_u16(outptr0 + 4 * 6, (uint16x4_t)vcvt_f16_f32(_sum6));\n                    vst1_u16(outptr0 + 4 * 7, (uint16x4_t)vcvt_f16_f32(_sum7));\n                    outptr0 += 32;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x8_ps(_sum0, _sum1, _sum2, _sum3, _sum4, _sum5, _sum6, _sum7);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum4));\n                    vst1_u16(outptr0 + out_hstep * 2 + 4, (uint16x4_t)vcvt_f16_f32(_sum5));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum6));\n                    vst1_u16(outptr0 + out_hstep * 3 + 4, (uint16x4_t)vcvt_f16_f32(_sum7));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n                vst1q_f32(outptr + 4 * 4, _sum4);\n                vst1q_f32(outptr + 4 * 5, _sum5);\n                vst1q_f32(outptr + 4 * 6, _sum6);\n                vst1q_f32(outptr + 4 * 7, _sum7);\n            }\n\n            outptr += 32;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n            float32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n                _sum3 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        _sum2 = _sum0;\n                        _sum3 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        _sum3 = vld1q_f32(pC + 12);\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        _sum2 = vdupq_n_f32(pC[2]);\n                        _sum3 = vdupq_n_f32(pC[3]);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4 * 1);\n                _sum2 = vld1q_f32(outptr + 4 * 2);\n                _sum3 = vld1q_f32(outptr + 4 * 3);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pA = vld1_f16(pA);\n                float16x8_t _pAA = vcombine_f16(_pA, _pA);\n\n                float16x4_t _pB0 = vld1_f16(pB);\n\n                _sum0 = vfmlalq_lane_low_f16(_sum0, _pAA, _pB0, 0);\n                _sum1 = vfmlalq_lane_low_f16(_sum1, _pAA, _pB0, 1);\n                _sum2 = vfmlalq_lane_low_f16(_sum2, _pAA, _pB0, 2);\n                _sum3 = vfmlalq_lane_low_f16(_sum3, _pAA, _pB0, 3);\n#else\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _pA, _pB, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _pA, _pB, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _pA, _pB, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _pA, _pB, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _pA, vget_low_f32(_pB), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _pA, vget_low_f32(_pB), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _pA, vget_high_f32(_pB), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _pA, vget_high_f32(_pB), 1);\n#endif\n#endif\n\n                pA += 4;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n                _sum3 = vmulq_f32(_sum3, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + 4 * 2, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + 4 * 3, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    outptr0 += 16;\n                }\n                if (out_elempack == 1)\n                {\n                    transpose4x4_ps(_sum0, _sum1, _sum2, _sum3);\n\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + out_hstep * 2, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    vst1_u16(outptr0 + out_hstep * 3, (uint16x4_t)vcvt_f16_f32(_sum3));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 4 * 2, _sum2);\n                vst1q_f32(outptr + 4 * 3, _sum3);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pA = vld1_f16(pA);\n                float16x8_t _pAA = vcombine_f16(_pA, _pA);\n\n                float16x4_t _pB0 = vdup_n_f16(pB[0]);\n                float16x4_t _pB1 = vdup_n_f16(pB[1]);\n                float16x8_t _pB01 = vcombine_f16(_pB0, _pB1);\n\n                _sum0 = vfmlalq_low_f16(_sum0, _pAA, _pB01);\n                _sum1 = vfmlalq_high_f16(_sum1, _pAA, _pB01);\n#else\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pB[0]));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pB[1]));\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA, _pB1);\n#endif\n#endif\n\n                pA += 4;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    outptr0 += 8;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    unsigned short sum1[4];\n                    vst1_u16(sum0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(sum1, (uint16x4_t)vcvt_f16_f32(_sum1));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0[1] = sum1[0];\n                    outptr0[out_hstep + 1] = sum1[1];\n                    outptr0[out_hstep * 2 + 1] = sum1[2];\n                    outptr0[out_hstep * 3 + 1] = sum1[3];\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pA = vld1_f16(pA);\n                float16x8_t _pAA = vcombine_f16(_pA, _pA);\n\n                float16x8_t _pB = vdupq_n_f16(pB[0]);\n\n                _sum0 = vfmlalq_low_f16(_sum0, _pAA, _pB);\n#else\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vld1_u16(pA));\n                float32x4_t _pB = vcvt_f32_f16((float16x4_t)vdup_n_u16(pB[0]));\n\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA, _pB);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA, _pB);\n#endif\n#endif\n\n                pA += 4;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n            }\n\n            if (k_end)\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    outptr0 += 4;\n                }\n                if (out_elempack == 1)\n                {\n                    unsigned short sum0[4];\n                    vst1_u16(sum0, (uint16x4_t)vcvt_f16_f32(_sum0));\n\n                    outptr0[0] = sum0[0];\n                    outptr0[out_hstep] = sum0[1];\n                    outptr0[out_hstep * 2] = sum0[2];\n                    outptr0[out_hstep * 3] = sum0[3];\n                    outptr0++;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n            }\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n#if __ARM_FEATURE_FP16_FML\n        const __fp16* pB = pBT;\n#else\n        const unsigned short* pB = pBT;\n#endif\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum02;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n            float32x4_t _sum12;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum02 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n                _sum12 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                        _sum12 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum02 = _sum00;\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = _sum10;\n                        _sum12 = _sum10;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        float32x4x2_t _tmp45 = vld2q_f32(pC + 16);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum02 = _tmp45.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        _sum12 = _tmp45.val[1];\n                        pC += 24;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum02 = vld1q_f32(pC + 8);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        _sum12 = _sum02;\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                float32x4x2_t _tmp45 = vld2q_f32(outptr + 16);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum02 = _tmp45.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n                _sum12 = _tmp45.val[1];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pB01 = vld1q_f16(pB);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n                float16x8_t _pB22 = vcombine_f16(_pB2, _pB2);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n                float16x8_t _pA01 = vcombine_f16(_pA0, _pA1);\n                float16x8_t _pA10 = vcombine_f16(_pA1, _pA0);\n\n                _sum00 = vfmlalq_low_f16(_sum00, _pB01, _pA01);\n                _sum01 = vfmlalq_high_f16(_sum01, _pB01, _pA10);\n                _sum02 = vfmlalq_low_f16(_sum02, _pB22, _pA01);\n                _sum10 = vfmlalq_low_f16(_sum10, _pB01, _pA10);\n                _sum11 = vfmlalq_high_f16(_sum11, _pB01, _pA01);\n                _sum12 = vfmlalq_low_f16(_sum12, _pB22, _pA10);\n#else\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pB));\n                float32x4_t _pB2 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 8));\n\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[1]));\n#if __aarch64__\n                _sum00 = vfmaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vfmaq_f32(_sum01, _pB1, _pA0);\n                _sum02 = vfmaq_f32(_sum02, _pB2, _pA0);\n                _sum10 = vfmaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vfmaq_f32(_sum11, _pB1, _pA1);\n                _sum12 = vfmaq_f32(_sum12, _pB2, _pA1);\n#else\n                _sum00 = vmlaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vmlaq_f32(_sum01, _pB1, _pA0);\n                _sum02 = vmlaq_f32(_sum02, _pB2, _pA0);\n                _sum10 = vmlaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vmlaq_f32(_sum11, _pB1, _pA1);\n                _sum12 = vmlaq_f32(_sum12, _pB2, _pA1);\n#endif\n#endif\n\n                pA += 2;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum02 = vmulq_f32(_sum02, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n                _sum12 = vmulq_f32(_sum12, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + 8, (uint16x4_t)vcvt_f16_f32(_sum02));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    vst1_u16(outptr0 + out_hstep + 8, (uint16x4_t)vcvt_f16_f32(_sum12));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                float32x4x2_t _tmp45;\n                _tmp45.val[0] = _sum02;\n                _tmp45.val[1] = _sum12;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n                vst2q_f32(outptr + 16, _tmp45);\n            }\n\n            outptr += 24;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum00;\n            float32x4_t _sum01;\n            float32x4_t _sum10;\n            float32x4_t _sum11;\n\n            if (k == 0)\n            {\n                _sum00 = vdupq_n_f32(0.f);\n                _sum01 = vdupq_n_f32(0.f);\n                _sum10 = vdupq_n_f32(0.f);\n                _sum11 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = _sum00;\n                        _sum11 = _sum00;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum00 = vdupq_n_f32(pC[0]);\n                        _sum01 = _sum00;\n                        _sum10 = vdupq_n_f32(pC[1]);\n                        _sum11 = _sum10;\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        float32x4x2_t _tmp23 = vld2q_f32(pC + 8);\n                        _sum00 = _tmp01.val[0];\n                        _sum01 = _tmp23.val[0];\n                        _sum10 = _tmp01.val[1];\n                        _sum11 = _tmp23.val[1];\n                        pC += 16;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum00 = vld1q_f32(pC);\n                        _sum01 = vld1q_f32(pC + 4);\n                        _sum10 = _sum00;\n                        _sum11 = _sum01;\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01 = vld2q_f32(outptr);\n                float32x4x2_t _tmp23 = vld2q_f32(outptr + 8);\n                _sum00 = _tmp01.val[0];\n                _sum01 = _tmp23.val[0];\n                _sum10 = _tmp01.val[1];\n                _sum11 = _tmp23.val[1];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pB01 = vld1q_f16(pB);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n                float16x8_t _pA01 = vcombine_f16(_pA0, _pA1);\n                float16x8_t _pA10 = vcombine_f16(_pA1, _pA0);\n\n                _sum00 = vfmlalq_low_f16(_sum00, _pB01, _pA01);\n                _sum01 = vfmlalq_high_f16(_sum01, _pB01, _pA10);\n                _sum10 = vfmlalq_low_f16(_sum10, _pB01, _pA10);\n                _sum11 = vfmlalq_high_f16(_sum11, _pB01, _pA01);\n#else\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pB));\n\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[1]));\n#if __aarch64__\n                _sum00 = vfmaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vfmaq_f32(_sum01, _pB1, _pA0);\n                _sum10 = vfmaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vfmaq_f32(_sum11, _pB1, _pA1);\n#else\n                _sum00 = vmlaq_f32(_sum00, _pB0, _pA0);\n                _sum01 = vmlaq_f32(_sum01, _pB1, _pA0);\n                _sum10 = vmlaq_f32(_sum10, _pB0, _pA1);\n                _sum11 = vmlaq_f32(_sum11, _pB1, _pA1);\n#endif\n#endif\n\n                pA += 2;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum00 = vmulq_f32(_sum00, _alpha);\n                _sum01 = vmulq_f32(_sum01, _alpha);\n                _sum10 = vmulq_f32(_sum10, _alpha);\n                _sum11 = vmulq_f32(_sum11, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum00));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum01));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum10));\n                    vst1_u16(outptr0 + out_hstep + 4, (uint16x4_t)vcvt_f16_f32(_sum11));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum00;\n                _tmp01.val[1] = _sum10;\n                float32x4x2_t _tmp23;\n                _tmp23.val[0] = _sum01;\n                _tmp23.val[1] = _sum11;\n                vst2q_f32(outptr, _tmp01);\n                vst2q_f32(outptr + 8, _tmp23);\n            }\n\n            outptr += 16;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = _sum0;\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[1]);\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        float32x4x2_t _tmp01 = vld2q_f32(pC);\n                        _sum0 = _tmp01.val[0];\n                        _sum1 = _tmp01.val[1];\n                        pC += 8;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = _sum0;\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                float32x4_t _tmp0 = vld1q_f32(outptr);\n                float32x4_t _tmp1 = vld1q_f32(outptr + 4);\n                float32x4x2_t _tmp01 = vuzpq_f32(_tmp0, _tmp1);\n                _sum0 = _tmp01.val[0];\n                _sum1 = _tmp01.val[1];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pB = vld1_f16(pB);\n                float16x8_t _pBB = vcombine_f16(_pB, _pB);\n\n                float16x4_t _pA0 = vdup_n_f16(pA[0]);\n                float16x4_t _pA1 = vdup_n_f16(pA[1]);\n                float16x8_t _pA01 = vcombine_f16(_pA0, _pA1);\n\n                _sum0 = vfmlalq_low_f16(_sum0, _pBB, _pA01);\n                _sum1 = vfmlalq_high_f16(_sum1, _pBB, _pA01);\n#else\n                float32x4_t _pB = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n                float32x4_t _pA1 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[1]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pB, _pA0);\n                _sum1 = vfmaq_f32(_sum1, _pB, _pA1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pB, _pA0);\n                _sum1 = vmlaq_f32(_sum1, _pB, _pA1);\n#endif\n#endif\n\n                pA += 2;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                float32x4x2_t _tmp01;\n                _tmp01.val[0] = _sum0;\n                _tmp01.val[1] = _sum1;\n                vst2q_f32(outptr, _tmp01);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum00;\n            float sum01;\n            float sum10;\n            float sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0.f;\n                sum01 = 0.f;\n                sum10 = 0.f;\n                sum11 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[0];\n                        sum11 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[0];\n                        sum11 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[1];\n                        sum10 = pC[2];\n                        sum11 = pC[3];\n                        pC += 4;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum00 = pC[0];\n                        sum01 = pC[0];\n                        sum10 = pC[1];\n                        sum11 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum01 = outptr[1];\n                sum10 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                __fp16 pA0 = pA[0];\n                __fp16 pA1 = pA[1];\n                __fp16 pB0 = pB[0];\n                __fp16 pB1 = pB[1];\n#else\n                float pA0 = float16_to_float32(pA[0]);\n                float pA1 = float16_to_float32(pA[1]);\n                float pB0 = float16_to_float32(pB[0]);\n                float pB1 = float16_to_float32(pB[1]);\n#endif\n\n                sum00 += pA0 * pB0;\n                sum01 += pA1 * pB0;\n                sum10 += pA0 * pB1;\n                sum11 += pA1 * pB1;\n\n                pA += 2;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum00 *= alpha;\n                sum01 *= alpha;\n                sum10 *= alpha;\n                sum11 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_float16(sum00);\n                    outptr0[1] = float32_to_float16(sum10);\n                    outptr0[out_hstep] = float32_to_float16(sum01);\n                    outptr0[out_hstep + 1] = float32_to_float16(sum11);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum00;\n                outptr[1] = sum01;\n                outptr[2] = sum10;\n                outptr[3] = sum11;\n            }\n\n            outptr += 4;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                    }\n                    if (broadcast_type_C == 3)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                    if (broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                __fp16 pA0 = pA[0];\n                __fp16 pA1 = pA[1];\n                __fp16 pB0 = pB[0];\n#else\n                float pA0 = float16_to_float32(pA[0]);\n                float pA1 = float16_to_float32(pA[1]);\n                float pB0 = float16_to_float32(pB[0]);\n#endif\n\n                sum0 += pA0 * pB0;\n                sum1 += pA1 * pB0;\n                pA += 2;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum0 *= alpha;\n                sum1 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_float16(sum0);\n                    outptr0[out_hstep] = float32_to_float16(sum1);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        unsigned short* outptr0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n#if __ARM_FEATURE_FP16_FML\n        const __fp16* pB = pBT;\n#else\n        const unsigned short* pB = pBT;\n#endif\n\n        if (pC)\n        {\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)CT_tile + i + ii;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)CT_tile + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 11 < max_jj; jj += 12)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n            float32x4_t _sum2;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n                _sum2 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                        _sum2 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        _sum2 = vld1q_f32(pC + 8);\n                        pC += 12;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n                _sum2 = vld1q_f32(outptr + 8);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pB01 = vld1q_f16(pB);\n                float16x4_t _pB2 = vld1_f16(pB + 8);\n                float16x8_t _pB22 = vcombine_f16(_pB2, _pB2);\n\n                float16x8_t _pA = vdupq_n_f16(pA[0]);\n\n                _sum0 = vfmlalq_low_f16(_sum0, _pA, _pB01);\n                _sum1 = vfmlalq_high_f16(_sum1, _pA, _pB01);\n                _sum2 = vfmlalq_low_f16(_sum2, _pA, _pB22);\n#else\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pB));\n                float32x4_t _pB2 = vcvt_f32_f16((float16x4_t)vld1_u16(pB + 8));\n\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vfmaq_f32(_sum2, _pA0, _pB2);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n                _sum2 = vmlaq_f32(_sum2, _pA0, _pB2);\n#endif\n#endif\n\n                pA += 1;\n                pB += 12;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n                _sum2 = vmulq_f32(_sum2, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    vst1_u16(outptr0 + 8, (uint16x4_t)vcvt_f16_f32(_sum2));\n                    outptr0 += 12;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n            }\n\n            outptr += 12;\n        }\n#endif // __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            float32x4_t _sum0;\n            float32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_f32(0.f);\n                _sum1 = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum0 = vdupq_n_f32(pC[0]);\n                        _sum1 = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum0 = vld1q_f32(pC);\n                        _sum1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                }\n            }\n            else\n            {\n                _sum0 = vld1q_f32(outptr);\n                _sum1 = vld1q_f32(outptr + 4);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x8_t _pB01 = vld1q_f16(pB);\n                float16x8_t _pA = vdupq_n_f16(pA[0]);\n\n                _sum0 = vfmlalq_low_f16(_sum0, _pA, _pB01);\n                _sum1 = vfmlalq_high_f16(_sum1, _pA, _pB01);\n#else\n                uint16x8_t _pB = vld1q_u16(pB);\n                float32x4_t _pB0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_pB));\n                float32x4_t _pB1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_pB));\n\n                float32x4_t _pA0 = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum0 = vfmaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vfmaq_f32(_sum1, _pA0, _pB1);\n#else\n                _sum0 = vmlaq_f32(_sum0, _pA0, _pB0);\n                _sum1 = vmlaq_f32(_sum1, _pA0, _pB1);\n#endif\n#endif\n\n                pA += 1;\n                pB += 8;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum0 = vmulq_f32(_sum0, _alpha);\n                _sum1 = vmulq_f32(_sum1, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum0));\n                    vst1_u16(outptr0 + 4, (uint16x4_t)vcvt_f16_f32(_sum1));\n                    outptr0 += 8;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n            }\n\n            outptr += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdupq_n_f32(0.f);\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        _sum = vdupq_n_f32(pC[0]);\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        _sum = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                }\n            }\n            else\n            {\n                _sum = vld1q_f32(outptr);\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                float16x4_t _pB = vld1_f16(pB);\n                float16x8_t _pBB = vcombine_f16(_pB, _pB);\n                float16x8_t _pA = vdupq_n_f16(pA[0]);\n\n                _sum = vfmlalq_low_f16(_sum, _pA, _pBB);\n#else\n                float32x4_t _pB = vcvt_f32_f16((float16x4_t)vld1_u16(pB));\n                float32x4_t _pA = vcvt_f32_f16((float16x4_t)vdup_n_u16(pA[0]));\n#if __aarch64__\n                _sum = vfmaq_f32(_sum, _pA, _pB);\n#else\n                _sum = vmlaq_f32(_sum, _pA, _pB);\n#endif\n#endif\n\n                pA += 1;\n                pB += 4;\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _sum = vmulq_f32(_sum, _alpha);\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    vst1_u16(outptr0, (uint16x4_t)vcvt_f16_f32(_sum));\n                    outptr0 += 4;\n                }\n            }\n            else\n            {\n                vst1q_f32(outptr, _sum);\n            }\n\n            outptr += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float sum0;\n            float sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0.f;\n                sum1 = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum0 = pC[0];\n                        sum1 = pC[1];\n                        pC += 2;\n                    }\n                }\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                __fp16 pA0 = pA[0];\n                __fp16 pB0 = pB[0];\n                __fp16 pB1 = pB[1];\n#else\n                float pA0 = float16_to_float32(pA[0]);\n                float pB0 = float16_to_float32(pB[0]);\n                float pB1 = float16_to_float32(pB[1]);\n#endif\n\n                sum0 += pA0 * pB0;\n                sum1 += pA0 * pB1;\n\n                pA += 1;\n                pB += 2;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum0 *= alpha;\n                sum1 *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_float16(sum0);\n                    outptr0[1] = float32_to_float16(sum1);\n                    outptr0 += 2;\n                }\n            }\n            else\n            {\n                outptr[0] = sum0;\n                outptr[1] = sum1;\n            }\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float sum;\n\n            if (k == 0)\n            {\n                sum = 0.f;\n\n                if (pC)\n                {\n                    if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                    {\n                        sum = pC[0];\n                    }\n                    if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                    {\n                        sum = pC[0];\n                        pC += 1;\n                    }\n                }\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n#if __ARM_FEATURE_FP16_FML\n            const __fp16* pA = pAT;\n#else\n            const unsigned short* pA = pAT;\n#endif\n            int kk = 0;\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_FP16_FML\n                __fp16 pA0 = pA[0];\n                __fp16 pB0 = pB[0];\n#else\n                float pA0 = float16_to_float32(pA[0]);\n                float pB0 = float16_to_float32(pB[0]);\n#endif\n\n                sum += pA0 * pB0;\n                pA += 1;\n                pB += 1;\n            }\n\n            if (alpha != 1.f)\n            {\n                sum *= alpha;\n            }\n\n            if (k_end)\n            {\n                // if (out_elempack == 1)\n                {\n                    outptr0[0] = float32_to_float16(sum);\n                    outptr0++;\n                }\n            }\n            else\n            {\n                outptr[0] = sum;\n            }\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/gemm_int8.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk);\nvoid transpose_pack_A_tile_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk);\nvoid pack_B_tile_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk);\nvoid transpose_pack_B_tile_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk);\nvoid pack_A_tile_fp32_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_fp32_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_fp32_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_fp32_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid gemm_transB_packed_tile_int8_i8mm(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, int i, int max_ii, int j, int max_jj, int k, int max_kk);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk);\nvoid transpose_pack_A_tile_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk);\nvoid pack_B_tile_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk);\nvoid transpose_pack_B_tile_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk);\nvoid pack_A_tile_fp32_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_fp32_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_fp32_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_fp32_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid unpack_output_tile_int32_to_fp32_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\nvoid transpose_unpack_output_tile_int32_to_fp32_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\nvoid gemm_transB_packed_tile_int8_asimddp(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, int i, int max_ii, int j, int max_jj, int k, int max_kk);\n#endif\n\nstatic void pack_A_tile_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_A_tile_int8_i8mm(A, AT, i, max_ii, k, max_kk);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_A_tile_int8_asimddp(A, AT, i, max_ii, k, max_kk);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"pack_A_tile_int8\");\n    // assert A.elempack == 1\n    // assert A.dims == 2\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const signed char* p0 = A.row<const signed char>(i + ii) + k;\n        const signed char* p1 = A.row<const signed char>(i + ii + 1) + k;\n        const signed char* p2 = A.row<const signed char>(i + ii + 2) + k;\n        const signed char* p3 = A.row<const signed char>(i + ii + 3) + k;\n        const signed char* p4 = A.row<const signed char>(i + ii + 4) + k;\n        const signed char* p5 = A.row<const signed char>(i + ii + 5) + k;\n        const signed char* p6 = A.row<const signed char>(i + ii + 6) + k;\n        const signed char* p7 = A.row<const signed char>(i + ii + 7) + k;\n\n        int kk = 0;\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n            int8x16_t _p2 = vld1q_s8(p2);\n            int8x16_t _p3 = vld1q_s8(p3);\n            int8x16_t _p4 = vld1q_s8(p4);\n            int8x16_t _p5 = vld1q_s8(p5);\n            int8x16_t _p6 = vld1q_s8(p6);\n            int8x16_t _p7 = vld1q_s8(p7);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int8x16_t _r0 = vcombine_s8(vget_low_s8(_p0), vget_low_s8(_p1));\n            int8x16_t _r1 = vcombine_s8(vget_low_s8(_p2), vget_low_s8(_p3));\n            int8x16_t _r2 = vcombine_s8(vget_low_s8(_p4), vget_low_s8(_p5));\n            int8x16_t _r3 = vcombine_s8(vget_low_s8(_p6), vget_low_s8(_p7));\n            int8x16_t _r4 = vcombine_s8(vget_high_s8(_p0), vget_high_s8(_p1));\n            int8x16_t _r5 = vcombine_s8(vget_high_s8(_p2), vget_high_s8(_p3));\n            int8x16_t _r6 = vcombine_s8(vget_high_s8(_p4), vget_high_s8(_p5));\n            int8x16_t _r7 = vcombine_s8(vget_high_s8(_p6), vget_high_s8(_p7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x2_t _p01 = vzipq_s32(vreinterpretq_s32_s8(_p0), vreinterpretq_s32_s8(_p1));\n            int32x4x2_t _p23 = vzipq_s32(vreinterpretq_s32_s8(_p2), vreinterpretq_s32_s8(_p3));\n            int32x4x2_t _p45 = vzipq_s32(vreinterpretq_s32_s8(_p4), vreinterpretq_s32_s8(_p5));\n            int32x4x2_t _p67 = vzipq_s32(vreinterpretq_s32_s8(_p6), vreinterpretq_s32_s8(_p7));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p01.val[0]), vget_low_s32(_p23.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p45.val[0]), vget_low_s32(_p67.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p01.val[0]), vget_high_s32(_p23.val[0])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p45.val[0]), vget_high_s32(_p67.val[0])));\n            int8x16_t _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p01.val[1]), vget_low_s32(_p23.val[1])));\n            int8x16_t _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p45.val[1]), vget_low_s32(_p67.val[1])));\n            int8x16_t _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p01.val[1]), vget_high_s32(_p23.val[1])));\n            int8x16_t _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p45.val[1]), vget_high_s32(_p67.val[1])));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x2_t _p01 = vzipq_s16(vreinterpretq_s16_s8(_p0), vreinterpretq_s16_s8(_p1));\n            int16x8x2_t _p23 = vzipq_s16(vreinterpretq_s16_s8(_p2), vreinterpretq_s16_s8(_p3));\n            int16x8x2_t _p45 = vzipq_s16(vreinterpretq_s16_s8(_p4), vreinterpretq_s16_s8(_p5));\n            int16x8x2_t _p67 = vzipq_s16(vreinterpretq_s16_s8(_p6), vreinterpretq_s16_s8(_p7));\n            int32x4x2_t _t0 = vzipq_s32(vreinterpretq_s32_s16(_p01.val[0]), vreinterpretq_s32_s16(_p23.val[0]));\n            int32x4x2_t _t1 = vzipq_s32(vreinterpretq_s32_s16(_p01.val[1]), vreinterpretq_s32_s16(_p23.val[1]));\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_p45.val[0]), vreinterpretq_s32_s16(_p67.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_p45.val[1]), vreinterpretq_s32_s16(_p67.val[1]));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t2.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t2.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t2.val[1])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t2.val[1])));\n            int8x16_t _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t3.val[0])));\n            int8x16_t _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t3.val[0])));\n            int8x16_t _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t3.val[1])));\n            int8x16_t _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t3.val[1])));\n#endif // __ARM_FEATURE_DOTPROD\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            vst1q_s8(pp + 64, _r4);\n            vst1q_s8(pp + 80, _r5);\n            vst1q_s8(pp + 96, _r6);\n            vst1q_s8(pp + 112, _r7);\n            pp += 128;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n            p4 += 16;\n            p5 += 16;\n            p6 += 16;\n            p7 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n            int8x8_t _p2 = vld1_s8(p2);\n            int8x8_t _p3 = vld1_s8(p3);\n            int8x8_t _p4 = vld1_s8(p4);\n            int8x8_t _p5 = vld1_s8(p5);\n            int8x8_t _p6 = vld1_s8(p6);\n            int8x8_t _p7 = vld1_s8(p7);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int8x16_t _r0 = vcombine_s8(_p0, _p1);\n            int8x16_t _r1 = vcombine_s8(_p2, _p3);\n            int8x16_t _r2 = vcombine_s8(_p4, _p5);\n            int8x16_t _r3 = vcombine_s8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x2_t _p01 = vzip_s32(vreinterpret_s32_s8(_p0), vreinterpret_s32_s8(_p1));\n            int32x2x2_t _p23 = vzip_s32(vreinterpret_s32_s8(_p2), vreinterpret_s32_s8(_p3));\n            int32x2x2_t _p45 = vzip_s32(vreinterpret_s32_s8(_p4), vreinterpret_s32_s8(_p5));\n            int32x2x2_t _p67 = vzip_s32(vreinterpret_s32_s8(_p6), vreinterpret_s32_s8(_p7));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(_p01.val[0], _p23.val[0]));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(_p45.val[0], _p67.val[0]));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(_p01.val[1], _p23.val[1]));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(_p45.val[1], _p67.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8_t _p04 = vreinterpretq_s16_s8(vcombine_s8(_p0, _p4));\n            int16x8_t _p15 = vreinterpretq_s16_s8(vcombine_s8(_p1, _p5));\n            int16x8_t _p26 = vreinterpretq_s16_s8(vcombine_s8(_p2, _p6));\n            int16x8_t _p37 = vreinterpretq_s16_s8(vcombine_s8(_p3, _p7));\n            int16x8x2_t _t0 = vzipq_s16(_p04, _p15);\n            int16x8x2_t _t1 = vzipq_s16(_p26, _p37);\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[0]), vreinterpretq_s32_s16(_t1.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[1]), vreinterpretq_s32_s16(_t1.val[1]));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1])));\n#endif // __ARM_FEATURE_DOTPROD\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n            pp[16] = p4[0];\n            pp[17] = p4[1];\n            pp[18] = p4[2];\n            pp[19] = p4[3];\n            pp[20] = p5[0];\n            pp[21] = p5[1];\n            pp[22] = p5[2];\n            pp[23] = p5[3];\n            pp[24] = p6[0];\n            pp[25] = p6[1];\n            pp[26] = p6[2];\n            pp[27] = p6[3];\n            pp[28] = p7[0];\n            pp[29] = p7[1];\n            pp[30] = p7[2];\n            pp[31] = p7[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p4[0];\n            pp[9] = p4[1];\n            pp[10] = p5[0];\n            pp[11] = p5[1];\n            pp[12] = p6[0];\n            pp[13] = p6[1];\n            pp[14] = p7[0];\n            pp[15] = p7[1];\n            pp[16] = p0[2];\n            pp[17] = p0[3];\n            pp[18] = p1[2];\n            pp[19] = p1[3];\n            pp[20] = p2[2];\n            pp[21] = p2[3];\n            pp[22] = p3[2];\n            pp[23] = p3[3];\n            pp[24] = p4[2];\n            pp[25] = p4[3];\n            pp[26] = p5[2];\n            pp[27] = p5[3];\n            pp[28] = p6[2];\n            pp[29] = p6[3];\n            pp[30] = p7[2];\n            pp[31] = p7[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p4[0];\n            pp[9] = p4[1];\n            pp[10] = p5[0];\n            pp[11] = p5[1];\n            pp[12] = p6[0];\n            pp[13] = p6[1];\n            pp[14] = p7[0];\n            pp[15] = p7[1];\n            pp += 16;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n            p4 += 2;\n            p5 += 2;\n            p6 += 2;\n            p7 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp[4] = p4[0];\n            pp[5] = p5[0];\n            pp[6] = p6[0];\n            pp[7] = p7[0];\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const signed char* p0 = A.row<const signed char>(i + ii) + k;\n        const signed char* p1 = A.row<const signed char>(i + ii + 1) + k;\n        const signed char* p2 = A.row<const signed char>(i + ii + 2) + k;\n        const signed char* p3 = A.row<const signed char>(i + ii + 3) + k;\n\n        int kk = 0;\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n            int8x16_t _p2 = vld1q_s8(p2);\n            int8x16_t _p3 = vld1q_s8(p3);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int64x2x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s64_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s64_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s64_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s64_s8(_p3);\n            vst4q_s64((int64_t*)pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s32_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s32_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s32_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s32_s8(_p3);\n            vst4q_s32((int*)pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s16_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s16_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s16_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s16_s8(_p3);\n            vst4q_s16((short*)pp, _r0123);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 64;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n            int8x8_t _p2 = vld1_s8(p2);\n            int8x8_t _p3 = vld1_s8(p3);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            vst1q_s8(pp, vcombine_s8(_p0, _p1));\n            vst1q_s8(pp + 16, vcombine_s8(_p2, _p3));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s32_s8(_p0);\n            _r0123.val[1] = vreinterpret_s32_s8(_p1);\n            _r0123.val[2] = vreinterpret_s32_s8(_p2);\n            _r0123.val[3] = vreinterpret_s32_s8(_p3);\n            vst4_s32((int*)pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x4x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s16_s8(_p0);\n            _r0123.val[1] = vreinterpret_s16_s8(_p1);\n            _r0123.val[2] = vreinterpret_s16_s8(_p2);\n            _r0123.val[3] = vreinterpret_s16_s8(_p3);\n            vst4_s16((short*)pp, _r0123);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p0[2];\n            pp[9] = p0[3];\n            pp[10] = p1[2];\n            pp[11] = p1[3];\n            pp[12] = p2[2];\n            pp[13] = p2[3];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp += 8;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const signed char* p0 = A.row<const signed char>(i + ii) + k;\n        const signed char* p1 = A.row<const signed char>(i + ii + 1) + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int64x2x2_t _r01;\n            _r01.val[0] = vreinterpretq_s64_s8(_p0);\n            _r01.val[1] = vreinterpretq_s64_s8(_p1);\n            vst2q_s64((int64_t*)pp, _r01);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x2_t _r01;\n            _r01.val[0] = vreinterpretq_s32_s8(_p0);\n            _r01.val[1] = vreinterpretq_s32_s8(_p1);\n            vst2q_s32((int*)pp, _r01);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x2_t _r01;\n            _r01.val[0] = vreinterpretq_s16_s8(_p0);\n            _r01.val[1] = vreinterpretq_s16_s8(_p1);\n            vst2q_s16((short*)pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 16;\n            p1 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            vst1q_s8(pp, vcombine_s8(_p0, _p1));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x2_t _r01;\n            _r01.val[0] = vreinterpret_s32_s8(_p0);\n            _r01.val[1] = vreinterpret_s32_s8(_p1);\n            vst2_s32((int*)pp, _r01);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x4x2_t _r01;\n            _r01.val[0] = vreinterpret_s16_s8(_p0);\n            _r01.val[1] = vreinterpret_s16_s8(_p1);\n            vst2_s16((short*)pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p0[2];\n            pp[5] = p0[3];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp += 4;\n            p0 += 2;\n            p1 += 2;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const signed char* p0 = A.row<const signed char>(i + ii) + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            vst1q_s8(pp, vld1q_s8(p0));\n            pp += 16;\n            p0 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            vst1_s8(pp, vld1_s8(p0));\n            pp += 8;\n            p0 += 8;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_A_tile_int8_i8mm(A, AT, i, max_ii, k, max_kk);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_A_tile_int8_asimddp(A, AT, i, max_ii, k, max_kk);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"transpose_pack_A_tile_int8\");\n    // assert A.elempack == 1\n    // assert A.dims == 2\n\n    const int A_hstep = A.w;\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const signed char* p0 = A.row<const signed char>(k) + (i + ii);\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p0 + A_hstep);\n            int8x8_t _r2 = vld1_s8(p0 + A_hstep * 2);\n            int8x8_t _r3 = vld1_s8(p0 + A_hstep * 3);\n            int8x8_t _r4 = vld1_s8(p0 + A_hstep * 4);\n            int8x8_t _r5 = vld1_s8(p0 + A_hstep * 5);\n            int8x8_t _r6 = vld1_s8(p0 + A_hstep * 6);\n            int8x8_t _r7 = vld1_s8(p0 + A_hstep * 7);\n            // transpose8x8\n            int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n            int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n            int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n            int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n            int8x8x4_t _r0123;\n            _r0123.val[0] = _r04.val[0];\n            _r0123.val[1] = _r15.val[0];\n            _r0123.val[2] = _r26.val[0];\n            _r0123.val[3] = _r37.val[0];\n            int8x8x4_t _r4567;\n            _r4567.val[0] = _r04.val[1];\n            _r4567.val[1] = _r15.val[1];\n            _r4567.val[2] = _r26.val[1];\n            _r4567.val[3] = _r37.val[1];\n            vst4_s8(pp, _r0123);\n            vst4_s8(pp + 32, _r4567);\n            pp += 64;\n            p0 += A_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            int8x8x4_t _r0123;\n            _r0123.val[0] = vld1_s8(p0);\n            _r0123.val[1] = vld1_s8(p0 + A_hstep);\n            _r0123.val[2] = vld1_s8(p0 + A_hstep * 2);\n            _r0123.val[3] = vld1_s8(p0 + A_hstep * 3);\n            vst4_s8(pp, _r0123);\n            pp += 32;\n            p0 += A_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            int8x8x2_t _r01;\n            _r01.val[0] = vld1_s8(p0);\n            _r01.val[1] = vld1_s8(p0 + A_hstep);\n            vst2_s8(pp, _r01);\n            pp += 16;\n            p0 += A_hstep * 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            vst1_s8(pp, vld1_s8(p0));\n            pp += 8;\n            p0 += A_hstep;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const signed char* p0 = A.row<const signed char>(k) + (i + ii);\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[A_hstep * 2];\n            pp[3] = p0[A_hstep * 3];\n            pp[4] = p0[A_hstep * 4];\n            pp[5] = p0[A_hstep * 5];\n            pp[6] = p0[A_hstep * 6];\n            pp[7] = p0[A_hstep * 7];\n            pp[8] = p0[1];\n            pp[9] = p0[A_hstep + 1];\n            pp[10] = p0[A_hstep * 2 + 1];\n            pp[11] = p0[A_hstep * 3 + 1];\n            pp[12] = p0[A_hstep * 4 + 1];\n            pp[13] = p0[A_hstep * 5 + 1];\n            pp[14] = p0[A_hstep * 6 + 1];\n            pp[15] = p0[A_hstep * 7 + 1];\n            pp[16] = p0[2];\n            pp[17] = p0[A_hstep + 2];\n            pp[18] = p0[A_hstep * 2 + 2];\n            pp[19] = p0[A_hstep * 3 + 2];\n            pp[20] = p0[A_hstep * 4 + 2];\n            pp[21] = p0[A_hstep * 5 + 2];\n            pp[22] = p0[A_hstep * 6 + 2];\n            pp[23] = p0[A_hstep * 7 + 2];\n            pp[24] = p0[3];\n            pp[25] = p0[A_hstep + 3];\n            pp[26] = p0[A_hstep * 2 + 3];\n            pp[27] = p0[A_hstep * 3 + 3];\n            pp[28] = p0[A_hstep * 4 + 3];\n            pp[29] = p0[A_hstep * 5 + 3];\n            pp[30] = p0[A_hstep * 6 + 3];\n            pp[31] = p0[A_hstep * 7 + 3];\n            pp += 32;\n            p0 += A_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[A_hstep * 2];\n            pp[3] = p0[A_hstep * 3];\n            pp[4] = p0[1];\n            pp[5] = p0[A_hstep + 1];\n            pp[6] = p0[A_hstep * 2 + 1];\n            pp[7] = p0[A_hstep * 3 + 1];\n            pp[8] = p0[2];\n            pp[9] = p0[A_hstep + 2];\n            pp[10] = p0[A_hstep * 2 + 2];\n            pp[11] = p0[A_hstep * 3 + 2];\n            pp[12] = p0[3];\n            pp[13] = p0[A_hstep + 3];\n            pp[14] = p0[A_hstep * 2 + 3];\n            pp[15] = p0[A_hstep * 3 + 3];\n            pp += 16;\n            p0 += A_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[1];\n            pp[3] = p0[A_hstep + 1];\n            pp[4] = p0[2];\n            pp[5] = p0[A_hstep + 2];\n            pp[6] = p0[3];\n            pp[7] = p0[A_hstep + 3];\n            pp += 8;\n            p0 += A_hstep * 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp += 4;\n            p0 += A_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const signed char* p0 = A.row<const signed char>(k) + (i + ii);\n\n        int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[A_hstep * 2];\n            pp[3] = p0[A_hstep * 3];\n            pp[4] = p0[A_hstep * 4];\n            pp[5] = p0[A_hstep * 5];\n            pp[6] = p0[A_hstep * 6];\n            pp[7] = p0[A_hstep * 7];\n            pp[8] = p0[1];\n            pp[9] = p0[A_hstep + 1];\n            pp[10] = p0[A_hstep * 2 + 1];\n            pp[11] = p0[A_hstep * 3 + 1];\n            pp[12] = p0[A_hstep * 4 + 1];\n            pp[13] = p0[A_hstep * 5 + 1];\n            pp[14] = p0[A_hstep * 6 + 1];\n            pp[15] = p0[A_hstep * 7 + 1];\n            pp += 16;\n            p0 += A_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[A_hstep * 2];\n            pp[3] = p0[A_hstep * 3];\n            pp[4] = p0[1];\n            pp[5] = p0[A_hstep + 1];\n            pp[6] = p0[A_hstep * 2 + 1];\n            pp[7] = p0[A_hstep * 3 + 1];\n            pp += 8;\n            p0 += A_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[A_hstep];\n            pp[2] = p0[1];\n            pp[3] = p0[A_hstep + 1];\n            pp += 4;\n            p0 += A_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp += 2;\n            p0 += A_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const signed char* p0 = A.row<const signed char>(k) + (i + ii);\n\n        int kk = 0;\n        // for (; kk + 1 < max_kk; kk += 2)\n        // {\n        //     pp[0] = p0[0];\n        //     pp[1] = p0[A_hstep];\n        //     pp += 2;\n        //     p0 += A_hstep * 2;\n        // }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0 += A_hstep;\n        }\n    }\n}\n\nstatic void pack_B_tile_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_B_tile_int8_i8mm(B, BT, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_B_tile_int8_asimddp(B, BT, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"pack_B_tile_int8\");\n    // assert B.elempack == 1\n    // assert B.dims == 2\n\n    signed char* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const signed char* p0 = B.row<const signed char>(j + jj) + k;\n        const signed char* p1 = B.row<const signed char>(j + jj + 1) + k;\n        const signed char* p2 = B.row<const signed char>(j + jj + 2) + k;\n        const signed char* p3 = B.row<const signed char>(j + jj + 3) + k;\n        const signed char* p4 = B.row<const signed char>(j + jj + 4) + k;\n        const signed char* p5 = B.row<const signed char>(j + jj + 5) + k;\n        const signed char* p6 = B.row<const signed char>(j + jj + 6) + k;\n        const signed char* p7 = B.row<const signed char>(j + jj + 7) + k;\n\n        int kk = 0;\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n            int8x16_t _p2 = vld1q_s8(p2);\n            int8x16_t _p3 = vld1q_s8(p3);\n            int8x16_t _p4 = vld1q_s8(p4);\n            int8x16_t _p5 = vld1q_s8(p5);\n            int8x16_t _p6 = vld1q_s8(p6);\n            int8x16_t _p7 = vld1q_s8(p7);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int8x16_t _r0 = vcombine_s8(vget_low_s8(_p0), vget_low_s8(_p1));\n            int8x16_t _r1 = vcombine_s8(vget_low_s8(_p2), vget_low_s8(_p3));\n            int8x16_t _r2 = vcombine_s8(vget_low_s8(_p4), vget_low_s8(_p5));\n            int8x16_t _r3 = vcombine_s8(vget_low_s8(_p6), vget_low_s8(_p7));\n            int8x16_t _r4 = vcombine_s8(vget_high_s8(_p0), vget_high_s8(_p1));\n            int8x16_t _r5 = vcombine_s8(vget_high_s8(_p2), vget_high_s8(_p3));\n            int8x16_t _r6 = vcombine_s8(vget_high_s8(_p4), vget_high_s8(_p5));\n            int8x16_t _r7 = vcombine_s8(vget_high_s8(_p6), vget_high_s8(_p7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x2_t _p01 = vzipq_s32(vreinterpretq_s32_s8(_p0), vreinterpretq_s32_s8(_p1));\n            int32x4x2_t _p23 = vzipq_s32(vreinterpretq_s32_s8(_p2), vreinterpretq_s32_s8(_p3));\n            int32x4x2_t _p45 = vzipq_s32(vreinterpretq_s32_s8(_p4), vreinterpretq_s32_s8(_p5));\n            int32x4x2_t _p67 = vzipq_s32(vreinterpretq_s32_s8(_p6), vreinterpretq_s32_s8(_p7));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p01.val[0]), vget_low_s32(_p23.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p45.val[0]), vget_low_s32(_p67.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p01.val[0]), vget_high_s32(_p23.val[0])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p45.val[0]), vget_high_s32(_p67.val[0])));\n            int8x16_t _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p01.val[1]), vget_low_s32(_p23.val[1])));\n            int8x16_t _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_p45.val[1]), vget_low_s32(_p67.val[1])));\n            int8x16_t _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p01.val[1]), vget_high_s32(_p23.val[1])));\n            int8x16_t _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_p45.val[1]), vget_high_s32(_p67.val[1])));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x2_t _p01 = vzipq_s16(vreinterpretq_s16_s8(_p0), vreinterpretq_s16_s8(_p1));\n            int16x8x2_t _p23 = vzipq_s16(vreinterpretq_s16_s8(_p2), vreinterpretq_s16_s8(_p3));\n            int16x8x2_t _p45 = vzipq_s16(vreinterpretq_s16_s8(_p4), vreinterpretq_s16_s8(_p5));\n            int16x8x2_t _p67 = vzipq_s16(vreinterpretq_s16_s8(_p6), vreinterpretq_s16_s8(_p7));\n            int32x4x2_t _t0 = vzipq_s32(vreinterpretq_s32_s16(_p01.val[0]), vreinterpretq_s32_s16(_p23.val[0]));\n            int32x4x2_t _t1 = vzipq_s32(vreinterpretq_s32_s16(_p01.val[1]), vreinterpretq_s32_s16(_p23.val[1]));\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_p45.val[0]), vreinterpretq_s32_s16(_p67.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_p45.val[1]), vreinterpretq_s32_s16(_p67.val[1]));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t2.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t2.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t0.val[1]), vget_low_s32(_t2.val[1])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t0.val[1]), vget_high_s32(_t2.val[1])));\n            int8x16_t _r4 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t3.val[0])));\n            int8x16_t _r5 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t3.val[0])));\n            int8x16_t _r6 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t3.val[1])));\n            int8x16_t _r7 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t3.val[1])));\n#endif // __ARM_FEATURE_DOTPROD\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            vst1q_s8(pp + 64, _r4);\n            vst1q_s8(pp + 80, _r5);\n            vst1q_s8(pp + 96, _r6);\n            vst1q_s8(pp + 112, _r7);\n            pp += 128;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n            p4 += 16;\n            p5 += 16;\n            p6 += 16;\n            p7 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n            int8x8_t _p2 = vld1_s8(p2);\n            int8x8_t _p3 = vld1_s8(p3);\n            int8x8_t _p4 = vld1_s8(p4);\n            int8x8_t _p5 = vld1_s8(p5);\n            int8x8_t _p6 = vld1_s8(p6);\n            int8x8_t _p7 = vld1_s8(p7);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int8x16_t _r0 = vcombine_s8(_p0, _p1);\n            int8x16_t _r1 = vcombine_s8(_p2, _p3);\n            int8x16_t _r2 = vcombine_s8(_p4, _p5);\n            int8x16_t _r3 = vcombine_s8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x2_t _p01 = vzip_s32(vreinterpret_s32_s8(_p0), vreinterpret_s32_s8(_p1));\n            int32x2x2_t _p23 = vzip_s32(vreinterpret_s32_s8(_p2), vreinterpret_s32_s8(_p3));\n            int32x2x2_t _p45 = vzip_s32(vreinterpret_s32_s8(_p4), vreinterpret_s32_s8(_p5));\n            int32x2x2_t _p67 = vzip_s32(vreinterpret_s32_s8(_p6), vreinterpret_s32_s8(_p7));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(_p01.val[0], _p23.val[0]));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(_p45.val[0], _p67.val[0]));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(_p01.val[1], _p23.val[1]));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(_p45.val[1], _p67.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8_t _p04 = vreinterpretq_s16_s8(vcombine_s8(_p0, _p4));\n            int16x8_t _p15 = vreinterpretq_s16_s8(vcombine_s8(_p1, _p5));\n            int16x8_t _p26 = vreinterpretq_s16_s8(vcombine_s8(_p2, _p6));\n            int16x8_t _p37 = vreinterpretq_s16_s8(vcombine_s8(_p3, _p7));\n            int16x8x2_t _t0 = vzipq_s16(_p04, _p15);\n            int16x8x2_t _t1 = vzipq_s16(_p26, _p37);\n            int32x4x2_t _t2 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[0]), vreinterpretq_s32_s16(_t1.val[0]));\n            int32x4x2_t _t3 = vzipq_s32(vreinterpretq_s32_s16(_t0.val[1]), vreinterpretq_s32_s16(_t1.val[1]));\n            int8x16_t _r0 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0])));\n            int8x16_t _r1 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0])));\n            int8x16_t _r2 = vreinterpretq_s8_s32(vcombine_s32(vget_low_s32(_t2.val[1]), vget_low_s32(_t3.val[1])));\n            int8x16_t _r3 = vreinterpretq_s8_s32(vcombine_s32(vget_high_s32(_t2.val[1]), vget_high_s32(_t3.val[1])));\n#endif // __ARM_FEATURE_DOTPROD\n            vst1q_s8(pp, _r0);\n            vst1q_s8(pp + 16, _r1);\n            vst1q_s8(pp + 32, _r2);\n            vst1q_s8(pp + 48, _r3);\n            pp += 64;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n            p4 += 8;\n            p5 += 8;\n            p6 += 8;\n            p7 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n            pp[16] = p4[0];\n            pp[17] = p4[1];\n            pp[18] = p4[2];\n            pp[19] = p4[3];\n            pp[20] = p5[0];\n            pp[21] = p5[1];\n            pp[22] = p5[2];\n            pp[23] = p5[3];\n            pp[24] = p6[0];\n            pp[25] = p6[1];\n            pp[26] = p6[2];\n            pp[27] = p6[3];\n            pp[28] = p7[0];\n            pp[29] = p7[1];\n            pp[30] = p7[2];\n            pp[31] = p7[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p4[0];\n            pp[9] = p4[1];\n            pp[10] = p5[0];\n            pp[11] = p5[1];\n            pp[12] = p6[0];\n            pp[13] = p6[1];\n            pp[14] = p7[0];\n            pp[15] = p7[1];\n            pp[16] = p0[2];\n            pp[17] = p0[3];\n            pp[18] = p1[2];\n            pp[19] = p1[3];\n            pp[20] = p2[2];\n            pp[21] = p2[3];\n            pp[22] = p3[2];\n            pp[23] = p3[3];\n            pp[24] = p4[2];\n            pp[25] = p4[3];\n            pp[26] = p5[2];\n            pp[27] = p5[3];\n            pp[28] = p6[2];\n            pp[29] = p6[3];\n            pp[30] = p7[2];\n            pp[31] = p7[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n            p4 += 4;\n            p5 += 4;\n            p6 += 4;\n            p7 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p4[0];\n            pp[9] = p4[1];\n            pp[10] = p5[0];\n            pp[11] = p5[1];\n            pp[12] = p6[0];\n            pp[13] = p6[1];\n            pp[14] = p7[0];\n            pp[15] = p7[1];\n            pp += 16;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n            p4 += 2;\n            p5 += 2;\n            p6 += 2;\n            p7 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp[4] = p4[0];\n            pp[5] = p5[0];\n            pp[6] = p6[0];\n            pp[7] = p7[0];\n            pp += 8;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n            p4++;\n            p5++;\n            p6++;\n            p7++;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const signed char* p0 = B.row<const signed char>(j + jj) + k;\n        const signed char* p1 = B.row<const signed char>(j + jj + 1) + k;\n        const signed char* p2 = B.row<const signed char>(j + jj + 2) + k;\n        const signed char* p3 = B.row<const signed char>(j + jj + 3) + k;\n\n        int kk = 0;\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n            int8x16_t _p2 = vld1q_s8(p2);\n            int8x16_t _p3 = vld1q_s8(p3);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int64x2x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s64_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s64_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s64_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s64_s8(_p3);\n            vst4q_s64((int64_t*)pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s32_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s32_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s32_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s32_s8(_p3);\n            vst4q_s32((int*)pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x4_t _r0123;\n            _r0123.val[0] = vreinterpretq_s16_s8(_p0);\n            _r0123.val[1] = vreinterpretq_s16_s8(_p1);\n            _r0123.val[2] = vreinterpretq_s16_s8(_p2);\n            _r0123.val[3] = vreinterpretq_s16_s8(_p3);\n            vst4q_s16((short*)pp, _r0123);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 64;\n            p0 += 16;\n            p1 += 16;\n            p2 += 16;\n            p3 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n            int8x8_t _p2 = vld1_s8(p2);\n            int8x8_t _p3 = vld1_s8(p3);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            vst1q_s8(pp, vcombine_s8(_p0, _p1));\n            vst1q_s8(pp + 16, vcombine_s8(_p2, _p3));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s32_s8(_p0);\n            _r0123.val[1] = vreinterpret_s32_s8(_p1);\n            _r0123.val[2] = vreinterpret_s32_s8(_p2);\n            _r0123.val[3] = vreinterpret_s32_s8(_p3);\n            vst4_s32((int*)pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x4x4_t _r0123;\n            _r0123.val[0] = vreinterpret_s16_s8(_p0);\n            _r0123.val[1] = vreinterpret_s16_s8(_p1);\n            _r0123.val[2] = vreinterpret_s16_s8(_p2);\n            _r0123.val[3] = vreinterpret_s16_s8(_p3);\n            vst4_s16((short*)pp, _r0123);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 8;\n            p1 += 8;\n            p2 += 8;\n            p3 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n            pp[8] = p2[0];\n            pp[9] = p2[1];\n            pp[10] = p2[2];\n            pp[11] = p2[3];\n            pp[12] = p3[0];\n            pp[13] = p3[1];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp[8] = p0[2];\n            pp[9] = p0[3];\n            pp[10] = p1[2];\n            pp[11] = p1[3];\n            pp[12] = p2[2];\n            pp[13] = p2[3];\n            pp[14] = p3[2];\n            pp[15] = p3[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 16;\n            p0 += 4;\n            p1 += 4;\n            p2 += 4;\n            p3 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p2[0];\n            pp[5] = p2[1];\n            pp[6] = p3[0];\n            pp[7] = p3[1];\n            pp += 8;\n            p0 += 2;\n            p1 += 2;\n            p2 += 2;\n            p3 += 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp[2] = p2[0];\n            pp[3] = p3[0];\n            pp += 4;\n            p0++;\n            p1++;\n            p2++;\n            p3++;\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const signed char* p0 = B.row<const signed char>(j + jj) + k;\n        const signed char* p1 = B.row<const signed char>(j + jj + 1) + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            int8x16_t _p0 = vld1q_s8(p0);\n            int8x16_t _p1 = vld1q_s8(p1);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            int64x2x2_t _r01;\n            _r01.val[0] = vreinterpretq_s64_s8(_p0);\n            _r01.val[1] = vreinterpretq_s64_s8(_p1);\n            vst2q_s64((int64_t*)pp, _r01);\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x4x2_t _r01;\n            _r01.val[0] = vreinterpretq_s32_s8(_p0);\n            _r01.val[1] = vreinterpretq_s32_s8(_p1);\n            vst2q_s32((int*)pp, _r01);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x8x2_t _r01;\n            _r01.val[0] = vreinterpretq_s16_s8(_p0);\n            _r01.val[1] = vreinterpretq_s16_s8(_p1);\n            vst2q_s16((short*)pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 32;\n            p0 += 16;\n            p1 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _p0 = vld1_s8(p0);\n            int8x8_t _p1 = vld1_s8(p1);\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            vst1q_s8(pp, vcombine_s8(_p0, _p1));\n#else  // __ARM_FEATURE_MATMUL_INT8\n            int32x2x2_t _r01;\n            _r01.val[0] = vreinterpret_s32_s8(_p0);\n            _r01.val[1] = vreinterpret_s32_s8(_p1);\n            vst2_s32((int*)pp, _r01);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n            int16x4x2_t _r01;\n            _r01.val[0] = vreinterpret_s16_s8(_p0);\n            _r01.val[1] = vreinterpret_s16_s8(_p1);\n            vst2_s16((short*)pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 16;\n            p0 += 8;\n            p1 += 8;\n        }\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n#if __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp[4] = p1[0];\n            pp[5] = p1[1];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n#else  // __ARM_FEATURE_DOTPROD\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp[4] = p0[2];\n            pp[5] = p0[3];\n            pp[6] = p1[2];\n            pp[7] = p1[3];\n#endif // __ARM_FEATURE_DOTPROD\n            pp += 8;\n            p0 += 4;\n            p1 += 4;\n        }\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p1[0];\n            pp[3] = p1[1];\n            pp += 4;\n            p0 += 2;\n            p1 += 2;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p1[0];\n            pp += 2;\n            p0++;\n            p1++;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const signed char* p0 = B.row<const signed char>(j + jj) + k;\n\n        int kk = 0;\n#if __ARM_NEON\n        for (; kk + 15 < max_kk; kk += 16)\n        {\n            vst1q_s8(pp, vld1q_s8(p0));\n            pp += 16;\n            p0 += 16;\n        }\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            vst1_s8(pp, vld1_s8(p0));\n            pp += 8;\n            p0 += 8;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_B_tile_int8_i8mm(B, BT, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_B_tile_int8_asimddp(B, BT, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"transpose_pack_B_tile_int8\");\n    // assert B.elempack == 1\n    // assert B.dims == 2\n\n    const int B_hstep = B.w;\n\n    signed char* pp = BT;\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const signed char* p0 = B.row<const signed char>(k) + (j + jj);\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            int8x8_t _r0 = vld1_s8(p0);\n            int8x8_t _r1 = vld1_s8(p0 + B_hstep);\n            int8x8_t _r2 = vld1_s8(p0 + B_hstep * 2);\n            int8x8_t _r3 = vld1_s8(p0 + B_hstep * 3);\n            int8x8_t _r4 = vld1_s8(p0 + B_hstep * 4);\n            int8x8_t _r5 = vld1_s8(p0 + B_hstep * 5);\n            int8x8_t _r6 = vld1_s8(p0 + B_hstep * 6);\n            int8x8_t _r7 = vld1_s8(p0 + B_hstep * 7);\n            // transpose8x8\n            int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n            int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n            int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n            int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n            int8x8x4_t _r0123;\n            _r0123.val[0] = _r04.val[0];\n            _r0123.val[1] = _r15.val[0];\n            _r0123.val[2] = _r26.val[0];\n            _r0123.val[3] = _r37.val[0];\n            int8x8x4_t _r4567;\n            _r4567.val[0] = _r04.val[1];\n            _r4567.val[1] = _r15.val[1];\n            _r4567.val[2] = _r26.val[1];\n            _r4567.val[3] = _r37.val[1];\n            vst4_s8(pp, _r0123);\n            vst4_s8(pp + 32, _r4567);\n            pp += 64;\n            p0 += B_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            int8x8x4_t _r0123;\n            _r0123.val[0] = vld1_s8(p0);\n            _r0123.val[1] = vld1_s8(p0 + B_hstep);\n            _r0123.val[2] = vld1_s8(p0 + B_hstep * 2);\n            _r0123.val[3] = vld1_s8(p0 + B_hstep * 3);\n            vst4_s8(pp, _r0123);\n            pp += 32;\n            p0 += B_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            int8x8x2_t _r01;\n            _r01.val[0] = vld1_s8(p0);\n            _r01.val[1] = vld1_s8(p0 + B_hstep);\n            vst2_s8(pp, _r01);\n            pp += 16;\n            p0 += B_hstep * 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            vst1_s8(pp, vld1_s8(p0));\n            pp += 8;\n            p0 += B_hstep;\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const signed char* p0 = B.row<const signed char>(k) + (j + jj);\n\n        int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[B_hstep * 2];\n            pp[3] = p0[B_hstep * 3];\n            pp[4] = p0[B_hstep * 4];\n            pp[5] = p0[B_hstep * 5];\n            pp[6] = p0[B_hstep * 6];\n            pp[7] = p0[B_hstep * 7];\n            pp[8] = p0[1];\n            pp[9] = p0[B_hstep + 1];\n            pp[10] = p0[B_hstep * 2 + 1];\n            pp[11] = p0[B_hstep * 3 + 1];\n            pp[12] = p0[B_hstep * 4 + 1];\n            pp[13] = p0[B_hstep * 5 + 1];\n            pp[14] = p0[B_hstep * 6 + 1];\n            pp[15] = p0[B_hstep * 7 + 1];\n            pp[16] = p0[2];\n            pp[17] = p0[B_hstep + 2];\n            pp[18] = p0[B_hstep * 2 + 2];\n            pp[19] = p0[B_hstep * 3 + 2];\n            pp[20] = p0[B_hstep * 4 + 2];\n            pp[21] = p0[B_hstep * 5 + 2];\n            pp[22] = p0[B_hstep * 6 + 2];\n            pp[23] = p0[B_hstep * 7 + 2];\n            pp[24] = p0[3];\n            pp[25] = p0[B_hstep + 3];\n            pp[26] = p0[B_hstep * 2 + 3];\n            pp[27] = p0[B_hstep * 3 + 3];\n            pp[28] = p0[B_hstep * 4 + 3];\n            pp[29] = p0[B_hstep * 5 + 3];\n            pp[30] = p0[B_hstep * 6 + 3];\n            pp[31] = p0[B_hstep * 7 + 3];\n            pp += 32;\n            p0 += B_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[B_hstep * 2];\n            pp[3] = p0[B_hstep * 3];\n            pp[4] = p0[1];\n            pp[5] = p0[B_hstep + 1];\n            pp[6] = p0[B_hstep * 2 + 1];\n            pp[7] = p0[B_hstep * 3 + 1];\n            pp[8] = p0[2];\n            pp[9] = p0[B_hstep + 2];\n            pp[10] = p0[B_hstep * 2 + 2];\n            pp[11] = p0[B_hstep * 3 + 2];\n            pp[12] = p0[3];\n            pp[13] = p0[B_hstep + 3];\n            pp[14] = p0[B_hstep * 2 + 3];\n            pp[15] = p0[B_hstep * 3 + 3];\n            pp += 16;\n            p0 += B_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[1];\n            pp[3] = p0[B_hstep + 1];\n            pp[4] = p0[2];\n            pp[5] = p0[B_hstep + 2];\n            pp[6] = p0[3];\n            pp[7] = p0[B_hstep + 3];\n            pp += 8;\n            p0 += B_hstep * 2;\n        }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp[2] = p0[2];\n            pp[3] = p0[3];\n            pp += 4;\n            p0 += B_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const signed char* p0 = B.row<const signed char>(k) + (j + jj);\n\n        int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 7 < max_kk; kk += 8)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[B_hstep * 2];\n            pp[3] = p0[B_hstep * 3];\n            pp[4] = p0[B_hstep * 4];\n            pp[5] = p0[B_hstep * 5];\n            pp[6] = p0[B_hstep * 6];\n            pp[7] = p0[B_hstep * 7];\n            pp[8] = p0[1];\n            pp[9] = p0[B_hstep + 1];\n            pp[10] = p0[B_hstep * 2 + 1];\n            pp[11] = p0[B_hstep * 3 + 1];\n            pp[12] = p0[B_hstep * 4 + 1];\n            pp[13] = p0[B_hstep * 5 + 1];\n            pp[14] = p0[B_hstep * 6 + 1];\n            pp[15] = p0[B_hstep * 7 + 1];\n            pp += 16;\n            p0 += B_hstep * 8;\n        }\n#endif // __ARM_FEATURE_MATMUL_INT8\n        for (; kk + 3 < max_kk; kk += 4)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[B_hstep * 2];\n            pp[3] = p0[B_hstep * 3];\n            pp[4] = p0[1];\n            pp[5] = p0[B_hstep + 1];\n            pp[6] = p0[B_hstep * 2 + 1];\n            pp[7] = p0[B_hstep * 3 + 1];\n            pp += 8;\n            p0 += B_hstep * 4;\n        }\n#endif // __ARM_FEATURE_DOTPROD\n        for (; kk + 1 < max_kk; kk += 2)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[B_hstep];\n            pp[2] = p0[1];\n            pp[3] = p0[B_hstep + 1];\n            pp += 4;\n            p0 += B_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp[1] = p0[1];\n            pp += 2;\n            p0 += B_hstep;\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const signed char* p0 = B.row<const signed char>(k) + (j + jj);\n\n        int kk = 0;\n        // for (; kk + 1 < max_kk; kk += 2)\n        // {\n        //     pp[0] = p0[0];\n        //     pp[1] = p0[B_hstep];\n        //     pp += 2;\n        //     p0 += B_hstep * 2;\n        // }\n        for (; kk < max_kk; kk++)\n        {\n            pp[0] = p0[0];\n            pp += 1;\n            p0 += B_hstep;\n        }\n    }\n}\n\nstatic void compute_A_tile_fp32_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.w;\n\n    // NCNN_LOGE(\"compute_A_tile_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n#if __aarch64__\n        float32x4_t _v127 = vdupq_n_f32(127.f);\n        float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n\n        for (int ii = 0; ii + 3 < max_ii; ii += 4)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vld1q_f32(p0);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n#endif\n            ps += 4;\n            pods += 4;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        for (int ii = 0; ii < max_ii; ii++)\n        {\n            const float* p0 = (const float*)A + (i + ii) * A_hstep;\n\n            float absmax = 0.f;\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (; kk + 15 < K; kk += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 7 < K; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p = vld1q_f32(p0);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif // __ARM_NEON\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(p0[0]));\n                p0++;\n            }\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void pack_A_tile_fp32_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_A_tile_fp32_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_A_tile_fp32_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"pack_A_tile_fp32_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + 16);\n                float32x4x4_t _r = vld4q_f32(p0 + A_hstep * 4);\n                float32x4x4_t _s = vld4q_f32(p0 + A_hstep * 4 + 16);\n\n                float32x4_t _p0 = vmulq_laneq_f32(_p.val[0], _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(_p.val[1], _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(_p.val[2], _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(_p.val[3], _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(_q.val[0], _scale0, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(_q.val[1], _scale0, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(_q.val[2], _scale0, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(_q.val[3], _scale0, 3);\n                float32x4_t _p8 = vmulq_laneq_f32(_r.val[0], _scale1, 0);\n                float32x4_t _p9 = vmulq_laneq_f32(_r.val[1], _scale1, 1);\n                float32x4_t _pa = vmulq_laneq_f32(_r.val[2], _scale1, 2);\n                float32x4_t _pb = vmulq_laneq_f32(_r.val[3], _scale1, 3);\n                float32x4_t _pc = vmulq_laneq_f32(_s.val[0], _scale1, 0);\n                float32x4_t _pd = vmulq_laneq_f32(_s.val[1], _scale1, 1);\n                float32x4_t _pe = vmulq_laneq_f32(_s.val[2], _scale1, 2);\n                float32x4_t _pf = vmulq_laneq_f32(_s.val[3], _scale1, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n                float32x4_t _p8 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + A_hstep * 4 + 8);\n                float32x4_t _pb = vld1q_f32(p0 + A_hstep * 4 + 12);\n                float32x4_t _pc = vld1q_f32(p0 + A_hstep * 4 + 16);\n                float32x4_t _pd = vld1q_f32(p0 + A_hstep * 4 + 20);\n                float32x4_t _pe = vld1q_f32(p0 + A_hstep * 4 + 24);\n                float32x4_t _pf = vld1q_f32(p0 + A_hstep * 4 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale0);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale0);\n                _p8 = vmulq_f32(_p8, _scale1);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale1);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale1);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale1);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vmulq_laneq_f32(_p.val[0], _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(_p.val[1], _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(_p.val[2], _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(_p.val[3], _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(_q.val[0], _scale1, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(_q.val[1], _scale1, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(_q.val[2], _scale1, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(_q.val[3], _scale1, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 4 + 8);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 4 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale1);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale1);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p0n = vld1q_f32(p0 + 4);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p1n = vld1q_f32(p0 + A_hstep * 4 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p0n = vmulq_f32(_p0n, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p1n = vmulq_f32(_p1n, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p0n, _p1n);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 3 + 4);\n                float32x4_t _p8 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + A_hstep * 5);\n                float32x4_t _pb = vld1q_f32(p0 + A_hstep * 5 + 4);\n                float32x4_t _pc = vld1q_f32(p0 + A_hstep * 6);\n                float32x4_t _pd = vld1q_f32(p0 + A_hstep * 6 + 4);\n                float32x4_t _pe = vld1q_f32(p0 + A_hstep * 7);\n                float32x4_t _pf = vld1q_f32(p0 + A_hstep * 7 + 4);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale0, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale0, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale0, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale0, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale1, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale1, 0);\n                _pa = vmulq_laneq_f32(_pa, _scale1, 1);\n                _pb = vmulq_laneq_f32(_pb, _scale1, 1);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 2);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 2);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 3);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale0), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale0), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale0), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale0), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale0), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale1), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale1), 0);\n                _pa = vmulq_lane_f32(_pa, vget_low_f32(_scale1), 1);\n                _pb = vmulq_lane_f32(_pb, vget_low_f32(_scale1), 1);\n                _pc = vmulq_lane_f32(_pc, vget_high_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_high_f32(_scale1), 0);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 1);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 5);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 6);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 7);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + A_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + A_hstep * 3);\n                float32x2_t _p4 = vld1_f32(p0 + A_hstep * 4);\n                float32x2_t _p5 = vld1_f32(p0 + A_hstep * 5);\n                float32x2_t _p6 = vld1_f32(p0 + A_hstep * 6);\n                float32x2_t _p7 = vld1_f32(p0 + A_hstep * 7);\n\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n                float32x4_t _p45 = vcombine_f32(_p4, _p5);\n                float32x4_t _p67 = vcombine_f32(_p6, _p7);\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale0, _scale0);\n                float32x4x2_t _scale23 = vzipq_f32(_scale1, _scale1);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n                _p45 = vmulq_f32(_p45, _scale23.val[0]);\n                _p67 = vmulq_f32(_p67, _scale23.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[A_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 7], _p1, 3);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + 16);\n\n                float32x4_t _p0 = vmulq_laneq_f32(_p.val[0], _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(_p.val[1], _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(_p.val[2], _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(_p.val[3], _scale, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(_q.val[0], _scale, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(_q.val[1], _scale, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(_q.val[2], _scale, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(_q.val[3], _scale, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n\n                float32x4_t _p0 = vmulq_laneq_f32(_p.val[0], _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(_p.val[1], _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(_p.val[2], _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(_p.val[3], _scale, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 3 + 4);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 3);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + A_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + A_hstep * 3);\n\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale, _scale);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[A_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 3], _p0, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale0 = vdupq_n_f32(scale0);\n            float32x4_t _scale1 = vdupq_n_f32(scale1);\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale1);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(p0[0] * scale0);\n                pp[1] = float2int8(p0[1] * scale0);\n                pp[2] = float2int8(p0[A_hstep] * scale1);\n                pp[3] = float2int8(p0[A_hstep + 1] * scale1);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale0);\n                pp[1] = float2int8(p0[A_hstep] * scale1);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + (i + ii) * A_hstep + k;\n\n        const float scale = scales[i + ii];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vdupq_n_f32(scale);\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_compute_A_tile_fp32_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.dims == 3 ? A.c : A.h;\n\n    // NCNN_LOGE(\"transpose_compute_A_tile_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n#if __ARM_NEON\n#if __aarch64__\n    float32x4_t _v127 = vdupq_n_f32(127.f);\n    float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n#endif\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int ii = 0;\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const float* p0 = (const float*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (int kk = 0; kk < K; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            float32x2_t _aa0 = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float32x2_t _aa1 = vmax_f32(vget_low_f32(_absmax1), vget_high_f32(_absmax1));\n            float32x2_t _aa2 = vmax_f32(vget_low_f32(_absmax2), vget_high_f32(_absmax2));\n            float32x2_t _aa3 = vmax_f32(vget_low_f32(_absmax3), vget_high_f32(_absmax3));\n            float32x2_t _aa01 = vpmax_f32(_aa0, _aa1);\n            float32x2_t _aa23 = vpmax_f32(_aa2, _aa3);\n            float32x4_t _absmax = vcombine_f32(_aa01, _aa23);\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax);\n            float32x4_t _out_descale = vdivq_f32(_absmax, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const float* p0 = (const float*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 8);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 12);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vld1q_f32(p0);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep * 4;\n            }\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float absmax = std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1));\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        int ii = 0;\n#if __ARM_NEON\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const float* p0 = (const float*)A + (i + ii);\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 3);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 2;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vld1q_f32(p0);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep;\n            }\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n        for (; ii + 1 < max_ii; ii += 2)\n        {\n            const float* p0 = (const float*)A + (i + ii);\n\n            float32x2_t _absmax0 = vdup_n_f32(0.f);\n            float32x2_t _absmax1 = vdup_n_f32(0.f);\n            float32x2_t _absmax2 = vdup_n_f32(0.f);\n            float32x2_t _absmax3 = vdup_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + A_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + A_hstep * 3);\n                _absmax0 = vmax_f32(_absmax0, vabs_f32(_p0));\n                _absmax1 = vmax_f32(_absmax1, vabs_f32(_p1));\n                _absmax2 = vmax_f32(_absmax2, vabs_f32(_p2));\n                _absmax3 = vmax_f32(_absmax3, vabs_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            _absmax0 = vmax_f32(_absmax0, _absmax2);\n            _absmax1 = vmax_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                _absmax0 = vmax_f32(_absmax0, vabs_f32(_p0));\n                _absmax1 = vmax_f32(_absmax1, vabs_f32(_p1));\n                p0 += A_hstep * 2;\n            }\n            _absmax0 = vmax_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x2_t _p = vld1_f32(p0);\n                _absmax0 = vmax_f32(_absmax0, vabs_f32(_p));\n                p0 += A_hstep;\n            }\n\n#if __aarch64__\n            float32x2_t _scale = vdiv_f32(vget_low_f32(_v127), _absmax0);\n            float32x2_t _out_descale = vdiv_f32(_absmax0, vget_low_f32(_v127_B_scale));\n\n            vst1_f32(ps, _scale);\n            vst1_f32(pods, _out_descale);\n#else\n            float tmp[2];\n            vst1_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n\n            // float32x2_t _recp_absmax = vrecpe_f32(_absmax0);\n            // _recp_absmax = vmul_f32(vrecps_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmul_f32(vrecps_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmul_f32(vrecps_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x2_t _scale = vmul_f32(vget_low_f32(_v127), _recp_absmax);\n            // float32x2_t _out_descale = vmul_f32(_absmax0, vget_low_f32(_recp_v127_B_scale));\n#endif\n\n            ps += 2;\n            pods += 2;\n        }\n#endif // __ARM_NEON\n        for (; ii < max_ii; ii++)\n        {\n            const float* p0 = (const float*)A + (i + ii);\n\n            float absmax = 0.f;\n            for (int kk = 0; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(p0[0]));\n                p0 += A_hstep;\n            }\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_fp32_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_A_tile_fp32_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_A_tile_fp32_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"transpose_pack_A_tile_fp32_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n                float32x4_t _p8 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + A_hstep * 4 + 8);\n                float32x4_t _pb = vld1q_f32(p0 + A_hstep * 4 + 12);\n                float32x4_t _pc = vld1q_f32(p0 + A_hstep * 4 + 16);\n                float32x4_t _pd = vld1q_f32(p0 + A_hstep * 4 + 20);\n                float32x4_t _pe = vld1q_f32(p0 + A_hstep * 4 + 24);\n                float32x4_t _pf = vld1q_f32(p0 + A_hstep * 4 + 28);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale0, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale0, 1);\n                _pa = vmulq_laneq_f32(_pa, _scale0, 2);\n                _pb = vmulq_laneq_f32(_pb, _scale0, 3);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 0);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 1);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 2);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale0), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale0), 1);\n                _pa = vmulq_lane_f32(_pa, vget_high_f32(_scale0), 0);\n                _pb = vmulq_lane_f32(_pb, vget_high_f32(_scale0), 1);\n                _pc = vmulq_lane_f32(_pc, vget_low_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_low_f32(_scale1), 1);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 0);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 3 + 4);\n                float32x4_t _p8 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + A_hstep * 5);\n                float32x4_t _pb = vld1q_f32(p0 + A_hstep * 5 + 4);\n                float32x4_t _pc = vld1q_f32(p0 + A_hstep * 6);\n                float32x4_t _pd = vld1q_f32(p0 + A_hstep * 6 + 4);\n                float32x4_t _pe = vld1q_f32(p0 + A_hstep * 7);\n                float32x4_t _pf = vld1q_f32(p0 + A_hstep * 7 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n                _p8 = vmulq_f32(_p8, _scale0);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale0);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale0);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale0);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 3 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 4 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 4 + 8);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 4 + 12);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 3);\n                float32x4_t _p4 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + A_hstep * 5);\n                float32x4_t _p6 = vld1q_f32(p0 + A_hstep * 6);\n                float32x4_t _p7 = vld1q_f32(p0 + A_hstep * 7);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n                pp += 4;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n#if __ARM_NEON\n        float32x4_t _scale0 = vdupq_n_f32(scale0);\n        float32x4_t _scale1 = vdupq_n_f32(scale1);\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 4 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            const float* p0 = (const float*)A + k * A_hstep + (i + ii);\n\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vzipq_f32(_scale0, _scale1).val[0];\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + A_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + A_hstep * 3);\n                float32x2_t _p4 = vld1_f32(p0 + A_hstep * 4);\n                float32x2_t _p5 = vld1_f32(p0 + A_hstep * 5);\n                float32x2_t _p6 = vld1_f32(p0 + A_hstep * 6);\n                float32x2_t _p7 = vld1_f32(p0 + A_hstep * 7);\n\n#if __ARM_FEATURE_DOTPROD\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n                float32x4_t _p45 = vcombine_f32(_p4, _p5);\n                float32x4_t _p67 = vcombine_f32(_p6, _p7);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p02 = vcombine_f32(_p0, _p2);\n                float32x4_t _p46 = vcombine_f32(_p4, _p6);\n                float32x4_t _p13 = vcombine_f32(_p1, _p3);\n                float32x4_t _p57 = vcombine_f32(_p5, _p7);\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + A_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + A_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + A_hstep * 3);\n\n#if __ARM_FEATURE_DOTPROD\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p02 = vcombine_f32(_p0, _p2);\n                float32x4_t _p13 = vcombine_f32(_p1, _p3);\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(p0[0] * scale0);\n                pp[1] = float2int8(p0[A_hstep + 0] * scale0);\n                pp[2] = float2int8(p0[1] * scale1);\n                pp[3] = float2int8(p0[A_hstep + 1] * scale1);\n                pp += 4;\n                p0 += A_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale0);\n                pp[1] = float2int8(p0[1] * scale1);\n                pp += 2;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const float* p0 = (const float*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale = scales[i + ii];\n\n#if __ARM_NEON\n        float32x4_t _scale = vdupq_n_f32(scale);\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n                float32x4_t _p2 = vld1q_f32(p0 + A_hstep * 8);\n                float32x4_t _p3 = vld1q_f32(p0 + A_hstep * 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + A_hstep * 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[1] * scale);\n                pp[2] = float2int8(p0[2] * scale);\n                pp[3] = float2int8(p0[3] * scale);\n                pp += 4;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                float32x4_t _p2 = float32x4_t();\n                float32x4_t _p3 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[A_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 7], _p1, 3);\n                _p2 = vsetq_lane_f32(p0[A_hstep * 8], _p2, 0);\n                _p2 = vsetq_lane_f32(p0[A_hstep * 9], _p2, 1);\n                _p2 = vsetq_lane_f32(p0[A_hstep * 10], _p2, 2);\n                _p2 = vsetq_lane_f32(p0[A_hstep * 11], _p2, 3);\n                _p3 = vsetq_lane_f32(p0[A_hstep * 12], _p3, 0);\n                _p3 = vsetq_lane_f32(p0[A_hstep * 13], _p3, 1);\n                _p3 = vsetq_lane_f32(p0[A_hstep * 14], _p3, 2);\n                _p3 = vsetq_lane_f32(p0[A_hstep * 15], _p3, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[A_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[A_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[A_hstep * 7], _p1, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp += 1;\n                p0 += A_hstep;\n            }\n        }\n    }\n}\n\nstatic void compute_B_fp32_int8_scale(const Mat& B, float& scale)\n{\n    float absmax = 0.f;\n#if __ARM_NEON\n    float32x4_t _absmax = vdupq_n_f32(0.f);\n#endif\n    for (int i = 0; i < (B.dims == 3 ? B.c : B.h); i++)\n    {\n        const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n        const float* ptr = (const float*)B + i * B_hstep * B.elempack;\n\n        const int size = B.w * B.elempack;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 3 < size; j += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _absmax = vmaxq_f32(_absmax, vabsq_f32(_p));\n            ptr += 4;\n        }\n#endif\n        for (; j < size; j++)\n        {\n            absmax = std::max(absmax, (float)fabsf(ptr[0]));\n            ptr++;\n        }\n    }\n#if __ARM_NEON\n    float32x2_t _aa = vmax_f32(vget_low_f32(_absmax), vget_high_f32(_absmax));\n    absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif\n\n    scale = absmax == 0.f ? 1.f : 127.f / absmax;\n}\n\nstatic void pack_B_tile_fp32_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_B_tile_fp32_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_B_tile_fp32_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"pack_B_tile_fp32_to_int8 %d %d %d\", max_jj, max_kk, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + 16);\n                float32x4x4_t _r = vld4q_f32(p0 + B_hstep * 4);\n                float32x4x4_t _s = vld4q_f32(p0 + B_hstep * 4 + 16);\n\n                float32x4_t _p0 = vmulq_f32(_p.val[0], _scale);\n                float32x4_t _p1 = vmulq_f32(_p.val[1], _scale);\n                float32x4_t _p2 = vmulq_f32(_p.val[2], _scale);\n                float32x4_t _p3 = vmulq_f32(_p.val[3], _scale);\n                float32x4_t _p4 = vmulq_f32(_q.val[0], _scale);\n                float32x4_t _p5 = vmulq_f32(_q.val[1], _scale);\n                float32x4_t _p6 = vmulq_f32(_q.val[2], _scale);\n                float32x4_t _p7 = vmulq_f32(_q.val[3], _scale);\n                float32x4_t _p8 = vmulq_f32(_r.val[0], _scale);\n                float32x4_t _p9 = vmulq_f32(_r.val[1], _scale);\n                float32x4_t _pa = vmulq_f32(_r.val[2], _scale);\n                float32x4_t _pb = vmulq_f32(_r.val[3], _scale);\n                float32x4_t _pc = vmulq_f32(_s.val[0], _scale);\n                float32x4_t _pd = vmulq_f32(_s.val[1], _scale);\n                float32x4_t _pe = vmulq_f32(_s.val[2], _scale);\n                float32x4_t _pf = vmulq_f32(_s.val[3], _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n                float32x4_t _p8 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + B_hstep * 4 + 8);\n                float32x4_t _pb = vld1q_f32(p0 + B_hstep * 4 + 12);\n                float32x4_t _pc = vld1q_f32(p0 + B_hstep * 4 + 16);\n                float32x4_t _pd = vld1q_f32(p0 + B_hstep * 4 + 20);\n                float32x4_t _pe = vld1q_f32(p0 + B_hstep * 4 + 24);\n                float32x4_t _pf = vld1q_f32(p0 + B_hstep * 4 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + B_hstep * 4);\n\n                float32x4_t _p0 = vmulq_f32(_p.val[0], _scale);\n                float32x4_t _p1 = vmulq_f32(_p.val[1], _scale);\n                float32x4_t _p2 = vmulq_f32(_p.val[2], _scale);\n                float32x4_t _p3 = vmulq_f32(_p.val[3], _scale);\n                float32x4_t _p4 = vmulq_f32(_q.val[0], _scale);\n                float32x4_t _p5 = vmulq_f32(_q.val[1], _scale);\n                float32x4_t _p6 = vmulq_f32(_q.val[2], _scale);\n                float32x4_t _p7 = vmulq_f32(_q.val[3], _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 4 + 8);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 4 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 4 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep * 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 3 + 4);\n                float32x4_t _p8 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + B_hstep * 5);\n                float32x4_t _pb = vld1q_f32(p0 + B_hstep * 5 + 4);\n                float32x4_t _pc = vld1q_f32(p0 + B_hstep * 6);\n                float32x4_t _pd = vld1q_f32(p0 + B_hstep * 6 + 4);\n                float32x4_t _pe = vld1q_f32(p0 + B_hstep * 7);\n                float32x4_t _pf = vld1q_f32(p0 + B_hstep * 7 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 5);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 6);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 7);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + B_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + B_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + B_hstep * 3);\n                float32x2_t _p4 = vld1_f32(p0 + B_hstep * 4);\n                float32x2_t _p5 = vld1_f32(p0 + B_hstep * 5);\n                float32x2_t _p6 = vld1_f32(p0 + B_hstep * 6);\n                float32x2_t _p7 = vld1_f32(p0 + B_hstep * 7);\n\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n                float32x4_t _p45 = vcombine_f32(_p4, _p5);\n                float32x4_t _p67 = vcombine_f32(_p6, _p7);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[B_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 7], _p1, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n                float32x4x4_t _q = vld4q_f32(p0 + 16);\n\n                float32x4_t _p0 = vmulq_f32(_p.val[0], _scale);\n                float32x4_t _p1 = vmulq_f32(_p.val[1], _scale);\n                float32x4_t _p2 = vmulq_f32(_p.val[2], _scale);\n                float32x4_t _p3 = vmulq_f32(_p.val[3], _scale);\n                float32x4_t _p4 = vmulq_f32(_q.val[0], _scale);\n                float32x4_t _p5 = vmulq_f32(_q.val[1], _scale);\n                float32x4_t _p6 = vmulq_f32(_q.val[2], _scale);\n                float32x4_t _p7 = vmulq_f32(_q.val[3], _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                float32x4x4_t _p = vld4q_f32(p0);\n\n                float32x4_t _p0 = vmulq_f32(_p.val[0], _scale);\n                float32x4_t _p1 = vmulq_f32(_p.val[1], _scale);\n                float32x4_t _p2 = vmulq_f32(_p.val[2], _scale);\n                float32x4_t _p3 = vmulq_f32(_p.val[3], _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 3 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + B_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + B_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + B_hstep * 3);\n\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[B_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 3], _p0, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[1] * scale);\n                pp[2] = float2int8(p0[B_hstep] * scale);\n                pp[3] = float2int8(p0[B_hstep + 1] * scale);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[B_hstep] * scale);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_fp32_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_B_tile_fp32_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_B_tile_fp32_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"transpose_pack_B_tile_fp32_to_int8 %d %d\", max_jj, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj) * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n                float32x4_t _p8 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + B_hstep * 4 + 8);\n                float32x4_t _pb = vld1q_f32(p0 + B_hstep * 4 + 12);\n                float32x4_t _pc = vld1q_f32(p0 + B_hstep * 4 + 16);\n                float32x4_t _pd = vld1q_f32(p0 + B_hstep * 4 + 20);\n                float32x4_t _pe = vld1q_f32(p0 + B_hstep * 4 + 24);\n                float32x4_t _pf = vld1q_f32(p0 + B_hstep * 4 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + 16);\n                float32x4_t _p5 = vld1q_f32(p0 + 20);\n                float32x4_t _p6 = vld1q_f32(p0 + 24);\n                float32x4_t _p7 = vld1q_f32(p0 + 28);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 3 + 4);\n                float32x4_t _p8 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p9 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _pa = vld1q_f32(p0 + B_hstep * 5);\n                float32x4_t _pb = vld1q_f32(p0 + B_hstep * 5 + 4);\n                float32x4_t _pc = vld1q_f32(p0 + B_hstep * 6);\n                float32x4_t _pd = vld1q_f32(p0 + B_hstep * 6 + 4);\n                float32x4_t _pe = vld1q_f32(p0 + B_hstep * 7);\n                float32x4_t _pf = vld1q_f32(p0 + B_hstep * 7 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 2 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 3 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj) * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 4 + 4);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 4 + 8);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 4 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + 8);\n                float32x4_t _p3 = vld1q_f32(p0 + 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 3);\n                float32x4_t _p4 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p5 = vld1q_f32(p0 + B_hstep * 5);\n                float32x4_t _p6 = vld1q_f32(p0 + B_hstep * 6);\n                float32x4_t _p7 = vld1q_f32(p0 + B_hstep * 7);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 2);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[1] * scale);\n                pp[2] = float2int8(p0[2] * scale);\n                pp[3] = float2int8(p0[3] * scale);\n                pp += 4;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 4 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + B_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + B_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + B_hstep * 3);\n                float32x2_t _p4 = vld1_f32(p0 + B_hstep * 4);\n                float32x2_t _p5 = vld1_f32(p0 + B_hstep * 5);\n                float32x2_t _p6 = vld1_f32(p0 + B_hstep * 6);\n                float32x2_t _p7 = vld1_f32(p0 + B_hstep * 7);\n\n#if __ARM_FEATURE_DOTPROD\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n                float32x4_t _p45 = vcombine_f32(_p4, _p5);\n                float32x4_t _p67 = vcombine_f32(_p6, _p7);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p02 = vcombine_f32(_p0, _p2);\n                float32x4_t _p46 = vcombine_f32(_p4, _p6);\n                float32x4_t _p13 = vcombine_f32(_p1, _p3);\n                float32x4_t _p57 = vcombine_f32(_p5, _p7);\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x2_t _p0 = vld1_f32(p0);\n                float32x2_t _p1 = vld1_f32(p0 + B_hstep);\n                float32x2_t _p2 = vld1_f32(p0 + B_hstep * 2);\n                float32x2_t _p3 = vld1_f32(p0 + B_hstep * 3);\n\n#if __ARM_FEATURE_DOTPROD\n                float32x4_t _p01 = vcombine_f32(_p0, _p1);\n                float32x4_t _p23 = vcombine_f32(_p2, _p3);\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _p02 = vcombine_f32(_p0, _p2);\n                float32x4_t _p13 = vcombine_f32(_p1, _p3);\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[B_hstep + 0] * scale);\n                pp[2] = float2int8(p0[1] * scale);\n                pp[3] = float2int8(p0[B_hstep + 1] * scale);\n                pp += 4;\n                p0 += B_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[1] * scale);\n                pp += 2;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const float* p0 = (const float*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep * 4);\n                float32x4_t _p2 = vld1q_f32(p0 + B_hstep * 8);\n                float32x4_t _p3 = vld1q_f32(p0 + B_hstep * 12);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(p0);\n                float32x4_t _p1 = vld1q_f32(p0 + B_hstep * 4);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp[1] = float2int8(p0[1] * scale);\n                pp[2] = float2int8(p0[2] * scale);\n                pp[3] = float2int8(p0[3] * scale);\n                pp += 4;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                float32x4_t _p2 = float32x4_t();\n                float32x4_t _p3 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[B_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 7], _p1, 3);\n                _p2 = vsetq_lane_f32(p0[B_hstep * 8], _p2, 0);\n                _p2 = vsetq_lane_f32(p0[B_hstep * 9], _p2, 1);\n                _p2 = vsetq_lane_f32(p0[B_hstep * 10], _p2, 2);\n                _p2 = vsetq_lane_f32(p0[B_hstep * 11], _p2, 3);\n                _p3 = vsetq_lane_f32(p0[B_hstep * 12], _p3, 0);\n                _p3 = vsetq_lane_f32(p0[B_hstep * 13], _p3, 1);\n                _p3 = vsetq_lane_f32(p0[B_hstep * 14], _p3, 2);\n                _p3 = vsetq_lane_f32(p0[B_hstep * 15], _p3, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = float32x4_t();\n                float32x4_t _p1 = float32x4_t();\n                _p0 = vsetq_lane_f32(p0[0], _p0, 0);\n                _p0 = vsetq_lane_f32(p0[B_hstep], _p0, 1);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 2], _p0, 2);\n                _p0 = vsetq_lane_f32(p0[B_hstep * 3], _p0, 3);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 4], _p1, 0);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 5], _p1, 1);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 6], _p1, 2);\n                _p1 = vsetq_lane_f32(p0[B_hstep * 7], _p1, 3);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(p0[0] * scale);\n                pp += 1;\n                p0 += B_hstep;\n            }\n        }\n    }\n}\n\nstatic void unpack_output_tile_int32_to_fp32(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        unpack_output_tile_int32_to_fp32_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const float* pC = C;\n\n    // NCNN_LOGE(\"unpack_output_tile_int32_to_fp32  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float* p0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(pC[0] * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                _c0 = vld1q_f32(pC);\n                _c1 = vld1q_f32(pC + 4);\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const float*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + 4 * 2);\n                        float32x4_t _c3 = vld1q_f32(pC + 4 * 3);\n                        float32x4_t _c4 = vld1q_f32(pC + 4 * 4);\n                        float32x4_t _c5 = vld1q_f32(pC + 4 * 5);\n                        float32x4_t _c6 = vld1q_f32(pC + 4 * 6);\n                        float32x4_t _c7 = vld1q_f32(pC + 4 * 7);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 4 + 4 * 2);\n                        _c3 = vld1q_f32(pC + c_hstep * 4 + 4 * 3);\n                        _c4 = vld1q_f32(pC + c_hstep * 4 + 4 * 4);\n                        _c5 = vld1q_f32(pC + c_hstep * 4 + 4 * 5);\n                        _c6 = vld1q_f32(pC + c_hstep * 4 + 4 * 6);\n                        _c7 = vld1q_f32(pC + c_hstep * 4 + 4 * 7);\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + c_hstep);\n                        float32x4_t _c3 = vld1q_f32(pC + c_hstep + 4);\n                        float32x4_t _c4 = vld1q_f32(pC + c_hstep * 2);\n                        float32x4_t _c5 = vld1q_f32(pC + c_hstep * 2 + 4);\n                        float32x4_t _c6 = vld1q_f32(pC + c_hstep * 3);\n                        float32x4_t _c7 = vld1q_f32(pC + c_hstep * 3 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 5);\n                        _c3 = vld1q_f32(pC + c_hstep * 5 + 4);\n                        _c4 = vld1q_f32(pC + c_hstep * 6);\n                        _c5 = vld1q_f32(pC + c_hstep * 6 + 4);\n                        _c6 = vld1q_f32(pC + c_hstep * 7);\n                        _c7 = vld1q_f32(pC + c_hstep * 7 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc0 = vld1q_f32(pC);\n                    float32x4_t _cc1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + 8, _f2);\n                vst1q_f32(p0 + 12, _f3);\n                vst1q_f32(p0 + 16, _f4);\n                vst1q_f32(p0 + 20, _f5);\n                vst1q_f32(p0 + 24, _f6);\n                vst1q_f32(p0 + 28, _f7);\n                vst1q_f32(p0 + out_hstep * 4, _f8);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _f9);\n                vst1q_f32(p0 + out_hstep * 4 + 8, _fa);\n                vst1q_f32(p0 + out_hstep * 4 + 12, _fb);\n                vst1q_f32(p0 + out_hstep * 4 + 16, _fc);\n                vst1q_f32(p0 + out_hstep * 4 + 20, _fd);\n                vst1q_f32(p0 + out_hstep * 4 + 24, _fe);\n                vst1q_f32(p0 + out_hstep * 4 + 28, _ff);\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_ps(_f0, _f1, _f2, _f3);\n                transpose4x4_ps(_f4, _f5, _f6, _f7);\n                transpose4x4_ps(_f8, _f9, _fa, _fb);\n                transpose4x4_ps(_fc, _fd, _fe, _ff);\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f4);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep + 4, _f5);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 2 + 4, _f6);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 3 + 4, _f7);\n                vst1q_f32(p0 + out_hstep * 4, _f8);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _fc);\n                vst1q_f32(p0 + out_hstep * 5, _f9);\n                vst1q_f32(p0 + out_hstep * 5 + 4, _fd);\n                vst1q_f32(p0 + out_hstep * 6, _fa);\n                vst1q_f32(p0 + out_hstep * 6 + 4, _fe);\n                vst1q_f32(p0 + out_hstep * 7, _fb);\n                vst1q_f32(p0 + out_hstep * 7 + 4, _ff);\n                p0 += 8;\n            }\n\n            pp += 64;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + 8);\n                        _c3 = vld1q_f32(pC + 12);\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep);\n                        _c2 = vld1q_f32(pC + c_hstep * 2);\n                        _c3 = vld1q_f32(pC + c_hstep * 3);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 4 + 8);\n                        _c3 = vld1q_f32(pC + c_hstep * 4 + 12);\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 5);\n                        _c2 = vld1q_f32(pC + c_hstep * 6);\n                        _c3 = vld1q_f32(pC + c_hstep * 7);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f4 = vaddq_f32(_f4, _c0);\n                        _f5 = vaddq_f32(_f5, _c1);\n                        _f6 = vaddq_f32(_f6, _c2);\n                        _f7 = vaddq_f32(_f7, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f4 = vmlaq_f32(_f4, _c0, _beta);\n                        _f5 = vmlaq_f32(_f5, _c1, _beta);\n                        _f6 = vmlaq_f32(_f6, _c2, _beta);\n                        _f7 = vmlaq_f32(_f7, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vld1q_f32(pC);\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + 8, _f2);\n                vst1q_f32(p0 + 12, _f3);\n                vst1q_f32(p0 + out_hstep * 4, _f4);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _f5);\n                vst1q_f32(p0 + out_hstep * 4 + 8, _f6);\n                vst1q_f32(p0 + out_hstep * 4 + 12, _f7);\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_ps(_f0, _f1, _f2, _f3);\n                transpose4x4_ps(_f4, _f5, _f6, _f7);\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 4, _f4);\n                vst1q_f32(p0 + out_hstep * 5, _f5);\n                vst1q_f32(p0 + out_hstep * 6, _f6);\n                vst1q_f32(p0 + out_hstep * 7, _f7);\n                p0 += 4;\n            }\n\n            pp += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 4);\n                        _c3 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        float32x2_t _cc0 = vld1_f32(pC);\n                        float32x2_t _cc1 = vld1_f32(pC + c_hstep);\n                        float32x2_t _cc2 = vld1_f32(pC + c_hstep * 2);\n                        float32x2_t _cc3 = vld1_f32(pC + c_hstep * 3);\n                        float32x4_t _c01 = vcombine_f32(_cc0, _cc1);\n                        float32x4_t _c23 = vcombine_f32(_cc2, _cc3);\n                        float32x4x2_t _ccc0 = vuzpq_f32(_c01, _c23);\n                        _c0 = _ccc0.val[0];\n                        _c1 = _ccc0.val[1];\n                        float32x2_t _cc4 = vld1_f32(pC + c_hstep * 4);\n                        float32x2_t _cc5 = vld1_f32(pC + c_hstep * 5);\n                        float32x2_t _cc6 = vld1_f32(pC + c_hstep * 6);\n                        float32x2_t _cc7 = vld1_f32(pC + c_hstep * 7);\n                        float32x4_t _c45 = vcombine_f32(_cc4, _cc5);\n                        float32x4_t _c67 = vcombine_f32(_cc6, _cc7);\n                        float32x4x2_t _ccc1 = vuzpq_f32(_c45, _c67);\n                        _c2 = _ccc1.val[0];\n                        _c3 = _ccc1.val[1];\n                        pC += 2;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _c = vld1_f32(pC);\n                    _c = vmul_n_f32(_c, beta);\n                    _c0 = vdupq_lane_f32(_c, 0);\n                    _c1 = vdupq_lane_f32(_c, 1);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + out_hstep * 4, _f2);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _f3);\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                float32x4x2_t _f01 = vzipq_f32(_f0, _f1);\n                float32x4x2_t _f23 = vzipq_f32(_f2, _f3);\n                vst1_f32(p0, vget_low_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep, vget_high_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep * 2, vget_low_f32(_f01.val[1]));\n                vst1_f32(p0 + out_hstep * 3, vget_high_f32(_f01.val[1]));\n                vst1_f32(p0 + out_hstep * 4, vget_low_f32(_f23.val[0]));\n                vst1_f32(p0 + out_hstep * 5, vget_high_f32(_f23.val[0]));\n                vst1_f32(p0 + out_hstep * 6, vget_low_f32(_f23.val[1]));\n                vst1_f32(p0 + out_hstep * 7, vget_high_f32(_f23.val[1]));\n                p0 += 2;\n            }\n\n            pp += 16;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep * 4);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vsetq_lane_f32(pC[0], _c0, 0);\n                        _c0 = vsetq_lane_f32(pC[c_hstep], _c0, 1);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 2], _c0, 2);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 3], _c0, 3);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 4], _c1, 0);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 5], _c1, 1);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 6], _c1, 2);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 7], _c1, 3);\n                        pC += 1;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(pC[0] * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + out_hstep * 4, _f1);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vgetq_lane_f32(_f0, 0);\n                p0[out_hstep] = vgetq_lane_f32(_f0, 1);\n                p0[out_hstep * 2] = vgetq_lane_f32(_f0, 2);\n                p0[out_hstep * 3] = vgetq_lane_f32(_f0, 3);\n                p0[out_hstep * 4] = vgetq_lane_f32(_f1, 0);\n                p0[out_hstep * 5] = vgetq_lane_f32(_f1, 1);\n                p0[out_hstep * 6] = vgetq_lane_f32(_f1, 2);\n                p0[out_hstep * 7] = vgetq_lane_f32(_f1, 3);\n                p0++;\n            }\n\n            pp += 8;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float* p0 = (float*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(pC[0] * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                _c0 = vld1q_f32(pC);\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const float*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    float32x4_t _c4;\n                    float32x4_t _c5;\n                    float32x4_t _c6;\n                    float32x4_t _c7;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + 8);\n                        _c3 = vld1q_f32(pC + 12);\n                        _c4 = vld1q_f32(pC + 16);\n                        _c5 = vld1q_f32(pC + 20);\n                        _c6 = vld1q_f32(pC + 24);\n                        _c7 = vld1q_f32(pC + 28);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + c_hstep);\n                        _c3 = vld1q_f32(pC + c_hstep + 4);\n                        _c4 = vld1q_f32(pC + c_hstep * 2);\n                        _c5 = vld1q_f32(pC + c_hstep * 2 + 4);\n                        _c6 = vld1q_f32(pC + c_hstep * 3);\n                        _c7 = vld1q_f32(pC + c_hstep * 3 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        pC += 8;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc0 = vld1q_f32(pC);\n                    float32x4_t _cc1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + 8, _f2);\n                vst1q_f32(p0 + 12, _f3);\n                vst1q_f32(p0 + 16, _f4);\n                vst1q_f32(p0 + 20, _f5);\n                vst1q_f32(p0 + 24, _f6);\n                vst1q_f32(p0 + 28, _f7);\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_ps(_f0, _f1, _f2, _f3);\n                transpose4x4_ps(_f4, _f5, _f6, _f7);\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f4);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep + 4, _f5);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 2 + 4, _f6);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 3 + 4, _f7);\n                p0 += 8;\n            }\n\n            pp += 32;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + 8);\n                        _c3 = vld1q_f32(pC + 12);\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep * 1);\n                        _c2 = vld1q_f32(pC + c_hstep * 2);\n                        _c3 = vld1q_f32(pC + c_hstep * 3);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vld1q_f32(pC);\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + 8, _f2);\n                vst1q_f32(p0 + 12, _f3);\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_ps(_f0, _f1, _f2, _f3);\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                p0 += 4;\n            }\n\n            pp += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        float32x2_t _cc0 = vld1_f32(pC);\n                        float32x2_t _cc1 = vld1_f32(pC + c_hstep);\n                        float32x2_t _cc2 = vld1_f32(pC + c_hstep * 2);\n                        float32x2_t _cc3 = vld1_f32(pC + c_hstep * 3);\n                        float32x4_t _c01 = vcombine_f32(_cc0, _cc1);\n                        float32x4_t _c23 = vcombine_f32(_cc2, _cc3);\n                        float32x4x2_t _cc = vuzpq_f32(_c01, _c23);\n                        _c0 = _cc.val[0];\n                        _c1 = _cc.val[1];\n                        pC += 2;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _c = vld1_f32(pC);\n                    _c = vmul_n_f32(_c, beta);\n                    _c0 = vdupq_lane_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_lane_f32(_c, 1);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                float32x4x2_t _f01 = vzipq_f32(_f0, _f1);\n                vst1_f32(p0, vget_low_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep, vget_high_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep * 2, vget_low_f32(_f01.val[1]));\n                vst1_f32(p0 + out_hstep * 3, vget_high_f32(_f01.val[1]));\n                p0 += 2;\n            }\n\n            pp += 8;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vsetq_lane_f32(pC[0], _c0, 0);\n                        _c0 = vsetq_lane_f32(pC[c_hstep], _c0, 1);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 2], _c0, 2);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 3], _c0, 3);\n                        pC += 1;\n                    }\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(pC[0] * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vgetq_lane_f32(_f0, 0);\n                p0[out_hstep] = vgetq_lane_f32(_f0, 1);\n                p0[out_hstep * 2] = vgetq_lane_f32(_f0, 2);\n                p0[out_hstep * 3] = vgetq_lane_f32(_f0, 3);\n                p0++;\n            }\n\n            pp += 4;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        // out_elempack == 1\n        float* p0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                c0 = pC[0] * beta;\n                c1 = pC[1] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const float*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + 4);\n                    float32x4_t _c2 = vld1q_f32(pC + c_hstep);\n                    float32x4_t _c3 = vld1q_f32(pC + c_hstep + 4);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + 4, _f1);\n            vst1q_f32(p0 + out_hstep, _f2);\n            vst1q_f32(p0 + out_hstep + 4, _f3);\n\n            pp += 16;\n            p0 += 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + c_hstep);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vld1q_f32(pC);\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + out_hstep, _f1);\n\n            pp += 8;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n\n            float32x2x2_t _descale01 = vzip_f32(_descale, _descale);\n            float32x4_t _descale0011 = vcombine_f32(_descale01.val[0], _descale01.val[1]);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0011);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _c0011 = vcombine_f32(vget_low_f32(_c0), vget_high_f32(_c1));\n                    _f0 = vaddq_f32(_f0, _c0011);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vcombine_f32(vld1_f32(pC), vld1_f32(pC + c_hstep));\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _c = vld1_f32(pC);\n                    _c0 = vcombine_f32(_c, _c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_f32(p0, vget_low_f32(_f0));\n            vst1_f32(p0 + out_hstep, vget_high_f32(_f0));\n\n            pp += 4;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += pC[0] * beta;\n                    f1 += pC[c_hstep] * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    f0 += pC[0] * beta;\n                    f1 += pC[0] * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n            f1 *= alpha;\n\n            p0[0] = f0;\n            p0[out_hstep] = f1;\n\n            pp += 2;\n            p0++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        // out_elempack == 1\n        float* p0 = (float*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const float*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    float32x4_t _c1 = vld1q_f32(pC + 4);\n                    float32x4_t _c2 = vld1q_f32(pC + 8);\n                    float32x4_t _c3 = vld1q_f32(pC + 12);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + 4, _f1);\n            vst1q_f32(p0 + 8, _f2);\n            vst1q_f32(p0 + 12, _f3);\n\n            pp += 16;\n            p0 += 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    float32x4_t _c1 = vld1q_f32(pC + 4);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + 4, _f1);\n\n            pp += 8;\n            p0 += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1q_f32(p0, _f0);\n\n            pp += 4;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    float32x2_t _c = vld1_f32(pC);\n                    _f0 = vmla_n_f32(_f0, _c, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            vst1_f32(p0, _f0);\n\n            pp += 2;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    f0 += pC[0] * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = f0;\n\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_int32_to_fp32(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_unpack_output_tile_int32_to_fp32_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const float* pC = C;\n\n    // NCNN_LOGE(\"transpose_unpack_output_tile_int32_to_fp32  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(pC[0] * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                _c0 = vld1q_f32(pC);\n                _c1 = vld1q_f32(pC + 4);\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const float*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + 8);\n                        float32x4_t _c3 = vld1q_f32(pC + 12);\n                        float32x4_t _c4 = vld1q_f32(pC + 16);\n                        float32x4_t _c5 = vld1q_f32(pC + 20);\n                        float32x4_t _c6 = vld1q_f32(pC + 24);\n                        float32x4_t _c7 = vld1q_f32(pC + 28);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 4 + 8);\n                        _c3 = vld1q_f32(pC + c_hstep * 4 + 12);\n                        _c4 = vld1q_f32(pC + c_hstep * 4 + 16);\n                        _c5 = vld1q_f32(pC + c_hstep * 4 + 20);\n                        _c6 = vld1q_f32(pC + c_hstep * 4 + 24);\n                        _c7 = vld1q_f32(pC + c_hstep * 4 + 28);\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + c_hstep);\n                        float32x4_t _c3 = vld1q_f32(pC + c_hstep + 4);\n                        float32x4_t _c4 = vld1q_f32(pC + c_hstep * 2);\n                        float32x4_t _c5 = vld1q_f32(pC + c_hstep * 2 + 4);\n                        float32x4_t _c6 = vld1q_f32(pC + c_hstep * 3);\n                        float32x4_t _c7 = vld1q_f32(pC + c_hstep * 3 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 5);\n                        _c3 = vld1q_f32(pC + c_hstep * 5 + 4);\n                        _c4 = vld1q_f32(pC + c_hstep * 6);\n                        _c5 = vld1q_f32(pC + c_hstep * 6 + 4);\n                        _c6 = vld1q_f32(pC + c_hstep * 7);\n                        _c7 = vld1q_f32(pC + c_hstep * 7 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc0 = vld1q_f32(pC);\n                    float32x4_t _cc1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                float32x4x4_t _ffa;\n                float32x4x4_t _ffb;\n                float32x4x4_t _ffc;\n                float32x4x4_t _ffd;\n                _ffa.val[0] = _f0;\n                _ffa.val[1] = _f1;\n                _ffa.val[2] = _f2;\n                _ffa.val[3] = _f3;\n                _ffb.val[0] = _f4;\n                _ffb.val[1] = _f5;\n                _ffb.val[2] = _f6;\n                _ffb.val[3] = _f7;\n                _ffc.val[0] = _f8;\n                _ffc.val[1] = _f9;\n                _ffc.val[2] = _fa;\n                _ffc.val[3] = _fb;\n                _ffd.val[0] = _fc;\n                _ffd.val[1] = _fd;\n                _ffd.val[2] = _fe;\n                _ffd.val[3] = _ff;\n                vst4q_f32(p0, _ffa);\n                vst4q_f32(p0 + 16, _ffc);\n                vst4q_f32(p0 + out_hstep * 4, _ffb);\n                vst4q_f32(p0 + out_hstep * 4 + 16, _ffd);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f8);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep + 4, _f9);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 2 + 4, _fa);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 3 + 4, _fb);\n                vst1q_f32(p0 + out_hstep * 4, _f4);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _fc);\n                vst1q_f32(p0 + out_hstep * 5, _f5);\n                vst1q_f32(p0 + out_hstep * 5 + 4, _fd);\n                vst1q_f32(p0 + out_hstep * 6, _f6);\n                vst1q_f32(p0 + out_hstep * 6 + 4, _fe);\n                vst1q_f32(p0 + out_hstep * 7, _f7);\n                vst1q_f32(p0 + out_hstep * 7 + 4, _ff);\n            }\n\n            pp += 64;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + 8);\n                        float32x4_t _c3 = vld1q_f32(pC + 12);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        _c2 = vld1q_f32(pC + c_hstep * 4 + 8);\n                        _c3 = vld1q_f32(pC + c_hstep * 4 + 12);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep);\n                        float32x4_t _c2 = vld1q_f32(pC + c_hstep * 2);\n                        float32x4_t _c3 = vld1q_f32(pC + c_hstep * 3);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c0 = vld1q_f32(pC + c_hstep * 4);\n                        _c1 = vld1q_f32(pC + c_hstep * 5);\n                        _c2 = vld1q_f32(pC + c_hstep * 6);\n                        _c3 = vld1q_f32(pC + c_hstep * 7);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 4;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc = vld1q_f32(pC);\n                    _cc = vmulq_n_f32(_cc, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_cc, 0);\n                    _c1 = vdupq_laneq_f32(_cc, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_cc), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_cc), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_cc), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_cc), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                float32x4x4_t _fa;\n                float32x4x4_t _fb;\n                _fa.val[0] = _f0;\n                _fa.val[1] = _f1;\n                _fa.val[2] = _f2;\n                _fa.val[3] = _f3;\n                _fb.val[0] = _f4;\n                _fb.val[1] = _f5;\n                _fb.val[2] = _f6;\n                _fb.val[3] = _f7;\n                vst4q_f32(p0, _fa);\n                vst4q_f32(p0 + 16, _fb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f4);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep + 4, _f5);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 2 + 4, _f6);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 3 + 4, _f7);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 1)\n                    {\n                        float32x2_t _cc0 = vld1_f32(pC);\n                        float32x2_t _cc1 = vld1_f32(pC + c_hstep);\n                        float32x2_t _cc2 = vld1_f32(pC + c_hstep * 2);\n                        float32x2_t _cc3 = vld1_f32(pC + c_hstep * 3);\n                        float32x2_t _cc4 = vld1_f32(pC + c_hstep * 4);\n                        float32x2_t _cc5 = vld1_f32(pC + c_hstep * 5);\n                        float32x2_t _cc6 = vld1_f32(pC + c_hstep * 6);\n                        float32x2_t _cc7 = vld1_f32(pC + c_hstep * 7);\n                        float32x4_t _cc01 = vcombine_f32(_cc0, _cc1);\n                        float32x4_t _cc23 = vcombine_f32(_cc2, _cc3);\n                        float32x4_t _cc45 = vcombine_f32(_cc4, _cc5);\n                        float32x4_t _cc67 = vcombine_f32(_cc6, _cc7);\n                        float32x4x2_t _ccc0 = vuzpq_f32(_cc01, _cc23);\n                        float32x4x2_t _ccc1 = vuzpq_f32(_cc45, _cc67);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _ccc0.val[0]);\n                            _f1 = vaddq_f32(_f1, _ccc0.val[1]);\n                            _f2 = vaddq_f32(_f2, _ccc1.val[0]);\n                            _f3 = vaddq_f32(_f3, _ccc1.val[1]);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _ccc0.val[0], _beta);\n                            _f1 = vmlaq_f32(_f1, _ccc0.val[1], _beta);\n                            _f2 = vmlaq_f32(_f2, _ccc1.val[0], _beta);\n                            _f3 = vmlaq_f32(_f3, _ccc1.val[1], _beta);\n                        }\n                        pC += 2;\n                    }\n                    else // if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        float32x4_t _c2 = vld1q_f32(pC + c_hstep * 4);\n                        float32x4_t _c3 = vld1q_f32(pC + c_hstep * 4 + 4);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _cc = vld1_f32(pC);\n                    _cc = vmul_n_f32(_cc, beta);\n                    _c0 = vdupq_lane_f32(_cc, 0);\n                    _c1 = vdupq_lane_f32(_cc, 1);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + 4, _f2);\n            vst1q_f32(p0 + out_hstep, _f1);\n            vst1q_f32(p0 + out_hstep + 4, _f3);\n\n            pp += 16;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp + 4)), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vsetq_lane_f32(pC[0], _c0, 0);\n                        _c0 = vsetq_lane_f32(pC[c_hstep], _c0, 1);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 2], _c0, 2);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 3], _c0, 3);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 4], _c1, 0);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 5], _c1, 1);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 6], _c1, 2);\n                        _c1 = vsetq_lane_f32(pC[c_hstep * 7], _c1, 3);\n                        pC += 1;\n                    }\n                    else // if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep * 4);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(pC[0] * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + 4, _f1);\n            pp += 8;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(pC[0] * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                _c0 = vld1q_f32(pC);\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const float*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    float32x4_t _c4;\n                    float32x4_t _c5;\n                    float32x4_t _c6;\n                    float32x4_t _c7;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + 8);\n                        _c3 = vld1q_f32(pC + 12);\n                        _c4 = vld1q_f32(pC + 16);\n                        _c5 = vld1q_f32(pC + 20);\n                        _c6 = vld1q_f32(pC + 24);\n                        _c7 = vld1q_f32(pC + 28);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + c_hstep);\n                        _c3 = vld1q_f32(pC + c_hstep + 4);\n                        _c4 = vld1q_f32(pC + c_hstep * 2);\n                        _c5 = vld1q_f32(pC + c_hstep * 2 + 4);\n                        _c6 = vld1q_f32(pC + c_hstep * 3);\n                        _c7 = vld1q_f32(pC + c_hstep * 3 + 4);\n                        transpose8x4_ps(_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7);\n                        pC += 8;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc0 = vld1q_f32(pC);\n                    float32x4_t _cc1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                float32x4x4_t _fa;\n                float32x4x4_t _fb;\n                _fa.val[0] = _f0;\n                _fa.val[1] = _f1;\n                _fa.val[2] = _f2;\n                _fa.val[3] = _f3;\n                _fb.val[0] = _f4;\n                _fb.val[1] = _f5;\n                _fb.val[2] = _f6;\n                _fb.val[3] = _f7;\n                vst4q_f32(p0, _fa);\n                vst4q_f32(p0 + out_hstep * 4, _fb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n                vst1q_f32(p0 + out_hstep * 4, _f4);\n                vst1q_f32(p0 + out_hstep * 5, _f5);\n                vst1q_f32(p0 + out_hstep * 6, _f6);\n                vst1q_f32(p0 + out_hstep * 7, _f7);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        _c2 = vld1q_f32(pC + 8);\n                        _c3 = vld1q_f32(pC + 12);\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + c_hstep);\n                        _c2 = vld1q_f32(pC + c_hstep * 2);\n                        _c3 = vld1q_f32(pC + c_hstep * 3);\n                        transpose4x4_ps(_c0, _c1, _c2, _c3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _cc = vld1q_f32(pC);\n                    _cc = vmulq_n_f32(_cc, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_cc, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_cc), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_cc), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_cc), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_cc), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                float32x4x4_t _f;\n                _f.val[0] = _f0;\n                _f.val[1] = _f1;\n                _f.val[2] = _f2;\n                _f.val[3] = _f3;\n                vst4q_f32(p0, _f);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + out_hstep, _f1);\n                vst1q_f32(p0 + out_hstep * 2, _f2);\n                vst1q_f32(p0 + out_hstep * 3, _f3);\n            }\n\n            pp += 16;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    if (c_elempack == 1)\n                    {\n                        float32x2_t _cc0 = vld1_f32(pC);\n                        float32x2_t _cc1 = vld1_f32(pC + c_hstep);\n                        float32x2_t _cc2 = vld1_f32(pC + c_hstep * 2);\n                        float32x2_t _cc3 = vld1_f32(pC + c_hstep * 3);\n                        float32x4_t _cc01 = vcombine_f32(_cc0, _cc1);\n                        float32x4_t _cc23 = vcombine_f32(_cc2, _cc3);\n                        float32x4x2_t _cc = vuzpq_f32(_cc01, _cc23);\n                        _c0 = _cc.val[0];\n                        _c1 = _cc.val[1];\n                        pC += 2;\n                    }\n                    else // if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        _c1 = vld1q_f32(pC + 4);\n                        pC += 8;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _c = vld1_f32(pC);\n                    _c = vmul_n_f32(_c, beta);\n                    _c0 = vdupq_lane_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_lane_f32(_c, 1);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_f32(p0, _f0);\n            vst1q_f32(p0 + out_hstep, _f1);\n\n            pp += 8;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 1)\n                    {\n                        _c0 = vsetq_lane_f32(pC[0], _c0, 0);\n                        _c0 = vsetq_lane_f32(pC[c_hstep], _c0, 1);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 2], _c0, 2);\n                        _c0 = vsetq_lane_f32(pC[c_hstep * 3], _c0, 3);\n                        pC += 1;\n                    }\n                    else // if (c_elempack == 4)\n                    {\n                        _c0 = vld1q_f32(pC);\n                        pC += 4;\n                    }\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(pC[0] * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1q_f32(p0, _f0);\n            pp += 4;\n            p0 += out_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale01 = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                c0 = pC[0] * beta;\n                c1 = pC[1] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const float*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale01, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + 4);\n                    float32x4_t _c2 = vld1q_f32(pC + c_hstep);\n                    float32x4_t _c3 = vld1q_f32(pC + c_hstep + 4);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + 4);\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f2);\n                vst1q_f32(p0 + out_hstep * 4, _f1);\n                vst1q_f32(p0 + out_hstep * 4 + 4, _f3);\n            }\n            if (out_elempack == 1)\n            {\n                float32x4x2_t _f02 = vzipq_f32(_f0, _f2);\n                float32x4x2_t _f13 = vzipq_f32(_f1, _f3);\n                vst1_f32(p0, vget_low_f32(_f02.val[0]));\n                vst1_f32(p0 + out_hstep, vget_high_f32(_f02.val[0]));\n                vst1_f32(p0 + out_hstep * 2, vget_low_f32(_f02.val[1]));\n                vst1_f32(p0 + out_hstep * 3, vget_high_f32(_f02.val[1]));\n                vst1_f32(p0 + out_hstep * 4, vget_low_f32(_f13.val[0]));\n                vst1_f32(p0 + out_hstep * 5, vget_high_f32(_f13.val[0]));\n                vst1_f32(p0 + out_hstep * 6, vget_low_f32(_f13.val[1]));\n                vst1_f32(p0 + out_hstep * 7, vget_high_f32(_f13.val[1]));\n            }\n\n            pp += 16;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            // a0 a1 a2 a3\n            // b0 b1 b2 b3\n\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _c1 = vld1q_f32(pC + c_hstep);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vld1q_f32(pC);\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            if (out_elempack == 4)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n            }\n            if (out_elempack == 1)\n            {\n                float32x4x2_t _f01 = vzipq_f32(_f0, _f1);\n                vst1_f32(p0, vget_low_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep, vget_high_f32(_f01.val[0]));\n                vst1_f32(p0 + out_hstep * 2, vget_low_f32(_f01.val[1]));\n                vst1_f32(p0 + out_hstep * 3, vget_high_f32(_f01.val[1]));\n            }\n\n            pp += 8;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            // a0 a1 b0 b1\n            int32x2x2_t _sum0 = vld2_s32(pp);\n\n            float32x4_t _descale = vcombine_f32(_descale01, _descale01);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vcombine_s32(_sum0.val[0], _sum0.val[1])), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _cc = vzipq_f32(_c0, _c1).val[0];\n                    _f0 = vaddq_f32(_f0, _cc);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    float32x2_t _cc0 = vld1_f32(pC);\n                    float32x2_t _cc1 = vld1_f32(pC + c_hstep);\n                    float32x2x2_t _c01 = vzip_f32(_cc0, _cc1);\n                    _c0 = vcombine_f32(_c01.val[0], _c01.val[1]);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x2_t _cc = vld1_f32(pC);\n                    float32x2x2_t _c01 = vzip_f32(_cc, _cc);\n                    _c0 = vcombine_f32(_c01.val[0], _c01.val[1]);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_f32(p0, vget_low_f32(_f0));\n            vst1_f32(p0 + out_hstep, vget_high_f32(_f0));\n\n            pp += 4;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += pC[0] * beta;\n                    f1 += pC[c_hstep] * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    f0 += pC[0] * beta;\n                    f1 += pC[0] * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n            f1 *= alpha;\n\n            p0[0] = f0;\n            p0[1] = f1;\n\n            pp += 2;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        float* p0 = (float*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const float*)C + i + ii;\n                c0 = pC[0] * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const float*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const float*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    float32x4_t _c1 = vld1q_f32(pC + 4);\n                    float32x4_t _c2 = vld1q_f32(pC + 8);\n                    float32x4_t _c3 = vld1q_f32(pC + 12);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            if (out_hstep == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n                vst1q_f32(p0 + 8, _f2);\n                vst1q_f32(p0 + 12, _f3);\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(p0, _f0);\n                    vst1q_f32(p0 + out_hstep * 4, _f1);\n                    vst1q_f32(p0 + out_hstep * 8, _f2);\n                    vst1q_f32(p0 + out_hstep * 12, _f3);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vgetq_lane_f32(_f0, 0);\n                    p0[out_hstep] = vgetq_lane_f32(_f0, 1);\n                    p0[out_hstep * 2] = vgetq_lane_f32(_f0, 2);\n                    p0[out_hstep * 3] = vgetq_lane_f32(_f0, 3);\n                    p0[out_hstep * 4] = vgetq_lane_f32(_f1, 0);\n                    p0[out_hstep * 5] = vgetq_lane_f32(_f1, 1);\n                    p0[out_hstep * 6] = vgetq_lane_f32(_f1, 2);\n                    p0[out_hstep * 7] = vgetq_lane_f32(_f1, 3);\n                    p0[out_hstep * 8] = vgetq_lane_f32(_f2, 0);\n                    p0[out_hstep * 9] = vgetq_lane_f32(_f2, 1);\n                    p0[out_hstep * 10] = vgetq_lane_f32(_f2, 2);\n                    p0[out_hstep * 11] = vgetq_lane_f32(_f2, 3);\n                    p0[out_hstep * 12] = vgetq_lane_f32(_f3, 0);\n                    p0[out_hstep * 13] = vgetq_lane_f32(_f3, 1);\n                    p0[out_hstep * 14] = vgetq_lane_f32(_f3, 2);\n                    p0[out_hstep * 15] = vgetq_lane_f32(_f3, 3);\n                }\n            }\n\n            pp += 16;\n            p0 += out_hstep * 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    float32x4_t _c1 = vld1q_f32(pC + 4);\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            if (out_hstep == 1)\n            {\n                vst1q_f32(p0, _f0);\n                vst1q_f32(p0 + 4, _f1);\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(p0, _f0);\n                    vst1q_f32(p0 + out_hstep * 4, _f1);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vgetq_lane_f32(_f0, 0);\n                    p0[out_hstep] = vgetq_lane_f32(_f0, 1);\n                    p0[out_hstep * 2] = vgetq_lane_f32(_f0, 2);\n                    p0[out_hstep * 3] = vgetq_lane_f32(_f0, 3);\n                    p0[out_hstep * 4] = vgetq_lane_f32(_f1, 0);\n                    p0[out_hstep * 5] = vgetq_lane_f32(_f1, 1);\n                    p0[out_hstep * 6] = vgetq_lane_f32(_f1, 2);\n                    p0[out_hstep * 7] = vgetq_lane_f32(_f1, 3);\n                }\n            }\n\n            pp += 8;\n            p0 += out_hstep * 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    _c0 = vld1q_f32(pC);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            if (out_hstep == 1)\n            {\n                vst1q_f32(p0, _f0);\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1q_f32(p0, _f0);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vgetq_lane_f32(_f0, 0);\n                    p0[out_hstep] = vgetq_lane_f32(_f0, 1);\n                    p0[out_hstep * 2] = vgetq_lane_f32(_f0, 2);\n                    p0[out_hstep * 3] = vgetq_lane_f32(_f0, 3);\n                }\n            }\n\n            pp += 4;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    float32x2_t _c = vld1_f32(pC);\n                    _f0 = vmla_n_f32(_f0, _c, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            if (out_hstep == 1)\n            {\n                vst1_f32(p0, _f0);\n            }\n            else\n            {\n                p0[0] = vget_lane_f32(_f0, 0);\n                p0[out_hstep] = vget_lane_f32(_f0, 1);\n            }\n\n            pp += 2;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    f0 += pC[0] * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = f0;\n\n            pp += 1;\n            p0 += out_hstep;\n        }\n    }\n}\n\nstatic void gemm_transB_packed_tile_int8(const Mat& AT_tile, const Mat& BT_tile, Mat& topT_tile, int i, int max_ii, int j, int max_jj, int k, int max_kk)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        gemm_transB_packed_tile_int8_i8mm(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        gemm_transB_packed_tile_int8_asimddp(AT_tile, BT_tile, topT_tile, i, max_ii, j, max_jj, k, max_kk);\n        return;\n    }\n#endif\n\n    // NCNN_LOGE(\"gemm_transB_packed_tile_int8 %d %d %d %d %d %d\", i, max_ii, j, max_jj, k, max_kk);\n\n    const signed char* pAT = AT_tile;\n    const signed char* pBT = BT_tile;\n\n    int* outptr = topT_tile;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n#if !__ARM_FEATURE_MATMUL_INT8\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n\n                \"1:                                 \\n\"\n#endif // !__ARM_FEATURE_MATMUL_INT8\n\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v0.16b, v0.16b, v0.16b      \\n\"\n                \"eor    v1.16b, v1.16b, v1.16b      \\n\"\n                \"eor    v2.16b, v2.16b, v2.16b      \\n\"\n                \"eor    v3.16b, v3.16b, v3.16b      \\n\"\n                \"eor    v4.16b, v4.16b, v4.16b      \\n\"\n                \"eor    v5.16b, v5.16b, v5.16b      \\n\"\n                \"eor    v6.16b, v6.16b, v6.16b      \\n\"\n                \"eor    v7.16b, v7.16b, v7.16b      \\n\"\n                \"eor    v8.16b, v8.16b, v8.16b      \\n\"\n                \"eor    v9.16b, v9.16b, v9.16b      \\n\"\n                \"eor    v10.16b, v10.16b, v10.16b   \\n\"\n                \"eor    v11.16b, v11.16b, v11.16b   \\n\"\n                \"eor    v12.16b, v12.16b, v12.16b   \\n\"\n                \"eor    v13.16b, v13.16b, v13.16b   \\n\"\n                \"eor    v14.16b, v14.16b, v14.16b   \\n\"\n                \"eor    v15.16b, v15.16b, v15.16b   \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v16.16b, v17.16b, v18.16b, v19.16b}, [%1], #64 \\n\"\n                \"ld1    {v20.16b, v21.16b, v22.16b, v23.16b}, [%2], #64 \\n\"\n                \"smmla  v0.4s, v16.16b, v20.16b     \\n\"\n                \"smmla  v1.4s, v17.16b, v20.16b     \\n\"\n                \"smmla  v2.4s, v16.16b, v21.16b     \\n\"\n                \"smmla  v3.4s, v17.16b, v21.16b     \\n\"\n                \"smmla  v4.4s, v18.16b, v20.16b     \\n\"\n                \"smmla  v5.4s, v19.16b, v20.16b     \\n\"\n                \"smmla  v6.4s, v18.16b, v21.16b     \\n\"\n                \"smmla  v7.4s, v19.16b, v21.16b     \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v8.4s, v16.16b, v22.16b     \\n\"\n                \"smmla  v9.4s, v17.16b, v22.16b     \\n\"\n                \"smmla  v10.4s, v16.16b, v23.16b    \\n\"\n                \"smmla  v11.4s, v17.16b, v23.16b    \\n\"\n                \"smmla  v12.4s, v18.16b, v22.16b    \\n\"\n                \"smmla  v13.4s, v19.16b, v22.16b    \\n\"\n                \"smmla  v14.4s, v18.16b, v23.16b    \\n\"\n                \"smmla  v15.4s, v19.16b, v23.16b    \\n\"\n                \"bne    2b                          \\n\"\n\n                \"uzp1   v16.4s, v0.4s, v1.4s        \\n\"\n                \"uzp2   v17.4s, v0.4s, v1.4s        \\n\"\n                \"uzp1   v18.4s, v2.4s, v3.4s        \\n\"\n                \"uzp2   v19.4s, v2.4s, v3.4s        \\n\"\n                \"uzp1   v20.4s, v4.4s, v5.4s        \\n\"\n                \"uzp2   v21.4s, v4.4s, v5.4s        \\n\"\n                \"uzp1   v22.4s, v6.4s, v7.4s        \\n\"\n                \"uzp2   v23.4s, v6.4s, v7.4s        \\n\"\n                \"uzp1   v24.4s, v8.4s, v9.4s        \\n\"\n                \"uzp2   v25.4s, v8.4s, v9.4s        \\n\"\n                \"uzp1   v26.4s, v10.4s, v11.4s      \\n\"\n                \"uzp2   v27.4s, v10.4s, v11.4s      \\n\"\n                \"uzp1   v28.4s, v12.4s, v13.4s      \\n\"\n                \"uzp2   v29.4s, v12.4s, v13.4s      \\n\"\n                \"uzp1   v30.4s, v14.4s, v15.4s      \\n\"\n                \"uzp2   v31.4s, v14.4s, v15.4s      \\n\"\n\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    1f                          \\n\"\n\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64   \\n\"\n                \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%0], #64   \\n\"\n                \"ld1    {v8.4s, v9.4s, v10.4s, v11.4s}, [%0], #64 \\n\"\n                \"ld1    {v12.4s, v13.4s, v14.4s, v15.4s}, [%0]    \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n                \"add    v24.4s, v24.4s, v8.4s       \\n\"\n                \"add    v25.4s, v25.4s, v9.4s       \\n\"\n                \"add    v26.4s, v26.4s, v10.4s      \\n\"\n                \"add    v27.4s, v27.4s, v11.4s      \\n\"\n                \"add    v28.4s, v28.4s, v12.4s      \\n\"\n                \"add    v29.4s, v29.4s, v13.4s      \\n\"\n                \"add    v30.4s, v30.4s, v14.4s      \\n\"\n                \"add    v31.4s, v31.4s, v15.4s      \\n\"\n                \"b      1f                          \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%1], #64 \\n\"\n                \"ld1    {v4.16b, v5.16b, v6.16b, v7.16b}, [%2], #64 \\n\"\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v4.4b[3]    \\n\"\n                \"sdot   v24.4s, v0.16b, v5.4b[0]    \\n\"\n                \"sdot   v25.4s, v0.16b, v5.4b[1]    \\n\"\n                \"sdot   v26.4s, v0.16b, v5.4b[2]    \\n\"\n                \"sdot   v27.4s, v0.16b, v5.4b[3]    \\n\"\n                \"sdot   v28.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v29.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v30.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v31.4s, v1.16b, v5.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v2.16b, v6.4b[0]    \\n\"\n                \"sdot   v17.4s, v2.16b, v6.4b[1]    \\n\"\n                \"sdot   v18.4s, v2.16b, v6.4b[2]    \\n\"\n                \"sdot   v19.4s, v2.16b, v6.4b[3]    \\n\"\n                \"sdot   v20.4s, v3.16b, v6.4b[0]    \\n\"\n                \"sdot   v21.4s, v3.16b, v6.4b[1]    \\n\"\n                \"sdot   v22.4s, v3.16b, v6.4b[2]    \\n\"\n                \"sdot   v23.4s, v3.16b, v6.4b[3]    \\n\"\n                \"sdot   v24.4s, v2.16b, v7.4b[0]    \\n\"\n                \"sdot   v25.4s, v2.16b, v7.4b[1]    \\n\"\n                \"sdot   v26.4s, v2.16b, v7.4b[2]    \\n\"\n                \"sdot   v27.4s, v2.16b, v7.4b[3]    \\n\"\n                \"sdot   v28.4s, v3.16b, v7.4b[0]    \\n\"\n                \"sdot   v29.4s, v3.16b, v7.4b[1]    \\n\"\n                \"sdot   v30.4s, v3.16b, v7.4b[2]    \\n\"\n                \"sdot   v31.4s, v3.16b, v7.4b[3]    \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n#if __ARM_FEATURE_MATMUL_INT8\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"ld1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"ld1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0]      \\n\"\n                \"sub    %0, %0, #192                \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n                \"1:                                 \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"and    w4, %w6, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b, v3.16b}, [%2], #32 \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v2.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v2.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v2.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v2.4b[3]    \\n\"\n                \"sdot   v24.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v25.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v26.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v27.4s, v0.16b, v3.4b[3]    \\n\"\n                \"sdot   v28.4s, v1.16b, v3.4b[0]    \\n\"\n                \"sdot   v29.4s, v1.16b, v3.4b[1]    \\n\"\n                \"sdot   v30.4s, v1.16b, v3.4b[2]    \\n\"\n                \"sdot   v31.4s, v1.16b, v3.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v4.16b       \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v4.16b      \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v6.16b      \\n\"\n                \"rev64  v3.4s, v1.4s                \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"smull2 v15.8h, v2.16b, v6.16b      \\n\"\n                \"rev64  v7.8h, v5.8h                \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v5.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v5.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v7.16b      \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v15.8h, v3.16b, v7.16b      \\n\"\n                \"ext    v0.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v2.16b, v2.16b, v2.16b, #8  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v20.4s, v10.8h              \\n\"\n                \"sadalp v21.4s, v11.8h              \\n\"\n                \"ext    v1.16b, v1.16b, v1.16b, #8  \\n\"\n                \"ext    v3.16b, v3.16b, v3.16b, #8  \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v4.16b       \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v4.16b      \\n\"\n                \"sadalp v24.4s, v12.8h              \\n\"\n                \"sadalp v25.4s, v13.8h              \\n\"\n                \"sadalp v28.4s, v14.8h              \\n\"\n                \"sadalp v29.4s, v15.8h              \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v6.16b      \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"smull2 v15.8h, v2.16b, v6.16b      \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v5.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v5.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v7.16b      \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v15.8h, v3.16b, v7.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v18.4s, v8.8h               \\n\"\n                \"sadalp v19.4s, v9.8h               \\n\"\n                \"sadalp v22.4s, v10.8h              \\n\"\n                \"sadalp v23.4s, v11.8h              \\n\"\n                \"sadalp v26.4s, v12.8h              \\n\"\n                \"sadalp v27.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w6, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v1.16b}, [%2], #16         \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"dup    v4.8h, v1.h[4]              \\n\"\n                \"dup    v5.8h, v1.h[5]              \\n\"\n                \"dup    v6.8h, v1.h[6]              \\n\"\n                \"dup    v7.8h, v1.h[7]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v24.4s, v8.8h               \\n\"\n                \"sadalp v25.4s, v9.8h               \\n\"\n                \"sadalp v26.4s, v10.8h              \\n\"\n                \"sadalp v27.4s, v11.8h              \\n\"\n                \"sadalp v28.4s, v12.8h              \\n\"\n                \"sadalp v29.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v20.4s, v10.8h              \\n\"\n                \"sadalp v21.4s, v11.8h              \\n\"\n                \"sadalp v24.4s, v12.8h              \\n\"\n                \"sadalp v25.4s, v13.8h              \\n\"\n                \"sadalp v28.4s, v14.8h              \\n\"\n                \"sadalp v29.4s, v15.8h              \\n\"\n                \"ext    v0.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v1.16b, v1.16b, v1.16b, #8  \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v18.4s, v8.8h               \\n\"\n                \"sadalp v19.4s, v9.8h               \\n\"\n                \"sadalp v22.4s, v10.8h              \\n\"\n                \"sadalp v23.4s, v11.8h              \\n\"\n                \"sadalp v26.4s, v12.8h              \\n\"\n                \"sadalp v27.4s, v13.8h              \\n\"\n                \"sadalp v30.4s, v14.8h              \\n\"\n                \"sadalp v31.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w6, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v8.8b, v1.b[0]              \\n\"\n                \"dup    v9.8b, v1.b[1]              \\n\"\n                \"dup    v10.8b, v1.b[2]             \\n\"\n                \"dup    v11.8b, v1.b[3]             \\n\"\n                \"dup    v12.8b, v1.b[4]             \\n\"\n                \"dup    v13.8b, v1.b[5]             \\n\"\n                \"dup    v14.8b, v1.b[6]             \\n\"\n                \"dup    v15.8b, v1.b[7]             \\n\"\n                \"smull  v8.8h, v0.8b, v8.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v9.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v10.8b       \\n\"\n                \"smull  v11.8h, v0.8b, v11.8b       \\n\"\n                \"smull  v12.8h, v0.8b, v12.8b       \\n\"\n                \"smull  v13.8h, v0.8b, v13.8b       \\n\"\n                \"smull  v14.8h, v0.8b, v14.8b       \\n\"\n                \"smull  v15.8h, v0.8b, v15.8b       \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw  v18.4s, v18.4s, v10.4h      \\n\"\n                \"saddw  v19.4s, v19.4s, v11.4h      \\n\"\n                \"saddw2 v20.4s, v20.4s, v8.8h       \\n\"\n                \"saddw2 v21.4s, v21.4s, v9.8h       \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n                \"saddw  v24.4s, v24.4s, v12.4h      \\n\"\n                \"saddw  v25.4s, v25.4s, v13.4h      \\n\"\n                \"saddw  v26.4s, v26.4s, v14.4h      \\n\"\n                \"saddw  v27.4s, v27.4s, v15.4h      \\n\"\n                \"saddw2 v28.4s, v28.4s, v12.8h      \\n\"\n                \"saddw2 v29.4s, v29.4s, v13.8h      \\n\"\n                \"saddw2 v30.4s, v30.4s, v14.8h      \\n\"\n                \"saddw2 v31.4s, v31.4s, v15.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v4.8b}, [%2], #8           \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #4     \\n\"\n                \"rev32  v2.4h, v0.4h                \\n\"\n                \"rev64  v3.4h, v0.4h                \\n\"\n                \"rev32  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull  v11.8h, v3.8b, v4.8b        \\n\"\n                \"smull  v12.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v13.8h, v1.8b, v5.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v5.8b        \\n\"\n                \"smull  v15.8h, v3.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n                \"saddw  v24.4s, v24.4s, v12.4h      \\n\"\n                \"saddw2 v25.4s, v25.4s, v12.8h      \\n\"\n                \"saddw  v26.4s, v26.4s, v13.4h      \\n\"\n                \"saddw2 v27.4s, v27.4s, v13.8h      \\n\"\n                \"saddw  v28.4s, v28.4s, v14.4h      \\n\"\n                \"saddw2 v29.4s, v29.4s, v14.8h      \\n\"\n                \"saddw  v30.4s, v30.4s, v15.4h      \\n\"\n                \"saddw2 v31.4s, v31.4s, v15.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n                \"st1    {v24.4s, v25.4s, v26.4s, v27.4s}, [%0], #64 \\n\"\n                \"st1    {v28.4s, v29.4s, v30.4s, v31.4s}, [%0], #64 \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n            int32x4_t _sum8;\n            int32x4_t _sum9;\n            int32x4_t _suma;\n            int32x4_t _sumb;\n            int32x4_t _sumc;\n            int32x4_t _sumd;\n            int32x4_t _sume;\n            int32x4_t _sumf;\n\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n                _sum8 = vdupq_n_s32(0);\n                _sum9 = vdupq_n_s32(0);\n                _suma = vdupq_n_s32(0);\n                _sumb = vdupq_n_s32(0);\n                _sumc = vdupq_n_s32(0);\n                _sumd = vdupq_n_s32(0);\n                _sume = vdupq_n_s32(0);\n                _sumf = vdupq_n_s32(0);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n                _sum8 = vdupq_n_s32(0);\n                _sum9 = vdupq_n_s32(0);\n                _suma = vdupq_n_s32(0);\n                _sumb = vdupq_n_s32(0);\n                _sumc = vdupq_n_s32(0);\n                _sumd = vdupq_n_s32(0);\n                _sume = vdupq_n_s32(0);\n                _sumf = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n                _sum8 = vld1q_s32(outptr + 32);\n                _sum9 = vld1q_s32(outptr + 36);\n                _suma = vld1q_s32(outptr + 40);\n                _sumb = vld1q_s32(outptr + 44);\n                _sumc = vld1q_s32(outptr + 48);\n                _sumd = vld1q_s32(outptr + 52);\n                _sume = vld1q_s32(outptr + 56);\n                _sumf = vld1q_s32(outptr + 60);\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    _sum0 = vmmlaq_s32(_sum0, _pA0, _pB0);\n                    _sum1 = vmmlaq_s32(_sum1, _pA1, _pB0);\n                    _sum2 = vmmlaq_s32(_sum2, _pA0, _pB1);\n                    _sum3 = vmmlaq_s32(_sum3, _pA1, _pB1);\n                    _sum4 = vmmlaq_s32(_sum4, _pA2, _pB0);\n                    _sum5 = vmmlaq_s32(_sum5, _pA3, _pB0);\n                    _sum6 = vmmlaq_s32(_sum6, _pA2, _pB1);\n                    _sum7 = vmmlaq_s32(_sum7, _pA3, _pB1);\n                    _sum8 = vmmlaq_s32(_sum8, _pA0, _pB2);\n                    _sum9 = vmmlaq_s32(_sum9, _pA1, _pB2);\n                    _suma = vmmlaq_s32(_suma, _pA0, _pB3);\n                    _sumb = vmmlaq_s32(_sumb, _pA1, _pB3);\n                    _sumc = vmmlaq_s32(_sumc, _pA2, _pB2);\n                    _sumd = vmmlaq_s32(_sumd, _pA3, _pB2);\n                    _sume = vmmlaq_s32(_sume, _pA2, _pB3);\n                    _sumf = vmmlaq_s32(_sumf, _pA3, _pB3);\n\n                    pA += 64;\n                    pB += 64;\n                }\n\n                int32x4x2_t _ss0 = vuzpq_s32(_sum0, _sum1);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum2, _sum3);\n                int32x4x2_t _ss2 = vuzpq_s32(_sum4, _sum5);\n                int32x4x2_t _ss3 = vuzpq_s32(_sum6, _sum7);\n                int32x4x2_t _ss4 = vuzpq_s32(_sum8, _sum9);\n                int32x4x2_t _ss5 = vuzpq_s32(_suma, _sumb);\n                int32x4x2_t _ss6 = vuzpq_s32(_sumc, _sumd);\n                int32x4x2_t _ss7 = vuzpq_s32(_sume, _sumf);\n\n                if (k == 0)\n                {\n                    _sum0 = _ss0.val[0];\n                    _sum1 = _ss0.val[1];\n                    _sum2 = _ss1.val[0];\n                    _sum3 = _ss1.val[1];\n                    _sum4 = _ss2.val[0];\n                    _sum5 = _ss2.val[1];\n                    _sum6 = _ss3.val[0];\n                    _sum7 = _ss3.val[1];\n                    _sum8 = _ss4.val[0];\n                    _sum9 = _ss4.val[1];\n                    _suma = _ss5.val[0];\n                    _sumb = _ss5.val[1];\n                    _sumc = _ss6.val[0];\n                    _sumd = _ss6.val[1];\n                    _sume = _ss7.val[0];\n                    _sumf = _ss7.val[1];\n                }\n                else\n                {\n                    _sum0 = vld1q_s32(outptr);\n                    _sum1 = vld1q_s32(outptr + 4);\n                    _sum2 = vld1q_s32(outptr + 8);\n                    _sum3 = vld1q_s32(outptr + 12);\n                    _sum4 = vld1q_s32(outptr + 16);\n                    _sum5 = vld1q_s32(outptr + 20);\n                    _sum6 = vld1q_s32(outptr + 24);\n                    _sum7 = vld1q_s32(outptr + 28);\n                    _sum8 = vld1q_s32(outptr + 32);\n                    _sum9 = vld1q_s32(outptr + 36);\n                    _suma = vld1q_s32(outptr + 40);\n                    _sumb = vld1q_s32(outptr + 44);\n                    _sumc = vld1q_s32(outptr + 48);\n                    _sumd = vld1q_s32(outptr + 52);\n                    _sume = vld1q_s32(outptr + 56);\n                    _sumf = vld1q_s32(outptr + 60);\n\n                    _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                    _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                    _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                    _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                    _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                    _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                    _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                    _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n                    _sum8 = vaddq_s32(_sum8, _ss4.val[0]);\n                    _sum9 = vaddq_s32(_sum9, _ss4.val[1]);\n                    _suma = vaddq_s32(_suma, _ss5.val[0]);\n                    _sumb = vaddq_s32(_sumb, _ss5.val[1]);\n                    _sumc = vaddq_s32(_sumc, _ss6.val[0]);\n                    _sumd = vaddq_s32(_sumd, _ss6.val[1]);\n                    _sume = vaddq_s32(_sume, _ss7.val[0]);\n                    _sumf = vaddq_s32(_sumf, _ss7.val[1]);\n                }\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pA2 = vld1q_s8(pA + 32);\n                int8x16_t _pA3 = vld1q_s8(pA + 48);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n                int8x16_t _pB2 = vld1q_s8(pB + 32);\n                int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                // aaaa bbbb cccc dddd    eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333    4444 5555 6666 7777\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA0, _pB1, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA0, _pB1, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA0, _pB1, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA0, _pB1, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA1, _pB1, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA1, _pB1, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA1, _pB1, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA1, _pB1, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB2, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB2, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA2, _pB2, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA2, _pB2, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA3, _pB2, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA3, _pB2, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA3, _pB2, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA3, _pB2, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA2, _pB3, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA2, _pB3, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA2, _pB3, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA2, _pB3, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA3, _pB3, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA3, _pB3, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA3, _pB3, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA3, _pB3, 3);\n\n                pA += 64;\n                pB += 64;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                // aaaa bbbb cccc dddd    eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333    4444 5555 6666 7777\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n                _sum8 = vdotq_laneq_s32(_sum8, _pA0, _pB1, 0);\n                _sum9 = vdotq_laneq_s32(_sum9, _pA0, _pB1, 1);\n                _suma = vdotq_laneq_s32(_suma, _pA0, _pB1, 2);\n                _sumb = vdotq_laneq_s32(_sumb, _pA0, _pB1, 3);\n                _sumc = vdotq_laneq_s32(_sumc, _pA1, _pB1, 0);\n                _sumd = vdotq_laneq_s32(_sumd, _pA1, _pB1, 1);\n                _sume = vdotq_laneq_s32(_sume, _pA1, _pB1, 2);\n                _sumf = vdotq_laneq_s32(_sumf, _pA1, _pB1, 3);\n\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB2 = vld1q_s8(pB + 16);\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA3 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA2)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB3 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _s9 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sa = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sc = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB1));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sf = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB1));\n\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_low_s8(_pB2));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB2));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA2), vget_low_s8(_pB2));\n                _s3 = vmlal_s8(_s3, vget_low_s8(_pA2), vget_high_s8(_pB2));\n                _s4 = vmlal_s8(_s4, vget_low_s8(_pA3), vget_low_s8(_pB2));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA3), vget_high_s8(_pB2));\n                _s6 = vmlal_s8(_s6, vget_high_s8(_pA3), vget_low_s8(_pB2));\n                _s7 = vmlal_s8(_s7, vget_low_s8(_pA3), vget_high_s8(_pB2));\n                _s8 = vmlal_s8(_s8, vget_low_s8(_pA2), vget_low_s8(_pB3));\n                _s9 = vmlal_s8(_s9, vget_high_s8(_pA2), vget_high_s8(_pB3));\n                _sa = vmlal_s8(_sa, vget_high_s8(_pA2), vget_low_s8(_pB3));\n                _sb = vmlal_s8(_sb, vget_low_s8(_pA2), vget_high_s8(_pB3));\n                _sc = vmlal_s8(_sc, vget_low_s8(_pA3), vget_low_s8(_pB3));\n                _sd = vmlal_s8(_sd, vget_high_s8(_pA3), vget_high_s8(_pB3));\n                _se = vmlal_s8(_se, vget_high_s8(_pA3), vget_low_s8(_pB3));\n                _sf = vmlal_s8(_sf, vget_low_s8(_pA3), vget_high_s8(_pB3));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233 44556677\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 0)));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 1)));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 2)));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 3)));\n                int16x8_t _s4 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 0)));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 1)));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 2)));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB)), 3)));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 0)));\n                int16x8_t _s9 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 1)));\n                int16x8_t _sa = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 2)));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 3)));\n                int16x8_t _sc = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 0)));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 1)));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 2)));\n                int16x8_t _sf = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB)), 3)));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233 44556677\n\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB0));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB0));\n                int16x8_t _s8 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _s9 = vmull_s8(vget_high_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sa = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB1));\n                int16x8_t _sb = vmull_s8(vget_low_s8(_pA0), vget_high_s8(_pB1));\n                int16x8_t _sc = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sd = vmull_s8(vget_high_s8(_pA1), vget_high_s8(_pB1));\n                int16x8_t _se = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB1));\n                int16x8_t _sf = vmull_s8(vget_low_s8(_pA1), vget_high_s8(_pB1));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n                _sum8 = vpadalq_s16(_sum8, _s8);\n                _sum9 = vpadalq_s16(_sum9, _s9);\n                _suma = vpadalq_s16(_suma, _sa);\n                _sumb = vpadalq_s16(_sumb, _sb);\n                _sumc = vpadalq_s16(_sumc, _sc);\n                _sumd = vpadalq_s16(_sumd, _sd);\n                _sume = vpadalq_s16(_sume, _se);\n                _sumf = vpadalq_s16(_sumf, _sf);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                // int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd efgh\n                // 0123 4567\n\n                int16x8_t _s01 = vmull_s8(_pA, vdup_n_s8(pB[0]));\n                int16x8_t _s23 = vmull_s8(_pA, vdup_n_s8(pB[1]));\n                int16x8_t _s45 = vmull_s8(_pA, vdup_n_s8(pB[2]));\n                int16x8_t _s67 = vmull_s8(_pA, vdup_n_s8(pB[3]));\n                int16x8_t _s89 = vmull_s8(_pA, vdup_n_s8(pB[4]));\n                int16x8_t _sab = vmull_s8(_pA, vdup_n_s8(pB[5]));\n                int16x8_t _scd = vmull_s8(_pA, vdup_n_s8(pB[6]));\n                int16x8_t _sef = vmull_s8(_pA, vdup_n_s8(pB[7]));\n\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s23));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s45));\n                _sum3 = vaddw_s16(_sum3, vget_low_s16(_s67));\n                _sum4 = vaddw_s16(_sum4, vget_high_s16(_s01));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s23));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s45));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n                _sum8 = vaddw_s16(_sum8, vget_low_s16(_s89));\n                _sum9 = vaddw_s16(_sum9, vget_low_s16(_sab));\n                _suma = vaddw_s16(_suma, vget_low_s16(_scd));\n                _sumb = vaddw_s16(_sumb, vget_low_s16(_sef));\n                _sumc = vaddw_s16(_sumc, vget_high_s16(_s89));\n                _sumd = vaddw_s16(_sumd, vget_high_s16(_sab));\n                _sume = vaddw_s16(_sume, vget_high_s16(_scd));\n                _sumf = vaddw_s16(_sumf, vget_high_s16(_sef));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd efgh\n                // efgh abcd\n                // cdab ghef\n                // ghef cdab\n\n                // 0123 4567\n                // 3210 7654\n\n                // abcdefgh  ->  ghefcdab  ->  cdabghef\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 4);\n                int8x8_t _pA2 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n                int8x8_t _pA3 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pA0)));\n\n                // 01234567  ->  32107654\n\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA2, _pB0);\n                int16x8_t _s67 = vmull_s8(_pA3, _pB0);\n                int16x8_t _s89 = vmull_s8(_pA0, _pB1);\n                int16x8_t _sab = vmull_s8(_pA1, _pB1);\n                int16x8_t _scd = vmull_s8(_pA2, _pB1);\n                int16x8_t _sef = vmull_s8(_pA3, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n                _sum8 = vaddw_s16(_sum8, vget_low_s16(_s89));\n                _sum9 = vaddw_s16(_sum9, vget_high_s16(_s89));\n                _suma = vaddw_s16(_suma, vget_low_s16(_sab));\n                _sumb = vaddw_s16(_sumb, vget_high_s16(_sab));\n                _sumc = vaddw_s16(_sumc, vget_low_s16(_scd));\n                _sumd = vaddw_s16(_sumd, vget_high_s16(_scd));\n                _sume = vaddw_s16(_sume, vget_low_s16(_sef));\n                _sumf = vaddw_s16(_sumf, vget_high_s16(_sef));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n            vst1q_s32(outptr + 16, _sum4);\n            vst1q_s32(outptr + 20, _sum5);\n            vst1q_s32(outptr + 24, _sum6);\n            vst1q_s32(outptr + 28, _sum7);\n            vst1q_s32(outptr + 32, _sum8);\n            vst1q_s32(outptr + 36, _sum9);\n            vst1q_s32(outptr + 40, _suma);\n            vst1q_s32(outptr + 44, _sumb);\n            vst1q_s32(outptr + 48, _sumc);\n            vst1q_s32(outptr + 52, _sumd);\n            vst1q_s32(outptr + 56, _sume);\n            vst1q_s32(outptr + 60, _sumf);\n\n            outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n                \"sub    %0, %0, #64                 \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%1], #64 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v4.16b      \\n\"\n                \"smmla  v26.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v5.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v28.4s, v2.16b, v4.16b      \\n\"\n                \"smmla  v29.4s, v3.16b, v4.16b      \\n\"\n                \"smmla  v30.4s, v2.16b, v5.16b      \\n\"\n                \"smmla  v31.4s, v3.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v4.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v2.16b, v5.4b[0]    \\n\"\n                \"sdot   v17.4s, v2.16b, v5.4b[1]    \\n\"\n                \"sdot   v18.4s, v2.16b, v5.4b[2]    \\n\"\n                \"sdot   v19.4s, v2.16b, v5.4b[3]    \\n\"\n                \"sdot   v20.4s, v3.16b, v5.4b[0]    \\n\"\n                \"sdot   v21.4s, v3.16b, v5.4b[1]    \\n\"\n                \"sdot   v22.4s, v3.16b, v5.4b[2]    \\n\"\n                \"sdot   v23.4s, v3.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n                \"uzp1   v4.4s, v28.4s, v29.4s       \\n\"\n                \"uzp2   v5.4s, v28.4s, v29.4s       \\n\"\n                \"uzp1   v6.4s, v30.4s, v31.4s       \\n\"\n                \"uzp2   v7.4s, v30.4s, v31.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w6, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v2.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v2.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v2.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v2.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b}, [%2], #16         \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"ext    v5.16b, v4.16b, v4.16b, #8  \\n\"\n                \"smull2 v9.8h, v0.16b, v5.16b       \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull2 v11.8h, v2.16b, v5.16b      \\n\"\n                \"ext    v7.16b, v6.16b, v6.16b, #8  \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"rev64  v3.4s, v1.4s                \\n\"\n                \"smull2 v13.8h, v0.16b, v7.16b      \\n\"\n                \"smull2 v15.8h, v2.16b, v7.16b      \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal2 v11.8h, v3.16b, v4.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v6.16b      \\n\"\n                \"smlal2 v15.8h, v3.16b, v6.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w6, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"smull2 v12.8h, v0.16b, v4.16b      \\n\"\n                \"smull2 v13.8h, v0.16b, v5.16b      \\n\"\n                \"smull2 v14.8h, v0.16b, v6.16b      \\n\"\n                \"smull2 v15.8h, v0.16b, v7.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1r   {v2.2d}, [%2]               \\n\"\n                \"add    %2, %2, #8                  \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w6, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2]               \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"dup    v8.8b, v1.b[0]              \\n\"\n                \"dup    v9.8b, v1.b[1]              \\n\"\n                \"dup    v10.8b, v1.b[2]             \\n\"\n                \"dup    v11.8b, v1.b[3]             \\n\"\n                \"smull  v8.8h, v0.8b, v8.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v9.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v10.8b       \\n\"\n                \"smull  v11.8h, v0.8b, v11.8b       \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw  v18.4s, v18.4s, v10.4h      \\n\"\n                \"saddw  v19.4s, v19.4s, v11.4h      \\n\"\n                \"saddw2 v20.4s, v20.4s, v8.8h       \\n\"\n                \"saddw2 v21.4s, v21.4s, v9.8h       \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1r   {v4.2s}, [%2]               \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"rev32  v1.4h, v0.4h                \\n\"\n                \"rev64  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %7, #0              \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0!, {d16-d23}      \\n\"\n                \"vldm       %0, {d24-d31}       \\n\"\n                \"sub        %0, %0, #64         \\n\"\n                \"b          1f                  \\n\"\n\n                \"0:                             \\n\"\n                \"veor       q8, q8              \\n\"\n                \"veor       q9, q9              \\n\"\n                \"veor       q10, q10            \\n\"\n                \"veor       q11, q11            \\n\"\n                \"veor       q12, q12            \\n\"\n                \"veor       q13, q13            \\n\"\n                \"veor       q14, q14            \\n\"\n                \"veor       q15, q15            \\n\"\n\n                \"1:                             \\n\"\n                \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        3f                  \\n\"\n\n                \".align 4                       \\n\"\n                \"2:                             \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.s8    {d0-d3}, [%1 :64]!  \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.s8    {d4-d5}, [%2]!      \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vrev64.32  q3, q0              \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d6, d4          \\n\"\n                \"vmull.s8   q7, d7, d4          \\n\"\n                \"vrev64.32  q0, q1              \\n\"\n                \"vmlal.s8   q4, d2, d5          \\n\"\n                \"vmlal.s8   q5, d3, d5          \\n\"\n                \"vmlal.s8   q6, d0, d5          \\n\"\n                \"vmlal.s8   q7, d1, d5          \\n\"\n                \"vrev64.16  q2, q2              \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vrev64.32  q1, q3              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vmull.s8   q4, d6, d4          \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vmull.s8   q5, d7, d4          \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"vmull.s8   q6, d2, d4          \\n\"\n                \"vmull.s8   q7, d3, d4          \\n\"\n                \"vrev64.32  q3, q0              \\n\"\n                \"vmlal.s8   q4, d0, d5          \\n\"\n                \"vmlal.s8   q5, d1, d5          \\n\"\n                \"vmlal.s8   q6, d6, d5          \\n\"\n                \"vmlal.s8   q7, d7, d5          \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vpadal.s16 q14, q4             \\n\"\n                \"vpadal.s16 q15, q5             \\n\"\n                \"vpadal.s16 q12, q6             \\n\"\n                \"vpadal.s16 q13, q7             \\n\"\n                \"bne        2b                  \\n\"\n\n                \"3:                             \\n\"\n                \"and        r4, %6, #2          \\n\" // r4 = remain = max_kk & 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                // kk += 2 part\n                \"vld1.s8    {d0-d1}, [%1 :64]!  \\n\"\n                \"vld1.s8    {d4}, [%2]!         \\n\"\n                \"vrev64.32  q1, q0              \\n\"\n                \"vrev64.16  d5, d4              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d2, d4          \\n\"\n                \"vmull.s8   q7, d3, d4          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"vmull.s8   q4, d0, d5          \\n\"\n                \"vmull.s8   q5, d1, d5          \\n\"\n                \"vmull.s8   q6, d2, d5          \\n\"\n                \"vmull.s8   q7, d3, d5          \\n\"\n                \"vpadal.s16 q12, q4             \\n\"\n                \"vpadal.s16 q13, q5             \\n\"\n                \"vpadal.s16 q14, q6             \\n\"\n                \"vpadal.s16 q15, q7             \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %6, #1          \\n\" // r4 = remain = max_kk & 1\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                // kk += 1 part\n                \"vld1.s8    {d0}, [%1 :64]!     \\n\"\n                \"vld1.s32   {d2[]}, [%2]!       \\n\"\n                \"vrev64.16  d1, d0              \\n\"\n                \"vrev64.8   d3, d2              \\n\"\n                \"vext.s8    d1, d1, #4          \\n\"\n                \"vmull.s8   q4, d0, d2          \\n\"\n                \"vmull.s8   q5, d1, d2          \\n\"\n                \"vmull.s8   q6, d0, d3          \\n\"\n                \"vmull.s8   q7, d1, d3          \\n\"\n                \"vaddw.s16  q8, d8              \\n\"\n                \"vaddw.s16  q9, d9              \\n\"\n                \"vaddw.s16  q10, d10            \\n\"\n                \"vaddw.s16  q11, d11            \\n\"\n                \"vaddw.s16  q12, d12            \\n\"\n                \"vaddw.s16  q13, d13            \\n\"\n                \"vaddw.s16  q14, d14            \\n\"\n                \"vaddw.s16  q15, d15            \\n\"\n\n                \"5:                             \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n                \"vstm       %0!, {d24-d31}      \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n                int32x4_t _s4 = vdupq_n_s32(0);\n                int32x4_t _s5 = vdupq_n_s32(0);\n                int32x4_t _s6 = vdupq_n_s32(0);\n                int32x4_t _s7 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000 11111111 22222222 33333333\n\n                    _s0 = vmmlaq_s32(_s0, _pA0, _pB0);\n                    _s1 = vmmlaq_s32(_s1, _pA1, _pB0);\n                    _s2 = vmmlaq_s32(_s2, _pA0, _pB1);\n                    _s3 = vmmlaq_s32(_s3, _pA1, _pB1);\n                    _s4 = vmmlaq_s32(_s4, _pA2, _pB0);\n                    _s5 = vmmlaq_s32(_s5, _pA3, _pB0);\n                    _s6 = vmmlaq_s32(_s6, _pA2, _pB1);\n                    _s7 = vmmlaq_s32(_s7, _pA3, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                    _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB0, 0);\n                    _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB0, 1);\n                    _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB0, 2);\n                    _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB0, 3);\n\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB1, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB1, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA2, _pB1, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA2, _pB1, 3);\n                    _sum4 = vdotq_laneq_s32(_sum4, _pA3, _pB1, 0);\n                    _sum5 = vdotq_laneq_s32(_sum5, _pA3, _pB1, 1);\n                    _sum6 = vdotq_laneq_s32(_sum6, _pA3, _pB1, 2);\n                    _sum7 = vdotq_laneq_s32(_sum7, _pA3, _pB1, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss0 = vuzpq_s32(_s0, _s1);\n                int32x4x2_t _ss1 = vuzpq_s32(_s2, _s3);\n                int32x4x2_t _ss2 = vuzpq_s32(_s4, _s5);\n                int32x4x2_t _ss3 = vuzpq_s32(_s6, _s7);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                // aaaa bbbb cccc dddd   eeee ffff gggg hhhh\n\n                // 0000 1111 2222 3333\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x16_t _pB02 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n                int8x16_t _pA3 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA2)));\n\n                // 00112233 44556677\n\n                // 33221100 77665544\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB02));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB02));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB13));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA1), vget_low_s8(_pB13));\n\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_pA3), vget_high_s8(_pB02));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA3), vget_high_s8(_pB02));\n                _s4 = vmlal_s8(_s4, vget_low_s8(_pA2), vget_high_s8(_pB13));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA2), vget_high_s8(_pB13));\n                _s6 = vmlal_s8(_s6, vget_low_s8(_pA3), vget_high_s8(_pB13));\n                _s7 = vmlal_s8(_s7, vget_high_s8(_pA3), vget_high_s8(_pB13));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n                int16x8_t _s4 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s6 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA), vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // ccddaabb gghheeff\n\n                int8x16_t _pA1 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA0)));\n\n                // 00112233\n\n                // 33221100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), _pB0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA1), _pB0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA1), _pB0);\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA0), _pB1);\n                int16x8_t _s5 = vmull_s8(vget_high_s8(_pA0), _pB1);\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA1), _pB1);\n                int16x8_t _s7 = vmull_s8(vget_high_s8(_pA1), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                // int8x8_t _pB0 = vreinterpret_s32_s8(vld1_dup_s32(pB));\n\n                // abcdefgh\n\n                // 0123\n\n                int16x8_t _s01 = vmull_s8(_pA0, vdup_n_s8(pB[0]));\n                int16x8_t _s23 = vmull_s8(_pA0, vdup_n_s8(pB[1]));\n                int16x8_t _s45 = vmull_s8(_pA0, vdup_n_s8(pB[2]));\n                int16x8_t _s67 = vmull_s8(_pA0, vdup_n_s8(pB[3]));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s23));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s45));\n                _sum3 = vaddw_s16(_sum3, vget_low_s16(_s67));\n                _sum4 = vaddw_s16(_sum4, vget_high_s16(_s01));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s23));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s45));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n                // int8x8_t _pB0 = vld1_s8(pB);\n                // _pB0 = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB0), vreinterpret_s32_s8(_pB0)).val[0]);\n\n                // abcdefgh  ->  cdabghef\n                int8x8_t _pA1 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n\n                // 01230123  ->  32103210\n                int8x8_t _pB1 = vrev64_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s67 = vmull_s8(_pA1, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 4;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n            vst1q_s32(outptr + 16, _sum4);\n            vst1q_s32(outptr + 20, _sum5);\n            vst1q_s32(outptr + 24, _sum6);\n            vst1q_s32(outptr + 28, _sum7);\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000 11111111\n\n                    _s0 = vmmlaq_s32(_s0, _pA0, _pB);\n                    _s1 = vmmlaq_s32(_s1, _pA1, _pB);\n                    _s2 = vmmlaq_s32(_s2, _pA2, _pB);\n                    _s3 = vmmlaq_s32(_s3, _pA3, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB, 0);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB, 1);\n\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA2, _pB, 2);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA2, _pB, 3);\n                    _sum2 = vdotq_laneq_s32(_sum2, _pA3, _pB, 2);\n                    _sum3 = vdotq_laneq_s32(_sum3, _pA3, _pB, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss0 = vuzpq_s32(_s0, _s1);\n                int32x4x2_t _ss1 = vuzpq_s32(_s2, _s3);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aaaa bbbb cccc dddd eeee ffff gggg hhhh\n\n                // 0000 1111\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA0, _pB, 1);\n                _sum2 = vdotq_lane_s32(_sum2, _pA1, _pB, 0);\n                _sum3 = vdotq_lane_s32(_sum3, _pA1, _pB, 1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh   aabbccdd eeffgghh\n\n                // 00112233 -> 00110011 22332233\n\n                // 11001100 33223322\n\n                int32x2x2_t _pBB = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n                int8x16_t _pB02 = vreinterpretq_s8_s32(vcombine_s32(_pBB.val[0], _pBB.val[1]));\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB13));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA0), vget_low_s8(_pB13));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_pA2), vget_high_s8(_pB13));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA2), vget_high_s8(_pB13));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int16x4_t _pB = vreinterpret_s16_s32(vld1_dup_s32((const int*)pB));\n\n                int16x4x2_t _pB01 = vuzp_s16(_pB, _pB);\n                int8x8_t _pB0 = vreinterpret_s8_s16(_pB01.val[0]);\n                int8x8_t _pB1 = vreinterpret_s8_s16(_pB01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), _pB1);\n                int16x8_t _s2 = vmull_s8(vget_high_s8(_pA), _pB0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aabbccdd eeffgghh\n\n                // 00110011\n                // 11001100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA), _pB0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA), _pB1);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int8x8x2_t _pB01 = vuzp_s8(_pB, _pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB01.val[0]);\n                int16x8_t _s1 = vmull_s8(_pA, _pB01.val[1]);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s1));\n                _sum2 = vaddw_s16(_sum2, vget_high_s16(_s0));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcdefgh\n\n                // 01010101\n                // 10101010\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 1);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 2;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n\n            outptr += 16;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _s0 = vdupq_n_s32(0);\n                int32x4_t _s1 = vdupq_n_s32(0);\n                int32x4_t _s2 = vdupq_n_s32(0);\n                int32x4_t _s3 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pA2 = vld1q_s8(pA + 32);\n                    int8x16_t _pA3 = vld1q_s8(pA + 48);\n\n                    int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb ..... hhhhhhhh\n                    // 00000000\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _s0 = vdotq_s32(_s0, _pA0, _pBB);\n                    _s1 = vdotq_s32(_s1, _pA1, _pBB);\n                    _s2 = vdotq_s32(_s2, _pA2, _pBB);\n                    _s3 = vdotq_s32(_s3, _pA3, _pBB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_lane_s32(_sum1, _pA1, _pB, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pA2, _pB, 1);\n                    _sum1 = vdotq_lane_s32(_sum1, _pA3, _pB, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 64;\n                    pB += 8;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_s0, _s1));\n                _sum1 = vaddq_s32(_sum1, vpaddq_s32(_s2, _s3));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aaaa bbbb cccc dddd eeee ffff gggg hhhh\n\n                // 0000 0000\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA1, _pB, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA2 = vld1q_s8(pA + 16);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n                int8x8_t _pB1 = vreinterpret_s8_s16(vld1_dup_s16((const short*)(pB + 2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), _pB0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA0), _pB0);\n                _s0 = vmlal_s8(_s0, vget_low_s8(_pA2), _pB1);\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA2), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_pA), _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 16;\n                pB += 2;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_dup_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 8;\n                pB += 1;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n\n            outptr += 8;\n        }\n\n        pAT += max_kk * 8;\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0] \\n\"\n                \"sub    %0, %0, #64                 \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n                \"eor    v20.16b, v20.16b, v20.16b   \\n\"\n                \"eor    v21.16b, v21.16b, v21.16b   \\n\"\n                \"eor    v22.16b, v22.16b, v22.16b   \\n\"\n                \"eor    v23.16b, v23.16b, v23.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n                \"eor    v28.16b, v28.16b, v28.16b   \\n\"\n                \"eor    v29.16b, v29.16b, v29.16b   \\n\"\n                \"eor    v30.16b, v30.16b, v30.16b   \\n\"\n                \"eor    v31.16b, v31.16b, v31.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v2.16b, v3.16b, v4.16b, v5.16b}, [%2], #64 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v2.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v2.16b      \\n\"\n                \"smmla  v26.4s, v0.16b, v3.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v3.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v28.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v29.4s, v1.16b, v4.16b      \\n\"\n                \"smmla  v30.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v31.4s, v1.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v21.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v22.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v23.4s, v0.16b, v3.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v1.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v1.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v1.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v1.16b, v4.4b[3]    \\n\"\n                \"sdot   v20.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v21.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v22.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v23.4s, v1.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n                \"uzp1   v4.4s, v28.4s, v29.4s       \\n\"\n                \"uzp2   v5.4s, v28.4s, v29.4s       \\n\"\n                \"uzp1   v6.4s, v30.4s, v31.4s       \\n\"\n                \"uzp2   v7.4s, v30.4s, v31.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n                \"add    v20.4s, v20.4s, v4.4s       \\n\"\n                \"add    v21.4s, v21.4s, v5.4s       \\n\"\n                \"add    v22.4s, v22.4s, v6.4s       \\n\"\n                \"add    v23.4s, v23.4s, v7.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w6, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b, v3.16b}, [%2], #32 \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n                \"sdot   v20.4s, v0.16b, v3.4b[0]    \\n\"\n                \"sdot   v21.4s, v0.16b, v3.4b[1]    \\n\"\n                \"sdot   v22.4s, v0.16b, v3.4b[2]    \\n\"\n                \"sdot   v23.4s, v0.16b, v3.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v5.16b       \\n\"\n                \"rev64  v2.4s, v0.4s                \\n\"\n                \"smull  v10.8h, v2.8b, v4.8b        \\n\"\n                \"smull2 v11.8h, v2.16b, v5.16b      \\n\"\n                \"rev64  v6.8h, v4.8h                \\n\"\n                \"smull  v12.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v14.8h, v2.8b, v6.8b        \\n\"\n                \"rev64  v7.8h, v5.8h                \\n\"\n                \"smull2 v13.8h, v0.16b, v7.16b      \\n\"\n                \"smull2 v15.8h, v2.16b, v7.16b      \\n\"\n                \"ext    v1.16b, v0.16b, v0.16b, #8  \\n\"\n                \"ext    v3.16b, v2.16b, v2.16b, #8  \\n\"\n                \"smlal  v8.8h, v1.8b, v5.8b         \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal  v10.8h, v3.8b, v5.8b        \\n\"\n                \"smlal2 v11.8h, v3.16b, v4.16b      \\n\"\n                \"smlal  v12.8h, v1.8b, v7.8b        \\n\"\n                \"smlal  v14.8h, v3.8b, v7.8b        \\n\"\n                \"smlal2 v13.8h, v1.16b, v6.16b      \\n\"\n                \"smlal2 v15.8h, v3.16b, v6.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w6, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.16b}, [%2], #16         \\n\"\n                \"dup    v4.8h, v1.h[0]              \\n\"\n                \"dup    v5.8h, v1.h[1]              \\n\"\n                \"dup    v6.8h, v1.h[2]              \\n\"\n                \"dup    v7.8h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"dup    v4.8h, v1.h[4]              \\n\"\n                \"dup    v5.8h, v1.h[5]              \\n\"\n                \"dup    v6.8h, v1.h[6]              \\n\"\n                \"dup    v7.8h, v1.h[7]              \\n\"\n                \"smull  v12.8h, v0.8b, v4.8b        \\n\"\n                \"smull  v13.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v14.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v15.8h, v0.8b, v7.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2d}, [%1]               \\n\"\n                \"add    %1, %1, #8                  \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"rev64  v3.8h, v2.8h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull2 v9.8h, v0.16b, v2.16b       \\n\"\n                \"smull  v10.8h, v1.8b, v2.8b        \\n\"\n                \"smull2 v11.8h, v1.16b, v2.16b      \\n\"\n                \"smull  v12.8h, v0.8b, v3.8b        \\n\"\n                \"smull2 v13.8h, v0.16b, v3.16b      \\n\"\n                \"smull  v14.8h, v1.8b, v3.8b        \\n\"\n                \"smull2 v15.8h, v1.16b, v3.16b      \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"sadalp v20.4s, v12.8h              \\n\"\n                \"sadalp v21.4s, v13.8h              \\n\"\n                \"sadalp v22.4s, v14.8h              \\n\"\n                \"sadalp v23.4s, v15.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w6, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"dup    v8.8h, v1.h[0]              \\n\"\n                \"dup    v9.8h, v1.h[1]              \\n\"\n                \"dup    v10.8h, v1.h[2]             \\n\"\n                \"dup    v11.8h, v1.h[3]             \\n\"\n                \"uzp1   v2.8b, v8.8b, v9.8b         \\n\"\n                \"uzp2   v3.8b, v8.8b, v9.8b         \\n\"\n                \"uzp1   v4.8b, v10.8b, v11.8b       \\n\"\n                \"uzp2   v5.8b, v10.8b, v11.8b       \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v3.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v4.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v5.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw  v17.4s, v17.4s, v9.4h       \\n\"\n                \"saddw2 v18.4s, v18.4s, v8.8h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw  v21.4s, v21.4s, v11.4h      \\n\"\n                \"saddw2 v22.4s, v22.4s, v10.8h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1    {v2.8b}, [%2], #8           \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #2     \\n\"\n                \"rev32  v3.8b, v2.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v2.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v3.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v3.8b        \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n                \"saddw  v20.4s, v20.4s, v10.4h      \\n\"\n                \"saddw2 v21.4s, v21.4s, v10.8h      \\n\"\n                \"saddw  v22.4s, v22.4s, v11.4h      \\n\"\n                \"saddw2 v23.4s, v23.4s, v11.8h      \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n                \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n            int32x4_t _sum4;\n            int32x4_t _sum5;\n            int32x4_t _sum6;\n            int32x4_t _sum7;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n                _sum4 = vdupq_n_s32(0);\n                _sum5 = vdupq_n_s32(0);\n                _sum6 = vdupq_n_s32(0);\n                _sum7 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n                _sum4 = vld1q_s32(outptr + 16);\n                _sum5 = vld1q_s32(outptr + 20);\n                _sum6 = vld1q_s32(outptr + 24);\n                _sum7 = vld1q_s32(outptr + 28);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n                int32x4_t _sum20 = vdupq_n_s32(0);\n                int32x4_t _sum21 = vdupq_n_s32(0);\n                int32x4_t _sum30 = vdupq_n_s32(0);\n                int32x4_t _sum31 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111 22222222 33333333\n                    // 44444444 55555555 66666666 77777777\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB0);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB0);\n                    _sum10 = vmmlaq_s32(_sum10, _pA0, _pB1);\n                    _sum11 = vmmlaq_s32(_sum11, _pA1, _pB1);\n                    _sum20 = vmmlaq_s32(_sum20, _pA0, _pB2);\n                    _sum21 = vmmlaq_s32(_sum21, _pA1, _pB2);\n                    _sum30 = vmmlaq_s32(_sum30, _pA0, _pB3);\n                    _sum31 = vmmlaq_s32(_sum31, _pA1, _pB3);\n\n                    // a0 a1 b0 b1\n                    // c0 c1 d0 d1\n                    // a2 a3 b2 b3\n                    // c2 c3 d2 d3\n                    // a4 a5 b4 b5\n                    // c4 c5 d4 d5\n                    // a6 a7 b6 b7\n                    // c6 c7 d6 d7\n\n                    pA += 32;\n                    pB += 64;\n                }\n                int32x4x2_t _ss0 = vuzpq_s32(_sum00, _sum01);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum10, _sum11);\n                int32x4x2_t _ss2 = vuzpq_s32(_sum20, _sum21);\n                int32x4x2_t _ss3 = vuzpq_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n                _sum4 = vaddq_s32(_sum4, _ss2.val[0]);\n                _sum5 = vaddq_s32(_sum5, _ss2.val[1]);\n                _sum6 = vaddq_s32(_sum6, _ss3.val[0]);\n                _sum7 = vaddq_s32(_sum7, _ss3.val[1]);\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n                int8x16_t _pB2 = vld1q_s8(pB + 32);\n                int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA0, _pB1, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA0, _pB1, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA0, _pB1, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA0, _pB1, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB2, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB2, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB2, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB2, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA1, _pB3, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA1, _pB3, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA1, _pB3, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA1, _pB3, 3);\n\n                pA += 32;\n                pB += 64;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA, _pB0, 3);\n                _sum4 = vdotq_laneq_s32(_sum4, _pA, _pB1, 0);\n                _sum5 = vdotq_laneq_s32(_sum5, _pA, _pB1, 1);\n                _sum6 = vdotq_laneq_s32(_sum6, _pA, _pB1, 2);\n                _sum7 = vdotq_laneq_s32(_sum7, _pA, _pB1, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA02 = vld1q_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB2 = vld1q_s8(pB + 16);\n\n                int8x16_t _pA13 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA02)));\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n                int8x16_t _pB3 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA02), vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA13), vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB1));\n                int16x8_t _s5 = vmull_s8(vget_low_s8(_pA02), vget_high_s8(_pB1));\n                int16x8_t _s6 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB1));\n                int16x8_t _s7 = vmull_s8(vget_low_s8(_pA13), vget_high_s8(_pB1));\n\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA02), vget_low_s8(_pB2));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA02), vget_high_s8(_pB2));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA13), vget_low_s8(_pB2));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA13), vget_high_s8(_pB2));\n                _s4 = vmlal_s8(_s4, vget_high_s8(_pA02), vget_low_s8(_pB3));\n                _s5 = vmlal_s8(_s5, vget_high_s8(_pA02), vget_high_s8(_pB3));\n                _s6 = vmlal_s8(_s6, vget_high_s8(_pA13), vget_low_s8(_pB3));\n                _s7 = vmlal_s8(_s7, vget_high_s8(_pA13), vget_high_s8(_pB3));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x16_t _pB01 = vld1q_s8(pB);\n\n                // aabbccdd\n\n                // 00112233 44556677\n\n                int16x8_t _s0 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 0)));\n                int16x8_t _s1 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 1)));\n                int16x8_t _s2 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 2)));\n                int16x8_t _s3 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pB01)), 3)));\n                int16x8_t _s4 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 0)));\n                int16x8_t _s5 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 1)));\n                int16x8_t _s6 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 2)));\n                int16x8_t _s7 = vmull_s8(_pA0, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pB01)), 3)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x16_t _pB0 = vld1q_s8(pB);\n\n                // aabbccdd\n                // ccddaabb\n\n                int8x8_t _pA1 = vreinterpret_s8_s32(vrev64_s32(vreinterpret_s32_s8(_pA0)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB1 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB0));\n                int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB0));\n                int16x8_t _s4 = vmull_s8(_pA0, vget_low_s8(_pB1));\n                int16x8_t _s5 = vmull_s8(_pA0, vget_high_s8(_pB1));\n                int16x8_t _s6 = vmull_s8(_pA1, vget_low_s8(_pB1));\n                int16x8_t _s7 = vmull_s8(_pA1, vget_high_s8(_pB1));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n                _sum4 = vpadalq_s16(_sum4, _s4);\n                _sum5 = vpadalq_s16(_sum5, _s5);\n                _sum6 = vpadalq_s16(_sum6, _s6);\n                _sum7 = vpadalq_s16(_sum7, _s7);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pAA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                // abcdabcd\n                // 01234567  ->  01010101 23232323 45454545 67676767\n                int8x8_t _pB0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0));\n                int8x8_t _pB2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1));\n                int8x8_t _pB4 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2));\n                int8x8_t _pB6 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3));\n\n                int8x8x2_t _pB0123 = vuzp_s8(_pB0, _pB2);\n                int8x8x2_t _pB4567 = vuzp_s8(_pB4, _pB6);\n\n                int16x8_t _s02 = vmull_s8(_pAA, _pB0123.val[0]);\n                int16x8_t _s13 = vmull_s8(_pAA, _pB0123.val[1]);\n                int16x8_t _s46 = vmull_s8(_pAA, _pB4567.val[0]);\n                int16x8_t _s57 = vmull_s8(_pAA, _pB4567.val[1]);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s02));\n                _sum1 = vaddw_s16(_sum1, vget_low_s16(_s13));\n                _sum2 = vaddw_s16(_sum2, vget_high_s16(_s02));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s13));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s46));\n                _sum5 = vaddw_s16(_sum5, vget_low_s16(_s57));\n                _sum6 = vaddw_s16(_sum6, vget_high_s16(_s46));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s57));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // abcd abcd\n                // cdab cdab\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 2);\n\n                // 0123 4567\n                // 3210 7654\n\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s45 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s67 = vmull_s8(_pA1, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n                _sum4 = vaddw_s16(_sum4, vget_low_s16(_s45));\n                _sum5 = vaddw_s16(_sum5, vget_high_s16(_s45));\n                _sum6 = vaddw_s16(_sum6, vget_low_s16(_s67));\n                _sum7 = vaddw_s16(_sum7, vget_high_s16(_s67));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 8;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n            vst1q_s32(outptr + 16, _sum4);\n            vst1q_s32(outptr + 20, _sum5);\n            vst1q_s32(outptr + 24, _sum6);\n            vst1q_s32(outptr + 28, _sum7);\n\n            outptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            const signed char* pA = pAT;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"cmp    %w7, #0                     \\n\"\n                \"beq    0f                          \\n\"\n\n                \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0] \\n\"\n                \"b      1f                          \\n\"\n\n                \"0:                                 \\n\"\n                \"eor    v16.16b, v16.16b, v16.16b   \\n\"\n                \"eor    v17.16b, v17.16b, v17.16b   \\n\"\n                \"eor    v18.16b, v18.16b, v18.16b   \\n\"\n                \"eor    v19.16b, v19.16b, v19.16b   \\n\"\n\n                \"1:                                 \\n\"\n#if __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #3                 \\n\" // w4 = max_kk >> 3\n                \"cmp    w4, #0                      \\n\"\n                \"beq    101f                        \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"eor    v24.16b, v24.16b, v24.16b   \\n\"\n                \"eor    v25.16b, v25.16b, v25.16b   \\n\"\n                \"eor    v26.16b, v26.16b, v26.16b   \\n\"\n                \"eor    v27.16b, v27.16b, v27.16b   \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b, v1.16b}, [%1], #32 \\n\"\n                \"ld1    {v4.16b, v5.16b}, [%2], #32 \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"smmla  v24.4s, v0.16b, v4.16b      \\n\"\n                \"smmla  v25.4s, v1.16b, v4.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"smmla  v26.4s, v0.16b, v5.16b      \\n\"\n                \"smmla  v27.4s, v1.16b, v5.16b      \\n\"\n#else  // __ARM_FEATURE_MATMUL_INT8\n                \"sdot   v16.4s, v0.16b, v4.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v4.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v4.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v4.4b[3]    \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sdot   v16.4s, v1.16b, v5.4b[0]    \\n\"\n                \"sdot   v17.4s, v1.16b, v5.4b[1]    \\n\"\n                \"sdot   v18.4s, v1.16b, v5.4b[2]    \\n\"\n                \"sdot   v19.4s, v1.16b, v5.4b[3]    \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n                \"bne    2b                          \\n\"\n\n#if __ARM_FEATURE_MATMUL_INT8\n                \"uzp1   v0.4s, v24.4s, v25.4s       \\n\"\n                \"uzp2   v1.4s, v24.4s, v25.4s       \\n\"\n                \"uzp1   v2.4s, v26.4s, v27.4s       \\n\"\n                \"uzp2   v3.4s, v26.4s, v27.4s       \\n\"\n\n                \"add    v16.4s, v16.4s, v0.4s       \\n\"\n                \"add    v17.4s, v17.4s, v1.4s       \\n\"\n                \"add    v18.4s, v18.4s, v2.4s       \\n\"\n                \"add    v19.4s, v19.4s, v3.4s       \\n\"\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                \"101:                               \\n\"\n                \"and    w4, %w6, #4                 \\n\" // w4 = remain = max_kk & 4\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                // kk += 4 part\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v2.16b}, [%2], #16         \\n\"\n                \"sdot   v16.4s, v0.16b, v2.4b[0]    \\n\"\n                \"sdot   v17.4s, v0.16b, v2.4b[1]    \\n\"\n                \"sdot   v18.4s, v0.16b, v2.4b[2]    \\n\"\n                \"sdot   v19.4s, v0.16b, v2.4b[3]    \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"lsr    w4, %w6, #2                 \\n\" // w4 = max_kk >> 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    3f                          \\n\"\n\n                \"2:                                 \\n\"\n                \"ld1    {v0.16b}, [%1], #16         \\n\"\n                \"ld1    {v4.16b}, [%2], #16         \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"rev64  v1.4s, v0.4s                \\n\"\n                \"smull  v9.8h, v1.8b, v4.8b         \\n\"\n                \"rev64  v5.8h, v4.8h                \\n\"\n                \"smull  v10.8h, v0.8b, v5.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v5.8b        \\n\"\n                \"smlal2 v8.8h, v0.16b, v4.16b       \\n\"\n                \"smlal2 v9.8h, v1.16b, v4.16b       \\n\"\n                \"smlal2 v10.8h, v0.16b, v5.16b      \\n\"\n                \"smlal2 v11.8h, v1.16b, v5.16b      \\n\"\n                \"subs   w4, w4, #1                  \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n                \"bne    2b                          \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"3:                                 \\n\"\n                \"and    w4, %w6, #2                 \\n\" // w4 = remain = max_kk & 2\n                \"cmp    w4, #0                      \\n\"\n                \"beq    4f                          \\n\"\n\n                // kk += 2 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v1.8b}, [%2], #8           \\n\"\n                \"dup    v4.4h, v1.h[0]              \\n\"\n                \"dup    v5.4h, v1.h[1]              \\n\"\n                \"dup    v6.4h, v1.h[2]              \\n\"\n                \"dup    v7.4h, v1.h[3]              \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v6.8b        \\n\"\n                \"smull  v11.8h, v0.8b, v7.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1], #8           \\n\"\n                \"ld1    {v2.8b}, [%2], #8           \\n\"\n                \"ext    v1.8b, v0.8b, v0.8b, #4     \\n\"\n                \"rev64  v3.4h, v2.4h                \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v1.8b, v2.8b         \\n\"\n                \"smull  v10.8h, v0.8b, v3.8b        \\n\"\n                \"smull  v11.8h, v1.8b, v3.8b        \\n\"\n                \"sadalp v16.4s, v8.8h               \\n\"\n                \"sadalp v17.4s, v9.8h               \\n\"\n                \"sadalp v18.4s, v10.8h              \\n\"\n                \"sadalp v19.4s, v11.8h              \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"4:                                 \\n\"\n                \"and    w4, %w6, #1                 \\n\" // w4 = remain = max_kk & 1\n                \"cmp    w4, #0                      \\n\"\n                \"beq    5f                          \\n\"\n\n                // kk += 1 part\n#if __ARM_FEATURE_DOTPROD\n                \"ld1r   {v0.2s}, [%1]               \\n\"\n                \"ld1r   {v1.2s}, [%2]               \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"zip1   v1.8b, v1.8b, v1.8b         \\n\"\n                \"zip1   v2.4h, v1.4h, v1.4h         \\n\"\n                \"zip2   v3.4h, v1.4h, v1.4h         \\n\"\n                \"smull  v8.8h, v0.8b, v2.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v3.8b         \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n#else  // __ARM_FEATURE_DOTPROD\n                \"ld1    {v0.8b}, [%1]               \\n\"\n                \"ld1r   {v4.2s}, [%2]               \\n\"\n                \"add    %1, %1, #4                  \\n\"\n                \"add    %2, %2, #4                  \\n\"\n                \"rev32  v1.4h, v0.4h                \\n\"\n                \"zip1   v0.2s, v0.2s, v1.2s         \\n\"\n                \"rev32  v5.8b, v4.8b                \\n\"\n                \"smull  v8.8h, v0.8b, v4.8b         \\n\"\n                \"smull  v9.8h, v0.8b, v5.8b         \\n\"\n                \"saddw  v16.4s, v16.4s, v8.4h       \\n\"\n                \"saddw2 v17.4s, v17.4s, v8.8h       \\n\"\n                \"saddw  v18.4s, v18.4s, v9.4h       \\n\"\n                \"saddw2 v19.4s, v19.4s, v9.8h       \\n\"\n#endif // __ARM_FEATURE_DOTPROD\n\n                \"5:                                 \\n\"\n                \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n            asm volatile(\n                \"cmp        %7, #0              \\n\"\n                \"beq        0f                  \\n\"\n\n                \"vldm       %0, {d16-d23}       \\n\"\n                \"b          1f                  \\n\"\n\n                \"0:                             \\n\"\n                \"veor       q8, q8              \\n\"\n                \"veor       q9, q9              \\n\"\n                \"veor       q10, q10            \\n\"\n                \"veor       q11, q11            \\n\"\n\n                \"1:                             \\n\"\n                \"lsr        r4, %6, #2          \\n\" // r4 = max_kk >> 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        3f                  \\n\"\n\n                \".align 4                       \\n\"\n                \"2:                             \\n\"\n                \"pld        [%1, #256]          \\n\"\n                \"vld1.s8    {d0-d1}, [%1 :64]!  \\n\"\n                \"pld        [%2, #128]          \\n\"\n                \"vld1.s8    {d4-d5}, [%2]!      \\n\"\n                \"vrev64.32  q1, q0              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vrev64.16  q3, q2              \\n\"\n                \"vmull.s8   q5, d2, d4          \\n\"\n                \"vmull.s8   q6, d0, d6          \\n\"\n                \"vmull.s8   q7, d2, d6          \\n\"\n                \"vmlal.s8   q4, d1, d5          \\n\"\n                \"vmlal.s8   q5, d3, d5          \\n\"\n                \"vmlal.s8   q6, d1, d7          \\n\"\n                \"vmlal.s8   q7, d3, d7          \\n\"\n                \"subs       r4, r4, #1          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n                \"bne        2b                  \\n\"\n\n                \"3:                             \\n\"\n                \"and        r4, %6, #2          \\n\" // r4 = remain = max_kk & 2\n                \"cmp        r4, #0              \\n\"\n                \"beq        4f                  \\n\"\n\n                // kk += 2 part\n                \"vld1.s8    {d0}, [%1 :64]!     \\n\"\n                \"vld1.s8    {d4}, [%2]!         \\n\"\n                \"vext.8     d1, d0, d0, #4      \\n\"\n                \"vrev64.16  d5, d4              \\n\"\n                \"vmull.s8   q4, d0, d4          \\n\"\n                \"vmull.s8   q5, d1, d4          \\n\"\n                \"vmull.s8   q6, d0, d5          \\n\"\n                \"vmull.s8   q7, d1, d5          \\n\"\n                \"vpadal.s16 q8, q4              \\n\"\n                \"vpadal.s16 q9, q5              \\n\"\n                \"vpadal.s16 q10, q6             \\n\"\n                \"vpadal.s16 q11, q7             \\n\"\n\n                \"4:                             \\n\"\n                \"and        r4, %6, #1          \\n\" // r4 = remain = max_kk & 1\n                \"cmp        r4, #0              \\n\"\n                \"beq        5f                  \\n\"\n\n                // kk += 1 part\n                \"vld1.s32   {d0[0]}, [%1]!      \\n\"\n                \"vld1.s32   {d2[]}, [%2]!       \\n\"\n                \"vrev32.16  d1, d0              \\n\"\n                \"vrev32.s8  d3, d2              \\n\"\n                \"vzip.32    d0, d1              \\n\"\n                \"vmull.s8   q4, d0, d2          \\n\"\n                \"vmull.s8   q5, d0, d3          \\n\"\n                \"vaddw.s16  q8, d8              \\n\"\n                \"vaddw.s16  q9, d9              \\n\"\n                \"vaddw.s16  q10, d10            \\n\"\n                \"vaddw.s16  q11, d11            \\n\"\n\n                \"5:                             \\n\"\n                \"vstm       %0!, {d16-d23}      \\n\"\n\n                : \"=r\"(outptr), // %0\n                \"=r\"(pA),     // %1\n                \"=r\"(pB)      // %2\n                : \"0\"(outptr),\n                \"1\"(pA),\n                \"2\"(pB),\n                \"r\"(max_kk), // %6\n                \"r\"(k)       // %7\n                : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111 22222222 33333333\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB0);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB0);\n                    _sum10 = vmmlaq_s32(_sum10, _pA0, _pB1);\n                    _sum11 = vmmlaq_s32(_sum11, _pA1, _pB1);\n\n                    // a0 a1 b0 b1\n                    // c0 c1 d0 d1\n                    // a2 a3 b2 b3\n                    // c2 c3 d2 d3\n\n                    pA += 32;\n                    pB += 32;\n                }\n                int32x4x2_t _ss0 = vuzpq_s32(_sum00, _sum01);\n                int32x4x2_t _ss1 = vuzpq_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, _ss0.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss0.val[1]);\n                _sum2 = vaddq_s32(_sum2, _ss1.val[0]);\n                _sum3 = vaddq_s32(_sum3, _ss1.val[1]);\n            }\n#elif __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB0, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB0, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA0, _pB0, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA0, _pB0, 3);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB1, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB1, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA1, _pB1, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA1, _pB1, 3);\n\n                pA += 32;\n                pB += 32;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8 || __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                _sum0 = vdotq_laneq_s32(_sum0, _pA, _pB, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _pA, _pB, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _pA, _pB, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _pA, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA02 = vld1q_s8(pA);\n                int8x16_t _pB02 = vld1q_s8(pB);\n\n                // aabbccdd eeffgghh\n                // ccddaabb gghheeff\n\n                int8x16_t _pA13 = vreinterpretq_s8_s32(vrev64q_s32(vreinterpretq_s32_s8(_pA02)));\n\n                // 00112233 44556677\n                // 33221100 77665544\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB02));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_pA02), vget_low_s8(_pB13));\n                int16x8_t _s3 = vmull_s8(vget_low_s8(_pA13), vget_low_s8(_pB13));\n\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA02), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA13), vget_high_s8(_pB02));\n                _s2 = vmlal_s8(_s2, vget_high_s8(_pA02), vget_high_s8(_pB13));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_pA13), vget_high_s8(_pB13));\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                int16x8_t _s2 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 2)));\n                int16x8_t _s3 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 3)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vld1_s8(pB);\n\n                // aabbccdd\n                // ccddaabb\n\n                int8x8_t _pA1 = vext_s8(_pA0, _pA0, 4);\n\n                // 00112233\n                // 33221100\n\n                int8x8_t _pB1 = vreinterpret_s8_s16(vrev64_s16(vreinterpret_s16_s8(_pB0)));\n\n                int16x8_t _s0 = vmull_s8(_pA0, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA1, _pB0);\n                int16x8_t _s2 = vmull_s8(_pA0, _pB1);\n                int16x8_t _s3 = vmull_s8(_pA1, _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                _pB = vzip_s8(_pB, _pB).val[0];\n                int16x4x2_t _pB0123 = vzip_s16(vreinterpret_s16_s8(_pB), vreinterpret_s16_s8(_pB));\n\n                int16x8_t _s01 = vmull_s8(_pA, vreinterpret_s8_s16(_pB0123.val[0]));\n                int16x8_t _s23 = vmull_s8(_pA, vreinterpret_s8_s16(_pB0123.val[1]));\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n#else  // __ARM_FEATURE_DOTPROD\n\n                int8x8_t _pA0 = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // abcd.... -> cdab.... -> abcdcdab\n                int8x8_t _pA1 = vreinterpret_s8_s16(vrev32_s16(vreinterpret_s16_s8(_pA0)));\n                int8x8_t _pA01 = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pA0), vreinterpret_s32_s8(_pA1)).val[0]);\n\n                // 01230123 -> 32103210\n                int8x8_t _pB1 = vrev32_s8(_pB0);\n\n                int16x8_t _s01 = vmull_s8(_pA01, _pB0);\n                int16x8_t _s23 = vmull_s8(_pA01, _pB1);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s01));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s01));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s23));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s23));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 4;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n\n            outptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000 11111111\n\n                    _sum00 = vmmlaq_s32(_sum00, _pA0, _pB);\n                    _sum01 = vmmlaq_s32(_sum01, _pA1, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA0, _pB, 0);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA0, _pB, 1);\n                    _sum0 = vdotq_laneq_s32(_sum0, _pA1, _pB, 2);\n                    _sum1 = vdotq_laneq_s32(_sum1, _pA1, _pB, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 32;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _ss = vuzpq_s32(_sum00, _sum01);\n                _sum0 = vaddq_s32(_sum0, _ss.val[0]);\n                _sum1 = vaddq_s32(_sum1, _ss.val[1]);\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA, _pB, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pA, _pB, 1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                // aabbccdd eeffgghh\n\n                // 00112233 -> 00110011 22332233\n                // 11001100 33223322\n\n                int32x2x2_t _pBB = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n                int8x16_t _pB02 = vreinterpretq_s8_s32(vcombine_s32(_pBB.val[0], _pBB.val[1]));\n\n                int8x16_t _pB13 = vreinterpretq_s8_s16(vrev64q_s16(vreinterpretq_s16_s8(_pB02)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB02));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB13));\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), vget_high_s8(_pB02));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA), vget_high_s8(_pB13));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n                // aabbccdd\n                // 0011....\n                int16x8_t _s0 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 0)));\n                int16x8_t _s1 = vmull_s8(_pA, vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pB), 1)));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                // aabbccdd\n\n                // 00110011\n                // 11001100\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 2);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB0);\n                int16x8_t _s1 = vmull_s8(_pA, _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcdabcd\n\n                // 01010101 -> 00001111\n                _pB = vuzp_s8(_pB, vext_s8(_pB, _pB, 1)).val[0];\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                // abcd abcd\n\n                // 0101 0101 -> 0101 1010\n\n                int8x8_t _pB1 = vext_s8(_pB0, _pB0, 1);\n                int8x8_t _pB = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB0), vreinterpret_s32_s8(_pB1)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 2;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n\n            outptr += 8;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            const signed char* pA = pAT;\n\n            int32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n            }\n\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA0 = vld1q_s8(pA);\n                    int8x16_t _pA1 = vld1q_s8(pA + 16);\n                    int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    // aaaaaaaa bbbbbbbb cccccccc dddddddd\n\n                    // 00000000\n\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _sum01 = vdotq_s32(_sum01, _pA0, _pBB);\n                    _sum23 = vdotq_s32(_sum23, _pA1, _pBB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pA0, _pB, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pA1, _pB, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 32;\n                    pB += 8;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum01, _sum23));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum0 = vdotq_lane_s32(_sum0, _pA, _pB, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB0 = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n                int8x8_t _pB1 = vreinterpret_s8_s16(vld1_dup_s16((const short*)(pB + 2)));\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), _pB0);\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), _pB1);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s16(vld1_dup_s16((const short*)pB));\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n\n                pA += 8;\n                pB += 2;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_dup_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n\n                pA += 4;\n                pB += 1;\n            }\n\n            vst1q_s32(outptr, _sum0);\n\n            outptr += 4;\n        }\n\n        pAT += max_kk * 4;\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n            int32x4_t _sum2;\n            int32x4_t _sum3;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n                _sum2 = vdupq_n_s32(0);\n                _sum3 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n                _sum2 = vld1q_s32(outptr + 8);\n                _sum3 = vld1q_s32(outptr + 12);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n                int32x4_t _sum45 = vdupq_n_s32(0);\n                int32x4_t _sum67 = vdupq_n_s32(0);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n                int32x2_t _sum20 = vdup_n_s32(0);\n                int32x2_t _sum21 = vdup_n_s32(0);\n                int32x2_t _sum30 = vdup_n_s32(0);\n                int32x2_t _sum31 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    _sum01 = vmmlaq_s32(_sum01, _pA, _pB0);\n                    _sum23 = vmmlaq_s32(_sum23, _pA, _pB1);\n                    _sum45 = vmmlaq_s32(_sum45, _pA, _pB2);\n                    _sum67 = vmmlaq_s32(_sum67, _pA, _pB3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum00 = vdot_laneq_s32(_sum00, vget_low_s8(_pA), _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_low_s8(_pA), _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_low_s8(_pA), _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_low_s8(_pA), _pB0, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, vget_low_s8(_pA), _pB1, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, vget_low_s8(_pA), _pB1, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, vget_low_s8(_pA), _pB1, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, vget_low_s8(_pA), _pB1, 3);\n                    _sum00 = vdot_laneq_s32(_sum00, vget_high_s8(_pA), _pB2, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_high_s8(_pA), _pB2, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_high_s8(_pA), _pB2, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_high_s8(_pA), _pB2, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, vget_high_s8(_pA), _pB3, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, vget_high_s8(_pA), _pB3, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, vget_high_s8(_pA), _pB3, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, vget_high_s8(_pA), _pB3, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 16;\n                    pB += 64;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(vget_low_s32(_sum01), vget_low_s32(_sum23)));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(vget_low_s32(_sum45), vget_low_s32(_sum67)));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(vget_high_s32(_sum01), vget_high_s32(_sum23)));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(vget_high_s32(_sum45), vget_high_s32(_sum67)));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                int32x2x2_t _sum2x = vzip_s32(_sum20, _sum21);\n                int32x2x2_t _sum3x = vzip_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum2x.val[0], _sum3x.val[0]));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(_sum2x.val[1], _sum3x.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_DOTPROD\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n                int32x2_t _sum20 = vdup_n_s32(0);\n                int32x2_t _sum21 = vdup_n_s32(0);\n                int32x2_t _sum30 = vdup_n_s32(0);\n                int32x2_t _sum31 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_DOTPROD\n                    _sum00 = vdot_laneq_s32(_sum00, _pA, _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, _pA, _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, _pA, _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, _pA, _pB0, 3);\n                    _sum20 = vdot_laneq_s32(_sum20, _pA, _pB1, 0);\n                    _sum21 = vdot_laneq_s32(_sum21, _pA, _pB1, 1);\n                    _sum30 = vdot_laneq_s32(_sum30, _pA, _pB1, 2);\n                    _sum31 = vdot_laneq_s32(_sum31, _pA, _pB1, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                    int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB0));\n                    int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB0));\n                    _s0 = vmlal_s8(_s0, _pA2, vget_low_s8(_pB1));\n                    _s1 = vmlal_s8(_s1, _pA2, vget_high_s8(_pB1));\n                    _s2 = vmlal_s8(_s2, _pA3, vget_low_s8(_pB1));\n                    _s3 = vmlal_s8(_s3, _pA3, vget_high_s8(_pB1));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n                    _sum2 = vpadalq_s16(_sum2, _s2);\n                    _sum3 = vpadalq_s16(_sum3, _s3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                    pA += 8;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                int32x2x2_t _sum2x = vzip_s32(_sum20, _sum21);\n                int32x2x2_t _sum3x = vzip_s32(_sum30, _sum31);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum2x.val[0], _sum3x.val[0]));\n                _sum2 = vaddq_s32(_sum2, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n                _sum3 = vaddq_s32(_sum3, vcombine_s32(_sum2x.val[1], _sum3x.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x4_t _pA = vreinterpret_s16_s32(vld1_dup_s32((const int*)pA));\n                int8x16_t _pB = vld1q_s8(pB);\n\n                int16x4x2_t _pA01 = vuzp_s16(_pA, _pA);\n                int8x8_t _pA0 = vreinterpret_s8_s16(_pA01.val[0]);\n                int8x8_t _pA1 = vreinterpret_s8_s16(_pA01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB));\n                int16x8_t _s2 = vmull_s8(_pA1, vget_low_s8(_pB));\n                int16x8_t _s3 = vmull_s8(_pA1, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                pA += 4;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int8x8x2_t _pA01 = vuzp_s8(_pA, _pA);\n\n                int16x8_t _s0 = vmull_s8(_pA01.val[0], _pB);\n                int16x8_t _s1 = vmull_s8(_pA01.val[1], _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                _sum2 = vaddw_s16(_sum2, vget_low_s16(_s1));\n                _sum3 = vaddw_s16(_sum3, vget_high_s16(_s1));\n\n                pA += 2;\n                pB += 8;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n            vst1q_s32(outptr + 8, _sum2);\n            vst1q_s32(outptr + 12, _sum3);\n\n            outptr += 16;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum23 = vdupq_n_s32(0);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    _sum01 = vmmlaq_s32(_sum01, _pA, _pB0);\n                    _sum23 = vmmlaq_s32(_sum23, _pA, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum00 = vdot_laneq_s32(_sum00, vget_low_s8(_pA), _pB0, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_low_s8(_pA), _pB0, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_low_s8(_pA), _pB0, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_low_s8(_pA), _pB0, 3);\n                    _sum00 = vdot_laneq_s32(_sum00, vget_high_s8(_pA), _pB1, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, vget_high_s8(_pA), _pB1, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, vget_high_s8(_pA), _pB1, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, vget_high_s8(_pA), _pB1, 3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 16;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(vget_low_s32(_sum01), vget_low_s32(_sum23)));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(vget_high_s32(_sum01), vget_high_s32(_sum23)));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_DOTPROD\n                int32x2_t _sum00 = vdup_n_s32(0);\n                int32x2_t _sum01 = vdup_n_s32(0);\n                int32x2_t _sum10 = vdup_n_s32(0);\n                int32x2_t _sum11 = vdup_n_s32(0);\n#endif // __ARM_FEATURE_DOTPROD\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                    _sum00 = vdot_laneq_s32(_sum00, _pA, _pB, 0);\n                    _sum01 = vdot_laneq_s32(_sum01, _pA, _pB, 1);\n                    _sum10 = vdot_laneq_s32(_sum10, _pA, _pB, 2);\n                    _sum11 = vdot_laneq_s32(_sum11, _pA, _pB, 3);\n#else  // __ARM_FEATURE_DOTPROD\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                    int16x8_t _s1 = vmull_s8(_pA1, vget_low_s8(_pB));\n                    _s0 = vmlal_s8(_s0, _pA2, vget_high_s8(_pB));\n                    _s1 = vmlal_s8(_s1, _pA3, vget_high_s8(_pB));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                    pA += 8;\n                    pB += 16;\n                }\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _sum0x = vzip_s32(_sum00, _sum01);\n                int32x2x2_t _sum1x = vzip_s32(_sum10, _sum11);\n                _sum0 = vaddq_s32(_sum0, vcombine_s32(_sum0x.val[0], _sum1x.val[0]));\n                _sum1 = vaddq_s32(_sum1, vcombine_s32(_sum0x.val[1], _sum1x.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int16x4_t _pA = vreinterpret_s16_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x4x2_t _pA01 = vuzp_s16(_pA, _pA);\n                int8x8_t _pA0 = vreinterpret_s8_s16(_pA01.val[0]);\n                int8x8_t _pA1 = vreinterpret_s8_s16(_pA01.val[1]);\n\n                int16x8_t _s0 = vmull_s8(_pA0, _pB);\n                int16x8_t _s1 = vmull_s8(_pA1, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 4;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                _pA = vzip_s8(_pA, _pA).val[0];\n                _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 2;\n                pB += 4;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n\n            outptr += 8;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n#if __ARM_NEON\n            int32x4_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1q_s32(outptr);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n\n#if __ARM_FEATURE_DOTPROD\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum = vmmlaq_s32(_sum, _pA, _pB);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int32x4x2_t _pAA = vzipq_s32(vreinterpretq_s32_s8(_pA), vreinterpretq_s32_s8(_pA));\n                int8x16_t _pA01 = vreinterpretq_s8_s32(_pAA.val[0]);\n                int8x16_t _pA23 = vreinterpretq_s8_s32(_pAA.val[1]);\n                int8x16_t _pB01 = vcombine_s8(vget_low_s8(_pB), vget_low_s8(_pB));\n                int8x16_t _pB23 = vcombine_s8(vget_high_s8(_pB), vget_high_s8(_pB));\n\n                _sum = vdotq_s32(_sum, _pA01, _pB01);\n                _sum = vdotq_s32(_sum, _pA23, _pB23);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                pA += 16;\n                pB += 16;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                int32x2x2_t _pAA = vzip_s32(vreinterpret_s32_s8(_pA), vreinterpret_s32_s8(_pA));\n                int8x16_t _pA01 = vreinterpretq_s8_s32(vcombine_s32(_pAA.val[0], _pAA.val[1]));\n\n                int8x16_t _pB01 = vcombine_s8(_pB, _pB);\n\n                _sum = vdotq_s32(_sum, _pA01, _pB01);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4x2_t _pA01 = vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA));\n                int32x2x2_t _pB01 = vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB));\n\n                int16x8_t _s0 = vmull_s8(vreinterpret_s8_s16(_pA01.val[0]), vreinterpret_s8_s32(_pB01.val[0]));\n                _s0 = vmlal_s8(_s0, vreinterpret_s8_s16(_pA01.val[1]), vreinterpret_s8_s32(_pB01.val[1]));\n                _sum = vpadalq_s16(_sum, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 8;\n                pB += 8;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n                _pB = vreinterpret_s8_s32(vzip_s32(vreinterpret_s32_s8(_pB), vreinterpret_s32_s8(_pB)).val[0]);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vpadalq_s16(_sum, _s0);\n\n                // A0 A1 A2 A3\n                // B0 B1 B2 B3\n\n                // A0 A1 A0 A1 A2 A3 A2 A3\n                // B0 B1 B2 B3 B0 B1 B2 B3\n\n                pA += 4;\n                pB += 4;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x8_t _pB = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(pB)), 0));\n\n                _pA = vzip_s8(_pA, _pA).val[0];\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vaddw_s16(_sum, vget_low_s16(_s0));\n\n                // A0 A1 A0 A1\n                // B0 B1 B0 B1\n\n                // A0 A0 A1 A1\n\n                pA += 2;\n                pB += 2;\n            }\n\n            vst1q_s32(outptr, _sum);\n\n            outptr += 4;\n#else // __ARM_NEON\n            int sum00;\n            int sum10;\n            int sum01;\n            int sum11;\n\n            if (k == 0)\n            {\n                sum00 = 0;\n                sum10 = 0;\n                sum01 = 0;\n                sum11 = 0;\n            }\n            else\n            {\n                sum00 = outptr[0];\n                sum10 = outptr[1];\n                sum01 = outptr[2];\n                sum11 = outptr[3];\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                // fomit-frame-pointer implied in optimized flag spare one register\n                // let us stay away from error: ‘asm’ operand has impossible constraints   --- nihui\n#if __OPTIMIZE__\n                asm volatile(\n                    \"ldr    r2, [%0], #4    \\n\" // int8x4_t _pA = *((int8x4_t*)pA); pA += 4;\n                    \"ldr    r4, [%1], #4    \\n\" // int8x4_t _pB = *((int8x4_t*)pB); pB += 4;\n                    \"ror    r3, r2, #8      \\n\" // int8x4_t _pA_r8 = __ror(_pA, 8);\n                    \"ror    r5, r4, #8      \\n\" // int8x4_t _pB_r8 = __ror(_pB, 8);\n                    \"sxtb16 r2, r2          \\n\" // int16x2_t _pA0 = __sxtb16(_pA);\n                    \"sxtb16 r4, r4          \\n\" // int16x2_t _pA1 = __sxtb16(_pA_r8);\n                    \"sxtb16 r3, r3          \\n\" // int16x2_t _pB0 = __sxtb16(_pB);\n                    \"sxtb16 r5, r5          \\n\" // int16x2_t _pB1 = __sxtb16(_pB_r8);\n                    \"smlad  %2, r2, r4, %2  \\n\" // sum00 = __smlad(_pA0, _pB0, sum00);\n                    \"smlad  %3, r3, r4, %3  \\n\" // sum10 = __smlad(_pA1, _pB0, sum10);\n                    \"smlad  %4, r2, r5, %4  \\n\" // sum01 = __smlad(_pA0, _pB1, sum01);\n                    \"smlad  %5, r3, r5, %5  \\n\" // sum11 = __smlad(_pA1, _pB1, sum11);\n                    : \"=r\"(pA),\n                    \"=r\"(pB),\n                    \"=r\"(sum00),\n                    \"=r\"(sum10),\n                    \"=r\"(sum01),\n                    \"=r\"(sum11)\n                    : \"0\"(pA),\n                    \"1\"(pB),\n                    \"2\"(sum00),\n                    \"3\"(sum10),\n                    \"4\"(sum01),\n                    \"5\"(sum11)\n                    : \"memory\", \"r2\", \"r3\", \"r4\", \"r5\");\n#else\n                int _pA0 = *((int*)pA);\n                int _pB0 = *((int*)pB);\n                int _pA1;\n                int _pB1;\n                asm volatile(\"ror %0, %1, #8\"\n                             : \"=r\"(_pA1)\n                             : \"r\"(_pA0)\n                             :);\n                asm volatile(\"ror %0, %1, #8\"\n                             : \"=r\"(_pB1)\n                             : \"r\"(_pB0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pA0)\n                             : \"0\"(_pA0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pA1)\n                             : \"0\"(_pA1)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pB0)\n                             : \"0\"(_pB0)\n                             :);\n                asm volatile(\"sxtb16 %0, %0\"\n                             : \"=r\"(_pB1)\n                             : \"0\"(_pB1)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum00)\n                             : \"0\"(sum00), \"r\"(_pA0), \"r\"(_pB0)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum10)\n                             : \"0\"(sum10), \"r\"(_pA1), \"r\"(_pB0)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum01)\n                             : \"0\"(sum01), \"r\"(_pA0), \"r\"(_pB1)\n                             :);\n                asm volatile(\"smlad %0, %2, %3, %0\"\n                             : \"=r\"(sum11)\n                             : \"0\"(sum11), \"r\"(_pA1), \"r\"(_pB1)\n                             :);\n                pA += 4;\n                pB += 4;\n#endif\n            }\n#endif // __ARM_FEATURE_SIMD32 && NCNN_GNU_INLINE_ASM\n            for (; kk < max_kk; kk += 1)\n            {\n                sum00 += pA[0] * pB[0];\n                sum10 += pA[1] * pB[0];\n                sum01 += pA[0] * pB[1];\n                sum11 += pA[1] * pB[1];\n\n                pA += 2;\n                pB += 2;\n            }\n\n            outptr[0] = sum00;\n            outptr[1] = sum10;\n            outptr[2] = sum01;\n            outptr[3] = sum11;\n\n            outptr += 4;\n#endif // __ARM_NEON\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n#if __ARM_NEON\n            int32x2_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdup_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1_s32(outptr);\n            }\n#else  // __ARM_NEON\n            int sum0;\n            int sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0;\n                sum1 = 0;\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n#endif // __ARM_NEON\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x8_t _pB = vld1_s8(pB);\n\n                    int8x16_t _pBB = vcombine_s8(_pB, _pB);\n\n                    _sum0 = vdotq_s32(_sum0, _pA, _pBB);\n\n                    pA += 16;\n                    pB += 8;\n                }\n                int32x2_t _ss = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum = vdot_lane_s32(_sum, vget_low_s8(_pA), _pB, 0);\n                _sum = vdot_lane_s32(_sum, vget_high_s8(_pA), _pB, 1);\n\n                pA += 16;\n                pB += 8;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s32(vld1_dup_s32((const int*)pB));\n\n                _sum = vdot_s32(_sum, _pA, _pB);\n\n                pA += 8;\n                pB += 4;\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                    _pB = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pB), vreinterpret_s16_s8(_pB)).val[0]);\n\n                    int16x8_t _s0 = vmull_s8(_pA, _pB);\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n\n                    pA += 8;\n                    pB += 4;\n                }\n                int32x2_t _ss = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            int sum0 = vget_lane_s32(_sum, 0);\n            int sum1 = vget_lane_s32(_sum, 1);\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                sum0 += pA[0] * pB[0];\n                sum0 += pA[1] * pB[1];\n                sum1 += pA[2] * pB[0];\n                sum1 += pA[3] * pB[1];\n                pA += 4;\n                pB += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[1] * pB[0];\n                pA += 2;\n                pB += 1;\n            }\n\n            outptr[0] = sum0;\n            outptr[1] = sum1;\n\n            outptr += 2;\n        }\n\n        pAT += max_kk * 2;\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const signed char* pB = pBT;\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0;\n            int32x4_t _sum1;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n                _sum1 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n                _sum1 = vld1q_s32(outptr + 4);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n                int32x4_t _sum10 = vdupq_n_s32(0);\n                int32x4_t _sum11 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n                    _sum00 = vdotq_s32(_sum00, _pAA, _pB0);\n                    _sum01 = vdotq_s32(_sum01, _pAA, _pB1);\n                    _sum10 = vdotq_s32(_sum10, _pAA, _pB2);\n                    _sum11 = vdotq_s32(_sum11, _pAA, _pB3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                    _sum1 = vdotq_lane_s32(_sum1, _pB1, _pA, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pB2, _pA, 1);\n                    _sum1 = vdotq_lane_s32(_sum1, _pB3, _pA, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 8;\n                    pB += 64;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum00, _sum01));\n                _sum1 = vaddq_s32(_sum1, vpaddq_s32(_sum10, _sum11));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum2 = vdupq_n_s32(0);\n                int32x4_t _sum3 = vdupq_n_s32(0);\n                int32x4_t _sum4 = vdupq_n_s32(0);\n                int32x4_t _sum5 = vdupq_n_s32(0);\n                int32x4_t _sum6 = vdupq_n_s32(0);\n                int32x4_t _sum7 = vdupq_n_s32(0);\n                for (; kk + 15 < max_kk; kk += 16)\n                {\n                    // TODO\n                    // __builtin_prefetch(pA + 16);\n                    // __builtin_prefetch(pB + 128);\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n                    int8x16_t _pB4 = vld1q_s8(pB + 64);\n                    int8x16_t _pB5 = vld1q_s8(pB + 80);\n                    int8x16_t _pB6 = vld1q_s8(pB + 96);\n                    int8x16_t _pB7 = vld1q_s8(pB + 112);\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 3));\n                    int8x8_t _pA4 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 0));\n                    int8x8_t _pA5 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 1));\n                    int8x8_t _pA6 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 2));\n                    int8x8_t _pA7 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 3));\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                    int16x8_t _s2 = vmull_s8(_pA2, vget_low_s8(_pB2));\n                    int16x8_t _s3 = vmull_s8(_pA2, vget_high_s8(_pB2));\n                    int16x8_t _s4 = vmull_s8(_pA4, vget_low_s8(_pB4));\n                    int16x8_t _s5 = vmull_s8(_pA4, vget_high_s8(_pB4));\n                    int16x8_t _s6 = vmull_s8(_pA6, vget_low_s8(_pB6));\n                    int16x8_t _s7 = vmull_s8(_pA6, vget_high_s8(_pB6));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_low_s8(_pB1));\n                    _s1 = vmlal_s8(_s1, _pA1, vget_high_s8(_pB1));\n                    _s2 = vmlal_s8(_s2, _pA3, vget_low_s8(_pB3));\n                    _s3 = vmlal_s8(_s3, _pA3, vget_high_s8(_pB3));\n                    _s4 = vmlal_s8(_s4, _pA5, vget_low_s8(_pB5));\n                    _s5 = vmlal_s8(_s5, _pA5, vget_high_s8(_pB5));\n                    _s6 = vmlal_s8(_s6, _pA7, vget_low_s8(_pB7));\n                    _s7 = vmlal_s8(_s7, _pA7, vget_high_s8(_pB7));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n                    _sum2 = vpadalq_s16(_sum2, _s2);\n                    _sum3 = vpadalq_s16(_sum3, _s3);\n                    _sum4 = vpadalq_s16(_sum4, _s4);\n                    _sum5 = vpadalq_s16(_sum5, _s5);\n                    _sum6 = vpadalq_s16(_sum6, _s6);\n                    _sum7 = vpadalq_s16(_sum7, _s7);\n\n                    pA += 16;\n                    pB += 128;\n                }\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                    int16x8_t _s2 = vmull_s8(_pA2, vget_low_s8(_pB2));\n                    int16x8_t _s3 = vmull_s8(_pA2, vget_high_s8(_pB2));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_low_s8(_pB1));\n                    _s1 = vmlal_s8(_s1, _pA1, vget_high_s8(_pB1));\n                    _s2 = vmlal_s8(_s2, _pA3, vget_low_s8(_pB3));\n                    _s3 = vmlal_s8(_s3, _pA3, vget_high_s8(_pB3));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n                    _sum2 = vpadalq_s16(_sum2, _s2);\n                    _sum3 = vpadalq_s16(_sum3, _s3);\n\n                    pA += 8;\n                    pB += 64;\n                }\n                _sum0 = vaddq_s32(_sum0, _sum2);\n                _sum1 = vaddq_s32(_sum1, _sum3);\n                _sum0 = vaddq_s32(_sum0, _sum4);\n                _sum1 = vaddq_s32(_sum1, _sum5);\n                _sum0 = vaddq_s32(_sum0, _sum6);\n                _sum1 = vaddq_s32(_sum1, _sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _pB1, _pA, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(_pA0, vget_high_s8(_pB0));\n                _s0 = vmlal_s8(_s0, _pA1, vget_low_s8(_pB1));\n                _s1 = vmlal_s8(_s1, _pA1, vget_high_s8(_pB1));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vld1_dup_s16((const short*)pA));\n                int8x16_t _pB = vld1q_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, vget_low_s8(_pB));\n                int16x8_t _s1 = vmull_s8(_pA, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n\n                pA += 2;\n                pB += 16;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_dup_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                pA += 1;\n                pB += 8;\n            }\n\n            vst1q_s32(outptr, _sum0);\n            vst1q_s32(outptr + 4, _sum1);\n\n            outptr += 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0;\n\n            if (k == 0)\n            {\n                _sum0 = vdupq_n_s32(0);\n            }\n            else\n            {\n                _sum0 = vld1q_s32(outptr);\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_FEATURE_DOTPROD\n            {\n#if __ARM_FEATURE_MATMUL_INT8\n                int32x4_t _sum00 = vdupq_n_s32(0);\n                int32x4_t _sum01 = vdupq_n_s32(0);\n#endif // __ARM_FEATURE_MATMUL_INT8\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n                    _sum00 = vdotq_s32(_sum00, _pAA, _pB0);\n                    _sum01 = vdotq_s32(_sum01, _pAA, _pB1);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                    _sum0 = vdotq_lane_s32(_sum0, _pB0, _pA, 0);\n                    _sum0 = vdotq_lane_s32(_sum0, _pB1, _pA, 1);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                    pA += 8;\n                    pB += 32;\n                }\n#if __ARM_FEATURE_MATMUL_INT8\n                _sum0 = vaddq_s32(_sum0, vpaddq_s32(_sum00, _sum01));\n#endif // __ARM_FEATURE_MATMUL_INT8\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum1 = vdupq_n_s32(0);\n                int32x4_t _sum2 = vdupq_n_s32(0);\n                int32x4_t _sum3 = vdupq_n_s32(0);\n                for (; kk + 15 < max_kk; kk += 16)\n                {\n                    // TODO\n                    // __builtin_prefetch(pA + 16);\n                    // __builtin_prefetch(pB + 64);\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n                    int8x16_t _pB2 = vld1q_s8(pB + 32);\n                    int8x16_t _pB3 = vld1q_s8(pB + 48);\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_low_s8(_pA)), 3));\n                    int8x8_t _pA4 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 0));\n                    int8x8_t _pA5 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 1));\n                    int8x8_t _pA6 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 2));\n                    int8x8_t _pA7 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vget_high_s8(_pA)), 3));\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA2, vget_low_s8(_pB1));\n                    int16x8_t _s2 = vmull_s8(_pA4, vget_low_s8(_pB2));\n                    int16x8_t _s3 = vmull_s8(_pA6, vget_low_s8(_pB3));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB0));\n                    _s1 = vmlal_s8(_s1, _pA3, vget_high_s8(_pB1));\n                    _s2 = vmlal_s8(_s2, _pA5, vget_high_s8(_pB2));\n                    _s3 = vmlal_s8(_s3, _pA7, vget_high_s8(_pB3));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n                    _sum2 = vpadalq_s16(_sum2, _s2);\n                    _sum3 = vpadalq_s16(_sum3, _s3);\n\n                    pA += 16;\n                    pB += 64;\n                }\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 2));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 3));\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA2, vget_low_s8(_pB1));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB0));\n                    _s1 = vmlal_s8(_s1, _pA3, vget_high_s8(_pB1));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n\n                    pA += 8;\n                    pB += 32;\n                }\n                _sum0 = vaddq_s32(_sum0, _sum1);\n                _sum0 = vaddq_s32(_sum0, _sum2);\n                _sum0 = vaddq_s32(_sum0, _sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum0 = vdotq_lane_s32(_sum0, _pB, _pA, 0);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _pA0 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 0));\n                int8x8_t _pA1 = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(_pA), 1));\n                int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 4;\n                pB += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                int8x8_t _pA = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(pA)), 0));\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n\n                pA += 2;\n                pB += 8;\n            }\n            for (; kk < max_kk; kk += 1)\n            {\n                int8x8_t _pA = vld1_dup_s8(pA);\n                int8x8_t _pB = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pB)), 0));\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n\n                pA += 1;\n                pB += 4;\n            }\n\n            vst1q_s32(outptr, _sum0);\n\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n#if __ARM_NEON\n            int32x2_t _sum;\n\n            if (k == 0)\n            {\n                _sum = vdup_n_s32(0);\n            }\n            else\n            {\n                _sum = vld1_s32(outptr);\n            }\n#else  // __ARM_NEON\n            int sum0;\n            int sum1;\n\n            if (k == 0)\n            {\n                sum0 = 0;\n                sum1 = 0;\n            }\n            else\n            {\n                sum0 = outptr[0];\n                sum1 = outptr[1];\n            }\n#endif // __ARM_NEON\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n                    int8x16_t _pAA = vcombine_s8(_pA, _pA);\n\n                    _sum0 = vdotq_s32(_sum0, _pAA, _pB);\n\n                    pA += 8;\n                    pB += 16;\n                }\n                int32x2_t _ss = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#else  // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n                _sum = vdot_lane_s32(_sum, vget_low_s8(_pB), _pA, 0);\n                _sum = vdot_lane_s32(_sum, vget_high_s8(_pB), _pA, 1);\n\n                pA += 8;\n                pB += 16;\n            }\n#endif // __ARM_FEATURE_MATMUL_INT8\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                int8x8_t _pA = vreinterpret_s8_s32(vld1_dup_s32((const int*)pA));\n                int8x8_t _pB = vld1_s8(pB);\n\n                _sum = vdot_s32(_sum, _pA, _pB);\n\n                pA += 4;\n                pB += 8;\n            }\n#else  // __ARM_FEATURE_DOTPROD\n            {\n                int32x4_t _sum0 = vdupq_n_s32(0);\n                int32x4_t _sum1 = vdupq_n_s32(0);\n                for (; kk + 15 < max_kk; kk += 16)\n                {\n                    int8x16_t _pA = vld1q_s8(pA);\n                    int8x16_t _pB0 = vld1q_s8(pB);\n                    int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n                    int16x8x2_t _pAA = vzipq_s16(vreinterpretq_s16_s8(_pA), vreinterpretq_s16_s8(_pA));\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(vget_low_s16(_pAA.val[0]));\n                    int8x8_t _pA1 = vreinterpret_s8_s16(vget_high_s16(_pAA.val[0]));\n                    int8x8_t _pA2 = vreinterpret_s8_s16(vget_low_s16(_pAA.val[1]));\n                    int8x8_t _pA3 = vreinterpret_s8_s16(vget_high_s16(_pAA.val[1]));\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB0));\n                    int16x8_t _s1 = vmull_s8(_pA2, vget_low_s8(_pB1));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB0));\n                    _s1 = vmlal_s8(_s1, _pA3, vget_high_s8(_pB1));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n                    _sum1 = vpadalq_s16(_sum1, _s1);\n\n                    pA += 16;\n                    pB += 32;\n                }\n                _sum0 = vaddq_s32(_sum0, _sum1);\n                for (; kk + 7 < max_kk; kk += 8)\n                {\n                    int8x8_t _pA = vld1_s8(pA);\n                    int8x16_t _pB = vld1q_s8(pB);\n\n                    int16x4x2_t _pAA = vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA));\n\n                    int8x8_t _pA0 = vreinterpret_s8_s16(_pAA.val[0]);\n                    int8x8_t _pA1 = vreinterpret_s8_s16(_pAA.val[1]);\n\n                    int16x8_t _s0 = vmull_s8(_pA0, vget_low_s8(_pB));\n                    _s0 = vmlal_s8(_s0, _pA1, vget_high_s8(_pB));\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n\n                    pA += 8;\n                    pB += 16;\n                }\n                for (; kk + 3 < max_kk; kk += 4)\n                {\n                    int8x8_t _pA = vreinterpret_s8_s32(vdup_lane_s32(vreinterpret_s32_s8(vld1_s8(pA)), 0));\n                    int8x8_t _pB = vld1_s8(pB);\n\n                    _pA = vreinterpret_s8_s16(vzip_s16(vreinterpret_s16_s8(_pA), vreinterpret_s16_s8(_pA)).val[0]);\n\n                    int16x8_t _s0 = vmull_s8(_pA, _pB);\n                    _sum0 = vpadalq_s16(_sum0, _s0);\n\n                    pA += 4;\n                    pB += 8;\n                }\n                int32x2_t _ss = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                _sum = vadd_s32(_sum, _ss);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            int sum0 = vget_lane_s32(_sum, 0);\n            int sum1 = vget_lane_s32(_sum, 1);\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                sum0 += pA[0] * pB[0];\n                sum0 += pA[1] * pB[1];\n                sum1 += pA[0] * pB[2];\n                sum1 += pA[1] * pB[3];\n                pA += 2;\n                pB += 4;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum0 += pA[0] * pB[0];\n                sum1 += pA[0] * pB[1];\n                pA += 1;\n                pB += 2;\n            }\n\n            outptr[0] = sum0;\n            outptr[1] = sum1;\n\n            outptr += 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            int sum;\n\n            if (k == 0)\n            {\n                sum = 0;\n            }\n            else\n            {\n                sum = outptr[0];\n            }\n\n            const signed char* pA = pAT;\n            int kk = 0;\n#if __ARM_NEON\n            int32x4_t _sum = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            for (; kk + 31 < max_kk; kk += 32)\n            {\n                int8x16_t _pA0 = vld1q_s8(pA);\n                int8x16_t _pA1 = vld1q_s8(pA + 16);\n                int8x16_t _pB0 = vld1q_s8(pB);\n                int8x16_t _pB1 = vld1q_s8(pB + 16);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum = vdotq_s32(_sum, _pA0, _pB0);\n                _sum1 = vdotq_s32(_sum1, _pA1, _pB1);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA0), vget_low_s8(_pB0));\n                int16x8_t _s1 = vmull_s8(vget_low_s8(_pA1), vget_low_s8(_pB1));\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA0), vget_high_s8(_pB0));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_pA1), vget_high_s8(_pB1));\n                _sum = vpadalq_s16(_sum, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 32;\n                pB += 32;\n            }\n            _sum = vaddq_s32(_sum, _sum1);\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                int8x16_t _pA = vld1q_s8(pA);\n                int8x16_t _pB = vld1q_s8(pB);\n\n#if __ARM_FEATURE_DOTPROD\n                _sum = vdotq_s32(_sum, _pA, _pB);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_pA), vget_low_s8(_pB));\n                _s0 = vmlal_s8(_s0, vget_high_s8(_pA), vget_high_s8(_pB));\n                _sum = vpadalq_s16(_sum, _s0);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pA += 16;\n                pB += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                int8x8_t _pA = vld1_s8(pA);\n                int8x8_t _pB = vld1_s8(pB);\n\n                int16x8_t _s0 = vmull_s8(_pA, _pB);\n                _sum = vpadalq_s16(_sum, _s0);\n\n                pA += 8;\n                pB += 8;\n            }\n#if __aarch64__\n            sum += vaddvq_s32(_sum);\n#else\n            int32x2_t _ss = vadd_s32(vget_low_s32(_sum), vget_high_s32(_sum));\n            _ss = vpadd_s32(_ss, _ss);\n            sum += vget_lane_s32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk += 1)\n            {\n                sum += pA[0] * pB[0];\n                pA += 1;\n                pB += 1;\n            }\n\n            outptr[0] = sum;\n\n            outptr += 1;\n        }\n\n        pAT += max_kk;\n    }\n}\n\nstatic void get_optimal_tile_mnk_int8(int M, int N, int K, int constant_TILE_M, int constant_TILE_N, int constant_TILE_K, int& TILE_M, int& TILE_N, int& TILE_K, int nT)\n{\n    // resolve optimal tile size from cache size\n    const size_t l2_cache_size = get_cpu_level2_cache_size();\n\n    if (nT == 0)\n        nT = get_physical_big_cpu_count();\n\n    int tile_size = (int)sqrtf((float)l2_cache_size / (2 * sizeof(signed char) + sizeof(int)));\n\n    TILE_M = std::max(8, tile_size / 8 * 8);\n#if __aarch64__\n    TILE_N = std::max(8, tile_size / 8 * 8);\n#else\n    TILE_N = std::max(4, tile_size / 4 * 4);\n#endif\n    TILE_K = std::max(8, tile_size / 8 * 8);\n\n    if (K > 0)\n    {\n        int nn_K = (K + TILE_K - 1) / TILE_K;\n        TILE_K = std::min(TILE_K, ((K + nn_K - 1) / nn_K + 7) / 8 * 8);\n\n        if (nn_K == 1)\n        {\n            tile_size = (int)((float)l2_cache_size / 2 / sizeof(signed char) / TILE_K);\n\n            TILE_M = std::max(8, tile_size / 8 * 8);\n#if __aarch64__\n            TILE_N = std::max(8, tile_size / 8 * 8);\n#else\n            TILE_N = std::max(4, tile_size / 4 * 4);\n#endif\n        }\n    }\n\n    TILE_M *= std::min(nT, get_physical_cpu_count());\n\n    if (M > 0)\n    {\n        int nn_M = (M + TILE_M - 1) / TILE_M;\n        TILE_M = std::min(TILE_M, ((M + nn_M - 1) / nn_M + 7) / 8 * 8);\n    }\n\n    if (N > 0)\n    {\n        int nn_N = (N + TILE_N - 1) / TILE_N;\n#if __aarch64__\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 7) / 8 * 8);\n#else\n        TILE_N = std::min(TILE_N, ((N + nn_N - 1) / nn_N + 3) / 4 * 4);\n#endif\n    }\n\n    if (nT > 1)\n    {\n        TILE_M = std::min(TILE_M, (std::max(1, TILE_M / nT) + 7) / 8 * 8);\n    }\n\n    // always take constant TILE_M/N/K value when provided\n    if (constant_TILE_M > 0)\n    {\n        TILE_M = (constant_TILE_M + 7) / 8 * 8;\n    }\n\n    if (constant_TILE_N > 0)\n    {\n#if __aarch64__\n        TILE_N = (constant_TILE_N + 7) / 8 * 8;\n#else\n        TILE_N = (constant_TILE_N + 3) / 4 * 4;\n#endif\n    }\n\n    if (constant_TILE_K > 0)\n    {\n        TILE_K = (constant_TILE_K + 7) / 8 * 8;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/gemm_int8_bf16s.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_bf16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_bf16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_bf16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_bf16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_bf16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_bf16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_bf16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_bf16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid unpack_output_tile_int32_to_bf16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\nvoid transpose_unpack_output_tile_int32_to_bf16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\n#endif\n\nstatic void compute_A_tile_bf16_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.w;\n\n    // NCNN_LOGE(\"compute_A_tile_bf16_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n#if __aarch64__\n        float32x4_t _v127 = vdupq_n_f32(127.f);\n        float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n\n        for (int ii = 0; ii + 3 < max_ii; ii += 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n#endif\n            ps += 4;\n            pods += 4;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        for (int ii = 0; ii < max_ii; ii++)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep;\n\n            float absmax = 0.f;\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (; kk + 15 < K; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 7 < K; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif // __ARM_NEON\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(bfloat16_to_float32(p0[0])));\n                p0++;\n            }\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void pack_A_tile_bf16_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_A_tile_bf16_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_A_tile_bf16_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"pack_A_tile_bf16_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n                uint16x8x4_t _q = vld4q_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[0])), _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[1])), _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[2])), _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[3])), _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[0])), _scale0, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[1])), _scale0, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[2])), _scale0, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[3])), _scale0, 3);\n                float32x4_t _p8 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_q.val[0])), _scale1, 0);\n                float32x4_t _p9 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_q.val[1])), _scale1, 1);\n                float32x4_t _pa = vmulq_laneq_f32(bfloat2float(vget_low_u16(_q.val[2])), _scale1, 2);\n                float32x4_t _pb = vmulq_laneq_f32(bfloat2float(vget_low_u16(_q.val[3])), _scale1, 3);\n                float32x4_t _pc = vmulq_laneq_f32(bfloat2float(vget_high_u16(_q.val[0])), _scale1, 0);\n                float32x4_t _pd = vmulq_laneq_f32(bfloat2float(vget_high_u16(_q.val[1])), _scale1, 1);\n                float32x4_t _pe = vmulq_laneq_f32(bfloat2float(vget_high_u16(_q.val[2])), _scale1, 2);\n                float32x4_t _pf = vmulq_laneq_f32(bfloat2float(vget_high_u16(_q.val[3])), _scale1, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 4 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale0);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale0);\n                _p8 = vmulq_f32(_p8, _scale1);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale1);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale1);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale1);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n                uint16x4x4_t _q = vld4_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vmulq_laneq_f32(bfloat2float(_p.val[0]), _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(bfloat2float(_p.val[1]), _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(bfloat2float(_p.val[2]), _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(bfloat2float(_p.val[3]), _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(bfloat2float(_q.val[0]), _scale1, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(bfloat2float(_q.val[1]), _scale1, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(bfloat2float(_q.val[2]), _scale1, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(bfloat2float(_q.val[3]), _scale1, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 4 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale1);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale1);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p0n = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p1n = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p0n = vmulq_f32(_p0n, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p1n = vmulq_f32(_p1n, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p0n, _p1n);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale0, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale0, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale0, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale0, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale1, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale1, 0);\n                _pa = vmulq_laneq_f32(_pa, _scale1, 1);\n                _pb = vmulq_laneq_f32(_pb, _scale1, 1);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 2);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 2);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 3);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale0), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale0), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale0), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale0), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale0), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale1), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale1), 0);\n                _pa = vmulq_lane_f32(_pa, vget_low_f32(_scale1), 1);\n                _pb = vmulq_lane_f32(_pb, vget_low_f32(_scale1), 1);\n                _pc = vmulq_lane_f32(_pc, vget_high_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_high_f32(_scale1), 0);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 1);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 3));\n                float32x4_t _p4 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p5 = bfloat2float(vld1_u16(p0 + A_hstep * 5));\n                float32x4_t _p6 = bfloat2float(vld1_u16(p0 + A_hstep * 6));\n                float32x4_t _p7 = bfloat2float(vld1_u16(p0 + A_hstep * 7));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p45 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p67 = bfloat2float(vget_high_u16(_q));\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale0, _scale0);\n                float32x4x2_t _scale23 = vzipq_f32(_scale1, _scale1);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n                _p45 = vmulq_f32(_p45, _scale23.val[0]);\n                _p67 = vmulq_f32(_p67, _scale23.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n\n                float32x4_t _p0 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[0])), _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[1])), _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[2])), _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(bfloat2float(vget_low_u16(_p.val[3])), _scale, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[0])), _scale, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[1])), _scale, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[2])), _scale, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(bfloat2float(vget_high_u16(_p.val[3])), _scale, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n\n                float32x4_t _p0 = vmulq_laneq_f32(bfloat2float(_p.val[0]), _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(bfloat2float(_p.val[1]), _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(bfloat2float(_p.val[2]), _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(bfloat2float(_p.val[3]), _scale, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 3));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale, _scale);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _p = uint16x4_t();\n                _p = vset_lane_u16(p0[0], _p, 0);\n                _p = vset_lane_u16(p0[A_hstep], _p, 1);\n                _p = vset_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vset_lane_u16(p0[A_hstep * 3], _p, 3);\n                float32x4_t _p0 = bfloat2float(_p);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale0 = vdupq_n_f32(scale0);\n            float32x4_t _scale1 = vdupq_n_f32(scale1);\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale1);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale0);\n                pp[2] = float2int8(bfloat16_to_float32(p0[A_hstep]) * scale1);\n                pp[3] = float2int8(bfloat16_to_float32(p0[A_hstep + 1]) * scale1);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(bfloat16_to_float32(p0[A_hstep]) * scale1);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n        const float scale = scales[i + ii];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vdupq_n_f32(scale);\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_compute_A_tile_bf16_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.dims == 3 ? A.c : A.h;\n\n    // NCNN_LOGE(\"transpose_compute_A_tile_bf16_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n#if __ARM_NEON\n#if __aarch64__\n    float32x4_t _v127 = vdupq_n_f32(127.f);\n    float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n#endif\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        int ii = 0;\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (int kk = 0; kk < K; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            float32x2_t _aa0 = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float32x2_t _aa1 = vmax_f32(vget_low_f32(_absmax1), vget_high_f32(_absmax1));\n            float32x2_t _aa2 = vmax_f32(vget_low_f32(_absmax2), vget_high_f32(_absmax2));\n            float32x2_t _aa3 = vmax_f32(vget_low_f32(_absmax3), vget_high_f32(_absmax3));\n            float32x2_t _aa01 = vpmax_f32(_aa0, _aa1);\n            float32x2_t _aa23 = vpmax_f32(_aa2, _aa3);\n            float32x4_t _absmax = vcombine_f32(_aa01, _aa23);\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax);\n            float32x4_t _out_descale = vdivq_f32(_absmax, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 8));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 12));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep * 4;\n            }\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float absmax = std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1));\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        int ii = 0;\n#if __ARM_NEON\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii);\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 3));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 2;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep;\n            }\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n#endif // __ARM_NEON\n        for (; ii < max_ii; ii++)\n        {\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii);\n\n            float absmax = 0.f;\n            for (int kk = 0; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(bfloat16_to_float32(p0[0])));\n                p0 += A_hstep;\n            }\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_bf16_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_A_tile_bf16_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_A_tile_bf16_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"transpose_pack_A_tile_bf16_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 4 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale0, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale0, 1);\n                _pa = vmulq_laneq_f32(_pa, _scale0, 2);\n                _pb = vmulq_laneq_f32(_pb, _scale0, 3);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 0);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 1);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 2);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale0), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale0), 1);\n                _pa = vmulq_lane_f32(_pa, vget_high_f32(_scale0), 0);\n                _pb = vmulq_lane_f32(_pb, vget_high_f32(_scale0), 1);\n                _pc = vmulq_lane_f32(_pc, vget_low_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_low_f32(_scale1), 1);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 0);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n                _p8 = vmulq_f32(_p8, _scale0);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale0);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale0);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale0);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 4 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 3));\n                float32x4_t _p4 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p5 = bfloat2float(vld1_u16(p0 + A_hstep * 5));\n                float32x4_t _p6 = bfloat2float(vld1_u16(p0 + A_hstep * 6));\n                float32x4_t _p7 = bfloat2float(vld1_u16(p0 + A_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n                pp += 4;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n#if __ARM_NEON\n        float32x4_t _scale0 = vdupq_n_f32(scale0);\n        float32x4_t _scale1 = vdupq_n_f32(scale1);\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep * 4);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vzipq_f32(_scale0, _scale1).val[0];\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p45 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p67 = bfloat2float(vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 4 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 6 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 3], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 3 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p02 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p46 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p13 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p57 = bfloat2float(vget_high_u16(_q));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p02 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p13 = bfloat2float(vget_high_u16(_p));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(bfloat16_to_float32(p0[A_hstep + 0]) * scale0);\n                pp[2] = float2int8(bfloat16_to_float32(p0[1]) * scale1);\n                pp[3] = float2int8(bfloat16_to_float32(p0[A_hstep + 1]) * scale1);\n                pp += 4;\n                p0 += A_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale1);\n                pp += 2;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale = scales[i + ii];\n\n#if __ARM_NEON\n        float32x4_t _scale = vdupq_n_f32(scale);\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + A_hstep * 8));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + A_hstep * 12));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + A_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(bfloat16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(bfloat16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 8], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 9], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 10], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 11], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 12], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 13], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 14], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 15], _q, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0 += A_hstep;\n            }\n        }\n    }\n}\n\nstatic void compute_B_bf16_int8_scale(const Mat& B, float& scale)\n{\n    float absmax = 0.f;\n#if __ARM_NEON\n    float32x4_t _absmax = vdupq_n_f32(0.f);\n#endif\n    for (int i = 0; i < (B.dims == 3 ? B.c : B.h); i++)\n    {\n        const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n        const unsigned short* ptr = (const unsigned short*)B + i * B_hstep * B.elempack;\n\n        const int size = B.w * B.elempack;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 7 < size; j += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _absmax = vmaxq_f32(_absmax, vabsq_f32(_p0));\n            _absmax = vmaxq_f32(_absmax, vabsq_f32(_p1));\n            ptr += 8;\n        }\n        for (; j + 3 < size; j += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _absmax = vmaxq_f32(_absmax, vabsq_f32(_p));\n            ptr += 4;\n        }\n#endif\n        for (; j < size; j++)\n        {\n            absmax = std::max(absmax, (float)fabsf(bfloat16_to_float32(ptr[0])));\n            ptr++;\n        }\n    }\n#if __ARM_NEON\n    float32x2_t _aa = vmax_f32(vget_low_f32(_absmax), vget_high_f32(_absmax));\n    absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif\n\n    scale = absmax == 0.f ? 1.f : 127.f / absmax;\n}\n\nstatic void pack_B_tile_bf16_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_B_tile_bf16_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_B_tile_bf16_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"pack_B_tile_bf16_to_int8 %d %d\", max_jj, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n                uint16x8x4_t _q = vld4q_u16(p0 + B_hstep * 4);\n\n                float32x4_t _p0 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[0])), _scale);\n                float32x4_t _p1 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[1])), _scale);\n                float32x4_t _p2 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[2])), _scale);\n                float32x4_t _p3 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[3])), _scale);\n                float32x4_t _p4 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[0])), _scale);\n                float32x4_t _p5 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[1])), _scale);\n                float32x4_t _p6 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[2])), _scale);\n                float32x4_t _p7 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[3])), _scale);\n                float32x4_t _p8 = vmulq_f32(bfloat2float(vget_low_u16(_q.val[0])), _scale);\n                float32x4_t _p9 = vmulq_f32(bfloat2float(vget_low_u16(_q.val[1])), _scale);\n                float32x4_t _pa = vmulq_f32(bfloat2float(vget_low_u16(_q.val[2])), _scale);\n                float32x4_t _pb = vmulq_f32(bfloat2float(vget_low_u16(_q.val[3])), _scale);\n                float32x4_t _pc = vmulq_f32(bfloat2float(vget_high_u16(_q.val[0])), _scale);\n                float32x4_t _pd = vmulq_f32(bfloat2float(vget_high_u16(_q.val[1])), _scale);\n                float32x4_t _pe = vmulq_f32(bfloat2float(vget_high_u16(_q.val[2])), _scale);\n                float32x4_t _pf = vmulq_f32(bfloat2float(vget_high_u16(_q.val[3])), _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 4 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n                uint16x4x4_t _q = vld4_u16(p0 + B_hstep * 4);\n\n                float32x4_t _p0 = vmulq_f32(bfloat2float(_p.val[0]), _scale);\n                float32x4_t _p1 = vmulq_f32(bfloat2float(_p.val[1]), _scale);\n                float32x4_t _p2 = vmulq_f32(bfloat2float(_p.val[2]), _scale);\n                float32x4_t _p3 = vmulq_f32(bfloat2float(_p.val[3]), _scale);\n                float32x4_t _p4 = vmulq_f32(bfloat2float(_q.val[0]), _scale);\n                float32x4_t _p5 = vmulq_f32(bfloat2float(_q.val[1]), _scale);\n                float32x4_t _p6 = vmulq_f32(bfloat2float(_q.val[2]), _scale);\n                float32x4_t _p7 = vmulq_f32(bfloat2float(_q.val[3]), _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 4 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep * 4);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + B_hstep * 3));\n                float32x4_t _p4 = bfloat2float(vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p5 = bfloat2float(vld1_u16(p0 + B_hstep * 5));\n                float32x4_t _p6 = bfloat2float(vld1_u16(p0 + B_hstep * 6));\n                float32x4_t _p7 = bfloat2float(vld1_u16(p0 + B_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p45 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p67 = bfloat2float(vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n\n                float32x4_t _p0 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[0])), _scale);\n                float32x4_t _p1 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[1])), _scale);\n                float32x4_t _p2 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[2])), _scale);\n                float32x4_t _p3 = vmulq_f32(bfloat2float(vget_low_u16(_p.val[3])), _scale);\n                float32x4_t _p4 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[0])), _scale);\n                float32x4_t _p5 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[1])), _scale);\n                float32x4_t _p6 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[2])), _scale);\n                float32x4_t _p7 = vmulq_f32(bfloat2float(vget_high_u16(_p.val[3])), _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n\n                float32x4_t _p0 = vmulq_f32(bfloat2float(_p.val[0]), _scale);\n                float32x4_t _p1 = vmulq_f32(bfloat2float(_p.val[1]), _scale);\n                float32x4_t _p2 = vmulq_f32(bfloat2float(_p.val[2]), _scale);\n                float32x4_t _p3 = vmulq_f32(bfloat2float(_p.val[3]), _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + B_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _p = uint16x4_t();\n                _p = vset_lane_u16(p0[0], _p, 0);\n                _p = vset_lane_u16(p0[B_hstep], _p, 1);\n                _p = vset_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vset_lane_u16(p0[B_hstep * 3], _p, 3);\n                float32x4_t _p0 = bfloat2float(_p);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(bfloat16_to_float32(p0[B_hstep]) * scale);\n                pp[3] = float2int8(bfloat16_to_float32(p0[B_hstep + 1]) * scale);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[B_hstep]) * scale);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_bf16_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_B_tile_bf16_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_B_tile_bf16_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"transpose_pack_B_tile_bf16_to_int8 %d %d\", max_jj, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 4 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n                float32x4_t _p8 = bfloat2float(vget_low_u16(_t));\n                float32x4_t _p9 = bfloat2float(vget_high_u16(_t));\n                float32x4_t _pa = bfloat2float(vget_low_u16(_u));\n                float32x4_t _pb = bfloat2float(vget_high_u16(_u));\n                float32x4_t _pc = bfloat2float(vget_low_u16(_v));\n                float32x4_t _pd = bfloat2float(vget_high_u16(_v));\n                float32x4_t _pe = bfloat2float(vget_low_u16(_w));\n                float32x4_t _pf = bfloat2float(vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 4 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                float32x4_t _p4 = bfloat2float(vget_low_u16(_r));\n                float32x4_t _p5 = bfloat2float(vget_high_u16(_r));\n                float32x4_t _p6 = bfloat2float(vget_low_u16(_s));\n                float32x4_t _p7 = bfloat2float(vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + B_hstep * 3));\n                float32x4_t _p4 = bfloat2float(vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p5 = bfloat2float(vld1_u16(p0 + B_hstep * 5));\n                float32x4_t _p6 = bfloat2float(vld1_u16(p0 + B_hstep * 6));\n                float32x4_t _p7 = bfloat2float(vld1_u16(p0 + B_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + B_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(bfloat16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(bfloat16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep * 4);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p45 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p67 = bfloat2float(vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 4 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 6 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 3], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 3 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p02 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p46 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p13 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p57 = bfloat2float(vget_high_u16(_q));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p23 = bfloat2float(vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p02 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p13 = bfloat2float(vget_high_u16(_p));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[B_hstep + 0]) * scale);\n                pp[2] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp[3] = float2int8(bfloat16_to_float32(p0[B_hstep + 1]) * scale);\n                pp += 4;\n                p0 += B_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp += 2;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p2 = bfloat2float(vld1_u16(p0 + B_hstep * 8));\n                float32x4_t _p3 = bfloat2float(vld1_u16(p0 + B_hstep * 12));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = bfloat2float(vld1_u16(p0));\n                float32x4_t _p1 = bfloat2float(vld1_u16(p0 + B_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(bfloat16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(bfloat16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(bfloat16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 8], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 9], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 10], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 11], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 12], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 13], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 14], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 15], _q, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(bfloat16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0 += B_hstep;\n            }\n        }\n    }\n}\n\nstatic void unpack_output_tile_int32_to_bf16(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        unpack_output_tile_int32_to_bf16_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const unsigned short* pC = C;\n\n    // NCNN_LOGE(\"unpack_output_tile_int32_to_bf16  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                uint16x8_t _c = vld1q_u16(pC);\n                _c0 = bfloat2float(vget_low_u16(_c));\n                _c1 = bfloat2float(vget_high_u16(_c));\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        uint16x8_t _c45 = vld1q_u16(pC + 16);\n                        uint16x8_t _c67 = vld1q_u16(pC + 24);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                        float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                        float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                        float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c45 = vld1q_u16(pC + c_hstep * 4 + 16);\n                        _c67 = vld1q_u16(pC + c_hstep * 4 + 24);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        _c4 = bfloat2float(vget_low_u16(_c45));\n                        _c5 = bfloat2float(vget_high_u16(_c45));\n                        _c6 = bfloat2float(vget_low_u16(_c67));\n                        _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                        uint16x8_t _c45 = vld1q_u16(pC + c_hstep * 2);\n                        uint16x8_t _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                        float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                        float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                        float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 5);\n                        _c45 = vld1q_u16(pC + c_hstep * 6);\n                        _c67 = vld1q_u16(pC + c_hstep * 7);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        _c4 = bfloat2float(vget_low_u16(_c45));\n                        _c5 = bfloat2float(vget_high_u16(_c45));\n                        _c6 = bfloat2float(vget_low_u16(_c67));\n                        _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _cc = vld1q_u16(pC);\n                    float32x4_t _cc0 = bfloat2float(vget_low_u16(_cc));\n                    float32x4_t _cc1 = bfloat2float(vget_high_u16(_cc));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n            uint16x4_t _bf4 = float2bfloat(_f4);\n            uint16x4_t _bf5 = float2bfloat(_f5);\n            uint16x4_t _bf6 = float2bfloat(_f6);\n            uint16x4_t _bf7 = float2bfloat(_f7);\n            uint16x4_t _bf8 = float2bfloat(_f8);\n            uint16x4_t _bf9 = float2bfloat(_f9);\n            uint16x4_t _bfa = float2bfloat(_fa);\n            uint16x4_t _bfb = float2bfloat(_fb);\n            uint16x4_t _bfc = float2bfloat(_fc);\n            uint16x4_t _bfd = float2bfloat(_fd);\n            uint16x4_t _bfe = float2bfloat(_fe);\n            uint16x4_t _bff = float2bfloat(_ff);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_bf2, _bf3));\n                vst1q_u16(p0 + 16, vcombine_u16(_bf4, _bf5));\n                vst1q_u16(p0 + 24, vcombine_u16(_bf6, _bf7));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_bf8, _bf9));\n                vst1q_u16(p0 + out_hstep * 4 + 8, vcombine_u16(_bfa, _bfb));\n                vst1q_u16(p0 + out_hstep * 4 + 16, vcombine_u16(_bfc, _bfd));\n                vst1q_u16(p0 + out_hstep * 4 + 24, vcombine_u16(_bfe, _bff));\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_bf0, _bf1, _bf2, _bf3);\n                transpose4x4_u16(_bf4, _bf5, _bf6, _bf7);\n                transpose4x4_u16(_bf8, _bf9, _bfa, _bfb);\n                transpose4x4_u16(_bfc, _bfd, _bfe, _bff);\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf4));\n                vst1q_u16(p0 + out_hstep, vcombine_u16(_bf1, _bf5));\n                vst1q_u16(p0 + out_hstep * 2, vcombine_u16(_bf2, _bf6));\n                vst1q_u16(p0 + out_hstep * 3, vcombine_u16(_bf3, _bf7));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_bf8, _bfc));\n                vst1q_u16(p0 + out_hstep * 5, vcombine_u16(_bf9, _bfd));\n                vst1q_u16(p0 + out_hstep * 6, vcombine_u16(_bfa, _bfe));\n                vst1q_u16(p0 + out_hstep * 7, vcombine_u16(_bfb, _bff));\n                p0 += 8;\n            }\n\n            pp += 64;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        float32x4_t _c2 = bfloat2float(_cc2);\n                        float32x4_t _c3 = bfloat2float(_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _cc0 = vld1_u16(pC + c_hstep * 4);\n                        _cc1 = vld1_u16(pC + c_hstep * 5);\n                        _cc2 = vld1_u16(pC + c_hstep * 6);\n                        _cc3 = vld1_u16(pC + c_hstep * 7);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        _c2 = bfloat2float(_cc2);\n                        _c3 = bfloat2float(_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 4;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = bfloat2float(vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n            uint16x4_t _bf4 = float2bfloat(_f4);\n            uint16x4_t _bf5 = float2bfloat(_f5);\n            uint16x4_t _bf6 = float2bfloat(_f6);\n            uint16x4_t _bf7 = float2bfloat(_f7);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_bf2, _bf3));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_bf4, _bf5));\n                vst1q_u16(p0 + out_hstep * 4 + 8, vcombine_u16(_bf6, _bf7));\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_bf0, _bf1, _bf2, _bf3);\n                transpose4x4_u16(_bf4, _bf5, _bf6, _bf7);\n                vst1_u16(p0, _bf0);\n                vst1_u16(p0 + out_hstep, _bf1);\n                vst1_u16(p0 + out_hstep * 2, _bf2);\n                vst1_u16(p0 + out_hstep * 3, _bf3);\n                vst1_u16(p0 + out_hstep * 4, _bf4);\n                vst1_u16(p0 + out_hstep * 5, _bf5);\n                vst1_u16(p0 + out_hstep * 6, _bf6);\n                vst1_u16(p0 + out_hstep * 7, _bf7);\n                p0 += 4;\n            }\n\n            pp += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c01;\n                    uint16x8_t _c23;\n                    if (c_elempack == 4)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + c_hstep * 4);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[1], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep + 1], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c01, 7);\n                        _c23 = uint16x8_t();\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4], _c23, 0);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5], _c23, 1);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6], _c23, 2);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7], _c23, 3);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4 + 1], _c23, 4);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5 + 1], _c23, 5);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6 + 1], _c23, 6);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7 + 1], _c23, 7);\n                        pC += 2;\n                    }\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _c1 = vdupq_n_f32(bfloat16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_bf2, _bf3));\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[1] = vget_lane_u16(_bf1, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_bf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_bf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_bf1, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_bf2, 0);\n                p0[out_hstep * 4 + 1] = vget_lane_u16(_bf3, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_bf2, 1);\n                p0[out_hstep * 5 + 1] = vget_lane_u16(_bf3, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_bf2, 2);\n                p0[out_hstep * 6 + 1] = vget_lane_u16(_bf3, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_bf2, 3);\n                p0[out_hstep * 7 + 1] = vget_lane_u16(_bf3, 3);\n                p0 += 2;\n            }\n\n            pp += 16;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = bfloat2float(vld1_u16(pC));\n                        _c1 = bfloat2float(vld1_u16(pC + c_hstep * 4));\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        pC += 1;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n\n            if (out_elempack == 4)\n            {\n                vst1_u16(p0, _bf0);\n                vst1_u16(p0 + out_hstep * 4, _bf1);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_bf1, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_bf1, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_bf1, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_bf1, 3);\n                p0++;\n            }\n\n            pp += 8;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                _c0 = bfloat2float(vld1_u16(pC));\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c01;\n                    uint16x8_t _c23;\n                    uint16x8_t _c45;\n                    uint16x8_t _c67;\n                    if (c_elempack == 4)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + 8);\n                        _c45 = vld1q_u16(pC + 16);\n                        _c67 = vld1q_u16(pC + 24);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + c_hstep);\n                        _c45 = vld1q_u16(pC + c_hstep * 2);\n                        _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        pC += 8;\n                    }\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                    float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                    float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                    float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _cc1 = bfloat2float(vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n            uint16x4_t _bf4 = float2bfloat(_f4);\n            uint16x4_t _bf5 = float2bfloat(_f5);\n            uint16x4_t _bf6 = float2bfloat(_f6);\n            uint16x4_t _bf7 = float2bfloat(_f7);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_bf2, _bf3));\n                vst1q_u16(p0 + 16, vcombine_u16(_bf4, _bf5));\n                vst1q_u16(p0 + 24, vcombine_u16(_bf6, _bf7));\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_bf0, _bf1, _bf2, _bf3);\n                transpose4x4_u16(_bf4, _bf5, _bf6, _bf7);\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf4));\n                vst1q_u16(p0 + out_hstep, vcombine_u16(_bf1, _bf5));\n                vst1q_u16(p0 + out_hstep * 2, vcombine_u16(_bf2, _bf6));\n                vst1q_u16(p0 + out_hstep * 3, vcombine_u16(_bf3, _bf7));\n                p0 += 8;\n            }\n\n            pp += 32;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep * 1);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        _c2 = bfloat2float(_cc2);\n                        _c3 = bfloat2float(_cc3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = bfloat2float(vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_bf2, _bf3));\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_bf0, _bf1, _bf2, _bf3);\n                vst1_u16(p0, _bf0);\n                vst1_u16(p0 + out_hstep, _bf1);\n                vst1_u16(p0 + out_hstep * 2, _bf2);\n                vst1_u16(p0 + out_hstep * 3, _bf3);\n                p0 += 4;\n            }\n\n            pp += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1q_u16(pC);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x8_t();\n                        _c = vsetq_lane_u16(pC[0], _c, 0);\n                        _c = vsetq_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3], _c, 3);\n                        _c = vsetq_lane_u16(pC[1], _c, 4);\n                        _c = vsetq_lane_u16(pC[c_hstep + 1], _c, 5);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c, 6);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c, 7);\n                        pC += 2;\n                    }\n                    _c0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    float32x4_t _c1 = vdupq_n_f32(bfloat16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[1] = vget_lane_u16(_bf1, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_bf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_bf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_bf1, 3);\n                p0 += 2;\n            }\n\n            pp += 8;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x4_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1_u16(pC);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x4_t();\n                        _c = vset_lane_u16(pC[0], _c, 0);\n                        _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vset_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vset_lane_u16(pC[c_hstep * 3], _c, 3);\n                        pC += 1;\n                    }\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n\n            if (out_elempack == 4)\n            {\n                vst1_u16(p0, _bf0);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0++;\n            }\n\n            pp += 4;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        // out_elempack == 1\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n                c1 = bfloat16_to_float32(pC[1]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16(float2bfloat(_f0), float2bfloat(_f1)));\n            vst1q_u16(p0 + out_hstep, vcombine_u16(float2bfloat(_f2), float2bfloat(_f3)));\n\n            pp += 16;\n            p0 += 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    float32x4_t _c1 = bfloat2float(vld1_u16(pC + c_hstep));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1_u16(p0, float2bfloat(_f0));\n            vst1_u16(p0 + out_hstep, float2bfloat(_f1));\n\n            pp += 8;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n\n            float32x2x2_t _descale01 = vzip_f32(_descale, _descale);\n            float32x4_t _descale0011 = vcombine_f32(_descale01.val[0], _descale01.val[1]);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0011);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _c0011 = vcombine_f32(vget_low_f32(_c0), vget_high_f32(_c1));\n                    _f0 = vaddq_f32(_f0, _c0011);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[1], _c, 1);\n                    _c = vset_lane_u16(pC[c_hstep], _c, 2);\n                    _c = vset_lane_u16(pC[c_hstep + 1], _c, 3);\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[1], _c, 1);\n                    _c = vset_lane_u16(pC[0], _c, 2);\n                    _c = vset_lane_u16(pC[1], _c, 3);\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n\n            p0[0] = vget_lane_u16(_bf0, 0);\n            p0[1] = vget_lane_u16(_bf0, 1);\n            p0[out_hstep] = vget_lane_u16(_bf0, 2);\n            p0[out_hstep + 1] = vget_lane_u16(_bf0, 3);\n\n            pp += 4;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += bfloat16_to_float32(pC[0]) * beta;\n                    f1 += bfloat16_to_float32(pC[c_hstep]) * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    f0 += bfloat16_to_float32(pC[0]) * beta;\n                    f1 += bfloat16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                f0 *= alpha;\n                f1 *= alpha;\n            }\n\n            p0[0] = float32_to_bfloat16(f0);\n            p0[out_hstep] = float32_to_bfloat16(f1);\n\n            pp += 2;\n            p0++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        // out_elempack == 1\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + 8);\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16(float2bfloat(_f0), float2bfloat(_f1)));\n            vst1q_u16(p0 + 8, vcombine_u16(float2bfloat(_f2), float2bfloat(_f3)));\n\n            pp += 16;\n            p0 += 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16(float2bfloat(_f0), float2bfloat(_f1)));\n\n            pp += 8;\n            p0 += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_u16(p0, float2bfloat(_f0));\n\n            pp += 4;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    float32x2_t _cc = float32x2_t();\n                    _cc = vset_lane_f32(bfloat16_to_float32(pC[0]), _cc, 0);\n                    _cc = vset_lane_f32(bfloat16_to_float32(pC[1]), _cc, 1);\n                    _f0 = vmla_n_f32(_f0, _cc, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            p0[0] = float32_to_bfloat16(vget_lane_f32(_f0, 0));\n            p0[1] = float32_to_bfloat16(vget_lane_f32(_f0, 1));\n\n            pp += 2;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    f0 += bfloat16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = float32_to_bfloat16(f0);\n\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_int32_to_bf16(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_unpack_output_tile_int32_to_bf16_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const unsigned short* pC = C;\n\n    // NCNN_LOGE(\"transpose_unpack_output_tile_int32_to_bf16  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                uint16x8_t _c = vld1q_u16(pC);\n                _c0 = bfloat2float(vget_low_u16(_c));\n                _c1 = bfloat2float(vget_high_u16(_c));\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        uint16x8_t _c45 = vld1q_u16(pC + 16);\n                        uint16x8_t _c67 = vld1q_u16(pC + 24);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                        float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                        float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                        float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c45 = vld1q_u16(pC + c_hstep * 4 + 16);\n                        _c67 = vld1q_u16(pC + c_hstep * 4 + 24);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        _c4 = bfloat2float(vget_low_u16(_c45));\n                        _c5 = bfloat2float(vget_high_u16(_c45));\n                        _c6 = bfloat2float(vget_low_u16(_c67));\n                        _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                        uint16x8_t _c45 = vld1q_u16(pC + c_hstep * 2);\n                        uint16x8_t _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                        float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                        float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                        float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 5);\n                        _c45 = vld1q_u16(pC + c_hstep * 6);\n                        _c67 = vld1q_u16(pC + c_hstep * 7);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        _c4 = bfloat2float(vget_low_u16(_c45));\n                        _c5 = bfloat2float(vget_high_u16(_c45));\n                        _c6 = bfloat2float(vget_low_u16(_c67));\n                        _c7 = bfloat2float(vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _cc1 = bfloat2float(vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            uint16x8_t _bf0 = vcombine_u16(float2bfloat(_f0), float2bfloat(_f8));\n            uint16x8_t _bf1 = vcombine_u16(float2bfloat(_f1), float2bfloat(_f9));\n            uint16x8_t _bf2 = vcombine_u16(float2bfloat(_f2), float2bfloat(_fa));\n            uint16x8_t _bf3 = vcombine_u16(float2bfloat(_f3), float2bfloat(_fb));\n            uint16x8_t _bf4 = vcombine_u16(float2bfloat(_f4), float2bfloat(_fc));\n            uint16x8_t _bf5 = vcombine_u16(float2bfloat(_f5), float2bfloat(_fd));\n            uint16x8_t _bf6 = vcombine_u16(float2bfloat(_f6), float2bfloat(_fe));\n            uint16x8_t _bf7 = vcombine_u16(float2bfloat(_f7), float2bfloat(_ff));\n\n            if (out_elempack == 4)\n            {\n                uint16x8x4_t _bfa;\n                uint16x8x4_t _bfb;\n                _bfa.val[0] = _bf0;\n                _bfa.val[1] = _bf1;\n                _bfa.val[2] = _bf2;\n                _bfa.val[3] = _bf3;\n                _bfb.val[0] = _bf4;\n                _bfb.val[1] = _bf5;\n                _bfb.val[2] = _bf6;\n                _bfb.val[3] = _bf7;\n                vst4q_u16(p0, _bfa);\n                vst4q_u16(p0 + out_hstep * 4, _bfb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_u16(p0, _bf0);\n                vst1q_u16(p0 + out_hstep, _bf1);\n                vst1q_u16(p0 + out_hstep * 2, _bf2);\n                vst1q_u16(p0 + out_hstep * 3, _bf3);\n                vst1q_u16(p0 + out_hstep * 4, _bf4);\n                vst1q_u16(p0 + out_hstep * 5, _bf5);\n                vst1q_u16(p0 + out_hstep * 6, _bf6);\n                vst1q_u16(p0 + out_hstep * 7, _bf7);\n            }\n\n            pp += 64;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                        float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        float32x4_t _c2 = bfloat2float(_cc2);\n                        float32x4_t _c3 = bfloat2float(_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _cc0 = vld1_u16(pC + c_hstep * 4);\n                        _cc1 = vld1_u16(pC + c_hstep * 5);\n                        _cc2 = vld1_u16(pC + c_hstep * 6);\n                        _cc3 = vld1_u16(pC + c_hstep * 7);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        _c2 = bfloat2float(_cc2);\n                        _c3 = bfloat2float(_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 4;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = bfloat2float(vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x8_t _bf0 = vcombine_u16(float2bfloat(_f0), float2bfloat(_f4));\n            uint16x8_t _bf1 = vcombine_u16(float2bfloat(_f1), float2bfloat(_f5));\n            uint16x8_t _bf2 = vcombine_u16(float2bfloat(_f2), float2bfloat(_f6));\n            uint16x8_t _bf3 = vcombine_u16(float2bfloat(_f3), float2bfloat(_f7));\n\n            if (out_elempack == 4)\n            {\n                uint16x8x4_t _bf;\n                _bf.val[0] = _bf0;\n                _bf.val[1] = _bf1;\n                _bf.val[2] = _bf2;\n                _bf.val[3] = _bf3;\n                vst4q_u16(p0, _bf);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_u16(p0, _bf0);\n                vst1q_u16(p0 + out_hstep, _bf1);\n                vst1q_u16(p0 + out_hstep * 2, _bf2);\n                vst1q_u16(p0 + out_hstep * 3, _bf3);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep * 4);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n\n                        uint16x8_t _c23 = uint16x8_t();\n                        _c23 = vsetq_lane_u16(pC[1], _c23, 0);\n                        _c23 = vsetq_lane_u16(pC[c_hstep + 1], _c23, 1);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c23, 2);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c23, 3);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4 + 1], _c23, 4);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5 + 1], _c23, 5);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6 + 1], _c23, 6);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7 + 1], _c23, 7);\n\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_low_u16(_c23));\n                        _c2 = bfloat2float(vget_high_u16(_c01));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        pC += 2;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _c1 = vdupq_n_f32(bfloat16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16(float2bfloat(_f0), float2bfloat(_f2)));\n            vst1q_u16(p0 + out_hstep, vcombine_u16(float2bfloat(_f1), float2bfloat(_f3)));\n\n            pp += 16;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp + 4)), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    if (c_elempack == 4)\n                    {\n                        _c0 = bfloat2float(vld1_u16(pC));\n                        _c1 = bfloat2float(vld1_u16(pC + c_hstep * 4));\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        pC += 1;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16(float2bfloat(_f0), float2bfloat(_f1)));\n            pp += 8;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                _c0 = bfloat2float(vld1_u16(pC));\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c01;\n                    uint16x8_t _c23;\n                    uint16x8_t _c45;\n                    uint16x8_t _c67;\n                    if (c_elempack == 4)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + 8);\n                        _c45 = vld1q_u16(pC + 16);\n                        _c67 = vld1q_u16(pC + 24);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + c_hstep);\n                        _c45 = vld1q_u16(pC + c_hstep * 2);\n                        _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        pC += 8;\n                    }\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    float32x4_t _c4 = bfloat2float(vget_low_u16(_c45));\n                    float32x4_t _c5 = bfloat2float(vget_high_u16(_c45));\n                    float32x4_t _c6 = bfloat2float(vget_low_u16(_c67));\n                    float32x4_t _c7 = bfloat2float(vget_high_u16(_c67));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _cc1 = bfloat2float(vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n            uint16x4_t _bf4 = float2bfloat(_f4);\n            uint16x4_t _bf5 = float2bfloat(_f5);\n            uint16x4_t _bf6 = float2bfloat(_f6);\n            uint16x4_t _bf7 = float2bfloat(_f7);\n\n            if (out_elempack == 4)\n            {\n                uint16x4x4_t _bfa;\n                uint16x4x4_t _bfb;\n                _bfa.val[0] = _bf0;\n                _bfa.val[1] = _bf1;\n                _bfa.val[2] = _bf2;\n                _bfa.val[3] = _bf3;\n                _bfb.val[0] = _bf4;\n                _bfb.val[1] = _bf5;\n                _bfb.val[2] = _bf6;\n                _bfb.val[3] = _bf7;\n                vst4_u16(p0, _bfa);\n                vst4_u16(p0 + out_hstep * 4, _bfb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1_u16(p0, _bf0);\n                vst1_u16(p0 + out_hstep, _bf1);\n                vst1_u16(p0 + out_hstep * 2, _bf2);\n                vst1_u16(p0 + out_hstep * 3, _bf3);\n                vst1_u16(p0 + out_hstep * 4, _bf4);\n                vst1_u16(p0 + out_hstep * 5, _bf5);\n                vst1_u16(p0 + out_hstep * 6, _bf6);\n                vst1_u16(p0 + out_hstep * 7, _bf7);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = bfloat2float(vget_low_u16(_c01));\n                        _c1 = bfloat2float(vget_high_u16(_c01));\n                        _c2 = bfloat2float(vget_low_u16(_c23));\n                        _c3 = bfloat2float(vget_high_u16(_c23));\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = bfloat2float(_cc0);\n                        _c1 = bfloat2float(_cc1);\n                        _c2 = bfloat2float(_cc2);\n                        _c3 = bfloat2float(_cc3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = bfloat2float(vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n\n            if (out_elempack == 4)\n            {\n                uint16x4x4_t _bf;\n                _bf.val[0] = _bf0;\n                _bf.val[1] = _bf1;\n                _bf.val[2] = _bf2;\n                _bf.val[3] = _bf3;\n                vst4_u16(p0, _bf);\n            }\n            if (out_elempack == 1)\n            {\n                vst1_u16(p0, _bf0);\n                vst1_u16(p0 + out_hstep, _bf1);\n                vst1_u16(p0 + out_hstep * 2, _bf2);\n                vst1_u16(p0 + out_hstep * 3, _bf3);\n            }\n\n            pp += 16;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1q_u16(pC);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x8_t();\n                        _c = vsetq_lane_u16(pC[0], _c, 0);\n                        _c = vsetq_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3], _c, 3);\n                        _c = vsetq_lane_u16(pC[1], _c, 4);\n                        _c = vsetq_lane_u16(pC[c_hstep + 1], _c, 5);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c, 6);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c, 7);\n                        pC += 2;\n                    }\n                    _c0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    float32x4_t _c1 = vdupq_n_f32(bfloat16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1_u16(p0, float2bfloat(_f0));\n            vst1_u16(p0 + out_hstep, float2bfloat(_f1));\n\n            pp += 8;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x4_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1_u16(pC);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x4_t();\n                        _c = vset_lane_u16(pC[0], _c, 0);\n                        _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vset_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vset_lane_u16(pC[c_hstep * 3], _c, 3);\n                        pC += 1;\n                    }\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(bfloat16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_u16(p0, float2bfloat(_f0));\n            pp += 4;\n            p0 += out_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale01 = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n                c1 = bfloat16_to_float32(pC[1]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale01, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = bfloat2float(vget_low_u16(_c));\n                    _c1 = bfloat2float(vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf2));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_bf1, _bf3));\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[1] = vget_lane_u16(_bf2, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_bf2, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_bf2, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_bf2, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_bf1, 0);\n                p0[out_hstep * 4 + 1] = vget_lane_u16(_bf3, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_bf1, 1);\n                p0[out_hstep * 5 + 1] = vget_lane_u16(_bf3, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_bf1, 2);\n                p0[out_hstep * 6 + 1] = vget_lane_u16(_bf3, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_bf1, 3);\n                p0[out_hstep * 7 + 1] = vget_lane_u16(_bf3, 3);\n            }\n\n            pp += 16;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            // a0 a1 a2 a3\n            // b0 b1 b2 b3\n\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    _c1 = bfloat2float(vld1_u16(pC + c_hstep));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_bf0, 0);\n                p0[1] = vget_lane_u16(_bf1, 0);\n                p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_bf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_bf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_bf1, 3);\n            }\n\n            pp += 8;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            // a0 a1 b0 b1\n            int32x2x2_t _sum0 = vld2_s32(pp);\n\n            float32x4_t _descale = vcombine_f32(_descale01, _descale01);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vcombine_s32(_sum0.val[0], _sum0.val[1])), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _cc = vzipq_f32(_c0, _c1).val[0];\n                    _f0 = vaddq_f32(_f0, _cc);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                    _c = vset_lane_u16(pC[1], _c, 2);\n                    _c = vset_lane_u16(pC[c_hstep + 1], _c, 3);\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[0], _c, 1);\n                    _c = vset_lane_u16(pC[1], _c, 2);\n                    _c = vset_lane_u16(pC[1], _c, 3);\n                    _c0 = bfloat2float(_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n\n            p0[0] = vget_lane_u16(_bf0, 0);\n            p0[1] = vget_lane_u16(_bf0, 1);\n            p0[out_hstep] = vget_lane_u16(_bf0, 2);\n            p0[out_hstep + 1] = vget_lane_u16(_bf0, 3);\n\n            pp += 4;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += bfloat16_to_float32(pC[0]) * beta;\n                    f1 += bfloat16_to_float32(pC[c_hstep]) * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    c0 = bfloat16_to_float32(pC[0]) * beta;\n                    f0 += c0;\n                    f1 += c0;\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                f0 *= alpha;\n                f1 *= alpha;\n            }\n\n            p0[0] = float32_to_bfloat16(f0);\n            p0[1] = float32_to_bfloat16(f1);\n            pp += 2;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = bfloat16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + 8);\n                    _c0 = bfloat2float(vget_low_u16(_c01));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c01));\n                    float32x4_t _c2 = bfloat2float(vget_low_u16(_c23));\n                    float32x4_t _c3 = bfloat2float(vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n            uint16x4_t _bf2 = float2bfloat(_f2);\n            uint16x4_t _bf3 = float2bfloat(_f3);\n\n            if (out_hstep == 1)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_bf2, _bf3));\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _bf0);\n                    vst1_u16(p0 + out_hstep * 4, _bf1);\n                    vst1_u16(p0 + out_hstep * 8, _bf2);\n                    vst1_u16(p0 + out_hstep * 12, _bf3);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_bf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                    p0[out_hstep * 4] = vget_lane_u16(_bf1, 0);\n                    p0[out_hstep * 5] = vget_lane_u16(_bf1, 1);\n                    p0[out_hstep * 6] = vget_lane_u16(_bf1, 2);\n                    p0[out_hstep * 7] = vget_lane_u16(_bf1, 3);\n                    p0[out_hstep * 8] = vget_lane_u16(_bf2, 0);\n                    p0[out_hstep * 9] = vget_lane_u16(_bf2, 1);\n                    p0[out_hstep * 10] = vget_lane_u16(_bf2, 2);\n                    p0[out_hstep * 11] = vget_lane_u16(_bf2, 3);\n                    p0[out_hstep * 12] = vget_lane_u16(_bf3, 0);\n                    p0[out_hstep * 13] = vget_lane_u16(_bf3, 1);\n                    p0[out_hstep * 14] = vget_lane_u16(_bf3, 2);\n                    p0[out_hstep * 15] = vget_lane_u16(_bf3, 3);\n                }\n            }\n\n            pp += 16;\n            p0 += out_hstep * 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = bfloat2float(vget_low_u16(_c));\n                    float32x4_t _c1 = bfloat2float(vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n            uint16x4_t _bf1 = float2bfloat(_f1);\n\n            if (out_hstep == 1)\n            {\n                vst1q_u16(p0, vcombine_u16(_bf0, _bf1));\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _bf0);\n                    vst1_u16(p0 + out_hstep * 4, _bf1);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_bf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                    p0[out_hstep * 4] = vget_lane_u16(_bf1, 0);\n                    p0[out_hstep * 5] = vget_lane_u16(_bf1, 1);\n                    p0[out_hstep * 6] = vget_lane_u16(_bf1, 2);\n                    p0[out_hstep * 7] = vget_lane_u16(_bf1, 3);\n                }\n            }\n\n            pp += 8;\n            p0 += out_hstep * 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    _c0 = bfloat2float(vld1_u16(pC));\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _bf0 = float2bfloat(_f0);\n\n            if (out_hstep == 1)\n            {\n                vst1_u16(p0, _bf0);\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _bf0);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_bf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_bf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_bf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_bf0, 3);\n                }\n            }\n\n            pp += 4;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    float32x2_t _c = float32x2_t();\n                    _c = vset_lane_f32(bfloat16_to_float32(pC[0]), _c, 0);\n                    _c = vset_lane_f32(bfloat16_to_float32(pC[1]), _c, 1);\n                    _f0 = vmla_n_f32(_f0, _c, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            p0[0] = float32_to_bfloat16(vget_lane_f32(_f0, 0));\n            p0[out_hstep] = float32_to_bfloat16(vget_lane_f32(_f0, 1));\n\n            pp += 2;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    f0 += bfloat16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = float32_to_bfloat16(f0);\n\n            pp += 1;\n            p0 += out_hstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/gemm_int8_fp16s.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_fp16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_fp16_to_int8_i8mm(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_fp16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_fp16_to_int8_i8mm(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\nvoid pack_A_tile_fp16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid transpose_pack_A_tile_fp16_to_int8_asimddp(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales);\nvoid pack_B_tile_fp16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid transpose_pack_B_tile_fp16_to_int8_asimddp(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale);\nvoid unpack_output_tile_int32_to_fp16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\nvoid transpose_unpack_output_tile_int32_to_fp16_asimddp(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nvoid compute_A_tile_fp16_int8_scales_asimdhp(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii);\nvoid transpose_compute_A_tile_fp16_int8_scales_asimdhp(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii);\nvoid compute_B_fp16_int8_scale_asimdhp(const Mat& B, float& scale);\n#endif\n\nstatic void compute_A_tile_fp16_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        compute_A_tile_fp16_int8_scales_asimdhp(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.w;\n\n    // NCNN_LOGE(\"compute_A_tile_fp16_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (elempack == 8)\n    {\n        float32x4_t _v127 = vdupq_n_f32(127.f);\n        float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n\n        for (int ii = 0; ii + 7 < max_ii; ii += 8)\n        {\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * A_hstep;\n\n            float16x8_t _amax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                float16x8_t _p2 = vld1q_f16(p0 + 16);\n                float16x8_t _p3 = vld1q_f16(p0 + 24);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                _amax2 = vmaxq_f16(_amax2, vabsq_f16(_p2));\n                _amax3 = vmaxq_f16(_amax3, vabsq_f16(_p3));\n                p0 += 32;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax2);\n            _amax1 = vmaxq_f16(_amax1, _amax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                p0 += 16;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax1);\n            for (; kk < K; kk++)\n            {\n                float16x8_t _p = vld1q_f16(p0);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p));\n                p0 += 8;\n            }\n            float32x4_t _absmax0 = vcvt_f32_f16(vget_low_f16(_amax0));\n            float32x4_t _absmax1 = vcvt_f32_f16(vget_high_f16(_amax0));\n\n            float32x4_t _scale0 = vdivq_f32(_v127, _absmax0);\n            float32x4_t _scale1 = vdivq_f32(_v127, _absmax1);\n            float32x4_t _out_descale0 = vdivq_f32(_absmax0, _v127_B_scale);\n            float32x4_t _out_descale1 = vdivq_f32(_absmax1, _v127_B_scale);\n\n            vst1q_f32(ps, _scale0);\n            vst1q_f32(ps + 4, _scale1);\n            vst1q_f32(pods, _out_descale0);\n            vst1q_f32(pods + 4, _out_descale1);\n\n            ps += 8;\n            pods += 8;\n        }\n    }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (elempack == 4)\n    {\n#if __aarch64__\n        float32x4_t _v127 = vdupq_n_f32(127.f);\n        float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n\n        for (int ii = 0; ii + 3 < max_ii; ii += 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * A_hstep;\n\n            float16x8_t _amax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 7 < K; kk += 8)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                float16x8_t _p2 = vld1q_f16(p0 + 16);\n                float16x8_t _p3 = vld1q_f16(p0 + 24);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                _amax2 = vmaxq_f16(_amax2, vabsq_f16(_p2));\n                _amax3 = vmaxq_f16(_amax3, vabsq_f16(_p3));\n                p0 += 32;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax2);\n            _amax1 = vmaxq_f16(_amax1, _amax3);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                p0 += 16;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax1);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x8_t _p = vld1q_f16(p0);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p));\n                p0 += 8;\n            }\n            float16x4_t _amax = vmax_f16(vget_low_f16(_amax0), vget_high_f16(_amax0));\n            for (; kk < K; kk++)\n            {\n                float16x4_t _p = vld1_f16(p0);\n                _amax = vmax_f16(_amax, vabs_f16(_p));\n                p0 += 4;\n            }\n            float32x4_t _absmax0 = vcvt_f32_f16(_amax);\n#else  // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n#endif\n            ps += 4;\n            pods += 4;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        for (int ii = 0; ii < max_ii; ii++)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * A_hstep;\n\n            float absmax = 0.f;\n            float16x8_t _amax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 31 < K; kk += 32)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                float16x8_t _p2 = vld1q_f16(p0 + 16);\n                float16x8_t _p3 = vld1q_f16(p0 + 24);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                _amax2 = vmaxq_f16(_amax2, vabsq_f16(_p2));\n                _amax3 = vmaxq_f16(_amax3, vabsq_f16(_p3));\n                p0 += 32;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax2);\n            _amax1 = vmaxq_f16(_amax1, _amax3);\n            for (; kk + 15 < K; kk += 16)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p0));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(_p1));\n                p0 += 16;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax1);\n            for (; kk + 7 < K; kk += 8)\n            {\n                float16x8_t _p = vld1q_f16(p0);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(_p));\n                p0 += 8;\n            }\n            absmax = (float)vmaxvq_f16(_amax0);\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(p0[0]));\n                p0++;\n            }\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep;\n\n            float absmax = 0.f;\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (; kk + 15 < K; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 7 < K; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += 4;\n            }\n#if __aarch64__\n            absmax = vmaxvq_f32(_absmax0);\n#else\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif\n#endif // __ARM_NEON\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(float16_to_float32(p0[0])));\n                p0++;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void pack_A_tile_fp16_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_A_tile_fp16_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_A_tile_fp16_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"pack_A_tile_fp16_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + 32);\n                uint16x8_t _u = vld1q_u16(p0 + 40);\n                uint16x8_t _v = vld1q_u16(p0 + 48);\n                uint16x8_t _w = vld1q_u16(p0 + 56);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n                _p8 = vmulq_f32(_p8, _scale0);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale0);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale0);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale0);\n                _pf = vmulq_f32(_pf, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _p04 = vzip_s8(float2int8(_p0, _p1), float2int8(_p8, _p9));\n                int8x8x2_t _p15 = vzip_s8(float2int8(_p2, _p3), float2int8(_pa, _pb));\n                int8x8x2_t _p26 = vzip_s8(float2int8(_p4, _p5), float2int8(_pc, _pd));\n                int8x8x2_t _p37 = vzip_s8(float2int8(_p6, _p7), float2int8(_pe, _pf));\n\n                int8x16x4_t _rr;\n                _rr.val[0] = vcombine_s8(_p04.val[0], _p04.val[1]);\n                _rr.val[1] = vcombine_s8(_p15.val[0], _p15.val[1]);\n                _rr.val[2] = vcombine_s8(_p26.val[0], _p26.val[1]);\n                _rr.val[3] = vcombine_s8(_p37.val[0], _p37.val[1]);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x16x4_t _rr;\n                _rr.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p8, _p9));\n                _rr.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_pa, _pb));\n                _rr.val[2] = vcombine_s8(float2int8(_p4, _p5), float2int8(_pc, _pd));\n                _rr.val[3] = vcombine_s8(float2int8(_p6, _p7), float2int8(_pe, _pf));\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst4q_s8(pp, _rr);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p8, _p9), float2int8(_pc, _pd));\n                _r23.val[1] = vcombine_s8(float2int8(_pa, _pb), float2int8(_pe, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 64;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                uint16x8_t _p23 = vld1q_u16(p0 + 8);\n\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p23));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p23));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n                uint16x8x4_t _q = vld4q_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[0])), _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[1])), _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[2])), _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[3])), _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[0])), _scale0, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[1])), _scale0, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[2])), _scale0, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[3])), _scale0, 3);\n                float32x4_t _p8 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[0])), _scale1, 0);\n                float32x4_t _p9 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[1])), _scale1, 1);\n                float32x4_t _pa = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[2])), _scale1, 2);\n                float32x4_t _pb = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[3])), _scale1, 3);\n                float32x4_t _pc = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[0])), _scale1, 0);\n                float32x4_t _pd = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[1])), _scale1, 1);\n                float32x4_t _pe = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[2])), _scale1, 2);\n                float32x4_t _pf = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[3])), _scale1, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 4 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale0);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale0);\n                _p8 = vmulq_f32(_p8, _scale1);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale1);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale1);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale1);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n                uint16x4x4_t _q = vld4_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[0]), _scale0, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[1]), _scale0, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[2]), _scale0, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[3]), _scale0, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_q.val[0]), _scale1, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_q.val[1]), _scale1, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_q.val[2]), _scale1, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_q.val[3]), _scale1, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 4 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale0);\n                _p4 = vmulq_f32(_p4, _scale1);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale1);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep * 4);\n\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p0n = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p1n = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p0n = vmulq_f32(_p0n, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p1n = vmulq_f32(_p1n, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p0n, _p1n);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale0, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale0, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale0, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale0, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale1, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale1, 0);\n                _pa = vmulq_laneq_f32(_pa, _scale1, 1);\n                _pb = vmulq_laneq_f32(_pb, _scale1, 1);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 2);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 2);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 3);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale0), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale0), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale0), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale0), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale0), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale1), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale1), 0);\n                _pa = vmulq_lane_f32(_pa, vget_low_f32(_scale1), 1);\n                _pb = vmulq_lane_f32(_pb, vget_low_f32(_scale1), 1);\n                _pc = vmulq_lane_f32(_pc, vget_high_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_high_f32(_scale1), 0);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 1);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 3));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 5));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 6));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 7));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p45 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p67 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale0, _scale0);\n                float32x4x2_t _scale23 = vzipq_f32(_scale1, _scale1);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n                _p45 = vmulq_f32(_p45, _scale23.val[0]);\n                _p67 = vmulq_f32(_p67, _scale23.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n\n                float32x4_t _p0 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[0])), _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[1])), _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[2])), _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[3])), _scale, 3);\n                float32x4_t _p4 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[0])), _scale, 0);\n                float32x4_t _p5 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[1])), _scale, 1);\n                float32x4_t _p6 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[2])), _scale, 2);\n                float32x4_t _p7 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[3])), _scale, 3);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n\n                float32x4_t _p0 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[0]), _scale, 0);\n                float32x4_t _p1 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[1]), _scale, 1);\n                float32x4_t _p2 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[2]), _scale, 2);\n                float32x4_t _p3 = vmulq_laneq_f32(vcvt_f32_f16((float16x4_t)_p.val[3]), _scale, 3);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 0);\n                _p2 = vmulq_lane_f32(_p2, vget_low_f32(_scale), 1);\n                _p3 = vmulq_lane_f32(_p3, vget_low_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_high_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_high_f32(_scale), 0);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 1);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 3));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                float32x4x2_t _scale01 = vzipq_f32(_scale, _scale);\n\n                _p01 = vmulq_f32(_p01, _scale01.val[0]);\n                _p23 = vmulq_f32(_p23, _scale01.val[1]);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _p = uint16x4_t();\n                _p = vset_lane_u16(p0[0], _p, 0);\n                _p = vset_lane_u16(p0[A_hstep], _p, 1);\n                _p = vset_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vset_lane_u16(p0[A_hstep * 3], _p, 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)_p);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale0 = vdupq_n_f32(scale0);\n            float32x4_t _scale1 = vdupq_n_f32(scale1);\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale1);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale0);\n                pp[2] = float2int8(float16_to_float32(p0[A_hstep]) * scale1);\n                pp[3] = float2int8(float16_to_float32(p0[A_hstep + 1]) * scale1);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(float16_to_float32(p0[A_hstep]) * scale1);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + (i + ii) * A_hstep + k;\n\n        const float scale = scales[i + ii];\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vdupq_n_f32(scale);\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_compute_A_tile_fp16_int8_scales(const Mat& A, Mat& scales, float B_scale, Mat& out_descales, int i, int max_ii)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        transpose_compute_A_tile_fp16_int8_scales_asimdhp(A, scales, B_scale, out_descales, i, max_ii);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n    const int K = A.dims == 3 ? A.c : A.h;\n\n    // NCNN_LOGE(\"transpose_compute_A_tile_fp16_int8_scales %d %d\", max_ii, elempack);\n\n    const float v127_B_scale = 127.f * B_scale;\n\n#if __ARM_NEON\n#if __aarch64__\n    float32x4_t _v127 = vdupq_n_f32(127.f);\n    float32x4_t _v127_B_scale = vdupq_n_f32(v127_B_scale);\n#endif\n#endif\n\n    float* ps = (float*)scales + i;\n    float* pods = (float*)out_descales + i;\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (elempack == 8)\n    {\n        int ii = 0;\n        for (; ii + 1 < max_ii; ii += 2)\n        {\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * 8;\n\n            float16x8_t _absmax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                float16x8_t _p2 = vld1q_f16(p0 + A_hstep * 8);\n                float16x8_t _p3 = vld1q_f16(p0 + A_hstep * 8 + 8);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                _absmax2 = vmaxq_f16(_absmax2, vabsq_f16(_p2));\n                _absmax3 = vmaxq_f16(_absmax3, vabsq_f16(_p3));\n                p0 += A_hstep * 16;\n            }\n            _absmax0 = vmaxq_f16(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f16(_absmax1, _absmax3);\n            for (; kk < K; kk++)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                p0 += A_hstep * 8;\n            }\n            float absmax0 = (float)vmaxvq_f16(_absmax0);\n            float absmax1 = (float)vmaxvq_f16(_absmax1);\n\n            ps[0] = 127.f / absmax0;\n            ps[1] = 127.f / absmax1;\n            pods[0] = absmax0 / v127_B_scale;\n            pods[1] = absmax1 / v127_B_scale;\n            ps += 2;\n            pods += 2;\n        }\n        for (; ii < max_ii; ii++)\n        {\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * 8;\n\n            float16x8_t _absmax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + A_hstep * 8);\n                float16x8_t _p2 = vld1q_f16(p0 + A_hstep * 16);\n                float16x8_t _p3 = vld1q_f16(p0 + A_hstep * 24);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                _absmax2 = vmaxq_f16(_absmax2, vabsq_f16(_p2));\n                _absmax3 = vmaxq_f16(_absmax3, vabsq_f16(_p3));\n                p0 += A_hstep * 32;\n            }\n            _absmax0 = vmaxq_f16(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f16(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + A_hstep * 8);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                p0 += A_hstep * 16;\n            }\n            _absmax0 = vmaxq_f16(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float16x8_t _p = vld1q_f16(p0);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p));\n                p0 += A_hstep * 8;\n            }\n            float absmax = (float)vmaxvq_f16(_absmax0);\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (elempack == 4)\n    {\n        int ii = 0;\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * 4;\n\n            float16x8_t _absmax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _absmax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                float16x8_t _p2 = vld1q_f16(p0 + A_hstep * 4);\n                float16x8_t _p3 = vld1q_f16(p0 + A_hstep * 4 + 8);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                _absmax2 = vmaxq_f16(_absmax2, vabsq_f16(_p2));\n                _absmax3 = vmaxq_f16(_absmax3, vabsq_f16(_p3));\n                p0 += A_hstep * 8;\n            }\n            _absmax0 = vmaxq_f16(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f16(_absmax1, _absmax3);\n            for (; kk < K; kk++)\n            {\n                float16x8_t _p0 = vld1q_f16(p0);\n                float16x8_t _p1 = vld1q_f16(p0 + 8);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n                _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n                p0 += A_hstep * 4;\n            }\n            float16x8_t _aa0123 = vpmaxq_f16(_absmax0, _absmax1);\n            float32x4_t _absmax = vcvt_f32_f16(vpmax_f16(vget_low_f16(_aa0123), vget_high_f16(_aa0123)));\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            for (int kk = 0; kk < K; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n#if __aarch64__\n            float32x4_t _aa01 = vpmaxq_f32(_absmax0, _absmax1);\n            float32x4_t _aa23 = vpmaxq_f32(_absmax2, _absmax3);\n            float32x4_t _absmax = vpmaxq_f32(_aa01, _aa23);\n#else\n            float32x2_t _aa0 = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float32x2_t _aa1 = vmax_f32(vget_low_f32(_absmax1), vget_high_f32(_absmax1));\n            float32x2_t _aa2 = vmax_f32(vget_low_f32(_absmax2), vget_high_f32(_absmax2));\n            float32x2_t _aa3 = vmax_f32(vget_low_f32(_absmax3), vget_high_f32(_absmax3));\n            float32x2_t _aa01 = vpmax_f32(_aa0, _aa1);\n            float32x2_t _aa23 = vpmax_f32(_aa2, _aa3);\n            float32x4_t _absmax = vcombine_f32(_aa01, _aa23);\n#endif\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax);\n            float32x4_t _out_descale = vdivq_f32(_absmax, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n        for (; ii < max_ii; ii++)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii) * 4;\n\n            float16x8_t _amax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 7 < K; kk += 8)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep * 4);\n                float16x4_t _p2 = vld1_f16(p0 + A_hstep * 8);\n                float16x4_t _p3 = vld1_f16(p0 + A_hstep * 12);\n                float16x4_t _p4 = vld1_f16(p0 + A_hstep * 16);\n                float16x4_t _p5 = vld1_f16(p0 + A_hstep * 20);\n                float16x4_t _p6 = vld1_f16(p0 + A_hstep * 24);\n                float16x4_t _p7 = vld1_f16(p0 + A_hstep * 28);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(vcombine_f16(_p2, _p3)));\n                _amax2 = vmaxq_f16(_amax2, vabsq_f16(vcombine_f16(_p4, _p5)));\n                _amax3 = vmaxq_f16(_amax3, vabsq_f16(vcombine_f16(_p6, _p7)));\n                p0 += A_hstep * 32;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax2);\n            _amax1 = vmaxq_f16(_amax1, _amax3);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep * 4);\n                float16x4_t _p2 = vld1_f16(p0 + A_hstep * 8);\n                float16x4_t _p3 = vld1_f16(p0 + A_hstep * 12);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(vcombine_f16(_p2, _p3)));\n                p0 += A_hstep * 16;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax1);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep * 4);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                p0 += A_hstep * 8;\n            }\n            float16x4_t _amax01 = vmax_f16(vget_low_f16(_amax0), vget_high_f16(_amax0));\n            for (; kk < K; kk++)\n            {\n                float16x4_t _p = vld1_f16(p0);\n                _amax01 = vmax_f16(_amax01, vabs_f16(_p));\n                p0 += A_hstep * 4;\n            }\n            float absmax = (float)vmaxv_f16(_amax01);\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii) * 4;\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 8));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 12));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 16;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep * 4;\n            }\n#if __aarch64__\n            float absmax = vmaxvq_f32(_absmax0);\n#else\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            float absmax = std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1));\n#endif\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        int ii = 0;\n#if __ARM_NEON\n        for (; ii + 3 < max_ii; ii += 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii);\n\n            float16x8_t _amax0 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _amax3 = vdupq_n_f16((__fp16)0.f);\n            int kk = 0;\n            for (; kk + 7 < K; kk += 8)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep);\n                float16x4_t _p2 = vld1_f16(p0 + A_hstep * 2);\n                float16x4_t _p3 = vld1_f16(p0 + A_hstep * 3);\n                float16x4_t _p4 = vld1_f16(p0 + A_hstep * 4);\n                float16x4_t _p5 = vld1_f16(p0 + A_hstep * 5);\n                float16x4_t _p6 = vld1_f16(p0 + A_hstep * 6);\n                float16x4_t _p7 = vld1_f16(p0 + A_hstep * 7);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(vcombine_f16(_p2, _p3)));\n                _amax2 = vmaxq_f16(_amax2, vabsq_f16(vcombine_f16(_p4, _p5)));\n                _amax3 = vmaxq_f16(_amax3, vabsq_f16(vcombine_f16(_p6, _p7)));\n                p0 += A_hstep * 8;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax2);\n            _amax1 = vmaxq_f16(_amax1, _amax3);\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep);\n                float16x4_t _p2 = vld1_f16(p0 + A_hstep * 2);\n                float16x4_t _p3 = vld1_f16(p0 + A_hstep * 3);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                _amax1 = vmaxq_f16(_amax1, vabsq_f16(vcombine_f16(_p2, _p3)));\n                p0 += A_hstep * 4;\n            }\n            _amax0 = vmaxq_f16(_amax0, _amax1);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float16x4_t _p0 = vld1_f16(p0);\n                float16x4_t _p1 = vld1_f16(p0 + A_hstep);\n                _amax0 = vmaxq_f16(_amax0, vabsq_f16(vcombine_f16(_p0, _p1)));\n                p0 += A_hstep * 2;\n            }\n            float16x4_t _amax = vmax_f16(vget_low_f16(_amax0), vget_high_f16(_amax0));\n            for (; kk < K; kk++)\n            {\n                float16x4_t _p = vld1_f16(p0);\n                _amax = vmax_f16(_amax, vabs_f16(_p));\n                p0 += A_hstep;\n            }\n            float32x4_t _absmax0 = vcvt_f32_f16(_amax);\n#else  // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii);\n\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            float32x4_t _absmax2 = vdupq_n_f32(0.f);\n            float32x4_t _absmax3 = vdupq_n_f32(0.f);\n            int kk = 0;\n            for (; kk + 3 < K; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 3));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n                _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n                p0 += A_hstep * 4;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n            _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n            for (; kk + 1 < K; kk += 2)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 2;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk < K; kk++)\n            {\n                float32x4_t _p = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n                p0 += A_hstep;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n#if __aarch64__\n            float32x4_t _scale = vdivq_f32(_v127, _absmax0);\n            float32x4_t _out_descale = vdivq_f32(_absmax0, _v127_B_scale);\n\n            vst1q_f32(ps, _scale);\n            vst1q_f32(pods, _out_descale);\n#else\n            float tmp[4];\n            vst1q_f32(tmp, _absmax0);\n\n            ps[0] = 127.f / tmp[0];\n            ps[1] = 127.f / tmp[1];\n            ps[2] = 127.f / tmp[2];\n            ps[3] = 127.f / tmp[3];\n\n            pods[0] = tmp[0] / v127_B_scale;\n            pods[1] = tmp[1] / v127_B_scale;\n            pods[2] = tmp[2] / v127_B_scale;\n            pods[3] = tmp[3] / v127_B_scale;\n\n            // float32x4_t _recp_absmax = vrecpeq_f32(_absmax0);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // _recp_absmax = vmulq_f32(vrecpsq_f32(_absmax0, _recp_absmax), _recp_absmax);\n            // float32x4_t _scale = vmulq_f32(_v127, _recp_absmax);\n            // float32x4_t _out_descale = vmulq_f32(_absmax0, _recp_v127_B_scale);\n#endif\n\n            ps += 4;\n            pods += 4;\n        }\n#endif // __ARM_NEON\n        for (; ii < max_ii; ii++)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const __fp16* p0 = (const __fp16*)A + (i + ii);\n\n            float absmax = 0.f;\n            int kk = 0;\n            float16x8_t _absmax0 = vdupq_n_f16((__fp16)0.f);\n            for (; kk + 7 < K; kk += 8)\n            {\n                float16x8_t _p = float16x8_t();\n                _p = vsetq_lane_f16(p0[0], _p, 0);\n                _p = vsetq_lane_f16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_f16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_f16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_f16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_f16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_f16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_f16(p0[A_hstep * 7], _p, 7);\n                _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p));\n                p0 += A_hstep * 8;\n            }\n            float16x4_t _amax0 = vmax_f16(vget_low_f16(_absmax0), vget_high_f16(_absmax0));\n            for (; kk + 3 < K; kk += 4)\n            {\n                float16x4_t _p = float16x4_t();\n                _p = vset_lane_f16(p0[0], _p, 0);\n                _p = vset_lane_f16(p0[A_hstep], _p, 1);\n                _p = vset_lane_f16(p0[A_hstep * 2], _p, 2);\n                _p = vset_lane_f16(p0[A_hstep * 3], _p, 3);\n                _amax0 = vmax_f16(_amax0, vabs_f16(_p));\n                p0 += A_hstep * 4;\n            }\n            absmax = (float)vmaxv_f16(_amax0);\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf((float)p0[0]));\n                p0 += A_hstep;\n            }\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            const unsigned short* p0 = (const unsigned short*)A + (i + ii);\n\n            float absmax = 0.f;\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _absmax0 = vdupq_n_f32(0.f);\n            float32x4_t _absmax1 = vdupq_n_f32(0.f);\n            for (; kk + 7 < K; kk += 8)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n                p0 += A_hstep * 8;\n            }\n            _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n            for (; kk + 3 < K; kk += 4)\n            {\n                uint16x4_t _p = uint16x4_t();\n                _p = vset_lane_u16(p0[0], _p, 0);\n                _p = vset_lane_u16(p0[A_hstep], _p, 1);\n                _p = vset_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vset_lane_u16(p0[A_hstep * 3], _p, 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)_p);\n                _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n                p0 += A_hstep * 4;\n            }\n#if __aarch64__\n            absmax = vmaxvq_f32(_absmax0);\n#else\n            float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n            absmax = std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1));\n#endif\n#endif // __ARM_NEON\n            for (; kk < K; kk++)\n            {\n                absmax = std::max(absmax, (float)fabsf(float16_to_float32(p0[0])));\n                p0 += A_hstep;\n            }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n            ps[0] = 127.f / absmax;\n            pods[0] = absmax / v127_B_scale;\n            ps++;\n            pods++;\n        }\n    }\n}\n\nstatic void transpose_pack_A_tile_fp16_to_int8(const Mat& A, Mat& AT, int i, int max_ii, int k, int max_kk, const Mat& scales)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_A_tile_fp16_to_int8_i8mm(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_A_tile_fp16_to_int8_asimddp(A, AT, i, max_ii, k, max_kk, scales);\n        return;\n    }\n#endif\n\n    const int elempack = A.elempack;\n    const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n\n    // NCNN_LOGE(\"transpose_pack_A_tile_fp16_to_int8 %d %d\", max_ii, elempack);\n\n    signed char* pp = AT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale0 = vld1q_f32((const float*)scales + i + ii);\n        float32x4_t _scale1 = vld1q_f32((const float*)scales + i + ii + 4);\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + 32);\n                uint16x8_t _u = vld1q_u16(p0 + 40);\n                uint16x8_t _v = vld1q_u16(p0 + 48);\n                uint16x8_t _w = vld1q_u16(p0 + 56);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale0, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale0, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale0, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale0, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale1, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale1, 0);\n                _pa = vmulq_laneq_f32(_pa, _scale1, 1);\n                _pb = vmulq_laneq_f32(_pb, _scale1, 1);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 2);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 2);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 3);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 4 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n                _p8 = vmulq_laneq_f32(_p8, _scale0, 0);\n                _p9 = vmulq_laneq_f32(_p9, _scale0, 1);\n                _pa = vmulq_laneq_f32(_pa, _scale0, 2);\n                _pb = vmulq_laneq_f32(_pb, _scale0, 3);\n                _pc = vmulq_laneq_f32(_pc, _scale1, 0);\n                _pd = vmulq_laneq_f32(_pd, _scale1, 1);\n                _pe = vmulq_laneq_f32(_pe, _scale1, 2);\n                _pf = vmulq_laneq_f32(_pf, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n                _p8 = vmulq_lane_f32(_p8, vget_low_f32(_scale0), 0);\n                _p9 = vmulq_lane_f32(_p9, vget_low_f32(_scale0), 1);\n                _pa = vmulq_lane_f32(_pa, vget_high_f32(_scale0), 0);\n                _pb = vmulq_lane_f32(_pb, vget_high_f32(_scale0), 1);\n                _pc = vmulq_lane_f32(_pc, vget_low_f32(_scale1), 0);\n                _pd = vmulq_lane_f32(_pd, vget_low_f32(_scale1), 1);\n                _pe = vmulq_lane_f32(_pe, vget_high_f32(_scale1), 0);\n                _pf = vmulq_lane_f32(_pf, vget_high_f32(_scale1), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale0, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale0, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale0, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale0, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale1, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale1, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale1, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale1, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale0), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale0), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale0), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale0), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale1), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale1), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale1), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale1), 1);\n#endif\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + A_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + A_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + A_hstep * 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n                _p8 = vmulq_f32(_p8, _scale0);\n                _p9 = vmulq_f32(_p9, _scale1);\n                _pa = vmulq_f32(_pa, _scale0);\n                _pb = vmulq_f32(_pb, _scale1);\n                _pc = vmulq_f32(_pc, _scale0);\n                _pd = vmulq_f32(_pd, _scale1);\n                _pe = vmulq_f32(_pe, _scale0);\n                _pf = vmulq_f32(_pf, _scale1);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n                _p4 = vmulq_f32(_p4, _scale0);\n                _p5 = vmulq_f32(_p5, _scale1);\n                _p6 = vmulq_f32(_p6, _scale0);\n                _p7 = vmulq_f32(_p7, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep);\n\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        float32x4_t _scale = vld1q_f32((const float*)scales + i + ii);\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 0);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 1);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 1);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 2);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 2);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 3);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + A_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + A_hstep * 4 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n                _p4 = vmulq_laneq_f32(_p4, _scale, 0);\n                _p5 = vmulq_laneq_f32(_p5, _scale, 1);\n                _p6 = vmulq_laneq_f32(_p6, _scale, 2);\n                _p7 = vmulq_laneq_f32(_p7, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n                _p4 = vmulq_lane_f32(_p4, vget_low_f32(_scale), 0);\n                _p5 = vmulq_lane_f32(_p5, vget_low_f32(_scale), 1);\n                _p6 = vmulq_lane_f32(_p6, vget_high_f32(_scale), 0);\n                _p7 = vmulq_lane_f32(_p7, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n#if __aarch64__\n                _p0 = vmulq_laneq_f32(_p0, _scale, 0);\n                _p1 = vmulq_laneq_f32(_p1, _scale, 1);\n                _p2 = vmulq_laneq_f32(_p2, _scale, 2);\n                _p3 = vmulq_laneq_f32(_p3, _scale, 3);\n#else\n                _p0 = vmulq_lane_f32(_p0, vget_low_f32(_scale), 0);\n                _p1 = vmulq_lane_f32(_p1, vget_low_f32(_scale), 1);\n                _p2 = vmulq_lane_f32(_p2, vget_high_f32(_scale), 0);\n                _p3 = vmulq_lane_f32(_p3, vget_high_f32(_scale), 1);\n#endif\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 3));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 5));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 6));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n                pp += 4;\n                p0 += A_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale0 = scales[i + ii];\n        const float scale1 = scales[i + ii + 1];\n\n#if __ARM_NEON\n        float32x4_t _scale0 = vdupq_n_f32(scale0);\n        float32x4_t _scale1 = vdupq_n_f32(scale1);\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale0);\n                _p2 = vmulq_f32(_p2, _scale1);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + A_hstep * 4);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n                _p2 = vmulq_f32(_p2, _scale0);\n                _p3 = vmulq_f32(_p3, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale0);\n                _p1 = vmulq_f32(_p1, _scale1);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            float32x4_t _scale = vzipq_f32(_scale0, _scale1).val[0];\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p45 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p67 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 4 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 6 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 3], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 3 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 5], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 5 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 7 + 1], _q, 7);\n                float32x4_t _p02 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p46 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p13 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p57 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 3 + 1], _p, 7);\n                float32x4_t _p02 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p13 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(float16_to_float32(p0[A_hstep + 0]) * scale0);\n                pp[2] = float2int8(float16_to_float32(p0[1]) * scale1);\n                pp[3] = float2int8(float16_to_float32(p0[A_hstep + 1]) * scale1);\n                pp += 4;\n                p0 += A_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale0);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale1);\n                pp += 2;\n                p0 += A_hstep;\n            }\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)A + k * A_hstep + (i + ii) * elempack;\n\n        const float scale = scales[i + ii];\n\n#if __ARM_NEON\n        float32x4_t _scale = vdupq_n_f32(scale);\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                uint16x8_t _p23 = vld1q_u16(p0 + A_hstep * 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p23));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p23));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 8));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 12));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + A_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(float16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(float16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += A_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[A_hstep * 8], _q, 0);\n                _q = vsetq_lane_u16(p0[A_hstep * 9], _q, 1);\n                _q = vsetq_lane_u16(p0[A_hstep * 10], _q, 2);\n                _q = vsetq_lane_u16(p0[A_hstep * 11], _q, 3);\n                _q = vsetq_lane_u16(p0[A_hstep * 12], _q, 4);\n                _q = vsetq_lane_u16(p0[A_hstep * 13], _q, 5);\n                _q = vsetq_lane_u16(p0[A_hstep * 14], _q, 6);\n                _q = vsetq_lane_u16(p0[A_hstep * 15], _q, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += A_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[A_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[A_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[A_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[A_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[A_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[A_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[A_hstep * 7], _p, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += A_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0 += A_hstep;\n            }\n        }\n    }\n}\n\nstatic void compute_B_fp16_int8_scale(const Mat& B, float& scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        compute_B_fp16_int8_scale_asimdhp(B, scale);\n        return;\n    }\n#endif\n\n    float absmax = 0.f;\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    float16x8_t _absmax0 = vdupq_n_f16((__fp16)0.f);\n    float16x8_t _absmax1 = vdupq_n_f16((__fp16)0.f);\n    float16x8_t _absmax2 = vdupq_n_f16((__fp16)0.f);\n    float16x8_t _absmax3 = vdupq_n_f16((__fp16)0.f);\n    float16x4_t _amax = vdup_n_f16((__fp16)0.f);\n#else  // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    float32x4_t _absmax0 = vdupq_n_f32(0.f);\n    float32x4_t _absmax1 = vdupq_n_f32(0.f);\n    float32x4_t _absmax2 = vdupq_n_f32(0.f);\n    float32x4_t _absmax3 = vdupq_n_f32(0.f);\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#endif\n    for (int i = 0; i < (B.dims == 3 ? B.c : B.h); i++)\n    {\n        const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n        const int size = B.w * B.elempack;\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        const __fp16* ptr = (const __fp16*)B + i * B_hstep * B.elempack;\n\n        int j = 0;\n        for (; j + 31 < size; j += 32)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n            _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n            _absmax2 = vmaxq_f16(_absmax2, vabsq_f16(_p2));\n            _absmax3 = vmaxq_f16(_absmax3, vabsq_f16(_p3));\n            ptr += 32;\n        }\n        for (; j + 15 < size; j += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p0));\n            _absmax1 = vmaxq_f16(_absmax1, vabsq_f16(_p1));\n            ptr += 16;\n        }\n        for (; j + 7 < size; j += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _absmax0 = vmaxq_f16(_absmax0, vabsq_f16(_p));\n            ptr += 8;\n        }\n        for (; j + 3 < size; j += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _amax = vmax_f16(_amax, vabs_f16(_p));\n            ptr += 4;\n        }\n        for (; j < size; j++)\n        {\n            absmax = std::max(absmax, (float)fabsf((float)ptr[0]));\n            ptr++;\n        }\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        const unsigned short* ptr = (const unsigned short*)B + i * B_hstep * B.elempack;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 15 < size; j += 16)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            uint16x8_t _q = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n            float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n            float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n            float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n            _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n            _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n            _absmax2 = vmaxq_f32(_absmax2, vabsq_f32(_p2));\n            _absmax3 = vmaxq_f32(_absmax3, vabsq_f32(_p3));\n            ptr += 16;\n        }\n        for (; j + 7 < size; j += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n            float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n            _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p0));\n            _absmax1 = vmaxq_f32(_absmax1, vabsq_f32(_p1));\n            ptr += 8;\n        }\n        for (; j + 3 < size; j += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16((float16x4_t)vld1_u16(ptr));\n            _absmax0 = vmaxq_f32(_absmax0, vabsq_f32(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size; j++)\n        {\n            absmax = std::max(absmax, (float)fabsf(float16_to_float32(ptr[0])));\n            ptr++;\n        }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    }\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    _absmax0 = vmaxq_f16(_absmax0, _absmax2);\n    _absmax1 = vmaxq_f16(_absmax1, _absmax3);\n    _absmax0 = vmaxq_f16(_absmax0, _absmax1);\n    absmax = std::max(absmax, (float)vmaxvq_f16(_absmax0));\n    absmax = std::max(absmax, (float)vmaxv_f16(_amax));\n#else // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    _absmax0 = vmaxq_f32(_absmax0, _absmax2);\n    _absmax1 = vmaxq_f32(_absmax1, _absmax3);\n    _absmax0 = vmaxq_f32(_absmax0, _absmax1);\n#if __aarch64__\n    absmax = std::max(absmax, vmaxvq_f32(_absmax0));\n#else\n    float32x2_t _aa = vmax_f32(vget_low_f32(_absmax0), vget_high_f32(_absmax0));\n    absmax = std::max(absmax, std::max(vget_lane_f32(_aa, 0), vget_lane_f32(_aa, 1)));\n#endif\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#endif // __ARM_NEON\n\n    scale = absmax == 0.f ? 1.f : 127.f / absmax;\n}\n\nstatic void pack_B_tile_fp16_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        pack_B_tile_fp16_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        pack_B_tile_fp16_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"pack_B_tile_fp16_to_int8 %d %d\", max_jj, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * elempack;\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + 32);\n                uint16x8_t _u = vld1q_u16(p0 + 40);\n                uint16x8_t _v = vld1q_u16(p0 + 48);\n                uint16x8_t _w = vld1q_u16(p0 + 56);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _p04 = vzip_s8(float2int8(_p0, _p1), float2int8(_p8, _p9));\n                int8x8x2_t _p15 = vzip_s8(float2int8(_p2, _p3), float2int8(_pa, _pb));\n                int8x8x2_t _p26 = vzip_s8(float2int8(_p4, _p5), float2int8(_pc, _pd));\n                int8x8x2_t _p37 = vzip_s8(float2int8(_p6, _p7), float2int8(_pe, _pf));\n\n                int8x16x4_t _rr;\n                _rr.val[0] = vcombine_s8(_p04.val[0], _p04.val[1]);\n                _rr.val[1] = vcombine_s8(_p15.val[0], _p15.val[1]);\n                _rr.val[2] = vcombine_s8(_p26.val[0], _p26.val[1]);\n                _rr.val[3] = vcombine_s8(_p37.val[0], _p37.val[1]);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x16x4_t _rr;\n                _rr.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p8, _p9));\n                _rr.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_pa, _pb));\n                _rr.val[2] = vcombine_s8(float2int8(_p4, _p5), float2int8(_pc, _pd));\n                _rr.val[3] = vcombine_s8(float2int8(_p6, _p7), float2int8(_pe, _pf));\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst4q_s8(pp, _rr);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p8, _p9), float2int8(_pc, _pd));\n                _r23.val[1] = vcombine_s8(float2int8(_pa, _pb), float2int8(_pe, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 64;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                uint16x8_t _p23 = vld1q_u16(p0 + 8);\n\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p23));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p23));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n                uint16x8x4_t _q = vld4q_u16(p0 + B_hstep * 4);\n\n                float32x4_t _p0 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[0])), _scale);\n                float32x4_t _p1 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[1])), _scale);\n                float32x4_t _p2 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[2])), _scale);\n                float32x4_t _p3 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[3])), _scale);\n                float32x4_t _p4 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[0])), _scale);\n                float32x4_t _p5 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[1])), _scale);\n                float32x4_t _p6 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[2])), _scale);\n                float32x4_t _p7 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[3])), _scale);\n                float32x4_t _p8 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[0])), _scale);\n                float32x4_t _p9 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[1])), _scale);\n                float32x4_t _pa = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[2])), _scale);\n                float32x4_t _pb = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_q.val[3])), _scale);\n                float32x4_t _pc = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[0])), _scale);\n                float32x4_t _pd = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[1])), _scale);\n                float32x4_t _pe = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[2])), _scale);\n                float32x4_t _pf = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_q.val[3])), _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n                int8x8_t _r4 = float2int8(_p8, _pc);\n                int8x8_t _r5 = float2int8(_p9, _pd);\n                int8x8_t _r6 = float2int8(_pa, _pe);\n                int8x8_t _r7 = float2int8(_pb, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p8, _p9);\n                int8x8_t _r3 = float2int8(_pa, _pb);\n                int8x8_t _r4 = float2int8(_p4, _p5);\n                int8x8_t _r5 = float2int8(_p6, _p7);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 4 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p8), float2int8(_p2, _pa));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p9), float2int8(_p3, _pb));\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(float2int8(_p4, _pc), float2int8(_p6, _pe));\n                _r23.val[1] = vcombine_s8(float2int8(_p5, _pd), float2int8(_p7, _pf));\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n                uint16x4x4_t _q = vld4_u16(p0 + B_hstep * 4);\n\n                float32x4_t _p0 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[0]), _scale);\n                float32x4_t _p1 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[1]), _scale);\n                float32x4_t _p2 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[2]), _scale);\n                float32x4_t _p3 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[3]), _scale);\n                float32x4_t _p4 = vmulq_f32(vcvt_f32_f16((float16x4_t)_q.val[0]), _scale);\n                float32x4_t _p5 = vmulq_f32(vcvt_f32_f16((float16x4_t)_q.val[1]), _scale);\n                float32x4_t _p6 = vmulq_f32(vcvt_f32_f16((float16x4_t)_q.val[2]), _scale);\n                float32x4_t _p7 = vmulq_f32(vcvt_f32_f16((float16x4_t)_q.val[3]), _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 4 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p4), float2int8(_p2, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p5), float2int8(_p3, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep * 4);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p8, _pa));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_pc, _pe));\n                int16x4_t _t4 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t5 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4_t _t6 = vreinterpret_s16_s8(float2int8(_p9, _pb));\n                int16x4_t _t7 = vreinterpret_s16_s8(float2int8(_pd, _pf));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int16x4x2_t _t45 = vuzp_s16(_t4, _t5);\n                int16x4x2_t _t67 = vuzp_s16(_t6, _t7);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n                int8x8_t _r4 = vreinterpret_s8_s16(_t45.val[0]);\n                int8x8_t _r5 = vreinterpret_s8_s16(_t67.val[0]);\n                int8x8_t _r6 = vreinterpret_s8_s16(_t45.val[1]);\n                int8x8_t _r7 = vreinterpret_s8_s16(_t67.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n\n                pp += 64;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 3));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 5));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 6));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p45 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p67 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0++;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k * elempack;\n\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8x4_t _p = vld4q_u16(p0);\n\n                float32x4_t _p0 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[0])), _scale);\n                float32x4_t _p1 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[1])), _scale);\n                float32x4_t _p2 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[2])), _scale);\n                float32x4_t _p3 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_low_u16(_p.val[3])), _scale);\n                float32x4_t _p4 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[0])), _scale);\n                float32x4_t _p5 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[1])), _scale);\n                float32x4_t _p6 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[2])), _scale);\n                float32x4_t _p7 = vmulq_f32(vcvt_f32_f16((float16x4_t)vget_high_u16(_p.val[3])), _scale);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += 32;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x4x4_t _p = vld4_u16(p0);\n\n                float32x4_t _p0 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[0]), _scale);\n                float32x4_t _p1 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[1]), _scale);\n                float32x4_t _p2 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[2]), _scale);\n                float32x4_t _p3 = vmulq_f32(vcvt_f32_f16((float16x4_t)_p.val[3]), _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += 8;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0 += 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x4_t _p = uint16x4_t();\n                _p = vset_lane_u16(p0[0], _p, 0);\n                _p = vset_lane_u16(p0[B_hstep], _p, 1);\n                _p = vset_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vset_lane_u16(p0[B_hstep * 3], _p, 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)_p);\n\n                _p0 = vmulq_f32(_p0, _scale);\n                int8x8_t _r0 = float2int8(_p0, _p0);\n\n                pp[0] = vget_lane_s8(_r0, 0);\n                pp[1] = vget_lane_s8(_r0, 1);\n                pp[2] = vget_lane_s8(_r0, 2);\n                pp[3] = vget_lane_s8(_r0, 3);\n\n                pp += 4;\n                p0++;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p2));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p2));\n                float32x4_t _t2 = vcombine_f32(vget_low_f32(_p1), vget_low_f32(_p3));\n                float32x4_t _t3 = vcombine_f32(vget_high_f32(_p1), vget_high_f32(_p3));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n                int8x8_t _r1 = float2int8(_t2, _t3);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n                vst1_s8(pp + 8, _r1);\n\n                pp += 16;\n                p0 += 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r0 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(float16_to_float32(p0[B_hstep]) * scale);\n                pp[3] = float2int8(float16_to_float32(p0[B_hstep + 1]) * scale);\n                pp += 4;\n                p0 += 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[B_hstep]) * scale);\n                pp += 2;\n                p0++;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + (j + jj) * B_hstep + k;\n\n        // if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0++;\n            }\n        }\n    }\n}\n\nstatic void transpose_pack_B_tile_fp16_to_int8(const Mat& B, Mat& BT, int j, int max_jj, int k, int max_kk, float scale)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM84I8MM && __aarch64__ && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_i8mm())\n    {\n        transpose_pack_B_tile_fp16_to_int8_i8mm(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_pack_B_tile_fp16_to_int8_asimddp(B, BT, j, max_jj, k, max_kk, scale);\n        return;\n    }\n#endif\n\n    const int elempack = B.elempack;\n    const size_t B_hstep = B.dims == 3 ? B.cstep : (size_t)B.w;\n\n    // NCNN_LOGE(\"transpose_pack_B_tile_fp16_to_int8 %d %d\", max_jj, elempack);\n\n    signed char* pp = BT;\n\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n#endif\n\n    int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n    for (; jj + 7 < max_jj; jj += 8)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + 32);\n                uint16x8_t _u = vld1q_u16(p0 + 40);\n                uint16x8_t _v = vld1q_u16(p0 + 48);\n                uint16x8_t _w = vld1q_u16(p0 + 56);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p8, _pa);\n                int8x8_t _r3 = float2int8(_pc, _pe);\n                int8x8_t _r4 = float2int8(_p1, _p3);\n                int8x8_t _r5 = float2int8(_p5, _p7);\n                int8x8_t _r6 = float2int8(_p9, _pb);\n                int8x8_t _r7 = float2int8(_pd, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 4 + 8);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 4 + 16);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 4 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p8);\n                int8x8_t _r1 = float2int8(_p1, _p9);\n                int8x8_t _r2 = float2int8(_p2, _pa);\n                int8x8_t _r3 = float2int8(_p3, _pb);\n                int8x8_t _r4 = float2int8(_p4, _pc);\n                int8x8_t _r5 = float2int8(_p5, _pd);\n                int8x8_t _r6 = float2int8(_p6, _pe);\n                int8x8_t _r7 = float2int8(_p7, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n                vst1q_s8(pp + 32, vcombine_s8(_r4, _r5));\n                vst1q_s8(pp + 48, vcombine_s8(_r6, _r7));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8_t _r45 = vreinterpretq_s16_s8(vcombine_s8(_r4, _r5));\n                int16x8_t _r67 = vreinterpretq_s16_s8(vcombine_s8(_r6, _r7));\n                int16x8x2_t _rr0 = vuzpq_s16(_r01, _r23);\n                int16x8x2_t _rr1 = vuzpq_s16(_r45, _r67);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr0.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr0.val[1]));\n                vst1q_s8(pp + 32, vreinterpretq_s8_s16(_rr1.val[0]));\n                vst1q_s8(pp + 48, vreinterpretq_s8_s16(_rr1.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n\n#if __ARM_FEATURE_DOTPROD\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n#else  // __ARM_FEATURE_DOTPROD\n                int16x8_t _r01 = vreinterpretq_s16_s8(vcombine_s8(_r0, _r1));\n                int16x8_t _r23 = vreinterpretq_s16_s8(vcombine_s8(_r2, _r3));\n                int16x8x2_t _rr = vuzpq_s16(_r01, _r23);\n\n                vst1q_s8(pp, vreinterpretq_s8_s16(_rr.val[0]));\n                vst1q_s8(pp + 16, vreinterpretq_s8_s16(_rr.val[1]));\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                uint16x8_t _t = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _u = vld1q_u16(p0 + B_hstep * 5);\n                uint16x8_t _v = vld1q_u16(p0 + B_hstep * 6);\n                uint16x8_t _w = vld1q_u16(p0 + B_hstep * 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n                float32x4_t _p8 = vcvt_f32_f16((float16x4_t)vget_low_u16(_t));\n                float32x4_t _p9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_t));\n                float32x4_t _pa = vcvt_f32_f16((float16x4_t)vget_low_u16(_u));\n                float32x4_t _pb = vcvt_f32_f16((float16x4_t)vget_high_u16(_u));\n                float32x4_t _pc = vcvt_f32_f16((float16x4_t)vget_low_u16(_v));\n                float32x4_t _pd = vcvt_f32_f16((float16x4_t)vget_high_u16(_v));\n                float32x4_t _pe = vcvt_f32_f16((float16x4_t)vget_low_u16(_w));\n                float32x4_t _pf = vcvt_f32_f16((float16x4_t)vget_high_u16(_w));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n                _p8 = vmulq_f32(_p8, _scale);\n                _p9 = vmulq_f32(_p9, _scale);\n                _pa = vmulq_f32(_pa, _scale);\n                _pb = vmulq_f32(_pb, _scale);\n                _pc = vmulq_f32(_pc, _scale);\n                _pd = vmulq_f32(_pd, _scale);\n                _pe = vmulq_f32(_pe, _scale);\n                _pf = vmulq_f32(_pf, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n                int8x8_t _r4 = float2int8(_p8, _p9);\n                int8x8_t _r5 = float2int8(_pa, _pb);\n                int8x8_t _r6 = float2int8(_pc, _pd);\n                int8x8_t _r7 = float2int8(_pe, _pf);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r04 = vzip_s8(_r0, _r4);\n                int8x8x2_t _r15 = vzip_s8(_r1, _r5);\n                int8x8x2_t _r26 = vzip_s8(_r2, _r6);\n                int8x8x2_t _r37 = vzip_s8(_r3, _r7);\n                int8x16x4_t _r0123;\n                _r0123.val[0] = vcombine_s8(_r04.val[0], _r04.val[1]);\n                _r0123.val[1] = vcombine_s8(_r15.val[0], _r15.val[1]);\n                _r0123.val[2] = vcombine_s8(_r26.val[0], _r26.val[1]);\n                _r0123.val[3] = vcombine_s8(_r37.val[0], _r37.val[1]);\n\n                vst4q_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = _r0;\n                _r0123.val[1] = _r1;\n                _r0123.val[2] = _r2;\n                _r0123.val[3] = _r3;\n                int8x8x4_t _r4567;\n                _r4567.val[0] = _r4;\n                _r4567.val[1] = _r5;\n                _r4567.val[2] = _r6;\n                _r4567.val[3] = _r7;\n\n                vst4_s8(pp, _r0123);\n                vst4_s8(pp + 32, _r4567);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(_r0, _r2);\n                _r01.val[1] = vcombine_s8(_r1, _r3);\n                int8x16x2_t _r23;\n                _r23.val[0] = vcombine_s8(_r4, _r6);\n                _r23.val[1] = vcombine_s8(_r5, _r7);\n\n                vst2q_s8(pp, _r01);\n                vst2q_s8(pp + 32, _r23);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 64;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 2);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 3);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p1);\n                _r0123.val[1] = float2int8(_p2, _p3);\n                _r0123.val[2] = float2int8(_p4, _p5);\n                _r0123.val[3] = float2int8(_p6, _p7);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p1), float2int8(_p4, _p5));\n                _r01.val[1] = vcombine_s8(float2int8(_p2, _p3), float2int8(_p6, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p1);\n                _r01.val[1] = float2int8(_p2, _p3);\n\n                vst2_s8(pp, _r01);\n\n                pp += 16;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r0 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r0);\n\n                pp += 8;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __aarch64__\n    for (; jj + 3 < max_jj; jj += 4)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + 16);\n                uint16x8_t _s = vld1q_u16(p0 + 24);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p4, _p6);\n                int8x8_t _r2 = float2int8(_p1, _p3);\n                int8x8_t _r3 = float2int8(_p5, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p4, _p6));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p5, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                uint16x8_t _r = vld1q_u16(p0 + B_hstep * 4);\n                uint16x8_t _s = vld1q_u16(p0 + B_hstep * 4 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_r));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_r));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_s));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_s));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p4);\n                int8x8_t _r1 = float2int8(_p1, _p5);\n                int8x8_t _r2 = float2int8(_p2, _p6);\n                int8x8_t _r3 = float2int8(_p3, _p7);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n                int8x8_t _r2 = float2int8(_p4, _p5);\n                int8x8_t _r3 = float2int8(_p6, _p7);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4_t _t2 = vreinterpret_s16_s8(float2int8(_p4, _p5));\n                int16x4_t _t3 = vreinterpret_s16_s8(float2int8(_p6, _p7));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int16x4x2_t _t23 = vuzp_s16(_t2, _t3);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n                int8x8_t _r2 = vreinterpret_s8_s16(_t23.val[0]);\n                int8x8_t _r3 = vreinterpret_s8_s16(_t23.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n                vst1q_s8(pp + 16, vcombine_s8(_r2, _r3));\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vuzp_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n        }\n        if (elempack == 1)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 3));\n                float32x4_t _p4 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p5 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 5));\n                float32x4_t _p6 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 6));\n                float32x4_t _p7 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 7));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n                _p4 = vmulq_f32(_p4, _scale);\n                _p5 = vmulq_f32(_p5, _scale);\n                _p6 = vmulq_f32(_p6, _scale);\n                _p7 = vmulq_f32(_p7, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                float32x4x2_t _p04 = vzipq_f32(_p0, _p4);\n                float32x4x2_t _p15 = vzipq_f32(_p1, _p5);\n                float32x4x2_t _p26 = vzipq_f32(_p2, _p6);\n                float32x4x2_t _p37 = vzipq_f32(_p3, _p7);\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p04.val[0], _p04.val[1]);\n                _r0123.val[1] = float2int8(_p15.val[0], _p15.val[1]);\n                _r0123.val[2] = float2int8(_p26.val[0], _p26.val[1]);\n                _r0123.val[3] = float2int8(_p37.val[0], _p37.val[1]);\n\n                vst4_s8(pp, _r0123);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x4_t _r0123;\n                _r0123.val[0] = float2int8(_p0, _p4);\n                _r0123.val[1] = float2int8(_p1, _p5);\n                _r0123.val[2] = float2int8(_p2, _p6);\n                _r0123.val[3] = float2int8(_p3, _p7);\n\n                vst4_s8(pp, _r0123);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int8x16x2_t _r01;\n                _r01.val[0] = vcombine_s8(float2int8(_p0, _p2), float2int8(_p4, _p6));\n                _r01.val[1] = vcombine_s8(float2int8(_p1, _p3), float2int8(_p5, _p7));\n\n                vst2q_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 32;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 2));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 3));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                transpose4x4_ps(_p0, _p1, _p2, _p3);\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n#else  // __ARM_FEATURE_DOTPROD\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p0, _p2);\n                _r01.val[1] = float2int8(_p1, _p3);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n                int8x8_t _r01 = float2int8(_p01.val[0], _p01.val[1]);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 2;\n            }\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(float16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(float16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += B_hstep;\n            }\n        }\n    }\n#endif // __ARM_NEON\n    for (; jj + 1 < max_jj; jj += 2)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p1));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p2, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                uint16x8_t _q = vld1q_u16(p0 + B_hstep * 4);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p2);\n                int8x8_t _r1 = float2int8(_p1, _p3);\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8_t _r0 = float2int8(_p0, _p1);\n                int8x8_t _r1 = float2int8(_p2, _p3);\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                int16x4_t _t0 = vreinterpret_s16_s8(float2int8(_p0, _p2));\n                int16x4_t _t1 = vreinterpret_s16_s8(float2int8(_p1, _p3));\n                int16x4x2_t _t01 = vzip_s16(_t0, _t1);\n                int8x8_t _r0 = vreinterpret_s8_s16(_t01.val[0]);\n                int8x8_t _r1 = vreinterpret_s8_s16(_t01.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1q_s8(pp, vcombine_s8(_r0, _r1));\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                uint16x8_t _p = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _r01 = float2int8(_p0, _p1);\n#else  // __ARM_FEATURE_DOTPROD\n                float32x4_t _t0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                float32x4_t _t1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n                int8x8_t _r01 = float2int8(_t0, _t1);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 4], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 4 + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 6], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 6 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p45 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p67 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n                _p45 = vmulq_f32(_p45, _scale);\n                _p67 = vmulq_f32(_p67, _scale);\n\n                int8x8_t _r0 = float2int8(_p01, _p23);\n                int8x8_t _r1 = float2int8(_p45, _p67);\n\n#if __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vuzp_s8(_r0, _r1);\n\n                vst1q_s8(pp, vcombine_s8(_r01.val[0], _r01.val[1]));\n#else  // __ARM_FEATURE_MATMUL_INT8\n                int8x8x2_t _r01 = vtrn_s8(_r0, _r1);\n                int8x8x2_t _rr01 = vuzp_s8(_r01.val[0], _r01.val[1]);\n\n                vst1q_s8(pp, vcombine_s8(_rr01.val[0], _rr01.val[1]));\n#endif // __ARM_FEATURE_MATMUL_INT8\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 4 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 6 + 1], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep + 1], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 3], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 3 + 1], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 5], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 5 + 1], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 7], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 7 + 1], _q, 7);\n                float32x4_t _p02 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p46 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p13 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p57 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p46 = vmulq_f32(_p46, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n                _p57 = vmulq_f32(_p57, _scale);\n\n                int8x8x2_t _r01;\n                _r01.val[0] = float2int8(_p02, _p46);\n                _r01.val[1] = float2int8(_p13, _p57);\n\n                vst2_s8(pp, _r01);\n#endif // __ARM_FEATURE_DOTPROD\n\n                pp += 16;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p01 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p23 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p01 = vmulq_f32(_p01, _scale);\n                _p23 = vmulq_f32(_p23, _scale);\n\n                float32x4x2_t _pp = vuzpq_f32(_p01, _p23);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#else  // __ARM_FEATURE_DOTPROD\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[1], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 2 + 1], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep + 1], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 3 + 1], _p, 7);\n                float32x4_t _p02 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p13 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p02 = vmulq_f32(_p02, _scale);\n                _p13 = vmulq_f32(_p13, _scale);\n\n                float32x4x2_t _pp = vzipq_f32(_p02, _p13);\n                int8x8_t _r01 = float2int8(_pp.val[0], _pp.val[1]);\n#endif // __ARM_FEATURE_DOTPROD\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 4;\n            }\n            for (; kk + 1 < max_kk; kk += 2)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[B_hstep + 0]) * scale);\n                pp[2] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp[3] = float2int8(float16_to_float32(p0[B_hstep + 1]) * scale);\n                pp += 4;\n                p0 += B_hstep * 2;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp += 2;\n                p0 += B_hstep;\n            }\n        }\n    }\n    for (; jj < max_jj; jj += 1)\n    {\n        const unsigned short* p0 = (const unsigned short*)B + k * B_hstep + (j + jj) * elempack;\n\n#if __ARM_NEON\n#if __aarch64__\n        if (elempack == 8)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                uint16x8_t _p23 = vld1q_u16(p0 + B_hstep * 8);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p23));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p23));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p01 = vld1q_u16(p0);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p01));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p01));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n        }\n#endif // __aarch64__\n        if (elempack == 4)\n        {\n            int kk = 0;\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 4));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 8));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 12));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vld1_u16(p0));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vld1_u16(p0 + B_hstep * 4));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n            for (; kk + 3 < max_kk; kk += 4)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp[1] = float2int8(float16_to_float32(p0[1]) * scale);\n                pp[2] = float2int8(float16_to_float32(p0[2]) * scale);\n                pp[3] = float2int8(float16_to_float32(p0[3]) * scale);\n                pp += 4;\n                p0 += B_hstep * 4;\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == 1)\n        {\n            int kk = 0;\n#if __ARM_NEON\n            for (; kk + 15 < max_kk; kk += 16)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                uint16x8_t _q = uint16x8_t();\n                _q = vsetq_lane_u16(p0[B_hstep * 8], _q, 0);\n                _q = vsetq_lane_u16(p0[B_hstep * 9], _q, 1);\n                _q = vsetq_lane_u16(p0[B_hstep * 10], _q, 2);\n                _q = vsetq_lane_u16(p0[B_hstep * 11], _q, 3);\n                _q = vsetq_lane_u16(p0[B_hstep * 12], _q, 4);\n                _q = vsetq_lane_u16(p0[B_hstep * 13], _q, 5);\n                _q = vsetq_lane_u16(p0[B_hstep * 14], _q, 6);\n                _q = vsetq_lane_u16(p0[B_hstep * 15], _q, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n                float32x4_t _p2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_q));\n                float32x4_t _p3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_q));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n                _p2 = vmulq_f32(_p2, _scale);\n                _p3 = vmulq_f32(_p3, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n                int8x8_t _r23 = float2int8(_p2, _p3);\n\n                vst1q_s8(pp, vcombine_s8(_r01, _r23));\n\n                pp += 16;\n                p0 += B_hstep * 16;\n            }\n            for (; kk + 7 < max_kk; kk += 8)\n            {\n                uint16x8_t _p = uint16x8_t();\n                _p = vsetq_lane_u16(p0[0], _p, 0);\n                _p = vsetq_lane_u16(p0[B_hstep], _p, 1);\n                _p = vsetq_lane_u16(p0[B_hstep * 2], _p, 2);\n                _p = vsetq_lane_u16(p0[B_hstep * 3], _p, 3);\n                _p = vsetq_lane_u16(p0[B_hstep * 4], _p, 4);\n                _p = vsetq_lane_u16(p0[B_hstep * 5], _p, 5);\n                _p = vsetq_lane_u16(p0[B_hstep * 6], _p, 6);\n                _p = vsetq_lane_u16(p0[B_hstep * 7], _p, 7);\n                float32x4_t _p0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_p));\n                float32x4_t _p1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_p));\n\n                _p0 = vmulq_f32(_p0, _scale);\n                _p1 = vmulq_f32(_p1, _scale);\n\n                int8x8_t _r01 = float2int8(_p0, _p1);\n\n                vst1_s8(pp, _r01);\n\n                pp += 8;\n                p0 += B_hstep * 8;\n            }\n#endif // __ARM_NEON\n            for (; kk < max_kk; kk++)\n            {\n                pp[0] = float2int8(float16_to_float32(p0[0]) * scale);\n                pp += 1;\n                p0 += B_hstep;\n            }\n        }\n    }\n}\n\nstatic void unpack_output_tile_int32_to_fp16(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        unpack_output_tile_int32_to_fp16_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const unsigned short* pC = C;\n\n    // NCNN_LOGE(\"unpack_output_tile_int32_to_fp16  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                uint16x8_t _c = vld1q_u16(pC);\n                _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c08 = vld1q_u16(pC);\n                        uint16x8_t _c19 = vld1q_u16(pC + 8);\n                        uint16x8_t _c2a = vld1q_u16(pC + 16);\n                        uint16x8_t _c3b = vld1q_u16(pC + 24);\n                        uint16x8_t _c4c = vld1q_u16(pC + 32);\n                        uint16x8_t _c5d = vld1q_u16(pC + 40);\n                        uint16x8_t _c6e = vld1q_u16(pC + 48);\n                        uint16x8_t _c7f = vld1q_u16(pC + 56);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c08));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c19));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c2a));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c3b));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c4c));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c5d));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c6e));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c7f));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c08));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c19));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c2a));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c3b));\n                        _c4 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c4c));\n                        _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c5d));\n                        _c6 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c6e));\n                        _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c7f));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 64;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        uint16x8_t _c45 = vld1q_u16(pC + 16);\n                        uint16x8_t _c67 = vld1q_u16(pC + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c45 = vld1q_u16(pC + c_hstep * 4 + 16);\n                        _c67 = vld1q_u16(pC + c_hstep * 4 + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                        uint16x8_t _c45 = vld1q_u16(pC + c_hstep * 2);\n                        uint16x8_t _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 5);\n                        _c45 = vld1q_u16(pC + c_hstep * 6);\n                        _c67 = vld1q_u16(pC + c_hstep * 7);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _cc = vld1q_u16(pC);\n                    float32x4_t _cc0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_cc));\n                    float32x4_t _cc1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_cc));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n            uint16x4_t _hf4 = (uint16x4_t)vcvt_f16_f32(_f4);\n            uint16x4_t _hf5 = (uint16x4_t)vcvt_f16_f32(_f5);\n            uint16x4_t _hf6 = (uint16x4_t)vcvt_f16_f32(_f6);\n            uint16x4_t _hf7 = (uint16x4_t)vcvt_f16_f32(_f7);\n            uint16x4_t _hf8 = (uint16x4_t)vcvt_f16_f32(_f8);\n            uint16x4_t _hf9 = (uint16x4_t)vcvt_f16_f32(_f9);\n            uint16x4_t _hfa = (uint16x4_t)vcvt_f16_f32(_fa);\n            uint16x4_t _hfb = (uint16x4_t)vcvt_f16_f32(_fb);\n            uint16x4_t _hfc = (uint16x4_t)vcvt_f16_f32(_fc);\n            uint16x4_t _hfd = (uint16x4_t)vcvt_f16_f32(_fd);\n            uint16x4_t _hfe = (uint16x4_t)vcvt_f16_f32(_fe);\n            uint16x4_t _hff = (uint16x4_t)vcvt_f16_f32(_ff);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf8));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf1, _hf9));\n                vst1q_u16(p0 + 16, vcombine_u16(_hf2, _hfa));\n                vst1q_u16(p0 + 24, vcombine_u16(_hf3, _hfb));\n                vst1q_u16(p0 + 32, vcombine_u16(_hf4, _hfc));\n                vst1q_u16(p0 + 40, vcombine_u16(_hf5, _hfd));\n                vst1q_u16(p0 + 48, vcombine_u16(_hf6, _hfe));\n                vst1q_u16(p0 + 56, vcombine_u16(_hf7, _hff));\n                p0 += 64;\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n                vst1q_u16(p0 + 16, vcombine_u16(_hf4, _hf5));\n                vst1q_u16(p0 + 24, vcombine_u16(_hf6, _hf7));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_hf8, _hf9));\n                vst1q_u16(p0 + out_hstep * 4 + 8, vcombine_u16(_hfa, _hfb));\n                vst1q_u16(p0 + out_hstep * 4 + 16, vcombine_u16(_hfc, _hfd));\n                vst1q_u16(p0 + out_hstep * 4 + 24, vcombine_u16(_hfe, _hff));\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_hf0, _hf1, _hf2, _hf3);\n                transpose4x4_u16(_hf4, _hf5, _hf6, _hf7);\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf4));\n                vst1q_u16(p0 + out_hstep, vcombine_u16(_hf1, _hf5));\n                vst1q_u16(p0 + out_hstep * 2, vcombine_u16(_hf2, _hf6));\n                vst1q_u16(p0 + out_hstep * 3, vcombine_u16(_hf3, _hf7));\n                transpose4x4_u16(_hf8, _hf9, _hfa, _hfb);\n                transpose4x4_u16(_hfc, _hfd, _hfe, _hff);\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_hf8, _hfc));\n                vst1q_u16(p0 + out_hstep * 5, vcombine_u16(_hf9, _hfd));\n                vst1q_u16(p0 + out_hstep * 6, vcombine_u16(_hfa, _hfe));\n                vst1q_u16(p0 + out_hstep * 7, vcombine_u16(_hfb, _hff));\n                p0 += 8;\n            }\n\n            pp += 64;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c04 = vld1q_u16(pC);\n                        uint16x8_t _c15 = vld1q_u16(pC + 8);\n                        uint16x8_t _c26 = vld1q_u16(pC + 16);\n                        uint16x8_t _c37 = vld1q_u16(pC + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c04));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c15));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c26));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c37));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c04));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c15));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c26));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c37));\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 32;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _cc0 = vld1_u16(pC + c_hstep * 4);\n                        _cc1 = vld1_u16(pC + c_hstep * 5);\n                        _cc2 = vld1_u16(pC + c_hstep * 6);\n                        _cc3 = vld1_u16(pC + c_hstep * 7);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 4;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n            uint16x4_t _hf4 = (uint16x4_t)vcvt_f16_f32(_f4);\n            uint16x4_t _hf5 = (uint16x4_t)vcvt_f16_f32(_f5);\n            uint16x4_t _hf6 = (uint16x4_t)vcvt_f16_f32(_f6);\n            uint16x4_t _hf7 = (uint16x4_t)vcvt_f16_f32(_f7);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf4));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf1, _hf5));\n                vst1q_u16(p0 + 16, vcombine_u16(_hf2, _hf6));\n                vst1q_u16(p0 + 24, vcombine_u16(_hf3, _hf7));\n                p0 += 32;\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_hf4, _hf5));\n                vst1q_u16(p0 + out_hstep * 4 + 8, vcombine_u16(_hf6, _hf7));\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_hf0, _hf1, _hf2, _hf3);\n                transpose4x4_u16(_hf4, _hf5, _hf6, _hf7);\n                vst1_u16(p0, _hf0);\n                vst1_u16(p0 + out_hstep, _hf1);\n                vst1_u16(p0 + out_hstep * 2, _hf2);\n                vst1_u16(p0 + out_hstep * 3, _hf3);\n                vst1_u16(p0 + out_hstep * 4, _hf4);\n                vst1_u16(p0 + out_hstep * 5, _hf5);\n                vst1_u16(p0 + out_hstep * 6, _hf6);\n                vst1_u16(p0 + out_hstep * 7, _hf7);\n                p0 += 4;\n            }\n\n            pp += 32;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _cc0 = vld1q_u16(pC);\n                        uint16x8_t _cc1 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_cc0));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_cc1));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_high_u16(_cc0));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_cc1));\n                        pC += 16;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep * 4);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[1], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep + 1], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c01, 7);\n                        uint16x8_t _c23 = uint16x8_t();\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4], _c23, 0);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5], _c23, 1);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6], _c23, 2);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7], _c23, 3);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4 + 1], _c23, 4);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5 + 1], _c23, 5);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6 + 1], _c23, 6);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7 + 1], _c23, 7);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 2;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _c1 = vdupq_n_f32(float16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf2));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf1, _hf3));\n                p0 += 16;\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_hf2, _hf3));\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[1] = vget_lane_u16(_hf1, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_hf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_hf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_hf1, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_hf2, 0);\n                p0[out_hstep * 4 + 1] = vget_lane_u16(_hf3, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_hf2, 1);\n                p0[out_hstep * 5 + 1] = vget_lane_u16(_hf3, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_hf2, 2);\n                p0[out_hstep * 6 + 1] = vget_lane_u16(_hf3, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_hf2, 3);\n                p0[out_hstep * 7 + 1] = vget_lane_u16(_hf3, 3);\n                p0 += 2;\n            }\n\n            pp += 16;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c = vld1q_u16(pC);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                        pC += 8;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                        _c1 = vcvt_f32_f16((float16x4_t)vld1_u16(pC + c_hstep * 4));\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        pC += 1;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                p0 += 8;\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                vst1_u16(p0, _hf0);\n                vst1_u16(p0 + out_hstep * 4, _hf1);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_hf1, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_hf1, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_hf1, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_hf1, 3);\n                p0++;\n            }\n\n            pp += 8;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c01;\n                    uint16x8_t _c23;\n                    uint16x8_t _c45;\n                    uint16x8_t _c67;\n                    if (c_elempack == 4)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + 8);\n                        _c45 = vld1q_u16(pC + 16);\n                        _c67 = vld1q_u16(pC + 24);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + c_hstep);\n                        _c45 = vld1q_u16(pC + c_hstep * 2);\n                        _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        pC += 8;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                    float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                    float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                    float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _cc1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n            uint16x4_t _hf4 = (uint16x4_t)vcvt_f16_f32(_f4);\n            uint16x4_t _hf5 = (uint16x4_t)vcvt_f16_f32(_f5);\n            uint16x4_t _hf6 = (uint16x4_t)vcvt_f16_f32(_f6);\n            uint16x4_t _hf7 = (uint16x4_t)vcvt_f16_f32(_f7);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n                vst1q_u16(p0 + 16, vcombine_u16(_hf4, _hf5));\n                vst1q_u16(p0 + 24, vcombine_u16(_hf6, _hf7));\n                p0 += 32;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_hf0, _hf1, _hf2, _hf3);\n                transpose4x4_u16(_hf4, _hf5, _hf6, _hf7);\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf4));\n                vst1q_u16(p0 + out_hstep, vcombine_u16(_hf1, _hf5));\n                vst1q_u16(p0 + out_hstep * 2, vcombine_u16(_hf2, _hf6));\n                vst1q_u16(p0 + out_hstep * 3, vcombine_u16(_hf3, _hf7));\n                p0 += 8;\n            }\n\n            pp += 32;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep * 1);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n                p0 += 16;\n            }\n            if (out_elempack == 1)\n            {\n                transpose4x4_u16(_hf0, _hf1, _hf2, _hf3);\n                vst1_u16(p0, _hf0);\n                vst1_u16(p0 + out_hstep, _hf1);\n                vst1_u16(p0 + out_hstep * 2, _hf2);\n                vst1_u16(p0 + out_hstep * 3, _hf3);\n                p0 += 4;\n            }\n\n            pp += 16;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1q_u16(pC);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x8_t();\n                        _c = vsetq_lane_u16(pC[0], _c, 0);\n                        _c = vsetq_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3], _c, 3);\n                        _c = vsetq_lane_u16(pC[1], _c, 4);\n                        _c = vsetq_lane_u16(pC[c_hstep + 1], _c, 5);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c, 6);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c, 7);\n                        pC += 2;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    float32x4_t _c1 = vdupq_n_f32(float16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                p0 += 8;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[1] = vget_lane_u16(_hf1, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_hf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_hf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_hf1, 3);\n                p0 += 2;\n            }\n\n            pp += 8;\n        }\n        for (; jj < max_jj; jj++)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x4_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1_u16(pC);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x4_t();\n                        _c = vset_lane_u16(pC[0], _c, 0);\n                        _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vset_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vset_lane_u16(pC[c_hstep * 3], _c, 3);\n                        pC += 1;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n\n            if (out_elempack == 4)\n            {\n                vst1_u16(p0, _hf0);\n                p0 += 4;\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0++;\n            }\n\n            pp += 4;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        // out_elempack == 1\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = float16_to_float32(pC[0]) * beta;\n                c1 = float16_to_float32(pC[1]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f1)));\n            vst1q_u16(p0 + out_hstep, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f2), (uint16x4_t)vcvt_f16_f32(_f3)));\n\n            pp += 16;\n            p0 += 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vld1_u16(pC + c_hstep));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1_u16(p0, (uint16x4_t)vcvt_f16_f32(_f0));\n            vst1_u16(p0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_f1));\n\n            pp += 8;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n\n            float32x2x2_t _descale01 = vzip_f32(_descale, _descale);\n            float32x4_t _descale0011 = vcombine_f32(_descale01.val[0], _descale01.val[1]);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0011);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _c0011 = vcombine_f32(vget_low_f32(_c0), vget_high_f32(_c1));\n                    _f0 = vaddq_f32(_f0, _c0011);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[1], _c, 1);\n                    _c = vset_lane_u16(pC[c_hstep], _c, 2);\n                    _c = vset_lane_u16(pC[c_hstep + 1], _c, 3);\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[1], _c, 1);\n                    _c = vset_lane_u16(pC[0], _c, 2);\n                    _c = vset_lane_u16(pC[1], _c, 3);\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n\n            p0[0] = vget_lane_u16(_hf0, 0);\n            p0[1] = vget_lane_u16(_hf0, 1);\n            p0[out_hstep] = vget_lane_u16(_hf0, 2);\n            p0[out_hstep + 1] = vget_lane_u16(_hf0, 3);\n\n            pp += 4;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += float16_to_float32(pC[0]) * beta;\n                    f1 += float16_to_float32(pC[c_hstep]) * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    f0 += float16_to_float32(pC[0]) * beta;\n                    f1 += float16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                f0 *= alpha;\n                f1 *= alpha;\n            }\n\n            p0[0] = float32_to_float16(f0);\n            p0[out_hstep] = float32_to_float16(f1);\n\n            pp += 2;\n            p0++;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        // out_elempack == 1\n        unsigned short* p0 = (unsigned short*)top_blob + (i + ii) * out_hstep + j;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + 8);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f1)));\n            vst1q_u16(p0 + 8, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f2), (uint16x4_t)vcvt_f16_f32(_f3)));\n\n            pp += 16;\n            p0 += 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f1)));\n\n            pp += 8;\n            p0 += 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_u16(p0, (uint16x4_t)vcvt_f16_f32(_f0));\n\n            pp += 4;\n            p0 += 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    float32x2_t _cc = float32x2_t();\n                    _cc = vset_lane_f32(float16_to_float32(pC[0]), _cc, 0);\n                    _cc = vset_lane_f32(float16_to_float32(pC[1]), _cc, 1);\n                    _f0 = vmla_n_f32(_f0, _cc, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            p0[0] = float32_to_float16(vget_lane_f32(_f0, 0));\n            p0[1] = float32_to_float16(vget_lane_f32(_f0, 1));\n\n            pp += 2;\n            p0 += 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj++)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    f0 += float16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = float32_to_float16(f0);\n\n            pp += 1;\n            p0++;\n        }\n    }\n}\n\nstatic void transpose_unpack_output_tile_int32_to_fp16(const Mat& topT, const Mat& C, Mat& top_blob, int broadcast_type_C, int i, int max_ii, int j, int max_jj, const Mat& descales, float alpha, float beta)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD && !__ARM_FEATURE_MATMUL_INT8\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        transpose_unpack_output_tile_int32_to_fp16_asimddp(topT, C, top_blob, broadcast_type_C, i, max_ii, j, max_jj, descales, alpha, beta);\n        return;\n    }\n#endif\n\n    const int out_elempack = top_blob.elempack;\n    const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n    const size_t c_hstep = C.dims == 3 ? C.cstep : (size_t)C.w;\n    const int c_elempack = C.elempack;\n    const unsigned short* pC = C;\n\n    // NCNN_LOGE(\"transpose_unpack_output_tile_int32_to_fp16  %d %d %d %d  %d  %d  %d\", i, max_ii, j, max_jj, out_elempack, broadcast_type_C, c_elempack);\n\n    const int* pp = topT;\n\n    int ii = 0;\n#if __ARM_NEON\n    for (; ii + 7 < max_ii; ii += 8)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale0 = vld1q_f32((const float*)descales + i + ii);\n        float32x4_t _descale1 = vld1q_f32((const float*)descales + i + ii + 4);\n\n        float32x4_t _c0;\n        float32x4_t _c1;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                uint16x8_t _c = vld1q_u16(pC);\n                _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                _c0 = vmulq_n_f32(_c0, beta);\n                _c1 = vmulq_n_f32(_c1, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n            int32x4_t _sum8 = vld1q_s32(pp + 32);\n            int32x4_t _sum9 = vld1q_s32(pp + 36);\n            int32x4_t _suma = vld1q_s32(pp + 40);\n            int32x4_t _sumb = vld1q_s32(pp + 44);\n            int32x4_t _sumc = vld1q_s32(pp + 48);\n            int32x4_t _sumd = vld1q_s32(pp + 52);\n            int32x4_t _sume = vld1q_s32(pp + 56);\n            int32x4_t _sumf = vld1q_s32(pp + 60);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e4 f5 g6 h7\n            //      e0 f1 g2 h3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      g4 h5 e6 f7\n            //      g0 h1 e2 f3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      e7 f6 g5 h4\n            //      e3 f2 g1 h0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      g7 h6 e5 f4\n            //      g3 h2 e1 f0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            //      e4 f4 g4 h4\n            //      e5 f5 g5 h5\n            //      e6 f6 g6 h6\n            //      e7 f7 g7 h7\n            {\n                _sum8 = vrev64q_s32(_sum8);\n                _sum9 = vrev64q_s32(_sum9);\n                _suma = vrev64q_s32(_suma);\n                _sumb = vrev64q_s32(_sumb);\n                _sumc = vrev64q_s32(_sumc);\n                _sumd = vrev64q_s32(_sumd);\n                _sume = vrev64q_s32(_sume);\n                _sumf = vrev64q_s32(_sumf);\n                _sum8 = vextq_s32(_sum8, _sum8, 2);\n                _sum9 = vextq_s32(_sum9, _sum9, 2);\n                _suma = vextq_s32(_suma, _suma, 2);\n                _sumb = vextq_s32(_sumb, _sumb, 2);\n                _sumc = vextq_s32(_sumc, _sumc, 2);\n                _sumd = vextq_s32(_sumd, _sumd, 2);\n                _sume = vextq_s32(_sume, _sume, 2);\n                _sumf = vextq_s32(_sumf, _sumf, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sumc);\n                int32x4x2_t _t1 = vzipq_s32(_sum4, _sum8);\n                int32x4x2_t _t2 = vzipq_s32(_sum2, _sume);\n                int32x4x2_t _t3 = vzipq_s32(_sum6, _suma);\n                int32x4x2_t _t4 = vzipq_s32(_sum3, _sumf);\n                int32x4x2_t _t5 = vzipq_s32(_sum7, _sumb);\n                int32x4x2_t _t6 = vzipq_s32(_sum1, _sumd);\n                int32x4x2_t _t7 = vzipq_s32(_sum5, _sum9);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum8 = vcombine_s32(vget_low_s32(_t4.val[0]), vget_low_s32(_t5.val[0]));\n                _sum9 = vcombine_s32(vget_high_s32(_t4.val[0]), vget_high_s32(_t5.val[0]));\n                _suma = vcombine_s32(vget_low_s32(_t5.val[1]), vget_low_s32(_t4.val[1]));\n                _sumb = vcombine_s32(vget_high_s32(_t5.val[1]), vget_high_s32(_t4.val[1]));\n                _sumc = vcombine_s32(vget_low_s32(_t6.val[0]), vget_low_s32(_t7.val[0]));\n                _sumd = vcombine_s32(vget_high_s32(_t6.val[0]), vget_high_s32(_t7.val[0]));\n                _sume = vcombine_s32(vget_low_s32(_t7.val[1]), vget_low_s32(_t6.val[1]));\n                _sumf = vcombine_s32(vget_high_s32(_t7.val[1]), vget_high_s32(_t6.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum9 = vrev64q_s32(_sum9);\n                _sumb = vrev64q_s32(_sumb);\n                _sumd = vrev64q_s32(_sumd);\n                _sumf = vrev64q_s32(_sumf);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum8), _descale0);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum9), _descale0);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_suma), _descale0);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sumb), _descale0);\n            float32x4_t _f8 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f9 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _fa = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _fb = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n            float32x4_t _fc = vmulq_f32(vcvtq_f32_s32(_sumc), _descale1);\n            float32x4_t _fd = vmulq_f32(vcvtq_f32_s32(_sumd), _descale1);\n            float32x4_t _fe = vmulq_f32(vcvtq_f32_s32(_sume), _descale1);\n            float32x4_t _ff = vmulq_f32(vcvtq_f32_s32(_sumf), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c0);\n                    _fa = vaddq_f32(_fa, _c0);\n                    _fb = vaddq_f32(_fb, _c0);\n                    _fc = vaddq_f32(_fc, _c0);\n                    _fd = vaddq_f32(_fd, _c0);\n                    _fe = vaddq_f32(_fe, _c0);\n                    _ff = vaddq_f32(_ff, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                    _f8 = vaddq_f32(_f8, _c1);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c1);\n                    _fb = vaddq_f32(_fb, _c1);\n                    _fc = vaddq_f32(_fc, _c1);\n                    _fd = vaddq_f32(_fd, _c1);\n                    _fe = vaddq_f32(_fe, _c1);\n                    _ff = vaddq_f32(_ff, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c08 = vld1q_u16(pC);\n                        uint16x8_t _c19 = vld1q_u16(pC + 8);\n                        uint16x8_t _c2a = vld1q_u16(pC + 16);\n                        uint16x8_t _c3b = vld1q_u16(pC + 24);\n                        uint16x8_t _c4c = vld1q_u16(pC + 32);\n                        uint16x8_t _c5d = vld1q_u16(pC + 40);\n                        uint16x8_t _c6e = vld1q_u16(pC + 48);\n                        uint16x8_t _c7f = vld1q_u16(pC + 56);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c08));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c19));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c2a));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c3b));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c4c));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c5d));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c6e));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c7f));\n                        float32x4_t _c8 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c08));\n                        float32x4_t _c9 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c19));\n                        float32x4_t _ca = vcvt_f32_f16((float16x4_t)vget_high_u16(_c2a));\n                        float32x4_t _cb = vcvt_f32_f16((float16x4_t)vget_high_u16(_c3b));\n                        float32x4_t _cc = vcvt_f32_f16((float16x4_t)vget_high_u16(_c4c));\n                        float32x4_t _cd = vcvt_f32_f16((float16x4_t)vget_high_u16(_c5d));\n                        float32x4_t _ce = vcvt_f32_f16((float16x4_t)vget_high_u16(_c6e));\n                        float32x4_t _cf = vcvt_f32_f16((float16x4_t)vget_high_u16(_c7f));\n\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                            _f8 = vaddq_f32(_f8, _c8);\n                            _f9 = vaddq_f32(_f9, _c9);\n                            _fa = vaddq_f32(_fa, _ca);\n                            _fb = vaddq_f32(_fb, _cb);\n                            _fc = vaddq_f32(_fc, _cc);\n                            _fd = vaddq_f32(_fd, _cd);\n                            _fe = vaddq_f32(_fe, _ce);\n                            _ff = vaddq_f32(_ff, _cf);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                            _f8 = vmlaq_f32(_f8, _c8, _beta);\n                            _f9 = vmlaq_f32(_f9, _c9, _beta);\n                            _fa = vmlaq_f32(_fa, _ca, _beta);\n                            _fb = vmlaq_f32(_fb, _cb, _beta);\n                            _fc = vmlaq_f32(_fc, _cc, _beta);\n                            _fd = vmlaq_f32(_fd, _cd, _beta);\n                            _fe = vmlaq_f32(_fe, _ce, _beta);\n                            _ff = vmlaq_f32(_ff, _cf, _beta);\n                        }\n                        pC += 64;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        uint16x8_t _c45 = vld1q_u16(pC + 16);\n                        uint16x8_t _c67 = vld1q_u16(pC + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c45 = vld1q_u16(pC + c_hstep * 4 + 16);\n                        _c67 = vld1q_u16(pC + c_hstep * 4 + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                        uint16x8_t _c45 = vld1q_u16(pC + c_hstep * 2);\n                        uint16x8_t _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 5);\n                        _c45 = vld1q_u16(pC + c_hstep * 6);\n                        _c67 = vld1q_u16(pC + c_hstep * 7);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                        _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                        _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                        _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                        if (beta == 1.f)\n                        {\n                            _f8 = vaddq_f32(_f8, _c0);\n                            _f9 = vaddq_f32(_f9, _c1);\n                            _fa = vaddq_f32(_fa, _c2);\n                            _fb = vaddq_f32(_fb, _c3);\n                            _fc = vaddq_f32(_fc, _c4);\n                            _fd = vaddq_f32(_fd, _c5);\n                            _fe = vaddq_f32(_fe, _c6);\n                            _ff = vaddq_f32(_ff, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f8 = vmlaq_f32(_f8, _c0, _beta);\n                            _f9 = vmlaq_f32(_f9, _c1, _beta);\n                            _fa = vmlaq_f32(_fa, _c2, _beta);\n                            _fb = vmlaq_f32(_fb, _c3, _beta);\n                            _fc = vmlaq_f32(_fc, _c4, _beta);\n                            _fd = vmlaq_f32(_fd, _c5, _beta);\n                            _fe = vmlaq_f32(_fe, _c6, _beta);\n                            _ff = vmlaq_f32(_ff, _c7, _beta);\n                        }\n                        pC += 8;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _cc1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    _f8 = vaddq_f32(_f8, _c0);\n                    _f9 = vaddq_f32(_f9, _c1);\n                    _fa = vaddq_f32(_fa, _c2);\n                    _fb = vaddq_f32(_fb, _c3);\n                    _fc = vaddq_f32(_fc, _c4);\n                    _fd = vaddq_f32(_fd, _c5);\n                    _fe = vaddq_f32(_fe, _c6);\n                    _ff = vaddq_f32(_ff, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n                _f8 = vmulq_f32(_f8, _alpha);\n                _f9 = vmulq_f32(_f9, _alpha);\n                _fa = vmulq_f32(_fa, _alpha);\n                _fb = vmulq_f32(_fb, _alpha);\n                _fc = vmulq_f32(_fc, _alpha);\n                _fd = vmulq_f32(_fd, _alpha);\n                _fe = vmulq_f32(_fe, _alpha);\n                _ff = vmulq_f32(_ff, _alpha);\n            }\n\n            uint16x8_t _hf0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f8));\n            uint16x8_t _hf1 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f1), (uint16x4_t)vcvt_f16_f32(_f9));\n            uint16x8_t _hf2 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f2), (uint16x4_t)vcvt_f16_f32(_fa));\n            uint16x8_t _hf3 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f3), (uint16x4_t)vcvt_f16_f32(_fb));\n            uint16x8_t _hf4 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f4), (uint16x4_t)vcvt_f16_f32(_fc));\n            uint16x8_t _hf5 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f5), (uint16x4_t)vcvt_f16_f32(_fd));\n            uint16x8_t _hf6 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f6), (uint16x4_t)vcvt_f16_f32(_fe));\n            uint16x8_t _hf7 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f7), (uint16x4_t)vcvt_f16_f32(_ff));\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                transpose8x8_u16(_hf0, _hf1, _hf2, _hf3, _hf4, _hf5, _hf6, _hf7);\n                vst1q_u16(p0, _hf0);\n                vst1q_u16(p0 + 8, _hf1);\n                vst1q_u16(p0 + 16, _hf2);\n                vst1q_u16(p0 + 24, _hf3);\n                vst1q_u16(p0 + 32, _hf4);\n                vst1q_u16(p0 + 40, _hf5);\n                vst1q_u16(p0 + 48, _hf6);\n                vst1q_u16(p0 + 56, _hf7);\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                uint16x8x4_t _hfa;\n                uint16x8x4_t _hfb;\n                _hfa.val[0] = _hf0;\n                _hfa.val[1] = _hf1;\n                _hfa.val[2] = _hf2;\n                _hfa.val[3] = _hf3;\n                _hfb.val[0] = _hf4;\n                _hfb.val[1] = _hf5;\n                _hfb.val[2] = _hf6;\n                _hfb.val[3] = _hf7;\n                vst4q_u16(p0, _hfa);\n                vst4q_u16(p0 + out_hstep * 4, _hfb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_u16(p0, _hf0);\n                vst1q_u16(p0 + out_hstep, _hf1);\n                vst1q_u16(p0 + out_hstep * 2, _hf2);\n                vst1q_u16(p0 + out_hstep * 3, _hf3);\n                vst1q_u16(p0 + out_hstep * 4, _hf4);\n                vst1q_u16(p0 + out_hstep * 5, _hf5);\n                vst1q_u16(p0 + out_hstep * 6, _hf6);\n                vst1q_u16(p0 + out_hstep * 7, _hf7);\n            }\n\n            pp += 64;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      e0 f1 g2 h3\n            //      c0 d1 a2 b3\n            //      g0 h1 e2 f3\n            //      a3 b2 c1 d0\n            //      e3 f2 g1 h0\n            //      c3 d2 a1 b0\n            //      g3 h2 e1 f0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            //      e2 f2 g2 h2\n            //      e3 f3 g3 h3\n\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale0);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale0);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale1);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale1);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale1);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c1);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c1);\n                    _f7 = vaddq_f32(_f7, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c04 = vld1q_u16(pC);\n                        uint16x8_t _c15 = vld1q_u16(pC + 8);\n                        uint16x8_t _c26 = vld1q_u16(pC + 16);\n                        uint16x8_t _c37 = vld1q_u16(pC + 24);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c04));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c15));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c26));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c37));\n                        float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c04));\n                        float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c15));\n                        float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c26));\n                        float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c37));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                            _f4 = vaddq_f32(_f4, _c4);\n                            _f5 = vaddq_f32(_f5, _c5);\n                            _f6 = vaddq_f32(_f6, _c6);\n                            _f7 = vaddq_f32(_f7, _c7);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                            _f4 = vmlaq_f32(_f4, _c4, _beta);\n                            _f5 = vmlaq_f32(_f5, _c5, _beta);\n                            _f6 = vmlaq_f32(_f6, _c6, _beta);\n                            _f7 = vmlaq_f32(_f7, _c7, _beta);\n                        }\n                        pC += 32;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _c01 = vld1q_u16(pC + c_hstep * 4);\n                        _c23 = vld1q_u16(pC + c_hstep * 4 + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        float32x4_t _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        float32x4_t _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f0 = vaddq_f32(_f0, _c0);\n                            _f1 = vaddq_f32(_f1, _c1);\n                            _f2 = vaddq_f32(_f2, _c2);\n                            _f3 = vaddq_f32(_f3, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f0 = vmlaq_f32(_f0, _c0, _beta);\n                            _f1 = vmlaq_f32(_f1, _c1, _beta);\n                            _f2 = vmlaq_f32(_f2, _c2, _beta);\n                            _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        }\n                        _cc0 = vld1_u16(pC + c_hstep * 4);\n                        _cc1 = vld1_u16(pC + c_hstep * 5);\n                        _cc2 = vld1_u16(pC + c_hstep * 6);\n                        _cc3 = vld1_u16(pC + c_hstep * 7);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        if (beta == 1.f)\n                        {\n                            _f4 = vaddq_f32(_f4, _c0);\n                            _f5 = vaddq_f32(_f5, _c1);\n                            _f6 = vaddq_f32(_f6, _c2);\n                            _f7 = vaddq_f32(_f7, _c3);\n                        }\n                        else\n                        {\n                            float32x4_t _beta = vdupq_n_f32(beta);\n                            _f4 = vmlaq_f32(_f4, _c0, _beta);\n                            _f5 = vmlaq_f32(_f5, _c1, _beta);\n                            _f6 = vmlaq_f32(_f6, _c2, _beta);\n                            _f7 = vmlaq_f32(_f7, _c3, _beta);\n                        }\n                        pC += 4;\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c1);\n                    _f6 = vaddq_f32(_f6, _c2);\n                    _f7 = vaddq_f32(_f7, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x8_t _hf0 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f4));\n            uint16x8_t _hf1 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f1), (uint16x4_t)vcvt_f16_f32(_f5));\n            uint16x8_t _hf2 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f2), (uint16x4_t)vcvt_f16_f32(_f6));\n            uint16x8_t _hf3 = vcombine_u16((uint16x4_t)vcvt_f16_f32(_f3), (uint16x4_t)vcvt_f16_f32(_f7));\n\n            if (out_elempack == 4)\n            {\n                uint16x8x4_t _hf;\n                _hf.val[0] = _hf0;\n                _hf.val[1] = _hf1;\n                _hf.val[2] = _hf2;\n                _hf.val[3] = _hf3;\n                vst4q_u16(p0, _hf);\n            }\n            if (out_elempack == 1)\n            {\n                vst1q_u16(p0, _hf0);\n                vst1q_u16(p0 + out_hstep, _hf1);\n                vst1q_u16(p0 + out_hstep * 2, _hf2);\n                vst1q_u16(p0 + out_hstep * 3, _hf3);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      e0 f1 g0 h1\n            //      a1 b0 c1 d0\n            //      e1 f0 g1 h0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      e0 f0 g0 h0\n            //      e1 f1 g1 h1\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum2);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[0]), vget_low_s32(_t1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[0]), vget_high_s32(_t1.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale0);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale1);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c02 = vld1q_u16(pC);\n                        uint16x8_t _c13 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c02));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c13));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c02));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c13));\n                        pC += 16;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + c_hstep * 4);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n\n                        uint16x8_t _c23 = uint16x8_t();\n                        _c23 = vsetq_lane_u16(pC[1], _c23, 0);\n                        _c23 = vsetq_lane_u16(pC[c_hstep + 1], _c23, 1);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c23, 2);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c23, 3);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 4 + 1], _c23, 4);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 5 + 1], _c23, 5);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 6 + 1], _c23, 6);\n                        _c23 = vsetq_lane_u16(pC[c_hstep * 7 + 1], _c23, 7);\n\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 2;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _c1 = vdupq_n_f32(float16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f2)));\n            vst1q_u16(p0 + out_hstep, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f1), (uint16x4_t)vcvt_f16_f32(_f3)));\n\n            pp += 16;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale0);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp + 4)), _descale1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n#if __aarch64__\n                    if (c_elempack == 8)\n                    {\n                        uint16x8_t _c = vld1q_u16(pC);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                        pC += 8;\n                    }\n#endif // __aarch64__\n                    if (c_elempack == 4)\n                    {\n                        _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                        _c1 = vcvt_f32_f16((float16x4_t)vld1_u16(pC + c_hstep * 4));\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x8_t _c01 = uint16x8_t();\n                        _c01 = vsetq_lane_u16(pC[0], _c01, 0);\n                        _c01 = vsetq_lane_u16(pC[c_hstep], _c01, 1);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 2], _c01, 2);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 3], _c01, 3);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 4], _c01, 4);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 5], _c01, 5);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 6], _c01, 6);\n                        _c01 = vsetq_lane_u16(pC[c_hstep * 7], _c01, 7);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        pC += 1;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1q_u16(p0, vcombine_u16((uint16x4_t)vcvt_f16_f32(_f0), (uint16x4_t)vcvt_f16_f32(_f1)));\n            pp += 8;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii + 3 < max_ii; ii += 4)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        float32x4_t _descale = vld1q_f32((const float*)descales + i + ii);\n\n        float32x4_t _c0;\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                _c0 = vmulq_n_f32(_c0, beta);\n            }\n            if (broadcast_type_C == 3)\n            {\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j * c_elempack;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n            int32x4_t _sum4 = vld1q_s32(pp + 16);\n            int32x4_t _sum5 = vld1q_s32(pp + 20);\n            int32x4_t _sum6 = vld1q_s32(pp + 24);\n            int32x4_t _sum7 = vld1q_s32(pp + 28);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      a4 b5 c6 d7\n            //      c0 d1 a2 b3\n            //      c4 d5 a6 b7\n            //      a3 b2 c1 d0\n            //      a7 b6 c5 d4\n            //      c3 d2 a1 b0\n            //      c7 d6 a5 b4\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            //      a4 b4 c4 d4\n            //      a5 b5 c5 d5\n            //      a6 b6 c6 d6\n            //      a7 b7 c7 d7\n            {\n                _sum4 = vrev64q_s32(_sum4);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum6 = vrev64q_s32(_sum6);\n                _sum7 = vrev64q_s32(_sum7);\n                _sum4 = vextq_s32(_sum4, _sum4, 2);\n                _sum5 = vextq_s32(_sum5, _sum5, 2);\n                _sum6 = vextq_s32(_sum6, _sum6, 2);\n                _sum7 = vextq_s32(_sum7, _sum7, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum6);\n                int32x4x2_t _t1 = vzipq_s32(_sum2, _sum4);\n                int32x4x2_t _t2 = vzipq_s32(_sum1, _sum7);\n                int32x4x2_t _t3 = vzipq_s32(_sum3, _sum5);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum4 = vcombine_s32(vget_low_s32(_t2.val[0]), vget_low_s32(_t3.val[0]));\n                _sum5 = vcombine_s32(vget_high_s32(_t2.val[0]), vget_high_s32(_t3.val[0]));\n                _sum6 = vcombine_s32(vget_low_s32(_t3.val[1]), vget_low_s32(_t2.val[1]));\n                _sum7 = vcombine_s32(vget_high_s32(_t3.val[1]), vget_high_s32(_t2.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum5 = vrev64q_s32(_sum5);\n                _sum7 = vrev64q_s32(_sum7);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n            float32x4_t _f4 = vmulq_f32(vcvtq_f32_s32(_sum4), _descale);\n            float32x4_t _f5 = vmulq_f32(vcvtq_f32_s32(_sum5), _descale);\n            float32x4_t _f6 = vmulq_f32(vcvtq_f32_s32(_sum6), _descale);\n            float32x4_t _f7 = vmulq_f32(vcvtq_f32_s32(_sum7), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                    _f4 = vaddq_f32(_f4, _c0);\n                    _f5 = vaddq_f32(_f5, _c0);\n                    _f6 = vaddq_f32(_f6, _c0);\n                    _f7 = vaddq_f32(_f7, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c01;\n                    uint16x8_t _c23;\n                    uint16x8_t _c45;\n                    uint16x8_t _c67;\n                    if (c_elempack == 4)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + 8);\n                        _c45 = vld1q_u16(pC + 16);\n                        _c67 = vld1q_u16(pC + 24);\n                        pC += 32;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c01 = vld1q_u16(pC);\n                        _c23 = vld1q_u16(pC + c_hstep);\n                        _c45 = vld1q_u16(pC + c_hstep * 2);\n                        _c67 = vld1q_u16(pC + c_hstep * 3);\n                        transpose8x4_u16(_c01, _c23, _c45, _c67);\n                        pC += 8;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    float32x4_t _c4 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c45));\n                    float32x4_t _c5 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c45));\n                    float32x4_t _c6 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c67));\n                    float32x4_t _c7 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c67));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                        _f4 = vaddq_f32(_f4, _c4);\n                        _f5 = vaddq_f32(_f5, _c5);\n                        _f6 = vaddq_f32(_f6, _c6);\n                        _f7 = vaddq_f32(_f7, _c7);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                        _f4 = vmlaq_f32(_f4, _c4, _beta);\n                        _f5 = vmlaq_f32(_f5, _c5, _beta);\n                        _f6 = vmlaq_f32(_f6, _c6, _beta);\n                        _f7 = vmlaq_f32(_f7, _c7, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    float32x4_t _cc0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _cc1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _cc0 = vmulq_f32(_cc0, _beta);\n                        _cc1 = vmulq_f32(_cc1, _beta);\n                    }\n                    _c0 = vdupq_laneq_f32(_cc0, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_cc0, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_cc0, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_cc0, 3);\n                    float32x4_t _c4 = vdupq_laneq_f32(_cc1, 0);\n                    float32x4_t _c5 = vdupq_laneq_f32(_cc1, 1);\n                    float32x4_t _c6 = vdupq_laneq_f32(_cc1, 2);\n                    float32x4_t _c7 = vdupq_laneq_f32(_cc1, 3);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    _f4 = vaddq_f32(_f4, _c4);\n                    _f5 = vaddq_f32(_f5, _c5);\n                    _f6 = vaddq_f32(_f6, _c6);\n                    _f7 = vaddq_f32(_f7, _c7);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n                _f4 = vmulq_f32(_f4, _alpha);\n                _f5 = vmulq_f32(_f5, _alpha);\n                _f6 = vmulq_f32(_f6, _alpha);\n                _f7 = vmulq_f32(_f7, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n            uint16x4_t _hf4 = (uint16x4_t)vcvt_f16_f32(_f4);\n            uint16x4_t _hf5 = (uint16x4_t)vcvt_f16_f32(_f5);\n            uint16x4_t _hf6 = (uint16x4_t)vcvt_f16_f32(_f6);\n            uint16x4_t _hf7 = (uint16x4_t)vcvt_f16_f32(_f7);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                transpose4x4_u16(_hf0, _hf1, _hf2, _hf3);\n                transpose4x4_u16(_hf4, _hf5, _hf6, _hf7);\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf4));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf1, _hf5));\n                vst1q_u16(p0 + 16, vcombine_u16(_hf2, _hf6));\n                vst1q_u16(p0 + 24, vcombine_u16(_hf3, _hf7));\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                uint16x4x4_t _hfa;\n                uint16x4x4_t _hfb;\n                _hfa.val[0] = _hf0;\n                _hfa.val[1] = _hf1;\n                _hfa.val[2] = _hf2;\n                _hfa.val[3] = _hf3;\n                _hfb.val[0] = _hf4;\n                _hfb.val[1] = _hf5;\n                _hfb.val[2] = _hf6;\n                _hfb.val[3] = _hf7;\n                vst4_u16(p0, _hfa);\n                vst4_u16(p0 + out_hstep * 4, _hfb);\n            }\n            if (out_elempack == 1)\n            {\n                vst1_u16(p0, _hf0);\n                vst1_u16(p0 + out_hstep, _hf1);\n                vst1_u16(p0 + out_hstep * 2, _hf2);\n                vst1_u16(p0 + out_hstep * 3, _hf3);\n                vst1_u16(p0 + out_hstep * 4, _hf4);\n                vst1_u16(p0 + out_hstep * 5, _hf5);\n                vst1_u16(p0 + out_hstep * 6, _hf6);\n                vst1_u16(p0 + out_hstep * 7, _hf7);\n            }\n\n            pp += 32;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n#else\n            // from\n            //      a0 b1 c2 d3\n            //      c0 d1 a2 b3\n            //      a3 b2 c1 d0\n            //      c3 d2 a1 b0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            //      a2 b2 c2 d2\n            //      a3 b3 c3 d3\n            {\n                _sum2 = vrev64q_s32(_sum2);\n                _sum3 = vrev64q_s32(_sum3);\n                _sum2 = vextq_s32(_sum2, _sum2, 2);\n                _sum3 = vextq_s32(_sum3, _sum3, 2);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum3);\n                int32x4x2_t _t1 = vzipq_s32(_sum1, _sum2);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_t1.val[1]), vget_low_s32(_t0.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_t1.val[1]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n                _sum3 = vrev64q_s32(_sum3);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    float32x4_t _c1;\n                    float32x4_t _c2;\n                    float32x4_t _c3;\n                    if (c_elempack == 4)\n                    {\n                        uint16x8_t _c01 = vld1q_u16(pC);\n                        uint16x8_t _c23 = vld1q_u16(pC + 8);\n                        _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                        _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                        _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                        _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                        pC += 16;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        uint16x4_t _cc0 = vld1_u16(pC);\n                        uint16x4_t _cc1 = vld1_u16(pC + c_hstep);\n                        uint16x4_t _cc2 = vld1_u16(pC + c_hstep * 2);\n                        uint16x4_t _cc3 = vld1_u16(pC + c_hstep * 3);\n                        transpose4x4_u16(_cc0, _cc1, _cc2, _cc3);\n                        _c0 = vcvt_f32_f16((float16x4_t)_cc0);\n                        _c1 = vcvt_f32_f16((float16x4_t)_cc1);\n                        _c2 = vcvt_f32_f16((float16x4_t)_cc2);\n                        _c3 = vcvt_f32_f16((float16x4_t)_cc3);\n                        pC += 4;\n                    }\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    float32x4_t _c = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c = vmulq_n_f32(_c, beta);\n#if __aarch64__\n                    _c0 = vdupq_laneq_f32(_c, 0);\n                    float32x4_t _c1 = vdupq_laneq_f32(_c, 1);\n                    float32x4_t _c2 = vdupq_laneq_f32(_c, 2);\n                    float32x4_t _c3 = vdupq_laneq_f32(_c, 3);\n#else\n                    _c0 = vdupq_lane_f32(vget_low_f32(_c), 0);\n                    float32x4_t _c1 = vdupq_lane_f32(vget_low_f32(_c), 1);\n                    float32x4_t _c2 = vdupq_lane_f32(vget_high_f32(_c), 0);\n                    float32x4_t _c3 = vdupq_lane_f32(vget_high_f32(_c), 1);\n#endif\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c2);\n                    _f3 = vaddq_f32(_f3, _c3);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n\n            if (out_elempack == 4)\n            {\n                uint16x4x4_t _hf;\n                _hf.val[0] = _hf0;\n                _hf.val[1] = _hf1;\n                _hf.val[2] = _hf2;\n                _hf.val[3] = _hf3;\n                vst4_u16(p0, _hf);\n            }\n            if (out_elempack == 1)\n            {\n                vst1_u16(p0, _hf0);\n                vst1_u16(p0 + out_hstep, _hf1);\n                vst1_u16(p0 + out_hstep * 2, _hf2);\n                vst1_u16(p0 + out_hstep * 3, _hf3);\n            }\n\n            pp += 16;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n#if __ARM_FEATURE_DOTPROD\n            // from/to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n#else\n            // from\n            //      a0 b1 c0 d1\n            //      a1 b0 c1 d0\n\n            // to\n            //      a0 b0 c0 d0\n            //      a1 b1 c1 d1\n            {\n                _sum1 = vrev64q_s32(_sum1);\n                int32x4x2_t _t0 = vzipq_s32(_sum0, _sum1);\n                _sum0 = vcombine_s32(vget_low_s32(_t0.val[0]), vget_low_s32(_t0.val[1]));\n                _sum1 = vcombine_s32(vget_high_s32(_t0.val[0]), vget_high_s32(_t0.val[1]));\n                _sum1 = vrev64q_s32(_sum1);\n            }\n#endif // __ARM_FEATURE_DOTPROD\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x8_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1q_u16(pC);\n                        pC += 8;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x8_t();\n                        _c = vsetq_lane_u16(pC[0], _c, 0);\n                        _c = vsetq_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3], _c, 3);\n                        _c = vsetq_lane_u16(pC[1], _c, 4);\n                        _c = vsetq_lane_u16(pC[c_hstep + 1], _c, 5);\n                        _c = vsetq_lane_u16(pC[c_hstep * 2 + 1], _c, 6);\n                        _c = vsetq_lane_u16(pC[c_hstep * 3 + 1], _c, 7);\n                        pC += 2;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    float32x4_t _c1 = vdupq_n_f32(float16_to_float32(pC[1]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    pC += 2;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            vst1_u16(p0, (uint16x4_t)vcvt_f16_f32(_f0));\n            vst1_u16(p0 + out_hstep, (uint16x4_t)vcvt_f16_f32(_f1));\n\n            pp += 8;\n            p0 += out_hstep * 2;\n        }\n        for (; jj < max_jj; jj += 1)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    uint16x4_t _c;\n                    if (c_elempack == 4)\n                    {\n                        _c = vld1_u16(pC);\n                        pC += 4;\n                    }\n                    if (c_elempack == 1)\n                    {\n                        _c = uint16x4_t();\n                        _c = vset_lane_u16(pC[0], _c, 0);\n                        _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                        _c = vset_lane_u16(pC[c_hstep * 2], _c, 2);\n                        _c = vset_lane_u16(pC[c_hstep * 3], _c, 3);\n                        pC += 1;\n                    }\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vdupq_n_f32(float16_to_float32(pC[0]) * beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    pC += 1;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            vst1_u16(p0, (uint16x4_t)vcvt_f16_f32(_f0));\n            pp += 4;\n            p0 += out_hstep;\n        }\n    }\n#endif // __ARM_NEON\n    for (; ii + 1 < max_ii; ii += 2)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale0 = descales[i + ii];\n        const float descale1 = descales[i + ii + 1];\n#if __ARM_NEON\n        float32x2_t _descale01 = vld1_f32((const float*)descales + i + ii);\n#endif\n\n        float c0;\n        float c1;\n#if __ARM_NEON\n        float32x4_t _c0;\n        float32x4_t _c1;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = float16_to_float32(pC[0]) * beta;\n                c1 = float16_to_float32(pC[1]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n                _c1 = vdupq_n_f32(c1);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 0);\n            float32x4_t _f2 = vmulq_lane_f32(vcvtq_f32_s32(_sum2), _descale01, 1);\n            float32x4_t _f3 = vmulq_lane_f32(vcvtq_f32_s32(_sum3), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c1);\n                    _f3 = vaddq_f32(_f3, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + c_hstep);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 8;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta != 1.f)\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _c0 = vmulq_f32(_c0, _beta);\n                        _c1 = vmulq_f32(_c1, _beta);\n                    }\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c1);\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n\n#if __aarch64__\n            if (out_elempack == 8)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n            }\n#endif // __aarch64__\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf2));\n                vst1q_u16(p0 + out_hstep * 4, vcombine_u16(_hf1, _hf3));\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[1] = vget_lane_u16(_hf2, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_hf2, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_hf2, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_hf2, 3);\n                p0[out_hstep * 4] = vget_lane_u16(_hf1, 0);\n                p0[out_hstep * 4 + 1] = vget_lane_u16(_hf3, 0);\n                p0[out_hstep * 5] = vget_lane_u16(_hf1, 1);\n                p0[out_hstep * 5 + 1] = vget_lane_u16(_hf3, 1);\n                p0[out_hstep * 6] = vget_lane_u16(_hf1, 2);\n                p0[out_hstep * 6 + 1] = vget_lane_u16(_hf3, 2);\n                p0[out_hstep * 7] = vget_lane_u16(_hf1, 3);\n                p0[out_hstep * 7 + 1] = vget_lane_u16(_hf3, 3);\n            }\n\n            pp += 16;\n            p0 += out_hstep * 8;\n        }\n#endif // __aarch64__\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            // a0 a1 a2 a3\n            // b0 b1 b2 b3\n\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_lane_f32(vcvtq_f32_s32(_sum0), _descale01, 0);\n            float32x4_t _f1 = vmulq_lane_f32(vcvtq_f32_s32(_sum1), _descale01, 1);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c1);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c1 = vcvt_f32_f16((float16x4_t)vld1_u16(pC + c_hstep));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 4;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _c0 = vmulq_n_f32(_c0, beta);\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    pC += 4;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n\n            if (out_elempack == 4)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n            }\n            if (out_elempack == 1)\n            {\n                p0[0] = vget_lane_u16(_hf0, 0);\n                p0[1] = vget_lane_u16(_hf1, 0);\n                p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                p0[out_hstep + 1] = vget_lane_u16(_hf1, 1);\n                p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                p0[out_hstep * 2 + 1] = vget_lane_u16(_hf1, 2);\n                p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                p0[out_hstep * 3 + 1] = vget_lane_u16(_hf1, 3);\n            }\n\n            pp += 8;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            // a0 a1 b0 b1\n            int32x2x2_t _sum0 = vld2_s32(pp);\n\n            float32x4_t _descale = vcombine_f32(_descale01, _descale01);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vcombine_s32(_sum0.val[0], _sum0.val[1])), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    float32x4_t _cc = vzipq_f32(_c0, _c1).val[0];\n                    _f0 = vaddq_f32(_f0, _cc);\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[c_hstep], _c, 1);\n                    _c = vset_lane_u16(pC[1], _c, 2);\n                    _c = vset_lane_u16(pC[c_hstep + 1], _c, 3);\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    uint16x4_t _c = uint16x4_t();\n                    _c = vset_lane_u16(pC[0], _c, 0);\n                    _c = vset_lane_u16(pC[0], _c, 1);\n                    _c = vset_lane_u16(pC[1], _c, 2);\n                    _c = vset_lane_u16(pC[1], _c, 3);\n                    _c0 = vcvt_f32_f16((float16x4_t)_c);\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n\n            p0[0] = vget_lane_u16(_hf0, 0);\n            p0[1] = vget_lane_u16(_hf0, 1);\n            p0[out_hstep] = vget_lane_u16(_hf0, 2);\n            p0[out_hstep + 1] = vget_lane_u16(_hf0, 3);\n\n            pp += 4;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale0;\n            float f1 = pp[1] * descale1;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    f0 += c0;\n                    f1 += c0;\n                }\n                if (broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                    f1 += c1;\n                }\n                if (broadcast_type_C == 3)\n                {\n                    // c_elempack == 1\n                    f0 += float16_to_float32(pC[0]) * beta;\n                    f1 += float16_to_float32(pC[c_hstep]) * beta;\n                    pC += 1;\n                }\n                if (broadcast_type_C == 4)\n                {\n                    c0 = float16_to_float32(pC[0]) * beta;\n                    f0 += c0;\n                    f1 += c0;\n                    pC += 1;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                f0 *= alpha;\n                f1 *= alpha;\n            }\n\n            p0[0] = float32_to_float16(f0);\n            p0[1] = float32_to_float16(f1);\n            pp += 2;\n            p0 += out_hstep;\n        }\n    }\n    for (; ii < max_ii; ii += 1)\n    {\n        unsigned short* p0 = (unsigned short*)top_blob + j * out_hstep + (i + ii) * out_elempack;\n\n        const float descale = descales[i + ii];\n#if __ARM_NEON\n        float32x4_t _descale = vdupq_n_f32(descale);\n#endif\n\n        float c0;\n#if __ARM_NEON\n        float32x4_t _c0;\n#endif\n        if (pC)\n        {\n            if (broadcast_type_C == 0)\n            {\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 1 || broadcast_type_C == 2)\n            {\n                pC = (const unsigned short*)C + i + ii;\n                c0 = float16_to_float32(pC[0]) * beta;\n#if __ARM_NEON\n                _c0 = vdupq_n_f32(c0);\n#endif\n            }\n            if (broadcast_type_C == 3)\n            {\n                // c_elempack == 1\n                pC = (const unsigned short*)C + (i + ii) * c_hstep + j;\n            }\n            if (broadcast_type_C == 4)\n            {\n                pC = (const unsigned short*)C + j;\n            }\n        }\n\n        int jj = 0;\n#if __ARM_NEON\n        for (; jj + 15 < max_jj; jj += 16)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n            int32x4_t _sum2 = vld1q_s32(pp + 8);\n            int32x4_t _sum3 = vld1q_s32(pp + 12);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n            float32x4_t _f2 = vmulq_f32(vcvtq_f32_s32(_sum2), _descale);\n            float32x4_t _f3 = vmulq_f32(vcvtq_f32_s32(_sum3), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                    _f2 = vaddq_f32(_f2, _c0);\n                    _f3 = vaddq_f32(_f3, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    uint16x8_t _c01 = vld1q_u16(pC);\n                    uint16x8_t _c23 = vld1q_u16(pC + 8);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c01));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c01));\n                    float32x4_t _c2 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c23));\n                    float32x4_t _c3 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c23));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                        _f2 = vaddq_f32(_f2, _c2);\n                        _f3 = vaddq_f32(_f3, _c3);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                        _f2 = vmlaq_f32(_f2, _c2, _beta);\n                        _f3 = vmlaq_f32(_f3, _c3, _beta);\n                    }\n                    pC += 16;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n                _f2 = vmulq_f32(_f2, _alpha);\n                _f3 = vmulq_f32(_f3, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n            uint16x4_t _hf2 = (uint16x4_t)vcvt_f16_f32(_f2);\n            uint16x4_t _hf3 = (uint16x4_t)vcvt_f16_f32(_f3);\n\n            if (out_hstep == 1)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                vst1q_u16(p0 + 8, vcombine_u16(_hf2, _hf3));\n            }\n            else\n            {\n#if __aarch64__\n                if (out_elempack == 8)\n                {\n                    vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                    vst1q_u16(p0 + out_hstep * 8, vcombine_u16(_hf2, _hf3));\n                }\n#endif // __aarch64__\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _hf0);\n                    vst1_u16(p0 + out_hstep * 4, _hf1);\n                    vst1_u16(p0 + out_hstep * 8, _hf2);\n                    vst1_u16(p0 + out_hstep * 12, _hf3);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_hf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                    p0[out_hstep * 4] = vget_lane_u16(_hf1, 0);\n                    p0[out_hstep * 5] = vget_lane_u16(_hf1, 1);\n                    p0[out_hstep * 6] = vget_lane_u16(_hf1, 2);\n                    p0[out_hstep * 7] = vget_lane_u16(_hf1, 3);\n                    p0[out_hstep * 8] = vget_lane_u16(_hf2, 0);\n                    p0[out_hstep * 9] = vget_lane_u16(_hf2, 1);\n                    p0[out_hstep * 10] = vget_lane_u16(_hf2, 2);\n                    p0[out_hstep * 11] = vget_lane_u16(_hf2, 3);\n                    p0[out_hstep * 12] = vget_lane_u16(_hf3, 0);\n                    p0[out_hstep * 13] = vget_lane_u16(_hf3, 1);\n                    p0[out_hstep * 14] = vget_lane_u16(_hf3, 2);\n                    p0[out_hstep * 15] = vget_lane_u16(_hf3, 3);\n                }\n            }\n\n            pp += 16;\n            p0 += out_hstep * 16;\n        }\n        for (; jj + 7 < max_jj; jj += 8)\n        {\n            int32x4_t _sum0 = vld1q_s32(pp);\n            int32x4_t _sum1 = vld1q_s32(pp + 4);\n\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(_sum0), _descale);\n            float32x4_t _f1 = vmulq_f32(vcvtq_f32_s32(_sum1), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                    _f1 = vaddq_f32(_f1, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    uint16x8_t _c = vld1q_u16(pC);\n                    _c0 = vcvt_f32_f16((float16x4_t)vget_low_u16(_c));\n                    float32x4_t _c1 = vcvt_f32_f16((float16x4_t)vget_high_u16(_c));\n                    if (beta == 1.f)\n                    {\n                        _f0 = vaddq_f32(_f0, _c0);\n                        _f1 = vaddq_f32(_f1, _c1);\n                    }\n                    else\n                    {\n                        float32x4_t _beta = vdupq_n_f32(beta);\n                        _f0 = vmlaq_f32(_f0, _c0, _beta);\n                        _f1 = vmlaq_f32(_f1, _c1, _beta);\n                    }\n                    pC += 8;\n                }\n            }\n\n            if (alpha != 1.f)\n            {\n                float32x4_t _alpha = vdupq_n_f32(alpha);\n                _f0 = vmulq_f32(_f0, _alpha);\n                _f1 = vmulq_f32(_f1, _alpha);\n            }\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n            uint16x4_t _hf1 = (uint16x4_t)vcvt_f16_f32(_f1);\n\n            if (out_hstep == 1)\n            {\n                vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n            }\n            else\n            {\n#if __aarch64__\n                if (out_elempack == 8)\n                {\n                    vst1q_u16(p0, vcombine_u16(_hf0, _hf1));\n                }\n#endif // __aarch64__\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _hf0);\n                    vst1_u16(p0 + out_hstep * 4, _hf1);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_hf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                    p0[out_hstep * 4] = vget_lane_u16(_hf1, 0);\n                    p0[out_hstep * 5] = vget_lane_u16(_hf1, 1);\n                    p0[out_hstep * 6] = vget_lane_u16(_hf1, 2);\n                    p0[out_hstep * 7] = vget_lane_u16(_hf1, 3);\n                }\n            }\n\n            pp += 8;\n            p0 += out_hstep * 8;\n        }\n        for (; jj + 3 < max_jj; jj += 4)\n        {\n            float32x4_t _f0 = vmulq_f32(vcvtq_f32_s32(vld1q_s32(pp)), _descale);\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vaddq_f32(_f0, _c0);\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // out_elempack == 1\n                    _c0 = vcvt_f32_f16((float16x4_t)vld1_u16(pC));\n                    _f0 = vmlaq_n_f32(_f0, _c0, beta);\n                    pC += 4;\n                }\n            }\n\n            _f0 = vmulq_n_f32(_f0, alpha);\n\n            uint16x4_t _hf0 = (uint16x4_t)vcvt_f16_f32(_f0);\n\n            if (out_hstep == 1)\n            {\n                vst1_u16(p0, _hf0);\n            }\n            else\n            {\n                if (out_elempack == 4)\n                {\n                    vst1_u16(p0, _hf0);\n                }\n                if (out_elempack == 1)\n                {\n                    p0[0] = vget_lane_u16(_hf0, 0);\n                    p0[out_hstep] = vget_lane_u16(_hf0, 1);\n                    p0[out_hstep * 2] = vget_lane_u16(_hf0, 2);\n                    p0[out_hstep * 3] = vget_lane_u16(_hf0, 3);\n                }\n            }\n\n            pp += 4;\n            p0 += out_hstep * 4;\n        }\n        for (; jj + 1 < max_jj; jj += 2)\n        {\n            float32x2_t _f0 = vmul_f32(vcvt_f32_s32(vld1_s32(pp)), vget_low_f32(_descale));\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    _f0 = vadd_f32(_f0, vget_low_f32(_c0));\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    float32x2_t _c = float32x2_t();\n                    _c = vset_lane_f32(float16_to_float32(pC[0]), _c, 0);\n                    _c = vset_lane_f32(float16_to_float32(pC[1]), _c, 1);\n                    _f0 = vmla_n_f32(_f0, _c, beta);\n                    pC += 2;\n                }\n            }\n\n            _f0 = vmul_n_f32(_f0, alpha);\n\n            p0[0] = float32_to_float16(vget_lane_f32(_f0, 0));\n            p0[out_hstep] = float32_to_float16(vget_lane_f32(_f0, 1));\n\n            pp += 2;\n            p0 += out_hstep * 2;\n        }\n#endif // __ARM_NEON\n        for (; jj < max_jj; jj += 1)\n        {\n            float f0 = pp[0] * descale;\n\n            if (pC)\n            {\n                if (broadcast_type_C == 0 || broadcast_type_C == 1 || broadcast_type_C == 2)\n                {\n                    f0 += c0;\n                }\n                if (broadcast_type_C == 3 || broadcast_type_C == 4)\n                {\n                    // c_elempack == 1\n                    f0 += float16_to_float32(pC[0]) * beta;\n                    pC += 1;\n                }\n            }\n\n            f0 *= alpha;\n\n            p0[0] = float32_to_float16(f0);\n\n            pp += 1;\n            p0 += out_hstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/groupnorm_arm.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"groupnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nGroupNorm_arm::GroupNorm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void groupnorm(float* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int channels, int size, int elempack, size_t cstep)\n{\n#if __ARM_NEON\n    float32x4_t _mean = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float mean = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _mean = vaddq_f32(_mean, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            mean += ptr0[0];\n            ptr0++;\n        }\n    }\n\n    {\n#if __ARM_NEON\n#if __aarch64__\n        mean += vaddvq_f32(_mean);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_mean), vget_high_f32(_mean));\n        _s2 = vpadd_f32(_s2, _s2);\n        mean += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        mean = mean / (channels * size);\n#if __ARM_NEON\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n#if __ARM_NEON\n    float32x4_t _var = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float var = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _p = vsubq_f32(_p, _mean);\n            _var = vmlaq_f32(_var, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = ptr0[0] - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n    {\n#if __ARM_NEON\n#if __aarch64__\n        var += vaddvq_f32(_var);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_var), vget_high_f32(_var));\n        _s2 = vpadd_f32(_s2, _s2);\n        var += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        var = 1.f / sqrtf(var / (channels * size) + eps);\n        mean = -mean * var;\n#if __ARM_NEON\n        _var = vdupq_n_f32(var);\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr0 = ptr + cstep * q * elempack;\n\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(0.f);\n            float32x4_t _b = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n            float a = 0.f;\n            float b = 0.f;\n\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                float32x4_t _gamma = vld1q_f32(gamma_ptr + q * elempack);\n                float32x4_t _beta = vld1q_f32(beta_ptr + q * elempack);\n\n                _a = vmulq_f32(_var, _gamma);\n                _b = vmlaq_f32(_beta, _mean, _gamma);\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                const float gamma = gamma_ptr[q];\n                const float beta = beta_ptr[q];\n\n                a = var * gamma;\n                b = mean * gamma + beta;\n#if __ARM_NEON\n                _a = vdupq_n_f32(a);\n                _b = vdupq_n_f32(b);\n#endif // __ARM_NEON\n            }\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr0);\n                _p = vmlaq_f32(_b, _p, _a);\n                vst1q_f32(ptr0, _p);\n                ptr0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *ptr0 = *ptr0 * a + b;\n                ptr0++;\n            }\n        }\n    }\n    else\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr0 = ptr + cstep * q * elempack;\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr0);\n                _p = vmlaq_f32(_mean, _p, _var);\n                vst1q_f32(ptr0, _p);\n                ptr0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *ptr0 = *ptr0 * var + mean;\n                ptr0++;\n            }\n        }\n    }\n}\n\nint GroupNorm_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    const int dims = bottom_top_blob.dims;\n    const int elempack = bottom_top_blob.elempack;\n    const int channels_g = channels / group;\n\n    int g_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_top_blob_unpacked = bottom_top_blob;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_top_blob, bottom_top_blob_unpacked, g_elempack, opt_p);\n        if (bottom_top_blob_unpacked.empty())\n            return -100;\n    }\n\n    if (dims == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, 1 * g_elempack, g_elempack, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        const int w = bottom_top_blob_unpacked.w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.row_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, w * g_elempack, g_elempack, w);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int size = bottom_top_blob_unpacked.w * bottom_top_blob_unpacked.h * bottom_top_blob_unpacked.d;\n        const size_t cstep = bottom_top_blob_unpacked.cstep;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.channel_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, size * g_elempack, g_elempack, cstep);\n        }\n    }\n\n    if (g_elempack != elempack)\n    {\n        convert_packing(bottom_top_blob_unpacked, bottom_top_blob, elempack, opt);\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void groupnorm_bf16s(unsigned short* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int channels, int size, int elempack, size_t cstep)\n{\n#if __ARM_NEON\n    float32x4_t _mean = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float mean = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned short* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _mean = vaddq_f32(_mean, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            mean += bfloat16_to_float32(ptr0[0]);\n            ptr0++;\n        }\n    }\n\n    {\n#if __ARM_NEON\n#if __aarch64__\n        mean += vaddvq_f32(_mean);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_mean), vget_high_f32(_mean));\n        _s2 = vpadd_f32(_s2, _s2);\n        mean += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        mean = mean / (channels * size);\n#if __ARM_NEON\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n#if __ARM_NEON\n    float32x4_t _var = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float var = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned short* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _p = vsubq_f32(_p, _mean);\n            _var = vmlaq_f32(_var, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr0[0]) - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n    {\n#if __ARM_NEON\n#if __aarch64__\n        var += vaddvq_f32(_var);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_var), vget_high_f32(_var));\n        _s2 = vpadd_f32(_s2, _s2);\n        var += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        var = 1.f / sqrtf(var / (channels * size) + eps);\n        mean = -mean * var;\n#if __ARM_NEON\n        _var = vdupq_n_f32(var);\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr0 = ptr + cstep * q * elempack;\n\n#if __ARM_NEON\n            float32x4_t _a = vdupq_n_f32(0.f);\n            float32x4_t _b = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n            float a = 0.f;\n            float b = 0.f;\n\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                float32x4_t _gamma = vld1q_f32(gamma_ptr + q * elempack);\n                float32x4_t _beta = vld1q_f32(beta_ptr + q * elempack);\n\n                _a = vmulq_f32(_var, _gamma);\n                _b = vmlaq_f32(_beta, _mean, _gamma);\n            }\n#endif // __ARM_NEON\n            if (elempack == 1)\n            {\n                const float gamma = gamma_ptr[q];\n                const float beta = beta_ptr[q];\n\n                a = var * gamma;\n                b = mean * gamma + beta;\n#if __ARM_NEON\n                _a = vdupq_n_f32(a);\n                _b = vdupq_n_f32(b);\n#endif // __ARM_NEON\n            }\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n                _p = vmlaq_f32(_b, _p, _a);\n                vst1_u16(ptr0, float2bfloat(_p));\n                ptr0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *ptr0 = float32_to_bfloat16(bfloat16_to_float32(*ptr0) * a + b);\n                ptr0++;\n            }\n        }\n    }\n    else\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr0 = ptr + cstep * q * elempack;\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n                _p = vmlaq_f32(_mean, _p, _var);\n                vst1_u16(ptr0, float2bfloat(_p));\n                ptr0 += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *ptr0 = float32_to_bfloat16(bfloat16_to_float32(*ptr0) * var + mean);\n                ptr0++;\n            }\n        }\n    }\n}\n\nint GroupNorm_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int elempack = bottom_top_blob.elempack;\n    const int channels_g = channels / group;\n\n    int g_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_top_blob_unpacked = bottom_top_blob;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_top_blob, bottom_top_blob_unpacked, g_elempack, opt_p);\n        if (bottom_top_blob_unpacked.empty())\n            return -100;\n    }\n\n    if (dims == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_bf16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, 1 * g_elempack, g_elempack, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        const int w = bottom_top_blob_unpacked.w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.row_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_bf16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, w * g_elempack, g_elempack, w);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int size = bottom_top_blob_unpacked.w * bottom_top_blob_unpacked.h * bottom_top_blob_unpacked.d;\n        const size_t cstep = bottom_top_blob_unpacked.cstep;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.channel_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_bf16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, size * g_elempack, g_elempack, cstep);\n        }\n    }\n\n    if (g_elempack != elempack)\n    {\n        convert_packing(bottom_top_blob_unpacked, bottom_top_blob, elempack, opt);\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/groupnorm_arm.h",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GROUPNORM_ARM_H\n#define LAYER_GROUPNORM_ARM_H\n\n#include \"groupnorm.h\"\n\nnamespace ncnn {\n\nclass GroupNorm_arm : public GroupNorm\n{\npublic:\n    GroupNorm_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GROUPNORM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/groupnorm_arm_asimdhp.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"groupnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void groupnorm_fp16s(__fp16* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int channels, int size, int elempack, size_t cstep)\n{\n    float32x4_t _mean0 = vdupq_n_f32(0.f);\n    float32x4_t _mean1 = vdupq_n_f32(0.f);\n    float mean = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const __fp16* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _mean0 = vaddq_f32(_mean0, _p0);\n            _mean1 = vaddq_f32(_mean1, _p1);\n            ptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n            _mean0 = vaddq_f32(_mean0, _p);\n            ptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            mean += (float)ptr0[0];\n            ptr0++;\n        }\n    }\n\n    {\n        _mean0 = vaddq_f32(_mean0, _mean1);\n        mean += vaddvq_f32(_mean0);\n\n        mean = mean / (channels * size);\n        _mean0 = vdupq_n_f32(mean);\n        _mean1 = _mean0;\n    }\n\n    float32x4_t _var0 = vdupq_n_f32(0.f);\n    float32x4_t _var1 = vdupq_n_f32(0.f);\n    float var = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const __fp16* ptr0 = ptr + cstep * q * elempack;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vsubq_f32(_p0, _mean0);\n            _p1 = vsubq_f32(_p1, _mean1);\n            _var0 = vfmaq_f32(_var0, _p0, _p0);\n            _var1 = vfmaq_f32(_var1, _p1, _p1);\n            ptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n            _p = vsubq_f32(_p, _mean0);\n            _var0 = vfmaq_f32(_var0, _p, _p);\n            ptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)ptr0[0] - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n    {\n        _var0 = vaddq_f32(_var0, _var1);\n        var += vaddvq_f32(_var0);\n\n        var = 1.f / sqrtf(var / (channels * size) + eps);\n        mean = -mean * var;\n        _var0 = vdupq_n_f32(var);\n        _mean0 = vdupq_n_f32(mean);\n        _var1 = _var0;\n        _mean1 = _mean0;\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr0 = ptr + cstep * q * elempack;\n\n            float32x4_t _a0 = vdupq_n_f32(0.f);\n            float32x4_t _b0 = vdupq_n_f32(0.f);\n            float32x4_t _a1 = vdupq_n_f32(0.f);\n            float32x4_t _b1 = vdupq_n_f32(0.f);\n            float a = 0.f;\n            float b = 0.f;\n\n            if (elempack == 8)\n            {\n                float32x4_t _gamma0 = vld1q_f32(gamma_ptr + q * elempack);\n                float32x4_t _gamma1 = vld1q_f32(gamma_ptr + q * elempack + 4);\n                float32x4_t _beta0 = vld1q_f32(beta_ptr + q * elempack);\n                float32x4_t _beta1 = vld1q_f32(beta_ptr + q * elempack + 4);\n\n                _a0 = vmulq_f32(_var0, _gamma0);\n                _a1 = vmulq_f32(_var1, _gamma1);\n                _b0 = vfmaq_f32(_beta0, _mean0, _gamma0);\n                _b1 = vfmaq_f32(_beta1, _mean1, _gamma1);\n            }\n            if (elempack == 4)\n            {\n                float32x4_t _gamma = vld1q_f32(gamma_ptr + q * elempack);\n                float32x4_t _beta = vld1q_f32(beta_ptr + q * elempack);\n\n                _a0 = vmulq_f32(_var0, _gamma);\n                _b0 = vfmaq_f32(_beta, _mean0, _gamma);\n                _a1 = _a0;\n                _b1 = _b0;\n            }\n            if (elempack == 1)\n            {\n                const float gamma = gamma_ptr[q];\n                const float beta = beta_ptr[q];\n\n                a = var * gamma;\n                b = mean * gamma + beta;\n                _a0 = vdupq_n_f32(a);\n                _b0 = vdupq_n_f32(b);\n                _a1 = _a0;\n                _b1 = _b0;\n            }\n\n            int i = 0;\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr0);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                _p0 = vfmaq_f32(_b0, _p0, _a0);\n                _p1 = vfmaq_f32(_b1, _p1, _a1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr0, _p);\n                ptr0 += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n                _p = vfmaq_f32(_b0, _p, _a0);\n                vst1_f16(ptr0, vcvt_f16_f32(_p));\n                ptr0 += 4;\n            }\n            for (; i < size; i++)\n            {\n                *ptr0 = (__fp16)((float)*ptr0 * a + b);\n                ptr0++;\n            }\n        }\n    }\n    else\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr0 = ptr + cstep * q * elempack;\n\n            int i = 0;\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr0);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                _p0 = vfmaq_f32(_mean0, _p0, _var0);\n                _p1 = vfmaq_f32(_mean1, _p1, _var1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr0, _p);\n                ptr0 += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n                _p = vfmaq_f32(_mean0, _p, _var0);\n                vst1_f16(ptr0, vcvt_f16_f32(_p));\n                ptr0 += 4;\n            }\n            for (; i < size; i++)\n            {\n                *ptr0 = (__fp16)((float)*ptr0 * var + mean);\n                ptr0++;\n            }\n        }\n    }\n}\n\nint GroupNorm_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int elempack = bottom_top_blob.elempack;\n    const int channels_g = channels / group;\n\n    int g_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        if (opt.use_fp16_arithmetic)\n            g_elempack = channels_g % 8 == 0 ? 8 : channels_g % 4 == 0 ? 4 : 1;\n        else\n            g_elempack = channels_g % 4 == 0 ? 4 : 1;\n    }\n\n    Mat bottom_top_blob_unpacked = bottom_top_blob;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_top_blob, bottom_top_blob_unpacked, g_elempack, opt_p);\n        if (bottom_top_blob_unpacked.empty())\n            return -100;\n    }\n\n    if (dims == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_fp16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, 1 * g_elempack, g_elempack, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        const int w = bottom_top_blob_unpacked.w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.row_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_fp16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, w * g_elempack, g_elempack, w);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int size = bottom_top_blob_unpacked.w * bottom_top_blob_unpacked.h * bottom_top_blob_unpacked.d;\n        const size_t cstep = bottom_top_blob_unpacked.cstep;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob_unpacked.channel_range(g * channels_g / g_elempack, channels_g / g_elempack);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm_fp16s(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g / g_elempack, size * g_elempack, g_elempack, cstep);\n        }\n    }\n\n    if (g_elempack != elempack)\n    {\n        convert_packing(bottom_top_blob_unpacked, bottom_top_blob, elempack, opt);\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gru_arm.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gru_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if NCNN_INT8\n#include \"gru_int8.h\"\n#endif\n\nGRU_arm::GRU_arm()\n{\n#if __ARM_NEON\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint GRU_arm::create_pipeline(const Option& opt)\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return create_pipeline_int8(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    // pack RUN\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / num_output / 3;\n\n#if __ARM_NEON\n    weight_xc_data_packed.create(size * 12, num_output / 4 + num_output % 4, num_directions);\n    bias_c_data_packed.create(num_output, 1, num_directions, 16u, 4);\n    weight_hc_data_packed.create(num_output * 12, num_output / 4 + num_output % 4, num_directions);\n#else\n    weight_xc_data_packed.create(size * 3, num_output, num_directions);\n    bias_c_data_packed.create(num_output, 1, num_directions, 16u, 4);\n    weight_hc_data_packed.create(num_output * 3, num_output, num_directions);\n#endif\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_R = bias_c.row(0);\n        const float* bias_c_U = bias_c.row(1);\n        const float* bias_c_WN = bias_c.row(2);\n        const float* bias_c_BN = bias_c.row(3);\n\n        float* bias_c_RUBNWN = bias_c_data_packed_dr.row(0);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            vst1q_f32(bias_c_RUBNWN, vld1q_f32(bias_c_R + q));\n            vst1q_f32(bias_c_RUBNWN + 4, vld1q_f32(bias_c_U + q));\n            vst1q_f32(bias_c_RUBNWN + 8, vld1q_f32(bias_c_BN + q));\n            vst1q_f32(bias_c_RUBNWN + 12, vld1q_f32(bias_c_WN + q));\n\n            bias_c_RUBNWN += 16;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_xc_R_1 = weight_xc.row(num_output * 0 + q + 1);\n            const float* weight_xc_U_1 = weight_xc.row(num_output * 1 + q + 1);\n            const float* weight_xc_N_1 = weight_xc.row(num_output * 2 + q + 1);\n\n            const float* weight_xc_R_2 = weight_xc.row(num_output * 0 + q + 2);\n            const float* weight_xc_U_2 = weight_xc.row(num_output * 1 + q + 2);\n            const float* weight_xc_N_2 = weight_xc.row(num_output * 2 + q + 2);\n\n            const float* weight_xc_R_3 = weight_xc.row(num_output * 0 + q + 3);\n            const float* weight_xc_U_3 = weight_xc.row(num_output * 1 + q + 3);\n            const float* weight_xc_N_3 = weight_xc.row(num_output * 2 + q + 3);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n            const float* weight_hc_R_1 = weight_hc.row(num_output * 0 + q + 1);\n            const float* weight_hc_U_1 = weight_hc.row(num_output * 1 + q + 1);\n            const float* weight_hc_N_1 = weight_hc.row(num_output * 2 + q + 1);\n\n            const float* weight_hc_R_2 = weight_hc.row(num_output * 0 + q + 2);\n            const float* weight_hc_U_2 = weight_hc.row(num_output * 1 + q + 2);\n            const float* weight_hc_N_2 = weight_hc.row(num_output * 2 + q + 2);\n\n            const float* weight_hc_R_3 = weight_hc.row(num_output * 0 + q + 3);\n            const float* weight_hc_U_3 = weight_hc.row(num_output * 1 + q + 3);\n            const float* weight_hc_N_3 = weight_hc.row(num_output * 2 + q + 3);\n\n            float* weight_xc_RUN = weight_xc_data_packed_dr.row(q / 4);\n            float* weight_hc_RUN = weight_hc_data_packed_dr.row(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = weight_xc_R[i];\n                weight_xc_RUN[1] = weight_xc_R_1[i];\n                weight_xc_RUN[2] = weight_xc_R_2[i];\n                weight_xc_RUN[3] = weight_xc_R_3[i];\n                weight_xc_RUN[4] = weight_xc_U[i];\n                weight_xc_RUN[5] = weight_xc_U_1[i];\n                weight_xc_RUN[6] = weight_xc_U_2[i];\n                weight_xc_RUN[7] = weight_xc_U_3[i];\n\n                weight_xc_RUN += 8;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = weight_hc_R[i];\n                weight_hc_RUN[1] = weight_hc_R_1[i];\n                weight_hc_RUN[2] = weight_hc_R_2[i];\n                weight_hc_RUN[3] = weight_hc_R_3[i];\n                weight_hc_RUN[4] = weight_hc_U[i];\n                weight_hc_RUN[5] = weight_hc_U_1[i];\n                weight_hc_RUN[6] = weight_hc_U_2[i];\n                weight_hc_RUN[7] = weight_hc_U_3[i];\n\n                weight_hc_RUN += 8;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = weight_xc_N[i];\n                weight_xc_RUN[1] = weight_xc_N_1[i];\n                weight_xc_RUN[2] = weight_xc_N_2[i];\n                weight_xc_RUN[3] = weight_xc_N_3[i];\n\n                weight_xc_RUN += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = weight_hc_N[i];\n                weight_hc_RUN[1] = weight_hc_N_1[i];\n                weight_hc_RUN[2] = weight_hc_N_2[i];\n                weight_hc_RUN[3] = weight_hc_N_3[i];\n\n                weight_hc_RUN += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            bias_c_RUBNWN[0] = bias_c_R[q];\n            bias_c_RUBNWN[1] = bias_c_U[q];\n            bias_c_RUBNWN[2] = bias_c_BN[q];\n            bias_c_RUBNWN[3] = bias_c_WN[q];\n\n            bias_c_RUBNWN += 4;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n#if __ARM_NEON\n            float* weight_xc_RUN = weight_xc_data_packed_dr.row(q / 4 + q % 4);\n            float* weight_hc_RUN = weight_hc_data_packed_dr.row(q / 4 + q % 4);\n#else\n            float* weight_xc_RUN = weight_xc_data_packed_dr.row(q);\n            float* weight_hc_RUN = weight_hc_data_packed_dr.row(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = weight_xc_R[i];\n                weight_xc_RUN[1] = weight_xc_U[i];\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = weight_hc_R[i];\n                weight_hc_RUN[1] = weight_hc_U[i];\n\n                weight_hc_RUN += 2;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = weight_xc_N[i];\n\n                weight_xc_RUN += 1;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = weight_hc_N[i];\n\n                weight_hc_RUN += 1;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nstatic int gru(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n#if __ARM_NEON\n    Mat gates(4 * 2, num_output / 4 + num_output % 4, 4u, opt.workspace_allocator);\n#else\n    Mat gates(2, num_output, 4u, opt.workspace_allocator);\n#endif\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* x = bottom_blob.row(ti);\n\n            // gate reset update\n            const float* bias_c_RUBNWN = (const float*)bias_c + q * 4;\n\n            const float* weight_xc_RUN = weight_xc.row(q / 4);\n            const float* weight_hc_RUN = weight_hc.row(q / 4);\n\n            float32x4_t _gru_R = vld1q_f32(bias_c_RUBNWN);\n            float32x4_t _gru_U = vld1q_f32(bias_c_RUBNWN + 4);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vld1q_f32(x + i);\n                float32x4_t _weight_xc_R = vld1q_f32(weight_xc_RUN);\n                float32x4_t _weight_xc_U = vld1q_f32(weight_xc_RUN + 4);\n                float32x4_t _weight_xc_R_1 = vld1q_f32(weight_xc_RUN + 8);\n                float32x4_t _weight_xc_U_1 = vld1q_f32(weight_xc_RUN + 12);\n                float32x4_t _weight_xc_R_2 = vld1q_f32(weight_xc_RUN + 16);\n                float32x4_t _weight_xc_U_2 = vld1q_f32(weight_xc_RUN + 20);\n                float32x4_t _weight_xc_R_3 = vld1q_f32(weight_xc_RUN + 24);\n                float32x4_t _weight_xc_U_3 = vld1q_f32(weight_xc_RUN + 28);\n#if __aarch64__\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_xc_R, _xi, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_xc_U, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_R_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_U_1, _xi, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_R_2, _xi, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_xc_U_2, _xi, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_xc_R_3, _xi, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_xc_U_3, _xi, 3);\n#else\n                _gru_R = vmlaq_lane_f32(_gru_R, _weight_xc_R, vget_low_f32(_xi), 0);\n                _gru_U = vmlaq_lane_f32(_gru_U, _weight_xc_U, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_R_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_U_1, vget_low_f32(_xi), 1);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_R_2, vget_high_f32(_xi), 0);\n                _sum4 = vmlaq_lane_f32(_sum4, _weight_xc_U_2, vget_high_f32(_xi), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _weight_xc_R_3, vget_high_f32(_xi), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _weight_xc_U_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_RUN += 32;\n            }\n            for (; i < size; i++)\n            {\n                float xi = x[i];\n\n                float32x4_t _xi = vdupq_n_f32(xi);\n                float32x4_t _weight_xc_R = vld1q_f32(weight_xc_RUN);\n                float32x4_t _weight_xc_U = vld1q_f32(weight_xc_RUN + 4);\n                _gru_R = vmlaq_f32(_gru_R, _weight_xc_R, _xi);\n                _gru_U = vmlaq_f32(_gru_U, _weight_xc_U, _xi);\n\n                weight_xc_RUN += 8;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_R = vld1q_f32(weight_hc_RUN);\n                float32x4_t _weight_hc_U = vld1q_f32(weight_hc_RUN + 4);\n                float32x4_t _weight_hc_R_1 = vld1q_f32(weight_hc_RUN + 8);\n                float32x4_t _weight_hc_U_1 = vld1q_f32(weight_hc_RUN + 12);\n                float32x4_t _weight_hc_R_2 = vld1q_f32(weight_hc_RUN + 16);\n                float32x4_t _weight_hc_U_2 = vld1q_f32(weight_hc_RUN + 20);\n                float32x4_t _weight_hc_R_3 = vld1q_f32(weight_hc_RUN + 24);\n                float32x4_t _weight_hc_U_3 = vld1q_f32(weight_hc_RUN + 28);\n#if __aarch64__\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_hc_R, _h_cont, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_hc_U, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_R_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_U_1, _h_cont, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_R_2, _h_cont, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_hc_U_2, _h_cont, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_hc_R_3, _h_cont, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_hc_U_3, _h_cont, 3);\n#else\n                _gru_R = vmlaq_lane_f32(_gru_R, _weight_hc_R, vget_low_f32(_h_cont), 0);\n                _gru_U = vmlaq_lane_f32(_gru_U, _weight_hc_U, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_R_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_U_1, vget_low_f32(_h_cont), 1);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_R_2, vget_high_f32(_h_cont), 0);\n                _sum4 = vmlaq_lane_f32(_sum4, _weight_hc_U_2, vget_high_f32(_h_cont), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _weight_hc_R_3, vget_high_f32(_h_cont), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _weight_hc_U_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_RUN += 32;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_R = vld1q_f32(weight_hc_RUN);\n                float32x4_t _weight_hc_U = vld1q_f32(weight_hc_RUN + 4);\n                _gru_R = vmlaq_f32(_gru_R, _weight_hc_R, _h_cont);\n                _gru_U = vmlaq_f32(_gru_U, _weight_hc_U, _h_cont);\n\n                weight_hc_RUN += 8;\n            }\n\n            _gru_R = vaddq_f32(_gru_R, _sum1);\n            _gru_U = vaddq_f32(_gru_U, _sum2);\n            _sum3 = vaddq_f32(_sum3, _sum5);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _gru_R = vaddq_f32(_gru_R, _sum3);\n            _gru_U = vaddq_f32(_gru_U, _sum4);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            _gru_R = sigmoid_ps(_gru_R);\n            _gru_U = sigmoid_ps(_gru_U);\n\n            // gate new\n            float32x4_t _gru_N = vld1q_f32(bias_c_RUBNWN + 8);\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_N = vld1q_f32(weight_hc_RUN);\n                float32x4_t _weight_hc_N_1 = vld1q_f32(weight_hc_RUN + 4);\n                float32x4_t _weight_hc_N_2 = vld1q_f32(weight_hc_RUN + 8);\n                float32x4_t _weight_hc_N_3 = vld1q_f32(weight_hc_RUN + 12);\n#if __aarch64__\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_hc_N, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_N_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_N_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_N_3, _h_cont, 3);\n#else\n                _gru_N = vmlaq_lane_f32(_gru_N, _weight_hc_N, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_N_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_N_2, vget_high_f32(_h_cont), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_N_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_RUN += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_N = vld1q_f32(weight_hc_RUN);\n                _gru_N = vmlaq_f32(_gru_N, _weight_hc_N, _h_cont);\n\n                weight_hc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            _gru_N = vmlaq_f32(vld1q_f32(bias_c_RUBNWN + 12), _gru_R, _gru_N);\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vld1q_f32(x + i);\n                float32x4_t _weight_xc_N = vld1q_f32(weight_xc_RUN);\n                float32x4_t _weight_xc_N_1 = vld1q_f32(weight_xc_RUN + 4);\n                float32x4_t _weight_xc_N_2 = vld1q_f32(weight_xc_RUN + 8);\n                float32x4_t _weight_xc_N_3 = vld1q_f32(weight_xc_RUN + 12);\n#if __aarch64__\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_xc_N, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_N_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_N_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_N_3, _xi, 3);\n#else\n                _gru_N = vmlaq_lane_f32(_gru_N, _weight_xc_N, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_N_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_N_2, vget_high_f32(_xi), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_N_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_RUN += 16;\n            }\n            for (; i < size; i++)\n            {\n                float xi = x[i];\n\n                float32x4_t _xi = vdupq_n_f32(xi);\n                float32x4_t _weight_xc_N = vld1q_f32(weight_xc_RUN);\n                _gru_N = vmlaq_f32(_gru_N, _weight_xc_N, _xi);\n\n                weight_xc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            // tanh(N)\n            _gru_N = tanh_ps(_gru_N);\n\n            float* gates_data = gates.row(q / 4);\n\n            vst1q_f32(gates_data, _gru_U);\n            vst1q_f32(gates_data + 4, _gru_N);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const float* x = bottom_blob.row(ti);\n\n            // gate reset update\n            const float* bias_c_RUBNWN = (const float*)bias_c + q * 4;\n\n#if __ARM_NEON\n            const float* weight_xc_RUN = weight_xc.row(q / 4 + q % 4);\n            const float* weight_hc_RUN = weight_hc.row(q / 4 + q % 4);\n#else\n            const float* weight_xc_RUN = weight_xc.row(q);\n            const float* weight_hc_RUN = weight_hc.row(q);\n#endif\n\n            float R = bias_c_RUBNWN[0];\n            float U = bias_c_RUBNWN[1];\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = x[i];\n\n                R += weight_xc_RUN[0] * xi;\n                U += weight_xc_RUN[1] * xi;\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                R += weight_hc_RUN[0] * h_cont;\n                U += weight_hc_RUN[1] * h_cont;\n\n                weight_hc_RUN += 2;\n            }\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n            float N = bias_c_RUBNWN[2];\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                N += weight_hc_RUN[0] * h_cont;\n\n                weight_hc_RUN += 1;\n            }\n\n            N = bias_c_RUBNWN[3] + R * N;\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = x[i];\n\n                N += weight_xc_RUN[0] * xi;\n\n                weight_xc_RUN += 1;\n            }\n\n            // tanh(N)\n            N = tanhf(N);\n\n#if __ARM_NEON\n            float* gates_data = gates.row(q / 4 + q % 4);\n#else\n            float* gates_data = gates.row(q);\n#endif\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        float* output_data = top_blob.row(ti);\n\n        float* hidden_ptr = hidden_state;\n\n#if __ARM_NEON\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q / 4);\n\n            float32x4_t _gru_U = vld1q_f32(gates_data);\n            float32x4_t _gru_N = vld1q_f32(gates_data + 4);\n\n            float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));\n\n            vst1q_f32(hidden_ptr + q, _gru_H);\n            vst1q_f32(output_data + q, _gru_H);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n#if __ARM_NEON\n            const float* gates_data = gates.row(q / 4 + q % 4);\n#else\n            const float* gates_data = gates.row(q);\n#endif\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_ptr[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = H;\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = gru(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.0f);\n\n        {\n            int ret = gru(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        hidden = bottom_blobs[1].clone(hidden_allocator);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = gru(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = gru(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        top_blobs[1] = hidden;\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic int gru_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n#if __ARM_NEON\n    Mat gates(4 * 2, num_output / 4 + num_output % 4, 4u, opt.workspace_allocator);\n#else\n    Mat gates(2, num_output, 4u, opt.workspace_allocator);\n#endif\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const unsigned short* x = bottom_blob.row<const unsigned short>(ti);\n\n            // gate reset update\n            const unsigned short* bias_c_RUBNWN = (const unsigned short*)bias_c + q * 4;\n\n            const unsigned short* weight_xc_RUN = weight_xc.row<const unsigned short>(q / 4);\n            const unsigned short* weight_hc_RUN = weight_hc.row<const unsigned short>(q / 4);\n\n            float32x4_t _gru_R = bfloat2float(vld1_u16(bias_c_RUBNWN));\n            float32x4_t _gru_U = bfloat2float(vld1_u16(bias_c_RUBNWN + 4));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = bfloat2float(vld1_u16(x + i));\n                float32x4_t _weight_xc_R = bfloat2float(vld1_u16(weight_xc_RUN));\n                float32x4_t _weight_xc_U = bfloat2float(vld1_u16(weight_xc_RUN + 4));\n                float32x4_t _weight_xc_R_1 = bfloat2float(vld1_u16(weight_xc_RUN + 8));\n                float32x4_t _weight_xc_U_1 = bfloat2float(vld1_u16(weight_xc_RUN + 12));\n                float32x4_t _weight_xc_R_2 = bfloat2float(vld1_u16(weight_xc_RUN + 16));\n                float32x4_t _weight_xc_U_2 = bfloat2float(vld1_u16(weight_xc_RUN + 20));\n                float32x4_t _weight_xc_R_3 = bfloat2float(vld1_u16(weight_xc_RUN + 24));\n                float32x4_t _weight_xc_U_3 = bfloat2float(vld1_u16(weight_xc_RUN + 28));\n#if __aarch64__\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_xc_R, _xi, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_xc_U, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_R_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_U_1, _xi, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_R_2, _xi, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_xc_U_2, _xi, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_xc_R_3, _xi, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_xc_U_3, _xi, 3);\n#else\n                _gru_R = vmlaq_lane_f32(_gru_R, _weight_xc_R, vget_low_f32(_xi), 0);\n                _gru_U = vmlaq_lane_f32(_gru_U, _weight_xc_U, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_R_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_U_1, vget_low_f32(_xi), 1);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_R_2, vget_high_f32(_xi), 0);\n                _sum4 = vmlaq_lane_f32(_sum4, _weight_xc_U_2, vget_high_f32(_xi), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _weight_xc_R_3, vget_high_f32(_xi), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _weight_xc_U_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_RUN += 32;\n            }\n            for (; i < size; i++)\n            {\n                unsigned short xi = x[i];\n\n                float32x4_t _xi = bfloat2float(vdup_n_u16(xi));\n                float32x4_t _weight_xc_R = bfloat2float(vld1_u16(weight_xc_RUN));\n                float32x4_t _weight_xc_U = bfloat2float(vld1_u16(weight_xc_RUN + 4));\n                _gru_R = vmlaq_f32(_gru_R, _weight_xc_R, _xi);\n                _gru_U = vmlaq_f32(_gru_U, _weight_xc_U, _xi);\n\n                weight_xc_RUN += 8;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_R = bfloat2float(vld1_u16(weight_hc_RUN));\n                float32x4_t _weight_hc_U = bfloat2float(vld1_u16(weight_hc_RUN + 4));\n                float32x4_t _weight_hc_R_1 = bfloat2float(vld1_u16(weight_hc_RUN + 8));\n                float32x4_t _weight_hc_U_1 = bfloat2float(vld1_u16(weight_hc_RUN + 12));\n                float32x4_t _weight_hc_R_2 = bfloat2float(vld1_u16(weight_hc_RUN + 16));\n                float32x4_t _weight_hc_U_2 = bfloat2float(vld1_u16(weight_hc_RUN + 20));\n                float32x4_t _weight_hc_R_3 = bfloat2float(vld1_u16(weight_hc_RUN + 24));\n                float32x4_t _weight_hc_U_3 = bfloat2float(vld1_u16(weight_hc_RUN + 28));\n#if __aarch64__\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_hc_R, _h_cont, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_hc_U, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_R_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_U_1, _h_cont, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_R_2, _h_cont, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_hc_U_2, _h_cont, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_hc_R_3, _h_cont, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_hc_U_3, _h_cont, 3);\n#else\n                _gru_R = vmlaq_lane_f32(_gru_R, _weight_hc_R, vget_low_f32(_h_cont), 0);\n                _gru_U = vmlaq_lane_f32(_gru_U, _weight_hc_U, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_R_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_U_1, vget_low_f32(_h_cont), 1);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_R_2, vget_high_f32(_h_cont), 0);\n                _sum4 = vmlaq_lane_f32(_sum4, _weight_hc_U_2, vget_high_f32(_h_cont), 0);\n                _sum5 = vmlaq_lane_f32(_sum5, _weight_hc_R_3, vget_high_f32(_h_cont), 1);\n                _sum6 = vmlaq_lane_f32(_sum6, _weight_hc_U_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_RUN += 32;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_R = bfloat2float(vld1_u16(weight_hc_RUN));\n                float32x4_t _weight_hc_U = bfloat2float(vld1_u16(weight_hc_RUN + 4));\n                _gru_R = vmlaq_f32(_gru_R, _weight_hc_R, _h_cont);\n                _gru_U = vmlaq_f32(_gru_U, _weight_hc_U, _h_cont);\n\n                weight_hc_RUN += 8;\n            }\n\n            _gru_R = vaddq_f32(_gru_R, _sum1);\n            _gru_U = vaddq_f32(_gru_U, _sum2);\n            _sum3 = vaddq_f32(_sum3, _sum5);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _gru_R = vaddq_f32(_gru_R, _sum3);\n            _gru_U = vaddq_f32(_gru_U, _sum4);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            _gru_R = sigmoid_ps(_gru_R);\n            _gru_U = sigmoid_ps(_gru_U);\n\n            // gate new\n            float32x4_t _gru_N = bfloat2float(vld1_u16(bias_c_RUBNWN + 8));\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_N = bfloat2float(vld1_u16(weight_hc_RUN));\n                float32x4_t _weight_hc_N_1 = bfloat2float(vld1_u16(weight_hc_RUN + 4));\n                float32x4_t _weight_hc_N_2 = bfloat2float(vld1_u16(weight_hc_RUN + 8));\n                float32x4_t _weight_hc_N_3 = bfloat2float(vld1_u16(weight_hc_RUN + 12));\n#if __aarch64__\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_hc_N, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_N_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_N_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_N_3, _h_cont, 3);\n#else\n                _gru_N = vmlaq_lane_f32(_gru_N, _weight_hc_N, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_N_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_N_2, vget_high_f32(_h_cont), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_N_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_RUN += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_N = bfloat2float(vld1_u16(weight_hc_RUN));\n                _gru_N = vmlaq_f32(_gru_N, _weight_hc_N, _h_cont);\n\n                weight_hc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            _gru_N = vmlaq_f32(bfloat2float(vld1_u16(bias_c_RUBNWN + 12)), _gru_R, _gru_N);\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = bfloat2float(vld1_u16(x + i));\n                float32x4_t _weight_xc_N = bfloat2float(vld1_u16(weight_xc_RUN));\n                float32x4_t _weight_xc_N_1 = bfloat2float(vld1_u16(weight_xc_RUN + 4));\n                float32x4_t _weight_xc_N_2 = bfloat2float(vld1_u16(weight_xc_RUN + 8));\n                float32x4_t _weight_xc_N_3 = bfloat2float(vld1_u16(weight_xc_RUN + 12));\n#if __aarch64__\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_xc_N, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_N_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_N_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_N_3, _xi, 3);\n#else\n                _gru_N = vmlaq_lane_f32(_gru_N, _weight_xc_N, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_N_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_N_2, vget_high_f32(_xi), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_N_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_RUN += 16;\n            }\n            for (; i < size; i++)\n            {\n                unsigned short xi = x[i];\n\n                float32x4_t _xi = bfloat2float(vdup_n_u16(xi));\n                float32x4_t _weight_xc_N = bfloat2float(vld1_u16(weight_xc_RUN));\n                _gru_N = vmlaq_f32(_gru_N, _weight_xc_N, _xi);\n\n                weight_xc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            // tanh(N)\n            _gru_N = tanh_ps(_gru_N);\n\n            float* gates_data = gates.row(q / 4);\n\n            vst1q_f32(gates_data, _gru_U);\n            vst1q_f32(gates_data + 4, _gru_N);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(ti);\n\n            // gate reset update\n            const unsigned short* bias_c_RUBNWN = (const unsigned short*)bias_c + q * 4;\n\n#if __ARM_NEON\n            const unsigned short* weight_xc_RUN = weight_xc.row<const unsigned short>(q / 4 + q % 4);\n            const unsigned short* weight_hc_RUN = weight_hc.row<const unsigned short>(q / 4 + q % 4);\n#else\n            const unsigned short* weight_xc_RUN = weight_xc.row<const unsigned short>(q);\n            const unsigned short* weight_hc_RUN = weight_hc.row<const unsigned short>(q);\n#endif\n\n            float R = bfloat16_to_float32(bias_c_RUBNWN[0]);\n            float U = bfloat16_to_float32(bias_c_RUBNWN[1]);\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = bfloat16_to_float32(x[i]);\n\n                R += bfloat16_to_float32(weight_xc_RUN[0]) * xi;\n                U += bfloat16_to_float32(weight_xc_RUN[1]) * xi;\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                R += bfloat16_to_float32(weight_hc_RUN[0]) * h_cont;\n                U += bfloat16_to_float32(weight_hc_RUN[1]) * h_cont;\n\n                weight_hc_RUN += 2;\n            }\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n            float N = bfloat16_to_float32(bias_c_RUBNWN[2]);\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                N += bfloat16_to_float32(weight_hc_RUN[0]) * h_cont;\n\n                weight_hc_RUN += 1;\n            }\n\n            N = bfloat16_to_float32(bias_c_RUBNWN[3]) + R * N;\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = bfloat16_to_float32(x[i]);\n\n                N += bfloat16_to_float32(weight_xc_RUN[0]) * xi;\n\n                weight_xc_RUN += 1;\n            }\n\n            // tanh(N)\n            N = tanhf(N);\n\n#if __ARM_NEON\n            float* gates_data = gates.row(q / 4 + q % 4);\n#else\n            float* gates_data = gates.row(q);\n#endif\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        unsigned short* output_data = top_blob.row<unsigned short>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n#if __ARM_NEON\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q / 4);\n\n            float32x4_t _gru_U = vld1q_f32(gates_data);\n            float32x4_t _gru_N = vld1q_f32(gates_data + 4);\n\n            float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));\n\n            vst1q_f32(hidden_ptr + q, _gru_H);\n            vst1_u16(output_data + q, float2bfloat(_gru_H));\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n#if __ARM_NEON\n            const float* gates_data = gates.row(q / 4 + q % 4);\n#else\n            const float* gates_data = gates.row(q);\n#endif\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_ptr[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = float32_to_bfloat16(H);\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::create_pipeline_bf16s(const Option& opt)\n{\n    // pack RUN\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / num_output / 3;\n\n#if __ARM_NEON\n    weight_xc_data_packed.create(size * 12, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n    bias_c_data_packed.create(num_output, 1, num_directions, 8u, 4);\n    weight_hc_data_packed.create(num_output * 12, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n#else\n    weight_xc_data_packed.create(size * 3, num_output, num_directions, 2u, 1);\n    bias_c_data_packed.create(num_output, 1, num_directions, 8u, 4);\n    weight_hc_data_packed.create(num_output * 3, num_output, num_directions, 2u, 1);\n#endif\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_R = bias_c.row(0);\n        const float* bias_c_U = bias_c.row(1);\n        const float* bias_c_WN = bias_c.row(2);\n        const float* bias_c_BN = bias_c.row(3);\n\n        unsigned short* bias_c_RUBNWN = bias_c_data_packed_dr.row<unsigned short>(0);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            vst1_u16(bias_c_RUBNWN, float2bfloat(vld1q_f32(bias_c_R + q)));\n            vst1_u16(bias_c_RUBNWN + 4, float2bfloat(vld1q_f32(bias_c_U + q)));\n            vst1_u16(bias_c_RUBNWN + 8, float2bfloat(vld1q_f32(bias_c_BN + q)));\n            vst1_u16(bias_c_RUBNWN + 12, float2bfloat(vld1q_f32(bias_c_WN + q)));\n\n            bias_c_RUBNWN += 16;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_xc_R_1 = weight_xc.row(num_output * 0 + q + 1);\n            const float* weight_xc_U_1 = weight_xc.row(num_output * 1 + q + 1);\n            const float* weight_xc_N_1 = weight_xc.row(num_output * 2 + q + 1);\n\n            const float* weight_xc_R_2 = weight_xc.row(num_output * 0 + q + 2);\n            const float* weight_xc_U_2 = weight_xc.row(num_output * 1 + q + 2);\n            const float* weight_xc_N_2 = weight_xc.row(num_output * 2 + q + 2);\n\n            const float* weight_xc_R_3 = weight_xc.row(num_output * 0 + q + 3);\n            const float* weight_xc_U_3 = weight_xc.row(num_output * 1 + q + 3);\n            const float* weight_xc_N_3 = weight_xc.row(num_output * 2 + q + 3);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n            const float* weight_hc_R_1 = weight_hc.row(num_output * 0 + q + 1);\n            const float* weight_hc_U_1 = weight_hc.row(num_output * 1 + q + 1);\n            const float* weight_hc_N_1 = weight_hc.row(num_output * 2 + q + 1);\n\n            const float* weight_hc_R_2 = weight_hc.row(num_output * 0 + q + 2);\n            const float* weight_hc_U_2 = weight_hc.row(num_output * 1 + q + 2);\n            const float* weight_hc_N_2 = weight_hc.row(num_output * 2 + q + 2);\n\n            const float* weight_hc_R_3 = weight_hc.row(num_output * 0 + q + 3);\n            const float* weight_hc_U_3 = weight_hc.row(num_output * 1 + q + 3);\n            const float* weight_hc_N_3 = weight_hc.row(num_output * 2 + q + 3);\n\n            unsigned short* weight_xc_RUN = weight_xc_data_packed_dr.row<unsigned short>(q / 4);\n            unsigned short* weight_hc_RUN = weight_hc_data_packed_dr.row<unsigned short>(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = float32_to_bfloat16(weight_xc_R[i]);\n                weight_xc_RUN[1] = float32_to_bfloat16(weight_xc_R_1[i]);\n                weight_xc_RUN[2] = float32_to_bfloat16(weight_xc_R_2[i]);\n                weight_xc_RUN[3] = float32_to_bfloat16(weight_xc_R_3[i]);\n                weight_xc_RUN[4] = float32_to_bfloat16(weight_xc_U[i]);\n                weight_xc_RUN[5] = float32_to_bfloat16(weight_xc_U_1[i]);\n                weight_xc_RUN[6] = float32_to_bfloat16(weight_xc_U_2[i]);\n                weight_xc_RUN[7] = float32_to_bfloat16(weight_xc_U_3[i]);\n\n                weight_xc_RUN += 8;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = float32_to_bfloat16(weight_hc_R[i]);\n                weight_hc_RUN[1] = float32_to_bfloat16(weight_hc_R_1[i]);\n                weight_hc_RUN[2] = float32_to_bfloat16(weight_hc_R_2[i]);\n                weight_hc_RUN[3] = float32_to_bfloat16(weight_hc_R_3[i]);\n                weight_hc_RUN[4] = float32_to_bfloat16(weight_hc_U[i]);\n                weight_hc_RUN[5] = float32_to_bfloat16(weight_hc_U_1[i]);\n                weight_hc_RUN[6] = float32_to_bfloat16(weight_hc_U_2[i]);\n                weight_hc_RUN[7] = float32_to_bfloat16(weight_hc_U_3[i]);\n\n                weight_hc_RUN += 8;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = float32_to_bfloat16(weight_xc_N[i]);\n                weight_xc_RUN[1] = float32_to_bfloat16(weight_xc_N_1[i]);\n                weight_xc_RUN[2] = float32_to_bfloat16(weight_xc_N_2[i]);\n                weight_xc_RUN[3] = float32_to_bfloat16(weight_xc_N_3[i]);\n\n                weight_xc_RUN += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = float32_to_bfloat16(weight_hc_N[i]);\n                weight_hc_RUN[1] = float32_to_bfloat16(weight_hc_N_1[i]);\n                weight_hc_RUN[2] = float32_to_bfloat16(weight_hc_N_2[i]);\n                weight_hc_RUN[3] = float32_to_bfloat16(weight_hc_N_3[i]);\n\n                weight_hc_RUN += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            bias_c_RUBNWN[0] = float32_to_bfloat16(bias_c_R[q]);\n            bias_c_RUBNWN[1] = float32_to_bfloat16(bias_c_U[q]);\n            bias_c_RUBNWN[2] = float32_to_bfloat16(bias_c_BN[q]);\n            bias_c_RUBNWN[3] = float32_to_bfloat16(bias_c_WN[q]);\n\n            bias_c_RUBNWN += 4;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n#if __ARM_NEON\n            unsigned short* weight_xc_RUN = weight_xc_data_packed_dr.row<unsigned short>(q / 4 + q % 4);\n            unsigned short* weight_hc_RUN = weight_hc_data_packed_dr.row<unsigned short>(q / 4 + q % 4);\n#else\n            unsigned short* weight_xc_RUN = weight_xc_data_packed_dr.row<unsigned short>(q);\n            unsigned short* weight_hc_RUN = weight_hc_data_packed_dr.row<unsigned short>(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = float32_to_bfloat16(weight_xc_R[i]);\n                weight_xc_RUN[1] = float32_to_bfloat16(weight_xc_U[i]);\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = float32_to_bfloat16(weight_hc_R[i]);\n                weight_hc_RUN[1] = float32_to_bfloat16(weight_hc_U[i]);\n\n                weight_hc_RUN += 2;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = float32_to_bfloat16(weight_xc_N[i]);\n\n                weight_xc_RUN += 1;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = float32_to_bfloat16(weight_hc_N[i]);\n\n                weight_hc_RUN += 1;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = gru_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n\n        {\n            int ret = gru_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_allocator;\n        cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = gru_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = gru_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint GRU_arm::create_pipeline_int8(const Option& opt)\n{\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / num_output / 3;\n\n    gru_transform_weight_int8(weight_xc_data, weight_xc_data_int8_scales, weight_hc_data, weight_hc_data_int8_scales, bias_c_data, weight_data_tm, weight_data_tm_int8_descales, bias_c_data_packed, size, num_output, num_directions, opt);\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        weight_hc_data.release();\n        bias_c_data.release();\n        weight_xc_data_int8_scales.release();\n        weight_hc_data_int8_scales.release();\n    }\n\n    return 0;\n}\n\nvoid GRU_arm::dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    // dynamic quantize bottom_blob\n    bottom_blob_int8_descales.create(T, (size_t)4u, 1, opt.blob_allocator);\n\n    Mat bottom_blob_int8_scales(T, (size_t)4u, 1, opt.blob_allocator);\n\n    if (elemtype == 1)\n    {\n        // fp32\n        for (int t = 0; t < T; t++)\n        {\n            const float* x = bottom_blob.row(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(x[i]));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 2)\n    {\n        // fp16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(float16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 4)\n    {\n        // bf16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(bfloat16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n\n    quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt);\n}\n\nint GRU_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n        }\n\n        hidden.fill(0.f);\n\n        {\n            gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), hidden, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        if (elemtype == 1)\n        {\n            hidden = bottom_blobs[1].clone(hidden_allocator);\n        }\n        if (elemtype == 2)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_allocator;\n            cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        }\n        if (elemtype == 4)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_allocator;\n            cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        }\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden0, opt);\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), hidden1, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        if (elemtype == 1)\n        {\n            top_blobs[1] = hidden;\n        }\n        if (elemtype == 2)\n        {\n            cast_float32_to_float16(hidden, top_blobs[1], opt);\n        }\n        if (elemtype == 4)\n        {\n            cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gru_arm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GRU_ARM_H\n#define LAYER_GRU_ARM_H\n\n#include \"gru.h\"\n\nnamespace ncnn {\n\nclass GRU_arm : public GRU\n{\npublic:\n    GRU_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8(const Option& opt);\n    void dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n\npublic:\n    Mat weight_xc_data_packed;\n    Mat bias_c_data_packed;\n    Mat weight_hc_data_packed;\n\n    Mat weight_data_tm;\n\n#if NCNN_INT8\n    Mat weight_data_tm_int8_descales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GRU_ARM_H\n"
  },
  {
    "path": "src/layer/arm/gru_arm_asimddp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"gru_int8.h\"\n\nvoid gru_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt)\n{\n    gru_transform_weight_int8(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, opt);\n}\n\nvoid gru_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt)\n{\n    gru_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, hidden_state, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gru_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gru_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n    Mat gates(4 * 2, num_output / 4 + num_output % 4, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        int nn_num_output = num_output >> 2;\n        int remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            // gate reset update\n            const __fp16* bias_c_RUBNWN = (const __fp16*)bias_c + q * 4;\n\n            const __fp16* weight_xc_RUN = weight_xc.row<const __fp16>(q / 4);\n            const __fp16* weight_hc_RUN = weight_hc.row<const __fp16>(q / 4);\n\n            float16x8_t _RU = vld1q_f16(bias_c_RUBNWN);\n            float16x8_t _sum1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _sum2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _sum3 = vdupq_n_f16((__fp16)0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4h}, [%0], #8       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    \"fmla   %2.8h, v0.8h, v4.h[0]   \\n\"\n                    \"fmla   %3.8h, v1.8h, v4.h[1]   \\n\"\n                    \"fmla   %4.8h, v2.8h, v4.h[2]   \\n\"\n                    \"fmla   %5.8h, v3.8h, v4.h[3]   \\n\"\n                    : \"=r\"(x),\n                    \"=r\"(weight_xc_RUN),\n                    \"=w\"(_RU),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(x),\n                    \"1\"(weight_xc_RUN),\n                    \"2\"(_RU),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _x = vld1_f16(x);\n                float16x8_t _w0 = vld1q_f16(weight_xc_RUN);\n                float16x8_t _w1 = vld1q_f16(weight_xc_RUN + 8);\n                float16x8_t _w2 = vld1q_f16(weight_xc_RUN + 16);\n                float16x8_t _w3 = vld1q_f16(weight_xc_RUN + 24);\n                _RU = vfmaq_lane_f16(_RU, _w0, _x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _w1, _x, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _w2, _x, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _w3, _x, 3);\n\n                x += 4;\n                weight_xc_RUN += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = *x++;\n\n                float16x8_t _xi = vdupq_n_f16(xi);\n                float16x8_t _weight_xc_RU = vld1q_f16(weight_xc_RUN);\n                _RU = vfmaq_f16(_RU, _weight_xc_RU, _xi);\n\n                weight_xc_RUN += 8;\n            }\n\n            const float* hidden_ptr = hidden_state;\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4s}, [%0], #16      \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    \"fcvtn  v4.4h, v4.4s            \\n\"\n                    \"fmla   %2.8h, v0.8h, v4.h[0]   \\n\"\n                    \"fmla   %3.8h, v1.8h, v4.h[1]   \\n\"\n                    \"fmla   %4.8h, v2.8h, v4.h[2]   \\n\"\n                    \"fmla   %5.8h, v3.8h, v4.h[3]   \\n\"\n                    : \"=r\"(hidden_ptr),\n                    \"=r\"(weight_hc_RUN),\n                    \"=w\"(_RU),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(hidden_ptr),\n                    \"1\"(weight_hc_RUN),\n                    \"2\"(_RU),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _h_cont = vcvt_f16_f32(vld1q_f32(hidden_ptr));\n                float16x8_t _w0 = vld1q_f16(weight_hc_RUN);\n                float16x8_t _w1 = vld1q_f16(weight_hc_RUN + 8);\n                float16x8_t _w2 = vld1q_f16(weight_hc_RUN + 16);\n                float16x8_t _w3 = vld1q_f16(weight_hc_RUN + 24);\n                _RU = vfmaq_lane_f16(_RU, _w0, _h_cont, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _w1, _h_cont, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _w2, _h_cont, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _w3, _h_cont, 3);\n\n                hidden_ptr += 4;\n                weight_hc_RUN += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = *hidden_ptr++;\n\n                float16x8_t _h_cont = vdupq_n_f16((__fp16)h_cont);\n                float16x8_t _weight_hc_RU = vld1q_f16(weight_hc_RUN);\n                _RU = vfmaq_f16(_RU, _weight_hc_RU, _h_cont);\n\n                weight_hc_RUN += 8;\n            }\n\n            _RU = vaddq_f16(_RU, _sum1);\n            _sum2 = vaddq_f16(_sum2, _sum3);\n            _RU = vaddq_f16(_RU, _sum2);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            float32x4_t _R32 = sigmoid_ps(vcvt_f32_f16(vget_low_f16(_RU)));\n            float32x4_t _U32 = sigmoid_ps(vcvt_f32_f16(vget_high_f16(_RU)));\n\n            x -= size;\n            hidden_ptr = hidden_state;\n\n            // gate new\n            float16x4_t _gru_N = vld1_f16(bias_c_RUBNWN + 8);\n            float16x4_t _sum4 = vdup_n_f16((__fp16)0.f);\n            float16x4_t _sum5 = vdup_n_f16((__fp16)0.f);\n            float16x4_t _sum6 = vdup_n_f16((__fp16)0.f);\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4s}, [%0], #16      \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    \"fcvtn  v4.4h, v4.4s            \\n\"\n                    \"fmla   %2.4h, v0.4h, v4.h[0]   \\n\"\n                    \"fmla   %3.4h, v1.4h, v4.h[1]   \\n\"\n                    \"fmla   %4.4h, v2.4h, v4.h[2]   \\n\"\n                    \"fmla   %5.4h, v3.4h, v4.h[3]   \\n\"\n                    : \"=r\"(hidden_ptr),\n                    \"=r\"(weight_hc_RUN),\n                    \"=w\"(_gru_N),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6)\n                    : \"0\"(hidden_ptr),\n                    \"1\"(weight_hc_RUN),\n                    \"2\"(_gru_N),\n                    \"3\"(_sum4),\n                    \"4\"(_sum5),\n                    \"5\"(_sum6)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _h_cont = vcvt_f16_f32(vld1q_f32(hidden_ptr));\n                float16x4_t _w0 = vld1_f16(weight_hc_RUN);\n                float16x4_t _w1 = vld1_f16(weight_hc_RUN + 4);\n                float16x4_t _w2 = vld1_f16(weight_hc_RUN + 8);\n                float16x4_t _w3 = vld1_f16(weight_hc_RUN + 12);\n                _gru_N = vfma_lane_f16(_gru_N, _w0, _h_cont, 0);\n                _sum4 = vfma_lane_f16(_sum4, _w1, _h_cont, 1);\n                _sum5 = vfma_lane_f16(_sum5, _w2, _h_cont, 2);\n                _sum6 = vfma_lane_f16(_sum6, _w3, _h_cont, 3);\n\n                hidden_ptr += 4;\n                weight_hc_RUN += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = *hidden_ptr++;\n\n                float16x4_t _h_cont = vdup_n_f16((__fp16)h_cont);\n                float16x4_t _weight_hc_N = vld1_f16(weight_hc_RUN);\n                _gru_N = vfma_f16(_gru_N, _weight_hc_N, _h_cont);\n\n                weight_hc_RUN += 4;\n            }\n\n            _gru_N = vadd_f16(_gru_N, _sum4);\n            _sum5 = vadd_f16(_sum5, _sum6);\n            _gru_N = vadd_f16(_gru_N, _sum5);\n\n            _gru_N = vfma_f16(vld1_f16(bias_c_RUBNWN + 12), vcvt_f16_f32(_R32), _gru_N);\n            _sum4 = vdup_n_f16((__fp16)0.f);\n            _sum5 = vdup_n_f16((__fp16)0.f);\n            _sum6 = vdup_n_f16((__fp16)0.f);\n\n            i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4h}, [%0], #8       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    \"fmla   %2.4h, v0.4h, v4.h[0]   \\n\"\n                    \"fmla   %3.4h, v1.4h, v4.h[1]   \\n\"\n                    \"fmla   %4.4h, v2.4h, v4.h[2]   \\n\"\n                    \"fmla   %5.4h, v3.4h, v4.h[3]   \\n\"\n                    : \"=r\"(x),\n                    \"=r\"(weight_xc_RUN),\n                    \"=w\"(_gru_N),\n                    \"=w\"(_sum4),\n                    \"=w\"(_sum5),\n                    \"=w\"(_sum6)\n                    : \"0\"(x),\n                    \"1\"(weight_xc_RUN),\n                    \"2\"(_gru_N),\n                    \"3\"(_sum4),\n                    \"4\"(_sum5),\n                    \"5\"(_sum6)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _x = vld1_f16(x);\n                float16x4_t _w0 = vld1_f16(weight_xc_RUN);\n                float16x4_t _w1 = vld1_f16(weight_xc_RUN + 4);\n                float16x4_t _w2 = vld1_f16(weight_xc_RUN + 8);\n                float16x4_t _w3 = vld1_f16(weight_xc_RUN + 12);\n                _gru_N = vfma_lane_f16(_gru_N, _w0, _x, 0);\n                _sum4 = vfma_lane_f16(_sum4, _w1, _x, 1);\n                _sum5 = vfma_lane_f16(_sum5, _w2, _x, 2);\n                _sum6 = vfma_lane_f16(_sum6, _w3, _x, 3);\n\n                x += 4;\n                weight_xc_RUN += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = *x++;\n\n                float16x4_t _xi = vdup_n_f16(xi);\n                float16x4_t _weight_xc_N = vld1_f16(weight_xc_RUN);\n                _gru_N = vfma_f16(_gru_N, _weight_xc_N, _xi);\n\n                weight_xc_RUN += 4;\n            }\n\n            _gru_N = vadd_f16(_gru_N, _sum4);\n            _sum5 = vadd_f16(_sum5, _sum6);\n            _gru_N = vadd_f16(_gru_N, _sum5);\n\n            // tanh(N)\n            float32x4_t _N32 = tanh_ps(vcvt_f32_f16(_gru_N));\n\n            float* gates_data = gates.row(q / 4);\n\n            vst1q_f32(gates_data, _U32);\n            vst1q_f32(gates_data + 4, _N32);\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            // gate reset update\n            const __fp16* bias_c_RUBNWN = (const __fp16*)bias_c + q * 4;\n\n            const __fp16* weight_xc_RUN = weight_xc.row<const __fp16>(q / 4 + q % 4);\n            const __fp16* weight_hc_RUN = weight_hc.row<const __fp16>(q / 4 + q % 4);\n\n            __fp16 R = bias_c_RUBNWN[0];\n            __fp16 U = bias_c_RUBNWN[1];\n\n            for (int i = 0; i < size; i++)\n            {\n                __fp16 xi = x[i];\n\n                R += weight_xc_RUN[0] * xi;\n                U += weight_xc_RUN[1] * xi;\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                __fp16 h_cont = (__fp16)hidden_state[i];\n\n                R += weight_hc_RUN[0] * h_cont;\n                U += weight_hc_RUN[1] * h_cont;\n\n                weight_hc_RUN += 2;\n            }\n\n            // sigmoid(R)\n            // sigmoid(U)\n            float R32 = 1.f / (1.f + expf((float)-R));\n            float U32 = 1.f / (1.f + expf((float)-U));\n\n            // gate new\n            __fp16 N = bias_c_RUBNWN[2];\n\n            for (int i = 0; i < num_output; i++)\n            {\n                __fp16 h_cont = (__fp16)hidden_state[i];\n\n                N += weight_hc_RUN[0] * h_cont;\n\n                weight_hc_RUN += 1;\n            }\n\n            N = bias_c_RUBNWN[3] + (__fp16)R32 * N;\n\n            for (int i = 0; i < size; i++)\n            {\n                __fp16 xi = x[i];\n\n                N += weight_xc_RUN[0] * xi;\n\n                weight_xc_RUN += 1;\n            }\n\n            // tanh(N)\n            float N32 = tanhf((float)N);\n\n            float* gates_data = gates.row(q / 4 + q % 4);\n\n            gates_data[0] = U32;\n            gates_data[1] = N32;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q / 4);\n\n            float32x4_t _gru_U = vld1q_f32(gates_data);\n            float32x4_t _gru_N = vld1q_f32(gates_data + 4);\n\n            float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));\n\n            vst1q_f32(hidden_ptr + q, _gru_H);\n            vst1_f16(output_data + q, vcvt_f16_f32(_gru_H));\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const float* gates_data = gates.row(q / 4 + q % 4);\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_ptr[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = (__fp16)H;\n        }\n    }\n\n    return 0;\n}\n\nstatic int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    if (opt.use_fp16_arithmetic)\n        return gru_fp16sa(bottom_blob, top_blob, reverse, weight_xc, bias_c, weight_hc, hidden_state, opt);\n\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n    Mat gates(4 * 2, num_output / 4 + num_output % 4, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        int nn_num_output = num_output >> 2;\n        int remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            // gate reset update\n            const __fp16* bias_c_RUBNWN = (const __fp16*)bias_c + q * 4;\n\n            const __fp16* weight_xc_RUN = weight_xc.row<const __fp16>(q / 4);\n            const __fp16* weight_hc_RUN = weight_hc.row<const __fp16>(q / 4);\n\n            float32x4_t _gru_R = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN));\n            float32x4_t _gru_U = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 4));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n            float32x4_t _sum4 = vdupq_n_f32(0.f);\n            float32x4_t _sum5 = vdupq_n_f32(0.f);\n            float32x4_t _sum6 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vcvt_f32_f16(vld1_f16(x + i));\n                float32x4_t _weight_xc_R = vcvt_f32_f16(vld1_f16(weight_xc_RUN));\n                float32x4_t _weight_xc_U = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 4));\n                float32x4_t _weight_xc_R_1 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 8));\n                float32x4_t _weight_xc_U_1 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 12));\n                float32x4_t _weight_xc_R_2 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 16));\n                float32x4_t _weight_xc_U_2 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 20));\n                float32x4_t _weight_xc_R_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 24));\n                float32x4_t _weight_xc_U_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 28));\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_xc_R, _xi, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_xc_U, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_R_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_U_1, _xi, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_R_2, _xi, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_xc_U_2, _xi, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_xc_R_3, _xi, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_xc_U_3, _xi, 3);\n\n                weight_xc_RUN += 32;\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = x[i];\n\n                float32x4_t _xi = vcvt_f32_f16(vdup_n_f16(xi));\n                float32x4_t _weight_xc_R = vcvt_f32_f16(vld1_f16(weight_xc_RUN));\n                float32x4_t _weight_xc_U = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 4));\n                _gru_R = vmlaq_f32(_gru_R, _weight_xc_R, _xi);\n                _gru_U = vmlaq_f32(_gru_U, _weight_xc_U, _xi);\n\n                weight_xc_RUN += 8;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_R = vcvt_f32_f16(vld1_f16(weight_hc_RUN));\n                float32x4_t _weight_hc_U = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 4));\n                float32x4_t _weight_hc_R_1 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 8));\n                float32x4_t _weight_hc_U_1 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 12));\n                float32x4_t _weight_hc_R_2 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 16));\n                float32x4_t _weight_hc_U_2 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 20));\n                float32x4_t _weight_hc_R_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 24));\n                float32x4_t _weight_hc_U_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 28));\n                _gru_R = vfmaq_laneq_f32(_gru_R, _weight_hc_R, _h_cont, 0);\n                _gru_U = vfmaq_laneq_f32(_gru_U, _weight_hc_U, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_R_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_U_1, _h_cont, 1);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_R_2, _h_cont, 2);\n                _sum4 = vfmaq_laneq_f32(_sum4, _weight_hc_U_2, _h_cont, 2);\n                _sum5 = vfmaq_laneq_f32(_sum5, _weight_hc_R_3, _h_cont, 3);\n                _sum6 = vfmaq_laneq_f32(_sum6, _weight_hc_U_3, _h_cont, 3);\n\n                weight_hc_RUN += 32;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_R = vcvt_f32_f16(vld1_f16(weight_hc_RUN));\n                float32x4_t _weight_hc_U = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 4));\n                _gru_R = vmlaq_f32(_gru_R, _weight_hc_R, _h_cont);\n                _gru_U = vmlaq_f32(_gru_U, _weight_hc_U, _h_cont);\n\n                weight_hc_RUN += 8;\n            }\n\n            _gru_R = vaddq_f32(_gru_R, _sum1);\n            _gru_U = vaddq_f32(_gru_U, _sum2);\n            _sum3 = vaddq_f32(_sum3, _sum5);\n            _sum4 = vaddq_f32(_sum4, _sum6);\n            _gru_R = vaddq_f32(_gru_R, _sum3);\n            _gru_U = vaddq_f32(_gru_U, _sum4);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            _gru_R = sigmoid_ps(_gru_R);\n            _gru_U = sigmoid_ps(_gru_U);\n\n            // gate new\n            float32x4_t _gru_N = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 8));\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc_N = vcvt_f32_f16(vld1_f16(weight_hc_RUN));\n                float32x4_t _weight_hc_N_1 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 4));\n                float32x4_t _weight_hc_N_2 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 8));\n                float32x4_t _weight_hc_N_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 12));\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_hc_N, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_N_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_N_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_N_3, _h_cont, 3);\n\n                weight_hc_RUN += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_N = vcvt_f32_f16(vld1_f16(weight_hc_RUN));\n                _gru_N = vmlaq_f32(_gru_N, _weight_hc_N, _h_cont);\n\n                weight_hc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            _gru_N = vmlaq_f32(vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 12)), _gru_R, _gru_N);\n            _sum1 = vdupq_n_f32(0.f);\n            _sum2 = vdupq_n_f32(0.f);\n            _sum3 = vdupq_n_f32(0.f);\n\n            i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vcvt_f32_f16(vld1_f16(x + i));\n                float32x4_t _weight_xc_N = vcvt_f32_f16(vld1_f16(weight_xc_RUN));\n                float32x4_t _weight_xc_N_1 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 4));\n                float32x4_t _weight_xc_N_2 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 8));\n                float32x4_t _weight_xc_N_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 12));\n                _gru_N = vfmaq_laneq_f32(_gru_N, _weight_xc_N, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_N_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_N_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_N_3, _xi, 3);\n\n                weight_xc_RUN += 16;\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = x[i];\n\n                float32x4_t _xi = vcvt_f32_f16(vdup_n_f16(xi));\n                float32x4_t _weight_xc_N = vcvt_f32_f16(vld1_f16(weight_xc_RUN));\n                _gru_N = vmlaq_f32(_gru_N, _weight_xc_N, _xi);\n\n                weight_xc_RUN += 4;\n            }\n\n            _gru_N = vaddq_f32(_gru_N, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _gru_N = vaddq_f32(_gru_N, _sum2);\n\n            // tanh(N)\n            _gru_N = tanh_ps(_gru_N);\n\n            float* gates_data = gates.row(q / 4);\n\n            vst1q_f32(gates_data, _gru_U);\n            vst1q_f32(gates_data + 4, _gru_N);\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            // gate reset update\n            const __fp16* bias_c_RUBNWN = (const __fp16*)bias_c + q * 4;\n\n            const __fp16* weight_xc_RUN = weight_xc.row<const __fp16>(q / 4 + q % 4);\n            const __fp16* weight_hc_RUN = weight_hc.row<const __fp16>(q / 4 + q % 4);\n\n            float R = (float)bias_c_RUBNWN[0];\n            float U = (float)bias_c_RUBNWN[1];\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = (float)x[i];\n\n                R += (float)weight_xc_RUN[0] * xi;\n                U += (float)weight_xc_RUN[1] * xi;\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                R += (float)weight_hc_RUN[0] * h_cont;\n                U += (float)weight_hc_RUN[1] * h_cont;\n\n                weight_hc_RUN += 2;\n            }\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n            float N = (float)bias_c_RUBNWN[2];\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                N += (float)weight_hc_RUN[0] * h_cont;\n\n                weight_hc_RUN += 1;\n            }\n\n            N = (float)bias_c_RUBNWN[3] + R * N;\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = (float)x[i];\n\n                N += (float)weight_xc_RUN[0] * xi;\n\n                weight_xc_RUN += 1;\n            }\n\n            // tanh(N)\n            N = tanhf(N);\n\n            float* gates_data = gates.row(q / 4 + q % 4);\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q / 4);\n\n            float32x4_t _gru_U = vld1q_f32(gates_data);\n            float32x4_t _gru_N = vld1q_f32(gates_data + 4);\n\n            float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));\n\n            vst1q_f32(hidden_ptr + q, _gru_H);\n            vst1_f16(output_data + q, vcvt_f16_f32(_gru_H));\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const float* gates_data = gates.row(q / 4 + q % 4);\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_ptr[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = (__fp16)H;\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::create_pipeline_fp16s(const Option& opt)\n{\n    // pack RUN\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / num_output / 3;\n\n    weight_xc_data_packed.create(size * 12, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n    bias_c_data_packed.create(num_output, 1, num_directions, 8u, 4);\n    weight_hc_data_packed.create(num_output * 12, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_R = bias_c.row(0);\n        const float* bias_c_U = bias_c.row(1);\n        const float* bias_c_WN = bias_c.row(2);\n        const float* bias_c_BN = bias_c.row(3);\n\n        __fp16* bias_c_RUBNWN = bias_c_data_packed_dr.row<__fp16>(0);\n\n        int q = 0;\n        for (; q + 3 < num_output; q += 4)\n        {\n            bias_c_RUBNWN[0] = (__fp16)bias_c_R[q];\n            bias_c_RUBNWN[1] = (__fp16)bias_c_R[q + 1];\n            bias_c_RUBNWN[2] = (__fp16)bias_c_R[q + 2];\n            bias_c_RUBNWN[3] = (__fp16)bias_c_R[q + 3];\n            bias_c_RUBNWN[4] = (__fp16)bias_c_U[q];\n            bias_c_RUBNWN[5] = (__fp16)bias_c_U[q + 1];\n            bias_c_RUBNWN[6] = (__fp16)bias_c_U[q + 2];\n            bias_c_RUBNWN[7] = (__fp16)bias_c_U[q + 3];\n            bias_c_RUBNWN[8] = (__fp16)bias_c_BN[q];\n            bias_c_RUBNWN[9] = (__fp16)bias_c_BN[q + 1];\n            bias_c_RUBNWN[10] = (__fp16)bias_c_BN[q + 2];\n            bias_c_RUBNWN[11] = (__fp16)bias_c_BN[q + 3];\n            bias_c_RUBNWN[12] = (__fp16)bias_c_WN[q];\n            bias_c_RUBNWN[13] = (__fp16)bias_c_WN[q + 1];\n            bias_c_RUBNWN[14] = (__fp16)bias_c_WN[q + 2];\n            bias_c_RUBNWN[15] = (__fp16)bias_c_WN[q + 3];\n\n            bias_c_RUBNWN += 16;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_xc_R_1 = weight_xc.row(num_output * 0 + q + 1);\n            const float* weight_xc_U_1 = weight_xc.row(num_output * 1 + q + 1);\n            const float* weight_xc_N_1 = weight_xc.row(num_output * 2 + q + 1);\n\n            const float* weight_xc_R_2 = weight_xc.row(num_output * 0 + q + 2);\n            const float* weight_xc_U_2 = weight_xc.row(num_output * 1 + q + 2);\n            const float* weight_xc_N_2 = weight_xc.row(num_output * 2 + q + 2);\n\n            const float* weight_xc_R_3 = weight_xc.row(num_output * 0 + q + 3);\n            const float* weight_xc_U_3 = weight_xc.row(num_output * 1 + q + 3);\n            const float* weight_xc_N_3 = weight_xc.row(num_output * 2 + q + 3);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n            const float* weight_hc_R_1 = weight_hc.row(num_output * 0 + q + 1);\n            const float* weight_hc_U_1 = weight_hc.row(num_output * 1 + q + 1);\n            const float* weight_hc_N_1 = weight_hc.row(num_output * 2 + q + 1);\n\n            const float* weight_hc_R_2 = weight_hc.row(num_output * 0 + q + 2);\n            const float* weight_hc_U_2 = weight_hc.row(num_output * 1 + q + 2);\n            const float* weight_hc_N_2 = weight_hc.row(num_output * 2 + q + 2);\n\n            const float* weight_hc_R_3 = weight_hc.row(num_output * 0 + q + 3);\n            const float* weight_hc_U_3 = weight_hc.row(num_output * 1 + q + 3);\n            const float* weight_hc_N_3 = weight_hc.row(num_output * 2 + q + 3);\n\n            __fp16* weight_xc_RUN = weight_xc_data_packed_dr.row<__fp16>(q / 4);\n            __fp16* weight_hc_RUN = weight_hc_data_packed_dr.row<__fp16>(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = (__fp16)weight_xc_R[i];\n                weight_xc_RUN[1] = (__fp16)weight_xc_R_1[i];\n                weight_xc_RUN[2] = (__fp16)weight_xc_R_2[i];\n                weight_xc_RUN[3] = (__fp16)weight_xc_R_3[i];\n                weight_xc_RUN[4] = (__fp16)weight_xc_U[i];\n                weight_xc_RUN[5] = (__fp16)weight_xc_U_1[i];\n                weight_xc_RUN[6] = (__fp16)weight_xc_U_2[i];\n                weight_xc_RUN[7] = (__fp16)weight_xc_U_3[i];\n\n                weight_xc_RUN += 8;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = (__fp16)weight_hc_R[i];\n                weight_hc_RUN[1] = (__fp16)weight_hc_R_1[i];\n                weight_hc_RUN[2] = (__fp16)weight_hc_R_2[i];\n                weight_hc_RUN[3] = (__fp16)weight_hc_R_3[i];\n                weight_hc_RUN[4] = (__fp16)weight_hc_U[i];\n                weight_hc_RUN[5] = (__fp16)weight_hc_U_1[i];\n                weight_hc_RUN[6] = (__fp16)weight_hc_U_2[i];\n                weight_hc_RUN[7] = (__fp16)weight_hc_U_3[i];\n\n                weight_hc_RUN += 8;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = (__fp16)weight_xc_N[i];\n                weight_xc_RUN[1] = (__fp16)weight_xc_N_1[i];\n                weight_xc_RUN[2] = (__fp16)weight_xc_N_2[i];\n                weight_xc_RUN[3] = (__fp16)weight_xc_N_3[i];\n\n                weight_xc_RUN += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = (__fp16)weight_hc_N[i];\n                weight_hc_RUN[1] = (__fp16)weight_hc_N_1[i];\n                weight_hc_RUN[2] = (__fp16)weight_hc_N_2[i];\n                weight_hc_RUN[3] = (__fp16)weight_hc_N_3[i];\n\n                weight_hc_RUN += 4;\n            }\n        }\n        for (; q < num_output; q++)\n        {\n            bias_c_RUBNWN[0] = (__fp16)bias_c_R[q];\n            bias_c_RUBNWN[1] = (__fp16)bias_c_U[q];\n            bias_c_RUBNWN[2] = (__fp16)bias_c_BN[q];\n            bias_c_RUBNWN[3] = (__fp16)bias_c_WN[q];\n\n            bias_c_RUBNWN += 4;\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n            __fp16* weight_xc_RUN = weight_xc_data_packed_dr.row<__fp16>(q / 4 + q % 4);\n            __fp16* weight_hc_RUN = weight_hc_data_packed_dr.row<__fp16>(q / 4 + q % 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = (__fp16)weight_xc_R[i];\n                weight_xc_RUN[1] = (__fp16)weight_xc_U[i];\n\n                weight_xc_RUN += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = (__fp16)weight_hc_R[i];\n                weight_hc_RUN[1] = (__fp16)weight_hc_U[i];\n\n                weight_hc_RUN += 2;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_RUN[0] = (__fp16)weight_xc_N[i];\n\n                weight_xc_RUN += 1;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_RUN[0] = (__fp16)weight_hc_N[i];\n\n                weight_hc_RUN += 1;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = gru_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n\n        {\n            int ret = gru_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    return 0;\n}\n\nint GRU_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_allocator;\n        cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = gru_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = gru_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = gru_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        cast_float32_to_float16(hidden, top_blobs[1], opt);\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gru_arm_vfpv4.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"gru_int8.h\"\n\nvoid gru_int8_gate_output_vfpv4(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n    gru_int8_gate_output(gates, hidden_state, top_blob, ti, elemtype, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/gru_int8.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\nvoid gru_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt);\nvoid gru_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\nvoid gru_int8_gate_output_vfpv4(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt);\n#endif\n\nstatic void gru_transform_weight_int8(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        gru_transform_weight_int8_asimddp(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, opt);\n        return;\n    }\n#endif\n\n#if __ARM_NEON\n    weight_data_tm.create(size * 12 + num_output * 12, num_output / 4 + num_output % 4, num_directions, 1u, 1);\n    weight_data_tm_int8_descales.create(12 + 12, num_output / 4 + num_output % 4, num_directions);\n#else\n    weight_data_tm.create(size * 3 + num_output * 3, num_output, num_directions, 1u, 1);\n    weight_data_tm_int8_descales.create(3 + 3, num_output, num_directions);\n#endif\n    bias_c_tm.create(num_output, 1, num_directions, 16u, 4);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc_dr = weight_xc.channel(dr);\n        const Mat weight_hc_dr = weight_hc.channel(dr);\n        const Mat bias_c_dr = bias_c.channel(dr);\n        const float* weight_xc_int8_scales_ptr = weight_xc_int8_scales.row(dr);\n        const float* weight_hc_int8_scales_ptr = weight_hc_int8_scales.row(dr);\n\n        Mat weight_data_tm_dr = weight_data_tm.channel(dr);\n        Mat bias_c_tm_dr = bias_c_tm.channel(dr);\n        Mat weight_data_tm_int8_descales_dr = weight_data_tm_int8_descales.channel(dr);\n\n        const float* bias_c_R = bias_c_dr.row(0);\n        const float* bias_c_U = bias_c_dr.row(1);\n        const float* bias_c_WN = bias_c_dr.row(2);\n        const float* bias_c_BN = bias_c_dr.row(3);\n\n        float* bias_c_RUBNWN = bias_c_tm_dr.row(0);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            vst1q_f32(bias_c_RUBNWN, vld1q_f32(bias_c_R + q));\n            vst1q_f32(bias_c_RUBNWN + 4, vld1q_f32(bias_c_U + q));\n            vst1q_f32(bias_c_RUBNWN + 8, vld1q_f32(bias_c_BN + q));\n            vst1q_f32(bias_c_RUBNWN + 12, vld1q_f32(bias_c_WN + q));\n\n            bias_c_RUBNWN += 16;\n\n            const signed char* weight_xc_R_0 = weight_xc_dr.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_xc_U_0 = weight_xc_dr.row<const signed char>(num_output * 1 + q);\n            const signed char* weight_xc_N_0 = weight_xc_dr.row<const signed char>(num_output * 2 + q);\n\n            const signed char* weight_xc_R_1 = weight_xc_dr.row<const signed char>(num_output * 0 + q + 1);\n            const signed char* weight_xc_U_1 = weight_xc_dr.row<const signed char>(num_output * 1 + q + 1);\n            const signed char* weight_xc_N_1 = weight_xc_dr.row<const signed char>(num_output * 2 + q + 1);\n\n            const signed char* weight_xc_R_2 = weight_xc_dr.row<const signed char>(num_output * 0 + q + 2);\n            const signed char* weight_xc_U_2 = weight_xc_dr.row<const signed char>(num_output * 1 + q + 2);\n            const signed char* weight_xc_N_2 = weight_xc_dr.row<const signed char>(num_output * 2 + q + 2);\n\n            const signed char* weight_xc_R_3 = weight_xc_dr.row<const signed char>(num_output * 0 + q + 3);\n            const signed char* weight_xc_U_3 = weight_xc_dr.row<const signed char>(num_output * 1 + q + 3);\n            const signed char* weight_xc_N_3 = weight_xc_dr.row<const signed char>(num_output * 2 + q + 3);\n\n            const signed char* weight_hc_R_0 = weight_hc_dr.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_hc_U_0 = weight_hc_dr.row<const signed char>(num_output * 1 + q);\n            const signed char* weight_hc_N_0 = weight_hc_dr.row<const signed char>(num_output * 2 + q);\n\n            const signed char* weight_hc_R_1 = weight_hc_dr.row<const signed char>(num_output * 0 + q + 1);\n            const signed char* weight_hc_U_1 = weight_hc_dr.row<const signed char>(num_output * 1 + q + 1);\n            const signed char* weight_hc_N_1 = weight_hc_dr.row<const signed char>(num_output * 2 + q + 1);\n\n            const signed char* weight_hc_R_2 = weight_hc_dr.row<const signed char>(num_output * 0 + q + 2);\n            const signed char* weight_hc_U_2 = weight_hc_dr.row<const signed char>(num_output * 1 + q + 2);\n            const signed char* weight_hc_N_2 = weight_hc_dr.row<const signed char>(num_output * 2 + q + 2);\n\n            const signed char* weight_hc_R_3 = weight_hc_dr.row<const signed char>(num_output * 0 + q + 3);\n            const signed char* weight_hc_U_3 = weight_hc_dr.row<const signed char>(num_output * 1 + q + 3);\n            const signed char* weight_hc_N_3 = weight_hc_dr.row<const signed char>(num_output * 2 + q + 3);\n\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q / 4);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q / 4);\n\n            int i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n                kptr[0] = weight_xc_R_0[i];\n                kptr[1] = weight_xc_R_0[i + 1];\n                kptr[2] = weight_xc_R_0[i + 2];\n                kptr[3] = weight_xc_R_0[i + 3];\n                kptr[4] = weight_xc_R_1[i];\n                kptr[5] = weight_xc_R_1[i + 1];\n                kptr[6] = weight_xc_R_1[i + 2];\n                kptr[7] = weight_xc_R_1[i + 3];\n                kptr[8 + 0] = weight_xc_R_2[i];\n                kptr[8 + 1] = weight_xc_R_2[i + 1];\n                kptr[8 + 2] = weight_xc_R_2[i + 2];\n                kptr[8 + 3] = weight_xc_R_2[i + 3];\n                kptr[8 + 4] = weight_xc_R_3[i];\n                kptr[8 + 5] = weight_xc_R_3[i + 1];\n                kptr[8 + 6] = weight_xc_R_3[i + 2];\n                kptr[8 + 7] = weight_xc_R_3[i + 3];\n                kptr[16 + 0] = weight_xc_U_0[i];\n                kptr[16 + 1] = weight_xc_U_0[i + 1];\n                kptr[16 + 2] = weight_xc_U_0[i + 2];\n                kptr[16 + 3] = weight_xc_U_0[i + 3];\n                kptr[16 + 4] = weight_xc_U_1[i];\n                kptr[16 + 5] = weight_xc_U_1[i + 1];\n                kptr[16 + 6] = weight_xc_U_1[i + 2];\n                kptr[16 + 7] = weight_xc_U_1[i + 3];\n                kptr[24 + 0] = weight_xc_U_2[i];\n                kptr[24 + 1] = weight_xc_U_2[i + 1];\n                kptr[24 + 2] = weight_xc_U_2[i + 2];\n                kptr[24 + 3] = weight_xc_U_2[i + 3];\n                kptr[24 + 4] = weight_xc_U_3[i];\n                kptr[24 + 5] = weight_xc_U_3[i + 1];\n                kptr[24 + 6] = weight_xc_U_3[i + 2];\n                kptr[24 + 7] = weight_xc_U_3[i + 3];\n\n                kptr += 32;\n            }\n#else\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _w0 = vld1_s8(weight_xc_R_0 + i);\n                int8x8_t _w1 = vld1_s8(weight_xc_R_1 + i);\n                int8x8_t _w2 = vld1_s8(weight_xc_R_2 + i);\n                int8x8_t _w3 = vld1_s8(weight_xc_R_3 + i);\n                int8x8_t _w4 = vld1_s8(weight_xc_U_0 + i);\n                int8x8_t _w5 = vld1_s8(weight_xc_U_1 + i);\n                int8x8_t _w6 = vld1_s8(weight_xc_U_2 + i);\n                int8x8_t _w7 = vld1_s8(weight_xc_U_3 + i);\n\n                int32x2x2_t _t0 = vtrn_s32(vreinterpret_s32_s8(_w0), vreinterpret_s32_s8(_w4));\n                int32x2x2_t _t1 = vtrn_s32(vreinterpret_s32_s8(_w1), vreinterpret_s32_s8(_w5));\n                int32x2x2_t _t2 = vtrn_s32(vreinterpret_s32_s8(_w2), vreinterpret_s32_s8(_w6));\n                int32x2x2_t _t3 = vtrn_s32(vreinterpret_s32_s8(_w3), vreinterpret_s32_s8(_w7));\n\n                int32x4x4_t _w;\n                _w.val[0] = vcombine_s32(_t0.val[0], _t0.val[1]);\n                _w.val[1] = vcombine_s32(_t1.val[0], _t1.val[1]);\n                _w.val[2] = vcombine_s32(_t2.val[0], _t2.val[1]);\n                _w.val[3] = vcombine_s32(_t3.val[0], _t3.val[1]);\n\n                vst4q_s32((int*)kptr, _w);\n\n                kptr += 64;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < size; i += 2)\n            {\n                kptr[0] = weight_xc_R_0[i];\n                kptr[1] = weight_xc_R_0[i + 1];\n                kptr[2] = weight_xc_R_1[i];\n                kptr[3] = weight_xc_R_1[i + 1];\n                kptr[4] = weight_xc_R_2[i];\n                kptr[5] = weight_xc_R_2[i + 1];\n                kptr[6] = weight_xc_R_3[i];\n                kptr[7] = weight_xc_R_3[i + 1];\n                kptr[8 + 0] = weight_xc_U_0[i];\n                kptr[8 + 1] = weight_xc_U_0[i + 1];\n                kptr[8 + 2] = weight_xc_U_1[i];\n                kptr[8 + 3] = weight_xc_U_1[i + 1];\n                kptr[8 + 4] = weight_xc_U_2[i];\n                kptr[8 + 5] = weight_xc_U_2[i + 1];\n                kptr[8 + 6] = weight_xc_U_3[i];\n                kptr[8 + 7] = weight_xc_U_3[i + 1];\n\n                kptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                kptr[0] = weight_xc_R_0[i];\n                kptr[1] = weight_xc_R_1[i];\n                kptr[2] = weight_xc_R_2[i];\n                kptr[3] = weight_xc_R_3[i];\n                kptr[4] = weight_xc_U_0[i];\n                kptr[5] = weight_xc_U_1[i];\n                kptr[6] = weight_xc_U_2[i];\n                kptr[7] = weight_xc_U_3[i];\n\n                kptr += 8;\n            }\n\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n                kptr[0] = weight_hc_R_0[i];\n                kptr[1] = weight_hc_R_0[i + 1];\n                kptr[2] = weight_hc_R_0[i + 2];\n                kptr[3] = weight_hc_R_0[i + 3];\n                kptr[4] = weight_hc_R_1[i];\n                kptr[5] = weight_hc_R_1[i + 1];\n                kptr[6] = weight_hc_R_1[i + 2];\n                kptr[7] = weight_hc_R_1[i + 3];\n                kptr[8 + 0] = weight_hc_R_2[i];\n                kptr[8 + 1] = weight_hc_R_2[i + 1];\n                kptr[8 + 2] = weight_hc_R_2[i + 2];\n                kptr[8 + 3] = weight_hc_R_2[i + 3];\n                kptr[8 + 4] = weight_hc_R_3[i];\n                kptr[8 + 5] = weight_hc_R_3[i + 1];\n                kptr[8 + 6] = weight_hc_R_3[i + 2];\n                kptr[8 + 7] = weight_hc_R_3[i + 3];\n                kptr[16 + 0] = weight_hc_U_0[i];\n                kptr[16 + 1] = weight_hc_U_0[i + 1];\n                kptr[16 + 2] = weight_hc_U_0[i + 2];\n                kptr[16 + 3] = weight_hc_U_0[i + 3];\n                kptr[16 + 4] = weight_hc_U_1[i];\n                kptr[16 + 5] = weight_hc_U_1[i + 1];\n                kptr[16 + 6] = weight_hc_U_1[i + 2];\n                kptr[16 + 7] = weight_hc_U_1[i + 3];\n                kptr[24 + 0] = weight_hc_U_2[i];\n                kptr[24 + 1] = weight_hc_U_2[i + 1];\n                kptr[24 + 2] = weight_hc_U_2[i + 2];\n                kptr[24 + 3] = weight_hc_U_2[i + 3];\n                kptr[24 + 4] = weight_hc_U_3[i];\n                kptr[24 + 5] = weight_hc_U_3[i + 1];\n                kptr[24 + 6] = weight_hc_U_3[i + 2];\n                kptr[24 + 7] = weight_hc_U_3[i + 3];\n\n                kptr += 32;\n            }\n#else\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _w0 = vld1_s8(weight_hc_R_0 + i);\n                int8x8_t _w1 = vld1_s8(weight_hc_R_1 + i);\n                int8x8_t _w2 = vld1_s8(weight_hc_R_2 + i);\n                int8x8_t _w3 = vld1_s8(weight_hc_R_3 + i);\n                int8x8_t _w4 = vld1_s8(weight_hc_U_0 + i);\n                int8x8_t _w5 = vld1_s8(weight_hc_U_1 + i);\n                int8x8_t _w6 = vld1_s8(weight_hc_U_2 + i);\n                int8x8_t _w7 = vld1_s8(weight_hc_U_3 + i);\n\n                int32x2x2_t _t0 = vtrn_s32(vreinterpret_s32_s8(_w0), vreinterpret_s32_s8(_w4));\n                int32x2x2_t _t1 = vtrn_s32(vreinterpret_s32_s8(_w1), vreinterpret_s32_s8(_w5));\n                int32x2x2_t _t2 = vtrn_s32(vreinterpret_s32_s8(_w2), vreinterpret_s32_s8(_w6));\n                int32x2x2_t _t3 = vtrn_s32(vreinterpret_s32_s8(_w3), vreinterpret_s32_s8(_w7));\n\n                int32x4x4_t _w;\n                _w.val[0] = vcombine_s32(_t0.val[0], _t0.val[1]);\n                _w.val[1] = vcombine_s32(_t1.val[0], _t1.val[1]);\n                _w.val[2] = vcombine_s32(_t2.val[0], _t2.val[1]);\n                _w.val[3] = vcombine_s32(_t3.val[0], _t3.val[1]);\n\n                vst4q_s32((int*)kptr, _w);\n\n                kptr += 64;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < num_output; i += 2)\n            {\n                kptr[0] = weight_hc_R_0[i];\n                kptr[1] = weight_hc_R_0[i + 1];\n                kptr[2] = weight_hc_R_1[i];\n                kptr[3] = weight_hc_R_1[i + 1];\n                kptr[4] = weight_hc_R_2[i];\n                kptr[5] = weight_hc_R_2[i + 1];\n                kptr[6] = weight_hc_R_3[i];\n                kptr[7] = weight_hc_R_3[i + 1];\n                kptr[8 + 0] = weight_hc_U_0[i];\n                kptr[8 + 1] = weight_hc_U_0[i + 1];\n                kptr[8 + 2] = weight_hc_U_1[i];\n                kptr[8 + 3] = weight_hc_U_1[i + 1];\n                kptr[8 + 4] = weight_hc_U_2[i];\n                kptr[8 + 5] = weight_hc_U_2[i + 1];\n                kptr[8 + 6] = weight_hc_U_3[i];\n                kptr[8 + 7] = weight_hc_U_3[i + 1];\n\n                kptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_R_0[i];\n                kptr[1] = weight_hc_R_1[i];\n                kptr[2] = weight_hc_R_2[i];\n                kptr[3] = weight_hc_R_3[i];\n                kptr[4] = weight_hc_U_0[i];\n                kptr[5] = weight_hc_U_1[i];\n                kptr[6] = weight_hc_U_2[i];\n                kptr[7] = weight_hc_U_3[i];\n\n                kptr += 8;\n            }\n\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n                kptr[0] = weight_hc_N_0[i];\n                kptr[1] = weight_hc_N_0[i + 1];\n                kptr[2] = weight_hc_N_0[i + 2];\n                kptr[3] = weight_hc_N_0[i + 3];\n                kptr[4] = weight_hc_N_1[i];\n                kptr[5] = weight_hc_N_1[i + 1];\n                kptr[6] = weight_hc_N_1[i + 2];\n                kptr[7] = weight_hc_N_1[i + 3];\n                kptr[8 + 0] = weight_hc_N_2[i];\n                kptr[8 + 1] = weight_hc_N_2[i + 1];\n                kptr[8 + 2] = weight_hc_N_2[i + 2];\n                kptr[8 + 3] = weight_hc_N_2[i + 3];\n                kptr[8 + 4] = weight_hc_N_3[i];\n                kptr[8 + 5] = weight_hc_N_3[i + 1];\n                kptr[8 + 6] = weight_hc_N_3[i + 2];\n                kptr[8 + 7] = weight_hc_N_3[i + 3];\n\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < num_output; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_hc_N_0 + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_hc_N_1 + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_hc_N_2 + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_hc_N_3 + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < num_output; i += 2)\n            {\n                kptr[0] = weight_hc_N_0[i];\n                kptr[1] = weight_hc_N_0[i + 1];\n                kptr[2] = weight_hc_N_1[i];\n                kptr[3] = weight_hc_N_1[i + 1];\n                kptr[4] = weight_hc_N_2[i];\n                kptr[5] = weight_hc_N_2[i + 1];\n                kptr[6] = weight_hc_N_3[i];\n                kptr[7] = weight_hc_N_3[i + 1];\n\n                kptr += 8;\n            }\n            for (; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_N_0[i];\n                kptr[1] = weight_hc_N_1[i];\n                kptr[2] = weight_hc_N_2[i];\n                kptr[3] = weight_hc_N_3[i];\n\n                kptr += 4;\n            }\n\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n                kptr[0] = weight_xc_N_0[i];\n                kptr[1] = weight_xc_N_0[i + 1];\n                kptr[2] = weight_xc_N_0[i + 2];\n                kptr[3] = weight_xc_N_0[i + 3];\n                kptr[4] = weight_xc_N_1[i];\n                kptr[5] = weight_xc_N_1[i + 1];\n                kptr[6] = weight_xc_N_1[i + 2];\n                kptr[7] = weight_xc_N_1[i + 3];\n                kptr[8 + 0] = weight_xc_N_2[i];\n                kptr[8 + 1] = weight_xc_N_2[i + 1];\n                kptr[8 + 2] = weight_xc_N_2[i + 2];\n                kptr[8 + 3] = weight_xc_N_2[i + 3];\n                kptr[8 + 4] = weight_xc_N_3[i];\n                kptr[8 + 5] = weight_xc_N_3[i + 1];\n                kptr[8 + 6] = weight_xc_N_3[i + 2];\n                kptr[8 + 7] = weight_xc_N_3[i + 3];\n\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < size; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_xc_N_0 + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_xc_N_1 + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_xc_N_2 + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_xc_N_3 + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < size; i += 2)\n            {\n                kptr[0] = weight_xc_N_0[i];\n                kptr[1] = weight_xc_N_0[i + 1];\n                kptr[2] = weight_xc_N_1[i];\n                kptr[3] = weight_xc_N_1[i + 1];\n                kptr[4] = weight_xc_N_2[i];\n                kptr[5] = weight_xc_N_2[i + 1];\n                kptr[6] = weight_xc_N_3[i];\n                kptr[7] = weight_xc_N_3[i + 1];\n\n                kptr += 8;\n            }\n            for (; i < size; i++)\n            {\n                kptr[0] = weight_xc_N_0[i];\n                kptr[1] = weight_xc_N_1[i];\n                kptr[2] = weight_xc_N_2[i];\n                kptr[3] = weight_xc_N_3[i];\n\n                kptr += 4;\n            }\n\n            float32x4_t _xc_R0 = vld1q_f32(weight_xc_int8_scales_ptr + q);\n            float32x4_t _xc_U0 = vld1q_f32(weight_xc_int8_scales_ptr + num_output + q);\n            float32x4_t _xc_N0 = vld1q_f32(weight_xc_int8_scales_ptr + num_output * 2 + q);\n            float32x4_t _hc_R0 = vld1q_f32(weight_hc_int8_scales_ptr + q);\n            float32x4_t _hc_U0 = vld1q_f32(weight_hc_int8_scales_ptr + num_output + q);\n            float32x4_t _hc_N0 = vld1q_f32(weight_hc_int8_scales_ptr + num_output * 2 + q);\n\n#if __aarch64__\n            float32x4_t _one = vdupq_n_f32(1.f);\n            float32x4_t _reciprocal_xc_R0 = vdivq_f32(_one, _xc_R0);\n            float32x4_t _reciprocal_xc_U0 = vdivq_f32(_one, _xc_U0);\n            float32x4_t _reciprocal_xc_N0 = vdivq_f32(_one, _xc_N0);\n            float32x4_t _reciprocal_hc_R0 = vdivq_f32(_one, _hc_R0);\n            float32x4_t _reciprocal_hc_U0 = vdivq_f32(_one, _hc_U0);\n            float32x4_t _reciprocal_hc_N0 = vdivq_f32(_one, _hc_N0);\n#else\n            float32x4_t _reciprocal_xc_R0 = vrecpeq_f32(_xc_R0);\n            float32x4_t _reciprocal_xc_U0 = vrecpeq_f32(_xc_U0);\n            float32x4_t _reciprocal_xc_N0 = vrecpeq_f32(_xc_N0);\n            _reciprocal_xc_R0 = vmulq_f32(vrecpsq_f32(_xc_R0, _reciprocal_xc_R0), _reciprocal_xc_R0);\n            _reciprocal_xc_U0 = vmulq_f32(vrecpsq_f32(_xc_U0, _reciprocal_xc_U0), _reciprocal_xc_U0);\n            _reciprocal_xc_N0 = vmulq_f32(vrecpsq_f32(_xc_N0, _reciprocal_xc_N0), _reciprocal_xc_N0);\n            _reciprocal_xc_R0 = vmulq_f32(vrecpsq_f32(_xc_R0, _reciprocal_xc_R0), _reciprocal_xc_R0);\n            _reciprocal_xc_U0 = vmulq_f32(vrecpsq_f32(_xc_U0, _reciprocal_xc_U0), _reciprocal_xc_U0);\n            _reciprocal_xc_N0 = vmulq_f32(vrecpsq_f32(_xc_N0, _reciprocal_xc_N0), _reciprocal_xc_N0);\n            float32x4_t _reciprocal_hc_R0 = vrecpeq_f32(_hc_R0);\n            float32x4_t _reciprocal_hc_U0 = vrecpeq_f32(_hc_U0);\n            float32x4_t _reciprocal_hc_N0 = vrecpeq_f32(_hc_N0);\n            _reciprocal_hc_R0 = vmulq_f32(vrecpsq_f32(_hc_R0, _reciprocal_hc_R0), _reciprocal_hc_R0);\n            _reciprocal_hc_U0 = vmulq_f32(vrecpsq_f32(_hc_U0, _reciprocal_hc_U0), _reciprocal_hc_U0);\n            _reciprocal_hc_N0 = vmulq_f32(vrecpsq_f32(_hc_N0, _reciprocal_hc_N0), _reciprocal_hc_N0);\n            _reciprocal_hc_R0 = vmulq_f32(vrecpsq_f32(_hc_R0, _reciprocal_hc_R0), _reciprocal_hc_R0);\n            _reciprocal_hc_U0 = vmulq_f32(vrecpsq_f32(_hc_U0, _reciprocal_hc_U0), _reciprocal_hc_U0);\n            _reciprocal_hc_N0 = vmulq_f32(vrecpsq_f32(_hc_N0, _reciprocal_hc_N0), _reciprocal_hc_N0);\n#endif\n\n            vst1q_f32(descales_ptr, _reciprocal_xc_R0);\n            vst1q_f32(descales_ptr + 4, _reciprocal_xc_U0);\n            vst1q_f32(descales_ptr + 8, _reciprocal_hc_R0);\n            vst1q_f32(descales_ptr + 12, _reciprocal_hc_U0);\n            vst1q_f32(descales_ptr + 16, _reciprocal_hc_N0);\n            vst1q_f32(descales_ptr + 20, _reciprocal_xc_N0);\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            bias_c_RUBNWN[0] = bias_c_R[q];\n            bias_c_RUBNWN[1] = bias_c_U[q];\n            bias_c_RUBNWN[2] = bias_c_BN[q];\n            bias_c_RUBNWN[3] = bias_c_WN[q];\n\n            bias_c_RUBNWN += 4;\n\n            const signed char* weight_xc_R = weight_xc_dr.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_xc_U = weight_xc_dr.row<const signed char>(num_output * 1 + q);\n            const signed char* weight_xc_N = weight_xc_dr.row<const signed char>(num_output * 2 + q);\n\n            const signed char* weight_hc_R = weight_hc_dr.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_hc_U = weight_hc_dr.row<const signed char>(num_output * 1 + q);\n            const signed char* weight_hc_N = weight_hc_dr.row<const signed char>(num_output * 2 + q);\n\n#if __ARM_NEON\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q / 4 + q % 4);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q / 4 + q % 4);\n#else\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                kptr[0] = weight_xc_R[i];\n                kptr[1] = weight_xc_U[i];\n                kptr += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_R[i];\n                kptr[1] = weight_hc_U[i];\n                kptr += 2;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_N[i];\n                kptr += 1;\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                kptr[0] = weight_xc_N[i];\n                kptr += 1;\n            }\n\n            descales_ptr[0] = 1.f / weight_xc_int8_scales_ptr[num_output * 0 + q];\n            descales_ptr[1] = 1.f / weight_xc_int8_scales_ptr[num_output * 1 + q];\n            descales_ptr[2] = 1.f / weight_hc_int8_scales_ptr[num_output * 0 + q];\n            descales_ptr[3] = 1.f / weight_hc_int8_scales_ptr[num_output * 1 + q];\n            descales_ptr[4] = 1.f / weight_hc_int8_scales_ptr[num_output * 2 + q];\n            descales_ptr[5] = 1.f / weight_xc_int8_scales_ptr[num_output * 2 + q];\n        }\n    }\n}\n\nstatic void gru_int8_gate_output(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\n    if (ncnn::cpu_support_arm_vfpv4())\n    {\n        gru_int8_gate_output_vfpv4(gates, hidden_state, top_blob, ti, elemtype, opt);\n        return;\n    }\n#endif\n\n    const int num_output = top_blob.w;\n\n    // h_t := (1 - update) .* new + update .* h_{t-1}\n    float* output_data = top_blob.row(ti);\n\n    float* hidden_ptr = hidden_state;\n\n    int remain_num_output_start = 0;\n#if __ARM_NEON\n    int nn_num_output = num_output >> 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int qq = 0; qq < nn_num_output; qq++)\n    {\n        int q = qq * 4;\n\n        const float* gates_data = gates.row(q / 4);\n\n        float32x4_t _gru_U0 = vld1q_f32(gates_data);\n        float32x4_t _gru_N0 = vld1q_f32(gates_data + 4);\n\n        float32x4_t _gru_H0 = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U0), _gru_N0), vmulq_f32(_gru_U0, vld1q_f32(hidden_ptr + q)));\n\n        vst1q_f32(hidden_ptr + q, _gru_H0);\n\n        if (elemtype == 1)\n        {\n            // fp32\n            vst1q_f32(output_data + q, _gru_H0);\n        }\n        if (elemtype == 2)\n        {\n            // fp16\n            unsigned short* outptr = (unsigned short*)output_data + q;\n#if (__ARM_FP & 2)\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"fcvtn  v0.4h, %2.4s        \\n\"\n                \"st1    {v0.4h}, [%0]       \\n\"\n                : \"=r\"(outptr) // %0\n                : \"0\"(outptr),\n                \"w\"(_gru_H0)\n                : \"memory\", \"v0\");\n#else  // __aarch64__\n            asm volatile(\n                \"vcvt.f16.f32 d0, %q2       \\n\"\n                \"vst1.u16   {d0}, [%0]      \\n\"\n                : \"=r\"(outptr) // %0\n                : \"0\"(outptr),\n                \"w\"(_gru_H0)\n                : \"memory\", \"q0\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            vst1_u16(outptr, (uint16x4_t)vcvt_f16_f32(_gru_H0));\n#endif // NCNN_GNU_INLINE_ASM\n#else\n            outptr[q] = float32_to_float16(hidden_ptr[q]);\n            outptr[q + 1] = float32_to_float16(hidden_ptr[q + 1]);\n            outptr[q + 2] = float32_to_float16(hidden_ptr[q + 2]);\n            outptr[q + 3] = float32_to_float16(hidden_ptr[q + 3]);\n#endif // (__ARM_FP & 2)\n        }\n        if (elemtype == 4)\n        {\n            // bf16\n            vst1_u16((unsigned short*)output_data + q, float2bfloat(_gru_H0));\n        }\n    }\n    remain_num_output_start += nn_num_output << 2;\n#endif // __ARM_NEON\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = remain_num_output_start; q < num_output; q++)\n    {\n#if __ARM_NEON\n        const float* gates_data = gates.row(q / 4 + q % 4);\n#else\n        const float* gates_data = gates.row(q);\n#endif\n\n        float U = gates_data[0];\n        float N = gates_data[1];\n\n        float H = (1 - U) * N + U * hidden_ptr[q];\n\n        hidden_ptr[q] = H;\n\n        if (elemtype == 1)\n        {\n            output_data[q] = H;\n        }\n        if (elemtype == 2)\n        {\n            ((unsigned short*)output_data)[q] = float32_to_float16(H);\n        }\n        if (elemtype == 4)\n        {\n            ((unsigned short*)output_data)[q] = float32_to_bfloat16(H);\n        }\n    }\n}\n\nstatic void gru_int8(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        gru_int8_asimddp(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, hidden_state, opt);\n        return;\n    }\n#endif\n\n    int size = bottom_blob_int8.w;\n    int T = bottom_blob_int8.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n#if __ARM_NEON\n    Mat gates(4 * 2, num_output / 4 + num_output % 4, 4u, opt.workspace_allocator);\n#else\n    Mat gates(2, num_output, 4u, opt.workspace_allocator);\n#endif\n\n    Mat hidden_state_int8(num_output, (size_t)1u, 1, opt.workspace_allocator);\n    float hidden_state_int8_scale = 1.f;\n    float hidden_state_int8_descale = 1.f;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        // dynamic quantize hidden_state\n        {\n            float absmax = 0.f;\n            for (int i = 0; i < num_output; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(hidden_state[i]));\n            }\n\n            if (absmax == 0.f)\n            {\n                hidden_state_int8.fill<signed char>(0);\n            }\n            else\n            {\n                hidden_state_int8_scale = 127.f / absmax;\n                hidden_state_int8_descale = absmax / 127.f;\n\n                signed char* hs = hidden_state_int8;\n                for (int i = 0; i < num_output; i++)\n                {\n                    hs[i] = float2int8(hidden_state[i] * hidden_state_int8_scale);\n                }\n            }\n        }\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n            const signed char* hs = hidden_state_int8;\n            const float descale_x = bottom_blob_int8_descales[ti];\n            const float descale_h = hidden_state_int8_descale;\n\n            // gate reset update\n            const float* bias_c_RUBNWN = (const float*)bias_c + q * 4;\n\n            const signed char* kptr = weight_data_tm.row<const signed char>(q / 4);\n\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q / 4);\n\n            int32x4_t _gru_Rx0 = vdupq_n_s32(0);\n            int32x4_t _gru_Ux0 = vdupq_n_s32(0);\n            int i = 0;\n#if __ARM_FEATURE_DOTPROD\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _gru_Rx0 = vdotq_lane_s32(_gru_Rx0, _w0, _xi, 0);\n                _gru_Ux0 = vdotq_lane_s32(_gru_Ux0, _w1, _xi, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w2, _xi, 1);\n                _sum2 = vdotq_lane_s32(_sum2, _w3, _xi, 1);\n\n                kptr += 64;\n            }\n            _gru_Rx0 = vaddq_s32(_gru_Rx0, _sum1);\n            _gru_Ux0 = vaddq_s32(_gru_Ux0, _sum2);\n#else\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n            for (; i + 7 < size; i += 8)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* xptr = x + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16}, [%0]         \\n\"\n                    \"vdup.32    d17, d16[0]         \\n\"\n                    \"vdup.32    d16, d16[1]         \\n\"\n                    \"vmull.s8   q4, d0, d17         \\n\"\n                    \"vmull.s8   q5, d1, d17         \\n\"\n                    \"vmull.s8   q6, d2, d17         \\n\"\n                    \"vmull.s8   q7, d3, d17         \\n\"\n                    \"vmlal.s8   q4, d4, d16         \\n\"\n                    \"vmlal.s8   q5, d5, d16         \\n\"\n                    \"vmlal.s8   q6, d6, d16         \\n\"\n                    \"vmlal.s8   q7, d7, d16         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(xptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(xptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int32x2_t _xi01 = vreinterpret_s32_s8(vld1_s8(x + i));\n                int8x8_t _xi0 = vreinterpret_s8_s32(vdup_lane_s32(_xi01, 0));\n                int8x8_t _xi1 = vreinterpret_s8_s32(vdup_lane_s32(_xi01, 1));\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _xi0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _xi0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _xi0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _xi0);\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), _xi1);\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), _xi1);\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), _xi1);\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), _xi1);\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            {\n                int32x2_t _s0 = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                int32x2_t _s1 = vpadd_s32(vget_low_s32(_sum1), vget_high_s32(_sum1));\n                int32x2_t _s2 = vpadd_s32(vget_low_s32(_sum2), vget_high_s32(_sum2));\n                int32x2_t _s3 = vpadd_s32(vget_low_s32(_sum3), vget_high_s32(_sum3));\n                _gru_Rx0 = vaddq_s32(_gru_Rx0, vcombine_s32(_s0, _s1));\n                _gru_Ux0 = vaddq_s32(_gru_Ux0, vcombine_s32(_s2, _s3));\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _gru_Rx0 = vdotq_lane_s32(_gru_Rx0, _w0, _xi, 0);\n                _gru_Ux0 = vdotq_lane_s32(_gru_Ux0, _w1, _xi, 0);\n#else\n                int16x4_t _xi01 = vreinterpret_s16_s8(vld1_s8(x + i));\n                int8x8_t _xi0 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 0));\n                int8x8_t _xi1 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 1));\n                int8x16_t _weight_xc_RU0 = vld1q_s8(kptr);\n                int8x16_t _weight_xc_RU1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _gru_Rx = vmull_s8(vget_low_s8(_weight_xc_RU0), _xi0);\n                int16x8_t _gru_Ux = vmull_s8(vget_high_s8(_weight_xc_RU0), _xi0);\n                _gru_Rx = vmlal_s8(_gru_Rx, vget_low_s8(_weight_xc_RU1), _xi1);\n                _gru_Ux = vmlal_s8(_gru_Ux, vget_high_s8(_weight_xc_RU1), _xi1);\n\n                _gru_Rx0 = vpadalq_s16(_gru_Rx0, _gru_Rx);\n                _gru_Ux0 = vpadalq_s16(_gru_Ux0, _gru_Ux);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 32;\n            }\n            for (; i + 1 < size; i += 2)\n            {\n                int8x8_t _xi = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(x + i)), 0));\n                int8x16_t _weight_xc_RU = vld1q_s8(kptr);\n\n                int16x8_t _gru_Rx = vmull_s8(vget_low_s8(_weight_xc_RU), _xi);\n                int16x8_t _gru_Ux = vmull_s8(vget_high_s8(_weight_xc_RU), _xi);\n\n                _gru_Rx0 = vpadalq_s16(_gru_Rx0, _gru_Rx);\n                _gru_Ux0 = vpadalq_s16(_gru_Ux0, _gru_Ux);\n\n                kptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                int8x8_t _xi = vdup_n_s8(x[i]);\n                int8x8_t _weight_xc_RU = vld1_s8(kptr);\n\n                int16x8_t _gru_RxUx = vmull_s8(_weight_xc_RU, _xi);\n                _gru_Rx0 = vaddw_s16(_gru_Rx0, vget_low_s16(_gru_RxUx));\n                _gru_Ux0 = vaddw_s16(_gru_Ux0, vget_high_s16(_gru_RxUx));\n\n                kptr += 8;\n            }\n\n            int32x4_t _gru_Rh0 = vdupq_n_s32(0);\n            int32x4_t _gru_Uh0 = vdupq_n_s32(0);\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _gru_Rh0 = vdotq_lane_s32(_gru_Rh0, _w0, _h_cont, 0);\n                _gru_Uh0 = vdotq_lane_s32(_gru_Uh0, _w1, _h_cont, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w2, _h_cont, 1);\n                _sum2 = vdotq_lane_s32(_sum2, _w3, _h_cont, 1);\n\n                kptr += 64;\n            }\n            _gru_Rh0 = vaddq_s32(_gru_Rh0, _sum1);\n            _gru_Uh0 = vaddq_s32(_gru_Uh0, _sum2);\n#else\n            _sum0 = vdupq_n_s32(0);\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 7 < num_output; i += 8)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* hsptr = hs + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16}, [%0]         \\n\"\n                    \"vdup.32    d17, d16[0]         \\n\"\n                    \"vdup.32    d16, d16[1]         \\n\"\n                    \"vmull.s8   q4, d0, d17         \\n\"\n                    \"vmull.s8   q5, d1, d17         \\n\"\n                    \"vmull.s8   q6, d2, d17         \\n\"\n                    \"vmull.s8   q7, d3, d17         \\n\"\n                    \"vmlal.s8   q4, d4, d16         \\n\"\n                    \"vmlal.s8   q5, d5, d16         \\n\"\n                    \"vmlal.s8   q6, d6, d16         \\n\"\n                    \"vmlal.s8   q7, d7, d16         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(hsptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(hsptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int32x2_t _h_cont01 = vreinterpret_s32_s8(vld1_s8(hs + i));\n                int8x8_t _h_cont0 = vreinterpret_s8_s32(vdup_lane_s32(_h_cont01, 0));\n                int8x8_t _h_cont1 = vreinterpret_s8_s32(vdup_lane_s32(_h_cont01, 1));\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _h_cont0);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _h_cont0);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _h_cont0);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _h_cont0);\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), _h_cont1);\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), _h_cont1);\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), _h_cont1);\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), _h_cont1);\n\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            {\n                int32x2_t _s0 = vpadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                int32x2_t _s1 = vpadd_s32(vget_low_s32(_sum1), vget_high_s32(_sum1));\n                int32x2_t _s2 = vpadd_s32(vget_low_s32(_sum2), vget_high_s32(_sum2));\n                int32x2_t _s3 = vpadd_s32(vget_low_s32(_sum3), vget_high_s32(_sum3));\n                _gru_Rh0 = vaddq_s32(_gru_Rh0, vcombine_s32(_s0, _s1));\n                _gru_Uh0 = vaddq_s32(_gru_Uh0, vcombine_s32(_s2, _s3));\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _gru_Rh0 = vdotq_lane_s32(_gru_Rh0, _w0, _h_cont, 0);\n                _gru_Uh0 = vdotq_lane_s32(_gru_Uh0, _w1, _h_cont, 0);\n#else\n                int16x4_t _h_cont01 = vreinterpret_s16_s8(vld1_s8(hs + i));\n                int8x8_t _h_cont0 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 0));\n                int8x8_t _h_cont1 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 1));\n                int8x16_t _weight_hc_RU0 = vld1q_s8(kptr);\n                int8x16_t _weight_hc_RU1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _gru_Rh = vmull_s8(vget_low_s8(_weight_hc_RU0), _h_cont0);\n                int16x8_t _gru_Uh = vmull_s8(vget_high_s8(_weight_hc_RU0), _h_cont0);\n                _gru_Rh = vmlal_s8(_gru_Rh, vget_low_s8(_weight_hc_RU1), _h_cont1);\n                _gru_Uh = vmlal_s8(_gru_Uh, vget_high_s8(_weight_hc_RU1), _h_cont1);\n\n                _gru_Rh0 = vpadalq_s16(_gru_Rh0, _gru_Rh);\n                _gru_Uh0 = vpadalq_s16(_gru_Uh0, _gru_Uh);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 32;\n            }\n            for (; i + 1 < num_output; i += 2)\n            {\n                int8x8_t _h_cont = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(hs + i)), 0));\n                int8x16_t _weight_hc_RU = vld1q_s8(kptr);\n\n                int16x8_t _gru_Rh = vmull_s8(vget_low_s8(_weight_hc_RU), _h_cont);\n                int16x8_t _gru_Uh = vmull_s8(vget_high_s8(_weight_hc_RU), _h_cont);\n\n                _gru_Rh0 = vpadalq_s16(_gru_Rh0, _gru_Rh);\n                _gru_Uh0 = vpadalq_s16(_gru_Uh0, _gru_Uh);\n\n                kptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                int8x8_t _h_cont = vdup_n_s8(hs[i]);\n                int8x8_t _weight_hc_RU = vld1_s8(kptr);\n\n                int16x8_t _gru_RhUh = vmull_s8(_weight_hc_RU, _h_cont);\n                _gru_Rh0 = vaddw_s16(_gru_Rh0, vget_low_s16(_gru_RhUh));\n                _gru_Uh0 = vaddw_s16(_gru_Uh0, vget_high_s16(_gru_RhUh));\n\n                kptr += 8;\n            }\n\n            float32x4_t _descale_x = vdupq_n_f32(descale_x);\n            float32x4_t _descale_h = vdupq_n_f32(descale_h);\n\n            float32x4_t _gru_R0 = vld1q_f32(bias_c_RUBNWN);\n            float32x4_t _gru_U0 = vld1q_f32(bias_c_RUBNWN + 4);\n\n            float32x4_t _descale_xc_R0 = vld1q_f32(descales_ptr);\n            float32x4_t _descale_xc_U0 = vld1q_f32(descales_ptr + 4);\n\n            _gru_R0 = vmlaq_f32(_gru_R0, vcvtq_f32_s32(_gru_Rx0), vmulq_f32(_descale_x, _descale_xc_R0));\n            _gru_U0 = vmlaq_f32(_gru_U0, vcvtq_f32_s32(_gru_Ux0), vmulq_f32(_descale_x, _descale_xc_U0));\n\n            float32x4_t _descale_hc_R0 = vld1q_f32(descales_ptr + 8);\n            float32x4_t _descale_hc_U0 = vld1q_f32(descales_ptr + 12);\n\n            _gru_R0 = vmlaq_f32(_gru_R0, vcvtq_f32_s32(_gru_Rh0), vmulq_f32(_descale_h, _descale_hc_R0));\n            _gru_U0 = vmlaq_f32(_gru_U0, vcvtq_f32_s32(_gru_Uh0), vmulq_f32(_descale_h, _descale_hc_U0));\n\n            // sigmoid(R)\n            // sigmoid(U)\n            _gru_R0 = sigmoid_ps(_gru_R0);\n            _gru_U0 = sigmoid_ps(_gru_U0);\n\n            // gate new\n\n            int32x4_t _gru_Nh0 = vdupq_n_s32(0);\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            _sum1 = vdupq_n_s32(0);\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _gru_Nh0 = vdotq_lane_s32(_gru_Nh0, _w0, _h_cont, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _h_cont, 1);\n\n                kptr += 32;\n            }\n            _gru_Nh0 = vaddq_s32(_gru_Nh0, _sum1);\n#else\n            _sum0 = vdupq_n_s32(0);\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < num_output; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* hsptr = hs + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(hsptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(hsptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _h_cont = vld1q_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_h_cont));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_h_cont));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_h_cont));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_h_cont));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_h_cont));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_h_cont));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _h_cont);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _h_cont);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _h_cont);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _h_cont);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _gru_Nh0 = vaddq_s32(_gru_Nh0, _sum0);\n            _gru_Nh0 = vaddq_s32(_gru_Nh0, _sum1);\n            _gru_Nh0 = vaddq_s32(_gru_Nh0, _sum2);\n            _gru_Nh0 = vaddq_s32(_gru_Nh0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _gru_Nh0 = vdotq_lane_s32(_gru_Nh0, _w, _h_cont, 0);\n#else\n                int16x4_t _h_cont01 = vreinterpret_s16_s8(vld1_s8(hs + i));\n                int8x8_t _h_cont0 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 0));\n                int8x8_t _h_cont1 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _gru_Nh = vmull_s8(vget_low_s8(_w01), _h_cont0);\n                _gru_Nh = vmlal_s8(_gru_Nh, vget_high_s8(_w01), _h_cont1);\n                _gru_Nh0 = vpadalq_s16(_gru_Nh0, _gru_Nh);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < num_output; i += 2)\n            {\n                int8x8_t _h_cont = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(hs + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _gru_Nh = vmull_s8(_w, _h_cont);\n                _gru_Nh0 = vpadalq_s16(_gru_Nh0, _gru_Nh);\n\n                kptr += 8;\n            }\n            for (; i < num_output; i++)\n            {\n                int8x8_t _h_cont = vdup_n_s8(hs[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _gru_Nh = vmull_s8(_w, _h_cont);\n                _gru_Nh0 = vaddw_s16(_gru_Nh0, vget_low_s16(_gru_Nh));\n\n                kptr += 4;\n            }\n\n            int32x4_t _gru_Nx0 = vdupq_n_s32(0);\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            _sum1 = vdupq_n_s32(0);\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _gru_Nx0 = vdotq_lane_s32(_gru_Nx0, _w0, _xi, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _xi, 1);\n\n                kptr += 32;\n            }\n            _gru_Nx0 = vaddq_s32(_gru_Nx0, _sum1);\n#else\n            _sum0 = vdupq_n_s32(0);\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* xptr = x + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(xptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(xptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _xi = vld1q_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_xi));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_xi));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_xi));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_xi));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_xi));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_xi));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _xi);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _xi);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _xi);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _xi);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _gru_Nx0 = vaddq_s32(_gru_Nx0, _sum0);\n            _gru_Nx0 = vaddq_s32(_gru_Nx0, _sum1);\n            _gru_Nx0 = vaddq_s32(_gru_Nx0, _sum2);\n            _gru_Nx0 = vaddq_s32(_gru_Nx0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _gru_Nx0 = vdotq_lane_s32(_gru_Nx0, _w, _xi, 0);\n#else\n                int16x4_t _xi01 = vreinterpret_s16_s8(vld1_s8(x + i));\n                int8x8_t _xi0 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 0));\n                int8x8_t _xi1 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _gru_Nx = vmull_s8(vget_low_s8(_w01), _xi0);\n                _gru_Nx = vmlal_s8(_gru_Nx, vget_high_s8(_w01), _xi1);\n                _gru_Nx0 = vpadalq_s16(_gru_Nx0, _gru_Nx);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < size; i += 2)\n            {\n                int8x8_t _xi = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(x + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _gru_Nx = vmull_s8(_w, _xi);\n                _gru_Nx0 = vpadalq_s16(_gru_Nx0, _gru_Nx);\n\n                kptr += 8;\n            }\n            for (; i < size; i++)\n            {\n                int8x8_t _xi = vdup_n_s8(x[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _gru_Nx = vmull_s8(_w, _xi);\n                _gru_Nx0 = vaddw_s16(_gru_Nx0, vget_low_s16(_gru_Nx));\n\n                kptr += 4;\n            }\n\n            float32x4_t _gru_N0 = vld1q_f32(bias_c_RUBNWN + 8);\n\n            float32x4_t _descale_hc_N0 = vld1q_f32(descales_ptr + 16);\n\n            _gru_N0 = vmlaq_f32(_gru_N0, vcvtq_f32_s32(_gru_Nh0), vmulq_f32(_descale_h, _descale_hc_N0));\n\n            _gru_N0 = vmlaq_f32(vld1q_f32(bias_c_RUBNWN + 12), _gru_R0, _gru_N0);\n\n            float32x4_t _descale_xc_N0 = vld1q_f32(descales_ptr + 20);\n\n            _gru_N0 = vmlaq_f32(_gru_N0, vcvtq_f32_s32(_gru_Nx0), vmulq_f32(_descale_x, _descale_xc_N0));\n\n            // tanh(N)\n            _gru_N0 = tanh_ps(_gru_N0);\n\n            float* gates_data = gates.row(q / 4);\n\n            vst1q_f32(gates_data, _gru_U0);\n            vst1q_f32(gates_data + 4, _gru_N0);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n            const signed char* hs = hidden_state_int8;\n            const float descale_x = bottom_blob_int8_descales[ti];\n            const float descale_h = hidden_state_int8_descale;\n\n            // gate reset update\n            const float* bias_c_RUBNWN = (const float*)bias_c + q * 4;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.row<const signed char>(q / 4 + q % 4);\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q / 4 + q % 4);\n#else\n            const signed char* kptr = weight_data_tm.row<const signed char>(q);\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q);\n#endif\n\n            const float descale_xc_R = descales_ptr[0];\n            const float descale_xc_U = descales_ptr[1];\n            const float descale_hc_R = descales_ptr[2];\n            const float descale_hc_U = descales_ptr[3];\n            const float descale_hc_N = descales_ptr[4];\n            const float descale_xc_N = descales_ptr[5];\n\n            int Rx = 0;\n            int Ux = 0;\n            for (int i = 0; i < size; i++)\n            {\n                signed char xi = x[i];\n\n                Rx += kptr[0] * xi;\n                Ux += kptr[1] * xi;\n\n                kptr += 2;\n            }\n\n            int Rh = 0;\n            int Uh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                signed char h_cont = hs[i];\n\n                Rh += kptr[0] * h_cont;\n                Uh += kptr[1] * h_cont;\n\n                kptr += 2;\n            }\n\n            float R = bias_c_RUBNWN[0] + Rx * (descale_x * descale_xc_R) + Rh * (descale_h * descale_hc_R);\n            float U = bias_c_RUBNWN[1] + Ux * (descale_x * descale_xc_U) + Uh * (descale_h * descale_hc_U);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n\n            int Nh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                Nh += kptr[0] * hs[i];\n                kptr += 1;\n            }\n\n            int Nx = 0;\n            for (int i = 0; i < size; i++)\n            {\n                Nx += kptr[0] * x[i];\n                kptr += 1;\n            }\n\n            float N = bias_c_RUBNWN[2] + Nh * (descale_h * descale_hc_N);\n            N = bias_c_RUBNWN[3] + R * N + Nx * (descale_x * descale_xc_N);\n\n            // tanh(N)\n            N = tanhf(N);\n\n#if __ARM_NEON\n            float* gates_data = gates.row(q / 4 + q % 4);\n#else\n            float* gates_data = gates.row(q);\n#endif\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        gru_int8_gate_output(gates, hidden_state, top_blob, ti, elemtype, opt);\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/hardsigmoid_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardsigmoid_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nHardSigmoid_arm::HardSigmoid_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint HardSigmoid_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.4s, v0.4s, %4.4s     \\n\"\n                \"fmla   v5.4s, v1.4s, %4.4s     \\n\"\n                \"fmla   v6.4s, v2.4s, %4.4s     \\n\"\n                \"fmla   v7.4s, v3.4s, %4.4s     \\n\"\n                \"fmax   v0.4s, v4.4s, %2.4s     \\n\"\n                \"fmax   v1.4s, v5.4s, %2.4s     \\n\"\n                \"fmax   v2.4s, v6.4s, %2.4s     \\n\"\n                \"fmax   v3.4s, v7.4s, %2.4s     \\n\"\n                \"fmin   v0.4s, v0.4s, %3.4s     \\n\"\n                \"fmin   v1.4s, v1.4s, %3.4s     \\n\"\n                \"fmin   v2.4s, v2.4s, %3.4s     \\n\"\n                \"fmin   v3.4s, v3.4s, %3.4s     \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0, {d0-d7}     \\n\"\n                \"vmov       q4, %q5         \\n\"\n                \"vmov       q5, %q5         \\n\"\n                \"vmov       q6, %q5         \\n\"\n                \"vmov       q7, %q5         \\n\"\n                \"vmla.f32   q4, q0, %q4     \\n\"\n                \"vmla.f32   q5, q1, %q4     \\n\"\n                \"vmla.f32   q6, q2, %q4     \\n\"\n                \"vmla.f32   q7, q3, %q4     \\n\"\n                \"vmax.f32   q0, q4, %q2     \\n\"\n                \"vmax.f32   q1, q5, %q2     \\n\"\n                \"vmax.f32   q2, q6, %q2     \\n\"\n                \"vmax.f32   q3, q7, %q2     \\n\"\n                \"vmin.f32   q0, q0, %q3     \\n\"\n                \"vmin.f32   q1, q1, %q3     \\n\"\n                \"vmin.f32   q2, q2, %q3     \\n\"\n                \"vmin.f32   q3, q3, %q3     \\n\"\n                \"vstm       %0!, {d0-d7}    \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = vmlaq_f32(_beta, _p0, _alpha);\n            _p1 = vmlaq_f32(_beta, _p1, _alpha);\n            _p2 = vmlaq_f32(_beta, _p2, _alpha);\n            _p3 = vmlaq_f32(_beta, _p3, _alpha);\n            _p0 = vmaxq_f32(_p0, _zero);\n            _p1 = vmaxq_f32(_p1, _zero);\n            _p2 = vmaxq_f32(_p2, _zero);\n            _p3 = vmaxq_f32(_p3, _zero);\n            _p0 = vminq_f32(_p0, _one);\n            _p1 = vminq_f32(_p1, _one);\n            _p2 = vminq_f32(_p2, _one);\n            _p3 = vminq_f32(_p3, _one);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = vmlaq_f32(_beta, _p0, _alpha);\n            _p1 = vmlaq_f32(_beta, _p1, _alpha);\n            _p0 = vmaxq_f32(_p0, _zero);\n            _p1 = vmaxq_f32(_p1, _zero);\n            _p0 = vminq_f32(_p0, _one);\n            _p1 = vminq_f32(_p1, _one);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmlaq_f32(_beta, _p, _alpha);\n            _p = vmaxq_f32(_p, _zero);\n            _p = vminq_f32(_p, _one);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            if (*ptr < lower)\n                *ptr = 0.f;\n            else if (*ptr > upper)\n                *ptr = 1.f;\n            else\n                *ptr = *ptr * alpha + beta;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint HardSigmoid_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]   \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                \"shll   v0.4s, v0.4h, #16       \\n\"\n                \"shll   v1.4s, v1.4h, #16       \\n\"\n                \"shll   v2.4s, v2.4h, #16       \\n\"\n                \"shll   v3.4s, v3.4h, #16       \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.4s, v0.4s, %4.4s     \\n\"\n                \"fmla   v5.4s, v1.4s, %4.4s     \\n\"\n                \"fmla   v6.4s, v2.4s, %4.4s     \\n\"\n                \"fmla   v7.4s, v3.4s, %4.4s     \\n\"\n                \"fmax   v0.4s, v4.4s, %2.4s     \\n\"\n                \"fmax   v1.4s, v5.4s, %2.4s     \\n\"\n                \"fmax   v2.4s, v6.4s, %2.4s     \\n\"\n                \"fmax   v3.4s, v7.4s, %2.4s     \\n\"\n                \"fmin   v0.4s, v0.4s, %3.4s     \\n\"\n                \"fmin   v1.4s, v1.4s, %3.4s     \\n\"\n                \"fmin   v2.4s, v2.4s, %3.4s     \\n\"\n                \"fmin   v3.4s, v3.4s, %3.4s     \\n\"\n                \"shrn   v0.4h, v0.4s, #16       \\n\"\n                \"shrn   v1.4h, v1.4s, #16       \\n\"\n                \"shrn   v2.4h, v2.4s, #16       \\n\"\n                \"shrn   v3.4h, v3.4s, #16       \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]      \\n\"\n                \"vld1.u16   {d4-d7}, [%0]   \\n\"\n                \"vshll.u16  q0, d4, #16     \\n\"\n                \"vshll.u16  q1, d5, #16     \\n\"\n                \"vshll.u16  q2, d6, #16     \\n\"\n                \"vshll.u16  q3, d7, #16     \\n\"\n                \"vmov       q4, %q5         \\n\"\n                \"vmov       q5, %q5         \\n\"\n                \"vmov       q6, %q5         \\n\"\n                \"vmov       q7, %q5         \\n\"\n                \"vmla.f32   q4, q0, %q4     \\n\"\n                \"vmla.f32   q5, q1, %q4     \\n\"\n                \"vmla.f32   q6, q2, %q4     \\n\"\n                \"vmla.f32   q7, q3, %q4     \\n\"\n                \"vmax.f32   q0, q4, %q2     \\n\"\n                \"vmax.f32   q1, q5, %q2     \\n\"\n                \"vmax.f32   q2, q6, %q2     \\n\"\n                \"vmax.f32   q3, q7, %q2     \\n\"\n                \"vmin.f32   q0, q0, %q3     \\n\"\n                \"vmin.f32   q1, q1, %q3     \\n\"\n                \"vmin.f32   q2, q2, %q3     \\n\"\n                \"vmin.f32   q3, q3, %q3     \\n\"\n                \"vshrn.u32  d0, q0, #16     \\n\"\n                \"vshrn.u32  d1, q1, #16     \\n\"\n                \"vshrn.u32  d2, q2, #16     \\n\"\n                \"vshrn.u32  d3, q3, #16     \\n\"\n                \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x8_t _p = vld1q_u16(ptr);\n            uint16x8_t _q = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n            _p0 = vmlaq_f32(_beta, _p0, _alpha);\n            _p1 = vmlaq_f32(_beta, _p1, _alpha);\n            _p2 = vmlaq_f32(_beta, _p2, _alpha);\n            _p3 = vmlaq_f32(_beta, _p3, _alpha);\n            _p0 = vmaxq_f32(_p0, _zero);\n            _p1 = vmaxq_f32(_p1, _zero);\n            _p2 = vmaxq_f32(_p2, _zero);\n            _p3 = vmaxq_f32(_p3, _zero);\n            _p0 = vminq_f32(_p0, _one);\n            _p1 = vminq_f32(_p1, _one);\n            _p2 = vminq_f32(_p2, _one);\n            _p3 = vminq_f32(_p3, _one);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _q = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p);\n            vst1q_u16(ptr + 8, _q);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = vmlaq_f32(_beta, _p0, _alpha);\n            _p1 = vmlaq_f32(_beta, _p1, _alpha);\n            _p0 = vmaxq_f32(_p0, _zero);\n            _p1 = vmaxq_f32(_p1, _zero);\n            _p0 = vminq_f32(_p0, _one);\n            _p1 = vminq_f32(_p1, _one);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmlaq_f32(_beta, _p, _alpha);\n            _p = vmaxq_f32(_p, _zero);\n            _p = vminq_f32(_p, _one);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            if (v < lower)\n                v = 0.f;\n            else if (v > upper)\n                v = 1.f;\n            else\n                v = v * alpha + beta;\n            *ptr = float32_to_bfloat16(v);\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/hardsigmoid_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSIGMOID_ARM_H\n#define LAYER_HARDSIGMOID_ARM_H\n\n#include \"hardsigmoid.h\"\n\nnamespace ncnn {\n\nclass HardSigmoid_arm : public HardSigmoid\n{\npublic:\n    HardSigmoid_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSIGMOID_ARM_H\n"
  },
  {
    "path": "src/layer/arm/hardsigmoid_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardsigmoid_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint HardSigmoid_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vfmaq_f32(_beta, _p0, _alpha);\n            _p1 = vfmaq_f32(_beta, _p1, _alpha);\n            _p0 = vmaxq_f32(_p0, _zero);\n            _p1 = vmaxq_f32(_p1, _zero);\n            _p0 = vminq_f32(_p0, _one);\n            _p1 = vminq_f32(_p1, _one);\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = vfmaq_f32(_beta, _p, _alpha);\n            _p = vmaxq_f32(_p, _zero);\n            _p = vminq_f32(_p, _one);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            if (v < lower)\n                v = 0.f;\n            else if (v > upper)\n                v = 1.f;\n            else\n                v = v * alpha + beta;\n            *ptr = (__fp16)v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint HardSigmoid_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        __fp16 alpha_fp16 = (__fp16)alpha;\n        __fp16 beta_fp16 = (__fp16)beta;\n\n        float16x8_t _zero = vdupq_n_f16((__fp16)0.f);\n        float16x8_t _one = vdupq_n_f16((__fp16)1.f);\n        float16x8_t _alpha = vdupq_n_f16(alpha_fp16);\n        float16x8_t _beta = vdupq_n_f16(beta_fp16);\n\n        int i = 0;\n        for (; i + 31 < size; i += 32)\n        {\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.8h, v0.8h, %4.8h     \\n\"\n                \"fmla   v5.8h, v1.8h, %4.8h     \\n\"\n                \"fmla   v6.8h, v2.8h, %4.8h     \\n\"\n                \"fmla   v7.8h, v3.8h, %4.8h     \\n\"\n                \"fmax   v0.8h, v4.8h, %2.8h     \\n\"\n                \"fmax   v1.8h, v5.8h, %2.8h     \\n\"\n                \"fmax   v2.8h, v6.8h, %2.8h     \\n\"\n                \"fmax   v3.8h, v7.8h, %2.8h     \\n\"\n                \"fmin   v0.8h, v0.8h, %3.8h     \\n\"\n                \"fmin   v1.8h, v1.8h, %3.8h     \\n\"\n                \"fmin   v2.8h, v2.8h, %3.8h     \\n\"\n                \"fmin   v3.8h, v3.8h, %3.8h     \\n\"\n                \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            _p0 = vfmaq_f16(_beta, _p0, _alpha);\n            _p1 = vfmaq_f16(_beta, _p1, _alpha);\n            _p2 = vfmaq_f16(_beta, _p2, _alpha);\n            _p3 = vfmaq_f16(_beta, _p3, _alpha);\n            _p0 = vmaxq_f16(_p0, _zero);\n            _p1 = vmaxq_f16(_p1, _zero);\n            _p2 = vmaxq_f16(_p2, _zero);\n            _p3 = vmaxq_f16(_p3, _zero);\n            _p0 = vminq_f16(_p0, _one);\n            _p1 = vminq_f16(_p1, _one);\n            _p2 = vminq_f16(_p2, _one);\n            _p3 = vminq_f16(_p3, _one);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _p0 = vfmaq_f16(_beta, _p0, _alpha);\n            _p1 = vfmaq_f16(_beta, _p1, _alpha);\n            _p0 = vmaxq_f16(_p0, _zero);\n            _p1 = vmaxq_f16(_p1, _zero);\n            _p0 = vminq_f16(_p0, _one);\n            _p1 = vminq_f16(_p1, _one);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = vfmaq_f16(_beta, _p, _alpha);\n            _p = vmaxq_f16(_p, _zero);\n            _p = vminq_f16(_p, _one);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = vfma_f16(vget_low_f16(_beta), _p, vget_low_f16(_alpha));\n            _p = vmax_f16(_p, vget_low_f16(_zero));\n            _p = vmin_f16(_p, vget_low_f16(_one));\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            if (v < (__fp16)lower)\n                v = (__fp16)0.f;\n            else if (v > (__fp16)upper)\n                v = (__fp16)1.f;\n            else\n                v = v * alpha_fp16 + beta_fp16;\n            *ptr = v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/hardswish_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardswish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nHardSwish_arm::HardSwish_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint HardSwish_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.4s, v0.4s, %4.4s     \\n\"\n                \"fmla   v5.4s, v1.4s, %4.4s     \\n\"\n                \"fmla   v6.4s, v2.4s, %4.4s     \\n\"\n                \"fmla   v7.4s, v3.4s, %4.4s     \\n\"\n                \"fmax   v4.4s, v4.4s, %2.4s     \\n\"\n                \"fmax   v5.4s, v5.4s, %2.4s     \\n\"\n                \"fmax   v6.4s, v6.4s, %2.4s     \\n\"\n                \"fmax   v7.4s, v7.4s, %2.4s     \\n\"\n                \"fmin   v4.4s, v4.4s, %3.4s     \\n\"\n                \"fmin   v5.4s, v5.4s, %3.4s     \\n\"\n                \"fmin   v6.4s, v6.4s, %3.4s     \\n\"\n                \"fmin   v7.4s, v7.4s, %3.4s     \\n\"\n                \"fmul   v0.4s, v4.4s, v0.4s     \\n\"\n                \"fmul   v1.4s, v5.4s, v1.4s     \\n\"\n                \"fmul   v2.4s, v6.4s, v2.4s     \\n\"\n                \"fmul   v3.4s, v7.4s, v3.4s     \\n\"\n                \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #512]      \\n\"\n                \"vldm       %0, {d0-d7}     \\n\"\n                \"vmov       q4, %q5         \\n\"\n                \"vmov       q5, %q5         \\n\"\n                \"vmov       q6, %q5         \\n\"\n                \"vmov       q7, %q5         \\n\"\n                \"vmla.f32   q4, q0, %q4     \\n\"\n                \"vmla.f32   q5, q1, %q4     \\n\"\n                \"vmla.f32   q6, q2, %q4     \\n\"\n                \"vmla.f32   q7, q3, %q4     \\n\"\n                \"vmax.f32   q4, q4, %q2     \\n\"\n                \"vmax.f32   q5, q5, %q2     \\n\"\n                \"vmax.f32   q6, q6, %q2     \\n\"\n                \"vmax.f32   q7, q7, %q2     \\n\"\n                \"vmin.f32   q4, q4, %q3     \\n\"\n                \"vmin.f32   q5, q5, %q3     \\n\"\n                \"vmin.f32   q6, q6, %q3     \\n\"\n                \"vmin.f32   q7, q7, %q3     \\n\"\n                \"vmul.f32   q0, q4, q0      \\n\"\n                \"vmul.f32   q1, q5, q1      \\n\"\n                \"vmul.f32   q2, q6, q2      \\n\"\n                \"vmul.f32   q3, q7, q3      \\n\"\n                \"vstm       %0!, {d0-d7}    \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            float32x4_t _ans0 = vmlaq_f32(_beta, _p0, _alpha);\n            float32x4_t _ans1 = vmlaq_f32(_beta, _p1, _alpha);\n            float32x4_t _ans2 = vmlaq_f32(_beta, _p2, _alpha);\n            float32x4_t _ans3 = vmlaq_f32(_beta, _p3, _alpha);\n            _ans0 = vmaxq_f32(_ans0, _zero);\n            _ans1 = vmaxq_f32(_ans1, _zero);\n            _ans2 = vmaxq_f32(_ans2, _zero);\n            _ans3 = vmaxq_f32(_ans3, _zero);\n            _ans0 = vminq_f32(_ans0, _one);\n            _ans1 = vminq_f32(_ans1, _one);\n            _ans2 = vminq_f32(_ans2, _one);\n            _ans3 = vminq_f32(_ans3, _one);\n            _p0 = vmulq_f32(_ans0, _p0);\n            _p1 = vmulq_f32(_ans1, _p1);\n            _p2 = vmulq_f32(_ans2, _p2);\n            _p3 = vmulq_f32(_ans3, _p3);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _ans0 = vmlaq_f32(_beta, _p0, _alpha);\n            float32x4_t _ans1 = vmlaq_f32(_beta, _p1, _alpha);\n            _ans0 = vmaxq_f32(_ans0, _zero);\n            _ans1 = vmaxq_f32(_ans1, _zero);\n            _ans0 = vminq_f32(_ans0, _one);\n            _ans1 = vminq_f32(_ans1, _one);\n            _p0 = vmulq_f32(_ans0, _p0);\n            _p1 = vmulq_f32(_ans1, _p1);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _ans = vmlaq_f32(_beta, _p, _alpha);\n            _ans = vmaxq_f32(_ans, _zero);\n            _ans = vminq_f32(_ans, _one);\n            _p = vmulq_f32(_ans, _p);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            if (*ptr < lower)\n                *ptr = 0.f;\n            else if (*ptr > upper)\n                ;\n            else\n                *ptr = *ptr * (*ptr * alpha + beta);\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint HardSwish_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n        for (; i + 15 < size; i += 16)\n        {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]   \\n\"\n                \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                \"shll   v0.4s, v0.4h, #16       \\n\"\n                \"shll   v1.4s, v1.4h, #16       \\n\"\n                \"shll   v2.4s, v2.4h, #16       \\n\"\n                \"shll   v3.4s, v3.4h, #16       \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.4s, v0.4s, %4.4s     \\n\"\n                \"fmla   v5.4s, v1.4s, %4.4s     \\n\"\n                \"fmla   v6.4s, v2.4s, %4.4s     \\n\"\n                \"fmla   v7.4s, v3.4s, %4.4s     \\n\"\n                \"fmax   v4.4s, v4.4s, %2.4s     \\n\"\n                \"fmax   v5.4s, v5.4s, %2.4s     \\n\"\n                \"fmax   v6.4s, v6.4s, %2.4s     \\n\"\n                \"fmax   v7.4s, v7.4s, %2.4s     \\n\"\n                \"fmin   v4.4s, v4.4s, %3.4s     \\n\"\n                \"fmin   v5.4s, v5.4s, %3.4s     \\n\"\n                \"fmin   v6.4s, v6.4s, %3.4s     \\n\"\n                \"fmin   v7.4s, v7.4s, %3.4s     \\n\"\n                \"fmul   v0.4s, v4.4s, v0.4s     \\n\"\n                \"fmul   v1.4s, v5.4s, v1.4s     \\n\"\n                \"fmul   v2.4s, v6.4s, v2.4s     \\n\"\n                \"fmul   v3.4s, v7.4s, v3.4s     \\n\"\n                \"shrn   v0.4h, v0.4s, #16       \\n\"\n                \"shrn   v1.4h, v1.4s, #16       \\n\"\n                \"shrn   v2.4h, v2.4s, #16       \\n\"\n                \"shrn   v3.4h, v3.4s, #16       \\n\"\n                \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]      \\n\"\n                \"vld1.u16   {d4-d7}, [%0]   \\n\"\n                \"vshll.u16  q0, d4, #16     \\n\"\n                \"vshll.u16  q1, d5, #16     \\n\"\n                \"vshll.u16  q2, d6, #16     \\n\"\n                \"vshll.u16  q3, d7, #16     \\n\"\n                \"vmov       q4, %q5         \\n\"\n                \"vmov       q5, %q5         \\n\"\n                \"vmov       q6, %q5         \\n\"\n                \"vmov       q7, %q5         \\n\"\n                \"vmla.f32   q4, q0, %q4     \\n\"\n                \"vmla.f32   q5, q1, %q4     \\n\"\n                \"vmla.f32   q6, q2, %q4     \\n\"\n                \"vmla.f32   q7, q3, %q4     \\n\"\n                \"vmax.f32   q4, q4, %q2     \\n\"\n                \"vmax.f32   q5, q5, %q2     \\n\"\n                \"vmax.f32   q6, q6, %q2     \\n\"\n                \"vmax.f32   q7, q7, %q2     \\n\"\n                \"vmin.f32   q4, q4, %q3     \\n\"\n                \"vmin.f32   q5, q5, %q3     \\n\"\n                \"vmin.f32   q6, q6, %q3     \\n\"\n                \"vmin.f32   q7, q7, %q3     \\n\"\n                \"vmul.f32   q0, q4, q0      \\n\"\n                \"vmul.f32   q1, q5, q1      \\n\"\n                \"vmul.f32   q2, q6, q2      \\n\"\n                \"vmul.f32   q3, q7, q3      \\n\"\n                \"vshrn.u32  d0, q0, #16     \\n\"\n                \"vshrn.u32  d1, q1, #16     \\n\"\n                \"vshrn.u32  d2, q2, #16     \\n\"\n                \"vshrn.u32  d3, q3, #16     \\n\"\n                \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            uint16x8_t _p = vld1q_u16(ptr);\n            uint16x8_t _q = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n            float32x4_t _ans0 = vmlaq_f32(_beta, _p0, _alpha);\n            float32x4_t _ans1 = vmlaq_f32(_beta, _p1, _alpha);\n            float32x4_t _ans2 = vmlaq_f32(_beta, _p2, _alpha);\n            float32x4_t _ans3 = vmlaq_f32(_beta, _p3, _alpha);\n            _ans0 = vmaxq_f32(_ans0, _zero);\n            _ans1 = vmaxq_f32(_ans1, _zero);\n            _ans2 = vmaxq_f32(_ans2, _zero);\n            _ans3 = vmaxq_f32(_ans3, _zero);\n            _ans0 = vminq_f32(_ans0, _one);\n            _ans1 = vminq_f32(_ans1, _one);\n            _ans2 = vminq_f32(_ans2, _one);\n            _ans3 = vminq_f32(_ans3, _one);\n            _p0 = vmulq_f32(_ans0, _p0);\n            _p1 = vmulq_f32(_ans1, _p1);\n            _p2 = vmulq_f32(_ans2, _p2);\n            _p3 = vmulq_f32(_ans3, _p3);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _q = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p);\n            vst1q_u16(ptr + 8, _q);\n            ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _ans0 = vmlaq_f32(_beta, _p0, _alpha);\n            float32x4_t _ans1 = vmlaq_f32(_beta, _p1, _alpha);\n            _ans0 = vmaxq_f32(_ans0, _zero);\n            _ans1 = vmaxq_f32(_ans1, _zero);\n            _ans0 = vminq_f32(_ans0, _one);\n            _ans1 = vminq_f32(_ans1, _one);\n            _p0 = vmulq_f32(_ans0, _p0);\n            _p1 = vmulq_f32(_ans1, _p1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _ans = vmlaq_f32(_beta, _p, _alpha);\n            _ans = vmaxq_f32(_ans, _zero);\n            _ans = vminq_f32(_ans, _one);\n            _p = vmulq_f32(_ans, _p);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            if (v < lower)\n                v = 0.f;\n            else if (v > upper)\n                ;\n            else\n                v = v * (v * alpha + beta);\n            *ptr = float32_to_bfloat16(v);\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/hardswish_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSWISH_ARM_H\n#define LAYER_HARDSWISH_ARM_H\n\n#include \"hardswish.h\"\n\nnamespace ncnn {\n\nclass HardSwish_arm : public HardSwish\n{\npublic:\n    HardSwish_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSWISH_ARM_H\n"
  },
  {
    "path": "src/layer/arm/hardswish_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardswish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint HardSwish_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _alpha = vdupq_n_f32(alpha);\n        float32x4_t _beta = vdupq_n_f32(beta);\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            float32x4_t _ans0 = vfmaq_f32(_beta, _p0, _alpha);\n            float32x4_t _ans1 = vfmaq_f32(_beta, _p1, _alpha);\n            _ans0 = vmaxq_f32(_ans0, _zero);\n            _ans1 = vmaxq_f32(_ans1, _zero);\n            _ans0 = vminq_f32(_ans0, _one);\n            _ans1 = vminq_f32(_ans1, _one);\n            _p0 = vmulq_f32(_ans0, _p0);\n            _p1 = vmulq_f32(_ans1, _p1);\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            float32x4_t _ans = vfmaq_f32(_beta, _p, _alpha);\n            _ans = vmaxq_f32(_ans, _zero);\n            _ans = vminq_f32(_ans, _one);\n            _p = vmulq_f32(_ans, _p);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            if (v < lower)\n                v = 0.f;\n            else if (v > upper)\n                ;\n            else\n                v = v * (v * alpha + beta);\n            *ptr = (__fp16)v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint HardSwish_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        __fp16 alpha_fp16 = (__fp16)alpha;\n        __fp16 beta_fp16 = (__fp16)beta;\n\n        float16x8_t _zero = vdupq_n_f16((__fp16)0.f);\n        float16x8_t _one = vdupq_n_f16((__fp16)1.f);\n        float16x8_t _alpha = vdupq_n_f16(alpha_fp16);\n        float16x8_t _beta = vdupq_n_f16(beta_fp16);\n\n        int i = 0;\n        for (; i + 31 < size; i += 32)\n        {\n#if NCNN_GNU_INLINE_ASM\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #512]   \\n\"\n                \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                \"mov    v4.16b, %5.16b          \\n\"\n                \"mov    v5.16b, %5.16b          \\n\"\n                \"mov    v6.16b, %5.16b          \\n\"\n                \"mov    v7.16b, %5.16b          \\n\"\n                \"fmla   v4.8h, v0.8h, %4.8h     \\n\"\n                \"fmla   v5.8h, v1.8h, %4.8h     \\n\"\n                \"fmla   v6.8h, v2.8h, %4.8h     \\n\"\n                \"fmla   v7.8h, v3.8h, %4.8h     \\n\"\n                \"fmax   v4.8h, v4.8h, %2.8h     \\n\"\n                \"fmax   v5.8h, v5.8h, %2.8h     \\n\"\n                \"fmax   v6.8h, v6.8h, %2.8h     \\n\"\n                \"fmax   v7.8h, v7.8h, %2.8h     \\n\"\n                \"fmin   v4.8h, v4.8h, %3.8h     \\n\"\n                \"fmin   v5.8h, v5.8h, %3.8h     \\n\"\n                \"fmin   v6.8h, v6.8h, %3.8h     \\n\"\n                \"fmin   v7.8h, v7.8h, %3.8h     \\n\"\n                \"fmul   v0.8h, v4.8h, v0.8h     \\n\"\n                \"fmul   v1.8h, v5.8h, v1.8h     \\n\"\n                \"fmul   v2.8h, v6.8h, v2.8h     \\n\"\n                \"fmul   v3.8h, v7.8h, v3.8h     \\n\"\n                \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                : \"=r\"(ptr) // %0\n                : \"0\"(ptr),\n                \"w\"(_zero),  // %2\n                \"w\"(_one),   // %3\n                \"w\"(_alpha), // %4\n                \"w\"(_beta)   // %5\n                : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\");\n#else  // NCNN_GNU_INLINE_ASM\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _ans0 = vfmaq_f16(_beta, _p0, _alpha);\n            float16x8_t _ans1 = vfmaq_f16(_beta, _p1, _alpha);\n            float16x8_t _ans2 = vfmaq_f16(_beta, _p2, _alpha);\n            float16x8_t _ans3 = vfmaq_f16(_beta, _p3, _alpha);\n            _ans0 = vmaxq_f16(_ans0, _zero);\n            _ans1 = vmaxq_f16(_ans1, _zero);\n            _ans2 = vmaxq_f16(_ans2, _zero);\n            _ans3 = vmaxq_f16(_ans3, _zero);\n            _ans0 = vminq_f16(_ans0, _one);\n            _ans1 = vminq_f16(_ans1, _one);\n            _ans2 = vminq_f16(_ans2, _one);\n            _ans3 = vminq_f16(_ans3, _one);\n            _p0 = vmulq_f16(_ans0, _p0);\n            _p1 = vmulq_f16(_ans1, _p1);\n            _p2 = vmulq_f16(_ans2, _p2);\n            _p3 = vmulq_f16(_ans3, _p3);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n        }\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _ans0 = vfmaq_f16(_beta, _p0, _alpha);\n            float16x8_t _ans1 = vfmaq_f16(_beta, _p1, _alpha);\n            _ans0 = vmaxq_f16(_ans0, _zero);\n            _ans1 = vmaxq_f16(_ans1, _zero);\n            _ans0 = vminq_f16(_ans0, _one);\n            _ans1 = vminq_f16(_ans1, _one);\n            _p0 = vmulq_f16(_ans0, _p0);\n            _p1 = vmulq_f16(_ans1, _p1);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _ans = vfmaq_f16(_beta, _p, _alpha);\n            _ans = vmaxq_f16(_ans, _zero);\n            _ans = vminq_f16(_ans, _one);\n            _p = vmulq_f16(_ans, _p);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _ans = vfma_f16(vget_low_f16(_beta), _p, vget_low_f16(_alpha));\n            _ans = vmax_f16(_ans, vget_low_f16(_zero));\n            _ans = vmin_f16(_ans, vget_low_f16(_one));\n            _p = vmul_f16(_ans, _p);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            if (v < (__fp16)lower)\n                v = (__fp16)0.f;\n            else if (v > (__fp16)upper)\n                ;\n            else\n                v = v * (v * alpha_fp16 + beta_fp16);\n            *ptr = v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/innerproduct_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct_arm.h\"\n\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nInnerProduct_arm::InnerProduct_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    flatten = 0;\n}\n\nint InnerProduct_arm::create_pipeline(const Option& opt)\n{\n    {\n        flatten = ncnn::create_layer_cpu(ncnn::LayerType::Flatten);\n\n        ncnn::ParamDict pd;\n\n        flatten->load_param(pd);\n\n        flatten->create_pipeline(opt);\n    }\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_arm(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n#if NCNN_VFPV4\n    if (cpu_support_arm_vfpv4() && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    if (out_elempack == 4)\n    {\n        // src = inch-outch\n        // dst = pb-inch-outch/pb\n        {\n            Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n            weight_data_tm.create(num_input, num_output / out_elempack, (size_t)4u * out_elempack, out_elempack);\n\n            for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n            {\n                float* g0 = weight_data_tm.row(q / out_elempack);\n\n                for (int p = 0; p < num_input; p++)\n                {\n                    for (int j = 0; j < out_elempack; j++)\n                    {\n                        *g0++ = weight_data_r2.row(q + j)[p];\n                    }\n                }\n            }\n        }\n    }\n    else\n    {\n        weight_data_tm = weight_data;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_arm::destroy_pipeline(const Option& opt)\n{\n    if (flatten)\n    {\n        flatten->destroy_pipeline(opt);\n        delete flatten;\n        flatten = 0;\n    }\n\n    return 0;\n}\n\nint InnerProduct_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_arm(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_VFPV4\n    if (cpu_support_arm_vfpv4() && opt.use_fp16_storage)\n    {\n        return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n#if __ARM_NEON\n            if (elempack == 4 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const float* kptr = weight_data_tm.row(p);\n                    const float* m = bottom_blob.row(j);\n\n                    float32x4_t _sum0 = vdupq_n_f32(0.f);\n                    float32x4_t _sum1 = vdupq_n_f32(0.f);\n                    float32x4_t _sum2 = vdupq_n_f32(0.f);\n                    float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdupq_n_f32(bias_data[p * 4 + 0]);\n                        _sum1 = vdupq_n_f32(bias_data[p * 4 + 1]);\n                        _sum2 = vdupq_n_f32(bias_data[p * 4 + 2]);\n                        _sum3 = vdupq_n_f32(bias_data[p * 4 + 3]);\n                    }\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        float32x4_t _val = vld1q_f32(m);\n                        float32x4_t _w = vld1q_f32(kptr);\n#if __aarch64__\n                        _sum0 = vfmaq_laneq_f32(_sum0, _val, _w, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _val, _w, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _val, _w, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _val, _w, 3);\n#else\n                        _sum0 = vmlaq_lane_f32(_sum0, _val, vget_low_f32(_w), 0);\n                        _sum1 = vmlaq_lane_f32(_sum1, _val, vget_low_f32(_w), 1);\n                        _sum2 = vmlaq_lane_f32(_sum2, _val, vget_high_f32(_w), 0);\n                        _sum3 = vmlaq_lane_f32(_sum3, _val, vget_high_f32(_w), 1);\n#endif\n                        m += 4;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps(_sum3, activation_type, activation_params);\n\n                    vst1q_f32(outptr, _sum0);\n                    vst1q_f32(outptr + 4, _sum1);\n                    vst1q_f32(outptr + 8, _sum2);\n                    vst1q_f32(outptr + 12, _sum3);\n                    outptr += 16;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const float* kptr = weight_data_tm.row(p);\n                    const float* m = bottom_blob.row(j);\n\n                    float32x4_t _sum0 = vdupq_n_f32(0.f);\n                    float32x4_t _sum1 = vdupq_n_f32(0.f);\n                    float32x4_t _sum2 = vdupq_n_f32(0.f);\n                    float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vld1q_f32((const float*)bias_data + p * 4);\n                    }\n\n                    int i = 0;\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        float32x4_t _val = vld1q_f32(m);\n\n                        float32x4_t _w0 = vld1q_f32(kptr);\n                        float32x4_t _w1 = vld1q_f32(kptr + 4);\n                        float32x4_t _w2 = vld1q_f32(kptr + 8);\n                        float32x4_t _w3 = vld1q_f32(kptr + 12);\n\n#if __aarch64__\n                        _sum0 = vfmaq_laneq_f32(_sum0, _w0, _val, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _w1, _val, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _w2, _val, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _w3, _val, 3);\n#else\n                        _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_val), 0);\n                        _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_val), 1);\n                        _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_val), 0);\n                        _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_val), 1);\n#endif\n\n                        m += 4;\n                        kptr += 16;\n                    }\n                    for (; i < num_input; i++)\n                    {\n                        float32x4_t _val = vld1q_dup_f32(m);\n                        float32x4_t _k = vld1q_f32(kptr);\n                        _sum0 = vmlaq_f32(_sum0, _val, _k);\n\n                        m += 1;\n                        kptr += 4;\n                    }\n\n                    _sum0 = vaddq_f32(_sum0, _sum1);\n                    _sum2 = vaddq_f32(_sum2, _sum3);\n                    _sum0 = vaddq_f32(_sum0, _sum2);\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                    vst1q_f32(outptr, _sum0);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const float* kptr = (const float*)weight_data_tm + num_input * p;\n                    const float* m = bottom_blob.row(j);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vdupq_n_f32(bias_data[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float32x4_t _val = vld1q_f32(m);\n                        float32x4_t _k = vdupq_n_f32(kptr[0]);\n                        _sum = vmlaq_f32(_sum, _val, _k);\n\n                        m += 4;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    vst1q_f32(outptr, _sum);\n                    outptr += 4;\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const float* kptr = (const float*)weight_data_tm + num_input * p;\n                    const float* m = bottom_blob.row(j);\n\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    int i = 0;\n#if __ARM_NEON\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        float32x4_t _val = vld1q_f32(m);\n                        float32x4_t _k = vld1q_f32(kptr);\n                        _sum = vmlaq_f32(_sum, _val, _k);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n#if __aarch64__\n                    sum += vaddvq_f32(_sum);\n#else\n                    float32x2_t _ss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n                    _ss = vpadd_f32(_ss, _ss);\n                    sum += vget_lane_f32(_ss, 0);\n#endif\n#endif // __ARM_NEON\n                    for (; i < num_input; i++)\n                    {\n                        sum += *m * *kptr;\n\n                        m += 1;\n                        kptr += 1;\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[0] = sum;\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n        if (bottom_blob_flattened.empty())\n            return -100;\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            float32x4_t _sum0 = bias_term ? vld1q_f32((const float*)bias_data + p * 4) : vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            const float* kptr = weight_data_tm.row(p);\n\n            const float* sptr = bottom_blob_flattened;\n\n            int i = 0;\n#if NCNN_GNU_INLINE_ASM\n            for (; i + 7 < num_input; i += 8)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm       pldl1keep, [%0, #256]     \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%0], #32 \\n\"\n                    \"prfm       pldl1keep, [%1, #512]     \\n\"\n                    \"ld1        {v2.4s, v3.4s, v4.4s, v5.4s}, [%1], #64 \\n\"\n                    \"prfm       pldl1keep, [%1, #512]     \\n\"\n                    \"ld1        {v6.4s, v7.4s, v8.4s, v9.4s}, [%1], #64 \\n\"\n                    \"fmla       %2.4s, v2.4s, v0.s[0]     \\n\"\n                    \"fmla       %3.4s, v3.4s, v0.s[1]     \\n\"\n                    \"fmla       %4.4s, v4.4s, v0.s[2]     \\n\"\n                    \"fmla       %5.4s, v5.4s, v0.s[3]     \\n\"\n                    \"fmla       %2.4s, v6.4s, v1.s[0]     \\n\"\n                    \"fmla       %3.4s, v7.4s, v1.s[1]     \\n\"\n                    \"fmla       %4.4s, v8.4s, v1.s[2]     \\n\"\n                    \"fmla       %5.4s, v9.4s, v1.s[3]     \\n\"\n                    : \"=r\"(sptr),  // %0\n                    \"=r\"(kptr),  // %1\n                    \"=w\"(_sum0), // %2\n                    \"=w\"(_sum1), // %3\n                    \"=w\"(_sum2), // %4\n                    \"=w\"(_sum3)  // %5\n                    : \"0\"(sptr),\n                    \"1\"(kptr),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%0 :128]! \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d4-d11}       \\n\"\n                    \"pld        [%1, #512]          \\n\"\n                    \"vldm       %1!, {d12-d19}      \\n\"\n                    \"vmla.f32   %q2, q2, d0[0]      \\n\"\n                    \"vmla.f32   %q3, q3, d0[1]      \\n\"\n                    \"vmla.f32   %q4, q4, d1[0]      \\n\"\n                    \"vmla.f32   %q5, q5, d1[1]      \\n\"\n                    \"vmla.f32   %q2, q6, d2[0]      \\n\"\n                    \"vmla.f32   %q3, q7, d2[1]      \\n\"\n                    \"vmla.f32   %q4, q8, d3[0]      \\n\"\n                    \"vmla.f32   %q5, q9, d3[1]      \\n\"\n                    : \"=r\"(sptr),  // %0\n                    \"=r\"(kptr),  // %1\n                    \"=w\"(_sum0), // %2\n                    \"=w\"(_sum1), // %3\n                    \"=w\"(_sum2), // %4\n                    \"=w\"(_sum3)  // %5\n                    : \"0\"(sptr),\n                    \"1\"(kptr),\n                    \"2\"(_sum0),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\");\n#endif\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; i + 3 < num_input; i += 4)\n            {\n                float32x4_t _val = vld1q_f32(sptr);\n\n                float32x4_t _w0 = vld1q_f32(kptr);\n                float32x4_t _w1 = vld1q_f32(kptr + 4);\n                float32x4_t _w2 = vld1q_f32(kptr + 8);\n                float32x4_t _w3 = vld1q_f32(kptr + 12);\n\n#if __aarch64__\n                _sum0 = vfmaq_laneq_f32(_sum0, _w0, _val, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _w1, _val, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _w2, _val, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _w3, _val, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_val), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_val), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_val), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_val), 1);\n#endif\n\n                sptr += 4;\n                kptr += 16;\n            }\n            for (; i < num_input; i++)\n            {\n                float32x4_t _val = vld1q_dup_f32(sptr);\n                float32x4_t _w = vld1q_f32(kptr);\n                _sum0 = vmlaq_f32(_sum0, _val, _w);\n\n                sptr += 1;\n                kptr += 4;\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _sum0 = vaddq_f32(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            float* outptr = top_blob;\n            vst1q_f32(outptr + p * 4, _sum0);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (out_elempack == 1)\n    {\n        const float* weight_data_ptr = weight_data_tm;\n\n        int nn_num_output = num_output >> 2;\n        int remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int pp = 0; pp < nn_num_output; pp++)\n        {\n            int p = pp * 4;\n\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n            float sum2 = 0.f;\n            float sum3 = 0.f;\n\n            if (bias_term)\n            {\n                sum0 = bias_data[p];\n                sum1 = bias_data[p + 1];\n                sum2 = bias_data[p + 2];\n                sum3 = bias_data[p + 3];\n            }\n\n            const float* w0 = weight_data_ptr + num_input * p;\n            const float* w1 = weight_data_ptr + num_input * (p + 1);\n            const float* w2 = weight_data_ptr + num_input * (p + 2);\n            const float* w3 = weight_data_ptr + num_input * (p + 3);\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n#if NCNN_GNU_INLINE_ASM\n            for (; i + 7 < num_input; i += 8)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm       pldl1keep, [%0, #256]     \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%0], #32 \\n\"\n                    \"prfm       pldl1keep, [%1, #256]     \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%1], #32 \\n\"\n                    \"prfm       pldl1keep, [%2, #256]     \\n\"\n                    \"ld1        {v4.4s, v5.4s}, [%2], #32 \\n\"\n                    \"prfm       pldl1keep, [%3, #256]     \\n\"\n                    \"ld1        {v6.4s, v7.4s}, [%3], #32 \\n\"\n                    \"prfm       pldl1keep, [%4, #256]     \\n\"\n                    \"ld1        {v8.4s, v9.4s}, [%4], #32 \\n\"\n                    \"fmla       %5.4s, v0.4s, v2.4s       \\n\"\n                    \"fmla       %6.4s, v0.4s, v4.4s       \\n\"\n                    \"fmla       %7.4s, v0.4s, v6.4s       \\n\"\n                    \"fmla       %8.4s, v0.4s, v8.4s       \\n\"\n                    \"fmla       %5.4s, v1.4s, v3.4s       \\n\"\n                    \"fmla       %6.4s, v1.4s, v5.4s       \\n\"\n                    \"fmla       %7.4s, v1.4s, v7.4s       \\n\"\n                    \"fmla       %8.4s, v1.4s, v9.4s       \\n\"\n                    : \"=r\"(m),     // %0\n                    \"=r\"(w0),    // %1\n                    \"=r\"(w1),    // %2\n                    \"=r\"(w2),    // %3\n                    \"=r\"(w3),    // %4\n                    \"=w\"(_sum0), // %5\n                    \"=w\"(_sum1), // %6\n                    \"=w\"(_sum2), // %7\n                    \"=w\"(_sum3)  // %8\n                    : \"0\"(m),\n                    \"1\"(w0),\n                    \"2\"(w1),\n                    \"3\"(w2),\n                    \"4\"(w3),\n                    \"5\"(_sum0),\n                    \"6\"(_sum1),\n                    \"7\"(_sum2),\n                    \"8\"(_sum3)\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%0 :128]! \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%1]!      \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d8-d11}, [%2]!     \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld1.f32   {d12-d15}, [%3]!    \\n\"\n                    \"pld        [%4, #256]          \\n\"\n                    \"vld1.f32   {d16-d19}, [%4]!    \\n\"\n                    \"vmla.f32   %q5, q0, q2         \\n\"\n                    \"vmla.f32   %q6, q0, q4         \\n\"\n                    \"vmla.f32   %q7, q0, q6         \\n\"\n                    \"vmla.f32   %q8, q0, q8         \\n\"\n                    \"vmla.f32   %q5, q1, q3         \\n\"\n                    \"vmla.f32   %q6, q1, q5         \\n\"\n                    \"vmla.f32   %q7, q1, q7         \\n\"\n                    \"vmla.f32   %q8, q1, q9         \\n\"\n                    : \"=r\"(m),     // %0\n                    \"=r\"(w0),    // %1\n                    \"=r\"(w1),    // %2\n                    \"=r\"(w2),    // %3\n                    \"=r\"(w3),    // %4\n                    \"=w\"(_sum0), // %5\n                    \"=w\"(_sum1), // %6\n                    \"=w\"(_sum2), // %7\n                    \"=w\"(_sum3)  // %8\n                    : \"0\"(m),\n                    \"1\"(w0),\n                    \"2\"(w1),\n                    \"3\"(w2),\n                    \"4\"(w3),\n                    \"5\"(_sum0),\n                    \"6\"(_sum1),\n                    \"7\"(_sum2),\n                    \"8\"(_sum3)\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\");\n#endif // __aarch64__\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; i + 3 < num_input; i += 4)\n            {\n                float32x4_t _val = vld1q_f32(m);\n\n                float32x4_t _w0 = vld1q_f32(w0);\n                float32x4_t _w1 = vld1q_f32(w1);\n                float32x4_t _w2 = vld1q_f32(w2);\n                float32x4_t _w3 = vld1q_f32(w3);\n\n                _sum0 = vmlaq_f32(_sum0, _val, _w0);\n                _sum1 = vmlaq_f32(_sum1, _val, _w1);\n                _sum2 = vmlaq_f32(_sum2, _val, _w2);\n                _sum3 = vmlaq_f32(_sum3, _val, _w3);\n\n                m += 4;\n                w0 += 4;\n                w1 += 4;\n                w2 += 4;\n                w3 += 4;\n            }\n\n            float32x2_t _sum0ss = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n            float32x2_t _sum1ss = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n            float32x2_t _sum2ss = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n            float32x2_t _sum3ss = vadd_f32(vget_low_f32(_sum3), vget_high_f32(_sum3));\n\n            float32x2_t _sum01ss = vpadd_f32(_sum0ss, _sum1ss);\n            float32x2_t _sum23ss = vpadd_f32(_sum2ss, _sum3ss);\n\n            sum0 += vget_lane_f32(_sum01ss, 0);\n            sum1 += vget_lane_f32(_sum01ss, 1);\n            sum2 += vget_lane_f32(_sum23ss, 0);\n            sum3 += vget_lane_f32(_sum23ss, 1);\n#endif // __ARM_NEON\n            for (; i < num_input; i++)\n            {\n                sum0 += *m * *w0;\n                sum1 += *m * *w1;\n                sum2 += *m * *w2;\n                sum3 += *m * *w3;\n\n                m++;\n                w0++;\n                w1++;\n                w2++;\n                w3++;\n            }\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n            sum2 = activation_ss(sum2, activation_type, activation_params);\n            sum3 = activation_ss(sum3, activation_type, activation_params);\n\n            top_blob[p] = sum0;\n            top_blob[p + 1] = sum1;\n            top_blob[p + 2] = sum2;\n            top_blob[p + 3] = sum3;\n        }\n\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = remain_num_output_start; p < num_output; p++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const float* w = weight_data_ptr + num_input * p;\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n#if NCNN_GNU_INLINE_ASM\n            for (; i + 7 < num_input; i += 8)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm       pldl1keep, [%0, #256]     \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%0], #32 \\n\"\n                    \"prfm       pldl1keep, [%1, #256]     \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%1], #32 \\n\"\n                    \"fmla       %2.4s, v0.4s, v2.4s       \\n\"\n                    \"fmla       %3.4s, v1.4s, v3.4s       \\n\"\n                    : \"=r\"(m),    // %0\n                    \"=r\"(w),    // %1\n                    \"=w\"(_sum), // %2\n                    \"=w\"(_sum2) // %3\n                    : \"0\"(m),\n                    \"1\"(w),\n                    \"2\"(_sum),\n                    \"3\"(_sum2)\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                asm volatile(\n                    \"pld        [%0, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%0 :128]! \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld1.f32   {d4-d7}, [%1]!      \\n\"\n                    \"vmla.f32   %q2, q0, q2         \\n\"\n                    \"vmla.f32   %q3, q1, q3         \\n\"\n                    : \"=r\"(m),    // %0\n                    \"=r\"(w),    // %1\n                    \"=w\"(_sum), // %2\n                    \"=w\"(_sum2) // %3\n                    : \"0\"(m),\n                    \"1\"(w),\n                    \"2\"(_sum),\n                    \"3\"(_sum2)\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; i + 3 < num_input; i += 4)\n            {\n                float32x4_t _val = vld1q_f32(m);\n                float32x4_t _w = vld1q_f32(w);\n                _sum = vmlaq_f32(_sum, _val, _w);\n                m += 4;\n                w += 4;\n            }\n\n            _sum = vaddq_f32(_sum, _sum2);\n#if __aarch64__\n            sum += vaddvq_f32(_sum);\n#else\n            float32x2_t _sumss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n            _sumss = vpadd_f32(_sumss, _sumss);\n            sum += vget_lane_f32(_sumss, 0);\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; i < num_input; i++)\n            {\n                sum += *m * *w;\n\n                m++;\n                w++;\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            top_blob[p] = sum;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint InnerProduct_arm::create_pipeline_bf16s(const Option& opt)\n{\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n\n    // src = inch-outch\n    // dst = pb-inch-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / out_elempack, (size_t)2u * out_elempack, out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            unsigned short* g0 = weight_data_tm.row<unsigned short>(q / out_elempack);\n\n            for (int p = 0; p < num_input; p++)\n            {\n                for (int j = 0; j < out_elempack; j++)\n                {\n                    *g0++ = float32_to_bfloat16(weight_data_r2.row(q + j)[p]);\n                }\n            }\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 4 == 0 ? 4 : 1;\n        }\n#endif // __ARM_NEON\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n#if __ARM_NEON\n            if (elempack == 4 && num_output_elempack == 4)\n            {\n                unsigned short* outptr = top_blob.row<unsigned short>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + num_input * p * 4;\n                    const unsigned short* m = bottom_blob.row<const unsigned short>(j);\n\n                    float32x4_t _sum0 = vdupq_n_f32(0.f);\n                    float32x4_t _sum1 = vdupq_n_f32(0.f);\n                    float32x4_t _sum2 = vdupq_n_f32(0.f);\n                    float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdupq_n_f32(bias_data[p * 4 + 0]);\n                        _sum1 = vdupq_n_f32(bias_data[p * 4 + 1]);\n                        _sum2 = vdupq_n_f32(bias_data[p * 4 + 2]);\n                        _sum3 = vdupq_n_f32(bias_data[p * 4 + 3]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float32x4_t _val = bfloat2float(vld1_u16(m));\n                        float32x4_t _k = bfloat2float(vld1_u16(kptr));\n#if __aarch64__\n                        _sum0 = vfmaq_laneq_f32(_sum0, _val, _k, 0);\n                        _sum1 = vfmaq_laneq_f32(_sum1, _val, _k, 1);\n                        _sum2 = vfmaq_laneq_f32(_sum2, _val, _k, 2);\n                        _sum3 = vfmaq_laneq_f32(_sum3, _val, _k, 3);\n#else\n                        _sum0 = vmlaq_lane_f32(_sum0, _val, vget_low_f32(_k), 0);\n                        _sum1 = vmlaq_lane_f32(_sum1, _val, vget_low_f32(_k), 1);\n                        _sum2 = vmlaq_lane_f32(_sum2, _val, vget_high_f32(_k), 0);\n                        _sum3 = vmlaq_lane_f32(_sum3, _val, vget_high_f32(_k), 1);\n#endif\n\n                        m += 4;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps(_sum3, activation_type, activation_params);\n\n                    vst1_u16(outptr, float2bfloat(_sum0));\n                    vst1_u16(outptr + 4, float2bfloat(_sum1));\n                    vst1_u16(outptr + 8, float2bfloat(_sum2));\n                    vst1_u16(outptr + 12, float2bfloat(_sum3));\n                    outptr += 16;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 4)\n            {\n                unsigned short* outptr = top_blob.row<unsigned short>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + num_input * p * 4;\n                    const unsigned short* m = bottom_blob.row<const unsigned short>(j);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vld1q_f32((const float*)bias_data + p * 4);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float32x4_t _val = vdupq_n_f32(bfloat16_to_float32(m[0]));\n                        float32x4_t _k = bfloat2float(vld1_u16(kptr));\n                        _sum = vmlaq_f32(_sum, _val, _k);\n\n                        m += 1;\n                        kptr += 4;\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    vst1_u16(outptr, float2bfloat(_sum));\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 1)\n            {\n                unsigned short* outptr = top_blob.row<unsigned short>(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + num_input * p;\n                    const unsigned short* m = bottom_blob.row<const unsigned short>(j);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vdupq_n_f32(bias_data[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float32x4_t _val = bfloat2float(vld1_u16(m));\n                        float32x4_t _k = vdupq_n_f32(bfloat16_to_float32(kptr[0]));\n                        _sum = vmlaq_f32(_sum, _val, _k);\n\n                        m += 4;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    vst1_u16(outptr, float2bfloat(_sum));\n                    outptr += 4;\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1 && num_output_elempack == 1)\n            {\n                unsigned short* outptr = top_blob.row<unsigned short>(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const unsigned short* kptr = (const unsigned short*)weight_data_tm + num_input * p;\n                    const unsigned short* m = bottom_blob.row<const unsigned short>(j);\n\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        sum += bfloat16_to_float32(*m) * bfloat16_to_float32(*kptr);\n\n                        m += 1;\n                        kptr += 1;\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[0] = float32_to_bfloat16(sum);\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n        if (bottom_blob_flattened.empty())\n            return -100;\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __ARM_NEON\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (out_elempack == 4)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            if (bias_term)\n            {\n                _sum0 = vld1q_f32(((const float*)bias_data) + p * 4);\n            }\n\n            const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n\n            const unsigned short* sptr = bottom_blob_flattened;\n\n            int i = 0;\n            for (; i + 3 < num_input; i += 4)\n            {\n                float32x4_t _val = bfloat2float(vld1_u16(sptr));\n\n                float32x4_t _w0 = bfloat2float(vld1_u16(kptr));\n                float32x4_t _w1 = bfloat2float(vld1_u16(kptr + 4));\n                float32x4_t _w2 = bfloat2float(vld1_u16(kptr + 8));\n                float32x4_t _w3 = bfloat2float(vld1_u16(kptr + 12));\n\n#if __aarch64__\n                _sum0 = vmlaq_laneq_f32(_sum0, _w0, _val, 0);\n                _sum1 = vmlaq_laneq_f32(_sum1, _w1, _val, 1);\n                _sum2 = vmlaq_laneq_f32(_sum2, _w2, _val, 2);\n                _sum3 = vmlaq_laneq_f32(_sum3, _w3, _val, 3);\n#else\n                _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_val), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_val), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_val), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_val), 1);\n#endif\n\n                sptr += 4;\n                kptr += 16;\n            }\n            for (; i < num_input; i++)\n            {\n                float32x4_t _val = vdupq_n_f32(bfloat16_to_float32(sptr[0]));\n\n                float32x4_t _w = bfloat2float(vld1_u16(kptr));\n\n                _sum0 = vmlaq_f32(_sum0, _val, _w);\n\n                sptr += 1;\n                kptr += 4;\n            }\n\n            _sum0 = vaddq_f32(_sum0, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _sum0 = vaddq_f32(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            unsigned short* outptr = (unsigned short*)top_blob;\n            vst1_u16(outptr + p * 4, float2bfloat(_sum0));\n        }\n    }\n#endif // __ARM_NEON\n\n    if (out_elempack == 1)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output; p++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const unsigned short* kptr = weight_data_tm.row<unsigned short>(p);\n\n            const unsigned short* sptr = bottom_blob_flattened;\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            for (; i + 3 < num_input; i += 4)\n            {\n                float32x4_t _m = bfloat2float(vld1_u16(sptr));\n                float32x4_t _w = bfloat2float(vld1_u16(kptr));\n\n                _sum = vmlaq_f32(_sum, _m, _w);\n\n                sptr += 4;\n                kptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < num_input; i++)\n            {\n                float v = bfloat16_to_float32(*sptr);\n                float k = bfloat16_to_float32(*kptr);\n\n                sum += v * k;\n\n                sptr++;\n                kptr++;\n            }\n\n#if __ARM_NEON\n#if __aarch64__\n            sum += vaddvq_f32(_sum);\n#else\n            float32x2_t _sumss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n            _sumss = vpadd_f32(_sumss, _sumss);\n            sum += vget_lane_f32(_sumss, 0);\n#endif // __aarch64__\n#endif // __ARM_NEON\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            unsigned short* outptr = (unsigned short*)top_blob;\n            outptr[p] = float32_to_bfloat16(sum);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint InnerProduct_arm::create_pipeline_int8_arm(const Option& opt)\n{\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 8 == 0 ? 8 : 1;\n    }\n#endif\n\n    // src = inch-outch\n    // dst = pb-inch-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / out_elempack, (size_t)out_elempack, out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            signed char* g0 = weight_data_tm.row<signed char>(q / out_elempack);\n\n            for (int p = 0; p < num_input; p++)\n            {\n                for (int j = 0; j < out_elempack; j++)\n                {\n                    *g0++ = weight_data_r2.row<signed char>(q + j)[p];\n                }\n            }\n        }\n    }\n\n    scale_in_data.create(num_output);\n    for (int p = 0; p < num_output; p++)\n    {\n        // dequantize\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        scale_in_data[p] = scale_in;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_arm::forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    int elembits = bottom_blob.elembits();\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_q);\n        if (bottom_blob_int8.empty())\n            return -100;\n    }\n\n    if (bottom_blob_int8.dims == 2 && bottom_blob_int8.w == num_input)\n    {\n        // gemm\n        Mat bottom_blob_int8_unpacked;\n        Option opt_unpack = opt;\n        opt_unpack.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_int8, bottom_blob_int8_unpacked, 1, opt_unpack);\n        if (bottom_blob_int8_unpacked.empty())\n            return -100;\n\n        int h = bottom_blob_int8_unpacked.h;\n\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        int outh = h / out_elempack;\n\n        top_blob.create(num_output, outh, (size_t)(4u * out_elempack), out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 8 == 0 ? 8 : 1;\n        }\n#endif\n\n#if __ARM_NEON\n        if (num_output_elempack == 8 && out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m0 = bottom_blob_int8_unpacked.row<const signed char>(j * 4);\n                    const signed char* m1 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 1);\n                    const signed char* m2 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 2);\n                    const signed char* m3 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 3);\n\n                    int32x4_t _sum00 = vdupq_n_s32(0);\n                    int32x4_t _sum01 = vdupq_n_s32(0);\n                    int32x4_t _sum10 = vdupq_n_s32(0);\n                    int32x4_t _sum11 = vdupq_n_s32(0);\n                    int32x4_t _sum20 = vdupq_n_s32(0);\n                    int32x4_t _sum21 = vdupq_n_s32(0);\n                    int32x4_t _sum30 = vdupq_n_s32(0);\n                    int32x4_t _sum31 = vdupq_n_s32(0);\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        int8x8_t _val0 = vld1_dup_s8(m0);\n                        int8x8_t _val1 = vld1_dup_s8(m1);\n                        int8x8_t _val2 = vld1_dup_s8(m2);\n                        int8x8_t _val3 = vld1_dup_s8(m3);\n\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_val0, _w);\n                        int16x8_t _s1 = vmull_s8(_val1, _w);\n                        int16x8_t _s2 = vmull_s8(_val2, _w);\n                        int16x8_t _s3 = vmull_s8(_val3, _w);\n                        _sum00 = vaddw_s16(_sum00, vget_low_s16(_s0));\n                        _sum01 = vaddw_s16(_sum01, vget_high_s16(_s0));\n                        _sum10 = vaddw_s16(_sum10, vget_low_s16(_s1));\n                        _sum11 = vaddw_s16(_sum11, vget_high_s16(_s1));\n                        _sum20 = vaddw_s16(_sum20, vget_low_s16(_s2));\n                        _sum21 = vaddw_s16(_sum21, vget_high_s16(_s2));\n                        _sum30 = vaddw_s16(_sum30, vget_low_s16(_s3));\n                        _sum31 = vaddw_s16(_sum31, vget_high_s16(_s3));\n\n                        m0++;\n                        m1++;\n                        m2++;\n                        m3++;\n                        kptr += 8;\n                    }\n\n                    // dequantize and relu\n                    float32x4_t _scale_in0 = vld1q_f32((const float*)scale_in_data + p * 8);\n                    float32x4_t _scale_in1 = vld1q_f32((const float*)scale_in_data + p * 8 + 4);\n\n                    float32x4_t _sumfp32_00 = vcvtq_f32_s32(_sum00);\n                    float32x4_t _sumfp32_01 = vcvtq_f32_s32(_sum01);\n                    float32x4_t _sumfp32_10 = vcvtq_f32_s32(_sum10);\n                    float32x4_t _sumfp32_11 = vcvtq_f32_s32(_sum11);\n                    float32x4_t _sumfp32_20 = vcvtq_f32_s32(_sum20);\n                    float32x4_t _sumfp32_21 = vcvtq_f32_s32(_sum21);\n                    float32x4_t _sumfp32_30 = vcvtq_f32_s32(_sum30);\n                    float32x4_t _sumfp32_31 = vcvtq_f32_s32(_sum31);\n                    if (bias_term)\n                    {\n                        float32x4_t _bias0 = vld1q_f32((const float*)bias_data + p * 8);\n                        float32x4_t _bias1 = vld1q_f32((const float*)bias_data + p * 8 + 4);\n                        _sumfp32_00 = vmlaq_f32(_bias0, _sumfp32_00, _scale_in0);\n                        _sumfp32_01 = vmlaq_f32(_bias1, _sumfp32_01, _scale_in1);\n                        _sumfp32_10 = vmlaq_f32(_bias0, _sumfp32_10, _scale_in0);\n                        _sumfp32_11 = vmlaq_f32(_bias1, _sumfp32_11, _scale_in1);\n                        _sumfp32_20 = vmlaq_f32(_bias0, _sumfp32_20, _scale_in0);\n                        _sumfp32_21 = vmlaq_f32(_bias1, _sumfp32_21, _scale_in1);\n                        _sumfp32_30 = vmlaq_f32(_bias0, _sumfp32_30, _scale_in0);\n                        _sumfp32_31 = vmlaq_f32(_bias1, _sumfp32_31, _scale_in1);\n                    }\n                    else\n                    {\n                        _sumfp32_00 = vmulq_f32(_sumfp32_00, _scale_in0);\n                        _sumfp32_01 = vmulq_f32(_sumfp32_01, _scale_in1);\n                        _sumfp32_10 = vmulq_f32(_sumfp32_10, _scale_in0);\n                        _sumfp32_11 = vmulq_f32(_sumfp32_11, _scale_in1);\n                        _sumfp32_20 = vmulq_f32(_sumfp32_20, _scale_in0);\n                        _sumfp32_21 = vmulq_f32(_sumfp32_21, _scale_in1);\n                        _sumfp32_30 = vmulq_f32(_sumfp32_30, _scale_in0);\n                        _sumfp32_31 = vmulq_f32(_sumfp32_31, _scale_in1);\n                    }\n\n                    _sumfp32_00 = activation_ps(_sumfp32_00, activation_type, activation_params);\n                    _sumfp32_01 = activation_ps(_sumfp32_01, activation_type, activation_params);\n                    _sumfp32_10 = activation_ps(_sumfp32_10, activation_type, activation_params);\n                    _sumfp32_11 = activation_ps(_sumfp32_11, activation_type, activation_params);\n                    _sumfp32_20 = activation_ps(_sumfp32_20, activation_type, activation_params);\n                    _sumfp32_21 = activation_ps(_sumfp32_21, activation_type, activation_params);\n                    _sumfp32_30 = activation_ps(_sumfp32_30, activation_type, activation_params);\n                    _sumfp32_31 = activation_ps(_sumfp32_31, activation_type, activation_params);\n\n                    // transpose 4x8\n                    float32x4x4_t _sumfp32_0;\n                    _sumfp32_0.val[0] = _sumfp32_00;\n                    _sumfp32_0.val[1] = _sumfp32_10;\n                    _sumfp32_0.val[2] = _sumfp32_20;\n                    _sumfp32_0.val[3] = _sumfp32_30;\n                    float32x4x4_t _sumfp32_1;\n                    _sumfp32_1.val[0] = _sumfp32_01;\n                    _sumfp32_1.val[1] = _sumfp32_11;\n                    _sumfp32_1.val[2] = _sumfp32_21;\n                    _sumfp32_1.val[3] = _sumfp32_31;\n\n                    vst4q_f32(outptr, _sumfp32_0);\n                    vst4q_f32(outptr + 16, _sumfp32_1);\n\n                    outptr += 32;\n                }\n            }\n        }\n\n        if (num_output_elempack == 1 && out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m0 = bottom_blob_int8_unpacked.row<const signed char>(j * 4);\n                    const signed char* m1 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 1);\n                    const signed char* m2 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 2);\n                    const signed char* m3 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 3);\n\n                    int sum0 = 0;\n                    int sum1 = 0;\n                    int sum2 = 0;\n                    int sum3 = 0;\n\n                    int i = 0;\n\n                    int32x4_t _sum0 = vdupq_n_s32(0);\n                    int32x4_t _sum1 = vdupq_n_s32(0);\n                    int32x4_t _sum2 = vdupq_n_s32(0);\n                    int32x4_t _sum3 = vdupq_n_s32(0);\n                    for (; i + 7 < num_input; i += 8)\n                    {\n                        int8x8_t _val0 = vld1_s8(m0);\n                        int8x8_t _val1 = vld1_s8(m1);\n                        int8x8_t _val2 = vld1_s8(m2);\n                        int8x8_t _val3 = vld1_s8(m3);\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_val0, _w);\n                        int16x8_t _s1 = vmull_s8(_val1, _w);\n                        int16x8_t _s2 = vmull_s8(_val2, _w);\n                        int16x8_t _s3 = vmull_s8(_val3, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_low_s16(_s1));\n                        _sum2 = vaddw_s16(_sum2, vget_low_s16(_s2));\n                        _sum3 = vaddw_s16(_sum3, vget_low_s16(_s3));\n                        _sum0 = vaddw_s16(_sum0, vget_high_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s1));\n                        _sum2 = vaddw_s16(_sum2, vget_high_s16(_s2));\n                        _sum3 = vaddw_s16(_sum3, vget_high_s16(_s3));\n\n                        m0 += 8;\n                        m1 += 8;\n                        m2 += 8;\n                        m3 += 8;\n                        kptr += 8;\n                    }\n#if __aarch64__\n                    sum0 = vaddvq_s32(_sum0);\n                    sum1 = vaddvq_s32(_sum1);\n                    sum2 = vaddvq_s32(_sum2);\n                    sum3 = vaddvq_s32(_sum3);\n#else\n                    int32x2_t _s20 = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                    int32x2_t _s21 = vadd_s32(vget_low_s32(_sum1), vget_high_s32(_sum1));\n                    int32x2_t _s22 = vadd_s32(vget_low_s32(_sum2), vget_high_s32(_sum2));\n                    int32x2_t _s23 = vadd_s32(vget_low_s32(_sum3), vget_high_s32(_sum3));\n                    int32x2_t _s201 = vpadd_s32(_s20, _s21);\n                    int32x2_t _s223 = vpadd_s32(_s22, _s23);\n                    sum0 = vget_lane_s32(_s201, 0);\n                    sum1 = vget_lane_s32(_s201, 1);\n                    sum2 = vget_lane_s32(_s223, 0);\n                    sum3 = vget_lane_s32(_s223, 1);\n#endif\n                    for (; i < num_input; i++)\n                    {\n                        sum0 += *m0++ * kptr[0];\n                        sum1 += *m1++ * kptr[0];\n                        sum2 += *m2++ * kptr[0];\n                        sum3 += *m3++ * kptr[0];\n                        kptr += 1;\n                    }\n\n                    // dequantize and relu\n                    float sumfp32_0 = sum0 * scale_in_data[p];\n                    float sumfp32_1 = sum1 * scale_in_data[p];\n                    float sumfp32_2 = sum2 * scale_in_data[p];\n                    float sumfp32_3 = sum3 * scale_in_data[p];\n\n                    if (bias_term)\n                    {\n                        sumfp32_0 += bias_data[p];\n                        sumfp32_1 += bias_data[p];\n                        sumfp32_2 += bias_data[p];\n                        sumfp32_3 += bias_data[p];\n                    }\n\n                    outptr[0] = activation_ss(sumfp32_0, activation_type, activation_params);\n                    outptr[1] = activation_ss(sumfp32_1, activation_type, activation_params);\n                    outptr[2] = activation_ss(sumfp32_2, activation_type, activation_params);\n                    outptr[3] = activation_ss(sumfp32_3, activation_type, activation_params);\n                    outptr += 4;\n                }\n            }\n        }\n\n        if (num_output_elempack == 8 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m = bottom_blob_int8_unpacked.row<const signed char>(j);\n\n                    int32x4_t _sum0 = vdupq_n_s32(0);\n                    int32x4_t _sum1 = vdupq_n_s32(0);\n\n                    int i = 0;\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        int8x8_t _val0 = vdup_n_s8(m[0]);\n                        int8x8_t _val1 = vdup_n_s8(m[1]);\n                        int8x8_t _val2 = vdup_n_s8(m[2]);\n                        int8x8_t _val3 = vdup_n_s8(m[3]);\n\n                        int8x16_t _w0 = vld1q_s8(kptr);\n                        int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                        int16x8_t _s0 = vmull_s8(_val0, vget_low_s8(_w0));\n                        int16x8_t _s1 = vmull_s8(_val2, vget_low_s8(_w1));\n                        _s0 = vmlal_s8(_s0, _val1, vget_high_s8(_w0));\n                        _s1 = vmlal_s8(_s1, _val3, vget_high_s8(_w1));\n\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s1));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s1));\n\n                        m += 4;\n                        kptr += 32;\n                    }\n                    for (; i < num_input; i++)\n                    {\n                        int8x8_t _val = vld1_dup_s8(m);\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_val, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                        m++;\n                        kptr += 8;\n                    }\n\n                    // dequantize and relu\n                    float32x4_t _scale_in0 = vld1q_f32((const float*)scale_in_data + p * 8);\n                    float32x4_t _scale_in1 = vld1q_f32((const float*)scale_in_data + p * 8 + 4);\n\n                    float32x4_t _sumfp32_0 = vcvtq_f32_s32(_sum0);\n                    float32x4_t _sumfp32_1 = vcvtq_f32_s32(_sum1);\n\n                    if (bias_term)\n                    {\n                        float32x4_t _bias0 = vld1q_f32((const float*)bias_data + p * 8);\n                        float32x4_t _bias1 = vld1q_f32((const float*)bias_data + p * 8 + 4);\n                        _sumfp32_0 = vmlaq_f32(_bias0, _sumfp32_0, _scale_in0);\n                        _sumfp32_1 = vmlaq_f32(_bias1, _sumfp32_1, _scale_in1);\n                    }\n                    else\n                    {\n                        _sumfp32_0 = vmulq_f32(_sumfp32_0, _scale_in0);\n                        _sumfp32_1 = vmulq_f32(_sumfp32_1, _scale_in1);\n                    }\n\n                    _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n                    _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n                    vst1q_f32(outptr, _sumfp32_0);\n                    vst1q_f32(outptr + 4, _sumfp32_1);\n                    outptr += 8;\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (num_output_elempack == 1 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m = bottom_blob_int8_unpacked.row<const signed char>(j);\n\n                    int sum = 0;\n\n                    int i = 0;\n#if __ARM_NEON\n                    int32x4_t _sum0 = vdupq_n_s32(0);\n                    int32x4_t _sum1 = vdupq_n_s32(0);\n                    for (; i + 7 < num_input; i += 8)\n                    {\n                        int8x8_t _val = vld1_s8(m);\n                        int8x8_t _w = vld1_s8(kptr);\n\n                        int16x8_t _s0 = vmull_s8(_val, _w);\n                        _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                        _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                        m += 8;\n                        kptr += 8;\n                    }\n\n                    _sum0 = vaddq_s32(_sum0, _sum1);\n#if __aarch64__\n                    sum = vaddvq_s32(_sum0);\n#else\n                    int32x2_t _s2 = vadd_s32(vget_low_s32(_sum0), vget_high_s32(_sum0));\n                    _s2 = vpadd_s32(_s2, _s2);\n                    sum = vget_lane_s32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n                    for (; i < num_input; i++)\n                    {\n                        sum += *m++ * *kptr++;\n                    }\n\n                    // dequantize and relu\n                    float sumfp32 = sum * scale_in_data[p];\n\n                    if (bias_term)\n                        sumfp32 += bias_data[p];\n\n                    outptr[0] = activation_ss(sumfp32, activation_type, activation_params);\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat bottom_blob_int8_flattened = bottom_blob_int8;\n    if (bottom_blob_int8.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n        flatten->forward(bottom_blob_int8, bottom_blob_int8_flattened, opt_flatten);\n        if (bottom_blob_int8_flattened.empty())\n            return -100;\n    }\n\n    //     int elempack = bottom_blob_int8_flattened.elempack;\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 8 == 0 ? 8 : 1;\n    }\n#endif\n\n    top_blob.create(num_output / out_elempack, (size_t)(4u * out_elempack), out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (out_elempack == 8)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            const signed char* kptr = weight_data_tm.row<const signed char>(p);\n            const signed char* sptr = bottom_blob_int8_flattened;\n\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n\n            int i = 0;\n            for (; i + 1 < num_input; i += 2)\n            {\n                int8x8_t _val0 = vdup_n_s8(sptr[0]);\n                int8x8_t _val1 = vdup_n_s8(sptr[1]);\n\n                int8x8_t _w0 = vld1_s8(kptr);\n                int8x8_t _w1 = vld1_s8(kptr + 8);\n\n                int16x8_t _s0 = vmull_s8(_val0, _w0);\n                _s0 = vmlal_s8(_s0, _val1, _w1);\n\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                sptr += 2;\n                kptr += 16;\n            }\n            for (; i < num_input; i++)\n            {\n                int8x8_t _val = vdup_n_s8(sptr[0]);\n\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _s0 = vmull_s8(_val, _w);\n                _sum0 = vaddw_s16(_sum0, vget_low_s16(_s0));\n                _sum1 = vaddw_s16(_sum1, vget_high_s16(_s0));\n\n                sptr += 1;\n                kptr += 8;\n            }\n\n            // dequantize and relu\n            float32x4_t _scale_in0 = vld1q_f32((const float*)scale_in_data + p * 8);\n            float32x4_t _scale_in1 = vld1q_f32((const float*)scale_in_data + p * 8 + 4);\n\n            float32x4_t _sumfp32_0 = vcvtq_f32_s32(_sum0);\n            float32x4_t _sumfp32_1 = vcvtq_f32_s32(_sum1);\n\n            if (bias_term)\n            {\n                float32x4_t _bias0 = vld1q_f32((const float*)bias_data + p * 8);\n                float32x4_t _bias1 = vld1q_f32((const float*)bias_data + p * 8 + 4);\n                _sumfp32_0 = vmlaq_f32(_bias0, _sumfp32_0, _scale_in0);\n                _sumfp32_1 = vmlaq_f32(_bias1, _sumfp32_1, _scale_in1);\n            }\n            else\n            {\n                _sumfp32_0 = vmulq_f32(_sumfp32_0, _scale_in0);\n                _sumfp32_1 = vmulq_f32(_sumfp32_1, _scale_in1);\n            }\n\n            _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n            _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n            float* outptr = (float*)top_blob + p * 8;\n            vst1q_f32(outptr, _sumfp32_0);\n            vst1q_f32(outptr + 4, _sumfp32_1);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (out_elempack == 1)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            const signed char* kptr = weight_data_tm.row<const signed char>(p);\n            const signed char* sptr = bottom_blob_int8_flattened;\n\n            int sum = 0;\n\n            int i = 0;\n            for (; i < num_input; i++)\n            {\n                signed char val = sptr[0];\n\n                signed char w = kptr[0];\n\n                sum += val * w;\n\n                sptr += 1;\n                kptr += 1;\n            }\n\n            // dequantize and relu\n            float sumfp32 = sum * scale_in_data[p];\n\n            if (bias_term)\n                sumfp32 += bias_data[p];\n\n            sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n            top_blob[p] = sumfp32;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/innerproduct_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INNERPRODUCT_ARM_H\n#define LAYER_INNERPRODUCT_ARM_H\n\n#include \"innerproduct.h\"\n\nnamespace ncnn {\n\nclass InnerProduct_arm : public InnerProduct\n{\npublic:\n    InnerProduct_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_VFPV4\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_ARM82\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8_arm(const Option& opt);\n    int forward_int8_arm(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* flatten;\n\n    Mat weight_data_tm;\n\n    // fp16\n    Mat bias_data_fp16;\n\n#if NCNN_INT8\n    Mat scale_in_data;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INNERPRODUCT_ARM_H\n"
  },
  {
    "path": "src/layer/arm/innerproduct_arm_asimdfhm.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct_arm.h\"\n\n#include \"cpu.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"innerproduct_fp16s.h\"\n#include \"innerproduct_gemm_fp16s.h\"\n\nvoid innerproduct_pack4_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_pack4_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_gemm_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_gemm_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_transform_kernel_fp16s_neon_asimdfhm(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, const Option& opt)\n{\n    innerproduct_transform_kernel_fp16s_neon(weight_data, weight_data_tm, num_input, num_output, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/innerproduct_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct_arm.h\"\n\n#include \"cpu.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"innerproduct_fp16s.h\"\n#include \"innerproduct_gemm_fp16s.h\"\n\nvoid innerproduct_pack4_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_pack4_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_gemm_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    innerproduct_gemm_fp16s_neon(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n}\n\nvoid innerproduct_transform_kernel_fp16s_neon_asimdhp(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, const Option& opt)\n{\n    innerproduct_transform_kernel_fp16s_neon(weight_data, weight_data_tm, num_input, num_output, opt);\n}\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint InnerProduct_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            if (elempack == 8 && num_output_elempack == 8)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 8;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x8_t _sum0 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum1 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum2 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum3 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum4 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum5 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum6 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum7 = vdupq_n_f16((__fp16)0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 0]);\n                        _sum1 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 1]);\n                        _sum2 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 2]);\n                        _sum3 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 3]);\n                        _sum4 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 4]);\n                        _sum5 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 5]);\n                        _sum6 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 6]);\n                        _sum7 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 7]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x8_t _val = vld1q_f16(m);\n                        float16x8_t _k = vld1q_f16(kptr);\n                        _sum0 = vfmaq_laneq_f16(_sum0, _val, _k, 0);\n                        _sum1 = vfmaq_laneq_f16(_sum1, _val, _k, 1);\n                        _sum2 = vfmaq_laneq_f16(_sum2, _val, _k, 2);\n                        _sum3 = vfmaq_laneq_f16(_sum3, _val, _k, 3);\n                        _sum4 = vfmaq_laneq_f16(_sum4, _val, _k, 4);\n                        _sum5 = vfmaq_laneq_f16(_sum5, _val, _k, 5);\n                        _sum6 = vfmaq_laneq_f16(_sum6, _val, _k, 6);\n                        _sum7 = vfmaq_laneq_f16(_sum7, _val, _k, 7);\n\n                        m += 8;\n                        kptr += 8;\n                    }\n\n                    _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps_f16(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps_f16(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps_f16(_sum3, activation_type, activation_params);\n                    _sum4 = activation_ps_f16(_sum4, activation_type, activation_params);\n                    _sum5 = activation_ps_f16(_sum5, activation_type, activation_params);\n                    _sum6 = activation_ps_f16(_sum6, activation_type, activation_params);\n                    _sum7 = activation_ps_f16(_sum7, activation_type, activation_params);\n\n                    vst1q_f16(outptr, _sum0);\n                    vst1q_f16(outptr + 8, _sum1);\n                    vst1q_f16(outptr + 16, _sum2);\n                    vst1q_f16(outptr + 24, _sum3);\n                    vst1q_f16(outptr + 32, _sum4);\n                    vst1q_f16(outptr + 40, _sum5);\n                    vst1q_f16(outptr + 48, _sum6);\n                    vst1q_f16(outptr + 56, _sum7);\n                    outptr += 64;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 8)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 8;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x8_t _sum = vdupq_n_f16(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vld1q_f16((const __fp16*)bias_data_fp16 + p * 8);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x8_t _val = vdupq_n_f16(m[0]);\n                        float16x8_t _k = vld1q_f16(kptr);\n                        _sum = vfmaq_f16(_sum, _val, _k);\n\n                        m += 1;\n                        kptr += 8;\n                    }\n\n                    _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                    vst1q_f16(outptr, _sum);\n                    outptr += 8;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 8)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 8;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x4_t _sum0 = vdup_n_f16(0.f);\n                    float16x4_t _sum1 = vdup_n_f16(0.f);\n                    float16x4_t _sum2 = vdup_n_f16(0.f);\n                    float16x4_t _sum3 = vdup_n_f16(0.f);\n                    float16x4_t _sum4 = vdup_n_f16(0.f);\n                    float16x4_t _sum5 = vdup_n_f16(0.f);\n                    float16x4_t _sum6 = vdup_n_f16(0.f);\n                    float16x4_t _sum7 = vdup_n_f16(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 0]);\n                        _sum1 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 1]);\n                        _sum2 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 2]);\n                        _sum3 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 3]);\n                        _sum4 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 4]);\n                        _sum5 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 5]);\n                        _sum6 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 6]);\n                        _sum7 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 8 + 7]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x4_t _val = vld1_f16(m);\n                        float16x8_t _k = vld1q_f16(kptr);\n                        _sum0 = vfma_laneq_f16(_sum0, _val, _k, 0);\n                        _sum1 = vfma_laneq_f16(_sum1, _val, _k, 1);\n                        _sum2 = vfma_laneq_f16(_sum2, _val, _k, 2);\n                        _sum3 = vfma_laneq_f16(_sum3, _val, _k, 3);\n                        _sum4 = vfma_laneq_f16(_sum4, _val, _k, 4);\n                        _sum5 = vfma_laneq_f16(_sum5, _val, _k, 5);\n                        _sum6 = vfma_laneq_f16(_sum6, _val, _k, 6);\n                        _sum7 = vfma_laneq_f16(_sum7, _val, _k, 7);\n\n                        m += 4;\n                        kptr += 8;\n                    }\n\n                    _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps_f16(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps_f16(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps_f16(_sum3, activation_type, activation_params);\n                    _sum4 = activation_ps_f16(_sum4, activation_type, activation_params);\n                    _sum5 = activation_ps_f16(_sum5, activation_type, activation_params);\n                    _sum6 = activation_ps_f16(_sum6, activation_type, activation_params);\n                    _sum7 = activation_ps_f16(_sum7, activation_type, activation_params);\n\n                    vst1_f16(outptr, _sum0);\n                    vst1_f16(outptr + 4, _sum1);\n                    vst1_f16(outptr + 8, _sum2);\n                    vst1_f16(outptr + 12, _sum3);\n                    vst1_f16(outptr + 16, _sum4);\n                    vst1_f16(outptr + 20, _sum5);\n                    vst1_f16(outptr + 24, _sum6);\n                    vst1_f16(outptr + 28, _sum7);\n                    outptr += 32;\n                }\n            }\n\n            if (elempack == 8 && num_output_elempack == 1)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x8_t _val = vld1q_f16(m);\n                        float16x8_t _k = vdupq_n_f16(kptr[0]);\n                        _sum = vfmaq_f16(_sum, _val, _k);\n\n                        m += 8;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                    vst1q_f16(outptr, _sum);\n                    outptr += 8;\n                }\n            }\n\n            if (elempack == 8 && num_output_elempack == 4)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 4;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x8_t _sum0 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum1 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum2 = vdupq_n_f16((__fp16)0.f);\n                    float16x8_t _sum3 = vdupq_n_f16((__fp16)0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 0]);\n                        _sum1 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 1]);\n                        _sum2 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 2]);\n                        _sum3 = vdupq_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 3]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x8_t _val = vld1q_f16(m);\n                        float16x4_t _k = vld1_f16(kptr);\n                        _sum0 = vfmaq_lane_f16(_sum0, _val, _k, 0);\n                        _sum1 = vfmaq_lane_f16(_sum1, _val, _k, 1);\n                        _sum2 = vfmaq_lane_f16(_sum2, _val, _k, 2);\n                        _sum3 = vfmaq_lane_f16(_sum3, _val, _k, 3);\n\n                        m += 8;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps_f16(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps_f16(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps_f16(_sum3, activation_type, activation_params);\n\n                    vst1q_f16(outptr, _sum0);\n                    vst1q_f16(outptr + 8, _sum1);\n                    vst1q_f16(outptr + 16, _sum2);\n                    vst1q_f16(outptr + 24, _sum3);\n                    outptr += 32;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 4)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 4;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x4_t _sum0 = vdup_n_f16(0.f);\n                    float16x4_t _sum1 = vdup_n_f16(0.f);\n                    float16x4_t _sum2 = vdup_n_f16(0.f);\n                    float16x4_t _sum3 = vdup_n_f16(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum0 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 0]);\n                        _sum1 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 1]);\n                        _sum2 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 2]);\n                        _sum3 = vdup_n_f16(((const __fp16*)bias_data_fp16)[p * 4 + 3]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x4_t _val = vld1_f16(m);\n                        float16x4_t _k = vld1_f16(kptr);\n                        _sum0 = vfma_lane_f16(_sum0, _val, _k, 0);\n                        _sum1 = vfma_lane_f16(_sum1, _val, _k, 1);\n                        _sum2 = vfma_lane_f16(_sum2, _val, _k, 2);\n                        _sum3 = vfma_lane_f16(_sum3, _val, _k, 3);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps_f16(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps_f16(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps_f16(_sum3, activation_type, activation_params);\n\n                    vst1_f16(outptr, _sum0);\n                    vst1_f16(outptr + 4, _sum1);\n                    vst1_f16(outptr + 8, _sum2);\n                    vst1_f16(outptr + 12, _sum3);\n                    outptr += 16;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 4)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p * 4;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x4_t _sum = vdup_n_f16(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vld1_f16((const __fp16*)bias_data_fp16 + p * 4);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x4_t _val = vdup_n_f16(m[0]);\n                        float16x4_t _k = vld1_f16(kptr);\n                        _sum = vfma_f16(_sum, _val, _k);\n\n                        m += 1;\n                        kptr += 4;\n                    }\n\n                    _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                    vst1_f16(outptr, _sum);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 1)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float16x4_t _sum = vdup_n_f16(0.f);\n\n                    if (bias_term)\n                    {\n                        _sum = vdup_n_f16(((const __fp16*)bias_data_fp16)[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        float16x4_t _val = vld1_f16(m);\n                        float16x4_t _k = vdup_n_f16(kptr[0]);\n                        _sum = vfma_f16(_sum, _val, _k);\n\n                        m += 4;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps_f16(_sum, activation_type, activation_params);\n\n                    vst1_f16(outptr, _sum);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 1)\n            {\n                __fp16* outptr = top_blob.row<__fp16>(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const __fp16* kptr = (const __fp16*)weight_data_tm + num_input * p;\n                    const __fp16* m = bottom_blob.row<const __fp16>(j);\n\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        sum += (float)(*m * *kptr);\n\n                        m += 1;\n                        kptr += 1;\n                    }\n\n                    sum = activation_ss_f16(sum, activation_type, activation_params);\n\n                    outptr[0] = (__fp16)sum;\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n        if (bottom_blob_flattened.empty())\n            return -100;\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (out_elempack == 8)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            float16x8_t _sum0 = vdupq_n_f16(0.f);\n            float16x8_t _sum1 = vdupq_n_f16(0.f);\n            float16x8_t _sum2 = vdupq_n_f16(0.f);\n            float16x8_t _sum3 = vdupq_n_f16(0.f);\n            float16x8_t _sum4 = vdupq_n_f16(0.f);\n            float16x8_t _sum5 = vdupq_n_f16(0.f);\n            float16x8_t _sum6 = vdupq_n_f16(0.f);\n            float16x8_t _sum7 = vdupq_n_f16(0.f);\n\n            if (bias_term)\n            {\n                _sum0 = vld1q_f16((const __fp16*)bias_data_fp16 + p * 8);\n            }\n\n            const __fp16* kptr = weight_data_tm.row<const __fp16>(p);\n\n            const __fp16* sptr = bottom_blob_flattened;\n\n            int i = 0;\n#if NCNN_GNU_INLINE_ASM\n            for (; i + 7 < num_input; i += 8)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%8], #16          \\n\" // _val\n\n                    \"prfm   pldl1keep, [%9, #512]       \\n\"\n                    \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%9], #64 \\n\" // w0123\n\n                    \"prfm   pldl1keep, [%9, #512]       \\n\"\n                    \"ld1    {v12.8h, v13.8h, v14.8h, v15.8h}, [%9], #64 \\n\" // w4567\n\n                    \"fmla   %0.8h, v8.8h, v0.h[0]       \\n\"\n                    \"fmla   %1.8h, v9.8h, v0.h[1]       \\n\"\n                    \"fmla   %2.8h, v10.8h, v0.h[2]      \\n\"\n                    \"fmla   %3.8h, v11.8h, v0.h[3]      \\n\"\n                    \"fmla   %4.8h, v12.8h, v0.h[4]      \\n\"\n                    \"fmla   %5.8h, v13.8h, v0.h[5]      \\n\"\n                    \"fmla   %6.8h, v14.8h, v0.h[6]      \\n\"\n                    \"fmla   %7.8h, v15.8h, v0.h[7]      \\n\"\n\n                    : \"=w\"(_sum0), // %0\n                    \"=w\"(_sum1), // %1\n                    \"=w\"(_sum2), // %2\n                    \"=w\"(_sum3), // %3\n                    \"=w\"(_sum4), // %4\n                    \"=w\"(_sum5), // %5\n                    \"=w\"(_sum6), // %6\n                    \"=w\"(_sum7), // %7\n                    \"=r\"(sptr),  // %8\n                    \"=r\"(kptr)   // %9\n                    : \"0\"(_sum0),\n                    \"1\"(_sum1),\n                    \"2\"(_sum2),\n                    \"3\"(_sum3),\n                    \"4\"(_sum4),\n                    \"5\"(_sum5),\n                    \"6\"(_sum6),\n                    \"7\"(_sum7),\n                    \"8\"(sptr),\n                    \"9\"(kptr)\n                    : \"cc\", \"memory\", \"v0\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n            }\n            for (; i + 3 < num_input; i += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v0.4h}, [%4], #8           \\n\" // _val\n\n                    \"prfm   pldl1keep, [%5, #512]       \\n\"\n                    \"ld1    {v8.8h, v9.8h, v10.8h, v11.8h}, [%5], #64 \\n\" // w0123\n\n                    \"fmla   %0.8h, v8.8h, v0.h[0]       \\n\"\n                    \"fmla   %1.8h, v9.8h, v0.h[1]       \\n\"\n                    \"fmla   %2.8h, v10.8h, v0.h[2]      \\n\"\n                    \"fmla   %3.8h, v11.8h, v0.h[3]      \\n\"\n\n                    : \"=w\"(_sum0), // %0\n                    \"=w\"(_sum1), // %1\n                    \"=w\"(_sum2), // %2\n                    \"=w\"(_sum3), // %3\n                    \"=r\"(sptr),  // %4\n                    \"=r\"(kptr)   // %5\n                    : \"0\"(_sum0),\n                    \"1\"(_sum1),\n                    \"2\"(_sum2),\n                    \"3\"(_sum3),\n                    \"4\"(sptr),\n                    \"5\"(kptr)\n                    : \"cc\", \"memory\", \"v0\", \"v8\", \"v9\", \"v10\", \"v11\");\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; i < num_input; i++)\n            {\n                float16x8_t _val = vdupq_n_f16(sptr[0]);\n\n                float16x8_t _w = vld1q_f16(kptr);\n\n                _sum0 = vfmaq_f16(_sum0, _val, _w);\n\n                sptr += 1;\n                kptr += 8;\n            }\n\n            _sum0 = vaddq_f16(_sum0, _sum1);\n            _sum2 = vaddq_f16(_sum2, _sum3);\n            _sum4 = vaddq_f16(_sum4, _sum5);\n            _sum6 = vaddq_f16(_sum6, _sum7);\n            _sum0 = vaddq_f16(_sum0, _sum2);\n            _sum4 = vaddq_f16(_sum4, _sum6);\n            _sum0 = vaddq_f16(_sum0, _sum4);\n\n            _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n            __fp16* outptr = (__fp16*)top_blob;\n            vst1q_f16(outptr + p * 8, _sum0);\n        }\n    }\n\n    if (out_elempack == 4)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            float16x4_t _sum0 = vdup_n_f16(0.f);\n            float16x4_t _sum1 = vdup_n_f16(0.f);\n            float16x4_t _sum2 = vdup_n_f16(0.f);\n            float16x4_t _sum3 = vdup_n_f16(0.f);\n            float16x4_t _sum4 = vdup_n_f16(0.f);\n            float16x4_t _sum5 = vdup_n_f16(0.f);\n            float16x4_t _sum6 = vdup_n_f16(0.f);\n            float16x4_t _sum7 = vdup_n_f16(0.f);\n\n            if (bias_term)\n            {\n                _sum0 = vld1_f16((const __fp16*)bias_data_fp16 + p * 4);\n            }\n\n            const __fp16* kptr = weight_data_tm.row<const __fp16>(p);\n\n            const __fp16* sptr = bottom_blob_flattened;\n\n            int i = 0;\n#if NCNN_GNU_INLINE_ASM\n            for (; i + 7 < num_input; i += 8)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%8, #128]       \\n\"\n                    \"ld1    {v0.8h}, [%8], #16          \\n\" // _val\n\n                    \"prfm   pldl1keep, [%9, #256]       \\n\"\n                    \"ld1    {v8.4h, v9.4h, v10.4h, v11.4h}, [%9], #32 \\n\" // w0123\n\n                    \"prfm   pldl1keep, [%9, #256]       \\n\"\n                    \"ld1    {v12.4h, v13.4h, v14.4h, v15.4h}, [%9], #32 \\n\" // w4567\n\n                    \"fmla   %0.4h, v8.4h, v0.h[0]       \\n\"\n                    \"fmla   %1.4h, v9.4h, v0.h[1]       \\n\"\n                    \"fmla   %2.4h, v10.4h, v0.h[2]      \\n\"\n                    \"fmla   %3.4h, v11.4h, v0.h[3]      \\n\"\n                    \"fmla   %4.4h, v12.4h, v0.h[4]      \\n\"\n                    \"fmla   %5.4h, v13.4h, v0.h[5]      \\n\"\n                    \"fmla   %6.4h, v14.4h, v0.h[6]      \\n\"\n                    \"fmla   %7.4h, v15.4h, v0.h[7]      \\n\"\n\n                    : \"=w\"(_sum0), // %0\n                    \"=w\"(_sum1), // %1\n                    \"=w\"(_sum2), // %2\n                    \"=w\"(_sum3), // %3\n                    \"=w\"(_sum4), // %4\n                    \"=w\"(_sum5), // %5\n                    \"=w\"(_sum6), // %6\n                    \"=w\"(_sum7), // %7\n                    \"=r\"(sptr),  // %8\n                    \"=r\"(kptr)   // %9\n                    : \"0\"(_sum0),\n                    \"1\"(_sum1),\n                    \"2\"(_sum2),\n                    \"3\"(_sum3),\n                    \"4\"(_sum4),\n                    \"5\"(_sum5),\n                    \"6\"(_sum6),\n                    \"7\"(_sum7),\n                    \"8\"(sptr),\n                    \"9\"(kptr)\n                    : \"cc\", \"memory\", \"v0\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\");\n            }\n            for (; i + 3 < num_input; i += 4)\n            {\n                asm volatile(\n                    \"prfm   pldl1keep, [%4, #128]       \\n\"\n                    \"ld1    {v0.4h}, [%4], #8           \\n\" // _val\n\n                    \"prfm   pldl1keep, [%5, #256]       \\n\"\n                    \"ld1    {v8.4h, v9.4h, v10.4h, v11.4h}, [%5], #32 \\n\" // w0123\n\n                    \"fmla   %0.4h, v8.4h, v0.h[0]       \\n\"\n                    \"fmla   %1.4h, v9.4h, v0.h[1]       \\n\"\n                    \"fmla   %2.4h, v10.4h, v0.h[2]      \\n\"\n                    \"fmla   %3.4h, v11.4h, v0.h[3]      \\n\"\n\n                    : \"=w\"(_sum0), // %0\n                    \"=w\"(_sum1), // %1\n                    \"=w\"(_sum2), // %2\n                    \"=w\"(_sum3), // %3\n                    \"=r\"(sptr),  // %4\n                    \"=r\"(kptr)   // %5\n                    : \"0\"(_sum0),\n                    \"1\"(_sum1),\n                    \"2\"(_sum2),\n                    \"3\"(_sum3),\n                    \"4\"(sptr),\n                    \"5\"(kptr)\n                    : \"cc\", \"memory\", \"v0\", \"v8\", \"v9\", \"v10\", \"v11\");\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; i < num_input; i++)\n            {\n                float16x4_t _val = vdup_n_f16(sptr[0]);\n\n                float16x4_t _w = vld1_f16(kptr);\n\n                _sum0 = vfma_f16(_sum0, _val, _w);\n\n                sptr += 1;\n                kptr += 4;\n            }\n\n            _sum0 = vadd_f16(_sum0, _sum1);\n            _sum2 = vadd_f16(_sum2, _sum3);\n            _sum4 = vadd_f16(_sum4, _sum5);\n            _sum6 = vadd_f16(_sum6, _sum7);\n            _sum0 = vadd_f16(_sum0, _sum2);\n            _sum4 = vadd_f16(_sum4, _sum6);\n            _sum0 = vadd_f16(_sum0, _sum4);\n\n            _sum0 = activation_ps_f16(_sum0, activation_type, activation_params);\n\n            __fp16* outptr = (__fp16*)top_blob;\n            vst1_f16(outptr + p * 4, _sum0);\n        }\n    }\n\n    if (out_elempack == 1)\n    {\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output; p++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const __fp16* kptr = weight_data_tm.row<__fp16>(p);\n\n            const __fp16* sptr = bottom_blob_flattened;\n\n            float16x8_t _sum = vdupq_n_f16(0.f);\n            int i = 0;\n            for (; i + 7 < num_input; i += 8)\n            {\n                float16x8_t _m = vld1q_f16(sptr);\n                float16x8_t _w = vld1q_f16(kptr);\n\n                _sum = vfmaq_f16(_sum, _m, _w);\n\n                sptr += 8;\n                kptr += 8;\n            }\n            for (; i < num_input; i++)\n            {\n                __fp16 v = *sptr;\n                __fp16 k = *kptr;\n\n                sum += (float)(v * k);\n\n                sptr++;\n                kptr++;\n            }\n\n            float16x4_t _s4 = vadd_f16(vget_low_f16(_sum), vget_high_f16(_sum));\n            sum += vaddvq_f32(vcvt_f32_f16(_s4)); // dot\n\n            sum = activation_ss_f16(sum, activation_type, activation_params);\n\n            __fp16* outptr = (__fp16*)top_blob;\n            outptr[p] = (__fp16)sum;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/innerproduct_arm_vfpv4.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct_arm.h\"\n\n#include \"cpu.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"innerproduct_fp16s.h\"\n#include \"innerproduct_gemm_fp16s.h\"\n\nint InnerProduct_arm::create_pipeline_fp16s(const Option& opt)\n{\n    const int num_input = weight_data_size / num_output;\n\n    innerproduct_transform_kernel_fp16s_neon(weight_data, weight_data_tm, num_input, num_output, opt);\n\n#if NCNN_ARM82\n    if (ncnn::cpu_support_arm_asimdhp() && opt.use_fp16_arithmetic)\n    {\n        ncnn::cast_float32_to_float16(bias_data, bias_data_fp16, opt);\n    }\n#endif\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        innerproduct_gemm_fp16s_neon(bottom_blob, top_blob, weight_data_tm, bias_data, activation_type, activation_params, opt);\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n        if (bottom_blob_flattened.empty())\n            return -100;\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (out_elempack == 4)\n    {\n        innerproduct_pack4_fp16s_neon(bottom_blob_flattened, top_blob, weight_data_tm, bias_data, activation_type, activation_params, opt);\n    }\n\n    if (out_elempack == 1)\n    {\n        innerproduct_fp16s_neon(bottom_blob_flattened, top_blob, weight_data_tm, bias_data, activation_type, activation_params, opt);\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/innerproduct_fp16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\nvoid innerproduct_pack4_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\nvoid innerproduct_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\nvoid innerproduct_transform_kernel_fp16s_neon_asimdfhm(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\nvoid innerproduct_pack4_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\nvoid innerproduct_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\nvoid innerproduct_transform_kernel_fp16s_neon_asimdhp(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, const Option& opt);\n#endif\n\nstatic void innerproduct_pack4_fp16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdfhm())\n    {\n        innerproduct_pack4_fp16s_neon_asimdfhm(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        innerproduct_pack4_fp16s_neon_asimdhp(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n    const int num_input = bottom_blob.w * bottom_blob.elempack;\n    const int num_output = top_blob.w;\n\n    const float* bias_data_ptr = bias_data;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < num_output; p++)\n    {\n        float32x4_t _sum0 = vdupq_n_f32(0.f);\n\n        if (bias_data_ptr)\n        {\n            _sum0 = vld1q_f32(bias_data_ptr + p * 4);\n        }\n\n        float32x4_t _sum1 = vdupq_n_f32(0.f);\n        float32x4_t _sum2 = vdupq_n_f32(0.f);\n        float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        const __fp16* sptr = bottom_blob;\n        const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n        const float* sptr = bottom_blob;\n        const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n        int i = 0;\n#if NCNN_GNU_INLINE_ASM\n        for (; i + 7 < num_input; i += 8)\n        {\n#if __aarch64__\n#if __ARM_FEATURE_FP16_FML\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #128]       \\n\"\n                \"ld1    {v0.8h}, [%0], #16          \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v2.8h, v3.8h, v4.8h, v5.8h}, [%1], #64 \\n\"\n                \"fmlal  %2.4s, v2.4h, v0.h[0]       \\n\"\n                \"fmlal2 %3.4s, v2.4h, v0.h[1]       \\n\"\n                \"fmlal  %4.4s, v3.4h, v0.h[2]       \\n\"\n                \"fmlal2 %5.4s, v3.4h, v0.h[3]       \\n\"\n                \"fmlal  %2.4s, v4.4h, v0.h[4]       \\n\"\n                \"fmlal2 %3.4s, v4.4h, v0.h[5]       \\n\"\n                \"fmlal  %4.4s, v5.4h, v0.h[6]       \\n\"\n                \"fmlal2 %5.4s, v5.4h, v0.h[7]       \\n\"\n                : \"=r\"(sptr),  // %0\n                \"=r\"(kptr),  // %1\n                \"=w\"(_sum0), // %2\n                \"=w\"(_sum1), // %3\n                \"=w\"(_sum2), // %4\n                \"=w\"(_sum3)  // %5\n                : \"0\"(sptr),\n                \"1\"(kptr),\n                \"2\"(_sum0),\n                \"3\"(_sum1),\n                \"4\"(_sum2),\n                \"5\"(_sum3)\n                : \"cc\", \"memory\", \"v0\", \"v2\", \"v3\", \"v4\", \"v5\");\n#elif __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #128]       \\n\"\n                \"ld1    {v1.8h}, [%0], #16          \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v6.8h, v7.8h, v8.8h, v9.8h}, [%1], #64 \\n\"\n                \"fcvtl  v0.4s, v1.4h                \\n\"\n                \"fcvtl2 v1.4s, v1.8h                \\n\"\n                \"fcvtl  v2.4s, v6.4h                \\n\"\n                \"fcvtl2 v3.4s, v6.8h                \\n\"\n                \"fcvtl  v4.4s, v7.4h                \\n\"\n                \"fcvtl2 v5.4s, v7.8h                \\n\"\n                \"fcvtl  v6.4s, v8.4h                \\n\"\n                \"fcvtl2 v7.4s, v8.8h                \\n\"\n                \"fcvtl  v8.4s, v9.4h                \\n\"\n                \"fcvtl2 v9.4s, v9.8h                \\n\"\n                \"fmla   %2.4s, v2.4s, v0.s[0]       \\n\"\n                \"fmla   %3.4s, v3.4s, v0.s[1]       \\n\"\n                \"fmla   %4.4s, v4.4s, v0.s[2]       \\n\"\n                \"fmla   %5.4s, v5.4s, v0.s[3]       \\n\"\n                \"fmla   %2.4s, v6.4s, v1.s[0]       \\n\"\n                \"fmla   %3.4s, v7.4s, v1.s[1]       \\n\"\n                \"fmla   %4.4s, v8.4s, v1.s[2]       \\n\"\n                \"fmla   %5.4s, v9.4s, v1.s[3]       \\n\"\n                : \"=r\"(sptr),  // %0\n                \"=r\"(kptr),  // %1\n                \"=w\"(_sum0), // %2\n                \"=w\"(_sum1), // %3\n                \"=w\"(_sum2), // %4\n                \"=w\"(_sum3)  // %5\n                : \"0\"(sptr),\n                \"1\"(kptr),\n                \"2\"(_sum0),\n                \"3\"(_sum1),\n                \"4\"(_sum2),\n                \"5\"(_sum3)\n                : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#else  // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            asm volatile(\n                \"prfm   pldl1keep, [%0, #256]       \\n\"\n                \"ld1    {v0.4s, v1.4s}, [%0], #32   \\n\"\n                \"prfm   pldl1keep, [%1, #512]       \\n\"\n                \"ld1    {v6.8h, v7.8h, v8.8h, v9.8h}, [%1], #64 \\n\"\n                \"fcvtl  v2.4s, v6.4h                \\n\"\n                \"fcvtl2 v3.4s, v6.8h                \\n\"\n                \"fcvtl  v4.4s, v7.4h                \\n\"\n                \"fcvtl2 v5.4s, v7.8h                \\n\"\n                \"fcvtl  v6.4s, v8.4h                \\n\"\n                \"fcvtl2 v7.4s, v8.8h                \\n\"\n                \"fcvtl  v8.4s, v9.4h                \\n\"\n                \"fcvtl2 v9.4s, v9.8h                \\n\"\n                \"fmla   %2.4s, v2.4s, v0.s[0]       \\n\"\n                \"fmla   %3.4s, v3.4s, v0.s[1]       \\n\"\n                \"fmla   %4.4s, v4.4s, v0.s[2]       \\n\"\n                \"fmla   %5.4s, v5.4s, v0.s[3]       \\n\"\n                \"fmla   %2.4s, v6.4s, v1.s[0]       \\n\"\n                \"fmla   %3.4s, v7.4s, v1.s[1]       \\n\"\n                \"fmla   %4.4s, v8.4s, v1.s[2]       \\n\"\n                \"fmla   %5.4s, v9.4s, v1.s[3]       \\n\"\n                : \"=r\"(sptr),  // %0\n                \"=r\"(kptr),  // %1\n                \"=w\"(_sum0), // %2\n                \"=w\"(_sum1), // %3\n                \"=w\"(_sum2), // %4\n                \"=w\"(_sum3)  // %5\n                : \"0\"(sptr),\n                \"1\"(kptr),\n                \"2\"(_sum0),\n                \"3\"(_sum1),\n                \"4\"(_sum2),\n                \"5\"(_sum3)\n                : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\");\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#else  // __aarch64__\n            asm volatile(\n                \"pld        [%0, #256]          \\n\"\n                \"vld1.f32   {d0-d3}, [%0 :128]! \\n\"\n                \"pld        [%1, #512]          \\n\"\n                \"vldm       %1!, {d12-d19}      \\n\"\n                \"vcvt.f32.f16 q2, d12           \\n\"\n                \"vcvt.f32.f16 q3, d13           \\n\"\n                \"vcvt.f32.f16 q4, d14           \\n\"\n                \"vcvt.f32.f16 q5, d15           \\n\"\n                \"vcvt.f32.f16 q6, d16           \\n\"\n                \"vcvt.f32.f16 q7, d17           \\n\"\n                \"vcvt.f32.f16 q8, d18           \\n\"\n                \"vcvt.f32.f16 q9, d19           \\n\"\n                \"vmla.f32   %q2, q2, d0[0]      \\n\"\n                \"vmla.f32   %q3, q3, d0[1]      \\n\"\n                \"vmla.f32   %q4, q4, d1[0]      \\n\"\n                \"vmla.f32   %q5, q5, d1[1]      \\n\"\n                \"vmla.f32   %q2, q6, d2[0]      \\n\"\n                \"vmla.f32   %q3, q7, d2[1]      \\n\"\n                \"vmla.f32   %q4, q8, d3[0]      \\n\"\n                \"vmla.f32   %q5, q9, d3[1]      \\n\"\n                : \"=r\"(sptr),  // %0\n                \"=r\"(kptr),  // %1\n                \"=w\"(_sum0), // %2\n                \"=w\"(_sum1), // %3\n                \"=w\"(_sum2), // %4\n                \"=w\"(_sum3)  // %5\n                : \"0\"(sptr),\n                \"1\"(kptr),\n                \"2\"(_sum0),\n                \"3\"(_sum1),\n                \"4\"(_sum2),\n                \"5\"(_sum3)\n                : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\");\n#endif // __aarch64__\n        }\n#endif // NCNN_GNU_INLINE_ASM\n        for (; i + 3 < num_input; i += 4)\n        {\n#if __ARM_FEATURE_FP16_FML\n            float16x4_t _val = vld1_f16(sptr);\n            float16x8_t _w01 = vld1q_f16(kptr);\n            float16x8_t _w23 = vld1q_f16(kptr + 8);\n\n            _sum0 = vfmlalq_lane_low_f16(_sum0, _w01, _val, 0);\n            _sum1 = vfmlalq_lane_high_f16(_sum1, _w01, _val, 1);\n            _sum2 = vfmlalq_lane_low_f16(_sum2, _w23, _val, 2);\n            _sum3 = vfmlalq_lane_high_f16(_sum3, _w23, _val, 3);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n            float16x8_t _w01 = vld1q_f16(kptr);\n            float16x8_t _w23 = vld1q_f16(kptr + 8);\n            float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n            float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n            float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n            float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n#else\n            float32x4_t _val = vld1q_f32(sptr);\n            uint16x8_t _w01 = vld1q_u16(kptr);\n            uint16x8_t _w23 = vld1q_u16(kptr + 8);\n            float32x4_t _w0 = vcvt_f32_f16((float16x4_t)(vget_low_u16(_w01)));\n            float32x4_t _w1 = vcvt_f32_f16((float16x4_t)(vget_high_u16(_w01)));\n            float32x4_t _w2 = vcvt_f32_f16((float16x4_t)(vget_low_u16(_w23)));\n            float32x4_t _w3 = vcvt_f32_f16((float16x4_t)(vget_high_u16(_w23)));\n#endif\n\n#if __aarch64__\n            _sum0 = vfmaq_laneq_f32(_sum0, _w0, _val, 0);\n            _sum1 = vfmaq_laneq_f32(_sum1, _w1, _val, 1);\n            _sum2 = vfmaq_laneq_f32(_sum2, _w2, _val, 2);\n            _sum3 = vfmaq_laneq_f32(_sum3, _w3, _val, 3);\n#else\n            _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_val), 0);\n            _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_val), 1);\n            _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_val), 0);\n            _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_val), 1);\n#endif\n#endif // __ARM_FEATURE_FP16_FML\n\n            sptr += 4;\n            kptr += 16;\n        }\n        for (; i < num_input; i++)\n        {\n            float32x4_t _val = vdupq_n_f32((float)sptr[0]);\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n            float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n            _sum0 = vfmaq_f32(_sum0, _val, _w);\n\n            sptr += 1;\n            kptr += 4;\n        }\n\n        _sum0 = vaddq_f32(_sum0, _sum1);\n        _sum2 = vaddq_f32(_sum2, _sum3);\n        _sum0 = vaddq_f32(_sum0, _sum2);\n\n        _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        __fp16* outptr = (__fp16*)top_blob;\n        vst1_f16(outptr + p * 4, vcvt_f16_f32(_sum0));\n#else\n        float* outptr = top_blob;\n        vst1q_f32(outptr + p * 4, _sum0);\n#endif\n    }\n}\n\nstatic void innerproduct_fp16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdfhm())\n    {\n        innerproduct_fp16s_neon_asimdfhm(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        innerproduct_fp16s_neon_asimdhp(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n    const int num_input = bottom_blob.w * bottom_blob.elempack;\n    const int num_output = top_blob.w;\n\n    const float* bias_data_ptr = bias_data;\n\n    int nn_num_output = num_output >> 2;\n    int remain_num_output_start = nn_num_output << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_num_output; pp++)\n    {\n        int p = pp * 4;\n\n        float sums[4] = {0.0f};\n        if (bias_data_ptr)\n        {\n            sums[0] = bias_data_ptr[p];\n            sums[1] = bias_data_ptr[p + 1];\n            sums[2] = bias_data_ptr[p + 2];\n            sums[3] = bias_data_ptr[p + 3];\n        }\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        const __fp16* sptr = bottom_blob;\n        const __fp16* kptr0 = weight_data_fp16.row<const __fp16>(p);\n        const __fp16* kptr1 = weight_data_fp16.row<const __fp16>(p + 1);\n        const __fp16* kptr2 = weight_data_fp16.row<const __fp16>(p + 2);\n        const __fp16* kptr3 = weight_data_fp16.row<const __fp16>(p + 3);\n#else\n        const float* sptr = bottom_blob;\n        const unsigned short* kptr0 = weight_data_fp16.row<const unsigned short>(p);\n        const unsigned short* kptr1 = weight_data_fp16.row<const unsigned short>(p + 1);\n        const unsigned short* kptr2 = weight_data_fp16.row<const unsigned short>(p + 2);\n        const unsigned short* kptr3 = weight_data_fp16.row<const unsigned short>(p + 3);\n#endif\n\n        int i = 0;\n\n        float32x4_t _sum0 = vdupq_n_f32(0.f);\n        float32x4_t _sum1 = vdupq_n_f32(0.f);\n        float32x4_t _sum2 = vdupq_n_f32(0.f);\n        float32x4_t _sum3 = vdupq_n_f32(0.f);\n        for (; i + 3 < num_input; i += 4)\n        {\n#if __ARM_FEATURE_FP16_FML\n            float16x4_t _val = vld1_f16(sptr);\n            float16x4_t _w0 = vld1_f16(kptr0);\n            float16x4_t _w1 = vld1_f16(kptr1);\n            float16x4_t _w2 = vld1_f16(kptr2);\n            float16x4_t _w3 = vld1_f16(kptr3);\n            float16x8_t _w01 = vcombine_f16(_w0, _w1);\n            float16x8_t _w23 = vcombine_f16(_w2, _w3);\n            float16x8_t _valval = vcombine_f16(_val, _val);\n\n            _sum0 = vfmlalq_low_f16(_sum0, _w01, _valval);\n            _sum1 = vfmlalq_high_f16(_sum1, _w01, _valval);\n            _sum2 = vfmlalq_low_f16(_sum2, _w23, _valval);\n            _sum3 = vfmlalq_high_f16(_sum3, _w23, _valval);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n            float32x4_t _w0 = vcvt_f32_f16(vld1_f16(kptr0));\n            float32x4_t _w1 = vcvt_f32_f16(vld1_f16(kptr1));\n            float32x4_t _w2 = vcvt_f32_f16(vld1_f16(kptr2));\n            float32x4_t _w3 = vcvt_f32_f16(vld1_f16(kptr3));\n#else\n            float32x4_t _val = vld1q_f32(sptr);\n            float32x4_t _w0 = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr0)));\n            float32x4_t _w1 = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr1)));\n            float32x4_t _w2 = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr2)));\n            float32x4_t _w3 = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr3)));\n#endif\n\n            _sum0 = vfmaq_f32(_sum0, _val, _w0);\n            _sum1 = vfmaq_f32(_sum1, _val, _w1);\n            _sum2 = vfmaq_f32(_sum2, _val, _w2);\n            _sum3 = vfmaq_f32(_sum3, _val, _w3);\n#endif // __ARM_FEATURE_FP16_FML\n\n            sptr += 4;\n            kptr0 += 4;\n            kptr1 += 4;\n            kptr2 += 4;\n            kptr3 += 4;\n        }\n\n#if __aarch64__\n        sums[0] += vaddvq_f32(_sum0);\n        sums[1] += vaddvq_f32(_sum1);\n        sums[2] += vaddvq_f32(_sum2);\n        sums[3] += vaddvq_f32(_sum3);\n#else\n        float32x2_t _sum0ss = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n        float32x2_t _sum1ss = vadd_f32(vget_low_f32(_sum1), vget_high_f32(_sum1));\n        float32x2_t _sum2ss = vadd_f32(vget_low_f32(_sum2), vget_high_f32(_sum2));\n        float32x2_t _sum3ss = vadd_f32(vget_low_f32(_sum3), vget_high_f32(_sum3));\n        float32x2_t _sum01ss = vpadd_f32(_sum0ss, _sum1ss);\n        float32x2_t _sum23ss = vpadd_f32(_sum2ss, _sum3ss);\n        sums[0] += vget_lane_f32(_sum01ss, 0);\n        sums[1] += vget_lane_f32(_sum01ss, 1);\n        sums[2] += vget_lane_f32(_sum23ss, 0);\n        sums[3] += vget_lane_f32(_sum23ss, 1);\n#endif\n\n        for (; i < num_input; i++)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            sums[0] += (float)(*sptr) * (float)(*kptr0);\n            sums[1] += (float)(*sptr) * (float)(*kptr1);\n            sums[2] += (float)(*sptr) * (float)(*kptr2);\n            sums[3] += (float)(*sptr) * (float)(*kptr3);\n#else\n            sums[0] += *sptr * float16_to_float32(*kptr0);\n            sums[1] += *sptr * float16_to_float32(*kptr1);\n            sums[2] += *sptr * float16_to_float32(*kptr2);\n            sums[3] += *sptr * float16_to_float32(*kptr3);\n#endif\n\n            sptr++;\n            kptr0++;\n            kptr1++;\n            kptr2++;\n            kptr3++;\n        }\n\n        float32x4_t _sum = vld1q_f32(sums);\n\n        _sum = activation_ps(_sum, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        __fp16* outptr = (__fp16*)top_blob;\n        vst1_f16(outptr + p, vcvt_f16_f32(_sum));\n#else\n        float* outptr = top_blob;\n        vst1q_f32(outptr + p, _sum);\n#endif\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_num_output_start; p < num_output; p++)\n    {\n        float sum = 0.f;\n\n        if (bias_data_ptr)\n            sum = bias_data_ptr[p];\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        const __fp16* sptr = bottom_blob;\n        const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n        const float* sptr = bottom_blob;\n        const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n        int i = 0;\n\n        float32x4_t _sum = vdupq_n_f32(0.f);\n        for (; i + 3 < num_input; i += 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr));\n            float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n            float32x4_t _val = vld1q_f32(sptr);\n            float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n            _sum = vfmaq_f32(_sum, _val, _w);\n\n            sptr += 4;\n            kptr += 4;\n        }\n        for (; i < num_input; i++)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            sum += (float)(*sptr) * (float)(*kptr);\n#else\n            sum += *sptr * float16_to_float32(*kptr);\n#endif\n            sptr++;\n            kptr++;\n        }\n\n#if __aarch64__\n        sum += vaddvq_f32(_sum);\n#else\n        float32x2_t _sumss = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n        _sumss = vpadd_f32(_sumss, _sumss);\n        sum += vget_lane_f32(_sumss, 0);\n#endif // __aarch64__\n\n        sum = activation_ss(sum, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        __fp16* outptr = (__fp16*)top_blob;\n        outptr[p] = (__fp16)sum;\n#else\n        float* outptr = top_blob;\n        outptr[p] = sum;\n#endif\n    }\n}\n\nstatic void innerproduct_transform_kernel_fp16s_neon(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdfhm())\n    {\n        innerproduct_transform_kernel_fp16s_neon_asimdfhm(weight_data, weight_data_tm, num_input, num_output, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        innerproduct_transform_kernel_fp16s_neon_asimdhp(weight_data, weight_data_tm, num_input, num_output, opt);\n        return;\n    }\n#endif\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n        out_elempack = opt.use_fp16_arithmetic && num_output % 8 == 0 ? 8 : num_output % 4 == 0 ? 4 : 1;\n#else\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n#endif\n    }\n\n    // src = inch-outch\n    // dst = pb-inch-outch/pb\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n    if (out_elempack == 8)\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / 8, (size_t)16u, 8);\n\n        for (int q = 0; q + 7 < num_output; q += 8)\n        {\n            unsigned short* g0 = weight_data_tm.row<unsigned short>(q / 8);\n\n            const float* k0 = weight_data_r2.row(q);\n            const float* k1 = weight_data_r2.row(q + 1);\n            const float* k2 = weight_data_r2.row(q + 2);\n            const float* k3 = weight_data_r2.row(q + 3);\n            const float* k4 = weight_data_r2.row(q + 4);\n            const float* k5 = weight_data_r2.row(q + 5);\n            const float* k6 = weight_data_r2.row(q + 6);\n            const float* k7 = weight_data_r2.row(q + 7);\n\n            int p = 0;\n#if NCNN_GNU_INLINE_ASM\n            for (; p + 7 < num_input; p += 8)\n            {\n                // transpose 8x8\n                asm volatile(\n                    \"ld1    {v0.4s, v1.4s}, [%0], #32   \\n\"\n                    \"ld1    {v2.4s, v3.4s}, [%1], #32   \\n\"\n                    \"ld1    {v4.4s, v5.4s}, [%2], #32   \\n\"\n                    \"ld1    {v6.4s, v7.4s}, [%3], #32   \\n\"\n                    \"ld1    {v8.4s, v9.4s}, [%4], #32   \\n\"\n                    \"ld1    {v10.4s, v11.4s}, [%5], #32 \\n\"\n                    \"ld1    {v12.4s, v13.4s}, [%6], #32 \\n\"\n                    \"ld1    {v14.4s, v15.4s}, [%7], #32 \\n\"\n\n                    \"fcvtn  v0.4h, v0.4s            \\n\"\n                    \"fcvtn2 v0.8h, v1.4s            \\n\"\n                    \"fcvtn  v1.4h, v2.4s            \\n\"\n                    \"fcvtn2 v1.8h, v3.4s            \\n\"\n                    \"fcvtn  v2.4h, v4.4s            \\n\"\n                    \"fcvtn2 v2.8h, v5.4s            \\n\"\n                    \"fcvtn  v3.4h, v6.4s            \\n\"\n                    \"fcvtn2 v3.8h, v7.4s            \\n\"\n                    \"fcvtn  v4.4h, v8.4s            \\n\"\n                    \"fcvtn2 v4.8h, v9.4s            \\n\"\n                    \"fcvtn  v5.4h, v10.4s           \\n\"\n                    \"fcvtn2 v5.8h, v11.4s           \\n\"\n                    \"fcvtn  v6.4h, v12.4s           \\n\"\n                    \"fcvtn2 v6.8h, v13.4s           \\n\"\n                    \"fcvtn  v7.4h, v14.4s           \\n\"\n                    \"fcvtn2 v7.8h, v15.4s           \\n\"\n\n                    \"zip1   v16.8h, v0.8h, v4.8h    \\n\"\n                    \"zip2   v20.8h, v0.8h, v4.8h    \\n\"\n                    \"zip1   v17.8h, v1.8h, v5.8h    \\n\"\n                    \"zip2   v21.8h, v1.8h, v5.8h    \\n\"\n                    \"zip1   v18.8h, v2.8h, v6.8h    \\n\"\n                    \"zip2   v22.8h, v2.8h, v6.8h    \\n\"\n                    \"zip1   v19.8h, v3.8h, v7.8h    \\n\"\n                    \"zip2   v23.8h, v3.8h, v7.8h    \\n\"\n\n                    \"st4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n                    \"st4    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n                    : \"=r\"(k0), // %0\n                    \"=r\"(k1), // %1\n                    \"=r\"(k2), // %2\n                    \"=r\"(k3), // %3\n                    \"=r\"(k4), // %4\n                    \"=r\"(k5), // %5\n                    \"=r\"(k6), // %6\n                    \"=r\"(k7), // %7\n                    \"=r\"(g0)  // %8\n                    : \"0\"(k0),\n                    \"1\"(k1),\n                    \"2\"(k2),\n                    \"3\"(k3),\n                    \"4\"(k4),\n                    \"5\"(k5),\n                    \"6\"(k6),\n                    \"7\"(k7),\n                    \"8\"(g0)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\", \"v15\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n            }\n#endif // NCNN_GNU_INLINE_ASM\n            for (; p < num_input; p++)\n            {\n                g0[0] = float32_to_float16(*k0++);\n                g0[1] = float32_to_float16(*k1++);\n                g0[2] = float32_to_float16(*k2++);\n                g0[3] = float32_to_float16(*k3++);\n                g0[4] = float32_to_float16(*k4++);\n                g0[5] = float32_to_float16(*k5++);\n                g0[6] = float32_to_float16(*k6++);\n                g0[7] = float32_to_float16(*k7++);\n                g0 += 8;\n            }\n        }\n    }\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n    if (out_elempack == 4)\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / 4, (size_t)8u, 4);\n\n        for (int q = 0; q + 3 < num_output; q += 4)\n        {\n            unsigned short* g0 = weight_data_tm.row<unsigned short>(q / 4);\n\n            const float* k0 = weight_data_r2.row(q);\n            const float* k1 = weight_data_r2.row(q + 1);\n            const float* k2 = weight_data_r2.row(q + 2);\n            const float* k3 = weight_data_r2.row(q + 3);\n\n            int p = 0;\n            for (; p + 3 < num_input; p += 4)\n            {\n                // transpose 4x4\n                uint16x4x4_t _p;\n                _p.val[0] = (uint16x4_t)(vcvt_f16_f32(vld1q_f32(k0)));\n                _p.val[1] = (uint16x4_t)(vcvt_f16_f32(vld1q_f32(k1)));\n                _p.val[2] = (uint16x4_t)(vcvt_f16_f32(vld1q_f32(k2)));\n                _p.val[3] = (uint16x4_t)(vcvt_f16_f32(vld1q_f32(k3)));\n                vst4_u16(g0, _p);\n\n                k0 += 4;\n                k1 += 4;\n                k2 += 4;\n                k3 += 4;\n                g0 += 16;\n            }\n            for (; p < num_input; p++)\n            {\n                g0[0] = float32_to_float16(*k0++);\n                g0[1] = float32_to_float16(*k1++);\n                g0[2] = float32_to_float16(*k2++);\n                g0[3] = float32_to_float16(*k3++);\n                g0 += 4;\n            }\n        }\n    }\n\n    if (out_elempack == 1)\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n        ncnn::cast_float32_to_float16(weight_data_r2, weight_data_tm, opt);\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/innerproduct_gemm_fp16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\nvoid innerproduct_gemm_fp16s_neon_asimdfhm(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\nvoid innerproduct_gemm_fp16s_neon_asimdhp(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt);\n#endif\n\nstatic void innerproduct_gemm_fp16s_neon(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_fp16, const Mat& bias_data, int activation_type, const Mat& activation_params, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82FP16FML && __aarch64__ && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdfhm())\n    {\n        innerproduct_gemm_fp16s_neon_asimdfhm(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82 && __aarch64__ && !__ARM_FEATURE_FP16_VECTOR_ARITHMETIC && !__ARM_FEATURE_FP16_FML\n    if (ncnn::cpu_support_arm_asimdhp())\n    {\n        innerproduct_gemm_fp16s_neon_asimdhp(bottom_blob, top_blob, weight_data_fp16, bias_data, activation_type, activation_params, opt);\n        return;\n    }\n#endif\n\n    const int num_input = bottom_blob.w;\n    const int elempack = bottom_blob.elempack;\n    const int num_output = top_blob.w;\n    const int h = bottom_blob.h;\n\n    const float* bias_data_ptr = bias_data;\n\n    int num_output_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        num_output_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int j = 0; j < h; j++)\n    {\n        if (elempack == 4 && num_output_elempack == 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            __fp16* outptr = top_blob.row<__fp16>(j);\n#else\n            float* outptr = top_blob.row(j);\n#endif\n\n            for (int p = 0; p < num_output / num_output_elempack; p++)\n            {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                const __fp16* m = bottom_blob.row<const __fp16>(j);\n                const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n                const float* m = bottom_blob.row(j);\n                const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vdupq_n_f32(bias_data_ptr[p * 4 + 0]);\n                    _sum1 = vdupq_n_f32(bias_data_ptr[p * 4 + 1]);\n                    _sum2 = vdupq_n_f32(bias_data_ptr[p * 4 + 2]);\n                    _sum3 = vdupq_n_f32(bias_data_ptr[p * 4 + 3]);\n                }\n\n                int i = 0;\n                for (; i < num_input; i++)\n                {\n#if __ARM_FEATURE_FP16_FML\n                    float16x4_t _val = vld1_f16(m);\n                    float16x4_t _w = vld1_f16(kptr);\n                    float16x8_t _valval = vcombine_f16(_val, _val);\n\n                    _sum0 = vfmlalq_lane_low_f16(_sum0, _valval, _w, 0);\n                    _sum1 = vfmlalq_lane_low_f16(_sum1, _valval, _w, 1);\n                    _sum2 = vfmlalq_lane_low_f16(_sum2, _valval, _w, 2);\n                    _sum3 = vfmlalq_lane_low_f16(_sum3, _valval, _w, 3);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _val = vcvt_f32_f16(vld1_f16(m));\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n                    float32x4_t _val = vld1q_f32(m);\n                    float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _val, _w, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _val, _w, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _val, _w, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _val, _w, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _val, vget_low_f32(_w), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _val, vget_low_f32(_w), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _val, vget_high_f32(_w), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _val, vget_high_f32(_w), 1);\n#endif\n#endif // __ARM_FEATURE_FP16_FML\n\n                    m += 4;\n                    kptr += 4;\n                }\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                _sum1 = activation_ps(_sum1, activation_type, activation_params);\n                _sum2 = activation_ps(_sum2, activation_type, activation_params);\n                _sum3 = activation_ps(_sum3, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                vst1_f16(outptr, vcvt_f16_f32(_sum0));\n                vst1_f16(outptr + 4, vcvt_f16_f32(_sum1));\n                vst1_f16(outptr + 8, vcvt_f16_f32(_sum2));\n                vst1_f16(outptr + 12, vcvt_f16_f32(_sum3));\n#else\n                vst1q_f32(outptr, _sum0);\n                vst1q_f32(outptr + 4, _sum1);\n                vst1q_f32(outptr + 8, _sum2);\n                vst1q_f32(outptr + 12, _sum3);\n#endif\n                outptr += 16;\n            }\n        }\n\n        if (elempack == 1 && num_output_elempack == 4)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            __fp16* outptr = top_blob.row<__fp16>(j);\n#else\n            float* outptr = top_blob.row(j);\n#endif\n\n            for (int p = 0; p < num_output / num_output_elempack; p++)\n            {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                const __fp16* m = bottom_blob.row<const __fp16>(j);\n                const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n                const float* m = bottom_blob.row(j);\n                const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vld1q_f32(bias_data_ptr + p * 4);\n                }\n\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                int i = 0;\n                for (; i + 3 < num_input; i += 4)\n                {\n#if __ARM_FEATURE_FP16_FML\n                    float16x4_t _val = vld1_f16(m);\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n\n                    _sum0 = vfmlalq_lane_low_f16(_sum0, _w01, _val, 0);\n                    _sum1 = vfmlalq_lane_high_f16(_sum1, _w01, _val, 1);\n                    _sum2 = vfmlalq_lane_low_f16(_sum2, _w23, _val, 2);\n                    _sum3 = vfmlalq_lane_high_f16(_sum3, _w23, _val, 3);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _val = vcvt_f32_f16(vld1_f16(m));\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float16x8_t _w23 = vld1q_f16(kptr + 8);\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n                    float32x4_t _w2 = vcvt_f32_f16(vget_low_f16(_w23));\n                    float32x4_t _w3 = vcvt_f32_f16(vget_high_f16(_w23));\n#else\n                    float32x4_t _val = vld1q_f32(m);\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    uint16x8_t _w23 = vld1q_u16(kptr + 8);\n                    float32x4_t _w0 = vcvt_f32_f16((float16x4_t)(vget_low_u16(_w01)));\n                    float32x4_t _w1 = vcvt_f32_f16((float16x4_t)(vget_high_u16(_w01)));\n                    float32x4_t _w2 = vcvt_f32_f16((float16x4_t)(vget_low_u16(_w23)));\n                    float32x4_t _w3 = vcvt_f32_f16((float16x4_t)(vget_high_u16(_w23)));\n#endif\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _w0, _val, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _w1, _val, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _w2, _val, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _w3, _val, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _w0, vget_low_f32(_val), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _w1, vget_low_f32(_val), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _w2, vget_high_f32(_val), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _w3, vget_high_f32(_val), 1);\n#endif\n#endif // __ARM_FEATURE_FP16_FML\n\n                    m += 4;\n                    kptr += 16;\n                }\n                for (; i < num_input; i++)\n                {\n                    float32x4_t _val = vdupq_n_f32((float)m[0]);\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n                    float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n                    _sum0 = vfmaq_f32(_sum0, _val, _w);\n\n                    m += 1;\n                    kptr += 4;\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                _sum0 = vaddq_f32(_sum0, _sum2);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                vst1_f16(outptr, vcvt_f16_f32(_sum0));\n#else\n                vst1q_f32(outptr, _sum0);\n#endif\n                outptr += 4;\n            }\n        }\n\n        if (elempack == 4 && num_output_elempack == 1)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            __fp16* outptr = top_blob.row<__fp16>(j);\n#else\n            float* outptr = top_blob.row(j);\n#endif\n\n            for (int p = 0; p < num_output; p++)\n            {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                const __fp16* m = bottom_blob.row<const __fp16>(j);\n                const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n                const float* m = bottom_blob.row(j);\n                const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                float32x4_t _sum2 = vdupq_n_f32(0.f);\n                float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n                if (bias_data_ptr)\n                {\n                    _sum0 = vdupq_n_f32(bias_data_ptr[p]);\n                }\n\n                int i = 0;\n                for (; i + 3 < num_input; i += 4)\n                {\n#if __ARM_FEATURE_FP16_FML\n                    float16x8_t _val01 = vld1q_f16(m);\n                    float16x8_t _val23 = vld1q_f16(m + 8);\n                    float16x4_t _w = vld1_f16(kptr);\n\n                    _sum0 = vfmlalq_lane_low_f16(_sum0, _val01, _w, 0);\n                    _sum1 = vfmlalq_lane_high_f16(_sum1, _val01, _w, 1);\n                    _sum2 = vfmlalq_lane_low_f16(_sum2, _val23, _w, 2);\n                    _sum3 = vfmlalq_lane_high_f16(_sum3, _val23, _w, 3);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _val0 = vcvt_f32_f16(vld1_f16(m));\n                    float32x4_t _val1 = vcvt_f32_f16(vld1_f16(m + 4));\n                    float32x4_t _val2 = vcvt_f32_f16(vld1_f16(m + 8));\n                    float32x4_t _val3 = vcvt_f32_f16(vld1_f16(m + 12));\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n                    float32x4_t _val0 = vld1q_f32(m);\n                    float32x4_t _val1 = vld1q_f32(m + 4);\n                    float32x4_t _val2 = vld1q_f32(m + 8);\n                    float32x4_t _val3 = vld1q_f32(m + 12);\n                    float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n\n#if __aarch64__\n                    _sum0 = vfmaq_laneq_f32(_sum0, _val0, _w, 0);\n                    _sum1 = vfmaq_laneq_f32(_sum1, _val1, _w, 1);\n                    _sum2 = vfmaq_laneq_f32(_sum2, _val2, _w, 2);\n                    _sum3 = vfmaq_laneq_f32(_sum3, _val3, _w, 3);\n#else\n                    _sum0 = vmlaq_lane_f32(_sum0, _val0, vget_low_f32(_w), 0);\n                    _sum1 = vmlaq_lane_f32(_sum1, _val1, vget_low_f32(_w), 1);\n                    _sum2 = vmlaq_lane_f32(_sum2, _val2, vget_high_f32(_w), 0);\n                    _sum3 = vmlaq_lane_f32(_sum3, _val3, vget_high_f32(_w), 1);\n#endif\n#endif // __ARM_FEATURE_FP16_FML\n\n                    m += 16;\n                    kptr += 4;\n                }\n                for (; i < num_input; i++)\n                {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _val = vcvt_f32_f16(vld1_f16(m));\n                    float32x4_t _k = vdupq_n_f32((float)(kptr[0]));\n#else\n                    float32x4_t _val = vld1q_f32(m);\n                    float32x4_t _k = vdupq_n_f32(float16_to_float32(kptr[0]));\n#endif\n                    _sum0 = vfmaq_f32(_sum0, _val, _k);\n\n                    m += 4;\n                    kptr += 1;\n                }\n\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                _sum2 = vaddq_f32(_sum2, _sum3);\n                _sum0 = vaddq_f32(_sum0, _sum2);\n\n                _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                vst1_f16(outptr, vcvt_f16_f32(_sum0));\n#else\n                vst1q_f32(outptr, _sum0);\n#endif\n                outptr += 4;\n            }\n        }\n\n        if (elempack == 1 && num_output_elempack == 1)\n        {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n            __fp16* outptr = top_blob.row<__fp16>(j);\n#else\n            float* outptr = top_blob.row(j);\n#endif\n\n            for (int p = 0; p < num_output; p++)\n            {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                const __fp16* m = bottom_blob.row<const __fp16>(j);\n                const __fp16* kptr = weight_data_fp16.row<const __fp16>(p);\n#else\n                const float* m = bottom_blob.row(j);\n                const unsigned short* kptr = weight_data_fp16.row<const unsigned short>(p);\n#endif\n\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n                int i = 0;\n                float32x4_t _sum0 = vdupq_n_f32(0.f);\n                float32x4_t _sum1 = vdupq_n_f32(0.f);\n                for (; i + 7 < num_input; i += 8)\n                {\n#if __ARM_FEATURE_FP16_FML\n                    float16x8_t _val01 = vld1q_f16(m);\n                    float16x8_t _w01 = vld1q_f16(kptr);\n\n                    _sum0 = vfmlalq_low_f16(_sum0, _val01, _w01);\n                    _sum1 = vfmlalq_high_f16(_sum1, _val01, _w01);\n#else // __ARM_FEATURE_FP16_FML\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float16x8_t _val01 = vld1q_f16(m);\n                    float16x8_t _w01 = vld1q_f16(kptr);\n                    float32x4_t _val0 = vcvt_f32_f16(vget_low_f16(_val01));\n                    float32x4_t _val1 = vcvt_f32_f16(vget_high_f16(_val01));\n                    float32x4_t _w0 = vcvt_f32_f16(vget_low_f16(_w01));\n                    float32x4_t _w1 = vcvt_f32_f16(vget_high_f16(_w01));\n#else\n                    float32x4_t _val0 = vld1q_f32(m);\n                    float32x4_t _val1 = vld1q_f32(m + 4);\n                    uint16x8_t _w01 = vld1q_u16(kptr);\n                    float32x4_t _w0 = vcvt_f32_f16((float16x4_t)(vget_low_u16(_w01)));\n                    float32x4_t _w1 = vcvt_f32_f16((float16x4_t)(vget_high_u16(_w01)));\n#endif\n\n                    _sum0 = vfmaq_f32(_sum0, _val0, _w0);\n                    _sum1 = vfmaq_f32(_sum1, _val1, _w1);\n#endif // __ARM_FEATURE_FP16_FML\n\n                    m += 8;\n                    kptr += 8;\n                }\n                _sum0 = vaddq_f32(_sum0, _sum1);\n                for (; i + 3 < num_input; i += 4)\n                {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    float32x4_t _val = vcvt_f32_f16(vld1_f16(m));\n                    float32x4_t _w = vcvt_f32_f16(vld1_f16(kptr));\n#else\n                    float32x4_t _val = vld1q_f32(m);\n                    float32x4_t _w = vcvt_f32_f16((float16x4_t)(vld1_u16(kptr)));\n#endif\n\n                    _sum0 = vfmaq_f32(_sum0, _val, _w);\n\n                    m += 4;\n                    kptr += 4;\n                }\n#if __aarch64__\n                sum += vaddvq_f32(_sum0);\n#else\n                float32x2_t _ss = vadd_f32(vget_low_f32(_sum0), vget_high_f32(_sum0));\n                _ss = vpadd_f32(_ss, _ss);\n                sum += vget_lane_f32(_ss, 0);\n#endif\n                for (; i < num_input; i++)\n                {\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                    sum += (float)(*m++) * (float)(*kptr++);\n#else\n                    sum += *m++ * float16_to_float32(*kptr++);\n#endif\n                }\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n                outptr[0] = (__fp16)sum;\n#else\n                outptr[0] = sum;\n#endif\n                outptr += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/instancenorm_arm.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"instancenorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nInstanceNorm_arm::InstanceNorm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint InstanceNorm_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr0 = bottom_top_blob.channel(q);\n\n            float32x4_t _div_size = vdupq_n_f32(1.f / size);\n\n            // mean and var\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            float32x4_t _sqsum = vdupq_n_f32(0.f);\n            const float* ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _sum = vaddq_f32(_sum, _p);\n                ptr += 4;\n                //sqsum += ptr[i] * ptr[i];\n            }\n            float32x4_t _mean = vmulq_f32(_sum, _div_size);\n            ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _tmp = vsubq_f32(_p, _mean);\n                _sqsum = vmlaq_f32(_sqsum, _tmp, _tmp);\n                ptr += 4;\n            }\n            float32x4_t _var_eps = vmlaq_f32(vdupq_n_f32(eps), _sqsum, _div_size);\n            // the var maybe minus due to accuracy\n            //float var = sqsum / size - mean * mean;\n\n            float32x4_t _reciprocal = vrsqrteq_f32(_var_eps);\n            _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n            // _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n\n            float32x4_t _a;\n            float32x4_t _b;\n            if (affine)\n            {\n                float32x4_t _gamma = vld1q_f32((const float*)gamma_data + q * 4);\n                float32x4_t _beta = vld1q_f32((const float*)beta_data + q * 4);\n\n                _a = vmulq_f32(_gamma, _reciprocal);\n                _b = vmlsq_f32(_beta, _mean, _a);\n            }\n            else\n            {\n                _a = _reciprocal;\n                _b = vnegq_f32(vmulq_f32(_mean, _a));\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vld1q_f32(ptr0);\n                _p = vmlaq_f32(_b, _p, _a);\n                vst1q_f32(ptr0, _p);\n                ptr0 += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr0 = bottom_top_blob.channel(q);\n\n        // mean and var\n        float sum = 0.f;\n        float sqsum = 0.f;\n        const float* ptr = ptr0;\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _sum = vdupq_n_f32(0.f);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _sum = vaddq_f32(_sum, _p);\n            ptr += 4;\n        }\n#if __aarch64__\n        sum = vaddvq_f32(_sum);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n        _s2 = vpadd_f32(_s2, _s2);\n        sum = vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            sum += *ptr++;\n            //sqsum += ptr[i] * ptr[i];\n        }\n        float mean = sum / size;\n        ptr = ptr0;\n        i = 0;\n#if __ARM_NEON\n        float32x4_t _sqsum = vdupq_n_f32(0.f);\n        float32x4_t _mean = vdupq_n_f32(mean);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _tmp = vsubq_f32(_p, _mean);\n            _sqsum = vmlaq_f32(_sqsum, _tmp, _tmp);\n            ptr += 4;\n        }\n#if __aarch64__\n        sqsum = vaddvq_f32(_sqsum);\n#else\n        float32x2_t _sq2 = vadd_f32(vget_low_f32(_sqsum), vget_high_f32(_sqsum));\n        _sq2 = vpadd_f32(_sq2, _sq2);\n        sqsum = vget_lane_f32(_sq2, 0);\n#endif\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float tmp = *ptr++ - mean;\n            sqsum += tmp * tmp;\n        }\n        float var = sqsum / size;\n        // the var maybe minus due to accuracy\n        //float var = sqsum / size - mean * mean;\n\n        float a;\n        float b;\n        if (affine)\n        {\n            float gamma = gamma_data[q];\n            float beta = beta_data[q];\n\n            a = (float)(gamma / (sqrtf(var + eps)));\n            b = (float)(-mean * a + beta);\n        }\n        else\n        {\n            a = (float)(1.f / (sqrtf(var + eps)));\n            b = (float)(-mean * a);\n        }\n\n        i = 0;\n#if __ARM_NEON\n        float32x4_t _a = vdupq_n_f32(a);\n        float32x4_t _b = vdupq_n_f32(b);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _p = vmlaq_f32(_b, _p, _a);\n            vst1q_f32(ptr0, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr0 = *ptr0 * a + b;\n            ptr0++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint InstanceNorm_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr0 = bottom_top_blob.channel(q);\n\n            float32x4_t _div_size = vdupq_n_f32(1.f / size);\n\n            // mean and var\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            float32x4_t _sqsum = vdupq_n_f32(0.f);\n            const unsigned short* ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _sum = vaddq_f32(_sum, _p);\n                ptr += 4;\n                //sqsum += ptr[i] * ptr[i];\n            }\n            float32x4_t _mean = vmulq_f32(_sum, _div_size);\n            ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _tmp = vsubq_f32(_p, _mean);\n                _sqsum = vmlaq_f32(_sqsum, _tmp, _tmp);\n                ptr += 4;\n            }\n            float32x4_t _var_eps = vmlaq_f32(vdupq_n_f32(eps), _sqsum, _div_size);\n            // the var maybe minus due to accuracy\n            //float var = sqsum / size - mean * mean;\n\n            float32x4_t _reciprocal = vrsqrteq_f32(_var_eps);\n            _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n            // _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n\n            float32x4_t _a;\n            float32x4_t _b;\n            if (affine)\n            {\n                float32x4_t _gamma = vld1q_f32((const float*)gamma_data + q * 4);\n                float32x4_t _beta = vld1q_f32((const float*)beta_data + q * 4);\n\n                _a = vmulq_f32(_gamma, _reciprocal);\n                _b = vmlsq_f32(_beta, _mean, _a);\n            }\n            else\n            {\n                _a = _reciprocal;\n                _b = vnegq_f32(vmulq_f32(_mean, _a));\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n                _p = vmlaq_f32(_b, _p, _a);\n                vst1_u16(ptr0, float2bfloat(_p));\n                ptr0 += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr0 = bottom_top_blob.channel(q);\n\n        // mean and var\n        float sum = 0.f;\n        float sqsum = 0.f;\n        const unsigned short* ptr = ptr0;\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _sum = vdupq_n_f32(0.f);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _sum = vaddq_f32(_sum, _p);\n            ptr += 4;\n        }\n#if __aarch64__\n        sum = vaddvq_f32(_sum);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n        _s2 = vpadd_f32(_s2, _s2);\n        sum = vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            sum += bfloat16_to_float32(*ptr++);\n            //sqsum += ptr[i] * ptr[i];\n        }\n        float mean = sum / size;\n        ptr = ptr0;\n        i = 0;\n#if __ARM_NEON\n        float32x4_t _sqsum = vdupq_n_f32(0.f);\n        float32x4_t _mean = vdupq_n_f32(mean);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _tmp = vsubq_f32(_p, _mean);\n            _sqsum = vmlaq_f32(_sqsum, _tmp, _tmp);\n            ptr += 4;\n        }\n#if __aarch64__\n        sqsum = vaddvq_f32(_sqsum);\n#else\n        float32x2_t _sq2 = vadd_f32(vget_low_f32(_sqsum), vget_high_f32(_sqsum));\n        _sq2 = vpadd_f32(_sq2, _sq2);\n        sqsum = vget_lane_f32(_sq2, 0);\n#endif\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float tmp = bfloat16_to_float32(*ptr++) - mean;\n            sqsum += tmp * tmp;\n        }\n        float var = sqsum / size;\n        // the var maybe minus due to accuracy\n        //float var = sqsum / size - mean * mean;\n\n        float a;\n        float b;\n        if (affine)\n        {\n            float gamma = gamma_data[q];\n            float beta = beta_data[q];\n\n            a = (float)(gamma / (sqrtf(var + eps)));\n            b = (float)(-mean * a + beta);\n        }\n        else\n        {\n            a = (float)(1.f / (sqrtf(var + eps)));\n            b = (float)(-mean * a);\n        }\n\n        i = 0;\n#if __ARM_NEON\n        float32x4_t _a = vdupq_n_f32(a);\n        float32x4_t _b = vdupq_n_f32(b);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _p = vmlaq_f32(_b, _p, _a);\n            vst1_u16(ptr0, float2bfloat(_p));\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr0 = float32_to_bfloat16(bfloat16_to_float32(*ptr0) * a + b);\n            ptr0++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/instancenorm_arm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INSTANCENORM_ARM_H\n#define LAYER_INSTANCENORM_ARM_H\n\n#include \"instancenorm.h\"\n\nnamespace ncnn {\n\nclass InstanceNorm_arm : public InstanceNorm\n{\npublic:\n    InstanceNorm_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INSTANCENORM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/instancenorm_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"instancenorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint InstanceNorm_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 8)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr0 = bottom_top_blob.channel(q);\n\n            float32x4_t _div_size = vdupq_n_f32(1.f / size);\n            float32x4_t _eps = vdupq_n_f32(eps);\n\n            // mean and var\n            float32x4_t _sum0 = vdupq_n_f32(0.f);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sqsum0 = vdupq_n_f32(0.f);\n            float32x4_t _sqsum1 = vdupq_n_f32(0.f);\n            const __fp16* ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                _sum0 = vaddq_f32(_sum0, _p0);\n                _sum1 = vaddq_f32(_sum1, _p1);\n                ptr += 8;\n                //sqsum += ptr[i] * ptr[i];\n            }\n            float32x4_t _mean0 = vmulq_f32(_sum0, _div_size);\n            float32x4_t _mean1 = vmulq_f32(_sum1, _div_size);\n            ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _tmp0 = vsubq_f32(_p0, _mean0);\n                float32x4_t _tmp1 = vsubq_f32(_p1, _mean1);\n                _sqsum0 = vfmaq_f32(_sqsum0, _tmp0, _tmp0);\n                _sqsum1 = vfmaq_f32(_sqsum1, _tmp1, _tmp1);\n                ptr += 8;\n            }\n            float32x4_t _var_eps0 = vfmaq_f32(_eps, _sqsum0, _div_size);\n            float32x4_t _var_eps1 = vfmaq_f32(_eps, _sqsum1, _div_size);\n            // the var maybe minus due to accuracy\n            //float var = sqsum / size - mean * mean;\n\n            float32x4_t _reciprocal0 = vrsqrteq_f32(_var_eps0);\n            float32x4_t _reciprocal1 = vrsqrteq_f32(_var_eps1);\n            _reciprocal0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps0, _reciprocal0), _reciprocal0), _reciprocal0);\n            _reciprocal1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps1, _reciprocal1), _reciprocal1), _reciprocal1);\n            // _reciprocal0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps0, _reciprocal0), _reciprocal0), _reciprocal0);\n            // _reciprocal1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps1, _reciprocal1), _reciprocal1), _reciprocal1);\n\n            float16x8_t _a;\n            float16x8_t _b;\n            if (affine)\n            {\n                float32x4_t _gamma0 = vld1q_f32((const float*)gamma_data + q * 8);\n                float32x4_t _gamma1 = vld1q_f32((const float*)gamma_data + q * 8 + 4);\n                float32x4_t _beta0 = vld1q_f32((const float*)beta_data + q * 8);\n                float32x4_t _beta1 = vld1q_f32((const float*)beta_data + q * 8 + 4);\n\n                float32x4_t _a320 = vmulq_f32(_gamma0, _reciprocal0);\n                float32x4_t _a321 = vmulq_f32(_gamma1, _reciprocal1);\n                float16x4_t _a0 = vcvt_f16_f32(_a320);\n                float16x4_t _a1 = vcvt_f16_f32(_a321);\n                float16x4_t _b0 = vcvt_f16_f32(vmlsq_f32(_beta0, _mean0, _a320));\n                float16x4_t _b1 = vcvt_f16_f32(vmlsq_f32(_beta1, _mean1, _a321));\n\n                _a = vcombine_f16(_a0, _a1);\n                _b = vcombine_f16(_b0, _b1);\n            }\n            else\n            {\n                float16x4_t _a0 = vcvt_f16_f32(_reciprocal0);\n                float16x4_t _a1 = vcvt_f16_f32(_reciprocal1);\n                float16x4_t _b0 = vcvt_f16_f32(vnegq_f32(vmulq_f32(_mean0, _reciprocal0)));\n                float16x4_t _b1 = vcvt_f16_f32(vnegq_f32(vmulq_f32(_mean1, _reciprocal1)));\n\n                _a = vcombine_f16(_a0, _a1);\n                _b = vcombine_f16(_b0, _b1);\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x8_t _p = vld1q_f16(ptr0);\n                _p = vfmaq_f16(_b, _p, _a);\n                vst1q_f16(ptr0, _p);\n                ptr0 += 8;\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr0 = bottom_top_blob.channel(q);\n\n            float32x4_t _div_size = vdupq_n_f32(1.f / size);\n\n            // mean and var\n            float32x4_t _sum = vdupq_n_f32(0.f);\n            float32x4_t _sqsum = vdupq_n_f32(0.f);\n            const __fp16* ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _sum = vaddq_f32(_sum, _p);\n                ptr += 4;\n                //sqsum += ptr[i] * ptr[i];\n            }\n            float32x4_t _mean = vmulq_f32(_sum, _div_size);\n            ptr = ptr0;\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _tmp = vsubq_f32(_p, _mean);\n                _sqsum = vfmaq_f32(_sqsum, _tmp, _tmp);\n                ptr += 4;\n            }\n            float32x4_t _var_eps = vfmaq_f32(vdupq_n_f32(eps), _sqsum, _div_size);\n            // the var maybe minus due to accuracy\n            //float var = sqsum / size - mean * mean;\n\n            float32x4_t _reciprocal = vrsqrteq_f32(_var_eps);\n            _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n            // _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var_eps, _reciprocal), _reciprocal), _reciprocal);\n\n            float16x4_t _a;\n            float16x4_t _b;\n            if (affine)\n            {\n                float32x4_t _gamma = vld1q_f32((const float*)gamma_data + q * 4);\n                float32x4_t _beta = vld1q_f32((const float*)beta_data + q * 4);\n\n                float32x4_t _a32 = vmulq_f32(_gamma, _reciprocal);\n                _a = vcvt_f16_f32(_a32);\n                _b = vcvt_f16_f32(vmlsq_f32(_beta, _mean, _a32));\n            }\n            else\n            {\n                _a = vcvt_f16_f32(_reciprocal);\n                _b = vcvt_f16_f32(vnegq_f32(vmulq_f32(_mean, _reciprocal)));\n            }\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x4_t _p = vld1_f16(ptr0);\n                _p = vfma_f16(_b, _p, _a);\n                vst1_f16(ptr0, _p);\n                ptr0 += 4;\n            }\n        }\n\n        return 0;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr0 = bottom_top_blob.channel(q);\n\n        // mean and var\n        float sum = 0.f;\n        float sqsum = 0.f;\n        const __fp16* ptr = ptr0;\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _sum = vdupq_n_f32(0.f);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _sum = vaddq_f32(_sum, _p);\n            ptr += 4;\n        }\n        sum = vaddvq_f32(_sum);\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            sum += *ptr++;\n            //sqsum += ptr[i] * ptr[i];\n        }\n        float mean = sum / size;\n        ptr = ptr0;\n        i = 0;\n#if __ARM_NEON\n        float32x4_t _sqsum = vdupq_n_f32(0.f);\n        float32x4_t _mean = vdupq_n_f32(mean);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            float32x4_t _tmp = vsubq_f32(_p, _mean);\n            _sqsum = vmlaq_f32(_sqsum, _tmp, _tmp);\n            ptr += 4;\n        }\n        sqsum = vaddvq_f32(_sqsum);\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float tmp = (float)*ptr - mean;\n            sqsum += tmp * tmp;\n            ptr++;\n        }\n        float var = sqsum / size;\n        // the var maybe minus due to accuracy\n        //float var = sqsum / size - mean * mean;\n\n        __fp16 a;\n        __fp16 b;\n        if (affine)\n        {\n            float gamma = gamma_data[q];\n            float beta = beta_data[q];\n\n            float a_fp32 = gamma / (sqrtf(var + eps));\n            a = (__fp16)(a_fp32);\n            b = (__fp16)(-mean * a_fp32 + beta);\n        }\n        else\n        {\n            float a_fp32 = 1.f / (sqrtf(var + eps));\n            a = (__fp16)(a_fp32);\n            b = (__fp16)(-mean * a_fp32);\n        }\n\n        i = 0;\n#if __ARM_NEON\n        float16x8_t _a = vdupq_n_f16(a);\n        float16x8_t _b = vdupq_n_f16(b);\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            _p = vfmaq_f16(_b, _p, _a);\n            vst1q_f16(ptr0, _p);\n            ptr0 += 8;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr0 = *ptr0 * a + b;\n            ptr0++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/interp_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"interp_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"interp_bicubic.h\"\n#include \"interp_bilinear.h\"\n\n#if NCNN_BF16\n#include \"interp_bicubic_bf16s.h\"\n#include \"interp_bilinear_bf16s.h\"\n#endif\n\n#if __ARM_NEON\n#include \"interp_bicubic_pack4.h\"\n#include \"interp_bilinear_pack4.h\"\n#if NCNN_BF16\n#include \"interp_bicubic_pack4_bf16s.h\"\n#include \"interp_bilinear_pack4_bf16s.h\"\n#endif\n#endif\n\nInterp_arm::Interp_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Interp_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blobs, top_blobs, opt);\n        else\n            return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    int h = bottom_blob.h;\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_size_expr(bottom_blob_shapes, outw, outh);\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(outw, outh, w, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < w; q++)\n            {\n                Mat top_blob_c = top_blob.channel(q);\n                float32x4_t _v = vld1q_f32((const float*)bottom_blob + q * 4);\n                top_blob_c.fill(_v);\n            }\n\n            return 0;\n        }\n#endif // __ARM_NEON\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < w; q++)\n        {\n            Mat top_blob_c = top_blob.channel(q);\n            const float v = bottom_blob[q];\n            top_blob_c.fill(v);\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            if (resize_type == 1) // nearest\n            {\n                const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float32x4_t _p = vld1q_f32(ptr + in_x * 4);\n                        vst1q_f32(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const float* Sp = ptr + sx;\n\n                        float32x2_t _a01 = vld1_f32(alphap);\n\n                        float32x4_t _S0 = vld1q_f32(Sp);\n                        float32x4_t _S1 = vld1q_f32(Sp + 4);\n                        float32x4_t _p = vmulq_lane_f32(_S0, _a01, 0);\n                        _p = vmlaq_lane_f32(_p, _S1, _a01, 1);\n                        vst1q_f32(outptr, _p);\n\n                        alphap += 2;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const float* Sp = ptr + sx;\n\n                        float32x4_t _a0123 = vld1q_f32(alphap);\n\n                        float32x4_t _S0 = vld1q_f32(Sp - 4);\n                        float32x4_t _S1 = vld1q_f32(Sp + 0);\n                        float32x4_t _S2 = vld1q_f32(Sp + 4);\n                        float32x4_t _S3 = vld1q_f32(Sp + 8);\n                        float32x4_t _p = vmulq_lane_f32(_S0, vget_low_f32(_a0123), 0);\n                        _p = vmlaq_lane_f32(_p, _S1, vget_low_f32(_a0123), 1);\n                        _p = vmlaq_lane_f32(_p, _S2, vget_high_f32(_a0123), 0);\n                        _p = vmlaq_lane_f32(_p, _S3, vget_high_f32(_a0123), 1);\n                        vst1q_f32(outptr, _p);\n\n                        alphap += 4;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n#endif // __ARM_NEON\n\n        if (resize_type == 1) // nearest\n        {\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    *outptr++ = Sp[0] * a0 + Sp[1] * a1;\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    float a2 = alphap[2];\n                    float a3 = alphap[3];\n                    *outptr++ = Sp[-1] * a0 + Sp[0] * a1 + Sp[1] * a2 + Sp[2] * a3;\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (resize_type == 1) // nearest\n        {\n            const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    int in_y = std::min((int)(y * hs), (h - 1));\n\n                    const float* ptr = src.row(in_y);\n                    float* outptr = dst.row(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float32x4_t _p = vld1q_f32(ptr + in_x * 4);\n                        vst1q_f32(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n            float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n            linear_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack4(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n            float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack4(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (resize_type == 1) // nearest\n    {\n        const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n        const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            for (int y = 0; y < outh; y++)\n            {\n                int in_y = std::min((int)(y * hs), (h - 1));\n\n                const float* ptr = src.row(in_y);\n                float* outptr = dst.row(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n        float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n        linear_coeffs(w, outw, xofs, alpha, align_corner);\n        linear_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n        float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n        cubic_coeffs(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Interp_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int h = bottom_blob.h;\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_size_expr(bottom_blob_shapes, outw, outh);\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(outw, outh, w, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < w; q++)\n            {\n                Mat top_blob_c = top_blob.channel(q);\n                uint16x4_t _v = vld1_u16((const unsigned short*)bottom_blob + q * 4);\n                top_blob_c.fill(_v);\n            }\n\n            return 0;\n        }\n#endif // __ARM_NEON\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < w; q++)\n        {\n            Mat top_blob_c = top_blob.channel(q);\n            const unsigned short* ptr = bottom_blob;\n            top_blob_c.fill(ptr[q]);\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            if (resize_type == 1) // nearest\n            {\n                const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                    unsigned short* outptr = top_blob.row<unsigned short>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        uint16x4_t _p = vld1_u16(ptr + in_x * 4);\n                        vst1_u16(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                    unsigned short* outptr = top_blob.row<unsigned short>(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const unsigned short* Sp = ptr + sx;\n\n                        float32x2_t _a01 = vld1_f32(alphap);\n\n                        float32x4_t _S0 = bfloat2float(vld1_u16(Sp));\n                        float32x4_t _S1 = bfloat2float(vld1_u16(Sp + 4));\n                        float32x4_t _p = vmulq_lane_f32(_S0, _a01, 0);\n                        _p = vmlaq_lane_f32(_p, _S1, _a01, 1);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        alphap += 2;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                    unsigned short* outptr = top_blob.row<unsigned short>(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const unsigned short* Sp = ptr + sx;\n\n                        float32x4_t _a0123 = vld1q_f32(alphap);\n\n                        float32x4_t _S0 = bfloat2float(vld1_u16(Sp - 4));\n                        float32x4_t _S1 = bfloat2float(vld1_u16(Sp + 0));\n                        float32x4_t _S2 = bfloat2float(vld1_u16(Sp + 4));\n                        float32x4_t _S3 = bfloat2float(vld1_u16(Sp + 8));\n                        float32x4_t _p = vmulq_lane_f32(_S0, vget_low_f32(_a0123), 0);\n                        _p = vmlaq_lane_f32(_p, _S1, vget_low_f32(_a0123), 1);\n                        _p = vmlaq_lane_f32(_p, _S2, vget_high_f32(_a0123), 0);\n                        _p = vmlaq_lane_f32(_p, _S3, vget_high_f32(_a0123), 1);\n                        vst1_u16(outptr, float2bfloat(_p));\n\n                        alphap += 4;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n#endif // __ARM_NEON\n\n        if (resize_type == 1) // nearest\n        {\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                unsigned short* outptr = top_blob.row<unsigned short>(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                unsigned short* outptr = top_blob.row<unsigned short>(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const unsigned short* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    *outptr++ = float32_to_bfloat16(bfloat16_to_float32(Sp[0]) * a0 + bfloat16_to_float32(Sp[1]) * a1);\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(y);\n                unsigned short* outptr = top_blob.row<unsigned short>(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const unsigned short* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    float a2 = alphap[2];\n                    float a3 = alphap[3];\n                    *outptr++ = float32_to_bfloat16(bfloat16_to_float32(Sp[-1]) * a0 + bfloat16_to_float32(Sp[0]) * a1 + bfloat16_to_float32(Sp[1]) * a2 + bfloat16_to_float32(Sp[2]) * a3);\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (resize_type == 1) // nearest\n        {\n            const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    int in_y = std::min((int)(y * hs), (h - 1));\n\n                    const unsigned short* ptr = src.row<const unsigned short>(in_y);\n                    unsigned short* outptr = dst.row<unsigned short>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        uint16x4_t _p = vld1_u16(ptr + in_x * 4);\n                        vst1_u16(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n            float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n            linear_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack4_bf16s(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n            float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack4_bf16s(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (resize_type == 1) // nearest\n    {\n        const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n        const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            for (int y = 0; y < outh; y++)\n            {\n                int in_y = std::min((int)(y * hs), (h - 1));\n\n                const unsigned short* ptr = src.row<const unsigned short>(in_y);\n                unsigned short* outptr = dst.row<unsigned short>(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n        float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n        linear_coeffs(w, outw, xofs, alpha, align_corner);\n        linear_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image_bf16s(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n        float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n        cubic_coeffs(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image_bf16s(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/interp_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INTERP_ARM_H\n#define LAYER_INTERP_ARM_H\n\n#include \"interp.h\"\n\nnamespace ncnn {\n\nclass Interp_arm : public Interp\n{\npublic:\n    Interp_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n    int forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INTERP_ARM_H\n"
  },
  {
    "path": "src/layer/arm/interp_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"interp_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#include \"interp_bicubic.h\"\n#include \"interp_bilinear.h\"\n\n#if __ARM_NEON\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"interp_bicubic_fp16s.h\"\n#include \"interp_bicubic_pack4_fp16s.h\"\n#include \"interp_bicubic_pack8_fp16s.h\"\n#include \"interp_bilinear_fp16s.h\"\n#include \"interp_bilinear_pack4_fp16s.h\"\n#include \"interp_bilinear_pack8_fp16s.h\"\n#endif\n#endif\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Interp_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int h = bottom_blob.h;\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_size_expr(bottom_blob_shapes, outw, outh);\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(outw, outh, w, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < w; q++)\n            {\n                Mat top_blob_c = top_blob.channel(q);\n                float16x4_t _v = vld1_f16((const __fp16*)bottom_blob + q * 4);\n                top_blob_c.fill(_v);\n            }\n\n            return 0;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < w; q++)\n        {\n            Mat top_blob_c = top_blob.channel(q);\n            const __fp16* ptr = bottom_blob;\n            top_blob_c.fill(ptr[q]);\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4)\n        {\n            if (resize_type == 1) // nearest\n            {\n                const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float16x4_t _p = vld1_f16(ptr + in_x * 4);\n                        vst1_f16(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const __fp16* Sp = ptr + sx;\n\n                        float32x2_t _a01 = vld1_f32(alphap);\n\n                        float32x4_t _S0 = vcvt_f32_f16(vld1_f16(Sp));\n                        float32x4_t _S1 = vcvt_f32_f16(vld1_f16(Sp + 4));\n                        float32x4_t _p = vmulq_lane_f32(_S0, _a01, 0);\n                        _p = vmlaq_lane_f32(_p, _S1, _a01, 1);\n                        vst1_f16(outptr, vcvt_f16_f32(_p));\n\n                        alphap += 2;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const __fp16* Sp = ptr + sx;\n\n                        float32x4_t _a0123 = vld1q_f32(alphap);\n\n                        float32x4_t _S0 = vcvt_f32_f16(vld1_f16(Sp - 4));\n                        float32x4_t _S1 = vcvt_f32_f16(vld1_f16(Sp + 0));\n                        float32x4_t _S2 = vcvt_f32_f16(vld1_f16(Sp + 4));\n                        float32x4_t _S3 = vcvt_f32_f16(vld1_f16(Sp + 8));\n                        float32x4_t _p = vmulq_laneq_f32(_S0, _a0123, 0);\n                        _p = vfmaq_laneq_f32(_p, _S1, _a0123, 1);\n                        _p = vfmaq_laneq_f32(_p, _S2, _a0123, 2);\n                        _p = vfmaq_laneq_f32(_p, _S3, _a0123, 3);\n                        vst1_f16(outptr, vcvt_f16_f32(_p));\n\n                        alphap += 4;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n\n        if (resize_type == 1) // nearest\n        {\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                __fp16* outptr = top_blob.row<__fp16>(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                __fp16* outptr = top_blob.row<__fp16>(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const __fp16* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    *outptr++ = (__fp16)((float)Sp[0] * a0 + (float)Sp[1] * a1);\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                __fp16* outptr = top_blob.row<__fp16>(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const __fp16* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    float a2 = alphap[2];\n                    float a3 = alphap[3];\n                    *outptr++ = (__fp16)((float)Sp[-1] * a0 + (float)Sp[0] * a1 + (float)Sp[1] * a2 + (float)Sp[2] * a3);\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (elempack == 4)\n    {\n        if (resize_type == 1) // nearest\n        {\n            const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    int in_y = std::min((int)(y * hs), (h - 1));\n\n                    const __fp16* ptr = src.row<const __fp16>(in_y);\n                    __fp16* outptr = dst.row<__fp16>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float16x4_t _p = vld1_f16(ptr + in_x * 4);\n                        vst1_f16(outptr, _p);\n\n                        outptr += 4;\n                    }\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n            float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n            linear_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack4_fp16s(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n            float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack4_fp16s(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (resize_type == 1) // nearest\n    {\n        const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n        const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            for (int y = 0; y < outh; y++)\n            {\n                int in_y = std::min((int)(y * hs), (h - 1));\n\n                const __fp16* ptr = src.row<const __fp16>(in_y);\n                __fp16* outptr = dst.row<__fp16>(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n        float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n        linear_coeffs(w, outw, xofs, alpha, align_corner);\n        linear_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image_fp16s(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n        float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n        cubic_coeffs(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image_fp16s(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n\nint Interp_arm::forward_fp16sa(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int h = bottom_blob.h;\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_size_expr(bottom_blob_shapes, outw, outh);\n    }\n\n    if ((elempack == 1 || elempack == 4) && (dims == 1 || resize_type == 1)) // nearest\n    {\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(outw, outh, w, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < w; q++)\n            {\n                Mat top_blob_c = top_blob.channel(q);\n                float16x8_t _v = vld1q_f16((const __fp16*)bottom_blob + q * 8);\n                top_blob_c.fill(_v);\n            }\n\n            return 0;\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 8)\n        {\n            if (resize_type == 1) // nearest\n            {\n                const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float16x8_t _p = vld1q_f16(ptr + in_x * 8);\n                        vst1q_f16(outptr, _p);\n\n                        outptr += 8;\n                    }\n                }\n            }\n\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                __fp16* alpha = (__fp16*)(buf + outw);\n\n                linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const __fp16* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 8;\n                        const __fp16* Sp = ptr + sx;\n\n                        float16x4_t _a01 = vld1_f16(alphap);\n\n                        float16x8_t _S0 = vld1q_f16(Sp);\n                        float16x8_t _S1 = vld1q_f16(Sp + 8);\n                        float16x8_t _p = vmulq_lane_f16(_S0, _a01, 0);\n                        _p = vfmaq_lane_f16(_p, _S1, _a01, 1);\n                        vst1q_f16(outptr, _p);\n\n                        alphap += 2;\n                        outptr += 8;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                __fp16* alpha = (__fp16*)(buf + outw);\n\n                cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const __fp16* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 8;\n                        const __fp16* Sp = ptr + sx;\n\n                        float16x4_t _a0123 = vld1_f16(alphap);\n\n                        float16x8_t _S0 = vld1q_f16(Sp - 8);\n                        float16x8_t _S1 = vld1q_f16(Sp + 0);\n                        float16x8_t _S2 = vld1q_f16(Sp + 8);\n                        float16x8_t _S3 = vld1q_f16(Sp + 16);\n                        float16x8_t _p = vmulq_lane_f16(_S0, _a0123, 0);\n                        _p = vfmaq_lane_f16(_p, _S1, _a0123, 1);\n                        _p = vfmaq_lane_f16(_p, _S2, _a0123, 2);\n                        _p = vfmaq_lane_f16(_p, _S3, _a0123, 3);\n                        vst1q_f16(outptr, _p);\n\n                        alphap += 4;\n                        outptr += 8;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n\n        if (elempack == 4)\n        {\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                __fp16* alpha = (__fp16*)(buf + outw);\n\n                linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const __fp16* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const __fp16* Sp = ptr + sx;\n\n                        float16x4_t _a01 = vld1_f16(alphap);\n\n                        float16x4_t _S0 = vld1_f16(Sp);\n                        float16x4_t _S1 = vld1_f16(Sp + 4);\n                        float16x4_t _p = vmul_lane_f16(_S0, _a01, 0);\n                        _p = vfma_lane_f16(_p, _S1, _a01, 1);\n                        vst1_f16(outptr, _p);\n\n                        alphap += 2;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                __fp16* alpha = (__fp16*)(buf + outw);\n\n                cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                    __fp16* outptr = top_blob.row<__fp16>(y);\n                    const __fp16* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const __fp16* Sp = ptr + sx;\n\n                        float16x4_t _a0123 = vld1_f16(alphap);\n\n                        float16x4_t _S0 = vld1_f16(Sp - 4);\n                        float16x4_t _S1 = vld1_f16(Sp + 0);\n                        float16x4_t _S2 = vld1_f16(Sp + 4);\n                        float16x4_t _S3 = vld1_f16(Sp + 8);\n                        float16x4_t _p = vmul_lane_f16(_S0, _a0123, 0);\n                        _p = vfma_lane_f16(_p, _S1, _a0123, 1);\n                        _p = vfma_lane_f16(_p, _S2, _a0123, 2);\n                        _p = vfma_lane_f16(_p, _S3, _a0123, 3);\n                        vst1_f16(outptr, _p);\n\n                        alphap += 4;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            __fp16* alpha = (__fp16*)(buf + outw);\n\n            linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                __fp16* outptr = top_blob.row<__fp16>(y);\n                const __fp16* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const __fp16* Sp = ptr + sx;\n                    __fp16 a0 = alphap[0];\n                    __fp16 a1 = alphap[1];\n                    *outptr++ = Sp[0] * a0 + Sp[1] * a1;\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            __fp16* alpha = (__fp16*)(buf + outw);\n\n            cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(y);\n                __fp16* outptr = top_blob.row<__fp16>(y);\n                const __fp16* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const __fp16* Sp = ptr + sx;\n                    __fp16 a0 = alphap[0];\n                    __fp16 a1 = alphap[1];\n                    __fp16 a2 = alphap[2];\n                    __fp16 a3 = alphap[3];\n                    *outptr++ = Sp[-1] * a0 + Sp[0] * a1 + Sp[1] * a2 + Sp[2] * a3;\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (elempack == 8)\n    {\n        if (resize_type == 1) // nearest\n        {\n            const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    int in_y = std::min((int)(y * hs), (h - 1));\n\n                    const __fp16* ptr = src.row<const __fp16>(in_y);\n                    __fp16* outptr = dst.row<__fp16>(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        float16x8_t _p = vld1q_f16(ptr + in_x * 8);\n                        vst1q_f16(outptr, _p);\n\n                        outptr += 8;\n                    }\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 2];\n            __fp16* beta = (__fp16*)(buf + outw + outh + outw * 2); //new __fp16[outh * 2];\n\n            linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n            linear_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack8_fp16sa(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 4];\n            __fp16* beta = (__fp16*)(buf + outw + outh + outw * 4); //new __fp16[outh * 4];\n\n            cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack8_fp16sa(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 2];\n            __fp16* beta = (__fp16*)(buf + outw + outh + outw * 2); //new __fp16[outh * 2];\n\n            linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n            linear_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack4_fp16sa(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 4];\n            __fp16* beta = (__fp16*)(buf + outw + outh + outw * 4); //new __fp16[outh * 4];\n\n            cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack4_fp16sa(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 2];\n        __fp16* beta = (__fp16*)(buf + outw + outh + outw * 2); //new __fp16[outh * 2];\n\n        linear_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n        linear_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image_fp16sa(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        __fp16* alpha = (__fp16*)(buf + outw + outh);           //new __fp16[outw * 4];\n        __fp16* beta = (__fp16*)(buf + outw + outh + outw * 4); //new __fp16[outh * 4];\n\n        cubic_coeffs_fp16sa(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs_fp16sa(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image_fp16sa(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic inline void interpolate_cubic(float fx, float* coeffs)\n{\n    const float A = -0.75f;\n\n    float fx0 = fx + 1;\n    float fx1 = fx;\n    float fx2 = 1 - fx;\n    // float fx3 = 2 - fx;\n\n    coeffs[0] = A * fx0 * fx0 * fx0 - 5 * A * fx0 * fx0 + 8 * A * fx0 - 4 * A;\n    coeffs[1] = (A + 2) * fx1 * fx1 * fx1 - (A + 3) * fx1 * fx1 + 1;\n    coeffs[2] = (A + 2) * fx2 * fx2 * fx2 - (A + 3) * fx2 * fx2 + 1;\n    coeffs[3] = 1.f - coeffs[0] - coeffs[1] - coeffs[2];\n}\n\nstatic void cubic_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = (float)(dx * scale);\n        }\n\n        int sx = static_cast<int>(floor(fx));\n        fx -= sx;\n\n        interpolate_cubic(fx, alpha + dx * 4);\n\n        if (sx <= -1)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = 1.f - alpha[dx * 4 + 3];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = 0.f;\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == 0)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = alpha[dx * 4 + 0] + alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 2];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == w - 2)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = alpha[dx * 4 + 2] + alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 0] = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = 1.f - alpha[dx * 4 + 0];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 1] = 0.f;\n            alpha[dx * 4 + 0] = 0.f;\n        }\n\n        xofs[dx] = sx;\n    }\n}\n\nstatic void resize_bicubic_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    Mat rowsbuf2(w);\n    Mat rowsbuf3(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const float* S0 = src.row(sy - 1);\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows0p[dx] = S0p[-1] * a0 + S0p[0] * a1 + S0p[1] * a2 + S0p[2] * a3;\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n        float b2 = beta[2];\n        float b3 = beta[3];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        float* Dp = dst.row(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_bf16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    Mat rowsbuf2(w);\n    Mat rowsbuf3(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows3p[dx] = bfloat16_to_float32(S3p[-1]) * a0 + bfloat16_to_float32(S3p[0]) * a1 + bfloat16_to_float32(S3p[1]) * a2 + bfloat16_to_float32(S3p[2]) * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows2p[dx] = bfloat16_to_float32(S2p[-1]) * a0 + bfloat16_to_float32(S2p[0]) * a1 + bfloat16_to_float32(S2p[1]) * a2 + bfloat16_to_float32(S2p[2]) * a3;\n                rows3p[dx] = bfloat16_to_float32(S3p[-1]) * a0 + bfloat16_to_float32(S3p[0]) * a1 + bfloat16_to_float32(S3p[1]) * a2 + bfloat16_to_float32(S3p[2]) * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const unsigned short* S1 = src.row<const unsigned short>(sy);\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S1p = S1 + sx;\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows1p[dx] = bfloat16_to_float32(S1p[-1]) * a0 + bfloat16_to_float32(S1p[0]) * a1 + bfloat16_to_float32(S1p[1]) * a2 + bfloat16_to_float32(S1p[2]) * a3;\n                rows2p[dx] = bfloat16_to_float32(S2p[-1]) * a0 + bfloat16_to_float32(S2p[0]) * a1 + bfloat16_to_float32(S2p[1]) * a2 + bfloat16_to_float32(S2p[2]) * a3;\n                rows3p[dx] = bfloat16_to_float32(S3p[-1]) * a0 + bfloat16_to_float32(S3p[0]) * a1 + bfloat16_to_float32(S3p[1]) * a2 + bfloat16_to_float32(S3p[2]) * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const unsigned short* S0 = src.row<const unsigned short>(sy - 1);\n            const unsigned short* S1 = src.row<const unsigned short>(sy);\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S0p = S0 + sx;\n                const unsigned short* S1p = S1 + sx;\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows0p[dx] = bfloat16_to_float32(S0p[-1]) * a0 + bfloat16_to_float32(S0p[0]) * a1 + bfloat16_to_float32(S0p[1]) * a2 + bfloat16_to_float32(S0p[2]) * a3;\n                rows1p[dx] = bfloat16_to_float32(S1p[-1]) * a0 + bfloat16_to_float32(S1p[0]) * a1 + bfloat16_to_float32(S1p[1]) * a2 + bfloat16_to_float32(S1p[2]) * a3;\n                rows2p[dx] = bfloat16_to_float32(S2p[-1]) * a0 + bfloat16_to_float32(S2p[0]) * a1 + bfloat16_to_float32(S2p[1]) * a2 + bfloat16_to_float32(S2p[2]) * a3;\n                rows3p[dx] = bfloat16_to_float32(S3p[-1]) * a0 + bfloat16_to_float32(S3p[0]) * a1 + bfloat16_to_float32(S3p[1]) * a2 + bfloat16_to_float32(S3p[2]) * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n        float b2 = beta[2];\n        float b3 = beta[3];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        unsigned short* Dp = dst.row<unsigned short>(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = float32_to_bfloat16(*rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3);\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic inline void interpolate_cubic_fp16sa(float fx, __fp16* coeffs)\n{\n    const float A = -0.75f;\n\n    float fx0 = fx + 1;\n    float fx1 = fx;\n    float fx2 = 1 - fx;\n    // float fx3 = 2 - fx;\n\n    coeffs[0] = (__fp16)(A * fx0 * fx0 * fx0 - 5 * A * fx0 * fx0 + 8 * A * fx0 - 4 * A);\n    coeffs[1] = (__fp16)((A + 2) * fx1 * fx1 * fx1 - (A + 3) * fx1 * fx1 + 1);\n    coeffs[2] = (__fp16)((A + 2) * fx2 * fx2 * fx2 - (A + 3) * fx2 * fx2 + 1);\n    coeffs[3] = (__fp16)((__fp16)1.f - coeffs[0] - coeffs[1] - coeffs[2]);\n}\n\nstatic void cubic_coeffs_fp16sa(int w, int outw, int* xofs, __fp16* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = static_cast<float>(dx * scale);\n        }\n\n        int sx = static_cast<int>(floor(fx));\n        fx -= sx;\n\n        interpolate_cubic_fp16sa(fx, alpha + dx * 4);\n\n        if (sx <= -1)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = (__fp16)((__fp16)1.f - alpha[dx * 4 + 3]);\n            alpha[dx * 4 + 1] = (__fp16)alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = (__fp16)0.f;\n            alpha[dx * 4 + 3] = (__fp16)0.f;\n        }\n        if (sx == 0)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = (__fp16)(alpha[dx * 4 + 0] + alpha[dx * 4 + 1]);\n            alpha[dx * 4 + 1] = (__fp16)alpha[dx * 4 + 2];\n            alpha[dx * 4 + 2] = (__fp16)alpha[dx * 4 + 3];\n            alpha[dx * 4 + 3] = (__fp16)0.f;\n        }\n        if (sx == w - 2)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = (__fp16)(alpha[dx * 4 + 2] + alpha[dx * 4 + 3]);\n            alpha[dx * 4 + 2] = (__fp16)alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = (__fp16)alpha[dx * 4 + 0];\n            alpha[dx * 4 + 0] = (__fp16)0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = (__fp16)((__fp16)1.f - alpha[dx * 4 + 0]);\n            alpha[dx * 4 + 2] = (__fp16)(alpha[dx * 4 + 0]);\n            alpha[dx * 4 + 1] = (__fp16)0.f;\n            alpha[dx * 4 + 0] = (__fp16)0.f;\n        }\n\n        xofs[dx] = sx;\n    }\n}\n\nstatic void resize_bicubic_image_fp16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    Mat rowsbuf2(w);\n    Mat rowsbuf3(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows3p[dx] = (float)S3p[-1] * a0 + (float)S3p[0] * a1 + (float)S3p[1] * a2 + (float)S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows2p[dx] = (float)S2p[-1] * a0 + (float)S2p[0] * a1 + (float)S2p[1] * a2 + (float)S2p[2] * a3;\n                rows3p[dx] = (float)S3p[-1] * a0 + (float)S3p[0] * a1 + (float)S3p[1] * a2 + (float)S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows1p[dx] = (float)S1p[-1] * a0 + (float)S1p[0] * a1 + (float)S1p[1] * a2 + (float)S1p[2] * a3;\n                rows2p[dx] = (float)S2p[-1] * a0 + (float)S2p[0] * a1 + (float)S2p[1] * a2 + (float)S2p[2] * a3;\n                rows3p[dx] = (float)S3p[-1] * a0 + (float)S3p[0] * a1 + (float)S3p[1] * a2 + (float)S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const __fp16* S0 = src.row<const __fp16>(sy - 1);\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows0p[dx] = (float)S0p[-1] * a0 + (float)S0p[0] * a1 + (float)S0p[1] * a2 + (float)S0p[2] * a3;\n                rows1p[dx] = (float)S1p[-1] * a0 + (float)S1p[0] * a1 + (float)S1p[1] * a2 + (float)S1p[2] * a3;\n                rows2p[dx] = (float)S2p[-1] * a0 + (float)S2p[0] * a1 + (float)S2p[1] * a2 + (float)S2p[2] * a3;\n                rows3p[dx] = (float)S3p[-1] * a0 + (float)S3p[0] * a1 + (float)S3p[1] * a2 + (float)S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n        float b2 = beta[2];\n        float b3 = beta[3];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        __fp16* Dp = dst.row<__fp16>(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            // D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = (__fp16)(*rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3);\n        }\n\n        beta += 4;\n    }\n}\n\nstatic void resize_bicubic_image_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)2u);\n    Mat rowsbuf1(w, (size_t)2u);\n    Mat rowsbuf2(w, (size_t)2u);\n    Mat rowsbuf3(w, (size_t)2u);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n    __fp16* rows2 = rowsbuf2;\n    __fp16* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S3p = S3 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                __fp16 a2 = alphap[2];\n                __fp16 a3 = alphap[3];\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                __fp16 a2 = alphap[2];\n                __fp16 a3 = alphap[3];\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            __fp16* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                __fp16 a2 = alphap[2];\n                __fp16 a3 = alphap[3];\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const __fp16* S0 = src.row<const __fp16>(sy - 1);\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                __fp16 a2 = alphap[2];\n                __fp16 a3 = alphap[3];\n                rows0p[dx] = S0p[-1] * a0 + S0p[0] * a1 + S0p[1] * a2 + S0p[2] * a3;\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        __fp16 b0 = beta[0];\n        __fp16 b1 = beta[1];\n        __fp16 b2 = beta[2];\n        __fp16 b3 = beta[3];\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* rows2p = rows2;\n        __fp16* rows3p = rows3;\n        __fp16* Dp = dst.row<__fp16>(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            // D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = (*rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3);\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_pack4.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf2(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf3(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S30 = vld1q_f32(S3p - 4);\n                float32x4_t _S31 = vld1q_f32(S3p + 0);\n                float32x4_t _S32 = vld1q_f32(S3p + 4);\n                float32x4_t _S33 = vld1q_f32(S3p + 8);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S20 = vld1q_f32(S2p - 4);\n                float32x4_t _S21 = vld1q_f32(S2p + 0);\n                float32x4_t _S22 = vld1q_f32(S2p + 4);\n                float32x4_t _S23 = vld1q_f32(S2p + 8);\n                float32x4_t _S30 = vld1q_f32(S3p - 4);\n                float32x4_t _S31 = vld1q_f32(S3p + 0);\n                float32x4_t _S32 = vld1q_f32(S3p + 4);\n                float32x4_t _S33 = vld1q_f32(S3p + 8);\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S10 = vld1q_f32(S1p - 4);\n                float32x4_t _S11 = vld1q_f32(S1p + 0);\n                float32x4_t _S12 = vld1q_f32(S1p + 4);\n                float32x4_t _S13 = vld1q_f32(S1p + 8);\n                float32x4_t _S20 = vld1q_f32(S2p - 4);\n                float32x4_t _S21 = vld1q_f32(S2p + 0);\n                float32x4_t _S22 = vld1q_f32(S2p + 4);\n                float32x4_t _S23 = vld1q_f32(S2p + 8);\n                float32x4_t _S30 = vld1q_f32(S3p - 4);\n                float32x4_t _S31 = vld1q_f32(S3p + 0);\n                float32x4_t _S32 = vld1q_f32(S3p + 4);\n                float32x4_t _S33 = vld1q_f32(S3p + 8);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, vget_low_f32(_a0123), 0);\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S12, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S13, vget_high_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const float* S0 = src.row(sy - 1);\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                // TODO check the generated assembly on armv7\n                float32x4_t _S00 = vld1q_f32(S0p - 4);\n                float32x4_t _S01 = vld1q_f32(S0p + 0);\n                float32x4_t _S02 = vld1q_f32(S0p + 4);\n                float32x4_t _S03 = vld1q_f32(S0p + 8);\n                float32x4_t _S10 = vld1q_f32(S1p - 4);\n                float32x4_t _S11 = vld1q_f32(S1p + 0);\n                float32x4_t _S12 = vld1q_f32(S1p + 4);\n                float32x4_t _S13 = vld1q_f32(S1p + 8);\n                float32x4_t _S20 = vld1q_f32(S2p - 4);\n                float32x4_t _S21 = vld1q_f32(S2p + 0);\n                float32x4_t _S22 = vld1q_f32(S2p + 4);\n                float32x4_t _S23 = vld1q_f32(S2p + 8);\n                float32x4_t _S30 = vld1q_f32(S3p - 4);\n                float32x4_t _S31 = vld1q_f32(S3p + 0);\n                float32x4_t _S32 = vld1q_f32(S3p + 4);\n                float32x4_t _S33 = vld1q_f32(S3p + 8);\n                float32x4_t _rows0 = vmulq_lane_f32(_S00, vget_low_f32(_a0123), 0);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, vget_low_f32(_a0123), 0);\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S01, vget_low_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows0 = vmlaq_lane_f32(_rows0, _S02, vget_high_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S12, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S03, vget_high_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S13, vget_high_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x4_t _b0123 = vld1q_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        float* Dp = dst.row(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _rows2 = vld1q_f32(rows2p);\n            float32x4_t _rows3 = vld1q_f32(rows3p);\n            float32x4_t _Dp = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows1, vget_low_f32(_b0123), 1);\n            _Dp = vmlaq_lane_f32(_Dp, _rows2, vget_high_f32(_b0123), 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows3, vget_high_f32(_b0123), 1);\n            vst1q_f32(Dp, _Dp);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n            rows2p += 4;\n            rows3p += 4;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_pack4_bf16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf2(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf3(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S30 = bfloat2float(vld1_u16(S3p - 4));\n                float32x4_t _S31 = bfloat2float(vld1_u16(S3p + 0));\n                float32x4_t _S32 = bfloat2float(vld1_u16(S3p + 4));\n                float32x4_t _S33 = bfloat2float(vld1_u16(S3p + 8));\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S20 = bfloat2float(vld1_u16(S2p - 4));\n                float32x4_t _S21 = bfloat2float(vld1_u16(S2p + 0));\n                float32x4_t _S22 = bfloat2float(vld1_u16(S2p + 4));\n                float32x4_t _S23 = bfloat2float(vld1_u16(S2p + 8));\n                float32x4_t _S30 = bfloat2float(vld1_u16(S3p - 4));\n                float32x4_t _S31 = bfloat2float(vld1_u16(S3p + 0));\n                float32x4_t _S32 = bfloat2float(vld1_u16(S3p + 4));\n                float32x4_t _S33 = bfloat2float(vld1_u16(S3p + 8));\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const unsigned short* S1 = src.row<const unsigned short>(sy);\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S1p = S1 + sx;\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S10 = bfloat2float(vld1_u16(S1p - 4));\n                float32x4_t _S11 = bfloat2float(vld1_u16(S1p + 0));\n                float32x4_t _S12 = bfloat2float(vld1_u16(S1p + 4));\n                float32x4_t _S13 = bfloat2float(vld1_u16(S1p + 8));\n                float32x4_t _S20 = bfloat2float(vld1_u16(S2p - 4));\n                float32x4_t _S21 = bfloat2float(vld1_u16(S2p + 0));\n                float32x4_t _S22 = bfloat2float(vld1_u16(S2p + 4));\n                float32x4_t _S23 = bfloat2float(vld1_u16(S2p + 8));\n                float32x4_t _S30 = bfloat2float(vld1_u16(S3p - 4));\n                float32x4_t _S31 = bfloat2float(vld1_u16(S3p + 0));\n                float32x4_t _S32 = bfloat2float(vld1_u16(S3p + 4));\n                float32x4_t _S33 = bfloat2float(vld1_u16(S3p + 8));\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, vget_low_f32(_a0123), 0);\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S12, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S13, vget_high_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const unsigned short* S0 = src.row<const unsigned short>(sy - 1);\n            const unsigned short* S1 = src.row<const unsigned short>(sy);\n            const unsigned short* S2 = src.row<const unsigned short>(sy + 1);\n            const unsigned short* S3 = src.row<const unsigned short>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S0p = S0 + sx;\n                const unsigned short* S1p = S1 + sx;\n                const unsigned short* S2p = S2 + sx;\n                const unsigned short* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                // TODO check the generated assembly on armv7\n                float32x4_t _S00 = bfloat2float(vld1_u16(S0p - 4));\n                float32x4_t _S01 = bfloat2float(vld1_u16(S0p + 0));\n                float32x4_t _S02 = bfloat2float(vld1_u16(S0p + 4));\n                float32x4_t _S03 = bfloat2float(vld1_u16(S0p + 8));\n                float32x4_t _S10 = bfloat2float(vld1_u16(S1p - 4));\n                float32x4_t _S11 = bfloat2float(vld1_u16(S1p + 0));\n                float32x4_t _S12 = bfloat2float(vld1_u16(S1p + 4));\n                float32x4_t _S13 = bfloat2float(vld1_u16(S1p + 8));\n                float32x4_t _S20 = bfloat2float(vld1_u16(S2p - 4));\n                float32x4_t _S21 = bfloat2float(vld1_u16(S2p + 0));\n                float32x4_t _S22 = bfloat2float(vld1_u16(S2p + 4));\n                float32x4_t _S23 = bfloat2float(vld1_u16(S2p + 8));\n                float32x4_t _S30 = bfloat2float(vld1_u16(S3p - 4));\n                float32x4_t _S31 = bfloat2float(vld1_u16(S3p + 0));\n                float32x4_t _S32 = bfloat2float(vld1_u16(S3p + 4));\n                float32x4_t _S33 = bfloat2float(vld1_u16(S3p + 8));\n                float32x4_t _rows0 = vmulq_lane_f32(_S00, vget_low_f32(_a0123), 0);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, vget_low_f32(_a0123), 0);\n                float32x4_t _rows2 = vmulq_lane_f32(_S20, vget_low_f32(_a0123), 0);\n                float32x4_t _rows3 = vmulq_lane_f32(_S30, vget_low_f32(_a0123), 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S01, vget_low_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, vget_low_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S21, vget_low_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S31, vget_low_f32(_a0123), 1);\n                _rows0 = vmlaq_lane_f32(_rows0, _S02, vget_high_f32(_a0123), 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S12, vget_high_f32(_a0123), 0);\n                _rows2 = vmlaq_lane_f32(_rows2, _S22, vget_high_f32(_a0123), 0);\n                _rows3 = vmlaq_lane_f32(_rows3, _S32, vget_high_f32(_a0123), 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S03, vget_high_f32(_a0123), 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S13, vget_high_f32(_a0123), 1);\n                _rows2 = vmlaq_lane_f32(_rows2, _S23, vget_high_f32(_a0123), 1);\n                _rows3 = vmlaq_lane_f32(_rows3, _S33, vget_high_f32(_a0123), 1);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x4_t _b0123 = vld1q_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        unsigned short* Dp = dst.row<unsigned short>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _rows2 = vld1q_f32(rows2p);\n            float32x4_t _rows3 = vld1q_f32(rows3p);\n            float32x4_t _Dp = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows1, vget_low_f32(_b0123), 1);\n            _Dp = vmlaq_lane_f32(_Dp, _rows2, vget_high_f32(_b0123), 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows3, vget_high_f32(_b0123), 1);\n            vst1_u16(Dp, float2bfloat(_Dp));\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n            rows2p += 4;\n            rows3p += 4;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_pack4_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_pack4_fp16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf2(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf3(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S30 = vcvt_f32_f16(vld1_f16(S3p - 4));\n                float32x4_t _S31 = vcvt_f32_f16(vld1_f16(S3p + 0));\n                float32x4_t _S32 = vcvt_f32_f16(vld1_f16(S3p + 4));\n                float32x4_t _S33 = vcvt_f32_f16(vld1_f16(S3p + 8));\n                float32x4_t _rows3 = vmulq_laneq_f32(_S30, _a0123, 0);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S31, _a0123, 1);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S32, _a0123, 2);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S33, _a0123, 3);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S20 = vcvt_f32_f16(vld1_f16(S2p - 4));\n                float32x4_t _S21 = vcvt_f32_f16(vld1_f16(S2p + 0));\n                float32x4_t _S22 = vcvt_f32_f16(vld1_f16(S2p + 4));\n                float32x4_t _S23 = vcvt_f32_f16(vld1_f16(S2p + 8));\n                float32x4_t _S30 = vcvt_f32_f16(vld1_f16(S3p - 4));\n                float32x4_t _S31 = vcvt_f32_f16(vld1_f16(S3p + 0));\n                float32x4_t _S32 = vcvt_f32_f16(vld1_f16(S3p + 4));\n                float32x4_t _S33 = vcvt_f32_f16(vld1_f16(S3p + 8));\n                float32x4_t _rows2 = vmulq_laneq_f32(_S20, _a0123, 0);\n                float32x4_t _rows3 = vmulq_laneq_f32(_S30, _a0123, 0);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S31, _a0123, 1);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S32, _a0123, 2);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S33, _a0123, 3);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S10 = vcvt_f32_f16(vld1_f16(S1p - 4));\n                float32x4_t _S11 = vcvt_f32_f16(vld1_f16(S1p + 0));\n                float32x4_t _S12 = vcvt_f32_f16(vld1_f16(S1p + 4));\n                float32x4_t _S13 = vcvt_f32_f16(vld1_f16(S1p + 8));\n                float32x4_t _S20 = vcvt_f32_f16(vld1_f16(S2p - 4));\n                float32x4_t _S21 = vcvt_f32_f16(vld1_f16(S2p + 0));\n                float32x4_t _S22 = vcvt_f32_f16(vld1_f16(S2p + 4));\n                float32x4_t _S23 = vcvt_f32_f16(vld1_f16(S2p + 8));\n                float32x4_t _S30 = vcvt_f32_f16(vld1_f16(S3p - 4));\n                float32x4_t _S31 = vcvt_f32_f16(vld1_f16(S3p + 0));\n                float32x4_t _S32 = vcvt_f32_f16(vld1_f16(S3p + 4));\n                float32x4_t _S33 = vcvt_f32_f16(vld1_f16(S3p + 8));\n                float32x4_t _rows1 = vmulq_laneq_f32(_S10, _a0123, 0);\n                float32x4_t _rows2 = vmulq_laneq_f32(_S20, _a0123, 0);\n                float32x4_t _rows3 = vmulq_laneq_f32(_S30, _a0123, 0);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S11, _a0123, 1);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S31, _a0123, 1);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S12, _a0123, 2);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S32, _a0123, 2);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S13, _a0123, 3);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S33, _a0123, 3);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const __fp16* S0 = src.row<const __fp16>(sy - 1);\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float32x4_t _a0123 = vld1q_f32(alphap);\n\n                float32x4_t _S00 = vcvt_f32_f16(vld1_f16(S0p - 4));\n                float32x4_t _S01 = vcvt_f32_f16(vld1_f16(S0p + 0));\n                float32x4_t _S02 = vcvt_f32_f16(vld1_f16(S0p + 4));\n                float32x4_t _S03 = vcvt_f32_f16(vld1_f16(S0p + 8));\n                float32x4_t _S10 = vcvt_f32_f16(vld1_f16(S1p - 4));\n                float32x4_t _S11 = vcvt_f32_f16(vld1_f16(S1p + 0));\n                float32x4_t _S12 = vcvt_f32_f16(vld1_f16(S1p + 4));\n                float32x4_t _S13 = vcvt_f32_f16(vld1_f16(S1p + 8));\n                float32x4_t _S20 = vcvt_f32_f16(vld1_f16(S2p - 4));\n                float32x4_t _S21 = vcvt_f32_f16(vld1_f16(S2p + 0));\n                float32x4_t _S22 = vcvt_f32_f16(vld1_f16(S2p + 4));\n                float32x4_t _S23 = vcvt_f32_f16(vld1_f16(S2p + 8));\n                float32x4_t _S30 = vcvt_f32_f16(vld1_f16(S3p - 4));\n                float32x4_t _S31 = vcvt_f32_f16(vld1_f16(S3p + 0));\n                float32x4_t _S32 = vcvt_f32_f16(vld1_f16(S3p + 4));\n                float32x4_t _S33 = vcvt_f32_f16(vld1_f16(S3p + 8));\n                float32x4_t _rows0 = vmulq_laneq_f32(_S00, _a0123, 0);\n                float32x4_t _rows1 = vmulq_laneq_f32(_S10, _a0123, 0);\n                float32x4_t _rows2 = vmulq_laneq_f32(_S20, _a0123, 0);\n                float32x4_t _rows3 = vmulq_laneq_f32(_S30, _a0123, 0);\n                _rows0 = vfmaq_laneq_f32(_rows0, _S01, _a0123, 1);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S11, _a0123, 1);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S31, _a0123, 1);\n                _rows0 = vfmaq_laneq_f32(_rows0, _S02, _a0123, 2);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S12, _a0123, 2);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S32, _a0123, 2);\n                _rows0 = vfmaq_laneq_f32(_rows0, _S03, _a0123, 3);\n                _rows1 = vfmaq_laneq_f32(_rows1, _S13, _a0123, 3);\n                _rows2 = vfmaq_laneq_f32(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_laneq_f32(_rows3, _S33, _a0123, 3);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n                vst1q_f32(rows2p + dx * 4, _rows2);\n                vst1q_f32(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x4_t _b0123 = vld1q_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _rows2 = vld1q_f32(rows2p);\n            float32x4_t _rows3 = vld1q_f32(rows3p);\n            float32x4_t _Dp = vmulq_laneq_f32(_rows0, _b0123, 0);\n            _Dp = vfmaq_laneq_f32(_Dp, _rows1, _b0123, 1);\n            _Dp = vfmaq_laneq_f32(_Dp, _rows2, _b0123, 2);\n            _Dp = vfmaq_laneq_f32(_Dp, _rows3, _b0123, 3);\n            vst1_f16(Dp, vcvt_f16_f32(_Dp));\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n            rows2p += 4;\n            rows3p += 4;\n        }\n\n        beta += 4;\n    }\n}\n\nstatic void resize_bicubic_image_pack4_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 2u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 2u, 4);\n    Mat rowsbuf2(w, (size_t)4 * 2u, 4);\n    Mat rowsbuf3(w, (size_t)4 * 2u, 4);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n    __fp16* rows2 = rowsbuf2;\n    __fp16* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x4_t _S30 = vld1_f16(S3p - 4);\n                float16x4_t _S31 = vld1_f16(S3p + 0);\n                float16x4_t _S32 = vld1_f16(S3p + 4);\n                float16x4_t _S33 = vld1_f16(S3p + 8);\n                float16x4_t _rows3 = vmul_lane_f16(_S30, _a0123, 0);\n                _rows3 = vfma_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows3 = vfma_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows3 = vfma_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1_f16(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x4_t _S20 = vld1_f16(S2p - 4);\n                float16x4_t _S21 = vld1_f16(S2p + 0);\n                float16x4_t _S22 = vld1_f16(S2p + 4);\n                float16x4_t _S23 = vld1_f16(S2p + 8);\n                float16x4_t _S30 = vld1_f16(S3p - 4);\n                float16x4_t _S31 = vld1_f16(S3p + 0);\n                float16x4_t _S32 = vld1_f16(S3p + 4);\n                float16x4_t _S33 = vld1_f16(S3p + 8);\n                float16x4_t _rows2 = vmul_lane_f16(_S20, _a0123, 0);\n                float16x4_t _rows3 = vmul_lane_f16(_S30, _a0123, 0);\n                _rows2 = vfma_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfma_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows2 = vfma_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfma_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows2 = vfma_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfma_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1_f16(rows2p + dx * 4, _rows2);\n                vst1_f16(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            __fp16* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x4_t _S10 = vld1_f16(S1p - 4);\n                float16x4_t _S11 = vld1_f16(S1p + 0);\n                float16x4_t _S12 = vld1_f16(S1p + 4);\n                float16x4_t _S13 = vld1_f16(S1p + 8);\n                float16x4_t _S20 = vld1_f16(S2p - 4);\n                float16x4_t _S21 = vld1_f16(S2p + 0);\n                float16x4_t _S22 = vld1_f16(S2p + 4);\n                float16x4_t _S23 = vld1_f16(S2p + 8);\n                float16x4_t _S30 = vld1_f16(S3p - 4);\n                float16x4_t _S31 = vld1_f16(S3p + 0);\n                float16x4_t _S32 = vld1_f16(S3p + 4);\n                float16x4_t _S33 = vld1_f16(S3p + 8);\n                float16x4_t _rows1 = vmul_lane_f16(_S10, _a0123, 0);\n                float16x4_t _rows2 = vmul_lane_f16(_S20, _a0123, 0);\n                float16x4_t _rows3 = vmul_lane_f16(_S30, _a0123, 0);\n                _rows1 = vfma_lane_f16(_rows1, _S11, _a0123, 1);\n                _rows2 = vfma_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfma_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows1 = vfma_lane_f16(_rows1, _S12, _a0123, 2);\n                _rows2 = vfma_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfma_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows1 = vfma_lane_f16(_rows1, _S13, _a0123, 3);\n                _rows2 = vfma_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfma_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1_f16(rows1p + dx * 4, _rows1);\n                vst1_f16(rows2p + dx * 4, _rows2);\n                vst1_f16(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const __fp16* S0 = src.row<const __fp16>(sy - 1);\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x4_t _S00 = vld1_f16(S0p - 4);\n                float16x4_t _S01 = vld1_f16(S0p + 0);\n                float16x4_t _S02 = vld1_f16(S0p + 4);\n                float16x4_t _S03 = vld1_f16(S0p + 8);\n                float16x4_t _S10 = vld1_f16(S1p - 4);\n                float16x4_t _S11 = vld1_f16(S1p + 0);\n                float16x4_t _S12 = vld1_f16(S1p + 4);\n                float16x4_t _S13 = vld1_f16(S1p + 8);\n                float16x4_t _S20 = vld1_f16(S2p - 4);\n                float16x4_t _S21 = vld1_f16(S2p + 0);\n                float16x4_t _S22 = vld1_f16(S2p + 4);\n                float16x4_t _S23 = vld1_f16(S2p + 8);\n                float16x4_t _S30 = vld1_f16(S3p - 4);\n                float16x4_t _S31 = vld1_f16(S3p + 0);\n                float16x4_t _S32 = vld1_f16(S3p + 4);\n                float16x4_t _S33 = vld1_f16(S3p + 8);\n                float16x4_t _rows0 = vmul_lane_f16(_S00, _a0123, 0);\n                float16x4_t _rows1 = vmul_lane_f16(_S10, _a0123, 0);\n                float16x4_t _rows2 = vmul_lane_f16(_S20, _a0123, 0);\n                float16x4_t _rows3 = vmul_lane_f16(_S30, _a0123, 0);\n                _rows0 = vfma_lane_f16(_rows0, _S01, _a0123, 1);\n                _rows1 = vfma_lane_f16(_rows1, _S11, _a0123, 1);\n                _rows2 = vfma_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfma_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows0 = vfma_lane_f16(_rows0, _S02, _a0123, 2);\n                _rows1 = vfma_lane_f16(_rows1, _S12, _a0123, 2);\n                _rows2 = vfma_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfma_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows0 = vfma_lane_f16(_rows0, _S03, _a0123, 3);\n                _rows1 = vfma_lane_f16(_rows1, _S13, _a0123, 3);\n                _rows2 = vfma_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfma_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1_f16(rows0p + dx * 4, _rows0);\n                vst1_f16(rows1p + dx * 4, _rows1);\n                vst1_f16(rows2p + dx * 4, _rows2);\n                vst1_f16(rows3p + dx * 4, _rows3);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float16x4_t _b0123 = vld1_f16(beta);\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* rows2p = rows2;\n        __fp16* rows3p = rows3;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float16x4_t _rows0 = vld1_f16(rows0p);\n            float16x4_t _rows1 = vld1_f16(rows1p);\n            float16x4_t _rows2 = vld1_f16(rows2p);\n            float16x4_t _rows3 = vld1_f16(rows3p);\n            float16x4_t _Dp = vmul_lane_f16(_rows0, _b0123, 0);\n            _Dp = vfma_lane_f16(_Dp, _rows1, _b0123, 1);\n            _Dp = vfma_lane_f16(_Dp, _rows2, _b0123, 2);\n            _Dp = vfma_lane_f16(_Dp, _rows3, _b0123, 3);\n            vst1_f16(Dp, _Dp);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n            rows2p += 4;\n            rows3p += 4;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bicubic_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_pack8_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)8 * 2u, 8);\n    Mat rowsbuf1(w, (size_t)8 * 2u, 8);\n    Mat rowsbuf2(w, (size_t)8 * 2u, 8);\n    Mat rowsbuf3(w, (size_t)8 * 2u, 8);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n    __fp16* rows2 = rowsbuf2;\n    __fp16* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x8_t _S30 = vld1q_f16(S3p - 8);\n                float16x8_t _S31 = vld1q_f16(S3p + 0);\n                float16x8_t _S32 = vld1q_f16(S3p + 8);\n                float16x8_t _S33 = vld1q_f16(S3p + 16);\n                float16x8_t _rows3 = vmulq_lane_f16(_S30, _a0123, 0);\n                _rows3 = vfmaq_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows3 = vfmaq_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows3 = vfmaq_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1q_f16(rows3p + dx * 8, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x8_t _S20 = vld1q_f16(S2p - 8);\n                float16x8_t _S21 = vld1q_f16(S2p + 0);\n                float16x8_t _S22 = vld1q_f16(S2p + 8);\n                float16x8_t _S23 = vld1q_f16(S2p + 16);\n                float16x8_t _S30 = vld1q_f16(S3p - 8);\n                float16x8_t _S31 = vld1q_f16(S3p + 0);\n                float16x8_t _S32 = vld1q_f16(S3p + 8);\n                float16x8_t _S33 = vld1q_f16(S3p + 16);\n                float16x8_t _rows2 = vmulq_lane_f16(_S20, _a0123, 0);\n                float16x8_t _rows3 = vmulq_lane_f16(_S30, _a0123, 0);\n                _rows2 = vfmaq_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows2 = vfmaq_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows2 = vfmaq_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1q_f16(rows2p + dx * 8, _rows2);\n                vst1q_f16(rows3p + dx * 8, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            __fp16* rows0_old = rows0;\n            __fp16* rows1_old = rows1;\n            __fp16* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x8_t _S10 = vld1q_f16(S1p - 8);\n                float16x8_t _S11 = vld1q_f16(S1p + 0);\n                float16x8_t _S12 = vld1q_f16(S1p + 8);\n                float16x8_t _S13 = vld1q_f16(S1p + 16);\n                float16x8_t _S20 = vld1q_f16(S2p - 8);\n                float16x8_t _S21 = vld1q_f16(S2p + 0);\n                float16x8_t _S22 = vld1q_f16(S2p + 8);\n                float16x8_t _S23 = vld1q_f16(S2p + 16);\n                float16x8_t _S30 = vld1q_f16(S3p - 8);\n                float16x8_t _S31 = vld1q_f16(S3p + 0);\n                float16x8_t _S32 = vld1q_f16(S3p + 8);\n                float16x8_t _S33 = vld1q_f16(S3p + 16);\n                float16x8_t _rows1 = vmulq_lane_f16(_S10, _a0123, 0);\n                float16x8_t _rows2 = vmulq_lane_f16(_S20, _a0123, 0);\n                float16x8_t _rows3 = vmulq_lane_f16(_S30, _a0123, 0);\n                _rows1 = vfmaq_lane_f16(_rows1, _S11, _a0123, 1);\n                _rows2 = vfmaq_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows1 = vfmaq_lane_f16(_rows1, _S12, _a0123, 2);\n                _rows2 = vfmaq_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows1 = vfmaq_lane_f16(_rows1, _S13, _a0123, 3);\n                _rows2 = vfmaq_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1q_f16(rows1p + dx * 8, _rows1);\n                vst1q_f16(rows2p + dx * 8, _rows2);\n                vst1q_f16(rows3p + dx * 8, _rows3);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const __fp16* S0 = src.row<const __fp16>(sy - 1);\n            const __fp16* S1 = src.row<const __fp16>(sy);\n            const __fp16* S2 = src.row<const __fp16>(sy + 1);\n            const __fp16* S3 = src.row<const __fp16>(sy + 2);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            __fp16* rows2p = rows2;\n            __fp16* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n                const __fp16* S2p = S2 + sx;\n                const __fp16* S3p = S3 + sx;\n\n                float16x4_t _a0123 = vld1_f16(alphap);\n\n                float16x8_t _S00 = vld1q_f16(S0p - 8);\n                float16x8_t _S01 = vld1q_f16(S0p + 0);\n                float16x8_t _S02 = vld1q_f16(S0p + 8);\n                float16x8_t _S03 = vld1q_f16(S0p + 16);\n                float16x8_t _S10 = vld1q_f16(S1p - 8);\n                float16x8_t _S11 = vld1q_f16(S1p + 0);\n                float16x8_t _S12 = vld1q_f16(S1p + 8);\n                float16x8_t _S13 = vld1q_f16(S1p + 16);\n                float16x8_t _S20 = vld1q_f16(S2p - 8);\n                float16x8_t _S21 = vld1q_f16(S2p + 0);\n                float16x8_t _S22 = vld1q_f16(S2p + 8);\n                float16x8_t _S23 = vld1q_f16(S2p + 16);\n                float16x8_t _S30 = vld1q_f16(S3p - 8);\n                float16x8_t _S31 = vld1q_f16(S3p + 0);\n                float16x8_t _S32 = vld1q_f16(S3p + 8);\n                float16x8_t _S33 = vld1q_f16(S3p + 16);\n                float16x8_t _rows0 = vmulq_lane_f16(_S00, _a0123, 0);\n                float16x8_t _rows1 = vmulq_lane_f16(_S10, _a0123, 0);\n                float16x8_t _rows2 = vmulq_lane_f16(_S20, _a0123, 0);\n                float16x8_t _rows3 = vmulq_lane_f16(_S30, _a0123, 0);\n                _rows0 = vfmaq_lane_f16(_rows0, _S01, _a0123, 1);\n                _rows1 = vfmaq_lane_f16(_rows1, _S11, _a0123, 1);\n                _rows2 = vfmaq_lane_f16(_rows2, _S21, _a0123, 1);\n                _rows3 = vfmaq_lane_f16(_rows3, _S31, _a0123, 1);\n                _rows0 = vfmaq_lane_f16(_rows0, _S02, _a0123, 2);\n                _rows1 = vfmaq_lane_f16(_rows1, _S12, _a0123, 2);\n                _rows2 = vfmaq_lane_f16(_rows2, _S22, _a0123, 2);\n                _rows3 = vfmaq_lane_f16(_rows3, _S32, _a0123, 2);\n                _rows0 = vfmaq_lane_f16(_rows0, _S03, _a0123, 3);\n                _rows1 = vfmaq_lane_f16(_rows1, _S13, _a0123, 3);\n                _rows2 = vfmaq_lane_f16(_rows2, _S23, _a0123, 3);\n                _rows3 = vfmaq_lane_f16(_rows3, _S33, _a0123, 3);\n                vst1q_f16(rows0p + dx * 8, _rows0);\n                vst1q_f16(rows1p + dx * 8, _rows1);\n                vst1q_f16(rows2p + dx * 8, _rows2);\n                vst1q_f16(rows3p + dx * 8, _rows3);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float16x4_t _b0123 = vld1_f16(beta);\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* rows2p = rows2;\n        __fp16* rows3p = rows3;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float16x8_t _rows0 = vld1q_f16(rows0p);\n            float16x8_t _rows1 = vld1q_f16(rows1p);\n            float16x8_t _rows2 = vld1q_f16(rows2p);\n            float16x8_t _rows3 = vld1q_f16(rows3p);\n            float16x8_t _Dp = vmulq_lane_f16(_rows0, _b0123, 0);\n            _Dp = vfmaq_lane_f16(_Dp, _rows1, _b0123, 1);\n            _Dp = vfmaq_lane_f16(_Dp, _rows2, _b0123, 2);\n            _Dp = vfmaq_lane_f16(_Dp, _rows3, _b0123, 3);\n            vst1q_f16(Dp, _Dp);\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n            rows2p += 8;\n            rows3p += 8;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void linear_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = (float)(dx * scale);\n        }\n\n        int sx = floor(fx);\n        fx -= sx;\n\n        if (sx < 0)\n        {\n            sx = 0;\n            fx = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 2;\n            fx = 1.f;\n        }\n\n        xofs[dx] = sx;\n\n        alpha[dx * 2] = 1.f - fx;\n        alpha[dx * 2 + 1] = fx;\n    }\n}\n\nstatic void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n#if __ARM_NEON\n            for (; dx + 1 < w; dx += 2)\n            {\n                int sx = xofs[dx];\n                int sxn = xofs[dx + 1];\n                const float* S1p = S1 + sx;\n                const float* S1np = S1 + sxn;\n\n                float32x4_t _a = vld1q_f32(alphap);\n                float32x2_t _S1 = vld1_f32(S1p);\n                float32x2_t _S1n = vld1_f32(S1np);\n\n                float32x4_t _S1S1n = vcombine_f32(_S1, _S1n);\n                float32x4_t _ms1 = vmulq_f32(_S1S1n, _a);\n                float32x2_t _rows1 = vpadd_f32(vget_low_f32(_ms1), vget_high_f32(_ms1));\n\n                vst1_f32(rows1p + dx, _rows1);\n\n                alphap += 4;\n            }\n#endif // __ARM_NEON\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const float* S0 = src.row(sy);\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n#if __ARM_NEON\n            for (; dx + 1 < w; dx += 2)\n            {\n                int sx = xofs[dx];\n                int sxn = xofs[dx + 1];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S0np = S0 + sxn;\n                const float* S1np = S1 + sxn;\n\n                float32x4_t _a = vld1q_f32(alphap);\n                float32x2_t _S0 = vld1_f32(S0p);\n                float32x2_t _S1 = vld1_f32(S1p);\n                float32x2_t _S0n = vld1_f32(S0np);\n                float32x2_t _S1n = vld1_f32(S1np);\n\n                float32x4_t _S0S0n = vcombine_f32(_S0, _S0n);\n                float32x4_t _S1S1n = vcombine_f32(_S1, _S1n);\n                float32x4_t _ms0 = vmulq_f32(_S0S0n, _a);\n                float32x4_t _ms1 = vmulq_f32(_S1S1n, _a);\n                float32x2_t _rows0 = vpadd_f32(vget_low_f32(_ms0), vget_high_f32(_ms0));\n                float32x2_t _rows1 = vpadd_f32(vget_low_f32(_ms1), vget_high_f32(_ms1));\n\n                vst1_f32(rows0p + dx, _rows0);\n                vst1_f32(rows1p + dx, _rows1);\n\n                alphap += 4;\n            }\n#endif // __ARM_NEON\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows0p[dx] = S0p[0] * a0 + S0p[1] * a1;\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* Dp = dst.row(dy);\n\n#if __ARM_NEON\n        int nn = w >> 3;\n#else\n        int nn = 0;\n#endif\n        int remain = w - (nn << 3);\n\n#if __ARM_NEON\n        float32x4_t _b0 = vdupq_n_f32(b0);\n        float32x4_t _b1 = vdupq_n_f32(b1);\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n\n            float32x4_t _Dp = vmulq_f32(_rows0, _b0);\n            _Dp = vmlaq_f32(_Dp, _rows1, _b1);\n\n            vst1q_f32(Dp, _Dp);\n\n            float32x4_t _rows0n = vld1q_f32(rows0p + 4);\n            float32x4_t _rows1n = vld1q_f32(rows1p + 4);\n\n            float32x4_t _Dpn = vmulq_f32(_rows0n, _b0);\n            _Dpn = vmlaq_f32(_Dpn, _rows1n, _b1);\n\n            vst1q_f32(Dp + 4, _Dpn);\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n#endif // __ARM_NEON\n        for (; remain; --remain)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_bf16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const unsigned short* S1 = src.row<const unsigned short>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows1p[dx] = bfloat16_to_float32(S1p[0]) * a0 + bfloat16_to_float32(S1p[1]) * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const unsigned short* S0 = src.row<const unsigned short>(sy);\n            const unsigned short* S1 = src.row<const unsigned short>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const unsigned short* S0p = S0 + sx;\n                const unsigned short* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows0p[dx] = bfloat16_to_float32(S0p[0]) * a0 + bfloat16_to_float32(S0p[1]) * a1;\n                rows1p[dx] = bfloat16_to_float32(S1p[0]) * a0 + bfloat16_to_float32(S1p[1]) * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        unsigned short* Dp = dst.row<unsigned short>(dy);\n\n#if __ARM_NEON\n        int nn = w >> 3;\n#else\n        int nn = 0;\n#endif\n        int remain = w - (nn << 3);\n\n#if __ARM_NEON\n        float32x4_t _b0 = vdupq_n_f32(b0);\n        float32x4_t _b1 = vdupq_n_f32(b1);\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n\n            float32x4_t _Dp = vmulq_f32(_rows0, _b0);\n            _Dp = vmlaq_f32(_Dp, _rows1, _b1);\n\n            vst1_u16(Dp, float2bfloat(_Dp));\n\n            float32x4_t _rows0n = vld1q_f32(rows0p + 4);\n            float32x4_t _rows1n = vld1q_f32(rows1p + 4);\n\n            float32x4_t _Dpn = vmulq_f32(_rows0n, _b0);\n            _Dpn = vmlaq_f32(_Dpn, _rows1n, _b1);\n\n            vst1_u16(Dp + 4, float2bfloat(_Dpn));\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n#endif // __ARM_NEON\n        for (; remain; --remain)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = float32_to_bfloat16(*rows0p++ * b0 + *rows1p++ * b1);\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void linear_coeffs_fp16sa(int w, int outw, int* xofs, __fp16* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = static_cast<float>(dx * scale);\n        }\n\n        int sx = floor(fx);\n        fx -= sx;\n\n        if (sx < 0)\n        {\n            sx = 0;\n            fx = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 2;\n            fx = 1.f;\n        }\n\n        xofs[dx] = sx;\n\n        alpha[dx * 2] = (__fp16)(1.f - fx);\n        alpha[dx * 2 + 1] = (__fp16)fx;\n    }\n}\n\nstatic void resize_bilinear_image_fp16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows1p[dx] = (float)S1p[0] * a0 + (float)S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const __fp16* S0 = src.row<const __fp16>(sy);\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows0p[dx] = (float)S0p[0] * a0 + (float)S0p[1] * a1;\n                rows1p[dx] = (float)S1p[0] * a0 + (float)S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        int nn = w >> 3;\n        int remain = w - (nn << 3);\n\n        float32x4_t _b0 = vdupq_n_f32(b0);\n        float32x4_t _b1 = vdupq_n_f32(b1);\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n\n            float32x4_t _Dp = vmulq_f32(_rows0, _b0);\n            _Dp = vfmaq_f32(_Dp, _rows1, _b1);\n\n            vst1_f16(Dp, vcvt_f16_f32(_Dp));\n\n            float32x4_t _rows0n = vld1q_f32(rows0p + 4);\n            float32x4_t _rows1n = vld1q_f32(rows1p + 4);\n\n            float32x4_t _Dn = vmulq_f32(_rows0n, _b0);\n            _Dn = vfmaq_f32(_Dn, _rows1n, _b1);\n\n            vst1_f16(Dp + 4, vcvt_f16_f32(_Dn));\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n        for (; remain; --remain)\n        {\n            // D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = (__fp16)(*rows0p++ * b0 + *rows1p++ * b1);\n        }\n\n        beta += 2;\n    }\n}\n\nstatic void resize_bilinear_image_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)2u);\n    Mat rowsbuf1(w, (size_t)2u);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S1p = S1 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const __fp16* S0 = src.row<const __fp16>(sy);\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n\n                __fp16 a0 = alphap[0];\n                __fp16 a1 = alphap[1];\n                rows0p[dx] = S0p[0] * a0 + S0p[1] * a1;\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        __fp16 b0 = beta[0];\n        __fp16 b1 = beta[1];\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        int nn = w >> 3;\n        int remain = w - (nn << 3);\n\n        float16x8_t _b0 = vdupq_n_f16(b0);\n        float16x8_t _b1 = vdupq_n_f16(b1);\n        for (; nn > 0; nn--)\n        {\n            float16x8_t _rows0 = vld1q_f16(rows0p);\n            float16x8_t _rows1 = vld1q_f16(rows1p);\n\n            float16x8_t _Dp = vmulq_f16(_rows0, _b0);\n            _Dp = vfmaq_f16(_Dp, _rows1, _b1);\n\n            vst1q_f16(Dp, _Dp);\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n        for (; remain; --remain)\n        {\n            // D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = (__fp16)(*rows0p++ * b0 + *rows1p++ * b1);\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_pack4.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S10 = vld1q_f32(S1p);\n                float32x4_t _S11 = vld1q_f32(S1p + 4);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const float* S0 = src.row(sy);\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S00 = vld1q_f32(S0p);\n                float32x4_t _S01 = vld1q_f32(S0p + 4);\n                float32x4_t _S10 = vld1q_f32(S1p);\n                float32x4_t _S11 = vld1q_f32(S1p + 4);\n                float32x4_t _rows0 = vmulq_lane_f32(_S00, _a01, 0);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S01, _a01, 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x2_t _b01 = vld1_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* Dp = dst.row(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);\n            vst1q_f32(Dp, _Dp);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_pack4_bf16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_pack4_bf16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const unsigned short* S1 = src.row<const unsigned short>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S10 = bfloat2float(vld1_u16(S1p));\n                float32x4_t _S11 = bfloat2float(vld1_u16(S1p + 4));\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const unsigned short* S0 = src.row<const unsigned short>(sy);\n            const unsigned short* S1 = src.row<const unsigned short>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const unsigned short* S0p = S0 + sx;\n                const unsigned short* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S00 = bfloat2float(vld1_u16(S0p));\n                float32x4_t _S01 = bfloat2float(vld1_u16(S0p + 4));\n                float32x4_t _S10 = bfloat2float(vld1_u16(S1p));\n                float32x4_t _S11 = bfloat2float(vld1_u16(S1p + 4));\n                float32x4_t _rows0 = vmulq_lane_f32(_S00, _a01, 0);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S01, _a01, 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x2_t _b01 = vld1_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        unsigned short* Dp = dst.row<unsigned short>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);\n            vst1_u16(Dp, float2bfloat(_Dp));\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_pack4_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_pack4_fp16s(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S10 = vcvt_f32_f16(vld1_f16(S1p));\n                float32x4_t _S11 = vcvt_f32_f16(vld1_f16(S1p + 4));\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const __fp16* S0 = src.row<const __fp16>(sy);\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n\n                float32x2_t _a01 = vld1_f32(alphap);\n\n                float32x4_t _S00 = vcvt_f32_f16(vld1_f16(S0p));\n                float32x4_t _S01 = vcvt_f32_f16(vld1_f16(S0p + 4));\n                float32x4_t _S10 = vcvt_f32_f16(vld1_f16(S1p));\n                float32x4_t _S11 = vcvt_f32_f16(vld1_f16(S1p + 4));\n                float32x4_t _rows0 = vmulq_lane_f32(_S00, _a01, 0);\n                float32x4_t _rows1 = vmulq_lane_f32(_S10, _a01, 0);\n                _rows0 = vmlaq_lane_f32(_rows0, _S01, _a01, 1);\n                _rows1 = vmlaq_lane_f32(_rows1, _S11, _a01, 1);\n                vst1q_f32(rows0p + dx * 4, _rows0);\n                vst1q_f32(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float32x2_t _b01 = vld1_f32(beta);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float32x4_t _rows0 = vld1q_f32(rows0p);\n            float32x4_t _rows1 = vld1q_f32(rows1p);\n            float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);\n            _Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);\n            vst1_f16(Dp, vcvt_f16_f32(_Dp));\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n        }\n\n        beta += 2;\n    }\n}\n\nstatic void resize_bilinear_image_pack4_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 2u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 2u, 4);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S1p = S1 + sx;\n\n                float16x4_t _a01 = vld1_f16(alphap);\n\n                float16x4_t _S10 = vld1_f16(S1p);\n                float16x4_t _S11 = vld1_f16(S1p + 4);\n                float16x4_t _rows1 = vmul_lane_f16(_S10, _a01, 0);\n                _rows1 = vfma_lane_f16(_rows1, _S11, _a01, 1);\n                vst1_f16(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const __fp16* S0 = src.row<const __fp16>(sy);\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n\n                float16x4_t _a01 = vld1_f16(alphap);\n\n                float16x4_t _S00 = vld1_f16(S0p);\n                float16x4_t _S01 = vld1_f16(S0p + 4);\n                float16x4_t _S10 = vld1_f16(S1p);\n                float16x4_t _S11 = vld1_f16(S1p + 4);\n                float16x4_t _rows0 = vmul_lane_f16(_S00, _a01, 0);\n                float16x4_t _rows1 = vmul_lane_f16(_S10, _a01, 0);\n                _rows0 = vfma_lane_f16(_rows0, _S01, _a01, 1);\n                _rows1 = vfma_lane_f16(_rows1, _S11, _a01, 1);\n                vst1_f16(rows0p + dx * 4, _rows0);\n                vst1_f16(rows1p + dx * 4, _rows1);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float16x4_t _b01 = vld1_f16(beta);\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float16x4_t _rows0 = vld1_f16(rows0p);\n            float16x4_t _rows1 = vld1_f16(rows1p);\n            float16x4_t _Dp = vmul_lane_f16(_rows0, _b01, 0);\n            _Dp = vfma_lane_f16(_Dp, _rows1, _b01, 1);\n            vst1_f16(Dp, _Dp);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/interp_bilinear_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_pack8_fp16sa(const Mat& src, Mat& dst, __fp16* alpha, int* xofs, __fp16* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)8 * 2u, 8);\n    Mat rowsbuf1(w, (size_t)8 * 2u, 8);\n    __fp16* rows0 = rowsbuf0;\n    __fp16* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            __fp16* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S1p = S1 + sx;\n\n                float16x4_t _a01 = vld1_f16(alphap);\n\n                float16x8_t _S10 = vld1q_f16(S1p);\n                float16x8_t _S11 = vld1q_f16(S1p + 8);\n                float16x8_t _rows1 = vmulq_lane_f16(_S10, _a01, 0);\n                _rows1 = vfmaq_lane_f16(_rows1, _S11, _a01, 1);\n                vst1q_f16(rows1p + dx * 8, _rows1);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const __fp16* S0 = src.row<const __fp16>(sy);\n            const __fp16* S1 = src.row<const __fp16>(sy + 1);\n\n            const __fp16* alphap = alpha;\n            __fp16* rows0p = rows0;\n            __fp16* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 8;\n                const __fp16* S0p = S0 + sx;\n                const __fp16* S1p = S1 + sx;\n\n                float16x4_t _a01 = vld1_f16(alphap);\n\n                float16x8_t _S00 = vld1q_f16(S0p);\n                float16x8_t _S01 = vld1q_f16(S0p + 8);\n                float16x8_t _S10 = vld1q_f16(S1p);\n                float16x8_t _S11 = vld1q_f16(S1p + 8);\n                float16x8_t _rows0 = vmulq_lane_f16(_S00, _a01, 0);\n                float16x8_t _rows1 = vmulq_lane_f16(_S10, _a01, 0);\n                _rows0 = vfmaq_lane_f16(_rows0, _S01, _a01, 1);\n                _rows1 = vfmaq_lane_f16(_rows1, _S11, _a01, 1);\n                vst1q_f16(rows0p + dx * 8, _rows0);\n                vst1q_f16(rows1p + dx * 8, _rows1);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float16x4_t _b01 = vld1_f16(beta);\n\n        __fp16* rows0p = rows0;\n        __fp16* rows1p = rows1;\n        __fp16* Dp = dst.row<__fp16>(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            float16x8_t _rows0 = vld1q_f16(rows0p);\n            float16x8_t _rows1 = vld1q_f16(rows1p);\n            float16x8_t _Dp = vmulq_lane_f16(_rows0, _b01, 0);\n            _Dp = vfmaq_lane_f16(_Dp, _rows1, _b01, 1);\n            vst1q_f16(Dp, _Dp);\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/layernorm_arm.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layernorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nLayerNorm_arm::LayerNorm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void layernorm(float* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n#if __ARM_NEON\n    float32x4_t _mean = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float mean = 0.f;\n    {\n        const float* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _mean = vaddq_f32(_mean, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            mean += ptr0[0];\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        _mean = div_ps(_mean, _elemcount);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        mean += vaddvq_f32(_mean);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_mean), vget_high_f32(_mean));\n        _s2 = vpadd_f32(_s2, _s2);\n        mean += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        mean = mean / elemcount;\n#if __ARM_NEON\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n#if __ARM_NEON\n    float32x4_t _var = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float var = 0.f;\n    {\n        const float* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _p = vsubq_f32(_p, _mean);\n            _var = vmlaq_f32(_var, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = ptr0[0] - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n        _var = div_ps(_var, _elemcount);\n        _var = vaddq_f32(_var, _eps);\n        float32x4_t _rsqrt_var = vrsqrteq_f32(_var);\n        _rsqrt_var = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _var = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _mean = vmulq_f32(_mean, _var);\n        _mean = vnegq_f32(_mean);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        var += vaddvq_f32(_var);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_var), vget_high_f32(_var));\n        _s2 = vpadd_f32(_s2, _s2);\n        var += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        var = 1.f / sqrtf(var / elemcount + eps);\n        mean = -mean * var;\n#if __ARM_NEON\n        _var = vdupq_n_f32(var);\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        int i = 0;\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _beta = vdupq_n_f32(beta_ptr[0]);\n                _p = vmlaq_f32(_mean, _p, _var);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n                gamma_ptr += 1;\n                beta_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                float32x4_t _beta = vld1q_f32(beta_ptr);\n                _p = vmlaq_f32(_mean, _p, _var);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n                gamma_ptr += 4;\n                beta_ptr += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            ptr[0] = (ptr[0] * var + mean) * gamma_ptr[0] + beta_ptr[0];\n            ptr++;\n            gamma_ptr++;\n            beta_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmlaq_f32(_mean, _p, _var);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            ptr[0] = ptr[0] * var + mean;\n            ptr++;\n        }\n    }\n}\n\nint LayerNorm_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        float* ptr = bottom_top_blob;\n        layernorm(ptr, gamma_data, beta_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            layernorm(ptr, gamma_data, beta_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    float* ptr = bottom_top_blob.channel(q).row(i);\n                    layernorm(ptr, gamma_data, beta_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                float* ptr = bottom_top_blob.channel(q);\n                layernorm(ptr, gamma_data, beta_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void layernorm_bf16s(unsigned short* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n#if __ARM_NEON\n    float32x4_t _mean = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float mean = 0.f;\n    {\n        const unsigned short* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _mean = vaddq_f32(_mean, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            mean += bfloat16_to_float32(ptr0[0]);\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        _mean = div_ps(_mean, _elemcount);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        mean += vaddvq_f32(_mean);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_mean), vget_high_f32(_mean));\n        _s2 = vpadd_f32(_s2, _s2);\n        mean += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        mean = mean / elemcount;\n#if __ARM_NEON\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n#if __ARM_NEON\n    float32x4_t _var = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float var = 0.f;\n    {\n        const unsigned short* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _p = vsubq_f32(_p, _mean);\n            _var = vmlaq_f32(_var, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr0[0]) - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n        _var = div_ps(_var, _elemcount);\n        _var = vaddq_f32(_var, _eps);\n        float32x4_t _rsqrt_var = vrsqrteq_f32(_var);\n        _rsqrt_var = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _var = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _mean = vmulq_f32(_mean, _var);\n        _mean = vnegq_f32(_mean);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        var += vaddvq_f32(_var);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_var), vget_high_f32(_var));\n        _s2 = vpadd_f32(_s2, _s2);\n        var += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        var = 1.f / sqrtf(var / elemcount + eps);\n        mean = -mean * var;\n#if __ARM_NEON\n        _var = vdupq_n_f32(var);\n        _mean = vdupq_n_f32(mean);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        int i = 0;\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _beta = vdupq_n_f32(beta_ptr[0]);\n                _p = vmlaq_f32(_mean, _p, _var);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n                gamma_ptr += 1;\n                beta_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                float32x4_t _beta = vld1q_f32(beta_ptr);\n                _p = vmlaq_f32(_mean, _p, _var);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n                gamma_ptr += 4;\n                beta_ptr += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr[0]);\n            ptr[0] = float32_to_bfloat16((v * var + mean) * gamma_ptr[0] + beta_ptr[0]);\n            ptr++;\n            gamma_ptr++;\n            beta_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmlaq_f32(_mean, _p, _var);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr[0]);\n            ptr[0] = float32_to_bfloat16(v * var + mean);\n            ptr++;\n        }\n    }\n}\n\nint LayerNorm_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        unsigned short* ptr = bottom_top_blob;\n        layernorm_bf16s(ptr, gamma_data, beta_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n            layernorm_bf16s(ptr, gamma_data, beta_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    unsigned short* ptr = bottom_top_blob.channel(q).row<unsigned short>(i);\n                    layernorm_bf16s(ptr, gamma_data, beta_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                unsigned short* ptr = bottom_top_blob.channel(q);\n                layernorm_bf16s(ptr, gamma_data, beta_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/layernorm_arm.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LAYERNORM_ARM_H\n#define LAYER_LAYERNORM_ARM_H\n\n#include \"layernorm.h\"\n\nnamespace ncnn {\n\nclass LayerNorm_arm : public LayerNorm\n{\npublic:\n    LayerNorm_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LAYERNORM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/layernorm_arm_asimdhp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layernorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void layernorm_fp16s(__fp16* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n    float32x4_t _mean0 = vdupq_n_f32(0.f);\n    float32x4_t _mean1 = vdupq_n_f32(0.f);\n    float mean = 0.f;\n    {\n        const __fp16* ptr0 = ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _mean0 = vaddq_f32(_mean0, _p0);\n            _mean1 = vaddq_f32(_mean1, _p1);\n            ptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n            _mean0 = vaddq_f32(_mean0, _p);\n            ptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            mean += (float)ptr0[0];\n            ptr0++;\n        }\n    }\n\n    if (elempack == 8)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        _mean0 = vdivq_f32(_mean0, _elemcount);\n        _mean1 = vdivq_f32(_mean1, _elemcount);\n    }\n    if (elempack == 4)\n    {\n        _mean0 = vaddq_f32(_mean0, _mean1);\n\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        _mean0 = vdivq_f32(_mean0, _elemcount);\n        _mean1 = _mean0;\n    }\n    if (elempack == 1)\n    {\n        _mean0 = vaddq_f32(_mean0, _mean1);\n        mean += vaddvq_f32(_mean0);\n\n        mean = mean / elemcount;\n        _mean0 = vdupq_n_f32(mean);\n        _mean1 = _mean0;\n    }\n\n    float32x4_t _var0 = vdupq_n_f32(0.f);\n    float32x4_t _var1 = vdupq_n_f32(0.f);\n    float var = 0.f;\n    {\n        const __fp16* ptr0 = ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vsubq_f32(_p0, _mean0);\n            _p1 = vsubq_f32(_p1, _mean1);\n            _var0 = vmlaq_f32(_var0, _p0, _p0);\n            _var1 = vmlaq_f32(_var1, _p1, _p1);\n            ptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n            _p = vsubq_f32(_p, _mean0);\n            _var0 = vmlaq_f32(_var0, _p, _p);\n            ptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)ptr0[0] - mean;\n            var += v * v;\n            ptr0++;\n        }\n    }\n\n    if (elempack == 8)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n        _var0 = vdivq_f32(_var0, _elemcount);\n        _var1 = vdivq_f32(_var1, _elemcount);\n        _var0 = vaddq_f32(_var0, _eps);\n        _var1 = vaddq_f32(_var1, _eps);\n        float32x4_t _rsqrt_var0 = vrsqrteq_f32(_var0);\n        float32x4_t _rsqrt_var1 = vrsqrteq_f32(_var1);\n        _rsqrt_var0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var0, _rsqrt_var0), _rsqrt_var0), _rsqrt_var0);\n        _rsqrt_var1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var1, _rsqrt_var1), _rsqrt_var1), _rsqrt_var1);\n        _var0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var0, _rsqrt_var0), _rsqrt_var0), _rsqrt_var0);\n        _var1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var1, _rsqrt_var1), _rsqrt_var1), _rsqrt_var1);\n        _mean0 = vmulq_f32(_mean0, _var0);\n        _mean1 = vmulq_f32(_mean1, _var1);\n        _mean0 = vnegq_f32(_mean0);\n        _mean1 = vnegq_f32(_mean1);\n    }\n    if (elempack == 4)\n    {\n        _var0 = vaddq_f32(_var0, _var1);\n\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n        _var0 = vdivq_f32(_var0, _elemcount);\n        _var0 = vaddq_f32(_var0, _eps);\n        float32x4_t _rsqrt_var = vrsqrteq_f32(_var0);\n        _rsqrt_var = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var0, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _var0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_var0, _rsqrt_var), _rsqrt_var), _rsqrt_var);\n        _var1 = _var0;\n        _mean0 = vmulq_f32(_mean0, _var0);\n        _mean0 = vnegq_f32(_mean0);\n        _mean1 = _mean0;\n    }\n    if (elempack == 1)\n    {\n        _var0 = vaddq_f32(_var0, _var1);\n        var += vaddvq_f32(_var0);\n\n        var = 1.f / sqrtf(var / elemcount + eps);\n        mean = -mean * var;\n        _var0 = vdupq_n_f32(var);\n        _var1 = _var0;\n        _mean0 = vdupq_n_f32(mean);\n        _mean1 = _mean0;\n    }\n\n    if (gamma_ptr && beta_ptr)\n    {\n        int i = 0;\n        if (elempack == 8)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _beta = vdupq_n_f32(beta_ptr[0]);\n                _p0 = vmlaq_f32(_mean0, _p0, _var0);\n                _p1 = vmlaq_f32(_mean1, _p1, _var1);\n                _p0 = vmlaq_f32(_beta, _p0, _gamma);\n                _p1 = vmlaq_f32(_beta, _p1, _gamma);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 1;\n                beta_ptr += 1;\n            }\n        }\n        if (elempack == 4)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma0 = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _gamma1 = vdupq_n_f32(gamma_ptr[1]);\n                float32x4_t _beta0 = vdupq_n_f32(beta_ptr[0]);\n                float32x4_t _beta1 = vdupq_n_f32(beta_ptr[1]);\n                _p0 = vmlaq_f32(_mean0, _p0, _var0);\n                _p1 = vmlaq_f32(_mean1, _p1, _var1);\n                _p0 = vmlaq_f32(_beta0, _p0, _gamma0);\n                _p1 = vmlaq_f32(_beta1, _p1, _gamma1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 2;\n                beta_ptr += 2;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _beta = vdupq_n_f32(beta_ptr[0]);\n                _p = vmlaq_f32(_mean0, _p, _var0);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n                ptr += 4;\n                gamma_ptr += 1;\n                beta_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma0 = vld1q_f32(gamma_ptr);\n                float32x4_t _gamma1 = vld1q_f32(gamma_ptr + 4);\n                float32x4_t _beta0 = vld1q_f32(beta_ptr);\n                float32x4_t _beta1 = vld1q_f32(beta_ptr + 4);\n                _p0 = vmlaq_f32(_mean0, _p0, _var0);\n                _p1 = vmlaq_f32(_mean1, _p1, _var1);\n                _p0 = vmlaq_f32(_beta0, _p0, _gamma0);\n                _p1 = vmlaq_f32(_beta1, _p1, _gamma1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 8;\n                beta_ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                float32x4_t _beta = vld1q_f32(beta_ptr);\n                _p = vmlaq_f32(_mean0, _p, _var0);\n                _p = vmlaq_f32(_beta, _p, _gamma);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n                ptr += 4;\n                gamma_ptr += 4;\n                beta_ptr += 4;\n            }\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)ptr[0];\n            ptr[0] = (__fp16)((v * var + mean) * gamma_ptr[0] + beta_ptr[0]);\n            ptr++;\n            gamma_ptr++;\n            beta_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vmlaq_f32(_mean0, _p0, _var0);\n            _p1 = vmlaq_f32(_mean1, _p1, _var1);\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = vmlaq_f32(_mean0, _p, _var0);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)ptr[0];\n            ptr[0] = (__fp16)(v * var + mean);\n            ptr++;\n        }\n    }\n}\n\nint LayerNorm_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        __fp16* ptr = bottom_top_blob;\n        layernorm_fp16s(ptr, gamma_data, beta_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n            layernorm_fp16s(ptr, gamma_data, beta_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    __fp16* ptr = bottom_top_blob.channel(q).row<__fp16>(i);\n                    layernorm_fp16s(ptr, gamma_data, beta_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q);\n                layernorm_fp16s(ptr, gamma_data, beta_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lrn_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"lrn_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nint LRN_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    size_t elemsize = bottom_top_blob.elemsize;\n    int size = w * h;\n\n    // squared values with local_size padding\n    Mat square_blob;\n    square_blob.create(w, h, channels, elemsize, opt.workspace_allocator);\n    if (square_blob.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = bottom_top_blob.channel(q);\n        float* outptr = square_blob.channel(q);\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _outp = vmulq_f32(_p, _p);\n            vst1q_f32(outptr, _outp);\n\n            ptr += 4;\n            outptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            *outptr = *ptr * *ptr;\n\n            ptr++;\n            outptr++;\n        }\n    }\n\n    if (region_type == NormRegion_ACROSS_CHANNELS)\n    {\n        Mat square_sum;\n        square_sum.create(w, h, channels, elemsize, opt.workspace_allocator);\n        if (square_sum.empty())\n            return -100;\n        square_sum.fill(0.f);\n\n        const float alpha_div_size = alpha / local_size;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            // square sum\n            for (int p = q - local_size / 2; p <= q + local_size / 2; p++)\n            {\n                if (p < 0 || p >= channels)\n                    continue;\n\n                const float* sptr = square_blob.channel(p);\n                float* ssptr = square_sum.channel(q);\n\n#if __ARM_NEON\n                int nn = size >> 2;\n                int remain = size - (nn << 2);\n#else\n                int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n                for (; nn > 0; nn--)\n                {\n                    float32x4_t _sp = vld1q_f32(sptr);\n                    float32x4_t _ssp = vld1q_f32(ssptr);\n                    _ssp = vaddq_f32(_ssp, _sp);\n                    vst1q_f32(ssptr, _ssp);\n\n                    sptr += 4;\n                    ssptr += 4;\n                }\n#endif // __ARM_NEON\n                for (; remain > 0; remain--)\n                {\n                    *ssptr += *sptr;\n                    sptr++;\n                    ssptr++;\n                }\n            }\n\n            float* ptr = bottom_top_blob.channel(q);\n            float* ssptr = square_sum.channel(q);\n\n#if __ARM_NEON\n            int nn = size >> 2;\n            int remain = size - (nn << 2);\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _bias = vdupq_n_f32(bias);\n            float32x4_t _ads = vdupq_n_f32(alpha_div_size);\n            float32x4_t _mb = vdupq_n_f32(-beta);\n            for (; nn > 0; nn--)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _ssp = vld1q_f32(ssptr);\n                _ssp = vmulq_f32(_ssp, _ads);\n                _ssp = vaddq_f32(_ssp, _bias);\n                _ssp = pow_ps(_ssp, _mb);\n                _p = vmulq_f32(_p, _ssp);\n                vst1q_f32(ptr, _p);\n\n                ssptr += 4;\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                *ptr = *ptr * powf(bias + alpha_div_size * *ssptr, -beta);\n\n                ssptr++;\n                ptr++;\n            }\n        }\n    }\n    else if (region_type == NormRegion_WITHIN_CHANNEL)\n    {\n        int outw = w;\n        int outh = h;\n\n        Mat square_blob_bordered = square_blob;\n        int pad = local_size / 2;\n        if (pad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(square_blob, square_blob_bordered, pad, local_size - pad - 1, pad, local_size - pad - 1, BORDER_CONSTANT, 0.f, opt_b);\n            if (square_blob_bordered.empty())\n                return -100;\n\n            w = square_blob_bordered.w;\n            h = square_blob_bordered.h;\n        }\n\n        const int maxk = local_size * local_size;\n\n        const float alpha_div_size = alpha / maxk;\n\n        // norm window offsets\n        std::vector<int> _space_ofs(maxk);\n        int* space_ofs = &_space_ofs[0];\n        {\n            int p1 = 0;\n            int p2 = 0;\n            int gap = w - local_size;\n            for (int i = 0; i < local_size; i++)\n            {\n                for (int j = 0; j < local_size; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2++;\n                }\n                p2 += gap;\n            }\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            const Mat m = square_blob_bordered.channel(q);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    const float* sptr = m.row(i) + j;\n\n                    float ss = 0.f;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val = sptr[space_ofs[k]];\n                        ss += val;\n                    }\n\n                    ptr[j] = ptr[j] * powf(bias + alpha_div_size * ss, -beta);\n                }\n\n                ptr += outw;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lrn_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LRN_ARM_H\n#define LAYER_LRN_ARM_H\n\n#include \"lrn.h\"\n\nnamespace ncnn {\n\nclass LRN_arm : public LRN\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LRN_ARM_H\n"
  },
  {
    "path": "src/layer/arm/lstm_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"lstm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"lstm_int8.h\"\n\nLSTM_arm::LSTM_arm()\n{\n#if __ARM_NEON\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint LSTM_arm::create_pipeline(const Option& opt)\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return create_pipeline_int8(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    // pack IFOG\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / hidden_size / 4;\n\n    weight_xc_data_packed.create(size, hidden_size, num_directions, 16u, 4);\n    bias_c_data_packed.create(hidden_size, 1, num_directions, 16u, 4);\n    weight_hc_data_packed.create(num_output, hidden_size, num_directions, 16u, 4);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_I = bias_c.row(0);\n        const float* bias_c_F = bias_c.row(1);\n        const float* bias_c_O = bias_c.row(2);\n        const float* bias_c_G = bias_c.row(3);\n\n        float* bias_c_IFOG = bias_c_data_packed_dr.row(0);\n\n        for (int q = 0; q < hidden_size; q++)\n        {\n            bias_c_IFOG[0] = bias_c_I[q];\n            bias_c_IFOG[1] = bias_c_F[q];\n            bias_c_IFOG[2] = bias_c_O[q];\n            bias_c_IFOG[3] = bias_c_G[q];\n\n            bias_c_IFOG += 4;\n\n            const float* weight_xc_I = weight_xc.row(hidden_size * 0 + q);\n            const float* weight_xc_F = weight_xc.row(hidden_size * 1 + q);\n            const float* weight_xc_O = weight_xc.row(hidden_size * 2 + q);\n            const float* weight_xc_G = weight_xc.row(hidden_size * 3 + q);\n\n            const float* weight_hc_I = weight_hc.row(hidden_size * 0 + q);\n            const float* weight_hc_F = weight_hc.row(hidden_size * 1 + q);\n            const float* weight_hc_O = weight_hc.row(hidden_size * 2 + q);\n            const float* weight_hc_G = weight_hc.row(hidden_size * 3 + q);\n\n            float* weight_xc_IFOG = weight_xc_data_packed_dr.row(q);\n            float* weight_hc_IFOG = weight_hc_data_packed_dr.row(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_IFOG[0] = weight_xc_I[i];\n                weight_xc_IFOG[1] = weight_xc_F[i];\n                weight_xc_IFOG[2] = weight_xc_O[i];\n                weight_xc_IFOG[3] = weight_xc_G[i];\n\n                weight_xc_IFOG += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_IFOG[0] = weight_hc_I[i];\n                weight_hc_IFOG[1] = weight_hc_F[i];\n                weight_hc_IFOG[2] = weight_hc_O[i];\n                weight_hc_IFOG[3] = weight_hc_G[i];\n\n                weight_hc_IFOG += 4;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nstatic int lstm(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        const float* x = bottom_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const float* bias_c_IFOG = (const float*)bias_c + q * 4;\n\n            // gate I F O G\n            const float* weight_xc_IFOG = weight_xc.row(q);\n\n            const float* weight_hc_IFOG = weight_hc.row(q);\n\n#if __ARM_NEON\n            float32x4_t _IFOG = vld1q_f32(bias_c_IFOG);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n#else\n            float I = bias_c_IFOG[0];\n            float F = bias_c_IFOG[1];\n            float O = bias_c_IFOG[2];\n            float G = bias_c_IFOG[3];\n#endif // __ARM_NEON\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vld1q_f32(x + i);\n\n                float32x4_t _weight_xc_IFOG_0 = vld1q_f32(weight_xc_IFOG);\n                float32x4_t _weight_xc_IFOG_1 = vld1q_f32(weight_xc_IFOG + 4);\n                float32x4_t _weight_xc_IFOG_2 = vld1q_f32(weight_xc_IFOG + 8);\n                float32x4_t _weight_xc_IFOG_3 = vld1q_f32(weight_xc_IFOG + 12);\n\n#if __aarch64__\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_xc_IFOG_0, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_IFOG_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_IFOG_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_IFOG_3, _xi, 3);\n#else\n                _IFOG = vmlaq_lane_f32(_IFOG, _weight_xc_IFOG_0, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_IFOG_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_IFOG_2, vget_high_f32(_xi), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_IFOG_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_IFOG += 16;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                float xi = x[i];\n\n#if __ARM_NEON\n                float32x4_t _xi = vdupq_n_f32(xi);\n                float32x4_t _weight_xc_IFOG = vld1q_f32(weight_xc_IFOG);\n                _IFOG = vmlaq_f32(_IFOG, _weight_xc_IFOG, _xi);\n#else\n                I += weight_xc_IFOG[0] * xi;\n                F += weight_xc_IFOG[1] * xi;\n                O += weight_xc_IFOG[2] * xi;\n                G += weight_xc_IFOG[3] * xi;\n#endif // __ARM_NEON\n\n                weight_xc_IFOG += 4;\n            }\n\n            i = 0;\n#if __ARM_NEON\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n\n                float32x4_t _weight_hc_IFOG_0 = vld1q_f32(weight_hc_IFOG);\n                float32x4_t _weight_hc_IFOG_1 = vld1q_f32(weight_hc_IFOG + 4);\n                float32x4_t _weight_hc_IFOG_2 = vld1q_f32(weight_hc_IFOG + 8);\n                float32x4_t _weight_hc_IFOG_3 = vld1q_f32(weight_hc_IFOG + 12);\n\n#if __aarch64__\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_hc_IFOG_0, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_IFOG_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_IFOG_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_IFOG_3, _h_cont, 3);\n#else\n                _IFOG = vmlaq_lane_f32(_IFOG, _weight_hc_IFOG_0, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_IFOG_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_IFOG_2, vget_high_f32(_h_cont), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_IFOG_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_IFOG += 16;\n            }\n#endif // __ARM_NEON\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n#if __ARM_NEON\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_IFOG = vld1q_f32(weight_hc_IFOG);\n                _IFOG = vmlaq_f32(_IFOG, _weight_hc_IFOG, _h_cont);\n#else\n                I += weight_hc_IFOG[0] * h_cont;\n                F += weight_hc_IFOG[1] * h_cont;\n                O += weight_hc_IFOG[2] * h_cont;\n                G += weight_hc_IFOG[3] * h_cont;\n#endif // __ARM_NEON\n\n                weight_hc_IFOG += 4;\n            }\n\n            float* gates_data = gates.row(q);\n\n#if __ARM_NEON\n            _IFOG = vaddq_f32(_IFOG, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _IFOG = vaddq_f32(_IFOG, _sum2);\n\n            vst1q_f32(gates_data, _IFOG);\n#else\n            gates_data[0] = I;\n            gates_data[1] = F;\n            gates_data[2] = O;\n            gates_data[3] = G;\n#endif // __ARM_NEON\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        float* output_data = top_blob.row(ti);\n\n        float* cell_ptr = cell_state;\n        float* hidden_ptr = hidden_state;\n        float* tmp_hidden_ptr = tmp_hidden_state;\n\n        int remain_hidden_size_start = 0;\n#if __ARM_NEON\n        int nn_hidden_size = hidden_size >> 2;\n        remain_hidden_size_start = nn_hidden_size << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_hidden_size; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q);\n\n            float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);\n\n            float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);\n            float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);\n            float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);\n            float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);\n\n            float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));\n            float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));\n\n            vst1q_f32(cell_ptr + q, _cell2);\n\n            if (num_output == hidden_size)\n            {\n                vst1q_f32(hidden_ptr + q, _lstm_H);\n                vst1q_f32(output_data + q, _lstm_H);\n            }\n            else\n            {\n                vst1q_f32(tmp_hidden_ptr + q, _lstm_H);\n            }\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_hidden_size_start; q < hidden_size; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float I = gates_data[0];\n            float F = gates_data[1];\n            float O = gates_data[2];\n            float G = gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_ptr[q] + I * G;\n            float H = O * tanhf(cell2);\n\n            cell_ptr[q] = cell2;\n            if (num_output == hidden_size)\n            {\n                hidden_ptr[q] = H;\n                output_data[q] = H;\n            }\n            else\n            {\n                tmp_hidden_ptr[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            // int nn_num_output = num_output >> 2;\n            // int remain_num_output_start = nn_num_output << 2;\n            // #pragma omp parallel for num_threads(opt.num_threads)\n            // for (int qq = 0; qq < nn_num_output; qq++)\n            // {\n            //     int q = qq * 4;\n            //\n            // }\n            int remain_num_output_start = 0;\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = remain_num_output_start; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n                const float* tmp_hidden_ptr = tmp_hidden_state;\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_ptr[i] * hr[i];\n                }\n\n                hidden_ptr[q] = H;\n                output_data[q] = H;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    Mat cell(hidden_size, 4u, opt.workspace_allocator);\n    if (cell.empty())\n        return -100;\n    cell.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = lstm(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.0f);\n        cell.fill(0.0f);\n\n        {\n            int ret = lstm(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Mat cell;\n    Allocator* hidden_cell_allocator = top_blobs.size() == 3 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 3)\n    {\n        hidden = bottom_blobs[1].clone(hidden_cell_allocator);\n        cell = bottom_blobs[2].clone(hidden_cell_allocator);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_cell_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n\n        cell.create(hidden_size, num_directions, 4u, hidden_cell_allocator);\n        if (cell.empty())\n            return -100;\n        cell.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        Mat cell0 = cell.row_range(0, 1);\n        {\n            int ret = lstm(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        Mat cell1 = cell.row_range(1, 1);\n        {\n            int ret = lstm(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    if (top_blobs.size() == 3)\n    {\n        top_blobs[1] = hidden;\n        top_blobs[2] = cell;\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic int lstm_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        const unsigned short* x = bottom_blob.row<const unsigned short>(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const unsigned short* bias_c_IFOG = (const unsigned short*)bias_c + q * 4;\n\n            // gate I F O G\n            const unsigned short* weight_xc_IFOG = weight_xc.row<const unsigned short>(q);\n\n            const unsigned short* weight_hc_IFOG = weight_hc.row<const unsigned short>(q);\n\n#if __ARM_NEON\n            float32x4_t _IFOG = bfloat2float(vld1_u16(bias_c_IFOG));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n#else\n            float I = bfloat16_to_float32(bias_c_IFOG[0]);\n            float F = bfloat16_to_float32(bias_c_IFOG[1]);\n            float O = bfloat16_to_float32(bias_c_IFOG[2]);\n            float G = bfloat16_to_float32(bias_c_IFOG[3]);\n#endif // __ARM_NEON\n\n            int i = 0;\n#if __ARM_NEON\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = bfloat2float(vld1_u16(x + i));\n\n                float32x4_t _weight_xc_IFOG_0 = bfloat2float(vld1_u16(weight_xc_IFOG));\n                float32x4_t _weight_xc_IFOG_1 = bfloat2float(vld1_u16(weight_xc_IFOG + 4));\n                float32x4_t _weight_xc_IFOG_2 = bfloat2float(vld1_u16(weight_xc_IFOG + 8));\n                float32x4_t _weight_xc_IFOG_3 = bfloat2float(vld1_u16(weight_xc_IFOG + 12));\n\n#if __aarch64__\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_xc_IFOG_0, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_IFOG_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_IFOG_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_IFOG_3, _xi, 3);\n#else\n                _IFOG = vmlaq_lane_f32(_IFOG, _weight_xc_IFOG_0, vget_low_f32(_xi), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_IFOG_1, vget_low_f32(_xi), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_IFOG_2, vget_high_f32(_xi), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_IFOG_3, vget_high_f32(_xi), 1);\n#endif\n\n                weight_xc_IFOG += 16;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n#if __ARM_NEON\n                unsigned short xi = x[i];\n\n                float32x4_t _xi = bfloat2float(vdup_n_u16(xi));\n                float32x4_t _weight_xc_IFOG = bfloat2float(vld1_u16(weight_xc_IFOG));\n                _IFOG = vmlaq_f32(_IFOG, _weight_xc_IFOG, _xi);\n#else\n                float xi = bfloat16_to_float32(x[i]);\n\n                I += bfloat16_to_float32(weight_xc_IFOG[0]) * xi;\n                F += bfloat16_to_float32(weight_xc_IFOG[1]) * xi;\n                O += bfloat16_to_float32(weight_xc_IFOG[2]) * xi;\n                G += bfloat16_to_float32(weight_xc_IFOG[3]) * xi;\n#endif // __ARM_NEON\n\n                weight_xc_IFOG += 4;\n            }\n\n            i = 0;\n#if __ARM_NEON\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n\n                float32x4_t _weight_hc_IFOG_0 = bfloat2float(vld1_u16(weight_hc_IFOG));\n                float32x4_t _weight_hc_IFOG_1 = bfloat2float(vld1_u16(weight_hc_IFOG + 4));\n                float32x4_t _weight_hc_IFOG_2 = bfloat2float(vld1_u16(weight_hc_IFOG + 8));\n                float32x4_t _weight_hc_IFOG_3 = bfloat2float(vld1_u16(weight_hc_IFOG + 12));\n\n#if __aarch64__\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_hc_IFOG_0, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_IFOG_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_IFOG_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_IFOG_3, _h_cont, 3);\n#else\n                _IFOG = vmlaq_lane_f32(_IFOG, _weight_hc_IFOG_0, vget_low_f32(_h_cont), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_IFOG_1, vget_low_f32(_h_cont), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_IFOG_2, vget_high_f32(_h_cont), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_IFOG_3, vget_high_f32(_h_cont), 1);\n#endif\n\n                weight_hc_IFOG += 16;\n            }\n#endif // __ARM_NEON\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n#if __ARM_NEON\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_IFOG = bfloat2float(vld1_u16(weight_hc_IFOG));\n                _IFOG = vmlaq_f32(_IFOG, _weight_hc_IFOG, _h_cont);\n#else\n                I += bfloat16_to_float32(weight_hc_IFOG[0]) * h_cont;\n                F += bfloat16_to_float32(weight_hc_IFOG[1]) * h_cont;\n                O += bfloat16_to_float32(weight_hc_IFOG[2]) * h_cont;\n                G += bfloat16_to_float32(weight_hc_IFOG[3]) * h_cont;\n#endif // __ARM_NEON\n\n                weight_hc_IFOG += 4;\n            }\n\n            float* gates_data = gates.row(q);\n\n#if __ARM_NEON\n            _IFOG = vaddq_f32(_IFOG, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _IFOG = vaddq_f32(_IFOG, _sum2);\n\n            vst1q_f32(gates_data, _IFOG);\n#else\n            gates_data[0] = I;\n            gates_data[1] = F;\n            gates_data[2] = O;\n            gates_data[3] = G;\n#endif // __ARM_NEON\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        unsigned short* output_data = top_blob.row<unsigned short>(ti);\n\n        float* cell_ptr = cell_state;\n        float* hidden_ptr = hidden_state;\n        float* tmp_hidden_ptr = tmp_hidden_state;\n\n        int remain_hidden_size_start = 0;\n#if __ARM_NEON\n        int nn_hidden_size = hidden_size >> 2;\n        remain_hidden_size_start = nn_hidden_size << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_hidden_size; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q);\n\n            float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);\n\n            float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);\n            float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);\n            float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);\n            float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);\n\n            float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));\n            float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));\n\n            vst1q_f32(cell_ptr + q, _cell2);\n\n            if (num_output == hidden_size)\n            {\n                vst1q_f32(hidden_ptr + q, _lstm_H);\n                vst1_u16(output_data + q, float2bfloat(_lstm_H));\n            }\n            else\n            {\n                vst1q_f32(tmp_hidden_ptr + q, _lstm_H);\n            }\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_hidden_size_start; q < hidden_size; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float I = gates_data[0];\n            float F = gates_data[1];\n            float O = gates_data[2];\n            float G = gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_ptr[q] + I * G;\n            float H = O * tanhf(cell2);\n\n            cell_ptr[q] = cell2;\n            if (num_output == hidden_size)\n            {\n                hidden_ptr[q] = H;\n                output_data[q] = float32_to_bfloat16(H);\n            }\n            else\n            {\n                tmp_hidden_ptr[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            // int nn_num_output = num_output >> 2;\n            // int remain_num_output_start = nn_num_output << 2;\n            // #pragma omp parallel for num_threads(opt.num_threads)\n            // for (int qq = 0; qq < nn_num_output; qq++)\n            // {\n            //     int q = qq * 4;\n            //\n            // }\n            int remain_num_output_start = 0;\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = remain_num_output_start; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n                const float* tmp_hidden_ptr = tmp_hidden_state;\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_ptr[i] * hr[i];\n                }\n\n                hidden_ptr[q] = H;\n                output_data[q] = float32_to_bfloat16(H);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::create_pipeline_bf16s(const Option& opt)\n{\n    // pack IFOG\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / hidden_size / 4;\n\n    weight_xc_data_packed.create(size, hidden_size, num_directions, 8u, 4);\n    bias_c_data_packed.create(hidden_size, 1, num_directions, 8u, 4);\n    weight_hc_data_packed.create(num_output, hidden_size, num_directions, 8u, 4);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_I = bias_c.row(0);\n        const float* bias_c_F = bias_c.row(1);\n        const float* bias_c_O = bias_c.row(2);\n        const float* bias_c_G = bias_c.row(3);\n\n        unsigned short* bias_c_IFOG = bias_c_data_packed_dr.row<unsigned short>(0);\n\n        for (int q = 0; q < hidden_size; q++)\n        {\n            bias_c_IFOG[0] = float32_to_bfloat16(bias_c_I[q]);\n            bias_c_IFOG[1] = float32_to_bfloat16(bias_c_F[q]);\n            bias_c_IFOG[2] = float32_to_bfloat16(bias_c_O[q]);\n            bias_c_IFOG[3] = float32_to_bfloat16(bias_c_G[q]);\n\n            bias_c_IFOG += 4;\n\n            const float* weight_xc_I = weight_xc.row(hidden_size * 0 + q);\n            const float* weight_xc_F = weight_xc.row(hidden_size * 1 + q);\n            const float* weight_xc_O = weight_xc.row(hidden_size * 2 + q);\n            const float* weight_xc_G = weight_xc.row(hidden_size * 3 + q);\n\n            const float* weight_hc_I = weight_hc.row(hidden_size * 0 + q);\n            const float* weight_hc_F = weight_hc.row(hidden_size * 1 + q);\n            const float* weight_hc_O = weight_hc.row(hidden_size * 2 + q);\n            const float* weight_hc_G = weight_hc.row(hidden_size * 3 + q);\n\n            unsigned short* weight_xc_IFOG = weight_xc_data_packed_dr.row<unsigned short>(q);\n            unsigned short* weight_hc_IFOG = weight_hc_data_packed_dr.row<unsigned short>(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_IFOG[0] = float32_to_bfloat16(weight_xc_I[i]);\n                weight_xc_IFOG[1] = float32_to_bfloat16(weight_xc_F[i]);\n                weight_xc_IFOG[2] = float32_to_bfloat16(weight_xc_O[i]);\n                weight_xc_IFOG[3] = float32_to_bfloat16(weight_xc_G[i]);\n\n                weight_xc_IFOG += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_IFOG[0] = float32_to_bfloat16(weight_hc_I[i]);\n                weight_hc_IFOG[1] = float32_to_bfloat16(weight_hc_F[i]);\n                weight_hc_IFOG[2] = float32_to_bfloat16(weight_hc_O[i]);\n                weight_hc_IFOG[3] = float32_to_bfloat16(weight_hc_G[i]);\n\n                weight_hc_IFOG += 4;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    Mat cell(hidden_size, 4u, opt.workspace_allocator);\n    if (cell.empty())\n        return -100;\n    cell.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = lstm_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n        cell.fill(0.f);\n\n        {\n            int ret = lstm_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Mat cell;\n    Allocator* hidden_cell_allocator = top_blobs.size() == 3 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 3)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_cell_allocator;\n        cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        cast_bfloat16_to_float32(bottom_blobs[2], cell, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_cell_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n\n        cell.create(hidden_size, num_directions, 4u, hidden_cell_allocator);\n        if (cell.empty())\n            return -100;\n        cell.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        Mat cell0 = cell.row_range(0, 1);\n        {\n            int ret = lstm_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        Mat cell1 = cell.row_range(1, 1);\n        {\n            int ret = lstm_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    if (top_blobs.size() == 3)\n    {\n        cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n        cast_float32_to_bfloat16(cell, top_blobs[2], opt);\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint LSTM_arm::create_pipeline_int8(const Option& opt)\n{\n    // pack IFOG\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / hidden_size / 4;\n\n    lstm_transform_weight_int8(weight_xc_data, weight_xc_data_int8_scales, weight_hc_data, weight_hc_data_int8_scales, bias_c_data, weight_data_tm, weight_data_tm_int8_descales, bias_c_data_packed, size, num_output, num_directions, hidden_size, opt);\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n        weight_xc_data_int8_scales.release();\n        weight_hc_data_int8_scales.release();\n    }\n\n    return 0;\n}\n\nvoid LSTM_arm::dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    // dynamic quantize bottom_blob\n    bottom_blob_int8_descales.create(T, (size_t)4u, 1, opt.blob_allocator);\n\n    Mat bottom_blob_int8_scales(T, (size_t)4u, 1, opt.blob_allocator);\n\n    if (elemtype == 1)\n    {\n        // fp32\n        for (int t = 0; t < T; t++)\n        {\n            const float* x = bottom_blob.row(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(x[i]));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 2)\n    {\n        // fp16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(float16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 4)\n    {\n        // bf16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(bfloat16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n\n    quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt);\n}\n\nint LSTM_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    Mat cell(hidden_size, 4u, opt.workspace_allocator);\n    if (cell.empty())\n        return -100;\n    cell.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        }\n\n        hidden.fill(0.f);\n        cell.fill(0.0f);\n\n        {\n            lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Mat cell;\n    Allocator* hidden_cell_allocator = top_blobs.size() == 3 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 3)\n    {\n        if (elemtype == 1)\n        {\n            hidden = bottom_blobs[1].clone(hidden_cell_allocator);\n            cell = bottom_blobs[2].clone(hidden_cell_allocator);\n        }\n        if (elemtype == 2)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_cell_allocator;\n            cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n            cast_float16_to_float32(bottom_blobs[2], cell, opt_cast);\n        }\n        if (elemtype == 4)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_cell_allocator;\n            cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n            cast_bfloat16_to_float32(bottom_blobs[2], cell, opt_cast);\n        }\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_cell_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n\n        cell.create(hidden_size, num_directions, 4u, hidden_cell_allocator);\n        if (cell.empty())\n            return -100;\n        cell.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        Mat cell0 = cell.row_range(0, 1);\n        {\n            lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        Mat cell1 = cell.row_range(1, 1);\n        {\n            lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    if (top_blobs.size() == 3)\n    {\n        if (elemtype == 1)\n        {\n            top_blobs[1] = hidden;\n            top_blobs[2] = cell;\n        }\n        if (elemtype == 2)\n        {\n            cast_float32_to_float16(hidden, top_blobs[1], opt);\n            cast_float32_to_float16(cell, top_blobs[2], opt);\n        }\n        if (elemtype == 4)\n        {\n            cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n            cast_float32_to_bfloat16(cell, top_blobs[2], opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lstm_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LSTM_ARM_H\n#define LAYER_LSTM_ARM_H\n\n#include \"lstm.h\"\n\nnamespace ncnn {\n\nclass LSTM_arm : public LSTM\n{\npublic:\n    LSTM_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8(const Option& opt);\n    void dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n\npublic:\n    Mat weight_xc_data_packed;\n    Mat bias_c_data_packed;\n    Mat weight_hc_data_packed;\n\n    Mat weight_data_tm;\n\n#if NCNN_INT8\n    Mat weight_data_tm_int8_descales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LSTM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/lstm_arm_asimddp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"lstm_int8.h\"\n\nvoid lstm_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, int hidden_size, const Option& opt)\n{\n    lstm_transform_weight_int8(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, hidden_size, opt);\n}\n\nvoid lstm_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    lstm_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, weight_hr, hidden_state, cell_state, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lstm_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"lstm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic int lstm_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 2u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        int nn_hidden_size = hidden_size >> 1;\n        int remain_hidden_size_start = nn_hidden_size << 1;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_hidden_size; qq++)\n        {\n            int q = qq * 2;\n\n            const __fp16* bias_c_IFOG = (const __fp16*)bias_c + q * 4;\n\n            // gate I F O G\n            const __fp16* weight_xc_IFOG = weight_xc.row<const __fp16>(q / 2);\n\n            const __fp16* weight_hc_IFOG = weight_hc.row<const __fp16>(q / 2);\n\n            float16x8_t _IFOG = vld1q_f16(bias_c_IFOG);\n            float16x8_t _sum1 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _sum2 = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _sum3 = vdupq_n_f16((__fp16)0.f);\n\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4h}, [%0], #8       \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    \"fmla   %2.8h, v0.8h, v4.h[0]   \\n\"\n                    \"fmla   %3.8h, v1.8h, v4.h[1]   \\n\"\n                    \"fmla   %4.8h, v2.8h, v4.h[2]   \\n\"\n                    \"fmla   %5.8h, v3.8h, v4.h[3]   \\n\"\n                    : \"=r\"(x),\n                    \"=r\"(weight_xc_IFOG),\n                    \"=w\"(_IFOG),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(x),\n                    \"1\"(weight_xc_IFOG),\n                    \"2\"(_IFOG),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _x = vld1_f16(x);\n                float16x8_t _w0 = vld1q_f16(weight_xc_IFOG);\n                float16x8_t _w1 = vld1q_f16(weight_xc_IFOG + 8);\n                float16x8_t _w2 = vld1q_f16(weight_xc_IFOG + 16);\n                float16x8_t _w3 = vld1q_f16(weight_xc_IFOG + 24);\n                _IFOG = vfmaq_lane_f16(_IFOG, _w0, _x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _w1, _x, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _w2, _x, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _w3, _x, 3);\n\n                x += 4;\n                weight_xc_IFOG += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = *x++;\n\n                float16x8_t _xi = vdupq_n_f16(xi);\n                float16x8_t _weight_xc_IFOG = vld1q_f16(weight_xc_IFOG);\n                _IFOG = vfmaq_f16(_IFOG, _weight_xc_IFOG, _xi);\n\n                weight_xc_IFOG += 8;\n            }\n\n            const float* hidden_ptr = hidden_state;\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4s}, [%0], #16      \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%1], #64 \\n\"\n                    \"fcvtn  v4.4h, v4.4s            \\n\"\n                    \"fmla   %2.8h, v0.8h, v4.h[0]   \\n\"\n                    \"fmla   %3.8h, v1.8h, v4.h[1]   \\n\"\n                    \"fmla   %4.8h, v2.8h, v4.h[2]   \\n\"\n                    \"fmla   %5.8h, v3.8h, v4.h[3]   \\n\"\n                    : \"=r\"(hidden_ptr),\n                    \"=r\"(weight_hc_IFOG),\n                    \"=w\"(_IFOG),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(hidden_ptr),\n                    \"1\"(weight_hc_IFOG),\n                    \"2\"(_IFOG),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _h_cont = vcvt_f16_f32(vld1q_f32(hidden_ptr));\n                float16x8_t _w0 = vld1q_f16(weight_hc_IFOG);\n                float16x8_t _w1 = vld1q_f16(weight_hc_IFOG + 8);\n                float16x8_t _w2 = vld1q_f16(weight_hc_IFOG + 16);\n                float16x8_t _w3 = vld1q_f16(weight_hc_IFOG + 24);\n                _IFOG = vfmaq_lane_f16(_IFOG, _w0, _h_cont, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _w1, _h_cont, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _w2, _h_cont, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _w3, _h_cont, 3);\n\n                hidden_ptr += 4;\n                weight_hc_IFOG += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = *hidden_ptr++;\n\n                float16x8_t _h_cont = vdupq_n_f16((__fp16)h_cont);\n                float16x8_t _weight_hc_IFOG = vld1q_f16(weight_hc_IFOG);\n                _IFOG = vfmaq_f16(_IFOG, _weight_hc_IFOG, _h_cont);\n\n                weight_hc_IFOG += 8;\n            }\n\n            __fp16* gates_data = gates.row<__fp16>(q);\n\n            _IFOG = vaddq_f16(_IFOG, _sum1);\n            _sum2 = vaddq_f16(_sum2, _sum3);\n            _IFOG = vaddq_f16(_IFOG, _sum2);\n\n            vst1q_f16(gates_data, _IFOG);\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_hidden_size_start; q < hidden_size; q++)\n        {\n            const __fp16* bias_c_IFOG = (const __fp16*)bias_c + q * 4;\n\n            // gate I F O G\n            const __fp16* weight_xc_IFOG = weight_xc.row<const __fp16>(q / 2 + q % 2);\n\n            const __fp16* weight_hc_IFOG = weight_hc.row<const __fp16>(q / 2 + q % 2);\n\n            float16x4_t _IFOG = vld1_f16(bias_c_IFOG);\n            float16x4_t _sum1 = vdup_n_f16((__fp16)0.f);\n            float16x4_t _sum2 = vdup_n_f16((__fp16)0.f);\n            float16x4_t _sum3 = vdup_n_f16((__fp16)0.f);\n\n            const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4h}, [%0], #8       \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    \"fmla   %2.4h, v0.4h, v4.h[0]   \\n\"\n                    \"fmla   %3.4h, v1.4h, v4.h[1]   \\n\"\n                    \"fmla   %4.4h, v2.4h, v4.h[2]   \\n\"\n                    \"fmla   %5.4h, v3.4h, v4.h[3]   \\n\"\n                    : \"=r\"(x),\n                    \"=r\"(weight_xc_IFOG),\n                    \"=w\"(_IFOG),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(x),\n                    \"1\"(weight_xc_IFOG),\n                    \"2\"(_IFOG),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _x = vld1_f16(x);\n                float16x4_t _w0 = vld1_f16(weight_xc_IFOG);\n                float16x4_t _w1 = vld1_f16(weight_xc_IFOG + 4);\n                float16x4_t _w2 = vld1_f16(weight_xc_IFOG + 8);\n                float16x4_t _w3 = vld1_f16(weight_xc_IFOG + 12);\n                _IFOG = vfma_lane_f16(_IFOG, _w0, _x, 0);\n                _sum1 = vfma_lane_f16(_sum1, _w1, _x, 1);\n                _sum2 = vfma_lane_f16(_sum2, _w2, _x, 2);\n                _sum3 = vfma_lane_f16(_sum3, _w3, _x, 3);\n\n                x += 4;\n                weight_xc_IFOG += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = *x++;\n\n                float16x4_t _xi = vdup_n_f16(xi);\n                float16x4_t _weight_xc_IFOG = vld1_f16(weight_xc_IFOG);\n                _IFOG = vfma_f16(_IFOG, _weight_xc_IFOG, _xi);\n\n                weight_xc_IFOG += 4;\n            }\n\n            const float* hidden_ptr = hidden_state;\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"ld1    {v4.4s}, [%0], #16      \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n                    \"fcvtn  v4.4h, v4.4s            \\n\"\n                    \"fmla   %2.4h, v0.4h, v4.h[0]   \\n\"\n                    \"fmla   %3.4h, v1.4h, v4.h[1]   \\n\"\n                    \"fmla   %4.4h, v2.4h, v4.h[2]   \\n\"\n                    \"fmla   %5.4h, v3.4h, v4.h[3]   \\n\"\n                    : \"=r\"(hidden_ptr),\n                    \"=r\"(weight_hc_IFOG),\n                    \"=w\"(_IFOG),\n                    \"=w\"(_sum1),\n                    \"=w\"(_sum2),\n                    \"=w\"(_sum3)\n                    : \"0\"(hidden_ptr),\n                    \"1\"(weight_hc_IFOG),\n                    \"2\"(_IFOG),\n                    \"3\"(_sum1),\n                    \"4\"(_sum2),\n                    \"5\"(_sum3)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x4_t _h_cont = vcvt_f16_f32(vld1q_f32(hidden_ptr));\n                float16x4_t _w0 = vld1_f16(weight_hc_IFOG);\n                float16x4_t _w1 = vld1_f16(weight_hc_IFOG + 4);\n                float16x4_t _w2 = vld1_f16(weight_hc_IFOG + 8);\n                float16x4_t _w3 = vld1_f16(weight_hc_IFOG + 12);\n                _IFOG = vfma_lane_f16(_IFOG, _w0, _h_cont, 0);\n                _sum1 = vfma_lane_f16(_sum1, _w1, _h_cont, 1);\n                _sum2 = vfma_lane_f16(_sum2, _w2, _h_cont, 2);\n                _sum3 = vfma_lane_f16(_sum3, _w3, _h_cont, 3);\n\n                hidden_ptr += 4;\n                weight_hc_IFOG += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = *hidden_ptr++;\n\n                float16x4_t _h_cont = vdup_n_f16((__fp16)h_cont);\n                float16x4_t _weight_hc_IFOG = vld1_f16(weight_hc_IFOG);\n                _IFOG = vfma_f16(_IFOG, _weight_hc_IFOG, _h_cont);\n\n                weight_hc_IFOG += 4;\n            }\n\n            __fp16* gates_data = gates.row<__fp16>(q);\n\n            _IFOG = vadd_f16(_IFOG, _sum1);\n            _sum2 = vadd_f16(_sum2, _sum3);\n            _IFOG = vadd_f16(_IFOG, _sum2);\n\n            vst1_f16(gates_data, _IFOG);\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* cell_ptr = cell_state;\n        float* hidden_ptr = hidden_state;\n        float* tmp_hidden_ptr = tmp_hidden_state;\n\n        nn_hidden_size = hidden_size >> 2;\n        remain_hidden_size_start = nn_hidden_size << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_hidden_size; qq++)\n        {\n            int q = qq * 4;\n\n            const __fp16* gates_data = gates.row<const __fp16>(q);\n\n            float16x4x4_t _IFOG_4x4 = vld4_f16(gates_data);\n\n            float32x4_t _lstm_I = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[0]));\n            float32x4_t _lstm_F = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[1]));\n            float32x4_t _lstm_O = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[2]));\n            float32x4_t _lstm_G = tanh_ps(vcvt_f32_f16(_IFOG_4x4.val[3]));\n\n            float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));\n            float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));\n\n            vst1q_f32(cell_ptr + q, _cell2);\n\n            if (num_output == hidden_size)\n            {\n                vst1q_f32(hidden_ptr + q, _lstm_H);\n                vst1_f16(output_data + q, vcvt_f16_f32(_lstm_H));\n            }\n            else\n            {\n                vst1q_f32(tmp_hidden_ptr + q, _lstm_H);\n            }\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_hidden_size_start; q < hidden_size; q++)\n        {\n            const __fp16* gates_data = gates.row<const __fp16>(q);\n\n            float I = (float)gates_data[0];\n            float F = (float)gates_data[1];\n            float O = (float)gates_data[2];\n            float G = (float)gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_ptr[q] + I * G;\n            float H = O * tanhf(cell2);\n\n            cell_ptr[q] = cell2;\n            if (num_output == hidden_size)\n            {\n                hidden_ptr[q] = H;\n                output_data[q] = (__fp16)H;\n            }\n            else\n            {\n                tmp_hidden_ptr[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            // int nn_num_output = num_output >> 2;\n            // int remain_num_output_start = nn_num_output << 2;\n            // #pragma omp parallel for num_threads(opt.num_threads)\n            // for (int qq = 0; qq < nn_num_output; qq++)\n            // {\n            //     int q = qq * 4;\n            //\n            // }\n            int remain_num_output_start = 0;\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = remain_num_output_start; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n                const float* tmp_hidden_ptr = tmp_hidden_state;\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_ptr[i] * hr[i];\n                }\n\n                hidden_ptr[q] = H;\n                output_data[q] = (__fp16)H;\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic int lstm_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    if (opt.use_fp16_arithmetic)\n        return lstm_fp16sa(bottom_blob, top_blob, reverse, weight_xc, bias_c, weight_hc, weight_hr, hidden_state, cell_state, opt);\n\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        const __fp16* x = bottom_blob.row<const __fp16>(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const __fp16* bias_c_IFOG = (const __fp16*)bias_c + q * 4;\n\n            // gate I F O G\n            const __fp16* weight_xc_IFOG = weight_xc.row<const __fp16>(q);\n\n            const __fp16* weight_hc_IFOG = weight_hc.row<const __fp16>(q);\n\n            float32x4_t _IFOG = vcvt_f32_f16(vld1_f16(bias_c_IFOG));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _xi = vcvt_f32_f16(vld1_f16(x + i));\n\n                float32x4_t _weight_xc_IFOG_0 = vcvt_f32_f16(vld1_f16(weight_xc_IFOG));\n                float32x4_t _weight_xc_IFOG_1 = vcvt_f32_f16(vld1_f16(weight_xc_IFOG + 4));\n                float32x4_t _weight_xc_IFOG_2 = vcvt_f32_f16(vld1_f16(weight_xc_IFOG + 8));\n                float32x4_t _weight_xc_IFOG_3 = vcvt_f32_f16(vld1_f16(weight_xc_IFOG + 12));\n\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_xc_IFOG_0, _xi, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_IFOG_1, _xi, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_IFOG_2, _xi, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_IFOG_3, _xi, 3);\n\n                weight_xc_IFOG += 16;\n            }\n            for (; i < size; i++)\n            {\n                __fp16 xi = x[i];\n\n                float32x4_t _xi = vcvt_f32_f16(vdup_n_f16(xi));\n                float32x4_t _weight_xc_IFOG = vcvt_f32_f16(vld1_f16(weight_xc_IFOG));\n                _IFOG = vfmaq_f32(_IFOG, _weight_xc_IFOG, _xi);\n\n                weight_xc_IFOG += 4;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _h_cont = vld1q_f32((const float*)hidden_state + i);\n\n                float32x4_t _weight_hc_IFOG_0 = vcvt_f32_f16(vld1_f16(weight_hc_IFOG));\n                float32x4_t _weight_hc_IFOG_1 = vcvt_f32_f16(vld1_f16(weight_hc_IFOG + 4));\n                float32x4_t _weight_hc_IFOG_2 = vcvt_f32_f16(vld1_f16(weight_hc_IFOG + 8));\n                float32x4_t _weight_hc_IFOG_3 = vcvt_f32_f16(vld1_f16(weight_hc_IFOG + 12));\n\n                _IFOG = vfmaq_laneq_f32(_IFOG, _weight_hc_IFOG_0, _h_cont, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_IFOG_1, _h_cont, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_IFOG_2, _h_cont, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_IFOG_3, _h_cont, 3);\n\n                weight_hc_IFOG += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                float32x4_t _h_cont = vdupq_n_f32(h_cont);\n                float32x4_t _weight_hc_IFOG = vcvt_f32_f16(vld1_f16(weight_hc_IFOG));\n                _IFOG = vfmaq_f32(_IFOG, _weight_hc_IFOG, _h_cont);\n\n                weight_hc_IFOG += 4;\n            }\n\n            float* gates_data = gates.row(q);\n\n            _IFOG = vaddq_f32(_IFOG, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _IFOG = vaddq_f32(_IFOG, _sum2);\n\n            vst1q_f32(gates_data, _IFOG);\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* cell_ptr = cell_state;\n        float* hidden_ptr = hidden_state;\n        float* tmp_hidden_ptr = tmp_hidden_state;\n\n        int nn_hidden_size = hidden_size >> 2;\n        int remain_hidden_size_start = nn_hidden_size << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_hidden_size; qq++)\n        {\n            int q = qq * 4;\n\n            const float* gates_data = gates.row(q);\n\n            float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);\n\n            float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);\n            float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);\n            float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);\n            float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);\n\n            float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));\n            float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));\n\n            vst1q_f32(cell_ptr + q, _cell2);\n\n            if (num_output == hidden_size)\n            {\n                vst1q_f32(hidden_ptr + q, _lstm_H);\n                vst1_f16(output_data + q, vcvt_f16_f32(_lstm_H));\n            }\n            else\n            {\n                vst1q_f32(tmp_hidden_ptr + q, _lstm_H);\n            }\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_hidden_size_start; q < hidden_size; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float I = gates_data[0];\n            float F = gates_data[1];\n            float O = gates_data[2];\n            float G = gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_ptr[q] + I * G;\n            float H = O * tanhf(cell2);\n\n            cell_ptr[q] = cell2;\n            if (num_output == hidden_size)\n            {\n                hidden_ptr[q] = H;\n                output_data[q] = (__fp16)H;\n            }\n            else\n            {\n                tmp_hidden_ptr[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            // int nn_num_output = num_output >> 2;\n            // int remain_num_output_start = nn_num_output << 2;\n            // #pragma omp parallel for num_threads(opt.num_threads)\n            // for (int qq = 0; qq < nn_num_output; qq++)\n            // {\n            //     int q = qq * 4;\n            //\n            // }\n            int remain_num_output_start = 0;\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = remain_num_output_start; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n                const float* tmp_hidden_ptr = tmp_hidden_state;\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_ptr[i] * hr[i];\n                }\n\n                hidden_ptr[q] = H;\n                output_data[q] = (__fp16)H;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::create_pipeline_fp16s(const Option& opt)\n{\n    // pack IFOG\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / hidden_size / 4;\n\n    if (opt.use_fp16_arithmetic)\n    {\n        weight_xc_data_packed.create(size, hidden_size / 2 + hidden_size % 2, num_directions, 16u, 8);\n        bias_c_data_packed.create(hidden_size, 1, num_directions, 8u, 4);\n        weight_hc_data_packed.create(num_output, hidden_size / 2 + hidden_size % 2, num_directions, 16u, 8);\n    }\n    else\n    {\n        weight_xc_data_packed.create(size, hidden_size, num_directions, 8u, 4);\n        bias_c_data_packed.create(hidden_size, 1, num_directions, 8u, 4);\n        weight_hc_data_packed.create(num_output, hidden_size, num_directions, 8u, 4);\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat bias_c = bias_c_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat bias_c_data_packed_dr = bias_c_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        const float* bias_c_I = bias_c.row(0);\n        const float* bias_c_F = bias_c.row(1);\n        const float* bias_c_O = bias_c.row(2);\n        const float* bias_c_G = bias_c.row(3);\n\n        __fp16* bias_c_IFOG = bias_c_data_packed_dr.row<__fp16>(0);\n\n        int q = 0;\n        if (opt.use_fp16_arithmetic)\n        {\n            for (; q + 1 < hidden_size; q += 2)\n            {\n                bias_c_IFOG[0] = (__fp16)bias_c_I[q];\n                bias_c_IFOG[1] = (__fp16)bias_c_F[q];\n                bias_c_IFOG[2] = (__fp16)bias_c_O[q];\n                bias_c_IFOG[3] = (__fp16)bias_c_G[q];\n                bias_c_IFOG[4] = (__fp16)bias_c_I[q + 1];\n                bias_c_IFOG[5] = (__fp16)bias_c_F[q + 1];\n                bias_c_IFOG[6] = (__fp16)bias_c_O[q + 1];\n                bias_c_IFOG[7] = (__fp16)bias_c_G[q + 1];\n\n                bias_c_IFOG += 8;\n\n                const float* weight_xc_I = weight_xc.row(hidden_size * 0 + q);\n                const float* weight_xc_F = weight_xc.row(hidden_size * 1 + q);\n                const float* weight_xc_O = weight_xc.row(hidden_size * 2 + q);\n                const float* weight_xc_G = weight_xc.row(hidden_size * 3 + q);\n                const float* weight_xc_I_1 = weight_xc.row(hidden_size * 0 + q + 1);\n                const float* weight_xc_F_1 = weight_xc.row(hidden_size * 1 + q + 1);\n                const float* weight_xc_O_1 = weight_xc.row(hidden_size * 2 + q + 1);\n                const float* weight_xc_G_1 = weight_xc.row(hidden_size * 3 + q + 1);\n\n                const float* weight_hc_I = weight_hc.row(hidden_size * 0 + q);\n                const float* weight_hc_F = weight_hc.row(hidden_size * 1 + q);\n                const float* weight_hc_O = weight_hc.row(hidden_size * 2 + q);\n                const float* weight_hc_G = weight_hc.row(hidden_size * 3 + q);\n                const float* weight_hc_I_1 = weight_hc.row(hidden_size * 0 + q + 1);\n                const float* weight_hc_F_1 = weight_hc.row(hidden_size * 1 + q + 1);\n                const float* weight_hc_O_1 = weight_hc.row(hidden_size * 2 + q + 1);\n                const float* weight_hc_G_1 = weight_hc.row(hidden_size * 3 + q + 1);\n\n                __fp16* weight_xc_IFOG = weight_xc_data_packed_dr.row<__fp16>(q / 2);\n                __fp16* weight_hc_IFOG = weight_hc_data_packed_dr.row<__fp16>(q / 2);\n\n                for (int i = 0; i < size; i++)\n                {\n                    weight_xc_IFOG[0] = (__fp16)weight_xc_I[i];\n                    weight_xc_IFOG[1] = (__fp16)weight_xc_F[i];\n                    weight_xc_IFOG[2] = (__fp16)weight_xc_O[i];\n                    weight_xc_IFOG[3] = (__fp16)weight_xc_G[i];\n                    weight_xc_IFOG[4] = (__fp16)weight_xc_I_1[i];\n                    weight_xc_IFOG[5] = (__fp16)weight_xc_F_1[i];\n                    weight_xc_IFOG[6] = (__fp16)weight_xc_O_1[i];\n                    weight_xc_IFOG[7] = (__fp16)weight_xc_G_1[i];\n\n                    weight_xc_IFOG += 8;\n                }\n\n                for (int i = 0; i < num_output; i++)\n                {\n                    weight_hc_IFOG[0] = (__fp16)weight_hc_I[i];\n                    weight_hc_IFOG[1] = (__fp16)weight_hc_F[i];\n                    weight_hc_IFOG[2] = (__fp16)weight_hc_O[i];\n                    weight_hc_IFOG[3] = (__fp16)weight_hc_G[i];\n                    weight_hc_IFOG[4] = (__fp16)weight_hc_I_1[i];\n                    weight_hc_IFOG[5] = (__fp16)weight_hc_F_1[i];\n                    weight_hc_IFOG[6] = (__fp16)weight_hc_O_1[i];\n                    weight_hc_IFOG[7] = (__fp16)weight_hc_G_1[i];\n\n                    weight_hc_IFOG += 8;\n                }\n            }\n        }\n        for (; q < hidden_size; q++)\n        {\n            bias_c_IFOG[0] = (__fp16)bias_c_I[q];\n            bias_c_IFOG[1] = (__fp16)bias_c_F[q];\n            bias_c_IFOG[2] = (__fp16)bias_c_O[q];\n            bias_c_IFOG[3] = (__fp16)bias_c_G[q];\n\n            bias_c_IFOG += 4;\n\n            const float* weight_xc_I = weight_xc.row(hidden_size * 0 + q);\n            const float* weight_xc_F = weight_xc.row(hidden_size * 1 + q);\n            const float* weight_xc_O = weight_xc.row(hidden_size * 2 + q);\n            const float* weight_xc_G = weight_xc.row(hidden_size * 3 + q);\n\n            const float* weight_hc_I = weight_hc.row(hidden_size * 0 + q);\n            const float* weight_hc_F = weight_hc.row(hidden_size * 1 + q);\n            const float* weight_hc_O = weight_hc.row(hidden_size * 2 + q);\n            const float* weight_hc_G = weight_hc.row(hidden_size * 3 + q);\n\n            const int qq = opt.use_fp16_arithmetic ? q / 2 + q % 2 : q;\n            __fp16* weight_xc_IFOG = weight_xc_data_packed_dr.row<__fp16>(qq);\n            __fp16* weight_hc_IFOG = weight_hc_data_packed_dr.row<__fp16>(qq);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc_IFOG[0] = (__fp16)weight_xc_I[i];\n                weight_xc_IFOG[1] = (__fp16)weight_xc_F[i];\n                weight_xc_IFOG[2] = (__fp16)weight_xc_O[i];\n                weight_xc_IFOG[3] = (__fp16)weight_xc_G[i];\n\n                weight_xc_IFOG += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc_IFOG[0] = (__fp16)weight_hc_I[i];\n                weight_hc_IFOG[1] = (__fp16)weight_hc_F[i];\n                weight_hc_IFOG[2] = (__fp16)weight_hc_O[i];\n                weight_hc_IFOG[3] = (__fp16)weight_hc_G[i];\n\n                weight_hc_IFOG += 4;\n            }\n        }\n    }\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    Mat cell(hidden_size, 4u, opt.workspace_allocator);\n    if (cell.empty())\n        return -100;\n    cell.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = lstm_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n        cell.fill(0.f);\n\n        {\n            int ret = lstm_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    return 0;\n}\n\nint LSTM_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Mat cell;\n    Allocator* hidden_cell_allocator = top_blobs.size() == 3 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 3)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_cell_allocator;\n        cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        cast_float16_to_float32(bottom_blobs[2], cell, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_cell_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n\n        cell.create(hidden_size, num_directions, 4u, hidden_cell_allocator);\n        if (cell.empty())\n            return -100;\n        cell.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = lstm_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        Mat cell0 = cell.row_range(0, 1);\n        {\n            int ret = lstm_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        Mat cell1 = cell.row_range(1, 1);\n        {\n            int ret = lstm_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    if (top_blobs.size() == 3)\n    {\n        cast_float32_to_float16(hidden, top_blobs[1], opt);\n        cast_float32_to_float16(cell, top_blobs[2], opt);\n    }\n\n    return 0;\n}\n#endif\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lstm_arm_vfpv4.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"lstm_int8.h\"\n\nvoid lstm_int8_gate_output_vfpv4(const Mat& gates, const Mat& weight_hr, Mat& hidden_state, Mat& tmp_hidden_state, Mat& cell_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n    lstm_int8_gate_output(gates, weight_hr, hidden_state, tmp_hidden_state, cell_state, top_blob, ti, elemtype, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/lstm_int8.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\nvoid lstm_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, int hidden_size, const Option& opt);\nvoid lstm_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\nvoid lstm_int8_gate_output_vfpv4(const Mat& gates, const Mat& weight_hr, Mat& hidden_state, Mat& tmp_hidden_state, Mat& cell_state, Mat& top_blob, int ti, int elemtype, const Option& opt);\n#endif\n\nstatic void lstm_transform_weight_int8(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, int hidden_size, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        lstm_transform_weight_int8_asimddp(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, hidden_size, opt);\n        return;\n    }\n#endif\n\n    weight_data_tm.create(size + num_output, hidden_size, num_directions, 4u, 4);\n    weight_data_tm_int8_descales.create(4 + 4, hidden_size, num_directions);\n    bias_c_tm.create(hidden_size, 1, num_directions, 16u, 4);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc_dr = weight_xc.channel(dr);\n        const Mat weight_hc_dr = weight_hc.channel(dr);\n        const Mat bias_c_dr = bias_c.channel(dr);\n        const float* weight_xc_int8_scales_ptr = weight_xc_int8_scales.row(dr);\n        const float* weight_hc_int8_scales_ptr = weight_hc_int8_scales.row(dr);\n\n        Mat weight_data_tm_dr = weight_data_tm.channel(dr);\n        Mat bias_c_tm_dr = bias_c_tm.channel(dr);\n        Mat weight_data_tm_int8_descales_dr = weight_data_tm_int8_descales.channel(dr);\n\n        const float* bias_c_I = bias_c_dr.row(0);\n        const float* bias_c_F = bias_c_dr.row(1);\n        const float* bias_c_O = bias_c_dr.row(2);\n        const float* bias_c_G = bias_c_dr.row(3);\n\n        float* bias_c_IFOG = bias_c_tm_dr.row(0);\n\n        int q = 0;\n        for (; q < hidden_size; q++)\n        {\n            bias_c_IFOG[0] = bias_c_I[q];\n            bias_c_IFOG[1] = bias_c_F[q];\n            bias_c_IFOG[2] = bias_c_O[q];\n            bias_c_IFOG[3] = bias_c_G[q];\n\n            bias_c_IFOG += 4;\n\n            const signed char* weight_xc_I = weight_xc_dr.row<const signed char>(hidden_size * 0 + q);\n            const signed char* weight_xc_F = weight_xc_dr.row<const signed char>(hidden_size * 1 + q);\n            const signed char* weight_xc_O = weight_xc_dr.row<const signed char>(hidden_size * 2 + q);\n            const signed char* weight_xc_G = weight_xc_dr.row<const signed char>(hidden_size * 3 + q);\n\n            const signed char* weight_hc_I = weight_hc_dr.row<const signed char>(hidden_size * 0 + q);\n            const signed char* weight_hc_F = weight_hc_dr.row<const signed char>(hidden_size * 1 + q);\n            const signed char* weight_hc_O = weight_hc_dr.row<const signed char>(hidden_size * 2 + q);\n            const signed char* weight_hc_G = weight_hc_dr.row<const signed char>(hidden_size * 3 + q);\n\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q);\n\n            int i = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n                kptr[0] = weight_xc_I[i];\n                kptr[1] = weight_xc_I[i + 1];\n                kptr[2] = weight_xc_I[i + 2];\n                kptr[3] = weight_xc_I[i + 3];\n                kptr[4] = weight_xc_F[i];\n                kptr[5] = weight_xc_F[i + 1];\n                kptr[6] = weight_xc_F[i + 2];\n                kptr[7] = weight_xc_F[i + 3];\n                kptr[8 + 0] = weight_xc_O[i];\n                kptr[8 + 1] = weight_xc_O[i + 1];\n                kptr[8 + 2] = weight_xc_O[i + 2];\n                kptr[8 + 3] = weight_xc_O[i + 3];\n                kptr[8 + 4] = weight_xc_G[i];\n                kptr[8 + 5] = weight_xc_G[i + 1];\n                kptr[8 + 6] = weight_xc_G[i + 2];\n                kptr[8 + 7] = weight_xc_G[i + 3];\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < size; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_xc_I + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_xc_F + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_xc_O + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_xc_G + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < size; i += 2)\n            {\n                kptr[0] = weight_xc_I[i];\n                kptr[1] = weight_xc_I[i + 1];\n                kptr[2] = weight_xc_F[i];\n                kptr[3] = weight_xc_F[i + 1];\n                kptr[4] = weight_xc_O[i];\n                kptr[5] = weight_xc_O[i + 1];\n                kptr[6] = weight_xc_G[i];\n                kptr[7] = weight_xc_G[i + 1];\n                kptr += 8;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                kptr[0] = weight_xc_I[i];\n                kptr[1] = weight_xc_F[i];\n                kptr[2] = weight_xc_O[i];\n                kptr[3] = weight_xc_G[i];\n                kptr += 4;\n            }\n\n            i = 0;\n#if __ARM_NEON\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n                kptr[0] = weight_hc_I[i];\n                kptr[1] = weight_hc_I[i + 1];\n                kptr[2] = weight_hc_I[i + 2];\n                kptr[3] = weight_hc_I[i + 3];\n                kptr[4] = weight_hc_F[i];\n                kptr[5] = weight_hc_F[i + 1];\n                kptr[6] = weight_hc_F[i + 2];\n                kptr[7] = weight_hc_F[i + 3];\n                kptr[8 + 0] = weight_hc_O[i];\n                kptr[8 + 1] = weight_hc_O[i + 1];\n                kptr[8 + 2] = weight_hc_O[i + 2];\n                kptr[8 + 3] = weight_hc_O[i + 3];\n                kptr[8 + 4] = weight_hc_G[i];\n                kptr[8 + 5] = weight_hc_G[i + 1];\n                kptr[8 + 6] = weight_hc_G[i + 2];\n                kptr[8 + 7] = weight_hc_G[i + 3];\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < num_output; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_hc_I + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_hc_F + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_hc_O + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_hc_G + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < num_output; i += 2)\n            {\n                kptr[0] = weight_hc_I[i];\n                kptr[1] = weight_hc_I[i + 1];\n                kptr[2] = weight_hc_F[i];\n                kptr[3] = weight_hc_F[i + 1];\n                kptr[4] = weight_hc_O[i];\n                kptr[5] = weight_hc_O[i + 1];\n                kptr[6] = weight_hc_G[i];\n                kptr[7] = weight_hc_G[i + 1];\n                kptr += 8;\n            }\n#endif // __ARM_NEON\n            for (; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_I[i];\n                kptr[1] = weight_hc_F[i];\n                kptr[2] = weight_hc_O[i];\n                kptr[3] = weight_hc_G[i];\n                kptr += 4;\n            }\n\n            descales_ptr[0] = 1.f / weight_xc_int8_scales_ptr[hidden_size * 0 + q];\n            descales_ptr[1] = 1.f / weight_xc_int8_scales_ptr[hidden_size * 1 + q];\n            descales_ptr[2] = 1.f / weight_xc_int8_scales_ptr[hidden_size * 2 + q];\n            descales_ptr[3] = 1.f / weight_xc_int8_scales_ptr[hidden_size * 3 + q];\n            descales_ptr[4] = 1.f / weight_hc_int8_scales_ptr[hidden_size * 0 + q];\n            descales_ptr[5] = 1.f / weight_hc_int8_scales_ptr[hidden_size * 1 + q];\n            descales_ptr[6] = 1.f / weight_hc_int8_scales_ptr[hidden_size * 2 + q];\n            descales_ptr[7] = 1.f / weight_hc_int8_scales_ptr[hidden_size * 3 + q];\n        }\n    }\n}\n\nstatic void lstm_int8_gate_output(const Mat& gates, const Mat& weight_hr, Mat& hidden_state, Mat& tmp_hidden_state, Mat& cell_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\n    if (ncnn::cpu_support_arm_vfpv4())\n    {\n        lstm_int8_gate_output_vfpv4(gates, weight_hr, hidden_state, tmp_hidden_state, cell_state, top_blob, ti, elemtype, opt);\n        return;\n    }\n#endif\n\n    const int num_output = top_blob.w;\n    const int hidden_size = cell_state.w;\n\n    // lstm unit\n    // sigmoid(I)\n    // sigmoid(F)\n    // sigmoid(O)\n    // tanh(G)\n    // c_t := f_t .* c_{t-1} + i_t .* g_t\n    // h_t := o_t .* tanh[c_t]\n    float* output_data = top_blob.row(ti);\n\n    float* cell_ptr = cell_state;\n    float* hidden_ptr = hidden_state;\n    float* tmp_hidden_ptr = tmp_hidden_state;\n\n    int remain_hidden_size_start = 0;\n#if __ARM_NEON\n    int nn_hidden_size = hidden_size >> 2;\n    remain_hidden_size_start = nn_hidden_size << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int qq = 0; qq < nn_hidden_size; qq++)\n    {\n        int q = qq * 4;\n\n        const float* gates_data = gates.row(q);\n\n        float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);\n\n        float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);\n        float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);\n        float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);\n        float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);\n\n        float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));\n        float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));\n\n        vst1q_f32(cell_ptr + q, _cell2);\n\n        if (num_output == hidden_size)\n        {\n            vst1q_f32(hidden_ptr + q, _lstm_H);\n\n            if (elemtype == 1)\n            {\n                // fp32\n                vst1q_f32(output_data + q, _lstm_H);\n            }\n            if (elemtype == 2)\n            {\n                // fp16\n                unsigned short* outptr = (unsigned short*)output_data + q;\n#if (__ARM_FP & 2)\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"fcvtn  v0.4h, %2.4s        \\n\"\n                    \"st1    {v0.4h}, [%0]       \\n\"\n                    : \"=r\"(outptr) // %0\n                    : \"0\"(outptr),\n                    \"w\"(_lstm_H)\n                    : \"memory\", \"v0\");\n#else  // __aarch64__\n                asm volatile(\n                    \"vcvt.f16.f32 d0, %q2       \\n\"\n                    \"vst1.u16   {d0}, [%0]      \\n\"\n                    : \"=r\"(outptr) // %0\n                    : \"0\"(outptr),\n                    \"w\"(_lstm_H)\n                    : \"memory\", \"q0\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                vst1_u16(outptr, (uint16x4_t)vcvt_f16_f32(_lstm_H));\n#endif // NCNN_GNU_INLINE_ASM\n#else\n                outptr[q] = float32_to_float16(hidden_ptr[q]);\n                outptr[q + 1] = float32_to_float16(hidden_ptr[q + 1]);\n                outptr[q + 2] = float32_to_float16(hidden_ptr[q + 2]);\n                outptr[q + 3] = float32_to_float16(hidden_ptr[q + 3]);\n#endif // (__ARM_FP & 2)\n            }\n            if (elemtype == 4)\n            {\n                // bf16\n                vst1_u16((unsigned short*)output_data + q, float2bfloat(_lstm_H));\n            }\n        }\n        else\n        {\n            vst1q_f32(tmp_hidden_ptr + q, _lstm_H);\n        }\n    }\n#endif // __ARM_NEON\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = remain_hidden_size_start; q < hidden_size; q++)\n    {\n        const float* gates_data = gates.row(q);\n\n        float I = gates_data[0];\n        float F = gates_data[1];\n        float O = gates_data[2];\n        float G = gates_data[3];\n\n        I = 1.f / (1.f + expf(-I));\n        F = 1.f / (1.f + expf(-F));\n        O = 1.f / (1.f + expf(-O));\n        G = tanhf(G);\n\n        float cell2 = F * cell_ptr[q] + I * G;\n        float H = O * tanhf(cell2);\n\n        cell_ptr[q] = cell2;\n        if (num_output == hidden_size)\n        {\n            hidden_ptr[q] = H;\n\n            if (elemtype == 1)\n            {\n                output_data[q] = H;\n            }\n            if (elemtype == 2)\n            {\n                ((unsigned short*)output_data)[q] = float32_to_float16(H);\n            }\n            if (elemtype == 4)\n            {\n                ((unsigned short*)output_data)[q] = float32_to_bfloat16(H);\n            }\n        }\n        else\n        {\n            tmp_hidden_ptr[q] = H;\n        }\n    }\n\n    if (num_output != hidden_size)\n    {\n        // int nn_num_output = num_output >> 2;\n        // int remain_num_output_start = nn_num_output << 2;\n        // #pragma omp parallel for num_threads(opt.num_threads)\n        // for (int qq = 0; qq < nn_num_output; qq++)\n        // {\n        //     int q = qq * 4;\n        //\n        // }\n        int remain_num_output_start = 0;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const float* hr = weight_hr.row(q);\n            const float* tmp_hidden_ptr = tmp_hidden_state;\n\n            float H = 0;\n            for (int i = 0; i < hidden_size; i++)\n            {\n                H += tmp_hidden_ptr[i] * hr[i];\n            }\n\n            hidden_ptr[q] = H;\n\n            if (elemtype == 1)\n            {\n                output_data[q] = H;\n            }\n            if (elemtype == 2)\n            {\n                ((unsigned short*)output_data)[q] = float32_to_float16(H);\n            }\n            if (elemtype == 4)\n            {\n                ((unsigned short*)output_data)[q] = float32_to_bfloat16(H);\n            }\n        }\n    }\n}\n\nstatic void lstm_int8(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        lstm_int8_asimddp(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, weight_hr, hidden_state, cell_state, opt);\n        return;\n    }\n#endif\n\n    int size = bottom_blob_int8.w;\n    int T = bottom_blob_int8.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n    }\n\n    Mat hidden_state_int8(num_output, (size_t)1u, 1, opt.workspace_allocator);\n    float hidden_state_int8_scale = 1.f;\n    float hidden_state_int8_descale = 1.f;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        // dynamic quantize hidden_state\n        {\n            float absmax = 0.f;\n            for (int i = 0; i < num_output; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(hidden_state[i]));\n            }\n\n            if (absmax == 0.f)\n            {\n                hidden_state_int8.fill<signed char>(0);\n            }\n            else\n            {\n                hidden_state_int8_scale = 127.f / absmax;\n                hidden_state_int8_descale = absmax / 127.f;\n\n                signed char* hs = hidden_state_int8;\n                for (int i = 0; i < num_output; i++)\n                {\n                    hs[i] = float2int8(hidden_state[i] * hidden_state_int8_scale);\n                }\n            }\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n            const signed char* hs = hidden_state_int8;\n            const float descale_x = bottom_blob_int8_descales[ti];\n            const float descale_h = hidden_state_int8_descale;\n\n            // gate reset update\n            const float* bias_c_IFOG = (const float*)bias_c + q * 4;\n\n            const signed char* kptr = weight_data_tm.row<const signed char>(q);\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q);\n\n            float* gates_data = gates.row(q);\n\n#if __ARM_NEON\n            int32x4_t _lstm_IFOGx0 = vdupq_n_s32(0);\n            int i = 0;\n#if __ARM_FEATURE_DOTPROD\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < size; i += 16)\n            {\n                int8x16_t _xi = vld1q_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _lstm_IFOGx0 = vdotq_laneq_s32(_lstm_IFOGx0, _w0, _xi, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _w1, _xi, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _w2, _xi, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _w3, _xi, 3);\n\n                kptr += 64;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _lstm_IFOGx0 = vdotq_lane_s32(_lstm_IFOGx0, _w0, _xi, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _xi, 1);\n\n                kptr += 32;\n            }\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum1);\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum2);\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum3);\n#else\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* xptr = x + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(xptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(xptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _xi = vld1q_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_xi));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_xi));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_xi));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_xi));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_xi));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_xi));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _xi);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _xi);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _xi);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _xi);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum0);\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum1);\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum2);\n            _lstm_IFOGx0 = vaddq_s32(_lstm_IFOGx0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _lstm_IFOGx0 = vdotq_lane_s32(_lstm_IFOGx0, _w, _xi, 0);\n#else\n                int16x4_t _xi01 = vreinterpret_s16_s8(vld1_s8(x + i));\n                int8x8_t _xi0 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 0));\n                int8x8_t _xi1 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _lstm_IFOGx = vmull_s8(vget_low_s8(_w01), _xi0);\n                _lstm_IFOGx = vmlal_s8(_lstm_IFOGx, vget_high_s8(_w01), _xi1);\n                _lstm_IFOGx0 = vpadalq_s16(_lstm_IFOGx0, _lstm_IFOGx);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < size; i += 2)\n            {\n                int8x8_t _xi = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(x + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _lstm_IFOGx = vmull_s8(_w, _xi);\n                _lstm_IFOGx0 = vpadalq_s16(_lstm_IFOGx0, _lstm_IFOGx);\n\n                kptr += 8;\n            }\n            for (; i < size; i++)\n            {\n                int8x8_t _xi = vdup_n_s8(x[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _lstm_IFOGx = vmull_s8(_w, _xi);\n                _lstm_IFOGx0 = vaddw_s16(_lstm_IFOGx0, vget_low_s16(_lstm_IFOGx));\n\n                kptr += 4;\n            }\n\n            int32x4_t _lstm_IFOGh0 = vdupq_n_s32(0);\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < num_output; i += 16)\n            {\n                int8x16_t _h_cont = vld1q_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _lstm_IFOGh0 = vdotq_laneq_s32(_lstm_IFOGh0, _w0, _h_cont, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _w1, _h_cont, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _w2, _h_cont, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _w3, _h_cont, 3);\n\n                kptr += 64;\n            }\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _lstm_IFOGh0 = vdotq_lane_s32(_lstm_IFOGh0, _w0, _h_cont, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _h_cont, 1);\n\n                kptr += 32;\n            }\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum1);\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum2);\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum3);\n#else\n            _sum0 = vdupq_n_s32(0);\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < num_output; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* hsptr = hs + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(hsptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(hsptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _h_cont = vld1q_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_h_cont));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_h_cont));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_h_cont));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_h_cont));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_h_cont));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_h_cont));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _h_cont);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _h_cont);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _h_cont);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _h_cont);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum0);\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum1);\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum2);\n            _lstm_IFOGh0 = vaddq_s32(_lstm_IFOGh0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _lstm_IFOGh0 = vdotq_lane_s32(_lstm_IFOGh0, _w, _h_cont, 0);\n#else\n                int16x4_t _h_cont01 = vreinterpret_s16_s8(vld1_s8(hs + i));\n                int8x8_t _h_cont0 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 0));\n                int8x8_t _h_cont1 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _lstm_IFOGh = vmull_s8(vget_low_s8(_w01), _h_cont0);\n                _lstm_IFOGh = vmlal_s8(_lstm_IFOGh, vget_high_s8(_w01), _h_cont1);\n                _lstm_IFOGh0 = vpadalq_s16(_lstm_IFOGh0, _lstm_IFOGh);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < num_output; i += 2)\n            {\n                int8x8_t _h_cont = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(hs + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _lstm_IFOGh = vmull_s8(_w, _h_cont);\n                _lstm_IFOGh0 = vpadalq_s16(_lstm_IFOGh0, _lstm_IFOGh);\n\n                kptr += 8;\n            }\n            for (; i < num_output; i++)\n            {\n                int8x8_t _h_cont = vdup_n_s8(hs[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _lstm_IFOGh = vmull_s8(_w, _h_cont);\n                _lstm_IFOGh0 = vaddw_s16(_lstm_IFOGh0, vget_low_s16(_lstm_IFOGh));\n\n                kptr += 4;\n            }\n\n            float32x4_t _descale_x = vdupq_n_f32(descale_x);\n            float32x4_t _descale_h = vdupq_n_f32(descale_h);\n\n            float32x4_t _lstm_IFOG0 = vld1q_f32(bias_c_IFOG);\n\n            float32x4_t _descale_xc_IFOG = vld1q_f32(descales_ptr);\n\n            _lstm_IFOG0 = vmlaq_f32(_lstm_IFOG0, vcvtq_f32_s32(_lstm_IFOGx0), vmulq_f32(_descale_x, _descale_xc_IFOG));\n\n            float32x4_t _descale_hc_IFOG = vld1q_f32(descales_ptr + 4);\n\n            _lstm_IFOG0 = vmlaq_f32(_lstm_IFOG0, vcvtq_f32_s32(_lstm_IFOGh0), vmulq_f32(_descale_h, _descale_hc_IFOG));\n\n            vst1q_f32(gates_data, _lstm_IFOG0);\n#else\n            int Ix = 0;\n            int Fx = 0;\n            int Ox = 0;\n            int Gx = 0;\n            for (int i = 0; i < size; i++)\n            {\n                signed char xi = x[i];\n\n                Ix += kptr[0] * xi;\n                Fx += kptr[1] * xi;\n                Ox += kptr[2] * xi;\n                Gx += kptr[3] * xi;\n\n                kptr += 4;\n            }\n\n            int Ih = 0;\n            int Fh = 0;\n            int Oh = 0;\n            int Gh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                signed char h_cont = hs[i];\n\n                Ih += kptr[0] * h_cont;\n                Fh += kptr[1] * h_cont;\n                Oh += kptr[2] * h_cont;\n                Gh += kptr[3] * h_cont;\n\n                kptr += 4;\n            }\n\n            const float descale_xc_I = descales_ptr[0];\n            const float descale_xc_F = descales_ptr[1];\n            const float descale_xc_O = descales_ptr[2];\n            const float descale_xc_G = descales_ptr[3];\n            const float descale_hc_I = descales_ptr[4];\n            const float descale_hc_F = descales_ptr[5];\n            const float descale_hc_O = descales_ptr[6];\n            const float descale_hc_G = descales_ptr[7];\n\n            float I = bias_c_IFOG[0] + Ix * (descale_x * descale_xc_I) + Ih * (descale_h * descale_hc_I);\n            float F = bias_c_IFOG[1] + Fx * (descale_x * descale_xc_F) + Fh * (descale_h * descale_hc_F);\n            float O = bias_c_IFOG[2] + Ox * (descale_x * descale_xc_O) + Oh * (descale_h * descale_hc_O);\n            float G = bias_c_IFOG[3] + Gx * (descale_x * descale_xc_G) + Gh * (descale_h * descale_hc_G);\n\n            gates_data[0] = I;\n            gates_data[1] = F;\n            gates_data[2] = O;\n            gates_data[3] = G;\n#endif // __ARM_NEON\n        }\n\n        lstm_int8_gate_output(gates, weight_hr, hidden_state, tmp_hidden_state, cell_state, top_blob, ti, elemtype, opt);\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/matmul_arm.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"matmul_arm.h\"\n\n#include \"layer_type.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nMatMul_arm::MatMul_arm()\n{\n#if __ARM_NEON\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n\n    gemm = 0;\n}\n\nint MatMul_arm::create_pipeline(const Option& opt)\n{\n    gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n\n    ncnn::ParamDict pd;\n    pd.set(2, 0);      // transA\n    pd.set(3, transB); // transB\n    pd.set(4, 0);      // constantA\n    pd.set(5, 0);      // constantB\n    pd.set(6, 1);      // constantC\n    pd.set(7, 0);      // M = outch\n    pd.set(8, 0);      // N = size\n    pd.set(9, 0);      // K = maxk*inch\n    pd.set(10, -1);    // constant_broadcast_type_C = null\n    pd.set(11, 0);     // output_N1M\n    pd.set(12, 1);     // output_elempack\n\n    gemm->load_param(pd);\n\n    gemm->load_model(ModelBinFromMatArray(0));\n\n    gemm->create_pipeline(opt);\n\n    return 0;\n}\n\nint MatMul_arm::destroy_pipeline(const Option& opt)\n{\n    if (gemm)\n    {\n        gemm->destroy_pipeline(opt);\n        delete gemm;\n        gemm = 0;\n    }\n\n    return 0;\n}\n\nint MatMul_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int Adims = A.dims;\n    const int Bdims = B.dims;\n    const int max_ABdims = std::max(Adims, Bdims);\n    const size_t elemsize = A.elemsize;\n\n    if (Adims == 1 && Bdims == 1)\n    {\n        // dot product\n        std::vector<Mat> _bottom_blobs(2);\n        _bottom_blobs[0] = A.reshape(A.w, 1);\n        _bottom_blobs[1] = transB ? B.reshape(B.w, 1) : B.reshape(1, B.w);\n        gemm->forward(_bottom_blobs, top_blobs, opt);\n\n        top_blob = top_blob.reshape(1, opt.blob_allocator);\n    }\n    else if (Adims == 2 && Bdims == 2)\n    {\n        // matrix multiply\n        gemm->forward(bottom_blobs, top_blobs, opt);\n    }\n    else if (Adims == 1 && Bdims == 2)\n    {\n        // matrix multiply\n        std::vector<Mat> _bottom_blobs(2);\n        _bottom_blobs[0] = A.reshape(A.w, 1);\n        _bottom_blobs[1] = B;\n        gemm->forward(_bottom_blobs, top_blobs, opt);\n\n        top_blob = top_blob.reshape(top_blob.w, opt.blob_allocator);\n    }\n    else if (Adims == 2 && Bdims == 1)\n    {\n        // matrix multiply\n        std::vector<Mat> _bottom_blobs(2);\n        _bottom_blobs[0] = A;\n        _bottom_blobs[1] = transB ? B.reshape(B.w, 1) : B.reshape(1, B.w);\n        gemm->forward(_bottom_blobs, top_blobs, opt);\n\n        top_blob = top_blob.reshape(top_blob.h, opt.blob_allocator);\n    }\n    else if (Adims == 1 && Bdims > 2)\n    {\n        // batched matrix multiply\n        const int N = transB == 0 ? B.w : B.h;\n        const int batch_size = B.d * B.c;\n\n        Mat top_blob1(N, 1, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat A1 = A.reshape(A.w, 1);\n        Mat B1 = B.reshape(B.w, B.h, batch_size);\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            std::vector<Mat> _bottom_blobs(2);\n            _bottom_blobs[0] = A1;\n            _bottom_blobs[1] = B1.channel(p);\n            std::vector<Mat> _top_blobs(1);\n            _top_blobs[0] = top_blob1.channel(p);\n            gemm->forward(_bottom_blobs, _top_blobs, opt);\n        }\n\n        if (Bdims == 3)\n            top_blob = top_blob1.reshape(N, B.d * B.c, opt.blob_allocator);\n        else\n            top_blob = top_blob1.reshape(N, B.d, B.c, opt.blob_allocator);\n    }\n    else if (Adims > 2 && Bdims == 1)\n    {\n        // batched matrix multiply\n        const int M = A.h;\n        const int batch_size = A.d * A.c;\n\n        Mat top_blob1(1, M, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat A1 = A.reshape(A.w, A.h, batch_size);\n        Mat BT = transB ? B.reshape(B.w, 1) : B.reshape(1, B.w);\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            std::vector<Mat> _bottom_blobs(2);\n            _bottom_blobs[0] = A1.channel(p);\n            _bottom_blobs[1] = BT;\n            std::vector<Mat> _top_blobs(1);\n            _top_blobs[0] = top_blob1.channel(p);\n            gemm->forward(_bottom_blobs, _top_blobs, opt);\n        }\n\n        if (Adims == 3)\n            top_blob = top_blob1.reshape(M, A.d * A.c, opt.blob_allocator);\n        else\n            top_blob = top_blob1.reshape(M, A.d, A.c, opt.blob_allocator);\n    }\n    else if (max_ABdims == 3)\n    {\n        Mat A1 = Adims == 2 ? A.reshape(A.w, A.h, 1) : A;\n        Mat B1 = Bdims == 2 ? B.reshape(B.w, B.h, 1) : B;\n\n        const int M = A1.h;\n        const int N = transB == 0 ? B1.w : B1.h;\n        const int batch_size = std::max(A1.c, B1.c);\n\n        top_blob.create(N, M, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            int Ap = A1.c == 1 ? 0 : p;\n            int Bp = B1.c == 1 ? 0 : p;\n\n            std::vector<Mat> _bottom_blobs(2);\n            _bottom_blobs[0] = A1.channel(Ap);\n            _bottom_blobs[1] = B1.channel(Bp);\n            std::vector<Mat> _top_blobs(1);\n            _top_blobs[0] = top_blob.channel(p);\n            gemm->forward(_bottom_blobs, _top_blobs, opt);\n        }\n    }\n    else if (max_ABdims == 4)\n    {\n        Mat A1 = Adims == 3 ? A.reshape(A.w, A.h, A.c, 1) : A;\n        Mat B1 = Bdims == 3 ? B.reshape(B.w, B.h, B.c, 1) : B;\n\n        const int M = A1.h;\n        const int N = transB == 0 ? B1.w : B1.h;\n        const int batch_size_d = std::max(A1.d, B1.d);\n        const int batch_size_c = std::max(A1.c, B1.c);\n\n        top_blob.create(N, M, batch_size_d, batch_size_c, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int p = 0; p < batch_size_c; p++)\n        {\n            int Ap = A1.c == 1 ? 0 : p;\n            int Bp = B1.c == 1 ? 0 : p;\n\n            for (int q = 0; q < batch_size_d; q++)\n            {\n                int Ad = A1.d == 1 ? 0 : q;\n                int Bd = B1.d == 1 ? 0 : q;\n\n                std::vector<Mat> _bottom_blobs(2);\n                _bottom_blobs[0] = A1.channel(Ap).depth(Ad);\n                _bottom_blobs[1] = B1.channel(Bp).depth(Bd);\n                std::vector<Mat> _top_blobs(1);\n                _top_blobs[0] = top_blob.channel(p).depth(q);\n                gemm->forward(_bottom_blobs, _top_blobs, opt);\n            }\n        }\n    }\n    else\n    {\n        NCNN_LOGE(\"impossible matmul %d %d\", Adims, Bdims);\n        return -1;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/matmul_arm.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MATMUL_ARM_H\n#define LAYER_MATMUL_ARM_H\n\n#include \"matmul.h\"\n\nnamespace ncnn {\n\nclass MatMul_arm : public MatMul\n{\npublic:\n    MatMul_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    Layer* gemm;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MATMUL_ARM_H\n"
  },
  {
    "path": "src/layer/arm/mish_arm.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"mish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nMish_arm::Mish_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Mish_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            *ptr = *ptr * tanhf(logf(expf(*ptr) + 1.f));\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Mish_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            v = v * tanhf(logf(expf(v) + 1.f));\n            *ptr = float32_to_bfloat16(v);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/mish_arm.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MISH_ARM_H\n#define LAYER_MISH_ARM_H\n\n#include \"mish.h\"\n\nnamespace ncnn {\n\nclass Mish_arm : public Mish\n{\npublic:\n    Mish_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MISH_ARM_H\n"
  },
  {
    "path": "src/layer/arm/mish_arm_asimdhp.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"mish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun.h\"\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Mish_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = vmulq_f32(_p, tanh_ps(log_ps(vaddq_f32(exp_ps(_p), vdupq_n_f32(1.f)))));\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            v = v * tanhf(logf(expf(v) + 1.f));\n            *ptr = (__fp16)v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint Mish_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 8)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                _p = vmulq_f16(_p, tanh_ps_f16(log_ps_f16(vaddq_f16(exp_ps_f16(_p), vdupq_n_f16(1.f)))));\n                vst1q_f16(ptr, _p);\n\n                ptr += 8;\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                _p = vmul_f16(_p, tanh_ps_f16(log_ps_f16(vadd_f16(exp_ps_f16(_p), vdup_n_f16(1.f)))));\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = vmul_f16(_p, tanh_ps_f16(log_ps_f16(vadd_f16(exp_ps_f16(_p), vdup_n_f16(1.f)))));\n            vst1_f16(ptr, _p);\n\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            v = v * (__fp16)tanhf(logf(expf(v) + 1.f));\n            *ptr = v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/multiheadattention_arm.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"multiheadattention_arm.h\"\n\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\nnamespace ncnn {\n\nMultiHeadAttention_arm::MultiHeadAttention_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n    support_bf16_storage = false; // TODO enable bf16 when gemm has proper out_elemtype support\n\n    q_gemm = 0;\n    k_gemm = 0;\n    v_gemm = 0;\n    o_gemm = 0;\n\n    qk_gemm = 0;\n    qkv_gemm = 0;\n\n    qk_softmax = 0;\n}\n\nint MultiHeadAttention_arm::create_pipeline(const Option& _opt)\n{\n    Option opt = _opt;\n    opt.use_fp16_storage &= support_fp16_storage;\n    opt.use_bf16_storage &= support_bf16_storage;\n\n    {\n        qk_softmax = ncnn::create_layer_cpu(ncnn::LayerType::Softmax);\n        ncnn::ParamDict pd;\n        pd.set(0, -1);\n        pd.set(1, 1);\n        qk_softmax->load_param(pd);\n        qk_softmax->load_model(ModelBinFromMatArray(0));\n        qk_softmax->create_pipeline(opt);\n    }\n\n    const int qdim = weight_data_size / embed_dim;\n\n    {\n        q_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(0, scale);\n        pd.set(1, 1.f);\n        pd.set(2, 0);         // transA\n        pd.set(3, 1);         // transB\n        pd.set(4, 1);         // constantA\n        pd.set(5, 0);         // constantB\n        pd.set(6, 1);         // constantC\n        pd.set(7, embed_dim); // M\n        pd.set(8, 0);         // N\n        pd.set(9, qdim);      // K\n        pd.set(10, 1);        // constant_broadcast_type_C\n        pd.set(11, 0);        // output_N1M\n        pd.set(12, 1);        // output_elempack\n        pd.set(14, 0);        // output_transpose\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        q_gemm->load_param(pd);\n        Mat weights[3];\n        weights[0] = q_weight_data;\n        weights[1] = q_bias_data;\n#if NCNN_INT8\n        weights[2] = q_weight_data_int8_scales;\n#endif\n        q_gemm->load_model(ModelBinFromMatArray(weights));\n        q_gemm->create_pipeline(opt);\n\n        if (opt.lightmode)\n        {\n            q_weight_data.release();\n            q_bias_data.release();\n        }\n    }\n\n    {\n        k_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(2, 0);         // transA\n        pd.set(3, 1);         // transB\n        pd.set(4, 1);         // constantA\n        pd.set(5, 0);         // constantB\n        pd.set(6, 1);         // constantC\n        pd.set(7, embed_dim); // M\n        pd.set(8, 0);         // N\n        pd.set(9, kdim);      // K\n        pd.set(10, 1);        // constant_broadcast_type_C\n        pd.set(11, 0);        // output_N1M\n        pd.set(12, 1);        // output_elempack\n        pd.set(14, 0);        // output_transpose\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        k_gemm->load_param(pd);\n        Mat weights[3];\n        weights[0] = k_weight_data;\n        weights[1] = k_bias_data;\n#if NCNN_INT8\n        weights[2] = k_weight_data_int8_scales;\n#endif\n        k_gemm->load_model(ModelBinFromMatArray(weights));\n        k_gemm->create_pipeline(opt);\n\n        if (opt.lightmode)\n        {\n            k_weight_data.release();\n            k_bias_data.release();\n        }\n    }\n\n    {\n        v_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(2, 0);         // transA\n        pd.set(3, 1);         // transB\n        pd.set(4, 1);         // constantA\n        pd.set(5, 0);         // constantB\n        pd.set(6, 1);         // constantC\n        pd.set(7, embed_dim); // M\n        pd.set(8, 0);         // N\n        pd.set(9, vdim);      // K\n        pd.set(10, 1);        // constant_broadcast_type_C\n        pd.set(11, 0);        // output_N1M\n        pd.set(12, 1);        // output_elempack\n        pd.set(14, 0);        // output_transpose\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        v_gemm->load_param(pd);\n        Mat weights[3];\n        weights[0] = v_weight_data;\n        weights[1] = v_bias_data;\n#if NCNN_INT8\n        weights[2] = v_weight_data_int8_scales;\n#endif\n        v_gemm->load_model(ModelBinFromMatArray(weights));\n        v_gemm->create_pipeline(opt);\n\n        if (opt.lightmode)\n        {\n            v_weight_data.release();\n            v_bias_data.release();\n        }\n    }\n\n    {\n        o_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(2, 1);         // transA\n        pd.set(3, 1);         // transB\n        pd.set(4, 0);         // constantA\n        pd.set(5, 1);         // constantB\n        pd.set(6, 1);         // constantC\n        pd.set(7, 0);         // M = outch\n        pd.set(8, qdim);      // N = size\n        pd.set(9, embed_dim); // K = maxk*inch\n        pd.set(10, 4);        // constant_broadcast_type_C = null\n        pd.set(11, 0);        // output_N1M\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        o_gemm->load_param(pd);\n        Mat weights[3];\n        weights[0] = out_weight_data;\n        weights[1] = out_bias_data;\n#if NCNN_INT8\n        Mat out_weight_data_int8_scales(1);\n        out_weight_data_int8_scales[0] = out_weight_data_int8_scale;\n        weights[2] = out_weight_data_int8_scales;\n#endif\n        o_gemm->load_model(ModelBinFromMatArray(weights));\n        o_gemm->create_pipeline(opt);\n\n        if (opt.lightmode)\n        {\n            out_weight_data.release();\n            out_bias_data.release();\n        }\n    }\n\n    {\n        qk_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(2, 1);                   // transA\n        pd.set(3, 0);                   // transB\n        pd.set(4, 0);                   // constantA\n        pd.set(5, 0);                   // constantB\n        pd.set(6, attn_mask ? 0 : 1);   // constantC\n        pd.set(7, 0);                   // M\n        pd.set(8, 0);                   // N\n        pd.set(9, 0);                   // K\n        pd.set(10, attn_mask ? 3 : -1); // constant_broadcast_type_C\n        pd.set(11, 0);                  // output_N1M\n        pd.set(12, 1);                  // output_elempack\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        qk_gemm->load_param(pd);\n        qk_gemm->load_model(ModelBinFromMatArray(0));\n        Option opt1 = opt;\n        opt1.num_threads = 1;\n        qk_gemm->create_pipeline(opt1);\n    }\n\n    {\n        qkv_gemm = ncnn::create_layer_cpu(ncnn::LayerType::Gemm);\n        ncnn::ParamDict pd;\n        pd.set(2, 0);   // transA\n        pd.set(3, 1);   // transB\n        pd.set(4, 0);   // constantA\n        pd.set(5, 0);   // constantB\n        pd.set(6, 1);   // constantC\n        pd.set(7, 0);   // M\n        pd.set(8, 0);   // N\n        pd.set(9, 0);   // K\n        pd.set(10, -1); // constant_broadcast_type_C\n        pd.set(11, 0);  // output_N1M\n        pd.set(12, 1);  // output_elempack\n        pd.set(14, 1);  // output_transpose\n#if NCNN_INT8\n        pd.set(18, int8_scale_term);\n#endif\n        qkv_gemm->load_param(pd);\n        qkv_gemm->load_model(ModelBinFromMatArray(0));\n        Option opt1 = opt;\n        opt1.num_threads = 1;\n        qkv_gemm->create_pipeline(opt1);\n    }\n\n    return 0;\n}\n\nint MultiHeadAttention_arm::destroy_pipeline(const Option& _opt)\n{\n    Option opt = _opt;\n    opt.use_fp16_storage &= support_fp16_storage;\n    opt.use_bf16_storage &= support_bf16_storage;\n\n    if (qk_softmax)\n    {\n        qk_softmax->destroy_pipeline(opt);\n        delete qk_softmax;\n        qk_softmax = 0;\n    }\n\n    if (q_gemm)\n    {\n        q_gemm->destroy_pipeline(opt);\n        delete q_gemm;\n        q_gemm = 0;\n    }\n\n    if (k_gemm)\n    {\n        k_gemm->destroy_pipeline(opt);\n        delete k_gemm;\n        k_gemm = 0;\n    }\n\n    if (v_gemm)\n    {\n        v_gemm->destroy_pipeline(opt);\n        delete v_gemm;\n        v_gemm = 0;\n    }\n\n    if (o_gemm)\n    {\n        o_gemm->destroy_pipeline(opt);\n        delete o_gemm;\n        o_gemm = 0;\n    }\n\n    if (qk_gemm)\n    {\n        qk_gemm->destroy_pipeline(opt);\n        delete qk_gemm;\n        qk_gemm = 0;\n    }\n\n    if (qkv_gemm)\n    {\n        qkv_gemm->destroy_pipeline(opt);\n        delete qkv_gemm;\n        qkv_gemm = 0;\n    }\n\n    return 0;\n}\n\nint MultiHeadAttention_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& _opt) const\n{\n    int q_blob_i = 0;\n    int k_blob_i = 0;\n    int v_blob_i = 0;\n    int attn_mask_i = 0;\n    int cached_xk_i = 0;\n    int cached_xv_i = 0;\n    resolve_bottom_blob_index((int)bottom_blobs.size(), q_blob_i, k_blob_i, v_blob_i, attn_mask_i, cached_xk_i, cached_xv_i);\n\n    const Mat& q_blob = bottom_blobs[q_blob_i];\n    const Mat& k_blob = bottom_blobs[k_blob_i];\n    const Mat& v_blob = bottom_blobs[v_blob_i];\n    const Mat& attn_mask_blob = attn_mask ? bottom_blobs[attn_mask_i] : Mat();\n    const Mat& cached_xk_blob = kv_cache ? bottom_blobs[cached_xk_i] : Mat();\n    const Mat& cached_xv_blob = kv_cache ? bottom_blobs[cached_xv_i] : Mat();\n\n    Option opt = _opt;\n    opt.use_fp16_storage &= support_fp16_storage;\n    opt.use_bf16_storage &= support_bf16_storage;\n\n    Mat attn_mask_blob_unpacked;\n    if (attn_mask && attn_mask_blob.elempack != 1)\n    {\n        convert_packing(attn_mask_blob, attn_mask_blob_unpacked, 1, opt);\n        if (attn_mask_blob_unpacked.empty())\n            return -100;\n    }\n    else\n    {\n        attn_mask_blob_unpacked = attn_mask_blob;\n    }\n\n    Mat cached_xk_blob_unpacked;\n    if (kv_cache && !cached_xk_blob.empty() && cached_xk_blob.elempack != 1)\n    {\n        convert_packing(cached_xk_blob, cached_xk_blob_unpacked, 1, opt);\n        if (cached_xk_blob_unpacked.empty())\n            return -100;\n    }\n    else\n    {\n        cached_xk_blob_unpacked = cached_xk_blob;\n    }\n\n    Mat cached_xv_blob_unpacked;\n    if (kv_cache && !cached_xv_blob.empty() && cached_xv_blob.elempack != 1)\n    {\n        convert_packing(cached_xv_blob, cached_xv_blob_unpacked, 1, opt);\n        if (cached_xv_blob_unpacked.empty())\n            return -100;\n    }\n    else\n    {\n        cached_xv_blob_unpacked = cached_xv_blob;\n    }\n\n    const int embed_dim_per_head = embed_dim / num_heads;\n    const int src_seqlen = q_blob.h * q_blob.elempack;\n    const int cur_seqlen = k_blob.h * k_blob.elempack;\n    const int past_seqlen = kv_cache && !cached_xk_blob_unpacked.empty() ? cached_xk_blob_unpacked.w : 0;\n    const int dst_seqlen = past_seqlen > 0 ? (q_blob_i == k_blob_i ? (past_seqlen + cur_seqlen) : past_seqlen) : cur_seqlen;\n\n    // const int elembits = q_blob.elembits();\n\n    size_t elemsize = q_blob.elemsize / q_blob.elempack;\n\n    Mat q_affine;\n    int retq = q_gemm->forward(q_blob, q_affine, opt);\n    if (retq != 0)\n        return retq;\n\n    Mat k_affine;\n    if (past_seqlen > 0)\n    {\n        if (q_blob_i == k_blob_i)\n        {\n            Mat k_affine_q;\n            int retk = k_gemm->forward(q_blob, k_affine_q, opt);\n            if (retk != 0)\n                return retk;\n\n            // assert dst_seqlen == cached_xk_blob_unpacked.w + k_affine_q.w\n\n            // merge cached_xk_blob_unpacked and k_affine_q\n            k_affine.create(dst_seqlen, embed_dim, k_affine_q.elemsize);\n            if (k_affine.empty())\n                return -100;\n\n            for (int i = 0; i < embed_dim; i++)\n            {\n                const unsigned char* ptr = cached_xk_blob_unpacked.row<const unsigned char>(i);\n                const unsigned char* ptrq = k_affine_q.row<const unsigned char>(i);\n                unsigned char* outptr = k_affine.row<unsigned char>(i);\n\n                memcpy(outptr, ptr, past_seqlen * k_affine.elemsize);\n                memcpy(outptr + past_seqlen * k_affine.elemsize, ptrq, cur_seqlen * k_affine.elemsize);\n            }\n        }\n        else\n        {\n            k_affine = cached_xk_blob_unpacked;\n        }\n    }\n    else\n    {\n        int retk = k_gemm->forward(k_blob, k_affine, opt);\n        if (retk != 0)\n            return retk;\n    }\n\n    Mat qk_cross(dst_seqlen, src_seqlen * num_heads, elemsize, opt.blob_allocator);\n    if (qk_cross.empty())\n        return -100;\n\n    std::vector<int> retqks;\n    retqks.resize(num_heads);\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < num_heads; i++)\n    {\n        std::vector<Mat> qk_bottom_blobs(2);\n        qk_bottom_blobs[0] = q_affine.row_range(i * embed_dim_per_head, embed_dim_per_head);\n        qk_bottom_blobs[1] = k_affine.row_range(i * embed_dim_per_head, embed_dim_per_head);\n        if (attn_mask)\n        {\n            const Mat& maskm = attn_mask_blob_unpacked.dims == 3 ? attn_mask_blob_unpacked.channel(i) : attn_mask_blob_unpacked;\n            qk_bottom_blobs.push_back(maskm);\n        }\n        std::vector<Mat> qk_top_blobs(1);\n        qk_top_blobs[0] = qk_cross.row_range(i * src_seqlen, src_seqlen);\n        Option opt1 = opt;\n        opt1.num_threads = 1;\n        retqks[i] = qk_gemm->forward(qk_bottom_blobs, qk_top_blobs, opt1);\n    }\n    for (int i = 0; i < num_heads; i++)\n    {\n        if (retqks[i] != 0)\n            return retqks[i];\n    }\n\n    q_affine.release();\n\n    if (!kv_cache)\n    {\n        k_affine.release();\n    }\n\n    int retqk = qk_softmax->forward_inplace(qk_cross, opt);\n    if (retqk != 0)\n        return retqk;\n\n    Mat v_affine;\n    if (past_seqlen > 0)\n    {\n        if (q_blob_i == v_blob_i)\n        {\n            Mat v_affine_q;\n            int retk = v_gemm->forward(v_blob, v_affine_q, opt);\n            if (retk != 0)\n                return retk;\n\n            // assert dst_seqlen == cached_xv_blob_unpacked.w + v_affine_q.w\n\n            // merge cached_xv_blob_unpacked and v_affine_q\n            v_affine.create(dst_seqlen, embed_dim, v_affine_q.elemsize);\n            if (v_affine.empty())\n                return -100;\n\n            for (int i = 0; i < embed_dim; i++)\n            {\n                const unsigned char* ptr = cached_xv_blob_unpacked.row<const unsigned char>(i);\n                const unsigned char* ptrq = v_affine_q.row<const unsigned char>(i);\n                unsigned char* outptr = v_affine.row<unsigned char>(i);\n\n                memcpy(outptr, ptr, past_seqlen * v_affine.elemsize);\n                memcpy(outptr + past_seqlen * v_affine.elemsize, ptrq, cur_seqlen * v_affine.elemsize);\n            }\n        }\n        else\n        {\n            v_affine = cached_xv_blob_unpacked;\n        }\n    }\n    else\n    {\n        int retv = v_gemm->forward(v_blob, v_affine, opt);\n        if (retv != 0)\n            return retv;\n    }\n\n    Mat qkv_cross(src_seqlen, embed_dim_per_head * num_heads, elemsize, opt.blob_allocator);\n    if (qkv_cross.empty())\n        return -100;\n\n    std::vector<int> retqkvs;\n    retqkvs.resize(num_heads);\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < num_heads; i++)\n    {\n        std::vector<Mat> qkv_bottom_blobs(2);\n        qkv_bottom_blobs[0] = qk_cross.row_range(i * src_seqlen, src_seqlen);\n        qkv_bottom_blobs[1] = v_affine.row_range(i * embed_dim_per_head, embed_dim_per_head);\n        std::vector<Mat> qkv_top_blobs(1);\n        qkv_top_blobs[0] = qkv_cross.row_range(i * embed_dim_per_head, embed_dim_per_head);\n        Option opt1 = opt;\n        opt1.num_threads = 1;\n        retqkvs[i] = qkv_gemm->forward(qkv_bottom_blobs, qkv_top_blobs, opt1);\n    }\n    for (int i = 0; i < num_heads; i++)\n    {\n        if (retqkvs[i] != 0)\n            return retqkvs[i];\n    }\n\n    if (!kv_cache)\n    {\n        v_affine.release();\n    }\n\n    int reto = o_gemm->forward(qkv_cross, top_blobs[0], opt);\n    if (reto != 0)\n        return reto;\n\n    if (kv_cache)\n    {\n        // assert top_blobs.size() == 3\n        top_blobs[1] = k_affine;\n        top_blobs[2] = v_affine;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/multiheadattention_arm.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MULTIHEADATTENTION_ARM_H\n#define LAYER_MULTIHEADATTENTION_ARM_H\n\n#include \"multiheadattention.h\"\n\nnamespace ncnn {\n\nclass MultiHeadAttention_arm : public MultiHeadAttention\n{\npublic:\n    MultiHeadAttention_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    Layer* q_gemm;\n    Layer* k_gemm;\n    Layer* v_gemm;\n    Layer* o_gemm;\n\n    Layer* qk_gemm;\n    Layer* qkv_gemm;\n\n    Layer* qk_softmax;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MULTIHEADATTENTION_ARM_H\n"
  },
  {
    "path": "src/layer/arm/neon_mathfun.h",
    "content": "/* NEON implementation of sin, cos, exp and log\n *\n *   Inspired by Intel Approximate Math library, and based on the\n *   corresponding algorithms of the cephes math library\n */\n\n/* Copyright (C) 2011  Julien Pommier\n *\n *  This software is provided 'as-is', without any express or implied\n *  warranty.  In no event will the authors be held liable for any damages\n *  arising from the use of this software.\n *\n *  Permission is granted to anyone to use this software for any purpose,\n *  including commercial applications, and to alter it and redistribute it\n *  freely, subject to the following restrictions:\n *\n *  1. The origin of this software must not be misrepresented; you must not\n *     claim that you wrote the original software. If you use this software\n *     in a product, an acknowledgment in the product documentation would be\n *     appreciated but is not required.\n *  2. Altered source versions must be plainly marked as such, and must not be\n *     misrepresented as being the original software.\n *  3. This notice may not be removed or altered from any source distribution.\n *\n *  (this is the zlib license)\n */\n\n#ifndef NEON_MATHFUN_H\n#define NEON_MATHFUN_H\n\n#include <arm_neon.h>\n\n// Portable FMA macros: use hardware FMA on AArch64, fall back to MLA on AArch32\n#if defined(__aarch64__)\n#define VFMAQ_F32(a, b, c) vfmaq_f32(a, b, c)\n#define VFMSQ_F32(a, b, c) vfmsq_f32(a, b, c)\n#else\n#define VFMAQ_F32(a, b, c) vmlaq_f32(a, b, c)\n#define VFMSQ_F32(a, b, c) vmlsq_f32(a, b, c)\n#endif\n\n#define c_inv_mant_mask ~0x7f800000u\n#define c_cephes_SQRTHF 0.707106781186547524\n#define c_cephes_log_p0 7.0376836292E-2\n#define c_cephes_log_p1 -1.1514610310E-1\n#define c_cephes_log_p2 1.1676998740E-1\n#define c_cephes_log_p3 -1.2420140846E-1\n#define c_cephes_log_p4 +1.4249322787E-1\n#define c_cephes_log_p5 -1.6668057665E-1\n#define c_cephes_log_p6 +2.0000714765E-1\n#define c_cephes_log_p7 -2.4999993993E-1\n#define c_cephes_log_p8 +3.3333331174E-1\n#define c_cephes_log_q1 -2.12194440e-4\n#define c_cephes_log_q2 0.693359375\n\n/* natural logarithm computed for 4 simultaneous float\n *   return NaN for x <= 0\n */\nstatic inline float32x4_t log_ps(float32x4_t x)\n{\n    float32x4_t one = vdupq_n_f32(1);\n\n    x = vmaxq_f32(x, vdupq_n_f32(0)); /* force flush to zero on denormal values */\n    uint32x4_t invalid_mask = vcleq_f32(x, vdupq_n_f32(0));\n\n    int32x4_t ux = vreinterpretq_s32_f32(x);\n\n    int32x4_t emm0 = vshrq_n_s32(ux, 23);\n\n    /* keep only the fractional part */\n    ux = vandq_s32(ux, vdupq_n_s32(c_inv_mant_mask));\n    ux = vorrq_s32(ux, vreinterpretq_s32_f32(vdupq_n_f32(0.5f)));\n    x = vreinterpretq_f32_s32(ux);\n\n    emm0 = vsubq_s32(emm0, vdupq_n_s32(0x7f));\n    float32x4_t e = vcvtq_f32_s32(emm0);\n\n    e = vaddq_f32(e, one);\n\n    /* part2:\n     *     if( x < SQRTHF ) {\n     *       e -= 1;\n     *       x = x + x - 1.0;\n     *     } else { x = x - 1.0; }\n     */\n    uint32x4_t mask = vcltq_f32(x, vdupq_n_f32(c_cephes_SQRTHF));\n    float32x4_t tmp = vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(x), mask));\n    x = vsubq_f32(x, one);\n    e = vsubq_f32(e, vreinterpretq_f32_u32(vandq_u32(vreinterpretq_u32_f32(one), mask)));\n    x = vaddq_f32(x, tmp);\n\n    float32x4_t z = vmulq_f32(x, x);\n\n    float32x4_t y = vdupq_n_f32(c_cephes_log_p0);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p1), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p2), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p3), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p4), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p5), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p6), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p7), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_log_p8), y, x);\n    y = vmulq_f32(y, x);\n\n    y = vmulq_f32(y, z);\n\n    y = VFMAQ_F32(y, e, vdupq_n_f32(c_cephes_log_q1));\n\n    y = VFMSQ_F32(y, z, vdupq_n_f32(0.5f));\n\n    x = vaddq_f32(x, y);\n    x = VFMAQ_F32(x, e, vdupq_n_f32(c_cephes_log_q2));\n    x = vreinterpretq_f32_u32(vorrq_u32(vreinterpretq_u32_f32(x), invalid_mask)); // negative arg will be NAN\n    return x;\n}\n\n#define c_exp_hi 88.3762626647949f\n#define c_exp_lo -88.3762626647949f\n\n#define c_cephes_LOG2EF 1.44269504088896341\n#define c_cephes_exp_C1 0.693359375\n#define c_cephes_exp_C2 -2.12194440e-4\n\n#define c_cephes_exp_p0 1.9875691500E-4\n#define c_cephes_exp_p1 1.3981999507E-3\n#define c_cephes_exp_p2 8.3334519073E-3\n#define c_cephes_exp_p3 4.1665795894E-2\n#define c_cephes_exp_p4 1.6666665459E-1\n#define c_cephes_exp_p5 5.0000001201E-1\n\n/* exp() computed for 4 float at once */\nstatic inline float32x4_t exp_ps(float32x4_t x)\n{\n    float32x4_t tmp, fx;\n\n    float32x4_t one = vdupq_n_f32(1);\n    x = vminq_f32(x, vdupq_n_f32(c_exp_hi));\n    x = vmaxq_f32(x, vdupq_n_f32(c_exp_lo));\n\n    /* express exp(x) as exp(g + n*log(2)) */\n    fx = VFMAQ_F32(vdupq_n_f32(0.5f), x, vdupq_n_f32(c_cephes_LOG2EF));\n\n    /* perform a floorf */\n    tmp = vcvtq_f32_s32(vcvtq_s32_f32(fx));\n\n    /* if greater, substract 1 */\n    uint32x4_t mask = vcgtq_f32(tmp, fx);\n    mask = vandq_u32(mask, vreinterpretq_u32_f32(one));\n\n    fx = vsubq_f32(tmp, vreinterpretq_f32_u32(mask));\n\n    tmp = vmulq_f32(fx, vdupq_n_f32(c_cephes_exp_C1));\n    float32x4_t z = vmulq_f32(fx, vdupq_n_f32(c_cephes_exp_C2));\n    x = vsubq_f32(x, tmp);\n    x = vsubq_f32(x, z);\n\n    z = vmulq_f32(x, x);\n\n    float32x4_t y = vdupq_n_f32(c_cephes_exp_p0);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_exp_p1), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_exp_p2), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_exp_p3), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_exp_p4), y, x);\n    y = VFMAQ_F32(vdupq_n_f32(c_cephes_exp_p5), y, x);\n\n    y = VFMAQ_F32(x, y, z);\n    y = vaddq_f32(y, one);\n\n    /* build 2^n */\n    int32x4_t mm;\n    mm = vcvtq_s32_f32(fx);\n    mm = vaddq_s32(mm, vdupq_n_s32(0x7f));\n    mm = vshlq_n_s32(mm, 23);\n    float32x4_t pow2n = vreinterpretq_f32_s32(mm);\n\n    y = vmulq_f32(y, pow2n);\n    return y;\n}\n\n#define c_minus_cephes_DP1 -0.78515625\n#define c_minus_cephes_DP2 -2.4187564849853515625e-4\n#define c_minus_cephes_DP3 -3.77489497744594108e-8\n#define c_sincof_p0        -1.9515295891E-4\n#define c_sincof_p1        8.3321608736E-3\n#define c_sincof_p2        -1.6666654611E-1\n#define c_coscof_p0        2.443315711809948E-005\n#define c_coscof_p1        -1.388731625493765E-003\n#define c_coscof_p2        4.166664568298827E-002\n#define c_cephes_FOPI      1.27323954473516 // 4 / M_PI\n\n/* evaluation of 4 sines & cosines at once.\n *\n *   The code is the exact rewriting of the cephes sinf function.\n *   Precision is excellent as long as x < 8192 (I did not bother to\n *   take into account the special handling they have for greater values\n *   -- it does not return garbage for arguments over 8192, though, but\n *   the extra precision is missing).\n *\n *   Note that it is such that sinf((float)M_PI) = 8.74e-8, which is the\n *   surprising but correct result.\n *\n *   Note also that when you compute sin(x), cos(x) is available at\n *   almost no extra price so both sin_ps and cos_ps make use of\n *   sincos_ps..\n */\nstatic inline void sincos_ps(float32x4_t x, float32x4_t* ysin, float32x4_t* ycos)\n{\n    // any x\n    float32x4_t y;\n\n    uint32x4_t emm2;\n\n    uint32x4_t sign_mask_sin, sign_mask_cos;\n    sign_mask_sin = vcltq_f32(x, vdupq_n_f32(0));\n    x = vabsq_f32(x);\n\n    /* scale by 4/Pi */\n    y = vmulq_f32(x, vdupq_n_f32(c_cephes_FOPI));\n\n    /* store the integer part of y in mm0 */\n    emm2 = vcvtq_u32_f32(y);\n    /* j=(j+1) & (~1) (see the cephes sources) */\n    emm2 = vaddq_u32(emm2, vdupq_n_u32(1));\n    emm2 = vandq_u32(emm2, vdupq_n_u32(~1));\n    y = vcvtq_f32_u32(emm2);\n\n    /* get the polynom selection mask\n     *     there is one polynom for 0 <= x <= Pi/4\n     *     and another one for Pi/4<x<=Pi/2\n     *\n     *     Both branches will be computed.\n     */\n    uint32x4_t poly_mask = vtstq_u32(emm2, vdupq_n_u32(2));\n\n    /* The magic pass: \"Extended precision modular arithmetic\"\n     *     x = ((x - y * DP1) - y * DP2) - y * DP3; */\n    x = VFMAQ_F32(x, y, vdupq_n_f32(c_minus_cephes_DP1));\n    x = VFMAQ_F32(x, y, vdupq_n_f32(c_minus_cephes_DP2));\n    x = VFMAQ_F32(x, y, vdupq_n_f32(c_minus_cephes_DP3));\n\n    sign_mask_sin = veorq_u32(sign_mask_sin, vtstq_u32(emm2, vdupq_n_u32(4)));\n    sign_mask_cos = vtstq_u32(vsubq_u32(emm2, vdupq_n_u32(2)), vdupq_n_u32(4));\n\n    /* Evaluate the first polynom  (0 <= x <= Pi/4) in y1,\n     *     and the second polynom      (Pi/4 <= x <= 0) in y2 */\n    float32x4_t z = vmulq_f32(x, x);\n    float32x4_t y1, y2;\n\n    y1 = VFMAQ_F32(vdupq_n_f32(c_coscof_p1), z, vdupq_n_f32(c_coscof_p0));\n    y2 = VFMAQ_F32(vdupq_n_f32(c_sincof_p1), z, vdupq_n_f32(c_sincof_p0));\n    y1 = VFMAQ_F32(vdupq_n_f32(c_coscof_p2), y1, z);\n    y2 = VFMAQ_F32(vdupq_n_f32(c_sincof_p2), y2, z);\n    y1 = vmulq_f32(y1, z);\n    y2 = vmulq_f32(y2, z);\n    y1 = vmulq_f32(y1, z);\n    y1 = VFMSQ_F32(y1, z, vdupq_n_f32(0.5f));\n    y2 = VFMAQ_F32(x, y2, x);\n    y1 = vaddq_f32(y1, vdupq_n_f32(1));\n\n    /* select the correct result from the two polynoms */\n    float32x4_t ys = vbslq_f32(poly_mask, y1, y2);\n    float32x4_t yc = vbslq_f32(poly_mask, y2, y1);\n    *ysin = vbslq_f32(sign_mask_sin, vnegq_f32(ys), ys);\n    *ycos = vbslq_f32(sign_mask_cos, yc, vnegq_f32(yc));\n}\n\nstatic inline float32x4_t sin_ps(float32x4_t x)\n{\n    float32x4_t ysin, ycos;\n    sincos_ps(x, &ysin, &ycos);\n    return ysin;\n}\n\nstatic inline float32x4_t cos_ps(float32x4_t x)\n{\n    float32x4_t ysin, ycos;\n    sincos_ps(x, &ysin, &ycos);\n    return ycos;\n}\n\nstatic inline float32x4_t div_ps(float32x4_t a, float32x4_t b)\n{\n#if __aarch64__\n    return vdivq_f32(a, b);\n#else\n    float32x4_t reciprocal = vrecpeq_f32(b);\n    reciprocal = vmulq_f32(vrecpsq_f32(b, reciprocal), reciprocal);\n    reciprocal = vmulq_f32(vrecpsq_f32(b, reciprocal), reciprocal);\n    return vmulq_f32(a, reciprocal);\n#endif\n}\n\nstatic inline float32x4_t tan_ps(float32x4_t x)\n{\n    float32x4_t ysin, ycos;\n    sincos_ps(x, &ysin, &ycos);\n    float32x4_t ytan = div_ps(ysin, ycos);\n    return ytan;\n}\n\nstatic inline float32x4_t pow_ps(float32x4_t a, float32x4_t b)\n{\n    // pow(x, m) = exp(m * log(x))\n    return exp_ps(vmulq_f32(b, log_ps(a)));\n}\n\nstatic inline float32x4_t sigmoid_ps(float32x4_t _v)\n{\n    float32x4_t _one = vdupq_n_f32(1.f);\n    _v = vnegq_f32(_v);\n    _v = exp_ps(_v);\n    _v = vaddq_f32(_v, _one);\n    float32x4_t _outp = vrecpeq_f32(_v);\n    _outp = vmulq_f32(vrecpsq_f32(_v, _outp), _outp);\n    return vmulq_f32(vrecpsq_f32(_v, _outp), _outp);\n}\n\nstatic const float asinf_lut[7] = {\n    1.5707961728,\n    -0.2145852647,\n    0.0887556286,\n    -0.0488025043,\n    0.0268999482,\n    -0.0111462294,\n    0.0022959648\n};\n\nstatic inline void asincos_ps(float32x4_t x, float32x4_t* yasin, float32x4_t* yacos)\n{\n    int i = 0;\n    float32x4_t one = vdupq_n_f32(1);\n    float32x4_t negone = vdupq_n_f32(-1);\n    float32x4_t lut[7];\n    float32x4_t xv[5];\n    float32x4_t a0, a1, a2, a3;\n    float32x4_t phx;\n    float32x4_t arcsinx, arcnsinx;\n    float32x4_t sat = vdupq_n_f32(0.9999999f);\n    float32x4_t m_pi_2 = vdupq_n_f32(1.570796326);\n    for (i = 0; i <= 6; i++)\n    {\n        lut[i] = vdupq_n_f32(asinf_lut[i]);\n    }\n\n    uint32x4_t sign_mask_asin, saturate;\n    sign_mask_asin = vcltq_f32(x, vdupq_n_f32(0));\n    x = vabsq_f32(x);\n    saturate = vcgeq_f32(x, one);\n    x = vbslq_f32(saturate, sat, x);\n    float32x4_t y = vsubq_f32(one, x);\n\n#if __aarch64__\n    y = vsqrtq_f32(y);\n#else\n    float32x4_t _reciprocal = vrsqrteq_f32(y);\n    _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(y, _reciprocal), _reciprocal), _reciprocal);\n    y = vmulq_f32(y, _reciprocal);\n#endif\n\n    xv[0] = vmulq_f32(x, x);\n    for (i = 1; i < 5; i++)\n    {\n        xv[i] = vmulq_f32(xv[i - 1], x);\n    }\n\n    a0 = vaddq_f32(lut[0], vmulq_f32(lut[1], x));\n    a1 = vaddq_f32(vmulq_f32(lut[2], xv[0]), vmulq_f32(lut[3], xv[1]));\n    a2 = vaddq_f32(vmulq_f32(lut[4], xv[2]), vmulq_f32(lut[5], xv[3]));\n    a3 = vmulq_f32(lut[6], xv[4]);\n    phx = vaddq_f32(vaddq_f32(a0, vaddq_f32(a1, a2)), a3);\n\n    arcsinx = vmulq_f32(y, phx);\n    arcsinx = vsubq_f32(m_pi_2, arcsinx);\n    arcnsinx = vmulq_f32(negone, arcsinx);\n    arcsinx = vbslq_f32(sign_mask_asin, arcnsinx, arcsinx);\n\n    *yasin = arcsinx;\n    *yacos = vsubq_f32(m_pi_2, arcsinx);\n}\n\nstatic inline float32x4_t asin_ps(float32x4_t x)\n{\n    float32x4_t yasin, yacos;\n    asincos_ps(x, &yasin, &yacos);\n    return yasin;\n}\n\nstatic inline float32x4_t acos_ps(float32x4_t x)\n{\n    float32x4_t yasin, yacos;\n    asincos_ps(x, &yasin, &yacos);\n    return yacos;\n}\n\nstatic inline float32x4_t atan2_ps(float32x4_t a, float32x4_t b)\n{\n    //TODO neon optimize\n    float tmpx[4];\n    float tmpy[4];\n    vst1q_f32(tmpx, a);\n    vst1q_f32(tmpy, b);\n    for (int i = 0; i < 4; i++)\n        tmpx[i] = atan2f(tmpx[i], tmpy[i]);\n    return vld1q_f32(tmpx);\n}\n\nstatic inline float32x4_t trunc_ps(const float32x4_t& x)\n{\n    // truncate toward zero\n#if __aarch64__\n    return vrndq_f32(x);\n#else\n    int32x4_t xi = vcvtq_s32_f32(x);\n    return vcvtq_f32_s32(xi);\n#endif\n}\n\nstatic inline float32x4_t fmod_ps(const float32x4_t& x, const float32x4_t& y)\n{\n    // fmod(x,y) = x - trunc(x/y) * y\n#if __aarch64__\n    float32x4_t q = vdivq_f32(x, y);\n#else\n    float32x4_t q = div_ps(x, y);\n#endif\n    float32x4_t tq = trunc_ps(q);\n    return vsubq_f32(x, vmulq_f32(tq, y));\n}\n\nstatic inline float32x4_t round_ps(const float32x4_t& x)\n{\n#if __aarch64__\n    return vrndnq_f32(x);\n#else\n    float32x4_t half = vdupq_n_f32(0.5f);\n    float32x4_t one = vdupq_n_f32(1.0f);\n    uint32x4_t sign_mask = vcltq_f32(x, vdupq_n_f32(0));\n    float32x4_t abs_x = vabsq_f32(x);\n    int32x4_t xi = vcvtq_s32_f32(abs_x);\n    float32x4_t truncated = vcvtq_f32_s32(xi);\n    float32x4_t diff = vsubq_f32(abs_x, truncated);\n    uint32x4_t diff_gt_half = vcgtq_f32(diff, half);\n    uint32x4_t diff_eq_half = vceqq_f32(diff, half);\n    int32x4_t xi_and_1 = vandq_s32(xi, vdupq_n_s32(1));\n    uint32x4_t is_odd = vcgtq_s32(xi_and_1, vdupq_n_s32(0));\n    uint32x4_t round_up = vorrq_u32(diff_gt_half, vandq_u32(diff_eq_half, is_odd));\n    float32x4_t rounded = vaddq_f32(truncated, vreinterpretq_f32_u32(vandq_u32(round_up, vreinterpretq_u32_f32(one))));\n    return vbslq_f32(sign_mask, vnegq_f32(rounded), rounded);\n#endif\n}\n\nstatic inline float32x4_t logaddexp_ps(const float32x4_t& x, const float32x4_t& y)\n{\n    float32x4_t max_xy = vmaxq_f32(x, y);\n    float32x4_t min_xy = vminq_f32(x, y);\n    float32x4_t diff = vsubq_f32(min_xy, max_xy);\n    float32x4_t exp_diff = exp_ps(diff);\n    float32x4_t one_plus_exp = vaddq_f32(vdupq_n_f32(1.0f), exp_diff);\n    float32x4_t log_result = log_ps(one_plus_exp);\n    return vaddq_f32(max_xy, log_result);\n}\n\nstatic inline float32x4_t floor_ps(const float32x4_t& x)\n{\n#if __aarch64__\n    return vrndmq_f32(x);\n#else\n    float32x4_t truncated = vcvtq_f32_s32(vcvtq_s32_f32(x));\n    uint32x4_t need_adjust = vcltq_f32(x, truncated);\n    float32x4_t adjusted = vsubq_f32(truncated, vdupq_n_f32(1.0f));\n    return vbslq_f32(need_adjust, adjusted, truncated);\n#endif\n}\n\nstatic inline float32x4_t floor_divide_ps(const float32x4_t& x, const float32x4_t& y)\n{\n#if __aarch64__\n    float32x4_t q = vdivq_f32(x, y);\n#else\n    float32x4_t q = div_ps(x, y);\n#endif\n    return floor_ps(q);\n}\n\nstatic inline float32x4_t remainder_ps(const float32x4_t& x, const float32x4_t& y)\n{\n#if __aarch64__\n    float32x4_t q = vdivq_f32(x, y);\n#else\n    float32x4_t q = div_ps(x, y);\n#endif\n    float32x4_t rq = round_ps(q);\n    return vsubq_f32(x, vmulq_f32(rq, y));\n}\n\n#include \"neon_mathfun_tanh.h\"\n\n// Clean up macros\n#undef VFMAQ_F32\n#undef VFMSQ_F32\n\n#endif // NEON_MATHFUN_H\n"
  },
  {
    "path": "src/layer/arm/neon_mathfun_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n/* NEON implementation of sin, cos, exp and log\n *\n *   Inspired by Intel Approximate Math library, and based on the\n *   corresponding algorithms of the cephes math library\n */\n\n/* Copyright (C) 2011  Julien Pommier\n *\n *  This software is provided 'as-is', without any express or implied\n *  warranty.  In no event will the authors be held liable for any damages\n *  arising from the use of this software.\n *\n *  Permission is granted to anyone to use this software for any purpose,\n *  including commercial applications, and to alter it and redistribute it\n *  freely, subject to the following restrictions:\n *\n *  1. The origin of this software must not be misrepresented; you must not\n *     claim that you wrote the original software. If you use this software\n *     in a product, an acknowledgment in the product documentation would be\n *     appreciated but is not required.\n *  2. Altered source versions must be plainly marked as such, and must not be\n *     misrepresented as being the original software.\n *  3. This notice may not be removed or altered from any source distribution.\n *\n *  (this is the zlib license)\n */\n\n#ifndef NEON_MATHFUN_FP16S_H\n#define NEON_MATHFUN_FP16S_H\n\n#include <arm_neon.h>\n\n#define c_inv_mant_mask_f16 -31745 // ~0x7c00u\n#define c_cephes_SQRTHF     0.707106781186547524\n#define c_cephes_log_p0     7.0376836292E-2\n#define c_cephes_log_p1     -1.1514610310E-1\n#define c_cephes_log_p2     1.1676998740E-1\n#define c_cephes_log_p3     -1.2420140846E-1\n#define c_cephes_log_p4     +1.4249322787E-1\n#define c_cephes_log_p5     -1.6668057665E-1\n#define c_cephes_log_p6     +2.0000714765E-1\n#define c_cephes_log_p7     -2.4999993993E-1\n#define c_cephes_log_p8     +3.3333331174E-1\n#define c_cephes_log_q1     -2.12194440e-4\n#define c_cephes_log_q2     0.693359375\n\n/* natural logarithm computed for 4 simultaneous float\n *   return NaN for x <= 0\n */\nstatic inline float16x4_t log_ps_f16(float16x4_t x)\n{\n    float16x4_t one = vdup_n_f16(1);\n\n    x = vmax_f16(x, vdup_n_f16(0)); /* force flush to zero on denormal values */\n    uint16x4_t invalid_mask = vcle_f16(x, vdup_n_f16(0));\n\n    int16x4_t ux = vreinterpret_s16_f16(x);\n\n    int16x4_t emm0 = vshr_n_s16(ux, 10);\n\n    /* keep only the fractional part */\n    ux = vand_s16(ux, vdup_n_s16(c_inv_mant_mask_f16));\n    ux = vorr_s16(ux, vreinterpret_s16_f16(vdup_n_f16(0.5f)));\n    x = vreinterpret_f16_s16(ux);\n\n    emm0 = vsub_s16(emm0, vdup_n_s16(0xf));\n    float16x4_t e = vcvt_f16_s16(emm0);\n\n    e = vadd_f16(e, one);\n\n    /* part2:\n     *     if( x < SQRTHF ) {\n     *       e -= 1;\n     *       x = x + x - 1.0;\n     *     } else { x = x - 1.0; }\n     */\n    uint16x4_t mask = vclt_f16(x, vdup_n_f16(c_cephes_SQRTHF));\n    float16x4_t tmp = (float16x4_t)(vand_u16((uint16x4_t)(x), mask));\n    x = vsub_f16(x, one);\n    e = vsub_f16(e, (float16x4_t)(vand_u16((uint16x4_t)(one), mask)));\n    x = vadd_f16(x, tmp);\n\n    float16x4_t z = vmul_f16(x, x);\n\n    float16x4_t y = vdup_n_f16(c_cephes_log_p0);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p1), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p2), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p3), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p4), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p5), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p6), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p7), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_log_p8), y, x);\n    y = vmul_f16(y, x);\n\n    y = vmul_f16(y, z);\n\n    y = vfma_f16(y, e, vdup_n_f16(c_cephes_log_q1));\n\n    y = vfms_f16(y, z, vdup_n_f16(0.5f));\n\n    x = vadd_f16(x, y);\n    x = vfma_f16(x, e, vdup_n_f16(c_cephes_log_q2));\n    x = (float16x4_t)(vorr_u16((uint16x4_t)(x), invalid_mask)); // negative arg will be NAN\n    return x;\n}\n\nstatic inline float16x8_t log_ps_f16(float16x8_t x)\n{\n    float16x8_t one = vdupq_n_f16(1);\n\n    x = vmaxq_f16(x, vdupq_n_f16(0)); /* force flush to zero on denormal values */\n    uint16x8_t invalid_mask = vcleq_f16(x, vdupq_n_f16(0));\n\n    int16x8_t ux = vreinterpretq_s16_f16(x);\n\n    int16x8_t emm0 = vshrq_n_s16(ux, 10);\n\n    /* keep only the fractional part */\n    ux = vandq_s16(ux, vdupq_n_s16(c_inv_mant_mask_f16));\n    ux = vorrq_s16(ux, vreinterpretq_s16_f16(vdupq_n_f16(0.5f)));\n    x = vreinterpretq_f16_s16(ux);\n\n    emm0 = vsubq_s16(emm0, vdupq_n_s16(0xf));\n    float16x8_t e = vcvtq_f16_s16(emm0);\n\n    e = vaddq_f16(e, one);\n\n    /* part2:\n     *     if( x < SQRTHF ) {\n     *       e -= 1;\n     *       x = x + x - 1.0;\n     *     } else { x = x - 1.0; }\n     */\n    uint16x8_t mask = vcltq_f16(x, vdupq_n_f16(c_cephes_SQRTHF));\n    float16x8_t tmp = vreinterpretq_f16_u16(vandq_u16(vreinterpretq_u16_f16(x), mask));\n    x = vsubq_f16(x, one);\n    e = vsubq_f16(e, vreinterpretq_f16_u16(vandq_u16(vreinterpretq_u16_f16(one), mask)));\n    x = vaddq_f16(x, tmp);\n\n    float16x8_t z = vmulq_f16(x, x);\n\n    float16x8_t y = vdupq_n_f16(c_cephes_log_p0);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p1), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p2), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p3), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p4), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p5), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p6), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p7), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_log_p8), y, x);\n    y = vmulq_f16(y, x);\n\n    y = vmulq_f16(y, z);\n\n    y = vfmaq_f16(y, e, vdupq_n_f16(c_cephes_log_q1));\n\n    y = vfmsq_f16(y, z, vdupq_n_f16(0.5f));\n\n    x = vaddq_f16(x, y);\n    x = vfmaq_f16(x, e, vdupq_n_f16(c_cephes_log_q2));\n    x = vreinterpretq_f16_u16(vorrq_u16(vreinterpretq_u16_f16(x), invalid_mask)); // negative arg will be NAN\n    return x;\n}\n\n#define c_exp_hi_f16 10.7421875f\n#define c_exp_lo_f16 -10.7421875f\n\n#define c_cephes_LOG2EF 1.44269504088896341\n#define c_cephes_exp_C1 0.693359375\n#define c_cephes_exp_C2 -2.12194440e-4\n\n#define c_cephes_exp_p0 1.9875691500E-4\n#define c_cephes_exp_p1 1.3981999507E-3\n#define c_cephes_exp_p2 8.3334519073E-3\n#define c_cephes_exp_p3 4.1665795894E-2\n#define c_cephes_exp_p4 1.6666665459E-1\n#define c_cephes_exp_p5 5.0000001201E-1\n\n/* exp() computed for 4 float at once */\nstatic inline float16x4_t exp_ps_f16(float16x4_t x)\n{\n    float16x4_t tmp, fx;\n\n    float16x4_t one = vdup_n_f16(1);\n    x = vmin_f16(x, vdup_n_f16(c_exp_hi_f16));\n    x = vmax_f16(x, vdup_n_f16(c_exp_lo_f16));\n\n    /* express exp(x) as exp(g + n*log(2)) */\n#if defined(_MSC_VER) && !defined(__clang__)\n    fx = vfma_f16(vdup_n_f16(0.5f), x, vcvt_f16_f32(vdupq_n_f32(c_cephes_LOG2EF)));\n#else\n    fx = vfma_f16(vdup_n_f16(0.5f), x, vdup_n_f16(c_cephes_LOG2EF));\n#endif\n\n    /* perform a floorf */\n    tmp = vcvt_f16_s16(vcvt_s16_f16(fx));\n\n    /* if greater, substract 1 */\n    uint16x4_t mask = vcgt_f16(tmp, fx);\n    mask = vand_u16(mask, (uint16x4_t)(one));\n\n    fx = vsub_f16(tmp, (float16x4_t)(mask));\n\n#if defined(_MSC_VER) && !defined(__clang__)\n    tmp = vmul_f16(fx, vcvt_f16_f32(vdupq_n_f32(c_cephes_exp_C1)));\n    float16x4_t z = vmul_f16(fx, vcvt_f16_f32(vdupq_n_f32(c_cephes_exp_C2)));\n#else\n    tmp = vmul_f16(fx, vdup_n_f16(c_cephes_exp_C1));\n    float16x4_t z = vmul_f16(fx, vdup_n_f16(c_cephes_exp_C2));\n#endif\n    x = vsub_f16(x, tmp);\n    x = vsub_f16(x, z);\n\n    z = vmul_f16(x, x);\n\n    float16x4_t y = vdup_n_f16(c_cephes_exp_p0);\n    y = vfma_f16(vdup_n_f16(c_cephes_exp_p1), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_exp_p2), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_exp_p3), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_exp_p4), y, x);\n    y = vfma_f16(vdup_n_f16(c_cephes_exp_p5), y, x);\n\n    y = vfma_f16(x, y, z);\n    y = vadd_f16(y, one);\n\n    /* build 2^n */\n    int16x4_t mm;\n    mm = vcvt_s16_f16(fx);\n    mm = vadd_s16(mm, vdup_n_s16(0xf));\n    mm = vshl_n_s16(mm, 10);\n    float16x4_t pow2n = vreinterpret_f16_s16(mm);\n\n    y = vmul_f16(y, pow2n);\n    return y;\n}\n\nstatic inline float16x8_t exp_ps_f16(float16x8_t x)\n{\n    float16x8_t tmp, fx;\n\n    float16x8_t one = vdupq_n_f16(1);\n    x = vminq_f16(x, vdupq_n_f16(c_exp_hi_f16));\n    x = vmaxq_f16(x, vdupq_n_f16(c_exp_lo_f16));\n\n    /* express exp(x) as exp(g + n*log(2)) */\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_cephes_LOG2EF = vcvt_f16_f32(vdupq_n_f32(c_cephes_LOG2EF));\n    fx = vfmaq_f16(vdupq_n_f16(0.5f), x, vcombine_f16(_c_cephes_LOG2EF, _c_cephes_LOG2EF));\n#else\n    fx = vfmaq_f16(vdupq_n_f16(0.5f), x, vdupq_n_f16(c_cephes_LOG2EF));\n#endif\n\n    /* perform a floorf */\n    tmp = vcvtq_f16_s16(vcvtq_s16_f16(fx));\n\n    /* if greater, substract 1 */\n    uint16x8_t mask = vcgtq_f16(tmp, fx);\n    mask = vandq_u16(mask, vreinterpretq_u16_f16(one));\n\n    fx = vsubq_f16(tmp, vreinterpretq_f16_u16(mask));\n\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_cephes_exp_C1 = vcvt_f16_f32(vdupq_n_f32(c_cephes_exp_C1));\n    tmp = vmulq_f16(fx, vcombine_f16(_c_cephes_exp_C1, _c_cephes_exp_C1));\n    float16x4_t _c_cephes_exp_C2 = vcvt_f16_f32(vdupq_n_f32(c_cephes_exp_C2));\n    float16x8_t z = vmulq_f16(fx, vcombine_f16(_c_cephes_exp_C2, _c_cephes_exp_C2));\n#else\n    tmp = vmulq_f16(fx, vdupq_n_f16(c_cephes_exp_C1));\n    float16x8_t z = vmulq_f16(fx, vdupq_n_f16(c_cephes_exp_C2));\n#endif\n    x = vsubq_f16(x, tmp);\n    x = vsubq_f16(x, z);\n\n    z = vmulq_f16(x, x);\n\n    float16x8_t y = vdupq_n_f16(c_cephes_exp_p0);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_exp_p1), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_exp_p2), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_exp_p3), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_exp_p4), y, x);\n    y = vfmaq_f16(vdupq_n_f16(c_cephes_exp_p5), y, x);\n\n    y = vfmaq_f16(x, y, z);\n    y = vaddq_f16(y, one);\n\n    /* build 2^n */\n    int16x8_t mm;\n    mm = vcvtq_s16_f16(fx);\n    mm = vaddq_s16(mm, vdupq_n_s16(0xf));\n    mm = vshlq_n_s16(mm, 10);\n    float16x8_t pow2n = vreinterpretq_f16_s16(mm);\n\n    y = vmulq_f16(y, pow2n);\n    return y;\n}\n\n#define c_minus_cephes_DP1 -0.78515625\n#define c_minus_cephes_DP2 -2.4187564849853515625e-4\n#define c_minus_cephes_DP3 -3.77489497744594108e-8\n#define c_sincof_p0        -1.9515295891E-4\n#define c_sincof_p1        8.3321608736E-3\n#define c_sincof_p2        -1.6666654611E-1\n#define c_coscof_p0        2.443315711809948E-005\n#define c_coscof_p1        -1.388731625493765E-003\n#define c_coscof_p2        4.166664568298827E-002\n#define c_cephes_FOPI      1.27323954473516 // 4 / M_PI\n\n/* evaluation of 4 sines & cosines at once.\n *\n *   The code is the exact rewriting of the cephes sinf function.\n *   Precision is excellent as long as x < 8192 (I did not bother to\n *   take into account the special handling they have for greater values\n *   -- it does not return garbage for arguments over 8192, though, but\n *   the extra precision is missing).\n *\n *   Note that it is such that sinf((float)M_PI) = 8.74e-8, which is the\n *   surprising but correct result.\n *\n *   Note also that when you compute sin(x), cos(x) is available at\n *   almost no extra price so both sin_ps and cos_ps make use of\n *   sincos_ps..\n */\nstatic inline void sincos_ps_f16(float16x4_t x, float16x4_t* ysin, float16x4_t* ycos)\n{\n    // any x\n    float16x4_t y;\n\n    uint16x4_t emm2;\n\n    uint16x4_t sign_mask_sin, sign_mask_cos;\n    sign_mask_sin = vclt_f16(x, vdup_n_f16(0));\n    x = vabs_f16(x);\n\n    /* scale by 4/Pi */\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_cephes_FOPI = vcvt_f16_f32(vdupq_n_f32(c_cephes_FOPI));\n    y = vmul_f16(x, _c_cephes_FOPI);\n#else\n    y = vmul_f16(x, vdup_n_f16(c_cephes_FOPI));\n#endif\n\n    /* store the integer part of y in mm0 */\n    emm2 = vcvt_u16_f16(y);\n    /* j=(j+1) & (~1) (see the cephes sources) */\n    emm2 = vadd_u16(emm2, vdup_n_u16(1));\n    emm2 = vand_u16(emm2, vdup_n_u16(~1));\n    y = vcvt_f16_u16(emm2);\n\n    /* get the polynom selection mask\n     *     there is one polynom for 0 <= x <= Pi/4\n     *     and another one for Pi/4<x<=Pi/2\n     *\n     *     Both branches will be computed.\n     */\n    uint16x4_t poly_mask = vtst_u16(emm2, vdup_n_u16(2));\n\n    /* The magic pass: \"Extended precision modular arithmetic\"\n     *     x = ((x - y * DP1) - y * DP2) - y * DP3; */\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_minus_cephes_DP1 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP1));\n    float16x4_t _c_minus_cephes_DP2 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP2));\n    float16x4_t _c_minus_cephes_DP3 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP3));\n    x = vfma_f16(x, y, _c_minus_cephes_DP1);\n    x = vfma_f16(x, y, _c_minus_cephes_DP2);\n    x = vfma_f16(x, y, _c_minus_cephes_DP3);\n#else\n    x = vfma_f16(x, y, vdup_n_f16(c_minus_cephes_DP1));\n    x = vfma_f16(x, y, vdup_n_f16(c_minus_cephes_DP2));\n    x = vfma_f16(x, y, vdup_n_f16(c_minus_cephes_DP3));\n#endif\n\n    sign_mask_sin = veor_u16(sign_mask_sin, vtst_u16(emm2, vdup_n_u16(4)));\n    sign_mask_cos = vtst_u16(vsub_u16(emm2, vdup_n_u16(2)), vdup_n_u16(4));\n\n    /* Evaluate the first polynom  (0 <= x <= Pi/4) in y1,\n     *     and the second polynom      (Pi/4 <= x <= 0) in y2 */\n    float16x4_t z = vmul_f16(x, x);\n    float16x4_t y1, y2;\n\n    y1 = vfma_f16(vdup_n_f16(c_coscof_p1), z, vdup_n_f16(c_coscof_p0));\n    y2 = vfma_f16(vdup_n_f16(c_sincof_p1), z, vdup_n_f16(c_sincof_p0));\n    y1 = vfma_f16(vdup_n_f16(c_coscof_p2), y1, z);\n    y2 = vfma_f16(vdup_n_f16(c_sincof_p2), y2, z);\n    y1 = vmul_f16(y1, z);\n    y2 = vmul_f16(y2, z);\n    y1 = vmul_f16(y1, z);\n    y1 = vfms_f16(y1, z, vdup_n_f16(0.5f));\n    y2 = vfma_f16(x, y2, x);\n    y1 = vadd_f16(y1, vdup_n_f16(1));\n\n    /* select the correct result from the two polynoms */\n    float16x4_t ys = vbsl_f16(poly_mask, y1, y2);\n    float16x4_t yc = vbsl_f16(poly_mask, y2, y1);\n    *ysin = vbsl_f16(sign_mask_sin, vneg_f16(ys), ys);\n    *ycos = vbsl_f16(sign_mask_cos, yc, vneg_f16(yc));\n}\n\nstatic inline void sincos_ps_f16(float16x8_t x, float16x8_t* ysin, float16x8_t* ycos)\n{\n    // any x\n    float16x8_t y;\n\n    uint16x8_t emm2;\n\n    uint16x8_t sign_mask_sin, sign_mask_cos;\n    sign_mask_sin = vcltq_f16(x, vdupq_n_f16(0));\n    x = vabsq_f16(x);\n\n    /* scale by 4/Pi */\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_cephes_FOPI = vcvt_f16_f32(vdupq_n_f32(c_cephes_FOPI));\n    y = vmulq_f16(x, vcombine_f16(_c_cephes_FOPI, _c_cephes_FOPI));\n#else\n    y = vmulq_f16(x, vdupq_n_f16(c_cephes_FOPI));\n#endif\n\n    /* store the integer part of y in mm0 */\n    emm2 = vcvtq_u16_f16(y);\n    /* j=(j+1) & (~1) (see the cephes sources) */\n    emm2 = vaddq_u16(emm2, vdupq_n_u16(1));\n    emm2 = vandq_u16(emm2, vdupq_n_u16(~1));\n    y = vcvtq_f16_u16(emm2);\n\n    /* get the polynom selection mask\n     *     there is one polynom for 0 <= x <= Pi/4\n     *     and another one for Pi/4<x<=Pi/2\n     *\n     *     Both branches will be computed.\n     */\n    uint16x8_t poly_mask = vtstq_u16(emm2, vdupq_n_u16(2));\n\n    /* The magic pass: \"Extended precision modular arithmetic\"\n     *     x = ((x - y * DP1) - y * DP2) - y * DP3; */\n#if defined(_MSC_VER) && !defined(__clang__)\n    float16x4_t _c_minus_cephes_DP1 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP1));\n    float16x4_t _c_minus_cephes_DP2 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP2));\n    float16x4_t _c_minus_cephes_DP3 = vcvt_f16_f32(vdupq_n_f32(c_minus_cephes_DP3));\n    x = vfmaq_f16(x, y, vcombine_f16(_c_minus_cephes_DP1, _c_minus_cephes_DP1));\n    x = vfmaq_f16(x, y, vcombine_f16(_c_minus_cephes_DP2, _c_minus_cephes_DP2));\n    x = vfmaq_f16(x, y, vcombine_f16(_c_minus_cephes_DP3, _c_minus_cephes_DP3));\n#else\n    x = vfmaq_f16(x, y, vdupq_n_f16(c_minus_cephes_DP1));\n    x = vfmaq_f16(x, y, vdupq_n_f16(c_minus_cephes_DP2));\n    x = vfmaq_f16(x, y, vdupq_n_f16(c_minus_cephes_DP3));\n#endif\n\n    sign_mask_sin = veorq_u16(sign_mask_sin, vtstq_u16(emm2, vdupq_n_u16(4)));\n    sign_mask_cos = vtstq_u16(vsubq_u16(emm2, vdupq_n_u16(2)), vdupq_n_u16(4));\n\n    /* Evaluate the first polynom  (0 <= x <= Pi/4) in y1,\n     *     and the second polynom      (Pi/4 <= x <= 0) in y2 */\n    float16x8_t z = vmulq_f16(x, x);\n    float16x8_t y1, y2;\n\n    y1 = vfmaq_f16(vdupq_n_f16(c_coscof_p1), z, vdupq_n_f16(c_coscof_p0));\n    y2 = vfmaq_f16(vdupq_n_f16(c_sincof_p1), z, vdupq_n_f16(c_sincof_p0));\n    y1 = vfmaq_f16(vdupq_n_f16(c_coscof_p2), y1, z);\n    y2 = vfmaq_f16(vdupq_n_f16(c_sincof_p2), y2, z);\n    y1 = vmulq_f16(y1, z);\n    y2 = vmulq_f16(y2, z);\n    y1 = vmulq_f16(y1, z);\n    y1 = vfmsq_f16(y1, z, vdupq_n_f16(0.5f));\n    y2 = vfmaq_f16(x, y2, x);\n    y1 = vaddq_f16(y1, vdupq_n_f16(1));\n\n    /* select the correct result from the two polynoms */\n    float16x8_t ys = vbslq_f16(poly_mask, y1, y2);\n    float16x8_t yc = vbslq_f16(poly_mask, y2, y1);\n    *ysin = vbslq_f16(sign_mask_sin, vnegq_f16(ys), ys);\n    *ycos = vbslq_f16(sign_mask_cos, yc, vnegq_f16(yc));\n}\n\nstatic inline float16x4_t sin_ps_f16(float16x4_t x)\n{\n    float16x4_t ysin, ycos;\n    sincos_ps_f16(x, &ysin, &ycos);\n    return ysin;\n}\n\nstatic inline float16x8_t sin_ps_f16(float16x8_t x)\n{\n    float16x8_t ysin, ycos;\n    sincos_ps_f16(x, &ysin, &ycos);\n    return ysin;\n}\n\nstatic inline float16x4_t cos_ps_f16(float16x4_t x)\n{\n    float16x4_t ysin, ycos;\n    sincos_ps_f16(x, &ysin, &ycos);\n    return ycos;\n}\n\nstatic inline float16x8_t cos_ps_f16(float16x8_t x)\n{\n    float16x8_t ysin, ycos;\n    sincos_ps_f16(x, &ysin, &ycos);\n    return ycos;\n}\n\n#define c_tanh_tiny 1e-4f\n#define c_tanh_hi   9.0f\n// The monomial coefficients of the numerator polynomial (odd).\n#define c_tanh_alpha_1  4.89352455891786e-3f\n#define c_tanh_alpha_3  6.37261928875436e-4f\n#define c_tanh_alpha_5  1.48572235717979e-5f\n#define c_tanh_alpha_7  5.12229709037114e-8f\n#define c_tanh_alpha_9  -8.60467152213735e-11f\n#define c_tanh_alpha_11 2.00018790482477e-13f\n#define c_tanh_alpha_13 -2.76076847742355e-16f\n// The monomial coefficients of the denominator polynomial (even).\n#define c_tanh_beta_0 4.89352518554385e-3f\n#define c_tanh_beta_2 2.26843463243900e-3f\n#define c_tanh_beta_4 1.18534705686654e-4f\n#define c_tanh_beta_6 1.19825839466702e-6f\n\n/* Single precision hyperbolic tangent computed for 4 simultaneous float */\nstatic inline float16x4_t tanh_ps_f16(float16x4_t x)\n{\n    float16x4_t x2 = vabs_f16(x);\n\n    uint16x4_t tiny_mask = vcge_f16(x2, vdup_n_f16(c_tanh_tiny));\n\n    // clamp the inputs to the range [-9, 9] since anything outside\n    // this range is -/+1.0f in single-precision.\n    x2 = (float16x4_t)(vbsl_u16(vcge_f16(vdup_n_f16(c_tanh_hi), x2), (uint16x4_t)(x2), (uint16x4_t)(vdup_n_f16(c_tanh_hi))));\n\n    // since the polynomials are odd/even, we need x**2.\n    float16x4_t z = vmul_f16(x2, x2);\n\n    // evaluate the numerator polynomial y.\n    float16x4_t y = vdup_n_f16(c_tanh_alpha_13);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_11), y, z);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_9), y, z);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_7), y, z);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_5), y, z);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_3), y, z);\n    y = vfma_f16(vdup_n_f16(c_tanh_alpha_1), y, z);\n    y = vmul_f16(y, x2);\n\n    // evaluate the denominator polynomial w.\n    float16x4_t w = vdup_n_f16(c_tanh_beta_6);\n    w = vfma_f16(vdup_n_f16(c_tanh_beta_4), w, z);\n    w = vfma_f16(vdup_n_f16(c_tanh_beta_2), w, z);\n    w = vfma_f16(vdup_n_f16(c_tanh_beta_0), w, z);\n\n    // divide the numerator by the denominator.\n    y = vdiv_f16(y, w);\n\n    // reinstate the sign.\n    y = (float16x4_t)(vbsl_u16(vdup_n_u16(1u << 15), (uint16x4_t)(x), (uint16x4_t)(y)));\n\n    // when the argument is very small in magnitude it's more accurate to just return it.\n    y = (float16x4_t)(vbsl_u16(tiny_mask, (uint16x4_t)(y), (uint16x4_t)(x)));\n\n    return y;\n}\n\nstatic inline float16x8_t tanh_ps_f16(float16x8_t x)\n{\n    float16x8_t x2 = vabsq_f16(x);\n\n    uint16x8_t tiny_mask = vcgeq_f16(x2, vdupq_n_f16(c_tanh_tiny));\n\n    // clamp the inputs to the range [-9, 9] since anything outside\n    // this range is -/+1.0f in single-precision.\n    x2 = vreinterpretq_f16_u16(vbslq_u16(vcgeq_f16(vdupq_n_f16(c_tanh_hi), x2), vreinterpretq_u16_f16(x2), vreinterpretq_u16_f16(vdupq_n_f16(c_tanh_hi))));\n\n    // since the polynomials are odd/even, we need x**2.\n    float16x8_t z = vmulq_f16(x2, x2);\n\n    // evaluate the numerator polynomial y.\n    float16x8_t y = vdupq_n_f16(c_tanh_alpha_13);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_11), y, z);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_9), y, z);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_7), y, z);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_5), y, z);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_3), y, z);\n    y = vfmaq_f16(vdupq_n_f16(c_tanh_alpha_1), y, z);\n    y = vmulq_f16(y, x2);\n\n    // evaluate the denominator polynomial w.\n    float16x8_t w = vdupq_n_f16(c_tanh_beta_6);\n    w = vfmaq_f16(vdupq_n_f16(c_tanh_beta_4), w, z);\n    w = vfmaq_f16(vdupq_n_f16(c_tanh_beta_2), w, z);\n    w = vfmaq_f16(vdupq_n_f16(c_tanh_beta_0), w, z);\n\n    // divide the numerator by the denominator.\n    y = vdivq_f16(y, w);\n\n    // reinstate the sign.\n    y = vreinterpretq_f16_u16(vbslq_u16(vdupq_n_u16(1u << 15), vreinterpretq_u16_f16(x), vreinterpretq_u16_f16(y)));\n\n    // when the argument is very small in magnitude it's more accurate to just return it.\n    y = vreinterpretq_f16_u16(vbslq_u16(tiny_mask, vreinterpretq_u16_f16(y), vreinterpretq_u16_f16(x)));\n\n    return y;\n}\n\nstatic inline float16x4_t sigmoid_ps_f16(float16x4_t _v)\n{\n    float16x4_t _one = vdup_n_f16(1.f);\n    _v = vneg_f16(_v);\n    _v = exp_ps_f16(_v);\n    _v = vadd_f16(_v, _one);\n    return vdiv_f16(_one, _v);\n}\n\nstatic inline float16x8_t sigmoid_ps_f16(float16x8_t _v)\n{\n    float16x8_t _one = vdupq_n_f16(1.f);\n    _v = vnegq_f16(_v);\n    _v = exp_ps_f16(_v);\n    _v = vaddq_f16(_v, _one);\n    return vdivq_f16(_one, _v);\n}\n\n#endif // NEON_MATHFUN_FP16S_H\n"
  },
  {
    "path": "src/layer/arm/neon_mathfun_tanh.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef NEON_MATHFUN_TANH_H\n#define NEON_MATHFUN_TANH_H\n\n#include <arm_neon.h>\n\n#define c_tanh_tiny 1e-4f\n#define c_tanh_hi   9.0f\n// The monomial coefficients of the numerator polynomial (odd).\n#define c_tanh_alpha_1  4.89352455891786e-3f\n#define c_tanh_alpha_3  6.37261928875436e-4f\n#define c_tanh_alpha_5  1.48572235717979e-5f\n#define c_tanh_alpha_7  5.12229709037114e-8f\n#define c_tanh_alpha_9  -8.60467152213735e-11f\n#define c_tanh_alpha_11 2.00018790482477e-13f\n#define c_tanh_alpha_13 -2.76076847742355e-16f\n// The monomial coefficients of the denominator polynomial (even).\n#define c_tanh_beta_0 4.89352518554385e-3f\n#define c_tanh_beta_2 2.26843463243900e-3f\n#define c_tanh_beta_4 1.18534705686654e-4f\n#define c_tanh_beta_6 1.19825839466702e-6f\n\n/* Single precision hyperbolic tangent computed for 4 simultaneous float */\nstatic inline float32x4_t tanh_ps(float32x4_t x)\n{\n    float32x4_t x2 = vabsq_f32(x);\n\n    uint32x4_t tiny_mask = vcgeq_f32(x2, vdupq_n_f32(c_tanh_tiny));\n\n    // clamp the inputs to the range [-9, 9] since anything outside\n    // this range is -/+1.0f in single-precision.\n    x2 = vreinterpretq_f32_u32(vbslq_u32(vcgeq_f32(vdupq_n_f32(c_tanh_hi), x2), vreinterpretq_u32_f32(x2), vreinterpretq_u32_f32(vdupq_n_f32(c_tanh_hi))));\n\n    // since the polynomials are odd/even, we need x**2.\n    float32x4_t z = vmulq_f32(x2, x2);\n\n    // evaluate the numerator polynomial y.\n    float32x4_t y = vdupq_n_f32(c_tanh_alpha_13);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_11), y, z);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_9), y, z);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_7), y, z);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_5), y, z);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_3), y, z);\n    y = vmlaq_f32(vdupq_n_f32(c_tanh_alpha_1), y, z);\n    y = vmulq_f32(y, x2);\n\n    // evaluate the denominator polynomial w.\n    float32x4_t w = vdupq_n_f32(c_tanh_beta_6);\n    w = vmlaq_f32(vdupq_n_f32(c_tanh_beta_4), w, z);\n    w = vmlaq_f32(vdupq_n_f32(c_tanh_beta_2), w, z);\n    w = vmlaq_f32(vdupq_n_f32(c_tanh_beta_0), w, z);\n\n    // divide the numerator by the denominator.\n#if __aarch64__\n    y = vdivq_f32(y, w);\n#else\n    y = div_ps(y, w);\n#endif\n\n    // reinstate the sign.\n    y = vreinterpretq_f32_u32(vbslq_u32(vdupq_n_u32(1u << 31), vreinterpretq_u32_f32(x), vreinterpretq_u32_f32(y)));\n\n    // when the argument is very small in magnitude it's more accurate to just return it.\n    y = vreinterpretq_f32_u32(vbslq_u32(tiny_mask, vreinterpretq_u32_f32(y), vreinterpretq_u32_f32(x)));\n\n    return y;\n}\n\n#endif // NEON_MATHFUN_TANH_H\n"
  },
  {
    "path": "src/layer/arm/packing_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"packing_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nPacking_arm::Packing_arm()\n{\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n\n    support_bf16_storage = true;\n}\n\nint Packing_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n\n    if (use_padding)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (elembits != 32)\n    {\n        // non-fp32 type\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    if (elempack == out_elempack)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    bool pack1to4 = elempack == 1 && out_elempack == 4;\n    bool pack4to1 = elempack == 4 && out_elempack == 1;\n\n    if (!pack1to4 && !pack4to1)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    if (!use_padding)\n    {\n        // identity if use_padding not allowed\n        if (dims == 1 && w * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if (dims == 2 && h * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if ((dims == 3 || dims == 4) && channels * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n    }\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        top_blob.w = w * elempack / out_elempack;\n        top_blob.cstep = bottom_blob.cstep * elempack / out_elempack;\n        top_blob.elemsize = elemsize / elempack * out_elempack;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        int outh = h * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const float* r0 = bottom_blob.row(i * 4);\n                const float* r1 = bottom_blob.row(i * 4 + 1);\n                const float* r2 = bottom_blob.row(i * 4 + 2);\n                const float* r3 = bottom_blob.row(i * 4 + 3);\n\n                float* outptr = top_blob.row(i);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4x4_t _p;\n                    _p.val[0] = vld1q_f32(r0);\n                    _p.val[1] = vld1q_f32(r1);\n                    _p.val[2] = vld1q_f32(r2);\n                    _p.val[3] = vld1q_f32(r3);\n                    vst4q_f32(outptr, _p);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* r0 = bottom_blob.row(i);\n\n                float* outptr0 = top_blob.row(i * 4);\n                float* outptr1 = top_blob.row(i * 4 + 1);\n                float* outptr2 = top_blob.row(i * 4 + 2);\n                float* outptr3 = top_blob.row(i * 4 + 3);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    float32x4x4_t _p = vld4q_f32(r0);\n                    vst1q_f32(outptr0, _p.val[0]);\n                    vst1q_f32(outptr1, _p.val[1]);\n                    vst1q_f32(outptr2, _p.val[2]);\n                    vst1q_f32(outptr3, _p.val[3]);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int size = w * h * d;\n        int outc = channels * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 3)\n            top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        else // if (dims == 4)\n            top_blob.create(w, h, d, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const float* r0 = bottom_blob.channel(q * 4);\n                const float* r1 = bottom_blob.channel(q * 4 + 1);\n                const float* r2 = bottom_blob.channel(q * 4 + 2);\n                const float* r3 = bottom_blob.channel(q * 4 + 3);\n\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4x4_t _p;\n                    _p.val[0] = vld1q_f32(r0);\n                    _p.val[1] = vld1q_f32(r1);\n                    _p.val[2] = vld1q_f32(r2);\n                    _p.val[3] = vld1q_f32(r3);\n                    vst4q_f32(outptr, _p);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* r0 = bottom_blob.channel(q);\n\n                float* outptr0 = top_blob.channel(q * 4);\n                float* outptr1 = top_blob.channel(q * 4 + 1);\n                float* outptr2 = top_blob.channel(q * 4 + 2);\n                float* outptr3 = top_blob.channel(q * 4 + 3);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4x4_t _p = vld4q_f32(r0);\n                    vst1q_f32(outptr0, _p.val[0]);\n                    vst1q_f32(outptr1, _p.val[1]);\n                    vst1q_f32(outptr2, _p.val[2]);\n                    vst1q_f32(outptr3, _p.val[3]);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    return 0;\n}\n\nint Packing_arm::forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (use_padding)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    if (elempack == out_elempack)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    bool pack1to4 = elempack == 1 && out_elempack == 4;\n    bool pack4to1 = elempack == 4 && out_elempack == 1;\n    bool pack1to8 = elempack == 1 && out_elempack == 8;\n    bool pack8to1 = elempack == 8 && out_elempack == 1;\n    bool pack4to8 = elempack == 4 && out_elempack == 8;\n    bool pack8to4 = elempack == 8 && out_elempack == 4;\n\n    if (!pack1to4 && !pack4to1 && !pack1to8 && !pack8to1 && !pack4to8 && !pack8to4)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    if (!use_padding)\n    {\n        // identity if use_padding not allowed\n        if (dims == 1 && w * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if (dims == 2 && h * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if ((dims == 3 || dims == 4) && channels * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n    }\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        top_blob.w = w * elempack / out_elempack;\n        top_blob.cstep = bottom_blob.cstep * elempack / out_elempack;\n        top_blob.elemsize = elemsize / elempack * out_elempack;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        int outh = h * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i * 4);\n                const unsigned short* r1 = bottom_blob.row<const unsigned short>(i * 4 + 1);\n                const unsigned short* r2 = bottom_blob.row<const unsigned short>(i * 4 + 2);\n                const unsigned short* r3 = bottom_blob.row<const unsigned short>(i * 4 + 3);\n\n                unsigned short* outptr = top_blob.row<unsigned short>(i);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x4x4_t _p;\n                    _p.val[0] = vld1_u16(r0);\n                    _p.val[1] = vld1_u16(r1);\n                    _p.val[2] = vld1_u16(r2);\n                    _p.val[3] = vld1_u16(r3);\n                    vst4_u16(outptr, _p);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                unsigned short* outptr0 = top_blob.row<unsigned short>(i * 4);\n                unsigned short* outptr1 = top_blob.row<unsigned short>(i * 4 + 1);\n                unsigned short* outptr2 = top_blob.row<unsigned short>(i * 4 + 2);\n                unsigned short* outptr3 = top_blob.row<unsigned short>(i * 4 + 3);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x4x4_t _p = vld4_u16(r0);\n                    vst1_u16(outptr0, _p.val[0]);\n                    vst1_u16(outptr1, _p.val[1]);\n                    vst1_u16(outptr2, _p.val[2]);\n                    vst1_u16(outptr3, _p.val[3]);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i * 8);\n                const unsigned short* r1 = bottom_blob.row<const unsigned short>(i * 8 + 1);\n                const unsigned short* r2 = bottom_blob.row<const unsigned short>(i * 8 + 2);\n                const unsigned short* r3 = bottom_blob.row<const unsigned short>(i * 8 + 3);\n                const unsigned short* r4 = bottom_blob.row<const unsigned short>(i * 8 + 4);\n                const unsigned short* r5 = bottom_blob.row<const unsigned short>(i * 8 + 5);\n                const unsigned short* r6 = bottom_blob.row<const unsigned short>(i * 8 + 6);\n                const unsigned short* r7 = bottom_blob.row<const unsigned short>(i * 8 + 7);\n\n                unsigned short* outptr = top_blob.row<unsigned short>(i);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 7 < w; j += 8)\n                {\n                    // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h}, [%0], #16      \\n\"\n                        \"ld1    {v1.8h}, [%1], #16      \\n\"\n                        \"ld1    {v2.8h}, [%2], #16      \\n\"\n                        \"ld1    {v3.8h}, [%3], #16      \\n\"\n                        \"ld1    {v4.8h}, [%4], #16      \\n\"\n                        \"ld1    {v5.8h}, [%5], #16      \\n\"\n                        \"ld1    {v6.8h}, [%6], #16      \\n\"\n                        \"ld1    {v7.8h}, [%7], #16      \\n\"\n\n                        \"zip1   v16.8h, v0.8h, v4.8h    \\n\"\n                        \"zip2   v20.8h, v0.8h, v4.8h    \\n\"\n                        \"zip1   v17.8h, v1.8h, v5.8h    \\n\"\n                        \"zip2   v21.8h, v1.8h, v5.8h    \\n\"\n                        \"zip1   v18.8h, v2.8h, v6.8h    \\n\"\n                        \"zip2   v22.8h, v2.8h, v6.8h    \\n\"\n                        \"zip1   v19.8h, v3.8h, v7.8h    \\n\"\n                        \"zip2   v23.8h, v3.8h, v7.8h    \\n\"\n\n                        \"st4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n                        \"st4    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(r2),    // %2\n                        \"=r\"(r3),    // %3\n                        \"=r\"(r4),    // %4\n                        \"=r\"(r5),    // %5\n                        \"=r\"(r6),    // %6\n                        \"=r\"(r7),    // %7\n                        \"=r\"(outptr) // %8\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"4\"(r4),\n                        \"5\"(r5),\n                        \"6\"(r6),\n                        \"7\"(r7),\n                        \"8\"(outptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d16-d17}, [%0]!    \\n\"\n                        \"vld1.u16   {d18-d19}, [%1]!    \\n\"\n                        \"vld1.u16   {d20-d21}, [%2]!    \\n\"\n                        \"vld1.u16   {d22-d23}, [%3]!    \\n\"\n                        \"vld1.u16   {d24-d25}, [%4]!    \\n\"\n                        \"vld1.u16   {d26-d27}, [%5]!    \\n\"\n                        \"vld1.u16   {d28-d29}, [%6]!    \\n\"\n                        \"vld1.u16   {d30-d31}, [%7]!    \\n\"\n\n                        \"vtrn.u16   q8, q9              \\n\"\n                        \"vtrn.u16   q10, q11            \\n\"\n                        \"vtrn.u16   q12, q13            \\n\"\n                        \"vtrn.u16   q14, q15            \\n\"\n\n                        \"vtrn.u32   q8, q10             \\n\"\n                        \"vtrn.u32   q9, q11             \\n\"\n                        \"vtrn.u32   q12, q14            \\n\"\n                        \"vtrn.u32   q13, q15            \\n\"\n\n                        \"vswp       d17, d24            \\n\"\n                        \"vswp       d19, d26            \\n\"\n                        \"vswp       d21, d28            \\n\"\n                        \"vswp       d23, d30            \\n\"\n\n                        \"vstm       %8!, {d16-d23}      \\n\"\n                        \"vstm       %8!, {d24-d31}      \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(r2),    // %2\n                        \"=r\"(r3),    // %3\n                        \"=r\"(r4),    // %4\n                        \"=r\"(r5),    // %5\n                        \"=r\"(r6),    // %6\n                        \"=r\"(r7),    // %7\n                        \"=r\"(outptr) // %8\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"4\"(r4),\n                        \"5\"(r5),\n                        \"6\"(r6),\n                        \"7\"(r7),\n                        \"8\"(outptr)\n                        : \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n#else  // NCNN_GNU_INLINE_ASM\n                    uint16x8_t _r0 = vld1q_u16(r0);\n                    uint16x8_t _r1 = vld1q_u16(r1);\n                    uint16x8_t _r2 = vld1q_u16(r2);\n                    uint16x8_t _r3 = vld1q_u16(r3);\n                    uint16x8_t _r4 = vld1q_u16(r4);\n                    uint16x8_t _r5 = vld1q_u16(r5);\n                    uint16x8_t _r6 = vld1q_u16(r6);\n                    uint16x8_t _r7 = vld1q_u16(r7);\n                    uint16x8x2_t _r04 = vzipq_u16(_r0, _r4);\n                    uint16x8x2_t _r15 = vzipq_u16(_r1, _r5);\n                    uint16x8x2_t _r26 = vzipq_u16(_r2, _r6);\n                    uint16x8x2_t _r37 = vzipq_u16(_r3, _r7);\n                    uint16x8x4_t _r0123;\n                    _r0123.val[0] = _r04.val[0];\n                    _r0123.val[1] = _r15.val[0];\n                    _r0123.val[2] = _r26.val[0];\n                    _r0123.val[3] = _r37.val[0];\n                    uint16x8x4_t _r4567;\n                    _r4567.val[0] = _r04.val[1];\n                    _r4567.val[1] = _r15.val[1];\n                    _r4567.val[2] = _r26.val[1];\n                    _r4567.val[3] = _r37.val[1];\n                    vst4q_u16(outptr, _r0123);\n                    vst4q_u16(outptr + 32, _r4567);\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                    r3 += 8;\n                    r4 += 8;\n                    r5 += 8;\n                    r6 += 8;\n                    r7 += 8;\n                    outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                unsigned short* outptr0 = top_blob.row<unsigned short>(i * 8);\n                unsigned short* outptr1 = top_blob.row<unsigned short>(i * 8 + 1);\n                unsigned short* outptr2 = top_blob.row<unsigned short>(i * 8 + 2);\n                unsigned short* outptr3 = top_blob.row<unsigned short>(i * 8 + 3);\n                unsigned short* outptr4 = top_blob.row<unsigned short>(i * 8 + 4);\n                unsigned short* outptr5 = top_blob.row<unsigned short>(i * 8 + 5);\n                unsigned short* outptr6 = top_blob.row<unsigned short>(i * 8 + 6);\n                unsigned short* outptr7 = top_blob.row<unsigned short>(i * 8 + 7);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 7 < w; j += 8)\n                {\n                    // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                    asm volatile(\n                        \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                        \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0], #64 \\n\"\n\n                        \"uzp1   v16.8h, v0.8h, v4.8h    \\n\"\n                        \"uzp2   v20.8h, v0.8h, v4.8h    \\n\"\n                        \"uzp1   v17.8h, v1.8h, v5.8h    \\n\"\n                        \"uzp2   v21.8h, v1.8h, v5.8h    \\n\"\n                        \"uzp1   v18.8h, v2.8h, v6.8h    \\n\"\n                        \"uzp2   v22.8h, v2.8h, v6.8h    \\n\"\n                        \"uzp1   v19.8h, v3.8h, v7.8h    \\n\"\n                        \"uzp2   v23.8h, v3.8h, v7.8h    \\n\"\n\n                        \"st1    {v16.8h}, [%1], #16      \\n\"\n                        \"st1    {v17.8h}, [%2], #16      \\n\"\n                        \"st1    {v18.8h}, [%3], #16      \\n\"\n                        \"st1    {v19.8h}, [%4], #16      \\n\"\n                        \"st1    {v20.8h}, [%5], #16      \\n\"\n                        \"st1    {v21.8h}, [%6], #16      \\n\"\n                        \"st1    {v22.8h}, [%7], #16      \\n\"\n                        \"st1    {v23.8h}, [%8], #16      \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7)  // %8\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                    asm volatile(\n                        \"vldm       %0!, {d16-d23}      \\n\"\n                        \"vldm       %0!, {d24-d31}      \\n\"\n\n                        \"vtrn.u16   q8, q9              \\n\"\n                        \"vtrn.u16   q10, q11            \\n\"\n                        \"vtrn.u16   q12, q13            \\n\"\n                        \"vtrn.u16   q14, q15            \\n\"\n\n                        \"vtrn.u32   q8, q10             \\n\"\n                        \"vtrn.u32   q9, q11             \\n\"\n                        \"vtrn.u32   q12, q14            \\n\"\n                        \"vtrn.u32   q13, q15            \\n\"\n\n                        \"vswp       d17, d24            \\n\"\n                        \"vswp       d19, d26            \\n\"\n                        \"vswp       d21, d28            \\n\"\n                        \"vswp       d23, d30            \\n\"\n\n                        \"vst1.u16   {d16-d17}, [%1]!    \\n\"\n                        \"vst1.u16   {d18-d19}, [%2]!    \\n\"\n                        \"vst1.u16   {d20-d21}, [%3]!    \\n\"\n                        \"vst1.u16   {d22-d23}, [%4]!    \\n\"\n                        \"vst1.u16   {d24-d25}, [%5]!    \\n\"\n                        \"vst1.u16   {d26-d27}, [%6]!    \\n\"\n                        \"vst1.u16   {d28-d29}, [%7]!    \\n\"\n                        \"vst1.u16   {d30-d31}, [%8]!    \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7)  // %8\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7)\n                        : \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n#else  // NCNN_GNU_INLINE_ASM\n                    uint16x8x4_t _r0246 = vld4q_u16(r0);\n                    uint16x8x4_t _r1357 = vld4q_u16(r0 + 32);\n                    uint16x8x2_t _r04 = vuzpq_u16(_r0246.val[0], _r1357.val[0]);\n                    uint16x8x2_t _r15 = vuzpq_u16(_r0246.val[1], _r1357.val[1]);\n                    uint16x8x2_t _r26 = vuzpq_u16(_r0246.val[2], _r1357.val[2]);\n                    uint16x8x2_t _r37 = vuzpq_u16(_r0246.val[3], _r1357.val[3]);\n                    vst1q_u16(outptr0, _r04.val[0]);\n                    vst1q_u16(outptr1, _r15.val[0]);\n                    vst1q_u16(outptr2, _r26.val[0]);\n                    vst1q_u16(outptr3, _r37.val[0]);\n                    vst1q_u16(outptr4, _r04.val[1]);\n                    vst1q_u16(outptr5, _r15.val[1]);\n                    vst1q_u16(outptr6, _r26.val[1]);\n                    vst1q_u16(outptr7, _r37.val[1]);\n\n                    r0 += 64;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                    outptr3 += 8;\n                    outptr4 += 8;\n                    outptr5 += 8;\n                    outptr6 += 8;\n                    outptr7 += 8;\n#endif // NCNN_GNU_INLINE_ASM\n                }\n#endif\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n        if (pack4to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i * 2);\n                const unsigned short* r1 = bottom_blob.row<const unsigned short>(i * 2 + 1);\n\n                unsigned short* outptr = top_blob.row<unsigned short>(i);\n\n                int j = 0;\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n                for (; j + 1 < w; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h}, [%0], #16      \\n\"\n                        \"ld1    {v1.8h}, [%1], #16      \\n\"\n\n                        \"zip1   v2.2d, v0.2d, v1.2d     \\n\"\n                        \"zip2   v3.2d, v0.2d, v1.2d     \\n\"\n\n                        \"st1    {v2.8h, v3.8h}, [%2], #32\\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(outptr) // %2\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(outptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d0-d1}, [%0 :64]!  \\n\"\n                        \"vld1.u16   {d2-d3}, [%1 :64]!  \\n\"\n\n                        \"vswp       d1, d2              \\n\"\n\n                        \"vst1.u16   {d0-d3}, [%2 :128]! \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(outptr) // %2\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(outptr)\n                        : \"memory\", \"q0\", \"q1\");\n#endif\n                }\n#endif\n#endif // NCNN_GNU_INLINE_ASM\n                for (; j < w; j++)\n                {\n                    outptr[0] = r0[0];\n                    outptr[1] = r0[1];\n                    outptr[2] = r0[2];\n                    outptr[3] = r0[3];\n                    outptr[4] = r1[0];\n                    outptr[5] = r1[1];\n                    outptr[6] = r1[2];\n                    outptr[7] = r1[3];\n\n                    r0 += 4;\n                    r1 += 4;\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* r0 = bottom_blob.row<const unsigned short>(i);\n\n                unsigned short* outptr0 = top_blob.row<unsigned short>(i * 2);\n                unsigned short* outptr1 = top_blob.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n                for (; j + 1 < w; j += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n\n                        \"uzp1   v2.2d, v0.2d, v1.2d     \\n\"\n                        \"uzp2   v3.2d, v0.2d, v1.2d     \\n\"\n\n                        \"st1    {v2.8h}, [%1], #16      \\n\"\n                        \"st1    {v3.8h}, [%2], #16      \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1)  // %2\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d0-d3}, [%0 :128]! \\n\"\n\n                        \"vswp       d1, d2              \\n\"\n\n                        \"vst1.u16   {d0-d1}, [%1 :64]!  \\n\"\n                        \"vst1.u16   {d2-d3}, [%2 :64]!  \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1)  // %2\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1)\n                        : \"memory\", \"q0\", \"q1\");\n#endif\n                }\n#endif\n#endif // NCNN_GNU_INLINE_ASM\n                for (; j < w; j++)\n                {\n                    outptr0[0] = r0[0];\n                    outptr0[1] = r0[1];\n                    outptr0[2] = r0[2];\n                    outptr0[3] = r0[3];\n                    outptr1[0] = r0[4];\n                    outptr1[1] = r0[5];\n                    outptr1[2] = r0[6];\n                    outptr1[3] = r0[7];\n\n                    r0 += 8;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int size = w * h * d;\n        int outc = channels * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 3)\n            top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        else // if (dims == 4)\n            top_blob.create(w, h, d, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q * 4);\n                const unsigned short* r1 = bottom_blob.channel(q * 4 + 1);\n                const unsigned short* r2 = bottom_blob.channel(q * 4 + 2);\n                const unsigned short* r3 = bottom_blob.channel(q * 4 + 3);\n\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4x4_t _p;\n                    _p.val[0] = vld1_u16(r0);\n                    _p.val[1] = vld1_u16(r1);\n                    _p.val[2] = vld1_u16(r2);\n                    _p.val[3] = vld1_u16(r3);\n                    vst4_u16(outptr, _p);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q);\n\n                unsigned short* outptr0 = top_blob.channel(q * 4);\n                unsigned short* outptr1 = top_blob.channel(q * 4 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 4 + 2);\n                unsigned short* outptr3 = top_blob.channel(q * 4 + 3);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4x4_t _p = vld4_u16(r0);\n                    vst1_u16(outptr0, _p.val[0]);\n                    vst1_u16(outptr1, _p.val[1]);\n                    vst1_u16(outptr2, _p.val[2]);\n                    vst1_u16(outptr3, _p.val[3]);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q * 8);\n                const unsigned short* r1 = bottom_blob.channel(q * 8 + 1);\n                const unsigned short* r2 = bottom_blob.channel(q * 8 + 2);\n                const unsigned short* r3 = bottom_blob.channel(q * 8 + 3);\n                const unsigned short* r4 = bottom_blob.channel(q * 8 + 4);\n                const unsigned short* r5 = bottom_blob.channel(q * 8 + 5);\n                const unsigned short* r6 = bottom_blob.channel(q * 8 + 6);\n                const unsigned short* r7 = bottom_blob.channel(q * 8 + 7);\n\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h}, [%0], #16      \\n\"\n                        \"ld1    {v1.8h}, [%1], #16      \\n\"\n                        \"ld1    {v2.8h}, [%2], #16      \\n\"\n                        \"ld1    {v3.8h}, [%3], #16      \\n\"\n                        \"ld1    {v4.8h}, [%4], #16      \\n\"\n                        \"ld1    {v5.8h}, [%5], #16      \\n\"\n                        \"ld1    {v6.8h}, [%6], #16      \\n\"\n                        \"ld1    {v7.8h}, [%7], #16      \\n\"\n\n                        \"zip1   v16.8h, v0.8h, v4.8h    \\n\"\n                        \"zip2   v20.8h, v0.8h, v4.8h    \\n\"\n                        \"zip1   v17.8h, v1.8h, v5.8h    \\n\"\n                        \"zip2   v21.8h, v1.8h, v5.8h    \\n\"\n                        \"zip1   v18.8h, v2.8h, v6.8h    \\n\"\n                        \"zip2   v22.8h, v2.8h, v6.8h    \\n\"\n                        \"zip1   v19.8h, v3.8h, v7.8h    \\n\"\n                        \"zip2   v23.8h, v3.8h, v7.8h    \\n\"\n\n                        \"st4    {v16.8h, v17.8h, v18.8h, v19.8h}, [%8], #64 \\n\"\n                        \"st4    {v20.8h, v21.8h, v22.8h, v23.8h}, [%8], #64 \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(r2),    // %2\n                        \"=r\"(r3),    // %3\n                        \"=r\"(r4),    // %4\n                        \"=r\"(r5),    // %5\n                        \"=r\"(r6),    // %6\n                        \"=r\"(r7),    // %7\n                        \"=r\"(outptr) // %8\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"4\"(r4),\n                        \"5\"(r5),\n                        \"6\"(r6),\n                        \"7\"(r7),\n                        \"8\"(outptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d16-d17}, [%0 : 128]! \\n\"\n                        \"vld1.u16   {d18-d19}, [%1 : 128]! \\n\"\n                        \"vld1.u16   {d20-d21}, [%2 : 128]! \\n\"\n                        \"vld1.u16   {d22-d23}, [%3 : 128]! \\n\"\n                        \"vld1.u16   {d24-d25}, [%4 : 128]! \\n\"\n                        \"vld1.u16   {d26-d27}, [%5 : 128]! \\n\"\n                        \"vld1.u16   {d28-d29}, [%6 : 128]! \\n\"\n                        \"vld1.u16   {d30-d31}, [%7 : 128]! \\n\"\n\n                        \"vtrn.u16   q8, q9              \\n\"\n                        \"vtrn.u16   q10, q11            \\n\"\n                        \"vtrn.u16   q12, q13            \\n\"\n                        \"vtrn.u16   q14, q15            \\n\"\n\n                        \"vtrn.u32   q8, q10             \\n\"\n                        \"vtrn.u32   q9, q11             \\n\"\n                        \"vtrn.u32   q12, q14            \\n\"\n                        \"vtrn.u32   q13, q15            \\n\"\n\n                        \"vswp       d17, d24            \\n\"\n                        \"vswp       d19, d26            \\n\"\n                        \"vswp       d21, d28            \\n\"\n                        \"vswp       d23, d30            \\n\"\n\n                        \"vstm       %8!, {d16-d23}      \\n\"\n                        \"vstm       %8!, {d24-d31}      \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(r2),    // %2\n                        \"=r\"(r3),    // %3\n                        \"=r\"(r4),    // %4\n                        \"=r\"(r5),    // %5\n                        \"=r\"(r6),    // %6\n                        \"=r\"(r7),    // %7\n                        \"=r\"(outptr) // %8\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(r2),\n                        \"3\"(r3),\n                        \"4\"(r4),\n                        \"5\"(r5),\n                        \"6\"(r6),\n                        \"7\"(r7),\n                        \"8\"(outptr)\n                        : \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n#else  // NCNN_GNU_INLINE_ASM\n                    uint16x8_t _r0 = vld1q_u16(r0);\n                    uint16x8_t _r1 = vld1q_u16(r1);\n                    uint16x8_t _r2 = vld1q_u16(r2);\n                    uint16x8_t _r3 = vld1q_u16(r3);\n                    uint16x8_t _r4 = vld1q_u16(r4);\n                    uint16x8_t _r5 = vld1q_u16(r5);\n                    uint16x8_t _r6 = vld1q_u16(r6);\n                    uint16x8_t _r7 = vld1q_u16(r7);\n                    uint16x8x2_t _r04 = vzipq_u16(_r0, _r4);\n                    uint16x8x2_t _r15 = vzipq_u16(_r1, _r5);\n                    uint16x8x2_t _r26 = vzipq_u16(_r2, _r6);\n                    uint16x8x2_t _r37 = vzipq_u16(_r3, _r7);\n                    uint16x8x4_t _r0123;\n                    _r0123.val[0] = _r04.val[0];\n                    _r0123.val[1] = _r15.val[0];\n                    _r0123.val[2] = _r26.val[0];\n                    _r0123.val[3] = _r37.val[0];\n                    uint16x8x4_t _r4567;\n                    _r4567.val[0] = _r04.val[1];\n                    _r4567.val[1] = _r15.val[1];\n                    _r4567.val[2] = _r26.val[1];\n                    _r4567.val[3] = _r37.val[1];\n                    vst4q_u16(outptr, _r0123);\n                    vst4q_u16(outptr + 32, _r4567);\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                    r3 += 8;\n                    r4 += 8;\n                    r5 += 8;\n                    r6 += 8;\n                    r7 += 8;\n                    outptr += 64;\n#endif // NCNN_GNU_INLINE_ASM\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q);\n\n                unsigned short* outptr0 = top_blob.channel(q * 8);\n                unsigned short* outptr1 = top_blob.channel(q * 8 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 8 + 2);\n                unsigned short* outptr3 = top_blob.channel(q * 8 + 3);\n                unsigned short* outptr4 = top_blob.channel(q * 8 + 4);\n                unsigned short* outptr5 = top_blob.channel(q * 8 + 5);\n                unsigned short* outptr6 = top_blob.channel(q * 8 + 6);\n                unsigned short* outptr7 = top_blob.channel(q * 8 + 7);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 7 < size; i += 8)\n                {\n                    // transpose 8x8\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                    asm volatile(\n                        \"ld4    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                        \"ld4    {v4.8h, v5.8h, v6.8h, v7.8h}, [%0], #64 \\n\"\n\n                        \"uzp1   v16.8h, v0.8h, v4.8h    \\n\"\n                        \"uzp2   v20.8h, v0.8h, v4.8h    \\n\"\n                        \"uzp1   v17.8h, v1.8h, v5.8h    \\n\"\n                        \"uzp2   v21.8h, v1.8h, v5.8h    \\n\"\n                        \"uzp1   v18.8h, v2.8h, v6.8h    \\n\"\n                        \"uzp2   v22.8h, v2.8h, v6.8h    \\n\"\n                        \"uzp1   v19.8h, v3.8h, v7.8h    \\n\"\n                        \"uzp2   v23.8h, v3.8h, v7.8h    \\n\"\n\n                        \"st1    {v16.8h}, [%1], #16      \\n\"\n                        \"st1    {v17.8h}, [%2], #16      \\n\"\n                        \"st1    {v18.8h}, [%3], #16      \\n\"\n                        \"st1    {v19.8h}, [%4], #16      \\n\"\n                        \"st1    {v20.8h}, [%5], #16      \\n\"\n                        \"st1    {v21.8h}, [%6], #16      \\n\"\n                        \"st1    {v22.8h}, [%7], #16      \\n\"\n                        \"st1    {v23.8h}, [%8], #16      \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7)  // %8\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else\n                    asm volatile(\n                        \"vldm       %0!, {d16-d23}      \\n\"\n                        \"vldm       %0!, {d24-d31}      \\n\"\n\n                        \"vtrn.u16   q8, q9              \\n\"\n                        \"vtrn.u16   q10, q11            \\n\"\n                        \"vtrn.u16   q12, q13            \\n\"\n                        \"vtrn.u16   q14, q15            \\n\"\n\n                        \"vtrn.u32   q8, q10             \\n\"\n                        \"vtrn.u32   q9, q11             \\n\"\n                        \"vtrn.u32   q12, q14            \\n\"\n                        \"vtrn.u32   q13, q15            \\n\"\n\n                        \"vswp       d17, d24            \\n\"\n                        \"vswp       d19, d26            \\n\"\n                        \"vswp       d21, d28            \\n\"\n                        \"vswp       d23, d30            \\n\"\n\n                        \"vst1.u16   {d16-d17}, [%1 : 128]! \\n\"\n                        \"vst1.u16   {d18-d19}, [%2 : 128]! \\n\"\n                        \"vst1.u16   {d20-d21}, [%3 : 128]! \\n\"\n                        \"vst1.u16   {d22-d23}, [%4 : 128]! \\n\"\n                        \"vst1.u16   {d24-d25}, [%5 : 128]! \\n\"\n                        \"vst1.u16   {d26-d27}, [%6 : 128]! \\n\"\n                        \"vst1.u16   {d28-d29}, [%7 : 128]! \\n\"\n                        \"vst1.u16   {d30-d31}, [%8 : 128]! \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1), // %2\n                        \"=r\"(outptr2), // %3\n                        \"=r\"(outptr3), // %4\n                        \"=r\"(outptr4), // %5\n                        \"=r\"(outptr5), // %6\n                        \"=r\"(outptr6), // %7\n                        \"=r\"(outptr7)  // %8\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1),\n                        \"3\"(outptr2),\n                        \"4\"(outptr3),\n                        \"5\"(outptr4),\n                        \"6\"(outptr5),\n                        \"7\"(outptr6),\n                        \"8\"(outptr7)\n                        : \"memory\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif\n#else  // NCNN_GNU_INLINE_ASM\n                    uint16x8x4_t _r0246 = vld4q_u16(r0);\n                    uint16x8x4_t _r1357 = vld4q_u16(r0 + 32);\n                    uint16x8x2_t _r04 = vuzpq_u16(_r0246.val[0], _r1357.val[0]);\n                    uint16x8x2_t _r15 = vuzpq_u16(_r0246.val[1], _r1357.val[1]);\n                    uint16x8x2_t _r26 = vuzpq_u16(_r0246.val[2], _r1357.val[2]);\n                    uint16x8x2_t _r37 = vuzpq_u16(_r0246.val[3], _r1357.val[3]);\n                    vst1q_u16(outptr0, _r04.val[0]);\n                    vst1q_u16(outptr1, _r15.val[0]);\n                    vst1q_u16(outptr2, _r26.val[0]);\n                    vst1q_u16(outptr3, _r37.val[0]);\n                    vst1q_u16(outptr4, _r04.val[1]);\n                    vst1q_u16(outptr5, _r15.val[1]);\n                    vst1q_u16(outptr6, _r26.val[1]);\n                    vst1q_u16(outptr7, _r37.val[1]);\n\n                    r0 += 64;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                    outptr3 += 8;\n                    outptr4 += 8;\n                    outptr5 += 8;\n                    outptr6 += 8;\n                    outptr7 += 8;\n#endif // NCNN_GNU_INLINE_ASM\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n        if (pack4to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q * 2);\n                const unsigned short* r1 = bottom_blob.channel(q * 2 + 1);\n\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n                for (; i + 1 < size; i += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h}, [%0], #16      \\n\"\n                        \"ld1    {v1.8h}, [%1], #16      \\n\"\n\n                        \"zip1   v2.2d, v0.2d, v1.2d     \\n\"\n                        \"zip2   v3.2d, v0.2d, v1.2d     \\n\"\n\n                        \"st1    {v2.8h, v3.8h}, [%2], #32\\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(outptr) // %2\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(outptr)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d0-d1}, [%0 :128]! \\n\"\n                        \"vld1.u16   {d2-d3}, [%1 :128]! \\n\"\n\n                        \"vswp       d1, d2              \\n\"\n\n                        \"vst1.u16   {d0-d3}, [%2 :128]! \\n\"\n                        : \"=r\"(r0),    // %0\n                        \"=r\"(r1),    // %1\n                        \"=r\"(outptr) // %2\n                        : \"0\"(r0),\n                        \"1\"(r1),\n                        \"2\"(outptr)\n                        : \"memory\", \"q0\", \"q1\");\n#endif\n                }\n#endif\n#endif // NCNN_GNU_INLINE_ASM\n                for (; i < size; i++)\n                {\n                    outptr[0] = r0[0];\n                    outptr[1] = r0[1];\n                    outptr[2] = r0[2];\n                    outptr[3] = r0[3];\n                    outptr[4] = r1[0];\n                    outptr[5] = r1[1];\n                    outptr[6] = r1[2];\n                    outptr[7] = r1[3];\n\n                    r0 += 4;\n                    r1 += 4;\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* r0 = bottom_blob.channel(q);\n\n                unsigned short* outptr0 = top_blob.channel(q * 2);\n                unsigned short* outptr1 = top_blob.channel(q * 2 + 1);\n\n                int i = 0;\n#if NCNN_GNU_INLINE_ASM\n#if __ARM_NEON\n                for (; i + 1 < size; i += 2)\n                {\n#if __aarch64__\n                    asm volatile(\n                        \"ld1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n\n                        \"uzp1   v2.2d, v0.2d, v1.2d     \\n\"\n                        \"uzp2   v3.2d, v0.2d, v1.2d     \\n\"\n\n                        \"st1    {v2.8h}, [%1], #16      \\n\"\n                        \"st1    {v3.8h}, [%2], #16      \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1)  // %2\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1)\n                        : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else\n                    asm volatile(\n                        \"vld1.u16   {d0-d3}, [%0 :128]! \\n\"\n\n                        \"vswp       d1, d2              \\n\"\n\n                        \"vst1.u16   {d0-d1}, [%1 :128]! \\n\"\n                        \"vst1.u16   {d2-d3}, [%2 :128]! \\n\"\n                        : \"=r\"(r0),      // %0\n                        \"=r\"(outptr0), // %1\n                        \"=r\"(outptr1)  // %2\n                        : \"0\"(r0),\n                        \"1\"(outptr0),\n                        \"2\"(outptr1)\n                        : \"memory\", \"q0\", \"q1\");\n#endif\n                }\n#endif\n#endif // NCNN_GNU_INLINE_ASM\n                for (; i < size; i++)\n                {\n                    outptr0[0] = r0[0];\n                    outptr0[1] = r0[1];\n                    outptr0[2] = r0[2];\n                    outptr0[3] = r0[3];\n                    outptr1[0] = r0[4];\n                    outptr1[1] = r0[5];\n                    outptr1[2] = r0[6];\n                    outptr1[3] = r0[7];\n\n                    r0 += 8;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    return 0;\n}\n\nint Packing_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (use_padding)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    if (elempack == out_elempack)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    bool pack1to8 = elempack == 1 && out_elempack == 8;\n    bool pack8to1 = elempack == 8 && out_elempack == 1;\n\n    if (!pack1to8 && !pack8to1)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    if (!use_padding)\n    {\n        // identity if use_padding not allowed\n        if (dims == 1 && w * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if (dims == 2 && h * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if ((dims == 3 || dims == 4) && channels * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n    }\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        top_blob.w = w * elempack / out_elempack;\n        top_blob.cstep = bottom_blob.cstep * elempack / out_elempack;\n        top_blob.elemsize = elemsize / elempack * out_elempack;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        int outh = h * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const signed char* r0 = bottom_blob.row<const signed char>(i * 8);\n                const signed char* r1 = bottom_blob.row<const signed char>(i * 8 + 1);\n                const signed char* r2 = bottom_blob.row<const signed char>(i * 8 + 2);\n                const signed char* r3 = bottom_blob.row<const signed char>(i * 8 + 3);\n                const signed char* r4 = bottom_blob.row<const signed char>(i * 8 + 4);\n                const signed char* r5 = bottom_blob.row<const signed char>(i * 8 + 5);\n                const signed char* r6 = bottom_blob.row<const signed char>(i * 8 + 6);\n                const signed char* r7 = bottom_blob.row<const signed char>(i * 8 + 7);\n\n                signed char* outptr = top_blob.row<signed char>(i);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const signed char* r0 = bottom_blob.row<const signed char>(i);\n\n                signed char* outptr0 = top_blob.row<signed char>(i * 8);\n                signed char* outptr1 = top_blob.row<signed char>(i * 8 + 1);\n                signed char* outptr2 = top_blob.row<signed char>(i * 8 + 2);\n                signed char* outptr3 = top_blob.row<signed char>(i * 8 + 3);\n                signed char* outptr4 = top_blob.row<signed char>(i * 8 + 4);\n                signed char* outptr5 = top_blob.row<signed char>(i * 8 + 5);\n                signed char* outptr6 = top_blob.row<signed char>(i * 8 + 6);\n                signed char* outptr7 = top_blob.row<signed char>(i * 8 + 7);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int size = w * h * d;\n        int outc = channels * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 3)\n            top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        else // if (dims == 4)\n            top_blob.create(w, h, d, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q * 8);\n                const signed char* r1 = bottom_blob.channel(q * 8 + 1);\n                const signed char* r2 = bottom_blob.channel(q * 8 + 2);\n                const signed char* r3 = bottom_blob.channel(q * 8 + 3);\n                const signed char* r4 = bottom_blob.channel(q * 8 + 4);\n                const signed char* r5 = bottom_blob.channel(q * 8 + 5);\n                const signed char* r6 = bottom_blob.channel(q * 8 + 6);\n                const signed char* r7 = bottom_blob.channel(q * 8 + 7);\n\n                signed char* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q);\n\n                signed char* outptr0 = top_blob.channel(q * 8);\n                signed char* outptr1 = top_blob.channel(q * 8 + 1);\n                signed char* outptr2 = top_blob.channel(q * 8 + 2);\n                signed char* outptr3 = top_blob.channel(q * 8 + 3);\n                signed char* outptr4 = top_blob.channel(q * 8 + 4);\n                signed char* outptr5 = top_blob.channel(q * 8 + 5);\n                signed char* outptr6 = top_blob.channel(q * 8 + 6);\n                signed char* outptr7 = top_blob.channel(q * 8 + 7);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/packing_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PACKING_ARM_H\n#define LAYER_PACKING_ARM_H\n\n#include \"packing.h\"\n\nnamespace ncnn {\n\nclass Packing_arm : public Packing\n{\npublic:\n    Packing_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PACKING_ARM_H\n"
  },
  {
    "path": "src/layer/arm/padding_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"padding_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if __ARM_NEON\n#include \"padding_pack4.h\"\n#include \"padding_pack4_bf16s_fp16s.h\"\n#include \"padding_pack8_int8.h\"\n#if NCNN_ARM82\n#include \"padding_pack8_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nPadding_arm::Padding_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Padding_arm::create_pipeline(const Option& opt)\n{\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        value_fp16 = float32_to_float16(value);\n\n        ncnn::cast_float32_to_float16(per_channel_pad_data, per_channel_pad_data_fp16, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        value_bf16 = float32_to_bfloat16(value);\n\n        ncnn::cast_float32_to_bfloat16(per_channel_pad_data, per_channel_pad_data_bf16, opt);\n    }\n#endif\n\n    return 0;\n}\n\nint Padding_arm::destroy_pipeline(const Option& /*opt*/)\n{\n    return 0;\n}\n\nint Padding_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (top == 0 && bottom == 0 && left == 0 && right == 0 && front == 0 && behind == 0)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n            int out_elempack = outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                float32x4_t pad_value = vdupq_n_f32(value);\n                padding_constant_pack4_neon(bottom_blob, top_blob, 0, 0, left / 4, right / 4, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n            int out_elempack = outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                float32x4_t pad_value = vdupq_n_f32(value);\n                padding_constant_pack4_neon(bottom_blob, top_blob, top / 4, bottom / 4, left, right, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n            int out_elempack = outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (front % 4 == 0 && out_elempack == 4 && !(outc != channels * elempack && type != 0))\n            {\n                top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    float32x4_t pad_value = per_channel_pad_data_size ? vld1q_f32((const float*)per_channel_pad_data + q * 4) : vdupq_n_f32(value);\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack4_neon(m, borderm, top, bottom, left, right, pad_value);\n                        if (type == 1)\n                            padding_replicate_pack4_neon(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack4_neon(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            if (type == 0)\n            {\n                top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    float32x4_t pad_value = per_channel_pad_data_size ? vld1q_f32((const float*)per_channel_pad_data + q * 4) : vdupq_n_f32(value);\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack4_neon(m, borderm, top, bottom, left, right, pad_value);\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n        if (bottom_blob_unpacked.empty())\n            return -100;\n    }\n\n    return Padding::forward(bottom_blob_unpacked, top_blob, opt);\n}\n\nint Padding_arm::forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __ARM_NEON\n#if NCNN_ARM82\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n            int out_elempack = outw % 8 == 0 ? 8 : outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                uint16x8_t pad_value = vdupq_n_u16(value_fp16);\n                padding_constant_pack8_fp16s_neon(bottom_blob, top_blob, 0, 0, left / 8, right / 8, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n            int out_elempack = outh % 8 == 0 ? 8 : outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                uint16x8_t pad_value = vdupq_n_u16(value_fp16);\n                padding_constant_pack8_fp16s_neon(bottom_blob, top_blob, top / 8, bottom / 8, left, right, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n            int out_elempack = outc % 8 == 0 ? 8 : outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (front % 8 == 0 && out_elempack == 8 && !(outc != channels * elempack && type != 0))\n            {\n                top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    uint16x8_t pad_value = per_channel_pad_data_size ? vld1q_u16((const unsigned short*)per_channel_pad_data_fp16 + q * 8) : vdupq_n_u16(value_fp16);\n\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack8_fp16s_neon(m, borderm, top, bottom, left, right, pad_value);\n                        if (type == 1)\n                            padding_replicate_pack8_fp16s_neon(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack8_fp16s_neon(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            if (type == 0)\n            {\n                top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    uint16x8_t pad_value = per_channel_pad_data_size ? vld1q_u16((const unsigned short*)per_channel_pad_data_fp16 + q * 8) : vdupq_n_u16(value_fp16);\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack8_fp16s_neon(m, borderm, top, bottom, left, right, pad_value);\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n#if NCNN_ARM82\n            int out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outw % 8 == 0 ? 8 : outw % 4 == 0 ? 4 : 1;\n#else\n            int out_elempack = outw % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                // clang-format off\n                // *INDENT-OFF*\n                uint16x4_t pad_value;\n#if NCNN_ARM82\n                if (support_fp16_storage && opt.use_fp16_storage)\n                {\n                    pad_value = vdup_n_u16(value_fp16);\n                }\n                else\n#endif\n#if NCNN_BF16\n                if (opt.use_bf16_storage)\n                {\n                    pad_value = vdup_n_u16(value_bf16);\n                }\n                else\n#endif\n                {\n                    // shall never reach here\n                    pad_value = vdup_n_u16(0);\n                }\n                // *INDENT-ON*\n                // clang-format on\n                padding_constant_pack4_bf16_fp16s_neon(bottom_blob, top_blob, 0, 0, left / 4, right / 4, vcombine_u16(pad_value, pad_value));\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n#if NCNN_ARM82\n            int out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outh % 8 == 0 ? 8 : outh % 4 == 0 ? 4 : 1;\n#else\n            int out_elempack = outh % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                // clang-format off\n                // *INDENT-OFF*\n                uint16x4_t pad_value;\n#if NCNN_ARM82\n                if (support_fp16_storage && opt.use_fp16_storage)\n                {\n                    pad_value = vdup_n_u16(value_fp16);\n                }\n                else\n#endif\n#if NCNN_BF16\n                if (opt.use_bf16_storage)\n                {\n                    pad_value = vdup_n_u16(value_bf16);\n                }\n                else\n#endif\n                {\n                    // shall never reach here\n                    pad_value = vdup_n_u16(0);\n                }\n                // *INDENT-ON*\n                // clang-format on\n                padding_constant_pack4_bf16_fp16s_neon(bottom_blob, top_blob, top / 4, bottom / 4, left, right, vcombine_u16(pad_value, pad_value));\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n#if NCNN_ARM82\n            int out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outc % 8 == 0 ? 8 : outc % 4 == 0 ? 4 : 1;\n#else\n            int out_elempack = outc % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (front % 4 == 0 && out_elempack == 4 && !(outc != channels * elempack && type != 0))\n            {\n                top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    // clang-format off\n                    // *INDENT-OFF*\n                    uint16x4_t pad_value;\n#if NCNN_ARM82\n                    if (support_fp16_storage && opt.use_fp16_storage)\n                    {\n                        pad_value = per_channel_pad_data_size ? vld1_u16((const unsigned short*)per_channel_pad_data_fp16 + q * 4) : vdup_n_u16(value_fp16);\n                    }\n                    else\n#endif\n#if NCNN_BF16\n                    if (opt.use_bf16_storage)\n                    {\n                        pad_value = per_channel_pad_data_size ? vld1_u16((const unsigned short*)per_channel_pad_data_bf16 + q * 4) : vdup_n_u16(value_bf16);\n                    }\n                    else\n#endif\n                    {\n                        // shall never reach here\n                        pad_value = vdup_n_u16(0);\n                    }\n                    // *INDENT-ON*\n                    // clang-format on\n\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack4_bf16_fp16s_neon(m, borderm, top, bottom, left, right, vcombine_u16(pad_value, pad_value));\n                        if (type == 1)\n                            padding_replicate_pack4_bf16_fp16s_neon(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack4_bf16_fp16s_neon(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            if (type == 0)\n            {\n                top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    // clang-format off\n                    // *INDENT-OFF*\n                    uint16x4_t pad_value;\n#if NCNN_ARM82\n                    if (support_fp16_storage && opt.use_fp16_storage)\n                    {\n                        pad_value = per_channel_pad_data_size ? vld1_u16((const unsigned short*)per_channel_pad_data_fp16 + q * 4) : vdup_n_u16(value_fp16);\n                    }\n                    else\n#endif\n#if NCNN_BF16\n                    if (opt.use_bf16_storage)\n                    {\n                        pad_value = per_channel_pad_data_size ? vld1_u16((const unsigned short*)per_channel_pad_data_bf16 + q * 4) : vdup_n_u16(value_bf16);\n                    }\n                    else\n#endif\n                    {\n                        // shall never reach here\n                        pad_value = vdup_n_u16(0);\n                    }\n                    // *INDENT-ON*\n                    // clang-format on\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack4_bf16_fp16s_neon(m, borderm, top, bottom, left, right, vcombine_u16(pad_value, pad_value));\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n        if (bottom_blob_unpacked.empty())\n            return -100;\n    }\n\n    return Padding::forward(bottom_blob_unpacked, top_blob, opt);\n}\n\nint Padding_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n            int out_elempack = outw % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int8x8_t pad_value = vdup_n_s8((signed char)value);\n                padding_constant_pack8_int8_neon(bottom_blob, top_blob, 0, 0, left / 8, right / 8, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n            int out_elempack = outh % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int8x8_t pad_value = vdup_n_s8((signed char)value);\n                padding_constant_pack8_int8_neon(bottom_blob, top_blob, top / 8, bottom / 8, left, right, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n            int out_elempack = outc % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            if (front % 8 == 0 && out_elempack == 8 && !(outc != channels * elempack && type != 0))\n            {\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    // TODO perchannel\n                    //                     int8x8_t pad_value = per_channel_pad_data_size ? vld1_s8(per_channel_pad_data + q * 8) : vdup_n_s8((signed char)value);\n                    int8x8_t pad_value = vdup_n_s8((signed char)value);\n\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill<int8x8_t>(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack8_int8_neon(m, borderm, top, bottom, left, right, pad_value);\n                        if (type == 1)\n                            padding_replicate_pack8_int8_neon(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack8_int8_neon(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            if (type == 0)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    // TODO perchannel\n                    //                     int8x8_t pad_value = per_channel_pad_data_size ? vld1_s8(per_channel_pad_data + q * 8) : vdup_n_s8((signed char)value);\n                    int8x8_t pad_value = vdup_n_s8((signed char)value);\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill<int8x8_t>(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack8_int8_neon(m, borderm, top, bottom, left, right, pad_value);\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __ARM_NEON\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n        if (bottom_blob_unpacked.empty())\n            return -100;\n    }\n\n    return Padding::forward(bottom_blob_unpacked, top_blob, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/padding_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PADDING_ARM_H\n#define LAYER_PADDING_ARM_H\n\n#include \"padding.h\"\n\nnamespace ncnn {\n\nclass Padding_arm : public Padding\n{\npublic:\n    Padding_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n#if NCNN_BF16\n    // bf16\n    unsigned short value_bf16;\n    Mat per_channel_pad_data_bf16;\n#endif\n\n    // fp16\n    unsigned short value_fp16;\n    Mat per_channel_pad_data_fp16;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PADDING_ARM_H\n"
  },
  {
    "path": "src/layer/arm/padding_pack4.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack4_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right, float32x4_t v)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n\n    int w = src.w;\n    int h = src.h;\n\n    int top_size = top * dst.w;\n    int bottom_size = bottom * dst.w;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n    asm volatile(\n        \"mov    v0.16b, %10.16b         \\n\"\n        \"mov    v1.16b, %10.16b         \\n\"\n        \"mov    v2.16b, %10.16b         \\n\"\n        \"mov    v3.16b, %10.16b         \\n\"\n\n        // fill top\n        \"lsr    w4, %w8, #3             \\n\" // w4 = nn = top_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    1f                      \\n\"\n\n        \"0:                             \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"bne    0b                      \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and    w4, %w8, #7             \\n\" // w4 = remain = top_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    2f                      \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"2:                             \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    3f                      \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.4s, v1.4s}, [%0], #32 \\n\"\n        \"3:                             \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    4f                      \\n\"\n        \"st1    {v0.4s}, [%0], #16      \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp    %w5, #0                 \\n\"\n        \"beq    15f                     \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov    w4, %w6                 \\n\" // w4 = left\n        \"cmp    w4, #0                  \\n\"\n        \"beq    7f                      \\n\"\n\n        \"6:                             \\n\"\n        \"st1    {v0.4s}, [%0], #16      \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    6b                      \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr    w4, %w4, #3             \\n\" // w4 = nn = w >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    9f                      \\n\"\n\n        \"8:                             \\n\"\n        \"prfm   pldl1keep, [%1, #512]   \\n\"\n        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n        \"prfm   pldl1keep, [%1, #512]   \\n\"\n        \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%1], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n        \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n        \"bne    8b                      \\n\"\n\n        \"9:                             \\n\"\n\n        \"and    w4, %w4, #7             \\n\" // w4 = remain = w & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    10f                     \\n\"\n        \"prfm   pldl1keep, [%1, #512]   \\n\"\n        \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%1], #64 \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%0], #64 \\n\"\n        \"10:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    11f                     \\n\"\n        \"prfm   pldl1keep, [%1, #256]   \\n\"\n        \"ld1    {v16.4s, v17.4s}, [%1], #32 \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v16.4s, v17.4s}, [%0], #32 \\n\"\n        \"11:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    12f                     \\n\"\n        \"prfm   pldl1keep, [%1, #128]   \\n\"\n        \"ld1    {v16.4s}, [%1], #16     \\n\"\n        \"st1    {v16.4s}, [%0], #16     \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov    w4, %w7                 \\n\" // w4 = right\n        \"cmp    w4, #0                  \\n\"\n        \"beq    14f                     \\n\"\n\n        \"13:                            \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.4s}, [%0], #16      \\n\"\n        \"bne    13b                     \\n\"\n        \"14:                            \\n\"\n\n        \"subs   %w5, %w5, #1            \\n\"\n        \"bne    5b                      \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr    w4, %w9, #3             \\n\" // w4 = nn = bottom_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    17f                     \\n\"\n\n        \"16:                            \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"bne    16b                     \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and    w4, %w9, #7             \\n\" // w4 = remain = bottom_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    18f                     \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n        \"18:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    19f                     \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.4s, v1.4s}, [%0], #32 \\n\"\n        \"19:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    20f                     \\n\"\n        \"st1    {v0.4s}, [%0], #16      \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // __aarch64__\n    asm volatile(\n        \"vmov       q0, %q10            \\n\"\n        \"vmov       q1, %q10            \\n\"\n        \"vmov       q2, %q10            \\n\"\n        \"vmov       q3, %q10            \\n\"\n\n        // fill top\n        \"lsr        r4, %8, #3          \\n\" // r4 = nn = top_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        1f                  \\n\"\n\n        \"0:                             \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"bne        0b                  \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and        r4, %8, #7          \\n\" // r4 = remain = top_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        2f                  \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"2:                             \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        3f                  \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.f32   {d0-d3}, [%0 :128]! \\n\"\n        \"3:                             \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        4f                  \\n\"\n        \"vst1.f32   {d0-d1}, [%0 :128]! \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp        %5, #0              \\n\"\n        \"beq        15f                 \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov        r4, %6              \\n\" // r4 = left\n        \"cmp        r4, #0              \\n\"\n        \"beq        7f                  \\n\"\n\n        \"6:                             \\n\"\n        \"vst1.f32   {d0-d1}, [%0 :128]! \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        6b                  \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr        r4, %4, #3          \\n\" // r4 = nn = w >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        9f                  \\n\"\n\n        \"8:                             \\n\"\n        \"pld        [%1, #512]          \\n\"\n        \"vldm       %1!, {d16-d23}      \\n\"\n        \"pld        [%1, #512]          \\n\"\n        \"vldm       %1!, {d24-d31}      \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vstm       %0!, {d16-d23}      \\n\"\n        \"vstm       %0!, {d24-d31}      \\n\"\n        \"bne        8b                  \\n\"\n\n        \"9:                             \\n\"\n\n        \"and        r4, %4, #7          \\n\" // r4 = remain = w & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        10f                 \\n\"\n        \"pld        [%1, #512]          \\n\"\n        \"vldm       %1!, {d16-d23}      \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vstm       %0!, {d16-d23}      \\n\"\n        \"10:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        11f                 \\n\"\n        \"pld        [%1, #256]          \\n\"\n        \"vld1.f32   {d16-d19}, [%1 :128]! \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.f32   {d16-d19}, [%0 :128]! \\n\"\n        \"11:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        12f                 \\n\"\n        \"pld        [%1, #128]          \\n\"\n        \"vld1.f32   {d16-d17}, [%1 :128]! \\n\"\n        \"vst1.f32   {d16-d17}, [%0 :128]! \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov        r4, %7              \\n\" // r4 = right\n        \"cmp        r4, #0              \\n\"\n        \"beq        14f                 \\n\"\n\n        \"13:                            \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vst1.f32   {d0-d1}, [%0 :128]! \\n\"\n        \"bne        13b                 \\n\"\n        \"14:                            \\n\"\n\n        \"subs       %5, %5, #1          \\n\"\n        \"bne        5b                  \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr        r4, %9, #3          \\n\" // r4 = nn = bottom_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        17f                 \\n\"\n\n        \"16:                            \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"bne        16b                 \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and        r4, %9, #7          \\n\" // r4 = remain = bottom_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        18f                 \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"18:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        19f                 \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.f32   {d0-d3}, [%0 :128]! \\n\"\n        \"19:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        20f                 \\n\"\n        \"vst1.f32   {d0-d1}, [%0 :128]! \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n\n    // fill top\n    {\n        int x = 0;\n        for (; x + 3 < top_size; x += 4)\n        {\n            vst1q_f32(outptr, v);\n            vst1q_f32(outptr + 4, v);\n            vst1q_f32(outptr + 8, v);\n            vst1q_f32(outptr + 12, v);\n            outptr += 16;\n        }\n        for (; x < top_size; x++)\n        {\n            vst1q_f32(outptr, v);\n            outptr += 4;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_f32(outptr, v);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            vst1q_f32(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_f32(outptr, v);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    {\n        int x = 0;\n        for (; x + 3 < bottom_size; x += 4)\n        {\n            vst1q_f32(outptr, v);\n            vst1q_f32(outptr + 4, v);\n            vst1q_f32(outptr + 8, v);\n            vst1q_f32(outptr + 12, v);\n            outptr += 16;\n        }\n        for (; x < bottom_size; x++)\n        {\n            vst1q_f32(outptr, v);\n            outptr += 4;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n}\n\nstatic void padding_replicate_pack4_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const float* ptr0 = ptr;\n        float32x4_t _p = vld1q_f32(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_f32(ptr0);\n            vst1q_f32(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        float32x4_t _p = vld1q_f32(ptr);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_f32(ptr);\n            vst1q_f32(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const float* ptr0 = ptr;\n        float32x4_t _p = vld1q_f32(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_f32(ptr0);\n            vst1q_f32(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n    }\n}\n\nstatic void padding_reflect_pack4_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n\n    // fill top\n    ptr += top * src.w * 4;\n    for (int y = 0; y < top; y++)\n    {\n        const float* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0 + (left - x) * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            vst1q_f32(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0 - 8 - x * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr + (left - x) * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            vst1q_f32(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr - 8 - x * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const float* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0 + (left - x) * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            vst1q_f32(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            float32x4_t _p = vld1q_f32(ptr0 - 8 - x * 4);\n            vst1q_f32(outptr, _p);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/padding_pack4_bf16s_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack4_bf16_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right, uint16x8_t v)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    int w = src.w;\n    int h = src.h;\n\n    int top_size = top * dst.w;\n    int bottom_size = bottom * dst.w;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n    asm volatile(\n        \"mov    v0.16b, %10.16b         \\n\"\n        \"mov    v1.16b, %10.16b         \\n\"\n        \"mov    v2.16b, %10.16b         \\n\"\n        \"mov    v3.16b, %10.16b         \\n\"\n\n        // fill top\n        \"lsr    w4, %w8, #3             \\n\" // w4 = nn = top_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    1f                      \\n\"\n\n        \"0:                             \\n\"\n        \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    0b                      \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and    w4, %w8, #7             \\n\" // w4 = remain = top_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    2f                      \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n        \"2:                             \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    3f                      \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"3:                             \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    4f                      \\n\"\n        \"st1    {v0.4h}, [%0], #8       \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp    %w5, #0                 \\n\"\n        \"beq    15f                     \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov    w4, %w6                 \\n\" // w4 = left\n        \"cmp    w4, #0                  \\n\"\n        \"beq    7f                      \\n\"\n\n        \"6:                             \\n\"\n        \"st1    {v0.4h}, [%0], #8       \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    6b                      \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr    w4, %w4, #3             \\n\" // w4 = nn = w >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    9f                      \\n\"\n\n        \"8:                             \\n\"\n        \"prfm   pldl1keep, [%1, #256]   \\n\"\n        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%1], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%0], #64 \\n\"\n        \"bne    8b                      \\n\"\n\n        \"9:                             \\n\"\n\n        \"and    w4, %w4, #7             \\n\" // w4 = remain = w & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    10f                     \\n\"\n        \"prfm   pldl1keep, [%1, #256]   \\n\"\n        \"ld1    {v16.8h, v17.8h}, [%1], #32 \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v16.8h, v17.8h}, [%0], #32 \\n\"\n        \"10:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    11f                     \\n\"\n        \"prfm   pldl1keep, [%1, #128]   \\n\"\n        \"ld1    {v16.8h}, [%1], #16     \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v16.8h}, [%0], #16     \\n\"\n        \"11:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    12f                     \\n\"\n        \"prfm   pldl1keep, [%1, #64]    \\n\"\n        \"ld1    {v16.4h}, [%1], #8      \\n\"\n        \"st1    {v16.4h}, [%0], #8      \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov    w4, %w7                 \\n\" // w4 = right\n        \"cmp    w4, #0                  \\n\"\n        \"beq    14f                     \\n\"\n\n        \"13:                            \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.4h}, [%0], #8       \\n\"\n        \"bne    13b                     \\n\"\n        \"14:                            \\n\"\n\n        \"subs   %w5, %w5, #1            \\n\"\n        \"bne    5b                      \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr    w4, %w9, #3             \\n\" // w4 = nn = bottom_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    17f                     \\n\"\n\n        \"16:                            \\n\"\n        \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    16b                     \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and    w4, %w9, #7             \\n\" // w4 = remain = bottom_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    18f                     \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n        \"18:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    19f                     \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"19:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    20f                     \\n\"\n        \"st1    {v0.4h}, [%0], #8       \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else  // __aarch64__\n    asm volatile(\n        \"vmov       q0, %q10            \\n\"\n        \"vmov       q1, %q10            \\n\"\n        \"vmov       q2, %q10            \\n\"\n        \"vmov       q3, %q10            \\n\"\n\n        // fill top\n        \"lsr        r4, %8, #3          \\n\" // r4 = nn = top_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        1f                  \\n\"\n\n        \"0:                             \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        0b                  \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and        r4, %8, #7          \\n\" // r4 = remain = top_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        2f                  \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.u16   {d0-d3}, [%0 :64]!  \\n\"\n        \"2:                             \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        3f                  \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.u16   {d0-d1}, [%0 :64]!  \\n\"\n        \"3:                             \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        4f                  \\n\"\n        \"vst1.u16   {d0}, [%0 :64]!     \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp        %5, #0              \\n\"\n        \"beq        15f                 \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov        r4, %6              \\n\" // r4 = left\n        \"cmp        r4, #0              \\n\"\n        \"beq        7f                  \\n\"\n\n        \"6:                             \\n\"\n        \"vst1.u16   {d0}, [%0 :64]!     \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        6b                  \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr        r4, %4, #3          \\n\" // r4 = nn = w >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        9f                  \\n\"\n\n        \"8:                             \\n\"\n        \"pld        [%1, #512]          \\n\"\n        \"vldm       %1!, {d16-d23}      \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vstm       %0!, {d16-d23}      \\n\"\n        \"bne        8b                  \\n\"\n\n        \"9:                             \\n\"\n\n        \"and        r4, %4, #7          \\n\" // r4 = remain = w & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        10f                 \\n\"\n        \"pld        [%1, #256]          \\n\"\n        \"vld1.u16   {d16-d19}, [%1 :64]! \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.u16   {d16-d19}, [%0 :64]! \\n\"\n        \"10:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        11f                 \\n\"\n        \"pld        [%1, #128]          \\n\"\n        \"vld1.u16   {d16-d17}, [%1 :64]! \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.u16   {d16-d17}, [%0 :64]! \\n\"\n        \"11:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        12f                 \\n\"\n        \"pld        [%1, #64]           \\n\"\n        \"vld1.u16   {d16}, [%1 :64]!    \\n\"\n        \"vst1.u16   {d16}, [%0 :64]!    \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov        r4, %7              \\n\" // r4 = right\n        \"cmp        r4, #0              \\n\"\n        \"beq        14f                 \\n\"\n\n        \"13:                            \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vst1.u16   {d0}, [%0 :64]!     \\n\"\n        \"bne        13b                 \\n\"\n        \"14:                            \\n\"\n\n        \"subs       %5, %5, #1          \\n\"\n        \"bne        5b                  \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr        r4, %9, #3          \\n\" // r4 = nn = bottom_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        17f                 \\n\"\n\n        \"16:                            \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        16b                 \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and        r4, %9, #7          \\n\" // r4 = remain = bottom_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        18f                 \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.u16   {d0-d3}, [%0 :64]!  \\n\"\n        \"18:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        19f                 \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.u16   {d0-d1}, [%0 :64]!  \\n\"\n        \"19:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        20f                 \\n\"\n        \"vst1.u16   {d0}, [%0 :64]!     \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n\n    // fill top\n    {\n        int x = 0;\n        for (; x + 3 < top_size; x += 4)\n        {\n            vst1q_u16(outptr, v);\n            vst1q_u16(outptr + 8, v);\n            outptr += 16;\n        }\n        for (; x < top_size; x++)\n        {\n            vst1_u16(outptr, vget_low_u16(v));\n            outptr += 4;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            vst1_u16(outptr, vget_low_u16(v));\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr);\n            vst1_u16(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_u16(outptr, vget_low_u16(v));\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    {\n        int x = 0;\n        for (; x + 3 < bottom_size; x += 4)\n        {\n            vst1q_u16(outptr, v);\n            vst1q_u16(outptr + 8, v);\n            outptr += 16;\n        }\n        for (; x < bottom_size; x++)\n        {\n            vst1_u16(outptr, vget_low_u16(v));\n            outptr += 4;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n}\n\nstatic void padding_replicate_pack4_bf16_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        uint16x4_t _p = vld1_u16(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_u16(ptr0);\n            vst1_u16(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        uint16x4_t _p = vld1_u16(ptr);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_u16(ptr);\n            vst1_u16(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        uint16x4_t _p = vld1_u16(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_u16(ptr0);\n            vst1_u16(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n    }\n}\n\nstatic void padding_reflect_pack4_bf16_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    // fill top\n    ptr += top * src.w * 4;\n    for (int y = 0; y < top; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0 + (left - x) * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0);\n            vst1_u16(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0 - 8 - x * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr + (left - x) * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr);\n            vst1_u16(outptr, _p);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr - 8 - x * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0 + (left - x) * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0);\n            vst1_u16(outptr, _p);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x4_t _p = vld1_u16(ptr0 - 8 - x * 4);\n            vst1_u16(outptr, _p);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/padding_pack8_fp16s.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack8_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right, uint16x8_t v)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    int w = src.w;\n    int h = src.h;\n\n    int top_size = top * dst.w;\n    int bottom_size = bottom * dst.w;\n\n#if NCNN_GNU_INLINE_ASM\n    asm volatile(\n        \"mov    v0.16b, %10.16b         \\n\"\n        \"mov    v1.16b, %10.16b         \\n\"\n        \"mov    v2.16b, %10.16b         \\n\"\n        \"mov    v3.16b, %10.16b         \\n\"\n\n        // fill top\n        \"lsr    w4, %w8, #2             \\n\" // w4 = nn = top_size >> 2\n        \"cmp    w4, #0                  \\n\"\n        \"beq    1f                      \\n\"\n\n        \"0:                             \\n\"\n        \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    0b                      \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and    w4, %w8, #3             \\n\" // w4 = remain = top_size & 3\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    2f                      \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n        \"2:                             \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    3f                      \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"3:                             \\n\"\n\n        // fill center h loop\n        \"cmp    %w5, #0                 \\n\"\n        \"beq    13f                     \\n\"\n        \"4:                             \\n\"\n\n        // fill left\n        \"mov    w4, %w6                 \\n\" // w4 = left\n        \"cmp    w4, #0                  \\n\"\n        \"beq    6f                      \\n\"\n\n        \"5:                             \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    5b                      \\n\"\n\n        \"6:                             \\n\"\n\n        // fill middle\n        \"lsr    w4, %w4, #2             \\n\" // w4 = nn = w >> 2\n        \"cmp    w4, #0                  \\n\"\n        \"beq    8f                      \\n\"\n\n        \"7:                             \\n\"\n        \"prfm   pldl1keep, [%1, #512]   \\n\"\n        \"ld1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%1], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v16.8h, v17.8h, v18.8h, v19.8h}, [%0], #64 \\n\"\n        \"bne    7b                      \\n\"\n\n        \"8:                             \\n\"\n\n        \"and    w4, %w4, #3             \\n\" // w4 = remain = w & 3\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    9f                      \\n\"\n        \"prfm   pldl1keep, [%1, #256]   \\n\"\n        \"ld1    {v16.8h, v17.8h}, [%1], #32 \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v16.8h, v17.8h}, [%0], #32 \\n\"\n        \"9:                             \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    10f                     \\n\"\n        \"prfm   pldl1keep, [%1, #128]   \\n\"\n        \"ld1    {v16.8h}, [%1], #16     \\n\"\n        \"st1    {v16.8h}, [%0], #16     \\n\"\n        \"10:                            \\n\"\n\n        // fill right\n        \"mov    w4, %w7                 \\n\" // w4 = right\n        \"cmp    w4, #0                  \\n\"\n        \"beq    12f                     \\n\"\n\n        \"11:                            \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"bne    11b                     \\n\"\n        \"12:                            \\n\"\n\n        \"subs   %w5, %w5, #1            \\n\"\n        \"bne    4b                      \\n\"\n\n        \"13:                            \\n\"\n\n        // fill bottom\n        \"lsr    w4, %w9, #2             \\n\" // w4 = nn = bottom_size >> 2\n        \"cmp    w4, #0                  \\n\"\n        \"beq    15f                     \\n\"\n\n        \"14:                            \\n\"\n        \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    14b                     \\n\"\n        \"15:                            \\n\"\n\n        // fill bottom remain\n        \"and    w4, %w9, #3             \\n\" // w4 = remain = bottom_size & 3\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    16f                     \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.8h, v1.8h}, [%0], #32 \\n\"\n        \"16:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    17f                     \\n\"\n        \"st1    {v0.8h}, [%0], #16      \\n\"\n        \"17:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else  // NCNN_GNU_INLINE_ASM\n\n    // fill top\n    {\n        int x = 0;\n        for (; x + 3 < top_size; x += 4)\n        {\n            vst1q_u16(outptr, v);\n            vst1q_u16(outptr + 8, v);\n            vst1q_u16(outptr + 16, v);\n            vst1q_u16(outptr + 24, v);\n            outptr += 32;\n        }\n        for (; x < top_size; x++)\n        {\n            vst1q_u16(outptr, v);\n            outptr += 8;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_u16(outptr, v);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            vst1q_u16(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_u16(outptr, v);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    {\n        int x = 0;\n        for (; x + 3 < bottom_size; x += 4)\n        {\n            vst1q_u16(outptr, v);\n            vst1q_u16(outptr + 8, v);\n            vst1q_u16(outptr + 16, v);\n            vst1q_u16(outptr + 24, v);\n            outptr += 32;\n        }\n        for (; x < bottom_size; x++)\n        {\n            vst1q_u16(outptr, v);\n            outptr += 8;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n}\n\nstatic void padding_replicate_pack8_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        uint16x8_t _p = vld1q_u16(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_u16(ptr0);\n            vst1q_u16(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        uint16x8_t _p = vld1q_u16(ptr);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_u16(ptr);\n            vst1q_u16(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    ptr -= src.w * 8;\n    for (int y = 0; y < bottom; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        uint16x8_t _p = vld1q_u16(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1q_u16(ptr0);\n            vst1q_u16(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n    }\n}\n\nstatic void padding_reflect_pack8_fp16s_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const unsigned short* ptr = src;\n    unsigned short* outptr = dst;\n\n    // fill top\n    ptr += top * src.w * 8;\n    for (int y = 0; y < top; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0 + (left - x) * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0);\n            vst1q_u16(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0 - 16 - x * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        ptr -= src.w * 8;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr + (left - x) * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            vst1q_u16(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr - 16 - x * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w * 8;\n    for (int y = 0; y < bottom; y++)\n    {\n        const unsigned short* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0 + (left - x) * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0);\n            vst1q_u16(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            uint16x8_t _p = vld1q_u16(ptr0 - 16 - x * 8);\n            vst1q_u16(outptr, _p);\n            outptr += 8;\n        }\n        ptr -= src.w * 8;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/padding_pack8_int8.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack8_int8_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right, int8x8_t v)\n{\n    const signed char* ptr = src;\n    signed char* outptr = dst;\n\n    int w = src.w;\n    int h = src.h;\n\n    int top_size = top * dst.w;\n    int bottom_size = bottom * dst.w;\n\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n    asm volatile(\n        \"mov    v0.8b, %10.8b           \\n\"\n        \"mov    v0.d[1], v0.d[0]        \\n\"\n        \"mov    v1.16b, v0.16b          \\n\"\n        \"mov    v2.16b, v0.16b          \\n\"\n        \"mov    v3.16b, v0.16b          \\n\"\n\n        // fill top\n        \"lsr    w4, %w8, #3             \\n\" // w4 = nn = top_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    1f                      \\n\"\n\n        \"0:                             \\n\"\n        \"st1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    0b                      \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and    w4, %w8, #7             \\n\" // w4 = remain = top_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    2f                      \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.16b, v1.16b}, [%0], #32 \\n\"\n        \"2:                             \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    3f                      \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.16b}, [%0], #16     \\n\"\n        \"3:                             \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    4f                      \\n\"\n        \"st1    {v0.8b}, [%0], #8       \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp    %w5, #0                 \\n\"\n        \"beq    15f                     \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov    w4, %w6                 \\n\" // w4 = left\n        \"cmp    w4, #0                  \\n\"\n        \"beq    7f                      \\n\"\n\n        \"6:                             \\n\"\n        \"st1    {v0.8b}, [%0], #8       \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    6b                      \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr    w4, %w4, #3             \\n\" // w4 = nn = w >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    9f                      \\n\"\n\n        \"8:                             \\n\"\n        \"prfm   pldl1keep, [%1, #512]   \\n\"\n        \"ld1    {v16.16b, v17.16b, v18.16b, v19.16b}, [%1], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v16.16b, v17.16b, v18.16b, v19.16b}, [%0], #64 \\n\"\n        \"bne    8b                      \\n\"\n\n        \"9:                             \\n\"\n\n        \"and    w4, %w4, #7             \\n\" // w4 = remain = w & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    10f                     \\n\"\n        \"prfm   pldl1keep, [%1, #256]   \\n\"\n        \"ld1    {v16.16b, v17.16b}, [%1], #32 \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v16.16b, v17.16b}, [%0], #32 \\n\"\n        \"10:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    11f                     \\n\"\n        \"prfm   pldl1keep, [%1, #128]   \\n\"\n        \"ld1    {v16.16b}, [%1], #16    \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v16.16b}, [%0], #16    \\n\"\n        \"11:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    12f                     \\n\"\n        \"prfm   pldl1keep, [%1, #64]    \\n\"\n        \"ld1    {v16.8b}, [%1], #8      \\n\"\n        \"st1    {v16.8b}, [%0], #8      \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov    w4, %w7                 \\n\" // w4 = right\n        \"cmp    w4, #0                  \\n\"\n        \"beq    14f                     \\n\"\n\n        \"13:                            \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"st1    {v0.8b}, [%0], #8       \\n\"\n        \"bne    13b                     \\n\"\n        \"14:                            \\n\"\n\n        \"subs   %w5, %w5, #1            \\n\"\n        \"bne    5b                      \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr    w4, %w9, #3             \\n\" // w4 = nn = bottom_size >> 3\n        \"cmp    w4, #0                  \\n\"\n        \"beq    17f                     \\n\"\n\n        \"16:                            \\n\"\n        \"st1    {v0.16b, v1.16b, v2.16b, v3.16b}, [%0], #64 \\n\"\n        \"subs   w4, w4, #1              \\n\"\n        \"bne    16b                     \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and    w4, %w9, #7             \\n\" // w4 = remain = bottom_size & 7\n\n        \"cmp    w4, #4                  \\n\" // w4 >= 4\n        \"blt    18f                     \\n\"\n        \"sub    w4, w4, #4              \\n\"\n        \"st1    {v0.16b, v1.16b}, [%0], #32 \\n\"\n        \"18:                            \\n\"\n\n        \"cmp    w4, #2                  \\n\" // w4 >= 2\n        \"blt    19f                     \\n\"\n        \"sub    w4, w4, #2              \\n\"\n        \"st1    {v0.16b}, [%0], #16     \\n\"\n        \"19:                            \\n\"\n\n        \"cmp    w4, #0                  \\n\" // w4 > 0\n        \"beq    20f                     \\n\"\n        \"st1    {v0.8b}, [%0], #8       \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"x4\", \"v0\", \"v1\", \"v2\", \"v3\", \"v16\", \"v17\", \"v18\", \"v19\");\n#else  // __aarch64__\n    asm volatile(\n        \"vmov       d0, %P10            \\n\"\n        \"vmov       d1, d0              \\n\"\n        \"vmov       q1, q0              \\n\"\n        \"vmov       q2, q0              \\n\"\n        \"vmov       q3, q0              \\n\"\n\n        // fill top\n        \"lsr        r4, %8, #3          \\n\" // r4 = nn = top_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        1f                  \\n\"\n\n        \"0:                             \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        0b                  \\n\"\n\n        \"1:                             \\n\"\n\n        // fill top remain\n        \"and        r4, %8, #7          \\n\" // r4 = remain = top_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        2f                  \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.s8    {d0-d3}, [%0 :128]! \\n\"\n        \"2:                             \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        3f                  \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.s8    {d0-d1}, [%0 :128]! \\n\"\n        \"3:                             \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        4f                  \\n\"\n        \"vst1.s8    {d0}, [%0 :64]!     \\n\"\n        \"4:                             \\n\"\n\n        // fill center h loop\n        \"cmp        %5, #0              \\n\"\n        \"beq        15f                 \\n\"\n        \"5:                             \\n\"\n\n        // fill left\n        \"mov        r4, %6              \\n\" // r4 = left\n        \"cmp        r4, #0              \\n\"\n        \"beq        7f                  \\n\"\n\n        \"6:                             \\n\"\n        \"vst1.s8    {d0}, [%0 :64]!     \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        6b                  \\n\"\n\n        \"7:                             \\n\"\n\n        // fill middle\n        \"lsr        r4, %4, #3          \\n\" // r4 = nn = w >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        9f                  \\n\"\n\n        \"8:                             \\n\"\n        \"pld        [%1, #512]          \\n\"\n        \"vldm       %1!, {d16-d23}      \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vstm       %0!, {d16-d23}      \\n\"\n        \"bne        8b                  \\n\"\n\n        \"9:                             \\n\"\n\n        \"and        r4, %4, #7          \\n\" // r4 = remain = w & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        10f                 \\n\"\n        \"pld        [%1, #256]          \\n\"\n        \"vld1.s8    {d16-d19}, [%1 :64]! \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.s8    {d16-d19}, [%0 :64]! \\n\"\n        \"10:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        11f                 \\n\"\n        \"pld        [%1, #128]          \\n\"\n        \"vld1.s8    {d16-d17}, [%1 :64]! \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.s8    {d16-d17}, [%0 :64]! \\n\"\n        \"11:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        12f                 \\n\"\n        \"pld        [%1, #64]           \\n\"\n        \"vld1.s8    {d16}, [%1 :64]!    \\n\"\n        \"vst1.s8    {d16}, [%0 :64]!    \\n\"\n        \"12:                            \\n\"\n\n        // fill right\n        \"mov        r4, %7              \\n\" // r4 = right\n        \"cmp        r4, #0              \\n\"\n        \"beq        14f                 \\n\"\n\n        \"13:                            \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"vst1.s8    {d0}, [%0 :64]!     \\n\"\n        \"bne        13b                 \\n\"\n        \"14:                            \\n\"\n\n        \"subs       %5, %5, #1          \\n\"\n        \"bne        5b                  \\n\"\n\n        \"15:                            \\n\"\n\n        // fill bottom\n        \"lsr        r4, %9, #3          \\n\" // r4 = nn = bottom_size >> 3\n        \"cmp        r4, #0              \\n\"\n        \"beq        17f                 \\n\"\n\n        \"16:                            \\n\"\n        \"vstm       %0!, {d0-d7}        \\n\"\n        \"subs       r4, r4, #1          \\n\"\n        \"bne        16b                 \\n\"\n        \"17:                            \\n\"\n\n        // fill bottom remain\n        \"and        r4, %9, #7          \\n\" // r4 = remain = bottom_size & 7\n\n        \"cmp        r4, #4              \\n\" // r4 >= 4\n        \"blt        18f                 \\n\"\n        \"sub        r4, r4, #4          \\n\"\n        \"vst1.s8    {d0-d3}, [%0 :64]!  \\n\"\n        \"18:                            \\n\"\n\n        \"cmp        r4, #2              \\n\" // r4 >= 2\n        \"blt        19f                 \\n\"\n        \"sub        r4, r4, #2          \\n\"\n        \"vst1.s8    {d0-d1}, [%0 :64]!  \\n\"\n        \"19:                            \\n\"\n\n        \"cmp        r4, #0              \\n\" // r4 > 0\n        \"beq        20f                 \\n\"\n        \"vst1.s8    {d0}, [%0 :64]!     \\n\"\n        \"20:                            \\n\"\n\n        : \"=r\"(outptr), // %0\n        \"=r\"(ptr)     // %1\n        : \"0\"(outptr),\n        \"1\"(ptr),\n        \"r\"(w),           // %4\n        \"r\"(h),           // %5\n        \"r\"(left),        // %6\n        \"r\"(right),       // %7\n        \"r\"(top_size),    // %8\n        \"r\"(bottom_size), // %9\n        \"w\"(v)            // %10\n        : \"cc\", \"memory\", \"r4\", \"q0\", \"q1\", \"q2\", \"q3\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n\n    // fill top\n    {\n        int x = 0;\n        for (; x + 3 < top_size; x += 4)\n        {\n            vst1_s8(outptr, v);\n            vst1_s8(outptr + 8, v);\n            vst1_s8(outptr + 16, v);\n            vst1_s8(outptr + 24, v);\n            outptr += 32;\n        }\n        for (; x < top_size; x++)\n        {\n            vst1_s8(outptr, v);\n            outptr += 8;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            vst1_s8(outptr, v);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr);\n            vst1_s8(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_s8(outptr, v);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    {\n        int x = 0;\n        for (; x + 3 < bottom_size; x += 4)\n        {\n            vst1_s8(outptr, v);\n            vst1_s8(outptr + 8, v);\n            vst1_s8(outptr + 16, v);\n            vst1_s8(outptr + 24, v);\n            outptr += 32;\n        }\n        for (; x < bottom_size; x++)\n        {\n            vst1_s8(outptr, v);\n            outptr += 8;\n        }\n    }\n#endif // NCNN_GNU_INLINE_ASM\n}\n\nstatic void padding_replicate_pack8_int8_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const signed char* ptr = src;\n    signed char* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const signed char* ptr0 = ptr;\n        int8x8_t _p = vld1_s8(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_s8(ptr0);\n            vst1_s8(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        int8x8_t _p = vld1_s8(ptr);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_s8(ptr);\n            vst1_s8(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    ptr -= src.w * 8;\n    for (int y = 0; y < bottom; y++)\n    {\n        const signed char* ptr0 = ptr;\n        int8x8_t _p = vld1_s8(ptr0);\n        for (int x = 0; x < left; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = vld1_s8(ptr0);\n            vst1_s8(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n    }\n}\n\nstatic void padding_reflect_pack8_int8_neon(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const signed char* ptr = src;\n    signed char* outptr = dst;\n\n    // fill top\n    ptr += top * src.w * 8;\n    for (int y = 0; y < top; y++)\n    {\n        const signed char* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0 + (left - x) * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0);\n            vst1_s8(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0 - 16 - x * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        ptr -= src.w * 8;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr + (left - x) * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr);\n            vst1_s8(outptr, _p);\n            ptr += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr - 16 - x * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w * 8;\n    for (int y = 0; y < bottom; y++)\n    {\n        const signed char* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0 + (left - x) * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0);\n            vst1_s8(outptr, _p);\n            ptr0 += 8;\n            outptr += 8;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            int8x8_t _p = vld1_s8(ptr0 - 16 - x * 8);\n            vst1_s8(outptr, _p);\n            outptr += 8;\n        }\n        ptr -= src.w * 8;\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pixelshuffle_arm.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"pixelshuffle_arm.h\"\n\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nPixelShuffle_arm::PixelShuffle_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint PixelShuffle_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = w * upscale_factor;\n    int outh = h * upscale_factor;\n    int outc = channels * elempack / (upscale_factor * upscale_factor);\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n        out_elempack = outc % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (upscale_factor != 2 || mode != 0)\n    {\n        Option opt_pack = opt;\n        opt_pack.blob_allocator = opt.workspace_allocator;\n\n        Mat bottom_blob_unpacked;\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack);\n\n        return PixelShuffle::forward(bottom_blob_unpacked, top_blob, opt);\n    }\n\n    top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (elempack == 4 && out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const float* sptr0 = bottom_blob.channel(p * 4);\n            const float* sptr1 = bottom_blob.channel(p * 4 + 1);\n            const float* sptr2 = bottom_blob.channel(p * 4 + 2);\n            const float* sptr3 = bottom_blob.channel(p * 4 + 3);\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr0 = m.row(i * 2);\n                float* outptr1 = m.row(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 1 < w; j += 2)\n                {\n                    float32x4_t _p00 = vld1q_f32(sptr0);\n                    float32x4_t _p10 = vld1q_f32(sptr1);\n                    float32x4_t _p20 = vld1q_f32(sptr2);\n                    float32x4_t _p30 = vld1q_f32(sptr3);\n\n                    float32x4_t _p01 = vld1q_f32(sptr0 + 4);\n                    float32x4_t _p11 = vld1q_f32(sptr1 + 4);\n                    float32x4_t _p21 = vld1q_f32(sptr2 + 4);\n                    float32x4_t _p31 = vld1q_f32(sptr3 + 4);\n\n                    float32x4x4_t _s0;\n                    _s0.val[0] = vcombine_f32(vget_low_f32(_p00), vget_low_f32(_p01));\n                    _s0.val[1] = vcombine_f32(vget_low_f32(_p10), vget_low_f32(_p11));\n                    _s0.val[2] = vcombine_f32(vget_low_f32(_p20), vget_low_f32(_p21));\n                    _s0.val[3] = vcombine_f32(vget_low_f32(_p30), vget_low_f32(_p31));\n\n                    float32x4x4_t _s1;\n                    _s1.val[0] = vcombine_f32(vget_high_f32(_p00), vget_high_f32(_p01));\n                    _s1.val[1] = vcombine_f32(vget_high_f32(_p10), vget_high_f32(_p11));\n                    _s1.val[2] = vcombine_f32(vget_high_f32(_p20), vget_high_f32(_p21));\n                    _s1.val[3] = vcombine_f32(vget_high_f32(_p30), vget_high_f32(_p31));\n\n                    vst4q_f32(outptr0, _s0);\n                    vst4q_f32(outptr1, _s1);\n\n                    sptr0 += 8;\n                    sptr1 += 8;\n                    sptr2 += 8;\n                    sptr3 += 8;\n                    outptr0 += 16;\n                    outptr1 += 16;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr0[0];\n                    outptr0[1] = sptr1[0];\n                    outptr0[2] = sptr2[0];\n                    outptr0[3] = sptr3[0];\n\n                    outptr0[4] = sptr0[1];\n                    outptr0[5] = sptr1[1];\n                    outptr0[6] = sptr2[1];\n                    outptr0[7] = sptr3[1];\n\n                    outptr1[0] = sptr0[2];\n                    outptr1[1] = sptr1[2];\n                    outptr1[2] = sptr2[2];\n                    outptr1[3] = sptr3[2];\n\n                    outptr1[4] = sptr0[3];\n                    outptr1[5] = sptr1[3];\n                    outptr1[6] = sptr2[3];\n                    outptr1[7] = sptr3[3];\n\n                    sptr0 += 4;\n                    sptr1 += 4;\n                    sptr2 += 4;\n                    sptr3 += 4;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const float* sptr = bottom_blob.channel(p);\n\n            for (int i = 0; i < h; i++)\n            {\n                float* outptr0 = m.row(i * 2);\n                float* outptr1 = m.row(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 1 < w; j += 2)\n                {\n                    float32x4_t _p0 = vld1q_f32(sptr);\n                    float32x4_t _p1 = vld1q_f32(sptr + 4);\n\n                    float32x4_t _s0 = vcombine_f32(vget_low_f32(_p0), vget_low_f32(_p1));\n                    float32x4_t _s1 = vcombine_f32(vget_high_f32(_p0), vget_high_f32(_p1));\n\n                    vst1q_f32(outptr0, _s0);\n                    vst1q_f32(outptr1, _s1);\n\n                    sptr += 8;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr[0];\n                    outptr0[1] = sptr[1];\n                    outptr1[0] = sptr[2];\n                    outptr1[1] = sptr[3];\n\n                    sptr += 4;\n                    outptr0 += 2;\n                    outptr1 += 2;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    return PixelShuffle::forward(bottom_blob, top_blob, opt);\n}\n\nint PixelShuffle_arm::forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = w * upscale_factor;\n    int outh = h * upscale_factor;\n    int outc = channels * elempack / (upscale_factor * upscale_factor);\n\n    int out_elempack = 1;\n#if __ARM_NEON\n    if (opt.use_packing_layout)\n    {\n#if NCNN_ARM82\n        out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outc % 8 == 0 ? 8 : outc % 4 == 0 ? 4 : 1;\n#else\n        out_elempack = outc % 4 == 0 ? 4 : 1;\n#endif\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (upscale_factor != 2 || mode != 0)\n    {\n        Option opt_pack = opt;\n        opt_pack.blob_allocator = opt.workspace_allocator;\n\n        Mat bottom_blob_unpacked;\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack);\n\n        top_blob.create(outw, outh, outc, 2u, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            for (int sh = 0; sh < upscale_factor; sh++)\n            {\n                for (int sw = 0; sw < upscale_factor; sw++)\n                {\n                    int q;\n                    if (mode == 0)\n                        q = p * upscale_factor * upscale_factor + sh * upscale_factor + sw;\n                    else // if (mode == 1)\n                        q = (sh * upscale_factor + sw) * outc + p;\n\n                    const unsigned short* sptr = bottom_blob_unpacked.channel(q);\n\n                    for (int i = 0; i < h; i++)\n                    {\n                        unsigned short* outptr = m.row<unsigned short>(i * upscale_factor + sh) + sw;\n                        for (int j = 0; j < w; j++)\n                        {\n                            outptr[0] = sptr[0];\n\n                            sptr++;\n                            outptr += upscale_factor;\n                        }\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __ARM_NEON\n    if (elempack == 8 && out_elempack == 8)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const unsigned short* sptr0 = bottom_blob.channel(p * 4);\n            const unsigned short* sptr1 = bottom_blob.channel(p * 4 + 1);\n            const unsigned short* sptr2 = bottom_blob.channel(p * 4 + 2);\n            const unsigned short* sptr3 = bottom_blob.channel(p * 4 + 3);\n\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* outptr0 = m.row<unsigned short>(i * 2);\n                unsigned short* outptr1 = m.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x8x4_t _p0 = vld4q_u16(sptr0);\n                    uint16x8x4_t _p1 = vld4q_u16(sptr1);\n                    uint16x8x4_t _p2 = vld4q_u16(sptr2);\n                    uint16x8x4_t _p3 = vld4q_u16(sptr3);\n\n                    uint32x4x2_t _s04_0 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[0]), vreinterpretq_u32_u16(_p1.val[0]));\n                    uint32x4x2_t _s15_0 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[1]), vreinterpretq_u32_u16(_p1.val[1]));\n                    uint32x4x2_t _s26_0 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[2]), vreinterpretq_u32_u16(_p1.val[2]));\n                    uint32x4x2_t _s37_0 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[3]), vreinterpretq_u32_u16(_p1.val[3]));\n                    uint32x4x2_t _s04_1 = vzipq_u32(vreinterpretq_u32_u16(_p2.val[0]), vreinterpretq_u32_u16(_p3.val[0]));\n                    uint32x4x2_t _s15_1 = vzipq_u32(vreinterpretq_u32_u16(_p2.val[1]), vreinterpretq_u32_u16(_p3.val[1]));\n                    uint32x4x2_t _s26_1 = vzipq_u32(vreinterpretq_u32_u16(_p2.val[2]), vreinterpretq_u32_u16(_p3.val[2]));\n                    uint32x4x2_t _s37_1 = vzipq_u32(vreinterpretq_u32_u16(_p2.val[3]), vreinterpretq_u32_u16(_p3.val[3]));\n\n                    uint16x8_t _s0_0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s04_0.val[0]), vget_low_u32(_s04_1.val[0])));\n                    uint16x8_t _s0_1 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s15_0.val[0]), vget_low_u32(_s15_1.val[0])));\n                    uint16x8_t _s0_2 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s04_0.val[0]), vget_high_u32(_s04_1.val[0])));\n                    uint16x8_t _s0_3 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s15_0.val[0]), vget_high_u32(_s15_1.val[0])));\n                    uint16x8_t _s0_4 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s04_0.val[1]), vget_low_u32(_s04_1.val[1])));\n                    uint16x8_t _s0_5 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s15_0.val[1]), vget_low_u32(_s15_1.val[1])));\n                    uint16x8_t _s0_6 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s04_0.val[1]), vget_high_u32(_s04_1.val[1])));\n                    uint16x8_t _s0_7 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s15_0.val[1]), vget_high_u32(_s15_1.val[1])));\n                    uint16x8_t _s1_0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s26_0.val[0]), vget_low_u32(_s26_1.val[0])));\n                    uint16x8_t _s1_1 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s37_0.val[0]), vget_low_u32(_s37_1.val[0])));\n                    uint16x8_t _s1_2 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s26_0.val[0]), vget_high_u32(_s26_1.val[0])));\n                    uint16x8_t _s1_3 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s37_0.val[0]), vget_high_u32(_s37_1.val[0])));\n                    uint16x8_t _s1_4 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s26_0.val[1]), vget_low_u32(_s26_1.val[1])));\n                    uint16x8_t _s1_5 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s37_0.val[1]), vget_low_u32(_s37_1.val[1])));\n                    uint16x8_t _s1_6 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s26_0.val[1]), vget_high_u32(_s26_1.val[1])));\n                    uint16x8_t _s1_7 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s37_0.val[1]), vget_high_u32(_s37_1.val[1])));\n\n                    vst1q_u16(outptr0, _s0_0);\n                    vst1q_u16(outptr0 + 8, _s0_1);\n                    vst1q_u16(outptr0 + 16, _s0_2);\n                    vst1q_u16(outptr0 + 24, _s0_3);\n                    vst1q_u16(outptr0 + 32, _s0_4);\n                    vst1q_u16(outptr0 + 40, _s0_5);\n                    vst1q_u16(outptr0 + 48, _s0_6);\n                    vst1q_u16(outptr0 + 56, _s0_7);\n                    vst1q_u16(outptr1, _s1_0);\n                    vst1q_u16(outptr1 + 8, _s1_1);\n                    vst1q_u16(outptr1 + 16, _s1_2);\n                    vst1q_u16(outptr1 + 24, _s1_3);\n                    vst1q_u16(outptr1 + 32, _s1_4);\n                    vst1q_u16(outptr1 + 40, _s1_5);\n                    vst1q_u16(outptr1 + 48, _s1_6);\n                    vst1q_u16(outptr1 + 56, _s1_7);\n\n                    sptr0 += 32;\n                    sptr1 += 32;\n                    sptr2 += 32;\n                    sptr3 += 32;\n                    outptr0 += 64;\n                    outptr1 += 64;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr0[0];\n                    outptr0[1] = sptr0[4];\n                    outptr0[2] = sptr1[0];\n                    outptr0[3] = sptr1[4];\n                    outptr0[4] = sptr2[0];\n                    outptr0[5] = sptr2[4];\n                    outptr0[6] = sptr3[0];\n                    outptr0[7] = sptr3[4];\n\n                    outptr0[8] = sptr0[1];\n                    outptr0[9] = sptr0[5];\n                    outptr0[10] = sptr1[1];\n                    outptr0[11] = sptr1[5];\n                    outptr0[12] = sptr2[1];\n                    outptr0[13] = sptr2[5];\n                    outptr0[14] = sptr3[1];\n                    outptr0[15] = sptr3[5];\n\n                    outptr1[0] = sptr0[2];\n                    outptr1[1] = sptr0[6];\n                    outptr1[2] = sptr1[2];\n                    outptr1[3] = sptr1[6];\n                    outptr1[4] = sptr2[2];\n                    outptr1[5] = sptr2[6];\n                    outptr1[6] = sptr3[2];\n                    outptr1[7] = sptr3[6];\n\n                    outptr1[8] = sptr0[3];\n                    outptr1[9] = sptr0[7];\n                    outptr1[10] = sptr1[3];\n                    outptr1[11] = sptr1[7];\n                    outptr1[12] = sptr2[3];\n                    outptr1[13] = sptr2[7];\n                    outptr1[14] = sptr3[3];\n                    outptr1[15] = sptr3[7];\n\n                    sptr0 += 8;\n                    sptr1 += 8;\n                    sptr2 += 8;\n                    sptr3 += 8;\n                    outptr0 += 16;\n                    outptr1 += 16;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 8 && out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const unsigned short* sptr0 = bottom_blob.channel(p * 2);\n            const unsigned short* sptr1 = bottom_blob.channel(p * 2 + 1);\n\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* outptr0 = m.row<unsigned short>(i * 2);\n                unsigned short* outptr1 = m.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x8x4_t _p0 = vld4q_u16(sptr0);\n                    uint16x8x4_t _p1 = vld4q_u16(sptr1);\n\n                    uint32x4x2_t _s04 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[0]), vreinterpretq_u32_u16(_p1.val[0]));\n                    uint32x4x2_t _s15 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[1]), vreinterpretq_u32_u16(_p1.val[1]));\n                    uint32x4x2_t _s26 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[2]), vreinterpretq_u32_u16(_p1.val[2]));\n                    uint32x4x2_t _s37 = vzipq_u32(vreinterpretq_u32_u16(_p0.val[3]), vreinterpretq_u32_u16(_p1.val[3]));\n\n                    uint16x8_t _s0_0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s04.val[0]), vget_low_u32(_s15.val[0])));\n                    uint16x8_t _s0_1 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s04.val[0]), vget_high_u32(_s15.val[0])));\n                    uint16x8_t _s0_2 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s04.val[1]), vget_low_u32(_s15.val[1])));\n                    uint16x8_t _s0_3 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s04.val[1]), vget_high_u32(_s15.val[1])));\n                    uint16x8_t _s1_0 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s26.val[0]), vget_low_u32(_s37.val[0])));\n                    uint16x8_t _s1_1 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s26.val[0]), vget_high_u32(_s37.val[0])));\n                    uint16x8_t _s1_2 = vreinterpretq_u16_u32(vcombine_u32(vget_low_u32(_s26.val[1]), vget_low_u32(_s37.val[1])));\n                    uint16x8_t _s1_3 = vreinterpretq_u16_u32(vcombine_u32(vget_high_u32(_s26.val[1]), vget_high_u32(_s37.val[1])));\n\n                    vst1q_u16(outptr0, _s0_0);\n                    vst1q_u16(outptr0 + 8, _s0_1);\n                    vst1q_u16(outptr0 + 16, _s0_2);\n                    vst1q_u16(outptr0 + 24, _s0_3);\n                    vst1q_u16(outptr1, _s1_0);\n                    vst1q_u16(outptr1 + 8, _s1_1);\n                    vst1q_u16(outptr1 + 16, _s1_2);\n                    vst1q_u16(outptr1 + 24, _s1_3);\n\n                    sptr0 += 32;\n                    sptr1 += 32;\n                    outptr0 += 32;\n                    outptr1 += 32;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr0[0];\n                    outptr0[1] = sptr0[4];\n                    outptr0[2] = sptr1[0];\n                    outptr0[3] = sptr1[4];\n\n                    outptr0[4] = sptr0[1];\n                    outptr0[5] = sptr0[5];\n                    outptr0[6] = sptr1[1];\n                    outptr0[7] = sptr1[5];\n\n                    outptr1[0] = sptr0[2];\n                    outptr1[1] = sptr0[6];\n                    outptr1[2] = sptr1[2];\n                    outptr1[3] = sptr1[6];\n\n                    outptr1[4] = sptr0[3];\n                    outptr1[5] = sptr0[7];\n                    outptr1[6] = sptr1[3];\n                    outptr1[7] = sptr1[7];\n\n                    sptr0 += 8;\n                    sptr1 += 8;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 8 && out_elempack == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack / 2; p++)\n        {\n            Mat m0 = top_blob.channel(p * 2);\n            Mat m1 = top_blob.channel(p * 2 + 1);\n\n            const unsigned short* sptr = bottom_blob.channel(p);\n\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* outptr00 = m0.row<unsigned short>(i * 2);\n                unsigned short* outptr01 = m0.row<unsigned short>(i * 2 + 1);\n                unsigned short* outptr10 = m1.row<unsigned short>(i * 2);\n                unsigned short* outptr11 = m1.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint32x4x4_t _p = vld4q_u32((unsigned int*)sptr);\n\n                    uint16x8_t _s0 = vreinterpretq_u16_u32(_p.val[0]);\n                    uint16x8_t _s1 = vreinterpretq_u16_u32(_p.val[1]);\n                    uint16x8_t _s2 = vreinterpretq_u16_u32(_p.val[2]);\n                    uint16x8_t _s3 = vreinterpretq_u16_u32(_p.val[3]);\n\n                    vst1q_u16(outptr00, _s0);\n                    vst1q_u16(outptr01, _s1);\n                    vst1q_u16(outptr10, _s2);\n                    vst1q_u16(outptr11, _s3);\n\n                    sptr += 32;\n                    outptr00 += 8;\n                    outptr01 += 8;\n                    outptr10 += 8;\n                    outptr11 += 8;\n                }\n                for (; j < w; j++)\n                {\n                    outptr00[0] = sptr[0];\n                    outptr00[1] = sptr[1];\n                    outptr01[0] = sptr[2];\n                    outptr01[1] = sptr[3];\n\n                    outptr10[0] = sptr[4];\n                    outptr10[1] = sptr[5];\n                    outptr11[0] = sptr[6];\n                    outptr11[1] = sptr[7];\n\n                    sptr += 8;\n                    outptr00 += 2;\n                    outptr01 += 2;\n                    outptr10 += 2;\n                    outptr11 += 2;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4 && out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const unsigned short* sptr0 = bottom_blob.channel(p * 4);\n            const unsigned short* sptr1 = bottom_blob.channel(p * 4 + 1);\n            const unsigned short* sptr2 = bottom_blob.channel(p * 4 + 2);\n            const unsigned short* sptr3 = bottom_blob.channel(p * 4 + 3);\n\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* outptr0 = m.row<unsigned short>(i * 2);\n                unsigned short* outptr1 = m.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x8_t _p00 = vld1q_u16(sptr0);\n                    uint16x8_t _p10 = vld1q_u16(sptr1);\n                    uint16x8_t _p20 = vld1q_u16(sptr2);\n                    uint16x8_t _p30 = vld1q_u16(sptr3);\n                    uint16x8_t _p01 = vld1q_u16(sptr0 + 8);\n                    uint16x8_t _p11 = vld1q_u16(sptr1 + 8);\n                    uint16x8_t _p21 = vld1q_u16(sptr2 + 8);\n                    uint16x8_t _p31 = vld1q_u16(sptr3 + 8);\n\n                    uint32x4x2_t _p0 = vuzpq_u32(vreinterpretq_u32_u16(_p00), vreinterpretq_u32_u16(_p01));\n                    uint32x4x2_t _p1 = vuzpq_u32(vreinterpretq_u32_u16(_p10), vreinterpretq_u32_u16(_p11));\n                    uint32x4x2_t _p2 = vuzpq_u32(vreinterpretq_u32_u16(_p20), vreinterpretq_u32_u16(_p21));\n                    uint32x4x2_t _p3 = vuzpq_u32(vreinterpretq_u32_u16(_p30), vreinterpretq_u32_u16(_p31));\n\n                    uint16x8x4_t _s0;\n                    _s0.val[0] = vreinterpretq_u16_u32(_p0.val[0]);\n                    _s0.val[1] = vreinterpretq_u16_u32(_p1.val[0]);\n                    _s0.val[2] = vreinterpretq_u16_u32(_p2.val[0]);\n                    _s0.val[3] = vreinterpretq_u16_u32(_p3.val[0]);\n\n                    uint16x8x4_t _s1;\n                    _s1.val[0] = vreinterpretq_u16_u32(_p0.val[1]);\n                    _s1.val[1] = vreinterpretq_u16_u32(_p1.val[1]);\n                    _s1.val[2] = vreinterpretq_u16_u32(_p2.val[1]);\n                    _s1.val[3] = vreinterpretq_u16_u32(_p3.val[1]);\n\n                    vst4q_u16(outptr0, _s0);\n                    vst4q_u16(outptr1, _s1);\n\n                    sptr0 += 16;\n                    sptr1 += 16;\n                    sptr2 += 16;\n                    sptr3 += 16;\n                    outptr0 += 32;\n                    outptr1 += 32;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr0[0];\n                    outptr0[1] = sptr1[0];\n                    outptr0[2] = sptr2[0];\n                    outptr0[3] = sptr3[0];\n\n                    outptr0[4] = sptr0[1];\n                    outptr0[5] = sptr1[1];\n                    outptr0[6] = sptr2[1];\n                    outptr0[7] = sptr3[1];\n\n                    outptr1[0] = sptr0[2];\n                    outptr1[1] = sptr1[2];\n                    outptr1[2] = sptr2[2];\n                    outptr1[3] = sptr3[2];\n\n                    outptr1[4] = sptr0[3];\n                    outptr1[5] = sptr1[3];\n                    outptr1[6] = sptr2[3];\n                    outptr1[7] = sptr3[3];\n\n                    sptr0 += 4;\n                    sptr1 += 4;\n                    sptr2 += 4;\n                    sptr3 += 4;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < outc / out_elempack; p++)\n        {\n            Mat m = top_blob.channel(p);\n\n            const unsigned short* sptr = bottom_blob.channel(p);\n\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* outptr0 = m.row<unsigned short>(i * 2);\n                unsigned short* outptr1 = m.row<unsigned short>(i * 2 + 1);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    uint16x8_t _p0 = vld1q_u16(sptr);\n                    uint16x8_t _p1 = vld1q_u16(sptr + 8);\n\n                    uint32x4x2_t _s01 = vuzpq_u32(vreinterpretq_u32_u16(_p0), vreinterpretq_u32_u16(_p1));\n\n                    uint16x8_t _s0 = vreinterpretq_u16_u32(_s01.val[0]);\n                    uint16x8_t _s1 = vreinterpretq_u16_u32(_s01.val[1]);\n\n                    vst1q_u16(outptr0, _s0);\n                    vst1q_u16(outptr1, _s1);\n\n                    sptr += 16;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n                for (; j < w; j++)\n                {\n                    outptr0[0] = sptr[0];\n                    outptr0[1] = sptr[1];\n                    outptr1[0] = sptr[2];\n                    outptr1[1] = sptr[3];\n\n                    sptr += 4;\n                    outptr0 += 2;\n                    outptr1 += 2;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outc; p++)\n    {\n        Mat m = top_blob.channel(p);\n\n        for (int sh = 0; sh < upscale_factor; sh++)\n        {\n            for (int sw = 0; sw < upscale_factor; sw++)\n            {\n                int q = p * upscale_factor * upscale_factor + sh * upscale_factor + sw;\n\n                const unsigned short* sptr = bottom_blob.channel(q);\n\n                for (int i = 0; i < h; i++)\n                {\n                    unsigned short* outptr = m.row<unsigned short>(i * upscale_factor + sh) + sw;\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr[0] = sptr[0];\n\n                        sptr++;\n                        outptr += upscale_factor;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/pixelshuffle_arm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PIXELSHUFFLE_ARM_H\n#define LAYER_PIXELSHUFFLE_ARM_H\n\n#include \"pixelshuffle.h\"\n\nnamespace ncnn {\n\nclass PixelShuffle_arm : public PixelShuffle\n{\npublic:\n    PixelShuffle_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PIXELSHUFFLE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/pooling_2x2.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling2x2s2_max_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const float* img0 = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n\n        for (int i = 0; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 2;\n            int remain = outw - (nn << 2);\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                                   \\n\"\n                    \"prfm       pldl1keep, [%1, #256]     \\n\"\n                    \"prfm       pldl1keep, [%2, #256]     \\n\"\n                    \"ld1        {v0.4s, v1.4s}, [%1], #32 \\n\"\n                    \"ld1        {v2.4s, v3.4s}, [%2], #32 \\n\"\n                    \"fmax       v0.4s, v0.4s, v2.4s       \\n\"\n                    \"fmax       v1.4s, v1.4s, v3.4s       \\n\"\n                    \"fmaxp      v2.4s, v0.4s, v1.4s       \\n\"\n                    \"subs       %w0, %w0, #1              \\n\"\n                    \"st1        {v2.4s}, [%3], #16        \\n\"\n                    \"bne        0b                        \\n\"\n                    : \"=r\"(nn),    // %0\n                    \"=r\"(r0),    // %1\n                    \"=r\"(r1),    // %2\n                    \"=r\"(outptr) // %3\n                    : \"0\"(nn),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(outptr)\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"0:                             \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld1.f32   {d0-d3}, [%1]!      \\n\"\n                    \"vld1.f32   {d4-d7}, [%2]!      \\n\"\n                    \"vmax.f32   q0, q0, q2          \\n\"\n                    \"vmax.f32   q1, q1, q3          \\n\"\n                    \"vpmax.f32  d4, d0, d1          \\n\"\n                    \"vpmax.f32  d5, d2, d3          \\n\"\n                    \"subs       %0, #1              \\n\"\n                    \"vst1.f32   {d4-d5}, [%3]!      \\n\"\n                    \"bne        0b                  \\n\"\n                    : \"=r\"(nn),    // %0\n                    \"=r\"(r0),    // %1\n                    \"=r\"(r1),    // %2\n                    \"=r\"(outptr) // %3\n                    : \"0\"(nn),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(outptr)\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float max0 = std::max(r0[0], r0[1]);\n                float max1 = std::max(r1[0], r1[1]);\n\n                *outptr = std::max(max0, max1);\n\n                r0 += 2;\n                r1 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_2x2_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling2x2s2_max_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n\n                    \"fmax   v0.4s, v0.4s, v1.4s     \\n\"\n                    \"fmax   v2.4s, v2.4s, v3.4s     \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]   \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                    \"fmax   v4.4s, v4.4s, v5.4s     \\n\"\n                    \"fmax   v6.4s, v6.4s, v7.4s     \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]   \\n\"\n                    \"ld1    {v16.4s, v17.4s, v18.4s, v19.4s}, [%2], #64 \\n\"\n\n                    \"fmax   v16.4s, v16.4s, v17.4s  \\n\"\n                    \"fmax   v18.4s, v18.4s, v19.4s  \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]   \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%2], #64 \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v21.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v23.4s  \\n\"\n\n                    \"fmax   v0.4s, v0.4s, v16.4s    \\n\"\n                    \"fmax   v1.4s, v2.4s, v18.4s    \\n\"\n                    \"fmax   v2.4s, v4.4s, v20.4s    \\n\"\n                    \"fmax   v3.4s, v6.4s, v22.4s    \\n\"\n\n                    \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]      \\n\"\n                    \"vldm       %1!, {d0-d7}    \\n\"\n\n                    \"vmax.f32   q0, q0, q1      \\n\"\n                    \"vmax.f32   q2, q2, q3      \\n\"\n\n                    \"pld        [%1, #512]      \\n\"\n                    \"vldm       %1!, {d8-d15}   \\n\"\n\n                    \"vmax.f32   q4, q4, q5      \\n\"\n                    \"vmax.f32   q6, q6, q7      \\n\"\n\n                    \"pld        [%2, #512]      \\n\"\n                    \"vldm       %2!, {d16-d23}  \\n\"\n\n                    \"vmax.f32   q8, q8, q9      \\n\"\n                    \"vmax.f32   q10, q10, q11   \\n\"\n\n                    \"pld        [%2, #512]      \\n\"\n                    \"vldm       %2!, {d24-d31}  \\n\"\n\n                    \"vmax.f32   q12, q12, q13   \\n\"\n                    \"vmax.f32   q14, q14, q15   \\n\"\n\n                    \"vmax.f32   q0, q0, q8      \\n\"\n                    \"vmax.f32   q1, q2, q10     \\n\"\n                    \"vmax.f32   q2, q4, q12     \\n\"\n                    \"vmax.f32   q3, q6, q14     \\n\"\n\n                    \"vstm       %0!, {d0-d7}    \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n\n                float32x4_t _max0 = vmaxq_f32(_r00, _r01);\n                float32x4_t _max1 = vmaxq_f32(_r10, _r11);\n                float32x4_t _max = vmaxq_f32(_max0, _max1);\n\n                vst1q_f32(outptr, _max);\n\n                r0 += 8;\n                r1 += 8;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_2x2_pack4_bf16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling2x2s2_max_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        unsigned short* outptr = top_blob.channel(q);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n\n                    \"fmax   v0.4s, v0.4s, v1.4s     \\n\"\n                    \"fmax   v2.4s, v2.4s, v3.4s     \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16       \\n\"\n                    \"shll   v5.4s, v5.4h, #16       \\n\"\n                    \"shll   v6.4s, v6.4h, #16       \\n\"\n                    \"shll   v7.4s, v7.4h, #16       \\n\"\n\n                    \"fmax   v4.4s, v4.4s, v5.4s     \\n\"\n                    \"fmax   v6.4s, v6.4s, v7.4s     \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]   \\n\"\n                    \"ld1    {v16.4h, v17.4h, v18.4h, v19.4h}, [%2], #32 \\n\"\n\n                    \"shll   v16.4s, v16.4h, #16     \\n\"\n                    \"shll   v17.4s, v17.4h, #16     \\n\"\n                    \"shll   v18.4s, v18.4h, #16     \\n\"\n                    \"shll   v19.4s, v19.4h, #16     \\n\"\n\n                    \"fmax   v16.4s, v16.4s, v17.4s  \\n\"\n                    \"fmax   v18.4s, v18.4s, v19.4s  \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]   \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%2], #32 \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16     \\n\"\n                    \"shll   v21.4s, v21.4h, #16     \\n\"\n                    \"shll   v22.4s, v22.4h, #16     \\n\"\n                    \"shll   v23.4s, v23.4h, #16     \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v21.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v23.4s  \\n\"\n\n                    \"fmax   v0.4s, v0.4s, v16.4s    \\n\"\n                    \"fmax   v1.4s, v2.4s, v18.4s    \\n\"\n                    \"fmax   v2.4s, v4.4s, v20.4s    \\n\"\n                    \"fmax   v3.4s, v6.4s, v22.4s    \\n\"\n\n                    \"shrn   v0.4h, v0.4s, #16       \\n\"\n                    \"shrn   v1.4h, v1.4s, #16       \\n\"\n                    \"shrn   v2.4h, v2.4s, #16       \\n\"\n                    \"shrn   v3.4h, v3.4s, #16       \\n\"\n\n                    \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d4-d7}, [%1]!  \\n\"\n\n                    \"vshll.u16  q0, d4, #16     \\n\"\n                    \"vshll.u16  q1, d5, #16     \\n\"\n                    \"vshll.u16  q2, d6, #16     \\n\"\n                    \"vshll.u16  q3, d7, #16     \\n\"\n\n                    \"vmax.f32   q0, q0, q1      \\n\"\n                    \"vmax.f32   q2, q2, q3      \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d12-d15}, [%1]! \\n\"\n\n                    \"vshll.u16  q4, d12, #16    \\n\"\n                    \"vshll.u16  q5, d13, #16    \\n\"\n                    \"vshll.u16  q6, d14, #16    \\n\"\n                    \"vshll.u16  q7, d15, #16    \\n\"\n\n                    \"vmax.f32   q4, q4, q5      \\n\"\n                    \"vmax.f32   q6, q6, q7      \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d20-d23}, [%2]! \\n\"\n\n                    \"vshll.u16  q8, d20, #16    \\n\"\n                    \"vshll.u16  q9, d21, #16    \\n\"\n                    \"vshll.u16  q10, d22, #16   \\n\"\n                    \"vshll.u16  q11, d23, #16   \\n\"\n\n                    \"vmax.f32   q8, q8, q9      \\n\"\n                    \"vmax.f32   q10, q10, q11   \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%2]! \\n\"\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmax.f32   q12, q12, q13   \\n\"\n                    \"vmax.f32   q14, q14, q15   \\n\"\n\n                    \"vmax.f32   q0, q0, q8      \\n\"\n                    \"vmax.f32   q1, q2, q10     \\n\"\n                    \"vmax.f32   q2, q4, q12     \\n\"\n                    \"vmax.f32   q3, q6, q14     \\n\"\n\n                    \"vshrn.u32  d0, q0, #16     \\n\"\n                    \"vshrn.u32  d1, q1, #16     \\n\"\n                    \"vshrn.u32  d2, q2, #16     \\n\"\n                    \"vshrn.u32  d3, q3, #16     \\n\"\n\n                    \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1)      // %2\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _r00 = bfloat2float(vld1_u16(r0));\n                float32x4_t _r01 = bfloat2float(vld1_u16(r0 + 4));\n                float32x4_t _r10 = bfloat2float(vld1_u16(r1));\n                float32x4_t _r11 = bfloat2float(vld1_u16(r1 + 4));\n\n                float32x4_t _max0 = vmaxq_f32(_r00, _r01);\n                float32x4_t _max1 = vmaxq_f32(_r10, _r11);\n                float32x4_t _max = vmaxq_f32(_max0, _max1);\n\n                vst1_u16(outptr, float2bfloat(_max));\n\n                r0 += 8;\n                r1 += 8;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_3x3.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling3x3s2_max_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const float* img0 = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n\n        for (int i = 0; i < outh; i++)\n        {\n#if __ARM_NEON\n            int nn = outw >> 2;\n            int remain = outw - (nn << 2);\n#else\n            int remain = outw;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n#if __aarch64__\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"prfm       pldl1keep, [%1, #256]       \\n\"\n                    \"ld2        {v0.4s, v1.4s}, [%1], #32   \\n\"\n                    \"prfm       pldl1keep, [%2, #256]       \\n\"\n                    \"ld2        {v2.4s, v3.4s}, [%2], #32   \\n\"\n                    \"prfm       pldl1keep, [%3, #256]       \\n\"\n                    \"ld2        {v4.4s, v5.4s}, [%3], #32   \\n\"\n                    \"0:                                     \\n\"\n\n                    \"prfm       pldl1keep, [%1, #256]       \\n\"\n                    \"ld2        {v6.4s, v7.4s}, [%1], #32   \\n\"\n\n                    \"fmax       v12.4s, v0.4s, v1.4s        \\n\"\n                    \"fmax       v13.4s, v2.4s, v3.4s        \\n\"\n\n                    \"prfm       pldl1keep, [%2, #256]       \\n\"\n                    \"ld2        {v8.4s, v9.4s}, [%2], #32   \\n\"\n\n                    \"fmax       v14.4s, v4.4s, v5.4s        \\n\"\n                    \"ext        v0.16b, v0.16b, v6.16b, #4  \\n\"\n\n                    \"prfm       pldl1keep, [%3, #256]       \\n\"\n                    \"ld2        {v10.4s, v11.4s}, [%3], #32 \\n\"\n\n                    \"ext        v2.16b, v2.16b, v8.16b, #4  \\n\"\n\n                    \"fmax       v12.4s, v12.4s, v0.4s       \\n\"\n                    \"ext        v4.16b, v4.16b, v10.16b, #4 \\n\"\n\n                    \"fmax       v13.4s, v13.4s, v2.4s       \\n\"\n                    \"fmax       v14.4s, v14.4s, v4.4s       \\n\"\n                    \"fmax       v12.4s, v12.4s, v13.4s      \\n\"\n\n                    \"orr        v0.16b, v6.16b, v6.16b      \\n\"\n                    \"orr        v1.16b, v7.16b, v7.16b      \\n\"\n                    \"fmax       v12.4s, v12.4s, v14.4s      \\n\"\n\n                    \"orr        v2.16b, v8.16b, v8.16b      \\n\"\n                    \"orr        v3.16b, v9.16b, v9.16b      \\n\"\n                    \"orr        v4.16b, v10.16b, v10.16b    \\n\"\n                    \"orr        v5.16b, v11.16b, v11.16b    \\n\"\n\n                    \"subs       %w0, %w0, #1                \\n\"\n                    \"st1        {v12.4s}, [%4], #16         \\n\"\n                    \"bne        0b                          \\n\"\n                    \"sub        %1, %1, #32                 \\n\"\n                    \"sub        %2, %2, #32                 \\n\"\n                    \"sub        %3, %3, #32                 \\n\"\n                    : \"=r\"(nn),    // %0\n                    \"=r\"(r0),    // %1\n                    \"=r\"(r1),    // %2\n                    \"=r\"(r2),    // %3\n                    \"=r\"(outptr) // %4\n                    : \"0\"(nn),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(outptr)\n                    : \"cc\", \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\", \"v12\", \"v13\", \"v14\");\n            }\n#else\n            if (nn > 0)\n            {\n                asm volatile(\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld2.f32   {d0-d3}, [%1]!      \\n\" // q0 = 0 2 4 6  q1 = 1 3 5 7\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d4-d7}, [%2]!      \\n\"\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld2.f32   {d8-d11}, [%3]!     \\n\"\n                    \"0:                             \\n\"\n                    \"pld        [%1, #256]          \\n\"\n                    \"vld2.f32   {d12-d15}, [%1]!    \\n\" // q6 = 8 10 12 14  q7 = 9 11 13 15\n\n                    \"vmax.f32   q12, q0, q1         \\n\"\n                    \"vmax.f32   q13, q2, q3         \\n\"\n\n                    \"pld        [%2, #256]          \\n\"\n                    \"vld2.f32   {d16-d19}, [%2]!    \\n\"\n\n                    \"vmax.f32   q14, q4, q5         \\n\"\n                    \"vext.32    q0, q0, q6, #1      \\n\"\n\n                    \"pld        [%3, #256]          \\n\"\n                    \"vld2.f32   {d20-d23}, [%3]!    \\n\"\n\n                    \"vext.32    q2, q2, q8, #1      \\n\"\n\n                    \"vmax.f32   q12, q12, q0        \\n\"\n                    \"vext.32    q4, q4, q10, #1     \\n\"\n\n                    \"vmax.f32   q13, q13, q2        \\n\"\n                    \"vmax.f32   q14, q14, q4        \\n\"\n                    \"vmax.f32   q12, q12, q13       \\n\"\n\n                    \"vorr       q0, q6, q6          \\n\"\n                    \"vorr       q1, q7, q7          \\n\"\n                    \"vmax.f32   q12, q12, q14       \\n\"\n\n                    \"vorr       q2, q8, q8          \\n\"\n                    \"vorr       q3, q9, q9          \\n\"\n                    \"vorr       q4, q10, q10        \\n\"\n                    \"vorr       q5, q11, q11        \\n\"\n\n                    \"subs       %0, #1              \\n\"\n                    \"vst1.f32   {d24-d25}, [%4]!    \\n\"\n                    \"bne        0b                  \\n\"\n                    \"sub        %1, #32             \\n\"\n                    \"sub        %2, #32             \\n\"\n                    \"sub        %3, #32             \\n\"\n                    : \"=r\"(nn),    // %0\n                    \"=r\"(r0),    // %1\n                    \"=r\"(r1),    // %2\n                    \"=r\"(r2),    // %3\n                    \"=r\"(outptr) // %4\n                    : \"0\"(nn),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2),\n                    \"4\"(outptr)\n                    : \"cc\", \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\");\n            }\n#endif // __aarch64__\n#endif // __ARM_NEON\n            for (; remain > 0; remain--)\n            {\n                float max0 = std::max(std::max(r0[0], r0[1]), r0[2]);\n                float max1 = std::max(std::max(r1[0], r1[1]), r1[2]);\n                float max2 = std::max(std::max(r2[0], r2[1]), r2[2]);\n\n                *outptr = std::max(std::max(max0, max1), max2);\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep; //1 + w;\n            r1 += tailstep; //1 + w;\n            r2 += tailstep; //1 + w;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_3x3_pack4.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling3x3s2_max_pack4_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        float* outptr = top_blob.channel(q);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"prfm   pldl1keep, [%1, #512]   \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%1], #64 \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4s}, [%1]           \\n\"\n\n                    \"fmax   v20.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v21.4s, v17.4s, v4.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%2], #64 \\n\"\n\n                    \"fmax   v22.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v23.4s, v19.4s, v8.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]   \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4s}, [%2]           \\n\"\n\n                    \"fmax   v24.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v25.4s, v17.4s, v4.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%3], #64 \\n\"\n\n                    \"fmax   v26.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v27.4s, v19.4s, v8.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]   \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%3], #64 \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4s}, [%3]           \\n\"\n\n                    \"fmax   v28.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v29.4s, v17.4s, v4.4s   \\n\"\n                    \"fmax   v30.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v31.4s, v19.4s, v8.4s   \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v24.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v25.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v26.4s  \\n\"\n                    \"fmax   v23.4s, v23.4s, v27.4s  \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v28.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v29.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v30.4s  \\n\"\n                    \"fmax   v23.4s, v23.4s, v31.4s  \\n\"\n\n                    \"st1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%0], #64 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]      \\n\"\n                    \"vldm       %1!, {d0-d7}    \\n\"\n\n                    \"pld        [%2, #512]      \\n\"\n                    \"vldm       %2!, {d8-d15}   \\n\"\n\n                    \"vmax.f32   q0, q0, q4      \\n\"\n                    \"vmax.f32   q1, q1, q5      \\n\"\n\n                    \"pld        [%3, #512]      \\n\"\n                    \"vldm       %3!, {d16-d23}  \\n\"\n\n                    \"vmax.f32   q2, q2, q6      \\n\"\n                    \"vmax.f32   q3, q3, q7      \\n\"\n\n                    \"vmax.f32   q0, q0, q8      \\n\"\n                    \"vmax.f32   q1, q1, q9      \\n\"\n\n                    \"pld        [%1, #512]      \\n\"\n                    \"vldm       %1!, {d8-d15}   \\n\"\n\n                    \"vmax.f32   q2, q2, q10     \\n\"\n                    \"vmax.f32   q3, q3, q11     \\n\"\n\n                    \"pld        [%2, #512]      \\n\"\n                    \"vldm       %2!, {d16-d23}  \\n\"\n\n                    \"vmax.f32   q4, q4, q8      \\n\"\n                    \"vmax.f32   q5, q5, q9      \\n\"\n\n                    \"pld        [%3, #512]      \\n\"\n                    \"vldm       %3!, {d24-d31}  \\n\"\n\n                    \"vmax.f32   q6, q6, q10     \\n\"\n                    \"vmax.f32   q7, q7, q11     \\n\"\n\n                    \"vmax.f32   q4, q4, q12     \\n\"\n                    \"vmax.f32   q5, q5, q13     \\n\"\n\n                    \"vld1.f32   {d24-d25}, [%1 :128] \\n\"\n                    \"vld1.f32   {d26-d27}, [%2 :128] \\n\"\n\n                    \"vmax.f32   q6, q6, q14     \\n\"\n                    \"vmax.f32   q7, q7, q15     \\n\"\n\n                    \"vld1.f32   {d28-d29}, [%3 :128] \\n\"\n\n                    \"vmax.f32   q8, q12, q13    \\n\"\n                    \"vmax.f32   q8, q8, q14     \\n\"\n\n                    \"vmax.f32   q12, q0, q1     \\n\"\n                    \"vmax.f32   q13, q2, q3     \\n\"\n                    \"vmax.f32   q14, q4, q5     \\n\"\n                    \"vmax.f32   q15, q6, q7     \\n\"\n\n                    \"vmax.f32   q12, q12, q2    \\n\"\n                    \"vmax.f32   q13, q13, q4    \\n\"\n                    \"vmax.f32   q14, q14, q6    \\n\"\n                    \"vmax.f32   q15, q15, q8    \\n\"\n\n                    \"vstm       %0!, {d24-d31}  \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%1], #64 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #512]   \\n\"\n                    \"ld1    {v4.4s, v5.4s, v6.4s, v7.4s}, [%2], #64 \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v4.4s    \\n\"\n                    \"fmax   v17.4s, v1.4s, v5.4s    \\n\"\n\n                    \"prfm   pldl1keep, [%3, #512]   \\n\"\n                    \"ld1    {v20.4s, v21.4s, v22.4s, v23.4s}, [%3], #64 \\n\"\n\n                    \"fmax   v18.4s, v2.4s, v6.4s    \\n\"\n                    \"fmax   v19.4s, v3.4s, v7.4s    \\n\"\n\n                    \"ld1    {v0.4s}, [%1]           \\n\"\n\n                    \"fmax   v16.4s, v16.4s, v20.4s  \\n\"\n                    \"fmax   v17.4s, v17.4s, v21.4s  \\n\"\n\n                    \"ld1    {v1.4s}, [%2]           \\n\"\n\n                    \"fmax   v18.4s, v18.4s, v22.4s  \\n\"\n                    \"fmax   v19.4s, v19.4s, v23.4s  \\n\"\n\n                    \"ld1    {v2.4s}, [%3]           \\n\"\n\n                    \"fmax   v3.4s, v0.4s, v1.4s     \\n\"\n\n                    \"fmax   v20.4s, v16.4s, v17.4s  \\n\"\n                    \"fmax   v21.4s, v18.4s, v19.4s  \\n\"\n\n                    \"fmax   v3.4s, v3.4s, v2.4s     \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v18.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v3.4s   \\n\"\n\n                    \"st1    {v20.4s, v21.4s}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #512]      \\n\"\n                    \"vldm       %1!, {d0-d7}    \\n\"\n\n                    \"pld        [%2, #512]      \\n\"\n                    \"vldm       %2!, {d8-d15}   \\n\"\n\n                    \"vmax.f32   q12, q0, q4     \\n\"\n                    \"vmax.f32   q13, q1, q5     \\n\"\n\n                    \"pld        [%3, #512]      \\n\"\n                    \"vldm       %3!, {d16-d23}  \\n\"\n\n                    \"vmax.f32   q14, q2, q6     \\n\"\n                    \"vmax.f32   q15, q3, q7     \\n\"\n\n                    \"vld1.f32   {d0-d1}, [%1 :128] \\n\"\n\n                    \"vmax.f32   q12, q12, q8    \\n\"\n                    \"vmax.f32   q13, q13, q9    \\n\"\n\n                    \"vld1.f32   {d2-d3}, [%2 :128] \\n\"\n\n                    \"vmax.f32   q14, q14, q10   \\n\"\n                    \"vmax.f32   q15, q15, q11   \\n\"\n\n                    \"vld1.f32   {d4-d5}, [%3 :128] \\n\"\n\n                    \"vmax.f32   q3, q0, q1      \\n\"\n\n                    \"vmax.f32   q4, q12, q13    \\n\"\n                    \"vmax.f32   q5, q14, q15    \\n\"\n\n                    \"vmax.f32   q3, q3, q2      \\n\"\n\n                    \"vmax.f32   q4, q4, q14     \\n\"\n                    \"vmax.f32   q5, q5, q3      \\n\"\n\n                    \"vst1.f32   {d8-d11}, [%0 :128]! \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _r00 = vld1q_f32(r0);\n                float32x4_t _r01 = vld1q_f32(r0 + 4);\n                float32x4_t _r02 = vld1q_f32(r0 + 8);\n                float32x4_t _r10 = vld1q_f32(r1);\n                float32x4_t _r11 = vld1q_f32(r1 + 4);\n                float32x4_t _r12 = vld1q_f32(r1 + 8);\n                float32x4_t _r20 = vld1q_f32(r2);\n                float32x4_t _r21 = vld1q_f32(r2 + 4);\n                float32x4_t _r22 = vld1q_f32(r2 + 8);\n\n                float32x4_t _max0 = vmaxq_f32(vmaxq_f32(_r00, _r01), _r02);\n                float32x4_t _max1 = vmaxq_f32(vmaxq_f32(_r10, _r11), _r12);\n                float32x4_t _max2 = vmaxq_f32(vmaxq_f32(_r20, _r21), _r22);\n\n                float32x4_t _max = vmaxq_f32(vmaxq_f32(_max0, _max1), _max2);\n\n                vst1q_f32(outptr, _max);\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_3x3_pack4_bf16s.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void pooling3x3s2_max_pack4_bf16s_neon(const Mat& bottom_blob, Mat& top_blob, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        unsigned short* outptr = top_blob.channel(q);\n\n        const unsigned short* r0 = img0.row<const unsigned short>(0);\n        const unsigned short* r1 = img0.row<const unsigned short>(1);\n        const unsigned short* r2 = img0.row<const unsigned short>(2);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n\n            for (; j + 3 < outw; j += 4)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%1], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16       \\n\"\n                    \"shll   v5.4s, v5.4h, #16       \\n\"\n                    \"shll   v6.4s, v6.4h, #16       \\n\"\n                    \"shll   v7.4s, v7.4h, #16       \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4h}, [%1]           \\n\"\n                    \"shll   v8.4s, v8.4h, #16       \\n\"\n\n                    \"fmax   v20.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v21.4s, v17.4s, v4.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%2], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n\n                    \"fmax   v22.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v23.4s, v19.4s, v8.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]   \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16       \\n\"\n                    \"shll   v5.4s, v5.4h, #16       \\n\"\n                    \"shll   v6.4s, v6.4h, #16       \\n\"\n                    \"shll   v7.4s, v7.4h, #16       \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4h}, [%2]           \\n\"\n                    \"shll   v8.4s, v8.4h, #16       \\n\"\n\n                    \"fmax   v24.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v25.4s, v17.4s, v4.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%3], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n\n                    \"fmax   v26.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v27.4s, v19.4s, v8.4s   \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]   \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%3], #32 \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16       \\n\"\n                    \"shll   v5.4s, v5.4h, #16       \\n\"\n                    \"shll   v6.4s, v6.4h, #16       \\n\"\n                    \"shll   v7.4s, v7.4h, #16       \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v1.4s    \\n\"\n                    \"fmax   v17.4s, v2.4s, v3.4s    \\n\"\n\n                    \"fmax   v18.4s, v4.4s, v5.4s    \\n\"\n                    \"fmax   v19.4s, v6.4s, v7.4s    \\n\"\n\n                    \"ld1    {v8.4h}, [%3]           \\n\"\n                    \"shll   v8.4s, v8.4h, #16       \\n\"\n\n                    \"fmax   v28.4s, v16.4s, v2.4s   \\n\"\n                    \"fmax   v29.4s, v17.4s, v4.4s   \\n\"\n                    \"fmax   v30.4s, v18.4s, v6.4s   \\n\"\n                    \"fmax   v31.4s, v19.4s, v8.4s   \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v24.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v25.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v26.4s  \\n\"\n                    \"fmax   v23.4s, v23.4s, v27.4s  \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v28.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v29.4s  \\n\"\n                    \"fmax   v22.4s, v22.4s, v30.4s  \\n\"\n                    \"fmax   v23.4s, v23.4s, v31.4s  \\n\"\n\n                    \"shrn   v20.4h, v20.4s, #16     \\n\"\n                    \"shrn   v21.4h, v21.4s, #16     \\n\"\n                    \"shrn   v22.4h, v22.4s, #16     \\n\"\n                    \"shrn   v23.4h, v23.4s, #16     \\n\"\n\n                    \"st1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%0], #32 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\", \"v24\", \"v25\", \"v26\", \"v27\", \"v28\", \"v29\", \"v30\", \"v31\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d4-d7}, [%1]!  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d12-d15}, [%2]! \\n\"\n\n                    \"vshll.u16  q0, d4, #16     \\n\"\n                    \"vshll.u16  q1, d5, #16     \\n\"\n                    \"vshll.u16  q2, d6, #16     \\n\"\n                    \"vshll.u16  q3, d7, #16     \\n\"\n\n                    \"vshll.u16  q4, d12, #16    \\n\"\n                    \"vshll.u16  q5, d13, #16    \\n\"\n                    \"vshll.u16  q6, d14, #16    \\n\"\n                    \"vshll.u16  q7, d15, #16    \\n\"\n\n                    \"vmax.f32   q0, q0, q4      \\n\"\n                    \"vmax.f32   q1, q1, q5      \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.u16   {d20-d23}, [%3]! \\n\"\n\n                    \"vshll.u16  q8, d20, #16    \\n\"\n                    \"vshll.u16  q9, d21, #16    \\n\"\n                    \"vshll.u16  q10, d22, #16   \\n\"\n                    \"vshll.u16  q11, d23, #16   \\n\"\n\n                    \"vmax.f32   q2, q2, q6      \\n\"\n                    \"vmax.f32   q3, q3, q7      \\n\"\n\n                    \"vmax.f32   q0, q0, q8      \\n\"\n                    \"vmax.f32   q1, q1, q9      \\n\"\n\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d12-d15}, [%1]! \\n\"\n\n                    \"vshll.u16  q4, d12, #16    \\n\"\n                    \"vshll.u16  q5, d13, #16    \\n\"\n                    \"vshll.u16  q6, d14, #16    \\n\"\n                    \"vshll.u16  q7, d15, #16    \\n\"\n\n                    \"vmax.f32   q2, q2, q10     \\n\"\n                    \"vmax.f32   q3, q3, q11     \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d20-d23}, [%2]! \\n\"\n\n                    \"vshll.u16  q8, d20, #16    \\n\"\n                    \"vshll.u16  q9, d21, #16    \\n\"\n                    \"vshll.u16  q10, d22, #16   \\n\"\n                    \"vshll.u16  q11, d23, #16   \\n\"\n\n                    \"vmax.f32   q4, q4, q8      \\n\"\n                    \"vmax.f32   q5, q5, q9      \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.u16   {d28-d31}, [%3]! \\n\"\n\n                    \"vshll.u16  q12, d28, #16   \\n\"\n                    \"vshll.u16  q13, d29, #16   \\n\"\n                    \"vshll.u16  q14, d30, #16   \\n\"\n                    \"vshll.u16  q15, d31, #16   \\n\"\n\n                    \"vmax.f32   q6, q6, q10     \\n\"\n                    \"vmax.f32   q7, q7, q11     \\n\"\n\n                    \"vmax.f32   q4, q4, q12     \\n\"\n                    \"vmax.f32   q5, q5, q13     \\n\"\n\n                    \"vld1.u16   {d25}, [%1]     \\n\"\n                    \"vld1.u16   {d27}, [%2]     \\n\"\n                    \"vshll.u16  q12, d25, #16   \\n\"\n                    \"vshll.u16  q13, d27, #16   \\n\"\n\n                    \"vmax.f32   q6, q6, q14     \\n\"\n                    \"vmax.f32   q7, q7, q15     \\n\"\n\n                    \"vld1.u16   {d29}, [%3]     \\n\"\n                    \"vshll.u16  q14, d29, #16   \\n\"\n\n                    \"vmax.f32   q8, q12, q13    \\n\"\n                    \"vmax.f32   q8, q8, q14     \\n\"\n\n                    \"vmax.f32   q12, q0, q1     \\n\"\n                    \"vmax.f32   q13, q2, q3     \\n\"\n                    \"vmax.f32   q14, q4, q5     \\n\"\n                    \"vmax.f32   q15, q6, q7     \\n\"\n\n                    \"vmax.f32   q12, q12, q2    \\n\"\n                    \"vmax.f32   q13, q13, q4    \\n\"\n                    \"vmax.f32   q14, q14, q6    \\n\"\n                    \"vmax.f32   q15, q15, q8    \\n\"\n\n                    \"vshrn.u32  d24, q12, #16   \\n\"\n                    \"vshrn.u32  d25, q13, #16   \\n\"\n                    \"vshrn.u32  d26, q14, #16   \\n\"\n                    \"vshrn.u32  d27, q15, #16   \\n\"\n\n                    \"vst1.u16   {d24-d27}, [%0]! \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%1, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%1], #32 \\n\"\n\n                    \"prfm   pldl1keep, [%2, #256]   \\n\"\n                    \"ld1    {v4.4h, v5.4h, v6.4h, v7.4h}, [%2], #32 \\n\"\n\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n\n                    \"shll   v4.4s, v4.4h, #16       \\n\"\n                    \"shll   v5.4s, v5.4h, #16       \\n\"\n                    \"shll   v6.4s, v6.4h, #16       \\n\"\n                    \"shll   v7.4s, v7.4h, #16       \\n\"\n\n                    \"fmax   v16.4s, v0.4s, v4.4s    \\n\"\n                    \"fmax   v17.4s, v1.4s, v5.4s    \\n\"\n\n                    \"prfm   pldl1keep, [%3, #256]   \\n\"\n                    \"ld1    {v20.4h, v21.4h, v22.4h, v23.4h}, [%3], #32 \\n\"\n\n                    \"shll   v20.4s, v20.4h, #16     \\n\"\n                    \"shll   v21.4s, v21.4h, #16     \\n\"\n                    \"shll   v22.4s, v22.4h, #16     \\n\"\n                    \"shll   v23.4s, v23.4h, #16     \\n\"\n\n                    \"fmax   v18.4s, v2.4s, v6.4s    \\n\"\n                    \"fmax   v19.4s, v3.4s, v7.4s    \\n\"\n\n                    \"ld1    {v0.4s}, [%1]           \\n\"\n\n                    \"fmax   v16.4s, v16.4s, v20.4s  \\n\"\n                    \"fmax   v17.4s, v17.4s, v21.4s  \\n\"\n\n                    \"ld1    {v1.4s}, [%2]           \\n\"\n\n                    \"fmax   v18.4s, v18.4s, v22.4s  \\n\"\n                    \"fmax   v19.4s, v19.4s, v23.4s  \\n\"\n\n                    \"ld1    {v2.4s}, [%3]           \\n\"\n\n                    \"fmax   v3.4s, v0.4s, v1.4s     \\n\"\n\n                    \"fmax   v20.4s, v16.4s, v17.4s  \\n\"\n                    \"fmax   v21.4s, v18.4s, v19.4s  \\n\"\n\n                    \"fmax   v3.4s, v3.4s, v2.4s     \\n\"\n\n                    \"fmax   v20.4s, v20.4s, v18.4s  \\n\"\n                    \"fmax   v21.4s, v21.4s, v3.4s   \\n\"\n\n                    \"shrn   v20.4h, v20.4s, #16     \\n\"\n                    \"shrn   v21.4h, v21.4s, #16     \\n\"\n\n                    \"st1    {v20.4h, v21.4h}, [%0], #16 \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v16\", \"v17\", \"v18\", \"v19\", \"v20\", \"v21\", \"v22\", \"v23\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%1, #256]      \\n\"\n                    \"vld1.u16   {d4-d7}, [%1]!  \\n\"\n\n                    \"pld        [%2, #256]      \\n\"\n                    \"vld1.u16   {d12-d15}, [%2]! \\n\"\n\n                    \"vshll.u16  q0, d4, #16     \\n\"\n                    \"vshll.u16  q1, d5, #16     \\n\"\n                    \"vshll.u16  q2, d6, #16     \\n\"\n                    \"vshll.u16  q3, d7, #16     \\n\"\n\n                    \"vshll.u16  q4, d12, #16    \\n\"\n                    \"vshll.u16  q5, d13, #16    \\n\"\n                    \"vshll.u16  q6, d14, #16    \\n\"\n                    \"vshll.u16  q7, d15, #16    \\n\"\n\n                    \"vmax.f32   q12, q0, q4     \\n\"\n                    \"vmax.f32   q13, q1, q5     \\n\"\n\n                    \"pld        [%3, #256]      \\n\"\n                    \"vld1.u16   {d20-d23}, [%3]! \\n\"\n\n                    \"vshll.u16  q8, d20, #16    \\n\"\n                    \"vshll.u16  q9, d21, #16    \\n\"\n                    \"vshll.u16  q10, d22, #16   \\n\"\n                    \"vshll.u16  q11, d23, #16   \\n\"\n\n                    \"vmax.f32   q14, q2, q6     \\n\"\n                    \"vmax.f32   q15, q3, q7     \\n\"\n\n                    \"vld1.u16   {d1}, [%1]      \\n\"\n                    \"vshll.u16  q0, d1, #16     \\n\"\n\n                    \"vmax.f32   q12, q12, q8    \\n\"\n                    \"vmax.f32   q13, q13, q9    \\n\"\n\n                    \"vld1.u16   {d3}, [%2]      \\n\"\n                    \"vshll.u16  q1, d3, #16     \\n\"\n\n                    \"vmax.f32   q14, q14, q10   \\n\"\n                    \"vmax.f32   q15, q15, q11   \\n\"\n\n                    \"vld1.u16   {d5}, [%3]      \\n\"\n                    \"vshll.u16  q2, d5, #16     \\n\"\n\n                    \"vmax.f32   q3, q0, q1      \\n\"\n\n                    \"vmax.f32   q4, q12, q13    \\n\"\n                    \"vmax.f32   q5, q14, q15    \\n\"\n\n                    \"vmax.f32   q3, q3, q2      \\n\"\n\n                    \"vmax.f32   q4, q4, q14     \\n\"\n                    \"vmax.f32   q5, q5, q3      \\n\"\n\n                    \"vshrn.u32  d8, q4, #16     \\n\"\n                    \"vshrn.u32  d9, q5, #16     \\n\"\n\n                    \"vst1.u16   {d8-d9}, [%0]!  \\n\"\n\n                    : \"=r\"(outptr), // %0\n                    \"=r\"(r0),     // %1\n                    \"=r\"(r1),     // %2\n                    \"=r\"(r2)      // %3\n                    : \"0\"(outptr),\n                    \"1\"(r0),\n                    \"2\"(r1),\n                    \"3\"(r2)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\", \"q12\", \"q13\", \"q14\", \"q15\");\n#endif // __aarch64__\n            }\n            for (; j < outw; j++)\n            {\n                float32x4_t _r00 = bfloat2float(vld1_u16(r0));\n                float32x4_t _r01 = bfloat2float(vld1_u16(r0 + 4));\n                float32x4_t _r02 = bfloat2float(vld1_u16(r0 + 8));\n                float32x4_t _r10 = bfloat2float(vld1_u16(r1));\n                float32x4_t _r11 = bfloat2float(vld1_u16(r1 + 4));\n                float32x4_t _r12 = bfloat2float(vld1_u16(r1 + 8));\n                float32x4_t _r20 = bfloat2float(vld1_u16(r2));\n                float32x4_t _r21 = bfloat2float(vld1_u16(r2 + 4));\n                float32x4_t _r22 = bfloat2float(vld1_u16(r2 + 8));\n\n                float32x4_t _max0 = vmaxq_f32(vmaxq_f32(_r00, _r01), _r02);\n                float32x4_t _max1 = vmaxq_f32(vmaxq_f32(_r10, _r11), _r12);\n                float32x4_t _max2 = vmaxq_f32(vmaxq_f32(_r20, _r21), _r22);\n\n                float32x4_t _max = vmaxq_f32(vmaxq_f32(_max0, _max1), _max2);\n\n                vst1_u16(outptr, float2bfloat(_max));\n\n                r0 += 8;\n                r1 += 8;\n                r2 += 8;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/pooling_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"pooling_arm.h\"\n\n#include <float.h>\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if NCNN_GNU_INLINE_ASM\n#include \"pooling_2x2.h\"\n#include \"pooling_3x3.h\"\n\n#if __ARM_NEON\n#include \"pooling_2x2_pack4.h\"\n#include \"pooling_3x3_pack4.h\"\n#include \"pooling_2x2_pack4_bf16s.h\"\n#include \"pooling_3x3_pack4_bf16s.h\"\n#endif\n#endif // NCNN_GNU_INLINE_ASM\n\nPooling_arm::Pooling_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Pooling_arm::create_pipeline(const Option& /*opt*/)\n{\n    if (adaptive_pooling)\n    {\n        support_packing = false;\n\n        support_bf16_storage = false;\n        support_fp16_storage = false;\n        support_int8_storage = false;\n        support_tensor_storage = false;\n    }\n    return 0;\n}\n\nint Pooling_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (adaptive_pooling)\n    {\n        return Pooling::forward(bottom_blob, top_blob, opt);\n    }\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    // max value in NxN window\n    // avg value in NxN window\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __ARM_NEON\n    //     NCNN_LOGE(\"Pooling     input %d x %d  pad = %d %d %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_left, pad_right, pad_top, pad_bottom, kernel_w, kernel_h, stride_w, stride_h);\n\n    if (elempack == 4)\n    {\n        if (global_pooling)\n        {\n            top_blob.create(channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            int size = w * h;\n\n            if (pooling_type == PoolMethod_MAX)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _max = vld1q_f32(ptr);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _val = vld1q_f32(ptr);\n                        _max = vmaxq_f32(_max, _val);\n                        ptr += 4;\n                    }\n\n                    float* outptr = top_blob;\n                    vst1q_f32(outptr + q * 4, _max);\n                }\n            }\n            else if (pooling_type == PoolMethod_AVE)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _val = vld1q_f32(ptr);\n                        _sum = vaddq_f32(_sum, _val);\n                        ptr += 4;\n                    }\n\n                    float32x4_t _inv_size = vdupq_n_f32(1.f / size);\n                    float32x4_t _avg = vmulq_f32(_sum, _inv_size);\n\n                    float* outptr = top_blob;\n                    vst1q_f32(outptr + q * 4, _avg);\n                }\n            }\n\n            return 0;\n        }\n\n        Mat bottom_blob_bordered;\n        make_padding(bottom_blob, bottom_blob_bordered, opt);\n        if (bottom_blob_bordered.empty())\n            return -100;\n\n        w = bottom_blob_bordered.w;\n        h = bottom_blob_bordered.h;\n\n        int outw = (w - kernel_w) / stride_w + 1;\n        int outh = (h - kernel_h) / stride_h + 1;\n\n        top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int maxk = kernel_w * kernel_h;\n\n        // kernel offsets\n        std::vector<int> _space_ofs(maxk);\n        int* space_ofs = &_space_ofs[0];\n        {\n            int p1 = 0;\n            int p2 = 0;\n            int gap = w - kernel_w;\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2++;\n                }\n                p2 += gap;\n            }\n        }\n\n        if (pooling_type == PoolMethod_MAX)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 2 && kernel_h == 2 && stride_w == 2 && stride_h == 2)\n            {\n                pooling2x2s2_max_pack4_neon(bottom_blob_bordered, top_blob, opt);\n\n                return 0;\n            }\n\n            if (kernel_w == 3 && kernel_h == 3 && stride_w == 2 && stride_h == 2)\n            {\n                pooling3x3s2_max_pack4_neon(bottom_blob_bordered, top_blob, opt);\n\n                return 0;\n            }\n#endif // NCNN_GNU_INLINE_ASM\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                        float32x4_t _max = vld1q_f32(sptr);\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float32x4_t _val = vld1q_f32(sptr + space_ofs[k] * 4);\n                            _max = vmaxq_f32(_max, _val);\n                        }\n\n                        vst1q_f32(outptr + j * 4, _max);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n        else if (pooling_type == PoolMethod_AVE)\n        {\n            if (avgpool_count_include_pad == 0)\n            {\n                int wtailpad = 0;\n                int htailpad = 0;\n\n                if (pad_mode == 0) // full padding\n                {\n                    wtailpad = bottom_blob_bordered.w - bottom_blob.w - pad_left - pad_right;\n                    htailpad = bottom_blob_bordered.h - bottom_blob.h - pad_top - pad_bottom;\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float32x4_t _val = vld1q_f32(m.row(sy) + sx * 4);\n                                    _sum = vaddq_f32(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n                            float32x4_t _inv_area = vdupq_n_f32(1.f / area);\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_area);\n                            vst1q_f32(outptr + j * 4, _avg);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n            else // if (avgpool_count_include_pad == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    float32x4_t _inv_maxk = vdupq_n_f32(1.f / maxk);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = vld1q_f32(sptr + space_ofs[k] * 4);\n                                _sum = vaddq_f32(_sum, _val);\n                            }\n\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_maxk);\n                            vst1q_f32(outptr + j * 4, _avg);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n#if NCNN_GNU_INLINE_ASM\n    if (kernel_w != kernel_h || stride_w != stride_h)\n    {\n        return Pooling::forward(bottom_blob, top_blob, opt);\n    }\n\n    const int kernel_size = kernel_w;\n    const int stride = stride_w;\n\n    if (pooling_type != PoolMethod_MAX || stride != 2 || global_pooling == 1)\n    {\n        return Pooling::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (kernel_size != 2 && kernel_size != 3)\n    {\n        return Pooling::forward(bottom_blob, top_blob, opt);\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_w) / stride_w + 1;\n    int outh = (h - kernel_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, channels, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (kernel_size == 2)\n        pooling2x2s2_max_neon(bottom_blob_bordered, top_blob, opt);\n    if (kernel_size == 3)\n        pooling3x3s2_max_neon(bottom_blob_bordered, top_blob, opt);\n\n    return 0;\n#else  // NCNN_GNU_INLINE_ASM\n    return Pooling::forward(bottom_blob, top_blob, opt);\n#endif // NCNN_GNU_INLINE_ASM\n}\n\n#if NCNN_BF16\nint Pooling_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // max value in NxN window\n    // avg value in NxN window\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Pooling     input %d x %d  pad = %d %d %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_left, pad_right, pad_top, pad_bottom, kernel_w, kernel_h, stride_w, stride_h);\n\n    if (global_pooling)\n    {\n        top_blob.create(channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int size = w * h;\n\n        if (pooling_type == PoolMethod_MAX)\n        {\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _max = vdupq_n_f32(-FLT_MAX);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _val = bfloat2float(vld1_u16(ptr));\n                        _max = vmaxq_f32(_max, _val);\n                        ptr += 4;\n                    }\n\n                    unsigned short* outptr = top_blob;\n                    vst1_u16(outptr + q * 4, float2bfloat(_max));\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n\n                    float max = -FLT_MAX;\n                    for (int i = 0; i < size; i++)\n                    {\n                        max = std::max(max, bfloat16_to_float32(ptr[i]));\n                    }\n\n                    unsigned short* outptr = top_blob;\n                    outptr[q] = float32_to_bfloat16(max);\n                }\n            }\n        }\n\n        if (pooling_type == PoolMethod_AVE)\n        {\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _val = bfloat2float(vld1_u16(ptr));\n                        _sum = vaddq_f32(_sum, _val);\n                        ptr += 4;\n                    }\n\n                    float32x4_t _inv_size = vdupq_n_f32(1.f / size);\n                    float32x4_t _avg = vmulq_f32(_sum, _inv_size);\n\n                    unsigned short* outptr = top_blob;\n                    vst1_u16(outptr + q * 4, float2bfloat(_avg));\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const unsigned short* ptr = bottom_blob.channel(q);\n\n                    float sum = 0.f;\n                    for (int i = 0; i < size; i++)\n                    {\n                        sum += bfloat16_to_float32(ptr[i]);\n                    }\n\n                    unsigned short* outptr = top_blob;\n                    outptr[q] = float32_to_bfloat16(sum / size);\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_w) / stride_w + 1;\n    int outh = (h - kernel_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w - kernel_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2++;\n            }\n            p2 += gap;\n        }\n    }\n\n    if (pooling_type == PoolMethod_MAX)\n    {\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n#if NCNN_GNU_INLINE_ASM\n            if (kernel_w == 2 && kernel_h == 2 && stride_w == 2 && stride_h == 2)\n            {\n                pooling2x2s2_max_pack4_bf16s_neon(bottom_blob_bordered, top_blob, opt);\n\n                return 0;\n            }\n\n            if (kernel_w == 3 && kernel_h == 3 && stride_w == 2 && stride_h == 2)\n            {\n                pooling3x3s2_max_pack4_bf16s_neon(bottom_blob_bordered, top_blob, opt);\n\n                return 0;\n            }\n#endif // NCNN_GNU_INLINE_ASM\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w * 4;\n\n                        float32x4_t _max = vdupq_n_f32(-FLT_MAX);\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float32x4_t _val = bfloat2float(vld1_u16(sptr + space_ofs[k] * 4));\n                            _max = vmaxq_f32(_max, _val);\n                        }\n\n                        vst1_u16(outptr + j * 4, float2bfloat(_max));\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n#endif // __ARM_NEON\n\n        if (elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                unsigned short* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                        float max = -FLT_MAX;\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float val = bfloat16_to_float32(sptr[space_ofs[k]]);\n                            max = std::max(max, val);\n                        }\n\n                        outptr[j] = float32_to_bfloat16(max);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    if (pooling_type == PoolMethod_AVE)\n    {\n        if (avgpool_count_include_pad == 0)\n        {\n            int wtailpad = 0;\n            int htailpad = 0;\n\n            if (pad_mode == 0) // full padding\n            {\n                wtailpad = bottom_blob_bordered.w - bottom_blob.w - pad_left - pad_right;\n                htailpad = bottom_blob_bordered.h - bottom_blob.h - pad_top - pad_bottom;\n            }\n\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float32x4_t _val = bfloat2float(vld1_u16(m.row<const unsigned short>(sy) + sx * 4));\n                                    _sum = vaddq_f32(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n                            float32x4_t _inv_area = vdupq_n_f32(1.f / area);\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_area);\n                            vst1_u16(outptr + j * 4, float2bfloat(_avg));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float sum = 0;\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float val = bfloat16_to_float32(m.row<const unsigned short>(sy)[sx]);\n                                    sum += val;\n                                    area += 1;\n                                }\n                            }\n\n                            outptr[j] = float32_to_bfloat16(sum / area);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        if (avgpool_count_include_pad == 1)\n        {\n#if __ARM_NEON\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    float32x4_t _inv_maxk = vdupq_n_f32(1.f / maxk);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w * 4;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = bfloat2float(vld1_u16(sptr + space_ofs[k] * 4));\n                                _sum = vaddq_f32(_sum, _val);\n                            }\n\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_maxk);\n                            vst1_u16(outptr + j * 4, float2bfloat(_avg));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n#endif // __ARM_NEON\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    unsigned short* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const unsigned short* sptr = m.row<const unsigned short>(i * stride_h) + j * stride_w;\n\n                            float sum = 0.f;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = bfloat16_to_float32(sptr[space_ofs[k]]);\n                                sum += val;\n                            }\n\n                            outptr[j] = float32_to_bfloat16(sum / maxk);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/pooling_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_POOLING_ARM_H\n#define LAYER_POOLING_ARM_H\n\n#include \"pooling.h\"\n\nnamespace ncnn {\n\nclass Pooling_arm : public Pooling\n{\npublic:\n    Pooling_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_POOLING_ARM_H\n"
  },
  {
    "path": "src/layer/arm/pooling_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"pooling_arm.h\"\n\n#include <float.h>\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Pooling_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // max value in NxN window\n    // avg value in NxN window\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Pooling     input %d x %d  pad = %d %d %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_left, pad_right, pad_top, pad_bottom, kernel_w, kernel_h, stride_w, stride_h);\n\n    if (global_pooling)\n    {\n        top_blob.create(channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int size = w * h;\n\n        if (pooling_type == PoolMethod_MAX)\n        {\n            if (elempack == 8)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    float16x8_t _max = vdupq_n_f16((__fp16)-FLT_MAX);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float16x8_t _val = vld1q_f16(ptr);\n                        _max = vmaxq_f16(_max, _val);\n                        ptr += 8;\n                    }\n\n                    __fp16* outptr = top_blob;\n                    vst1q_f16(outptr + q * 8, _max);\n                }\n            }\n\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    float16x4_t _max = vdup_n_f16((__fp16)-FLT_MAX);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float16x4_t _val = vld1_f16(ptr);\n                        _max = vmax_f16(_max, _val);\n                        ptr += 4;\n                    }\n\n                    __fp16* outptr = top_blob;\n                    vst1_f16(outptr + q * 4, _max);\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    __fp16 max = (__fp16)-FLT_MAX;\n                    for (int i = 0; i < size; i++)\n                    {\n                        max = std::max(max, ptr[i]);\n                    }\n\n                    __fp16* outptr = top_blob;\n                    outptr[q] = max;\n                }\n            }\n        }\n\n        if (pooling_type == PoolMethod_AVE)\n        {\n            if (elempack == 8)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _sum0 = vdupq_n_f32(0.f);\n                    float32x4_t _sum1 = vdupq_n_f32(0.f);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float16x8_t _val = vld1q_f16(ptr);\n                        _sum0 = vaddq_f32(_sum0, vcvt_f32_f16(vget_low_f16(_val)));\n                        _sum1 = vaddq_f32(_sum1, vcvt_f32_f16(vget_high_f16(_val)));\n                        ptr += 8;\n                    }\n\n                    float32x4_t _inv_size = vdupq_n_f32(1.f / size);\n                    float32x4_t _avg0 = vmulq_f32(_sum0, _inv_size);\n                    float32x4_t _avg1 = vmulq_f32(_sum1, _inv_size);\n\n                    __fp16* outptr = top_blob;\n                    vst1q_f16(outptr + q * 8, vcombine_f16(vcvt_f16_f32(_avg0), vcvt_f16_f32(_avg1)));\n                }\n            }\n\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    float32x4_t _sum = vdupq_n_f32(0.f);\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _val = vcvt_f32_f16(vld1_f16(ptr));\n                        _sum = vaddq_f32(_sum, _val);\n                        ptr += 4;\n                    }\n\n                    float32x4_t _inv_size = vdupq_n_f32(1.f / size);\n                    float32x4_t _avg = vmulq_f32(_sum, _inv_size);\n\n                    __fp16* outptr = top_blob;\n                    vst1_f16(outptr + q * 4, vcvt_f16_f32(_avg));\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const __fp16* ptr = bottom_blob.channel(q);\n\n                    float sum = 0.f;\n                    for (int i = 0; i < size; i++)\n                    {\n                        sum += (float)ptr[i];\n                    }\n\n                    __fp16* outptr = top_blob;\n                    outptr[q] = (__fp16)(sum / size);\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_w) / stride_w + 1;\n    int outh = (h - kernel_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w - kernel_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2++;\n            }\n            p2 += gap;\n        }\n    }\n\n    if (pooling_type == PoolMethod_MAX)\n    {\n        if (elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 8;\n\n                        float16x8_t _max = vdupq_n_f16((__fp16)-FLT_MAX);\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float16x8_t _val = vld1q_f16(sptr + space_ofs[k] * 8);\n                            _max = vmaxq_f16(_max, _val);\n                        }\n\n                        vst1q_f16(outptr + j * 8, _max);\n                    }\n\n                    outptr += outw * 8;\n                }\n            }\n        }\n\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 4;\n\n                        float16x4_t _max = vdup_n_f16((__fp16)-FLT_MAX);\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float16x4_t _val = vld1_f16(sptr + space_ofs[k] * 4);\n                            _max = vmax_f16(_max, _val);\n                        }\n\n                        vst1_f16(outptr + j * 4, _max);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n\n        if (elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                __fp16* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w;\n\n                        __fp16 max = (__fp16)-FLT_MAX;\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            __fp16 val = sptr[space_ofs[k]];\n                            max = std::max(max, val);\n                        }\n\n                        outptr[j] = max;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    if (pooling_type == PoolMethod_AVE)\n    {\n        if (avgpool_count_include_pad == 0)\n        {\n            int wtailpad = 0;\n            int htailpad = 0;\n\n            if (pad_mode == 0) // full padding\n            {\n                wtailpad = bottom_blob_bordered.w - bottom_blob.w - pad_left - pad_right;\n                htailpad = bottom_blob_bordered.h - bottom_blob.h - pad_top - pad_bottom;\n            }\n\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float32x4_t _val = vcvt_f32_f16(vld1_f16(m.row<const __fp16>(sy) + sx * 4));\n                                    _sum = vaddq_f32(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n                            float32x4_t _inv_area = vdupq_n_f32(1.f / area);\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_area);\n                            vst1_f16(outptr + j * 4, vcvt_f16_f32(_avg));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float sum = 0.f;\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float val = (float)(m.row<const __fp16>(sy)[sx]);\n                                    sum += val;\n                                    area += 1;\n                                }\n                            }\n\n                            outptr[j] = (__fp16)(sum / area);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        if (avgpool_count_include_pad == 1)\n        {\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    float32x4_t _inv_maxk = vdupq_n_f32(1.f / maxk);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 4;\n\n                            float32x4_t _sum = vdupq_n_f32(0.f);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float32x4_t _val = vcvt_f32_f16(vld1_f16(sptr + space_ofs[k] * 4));\n                                _sum = vaddq_f32(_sum, _val);\n                            }\n\n                            float32x4_t _avg = vmulq_f32(_sum, _inv_maxk);\n                            vst1_f16(outptr + j * 4, vcvt_f16_f32(_avg));\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w;\n\n                            float sum = 0.f;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = (float)(sptr[space_ofs[k]]);\n                                sum += val;\n                            }\n\n                            outptr[j] = (__fp16)(sum / maxk);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Pooling_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // max value in NxN window\n    // avg value in NxN window\n\n    if (pooling_type == PoolMethod_MAX || global_pooling)\n    {\n        return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Pooling     input %d x %d  pad = %d %d %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_left, pad_right, pad_top, pad_bottom, kernel_w, kernel_h, stride_w, stride_h);\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_w) / stride_w + 1;\n    int outh = (h - kernel_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w - kernel_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2++;\n            }\n            p2 += gap;\n        }\n    }\n\n    if (pooling_type == PoolMethod_AVE)\n    {\n        if (avgpool_count_include_pad == 0)\n        {\n            int wtailpad = 0;\n            int htailpad = 0;\n\n            if (pad_mode == 0) // full padding\n            {\n                wtailpad = bottom_blob_bordered.w - bottom_blob.w - pad_left - pad_right;\n                htailpad = bottom_blob_bordered.h - bottom_blob.h - pad_top - pad_bottom;\n            }\n\n            if (elempack == 8)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float16x8_t _val = vld1q_f16(m.row<const __fp16>(sy) + sx * 8);\n                                    _sum = vaddq_f16(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n#if defined(_MSC_VER) && !defined(__clang__)\n                            float16x4_t _inv_area0 = vcvt_f16_f32(vdupq_n_f32(1.f / area));\n                            float16x8_t _inv_area = vcombine_f16(_inv_area0, _inv_area0);\n#else\n                            float16x8_t _inv_area = vdupq_n_f16((__fp16)(1.f / area));\n#endif\n                            float16x8_t _avg = vmulq_f16(_sum, _inv_area);\n                            vst1q_f16(outptr + j * 8, _avg);\n                        }\n\n                        outptr += outw * 8;\n                    }\n                }\n            }\n\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    float16x4_t _val = vld1_f16(m.row<const __fp16>(sy) + sx * 4);\n                                    _sum = vadd_f16(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n#if defined(_MSC_VER) && !defined(__clang__)\n                            float16x4_t _inv_area = vcvt_f16_f32(vdupq_n_f32(1.f / area));\n#else\n                            float16x4_t _inv_area = vdup_n_f16((__fp16)(1.f / area));\n#endif\n                            float16x4_t _avg = vmul_f16(_sum, _inv_area);\n                            vst1_f16(outptr + j * 4, _avg);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            __fp16 sum = (__fp16)0.f;\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    __fp16 val = m.row<const __fp16>(sy)[sx];\n                                    sum += val;\n                                    area += 1;\n                                }\n                            }\n\n                            outptr[j] = sum / (__fp16)area;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        if (avgpool_count_include_pad == 1)\n        {\n            if (elempack == 8)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n#if defined(_MSC_VER) && !defined(__clang__)\n                    float16x4_t _inv_maxk0 = vcvt_f16_f32(vdupq_n_f32(1.f / maxk));\n                    float16x8_t _inv_maxk = vcombine_f16(_inv_maxk0, _inv_maxk0);\n#else\n                    float16x8_t _inv_maxk = vdupq_n_f16((__fp16)(1.f / maxk));\n#endif\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 8;\n\n                            float16x8_t _sum = vdupq_n_f16((__fp16)0.f);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float16x8_t _val = vld1q_f16(sptr + space_ofs[k] * 8);\n                                _sum = vaddq_f16(_sum, _val);\n                            }\n\n                            float16x8_t _avg = vmulq_f16(_sum, _inv_maxk);\n                            vst1q_f16(outptr + j * 8, _avg);\n                        }\n\n                        outptr += outw * 8;\n                    }\n                }\n            }\n\n            if (elempack == 4)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n#if defined(_MSC_VER) && !defined(__clang__)\n                    float16x4_t _inv_maxk = vcvt_f16_f32(vdupq_n_f32(1.f / maxk));\n#else\n                    float16x4_t _inv_maxk = vdup_n_f16((__fp16)(1.f / maxk));\n#endif\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w * 4;\n\n                            float16x4_t _sum = vdup_n_f16((__fp16)0.f);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float16x4_t _val = vld1_f16(sptr + space_ofs[k] * 4);\n                                _sum = vadd_f16(_sum, _val);\n                            }\n\n                            float16x4_t _avg = vmul_f16(_sum, _inv_maxk);\n                            vst1_f16(outptr + j * 4, _avg);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n\n            if (elempack == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    __fp16* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const __fp16* sptr = m.row<const __fp16>(i * stride_h) + j * stride_w;\n\n                            __fp16 sum = (__fp16)0.f;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                __fp16 val = sptr[space_ofs[k]];\n                                sum += val;\n                            }\n\n                            outptr[j] = sum / (__fp16)maxk;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/prelu_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"prelu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nPReLU_arm::PReLU_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint PReLU_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _zero = vdupq_n_f32(0.f);\n\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            if (num_slope > 1)\n            {\n                const float* slope = slope_data;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    float* ptr = (float*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vld1q_f32(ptr);\n                    float32x4_t _slope = vld1q_f32(slope + i * 4);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1q_f32(ptr, _p);\n                }\n            }\n            else\n            {\n                float32x4_t _slope = vdupq_n_f32(slope_data[0]);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    float* ptr = (float*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vld1q_f32(ptr);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1q_f32(ptr, _p);\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                float* ptr = bottom_top_blob.row(i);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + i * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1q_f32(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                float* ptr = bottom_top_blob.channel(q);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + q * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = vld1q_f32(ptr);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1q_f32(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        float* ptr = bottom_top_blob;\n\n        if (num_slope > 1)\n        {\n            const float* slope = slope_data;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = ptr[i];\n                if (v < 0.f)\n                    ptr[i] = v * slope[i];\n            }\n        }\n        else\n        {\n            const float slope = slope_data[0];\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = ptr[i];\n                if (v < 0.f)\n                    ptr[i] = v * slope;\n            }\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n\n            const float slope = num_slope > 1 ? slope_data[i] : slope_data[0];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1q_f32(ptr, _p);\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < w; j++)\n            {\n                float v = *ptr;\n                if (v < 0.f)\n                    *ptr = v * slope;\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h;\n\n        const float* slope_data_ptr = slope_data;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            float slope = num_slope > 1 ? slope_data_ptr[q] : slope_data_ptr[0];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            for (; j + 15 < size; j += 16)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _p2 = vld1q_f32(ptr + 8);\n                float32x4_t _p3 = vld1q_f32(ptr + 12);\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                uint32x4_t _lemask2 = vcleq_f32(_p2, _zero);\n                uint32x4_t _lemask3 = vcleq_f32(_p3, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                float32x4_t _ps2 = vmulq_f32(_p2, _slope);\n                float32x4_t _ps3 = vmulq_f32(_p3, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                _p2 = vbslq_f32(_lemask2, _ps2, _p2);\n                _p3 = vbslq_f32(_lemask3, _ps3, _p3);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                vst1q_f32(ptr + 8, _p2);\n                vst1q_f32(ptr + 12, _p3);\n                ptr += 16;\n            }\n            for (; j + 7 < size; j += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                ptr += 8;\n            }\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < size; j++)\n            {\n                if (*ptr < 0)\n                    *ptr *= slope;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint PReLU_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _zero = vdupq_n_f32(0.f);\n\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            if (num_slope > 1)\n            {\n                const float* slope = slope_data;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    unsigned short* ptr = (unsigned short*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    float32x4_t _slope = vld1q_f32(slope + i * 4);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_u16(ptr, float2bfloat(_p));\n                }\n            }\n            else\n            {\n                float32x4_t _slope = vdupq_n_f32(slope_data[0]);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    unsigned short* ptr = (unsigned short*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_u16(ptr, float2bfloat(_p));\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + i * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_u16(ptr, float2bfloat(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                unsigned short* ptr = bottom_top_blob.channel(q);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + q * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_u16(ptr, float2bfloat(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        unsigned short* ptr = bottom_top_blob;\n\n        if (num_slope > 1)\n        {\n            const float* slope = slope_data;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = bfloat16_to_float32(ptr[i]);\n                if (v < 0.f)\n                    ptr[i] = float32_to_bfloat16(v * slope[i]);\n            }\n        }\n        else\n        {\n            const float slope = slope_data[0];\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = bfloat16_to_float32(ptr[i]);\n                if (v < 0.f)\n                    ptr[i] = float32_to_bfloat16(v * slope);\n            }\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n\n            const float slope = num_slope > 1 ? slope_data[i] : slope_data[0];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1_u16(ptr, float2bfloat(_p));\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < w; j++)\n            {\n                float v = bfloat16_to_float32(*ptr);\n                if (v < 0.f)\n                    *ptr = float32_to_bfloat16(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            const float slope = num_slope > 1 ? slope_data[q] : slope_data[0];\n\n            int j = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1_u16(ptr, float2bfloat(_p));\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; j < size; j++)\n            {\n                float v = bfloat16_to_float32(*ptr);\n                if (v < 0.f)\n                    *ptr = float32_to_bfloat16(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/prelu_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PRELU_ARM_H\n#define LAYER_PRELU_ARM_H\n\n#include \"prelu.h\"\n\nnamespace ncnn {\n\nclass PReLU_arm : public PReLU\n{\npublic:\n    PReLU_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PRELU_ARM_H\n"
  },
  {
    "path": "src/layer/arm/prelu_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"prelu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint PReLU_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 4)\n    {\n        float32x4_t _zero = vdupq_n_f32(0.f);\n\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            if (num_slope > 1)\n            {\n                const float* slope = slope_data;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    float32x4_t _slope = vld1q_f32(slope + i * 4);\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n                }\n            }\n            else\n            {\n                float32x4_t _slope = vdupq_n_f32(slope_data[0]);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + i * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q);\n                float32x4_t _slope = num_slope > 1 ? vld1q_f32((const float*)slope_data + q * 4) : vdupq_n_f32(slope_data[0]);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                    uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                    float32x4_t _ps = vmulq_f32(_p, _slope);\n                    _p = vbslq_f32(_lemask, _ps, _p);\n                    vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        __fp16* ptr = bottom_top_blob;\n\n        if (num_slope > 1)\n        {\n            const float* slope = slope_data;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = (float)ptr[i];\n                if (v < 0.f)\n                    ptr[i] = (__fp16)(v * slope[i]);\n            }\n        }\n        else\n        {\n            const float slope = slope_data[0];\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = (float)ptr[i];\n                if (v < 0.f)\n                    ptr[i] = (__fp16)(v * slope);\n            }\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n            const float slope = num_slope > 1 ? slope_data[i] : slope_data[0];\n\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            int j = 0;\n            for (; j + 3 < w; j += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n            for (; j < w; j++)\n            {\n                float v = (float)*ptr;\n                if (v < 0.f)\n                    *ptr = (__fp16)(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            const float slope = num_slope > 1 ? slope_data[q] : slope_data[0];\n\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n\n            int j = 0;\n            for (; j + 3 < size; j += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n            for (; j < size; j++)\n            {\n                float v = (float)*ptr;\n                if (v < 0.f)\n                    *ptr = (__fp16)(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint PReLU_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 8)\n    {\n        float16x8_t _zero = vdupq_n_f16(0.f);\n\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            if (num_slope > 1)\n            {\n                const float* slope = slope_data;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 8;\n\n                    float16x8_t _p = vld1q_f16(ptr);\n                    float16x8_t _slope = vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)slope + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)slope + i * 8 + 4)));\n                    uint16x8_t _lemask = vcleq_f16(_p, _zero);\n                    float16x8_t _ps = vmulq_f16(_p, _slope);\n                    _p = vbslq_f16(_lemask, _ps, _p);\n                    vst1q_f16(ptr, _p);\n                }\n            }\n            else\n            {\n#if defined(_MSC_VER) && !defined(__clang__)\n                float16x4_t _slope0 = vcvt_f16_f32(vdupq_n_f32(slope_data[0]));\n                float16x8_t _slope = vcombine_f16(_slope0, _slope0);\n#else\n                float16x8_t _slope = vdupq_n_f16((__fp16)slope_data[0]);\n#endif\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 8;\n\n                    float16x8_t _p = vld1q_f16(ptr);\n                    uint16x8_t _lemask = vcleq_f16(_p, _zero);\n                    float16x8_t _ps = vmulq_f16(_p, _slope);\n                    _p = vbslq_f16(_lemask, _ps, _p);\n                    vst1q_f16(ptr, _p);\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n                float16x8_t _slope = num_slope > 1 ? vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)slope_data + i * 8)), vcvt_f16_f32(vld1q_f32((const float*)slope_data + i * 8 + 4))) : vdupq_n_f16((__fp16)slope_data[0]);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    uint16x8_t _lemask = vcleq_f16(_p, _zero);\n                    float16x8_t _ps = vmulq_f16(_p, _slope);\n                    _p = vbslq_f16(_lemask, _ps, _p);\n                    vst1q_f16(ptr, _p);\n\n                    ptr += 8;\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q);\n                float16x8_t _slope = num_slope > 1 ? vcombine_f16(vcvt_f16_f32(vld1q_f32((const float*)slope_data + q * 8)), vcvt_f16_f32(vld1q_f32((const float*)slope_data + q * 8 + 4))) : vdupq_n_f16((__fp16)slope_data[0]);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float16x8_t _p = vld1q_f16(ptr);\n                    uint16x8_t _lemask = vcleq_f16(_p, _zero);\n                    float16x8_t _ps = vmulq_f16(_p, _slope);\n                    _p = vbslq_f16(_lemask, _ps, _p);\n                    vst1q_f16(ptr, _p);\n\n                    ptr += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n#if defined(_MSC_VER) && !defined(__clang__)\n            float16x8_t _zero = vdupq_n_f16(0.f);\n#else\n            float16x4_t _zero = vdup_n_f16(0.f);\n#endif\n\n            int w = bottom_top_blob.w;\n\n            if (num_slope > 1)\n            {\n                const float* slope = slope_data;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                    float16x4_t _p = vld1_f16(ptr);\n                    float16x4_t _slope = vcvt_f16_f32(vld1q_f32(slope + i * 4));\n#if defined(_MSC_VER) && !defined(__clang__)\n                    uint16x4_t _lemask = vcle_f16(_p, vget_low_f16(_zero));\n#else\n                    uint16x4_t _lemask = vcle_f16(_p, _zero);\n#endif\n                    float16x4_t _ps = vmul_f16(_p, _slope);\n                    _p = vbsl_f16(_lemask, _ps, _p);\n                    vst1_f16(ptr, _p);\n                }\n            }\n            else\n            {\n#if defined(_MSC_VER) && !defined(__clang__)\n                float16x8_t _slope = vdupq_n_f16((__fp16)slope_data[0]);\n#else\n                float16x4_t _slope = vdup_n_f16((__fp16)slope_data[0]);\n#endif\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    __fp16* ptr = (__fp16*)bottom_top_blob + i * 4;\n\n                    float16x4_t _p = vld1_f16(ptr);\n#if defined(_MSC_VER) && !defined(__clang__)\n                    uint16x4_t _lemask = vcle_f16(_p, vget_low_f16(_zero));\n                    float16x4_t _ps = vmul_f16(_p, vget_low_f16(_slope));\n#else\n                    uint16x4_t _lemask = vcle_f16(_p, _zero);\n                    float16x4_t _ps = vmul_f16(_p, _slope);\n#endif\n                    _p = vbsl_f16(_lemask, _ps, _p);\n                    vst1_f16(ptr, _p);\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n                float16x4_t _zero = vdup_n_f16(0.f);\n                float16x4_t _slope = num_slope > 1 ? vcvt_f16_f32(vld1q_f32((const float*)slope_data + i * 4)) : vdup_n_f16((__fp16)slope_data[0]);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    uint16x4_t _lemask = vcle_f16(_p, _zero);\n                    float16x4_t _ps = vmul_f16(_p, _slope);\n                    _p = vbsl_f16(_lemask, _ps, _p);\n                    vst1_f16(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q);\n                float16x4_t _zero = vdup_n_f16(0.f);\n                float16x4_t _slope = num_slope > 1 ? vcvt_f16_f32(vld1q_f32((const float*)slope_data + q * 4)) : vdup_n_f16((__fp16)slope_data[0]);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float16x4_t _p = vld1_f16(ptr);\n                    uint16x4_t _lemask = vcle_f16(_p, _zero);\n                    float16x4_t _ps = vmul_f16(_p, _slope);\n                    _p = vbsl_f16(_lemask, _ps, _p);\n                    vst1_f16(ptr, _p);\n\n                    ptr += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        __fp16* ptr = bottom_top_blob;\n\n        if (num_slope > 1)\n        {\n            const float* slope = slope_data;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = (float)ptr[i];\n                if (v < (__fp16)0.f)\n                    ptr[i] = (__fp16)(v * slope[i]);\n            }\n        }\n        else\n        {\n            const float slope = slope_data[0];\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < w; i++)\n            {\n                float v = (float)ptr[i];\n                if (v < (__fp16)0.f)\n                    ptr[i] = (__fp16)(v * slope);\n            }\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n            const float slope = num_slope > 1 ? slope_data[i] : slope_data[0];\n\n            float16x4_t _zero = vdup_n_f16(0.f);\n#if defined(_MSC_VER) && !defined(__clang__)\n            float16x4_t _slope = vcvt_f16_f32(vdupq_n_f32(slope));\n#else\n            float16x4_t _slope = vdup_n_f16((__fp16)slope);\n#endif\n\n            int j = 0;\n            for (; j + 3 < w; j += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                uint16x4_t _lemask = vcle_f16(_p, _zero);\n                float16x4_t _ps = vmul_f16(_p, _slope);\n                _p = vbsl_f16(_lemask, _ps, _p);\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n            for (; j < w; j++)\n            {\n                float v = (float)*ptr;\n                if (v < (__fp16)0.f)\n                    *ptr = (__fp16)(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            const float slope = num_slope > 1 ? slope_data[q] : slope_data[0];\n\n            float16x4_t _zero = vdup_n_f16(0.f);\n#if defined(_MSC_VER) && !defined(__clang__)\n            float16x4_t _slope = vcvt_f16_f32(vdupq_n_f32(slope));\n#else\n            float16x4_t _slope = vdup_n_f16((__fp16)slope);\n#endif\n\n            int j = 0;\n            for (; j + 3 < size; j += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                uint16x4_t _lemask = vcle_f16(_p, _zero);\n                float16x4_t _ps = vmul_f16(_p, _slope);\n                _p = vbsl_f16(_lemask, _ps, _p);\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n            for (; j < size; j++)\n            {\n                float v = (float)*ptr;\n                if (v < (__fp16)0.f)\n                    *ptr = (__fp16)(v * slope);\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/quantize_arm.cpp",
    "content": "// Copyright 2018 Tencent\n// Copyright 2019 BUG1989\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"quantize_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nQuantize_arm::Quantize_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void quantize(const float* ptr, signed char* s8ptr, const Mat& scale_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"quantize %d   %d %d\", scale_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale = vld1q_f32((const float*)scale_data);\n        }\n    }\n#endif // __ARM_NEON\n\n    int i = 0;\n#if __ARM_NEON\n    for (; i + 15 < size; i += 16)\n    {\n        float32x4_t _v0 = vld1q_f32(ptr);\n        float32x4_t _v1 = vld1q_f32(ptr + 4);\n        float32x4_t _v2 = vld1q_f32(ptr + 8);\n        float32x4_t _v3 = vld1q_f32(ptr + 12);\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr += 16;\n        s8ptr += 16;\n    }\n    for (; i + 7 < size; i += 8)\n    {\n        float32x4_t _v0 = vld1q_f32(ptr);\n        float32x4_t _v1 = vld1q_f32(ptr + 4);\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr += 8;\n        s8ptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _v = vld1q_f32(ptr);\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr[0] = vget_lane_s8(v, 0);\n        s8ptr[1] = vget_lane_s8(v, 1);\n        s8ptr[2] = vget_lane_s8(v, 2);\n        s8ptr[3] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        float v = *ptr * scale;\n        *s8ptr = float2int8(v);\n        ptr++;\n        s8ptr++;\n    }\n}\n\n#if __ARM_NEON\nstatic void quantize_pack4to8(const float* ptr0, const float* ptr1, signed char* s8ptr, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to8 %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        _scale0 = vld1q_f32((const float*)scale_data);\n        _scale1 = vld1q_f32((const float*)scale_data + 4);\n    }\n\n    int i = 0;\n    for (; i + 1 < elemcount; i += 2)\n    {\n        float32x4_t _v0 = vld1q_f32(ptr0);\n        float32x4_t _v1 = vld1q_f32(ptr1);\n        float32x4_t _v2 = vld1q_f32(ptr0 + 4);\n        float32x4_t _v3 = vld1q_f32(ptr1 + 4);\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        _v2 = vmulq_f32(_v2, _scale0);\n        _v3 = vmulq_f32(_v3, _scale1);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr0 += 8;\n        ptr1 += 8;\n        s8ptr += 16;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v0 = vld1q_f32(ptr0);\n        float32x4_t _v1 = vld1q_f32(ptr1);\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr0 += 4;\n        ptr1 += 4;\n        s8ptr += 8;\n    }\n}\n\nstatic void quantize_pack4to1(const float* ptr, signed char* s8ptr0, signed char* s8ptr1, signed char* s8ptr2, signed char* s8ptr3, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to1 %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        _scale = vld1q_f32((const float*)scale_data);\n    }\n\n    int i = 0;\n    for (; i + 7 < elemcount; i += 8)\n    {\n        float32x4_t _v0 = vld1q_f32(ptr);\n        float32x4_t _v1 = vld1q_f32(ptr + 4);\n        float32x4_t _v2 = vld1q_f32(ptr + 8);\n        float32x4_t _v3 = vld1q_f32(ptr + 12);\n        float32x4_t _v4 = vld1q_f32(ptr + 16);\n        float32x4_t _v5 = vld1q_f32(ptr + 20);\n        float32x4_t _v6 = vld1q_f32(ptr + 24);\n        float32x4_t _v7 = vld1q_f32(ptr + 28);\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        _v4 = vmulq_f32(_v4, _scale);\n        _v5 = vmulq_f32(_v5, _scale);\n        _v6 = vmulq_f32(_v6, _scale);\n        _v7 = vmulq_f32(_v7, _scale);\n        int8x8_t v0 = float2int8(_v0, _v1);\n        int8x8_t v1 = float2int8(_v2, _v3);\n        int8x8_t v2 = float2int8(_v4, _v5);\n        int8x8_t v3 = float2int8(_v6, _v7);\n        int8x16_t v01 = vcombine_s8(v0, v1);\n        int8x16_t v23 = vcombine_s8(v2, v3);\n        int8x16x2_t v0213 = vuzpq_s8(v01, v23);\n        int8x16x2_t v0123 = vuzpq_s8(v0213.val[0], v0213.val[1]);\n        vst1_s8(s8ptr0, vget_low_s8(v0123.val[0]));\n        vst1_s8(s8ptr1, vget_high_s8(v0123.val[0]));\n        vst1_s8(s8ptr2, vget_low_s8(v0123.val[1]));\n        vst1_s8(s8ptr3, vget_high_s8(v0123.val[1]));\n        ptr += 32;\n        s8ptr0 += 8;\n        s8ptr1 += 8;\n        s8ptr2 += 8;\n        s8ptr3 += 8;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v = vld1q_f32(ptr);\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr0[0] = vget_lane_s8(v, 0);\n        s8ptr1[0] = vget_lane_s8(v, 1);\n        s8ptr2[0] = vget_lane_s8(v, 2);\n        s8ptr3[0] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr0 += 1;\n        s8ptr1 += 1;\n        s8ptr2 += 1;\n        s8ptr3 += 1;\n    }\n}\n#endif // __ARM_NEON\n\nint Quantize_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_fp16sa(bottom_blob, top_blob, opt);\n        else\n            return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    if (dims == 1)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = w * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outw = w * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const float* ptr = (const float*)bottom_blob + i * elempack;\n            signed char* s8ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            quantize(ptr, s8ptr, scale_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outh = h * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const float* ptr0 = bottom_blob.row(i * 2);\n                const float* ptr1 = bottom_blob.row(i * 2 + 1);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8(ptr0, ptr1, s8ptr, scale_data_i, w);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                signed char* s8ptr0 = top_blob.row<signed char>(i * 4);\n                signed char* s8ptr1 = top_blob.row<signed char>(i * 4 + 1);\n                signed char* s8ptr2 = top_blob.row<signed char>(i * 4 + 2);\n                signed char* s8ptr3 = top_blob.row<signed char>(i * 4 + 3);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_pack4to1(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_i, w);\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize(ptr, s8ptr, scale_data_i, w, elempack);\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = channels * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outc = channels * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q * 2);\n                const float* ptr1 = bottom_blob.channel(q * 2 + 1);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8(ptr0, ptr1, s8ptr, scale_data_q, w * h);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                signed char* s8ptr0 = top_blob.channel(q * 4);\n                signed char* s8ptr1 = top_blob.channel(q * 4 + 1);\n                signed char* s8ptr2 = top_blob.channel(q * 4 + 2);\n                signed char* s8ptr3 = top_blob.channel(q * 4 + 3);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_pack4to1(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_q, w * h);\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize(ptr, s8ptr, scale_data_q, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void quantize_bf16s(const unsigned short* ptr, signed char* s8ptr, const Mat& scale_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"quantize_bf16s %d   %d %d\", scale_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __ARM_NEON\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale = vld1q_f32((const float*)scale_data);\n        }\n    }\n#endif // __ARM_NEON\n\n    int i = 0;\n#if __ARM_NEON\n    for (; i + 15 < size; i += 16)\n    {\n        uint16x8_t _v01 = vld1q_u16(ptr);\n        uint16x8_t _v23 = vld1q_u16(ptr + 8);\n        float32x4_t _v0 = bfloat2float(vget_low_u16(_v01));\n        float32x4_t _v1 = bfloat2float(vget_high_u16(_v01));\n        float32x4_t _v2 = bfloat2float(vget_low_u16(_v23));\n        float32x4_t _v3 = bfloat2float(vget_high_u16(_v23));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr += 16;\n        s8ptr += 16;\n    }\n    for (; i + 7 < size; i += 8)\n    {\n        uint16x8_t _v01 = vld1q_u16(ptr);\n        float32x4_t _v0 = bfloat2float(vget_low_u16(_v01));\n        float32x4_t _v1 = bfloat2float(vget_high_u16(_v01));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr += 8;\n        s8ptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _v = bfloat2float(vld1_u16(ptr));\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr[0] = vget_lane_s8(v, 0);\n        s8ptr[1] = vget_lane_s8(v, 1);\n        s8ptr[2] = vget_lane_s8(v, 2);\n        s8ptr[3] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr += 4;\n    }\n#endif // __ARM_NEON\n    for (; i < size; i++)\n    {\n        float v = bfloat16_to_float32(*ptr) * scale;\n        *s8ptr = float2int8(v);\n        ptr++;\n        s8ptr++;\n    }\n}\n\n#if __ARM_NEON\nstatic void quantize_pack4to8_bf16s(const unsigned short* ptr0, const unsigned short* ptr1, signed char* s8ptr, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to8_bf16s %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        _scale0 = vld1q_f32((const float*)scale_data);\n        _scale1 = vld1q_f32((const float*)scale_data + 4);\n    }\n\n    int i = 0;\n    for (; i + 1 < elemcount; i += 2)\n    {\n        uint16x8_t _v02 = vld1q_u16(ptr0);\n        uint16x8_t _v13 = vld1q_u16(ptr1);\n        float32x4_t _v0 = bfloat2float(vget_low_u16(_v02));\n        float32x4_t _v1 = bfloat2float(vget_low_u16(_v13));\n        float32x4_t _v2 = bfloat2float(vget_high_u16(_v02));\n        float32x4_t _v3 = bfloat2float(vget_high_u16(_v13));\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        _v2 = vmulq_f32(_v2, _scale0);\n        _v3 = vmulq_f32(_v3, _scale1);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr0 += 8;\n        ptr1 += 8;\n        s8ptr += 16;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v0 = bfloat2float(vld1_u16(ptr0));\n        float32x4_t _v1 = bfloat2float(vld1_u16(ptr1));\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr0 += 4;\n        ptr1 += 4;\n        s8ptr += 8;\n    }\n}\n\nstatic void quantize_pack4to1_bf16s(const unsigned short* ptr, signed char* s8ptr0, signed char* s8ptr1, signed char* s8ptr2, signed char* s8ptr3, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to1_bf16s %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        _scale = vld1q_f32((const float*)scale_data);\n    }\n\n    int i = 0;\n    for (; i + 7 < elemcount; i += 8)\n    {\n        uint16x8_t _v01 = vld1q_u16(ptr);\n        uint16x8_t _v23 = vld1q_u16(ptr + 8);\n        uint16x8_t _v45 = vld1q_u16(ptr + 16);\n        uint16x8_t _v67 = vld1q_u16(ptr + 24);\n        float32x4_t _v0 = bfloat2float(vget_low_u16(_v01));\n        float32x4_t _v1 = bfloat2float(vget_high_u16(_v01));\n        float32x4_t _v2 = bfloat2float(vget_low_u16(_v23));\n        float32x4_t _v3 = bfloat2float(vget_high_u16(_v23));\n        float32x4_t _v4 = bfloat2float(vget_low_u16(_v45));\n        float32x4_t _v5 = bfloat2float(vget_high_u16(_v45));\n        float32x4_t _v6 = bfloat2float(vget_low_u16(_v67));\n        float32x4_t _v7 = bfloat2float(vget_high_u16(_v67));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        _v4 = vmulq_f32(_v4, _scale);\n        _v5 = vmulq_f32(_v5, _scale);\n        _v6 = vmulq_f32(_v6, _scale);\n        _v7 = vmulq_f32(_v7, _scale);\n        int8x8_t v0 = float2int8(_v0, _v1);\n        int8x8_t v1 = float2int8(_v2, _v3);\n        int8x8_t v2 = float2int8(_v4, _v5);\n        int8x8_t v3 = float2int8(_v6, _v7);\n        int8x16_t v01 = vcombine_s8(v0, v1);\n        int8x16_t v23 = vcombine_s8(v2, v3);\n        int8x16x2_t v0213 = vuzpq_s8(v01, v23);\n        int8x16x2_t v0123 = vuzpq_s8(v0213.val[0], v0213.val[1]);\n        vst1_s8(s8ptr0, vget_low_s8(v0123.val[0]));\n        vst1_s8(s8ptr1, vget_high_s8(v0123.val[0]));\n        vst1_s8(s8ptr2, vget_low_s8(v0123.val[1]));\n        vst1_s8(s8ptr3, vget_high_s8(v0123.val[1]));\n        ptr += 32;\n        s8ptr0 += 8;\n        s8ptr1 += 8;\n        s8ptr2 += 8;\n        s8ptr3 += 8;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v = bfloat2float(vld1_u16(ptr));\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr0[0] = vget_lane_s8(v, 0);\n        s8ptr1[0] = vget_lane_s8(v, 1);\n        s8ptr2[0] = vget_lane_s8(v, 2);\n        s8ptr3[0] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr0 += 1;\n        s8ptr1 += 1;\n        s8ptr2 += 1;\n        s8ptr3 += 1;\n    }\n}\n#endif // __ARM_NEON\n\nint Quantize_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    if (dims == 1)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = w * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outw = w * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const unsigned short* ptr = (const unsigned short*)bottom_blob + i * elempack;\n            signed char* s8ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            quantize_bf16s(ptr, s8ptr, scale_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outh = h * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const unsigned short* ptr0 = bottom_blob.row<const unsigned short>(i * 2);\n                const unsigned short* ptr1 = bottom_blob.row<const unsigned short>(i * 2 + 1);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8_bf16s(ptr0, ptr1, s8ptr, scale_data_i, w);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(i);\n                signed char* s8ptr0 = top_blob.row<signed char>(i * 4);\n                signed char* s8ptr1 = top_blob.row<signed char>(i * 4 + 1);\n                signed char* s8ptr2 = top_blob.row<signed char>(i * 4 + 2);\n                signed char* s8ptr3 = top_blob.row<signed char>(i * 4 + 3);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_bf16s(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_i, w);\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const unsigned short* ptr = bottom_blob.row<const unsigned short>(i);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_bf16s(ptr, s8ptr, scale_data_i, w, elempack);\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int out_elempack = 1;\n#if __ARM_NEON\n        if (opt.use_packing_layout)\n        {\n            out_elempack = channels * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outc = channels * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __ARM_NEON\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q * 2);\n                const unsigned short* ptr1 = bottom_blob.channel(q * 2 + 1);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8_bf16s(ptr0, ptr1, s8ptr, scale_data_q, w * h);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                signed char* s8ptr0 = top_blob.channel(q * 4);\n                signed char* s8ptr1 = top_blob.channel(q * 4 + 1);\n                signed char* s8ptr2 = top_blob.channel(q * 4 + 2);\n                signed char* s8ptr3 = top_blob.channel(q * 4 + 3);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_bf16s(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_q, w * h);\n            }\n        }\n#endif // __ARM_NEON\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const unsigned short* ptr = bottom_blob.channel(q);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_bf16s(ptr, s8ptr, scale_data_q, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/quantize_arm.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_QUANTIZE_ARM_H\n#define LAYER_QUANTIZE_ARM_H\n\n#include \"quantize.h\"\n\nnamespace ncnn {\n\nclass Quantize_arm : public Quantize\n{\npublic:\n    Quantize_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_QUANTIZE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/quantize_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"quantize_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void quantize_fp16s(const __fp16* ptr, signed char* s8ptr, const Mat& scale_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"quantize_fp16s %d   %d %d\", scale_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale = vld1q_f32((const float*)scale_data);\n        }\n    }\n\n    int i = 0;\n    for (; i + 15 < size; i += 16)\n    {\n        float16x8_t _v01 = vld1q_f16(ptr);\n        float16x8_t _v23 = vld1q_f16(ptr + 8);\n        float32x4_t _v0 = vcvt_f32_f16(vget_low_f16(_v01));\n        float32x4_t _v1 = vcvt_f32_f16(vget_high_f16(_v01));\n        float32x4_t _v2 = vcvt_f32_f16(vget_low_f16(_v23));\n        float32x4_t _v3 = vcvt_f32_f16(vget_high_f16(_v23));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr += 16;\n        s8ptr += 16;\n    }\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _v01 = vld1q_f16(ptr);\n        float32x4_t _v0 = vcvt_f32_f16(vget_low_f16(_v01));\n        float32x4_t _v1 = vcvt_f32_f16(vget_high_f16(_v01));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr += 8;\n        s8ptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float32x4_t _v = vcvt_f32_f16(vld1_f16(ptr));\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr[0] = vget_lane_s8(v, 0);\n        s8ptr[1] = vget_lane_s8(v, 1);\n        s8ptr[2] = vget_lane_s8(v, 2);\n        s8ptr[3] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr += 4;\n    }\n    for (; i < size; i++)\n    {\n        float v = (float)(*ptr) * scale;\n        *s8ptr = float2int8(v);\n        ptr++;\n        s8ptr++;\n    }\n}\n\nstatic void quantize_pack4to8_fp16s(const __fp16* ptr0, const __fp16* ptr1, signed char* s8ptr, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to8_fp16s %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale0 = vdupq_n_f32(scale);\n    float32x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        _scale0 = vld1q_f32((const float*)scale_data);\n        _scale1 = vld1q_f32((const float*)scale_data + 4);\n    }\n\n    int i = 0;\n    for (; i + 1 < elemcount; i += 2)\n    {\n        float16x8_t _v02 = vld1q_f16(ptr0);\n        float16x8_t _v13 = vld1q_f16(ptr1);\n        float32x4_t _v0 = vcvt_f32_f16(vget_low_f16(_v02));\n        float32x4_t _v1 = vcvt_f32_f16(vget_low_f16(_v13));\n        float32x4_t _v2 = vcvt_f32_f16(vget_high_f16(_v02));\n        float32x4_t _v3 = vcvt_f32_f16(vget_high_f16(_v13));\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        _v2 = vmulq_f32(_v2, _scale0);\n        _v3 = vmulq_f32(_v3, _scale1);\n        vst1q_s8(s8ptr, vcombine_s8(float2int8(_v0, _v1), float2int8(_v2, _v3)));\n        ptr0 += 8;\n        ptr1 += 8;\n        s8ptr += 16;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v0 = vcvt_f32_f16(vld1_f16(ptr0));\n        float32x4_t _v1 = vcvt_f32_f16(vld1_f16(ptr1));\n        _v0 = vmulq_f32(_v0, _scale0);\n        _v1 = vmulq_f32(_v1, _scale1);\n        vst1_s8(s8ptr, float2int8(_v0, _v1));\n        ptr0 += 4;\n        ptr1 += 4;\n        s8ptr += 8;\n    }\n}\n\nstatic void quantize_pack4to1_fp16s(const __fp16* ptr, signed char* s8ptr0, signed char* s8ptr1, signed char* s8ptr2, signed char* s8ptr3, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to1_fp16s %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    float32x4_t _scale = vdupq_n_f32(scale);\n    if (scale_data_size > 1)\n    {\n        _scale = vld1q_f32((const float*)scale_data);\n    }\n\n    int i = 0;\n    for (; i + 7 < elemcount; i += 8)\n    {\n        float16x8_t _v01 = vld1q_f16(ptr);\n        float16x8_t _v23 = vld1q_f16(ptr + 8);\n        float16x8_t _v45 = vld1q_f16(ptr + 16);\n        float16x8_t _v67 = vld1q_f16(ptr + 24);\n        float32x4_t _v0 = vcvt_f32_f16(vget_low_f16(_v01));\n        float32x4_t _v1 = vcvt_f32_f16(vget_high_f16(_v01));\n        float32x4_t _v2 = vcvt_f32_f16(vget_low_f16(_v23));\n        float32x4_t _v3 = vcvt_f32_f16(vget_high_f16(_v23));\n        float32x4_t _v4 = vcvt_f32_f16(vget_low_f16(_v45));\n        float32x4_t _v5 = vcvt_f32_f16(vget_high_f16(_v45));\n        float32x4_t _v6 = vcvt_f32_f16(vget_low_f16(_v67));\n        float32x4_t _v7 = vcvt_f32_f16(vget_high_f16(_v67));\n        _v0 = vmulq_f32(_v0, _scale);\n        _v1 = vmulq_f32(_v1, _scale);\n        _v2 = vmulq_f32(_v2, _scale);\n        _v3 = vmulq_f32(_v3, _scale);\n        _v4 = vmulq_f32(_v4, _scale);\n        _v5 = vmulq_f32(_v5, _scale);\n        _v6 = vmulq_f32(_v6, _scale);\n        _v7 = vmulq_f32(_v7, _scale);\n        int8x8_t v0 = float2int8(_v0, _v1);\n        int8x8_t v1 = float2int8(_v2, _v3);\n        int8x8_t v2 = float2int8(_v4, _v5);\n        int8x8_t v3 = float2int8(_v6, _v7);\n        int8x16_t v01 = vcombine_s8(v0, v1);\n        int8x16_t v23 = vcombine_s8(v2, v3);\n        int8x16x2_t v0213 = vuzpq_s8(v01, v23);\n        int8x16x2_t v0123 = vuzpq_s8(v0213.val[0], v0213.val[1]);\n        vst1_s8(s8ptr0, vget_low_s8(v0123.val[0]));\n        vst1_s8(s8ptr1, vget_high_s8(v0123.val[0]));\n        vst1_s8(s8ptr2, vget_low_s8(v0123.val[1]));\n        vst1_s8(s8ptr3, vget_high_s8(v0123.val[1]));\n        ptr += 32;\n        s8ptr0 += 8;\n        s8ptr1 += 8;\n        s8ptr2 += 8;\n        s8ptr3 += 8;\n    }\n    for (; i < elemcount; i++)\n    {\n        float32x4_t _v = vcvt_f32_f16(vld1_f16(ptr));\n        _v = vmulq_f32(_v, _scale);\n        int8x8_t v = float2int8(_v, _v);\n        s8ptr0[0] = vget_lane_s8(v, 0);\n        s8ptr1[0] = vget_lane_s8(v, 1);\n        s8ptr2[0] = vget_lane_s8(v, 2);\n        s8ptr3[0] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr0 += 1;\n        s8ptr1 += 1;\n        s8ptr2 += 1;\n        s8ptr3 += 1;\n    }\n}\n\nint Quantize_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    if (dims == 1)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = w * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outw = w * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const __fp16* ptr = (const __fp16*)bottom_blob + i * elempack;\n            signed char* s8ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            quantize_fp16s(ptr, s8ptr, scale_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outh = h * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const __fp16* ptr0 = bottom_blob.row<const __fp16>(i * 2);\n                const __fp16* ptr1 = bottom_blob.row<const __fp16>(i * 2 + 1);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8_fp16s(ptr0, ptr1, s8ptr, scale_data_i, w);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(i);\n                signed char* s8ptr0 = top_blob.row<signed char>(i * 4);\n                signed char* s8ptr1 = top_blob.row<signed char>(i * 4 + 1);\n                signed char* s8ptr2 = top_blob.row<signed char>(i * 4 + 2);\n                signed char* s8ptr3 = top_blob.row<signed char>(i * 4 + 3);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_fp16s(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_i, w);\n            }\n        }\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(i);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_fp16s(ptr, s8ptr, scale_data_i, w, elempack);\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = channels * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outc = channels * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const __fp16* ptr0 = bottom_blob.channel(q * 2);\n                const __fp16* ptr1 = bottom_blob.channel(q * 2 + 1);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8_fp16s(ptr0, ptr1, s8ptr, scale_data_q, w * h);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                signed char* s8ptr0 = top_blob.channel(q * 4);\n                signed char* s8ptr1 = top_blob.channel(q * 4 + 1);\n                signed char* s8ptr2 = top_blob.channel(q * 4 + 2);\n                signed char* s8ptr3 = top_blob.channel(q * 4 + 3);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_fp16s(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_q, w * h);\n            }\n        }\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_fp16s(ptr, s8ptr, scale_data_q, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic void quantize_fp16sa(const __fp16* ptr, signed char* s8ptr, const Mat& scale_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"quantize_fp16sa %d   %d %d\", scale_data_size, elemcount, elempack);\n\n    __fp16 scale = (__fp16)scale_data[0];\n    float16x4_t _scale0 = vdup_n_f16(scale);\n    float16x4_t _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale0 = vcvt_f16_f32(vld1q_f32((const float*)scale_data));\n            _scale1 = vcvt_f16_f32(vld1q_f32((const float*)scale_data + 4));\n        }\n        if (elempack == 4)\n        {\n            _scale0 = vcvt_f16_f32(vld1q_f32((const float*)scale_data));\n            _scale1 = _scale0;\n        }\n    }\n    float16x8_t _scale = vcombine_f16(_scale0, _scale1);\n\n    int i = 0;\n    for (; i + 7 < size; i += 8)\n    {\n        float16x8_t _v = vld1q_f16(ptr);\n        _v = vmulq_f16(_v, _scale);\n        vst1_s8(s8ptr, float2int8(_v));\n        ptr += 8;\n        s8ptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        float16x4_t _v = vld1_f16(ptr);\n        _v = vmul_f16(_v, _scale0);\n        int8x8_t v = float2int8(vcombine_f16(_v, _v));\n        s8ptr[0] = vget_lane_s8(v, 0);\n        s8ptr[1] = vget_lane_s8(v, 1);\n        s8ptr[2] = vget_lane_s8(v, 2);\n        s8ptr[3] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr += 4;\n    }\n    for (; i < size; i++)\n    {\n        __fp16 v = *ptr * scale;\n        *s8ptr = float2int8(v);\n        ptr++;\n        s8ptr++;\n    }\n}\n\nstatic void quantize_pack4to1_fp16sa(const __fp16* ptr, signed char* s8ptr0, signed char* s8ptr1, signed char* s8ptr2, signed char* s8ptr3, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to1_fp16sa %d   %d\", scale_data_size, elemcount);\n\n    __fp16 scale = (__fp16)scale_data[0];\n    float16x4_t _scale = vdup_n_f16(scale);\n    if (scale_data_size > 1)\n    {\n        _scale = vcvt_f16_f32(vld1q_f32((const float*)scale_data));\n    }\n    float16x8_t _scale01 = vcombine_f16(_scale, _scale);\n\n    int i = 0;\n    for (; i + 7 < elemcount; i += 8)\n    {\n        float16x8_t _v01 = vld1q_f16(ptr);\n        float16x8_t _v23 = vld1q_f16(ptr + 8);\n        float16x8_t _v45 = vld1q_f16(ptr + 16);\n        float16x8_t _v67 = vld1q_f16(ptr + 24);\n        _v01 = vmulq_f16(_v01, _scale01);\n        _v23 = vmulq_f16(_v23, _scale01);\n        _v45 = vmulq_f16(_v45, _scale01);\n        _v67 = vmulq_f16(_v67, _scale01);\n        int8x8_t v0 = float2int8(_v01);\n        int8x8_t v1 = float2int8(_v23);\n        int8x8_t v2 = float2int8(_v45);\n        int8x8_t v3 = float2int8(_v67);\n        int8x16_t v01 = vcombine_s8(v0, v1);\n        int8x16_t v23 = vcombine_s8(v2, v3);\n        int8x16x2_t v0213 = vuzpq_s8(v01, v23);\n        int8x16x2_t v0123 = vuzpq_s8(v0213.val[0], v0213.val[1]);\n        vst1_s8(s8ptr0, vget_low_s8(v0123.val[0]));\n        vst1_s8(s8ptr1, vget_high_s8(v0123.val[0]));\n        vst1_s8(s8ptr2, vget_low_s8(v0123.val[1]));\n        vst1_s8(s8ptr3, vget_high_s8(v0123.val[1]));\n        ptr += 32;\n        s8ptr0 += 8;\n        s8ptr1 += 8;\n        s8ptr2 += 8;\n        s8ptr3 += 8;\n    }\n    for (; i < elemcount; i++)\n    {\n        float16x4_t _v = vld1_f16(ptr);\n        _v = vmul_f16(_v, _scale);\n        int8x8_t v = float2int8(vcombine_f16(_v, _v));\n        s8ptr0[0] = vget_lane_s8(v, 0);\n        s8ptr1[0] = vget_lane_s8(v, 1);\n        s8ptr2[0] = vget_lane_s8(v, 2);\n        s8ptr3[0] = vget_lane_s8(v, 3);\n        ptr += 4;\n        s8ptr0 += 1;\n        s8ptr1 += 1;\n        s8ptr2 += 1;\n        s8ptr3 += 1;\n    }\n}\n\nint Quantize_arm::forward_fp16sa(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    if (dims == 1)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = w * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outw = w * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const __fp16* ptr = (const __fp16*)bottom_blob + i * elempack;\n            signed char* s8ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            quantize_fp16sa(ptr, s8ptr, scale_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outh = h * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(i);\n                signed char* s8ptr0 = top_blob.row<signed char>(i * 4);\n                signed char* s8ptr1 = top_blob.row<signed char>(i * 4 + 1);\n                signed char* s8ptr2 = top_blob.row<signed char>(i * 4 + 2);\n                signed char* s8ptr3 = top_blob.row<signed char>(i * 4 + 3);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_fp16sa(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_i, w);\n            }\n        }\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const __fp16* ptr = bottom_blob.row<const __fp16>(i);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_fp16sa(ptr, s8ptr, scale_data_i, w, elempack);\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            out_elempack = channels * elempack % 8 == 0 ? 8 : 1;\n        }\n        const int outc = channels * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                signed char* s8ptr0 = top_blob.channel(q * 4);\n                signed char* s8ptr1 = top_blob.channel(q * 4 + 1);\n                signed char* s8ptr2 = top_blob.channel(q * 4 + 2);\n                signed char* s8ptr3 = top_blob.channel(q * 4 + 3);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_pack4to1_fp16sa(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_q, w * h);\n            }\n        }\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const __fp16* ptr = bottom_blob.channel(q);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_fp16sa(ptr, s8ptr, scale_data_q, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/relu_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"relu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nReLU_arm::ReLU_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint ReLU_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n    if (elembits == 8)\n        return forward_inplace_int8(bottom_top_blob, opt);\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    if (slope == 0.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                    \"fmax   v0.4s, v0.4s, %2.4s     \\n\"\n                    \"fmax   v1.4s, v1.4s, %2.4s     \\n\"\n                    \"fmax   v2.4s, v2.4s, %2.4s     \\n\"\n                    \"fmax   v3.4s, v3.4s, %2.4s     \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]      \\n\"\n                    \"vldm       %0, {d0-d7}     \\n\"\n                    \"vmax.f32   q0, q0, %q2     \\n\"\n                    \"vmax.f32   q1, q1, %q2     \\n\"\n                    \"vmax.f32   q2, q2, %q2     \\n\"\n                    \"vmax.f32   q3, q3, %q2     \\n\"\n                    \"vstm       %0!, {d0-d7}    \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero) // %2\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _p2 = vld1q_f32(ptr + 8);\n                float32x4_t _p3 = vld1q_f32(ptr + 12);\n                _p0 = vmaxq_f32(_p0, _zero);\n                _p1 = vmaxq_f32(_p1, _zero);\n                _p2 = vmaxq_f32(_p2, _zero);\n                _p3 = vmaxq_f32(_p3, _zero);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                vst1q_f32(ptr + 8, _p2);\n                vst1q_f32(ptr + 12, _p3);\n                ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                _p0 = vmaxq_f32(_p0, _zero);\n                _p1 = vmaxq_f32(_p1, _zero);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _ptr = vld1q_f32(ptr);\n                _ptr = vmaxq_f32(_ptr, _zero);\n                vst1q_f32(ptr, _ptr);\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                *ptr = std::max(*ptr, 0.f);\n                ptr++;\n            }\n        }\n    }\n    else\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0] \\n\"\n                    \"fcmle  v4.4s, v0.4s, #0        \\n\"\n                    \"fcmle  v5.4s, v1.4s, #0        \\n\"\n                    \"fcmle  v6.4s, v2.4s, #0        \\n\"\n                    \"fcmle  v7.4s, v3.4s, #0        \\n\"\n                    \"fmul   v8.4s, v0.4s, %2.4s     \\n\"\n                    \"fmul   v9.4s, v1.4s, %2.4s     \\n\"\n                    \"fmul   v10.4s, v2.4s, %2.4s    \\n\"\n                    \"fmul   v11.4s, v3.4s, %2.4s    \\n\"\n                    \"bit    v0.16b, v8.16b, v4.16b  \\n\"\n                    \"bit    v1.16b, v9.16b, v5.16b  \\n\"\n                    \"bit    v2.16b, v10.16b, v6.16b \\n\"\n                    \"bit    v3.16b, v11.16b, v7.16b \\n\"\n                    \"st1    {v0.4s, v1.4s, v2.4s, v3.4s}, [%0], #64 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_slope) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #512]      \\n\"\n                    \"vldm       %0, {d0-d7}     \\n\"\n                    \"vcle.f32   q4, q0, %q2     \\n\"\n                    \"vcle.f32   q5, q1, %q2     \\n\"\n                    \"vcle.f32   q6, q2, %q2     \\n\"\n                    \"vcle.f32   q7, q3, %q2     \\n\"\n                    \"vmul.f32   q8, q0, %q3     \\n\"\n                    \"vmul.f32   q9, q1, %q3     \\n\"\n                    \"vmul.f32   q10, q2, %q3    \\n\"\n                    \"vmul.f32   q11, q3, %q3    \\n\"\n                    \"vbit.32    q0, q8, q4      \\n\"\n                    \"vbit.32    q1, q9, q5      \\n\"\n                    \"vbit.32    q2, q10, q6     \\n\"\n                    \"vbit.32    q3, q11, q7     \\n\"\n                    \"vstm       %0!, {d0-d7}    \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero), // %2\n                    \"w\"(_slope) // %3\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                float32x4_t _p2 = vld1q_f32(ptr + 8);\n                float32x4_t _p3 = vld1q_f32(ptr + 12);\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                uint32x4_t _lemask2 = vcleq_f32(_p2, _zero);\n                uint32x4_t _lemask3 = vcleq_f32(_p3, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                float32x4_t _ps2 = vmulq_f32(_p2, _slope);\n                float32x4_t _ps3 = vmulq_f32(_p3, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                _p2 = vbslq_f32(_lemask2, _ps2, _p2);\n                _p3 = vbslq_f32(_lemask3, _ps3, _p3);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                vst1q_f32(ptr + 8, _p2);\n                vst1q_f32(ptr + 12, _p3);\n                ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float32x4_t _p0 = vld1q_f32(ptr);\n                float32x4_t _p1 = vld1q_f32(ptr + 4);\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                vst1q_f32(ptr, _p0);\n                vst1q_f32(ptr + 4, _p1);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                if (*ptr < 0)\n                    *ptr *= slope;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint ReLU_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    if (slope == 0.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n                    \"fmax   v0.4s, v0.4s, %2.4s     \\n\"\n                    \"fmax   v1.4s, v1.4s, %2.4s     \\n\"\n                    \"fmax   v2.4s, v2.4s, %2.4s     \\n\"\n                    \"fmax   v3.4s, v3.4s, %2.4s     \\n\"\n                    \"shrn   v0.4h, v0.4s, #16       \\n\"\n                    \"shrn   v1.4h, v1.4s, #16       \\n\"\n                    \"shrn   v2.4h, v2.4s, #16       \\n\"\n                    \"shrn   v3.4h, v3.4s, #16       \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]      \\n\"\n                    \"vld1.u16   {d4-d7}, [%0]   \\n\"\n                    \"vshll.u16  q0, d4, #16     \\n\"\n                    \"vshll.u16  q1, d5, #16     \\n\"\n                    \"vshll.u16  q2, d6, #16     \\n\"\n                    \"vshll.u16  q3, d7, #16     \\n\"\n                    \"vmax.f32   q0, q0, %q2     \\n\"\n                    \"vmax.f32   q1, q1, %q2     \\n\"\n                    \"vmax.f32   q2, q2, %q2     \\n\"\n                    \"vmax.f32   q3, q3, %q2     \\n\"\n                    \"vshrn.u32  d0, q0, #16     \\n\"\n                    \"vshrn.u32  d1, q1, #16     \\n\"\n                    \"vshrn.u32  d2, q2, #16     \\n\"\n                    \"vshrn.u32  d3, q3, #16     \\n\"\n                    \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero) // %2\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8_t _p = vld1q_u16(ptr);\n                uint16x8_t _q = vld1q_u16(ptr + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                _p0 = vmaxq_f32(_p0, _zero);\n                _p1 = vmaxq_f32(_p1, _zero);\n                _p2 = vmaxq_f32(_p2, _zero);\n                _p3 = vmaxq_f32(_p3, _zero);\n                _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n                _q = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n                vst1q_u16(ptr, _p);\n                vst1q_u16(ptr + 8, _q);\n                ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                uint16x8_t _p = vld1q_u16(ptr);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                _p0 = vmaxq_f32(_p0, _zero);\n                _p1 = vmaxq_f32(_p1, _zero);\n                _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n                vst1q_u16(ptr, _p);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = vmaxq_f32(_p, _zero);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                float v = bfloat16_to_float32(ptr[0]);\n                if (v < 0.f)\n                    ptr[0] = float32_to_bfloat16(0.f);\n                ptr += 1;\n            }\n        }\n    }\n    else\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            float32x4_t _zero = vdupq_n_f32(0.f);\n            float32x4_t _slope = vdupq_n_f32(slope);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #256]   \\n\"\n                    \"ld1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0] \\n\"\n                    \"shll   v0.4s, v0.4h, #16       \\n\"\n                    \"shll   v1.4s, v1.4h, #16       \\n\"\n                    \"shll   v2.4s, v2.4h, #16       \\n\"\n                    \"shll   v3.4s, v3.4h, #16       \\n\"\n                    \"fcmle  v4.4s, v0.4s, #0        \\n\"\n                    \"fcmle  v5.4s, v1.4s, #0        \\n\"\n                    \"fcmle  v6.4s, v2.4s, #0        \\n\"\n                    \"fcmle  v7.4s, v3.4s, #0        \\n\"\n                    \"fmul   v8.4s, v0.4s, %2.4s     \\n\"\n                    \"fmul   v9.4s, v1.4s, %2.4s     \\n\"\n                    \"fmul   v10.4s, v2.4s, %2.4s    \\n\"\n                    \"fmul   v11.4s, v3.4s, %2.4s    \\n\"\n                    \"bit    v0.16b, v8.16b, v4.16b  \\n\"\n                    \"bit    v1.16b, v9.16b, v5.16b  \\n\"\n                    \"bit    v2.16b, v10.16b, v6.16b \\n\"\n                    \"bit    v3.16b, v11.16b, v7.16b \\n\"\n                    \"shrn   v0.4h, v0.4s, #16       \\n\"\n                    \"shrn   v1.4h, v1.4s, #16       \\n\"\n                    \"shrn   v2.4h, v2.4s, #16       \\n\"\n                    \"shrn   v3.4h, v3.4s, #16       \\n\"\n                    \"st1    {v0.4h, v1.4h, v2.4h, v3.4h}, [%0], #32 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_slope) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\");\n#else  // __aarch64__\n                asm volatile(\n                    \"pld        [%0, #256]      \\n\"\n                    \"vld1.u16   {d4-d7}, [%0]   \\n\"\n                    \"vshll.u16  q0, d4, #16     \\n\"\n                    \"vshll.u16  q1, d5, #16     \\n\"\n                    \"vshll.u16  q2, d6, #16     \\n\"\n                    \"vshll.u16  q3, d7, #16     \\n\"\n                    \"vcle.f32   q4, q0, %q2     \\n\"\n                    \"vcle.f32   q5, q1, %q2     \\n\"\n                    \"vcle.f32   q6, q2, %q2     \\n\"\n                    \"vcle.f32   q7, q3, %q2     \\n\"\n                    \"vmul.f32   q8, q0, %q3     \\n\"\n                    \"vmul.f32   q9, q1, %q3     \\n\"\n                    \"vmul.f32   q10, q2, %q3    \\n\"\n                    \"vmul.f32   q11, q3, %q3    \\n\"\n                    \"vbit.32    q0, q8, q4      \\n\"\n                    \"vbit.32    q1, q9, q5      \\n\"\n                    \"vbit.32    q2, q10, q6     \\n\"\n                    \"vbit.32    q3, q11, q7     \\n\"\n                    \"vshrn.u32  d0, q0, #16     \\n\"\n                    \"vshrn.u32  d1, q1, #16     \\n\"\n                    \"vshrn.u32  d2, q2, #16     \\n\"\n                    \"vshrn.u32  d3, q3, #16     \\n\"\n                    \"vst1.u16   {d0-d3}, [%0]!  \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero), // %2\n                    \"w\"(_slope) // %3\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\", \"q9\", \"q10\", \"q11\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n                uint16x8_t _p = vld1q_u16(ptr);\n                uint16x8_t _q = vld1q_u16(ptr + 8);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                float32x4_t _p2 = bfloat2float(vget_low_u16(_q));\n                float32x4_t _p3 = bfloat2float(vget_high_u16(_q));\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                uint32x4_t _lemask2 = vcleq_f32(_p2, _zero);\n                uint32x4_t _lemask3 = vcleq_f32(_p3, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                float32x4_t _ps2 = vmulq_f32(_p2, _slope);\n                float32x4_t _ps3 = vmulq_f32(_p3, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                _p2 = vbslq_f32(_lemask2, _ps2, _p2);\n                _p3 = vbslq_f32(_lemask3, _ps3, _p3);\n                _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n                _q = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n                vst1q_u16(ptr, _p);\n                vst1q_u16(ptr + 8, _q);\n                ptr += 16;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                uint16x8_t _p = vld1q_u16(ptr);\n                float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n                float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n                uint32x4_t _lemask0 = vcleq_f32(_p0, _zero);\n                uint32x4_t _lemask1 = vcleq_f32(_p1, _zero);\n                float32x4_t _ps0 = vmulq_f32(_p0, _slope);\n                float32x4_t _ps1 = vmulq_f32(_p1, _slope);\n                _p0 = vbslq_f32(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f32(_lemask1, _ps1, _p1);\n                _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n                vst1q_u16(ptr, _p);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                uint32x4_t _lemask = vcleq_f32(_p, _zero);\n                float32x4_t _ps = vmulq_f32(_p, _slope);\n                _p = vbslq_f32(_lemask, _ps, _p);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                float v = bfloat16_to_float32(ptr[0]);\n                if (v < 0.f)\n                    ptr[0] = float32_to_bfloat16(v * slope);\n                ptr += 1;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\nint ReLU_arm::forward_inplace_int8(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 8)\n    {\n        if (slope == 0.f)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                signed char* ptr = bottom_top_blob.channel(q);\n\n                int i = 0;\n                int8x16_t _zero = vdupq_n_s8(0);\n                for (; i + 1 < size; i += 2)\n                {\n                    int8x16_t _p = vld1q_s8(ptr);\n                    _p = vmaxq_s8(_p, _zero);\n                    vst1q_s8(ptr, _p);\n\n                    ptr += 16;\n                }\n                for (; i < size; i++)\n                {\n                    int8x8_t _p = vld1_s8(ptr);\n                    _p = vmax_s8(_p, vget_low_s8(_zero));\n                    vst1_s8(ptr, _p);\n\n                    ptr += 8;\n                }\n            }\n        }\n        else\n        {\n            // TODO leakyrelu\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (slope == 0.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            signed char* ptr = bottom_top_blob.channel(q);\n\n            int i = 0;\n#if __ARM_NEON\n            int8x16_t _zero = vdupq_n_s8(0);\n            for (; i + 15 < size; i += 16)\n            {\n                int8x16_t _p = vld1q_s8(ptr);\n                _p = vmaxq_s8(_p, _zero);\n                vst1q_s8(ptr, _p);\n\n                ptr += 16;\n            }\n#endif // __ARM_NEON\n            for (; i < size; i++)\n            {\n                if (*ptr < 0)\n                    *ptr = 0;\n\n                ptr++;\n            }\n        }\n    }\n    else\n    {\n        // TODO leakyrelu\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/relu_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_RELU_ARM_H\n#define LAYER_RELU_ARM_H\n\n#include \"relu.h\"\n\nnamespace ncnn {\n\nclass ReLU_arm : public ReLU\n{\npublic:\n    ReLU_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n    int forward_inplace_int8(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RELU_ARM_H\n"
  },
  {
    "path": "src/layer/arm/relu_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"relu_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint ReLU_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    if (slope == 0.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            float16x8_t _zero = vdupq_n_f16((__fp16)0.f);\n\n            int i = 0;\n            for (; i + 31 < size; i += 32)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"fmax   v0.8h, v0.8h, %2.8h     \\n\"\n                    \"fmax   v1.8h, v1.8h, %2.8h     \\n\"\n                    \"fmax   v2.8h, v2.8h, %2.8h     \\n\"\n                    \"fmax   v3.8h, v3.8h, %2.8h     \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_zero) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                float16x8_t _p2 = vld1q_f16(ptr + 16);\n                float16x8_t _p3 = vld1q_f16(ptr + 24);\n                _p0 = vmaxq_f16(_p0, _zero);\n                _p1 = vmaxq_f16(_p1, _zero);\n                _p2 = vmaxq_f16(_p2, _zero);\n                _p3 = vmaxq_f16(_p3, _zero);\n                vst1q_f16(ptr, _p0);\n                vst1q_f16(ptr + 8, _p1);\n                vst1q_f16(ptr + 16, _p2);\n                vst1q_f16(ptr + 24, _p3);\n                ptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 15 < size; i += 16)\n            {\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                _p0 = vmaxq_f16(_p0, _zero);\n                _p1 = vmaxq_f16(_p1, _zero);\n                vst1q_f16(ptr, _p0);\n                vst1q_f16(ptr + 8, _p1);\n                ptr += 16;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                _p = vmaxq_f16(_p, _zero);\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                _p = vmax_f16(_p, vget_low_f16(_zero));\n                vst1_f16(ptr, _p);\n                ptr += 4;\n            }\n            for (; i < size; i++)\n            {\n                __fp16 v = ptr[0];\n                if (v < (__fp16)0.f)\n                    ptr[0] = (__fp16)0.f;\n\n                ptr += 1;\n            }\n        }\n    }\n    else\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            float16x8_t _zero = vdupq_n_f16((__fp16)0.f);\n            float16x8_t _slope = vdupq_n_f16((__fp16)slope);\n\n            int i = 0;\n            for (; i + 31 < size; i += 32)\n            {\n#if NCNN_GNU_INLINE_ASM\n                asm volatile(\n                    \"prfm   pldl1keep, [%0, #512]   \\n\"\n                    \"ld1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0] \\n\"\n                    \"fcmle  v4.8h, v0.8h, #0        \\n\"\n                    \"fcmle  v5.8h, v1.8h, #0        \\n\"\n                    \"fcmle  v6.8h, v2.8h, #0        \\n\"\n                    \"fcmle  v7.8h, v3.8h, #0        \\n\"\n                    \"fmul   v8.8h, v0.8h, %2.8h     \\n\"\n                    \"fmul   v9.8h, v1.8h, %2.8h     \\n\"\n                    \"fmul   v10.8h, v2.8h, %2.8h    \\n\"\n                    \"fmul   v11.8h, v3.8h, %2.8h    \\n\"\n                    \"bit    v0.16b, v8.16b, v4.16b  \\n\"\n                    \"bit    v1.16b, v9.16b, v5.16b  \\n\"\n                    \"bit    v2.16b, v10.16b, v6.16b \\n\"\n                    \"bit    v3.16b, v11.16b, v7.16b \\n\"\n                    \"st1    {v0.8h, v1.8h, v2.8h, v3.8h}, [%0], #64 \\n\"\n                    : \"=r\"(ptr) // %0\n                    : \"0\"(ptr),\n                    \"w\"(_slope) // %2\n                    : \"memory\", \"v0\", \"v1\", \"v2\", \"v3\", \"v4\", \"v5\", \"v6\", \"v7\", \"v8\", \"v9\", \"v10\", \"v11\");\n#else  // NCNN_GNU_INLINE_ASM\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                float16x8_t _p2 = vld1q_f16(ptr + 16);\n                float16x8_t _p3 = vld1q_f16(ptr + 24);\n                uint16x8_t _lemask0 = vcleq_f16(_p0, _zero);\n                uint16x8_t _lemask1 = vcleq_f16(_p1, _zero);\n                uint16x8_t _lemask2 = vcleq_f16(_p2, _zero);\n                uint16x8_t _lemask3 = vcleq_f16(_p3, _zero);\n                float16x8_t _ps0 = vmulq_f16(_p0, _slope);\n                float16x8_t _ps1 = vmulq_f16(_p1, _slope);\n                float16x8_t _ps2 = vmulq_f16(_p2, _slope);\n                float16x8_t _ps3 = vmulq_f16(_p3, _slope);\n                _p0 = vbslq_f16(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f16(_lemask1, _ps1, _p1);\n                _p2 = vbslq_f16(_lemask2, _ps2, _p2);\n                _p3 = vbslq_f16(_lemask3, _ps3, _p3);\n                vst1q_f16(ptr, _p0);\n                vst1q_f16(ptr + 8, _p1);\n                vst1q_f16(ptr + 16, _p2);\n                vst1q_f16(ptr + 24, _p3);\n                ptr += 32;\n#endif // NCNN_GNU_INLINE_ASM\n            }\n            for (; i + 15 < size; i += 16)\n            {\n                float16x8_t _p0 = vld1q_f16(ptr);\n                float16x8_t _p1 = vld1q_f16(ptr + 8);\n                uint16x8_t _lemask0 = vcleq_f16(_p0, _zero);\n                uint16x8_t _lemask1 = vcleq_f16(_p1, _zero);\n                float16x8_t _ps0 = vmulq_f16(_p0, _slope);\n                float16x8_t _ps1 = vmulq_f16(_p1, _slope);\n                _p0 = vbslq_f16(_lemask0, _ps0, _p0);\n                _p1 = vbslq_f16(_lemask1, _ps1, _p1);\n                vst1q_f16(ptr, _p0);\n                vst1q_f16(ptr + 8, _p1);\n                ptr += 16;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                uint16x8_t _lemask = vcleq_f16(_p, _zero);\n                float16x8_t _ps = vmulq_f16(_p, _slope);\n                _p = vbslq_f16(_lemask, _ps, _p);\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                uint16x4_t _lemask = vcle_f16(_p, vget_low_f16(_zero));\n                float16x4_t _ps = vmul_f16(_p, vget_low_f16(_slope));\n                _p = vbsl_f16(_lemask, _ps, _p);\n                vst1_f16(ptr, _p);\n                ptr += 4;\n            }\n            for (; i < size; i++)\n            {\n                __fp16 v = ptr[0];\n                if (v < (__fp16)0.f)\n                    ptr[0] = v * (__fp16)slope;\n\n                ptr += 1;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/requantize_arm.cpp",
    "content": "// Copyright 2019 BUG1989\n// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"requantize_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\nRequantize_arm::Requantize_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#endif // __ARM_NEON\n}\n\nstatic void requantize_relu(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, int elemcount, int elempack)\n{\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize_relu %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    // int8(relu(v * scale_in) * scale_out)\n    // int8_relu(v * (scale_in * scale_out))\n\n    // int8(relu(v * scale_in + bias) * scale_out)\n    // int8_relu(v * (scale_in * scale_out) + (bias * scale_out))\n\n    float scale_in = scale_in_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_in0 = vdupq_n_f32(scale_in);\n    float32x4_t _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = vld1q_f32((const float*)scale_in_data);\n            _scale_in1 = vld1q_f32((const float*)scale_in_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    float scale_out = scale_out_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_out0 = vdupq_n_f32(scale_out);\n    float32x4_t _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = vld1q_f32((const float*)scale_out_data);\n            _scale_out1 = vld1q_f32((const float*)scale_out_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    float scale = scale_in * scale_out;\n#if __ARM_NEON\n    float32x4_t _scale0 = vmulq_f32(_scale_in0, _scale_out0);\n    float32x4_t _scale1 = vmulq_f32(_scale_in1, _scale_out1);\n#endif // __ARM_NEON\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale0);\n            _v1 = vmulq_f32(_v1, _scale1);\n            vst1_s8(ptr, float2int8relu(_v0, _v1));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale0);\n            int8x8_t v = float2int8relu(_v, _v);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale;\n            if (v < 0) v = 0;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __ARM_NEON\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n        }\n#endif // __ARM_NEON\n\n        bias = bias * scale_out;\n#if __ARM_NEON\n        _bias0 = vmulq_f32(_bias0, _scale_out0);\n        _bias1 = vmulq_f32(_bias1, _scale_out1);\n#endif // __ARM_NEON\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n#if __aarch64__\n            _v0 = vfmaq_f32(_bias0, _v0, _scale0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale1);\n#else  // __aarch64__\n            _v0 = vmlaq_f32(_bias0, _v0, _scale0);\n            _v1 = vmlaq_f32(_bias1, _v1, _scale1);\n#endif // __aarch64__\n            vst1_s8(ptr, float2int8relu(_v0, _v1));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n#if __aarch64__\n            _v = vfmaq_f32(_bias0, _v, _scale0);\n#else  // __aarch64__\n            _v = vmlaq_f32(_bias0, _v, _scale0);\n#endif // __aarch64__\n            int8x8_t v = float2int8relu(_v, _v);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale + bias;\n            if (v < 0) v = 0;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nstatic void requantize_leakyrelu(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, float slope, int elemcount, int elempack)\n{\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize_leakyrelu %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    // int8(leakyrelu(v * scale_in, slope) * scale_out)\n    // int8_leakyrelu(v * (scale_in * scale_out), slope)\n\n    // int8(leakyrelu(v * scale_in + bias, slope) * scale_out)\n    // int8_leakyrelu(v * (scale_in * scale_out) + (bias * scale_out), slope)\n\n    float scale_in = scale_in_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_in0 = vdupq_n_f32(scale_in);\n    float32x4_t _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = vld1q_f32((const float*)scale_in_data);\n            _scale_in1 = vld1q_f32((const float*)scale_in_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    float scale_out = scale_out_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_out0 = vdupq_n_f32(scale_out);\n    float32x4_t _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = vld1q_f32((const float*)scale_out_data);\n            _scale_out1 = vld1q_f32((const float*)scale_out_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    float scale = scale_in * scale_out;\n#if __ARM_NEON\n    float32x4_t _scale0 = vmulq_f32(_scale_in0, _scale_out0);\n    float32x4_t _scale1 = vmulq_f32(_scale_in1, _scale_out1);\n    float32x4_t _slope = vdupq_n_f32(slope);\n#endif // __ARM_NEON\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale0);\n            _v1 = vmulq_f32(_v1, _scale1);\n            vst1_s8(ptr, float2int8leakyrelu(_v0, _v1, _slope));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale0);\n            int8x8_t v = float2int8leakyrelu(_v, _v, _slope);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale;\n            if (v < 0) v *= slope;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __ARM_NEON\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n        }\n#endif // __ARM_NEON\n\n        bias = bias * scale_out;\n#if __ARM_NEON\n        _bias0 = vmulq_f32(_bias0, _scale_out0);\n        _bias1 = vmulq_f32(_bias1, _scale_out1);\n#endif // __ARM_NEON\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n#if __aarch64__\n            _v0 = vfmaq_f32(_bias0, _v0, _scale0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale1);\n#else  // __aarch64__\n            _v0 = vmlaq_f32(_bias0, _v0, _scale0);\n            _v1 = vmlaq_f32(_bias1, _v1, _scale1);\n#endif // __aarch64__\n            vst1_s8(ptr, float2int8leakyrelu(_v0, _v1, _slope));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n#if __aarch64__\n            _v = vfmaq_f32(_bias0, _v, _scale0);\n#else  // __aarch64__\n            _v = vmlaq_f32(_bias0, _v, _scale0);\n#endif // __aarch64__\n            int8x8_t v = float2int8leakyrelu(_v, _v, _slope);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale + bias;\n            if (v < 0) v *= slope;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nstatic void requantize(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, int activation_type, const Mat& activation_params, int elemcount, int elempack)\n{\n    if (activation_type == 1)\n    {\n        requantize_relu(intptr, ptr, scale_in_data, bias_data, scale_out_data, elemcount, elempack);\n        return;\n    }\n\n    if (activation_type == 2 && activation_params[0] > 0.f)\n    {\n        const float slope = activation_params[0];\n        requantize_leakyrelu(intptr, ptr, scale_in_data, bias_data, scale_out_data, slope, elemcount, elempack);\n        return;\n    }\n\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    float scale_in = scale_in_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_in0 = vdupq_n_f32(scale_in);\n    float32x4_t _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = vld1q_f32((const float*)scale_in_data);\n            _scale_in1 = vld1q_f32((const float*)scale_in_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    float scale_out = scale_out_data[0];\n#if __ARM_NEON\n    float32x4_t _scale_out0 = vdupq_n_f32(scale_out);\n    float32x4_t _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = vld1q_f32((const float*)scale_out_data);\n            _scale_out1 = vld1q_f32((const float*)scale_out_data + 4);\n        }\n    }\n#endif // __ARM_NEON\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n            _v0 = vmulq_f32(_v0, _scale_in0);\n            _v1 = vmulq_f32(_v1, _scale_in1);\n            _v0 = activation_ps(_v0, activation_type, activation_params);\n            _v1 = activation_ps(_v1, activation_type, activation_params);\n            _v0 = vmulq_f32(_v0, _scale_out0);\n            _v1 = vmulq_f32(_v1, _scale_out1);\n            vst1_s8(ptr, float2int8(_v0, _v1));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n            _v = vmulq_f32(_v, _scale_in0);\n            _v = activation_ps(_v, activation_type, activation_params);\n            _v = vmulq_f32(_v, _scale_out0);\n            int8x8_t v = float2int8(_v, _v);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale_in;\n            v = activation_ss(v, activation_type, activation_params);\n            *ptr = float2int8(v * scale_out);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __ARM_NEON\n        float32x4_t _bias0 = vdupq_n_f32(bias);\n        float32x4_t _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = vld1q_f32((const float*)bias_data);\n                _bias1 = vld1q_f32((const float*)bias_data + 4);\n            }\n        }\n#endif // __ARM_NEON\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _v0 = vcvtq_f32_s32(vld1q_s32(intptr));\n            float32x4_t _v1 = vcvtq_f32_s32(vld1q_s32(intptr + 4));\n#if __aarch64__\n            _v0 = vfmaq_f32(_bias0, _v0, _scale_in0);\n            _v1 = vfmaq_f32(_bias1, _v1, _scale_in1);\n#else  // __aarch64__\n            _v0 = vmlaq_f32(_bias0, _v0, _scale_in0);\n            _v1 = vmlaq_f32(_bias1, _v1, _scale_in1);\n#endif // __aarch64__\n            _v0 = activation_ps(_v0, activation_type, activation_params);\n            _v1 = activation_ps(_v1, activation_type, activation_params);\n            _v0 = vmulq_f32(_v0, _scale_out0);\n            _v1 = vmulq_f32(_v1, _scale_out1);\n            vst1_s8(ptr, float2int8(_v0, _v1));\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _v = vcvtq_f32_s32(vld1q_s32(intptr));\n#if __aarch64__\n            _v = vfmaq_f32(_bias0, _v, _scale_in0);\n#else  // __aarch64__\n            _v = vmlaq_f32(_bias0, _v, _scale_in0);\n#endif // __aarch64__\n            _v = activation_ps(_v, activation_type, activation_params);\n            _v = vmulq_f32(_v, _scale_out0);\n            int8x8_t v = float2int8(_v, _v);\n            ptr[0] = vget_lane_s8(v, 0);\n            ptr[1] = vget_lane_s8(v, 1);\n            ptr[2] = vget_lane_s8(v, 2);\n            ptr[3] = vget_lane_s8(v, 3);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale_in + bias;\n            v = activation_ss(v, activation_type, activation_params);\n            *ptr = float2int8(v * scale_out);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Requantize_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n    const size_t out_elemsize = elempack * 1u;\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            signed char* ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_in_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n            // assert scale_out_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            requantize(intptr, ptr, scale_in_data, bias_data, scale_out_data, activation_type, activation_params, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            signed char* ptr = top_blob.row<signed char>(i);\n\n            const Mat scale_in_data_i = scale_in_data_size > 1 ? scale_in_data.range(i * elempack, elempack) : scale_in_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n            const Mat scale_out_data_i = scale_out_data_size > 1 ? scale_out_data.range(i * elempack, elempack) : scale_out_data;\n\n            requantize(intptr, ptr, scale_in_data_i, bias_data_i, scale_out_data_i, activation_type, activation_params, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            signed char* ptr = top_blob.channel(q);\n\n            const Mat scale_in_data_q = scale_in_data_size > 1 ? scale_in_data.range(q * elempack, elempack) : scale_in_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n            const Mat scale_out_data_q = scale_out_data_size > 1 ? scale_out_data.range(q * elempack, elempack) : scale_out_data;\n\n            requantize(intptr, ptr, scale_in_data_q, bias_data_q, scale_out_data_q, activation_type, activation_params, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/requantize_arm.h",
    "content": "// Copyright 2019 BUG1989\n// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_REQUANTIZE_ARM_H\n#define LAYER_REQUANTIZE_ARM_H\n\n#include \"requantize.h\"\n\nnamespace ncnn {\n\nclass Requantize_arm : public Requantize\n{\npublic:\n    Requantize_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_REQUANTIZE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/reshape_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"reshape_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nReshape_arm::Reshape_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Reshape_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    Mat& top_blob = top_blobs[0];\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    // resolve out shape\n    int outw = w;\n    int outh = h;\n    int outd = d;\n    int outc = c;\n\n    if (!shape_expr.empty())\n    {\n        int er = eval_shape_expr(bottom_blobs, outw, outh, outd, outc);\n        if (er != 0)\n            return -1;\n    }\n\n    if (ndim == 1)\n    {\n        // flatten\n        flatten(bottom_blob, top_blob, opt);\n        if (top_blob.empty())\n            return -100;\n\n        return 0;\n    }\n\n    const int dims = bottom_blob.dims;\n    const int elempack = bottom_blob.elempack;\n    const size_t elemsize = bottom_blob.elemsize;\n\n    const int total = bottom_blob.w * bottom_blob.h * bottom_blob.d * bottom_blob.c * elempack;\n\n    if (ndim == 2)\n    {\n        if (outw == 0)\n            outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n        if (outh == 0)\n            outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n\n        if (outw == -1)\n            outw = total / outh;\n        if (outh == -1)\n            outh = total / outw;\n\n        int out_elempack = opt.use_packing_layout && outh % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 2 && bottom_blob.h * elempack == outh && elempack == out_elempack)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        if (out_elempack == 1)\n        {\n            // flatten\n            flatten(bottom_blob, top_blob, opt);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = 2;\n            top_blob.w = outw;\n            top_blob.h = outh;\n            top_blob.cstep = top_blob.cstep * top_blob.elempack;\n            top_blob.elemsize = out_elemsize;\n            top_blob.elempack = out_elempack;\n\n            return 0;\n        }\n\n        // flatten\n        Mat bottom_blob_flattened = bottom_blob;\n        {\n            Option opt_flatten = opt;\n            opt_flatten.blob_allocator = opt.workspace_allocator;\n\n            flatten(bottom_blob, bottom_blob_flattened, opt_flatten);\n            if (bottom_blob_flattened.empty())\n                return -100;\n        }\n\n        top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        // assert out_elempack == 4\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < top_blob.h; i++)\n        {\n            const float* ptr0 = (const float*)bottom_blob_flattened + outw * i * 4;\n            const float* ptr1 = (const float*)bottom_blob_flattened + outw * (i * 4 + 1);\n            const float* ptr2 = (const float*)bottom_blob_flattened + outw * (i * 4 + 2);\n            const float* ptr3 = (const float*)bottom_blob_flattened + outw * (i * 4 + 3);\n            float* outptr = top_blob.row(i);\n\n            int j = 0;\n#if __ARM_NEON\n            for (; j + 3 < outw; j += 4)\n            {\n                float32x4x4_t _v4;\n                _v4.val[0] = vld1q_f32(ptr0);\n                _v4.val[1] = vld1q_f32(ptr1);\n                _v4.val[2] = vld1q_f32(ptr2);\n                _v4.val[3] = vld1q_f32(ptr3);\n\n                vst4q_f32(outptr, _v4);\n\n                ptr0 += 4;\n                ptr1 += 4;\n                ptr2 += 4;\n                ptr3 += 4;\n                outptr += 16;\n            }\n#endif\n            for (; j < outw; j++)\n            {\n                outptr[0] = *ptr0++;\n                outptr[1] = *ptr1++;\n                outptr[2] = *ptr2++;\n                outptr[3] = *ptr3++;\n\n                outptr += 4;\n            }\n        }\n    }\n\n    if (ndim == 3 || ndim == 4)\n    {\n        if (ndim == 3)\n        {\n            if (outw == 0)\n                outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n            if (outh == 0)\n                outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n            if (outc == 0)\n                outc = dims == 3 ? bottom_blob.c * elempack : bottom_blob.c;\n\n            if (outw == -1)\n                outw = total / outc / outh;\n            if (outh == -1)\n                outh = total / outc / outw;\n            if (outc == -1)\n                outc = total / outh / outw;\n\n            outd = 1;\n        }\n        else // if (ndim == 4)\n        {\n            if (outw == 0)\n                outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n            if (outh == 0)\n                outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n            if (outd == 0)\n                outd = bottom_blob.d;\n            if (outc == 0)\n                outc = (dims == 3 || dims == 4) ? bottom_blob.c * elempack : bottom_blob.c;\n\n            if (outw == -1)\n                outw = total / outc / outd / outh;\n            if (outh == -1)\n                outh = total / outc / outd / outw;\n            if (outd == -1)\n                outd = total / outc / outh / outw;\n            if (outc == -1)\n                outc = total / outd / outh / outw;\n        }\n\n        int out_elempack = opt.use_packing_layout && outc % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if ((dims == 3 || dims == 4) && bottom_blob.c * elempack == outc && elempack == out_elempack)\n        {\n            top_blob = bottom_blob;\n            top_blob.dims = ndim;\n            top_blob.w = outw;\n            top_blob.h = outh;\n            top_blob.d = outd;\n            return 0;\n        }\n\n        // flatten\n        Mat bottom_blob_flattened = bottom_blob;\n        {\n            Option opt_flatten = opt;\n            opt_flatten.blob_allocator = opt.workspace_allocator;\n\n            flatten(bottom_blob, bottom_blob_flattened, opt_flatten);\n            if (bottom_blob_flattened.empty())\n                return -100;\n        }\n\n        if (ndim == 3)\n        {\n            top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        }\n        else // if (ndim == 4)\n        {\n            top_blob.create(outw, outh, outd, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        }\n        if (top_blob.empty())\n            return -100;\n\n        int size = top_blob.w * top_blob.h * top_blob.d;\n\n        if (out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < top_blob.c; q++)\n            {\n                const float* ptr0 = (const float*)bottom_blob_flattened + size * q * 4;\n                const float* ptr1 = (const float*)bottom_blob_flattened + size * (q * 4 + 1);\n                const float* ptr2 = (const float*)bottom_blob_flattened + size * (q * 4 + 2);\n                const float* ptr3 = (const float*)bottom_blob_flattened + size * (q * 4 + 3);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4x4_t _v4;\n                    _v4.val[0] = vld1q_f32(ptr0);\n                    _v4.val[1] = vld1q_f32(ptr1);\n                    _v4.val[2] = vld1q_f32(ptr2);\n                    _v4.val[3] = vld1q_f32(ptr3);\n\n                    vst4q_f32(outptr, _v4);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    outptr[0] = *ptr0++;\n                    outptr[1] = *ptr1++;\n                    outptr[2] = *ptr2++;\n                    outptr[3] = *ptr3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n\n        if (out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < top_blob.c; q++)\n            {\n                const float* ptr = (const float*)bottom_blob_flattened + size * q;\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    float32x4_t _v = vld1q_f32(ptr);\n                    vst1q_f32(outptr, _v);\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Reshape_arm::forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    Mat& top_blob = top_blobs[0];\n\n    // resolve out shape\n    int outw = w;\n    int outh = h;\n    int outd = d;\n    int outc = c;\n\n    if (!shape_expr.empty())\n    {\n        int er = eval_shape_expr(bottom_blobs, outw, outh, outd, outc);\n        if (er != 0)\n            return -1;\n    }\n\n    if (ndim == 1)\n    {\n        // flatten\n        flatten(bottom_blob, top_blob, opt);\n        if (top_blob.empty())\n            return -100;\n\n        return 0;\n    }\n\n    const int dims = bottom_blob.dims;\n    const int elempack = bottom_blob.elempack;\n    const size_t elemsize = bottom_blob.elemsize;\n\n    const int total = bottom_blob.w * bottom_blob.h * bottom_blob.d * bottom_blob.c * elempack;\n\n    if (ndim == 2)\n    {\n        if (outw == 0)\n            outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n        if (outh == 0)\n            outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n\n        if (outw == -1)\n            outw = total / outh;\n        if (outh == -1)\n            outh = total / outw;\n\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n#if NCNN_ARM82\n            out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outh % 8 == 0 ? 8 : outh % 4 == 0 ? 4 : 1;\n#else\n            out_elempack = outh % 4 == 0 ? 4 : 1;\n#endif\n        }\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 2 && bottom_blob.h * elempack == outh && elempack == out_elempack)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        if (out_elempack == 1)\n        {\n            // flatten\n            flatten(bottom_blob, top_blob, opt);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = 2;\n            top_blob.w = outw;\n            top_blob.h = outh;\n            top_blob.cstep = top_blob.cstep * top_blob.elempack;\n            top_blob.elemsize = out_elemsize;\n            top_blob.elempack = out_elempack;\n\n            return 0;\n        }\n\n        // flatten\n        Mat bottom_blob_flattened = bottom_blob;\n        {\n            Option opt_flatten = opt;\n            opt_flatten.blob_allocator = opt.workspace_allocator;\n\n            flatten(bottom_blob, bottom_blob_flattened, opt_flatten);\n            if (bottom_blob_flattened.empty())\n                return -100;\n        }\n\n        top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if NCNN_ARM82\n        if (out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < top_blob.h; i++)\n            {\n                const unsigned short* ptr0 = (const unsigned short*)bottom_blob_flattened + outw * i * 8;\n                const unsigned short* ptr1 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 1);\n                const unsigned short* ptr2 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 2);\n                const unsigned short* ptr3 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 3);\n                const unsigned short* ptr4 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 4);\n                const unsigned short* ptr5 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 5);\n                const unsigned short* ptr6 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 6);\n                const unsigned short* ptr7 = (const unsigned short*)bottom_blob_flattened + outw * (i * 8 + 7);\n                unsigned short* outptr = top_blob.row<unsigned short>(i);\n\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    uint16x8_t _p01 = vcombine_u16(vld1_u16(ptr0), vld1_u16(ptr1));\n                    uint16x8_t _p23 = vcombine_u16(vld1_u16(ptr2), vld1_u16(ptr3));\n                    uint16x8_t _p45 = vcombine_u16(vld1_u16(ptr4), vld1_u16(ptr5));\n                    uint16x8_t _p67 = vcombine_u16(vld1_u16(ptr6), vld1_u16(ptr7));\n\n                    uint16x8x2_t _p0415 = vzipq_u16(_p01, _p45);\n                    uint16x8x2_t _p2637 = vzipq_u16(_p23, _p67);\n\n                    uint16x8x4_t _v4;\n                    _v4.val[0] = _p0415.val[0];\n                    _v4.val[1] = _p0415.val[1];\n                    _v4.val[2] = _p2637.val[0];\n                    _v4.val[3] = _p2637.val[1];\n\n                    vst4q_u16(outptr, _v4);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    ptr4 += 4;\n                    ptr5 += 4;\n                    ptr6 += 4;\n                    ptr7 += 4;\n                    outptr += 32;\n                }\n                for (; j < outw; j++)\n                {\n                    outptr[0] = *ptr0++;\n                    outptr[1] = *ptr1++;\n                    outptr[2] = *ptr2++;\n                    outptr[3] = *ptr3++;\n                    outptr[4] = *ptr4++;\n                    outptr[5] = *ptr5++;\n                    outptr[6] = *ptr6++;\n                    outptr[7] = *ptr7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n#endif // NCNN_ARM82\n\n        if (out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < top_blob.h; i++)\n            {\n                const unsigned short* ptr0 = (const unsigned short*)bottom_blob_flattened + outw * i * 4;\n                const unsigned short* ptr1 = (const unsigned short*)bottom_blob_flattened + outw * (i * 4 + 1);\n                const unsigned short* ptr2 = (const unsigned short*)bottom_blob_flattened + outw * (i * 4 + 2);\n                const unsigned short* ptr3 = (const unsigned short*)bottom_blob_flattened + outw * (i * 4 + 3);\n                unsigned short* outptr = top_blob.row<unsigned short>(i);\n\n                int j = 0;\n#if __ARM_NEON\n                for (; j + 3 < outw; j += 4)\n                {\n                    uint16x4x4_t _v4;\n                    _v4.val[0] = vld1_u16(ptr0);\n                    _v4.val[1] = vld1_u16(ptr1);\n                    _v4.val[2] = vld1_u16(ptr2);\n                    _v4.val[3] = vld1_u16(ptr3);\n\n                    vst4_u16(outptr, _v4);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; j < outw; j++)\n                {\n                    outptr[0] = *ptr0++;\n                    outptr[1] = *ptr1++;\n                    outptr[2] = *ptr2++;\n                    outptr[3] = *ptr3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    if (ndim == 3 || ndim == 4)\n    {\n        if (ndim == 3)\n        {\n            if (outw == 0)\n                outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n            if (outh == 0)\n                outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n            if (outc == 0)\n                outc = dims == 3 ? bottom_blob.c * elempack : bottom_blob.c;\n\n            if (outw == -1)\n                outw = total / outc / outh;\n            if (outh == -1)\n                outh = total / outc / outw;\n            if (outc == -1)\n                outc = total / outh / outw;\n\n            outd = 1;\n        }\n        else // if (ndim == 4)\n        {\n            if (outw == 0)\n                outw = dims == 1 ? bottom_blob.w * elempack : bottom_blob.w;\n            if (outh == 0)\n                outh = dims == 2 ? bottom_blob.h * elempack : bottom_blob.h;\n            if (outd == 0)\n                outd = bottom_blob.d;\n            if (outc == 0)\n                outc = (dims == 3 || dims == 4) ? bottom_blob.c * elempack : bottom_blob.c;\n\n            if (outw == -1)\n                outw = total / outc / outd / outh;\n            if (outh == -1)\n                outh = total / outc / outd / outw;\n            if (outd == -1)\n                outd = total / outc / outh / outw;\n            if (outc == -1)\n                outc = total / outd / outh / outw;\n        }\n\n        int out_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n#if NCNN_ARM82\n            out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && outc % 8 == 0 ? 8 : outc % 4 == 0 ? 4 : 1;\n#else\n            out_elempack = outc % 4 == 0 ? 4 : 1;\n#endif\n        }\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if ((dims == 3 || dims == 4) && bottom_blob.c * elempack == outc && elempack == out_elempack)\n        {\n            top_blob = bottom_blob;\n            top_blob.dims = ndim;\n            top_blob.w = outw;\n            top_blob.h = outh;\n            top_blob.d = outd;\n            return 0;\n        }\n\n        // flatten\n        Mat bottom_blob_flattened = bottom_blob;\n        {\n            Option opt_flatten = opt;\n            opt_flatten.blob_allocator = opt.workspace_allocator;\n\n            flatten(bottom_blob, bottom_blob_flattened, opt_flatten);\n            if (bottom_blob_flattened.empty())\n                return -100;\n        }\n\n        if (ndim == 3)\n        {\n            top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        }\n        else // if (ndim == 4)\n        {\n            top_blob.create(outw, outh, outd, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        }\n        if (top_blob.empty())\n            return -100;\n\n        int size = top_blob.w * top_blob.h * top_blob.d;\n\n#if NCNN_ARM82\n        if (out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < top_blob.c; q++)\n            {\n                const unsigned short* ptr0 = (const unsigned short*)bottom_blob_flattened + size * q * 8;\n                const unsigned short* ptr1 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 1);\n                const unsigned short* ptr2 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 2);\n                const unsigned short* ptr3 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 3);\n                const unsigned short* ptr4 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 4);\n                const unsigned short* ptr5 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 5);\n                const unsigned short* ptr6 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 6);\n                const unsigned short* ptr7 = (const unsigned short*)bottom_blob_flattened + size * (q * 8 + 7);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x8_t _p01 = vcombine_u16(vld1_u16(ptr0), vld1_u16(ptr1));\n                    uint16x8_t _p23 = vcombine_u16(vld1_u16(ptr2), vld1_u16(ptr3));\n                    uint16x8_t _p45 = vcombine_u16(vld1_u16(ptr4), vld1_u16(ptr5));\n                    uint16x8_t _p67 = vcombine_u16(vld1_u16(ptr6), vld1_u16(ptr7));\n\n                    uint16x8x2_t _p0415 = vzipq_u16(_p01, _p45);\n                    uint16x8x2_t _p2637 = vzipq_u16(_p23, _p67);\n\n                    uint16x8x4_t _v4;\n                    _v4.val[0] = _p0415.val[0];\n                    _v4.val[1] = _p0415.val[1];\n                    _v4.val[2] = _p2637.val[0];\n                    _v4.val[3] = _p2637.val[1];\n\n                    vst4q_u16(outptr, _v4);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    ptr4 += 4;\n                    ptr5 += 4;\n                    ptr6 += 4;\n                    ptr7 += 4;\n                    outptr += 32;\n                }\n                for (; i < size; i++)\n                {\n                    outptr[0] = *ptr0++;\n                    outptr[1] = *ptr1++;\n                    outptr[2] = *ptr2++;\n                    outptr[3] = *ptr3++;\n                    outptr[4] = *ptr4++;\n                    outptr[5] = *ptr5++;\n                    outptr[6] = *ptr6++;\n                    outptr[7] = *ptr7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n#endif // NCNN_ARM82\n\n        if (out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < top_blob.c; q++)\n            {\n                const unsigned short* ptr0 = (const unsigned short*)bottom_blob_flattened + size * q * 4;\n                const unsigned short* ptr1 = (const unsigned short*)bottom_blob_flattened + size * (q * 4 + 1);\n                const unsigned short* ptr2 = (const unsigned short*)bottom_blob_flattened + size * (q * 4 + 2);\n                const unsigned short* ptr3 = (const unsigned short*)bottom_blob_flattened + size * (q * 4 + 3);\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4x4_t _v4;\n                    _v4.val[0] = vld1_u16(ptr0);\n                    _v4.val[1] = vld1_u16(ptr1);\n                    _v4.val[2] = vld1_u16(ptr2);\n                    _v4.val[3] = vld1_u16(ptr3);\n\n                    vst4_u16(outptr, _v4);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    outptr += 16;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    outptr[0] = *ptr0++;\n                    outptr[1] = *ptr1++;\n                    outptr[2] = *ptr2++;\n                    outptr[3] = *ptr3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n\n        if (out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < top_blob.c; q++)\n            {\n                const unsigned short* ptr = (const unsigned short*)bottom_blob_flattened + size * q;\n                unsigned short* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __ARM_NEON\n                for (; i + 3 < size; i += 4)\n                {\n                    uint16x4_t _v = vld1_u16(ptr);\n                    vst1_u16(outptr, _v);\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/reshape_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_RESHAPE_ARM_H\n#define LAYER_RESHAPE_ARM_H\n\n#include \"reshape.h\"\n\nnamespace ncnn {\n\nclass Reshape_arm : public Reshape\n{\npublic:\n    Reshape_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RESHAPE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/rmsnorm_arm.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"rmsnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nRMSNorm_arm::RMSNorm_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void rmsnorm(float* ptr, const float* gamma_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n#if __ARM_NEON\n    float32x4_t _rms = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float rms = 0.f;\n    {\n        const float* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr0);\n            _rms = vmlaq_f32(_rms, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            rms += ptr0[0] * ptr0[0];\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n\n#if __aarch64__\n        _rms = vdivq_f32(_rms, _elemcount);\n        _rms = vaddq_f32(_rms, _eps);\n#else\n        float32x4_t _inv_elemcount = vrecpeq_f32(_elemcount);\n        _inv_elemcount = vmulq_f32(vrecpsq_f32(_elemcount, _inv_elemcount), _inv_elemcount);\n        _inv_elemcount = vmulq_f32(vrecpsq_f32(_elemcount, _inv_elemcount), _inv_elemcount);\n        _rms = vmlaq_f32(_eps, _rms, _inv_elemcount);\n#endif\n\n        float32x4_t _rsqrt_rms = vrsqrteq_f32(_rms);\n        _rsqrt_rms = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms, _rsqrt_rms), _rsqrt_rms), _rsqrt_rms);\n        _rms = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms, _rsqrt_rms), _rsqrt_rms), _rsqrt_rms);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        rms += vaddvq_f32(_rms);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_rms), vget_high_f32(_rms));\n        _s2 = vpadd_f32(_s2, _s2);\n        rms += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        rms = 1.f / sqrtf(rms / elemcount + eps);\n#if __ARM_NEON\n        _rms = vdupq_n_f32(rms);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr)\n    {\n        int i = 0;\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                _p = vmulq_f32(_p, _rms);\n                _p = vmulq_f32(_p, _gamma);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n                gamma_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                _p = vmulq_f32(_p, _rms);\n                _p = vmulq_f32(_p, _gamma);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n                gamma_ptr += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            ptr[0] = (ptr[0] * rms) * gamma_ptr[0];\n            ptr++;\n            gamma_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmulq_f32(_p, _rms);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            ptr[0] = ptr[0] * rms;\n            ptr++;\n        }\n    }\n}\n\nint RMSNorm_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        float* ptr = bottom_top_blob;\n        rmsnorm(ptr, gamma_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            rmsnorm(ptr, gamma_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    float* ptr = bottom_top_blob.channel(q).row(i);\n                    rmsnorm(ptr, gamma_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                float* ptr = bottom_top_blob.channel(q);\n                rmsnorm(ptr, gamma_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void rmsnorm_bf16s(unsigned short* ptr, const float* gamma_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n#if __ARM_NEON\n    float32x4_t _rms = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float rms = 0.f;\n    {\n        const unsigned short* ptr0 = ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr0));\n            _rms = vmlaq_f32(_rms, _p, _p);\n            ptr0 += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr0[0]);\n            rms += v * v;\n            ptr0++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n\n#if __aarch64__\n        _rms = vdivq_f32(_rms, _elemcount);\n        _rms = vaddq_f32(_rms, _eps);\n#else\n        float32x4_t _inv_elemcount = vrecpeq_f32(_elemcount);\n        _inv_elemcount = vmulq_f32(vrecpsq_f32(_elemcount, _inv_elemcount), _inv_elemcount);\n        _inv_elemcount = vmulq_f32(vrecpsq_f32(_elemcount, _inv_elemcount), _inv_elemcount);\n        _rms = vmlaq_f32(_eps, _rms, _inv_elemcount);\n#endif\n\n        float32x4_t _rsqrt_rms = vrsqrteq_f32(_rms);\n        _rsqrt_rms = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms, _rsqrt_rms), _rsqrt_rms), _rsqrt_rms);\n        _rms = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms, _rsqrt_rms), _rsqrt_rms), _rsqrt_rms);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n#if __ARM_NEON\n#if __aarch64__\n        rms += vaddvq_f32(_rms);\n#else\n        float32x2_t _s2 = vadd_f32(vget_low_f32(_rms), vget_high_f32(_rms));\n        _s2 = vpadd_f32(_s2, _s2);\n        rms += vget_lane_f32(_s2, 0);\n#endif\n#endif // __ARM_NEON\n\n        rms = 1.f / sqrtf(rms / elemcount + eps);\n#if __ARM_NEON\n        _rms = vdupq_n_f32(rms);\n#endif // __ARM_NEON\n    }\n\n    if (gamma_ptr)\n    {\n        int i = 0;\n#if __ARM_NEON\n        if (elempack == 4)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                _p = vmulq_f32(_p, _rms);\n                _p = vmulq_f32(_p, _gamma);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n                gamma_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                _p = vmulq_f32(_p, _rms);\n                _p = vmulq_f32(_p, _gamma);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n                gamma_ptr += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr[0]);\n            ptr[0] = float32_to_bfloat16((v * rms) * gamma_ptr[0]);\n            ptr++;\n            gamma_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmulq_f32(_p, _rms);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(ptr[0]);\n            ptr[0] = float32_to_bfloat16(v * rms);\n            ptr++;\n        }\n    }\n}\n\nint RMSNorm_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        unsigned short* ptr = bottom_top_blob;\n        rmsnorm_bf16s(ptr, gamma_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n            rmsnorm_bf16s(ptr, gamma_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    unsigned short* ptr = bottom_top_blob.channel(q).row<unsigned short>(i);\n                    rmsnorm_bf16s(ptr, gamma_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                unsigned short* ptr = bottom_top_blob.channel(q);\n                rmsnorm_bf16s(ptr, gamma_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rmsnorm_arm.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_RMSNORM_ARM_H\n#define LAYER_RMSNORM_ARM_H\n\n#include \"rmsnorm.h\"\n\nnamespace ncnn {\n\nclass RMSNorm_arm : public RMSNorm\n{\npublic:\n    RMSNorm_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RMSNORM_ARM_H\n"
  },
  {
    "path": "src/layer/arm/rmsnorm_arm_asimdhp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"rmsnorm_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void rmsnorm_fp16s(__fp16* ptr, const float* gamma_ptr, float eps, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n    float32x4_t _rms0 = vdupq_n_f32(0.f);\n    float32x4_t _rms1 = vdupq_n_f32(0.f);\n    float rms = 0.f;\n    {\n        const __fp16* ptr0 = ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr0);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _rms0 = vmlaq_f32(_rms0, _p0, _p0);\n            _rms1 = vmlaq_f32(_rms1, _p1, _p1);\n            ptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr0));\n            _rms0 = vmlaq_f32(_rms0, _p, _p);\n            ptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            rms += (float)ptr0[0] * (float)ptr0[0];\n            ptr0++;\n        }\n    }\n\n    if (elempack == 8)\n    {\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n\n        _rms0 = vdivq_f32(_rms0, _elemcount);\n        _rms1 = vdivq_f32(_rms1, _elemcount);\n        _rms0 = vaddq_f32(_rms0, _eps);\n        _rms1 = vaddq_f32(_rms1, _eps);\n\n        float32x4_t _rsqrt_rms0 = vrsqrteq_f32(_rms0);\n        float32x4_t _rsqrt_rms1 = vrsqrteq_f32(_rms1);\n        _rsqrt_rms0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms0, _rsqrt_rms0), _rsqrt_rms0), _rsqrt_rms0);\n        _rsqrt_rms1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms1, _rsqrt_rms1), _rsqrt_rms1), _rsqrt_rms1);\n        _rms0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms0, _rsqrt_rms0), _rsqrt_rms0), _rsqrt_rms0);\n        _rms1 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms1, _rsqrt_rms1), _rsqrt_rms1), _rsqrt_rms1);\n    }\n    if (elempack == 4)\n    {\n        _rms0 = vaddq_f32(_rms0, _rms1);\n\n        float32x4_t _elemcount = vdupq_n_f32(elemcount);\n        float32x4_t _eps = vdupq_n_f32(eps);\n\n        _rms0 = vdivq_f32(_rms0, _elemcount);\n        _rms0 = vaddq_f32(_rms0, _eps);\n\n        float32x4_t _rsqrt_rms0 = vrsqrteq_f32(_rms0);\n        _rsqrt_rms0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms0, _rsqrt_rms0), _rsqrt_rms0), _rsqrt_rms0);\n        _rms0 = vmulq_f32(vrsqrtsq_f32(vmulq_f32(_rms0, _rsqrt_rms0), _rsqrt_rms0), _rsqrt_rms0);\n        _rms1 = _rms0;\n    }\n    if (elempack == 1)\n    {\n        _rms0 = vaddq_f32(_rms0, _rms1);\n        rms += vaddvq_f32(_rms0);\n\n        rms = 1.f / sqrtf(rms / elemcount + eps);\n        _rms0 = vdupq_n_f32(rms);\n        _rms1 = _rms0;\n    }\n\n    if (gamma_ptr)\n    {\n        int i = 0;\n        if (elempack == 8)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                _p0 = vmulq_f32(_p0, _rms0);\n                _p1 = vmulq_f32(_p1, _rms1);\n                _p0 = vmulq_f32(_p0, _gamma);\n                _p1 = vmulq_f32(_p1, _gamma);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 1;\n            }\n        }\n        if (elempack == 4)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma0 = vdupq_n_f32(gamma_ptr[0]);\n                float32x4_t _gamma1 = vdupq_n_f32(gamma_ptr[1]);\n                _p0 = vmulq_f32(_p0, _rms0);\n                _p1 = vmulq_f32(_p1, _rms1);\n                _p0 = vmulq_f32(_p0, _gamma0);\n                _p1 = vmulq_f32(_p1, _gamma1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 2;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _gamma = vdupq_n_f32(gamma_ptr[0]);\n                _p = vmulq_f32(_p, _rms0);\n                _p = vmulq_f32(_p, _gamma);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n                ptr += 4;\n                gamma_ptr += 1;\n            }\n        }\n        if (elempack == 1)\n        {\n            for (; i + 7 < size; i += 8)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n                float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n                float32x4_t _gamma0 = vld1q_f32(gamma_ptr);\n                float32x4_t _gamma1 = vld1q_f32(gamma_ptr + 4);\n                _p0 = vmulq_f32(_p0, _rms0);\n                _p1 = vmulq_f32(_p1, _rms1);\n                _p0 = vmulq_f32(_p0, _gamma0);\n                _p1 = vmulq_f32(_p1, _gamma1);\n                _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n                vst1q_f16(ptr, _p);\n                ptr += 8;\n                gamma_ptr += 8;\n            }\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                float32x4_t _gamma = vld1q_f32(gamma_ptr);\n                _p = vmulq_f32(_p, _rms0);\n                _p = vmulq_f32(_p, _gamma);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n                ptr += 4;\n                gamma_ptr += 4;\n            }\n        }\n        for (; i < size; i++)\n        {\n            ptr[0] = (__fp16)(((float)ptr[0] * rms) * gamma_ptr[0]);\n            ptr++;\n            gamma_ptr++;\n        }\n    }\n    else\n    {\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vmulq_f32(_p0, _rms0);\n            _p1 = vmulq_f32(_p1, _rms1);\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = vmulq_f32(_p, _rms0);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            ptr[0] = (__fp16)((float)ptr[0] * rms);\n            ptr++;\n        }\n    }\n}\n\nint RMSNorm_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        // assert affine_size == w\n\n        __fp16* ptr = bottom_top_blob;\n        rmsnorm_fp16s(ptr, gamma_data, eps, w * elempack, 1);\n    }\n\n    if (dims == 2)\n    {\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n            rmsnorm_fp16s(ptr, gamma_data, eps, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    __fp16* ptr = bottom_top_blob.channel(q).row<__fp16>(i);\n                    rmsnorm_fp16s(ptr, gamma_data, eps, w, elempack);\n                }\n            }\n        }\n        else // if (affine_size == w * h)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q);\n                rmsnorm_fp16s(ptr, gamma_data, eps, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rnn_arm.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"rnn_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if NCNN_INT8\n#include \"rnn_int8.h\"\n#endif\n\nRNN_arm::RNN_arm()\n{\n#if __ARM_NEON\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint RNN_arm::create_pipeline(const Option& opt)\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return create_pipeline_int8(opt);\n    }\n#endif\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage)\n    {\n        return create_pipeline_bf16s(opt);\n    }\n#endif\n\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / num_output;\n\n#if __ARM_NEON\n    weight_xc_data_packed.create(size * 4, num_output / 4 + num_output % 4, num_directions);\n    weight_hc_data_packed.create(num_output * 4, num_output / 4 + num_output % 4, num_directions);\n#else\n    weight_xc_data_packed.create(size, num_output, num_directions);\n    weight_hc_data_packed.create(num_output, num_output, num_directions);\n#endif\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_xc_1 = weight_xc.row(q + 1);\n            const float* weight_xc_2 = weight_xc.row(q + 2);\n            const float* weight_xc_3 = weight_xc.row(q + 3);\n\n            const float* weight_hc_0 = weight_hc.row(q);\n            const float* weight_hc_1 = weight_hc.row(q + 1);\n            const float* weight_hc_2 = weight_hc.row(q + 2);\n            const float* weight_hc_3 = weight_hc.row(q + 3);\n\n            float* weight_xc = weight_xc_data_packed_dr.row(q / 4);\n            float* weight_hc = weight_hc_data_packed_dr.row(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[0] = weight_xc_0[i];\n                weight_xc[1] = weight_xc_1[i];\n                weight_xc[2] = weight_xc_2[i];\n                weight_xc[3] = weight_xc_3[i];\n\n                weight_xc += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[0] = weight_hc_0[i];\n                weight_hc[1] = weight_hc_1[i];\n                weight_hc[2] = weight_hc_2[i];\n                weight_hc[3] = weight_hc_3[i];\n\n                weight_hc += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_hc_0 = weight_hc.row(q);\n\n#if __ARM_NEON\n            float* weight_xc = weight_xc_data_packed_dr.row(q / 4 + q % 4);\n            float* weight_hc = weight_hc_data_packed_dr.row(q / 4 + q % 4);\n#else\n            float* weight_xc = weight_xc_data_packed_dr.row(q);\n            float* weight_hc = weight_hc_data_packed_dr.row(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[i] = weight_xc_0[i];\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[i] = weight_hc_0[i];\n            }\n        }\n    }\n\n    bias_c_data_packed = bias_c_data;\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nstatic int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // num_output\n    Mat gates(num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        const float* x = bottom_blob.row(ti);\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const float* weight_xc_ptr = weight_xc.row(q / 4);\n            const float* weight_hc_ptr = weight_hc.row(q / 4);\n\n            float32x4_t _rnn_H = vld1q_f32((const float*)bias_c + q);\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _x = vld1q_f32(x + i);\n                float32x4_t _weight_xc = vld1q_f32(weight_xc_ptr);\n                float32x4_t _weight_xc_1 = vld1q_f32(weight_xc_ptr + 4);\n                float32x4_t _weight_xc_2 = vld1q_f32(weight_xc_ptr + 8);\n                float32x4_t _weight_xc_3 = vld1q_f32(weight_xc_ptr + 12);\n#if __aarch64__\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);\n#else\n                _rnn_H = vmlaq_lane_f32(_rnn_H, _weight_xc, vget_low_f32(_x), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_1, vget_low_f32(_x), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_2, vget_high_f32(_x), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_3, vget_high_f32(_x), 1);\n#endif\n\n                weight_xc_ptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                float32x4_t _x = vdupq_n_f32(x[i]);\n                float32x4_t _weight_xc = vld1q_f32(weight_xc_ptr);\n                _rnn_H = vmlaq_f32(_rnn_H, _weight_xc, _x);\n\n                weight_xc_ptr += 4;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _hidden_state = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc = vld1q_f32(weight_hc_ptr);\n                float32x4_t _weight_hc_1 = vld1q_f32(weight_hc_ptr + 4);\n                float32x4_t _weight_hc_2 = vld1q_f32(weight_hc_ptr + 8);\n                float32x4_t _weight_hc_3 = vld1q_f32(weight_hc_ptr + 12);\n#if __aarch64__\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);\n#else\n                _rnn_H = vmlaq_lane_f32(_rnn_H, _weight_hc, vget_low_f32(_hidden_state), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_1, vget_low_f32(_hidden_state), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_2, vget_high_f32(_hidden_state), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_3, vget_high_f32(_hidden_state), 1);\n#endif\n\n                weight_hc_ptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);\n                float32x4_t _weight_hc = vld1q_f32(weight_hc_ptr);\n                _rnn_H = vmlaq_f32(_rnn_H, _weight_hc, _hidden_state);\n\n                weight_hc_ptr += 4;\n            }\n\n            _rnn_H = vaddq_f32(_rnn_H, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _rnn_H = vaddq_f32(_rnn_H, _sum2);\n\n            _rnn_H = tanh_ps(_rnn_H);\n\n            vst1q_f32((float*)gates + q, _rnn_H);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n#if __ARM_NEON\n            const float* weight_xc_ptr = weight_xc.row(q / 4 + q % 4);\n            const float* weight_hc_ptr = weight_hc.row(q / 4 + q % 4);\n#else\n            const float* weight_xc_ptr = weight_xc.row(q);\n            const float* weight_hc_ptr = weight_hc.row(q);\n#endif // __ARM_NEON\n\n            float H = bias_c[q];\n\n            for (int i = 0; i < size; i++)\n            {\n                H += weight_xc_ptr[i] * x[i];\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                H += weight_hc_ptr[i] * hidden_state[i];\n            }\n\n            H = tanhf(H);\n\n            gates[q] = H;\n        }\n\n        float* output_data = top_blob.row(ti);\n\n        float* hidden_ptr = hidden_state;\n\n#if __ARM_NEON\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            float32x4_t _rnn_H = vld1q_f32((float*)gates + q);\n\n            vst1q_f32(hidden_ptr + q, _rnn_H);\n            vst1q_f32(output_data + q, _rnn_H);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            float H = gates[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = H;\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blob, top_blob, opt);\n#endif\n\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = rnn(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.0f);\n\n        {\n            int ret = rnn(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blobs, top_blobs, opt);\n    }\n#endif\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        hidden = bottom_blobs[1].clone(hidden_allocator);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = rnn(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = rnn(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        top_blobs[1] = hidden;\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // num_output\n    Mat gates(num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        const unsigned short* x = bottom_blob.row<const unsigned short>(ti);\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const unsigned short* weight_xc_ptr = weight_xc.row<const unsigned short>(q / 4);\n            const unsigned short* weight_hc_ptr = weight_hc.row<const unsigned short>(q / 4);\n\n            float32x4_t _rnn_H = bfloat2float(vld1_u16((const unsigned short*)bias_c + q));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _x = bfloat2float(vld1_u16(x + i));\n                float32x4_t _weight_xc = bfloat2float(vld1_u16(weight_xc_ptr));\n                float32x4_t _weight_xc_1 = bfloat2float(vld1_u16(weight_xc_ptr + 4));\n                float32x4_t _weight_xc_2 = bfloat2float(vld1_u16(weight_xc_ptr + 8));\n                float32x4_t _weight_xc_3 = bfloat2float(vld1_u16(weight_xc_ptr + 12));\n#if __aarch64__\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);\n#else\n                _rnn_H = vmlaq_lane_f32(_rnn_H, _weight_xc, vget_low_f32(_x), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_xc_1, vget_low_f32(_x), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_xc_2, vget_high_f32(_x), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_xc_3, vget_high_f32(_x), 1);\n#endif\n\n                weight_xc_ptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                float32x4_t _x = bfloat2float(vdup_n_u16(x[i]));\n                float32x4_t _weight_xc = bfloat2float(vld1_u16(weight_xc_ptr));\n                _rnn_H = vmlaq_f32(_rnn_H, _weight_xc, _x);\n\n                weight_xc_ptr += 4;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _hidden_state = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc = bfloat2float(vld1_u16(weight_hc_ptr));\n                float32x4_t _weight_hc_1 = bfloat2float(vld1_u16(weight_hc_ptr + 4));\n                float32x4_t _weight_hc_2 = bfloat2float(vld1_u16(weight_hc_ptr + 8));\n                float32x4_t _weight_hc_3 = bfloat2float(vld1_u16(weight_hc_ptr + 12));\n#if __aarch64__\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);\n#else\n                _rnn_H = vmlaq_lane_f32(_rnn_H, _weight_hc, vget_low_f32(_hidden_state), 0);\n                _sum1 = vmlaq_lane_f32(_sum1, _weight_hc_1, vget_low_f32(_hidden_state), 1);\n                _sum2 = vmlaq_lane_f32(_sum2, _weight_hc_2, vget_high_f32(_hidden_state), 0);\n                _sum3 = vmlaq_lane_f32(_sum3, _weight_hc_3, vget_high_f32(_hidden_state), 1);\n#endif\n\n                weight_hc_ptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);\n                float32x4_t _weight_hc = bfloat2float(vld1_u16(weight_hc_ptr));\n                _rnn_H = vmlaq_f32(_rnn_H, _weight_hc, _hidden_state);\n\n                weight_hc_ptr += 4;\n            }\n\n            _rnn_H = vaddq_f32(_rnn_H, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _rnn_H = vaddq_f32(_rnn_H, _sum2);\n\n            _rnn_H = tanh_ps(_rnn_H);\n\n            vst1q_f32((float*)gates + q, _rnn_H);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n#if __ARM_NEON\n            const unsigned short* weight_xc_ptr = weight_xc.row<const unsigned short>(q / 4 + q % 4);\n            const unsigned short* weight_hc_ptr = weight_hc.row<const unsigned short>(q / 4 + q % 4);\n#else\n            const unsigned short* weight_xc_ptr = weight_xc.row<const unsigned short>(q);\n            const unsigned short* weight_hc_ptr = weight_hc.row<const unsigned short>(q);\n#endif // __ARM_NEON\n\n            float H = bfloat16_to_float32(((const unsigned short*)bias_c)[q]);\n\n            for (int i = 0; i < size; i++)\n            {\n                H += bfloat16_to_float32(weight_xc_ptr[i]) * bfloat16_to_float32(x[i]);\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                H += bfloat16_to_float32(weight_hc_ptr[i]) * hidden_state[i];\n            }\n\n            H = tanhf(H);\n\n            gates[q] = H;\n        }\n\n        unsigned short* output_data = top_blob.row<unsigned short>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n#if __ARM_NEON\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            float32x4_t _rnn_H = vld1q_f32((float*)gates + q);\n\n            vst1q_f32(hidden_ptr + q, _rnn_H);\n            vst1_u16(output_data + q, float2bfloat(_rnn_H));\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            float H = gates[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = float32_to_bfloat16(H);\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::create_pipeline_bf16s(const Option& opt)\n{\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / num_output;\n\n#if __ARM_NEON\n    weight_xc_data_packed.create(size * 4, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n    weight_hc_data_packed.create(num_output * 4, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_xc_1 = weight_xc.row(q + 1);\n            const float* weight_xc_2 = weight_xc.row(q + 2);\n            const float* weight_xc_3 = weight_xc.row(q + 3);\n\n            const float* weight_hc_0 = weight_hc.row(q);\n            const float* weight_hc_1 = weight_hc.row(q + 1);\n            const float* weight_hc_2 = weight_hc.row(q + 2);\n            const float* weight_hc_3 = weight_hc.row(q + 3);\n\n            unsigned short* weight_xc = weight_xc_data_packed_dr.row<unsigned short>(q / 4);\n            unsigned short* weight_hc = weight_hc_data_packed_dr.row<unsigned short>(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[0] = float32_to_bfloat16(weight_xc_0[i]);\n                weight_xc[1] = float32_to_bfloat16(weight_xc_1[i]);\n                weight_xc[2] = float32_to_bfloat16(weight_xc_2[i]);\n                weight_xc[3] = float32_to_bfloat16(weight_xc_3[i]);\n\n                weight_xc += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[0] = float32_to_bfloat16(weight_hc_0[i]);\n                weight_hc[1] = float32_to_bfloat16(weight_hc_1[i]);\n                weight_hc[2] = float32_to_bfloat16(weight_hc_2[i]);\n                weight_hc[3] = float32_to_bfloat16(weight_hc_3[i]);\n\n                weight_hc += 4;\n            }\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_hc_0 = weight_hc.row(q);\n\n#if __ARM_NEON\n            unsigned short* weight_xc = weight_xc_data_packed_dr.row<unsigned short>(q / 4 + q % 4);\n            unsigned short* weight_hc = weight_hc_data_packed_dr.row<unsigned short>(q / 4 + q % 4);\n#else\n            unsigned short* weight_xc = weight_xc_data_packed_dr.row<unsigned short>(q);\n            unsigned short* weight_hc = weight_hc_data_packed_dr.row<unsigned short>(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[i] = float32_to_bfloat16(weight_xc_0[i]);\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[i] = float32_to_bfloat16(weight_hc_0[i]);\n            }\n        }\n    }\n#else\n    cast_float32_to_bfloat16(weight_xc_data, weight_xc_data_packed, opt);\n    cast_float32_to_bfloat16(weight_hc_data, weight_hc_data_packed, opt);\n#endif\n\n    cast_float32_to_bfloat16(bias_c_data, bias_c_data_packed, opt);\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = rnn_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n\n        {\n            int ret = rnn_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_allocator;\n        cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn_bf16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = rnn_bf16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = rnn_bf16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned short* pf = top_blob_forward.row<const unsigned short>(i);\n            const unsigned short* pr = top_blob_reverse.row<const unsigned short>(i);\n            unsigned short* ptr = top_blob.row<unsigned short>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(unsigned short));\n            memcpy(ptr + num_output, pr, num_output * sizeof(unsigned short));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n#if NCNN_INT8\nint RNN_arm::create_pipeline_int8(const Option& opt)\n{\n    const int num_directions = direction == 2 ? 2 : 1;\n    const int size = weight_data_size / num_directions / num_output;\n\n    rnn_transform_weight_int8(weight_xc_data, weight_xc_data_int8_scales, weight_hc_data, weight_hc_data_int8_scales, bias_c_data, weight_data_tm, weight_data_tm_int8_descales, bias_c_data_packed, size, num_output, num_directions, opt);\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        weight_hc_data.release();\n        bias_c_data.release();\n        weight_xc_data_int8_scales.release();\n        weight_hc_data_int8_scales.release();\n    }\n\n    return 0;\n}\n\nvoid RNN_arm::dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    // dynamic quantize bottom_blob\n    bottom_blob_int8_descales.create(T, (size_t)4u, 1, opt.blob_allocator);\n\n    Mat bottom_blob_int8_scales(T, (size_t)4u, 1, opt.blob_allocator);\n\n    if (elemtype == 1)\n    {\n        // fp32\n        for (int t = 0; t < T; t++)\n        {\n            const float* x = bottom_blob.row(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(x[i]));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 2)\n    {\n        // fp16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(float16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n    if (elemtype == 4)\n    {\n        // bf16\n        for (int t = 0; t < T; t++)\n        {\n            const unsigned short* x = bottom_blob.row<const unsigned short>(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(bfloat16_to_float32(x[i])));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n            bottom_blob_int8_descales[t] = absmax / 127.f;\n        }\n    }\n\n    quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt);\n}\n\nint RNN_arm::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n        }\n\n        hidden.fill(0.0f);\n\n        {\n            rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), hidden, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n\n    int elemtype = 1; // fp32\n    {\n        int elembits = bottom_blob.elembits();\n\n        // clang-format off\n        // *INDENT-OFF*\n\n#if NCNN_ARM82\n        if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        {\n            elemtype = 2; // fp16\n        }\n        else\n#endif\n#if NCNN_BF16\n        if (opt.use_bf16_storage && elembits == 16)\n        {\n            elemtype = 4; // bf16\n        }\n        else\n#endif\n        {\n            // fp32\n        }\n\n        // *INDENT-ON*\n        // clang-format on\n    }\n\n    int T = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        if (elemtype == 1)\n        {\n            hidden = bottom_blobs[1].clone(hidden_allocator);\n        }\n        if (elemtype == 2)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_allocator;\n            cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        }\n        if (elemtype == 4)\n        {\n            Option opt_cast = opt;\n            opt_cast.blob_allocator = hidden_allocator;\n            cast_bfloat16_to_float32(bottom_blobs[1], hidden, opt_cast);\n        }\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8;\n    Mat bottom_blob_int8_descales;\n    {\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        dynamic_quantize(bottom_blob, elemtype, bottom_blob_int8, bottom_blob_int8_descales, opt_quant);\n    }\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, direction, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden, opt);\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, elemsize, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_forward, elemtype, 0, weight_data_tm.channel(0), weight_data_tm_int8_descales.channel(0), bias_c_data_packed.channel(0), hidden0, opt);\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob_reverse, elemtype, 1, weight_data_tm.channel(1), weight_data_tm_int8_descales.channel(1), bias_c_data_packed.channel(1), hidden1, opt);\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const unsigned char* pf = top_blob_forward.row<const unsigned char>(i);\n            const unsigned char* pr = top_blob_reverse.row<const unsigned char>(i);\n            unsigned char* ptr = top_blob.row<unsigned char>(i);\n\n            memcpy(ptr, pf, num_output * elemsize);\n            memcpy(ptr + num_output * elemsize, pr, num_output * elemsize);\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        if (elemtype == 1)\n        {\n            top_blobs[1] = hidden;\n        }\n        if (elemtype == 2)\n        {\n            cast_float32_to_float16(hidden, top_blobs[1], opt);\n        }\n        if (elemtype == 4)\n        {\n            cast_float32_to_bfloat16(hidden, top_blobs[1], opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rnn_arm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_RNN_ARM_H\n#define LAYER_RNN_ARM_H\n\n#include \"rnn.h\"\n\nnamespace ncnn {\n\nclass RNN_arm : public RNN\n{\npublic:\n    RNN_arm();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int create_pipeline_bf16s(const Option& opt);\n    int forward_bf16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8(const Option& opt);\n    void dynamic_quantize(const Mat& bottom_blob, int elemtype, Mat& bottom_blob_int8, Mat& bottom_blob_int8_descales, const Option& opt) const;\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n    int forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n\npublic:\n    Mat weight_xc_data_packed;\n    Mat bias_c_data_packed;\n    Mat weight_hc_data_packed;\n\n    Mat weight_data_tm;\n\n#if NCNN_INT8\n    Mat weight_data_tm_int8_descales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RNN_ARM_H\n"
  },
  {
    "path": "src/layer/arm/rnn_arm_asimddp.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"rnn_int8.h\"\n\nvoid rnn_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt)\n{\n    rnn_transform_weight_int8(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, opt);\n}\n\nvoid rnn_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt)\n{\n    rnn_int8(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, hidden_state, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rnn_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"rnn_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"arm_activation.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // num_output\n    Mat gates(num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n        int nn_num_output = num_output >> 3;\n        int remain_num_output_start = nn_num_output << 3;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 8;\n\n            const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 8);\n            const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 8);\n\n            float16x8_t _rnn_H = vld1q_f16((const __fp16*)bias_c + q);\n            float16x8_t _sum1 = vdupq_n_f16(0.f);\n            float16x8_t _sum2 = vdupq_n_f16(0.f);\n            float16x8_t _sum3 = vdupq_n_f16(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _x = vld1_f16(x + i);\n                float16x8_t _weight_xc = vld1q_f16(weight_xc_ptr);\n                float16x8_t _weight_xc_1 = vld1q_f16(weight_xc_ptr + 8);\n                float16x8_t _weight_xc_2 = vld1q_f16(weight_xc_ptr + 16);\n                float16x8_t _weight_xc_3 = vld1q_f16(weight_xc_ptr + 24);\n                _rnn_H = vfmaq_lane_f16(_rnn_H, _weight_xc, _x, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _weight_xc_1, _x, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _weight_xc_2, _x, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _weight_xc_3, _x, 3);\n\n                weight_xc_ptr += 32;\n            }\n            for (; i < size; i++)\n            {\n                float16x8_t _x = vdupq_n_f16(x[i]);\n                float16x8_t _weight_xc = vld1q_f16(weight_xc_ptr);\n                _rnn_H = vfmaq_f16(_rnn_H, _weight_xc, _x);\n\n                weight_xc_ptr += 8;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float16x4_t _hidden_state = vcvt_f16_f32(vld1q_f32((const float*)hidden_state + i));\n                float16x8_t _weight_hc = vld1q_f16(weight_hc_ptr);\n                float16x8_t _weight_hc_1 = vld1q_f16(weight_hc_ptr + 8);\n                float16x8_t _weight_hc_2 = vld1q_f16(weight_hc_ptr + 16);\n                float16x8_t _weight_hc_3 = vld1q_f16(weight_hc_ptr + 24);\n                _rnn_H = vfmaq_lane_f16(_rnn_H, _weight_hc, _hidden_state, 0);\n                _sum1 = vfmaq_lane_f16(_sum1, _weight_hc_1, _hidden_state, 1);\n                _sum2 = vfmaq_lane_f16(_sum2, _weight_hc_2, _hidden_state, 2);\n                _sum3 = vfmaq_lane_f16(_sum3, _weight_hc_3, _hidden_state, 3);\n\n                weight_hc_ptr += 32;\n            }\n            for (; i < num_output; i++)\n            {\n                float16x8_t _hidden_state = vdupq_n_f16((__fp16)hidden_state[i]);\n                float16x8_t _weight_hc = vld1q_f16(weight_hc_ptr);\n                _rnn_H = vfmaq_f16(_rnn_H, _weight_hc, _hidden_state);\n\n                weight_hc_ptr += 8;\n            }\n\n            _rnn_H = vaddq_f16(_rnn_H, _sum1);\n            _sum2 = vaddq_f16(_sum2, _sum3);\n            _rnn_H = vaddq_f16(_rnn_H, _sum2);\n\n            float32x4_t _H32low = tanh_ps(vcvt_f32_f16(vget_low_f16(_rnn_H)));\n            float32x4_t _H32high = tanh_ps(vcvt_f32_f16(vget_high_f16(_rnn_H)));\n\n            vst1q_f32((float*)gates + q, _H32low);\n            vst1q_f32((float*)gates + q + 4, _H32high);\n        }\n        nn_num_output = (num_output - remain_num_output_start) >> 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = remain_num_output_start + qq * 4;\n\n            const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 8 + (q % 8) / 4);\n            const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 8 + (q % 8) / 4);\n\n            float16x4_t _rnn_H = vld1_f16((const __fp16*)bias_c + q);\n            float16x4_t _sum1 = vdup_n_f16(0.f);\n            float16x4_t _sum2 = vdup_n_f16(0.f);\n            float16x4_t _sum3 = vdup_n_f16(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float16x4_t _x = vld1_f16(x + i);\n                float16x4_t _weight_xc = vld1_f16(weight_xc_ptr);\n                float16x4_t _weight_xc_1 = vld1_f16(weight_xc_ptr + 4);\n                float16x4_t _weight_xc_2 = vld1_f16(weight_xc_ptr + 8);\n                float16x4_t _weight_xc_3 = vld1_f16(weight_xc_ptr + 12);\n                _rnn_H = vfma_lane_f16(_rnn_H, _weight_xc, _x, 0);\n                _sum1 = vfma_lane_f16(_sum1, _weight_xc_1, _x, 1);\n                _sum2 = vfma_lane_f16(_sum2, _weight_xc_2, _x, 2);\n                _sum3 = vfma_lane_f16(_sum3, _weight_xc_3, _x, 3);\n\n                weight_xc_ptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                float16x4_t _x = vdup_n_f16(x[i]);\n                float16x4_t _weight_xc = vld1_f16(weight_xc_ptr);\n                _rnn_H = vfma_f16(_rnn_H, _weight_xc, _x);\n\n                weight_xc_ptr += 4;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float16x4_t _hidden_state = vcvt_f16_f32(vld1q_f32((const float*)hidden_state + i));\n                float16x4_t _weight_hc = vld1_f16(weight_hc_ptr);\n                float16x4_t _weight_hc_1 = vld1_f16(weight_hc_ptr + 4);\n                float16x4_t _weight_hc_2 = vld1_f16(weight_hc_ptr + 8);\n                float16x4_t _weight_hc_3 = vld1_f16(weight_hc_ptr + 12);\n                _rnn_H = vfma_lane_f16(_rnn_H, _weight_hc, _hidden_state, 0);\n                _sum1 = vfma_lane_f16(_sum1, _weight_hc_1, _hidden_state, 1);\n                _sum2 = vfma_lane_f16(_sum2, _weight_hc_2, _hidden_state, 2);\n                _sum3 = vfma_lane_f16(_sum3, _weight_hc_3, _hidden_state, 3);\n\n                weight_hc_ptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float16x4_t _hidden_state = vdup_n_f16((__fp16)hidden_state[i]);\n                float16x4_t _weight_hc = vld1_f16(weight_hc_ptr);\n                _rnn_H = vfma_f16(_rnn_H, _weight_hc, _hidden_state);\n\n                weight_hc_ptr += 4;\n            }\n\n            _rnn_H = vadd_f16(_rnn_H, _sum1);\n            _sum2 = vadd_f16(_sum2, _sum3);\n            _rnn_H = vadd_f16(_rnn_H, _sum2);\n\n            float32x4_t _H32 = tanh_ps(vcvt_f32_f16(_rnn_H));\n\n            vst1q_f32((float*)gates + q, _H32);\n        }\n        remain_num_output_start += nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 8 + (q % 8) / 4 + q % 4);\n            const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 8 + (q % 8) / 4 + q % 4);\n\n            __fp16 H = ((const __fp16*)bias_c)[q];\n\n            for (int i = 0; i < size; i++)\n            {\n                H += weight_xc_ptr[i] * x[i];\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                H += weight_hc_ptr[i] * (__fp16)hidden_state[i];\n            }\n\n            float H32 = tanhf((float)H);\n\n            gates[q] = H32;\n        }\n\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            float32x4_t _rnn_H = vld1q_f32((float*)gates + q);\n\n            vst1q_f32(hidden_ptr + q, _rnn_H);\n            vst1_f16(output_data + q, vcvt_f16_f32(_rnn_H));\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            float H = gates[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = (__fp16)H;\n        }\n    }\n\n    return 0;\n}\n\nstatic int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    if (opt.use_fp16_arithmetic)\n        return rnn_fp16sa(bottom_blob, top_blob, reverse, weight_xc, bias_c, weight_hc, hidden_state, opt);\n\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // num_output\n    Mat gates(num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        const __fp16* x = bottom_blob.row<const __fp16>(ti);\n\n        int nn_num_output = num_output >> 2;\n        int remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 4);\n            const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 4);\n\n            float32x4_t _rnn_H = vcvt_f32_f16(vld1_f16((const __fp16*)bias_c + q));\n            float32x4_t _sum1 = vdupq_n_f32(0.f);\n            float32x4_t _sum2 = vdupq_n_f32(0.f);\n            float32x4_t _sum3 = vdupq_n_f32(0.f);\n\n            int i = 0;\n            for (; i + 3 < size; i += 4)\n            {\n                float32x4_t _x = vcvt_f32_f16(vld1_f16(x + i));\n                float32x4_t _weight_xc = vcvt_f32_f16(vld1_f16(weight_xc_ptr));\n                float32x4_t _weight_xc_1 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 4));\n                float32x4_t _weight_xc_2 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 8));\n                float32x4_t _weight_xc_3 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 12));\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);\n\n                weight_xc_ptr += 16;\n            }\n            for (; i < size; i++)\n            {\n                float32x4_t _x = vcvt_f32_f16(vdup_n_f16(x[i]));\n                float32x4_t _weight_xc = vcvt_f32_f16(vld1_f16(weight_xc_ptr));\n                _rnn_H = vfmaq_f32(_rnn_H, _weight_xc, _x);\n\n                weight_xc_ptr += 4;\n            }\n\n            i = 0;\n            for (; i + 3 < num_output; i += 4)\n            {\n                float32x4_t _hidden_state = vld1q_f32((const float*)hidden_state + i);\n                float32x4_t _weight_hc = vcvt_f32_f16(vld1_f16(weight_hc_ptr));\n                float32x4_t _weight_hc_1 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 4));\n                float32x4_t _weight_hc_2 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 8));\n                float32x4_t _weight_hc_3 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 12));\n                _rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);\n                _sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);\n                _sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);\n                _sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);\n\n                weight_hc_ptr += 16;\n            }\n            for (; i < num_output; i++)\n            {\n                float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);\n                float32x4_t _weight_hc = vcvt_f32_f16(vld1_f16(weight_hc_ptr));\n                _rnn_H = vfmaq_f32(_rnn_H, _weight_hc, _hidden_state);\n\n                weight_hc_ptr += 4;\n            }\n\n            _rnn_H = vaddq_f32(_rnn_H, _sum1);\n            _sum2 = vaddq_f32(_sum2, _sum3);\n            _rnn_H = vaddq_f32(_rnn_H, _sum2);\n\n            _rnn_H = tanh_ps(_rnn_H);\n\n            vst1q_f32((float*)gates + q, _rnn_H);\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 4 + q % 4);\n            const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 4 + q % 4);\n\n            float H = (float)(((const __fp16*)bias_c)[q]);\n\n            for (int i = 0; i < size; i++)\n            {\n                H += (float)weight_xc_ptr[i] * (float)x[i];\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                H += (float)weight_hc_ptr[i] * hidden_state[i];\n            }\n\n            H = tanhf(H);\n\n            gates[q] = H;\n        }\n\n        __fp16* output_data = top_blob.row<__fp16>(ti);\n\n        float* hidden_ptr = hidden_state;\n\n        nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            float32x4_t _rnn_H = vld1q_f32((float*)gates + q);\n\n            vst1q_f32(hidden_ptr + q, _rnn_H);\n            vst1_f16(output_data + q, vcvt_f16_f32(_rnn_H));\n        }\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            float H = gates[q];\n\n            hidden_ptr[q] = H;\n            output_data[q] = (__fp16)H;\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::create_pipeline_fp16s(const Option& opt)\n{\n    int num_directions = direction == 2 ? 2 : 1;\n    int size = weight_data_size / num_directions / num_output;\n\n    if (opt.use_fp16_arithmetic)\n    {\n        weight_xc_data_packed.create(size * 8, num_output / 8 + (num_output % 8) / 4 + num_output % 4, num_directions, 2u, 1);\n        weight_hc_data_packed.create(num_output * 8, num_output / 8 + (num_output % 8) / 4 + num_output % 4, num_directions, 2u, 1);\n    }\n    else\n    {\n        weight_xc_data_packed.create(size * 4, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n        weight_hc_data_packed.create(num_output * 4, num_output / 4 + num_output % 4, num_directions, 2u, 1);\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc = weight_xc_data.channel(dr);\n        const Mat weight_hc = weight_hc_data.channel(dr);\n\n        Mat weight_xc_data_packed_dr = weight_xc_data_packed.channel(dr);\n        Mat weight_hc_data_packed_dr = weight_hc_data_packed.channel(dr);\n\n        int q = 0;\n        if (opt.use_fp16_arithmetic)\n        {\n            for (; q + 7 < num_output; q += 8)\n            {\n                const float* weight_xc_0 = weight_xc.row(q);\n                const float* weight_xc_1 = weight_xc.row(q + 1);\n                const float* weight_xc_2 = weight_xc.row(q + 2);\n                const float* weight_xc_3 = weight_xc.row(q + 3);\n                const float* weight_xc_4 = weight_xc.row(q + 4);\n                const float* weight_xc_5 = weight_xc.row(q + 5);\n                const float* weight_xc_6 = weight_xc.row(q + 6);\n                const float* weight_xc_7 = weight_xc.row(q + 7);\n\n                const float* weight_hc_0 = weight_hc.row(q);\n                const float* weight_hc_1 = weight_hc.row(q + 1);\n                const float* weight_hc_2 = weight_hc.row(q + 2);\n                const float* weight_hc_3 = weight_hc.row(q + 3);\n                const float* weight_hc_4 = weight_hc.row(q + 4);\n                const float* weight_hc_5 = weight_hc.row(q + 5);\n                const float* weight_hc_6 = weight_hc.row(q + 6);\n                const float* weight_hc_7 = weight_hc.row(q + 7);\n\n                __fp16* weight_xc = weight_xc_data_packed_dr.row<__fp16>(q / 8);\n                __fp16* weight_hc = weight_hc_data_packed_dr.row<__fp16>(q / 8);\n\n                for (int i = 0; i < size; i++)\n                {\n                    weight_xc[0] = (__fp16)weight_xc_0[i];\n                    weight_xc[1] = (__fp16)weight_xc_1[i];\n                    weight_xc[2] = (__fp16)weight_xc_2[i];\n                    weight_xc[3] = (__fp16)weight_xc_3[i];\n                    weight_xc[4] = (__fp16)weight_xc_4[i];\n                    weight_xc[5] = (__fp16)weight_xc_5[i];\n                    weight_xc[6] = (__fp16)weight_xc_6[i];\n                    weight_xc[7] = (__fp16)weight_xc_7[i];\n\n                    weight_xc += 8;\n                }\n\n                for (int i = 0; i < num_output; i++)\n                {\n                    weight_hc[0] = (__fp16)weight_hc_0[i];\n                    weight_hc[1] = (__fp16)weight_hc_1[i];\n                    weight_hc[2] = (__fp16)weight_hc_2[i];\n                    weight_hc[3] = (__fp16)weight_hc_3[i];\n                    weight_hc[4] = (__fp16)weight_hc_4[i];\n                    weight_hc[5] = (__fp16)weight_hc_5[i];\n                    weight_hc[6] = (__fp16)weight_hc_6[i];\n                    weight_hc[7] = (__fp16)weight_hc_7[i];\n\n                    weight_hc += 8;\n                }\n            }\n        }\n        for (; q + 3 < num_output; q += 4)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_xc_1 = weight_xc.row(q + 1);\n            const float* weight_xc_2 = weight_xc.row(q + 2);\n            const float* weight_xc_3 = weight_xc.row(q + 3);\n\n            const float* weight_hc_0 = weight_hc.row(q);\n            const float* weight_hc_1 = weight_hc.row(q + 1);\n            const float* weight_hc_2 = weight_hc.row(q + 2);\n            const float* weight_hc_3 = weight_hc.row(q + 3);\n\n            __fp16* weight_xc = opt.use_fp16_arithmetic ? weight_xc_data_packed_dr.row<__fp16>(q / 8 + (q % 8) / 4) : weight_xc_data_packed_dr.row<__fp16>(q / 4);\n            __fp16* weight_hc = opt.use_fp16_arithmetic ? weight_hc_data_packed_dr.row<__fp16>(q / 8 + (q % 8) / 4) : weight_hc_data_packed_dr.row<__fp16>(q / 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[0] = (__fp16)weight_xc_0[i];\n                weight_xc[1] = (__fp16)weight_xc_1[i];\n                weight_xc[2] = (__fp16)weight_xc_2[i];\n                weight_xc[3] = (__fp16)weight_xc_3[i];\n\n                weight_xc += 4;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[0] = (__fp16)weight_hc_0[i];\n                weight_hc[1] = (__fp16)weight_hc_1[i];\n                weight_hc[2] = (__fp16)weight_hc_2[i];\n                weight_hc[3] = (__fp16)weight_hc_3[i];\n\n                weight_hc += 4;\n            }\n        }\n        for (; q < num_output; q++)\n        {\n            const float* weight_xc_0 = weight_xc.row(q);\n            const float* weight_hc_0 = weight_hc.row(q);\n\n            __fp16* weight_xc = opt.use_fp16_arithmetic ? weight_xc_data_packed_dr.row<__fp16>(q / 8 + (q % 8) / 4 + q % 4) : weight_xc_data_packed_dr.row<__fp16>(q / 4 + q % 4);\n            __fp16* weight_hc = opt.use_fp16_arithmetic ? weight_hc_data_packed_dr.row<__fp16>(q / 8 + (q % 8) / 4 + q % 4) : weight_hc_data_packed_dr.row<__fp16>(q / 4 + q % 4);\n\n            for (int i = 0; i < size; i++)\n            {\n                weight_xc[i] = (__fp16)weight_xc_0[i];\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                weight_hc[i] = (__fp16)weight_hc_0[i];\n            }\n        }\n    }\n\n    cast_float32_to_float16(bias_c_data, bias_c_data_packed, opt);\n\n    if (opt.lightmode)\n    {\n        weight_xc_data.release();\n        bias_c_data.release();\n        weight_hc_data.release();\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        {\n            int ret = rnn_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.f);\n\n        {\n            int ret = rnn_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    return 0;\n}\n\nint RNN_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        Option opt_cast = opt;\n        opt_cast.blob_allocator = hidden_allocator;\n        cast_float16_to_float32(bottom_blobs[1], hidden, opt_cast);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 2u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n        int ret = rnn_fp16s(bottom_blob, top_blob, direction, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden, opt);\n        if (ret != 0)\n            return ret;\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 2u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        {\n            int ret = rnn_fp16s(bottom_blob, top_blob_forward, 0, weight_xc_data_packed.channel(0), bias_c_data_packed.channel(0), weight_hc_data_packed.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        {\n            int ret = rnn_fp16s(bottom_blob, top_blob_reverse, 1, weight_xc_data_packed.channel(1), bias_c_data_packed.channel(1), weight_hc_data_packed.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const __fp16* pf = top_blob_forward.row<const __fp16>(i);\n            const __fp16* pr = top_blob_reverse.row<const __fp16>(i);\n            __fp16* ptr = top_blob.row<__fp16>(i);\n\n            memcpy(ptr, pf, num_output * sizeof(__fp16));\n            memcpy(ptr + num_output, pr, num_output * sizeof(__fp16));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        cast_float32_to_float16(hidden, top_blobs[1], opt);\n    }\n\n    return 0;\n}\n#endif\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rnn_arm_vfpv4.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#include \"layer.h\"\n#include \"arm_activation.h\"\n#include \"arm_usability.h\"\n\nnamespace ncnn {\n\n#include \"rnn_int8.h\"\n\nvoid rnn_int8_gate_output_vfpv4(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n    rnn_int8_gate_output(gates, hidden_state, top_blob, ti, elemtype, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/rnn_int8.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\nvoid rnn_transform_weight_int8_asimddp(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt);\nvoid rnn_int8_asimddp(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt);\n#endif\n\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\nvoid rnn_int8_gate_output_vfpv4(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt);\n#endif\n\nstatic void rnn_transform_weight_int8(const Mat& weight_xc, const Mat& weight_xc_int8_scales, const Mat& weight_hc, const Mat& weight_hc_int8_scales, const Mat& bias_c, Mat& weight_data_tm, Mat& weight_data_tm_int8_descales, Mat& bias_c_tm, int size, int num_output, int num_directions, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        rnn_transform_weight_int8_asimddp(weight_xc, weight_xc_int8_scales, weight_hc, weight_hc_int8_scales, bias_c, weight_data_tm, weight_data_tm_int8_descales, bias_c_tm, size, num_output, num_directions, opt);\n        return;\n    }\n#endif\n\n#if __ARM_NEON\n    weight_data_tm.create(size * 4 + num_output * 4, num_output / 4 + num_output % 4, num_directions, 1u, 1);\n    weight_data_tm_int8_descales.create(4 + 4, num_output / 4 + num_output % 4, num_directions);\n#else\n    weight_data_tm.create(size + num_output, num_output, num_directions, 1u, 1);\n    weight_data_tm_int8_descales.create(1 + 1, num_output, num_directions);\n#endif\n    bias_c_tm = bias_c;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int dr = 0; dr < num_directions; dr++)\n    {\n        const Mat weight_xc_dr = weight_xc.channel(dr);\n        const Mat weight_hc_dr = weight_hc.channel(dr);\n        const float* weight_xc_int8_scales_ptr = weight_xc_int8_scales.row(dr);\n        const float* weight_hc_int8_scales_ptr = weight_hc_int8_scales.row(dr);\n\n        Mat weight_data_tm_dr = weight_data_tm.channel(dr);\n        Mat weight_data_tm_int8_descales_dr = weight_data_tm_int8_descales.channel(dr);\n\n        int q = 0;\n#if __ARM_NEON\n        for (; q + 3 < num_output; q += 4)\n        {\n            const signed char* weight_xc_0 = weight_xc_dr.row<const signed char>(q);\n            const signed char* weight_xc_1 = weight_xc_dr.row<const signed char>(q + 1);\n            const signed char* weight_xc_2 = weight_xc_dr.row<const signed char>(q + 2);\n            const signed char* weight_xc_3 = weight_xc_dr.row<const signed char>(q + 3);\n\n            const signed char* weight_hc_0 = weight_hc_dr.row<const signed char>(q);\n            const signed char* weight_hc_1 = weight_hc_dr.row<const signed char>(q + 1);\n            const signed char* weight_hc_2 = weight_hc_dr.row<const signed char>(q + 2);\n            const signed char* weight_hc_3 = weight_hc_dr.row<const signed char>(q + 3);\n\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q / 4);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q / 4);\n\n            int i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n                kptr[0] = weight_xc_0[i];\n                kptr[1] = weight_xc_0[i + 1];\n                kptr[2] = weight_xc_0[i + 2];\n                kptr[3] = weight_xc_0[i + 3];\n                kptr[4] = weight_xc_1[i];\n                kptr[5] = weight_xc_1[i + 1];\n                kptr[6] = weight_xc_1[i + 2];\n                kptr[7] = weight_xc_1[i + 3];\n                kptr[8 + 0] = weight_xc_2[i];\n                kptr[8 + 1] = weight_xc_2[i + 1];\n                kptr[8 + 2] = weight_xc_2[i + 2];\n                kptr[8 + 3] = weight_xc_2[i + 3];\n                kptr[8 + 4] = weight_xc_3[i];\n                kptr[8 + 5] = weight_xc_3[i + 1];\n                kptr[8 + 6] = weight_xc_3[i + 2];\n                kptr[8 + 7] = weight_xc_3[i + 3];\n\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < size; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_xc_0 + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_xc_1 + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_xc_2 + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_xc_3 + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < size; i += 2)\n            {\n                kptr[0] = weight_xc_0[i];\n                kptr[1] = weight_xc_0[i + 1];\n                kptr[2] = weight_xc_1[i];\n                kptr[3] = weight_xc_1[i + 1];\n                kptr[4] = weight_xc_2[i];\n                kptr[5] = weight_xc_2[i + 1];\n                kptr[6] = weight_xc_3[i];\n                kptr[7] = weight_xc_3[i + 1];\n\n                kptr += 8;\n            }\n            for (; i < size; i++)\n            {\n                kptr[0] = weight_xc_0[i];\n                kptr[1] = weight_xc_1[i];\n                kptr[2] = weight_xc_2[i];\n                kptr[3] = weight_xc_3[i];\n\n                kptr += 4;\n            }\n\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n                kptr[0] = weight_hc_0[i];\n                kptr[1] = weight_hc_0[i + 1];\n                kptr[2] = weight_hc_0[i + 2];\n                kptr[3] = weight_hc_0[i + 3];\n                kptr[4] = weight_hc_1[i];\n                kptr[5] = weight_hc_1[i + 1];\n                kptr[6] = weight_hc_1[i + 2];\n                kptr[7] = weight_hc_1[i + 3];\n                kptr[8 + 0] = weight_hc_2[i];\n                kptr[8 + 1] = weight_hc_2[i + 1];\n                kptr[8 + 2] = weight_hc_2[i + 2];\n                kptr[8 + 3] = weight_hc_2[i + 3];\n                kptr[8 + 4] = weight_hc_3[i];\n                kptr[8 + 5] = weight_hc_3[i + 1];\n                kptr[8 + 6] = weight_hc_3[i + 2];\n                kptr[8 + 7] = weight_hc_3[i + 3];\n\n                kptr += 16;\n            }\n#else\n            for (; i + 7 < num_output; i += 8)\n            {\n                vst1_s8(kptr, vld1_s8(weight_hc_0 + i));\n                vst1_s8(kptr + 8, vld1_s8(weight_hc_1 + i));\n                vst1_s8(kptr + 16, vld1_s8(weight_hc_2 + i));\n                vst1_s8(kptr + 24, vld1_s8(weight_hc_3 + i));\n                kptr += 32;\n            }\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 1 < num_output; i += 2)\n            {\n                kptr[0] = weight_hc_0[i];\n                kptr[1] = weight_hc_0[i + 1];\n                kptr[2] = weight_hc_1[i];\n                kptr[3] = weight_hc_1[i + 1];\n                kptr[4] = weight_hc_2[i];\n                kptr[5] = weight_hc_2[i + 1];\n                kptr[6] = weight_hc_3[i];\n                kptr[7] = weight_hc_3[i + 1];\n\n                kptr += 8;\n            }\n            for (; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_0[i];\n                kptr[1] = weight_hc_1[i];\n                kptr[2] = weight_hc_2[i];\n                kptr[3] = weight_hc_3[i];\n\n                kptr += 4;\n            }\n\n            float32x4_t _xc = vld1q_f32(weight_xc_int8_scales_ptr + q);\n            float32x4_t _hc = vld1q_f32(weight_hc_int8_scales_ptr + q);\n\n#if __aarch64__\n            float32x4_t _one = vdupq_n_f32(1.f);\n            float32x4_t _reciprocal_xc = vdivq_f32(_one, _xc);\n            float32x4_t _reciprocal_hc = vdivq_f32(_one, _hc);\n#else\n            float32x4_t _reciprocal_xc = vrecpeq_f32(_xc);\n            _reciprocal_xc = vmulq_f32(vrecpsq_f32(_xc, _reciprocal_xc), _reciprocal_xc);\n            _reciprocal_xc = vmulq_f32(vrecpsq_f32(_xc, _reciprocal_xc), _reciprocal_xc);\n            float32x4_t _reciprocal_hc = vrecpeq_f32(_hc);\n            _reciprocal_hc = vmulq_f32(vrecpsq_f32(_hc, _reciprocal_hc), _reciprocal_hc);\n            _reciprocal_hc = vmulq_f32(vrecpsq_f32(_hc, _reciprocal_hc), _reciprocal_hc);\n#endif\n\n            vst1q_f32(descales_ptr, _reciprocal_xc);\n            vst1q_f32(descales_ptr + 4, _reciprocal_hc);\n        }\n#endif // __ARM_NEON\n        for (; q < num_output; q++)\n        {\n            const signed char* weight_xc_0 = weight_xc_dr.row<const signed char>(q);\n            const signed char* weight_hc_0 = weight_hc_dr.row<const signed char>(q);\n\n#if __ARM_NEON\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q / 4 + q % 4);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q / 4 + q % 4);\n#else\n            signed char* kptr = weight_data_tm_dr.row<signed char>(q);\n            float* descales_ptr = weight_data_tm_int8_descales_dr.row(q);\n#endif // __ARM_NEON\n\n            for (int i = 0; i < size; i++)\n            {\n                kptr[0] = weight_xc_0[i];\n                kptr += 1;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                kptr[0] = weight_hc_0[i];\n                kptr += 1;\n            }\n\n            descales_ptr[0] = 1.f / weight_xc_int8_scales_ptr[q];\n            descales_ptr[1] = 1.f / weight_hc_int8_scales_ptr[q];\n        }\n    }\n}\n\nstatic void rnn_int8_gate_output(const Mat& gates, Mat& hidden_state, Mat& top_blob, int ti, int elemtype, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_VFPV4 && __ARM_NEON && !(__ARM_FP & 2)\n    if (ncnn::cpu_support_arm_vfpv4())\n    {\n        rnn_int8_gate_output_vfpv4(gates, hidden_state, top_blob, ti, elemtype, opt);\n        return;\n    }\n#endif\n\n    const int num_output = top_blob.w;\n\n    float* output_data = top_blob.row(ti);\n\n    float* hidden_ptr = hidden_state;\n\n    int remain_num_output_start = 0;\n#if __ARM_NEON\n    int nn_num_output = num_output >> 2;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int qq = 0; qq < nn_num_output; qq++)\n    {\n        int q = qq * 4;\n\n        float32x4_t _rnn_H = vld1q_f32((const float*)gates + q);\n\n        vst1q_f32(hidden_ptr + q, _rnn_H);\n\n        if (elemtype == 1)\n        {\n            // fp32\n            vst1q_f32(output_data + q, _rnn_H);\n        }\n        if (elemtype == 2)\n        {\n            // fp16\n            unsigned short* outptr = (unsigned short*)output_data + q;\n#if (__ARM_FP & 2)\n#if NCNN_GNU_INLINE_ASM\n#if __aarch64__\n            asm volatile(\n                \"fcvtn  v0.4h, %2.4s        \\n\"\n                \"st1    {v0.4h}, [%0]       \\n\"\n                : \"=r\"(outptr) // %0\n                : \"0\"(outptr),\n                \"w\"(_rnn_H)\n                : \"memory\", \"v0\");\n#else  // __aarch64__\n            asm volatile(\n                \"vcvt.f16.f32 d0, %q2       \\n\"\n                \"vst1.u16   {d0}, [%0]      \\n\"\n                : \"=r\"(outptr) // %0\n                : \"0\"(outptr),\n                \"w\"(_rnn_H)\n                : \"memory\", \"q0\");\n#endif // __aarch64__\n#else  // NCNN_GNU_INLINE_ASM\n            vst1_u16(outptr, (uint16x4_t)vcvt_f16_f32(_rnn_H));\n#endif // NCNN_GNU_INLINE_ASM\n#else\n            outptr[q] = float32_to_float16(hidden_ptr[q]);\n            outptr[q + 1] = float32_to_float16(hidden_ptr[q + 1]);\n            outptr[q + 2] = float32_to_float16(hidden_ptr[q + 2]);\n            outptr[q + 3] = float32_to_float16(hidden_ptr[q + 3]);\n#endif // (__ARM_FP & 2)\n        }\n        if (elemtype == 4)\n        {\n            // bf16\n            vst1_u16((unsigned short*)output_data + q, float2bfloat(_rnn_H));\n        }\n    }\n    remain_num_output_start += nn_num_output << 2;\n#endif // __ARM_NEON\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = remain_num_output_start; q < num_output; q++)\n    {\n        float H = gates[q];\n\n        hidden_ptr[q] = H;\n\n        if (elemtype == 1)\n        {\n            output_data[q] = H;\n        }\n        if (elemtype == 2)\n        {\n            ((unsigned short*)output_data)[q] = float32_to_float16(H);\n        }\n        if (elemtype == 4)\n        {\n            ((unsigned short*)output_data)[q] = float32_to_bfloat16(H);\n        }\n    }\n}\n\nstatic void rnn_int8(const Mat& bottom_blob_int8, const Mat& bottom_blob_int8_descales, Mat& top_blob, int elemtype, int reverse, const Mat& weight_data_tm, const Mat& weight_data_tm_int8_descales, const Mat& bias_c, Mat& hidden_state, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_ARM82DOT && __aarch64__ && !__ARM_FEATURE_DOTPROD\n    if (ncnn::cpu_support_arm_asimddp())\n    {\n        rnn_int8_asimddp(bottom_blob_int8, bottom_blob_int8_descales, top_blob, elemtype, reverse, weight_data_tm, weight_data_tm_int8_descales, bias_c, hidden_state, opt);\n        return;\n    }\n#endif\n\n    int size = bottom_blob_int8.w;\n    int T = bottom_blob_int8.h;\n\n    int num_output = top_blob.w;\n\n    // num_output\n    Mat gates(num_output, 4u, opt.workspace_allocator);\n\n    Mat hidden_state_int8(num_output, (size_t)1u, 1, opt.workspace_allocator);\n    float hidden_state_int8_scale = 1.f;\n    float hidden_state_int8_descale = 1.f;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        // dynamic quantize hidden_state\n        {\n            float absmax = 0.f;\n            for (int i = 0; i < num_output; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(hidden_state[i]));\n            }\n\n            if (absmax == 0.f)\n            {\n                hidden_state_int8.fill<signed char>(0);\n            }\n            else\n            {\n                hidden_state_int8_scale = 127.f / absmax;\n                hidden_state_int8_descale = absmax / 127.f;\n\n                signed char* hs = hidden_state_int8;\n                for (int i = 0; i < num_output; i++)\n                {\n                    hs[i] = float2int8(hidden_state[i] * hidden_state_int8_scale);\n                }\n            }\n        }\n\n        int remain_num_output_start = 0;\n#if __ARM_NEON\n        int nn_num_output = num_output >> 2;\n        remain_num_output_start = nn_num_output << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int qq = 0; qq < nn_num_output; qq++)\n        {\n            int q = qq * 4;\n\n            const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n            const signed char* hs = hidden_state_int8;\n            const float descale_x = bottom_blob_int8_descales[ti];\n            const float descale_h = hidden_state_int8_descale;\n\n            const signed char* kptr = weight_data_tm.row<const signed char>(q / 4);\n\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q / 4);\n\n            int32x4_t _rnn_Hx0 = vdupq_n_s32(0);\n            int i = 0;\n#if __ARM_FEATURE_DOTPROD\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < size; i += 16)\n            {\n                int8x16_t _xi = vld1q_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _rnn_Hx0 = vdotq_laneq_s32(_rnn_Hx0, _w0, _xi, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _w1, _xi, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _w2, _xi, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _w3, _xi, 3);\n\n                kptr += 64;\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _rnn_Hx0 = vdotq_lane_s32(_rnn_Hx0, _w0, _xi, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _xi, 1);\n\n                kptr += 32;\n            }\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum1);\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum2);\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum3);\n#else\n            int32x4_t _sum0 = vdupq_n_s32(0);\n            int32x4_t _sum1 = vdupq_n_s32(0);\n            int32x4_t _sum2 = vdupq_n_s32(0);\n            int32x4_t _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < size; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* xptr = x + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(xptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(xptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _xi = vld1q_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_xi));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_xi));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_xi));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_xi));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_xi));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_xi));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_xi));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < size; i += 8)\n            {\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _xi);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _xi);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _xi);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _xi);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum0);\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum1);\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum2);\n            _rnn_Hx0 = vaddq_s32(_rnn_Hx0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < size; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _xi = vld1_s8(x + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _rnn_Hx0 = vdotq_lane_s32(_rnn_Hx0, _w, _xi, 0);\n#else\n                int16x4_t _xi01 = vreinterpret_s16_s8(vld1_s8(x + i));\n                int8x8_t _xi0 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 0));\n                int8x8_t _xi1 = vreinterpret_s8_s16(vdup_lane_s16(_xi01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _rnn_Hx = vmull_s8(vget_low_s8(_w01), _xi0);\n                _rnn_Hx = vmlal_s8(_rnn_Hx, vget_high_s8(_w01), _xi1);\n                _rnn_Hx0 = vpadalq_s16(_rnn_Hx0, _rnn_Hx);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < size; i += 2)\n            {\n                int8x8_t _xi = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(x + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _rnn_Hx = vmull_s8(_w, _xi);\n                _rnn_Hx0 = vpadalq_s16(_rnn_Hx0, _rnn_Hx);\n\n                kptr += 8;\n            }\n            for (; i < size; i++)\n            {\n                int8x8_t _xi = vdup_n_s8(x[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _rnn_Hx = vmull_s8(_w, _xi);\n                _rnn_Hx0 = vaddw_s16(_rnn_Hx0, vget_low_s16(_rnn_Hx));\n\n                kptr += 4;\n            }\n\n            int32x4_t _rnn_Hh0 = vdupq_n_s32(0);\n            i = 0;\n#if __ARM_FEATURE_DOTPROD\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < num_output; i += 16)\n            {\n                int8x16_t _h_cont = vld1q_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n                _rnn_Hh0 = vdotq_laneq_s32(_rnn_Hh0, _w0, _h_cont, 0);\n                _sum1 = vdotq_laneq_s32(_sum1, _w1, _h_cont, 1);\n                _sum2 = vdotq_laneq_s32(_sum2, _w2, _h_cont, 2);\n                _sum3 = vdotq_laneq_s32(_sum3, _w3, _h_cont, 3);\n\n                kptr += 64;\n            }\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                _rnn_Hh0 = vdotq_lane_s32(_rnn_Hh0, _w0, _h_cont, 0);\n                _sum1 = vdotq_lane_s32(_sum1, _w1, _h_cont, 1);\n\n                kptr += 32;\n            }\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum1);\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum2);\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum3);\n#else\n            _sum0 = vdupq_n_s32(0);\n            _sum1 = vdupq_n_s32(0);\n            _sum2 = vdupq_n_s32(0);\n            _sum3 = vdupq_n_s32(0);\n            for (; i + 15 < num_output; i += 16)\n            {\n#if NCNN_GNU_INLINE_ASM && !__aarch64__\n                const signed char* hsptr = hs + i;\n\n                asm volatile(\n                    \"vldm       %1!, {d0-d7}        \\n\"\n                    \"vld1.s8    {d16-d17}, [%0]     \\n\"\n                    \"vmull.s8   q4, d0, d16         \\n\"\n                    \"vmull.s8   q5, d1, d16         \\n\"\n                    \"vmull.s8   q6, d2, d16         \\n\"\n                    \"vmull.s8   q7, d3, d16         \\n\"\n                    \"vmlal.s8   q4, d4, d17         \\n\"\n                    \"vmlal.s8   q5, d5, d17         \\n\"\n                    \"vmlal.s8   q6, d6, d17         \\n\"\n                    \"vmlal.s8   q7, d7, d17         \\n\"\n                    \"vpadal.s16 %q2, q4             \\n\"\n                    \"vpadal.s16 %q3, q5             \\n\"\n                    \"vpadal.s16 %q4, q6             \\n\"\n                    \"vpadal.s16 %q5, q7             \\n\"\n                    : \"=r\"(hsptr), \"=r\"(kptr), \"=w\"(_sum0), \"=w\"(_sum1), \"=w\"(_sum2), \"=w\"(_sum3)\n                    : \"0\"(hsptr), \"1\"(kptr), \"2\"(_sum0), \"3\"(_sum1), \"4\"(_sum2), \"5\"(_sum3)\n                    : \"memory\", \"q0\", \"q1\", \"q2\", \"q3\", \"q4\", \"q5\", \"q6\", \"q7\", \"q8\");\n#else\n                int8x16_t _h_cont = vld1q_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n                int8x16_t _w2 = vld1q_s8(kptr + 32);\n                int8x16_t _w3 = vld1q_s8(kptr + 48);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), vget_low_s8(_h_cont));\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), vget_low_s8(_h_cont));\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), vget_low_s8(_h_cont));\n                _s0 = vmlal_s8(_s0, vget_low_s8(_w2), vget_high_s8(_h_cont));\n                _s1 = vmlal_s8(_s1, vget_high_s8(_w2), vget_high_s8(_h_cont));\n                _s2 = vmlal_s8(_s2, vget_low_s8(_w3), vget_high_s8(_h_cont));\n                _s3 = vmlal_s8(_s3, vget_high_s8(_w3), vget_high_s8(_h_cont));\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 64;\n#endif\n            }\n            for (; i + 7 < num_output; i += 8)\n            {\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w0 = vld1q_s8(kptr);\n                int8x16_t _w1 = vld1q_s8(kptr + 16);\n\n                int16x8_t _s0 = vmull_s8(vget_low_s8(_w0), _h_cont);\n                int16x8_t _s1 = vmull_s8(vget_high_s8(_w0), _h_cont);\n                int16x8_t _s2 = vmull_s8(vget_low_s8(_w1), _h_cont);\n                int16x8_t _s3 = vmull_s8(vget_high_s8(_w1), _h_cont);\n                _sum0 = vpadalq_s16(_sum0, _s0);\n                _sum1 = vpadalq_s16(_sum1, _s1);\n                _sum2 = vpadalq_s16(_sum2, _s2);\n                _sum3 = vpadalq_s16(_sum3, _s3);\n\n                kptr += 32;\n            }\n            {\n                int32x4x2_t _tmp0 = vzipq_s32(_sum0, _sum1);\n                int32x4x2_t _tmp1 = vzipq_s32(_sum2, _sum3);\n                _sum0 = vcombine_s32(vget_low_s32(_tmp0.val[0]), vget_low_s32(_tmp1.val[0]));\n                _sum1 = vcombine_s32(vget_high_s32(_tmp0.val[0]), vget_high_s32(_tmp1.val[0]));\n                _sum2 = vcombine_s32(vget_low_s32(_tmp0.val[1]), vget_low_s32(_tmp1.val[1]));\n                _sum3 = vcombine_s32(vget_high_s32(_tmp0.val[1]), vget_high_s32(_tmp1.val[1]));\n            }\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum0);\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum1);\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum2);\n            _rnn_Hh0 = vaddq_s32(_rnn_Hh0, _sum3);\n#endif // __ARM_FEATURE_DOTPROD\n            for (; i + 3 < num_output; i += 4)\n            {\n#if __ARM_FEATURE_DOTPROD\n                int8x8_t _h_cont = vld1_s8(hs + i);\n                int8x16_t _w = vld1q_s8(kptr);\n                _rnn_Hh0 = vdotq_lane_s32(_rnn_Hh0, _w, _h_cont, 0);\n#else\n                int16x4_t _h_cont01 = vreinterpret_s16_s8(vld1_s8(hs + i));\n                int8x8_t _h_cont0 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 0));\n                int8x8_t _h_cont1 = vreinterpret_s8_s16(vdup_lane_s16(_h_cont01, 1));\n                int8x16_t _w01 = vld1q_s8(kptr);\n\n                int16x8_t _rnn_Hh = vmull_s8(vget_low_s8(_w01), _h_cont0);\n                _rnn_Hh = vmlal_s8(_rnn_Hh, vget_high_s8(_w01), _h_cont1);\n                _rnn_Hh0 = vpadalq_s16(_rnn_Hh0, _rnn_Hh);\n#endif // __ARM_FEATURE_DOTPROD\n\n                kptr += 16;\n            }\n            for (; i + 1 < num_output; i += 2)\n            {\n                int8x8_t _h_cont = vreinterpret_s8_s16(vdup_lane_s16(vreinterpret_s16_s8(vld1_s8(hs + i)), 0));\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _rnn_Hh = vmull_s8(_w, _h_cont);\n                _rnn_Hh0 = vpadalq_s16(_rnn_Hh0, _rnn_Hh);\n\n                kptr += 8;\n            }\n            for (; i < num_output; i++)\n            {\n                int8x8_t _h_cont = vdup_n_s8(hs[i]);\n                int8x8_t _w = vld1_s8(kptr);\n\n                int16x8_t _rnn_Hh = vmull_s8(_w, _h_cont);\n                _rnn_Hh0 = vaddw_s16(_rnn_Hh0, vget_low_s16(_rnn_Hh));\n\n                kptr += 4;\n            }\n\n            float32x4_t _descale_x = vdupq_n_f32(descale_x);\n            float32x4_t _descale_h = vdupq_n_f32(descale_h);\n\n            float32x4_t _rnn_H = vld1q_f32((const float*)bias_c + q);\n\n            float32x4_t _descale_xc = vld1q_f32(descales_ptr);\n\n            _rnn_H = vmlaq_f32(_rnn_H, vcvtq_f32_s32(_rnn_Hx0), vmulq_f32(_descale_x, _descale_xc));\n\n            float32x4_t _descale_hc = vld1q_f32(descales_ptr + 4);\n\n            _rnn_H = vmlaq_f32(_rnn_H, vcvtq_f32_s32(_rnn_Hh0), vmulq_f32(_descale_h, _descale_hc));\n\n            _rnn_H = tanh_ps(_rnn_H);\n\n            vst1q_f32((float*)gates + q, _rnn_H);\n        }\n#endif // __ARM_NEON\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = remain_num_output_start; q < num_output; q++)\n        {\n            const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n            const signed char* hs = hidden_state_int8;\n            const float descale_x = bottom_blob_int8_descales[ti];\n            const float descale_h = hidden_state_int8_descale;\n\n#if __ARM_NEON\n            const signed char* kptr = weight_data_tm.row<const signed char>(q / 4 + q % 4);\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q / 4 + q % 4);\n#else\n            const signed char* kptr = weight_data_tm.row<const signed char>(q);\n            const float* descales_ptr = weight_data_tm_int8_descales.row(q);\n#endif // __ARM_NEON\n\n            const float descale_xc = descales_ptr[0];\n            const float descale_hc = descales_ptr[1];\n\n            int Hx = 0;\n            for (int i = 0; i < size; i++)\n            {\n                Hx += kptr[0] * x[i];\n                kptr += 1;\n            }\n\n            int Hh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                Hh += kptr[0] * hs[i];\n                kptr += 1;\n            }\n\n            float H = bias_c[q] + Hx * (descale_x * descale_xc) + Hh * (descale_h * descale_hc);\n\n            H = tanhf(H);\n\n            gates[q] = H;\n        }\n\n        rnn_int8_gate_output(gates, hidden_state, top_blob, ti, elemtype, opt);\n    }\n}\n"
  },
  {
    "path": "src/layer/arm/scale_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"scale_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nScale_arm::Scale_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#endif // __ARM_NEON\n}\n\nint Scale_arm::forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const\n{\n    Mat& bottom_top_blob = bottom_top_blobs[0];\n    const Mat& scale_blob = bottom_top_blobs[1];\n\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int w = bottom_top_blob.w;\n\n            const float* scale = scale_blob;\n            if (bias_term)\n            {\n                const float* bias = bias_data;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    float* ptr = (float*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vld1q_f32(ptr);\n                    float32x4_t _s = vld1q_f32(scale + i * 4);\n                    float32x4_t _bias = vld1q_f32(bias + i * 4);\n                    _p = vmlaq_f32(_bias, _p, _s);\n                    vst1q_f32(ptr, _p);\n                }\n            }\n            else\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < w; i++)\n                {\n                    float* ptr = (float*)bottom_top_blob + i * 4;\n\n                    float32x4_t _p = vld1q_f32(ptr);\n                    float32x4_t _s = vld1q_f32(scale + i * 4);\n                    _p = vmulq_f32(_p, _s);\n                    vst1q_f32(ptr, _p);\n                }\n            }\n        }\n\n        if (dims == 2)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n\n            if (bias_term)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < h; i++)\n                {\n                    float* ptr = bottom_top_blob.row(i);\n                    float32x4_t _s = vld1q_f32((const float*)scale_blob + i * 4);\n                    float32x4_t _bias = vld1q_f32((const float*)bias_data + i * 4);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr);\n                        _p = vmlaq_f32(_bias, _p, _s);\n                        vst1q_f32(ptr, _p);\n\n                        ptr += 4;\n                    }\n                }\n            }\n            else\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int i = 0; i < h; i++)\n                {\n                    float* ptr = bottom_top_blob.row(i);\n                    float32x4_t _s = vld1q_f32((const float*)scale_blob + i * 4);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr);\n                        _p = vmulq_f32(_p, _s);\n                        vst1q_f32(ptr, _p);\n\n                        ptr += 4;\n                    }\n                }\n            }\n        }\n\n        if (dims == 3)\n        {\n            int w = bottom_top_blob.w;\n            int h = bottom_top_blob.h;\n            int channels = bottom_top_blob.c;\n            int size = w * h;\n\n            if (bias_term)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    float* ptr = bottom_top_blob.channel(q);\n                    float32x4_t _s = vld1q_f32((const float*)scale_blob + q * 4);\n                    float32x4_t _bias = vld1q_f32((const float*)bias_data + q * 4);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr);\n                        _p = vmlaq_f32(_bias, _p, _s);\n                        vst1q_f32(ptr, _p);\n\n                        ptr += 4;\n                    }\n                }\n            }\n            else\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    float* ptr = bottom_top_blob.channel(q);\n                    float32x4_t _s = vld1q_f32((const float*)scale_blob + q * 4);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        float32x4_t _p = vld1q_f32(ptr);\n                        _p = vmulq_f32(_p, _s);\n                        vst1q_f32(ptr, _p);\n\n                        ptr += 4;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    if (dims != 3)\n        return Scale::forward_inplace(bottom_top_blobs, opt);\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    if (bias_term)\n    {\n        const float* scale_ptr = scale_blob;\n        const float* bias_ptr = bias_data;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            float s = scale_ptr[q];\n            float bias = bias_ptr[q];\n\n#if __ARM_NEON\n            int nn = size >> 2;\n            int remain = size - (nn << 2);\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _s = vdupq_n_f32(s);\n            float32x4_t _bias = vdupq_n_f32(bias);\n            for (; nn > 0; nn--)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmlaq_f32(_bias, _p, _s);\n                vst1q_f32(ptr, _p);\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n\n            for (; remain > 0; remain--)\n            {\n                *ptr = *ptr * s + bias;\n\n                ptr++;\n            }\n        }\n    }\n    else\n    {\n        const float* scale_ptr = scale_blob;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            float s = scale_ptr[q];\n\n#if __ARM_NEON\n            int nn = size >> 2;\n            int remain = size - (nn << 2);\n#else\n            int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n            float32x4_t _s = vdupq_n_f32(s);\n            for (; nn > 0; nn--)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = vmulq_f32(_p, _s);\n                vst1q_f32(ptr, _p);\n\n                ptr += 4;\n            }\n#endif // __ARM_NEON\n\n            for (; remain > 0; remain--)\n            {\n                *ptr *= s;\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/scale_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SCALE_ARM_H\n#define LAYER_SCALE_ARM_H\n\n#include \"scale.h\"\n\nnamespace ncnn {\n\nclass Scale_arm : public Scale\n{\npublic:\n    Scale_arm();\n\n    virtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SCALE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/selu_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"selu_arm.h\"\n\n#if __ARM_NEON\n#include \"neon_mathfun.h\"\n\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\nint SELU_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    float alphaxlambda = alpha * lambda;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        float32x4_t _alphaxlambda = vdupq_n_f32(alphaxlambda);\n        float32x4_t _lambda = vdupq_n_f32(lambda);\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            uint32x4_t _lemask = vcleq_f32(_p, _zero);\n\n            float32x4_t _nps = exp_ps(_p);\n            _nps = vsubq_f32(_nps, _one);\n            _nps = vmulq_f32(_nps, _alphaxlambda);\n\n            _p = vmulq_f32(_p, _lambda);\n\n            _p = vbslq_f32(_lemask, _nps, _p);\n            vst1q_f32(ptr, _p);\n\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            if (*ptr < 0.f)\n                *ptr = (expf(*ptr) - 1.f) * alphaxlambda;\n            else\n                *ptr *= lambda;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/selu_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SELU_ARM_H\n#define LAYER_SELU_ARM_H\n\n#include \"selu.h\"\n\nnamespace ncnn {\n\nclass SELU_arm : public SELU\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SELU_ARM_H\n"
  },
  {
    "path": "src/layer/arm/shufflechannel_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"shufflechannel_arm.h\"\n\n#include \"layer_type.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nShuffleChannel_arm::ShuffleChannel_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint ShuffleChannel_arm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blob, top_blob, opt);\n#endif\n\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n\n    int _group = reverse ? channels * elempack / group : group;\n\n    if (_group == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (_group == 2 && channels % _group != 0)\n        {\n            int w = bottom_blob.w;\n            int h = bottom_blob.h;\n            int size = w * h;\n            size_t elemsize = bottom_blob.elemsize;\n\n            top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            int channels_per_group = channels / _group;\n\n            // TODO unroll me\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const float* ptr2 = bottom_blob.channel(channels_per_group + q + 1);\n                float* outptr0 = top_blob.channel(q * 2);\n                float* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr1);\n                    float32x4_t _p2 = vld1q_f32(ptr2);\n\n                    float32x4_t _p12 = vextq_f32(_p1, _p2, 2);\n\n                    float32x4x2_t _p01 = vzipq_f32(_p0, _p12);\n\n                    vst1q_f32(outptr0, _p01.val[0]);\n                    vst1q_f32(outptr1, _p01.val[1]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n\n            // handle the last channel\n            {\n                const float* ptr0 = bottom_blob.channel(channels_per_group);\n                const float* ptr1 = bottom_blob.channel(channels_per_group + channels_per_group);\n                float* outptr0 = top_blob.channel(channels_per_group * 2);\n\n                ptr1 += 2;\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr1);\n\n                    float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                    vst1q_f32(outptr0, _p01.val[0]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    outptr0 += 4;\n                }\n            }\n\n            return 0;\n        }\n\n        if (_group > 4 || channels % _group != 0)\n        {\n            // slow path for too large group or shuffle inside elempack\n            Option opt_pack = opt;\n            opt_pack.blob_allocator = opt.workspace_allocator;\n\n            Mat bottom_blob_unpacked;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n\n            Mat top_blob_unpacked;\n            int ret = ShuffleChannel::forward(bottom_blob_unpacked, top_blob_unpacked, opt_pack);\n            if (ret != 0)\n                return ret;\n\n            convert_packing(top_blob_unpacked, top_blob, elempack, opt);\n\n            return 0;\n        }\n\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int size = w * h;\n        size_t elemsize = bottom_blob.elemsize;\n\n        top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int channels_per_group = channels / _group;\n\n        if (_group == 2)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob.channel(channels_per_group + q);\n                float* outptr0 = top_blob.channel(q * 2);\n                float* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr1);\n\n                    float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n\n                    vst1q_f32(outptr0, _p01.val[0]);\n                    vst1q_f32(outptr1, _p01.val[1]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n        }\n\n        if (_group == 3)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const float* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                float* outptr0 = top_blob.channel(q * 3);\n                float* outptr1 = top_blob.channel(q * 3 + 1);\n                float* outptr2 = top_blob.channel(q * 3 + 2);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr1);\n                    float32x4_t _p2 = vld1q_f32(ptr2);\n\n                    float32x4x2_t _p01 = vzipq_f32(_p0, _p1);\n                    float32x4x2_t _p12 = vzipq_f32(_p1, _p2);\n\n                    float32x4_t _0415 = _p01.val[0];\n                    float32x4_t _2637 = _p01.val[1];\n                    float32x4_t _4859 = _p12.val[0];\n                    float32x4_t _6x7y = _p12.val[1];\n\n                    float32x2_t _15 = vget_high_f32(_0415);\n                    float32x2_t _37 = vget_high_f32(_2637);\n                    float32x2_t _48 = vget_low_f32(_4859);\n                    float32x2_t _6x = vget_low_f32(_6x7y);\n\n                    float32x2_t _81 = vext_f32(_48, _15, 1);\n                    float32x2_t _x3 = vext_f32(_6x, _37, 1);\n\n                    float32x4_t _0481 = vcombine_f32(vget_low_f32(_0415), _81);\n                    float32x4_t _5926 = vextq_f32(_4859, _2637, 2);\n                    float32x4_t _x37y = vcombine_f32(_x3, vget_high_f32(_6x7y));\n\n                    vst1q_f32(outptr0, _0481);\n                    vst1q_f32(outptr1, _5926);\n                    vst1q_f32(outptr2, _x37y);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                }\n            }\n        }\n\n        if (_group == 4)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const float* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                const float* ptr3 = bottom_blob.channel(channels_per_group * 3 + q);\n                float* outptr0 = top_blob.channel(q * 4);\n                float* outptr1 = top_blob.channel(q * 4 + 1);\n                float* outptr2 = top_blob.channel(q * 4 + 2);\n                float* outptr3 = top_blob.channel(q * 4 + 3);\n\n                for (int i = 0; i < size; i++)\n                {\n                    float32x4_t _p0 = vld1q_f32(ptr0);\n                    float32x4_t _p1 = vld1q_f32(ptr1);\n                    float32x4_t _p2 = vld1q_f32(ptr2);\n                    float32x4_t _p3 = vld1q_f32(ptr3);\n\n                    // transpose 4x4\n                    float32x4x2_t _p01 = vtrnq_f32(_p0, _p1);\n                    float32x4x2_t _p23 = vtrnq_f32(_p2, _p3);\n                    _p0 = vcombine_f32(vget_low_f32(_p01.val[0]), vget_low_f32(_p23.val[0]));\n                    _p1 = vcombine_f32(vget_low_f32(_p01.val[1]), vget_low_f32(_p23.val[1]));\n                    _p2 = vcombine_f32(vget_high_f32(_p01.val[0]), vget_high_f32(_p23.val[0]));\n                    _p3 = vcombine_f32(vget_high_f32(_p01.val[1]), vget_high_f32(_p23.val[1]));\n\n                    vst1q_f32(outptr0, _p0);\n                    vst1q_f32(outptr1, _p1);\n                    vst1q_f32(outptr2, _p2);\n                    vst1q_f32(outptr3, _p3);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    return ShuffleChannel::forward(bottom_blob, top_blob, opt);\n}\n\nint ShuffleChannel_arm::forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n\n    int _group = reverse ? channels * elempack / group : group;\n\n    if (_group == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n#if NCNN_ARM82\n    if (elempack == 8)\n    {\n        if (_group == 2 && channels % _group != 0)\n        {\n            int w = bottom_blob.w;\n            int h = bottom_blob.h;\n            int size = w * h;\n            size_t elemsize = bottom_blob.elemsize;\n\n            top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            int channels_per_group = channels / _group;\n\n            // TODO unroll me\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group + q + 1);\n                unsigned short* outptr0 = top_blob.channel(q * 2);\n                unsigned short* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x8_t _p0 = vld1q_u16(ptr0);\n                    uint16x8_t _p1 = vld1q_u16(ptr1);\n                    uint16x8_t _p2 = vld1q_u16(ptr2);\n\n                    uint16x8_t _p12 = vextq_u16(_p1, _p2, 4);\n\n                    uint16x8x2_t _p01 = vzipq_u16(_p0, _p12);\n\n                    vst1q_u16(outptr0, _p01.val[0]);\n                    vst1q_u16(outptr1, _p01.val[1]);\n\n                    ptr0 += 8;\n                    ptr1 += 8;\n                    ptr2 += 8;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n            }\n\n            // handle the last channel\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(channels_per_group);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + channels_per_group);\n                unsigned short* outptr0 = top_blob.channel(channels_per_group * 2);\n\n                ptr1 += 4;\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n\n                    uint16x4x2_t _p01 = vzip_u16(_p0, _p1);\n\n                    vst1_u16(outptr0, _p01.val[0]);\n                    vst1_u16(outptr0 + 4, _p01.val[1]);\n\n                    ptr0 += 8;\n                    ptr1 += 8;\n                    outptr0 += 8;\n                }\n            }\n\n            return 0;\n        }\n\n        if (_group > 4 || channels % _group != 0)\n        {\n            // slow path for too large group or shuffle inside elempack\n            Option opt_pack = opt;\n            opt_pack.blob_allocator = opt.workspace_allocator;\n\n            Mat bottom_blob_unpacked;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n\n            Mat top_blob_unpacked;\n            int ret = ShuffleChannel::forward(bottom_blob_unpacked, top_blob_unpacked, opt_pack);\n            if (ret != 0)\n                return ret;\n\n            convert_packing(top_blob_unpacked, top_blob, elempack, opt);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int size = w * h;\n        size_t elemsize = bottom_blob.elemsize;\n\n        top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int channels_per_group = channels / _group;\n\n        if (_group == 2)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                unsigned short* outptr0 = top_blob.channel(q * 2);\n                unsigned short* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x8_t _p0 = vld1q_u16(ptr0);\n                    uint16x8_t _p1 = vld1q_u16(ptr1);\n\n                    uint16x8x2_t _p01 = vzipq_u16(_p0, _p1);\n\n                    vst1q_u16(outptr0, _p01.val[0]);\n                    vst1q_u16(outptr1, _p01.val[1]);\n\n                    ptr0 += 8;\n                    ptr1 += 8;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                }\n            }\n        }\n\n        if (_group == 3)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                unsigned short* outptr0 = top_blob.channel(q * 3);\n                unsigned short* outptr1 = top_blob.channel(q * 3 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 3 + 2);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x8_t _p0 = vld1q_u16(ptr0);\n                    uint16x8_t _p1 = vld1q_u16(ptr1);\n                    uint16x8_t _p2 = vld1q_u16(ptr2);\n\n                    // TODO figure out a faster way\n\n                    // 01234567        08g19h2a\n                    // 89abcdef   ->   i3bj4ck5\n                    // ghijklmn        dl6em7fn\n\n                    uint16x8x3_t _p012;\n                    _p012.val[0] = _p0;\n                    _p012.val[1] = _p1;\n                    _p012.val[2] = _p2;\n\n                    unsigned short tmp[24];\n                    vst3q_u16(&tmp[0], _p012);\n\n                    _p0 = vld1q_u16(&tmp[0]);\n                    _p1 = vld1q_u16(&tmp[8]);\n                    _p2 = vld1q_u16(&tmp[16]);\n\n                    vst1q_u16(outptr0, _p0);\n                    vst1q_u16(outptr1, _p1);\n                    vst1q_u16(outptr2, _p2);\n\n                    ptr0 += 8;\n                    ptr1 += 8;\n                    ptr2 += 8;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                }\n            }\n        }\n\n        if (_group == 4)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                const unsigned short* ptr3 = bottom_blob.channel(channels_per_group * 3 + q);\n                unsigned short* outptr0 = top_blob.channel(q * 4);\n                unsigned short* outptr1 = top_blob.channel(q * 4 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 4 + 2);\n                unsigned short* outptr3 = top_blob.channel(q * 4 + 3);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x8_t _p0 = vld1q_u16(ptr0);\n                    uint16x8_t _p1 = vld1q_u16(ptr1);\n                    uint16x8_t _p2 = vld1q_u16(ptr2);\n                    uint16x8_t _p3 = vld1q_u16(ptr3);\n\n                    // transpose 4x4\n                    uint16x8x2_t _p01 = vtrnq_u16(_p0, _p1);\n                    uint16x8x2_t _p23 = vtrnq_u16(_p2, _p3);\n                    uint32x4x2_t _p02 = vtrnq_u32(vreinterpretq_u32_u16(_p01.val[0]), vreinterpretq_u32_u16(_p23.val[0]));\n                    uint32x4x2_t _p13 = vtrnq_u32(vreinterpretq_u32_u16(_p01.val[1]), vreinterpretq_u32_u16(_p23.val[1]));\n                    _p0 = vreinterpretq_u16_u32(_p02.val[0]);\n                    _p1 = vreinterpretq_u16_u32(_p13.val[0]);\n                    _p2 = vreinterpretq_u16_u32(_p02.val[1]);\n                    _p3 = vreinterpretq_u16_u32(_p13.val[1]);\n\n                    vst1q_u16(outptr0, vcombine_u16(vget_low_u16(_p0), vget_low_u16(_p1)));\n                    vst1q_u16(outptr1, vcombine_u16(vget_low_u16(_p2), vget_low_u16(_p3)));\n                    vst1q_u16(outptr2, vcombine_u16(vget_high_u16(_p0), vget_high_u16(_p1)));\n                    vst1q_u16(outptr3, vcombine_u16(vget_high_u16(_p2), vget_high_u16(_p3)));\n\n                    ptr0 += 8;\n                    ptr1 += 8;\n                    ptr2 += 8;\n                    ptr3 += 8;\n                    outptr0 += 8;\n                    outptr1 += 8;\n                    outptr2 += 8;\n                    outptr3 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // NCNN_ARM82\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        if (_group == 2 && channels % _group != 0)\n        {\n            int w = bottom_blob.w;\n            int h = bottom_blob.h;\n            int size = w * h;\n            size_t elemsize = bottom_blob.elemsize;\n\n            top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            int channels_per_group = channels / _group;\n\n            // TODO unroll me\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group + q + 1);\n                unsigned short* outptr0 = top_blob.channel(q * 2);\n                unsigned short* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n                    uint16x4_t _p2 = vld1_u16(ptr2);\n\n                    uint16x4_t _p12 = vext_u16(_p1, _p2, 2);\n\n                    uint16x4x2_t _p01 = vzip_u16(_p0, _p12);\n\n                    vst1_u16(outptr0, _p01.val[0]);\n                    vst1_u16(outptr1, _p01.val[1]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n\n            // handle the last channel\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(channels_per_group);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + channels_per_group);\n                unsigned short* outptr0 = top_blob.channel(channels_per_group * 2);\n\n                ptr1 += 2;\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n\n                    uint16x4x2_t _p01 = vzip_u16(_p0, _p1);\n\n                    vst1_u16(outptr0, _p01.val[0]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    outptr0 += 4;\n                }\n            }\n\n            return 0;\n        }\n\n        if (_group > 4 || channels % _group != 0)\n        {\n            // slow path for too large group or shuffle inside elempack\n            Option opt_pack = opt;\n            opt_pack.blob_allocator = opt.workspace_allocator;\n\n            Mat bottom_blob_unpacked;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n\n            Mat top_blob_unpacked;\n            int ret = ShuffleChannel::forward(bottom_blob_unpacked, top_blob_unpacked, opt_pack);\n            if (ret != 0)\n                return ret;\n\n            convert_packing(top_blob_unpacked, top_blob, elempack, opt);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int size = w * h;\n        size_t elemsize = bottom_blob.elemsize;\n\n        top_blob.create(w, h, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int channels_per_group = channels / _group;\n\n        if (_group == 2)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                unsigned short* outptr0 = top_blob.channel(q * 2);\n                unsigned short* outptr1 = top_blob.channel(q * 2 + 1);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n\n                    uint16x4x2_t _p01 = vzip_u16(_p0, _p1);\n\n                    vst1_u16(outptr0, _p01.val[0]);\n                    vst1_u16(outptr1, _p01.val[1]);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                }\n            }\n        }\n\n        if (_group == 3)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                unsigned short* outptr0 = top_blob.channel(q * 3);\n                unsigned short* outptr1 = top_blob.channel(q * 3 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 3 + 2);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n                    uint16x4_t _p2 = vld1_u16(ptr2);\n\n                    // TODO figure out a faster way\n                    uint16x4x2_t _p01 = vzip_u16(_p0, _p1);\n                    uint16x4x2_t _p12 = vzip_u16(_p1, _p2);\n\n                    uint32x2_t _0415 = vreinterpret_u32_u16(_p01.val[0]);\n                    uint16x4_t _2637 = _p01.val[1];\n                    uint16x4_t _4859 = _p12.val[0];\n                    uint32x2_t _6x7y = vreinterpret_u32_u16(_p12.val[1]);\n\n                    uint16x4_t _98yx = vrev32_u16(_p2);\n                    uint16x4x2_t _90y281x3 = vtrn_u16(_98yx, _p0);\n\n                    uint32x2_t _81x3 = vreinterpret_u32_u16(_90y281x3.val[1]);\n\n                    uint32x2x2_t _048115x3 = vtrn_u32(_0415, _81x3);\n                    uint32x2x2_t _816xx37y = vtrn_u32(_81x3, _6x7y);\n\n                    uint16x4_t _0481 = vreinterpret_u16_u32(_048115x3.val[0]);\n                    uint16x4_t _5926 = vext_u16(_4859, _2637, 2);\n                    uint16x4_t _x37y = vreinterpret_u16_u32(_816xx37y.val[1]);\n\n                    vst1_u16(outptr0, _0481);\n                    vst1_u16(outptr1, _5926);\n                    vst1_u16(outptr2, _x37y);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                }\n            }\n        }\n\n        if (_group == 4)\n        {\n            for (int q = 0; q < channels_per_group; q++)\n            {\n                const unsigned short* ptr0 = bottom_blob.channel(q);\n                const unsigned short* ptr1 = bottom_blob.channel(channels_per_group + q);\n                const unsigned short* ptr2 = bottom_blob.channel(channels_per_group * 2 + q);\n                const unsigned short* ptr3 = bottom_blob.channel(channels_per_group * 3 + q);\n                unsigned short* outptr0 = top_blob.channel(q * 4);\n                unsigned short* outptr1 = top_blob.channel(q * 4 + 1);\n                unsigned short* outptr2 = top_blob.channel(q * 4 + 2);\n                unsigned short* outptr3 = top_blob.channel(q * 4 + 3);\n\n                for (int i = 0; i < size; i++)\n                {\n                    uint16x4_t _p0 = vld1_u16(ptr0);\n                    uint16x4_t _p1 = vld1_u16(ptr1);\n                    uint16x4_t _p2 = vld1_u16(ptr2);\n                    uint16x4_t _p3 = vld1_u16(ptr3);\n\n                    // transpose 4x4\n                    uint16x4x2_t _p01 = vtrn_u16(_p0, _p1);\n                    uint16x4x2_t _p23 = vtrn_u16(_p2, _p3);\n                    uint32x2x2_t _p02 = vtrn_u32(vreinterpret_u32_u16(_p01.val[0]), vreinterpret_u32_u16(_p23.val[0]));\n                    uint32x2x2_t _p13 = vtrn_u32(vreinterpret_u32_u16(_p01.val[1]), vreinterpret_u32_u16(_p23.val[1]));\n                    _p0 = vreinterpret_u16_u32(_p02.val[0]);\n                    _p1 = vreinterpret_u16_u32(_p13.val[0]);\n                    _p2 = vreinterpret_u16_u32(_p02.val[1]);\n                    _p3 = vreinterpret_u16_u32(_p13.val[1]);\n\n                    vst1_u16(outptr0, _p0);\n                    vst1_u16(outptr1, _p1);\n                    vst1_u16(outptr2, _p2);\n                    vst1_u16(outptr3, _p3);\n\n                    ptr0 += 4;\n                    ptr1 += 4;\n                    ptr2 += 4;\n                    ptr3 += 4;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    return ShuffleChannel::forward(bottom_blob, top_blob, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/shufflechannel_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SHUFFLECHANNEL_ARM_H\n#define LAYER_SHUFFLECHANNEL_ARM_H\n\n#include \"shufflechannel.h\"\n\nnamespace ncnn {\n\nclass ShuffleChannel_arm : public ShuffleChannel\n{\npublic:\n    ShuffleChannel_arm();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SHUFFLECHANNEL_ARM_H\n"
  },
  {
    "path": "src/layer/arm/sigmoid_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"sigmoid_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nSigmoid_arm::Sigmoid_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Sigmoid_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; i + 15 < size; i += 16)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            _p2 = sigmoid_ps(_p2);\n            _p3 = sigmoid_ps(_p3);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n        }\n#endif // __aarch64__\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = sigmoid_ps(_p);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = 1.f / (1.f + expf(-*ptr));\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Sigmoid_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; i + 15 < size; i += 16)\n        {\n            uint16x8_t _p01 = vld1q_u16(ptr);\n            uint16x8_t _p23 = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_p23));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_p23));\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            _p2 = sigmoid_ps(_p2);\n            _p3 = sigmoid_ps(_p3);\n            _p01 = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _p23 = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p01);\n            vst1q_u16(ptr + 8, _p23);\n            ptr += 16;\n        }\n#endif // __aarch64__\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = sigmoid_ps(_p);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            v = 1.f / (1.f + expf(-v));\n            *ptr = float32_to_bfloat16(v);\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/sigmoid_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SIGMOID_ARM_H\n#define LAYER_SIGMOID_ARM_H\n\n#include \"sigmoid.h\"\n\nnamespace ncnn {\n\nclass Sigmoid_arm : public Sigmoid\n{\npublic:\n    Sigmoid_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SIGMOID_ARM_H\n"
  },
  {
    "path": "src/layer/arm/sigmoid_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"sigmoid_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun.h\"\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Sigmoid_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p01 = vld1q_f16(ptr);\n            float16x8_t _p23 = vld1q_f16(ptr + 8);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n            float32x4_t _p2 = vcvt_f32_f16(vget_low_f16(_p23));\n            float32x4_t _p3 = vcvt_f32_f16(vget_high_f16(_p23));\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            _p2 = sigmoid_ps(_p2);\n            _p3 = sigmoid_ps(_p3);\n            _p01 = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            _p23 = vcombine_f16(vcvt_f16_f32(_p2), vcvt_f16_f32(_p3));\n            vst1q_f16(ptr, _p01);\n            vst1q_f16(ptr + 8, _p23);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = sigmoid_ps(_p0);\n            _p1 = sigmoid_ps(_p1);\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = sigmoid_ps(_p);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            v = 1.f / (1.f + expf(-v));\n            *ptr = (__fp16)v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint Sigmoid_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 31 < size; i += 32)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            _p0 = sigmoid_ps_f16(_p0);\n            _p1 = sigmoid_ps_f16(_p1);\n            _p2 = sigmoid_ps_f16(_p2);\n            _p3 = sigmoid_ps_f16(_p3);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n        }\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _p0 = sigmoid_ps_f16(_p0);\n            _p1 = sigmoid_ps_f16(_p1);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = sigmoid_ps_f16(_p);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = sigmoid_ps_f16(_p);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            v = 1.f / (1.f + expf(-v));\n            *ptr = v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/slice_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"slice_arm.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nSlice_arm::Slice_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Slice_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int elembits = bottom_blobs[0].elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_bf16s_fp16s(bottom_blobs, top_blobs, opt);\n#endif\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    const int* slices_ptr = slices;\n    const int* indices_ptr = indices;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // slice vector\n        int w = bottom_blob.w * elempack;\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            const float* ptr = (const float*)bottom_blob + q;\n            float* outptr = top_blob;\n            memcpy(outptr, ptr, top_blob.w * top_blob.elemsize);\n\n            q += slice;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // slice image height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        const float* ptr = bottom_blob_unpacked;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                for (int j = 0; j < top_blob.h; j++)\n                {\n                    const float* r0 = ptr;\n                    const float* r1 = ptr + w;\n                    const float* r2 = ptr + w * 2;\n                    const float* r3 = ptr + w * 3;\n\n                    float* outptr0 = top_blob.row(j);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    ptr += w * 4;\n                }\n            }\n            else // if (out_elempack == 1 && top_blob.elempack == 1) if (out_elempack == 4 && top_blob.elempack == 4)\n            {\n                int size = w * top_blob.h;\n\n                float* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                ptr += size * top_blob.elempack;\n            }\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // slice image width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            const float* ptr = bottom_blob.row(j);\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                float* outptr = top_blob.row(j);\n                memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                ptr += top_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // slice dim channel\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = channels - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? channels + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((channels - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, d, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        int p = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const float* r0 = bottom_blob_unpacked.channel(p);\n                    const float* r1 = bottom_blob_unpacked.channel(p + 1);\n                    const float* r2 = bottom_blob_unpacked.channel(p + 2);\n                    const float* r3 = bottom_blob_unpacked.channel(p + 3);\n\n                    float* outptr0 = top_blob.channel(q);\n\n                    for (int j = 0; j < size; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            else // if (out_elempack == 1 && top_blob.elempack == 1) if (out_elempack == 4 && top_blob.elempack == 4)\n            {\n                int size = top_blob.total();\n\n                const float* ptr = bottom_blob_unpacked.channel(p);\n                float* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                p += top_blob.c;\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // slice dim height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (size_t i = 0; i < top_blobs.size(); i++)\n                {\n                    Mat& top_blob = top_blobs[i];\n\n                    int size = top_blob.w * top_blob.h;\n\n                    float* outptr = top_blob.channel(p).depth(j);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    ptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // slice dim width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (int k = 0; k < h; k++)\n                {\n                    for (size_t i = 0; i < top_blobs.size(); i++)\n                    {\n                        Mat& top_blob = top_blobs[i];\n\n                        float* outptr = top_blob.channel(p).depth(j).row(k);\n                        memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                        ptr += top_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = d - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? d + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((d - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, slice, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                float* outptr = top_blob.channel(p);\n                memcpy(outptr, ptr, size * elemsize);\n\n                ptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Slice_arm::forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    const int* slices_ptr = slices;\n    const int* indices_ptr = indices;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // slice vector\n        int w = bottom_blob.w * elempack;\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n#if NCNN_ARM82\n                out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && slice % 8 == 0 ? 8 : slice % 4 == 0 ? 4 : 1;\n#else\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            const unsigned short* ptr = (const unsigned short*)bottom_blob + q;\n            unsigned short* outptr = top_blob;\n            memcpy(outptr, ptr, top_blob.w * top_blob.elemsize);\n\n            q += slice;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // slice image height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n#if NCNN_ARM82\n                out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && slice % 8 == 0 ? 8 : slice % 4 == 0 ? 4 : 1;\n#else\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        const unsigned short* ptr = bottom_blob_unpacked;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n#if NCNN_ARM82\n            if (out_elempack == 4 && top_blob.elempack == 8)\n            {\n                for (int j = 0; j < top_blob.h; j++)\n                {\n                    const unsigned short* r0 = ptr;\n                    const unsigned short* r1 = ptr + w * 4;\n\n                    unsigned short* outptr0 = top_blob.row<unsigned short>(j);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = r0[0];\n                        outptr0[1] = r0[1];\n                        outptr0[2] = r0[2];\n                        outptr0[3] = r0[3];\n                        outptr0[4] = r1[0];\n                        outptr0[5] = r1[1];\n                        outptr0[6] = r1[2];\n                        outptr0[7] = r1[3];\n\n                        r0 += 4;\n                        r1 += 4;\n                        outptr0 += 8;\n                    }\n\n                    ptr += w * 8;\n                }\n            }\n            if (out_elempack == 1 && top_blob.elempack == 8)\n            {\n                for (int j = 0; j < top_blob.h; j++)\n                {\n                    const unsigned short* r0 = ptr;\n                    const unsigned short* r1 = ptr + w;\n                    const unsigned short* r2 = ptr + w * 2;\n                    const unsigned short* r3 = ptr + w * 3;\n                    const unsigned short* r4 = ptr + w * 4;\n                    const unsigned short* r5 = ptr + w * 5;\n                    const unsigned short* r6 = ptr + w * 6;\n                    const unsigned short* r7 = ptr + w * 7;\n\n                    unsigned short* outptr0 = top_blob.row<unsigned short>(j);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n                        outptr0[4] = *r4++;\n                        outptr0[5] = *r5++;\n                        outptr0[6] = *r6++;\n                        outptr0[7] = *r7++;\n\n                        outptr0 += 8;\n                    }\n\n                    ptr += w * 8;\n                }\n            }\n#endif // NCNN_ARM82\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                for (int j = 0; j < top_blob.h; j++)\n                {\n                    const unsigned short* r0 = ptr;\n                    const unsigned short* r1 = ptr + w;\n                    const unsigned short* r2 = ptr + w * 2;\n                    const unsigned short* r3 = ptr + w * 3;\n\n                    unsigned short* outptr0 = top_blob.row<unsigned short>(j);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    ptr += w * 4;\n                }\n            }\n            if (out_elempack == top_blob.elempack) // 1-1 4-4 8-8\n            {\n                int size = w * top_blob.h;\n\n                unsigned short* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                ptr += size * top_blob.elempack;\n            }\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // slice image width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            const unsigned short* ptr = bottom_blob.row<const unsigned short>(j);\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                unsigned short* outptr = top_blob.row<unsigned short>(j);\n                memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                ptr += top_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // slice dim channel\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = channels - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? channels + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((channels - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __ARM_NEON\n            if (opt.use_packing_layout)\n            {\n#if NCNN_ARM82\n                out_elempack = support_fp16_storage && opt.use_fp16_arithmetic && slice % 8 == 0 ? 8 : slice % 4 == 0 ? 4 : 1;\n#else\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            }\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, d, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        int p = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n#if NCNN_ARM82\n            if (out_elempack == 4 && top_blob.elempack == 8)\n            {\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob_unpacked.channel(p);\n                    const unsigned short* r1 = bottom_blob_unpacked.channel(p + 1);\n\n                    unsigned short* outptr0 = top_blob.channel(q);\n\n                    for (int j = 0; j < size; j++)\n                    {\n                        outptr0[0] = r0[0];\n                        outptr0[1] = r0[1];\n                        outptr0[2] = r0[2];\n                        outptr0[3] = r0[3];\n                        outptr0[4] = r1[0];\n                        outptr0[5] = r1[1];\n                        outptr0[6] = r1[2];\n                        outptr0[7] = r1[3];\n\n                        r0 += 4;\n                        r1 += 4;\n                        outptr0 += 8;\n                    }\n\n                    p += 2;\n                }\n            }\n            if (out_elempack == 1 && top_blob.elempack == 8)\n            {\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob_unpacked.channel(p);\n                    const unsigned short* r1 = bottom_blob_unpacked.channel(p + 1);\n                    const unsigned short* r2 = bottom_blob_unpacked.channel(p + 2);\n                    const unsigned short* r3 = bottom_blob_unpacked.channel(p + 3);\n                    const unsigned short* r4 = bottom_blob_unpacked.channel(p + 4);\n                    const unsigned short* r5 = bottom_blob_unpacked.channel(p + 5);\n                    const unsigned short* r6 = bottom_blob_unpacked.channel(p + 6);\n                    const unsigned short* r7 = bottom_blob_unpacked.channel(p + 7);\n\n                    unsigned short* outptr0 = top_blob.channel(q);\n\n                    for (int j = 0; j < size; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n                        outptr0[4] = *r4++;\n                        outptr0[5] = *r5++;\n                        outptr0[6] = *r6++;\n                        outptr0[7] = *r7++;\n\n                        outptr0 += 8;\n                    }\n\n                    p += 8;\n                }\n            }\n#endif // NCNN_ARM82\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const unsigned short* r0 = bottom_blob_unpacked.channel(p);\n                    const unsigned short* r1 = bottom_blob_unpacked.channel(p + 1);\n                    const unsigned short* r2 = bottom_blob_unpacked.channel(p + 2);\n                    const unsigned short* r3 = bottom_blob_unpacked.channel(p + 3);\n\n                    unsigned short* outptr0 = top_blob.channel(q);\n\n                    for (int j = 0; j < size; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            if (out_elempack == top_blob.elempack) // 1-1 4-4 8-8\n            {\n                int size = top_blob.total();\n\n                const unsigned short* ptr = bottom_blob_unpacked.channel(p);\n                unsigned short* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                p += top_blob.c;\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // slice dim height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (size_t i = 0; i < top_blobs.size(); i++)\n                {\n                    Mat& top_blob = top_blobs[i];\n\n                    int size = top_blob.w * top_blob.h;\n\n                    unsigned short* outptr = top_blob.channel(p).depth(j);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    ptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // slice dim width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (int k = 0; k < h; k++)\n                {\n                    for (size_t i = 0; i < top_blobs.size(); i++)\n                    {\n                        Mat& top_blob = top_blobs[i];\n\n                        unsigned short* outptr = top_blob.channel(p).depth(j).row<unsigned short>(k);\n                        memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                        ptr += top_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = d - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? d + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((d - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, slice, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(p);\n\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                unsigned short* outptr = top_blob.channel(p);\n                memcpy(outptr, ptr, size * elemsize);\n\n                ptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/slice_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SLICE_ARM_H\n#define LAYER_SLICE_ARM_H\n\n#include \"slice.h\"\n\nnamespace ncnn {\n\nclass Slice_arm : public Slice\n{\npublic:\n    Slice_arm();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int forward_bf16s_fp16s(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SLICE_ARM_H\n"
  },
  {
    "path": "src/layer/arm/softmax_arm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"softmax_arm.h\"\n\n#include <float.h>\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nSoftmax_arm::Softmax_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nstatic void softmax(float* _ptr, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n    // reduce max\n#if __ARM_NEON\n    float32x4_t _max = vdupq_n_f32(-FLT_MAX);\n#endif // __ARM_NEON\n    float max = -FLT_MAX;\n    {\n        const float* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _max = vmaxq_f32(_max, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            max = std::max(max, *ptr++);\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 1)\n    {\n#if __aarch64__\n        max = std::max(max, vmaxvq_f32(_max));\n#else\n        float32x2_t _max2 = vmax_f32(vget_low_f32(_max), vget_high_f32(_max));\n        float32x2_t _mm2 = vpmax_f32(_max2, _max2);\n        max = std::max(max, vget_lane_f32(_mm2, 0));\n#endif\n\n        _max = vdupq_n_f32(max);\n    }\n#endif // __ARM_NEON\n\n    // reduce exp(x - max)\n#if __ARM_NEON\n    float32x4_t _sum = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float sum = 0.f;\n    {\n        float* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vsubq_f32(_p, _max);\n            _p = exp_ps(_p);\n            vst1q_f32(ptr, _p);\n            _sum = vaddq_f32(_sum, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = expf(*ptr - max);\n            *ptr = v;\n            sum += v;\n            ptr++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 1)\n    {\n#if __aarch64__\n        sum += vaddvq_f32(_sum);\n#else\n        float32x2_t _sum2 = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n        float32x2_t _ss2 = vpadd_f32(_sum2, _sum2);\n        sum += vget_lane_f32(_ss2, 0);\n#endif\n\n        _sum = vdupq_n_f32(sum);\n    }\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n    _sum = div_ps(vdupq_n_f32(1.f), _sum);\n#endif // __ARM_NEON\n    sum = 1.f / sum;\n\n    // div sum\n    {\n        float* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = vmulq_f32(_p, _sum);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr++ *= sum;\n        }\n    }\n}\n\n#if __ARM_NEON\nstatic void softmax_pack4(float* _ptr, int elemcount, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const float* ptr = _ptr + i * stride;\n        float* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4x4_t _p = vld4q_f32(ptr);\n            float32x4_t _max = vld1q_f32(maxptr);\n            float32x4_t _max2 = vmaxq_f32(_p.val[0], _p.val[1]);\n            float32x4_t _max4 = vmaxq_f32(_p.val[2], _p.val[3]);\n            _max = vmaxq_f32(_max, vmaxq_f32(_max2, _max4));\n            vst1q_f32(maxptr, _max);\n            ptr += 16;\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n#if __aarch64__\n            float max0 = vmaxvq_f32(_p);\n#else\n            float32x2_t _max2 = vmax_f32(vget_low_f32(_p), vget_high_f32(_p));\n            float32x2_t _mm2 = vpmax_f32(_max2, _max2);\n            float max0 = vget_lane_f32(_mm2, 0);\n#endif\n            *maxptr = std::max(*maxptr, max0);\n            ptr += 4;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        float* ptr = _ptr + i * stride;\n        const float* maxptr = _maxptr;\n        float* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4x4_t _p = vld4q_f32(ptr);\n            float32x4_t _max = vld1q_f32(maxptr);\n            float32x4_t _p0 = vsubq_f32(_p.val[0], _max);\n            float32x4_t _p1 = vsubq_f32(_p.val[1], _max);\n            float32x4_t _p2 = vsubq_f32(_p.val[2], _max);\n            float32x4_t _p3 = vsubq_f32(_p.val[3], _max);\n            _p.val[0] = exp_ps(_p0);\n            _p.val[1] = exp_ps(_p1);\n            _p.val[2] = exp_ps(_p2);\n            _p.val[3] = exp_ps(_p3);\n            vst4q_f32(ptr, _p);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            float32x4_t _ss2 = vaddq_f32(_p.val[0], _p.val[1]);\n            float32x4_t _ss4 = vaddq_f32(_p.val[2], _p.val[3]);\n            _sum = vaddq_f32(_sum, vaddq_f32(_ss2, _ss4));\n            vst1q_f32(sumptr, _sum);\n            ptr += 16;\n            maxptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _max = vdupq_n_f32(*maxptr);\n            _p = exp_ps(vsubq_f32(_p, _max));\n            vst1q_f32(ptr, _p);\n#if __aarch64__\n            float sum0 = vaddvq_f32(_p);\n#else\n            float32x2_t _sum2 = vadd_f32(vget_low_f32(_p), vget_high_f32(_p));\n            float32x2_t _ss2 = vpadd_f32(_sum2, _sum2);\n            float sum0 = vget_lane_f32(_ss2, 0);\n#endif\n            *sumptr += sum0;\n            ptr += 4;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float* sumptr = _sumptr;\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _sum = div_ps(_one, _sum);\n            vst1q_f32(sumptr, _sum);\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr = 1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        float* ptr = _ptr + i * stride;\n        const float* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4x4_t _p = vld4q_f32(ptr);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p.val[0] = vmulq_f32(_p.val[0], _sum);\n            _p.val[1] = vmulq_f32(_p.val[1], _sum);\n            _p.val[2] = vmulq_f32(_p.val[2], _sum);\n            _p.val[3] = vmulq_f32(_p.val[3], _sum);\n            vst4q_f32(ptr, _p);\n            ptr += 16;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _sum = vld1q_dup_f32(sumptr);\n            _p = vmulq_f32(_p, _sum);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n            sumptr++;\n        }\n    }\n}\n#endif // __ARM_NEON\n\nstatic void softmax_pack1(float* _ptr, int elemcount, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const float* ptr = _ptr + i * stride;\n        float* maxptr = _maxptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _max = vld1q_f32(maxptr);\n            _max = vmaxq_f32(_max, _p);\n            vst1q_f32(maxptr, _max);\n            ptr += 4;\n            maxptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *maxptr = std::max(*maxptr, *ptr);\n            ptr++;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        float* ptr = _ptr + i * stride;\n        const float* maxptr = _maxptr;\n        float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _max = vld1q_f32(maxptr);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p = vsubq_f32(_p, _max);\n            _p = exp_ps(_p);\n            vst1q_f32(ptr, _p);\n            _sum = vaddq_f32(_sum, _p);\n            vst1q_f32(sumptr, _sum);\n            ptr += 4;\n            maxptr += 4;\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            float v = expf(*ptr - *maxptr);\n            *ptr = v;\n            *sumptr += v;\n            ptr++;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float* sumptr = _sumptr;\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _one = vdupq_n_f32(1.f);\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _sum = div_ps(_one, _sum);\n            vst1q_f32(sumptr, _sum);\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *sumptr = 1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        float* ptr = _ptr + i * stride;\n        const float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p = vmulq_f32(_p, _sum);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *ptr *= *sumptr;\n            ptr++;\n            sumptr++;\n        }\n    }\n}\n\nstatic void softmax(float* _ptr, int elemcount, int elempack, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    {\n        float* maxptr = _maxptr;\n\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _negmax = vdupq_n_f32(-FLT_MAX);\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1q_f32(maxptr, _negmax);\n            maxptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *maxptr++ = -FLT_MAX;\n        }\n    }\n\n    // reduce exp(x - max)\n    {\n        float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1q_f32(sumptr, _zero);\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *sumptr++ = 0.f;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        softmax_pack4(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        softmax_pack1(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n}\n\nint Softmax_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int d = bottom_top_blob.d;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n    const int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        float* ptr = bottom_top_blob;\n\n        const int size = w * elempack;\n\n        softmax(ptr, size, 1);\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        const int size = w;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = (size_t)w * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + sizen;\n\n            float* ptr = (float*)bottom_top_blob + i * elempack;\n\n            softmax(ptr, h, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n\n            softmax(ptr, w, elempack);\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        const int size = w * h * d;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = bottom_top_blob.cstep * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + sizen;\n\n            float* ptr = (float*)bottom_top_blob + i * elempack;\n\n            softmax(ptr, channels, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        const int size = w * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            for (int i = 0; i < d; i++)\n            {\n                float* ptr = bottom_top_blob.channel(q).depth(i);\n\n                float* maxsumptr = maxsum.channel(get_omp_thread_num());\n                float* maxptr = maxsumptr;\n                float* sumptr = maxptr + size;\n\n                softmax(ptr, h, 1, size, size, maxptr, sumptr);\n            }\n        }\n    }\n\n    if (dims == 3 && positive_axis == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < h; i++)\n            {\n                softmax(ptr, w, elempack);\n                ptr += w * elempack;\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        const int size = w * h * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + size;\n\n            softmax(ptr, d, 1, size, size, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 4 && positive_axis == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    softmax(ptr, w, elempack);\n                    ptr += w * elempack;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nstatic void softmax_bf16s(unsigned short* _ptr, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n    // reduce max\n#if __ARM_NEON\n    float32x4_t _max = vdupq_n_f32(-FLT_MAX);\n#endif // __ARM_NEON\n    float max = -FLT_MAX;\n    {\n        const unsigned short* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _max1 = vdupq_n_f32(-FLT_MAX);\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _max = vmaxq_f32(_max, _p0);\n            _max1 = vmaxq_f32(_max1, _p1);\n            ptr += 8;\n        }\n        _max = vmaxq_f32(_max, _max1);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _max = vmaxq_f32(_max, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            max = std::max(max, bfloat16_to_float32(*ptr++));\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 1)\n    {\n#if __aarch64__\n        max = std::max(max, vmaxvq_f32(_max));\n#else\n        float32x2_t _max2 = vmax_f32(vget_low_f32(_max), vget_high_f32(_max));\n        float32x2_t _mm2 = vpmax_f32(_max2, _max2);\n        max = std::max(max, vget_lane_f32(_mm2, 0));\n#endif\n\n        _max = vdupq_n_f32(max);\n    }\n#endif // __ARM_NEON\n\n    // reduce exp(x - max)\n#if __ARM_NEON\n    float32x4_t _sum = vdupq_n_f32(0.f);\n#endif // __ARM_NEON\n    float sum = 0.f;\n    {\n        unsigned short* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _sum1 = vdupq_n_f32(0.f);\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = vsubq_f32(_p0, _max);\n            _p1 = vsubq_f32(_p1, _max);\n            _p0 = exp_ps(_p0);\n            _p1 = exp_ps(_p1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            _sum = vaddq_f32(_sum, _p0);\n            _sum1 = vaddq_f32(_sum1, _p1);\n            ptr += 8;\n        }\n        _sum = vaddq_f32(_sum, _sum1);\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vsubq_f32(_p, _max);\n            _p = exp_ps(_p);\n            vst1_u16(ptr, float2bfloat(_p));\n            _sum = vaddq_f32(_sum, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = expf(bfloat16_to_float32(*ptr) - max);\n            *ptr = float32_to_bfloat16(v);\n            sum += v;\n            ptr++;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 1)\n    {\n#if __aarch64__\n        sum += vaddvq_f32(_sum);\n#else\n        float32x2_t _sum2 = vadd_f32(vget_low_f32(_sum), vget_high_f32(_sum));\n        float32x2_t _ss2 = vpadd_f32(_sum2, _sum2);\n        sum += vget_lane_f32(_ss2, 0);\n#endif\n\n        _sum = vdupq_n_f32(sum);\n    }\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n    _sum = div_ps(vdupq_n_f32(1.f), _sum);\n#endif // __ARM_NEON\n    sum = 1.f / sum;\n\n    // div sum\n    {\n        unsigned short* ptr = _ptr;\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = vmulq_f32(_p0, _sum);\n            _p1 = vmulq_f32(_p1, _sum);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = vmulq_f32(_p, _sum);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = float32_to_bfloat16(bfloat16_to_float32(*ptr) * sum);\n            ptr++;\n        }\n    }\n}\n\n#if __ARM_NEON\nstatic void softmax_bf16s_pack4(unsigned short* _ptr, int elemcount, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const unsigned short* ptr = _ptr + i * stride;\n        float* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            uint16x4x4_t _p = vld4_u16(ptr);\n            float32x4_t _p0 = bfloat2float(_p.val[0]);\n            float32x4_t _p1 = bfloat2float(_p.val[1]);\n            float32x4_t _p2 = bfloat2float(_p.val[2]);\n            float32x4_t _p3 = bfloat2float(_p.val[3]);\n            float32x4_t _max = vld1q_f32(maxptr);\n            float32x4_t _max2 = vmaxq_f32(_p0, _p1);\n            float32x4_t _max4 = vmaxq_f32(_p2, _p3);\n            _max = vmaxq_f32(_max, vmaxq_f32(_max2, _max4));\n            vst1q_f32(maxptr, _max);\n            ptr += 16;\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n#if __aarch64__\n            float max0 = vmaxvq_f32(_p);\n#else\n            float32x2_t _max2 = vmax_f32(vget_low_f32(_p), vget_high_f32(_p));\n            float32x2_t _mm2 = vpmax_f32(_max2, _max2);\n            float max0 = vget_lane_f32(_mm2, 0);\n#endif\n            *maxptr = std::max(*maxptr, max0);\n            ptr += 4;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        unsigned short* ptr = _ptr + i * stride;\n        const float* maxptr = _maxptr;\n        float* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            uint16x4x4_t _p = vld4_u16(ptr);\n            float32x4_t _p0 = bfloat2float(_p.val[0]);\n            float32x4_t _p1 = bfloat2float(_p.val[1]);\n            float32x4_t _p2 = bfloat2float(_p.val[2]);\n            float32x4_t _p3 = bfloat2float(_p.val[3]);\n            float32x4_t _max = vld1q_f32(maxptr);\n            _p0 = vsubq_f32(_p0, _max);\n            _p1 = vsubq_f32(_p1, _max);\n            _p2 = vsubq_f32(_p2, _max);\n            _p3 = vsubq_f32(_p3, _max);\n            _p0 = exp_ps(_p0);\n            _p1 = exp_ps(_p1);\n            _p2 = exp_ps(_p2);\n            _p3 = exp_ps(_p3);\n            _p.val[0] = float2bfloat(_p0);\n            _p.val[1] = float2bfloat(_p1);\n            _p.val[2] = float2bfloat(_p2);\n            _p.val[3] = float2bfloat(_p3);\n            vst4_u16(ptr, _p);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            float32x4_t _ss2 = vaddq_f32(_p0, _p1);\n            float32x4_t _ss4 = vaddq_f32(_p2, _p3);\n            _sum = vaddq_f32(_sum, vaddq_f32(_ss2, _ss4));\n            vst1q_f32(sumptr, _sum);\n            ptr += 16;\n            maxptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _max = vdupq_n_f32(*maxptr);\n            _p = exp_ps(vsubq_f32(_p, _max));\n            vst1_u16(ptr, float2bfloat(_p));\n#if __aarch64__\n            float sum0 = vaddvq_f32(_p);\n#else\n            float32x2_t _sum2 = vadd_f32(vget_low_f32(_p), vget_high_f32(_p));\n            float32x2_t _ss2 = vpadd_f32(_sum2, _sum2);\n            float sum0 = vget_lane_f32(_ss2, 0);\n#endif\n            *sumptr += sum0;\n            ptr += 4;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float32x4_t _one = vdupq_n_f32(1.f);\n        float* sumptr = _sumptr;\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _sum = div_ps(_one, _sum);\n            vst1q_f32(sumptr, _sum);\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr = 1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        unsigned short* ptr = _ptr + i * stride;\n        const float* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 3 < size1; j += 4)\n        {\n            uint16x4x4_t _p = vld4_u16(ptr);\n            float32x4_t _p0 = bfloat2float(_p.val[0]);\n            float32x4_t _p1 = bfloat2float(_p.val[1]);\n            float32x4_t _p2 = bfloat2float(_p.val[2]);\n            float32x4_t _p3 = bfloat2float(_p.val[3]);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p0 = vmulq_f32(_p0, _sum);\n            _p1 = vmulq_f32(_p1, _sum);\n            _p2 = vmulq_f32(_p2, _sum);\n            _p3 = vmulq_f32(_p3, _sum);\n            _p.val[0] = float2bfloat(_p0);\n            _p.val[1] = float2bfloat(_p1);\n            _p.val[2] = float2bfloat(_p2);\n            _p.val[3] = float2bfloat(_p3);\n            vst4_u16(ptr, _p);\n            ptr += 16;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _sum = vld1q_dup_f32(sumptr);\n            _p = vmulq_f32(_p, _sum);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n            sumptr++;\n        }\n    }\n}\n#endif // __ARM_NEON\n\nstatic void softmax_bf16s_pack1(unsigned short* _ptr, int elemcount, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const unsigned short* ptr = _ptr + i * stride;\n        float* maxptr = _maxptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 7 < size1; j += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _max0 = vld1q_f32(maxptr);\n            float32x4_t _max1 = vld1q_f32(maxptr + 4);\n            _max0 = vmaxq_f32(_max0, _p0);\n            _max1 = vmaxq_f32(_max1, _p1);\n            vst1q_f32(maxptr, _max0);\n            vst1q_f32(maxptr + 4, _max1);\n            ptr += 8;\n            maxptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _max = vld1q_f32(maxptr);\n            _max = vmaxq_f32(_max, _p);\n            vst1q_f32(maxptr, _max);\n            ptr += 4;\n            maxptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *maxptr = std::max(*maxptr, bfloat16_to_float32(*ptr));\n            ptr++;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        unsigned short* ptr = _ptr + i * stride;\n        const float* maxptr = _maxptr;\n        float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 7 < size1; j += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _max0 = vld1q_f32(maxptr);\n            float32x4_t _max1 = vld1q_f32(maxptr + 4);\n            float32x4_t _sum0 = vld1q_f32(sumptr);\n            float32x4_t _sum1 = vld1q_f32(sumptr + 4);\n            _p0 = vsubq_f32(_p0, _max0);\n            _p1 = vsubq_f32(_p1, _max1);\n            _p0 = exp_ps(_p0);\n            _p1 = exp_ps(_p1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            _sum0 = vaddq_f32(_sum0, _p0);\n            _sum1 = vaddq_f32(_sum1, _p1);\n            vst1q_f32(sumptr, _sum0);\n            vst1q_f32(sumptr + 4, _sum1);\n            ptr += 8;\n            maxptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _max = vld1q_f32(maxptr);\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p = vsubq_f32(_p, _max);\n            _p = exp_ps(_p);\n            vst1_u16(ptr, float2bfloat(_p));\n            _sum = vaddq_f32(_sum, _p);\n            vst1q_f32(sumptr, _sum);\n            ptr += 4;\n            maxptr += 4;\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            float v = expf(bfloat16_to_float32(*ptr) - *maxptr);\n            *ptr = float32_to_bfloat16(v);\n            *sumptr += v;\n            ptr++;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float* sumptr = _sumptr;\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _one = vdupq_n_f32(1.f);\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _sum = div_ps(_one, _sum);\n            vst1q_f32(sumptr, _sum);\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *sumptr = 1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        unsigned short* ptr = _ptr + i * stride;\n        const float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        for (; j + 7 < size1; j += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            float32x4_t _sum0 = vld1q_f32(sumptr);\n            float32x4_t _sum1 = vld1q_f32(sumptr + 4);\n            _p0 = vmulq_f32(_p0, _sum0);\n            _p1 = vmulq_f32(_p1, _sum1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            float32x4_t _sum = vld1q_f32(sumptr);\n            _p = vmulq_f32(_p, _sum);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *ptr = float32_to_bfloat16(bfloat16_to_float32(*ptr) * *sumptr);\n            ptr++;\n            sumptr++;\n        }\n    }\n}\n\nstatic void softmax_bf16s(unsigned short* _ptr, int elemcount, int elempack, size_t stride, int size1, float* _maxptr, float* _sumptr)\n{\n    // reduce max\n    {\n        float* maxptr = _maxptr;\n\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _negmax = vdupq_n_f32(-FLT_MAX);\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1q_f32(maxptr, _negmax);\n            maxptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *maxptr++ = -FLT_MAX;\n        }\n    }\n\n    // reduce exp(x - max)\n    {\n        float* sumptr = _sumptr;\n\n        int j = 0;\n#if __ARM_NEON\n        float32x4_t _zero = vdupq_n_f32(0.f);\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1q_f32(sumptr, _zero);\n            sumptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; j < size1; j++)\n        {\n            *sumptr++ = 0.f;\n        }\n    }\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        softmax_bf16s_pack4(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n#endif // __ARM_NEON\n    if (elempack == 1)\n    {\n        softmax_bf16s_pack1(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n}\n\nint Softmax_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int d = bottom_top_blob.d;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n    const int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        unsigned short* ptr = bottom_top_blob;\n\n        const int size = w * elempack;\n\n        softmax_bf16s(ptr, size, 1);\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        const int size = w;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = (size_t)w * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + sizen;\n\n            unsigned short* ptr = (unsigned short*)bottom_top_blob + i * elempack;\n\n            softmax_bf16s(ptr, h, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned short* ptr = bottom_top_blob.row<unsigned short>(i);\n\n            softmax_bf16s(ptr, w, elempack);\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        const int size = w * h * d;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = bottom_top_blob.cstep * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + sizen;\n\n            unsigned short* ptr = (unsigned short*)bottom_top_blob + i * elempack;\n\n            softmax_bf16s(ptr, channels, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        const int size = w * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            for (int i = 0; i < d; i++)\n            {\n                unsigned short* ptr = bottom_top_blob.channel(q).depth(i);\n\n                float* maxsumptr = maxsum.channel(get_omp_thread_num());\n                float* maxptr = maxsumptr;\n                float* sumptr = maxptr + size;\n\n                softmax_bf16s(ptr, h, 1, size, size, maxptr, sumptr);\n            }\n        }\n    }\n\n    if (dims == 3 && positive_axis == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < h; i++)\n            {\n                softmax_bf16s(ptr, w, elempack);\n                ptr += w * elempack;\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        const int size = w * h * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 4u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            float* maxsumptr = maxsum.channel(get_omp_thread_num());\n            float* maxptr = maxsumptr;\n            float* sumptr = maxptr + size;\n\n            softmax_bf16s(ptr, d, 1, size, size, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 4 && positive_axis == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    softmax_bf16s(ptr, w, elempack);\n                    ptr += w * elempack;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/softmax_arm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SOFTMAX_ARM_H\n#define LAYER_SOFTMAX_ARM_H\n\n#include \"softmax.h\"\n\nnamespace ncnn {\n\nclass Softmax_arm : public Softmax\n{\npublic:\n    Softmax_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SOFTMAX_ARM_H\n"
  },
  {
    "path": "src/layer/arm/softmax_arm_asimdhp.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"softmax_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun_fp16s.h\"\n#endif // __ARM_NEON\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nstatic void softmax_fp16s(__fp16* _ptr, int elemcount, int elempack)\n{\n    const int size = elemcount * elempack;\n\n    // reduce max\n    float16x8_t _max8 = vdupq_n_f16(-65504.f);\n    float16x4_t _max4 = vdup_n_f16(-65504.f);\n    __fp16 max = -65504.f;\n    {\n        const __fp16* ptr = _ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _max8 = vmaxq_f16(_max8, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _max4 = vmax_f16(_max4, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            max = std::max(max, *ptr++);\n        }\n    }\n\n    if (elempack == 4)\n    {\n        _max4 = vmax_f16(_max4, vget_low_f16(_max8));\n        _max4 = vmax_f16(_max4, vget_high_f16(_max8));\n\n        _max8 = vcombine_f16(_max4, _max4);\n    }\n    if (elempack == 1)\n    {\n        max = std::max(max, vmaxvq_f16(_max8));\n        max = std::max(max, vmaxv_f16(_max4));\n\n        _max4 = vdup_n_f16(max);\n        _max8 = vdupq_n_f16(max);\n    }\n\n    // reduce exp(x - max)\n    float16x8_t _sum8 = vdupq_n_f16(0.f);\n    float16x4_t _sum4 = vdup_n_f16(0.f);\n    __fp16 sum = 0.f;\n    {\n        __fp16* ptr = _ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = exp_ps_f16(vsubq_f16(_p, _max8));\n            vst1q_f16(ptr, _p);\n            _sum8 = vaddq_f16(_sum8, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = exp_ps_f16(vsub_f16(_p, _max4));\n            vst1_f16(ptr, _p);\n            _sum4 = vadd_f16(_sum4, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = (__fp16)expf(*ptr - max);\n            *ptr = v;\n            sum += v;\n            ptr++;\n        }\n    }\n\n    if (elempack == 4)\n    {\n        _sum4 = vadd_f16(_sum4, vget_low_f16(_sum8));\n        _sum4 = vadd_f16(_sum4, vget_high_f16(_sum8));\n\n        _sum8 = vcombine_f16(_sum4, _sum4);\n    }\n    if (elempack == 1)\n    {\n        _sum4 = vadd_f16(_sum4, vget_low_f16(_sum8));\n        _sum4 = vadd_f16(_sum4, vget_high_f16(_sum8));\n        _sum4 = vpadd_f16(_sum4, _sum4);\n        _sum4 = vpadd_f16(_sum4, _sum4);\n        sum += vget_lane_f16(_sum4, 0);\n\n        _sum4 = vdup_n_f16(sum);\n        _sum8 = vdupq_n_f16(sum);\n    }\n\n    _sum8 = vdivq_f16(vdupq_n_f16(1.f), _sum8);\n    _sum4 = vdiv_f16(vdup_n_f16(1.f), _sum4);\n    sum = (__fp16)1.f / sum;\n\n    // div sum\n    {\n        __fp16* ptr = _ptr;\n\n        int i = 0;\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = vmulq_f16(_p, _sum8);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = vmul_f16(_p, _sum4);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            *ptr++ *= sum;\n        }\n    }\n}\n\nstatic void softmax_fp16s_pack8(__fp16* _ptr, int elemcount, size_t stride, int size1, __fp16* _maxptr, __fp16* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const __fp16* ptr = _ptr + i * stride;\n        __fp16* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _p4 = vld1q_f16(ptr + 32);\n            float16x8_t _p5 = vld1q_f16(ptr + 40);\n            float16x8_t _p6 = vld1q_f16(ptr + 48);\n            float16x8_t _p7 = vld1q_f16(ptr + 56);\n            float16x8_t _max = vld1q_f16(maxptr);\n            float16x8_t _max01 = vpmaxq_f16(_p0, _p1);\n            float16x8_t _max23 = vpmaxq_f16(_p2, _p3);\n            float16x8_t _max45 = vpmaxq_f16(_p4, _p5);\n            float16x8_t _max67 = vpmaxq_f16(_p6, _p7);\n            float16x8_t _max2 = vpmaxq_f16(_max01, _max23);\n            float16x8_t _max4 = vpmaxq_f16(_max45, _max67);\n            _max = vmaxq_f16(_max, vpmaxq_f16(_max2, _max4));\n            vst1q_f16(maxptr, _max);\n            ptr += 64;\n            maxptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x4_t _max = vld1_f16(maxptr);\n            float16x8_t _max01 = vpmaxq_f16(_p0, _p1);\n            float16x8_t _max23 = vpmaxq_f16(_p2, _p3);\n            float16x8_t _max2 = vpmaxq_f16(_max01, _max23);\n            _max = vmax_f16(_max, vpmax_f16(vget_low_f16(_max2), vget_high_f16(_max2)));\n            vst1_f16(maxptr, _max);\n            ptr += 32;\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            __fp16 max0 = vmaxvq_f16(_p);\n            *maxptr = std::max(*maxptr, max0);\n            ptr += 8;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* maxptr = _maxptr;\n        __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _p4 = vld1q_f16(ptr + 32);\n            float16x8_t _p5 = vld1q_f16(ptr + 40);\n            float16x8_t _p6 = vld1q_f16(ptr + 48);\n            float16x8_t _p7 = vld1q_f16(ptr + 56);\n            float16x8_t _max = vld1q_f16(maxptr);\n            _p0 = exp_ps_f16(vsubq_f16(_p0, vdupq_laneq_f16(_max, 0)));\n            _p1 = exp_ps_f16(vsubq_f16(_p1, vdupq_laneq_f16(_max, 1)));\n            _p2 = exp_ps_f16(vsubq_f16(_p2, vdupq_laneq_f16(_max, 2)));\n            _p3 = exp_ps_f16(vsubq_f16(_p3, vdupq_laneq_f16(_max, 3)));\n            _p4 = exp_ps_f16(vsubq_f16(_p4, vdupq_laneq_f16(_max, 4)));\n            _p5 = exp_ps_f16(vsubq_f16(_p5, vdupq_laneq_f16(_max, 5)));\n            _p6 = exp_ps_f16(vsubq_f16(_p6, vdupq_laneq_f16(_max, 6)));\n            _p7 = exp_ps_f16(vsubq_f16(_p7, vdupq_laneq_f16(_max, 7)));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            vst1q_f16(ptr + 32, _p4);\n            vst1q_f16(ptr + 40, _p5);\n            vst1q_f16(ptr + 48, _p6);\n            vst1q_f16(ptr + 56, _p7);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            float16x8_t _ss01 = vpaddq_f16(_p0, _p1);\n            float16x8_t _ss23 = vpaddq_f16(_p2, _p3);\n            float16x8_t _ss45 = vpaddq_f16(_p4, _p5);\n            float16x8_t _ss67 = vpaddq_f16(_p6, _p7);\n            float16x8_t _ss2 = vpaddq_f16(_ss01, _ss23);\n            float16x8_t _ss4 = vpaddq_f16(_ss45, _ss67);\n            _sum = vaddq_f16(_sum, vpaddq_f16(_ss2, _ss4));\n            vst1q_f16(sumptr, _sum);\n            ptr += 64;\n            maxptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x4_t _max = vld1_f16(maxptr);\n            _p0 = exp_ps_f16(vsubq_f16(_p0, vdupq_lane_f16(_max, 0)));\n            _p1 = exp_ps_f16(vsubq_f16(_p1, vdupq_lane_f16(_max, 1)));\n            _p2 = exp_ps_f16(vsubq_f16(_p2, vdupq_lane_f16(_max, 2)));\n            _p3 = exp_ps_f16(vsubq_f16(_p3, vdupq_lane_f16(_max, 3)));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            float16x4_t _sum = vld1_f16(sumptr);\n            float16x8_t _ss01 = vpaddq_f16(_p0, _p1);\n            float16x8_t _ss23 = vpaddq_f16(_p2, _p3);\n            float16x8_t _ss2 = vpaddq_f16(_ss01, _ss23);\n            _sum = vadd_f16(_sum, vpadd_f16(vget_low_f16(_ss2), vget_high_f16(_ss2)));\n            vst1_f16(sumptr, _sum);\n            ptr += 32;\n            maxptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _max = vdupq_n_f16(*maxptr);\n            _p = exp_ps_f16(vsubq_f16(_p, _max));\n            vst1q_f16(ptr, _p);\n            float16x4_t _sum2 = vadd_f16(vget_low_f16(_p), vget_high_f16(_p));\n            float16x4_t _ss2 = vpadd_f16(_sum2, _sum2);\n            __fp16 sum0 = vget_lane_f16(_ss2, 0) + vget_lane_f16(_ss2, 1);\n            *sumptr += sum0;\n            ptr += 8;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float16x8_t _one = vdupq_n_f16(1.f);\n        __fp16* sumptr = _sumptr;\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _sum = vdivq_f16(_one, _sum);\n            vst1q_f16(sumptr, _sum);\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _sum = vld1_f16(sumptr);\n            _sum = vdiv_f16(vget_low_f16(_one), _sum);\n            vst1_f16(sumptr, _sum);\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr = (__fp16)1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _p4 = vld1q_f16(ptr + 32);\n            float16x8_t _p5 = vld1q_f16(ptr + 40);\n            float16x8_t _p6 = vld1q_f16(ptr + 48);\n            float16x8_t _p7 = vld1q_f16(ptr + 56);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _p0 = vmulq_laneq_f16(_p0, _sum, 0);\n            _p1 = vmulq_laneq_f16(_p1, _sum, 1);\n            _p2 = vmulq_laneq_f16(_p2, _sum, 2);\n            _p3 = vmulq_laneq_f16(_p3, _sum, 3);\n            _p4 = vmulq_laneq_f16(_p4, _sum, 4);\n            _p5 = vmulq_laneq_f16(_p5, _sum, 5);\n            _p6 = vmulq_laneq_f16(_p6, _sum, 6);\n            _p7 = vmulq_laneq_f16(_p7, _sum, 7);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            vst1q_f16(ptr + 32, _p4);\n            vst1q_f16(ptr + 40, _p5);\n            vst1q_f16(ptr + 48, _p6);\n            vst1q_f16(ptr + 56, _p7);\n            ptr += 64;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x4_t _sum = vld1_f16(sumptr);\n            _p0 = vmulq_lane_f16(_p0, _sum, 0);\n            _p1 = vmulq_lane_f16(_p1, _sum, 1);\n            _p2 = vmulq_lane_f16(_p2, _sum, 2);\n            _p3 = vmulq_lane_f16(_p3, _sum, 3);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _sum = vld1q_dup_f16(sumptr);\n            _p = vmulq_f16(_p, _sum);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n            sumptr++;\n        }\n    }\n}\n\nstatic void softmax_fp16s_pack4(__fp16* _ptr, int elemcount, size_t stride, int size1, __fp16* _maxptr, __fp16* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const __fp16* ptr = _ptr + i * stride;\n        __fp16* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _max = vld1q_f16(maxptr);\n            float16x8_t _max2 = vpmaxq_f16(_p0, _p1);\n            float16x8_t _max4 = vpmaxq_f16(_p2, _p3);\n            _max = vmaxq_f16(_max, vpmaxq_f16(_max2, _max4));\n            vst1q_f16(maxptr, _max);\n            ptr += 32;\n            maxptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x4_t _max = vld1_f16(maxptr);\n            float16x8_t _max2 = vpmaxq_f16(_p0, _p1);\n            _max = vmax_f16(_max, vpmax_f16(vget_low_f16(_max2), vget_high_f16(_max2)));\n            vst1_f16(maxptr, _max);\n            ptr += 16;\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            __fp16 max0 = vmaxv_f16(_p);\n            *maxptr = std::max(*maxptr, max0);\n            ptr += 4;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* maxptr = _maxptr;\n        __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _max = vld1q_f16(maxptr);\n            float16x8_t _max0 = vcombine_f16(vdup_laneq_f16(_max, 0), vdup_laneq_f16(_max, 1));\n            float16x8_t _max1 = vcombine_f16(vdup_laneq_f16(_max, 2), vdup_laneq_f16(_max, 3));\n            float16x8_t _max2 = vcombine_f16(vdup_laneq_f16(_max, 4), vdup_laneq_f16(_max, 5));\n            float16x8_t _max3 = vcombine_f16(vdup_laneq_f16(_max, 6), vdup_laneq_f16(_max, 7));\n            _p0 = exp_ps_f16(vsubq_f16(_p0, _max0));\n            _p1 = exp_ps_f16(vsubq_f16(_p1, _max1));\n            _p2 = exp_ps_f16(vsubq_f16(_p2, _max2));\n            _p3 = exp_ps_f16(vsubq_f16(_p3, _max3));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            float16x8_t _ss2 = vpaddq_f16(_p0, _p1);\n            float16x8_t _ss4 = vpaddq_f16(_p2, _p3);\n            _sum = vaddq_f16(_sum, vpaddq_f16(_ss2, _ss4));\n            vst1q_f16(sumptr, _sum);\n            ptr += 32;\n            maxptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x4_t _max = vld1_f16(maxptr);\n            float16x8_t _max0 = vcombine_f16(vdup_lane_f16(_max, 0), vdup_lane_f16(_max, 1));\n            float16x8_t _max1 = vcombine_f16(vdup_lane_f16(_max, 2), vdup_lane_f16(_max, 3));\n            _p0 = exp_ps_f16(vsubq_f16(_p0, _max0));\n            _p1 = exp_ps_f16(vsubq_f16(_p1, _max1));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            float16x4_t _sum = vld1_f16(sumptr);\n            float16x8_t _ss2 = vpaddq_f16(_p0, _p1);\n            _sum = vadd_f16(_sum, vpadd_f16(vget_low_f16(_ss2), vget_high_f16(_ss2)));\n            vst1_f16(sumptr, _sum);\n            ptr += 16;\n            maxptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _max = vdup_n_f16(*maxptr);\n            _p = exp_ps_f16(vsub_f16(_p, _max));\n            vst1_f16(ptr, _p);\n            float16x4_t _ss2 = vpadd_f16(_p, _p);\n            __fp16 sum0 = vget_lane_f16(_ss2, 0) + vget_lane_f16(_ss2, 1);\n            *sumptr += sum0;\n            ptr += 4;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float16x8_t _one = vdupq_n_f16(1.f);\n        __fp16* sumptr = _sumptr;\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _sum = vdivq_f16(_one, _sum);\n            vst1q_f16(sumptr, _sum);\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _sum = vld1_f16(sumptr);\n            _sum = vdiv_f16(vget_low_f16(_one), _sum);\n            vst1_f16(sumptr, _sum);\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr = (__fp16)1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            float16x8_t _sum0 = vcombine_f16(vdup_laneq_f16(_sum, 0), vdup_laneq_f16(_sum, 1));\n            float16x8_t _sum1 = vcombine_f16(vdup_laneq_f16(_sum, 2), vdup_laneq_f16(_sum, 3));\n            float16x8_t _sum2 = vcombine_f16(vdup_laneq_f16(_sum, 4), vdup_laneq_f16(_sum, 5));\n            float16x8_t _sum3 = vcombine_f16(vdup_laneq_f16(_sum, 6), vdup_laneq_f16(_sum, 7));\n            _p0 = vmulq_f16(_p0, _sum0);\n            _p1 = vmulq_f16(_p1, _sum1);\n            _p2 = vmulq_f16(_p2, _sum2);\n            _p3 = vmulq_f16(_p3, _sum3);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x4_t _sum = vld1_f16(sumptr);\n            float16x8_t _sum0 = vcombine_f16(vdup_lane_f16(_sum, 0), vdup_lane_f16(_sum, 1));\n            float16x8_t _sum1 = vcombine_f16(vdup_lane_f16(_sum, 2), vdup_lane_f16(_sum, 3));\n            _p0 = vmulq_f16(_p0, _sum0);\n            _p1 = vmulq_f16(_p1, _sum1);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _sum = vld1_dup_f16(sumptr);\n            _p = vmul_f16(_p, _sum);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n            sumptr++;\n        }\n    }\n}\n\nstatic void softmax_fp16s_pack1(__fp16* _ptr, int elemcount, size_t stride, int size1, __fp16* _maxptr, __fp16* _sumptr)\n{\n    // reduce max\n    for (int i = 0; i < elemcount; i++)\n    {\n        const __fp16* ptr = _ptr + i * stride;\n        __fp16* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _max = vld1q_f16(maxptr);\n            _max = vmaxq_f16(_max, _p);\n            vst1q_f16(maxptr, _max);\n            ptr += 8;\n            maxptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _max = vld1_f16(maxptr);\n            _max = vmax_f16(_max, _p);\n            vst1_f16(maxptr, _max);\n            ptr += 4;\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *maxptr = std::max(*maxptr, *ptr);\n            ptr++;\n            maxptr++;\n        }\n    }\n\n    // reduce exp(x - max)\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* maxptr = _maxptr;\n        __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _max = vld1q_f16(maxptr);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _p = vsubq_f16(_p, _max);\n            _p = exp_ps_f16(_p);\n            vst1q_f16(ptr, _p);\n            _sum = vaddq_f16(_sum, _p);\n            vst1q_f16(sumptr, _sum);\n            ptr += 8;\n            maxptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _max = vld1_f16(maxptr);\n            float16x4_t _sum = vld1_f16(sumptr);\n            _p = vsub_f16(_p, _max);\n            _p = exp_ps_f16(_p);\n            vst1_f16(ptr, _p);\n            _sum = vadd_f16(_sum, _p);\n            vst1_f16(sumptr, _sum);\n            ptr += 4;\n            maxptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            __fp16 v = expf(*ptr - *maxptr);\n            *ptr = v;\n            *sumptr += v;\n            ptr++;\n            maxptr++;\n            sumptr++;\n        }\n    }\n\n    {\n        float16x8_t _one = vdupq_n_f16(1.f);\n        __fp16* sumptr = _sumptr;\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _sum = vdivq_f16(_one, _sum);\n            vst1q_f16(sumptr, _sum);\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _sum = vld1_f16(sumptr);\n            _sum = vdiv_f16(vget_low_f16(_one), _sum);\n            vst1_f16(sumptr, _sum);\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr = (__fp16)1.f / *sumptr;\n            sumptr++;\n        }\n    }\n\n    // div sum\n    for (int i = 0; i < elemcount; i++)\n    {\n        __fp16* ptr = _ptr + i * stride;\n        const __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float16x8_t _sum = vld1q_f16(sumptr);\n            _p = vmulq_f16(_p, _sum);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            float16x4_t _sum = vld1_f16(sumptr);\n            _p = vmul_f16(_p, _sum);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *ptr *= *sumptr;\n            ptr++;\n            sumptr++;\n        }\n    }\n}\n\nstatic void softmax_fp16s(__fp16* _ptr, int elemcount, int elempack, size_t stride, int size1, __fp16* _maxptr, __fp16* _sumptr)\n{\n    // reduce max\n    {\n        float16x8_t _negmax = vdupq_n_f16(-65504.f);\n\n        __fp16* maxptr = _maxptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            vst1q_f16(maxptr, _negmax);\n            maxptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1_f16(maxptr, vget_low_f16(_negmax));\n            maxptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *maxptr++ = -65504.f;\n        }\n    }\n\n    // reduce exp(x - max)\n    {\n        float16x8_t _zero = vdupq_n_f16(0.f);\n\n        __fp16* sumptr = _sumptr;\n\n        int j = 0;\n        for (; j + 7 < size1; j += 8)\n        {\n            vst1q_f16(sumptr, _zero);\n            sumptr += 8;\n        }\n        for (; j + 3 < size1; j += 4)\n        {\n            vst1_f16(sumptr, vget_low_f16(_zero));\n            sumptr += 4;\n        }\n        for (; j < size1; j++)\n        {\n            *sumptr++ = 0.f;\n        }\n    }\n\n    if (elempack == 8)\n    {\n        softmax_fp16s_pack8(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n    if (elempack == 4)\n    {\n        softmax_fp16s_pack4(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n    if (elempack == 1)\n    {\n        softmax_fp16s_pack1(_ptr, elemcount, stride, size1, _maxptr, _sumptr);\n    }\n}\n\nint Softmax_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int w = bottom_top_blob.w;\n    const int h = bottom_top_blob.h;\n    const int d = bottom_top_blob.d;\n    const int channels = bottom_top_blob.c;\n    const int elempack = bottom_top_blob.elempack;\n    const int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        __fp16* ptr = bottom_top_blob;\n\n        const int size = w * elempack;\n\n        softmax_fp16s(ptr, size, 1);\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        const int size = w;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = (size_t)w * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 2u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            __fp16* maxsumptr = maxsum.channel(get_omp_thread_num());\n            __fp16* maxptr = maxsumptr;\n            __fp16* sumptr = maxptr + sizen;\n\n            __fp16* ptr = (__fp16*)bottom_top_blob + i * elempack;\n\n            softmax_fp16s(ptr, h, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            __fp16* ptr = bottom_top_blob.row<__fp16>(i);\n\n            softmax_fp16s(ptr, w, elempack);\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        const int size = w * h * d;\n        const int sizen = (size + (opt.num_threads - 1)) / opt.num_threads;\n        const size_t stride = bottom_top_blob.cstep * elempack;\n\n        Mat maxsum(sizen, 2, opt.num_threads, 2u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        const int nn_size = (size + sizen - 1) / sizen;\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            const int i = ii * sizen;\n            const int size1 = std::min(sizen, size - i);\n\n            __fp16* maxsumptr = maxsum.channel(get_omp_thread_num());\n            __fp16* maxptr = maxsumptr;\n            __fp16* sumptr = maxptr + sizen;\n\n            __fp16* ptr = (__fp16*)bottom_top_blob + i * elempack;\n\n            softmax_fp16s(ptr, channels, elempack, stride, size1, maxptr, sumptr);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        const int size = w * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 2u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            for (int i = 0; i < d; i++)\n            {\n                __fp16* ptr = bottom_top_blob.channel(q).depth(i);\n\n                __fp16* maxsumptr = maxsum.channel(get_omp_thread_num());\n                __fp16* maxptr = maxsumptr;\n                __fp16* sumptr = maxptr + size;\n\n                softmax_fp16s(ptr, h, 1, size, size, maxptr, sumptr);\n            }\n        }\n    }\n\n    if (dims == 3 && positive_axis == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < h; i++)\n            {\n                softmax_fp16s(ptr, w, elempack);\n                ptr += w * elempack;\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        const int size = w * h * elempack;\n\n        Mat maxsum(size, 2, opt.num_threads, 2u, opt.workspace_allocator);\n        if (maxsum.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            __fp16* maxsumptr = maxsum.channel(get_omp_thread_num());\n            __fp16* maxptr = maxsumptr;\n            __fp16* sumptr = maxptr + size;\n\n            softmax_fp16s(ptr, d, 1, size, size, maxptr, sumptr);\n        }\n    }\n\n    if (dims == 4 && positive_axis == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    softmax_fp16s(ptr, w, elempack);\n                    ptr += w * elempack;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/swish_arm.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"swish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nSwish_arm::Swish_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint Swish_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _one = vdupq_n_f32(1.f);\n#if __aarch64__\n        for (; i + 15 < size; i += 16)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            float32x4_t _p2 = vld1q_f32(ptr + 8);\n            float32x4_t _p3 = vld1q_f32(ptr + 12);\n            _p0 = div_ps(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = div_ps(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            _p2 = div_ps(_p2, vaddq_f32(_one, exp_ps(vnegq_f32(_p2))));\n            _p3 = div_ps(_p3, vaddq_f32(_one, exp_ps(vnegq_f32(_p3))));\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            vst1q_f32(ptr + 8, _p2);\n            vst1q_f32(ptr + 12, _p3);\n            ptr += 16;\n        }\n#endif // __aarch64__\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = div_ps(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = div_ps(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = div_ps(_p, vaddq_f32(_one, exp_ps(vnegq_f32(_p))));\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = *ptr / (1.f + expf(-*ptr));\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint Swish_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        float32x4_t _one = vdupq_n_f32(1.f);\n#if __aarch64__\n        for (; i + 15 < size; i += 16)\n        {\n            uint16x8_t _p01 = vld1q_u16(ptr);\n            uint16x8_t _p23 = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_p23));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_p23));\n            _p0 = div_ps(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = div_ps(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            _p2 = div_ps(_p2, vaddq_f32(_one, exp_ps(vnegq_f32(_p2))));\n            _p3 = div_ps(_p3, vaddq_f32(_one, exp_ps(vnegq_f32(_p3))));\n            _p01 = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _p23 = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p01);\n            vst1q_u16(ptr + 8, _p23);\n            ptr += 16;\n        }\n#endif // __aarch64__\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = div_ps(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = div_ps(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = div_ps(_p, vaddq_f32(_one, exp_ps(vnegq_f32(_p))));\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            v = v / (1.f + expf(-v));\n            *ptr = float32_to_bfloat16(v);\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/swish_arm.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SWISH_ARM_H\n#define LAYER_SWISH_ARM_H\n\n#include \"swish.h\"\n\nnamespace ncnn {\n\nclass Swish_arm : public Swish\n{\npublic:\n    Swish_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SWISH_ARM_H\n"
  },
  {
    "path": "src/layer/arm/swish_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"swish_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun.h\"\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint Swish_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        float32x4_t _one = vdupq_n_f32(1.f);\n\n        int i = 0;\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p01 = vld1q_f16(ptr);\n            float16x8_t _p23 = vld1q_f16(ptr + 8);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p01));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p01));\n            float32x4_t _p2 = vcvt_f32_f16(vget_low_f16(_p23));\n            float32x4_t _p3 = vcvt_f32_f16(vget_high_f16(_p23));\n            _p0 = vdivq_f32(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = vdivq_f32(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            _p2 = vdivq_f32(_p2, vaddq_f32(_one, exp_ps(vnegq_f32(_p2))));\n            _p3 = vdivq_f32(_p3, vaddq_f32(_one, exp_ps(vnegq_f32(_p3))));\n            _p01 = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            _p23 = vcombine_f16(vcvt_f16_f32(_p2), vcvt_f16_f32(_p3));\n            vst1q_f16(ptr, _p01);\n            vst1q_f16(ptr + 8, _p23);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            float32x4_t _p0 = vcvt_f32_f16(vget_low_f16(_p));\n            float32x4_t _p1 = vcvt_f32_f16(vget_high_f16(_p));\n            _p0 = vdivq_f32(_p0, vaddq_f32(_one, exp_ps(vnegq_f32(_p0))));\n            _p1 = vdivq_f32(_p1, vaddq_f32(_one, exp_ps(vnegq_f32(_p1))));\n            _p = vcombine_f16(vcvt_f16_f32(_p0), vcvt_f16_f32(_p1));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = vdivq_f32(_p, vaddq_f32(_one, exp_ps(vnegq_f32(_p))));\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            v = v / (1.f + expf(-v));\n            *ptr = (__fp16)v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint Swish_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        float16x8_t _one = vdupq_n_f16(1.f);\n\n        int i = 0;\n        for (; i + 31 < size; i += 32)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            float16x8_t _p2 = vld1q_f16(ptr + 16);\n            float16x8_t _p3 = vld1q_f16(ptr + 24);\n            _p0 = vdivq_f16(_p0, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p0))));\n            _p1 = vdivq_f16(_p1, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p1))));\n            _p2 = vdivq_f16(_p2, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p2))));\n            _p3 = vdivq_f16(_p3, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p3))));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            vst1q_f16(ptr + 16, _p2);\n            vst1q_f16(ptr + 24, _p3);\n            ptr += 32;\n        }\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _p0 = vdivq_f16(_p0, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p0))));\n            _p1 = vdivq_f16(_p1, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p1))));\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = vdivq_f16(_p, vaddq_f16(_one, exp_ps_f16(vnegq_f16(_p))));\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = vdiv_f16(_p, vadd_f16(vget_low_f16(_one), exp_ps_f16(vneg_f16(_p))));\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            v = v / (__fp16)(1.f + expf(-v));\n            *ptr = v;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/tanh_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"tanh_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nTanH_arm::TanH_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\nint TanH_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n    {\n        if (opt.use_fp16_arithmetic)\n            return forward_inplace_fp16sa(bottom_top_blob, opt);\n        else\n            return forward_inplace_fp16s(bottom_top_blob, opt);\n    }\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vld1q_f32(ptr);\n                _p = tanh_ps(_p);\n                vst1q_f32(ptr, _p);\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = tanh_ps(_p);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            *ptr = tanhf(*ptr);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_BF16\nint TanH_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n#if __ARM_NEON\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned short* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = bfloat2float(vld1_u16(ptr));\n                _p = tanh_ps(_p);\n                vst1_u16(ptr, float2bfloat(_p));\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n#endif // __ARM_NEON\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = bottom_top_blob.channel(q);\n\n#if __ARM_NEON\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __ARM_NEON\n\n#if __ARM_NEON\n        for (; nn > 0; nn--)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = tanh_ps(_p);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; remain > 0; remain--)\n        {\n            float v = bfloat16_to_float32(*ptr);\n            v = tanhf(v);\n            *ptr = float32_to_bfloat16(v);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/tanh_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_TANH_ARM_H\n#define LAYER_TANH_ARM_H\n\n#include \"tanh.h\"\n\nnamespace ncnn {\n\nclass TanH_arm : public TanH\n{\npublic:\n    TanH_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n    int forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_TANH_ARM_H\n"
  },
  {
    "path": "src/layer/arm/tanh_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"tanh_arm.h\"\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#include \"neon_mathfun.h\"\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\nint TanH_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n                _p = tanh_ps(_p);\n                vst1_f16(ptr, vcvt_f16_f32(_p));\n\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vcvt_f32_f16(vld1_f16(ptr));\n            _p = tanh_ps(_p);\n            vst1_f16(ptr, vcvt_f16_f32(_p));\n\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            float v = (float)*ptr;\n            v = tanhf(v);\n            *ptr = (__fp16)v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint TanH_arm::forward_inplace_fp16sa(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n    int elempack = bottom_top_blob.elempack;\n\n    if (elempack == 8)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x8_t _p = vld1q_f16(ptr);\n                _p = tanh_ps_f16(_p);\n                vst1q_f16(ptr, _p);\n\n                ptr += 8;\n            }\n        }\n\n        return 0;\n    }\n\n    if (elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            __fp16* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                float16x4_t _p = vld1_f16(ptr);\n                _p = tanh_ps_f16(_p);\n                vst1_f16(ptr, _p);\n\n                ptr += 4;\n            }\n        }\n\n        return 0;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = tanh_ps_f16(_p);\n            vst1_f16(ptr, _p);\n\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            __fp16 v = *ptr;\n            v = tanhf(v);\n            *ptr = v;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/unaryop_arm.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"unaryop_arm.h\"\n\n// #include <fenv.h>\n#include <float.h>\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"neon_mathfun.h\"\n#endif // __ARM_NEON\n\n#include \"arm_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\nUnaryOp_arm::UnaryOp_arm()\n{\n#if __ARM_NEON\n    support_packing = true;\n#if NCNN_ARM82\n    support_fp16_storage = cpu_support_arm_asimdhp();\n#endif\n#endif // __ARM_NEON\n\n#if NCNN_BF16\n    support_bf16_storage = true;\n#endif\n}\n\ntemplate<typename Op>\nstatic int unary_op_inplace(Mat& a, const Option& opt)\n{\n    Op op;\n\n    int w = a.w;\n    int h = a.h;\n    int d = a.d;\n    int channels = a.c;\n    int elempack = a.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n        for (; i + 7 < size; i += 8)\n        {\n            float32x4_t _p0 = vld1q_f32(ptr);\n            float32x4_t _p1 = vld1q_f32(ptr + 4);\n            _p0 = op.func_pack4(_p0);\n            _p1 = op.func_pack4(_p1);\n            vst1q_f32(ptr, _p0);\n            vst1q_f32(ptr + 4, _p1);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = vld1q_f32(ptr);\n            _p = op.func_pack4(_p);\n            vst1q_f32(ptr, _p);\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = op.func(*ptr);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nnamespace UnaryOp_arm_functor {\n\nstruct unary_op_abs\n{\n    float func(const float& x) const\n    {\n        return (float)fabsf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return vabsq_f32(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_neg\n{\n    float func(const float& x) const\n    {\n        return -x;\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return vnegq_f32(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_floor\n{\n    float func(const float& x) const\n    {\n        return (float)floorf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n#if __aarch64__\n        return vrndmq_f32(x);\n#else  // __aarch64__\n        int32x4_t _xi = vcvtq_s32_f32(x);\n        uint32x4_t _mask = vcgtq_f32(vcvtq_f32_s32(_xi), x);\n        return vcvtq_f32_s32(vaddq_s32(_xi, vreinterpretq_s32_u32(_mask)));\n#endif // __aarch64__\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_ceil\n{\n    float func(const float& x) const\n    {\n        return (float)ceilf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n#if __aarch64__\n        return vrndpq_f32(x);\n#else  // __aarch64__\n        int32x4_t _xi = vcvtq_s32_f32(x);\n        uint32x4_t _mask = vcgtq_f32(x, vcvtq_f32_s32(_xi));\n        return vcvtq_f32_s32(vsubq_s32(_xi, vreinterpretq_s32_u32(_mask)));\n#endif // __aarch64__\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_square\n{\n    float func(const float& x) const\n    {\n        return x * x;\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return vmulq_f32(x, x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_sqrt\n{\n    float func(const float& x) const\n    {\n        return (float)sqrtf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n#if __aarch64__\n        return vsqrtq_f32(x);\n#else\n        float32x4_t _reciprocal = vrsqrteq_f32(x);\n        _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(x, _reciprocal), _reciprocal), _reciprocal);\n        // _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(x, _reciprocal), _reciprocal), _reciprocal);\n        return vmulq_f32(x, _reciprocal);\n#endif\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_rsqrt\n{\n    float func(const float& x) const\n    {\n        return (float)(1.f / sqrtf(x));\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        float32x4_t _reciprocal = vrsqrteq_f32(x);\n        _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(x, _reciprocal), _reciprocal), _reciprocal);\n        // _reciprocal = vmulq_f32(vrsqrtsq_f32(vmulq_f32(x, _reciprocal), _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_exp\n{\n    float func(const float& x) const\n    {\n        return (float)expf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return exp_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_log\n{\n    float func(const float& x) const\n    {\n        return (float)logf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return log_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_sin\n{\n    float func(const float& x) const\n    {\n        return (float)sinf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return sin_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_cos\n{\n    float func(const float& x) const\n    {\n        return (float)cosf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return cos_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_tan\n{\n    float func(const float& x) const\n    {\n        return (float)tanf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return tan_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_asin\n{\n    float func(const float& x) const\n    {\n        return (float)asinf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return asin_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_acos\n{\n    float func(const float& x) const\n    {\n        return (float)acosf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return acos_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_atan\n{\n    float func(const float& x) const\n    {\n        return (float)atanf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        // TODO neon optimize\n        float tmp[4];\n        vst1q_f32(tmp, x);\n        tmp[0] = atanf(tmp[0]);\n        tmp[1] = atanf(tmp[1]);\n        tmp[2] = atanf(tmp[2]);\n        tmp[3] = atanf(tmp[3]);\n        return vld1q_f32(tmp);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_reciprocal\n{\n    float func(const float& x) const\n    {\n        return 1.f / x;\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        float32x4_t _reciprocal = vrecpeq_f32(x);\n        _reciprocal = vmulq_f32(vrecpsq_f32(x, _reciprocal), _reciprocal);\n        // _reciprocal = vmulq_f32(vrecpsq_f32(x, _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_tanh\n{\n    float func(const float& x) const\n    {\n        return (float)tanhf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return tanh_ps(x);\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_log10\n{\n    float func(const float& x) const\n    {\n        return (float)log10f(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return vmulq_f32(log_ps(x), vdupq_n_f32(0.434294481903));\n    }\n#endif // __ARM_NEON\n};\n\nstruct unary_op_round\n{\n    float func(const float& x) const\n    {\n        // round to nearest even\n#if NCNN_GNU_INLINE_ASM && __ARM_NEON\n        // return (x + 12582912.f) - 12582912.f;\n        float y;\n        const float magic = 12582912.f;\n#if __aarch64__\n        asm volatile(\n            \"fadd   %s0, %s1, %s2   \\n\"\n            \"fsub   %s0, %s0, %s2   \\n\"\n            : \"=w\"(y)\n            : \"w\"(x), \"w\"(magic)\n            :);\n#else\n        asm volatile(\n            \"vadd.f32   %0, %1, %2  \\n\"\n            \"vsub.f32   %0, %0, %2  \\n\"\n            : \"=t\"(y)\n            : \"t\"(x), \"t\"(magic)\n            :);\n#endif\n        return y;\n#else\n#ifdef FE_TONEAREST\n        int old_rm = fegetround();\n        fesetround(FE_TONEAREST);\n#endif\n        float y = nearbyintf(x);\n#ifdef FE_TONEAREST\n        fesetround(old_rm);\n#endif\n        return y;\n#endif\n    }\n#if __ARM_NEON\n#if __aarch64__\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n        return vrndnq_f32(x);\n    }\n#else\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n#if NCNN_GNU_INLINE_ASM\n        float32x4_t y;\n        float32x4_t _magic = vdupq_n_f32(12582912.f); // 1.5 * 2^23\n        asm volatile(\n            \"vadd.f32   %q0, %q1, %q2   \\n\"\n            \"vsub.f32   %q0, %q0, %q2   \\n\"\n            : \"=w\"(y)\n            : \"w\"(x), \"w\"(_magic)\n            :);\n        return y;\n#else\n        float tmp[4];\n        vst1q_f32(tmp, x);\n#ifdef FE_TONEAREST\n        int old_rm = fegetround();\n        fesetround(FE_TONEAREST);\n#endif\n        tmp[0] = nearbyintf(tmp[0]);\n        tmp[1] = nearbyintf(tmp[1]);\n        tmp[2] = nearbyintf(tmp[2]);\n        tmp[3] = nearbyintf(tmp[3]);\n#ifdef FE_TONEAREST\n        fesetround(old_rm);\n#endif\n        float32x4_t y = vld1q_f32(tmp);\n        return y;\n#endif\n    }\n#endif\n#endif // __ARM_NEON\n};\n\nstruct unary_op_trunc\n{\n    float func(const float& x) const\n    {\n        return (float)truncf(x);\n    }\n#if __ARM_NEON\n    float32x4_t func_pack4(const float32x4_t& x) const\n    {\n#if __aarch64__\n        return vrndq_f32(x);\n#else\n        return vcvtq_f32_s32(vcvtq_s32_f32(x));\n#endif\n    }\n#endif // __ARM_NEON\n};\n\n} // namespace UnaryOp_arm_functor\n\nint UnaryOp_arm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int elembits = bottom_top_blob.elembits();\n\n#if NCNN_ARM82\n    if (support_fp16_storage && opt.use_fp16_storage && elembits == 16)\n        return forward_inplace_fp16s(bottom_top_blob, opt);\n#endif\n\n#if NCNN_BF16\n    if (opt.use_bf16_storage && elembits == 16)\n        return forward_inplace_bf16s(bottom_top_blob, opt);\n#endif\n\n    using namespace UnaryOp_arm_functor;\n\n    if (op_type == Operation_ABS)\n        return unary_op_inplace<unary_op_abs>(bottom_top_blob, opt);\n\n    if (op_type == Operation_NEG)\n        return unary_op_inplace<unary_op_neg>(bottom_top_blob, opt);\n\n    if (op_type == Operation_FLOOR)\n        return unary_op_inplace<unary_op_floor>(bottom_top_blob, opt);\n\n    if (op_type == Operation_CEIL)\n        return unary_op_inplace<unary_op_ceil>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQUARE)\n        return unary_op_inplace<unary_op_square>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQRT)\n        return unary_op_inplace<unary_op_sqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RSQRT)\n        return unary_op_inplace<unary_op_rsqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_EXP)\n        return unary_op_inplace<unary_op_exp>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG)\n        return unary_op_inplace<unary_op_log>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SIN)\n        return unary_op_inplace<unary_op_sin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_COS)\n        return unary_op_inplace<unary_op_cos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TAN)\n        return unary_op_inplace<unary_op_tan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ASIN)\n        return unary_op_inplace<unary_op_asin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ACOS)\n        return unary_op_inplace<unary_op_acos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ATAN)\n        return unary_op_inplace<unary_op_atan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RECIPROCAL)\n        return unary_op_inplace<unary_op_reciprocal>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TANH)\n        return unary_op_inplace<unary_op_tanh>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG10)\n        return unary_op_inplace<unary_op_log10>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ROUND)\n        return unary_op_inplace<unary_op_round>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TRUNC)\n        return unary_op_inplace<unary_op_trunc>(bottom_top_blob, opt);\n\n    return 0;\n}\n\n#if NCNN_BF16\ntemplate<typename Op>\nstatic int unary_op_inplace_bf16s(Mat& a, const Option& opt)\n{\n    Op op;\n\n    int w = a.w;\n    int h = a.h;\n    int d = a.d;\n    int channels = a.c;\n    int elempack = a.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        unsigned short* ptr = a.channel(q);\n\n        int i = 0;\n#if __ARM_NEON\n#if __aarch64__\n        for (; i + 15 < size; i += 16)\n        {\n            uint16x8_t _p01 = vld1q_u16(ptr);\n            uint16x8_t _p23 = vld1q_u16(ptr + 8);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p01));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p01));\n            float32x4_t _p2 = bfloat2float(vget_low_u16(_p23));\n            float32x4_t _p3 = bfloat2float(vget_high_u16(_p23));\n            _p0 = op.func_pack4(_p0);\n            _p1 = op.func_pack4(_p1);\n            _p2 = op.func_pack4(_p2);\n            _p3 = op.func_pack4(_p3);\n            _p01 = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            _p23 = vcombine_u16(float2bfloat(_p2), float2bfloat(_p3));\n            vst1q_u16(ptr, _p01);\n            vst1q_u16(ptr + 8, _p23);\n            ptr += 16;\n        }\n#endif // __aarch64__\n        for (; i + 7 < size; i += 8)\n        {\n            uint16x8_t _p = vld1q_u16(ptr);\n            float32x4_t _p0 = bfloat2float(vget_low_u16(_p));\n            float32x4_t _p1 = bfloat2float(vget_high_u16(_p));\n            _p0 = op.func_pack4(_p0);\n            _p1 = op.func_pack4(_p1);\n            _p = vcombine_u16(float2bfloat(_p0), float2bfloat(_p1));\n            vst1q_u16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float32x4_t _p = bfloat2float(vld1_u16(ptr));\n            _p = op.func_pack4(_p);\n            vst1_u16(ptr, float2bfloat(_p));\n            ptr += 4;\n        }\n#endif // __ARM_NEON\n        for (; i < size; i++)\n        {\n            *ptr = float32_to_bfloat16(op.func(bfloat16_to_float32(*ptr)));\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nint UnaryOp_arm::forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    using namespace UnaryOp_arm_functor;\n\n    if (op_type == Operation_ABS)\n        return unary_op_inplace_bf16s<unary_op_abs>(bottom_top_blob, opt);\n\n    if (op_type == Operation_NEG)\n        return unary_op_inplace_bf16s<unary_op_neg>(bottom_top_blob, opt);\n\n    if (op_type == Operation_FLOOR)\n        return unary_op_inplace_bf16s<unary_op_floor>(bottom_top_blob, opt);\n\n    if (op_type == Operation_CEIL)\n        return unary_op_inplace_bf16s<unary_op_ceil>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQUARE)\n        return unary_op_inplace_bf16s<unary_op_square>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQRT)\n        return unary_op_inplace_bf16s<unary_op_sqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RSQRT)\n        return unary_op_inplace_bf16s<unary_op_rsqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_EXP)\n        return unary_op_inplace_bf16s<unary_op_exp>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG)\n        return unary_op_inplace_bf16s<unary_op_log>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SIN)\n        return unary_op_inplace_bf16s<unary_op_sin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_COS)\n        return unary_op_inplace_bf16s<unary_op_cos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TAN)\n        return unary_op_inplace_bf16s<unary_op_tan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ASIN)\n        return unary_op_inplace_bf16s<unary_op_asin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ACOS)\n        return unary_op_inplace_bf16s<unary_op_acos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ATAN)\n        return unary_op_inplace_bf16s<unary_op_atan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RECIPROCAL)\n        return unary_op_inplace_bf16s<unary_op_reciprocal>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TANH)\n        return unary_op_inplace_bf16s<unary_op_tanh>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG10)\n        return unary_op_inplace_bf16s<unary_op_log10>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ROUND)\n        return unary_op_inplace_bf16s<unary_op_round>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TRUNC)\n        return unary_op_inplace_bf16s<unary_op_trunc>(bottom_top_blob, opt);\n\n    return 0;\n}\n#endif // NCNN_BF16\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/arm/unaryop_arm.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_UNARYOP_ARM_H\n#define LAYER_UNARYOP_ARM_H\n\n#include \"unaryop.h\"\n\nnamespace ncnn {\n\nclass UnaryOp_arm : public UnaryOp\n{\npublic:\n    UnaryOp_arm();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_ARM82\n    int forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n#if NCNN_BF16\n    int forward_inplace_bf16s(Mat& bottom_top_blob, const Option& opt) const;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_UNARYOP_ARM_H\n"
  },
  {
    "path": "src/layer/arm/unaryop_arm_asimdhp.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"unaryop_arm.h\"\n\n// #include <fenv.h>\n#include <float.h>\n\n#if __ARM_NEON\n#include <arm_neon.h>\n#include \"arm_usability.h\"\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n#include \"neon_mathfun_fp16s.h\"\n#endif\n#endif // __ARM_NEON\n\nnamespace ncnn {\n\n#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\ntemplate<typename Op>\nstatic int unary_op_inplace_fp16s(Mat& a, const Option& opt)\n{\n    Op op;\n\n    int w = a.w;\n    int h = a.h;\n    int d = a.d;\n    int channels = a.c;\n    int elempack = a.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        __fp16* ptr = a.channel(q);\n\n        int i = 0;\n        for (; i + 15 < size; i += 16)\n        {\n            float16x8_t _p0 = vld1q_f16(ptr);\n            float16x8_t _p1 = vld1q_f16(ptr + 8);\n            _p0 = op.func_pack8(_p0);\n            _p1 = op.func_pack8(_p1);\n            vst1q_f16(ptr, _p0);\n            vst1q_f16(ptr + 8, _p1);\n            ptr += 16;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            float16x8_t _p = vld1q_f16(ptr);\n            _p = op.func_pack8(_p);\n            vst1q_f16(ptr, _p);\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            float16x4_t _p = vld1_f16(ptr);\n            _p = op.func_pack4(_p);\n            vst1_f16(ptr, _p);\n            ptr += 4;\n        }\n        for (; i < size; i++)\n        {\n            *ptr = op.func(*ptr);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nnamespace UnaryOp_arm_functor {\n\nstruct unary_op_abs_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)fabsf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vabs_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vabsq_f16(x);\n    }\n};\n\nstruct unary_op_neg_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return -x;\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vneg_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vnegq_f16(x);\n    }\n};\n\nstruct unary_op_floor_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)floorf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vcvt_f16_s16(vcvtm_s16_f16(x));\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vcvtq_f16_s16(vcvtmq_s16_f16(x));\n    }\n};\n\nstruct unary_op_ceil_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)ceilf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vcvt_f16_s16(vcvtp_s16_f16(x));\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vcvtq_f16_s16(vcvtpq_s16_f16(x));\n    }\n};\n\nstruct unary_op_square_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return x * x;\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vmul_f16(x, x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vmulq_f16(x, x);\n    }\n};\n\nstruct unary_op_sqrt_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)sqrtf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vsqrt_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vsqrtq_f16(x);\n    }\n};\n\nstruct unary_op_rsqrt_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)1.f / (__fp16)sqrtf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        float16x4_t _reciprocal = vrsqrte_f16(x);\n        _reciprocal = vmul_f16(vrsqrts_f16(vmul_f16(x, _reciprocal), _reciprocal), _reciprocal);\n        // _reciprocal = vmul_f16(vrsqrts_f16(vmul_f16(x, _reciprocal), _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        float16x8_t _reciprocal = vrsqrteq_f16(x);\n        _reciprocal = vmulq_f16(vrsqrtsq_f16(vmulq_f16(x, _reciprocal), _reciprocal), _reciprocal);\n        // _reciprocal = vmulq_f16(vrsqrtsq_f16(vmulq_f16(x, _reciprocal), _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n};\n\nstruct unary_op_exp_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)expf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return exp_ps_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return exp_ps_f16(x);\n    }\n};\n\nstruct unary_op_log_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)logf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return log_ps_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return log_ps_f16(x);\n    }\n};\n\nstruct unary_op_sin_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)sinf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return sin_ps_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return sin_ps_f16(x);\n    }\n};\n\nstruct unary_op_cos_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)cosf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return cos_ps_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return cos_ps_f16(x);\n    }\n};\n\nstruct unary_op_tan_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)tanf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[4];\n        vst1_f16(tmp, x);\n        tmp[0] = tanf(tmp[0]);\n        tmp[1] = tanf(tmp[1]);\n        tmp[2] = tanf(tmp[2]);\n        tmp[3] = tanf(tmp[3]);\n        return vld1_f16(tmp);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[8];\n        vst1q_f16(tmp, x);\n        tmp[0] = tanf(tmp[0]);\n        tmp[1] = tanf(tmp[1]);\n        tmp[2] = tanf(tmp[2]);\n        tmp[3] = tanf(tmp[3]);\n        tmp[4] = tanf(tmp[4]);\n        tmp[5] = tanf(tmp[5]);\n        tmp[6] = tanf(tmp[6]);\n        tmp[7] = tanf(tmp[7]);\n        return vld1q_f16(tmp);\n    }\n};\n\nstruct unary_op_asin_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)asinf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[4];\n        vst1_f16(tmp, x);\n        tmp[0] = asinf(tmp[0]);\n        tmp[1] = asinf(tmp[1]);\n        tmp[2] = asinf(tmp[2]);\n        tmp[3] = asinf(tmp[3]);\n        return vld1_f16(tmp);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[8];\n        vst1q_f16(tmp, x);\n        tmp[0] = asinf(tmp[0]);\n        tmp[1] = asinf(tmp[1]);\n        tmp[2] = asinf(tmp[2]);\n        tmp[3] = asinf(tmp[3]);\n        tmp[4] = asinf(tmp[4]);\n        tmp[5] = asinf(tmp[5]);\n        tmp[6] = asinf(tmp[6]);\n        tmp[7] = asinf(tmp[7]);\n        return vld1q_f16(tmp);\n    }\n};\n\nstruct unary_op_acos_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)acosf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[4];\n        vst1_f16(tmp, x);\n        tmp[0] = acosf(tmp[0]);\n        tmp[1] = acosf(tmp[1]);\n        tmp[2] = acosf(tmp[2]);\n        tmp[3] = acosf(tmp[3]);\n        return vld1_f16(tmp);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[8];\n        vst1q_f16(tmp, x);\n        tmp[0] = acosf(tmp[0]);\n        tmp[1] = acosf(tmp[1]);\n        tmp[2] = acosf(tmp[2]);\n        tmp[3] = acosf(tmp[3]);\n        tmp[4] = acosf(tmp[4]);\n        tmp[5] = acosf(tmp[5]);\n        tmp[6] = acosf(tmp[6]);\n        tmp[7] = acosf(tmp[7]);\n        return vld1q_f16(tmp);\n    }\n};\n\nstruct unary_op_atan_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)atanf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[4];\n        vst1_f16(tmp, x);\n        tmp[0] = atanf(tmp[0]);\n        tmp[1] = atanf(tmp[1]);\n        tmp[2] = atanf(tmp[2]);\n        tmp[3] = atanf(tmp[3]);\n        return vld1_f16(tmp);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        // TODO neon optimize\n        __fp16 tmp[8];\n        vst1q_f16(tmp, x);\n        tmp[0] = atanf(tmp[0]);\n        tmp[1] = atanf(tmp[1]);\n        tmp[2] = atanf(tmp[2]);\n        tmp[3] = atanf(tmp[3]);\n        tmp[4] = atanf(tmp[4]);\n        tmp[5] = atanf(tmp[5]);\n        tmp[6] = atanf(tmp[6]);\n        tmp[7] = atanf(tmp[7]);\n        return vld1q_f16(tmp);\n    }\n};\n\nstruct unary_op_reciprocal_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)1.f / x;\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        float16x4_t _reciprocal = vrecpe_f16(x);\n        _reciprocal = vmul_f16(vrecps_f16(x, _reciprocal), _reciprocal);\n        // _reciprocal = vmul_f16(vrecps_f16(x, _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        float16x8_t _reciprocal = vrecpeq_f16(x);\n        _reciprocal = vmulq_f16(vrecpsq_f16(x, _reciprocal), _reciprocal);\n        // _reciprocal = vmulq_f16(vrecpsq_f16(x, _reciprocal), _reciprocal);\n        return _reciprocal;\n    }\n};\n\nstruct unary_op_tanh_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)tanhf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return tanh_ps_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return tanh_ps_f16(x);\n    }\n};\n\nstruct unary_op_log10_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)log10f(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vmul_f16(log_ps_f16(x), vdup_n_f16(0.434294481903));\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vmulq_f16(log_ps_f16(x), vdupq_n_f16(0.434294481903));\n    }\n};\n\nstruct unary_op_round_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        // round to nearest even\n#if NCNN_GNU_INLINE_ASM\n        // return (x + 1536.f) - 1536.f;\n        __fp16 y;\n        const __fp16 magic = 1536.f;\n        asm volatile(\n            \"fadd   %h0, %h1, %h2  \\n\"\n            \"fsub   %h0, %h0, %h2  \\n\"\n            : \"=w\"(y)\n            : \"w\"(x), \"w\"(magic)\n            :);\n        return y;\n#else\n#ifdef FE_TONEAREST\n        int old_rm = fegetround();\n        fesetround(FE_TONEAREST);\n#endif\n        __fp16 y = (__fp16)nearbyintf(x);\n#ifdef FE_TONEAREST\n        fesetround(old_rm);\n#endif\n        return y;\n#endif\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vrndn_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vrndnq_f16(x);\n    }\n};\n\nstruct unary_op_trunc_fp16s\n{\n    __fp16 func(const __fp16& x) const\n    {\n        return (__fp16)truncf(x);\n    }\n    float16x4_t func_pack4(const float16x4_t& x) const\n    {\n        return vrnd_f16(x);\n    }\n    float16x8_t func_pack8(const float16x8_t& x) const\n    {\n        return vrndq_f16(x);\n    }\n};\n\n} // namespace UnaryOp_arm_functor\n\nint UnaryOp_arm::forward_inplace_fp16s(Mat& bottom_top_blob, const Option& opt) const\n{\n    using namespace UnaryOp_arm_functor;\n\n    if (op_type == Operation_ABS)\n        return unary_op_inplace_fp16s<unary_op_abs_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_NEG)\n        return unary_op_inplace_fp16s<unary_op_neg_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_FLOOR)\n        return unary_op_inplace_fp16s<unary_op_floor_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_CEIL)\n        return unary_op_inplace_fp16s<unary_op_ceil_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQUARE)\n        return unary_op_inplace_fp16s<unary_op_square_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQRT)\n        return unary_op_inplace_fp16s<unary_op_sqrt_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RSQRT)\n        return unary_op_inplace_fp16s<unary_op_rsqrt_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_EXP)\n        return unary_op_inplace_fp16s<unary_op_exp_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG)\n        return unary_op_inplace_fp16s<unary_op_log_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SIN)\n        return unary_op_inplace_fp16s<unary_op_sin_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_COS)\n        return unary_op_inplace_fp16s<unary_op_cos_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TAN)\n        return unary_op_inplace_fp16s<unary_op_tan_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ASIN)\n        return unary_op_inplace_fp16s<unary_op_asin_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ACOS)\n        return unary_op_inplace_fp16s<unary_op_acos_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ATAN)\n        return unary_op_inplace_fp16s<unary_op_atan_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RECIPROCAL)\n        return unary_op_inplace_fp16s<unary_op_reciprocal_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TANH)\n        return unary_op_inplace_fp16s<unary_op_tanh_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG10)\n        return unary_op_inplace_fp16s<unary_op_log10_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ROUND)\n        return unary_op_inplace_fp16s<unary_op_round_fp16s>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TRUNC)\n        return unary_op_inplace_fp16s<unary_op_trunc_fp16s>(bottom_top_blob, opt);\n\n    return 0;\n}\n#endif // __ARM_FEATURE_FP16_VECTOR_ARITHMETIC\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/batchnorm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"batchnorm.h\"\n\nnamespace ncnn {\n\nBatchNorm::BatchNorm()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint BatchNorm::load_param(const ParamDict& pd)\n{\n    channels = pd.get(0, 0);\n    eps = pd.get(1, 0.f);\n\n    return 0;\n}\n\nint BatchNorm::load_model(const ModelBin& mb)\n{\n    slope_data = mb.load(channels, 1);\n    if (slope_data.empty())\n        return -100;\n\n    mean_data = mb.load(channels, 1);\n    if (mean_data.empty())\n        return -100;\n\n    var_data = mb.load(channels, 1);\n    if (var_data.empty())\n        return -100;\n\n    bias_data = mb.load(channels, 1);\n    if (bias_data.empty())\n        return -100;\n\n    a_data.create(channels);\n    if (a_data.empty())\n        return -100;\n    b_data.create(channels);\n    if (b_data.empty())\n        return -100;\n\n    for (int i = 0; i < channels; i++)\n    {\n        float sqrt_var = sqrtf(var_data[i] + eps);\n        if (sqrt_var == 0.f)\n            sqrt_var = 0.0001f; // sanitize divide by zero\n        a_data[i] = bias_data[i] - slope_data[i] * mean_data[i] / sqrt_var;\n        b_data[i] = slope_data[i] / sqrt_var;\n    }\n\n    return 0;\n}\n\nint BatchNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    // a = bias - slope * mean / sqrt(var)\n    // b = slope / sqrt(var)\n    // value = b * value + a\n\n    int dims = bottom_top_blob.dims;\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n\n        float* ptr = bottom_top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < w; i++)\n        {\n            ptr[i] = b_data[i] * ptr[i] + a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            float a = a_data[i];\n            float b = b_data[i];\n\n            for (int j = 0; j < w; j++)\n            {\n                ptr[j] = b * ptr[j] + a;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            float a = a_data[q];\n            float b = b_data[q];\n\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = b * ptr[i] + a;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/batchnorm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BATCHNORM_H\n#define LAYER_BATCHNORM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass BatchNorm : public Layer\n{\npublic:\n    BatchNorm();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int channels;\n    float eps;\n\n    // model\n    Mat slope_data;\n    Mat mean_data;\n    Mat var_data;\n    Mat bias_data;\n\n    Mat a_data;\n    Mat b_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BATCHNORM_H\n"
  },
  {
    "path": "src/layer/bias.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"bias.h\"\n\nnamespace ncnn {\n\nBias::Bias()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Bias::load_param(const ParamDict& pd)\n{\n    bias_data_size = pd.get(0, 0);\n\n    return 0;\n}\n\nint Bias::load_model(const ModelBin& mb)\n{\n    bias_data = mb.load(bias_data_size, 1);\n    if (bias_data.empty())\n        return -100;\n\n    return 0;\n}\n\nint Bias::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        float bias = bias_data[q];\n\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] += bias;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/bias.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BIAS_H\n#define LAYER_BIAS_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Bias : public Layer\n{\npublic:\n    Bias();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int bias_data_size;\n\n    // model\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BIAS_H\n"
  },
  {
    "path": "src/layer/binaryop.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"binaryop.h\"\n\nnamespace ncnn {\n\nBinaryOp::BinaryOp()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint BinaryOp::load_param(const ParamDict& pd)\n{\n    op_type = pd.get(0, 0);\n    with_scalar = pd.get(1, 0);\n    b = pd.get(2, 0.f);\n\n    if (with_scalar != 0)\n    {\n        one_blob_only = true;\n        support_inplace = true;\n    }\n\n    return 0;\n}\n\n// broadcasting rule\n// https://github.com/Tencent/ncnn/wiki/binaryop-broadcasting\n\ntemplate<typename Op>\nstatic void binary_op_broadcast(const Mat& a, const Mat& b, Mat& c, const Option& opt)\n{\n    // general broadcast\n    const Op op;\n\n    const int dims = c.dims;\n    const int w = c.w;\n    const int h = c.h;\n    const int d = c.d;\n    const int channels = c.c;\n\n    if (dims == 1)\n    {\n        const float* ptr = a;\n        const float* ptr1 = b;\n        float* outptr = c;\n\n        const int ainc = a.w > 1 ? 1 : 0;\n        const int binc = b.w > 1 ? 1 : 0;\n\n        for (int x = 0; x < w; x++)\n        {\n            outptr[x] = op(*ptr, *ptr1);\n            ptr += ainc;\n            ptr1 += binc;\n        }\n    }\n\n    if (dims == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const float* ptr = a.row(std::min(y, a.h - 1));\n            const float* ptr1 = b.row(std::min(y, b.h - 1));\n            float* outptr = c.row(y);\n\n            const int ainc = a.w > 1 ? 1 : 0;\n            const int binc = b.w > 1 ? 1 : 0;\n\n            for (int x = 0; x < w; x++)\n            {\n                outptr[x] = op(*ptr, *ptr1);\n                ptr += ainc;\n                ptr1 += binc;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = c.channel(q);\n\n            const int ainc = a.w > 1 ? 1 : 0;\n            const int binc = b.w > 1 ? 1 : 0;\n\n            for (int z = 0; z < d; z++)\n            {\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = a.channel(std::min(q, a.c - 1)).depth(std::min(z, a.d - 1)).row(std::min(y, a.h - 1));\n                    const float* ptr1 = b.channel(std::min(q, b.c - 1)).depth(std::min(z, b.d - 1)).row(std::min(y, b.h - 1));\n\n                    for (int x = 0; x < w; x++)\n                    {\n                        outptr[x] = op(*ptr, *ptr1);\n                        ptr += ainc;\n                        ptr1 += binc;\n                    }\n\n                    outptr += w;\n                }\n            }\n        }\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_scalar_inplace(Mat& a, float b, const Option& opt)\n{\n    const Op op;\n\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = op(ptr[i], b);\n        }\n    }\n}\n\nstruct binary_op_add\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return x + y;\n    }\n};\n\nstruct binary_op_sub\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return x - y;\n    }\n};\n\nstruct binary_op_mul\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return x * y;\n    }\n};\n\nstruct binary_op_div\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return x / y;\n    }\n};\n\nstruct binary_op_max\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return std::max(x, y);\n    }\n};\n\nstruct binary_op_min\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return std::min(x, y);\n    }\n};\n\nstruct binary_op_pow\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)powf(x, y);\n    }\n};\n\nstruct binary_op_rsub\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return y - x;\n    }\n};\n\nstruct binary_op_rdiv\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return y / x;\n    }\n};\n\nstruct binary_op_rpow\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)powf(y, x);\n    }\n};\n\nstruct binary_op_atan2\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)atan2f(x, y);\n    }\n};\n\nstruct binary_op_ratan2\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)atan2f(y, x);\n    }\n};\n\nstruct binary_op_fmod\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)fmodf(x, y);\n    }\n};\n\nstruct binary_op_logaddexp\n{\n    float operator()(const float& x, const float& y) const\n    {\n        float max_xy = std::max(x, y);\n        float min_xy = std::min(x, y);\n        return (float)(max_xy + log1pf(expf(min_xy - max_xy)));\n    }\n};\n\nstruct binary_op_floor_divide\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)floorf(x / y);\n    }\n};\n\nstruct binary_op_remainder\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)remainderf(x, y);\n    }\n};\n\nstruct binary_op_rfmod\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)fmodf(y, x);\n    }\n};\n\nstruct binary_op_rfloor_divide\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)floorf(y / x);\n    }\n};\n\nstruct binary_op_rremainder\n{\n    float operator()(const float& x, const float& y) const\n    {\n        return (float)remainderf(y, x);\n    }\n};\n\nstatic void binary_op_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_broadcast<binary_op_add>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_broadcast<binary_op_sub>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_broadcast<binary_op_mul>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_broadcast<binary_op_div>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_broadcast<binary_op_max>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_broadcast<binary_op_min>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_broadcast<binary_op_pow>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_broadcast<binary_op_sub>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_broadcast<binary_op_div>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_broadcast<binary_op_pow>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_broadcast<binary_op_atan2>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_broadcast<binary_op_atan2>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_broadcast<binary_op_fmod>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_broadcast<binary_op_fmod>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_broadcast<binary_op_logaddexp>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_broadcast<binary_op_floor_divide>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_broadcast<binary_op_floor_divide>(b, a, c, opt);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_broadcast<binary_op_remainder>(a, b, c, opt);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_broadcast<binary_op_remainder>(b, a, c, opt);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar_inplace(Mat& bottom_top_blob, float b, int op_type, const Option& opt)\n{\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_scalar_inplace<binary_op_add>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_scalar_inplace<binary_op_sub>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_scalar_inplace<binary_op_mul>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_scalar_inplace<binary_op_div>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_scalar_inplace<binary_op_max>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_scalar_inplace<binary_op_min>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_scalar_inplace<binary_op_pow>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_scalar_inplace<binary_op_rsub>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_scalar_inplace<binary_op_rdiv>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_scalar_inplace<binary_op_rpow>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_scalar_inplace<binary_op_atan2>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_scalar_inplace<binary_op_ratan2>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_scalar_inplace<binary_op_fmod>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_scalar_inplace<binary_op_rfmod>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_scalar_inplace<binary_op_logaddexp>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_scalar_inplace<binary_op_floor_divide>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_scalar_inplace<binary_op_rfloor_divide>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_scalar_inplace<binary_op_remainder>(bottom_top_blob, b, opt);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_scalar_inplace<binary_op_rremainder>(bottom_top_blob, b, opt);\n\n    // should never reach here\n}\n\nint BinaryOp::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w == B.h)\n                A2 = A.reshape(1, A.w);\n            else // if (A.w == B.w)\n                A2 = A.reshape(A.w, 1);\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w == B.c)\n                A2 = A.reshape(1, 1, A.w);\n            else // if (A.w == B.w)\n                A2 = A.reshape(A.w, 1, 1);\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w == B.c)\n                A2 = A.reshape(1, 1, 1, A.w);\n            else // if (A.w == B.w)\n                A2 = A.reshape(A.w, 1, 1, 1);\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w == A.h)\n                B2 = B.reshape(1, B.w);\n            else // if (B.w == A.w)\n                B2 = B.reshape(B.w, 1);\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w == A.c)\n                B2 = B.reshape(1, 1, B.w);\n            else // if (B.w == A.w)\n                B2 = B.reshape(B.w, 1, 1);\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w == A.c)\n                B2 = B.reshape(1, 1, 1, B.w);\n            else // if (B.w == A.w)\n                B2 = B.reshape(B.w, 1, 1, 1);\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, 4u, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, 4u, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, 4u, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, 4u, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    binary_op_broadcast(A2, B2, top_blob, op_type, opt);\n\n    return 0;\n}\n\nint BinaryOp::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    binary_op_scalar_inplace(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/binaryop.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BINARYOP_H\n#define LAYER_BINARYOP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass BinaryOp : public Layer\n{\npublic:\n    BinaryOp();\n\n    virtual int load_param(const ParamDict& pd);\n\n    using Layer::forward;\n    using Layer::forward_inplace;\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\n    enum OperationType\n    {\n        Operation_ADD = 0,\n        Operation_SUB = 1,\n        Operation_MUL = 2,\n        Operation_DIV = 3,\n        Operation_MAX = 4,\n        Operation_MIN = 5,\n        Operation_POW = 6,\n        Operation_RSUB = 7,\n        Operation_RDIV = 8,\n        Operation_RPOW = 9,\n        Operation_ATAN2 = 10,\n        Operation_RATAN2 = 11,\n        Operation_FMOD = 12,\n        Operation_RFMOD = 13,\n        Operation_LOGADDEXP = 14,\n        Operation_FLOOR_DIVIDE = 15,\n        Operation_RFLOOR_DIVIDE = 16,\n        Operation_REMAINDER = 17,\n        Operation_RREMAINDER = 18\n    };\n\npublic:\n    // param\n    int op_type;\n    int with_scalar;\n    float b;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BINARYOP_H\n"
  },
  {
    "path": "src/layer/bnll.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"bnll.h\"\n\nnamespace ncnn {\n\nBNLL::BNLL()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint BNLL::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] > 0)\n                ptr[i] = ptr[i] + logf(1.f + expf(-ptr[i]));\n            else\n                ptr[i] = logf(1.f + expf(ptr[i]));\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/bnll.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BNLL_H\n#define LAYER_BNLL_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass BNLL : public Layer\n{\npublic:\n    BNLL();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BNLL_H\n"
  },
  {
    "path": "src/layer/cast.cpp",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cast.h\"\n\nnamespace ncnn {\n\nCast::Cast()\n{\n    one_blob_only = true;\n    support_inplace = false;\n    support_packing = true;\n}\n\nint Cast::load_param(const ParamDict& pd)\n{\n    type_from = pd.get(0, 0);\n    type_to = pd.get(1, 0);\n\n    return 0;\n}\n\n// round to nearest\nsigned char float32_to_int8(float value)\n{\n    float tmp;\n    if (value >= 0.f)\n        tmp = value + 0.5f;\n    else\n        tmp = value - 0.5f;\n\n    if (tmp > 127)\n        return 127;\n    if (tmp < -128)\n        return -128;\n\n    return static_cast<signed char>(tmp);\n}\n\nint Cast::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (type_from == type_to)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    size_t out_elemsize = elemsize;\n    if (type_to == 1)\n    {\n        // float32\n        out_elemsize = 4 * elempack;\n    }\n    else if (type_to == 2)\n    {\n        // float16\n        out_elemsize = 2 * elempack;\n    }\n    else if (type_to == 3)\n    {\n        // int8\n        out_elemsize = elempack;\n    }\n    else if (type_to == 4)\n    {\n        // bfloat16\n        out_elemsize = 2 * elempack;\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 4)\n    {\n        top_blob.create(w, h, d, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int size = w * h * d * elempack;\n\n    if (type_from == 1 && type_to == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = float32_to_float16(ptr[i]);\n            }\n        }\n    }\n\n    if (type_from == 2 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = float16_to_float32(ptr[i]);\n            }\n        }\n    }\n\n    if (type_from == 3 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const signed char* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = (float)ptr[i];\n            }\n        }\n    }\n\n    if (type_from == 1 && type_to == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = float32_to_bfloat16(ptr[i]);\n            }\n        }\n    }\n\n    if (type_from == 4 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = bfloat16_to_float32(ptr[i]);\n            }\n        }\n    }\n\n    // TODO more cast type\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/cast.h",
    "content": "// Copyright 2019 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CAST_H\n#define LAYER_CAST_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Cast : public Layer\n{\npublic:\n    Cast();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // element type\n    // 0 = auto\n    // 1 = float32\n    // 2 = float16\n    // 3 = int8\n    // 4 = bfloat16\n    int type_from;\n    int type_to;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CAST_H\n"
  },
  {
    "path": "src/layer/celu.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"celu.h\"\n\nnamespace ncnn {\n\nCELU::CELU()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint CELU::load_param(const ParamDict& pd)\n{\n    alpha = pd.get(0, 1.f);\n\n    return 0;\n}\n\nint CELU::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < 0.f)\n                ptr[i] = (expf(ptr[i] / alpha) - 1.f) * alpha;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/celu.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CELU_H\n#define LAYER_CELU_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass CELU : public Layer\n{\npublic:\n    CELU();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float alpha;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CELU_H\n"
  },
  {
    "path": "src/layer/clip.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"clip.h\"\n\n#include <float.h>\n\nnamespace ncnn {\n\nClip::Clip()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Clip::load_param(const ParamDict& pd)\n{\n    min = pd.get(0, -FLT_MAX);\n    max = pd.get(1, FLT_MAX);\n\n    return 0;\n}\n\nint Clip::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < min)\n                ptr[i] = min;\n            if (ptr[i] > max)\n                ptr[i] = max;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/clip.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CLIP_H\n#define LAYER_CLIP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Clip : public Layer\n{\npublic:\n    Clip();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float min;\n    float max;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CLIP_H\n"
  },
  {
    "path": "src/layer/concat.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"concat.h\"\n\nnamespace ncnn {\n\nConcat::Concat()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint Concat::load_param(const ParamDict& pd)\n{\n    axis = pd.get(0, 0);\n\n    return 0;\n}\n\nint Concat::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int dims = bottom_blobs[0].dims;\n    size_t elemsize = bottom_blobs[0].elemsize;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // concat vector\n        // total length\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        unsigned char* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            int w = bottom_blob.w;\n\n            const unsigned char* ptr = bottom_blob;\n            memcpy(outptr, ptr, w * elemsize);\n\n            outptr += w * elemsize;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // concat image\n        int w = bottom_blobs[0].w;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        unsigned char* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            int size = w * bottom_blob.h;\n\n            const unsigned char* ptr = bottom_blob;\n            memcpy(outptr, ptr, size * elemsize);\n\n            outptr += size * elemsize;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // interleave image row\n        int h = bottom_blobs[0].h;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            unsigned char* outptr = top_blob.row<unsigned char>(i);\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                const unsigned char* ptr = bottom_blob.row<const unsigned char>(i);\n                memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                outptr += bottom_blob.w * elemsize;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // concat dim\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n\n        // total channels\n        int top_channels = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_channels += bottom_blob.c;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, d, top_channels, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        int q = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            int channels = bottom_blob.c;\n            size_t size = bottom_blob.cstep * channels;\n\n            const unsigned char* ptr = bottom_blob;\n            unsigned char* outptr = top_blob.channel(q);\n            memcpy(outptr, ptr, size * elemsize);\n\n            q += channels;\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // interleave dim height\n        int w = bottom_blobs[0].w;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, d, channels, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned char* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (size_t b = 0; b < bottom_blobs.size(); b++)\n                {\n                    const Mat& bottom_blob = bottom_blobs[b];\n\n                    int size = bottom_blob.w * bottom_blob.h;\n\n                    const unsigned char* ptr = bottom_blob.channel(q).depth(i);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    outptr += size * elemsize;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // interleave dim width\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, d, channels, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned char* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    for (size_t b = 0; b < bottom_blobs.size(); b++)\n                    {\n                        const Mat& bottom_blob = bottom_blobs[b];\n\n                        const unsigned char* ptr = bottom_blob.channel(q).depth(i).row<const unsigned char>(j);\n                        memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                        outptr += bottom_blob.w * elemsize;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        // interleave dim depth\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int channels = bottom_blobs[0].c;\n\n        // total depth\n        int top_d = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_d += bottom_blob.d;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, top_d, channels, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            unsigned char* outptr = top_blob.channel(q);\n\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                const unsigned char* ptr = bottom_blob.channel(q);\n                memcpy(outptr, ptr, size * elemsize);\n\n                outptr += size * elemsize;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/concat.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONCAT_H\n#define LAYER_CONCAT_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Concat : public Layer\n{\npublic:\n    Concat();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int axis;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONCAT_H\n"
  },
  {
    "path": "src/layer/convolution.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution.h\"\n\n#include \"layer_type.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolution::Convolution()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Convolution::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    int8_scale_term = pd.get(8, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(19, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    if (int8_scale_term)\n    {\n#if NCNN_INT8\n        support_int8_storage = true;\n#else\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    return 0;\n}\n\nint Convolution::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        weight_data_int8_scales = mb.load(num_output, 1);\n        bottom_blob_int8_scales = mb.load(1, 1);\n    }\n\n    if (int8_scale_term > 100)\n    {\n        top_blob_int8_scales = mb.load(1, 1);\n    }\n#endif // NCNN_INT8\n\n#if NCNN_INT8\n    // runtime quantize the weight data\n    if (weight_data.elemsize == (size_t)4u && int8_scale_term)\n    {\n        const int maxk = kernel_w * kernel_h;\n        const int num_input = weight_data_size / num_output / maxk;\n\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        Mat weight_data_int8;\n\n        Option opt_q;\n        opt_q.num_threads = 1;\n        opt_q.blob_allocator = weight_data.allocator;\n        opt_q.use_packing_layout = false;\n        quantize_to_int8(weight_data_r2, weight_data_int8, weight_data_int8_scales, opt_q);\n        if (weight_data_int8.empty())\n            return -100;\n\n        weight_data = weight_data_int8.reshape(weight_data_size);\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic int convolution(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int stride_w, int stride_h, int dilation_w, int dilation_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int inch = bottom_blob.c;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_term)\n                    sum = bias_data[p];\n\n                const float* kptr = (const float*)weight_data + maxk * inch * p;\n\n                for (int q = 0; q < inch; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++) // 29.23\n                    {\n                        float val = sptr[space_ofs[k]]; // 20.72\n                        float wt = kptr[k];\n                        sum += val * wt; // 41.45\n                    }\n\n                    kptr += maxk;\n                }\n\n                outptr[j] = activation_ss(sum, activation_type, activation_params);\n            }\n\n            outptr += outw;\n        }\n    }\n\n    return 0;\n}\n\nint Convolution::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    // flattened blob, implement as InnerProduct\n    if (bottom_blob.dims == 1 && kernel_w == 1 && kernel_h == 1)\n    {\n        int num_input = weight_data_size / num_output;\n        if (bottom_blob.w * bottom_blob.elempack == num_input)\n        {\n            // call InnerProduct\n            ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::InnerProduct);\n\n            // set param\n            ncnn::ParamDict pd;\n            pd.set(0, num_output);\n            pd.set(1, bias_term);\n            pd.set(2, weight_data_size);\n            pd.set(8, int8_scale_term);\n            pd.set(9, activation_type);\n            pd.set(10, activation_params);\n\n            op->load_param(pd);\n\n            // set weights\n            ncnn::Mat weights[4];\n            weights[0] = weight_data;\n            weights[1] = bias_data;\n\n#if NCNN_INT8\n            if (int8_scale_term)\n            {\n                weights[2] = weight_data_int8_scales;\n                weights[3] = bottom_blob_int8_scales;\n            }\n#endif\n\n            op->load_model(ModelBinFromMatArray(weights));\n\n            op->create_pipeline(opt);\n\n            // forward\n            int ret = op->forward(bottom_blob, top_blob, opt);\n\n            op->destroy_pipeline(opt);\n\n            delete op;\n\n            return ret;\n        }\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const int h = bottom_blob_bordered.h;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolution(bottom_blob_bordered, top_blob, weight_data, bias_data, kernel_w, kernel_h, stride_w, stride_h, dilation_w, dilation_h, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nint Convolution::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, _kernel_w, _kernel_h, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const int h = bottom_blob_bordered.h;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, _num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolution(bottom_blob_bordered, top_blob, weight_data_flattened, bias_data_flattened, _kernel_w, _kernel_h, stride_w, stride_h, dilation_w, dilation_h, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nvoid Convolution::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    make_padding(bottom_blob, bottom_blob_bordered, kernel_w, kernel_h, opt);\n}\n\nvoid Convolution::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int _kernel_w, int _kernel_h, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border(bottom_blob, bottom_blob_bordered, pad_top, pad_bottom, pad_left, pad_right, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233 && pad_top == -233 && pad_bottom == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        if (wpad > 0 || hpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234 && pad_top == -234 && pad_bottom == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        if (wpad > 0 || hpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, hpad - hpad / 2, hpad / 2, wpad - wpad / 2, wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n#if NCNN_INT8\nstatic inline signed char float2int8(float v)\n{\n    int int32 = static_cast<int>(round(v));\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\nint Convolution::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n\n    //     NCNN_LOGE(\"Convolution input %d x %d  ksize=%d %d  stride=%d %d\", w, h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_unbordered = bottom_blob;\n    if (elemsize != 1)\n    {\n        Option opt_g = opt;\n        opt_g.blob_allocator = opt.workspace_allocator;\n\n        quantize_to_int8(bottom_blob, bottom_blob_unbordered, bottom_blob_int8_scales, opt_g);\n        if (bottom_blob_unbordered.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_unbordered, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // int8\n    bool use_int8_requantize = int8_scale_term > 100;\n    size_t out_elemsize = use_int8_requantize ? 1u : 4u;\n\n    top_blob.create(outw, outh, num_output, out_elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < num_output; p++)\n    {\n        signed char* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                int sum = 0;\n\n                const signed char* kptr = (const signed char*)weight_data + maxk * channels * p;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        int val = sptr[space_ofs[k]];\n                        int wt = kptr[k];\n                        sum += val * wt;\n                    }\n\n                    kptr += maxk;\n                }\n\n                float scale_in;\n                if (weight_data_int8_scales[p] == 0)\n                    scale_in = 0;\n                else\n                    scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n                float sumfp32 = sum * scale_in;\n\n                if (bias_term)\n                    sumfp32 += bias_data[p];\n\n                sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n                if (use_int8_requantize)\n                {\n                    // requantize\n                    float scale_out = top_blob_int8_scales[0];\n                    signed char sums8 = float2int8(sumfp32 * scale_out);\n                    outptr[0] = sums8;\n                    outptr += 1;\n                }\n                else\n                {\n                    // dequantize\n                    ((float*)outptr)[0] = sumfp32;\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolution.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION_H\n#define LAYER_CONVOLUTION_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Convolution : public Layer\n{\npublic:\n    Convolution();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int kernel_w, int kernel_h, const Option& opt) const;\n\n#if NCNN_INT8\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n\n    int int8_scale_term;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n\n#if NCNN_INT8\n    Mat weight_data_int8_scales;\n    Mat bottom_blob_int8_scales;\n    Mat top_blob_int8_scales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION_H\n"
  },
  {
    "path": "src/layer/convolution1d.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution1d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolution1D::Convolution1D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Convolution1D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    dilation_w = pd.get(2, 1);\n    stride_w = pd.get(3, 1);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(19, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    return 0;\n}\n\nint Convolution1D::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int convolution1d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int stride_w, int dilation_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int h = bottom_blob.h;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outh; p++)\n    {\n        float* outptr = top_blob.row(p);\n\n        for (int j = 0; j < outw; j++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const float* kptr = (const float*)weight_data + kernel_w * h * p;\n\n            for (int q = 0; q < h; q++)\n            {\n                const float* sptr = bottom_blob.row(q) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val = *sptr;\n                    float wt = kptr[k];\n                    sum += val * wt;\n\n                    sptr += dilation_w;\n                }\n\n                kptr += kernel_w;\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            outptr[j] = sum;\n        }\n    }\n\n    return 0;\n}\n\nint Convolution1D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    top_blob.create(outw, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolution1d(bottom_blob_bordered, top_blob, weight_data, bias_data, kernel_w, stride_w, dilation_w, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nint Convolution1D::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.c;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, _kernel_w, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    top_blob.create(outw, _num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolution1d(bottom_blob_bordered, top_blob, weight_data_flattened, bias_data_flattened, _kernel_w, stride_w, dilation_w, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nvoid Convolution1D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    make_padding(bottom_blob, bottom_blob_bordered, kernel_w, opt);\n}\n\nvoid Convolution1D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int _kernel_w, const Option& opt) const\n{\n    int w = bottom_blob.w;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, pad_left, pad_right, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        if (wpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, wpad / 2, wpad - wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        if (wpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, wpad - wpad / 2, wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolution1d.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION1D_H\n#define LAYER_CONVOLUTION1D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Convolution1D : public Layer\n{\npublic:\n    Convolution1D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int kernel_w, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int dilation_w;\n    int stride_w;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION1D_H\n"
  },
  {
    "path": "src/layer/convolution3d.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution3d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolution3D::Convolution3D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Convolution3D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    kernel_d = pd.get(21, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    dilation_d = pd.get(22, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    stride_d = pd.get(23, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_front = pd.get(24, pad_left);\n    pad_behind = pd.get(17, pad_front);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    return 0;\n}\n\nint Convolution3D::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nint Convolution3D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extend_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extend_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extend_d = dilation_d * (kernel_d - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    Option opt_pad = opt;\n    opt_pad.use_packing_layout = false;\n    make_padding(bottom_blob, bottom_blob_bordered, opt_pad);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n    d = bottom_blob_bordered.d;\n\n    int outw = (w - kernel_extend_w) / stride_w + 1;\n    int outh = (h - kernel_extend_h) / stride_h + 1;\n    int outd = (d - kernel_extend_d) / stride_d + 1;\n\n    const int maxk = kernel_w * kernel_h * kernel_d;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap0 = w * dilation_h - kernel_w * dilation_w;\n        int gap1 = h * w * dilation_d - w * kernel_h * dilation_h;\n        for (int z = 0; z < kernel_d; z++)\n        {\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2 += dilation_w;\n                }\n                p2 += gap0;\n            }\n            p2 += gap1;\n        }\n    }\n\n    top_blob.create(outw, outh, outd, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < num_output; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int z = 0; z < outd; z++)\n        {\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                        sum = bias_data[p];\n\n                    const float* kptr = (const float*)weight_data + maxk * channels * p;\n\n                    for (int q = 0; q < channels; q++)\n                    {\n                        const Mat m = bottom_blob_bordered.channel(q);\n                        const float* sptr = m.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                        for (int l = 0; l < maxk; l++)\n                        {\n                            float val = sptr[space_ofs[l]];\n\n                            float wt = kptr[l];\n                            sum += val * wt;\n                        }\n\n                        kptr += maxk;\n                    }\n\n                    outptr[j] = activation_ss(sum, activation_type, activation_params);\n                }\n\n                outptr += outw;\n            }\n        }\n    }\n\n    return 0;\n}\n\nvoid Convolution3D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extent_d = dilation_d * (kernel_d - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border_3d(bottom_blob, bottom_blob_bordered, pad_top, pad_bottom, pad_left, pad_right, pad_front, pad_behind, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233 && pad_top == -233 && pad_bottom == -233 && pad_front == -233 && pad_behind == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        int dpad = kernel_extent_d + (d - 1) / stride_d * stride_d - d;\n        if (wpad > 0 || hpad > 0 || dpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border_3d(bottom_blob, bottom_blob_bordered, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, dpad / 2, dpad - dpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234 && pad_top == -234 && pad_bottom == -234 && pad_front == -234 && pad_behind == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        int dpad = kernel_extent_d + (d - 1) / stride_d * stride_d - d;\n        if (wpad > 0 || hpad > 0 || dpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border_3d(bottom_blob, bottom_blob_bordered, hpad - hpad / 2, hpad / 2, wpad - wpad / 2, wpad / 2, dpad / 2, dpad - dpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolution3d.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION3D_H\n#define LAYER_CONVOLUTION3D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Convolution3D : public Layer\n{\npublic:\n    Convolution3D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int kernel_d;\n    int dilation_w;\n    int dilation_h;\n    int dilation_d;\n    int stride_w;\n    int stride_h;\n    int stride_d;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int pad_front;\n    int pad_behind;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n\n    int activation_type;\n    Mat activation_params;\n\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif //LAYER_CONVOLUTION3D_H\n"
  },
  {
    "path": "src/layer/convolutiondepthwise.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise.h\"\n\n#include \"layer_type.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolutionDepthWise::ConvolutionDepthWise()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint ConvolutionDepthWise::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    int8_scale_term = pd.get(8, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(19, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    if (num_output % group != 0)\n    {\n        // reject invalid group\n        return -100;\n    }\n\n    if (int8_scale_term)\n    {\n#if NCNN_INT8\n        support_int8_storage = true;\n#else\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term == 1 || int8_scale_term == 101)\n    {\n        weight_data_int8_scales = mb.load(group, 1);\n        bottom_blob_int8_scales = mb.load(1, 1);\n\n        float bottom_blob_int8_scale = bottom_blob_int8_scales[0];\n        bottom_blob_int8_scales = Mat(group);\n        bottom_blob_int8_scales.fill(bottom_blob_int8_scale);\n    }\n    else if (int8_scale_term == 2 || int8_scale_term == 102)\n    {\n        weight_data_int8_scales = mb.load(1, 1);\n        bottom_blob_int8_scales = mb.load(1, 1);\n\n        // extend group if only one provided\n        float weight_data_int8_scale = weight_data_int8_scales[0];\n        weight_data_int8_scales = Mat(group);\n        weight_data_int8_scales.fill(weight_data_int8_scale);\n\n        float bottom_blob_int8_scale = bottom_blob_int8_scales[0];\n        bottom_blob_int8_scales = Mat(group);\n        bottom_blob_int8_scales.fill(bottom_blob_int8_scale);\n    }\n\n    if (int8_scale_term > 100)\n    {\n        top_blob_int8_scales = mb.load(1, 1);\n\n        float top_blob_int8_scale = top_blob_int8_scales[0];\n        top_blob_int8_scales = Mat(group);\n        top_blob_int8_scales.fill(top_blob_int8_scale);\n    }\n#endif // NCNN_INT8\n\n#if NCNN_INT8\n    // runtime quantize the weight data\n    if (weight_data.elemsize == (size_t)4u && int8_scale_term)\n    {\n        Mat int8_weight_data(weight_data_size, (size_t)1u);\n        if (int8_weight_data.empty())\n            return -100;\n\n        const int weight_data_size_g = weight_data_size / group;\n\n        for (int g = 0; g < group; g++)\n        {\n            Option opt_q;\n            opt_q.num_threads = 1;\n            opt_q.blob_allocator = int8_weight_data.allocator;\n            opt_q.use_packing_layout = false;\n\n            const Mat weight_data_g = weight_data.range(weight_data_size_g * g, weight_data_size_g);\n            Mat int8_weight_data_g = int8_weight_data.range(weight_data_size_g * g, weight_data_size_g);\n            const Mat weight_data_int8_scales_g = weight_data_int8_scales.range(g, 1);\n            quantize_to_int8(weight_data_g, int8_weight_data_g, weight_data_int8_scales_g, opt_q);\n        }\n\n        weight_data = int8_weight_data;\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic int convolutiondepthwise(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int stride_w, int stride_h, int dilation_w, int dilation_h, int group, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int inch = bottom_blob.c;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // depth-wise\n    if (inch == group && group == outch)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            float* outptr = top_blob.channel(g);\n            const float* kptr = (const float*)weight_data + maxk * g;\n            const Mat m = bottom_blob.channel(g);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                        sum = bias_data[g];\n\n                    const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val = sptr[space_ofs[k]];\n                        float w = kptr[k];\n                        sum += val * w;\n                    }\n\n                    outptr[j] = activation_ss(sum, activation_type, activation_params);\n                }\n\n                outptr += outw;\n            }\n        }\n    }\n    else\n    {\n        // group convolution\n        const int inch_g = inch / group;\n        const int outch_g = outch / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < outch_g; p++)\n            {\n                float* outptr = top_blob.channel(g * outch_g + p);\n                const float* weight_data_ptr = (const float*)weight_data + maxk * inch_g * outch_g * g;\n\n                // shadowed variable for less openmp task args\n                const int outw = top_blob.w;\n                const int outh = top_blob.h;\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                            sum = bias_data[outch_g * g + p];\n\n                        const float* kptr = weight_data_ptr + maxk * inch_g * p;\n\n                        for (int q = 0; q < inch_g; q++)\n                        {\n                            const Mat m = bottom_blob.channel(inch_g * g + q);\n                            const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = sptr[space_ofs[k]];\n                                float w = kptr[k];\n                                sum += val * w;\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        outptr[j] = activation_ss(sum, activation_type, activation_params);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // convolv with NxN kernel\n    // value = value + bias\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    //     NCNN_LOGE(\"ConvolutionDepthWise input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const int h = bottom_blob_bordered.h;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolutiondepthwise(bottom_blob_bordered, top_blob, weight_data, bias_data, kernel_w, kernel_h, stride_w, stride_h, dilation_w, dilation_h, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nint ConvolutionDepthWise::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, _kernel_w, _kernel_h, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const int h = bottom_blob_bordered.h;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    top_blob.create(outw, outh, _num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolutiondepthwise(bottom_blob_bordered, top_blob, weight_data_flattened, bias_data_flattened, _kernel_w, _kernel_h, stride_w, stride_h, dilation_w, dilation_h, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nvoid ConvolutionDepthWise::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    make_padding(bottom_blob, bottom_blob_bordered, kernel_w, kernel_h, opt);\n}\n\nvoid ConvolutionDepthWise::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int _kernel_w, int _kernel_h, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border(bottom_blob, bottom_blob_bordered, pad_top, pad_bottom, pad_left, pad_right, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233 && pad_top == -233 && pad_bottom == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        if (wpad > 0 || hpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234 && pad_top == -234 && pad_bottom == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        if (wpad > 0 || hpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, hpad - hpad / 2, hpad / 2, wpad - wpad / 2, wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n#if NCNN_INT8\nstatic inline signed char float2int8(float v)\n{\n    int int32 = static_cast<int>(round(v));\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\nint ConvolutionDepthWise::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // convolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n\n    if (channels % group != 0 || num_output % group != 0)\n    {\n        // reject invalid group\n        return -100;\n    }\n\n    //     NCNN_LOGE(\"ConvolutionDepthWise input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elemsize != 1)\n    {\n        const int channels_g = channels / group;\n\n        Mat scales(channels);\n        {\n            float* ps = scales;\n            for (int g = 0; g < group; g++)\n            {\n                float scale = bottom_blob_int8_scales[g];\n                for (int q = 0; q < channels_g; q++)\n                {\n                    *ps++ = scale;\n                }\n            }\n        }\n\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, scales, opt_q);\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // int8\n    bool use_int8_requantize = int8_scale_term > 100;\n    size_t out_elemsize = use_int8_requantize ? 1u : 4u;\n\n    top_blob.create(outw, outh, num_output, out_elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            signed char* outptr = top_blob.channel(g);\n            const signed char* kptr = (const signed char*)weight_data + maxk * g;\n            const Mat m = bottom_blob_bordered.channel(g);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    int sum = 0;\n\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        signed char val = sptr[space_ofs[k]];\n                        signed char w = kptr[k];\n                        sum += val * w;\n                    }\n\n                    float scale_in;\n                    if (weight_data_int8_scales[g] == 0)\n                        scale_in = 0;\n                    else\n                        scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                    float sumfp32 = sum * scale_in;\n\n                    if (bias_term)\n                        sumfp32 += bias_data[g];\n\n                    sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n                    if (use_int8_requantize)\n                    {\n                        // requantize\n                        float scale_out = top_blob_int8_scales[g];\n                        signed char sums8 = float2int8(sumfp32 * scale_out);\n                        outptr[0] = sums8;\n                        outptr += 1;\n                    }\n                    else\n                    {\n                        // dequantize\n                        ((float*)outptr)[0] = sumfp32;\n                        outptr += 4;\n                    }\n                }\n            }\n        }\n    }\n    else\n    {\n        // group convolution\n        const int channels_g = channels / group;\n        const int num_output_g = num_output / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else // _WIN32\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif // _WIN32\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < num_output_g; p++)\n            {\n                signed char* outptr = top_blob.channel(g * num_output_g + p);\n                const signed char* weight_data_ptr = (const signed char*)weight_data + maxk * channels_g * num_output_g * g;\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        int sum = 0;\n\n                        const signed char* kptr = weight_data_ptr + maxk * channels_g * p;\n\n                        // channels_g\n                        for (int q = 0; q < channels_g; q++)\n                        {\n                            const Mat m = bottom_blob_bordered.channel(channels_g * g + q);\n                            const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                signed char val = sptr[space_ofs[k]];\n                                signed char w = kptr[k];\n                                sum += val * w;\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        float scale_in;\n                        if (weight_data_int8_scales[g] == 0)\n                            scale_in = 0;\n                        else\n                            scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                        float sumfp32 = sum * scale_in;\n\n                        if (bias_term)\n                            sumfp32 += bias_data[g * num_output_g + p];\n\n                        sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n                        if (use_int8_requantize)\n                        {\n                            // requantize\n                            float scale_out = top_blob_int8_scales[g];\n                            signed char sums8 = float2int8(sumfp32 * scale_out);\n                            outptr[0] = sums8;\n                            outptr += 1;\n                        }\n                        else\n                        {\n                            // dequantize\n                            ((float*)outptr)[0] = sumfp32;\n                            outptr += 4;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolutiondepthwise.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTIONDEPTHWISE_H\n#define LAYER_CONVOLUTIONDEPTHWISE_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ConvolutionDepthWise : public Layer\n{\npublic:\n    ConvolutionDepthWise();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int kernel_w, int kernel_h, const Option& opt) const;\n\n#if NCNN_INT8\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    int int8_scale_term;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n\n#if NCNN_INT8\n    Mat weight_data_int8_scales;\n    Mat bottom_blob_int8_scales;\n    Mat top_blob_int8_scales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTIONDEPTHWISE_H\n"
  },
  {
    "path": "src/layer/convolutiondepthwise1d.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise1d.h\"\n\n#include \"layer_type.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolutionDepthWise1D::ConvolutionDepthWise1D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint ConvolutionDepthWise1D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    dilation_w = pd.get(2, 1);\n    stride_w = pd.get(3, 1);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(19, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    if (num_output % group != 0)\n    {\n        // reject invalid group\n        return -100;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise1D::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int convolutiondepthwise1d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int stride_w, int dilation_w, int group, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int h = bottom_blob.h;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    // depth-wise\n    if (h == group && group == outh)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            float* outptr = top_blob.row(g);\n            const float* kptr = (const float*)weight_data + kernel_w * g;\n\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_term)\n                    sum = bias_data[g];\n\n                const float* sptr = bottom_blob.row(g) + j * stride_w;\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float val = *sptr;\n                    float w = kptr[k];\n                    sum += val * w;\n\n                    sptr += dilation_w;\n                }\n\n                outptr[j] = activation_ss(sum, activation_type, activation_params);\n            }\n        }\n    }\n    else\n    {\n        // group convolution\n        const int h_g = h / group;\n        const int outh_g = outh / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < outh_g; p++)\n            {\n                float* outptr = top_blob.row(g * outh_g + p);\n                const float* weight_data_ptr = (const float*)weight_data + kernel_w * h_g * outh_g * g;\n\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                        sum = bias_data[outh_g * g + p];\n\n                    const float* kptr = weight_data_ptr + kernel_w * h_g * p;\n\n                    for (int q = 0; q < h_g; q++)\n                    {\n                        const float* sptr = bottom_blob.row(h_g * g + q) + j * stride_w;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            float val = *sptr;\n                            float w = kptr[k];\n                            sum += val * w;\n\n                            sptr += dilation_w;\n                        }\n\n                        kptr += kernel_w;\n                    }\n\n                    outptr[j] = activation_ss(sum, activation_type, activation_params);\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise1D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    top_blob.create(outw, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolutiondepthwise1d(bottom_blob_bordered, top_blob, weight_data, bias_data, kernel_w, stride_w, dilation_w, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nint ConvolutionDepthWise1D::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.c;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, _kernel_w, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    const int w = bottom_blob_bordered.w;\n    const size_t elemsize = bottom_blob_bordered.elemsize;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n\n    top_blob.create(outw, _num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int ret = convolutiondepthwise1d(bottom_blob_bordered, top_blob, weight_data_flattened, bias_data_flattened, _kernel_w, stride_w, dilation_w, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    return 0;\n}\n\nvoid ConvolutionDepthWise1D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    make_padding(bottom_blob, bottom_blob_bordered, kernel_w, opt);\n}\n\nvoid ConvolutionDepthWise1D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int _kernel_w, const Option& opt) const\n{\n    int w = bottom_blob.w;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, pad_left, pad_right, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        if (wpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, wpad / 2, wpad - wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        if (wpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border(bottom_blob, bottom_blob_bordered, 0, 0, wpad - wpad / 2, wpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolutiondepthwise1d.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTIONDEPTHWISE1D_H\n#define LAYER_CONVOLUTIONDEPTHWISE1D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ConvolutionDepthWise1D : public Layer\n{\npublic:\n    ConvolutionDepthWise1D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, int kernel_w, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int dilation_w;\n    int stride_w;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTIONDEPTHWISE1D_H\n"
  },
  {
    "path": "src/layer/convolutiondepthwise3d.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise3d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nConvolutionDepthWise3D::ConvolutionDepthWise3D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint ConvolutionDepthWise3D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    kernel_d = pd.get(21, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    dilation_d = pd.get(22, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    stride_d = pd.get(23, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_front = pd.get(24, pad_left);\n    pad_behind = pd.get(17, pad_front);\n    pad_value = pd.get(18, 0.f);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    return 0;\n}\n\nint ConvolutionDepthWise3D::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise3D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extend_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extend_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extend_d = dilation_d * (kernel_d - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    Option opt_pad = opt;\n    opt_pad.use_packing_layout = false;\n    make_padding(bottom_blob, bottom_blob_bordered, opt_pad);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n    d = bottom_blob_bordered.d;\n\n    int outw = (w - kernel_extend_w) / stride_w + 1;\n    int outh = (h - kernel_extend_h) / stride_h + 1;\n    int outd = (d - kernel_extend_d) / stride_d + 1;\n\n    const int maxk = kernel_w * kernel_h * kernel_d;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap0 = w * dilation_h - kernel_w * dilation_w;\n        int gap1 = h * w * dilation_d - w * kernel_h * dilation_h;\n        for (int z = 0; z < kernel_d; z++)\n        {\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2 += dilation_w;\n                }\n                p2 += gap0;\n            }\n            p2 += gap1;\n        }\n    }\n\n    top_blob.create(outw, outh, outd, num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            float* outptr = top_blob.channel(g);\n            const float* kptr = (const float*)weight_data + maxk * g;\n            const Mat m = bottom_blob_bordered.channel(g);\n\n            for (int z = 0; z < outd; z++)\n            {\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                            sum = bias_data[g];\n\n                        const float* sptr = m.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float val = sptr[space_ofs[k]];\n                            float w = kptr[k];\n                            sum += val * w;\n                        }\n\n                        outptr[j] = activation_ss(sum, activation_type, activation_params);\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n    else\n    {\n        // group convolution\n        const int channels_g = channels / group;\n        const int num_output_g = num_output / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < num_output_g; p++)\n            {\n                float* outptr = top_blob.channel(g * num_output_g + p);\n                const float* weight_data_ptr = (const float*)weight_data + maxk * channels_g * num_output_g * g;\n\n                // shadowed variable for less openmp task args\n                const int outw = top_blob.w;\n                const int outh = top_blob.h;\n                const int outd = top_blob.d;\n\n                for (int z = 0; z < outd; z++)\n                {\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                                sum = bias_data[num_output_g * g + p];\n\n                            const float* kptr = weight_data_ptr + maxk * channels_g * p;\n\n                            for (int q = 0; q < channels_g; q++)\n                            {\n                                const Mat m = bottom_blob_bordered.channel(channels_g * g + q);\n                                const float* sptr = m.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                                for (int l = 0; l < maxk; l++)\n                                {\n                                    float val = sptr[space_ofs[l]];\n\n                                    float wt = kptr[l];\n                                    sum += val * wt;\n                                }\n\n                                kptr += maxk;\n                            }\n\n                            outptr[j] = activation_ss(sum, activation_type, activation_params);\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nvoid ConvolutionDepthWise3D::make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extent_d = dilation_d * (kernel_d - 1) + 1;\n\n    bottom_blob_bordered = bottom_blob;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0)\n    {\n        Option opt_b = opt;\n        opt_b.blob_allocator = opt.workspace_allocator;\n        copy_make_border_3d(bottom_blob, bottom_blob_bordered, pad_top, pad_bottom, pad_left, pad_right, pad_front, pad_behind, BORDER_CONSTANT, pad_value, opt_b);\n    }\n    else if (pad_left == -233 && pad_right == -233 && pad_top == -233 && pad_bottom == -233 && pad_front == -233 && pad_behind == -233)\n    {\n        // tensorflow padding=SAME or onnx padding=SAME_UPPER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        int dpad = kernel_extent_d + (d - 1) / stride_d * stride_d - d;\n        if (wpad > 0 || hpad > 0 || dpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border_3d(bottom_blob, bottom_blob_bordered, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, dpad / 2, dpad - dpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n    else if (pad_left == -234 && pad_right == -234 && pad_top == -234 && pad_bottom == -234 && pad_front == -234 && pad_behind == -234)\n    {\n        // onnx padding=SAME_LOWER\n        int wpad = kernel_extent_w + (w - 1) / stride_w * stride_w - w;\n        int hpad = kernel_extent_h + (h - 1) / stride_h * stride_h - h;\n        int dpad = kernel_extent_d + (d - 1) / stride_d * stride_d - d;\n        if (wpad > 0 || hpad > 0 || dpad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            copy_make_border_3d(bottom_blob, bottom_blob_bordered, hpad - hpad / 2, hpad / 2, wpad - wpad / 2, wpad / 2, dpad / 2, dpad - dpad / 2, BORDER_CONSTANT, pad_value, opt_b);\n        }\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/convolutiondepthwise3d.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTIONDEPTHWISE3D_H\n#define LAYER_CONVOLUTIONDEPTHWISE3D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ConvolutionDepthWise3D : public Layer\n{\npublic:\n    ConvolutionDepthWise3D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    void make_padding(const Mat& bottom_blob, Mat& bottom_blob_bordered, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int kernel_d;\n    int dilation_w;\n    int dilation_h;\n    int dilation_d;\n    int stride_w;\n    int stride_h;\n    int stride_d;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int pad_front;\n    int pad_behind;\n    float pad_value;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    int activation_type;\n    Mat activation_params;\n\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif //LAYER_CONVOLUTIONDEPTHWISE3D_H\n"
  },
  {
    "path": "src/layer/copyto.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"copyto.h\"\n\nnamespace ncnn {\n\nCopyTo::CopyTo()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint CopyTo::load_param(const ParamDict& pd)\n{\n    woffset = pd.get(0, 0);\n    hoffset = pd.get(1, 0);\n    doffset = pd.get(13, 0);\n    coffset = pd.get(2, 0);\n\n    starts = pd.get(9, Mat());\n    axes = pd.get(11, Mat());\n\n    return 0;\n}\n\ntemplate<typename T>\nstatic void copy_to_image(const Mat& src, Mat& self, int top, int left)\n{\n    int w = src.w;\n    int h = src.h;\n\n    const T* ptr = src;\n    T* outptr = self.row<T>(top) + left;\n\n    for (int y = 0; y < h; y++)\n    {\n        memcpy(outptr, ptr, w * sizeof(T));\n        ptr += w;\n        outptr += self.w;\n    }\n}\n\nint CopyTo::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& self_blob = bottom_blobs[0];\n    const Mat& src_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int w = self_blob.w;\n    int h = self_blob.h;\n    int d = self_blob.d;\n    int channels = self_blob.c;\n    int dims = self_blob.dims;\n    size_t elemsize = self_blob.elemsize;\n\n    if (src_blob.dims == dims && src_blob.w == w && src_blob.h == h && src_blob.d == d && src_blob.c == channels)\n    {\n        top_blob = src_blob;\n        return 0;\n    }\n\n    top_blob = self_blob.clone(opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    int _woffset, _hoffset, _doffset, _coffset;\n    resolve_copyto_offset(self_blob.shape(), _woffset, _hoffset, _doffset, _coffset);\n\n    if (dims == 1)\n    {\n        if (elemsize == 1)\n            copy_to_image<signed char>(src_blob, top_blob, 0, _woffset);\n        if (elemsize == 2)\n            copy_to_image<unsigned short>(src_blob, top_blob, 0, _woffset);\n        if (elemsize == 4)\n            copy_to_image<float>(src_blob, top_blob, 0, _woffset);\n    }\n\n    if (dims == 2)\n    {\n        if (elemsize == 1)\n            copy_to_image<signed char>(src_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 2)\n            copy_to_image<unsigned short>(src_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 4)\n            copy_to_image<float>(src_blob, top_blob, _hoffset, _woffset);\n    }\n\n    if (dims == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < src_blob.c; q++)\n        {\n            const Mat roim = src_blob.channel(q);\n            Mat m = top_blob.channel(q + _coffset);\n\n            if (elemsize == 1)\n                copy_to_image<signed char>(roim, m, _hoffset, _woffset);\n            if (elemsize == 2)\n                copy_to_image<unsigned short>(roim, m, _hoffset, _woffset);\n            if (elemsize == 4)\n                copy_to_image<float>(roim, m, _hoffset, _woffset);\n        }\n    }\n\n    if (dims == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < src_blob.c; q++)\n        {\n            for (int z = 0; z < src_blob.d; z++)\n            {\n                const Mat roim = src_blob.channel(q).depth(z);\n                Mat m = top_blob.channel(q + _coffset).depth(z + _doffset);\n\n                if (elemsize == 1)\n                    copy_to_image<signed char>(roim, m, _hoffset, _woffset);\n                if (elemsize == 2)\n                    copy_to_image<unsigned short>(roim, m, _hoffset, _woffset);\n                if (elemsize == 4)\n                    copy_to_image<float>(roim, m, _hoffset, _woffset);\n            }\n        }\n    }\n\n    return 0;\n}\n\nvoid CopyTo::resolve_copyto_offset(const Mat& self_blob, int& _woffset, int& _hoffset, int& _doffset, int& _coffset) const\n{\n    int w = self_blob.w;\n    int h = self_blob.h;\n    int d = self_blob.d;\n    int channels = self_blob.c;\n    int dims = self_blob.dims;\n\n    bool numpy_style_slice = !starts.empty();\n    if (numpy_style_slice)\n    {\n        _woffset = 0;\n        _hoffset = 0;\n        _doffset = 0;\n        _coffset = 0;\n\n        const int* starts_ptr = starts;\n        const int* axes_ptr = axes;\n\n        int _axes[4] = {0, 1, 2, 3};\n        int num_axis = axes.w;\n        if (num_axis == 0)\n        {\n            num_axis = dims;\n        }\n        else\n        {\n            for (int i = 0; i < num_axis; i++)\n            {\n                int axis = axes_ptr[i];\n                if (axis < 0)\n                    axis = dims + axis;\n                _axes[i] = axis;\n            }\n        }\n\n        for (int i = 0; i < num_axis; i++)\n        {\n            int axis = _axes[i];\n            int start = starts_ptr[i];\n\n            if (dims == 1) // axis == 0\n            {\n                if (start == -233) start = 0;\n                _woffset = start >= 0 ? start : w + start;\n            }\n            if (dims == 2)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    _hoffset = start >= 0 ? start : h + start;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    _woffset = start >= 0 ? start : w + start;\n                }\n            }\n            if (dims == 3)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    _coffset = start >= 0 ? start : channels + start;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    _hoffset = start >= 0 ? start : h + start;\n                }\n                if (axis == 2)\n                {\n                    if (start == -233) start = 0;\n                    _woffset = start >= 0 ? start : w + start;\n                }\n            }\n            if (dims == 4)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    _coffset = start >= 0 ? start : channels + start;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    _doffset = start >= 0 ? start : d + start;\n                }\n                if (axis == 2)\n                {\n                    if (start == -233) start = 0;\n                    _hoffset = start >= 0 ? start : h + start;\n                }\n                if (axis == 3)\n                {\n                    if (start == -233) start = 0;\n                    _woffset = start >= 0 ? start : w + start;\n                }\n            }\n        }\n    }\n    else\n    {\n        _woffset = woffset;\n        _hoffset = hoffset;\n        _doffset = doffset;\n        _coffset = coffset;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/copyto.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_COPYTO_H\n#define LAYER_COPYTO_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass CopyTo : public Layer\n{\npublic:\n    CopyTo();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void resolve_copyto_offset(const Mat& self_blob, int& woffset, int& hoffset, int& doffset, int& coffset) const;\n\npublic:\n    int woffset;\n    int hoffset;\n    int doffset;\n    int coffset;\n\n    // numpy-style slice\n    // if provided, all the above attributes will be ignored\n    Mat starts;\n    Mat axes;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_COPYTO_H\n"
  },
  {
    "path": "src/layer/crop.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"crop.h\"\n\n#include \"expression.h\"\n\nnamespace ncnn {\n\nCrop::Crop()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Crop::load_param(const ParamDict& pd)\n{\n    woffset = pd.get(0, 0);\n    hoffset = pd.get(1, 0);\n    doffset = pd.get(13, 0);\n    coffset = pd.get(2, 0);\n    outw = pd.get(3, 0);\n    outh = pd.get(4, 0);\n    outd = pd.get(14, 0);\n    outc = pd.get(5, 0);\n    woffset2 = pd.get(6, 0);\n    hoffset2 = pd.get(7, 0);\n    doffset2 = pd.get(15, 0);\n    coffset2 = pd.get(8, 0);\n\n    starts = pd.get(9, Mat());\n    ends = pd.get(10, Mat());\n    axes = pd.get(11, Mat());\n\n    starts_expr = pd.get(19, \"\");\n    ends_expr = pd.get(20, \"\");\n    axes_expr = pd.get(21, \"\");\n\n    // NCNN_LOGE(\"%s %s %s\", starts_expr.c_str(), ends_expr.c_str(), axes_expr.c_str());\n\n    bool numpy_style_slice = !starts.empty() && !ends.empty();\n\n    if (!starts_expr.empty() && !ends_expr.empty())\n        numpy_style_slice = true;\n\n    if (outw == 0 && outh == 0 && outd == 0 && outc == 0 && woffset2 == 0 && hoffset2 == 0 && doffset2 == 0 && coffset2 == 0 && !numpy_style_slice)\n    {\n        one_blob_only = false;\n    }\n\n    // count reference blobs\n    if (!starts_expr.empty() || !ends_expr.empty() || !axes_expr.empty())\n    {\n        const int starts_blob_count = count_expression_blobs(starts_expr);\n        const int ends_blob_count = count_expression_blobs(ends_expr);\n        const int axes_blob_count = count_expression_blobs(axes_expr);\n\n        // NCNN_LOGE(\"%d %d %d\", starts_blob_count, ends_blob_count, axes_blob_count);\n        if (starts_blob_count > 1 || ends_blob_count > 1 || axes_blob_count > 1)\n            one_blob_only = false;\n    }\n\n    return 0;\n}\n\ntemplate<typename T>\nstatic void copy_cut_border_image(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    const T* ptr = src.row<T>(top) + left;\n    T* outptr = dst; //.data;\n\n    for (int y = 0; y < h; y++)\n    {\n        if (w < 12)\n        {\n            for (int x = 0; x < w; x++)\n            {\n                outptr[x] = ptr[x];\n            }\n        }\n        else\n        {\n            memcpy(outptr, ptr, w * sizeof(T));\n        }\n        outptr += w;\n        ptr += src.w;\n    }\n}\n\nint Crop::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n\n    int _woffset, _hoffset, _doffset, _coffset;\n    int _outw = -1, _outh = -1, _outd = -1, _outc;\n\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        std::vector<Mat> bottom_blobs(1);\n        bottom_blobs[0] = bottom_blob;\n        eval_crop_expr(bottom_blobs, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (dims == 1)\n    {\n        if (_outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(_outw, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elemsize == 1)\n            copy_cut_border_image<signed char>(bottom_blob, top_blob, 0, _woffset);\n        if (elemsize == 2)\n            copy_cut_border_image<unsigned short>(bottom_blob, top_blob, 0, _woffset);\n        if (elemsize == 4)\n            copy_cut_border_image<float>(bottom_blob, top_blob, 0, _woffset);\n    }\n\n    if (dims == 2)\n    {\n        if (_outw == w && _outh == h)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elemsize == 1)\n            copy_cut_border_image<signed char>(bottom_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 2)\n            copy_cut_border_image<unsigned short>(bottom_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 4)\n            copy_cut_border_image<float>(bottom_blob, top_blob, _hoffset, _woffset);\n    }\n\n    if (dims == 3)\n    {\n        if (_outw == w && _outh == h && _outc == channels)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset, _outc);\n\n        if (_outw == w && _outh == h)\n        {\n            top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, _outc, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < _outc; q++)\n        {\n            const Mat m = bottom_blob_sliced.channel(q);\n            Mat borderm = top_blob.channel(q);\n\n            if (elemsize == 1)\n                copy_cut_border_image<signed char>(m, borderm, _hoffset, _woffset);\n            if (elemsize == 2)\n                copy_cut_border_image<unsigned short>(m, borderm, _hoffset, _woffset);\n            if (elemsize == 4)\n                copy_cut_border_image<float>(m, borderm, _hoffset, _woffset);\n        }\n    }\n\n    if (dims == 4)\n    {\n        if (_outw == w && _outh == h && _outd == d && _outc == channels)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset, _outc);\n\n        if (_outw == w && _outh == h && _outd == d)\n        {\n            top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, _outd, _outc, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < _outc; q++)\n        {\n            for (int z = 0; z < _outd; z++)\n            {\n                const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                Mat borderm = top_blob.channel(q).depth(z);\n\n                if (elemsize == 1)\n                    copy_cut_border_image<signed char>(m, borderm, _hoffset, _woffset);\n                if (elemsize == 2)\n                    copy_cut_border_image<unsigned short>(m, borderm, _hoffset, _woffset);\n                if (elemsize == 4)\n                    copy_cut_border_image<float>(m, borderm, _hoffset, _woffset);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Crop::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n\n    Mat& top_blob = top_blobs[0];\n\n    int _woffset, _hoffset, _doffset, _coffset = -1;\n    int _outw = -1, _outh = -1, _outd = -1, _outc;\n\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        eval_crop_expr(bottom_blobs, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else if (woffset == -233)\n    {\n        resolve_crop_roi(bottom_blob.shape(), (const int*)reference_blob, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), reference_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (dims == 1)\n    {\n        if (_outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(_outw, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elemsize == 1)\n            copy_cut_border_image<signed char>(bottom_blob, top_blob, 0, _woffset);\n        if (elemsize == 2)\n            copy_cut_border_image<unsigned short>(bottom_blob, top_blob, 0, _woffset);\n        if (elemsize == 4)\n            copy_cut_border_image<float>(bottom_blob, top_blob, 0, _woffset);\n    }\n\n    if (dims == 2)\n    {\n        if (_outw == w && _outh == h)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (elemsize == 1)\n            copy_cut_border_image<signed char>(bottom_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 2)\n            copy_cut_border_image<unsigned short>(bottom_blob, top_blob, _hoffset, _woffset);\n        if (elemsize == 4)\n            copy_cut_border_image<float>(bottom_blob, top_blob, _hoffset, _woffset);\n    }\n\n    if (dims == 3)\n    {\n        if (_outw == w && _outh == h && _outc == channels)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset, _outc);\n\n        if (_outw == w && _outh == h)\n        {\n            top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, _outc, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < _outc; q++)\n        {\n            const Mat m = bottom_blob_sliced.channel(q);\n            Mat borderm = top_blob.channel(q);\n\n            if (elemsize == 1)\n                copy_cut_border_image<signed char>(m, borderm, _hoffset, _woffset);\n            if (elemsize == 2)\n                copy_cut_border_image<unsigned short>(m, borderm, _hoffset, _woffset);\n            if (elemsize == 4)\n                copy_cut_border_image<float>(m, borderm, _hoffset, _woffset);\n        }\n    }\n\n    if (dims == 4)\n    {\n        if (_outw == w && _outh == h && _outd == d && _outc == channels)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset, _outc);\n\n        if (_outw == w && _outh == h && _outd == d)\n        {\n            top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            return 0;\n        }\n\n        top_blob.create(_outw, _outh, _outd, _outc, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < _outc; q++)\n        {\n            for (int z = 0; z < _outd; z++)\n            {\n                const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                Mat borderm = top_blob.channel(q).depth(z);\n\n                if (elemsize == 1)\n                    copy_cut_border_image<signed char>(m, borderm, _hoffset, _woffset);\n                if (elemsize == 2)\n                    copy_cut_border_image<unsigned short>(m, borderm, _hoffset, _woffset);\n                if (elemsize == 4)\n                    copy_cut_border_image<float>(m, borderm, _hoffset, _woffset);\n            }\n        }\n    }\n\n    return 0;\n}\n\nvoid Crop::resolve_crop_roi(const Mat& bottom_blob, int& _woffset, int& _hoffset, int& _doffset, int& _coffset, int& _outw, int& _outh, int& _outd, int& _outc) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    bool numpy_style_slice = !starts.empty() && !ends.empty();\n    if (numpy_style_slice)\n    {\n        _woffset = 0;\n        _hoffset = 0;\n        _doffset = 0;\n        _coffset = 0;\n        _outw = w;\n        _outh = h;\n        _outd = d;\n        _outc = channels;\n\n        const int* starts_ptr = starts;\n        const int* ends_ptr = ends;\n        const int* axes_ptr = axes;\n\n        int _axes[4] = {0, 1, 2, 3};\n        int num_axis = axes.w;\n        if (num_axis == 0)\n        {\n            num_axis = dims;\n        }\n        else\n        {\n            for (int i = 0; i < num_axis; i++)\n            {\n                int axis = axes_ptr[i];\n                if (axis < 0)\n                    axis = dims + axis;\n                _axes[i] = axis;\n            }\n        }\n\n        for (int i = 0; i < num_axis; i++)\n        {\n            int axis = _axes[i];\n            int start = starts_ptr[i];\n            int end = ends_ptr[i];\n\n            if (dims == 1) // axis == 0\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = w;\n                _woffset = start >= 0 ? start : w + start;\n                _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n            }\n            if (dims == 2)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = h;\n                    _hoffset = start >= 0 ? start : h + start;\n                    _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = w;\n                    _woffset = start >= 0 ? start : w + start;\n                    _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n                }\n            }\n            if (dims == 3)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = channels;\n                    _coffset = start >= 0 ? start : channels + start;\n                    _outc = std::min(channels, end > 0 ? end : channels + end) - _coffset;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = h;\n                    _hoffset = start >= 0 ? start : h + start;\n                    _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n                }\n                if (axis == 2)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = w;\n                    _woffset = start >= 0 ? start : w + start;\n                    _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n                }\n            }\n            if (dims == 4)\n            {\n                if (axis == 0)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = channels;\n                    _coffset = start >= 0 ? start : channels + start;\n                    _outc = std::min(channels, end > 0 ? end : channels + end) - _coffset;\n                }\n                if (axis == 1)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = d;\n                    _doffset = start >= 0 ? start : d + start;\n                    _outd = std::min(d, end > 0 ? end : d + end) - _doffset;\n                }\n                if (axis == 2)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = h;\n                    _hoffset = start >= 0 ? start : h + start;\n                    _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n                }\n                if (axis == 3)\n                {\n                    if (start == -233) start = 0;\n                    if (end == -233) end = w;\n                    _woffset = start >= 0 ? start : w + start;\n                    _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n                }\n            }\n        }\n    }\n    else\n    {\n        _woffset = woffset;\n        _hoffset = hoffset;\n        _doffset = doffset;\n        _coffset = coffset;\n        _outw = w;\n        _outh = h;\n        _outd = d;\n        _outc = channels;\n\n        if (dims == 1)\n        {\n            _outw = w - woffset - woffset2;\n            if (outw != -233)\n                _outw = std::min(outw, _outw);\n        }\n        if (dims == 2)\n        {\n            _outw = w - woffset - woffset2;\n            if (outw != -233)\n                _outw = std::min(outw, _outw);\n\n            _outh = h - hoffset - hoffset2;\n            if (outh != -233)\n                _outh = std::min(outh, _outh);\n        }\n        if (dims == 3)\n        {\n            _outw = w - woffset - woffset2;\n            if (outw != -233)\n                _outw = std::min(outw, _outw);\n\n            _outh = h - hoffset - hoffset2;\n            if (outh != -233)\n                _outh = std::min(outh, _outh);\n\n            _outc = channels - coffset - coffset2;\n            if (outc != -233)\n                _outc = std::min(outc, _outc);\n        }\n        if (dims == 4)\n        {\n            _outw = w - woffset - woffset2;\n            if (outw != -233)\n                _outw = std::min(outw, _outw);\n\n            _outh = h - hoffset - hoffset2;\n            if (outh != -233)\n                _outh = std::min(outh, _outh);\n\n            _outd = d - doffset - doffset2;\n            if (outd != -233)\n                _outd = std::min(outd, _outd);\n\n            _outc = channels - coffset - coffset2;\n            if (outc != -233)\n                _outc = std::min(outc, _outc);\n        }\n    }\n}\n\nvoid Crop::resolve_crop_roi(const Mat& bottom_blob, const Mat& reference_blob, int& _woffset, int& _hoffset, int& _doffset, int& _coffset, int& _outw, int& _outh, int& _outd, int& _outc) const\n{\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    int ref_w = reference_blob.w;\n    int ref_h = reference_blob.h;\n    int ref_d = reference_blob.d;\n    int ref_channels = reference_blob.c;\n    int ref_dims = reference_blob.dims;\n\n    if (dims == 1)\n    {\n        _woffset = woffset;\n        _outw = ref_w;\n    }\n    if (dims == 2)\n    {\n        _woffset = woffset;\n        _hoffset = hoffset;\n        _outw = ref_w;\n        _outh = ref_h;\n    }\n    if (dims == 3)\n    {\n        _woffset = woffset;\n        _hoffset = hoffset;\n        _coffset = coffset;\n        _outw = ref_w;\n        _outh = ref_h;\n        _outc = ref_dims == 3 ? ref_channels : channels;\n    }\n    if (dims == 4)\n    {\n        _woffset = woffset;\n        _hoffset = hoffset;\n        _doffset = doffset;\n        _coffset = coffset;\n        _outw = ref_w;\n        _outh = ref_h;\n        _outd = ref_d;\n        _outc = ref_dims == 4 ? ref_channels : channels;\n    }\n}\n\nvoid Crop::resolve_crop_roi(const Mat& bottom_blob, const int* param_data, int& _woffset, int& _hoffset, int& _doffset, int& _coffset, int& _outw, int& _outh, int& _outd, int& _outc) const\n{\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        _woffset = param_data[0];\n        _outw = param_data[3];\n    }\n    if (dims == 2)\n    {\n        _woffset = param_data[0];\n        _hoffset = param_data[1];\n        _outw = param_data[3];\n        _outh = param_data[4];\n    }\n    if (dims == 3)\n    {\n        _woffset = param_data[0];\n        _hoffset = param_data[1];\n        _coffset = param_data[2];\n        _outw = param_data[3];\n        _outh = param_data[4];\n        _outc = param_data[5];\n    }\n    if (dims == 4)\n    {\n        _woffset = param_data[0];\n        _hoffset = param_data[1];\n        _doffset = param_data[2];\n        _coffset = param_data[3];\n        _outw = param_data[4];\n        _outh = param_data[5];\n        _outd = param_data[6];\n        _outc = param_data[7];\n    }\n}\n\nint Crop::eval_crop_expr(const std::vector<Mat>& bottom_blobs, int& _woffset, int& _hoffset, int& _doffset, int& _coffset, int& _outw, int& _outh, int& _outd, int& _outc) const\n{\n    std::vector<int> _starts;\n    std::vector<int> _ends;\n    std::vector<int> _axes;\n    int er = eval_list_expression(starts_expr, bottom_blobs, _starts);\n    if (er != 0)\n        return -1;\n\n    er = eval_list_expression(ends_expr, bottom_blobs, _ends);\n    if (er != 0)\n        return -1;\n\n    er = eval_list_expression(axes_expr, bottom_blobs, _axes);\n    if (er != 0)\n        return -1;\n\n    // NCNN_LOGE(\"%d %d %d\", _starts[0], _ends[0], _axes[0]);\n\n    const Mat& bottom_blob = bottom_blobs[0];\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n    const int dims = bottom_blob.dims;\n\n    _woffset = 0;\n    _hoffset = 0;\n    _doffset = 0;\n    _coffset = 0;\n    _outw = w;\n    _outh = h;\n    _outd = d;\n    _outc = channels;\n\n    const int* starts_ptr = _starts.data();\n    const int* ends_ptr = _ends.data();\n    const int* axes_ptr = _axes.data();\n\n    int _axes4[4] = {0, 1, 2, 3};\n    int num_axis = (int)_axes.size();\n    if (num_axis == 0)\n    {\n        num_axis = dims;\n    }\n    else\n    {\n        for (int i = 0; i < num_axis; i++)\n        {\n            int axis = axes_ptr[i];\n            if (axis < 0)\n                axis = dims + axis;\n            _axes4[i] = axis;\n        }\n    }\n\n    for (int i = 0; i < num_axis; i++)\n    {\n        int axis = _axes4[i];\n        int start = starts_ptr[i];\n        int end = ends_ptr[i];\n\n        if (dims == 1) // axis == 0\n        {\n            if (start == -233) start = 0;\n            if (end == -233) end = w;\n            _woffset = start >= 0 ? start : w + start;\n            _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n        }\n        if (dims == 2)\n        {\n            if (axis == 0)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = h;\n                _hoffset = start >= 0 ? start : h + start;\n                _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n            }\n            if (axis == 1)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = w;\n                _woffset = start >= 0 ? start : w + start;\n                _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n            }\n        }\n        if (dims == 3)\n        {\n            if (axis == 0)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = channels;\n                _coffset = start >= 0 ? start : channels + start;\n                _outc = std::min(channels, end > 0 ? end : channels + end) - _coffset;\n            }\n            if (axis == 1)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = h;\n                _hoffset = start >= 0 ? start : h + start;\n                _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n            }\n            if (axis == 2)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = w;\n                _woffset = start >= 0 ? start : w + start;\n                _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n            }\n        }\n        if (dims == 4)\n        {\n            if (axis == 0)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = channels;\n                _coffset = start >= 0 ? start : channels + start;\n                _outc = std::min(channels, end > 0 ? end : channels + end) - _coffset;\n            }\n            if (axis == 1)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = d;\n                _doffset = start >= 0 ? start : d + start;\n                _outd = std::min(d, end > 0 ? end : d + end) - _doffset;\n            }\n            if (axis == 2)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = h;\n                _hoffset = start >= 0 ? start : h + start;\n                _outh = std::min(h, end > 0 ? end : h + end) - _hoffset;\n            }\n            if (axis == 3)\n            {\n                if (start == -233) start = 0;\n                if (end == -233) end = w;\n                _woffset = start >= 0 ? start : w + start;\n                _outw = std::min(w, end > 0 ? end : w + end) - _woffset;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/crop.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CROP_H\n#define LAYER_CROP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Crop : public Layer\n{\npublic:\n    Crop();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void resolve_crop_roi(const Mat& bottom_blob, int& woffset, int& hoffset, int& doffset, int& coffset, int& outw, int& outh, int& outd, int& outc) const;\n    void resolve_crop_roi(const Mat& bottom_blob, const Mat& reference_blob, int& woffset, int& hoffset, int& doffset, int& coffset, int& outw, int& outh, int& outd, int& outc) const;\n    void resolve_crop_roi(const Mat& bottom_blob, const int* param_data, int& woffset, int& hoffset, int& doffset, int& coffset, int& outw, int& outh, int& outd, int& outc) const;\n    int eval_crop_expr(const std::vector<Mat>& bottom_blobs, int& woffset, int& hoffset, int& doffset, int& coffset, int& outw, int& outh, int& outd, int& outc) const;\n\npublic:\n    // -233 = dynamic offset from reference blob\n    int woffset;\n    int hoffset;\n    int doffset;\n    int coffset;\n\n    // -233 = remaining\n    int outw;\n    int outh;\n    int outd;\n    int outc;\n\n    // tail offset for cropping, ignored if reference blob is provided\n    // woffset is aka left, and woffset2 is aka right\n    int woffset2;\n    int hoffset2;\n    int doffset2;\n    int coffset2;\n\n    // numpy-style slice\n    // if provided, all the above attributes will be ignored\n    Mat starts;\n    Mat ends;\n    Mat axes;\n\n    // see docs/developer-guide/expression.md\n    std::string starts_expr;\n    std::string ends_expr;\n    std::string axes_expr;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CROP_H\n"
  },
  {
    "path": "src/layer/cumulativesum.cpp",
    "content": "// Copyright 2023 Xiaomi Corp.   (author: Fangjun Kuang)\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cumulativesum.h\"\n\nnamespace ncnn {\n\nCumulativeSum::CumulativeSum()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint CumulativeSum::load_param(const ParamDict& pd)\n{\n    axis = pd.get(0, 0);\n\n    return 0;\n}\n\nint CumulativeSum::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1)\n    {   // ignore axis\n        int w = bottom_top_blob.w;\n\n        float* ptr = bottom_top_blob;\n\n        for (int i = 1; i < w; ++i)\n        {\n            ptr[i] = ptr[i] + ptr[i - 1];\n        }\n\n        return 0;\n    } // if (dims == 1)\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // sum over rows\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        for (int i = 1; i < h; ++i)\n        {\n            const float* prev_row = bottom_top_blob.row(i - 1);\n            float* this_row = bottom_top_blob.row(i);\n\n            for (int k = 0; k < w; ++k)\n            {\n                this_row[k] = this_row[k] + prev_row[k];\n            }\n        }\n\n        return 0;\n    } // if (dims == 2 && positive_axis == 0)\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // sum over columns\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; ++i)\n        {\n            float* ptr = bottom_top_blob.row(i);\n\n            for (int k = 1; k < w; ++k)\n            {\n                ptr[k] = ptr[k] + ptr[k - 1];\n            }\n        }\n\n        return 0;\n    } // if (dims == 2 && positive_axis == 1)\n\n    if (dims == 3 && positive_axis == 0)\n    {\n        // sum over channels\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int c = bottom_top_blob.c;\n\n        int size = w * h;\n\n        for (int i = 1; i < c; ++i)\n        {\n            const float* prev = bottom_top_blob.channel(i - 1);\n            float* cur = bottom_top_blob.channel(i);\n\n            for (int k = 0; k < size; ++k)\n            {\n                cur[k] = cur[k] + prev[k];\n            }\n        }\n\n        return 0;\n    } // if (dims == 3 && positive_axis == 0)\n\n    if (dims == 3 && positive_axis == 1)\n    {\n        // sum over rows within each channel\n\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int c = bottom_top_blob.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; ++q)\n        {\n            Mat this_channel = bottom_top_blob.channel(q);\n\n            for (int i = 1; i < h; ++i)\n            {\n                const float* prev_row = this_channel.row(i - 1);\n                float* this_row = this_channel.row(i);\n\n                for (int k = 0; k < w; ++k)\n                {\n                    this_row[k] = this_row[k] + prev_row[k];\n                }\n            }\n        }\n\n        return 0;\n    } // if (dims == 3 && positive_axis == 1)\n\n    if (dims == 3 && positive_axis == 2)\n    {\n        // sum over columns within each channel\n\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int c = bottom_top_blob.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; ++q)\n        {\n            Mat this_channel = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < h; ++i)\n            {\n                float* ptr = this_channel.row(i);\n                for (int k = 1; k < w; ++k)\n                {\n                    ptr[k] = ptr[k] + ptr[k - 1];\n                }\n            }\n        }\n\n        return 0;\n    } // if (dims == 3 && positive_axis == 2)\n\n    return -100;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/cumulativesum.h",
    "content": "// Copyright 2023 Xiaomi Corp.   (author: Fangjun Kuang)\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CUMULATIVESUM_H\n#define LAYER_CUMULATIVESUM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass CumulativeSum : public Layer\n{\npublic:\n    CumulativeSum();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    int axis;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CUMULATIVESUM_H\n"
  },
  {
    "path": "src/layer/deconvolution.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolution::Deconvolution()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Deconvolution::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    output_pad_right = pd.get(18, 0);\n    output_pad_bottom = pd.get(19, output_pad_right);\n    output_w = pd.get(20, 0);\n    output_h = pd.get(21, output_w);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(28, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    return 0;\n}\n\nint Deconvolution::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolution(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int stride_w, int stride_h, int dilation_w, int dilation_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = outw * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias = bias_data.empty() ? 0.f : bias_data[p];\n\n        out.fill(bias);\n\n        // shadowed variable for less openmp task args\n        const int w = bottom_blob.w;\n        const int h = bottom_blob.h;\n        const int inch = bottom_blob.c;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n\n        for (int i = 0; i < h; i++)\n        {\n            for (int j = 0; j < w; j++)\n            {\n                float* outptr = out.row(i * stride_h) + j * stride_w;\n\n                const float* kptr = (const float*)weight_data + maxk * inch * p;\n\n                for (int q = 0; q < inch; q++)\n                {\n                    const float val = bottom_blob.channel(q).row(i)[j];\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float w = kptr[k];\n                        outptr[space_ofs[k]] += val * w;\n                    }\n\n                    kptr += maxk;\n                }\n            }\n        }\n\n        {\n            float* outptr = out;\n            int size = outw * outh;\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Deconvolution::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolution(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, kernel_h, stride_w, stride_h, dilation_w, dilation_h, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint Deconvolution::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * 1;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / 1, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / 1;\n        const int inch_g = _num_input / 1;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < 1; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, _num_output, 4u, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, _num_output, 4u, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolution(bottom_blob, top_blob_bordered, weight_data_transposed, bias_data_flattened, _kernel_w, _kernel_h, stride_w, stride_h, dilation_w, dilation_h, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid Deconvolution::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        copy_cut_border(top_blob_bordered, top_blob, pad_top, pad_bottom, pad_left, pad_right, opt);\n    }\n    else if (output_w > 0 && output_h > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n        int hcut = top_blob_bordered.h - output_h;\n\n        if (pad_left == -233 || pad_right == -233 || pad_top == -233 || pad_bottom == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border(top_blob_bordered, top_blob, hcut / 2, hcut - hcut / 2, wcut / 2, wcut - wcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234 || pad_top == -234 || pad_bottom == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border(top_blob_bordered, top_blob, hcut - hcut / 2, hcut / 2, wcut - wcut / 2, wcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolution.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTION_H\n#define LAYER_DECONVOLUTION_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Deconvolution : public Layer\n{\npublic:\n    Deconvolution();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left;\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int output_pad_right;\n    int output_pad_bottom;\n    int output_w;\n    int output_h;\n    int bias_term;\n\n    int weight_data_size;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTION_H\n"
  },
  {
    "path": "src/layer/deconvolution1d.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution1d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolution1D::Deconvolution1D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Deconvolution1D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    dilation_w = pd.get(2, 1);\n    stride_w = pd.get(3, 1);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    output_pad_right = pd.get(18, 0);\n    output_w = pd.get(20, 0);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(28, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    return 0;\n}\n\nint Deconvolution1D::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolution1d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int stride_w, int dilation_w, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outh; p++)\n    {\n        Mat out = top_blob.row_range(p, 1);\n\n        const float bias = bias_term ? bias_data[p] : 0.f;\n\n        out.fill(bias);\n\n        for (int j = 0; j < w; j++)\n        {\n            float* outptr = (float*)out + j * stride_w;\n\n            const float* kptr = (const float*)weight_data + kernel_w * h * p;\n\n            for (int q = 0; q < h; q++)\n            {\n                const float val = bottom_blob.row(q)[j];\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float w = kptr[k];\n                    outptr[k * dilation_w] += val * w;\n                }\n\n                kptr += kernel_w;\n            }\n        }\n\n        {\n            float* outptr = out;\n\n            for (int i = 0; i < outw; i++)\n            {\n                outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Deconvolution1D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || output_w > 0)\n    {\n        top_blob_bordered.create(outw, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolution1d(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, stride_w, dilation_w, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint Deconvolution1D::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.h;\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.h * 1;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // transpose group-inch/group-outch/group-kw to group-outch/group-inch/group-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _num_output * _num_input / 1, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / 1;\n        const int inch_g = _num_input / 1;\n        const int maxk = _kernel_w;\n\n        for (int g = 0; g < 1; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    const int w = bottom_blob.w;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || output_w > 0)\n    {\n        top_blob_bordered.create(outw, _num_output, 4u, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, _num_output, 4u, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolution1d(bottom_blob, top_blob_bordered, weight_data_transposed, bias_data_flattened, _kernel_w, stride_w, dilation_w, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid Deconvolution1D::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0)\n    {\n        copy_cut_border(top_blob_bordered, top_blob, 0, 0, pad_left, pad_right, opt);\n    }\n    else if (output_w > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n\n        if (pad_left == -233 || pad_right == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border(top_blob_bordered, top_blob, 0, 0, wcut / 2, wcut - wcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border(top_blob_bordered, top_blob, 0, 0, wcut - wcut / 2, wcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolution1d.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTION1D_H\n#define LAYER_DECONVOLUTION1D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Deconvolution1D : public Layer\n{\npublic:\n    Deconvolution1D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int dilation_w;\n    int stride_w;\n    int pad_left;\n    int pad_right;\n    int output_pad_right;\n    int output_w;\n    int bias_term;\n\n    int weight_data_size;\n\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTION1D_H\n"
  },
  {
    "path": "src/layer/deconvolution3d.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution3d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolution3D::Deconvolution3D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Deconvolution3D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    kernel_d = pd.get(21, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    dilation_d = pd.get(22, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    stride_d = pd.get(23, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_front = pd.get(24, pad_left);\n    pad_behind = pd.get(17, pad_front);\n    output_pad_right = pd.get(18, 0);\n    output_pad_bottom = pd.get(19, output_pad_right);\n    output_pad_behind = pd.get(20, output_pad_right);\n    output_w = pd.get(25, 0);\n    output_h = pd.get(26, output_w);\n    output_d = pd.get(27, output_w);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    return 0;\n}\n\nint Deconvolution3D::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolution3d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int kernel_d, int stride_w, int stride_h, int stride_d, int dilation_w, int dilation_h, int dilation_d, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h * kernel_d;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap0 = outw * dilation_h - kernel_w * dilation_w;\n        int gap1 = outh * outw * dilation_d - outw * kernel_h * dilation_h;\n        for (int z = 0; z < kernel_d; z++)\n        {\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2 += dilation_w;\n                }\n                p2 += gap0;\n            }\n            p2 += gap1;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out = top_blob.channel(p);\n\n        const float bias = bias_data.empty() ? 0.f : bias_data[p];\n\n        out.fill(bias);\n\n        // shadowed variable for less openmp task args\n        const int w = bottom_blob.w;\n        const int h = bottom_blob.h;\n        const int d = bottom_blob.d;\n        const int inch = bottom_blob.c;\n        const int outw = top_blob.w;\n        const int outh = top_blob.h;\n        const int outd = top_blob.d;\n\n        for (int z = 0; z < d; z++)\n        {\n            for (int i = 0; i < h; i++)\n            {\n                for (int j = 0; j < w; j++)\n                {\n                    float* outptr = out.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                    const float* kptr = (const float*)weight_data + maxk * inch * p;\n\n                    for (int q = 0; q < inch; q++)\n                    {\n                        const float val = bottom_blob.channel(q).depth(z).row(i)[j];\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float w = kptr[k];\n                            outptr[space_ofs[k]] += val * w;\n                        }\n\n                        kptr += maxk;\n                    }\n                }\n            }\n        }\n\n        {\n            float* outptr = out;\n            int size = outw * outh * outd;\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Deconvolution3D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extent_d = dilation_d * (kernel_d - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int outd = (d - 1) * stride_d + kernel_extent_d + output_pad_behind;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0 || (output_w > 0 && output_h > 0 && output_d > 0))\n    {\n        top_blob_bordered.create(outw, outh, outd, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, outd, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolution3d(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, kernel_h, kernel_d, stride_w, stride_h, stride_d, dilation_w, dilation_h, dilation_d, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid Deconvolution3D::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0)\n    {\n        copy_cut_border_3d(top_blob_bordered, top_blob, pad_top, pad_bottom, pad_left, pad_right, pad_front, pad_behind, opt);\n    }\n    else if (output_w > 0 && output_h > 0 && output_d > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n        int hcut = top_blob_bordered.h - output_h;\n        int dcut = top_blob_bordered.d - output_d;\n\n        if (pad_left == -233 || pad_right == -233 || pad_top == -233 || pad_bottom == -233 || pad_front == -233 || pad_behind == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border_3d(top_blob_bordered, top_blob, hcut / 2, hcut - hcut / 2, wcut / 2, wcut - wcut / 2, dcut / 2, dcut - dcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234 || pad_top == -234 || pad_bottom == -234 || pad_front == -234 || pad_behind == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border_3d(top_blob_bordered, top_blob, hcut - hcut / 2, hcut / 2, wcut - wcut / 2, wcut / 2, dcut - dcut / 2, dcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolution3d.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTION3D_H\n#define LAYER_DECONVOLUTION3D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Deconvolution3D : public Layer\n{\npublic:\n    Deconvolution3D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int kernel_d;\n    int dilation_w;\n    int dilation_h;\n    int dilation_d;\n    int stride_w;\n    int stride_h;\n    int stride_d;\n    int pad_left;\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int pad_front;\n    int pad_behind;\n    int output_pad_right;\n    int output_pad_bottom;\n    int output_pad_behind;\n    int output_w;\n    int output_h;\n    int output_d;\n    int bias_term;\n\n    int weight_data_size;\n\n    int activation_type;\n    Mat activation_params;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTION3D_H\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolutionDepthWise::DeconvolutionDepthWise()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint DeconvolutionDepthWise::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    output_pad_right = pd.get(18, 0);\n    output_pad_bottom = pd.get(19, output_pad_right);\n    output_w = pd.get(20, 0);\n    output_h = pd.get(21, output_w);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(28, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolutiondepthwise(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int stride_w, int stride_h, int dilation_w, int dilation_h, int group, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int inch = bottom_blob.c;\n\n    const int outw = top_blob.w;\n    const int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = outw * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // depth-wise\n    if (inch == group && group == outch)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            const float* inptr = bottom_blob.channel(g);\n            const float* kptr = (const float*)weight_data + maxk * g;\n            Mat out = top_blob.channel(g);\n\n            const float bias = bias_data.empty() ? 0.f : bias_data[g];\n\n            out.fill(bias);\n\n            // shadowed variable for less openmp task args\n            const int w = bottom_blob.w;\n            const int h = bottom_blob.h;\n            const int outw = top_blob.w;\n            const int outh = top_blob.h;\n\n            for (int i = 0; i < h; i++)\n            {\n                for (int j = 0; j < w; j++)\n                {\n                    float* outptr = out.row(i * stride_h) + j * stride_w;\n\n                    const float val = inptr[i * w + j];\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float w = kptr[k];\n                        outptr[space_ofs[k]] += val * w;\n                    }\n                }\n            }\n\n            {\n                float* outptr = out;\n                int size = outw * outh;\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                }\n            }\n        }\n    }\n    else\n    {\n        const int inch_g = inch / group;\n        const int outch_g = outch / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < outch_g; p++)\n            {\n                Mat out = top_blob.channel(g * outch_g + p);\n\n                const float* weight_data_ptr = (const float*)weight_data + maxk * inch_g * outch_g * g;\n\n                const float bias = bias_data.empty() ? 0.f : bias_data[g * outch_g + p];\n\n                out.fill(bias);\n\n                // shadowed variable for less openmp task args\n                const int w = bottom_blob.w;\n                const int h = bottom_blob.h;\n                const int outw = top_blob.w;\n                const int outh = top_blob.h;\n\n                for (int i = 0; i < h; i++)\n                {\n                    for (int j = 0; j < w; j++)\n                    {\n                        float* outptr = out.row(i * stride_h) + j * stride_w;\n\n                        const float* kptr = weight_data_ptr + maxk * inch_g * p;\n\n                        for (int q = 0; q < inch_g; q++)\n                        {\n                            const float val = bottom_blob.channel(inch_g * g + q).row(i)[j];\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                outptr[space_ofs[k]] += val * kptr[k];\n                            }\n\n                            kptr += maxk;\n                        }\n                    }\n                }\n\n                {\n                    float* outptr = out;\n                    int size = outw * outh;\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolutiondepthwise(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, kernel_h, stride_w, stride_h, dilation_w, dilation_h, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint DeconvolutionDepthWise::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * group;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / group, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / group;\n        const int inch_g = _num_input / group;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < group; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (_kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, _num_output, 4u, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, _num_output, 4u, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolutiondepthwise(bottom_blob, top_blob_bordered, weight_data_transposed, bias_data_flattened, _kernel_w, _kernel_h, stride_w, stride_h, dilation_w, dilation_h, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid DeconvolutionDepthWise::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        copy_cut_border(top_blob_bordered, top_blob, pad_top, pad_bottom, pad_left, pad_right, opt);\n    }\n    else if (output_w > 0 && output_h > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n        int hcut = top_blob_bordered.h - output_h;\n\n        if (pad_left == -233 || pad_right == -233 || pad_top == -233 || pad_bottom == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border(top_blob_bordered, top_blob, hcut / 2, hcut - hcut / 2, wcut / 2, wcut - wcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234 || pad_top == -234 || pad_bottom == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border(top_blob_bordered, top_blob, hcut - hcut / 2, hcut / 2, wcut - wcut / 2, wcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTIONDEPTHWISE_H\n#define LAYER_DECONVOLUTIONDEPTHWISE_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DeconvolutionDepthWise : public Layer\n{\npublic:\n    DeconvolutionDepthWise();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left;\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int output_pad_right;\n    int output_pad_bottom;\n    int output_w;\n    int output_h;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTIONDEPTHWISE_H\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise1d.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise1d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolutionDepthWise1D::DeconvolutionDepthWise1D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint DeconvolutionDepthWise1D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    dilation_w = pd.get(2, 1);\n    stride_w = pd.get(3, 1);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    output_pad_right = pd.get(18, 0);\n    output_w = pd.get(20, 0);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    dynamic_weight = pd.get(28, 0);\n\n    if (dynamic_weight)\n    {\n        one_blob_only = false;\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise1D::load_model(const ModelBin& mb)\n{\n    if (dynamic_weight)\n        return 0;\n\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolutiondepthwise1d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int stride_w, int dilation_w, int group, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n\n    const int bias_term = bias_data.empty() ? 0 : 1;\n\n    // depth-wise\n    if (h == group && group == outh)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat out = top_blob.row_range(g, 1);\n\n            const float* inptr = bottom_blob.row(g);\n            const float* kptr = (const float*)weight_data + kernel_w * g;\n\n            const float bias = bias_term ? bias_data[g] : 0.f;\n\n            out.fill(bias);\n\n            for (int j = 0; j < w; j++)\n            {\n                float* outptr = (float*)out + j * stride_w;\n\n                const float val = inptr[j];\n\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    float w = kptr[k];\n                    outptr[k * dilation_w] += val * w;\n                }\n            }\n\n            {\n                float* outptr = out;\n\n                for (int i = 0; i < outw; i++)\n                {\n                    outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                }\n            }\n        }\n    }\n    else\n    {\n        const int h_g = h / group;\n        const int outh_g = outh / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < outh_g; p++)\n            {\n                Mat out = top_blob.row_range(g * outh_g + p, 1);\n\n                const float* weight_data_ptr = (const float*)weight_data + kernel_w * h_g * outh_g * g;\n                const float bias = bias_term ? bias_data[g * outh_g + p] : 0.f;\n\n                out.fill(bias);\n\n                for (int j = 0; j < w; j++)\n                {\n                    float* outptr = (float*)out + j * stride_w;\n\n                    const float* kptr = weight_data_ptr + kernel_w * h_g * p;\n\n                    for (int q = 0; q < h_g; q++)\n                    {\n                        const float val = bottom_blob.row(h_g * g + q)[j];\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            outptr[k * dilation_w] += val * kptr[k];\n                        }\n\n                        kptr += kernel_w;\n                    }\n                }\n\n                {\n                    float* outptr = out;\n\n                    for (int i = 0; i < outw; i++)\n                    {\n                        outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise1D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || output_w > 0)\n    {\n        top_blob_bordered.create(outw, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolutiondepthwise1d(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, stride_w, dilation_w, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint DeconvolutionDepthWise1D::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.h;\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.h * group;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // transpose group-inch/group-outch/group-kw to group-outch/group-inch/group-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _num_output * _num_input / group, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / group;\n        const int inch_g = _num_input / group;\n        const int maxk = _kernel_w;\n\n        for (int g = 0; g < group; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n    }\n\n    const int w = bottom_blob.w;\n\n    const int kernel_extent_w = dilation_w * (_kernel_w - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || output_w > 0)\n    {\n        top_blob_bordered.create(outw, _num_output, 4u, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, _num_output, 4u, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolutiondepthwise1d(bottom_blob, top_blob_bordered, weight_data_transposed, bias_data_flattened, _kernel_w, stride_w, dilation_w, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid DeconvolutionDepthWise1D::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0)\n    {\n        copy_cut_border(top_blob_bordered, top_blob, 0, 0, pad_left, pad_right, opt);\n    }\n    else if (output_w > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n\n        if (pad_left == -233 || pad_right == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border(top_blob_bordered, top_blob, 0, 0, wcut / 2, wcut - wcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border(top_blob_bordered, top_blob, 0, 0, wcut - wcut / 2, wcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise1d.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTIONDEPTHWISE1D_H\n#define LAYER_DECONVOLUTIONDEPTHWISE1D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DeconvolutionDepthWise1D : public Layer\n{\npublic:\n    DeconvolutionDepthWise1D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int dilation_w;\n    int stride_w;\n    int pad_left;\n    int pad_right;\n    int output_pad_right;\n    int output_w;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    int activation_type;\n    Mat activation_params;\n\n    int dynamic_weight;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTIONDEPTHWISE1D_H\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise3d.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise3d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeconvolutionDepthWise3D::DeconvolutionDepthWise3D()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint DeconvolutionDepthWise3D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    kernel_d = pd.get(21, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    dilation_d = pd.get(22, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    stride_d = pd.get(23, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    pad_front = pd.get(24, pad_left);\n    pad_behind = pd.get(17, pad_front);\n    output_pad_right = pd.get(18, 0);\n    output_pad_bottom = pd.get(19, output_pad_right);\n    output_pad_behind = pd.get(20, output_pad_right);\n    output_w = pd.get(25, 0);\n    output_h = pd.get(26, output_w);\n    output_d = pd.get(27, output_w);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    group = pd.get(7, 1);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    return 0;\n}\n\nint DeconvolutionDepthWise3D::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic int deconvolutiondepthwise3d(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data, const Mat& bias_data, int kernel_w, int kernel_h, int kernel_d, int stride_w, int stride_h, int stride_d, int dilation_w, int dilation_h, int dilation_d, int group, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    const int inch = bottom_blob.c;\n\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h * kernel_d;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap0 = outw * dilation_h - kernel_w * dilation_w;\n        int gap1 = outh * outw * dilation_d - outw * kernel_h * dilation_h;\n        for (int z = 0; z < kernel_d; z++)\n        {\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2 += dilation_w;\n                }\n                p2 += gap0;\n            }\n            p2 += gap1;\n        }\n    }\n\n    // depth-wise\n    if (inch == group && group == outch)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            const float* inptr = bottom_blob.channel(g);\n            const float* kptr = (const float*)weight_data + maxk * g;\n            Mat out = top_blob.channel(g);\n\n            const float bias = bias_data.empty() ? 0.f : bias_data[g];\n\n            out.fill(bias);\n\n            // shadowed variable for less openmp task args\n            const int w = bottom_blob.w;\n            const int h = bottom_blob.h;\n            const int d = bottom_blob.d;\n            const int outw = top_blob.w;\n            const int outh = top_blob.h;\n            const int outd = top_blob.d;\n\n            for (int z = 0; z < d; z++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    for (int j = 0; j < w; j++)\n                    {\n                        float* outptr = out.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                        const float val = inptr[z * w * h + i * w + j];\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            float w = kptr[k];\n                            outptr[space_ofs[k]] += val * w;\n                        }\n                    }\n                }\n            }\n\n            {\n                float* outptr = out;\n                int size = outw * outh * outd;\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                }\n            }\n        }\n    }\n    else\n    {\n        const int inch_g = inch / group;\n        const int outch_g = outch / group;\n\n#ifdef _WIN32\n        #pragma omp parallel for num_threads(opt.num_threads)\n#else\n        #pragma omp parallel for collapse(2) num_threads(opt.num_threads)\n#endif\n        for (int g = 0; g < group; g++)\n        {\n            for (int p = 0; p < outch_g; p++)\n            {\n                Mat out = top_blob.channel(g * outch_g + p);\n\n                const float* weight_data_ptr = (const float*)weight_data + maxk * inch_g * outch_g * g;\n\n                const float bias = bias_data.empty() ? 0.f : bias_data[g * outch_g + p];\n\n                out.fill(bias);\n\n                // shadowed variable for less openmp task args\n                const int w = bottom_blob.w;\n                const int h = bottom_blob.h;\n                const int d = bottom_blob.d;\n                const int outw = top_blob.w;\n                const int outh = top_blob.h;\n                const int outd = top_blob.d;\n\n                for (int z = 0; z < d; z++)\n                {\n                    for (int i = 0; i < h; i++)\n                    {\n                        for (int j = 0; j < w; j++)\n                        {\n                            float* outptr = out.depth(z * stride_d).row(i * stride_h) + j * stride_w;\n\n                            const float* kptr = weight_data_ptr + maxk * inch_g * p;\n\n                            for (int q = 0; q < inch_g; q++)\n                            {\n                                const float val = bottom_blob.channel(inch_g * g + q).depth(z).row(i)[j];\n\n                                for (int k = 0; k < maxk; k++)\n                                {\n                                    outptr[space_ofs[k]] += val * kptr[k];\n                                }\n\n                                kptr += maxk;\n                            }\n                        }\n                    }\n                }\n\n                {\n                    float* outptr = out;\n                    int size = outw * outh * outd;\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        outptr[i] = activation_ss(outptr[i], activation_type, activation_params);\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise3D::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n    const int kernel_extent_d = dilation_d * (kernel_d - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int outd = (d - 1) * stride_d + kernel_extent_d + output_pad_behind;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0 || (output_w > 0 && output_h > 0 && output_d > 0))\n    {\n        top_blob_bordered.create(outw, outh, outd, num_output, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, outd, num_output, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    int ret = deconvolutiondepthwise3d(bottom_blob, top_blob_bordered, weight_data, bias_data, kernel_w, kernel_h, kernel_d, stride_w, stride_h, stride_d, dilation_w, dilation_h, dilation_d, group, activation_type, activation_params, opt);\n    if (ret != 0)\n        return ret;\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nvoid DeconvolutionDepthWise3D::cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const\n{\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || pad_front > 0 || pad_behind > 0)\n    {\n        copy_cut_border_3d(top_blob_bordered, top_blob, pad_top, pad_bottom, pad_left, pad_right, pad_front, pad_behind, opt);\n    }\n    else if (output_w > 0 && output_h > 0 && output_d > 0)\n    {\n        int wcut = top_blob_bordered.w - output_w;\n        int hcut = top_blob_bordered.h - output_h;\n        int dcut = top_blob_bordered.d - output_d;\n\n        if (pad_left == -233 || pad_right == -233 || pad_top == -233 || pad_bottom == -233 || pad_front == -233 || pad_behind == -233)\n        {\n            // onnx padding=SAME_UPPER\n            copy_cut_border_3d(top_blob_bordered, top_blob, hcut / 2, hcut - hcut / 2, wcut / 2, wcut - wcut / 2, dcut / 2, dcut - dcut / 2, opt);\n        }\n        else if (pad_left == -234 || pad_right == -234 || pad_top == -234 || pad_bottom == -234 || pad_front == -234 || pad_behind == -234)\n        {\n            // onnx padding=SAME_LOWER\n            copy_cut_border_3d(top_blob_bordered, top_blob, hcut - hcut / 2, hcut / 2, wcut - wcut / 2, wcut / 2, dcut - dcut / 2, dcut / 2, opt);\n        }\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deconvolutiondepthwise3d.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTIONDEPTHWISE3D_H\n#define LAYER_DECONVOLUTIONDEPTHWISE3D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DeconvolutionDepthWise3D : public Layer\n{\npublic:\n    DeconvolutionDepthWise3D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    void cut_padding(const Mat& top_blob_bordered, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int kernel_d;\n    int dilation_w;\n    int dilation_h;\n    int dilation_d;\n    int stride_w;\n    int stride_h;\n    int stride_d;\n    int pad_left;\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int pad_front;\n    int pad_behind;\n    int output_pad_right;\n    int output_pad_bottom;\n    int output_pad_behind;\n    int output_w;\n    int output_h;\n    int output_d;\n    int bias_term;\n\n    int weight_data_size;\n    int group;\n\n    int activation_type;\n    Mat activation_params;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTIONDEPTHWISE3D_H\n"
  },
  {
    "path": "src/layer/deepcopy.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deepcopy.h\"\n\nnamespace ncnn {\n\nDeepCopy::DeepCopy()\n{\n    one_blob_only = true;\n    support_inplace = false;\n    support_packing = true;\n}\n\nint DeepCopy::forward(const Mat& bottom_blob, Mat& top_blob, const Option& /*opt*/) const\n{\n    top_blob = bottom_blob.clone();\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deepcopy.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DEEPCOPY_H\n#define LAYER_DEEPCOPY_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DeepCopy : public Layer\n{\npublic:\n    DeepCopy();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DEEPCOPY_H\n"
  },
  {
    "path": "src/layer/deformableconv2d.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deformableconv2d.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nDeformableConv2D::DeformableConv2D()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint DeformableConv2D::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    bias_term = pd.get(5, 0);\n    weight_data_size = pd.get(6, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n    return 0;\n}\n\nint DeformableConv2D::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n    return 0;\n}\n\nint DeformableConv2D::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& offset = bottom_blobs[1];\n\n    const bool has_mask = (bottom_blobs.size() == 3);\n\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int in_c = bottom_blob.c;\n    const size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int out_w = (w + pad_left + pad_right - kernel_extent_w) / stride_w + 1;\n    const int out_h = (h + pad_top + pad_bottom - kernel_extent_h) / stride_h + 1;\n\n    // output.shape is [num_output, out_h, out_w] (in python).\n    Mat& output = top_blobs[0];\n    output.create(out_w, out_h, num_output, elemsize, opt.blob_allocator);\n    if (output.empty())\n        return -100;\n\n    const float* weight_ptr = weight_data;\n    const float* bias_ptr = weight_data;\n    if (bias_term)\n        bias_ptr = bias_data;\n\n    // deformable conv\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int h_col = 0; h_col < out_h; h_col++)\n    {\n        for (int w_col = 0; w_col < out_w; w_col++)\n        {\n            int h_in = h_col * stride_h - pad_top;\n            int w_in = w_col * stride_w - pad_left;\n            for (int oc = 0; oc < num_output; oc++)\n            {\n                float sum = 0.f;\n                if (bias_term)\n                    sum = bias_ptr[oc];\n                for (int i = 0; i < kernel_h; i++)\n                {\n                    for (int j = 0; j < kernel_w; j++)\n                    {\n                        const float offset_h = offset.channel((i * kernel_w + j) * 2).row(h_col)[w_col];\n                        const float offset_w = offset.channel((i * kernel_w + j) * 2 + 1).row(h_col)[w_col];\n                        const float mask_ = has_mask ? bottom_blobs[2].channel(i * kernel_w + j).row(h_col)[w_col] : 1.f;\n                        const float h_im = h_in + i * dilation_h + offset_h;\n                        const float w_im = w_in + j * dilation_w + offset_w;\n\n                        // Bilinear\n                        const bool cond = h_im > -1 && w_im > -1 && h_im < h && w_im < w;\n                        int h_low = 0;\n                        int w_low = 0;\n                        int h_high = 0;\n                        int w_high = 0;\n                        float w1 = 0.f;\n                        float w2 = 0.f;\n                        float w3 = 0.f;\n                        float w4 = 0.f;\n                        bool v1_cond = false;\n                        bool v2_cond = false;\n                        bool v3_cond = false;\n                        bool v4_cond = false;\n                        if (cond)\n                        {\n                            h_low = (int)floorf(h_im);\n                            w_low = (int)floorf(w_im);\n                            h_high = h_low + 1;\n                            w_high = w_low + 1;\n\n                            float lh = h_im - h_low;\n                            float lw = w_im - w_low;\n                            float hh = 1 - lh;\n                            float hw = 1 - lw;\n\n                            v1_cond = (h_low >= 0 && w_low >= 0);\n                            v2_cond = (h_low >= 0 && w_high <= w - 1);\n                            v3_cond = (h_high <= h - 1 && w_low >= 0);\n                            v4_cond = (h_high <= h - 1 && w_high <= w - 1);\n\n                            w1 = hh * hw;\n                            w2 = hh * lw;\n                            w3 = lh * hw;\n                            w4 = lh * lw;\n                        }\n\n                        for (int c_im = 0; c_im < in_c; c_im++)\n                        {\n                            float val = 0.f;\n                            if (cond)\n                            {\n                                float v1 = v1_cond ? bottom_blob.channel(c_im).row(h_low)[w_low] : 0.f;\n                                float v2 = v2_cond ? bottom_blob.channel(c_im).row(h_low)[w_high] : 0.f;\n                                float v3 = v3_cond ? bottom_blob.channel(c_im).row(h_high)[w_low] : 0.f;\n                                float v4 = v4_cond ? bottom_blob.channel(c_im).row(h_high)[w_high] : 0.f;\n                                val = w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4;\n                            }\n                            sum += val * mask_ * weight_ptr[((oc * in_c + c_im) * kernel_h + i) * kernel_w + j];\n                        }\n                    }\n                }\n                output.channel(oc).row(h_col)[w_col] = activation_ss(sum, activation_type, activation_params);\n            }\n        }\n    }\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/deformableconv2d.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DEFORMABLECONV2D_H\n#define LAYER_DEFORMABLECONV2D_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DeformableConv2D : public Layer\n{\npublic:\n    DeformableConv2D();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int bias_term;\n\n    int weight_data_size;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DEFORMABLECONV2D_H\n"
  },
  {
    "path": "src/layer/dequantize.cpp",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dequantize.h\"\n\nnamespace ncnn {\n\nDequantize::Dequantize()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Dequantize::load_param(const ParamDict& pd)\n{\n    scale_data_size = pd.get(0, 1);\n    bias_data_size = pd.get(1, 0);\n\n    return 0;\n}\n\nint Dequantize::load_model(const ModelBin& mb)\n{\n    scale_data = mb.load(scale_data_size, 1);\n    if (scale_data.empty())\n        return -100;\n\n    if (bias_data_size)\n    {\n        bias_data = mb.load(bias_data_size, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n    return 0;\n}\n\nstatic void dequantize(const int* intptr, float* ptr, float scale, float bias, int size)\n{\n    for (int i = 0; i < size; i++)\n    {\n        *ptr = *intptr * scale + bias;\n        intptr++;\n        ptr++;\n    }\n}\n\nint Dequantize::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 1)\n    {\n        // assert scale_data_size == 1\n        // assert bias_data_size == 0 || bias_data_size == 1\n\n        const int* intptr = bottom_blob;\n        float* ptr = top_blob;\n\n        const float scale = scale_data[0];\n        const float bias = bias_data_size == 0 ? 0.f : bias_data[0];\n\n        dequantize(intptr, ptr, scale, bias, w);\n    }\n\n    if (dims == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            float* ptr = top_blob.row(i);\n\n            const float scale = scale_data_size == 1 ? scale_data[0] : scale_data[i];\n            const float bias = bias_data_size == 0 ? 0.f : bias_data_size == 1 ? bias_data[0] : bias_data[i];\n\n            dequantize(intptr, ptr, scale, bias, w);\n        }\n    }\n\n    if (dims == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            float* ptr = top_blob.channel(q);\n\n            const float scale = scale_data_size == 1 ? scale_data[0] : scale_data[q];\n            const float bias = bias_data_size == 0 ? 0.f : bias_data_size == 1 ? bias_data[0] : bias_data[q];\n\n            dequantize(intptr, ptr, scale, bias, w * h);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/dequantize.h",
    "content": "// Copyright 2018 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DEQUANTIZE_H\n#define LAYER_DEQUANTIZE_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Dequantize : public Layer\n{\npublic:\n    Dequantize();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    int scale_data_size;\n    int bias_data_size;\n\n    Mat scale_data;\n    Mat bias_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DEQUANTIZE_H\n"
  },
  {
    "path": "src/layer/detectionoutput.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"detectionoutput.h\"\n\nnamespace ncnn {\n\nDetectionOutput::DetectionOutput()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint DetectionOutput::load_param(const ParamDict& pd)\n{\n    num_class = pd.get(0, 0);\n    nms_threshold = pd.get(1, 0.05f);\n    nms_top_k = pd.get(2, 300);\n    keep_top_k = pd.get(3, 100);\n    confidence_threshold = pd.get(4, 0.5f);\n    variances[0] = pd.get(5, 0.1f);\n    variances[1] = pd.get(6, 0.1f);\n    variances[2] = pd.get(7, 0.2f);\n    variances[3] = pd.get(8, 0.2f);\n\n    return 0;\n}\n\nstruct BBoxRect\n{\n    float xmin;\n    float ymin;\n    float xmax;\n    float ymax;\n    int label;\n};\n\nstatic inline float intersection_area(const BBoxRect& a, const BBoxRect& b)\n{\n    if (a.xmin > b.xmax || a.xmax < b.xmin || a.ymin > b.ymax || a.ymax < b.ymin)\n    {\n        // no intersection\n        return 0.f;\n    }\n\n    float inter_width = std::min(a.xmax, b.xmax) - std::max(a.xmin, b.xmin);\n    float inter_height = std::min(a.ymax, b.ymax) - std::max(a.ymin, b.ymin);\n\n    return inter_width * inter_height;\n}\n\ntemplate<typename T>\nstatic void qsort_descent_inplace(std::vector<T>& datas, std::vector<float>& scores, int left, int right)\n{\n    int i = left;\n    int j = right;\n    float p = scores[(left + right) / 2];\n\n    while (i <= j)\n    {\n        while (scores[i] > p)\n            i++;\n\n        while (scores[j] < p)\n            j--;\n\n        if (i <= j)\n        {\n            // swap\n            std::swap(datas[i], datas[j]);\n            std::swap(scores[i], scores[j]);\n\n            i++;\n            j--;\n        }\n    }\n\n    if (left < j)\n        qsort_descent_inplace(datas, scores, left, j);\n\n    if (i < right)\n        qsort_descent_inplace(datas, scores, i, right);\n}\n\ntemplate<typename T>\nstatic void qsort_descent_inplace(std::vector<T>& datas, std::vector<float>& scores)\n{\n    if (datas.empty() || scores.empty())\n        return;\n\n    qsort_descent_inplace(datas, scores, 0, static_cast<int>(scores.size() - 1));\n}\n\nstatic void nms_sorted_bboxes(const std::vector<BBoxRect>& bboxes, std::vector<size_t>& picked, float nms_threshold)\n{\n    picked.clear();\n\n    const size_t n = bboxes.size();\n\n    std::vector<float> areas(n);\n    for (size_t i = 0; i < n; i++)\n    {\n        const BBoxRect& r = bboxes[i];\n\n        float width = r.xmax - r.xmin;\n        float height = r.ymax - r.ymin;\n\n        areas[i] = width * height;\n    }\n\n    for (size_t i = 0; i < n; i++)\n    {\n        const BBoxRect& a = bboxes[i];\n\n        int keep = 1;\n        for (int j = 0; j < (int)picked.size(); j++)\n        {\n            const BBoxRect& b = bboxes[picked[j]];\n\n            // intersection over union\n            float inter_area = intersection_area(a, b);\n            float union_area = areas[i] + areas[picked[j]] - inter_area;\n            //             float IoU = inter_area / union_area\n            if (inter_area / union_area > nms_threshold)\n                keep = 0;\n        }\n\n        if (keep)\n            picked.push_back(i);\n    }\n}\n\nint DetectionOutput::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& location = bottom_blobs[0];\n    const Mat& confidence = bottom_blobs[1];\n    const Mat& priorbox = bottom_blobs[2];\n\n    bool mxnet_ssd_style = num_class == -233;\n\n    // mxnet-ssd _contrib_MultiBoxDetection\n    const int num_prior = mxnet_ssd_style ? priorbox.h : priorbox.w / 4;\n\n    int num_class_copy = mxnet_ssd_style ? confidence.h : num_class;\n\n    // apply location with priorbox\n    Mat bboxes;\n    bboxes.create(4, num_prior, 4u, opt.workspace_allocator);\n    if (bboxes.empty())\n        return -100;\n\n    const float* location_ptr = location;\n    const float* priorbox_ptr = priorbox.row(0);\n    const float* variance_ptr = mxnet_ssd_style ? 0 : priorbox.row(1);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < num_prior; i++)\n    {\n        // if score of background class is larger than confidence threshold\n        float score = mxnet_ssd_style ? confidence[i] : confidence[static_cast<size_t>(i) * static_cast<size_t>(num_class_copy)];\n        if (score >= (1.0 - confidence_threshold))\n        {\n            continue;\n        }\n        const float* loc = location_ptr + i * 4;\n        const float* pb = priorbox_ptr + i * 4;\n        const float* var = variance_ptr ? variance_ptr + i * 4 : variances;\n\n        float* bbox = bboxes.row(i);\n\n        // CENTER_SIZE\n        float pb_w = pb[2] - pb[0];\n        float pb_h = pb[3] - pb[1];\n        float pb_cx = (pb[0] + pb[2]) * 0.5f;\n        float pb_cy = (pb[1] + pb[3]) * 0.5f;\n\n        float bbox_cx = var[0] * loc[0] * pb_w + pb_cx;\n        float bbox_cy = var[1] * loc[1] * pb_h + pb_cy;\n        float bbox_w = expf(var[2] * loc[2]) * pb_w;\n        float bbox_h = expf(var[3] * loc[3]) * pb_h;\n\n        bbox[0] = bbox_cx - bbox_w * 0.5f;\n        bbox[1] = bbox_cy - bbox_h * 0.5f;\n        bbox[2] = bbox_cx + bbox_w * 0.5f;\n        bbox[3] = bbox_cy + bbox_h * 0.5f;\n    }\n\n    // sort and nms for each class\n    std::vector<std::vector<BBoxRect> > all_class_bbox_rects;\n    std::vector<std::vector<float> > all_class_bbox_scores;\n    all_class_bbox_rects.resize(num_class_copy);\n    all_class_bbox_scores.resize(num_class_copy);\n\n    // start from 1 to ignore background class\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 1; i < num_class_copy; i++)\n    {\n        // filter by confidence_threshold\n        std::vector<BBoxRect> class_bbox_rects;\n        std::vector<float> class_bbox_scores;\n\n        for (int j = 0; j < num_prior; j++)\n        {\n            // prob data layout\n            // caffe-ssd = num_class x num_prior\n            // mxnet-ssd = num_prior x num_class\n            float score = mxnet_ssd_style ? confidence[i * num_prior + j] : confidence[j * num_class_copy + i];\n\n            if (score > confidence_threshold)\n            {\n                const float* bbox = bboxes.row(j);\n                BBoxRect c = {bbox[0], bbox[1], bbox[2], bbox[3], i};\n                class_bbox_rects.push_back(c);\n                class_bbox_scores.push_back(score);\n            }\n        }\n\n        // sort inplace\n        qsort_descent_inplace(class_bbox_rects, class_bbox_scores);\n\n        // keep nms_top_k\n        if (nms_top_k < (int)class_bbox_rects.size())\n        {\n            class_bbox_rects.resize(nms_top_k);\n            class_bbox_scores.resize(nms_top_k);\n        }\n\n        // apply nms\n        std::vector<size_t> picked;\n        nms_sorted_bboxes(class_bbox_rects, picked, nms_threshold);\n\n        // select\n        for (size_t j = 0; j < picked.size(); j++)\n        {\n            size_t z = picked[j];\n            all_class_bbox_rects[i].push_back(class_bbox_rects[z]);\n            all_class_bbox_scores[i].push_back(class_bbox_scores[z]);\n        }\n    }\n\n    // gather all class\n    std::vector<BBoxRect> bbox_rects;\n    std::vector<float> bbox_scores;\n\n    for (int i = 1; i < num_class_copy; i++)\n    {\n        const std::vector<BBoxRect>& class_bbox_rects = all_class_bbox_rects[i];\n        const std::vector<float>& class_bbox_scores = all_class_bbox_scores[i];\n\n        bbox_rects.insert(bbox_rects.end(), class_bbox_rects.begin(), class_bbox_rects.end());\n        bbox_scores.insert(bbox_scores.end(), class_bbox_scores.begin(), class_bbox_scores.end());\n    }\n\n    // global sort inplace\n    qsort_descent_inplace(bbox_rects, bbox_scores);\n\n    // keep_top_k\n    if (keep_top_k < (int)bbox_rects.size())\n    {\n        bbox_rects.resize(keep_top_k);\n        bbox_scores.resize(keep_top_k);\n    }\n\n    // fill result\n    int num_detected = static_cast<int>(bbox_rects.size());\n    if (num_detected == 0)\n        return 0;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(6, num_detected, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    for (int i = 0; i < num_detected; i++)\n    {\n        const BBoxRect& r = bbox_rects[i];\n        float score = bbox_scores[i];\n        float* outptr = top_blob.row(i);\n\n        outptr[0] = static_cast<float>(r.label);\n        outptr[1] = score;\n        outptr[2] = r.xmin;\n        outptr[3] = r.ymin;\n        outptr[4] = r.xmax;\n        outptr[5] = r.ymax;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/detectionoutput.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DETECTIONOUTPUT_H\n#define LAYER_DETECTIONOUTPUT_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass DetectionOutput : public Layer\n{\npublic:\n    DetectionOutput();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int num_class;\n    float nms_threshold;\n    int nms_top_k;\n    int keep_top_k;\n    float confidence_threshold;\n    float variances[4];\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DETECTIONOUTPUT_H\n"
  },
  {
    "path": "src/layer/diag.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"diag.h\"\n\nnamespace ncnn {\n\nDiag::Diag()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Diag::load_param(const ParamDict& pd)\n{\n    diagonal = pd.get(0, 0);\n\n    return 0;\n}\n\nint Diag::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n\n    if (dims == 1)\n    {\n        int w = bottom_blob.w;\n        int top_w = w + ((diagonal >= 0) ? diagonal : -diagonal);\n\n        top_blob.create(top_w, top_w, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.fill(0.0f);\n\n        int bias_r = -std::min(diagonal, 0);\n        int bias_c = std::max(diagonal, 0);\n\n        for (int i = 0; i < w; i++)\n        {\n            top_blob.row(i + bias_r)[i + bias_c] = bottom_blob[i];\n        }\n    }\n    if (dims == 2)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n\n        int len = 0;\n        int minimum = std::min(w - h, 0);\n        int maximum = std::max(w - h, 0);\n        if (diagonal <= maximum && diagonal >= minimum)\n            len = std::min(w, h);\n        else if (diagonal > -h && diagonal < minimum)\n            len = diagonal + h;\n        else if (diagonal > maximum && diagonal < w)\n            len = -diagonal + w;\n\n        top_blob.create(len, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n        {\n            if (len == 0)\n                return 0;\n            return -100;\n        }\n\n        int bias_r = -std::min(diagonal, 0);\n        int bias_c = std::max(diagonal, 0);\n\n        for (int i = 0; i < len; i++)\n        {\n            top_blob[i] = bottom_blob.row(i + bias_r)[i + bias_c];\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/diag.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DIAG_H\n#define LAYER_DIAG_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Diag : public Layer\n{\npublic:\n    Diag();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    int diagonal;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DIAG_H\n"
  },
  {
    "path": "src/layer/dropout.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dropout.h\"\n\nnamespace ncnn {\n\nDropout::Dropout()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Dropout::load_param(const ParamDict& pd)\n{\n    scale = pd.get(0, 1.f);\n\n    return 0;\n}\n\nint Dropout::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    if (scale == 1.f)\n    {\n        return 0;\n    }\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = ptr[i] * scale;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/dropout.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DROPOUT_H\n#define LAYER_DROPOUT_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Dropout : public Layer\n{\npublic:\n    Dropout();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float scale;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DROPOUT_H\n"
  },
  {
    "path": "src/layer/einsum.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"einsum.h\"\n#include <string.h>\n\nnamespace ncnn {\n\nEinsum::Einsum()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint Einsum::load_param(const ParamDict& pd)\n{\n    Mat equation_mat = pd.get(0, Mat());\n\n    const int equation_len = equation_mat.w;\n\n    // restore to lexical equation string\n    std::string equation;\n    equation.resize(equation_len);\n    char* equation_ptr = (char*)equation.c_str();\n    {\n        const int* p = equation_mat;\n        for (int i = 0; i < equation_len; i++)\n        {\n            equation_ptr[i] = p[i];\n        }\n    }\n\n    if (equation == \"ii\")\n    {\n        // trace\n        rhs_token = \"ii\";\n\n        return 0;\n    }\n\n    // split into tokens\n    char* arrow = strstr(equation_ptr, \"->\");\n    if (!arrow)\n    {\n        NCNN_LOGE(\"invalid equation %s\", equation_ptr);\n        return -1;\n    }\n\n    arrow[0] = '\\0';\n    arrow[1] = '\\0';\n\n    char* lhs = equation_ptr;\n    char* rhs = arrow + 2;\n\n    {\n        char* t = strtok(lhs, \",\");\n        while (t)\n        {\n            lhs_tokens.push_back(std::string(t));\n            t = strtok(NULL, \",\");\n        }\n    }\n\n    rhs_token = std::string(rhs);\n\n    // check token always in ijkl\n    {\n        for (size_t i = 0; i < rhs_token.size(); i++)\n        {\n            if (rhs_token[i] < 'i' || rhs_token[i] > 'l')\n            {\n                NCNN_LOGE(\"invalid rhs_token %s\", rhs_token.c_str());\n                return -1;\n            }\n        }\n\n        for (size_t i = 0; i < lhs_tokens.size(); i++)\n        {\n            const std::string& lhs_token = lhs_tokens[i];\n            for (size_t j = 0; j < lhs_token.size(); j++)\n            {\n                if (lhs_token[j] < 'i' || lhs_token[j] > 'x')\n                {\n                    NCNN_LOGE(\"invalid lhs_token %s\", lhs_token.c_str());\n                    return -1;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nstatic float get_indexed_value(const Mat& m, const std::string& token, std::vector<int>& indexes)\n{\n    const int dims = m.dims;\n\n    if (dims == 1)\n    {\n        int x = indexes[token[0] - 'i'];\n        return m[x];\n    }\n\n    if (dims == 2)\n    {\n        int y = indexes[token[0] - 'i'];\n        int x = indexes[token[1] - 'i'];\n        return m.row(y)[x];\n    }\n\n    if (dims == 3)\n    {\n        int c = indexes[token[0] - 'i'];\n        int y = indexes[token[1] - 'i'];\n        int x = indexes[token[2] - 'i'];\n        return m.channel(c).row(y)[x];\n    }\n\n    if (dims == 4)\n    {\n        int c = indexes[token[0] - 'i'];\n        int z = indexes[token[1] - 'i'];\n        int y = indexes[token[2] - 'i'];\n        int x = indexes[token[3] - 'i'];\n        return m.channel(c).depth(z).row(y)[x];\n    }\n\n    // should never reach here\n    return 0;\n}\n\nstatic float sum_dim(const std::vector<int>& dim_sizes, int d, const std::vector<Mat>& bottom_blobs, const std::vector<std::string>& tokens, std::vector<int>& indexes)\n{\n    if (d == (int)dim_sizes.size())\n    {\n        float v = 1.f;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            v *= get_indexed_value(bottom_blobs[b], tokens[b], indexes);\n        }\n\n        return v;\n    }\n\n    float sum = 0.f;\n\n    for (int i = 0; i < dim_sizes[d]; i++)\n    {\n        indexes[d] = i;\n\n        sum += sum_dim(dim_sizes, d + 1, bottom_blobs, tokens, indexes);\n    }\n\n    return sum;\n}\n\nint Einsum::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    // assert bottom_blobs.size() == lhs_tokens.size()\n    // assert top_blobs.size() == 1\n\n    size_t elemsize = bottom_blobs[0].elemsize;\n\n    if (lhs_tokens.empty() && rhs_token == \"ii\")\n    {\n        // assert bottom_blobs.size() == 1\n        // assert bottom_blob.dims == 2\n        // assert bottom_blob.w == bottom_blob.h\n\n        // trace\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(1, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const Mat& bottom_blob = bottom_blobs[0];\n\n        float sum = 0.f;\n\n        for (int i = 0; i < bottom_blob.h; i++)\n        {\n            sum += bottom_blob.row(i)[i];\n        }\n\n        top_blob[0] = sum;\n\n        return 0;\n    }\n\n    // resolve dimension sizes\n    std::vector<int> dim_sizes(16, 1); // map ijklmnopqrstuvwx -> dim_size\n    int dim_sizes_count = 0;\n\n    for (size_t b = 0; b < bottom_blobs.size(); b++)\n    {\n        const std::string& lhs_token = lhs_tokens[b];\n        const Mat& bottom_blob = bottom_blobs[b];\n        const int in_dims = bottom_blob.dims;\n\n        for (int s = 0; s < in_dims; s++)\n        {\n            int dim_size = 1;\n            if (in_dims == 1) dim_size = bottom_blob.w;\n            if (in_dims == 2 && s == 0) dim_size = bottom_blob.h;\n            if (in_dims == 2 && s == 1) dim_size = bottom_blob.w;\n            if (in_dims == 3 && s == 0) dim_size = bottom_blob.c;\n            if (in_dims == 3 && s == 1) dim_size = bottom_blob.h;\n            if (in_dims == 3 && s == 2) dim_size = bottom_blob.w;\n            if (in_dims == 4 && s == 0) dim_size = bottom_blob.c;\n            if (in_dims == 4 && s == 1) dim_size = bottom_blob.d;\n            if (in_dims == 4 && s == 2) dim_size = bottom_blob.h;\n            if (in_dims == 4 && s == 3) dim_size = bottom_blob.w;\n\n            int dim_sizes_index = lhs_token[s] - 'i';\n            dim_sizes[dim_sizes_index] = dim_size;\n            dim_sizes_count = std::max(dim_sizes_count, dim_sizes_index + 1);\n        }\n    }\n\n    dim_sizes.resize(dim_sizes_count);\n\n    const int out_dims = (int)rhs_token.size();\n\n    std::vector<int> indexes(dim_sizes_count);\n\n    if (out_dims == 1)\n    {\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(dim_sizes[0], elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int i = 0; i < top_blob.w; i++)\n        {\n            indexes[0] = i;\n\n            float sum = sum_dim(dim_sizes, 1, bottom_blobs, lhs_tokens, indexes);\n\n            top_blob[i] = sum;\n        }\n    }\n\n    if (out_dims == 2)\n    {\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(dim_sizes[1], dim_sizes[0], elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int i = 0; i < top_blob.h; i++)\n        {\n            indexes[0] = i;\n\n            for (int j = 0; j < top_blob.w; j++)\n            {\n                indexes[1] = j;\n\n                float sum = sum_dim(dim_sizes, 2, bottom_blobs, lhs_tokens, indexes);\n\n                top_blob.row(i)[j] = sum;\n            }\n        }\n    }\n\n    if (out_dims == 3)\n    {\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(dim_sizes[2], dim_sizes[1], dim_sizes[0], elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int i = 0; i < top_blob.c; i++)\n        {\n            indexes[0] = i;\n\n            for (int j = 0; j < top_blob.h; j++)\n            {\n                indexes[1] = j;\n\n                for (int k = 0; k < top_blob.w; k++)\n                {\n                    indexes[2] = k;\n\n                    float sum = sum_dim(dim_sizes, 3, bottom_blobs, lhs_tokens, indexes);\n\n                    top_blob.channel(i).row(j)[k] = sum;\n                }\n            }\n        }\n    }\n\n    if (out_dims == 4)\n    {\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(dim_sizes[3], dim_sizes[2], dim_sizes[1], dim_sizes[0], elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        for (int i = 0; i < top_blob.c; i++)\n        {\n            indexes[0] = i;\n\n            for (int j = 0; j < top_blob.d; j++)\n            {\n                indexes[1] = j;\n\n                for (int k = 0; k < top_blob.h; k++)\n                {\n                    indexes[2] = k;\n\n                    for (int l = 0; l < top_blob.w; l++)\n                    {\n                        indexes[3] = l;\n\n                        float sum = sum_dim(dim_sizes, 4, bottom_blobs, lhs_tokens, indexes);\n\n                        top_blob.channel(i).depth(j).row(k)[l] = sum;\n                    }\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/einsum.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_EINSUM_H\n#define LAYER_EINSUM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Einsum : public Layer\n{\npublic:\n    Einsum();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    // equation tokens\n    std::vector<std::string> lhs_tokens;\n    std::string rhs_token;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_EINSUM_H\n"
  },
  {
    "path": "src/layer/eltwise.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"eltwise.h\"\n\nnamespace ncnn {\n\nEltwise::Eltwise()\n{\n    one_blob_only = false;\n    support_inplace = false; // TODO inplace reduction\n}\n\nint Eltwise::load_param(const ParamDict& pd)\n{\n    op_type = pd.get(0, 0);\n    coeffs = pd.get(1, Mat());\n\n    return 0;\n}\n\nint Eltwise::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int size = w * h * d;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = ptr[i] * ptr1[i];\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] *= ptr[i];\n                }\n            }\n        }\n    }\n    else if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = ptr[i] + ptr1[i];\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        outptr[i] += ptr[i];\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            float coeff0 = coeffs[0];\n            float coeff1 = coeffs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = ptr[i] * coeff0 + ptr1[i] * coeff1;\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                float coeff = coeffs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        outptr[i] += ptr[i] * coeff;\n                    }\n                }\n            }\n        }\n    }\n    else if (op_type == Operation_MAX)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = std::max(ptr[i], ptr1[i]);\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < size; i++)\n                {\n                    outptr[i] = std::max(outptr[i], ptr[i]);\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/eltwise.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ELTWISE_H\n#define LAYER_ELTWISE_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Eltwise : public Layer\n{\npublic:\n    Eltwise();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    enum OperationType\n    {\n        Operation_PROD = 0,\n        Operation_SUM = 1,\n        Operation_MAX = 2\n    };\n\npublic:\n    // param\n    int op_type;\n    Mat coeffs;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ELTWISE_H\n"
  },
  {
    "path": "src/layer/elu.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"elu.h\"\n\nnamespace ncnn {\n\nELU::ELU()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint ELU::load_param(const ParamDict& pd)\n{\n    alpha = pd.get(0, 0.1f);\n\n    return 0;\n}\n\nint ELU::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < 0.f)\n                ptr[i] = alpha * (expf(ptr[i]) - 1.f);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/elu.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ELU_H\n#define LAYER_ELU_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ELU : public Layer\n{\npublic:\n    ELU();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float alpha;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ELU_H\n"
  },
  {
    "path": "src/layer/embed.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"embed.h\"\n\n#include <string.h>\n\nnamespace ncnn {\n\nEmbed::Embed()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Embed::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    input_dim = pd.get(1, 0);\n    bias_term = pd.get(2, 0);\n    weight_data_size = pd.get(3, 0);\n    int8_scale_term = pd.get(18, 0);\n\n    return 0;\n}\n\nint Embed::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        weight_data_int8_scale = mb.load(1, 1)[0];\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic void embed(const Mat& bottom_blob, const Mat& weight_data, const Mat& bias_data, Mat& top_blob, int input_dim, const Option& opt)\n{\n    const int num_output = top_blob.w;\n    const int words = top_blob.h;\n\n    const float* bias_ptr = bias_data;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < words; q++)\n    {\n        float* outptr = top_blob.row(q);\n\n        int word_index = ((const int*)bottom_blob)[q];\n\n        if (word_index < 0)\n            word_index = 0;\n        if (word_index >= input_dim)\n            word_index = input_dim - 1;\n\n        const float* em = (const float*)weight_data + num_output * word_index;\n\n        if (bias_ptr)\n        {\n            for (int p = 0; p < num_output; p++)\n            {\n                outptr[p] = em[p] + bias_ptr[p];\n            }\n        }\n        else\n        {\n            memcpy(outptr, em, num_output * sizeof(float));\n        }\n    }\n}\n\n#if NCNN_INT8\nstatic void embed_int8(const Mat& bottom_blob, const Mat& weight_data, float weight_data_int8_scale, const Mat& bias_data, Mat& top_blob, int input_dim, const Option& opt)\n{\n    const int num_output = top_blob.w;\n    const int words = top_blob.h;\n\n    const float* bias_ptr = bias_data;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < words; q++)\n    {\n        float* outptr = top_blob.row(q);\n\n        int word_index = ((const int*)bottom_blob)[q];\n\n        if (word_index < 0)\n            word_index = 0;\n        if (word_index >= input_dim)\n            word_index = input_dim - 1;\n\n        const float descale_em = 1.f / weight_data_int8_scale;\n\n        const signed char* em = (const signed char*)weight_data + num_output * word_index;\n\n        if (bias_ptr)\n        {\n            for (int p = 0; p < num_output; p++)\n            {\n                outptr[p] = em[p] * descale_em + bias_ptr[p];\n            }\n        }\n        else\n        {\n            for (int p = 0; p < num_output; p++)\n            {\n                outptr[p] = em[p] * descale_em;\n            }\n        }\n    }\n}\n#endif // NCNN_INT8\n\nint Embed::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int words = bottom_blob.w;\n\n    top_blob.create(num_output, words, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        embed_int8(bottom_blob, weight_data, weight_data_int8_scale, bias_data, top_blob, input_dim, opt);\n    }\n    else\n#endif // NCNN_INT8\n    {\n        embed(bottom_blob, weight_data, bias_data, top_blob, input_dim, opt);\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/embed.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_EMBED_H\n#define LAYER_EMBED_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Embed : public Layer\n{\npublic:\n    Embed();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int num_output;\n    int input_dim;\n    int bias_term;\n\n    int weight_data_size;\n\n    int int8_scale_term;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n\n#if NCNN_INT8\n    float weight_data_int8_scale;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_EMBED_H\n"
  },
  {
    "path": "src/layer/erf.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"erf.h\"\n\nnamespace ncnn {\n\nErf::Erf()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Erf::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = erff(ptr[i]);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/erf.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ERF_H\n#define LAYER_ERF_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Erf : public Layer\n{\npublic:\n    Erf();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ERF_H\n"
  },
  {
    "path": "src/layer/exp.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"exp.h\"\n\nnamespace ncnn {\n\nExp::Exp()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Exp::load_param(const ParamDict& pd)\n{\n    base = pd.get(0, -1.f);\n    scale = pd.get(1, 1.f);\n    shift = pd.get(2, 0.f);\n\n    return 0;\n}\n\nint Exp::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    if (base == -1.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = expf(shift + ptr[i] * scale);\n            }\n        }\n    }\n    else\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = powf(base, (shift + ptr[i] * scale));\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/exp.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_EXP_H\n#define LAYER_EXP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Exp : public Layer\n{\npublic:\n    Exp();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float base;\n    float scale;\n    float shift;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_EXP_H\n"
  },
  {
    "path": "src/layer/expanddims.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"expanddims.h\"\n\nnamespace ncnn {\n\nExpandDims::ExpandDims()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint ExpandDims::load_param(const ParamDict& pd)\n{\n    axes = pd.get(3, Mat());\n\n    return 0;\n}\n\nint ExpandDims::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int dims = bottom_blob.dims;\n\n    const int outdims = dims + axes.w;\n\n    bool expand_w = false;\n    bool expand_h = false;\n    bool expand_d = false;\n    bool expand_c = false;\n\n    {\n        const int* axes_ptr = axes;\n        for (int i = 0; i < axes.w; i++)\n        {\n            int axis = axes_ptr[i];\n            if (axis < 0)\n                axis = outdims + axis;\n\n            if (outdims == 2)\n            {\n                if (axis == 0) expand_h = true;\n                if (axis == 1) expand_w = true;\n            }\n            if (outdims == 3)\n            {\n                if (axis == 0) expand_c = true;\n                if (axis == 1) expand_h = true;\n                if (axis == 2) expand_w = true;\n            }\n            if (outdims == 4)\n            {\n                if (axis == 0) expand_c = true;\n                if (axis == 1) expand_d = true;\n                if (axis == 2) expand_h = true;\n                if (axis == 3) expand_w = true;\n            }\n        }\n    }\n\n    top_blob = bottom_blob;\n\n    if (outdims == 2)\n    {\n        if (expand_w)\n        {\n            top_blob = bottom_blob.reshape(1, w, opt.blob_allocator);\n        }\n        else if (expand_h)\n        {\n            top_blob = bottom_blob.reshape(w, 1, opt.blob_allocator);\n        }\n    }\n    if (outdims == 3)\n    {\n        if (expand_w && expand_h)\n        {\n            top_blob = bottom_blob.reshape(1, 1, w, opt.blob_allocator);\n        }\n        else if (expand_w && expand_c)\n        {\n            top_blob = bottom_blob.reshape(1, w, 1, opt.blob_allocator);\n        }\n        else if (expand_h && expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, 1, 1, opt.blob_allocator);\n        }\n        else if (expand_w)\n        {\n            top_blob = bottom_blob.reshape(1, w, h, opt.blob_allocator);\n        }\n        else if (expand_h)\n        {\n            top_blob = bottom_blob.reshape(w, 1, h, opt.blob_allocator);\n        }\n        else if (expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, h, 1, opt.blob_allocator);\n        }\n    }\n    if (outdims == 4)\n    {\n        if (expand_w && expand_h && expand_d)\n        {\n            top_blob = bottom_blob.reshape(1, 1, 1, w, opt.blob_allocator);\n        }\n        else if (expand_w && expand_h && expand_c)\n        {\n            top_blob = bottom_blob.reshape(1, 1, w, 1, opt.blob_allocator);\n        }\n        else if (expand_w && expand_d && expand_c)\n        {\n            top_blob = bottom_blob.reshape(1, w, 1, 1, opt.blob_allocator);\n        }\n        else if (expand_h && expand_d && expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, 1, 1, 1, opt.blob_allocator);\n        }\n        else if (expand_w && expand_h)\n        {\n            top_blob = bottom_blob.reshape(1, 1, w, h, opt.blob_allocator);\n        }\n        else if (expand_w && expand_c)\n        {\n            top_blob = bottom_blob.reshape(1, w, h, 1, opt.blob_allocator);\n        }\n        else if (expand_d && expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, h, 1, 1, opt.blob_allocator);\n        }\n        else if (expand_w && expand_d)\n        {\n            top_blob = bottom_blob.reshape(1, w, 1, h, opt.blob_allocator);\n        }\n        else if (expand_h && expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, 1, h, 1, opt.blob_allocator);\n        }\n        else if (expand_h && expand_d)\n        {\n            top_blob = bottom_blob.reshape(w, 1, 1, h, opt.blob_allocator);\n        }\n        else if (expand_w)\n        {\n            top_blob = bottom_blob.reshape(1, w, h, channels, opt.blob_allocator);\n        }\n        else if (expand_h)\n        {\n            top_blob = bottom_blob.reshape(w, 1, h, channels, opt.blob_allocator);\n        }\n        else if (expand_d)\n        {\n            top_blob = bottom_blob.reshape(w, h, 1, channels, opt.blob_allocator);\n        }\n        else if (expand_c)\n        {\n            top_blob = bottom_blob.reshape(w, h, channels, 1, opt.blob_allocator);\n        }\n    }\n\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/expanddims.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_EXPANDDIMS_H\n#define LAYER_EXPANDDIMS_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass ExpandDims : public Layer\n{\npublic:\n    ExpandDims();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    Mat axes;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_EXPANDDIMS_H\n"
  },
  {
    "path": "src/layer/flatten.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"flatten.h\"\n\n#include <string.h>\n\nnamespace ncnn {\n\nFlatten::Flatten()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Flatten::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int size = w * h * d;\n\n    top_blob.create(size * channels, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const unsigned char* ptr = bottom_blob.channel(q);\n        unsigned char* outptr = (unsigned char*)top_blob + size * elemsize * q;\n\n        memcpy(outptr, ptr, size * elemsize);\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/flatten.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_FLATTEN_H\n#define LAYER_FLATTEN_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Flatten : public Layer\n{\npublic:\n    Flatten();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_FLATTEN_H\n"
  },
  {
    "path": "src/layer/flip.cpp",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"flip.h\"\n\nnamespace ncnn {\n\nFlip::Flip()\n{\n    one_blob_only = true;\n}\n\nint Flip::load_param(const ParamDict& pd)\n{\n    axes = pd.get(0, Mat());\n\n    if (axes.w > 4)\n    {\n        // only handle up to 4-dim\n        return -1;\n    }\n\n    return 0;\n}\n\nint Flip::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (axes.empty())\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int d = bottom_blob.d;\n    const int channels = bottom_blob.c;\n\n    int axes_flag[4] = {0};\n    bool flip_w = false;\n    bool flip_h = false;\n    bool flip_d = false;\n    bool flip_c = false;\n    {\n        const int* axes_ptr = axes;\n        for (int i = 0; i < axes.w; i++)\n        {\n            int axis = axes_ptr[i];\n            // handle negative axis\n            if (axis < 0)\n                axis += dims;\n            axes_flag[axis] = 1;\n        }\n\n        if (dims == 1)\n        {\n            flip_w = true;\n        }\n        else if (dims == 2)\n        {\n            if (axes_flag[0] == 1) flip_h = true;\n            if (axes_flag[1] == 1) flip_w = true;\n        }\n        else if (dims == 3)\n        {\n            if (axes_flag[0] == 1) flip_c = true;\n            if (axes_flag[1] == 1) flip_h = true;\n            if (axes_flag[2] == 1) flip_w = true;\n        }\n        else if (dims == 4)\n        {\n            if (axes_flag[0] == 1) flip_c = true;\n            if (axes_flag[1] == 1) flip_d = true;\n            if (axes_flag[2] == 1) flip_h = true;\n            if (axes_flag[3] == 1) flip_w = true;\n        }\n    }\n\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        for (int z = 0; z < d; z++)\n        {\n            for (int i = 0; i < h; i++)\n            {\n                int q2 = flip_c ? channels - 1 - q : q;\n                int z2 = flip_d ? d - 1 - z : z;\n                int i2 = flip_h ? h - 1 - i : i;\n\n                const float* ptr = bottom_blob.channel(q2).depth(z2).row(i2);\n                float* outptr = top_blob.channel(q).depth(z).row(i);\n\n                if (flip_w)\n                {\n                    ptr += w - 1;\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr++ = *ptr--;\n                    }\n                }\n                else\n                {\n                    memcpy(outptr, ptr, w * sizeof(float));\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/flip.h",
    "content": "// Copyright 2025 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_FLIP_H\n#define LAYER_FLIP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Flip : public Layer\n{\npublic:\n    Flip();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    Mat axes;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_FLIP_H\n"
  },
  {
    "path": "src/layer/fold.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"fold.h\"\n\nnamespace ncnn {\n\nFold::Fold()\n{\n    one_blob_only = true;\n}\n\nint Fold::load_param(const ParamDict& pd)\n{\n    kernel_w = pd.get(1, 0);\n    kernel_h = pd.get(11, kernel_w);\n    dilation_w = pd.get(2, 1);\n    dilation_h = pd.get(12, dilation_w);\n    stride_w = pd.get(3, 1);\n    stride_h = pd.get(13, stride_w);\n    pad_left = pd.get(4, 0);\n    pad_right = pd.get(15, pad_left);\n    pad_top = pd.get(14, pad_left);\n    pad_bottom = pd.get(16, pad_top);\n    output_w = pd.get(20, 0);\n    output_h = pd.get(21, output_w);\n\n    return 0;\n}\n\nint Fold::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int max_channels = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int outw = output_w + pad_left + pad_right;\n    const int outh = output_h + pad_top + pad_bottom;\n\n    const int inw = (outw - kernel_extent_w) / stride_w + 1;\n    const int inh = (outh - kernel_extent_h) / stride_h + 1;\n\n    // assert inw * inh == size\n\n    const int maxk = kernel_w * kernel_h;\n    const int channels = max_channels / maxk;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        top_blob_bordered.create(outw, outh, channels, elemsize, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, channels, elemsize, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    // col2im\n    const int gap = outw * stride_h - inw * stride_w;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const float* sptr = bottom_blob.row(p * maxk);\n        Mat outm = top_blob_bordered.channel(p);\n\n        outm.fill(0.f);\n\n        for (int u = 0; u < kernel_h; u++)\n        {\n            for (int v = 0; v < kernel_w; v++)\n            {\n                float* ptr = outm.row(dilation_h * u) + dilation_w * v;\n\n                for (int i = 0; i < inh; i++)\n                {\n                    for (int j = 0; j < inw; j++)\n                    {\n                        ptr[0] += sptr[0];\n\n                        ptr += stride_w;\n                        sptr += 1;\n                    }\n\n                    ptr += gap;\n                }\n            }\n        }\n    }\n\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0)\n    {\n        Option opt_b = opt;\n        opt_b.use_packing_layout = false;\n        copy_cut_border(top_blob_bordered, top_blob, pad_top, pad_bottom, pad_left, pad_right, opt_b);\n        if (top_blob.empty())\n            return -100;\n    }\n    else\n    {\n        top_blob = top_blob_bordered;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/fold.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_FOLD_H\n#define LAYER_FOLD_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Fold : public Layer\n{\npublic:\n    Fold();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    int kernel_w;\n    int kernel_h;\n    int dilation_w;\n    int dilation_h;\n    int stride_w;\n    int stride_h;\n    int pad_left; // -233=SAME_UPPER -234=SAME_LOWER\n    int pad_right;\n    int pad_top;\n    int pad_bottom;\n    int output_w;\n    int output_h;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_FOLD_H\n"
  },
  {
    "path": "src/layer/fused_activation.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef FUSED_ACTIVATION_H\n#define FUSED_ACTIVATION_H\n\n#include \"mat.h\"\n#include \"layer_type.h\"\n\nstatic NCNN_FORCEINLINE float activation_ss(float v, int activation_type, const ncnn::Mat& activation_params)\n{\n    switch (activation_type)\n    {\n    case 1:\n    {\n        v = fmaxf(v, 0.f);\n        break;\n    }\n    case 2:\n    {\n        float slope = activation_params[0];\n        v = v > 0.f ? v : v * slope;\n        break;\n    }\n    case 3:\n    {\n        float min = activation_params[0];\n        float max = activation_params[1];\n        if (v < min)\n            v = min;\n        if (v > max)\n            v = max;\n        break;\n    }\n    case 4:\n    {\n        v = std::min(v, 88.3762626647949f);\n        v = std::max(v, -88.3762626647949f);\n        v = 1.f / (1.f + expf(-v));\n        break;\n    }\n    case 5:\n    {\n        v = v * tanhf(logf(expf(v) + 1.f));\n        break;\n    }\n    case 6:\n    {\n        float alpha = activation_params[0];\n        float beta = activation_params[1];\n        float lower = -beta / alpha;\n        float upper = (1.f / alpha) + lower;\n        if (v < lower)\n            v = 0.f;\n        else if (v > upper)\n            ;\n        else\n            v = v * (v * alpha + beta);\n        break;\n    }\n    }\n\n    return v;\n}\n\nstatic ncnn::Layer* create_activation_layer(int activation_type, const ncnn::Mat& activation_params, const ncnn::Option& opt)\n{\n    ncnn::Layer* activation = 0;\n\n    if (activation_type == 1)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::ReLU);\n\n        ncnn::ParamDict pd;\n        activation->load_param(pd);\n    }\n    else if (activation_type == 2)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::ReLU);\n\n        ncnn::ParamDict pd;\n        pd.set(0, activation_params[0]); // slope\n        activation->load_param(pd);\n    }\n    else if (activation_type == 3)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::Clip);\n\n        ncnn::ParamDict pd;\n        pd.set(0, activation_params[0]); // min\n        pd.set(1, activation_params[1]); // max\n\n        activation->load_param(pd);\n    }\n    else if (activation_type == 4)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::Sigmoid);\n\n        ncnn::ParamDict pd;\n        activation->load_param(pd);\n    }\n    else if (activation_type == 5)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::Mish);\n\n        ncnn::ParamDict pd;\n        activation->load_param(pd);\n    }\n    else if (activation_type == 6)\n    {\n        activation = ncnn::create_layer_cpu(ncnn::LayerType::HardSwish);\n\n        ncnn::ParamDict pd;\n        pd.set(0, activation_params[0]); // alpha\n        pd.set(1, activation_params[1]); // beta\n\n        activation->load_param(pd);\n    }\n\n    if (activation)\n    {\n        activation->create_pipeline(opt);\n    }\n\n    return activation;\n}\n\n#endif // FUSED_ACTIVATION_H\n"
  },
  {
    "path": "src/layer/gelu.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gelu.h\"\n\nnamespace ncnn {\n\nGELU::GELU()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint GELU::load_param(const ParamDict& pd)\n{\n    fast_gelu = pd.get(0, 0);\n\n    return 0;\n}\n\nint GELU::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    if (fast_gelu)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                // y = 0.5x * (1 + tanh(sqrt(2/Pi) * (x + 0.044715x^3)))\n                ptr[i] = 0.5f * ptr[i] * (1.0f + tanhf(0.79788452f * (ptr[i] + 0.044715f * ptr[i] * ptr[i] * ptr[i])));\n            }\n        }\n    }\n    else\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                // y = x * P(X <= x) where X ~ N(0, 1)\n                ptr[i] = 0.5f * ptr[i] * erfcf(-0.70710678f * ptr[i]);\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/gelu.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GELU_H\n#define LAYER_GELU_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass GELU : public Layer\n{\npublic:\n    GELU();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    int fast_gelu;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GELU_H\n"
  },
  {
    "path": "src/layer/gemm.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gemm.h\"\n\nnamespace ncnn {\n\nGemm::Gemm()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint Gemm::load_param(const ParamDict& pd)\n{\n    alpha = pd.get(0, 1.f);\n    beta = pd.get(1, 1.f);\n    transA = pd.get(2, 0);\n    transB = pd.get(3, 0);\n    constantA = pd.get(4, 0);\n    constantB = pd.get(5, 0);\n    constantC = pd.get(6, 0);\n    constantM = pd.get(7, 0);\n    constantN = pd.get(8, 0);\n    constantK = pd.get(9, 0);\n    constant_broadcast_type_C = pd.get(10, 0);\n    output_N1M = pd.get(11, 0);\n    output_elempack = pd.get(12, 0);\n    output_elemtype = pd.get(13, 0);\n    output_transpose = pd.get(14, 0);\n    int8_scale_term = pd.get(18, 0);\n    constant_TILE_M = pd.get(20, 0);\n    constant_TILE_N = pd.get(21, 0);\n    constant_TILE_K = pd.get(22, 0);\n\n    if (int8_scale_term)\n    {\n#if !NCNN_INT8\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    if (constantA == 1 && (constantM == 0 || constantK == 0))\n    {\n        NCNN_LOGE(\"constantM and constantK must be non-zero when constantA enabled\");\n        return -1;\n    }\n\n    if (constantB == 1 && (constantN == 0 || constantK == 0))\n    {\n        NCNN_LOGE(\"constantN and constantK must be non-zero when constantB enabled\");\n        return -1;\n    }\n\n    if (constantC == 1 && (constant_broadcast_type_C < -1 || constant_broadcast_type_C > 4))\n    {\n        NCNN_LOGE(\"constant_broadcast_type_C must be -1 or 0~4 when constantC enabled\");\n        return -1;\n    }\n\n    if (constantA == 0 && constantB == 1 && constantC == 1)\n        one_blob_only = true;\n\n    if (constantA == 1 && constantB == 0 && constantC == 1)\n        one_blob_only = true;\n\n    if (constantA == 1 && constantB == 1 && constantC == 0)\n        one_blob_only = true;\n\n    return 0;\n}\n\nint Gemm::load_model(const ModelBin& mb)\n{\n    if (constantA == 1)\n    {\n        if (transA == 0)\n            A_data = mb.load(constantK, constantM, 0);\n        else\n            A_data = mb.load(constantM, constantK, 0);\n        if (A_data.empty())\n            return -100;\n    }\n\n    if (constantB == 1)\n    {\n        if (transB == 0)\n            B_data = mb.load(constantN, constantK, 0);\n        else\n            B_data = mb.load(constantK, constantN, 0);\n        if (B_data.empty())\n            return -100;\n    }\n\n    if (constantC == 1 && constant_broadcast_type_C != -1)\n    {\n        if (constant_broadcast_type_C == 0)\n            C_data = mb.load(1, 0);\n        if (constant_broadcast_type_C == 1)\n            C_data = mb.load(constantM, 0);\n        if (constant_broadcast_type_C == 2)\n            C_data = mb.load(1, constantM, 0);\n        if (constant_broadcast_type_C == 3)\n            C_data = mb.load(constantN, constantM, 0);\n        if (constant_broadcast_type_C == 4)\n            C_data = mb.load(constantN, 1, 0);\n        if (C_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        if (constantA == 1)\n        {\n            A_data_int8_scales = mb.load(constantM, 1);\n        }\n\n        if (constantB == 1)\n        {\n            B_data_int8_scale = mb.load(1, 1)[0];\n        }\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic void gemm_transB(const Mat& A, const Mat& BT, const Mat& C, Mat& top_blob, float alpha, float beta, int broadcast_type_C, int output_transpose, const Option& opt)\n{\n    const int M = A.dims == 3 ? A.c : A.h;\n    const int N = BT.dims == 3 ? BT.c : BT.h;\n    const int K = A.w; // assert A.w == BT.w\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < M; i++)\n    {\n        const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n        const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n        const size_t BT_hstep = BT.dims == 3 ? BT.cstep : (size_t)BT.w;\n\n        const float* ptrA = (const float*)A + i * A_hstep;\n        const float* ptrC = C;\n\n        for (int j = 0; j < N; j++)\n        {\n            const float* ptrBT = (const float*)BT + j * BT_hstep;\n\n            float sum = 0.f;\n            if (ptrC)\n            {\n                if (broadcast_type_C == 0)\n                {\n                    sum = ptrC[0];\n                }\n                if (broadcast_type_C == 1)\n                {\n                    sum = ptrC[i];\n                }\n                if (broadcast_type_C == 2)\n                {\n                    sum = ptrC[i];\n                }\n                if (broadcast_type_C == 3)\n                {\n                    sum = ptrC[i * N + j];\n                }\n                if (broadcast_type_C == 4)\n                {\n                    sum = ptrC[j];\n                }\n\n                sum *= beta;\n            }\n\n            for (int k = 0; k < K; k++)\n            {\n                sum += ptrA[k] * ptrBT[k];\n            }\n\n            sum *= alpha;\n\n            if (output_transpose)\n            {\n                top_blob[j * out_hstep + i] = sum;\n            }\n            else\n            {\n                top_blob[i * out_hstep + j] = sum;\n            }\n        }\n    }\n}\n\n#if NCNN_INT8\nstatic inline signed char float2int8(float v)\n{\n    int int32 = static_cast<int>(round(v));\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\nstatic void gemm_transB_int8(const Mat& A_int8, const Mat& BT_int8, const Mat& A_int8_scales, float BT_int8_scale, const Mat& C, Mat& top_blob, float alpha, float beta, int broadcast_type_C, int output_transpose, const Option& opt)\n{\n    const int M = A_int8.h;\n    const int N = BT_int8.h;\n    const int K = A_int8.w; // assert A_int8.w == BT_int8.w\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < M; i++)\n    {\n        const size_t out_hstep = top_blob.dims == 3 ? top_blob.cstep : (size_t)top_blob.w;\n\n        const signed char* ptrA = A_int8.row<const signed char>(i);\n        const float* ptrC = C;\n\n        const float descale = 1.f / (A_int8_scales[i] * BT_int8_scale);\n\n        for (int j = 0; j < N; j++)\n        {\n            const signed char* ptrBT = BT_int8.row<const signed char>(j);\n\n            int sum = 0;\n            for (int k = 0; k < K; k++)\n            {\n                sum += ptrA[k] * ptrBT[k];\n            }\n\n            float sum_fp32 = sum * descale;\n\n            if (ptrC)\n            {\n                float c = 0.f;\n                if (broadcast_type_C == 0)\n                {\n                    c = ptrC[0];\n                }\n                if (broadcast_type_C == 1)\n                {\n                    c = ptrC[i];\n                }\n                if (broadcast_type_C == 2)\n                {\n                    c = ptrC[i];\n                }\n                if (broadcast_type_C == 3)\n                {\n                    c = ptrC[i * N + j];\n                }\n                if (broadcast_type_C == 4)\n                {\n                    c = ptrC[j];\n                }\n\n                sum_fp32 += c * beta;\n            }\n\n            sum_fp32 *= alpha;\n\n            if (output_transpose)\n            {\n                top_blob[j * out_hstep + i] = sum_fp32;\n            }\n            else\n            {\n                top_blob[i * out_hstep + j] = sum_fp32;\n            }\n        }\n    }\n}\n#endif // NCNN_INT8\n\nint Gemm::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    std::vector<Mat> bottom_blobs(1, bottom_blob);\n    std::vector<Mat> top_blobs(1, top_blob);\n    int ret = forward(bottom_blobs, top_blobs, opt);\n    top_blob = top_blobs[0];\n    return ret;\n}\n\nint Gemm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        return forward_int8(bottom_blobs, top_blobs, opt);\n    }\n#endif // NCNN_INT8\n\n    const Mat& A0 = constantA ? A_data : bottom_blobs[0];\n    const Mat& B0 = constantB ? B_data : constantA ? bottom_blobs[0] : bottom_blobs[1];\n\n    size_t elemsize = A0.elemsize;\n\n    Mat A;\n    if (transA == 0)\n    {\n        A = A0;\n    }\n    else\n    {\n        // transpose A to row-major\n        A.create((A0.dims == 3 ? A0.c : A0.h), A0.w, elemsize, opt.workspace_allocator);\n        if (A.empty())\n            return -100;\n\n        const size_t A0_hstep = A0.dims == 3 ? A0.cstep : (size_t)A0.w;\n\n        for (int i = 0; i < A.h; i++)\n        {\n            float* ptr = A.row(i);\n            for (int j = 0; j < A.w; j++)\n            {\n                ptr[j] = A0[j * A0_hstep + i];\n            }\n        }\n    }\n\n    Mat BT;\n    if (transB == 0)\n    {\n        // transpose B to col-major\n        BT.create((B0.dims == 3 ? B0.c : B0.h), B0.w, elemsize, opt.workspace_allocator);\n        if (BT.empty())\n            return -100;\n\n        const size_t B0_hstep = B0.dims == 3 ? B0.cstep : (size_t)B0.w;\n\n        for (int i = 0; i < BT.h; i++)\n        {\n            float* ptr = BT.row(i);\n            for (int j = 0; j < BT.w; j++)\n            {\n                ptr[j] = B0[j * B0_hstep + i];\n            }\n        }\n    }\n    else\n    {\n        BT = B0;\n    }\n\n    const int M = A.dims == 3 ? A.c : A.h;\n    const int N = BT.dims == 3 ? BT.c : BT.h;\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = C_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB && bottom_blobs.size() == 1)\n        {\n            C = bottom_blobs[0];\n        }\n        else if ((constantA || constantB) && bottom_blobs.size() == 2)\n        {\n            C = bottom_blobs[1];\n        }\n        else if (bottom_blobs.size() == 3)\n        {\n            C = bottom_blobs[2];\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n        }\n    }\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N, elemsize, opt.blob_allocator);\n        else\n            top_blob.create(M, N, elemsize, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M, elemsize, opt.blob_allocator);\n        else\n            top_blob.create(N, M, elemsize, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    gemm_transB(A, BT, C, top_blob, alpha, beta, broadcast_type_C, output_transpose, opt);\n\n    return 0;\n}\n\n#if NCNN_INT8\nint Gemm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A0 = constantA ? A_data : bottom_blobs[0];\n    const Mat& B0 = constantB ? B_data : constantA ? bottom_blobs[0] : bottom_blobs[1];\n\n    Mat A;\n    if (transA == 0)\n    {\n        A = A0;\n    }\n    else\n    {\n        // transpose A to row-major\n        if (A0.elemsize == 1)\n        {\n            A.create(A0.h, A0.w, (size_t)1u, 1, opt.workspace_allocator);\n            if (A.empty())\n                return -100;\n\n            for (int i = 0; i < A.h; i++)\n            {\n                signed char* ptr = A.row<signed char>(i);\n                for (int j = 0; j < A.w; j++)\n                {\n                    ptr[j] = A0.row<const signed char>(j)[i];\n                }\n            }\n        }\n        else\n        {\n            A.create(A0.dims == 3 ? A0.c : A0.h, A0.w, (size_t)4u, 1, opt.workspace_allocator);\n            if (A.empty())\n                return -100;\n\n            for (int i = 0; i < A.h; i++)\n            {\n                float* ptr = A.row(i);\n                for (int j = 0; j < A.w; j++)\n                {\n                    ptr[j] = A0.dims == 3 ? A0.channel(j)[i] : A0.row(j)[i];\n                }\n            }\n        }\n    }\n\n    // dynamic quantize A\n    Mat A_int8 = A;\n    Mat A_int8_scales = A_data_int8_scales;\n    if (A_int8.elemsize != 1)\n    {\n        A_int8.create(A.w, A.dims == 3 ? A.c : A.h, (size_t)1u, 1, opt.workspace_allocator);\n        if (A_int8.empty())\n            return -100;\n        A_int8_scales.create(A_int8.h, (size_t)4u, 1, opt.workspace_allocator);\n        if (A_int8_scales.empty())\n            return -100;\n\n        for (int i = 0; i < A_int8.h; i++)\n        {\n            const size_t A_hstep = A.dims == 3 ? A.cstep : (size_t)A.w;\n            const float* ptr = (const float*)A + i * A_hstep;\n\n            float absmax = 0.f;\n            for (int k = 0; k < A_int8.w; k++)\n            {\n                absmax = std::max(absmax, (float)fabs(ptr[k]));\n            }\n\n            float A_int8_scale = absmax == 0.f ? 1.f : 127.f / absmax;\n            A_int8_scales[i] = A_int8_scale;\n\n            signed char* ptrAi = A_int8.row<signed char>(i);\n\n            for (int k = 0; k < A_int8.w; k++)\n            {\n                ptrAi[k] = float2int8(ptr[k] * A_int8_scale);\n            }\n        }\n    }\n\n    // dynamic quantize B\n    Mat B0_int8 = B0;\n    float B_int8_scale = B_data_int8_scale;\n    if (B0_int8.elemsize != 1)\n    {\n        B0_int8.create(B0.w, B0.dims == 3 ? B0.c : B0.h, (size_t)1u, 1, opt.workspace_allocator);\n        if (B0_int8.empty())\n            return -100;\n\n        float absmax = 0.f;\n        for (int i = 0; i < B0_int8.h; i++)\n        {\n            const size_t B_hstep = B0.dims == 3 ? B0.cstep : (size_t)B0.w;\n            const float* ptr = (const float*)B0 + i * B_hstep;\n\n            for (int k = 0; k < B0_int8.w; k++)\n            {\n                absmax = std::max(absmax, (float)fabs(ptr[k]));\n            }\n        }\n\n        B_int8_scale = absmax == 0.f ? 1.f : 127.f / absmax;\n\n        for (int i = 0; i < B0_int8.h; i++)\n        {\n            const size_t B_hstep = B0.dims == 3 ? B0.cstep : (size_t)B0.w;\n            const float* ptr = (const float*)B0 + i * B_hstep;\n\n            signed char* ptrBi = B0_int8.row<signed char>(i);\n\n            for (int k = 0; k < B0_int8.w; k++)\n            {\n                ptrBi[k] = float2int8(ptr[k] * B_int8_scale);\n            }\n        }\n    }\n\n    Mat BT_int8;\n    if (transB == 0)\n    {\n        // transpose B to col-major\n        BT_int8.create(B0_int8.h, B0_int8.w, (size_t)1u, 1, opt.workspace_allocator);\n        if (BT_int8.empty())\n            return -100;\n\n        for (int i = 0; i < BT_int8.h; i++)\n        {\n            signed char* ptr = BT_int8.row<signed char>(i);\n            for (int j = 0; j < BT_int8.w; j++)\n            {\n                ptr[j] = B0_int8.row<const signed char>(j)[i];\n            }\n        }\n    }\n    else\n    {\n        BT_int8 = B0_int8;\n    }\n\n    const int M = A_int8.h;\n    const int N = BT_int8.h;\n\n    Mat C;\n    int broadcast_type_C = 0;\n    if (constantC)\n    {\n        C = C_data;\n        broadcast_type_C = constant_broadcast_type_C;\n    }\n    else\n    {\n        if (constantA && constantB && bottom_blobs.size() == 1)\n        {\n            C = bottom_blobs[0];\n        }\n        else if ((constantA || constantB) && bottom_blobs.size() == 2)\n        {\n            C = bottom_blobs[1];\n        }\n        else if (bottom_blobs.size() == 3)\n        {\n            C = bottom_blobs[2];\n        }\n\n        if (!C.empty())\n        {\n            if (C.dims == 1 && C.w == 1)\n            {\n                // scalar\n                broadcast_type_C = 0;\n            }\n            if (C.dims == 1 && C.w == M)\n            {\n                // M\n                // auto broadcast from h to w is the ncnn-style convention\n                broadcast_type_C = 1;\n            }\n            if (C.dims == 1 && C.w == N)\n            {\n                // N\n                broadcast_type_C = 4;\n            }\n            if (C.dims == 2 && C.w == 1 && C.h == M)\n            {\n                // Mx1\n                broadcast_type_C = 2;\n            }\n            if (C.dims == 2 && C.w == N && C.h == M)\n            {\n                // MxN\n                broadcast_type_C = 3;\n            }\n            if (C.dims == 2 && C.w == N && C.h == 1)\n            {\n                // 1xN\n                broadcast_type_C = 4;\n            }\n        }\n    }\n\n    Mat& top_blob = top_blobs[0];\n    if (output_transpose)\n    {\n        if (output_N1M)\n            top_blob.create(M, 1, N, 4u, opt.blob_allocator);\n        else\n            top_blob.create(M, N, 4u, opt.blob_allocator);\n    }\n    else\n    {\n        if (output_N1M)\n            top_blob.create(N, 1, M, 4u, opt.blob_allocator);\n        else\n            top_blob.create(N, M, 4u, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    gemm_transB_int8(A_int8, BT_int8, A_int8_scales, B_int8_scale, C, top_blob, alpha, beta, broadcast_type_C, output_transpose, opt);\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/gemm.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GEMM_H\n#define LAYER_GEMM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Gemm : public Layer\n{\npublic:\n    Gemm();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_INT8\n    int forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n#endif\n\npublic:\n    float alpha;\n    float beta;\n    int transA;\n    int transB;\n\n    int constantA;\n    int constantB;\n    int constantC;\n    int constantM;\n    int constantN;\n    int constantK;\n    int constant_broadcast_type_C;\n    int output_N1M;\n    int output_elempack;\n    int output_elemtype; // 0=auto 1=fp32\n    int output_transpose;\n\n    int int8_scale_term;\n\n    int constant_TILE_M;\n    int constant_TILE_N;\n    int constant_TILE_K;\n\n    // constant A / B / C\n    Mat A_data;\n    Mat B_data;\n    Mat C_data;\n\n#if NCNN_INT8\n    Mat A_data_int8_scales;\n    float B_data_int8_scale;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GEMM_H\n"
  },
  {
    "path": "src/layer/glu.cpp",
    "content": "// Copyright 2022 Xiaomi Corp.   (author: Fangjun Kuang)\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"glu.h\"\n\nnamespace ncnn {\n\nGLU::GLU()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint GLU::load_param(const ParamDict& pd)\n{\n    axis = pd.get(0, 0);\n\n    return 0;\n}\n\nint GLU::forward(const Mat& bottom_blob, Mat& top_blob,\n                 const Option& opt) const\n{\n    int dims = bottom_blob.dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1)\n    {   // ignore axis\n        int w = bottom_blob.w;\n        int out_w = w / 2;\n        top_blob.create(out_w, sizeof(float), opt.blob_allocator);\n\n        const float* in_ptr = bottom_blob;\n        float* out_ptr = top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int x = 0; x < out_w; ++x)\n        {\n            float sigmoid = 1.f / (1.f + expf(-in_ptr[x + out_w]));\n\n            out_ptr[x] = in_ptr[x] * sigmoid;\n        }\n\n        return 0;\n    } // if (dims == 1)\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int out_w = w;\n        int out_h = h / 2;\n        top_blob.create(out_w, out_h, sizeof(float), opt.blob_allocator);\n\n        int offset = out_w * out_h;\n\n#if 0\n        // this one is equivalent to the else branch. It is more readable\n        // but less efficient\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < out_h; ++y) {\n            const float *in_ptr = bottom_blob.row(y);\n            float *out_ptr = top_blob.row(y);\n\n            for (int x = 0; x < w; ++x) {\n                float sigmoid =\n                    1.f / (1.f + expf(-in_ptr[x + offset]));\n\n                out_ptr[x] = in_ptr[x] * sigmoid;\n            }\n        }\n#else\n        int size = offset;\n        const float* in_ptr = bottom_blob;\n        float* out_ptr = top_blob;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < size; ++i)\n        {\n            float sigmoid = 1.f / (1.f + expf(-in_ptr[i + offset]));\n            out_ptr[i] = in_ptr[i] * sigmoid;\n        }\n#endif\n\n        return 0;\n    } // if (dims == 2 && positive_axis == 0)\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int out_w = w / 2;\n        int out_h = h;\n\n        top_blob.create(out_w, out_h, sizeof(float), opt.blob_allocator);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; ++y)\n        {\n            const float* in_ptr = bottom_blob.row(y);\n            float* out_ptr = top_blob.row(y);\n\n            for (int x = 0; x < out_w; ++x)\n            {\n                float sigmoid = 1.f / (1.f + expf(-in_ptr[x + out_w]));\n                out_ptr[x] = in_ptr[x] * sigmoid;\n            }\n        }\n\n        return 0;\n    } // if (dims == 2 && positive_axis == 1)\n\n    if (dims == 3 && positive_axis == 0)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int c = bottom_blob.c;\n\n        int out_w = w;\n        int out_h = h;\n        int out_c = c / 2;\n\n        top_blob.create(out_w, out_h, out_c, sizeof(float), opt.blob_allocator);\n\n        size_t offset = out_c * bottom_blob.cstep;\n        int size = w * h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < out_c; ++q)\n        {\n            const float* in_ptr = bottom_blob.channel(q);\n            float* out_ptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; ++i)\n            {\n                float sigmoid = 1.f / (1.f + expf(-in_ptr[i + offset]));\n                out_ptr[i] = in_ptr[i] * sigmoid;\n            }\n        }\n        return 0;\n    } //   if (dims == 3 && positive_axis == 0) {\n\n    if (dims == 3 && positive_axis == 1)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int c = bottom_blob.c;\n\n        int out_w = w;\n        int out_h = h / 2;\n        int out_c = c;\n\n        top_blob.create(out_w, out_h, out_c, sizeof(float), opt.blob_allocator);\n\n        int offset = out_h * out_w;\n        int size = offset;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; ++q)\n        {\n            const float* in_ptr = bottom_blob.channel(q);\n            float* out_ptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; ++i)\n            {\n                float sigmoid = 1.f / (1.f + expf(-in_ptr[i + offset]));\n                out_ptr[i] = in_ptr[i] * sigmoid;\n            }\n        }\n        return 0;\n    } // if (dims == 3 && positive_axis == 1)\n\n    if (dims == 3 && positive_axis == 2)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int c = bottom_blob.c;\n\n        int out_w = w / 2;\n        int out_h = h;\n        int out_c = c;\n\n        top_blob.create(out_w, out_h, out_c, sizeof(float), opt.blob_allocator);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; ++q)\n        {\n            const float* in_ptr = bottom_blob.channel(q);\n            float* out_ptr = top_blob.channel(q);\n            for (int y = 0; y < h; ++y)\n            {\n                for (int x = 0; x < out_w; ++x)\n                {\n                    float sigmoid = 1.f / (1.f + expf(-in_ptr[x + out_w]));\n                    out_ptr[x] = in_ptr[x] * sigmoid;\n                }\n                in_ptr += w;\n                out_ptr += out_w;\n            }\n        }\n        return 0;\n    } // if (dims == 3 && positive_axis == 2)\n\n    return -100;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/glu.h",
    "content": "// Copyright 2022 Xiaomi Corp.   (author: Fangjun Kuang)\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GLU_H\n#define LAYER_GLU_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass GLU : public Layer\n{\npublic:\n    GLU();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob,\n                        const Option& opt) const;\n\npublic:\n    int axis;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GLU_H\n"
  },
  {
    "path": "src/layer/gridsample.cpp",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gridsample.h\"\n\nnamespace ncnn {\n\nGridSample::GridSample()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint GridSample::load_param(const ParamDict& pd)\n{\n    sample_type = pd.get(0, 1);\n    padding_mode = pd.get(1, 1);\n    align_corner = pd.get(2, 0);\n    permute_fusion = pd.get(3, 0);\n\n    if (sample_type < 1 || sample_type > 3)\n    {\n        NCNN_LOGE(\"unsupported sample type %d\", sample_type);\n        return -1;\n    }\n\n    if (padding_mode < 1 || padding_mode > 3)\n    {\n        NCNN_LOGE(\"unsupported padding mode %d\", padding_mode);\n        return -1;\n    }\n\n    return 0;\n}\n\n// Restore normalized location to acutal image location\n//   When align_corners is true:\n//     Normalized location (-1, -1) points to the top-left pixel.\n//     Normalized location (1, 1) points to the bottom-tight pixel.\n//   When align_corners is false [default]:\n//     Normalized location (-1, -1) points to the top-left pixel minus half\n//     pixel coord both directions, i.e, (-0.5, -0.5) coord acutal image space.\n//     Normalized location (1, 1) points to the bottom-tight pixel plus half\n//     pixel coord both directions, i.e. (H - 0.5, W - 0.5) coord acutal image space.\nstatic float grid_sample_unormalize(int w, float coordx, int align_corner)\n{\n    return align_corner ? (coordx + 1) / 2.f * (w - 1) : ((coordx + 1) * w - 1) / 2.f;\n}\n\nstatic float border_coord(float x, float border)\n{\n    return std::min(border, std::max(x, 0.0f));\n}\n\nstatic float reflect_coord(float x, int high)\n{\n    x = fabs(x);\n    x = high - fabs(x - high);\n    return x;\n}\n\nstatic float compute_coord(float sx, int w, int padding_mode, int align_corner)\n{\n    if (padding_mode == 2) // border\n    {\n        sx = border_coord(sx, w - 1);\n    }\n    else if (padding_mode == 3) // reflection\n    {\n        if (align_corner)\n        {\n            sx = reflect_coord(sx, w - 1);\n        }\n        else\n        {\n            sx = reflect_coord(sx + 0.5, w) - 0.5;\n            sx = border_coord(sx, w - 1);\n        }\n    }\n\n    return sx;\n}\n\nstatic bool in_bounds(const Mat& image, int x, int y)\n{\n    return x >= 0 && y >= 0 && x < image.w && y < image.h;\n}\n\nstatic bool in_bounds(const Mat& image, int x, int y, int z)\n{\n    return x >= 0 && y >= 0 && z >= 0 && x < image.w && y < image.h && z < image.c;\n}\n\nstatic float get_value_bounded(const Mat& image, int x, int y)\n{\n    return in_bounds(image, x, y) ? image.row(y)[x] : 0.f;\n}\n\nstatic float get_value_bounded(const Mat& image, int x, int y, int z)\n{\n    return in_bounds(image, x, y, z) ? image.depth(z).row(y)[x] : 0.f;\n}\n\nstatic float get_value_bounded(const Mat& image, int x, int y, int padding_mode, int align_corner)\n{\n    x = compute_coord(x, image.w, padding_mode, align_corner);\n    y = compute_coord(y, image.h, padding_mode, align_corner);\n\n    return get_value_bounded(image, x, y);\n}\n\nstatic inline void interpolate_cubic(float fx, float* coeffs)\n{\n    const float A = -0.75f;\n\n    float fx0 = fx + 1;\n    float fx1 = fx;\n    float fx2 = 1 - fx;\n    // float fx3 = 2 - fx;\n\n    coeffs[0] = A * fx0 * fx0 * fx0 - 5 * A * fx0 * fx0 + 8 * A * fx0 - 4 * A;\n    coeffs[1] = (A + 2) * fx1 * fx1 * fx1 - (A + 3) * fx1 * fx1 + 1;\n    coeffs[2] = (A + 2) * fx2 * fx2 * fx2 - (A + 3) * fx2 * fx2 + 1;\n    coeffs[3] = 1.f - coeffs[0] - coeffs[1] - coeffs[2];\n}\n\nint GridSample::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& grid = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n\n    if (dims == 3)\n    {\n        int outw = permute_fusion == 0 ? grid.h : grid.w;\n        int outh = permute_fusion == 0 ? grid.c : grid.h;\n\n        top_blob.create(outw, outh, channels, elemsize, opt.blob_allocator);\n\n        Mat offset_blob;\n        offset_blob.create(outw, outh, grid.c, elemsize, opt.workspace_allocator);\n\n        if (top_blob.empty() || offset_blob.empty())\n            return -100;\n\n        //pre-calculate all interpolation offsets for each x y, unpack grid on-the-fly\n        if (permute_fusion == 0)\n        {\n            float* offsetptr_x = offset_blob.channel(0);\n            float* offsetptr_y = offset_blob.channel(1);\n\n            for (int y = 0; y < outh; y++)\n            {\n                const float* gridptr = grid.channel(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    float sample_x = gridptr[0];\n                    float sample_y = gridptr[1];\n\n                    sample_x = grid_sample_unormalize(w, sample_x, align_corner);\n                    sample_y = grid_sample_unormalize(h, sample_y, align_corner);\n\n                    *offsetptr_x = sample_x;\n                    *offsetptr_y = sample_y;\n\n                    gridptr += 2;\n                    offsetptr_x++;\n                    offsetptr_y++;\n                }\n            }\n        }\n        else\n        {\n            const float* gridptr_x = grid.channel(0);\n            const float* gridptr_y = grid.channel(1);\n            float* offsetptr_x = offset_blob.channel(0);\n            float* offsetptr_y = offset_blob.channel(1);\n\n            for (int y = 0; y < outh; y++)\n            {\n                for (int x = 0; x < outw; x++)\n                {\n                    float sample_x = *gridptr_x;\n                    float sample_y = *gridptr_y;\n\n                    sample_x = grid_sample_unormalize(w, sample_x, align_corner);\n                    sample_y = grid_sample_unormalize(h, sample_y, align_corner);\n\n                    *offsetptr_x = sample_x;\n                    *offsetptr_y = sample_y;\n\n                    gridptr_x++;\n                    gridptr_y++;\n                    offsetptr_x++;\n                    offsetptr_y++;\n                }\n            }\n        }\n\n        if (sample_type == Interpolation_BILINEAR) // bilinear\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat image = bottom_blob.channel(q);\n                float* outptr = top_blob.channel(q);\n                const float* offsetptr_x = offset_blob.channel(0);\n                const float* offsetptr_y = offset_blob.channel(1);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    for (int x = 0; x < outw; x++)\n                    {\n                        float sample_x = *offsetptr_x;\n                        float sample_y = *offsetptr_y;\n\n                        // bilinear interpolate\n                        float v;\n                        {\n                            sample_x = compute_coord(sample_x, w, padding_mode, align_corner);\n                            sample_y = compute_coord(sample_y, h, padding_mode, align_corner);\n                            int x0 = floor(sample_x);\n                            int y0 = floor(sample_y);\n                            int x1 = x0 + 1;\n                            int y1 = y0 + 1;\n\n                            float v00 = get_value_bounded(image, x0, y0);\n                            float v01 = get_value_bounded(image, x1, y0);\n                            float v10 = get_value_bounded(image, x0, y1);\n                            float v11 = get_value_bounded(image, x1, y1);\n\n                            float alpha = sample_x - x0;\n                            float beta = sample_y - y0;\n\n                            float v0 = v00 * (1 - alpha) + v01 * alpha;\n                            float v1 = v10 * (1 - alpha) + v11 * alpha;\n\n                            v = v0 * (1 - beta) + v1 * beta;\n                        }\n\n                        outptr[0] = v;\n                        outptr += 1;\n\n                        offsetptr_x++;\n                        offsetptr_y++;\n                    }\n                }\n            }\n        }\n        else if (sample_type == Interpolation_NEAREST) // nearest\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat image = bottom_blob.channel(q);\n                float* outptr = top_blob.channel(q);\n                const float* offsetptr_x = offset_blob.channel(0);\n                const float* offsetptr_y = offset_blob.channel(1);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    for (int x = 0; x < outw; x++)\n                    {\n                        float sample_x = *offsetptr_x;\n                        float sample_y = *offsetptr_y;\n                        sample_x = compute_coord(sample_x, w, padding_mode, align_corner);\n                        sample_y = compute_coord(sample_y, h, padding_mode, align_corner);\n\n                        int x0 = static_cast<int>(floor(sample_x + 0.5f));\n                        int y0 = static_cast<int>(floor(sample_y + 0.5f));\n\n                        float v = get_value_bounded(image, x0, y0);\n\n                        outptr[0] = v;\n                        outptr += 1;\n\n                        offsetptr_x++;\n                        offsetptr_y++;\n                    }\n                }\n            }\n        }\n        else if (sample_type == Interpolation_BICUBIC) // bicubic\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat image = bottom_blob.channel(q);\n                float* outptr = top_blob.channel(q);\n                const float* offsetptr_x = offset_blob.channel(0);\n                const float* offsetptr_y = offset_blob.channel(1);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    for (int x = 0; x < outw; x++)\n                    {\n                        float sample_x = *offsetptr_x;\n                        float sample_y = *offsetptr_y;\n\n                        // bicubic interpolate\n                        float v;\n                        {\n                            int x1 = (int)floorf(sample_x);\n                            int y1 = (int)floorf(sample_y);\n                            int x0 = x1 - 1;\n                            int y0 = y1 - 1;\n                            int x2 = x1 + 1;\n                            int y2 = y1 + 1;\n                            int x3 = x1 + 2;\n                            int y3 = y1 + 2;\n\n                            float v00 = get_value_bounded(image, x0, y0, padding_mode, align_corner);\n                            float v01 = get_value_bounded(image, x1, y0, padding_mode, align_corner);\n                            float v02 = get_value_bounded(image, x2, y0, padding_mode, align_corner);\n                            float v03 = get_value_bounded(image, x3, y0, padding_mode, align_corner);\n                            float v10 = get_value_bounded(image, x0, y1, padding_mode, align_corner);\n                            float v11 = get_value_bounded(image, x1, y1, padding_mode, align_corner);\n                            float v12 = get_value_bounded(image, x2, y1, padding_mode, align_corner);\n                            float v13 = get_value_bounded(image, x3, y1, padding_mode, align_corner);\n                            float v20 = get_value_bounded(image, x0, y2, padding_mode, align_corner);\n                            float v21 = get_value_bounded(image, x1, y2, padding_mode, align_corner);\n                            float v22 = get_value_bounded(image, x2, y2, padding_mode, align_corner);\n                            float v23 = get_value_bounded(image, x3, y2, padding_mode, align_corner);\n                            float v30 = get_value_bounded(image, x0, y3, padding_mode, align_corner);\n                            float v31 = get_value_bounded(image, x1, y3, padding_mode, align_corner);\n                            float v32 = get_value_bounded(image, x2, y3, padding_mode, align_corner);\n                            float v33 = get_value_bounded(image, x3, y3, padding_mode, align_corner);\n\n                            float x_coeffs[4];\n                            float y_coeffs[4];\n                            interpolate_cubic(sample_x - x1, x_coeffs);\n                            interpolate_cubic(sample_y - y1, y_coeffs);\n\n                            float v0 = v00 * x_coeffs[0] + v01 * x_coeffs[1] + v02 * x_coeffs[2] + v03 * x_coeffs[3];\n                            float v1 = v10 * x_coeffs[0] + v11 * x_coeffs[1] + v12 * x_coeffs[2] + v13 * x_coeffs[3];\n                            float v2 = v20 * x_coeffs[0] + v21 * x_coeffs[1] + v22 * x_coeffs[2] + v23 * x_coeffs[3];\n                            float v3 = v30 * x_coeffs[0] + v31 * x_coeffs[1] + v32 * x_coeffs[2] + v33 * x_coeffs[3];\n\n                            v = v0 * y_coeffs[0] + v1 * y_coeffs[1] + v2 * y_coeffs[2] + v3 * y_coeffs[3];\n                        }\n\n                        outptr[0] = v;\n                        outptr += 1;\n\n                        offsetptr_x++;\n                        offsetptr_y++;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4)\n    {\n        int outw = permute_fusion == 0 ? grid.h : grid.w;\n        int outh = permute_fusion == 0 ? grid.d : grid.h;\n        int outd = permute_fusion == 0 ? grid.c : grid.d;\n\n        top_blob.create(outw, outh, outd, channels, elemsize, opt.blob_allocator);\n\n        Mat offset_blob;\n        offset_blob.create(outw, outh, outd, grid.c, elemsize, opt.workspace_allocator);\n\n        if (top_blob.empty() || offset_blob.empty())\n            return -100;\n\n        //pre-calculate all interpolation offsets for each x y, unpack grid on-the-fly\n        if (permute_fusion == 0)\n        {\n            float* offsetptr_x = offset_blob.channel(0);\n            float* offsetptr_y = offset_blob.channel(1);\n            float* offsetptr_z = offset_blob.channel(2);\n\n            for (int z = 0; z < outd; z++)\n            {\n                const float* gridptr = grid.channel(z);\n                for (int y = 0; y < outh; y++)\n                {\n                    for (int x = 0; x < outw; x++)\n                    {\n                        float sample_x = gridptr[0];\n                        float sample_y = gridptr[1];\n                        float sample_z = gridptr[2];\n\n                        sample_x = grid_sample_unormalize(w, sample_x, align_corner);\n                        sample_x = compute_coord(sample_x, w, padding_mode, align_corner);\n\n                        sample_y = grid_sample_unormalize(h, sample_y, align_corner);\n                        sample_y = compute_coord(sample_y, h, padding_mode, align_corner);\n\n                        sample_z = grid_sample_unormalize(d, sample_z, align_corner);\n                        sample_z = compute_coord(sample_z, d, padding_mode, align_corner);\n\n                        *offsetptr_x = sample_x;\n                        *offsetptr_y = sample_y;\n                        *offsetptr_z = sample_z;\n\n                        gridptr += 3;\n                        offsetptr_x++;\n                        offsetptr_y++;\n                        offsetptr_z++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            const float* gridptr_x = grid.channel(0);\n            const float* gridptr_y = grid.channel(1);\n            const float* gridptr_z = grid.channel(2);\n            float* offsetptr_x = offset_blob.channel(0);\n            float* offsetptr_y = offset_blob.channel(1);\n            float* offsetptr_z = offset_blob.channel(2);\n\n            for (int z = 0; z < outd; z++)\n            {\n                for (int y = 0; y < outh; y++)\n                {\n                    for (int x = 0; x < outw; x++)\n                    {\n                        float sample_x = *gridptr_x;\n                        float sample_y = *gridptr_y;\n                        float sample_z = *gridptr_z;\n\n                        sample_x = grid_sample_unormalize(w, sample_x, align_corner);\n                        sample_x = compute_coord(sample_x, w, padding_mode, align_corner);\n\n                        sample_y = grid_sample_unormalize(h, sample_y, align_corner);\n                        sample_y = compute_coord(sample_y, h, padding_mode, align_corner);\n\n                        sample_z = grid_sample_unormalize(d, sample_z, align_corner);\n                        sample_z = compute_coord(sample_z, d, padding_mode, align_corner);\n\n                        *offsetptr_x = sample_x;\n                        *offsetptr_y = sample_y;\n                        *offsetptr_z = sample_z;\n\n                        gridptr_x++;\n                        gridptr_y++;\n                        gridptr_z++;\n                        offsetptr_x++;\n                        offsetptr_y++;\n                        offsetptr_z++;\n                    }\n                }\n            }\n        }\n\n        if (sample_type == Interpolation_BILINEAR) // bilinear\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat image = bottom_blob.channel(q);\n                float* outptr = top_blob.channel(q);\n                const float* offsetptr_x = offset_blob.channel(0);\n                const float* offsetptr_y = offset_blob.channel(1);\n                const float* offsetptr_z = offset_blob.channel(2);\n\n                for (int z = 0; z < outd; z++)\n                {\n                    for (int y = 0; y < outh; y++)\n                    {\n                        for (int x = 0; x < outw; x++)\n                        {\n                            float sample_x = *offsetptr_x;\n                            float sample_y = *offsetptr_y;\n                            float sample_z = *offsetptr_z;\n\n                            // bilinear interpolate\n                            float v;\n                            {\n                                int x0 = (int)floor(sample_x);\n                                int y0 = (int)floor(sample_y);\n                                int z0 = (int)floor(sample_z);\n                                int x1 = x0 + 1;\n                                int y1 = y0 + 1;\n                                int z1 = z0 + 1;\n\n                                float v000 = get_value_bounded(image, x0, y0, z0);\n                                float v001 = get_value_bounded(image, x1, y0, z0);\n                                float v010 = get_value_bounded(image, x0, y1, z0);\n                                float v011 = get_value_bounded(image, x1, y1, z0);\n                                float v100 = get_value_bounded(image, x0, y0, z1);\n                                float v101 = get_value_bounded(image, x1, y0, z1);\n                                float v110 = get_value_bounded(image, x0, y1, z1);\n                                float v111 = get_value_bounded(image, x1, y1, z1);\n\n                                float alpha = sample_x - x0;\n                                float beta = sample_y - y0;\n                                float gamma = sample_z - z0;\n\n                                float v00 = v000 * (1 - alpha) + v001 * alpha;\n                                float v01 = v010 * (1 - alpha) + v011 * alpha;\n                                float v10 = v100 * (1 - alpha) + v101 * alpha;\n                                float v11 = v110 * (1 - alpha) + v111 * alpha;\n\n                                float v0 = v00 * (1 - beta) + v01 * beta;\n                                float v1 = v10 * (1 - beta) + v11 * beta;\n\n                                v = v0 * (1 - gamma) + v1 * gamma;\n                            }\n\n                            outptr[0] = v;\n                            outptr += 1;\n\n                            offsetptr_x++;\n                            offsetptr_y++;\n                            offsetptr_z++;\n                        }\n                    }\n                }\n            }\n        }\n        else if (sample_type == Interpolation_NEAREST) // nearest\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat image = bottom_blob.channel(q);\n                float* outptr = top_blob.channel(q);\n                const float* offsetptr_x = offset_blob.channel(0);\n                const float* offsetptr_y = offset_blob.channel(1);\n                const float* offsetptr_z = offset_blob.channel(2);\n\n                for (int z = 0; z < outd; z++)\n                {\n                    for (int y = 0; y < outh; y++)\n                    {\n                        for (int x = 0; x < outw; x++)\n                        {\n                            float sample_x = *offsetptr_x;\n                            float sample_y = *offsetptr_y;\n                            float sample_z = *offsetptr_z;\n\n                            int x0 = static_cast<int>(floor(sample_x + 0.5f));\n                            int y0 = static_cast<int>(floor(sample_y + 0.5f));\n                            int z0 = static_cast<int>(floor(sample_z + 0.5f));\n\n                            float v = get_value_bounded(image, x0, y0, z0);\n\n                            outptr[0] = v;\n                            outptr += 1;\n\n                            offsetptr_x++;\n                            offsetptr_y++;\n                            offsetptr_z++;\n                        }\n                    }\n                }\n            }\n        }\n        else if (sample_type == 3)\n        {\n            NCNN_LOGE(\"unsupported bicubic when dims == 4\");\n            return -1;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/gridsample.h",
    "content": "// Copyright 2023 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GRIDSAMPLE_H\n#define LAYER_GRIDSAMPLE_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass GridSample : public Layer\n{\npublic:\n    GridSample();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    enum InterpolationMode // 1=bilinear  2=nearest  3=bicubic\n    {\n        Interpolation_BILINEAR = 1,\n        Interpolation_NEAREST = 2,\n        Interpolation_BICUBIC = 3\n    };\n\n    enum PaddingMode // 1=zeros     2=border   3=reflection\n    {\n        Padding_ZEROS = 1,\n        Padding_BORDER = 2,\n        Padding_REFLECTION = 3\n    };\n\npublic:\n    // param\n    int sample_type;  // 1=bilinear  2=nearest  3=bicubic\n    int padding_mode; // 1=zeros     2=border   3=reflection\n    int align_corner;\n\n    int permute_fusion;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GRIDSAMPLE_H\n"
  },
  {
    "path": "src/layer/groupnorm.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"groupnorm.h\"\n\nnamespace ncnn {\n\nGroupNorm::GroupNorm()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint GroupNorm::load_param(const ParamDict& pd)\n{\n    group = pd.get(0, 1);\n    channels = pd.get(1, 0);\n    eps = pd.get(2, 0.001f);\n    affine = pd.get(3, 1);\n\n    return 0;\n}\n\nint GroupNorm::load_model(const ModelBin& mb)\n{\n    if (affine == 0)\n        return 0;\n\n    gamma_data = mb.load(channels, 1);\n    if (gamma_data.empty())\n        return -100;\n\n    beta_data = mb.load(channels, 1);\n    if (beta_data.empty())\n        return -100;\n\n    return 0;\n}\n\nstatic void groupnorm(float* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int channels, int size, size_t cstep)\n{\n    float sum = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr0 = ptr + cstep * q;\n        for (int i = 0; i < size; i++)\n        {\n            sum += ptr0[i];\n        }\n    }\n\n    float mean = sum / (channels * size);\n\n    float sqsum = 0.f;\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr0 = ptr + cstep * q;\n        for (int i = 0; i < size; i++)\n        {\n            float v = ptr0[i] - mean;\n            sqsum += v * v;\n        }\n    }\n\n    float var = sqsum / (channels * size);\n\n    float a = 1.f / sqrtf(var + eps);\n    float b = -mean * a;\n\n    if (gamma_ptr && beta_ptr)\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr0 = ptr + cstep * q;\n            const float gamma = gamma_ptr[q];\n            const float beta = beta_ptr[q];\n            for (int i = 0; i < size; i++)\n            {\n                ptr0[i] = (ptr0[i] * a + b) * gamma + beta;\n            }\n        }\n    }\n    else\n    {\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr0 = ptr + cstep * q;\n            for (int i = 0; i < size; i++)\n            {\n                ptr0[i] = ptr0[i] * a + b;\n            }\n        }\n    }\n}\n\nint GroupNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    const int dims = bottom_top_blob.dims;\n    const int channels_g = channels / group;\n\n    if (dims == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob.range(g * channels_g, channels_g);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g, 1, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        const int w = bottom_top_blob.w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob.row_range(g * channels_g, channels_g);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g, w, w);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int size = bottom_top_blob.w * bottom_top_blob.h * bottom_top_blob.d;\n        const size_t cstep = bottom_top_blob.cstep;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int g = 0; g < group; g++)\n        {\n            Mat bottom_top_blob_g = bottom_top_blob.channel_range(g * channels_g, channels_g);\n            const float* gamma_ptr = affine ? (const float*)gamma_data + g * channels_g : 0;\n            const float* beta_ptr = affine ? (const float*)beta_data + g * channels_g : 0;\n            groupnorm(bottom_top_blob_g, gamma_ptr, beta_ptr, eps, channels_g, size, cstep);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/groupnorm.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GROUPNORM_H\n#define LAYER_GROUPNORM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass GroupNorm : public Layer\n{\npublic:\n    GroupNorm();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int group;\n    int channels;\n    float eps;\n    int affine;\n\n    // model\n    Mat gamma_data;\n    Mat beta_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GROUPNORM_H\n"
  },
  {
    "path": "src/layer/gru.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"gru.h\"\n\nnamespace ncnn {\n\nGRU::GRU()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint GRU::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    weight_data_size = pd.get(1, 0);\n    direction = pd.get(2, 0);\n    int8_scale_term = pd.get(8, 0);\n\n    if (int8_scale_term)\n    {\n#if !NCNN_INT8\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    return 0;\n}\n\nint GRU::load_model(const ModelBin& mb)\n{\n    int num_directions = direction == 2 ? 2 : 1;\n\n    int size = weight_data_size / num_directions / num_output / 3;\n\n    // raw weight data\n    weight_xc_data = mb.load(size, num_output * 3, num_directions, 0);\n    if (weight_xc_data.empty())\n        return -100;\n\n    bias_c_data = mb.load(num_output, 4, num_directions, 0);\n    if (bias_c_data.empty())\n        return -100;\n\n    weight_hc_data = mb.load(num_output, num_output * 3, num_directions, 0);\n    if (weight_hc_data.empty())\n        return -100;\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        weight_xc_data_int8_scales = mb.load(num_output * 3, num_directions, 1);\n        weight_hc_data_int8_scales = mb.load(num_output * 3, num_directions, 1);\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic int gru(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n    Mat gates(2, num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        const float* x = bottom_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_output; q++)\n        {\n            float* gates_data = gates.row(q);\n\n            // gate reset update\n            const float* bias_c_R = bias_c.row(0);\n            const float* bias_c_U = bias_c.row(1);\n\n            const float* weight_xc_R = weight_xc.row(num_output * 0 + q);\n            const float* weight_xc_U = weight_xc.row(num_output * 1 + q);\n            const float* weight_hc_R = weight_hc.row(num_output * 0 + q);\n            const float* weight_hc_U = weight_hc.row(num_output * 1 + q);\n\n            float R = bias_c_R[q];\n            float U = bias_c_U[q];\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = x[i];\n\n                R += weight_xc_R[i] * xi;\n                U += weight_xc_U[i] * xi;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                R += weight_hc_R[i] * h_cont;\n                U += weight_hc_U[i] * h_cont;\n            }\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n            const float* bias_c_WN = bias_c.row(2);\n            const float* bias_c_BN = bias_c.row(3);\n\n            const float* weight_xc_N = weight_xc.row(num_output * 2 + q);\n            const float* weight_hc_N = weight_hc.row(num_output * 2 + q);\n\n            float N = bias_c_BN[q];\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                N += weight_hc_N[i] * h_cont;\n            }\n\n            N = bias_c_WN[q] + R * N;\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = x[i];\n\n                N += weight_xc_N[i] * xi;\n            }\n\n            // tanh(N)\n            N = tanhf(N);\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        float* output_data = top_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_output; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_state[q];\n\n            hidden_state[q] = H;\n            output_data[q] = H;\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_INT8\nstatic int gru_int8(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc_int8, const float* weight_xc_int8_scales, const Mat& bias_c, const Mat& weight_hc_int8, const float* weight_hc_int8_scales, Mat& hidden_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n\n    // 2 x num_output\n    Mat gates(2, num_output, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8(size, T, (size_t)1u, 1, opt.workspace_allocator);\n    Mat bottom_blob_int8_scales(T, (size_t)4u, 1, opt.workspace_allocator);\n    {\n        for (int t = 0; t < T; t++)\n        {\n            const float* x = bottom_blob.row(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(x[i]));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n        }\n\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_quant);\n    }\n\n    Mat hidden_state_int8(num_output, (size_t)1u, 1, opt.workspace_allocator);\n    Mat hidden_state_int8_scales(1, (size_t)4u, 1, opt.workspace_allocator);\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        int ti = reverse ? T - 1 - t : t;\n\n        // dynamic quantize hidden_state\n        {\n            float absmax = 0.f;\n            for (int i = 0; i < num_output; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(hidden_state[i]));\n            }\n\n            if (absmax == 0.f)\n            {\n                hidden_state_int8_scales[0] = 1.f;\n                hidden_state_int8.fill<signed char>(0);\n            }\n            else\n            {\n                hidden_state_int8_scales[0] = 127.f / absmax;\n\n                Option opt_quant = opt;\n                opt_quant.blob_allocator = opt.workspace_allocator;\n                opt_quant.use_packing_layout = false;\n                quantize_to_int8(hidden_state, hidden_state_int8, hidden_state_int8_scales, opt_quant);\n            }\n        }\n\n        const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n        const signed char* hs = hidden_state_int8;\n        const float descale_x = 1.f / bottom_blob_int8_scales[ti];\n        const float descale_h = 1.f / hidden_state_int8_scales[0];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_output; q++)\n        {\n            float* gates_data = gates.row(q);\n\n            // gate reset update\n            const float* bias_c_R = bias_c.row(0);\n            const float* bias_c_U = bias_c.row(1);\n\n            const signed char* weight_xc_int8_R = weight_xc_int8.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_xc_int8_U = weight_xc_int8.row<const signed char>(num_output * 1 + q);\n            const signed char* weight_hc_int8_R = weight_hc_int8.row<const signed char>(num_output * 0 + q);\n            const signed char* weight_hc_int8_U = weight_hc_int8.row<const signed char>(num_output * 1 + q);\n\n            const float descale_xc_R = 1.f / weight_xc_int8_scales[num_output * 0 + q];\n            const float descale_xc_U = 1.f / weight_xc_int8_scales[num_output * 1 + q];\n            const float descale_hc_R = 1.f / weight_hc_int8_scales[num_output * 0 + q];\n            const float descale_hc_U = 1.f / weight_hc_int8_scales[num_output * 1 + q];\n\n            int Rx = 0;\n            int Ux = 0;\n            for (int i = 0; i < size; i++)\n            {\n                signed char xi = x[i];\n\n                Rx += weight_xc_int8_R[i] * xi;\n                Ux += weight_xc_int8_U[i] * xi;\n            }\n\n            int Rh = 0;\n            int Uh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                signed char h_cont = hs[i];\n\n                Rh += weight_hc_int8_R[i] * h_cont;\n                Uh += weight_hc_int8_U[i] * h_cont;\n            }\n\n            float R = bias_c_R[q] + Rx * (descale_x * descale_xc_R) + Rh * (descale_h * descale_hc_R);\n            float U = bias_c_U[q] + Ux * (descale_x * descale_xc_U) + Uh * (descale_h * descale_hc_U);\n\n            // sigmoid(R)\n            // sigmoid(U)\n            R = 1.f / (1.f + expf(-R));\n            U = 1.f / (1.f + expf(-U));\n\n            // gate new\n            const float* bias_c_WN = bias_c.row(2);\n            const float* bias_c_BN = bias_c.row(3);\n\n            const signed char* weight_xc_int8_N = weight_xc_int8.row<const signed char>(num_output * 2 + q);\n            const signed char* weight_hc_int8_N = weight_hc_int8.row<const signed char>(num_output * 2 + q);\n\n            const float descale_xc_N = 1.f / weight_xc_int8_scales[num_output * 2 + q];\n            const float descale_hc_N = 1.f / weight_hc_int8_scales[num_output * 2 + q];\n\n            int Nh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                signed char h_cont = hs[i];\n\n                Nh += weight_hc_int8_N[i] * h_cont;\n            }\n\n            int Nx = 0;\n            for (int i = 0; i < size; i++)\n            {\n                signed char xi = x[i];\n\n                Nx += weight_xc_int8_N[i] * xi;\n            }\n\n            float N = bias_c_BN[q] + Nh * (descale_h * descale_hc_N);\n            N = bias_c_WN[q] + R * N + Nx * (descale_x * descale_xc_N);\n\n            // tanh(N)\n            N = tanhf(N);\n\n            gates_data[0] = U;\n            gates_data[1] = N;\n        }\n\n        // h_t := (1 - update) .* new + update .* h_{t-1}\n        float* output_data = top_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < num_output; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float U = gates_data[0];\n            float N = gates_data[1];\n\n            float H = (1 - U) * N + U * hidden_state[q];\n\n            hidden_state[q] = H;\n            output_data[q] = H;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\nint GRU::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob, direction, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob, direction, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.0f);\n\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), weight_xc_data_int8_scales.row(1), bias_c_data.channel(1), weight_hc_data.channel(1), weight_hc_data_int8_scales.row(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), bias_c_data.channel(1), weight_hc_data.channel(1), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    return 0;\n}\n\nint GRU::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Allocator* hidden_allocator = top_blobs.size() == 2 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 2)\n    {\n        hidden = bottom_blobs[1].clone(hidden_allocator);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob, direction, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob, direction, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), hidden, opt);\n            if (ret != 0)\n                return ret;\n        }\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), hidden0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = gru_int8(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), weight_xc_data_int8_scales.row(1), bias_c_data.channel(1), weight_hc_data.channel(1), weight_hc_data_int8_scales.row(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = gru(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), bias_c_data.channel(1), weight_hc_data.channel(1), hidden1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    if (top_blobs.size() == 2)\n    {\n        top_blobs[1] = hidden;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/gru.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_GRU_H\n#define LAYER_GRU_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass GRU : public Layer\n{\npublic:\n    GRU();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int num_output;\n    int weight_data_size;\n    int direction; // 0=forward 1=reverse 2=bidirectional\n\n    int int8_scale_term;\n\n    Mat weight_hc_data;\n    Mat weight_xc_data;\n    Mat bias_c_data;\n\n#if NCNN_INT8\n    Mat weight_hc_data_int8_scales;\n    Mat weight_xc_data_int8_scales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_GRU_H\n"
  },
  {
    "path": "src/layer/hardsigmoid.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardsigmoid.h\"\n\nnamespace ncnn {\n\nHardSigmoid::HardSigmoid()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint HardSigmoid::load_param(const ParamDict& pd)\n{\n    // tensorflow uses alpha,beta = 0.2, 0.5\n    // pytorch uses alpha,beta = 1/6, 0.5\n    alpha = pd.get(0, 0.2f);\n    beta = pd.get(1, 0.5f);\n    lower = -beta / alpha;\n    upper = (1.f / alpha) + lower;\n\n    return 0;\n}\n\nint HardSigmoid::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < lower)\n                ptr[i] = 0.f;\n            else if (ptr[i] > upper)\n                ptr[i] = 1.f;\n            else\n                ptr[i] = ptr[i] * alpha + beta;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/hardsigmoid.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSIGMOID_H\n#define LAYER_HARDSIGMOID_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass HardSigmoid : public Layer\n{\npublic:\n    HardSigmoid();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float alpha, beta, lower, upper;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSIGMOID_H\n"
  },
  {
    "path": "src/layer/hardswish.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardswish.h\"\n\nnamespace ncnn {\n\nHardSwish::HardSwish()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint HardSwish::load_param(const ParamDict& pd)\n{\n    // Note that tensorflow/pytorch use alpha,beta = 1/6, 0.5, not the default value here.\n    // You can setup them manually in .param file.\n    alpha = pd.get(0, 0.2f);\n    beta = pd.get(1, 0.5f);\n    lower = -beta / alpha;\n    upper = (1.f / alpha) + lower;\n\n    return 0;\n}\n\nint HardSwish::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            if (ptr[i] < lower)\n                ptr[i] = 0.f;\n            else if (ptr[i] > upper)\n                ;\n            else\n                ptr[i] = ptr[i] * (ptr[i] * alpha + beta);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/hardswish.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSWISH_H\n#define LAYER_HARDSWISH_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass HardSwish : public Layer\n{\npublic:\n    HardSwish();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float alpha, beta, lower, upper;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSWISH_H\n"
  },
  {
    "path": "src/layer/innerproduct.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct.h\"\n\n#include \"layer_type.h\"\n\n#include \"fused_activation.h\"\n\nnamespace ncnn {\n\nInnerProduct::InnerProduct()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint InnerProduct::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    bias_term = pd.get(1, 0);\n    weight_data_size = pd.get(2, 0);\n    int8_scale_term = pd.get(8, 0);\n    activation_type = pd.get(9, 0);\n    activation_params = pd.get(10, Mat());\n\n    if (int8_scale_term)\n    {\n#if NCNN_INT8\n        support_int8_storage = true;\n#else\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    return 0;\n}\n\nint InnerProduct::load_model(const ModelBin& mb)\n{\n    weight_data = mb.load(weight_data_size, 0);\n    if (weight_data.empty())\n        return -100;\n\n    if (bias_term)\n    {\n        bias_data = mb.load(num_output, 1);\n        if (bias_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        weight_data_int8_scales = mb.load(num_output, 1);\n        bottom_blob_int8_scales = mb.load(1, 1);\n    }\n#endif // NCNN_INT8\n\n#if NCNN_INT8\n    // runtime quantize the weight data\n    if (weight_data.elemsize == (size_t)4u && int8_scale_term)\n    {\n        const int num_input = weight_data_size / num_output;\n\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        Mat weight_data_int8;\n        Option opt_q;\n        opt_q.num_threads = 1;\n        opt_q.use_packing_layout = false;\n        quantize_to_int8(weight_data_r2, weight_data_int8, weight_data_int8_scales, opt_q);\n        if (weight_data_int8.empty())\n            return -100;\n\n        weight_data = weight_data_int8.reshape(weight_data_size);\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nint InnerProduct::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return forward_int8(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / num_output;\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int size = w * h;\n\n    if (bottom_blob.dims == 2 && w == num_input)\n    {\n        // gemm\n        top_blob.create(num_output, h, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            const float* m = bottom_blob.row(j);\n            float* outptr = top_blob.row(j);\n\n            for (int p = 0; p < num_output; p++)\n            {\n                const float* kptr = (const float*)weight_data + w * p;\n\n                float sum = 0.f;\n\n                if (bias_term)\n                    sum = bias_data[p];\n\n                for (int i = 0; i < w; i++)\n                {\n                    sum += m[i] * kptr[i];\n                }\n\n                outptr[p] = activation_ss(sum, activation_type, activation_params);\n            }\n        }\n\n        return 0;\n    }\n\n    top_blob.create(num_output, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < num_output; p++)\n    {\n        float sum = 0.f;\n\n        if (bias_term)\n            sum = bias_data[p];\n\n        // channels\n        for (int q = 0; q < channels; q++)\n        {\n            const float* w = (const float*)weight_data + size * channels * p + size * q;\n            const float* m = bottom_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                sum += m[i] * w[i];\n            }\n        }\n\n        top_blob[p] = activation_ss(sum, activation_type, activation_params);\n    }\n\n    return 0;\n}\n\n#if NCNN_INT8\nint InnerProduct::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int size = w * h;\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elemsize != 1)\n    {\n        Option opt_g = opt;\n        opt_g.blob_allocator = opt.workspace_allocator;\n        opt_g.use_packing_layout = false;\n\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_g);\n    }\n\n    if (bottom_blob.dims == 2 && w == num_input)\n    {\n        // gemm\n        top_blob.create(num_output, h, 4u, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            const signed char* m = bottom_blob_int8.row<signed char>(j);\n            float* outptr = top_blob.row(j);\n\n            for (int p = 0; p < num_output; p++)\n            {\n                const signed char* kptr = (const signed char*)weight_data + w * p;\n                int sum = 0;\n\n                for (int i = 0; i < w; i++)\n                {\n                    sum += m[i] * kptr[i];\n                }\n                // dequantize and relu\n                float scale_in;\n                if (weight_data_int8_scales[p] == 0)\n                    scale_in = 0;\n                else\n                    scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n                float sumfp32 = sum * scale_in;\n\n                if (bias_term)\n                    sumfp32 += bias_data[p];\n\n                outptr[p] = activation_ss(sumfp32, activation_type, activation_params);\n            }\n        }\n\n        return 0;\n    }\n\n    top_blob.create(num_output, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < num_output; p++)\n    {\n        float* outptr = top_blob;\n\n        int sum = 0;\n\n        int offset = size * channels * p;\n        // channels\n        for (int q = 0; q < channels; q++)\n        {\n            const signed char* w = (const signed char*)weight_data + offset + size * q;\n            const signed char* m = bottom_blob_int8.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                sum += m[i] * w[i];\n            }\n        }\n\n        // dequantize and relu\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        float sumfp32 = sum * scale_in;\n\n        if (bias_term)\n            sumfp32 += bias_data[p];\n\n        outptr[p] = activation_ss(sumfp32, activation_type, activation_params);\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/innerproduct.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INNERPRODUCT_H\n#define LAYER_INNERPRODUCT_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass InnerProduct : public Layer\n{\npublic:\n    InnerProduct();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if NCNN_INT8\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    // param\n    int num_output;\n    int bias_term;\n\n    int weight_data_size;\n\n    int int8_scale_term;\n\n    // 0=none 1=relu 2=leakyrelu 3=clip 4=sigmoid\n    int activation_type;\n    Mat activation_params;\n\n    // model\n    Mat weight_data;\n    Mat bias_data;\n\n#if NCNN_INT8\n    Mat weight_data_int8_scales;\n    Mat bottom_blob_int8_scales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INNERPRODUCT_H\n"
  },
  {
    "path": "src/layer/input.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"input.h\"\n\nnamespace ncnn {\n\nInput::Input()\n{\n    one_blob_only = true;\n    support_inplace = true;\n    support_vulkan = true;\n    support_packing = true;\n    support_bf16_storage = true;\n}\n\nint Input::load_param(const ParamDict& pd)\n{\n    w = pd.get(0, 0);\n    h = pd.get(1, 0);\n    d = pd.get(11, 0);\n    c = pd.get(2, 0);\n    return 0;\n}\n\nint Input::forward_inplace(Mat& /*bottom_top_blob*/, const Option& /*opt*/) const\n{\n    return 0;\n}\n\n#if NCNN_VULKAN\nint Input::forward_inplace(VkMat& /*bottom_top_blob*/, VkCompute& /*cmd*/, const Option& /*opt*/) const\n{\n    return 0;\n}\n#endif // NCNN_VULKAN\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/input.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INPUT_H\n#define LAYER_INPUT_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Input : public Layer\n{\npublic:\n    Input();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\n#if NCNN_VULKAN\n    virtual int forward_inplace(VkMat& bottom_top_blob, VkCompute& cmd, const Option& opt) const;\n#endif // NCNN_VULKAN\n\npublic:\n    int w;\n    int h;\n    int d;\n    int c;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INPUT_H\n"
  },
  {
    "path": "src/layer/instancenorm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"instancenorm.h\"\n\nnamespace ncnn {\n\nInstanceNorm::InstanceNorm()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint InstanceNorm::load_param(const ParamDict& pd)\n{\n    channels = pd.get(0, 0);\n    eps = pd.get(1, 0.001f);\n    affine = pd.get(2, 1);\n\n    return 0;\n}\n\nint InstanceNorm::load_model(const ModelBin& mb)\n{\n    if (affine == 0)\n        return 0;\n\n    gamma_data = mb.load(channels, 1);\n    if (gamma_data.empty())\n        return -100;\n\n    beta_data = mb.load(channels, 1);\n    if (beta_data.empty())\n        return -100;\n\n    return 0;\n}\n\nint InstanceNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    // x = (x - mean) / (sqrt(var + eps)) * gamma + beta\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int c = bottom_top_blob.c;\n    int size = w * h;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < c; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        // mean and var\n        float sum = 0.f;\n        float sqsum = 0.f;\n        for (int i = 0; i < size; i++)\n        {\n            sum += ptr[i];\n            //sqsum += ptr[i] * ptr[i];\n        }\n        float mean = sum / size;\n        float tmp = 0.f;\n        for (int i = 0; i < size; i++)\n        {\n            tmp = ptr[i] - mean;\n            sqsum += tmp * tmp;\n        }\n        float var = sqsum / size;\n        // the var maybe minus due to accuracy\n        //float var = sqsum / size - mean * mean;\n\n        float a;\n        float b;\n        if (affine)\n        {\n            float gamma = gamma_data[q];\n            float beta = beta_data[q];\n\n            a = gamma / (sqrtf(var + eps));\n            b = -mean * a + beta;\n        }\n        else\n        {\n            a = 1.f / (sqrtf(var + eps));\n            b = -mean * a;\n        }\n\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = ptr[i] * a + b;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/instancenorm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INSTANCENORM_H\n#define LAYER_INSTANCENORM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass InstanceNorm : public Layer\n{\npublic:\n    InstanceNorm();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int channels;\n    float eps;\n    int affine;\n\n    // model\n    Mat gamma_data;\n    Mat beta_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INSTANCENORM_H\n"
  },
  {
    "path": "src/layer/interp.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"interp.h\"\n\n#include \"expression.h\"\n\nnamespace ncnn {\n\nInterp::Interp()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint Interp::load_param(const ParamDict& pd)\n{\n    resize_type = pd.get(0, 0);\n    height_scale = pd.get(1, 1.f);\n    width_scale = pd.get(2, 1.f);\n    output_height = pd.get(3, 0);\n    output_width = pd.get(4, 0);\n    dynamic_target_size = pd.get(5, 0);\n    align_corner = pd.get(6, 0);\n\n    if (resize_type < 0 || resize_type > 3)\n    {\n        NCNN_LOGE(\"unsupported resize type %d\", resize_type);\n        return -1;\n    }\n\n    if (dynamic_target_size == 1)\n    {\n        one_blob_only = false;\n    }\n\n    size_expr = pd.get(9, \"\");\n\n    // count reference blobs\n    if (!size_expr.empty())\n    {\n        const int blob_count = count_expression_blobs(size_expr);\n        if (blob_count > 1)\n            one_blob_only = false;\n    }\n\n    return 0;\n}\n\n#if defined(__GNUC__) && defined(__powerpc__) && defined(__ALTIVEC__)\n// NOTE gcc altivec optimized version produce wrong result\n// so I have to disable vectorize here  --- nihui\n__attribute__((optimize(\"no-tree-vectorize\")))\n#endif\nstatic void\nlinear_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = static_cast<float>(dx * scale);\n        }\n\n        int sx = static_cast<int>(floor(fx));\n        fx -= sx;\n\n        if (sx < 0)\n        {\n            sx = 0;\n            fx = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 2;\n            fx = 1.f;\n        }\n\n        xofs[dx] = sx;\n\n        alpha[dx * 2] = 1.f - fx;\n        alpha[dx * 2 + 1] = fx;\n    }\n}\n\nstatic void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const float* S0 = src.row(sy);\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows0p[dx] = S0p[0] * a0 + S0p[1] * a1;\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* Dp = dst.row(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1;\n        }\n\n        beta += 2;\n    }\n}\n\nstatic inline void interpolate_cubic(float fx, float* coeffs)\n{\n    const float A = -0.75f;\n\n    float fx0 = fx + 1;\n    float fx1 = fx;\n    float fx2 = 1 - fx;\n    // float fx3 = 2 - fx;\n\n    coeffs[0] = A * fx0 * fx0 * fx0 - 5 * A * fx0 * fx0 + 8 * A * fx0 - 4 * A;\n    coeffs[1] = (A + 2) * fx1 * fx1 * fx1 - (A + 3) * fx1 * fx1 + 1;\n    coeffs[2] = (A + 2) * fx2 * fx2 * fx2 - (A + 3) * fx2 * fx2 + 1;\n    coeffs[3] = 1.f - coeffs[0] - coeffs[1] - coeffs[2];\n}\n\nstatic void cubic_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = static_cast<float>(dx * scale);\n        }\n\n        int sx = static_cast<int>(floor(fx));\n        fx -= sx;\n\n        interpolate_cubic(fx, alpha + dx * 4);\n\n        if (sx <= -1)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = 1.f - alpha[dx * 4 + 3];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = 0.f;\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == 0)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = alpha[dx * 4 + 0] + alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 2];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == w - 2)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = alpha[dx * 4 + 2] + alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 0] = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = 1.f - alpha[dx * 4 + 0];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 1] = 0.f;\n            alpha[dx * 4 + 0] = 0.f;\n        }\n\n        xofs[dx] = sx;\n    }\n}\n\nstatic void resize_bicubic_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    Mat rowsbuf2(w);\n    Mat rowsbuf3(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const float* S0 = src.row(sy - 1);\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows0p[dx] = S0p[-1] * a0 + S0p[0] * a1 + S0p[1] * a2 + S0p[2] * a3;\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n        float b2 = beta[2];\n        float b3 = beta[3];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        float* Dp = dst.row(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3;\n        }\n\n        beta += 4;\n    }\n}\n\nint Interp::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n\n    int outw = output_width;\n    int outh = output_height;\n    if (bottom_blob.dims == 1)\n    {\n        w = 1;\n        h = 1;\n    }\n    if (outw == 0 || outh == 0)\n    {\n        outw = static_cast<int>(w * width_scale);\n        outh = static_cast<int>(h * height_scale);\n    }\n\n    Mat reference_blob;\n    reference_blob.w = outw;\n    reference_blob.h = outh;\n\n    std::vector<Mat> bottom_blobs(2);\n    bottom_blobs[0] = bottom_blob;\n    bottom_blobs[1] = reference_blob;\n\n    std::vector<Mat> top_blobs(1);\n\n    int ret = forward(bottom_blobs, top_blobs, opt);\n\n    top_blob = top_blobs[0];\n\n    return ret;\n}\n\nint Interp::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        int r = eval_size_expr(bottom_blobs, outw, outh);\n        if (r != 0)\n            return -1;\n    }\n\n    if (dims == 1)\n    {\n        // special case for 2d resize on flattened blob\n        top_blob.create(outw, outh, w, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < w; q++)\n        {\n            Mat top_blob_c = top_blob.channel(q);\n            const float v = bottom_blob[q];\n            top_blob_c.fill(v);\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (resize_type == 1) // nearest\n        {\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    *outptr++ = Sp[0] * a0 + Sp[1] * a1;\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    float a2 = alphap[2];\n                    float a3 = alphap[3];\n                    *outptr++ = Sp[-1] * a0 + Sp[0] * a1 + Sp[1] * a2 + Sp[2] * a3;\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (resize_type == 1) // nearest\n    {\n        const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n        const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n            for (int y = 0; y < outh; y++)\n            {\n                int in_y = std::min((int)(y * hs), (h - 1));\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_y * w + in_x];\n                }\n            }\n        }\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n        float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n        linear_coeffs(w, outw, xofs, alpha, align_corner);\n        linear_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; ++q)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n        float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n        cubic_coeffs(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n\nint Interp::eval_size_expr(const std::vector<Mat>& bottom_blobs, int& outw, int& outh) const\n{\n    // [size(@0,0),size(@0,1)]\n    std::vector<int> sizes;\n    int er = eval_list_expression(size_expr, bottom_blobs, sizes);\n    if (er != 0)\n        return -1;\n\n    if (sizes.empty() || sizes.size() > 2)\n        return -1;\n\n    if (sizes.size() == 1)\n    {\n        outw = sizes[0];\n        outh = bottom_blobs[0].h;\n    }\n    else\n    {\n        outw = sizes[0];\n        outh = sizes[1];\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/interp.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INTERP_H\n#define LAYER_INTERP_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Interp : public Layer\n{\npublic:\n    Interp();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int eval_size_expr(const std::vector<Mat>& bottom_blobs, int& outw, int& outh) const;\n\npublic:\n    // param\n    int resize_type; //1=nearest  2=bilinear  3=bicubic\n    float width_scale;\n    float height_scale;\n    int output_width;\n    int output_height;\n    int dynamic_target_size;\n    int align_corner;\n\n    // see docs/developer-guide/expression.md\n    std::string size_expr;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INTERP_H\n"
  },
  {
    "path": "src/layer/inversespectrogram.cpp",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"inversespectrogram.h\"\n\nnamespace ncnn {\n\nInverseSpectrogram::InverseSpectrogram()\n{\n    one_blob_only = true;\n    support_inplace = false;\n}\n\nint InverseSpectrogram::load_param(const ParamDict& pd)\n{\n    n_fft = pd.get(0, 0);\n    returns = pd.get(1, 0);\n    hoplen = pd.get(2, n_fft / 4);\n    winlen = pd.get(3, n_fft);\n    window_type = pd.get(4, 0);\n    center = pd.get(5, 1);\n    normalized = pd.get(7, 0);\n\n    // assert winlen <= n_fft\n    // generate window\n    window_data.create(normalized == 2 ? n_fft + 1 : n_fft);\n    {\n        float* p = window_data;\n        for (int i = 0; i < (n_fft - winlen) / 2; i++)\n        {\n            *p++ = 0.f;\n        }\n        if (window_type == 0)\n        {\n            // all ones\n            for (int i = 0; i < winlen; i++)\n            {\n                *p++ = 1.f;\n            }\n        }\n        if (window_type == 1)\n        {\n            // hann window\n            for (int i = 0; i < winlen; i++)\n            {\n                *p++ = 0.5f * (1 - cosf(2 * 3.14159265358979323846 * i / winlen));\n            }\n        }\n        if (window_type == 2)\n        {\n            // hamming window\n            for (int i = 0; i < winlen; i++)\n            {\n                *p++ = 0.54f - 0.46f * cosf(2 * 3.14159265358979323846 * i / winlen);\n            }\n        }\n        for (int i = 0; i < n_fft - winlen - (n_fft - winlen) / 2; i++)\n        {\n            *p++ = 0.f;\n        }\n\n        // pre-calculated window norm factor\n        if (normalized == 2)\n        {\n            float sqsum = 0.f;\n            for (int i = 0; i < n_fft; i++)\n            {\n                sqsum += window_data[i] * window_data[i];\n            }\n            window_data[n_fft] = sqrt(sqsum);\n        }\n    }\n\n    return 0;\n}\n\nint InverseSpectrogram::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // https://github.com/librosa/librosa/blob/main/librosa/core/spectrum.py#L630\n\n    // TODO custom window\n    // TODO output length\n\n    const int frames = bottom_blob.h;\n    const int freqs = bottom_blob.c;\n    // assert freqs == n_fft or freqs == n_fft / 2 + 1\n\n    const int onesided = freqs == n_fft / 2 + 1 ? 1 : 0;\n\n    const int outsize = center ? (frames - 1) * hoplen + (n_fft - n_fft / 2 * 2) : (frames - 1) * hoplen + n_fft;\n\n    const size_t elemsize = bottom_blob.elemsize;\n\n    if (returns == 0)\n    {\n        top_blob.create(2, outsize, elemsize, opt.blob_allocator);\n    }\n    else\n    {\n        top_blob.create(outsize, elemsize, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    Mat window_sumsquare(outsize + n_fft, elemsize, opt.workspace_allocator);\n    if (window_sumsquare.empty())\n        return -100;\n\n    top_blob.fill(0.f);\n    window_sumsquare.fill(0.f);\n\n    for (int j = 0; j < frames; j++)\n    {\n        // collect complex\n        Mat sp(2, n_fft);\n        if (onesided == 1)\n        {\n            for (int k = 0; k < n_fft / 2 + 1; k++)\n            {\n                sp.row(k)[0] = bottom_blob.channel(k).row(j)[0];\n                sp.row(k)[1] = bottom_blob.channel(k).row(j)[1];\n            }\n            for (int k = n_fft / 2 + 1; k < n_fft; k++)\n            {\n                sp.row(k)[0] = bottom_blob.channel(n_fft - k).row(j)[0];\n                sp.row(k)[1] = -bottom_blob.channel(n_fft - k).row(j)[1];\n            }\n        }\n        else\n        {\n            for (int k = 0; k < n_fft; k++)\n            {\n                sp.row(k)[0] = bottom_blob.channel(k).row(j)[0];\n                sp.row(k)[1] = bottom_blob.channel(k).row(j)[1];\n            }\n        }\n\n        if (normalized == 1)\n        {\n            float norm = sqrt(n_fft);\n            for (int i = 0; i < 2 * n_fft; i++)\n            {\n                sp[i] *= norm;\n            }\n        }\n        if (normalized == 2)\n        {\n            float norm = window_data[n_fft];\n            for (int i = 0; i < 2 * n_fft; i++)\n            {\n                sp[i] *= norm;\n            }\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < n_fft; i++)\n        {\n            // inverse dft\n            float re = 0.f;\n            float im = 0.f;\n            for (int k = 0; k < n_fft; k++)\n            {\n                double angle = 2 * 3.14159265358979323846 * i * k / n_fft;\n\n                re += sp.row(k)[0] * cosf(angle) - sp.row(k)[1] * sinf(angle);\n                im += sp.row(k)[0] * sinf(angle) + sp.row(k)[1] * cosf(angle);\n            }\n\n            re /= n_fft;\n            im /= n_fft;\n\n            // apply window\n            re *= window_data[i];\n            im *= window_data[i];\n\n            int output_index = j * hoplen + i;\n            if (center == 1)\n            {\n                output_index -= n_fft / 2;\n            }\n            if (output_index >= 0 && output_index < outsize)\n            {\n                // square window\n                window_sumsquare[output_index] += window_data[i] * window_data[i];\n\n                if (returns == 0)\n                {\n                    top_blob.row(output_index)[0] += re;\n                    top_blob.row(output_index)[1] += im;\n                }\n                if (returns == 1)\n                {\n                    top_blob[output_index] += re;\n                }\n                if (returns == 2)\n                {\n                    top_blob[output_index] += im;\n                }\n            }\n        }\n    }\n\n    // square window norm\n    if (returns == 0)\n    {\n        for (int i = 0; i < outsize; i++)\n        {\n            if (window_sumsquare[i] != 0.f)\n            {\n                top_blob.row(i)[0] /= window_sumsquare[i];\n                top_blob.row(i)[1] /= window_sumsquare[i];\n            }\n        }\n    }\n    else\n    {\n        for (int i = 0; i < outsize; i++)\n        {\n            if (window_sumsquare[i] != 0.f)\n                top_blob[i] /= window_sumsquare[i];\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/inversespectrogram.h",
    "content": "// Copyright 2024 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INVERSESPECTROGRAM_H\n#define LAYER_INVERSESPECTROGRAM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass InverseSpectrogram : public Layer\n{\npublic:\n    InverseSpectrogram();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\npublic:\n    int n_fft;\n    int returns; // 0=complex 1=real 2=imag\n    int hoplen;\n    int winlen;\n    int window_type; // 0=ones 1=hann 2=hamming\n    int center;\n    int normalized; // 0=disabled 1=sqrt(n_fft) 2=window-l2-energy\n\n    Mat window_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INVERSESPECTROGRAM_H\n"
  },
  {
    "path": "src/layer/layernorm.cpp",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"layernorm.h\"\n\nnamespace ncnn {\n\nLayerNorm::LayerNorm()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint LayerNorm::load_param(const ParamDict& pd)\n{\n    affine_size = pd.get(0, 0);\n    eps = pd.get(1, 0.001f);\n    affine = pd.get(2, 1);\n\n    return 0;\n}\n\nint LayerNorm::load_model(const ModelBin& mb)\n{\n    if (affine == 0)\n        return 0;\n\n    gamma_data = mb.load(affine_size, 1);\n    if (gamma_data.empty())\n        return -100;\n\n    beta_data = mb.load(affine_size, 1);\n    if (beta_data.empty())\n        return -100;\n\n    return 0;\n}\n\nstatic void layernorm(float* ptr, const float* gamma_ptr, const float* beta_ptr, float eps, int size)\n{\n    float sum = 0.f;\n    for (int i = 0; i < size; i++)\n    {\n        sum += ptr[i];\n    }\n\n    float mean = sum / size;\n\n    float sqsum = 0.f;\n    for (int i = 0; i < size; i++)\n    {\n        float v = ptr[i] - mean;\n        sqsum += v * v;\n    }\n\n    float var = sqsum / size;\n\n    float a = 1.f / sqrtf(var + eps);\n    float b = -mean * a;\n\n    if (gamma_ptr && beta_ptr)\n    {\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = (ptr[i] * a + b) * gamma_ptr[i] + beta_ptr[i];\n        }\n    }\n    else\n    {\n        for (int i = 0; i < size; i++)\n        {\n            ptr[i] = ptr[i] * a + b;\n        }\n    }\n}\n\nint LayerNorm::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    // x = (x - mean) / sqrt(var + eps) * gamma + beta\n\n    int dims = bottom_top_blob.dims;\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w;\n        // assert affine_size == w\n\n        float* ptr = bottom_top_blob;\n        layernorm(ptr, gamma_data, beta_data, eps, w);\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        // assert affine_size == w\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            layernorm(ptr, gamma_data, beta_data, eps, w);\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n\n        if (affine_size == w)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                for (int i = 0; i < h; i++)\n                {\n                    float* ptr = bottom_top_blob.channel(q).row(i);\n                    layernorm(ptr, gamma_data, beta_data, eps, w);\n                }\n            }\n        }\n        else // if (affine_size == size)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                float* ptr = bottom_top_blob.channel(q);\n                layernorm(ptr, gamma_data, beta_data, eps, w * h);\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/layernorm.h",
    "content": "// Copyright 2020 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LAYERNORM_H\n#define LAYER_LAYERNORM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass LayerNorm : public Layer\n{\npublic:\n    LayerNorm();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    // param\n    int affine_size;\n    float eps;\n    int affine;\n\n    // model\n    Mat gamma_data;\n    Mat beta_data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LAYERNORM_H\n"
  },
  {
    "path": "src/layer/log.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"log.h\"\n\nnamespace ncnn {\n\nLog::Log()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint Log::load_param(const ParamDict& pd)\n{\n    base = pd.get(0, -1.f);\n    scale = pd.get(1, 1.f);\n    shift = pd.get(2, 0.f);\n\n    return 0;\n}\n\nint Log::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    if (base == -1.f)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = logf(shift + ptr[i] * scale);\n            }\n        }\n    }\n    else\n    {\n        float log_base_inv = 1.f / logf(base);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = logf(shift + ptr[i] * scale) * log_base_inv;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/log.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LOG_H\n#define LAYER_LOG_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass Log : public Layer\n{\npublic:\n    Log();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\npublic:\n    float base;\n    float scale;\n    float shift;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LOG_H\n"
  },
  {
    "path": "src/layer/loongarch/absval_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"absval_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nAbsVal_loongarch::AbsVal_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint AbsVal_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128i _p = __lsx_vld(ptr, 0);\n            __m128i _outp = __lsx_vbitclri_w(_p, 31);\n            __lsx_vst(_outp, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *ptr > 0 ? *ptr : -*ptr;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/absval_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ABSVAL_LOONGARCH_H\n#define LAYER_ABSVAL_LOONGARCH_H\n\n#include \"absval.h\"\n\nnamespace ncnn {\n\nclass AbsVal_loongarch : public AbsVal\n{\npublic:\n    AbsVal_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ABSVAL_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/batchnorm_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"batchnorm_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nBatchNorm_loongarch::BatchNorm_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint BatchNorm_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w * elempack;\n\n#if __loongarch_sx\n        int nn_w = w / 4;\n        int remain_w_start = nn_w * 4;\n#else\n        int remain_w_start = 0;\n#endif // __loongarch_sx\n\n        float* ptr = bottom_top_blob;\n\n#if __loongarch_sx\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < nn_w; i++)\n        {\n            float* ptr0 = ptr + i * 4;\n\n            __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n            __m128 _a = (__m128)__lsx_vld((const float*)a_data + i * 4, 0);\n            __m128 _b = (__m128)__lsx_vld((const float*)b_data + i * 4, 0);\n            _p = __lsx_vfmadd_s(_b, _p, _a);\n            __lsx_vst(_p, ptr0, 0);\n        }\n#endif // __loongarch_sx\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_w_start; i < w; i++)\n        {\n            ptr[i] = b_data[i] * ptr[i] + a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w * elempack;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            float a = a_data[i];\n            float b = b_data[i];\n\n            int j = 0;\n#if __loongarch_sx\n            __m128 _a = elempack == 4 ? (__m128)__lsx_vld((const float*)a_data + i * 4, 0) : (__m128)__lsx_vreplfr2vr_s(a);\n            __m128 _b = elempack == 4 ? (__m128)__lsx_vld((const float*)b_data + i * 4, 0) : (__m128)__lsx_vreplfr2vr_s(b);\n            for (; j + 3 < w; j += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                _p = __lsx_vfmadd_s(_b, _p, _a);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; j < w; j++)\n            {\n                *ptr = b * *ptr + a;\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d * elempack;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            float a = a_data[q];\n            float b = b_data[q];\n\n            int i = 0;\n#if __loongarch_sx\n            __m128 _a = elempack == 4 ? (__m128)__lsx_vld((const float*)a_data + q * 4, 0) : (__m128)__lsx_vreplfr2vr_s(a);\n            __m128 _b = elempack == 4 ? (__m128)__lsx_vld((const float*)b_data + q * 4, 0) : (__m128)__lsx_vreplfr2vr_s(b);\n            for (; i + 3 < size; i += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                _p = __lsx_vfmadd_s(_b, _p, _a);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                *ptr = b * *ptr + a;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/batchnorm_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BATCHNORM_LOONGARCH_H\n#define LAYER_BATCHNORM_LOONGARCH_H\n\n#include \"batchnorm.h\"\n\nnamespace ncnn {\n\nclass BatchNorm_loongarch : public BatchNorm\n{\npublic:\n    BatchNorm_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BATCHNORM_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/bias_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"bias_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nint Bias_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    const float* bias_ptr = bias_data;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        float bias = bias_ptr[q];\n\n#if __loongarch_sx\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __loongarch_sx\n\n#if __loongarch_sx\n        __m128 _bias = (__m128)__lsx_vreplfr2vr_s(bias);\n        for (; nn > 0; nn--)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _outp = __lsx_vfadd_s(_p, _bias);\n            __lsx_vst(_outp, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n\n        for (; remain > 0; remain--)\n        {\n            *ptr = *ptr + bias;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/bias_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BIAS_LOONGARCH_H\n#define LAYER_BIAS_LOONGARCH_H\n\n#include \"bias.h\"\n\nnamespace ncnn {\n\nclass Bias_loongarch : public Bias\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BIAS_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/binaryop_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"binaryop_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nBinaryOp_loongarch::BinaryOp_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_no_broadcast(const float* ptr, const float* ptr1, float* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n#if __loongarch_sx\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        __builtin_prefetch(ptr1 + 16);\n        __m128 _p = (__m128)__lsx_vld(ptr, 0);\n        __m128 _b = (__m128)__lsx_vld(ptr1, 0);\n        __m128 _outp = op(_p, _b);\n        __lsx_vst(_outp, outptr, 0);\n        ptr += 4;\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __loongarch_sx\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, *ptr1);\n        ptr += 1;\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_b(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float b = *ptr1;\n#if __loongarch_sx\n    __m128 _b_128 = (elempack == 4) ? (__m128)__lsx_vld(ptr1, 0) : __lsx_vreplfr2vr_s(b);\n#endif // __loongarch_sx\n\n    int i = 0;\n#if __loongarch_sx\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        __m128 _p = (__m128)__lsx_vld(ptr, 0);\n        __m128 _outp = op(_p, _b_128);\n        __lsx_vst(_outp, outptr, 0);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __loongarch_sx\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, b);\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_a(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float a = *ptr;\n#if __loongarch_sx\n    __m128 _a_128 = (elempack == 4) ? (__m128)__lsx_vld(ptr, 0) : __lsx_vreplfr2vr_s(a);\n#endif // __loongarch_sx\n\n    int i = 0;\n#if __loongarch_sx\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr1 + 16);\n        __m128 _b = (__m128)__lsx_vld(ptr1, 0);\n        __m128 _outp = op(_a_128, _b);\n        __lsx_vst(_outp, outptr, 0);\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __loongarch_sx\n    for (; i < size; i++)\n    {\n        *outptr = op(a, *ptr1);\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __loongarch_sx\n    if (elempack == 4)\n    {\n        int i = 0;\n        for (; i < w; i++)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _b = __lsx_vreplfr2vr_s(*ptr1);\n            __m128 _outp = op(_p, _b);\n            __lsx_vst(_outp, outptr, 0);\n            ptr += 4;\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __loongarch_sx\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_b(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n    const int size = w * elempack;\n\n    int i = 0;\n#if __loongarch_sx\n    __m128 _b = __lsx_vreplfr2vr_s(*ptr1);\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        __m128 _p = (__m128)__lsx_vld(ptr, 0);\n        __m128 _outp = op(_p, _b);\n        __lsx_vst(_outp, outptr, 0);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __loongarch_sx\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_a(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __loongarch_sx\n    if (elempack == 4)\n    {\n        int i = 0;\n        __m128 _p = (__m128)__lsx_vld(ptr, 0);\n        for (; i < w; i++)\n        {\n            __m128 _b = __lsx_vreplfr2vr_s(*ptr1);\n            __m128 _outp = op(_p, _b);\n            __lsx_vst(_outp, outptr, 0);\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __loongarch_sx\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp)\n{\n    const int w = std::max(aw, bw);\n    const int elempack = std::max(ap, bp);\n    const int size = w * elempack;\n\n    if (ap == bp)\n    {\n        if (aw == bw)\n        {\n            // no broadcast\n            return binary_op_vector_no_broadcast<Op>(ptr, ptr1, outptr, size);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast single b\n            return binary_op_vector_broadcast_b<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a\n            return binary_op_vector_broadcast_a<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n    }\n\n    if (bp == 1)\n    {\n        if (aw == bw)\n        {\n            // broadcast pack1 b\n            return binary_op_vector_broadcast_pb<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast pack1 single b\n            return binary_op_vector_broadcast_pb_b<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a and pack1 b\n            return binary_op_vector_broadcast_pb_a<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n    }\n\n    // shall never reach here\n}\n\ntemplate<typename Op>\nstatic int binary_op_scalar_inplace(Mat& a, float b, const Option& opt)\n{\n    Op op;\n\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _b = __lsx_vreplfr2vr_s(b);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = op(_p, _b);\n            __lsx_vst(_p, ptr, 0);\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = op(*ptr, b);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nnamespace BinaryOp_loongarch_functor {\n\n#if __loongarch_sx\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                          \\\n    struct NAME                                                   \\\n    {                                                             \\\n        float operator()(const float& x, const float& y) const    \\\n        {                                                         \\\n            return IMPL;                                          \\\n        }                                                         \\\n        __m128 operator()(const __m128& x, const __m128& y) const \\\n        {                                                         \\\n            return IMPL4;                                         \\\n        }                                                         \\\n    };\n#else\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                       \\\n    struct NAME                                                \\\n    {                                                          \\\n        float operator()(const float& x, const float& y) const \\\n        {                                                      \\\n            return IMPL;                                       \\\n        }                                                      \\\n    };\n#endif // __loongarch_sx\n\n// clang-format off\n// *INDENT-OFF*\nMAKE_FUNCTION(binary_op_add, x + y, __lsx_vfadd_s(x, y))\nMAKE_FUNCTION(binary_op_sub, x - y, __lsx_vfsub_s(x, y))\nMAKE_FUNCTION(binary_op_mul, x * y, __lsx_vfmul_s(x, y))\nMAKE_FUNCTION(binary_op_div, x / y, __lsx_vfdiv_s(x, y))\nMAKE_FUNCTION(binary_op_max, std::max(x, y), __lsx_vfmax_s(x, y))\nMAKE_FUNCTION(binary_op_min, std::min(x, y), __lsx_vfmin_s(x, y))\nMAKE_FUNCTION(binary_op_pow, (float)powf(x, y), pow_ps(x, y))\nMAKE_FUNCTION(binary_op_rsub, y - x, __lsx_vfsub_s(y, x))\nMAKE_FUNCTION(binary_op_rdiv, y / x, __lsx_vfdiv_s(y, x))\nMAKE_FUNCTION(binary_op_rpow, (float)powf(y, x), pow_ps(y, x))\nMAKE_FUNCTION(binary_op_atan2, (float)atan2f(x, y), atan2_ps(x, y))\nMAKE_FUNCTION(binary_op_ratan2, (float)atan2f(y, x), atan2_ps(y, x))\nMAKE_FUNCTION(binary_op_fmod, (float)fmodf(x, y), fmod_ps(x, y))\nMAKE_FUNCTION(binary_op_rfmod, (float)fmodf(y, x), fmod_ps(y, x))\nMAKE_FUNCTION(binary_op_logaddexp, (float)(std::max(x, y) + log1pf(expf(std::min(x, y) - std::max(x, y)))), logaddexp_ps(x, y))\nMAKE_FUNCTION(binary_op_floor_divide, (float)floorf(x / y), floor_divide_ps(x, y))\nMAKE_FUNCTION(binary_op_rfloor_divide, (float)floorf(y / x), floor_divide_ps(y, x))\nMAKE_FUNCTION(binary_op_remainder, (float)remainderf(x, y), remainder_ps(x, y))\nMAKE_FUNCTION(binary_op_rremainder, (float)remainderf(y, x), remainder_ps(y, x))\n// *INDENT-ON*\n// clang-format on\n\n#undef MAKE_FUNCTION\n\n} // namespace BinaryOp_loongarch_functor\n\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp, int op_type)\n{\n    using namespace BinaryOp_loongarch_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector<binary_op_add>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector<binary_op_sub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector<binary_op_mul>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector<binary_op_div>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector<binary_op_max>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector<binary_op_min>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector<binary_op_pow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector<binary_op_rsub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector<binary_op_rdiv>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector<binary_op_rpow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector<binary_op_atan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector<binary_op_ratan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector<binary_op_fmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector<binary_op_rfmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector<binary_op_logaddexp>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector<binary_op_floor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector<binary_op_rfloor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector<binary_op_remainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector<binary_op_rremainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar(const Mat& a, float b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, &b, outptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_no_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        const float* ptr1 = b.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, ptr1, outptr, size, size, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (b.w * b.h * b.d * b.c * b.elempack == 1)\n    {\n        return binary_op_scalar(a, b[0], c, op_type, opt);\n    }\n\n    if (a.dims == b.dims && a.w == b.w && a.h == b.h && a.d == b.d && a.c == b.c && a.elempack == b.elempack)\n    {\n        return binary_op_no_broadcast(a, b, c, op_type, opt);\n    }\n\n    const int dims = c.dims;\n\n    if (dims == 2)\n    {\n        const int h = c.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const int y0 = std::min(y, a.h - 1);\n            const int y1 = std::min(y, b.h - 1);\n\n            const float* ptr = a.row(y0);\n            const float* ptr1 = b.row(y1);\n            float* outptr = c.row(y);\n\n            binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int channels = c.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int q0 = std::min(q, a.c - 1);\n            const int q1 = std::min(q, b.c - 1);\n\n            if (b.d * b.h * b.w == 1)\n            {\n                const float* ptr = a.channel(q0);\n                const float* ptr1 = b.channel(q1);\n                float* outptr = c.channel(q);\n\n                binary_op_vector(ptr, ptr1, outptr, a.w * a.h * a.d, 1, a.elempack, b.elempack, op_type);\n                continue;\n            }\n\n            if (b.h * b.w == 1)\n            {\n                for (int z = 0; z < c.d; z++)\n                {\n                    const int z0 = std::min(z, a.d - 1);\n                    const int z1 = std::min(z, b.d - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0);\n                    const float* ptr1 = b.channel(q1).depth(z1);\n                    float* outptr = c.channel(q).depth(z);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w * a.h, 1, a.elempack, b.elempack, op_type);\n                }\n                continue;\n            }\n\n            for (int z = 0; z < c.d; z++)\n            {\n                const int z0 = std::min(z, a.d - 1);\n                const int z1 = std::min(z, b.d - 1);\n\n                for (int y = 0; y < c.h; y++)\n                {\n                    const int y0 = std::min(y, a.h - 1);\n                    const int y1 = std::min(y, b.h - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0).row(y0);\n                    const float* ptr1 = b.channel(q1).depth(z1).row(y1);\n                    float* outptr = c.channel(q).depth(z).row(y);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n                }\n            }\n        }\n    }\n}\n\nstatic void binary_op_scalar_inplace(Mat& a, float b, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        binary_op_vector(ptr, &b, ptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic int get_reverse_op_type(int op_type)\n{\n    if (op_type == BinaryOp::Operation_SUB) return BinaryOp::Operation_RSUB;\n    if (op_type == BinaryOp::Operation_DIV) return BinaryOp::Operation_RDIV;\n    if (op_type == BinaryOp::Operation_POW) return BinaryOp::Operation_RPOW;\n    if (op_type == BinaryOp::Operation_ATAN2) return BinaryOp::Operation_RATAN2;\n    if (op_type == BinaryOp::Operation_FMOD) return BinaryOp::Operation_RFMOD;\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return BinaryOp::Operation_LOGADDEXP;\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return BinaryOp::Operation_RFLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_REMAINDER) return BinaryOp::Operation_RREMAINDER;\n\n    if (op_type == BinaryOp::Operation_RSUB) return BinaryOp::Operation_SUB;\n    if (op_type == BinaryOp::Operation_RDIV) return BinaryOp::Operation_DIV;\n    if (op_type == BinaryOp::Operation_RPOW) return BinaryOp::Operation_POW;\n    if (op_type == BinaryOp::Operation_RATAN2) return BinaryOp::Operation_ATAN2;\n    if (op_type == BinaryOp::Operation_RFMOD) return BinaryOp::Operation_FMOD;\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return BinaryOp::Operation_FLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_RREMAINDER) return BinaryOp::Operation_REMAINDER;\n\n    return op_type;\n}\n\nint BinaryOp_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w * A.elempack == B.h * B.elempack)\n                A2 = A.reshape(1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 2;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 3;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 4;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c, opt.workspace_allocator);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w * B.elempack == A.h * A.elempack)\n                B2 = B.reshape(1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 2;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 3;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 4;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c, opt.workspace_allocator);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n    const size_t out_elemsize = std::max(A2.elemsize, B2.elemsize);\n    const int out_elempack = std::max(A2.elempack, B2.elempack);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    const bool a_pack_is_lower = A2.elempack < B2.elempack;\n    const bool a_pack_is_equal = A2.elempack == B2.elempack;\n    const bool a_size_is_lower = A2.w * A2.h * A2.d * A2.c * A2.elempack < B2.w * B2.h * B2.d * B2.c * B2.elempack;\n    if (a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))\n    {\n        binary_op_broadcast(B2, A2, top_blob, get_reverse_op_type(op_type), opt);\n    }\n    else\n    {\n        binary_op_broadcast(A2, B2, top_blob, op_type, opt);\n    }\n\n    return 0;\n}\n\nint BinaryOp_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    binary_op_scalar_inplace(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/binaryop_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BINARYOP_LOONGARCH_H\n#define LAYER_BINARYOP_LOONGARCH_H\n\n#include \"binaryop.h\"\n\nnamespace ncnn {\n\nclass BinaryOp_loongarch : public BinaryOp\n{\npublic:\n    BinaryOp_loongarch();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BINARYOP_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/cast_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cast_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nCast_loongarch::Cast_loongarch()\n{\n    support_packing = true;\n}\n\nint Cast_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (type_from == type_to)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    size_t out_elemsize = elemsize;\n    if (type_to == 1)\n    {\n        if (type_from == 3)\n        {\n            Cast::forward(bottom_blob, top_blob, opt);\n        }\n\n        // float32\n        out_elemsize = 4 * elempack;\n    }\n    else if (type_to == 2)\n    {\n        // float16\n        out_elemsize = 2 * elempack;\n    }\n    else if (type_to == 3)\n    {\n        // int8\n        out_elemsize = elempack;\n    }\n    else if (type_to == 4)\n    {\n        // bfloat16\n        out_elemsize = 2 * elempack;\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 4)\n    {\n        top_blob.create(w, h, d, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int size = w * h * d * elempack;\n\n    if (type_from == 1 && type_to == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __loongarch_sx\n            for (; i + 7 < size; i += 8)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p0 = (__m128)__lsx_vld(ptr, 0);\n                __m128 _p1 = (__m128)__lsx_vld(ptr + 4, 0);\n                __m128i _p = __lsx_vfcvt_h_s(_p1, _p0);\n                __lsx_vst(_p, outptr, 0);\n\n                ptr += 8;\n                outptr += 8;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                *outptr = float32_to_float16(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 2 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __loongarch_sx\n            for (; i + 7 < size; i += 8)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128i _p = __lsx_vld(ptr, 0);\n                __m128 _p0 = __lsx_vfcvtl_s_h(_p);\n                __m128 _p1 = __lsx_vfcvth_s_h(_p);\n                __lsx_vst(_p0, outptr, 0);\n                __lsx_vst(_p1, outptr + 4, 0);\n\n                ptr += 8;\n                outptr += 8;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                *outptr = float16_to_float32(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 3 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const signed char* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = (float)ptr[i];\n            }\n        }\n    }\n\n    if (type_from == 4 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i < size; i++)\n            {\n                *outptr = bfloat16_to_float32(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 1 && type_to == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i < size; i++)\n            {\n                *outptr = float32_to_bfloat16(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/cast_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CAST_LOONGARCH_H\n#define LAYER_CAST_LOONGARCH_H\n\n#include \"cast.h\"\n\nnamespace ncnn {\n\nclass Cast_loongarch : public Cast\n{\npublic:\n    Cast_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CAST_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/clip_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"clip_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nClip_loongarch::Clip_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint Clip_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _max = (__m128)__lsx_vreplfr2vr_s(max);\n        __m128 _min = (__m128)__lsx_vreplfr2vr_s(min);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = __lsx_vfmax_s(_p, _min);\n            _p = __lsx_vfmin_s(_p, _max);\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            if (*ptr < min)\n                *ptr = min;\n\n            if (*ptr > max)\n                *ptr = max;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/clip_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CLIP_LOONGARCH_H\n#define LAYER_CLIP_LOONGARCH_H\n\n#include \"clip.h\"\n\nnamespace ncnn {\n\nclass Clip_loongarch : public Clip\n{\npublic:\n    Clip_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CLIP_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/concat_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"concat_loongarch.h\"\n\nnamespace ncnn {\n\nConcat_loongarch::Concat_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Concat_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int dims = bottom_blobs[0].dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // concat vector\n        // total length\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_w % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        float* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            const float* ptr = bottom_blob;\n            memcpy(outptr, ptr, bottom_blob.w * bottom_blob.elemsize);\n\n            outptr += bottom_blob.w * bottom_blob.elempack;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // concat image\n        int w = bottom_blobs[0].w;\n\n        // total height\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_h += bottom_blob.h * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_h % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, top_h / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n        }\n\n        float* outptr = top_blob_unpacked;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const float* r0 = bottom_blob.row(i);\n\n                    float* outptr0 = outptr;\n                    float* outptr1 = outptr + w;\n                    float* outptr2 = outptr + w * 2;\n                    float* outptr3 = outptr + w * 3;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    outptr += w * 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = w * bottom_blob.h;\n\n                const float* ptr = bottom_blob;\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                outptr += size * bottom_blob.elempack;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // interleave image row\n        int h = bottom_blobs[0].h;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* outptr = top_blob.row(i);\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                const float* ptr = bottom_blob.row(i);\n                memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                outptr += bottom_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // concat dim\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n\n        // total channels\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_channels = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_channels += bottom_blob.c * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_channels % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, d, top_channels / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, h, d, top_channels / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n\n            top_blob_unpacked.dims = dims;\n        }\n\n        int p = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q);\n\n                    float* outptr0 = top_blob_unpacked.channel(p);\n                    float* outptr1 = top_blob_unpacked.channel(p + 1);\n                    float* outptr2 = top_blob_unpacked.channel(p + 2);\n                    float* outptr3 = top_blob_unpacked.channel(p + 3);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = bottom_blob.total();\n\n                const float* ptr = bottom_blob;\n                float* outptr = top_blob_unpacked.channel(p);\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                p += bottom_blob.c;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // interleave dim height\n        int w = bottom_blobs[0].w;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (size_t b = 0; b < bottom_blobs.size(); b++)\n                {\n                    const Mat& bottom_blob = bottom_blobs[b];\n\n                    int size = bottom_blob.w * bottom_blob.h;\n\n                    const float* ptr = bottom_blob.channel(q).depth(i);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    outptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // interleave dim width\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    for (size_t b = 0; b < bottom_blobs.size(); b++)\n                    {\n                        const Mat& bottom_blob = bottom_blobs[b];\n\n                        const float* ptr = bottom_blob.channel(q).depth(i).row(j);\n                        memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                        outptr += bottom_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        // interleave dim depth\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total depth\n        int top_d = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_d += bottom_blob.d;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, top_d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                const float* ptr = bottom_blob.channel(q);\n                memcpy(outptr, ptr, size * elemsize);\n\n                outptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/concat_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONCAT_LOONGARCH_H\n#define LAYER_CONCAT_LOONGARCH_H\n\n#include \"concat.h\"\n\nnamespace ncnn {\n\nclass Concat_loongarch : public Concat\n{\npublic:\n    Concat_loongarch();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONCAT_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/convolution1d_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution1d_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nConvolution1D_loongarch::Convolution1D_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Convolution1D_loongarch::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    const int num_input = weight_data_size / kernel_w / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // src = kw-inch-outch\n    // dst = pb-pa-kw-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(kernel_w, num_input, num_output);\n\n        weight_data_packed.create(kernel_w, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_packed.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution1D_loongarch::destroy_pipeline(const Option& /*opt*/)\n{\n    return 0;\n}\n\nint Convolution1D_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __loongarch_sx\n    if (elempack == 4 && out_elempack == 4)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w * 4;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            __m128 _val0 = __lsx_vreplfr2vr_s(sptr[0]);\n                            __m128 _val1 = __lsx_vreplfr2vr_s(sptr[1]);\n                            __m128 _val2 = __lsx_vreplfr2vr_s(sptr[2]);\n                            __m128 _val3 = __lsx_vreplfr2vr_s(sptr[3]);\n\n                            __m128 _w0 = (__m128)__lsx_vld(kptr, 0);\n                            __m128 _w1 = (__m128)__lsx_vld(kptr + 4, 0);\n                            __m128 _w2 = (__m128)__lsx_vld(kptr + 8, 0);\n                            __m128 _w3 = (__m128)__lsx_vld(kptr + 12, 0);\n\n                            _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n                            _sum = __lsx_vfmadd_s(_w1, _val1, _sum);\n                            _sum = __lsx_vfmadd_s(_w2, _val2, _sum);\n                            _sum = __lsx_vfmadd_s(_w3, _val3, _sum);\n\n                            sptr += dilation_w * 4;\n                            kptr += 16;\n                        }\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __lsx_vst(_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            __m128 _val = __lsx_vreplfr2vr_s(sptr[0]);\n                            __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                            _sum = __lsx_vfmadd_s(_w, _val, _sum);\n\n                            sptr += dilation_w;\n                            kptr += 4;\n                        }\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __lsx_vst(_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w * 4;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            __m128 _val = (__m128)__lsx_vld(sptr, 0);\n                            __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                            _sum = __lsx_vfmadd_s(_w, _val, _sum);\n\n                            sptr += dilation_w * 4;\n                            kptr += 4;\n                        }\n                    }\n\n                    sum += __lsx_reduce_fadd_s(_sum);\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[j] = sum;\n                }\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            float val = sptr[0];\n                            float wt = kptr[0];\n                            sum += val * wt;\n\n                            sptr += dilation_w;\n                            kptr += 1;\n                        }\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[j] = sum;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Convolution1D_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution1D);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(2, dilation_w);\n    pd.set(3, stride_w);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/convolution1d_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION1D_LOONGARCH_H\n#define LAYER_CONVOLUTION1D_LOONGARCH_H\n\n#include \"convolution1d.h\"\n\nnamespace ncnn {\n\nclass Convolution1D_loongarch : public Convolution1D\n{\npublic:\n    Convolution1D_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    // packn\n    Mat weight_data_packed;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION1D_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const signed char* r0 = bottom_blob.channel(p);\n        signed char* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n                outptr[2] = r0[4];\n                outptr[3] = r0[6];\n\n                r0 += 8;\n                outptr += 4;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n\n                r0 += 4;\n                outptr += 2;\n            }\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_int8_lsx(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_pack1to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack1to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack1to4_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack1to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const signed char* r0 = bottom_blob.channel(p);\n        signed char* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n                outptr[2] = r0[4];\n                outptr[3] = r0[6];\n\n                r0 += 8;\n                outptr += 4;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n\n                r0 += 4;\n                outptr += 2;\n            }\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack1to4_int8_lsx(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack4_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const float* r0 = bottom_blob.channel(p);\n        float* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _val = (__m128)__lsx_vld(r0, 0);\n                __lsx_vst(_val, outptr, 0);\n\n                r0 += 4 * 2;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack4_lsx(bottom_blob_shrinked, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_pack4to1.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack4to1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack4to1_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack4to1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const float* r0 = bottom_blob.channel(p);\n        float* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _val = (__m128)__lsx_vld(r0, 0);\n                __lsx_vst(_val, outptr, 0);\n\n                r0 += 4 * 2;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack4to1_lsx(bottom_blob_shrinked, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_pack8to1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack8to1_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack8to1_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack8to1_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const int64_t* r0 = bottom_blob.channel(p);\n        int64_t* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack8to1_int8_lsx(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_1x1_pack8to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack8to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack8to4_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack8to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const int64_t* r0 = bottom_blob.channel(p);\n        int64_t* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack8to4_int8_lsx(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd23_transform_kernel_lsx(const Mat& kernel, Mat& kernel_tm2, int inch, int outch, const Option& opt)\n{\n    Mat kernel_tm(4 * 4, inch, outch);\n\n    // G\n    const float ktm[4][3] = {\n        {1.0f, 0.0f, 0.0f},\n        {1.0f / 2, 1.0f / 2, 1.0f / 2},\n        {1.0f / 2, -1.0f / 2, 1.0f / 2},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[4][3];\n            for (int i = 0; i < 4; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 4; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 4; i++)\n                {\n                    kernel_tm0[j * 4 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 16-inch-outch\n    // dst = inch-16-outch\n#if __loongarch_sx\n    kernel_tm2.create(8 * inch, 16, outch / 8 + (outch % 8) / 4 + outch % 4);\n#else\n    kernel_tm2.create(2 * inch, 16, outch / 2 + outch % 2);\n#endif\n\n    int q = 0;\n#if __loongarch_sx\n    for (; q + 7 < outch; q += 8)\n    {\n        Mat g0 = kernel_tm2.channel(q / 8);\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm2.channel(q / 8 + (q % 8) / 4);\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#else  // __loongarch_sx\n    for (; q + 1 < outch; q += 2)\n    {\n        Mat g0 = kernel_tm2.channel(q / 2);\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 2; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#endif // __loongarch_sx\n    for (; q < outch; q++)\n    {\n#if __loongarch_sx\n        Mat g0 = kernel_tm2.channel(q / 8 + (q % 8) / 4 + q % 4);\n#else\n        Mat g0 = kernel_tm2.channel(q / 2 + q % 2);\n#endif\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                const float* k00 = kernel_tm.channel(q).row(p);\n                g00[0] = k00[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 2n+2, winograd F(2,3)\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 1) / 2 * 2;\n    outh = (outh + 1) / 2 * 2;\n\n    w = outw + 2;\n    h = outh + 2;\n    Option opt_b = opt;\n    opt_b.blob_allocator = opt.workspace_allocator;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, 0, 0.f, opt_b);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 2;\n        int h_tiles = outh / 2;\n        int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 16, inch, 4u, opt.workspace_allocator);\n        conv3x3s1_winograd23_transform_input_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd23_transform_output_lsx(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel_lsx(const Mat& kernel, Mat& kernel_tm2, int inch, int outch, const Option& opt)\n{\n    Mat kernel_tm(6 * 6, inch, outch);\n\n    // G\n    const float ktm[6][3] = {\n        {1.0f / 4, 0.0f, 0.0f},\n        {-1.0f / 6, -1.0f / 6, -1.0f / 6},\n        {-1.0f / 6, 1.0f / 6, -1.0f / 6},\n        {1.0f / 24, 1.0f / 12, 1.0f / 6},\n        {1.0f / 24, -1.0f / 12, 1.0f / 6},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = inch-36-outch\n#if __loongarch_sx\n    kernel_tm2.create(8 * inch, 36, outch / 8 + (outch % 8) / 4 + outch % 4);\n#else\n    kernel_tm2.create(2 * inch, 36, outch / 2 + outch % 2);\n#endif\n\n    int q = 0;\n#if __loongarch_sx\n    for (; q + 7 < outch; q += 8)\n    {\n        Mat g0 = kernel_tm2.channel(q / 8);\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm2.channel(q / 8 + (q % 8) / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#else  // __loongarch_sx\n    for (; q + 1 < outch; q += 2)\n    {\n        Mat g0 = kernel_tm2.channel(q / 2);\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                for (int i = 0; i < 2; i++)\n                {\n                    const float* k00 = kernel_tm.channel(q + i).row(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#endif // __loongarch_sx\n    for (; q < outch; q++)\n    {\n#if __loongarch_sx\n        Mat g0 = kernel_tm2.channel(q / 8 + (q % 8) / 4 + q % 4);\n#else\n        Mat g0 = kernel_tm2.channel(q / 2 + q % 2);\n#endif\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p < inch; p++)\n            {\n                const float* k00 = kernel_tm.channel(q).row(p);\n                g00[0] = k00[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2, winograd F(4,3)\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n\n    Option opt_b = opt;\n    opt_b.blob_allocator = opt.workspace_allocator;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, 0, 0.f, opt_b);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 4u, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_lsx(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_kernel_int8_lsx(const Mat& kernel, Mat& kernel_tm_packed, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 2b-inch-36-outch/2b\n#if __loongarch_sx\n    if (outch >= 4)\n    {\n        if (inch >= 4)\n            kernel_tm_packed.create(inch / 4 + inch % 4, 36, outch / 4 + outch % 4, (size_t)2u * 16, 16);\n        else\n            kernel_tm_packed.create(inch, 36, outch / 4 + outch % 4, (size_t)2u * 4, 4);\n    }\n#else  // __loongarch_sx\n    if (outch >= 2)\n    {\n        kernel_tm_packed.create(inch, 36, outch / 2 + outch % 2, (size_t)2u * 2, 2);\n    }\n#endif // __loongarch_sx\n    else\n    {\n#if __loongarch_sx\n        if (inch >= 4)\n            kernel_tm_packed.create(inch / 4 + inch % 4, 36, outch, (size_t)2u * 4, 4);\n        else\n#endif // __loongarch_sx\n        {\n            kernel_tm_packed.create(inch, 36, outch, (size_t)2u, 1);\n        }\n    }\n\n    int p = 0;\n#if __loongarch_sx\n    for (; p + 3 < outch; p += 4)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n        const Mat k2 = kernel_tm.channel(p + 2);\n        const Mat k3 = kernel_tm.channel(p + 3);\n\n        Mat g0 = kernel_tm_packed.channel(p / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n            for (; q + 3 < inch; q += 4)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k0.row<const short>(q + 1)[k];\n                g00[2] = k0.row<const short>(q + 2)[k];\n                g00[3] = k0.row<const short>(q + 3)[k];\n                g00[4] = k1.row<const short>(q)[k];\n                g00[5] = k1.row<const short>(q + 1)[k];\n                g00[6] = k1.row<const short>(q + 2)[k];\n                g00[7] = k1.row<const short>(q + 3)[k];\n                g00[8] = k2.row<const short>(q)[k];\n                g00[9] = k2.row<const short>(q + 1)[k];\n                g00[10] = k2.row<const short>(q + 2)[k];\n                g00[11] = k2.row<const short>(q + 3)[k];\n                g00[12] = k3.row<const short>(q)[k];\n                g00[13] = k3.row<const short>(q + 1)[k];\n                g00[14] = k3.row<const short>(q + 2)[k];\n                g00[15] = k3.row<const short>(q + 3)[k];\n                g00 += 16;\n            }\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k1.row<const short>(q)[k];\n                g00[2] = k2.row<const short>(q)[k];\n                g00[3] = k3.row<const short>(q)[k];\n                g00 += 4;\n            }\n        }\n    }\n#else  // __loongarch_sx\n    for (; p + 1 < outch; p += 2)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n\n        Mat g0 = kernel_tm_packed.channel(p / 2);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k1.row<const short>(q)[k];\n                g00 += 2;\n            }\n        }\n    }\n#endif // __loongarch_sx\n    for (; p < outch; p++)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n\n#if __loongarch_sx\n        Mat g0 = kernel_tm_packed.channel(p / 4 + p % 4);\n#else\n        Mat g0 = kernel_tm_packed.channel(p / 2 + p % 2);\n#endif\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n#if __loongarch_sx\n            for (; q + 3 < inch; q += 4)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k0.row<const short>(q + 1)[k];\n                g00[2] = k0.row<const short>(q + 2)[k];\n                g00[3] = k0.row<const short>(q + 3)[k];\n                g00 += 4;\n            }\n#endif // __loongarch_sx\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00 += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_int8_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_int8_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, 1, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_int8_lsx(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3_pack1to4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n            __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n            __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n            __m128 _k10 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n            __m128 _k11 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n            __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 5, 0);\n            __m128 _k20 = (__m128)__lsx_vld(k0 + 4 * 6, 0);\n            __m128 _k21 = (__m128)__lsx_vld(k0 + 4 * 7, 0);\n            __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 8, 0);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n                    __m128 _sum2 = (__m128)__lsx_vld(outptr0 + 4 * 2, 0);\n                    __m128 _sum3 = (__m128)__lsx_vld(outptr0 + 4 * 3, 0);\n                    __m128 _sum4 = (__m128)__lsx_vld(outptr0 + 4 * 4, 0);\n                    __m128 _sum5 = (__m128)__lsx_vld(outptr0 + 4 * 5, 0);\n                    __m128 _sum6 = (__m128)__lsx_vld(outptr0 + 4 * 6, 0);\n                    __m128 _sum7 = (__m128)__lsx_vld(outptr0 + 4 * 7, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n                    __m128i _r0nn = __lsx_vld(r0 + 8, 0);\n\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = (__m128)__lsx_vreplvei_w(_r0n, 0);\n                    __m128 _r05 = (__m128)__lsx_vreplvei_w(_r0n, 1);\n                    __m128 _r06 = (__m128)__lsx_vreplvei_w(_r0n, 2);\n                    __m128 _r07 = (__m128)__lsx_vreplvei_w(_r0n, 3);\n                    __m128 _r08 = (__m128)__lsx_vreplvei_w(_r0nn, 0);\n                    __m128 _r09 = (__m128)__lsx_vreplvei_w(_r0nn, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r01, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k00, _r02, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k00, _r03, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k00, _r04, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k00, _r05, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k00, _r06, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k00, _r07, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r02, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k01, _r03, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k01, _r04, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k01, _r05, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k01, _r06, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k01, _r07, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k01, _r08, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r03, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k02, _r04, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k02, _r05, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k02, _r06, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k02, _r07, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k02, _r08, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k02, _r09, _sum7);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n                    __m128i _r1nn = __lsx_vld(r1 + 8, 0);\n\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = (__m128)__lsx_vreplvei_w(_r1n, 0);\n                    __m128 _r15 = (__m128)__lsx_vreplvei_w(_r1n, 1);\n                    __m128 _r16 = (__m128)__lsx_vreplvei_w(_r1n, 2);\n                    __m128 _r17 = (__m128)__lsx_vreplvei_w(_r1n, 3);\n                    __m128 _r18 = (__m128)__lsx_vreplvei_w(_r1nn, 0);\n                    __m128 _r19 = (__m128)__lsx_vreplvei_w(_r1nn, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r11, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k10, _r12, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k10, _r13, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k10, _r14, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k10, _r15, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k10, _r16, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k10, _r17, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r12, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k11, _r13, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k11, _r14, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k11, _r15, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k11, _r16, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k11, _r17, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k11, _r18, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r13, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k12, _r14, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k12, _r15, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k12, _r16, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k12, _r17, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k12, _r18, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k12, _r19, _sum7);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n                    __m128i _r2nn = __lsx_vld(r2 + 8, 0);\n\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = (__m128)__lsx_vreplvei_w(_r2n, 0);\n                    __m128 _r25 = (__m128)__lsx_vreplvei_w(_r2n, 1);\n                    __m128 _r26 = (__m128)__lsx_vreplvei_w(_r2n, 2);\n                    __m128 _r27 = (__m128)__lsx_vreplvei_w(_r2n, 3);\n                    __m128 _r28 = (__m128)__lsx_vreplvei_w(_r2nn, 0);\n                    __m128 _r29 = (__m128)__lsx_vreplvei_w(_r2nn, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r21, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k20, _r22, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k20, _r23, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k20, _r24, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k20, _r25, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k20, _r26, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k20, _r27, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r22, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k21, _r23, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k21, _r24, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k21, _r25, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k21, _r26, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k21, _r27, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k21, _r28, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r23, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k22, _r24, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k22, _r25, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k22, _r26, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k22, _r27, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k22, _r28, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k22, _r29, _sum7);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n                    __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n                    __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n                    __lsx_vst(_sum4, outptr0 + 4 * 4, 0);\n                    __lsx_vst(_sum5, outptr0 + 4 * 5, 0);\n                    __lsx_vst(_sum6, outptr0 + 4 * 6, 0);\n                    __lsx_vst(_sum7, outptr0 + 4 * 7, 0);\n\n                    outptr0 += 4 * 8;\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n                    __m128 _sum2 = (__m128)__lsx_vld(outptr0 + 4 * 2, 0);\n                    __m128 _sum3 = (__m128)__lsx_vld(outptr0 + 4 * 3, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = (__m128)__lsx_vreplvei_w(_r0n, 0);\n                    __m128 _r05 = (__m128)__lsx_vreplvei_w(_r0n, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r01, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k00, _r02, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k00, _r03, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r02, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k01, _r03, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k01, _r04, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r03, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k02, _r04, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k02, _r05, _sum3);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = (__m128)__lsx_vreplvei_w(_r1n, 0);\n                    __m128 _r15 = (__m128)__lsx_vreplvei_w(_r1n, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r11, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k10, _r12, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k10, _r13, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r12, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k11, _r13, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k11, _r14, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r13, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k12, _r14, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k12, _r15, _sum3);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = (__m128)__lsx_vreplvei_w(_r2n, 0);\n                    __m128 _r25 = (__m128)__lsx_vreplvei_w(_r2n, 1);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r21, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k20, _r22, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k20, _r23, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r22, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k21, _r23, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k21, _r24, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r23, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k22, _r24, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k22, _r25, _sum3);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n                    __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n                    __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n\n                    outptr0 += 4 * 4;\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r01, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r02, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r03, _sum1);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r11, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r12, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r13, _sum1);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r21, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r22, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r23, _sum1);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n\n                    outptr0 += 4 * 2;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n                for (; j < outw; j++)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n\n                    outptr0 += 4;\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n            __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n            __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n            __m128 _k10 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n            __m128 _k11 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n            __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 5, 0);\n            __m128 _k20 = (__m128)__lsx_vld(k0 + 4 * 6, 0);\n            __m128 _k21 = (__m128)__lsx_vld(k0 + 4 * 7, 0);\n            __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 8, 0);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n                    __m128 _sum2 = (__m128)__lsx_vld(outptr0 + 4 * 2, 0);\n                    __m128 _sum3 = (__m128)__lsx_vld(outptr0 + 4 * 3, 0);\n                    __m128 _sum4 = (__m128)__lsx_vld(outptr0 + 4 * 4, 0);\n                    __m128 _sum5 = (__m128)__lsx_vld(outptr0 + 4 * 5, 0);\n                    __m128 _sum6 = (__m128)__lsx_vld(outptr0 + 4 * 6, 0);\n                    __m128 _sum7 = (__m128)__lsx_vld(outptr0 + 4 * 7, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n                    __m128i _r0nn = __lsx_vld(r0 + 8, 0);\n                    __m128i _r0nnn = __lsx_vld(r0 + 12, 0);\n\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = (__m128)__lsx_vreplvei_w(_r0n, 0);\n                    __m128 _r05 = (__m128)__lsx_vreplvei_w(_r0n, 1);\n                    __m128 _r06 = (__m128)__lsx_vreplvei_w(_r0n, 2);\n                    __m128 _r07 = (__m128)__lsx_vreplvei_w(_r0n, 3);\n                    __m128 _r08 = (__m128)__lsx_vreplvei_w(_r0nn, 0);\n                    __m128 _r09 = (__m128)__lsx_vreplvei_w(_r0nn, 1);\n                    __m128 _r0a = (__m128)__lsx_vreplvei_w(_r0nn, 2);\n                    __m128 _r0b = (__m128)__lsx_vreplvei_w(_r0nn, 3);\n                    __m128 _r0c = (__m128)__lsx_vreplvei_w(_r0nnn, 0);\n                    __m128 _r0d = (__m128)__lsx_vreplvei_w(_r0nnn, 1);\n                    __m128 _r0e = (__m128)__lsx_vreplvei_w(_r0nnn, 2);\n                    __m128 _r0f = (__m128)__lsx_vreplvei_w(_r0nnn, 3);\n                    __m128 _r0g = __lsx_vreplfr2vr_s(r0[16]);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r02, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k00, _r04, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k00, _r06, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k00, _r08, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k00, _r0a, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k00, _r0c, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k00, _r0e, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r03, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k01, _r05, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k01, _r07, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k01, _r09, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k01, _r0b, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k01, _r0d, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k01, _r0f, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r04, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k02, _r06, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k02, _r08, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k02, _r0a, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k02, _r0c, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k02, _r0e, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k02, _r0g, _sum7);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n                    __m128i _r1nn = __lsx_vld(r1 + 8, 0);\n                    __m128i _r1nnn = __lsx_vld(r1 + 12, 0);\n\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = (__m128)__lsx_vreplvei_w(_r1n, 0);\n                    __m128 _r15 = (__m128)__lsx_vreplvei_w(_r1n, 1);\n                    __m128 _r16 = (__m128)__lsx_vreplvei_w(_r1n, 2);\n                    __m128 _r17 = (__m128)__lsx_vreplvei_w(_r1n, 3);\n                    __m128 _r18 = (__m128)__lsx_vreplvei_w(_r1nn, 0);\n                    __m128 _r19 = (__m128)__lsx_vreplvei_w(_r1nn, 1);\n                    __m128 _r1a = (__m128)__lsx_vreplvei_w(_r1nn, 2);\n                    __m128 _r1b = (__m128)__lsx_vreplvei_w(_r1nn, 3);\n                    __m128 _r1c = (__m128)__lsx_vreplvei_w(_r1nnn, 0);\n                    __m128 _r1d = (__m128)__lsx_vreplvei_w(_r1nnn, 1);\n                    __m128 _r1e = (__m128)__lsx_vreplvei_w(_r1nnn, 2);\n                    __m128 _r1f = (__m128)__lsx_vreplvei_w(_r1nnn, 3);\n                    __m128 _r1g = __lsx_vreplfr2vr_s(r1[16]);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r12, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k10, _r14, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k10, _r16, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k10, _r18, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k10, _r1a, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k10, _r1c, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k10, _r1e, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r13, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k11, _r15, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k11, _r17, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k11, _r19, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k11, _r1b, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k11, _r1d, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k11, _r1f, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r14, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k12, _r16, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k12, _r18, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k12, _r1a, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k12, _r1c, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k12, _r1e, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k12, _r1g, _sum7);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n                    __m128i _r2nn = __lsx_vld(r2 + 8, 0);\n                    __m128i _r2nnn = __lsx_vld(r2 + 12, 0);\n\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = (__m128)__lsx_vreplvei_w(_r2n, 0);\n                    __m128 _r25 = (__m128)__lsx_vreplvei_w(_r2n, 1);\n                    __m128 _r26 = (__m128)__lsx_vreplvei_w(_r2n, 2);\n                    __m128 _r27 = (__m128)__lsx_vreplvei_w(_r2n, 3);\n                    __m128 _r28 = (__m128)__lsx_vreplvei_w(_r2nn, 0);\n                    __m128 _r29 = (__m128)__lsx_vreplvei_w(_r2nn, 1);\n                    __m128 _r2a = (__m128)__lsx_vreplvei_w(_r2nn, 2);\n                    __m128 _r2b = (__m128)__lsx_vreplvei_w(_r2nn, 3);\n                    __m128 _r2c = (__m128)__lsx_vreplvei_w(_r2nnn, 0);\n                    __m128 _r2d = (__m128)__lsx_vreplvei_w(_r2nnn, 1);\n                    __m128 _r2e = (__m128)__lsx_vreplvei_w(_r2nnn, 2);\n                    __m128 _r2f = (__m128)__lsx_vreplvei_w(_r2nnn, 3);\n                    __m128 _r2g = __lsx_vreplfr2vr_s(r2[16]);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r22, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k20, _r24, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k20, _r26, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k20, _r28, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k20, _r2a, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k20, _r2c, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k20, _r2e, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r23, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k21, _r25, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k21, _r27, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k21, _r29, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k21, _r2b, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k21, _r2d, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k21, _r2f, _sum7);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r24, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k22, _r26, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k22, _r28, _sum3);\n                    _sum4 = __lsx_vfmadd_s(_k22, _r2a, _sum4);\n                    _sum5 = __lsx_vfmadd_s(_k22, _r2c, _sum5);\n                    _sum6 = __lsx_vfmadd_s(_k22, _r2e, _sum6);\n                    _sum7 = __lsx_vfmadd_s(_k22, _r2g, _sum7);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n                    __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n                    __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n                    __lsx_vst(_sum4, outptr0 + 4 * 4, 0);\n                    __lsx_vst(_sum5, outptr0 + 4 * 5, 0);\n                    __lsx_vst(_sum6, outptr0 + 4 * 6, 0);\n                    __lsx_vst(_sum7, outptr0 + 4 * 7, 0);\n\n                    outptr0 += 4 * 8;\n\n                    r0 += 16;\n                    r1 += 16;\n                    r2 += 16;\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n                    __m128 _sum2 = (__m128)__lsx_vld(outptr0 + 4 * 2, 0);\n                    __m128 _sum3 = (__m128)__lsx_vld(outptr0 + 4 * 3, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = (__m128)__lsx_vreplvei_w(_r0n, 0);\n                    __m128 _r05 = (__m128)__lsx_vreplvei_w(_r0n, 1);\n                    __m128 _r06 = (__m128)__lsx_vreplvei_w(_r0n, 2);\n                    __m128 _r07 = (__m128)__lsx_vreplvei_w(_r0n, 3);\n                    __m128 _r08 = __lsx_vreplfr2vr_s(r0[8]);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r02, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k00, _r04, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k00, _r06, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r03, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k01, _r05, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k01, _r07, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r04, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k02, _r06, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k02, _r08, _sum3);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = (__m128)__lsx_vreplvei_w(_r1n, 0);\n                    __m128 _r15 = (__m128)__lsx_vreplvei_w(_r1n, 1);\n                    __m128 _r16 = (__m128)__lsx_vreplvei_w(_r1n, 2);\n                    __m128 _r17 = (__m128)__lsx_vreplvei_w(_r1n, 3);\n                    __m128 _r18 = __lsx_vreplfr2vr_s(r1[8]);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r12, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k10, _r14, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k10, _r16, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r13, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k11, _r15, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k11, _r17, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r14, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k12, _r16, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k12, _r18, _sum3);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = (__m128)__lsx_vreplvei_w(_r2n, 0);\n                    __m128 _r25 = (__m128)__lsx_vreplvei_w(_r2n, 1);\n                    __m128 _r26 = (__m128)__lsx_vreplvei_w(_r2n, 2);\n                    __m128 _r27 = (__m128)__lsx_vreplvei_w(_r2n, 3);\n                    __m128 _r28 = __lsx_vreplfr2vr_s(r2[8]);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r22, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k20, _r24, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k20, _r26, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r23, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k21, _r25, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k21, _r27, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r24, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k22, _r26, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k22, _r28, _sum3);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n                    __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n                    __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n\n                    outptr0 += 4 * 4;\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = __lsx_vreplfr2vr_s(r0[4]);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r02, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r03, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r04, _sum1);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = __lsx_vreplfr2vr_s(r1[4]);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r12, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r13, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r14, _sum1);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = __lsx_vreplfr2vr_s(r2[4]);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r22, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r23, _sum1);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r24, _sum1);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n\n                    outptr0 += 4 * 2;\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                }\n                for (; j < outw; j++)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n\n                    outptr0 += 4;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd63_transform_kernel_pack4_lsx(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd63 transform kernel\n    Mat kernel_tm;\n    kernel_tm.create(8 * 8, inch, outch);\n\n    const float ktm[8][3] = {\n        {1.0f, 0.0f, 0.0f},\n        {-2.0f / 9, -2.0f / 9, -2.0f / 9},\n        {-2.0f / 9, 2.0f / 9, -2.0f / 9},\n        {1.0f / 90, 1.0f / 45, 2.0f / 45},\n        {1.0f / 90, -1.0f / 45, 2.0f / 45},\n        {1.0f / 45, 1.0f / 90, 1.0f / 180},\n        {1.0f / 45, -1.0f / 90, 1.0f / 180},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel, transposed\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[8][3];\n            for (int i = 0; i < 8; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // v\n            for (int j = 0; j < 8; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 8; i++)\n                {\n                    kernel_tm0[j * 8 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 64-inch-outch\n    // dst = pb-pa-inch/pa-64-outch/pb\n    kernel_tm_pack4.create(inch / 4, 64, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 64; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd63_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 6n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 5) / 6 * 6;\n    outh = (outh + 5) / 6 * 6;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 6;\n        int h_tiles = outh / 6;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 64, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd63_transform_input_pack4_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd63_transform_output_pack4_lsx(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack4_lsx(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch);\n\n    const float ktm[6][3] = {\n        {1.0f / 4, 0.0f, 0.0f},\n        {-1.0f / 6, -1.0f / 6, -1.0f / 6},\n        {-1.0f / 6, 1.0f / 6, -1.0f / 6},\n        {1.0f / 24, 1.0f / 12, 1.0f / 6},\n        {1.0f / 24, -1.0f / 12, 1.0f / 6},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = pb-pa-inch/pa-36-outch/pb\n    kernel_tm_pack4.create(inch / 4, 36, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack4_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_pack4_lsx(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n\nstatic void conv3x3s1_winograd23_transform_kernel_pack4_lsx(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd23 transform kernel\n    Mat kernel_tm(4 * 4, inch, outch);\n\n    const float ktm[4][3] = {\n        {1.0f, 0.0f, 0.0f},\n        {1.0f / 2, 1.0f / 2, 1.0f / 2},\n        {1.0f / 2, -1.0f / 2, 1.0f / 2},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[4][3];\n            for (int i = 0; i < 4; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 4; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 4; i++)\n                {\n                    kernel_tm0[j * 4 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 16-inch-outch\n    // dst = pb-pa-inch/pa-16-outch/pb\n    kernel_tm_pack4.create(inch / 4, 16, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 2n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 1) / 2 * 2;\n    outh = (outh + 1) / 2 * 2;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 2;\n        int h_tiles = outh / 2;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 16, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd23_transform_input_pack4_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd23_transform_output_pack4_lsx(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3_pack8to1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack8to1_int8_lsx(const Mat& kernel, Mat& kernel_tm_pack8to1, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 4b-8a-inch/8a-36-outch/4b\n    kernel_tm_pack8to1.create(8 * inch / 8, 36, outch / 4 + outch % 4, (size_t)2u * 4, 4);\n\n    int p = 0;\n    for (; p + 3 < outch; p += 4)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n        const Mat k2 = kernel_tm.channel(p + 2);\n        const Mat k3 = kernel_tm.channel(p + 3);\n\n        Mat g0 = kernel_tm_pack8to1.channel(p / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            for (int q = 0; q + 7 < inch; q += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0.row<const short>(q + i)[k];\n                    g00[1] = k1.row<const short>(q + i)[k];\n                    g00[2] = k2.row<const short>(q + i)[k];\n                    g00[3] = k3.row<const short>(q + i)[k];\n\n                    g00 += 4;\n                }\n            }\n        }\n    }\n    for (; p < outch; p++)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n\n        Mat g0 = kernel_tm_pack8to1.channel(p / 4 + p % 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            for (int q = 0; q + 7 < inch; q += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0.row<const short>(q + i)[k];\n\n                    g00 += 1;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack8to1_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack8_int8_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack8to1_int8_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, 1, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_int8_lsx(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_3x3_pack8to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack8to4_int8_lsx(const Mat& kernel, Mat& kernel_tm_pack8, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 4b-8a-inch/8a-36-outch/4b\n    kernel_tm_pack8.create(inch / 8, 36, outch / 4, (size_t)2u * 32, 32);\n\n    int q = 0;\n    for (; q + 3 < outch; q += 4)\n    {\n        const Mat k0 = kernel_tm.channel(q);\n        const Mat k1 = kernel_tm.channel(q + 1);\n        const Mat k2 = kernel_tm.channel(q + 2);\n        const Mat k3 = kernel_tm.channel(q + 3);\n\n        Mat kernel_tm = kernel_tm_pack8.channel(q / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = kernel_tm.row<short>(k);\n\n            for (int p = 0; p + 7 < inch; p += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    const short* k00 = k0.row<const short>(p + i);\n                    const short* k10 = k1.row<const short>(p + i);\n                    const short* k20 = k2.row<const short>(p + i);\n                    const short* k30 = k3.row<const short>(p + i);\n\n                    g00[0] = k00[k];\n                    g00[1] = k10[k];\n                    g00[2] = k20[k];\n                    g00[3] = k30[k];\n\n                    g00 += 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack8to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack8_int8_lsx(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack8to4_int8_lsx(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u * 4, 4, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_pack4_int8_lsx(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_7x7_pack1to4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv7x7s2_pack1to4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n        out0.fill(_bias0);\n\n        for (int q = 0; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n            const float* r3 = img0.row(3);\n            const float* r4 = img0.row(4);\n            const float* r5 = img0.row(5);\n            const float* r6 = img0.row(6);\n\n            const float* kptr = kernel.channel(p).row(q);\n\n            int i = 0;\n\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 3 < outw; j += 4)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n                    __m128 _sum1 = (__m128)__lsx_vld(outptr0 + 4, 0);\n                    __m128 _sum2 = (__m128)__lsx_vld(outptr0 + 4 * 2, 0);\n                    __m128 _sum3 = (__m128)__lsx_vld(outptr0 + 4 * 3, 0);\n\n                    __m128 _k00 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k01 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k02 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k03 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k04 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k05 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k06 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n                    __m128i _r0nn = __lsx_vld(r0 + 8, 0);\n\n                    __m128 _r00 = (__m128)__lsx_vreplvei_w(_r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vreplvei_w(_r0, 1);\n                    __m128 _r02 = (__m128)__lsx_vreplvei_w(_r0, 2);\n                    __m128 _r03 = (__m128)__lsx_vreplvei_w(_r0, 3);\n                    __m128 _r04 = (__m128)__lsx_vreplvei_w(_r0n, 0);\n                    __m128 _r05 = (__m128)__lsx_vreplvei_w(_r0n, 1);\n                    __m128 _r06 = (__m128)__lsx_vreplvei_w(_r0n, 2);\n                    __m128 _r07 = (__m128)__lsx_vreplvei_w(_r0n, 3);\n                    __m128 _r08 = (__m128)__lsx_vreplvei_w(_r0nn, 0);\n                    __m128 _r09 = (__m128)__lsx_vreplvei_w(_r0nn, 1);\n                    __m128 _r0a = (__m128)__lsx_vreplvei_w(_r0nn, 2);\n                    __m128 _r0b = (__m128)__lsx_vreplvei_w(_r0nn, 3);\n                    __m128 _r0c = __lsx_vreplfr2vr_s(r0[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, _r00, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k00, _r02, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k00, _r04, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k00, _r06, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k01, _r01, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k01, _r03, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k01, _r05, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k01, _r07, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k02, _r02, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k02, _r04, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k02, _r06, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k02, _r08, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k03, _r03, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k03, _r05, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k03, _r07, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k03, _r09, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k04, _r04, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k04, _r06, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k04, _r08, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k04, _r0a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k05, _r05, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k05, _r07, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k05, _r09, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k05, _r0b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k06, _r06, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k06, _r08, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k06, _r0a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k06, _r0c, _sum3);\n\n                    __m128 _k10 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k11 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k12 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k13 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k14 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k15 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k16 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n                    __m128i _r1nn = __lsx_vld(r1 + 8, 0);\n\n                    __m128 _r10 = (__m128)__lsx_vreplvei_w(_r1, 0);\n                    __m128 _r11 = (__m128)__lsx_vreplvei_w(_r1, 1);\n                    __m128 _r12 = (__m128)__lsx_vreplvei_w(_r1, 2);\n                    __m128 _r13 = (__m128)__lsx_vreplvei_w(_r1, 3);\n                    __m128 _r14 = (__m128)__lsx_vreplvei_w(_r1n, 0);\n                    __m128 _r15 = (__m128)__lsx_vreplvei_w(_r1n, 1);\n                    __m128 _r16 = (__m128)__lsx_vreplvei_w(_r1n, 2);\n                    __m128 _r17 = (__m128)__lsx_vreplvei_w(_r1n, 3);\n                    __m128 _r18 = (__m128)__lsx_vreplvei_w(_r1nn, 0);\n                    __m128 _r19 = (__m128)__lsx_vreplvei_w(_r1nn, 1);\n                    __m128 _r1a = (__m128)__lsx_vreplvei_w(_r1nn, 2);\n                    __m128 _r1b = (__m128)__lsx_vreplvei_w(_r1nn, 3);\n                    __m128 _r1c = __lsx_vreplfr2vr_s(r1[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, _r10, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k10, _r12, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k10, _r14, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k10, _r16, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k11, _r11, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k11, _r13, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k11, _r15, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k11, _r17, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k12, _r12, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k12, _r14, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k12, _r16, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k12, _r18, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k13, _r13, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k13, _r15, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k13, _r17, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k13, _r19, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k14, _r14, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k14, _r16, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k14, _r18, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k14, _r1a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k15, _r15, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k15, _r17, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k15, _r19, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k15, _r1b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k16, _r16, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k16, _r18, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k16, _r1a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k16, _r1c, _sum3);\n\n                    __m128 _k20 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k21 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k22 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k23 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k24 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k25 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k26 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n                    __m128i _r2nn = __lsx_vld(r2 + 8, 0);\n\n                    __m128 _r20 = (__m128)__lsx_vreplvei_w(_r2, 0);\n                    __m128 _r21 = (__m128)__lsx_vreplvei_w(_r2, 1);\n                    __m128 _r22 = (__m128)__lsx_vreplvei_w(_r2, 2);\n                    __m128 _r23 = (__m128)__lsx_vreplvei_w(_r2, 3);\n                    __m128 _r24 = (__m128)__lsx_vreplvei_w(_r2n, 0);\n                    __m128 _r25 = (__m128)__lsx_vreplvei_w(_r2n, 1);\n                    __m128 _r26 = (__m128)__lsx_vreplvei_w(_r2n, 2);\n                    __m128 _r27 = (__m128)__lsx_vreplvei_w(_r2n, 3);\n                    __m128 _r28 = (__m128)__lsx_vreplvei_w(_r2nn, 0);\n                    __m128 _r29 = (__m128)__lsx_vreplvei_w(_r2nn, 1);\n                    __m128 _r2a = (__m128)__lsx_vreplvei_w(_r2nn, 2);\n                    __m128 _r2b = (__m128)__lsx_vreplvei_w(_r2nn, 3);\n                    __m128 _r2c = __lsx_vreplfr2vr_s(r2[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, _r20, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k20, _r22, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k20, _r24, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k20, _r26, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k21, _r21, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k21, _r23, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k21, _r25, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k21, _r27, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k22, _r22, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k22, _r24, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k22, _r26, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k22, _r28, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k23, _r23, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k23, _r25, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k23, _r27, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k23, _r29, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k24, _r24, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k24, _r26, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k24, _r28, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k24, _r2a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k25, _r25, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k25, _r27, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k25, _r29, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k25, _r2b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k26, _r26, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k26, _r28, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k26, _r2a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k26, _r2c, _sum3);\n\n                    __m128 _k30 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k31 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k32 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k33 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k34 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k35 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k36 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r3 = __lsx_vld(r3, 0);\n                    __m128i _r3n = __lsx_vld(r3 + 4, 0);\n                    __m128i _r3nn = __lsx_vld(r3 + 8, 0);\n\n                    __m128 _r30 = (__m128)__lsx_vreplvei_w(_r3, 0);\n                    __m128 _r31 = (__m128)__lsx_vreplvei_w(_r3, 1);\n                    __m128 _r32 = (__m128)__lsx_vreplvei_w(_r3, 2);\n                    __m128 _r33 = (__m128)__lsx_vreplvei_w(_r3, 3);\n                    __m128 _r34 = (__m128)__lsx_vreplvei_w(_r3n, 0);\n                    __m128 _r35 = (__m128)__lsx_vreplvei_w(_r3n, 1);\n                    __m128 _r36 = (__m128)__lsx_vreplvei_w(_r3n, 2);\n                    __m128 _r37 = (__m128)__lsx_vreplvei_w(_r3n, 3);\n                    __m128 _r38 = (__m128)__lsx_vreplvei_w(_r3nn, 0);\n                    __m128 _r39 = (__m128)__lsx_vreplvei_w(_r3nn, 1);\n                    __m128 _r3a = (__m128)__lsx_vreplvei_w(_r3nn, 2);\n                    __m128 _r3b = (__m128)__lsx_vreplvei_w(_r3nn, 3);\n                    __m128 _r3c = __lsx_vreplfr2vr_s(r3[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k30, _r30, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k30, _r32, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k30, _r34, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k30, _r36, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k31, _r31, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k31, _r33, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k31, _r35, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k31, _r37, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k32, _r32, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k32, _r34, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k32, _r36, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k32, _r38, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k33, _r33, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k33, _r35, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k33, _r37, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k33, _r39, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k34, _r34, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k34, _r36, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k34, _r38, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k34, _r3a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k35, _r35, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k35, _r37, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k35, _r39, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k35, _r3b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k36, _r36, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k36, _r38, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k36, _r3a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k36, _r3c, _sum3);\n\n                    __m128 _k40 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k41 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k42 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k43 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k44 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k45 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k46 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r4 = __lsx_vld(r4, 0);\n                    __m128i _r4n = __lsx_vld(r4 + 4, 0);\n                    __m128i _r4nn = __lsx_vld(r4 + 8, 0);\n\n                    __m128 _r40 = (__m128)__lsx_vreplvei_w(_r4, 0);\n                    __m128 _r41 = (__m128)__lsx_vreplvei_w(_r4, 1);\n                    __m128 _r42 = (__m128)__lsx_vreplvei_w(_r4, 2);\n                    __m128 _r43 = (__m128)__lsx_vreplvei_w(_r4, 3);\n                    __m128 _r44 = (__m128)__lsx_vreplvei_w(_r4n, 0);\n                    __m128 _r45 = (__m128)__lsx_vreplvei_w(_r4n, 1);\n                    __m128 _r46 = (__m128)__lsx_vreplvei_w(_r4n, 2);\n                    __m128 _r47 = (__m128)__lsx_vreplvei_w(_r4n, 3);\n                    __m128 _r48 = (__m128)__lsx_vreplvei_w(_r4nn, 0);\n                    __m128 _r49 = (__m128)__lsx_vreplvei_w(_r4nn, 1);\n                    __m128 _r4a = (__m128)__lsx_vreplvei_w(_r4nn, 2);\n                    __m128 _r4b = (__m128)__lsx_vreplvei_w(_r4nn, 3);\n                    __m128 _r4c = __lsx_vreplfr2vr_s(r4[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k40, _r40, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k40, _r42, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k40, _r44, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k40, _r46, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k41, _r41, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k41, _r43, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k41, _r45, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k41, _r47, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k42, _r42, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k42, _r44, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k42, _r46, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k42, _r48, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k43, _r43, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k43, _r45, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k43, _r47, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k43, _r49, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k44, _r44, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k44, _r46, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k44, _r48, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k44, _r4a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k45, _r45, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k45, _r47, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k45, _r49, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k45, _r4b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k46, _r46, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k46, _r48, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k46, _r4a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k46, _r4c, _sum3);\n\n                    __m128 _k50 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k51 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k52 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k53 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k54 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k55 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k56 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r5 = __lsx_vld(r5, 0);\n                    __m128i _r5n = __lsx_vld(r5 + 4, 0);\n                    __m128i _r5nn = __lsx_vld(r5 + 8, 0);\n\n                    __m128 _r50 = (__m128)__lsx_vreplvei_w(_r5, 0);\n                    __m128 _r51 = (__m128)__lsx_vreplvei_w(_r5, 1);\n                    __m128 _r52 = (__m128)__lsx_vreplvei_w(_r5, 2);\n                    __m128 _r53 = (__m128)__lsx_vreplvei_w(_r5, 3);\n                    __m128 _r54 = (__m128)__lsx_vreplvei_w(_r5n, 0);\n                    __m128 _r55 = (__m128)__lsx_vreplvei_w(_r5n, 1);\n                    __m128 _r56 = (__m128)__lsx_vreplvei_w(_r5n, 2);\n                    __m128 _r57 = (__m128)__lsx_vreplvei_w(_r5n, 3);\n                    __m128 _r58 = (__m128)__lsx_vreplvei_w(_r5nn, 0);\n                    __m128 _r59 = (__m128)__lsx_vreplvei_w(_r5nn, 1);\n                    __m128 _r5a = (__m128)__lsx_vreplvei_w(_r5nn, 2);\n                    __m128 _r5b = (__m128)__lsx_vreplvei_w(_r5nn, 3);\n                    __m128 _r5c = __lsx_vreplfr2vr_s(r5[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k50, _r50, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k50, _r52, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k50, _r54, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k50, _r56, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k51, _r51, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k51, _r53, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k51, _r55, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k51, _r57, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k52, _r52, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k52, _r54, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k52, _r56, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k52, _r58, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k53, _r53, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k53, _r55, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k53, _r57, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k53, _r59, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k54, _r54, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k54, _r56, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k54, _r58, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k54, _r5a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k55, _r55, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k55, _r57, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k55, _r59, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k55, _r5b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k56, _r56, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k56, _r58, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k56, _r5a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k56, _r5c, _sum3);\n\n                    __m128 _k60 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k61 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k62 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k63 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k64 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k65 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k66 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr -= 4 * 42;\n\n                    __m128i _r6 = __lsx_vld(r6, 0);\n                    __m128i _r6n = __lsx_vld(r6 + 4, 0);\n                    __m128i _r6nn = __lsx_vld(r6 + 8, 0);\n\n                    __m128 _r60 = (__m128)__lsx_vreplvei_w(_r6, 0);\n                    __m128 _r61 = (__m128)__lsx_vreplvei_w(_r6, 1);\n                    __m128 _r62 = (__m128)__lsx_vreplvei_w(_r6, 2);\n                    __m128 _r63 = (__m128)__lsx_vreplvei_w(_r6, 3);\n                    __m128 _r64 = (__m128)__lsx_vreplvei_w(_r6n, 0);\n                    __m128 _r65 = (__m128)__lsx_vreplvei_w(_r6n, 1);\n                    __m128 _r66 = (__m128)__lsx_vreplvei_w(_r6n, 2);\n                    __m128 _r67 = (__m128)__lsx_vreplvei_w(_r6n, 3);\n                    __m128 _r68 = (__m128)__lsx_vreplvei_w(_r6nn, 0);\n                    __m128 _r69 = (__m128)__lsx_vreplvei_w(_r6nn, 1);\n                    __m128 _r6a = (__m128)__lsx_vreplvei_w(_r6nn, 2);\n                    __m128 _r6b = (__m128)__lsx_vreplvei_w(_r6nn, 3);\n                    __m128 _r6c = __lsx_vreplfr2vr_s(r6[12]);\n\n                    _sum0 = __lsx_vfmadd_s(_k60, _r60, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k60, _r62, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k60, _r64, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k60, _r66, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k61, _r61, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k61, _r63, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k61, _r65, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k61, _r67, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k62, _r62, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k62, _r64, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k62, _r66, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k62, _r68, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k63, _r63, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k63, _r65, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k63, _r67, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k63, _r69, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k64, _r64, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k64, _r66, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k64, _r68, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k64, _r6a, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k65, _r65, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k65, _r67, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k65, _r69, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k65, _r6b, _sum3);\n                    _sum0 = __lsx_vfmadd_s(_k66, _r66, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_k66, _r68, _sum1);\n                    _sum2 = __lsx_vfmadd_s(_k66, _r6a, _sum2);\n                    _sum3 = __lsx_vfmadd_s(_k66, _r6c, _sum3);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n                    __lsx_vst(_sum1, outptr0 + 4, 0);\n                    __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n                    __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n\n                    outptr0 += 4 * 4;\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                    r3 += 8;\n                    r4 += 8;\n                    r5 += 8;\n                    r6 += 8;\n                }\n                for (; j < outw; j++)\n                {\n                    __m128 _sum0 = (__m128)__lsx_vld(outptr0, 0);\n\n                    __m128 _k00 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k01 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k02 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k03 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k04 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k05 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k06 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r0n = __lsx_vld(r0 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k00, (__m128)__lsx_vreplvei_w(_r0, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k01, (__m128)__lsx_vreplvei_w(_r0, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k02, (__m128)__lsx_vreplvei_w(_r0, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k03, (__m128)__lsx_vreplvei_w(_r0, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k04, (__m128)__lsx_vreplvei_w(_r0n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k05, (__m128)__lsx_vreplvei_w(_r0n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k06, (__m128)__lsx_vreplvei_w(_r0n, 2), _sum0);\n\n                    __m128 _k10 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k11 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k12 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k13 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k14 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k15 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k16 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r1n = __lsx_vld(r1 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k10, (__m128)__lsx_vreplvei_w(_r1, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k11, (__m128)__lsx_vreplvei_w(_r1, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k12, (__m128)__lsx_vreplvei_w(_r1, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k13, (__m128)__lsx_vreplvei_w(_r1, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k14, (__m128)__lsx_vreplvei_w(_r1n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k15, (__m128)__lsx_vreplvei_w(_r1n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k16, (__m128)__lsx_vreplvei_w(_r1n, 2), _sum0);\n\n                    __m128 _k20 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k21 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k22 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k23 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k24 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k25 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k26 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r2n = __lsx_vld(r2 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k20, (__m128)__lsx_vreplvei_w(_r2, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k21, (__m128)__lsx_vreplvei_w(_r2, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k22, (__m128)__lsx_vreplvei_w(_r2, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k23, (__m128)__lsx_vreplvei_w(_r2, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k24, (__m128)__lsx_vreplvei_w(_r2n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k25, (__m128)__lsx_vreplvei_w(_r2n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k26, (__m128)__lsx_vreplvei_w(_r2n, 2), _sum0);\n\n                    __m128 _k30 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k31 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k32 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k33 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k34 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k35 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k36 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r3 = __lsx_vld(r3, 0);\n                    __m128i _r3n = __lsx_vld(r3 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k30, (__m128)__lsx_vreplvei_w(_r3, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k31, (__m128)__lsx_vreplvei_w(_r3, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k32, (__m128)__lsx_vreplvei_w(_r3, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k33, (__m128)__lsx_vreplvei_w(_r3, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k34, (__m128)__lsx_vreplvei_w(_r3n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k35, (__m128)__lsx_vreplvei_w(_r3n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k36, (__m128)__lsx_vreplvei_w(_r3n, 2), _sum0);\n\n                    __m128 _k40 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k41 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k42 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k43 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k44 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k45 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k46 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r4 = __lsx_vld(r4, 0);\n                    __m128i _r4n = __lsx_vld(r4 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k40, (__m128)__lsx_vreplvei_w(_r4, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k41, (__m128)__lsx_vreplvei_w(_r4, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k42, (__m128)__lsx_vreplvei_w(_r4, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k43, (__m128)__lsx_vreplvei_w(_r4, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k44, (__m128)__lsx_vreplvei_w(_r4n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k45, (__m128)__lsx_vreplvei_w(_r4n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k46, (__m128)__lsx_vreplvei_w(_r4n, 2), _sum0);\n\n                    __m128 _k50 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k51 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k52 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k53 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k54 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k55 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k56 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr += 4 * 7;\n\n                    __m128i _r5 = __lsx_vld(r5, 0);\n                    __m128i _r5n = __lsx_vld(r5 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k50, (__m128)__lsx_vreplvei_w(_r5, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k51, (__m128)__lsx_vreplvei_w(_r5, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k52, (__m128)__lsx_vreplvei_w(_r5, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k53, (__m128)__lsx_vreplvei_w(_r5, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k54, (__m128)__lsx_vreplvei_w(_r5n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k55, (__m128)__lsx_vreplvei_w(_r5n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k56, (__m128)__lsx_vreplvei_w(_r5n, 2), _sum0);\n\n                    __m128 _k60 = (__m128)__lsx_vld(kptr, 0);\n                    __m128 _k61 = (__m128)__lsx_vld(kptr + 4, 0);\n                    __m128 _k62 = (__m128)__lsx_vld(kptr + 4 * 2, 0);\n                    __m128 _k63 = (__m128)__lsx_vld(kptr + 4 * 3, 0);\n                    __m128 _k64 = (__m128)__lsx_vld(kptr + 4 * 4, 0);\n                    __m128 _k65 = (__m128)__lsx_vld(kptr + 4 * 5, 0);\n                    __m128 _k66 = (__m128)__lsx_vld(kptr + 4 * 6, 0);\n\n                    kptr -= 4 * 42;\n\n                    __m128i _r6 = __lsx_vld(r6, 0);\n                    __m128i _r6n = __lsx_vld(r6 + 4, 0);\n\n                    _sum0 = __lsx_vfmadd_s(_k60, (__m128)__lsx_vreplvei_w(_r6, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k61, (__m128)__lsx_vreplvei_w(_r6, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k62, (__m128)__lsx_vreplvei_w(_r6, 2), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k63, (__m128)__lsx_vreplvei_w(_r6, 3), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k64, (__m128)__lsx_vreplvei_w(_r6n, 0), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k65, (__m128)__lsx_vreplvei_w(_r6n, 1), _sum0);\n                    _sum0 = __lsx_vfmadd_s(_k66, (__m128)__lsx_vreplvei_w(_r6n, 2), _sum0);\n\n                    __lsx_vst(_sum0, outptr0, 0);\n\n                    outptr0 += 4;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                    r3 += 2;\n                    r4 += 2;\n                    r5 += 2;\n                    r6 += 2;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n                r3 += tailstep;\n                r4 += tailstep;\n                r5 += tailstep;\n                r6 += tailstep;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_int8(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                int sum = 0;\n\n                //                 const signed char* kptr = weight_data_int8.channel(p);\n                const signed char* kptr = (const signed char*)weight_data_int8 + maxk * channels * p;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        signed char val = sptr[space_ofs[k]];\n                        signed char w = kptr[k];\n                        sum += val * w;\n                    }\n\n                    kptr += maxk;\n                }\n\n                outptr[j] = sum;\n            }\n\n            outptr += outw;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution_loongarch.h\"\n\n#include \"benchmark.h\"\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"convolution_sgemm.h\"\n#include \"convolution_winograd_transform.h\"\n#include \"convolution_winograd_dot.h\"\n#include \"convolution_1x1.h\"\n#include \"convolution_3x3.h\"\n\n#if NCNN_INT8\n#include \"convolution_sgemm_int8.h\"\n#include \"convolution_winograd_transform_int8.h\"\n#include \"convolution_winograd_dot_int8.h\"\n#include \"convolution_1x1_int8.h\"\n#include \"convolution_3x3_int8.h\"\n#include \"convolution_int8.h\"\n#endif // NCNN_INT8\n\n#if __loongarch_sx\n#include \"convolution_pack4.h\"\n#include \"convolution_pack1to4.h\"\n#include \"convolution_pack4to1.h\"\n\n#include \"convolution_sgemm_pack4.h\"\n#include \"convolution_sgemm_pack4to1.h\"\n#include \"convolution_winograd_transform_pack4.h\"\n#include \"convolution_winograd_dot_pack4.h\"\n#include \"convolution_1x1_pack4.h\"\n#include \"convolution_1x1_pack4to1.h\"\n#include \"convolution_3x3_pack4.h\"\n#include \"convolution_3x3_pack1to4.h\"\n#include \"convolution_7x7_pack1to4.h\"\n\n#if NCNN_INT8\n#include \"convolution_pack8to4_int8.h\"\n#include \"convolution_pack1to4_int8.h\"\n#include \"convolution_pack8to1_int8.h\"\n#include \"convolution_sgemm_pack8to4_int8.h\"\n#include \"convolution_sgemm_pack1to4_int8.h\"\n#include \"convolution_sgemm_pack8to1_int8.h\"\n#include \"convolution_winograd_transform_pack4_int8.h\"\n#include \"convolution_winograd_transform_pack8_int8.h\"\n#include \"convolution_winograd_dot_pack8to4_int8.h\"\n#include \"convolution_winograd_dot_pack8to1_int8.h\"\n#include \"convolution_1x1_pack8to4_int8.h\"\n#include \"convolution_1x1_pack1to4_int8.h\"\n#include \"convolution_1x1_pack8to1_int8.h\"\n#include \"convolution_3x3_pack8to4_int8.h\"\n#include \"convolution_3x3_pack8to1_int8.h\"\n#endif // NCNN_INT8\n#endif // __loongarch_sx\n\nConvolution_loongarch::Convolution_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n\n    activation = 0;\n}\n\nstatic void convolution_transform_kernel_packed_lsx(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_loongarch::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_loongarch(opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n#if __loongarch_sx\n    // pack4\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd63_convolution && num_input >= 8 && num_output >= 8 && num_input <= 64 && num_output <= 64) || (!opt.use_winograd43_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd63_transform_kernel_pack4_lsx(weight_data, weight_winograd63_data, num_input, num_output, opt);\n            else if ((opt.use_winograd43_convolution && num_input >= 8 && num_output >= 8) || (!opt.use_winograd63_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd43_transform_kernel_pack4_lsx(weight_data, weight_winograd43_data, num_input, num_output, opt);\n            else // if (opt.use_winograd23_convolution)\n                conv3x3s1_winograd23_transform_kernel_pack4_lsx(weight_data, weight_winograd23_data, num_input, num_output, opt);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    // pack1ton\n    if (elempack == 1 && out_elempack == 4)\n    {\n        convolution_transform_kernel_packed_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n    }\n\n    // pack4to1\n    if (elempack == 4 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n#endif // __loongarch_sx\n\n    // pack1\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd43_convolution && num_input >= 16 && num_output >= 16) || !opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd43_transform_kernel_lsx(weight_data, weight_winograd43_data, num_input, num_output, opt);\n            }\n            else if (opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd23_transform_kernel_lsx(weight_data, weight_winograd23_data, num_input, num_output, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            weight_data_tm = weight_data;\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_loongarch::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    return 0;\n}\n\nint Convolution_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_loongarch(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    // flattened blob, implement as InnerProduct\n    if (bottom_blob.dims == 1 && kernel_w == 1 && kernel_h == 1)\n    {\n        Mat bottom_blob_3d;\n        if (bottom_blob.elemsize % 16 == 0)\n        {\n            bottom_blob_3d = bottom_blob;\n            bottom_blob_3d.dims = 3;\n            bottom_blob_3d.w = 1;\n            bottom_blob_3d.h = 1;\n            bottom_blob_3d.c = bottom_blob.w;\n            bottom_blob_3d.cstep = 1;\n        }\n        else\n        {\n            bottom_blob_3d = bottom_blob.reshape(1, 1, bottom_blob.w, opt.workspace_allocator);\n        }\n\n        Mat top_blob_3d;\n        int ret = forward(bottom_blob_3d, top_blob_3d, opt);\n        if (ret != 0)\n            return ret;\n\n        if (top_blob_3d.elemsize % 16 == 0)\n        {\n            top_blob = top_blob_3d;\n            top_blob.dims = 1;\n            top_blob.w = top_blob_3d.c;\n            top_blob.h = 1;\n            top_blob.c = 1;\n            top_blob.cstep = top_blob_3d.c;\n        }\n        else\n        {\n            top_blob = top_blob_3d.reshape(top_blob_3d.c, opt.blob_allocator);\n        }\n\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int num_input = channels * elempack;\n\n#if __loongarch_sx\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd63_convolution && num_input >= 8 && num_output >= 8 && num_input <= 64 && num_output <= 64) || (!opt.use_winograd43_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd63_pack4_lsx(bottom_blob_bordered, top_blob, weight_winograd63_data, bias_data, opt);\n            else if ((opt.use_winograd43_convolution && num_input >= 8 && num_output >= 8) || (!opt.use_winograd63_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd43_pack4_lsx(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, opt);\n            else // if (opt.use_winograd23_convolution)\n                conv3x3s1_winograd23_pack4_lsx(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_pack1to4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack1to4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack4to1_lsx(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack4to1_lsx(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack4to1_lsx(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack4to1_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_lsx(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd43_convolution && num_input >= 16 && num_output >= 16) || !opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd43_lsx(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, opt);\n            }\n            else if (opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd23_lsx(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, opt);\n            }\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_lsx(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            const int maxk = kernel_w * kernel_h;\n\n            // kernel offsets\n            std::vector<int> _space_ofs(maxk);\n            int* space_ofs = &_space_ofs[0];\n            {\n                int p1 = 0;\n                int p2 = 0;\n                int gap = w * dilation_h - kernel_w * dilation_w;\n                for (int i = 0; i < kernel_h; i++)\n                {\n                    for (int j = 0; j < kernel_w; j++)\n                    {\n                        space_ofs[p1] = p2;\n                        p1++;\n                        p2 += dilation_w;\n                    }\n                    p2 += gap;\n                }\n            }\n\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                float* outptr = top_blob.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const float* kptr = (const float*)weight_data_tm + maxk * channels * p;\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob_bordered.channel(q);\n                            const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = sptr[space_ofs[k]];\n                                float wt = kptr[k];\n                                sum += val * wt;\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Convolution_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(8, int8_scale_term);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_INT8\nstatic void convolution_transform_kernel_packed_int8_lsx(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pa-pb-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            signed char* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < out_elempack; i++)\n                    {\n                        for (int j = 0; j < elempack; j++)\n                        {\n                            const signed char* k00 = weight_data_r2.channel(q + i).row<const signed char>(p + j);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_loongarch::create_pipeline_int8_loongarch(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 8 == 0 ? 8 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n\n#if __loongarch_sx\n    if (elempack == 8 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_pack8to4_int8_lsx(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    if (elempack == 8 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_pack8to1_int8_lsx(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_lsx(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_int8_lsx(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_lsx(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            weight_data_tm = weight_data;\n        }\n    }\n\n    scale_in_data.create(num_output);\n    for (int p = 0; p < num_output; p++)\n    {\n        // requantize and relu\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        scale_in_data[p] = scale_in;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_loongarch::forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_q);\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    int w = bottom_blob_bordered.w;\n    int h = bottom_blob_bordered.h;\n    int channels = bottom_blob_bordered.c;\n    int elempack = bottom_blob_bordered.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    bool use_int8_requantize = int8_scale_term > 100;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        else\n            out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n    size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int num_input = channels * elempack;\n\n    int out_elempack_int32 = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack_int32 = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n\n    Mat top_blob_int32;\n    top_blob_int32.create(outw, outh, num_output / out_elempack_int32, (size_t)(4u * out_elempack_int32), out_elempack_int32, opt.workspace_allocator);\n    if (top_blob_int32.empty())\n        return -100;\n\n#if __loongarch_sx\n    if (elempack == 8 && out_elempack_int32 == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack8to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack8to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_pack8to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack8to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack8to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack_int32 == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack1to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack1to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_pack1to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack1to4_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n    if (elempack == 8 && out_elempack_int32 == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack8to1_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack8to1_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_pack8to1_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_pack8to1_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack8to1_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (elempack == 1 && out_elempack_int32 == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_int8_lsx(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_int8(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        // NCNN_LOGE(\"top_blob_int32  %d  %d\", top_blob_int32.c, top_blob_int32.elempack);\n        if (use_int8_requantize)\n        {\n            // TODO implement winograd sgemm packed int8 pack1 output\n            if (top_blob_int32.elempack == 4 && top_blob_int32.c % 2 == 1)\n            {\n                Mat tmp;\n                convert_packing(top_blob_int32, tmp, 1, opt);\n                top_blob_int32 = tmp;\n            }\n            if (top_blob_int32.elempack == 4 && top_blob_int32.c % 2 == 0)\n            {\n                Mat tmp;\n                convert_packing(top_blob_int32, tmp, 8, opt);\n                top_blob_int32 = tmp;\n            }\n        }\n    }\n#endif\n\n    if (use_int8_requantize)\n    {\n        requantize_from_int32_to_int8(top_blob_int32, top_blob, scale_in_data, top_blob_int8_scales, bias_data, activation_type, activation_params, opt);\n    }\n    else\n    {\n        dequantize_from_int32(top_blob_int32, top_blob, scale_in_data, bias_data, opt);\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/convolution_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION_LOONGARCH_H\n#define LAYER_CONVOLUTION_LOONGARCH_H\n\n#include \"convolution.h\"\n\nnamespace ncnn {\n\nclass Convolution_loongarch : public Convolution\n{\npublic:\n    Convolution_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_INT8\n    int create_pipeline_int8_loongarch(const Option& opt);\n    int forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* activation;\n\n    Mat weight_data_tm;\n    Mat weight_sgemm_data;\n    Mat weight_winograd23_data;\n    Mat weight_winograd43_data;\n    Mat weight_winograd63_data;\n\n#if NCNN_INT8\n    Mat scale_in_data;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack1to4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack1to4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack1ton, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                if (bias_data_ptr)\n                {\n                    _sum = (__m128)__lsx_vld(bias_data_ptr + p * 4, 0);\n                }\n\n                const float* kptr = (const float*)weight_data_pack1ton + maxk * channels * p * 4;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++) // 29.23\n                    {\n                        __m128 _val = __lsx_vreplfr2vr_s(sptr[space_ofs[k]]);\n                        __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                        _sum = __lsx_vfmadd_s(_w, _val, _sum);\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum = activation_ps(_sum, activation_type, activation_params);\n\n                __lsx_vst(_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack1to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack1to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128i _sum = __lsx_vreplgr2vr_w(0);\n\n                const signed char* kptr = weight_data_int8.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __m128i _val = __lsx_vreplgr2vr_h((short)sptr[space_ofs[k]]);\n\n                        __m128i _w = __lsx_vld(kptr, 0);\n                        __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                        __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                        __m128i _s032 = __lsx_vilvl_h(__lsx_vslti_h(_s0, 0), _s0);\n\n                        _sum = __lsx_vadd_w(_sum, _s032);\n\n                        kptr += 4;\n                    }\n                }\n\n                __lsx_vst(_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack4, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                if (bias_data_ptr)\n                {\n                    _sum = (__m128)__lsx_vld(bias_data_ptr + p * 4, 0);\n                }\n\n                const float* kptr = (const float*)weight_data_pack4.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                    for (int k = 0; k < maxk; k++) // 29.23\n                    {\n                        const float* slptr = sptr + space_ofs[k] * 4;\n\n                        __m128 _val0 = __lsx_vreplfr2vr_s(slptr[0]);\n                        __m128 _val1 = __lsx_vreplfr2vr_s(slptr[1]);\n                        __m128 _val2 = __lsx_vreplfr2vr_s(slptr[2]);\n                        __m128 _val3 = __lsx_vreplfr2vr_s(slptr[3]);\n\n                        __m128 _w0 = (__m128)__lsx_vld(kptr, 0);\n                        __m128 _w1 = (__m128)__lsx_vld(kptr + 4, 0);\n                        __m128 _w2 = (__m128)__lsx_vld(kptr + 8, 0);\n                        __m128 _w3 = (__m128)__lsx_vld(kptr + 12, 0);\n\n                        _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n                        _sum = __lsx_vfmadd_s(_w1, _val1, _sum);\n                        _sum = __lsx_vfmadd_s(_w2, _val2, _sum);\n                        _sum = __lsx_vfmadd_s(_w3, _val3, _sum);\n\n                        kptr += 16;\n                    }\n                }\n\n                _sum = activation_ps(_sum, activation_type, activation_params);\n\n                __lsx_vst(_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack4to1.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack4to1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack4to1, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                const float* kptr = (const float*)weight_data_pack4to1.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __m128 _val = (__m128)__lsx_vld(sptr + space_ofs[k] * 4, 0);\n                        __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                        _sum = __lsx_vfmadd_s(_w, _val, _sum);\n\n                        kptr += 4;\n                    }\n                }\n\n                sum += __lsx_reduce_fadd_s(_sum);\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n                outptr[j] = sum;\n            }\n\n            outptr += outw;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack8to1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack8to1_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128i _sum = __lsx_vreplgr2vr_w(0);\n\n                const signed char* kptr = weight_data_int8.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w * 8;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __m128i _val = __lsx_vld(sptr + space_ofs[k] * 8, 0);\n                        __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                        __m128i _w = __lsx_vld(kptr, 0);\n                        __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                        __m128i _s0 = __lsx_vmul_h(_val16, _w16);\n\n                        _sum = __lsx_vadd_w(_sum, __lsx_vhaddw_w_h(_s0, _s0));\n\n                        kptr += 8;\n                    }\n                }\n\n                outptr[j] = __lsx_reduce_add_w(_sum);\n            }\n\n            outptr += outw;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_pack8to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack8to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                const signed char* kptr = weight_data_int8.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w * 8;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        __m128i _val = __lsx_vld(sptr + space_ofs[k] * 8, 0);\n                        __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                        __m128i _w01 = __lsx_vld(kptr, 0);\n                        __m128i _w23 = __lsx_vld(kptr + 16, 0);\n                        __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                        __m128i _extw23 = __lsx_vslti_b(_w23, 0);\n                        __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                        __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n                        __m128i _w2 = __lsx_vilvl_b(_extw23, _w23);\n                        __m128i _w3 = __lsx_vilvh_b(_extw23, _w23);\n\n                        __m128i _s0 = __lsx_vmul_h(_val16, _w0);\n                        __m128i _s1 = __lsx_vmul_h(_val16, _w1);\n                        __m128i _s2 = __lsx_vmul_h(_val16, _w2);\n                        __m128i _s3 = __lsx_vmul_h(_val16, _w3);\n\n                        _sum0 = __lsx_vadd_w(_sum0, __lsx_vhaddw_w_h(_s0, _s0));\n                        _sum1 = __lsx_vadd_w(_sum1, __lsx_vhaddw_w_h(_s1, _s1));\n                        _sum2 = __lsx_vadd_w(_sum2, __lsx_vhaddw_w_h(_s2, _s2));\n                        _sum3 = __lsx_vadd_w(_sum3, __lsx_vhaddw_w_h(_s3, _s3));\n\n                        kptr += 32;\n                    }\n                }\n\n                // transpose 4x4\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                    _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                    _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                    _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                    _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum2);\n\n                __lsx_vst(_sum0, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 4u, 1, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    // permute\n    Mat tmp;\n    if (size >= 4)\n        tmp.create(4 * maxk, inch, size / 4 + size % 4, 4u, 1, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 4u, 1, opt.workspace_allocator);\n    {\n        int nn_size = size / 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = ii * 4;\n\n            float* tmpptr = tmp.channel(i / 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n#if __loongarch_sx\n                    __lsx_vst(__lsx_vld(img0, 0), tmpptr, 0);\n#else\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img0[1];\n                    tmpptr[2] = img0[2];\n                    tmpptr[3] = img0[3];\n#endif\n                    img0 += size;\n                    tmpptr += 4;\n                }\n            }\n        }\n\n        int remain_size_start = nn_size * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            float* tmpptr = tmp.channel(i / 4 + i % 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    img0 += size;\n                    tmpptr += 1;\n                }\n            }\n        }\n    }\n\n#if __loongarch_sx\n    int nn_outch = outch >> 3;\n    int remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n        float* outptr2 = top_blob.channel(p + 2);\n        float* outptr3 = top_blob.channel(p + 3);\n        float* outptr4 = top_blob.channel(p + 4);\n        float* outptr5 = top_blob.channel(p + 5);\n        float* outptr6 = top_blob.channel(p + 6);\n        float* outptr7 = top_blob.channel(p + 7);\n\n        const float zeros[8] = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 8);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128 _sum0 = __lsx_vreplfr2vr_s(biasptr[0]);\n            __m128 _sum1 = __lsx_vreplfr2vr_s(biasptr[1]);\n            __m128 _sum2 = __lsx_vreplfr2vr_s(biasptr[2]);\n            __m128 _sum3 = __lsx_vreplfr2vr_s(biasptr[3]);\n            __m128 _sum4 = __lsx_vreplfr2vr_s(biasptr[4]);\n            __m128 _sum5 = __lsx_vreplfr2vr_s(biasptr[5]);\n            __m128 _sum6 = __lsx_vreplfr2vr_s(biasptr[6]);\n            __m128 _sum7 = __lsx_vreplfr2vr_s(biasptr[7]);\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 32);\n                __m128 _val = (__m128)__lsx_vld(tmpptr, 0);\n                __m128i _w0123 = __lsx_vld(kptr, 0);\n                __m128i _w4567 = __lsx_vld(kptr + 4, 0);\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val, _sum0);\n                _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val, _sum1);\n                _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val, _sum2);\n                _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val, _sum3);\n                _sum4 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 0), _val, _sum4);\n                _sum5 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 1), _val, _sum5);\n                _sum6 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 2), _val, _sum6);\n                _sum7 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 3), _val, _sum7);\n\n                tmpptr += 4;\n                kptr += 8;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr1, 0);\n            __lsx_vst(_sum2, outptr2, 0);\n            __lsx_vst(_sum3, outptr3, 0);\n            __lsx_vst(_sum4, outptr4, 0);\n            __lsx_vst(_sum5, outptr5, 0);\n            __lsx_vst(_sum6, outptr6, 0);\n            __lsx_vst(_sum7, outptr7, 0);\n\n            outptr0 += 4;\n            outptr1 += 4;\n            outptr2 += 4;\n            outptr3 += 4;\n            outptr4 += 4;\n            outptr5 += 4;\n            outptr6 += 4;\n            outptr7 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 8);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n            float sum2 = biasptr[2];\n            float sum3 = biasptr[3];\n            float sum4 = biasptr[4];\n            float sum5 = biasptr[5];\n            float sum6 = biasptr[6];\n            float sum7 = biasptr[7];\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                sum2 += tmpptr[0] * kptr[2];\n                sum3 += tmpptr[0] * kptr[3];\n                sum4 += tmpptr[0] * kptr[4];\n                sum5 += tmpptr[0] * kptr[5];\n                sum6 += tmpptr[0] * kptr[6];\n                sum7 += tmpptr[0] * kptr[7];\n                tmpptr++;\n                kptr += 8;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr2[0] = sum2;\n            outptr3[0] = sum3;\n            outptr4[0] = sum4;\n            outptr5[0] = sum5;\n            outptr6[0] = sum6;\n            outptr7[0] = sum7;\n\n            outptr0++;\n            outptr1++;\n            outptr2++;\n            outptr3++;\n            outptr4++;\n            outptr5++;\n            outptr6++;\n            outptr7++;\n        }\n    }\n\n    nn_outch = (outch - remain_outch_start) >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = remain_outch_start + pp * 4;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n        float* outptr2 = top_blob.channel(p + 2);\n        float* outptr3 = top_blob.channel(p + 3);\n\n        const float zeros[4] = {0.f, 0.f, 0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128 _sum0 = __lsx_vreplfr2vr_s(biasptr[0]);\n            __m128 _sum1 = __lsx_vreplfr2vr_s(biasptr[1]);\n            __m128 _sum2 = __lsx_vreplfr2vr_s(biasptr[2]);\n            __m128 _sum3 = __lsx_vreplfr2vr_s(biasptr[3]);\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 16);\n                __m128 _val = (__m128)__lsx_vld(tmpptr, 0);\n                __m128i _w0123 = __lsx_vld(kptr, 0);\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val, _sum0);\n                _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val, _sum1);\n                _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val, _sum2);\n                _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val, _sum3);\n\n                tmpptr += 4;\n                kptr += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr1, 0);\n            __lsx_vst(_sum2, outptr2, 0);\n            __lsx_vst(_sum3, outptr3, 0);\n\n            outptr0 += 4;\n            outptr1 += 4;\n            outptr2 += 4;\n            outptr3 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n            float sum2 = biasptr[2];\n            float sum3 = biasptr[3];\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                sum2 += tmpptr[0] * kptr[2];\n                sum3 += tmpptr[0] * kptr[3];\n                tmpptr++;\n                kptr += 4;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr2[0] = sum2;\n            outptr3[0] = sum3;\n\n            outptr0++;\n            outptr1++;\n            outptr2++;\n            outptr3++;\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n#else // __loongarch_sx\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n\n        const float zeros[2] = {0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 2);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum00 = biasptr[0];\n            float sum01 = biasptr[0];\n            float sum02 = biasptr[0];\n            float sum03 = biasptr[0];\n            float sum10 = biasptr[1];\n            float sum11 = biasptr[1];\n            float sum12 = biasptr[1];\n            float sum13 = biasptr[1];\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 8);\n                float k0 = kptr[0];\n                float k1 = kptr[1];\n                sum00 += tmpptr[0] * k0;\n                sum01 += tmpptr[1] * k0;\n                sum02 += tmpptr[2] * k0;\n                sum03 += tmpptr[3] * k0;\n                sum10 += tmpptr[0] * k1;\n                sum11 += tmpptr[1] * k1;\n                sum12 += tmpptr[2] * k1;\n                sum13 += tmpptr[3] * k1;\n                tmpptr += 4;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum00;\n            outptr0[1] = sum01;\n            outptr0[2] = sum02;\n            outptr0[3] = sum03;\n            outptr1[0] = sum10;\n            outptr1[1] = sum11;\n            outptr1[2] = sum12;\n            outptr1[3] = sum13;\n\n            outptr0 += 4;\n            outptr1 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 2);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 4);\n                __builtin_prefetch(kptr + 8);\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                tmpptr++;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n\n            outptr0++;\n            outptr1++;\n        }\n    }\n#endif // __loongarch_sx\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        float* outptr0 = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n#if __loongarch_sx\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4 + p % 4);\n#else\n            const float* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int nn = inch * maxk; // inch always > 0\n\n#if __loongarch_sx\n            __m128 _sum0 = __lsx_vreplfr2vr_s(bias0);\n\n            for (int q = 0; q < nn; q++)\n            {\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vld(tmpptr, 0), __lsx_vreplfr2vr_s(kptr[0]), _sum0);\n                tmpptr += 4;\n                kptr++;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n\n            outptr0 += 4;\n#else\n            float sum0 = bias0;\n            float sum1 = bias0;\n            float sum2 = bias0;\n            float sum3 = bias0;\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 4);\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[1] * kptr[0];\n                sum2 += tmpptr[2] * kptr[0];\n                sum3 += tmpptr[3] * kptr[0];\n                tmpptr += 4;\n                kptr++;\n            }\n\n            outptr0[0] = sum0;\n            outptr0[1] = sum1;\n            outptr0[2] = sum2;\n            outptr0[3] = sum3;\n\n            outptr0 += 4;\n#endif // __loongarch_sx\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n#if __loongarch_sx\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4 + p % 4);\n#else\n            const float* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = bias0;\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                tmpptr++;\n                kptr++;\n            }\n\n            outptr0[0] = sum0;\n\n            outptr0++;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 8b-maxk-inch-outch/8b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n#if __loongarch_sx\n    kernel_tm.create(8 * maxk, inch, outch / 8 + (outch % 8) / 4 + outch % 4);\n#else\n    kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2);\n#endif\n\n    int q = 0;\n#if __loongarch_sx\n    for (; q + 7 < outch; q += 8)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n        const Mat k2 = kernel.channel(q + 2);\n        const Mat k3 = kernel.channel(q + 3);\n        const Mat k4 = kernel.channel(q + 4);\n        const Mat k5 = kernel.channel(q + 5);\n        const Mat k6 = kernel.channel(q + 6);\n        const Mat k7 = kernel.channel(q + 7);\n\n        float* g00 = kernel_tm.channel(q / 8);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n            const float* k20 = k2.row(p);\n            const float* k30 = k3.row(p);\n            const float* k40 = k4.row(p);\n            const float* k50 = k5.row(p);\n            const float* k60 = k6.row(p);\n            const float* k70 = k7.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n                g00[2] = k20[k];\n                g00[3] = k30[k];\n                g00[4] = k40[k];\n                g00[5] = k50[k];\n                g00[6] = k60[k];\n                g00[7] = k70[k];\n\n                g00 += 8;\n            }\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n        const Mat k2 = kernel.channel(q + 2);\n        const Mat k3 = kernel.channel(q + 3);\n\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n            const float* k20 = k2.row(p);\n            const float* k30 = k3.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n                g00[2] = k20[k];\n                g00[3] = k30[k];\n\n                g00 += 4;\n            }\n        }\n    }\n#else\n    for (; q + 1 < outch; q += 2)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n\n        float* g00 = kernel_tm.channel(q / 2);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n\n                g00 += 2;\n            }\n        }\n    }\n#endif // __loongarch_sx\n    for (; q < outch; q++)\n    {\n        const Mat k0 = kernel.channel(q);\n\n#if __loongarch_sx\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + q % 4);\n#else\n        float* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n\n                g00 += 1;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 4u, 1, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            float* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const float* sptr = img.row<const float>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_int8_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    // permute\n    Mat tmp;\n#if __loongarch_sx\n    if (inch >= 4)\n    {\n        if (size >= 2)\n            tmp.create(2 * maxk, inch / 4 + inch % 4, size / 2 + size % 2, 4u, 4, opt.workspace_allocator);\n        else\n            tmp.create(maxk, inch / 4 + inch % 4, size, 4u, 4, opt.workspace_allocator);\n    }\n    else\n#endif // __loongarch_sx\n    {\n        if (size >= 2)\n            tmp.create(2 * maxk, inch, size / 2 + size % 2, 1u, 1, opt.workspace_allocator);\n        else\n            tmp.create(maxk, inch, size, 1u, 1, opt.workspace_allocator);\n    }\n    {\n        int remain_size_start = 0;\n        int nn_size = (size - remain_size_start) >> 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 2;\n\n            signed char* tmpptr = tmp.channel(i / 2);\n\n            int q = 0;\n#if __loongarch_sx\n            for (; q + 3 < inch; q += 4)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n                const signed char* img1 = (const signed char*)bottom_im2col.channel(q + 1) + i;\n                const signed char* img2 = (const signed char*)bottom_im2col.channel(q + 2) + i;\n                const signed char* img3 = (const signed char*)bottom_im2col.channel(q + 3) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img1[0];\n                    tmpptr[2] = img2[0];\n                    tmpptr[3] = img3[0];\n                    tmpptr[4] = img0[1];\n                    tmpptr[5] = img1[1];\n                    tmpptr[6] = img2[1];\n                    tmpptr[7] = img3[1];\n                    tmpptr += 8;\n\n                    img0 += size;\n                    img1 += size;\n                    img2 += size;\n                    img3 += size;\n                }\n            }\n#endif // __loongarch_sx\n            for (; q < inch; q++)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img0[1];\n\n                    tmpptr += 2;\n\n                    img0 += size;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n\n            int q = 0;\n#if __loongarch_sx\n            for (; q + 3 < inch; q += 4)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n                const signed char* img1 = (const signed char*)bottom_im2col.channel(q + 1) + i;\n                const signed char* img2 = (const signed char*)bottom_im2col.channel(q + 2) + i;\n                const signed char* img3 = (const signed char*)bottom_im2col.channel(q + 3) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img1[0];\n                    tmpptr[2] = img2[0];\n                    tmpptr[3] = img3[0];\n                    tmpptr += 4;\n\n                    img0 += size;\n                    img1 += size;\n                    img2 += size;\n                    img3 += size;\n                }\n            }\n#endif // __loongarch_sx\n            for (; q < inch; q++)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n\n                    tmpptr += 1;\n\n                    img0 += size;\n                }\n            }\n        }\n    }\n\n#if __loongarch_sx\n    int nn_outch = outch >> 2;\n    int remain_outch_start = nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        int* outptr0 = top_blob.channel(p);\n        int* outptr1 = top_blob.channel(p + 1);\n        int* outptr2 = top_blob.channel(p + 2);\n        int* outptr3 = top_blob.channel(p + 3);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p / 4);\n\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n\n            if (nn4 > 0)\n            {\n                __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum02 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum03 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum12 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum13 = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val01 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    __m128i _val0 = __lsx_vilvl_d(_val01, _val01);\n                    __m128i _val1 = __lsx_vilvh_d(_val01, _val01);\n\n                    __m128i _w01 = __lsx_vld(kptr, 0);\n                    __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                    __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                    __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n\n                    __m128i _s00 = __lsx_vmul_h(_val0, _w0);\n                    __m128i _s01 = __lsx_vmul_h(_val0, _w1);\n                    __m128i _s10 = __lsx_vmul_h(_val1, _w0);\n                    __m128i _s11 = __lsx_vmul_h(_val1, _w1);\n\n                    __m128i _exts00 = __lsx_vslti_h(_s00, 0);\n                    __m128i _exts01 = __lsx_vslti_h(_s01, 0);\n                    __m128i _exts10 = __lsx_vslti_h(_s10, 0);\n                    __m128i _exts11 = __lsx_vslti_h(_s11, 0);\n                    __m128i _s00l = __lsx_vilvl_h(_exts00, _s00);\n                    __m128i _s00h = __lsx_vilvh_h(_exts00, _s00);\n                    __m128i _s01l = __lsx_vilvl_h(_exts01, _s01);\n                    __m128i _s01h = __lsx_vilvh_h(_exts01, _s01);\n                    __m128i _s10l = __lsx_vilvl_h(_exts10, _s10);\n                    __m128i _s10h = __lsx_vilvh_h(_exts10, _s10);\n                    __m128i _s11l = __lsx_vilvl_h(_exts11, _s11);\n                    __m128i _s11h = __lsx_vilvh_h(_exts11, _s11);\n\n                    _sum00 = __lsx_vadd_w(_sum00, _s00l);\n                    _sum01 = __lsx_vadd_w(_sum01, _s00h);\n                    _sum02 = __lsx_vadd_w(_sum02, _s01l);\n                    _sum03 = __lsx_vadd_w(_sum03, _s01h);\n                    _sum10 = __lsx_vadd_w(_sum10, _s10l);\n                    _sum11 = __lsx_vadd_w(_sum11, _s10h);\n                    _sum12 = __lsx_vadd_w(_sum12, _s11l);\n                    _sum13 = __lsx_vadd_w(_sum13, _s11h);\n\n                    tmpptr += 8;\n                    kptr += 16;\n                }\n\n                // transpose 4x4\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum01, _sum00);\n                    _tmp1 = __lsx_vilvl_w(_sum03, _sum02);\n                    _tmp2 = __lsx_vilvh_w(_sum01, _sum00);\n                    _tmp3 = __lsx_vilvh_w(_sum03, _sum02);\n                    _sum00 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum01 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum02 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum03 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum11, _sum10);\n                    _tmp1 = __lsx_vilvl_w(_sum13, _sum12);\n                    _tmp2 = __lsx_vilvh_w(_sum11, _sum10);\n                    _tmp3 = __lsx_vilvh_w(_sum13, _sum12);\n                    _sum10 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum11 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum12 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum13 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n\n                _sum00 = __lsx_vadd_w(_sum00, _sum01);\n                _sum02 = __lsx_vadd_w(_sum02, _sum03);\n                _sum10 = __lsx_vadd_w(_sum10, _sum11);\n                _sum12 = __lsx_vadd_w(_sum12, _sum13);\n\n                _sum00 = __lsx_vadd_w(_sum00, _sum02);\n                _sum10 = __lsx_vadd_w(_sum10, _sum12);\n            }\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                __m128i _val0 = __lsx_vreplgr2vr_h(tmpptr[0]);\n                __m128i _val1 = __lsx_vreplgr2vr_h(tmpptr[1]);\n                __m128i _val = __lsx_vilvl_d(_val1, _val0);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                _w16 = __lsx_vilvl_d(_w16, _w16);\n\n                __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                _sum00 = __lsx_vadd_w(_sum00, _s0l);\n                _sum10 = __lsx_vadd_w(_sum10, _s0h);\n\n                tmpptr += 2;\n                kptr += 4;\n            }\n\n            int sum[8];\n            __lsx_vst(_sum00, sum, 0);\n            __lsx_vst(_sum10, sum + 4, 0);\n\n            outptr0[0] = sum[0];\n            outptr1[0] = sum[1];\n            outptr2[0] = sum[2];\n            outptr3[0] = sum[3];\n            outptr0[1] = sum[4];\n            outptr1[1] = sum[5];\n            outptr2[1] = sum[6];\n            outptr3[1] = sum[7];\n            outptr0 += 2;\n            outptr1 += 2;\n            outptr2 += 2;\n            outptr3 += 2;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p / 4);\n\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n\n            if (nn4 > 0)\n            {\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    _val16 = __lsx_vilvl_d(_val16, _val16);\n\n                    __m128i _w01 = __lsx_vld(kptr, 0);\n                    __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                    __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                    __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n\n                    __m128i _s0 = __lsx_vmul_h(_val16, _w0);\n                    __m128i _s1 = __lsx_vmul_h(_val16, _w1);\n\n                    __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                    __m128i _exts1 = __lsx_vslti_h(_s1, 0);\n                    __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                    __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n                    __m128i _s1l = __lsx_vilvl_h(_exts1, _s1);\n                    __m128i _s1h = __lsx_vilvh_h(_exts1, _s1);\n\n                    _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                    _sum1 = __lsx_vadd_w(_sum1, _s0h);\n                    _sum2 = __lsx_vadd_w(_sum2, _s1l);\n                    _sum3 = __lsx_vadd_w(_sum3, _s1h);\n\n                    tmpptr += 4;\n                    kptr += 16;\n                }\n\n                // transpose 4x4\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                    _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                    _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                    _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                    _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n                _sum0 = __lsx_vadd_w(_sum0, _sum2);\n            }\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                __m128i _val = __lsx_vreplgr2vr_h(tmpptr[0]);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                __m128i _s032 = __lsx_vilvl_h(__lsx_vslti_h(_s0, 0), _s0);\n\n                _sum0 = __lsx_vadd_w(_sum0, _s032);\n\n                tmpptr += 1;\n                kptr += 4;\n            }\n\n            int sum[4];\n            __lsx_vst(_sum0, sum, 0);\n\n            outptr0[0] = sum[0];\n            outptr1[0] = sum[1];\n            outptr2[0] = sum[2];\n            outptr3[0] = sum[3];\n            outptr0 += 1;\n            outptr1 += 1;\n            outptr2 += 1;\n            outptr3 += 1;\n        }\n    }\n#else // __loongarch_sx\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        int* outptr0 = top_blob.channel(p);\n        int* outptr1 = top_blob.channel(p + 1);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p / 2);\n\n            int sum00 = 0;\n            int sum01 = 0;\n            int sum10 = 0;\n            int sum11 = 0;\n\n            int nn1 = inch * maxk;\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                signed char val0 = tmpptr[0];\n                signed char val1 = tmpptr[1];\n                signed char w0 = kptr[0];\n                signed char w1 = kptr[1];\n\n                sum00 += val0 * w0;\n                sum01 += val1 * w0;\n                sum10 += val0 * w1;\n                sum11 += val1 * w1;\n\n                tmpptr += 2;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum00;\n            outptr0[1] = sum01;\n            outptr1[0] = sum10;\n            outptr1[1] = sum11;\n            outptr0 += 2;\n            outptr1 += 2;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p / 2);\n\n            int sum00 = 0;\n            int sum10 = 0;\n\n            int nn1 = inch * maxk;\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                signed char val0 = tmpptr[0];\n                signed char w0 = kptr[0];\n                signed char w1 = kptr[1];\n\n                sum00 += val0 * w0;\n                sum10 += val0 * w1;\n\n                tmpptr += 1;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum00;\n            outptr1[0] = sum10;\n            outptr0 += 1;\n            outptr1 += 1;\n        }\n    }\n#endif // __loongarch_sx\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        int* outptr0 = top_blob.channel(p);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n#if __loongarch_sx\n            const signed char* kptr = kernel.channel(p / 4 + p % 4);\n#else\n            const signed char* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int sum0 = 0;\n            int sum1 = 0;\n\n#if __loongarch_sx\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            if (nn4 > 0)\n            {\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    __m128i _w = __lsx_vld(kptr, 0);\n                    __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                    _w16 = __lsx_vilvl_d(_w16, _w16);\n\n                    __m128i _s0 = __lsx_vmul_h(_val16, _w16);\n                    __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                    __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                    __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                    _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                    _sum1 = __lsx_vadd_w(_sum1, _s0h);\n\n                    tmpptr += 8;\n                    kptr += 4;\n                }\n\n                sum0 = __lsx_reduce_add_w(_sum0);\n                sum1 = __lsx_reduce_add_w(_sum1);\n            }\n#else\n            int nn1 = inch * maxk;\n#endif // __loongarch_sx\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                signed char val0 = tmpptr[0];\n                signed char val1 = tmpptr[1];\n                signed char w = kptr[0];\n\n                sum0 += val0 * w;\n                sum1 += val1 * w;\n\n                tmpptr += 2;\n                kptr += 1;\n            }\n\n            outptr0[0] = sum0;\n            outptr0[1] = sum1;\n            outptr0 += 2;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n#if __loongarch_sx\n            const signed char* kptr = kernel.channel(p / 4 + p % 4);\n#else\n            const signed char* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int sum = 0;\n\n#if __loongarch_sx\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            if (nn4 > 0)\n            {\n                __m128i _sum = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    __m128i _w = __lsx_vld(kptr, 0);\n                    __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                    __m128i _s0 = __lsx_vmul_h(_val16, _w16);\n                    __m128i _s032 = __lsx_vilvl_h(__lsx_vslti_h(_s0, 0), _s0);\n\n                    _sum = __lsx_vadd_w(_sum, _s032);\n\n                    tmpptr += 4;\n                    kptr += 4;\n                }\n\n                sum = __lsx_reduce_add_w(_sum);\n            }\n#else\n            int nn1 = inch * maxk;\n#endif // __loongarch_sx\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                signed char val = tmpptr[0];\n                signed char w = kptr[0];\n\n                sum += val * w;\n\n                tmpptr += 1;\n                kptr += 1;\n            }\n\n            outptr0[0] = sum;\n            outptr0 += 1;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_int8_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 4a-4b-maxk-inch/4a-outch/4b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n#if __loongarch_sx\n    if (outch >= 4)\n    {\n        if (inch >= 4)\n            kernel_tm.create(16 * maxk, inch / 4 + inch % 4, outch / 4 + outch % 4, (size_t)1u);\n        else\n            kernel_tm.create(4 * maxk, inch, outch / 4 + outch % 4, (size_t)1u);\n    }\n#else\n    if (outch >= 2)\n    {\n        kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2, (size_t)1u);\n    }\n#endif // __loongarch_sx\n    else\n    {\n#if __loongarch_sx\n        if (inch >= 4)\n            kernel_tm.create(4 * maxk, inch / 4 + inch % 4, outch, (size_t)1u);\n        else\n#endif // __loongarch_sx\n        {\n            kernel_tm.create(1 * maxk, inch, outch, (size_t)1u);\n        }\n    }\n\n    int q = 0;\n#if __loongarch_sx\n    for (; q + 3 < outch; q += 4)\n    {\n        signed char* g00 = kernel_tm.channel(q / 4);\n\n        int p = 0;\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const signed char* k00 = kernel.channel(q + i).row<const signed char>(p + j);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    const signed char* k00 = kernel.channel(q + i).row<const signed char>(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#else  // __loongarch_sx\n    for (; q + 1 < outch; q += 2)\n    {\n        signed char* g00 = kernel_tm.channel(q / 2);\n\n        int p = 0;\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 2; i++)\n                {\n                    const signed char* k00 = kernel.channel(q + i).row<const signed char>(p);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n    }\n#endif // __loongarch_sx\n    for (; q < outch; q++)\n    {\n#if __loongarch_sx\n        signed char* g00 = kernel_tm.channel(q / 4 + q % 4);\n#else\n        signed char* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        int p = 0;\n#if __loongarch_sx\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int j = 0; j < 4; j++)\n                {\n                    const signed char* k00 = kernel.channel(q).row<const signed char>(p + j);\n                    g00[0] = k00[k];\n                    g00++;\n                }\n            }\n        }\n#endif // __loongarch_sx\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                const signed char* k00 = kernel.channel(q).row<const signed char>(p);\n                g00[0] = k00[k];\n                g00++;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 1u, 1, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            signed char* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const signed char* sptr = img.row<const signed char>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j + 3 < outw; j += 4)\n                        {\n                            ptr[0] = sptr[0];\n                            ptr[1] = sptr[stride_w];\n                            ptr[2] = sptr[stride_w * 2];\n                            ptr[3] = sptr[stride_w * 3];\n\n                            sptr += stride_w * 4;\n                            ptr += 4;\n                        }\n                        for (; j + 1 < outw; j += 2)\n                        {\n                            ptr[0] = sptr[0];\n                            ptr[1] = sptr[stride_w];\n\n                            sptr += stride_w * 2;\n                            ptr += 2;\n                        }\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_pack1to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_pack1to4_int8_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    // permute\n    Mat tmp;\n    if (inch >= 4)\n    {\n        if (size >= 2)\n            tmp.create(2 * maxk, inch / 4 + inch % 4, size / 2 + size % 2, 4u, 4, opt.workspace_allocator);\n        else\n            tmp.create(maxk, inch / 4 + inch % 4, size, 4u, 4, opt.workspace_allocator);\n    }\n    else\n    {\n        if (size >= 2)\n            tmp.create(2 * maxk, inch, size / 2 + size % 2, 1u, 1, opt.workspace_allocator);\n        else\n            tmp.create(maxk, inch, size, 1u, 1, opt.workspace_allocator);\n    }\n    {\n        int remain_size_start = 0;\n        int nn_size = (size - remain_size_start) >> 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 2;\n\n            signed char* tmpptr = tmp.channel(i / 2);\n\n            int q = 0;\n            for (; q + 3 < inch; q += 4)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n                const signed char* img1 = (const signed char*)bottom_im2col.channel(q + 1) + i;\n                const signed char* img2 = (const signed char*)bottom_im2col.channel(q + 2) + i;\n                const signed char* img3 = (const signed char*)bottom_im2col.channel(q + 3) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img1[0];\n                    tmpptr[2] = img2[0];\n                    tmpptr[3] = img3[0];\n                    tmpptr[4] = img0[1];\n                    tmpptr[5] = img1[1];\n                    tmpptr[6] = img2[1];\n                    tmpptr[7] = img3[1];\n                    tmpptr += 8;\n\n                    img0 += size;\n                    img1 += size;\n                    img2 += size;\n                    img3 += size;\n                }\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img0[1];\n\n                    tmpptr += 2;\n\n                    img0 += size;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n\n            int q = 0;\n            for (; q + 3 < inch; q += 4)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n                const signed char* img1 = (const signed char*)bottom_im2col.channel(q + 1) + i;\n                const signed char* img2 = (const signed char*)bottom_im2col.channel(q + 2) + i;\n                const signed char* img3 = (const signed char*)bottom_im2col.channel(q + 3) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img1[0];\n                    tmpptr[2] = img2[0];\n                    tmpptr[3] = img3[0];\n                    tmpptr += 4;\n\n                    img0 += size;\n                    img1 += size;\n                    img2 += size;\n                    img3 += size;\n                }\n            }\n            for (; q < inch; q++)\n            {\n                const signed char* img0 = (const signed char*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n\n                    tmpptr += 1;\n\n                    img0 += size;\n                }\n            }\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr0 = top_blob.channel(p);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p);\n\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n\n            if (nn4 > 0)\n            {\n                __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum02 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum03 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum12 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum13 = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __builtin_prefetch(tmpptr + 32);\n                    __builtin_prefetch(kptr + 64);\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val01 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    __m128i _val0 = __lsx_vilvl_d(_val01, _val01);\n                    __m128i _val1 = __lsx_vilvh_d(_val01, _val01);\n\n                    __m128i _w01 = __lsx_vld(kptr, 0);\n                    __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                    __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                    __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n\n                    __m128i _s00 = __lsx_vmul_h(_val0, _w0);\n                    __m128i _s01 = __lsx_vmul_h(_val0, _w1);\n                    __m128i _s10 = __lsx_vmul_h(_val1, _w0);\n                    __m128i _s11 = __lsx_vmul_h(_val1, _w1);\n\n                    __m128i _exts00 = __lsx_vslti_h(_s00, 0);\n                    __m128i _exts01 = __lsx_vslti_h(_s01, 0);\n                    __m128i _exts10 = __lsx_vslti_h(_s10, 0);\n                    __m128i _exts11 = __lsx_vslti_h(_s11, 0);\n                    __m128i _s00l = __lsx_vilvl_h(_exts00, _s00);\n                    __m128i _s00h = __lsx_vilvh_h(_exts00, _s00);\n                    __m128i _s01l = __lsx_vilvl_h(_exts01, _s01);\n                    __m128i _s01h = __lsx_vilvh_h(_exts01, _s01);\n                    __m128i _s10l = __lsx_vilvl_h(_exts10, _s10);\n                    __m128i _s10h = __lsx_vilvh_h(_exts10, _s10);\n                    __m128i _s11l = __lsx_vilvl_h(_exts11, _s11);\n                    __m128i _s11h = __lsx_vilvh_h(_exts11, _s11);\n\n                    _sum00 = __lsx_vadd_w(_sum00, _s00l);\n                    _sum01 = __lsx_vadd_w(_sum01, _s00h);\n                    _sum02 = __lsx_vadd_w(_sum02, _s01l);\n                    _sum03 = __lsx_vadd_w(_sum03, _s01h);\n                    _sum10 = __lsx_vadd_w(_sum10, _s10l);\n                    _sum11 = __lsx_vadd_w(_sum11, _s10h);\n                    _sum12 = __lsx_vadd_w(_sum12, _s11l);\n                    _sum13 = __lsx_vadd_w(_sum13, _s11h);\n\n                    tmpptr += 8;\n                    kptr += 16;\n                }\n\n                // transpose 4x4\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum01, _sum00);\n                    _tmp1 = __lsx_vilvl_w(_sum03, _sum02);\n                    _tmp2 = __lsx_vilvh_w(_sum01, _sum00);\n                    _tmp3 = __lsx_vilvh_w(_sum03, _sum02);\n                    _sum00 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum01 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum02 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum03 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum11, _sum10);\n                    _tmp1 = __lsx_vilvl_w(_sum13, _sum12);\n                    _tmp2 = __lsx_vilvh_w(_sum11, _sum10);\n                    _tmp3 = __lsx_vilvh_w(_sum13, _sum12);\n                    _sum10 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum11 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum12 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum13 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n\n                _sum00 = __lsx_vadd_w(_sum00, _sum01);\n                _sum02 = __lsx_vadd_w(_sum02, _sum03);\n                _sum10 = __lsx_vadd_w(_sum10, _sum11);\n                _sum12 = __lsx_vadd_w(_sum12, _sum13);\n\n                _sum00 = __lsx_vadd_w(_sum00, _sum02);\n                _sum10 = __lsx_vadd_w(_sum10, _sum12);\n            }\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                __m128i _val0 = __lsx_vreplgr2vr_h(tmpptr[0]);\n                __m128i _val1 = __lsx_vreplgr2vr_h(tmpptr[1]);\n                __m128i _val = __lsx_vilvl_d(_val1, _val0);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                _w16 = __lsx_vilvl_d(_w16, _w16);\n\n                __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                _sum00 = __lsx_vadd_w(_sum00, _s0l);\n                _sum10 = __lsx_vadd_w(_sum10, _s0h);\n\n                tmpptr += 2;\n                kptr += 4;\n            }\n\n            __lsx_vst(_sum00, outptr0, 0);\n            __lsx_vst(_sum10, outptr0 + 4, 0);\n            outptr0 += 8;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p);\n\n            int nn4 = (inch / 4) * maxk;\n            int nn1 = (inch % 4) * maxk;\n\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n\n            if (nn4 > 0)\n            {\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn4; j++)\n                {\n                    __builtin_prefetch(tmpptr + 16);\n                    __builtin_prefetch(kptr + 64);\n                    __m128i _val = __lsx_vld(tmpptr, 0);\n                    __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                    _val16 = __lsx_vilvl_d(_val16, _val16);\n\n                    __m128i _w01 = __lsx_vld(kptr, 0);\n                    __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                    __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                    __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n\n                    __m128i _s0 = __lsx_vmul_h(_val16, _w0);\n                    __m128i _s1 = __lsx_vmul_h(_val16, _w1);\n\n                    __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                    __m128i _exts1 = __lsx_vslti_h(_s1, 0);\n                    __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                    __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n                    __m128i _s1l = __lsx_vilvl_h(_exts1, _s1);\n                    __m128i _s1h = __lsx_vilvh_h(_exts1, _s1);\n\n                    _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                    _sum1 = __lsx_vadd_w(_sum1, _s0h);\n                    _sum2 = __lsx_vadd_w(_sum2, _s1l);\n                    _sum3 = __lsx_vadd_w(_sum3, _s1h);\n\n                    tmpptr += 4;\n                    kptr += 16;\n                }\n\n                // transpose 4x4\n                {\n                    __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                    _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                    _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                    _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                    _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                    _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                    _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                    _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n                _sum0 = __lsx_vadd_w(_sum0, _sum2);\n            }\n\n            int j = 0;\n            for (; j < nn1; j++)\n            {\n                __m128i _val = __lsx_vreplgr2vr_h(tmpptr[0]);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                __m128i _s032 = __lsx_vilvl_h(__lsx_vslti_h(_s0, 0), _s0);\n\n                _sum0 = __lsx_vadd_w(_sum0, _s032);\n\n                tmpptr += 1;\n                kptr += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            outptr0 += 4;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_pack1to4_int8_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 4a-4b-maxk-inch/4a-outch/4b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n    if (inch >= 4)\n        kernel_tm.create(16 * maxk, inch / 4 + inch % 4, outch / 4, (size_t)1u);\n    else\n        kernel_tm.create(4 * maxk, inch, outch / 4, (size_t)1u);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        signed char* g00 = kernel_tm.channel(q / 4);\n\n        int p = 0;\n        for (; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const signed char* k00 = kernel.channel(q + i).row<const signed char>(p + j);\n\n                        g00[0] = k00[k];\n\n                        g00++;\n                    }\n                }\n            }\n        }\n        for (; p < inch; p++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    const signed char* k00 = kernel.channel(q + i).row<const signed char>(p);\n\n                    g00[0] = k00[k];\n\n                    g00++;\n                }\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_pack1to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 1u, 1, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            signed char* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const signed char* sptr = img.row<const signed char>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j + 3 < outw; j += 4)\n                        {\n                            ptr[0] = sptr[0];\n                            ptr[1] = sptr[stride_w];\n                            ptr[2] = sptr[stride_w * 2];\n                            ptr[3] = sptr[stride_w * 3];\n\n                            sptr += stride_w * 4;\n                            ptr += 4;\n                        }\n                        for (; j + 1 < outw; j += 2)\n                        {\n                            ptr[0] = sptr[0];\n                            ptr[1] = sptr[stride_w];\n\n                            sptr += stride_w * 2;\n                            ptr += 2;\n                        }\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_pack1to4_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_pack4_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 4u * 4, 4, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    // permute\n    Mat tmp;\n    if (size >= 12)\n        tmp.create(12 * maxk, inch, size / 12 + (size % 12) / 8 + (size % 12 % 8) / 4 + (size % 12 % 4) / 2 + size % 12 % 2, 4u * 4, 4, opt.workspace_allocator);\n    else if (size >= 8)\n        tmp.create(8 * maxk, inch, size / 8 + (size % 8) / 4 + (size % 4) / 2 + size % 2, 4u * 4, 4, opt.workspace_allocator);\n    else if (size >= 4)\n        tmp.create(4 * maxk, inch, size / 4 + (size % 4) / 2 + size % 2, 4u * 4, 4, opt.workspace_allocator);\n    else if (size >= 2)\n        tmp.create(2 * maxk, inch, size / 2 + size % 2, 4u * 4, 4, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 4u * 4, 4, opt.workspace_allocator);\n    {\n        int remain_size_start = 0;\n        int nn_size = size / 12;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 12;\n\n            float* tmpptr = tmp.channel(i / 12);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x12\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n                    __m128i _r4 = __lsx_vld(img0 + 4 * 4, 0);\n                    __m128i _r5 = __lsx_vld(img0 + 4 * 5, 0);\n                    __m128i _r6 = __lsx_vld(img0 + 4 * 6, 0);\n                    __m128i _r7 = __lsx_vld(img0 + 4 * 7, 0);\n                    __m128i _r8 = __lsx_vld(img0 + 4 * 8, 0);\n                    __m128i _r9 = __lsx_vld(img0 + 4 * 9, 0);\n                    __m128i _ra = __lsx_vld(img0 + 4 * 10, 0);\n                    __m128i _rb = __lsx_vld(img0 + 4 * 11, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                    __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                    __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                    __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                    __m128i _r89r = __lsx_vilvl_w(_r9, _r8);\n                    __m128i _r89l = __lsx_vilvh_w(_r9, _r8);\n                    __m128i _rabr = __lsx_vilvl_w(_rb, _ra);\n                    __m128i _rabl = __lsx_vilvh_w(_rb, _ra);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                    __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                    __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                    __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                    __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n                    __m128i _r89ab_0 = __lsx_vilvl_d(_rabr, _r89r);\n                    __m128i _r89ab_1 = __lsx_vilvh_d(_rabr, _r89r);\n                    __m128i _r89ab_2 = __lsx_vilvl_d(_rabl, _r89l);\n                    __m128i _r89ab_3 = __lsx_vilvh_d(_rabl, _r89l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                    __lsx_vst(_r89ab_0, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4 * 3, 0);\n                    __lsx_vst(_r4567_1, tmpptr + 4 * 4, 0);\n                    __lsx_vst(_r89ab_1, tmpptr + 4 * 5, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 6, 0);\n                    __lsx_vst(_r4567_2, tmpptr + 4 * 7, 0);\n                    __lsx_vst(_r89ab_2, tmpptr + 4 * 8, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 9, 0);\n                    __lsx_vst(_r4567_3, tmpptr + 4 * 10, 0);\n                    __lsx_vst(_r89ab_3, tmpptr + 4 * 11, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 48;\n                }\n            }\n        }\n\n        remain_size_start += nn_size * 12;\n        nn_size = (size - remain_size_start) >> 3;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 8;\n\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x8\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n                    __m128i _r4 = __lsx_vld(img0 + 4 * 4, 0);\n                    __m128i _r5 = __lsx_vld(img0 + 4 * 5, 0);\n                    __m128i _r6 = __lsx_vld(img0 + 4 * 6, 0);\n                    __m128i _r7 = __lsx_vld(img0 + 4 * 7, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                    __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                    __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                    __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                    __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                    __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                    __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                    __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r4567_1, tmpptr + 4 * 3, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 4, 0);\n                    __lsx_vst(_r4567_2, tmpptr + 4 * 5, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 6, 0);\n                    __lsx_vst(_r4567_3, tmpptr + 4 * 7, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 32;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 3;\n        nn_size = (size - remain_size_start) >> 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 4;\n\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 3, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 16;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 2;\n        nn_size = (size - remain_size_start) >> 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 2;\n\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x2\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n\n                    __m128i _r01_0 = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01_1 = __lsx_vilvh_w(_r1, _r0);\n\n                    __lsx_vst(_r01_0, tmpptr, 0);\n                    __lsx_vst(_r01_1, tmpptr + 4, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 8;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2 + i % 12 % 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    __m128i _val = __lsx_vld(img0, 0);\n                    __lsx_vst(_val, tmpptr, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 4;\n                }\n            }\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr0 = top_blob.channel(p);\n\n        int i = 0;\n        for (; i + 11 < size; i += 12)\n        {\n            const float* tmpptr = tmp.channel(i / 12);\n            const float* kptr0 = kernel.channel(p);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = _sum0;\n            __m128 _sum2 = _sum0;\n            __m128 _sum3 = _sum0;\n            __m128 _sum4 = _sum0;\n            __m128 _sum5 = _sum0;\n            __m128 _sum6 = _sum0;\n            __m128 _sum7 = _sum0;\n            __m128 _sum8 = _sum0;\n            __m128 _sum9 = _sum0;\n            __m128 _suma = _sum0;\n            __m128 _sumb = _sum0;\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 48);\n                __builtin_prefetch(kptr0 + 16);\n                __m128i _val0123 = __lsx_vld(tmpptr, 0);\n                __m128i _val4567 = __lsx_vld(tmpptr + 4, 0);\n                __m128i _val89ab = __lsx_vld(tmpptr + 8, 0);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n                _sum4 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 0), _sum4);\n                _sum5 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 1), _sum5);\n                _sum6 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 2), _sum6);\n                _sum7 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 3), _sum7);\n                _sum8 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 0), _sum8);\n                _sum9 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 1), _sum9);\n                _suma = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 2), _suma);\n                _sumb = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 3), _sumb);\n\n                tmpptr += 12;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n            __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n            __lsx_vst(_sum4, outptr0 + 4 * 4, 0);\n            __lsx_vst(_sum5, outptr0 + 4 * 5, 0);\n            __lsx_vst(_sum6, outptr0 + 4 * 6, 0);\n            __lsx_vst(_sum7, outptr0 + 4 * 7, 0);\n            __lsx_vst(_sum8, outptr0 + 4 * 8, 0);\n            __lsx_vst(_sum9, outptr0 + 4 * 9, 0);\n            __lsx_vst(_suma, outptr0 + 4 * 10, 0);\n            __lsx_vst(_sumb, outptr0 + 4 * 11, 0);\n\n            outptr0 += 4 * 12;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8);\n            const float* kptr0 = kernel.channel(p);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = _sum0;\n            __m128 _sum2 = _sum0;\n            __m128 _sum3 = _sum0;\n            __m128 _sum4 = _sum0;\n            __m128 _sum5 = _sum0;\n            __m128 _sum6 = _sum0;\n            __m128 _sum7 = _sum0;\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr0 + 16);\n                __m128i _val0123 = __lsx_vld(tmpptr, 0);\n                __m128i _val4567 = __lsx_vld(tmpptr + 4, 0);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n                _sum4 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 0), _sum4);\n                _sum5 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 1), _sum5);\n                _sum6 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 2), _sum6);\n                _sum7 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 3), _sum7);\n\n                tmpptr += 8;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n            __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n            __lsx_vst(_sum4, outptr0 + 4 * 4, 0);\n            __lsx_vst(_sum5, outptr0 + 4 * 5, 0);\n            __lsx_vst(_sum6, outptr0 + 4 * 6, 0);\n            __lsx_vst(_sum7, outptr0 + 4 * 7, 0);\n\n            outptr0 += 4 * 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n            const float* kptr0 = kernel.channel(p);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = _sum0;\n            __m128 _sum2 = _sum0;\n            __m128 _sum3 = _sum0;\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr0 + 16);\n                __m128i _val0123 = __lsx_vld(tmpptr, 0);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n\n                tmpptr += 4;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr0 + 4 * 2, 0);\n            __lsx_vst(_sum3, outptr0 + 4 * 3, 0);\n\n            outptr0 += 4 * 4;\n        }\n        for (; i + 1 < size; i += 2)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2);\n            const float* kptr0 = kernel.channel(p);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = _sum0;\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 8);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = __lsx_vreplfr2vr_s(*tmpptr++);\n                __m128 _val1 = __lsx_vreplfr2vr_s(*tmpptr++);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, _val0, _sum0);\n                _sum1 = __lsx_vfmadd_s(_w0, _val1, _sum1);\n\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n\n            outptr0 += 4 * 2;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2 + i % 12 % 2);\n            const float* kptr0 = kernel.channel(p);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum = bias ? (__m128)__lsx_vld(bias + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 4);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = __lsx_vreplfr2vr_s(*tmpptr++);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum, outptr0, 0);\n\n            outptr0 += 4;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 4u * 4, 4, opt.workspace_allocator);\n    {\n        const int gap = (w * stride_h - outw * stride_w) * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            float* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const float* sptr = img.row<const float>(dilation_h * u) + dilation_w * v * 4;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            __m128 _val = (__m128)__lsx_vld(sptr, 0);\n                            __lsx_vst(_val, ptr, 0);\n\n                            sptr += stride_w * 4;\n                            ptr += 4;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_pack4_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_pack4to1.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_pack4to1_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 4u * 4, 4, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    Mat tmp;\n    if (size >= 12)\n        tmp.create(12 * maxk, inch, size / 12 + (size % 12) / 8 + (size % 12 % 8) / 4 + size % 12 % 4, 4u * 4, 4, opt.workspace_allocator);\n    else if (size >= 8)\n        tmp.create(8 * maxk, inch, size / 8 + (size % 8) / 4 + size % 4, 4u * 4, 4, opt.workspace_allocator);\n    else if (size >= 4)\n        tmp.create(4 * maxk, inch, size / 4 + size % 4, 4u * 4, 4, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 4u * 4, 4, opt.workspace_allocator);\n    {\n        int remain_size_start = 0;\n        int nn_size = size / 12;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 12;\n\n            float* tmpptr = tmp.channel(i / 12);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x12\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n                    __m128i _r4 = __lsx_vld(img0 + 4 * 4, 0);\n                    __m128i _r5 = __lsx_vld(img0 + 4 * 5, 0);\n                    __m128i _r6 = __lsx_vld(img0 + 4 * 6, 0);\n                    __m128i _r7 = __lsx_vld(img0 + 4 * 7, 0);\n                    __m128i _r8 = __lsx_vld(img0 + 4 * 8, 0);\n                    __m128i _r9 = __lsx_vld(img0 + 4 * 9, 0);\n                    __m128i _ra = __lsx_vld(img0 + 4 * 10, 0);\n                    __m128i _rb = __lsx_vld(img0 + 4 * 11, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                    __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                    __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                    __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                    __m128i _r89r = __lsx_vilvl_w(_r9, _r8);\n                    __m128i _r89l = __lsx_vilvh_w(_r9, _r8);\n                    __m128i _rabr = __lsx_vilvl_w(_rb, _ra);\n                    __m128i _rabl = __lsx_vilvh_w(_rb, _ra);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                    __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                    __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                    __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                    __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n                    __m128i _r89ab_0 = __lsx_vilvl_d(_rabr, _r89r);\n                    __m128i _r89ab_1 = __lsx_vilvh_d(_rabr, _r89r);\n                    __m128i _r89ab_2 = __lsx_vilvl_d(_rabl, _r89l);\n                    __m128i _r89ab_3 = __lsx_vilvh_d(_rabl, _r89l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                    __lsx_vst(_r89ab_0, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4 * 3, 0);\n                    __lsx_vst(_r4567_1, tmpptr + 4 * 4, 0);\n                    __lsx_vst(_r89ab_1, tmpptr + 4 * 5, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 6, 0);\n                    __lsx_vst(_r4567_2, tmpptr + 4 * 7, 0);\n                    __lsx_vst(_r89ab_2, tmpptr + 4 * 8, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 9, 0);\n                    __lsx_vst(_r4567_3, tmpptr + 4 * 10, 0);\n                    __lsx_vst(_r89ab_3, tmpptr + 4 * 11, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 48;\n                }\n            }\n        }\n\n        remain_size_start += nn_size * 12;\n        nn_size = (size - remain_size_start) >> 3;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 8;\n\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x8\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n                    __m128i _r4 = __lsx_vld(img0 + 4 * 4, 0);\n                    __m128i _r5 = __lsx_vld(img0 + 4 * 5, 0);\n                    __m128i _r6 = __lsx_vld(img0 + 4 * 6, 0);\n                    __m128i _r7 = __lsx_vld(img0 + 4 * 7, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                    __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                    __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                    __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                    __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                    __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                    __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                    __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r4567_1, tmpptr + 4 * 3, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 4, 0);\n                    __lsx_vst(_r4567_2, tmpptr + 4 * 5, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 6, 0);\n                    __lsx_vst(_r4567_3, tmpptr + 4 * 7, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 32;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 3;\n        nn_size = (size - remain_size_start) >> 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 4;\n\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(img0, 0);\n                    __m128i _r1 = __lsx_vld(img0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(img0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(img0 + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, tmpptr, 0);\n                    __lsx_vst(_r0123_1, tmpptr + 4, 0);\n                    __lsx_vst(_r0123_2, tmpptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_3, tmpptr + 4 * 3, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 16;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 2;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + i % 12 % 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i * 4;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    __m128 _val = (__m128)__lsx_vld(img0, 0);\n                    __lsx_vst(_val, tmpptr, 0);\n\n                    img0 += size * 4;\n                    tmpptr += 4;\n                }\n            }\n        }\n    }\n\n    int nn_outch = outch / 4;\n    int remain_outch_start = nn_outch * 4;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n        float* outptr2 = top_blob.channel(p + 2);\n        float* outptr3 = top_blob.channel(p + 3);\n\n        const float zeros[4] = {0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 11 < size; i += 12)\n        {\n            const float* tmpptr = tmp.channel(i / 12);\n            const float* kptr0 = kernel.channel(p / 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128i _bias = __lsx_vld(biasptr, 0);\n            __m128 _sum0 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum1 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum2 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum3 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum4 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum5 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum6 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum7 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum8 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum9 = (__m128)__lsx_vreplvei_w(_bias, 3);\n            __m128 _suma = (__m128)__lsx_vreplvei_w(_bias, 3);\n            __m128 _sumb = (__m128)__lsx_vreplvei_w(_bias, 3);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 48);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _val1 = (__m128)__lsx_vld(tmpptr + 4, 0);\n                __m128 _val2 = (__m128)__lsx_vld(tmpptr + 8, 0);\n                __m128i _w0123 = __lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val0, _sum0);\n                _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val1, _sum1);\n                _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val2, _sum2);\n                _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val0, _sum3);\n                _sum4 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val1, _sum4);\n                _sum5 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val2, _sum5);\n                _sum6 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val0, _sum6);\n                _sum7 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val1, _sum7);\n                _sum8 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val2, _sum8);\n                _sum9 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val0, _sum9);\n                _suma = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val1, _suma);\n                _sumb = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val2, _sumb);\n\n                tmpptr += 12;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr0 + 8, 0);\n            __lsx_vst(_sum3, outptr1, 0);\n            __lsx_vst(_sum4, outptr1 + 4, 0);\n            __lsx_vst(_sum5, outptr1 + 8, 0);\n            __lsx_vst(_sum6, outptr2, 0);\n            __lsx_vst(_sum7, outptr2 + 4, 0);\n            __lsx_vst(_sum8, outptr2 + 8, 0);\n            __lsx_vst(_sum9, outptr3, 0);\n            __lsx_vst(_suma, outptr3 + 4, 0);\n            __lsx_vst(_sumb, outptr3 + 8, 0);\n\n            outptr0 += 12;\n            outptr1 += 12;\n            outptr2 += 12;\n            outptr3 += 12;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8);\n            const float* kptr0 = kernel.channel(p / 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128i _bias = __lsx_vld(biasptr, 0);\n            __m128 _sum0 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum1 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum2 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum3 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum4 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum5 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum6 = (__m128)__lsx_vreplvei_w(_bias, 3);\n            __m128 _sum7 = (__m128)__lsx_vreplvei_w(_bias, 3);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _val1 = (__m128)__lsx_vld(tmpptr + 4, 0);\n                __m128i _w0123 = __lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val0, _sum0);\n                _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val1, _sum1);\n                _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val0, _sum2);\n                _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val1, _sum3);\n                _sum4 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val0, _sum4);\n                _sum5 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val1, _sum5);\n                _sum6 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val0, _sum6);\n                _sum7 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val1, _sum7);\n\n                tmpptr += 8;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr1, 0);\n            __lsx_vst(_sum3, outptr1 + 4, 0);\n            __lsx_vst(_sum4, outptr2, 0);\n            __lsx_vst(_sum5, outptr2 + 4, 0);\n            __lsx_vst(_sum6, outptr3, 0);\n            __lsx_vst(_sum7, outptr3 + 4, 0);\n\n            outptr0 += 8;\n            outptr1 += 8;\n            outptr2 += 8;\n            outptr3 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n            const float* kptr0 = kernel.channel(p / 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128i _bias = __lsx_vld(biasptr, 0);\n            __m128 _sum0 = (__m128)__lsx_vreplvei_w(_bias, 0);\n            __m128 _sum1 = (__m128)__lsx_vreplvei_w(_bias, 1);\n            __m128 _sum2 = (__m128)__lsx_vreplvei_w(_bias, 2);\n            __m128 _sum3 = (__m128)__lsx_vreplvei_w(_bias, 3);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128i _w0123 = __lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val0, _sum0);\n                _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val0, _sum1);\n                _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val0, _sum2);\n                _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val0, _sum3);\n\n                tmpptr += 4;\n                kptr0 += 4;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr1, 0);\n            __lsx_vst(_sum2, outptr2, 0);\n            __lsx_vst(_sum3, outptr3, 0);\n\n            outptr0 += 4;\n            outptr1 += 4;\n            outptr2 += 4;\n            outptr3 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + i % 12 % 4);\n            const float* kptr0 = kernel.channel(p / 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum = (__m128)__lsx_vld(biasptr, 0);\n            float* _sum_p = (float*)&_sum;\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 4);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = __lsx_vreplfr2vr_s(*tmpptr++);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n\n                kptr0 += 4;\n            }\n\n            outptr0[0] = _sum_p[0];\n            outptr1[0] = _sum_p[1];\n            outptr2[0] = _sum_p[2];\n            outptr3[0] = _sum_p[3];\n\n            outptr0 += 1;\n            outptr1 += 1;\n            outptr2 += 1;\n            outptr3 += 1;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        float* outptr0 = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        int i = 0;\n        for (; i + 11 < size; i += 12)\n        {\n            const float* tmpptr = tmp.channel(i / 12);\n            const float* kptr0 = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = __lsx_vreplfr2vr_s(bias0);\n            __m128 _sum1 = __lsx_vreplfr2vr_s(bias0);\n            __m128 _sum2 = __lsx_vreplfr2vr_s(bias0);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 48);\n                __builtin_prefetch(kptr0 + 4);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _val1 = (__m128)__lsx_vld(tmpptr + 4, 0);\n                __m128 _val2 = (__m128)__lsx_vld(tmpptr + 8, 0);\n                __m128 _w0 = __lsx_vreplfr2vr_s(*kptr0);\n                _sum0 = __lsx_vfmadd_s(_val0, _w0, _sum0);\n                _sum1 = __lsx_vfmadd_s(_val1, _w0, _sum1);\n                _sum2 = __lsx_vfmadd_s(_val2, _w0, _sum2);\n\n                tmpptr += 12;\n                kptr0 += 1;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n            __lsx_vst(_sum2, outptr0 + 8, 0);\n\n            outptr0 += 12;\n        }\n        for (; i + 7 < size; i += 8)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8);\n            const float* kptr0 = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = __lsx_vreplfr2vr_s(bias0);\n            __m128 _sum1 = __lsx_vreplfr2vr_s(bias0);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr0 + 4);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _val1 = (__m128)__lsx_vld(tmpptr + 4, 0);\n                __m128 _w0 = __lsx_vreplfr2vr_s(*kptr0);\n                _sum0 = __lsx_vfmadd_s(_val0, _w0, _sum0);\n                _sum1 = __lsx_vfmadd_s(_val1, _w0, _sum1);\n\n                tmpptr += 8;\n                kptr0 += 1;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n            __lsx_vst(_sum1, outptr0 + 4, 0);\n\n            outptr0 += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n            const float* kptr0 = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk * 4; // inch always > 0\n\n            __m128 _sum0 = __lsx_vreplfr2vr_s(bias0);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr0 + 4);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _w0 = __lsx_vreplfr2vr_s(*kptr0);\n                _sum0 = __lsx_vfmadd_s(_val0, _w0, _sum0);\n\n                tmpptr += 4;\n                kptr0 += 1;\n            }\n\n            __lsx_vst(_sum0, outptr0, 0);\n\n            outptr0 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + i % 12 % 4);\n            const float* kptr0 = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = bias0;\n\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n\n            for (int j = 0; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr0 + 16);\n                __m128 _val0 = (__m128)__lsx_vld(tmpptr, 0);\n                __m128 _w0 = (__m128)__lsx_vld(kptr0, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, _val0, _sum0);\n                tmpptr += 4;\n                kptr0 += 4;\n            }\n\n            sum0 += __lsx_reduce_fadd_s(_sum0);\n\n            outptr0[0] = sum0;\n\n            outptr0 += 1;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_pack4to1_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = pb-pa-maxk-inch/pa-outch/pb\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n    kernel_tm.create(4 * 4 * maxk, inch / 4, outch / 4 + outch % 4);\n\n    int q = 0;\n    for (; q + 3 < outch; q += 4)\n    {\n        float* g00 = kernel_tm.channel(q / 4);\n\n        for (int p = 0; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel.channel(q + j).row(p + i);\n\n                        g00[0] = k00[k];\n\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n    for (; q < outch; q++)\n    {\n        const Mat k0 = kernel.channel(q);\n\n        float* g00 = kernel_tm.channel(q / 4 + q % 4);\n\n        for (int p = 0; p + 3 < inch; p += 4)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int j = 0; j < 4; j++)\n                {\n                    const float* k00 = k0.row(p + j);\n\n                    g00[0] = k00[k];\n\n                    g00++;\n                }\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_pack4to1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 4u * 4, 4, opt.workspace_allocator);\n    {\n        const int gap = (w * stride_h - outw * stride_w) * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            float* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const float* sptr = img.row(dilation_h * u) + dilation_w * v * 4;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            __m128 _val = (__m128)__lsx_vld(sptr, 0);\n                            __lsx_vst(_val, ptr, 0);\n\n                            sptr += stride_w * 4;\n                            ptr += 4;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_pack4to1_lsx(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_pack8to1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_pack8to1_int8_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    // permute\n    Mat tmp;\n    if (size >= 2)\n        tmp.create(2 * maxk, inch, size / 2 + size % 2, 8u, 8, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 8u, 8, opt.workspace_allocator);\n    {\n        int remain_size_start = 0;\n        int nn_size = (size - remain_size_start) >> 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 2;\n\n            int64_t* tmpptr = tmp.channel(i / 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const int64_t* img0 = (const int64_t*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    __m128i _v = __lsx_vld(img0, 0);\n                    __lsx_vst(_v, tmpptr, 0);\n                    tmpptr += 2;\n                    img0 += size;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            int64_t* tmpptr = tmp.channel(i / 2 + i % 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const int64_t* img0 = (const int64_t*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr += 1;\n                    img0 += size;\n                }\n            }\n        }\n    }\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n\n    nn_outch = outch >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        int* outptr0 = top_blob.channel(p);\n        int* outptr1 = top_blob.channel(p + 1);\n        int* outptr2 = top_blob.channel(p + 2);\n        int* outptr3 = top_blob.channel(p + 3);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum02 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum03 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum12 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum13 = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 64);\n                __builtin_prefetch(kptr + 128);\n                __m128i _val01 = __lsx_vld(tmpptr, 0);\n                __m128i _extval01 = __lsx_vslti_b(_val01, 0);\n                __m128i _val0 = __lsx_vilvl_b(_extval01, _val01);\n                __m128i _val1 = __lsx_vilvh_b(_extval01, _val01);\n\n                __m128i _w01 = __lsx_vld(kptr, 0);\n                __m128i _w23 = __lsx_vld(kptr + 16, 0);\n                __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                __m128i _extw23 = __lsx_vslti_b(_w23, 0);\n                __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n                __m128i _w2 = __lsx_vilvl_b(_extw23, _w23);\n                __m128i _w3 = __lsx_vilvh_b(_extw23, _w23);\n\n                __m128i _s00 = __lsx_vmul_h(_val0, _w0);\n                __m128i _s01 = __lsx_vmul_h(_val0, _w1);\n                __m128i _s02 = __lsx_vmul_h(_val0, _w2);\n                __m128i _s03 = __lsx_vmul_h(_val0, _w3);\n                __m128i _s10 = __lsx_vmul_h(_val1, _w0);\n                __m128i _s11 = __lsx_vmul_h(_val1, _w1);\n                __m128i _s12 = __lsx_vmul_h(_val1, _w2);\n                __m128i _s13 = __lsx_vmul_h(_val1, _w3);\n\n                _sum00 = __lsx_vadd_w(_sum00, __lsx_vhaddw_w_h(_s00, _s00));\n                _sum01 = __lsx_vadd_w(_sum01, __lsx_vhaddw_w_h(_s01, _s01));\n                _sum02 = __lsx_vadd_w(_sum02, __lsx_vhaddw_w_h(_s02, _s02));\n                _sum03 = __lsx_vadd_w(_sum03, __lsx_vhaddw_w_h(_s03, _s03));\n                _sum10 = __lsx_vadd_w(_sum10, __lsx_vhaddw_w_h(_s10, _s10));\n                _sum11 = __lsx_vadd_w(_sum11, __lsx_vhaddw_w_h(_s11, _s11));\n                _sum12 = __lsx_vadd_w(_sum12, __lsx_vhaddw_w_h(_s12, _s12));\n                _sum13 = __lsx_vadd_w(_sum13, __lsx_vhaddw_w_h(_s13, _s13));\n\n                tmpptr += 16;\n                kptr += 32;\n            }\n\n            // transpose 4x4\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum01, _sum00);\n                _tmp1 = __lsx_vilvl_w(_sum03, _sum02);\n                _tmp2 = __lsx_vilvh_w(_sum01, _sum00);\n                _tmp3 = __lsx_vilvh_w(_sum03, _sum02);\n                _sum00 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum01 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum02 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum03 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum11, _sum10);\n                _tmp1 = __lsx_vilvl_w(_sum13, _sum12);\n                _tmp2 = __lsx_vilvh_w(_sum11, _sum10);\n                _tmp3 = __lsx_vilvh_w(_sum13, _sum12);\n                _sum10 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum11 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum12 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum13 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n\n            _sum00 = __lsx_vadd_w(_sum00, _sum01);\n            _sum02 = __lsx_vadd_w(_sum02, _sum03);\n            _sum10 = __lsx_vadd_w(_sum10, _sum11);\n            _sum12 = __lsx_vadd_w(_sum12, _sum13);\n\n            _sum00 = __lsx_vadd_w(_sum00, _sum02);\n            _sum10 = __lsx_vadd_w(_sum10, _sum12);\n\n            int sum[8];\n            __lsx_vst(_sum00, sum, 0);\n            __lsx_vst(_sum10, sum + 4, 0);\n\n            outptr0[0] = sum[0];\n            outptr1[0] = sum[1];\n            outptr2[0] = sum[2];\n            outptr3[0] = sum[3];\n            outptr0[1] = sum[4];\n            outptr1[1] = sum[5];\n            outptr2[1] = sum[6];\n            outptr3[1] = sum[7];\n            outptr0 += 2;\n            outptr1 += 2;\n            outptr2 += 2;\n            outptr3 += 2;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr + 128);\n                __m128i _val = __lsx_vld(tmpptr, 0);\n                __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                __m128i _w01 = __lsx_vld(kptr, 0);\n                __m128i _w23 = __lsx_vld(kptr + 16, 0);\n                __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                __m128i _extw23 = __lsx_vslti_b(_w23, 0);\n                __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n                __m128i _w2 = __lsx_vilvl_b(_extw23, _w23);\n                __m128i _w3 = __lsx_vilvh_b(_extw23, _w23);\n\n                __m128i _s0 = __lsx_vmul_h(_val16, _w0);\n                __m128i _s1 = __lsx_vmul_h(_val16, _w1);\n                __m128i _s2 = __lsx_vmul_h(_val16, _w2);\n                __m128i _s3 = __lsx_vmul_h(_val16, _w3);\n\n                _sum0 = __lsx_vadd_w(_sum0, __lsx_vhaddw_w_h(_s0, _s0));\n                _sum1 = __lsx_vadd_w(_sum1, __lsx_vhaddw_w_h(_s1, _s1));\n                _sum2 = __lsx_vadd_w(_sum2, __lsx_vhaddw_w_h(_s2, _s2));\n                _sum3 = __lsx_vadd_w(_sum3, __lsx_vhaddw_w_h(_s3, _s3));\n\n                tmpptr += 8;\n                kptr += 32;\n            }\n\n            // transpose 4x4\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n\n            _sum0 = __lsx_vadd_w(_sum0, _sum1);\n            _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n            _sum0 = __lsx_vadd_w(_sum0, _sum2);\n\n            int sum[4];\n            __lsx_vst(_sum0, sum, 0);\n\n            outptr0[0] = sum[0];\n            outptr1[0] = sum[1];\n            outptr2[0] = sum[2];\n            outptr3[0] = sum[3];\n            outptr0 += 1;\n            outptr1 += 1;\n            outptr2 += 1;\n            outptr3 += 1;\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        int* outptr0 = top_blob.channel(p);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 64);\n                __builtin_prefetch(kptr + 32);\n                __m128i _val01 = __lsx_vld(tmpptr, 0);\n                __m128i _extval01 = __lsx_vslti_b(_val01, 0);\n                __m128i _val0 = __lsx_vilvl_b(_extval01, _val01);\n                __m128i _val1 = __lsx_vilvh_b(_extval01, _val01);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                __m128i _s0 = __lsx_vmul_h(_val0, _w16);\n                __m128i _s1 = __lsx_vmul_h(_val1, _w16);\n\n                _sum0 = __lsx_vadd_w(_sum0, __lsx_vhaddw_w_h(_s0, _s0));\n                _sum1 = __lsx_vadd_w(_sum1, __lsx_vhaddw_w_h(_s1, _s1));\n\n                tmpptr += 16;\n                kptr += 8;\n            }\n\n            outptr0[0] = __lsx_reduce_add_w(_sum0);\n            outptr0[1] = __lsx_reduce_add_w(_sum1);\n            outptr0 += 2;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p / 4 + p % 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr + 32);\n                __m128i _val = __lsx_vld(tmpptr, 0);\n                __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                __m128i _s0 = __lsx_vmul_h(_val16, _w16);\n\n                _sum = __lsx_vadd_w(_sum, __lsx_vhaddw_w_h(_s0, _s0));\n\n                tmpptr += 8;\n                kptr += 8;\n            }\n\n            outptr0[0] = __lsx_reduce_add_w(_sum);\n            outptr0 += 1;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_pack8to1_int8_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 8a-4b-maxk-inch/8a-outch/4b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n    if (outch >= 4)\n        kernel_tm.create(32 * maxk, inch / 8, outch / 4 + outch % 4, (size_t)1u);\n    else\n        kernel_tm.create(8 * maxk, inch / 8, outch, (size_t)1u);\n\n    int q = 0;\n    for (; q + 3 < outch; q += 4)\n    {\n        signed char* g00 = kernel_tm.channel(q / 4);\n\n        for (int p = 0; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 8; j++)\n                    {\n                        const signed char* k00 = kernel.channel(q + i).row<const signed char>(p + j);\n\n                        g00[0] = k00[k];\n\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n    // TODO unroll 2\n    for (; q < outch; q++)\n    {\n        signed char* g00 = kernel_tm.channel(q / 4 + q % 4);\n\n        for (int p = 0; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int j = 0; j < 8; j++)\n                {\n                    const signed char* k00 = kernel.channel(q).row<const signed char>(p + j);\n\n                    g00[0] = k00[k];\n\n                    g00++;\n                }\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_pack8to1_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            int64_t* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const int64_t* sptr = img.row<const int64_t>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_pack8to1_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_sgemm_pack8to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_pack8to4_int8_lsx(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    // permute\n    Mat tmp;\n    if (size >= 2)\n        tmp.create(2 * maxk, inch, size / 2 + size % 2, 8u, 8, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 8u, 8, opt.workspace_allocator);\n    {\n        int remain_size_start = 0;\n        int nn_size = size >> 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = remain_size_start + ii * 2;\n\n            int64_t* tmpptr = tmp.channel(i / 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const int64_t* img0 = (const int64_t*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    __m128i _v = __lsx_vld(img0, 0);\n                    __lsx_vst(_v, tmpptr, 0);\n                    tmpptr += 2;\n                    img0 += size;\n                }\n            }\n        }\n\n        remain_size_start += nn_size << 1;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            int64_t* tmpptr = tmp.channel(i / 2 + i % 2);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const int64_t* img0 = (const int64_t*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    tmpptr += 1;\n                    img0 += size;\n                }\n            }\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr0 = top_blob.channel(p);\n\n        int i = 0;\n        for (; i + 1 < size; i += 2)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2);\n            const signed char* kptr = kernel.channel(p);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum02 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum03 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum12 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum13 = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 64);\n                __builtin_prefetch(kptr + 128);\n                __m128i _val01 = __lsx_vld(tmpptr, 0);\n                __m128i _extval01 = __lsx_vslti_b(_val01, 0);\n                __m128i _val0 = __lsx_vilvl_b(_extval01, _val01);\n                __m128i _val1 = __lsx_vilvh_b(_extval01, _val01);\n\n                __m128i _w01 = __lsx_vld(kptr, 0);\n                __m128i _w23 = __lsx_vld(kptr + 16, 0);\n                __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                __m128i _extw23 = __lsx_vslti_b(_w23, 0);\n                __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n                __m128i _w2 = __lsx_vilvl_b(_extw23, _w23);\n                __m128i _w3 = __lsx_vilvh_b(_extw23, _w23);\n\n                __m128i _s00 = __lsx_vmul_h(_val0, _w0);\n                __m128i _s01 = __lsx_vmul_h(_val0, _w1);\n                __m128i _s02 = __lsx_vmul_h(_val0, _w2);\n                __m128i _s03 = __lsx_vmul_h(_val0, _w3);\n                __m128i _s10 = __lsx_vmul_h(_val1, _w0);\n                __m128i _s11 = __lsx_vmul_h(_val1, _w1);\n                __m128i _s12 = __lsx_vmul_h(_val1, _w2);\n                __m128i _s13 = __lsx_vmul_h(_val1, _w3);\n\n                _sum00 = __lsx_vadd_w(_sum00, __lsx_vhaddw_w_h(_s00, _s00));\n                _sum01 = __lsx_vadd_w(_sum01, __lsx_vhaddw_w_h(_s01, _s01));\n                _sum02 = __lsx_vadd_w(_sum02, __lsx_vhaddw_w_h(_s02, _s02));\n                _sum03 = __lsx_vadd_w(_sum03, __lsx_vhaddw_w_h(_s03, _s03));\n                _sum10 = __lsx_vadd_w(_sum10, __lsx_vhaddw_w_h(_s10, _s10));\n                _sum11 = __lsx_vadd_w(_sum11, __lsx_vhaddw_w_h(_s11, _s11));\n                _sum12 = __lsx_vadd_w(_sum12, __lsx_vhaddw_w_h(_s12, _s12));\n                _sum13 = __lsx_vadd_w(_sum13, __lsx_vhaddw_w_h(_s13, _s13));\n\n                tmpptr += 16;\n                kptr += 32;\n            }\n\n            // transpose 4x4\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum01, _sum00);\n                _tmp1 = __lsx_vilvl_w(_sum03, _sum02);\n                _tmp2 = __lsx_vilvh_w(_sum01, _sum00);\n                _tmp3 = __lsx_vilvh_w(_sum03, _sum02);\n                _sum00 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum01 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum02 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum03 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum11, _sum10);\n                _tmp1 = __lsx_vilvl_w(_sum13, _sum12);\n                _tmp2 = __lsx_vilvh_w(_sum11, _sum10);\n                _tmp3 = __lsx_vilvh_w(_sum13, _sum12);\n                _sum10 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum11 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum12 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum13 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n\n            _sum00 = __lsx_vadd_w(_sum00, _sum01);\n            _sum02 = __lsx_vadd_w(_sum02, _sum03);\n            _sum10 = __lsx_vadd_w(_sum10, _sum11);\n            _sum12 = __lsx_vadd_w(_sum12, _sum13);\n\n            _sum00 = __lsx_vadd_w(_sum00, _sum02);\n            _sum10 = __lsx_vadd_w(_sum10, _sum12);\n\n            __lsx_vst(_sum00, outptr0, 0);\n            __lsx_vst(_sum10, outptr0 + 4, 0);\n            outptr0 += 8;\n        }\n        for (; i < size; i++)\n        {\n            const signed char* tmpptr = tmp.channel(i / 2 + i % 2);\n            const signed char* kptr = kernel.channel(p);\n\n            int nn = inch * maxk; // inch always > 0\n\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n            int j = 0;\n            for (; j < nn; j++)\n            {\n                __builtin_prefetch(tmpptr + 32);\n                __builtin_prefetch(kptr + 128);\n                __m128i _val = __lsx_vld(tmpptr, 0);\n                __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                __m128i _w01 = __lsx_vld(kptr, 0);\n                __m128i _w23 = __lsx_vld(kptr + 16, 0);\n                __m128i _extw01 = __lsx_vslti_b(_w01, 0);\n                __m128i _extw23 = __lsx_vslti_b(_w23, 0);\n                __m128i _w0 = __lsx_vilvl_b(_extw01, _w01);\n                __m128i _w1 = __lsx_vilvh_b(_extw01, _w01);\n                __m128i _w2 = __lsx_vilvl_b(_extw23, _w23);\n                __m128i _w3 = __lsx_vilvh_b(_extw23, _w23);\n\n                __m128i _s0 = __lsx_vmul_h(_val16, _w0);\n                __m128i _s1 = __lsx_vmul_h(_val16, _w1);\n                __m128i _s2 = __lsx_vmul_h(_val16, _w2);\n                __m128i _s3 = __lsx_vmul_h(_val16, _w3);\n\n                _sum0 = __lsx_vadd_w(_sum0, __lsx_vhaddw_w_h(_s0, _s0));\n                _sum1 = __lsx_vadd_w(_sum1, __lsx_vhaddw_w_h(_s1, _s1));\n                _sum2 = __lsx_vadd_w(_sum2, __lsx_vhaddw_w_h(_s2, _s2));\n                _sum3 = __lsx_vadd_w(_sum3, __lsx_vhaddw_w_h(_s3, _s3));\n\n                tmpptr += 8;\n                kptr += 32;\n            }\n\n            // transpose 4x4\n            {\n                __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n            }\n\n            _sum0 = __lsx_vadd_w(_sum0, _sum1);\n            _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n            _sum0 = __lsx_vadd_w(_sum0, _sum2);\n\n            __lsx_vst(_sum0, outptr0, 0);\n            outptr0 += 4;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_pack8to4_int8_lsx(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 8a-4b-maxk-inch/8a-outch/4b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n    kernel_tm.create(32 * maxk, inch / 8, outch / 4, (size_t)1u);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        signed char* g00 = kernel_tm.channel(q / 4);\n\n        for (int p = 0; p + 7 < inch; p += 8)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 8; j++)\n                    {\n                        const signed char* k00 = kernel.channel(q + i).row<const signed char>(p + j);\n\n                        g00[0] = k00[k];\n\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_pack8to4_int8_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 8u, 8, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            int64_t* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const int64_t* sptr = img.row<const int64_t>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_pack8to4_int8_lsx(bottom_im2col, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_dot.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_winograd_dot_lsx(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    // Mat bottom_blob_tm(tiles, 16/36/64, inch, 4u, opt.workspace_allocator);\n\n    const int tiles = bottom_blob_tm.w;\n    const int batch = bottom_blob_tm.h;\n    const int inch = bottom_blob_tm.c;\n\n    // permute\n    Mat bottom_blob_tm2;\n    if (tiles >= 4)\n        bottom_blob_tm2.create(4 * inch, tiles / 4 + tiles % 4, batch, 4u, opt.workspace_allocator);\n    else\n        bottom_blob_tm2.create(1 * inch, tiles, batch, 4u, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int r = 0; r < batch; r++)\n    {\n        Mat tm2 = bottom_blob_tm2.channel(r);\n\n        // tile\n        int i = 0;\n        for (; i + 3 < tiles; i += 4)\n        {\n            float* tmpptr = tm2.row(i / 4);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i);\n\n            for (int q = 0; q < inch; q++)\n            {\n#if __loongarch_sx\n                __lsx_vst(__lsx_vld(r0, 0), tmpptr, 0);\n#else\n                tmpptr[0] = r0[0];\n                tmpptr[1] = r0[1];\n                tmpptr[2] = r0[2];\n                tmpptr[3] = r0[3];\n#endif\n\n                r0 += bottom_blob_tm.cstep;\n                tmpptr += 4;\n            }\n        }\n        for (; i < tiles; i++)\n        {\n            float* tmpptr = tm2.row(i / 4 + i % 4);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i);\n\n            for (int q = 0; q < inch; q++)\n            {\n                tmpptr[0] = r0[0];\n\n                r0 += bottom_blob_tm.cstep;\n                tmpptr += 1;\n            }\n        }\n    }\n\n    bottom_blob_tm = Mat();\n    // permute end\n\n    top_blob_tm.create(tiles, batch, outch, 4u, opt.workspace_allocator);\n\n#if __loongarch_sx\n    int nn_outch = outch >> 3;\n    int remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        float* output0_tm = top_blob_tm.channel(p);\n        float* output1_tm = top_blob_tm.channel(p + 1);\n        float* output2_tm = top_blob_tm.channel(p + 2);\n        float* output3_tm = top_blob_tm.channel(p + 3);\n        float* output4_tm = top_blob_tm.channel(p + 4);\n        float* output5_tm = top_blob_tm.channel(p + 5);\n        float* output6_tm = top_blob_tm.channel(p + 6);\n        float* output7_tm = top_blob_tm.channel(p + 7);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 8);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 3 < tiles; i += 4)\n            {\n                const float* r0 = bb2.row(i / 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum4 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum5 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum6 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum7 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 16);\n                    __builtin_prefetch(k0 + 32);\n                    __m128 _val = (__m128)__lsx_vld(r0, 0);\n                    __m128i _w0123 = __lsx_vld(k0, 0);\n                    __m128i _w4567 = __lsx_vld(k0 + 4, 0);\n                    _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val, _sum0);\n                    _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val, _sum1);\n                    _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val, _sum2);\n                    _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val, _sum3);\n                    _sum4 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 0), _val, _sum4);\n                    _sum5 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 1), _val, _sum5);\n                    _sum6 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 2), _val, _sum6);\n                    _sum7 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w4567, 3), _val, _sum7);\n\n                    r0 += 4;\n                    k0 += 8;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output1_tm, 0);\n                __lsx_vst(_sum2, output2_tm, 0);\n                __lsx_vst(_sum3, output3_tm, 0);\n                __lsx_vst(_sum4, output4_tm, 0);\n                __lsx_vst(_sum5, output5_tm, 0);\n                __lsx_vst(_sum6, output6_tm, 0);\n                __lsx_vst(_sum7, output7_tm, 0);\n\n                output0_tm += 4;\n                output1_tm += 4;\n                output2_tm += 4;\n                output3_tm += 4;\n                output4_tm += 4;\n                output5_tm += 4;\n                output6_tm += 4;\n                output7_tm += 4;\n            }\n            for (; i < tiles; i++)\n            {\n                const float* r0 = bb2.row(i / 4 + i % 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n                float sum2 = 0.f;\n                float sum3 = 0.f;\n                float sum4 = 0.f;\n                float sum5 = 0.f;\n                float sum6 = 0.f;\n                float sum7 = 0.f;\n\n                int j = 0;\n                for (; j < nn; j++)\n                {\n                    sum0 += r0[0] * k0[0];\n                    sum1 += r0[0] * k0[1];\n                    sum2 += r0[0] * k0[2];\n                    sum3 += r0[0] * k0[3];\n                    sum4 += r0[0] * k0[4];\n                    sum5 += r0[0] * k0[5];\n                    sum6 += r0[0] * k0[6];\n                    sum7 += r0[0] * k0[7];\n\n                    r0 += 1;\n                    k0 += 8;\n                }\n\n                output0_tm[0] = sum0;\n                output1_tm[0] = sum1;\n                output2_tm[0] = sum2;\n                output3_tm[0] = sum3;\n                output4_tm[0] = sum4;\n                output5_tm[0] = sum5;\n                output6_tm[0] = sum6;\n                output7_tm[0] = sum7;\n\n                output0_tm++;\n                output1_tm++;\n                output2_tm++;\n                output3_tm++;\n                output4_tm++;\n                output5_tm++;\n                output6_tm++;\n                output7_tm++;\n            }\n        }\n    }\n\n    nn_outch = (outch - remain_outch_start) >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = remain_outch_start + pp * 4;\n\n        float* output0_tm = top_blob_tm.channel(p);\n        float* output1_tm = top_blob_tm.channel(p + 1);\n        float* output2_tm = top_blob_tm.channel(p + 2);\n        float* output3_tm = top_blob_tm.channel(p + 3);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 8 + (p % 8) / 4);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 3 < tiles; i += 4)\n            {\n                const float* r0 = bb2.row(i / 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                int j = 0;\n                for (; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 16);\n                    __builtin_prefetch(k0 + 16);\n                    __m128 _val = (__m128)__lsx_vld(r0, 0);\n                    __m128i _w0123 = __lsx_vld(k0, 0);\n                    _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 0), _val, _sum0);\n                    _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 1), _val, _sum1);\n                    _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 2), _val, _sum2);\n                    _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w0123, 3), _val, _sum3);\n\n                    r0 += 4;\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output1_tm, 0);\n                __lsx_vst(_sum2, output2_tm, 0);\n                __lsx_vst(_sum3, output3_tm, 0);\n\n                output0_tm += 4;\n                output1_tm += 4;\n                output2_tm += 4;\n                output3_tm += 4;\n            }\n            for (; i < tiles; i++)\n            {\n                const float* r0 = bb2.row(i / 4 + i % 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n                float sum2 = 0.f;\n                float sum3 = 0.f;\n\n                int j = 0;\n                for (; j < nn; j++)\n                {\n                    sum0 += r0[0] * k0[0];\n                    sum1 += r0[0] * k0[1];\n                    sum2 += r0[0] * k0[2];\n                    sum3 += r0[0] * k0[3];\n\n                    r0 += 1;\n                    k0 += 4;\n                }\n\n                output0_tm[0] = sum0;\n                output1_tm[0] = sum1;\n                output2_tm[0] = sum2;\n                output3_tm[0] = sum3;\n\n                output0_tm++;\n                output1_tm++;\n                output2_tm++;\n                output3_tm++;\n            }\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n#else\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        float* output0_tm = top_blob_tm.channel(p);\n        float* output1_tm = top_blob_tm.channel(p + 1);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 2);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 3 < tiles; i += 4)\n            {\n                const float* r0 = bb2.row(i / 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                float sum00 = 0.f;\n                float sum01 = 0.f;\n                float sum02 = 0.f;\n                float sum03 = 0.f;\n                float sum10 = 0.f;\n                float sum11 = 0.f;\n                float sum12 = 0.f;\n                float sum13 = 0.f;\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 16);\n                    __builtin_prefetch(k0 + 8);\n                    float w0 = k0[0];\n                    float w1 = k0[1];\n                    sum00 += r0[0] * w0;\n                    sum01 += r0[1] * w0;\n                    sum02 += r0[2] * w0;\n                    sum03 += r0[3] * w0;\n                    sum10 += r0[0] * w1;\n                    sum11 += r0[1] * w1;\n                    sum12 += r0[2] * w1;\n                    sum13 += r0[3] * w1;\n\n                    r0 += 4;\n                    k0 += 2;\n                }\n\n                output0_tm[0] = sum00;\n                output0_tm[1] = sum01;\n                output0_tm[2] = sum02;\n                output0_tm[3] = sum03;\n                output1_tm[0] = sum10;\n                output1_tm[1] = sum11;\n                output1_tm[2] = sum12;\n                output1_tm[3] = sum13;\n\n                output0_tm += 4;\n                output1_tm += 4;\n            }\n            for (; i < tiles; i++)\n            {\n                const float* r0 = bb2.row(i / 4 + i % 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                float sum00 = 0.f;\n                float sum10 = 0.f;\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 4);\n                    __builtin_prefetch(k0 + 8);\n                    float val0 = r0[0];\n                    sum00 += val0 * k0[0];\n                    sum10 += val0 * k0[1];\n\n                    r0 += 1;\n                    k0 += 2;\n                }\n\n                output0_tm[0] = sum00;\n                output1_tm[0] = sum10;\n                output0_tm++;\n                output1_tm++;\n            }\n        }\n    }\n#endif\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        float* output0_tm = top_blob_tm.channel(p);\n\n#if __loongarch_sx\n        const Mat kernel0_tm = kernel_tm.channel(p / 8 + (p % 8) / 4 + p % 4);\n#else\n        const Mat kernel0_tm = kernel_tm.channel(p / 2 + p % 2);\n#endif\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 3 < tiles; i += 4)\n            {\n                const float* r0 = bb2.row(i / 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                int j = 0;\n#if __loongarch_sx\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (; j < nn; j++)\n                {\n                    _sum0 = __lsx_vfmadd_s((__m128)__lsx_vld(r0, 0), __lsx_vreplfr2vr_s(k0[0]), _sum0);\n                    r0 += 4;\n                    k0++;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                output0_tm += 4;\n#else  // __loongarch_sx\n                float sum0 = 0.f;\n                float sum1 = 0.f;\n                float sum2 = 0.f;\n                float sum3 = 0.f;\n\n                for (; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 16);\n                    __builtin_prefetch(k0 + 4);\n                    float w0 = k0[0];\n                    sum0 += r0[0] * w0;\n                    sum1 += r0[1] * w0;\n                    sum2 += r0[2] * w0;\n                    sum3 += r0[3] * w0;\n\n                    r0 += 4;\n                    k0++;\n                }\n\n                output0_tm[0] = sum0;\n                output0_tm[1] = sum1;\n                output0_tm[2] = sum2;\n                output0_tm[3] = sum3;\n                output0_tm += 4;\n#endif // __loongarch_sx\n            }\n            for (; i < tiles; i++)\n            {\n                const float* r0 = bb2.row(i / 4 + i % 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch; // inch always > 0\n\n                float sum = 0.f;\n\n                for (int j = 0; j < nn; j++)\n                {\n                    float w0 = k0[0];\n                    float val0 = r0[0];\n                    sum += val0 * w0;\n\n                    r0 += 1;\n                    k0 += 1;\n                }\n\n                output0_tm[0] = sum;\n                output0_tm += 1;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_dot_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_winograd_dot_int8_lsx(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    // Mat bottom_blob_tm(tiles, 16/36/64, inch, 2u, 1, opt.workspace_allocator);\n\n    const int tiles = bottom_blob_tm.w;\n    const int batch = bottom_blob_tm.h;\n    const int inch = bottom_blob_tm.c;\n\n    // permute\n    Mat bottom_blob_tm2;\n#if __loongarch_sx\n    if (inch >= 4)\n    {\n        if (tiles >= 2)\n            bottom_blob_tm2.create(inch / 4 + inch % 4, tiles / 2 + tiles % 2, batch, 16u, 8, opt.workspace_allocator);\n        else // if (tiles >= 1)\n            bottom_blob_tm2.create(inch / 4 + inch % 4, tiles, batch, 8u, 4, opt.workspace_allocator);\n    }\n    else\n#endif // __loongarch_sx\n    {\n        if (tiles >= 2)\n            bottom_blob_tm2.create(inch, tiles / 2 + tiles % 2, batch, 4u, 2, opt.workspace_allocator);\n        else // if (tiles >= 1)\n            bottom_blob_tm2.create(inch, tiles, batch, 2u, 1, opt.workspace_allocator);\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int r = 0; r < batch; r++)\n    {\n        Mat tm2 = bottom_blob_tm2.channel(r);\n\n        // tile\n        int i = 0;\n        for (; i + 1 < tiles; i += 2)\n        {\n            short* tmpptr = tm2.row<short>(i / 2);\n\n            const short* r0 = (const short*)bottom_blob_tm + r * tiles + i;\n\n            int q = 0;\n#if __loongarch_sx\n            const short* r1 = (const short*)bottom_blob_tm.channel(1) + r * tiles + i;\n            const short* r2 = (const short*)bottom_blob_tm.channel(2) + r * tiles + i;\n            const short* r3 = (const short*)bottom_blob_tm.channel(3) + r * tiles + i;\n            for (; q + 3 < inch; q += 4)\n            {\n                tmpptr[0] = r0[0];\n                tmpptr[1] = r1[0];\n                tmpptr[2] = r2[0];\n                tmpptr[3] = r3[0];\n                tmpptr[4] = r0[1];\n                tmpptr[5] = r1[1];\n                tmpptr[6] = r2[1];\n                tmpptr[7] = r3[1];\n                r0 += bottom_blob_tm.cstep * 4;\n                r1 += bottom_blob_tm.cstep * 4;\n                r2 += bottom_blob_tm.cstep * 4;\n                r3 += bottom_blob_tm.cstep * 4;\n                tmpptr += 8;\n            }\n#endif // __loongarch_sx\n            for (; q < inch; q++)\n            {\n                tmpptr[0] = r0[0];\n                tmpptr[1] = r0[1];\n                r0 += bottom_blob_tm.cstep;\n                tmpptr += 2;\n            }\n        }\n        for (; i < tiles; i++)\n        {\n            short* tmpptr = tm2.row<short>(i / 2 + i % 2);\n\n            const short* r0 = (const short*)bottom_blob_tm + r * tiles + i;\n\n            int q = 0;\n#if __loongarch_sx\n            const short* r1 = (const short*)bottom_blob_tm.channel(1) + r * tiles + i;\n            const short* r2 = (const short*)bottom_blob_tm.channel(2) + r * tiles + i;\n            const short* r3 = (const short*)bottom_blob_tm.channel(3) + r * tiles + i;\n            for (; q + 3 < inch; q += 4)\n            {\n                tmpptr[0] = r0[0];\n                tmpptr[1] = r1[0];\n                tmpptr[2] = r2[0];\n                tmpptr[3] = r3[0];\n                r0 += bottom_blob_tm.cstep * 4;\n                r1 += bottom_blob_tm.cstep * 4;\n                r2 += bottom_blob_tm.cstep * 4;\n                r3 += bottom_blob_tm.cstep * 4;\n                tmpptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; q < inch; q++)\n            {\n                tmpptr[0] = r0[0];\n                r0 += bottom_blob_tm.cstep;\n                tmpptr += 1;\n            }\n        }\n    }\n\n    bottom_blob_tm = Mat();\n    // permute end\n\n    top_blob_tm.create(tiles, batch, outch, 4u, 1, opt.workspace_allocator);\n\n#if __loongarch_sx\n    int nn_outch = outch >> 2;\n    int remain_outch_start = nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        int* output0_tm = top_blob_tm.channel(p);\n        int* output1_tm = top_blob_tm.channel(p + 1);\n        int* output2_tm = top_blob_tm.channel(p + 2);\n        int* output3_tm = top_blob_tm.channel(p + 3);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 4);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn4 = inch / 4;\n                int nn1 = inch % 4;\n\n                __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n\n                if (nn4 > 0)\n                {\n                    __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum02 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum03 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum12 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum13 = __lsx_vreplgr2vr_w(0);\n\n                    int j = 0;\n                    for (; j < nn4; j++)\n                    {\n                        __m128i _val01 = __lsx_vld(r0, 0);\n\n                        __m128i _val0 = __lsx_vilvl_d(_val01, _val01);\n                        __m128i _val1 = __lsx_vilvh_d(_val01, _val01);\n\n                        __m128i _w0 = __lsx_vld(k0, 0);\n                        __m128i _w1 = __lsx_vld(k0 + 8, 0);\n\n                        __m128i _extval0 = __lsx_vslti_h(_val0, 0);\n                        __m128i _extval1 = __lsx_vslti_h(_val1, 0);\n                        __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                        __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n\n                        __m128i _val0l = __lsx_vilvl_h(_extval0, _val0);\n                        __m128i _val0h = __lsx_vilvh_h(_extval0, _val0);\n                        __m128i _val1l = __lsx_vilvl_h(_extval1, _val1);\n                        __m128i _val1h = __lsx_vilvh_h(_extval1, _val1);\n\n                        __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                        __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                        __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                        __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n\n                        _sum00 = __lsx_vmadd_w(_sum00, _val0l, _w0l);\n                        _sum01 = __lsx_vmadd_w(_sum01, _val0h, _w0h);\n                        _sum02 = __lsx_vmadd_w(_sum02, _val0l, _w1l);\n                        _sum03 = __lsx_vmadd_w(_sum03, _val0h, _w1h);\n                        _sum10 = __lsx_vmadd_w(_sum10, _val1l, _w0l);\n                        _sum11 = __lsx_vmadd_w(_sum11, _val1h, _w0h);\n                        _sum12 = __lsx_vmadd_w(_sum12, _val1l, _w1l);\n                        _sum13 = __lsx_vmadd_w(_sum13, _val1h, _w1h);\n\n                        r0 += 8;\n                        k0 += 16;\n                    }\n\n                    // transpose 4x4\n                    {\n                        __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                        _tmp0 = __lsx_vilvl_w(_sum01, _sum00);\n                        _tmp1 = __lsx_vilvl_w(_sum03, _sum02);\n                        _tmp2 = __lsx_vilvh_w(_sum01, _sum00);\n                        _tmp3 = __lsx_vilvh_w(_sum03, _sum02);\n                        _sum00 = __lsx_vilvl_d(_tmp1, _tmp0);\n                        _sum01 = __lsx_vilvh_d(_tmp1, _tmp0);\n                        _sum02 = __lsx_vilvl_d(_tmp3, _tmp2);\n                        _sum03 = __lsx_vilvh_d(_tmp3, _tmp2);\n                    }\n                    {\n                        __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                        _tmp0 = __lsx_vilvl_w(_sum11, _sum10);\n                        _tmp1 = __lsx_vilvl_w(_sum13, _sum12);\n                        _tmp2 = __lsx_vilvh_w(_sum11, _sum10);\n                        _tmp3 = __lsx_vilvh_w(_sum13, _sum12);\n                        _sum10 = __lsx_vilvl_d(_tmp1, _tmp0);\n                        _sum11 = __lsx_vilvh_d(_tmp1, _tmp0);\n                        _sum12 = __lsx_vilvl_d(_tmp3, _tmp2);\n                        _sum13 = __lsx_vilvh_d(_tmp3, _tmp2);\n                    }\n\n                    _sum00 = __lsx_vadd_w(_sum00, _sum01);\n                    _sum02 = __lsx_vadd_w(_sum02, _sum03);\n                    _sum10 = __lsx_vadd_w(_sum10, _sum11);\n                    _sum12 = __lsx_vadd_w(_sum12, _sum13);\n\n                    _sum00 = __lsx_vadd_w(_sum00, _sum02);\n                    _sum10 = __lsx_vadd_w(_sum10, _sum12);\n                }\n\n                for (int j = 0; j < nn1; j++)\n                {\n                    __m128i _val0 = __lsx_vreplgr2vr_h(r0[0]);\n                    __m128i _val1 = __lsx_vreplgr2vr_h(r0[1]);\n                    __m128i _val = __lsx_vilvl_d(_val1, _val0);\n\n                    __m128i _w16 = __lsx_vld(k0, 0);\n\n                    _w16 = __lsx_vilvl_d(_w16, _w16);\n\n                    __m128i _extval = __lsx_vslti_h(_val, 0);\n                    __m128i _extw16 = __lsx_vslti_h(_w16, 0);\n\n                    __m128i _vall = __lsx_vilvl_h(_extval, _val);\n                    __m128i _valh = __lsx_vilvh_h(_extval, _val);\n                    __m128i _w0l = __lsx_vilvl_h(_extw16, _w16);\n                    __m128i _w0h = __lsx_vilvh_h(_extw16, _w16);\n\n                    _sum00 = __lsx_vmadd_w(_sum00, _vall, _w0l);\n                    _sum10 = __lsx_vmadd_w(_sum10, _valh, _w0h);\n\n                    r0 += 2;\n                    k0 += 4;\n                }\n\n                int sum[8];\n                __lsx_vst(_sum00, sum, 0);\n                __lsx_vst(_sum10, sum + 4, 0);\n\n                output0_tm[0] = sum[0];\n                output1_tm[0] = sum[1];\n                output2_tm[0] = sum[2];\n                output3_tm[0] = sum[3];\n                output0_tm[1] = sum[4];\n                output1_tm[1] = sum[5];\n                output2_tm[1] = sum[6];\n                output3_tm[1] = sum[7];\n                output0_tm += 2;\n                output1_tm += 2;\n                output2_tm += 2;\n                output3_tm += 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn4 = inch / 4;\n                int nn1 = inch % 4;\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n\n                if (nn4 > 0)\n                {\n                    __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                    int j = 0;\n                    for (; j < nn4; j++)\n                    {\n                        __m128i _val16 = __lsx_vld(r0, 0);\n\n                        _val16 = __lsx_vilvl_d(_val16, _val16);\n\n                        __m128i _w0 = __lsx_vld(k0, 0);\n                        __m128i _w1 = __lsx_vld(k0 + 8, 0);\n\n                        __m128i _extval16 = __lsx_vslti_h(_val16, 0);\n                        __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                        __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n\n                        __m128i _val0l = __lsx_vilvl_h(_extval16, _val16);\n                        __m128i _val0h = __lsx_vilvh_h(_extval16, _val16);\n\n                        __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                        __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                        __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                        __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n\n                        _sum0 = __lsx_vmadd_w(_sum0, _val0l, _w0l);\n                        _sum1 = __lsx_vmadd_w(_sum1, _val0h, _w0h);\n                        _sum2 = __lsx_vmadd_w(_sum2, _val0l, _w1l);\n                        _sum3 = __lsx_vmadd_w(_sum3, _val0h, _w1h);\n\n                        r0 += 4;\n                        k0 += 16;\n                    }\n\n                    // transpose 4x4\n                    {\n                        __m128i _tmp0, _tmp1, _tmp2, _tmp3;\n                        _tmp0 = __lsx_vilvl_w(_sum1, _sum0);\n                        _tmp1 = __lsx_vilvl_w(_sum3, _sum2);\n                        _tmp2 = __lsx_vilvh_w(_sum1, _sum0);\n                        _tmp3 = __lsx_vilvh_w(_sum3, _sum2);\n                        _sum0 = __lsx_vilvl_d(_tmp1, _tmp0);\n                        _sum1 = __lsx_vilvh_d(_tmp1, _tmp0);\n                        _sum2 = __lsx_vilvl_d(_tmp3, _tmp2);\n                        _sum3 = __lsx_vilvh_d(_tmp3, _tmp2);\n                    }\n\n                    _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                    _sum2 = __lsx_vadd_w(_sum2, _sum3);\n                    _sum0 = __lsx_vadd_w(_sum0, _sum2);\n                }\n\n                for (int j = 0; j < nn1; j++)\n                {\n                    __m128i _val = __lsx_vreplgr2vr_w(r0[0]);\n                    __m128i _w16 = __lsx_vld(k0, 0);\n\n                    __m128i _extw16 = __lsx_vslti_h(_w16, 0);\n                    __m128i _w0l = __lsx_vilvl_h(_extw16, _w16);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _val, _w0l);\n\n                    r0 += 1;\n                    k0 += 4;\n                }\n\n                int sum[4];\n                __lsx_vst(_sum0, sum, 0);\n\n                output0_tm[0] = sum[0];\n                output1_tm[0] = sum[1];\n                output2_tm[0] = sum[2];\n                output3_tm[0] = sum[3];\n                output0_tm += 1;\n                output1_tm += 1;\n                output2_tm += 1;\n                output3_tm += 1;\n            }\n        }\n    }\n#else // __loongarch_sx\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        int* output0_tm = top_blob_tm.channel(p);\n        int* output1_tm = top_blob_tm.channel(p + 1);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 2);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int sum00 = 0;\n                int sum01 = 0;\n                int sum10 = 0;\n                int sum11 = 0;\n\n                int nn1 = inch;\n\n                for (int j = 0; j < nn1; j++)\n                {\n                    signed short val0 = r0[0];\n                    signed short val1 = r0[1];\n                    signed short w0 = k0[0];\n                    signed short w1 = k0[1];\n\n                    sum00 += val0 * w0;\n                    sum01 += val1 * w0;\n                    sum10 += val0 * w1;\n                    sum11 += val1 * w1;\n\n                    r0 += 2;\n                    k0 += 2;\n                }\n\n                output0_tm[0] = sum00;\n                output0_tm[1] = sum01;\n                output1_tm[0] = sum10;\n                output1_tm[1] = sum11;\n                output0_tm += 2;\n                output1_tm += 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int sum0 = 0;\n                int sum1 = 0;\n\n                int nn1 = inch;\n\n                for (int j = 0; j < nn1; j++)\n                {\n                    signed short val0 = r0[0];\n                    signed short w0 = k0[0];\n                    signed short w1 = k0[1];\n\n                    sum0 += val0 * w0;\n                    sum1 += val0 * w1;\n\n                    r0 += 1;\n                    k0 += 2;\n                }\n\n                output0_tm[0] = sum0;\n                output1_tm[0] = sum1;\n                output0_tm += 1;\n                output1_tm += 1;\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        int* output0_tm = top_blob_tm.channel(p);\n\n#if __loongarch_sx\n        const Mat kernel0_tm = kernel_tm.channel(p / 4 + p % 4);\n#else\n        const Mat kernel0_tm = kernel_tm.channel(p / 2 + p % 2);\n#endif\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int sum0 = 0;\n                int sum1 = 0;\n\n#if __loongarch_sx\n                int nn4 = inch / 4;\n                int nn1 = inch % 4;\n\n                if (nn4 > 0)\n                {\n                    __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                    int j = 0;\n                    for (; j < nn4; j++)\n                    {\n                        __m128i _val16 = __lsx_vld(r0, 0);\n\n                        __m128i _w16 = __lsx_vld(k0, 0);\n\n                        _w16 = __lsx_vilvl_d(_w16, _w16);\n\n                        __m128i _extval16 = __lsx_vslti_h(_val16, 0);\n                        __m128i _extw16 = __lsx_vslti_h(_w16, 0);\n\n                        __m128i _val0l = __lsx_vilvl_h(_extval16, _val16);\n                        __m128i _val0h = __lsx_vilvh_h(_extval16, _val16);\n\n                        __m128i _w0l = __lsx_vilvl_h(_extw16, _w16);\n                        __m128i _w0h = __lsx_vilvh_h(_extw16, _w16);\n\n                        _sum0 = __lsx_vmadd_w(_sum0, _val0l, _w0l);\n                        _sum1 = __lsx_vmadd_w(_sum1, _val0h, _w0h);\n\n                        r0 += 8;\n                        k0 += 4;\n                    }\n\n                    sum0 = __lsx_reduce_add_w(_sum0);\n                    sum1 = __lsx_reduce_add_w(_sum1);\n                }\n#else  // __loongarch_sx\n                int nn1 = inch;\n#endif // __loongarch_sx\n\n                for (int q = 0; q < nn1; q++)\n                {\n                    signed short val0 = r0[0];\n                    signed short val1 = r0[1];\n                    signed short w = k0[0];\n\n                    sum0 += val0 * w;\n                    sum1 += val1 * w;\n\n                    k0 += 1;\n                    r0 += 2;\n                }\n\n                output0_tm[0] = sum0;\n                output0_tm[1] = sum1;\n                output0_tm += 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int sum = 0;\n\n#if __loongarch_sx\n                int nn4 = inch / 4;\n                int nn1 = inch % 4;\n\n                if (nn4 > 0)\n                {\n                    __m128i _sum = __lsx_vreplgr2vr_w(0);\n\n                    int j = 0;\n                    for (; j < nn4; j++)\n                    {\n                        __m128i _val16 = __lsx_vld(r0, 0);\n                        __m128i _w16 = __lsx_vld(k0, 0);\n\n                        __m128i _extval16 = __lsx_vslti_h(_val16, 0);\n                        __m128i _extw16 = __lsx_vslti_h(_w16, 0);\n\n                        __m128i _val0l = __lsx_vilvl_h(_extval16, _val16);\n                        __m128i _w0l = __lsx_vilvl_h(_extw16, _w16);\n\n                        _sum = __lsx_vmadd_w(_sum, _val0l, _w0l);\n\n                        r0 += 4;\n                        k0 += 4;\n                    }\n\n                    sum = __lsx_reduce_add_w(_sum);\n                }\n#else  // __loongarch_sx\n                int nn1 = inch;\n#endif // __loongarch_sx\n\n                for (int q = 0; q < nn1; q++)\n                {\n                    signed short val = r0[0];\n                    signed short w = k0[0];\n\n                    sum += val * w;\n\n                    k0 += 1;\n                    r0 += 1;\n                }\n\n                output0_tm[0] = sum;\n                output0_tm++;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_dot_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_winograd_dot_pack4_lsx(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    // Mat bottom_blob_tm(tiles, 16/36/64, inch, 16u, 4, opt.workspace_allocator);\n\n    const int tiles = bottom_blob_tm.w;\n    const int batch = bottom_blob_tm.h;\n    const int inch = bottom_blob_tm.c;\n\n    // permute\n    Mat bottom_blob_tm2;\n    if (tiles >= 12)\n        bottom_blob_tm2.create(12 * inch, tiles / 12 + (tiles % 12) / 8 + (tiles % 12 % 8) / 4 + (tiles % 12 % 4) / 2 + tiles % 12 % 2, batch, 16u, 4, opt.workspace_allocator);\n    else if (tiles >= 8)\n        bottom_blob_tm2.create(8 * inch, tiles / 8 + (tiles % 8) / 4 + (tiles % 4) / 2 + tiles % 2, batch, 16u, 4, opt.workspace_allocator);\n    else if (tiles >= 4)\n        bottom_blob_tm2.create(4 * inch, tiles / 4 + (tiles % 4) / 2 + tiles % 2, batch, 16u, 4, opt.workspace_allocator);\n    else if (tiles >= 2)\n        bottom_blob_tm2.create(2 * inch, tiles / 2 + tiles % 2, batch, 16u, 4, opt.workspace_allocator);\n    else // if (tiles >= 1)\n        bottom_blob_tm2.create(1 * inch, tiles, batch, 16u, 4, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int r = 0; r < batch; r++)\n    {\n        Mat tm2 = bottom_blob_tm2.channel(r);\n\n        // tile\n        int i = 0;\n        for (; i + 11 < tiles; i += 12)\n        {\n            float* tmpptr = tm2.row(i / 12);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 4;\n\n            for (int q = 0; q < inch; q++)\n            {\n                // transpose 4x8\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 4, 0);\n                __m128i _r2 = __lsx_vld(r0 + 4 * 2, 0);\n                __m128i _r3 = __lsx_vld(r0 + 4 * 3, 0);\n                __m128i _r4 = __lsx_vld(r0 + 4 * 4, 0);\n                __m128i _r5 = __lsx_vld(r0 + 4 * 5, 0);\n                __m128i _r6 = __lsx_vld(r0 + 4 * 6, 0);\n                __m128i _r7 = __lsx_vld(r0 + 4 * 7, 0);\n                __m128i _r8 = __lsx_vld(r0 + 4 * 8, 0);\n                __m128i _r9 = __lsx_vld(r0 + 4 * 9, 0);\n                __m128i _ra = __lsx_vld(r0 + 4 * 10, 0);\n                __m128i _rb = __lsx_vld(r0 + 4 * 11, 0);\n\n                __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                __m128i _r89r = __lsx_vilvl_w(_r9, _r8);\n                __m128i _r89l = __lsx_vilvh_w(_r9, _r8);\n                __m128i _rabr = __lsx_vilvl_w(_rb, _ra);\n                __m128i _rabl = __lsx_vilvh_w(_rb, _ra);\n                __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n                __m128i _r89ab_0 = __lsx_vilvl_d(_rabr, _r89r);\n                __m128i _r89ab_1 = __lsx_vilvh_d(_rabr, _r89r);\n                __m128i _r89ab_2 = __lsx_vilvl_d(_rabl, _r89l);\n                __m128i _r89ab_3 = __lsx_vilvh_d(_rabl, _r89l);\n\n                __lsx_vst(_r0123_0, tmpptr, 0);\n                __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                __lsx_vst(_r89ab_0, tmpptr + 4 * 2, 0);\n                __lsx_vst(_r0123_1, tmpptr + 4 * 3, 0);\n                __lsx_vst(_r4567_1, tmpptr + 4 * 4, 0);\n                __lsx_vst(_r89ab_1, tmpptr + 4 * 5, 0);\n                __lsx_vst(_r0123_2, tmpptr + 4 * 6, 0);\n                __lsx_vst(_r4567_2, tmpptr + 4 * 7, 0);\n                __lsx_vst(_r89ab_2, tmpptr + 4 * 8, 0);\n                __lsx_vst(_r0123_3, tmpptr + 4 * 9, 0);\n                __lsx_vst(_r4567_3, tmpptr + 4 * 10, 0);\n                __lsx_vst(_r89ab_3, tmpptr + 4 * 11, 0);\n\n                r0 += bottom_blob_tm.cstep * 4;\n                tmpptr += 48;\n            }\n        }\n        for (; i + 7 < tiles; i += 8)\n        {\n            float* tmpptr = tm2.row(i / 12 + (i % 12) / 8);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 4;\n\n            for (int q = 0; q < inch; q++)\n            {\n                // transpose 4x8\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 4, 0);\n                __m128i _r2 = __lsx_vld(r0 + 4 * 2, 0);\n                __m128i _r3 = __lsx_vld(r0 + 4 * 3, 0);\n                __m128i _r4 = __lsx_vld(r0 + 4 * 4, 0);\n                __m128i _r5 = __lsx_vld(r0 + 4 * 5, 0);\n                __m128i _r6 = __lsx_vld(r0 + 4 * 6, 0);\n                __m128i _r7 = __lsx_vld(r0 + 4 * 7, 0);\n\n                __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                __m128i _r45r = __lsx_vilvl_w(_r5, _r4);\n                __m128i _r45l = __lsx_vilvh_w(_r5, _r4);\n                __m128i _r67r = __lsx_vilvl_w(_r7, _r6);\n                __m128i _r67l = __lsx_vilvh_w(_r7, _r6);\n                __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n                __m128i _r4567_0 = __lsx_vilvl_d(_r67r, _r45r);\n                __m128i _r4567_1 = __lsx_vilvh_d(_r67r, _r45r);\n                __m128i _r4567_2 = __lsx_vilvl_d(_r67l, _r45l);\n                __m128i _r4567_3 = __lsx_vilvh_d(_r67l, _r45l);\n\n                __lsx_vst(_r0123_0, tmpptr, 0);\n                __lsx_vst(_r4567_0, tmpptr + 4, 0);\n                __lsx_vst(_r0123_1, tmpptr + 4 * 2, 0);\n                __lsx_vst(_r4567_1, tmpptr + 4 * 3, 0);\n                __lsx_vst(_r0123_2, tmpptr + 4 * 4, 0);\n                __lsx_vst(_r4567_2, tmpptr + 4 * 5, 0);\n                __lsx_vst(_r0123_3, tmpptr + 4 * 6, 0);\n                __lsx_vst(_r4567_3, tmpptr + 4 * 7, 0);\n\n                r0 += bottom_blob_tm.cstep * 4;\n                tmpptr += 32;\n            }\n        }\n        for (; i + 3 < tiles; i += 4)\n        {\n            float* tmpptr = tm2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 4;\n\n            for (int q = 0; q < inch; q++)\n            {\n                // transpose 4x4\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 4, 0);\n                __m128i _r2 = __lsx_vld(r0 + 4 * 2, 0);\n                __m128i _r3 = __lsx_vld(r0 + 4 * 3, 0);\n\n                __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                __lsx_vst(_r0123_0, tmpptr, 0);\n                __lsx_vst(_r0123_1, tmpptr + 4, 0);\n                __lsx_vst(_r0123_2, tmpptr + 4 * 2, 0);\n                __lsx_vst(_r0123_3, tmpptr + 4 * 3, 0);\n\n                r0 += bottom_blob_tm.cstep * 4;\n                tmpptr += 16;\n            }\n        }\n        for (; i + 1 < tiles; i += 2)\n        {\n            float* tmpptr = tm2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 4;\n\n            for (int q = 0; q < inch; q++)\n            {\n                // transpose 4x2\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 4, 0);\n\n                __m128i _r01_0 = __lsx_vilvl_w(_r1, _r0);\n                __m128i _r01_1 = __lsx_vilvh_w(_r1, _r0);\n\n                __lsx_vst(_r01_0, tmpptr, 0);\n                __lsx_vst(_r01_1, tmpptr + 4, 0);\n\n                r0 += bottom_blob_tm.cstep * 4;\n                tmpptr += 8;\n            }\n        }\n        for (; i < tiles; i++)\n        {\n            float* tmpptr = tm2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2 + i % 12 % 2);\n\n            const float* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 4;\n\n            for (int q = 0; q < inch; q++)\n            {\n                __m128i _val = __lsx_vld(r0, 0);\n                __lsx_vst(_val, tmpptr, 0);\n\n                r0 += bottom_blob_tm.cstep * 4;\n                tmpptr += 4;\n            }\n        }\n    }\n\n    bottom_blob_tm = Mat();\n    // permute end\n\n    top_blob_tm.create(tiles, batch, outch, 16u, 4, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* output0_tm = top_blob_tm.channel(p);\n\n        const Mat kernel0_tm = kernel_tm.channel(p);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 11 < tiles; i += 12)\n            {\n                const float* r0 = bb2.row(i / 12);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch * 4; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum4 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum5 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum6 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum7 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum8 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum9 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _suma = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sumb = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 48);\n                    __builtin_prefetch(k0 + 16);\n                    __m128i _val0123 = __lsx_vld(r0, 0);\n                    __m128i _val4567 = __lsx_vld(r0 + 4, 0);\n                    __m128i _val89ab = __lsx_vld(r0 + 8, 0);\n                    __m128 _w0 = (__m128)__lsx_vld(k0, 0);\n                    _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                    _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                    _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                    _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n                    _sum4 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 0), _sum4);\n                    _sum5 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 1), _sum5);\n                    _sum6 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 2), _sum6);\n                    _sum7 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 3), _sum7);\n                    _sum8 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 0), _sum8);\n                    _sum9 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 1), _sum9);\n                    _suma = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 2), _suma);\n                    _sumb = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val89ab, 3), _sumb);\n\n                    r0 += 12;\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output0_tm + 4, 0);\n                __lsx_vst(_sum2, output0_tm + 4 * 2, 0);\n                __lsx_vst(_sum3, output0_tm + 4 * 3, 0);\n                __lsx_vst(_sum4, output0_tm + 4 * 4, 0);\n                __lsx_vst(_sum5, output0_tm + 4 * 5, 0);\n                __lsx_vst(_sum6, output0_tm + 4 * 6, 0);\n                __lsx_vst(_sum7, output0_tm + 4 * 7, 0);\n                __lsx_vst(_sum8, output0_tm + 4 * 8, 0);\n                __lsx_vst(_sum9, output0_tm + 4 * 9, 0);\n                __lsx_vst(_suma, output0_tm + 4 * 10, 0);\n                __lsx_vst(_sumb, output0_tm + 4 * 11, 0);\n\n                output0_tm += 4 * 12;\n            }\n            for (; i + 7 < tiles; i += 8)\n            {\n                const float* r0 = bb2.row(i / 12 + (i % 12) / 8);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch * 4; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum4 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum5 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum6 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum7 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 32);\n                    __builtin_prefetch(k0 + 16);\n                    __m128i _val0123 = __lsx_vld(r0, 0);\n                    __m128i _val4567 = __lsx_vld(r0 + 4, 0);\n                    __m128 _w0 = (__m128)__lsx_vld(k0, 0);\n                    _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                    _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                    _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                    _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n                    _sum4 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 0), _sum4);\n                    _sum5 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 1), _sum5);\n                    _sum6 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 2), _sum6);\n                    _sum7 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val4567, 3), _sum7);\n\n                    r0 += 8;\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output0_tm + 4, 0);\n                __lsx_vst(_sum2, output0_tm + 4 * 2, 0);\n                __lsx_vst(_sum3, output0_tm + 4 * 3, 0);\n                __lsx_vst(_sum4, output0_tm + 4 * 4, 0);\n                __lsx_vst(_sum5, output0_tm + 4 * 5, 0);\n                __lsx_vst(_sum6, output0_tm + 4 * 6, 0);\n                __lsx_vst(_sum7, output0_tm + 4 * 7, 0);\n\n                output0_tm += 4 * 8;\n            }\n            for (; i + 3 < tiles; i += 4)\n            {\n                const float* r0 = bb2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch * 4; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 16);\n                    __builtin_prefetch(k0 + 16);\n                    __m128i _val0123 = __lsx_vld(r0, 0);\n                    __m128 _w0 = (__m128)__lsx_vld(k0, 0);\n                    _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 0), _sum0);\n                    _sum1 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 1), _sum1);\n                    _sum2 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 2), _sum2);\n                    _sum3 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val0123, 3), _sum3);\n\n                    r0 += 4;\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output0_tm + 4, 0);\n                __lsx_vst(_sum2, output0_tm + 4 * 2, 0);\n                __lsx_vst(_sum3, output0_tm + 4 * 3, 0);\n\n                output0_tm += 4 * 4;\n            }\n            for (; i + 1 < tiles; i += 2)\n            {\n                const float* r0 = bb2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch * 4; // inch always > 0\n\n                __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 8);\n                    __builtin_prefetch(k0 + 16);\n                    __m128 _val0 = __lsx_vreplfr2vr_s(*r0++);\n                    __m128 _val1 = __lsx_vreplfr2vr_s(*r0++);\n                    __m128 _w0 = (__m128)__lsx_vld(k0, 0);\n                    _sum0 = __lsx_vfmadd_s(_w0, _val0, _sum0);\n                    _sum1 = __lsx_vfmadd_s(_w0, _val1, _sum1);\n\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum1, output0_tm + 4, 0);\n\n                output0_tm += 4 * 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const float* r0 = bb2.row(i / 12 + (i % 12) / 8 + (i % 12 % 8) / 4 + (i % 12 % 4) / 2 + i % 12 % 2);\n                const float* k0 = kernel0_tm.row(r);\n\n                int nn = inch * 4; // inch always > 0\n\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 4);\n                    __builtin_prefetch(k0 + 16);\n                    __m128 _val0 = __lsx_vreplfr2vr_s(*r0++);\n                    __m128 _w0 = (__m128)__lsx_vld(k0, 0);\n                    _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n\n                    k0 += 4;\n                }\n\n                __lsx_vst(_sum, output0_tm, 0);\n\n                output0_tm += 4;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_dot_pack8to1_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_winograd_dot_pack8to1_int8_lsx(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    // Mat bottom_blob_tm(tiles, 16/36/64, inch, 16u, 8, opt.workspace_allocator);\n\n    const int tiles = bottom_blob_tm.w;\n    const int batch = bottom_blob_tm.h;\n    const int inch = bottom_blob_tm.c;\n\n    // permute\n    Mat bottom_blob_tm2;\n    if (tiles >= 2)\n        bottom_blob_tm2.create(2 * inch, tiles / 2 + tiles % 2, batch, 16u, 8, opt.workspace_allocator);\n    else // if (tiles >= 1)\n        bottom_blob_tm2.create(1 * inch, tiles, batch, 16u, 8, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int r = 0; r < batch; r++)\n    {\n        Mat tm2 = bottom_blob_tm2.channel(r);\n\n        // tile\n        int i = 0;\n        for (; i + 1 < tiles; i += 2)\n        {\n            short* tmpptr = tm2.row<short>(i / 2);\n\n            const short* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 8;\n\n            for (int q = 0; q < inch; q++)\n            {\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 8, 0);\n                __lsx_vst(_r0, tmpptr, 0);\n                __lsx_vst(_r1, tmpptr + 8, 0);\n                r0 += bottom_blob_tm.cstep * 8;\n                tmpptr += 16;\n            }\n        }\n        for (; i < tiles; i++)\n        {\n            short* tmpptr = tm2.row<short>(i / 2 + i % 2);\n\n            const short* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 8;\n\n            for (int q = 0; q < inch; q++)\n            {\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __lsx_vst(_r0, tmpptr, 0);\n                r0 += bottom_blob_tm.cstep * 8;\n                tmpptr += 8;\n            }\n        }\n    }\n\n    bottom_blob_tm = Mat();\n    // permute end\n\n    top_blob_tm.create(tiles, batch, outch, 4u, 1, opt.workspace_allocator);\n\n    int nn_outch = 0;\n    int remain_outch_start = 0;\n\n    nn_outch = outch >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 4;\n\n        int* output0_tm = top_blob_tm.channel(p);\n        int* output1_tm = top_blob_tm.channel(p + 1);\n        int* output2_tm = top_blob_tm.channel(p + 2);\n        int* output3_tm = top_blob_tm.channel(p + 3);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 4);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 64);\n                    __builtin_prefetch(k0 + 128);\n                    __m128i _w0 = __lsx_vld(k0, 0);\n                    __m128i _w1 = __lsx_vld(k0 + 8, 0);\n                    __m128i _w2 = __lsx_vld(k0 + 16, 0);\n                    __m128i _w3 = __lsx_vld(k0 + 24, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n                    __m128i _extw2 = __lsx_vslti_h(_w2, 0);\n                    __m128i _extw3 = __lsx_vslti_h(_w3, 0);\n\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                    __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                    __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n                    __m128i _w2l = __lsx_vilvl_h(_extw2, _w2);\n                    __m128i _w2h = __lsx_vilvh_h(_extw2, _w2);\n                    __m128i _w3l = __lsx_vilvl_h(_extw3, _w3);\n                    __m128i _w3h = __lsx_vilvh_h(_extw3, _w3);\n\n                    __m128i _val0_0 = __lsx_vreplgr2vr_w(r0[0]);\n                    __m128i _val0_1 = __lsx_vreplgr2vr_w(r0[1]);\n                    __m128i _val0_2 = __lsx_vreplgr2vr_w(r0[2]);\n                    __m128i _val0_3 = __lsx_vreplgr2vr_w(r0[3]);\n                    __m128i _val0_4 = __lsx_vreplgr2vr_w(r0[4]);\n                    __m128i _val0_5 = __lsx_vreplgr2vr_w(r0[5]);\n                    __m128i _val0_6 = __lsx_vreplgr2vr_w(r0[6]);\n                    __m128i _val0_7 = __lsx_vreplgr2vr_w(r0[7]);\n                    __m128i _val1_0 = __lsx_vreplgr2vr_w(r0[8]);\n                    __m128i _val1_1 = __lsx_vreplgr2vr_w(r0[9]);\n                    __m128i _val1_2 = __lsx_vreplgr2vr_w(r0[10]);\n                    __m128i _val1_3 = __lsx_vreplgr2vr_w(r0[11]);\n                    __m128i _val1_4 = __lsx_vreplgr2vr_w(r0[12]);\n                    __m128i _val1_5 = __lsx_vreplgr2vr_w(r0[13]);\n                    __m128i _val1_6 = __lsx_vreplgr2vr_w(r0[14]);\n                    __m128i _val1_7 = __lsx_vreplgr2vr_w(r0[15]);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _val0_0);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _val0_1);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w0l, _val1_0);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w0h, _val1_1);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w1l, _val0_2);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w1h, _val0_3);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w1l, _val1_2);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w1h, _val1_3);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w2l, _val0_4);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w2h, _val0_5);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w2l, _val1_4);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w2h, _val1_5);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w3l, _val0_6);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w3h, _val0_7);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w3l, _val1_6);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w3h, _val1_7);\n\n                    r0 += 16;\n                    k0 += 32;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n                int sum[8];\n                __lsx_vst(_sum0, sum, 0);\n                __lsx_vst(_sum2, sum + 4, 0);\n\n                output0_tm[0] = sum[0];\n                output1_tm[0] = sum[1];\n                output2_tm[0] = sum[2];\n                output3_tm[0] = sum[3];\n                output0_tm[1] = sum[4];\n                output1_tm[1] = sum[5];\n                output2_tm[1] = sum[6];\n                output3_tm[1] = sum[7];\n                output0_tm += 2;\n                output1_tm += 2;\n                output2_tm += 2;\n                output3_tm += 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 32);\n                    __builtin_prefetch(k0 + 128);\n                    __m128i _w0 = __lsx_vld(k0, 0);\n                    __m128i _w1 = __lsx_vld(k0 + 8, 0);\n                    __m128i _w2 = __lsx_vld(k0 + 16, 0);\n                    __m128i _w3 = __lsx_vld(k0 + 24, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n                    __m128i _extw2 = __lsx_vslti_h(_w2, 0);\n                    __m128i _extw3 = __lsx_vslti_h(_w3, 0);\n\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                    __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                    __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n                    __m128i _w2l = __lsx_vilvl_h(_extw2, _w2);\n                    __m128i _w2h = __lsx_vilvh_h(_extw2, _w2);\n                    __m128i _w3l = __lsx_vilvl_h(_extw3, _w3);\n                    __m128i _w3h = __lsx_vilvh_h(_extw3, _w3);\n\n                    __m128i _val0 = __lsx_vreplgr2vr_w(r0[0]);\n                    __m128i _val1 = __lsx_vreplgr2vr_w(r0[1]);\n                    __m128i _val2 = __lsx_vreplgr2vr_w(r0[2]);\n                    __m128i _val3 = __lsx_vreplgr2vr_w(r0[3]);\n                    __m128i _val4 = __lsx_vreplgr2vr_w(r0[4]);\n                    __m128i _val5 = __lsx_vreplgr2vr_w(r0[5]);\n                    __m128i _val6 = __lsx_vreplgr2vr_w(r0[6]);\n                    __m128i _val7 = __lsx_vreplgr2vr_w(r0[7]);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _val0);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _val1);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w1l, _val2);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w1h, _val3);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w2l, _val4);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w2h, _val5);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w3l, _val6);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w3h, _val7);\n\n                    r0 += 8;\n                    k0 += 32;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n\n                int sum[4];\n                __lsx_vst(_sum0, sum, 0);\n\n                output0_tm[0] = sum[0];\n                output1_tm[0] = sum[1];\n                output2_tm[0] = sum[2];\n                output3_tm[0] = sum[3];\n                output0_tm += 1;\n                output1_tm += 1;\n                output2_tm += 1;\n                output3_tm += 1;\n            }\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        int* output0_tm = top_blob_tm.channel(p);\n\n        const Mat kernel0_tm = kernel_tm.channel(p / 4 + p % 4);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                for (int q = 0; q < inch; q++)\n                {\n                    __builtin_prefetch(r0 + 32);\n                    __builtin_prefetch(k0 + 64);\n                    __m128i _val0 = __lsx_vld(r0, 0);\n                    __m128i _val1 = __lsx_vld(r0 + 8, 0);\n\n                    __m128i _extval0 = __lsx_vslti_h(_val0, 0);\n                    __m128i _extval1 = __lsx_vslti_h(_val1, 0);\n                    __m128i _val0l = __lsx_vilvl_h(_extval0, _val0);\n                    __m128i _val0h = __lsx_vilvh_h(_extval0, _val0);\n                    __m128i _val1l = __lsx_vilvl_h(_extval1, _val1);\n                    __m128i _val1h = __lsx_vilvh_h(_extval1, _val1);\n\n                    __m128i _w0 = __lsx_vld(k0, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _val0l);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _val0h);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w0l, _val1l);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w0h, _val1h);\n\n                    k0 += 8;\n                    r0 += 16;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n                output0_tm[0] = __lsx_reduce_add_w(_sum0);\n                output0_tm[1] = __lsx_reduce_add_w(_sum2);\n                output0_tm += 2;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                for (int q = 0; q < inch; q++)\n                {\n                    __builtin_prefetch(r0 + 32);\n                    __builtin_prefetch(k0 + 32);\n                    __m128i _val = __lsx_vld(r0, 0);\n\n                    __m128i _extval = __lsx_vslti_h(_val, 0);\n                    __m128i _vall = __lsx_vilvl_h(_extval, _val);\n                    __m128i _valh = __lsx_vilvh_h(_extval, _val);\n\n                    __m128i _w0 = __lsx_vld(k0, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _vall);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _valh);\n\n                    k0 += 8;\n                    r0 += 8;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n\n                output0_tm[0] = __lsx_reduce_add_w(_sum0);\n                output0_tm++;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_dot_pack8to4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_winograd_dot_pack8to4_int8_lsx(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    // Mat bottom_blob_tm(tiles, 16/36/64, inch, 16u, 8, opt.workspace_allocator);\n\n    const int tiles = bottom_blob_tm.w;\n    const int batch = bottom_blob_tm.h;\n    const int inch = bottom_blob_tm.c;\n\n    // permute\n    Mat bottom_blob_tm2;\n    if (tiles >= 2)\n        bottom_blob_tm2.create(2 * inch, tiles / 2 + tiles % 2, batch, 16u, 8, opt.workspace_allocator);\n    else // if (tiles >= 1)\n        bottom_blob_tm2.create(1 * inch, tiles, batch, 16u, 8, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int r = 0; r < batch; r++)\n    {\n        Mat tm2 = bottom_blob_tm2.channel(r);\n\n        // tile\n        int i = 0;\n        for (; i + 1 < tiles; i += 2)\n        {\n            short* tmpptr = tm2.row<short>(i / 2);\n\n            const short* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 8;\n\n            for (int q = 0; q < inch; q++)\n            {\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __m128i _r1 = __lsx_vld(r0 + 8, 0);\n                __lsx_vst(_r0, tmpptr, 0);\n                __lsx_vst(_r1, tmpptr + 8, 0);\n                r0 += bottom_blob_tm.cstep * 8;\n                tmpptr += 16;\n            }\n        }\n        for (; i < tiles; i++)\n        {\n            short* tmpptr = tm2.row<short>(i / 2 + i % 2);\n\n            const short* r0 = bottom_blob_tm;\n\n            r0 += (r * tiles + i) * 8;\n\n            for (int q = 0; q < inch; q++)\n            {\n                __m128i _r0 = __lsx_vld(r0, 0);\n                __lsx_vst(_r0, tmpptr, 0);\n                r0 += bottom_blob_tm.cstep * 8;\n                tmpptr += 8;\n            }\n        }\n    }\n\n    bottom_blob_tm = Mat();\n    // permute end\n\n    top_blob_tm.create(tiles, batch, outch, 16u, 4, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* output0_tm = top_blob_tm.channel(p);\n\n        const Mat kernel0_tm = kernel_tm.channel(p);\n\n        for (int r = 0; r < batch; r++)\n        {\n            const Mat bb2 = bottom_blob_tm2.channel(r);\n\n            int i = 0;\n            for (; i + 1 < tiles; i += 2)\n            {\n                const short* r0 = bb2.row<const short>(i / 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum2 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum3 = __lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 64);\n                    __builtin_prefetch(k0 + 128);\n                    __m128i _w0 = __lsx_vld(k0, 0);\n                    __m128i _w1 = __lsx_vld(k0 + 8, 0);\n                    __m128i _w2 = __lsx_vld(k0 + 16, 0);\n                    __m128i _w3 = __lsx_vld(k0 + 24, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n                    __m128i _extw2 = __lsx_vslti_h(_w2, 0);\n                    __m128i _extw3 = __lsx_vslti_h(_w3, 0);\n\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                    __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                    __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n                    __m128i _w2l = __lsx_vilvl_h(_extw2, _w2);\n                    __m128i _w2h = __lsx_vilvh_h(_extw2, _w2);\n                    __m128i _w3l = __lsx_vilvl_h(_extw3, _w3);\n                    __m128i _w3h = __lsx_vilvh_h(_extw3, _w3);\n\n                    __m128i _val0_0 = __lsx_vreplgr2vr_w(r0[0]);\n                    __m128i _val0_1 = __lsx_vreplgr2vr_w(r0[1]);\n                    __m128i _val0_2 = __lsx_vreplgr2vr_w(r0[2]);\n                    __m128i _val0_3 = __lsx_vreplgr2vr_w(r0[3]);\n                    __m128i _val0_4 = __lsx_vreplgr2vr_w(r0[4]);\n                    __m128i _val0_5 = __lsx_vreplgr2vr_w(r0[5]);\n                    __m128i _val0_6 = __lsx_vreplgr2vr_w(r0[6]);\n                    __m128i _val0_7 = __lsx_vreplgr2vr_w(r0[7]);\n                    __m128i _val1_0 = __lsx_vreplgr2vr_w(r0[8]);\n                    __m128i _val1_1 = __lsx_vreplgr2vr_w(r0[9]);\n                    __m128i _val1_2 = __lsx_vreplgr2vr_w(r0[10]);\n                    __m128i _val1_3 = __lsx_vreplgr2vr_w(r0[11]);\n                    __m128i _val1_4 = __lsx_vreplgr2vr_w(r0[12]);\n                    __m128i _val1_5 = __lsx_vreplgr2vr_w(r0[13]);\n                    __m128i _val1_6 = __lsx_vreplgr2vr_w(r0[14]);\n                    __m128i _val1_7 = __lsx_vreplgr2vr_w(r0[15]);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _val0_0);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _val0_1);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w0l, _val1_0);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w0h, _val1_1);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w1l, _val0_2);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w1h, _val0_3);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w1l, _val1_2);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w1h, _val1_3);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w2l, _val0_4);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w2h, _val0_5);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w2l, _val1_4);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w2h, _val1_5);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w3l, _val0_6);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w3h, _val0_7);\n                    _sum2 = __lsx_vmadd_w(_sum2, _w3l, _val1_6);\n                    _sum3 = __lsx_vmadd_w(_sum3, _w3h, _val1_7);\n\n                    r0 += 16;\n                    k0 += 32;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n                _sum2 = __lsx_vadd_w(_sum2, _sum3);\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                __lsx_vst(_sum2, output0_tm + 4, 0);\n\n                output0_tm += 8;\n            }\n            for (; i < tiles; i++)\n            {\n                const short* r0 = bb2.row<const short>(i / 2 + i % 2);\n                const short* k0 = kernel0_tm.row<const short>(r);\n\n                int nn = inch; // inch always > 0\n\n                __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                for (int j = 0; j < nn; j++)\n                {\n                    __builtin_prefetch(r0 + 32);\n                    __builtin_prefetch(k0 + 128);\n                    __m128i _w0 = __lsx_vld(k0, 0);\n                    __m128i _w1 = __lsx_vld(k0 + 8, 0);\n                    __m128i _w2 = __lsx_vld(k0 + 16, 0);\n                    __m128i _w3 = __lsx_vld(k0 + 24, 0);\n\n                    __m128i _extw0 = __lsx_vslti_h(_w0, 0);\n                    __m128i _extw1 = __lsx_vslti_h(_w1, 0);\n                    __m128i _extw2 = __lsx_vslti_h(_w2, 0);\n                    __m128i _extw3 = __lsx_vslti_h(_w3, 0);\n\n                    __m128i _w0l = __lsx_vilvl_h(_extw0, _w0);\n                    __m128i _w0h = __lsx_vilvh_h(_extw0, _w0);\n                    __m128i _w1l = __lsx_vilvl_h(_extw1, _w1);\n                    __m128i _w1h = __lsx_vilvh_h(_extw1, _w1);\n                    __m128i _w2l = __lsx_vilvl_h(_extw2, _w2);\n                    __m128i _w2h = __lsx_vilvh_h(_extw2, _w2);\n                    __m128i _w3l = __lsx_vilvl_h(_extw3, _w3);\n                    __m128i _w3h = __lsx_vilvh_h(_extw3, _w3);\n\n                    __m128i _val0 = __lsx_vreplgr2vr_w(r0[0]);\n                    __m128i _val1 = __lsx_vreplgr2vr_w(r0[1]);\n                    __m128i _val2 = __lsx_vreplgr2vr_w(r0[2]);\n                    __m128i _val3 = __lsx_vreplgr2vr_w(r0[3]);\n                    __m128i _val4 = __lsx_vreplgr2vr_w(r0[4]);\n                    __m128i _val5 = __lsx_vreplgr2vr_w(r0[5]);\n                    __m128i _val6 = __lsx_vreplgr2vr_w(r0[6]);\n                    __m128i _val7 = __lsx_vreplgr2vr_w(r0[7]);\n\n                    _sum0 = __lsx_vmadd_w(_sum0, _w0l, _val0);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w0h, _val1);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w1l, _val2);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w1h, _val3);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w2l, _val4);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w2h, _val5);\n                    _sum0 = __lsx_vmadd_w(_sum0, _w3l, _val6);\n                    _sum1 = __lsx_vmadd_w(_sum1, _w3h, _val7);\n\n                    r0 += 8;\n                    k0 += 32;\n                }\n\n                _sum0 = __lsx_vadd_w(_sum0, _sum1);\n\n                __lsx_vst(_sum0, output0_tm, 0);\n                output0_tm += 4;\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_transform.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_input_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 4;\n    const int h_tiles = (h - 2) / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[6][6] = {\n    //     {4.0f, 0.0f, -5.0f, 0.0f, 1.0f, 0.0f},\n    //     {0.0f,-4.0f, -4.0f, 1.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f, -4.0f,-1.0f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, -1.0f, 2.0f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, -1.0f,-2.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f,  0.0f,-5.0f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  4 * r00 - 5 * r02 + r04\n    // 1 = -4 * (r01 + r02) + r04 + r03\n    // 2 =  4 * (r01 - r02) + r04 - r03\n    // 3 = -2 * (r01 - r03) + r04 - r02\n    // 4 =  2 * (r01 - r03) + r04 - r02\n    // 5 =  4 * r01 - 5 * r03 + r05\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        float tmp[6][6];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* r0 = img0.row(i * 4) + (j * 4);\n\n                for (int m = 0; m < 6; m++)\n                {\n                    float r00 = r0[0];\n                    float r01 = r0[1];\n                    float r02 = r0[2];\n                    float r03 = r0[3];\n                    float r04 = r0[4];\n                    float r05 = r0[5];\n\n                    float tmp0m = 4 * r00 - 5 * r02 + r04;\n                    float tmp1m = -4 * (r01 + r02) + r04 + r03;\n                    float tmp2m = 4 * (r01 - r02) + r04 - r03;\n                    float tmp3m = -2 * (r01 - r03) + r04 - r02;\n                    float tmp4m = 2 * (r01 - r03) + r04 - r02;\n                    float tmp5m = 4 * r01 - 5 * r03 + r05;\n\n                    tmp[0][m] = tmp0m;\n                    tmp[1][m] = tmp1m;\n                    tmp[2][m] = tmp2m;\n                    tmp[3][m] = tmp3m;\n                    tmp[4][m] = tmp4m;\n                    tmp[5][m] = tmp5m;\n\n                    r0 += w;\n                }\n\n                float* r0_tm_0 = (float*)img0_tm + (i * w_tiles + j);\n                float* r0_tm_1 = r0_tm_0 + tiles;\n                float* r0_tm_2 = r0_tm_0 + tiles * 2;\n                float* r0_tm_3 = r0_tm_0 + tiles * 3;\n                float* r0_tm_4 = r0_tm_0 + tiles * 4;\n                float* r0_tm_5 = r0_tm_0 + tiles * 5;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    float tmp00 = tmp[m][0];\n                    float tmp01 = tmp[m][1];\n                    float tmp02 = tmp[m][2];\n                    float tmp03 = tmp[m][3];\n                    float tmp04 = tmp[m][4];\n                    float tmp05 = tmp[m][5];\n\n                    float r0tm0 = 4 * tmp00 - 5 * tmp02 + tmp04;\n                    float r0tm1 = -4 * (tmp01 + tmp02) + tmp04 + tmp03;\n                    float r0tm2 = 4 * (tmp01 - tmp02) + tmp04 - tmp03;\n                    float r0tm3 = -2 * (tmp01 - tmp03) + tmp04 - tmp02;\n                    float r0tm4 = 2 * (tmp01 - tmp03) + tmp04 - tmp02;\n                    float r0tm5 = 4 * tmp01 - 5 * tmp03 + tmp05;\n\n                    r0_tm_0[0] = r0tm0;\n                    r0_tm_1[0] = r0tm1;\n                    r0_tm_2[0] = r0tm2;\n                    r0_tm_3[0] = r0tm3;\n                    r0_tm_4[0] = r0tm4;\n                    r0_tm_5[0] = r0tm5;\n\n                    r0_tm_0 += tiles * 6;\n                    r0_tm_1 += tiles * 6;\n                    r0_tm_2 += tiles * 6;\n                    r0_tm_3 += tiles * 6;\n                    r0_tm_4 += tiles * 6;\n                    r0_tm_5 += tiles * 6;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_output_lsx(const Mat& top_blob_tm, Mat& top_blob, const Mat& bias, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 4;\n    const int h_tiles = outh / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    const float* biasptr = bias;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,  1.0f, 1.0f,  1.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 2.0f, -2.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 4.0f,  4.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 8.0f, -8.0f, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) + (r03 - r04) * 2\n    // 2 =       (r01 + r02) + (r03 + r04) * 4\n    // 3 = r05 + (r01 - r02) + (r03 - r04) * 8\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        float bias0 = biasptr ? biasptr[p] : 0.f;\n\n        float tmp[4][6];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* output0_tm_0 = (const float*)out0_tm + (i * w_tiles + j);\n                const float* output0_tm_1 = output0_tm_0 + tiles;\n                const float* output0_tm_2 = output0_tm_0 + tiles * 2;\n                const float* output0_tm_3 = output0_tm_0 + tiles * 3;\n                const float* output0_tm_4 = output0_tm_0 + tiles * 4;\n                const float* output0_tm_5 = output0_tm_0 + tiles * 5;\n\n                float* output0 = out0.row(i * 4) + (j * 4);\n\n                for (int m = 0; m < 6; m++)\n                {\n                    float out0tm0 = output0_tm_0[0];\n                    float out0tm1 = output0_tm_1[0];\n                    float out0tm2 = output0_tm_2[0];\n                    float out0tm3 = output0_tm_3[0];\n                    float out0tm4 = output0_tm_4[0];\n                    float out0tm5 = output0_tm_5[0];\n\n                    float tmp02a = out0tm1 + out0tm2;\n                    float tmp13a = out0tm1 - out0tm2;\n\n                    float tmp02b = out0tm3 + out0tm4;\n                    float tmp13b = out0tm3 - out0tm4;\n\n                    float tmp0m = out0tm0 + tmp02a + tmp02b;\n                    float tmp1m = tmp13a + tmp13b * 2;\n                    float tmp2m = tmp02a + tmp02b * 4;\n                    float tmp3m = out0tm5 + tmp13a + tmp13b * 8;\n\n                    tmp[0][m] = tmp0m;\n                    tmp[1][m] = tmp1m;\n                    tmp[2][m] = tmp2m;\n                    tmp[3][m] = tmp3m;\n\n                    output0_tm_0 += tiles * 6;\n                    output0_tm_1 += tiles * 6;\n                    output0_tm_2 += tiles * 6;\n                    output0_tm_3 += tiles * 6;\n                    output0_tm_4 += tiles * 6;\n                    output0_tm_5 += tiles * 6;\n                }\n\n                for (int m = 0; m < 4; m++)\n                {\n                    float tmp00 = tmp[m][0];\n                    float tmp01 = tmp[m][1];\n                    float tmp02 = tmp[m][2];\n                    float tmp03 = tmp[m][3];\n                    float tmp04 = tmp[m][4];\n                    float tmp05 = tmp[m][5];\n\n                    float tmp02a = tmp01 + tmp02;\n                    float tmp13a = tmp01 - tmp02;\n\n                    float tmp02b = tmp03 + tmp04;\n                    float tmp13b = tmp03 - tmp04;\n\n                    float out00 = bias0 + tmp00 + tmp02a + tmp02b;\n                    float out01 = bias0 + tmp13a + tmp13b * 2;\n                    float out02 = bias0 + tmp02a + tmp02b * 4;\n                    float out03 = bias0 + tmp05 + tmp13a + tmp13b * 8;\n\n                    output0[0] = out00;\n                    output0[1] = out01;\n                    output0[2] = out02;\n                    output0[3] = out03;\n\n                    output0 += outw;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_input_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 2;\n    const int h_tiles = (h - 2) / 2;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[4][4] = {\n    //     {1.0f,  0.0f, -1.0f,  0.0f},\n    //     {0.0f,  1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  0.00f, 1.0f}\n    // };\n\n    // 0 = r00 - r02\n    // 1 = r01 + r02\n    // 2 = r02 - r01\n    // 3 = r03 - r01\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        float tmp[4][4];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* r0 = img0.row(i * 2) + (j * 2);\n\n                for (int m = 0; m < 4; m++)\n                {\n                    float r00 = r0[0];\n                    float r01 = r0[1];\n                    float r02 = r0[2];\n                    float r03 = r0[3];\n\n                    float tmp0m = r00 - r02;\n                    float tmp1m = r01 + r02;\n                    float tmp2m = r02 - r01;\n                    float tmp3m = r03 - r01;\n\n                    tmp[0][m] = tmp0m;\n                    tmp[1][m] = tmp1m;\n                    tmp[2][m] = tmp2m;\n                    tmp[3][m] = tmp3m;\n\n                    r0 += w;\n                }\n\n                float* r0_tm_0 = (float*)img0_tm + (i * w_tiles + j);\n                float* r0_tm_1 = r0_tm_0 + tiles;\n                float* r0_tm_2 = r0_tm_0 + tiles * 2;\n                float* r0_tm_3 = r0_tm_0 + tiles * 3;\n\n                for (int m = 0; m < 4; m++)\n                {\n                    float tmp00 = tmp[m][0];\n                    float tmp01 = tmp[m][1];\n                    float tmp02 = tmp[m][2];\n                    float tmp03 = tmp[m][3];\n\n                    float r0tm0 = tmp00 - tmp02;\n                    float r0tm1 = tmp01 + tmp02;\n                    float r0tm2 = tmp02 - tmp01;\n                    float r0tm3 = tmp03 - tmp01;\n\n                    r0_tm_0[0] = r0tm0;\n                    r0_tm_1[0] = r0tm1;\n                    r0_tm_2[0] = r0tm2;\n                    r0_tm_3[0] = r0tm3;\n\n                    r0_tm_0 += tiles * 4;\n                    r0_tm_1 += tiles * 4;\n                    r0_tm_2 += tiles * 4;\n                    r0_tm_3 += tiles * 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_output_lsx(const Mat& top_blob_tm, Mat& top_blob, const Mat& bias, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 2;\n    const int h_tiles = outh / 2;\n    const int tiles = w_tiles * h_tiles;\n\n    const float* biasptr = bias;\n\n    // const float otm[2][4] = {\n    //     {1.0f,  1.0f,  1.0f,  0.0f},\n    //     {0.0f,  1.0f, -1.0f,  1.0f}\n    // };\n\n    // 0 = r00 + r01 + r02\n    // 1 = r01 - r02 + r03\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        float bias0 = biasptr ? biasptr[p] : 0.f;\n\n        float tmp[2][4];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* output0_tm_0 = (const float*)out0_tm + (i * w_tiles + j);\n                const float* output0_tm_1 = output0_tm_0 + tiles;\n                const float* output0_tm_2 = output0_tm_0 + tiles * 2;\n                const float* output0_tm_3 = output0_tm_0 + tiles * 3;\n\n                float* output0 = out0.row(i * 2) + (j * 2);\n\n                for (int m = 0; m < 4; m++)\n                {\n                    float out0tm0 = output0_tm_0[0];\n                    float out0tm1 = output0_tm_1[0];\n                    float out0tm2 = output0_tm_2[0];\n                    float out0tm3 = output0_tm_3[0];\n\n                    float tmp0m = out0tm0 + out0tm1 + out0tm2;\n                    float tmp1m = out0tm1 - out0tm2 + out0tm3;\n\n                    tmp[0][m] = tmp0m;\n                    tmp[1][m] = tmp1m;\n\n                    output0_tm_0 += tiles * 4;\n                    output0_tm_1 += tiles * 4;\n                    output0_tm_2 += tiles * 4;\n                    output0_tm_3 += tiles * 4;\n                }\n\n                for (int m = 0; m < 2; m++)\n                {\n                    float tmp00 = tmp[m][0];\n                    float tmp01 = tmp[m][1];\n                    float tmp02 = tmp[m][2];\n                    float tmp03 = tmp[m][3];\n\n                    float out00 = bias0 + tmp00 + tmp01 + tmp02;\n                    float out01 = bias0 + tmp01 - tmp02 + tmp03;\n\n                    output0[0] = out00;\n                    output0[1] = out01;\n\n                    output0 += outw;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_transform_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_input_int8_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 4;\n    const int h_tiles = (h - 2) / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[6][6] = {\n    //     {4.0f, 0.0f, -5.0f, 0.0f, 1.0f, 0.0f},\n    //     {0.0f,-4.0f, -4.0f, 1.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f, -4.0f,-1.0f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, -1.0f, 2.0f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, -1.0f,-2.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f,  0.0f,-5.0f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  4 * r00 - 5 * r02 + r04\n    // 1 = -4 * (r01 + r02) + r04 + r03\n    // 2 =  4 * (r01 - r02) + r04 - r03\n    // 3 = -2 * (r01 - r03) + r04 - r02\n    // 4 =  2 * (r01 - r03) + r04 - r02\n    // 5 =  4 * r01 - 5 * r03 + r05\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        short tmp[6][6];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const signed char* r0 = img0.row<const signed char>(i * 4) + (j * 4);\n\n                for (int m = 0; m < 6; m++)\n                {\n                    signed char r00 = r0[0];\n                    signed char r01 = r0[1];\n                    signed char r02 = r0[2];\n                    signed char r03 = r0[3];\n                    signed char r04 = r0[4];\n                    signed char r05 = r0[5];\n\n                    short tmp0m = 4 * r00 - 5 * r02 + r04;\n                    short tmp1m = -4 * (r01 + r02) + r04 + r03;\n                    short tmp2m = 4 * (r01 - r02) + r04 - r03;\n                    short tmp3m = -2 * (r01 - r03) + r04 - r02;\n                    short tmp4m = 2 * (r01 - r03) + r04 - r02;\n                    short tmp5m = 4 * r01 - 5 * r03 + r05;\n\n                    tmp[0][m] = tmp0m;\n                    tmp[1][m] = tmp1m;\n                    tmp[2][m] = tmp2m;\n                    tmp[3][m] = tmp3m;\n                    tmp[4][m] = tmp4m;\n                    tmp[5][m] = tmp5m;\n\n                    r0 += w;\n                }\n\n                short* r0_tm_0 = (short*)img0_tm + (i * w_tiles + j);\n                short* r0_tm_1 = r0_tm_0 + tiles;\n                short* r0_tm_2 = r0_tm_0 + tiles * 2;\n                short* r0_tm_3 = r0_tm_0 + tiles * 3;\n                short* r0_tm_4 = r0_tm_0 + tiles * 4;\n                short* r0_tm_5 = r0_tm_0 + tiles * 5;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    short tmp00 = tmp[m][0];\n                    short tmp01 = tmp[m][1];\n                    short tmp02 = tmp[m][2];\n                    short tmp03 = tmp[m][3];\n                    short tmp04 = tmp[m][4];\n                    short tmp05 = tmp[m][5];\n\n                    short r0tm0 = 4 * tmp00 - 5 * tmp02 + tmp04;\n                    short r0tm1 = -4 * (tmp01 + tmp02) + tmp04 + tmp03;\n                    short r0tm2 = 4 * (tmp01 - tmp02) + tmp04 - tmp03;\n                    short r0tm3 = -2 * (tmp01 - tmp03) + tmp04 - tmp02;\n                    short r0tm4 = 2 * (tmp01 - tmp03) + tmp04 - tmp02;\n                    short r0tm5 = 4 * tmp01 - 5 * tmp03 + tmp05;\n\n                    r0_tm_0[0] = r0tm0;\n                    r0_tm_1[0] = r0tm1;\n                    r0_tm_2[0] = r0tm2;\n                    r0_tm_3[0] = r0tm3;\n                    r0_tm_4[0] = r0tm4;\n                    r0_tm_5[0] = r0tm5;\n\n                    r0_tm_0 += tiles * 6;\n                    r0_tm_1 += tiles * 6;\n                    r0_tm_2 += tiles * 6;\n                    r0_tm_3 += tiles * 6;\n                    r0_tm_4 += tiles * 6;\n                    r0_tm_5 += tiles * 6;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_output_int8_lsx(const Mat& top_blob_tm, Mat& top_blob, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 4;\n    const int h_tiles = outh / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,  1.0f, 1.0f,  1.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 2.0f, -2.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 4.0f,  4.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 8.0f, -8.0f, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) + (r03 - r04) * 2\n    // 2 =       (r01 + r02) + (r03 + r04) * 4\n    // 3 = r05 + (r01 - r02) + (r03 - r04) * 8\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        int tmp[4][6];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const int* output0_tm_0 = (const int*)out0_tm + (i * w_tiles + j) * 1;\n                const int* output0_tm_1 = output0_tm_0 + tiles * 1;\n                const int* output0_tm_2 = output0_tm_0 + tiles * 2;\n                const int* output0_tm_3 = output0_tm_0 + tiles * 3;\n                const int* output0_tm_4 = output0_tm_0 + tiles * 4;\n                const int* output0_tm_5 = output0_tm_0 + tiles * 5;\n\n                int* output0 = out0.row<int>(i * 4) + j * 4;\n\n                for (int m = 0; m < 5; m++)\n                {\n                    int tmp02a = output0_tm_1[0] + output0_tm_2[0];\n                    int tmp13a = output0_tm_1[0] - output0_tm_2[0];\n\n                    int tmp02b = output0_tm_3[0] + output0_tm_4[0];\n                    int tmp13b = output0_tm_3[0] - output0_tm_4[0];\n\n                    tmp[0][m] = output0_tm_0[0] + tmp02a + tmp02b;\n                    tmp[1][m] = tmp13a + tmp13b * 2;\n                    tmp[2][m] = tmp02a + tmp02b * 4;\n                    tmp[3][m] = output0_tm_5[0] * 4 + tmp13a + tmp13b * 8;\n\n                    output0_tm_0 += tiles * 6;\n                    output0_tm_1 += tiles * 6;\n                    output0_tm_2 += tiles * 6;\n                    output0_tm_3 += tiles * 6;\n                    output0_tm_4 += tiles * 6;\n                    output0_tm_5 += tiles * 6;\n                }\n                for (int m = 5; m < 6; m++)\n                {\n                    int tmp02a = output0_tm_1[0] + output0_tm_2[0];\n                    int tmp13a = output0_tm_1[0] - output0_tm_2[0];\n\n                    int tmp02b = output0_tm_3[0] + output0_tm_4[0];\n                    int tmp13b = output0_tm_3[0] - output0_tm_4[0];\n\n                    tmp[0][m] = (output0_tm_0[0] + tmp02a + tmp02b) * 4;\n                    tmp[1][m] = (tmp13a + tmp13b * 2) * 4;\n                    tmp[2][m] = (tmp02a + tmp02b * 4) * 4;\n                    tmp[3][m] = (output0_tm_5[0] * 4 + tmp13a + tmp13b * 8) * 4;\n\n                    output0_tm_0 += tiles * 6;\n                    output0_tm_1 += tiles * 6;\n                    output0_tm_2 += tiles * 6;\n                    output0_tm_3 += tiles * 6;\n                    output0_tm_4 += tiles * 6;\n                    output0_tm_5 += tiles * 6;\n                }\n\n                for (int m = 0; m < 4; m++)\n                {\n                    const int* tmp0 = tmp[m];\n\n                    int tmp02a = tmp0[1] + tmp0[2];\n                    int tmp13a = tmp0[1] - tmp0[2];\n\n                    int tmp02b = tmp0[3] + tmp0[4];\n                    int tmp13b = tmp0[3] - tmp0[4];\n\n                    output0[0] = (tmp0[0] + tmp02a + tmp02b) / 576;\n                    output0[1] = (tmp13a + tmp13b * 2) / 576;\n                    output0[2] = (tmp02a + tmp02b * 4) / 576;\n                    output0[3] = (tmp0[5] + tmp13a + tmp13b * 8) / 576;\n\n                    output0 += outw;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_transform_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd63_transform_input_pack4_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 6;\n    const int h_tiles = (h - 2) / 6;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[8][8] = {\n    //     {1.0f,  0.0f, -5.25f,  0.00f,  5.25f,  0.00f, -1.0f, 0.0f},\n    //\n    //     {0.0f,  1.0f,  1.00f, -4.25f, -4.25f,  1.00f,  1.0f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f,  4.25f, -4.25f, -1.00f,  1.0f, 0.0f},\n    //\n    //     {0.0f,  0.5f,  0.25f, -2.50f, -1.25f,  2.00f,  1.0f, 0.0f},\n    //     {0.0f, -0.5f,  0.25f,  2.50f, -1.25f, -2.00f,  1.0f, 0.0f},\n    //\n    //     {0.0f,  2.0f,  4.00f, -2.50f, -5.00f,  0.50f,  1.0f, 0.0f},\n    //     {0.0f, -2.0f,  4.00f,  2.50f, -5.00f, -0.50f,  1.0f, 0.0f},\n    //\n    //     {0.0f, -1.0f,  0.00f,  5.25f,  0.00f, -5.25f,  0.0f, 1.0f}\n    // };\n\n    // 0 = r00 - r06 + (r04 - r02) * 5.25\n    // 7 = r07 - r01 + (r03 - r05) * 5.25\n\n    // 1 = (r02 + r06 - r04 * 4.25) + (r01 - r03 * 4.25 + r05)\n    // 2 = (r02 + r06 - r04 * 4.25) - (r01 - r03 * 4.25 + r05)\n\n    // 3 = (r06 + r02 * 0.25 - r04 * 1.25) + (r01 * 0.5 - r03 * 2.5 + r05 * 2)\n    // 4 = (r06 + r02 * 0.25 - r04 * 1.25) - (r01 * 0.5 - r03 * 2.5 + r05 * 2)\n\n    // reuse r04 * 1.25\n    // reuse r03 * 2.5\n    // 5 = (r06 + (r02 - r04 * 1.25) * 4) + (r01 * 2 - r03 * 2.5 + r05 * 0.5)\n    // 6 = (r06 + (r02 - r04 * 1.25) * 4) - (r01 * 2 - r03 * 2.5 + r05 * 0.5)\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        float tmp[8][8][4];\n\n        __m128 _v5_25 = __lsx_vreplfr2vr_s(5.25f);\n        __m128 _vm4_25 = __lsx_vreplfr2vr_s(-4.25f);\n        __m128 _vm1_25 = __lsx_vreplfr2vr_s(-1.25f);\n        __m128 _v0_25 = __lsx_vreplfr2vr_s(0.25f);\n        __m128 _vm2_5 = __lsx_vreplfr2vr_s(-2.5f);\n        __m128 _v0_5 = __lsx_vreplfr2vr_s(0.5f);\n        __m128 _v2 = __lsx_vreplfr2vr_s(2.f);\n        __m128 _v4 = __lsx_vreplfr2vr_s(4.f);\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* r0 = img0.row(i * 6) + (j * 6) * 4;\n\n                for (int m = 0; m < 8; m++)\n                {\n                    __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                    __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                    __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                    __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n                    __m128 _r05 = (__m128)__lsx_vld(r0 + 4 * 5, 0);\n                    __m128 _r06 = (__m128)__lsx_vld(r0 + 4 * 6, 0);\n                    __m128 _r07 = (__m128)__lsx_vld(r0 + 4 * 7, 0);\n\n                    __m128 _tmp0m = __lsx_vfmadd_s(__lsx_vfsub_s(_r04, _r02), _v5_25, __lsx_vfsub_s(_r00, _r06));\n                    __m128 _tmp7m = __lsx_vfmadd_s(__lsx_vfsub_s(_r03, _r05), _v5_25, __lsx_vfsub_s(_r07, _r01));\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp7m, tmp[7][m], 0);\n\n                    __m128 _tmp12a = __lsx_vfmadd_s(_r04, _vm4_25, __lsx_vfadd_s(_r02, _r06));\n                    __m128 _tmp12b = __lsx_vfmadd_s(_r03, _vm4_25, __lsx_vfadd_s(_r01, _r05));\n\n                    __m128 _tmp1m = __lsx_vfadd_s(_tmp12a, _tmp12b);\n                    __m128 _tmp2m = __lsx_vfsub_s(_tmp12a, _tmp12b);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n\n                    __m128 _tmp34a = __lsx_vfmadd_s(_r04, _vm1_25, __lsx_vfmadd_s(_r02, _v0_25, _r06));\n                    __m128 _tmp34b = __lsx_vfmadd_s(_r05, _v2, __lsx_vfmadd_s(_r03, _vm2_5, __lsx_vfmul_s(_r01, _v0_5)));\n\n                    __m128 _tmp3m = __lsx_vfadd_s(_tmp34a, _tmp34b);\n                    __m128 _tmp4m = __lsx_vfsub_s(_tmp34a, _tmp34b);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n                    __lsx_vst(_tmp4m, tmp[4][m], 0);\n\n                    __m128 _tmp56a = __lsx_vfmadd_s(__lsx_vfmadd_s(_r04, _vm1_25, _r02), _v4, _r06);\n                    __m128 _tmp56b = __lsx_vfmadd_s(_r05, _v0_5, __lsx_vfmadd_s(_r03, _vm2_5, __lsx_vfmul_s(_r01, _v2)));\n\n                    __m128 _tmp5m = __lsx_vfadd_s(_tmp56a, _tmp56b);\n                    __m128 _tmp6m = __lsx_vfsub_s(_tmp56a, _tmp56b);\n                    __lsx_vst(_tmp5m, tmp[5][m], 0);\n                    __lsx_vst(_tmp6m, tmp[6][m], 0);\n\n                    r0 += w * 4;\n                }\n\n                float* r0_tm_0 = (float*)img0_tm + (i * w_tiles + j) * 4;\n                float* r0_tm_1 = r0_tm_0 + tiles * 4;\n                float* r0_tm_2 = r0_tm_0 + tiles * 4 * 2;\n                float* r0_tm_3 = r0_tm_0 + tiles * 4 * 3;\n                float* r0_tm_4 = r0_tm_0 + tiles * 4 * 4;\n                float* r0_tm_5 = r0_tm_0 + tiles * 4 * 5;\n                float* r0_tm_6 = r0_tm_0 + tiles * 4 * 6;\n                float* r0_tm_7 = r0_tm_0 + tiles * 4 * 7;\n\n                for (int m = 0; m < 8; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n                    __m128 _tmp04 = (__m128)__lsx_vld(tmp[m][4], 0);\n                    __m128 _tmp05 = (__m128)__lsx_vld(tmp[m][5], 0);\n                    __m128 _tmp06 = (__m128)__lsx_vld(tmp[m][6], 0);\n                    __m128 _tmp07 = (__m128)__lsx_vld(tmp[m][7], 0);\n\n                    __m128 _r0tm0 = __lsx_vfmadd_s(__lsx_vfsub_s(_tmp04, _tmp02), _v5_25, __lsx_vfsub_s(_tmp00, _tmp06));\n                    __m128 _r0tm7 = __lsx_vfmadd_s(__lsx_vfsub_s(_tmp03, _tmp05), _v5_25, __lsx_vfsub_s(_tmp07, _tmp01));\n\n                    __m128 _tmp12a = __lsx_vfmadd_s(_tmp04, _vm4_25, __lsx_vfadd_s(_tmp02, _tmp06));\n                    __m128 _tmp12b = __lsx_vfmadd_s(_tmp03, _vm4_25, __lsx_vfadd_s(_tmp01, _tmp05));\n\n                    __m128 _r0tm1 = __lsx_vfadd_s(_tmp12a, _tmp12b);\n                    __m128 _r0tm2 = __lsx_vfsub_s(_tmp12a, _tmp12b);\n\n                    __m128 _tmp34a = __lsx_vfmadd_s(_tmp04, _vm1_25, __lsx_vfmadd_s(_tmp02, _v0_25, _tmp06));\n                    __m128 _tmp34b = __lsx_vfmadd_s(_tmp05, _v2, __lsx_vfmadd_s(_tmp03, _vm2_5, __lsx_vfmul_s(_tmp01, _v0_5)));\n\n                    __m128 _r0tm3 = __lsx_vfadd_s(_tmp34a, _tmp34b);\n                    __m128 _r0tm4 = __lsx_vfsub_s(_tmp34a, _tmp34b);\n\n                    __m128 _tmp56a = __lsx_vfmadd_s(__lsx_vfmadd_s(_tmp04, _vm1_25, _tmp02), _v4, _tmp06);\n                    __m128 _tmp56b = __lsx_vfmadd_s(_tmp05, _v0_5, __lsx_vfmadd_s(_tmp03, _vm2_5, __lsx_vfmul_s(_tmp01, _v2)));\n\n                    __m128 _r0tm5 = __lsx_vfadd_s(_tmp56a, _tmp56b);\n                    __m128 _r0tm6 = __lsx_vfsub_s(_tmp56a, _tmp56b);\n\n                    __lsx_vst(_r0tm0, r0_tm_0, 0);\n                    __lsx_vst(_r0tm1, r0_tm_1, 0);\n                    __lsx_vst(_r0tm2, r0_tm_2, 0);\n                    __lsx_vst(_r0tm3, r0_tm_3, 0);\n                    __lsx_vst(_r0tm4, r0_tm_4, 0);\n                    __lsx_vst(_r0tm5, r0_tm_5, 0);\n                    __lsx_vst(_r0tm6, r0_tm_6, 0);\n                    __lsx_vst(_r0tm7, r0_tm_7, 0);\n\n                    r0_tm_0 += tiles * 4 * 8;\n                    r0_tm_1 += tiles * 4 * 8;\n                    r0_tm_2 += tiles * 4 * 8;\n                    r0_tm_3 += tiles * 4 * 8;\n                    r0_tm_4 += tiles * 4 * 8;\n                    r0_tm_5 += tiles * 4 * 8;\n                    r0_tm_6 += tiles * 4 * 8;\n                    r0_tm_7 += tiles * 4 * 8;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd63_transform_output_pack4_lsx(const Mat& top_blob_tm, Mat& top_blob, const Mat& bias, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 6;\n    const int h_tiles = outh / 6;\n    const int tiles = w_tiles * h_tiles;\n\n    const float* biasptr = bias;\n\n    // const float otm[6][8] = {\n    //     {1.0f,  1.0f,   1.0f,   1.0f,   1.0f,  32.0f, 32.0f, 0.0f},\n    //     {0.0f,  1.0f,  -1.0f,   2.0f,  -2.0f,  16.0f,-16.0f, 0.0f},\n    //     {0.0f,  1.0f,   1.0f,   4.0f,   4.0f,   8.0f,  8.0f, 0.0f},\n    //     {0.0f,  1.0f,  -1.0f,   8.0f,  -8.0f,   4.0f, -4.0f, 0.0f},\n    //     {0.0f,  1.0f,   1.0f,  16.0f,  16.0f,   2.0f,  2.0f, 0.0f},\n    //     {0.0f,  1.0f,  -1.0f,  32.0f, -32.0f,   1.0f, -1.0f, 1.0f}\n    // };\n\n    // 0 = r0 + (r1 + r2) + (r3 + r4)     + (r5 + r6) * 32\n    // 1 =      (r1 - r2) + (r3 - r4) * 2 + (r5 - r6) * 16\n    // 2 =      (r1 + r2) + (r3 + r4) * 4 + (r5 + r6) * 8\n    // 3 =      (r1 - r2) + (r3 - r4) * 8 + (r5 - r6) * 4\n    // 4 =      (r1 + r2) + (r3 + r4) * 16+ (r5 + r6) * 2\n    // 5 = r7 + (r1 - r2) + (r3 - r4) * 32+ (r5 - r6)\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = biasptr ? (__m128)__lsx_vld(biasptr + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        float tmp[6][8][4];\n\n        __m128 _v32 = __lsx_vreplfr2vr_s(32.f);\n        __m128 _v16 = __lsx_vreplfr2vr_s(16.f);\n        __m128 _v8 = __lsx_vreplfr2vr_s(8.f);\n        __m128 _v4 = __lsx_vreplfr2vr_s(4.f);\n        __m128 _v2 = __lsx_vreplfr2vr_s(2.f);\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* output0_tm_0 = (const float*)out0_tm + (i * w_tiles + j) * 4;\n                const float* output0_tm_1 = output0_tm_0 + tiles * 4;\n                const float* output0_tm_2 = output0_tm_0 + tiles * 4 * 2;\n                const float* output0_tm_3 = output0_tm_0 + tiles * 4 * 3;\n                const float* output0_tm_4 = output0_tm_0 + tiles * 4 * 4;\n                const float* output0_tm_5 = output0_tm_0 + tiles * 4 * 5;\n                const float* output0_tm_6 = output0_tm_0 + tiles * 4 * 6;\n                const float* output0_tm_7 = output0_tm_0 + tiles * 4 * 7;\n\n                float* output0 = out0.row<float>(i * 6) + (j * 6) * 4;\n\n                for (int m = 0; m < 8; m++)\n                {\n                    __m128 _out0tm0 = (__m128)__lsx_vld(output0_tm_0, 0);\n                    __m128 _out0tm1 = (__m128)__lsx_vld(output0_tm_1, 0);\n                    __m128 _out0tm2 = (__m128)__lsx_vld(output0_tm_2, 0);\n                    __m128 _out0tm3 = (__m128)__lsx_vld(output0_tm_3, 0);\n                    __m128 _out0tm4 = (__m128)__lsx_vld(output0_tm_4, 0);\n                    __m128 _out0tm5 = (__m128)__lsx_vld(output0_tm_5, 0);\n                    __m128 _out0tm6 = (__m128)__lsx_vld(output0_tm_6, 0);\n                    __m128 _out0tm7 = (__m128)__lsx_vld(output0_tm_7, 0);\n\n                    __m128 _tmp024a = __lsx_vfadd_s(_out0tm1, _out0tm2);\n                    __m128 _tmp135a = __lsx_vfsub_s(_out0tm1, _out0tm2);\n\n                    __m128 _tmp024b = __lsx_vfadd_s(_out0tm3, _out0tm4);\n                    __m128 _tmp135b = __lsx_vfsub_s(_out0tm3, _out0tm4);\n\n                    __m128 _tmp024c = __lsx_vfadd_s(_out0tm5, _out0tm6);\n                    __m128 _tmp135c = __lsx_vfsub_s(_out0tm5, _out0tm6);\n\n                    __m128 _tmp0m = __lsx_vfadd_s(__lsx_vfadd_s(_out0tm0, _tmp024a), __lsx_vfmadd_s(_tmp024c, _v32, _tmp024b));\n                    __m128 _tmp2m = __lsx_vfmadd_s(_tmp024c, _v8, __lsx_vfmadd_s(_tmp024b, _v4, _tmp024a));\n                    __m128 _tmp4m = __lsx_vfmadd_s(_tmp024c, _v2, __lsx_vfmadd_s(_tmp024b, _v16, _tmp024a));\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp4m, tmp[4][m], 0);\n\n                    __m128 _tmp1m = __lsx_vfmadd_s(_tmp135c, _v16, __lsx_vfmadd_s(_tmp135b, _v2, _tmp135a));\n                    __m128 _tmp3m = __lsx_vfmadd_s(_tmp135c, _v4, __lsx_vfmadd_s(_tmp135b, _v8, _tmp135a));\n                    __m128 _tmp5m = __lsx_vfadd_s(__lsx_vfadd_s(_out0tm7, _tmp135a), __lsx_vfmadd_s(_tmp135b, _v32, _tmp135c));\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n                    __lsx_vst(_tmp5m, tmp[5][m], 0);\n\n                    output0_tm_0 += tiles * 4 * 8;\n                    output0_tm_1 += tiles * 4 * 8;\n                    output0_tm_2 += tiles * 4 * 8;\n                    output0_tm_3 += tiles * 4 * 8;\n                    output0_tm_4 += tiles * 4 * 8;\n                    output0_tm_5 += tiles * 4 * 8;\n                    output0_tm_6 += tiles * 4 * 8;\n                    output0_tm_7 += tiles * 4 * 8;\n                }\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n                    __m128 _tmp04 = (__m128)__lsx_vld(tmp[m][4], 0);\n                    __m128 _tmp05 = (__m128)__lsx_vld(tmp[m][5], 0);\n                    __m128 _tmp06 = (__m128)__lsx_vld(tmp[m][6], 0);\n                    __m128 _tmp07 = (__m128)__lsx_vld(tmp[m][7], 0);\n\n                    __m128 _tmp024a = __lsx_vfadd_s(_tmp01, _tmp02);\n                    __m128 _tmp135a = __lsx_vfsub_s(_tmp01, _tmp02);\n\n                    __m128 _tmp024b = __lsx_vfadd_s(_tmp03, _tmp04);\n                    __m128 _tmp135b = __lsx_vfsub_s(_tmp03, _tmp04);\n\n                    __m128 _tmp024c = __lsx_vfadd_s(_tmp05, _tmp06);\n                    __m128 _tmp135c = __lsx_vfsub_s(_tmp05, _tmp06);\n\n                    __m128 _out00 = __lsx_vfadd_s(_bias0, __lsx_vfadd_s(__lsx_vfadd_s(_tmp00, _tmp024a), __lsx_vfmadd_s(_tmp024c, _v32, _tmp024b)));\n                    __m128 _out02 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp024c, _v8, __lsx_vfmadd_s(_tmp024b, _v4, _tmp024a)));\n                    __m128 _out04 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp024c, _v2, __lsx_vfmadd_s(_tmp024b, _v16, _tmp024a)));\n                    __lsx_vst(_out00, output0, 0);\n                    __lsx_vst(_out02, output0 + 4 * 2, 0);\n                    __lsx_vst(_out04, output0 + 4 * 4, 0);\n\n                    __m128 _out01 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp135c, _v16, __lsx_vfmadd_s(_tmp135b, _v2, _tmp135a)));\n                    __m128 _out03 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp135c, _v4, __lsx_vfmadd_s(_tmp135b, _v8, _tmp135a)));\n                    __m128 _out05 = __lsx_vfadd_s(_bias0, __lsx_vfadd_s(__lsx_vfadd_s(_tmp07, _tmp135a), __lsx_vfmadd_s(_tmp135b, _v32, _tmp135c)));\n                    __lsx_vst(_out01, output0 + 4, 0);\n                    __lsx_vst(_out03, output0 + 4 * 3, 0);\n                    __lsx_vst(_out05, output0 + 4 * 5, 0);\n\n                    output0 += outw * 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_input_pack4_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 4;\n    const int h_tiles = (h - 2) / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[6][6] = {\n    //     {4.0f, 0.0f, -5.0f, 0.0f, 1.0f, 0.0f},\n    //     {0.0f,-4.0f, -4.0f, 1.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f, -4.0f,-1.0f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, -1.0f, 2.0f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, -1.0f,-2.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f,  0.0f,-5.0f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  4 * r00 - 5 * r02 + r04\n    // 1 = -4 * (r01 + r02) + r04 + r03\n    // 2 =  4 * (r01 - r02) + r04 - r03\n    // 3 = -2 * (r01 - r03) + r04 - r02\n    // 4 =  2 * (r01 - r03) + r04 - r02\n    // 5 =  4 * r01 - 5 * r03 + r05\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        float tmp[6][6][4];\n\n        __m128 _vm5 = __lsx_vreplfr2vr_s(-5.f);\n        __m128 _vm4 = __lsx_vreplfr2vr_s(-4.f);\n        __m128 _v4 = __lsx_vreplfr2vr_s(4.f);\n        __m128 _vm2 = __lsx_vreplfr2vr_s(-2.f);\n        __m128 _v2 = __lsx_vreplfr2vr_s(2.f);\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* r0 = img0.row(i * 4) + (j * 4) * 4;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                    __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                    __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                    __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n                    __m128 _r05 = (__m128)__lsx_vld(r0 + 4 * 5, 0);\n\n                    __m128 _tmp0m = __lsx_vfmadd_s(_r02, _vm5, __lsx_vfmadd_s(_r00, _v4, _r04));\n                    __m128 _tmp1m = __lsx_vfmadd_s(__lsx_vfadd_s(_r01, _r02), _vm4, __lsx_vfadd_s(_r04, _r03));\n                    __m128 _tmp2m = __lsx_vfmadd_s(__lsx_vfsub_s(_r01, _r02), _v4, __lsx_vfsub_s(_r04, _r03));\n                    __m128 _tmp3m = __lsx_vfmadd_s(__lsx_vfsub_s(_r01, _r03), _vm2, __lsx_vfsub_s(_r04, _r02));\n                    __m128 _tmp4m = __lsx_vfmadd_s(__lsx_vfsub_s(_r01, _r03), _v2, __lsx_vfsub_s(_r04, _r02));\n                    __m128 _tmp5m = __lsx_vfmadd_s(_r03, _vm5, __lsx_vfmadd_s(_r01, _v4, _r05));\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n                    __lsx_vst(_tmp4m, tmp[4][m], 0);\n                    __lsx_vst(_tmp5m, tmp[5][m], 0);\n\n                    r0 += w * 4;\n                }\n\n                float* r0_tm_0 = (float*)img0_tm + (i * w_tiles + j) * 4;\n                float* r0_tm_1 = r0_tm_0 + tiles * 4;\n                float* r0_tm_2 = r0_tm_0 + tiles * 4 * 2;\n                float* r0_tm_3 = r0_tm_0 + tiles * 4 * 3;\n                float* r0_tm_4 = r0_tm_0 + tiles * 4 * 4;\n                float* r0_tm_5 = r0_tm_0 + tiles * 4 * 5;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n                    __m128 _tmp04 = (__m128)__lsx_vld(tmp[m][4], 0);\n                    __m128 _tmp05 = (__m128)__lsx_vld(tmp[m][5], 0);\n\n                    __m128 _r0tm0 = __lsx_vfmadd_s(_tmp02, _vm5, __lsx_vfmadd_s(_tmp00, _v4, _tmp04));\n                    __m128 _r0tm1 = __lsx_vfmadd_s(__lsx_vfadd_s(_tmp01, _tmp02), _vm4, __lsx_vfadd_s(_tmp04, _tmp03));\n                    __m128 _r0tm2 = __lsx_vfmadd_s(__lsx_vfsub_s(_tmp01, _tmp02), _v4, __lsx_vfsub_s(_tmp04, _tmp03));\n                    __m128 _r0tm3 = __lsx_vfmadd_s(__lsx_vfsub_s(_tmp01, _tmp03), _vm2, __lsx_vfsub_s(_tmp04, _tmp02));\n                    __m128 _r0tm4 = __lsx_vfmadd_s(__lsx_vfsub_s(_tmp01, _tmp03), _v2, __lsx_vfsub_s(_tmp04, _tmp02));\n                    __m128 _r0tm5 = __lsx_vfmadd_s(_tmp03, _vm5, __lsx_vfmadd_s(_tmp01, _v4, _tmp05));\n\n                    __lsx_vst(_r0tm0, r0_tm_0, 0);\n                    __lsx_vst(_r0tm1, r0_tm_1, 0);\n                    __lsx_vst(_r0tm2, r0_tm_2, 0);\n                    __lsx_vst(_r0tm3, r0_tm_3, 0);\n                    __lsx_vst(_r0tm4, r0_tm_4, 0);\n                    __lsx_vst(_r0tm5, r0_tm_5, 0);\n\n                    r0_tm_0 += tiles * 4 * 6;\n                    r0_tm_1 += tiles * 4 * 6;\n                    r0_tm_2 += tiles * 4 * 6;\n                    r0_tm_3 += tiles * 4 * 6;\n                    r0_tm_4 += tiles * 4 * 6;\n                    r0_tm_5 += tiles * 4 * 6;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_transform_output_pack4_lsx(const Mat& top_blob_tm, Mat& top_blob, const Mat& bias, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 4;\n    const int h_tiles = outh / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    const float* biasptr = bias;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,  1.0f, 1.0f,  1.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 2.0f, -2.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 4.0f,  4.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 8.0f, -8.0f, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) + (r03 - r04) * 2\n    // 2 =       (r01 + r02) + (r03 + r04) * 4\n    // 3 = r05 + (r01 - r02) + (r03 - r04) * 8\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = biasptr ? (__m128)__lsx_vld(biasptr + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        float tmp[4][6][4];\n\n        __m128 _v2 = __lsx_vreplfr2vr_s(2.f);\n        __m128 _v4 = __lsx_vreplfr2vr_s(4.f);\n        __m128 _v8 = __lsx_vreplfr2vr_s(8.f);\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* output0_tm_0 = (const float*)out0_tm + (i * w_tiles + j) * 4;\n                const float* output0_tm_1 = output0_tm_0 + tiles * 4;\n                const float* output0_tm_2 = output0_tm_0 + tiles * 4 * 2;\n                const float* output0_tm_3 = output0_tm_0 + tiles * 4 * 3;\n                const float* output0_tm_4 = output0_tm_0 + tiles * 4 * 4;\n                const float* output0_tm_5 = output0_tm_0 + tiles * 4 * 5;\n\n                float* output0 = out0.row<float>(i * 4) + (j * 4) * 4;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128 _out0tm0 = (__m128)__lsx_vld(output0_tm_0, 0);\n                    __m128 _out0tm1 = (__m128)__lsx_vld(output0_tm_1, 0);\n                    __m128 _out0tm2 = (__m128)__lsx_vld(output0_tm_2, 0);\n                    __m128 _out0tm3 = (__m128)__lsx_vld(output0_tm_3, 0);\n                    __m128 _out0tm4 = (__m128)__lsx_vld(output0_tm_4, 0);\n                    __m128 _out0tm5 = (__m128)__lsx_vld(output0_tm_5, 0);\n\n                    __m128 _tmp02a = __lsx_vfadd_s(_out0tm1, _out0tm2);\n                    __m128 _tmp13a = __lsx_vfsub_s(_out0tm1, _out0tm2);\n\n                    __m128 _tmp02b = __lsx_vfadd_s(_out0tm3, _out0tm4);\n                    __m128 _tmp13b = __lsx_vfsub_s(_out0tm3, _out0tm4);\n\n                    __m128 _tmp0m = __lsx_vfadd_s(__lsx_vfadd_s(_out0tm0, _tmp02a), _tmp02b);\n                    __m128 _tmp1m = __lsx_vfmadd_s(_tmp13b, _v2, _tmp13a);\n                    __m128 _tmp2m = __lsx_vfmadd_s(_tmp02b, _v4, _tmp02a);\n                    __m128 _tmp3m = __lsx_vfmadd_s(_tmp13b, _v8, __lsx_vfadd_s(_out0tm5, _tmp13a));\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n\n                    output0_tm_0 += tiles * 4 * 6;\n                    output0_tm_1 += tiles * 4 * 6;\n                    output0_tm_2 += tiles * 4 * 6;\n                    output0_tm_3 += tiles * 4 * 6;\n                    output0_tm_4 += tiles * 4 * 6;\n                    output0_tm_5 += tiles * 4 * 6;\n                }\n\n                for (int m = 0; m < 4; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n                    __m128 _tmp04 = (__m128)__lsx_vld(tmp[m][4], 0);\n                    __m128 _tmp05 = (__m128)__lsx_vld(tmp[m][5], 0);\n\n                    __m128 _tmp02a = __lsx_vfadd_s(_tmp01, _tmp02);\n                    __m128 _tmp13a = __lsx_vfsub_s(_tmp01, _tmp02);\n\n                    __m128 _tmp02b = __lsx_vfadd_s(_tmp03, _tmp04);\n                    __m128 _tmp13b = __lsx_vfsub_s(_tmp03, _tmp04);\n\n                    __m128 _out00 = __lsx_vfadd_s(_bias0, __lsx_vfadd_s(__lsx_vfadd_s(_tmp00, _tmp02a), _tmp02b));\n                    __m128 _out01 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp13b, _v2, _tmp13a));\n                    __m128 _out02 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp02b, _v4, _tmp02a));\n                    __m128 _out03 = __lsx_vfadd_s(_bias0, __lsx_vfmadd_s(_tmp13b, _v8, __lsx_vfadd_s(_tmp05, _tmp13a)));\n\n                    __lsx_vst(_out00, output0, 0);\n                    __lsx_vst(_out01, output0 + 4, 0);\n                    __lsx_vst(_out02, output0 + 4 * 2, 0);\n                    __lsx_vst(_out03, output0 + 4 * 3, 0);\n\n                    output0 += outw * 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_input_pack4_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 2;\n    const int h_tiles = (h - 2) / 2;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[4][4] = {\n    //     {1.0f,  0.0f, -1.0f,  0.0f},\n    //     {0.0f,  1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  1.00f, 0.0f},\n    //     {0.0f, -1.0f,  0.00f, 1.0f}\n    // };\n\n    // 0 = r00 - r02\n    // 1 = r01 + r02\n    // 2 = r02 - r01\n    // 3 = r03 - r01\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        float tmp[4][4][4];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* r0 = img0.row(i * 2) + (j * 2) * 4;\n\n                for (int m = 0; m < 4; m++)\n                {\n                    __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                    __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                    __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                    __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n\n                    __m128 _tmp0m = __lsx_vfsub_s(_r00, _r02);\n                    __m128 _tmp1m = __lsx_vfadd_s(_r01, _r02);\n                    __m128 _tmp2m = __lsx_vfsub_s(_r02, _r01);\n                    __m128 _tmp3m = __lsx_vfsub_s(_r03, _r01);\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n\n                    r0 += w * 4;\n                }\n\n                float* r0_tm_0 = (float*)img0_tm + (i * w_tiles + j) * 4;\n                float* r0_tm_1 = r0_tm_0 + tiles * 4;\n                float* r0_tm_2 = r0_tm_0 + tiles * 4 * 2;\n                float* r0_tm_3 = r0_tm_0 + tiles * 4 * 3;\n\n                for (int m = 0; m < 4; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n\n                    __m128 _r0tm0 = __lsx_vfsub_s(_tmp00, _tmp02);\n                    __m128 _r0tm1 = __lsx_vfadd_s(_tmp01, _tmp02);\n                    __m128 _r0tm2 = __lsx_vfsub_s(_tmp02, _tmp01);\n                    __m128 _r0tm3 = __lsx_vfsub_s(_tmp03, _tmp01);\n\n                    __lsx_vst(_r0tm0, r0_tm_0, 0);\n                    __lsx_vst(_r0tm1, r0_tm_1, 0);\n                    __lsx_vst(_r0tm2, r0_tm_2, 0);\n                    __lsx_vst(_r0tm3, r0_tm_3, 0);\n\n                    r0_tm_0 += tiles * 4 * 4;\n                    r0_tm_1 += tiles * 4 * 4;\n                    r0_tm_2 += tiles * 4 * 4;\n                    r0_tm_3 += tiles * 4 * 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_transform_output_pack4_lsx(const Mat& top_blob_tm, Mat& top_blob, const Mat& bias, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 2;\n    const int h_tiles = outh / 2;\n    const int tiles = w_tiles * h_tiles;\n\n    const float* biasptr = bias;\n\n    // const float otm[2][4] = {\n    //     {1.0f,  1.0f,  1.0f,  0.0f},\n    //     {0.0f,  1.0f, -1.0f,  1.0f}\n    // };\n\n    // 0 = r00 + r01 + r02\n    // 1 = r01 - r02 + r03\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        __m128 _bias0 = biasptr ? (__m128)__lsx_vld(biasptr + p * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        float tmp[2][4][4];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const float* output0_tm_0 = (const float*)out0_tm + (i * w_tiles + j) * 4;\n                const float* output0_tm_1 = output0_tm_0 + tiles * 4;\n                const float* output0_tm_2 = output0_tm_0 + tiles * 4 * 2;\n                const float* output0_tm_3 = output0_tm_0 + tiles * 4 * 3;\n\n                float* output0 = out0.row<float>(i * 2) + (j * 2) * 4;\n\n                for (int m = 0; m < 4; m++)\n                {\n                    __m128 _out0tm0 = (__m128)__lsx_vld(output0_tm_0, 0);\n                    __m128 _out0tm1 = (__m128)__lsx_vld(output0_tm_1, 0);\n                    __m128 _out0tm2 = (__m128)__lsx_vld(output0_tm_2, 0);\n                    __m128 _out0tm3 = (__m128)__lsx_vld(output0_tm_3, 0);\n\n                    __m128 _tmp0m = __lsx_vfadd_s(__lsx_vfadd_s(_out0tm0, _out0tm1), _out0tm2);\n                    __m128 _tmp1m = __lsx_vfadd_s(__lsx_vfsub_s(_out0tm1, _out0tm2), _out0tm3);\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n\n                    output0_tm_0 += tiles * 4 * 4;\n                    output0_tm_1 += tiles * 4 * 4;\n                    output0_tm_2 += tiles * 4 * 4;\n                    output0_tm_3 += tiles * 4 * 4;\n                }\n\n                for (int m = 0; m < 2; m++)\n                {\n                    __m128 _tmp00 = (__m128)__lsx_vld(tmp[m][0], 0);\n                    __m128 _tmp01 = (__m128)__lsx_vld(tmp[m][1], 0);\n                    __m128 _tmp02 = (__m128)__lsx_vld(tmp[m][2], 0);\n                    __m128 _tmp03 = (__m128)__lsx_vld(tmp[m][3], 0);\n\n                    __m128 _out00 = __lsx_vfadd_s(_bias0, __lsx_vfadd_s(__lsx_vfadd_s(_tmp00, _tmp01), _tmp02));\n                    __m128 _out01 = __lsx_vfadd_s(_bias0, __lsx_vfadd_s(__lsx_vfsub_s(_tmp01, _tmp02), _tmp03));\n\n                    __lsx_vst(_out00, output0, 0);\n                    __lsx_vst(_out01, output0 + 4, 0);\n\n                    output0 += outw * 4;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_transform_pack4_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_output_pack4_int8_lsx(const Mat& top_blob_tm, Mat& top_blob, const Option& opt)\n{\n    const int outw = top_blob.w;\n    const int outh = top_blob.h;\n    const int outch = top_blob.c;\n\n    const int w_tiles = outw / 4;\n    const int h_tiles = outh / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float otm[4][6] = {\n    //     {1.0f, 1.0f,  1.0f, 1.0f,  1.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 2.0f, -2.0f, 0.0f},\n    //     {0.0f, 1.0f,  1.0f, 4.0f,  4.0f, 0.0f},\n    //     {0.0f, 1.0f, -1.0f, 8.0f, -8.0f, 1.0f}\n    // };\n\n    // 0 = r00 + (r01 + r02) + (r03 + r04)\n    // 1 =       (r01 - r02) + (r03 - r04) * 2\n    // 2 =       (r01 + r02) + (r03 + r04) * 4\n    // 3 = r05 + (r01 - r02) + (r03 - r04) * 8\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        const Mat out0_tm = top_blob_tm.channel(p);\n        Mat out0 = top_blob.channel(p);\n\n        int tmp[4][6][4];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const int* output0_tm_0 = (const int*)out0_tm + (i * w_tiles + j) * 4;\n                const int* output0_tm_1 = output0_tm_0 + tiles * 4;\n                const int* output0_tm_2 = output0_tm_0 + tiles * 8;\n                const int* output0_tm_3 = output0_tm_0 + tiles * 12;\n                const int* output0_tm_4 = output0_tm_0 + tiles * 16;\n                const int* output0_tm_5 = output0_tm_0 + tiles * 20;\n\n                int* output0 = out0.row<int>(i * 4) + (j * 4) * 4;\n\n                for (int m = 0; m < 5; m++)\n                {\n                    __m128i _out0tm0 = __lsx_vld(output0_tm_0, 0);\n                    __m128i _out0tm1 = __lsx_vld(output0_tm_1, 0);\n                    __m128i _out0tm2 = __lsx_vld(output0_tm_2, 0);\n                    __m128i _out0tm3 = __lsx_vld(output0_tm_3, 0);\n                    __m128i _out0tm4 = __lsx_vld(output0_tm_4, 0);\n                    __m128i _out0tm5 = __lsx_vld(output0_tm_5, 0);\n\n                    __m128i _tmp02a = __lsx_vadd_w(_out0tm1, _out0tm2);\n                    __m128i _tmp13a = __lsx_vsub_w(_out0tm1, _out0tm2);\n\n                    __m128i _tmp02b = __lsx_vadd_w(_out0tm3, _out0tm4);\n                    __m128i _tmp13b = __lsx_vsub_w(_out0tm3, _out0tm4);\n\n                    __m128i _tmp0m = __lsx_vadd_w(__lsx_vadd_w(_out0tm0, _tmp02a), _tmp02b);\n                    __m128i _tmp1m = __lsx_vadd_w(_tmp13a, __lsx_vslli_w(_tmp13b, 1));\n                    __m128i _tmp2m = __lsx_vadd_w(_tmp02a, __lsx_vslli_w(_tmp02b, 2));\n                    __m128i _tmp3m = __lsx_vadd_w(__lsx_vadd_w(_tmp13a, __lsx_vslli_w(_out0tm5, 2)), __lsx_vslli_w(_tmp13b, 3));\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n\n                    output0_tm_0 += tiles * 24;\n                    output0_tm_1 += tiles * 24;\n                    output0_tm_2 += tiles * 24;\n                    output0_tm_3 += tiles * 24;\n                    output0_tm_4 += tiles * 24;\n                    output0_tm_5 += tiles * 24;\n                }\n                for (int m = 5; m < 6; m++)\n                {\n                    __m128i _out0tm0 = __lsx_vld(output0_tm_0, 0);\n                    __m128i _out0tm1 = __lsx_vld(output0_tm_1, 0);\n                    __m128i _out0tm2 = __lsx_vld(output0_tm_2, 0);\n                    __m128i _out0tm3 = __lsx_vld(output0_tm_3, 0);\n                    __m128i _out0tm4 = __lsx_vld(output0_tm_4, 0);\n                    __m128i _out0tm5 = __lsx_vld(output0_tm_5, 0);\n\n                    __m128i _tmp02a = __lsx_vadd_w(_out0tm1, _out0tm2);\n                    __m128i _tmp13a = __lsx_vsub_w(_out0tm1, _out0tm2);\n\n                    __m128i _tmp02b = __lsx_vadd_w(_out0tm3, _out0tm4);\n                    __m128i _tmp13b = __lsx_vsub_w(_out0tm3, _out0tm4);\n\n                    __m128i _tmp0m = __lsx_vadd_w(__lsx_vadd_w(_out0tm0, _tmp02a), _tmp02b);\n                    __m128i _tmp1m = __lsx_vadd_w(_tmp13a, __lsx_vslli_w(_tmp13b, 1));\n                    __m128i _tmp2m = __lsx_vadd_w(_tmp02a, __lsx_vslli_w(_tmp02b, 2));\n                    __m128i _tmp3m = __lsx_vadd_w(__lsx_vadd_w(_tmp13a, __lsx_vslli_w(_out0tm5, 2)), __lsx_vslli_w(_tmp13b, 3));\n\n                    _tmp0m = __lsx_vslli_w(_tmp0m, 2);\n                    _tmp1m = __lsx_vslli_w(_tmp1m, 2);\n                    _tmp2m = __lsx_vslli_w(_tmp2m, 2);\n                    _tmp3m = __lsx_vslli_w(_tmp3m, 2);\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n\n                    output0_tm_0 += tiles * 24;\n                    output0_tm_1 += tiles * 24;\n                    output0_tm_2 += tiles * 24;\n                    output0_tm_3 += tiles * 24;\n                    output0_tm_4 += tiles * 24;\n                    output0_tm_5 += tiles * 24;\n                }\n\n                for (int m = 0; m < 4; m++)\n                {\n                    __m128i _tmp00 = __lsx_vld(tmp[m][0], 0);\n                    __m128i _tmp01 = __lsx_vld(tmp[m][1], 0);\n                    __m128i _tmp02 = __lsx_vld(tmp[m][2], 0);\n                    __m128i _tmp03 = __lsx_vld(tmp[m][3], 0);\n                    __m128i _tmp04 = __lsx_vld(tmp[m][4], 0);\n                    __m128i _tmp05 = __lsx_vld(tmp[m][5], 0);\n\n                    __m128i _tmp02a = __lsx_vadd_w(_tmp01, _tmp02);\n                    __m128i _tmp13a = __lsx_vsub_w(_tmp01, _tmp02);\n\n                    __m128i _tmp02b = __lsx_vadd_w(_tmp03, _tmp04);\n                    __m128i _tmp13b = __lsx_vsub_w(_tmp03, _tmp04);\n\n                    __m128i _out00 = __lsx_vadd_w(__lsx_vadd_w(_tmp00, _tmp02a), _tmp02b);\n                    __m128i _out01 = __lsx_vadd_w(_tmp13a, __lsx_vslli_w(_tmp13b, 1));\n                    __m128i _out02 = __lsx_vadd_w(_tmp02a, __lsx_vslli_w(_tmp02b, 2));\n                    __m128i _out03 = __lsx_vadd_w(__lsx_vadd_w(_tmp05, _tmp13a), __lsx_vslli_w(_tmp13b, 3));\n\n                    // TODO use integer trick for division by 576\n                    __m128 _v576 = __lsx_vreplfr2vr_s(1.0 / 576);\n                    _out00 = __lsx_vftint_w_s(__lsx_vfmul_s(__lsx_vffint_s_w(_out00), _v576));\n                    _out01 = __lsx_vftint_w_s(__lsx_vfmul_s(__lsx_vffint_s_w(_out01), _v576));\n                    _out02 = __lsx_vftint_w_s(__lsx_vfmul_s(__lsx_vffint_s_w(_out02), _v576));\n                    _out03 = __lsx_vftint_w_s(__lsx_vfmul_s(__lsx_vffint_s_w(_out03), _v576));\n\n                    __lsx_vst(_out00, output0, 0);\n                    __lsx_vst(_out01, output0 + 4, 0);\n                    __lsx_vst(_out02, output0 + 8, 0);\n                    __lsx_vst(_out03, output0 + 12, 0);\n\n                    output0 += outw * 4;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolution_winograd_transform_pack8_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_input_pack8_int8_lsx(const Mat& bottom_blob, Mat& bottom_blob_tm, const Option& opt)\n{\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int inch = bottom_blob.c;\n\n    const int w_tiles = (w - 2) / 4;\n    const int h_tiles = (h - 2) / 4;\n    const int tiles = w_tiles * h_tiles;\n\n    // const float itm[6][6] = {\n    //     {4.0f, 0.0f, -5.0f, 0.0f, 1.0f, 0.0f},\n    //     {0.0f,-4.0f, -4.0f, 1.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f, -4.0f,-1.0f, 1.0f, 0.0f},\n    //     {0.0f,-2.0f, -1.0f, 2.0f, 1.0f, 0.0f},\n    //     {0.0f, 2.0f, -1.0f,-2.0f, 1.0f, 0.0f},\n    //     {0.0f, 4.0f,  0.0f,-5.0f, 0.0f, 1.0f}\n    // };\n\n    // 0 =  4 * r00 - 5 * r02 + r04\n    // 1 = -4 * (r01 + r02) + r04 + r03\n    // 2 =  4 * (r01 - r02) + r04 - r03\n    // 3 = -2 * (r01 - r03) + r04 - r02\n    // 4 =  2 * (r01 - r03) + r04 - r02\n    // 5 =  4 * r01 - 5 * r03 + r05\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < inch; q++)\n    {\n        const Mat img0 = bottom_blob.channel(q);\n        Mat img0_tm = bottom_blob_tm.channel(q);\n\n        short tmp[6][6][8];\n\n        // tile\n        for (int i = 0; i < h_tiles; i++)\n        {\n            for (int j = 0; j < w_tiles; j++)\n            {\n                const signed char* r0 = img0.row<const signed char>(i * 4) + (j * 4) * 8;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128i _r00_01 = __lsx_vld(r0, 0);\n                    __m128i _r02_03 = __lsx_vld(r0 + 16, 0);\n                    __m128i _r04_05 = __lsx_vld(r0 + 32, 0);\n                    __m128i _extr0001 = __lsx_vslti_b(_r00_01, 0);\n                    __m128i _extr0203 = __lsx_vslti_b(_r02_03, 0);\n                    __m128i _extr0405 = __lsx_vslti_b(_r04_05, 0);\n                    __m128i _r00 = __lsx_vilvl_b(_extr0001, _r00_01);\n                    __m128i _r01 = __lsx_vilvh_b(_extr0001, _r00_01);\n                    __m128i _r02 = __lsx_vilvl_b(_extr0203, _r02_03);\n                    __m128i _r03 = __lsx_vilvh_b(_extr0203, _r02_03);\n                    __m128i _r04 = __lsx_vilvl_b(_extr0405, _r04_05);\n                    __m128i _r05 = __lsx_vilvh_b(_extr0405, _r04_05);\n\n                    __m128i _v5 = __lsx_vreplgr2vr_h(5);\n\n                    __m128i _tmp0m = __lsx_vsub_h(__lsx_vadd_h(__lsx_vslli_h(_r00, 2), _r04), __lsx_vmul_h(_r02, _v5));\n                    __m128i _tmp1m = __lsx_vsub_h(__lsx_vadd_h(_r04, _r03), __lsx_vslli_h(__lsx_vadd_h(_r01, _r02), 2));\n                    __m128i _tmp2m = __lsx_vadd_h(__lsx_vsub_h(_r04, _r03), __lsx_vslli_h(__lsx_vsub_h(_r01, _r02), 2));\n                    __m128i _tmp3m = __lsx_vsub_h(__lsx_vsub_h(_r04, _r02), __lsx_vslli_h(__lsx_vsub_h(_r01, _r03), 1));\n                    __m128i _tmp4m = __lsx_vadd_h(__lsx_vsub_h(_r04, _r02), __lsx_vslli_h(__lsx_vsub_h(_r01, _r03), 1));\n                    __m128i _tmp5m = __lsx_vsub_h(__lsx_vadd_h(__lsx_vslli_h(_r01, 2), _r05), __lsx_vmul_h(_r03, _v5));\n\n                    __lsx_vst(_tmp0m, tmp[0][m], 0);\n                    __lsx_vst(_tmp1m, tmp[1][m], 0);\n                    __lsx_vst(_tmp2m, tmp[2][m], 0);\n                    __lsx_vst(_tmp3m, tmp[3][m], 0);\n                    __lsx_vst(_tmp4m, tmp[4][m], 0);\n                    __lsx_vst(_tmp5m, tmp[5][m], 0);\n\n                    r0 += w * 8;\n                }\n\n                short* r0_tm_0 = (short*)img0_tm + (i * w_tiles + j) * 8;\n                short* r0_tm_1 = r0_tm_0 + tiles * 8;\n                short* r0_tm_2 = r0_tm_0 + tiles * 16;\n                short* r0_tm_3 = r0_tm_0 + tiles * 24;\n                short* r0_tm_4 = r0_tm_0 + tiles * 32;\n                short* r0_tm_5 = r0_tm_0 + tiles * 40;\n\n                for (int m = 0; m < 6; m++)\n                {\n                    __m128i _tmp00 = __lsx_vld(tmp[m][0], 0);\n                    __m128i _tmp01 = __lsx_vld(tmp[m][1], 0);\n                    __m128i _tmp02 = __lsx_vld(tmp[m][2], 0);\n                    __m128i _tmp03 = __lsx_vld(tmp[m][3], 0);\n                    __m128i _tmp04 = __lsx_vld(tmp[m][4], 0);\n                    __m128i _tmp05 = __lsx_vld(tmp[m][5], 0);\n\n                    __m128i _v5 = __lsx_vreplgr2vr_h(5);\n\n                    __m128i _r0tm0 = __lsx_vsub_h(__lsx_vadd_h(__lsx_vslli_h(_tmp00, 2), _tmp04), __lsx_vmul_h(_tmp02, _v5));\n                    __m128i _r0tm1 = __lsx_vsub_h(__lsx_vadd_h(_tmp04, _tmp03), __lsx_vslli_h(__lsx_vadd_h(_tmp01, _tmp02), 2));\n                    __m128i _r0tm2 = __lsx_vadd_h(__lsx_vsub_h(_tmp04, _tmp03), __lsx_vslli_h(__lsx_vsub_h(_tmp01, _tmp02), 2));\n                    __m128i _r0tm3 = __lsx_vsub_h(__lsx_vsub_h(_tmp04, _tmp02), __lsx_vslli_h(__lsx_vsub_h(_tmp01, _tmp03), 1));\n                    __m128i _r0tm4 = __lsx_vadd_h(__lsx_vsub_h(_tmp04, _tmp02), __lsx_vslli_h(__lsx_vsub_h(_tmp01, _tmp03), 1));\n                    __m128i _r0tm5 = __lsx_vsub_h(__lsx_vadd_h(__lsx_vslli_h(_tmp01, 2), _tmp05), __lsx_vmul_h(_tmp03, _v5));\n\n                    __lsx_vst(_r0tm0, r0_tm_0, 0);\n                    __lsx_vst(_r0tm1, r0_tm_1, 0);\n                    __lsx_vst(_r0tm2, r0_tm_2, 0);\n                    __lsx_vst(_r0tm3, r0_tm_3, 0);\n                    __lsx_vst(_r0tm4, r0_tm_4, 0);\n                    __lsx_vst(_r0tm5, r0_tm_5, 0);\n\n                    r0_tm_0 += tiles * 48;\n                    r0_tm_1 += tiles * 48;\n                    r0_tm_2 += tiles * 48;\n                    r0_tm_3 += tiles * 48;\n                    r0_tm_4 += tiles * 48;\n                    r0_tm_5 += tiles * 48;\n                }\n            }\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolutiondepthwise_3x3.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 9;\n\n        float* outptr0 = out;\n        float* outptr1 = outptr0 + outw;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n        const float* r3 = img0 + w * 3;\n\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 3;\n        const float* k2 = kernel0 + 6;\n\n        int i = 0;\n\n        for (; i + 1 < outh; i += 2)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = bias0;\n                float sum2 = bias0;\n\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum2 += r1[0] * k0[0];\n                sum2 += r1[1] * k0[1];\n                sum2 += r1[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum2 += r2[0] * k1[0];\n                sum2 += r2[1] * k1[1];\n                sum2 += r2[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n                sum2 += r3[0] * k2[0];\n                sum2 += r3[1] * k2[1];\n                sum2 += r3[2] * k2[2];\n\n                *outptr0 = sum;\n                *outptr1 = sum2;\n\n                r0++;\n                r1++;\n                r2++;\n                r3++;\n                outptr0++;\n                outptr1++;\n            }\n\n            r0 += 2 + w;\n            r1 += 2 + w;\n            r2 += 2 + w;\n            r3 += 2 + w;\n\n            outptr0 += outw;\n            outptr1 += outw;\n        }\n\n        for (; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = bias0;\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n\n                *outptr0 = sum;\n\n                r0++;\n                r1++;\n                r2++;\n                outptr0++;\n            }\n\n            r0 += 2;\n            r1 += 2;\n            r2 += 2;\n        }\n    }\n}\n\nstatic void convdw3x3s2_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* kernel = _kernel;\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        const float bias0 = bias ? bias[g] : 0.f;\n\n        const float* kernel0 = kernel + g * 9;\n\n        float* outptr = out;\n\n        const float* img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0;\n        const float* r1 = img0 + w;\n        const float* r2 = img0 + w * 2;\n\n        const float* k0 = kernel0;\n        const float* k1 = kernel0 + 3;\n        const float* k2 = kernel0 + 6;\n\n        int i = 0;\n\n        for (; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = bias0;\n                sum += r0[0] * k0[0];\n                sum += r0[1] * k0[1];\n                sum += r0[2] * k0[2];\n                sum += r1[0] * k1[0];\n                sum += r1[1] * k1[1];\n                sum += r1[2] * k1[2];\n                sum += r2[0] * k2[0];\n                sum += r2[1] * k2[1];\n                sum += r2[2] * k2[2];\n\n                *outptr = sum;\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n                outptr++;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolutiondepthwise_3x3_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw3x3s1_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + g * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out.row(0);\n        float* outptr1 = out.row(1);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n        const float* r3 = img0.row(3);\n\n        __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n        __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n        __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n        __m128 _k10 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n        __m128 _k11 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n        __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 5, 0);\n        __m128 _k20 = (__m128)__lsx_vld(k0 + 4 * 6, 0);\n        __m128 _k21 = (__m128)__lsx_vld(k0 + 4 * 7, 0);\n        __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 8, 0);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                __builtin_prefetch(r0 + 32);\n                __builtin_prefetch(r1 + 32);\n                __builtin_prefetch(r2 + 32);\n                __builtin_prefetch(r3 + 32);\n\n                __m128 _sum00 = _bias0;\n                __m128 _sum01 = _bias0;\n                __m128 _sum10 = _bias0;\n                __m128 _sum11 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r00, _k00, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r01, _k01, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r02, _k02, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r01, _k00, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r02, _k01, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r03, _k02, _sum01);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r10, _k10, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r11, _k11, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r12, _k12, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r11, _k10, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r12, _k11, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r13, _k12, _sum01);\n                _sum10 = __lsx_vfmadd_s(_r10, _k00, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r11, _k01, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r12, _k02, _sum10);\n                _sum11 = __lsx_vfmadd_s(_r11, _k00, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r12, _k01, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r13, _k02, _sum11);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r20, _k20, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r21, _k21, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r22, _k22, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r21, _k20, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r22, _k21, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r23, _k22, _sum01);\n                _sum10 = __lsx_vfmadd_s(_r20, _k10, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r21, _k11, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r22, _k12, _sum10);\n                _sum11 = __lsx_vfmadd_s(_r21, _k10, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r22, _k11, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r23, _k12, _sum11);\n\n                __m128 _r30 = (__m128)__lsx_vld(r3, 0);\n                __m128 _r31 = (__m128)__lsx_vld(r3 + 4, 0);\n                __m128 _r32 = (__m128)__lsx_vld(r3 + 4 * 2, 0);\n                __m128 _r33 = (__m128)__lsx_vld(r3 + 4 * 3, 0);\n\n                _sum10 = __lsx_vfmadd_s(_r30, _k20, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r31, _k21, _sum10);\n                _sum10 = __lsx_vfmadd_s(_r32, _k22, _sum10);\n                _sum11 = __lsx_vfmadd_s(_r31, _k20, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r32, _k21, _sum11);\n                _sum11 = __lsx_vfmadd_s(_r33, _k22, _sum11);\n\n                __lsx_vst(_sum00, outptr0, 0);\n                __lsx_vst(_sum01, outptr0 + 4, 0);\n                __lsx_vst(_sum10, outptr1, 0);\n                __lsx_vst(_sum11, outptr1 + 4, 0);\n\n                outptr0 += 4 * 2;\n                outptr1 += 4 * 2;\n\n                r0 += 4 * 2;\n                r1 += 4 * 2;\n                r2 += 4 * 2;\n                r3 += 4 * 2;\n            }\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 16);\n                __builtin_prefetch(r1 + 16);\n                __builtin_prefetch(r2 + 16);\n                __builtin_prefetch(r3 + 16);\n\n                __m128 _sum0 = _bias0;\n                __m128 _sum1 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n                _sum1 = __lsx_vfmadd_s(_r10, _k00, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r11, _k01, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r12, _k02, _sum1);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n                _sum1 = __lsx_vfmadd_s(_r20, _k10, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r21, _k11, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r22, _k12, _sum1);\n\n                __m128 _r30 = (__m128)__lsx_vld(r3, 0);\n                __m128 _r31 = (__m128)__lsx_vld(r3 + 4, 0);\n                __m128 _r32 = (__m128)__lsx_vld(r3 + 4 * 2, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r30, _k20, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r31, _k21, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r32, _k22, _sum1);\n\n                __lsx_vst(_sum0, outptr0, 0);\n                __lsx_vst(_sum1, outptr1, 0);\n\n                outptr0 += 4;\n                outptr1 += 4;\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n            }\n\n            r0 += 2 * 4 + w * 4;\n            r1 += 2 * 4 + w * 4;\n            r2 += 2 * 4 + w * 4;\n            r3 += 2 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                __builtin_prefetch(r0 + 32);\n                __builtin_prefetch(r1 + 32);\n                __builtin_prefetch(r2 + 32);\n\n                __m128 _sum00 = _bias0;\n                __m128 _sum01 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r00, _k00, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r01, _k01, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r02, _k02, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r01, _k00, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r02, _k01, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r03, _k02, _sum01);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r10, _k10, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r11, _k11, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r12, _k12, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r11, _k10, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r12, _k11, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r13, _k12, _sum01);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r20, _k20, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r21, _k21, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r22, _k22, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r21, _k20, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r22, _k21, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r23, _k22, _sum01);\n\n                __lsx_vst(_sum00, outptr0, 0);\n                __lsx_vst(_sum01, outptr0 + 4, 0);\n\n                outptr0 += 4 * 2;\n\n                r0 += 4 * 2;\n                r1 += 4 * 2;\n                r2 += 4 * 2;\n            }\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 16);\n                __builtin_prefetch(r1 + 16);\n                __builtin_prefetch(r2 + 16);\n\n                __m128 _sum0 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n\n                __lsx_vst(_sum0, outptr0, 0);\n\n                outptr0 += 4;\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n            }\n\n            r0 += 2 * 4;\n            r1 += 2 * 4;\n            r2 += 2 * 4;\n        }\n    }\n}\n\nstatic void convdw3x3s2_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + g * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n\n        __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n        __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n        __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n        __m128 _k10 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n        __m128 _k11 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n        __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 5, 0);\n        __m128 _k20 = (__m128)__lsx_vld(k0 + 4 * 6, 0);\n        __m128 _k21 = (__m128)__lsx_vld(k0 + 4 * 7, 0);\n        __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 8, 0);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 1 < outw; j += 2)\n            {\n                __builtin_prefetch(r0 + 64);\n                __builtin_prefetch(r1 + 64);\n                __builtin_prefetch(r2 + 64);\n\n                __m128 _sum00 = _bias0;\n                __m128 _sum01 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r00, _k00, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r01, _k01, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r02, _k02, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r02, _k00, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r03, _k01, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r04, _k02, _sum01);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n                __m128 _r14 = (__m128)__lsx_vld(r1 + 4 * 4, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r10, _k10, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r11, _k11, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r12, _k12, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r12, _k10, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r13, _k11, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r14, _k12, _sum01);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n                __m128 _r24 = (__m128)__lsx_vld(r2 + 4 * 4, 0);\n\n                _sum00 = __lsx_vfmadd_s(_r20, _k20, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r21, _k21, _sum00);\n                _sum00 = __lsx_vfmadd_s(_r22, _k22, _sum00);\n                _sum01 = __lsx_vfmadd_s(_r22, _k20, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r23, _k21, _sum01);\n                _sum01 = __lsx_vfmadd_s(_r24, _k22, _sum01);\n\n                __lsx_vst(_sum00, outptr0, 0);\n                __lsx_vst(_sum01, outptr0 + 4, 0);\n\n                outptr0 += 4 * 2;\n\n                r0 += 4 * 4;\n                r1 += 4 * 4;\n                r2 += 4 * 4;\n            }\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 32);\n                __builtin_prefetch(r1 + 32);\n                __builtin_prefetch(r2 + 32);\n\n                __m128 _sum0 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n\n                __lsx_vst(_sum0, outptr0, 0);\n\n                outptr0 += 4;\n\n                r0 += 4 * 2;\n                r1 += 4 * 2;\n                r2 += 4 * 2;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolutiondepthwise_5x5_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convdw5x5s1_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + g * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out.row(0);\n        float* outptr1 = out.row(1);\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n        const float* r3 = img0.row(3);\n        const float* r4 = img0.row(4);\n        const float* r5 = img0.row(5);\n\n        int i = 0;\n        for (; i + 1 < outh; i += 2)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 16);\n                __builtin_prefetch(r1 + 16);\n                __builtin_prefetch(r2 + 16);\n                __builtin_prefetch(r3 + 16);\n                __builtin_prefetch(r4 + 16);\n                __builtin_prefetch(r5 + 16);\n\n                __builtin_prefetch(k0 + 400);\n\n                __m128 _sum0 = _bias0;\n                __m128 _sum1 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n\n                __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k03 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k04 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r03, _k03, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r04, _k04, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n                __m128 _r14 = (__m128)__lsx_vld(r1 + 4 * 4, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r10, _k00, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r11, _k01, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r12, _k02, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r13, _k03, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r14, _k04, _sum1);\n\n                __m128 _k10 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k11 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k13 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k14 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r13, _k13, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r14, _k14, _sum0);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n                __m128 _r24 = (__m128)__lsx_vld(r2 + 4 * 4, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r20, _k10, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r21, _k11, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r22, _k12, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r23, _k13, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r24, _k14, _sum1);\n\n                __m128 _k20 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k21 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k23 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k24 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r23, _k23, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r24, _k24, _sum0);\n\n                __m128 _r30 = (__m128)__lsx_vld(r3, 0);\n                __m128 _r31 = (__m128)__lsx_vld(r3 + 4, 0);\n                __m128 _r32 = (__m128)__lsx_vld(r3 + 4 * 2, 0);\n                __m128 _r33 = (__m128)__lsx_vld(r3 + 4 * 3, 0);\n                __m128 _r34 = (__m128)__lsx_vld(r3 + 4 * 4, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r30, _k20, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r31, _k21, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r32, _k22, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r33, _k23, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r34, _k24, _sum1);\n\n                __m128 _k30 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k31 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k32 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k33 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k34 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r30, _k30, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r31, _k31, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r32, _k32, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r33, _k33, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r34, _k34, _sum0);\n\n                __m128 _r40 = (__m128)__lsx_vld(r4, 0);\n                __m128 _r41 = (__m128)__lsx_vld(r4 + 4, 0);\n                __m128 _r42 = (__m128)__lsx_vld(r4 + 4 * 2, 0);\n                __m128 _r43 = (__m128)__lsx_vld(r4 + 4 * 3, 0);\n                __m128 _r44 = (__m128)__lsx_vld(r4 + 4 * 4, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r40, _k30, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r41, _k31, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r42, _k32, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r43, _k33, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r44, _k34, _sum1);\n\n                __m128 _k40 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k41 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k42 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k43 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k44 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 -= 4 * 20;\n\n                _sum0 = __lsx_vfmadd_s(_r40, _k40, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r41, _k41, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r42, _k42, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r43, _k43, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r44, _k44, _sum0);\n\n                __m128 _r50 = (__m128)__lsx_vld(r5, 0);\n                __m128 _r51 = (__m128)__lsx_vld(r5 + 4, 0);\n                __m128 _r52 = (__m128)__lsx_vld(r5 + 4 * 2, 0);\n                __m128 _r53 = (__m128)__lsx_vld(r5 + 4 * 3, 0);\n                __m128 _r54 = (__m128)__lsx_vld(r5 + 4 * 4, 0);\n\n                _sum1 = __lsx_vfmadd_s(_r50, _k40, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r51, _k41, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r52, _k42, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r53, _k43, _sum1);\n                _sum1 = __lsx_vfmadd_s(_r54, _k44, _sum1);\n\n                __lsx_vst(_sum0, outptr0, 0);\n                __lsx_vst(_sum1, outptr1, 0);\n\n                outptr0 += 4;\n                outptr1 += 4;\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n                r5 += 4;\n            }\n\n            r0 += 4 * 4 + w * 4;\n            r1 += 4 * 4 + w * 4;\n            r2 += 4 * 4 + w * 4;\n            r3 += 4 * 4 + w * 4;\n            r4 += 4 * 4 + w * 4;\n            r5 += 4 * 4 + w * 4;\n\n            outptr0 += outw * 4;\n            outptr1 += outw * 4;\n        }\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 16);\n                __builtin_prefetch(r1 + 16);\n                __builtin_prefetch(r2 + 16);\n                __builtin_prefetch(r3 + 16);\n                __builtin_prefetch(r4 + 16);\n\n                __builtin_prefetch(k0 + 400);\n\n                __m128 _sum0 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n\n                __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k03 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k04 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r03, _k03, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r04, _k04, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n                __m128 _r14 = (__m128)__lsx_vld(r1 + 4 * 4, 0);\n\n                __m128 _k10 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k11 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k13 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k14 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r13, _k13, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r14, _k14, _sum0);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n                __m128 _r24 = (__m128)__lsx_vld(r2 + 4 * 4, 0);\n\n                __m128 _k20 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k21 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k23 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k24 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r23, _k23, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r24, _k24, _sum0);\n\n                __m128 _r30 = (__m128)__lsx_vld(r3, 0);\n                __m128 _r31 = (__m128)__lsx_vld(r3 + 4, 0);\n                __m128 _r32 = (__m128)__lsx_vld(r3 + 4 * 2, 0);\n                __m128 _r33 = (__m128)__lsx_vld(r3 + 4 * 3, 0);\n                __m128 _r34 = (__m128)__lsx_vld(r3 + 4 * 4, 0);\n\n                __m128 _k30 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k31 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k32 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k33 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k34 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r30, _k30, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r31, _k31, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r32, _k32, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r33, _k33, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r34, _k34, _sum0);\n\n                __m128 _r40 = (__m128)__lsx_vld(r4, 0);\n                __m128 _r41 = (__m128)__lsx_vld(r4 + 4, 0);\n                __m128 _r42 = (__m128)__lsx_vld(r4 + 4 * 2, 0);\n                __m128 _r43 = (__m128)__lsx_vld(r4 + 4 * 3, 0);\n                __m128 _r44 = (__m128)__lsx_vld(r4 + 4 * 4, 0);\n\n                __m128 _k40 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k41 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k42 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k43 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k44 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 -= 4 * 20;\n\n                _sum0 = __lsx_vfmadd_s(_r40, _k40, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r41, _k41, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r42, _k42, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r43, _k43, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r44, _k44, _sum0);\n\n                __lsx_vst(_sum0, outptr0, 0);\n\n                outptr0 += 4;\n\n                r0 += 4;\n                r1 += 4;\n                r2 += 4;\n                r3 += 4;\n                r4 += 4;\n            }\n\n            r0 += 4 * 4;\n            r1 += 4 * 4;\n            r2 += 4 * 4;\n            r3 += 4 * 4;\n            r4 += 4 * 4;\n        }\n    }\n}\n\nstatic void convdw5x5s2_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int group = bottom_blob.c;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        Mat out = top_blob.channel(g);\n\n        __m128 _bias0 = bias ? (__m128)__lsx_vld(bias + g * 4, 0) : (__m128)__lsx_vreplgr2vr_w(0);\n\n        const float* k0 = kernel.row(g);\n\n        float* outptr0 = out;\n\n        const Mat img0 = bottom_blob.channel(g);\n\n        const float* r0 = img0.row(0);\n        const float* r1 = img0.row(1);\n        const float* r2 = img0.row(2);\n        const float* r3 = img0.row(3);\n        const float* r4 = img0.row(4);\n\n        int i = 0;\n        for (; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                __builtin_prefetch(r0 + 32);\n                __builtin_prefetch(r1 + 32);\n                __builtin_prefetch(r2 + 32);\n                __builtin_prefetch(r3 + 32);\n                __builtin_prefetch(r4 + 32);\n\n                __builtin_prefetch(k0 + 400);\n\n                __m128 _sum0 = _bias0;\n\n                __m128 _r00 = (__m128)__lsx_vld(r0, 0);\n                __m128 _r01 = (__m128)__lsx_vld(r0 + 4, 0);\n                __m128 _r02 = (__m128)__lsx_vld(r0 + 4 * 2, 0);\n                __m128 _r03 = (__m128)__lsx_vld(r0 + 4 * 3, 0);\n                __m128 _r04 = (__m128)__lsx_vld(r0 + 4 * 4, 0);\n\n                __m128 _k00 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k01 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k02 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k03 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k04 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r00, _k00, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r01, _k01, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r02, _k02, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r03, _k03, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r04, _k04, _sum0);\n\n                __m128 _r10 = (__m128)__lsx_vld(r1, 0);\n                __m128 _r11 = (__m128)__lsx_vld(r1 + 4, 0);\n                __m128 _r12 = (__m128)__lsx_vld(r1 + 4 * 2, 0);\n                __m128 _r13 = (__m128)__lsx_vld(r1 + 4 * 3, 0);\n                __m128 _r14 = (__m128)__lsx_vld(r1 + 4 * 4, 0);\n\n                __m128 _k10 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k11 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k12 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k13 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k14 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r10, _k10, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r11, _k11, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r12, _k12, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r13, _k13, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r14, _k14, _sum0);\n\n                __m128 _r20 = (__m128)__lsx_vld(r2, 0);\n                __m128 _r21 = (__m128)__lsx_vld(r2 + 4, 0);\n                __m128 _r22 = (__m128)__lsx_vld(r2 + 4 * 2, 0);\n                __m128 _r23 = (__m128)__lsx_vld(r2 + 4 * 3, 0);\n                __m128 _r24 = (__m128)__lsx_vld(r2 + 4 * 4, 0);\n\n                __m128 _k20 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k21 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k22 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k23 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k24 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r20, _k20, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r21, _k21, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r22, _k22, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r23, _k23, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r24, _k24, _sum0);\n\n                __m128 _r30 = (__m128)__lsx_vld(r3, 0);\n                __m128 _r31 = (__m128)__lsx_vld(r3 + 4, 0);\n                __m128 _r32 = (__m128)__lsx_vld(r3 + 4 * 2, 0);\n                __m128 _r33 = (__m128)__lsx_vld(r3 + 4 * 3, 0);\n                __m128 _r34 = (__m128)__lsx_vld(r3 + 4 * 4, 0);\n\n                __m128 _k30 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k31 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k32 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k33 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k34 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 += 4 * 5;\n\n                _sum0 = __lsx_vfmadd_s(_r30, _k30, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r31, _k31, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r32, _k32, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r33, _k33, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r34, _k34, _sum0);\n\n                __m128 _r40 = (__m128)__lsx_vld(r4, 0);\n                __m128 _r41 = (__m128)__lsx_vld(r4 + 4, 0);\n                __m128 _r42 = (__m128)__lsx_vld(r4 + 4 * 2, 0);\n                __m128 _r43 = (__m128)__lsx_vld(r4 + 4 * 3, 0);\n                __m128 _r44 = (__m128)__lsx_vld(r4 + 4 * 4, 0);\n\n                __m128 _k40 = (__m128)__lsx_vld(k0, 0);\n                __m128 _k41 = (__m128)__lsx_vld(k0 + 4, 0);\n                __m128 _k42 = (__m128)__lsx_vld(k0 + 4 * 2, 0);\n                __m128 _k43 = (__m128)__lsx_vld(k0 + 4 * 3, 0);\n                __m128 _k44 = (__m128)__lsx_vld(k0 + 4 * 4, 0);\n                k0 -= 4 * 20;\n\n                _sum0 = __lsx_vfmadd_s(_r40, _k40, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r41, _k41, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r42, _k42, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r43, _k43, _sum0);\n                _sum0 = __lsx_vfmadd_s(_r44, _k44, _sum0);\n\n                __lsx_vst(_sum0, outptr0, 0);\n\n                outptr0 += 4;\n\n                r0 += 4 * 2;\n                r1 += 4 * 2;\n                r2 += 4 * 2;\n                r3 += 4 * 2;\n                r4 += 4 * 2;\n            }\n\n            r0 += tailstep;\n            r1 += tailstep;\n            r2 += tailstep;\n            r3 += tailstep;\n            r4 += tailstep;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/convolutiondepthwise_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolutiondepthwise_loongarch.h\"\n\n#include \"layer_type.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\n#include \"convolutiondepthwise_3x3.h\"\n\n#if __loongarch_sx\n#include \"convolutiondepthwise_3x3_pack4.h\"\n#include \"convolutiondepthwise_5x5_pack4.h\"\n#endif // __loongarch_sx\n\nConvolutionDepthWise_loongarch::ConvolutionDepthWise_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n\n    activation = 0;\n}\n\nint ConvolutionDepthWise_loongarch::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_loongarch(opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n#if __loongarch_sx\n        // pack4\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 4, opt);\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1)\n        {\n            weight_data_tm = weight_data;\n        }\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_loongarch::create_group_ops(const Option& opt)\n{\n    // create Convolution op for each group\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    for (int i = 0; i < (int)group_ops.size(); i++)\n        delete group_ops[i];\n\n    group_ops.clear();\n\n    const int channels_g = channels / group;\n    const int num_output_g = num_output / group;\n\n    group_ops.resize(group);\n\n    for (int g = 0; g < group; g++)\n    {\n        Mat weight_data_g = weight_data.range(maxk * channels_g * num_output_g * g, maxk * channels_g * num_output_g).clone();\n        Mat bias_data_g;\n        if (bias_term)\n            bias_data_g = bias_data.range(num_output_g * g, num_output_g);\n\n        ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n        // set param\n        ncnn::ParamDict pd;\n        pd.set(0, num_output_g); // num_output\n        pd.set(1, kernel_w);\n        pd.set(11, kernel_h);\n        pd.set(2, dilation_w);\n        pd.set(12, dilation_h);\n        pd.set(3, stride_w);\n        pd.set(13, stride_h);\n        pd.set(4, 0);  // pad_w\n        pd.set(14, 0); // pad_h\n        pd.set(5, bias_term);\n        pd.set(6, maxk * channels_g * num_output_g); // weight_data_size\n        pd.set(8, int8_scale_term);\n        pd.set(9, activation_type);\n        pd.set(10, activation_params);\n\n        op->load_param(pd);\n\n        // set weights\n        if (bias_term)\n        {\n            ncnn::Mat weights[5];\n            weights[0] = weight_data_g;\n            weights[1] = bias_data_g;\n\n#if NCNN_INT8\n            if (int8_scale_term)\n            {\n                Mat weight_data_int8_scales_g(num_output_g);\n                weight_data_int8_scales_g.fill(weight_data_int8_scales[g]);\n                weights[2] = weight_data_int8_scales_g;\n                weights[3] = bottom_blob_int8_scales.range(g, 1);\n            }\n            if (int8_scale_term > 100)\n            {\n                weights[4] = top_blob_int8_scales.range(g, 1);\n            }\n#endif\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n        else\n        {\n            ncnn::Mat weights[4];\n            weights[0] = weight_data_g;\n\n#if NCNN_INT8\n            if (int8_scale_term)\n            {\n                Mat weight_data_int8_scales_g(num_output_g);\n                weight_data_int8_scales_g.fill(weight_data_int8_scales[g]);\n                weights[1] = weight_data_int8_scales_g;\n                weights[2] = bottom_blob_int8_scales.range(g, 1);\n            }\n            if (int8_scale_term > 100)\n            {\n                weights[3] = top_blob_int8_scales.range(g, 1);\n            }\n#endif\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n\n        op->create_pipeline(opt);\n\n        group_ops[g] = op;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise_loongarch::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    for (int i = 0; i < (int)group_ops.size(); i++)\n    {\n        group_ops[i]->destroy_pipeline(opt);\n        delete group_ops[i];\n    }\n    group_ops.clear();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_loongarch(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __loongarch_sx\n        if (elempack == 4)\n        {\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw5x5s1_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 5 && kernel_h == 5 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw5x5s2_pack4_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    float* outptr = top_blob.channel(g);\n                    const float* kptr = (const float*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                            if (bias_term)\n                            {\n                                _sum = (__m128)__lsx_vld((const float*)bias_data + g * 4, 0);\n                            }\n\n                            const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                __m128 _val = (__m128)__lsx_vld(sptr + space_ofs[k] * 4, 0);\n                                __m128 _w = (__m128)__lsx_vld(kptr + k * 4, 0);\n                                _sum = __lsx_vfmadd_s(_w, _val, _sum);\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            __lsx_vst(_sum, outptr + j * 4, 0);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1)\n        {\n            if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n            {\n                convdw3x3s1_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n            {\n                convdw3x3s2_lsx(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n                if (activation)\n                {\n                    activation->forward_inplace(top_blob, opt);\n                }\n            }\n            else\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    float* outptr = top_blob.channel(g);\n                    const float* kptr = (const float*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            float sum = 0.f;\n\n                            if (bias_term)\n                                sum = bias_data[g];\n\n                            const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = (float)sptr[space_ofs[k]];\n                                float w = (float)kptr[k];\n                                sum += val * w;\n                            }\n\n                            sum = activation_ss(sum, activation_type, activation_params);\n\n                            outptr[j] = sum;\n                        }\n\n                        outptr += outw;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 4 == 0 ? 4 : 1;\n        out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, 1, opt_p);\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack < out_elempack)\n    {\n        top_blob_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n    }\n\n    // packing\n    if (out_g_elempack < out_elempack)\n    {\n        convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n\nint ConvolutionDepthWise_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::ConvolutionDepthWise);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(7, group);\n    pd.set(8, int8_scale_term);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_INT8\nint ConvolutionDepthWise_loongarch::create_pipeline_int8_loongarch(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 8 == 0 ? 8 : 1;\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 8)\n        {\n            Mat weight_data_r2 = weight_data.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 8, opt);\n        }\n\n        if (elempack == 1)\n        {\n            weight_data_tm = weight_data;\n        }\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint ConvolutionDepthWise_loongarch::forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n\n    int elembits = bottom_blob.elembits();\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        const int channels_g = channels * elempack / group;\n\n        Mat scales(channels * elempack);\n        {\n            float* ps = scales;\n            for (int g = 0; g < group; g++)\n            {\n                float scale = bottom_blob_int8_scales[g];\n                for (int q = 0; q < channels_g; q++)\n                {\n                    *ps++ = scale;\n                }\n            }\n        }\n\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, scales, opt_q);\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n    channels = bottom_blob_bordered.c;\n    elempack = bottom_blob_bordered.elempack;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n        int out_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        }\n#endif // __loongarch_sx\n        bool use_int8_requantize = int8_scale_term > 100;\n        size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n\n        top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __loongarch_sx\n        if (elempack == 8)\n        {\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    signed char* outptr_s8 = top_blob.channel(g);\n                    float* outptr_f32 = top_blob.channel(g);\n                    const signed char* kptr = (const signed char*)weight_data_tm + maxk * g * 8;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                            __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                            const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w * 8;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                __m128i _val = __lsx_vld(sptr + space_ofs[k] * 8, 0);\n                                __m128i _val16 = __lsx_vilvl_b(__lsx_vslti_b(_val, 0), _val);\n\n                                __m128i _w = __lsx_vld(kptr + k * 8, 0);\n                                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                                __m128i _s0 = __lsx_vmul_h(_val16, _w16);\n                                __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                                __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                                __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                                _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                                _sum1 = __lsx_vadd_w(_sum1, _s0h);\n                            }\n\n                            __m128 _scale_in0;\n                            __m128 _scale_in1;\n                            {\n                                __m128 _bottom_blob_int8_scales0 = (__m128)__lsx_vld((const float*)bottom_blob_int8_scales + g * 8, 0);\n                                __m128 _bottom_blob_int8_scales1 = (__m128)__lsx_vld((const float*)bottom_blob_int8_scales + g * 8 + 4, 0);\n                                __m128 _weight_data_int8_scales0 = (__m128)__lsx_vld((const float*)weight_data_int8_scales + g * 8, 0);\n                                __m128 _weight_data_int8_scales1 = (__m128)__lsx_vld((const float*)weight_data_int8_scales + g * 8 + 4, 0);\n                                _scale_in0 = __lsx_vfrecip_s(__lsx_vfmul_s(_bottom_blob_int8_scales0, _weight_data_int8_scales0));\n                                _scale_in1 = __lsx_vfrecip_s(__lsx_vfmul_s(_bottom_blob_int8_scales1, _weight_data_int8_scales1));\n\n                                __m128i _m0 = __lsx_vfcmp_cne_s(_weight_data_int8_scales0, __lsx_vreplfr2vr_s(0.f));\n                                __m128i _m1 = __lsx_vfcmp_cne_s(_weight_data_int8_scales1, __lsx_vreplfr2vr_s(0.f));\n                                _scale_in0 = (__m128)__lsx_vand_v((__m128i)_scale_in0, (__m128i)_m0);\n                                _scale_in1 = (__m128)__lsx_vand_v((__m128i)_scale_in1, (__m128i)_m1);\n                            }\n\n                            __m128 _sumfp32_0 = __lsx_vfmul_s(__lsx_vffint_s_w(_sum0), _scale_in0);\n                            __m128 _sumfp32_1 = __lsx_vfmul_s(__lsx_vffint_s_w(_sum1), _scale_in1);\n\n                            if (bias_term)\n                            {\n                                __m128 _bias0 = (__m128)__lsx_vld((const float*)bias_data + g * 8, 0);\n                                __m128 _bias1 = (__m128)__lsx_vld((const float*)bias_data + g * 8 + 4, 0);\n                                _sumfp32_0 = __lsx_vfadd_s(_sumfp32_0, _bias0);\n                                _sumfp32_1 = __lsx_vfadd_s(_sumfp32_1, _bias1);\n                            }\n\n                            _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n                            _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n                            if (use_int8_requantize)\n                            {\n                                // requantize and relu\n                                __m128 _scale_out0 = (__m128)__lsx_vld((const float*)top_blob_int8_scales + g * 8, 0);\n                                __m128 _scale_out1 = (__m128)__lsx_vld((const float*)top_blob_int8_scales + g * 8 + 4, 0);\n                                _sumfp32_0 = __lsx_vfmul_s(_sumfp32_0, _scale_out0);\n                                _sumfp32_1 = __lsx_vfmul_s(_sumfp32_1, _scale_out1);\n                                int64_t _sum8 = float2int8(_sumfp32_0, _sumfp32_1);\n\n                                *(int64_t*)outptr_s8 = _sum8;\n                                outptr_s8 += 8;\n                            }\n                            else\n                            {\n                                // dequantize and relu\n                                __lsx_vst(_sumfp32_0, outptr_f32, 0);\n                                __lsx_vst(_sumfp32_1, outptr_f32 + 4, 0);\n                                outptr_f32 += 8;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1)\n        {\n            {\n                const int maxk = kernel_w * kernel_h;\n\n                // kernel offsets\n                std::vector<int> _space_ofs(maxk);\n                int* space_ofs = &_space_ofs[0];\n                {\n                    int p1 = 0;\n                    int p2 = 0;\n                    int gap = w * dilation_h - kernel_w * dilation_w;\n                    for (int i = 0; i < kernel_h; i++)\n                    {\n                        for (int j = 0; j < kernel_w; j++)\n                        {\n                            space_ofs[p1] = p2;\n                            p1++;\n                            p2 += dilation_w;\n                        }\n                        p2 += gap;\n                    }\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < group; g++)\n                {\n                    signed char* outptr_s8 = top_blob.channel(g);\n                    float* outptr_f32 = top_blob.channel(g);\n                    const signed char* kptr = (const signed char*)weight_data_tm + maxk * g;\n                    const Mat m = bottom_blob_bordered.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sum = 0;\n\n                            const signed char* sptr = m.row<const signed char>(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                signed char val = sptr[space_ofs[k]];\n                                signed char w = kptr[k];\n                                sum += val * w;\n                            }\n\n                            float scale_in;\n                            if (weight_data_int8_scales[g] == 0)\n                                scale_in = 0;\n                            else\n                                scale_in = 1.f / (bottom_blob_int8_scales[g] * weight_data_int8_scales[g]);\n\n                            float sumfp32 = sum * scale_in;\n\n                            if (bias_term)\n                                sumfp32 += bias_data[g];\n\n                            sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n                            if (use_int8_requantize)\n                            {\n                                // requantize\n                                float scale_out = top_blob_int8_scales[g];\n                                signed char sums8 = float2int8(sumfp32 * scale_out);\n                                outptr_s8[0] = sums8;\n                                outptr_s8 += 1;\n                            }\n                            else\n                            {\n                                // dequantize\n                                outptr_f32[0] = sumfp32;\n                                outptr_f32 += 1;\n                            }\n                        }\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    bool use_int8_requantize = int8_scale_term > 100;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        else\n            out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n    size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // group convolution\n    const int channels_g = channels * elempack / group;\n    const int num_output_g = num_output / group;\n\n    int g_elempack = 1;\n    int out_g_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        g_elempack = channels_g % 8 == 0 ? 8 : 1;\n        if (use_int8_requantize)\n            out_g_elempack = num_output_g % 8 == 0 ? 8 : 1;\n        else\n            out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n\n    // unpacking\n    Mat bottom_blob_bordered_unpacked = bottom_blob_bordered;\n    if (elempack > g_elempack)\n    {\n        Option opt_p = opt;\n        opt_p.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_bordered, bottom_blob_bordered_unpacked, g_elempack, opt_p);\n    }\n\n    Mat top_blob_unpacked = top_blob;\n    if (out_g_elempack < out_elempack)\n    {\n        top_blob_unpacked.create(outw, outh, num_output / out_g_elempack, out_elemsize / out_elempack * out_g_elempack, out_g_elempack, opt.workspace_allocator);\n        if (top_blob_unpacked.empty())\n            return -100;\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int g = 0; g < group; g++)\n    {\n        const Mat bottom_blob_bordered_g = bottom_blob_bordered_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n        Mat top_blob_g = top_blob_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n        const ncnn::Layer* op = group_ops[g];\n\n        Option opt_g = opt;\n        opt_g.blob_allocator = top_blob_unpacked.allocator;\n\n        // forward\n        op->forward(bottom_blob_bordered_g, top_blob_g, opt_g);\n    }\n\n    // packing\n    if (out_g_elempack < out_elempack)\n    {\n        convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n    }\n    else\n    {\n        top_blob = top_blob_unpacked;\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/convolutiondepthwise_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTIONDEPTHWISE_LOONGARCH_H\n#define LAYER_CONVOLUTIONDEPTHWISE_LOONGARCH_H\n\n#include \"convolutiondepthwise.h\"\n\nnamespace ncnn {\n\nclass ConvolutionDepthWise_loongarch : public ConvolutionDepthWise\n{\npublic:\n    ConvolutionDepthWise_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int create_group_ops(const Option& opt);\n#if NCNN_INT8\n    int create_pipeline_int8_loongarch(const Option& opt);\n    int forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* activation;\n    std::vector<ncnn::Layer*> group_ops;\n\n    Mat weight_data_tm;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTIONDEPTHWISE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/crop_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"crop_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nCrop_loongarch::Crop_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\n#if __loongarch_sx\nstatic void crop_pack4_lsx(const Mat& src, Mat& dst, int top, int left)\n{\n    int w = dst.w;\n    int h = dst.h;\n    int right = src.w - dst.w - left;\n\n    const float* ptr = src.row(top) + left * 4;\n    float* outptr = dst;\n\n    for (int y = 0; y < h; y++)\n    {\n        for (int x = 0; x < w; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __lsx_vst(_p, outptr, 0);\n\n            ptr += 4;\n            outptr += 4;\n        }\n\n        ptr += (left + right) * 4;\n    }\n}\n#endif // __loongarch_sx\n\nint Crop_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __loongarch_sx\n    int _woffset, _hoffset, _doffset, _coffset;\n    int _outw, _outh, _outd, _outc;\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(1);\n        bottom_blob_shapes[0] = bottom_blob.shape();\n        eval_crop_expr(bottom_blob_shapes, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                crop_pack4_lsx(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                crop_pack4_lsx(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    crop_pack4_lsx(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        crop_pack4_lsx(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n        if (bottom_blob_unpacked.empty())\n            return -100;\n    }\n\n    return Crop::forward(bottom_blob_unpacked, top_blob, opt);\n}\n\nint Crop_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int ref_elempack = reference_blob.elempack;\n\n    Mat& top_blob = top_blobs[0];\n\n#if __loongarch_sx\n    int _woffset, _hoffset, _doffset, _coffset;\n    int _outw, _outh, _outd, _outc;\n    if (!starts_expr.empty() && !ends_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_crop_expr(bottom_blob_shapes, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else if (woffset == -233)\n    {\n        resolve_crop_roi(bottom_blob.shape(), (const int*)reference_blob, _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n    else\n    {\n        resolve_crop_roi(bottom_blob.shape(), reference_blob.shape(), _woffset, _hoffset, _doffset, _coffset, _outw, _outh, _outd, _outc);\n    }\n\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int out_elempack = _outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw / out_elempack == w && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_woffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                crop_pack4_lsx(bottom_blob, top_blob, 0, _woffset / elempack);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int out_elempack = _outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh / out_elempack == h && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_hoffset % 4 == 0 && out_elempack == 4)\n            {\n                top_blob.create(_outw, _outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                crop_pack4_lsx(bottom_blob, top_blob, _hoffset / elempack, _woffset);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const Mat m = bottom_blob_sliced.channel(q);\n                    Mat borderm = top_blob.channel(q);\n\n                    crop_pack4_lsx(m, borderm, _hoffset, _woffset);\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int out_elempack = _outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (_outw == w && _outh == h && _outd == d && _outc / out_elempack == channels && out_elempack == 4)\n            {\n                top_blob = bottom_blob;\n                return 0;\n            }\n\n            if (_coffset % 4 == 0 && out_elempack == 4)\n            {\n                const Mat bottom_blob_sliced = bottom_blob.channel_range(_coffset / out_elempack, _outc / out_elempack);\n\n                if (_outw == w && _outh == h && _outd == d)\n                {\n                    top_blob = bottom_blob_sliced.clone(opt.blob_allocator);\n                    if (top_blob.empty())\n                        return -100;\n                }\n\n                top_blob.create(_outw, _outh, _outd, _outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    for (int z = 0; z < _outd; z++)\n                    {\n                        const Mat m = bottom_blob_sliced.channel(q).depth(z + _doffset);\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        crop_pack4_lsx(m, borderm, _hoffset, _woffset);\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    std::vector<Mat> bottom_blobs_unpacked(bottom_blobs.size());\n    for (size_t i = 0; i < bottom_blobs.size(); i++)\n    {\n        Mat bottom_blob_unpacked = bottom_blobs[i];\n        if (elempack != 1)\n        {\n            Option opt_pack1 = opt;\n            opt_pack1.blob_allocator = opt.workspace_allocator;\n\n            convert_packing(bottom_blobs[i], bottom_blob_unpacked, 1, opt_pack1);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        bottom_blobs_unpacked[i] = bottom_blob_unpacked;\n    }\n\n    return Crop::forward(bottom_blobs_unpacked, top_blobs, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/crop_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CROP_LOONGARCH_H\n#define LAYER_CROP_LOONGARCH_H\n\n#include \"crop.h\"\n\nnamespace ncnn {\n\nclass Crop_loongarch : public Crop\n{\npublic:\n    Crop_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CROP_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/deconvolution_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolution_loongarch.h\"\n\n#include \"layer_type.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\n#if __loongarch_sx\n#include \"deconvolution_pack4.h\"\n#include \"deconvolution_pack1to4.h\"\n#include \"deconvolution_pack4to1.h\"\n#endif // __loongarch_sx\n\nDeconvolution_loongarch::Deconvolution_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Deconvolution_loongarch::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    const int maxk = kernel_w * kernel_h;\n    int num_input = weight_data_size / maxk / num_output;\n\n    Mat weight_data_transposed(weight_data.w);\n    {\n        float* pt = weight_data_transposed;\n        const float* p = weight_data;\n\n        for (int i = 0; i < num_input * num_output; i++)\n        {\n            for (int k = 0; k < maxk; k++)\n            {\n                pt[maxk - 1 - k] = p[k];\n            }\n\n            p += maxk;\n            pt += maxk;\n        }\n    }\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data_transposed.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n#if __loongarch_sx\n    // pack4\n    if (elempack == 4 && out_elempack == 4)\n    {\n    }\n\n    // pack1ton\n    if (elempack == 1 && out_elempack == 4)\n    {\n    }\n\n    // pack4to1\n    if (elempack == 4 && out_elempack == 1)\n    {\n    }\n#endif // __loongarch_sx\n\n    // pack1\n    if (elempack == 1 && out_elempack == 1)\n    {\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Deconvolution_loongarch::destroy_pipeline(const Option& opt)\n{\n    return 0;\n}\n\nint Deconvolution_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // deconvolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Deconvolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n#if __loongarch_sx\n    if (elempack == 4 && out_elempack == 4)\n    {\n        {\n            deconvolution_pack4_lsx(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        {\n            deconvolution_pack1to4_lsx(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            deconvolution_pack4to1_lsx(bottom_blob, top_blob_bordered, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                float* outptr = top_blob_bordered.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const float* kptr = (const float*)weight_data_tm.channel(p);\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob.channel(q);\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                const float* sptr = m.row(sy);\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    float val = sptr[sx];\n\n                                    int k = y * kernel_w + x;\n\n                                    float w = kptr[k];\n\n                                    sum += val * w;\n                                }\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint Deconvolution_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c * bottom_blob.elempack;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * 1;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / 1, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / 1;\n        const int inch_g = _num_input / 1;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < 1; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Deconvolution);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, output_pad_right);\n    pd.set(19, output_pad_bottom);\n    pd.set(20, output_w);\n    pd.set(21, output_h);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_transposed.w);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_transposed;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/deconvolution_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTION_LOONGARCH_H\n#define LAYER_DECONVOLUTION_LOONGARCH_H\n\n#include \"deconvolution.h\"\n\nnamespace ncnn {\n\nclass Deconvolution_loongarch : public Deconvolution\n{\npublic:\n    Deconvolution_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    Mat weight_data_tm;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTION_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/deconvolution_pack1to4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconvolution_pack1to4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack1ton, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                if (bias_data_ptr)\n                {\n                    _sum = (__m128)__lsx_vld((const float*)bias_data_ptr + p * 4, 0);\n                }\n\n                const float* kptr = (const float*)weight_data_pack1ton + maxk * channels * p * 4;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n\n                    for (int y = 0; y < kernel_h; y++)\n                    {\n                        int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                        if (sys < 0 || sys % stride_h != 0)\n                            continue;\n\n                        int sy = sys / stride_h;\n                        if (sy >= h)\n                            continue;\n\n                        const float* sptr = m.row(sy);\n\n                        for (int x = 0; x < kernel_w; x++)\n                        {\n                            int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                            if (sxs < 0 || sxs % stride_w != 0)\n                                continue;\n\n                            int sx = sxs / stride_w;\n                            if (sx >= w)\n                                continue;\n\n                            float val = sptr[sx];\n\n                            int k = y * kernel_w + x;\n\n                            __m128 _val = (__m128)__lsx_vreplfr2vr_s(val);\n                            __m128 _w = (__m128)__lsx_vld(kptr + k * 4, 0);\n                            _sum = __lsx_vfmadd_s(_w, _val, _sum);\n                        }\n                    }\n\n                    kptr += maxk * 4;\n                }\n\n                _sum = activation_ps(_sum, activation_type, activation_params);\n\n                __lsx_vst(_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/deconvolution_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconvolution_pack4_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack4, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                if (bias_data_ptr)\n                {\n                    _sum = (__m128)__lsx_vld((const float*)bias_data_ptr + p * 4, 0);\n                }\n\n                const float* kptr = (const float*)weight_data_pack4.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n\n                    for (int y = 0; y < kernel_h; y++)\n                    {\n                        int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                        if (sys < 0 || sys % stride_h != 0)\n                            continue;\n\n                        int sy = sys / stride_h;\n                        if (sy >= h)\n                            continue;\n\n                        for (int x = 0; x < kernel_w; x++)\n                        {\n                            int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                            if (sxs < 0 || sxs % stride_w != 0)\n                                continue;\n\n                            int sx = sxs / stride_w;\n                            if (sx >= w)\n                                continue;\n\n                            const float* sptr = m.row(sy) + sx * 4;\n\n                            int k = (y * kernel_w + x) * 16;\n\n                            __m128 _val0 = (__m128)__lsx_vreplfr2vr_s(*sptr++);\n                            __m128 _val1 = (__m128)__lsx_vreplfr2vr_s(*sptr++);\n                            __m128 _val2 = (__m128)__lsx_vreplfr2vr_s(*sptr++);\n                            __m128 _val3 = (__m128)__lsx_vreplfr2vr_s(*sptr++);\n                            __m128 _w0 = (__m128)__lsx_vld(kptr + k, 0);\n                            __m128 _w1 = (__m128)__lsx_vld(kptr + k + 4, 0);\n                            __m128 _w2 = (__m128)__lsx_vld(kptr + k + 8, 0);\n                            __m128 _w3 = (__m128)__lsx_vld(kptr + k + 12, 0);\n                            _sum = __lsx_vfmadd_s(_w0, _val0, _sum);\n                            _sum = __lsx_vfmadd_s(_w1, _val1, _sum);\n                            _sum = __lsx_vfmadd_s(_w2, _val2, _sum);\n                            _sum = __lsx_vfmadd_s(_w3, _val3, _sum);\n                        }\n                    }\n\n                    kptr += maxk * 16;\n                }\n\n                _sum = activation_ps(_sum, activation_type, activation_params);\n\n                __lsx_vst(_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/deconvolution_pack4to1.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void deconvolution_pack4to1_lsx(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack4to1, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    const int maxk = kernel_w * kernel_h;\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                float sum = 0.f;\n\n                if (bias_data_ptr)\n                {\n                    sum = bias_data_ptr[p];\n                }\n\n                __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                const float* kptr = (const float*)weight_data_pack4to1 + maxk * channels * p * 4;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n\n                    for (int y = 0; y < kernel_h; y++)\n                    {\n                        int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                        if (sys < 0 || sys % stride_h != 0)\n                            continue;\n\n                        int sy = sys / stride_h;\n                        if (sy >= h)\n                            continue;\n\n                        for (int x = 0; x < kernel_w; x++)\n                        {\n                            int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                            if (sxs < 0 || sxs % stride_w != 0)\n                                continue;\n\n                            int sx = sxs / stride_w;\n                            if (sx >= w)\n                                continue;\n\n                            const float* sptr = m.row(sy) + sx * 4;\n\n                            int k = y * kernel_w + x;\n\n                            __m128 _val = (__m128)__lsx_vld(sptr, 0);\n                            __m128 _w = (__m128)__lsx_vld(kptr + k * 4, 0);\n                            _sum = __lsx_vfmadd_s(_w, _val, _sum);\n                        }\n                    }\n\n                    kptr += maxk * 4;\n                }\n\n                sum += __lsx_reduce_fadd_s(_sum);\n\n                sum = activation_ss(sum, activation_type, activation_params);\n\n                outptr[j] = sum;\n            }\n\n            outptr += outw;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/deconvolutiondepthwise_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"deconvolutiondepthwise_loongarch.h\"\n\n#include \"layer_type.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nDeconvolutionDepthWise_loongarch::DeconvolutionDepthWise_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint DeconvolutionDepthWise_loongarch::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    // depth-wise\n    if (channels == group && group == num_output)\n    {\n        int elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            elempack = channels % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        Mat weight_data_transposed(weight_data.w);\n        {\n            float* pt = weight_data_transposed;\n            const float* p = weight_data;\n\n            for (int i = 0; i < (channels / group) * (num_output / group) * group; i++)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    pt[maxk - 1 - k] = p[k];\n                }\n\n                p += maxk;\n                pt += maxk;\n            }\n        }\n\n#if __loongarch_sx\n        // pack4\n        if (elempack == 4)\n        {\n            Mat weight_data_r2 = weight_data_transposed.reshape(maxk, group);\n            convert_packing(weight_data_r2, weight_data_tm, 4, opt);\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1)\n        {\n            weight_data_tm = weight_data_transposed;\n        }\n\n        if (opt.lightmode)\n            weight_data.release();\n\n        return 0;\n    }\n\n    // group convolution\n    create_group_ops(opt);\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_loongarch::create_group_ops(const Option& opt)\n{\n    // create Deconvolution op for each group\n    const int maxk = kernel_w * kernel_h;\n    int channels = (weight_data_size / group) / maxk / (num_output / group) * group;\n\n    for (int i = 0; i < (int)group_ops.size(); i++)\n        delete group_ops[i];\n\n    group_ops.clear();\n\n    const int channels_g = channels / group;\n    const int num_output_g = num_output / group;\n\n    group_ops.resize(group);\n\n    for (int g = 0; g < group; g++)\n    {\n        Mat weight_data_g = weight_data.range(maxk * channels_g * num_output_g * g, maxk * channels_g * num_output_g).clone();\n        Mat bias_data_g;\n        if (bias_term)\n            bias_data_g = bias_data.range(num_output_g * g, num_output_g);\n\n        ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Deconvolution);\n\n        // set param\n        ncnn::ParamDict pd;\n        pd.set(0, num_output_g); // num_output\n        pd.set(1, kernel_w);\n        pd.set(11, kernel_h);\n        pd.set(2, dilation_w);\n        pd.set(12, dilation_h);\n        pd.set(3, stride_w);\n        pd.set(13, stride_h);\n        pd.set(4, 0);  // pad_w\n        pd.set(14, 0); // pad_h\n        pd.set(18, output_pad_right);\n        pd.set(19, output_pad_bottom);\n        pd.set(5, bias_term);\n        pd.set(6, maxk * channels_g * num_output_g); // weight_data_size\n        pd.set(9, activation_type);\n        pd.set(10, activation_params);\n\n        op->load_param(pd);\n\n        // set weights\n        if (bias_term)\n        {\n            ncnn::Mat weights[2];\n            weights[0] = weight_data_g;\n            weights[1] = bias_data_g;\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n        else\n        {\n            ncnn::Mat weights[1];\n            weights[0] = weight_data_g;\n\n            op->load_model(ModelBinFromMatArray(weights));\n        }\n\n        op->create_pipeline(opt);\n\n        group_ops[g] = op;\n    }\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_loongarch::destroy_pipeline(const Option& opt)\n{\n    for (int i = 0; i < (int)group_ops.size(); i++)\n    {\n        group_ops[i]->destroy_pipeline(opt);\n        delete group_ops[i];\n    }\n    group_ops.clear();\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // convolv with NxN kernel\n    // value = value + bias\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - 1) * stride_w + kernel_extent_w + output_pad_right;\n    int outh = (h - 1) * stride_h + kernel_extent_h + output_pad_bottom;\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    Mat top_blob_bordered;\n    if (pad_left > 0 || pad_right > 0 || pad_top > 0 || pad_bottom > 0 || (output_w > 0 && output_h > 0))\n    {\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.workspace_allocator);\n    }\n    else\n    {\n        top_blob_bordered = top_blob;\n        top_blob_bordered.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob_bordered.empty())\n        return -100;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // depth-wise\n    if (channels * elempack == group && group == num_output)\n    {\n#if __loongarch_sx\n        if (elempack == 4)\n        {\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int g = 0; g < channels; g++)\n                {\n                    float* outptr = top_blob_bordered.channel(g);\n                    const float* kptr = (const float*)weight_data_tm + maxk * g * 4;\n                    const Mat m = bottom_blob.channel(g);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                            if (bias_term)\n                            {\n                                _sum = (__m128)__lsx_vld((const float*)bias_data + g * 4, 0);\n                            }\n\n                            for (int y = 0; y < kernel_h; y++)\n                            {\n                                int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                                if (sys < 0 || sys % stride_h != 0)\n                                    continue;\n\n                                int sy = sys / stride_h;\n                                if (sy >= h)\n                                    continue;\n\n                                for (int x = 0; x < kernel_w; x++)\n                                {\n                                    int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                    if (sxs < 0 || sxs % stride_w != 0)\n                                        continue;\n\n                                    int sx = sxs / stride_w;\n                                    if (sx >= w)\n                                        continue;\n\n                                    const float* sptr = m.row(sy) + sx * 4;\n\n                                    int k = y * kernel_w + x;\n\n                                    __m128 _val = (__m128)__lsx_vld(sptr, 0);\n                                    __m128 _w = (__m128)__lsx_vld(kptr + k * 4, 0);\n                                    _sum = __lsx_vfmadd_s(_w, _val, _sum);\n                                }\n                            }\n\n                            _sum = activation_ps(_sum, activation_type, activation_params);\n\n                            __lsx_vst(_sum, outptr + j * 4, 0);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int g = 0; g < channels; g++)\n            {\n                float* outptr = top_blob_bordered.channel(g);\n                const float* kptr = (const float*)weight_data_tm + maxk * g;\n                const Mat m = bottom_blob.channel(g);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[g];\n                        }\n\n                        for (int y = 0; y < kernel_h; y++)\n                        {\n                            int sys = (i + y * dilation_h - (kernel_extent_h - 1));\n                            if (sys < 0 || sys % stride_h != 0)\n                                continue;\n\n                            int sy = sys / stride_h;\n                            if (sy >= h)\n                                continue;\n\n                            const float* sptr = m.row(sy);\n\n                            for (int x = 0; x < kernel_w; x++)\n                            {\n                                int sxs = (j + x * dilation_w - (kernel_extent_w - 1));\n                                if (sxs < 0 || sxs % stride_w != 0)\n                                    continue;\n\n                                int sx = sxs / stride_w;\n                                if (sx >= w)\n                                    continue;\n\n                                float val = sptr[sx];\n\n                                int k = y * kernel_w + x;\n\n                                float w = kptr[k];\n\n                                sum += val * w;\n                            }\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n    else\n    {\n        // group deconvolution\n        const int channels_g = channels * elempack / group;\n        const int num_output_g = num_output / group;\n\n        int g_elempack = 1;\n        int out_g_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            g_elempack = channels_g % 4 == 0 ? 4 : 1;\n            out_g_elempack = num_output_g % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        // unpacking\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > g_elempack)\n        {\n            Option opt_p = opt;\n            opt_p.blob_allocator = opt.workspace_allocator;\n            convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_p);\n        }\n\n        Mat top_blob_bordered_unpacked = top_blob_bordered;\n        if (out_g_elempack < out_elempack)\n        {\n            top_blob_bordered_unpacked.create(outw, outh, num_output, out_elemsize / out_elempack, 1, opt.workspace_allocator);\n            if (top_blob_bordered_unpacked.empty())\n                return -100;\n        }\n\n        for (int g = 0; g < group; g++)\n        {\n            const Mat bottom_blob_g = bottom_blob_unpacked.channel_range(channels_g * g / g_elempack, channels_g / g_elempack);\n            Mat top_blob_bordered_g = top_blob_bordered_unpacked.channel_range(num_output_g * g / out_g_elempack, num_output_g / out_g_elempack);\n\n            const ncnn::Layer* op = group_ops[g];\n\n            Option opt_g = opt;\n            opt_g.blob_allocator = top_blob_bordered_unpacked.allocator;\n\n            // forward\n            op->forward(bottom_blob_g, top_blob_bordered_g, opt_g);\n        }\n\n        // packing\n        if (out_g_elempack < out_elempack)\n        {\n            convert_packing(top_blob_bordered_unpacked, top_blob_bordered, 4, opt);\n        }\n        else\n        {\n            top_blob_bordered = top_blob_bordered_unpacked;\n        }\n    }\n\n    cut_padding(top_blob_bordered, top_blob, opt);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\nint DeconvolutionDepthWise_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _num_input = bottom_blob.c * bottom_blob.elempack;\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.d * group;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    // transpose group-inch/group-outch/group-kh-kw to group-outch/group-inch/group-kh-kw\n    Mat weight_data_transposed;\n    {\n        weight_data_transposed.create(_kernel_w * _kernel_h * _num_output * _num_input / group, 4u, opt.workspace_allocator);\n        if (weight_data_transposed.empty())\n            return -100;\n\n        const int outch_g = _num_output / group;\n        const int inch_g = _num_input / group;\n        const int maxk = _kernel_h * _kernel_w;\n\n        for (int g = 0; g < group; g++)\n        {\n            // reorder weight from inch-outch to outch-inch\n            float* wg2 = (float*)weight_data_transposed + g * outch_g * inch_g * maxk;\n            const float* wg = (const float*)weight_data_flattened + g * inch_g * outch_g * maxk;\n            for (int i = 0; i < outch_g; i++)\n            {\n                for (int j = 0; j < inch_g; j++)\n                {\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        wg2[(i * inch_g + j) * maxk + k] = wg[(j * outch_g + i) * maxk + k];\n                    }\n                }\n            }\n        }\n    }\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::DeconvolutionDepthWise);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, output_pad_right);\n    pd.set(19, output_pad_bottom);\n    pd.set(20, output_w);\n    pd.set(21, output_h);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_transposed.w);\n    pd.set(7, group);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_transposed;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/deconvolutiondepthwise_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DECONVOLUTIONDEPTHWISE_LOONGARCH_H\n#define LAYER_DECONVOLUTIONDEPTHWISE_LOONGARCH_H\n\n#include \"deconvolutiondepthwise.h\"\n\nnamespace ncnn {\n\nclass DeconvolutionDepthWise_loongarch : public DeconvolutionDepthWise\n{\npublic:\n    DeconvolutionDepthWise_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n    int create_group_ops(const Option& opt);\n\npublic:\n    std::vector<ncnn::Layer*> group_ops;\n\n    Mat weight_data_tm;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DECONVOLUTIONDEPTHWISE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/dequantize_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dequantize_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nDequantize_loongarch::Dequantize_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nstatic void dequantize(const int* intptr, float* ptr, const Mat& scale_data, const Mat& bias_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int bias_data_size = bias_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"dequantize %d %d   %d %d\", scale_data_size, bias_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __loongarch_sx\n    __m128 _scale0 = (__m128)__lsx_vreplfr2vr_s(scale);\n    __m128 _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale0 = (__m128)__lsx_vld((const float*)scale_data, 0);\n            _scale1 = _scale0;\n        }\n        if (elempack == 8)\n        {\n            _scale0 = (__m128)__lsx_vld((const float*)scale_data, 0);\n            _scale1 = (__m128)__lsx_vld((const float*)scale_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmul_s(_v0, _scale0);\n            _v1 = __lsx_vfmul_s(_v1, _scale1);\n            __lsx_vst(_v0, ptr, 0);\n            __lsx_vst(_v1, ptr + 4, 0);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmul_s(_v, _scale0);\n            __lsx_vst(_v, ptr, 0);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *intptr * scale;\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __loongarch_sx\n        __m128 _bias0 = (__m128)__lsx_vreplfr2vr_s(bias);\n        __m128 _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 4)\n            {\n                _bias0 = (__m128)__lsx_vld((const float*)bias_data, 0);\n                _bias1 = _bias0;\n            }\n            if (elempack == 8)\n            {\n                _bias0 = (__m128)__lsx_vld((const float*)bias_data, 0);\n                _bias1 = (__m128)__lsx_vld((const float*)bias_data + 4, 0);\n            }\n        }\n#endif // __loongarch_sx\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmadd_s(_scale0, _v0, _bias0);\n            _v1 = __lsx_vfmadd_s(_scale1, _v1, _bias1);\n            __lsx_vst(_v0, ptr, 0);\n            __lsx_vst(_v1, ptr + 4, 0);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmadd_s(_scale0, _v, _bias0);\n            __lsx_vst(_v, ptr, 0);\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *intptr * scale + bias;\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Dequantize_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    // assert bottom_blob.elembits() == 32\n\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 1)\n    {\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            float* ptr = (float*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            dequantize(intptr, ptr, scale_data, bias_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            float* ptr = top_blob.row(i);\n\n            const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n\n            dequantize(intptr, ptr, scale_data_i, bias_data_i, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            float* ptr = top_blob.channel(q);\n\n            const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n\n            dequantize(intptr, ptr, scale_data_q, bias_data_q, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/dequantize_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DEQUANTIZE_LOONGARCH_H\n#define LAYER_DEQUANTIZE_LOONGARCH_H\n\n#include \"dequantize.h\"\n\nnamespace ncnn {\n\nclass Dequantize_loongarch : public Dequantize\n{\npublic:\n    Dequantize_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DEQUANTIZE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/dropout_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"dropout_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nDropout_loongarch::Dropout_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Dropout_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    if (scale == 1.f)\n    {\n        return 0;\n    }\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _scale = (__m128)__lsx_vreplfr2vr_s(scale);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = __lsx_vfmul_s(_p, _scale);\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *ptr * scale;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/dropout_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_DROPOUT_LOONGARCH_H\n#define LAYER_DROPOUT_LOONGARCH_H\n\n#include \"dropout.h\"\n\nnamespace ncnn {\n\nclass Dropout_loongarch : public Dropout\n{\npublic:\n    Dropout_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_DROPOUT_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/eltwise_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"eltwise_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nEltwise_loongarch::Eltwise_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Eltwise_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d * elempack;\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create_like(bottom_blob, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (op_type == Operation_PROD)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __loongarch_sx\n            for (; i + 3 < size; i += 4)\n            {\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                __m128 _p1 = (__m128)__lsx_vld(ptr1, 0);\n                _p = __lsx_vfmul_s(_p, _p1);\n                __lsx_vst(_p, outptr, 0);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                *outptr = *ptr * *ptr1;\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    __m128 _p = (__m128)__lsx_vld(outptr, 0);\n                    __m128 _p1 = (__m128)__lsx_vld(ptr, 0);\n                    _p = __lsx_vfmul_s(_p, _p1);\n                    __lsx_vst(_p, outptr, 0);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr *= *ptr;\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n    if (op_type == Operation_SUM)\n    {\n        if (coeffs.w == 0)\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                    __m128 _p1 = (__m128)__lsx_vld(ptr1, 0);\n                    _p = __lsx_vfadd_s(_p, _p1);\n                    __lsx_vst(_p, outptr, 0);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr + *ptr1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    int i = 0;\n#if __loongarch_sx\n                    for (; i + 3 < size; i += 4)\n                    {\n                        __m128 _p = (__m128)__lsx_vld(outptr, 0);\n                        __m128 _p1 = (__m128)__lsx_vld(ptr, 0);\n                        _p = __lsx_vfadd_s(_p, _p1);\n                        __lsx_vst(_p, outptr, 0);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __loongarch_sx\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n        else\n        {\n            // first blob\n            const Mat& bottom_blob1 = bottom_blobs[1];\n            float coeff0 = coeffs[0];\n            float coeff1 = coeffs[1];\n#if __loongarch_sx\n            __m128 _coeff0 = (__m128)__lsx_vreplfr2vr_s(coeff0);\n            __m128 _coeff1 = (__m128)__lsx_vreplfr2vr_s(coeff1);\n#endif // __loongarch_sx\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                const float* ptr1 = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                    __m128 _p1 = (__m128)__lsx_vld(ptr1, 0);\n                    _p = __lsx_vfmul_s(_p, _coeff0);\n                    _p = __lsx_vfmadd_s(_coeff1, _p1, _p);\n                    __lsx_vst(_p, outptr, 0);\n\n                    ptr += 4;\n                    ptr1 += 4;\n                    outptr += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr = *ptr * coeff0 + *ptr1 * coeff1;\n\n                    ptr++;\n                    ptr1++;\n                    outptr++;\n                }\n            }\n\n            for (size_t b = 2; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob1 = bottom_blobs[b];\n                float coeff = coeffs[b];\n#if __loongarch_sx\n                __m128 _coeff = (__m128)__lsx_vreplfr2vr_s(coeff);\n#endif // __loongarch_sx\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob1.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    int i = 0;\n#if __loongarch_sx\n                    for (; i + 3 < size; i += 4)\n                    {\n                        __m128 _p = (__m128)__lsx_vld(outptr, 0);\n                        __m128 _p1 = (__m128)__lsx_vld(ptr, 0);\n                        _p = __lsx_vfmadd_s(_coeff, _p1, _p);\n                        __lsx_vst(_p, outptr, 0);\n\n                        ptr += 4;\n                        outptr += 4;\n                    }\n#endif // __loongarch_sx\n                    for (; i < size; i++)\n                    {\n                        *outptr += *ptr * coeff;\n\n                        ptr++;\n                        outptr++;\n                    }\n                }\n            }\n        }\n    }\n    if (op_type == Operation_MAX)\n    {\n        // first blob\n        const Mat& bottom_blob1 = bottom_blobs[1];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            const float* ptr1 = bottom_blob1.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __loongarch_sx\n            for (; i + 3 < size; i += 4)\n            {\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                __m128 _p1 = (__m128)__lsx_vld(ptr1, 0);\n                _p = __lsx_vfmax_s(_p, _p1);\n                __lsx_vst(_p, outptr, 0);\n\n                ptr += 4;\n                ptr1 += 4;\n                outptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                *outptr = std::max(*ptr, *ptr1);\n\n                ptr++;\n                ptr1++;\n                outptr++;\n            }\n        }\n\n        for (size_t b = 2; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob1 = bottom_blobs[b];\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob1.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    __m128 _p = (__m128)__lsx_vld(outptr, 0);\n                    __m128 _p1 = (__m128)__lsx_vld(ptr, 0);\n                    _p = __lsx_vfmax_s(_p, _p1);\n                    __lsx_vst(_p, outptr, 0);\n\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr = std::max(*ptr, *outptr);\n\n                    ptr++;\n                    outptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/eltwise_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ELTWISE_LOONGARCH_H\n#define LAYER_ELTWISE_LOONGARCH_H\n\n#include \"eltwise.h\"\n\nnamespace ncnn {\n\nclass Eltwise_loongarch : public Eltwise\n{\npublic:\n    Eltwise_loongarch();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ELTWISE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/flatten_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"flatten_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nFlatten_loongarch::Flatten_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Flatten_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d;\n\n    int total = size * channels * elempack;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = total % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (out_elempack == 1)\n    {\n        return Flatten::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (dims == 2 && elempack == 1) // out_elempack == 4\n    {\n        top_blob = bottom_blob;\n        top_blob.dims = 1;\n        top_blob.w = total / out_elempack;\n        top_blob.h = 1;\n        top_blob.cstep = bottom_blob.cstep / out_elempack;\n        top_blob.elemsize = out_elemsize;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    top_blob.create(total / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 2)\n    {\n#if __loongarch_sx\n        if (elempack == 4) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                float* outptr0 = (float*)top_blob + w * i * 4;\n                float* outptr1 = (float*)top_blob + w * (i * 4 + 1);\n                float* outptr2 = (float*)top_blob + w * (i * 4 + 2);\n                float* outptr3 = (float*)top_blob + w * (i * 4 + 3);\n\n                int j = 0;\n                for (; j + 3 < w; j += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(ptr, 0);\n                    __m128i _r1 = __lsx_vld(ptr + 4, 0);\n                    __m128i _r2 = __lsx_vld(ptr + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(ptr + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr0, 0);\n                    __lsx_vst(_r0123_1, outptr1, 0);\n                    __lsx_vst(_r0123_2, outptr2, 0);\n                    __lsx_vst(_r0123_3, outptr3, 0);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n#endif // __loongarch_sx\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n#if __loongarch_sx\n        if (elempack == 4) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                float* outptr0 = (float*)top_blob + size * q * 4;\n                float* outptr1 = (float*)top_blob + size * (q * 4 + 1);\n                float* outptr2 = (float*)top_blob + size * (q * 4 + 2);\n                float* outptr3 = (float*)top_blob + size * (q * 4 + 3);\n\n                int i = 0;\n                for (; i + 3 < size; i += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(ptr, 0);\n                    __m128i _r1 = __lsx_vld(ptr + 4, 0);\n                    __m128i _r2 = __lsx_vld(ptr + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(ptr + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr0, 0);\n                    __lsx_vst(_r0123_1, outptr1, 0);\n                    __lsx_vst(_r0123_2, outptr2, 0);\n                    __lsx_vst(_r0123_3, outptr3, 0);\n\n                    ptr += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n\n                    ptr += 4;\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1) // out_elempack == 4\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                float* outptr = (float*)top_blob + size * q;\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    __lsx_vst(__lsx_vld(ptr, 0), outptr, 0);\n                    ptr += 4;\n                    outptr += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Flatten_loongarch::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int dims = bottom_blob.dims;\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    int size = w * h * d;\n\n    int total = size * channels * elempack;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = total % 8 == 0 ? 8 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    if (out_elempack == 1)\n    {\n        return Flatten::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (dims == 2 && elempack == 1) // out_elempack == 8\n    {\n        top_blob = bottom_blob;\n        top_blob.dims = 1;\n        top_blob.w = total / out_elempack;\n        top_blob.h = 1;\n        top_blob.cstep = bottom_blob.cstep / out_elempack;\n        top_blob.elemsize = out_elemsize;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    top_blob.create(total / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (dims == 2)\n    {\n#if __loongarch_sx\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const signed char* ptr = bottom_blob.row<signed char>(i);\n                signed char* outptr0 = (signed char*)top_blob + w * i * 8;\n                signed char* outptr1 = (signed char*)top_blob + w * (i * 8 + 1);\n                signed char* outptr2 = (signed char*)top_blob + w * (i * 8 + 2);\n                signed char* outptr3 = (signed char*)top_blob + w * (i * 8 + 3);\n                signed char* outptr4 = (signed char*)top_blob + w * (i * 8 + 4);\n                signed char* outptr5 = (signed char*)top_blob + w * (i * 8 + 5);\n                signed char* outptr6 = (signed char*)top_blob + w * (i * 8 + 6);\n                signed char* outptr7 = (signed char*)top_blob + w * (i * 8 + 7);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n#endif // __loongarch_sx\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n#if __loongarch_sx\n        if (elempack == 8) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* ptr = bottom_blob.channel(q);\n                signed char* outptr0 = (signed char*)top_blob + size * q * 8;\n                signed char* outptr1 = (signed char*)top_blob + size * (q * 8 + 1);\n                signed char* outptr2 = (signed char*)top_blob + size * (q * 8 + 2);\n                signed char* outptr3 = (signed char*)top_blob + size * (q * 8 + 3);\n                signed char* outptr4 = (signed char*)top_blob + size * (q * 8 + 4);\n                signed char* outptr5 = (signed char*)top_blob + size * (q * 8 + 5);\n                signed char* outptr6 = (signed char*)top_blob + size * (q * 8 + 6);\n                signed char* outptr7 = (signed char*)top_blob + size * (q * 8 + 7);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr0++ = ptr[0];\n                    *outptr1++ = ptr[1];\n                    *outptr2++ = ptr[2];\n                    *outptr3++ = ptr[3];\n                    *outptr4++ = ptr[4];\n                    *outptr5++ = ptr[5];\n                    *outptr6++ = ptr[6];\n                    *outptr7++ = ptr[7];\n\n                    ptr += 8;\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (elempack == 1) // out_elempack == 8\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* ptr = bottom_blob.channel(q);\n                signed char* outptr = (signed char*)top_blob + size * q;\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr++ = *ptr++;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/flatten_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_FLATTEN_LOONGARCH_H\n#define LAYER_FLATTEN_LOONGARCH_H\n\n#include \"flatten.h\"\n\nnamespace ncnn {\n\nclass Flatten_loongarch : public Flatten\n{\npublic:\n    Flatten_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_FLATTEN_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/hardsigmoid_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardsigmoid_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nHardSigmoid_loongarch::HardSigmoid_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint HardSigmoid_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n        __m128 _one = (__m128)__lsx_vreplfr2vr_s(1.f);\n        __m128 _alpha = (__m128)__lsx_vreplfr2vr_s(alpha);\n        __m128 _beta = (__m128)__lsx_vreplfr2vr_s(beta);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = __lsx_vfmadd_s(_alpha, _p, _beta);\n            _p = __lsx_vfmax_s(_p, _zero);\n            _p = __lsx_vfmin_s(_p, _one);\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            if (*ptr < lower)\n                *ptr = 0.f;\n            else if (*ptr > upper)\n                *ptr = 1.f;\n            else\n                *ptr = *ptr * alpha + beta;\n            ++ptr;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/hardsigmoid_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSIGMOID_LOONGARCH_H\n#define LAYER_HARDSIGMOID_LOONGARCH_H\n\n#include \"hardsigmoid.h\"\n\nnamespace ncnn {\n\nclass HardSigmoid_loongarch : public HardSigmoid\n{\npublic:\n    HardSigmoid_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSIGMOID_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/hardswish_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"hardswish_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nHardSwish_loongarch::HardSwish_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint HardSwish_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n        __m128 _one = (__m128)__lsx_vreplfr2vr_s(1.f);\n        __m128 _alpha = (__m128)__lsx_vreplfr2vr_s(alpha);\n        __m128 _beta = (__m128)__lsx_vreplfr2vr_s(beta);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _outp = __lsx_vfmadd_s(_alpha, _p, _beta);\n            _outp = __lsx_vfmax_s(_outp, _zero);\n            _outp = __lsx_vfmin_s(_outp, _one);\n            _outp = __lsx_vfmul_s(_outp, _p);\n            __lsx_vst(_outp, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            if (*ptr < lower)\n                *ptr = 0.f;\n            else if (*ptr > upper)\n                ;\n            else\n                *ptr = *ptr * (*ptr * alpha + beta);\n            ++ptr;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/hardswish_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_HARDSWISH_LOONGARCH_H\n#define LAYER_HARDSWISH_LOONGARCH_H\n\n#include \"hardswish.h\"\n\nnamespace ncnn {\n\nclass HardSwish_loongarch : public HardSwish\n{\npublic:\n    HardSwish_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_HARDSWISH_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/innerproduct_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"innerproduct_loongarch.h\"\n\n#include \"layer_type.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n\nnamespace ncnn {\n\nInnerProduct_loongarch::InnerProduct_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n\n    flatten = 0;\n}\n\nint InnerProduct_loongarch::create_pipeline(const Option& opt)\n{\n    {\n        flatten = ncnn::create_layer_cpu(ncnn::LayerType::Flatten);\n\n        ncnn::ParamDict pd;\n\n        flatten->load_param(pd);\n\n        flatten->create_pipeline(opt);\n    }\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_loongarch(opt);\n    }\n#endif\n\n#if __loongarch_sx\n    if (opt.use_fp16_storage)\n    {\n        return create_pipeline_fp16s(opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n\n    if (out_elempack == 4)\n    {\n        // src = inch-outch\n        // dst = 4-inch-outch/4\n        {\n            Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n            weight_data_tm.create(num_input, num_output / 4, (size_t)4u * 4, 4);\n\n            for (int q = 0; q + 3 < num_output; q += 4)\n            {\n                float* g0 = weight_data_tm.row(q / 4);\n\n                for (int p = 0; p < num_input; p++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        *g0++ = weight_data_r2.row(q + j)[p];\n                    }\n                }\n            }\n        }\n    }\n    else\n    {\n        weight_data_tm = weight_data;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_loongarch::destroy_pipeline(const Option& opt)\n{\n    if (flatten)\n    {\n        flatten->destroy_pipeline(opt);\n        delete flatten;\n        flatten = 0;\n    }\n\n    return 0;\n}\n\nint InnerProduct_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_loongarch(bottom_blob, top_blob, opt);\n    }\n#endif\n\n#if __loongarch_sx\n    if (opt.use_fp16_storage)\n    {\n        return forward_fp16s(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n#if __loongarch_sx\n            if (elempack == 4 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const float* kptr = weight_data_tm.row(p);\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum0 = __lsx_vreplfr2vr_s(bias_data[p * 4 + 0]);\n                        _sum1 = __lsx_vreplfr2vr_s(bias_data[p * 4 + 1]);\n                        _sum2 = __lsx_vreplfr2vr_s(bias_data[p * 4 + 2]);\n                        _sum3 = __lsx_vreplfr2vr_s(bias_data[p * 4 + 3]);\n                    }\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 16);\n                        __m128 _val = (__m128)__lsx_vld(m, 0);\n                        __m128i _w = __lsx_vld(kptr, 0);\n                        _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 0), _val, _sum0);\n                        _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 1), _val, _sum1);\n                        _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 2), _val, _sum2);\n                        _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 3), _val, _sum3);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps(_sum3, activation_type, activation_params);\n\n                    __lsx_vst(_sum0, outptr, 0);\n                    __lsx_vst(_sum1, outptr + 4, 0);\n                    __lsx_vst(_sum2, outptr + 8, 0);\n                    __lsx_vst(_sum3, outptr + 12, 0);\n                    outptr += 16;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const float* kptr = weight_data_tm.row(p);\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum0 = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n                    }\n\n                    int i = 0;\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 64);\n                        __m128i _val = __lsx_vld(m, 0);\n                        __m128 _w0 = (__m128)__lsx_vld(kptr, 0);\n                        __m128 _w1 = (__m128)__lsx_vld(kptr + 4, 0);\n                        __m128 _w2 = (__m128)__lsx_vld(kptr + 8, 0);\n                        __m128 _w3 = (__m128)__lsx_vld(kptr + 12, 0);\n                        _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val, 0), _sum0);\n                        _sum1 = __lsx_vfmadd_s(_w1, (__m128)__lsx_vreplvei_w(_val, 1), _sum1);\n                        _sum2 = __lsx_vfmadd_s(_w2, (__m128)__lsx_vreplvei_w(_val, 2), _sum2);\n                        _sum3 = __lsx_vfmadd_s(_w3, (__m128)__lsx_vreplvei_w(_val, 3), _sum3);\n\n                        m += 4;\n                        kptr += 16;\n                    }\n                    for (; i < num_input; i++)\n                    {\n                        __m128 _val = __lsx_vreplfr2vr_s(m[0]);\n                        __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                        _sum0 = __lsx_vfmadd_s(_w, _val, _sum0);\n\n                        m += 1;\n                        kptr += 4;\n                    }\n\n                    _sum0 = __lsx_vfadd_s(_sum0, _sum1);\n                    _sum2 = __lsx_vfadd_s(_sum2, _sum3);\n                    _sum0 = __lsx_vfadd_s(_sum0, _sum2);\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                    __lsx_vst(_sum0, outptr, 0);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const float* kptr = (const float*)weight_data_tm + num_input * p;\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = __lsx_vreplfr2vr_s(bias_data[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 4);\n                        __m128 _val = (__m128)__lsx_vld(m, 0);\n                        __m128 _k = __lsx_vreplfr2vr_s(kptr[0]);\n                        _sum = __lsx_vfmadd_s(_k, _val, _sum);\n\n                        m += 4;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __lsx_vst(_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n#endif // __loongarch_sx\n\n            if (elempack == 1 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const float* kptr = (const float*)weight_data_tm + num_input * p;\n                    const float* m = bottom_blob.row(j);\n\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    int i = 0;\n#if __loongarch_sx\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 16);\n                        __m128 _m = (__m128)__lsx_vld(m, 0);\n                        __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                        _sum = __lsx_vfmadd_s(_w, _m, _sum);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n                    sum += __lsx_reduce_fadd_s(_sum);\n#endif // __loongarch_sx\n                    for (; i < num_input; i++)\n                    {\n                        sum += *m * *kptr;\n\n                        m += 1;\n                        kptr += 1;\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[0] = sum;\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __loongarch_sx\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __loongarch_sx\n    if (out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n            if (bias_term)\n            {\n                _sum0 = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n            }\n\n            const float* kptr = weight_data_tm.row(p);\n\n            const float* sptr = bottom_blob_flattened;\n\n            int i = 0;\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(sptr + 16);\n                __builtin_prefetch(kptr + 64);\n                __m128i _val = __lsx_vld(sptr, 0);\n                __m128 _w0 = (__m128)__lsx_vld(kptr, 0);\n                __m128 _w1 = (__m128)__lsx_vld(kptr + 4, 0);\n                __m128 _w2 = (__m128)__lsx_vld(kptr + 8, 0);\n                __m128 _w3 = (__m128)__lsx_vld(kptr + 12, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val, 0), _sum0);\n                _sum1 = __lsx_vfmadd_s(_w1, (__m128)__lsx_vreplvei_w(_val, 1), _sum1);\n                _sum2 = __lsx_vfmadd_s(_w2, (__m128)__lsx_vreplvei_w(_val, 2), _sum2);\n                _sum3 = __lsx_vfmadd_s(_w3, (__m128)__lsx_vreplvei_w(_val, 3), _sum3);\n\n                sptr += 4;\n                kptr += 16;\n            }\n            for (; i < num_input; i++)\n            {\n                __m128 _val = __lsx_vreplfr2vr_s(sptr[0]);\n                __m128 _w = (__m128)__lsx_vld(kptr, 0);\n                _sum0 = __lsx_vfmadd_s(_w, _val, _sum0);\n\n                sptr += 1;\n                kptr += 4;\n            }\n\n            _sum0 = __lsx_vfadd_s(_sum0, _sum1);\n            _sum2 = __lsx_vfadd_s(_sum2, _sum3);\n            _sum0 = __lsx_vfadd_s(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            float* outptr = top_blob;\n            __lsx_vst(_sum0, outptr + p * 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (out_elempack == 1)\n    {\n        int nn_num_output = num_output / 4;\n        int remain_num_output_start = nn_num_output * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int pp = 0; pp < nn_num_output; pp++)\n        {\n            int p = pp * 4;\n\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n            float sum2 = 0.f;\n            float sum3 = 0.f;\n\n            if (bias_term)\n            {\n                sum0 = bias_data[p];\n                sum1 = bias_data[p + 1];\n                sum2 = bias_data[p + 2];\n                sum3 = bias_data[p + 3];\n            }\n\n            const float* w0 = (const float*)weight_data_tm + num_input * p;\n            const float* w1 = (const float*)weight_data_tm + num_input * (p + 1);\n            const float* w2 = (const float*)weight_data_tm + num_input * (p + 2);\n            const float* w3 = (const float*)weight_data_tm + num_input * (p + 3);\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n#if __loongarch_sx\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(m + 16);\n                __builtin_prefetch(w0 + 16);\n                __builtin_prefetch(w1 + 16);\n                __builtin_prefetch(w2 + 16);\n                __builtin_prefetch(w3 + 16);\n                __m128 _m = (__m128)__lsx_vld(m, 0);\n                __m128 _w0 = (__m128)__lsx_vld(w0, 0);\n                __m128 _w1 = (__m128)__lsx_vld(w1, 0);\n                __m128 _w2 = (__m128)__lsx_vld(w2, 0);\n                __m128 _w3 = (__m128)__lsx_vld(w3, 0);\n                _sum0 = __lsx_vfmadd_s(_w0, _m, _sum0);\n                _sum1 = __lsx_vfmadd_s(_w1, _m, _sum1);\n                _sum2 = __lsx_vfmadd_s(_w2, _m, _sum2);\n                _sum3 = __lsx_vfmadd_s(_w3, _m, _sum3);\n\n                m += 4;\n                w0 += 4;\n                w1 += 4;\n                w2 += 4;\n                w3 += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < num_input; i++)\n            {\n                sum0 += *m * *w0;\n                sum1 += *m * *w1;\n                sum2 += *m * *w2;\n                sum3 += *m * *w3;\n\n                m++;\n                w0++;\n                w1++;\n                w2++;\n                w3++;\n            }\n\n#if __loongarch_sx\n            sum0 += __lsx_reduce_fadd_s(_sum0);\n            sum1 += __lsx_reduce_fadd_s(_sum1);\n            sum2 += __lsx_reduce_fadd_s(_sum2);\n            sum3 += __lsx_reduce_fadd_s(_sum3);\n#endif // __loongarch_sx\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n            sum2 = activation_ss(sum2, activation_type, activation_params);\n            sum3 = activation_ss(sum3, activation_type, activation_params);\n\n            top_blob[p] = sum0;\n            top_blob[p + 1] = sum1;\n            top_blob[p + 2] = sum2;\n            top_blob[p + 3] = sum3;\n        }\n\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = remain_num_output_start; p < num_output; p++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const float* w = (const float*)weight_data_tm + num_input * p;\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n#if __loongarch_sx\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(m + 16);\n                __builtin_prefetch(w + 16);\n                __m128 _m = (__m128)__lsx_vld(m, 0);\n                __m128 _w = (__m128)__lsx_vld(w, 0);\n                _sum0 = __lsx_vfmadd_s(_w, _m, _sum0);\n\n                m += 4;\n                w += 4;\n            }\n            sum += __lsx_reduce_fadd_s(_sum0);\n#endif // __loongarch_sx\n            for (; i < num_input; i++)\n            {\n                sum += *m * *w;\n\n                m++;\n                w++;\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            top_blob[p] = sum;\n        }\n    }\n\n    return 0;\n}\n\n#if __loongarch_sx\nint InnerProduct_loongarch::create_pipeline_fp16s(const Option& opt)\n{\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n\n    // src = inch-outch\n    // dst = pb-inch-outch/pb\n    if (out_elempack == 4)\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / 4, (size_t)8u, 4);\n\n        for (int q = 0; q + 3 < num_output; q += 4)\n        {\n            unsigned short* g0 = weight_data_tm.row<unsigned short>(q / 4);\n\n            const float* k0 = weight_data_r2.row(q);\n            const float* k1 = weight_data_r2.row(q + 1);\n            const float* k2 = weight_data_r2.row(q + 2);\n            const float* k3 = weight_data_r2.row(q + 3);\n\n            int p = 0;\n            for (; p + 3 < num_input; p += 4)\n            {\n                // transpose 4x4\n                __m128i _r0 = __lsx_vld(k0, 0);\n                __m128i _r1 = __lsx_vld(k1, 0);\n                __m128i _r2 = __lsx_vld(k2, 0);\n                __m128i _r3 = __lsx_vld(k3, 0);\n\n                __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                __m128i _p0 = __lsx_vfcvt_h_s((__m128)_r0123_1, (__m128)_r0123_0);\n                __m128i _p1 = __lsx_vfcvt_h_s((__m128)_r0123_3, (__m128)_r0123_2);\n\n                __lsx_vst(_p0, g0, 0);\n                __lsx_vst(_p1, g0 + 8, 0);\n\n                k0 += 4;\n                k1 += 4;\n                k2 += 4;\n                k3 += 4;\n                g0 += 16;\n            }\n            for (; p < num_input; p++)\n            {\n                g0[0] = float32_to_float16(*k0++);\n                g0[1] = float32_to_float16(*k1++);\n                g0[2] = float32_to_float16(*k2++);\n                g0[3] = float32_to_float16(*k3++);\n                g0 += 4;\n            }\n        }\n    }\n\n    if (out_elempack == 1)\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n        ncnn::cast_float32_to_float16(weight_data_r2, weight_data_tm, opt);\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_loongarch::forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    if (bottom_blob.dims == 2 && bottom_blob.w == num_input)\n    {\n        // gemm\n        int h = bottom_blob.h;\n        size_t elemsize = bottom_blob.elemsize;\n        int elempack = bottom_blob.elempack;\n\n        top_blob.create(num_output, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 4 == 0 ? 4 : 1;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            if (elempack == 4 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum0 = (__m128)__lsx_vreplfr2vr_s(bias_data[p * 4 + 0]);\n                        _sum1 = (__m128)__lsx_vreplfr2vr_s(bias_data[p * 4 + 1]);\n                        _sum2 = (__m128)__lsx_vreplfr2vr_s(bias_data[p * 4 + 2]);\n                        _sum3 = (__m128)__lsx_vreplfr2vr_s(bias_data[p * 4 + 3]);\n                    }\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 16);\n                        __m128 _val = (__m128)__lsx_vld(m, 0);\n                        __m128i _w = (__m128i)__lsx_vfcvtl_s_h(__lsx_vld(kptr, 0));\n                        _sum0 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 0), _val, _sum0);\n                        _sum1 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 1), _val, _sum1);\n                        _sum2 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 2), _val, _sum2);\n                        _sum3 = __lsx_vfmadd_s((__m128)__lsx_vreplvei_w(_w, 3), _val, _sum3);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n                    _sum1 = activation_ps(_sum1, activation_type, activation_params);\n                    _sum2 = activation_ps(_sum2, activation_type, activation_params);\n                    _sum3 = activation_ps(_sum3, activation_type, activation_params);\n\n                    __lsx_vst(_sum0, outptr, 0);\n                    __lsx_vst(_sum1, outptr + 4, 0);\n                    __lsx_vst(_sum2, outptr + 8, 0);\n                    __lsx_vst(_sum3, outptr + 12, 0);\n                    outptr += 16;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 4)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n                    __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum0 = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n                    }\n\n                    int i = 0;\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 64);\n                        __m128i _val = __lsx_vld(m, 0);\n                        __m128i _w01 = __lsx_vld(kptr, 0);\n                        __m128i _w23 = __lsx_vld(kptr + 8, 0);\n                        __m128 _w0 = __lsx_vfcvtl_s_h(_w01);\n                        __m128 _w1 = __lsx_vfcvth_s_h(_w01);\n                        __m128 _w2 = __lsx_vfcvtl_s_h(_w23);\n                        __m128 _w3 = __lsx_vfcvth_s_h(_w23);\n                        _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val, 0), _sum0);\n                        _sum1 = __lsx_vfmadd_s(_w1, (__m128)__lsx_vreplvei_w(_val, 1), _sum1);\n                        _sum2 = __lsx_vfmadd_s(_w2, (__m128)__lsx_vreplvei_w(_val, 2), _sum2);\n                        _sum3 = __lsx_vfmadd_s(_w3, (__m128)__lsx_vreplvei_w(_val, 3), _sum3);\n\n                        m += 4;\n                        kptr += 16;\n                    }\n                    for (; i < num_input; i++)\n                    {\n                        __m128 _val = __lsx_vreplfr2vr_s(m[0]);\n                        __m128 _w = __lsx_vfcvtl_s_h(__lsx_vld(kptr, 0));\n                        _sum0 = __lsx_vfmadd_s(_w, _val, _sum0);\n\n                        m += 1;\n                        kptr += 4;\n                    }\n\n                    _sum0 = __lsx_vfadd_s(_sum0, _sum1);\n                    _sum2 = __lsx_vfadd_s(_sum2, _sum3);\n                    _sum0 = __lsx_vfadd_s(_sum0, _sum2);\n\n                    _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n                    __lsx_vst(_sum0, outptr, 0);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 4 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n                    const float* m = bottom_blob.row(j);\n\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = __lsx_vreplfr2vr_s(bias_data[p]);\n                    }\n\n                    for (int i = 0; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 4);\n                        __m128 _val = (__m128)__lsx_vld(m, 0);\n                        __m128 _k = __lsx_vreplfr2vr_s(float16_to_float32(kptr[0]));\n                        _sum = __lsx_vfmadd_s(_k, _val, _sum);\n\n                        m += 4;\n                        kptr += 1;\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __lsx_vst(_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n\n            if (elempack == 1 && num_output_elempack == 1)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n                    const float* m = bottom_blob.row(j);\n\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    int i = 0;\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n                    for (; i + 3 < num_input; i += 4)\n                    {\n                        __builtin_prefetch(m + 16);\n                        __builtin_prefetch(kptr + 16);\n                        __m128 _m = (__m128)__lsx_vld(m, 0);\n                        __m128 _w = __lsx_vfcvtl_s_h(__lsx_vld(kptr, 0));\n                        _sum = __lsx_vfmadd_s(_w, _m, _sum);\n\n                        m += 4;\n                        kptr += 4;\n                    }\n                    sum += __lsx_reduce_fadd_s(_sum);\n                    for (; i < num_input; i++)\n                    {\n                        sum += *m * float16_to_float32(*kptr);\n\n                        m += 1;\n                        kptr += 1;\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[0] = sum;\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    // flatten\n    Mat bottom_blob_flattened = bottom_blob;\n    if (bottom_blob.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n\n        flatten->forward(bottom_blob, bottom_blob_flattened, opt_flatten);\n    }\n\n    size_t elemsize = bottom_blob_flattened.elemsize;\n    int elempack = bottom_blob_flattened.elempack;\n\n    int out_elempack = 1;\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    if (out_elempack == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n\n            if (bias_term)\n            {\n                _sum0 = (__m128)__lsx_vld((const float*)bias_data + p * 4, 0);\n            }\n\n            const unsigned short* kptr = weight_data_tm.row<const unsigned short>(p);\n\n            const float* sptr = bottom_blob_flattened;\n\n            int i = 0;\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(sptr + 16);\n                __builtin_prefetch(kptr + 64);\n                __m128i _val = __lsx_vld(sptr, 0);\n                __m128i _w01 = __lsx_vld(kptr, 0);\n                __m128i _w23 = __lsx_vld(kptr + 8, 0);\n                __m128 _w0 = __lsx_vfcvtl_s_h(_w01);\n                __m128 _w1 = __lsx_vfcvth_s_h(_w01);\n                __m128 _w2 = __lsx_vfcvtl_s_h(_w23);\n                __m128 _w3 = __lsx_vfcvth_s_h(_w23);\n                _sum0 = __lsx_vfmadd_s(_w0, (__m128)__lsx_vreplvei_w(_val, 0), _sum0);\n                _sum1 = __lsx_vfmadd_s(_w1, (__m128)__lsx_vreplvei_w(_val, 1), _sum1);\n                _sum2 = __lsx_vfmadd_s(_w2, (__m128)__lsx_vreplvei_w(_val, 2), _sum2);\n                _sum3 = __lsx_vfmadd_s(_w3, (__m128)__lsx_vreplvei_w(_val, 3), _sum3);\n\n                sptr += 4;\n                kptr += 16;\n            }\n            for (; i < num_input; i++)\n            {\n                __m128 _val = __lsx_vreplfr2vr_s(sptr[0]);\n                __m128 _w = __lsx_vfcvtl_s_h(__lsx_vld(kptr, 0));\n                _sum0 = __lsx_vfmadd_s(_w, _val, _sum0);\n\n                sptr += 1;\n                kptr += 4;\n            }\n\n            _sum0 = __lsx_vfadd_s(_sum0, _sum1);\n            _sum2 = __lsx_vfadd_s(_sum2, _sum3);\n            _sum0 = __lsx_vfadd_s(_sum0, _sum2);\n\n            _sum0 = activation_ps(_sum0, activation_type, activation_params);\n\n            float* outptr = top_blob;\n            __lsx_vst(_sum0, outptr + p * 4, 0);\n        }\n    }\n\n    if (out_elempack == 1)\n    {\n        int nn_num_output = num_output / 4;\n        int remain_num_output_start = nn_num_output * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int pp = 0; pp < nn_num_output; pp++)\n        {\n            int p = pp * 4;\n\n            float sum0 = 0.f;\n            float sum1 = 0.f;\n            float sum2 = 0.f;\n            float sum3 = 0.f;\n\n            if (bias_term)\n            {\n                sum0 = bias_data[p];\n                sum1 = bias_data[p + 1];\n                sum2 = bias_data[p + 2];\n                sum3 = bias_data[p + 3];\n            }\n\n            const unsigned short* w0 = weight_data_tm.row<const unsigned short>(p);\n            const unsigned short* w1 = weight_data_tm.row<const unsigned short>(p + 1);\n            const unsigned short* w2 = weight_data_tm.row<const unsigned short>(p + 2);\n            const unsigned short* w3 = weight_data_tm.row<const unsigned short>(p + 3);\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum1 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum2 = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _sum3 = (__m128)__lsx_vreplgr2vr_w(0);\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(m + 16);\n                __builtin_prefetch(w0 + 16);\n                __builtin_prefetch(w1 + 16);\n                __builtin_prefetch(w2 + 16);\n                __builtin_prefetch(w3 + 16);\n                __m128 _m = (__m128)__lsx_vld(m, 0);\n                __m128 _w0 = __lsx_vfcvtl_s_h(__lsx_vld(w0, 0));\n                __m128 _w1 = __lsx_vfcvtl_s_h(__lsx_vld(w1, 0));\n                __m128 _w2 = __lsx_vfcvtl_s_h(__lsx_vld(w2, 0));\n                __m128 _w3 = __lsx_vfcvtl_s_h(__lsx_vld(w3, 0));\n                _sum0 = __lsx_vfmadd_s(_w0, _m, _sum0);\n                _sum1 = __lsx_vfmadd_s(_w1, _m, _sum1);\n                _sum2 = __lsx_vfmadd_s(_w2, _m, _sum2);\n                _sum3 = __lsx_vfmadd_s(_w3, _m, _sum3);\n\n                m += 4;\n                w0 += 4;\n                w1 += 4;\n                w2 += 4;\n                w3 += 4;\n            }\n            for (; i < num_input; i++)\n            {\n                sum0 += *m * float16_to_float32(*w0);\n                sum1 += *m * float16_to_float32(*w1);\n                sum2 += *m * float16_to_float32(*w2);\n                sum3 += *m * float16_to_float32(*w3);\n\n                m++;\n                w0++;\n                w1++;\n                w2++;\n                w3++;\n            }\n\n            sum0 += __lsx_reduce_fadd_s(_sum0);\n            sum1 += __lsx_reduce_fadd_s(_sum1);\n            sum2 += __lsx_reduce_fadd_s(_sum2);\n            sum3 += __lsx_reduce_fadd_s(_sum3);\n\n            sum0 = activation_ss(sum0, activation_type, activation_params);\n            sum1 = activation_ss(sum1, activation_type, activation_params);\n            sum2 = activation_ss(sum2, activation_type, activation_params);\n            sum3 = activation_ss(sum3, activation_type, activation_params);\n\n            top_blob[p] = sum0;\n            top_blob[p + 1] = sum1;\n            top_blob[p + 2] = sum2;\n            top_blob[p + 3] = sum3;\n        }\n\n        // num_output\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = remain_num_output_start; p < num_output; p++)\n        {\n            float sum = 0.f;\n\n            if (bias_term)\n                sum = bias_data[p];\n\n            const unsigned short* w = weight_data_tm.row<const unsigned short>(p);\n\n            const float* m = bottom_blob_flattened;\n\n            int i = 0;\n            __m128 _sum0 = (__m128)__lsx_vreplgr2vr_w(0);\n            for (; i + 3 < num_input; i += 4)\n            {\n                __builtin_prefetch(m + 16);\n                __builtin_prefetch(w + 16);\n                __m128 _m = (__m128)__lsx_vld(m, 0);\n                __m128 _w = __lsx_vfcvtl_s_h(__lsx_vld(w, 0));\n                _sum0 = __lsx_vfmadd_s(_w, _m, _sum0);\n\n                m += 4;\n                w += 4;\n            }\n            sum += __lsx_reduce_fadd_s(_sum0);\n            for (; i < num_input; i++)\n            {\n                sum += *m * float16_to_float32(*w);\n\n                m++;\n                w++;\n            }\n\n            sum = activation_ss(sum, activation_type, activation_params);\n\n            top_blob[p] = sum;\n        }\n    }\n\n    return 0;\n}\n#endif // __loongarch_sx\n\n#if NCNN_INT8\nint InnerProduct_loongarch::create_pipeline_int8_loongarch(const Option& opt)\n{\n    const int num_input = weight_data_size / num_output;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 8 == 0 ? 8 : 1;\n    }\n#endif // __loongarch_sx\n\n    // src = inch-outch\n    // dst = pb-inch-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(num_input, num_output);\n\n        weight_data_tm.create(num_input, num_output / out_elempack, (size_t)out_elempack, out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            signed char* g0 = weight_data_tm.row<signed char>(q / out_elempack);\n\n            for (int p = 0; p < num_input; p++)\n            {\n                for (int j = 0; j < out_elempack; j++)\n                {\n                    *g0++ = weight_data_r2.row<signed char>(q + j)[p];\n                }\n            }\n        }\n    }\n\n    scale_in_data.create(num_output);\n    for (int p = 0; p < num_output; p++)\n    {\n        // dequantize\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        scale_in_data[p] = scale_in;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint InnerProduct_loongarch::forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int num_input = weight_data_size / num_output;\n\n    int elembits = bottom_blob.elembits();\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_q);\n    }\n\n    if (bottom_blob_int8.dims == 2 && bottom_blob_int8.w == num_input)\n    {\n        // gemm\n        Mat bottom_blob_int8_unpacked;\n        Option opt_unpack = opt;\n        opt_unpack.blob_allocator = opt.workspace_allocator;\n        convert_packing(bottom_blob_int8, bottom_blob_int8_unpacked, 1, opt_unpack);\n\n        int h = bottom_blob_int8_unpacked.h;\n\n        int out_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h % 4 == 0 ? 4 : 1;\n        }\n#endif\n\n        int outh = h / out_elempack;\n\n        top_blob.create(num_output, outh, (size_t)(4u * out_elempack), out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        int num_output_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            num_output_elempack = num_output % 8 == 0 ? 8 : 1;\n        }\n#endif\n\n#if __loongarch_sx\n        if (num_output_elempack == 8 && out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m0 = bottom_blob_int8_unpacked.row<const signed char>(j * 4);\n                    const signed char* m1 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 1);\n                    const signed char* m2 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 2);\n                    const signed char* m3 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 3);\n\n                    __m128i _sum00 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum01 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum10 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum11 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum20 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum21 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum30 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum31 = __lsx_vreplgr2vr_w(0);\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m0 + 4);\n                        __builtin_prefetch(m1 + 4);\n                        __builtin_prefetch(m2 + 4);\n                        __builtin_prefetch(m3 + 4);\n                        __builtin_prefetch(kptr + 32);\n                        __m128i _val0 = __lsx_vreplgr2vr_h((short)m0[0]);\n                        __m128i _val1 = __lsx_vreplgr2vr_h((short)m1[0]);\n                        __m128i _val2 = __lsx_vreplgr2vr_h((short)m2[0]);\n                        __m128i _val3 = __lsx_vreplgr2vr_h((short)m3[0]);\n\n                        __m128i _w = __lsx_vld(kptr, 0);\n                        __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                        __m128i _s0 = __lsx_vmul_h(_val0, _w16);\n                        __m128i _s1 = __lsx_vmul_h(_val1, _w16);\n                        __m128i _s2 = __lsx_vmul_h(_val2, _w16);\n                        __m128i _s3 = __lsx_vmul_h(_val3, _w16);\n                        __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                        __m128i _exts1 = __lsx_vslti_h(_s1, 0);\n                        __m128i _exts2 = __lsx_vslti_h(_s2, 0);\n                        __m128i _exts3 = __lsx_vslti_h(_s3, 0);\n                        __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                        __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n                        __m128i _s1l = __lsx_vilvl_h(_exts1, _s1);\n                        __m128i _s1h = __lsx_vilvh_h(_exts1, _s1);\n                        __m128i _s2l = __lsx_vilvl_h(_exts2, _s2);\n                        __m128i _s2h = __lsx_vilvh_h(_exts2, _s2);\n                        __m128i _s3l = __lsx_vilvl_h(_exts3, _s3);\n                        __m128i _s3h = __lsx_vilvh_h(_exts3, _s3);\n\n                        _sum00 = __lsx_vadd_w(_sum00, _s0l);\n                        _sum01 = __lsx_vadd_w(_sum01, _s0h);\n                        _sum10 = __lsx_vadd_w(_sum10, _s1l);\n                        _sum11 = __lsx_vadd_w(_sum11, _s1h);\n                        _sum20 = __lsx_vadd_w(_sum20, _s2l);\n                        _sum21 = __lsx_vadd_w(_sum21, _s2h);\n                        _sum30 = __lsx_vadd_w(_sum30, _s3l);\n                        _sum31 = __lsx_vadd_w(_sum31, _s3h);\n\n                        m0++;\n                        m1++;\n                        m2++;\n                        m3++;\n                        kptr += 8;\n                    }\n\n                    // dequantize and relu\n                    __m128 _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8, 0);\n                    __m128 _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8 + 4, 0);\n\n                    __m128 _sumfp32_00 = __lsx_vffint_s_w(_sum00);\n                    __m128 _sumfp32_01 = __lsx_vffint_s_w(_sum01);\n                    __m128 _sumfp32_10 = __lsx_vffint_s_w(_sum10);\n                    __m128 _sumfp32_11 = __lsx_vffint_s_w(_sum11);\n                    __m128 _sumfp32_20 = __lsx_vffint_s_w(_sum20);\n                    __m128 _sumfp32_21 = __lsx_vffint_s_w(_sum21);\n                    __m128 _sumfp32_30 = __lsx_vffint_s_w(_sum30);\n                    __m128 _sumfp32_31 = __lsx_vffint_s_w(_sum31);\n                    if (bias_term)\n                    {\n                        __m128 _bias0 = (__m128)__lsx_vld((const float*)bias_data + p * 8, 0);\n                        __m128 _bias1 = (__m128)__lsx_vld((const float*)bias_data + p * 8 + 4, 0);\n                        _sumfp32_00 = __lsx_vfmadd_s(_scale_in0, _sumfp32_00, _bias0);\n                        _sumfp32_01 = __lsx_vfmadd_s(_scale_in1, _sumfp32_01, _bias1);\n                        _sumfp32_10 = __lsx_vfmadd_s(_scale_in0, _sumfp32_10, _bias0);\n                        _sumfp32_11 = __lsx_vfmadd_s(_scale_in1, _sumfp32_11, _bias1);\n                        _sumfp32_20 = __lsx_vfmadd_s(_scale_in0, _sumfp32_20, _bias0);\n                        _sumfp32_21 = __lsx_vfmadd_s(_scale_in1, _sumfp32_21, _bias1);\n                        _sumfp32_30 = __lsx_vfmadd_s(_scale_in0, _sumfp32_30, _bias0);\n                        _sumfp32_31 = __lsx_vfmadd_s(_scale_in1, _sumfp32_31, _bias1);\n                    }\n                    else\n                    {\n                        _sumfp32_00 = __lsx_vfmul_s(_sumfp32_00, _scale_in0);\n                        _sumfp32_01 = __lsx_vfmul_s(_sumfp32_01, _scale_in1);\n                        _sumfp32_10 = __lsx_vfmul_s(_sumfp32_10, _scale_in0);\n                        _sumfp32_11 = __lsx_vfmul_s(_sumfp32_11, _scale_in1);\n                        _sumfp32_20 = __lsx_vfmul_s(_sumfp32_20, _scale_in0);\n                        _sumfp32_21 = __lsx_vfmul_s(_sumfp32_21, _scale_in1);\n                        _sumfp32_30 = __lsx_vfmul_s(_sumfp32_30, _scale_in0);\n                        _sumfp32_31 = __lsx_vfmul_s(_sumfp32_31, _scale_in1);\n                    }\n\n                    _sumfp32_00 = activation_ps(_sumfp32_00, activation_type, activation_params);\n                    _sumfp32_01 = activation_ps(_sumfp32_01, activation_type, activation_params);\n                    _sumfp32_10 = activation_ps(_sumfp32_10, activation_type, activation_params);\n                    _sumfp32_11 = activation_ps(_sumfp32_11, activation_type, activation_params);\n                    _sumfp32_20 = activation_ps(_sumfp32_20, activation_type, activation_params);\n                    _sumfp32_21 = activation_ps(_sumfp32_21, activation_type, activation_params);\n                    _sumfp32_30 = activation_ps(_sumfp32_30, activation_type, activation_params);\n                    _sumfp32_31 = activation_ps(_sumfp32_31, activation_type, activation_params);\n\n                    // transpose 4x8\n                    __m128i _r01r = __lsx_vilvl_w((__m128i)_sumfp32_10, (__m128i)_sumfp32_00);\n                    __m128i _r01l = __lsx_vilvh_w((__m128i)_sumfp32_10, (__m128i)_sumfp32_00);\n                    __m128i _r23r = __lsx_vilvl_w((__m128i)_sumfp32_30, (__m128i)_sumfp32_20);\n                    __m128i _r23l = __lsx_vilvh_w((__m128i)_sumfp32_30, (__m128i)_sumfp32_20);\n                    __m128i _r45r = __lsx_vilvl_w((__m128i)_sumfp32_11, (__m128i)_sumfp32_01);\n                    __m128i _r45l = __lsx_vilvh_w((__m128i)_sumfp32_11, (__m128i)_sumfp32_01);\n                    __m128i _r67r = __lsx_vilvl_w((__m128i)_sumfp32_31, (__m128i)_sumfp32_21);\n                    __m128i _r67l = __lsx_vilvh_w((__m128i)_sumfp32_31, (__m128i)_sumfp32_21);\n                    _sumfp32_00 = (__m128)__lsx_vilvl_d(_r23r, _r01r);\n                    _sumfp32_10 = (__m128)__lsx_vilvh_d(_r23r, _r01r);\n                    _sumfp32_20 = (__m128)__lsx_vilvl_d(_r23l, _r01l);\n                    _sumfp32_30 = (__m128)__lsx_vilvh_d(_r23l, _r01l);\n                    _sumfp32_01 = (__m128)__lsx_vilvl_d(_r67r, _r45r);\n                    _sumfp32_11 = (__m128)__lsx_vilvh_d(_r67r, _r45r);\n                    _sumfp32_21 = (__m128)__lsx_vilvl_d(_r67l, _r45l);\n                    _sumfp32_31 = (__m128)__lsx_vilvh_d(_r67l, _r45l);\n\n                    __lsx_vst(_sumfp32_00, outptr, 0);\n                    __lsx_vst(_sumfp32_10, outptr + 4, 0);\n                    __lsx_vst(_sumfp32_20, outptr + 8, 0);\n                    __lsx_vst(_sumfp32_30, outptr + 12, 0);\n                    __lsx_vst(_sumfp32_01, outptr + 16, 0);\n                    __lsx_vst(_sumfp32_11, outptr + 20, 0);\n                    __lsx_vst(_sumfp32_21, outptr + 24, 0);\n                    __lsx_vst(_sumfp32_31, outptr + 28, 0);\n\n                    outptr += 32;\n                }\n            }\n        }\n\n        if (num_output_elempack == 1 && out_elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m0 = bottom_blob_int8_unpacked.row<const signed char>(j * 4);\n                    const signed char* m1 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 1);\n                    const signed char* m2 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 2);\n                    const signed char* m3 = bottom_blob_int8_unpacked.row<const signed char>(j * 4 + 3);\n\n                    int sum0 = 0;\n                    int sum1 = 0;\n                    int sum2 = 0;\n                    int sum3 = 0;\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        sum0 += *m0++ * kptr[0];\n                        sum1 += *m1++ * kptr[0];\n                        sum2 += *m2++ * kptr[0];\n                        sum3 += *m3++ * kptr[0];\n                        kptr += 1;\n                    }\n\n                    // dequantize and relu\n                    float sumfp32_0 = sum0 * scale_in_data[p];\n                    float sumfp32_1 = sum1 * scale_in_data[p];\n                    float sumfp32_2 = sum2 * scale_in_data[p];\n                    float sumfp32_3 = sum3 * scale_in_data[p];\n\n                    if (bias_term)\n                    {\n                        sumfp32_0 += bias_data[p];\n                        sumfp32_1 += bias_data[p];\n                        sumfp32_2 += bias_data[p];\n                        sumfp32_3 += bias_data[p];\n                    }\n\n                    outptr[0] = activation_ss(sumfp32_0, activation_type, activation_params);\n                    outptr[1] = activation_ss(sumfp32_1, activation_type, activation_params);\n                    outptr[2] = activation_ss(sumfp32_2, activation_type, activation_params);\n                    outptr[3] = activation_ss(sumfp32_3, activation_type, activation_params);\n                    outptr += 4;\n                }\n            }\n        }\n\n        if (num_output_elempack == 8 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output / num_output_elempack; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m = bottom_blob_int8_unpacked.row<const signed char>(j);\n\n                    __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n                    __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        __builtin_prefetch(m + 4);\n                        __builtin_prefetch(kptr + 32);\n                        __m128i _val = __lsx_vreplgr2vr_h((short)m[0]);\n\n                        __m128i _w = __lsx_vld(kptr, 0);\n                        __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                        __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                        __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                        __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                        __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                        _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                        _sum1 = __lsx_vadd_w(_sum1, _s0h);\n\n                        m++;\n                        kptr += 8;\n                    }\n\n                    // dequantize and relu\n                    __m128 _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8, 0);\n                    __m128 _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8 + 4, 0);\n\n                    __m128 _sumfp32_0 = __lsx_vffint_s_w(_sum0);\n                    __m128 _sumfp32_1 = __lsx_vffint_s_w(_sum1);\n\n                    if (bias_term)\n                    {\n                        __m128 _bias0 = (__m128)__lsx_vld((const float*)bias_data + p * 8, 0);\n                        __m128 _bias1 = (__m128)__lsx_vld((const float*)bias_data + p * 8 + 4, 0);\n                        _sumfp32_0 = __lsx_vfmadd_s(_scale_in0, _sumfp32_0, _bias0);\n                        _sumfp32_1 = __lsx_vfmadd_s(_scale_in1, _sumfp32_1, _bias1);\n                    }\n                    else\n                    {\n                        _sumfp32_0 = __lsx_vfmul_s(_sumfp32_0, _scale_in0);\n                        _sumfp32_1 = __lsx_vfmul_s(_sumfp32_1, _scale_in1);\n                    }\n\n                    _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n                    _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n                    __lsx_vst(_sumfp32_0, outptr, 0);\n                    __lsx_vst(_sumfp32_1, outptr + 4, 0);\n                    outptr += 8;\n                }\n            }\n        }\n#endif // __loongarch_sx\n\n        if (num_output_elempack == 1 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int j = 0; j < outh; j++)\n            {\n                float* outptr = top_blob.row(j);\n\n                for (int p = 0; p < num_output; p++)\n                {\n                    const signed char* kptr = weight_data_tm.row<const signed char>(p);\n                    const signed char* m = bottom_blob_int8_unpacked.row<const signed char>(j);\n\n                    int sum = 0;\n\n                    int i = 0;\n                    for (; i < num_input; i++)\n                    {\n                        sum += *m++ * *kptr++;\n                    }\n\n                    // dequantize and relu\n                    float sumfp32 = sum * scale_in_data[p];\n\n                    if (bias_term)\n                        sumfp32 += bias_data[p];\n\n                    outptr[0] = activation_ss(sumfp32, activation_type, activation_params);\n                    outptr += 1;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    Mat bottom_blob_int8_flattened = bottom_blob_int8;\n    if (bottom_blob_int8.dims != 1)\n    {\n        Option opt_flatten = opt;\n        opt_flatten.blob_allocator = opt.workspace_allocator;\n        flatten->forward(bottom_blob_int8, bottom_blob_int8_flattened, opt_flatten);\n    }\n\n    //     int elempack = bottom_blob_int8_flattened.elempack;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 8 == 0 ? 8 : 1;\n    }\n#endif // __loongarch_sx\n    //     size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(num_output / out_elempack, (size_t)(4u * out_elempack), out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __loongarch_sx\n    if (out_elempack == 8)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            __m128i _sum0 = __lsx_vreplgr2vr_w(0);\n            __m128i _sum1 = __lsx_vreplgr2vr_w(0);\n\n            const signed char* kptr = weight_data_tm.row<const signed char>(p);\n            const signed char* sptr = bottom_blob_int8_flattened;\n\n            int i = 0;\n            for (; i < num_input; i++)\n            {\n                __builtin_prefetch(sptr + 4);\n                __builtin_prefetch(kptr + 32);\n                __m128i _val = __lsx_vreplgr2vr_h((short)sptr[0]);\n\n                __m128i _w = __lsx_vld(kptr, 0);\n                __m128i _w16 = __lsx_vilvl_b(__lsx_vslti_b(_w, 0), _w);\n\n                __m128i _s0 = __lsx_vmul_h(_val, _w16);\n                __m128i _exts0 = __lsx_vslti_h(_s0, 0);\n                __m128i _s0l = __lsx_vilvl_h(_exts0, _s0);\n                __m128i _s0h = __lsx_vilvh_h(_exts0, _s0);\n\n                _sum0 = __lsx_vadd_w(_sum0, _s0l);\n                _sum1 = __lsx_vadd_w(_sum1, _s0h);\n\n                sptr += 1;\n                kptr += 8;\n            }\n\n            // dequantize and relu\n            __m128 _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8, 0);\n            __m128 _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + p * 8 + 4, 0);\n\n            __m128 _sumfp32_0 = __lsx_vffint_s_w(_sum0);\n            __m128 _sumfp32_1 = __lsx_vffint_s_w(_sum1);\n\n            if (bias_term)\n            {\n                __m128 _bias0 = (__m128)__lsx_vld((const float*)bias_data + p * 8, 0);\n                __m128 _bias1 = (__m128)__lsx_vld((const float*)bias_data + p * 8 + 4, 0);\n                _sumfp32_0 = __lsx_vfmadd_s(_scale_in0, _sumfp32_0, _bias0);\n                _sumfp32_1 = __lsx_vfmadd_s(_scale_in1, _sumfp32_1, _bias1);\n            }\n            else\n            {\n                _sumfp32_0 = __lsx_vfmul_s(_sumfp32_0, _scale_in0);\n                _sumfp32_1 = __lsx_vfmul_s(_sumfp32_1, _scale_in1);\n            }\n\n            _sumfp32_0 = activation_ps(_sumfp32_0, activation_type, activation_params);\n            _sumfp32_1 = activation_ps(_sumfp32_1, activation_type, activation_params);\n\n            float* outptr = (float*)top_blob + p * 8;\n            __lsx_vst(_sumfp32_0, outptr, 0);\n            __lsx_vst(_sumfp32_1, outptr + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (out_elempack == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < num_output / out_elempack; p++)\n        {\n            int sum = 0;\n\n            const signed char* kptr = weight_data_tm.row<const signed char>(p);\n            const signed char* sptr = bottom_blob_int8_flattened;\n\n            int i = 0;\n            for (; i < num_input; i++)\n            {\n                signed char val = sptr[0];\n\n                signed char w = kptr[0];\n\n                sum += val * w;\n\n                sptr += 1;\n                kptr += 1;\n            }\n\n            // dequantize and relu\n            float sumfp32 = sum * scale_in_data[p];\n\n            if (bias_term)\n                sumfp32 += bias_data[p];\n\n            sumfp32 = activation_ss(sumfp32, activation_type, activation_params);\n\n            top_blob[p] = sumfp32;\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/innerproduct_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INNERPRODUCT_LOONGARCH_H\n#define LAYER_INNERPRODUCT_LOONGARCH_H\n\n#include \"innerproduct.h\"\n\nnamespace ncnn {\n\nclass InnerProduct_loongarch : public InnerProduct\n{\npublic:\n    InnerProduct_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n#if __loongarch_sx\n    int create_pipeline_fp16s(const Option& opt);\n    int forward_fp16s(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n#if NCNN_INT8\n    int create_pipeline_int8_loongarch(const Option& opt);\n    int forward_int8_loongarch(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* flatten;\n\n    Mat weight_data_tm;\n\n#if NCNN_INT8\n    Mat scale_in_data;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INNERPRODUCT_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/interp_bicubic.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic inline void interpolate_cubic(float fx, float* coeffs)\n{\n    const float A = -0.75f;\n\n    float fx0 = fx + 1;\n    float fx1 = fx;\n    float fx2 = 1 - fx;\n    // float fx3 = 2 - fx;\n\n    coeffs[0] = A * fx0 * fx0 * fx0 - 5 * A * fx0 * fx0 + 8 * A * fx0 - 4 * A;\n    coeffs[1] = (A + 2) * fx1 * fx1 * fx1 - (A + 3) * fx1 * fx1 + 1;\n    coeffs[2] = (A + 2) * fx2 * fx2 * fx2 - (A + 3) * fx2 * fx2 + 1;\n    coeffs[3] = 1.f - coeffs[0] - coeffs[1] - coeffs[2];\n}\n\nstatic void cubic_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = (float)(dx * scale);\n        }\n\n        int sx = static_cast<int>(floor(fx));\n        fx -= sx;\n\n        interpolate_cubic(fx, alpha + dx * 4);\n\n        if (sx <= -1)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = 1.f - alpha[dx * 4 + 3];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = 0.f;\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == 0)\n        {\n            sx = 1;\n            alpha[dx * 4 + 0] = alpha[dx * 4 + 0] + alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 2];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 3];\n            alpha[dx * 4 + 3] = 0.f;\n        }\n        if (sx == w - 2)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = alpha[dx * 4 + 2] + alpha[dx * 4 + 3];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 1];\n            alpha[dx * 4 + 1] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 0] = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 3;\n            alpha[dx * 4 + 3] = 1.f - alpha[dx * 4 + 0];\n            alpha[dx * 4 + 2] = alpha[dx * 4 + 0];\n            alpha[dx * 4 + 1] = 0.f;\n            alpha[dx * 4 + 0] = 0.f;\n        }\n\n        xofs[dx] = sx;\n    }\n}\n\nstatic void resize_bicubic_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    Mat rowsbuf2(w);\n    Mat rowsbuf3(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const float* S0 = src.row(sy - 1);\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                float a2 = alphap[2];\n                float a3 = alphap[3];\n                rows0p[dx] = S0p[-1] * a0 + S0p[0] * a1 + S0p[1] * a2 + S0p[2] * a3;\n                rows1p[dx] = S1p[-1] * a0 + S1p[0] * a1 + S1p[1] * a2 + S1p[2] * a3;\n                rows2p[dx] = S2p[-1] * a0 + S2p[0] * a1 + S2p[1] * a2 + S2p[2] * a3;\n                rows3p[dx] = S3p[-1] * a0 + S3p[0] * a1 + S3p[1] * a2 + S3p[2] * a3;\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n        float b2 = beta[2];\n        float b3 = beta[3];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        float* Dp = dst.row(dy);\n        for (int dx = 0; dx < w; dx++)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1 + rows2[x]*b2 + rows3[x]*b3;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1 + *rows2p++ * b2 + *rows3p++ * b3;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/interp_bicubic_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf2(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf3(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n    float* rows2 = rowsbuf2;\n    float* rows3 = rowsbuf3;\n\n    int prev_sy1 = -3;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows2;\n            rows2 = rows3;\n            rows3 = rows0_old;\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S3p = S3 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n                __m128 _a2 = __lsx_vreplfr2vr_s(alphap[2]);\n                __m128 _a3 = __lsx_vreplfr2vr_s(alphap[3]);\n\n                __m128 _S30 = (__m128)__lsx_vld(S3p - 4, 0);\n                __m128 _S31 = (__m128)__lsx_vld(S3p + 0, 0);\n                __m128 _S32 = (__m128)__lsx_vld(S3p + 4, 0);\n                __m128 _S33 = (__m128)__lsx_vld(S3p + 8, 0);\n                __m128 _rows3 = __lsx_vfmul_s(_S30, _a0);\n                _rows3 = __lsx_vfmadd_s(_a1, _S31, _rows3);\n                _rows3 = __lsx_vfmadd_s(_a2, _S32, _rows3);\n                _rows3 = __lsx_vfmadd_s(_a3, _S33, _rows3);\n                __lsx_vst(_rows3, rows3p + dx * 4, 0);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 2)\n        {\n            // hresize two rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            rows0 = rows2;\n            rows1 = rows3;\n            rows2 = rows0_old;\n            rows3 = rows1_old;\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n                __m128 _a2 = __lsx_vreplfr2vr_s(alphap[2]);\n                __m128 _a3 = __lsx_vreplfr2vr_s(alphap[3]);\n\n                __m128 _S20 = (__m128)__lsx_vld(S2p - 4, 0);\n                __m128 _S21 = (__m128)__lsx_vld(S2p + 0, 0);\n                __m128 _S22 = (__m128)__lsx_vld(S2p + 4, 0);\n                __m128 _S23 = (__m128)__lsx_vld(S2p + 8, 0);\n                __m128 _S30 = (__m128)__lsx_vld(S3p - 4, 0);\n                __m128 _S31 = (__m128)__lsx_vld(S3p + 0, 0);\n                __m128 _S32 = (__m128)__lsx_vld(S3p + 4, 0);\n                __m128 _S33 = (__m128)__lsx_vld(S3p + 8, 0);\n                __m128 _rows2 = __lsx_vfmul_s(_S20, _a0);\n                __m128 _rows3 = __lsx_vfmul_s(_S30, _a0);\n                _rows2 = __lsx_vfmadd_s(_a1, _S21, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a1, _S31, _rows3);\n                _rows2 = __lsx_vfmadd_s(_a2, _S22, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a2, _S32, _rows3);\n                _rows2 = __lsx_vfmadd_s(_a3, _S23, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a3, _S33, _rows3);\n                __lsx_vst(_rows2, rows2p + dx * 4, 0);\n                __lsx_vst(_rows3, rows3p + dx * 4, 0);\n\n                alphap += 4;\n            }\n        }\n        else if (sy == prev_sy1 + 3)\n        {\n            // hresize three rows\n            float* rows0_old = rows0;\n            float* rows1_old = rows1;\n            float* rows2_old = rows2;\n            rows0 = rows3;\n            rows1 = rows0_old;\n            rows2 = rows1_old;\n            rows3 = rows2_old;\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n                __m128 _a2 = __lsx_vreplfr2vr_s(alphap[2]);\n                __m128 _a3 = __lsx_vreplfr2vr_s(alphap[3]);\n\n                __m128 _S10 = (__m128)__lsx_vld(S1p - 4, 0);\n                __m128 _S11 = (__m128)__lsx_vld(S1p + 0, 0);\n                __m128 _S12 = (__m128)__lsx_vld(S1p + 4, 0);\n                __m128 _S13 = (__m128)__lsx_vld(S1p + 8, 0);\n                __m128 _S20 = (__m128)__lsx_vld(S2p - 4, 0);\n                __m128 _S21 = (__m128)__lsx_vld(S2p + 0, 0);\n                __m128 _S22 = (__m128)__lsx_vld(S2p + 4, 0);\n                __m128 _S23 = (__m128)__lsx_vld(S2p + 8, 0);\n                __m128 _S30 = (__m128)__lsx_vld(S3p - 4, 0);\n                __m128 _S31 = (__m128)__lsx_vld(S3p + 0, 0);\n                __m128 _S32 = (__m128)__lsx_vld(S3p + 4, 0);\n                __m128 _S33 = (__m128)__lsx_vld(S3p + 8, 0);\n                __m128 _rows1 = __lsx_vfmul_s(_S10, _a0);\n                __m128 _rows2 = __lsx_vfmul_s(_S20, _a0);\n                __m128 _rows3 = __lsx_vfmul_s(_S30, _a0);\n                _rows1 = __lsx_vfmadd_s(_a1, _S11, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a1, _S21, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a1, _S31, _rows3);\n                _rows1 = __lsx_vfmadd_s(_a2, _S12, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a2, _S22, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a2, _S32, _rows3);\n                _rows1 = __lsx_vfmadd_s(_a3, _S13, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a3, _S23, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a3, _S33, _rows3);\n                __lsx_vst(_rows1, rows1p + dx * 4, 0);\n                __lsx_vst(_rows2, rows2p + dx * 4, 0);\n                __lsx_vst(_rows3, rows3p + dx * 4, 0);\n\n                alphap += 4;\n            }\n        }\n        else\n        {\n            // hresize four rows\n            const float* S0 = src.row(sy - 1);\n            const float* S1 = src.row(sy);\n            const float* S2 = src.row(sy + 1);\n            const float* S3 = src.row(sy + 2);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            float* rows2p = rows2;\n            float* rows3p = rows3;\n            for (int dx = 0; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n                const float* S2p = S2 + sx;\n                const float* S3p = S3 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n                __m128 _a2 = __lsx_vreplfr2vr_s(alphap[2]);\n                __m128 _a3 = __lsx_vreplfr2vr_s(alphap[3]);\n\n                __m128 _S00 = (__m128)__lsx_vld(S0p - 4, 0);\n                __m128 _S01 = (__m128)__lsx_vld(S0p + 0, 0);\n                __m128 _S02 = (__m128)__lsx_vld(S0p + 4, 0);\n                __m128 _S03 = (__m128)__lsx_vld(S0p + 8, 0);\n                __m128 _S10 = (__m128)__lsx_vld(S1p - 4, 0);\n                __m128 _S11 = (__m128)__lsx_vld(S1p + 0, 0);\n                __m128 _S12 = (__m128)__lsx_vld(S1p + 4, 0);\n                __m128 _S13 = (__m128)__lsx_vld(S1p + 8, 0);\n                __m128 _S20 = (__m128)__lsx_vld(S2p - 4, 0);\n                __m128 _S21 = (__m128)__lsx_vld(S2p + 0, 0);\n                __m128 _S22 = (__m128)__lsx_vld(S2p + 4, 0);\n                __m128 _S23 = (__m128)__lsx_vld(S2p + 8, 0);\n                __m128 _S30 = (__m128)__lsx_vld(S3p - 4, 0);\n                __m128 _S31 = (__m128)__lsx_vld(S3p + 0, 0);\n                __m128 _S32 = (__m128)__lsx_vld(S3p + 4, 0);\n                __m128 _S33 = (__m128)__lsx_vld(S3p + 8, 0);\n                __m128 _rows0 = __lsx_vfmul_s(_S00, _a0);\n                __m128 _rows1 = __lsx_vfmul_s(_S10, _a0);\n                __m128 _rows2 = __lsx_vfmul_s(_S20, _a0);\n                __m128 _rows3 = __lsx_vfmul_s(_S30, _a0);\n                _rows0 = __lsx_vfmadd_s(_a1, _S01, _rows0);\n                _rows1 = __lsx_vfmadd_s(_a1, _S11, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a1, _S21, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a1, _S31, _rows3);\n                _rows0 = __lsx_vfmadd_s(_a2, _S02, _rows0);\n                _rows1 = __lsx_vfmadd_s(_a2, _S12, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a2, _S22, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a2, _S32, _rows3);\n                _rows0 = __lsx_vfmadd_s(_a3, _S03, _rows0);\n                _rows1 = __lsx_vfmadd_s(_a3, _S13, _rows1);\n                _rows2 = __lsx_vfmadd_s(_a3, _S23, _rows2);\n                _rows3 = __lsx_vfmadd_s(_a3, _S33, _rows3);\n                __lsx_vst(_rows0, rows0p + dx * 4, 0);\n                __lsx_vst(_rows1, rows1p + dx * 4, 0);\n                __lsx_vst(_rows2, rows2p + dx * 4, 0);\n                __lsx_vst(_rows3, rows3p + dx * 4, 0);\n\n                alphap += 4;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        __m128 _b0 = __lsx_vreplfr2vr_s(beta[0]);\n        __m128 _b1 = __lsx_vreplfr2vr_s(beta[1]);\n        __m128 _b2 = __lsx_vreplfr2vr_s(beta[2]);\n        __m128 _b3 = __lsx_vreplfr2vr_s(beta[3]);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* rows2p = rows2;\n        float* rows3p = rows3;\n        float* Dp = dst.row(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            __m128 _rows0 = (__m128)__lsx_vld(rows0p, 0);\n            __m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);\n            __m128 _rows2 = (__m128)__lsx_vld(rows2p, 0);\n            __m128 _rows3 = (__m128)__lsx_vld(rows3p, 0);\n            __m128 _Dp = __lsx_vfmul_s(_rows0, _b0);\n            _Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);\n            _Dp = __lsx_vfmadd_s(_b2, _rows2, _Dp);\n            _Dp = __lsx_vfmadd_s(_b3, _rows3, _Dp);\n            __lsx_vst(_Dp, Dp, 0);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n            rows2p += 4;\n            rows3p += 4;\n        }\n\n        beta += 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/interp_bilinear.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void linear_coeffs(int w, int outw, int* xofs, float* alpha, int align_corner)\n{\n    double scale = (double)w / outw;\n    if (align_corner)\n    {\n        scale = (double)(w - 1) / (outw - 1);\n    }\n\n    for (int dx = 0; dx < outw; dx++)\n    {\n        float fx = (float)((dx + 0.5) * scale - 0.5);\n        if (align_corner)\n        {\n            fx = (float)(dx * scale);\n        }\n\n        int sx = floor(fx);\n        fx -= sx;\n\n        if (sx < 0)\n        {\n            sx = 0;\n            fx = 0.f;\n        }\n        if (sx >= w - 1)\n        {\n            sx = w - 2;\n            fx = 1.f;\n        }\n\n        xofs[dx] = sx;\n\n        alpha[dx * 2] = 1.f - fx;\n        alpha[dx * 2 + 1] = fx;\n    }\n}\n\nstatic void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w);\n    Mat rowsbuf1(w);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const float* S0 = src.row(sy);\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx];\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n\n                float a0 = alphap[0];\n                float a1 = alphap[1];\n                rows0p[dx] = S0p[0] * a0 + S0p[1] * a1;\n                rows1p[dx] = S1p[0] * a0 + S1p[1] * a1;\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        float b0 = beta[0];\n        float b1 = beta[1];\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* Dp = dst.row(dy);\n\n#if __loongarch_sx\n        int nn = w >> 3;\n#else\n        int nn = 0;\n#endif\n        int remain = w - (nn << 3);\n\n#if __loongarch_sx\n        __m128 _b0 = __lsx_vreplfr2vr_s(b0);\n        __m128 _b1 = __lsx_vreplfr2vr_s(b1);\n        for (; nn > 0; nn--)\n        {\n            __m128 _rows0 = (__m128)__lsx_vld(rows0p, 0);\n            __m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);\n\n            __m128 _Dp = __lsx_vfmul_s(_rows0, _b0);\n            _Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);\n\n            __lsx_vst(_Dp, Dp, 0);\n\n            __m128 _rows0n = (__m128)__lsx_vld(rows0p + 4, 0);\n            __m128 _rows1n = (__m128)__lsx_vld(rows1p + 4, 0);\n\n            __m128 _Dpn = __lsx_vfmul_s(_rows0n, _b0);\n            _Dpn = __lsx_vfmadd_s(_b1, _rows1n, _Dpn);\n\n            __lsx_vst(_Dpn, Dp + 4, 0);\n\n            Dp += 8;\n            rows0p += 8;\n            rows1p += 8;\n        }\n#endif // __loongarch_sx\n        for (; remain; --remain)\n        {\n            //             D[x] = rows0[x]*b0 + rows1[x]*b1;\n            *Dp++ = *rows0p++ * b0 + *rows1p++ * b1;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/interp_bilinear_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha, int* xofs, float* beta, int* yofs)\n{\n    int w = dst.w;\n    int h = dst.h;\n\n    // loop body\n    Mat rowsbuf0(w, (size_t)4 * 4u, 4);\n    Mat rowsbuf1(w, (size_t)4 * 4u, 4);\n    float* rows0 = rowsbuf0;\n    float* rows1 = rowsbuf1;\n\n    int prev_sy1 = -2;\n\n    for (int dy = 0; dy < h; dy++)\n    {\n        int sy = yofs[dy];\n\n        if (sy == prev_sy1)\n        {\n            // reuse all rows\n        }\n        else if (sy == prev_sy1 + 1)\n        {\n            // hresize one row\n            float* rows0_old = rows0;\n            rows0 = rows1;\n            rows1 = rows0_old;\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S1p = S1 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n\n                __m128 _S10 = (__m128)__lsx_vld(S1p, 0);\n                __m128 _S11 = (__m128)__lsx_vld(S1p + 4, 0);\n                __m128 _rows1 = __lsx_vfmul_s(_S10, _a0);\n                _rows1 = __lsx_vfmadd_s(_a1, _S11, _rows1);\n                __lsx_vst(_rows1, rows1p + dx * 4, 0);\n\n                alphap += 2;\n            }\n        }\n        else\n        {\n            // hresize two rows\n            const float* S0 = src.row(sy);\n            const float* S1 = src.row(sy + 1);\n\n            const float* alphap = alpha;\n            float* rows0p = rows0;\n            float* rows1p = rows1;\n            int dx = 0;\n            for (; dx < w; dx++)\n            {\n                int sx = xofs[dx] * 4;\n                const float* S0p = S0 + sx;\n                const float* S1p = S1 + sx;\n\n                __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n\n                __m128 _S00 = (__m128)__lsx_vld(S0p, 0);\n                __m128 _S01 = (__m128)__lsx_vld(S0p + 4, 0);\n                __m128 _S10 = (__m128)__lsx_vld(S1p, 0);\n                __m128 _S11 = (__m128)__lsx_vld(S1p + 4, 0);\n                __m128 _rows0 = __lsx_vfmul_s(_S00, _a0);\n                __m128 _rows1 = __lsx_vfmul_s(_S10, _a0);\n                _rows0 = __lsx_vfmadd_s(_a1, _S01, _rows0);\n                _rows1 = __lsx_vfmadd_s(_a1, _S11, _rows1);\n                __lsx_vst(_rows0, rows0p + dx * 4, 0);\n                __lsx_vst(_rows1, rows1p + dx * 4, 0);\n\n                alphap += 2;\n            }\n        }\n\n        prev_sy1 = sy;\n\n        // vresize\n        __m128 _b0 = __lsx_vreplfr2vr_s(beta[0]);\n        __m128 _b1 = __lsx_vreplfr2vr_s(beta[1]);\n\n        float* rows0p = rows0;\n        float* rows1p = rows1;\n        float* Dp = dst.row(dy);\n\n        for (int dx = 0; dx < w; dx++)\n        {\n            __m128 _rows0 = (__m128)__lsx_vld(rows0p, 0);\n            __m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);\n            __m128 _Dp = __lsx_vfmul_s(_rows0, _b0);\n            _Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);\n            __lsx_vst(_Dp, Dp, 0);\n\n            Dp += 4;\n            rows0p += 4;\n            rows1p += 4;\n        }\n\n        beta += 2;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/interp_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"interp_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\n#include \"interp_bicubic.h\"\n#include \"interp_bilinear.h\"\n\n#if __loongarch_sx\n#include \"interp_bicubic_pack4.h\"\n#include \"interp_bilinear_pack4.h\"\n#endif\n\nInterp_loongarch::Interp_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Interp_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& reference_blob = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    int h = bottom_blob.h;\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = reference_blob.w;\n    int outh = reference_blob.h;\n\n    if (!size_expr.empty())\n    {\n        std::vector<Mat> bottom_blob_shapes(bottom_blobs.size());\n        for (size_t i = 0; i < bottom_blobs.size(); i++)\n        {\n            bottom_blob_shapes[i] = bottom_blobs[i].shape();\n        }\n        eval_size_expr(bottom_blob_shapes, outw, outh);\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(outw, outh, w, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __loongarch_sx\n        if (elempack == 4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < w; q++)\n            {\n                Mat top_blob_c = top_blob.channel(q);\n                __m128 _v = (__m128)__lsx_vld((const float*)bottom_blob + q * 4, 0);\n                top_blob_c.fill(_v);\n            }\n\n            return 0;\n        }\n#endif // __loongarch_sx\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < w; q++)\n        {\n            Mat top_blob_c = top_blob.channel(q);\n            const float v = bottom_blob[q];\n            top_blob_c.fill(v);\n        }\n\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        if (outw == w)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n\n        top_blob.create(outw, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __loongarch_sx\n        if (elempack == 4)\n        {\n            if (resize_type == 1) // nearest\n            {\n                const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        __m128 _p = (__m128)__lsx_vld(ptr + in_x * 4, 0);\n                        __lsx_vst(_p, outptr, 0);\n\n                        outptr += 4;\n                    }\n                }\n            }\n\n            if (resize_type == 2) // bilinear\n            {\n                int* buf = new int[outw + outw * 2];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const float* Sp = ptr + sx;\n\n                        __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                        __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n\n                        __m128 _S0 = (__m128)__lsx_vld(Sp, 0);\n                        __m128 _S1 = (__m128)__lsx_vld(Sp + 4, 0);\n                        __m128 _p = __lsx_vfmul_s(_S0, _a0);\n                        _p = __lsx_vfmadd_s(_a1, _S1, _p);\n                        __lsx_vst(_p, outptr, 0);\n\n                        alphap += 2;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            if (resize_type == 3) // bicubic\n            {\n                int* buf = new int[outw + outw * 4];\n\n                int* xofs = buf;\n                float* alpha = (float*)(buf + outw);\n\n                cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int y = 0; y < h; y++)\n                {\n                    const float* ptr = bottom_blob.row(y);\n                    float* outptr = top_blob.row(y);\n                    const float* alphap = alpha;\n\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int sx = xofs[x] * 4;\n                        const float* Sp = ptr + sx;\n\n                        __m128 _a0 = __lsx_vreplfr2vr_s(alphap[0]);\n                        __m128 _a1 = __lsx_vreplfr2vr_s(alphap[1]);\n                        __m128 _a2 = __lsx_vreplfr2vr_s(alphap[2]);\n                        __m128 _a3 = __lsx_vreplfr2vr_s(alphap[3]);\n\n                        __m128 _S0 = (__m128)__lsx_vld(Sp - 4, 0);\n                        __m128 _S1 = (__m128)__lsx_vld(Sp + 0, 0);\n                        __m128 _S2 = (__m128)__lsx_vld(Sp + 4, 0);\n                        __m128 _S3 = (__m128)__lsx_vld(Sp + 8, 0);\n                        __m128 _p = __lsx_vfmul_s(_S0, _a0);\n                        _p = __lsx_vfmadd_s(_a1, _S1, _p);\n                        _p = __lsx_vfmadd_s(_a2, _S2, _p);\n                        _p = __lsx_vfmadd_s(_a3, _S3, _p);\n                        __lsx_vst(_p, outptr, 0);\n\n                        alphap += 4;\n                        outptr += 4;\n                    }\n                }\n\n                delete[] buf;\n            }\n\n            return 0;\n        }\n#endif // __loongarch_sx\n\n        if (resize_type == 1) // nearest\n        {\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outw * 2];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    *outptr++ = Sp[0] * a0 + Sp[1] * a1;\n                    alphap += 2;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outw * 4];\n\n            int* xofs = buf;\n            float* alpha = (float*)(buf + outw);\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int y = 0; y < h; y++)\n            {\n                const float* ptr = bottom_blob.row(y);\n                float* outptr = top_blob.row(y);\n                const float* alphap = alpha;\n\n                for (int x = 0; x < outw; x++)\n                {\n                    int sx = xofs[x];\n                    const float* Sp = ptr + sx;\n                    float a0 = alphap[0];\n                    float a1 = alphap[1];\n                    float a2 = alphap[2];\n                    float a3 = alphap[3];\n                    *outptr++ = Sp[-1] * a0 + Sp[0] * a1 + Sp[1] * a2 + Sp[2] * a3;\n                    alphap += 4;\n                }\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n\n    if (outw == w && outh == h)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __loongarch_sx\n    if (elempack == 4)\n    {\n        if (resize_type == 1) // nearest\n        {\n            const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n            const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                for (int y = 0; y < outh; y++)\n                {\n                    int in_y = std::min((int)(y * hs), (h - 1));\n\n                    const float* ptr = src.row(in_y);\n                    float* outptr = dst.row(y);\n                    for (int x = 0; x < outw; x++)\n                    {\n                        int in_x = std::min((int)(x * ws), (w - 1));\n\n                        __m128 _p = (__m128)__lsx_vld(ptr + in_x * 4, 0);\n                        __lsx_vst(_p, outptr, 0);\n\n                        outptr += 4;\n                    }\n                }\n            }\n        }\n\n        if (resize_type == 2) // bilinear\n        {\n            int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n            float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n            linear_coeffs(w, outw, xofs, alpha, align_corner);\n            linear_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bilinear_image_pack4(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        if (resize_type == 3) // bicubic\n        {\n            int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n            int* xofs = buf;        //new int[outw];\n            int* yofs = buf + outw; //new int[outh];\n\n            float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n            float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n            cubic_coeffs(w, outw, xofs, alpha, align_corner);\n            cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat src = bottom_blob.channel(q);\n                Mat dst = top_blob.channel(q);\n\n                resize_bicubic_image_pack4(src, dst, alpha, xofs, beta, yofs);\n            }\n\n            delete[] buf;\n        }\n\n        return 0;\n    }\n#endif // __loongarch_sx\n\n    if (resize_type == 1) // nearest\n    {\n        const float hs = (output_height || !size_expr.empty()) ? h / (float)outh : 1.f / height_scale;\n        const float ws = (output_width || !size_expr.empty()) ? w / (float)outw : 1.f / width_scale;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            for (int y = 0; y < outh; y++)\n            {\n                int in_y = std::min((int)(y * hs), (h - 1));\n\n                const float* ptr = src.row(in_y);\n                float* outptr = dst.row(y);\n                for (int x = 0; x < outw; x++)\n                {\n                    int in_x = std::min((int)(x * ws), (w - 1));\n                    *outptr++ = ptr[in_x];\n                }\n            }\n        }\n    }\n\n    if (resize_type == 2) // bilinear\n    {\n        int* buf = new int[outw + outh + outw * 2 + outh * 2];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 2];\n        float* beta = (float*)(buf + outw + outh + outw * 2); //new float[outh * 2];\n\n        linear_coeffs(w, outw, xofs, alpha, align_corner);\n        linear_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bilinear_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    if (resize_type == 3) // bicubic\n    {\n        int* buf = new int[outw + outh + outw * 4 + outh * 4];\n\n        int* xofs = buf;        //new int[outw];\n        int* yofs = buf + outw; //new int[outh];\n\n        float* alpha = (float*)(buf + outw + outh);           //new float[outw * 4];\n        float* beta = (float*)(buf + outw + outh + outw * 4); //new float[outh * 4];\n\n        cubic_coeffs(w, outw, xofs, alpha, align_corner);\n        cubic_coeffs(h, outh, yofs, beta, align_corner);\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const Mat src = bottom_blob.channel(q);\n            Mat dst = top_blob.channel(q);\n\n            resize_bicubic_image(src, dst, alpha, xofs, beta, yofs);\n        }\n\n        delete[] buf;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/interp_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_INTERP_LOONGARCH_H\n#define LAYER_INTERP_LOONGARCH_H\n\n#include \"interp.h\"\n\nnamespace ncnn {\n\nclass Interp_loongarch : public Interp\n{\npublic:\n    Interp_loongarch();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_INTERP_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/lasx_mathfun.h",
    "content": "// Copyright 2025 AtomAlpaca <atal@anche.no>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LASX_MATHFUN_H\n#define LASX_MATHFUN_H\n\n#include \"loongarch_usability.h\"\n\n#include <lasxintrin.h>\n\n_LOONGARCH_FLOAT_CONST_PS256(c_0, 0.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_1, 1.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_2, 2.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_3, 3.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_4, 4.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_n1, -1.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_n3, -3.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_0p5, 0.5f);\n_LOONGARCH_FLOAT_CONST_PS256(c_eps, 1E-8f);\n\n#define c_inv_mant_mask ~0x7f800000u\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_SQRTHF, 0.707106781186547524);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p0, 7.0376836292E-2);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p1, -1.1514610310E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p2, 1.1676998740E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p3, -1.2420140846E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p4, +1.4249322787E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p5, -1.6668057665E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p6, +2.0000714765E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p7, -2.4999993993E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_p8, +3.3333331174E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_q1, -2.12194440e-4);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_log_q2, 0.693359375);\n\n/* natural logarithm computed for 4 simultaneous float\n *   return NaN for x <= 0\n */\nstatic inline __m256 log256_ps(__m256 x)\n{\n    __m256 one = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i);\n\n    x = __lasx_xvfmax_s(x, (__m256)__lasx_xvreplgr2vr_w(0)); /* force flush to zero on denormal values */\n    __m256i invalid_mask = __lasx_xvfcmp_cle_s(x, (__m256)__lasx_xvreplgr2vr_w(0));\n\n    __m256i ux = (__m256i)(x);\n\n    __m256i emm0 = __lasx_xvsrl_w(ux, __lasx_xvreplgr2vr_w(23));\n\n    /* keep only the fractional part */\n    ux = __lasx_xvand_v(ux, __lasx_xvreplgr2vr_w(c_inv_mant_mask));\n    ux = __lasx_xvor_v(ux, __lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n    x = (__m256)(ux);\n\n    emm0 = __lasx_xvsub_w(emm0, __lasx_xvreplgr2vr_w(0x7f));\n    __m256 e = __lasx_xvffint_s_w(emm0);\n\n    e = __lasx_xvfadd_s(e, one);\n\n    /* part2:\n     *     if( x < SQRTHF ) {\n     *       e -= 1;\n     *       x = x + x - 1.0;\n     *     } else { x = x - 1.0; }\n     */\n    __m256i mask = __lasx_xvfcmp_clt_s((__m256)x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_SQRTHF.i));\n    __m256 tmp = (__m256)(__lasx_xvand_v((__m256i)(x), (__m256i)mask));\n    x = __lasx_xvfsub_s(x, one);\n    e = __lasx_xvfsub_s(e, (__m256)(__lasx_xvand_v((__m256i)(one), (__m256i)mask)));\n    x = __lasx_xvfadd_s(x, tmp);\n\n    __m256 z = __lasx_xvfmul_s(x, x);\n\n    __m256 y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p0.i);\n\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p1.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p2.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p3.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p4.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p5.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p6.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p7.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_p8.i));\n    y = __lasx_xvfmul_s(y, x);\n\n    y = __lasx_xvfmul_s(y, z);\n\n    tmp = __lasx_xvfmul_s(e, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_q1.i));\n    y = __lasx_xvfadd_s(y, tmp);\n\n    tmp = __lasx_xvfmul_s(z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n    y = __lasx_xvfsub_s(y, tmp);\n\n    tmp = __lasx_xvfmul_s(e, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_log_q2.i));\n    x = __lasx_xvfadd_s(x, y);\n    x = __lasx_xvfadd_s(x, tmp);\n    x = (__m256)(__lasx_xvor_v((__m256i)(x), (__m256i)invalid_mask)); // negative arg will be NAN\n    return x;\n}\n\n_LOONGARCH_FLOAT_CONST_PS256(c_exp_hi, 88.3762626647949f);\n_LOONGARCH_FLOAT_CONST_PS256(c_exp_lo, -88.3762626647949f);\n\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_LOG2EF, 1.44269504088896341);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_C1, 0.693359375);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_C2, -2.12194440e-4);\n\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p0, 1.9875691500E-4);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p1, 1.3981999507E-3);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p2, 8.3334519073E-3);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p3, 4.1665795894E-2);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p4, 1.6666665459E-1);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_exp_p5, 5.0000001201E-1);\n\n/* exp() computed for 4 float at once */\nstatic inline __m256 exp256_ps(__m256 x)\n{\n    __m256 tmp, fx;\n\n    __m256 one = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i);\n    x = __lasx_xvfmin_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_exp_hi.i));\n    x = __lasx_xvfmax_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_exp_lo.i));\n\n    /* express exp(x) as exp(g + n*log(2)) */\n    fx = __lasx_xvfmul_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_LOG2EF.i));\n    fx = __lasx_xvfadd_s(fx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n\n    /* perform a floorf */\n    tmp = __lasx_xvffint_s_w(__lasx_xvftint_w_s(fx));\n\n    /* if greater, substract 1 */\n    __m256i mask = __lasx_xvfcmp_clt_s(fx, tmp);\n    mask = __lasx_xvand_v(mask, (__m256i)one);\n\n    fx = __lasx_xvfsub_s(tmp, (__m256)mask);\n\n    tmp = __lasx_xvfmul_s(fx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_C1.i));\n    __m256 z = __lasx_xvfmul_s(fx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_C2.i));\n    x = __lasx_xvfsub_s(x, tmp);\n    x = __lasx_xvfsub_s(x, z);\n\n    __m256 y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p0.i);\n\n    z = __lasx_xvfmul_s(x, x);\n\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p1.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p2.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p3.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p4.i));\n    y = __lasx_xvfmadd_s(x, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_exp_p5.i));\n\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfadd_s(y, x);\n    y = __lasx_xvfadd_s(y, one);\n\n    /* build 2^n */\n    __m256i mm;\n    mm = __lasx_xvftintrz_w_s(fx);\n    mm = __lasx_xvadd_w(mm, __lasx_xvreplgr2vr_w(0x7f));\n    mm = __lasx_xvsll_w(mm, __lasx_xvreplgr2vr_w(23));\n\n    y = __lasx_xvfmul_s(y, (__m256)mm);\n    return y;\n}\n\n_LOONGARCH_FLOAT_CONST_PS256(c_minus_cephes_DP1, -0.78515625f);\n_LOONGARCH_FLOAT_CONST_PS256(c_minus_cephes_DP2, -2.4187564849853515625e-4f);\n_LOONGARCH_FLOAT_CONST_PS256(c_minus_cephes_DP3, -3.77489497744594108e-8f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_sin_p0, -1.9515295891E-4f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_sin_p1, 8.3321608736E-3f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_sin_p2, -1.6666654611E-1f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_cos_p0, 2.443315711809948E-005f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_cos_p1, -1.388731625493765E-003f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_cos_p2, 4.166664568298827E-002f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_FOPI, 1.27323954473516f); // 4/PI\n\nstatic inline __m256 sin256_ps(__m256 x)\n{\n    __m256 y;\n    __m256i swap_sign_bit, poly_mask, sign_bit;\n    __m256 n0p5 = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_n1.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n\n    sign_bit = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    y = __lasx_xvfmul_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_FOPI.i));\n\n    poly_mask = __lasx_xvftintrz_w_s(y);\n    poly_mask = __lasx_xvadd_w(poly_mask, __lasx_xvreplgr2vr_w(1));\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(~1));\n    y = __lasx_xvffint_s_w(poly_mask);\n\n    swap_sign_bit = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(4));\n    swap_sign_bit = __lasx_xvslli_w(swap_sign_bit, 29);\n\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(2));\n    poly_mask = __lasx_xvseq_w(poly_mask, __lasx_xvreplgr2vr_w(0));\n\n    sign_bit = __lasx_xvxor_v(sign_bit, swap_sign_bit);\n\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP1.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP2.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP3.i), x);\n\n    y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p0.i);\n    __m256 z = __lasx_xvfmul_s(x, x);\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p1.i));\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p2.i));\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmadd_s(z, n0p5, y);\n    y = __lasx_xvfadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i));\n\n    __m256 y2 = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p0.i);\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p1.i));\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p2.i));\n    y2 = __lasx_xvfmul_s(y2, z);\n    y2 = __lasx_xvfmadd_s(y2, x, x);\n\n    y2 = (__m256)__lasx_xvand_v((__m256i)y2, poly_mask);\n    y = (__m256)__lasx_xvand_v(__lasx_xvxor_v(poly_mask, __lasx_xvreplgr2vr_w(0xffffffff)), (__m256i)y);\n    y = __lasx_xvfadd_s(y, y2);\n    y = (__m256)__lasx_xvxor_v((__m256i)y, sign_bit);\n\n    return y;\n}\n\nstatic inline __m256 cos256_ps(__m256 x)\n{\n    __m256 y;\n    __m256i swap_sign_bit, poly_mask, sign_bit;\n    __m256 n0p5 = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_n1.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    y = __lasx_xvfmul_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_FOPI.i));\n\n    poly_mask = __lasx_xvftintrz_w_s(y);\n    poly_mask = __lasx_xvadd_w(poly_mask, __lasx_xvreplgr2vr_w(1));\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(~1));\n    y = __lasx_xvffint_s_w(poly_mask);\n    poly_mask = __lasx_xvsub_w(poly_mask, __lasx_xvreplgr2vr_w(2));\n\n    swap_sign_bit = __lasx_xvandn_v(poly_mask, __lasx_xvreplgr2vr_w(4));\n    swap_sign_bit = __lasx_xvslli_w(swap_sign_bit, 29);\n\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(2));\n    poly_mask = __lasx_xvseq_w(poly_mask, __lasx_xvreplgr2vr_w(0));\n\n    sign_bit = swap_sign_bit;\n\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP1.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP2.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP3.i), x);\n\n    y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p0.i);\n    __m256 z = __lasx_xvfmul_s(x, x);\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p1.i));\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p2.i));\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmadd_s(z, n0p5, y);\n    y = __lasx_xvfadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i));\n\n    __m256 y2 = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p0.i);\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p1.i));\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p2.i));\n    y2 = __lasx_xvfmul_s(y2, z);\n    y2 = __lasx_xvfmadd_s(y2, x, x);\n\n    y2 = (__m256)__lasx_xvand_v((__m256i)y2, poly_mask);\n    y = (__m256)__lasx_xvandn_v(poly_mask, (__m256i)y);\n    y = __lasx_xvfadd_s(y, y2);\n    y = (__m256)__lasx_xvxor_v((__m256i)y, sign_bit);\n\n    return y;\n}\n\nstatic inline void sincos256_ps(__m256 x, __m256* s, __m256* c)\n{\n    __m256 y;\n    __m256i swap_sign_bit_cos, swap_sign_bit_sin, poly_mask, sign_bit_sin, sign_bit_cos;\n    __m256 n0p5 = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_n1.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n\n    sign_bit_sin = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    y = __lasx_xvfmul_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_FOPI.i));\n\n    poly_mask = __lasx_xvftintrz_w_s(y);\n    poly_mask = __lasx_xvadd_w(poly_mask, __lasx_xvreplgr2vr_w(1));\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(~1));\n    y = __lasx_xvffint_s_w(poly_mask);\n\n    swap_sign_bit_cos = __lasx_xvsub_w(poly_mask, __lasx_xvreplgr2vr_w(2));\n    swap_sign_bit_cos = __lasx_xvandn_v(swap_sign_bit_cos, __lasx_xvreplgr2vr_w(4));\n    swap_sign_bit_cos = __lasx_xvslli_w(swap_sign_bit_cos, 29);\n\n    swap_sign_bit_sin = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(4));\n    swap_sign_bit_sin = __lasx_xvslli_w(swap_sign_bit_sin, 29);\n\n    poly_mask = __lasx_xvand_v(poly_mask, __lasx_xvreplgr2vr_w(2));\n    poly_mask = __lasx_xvseq_w(poly_mask, __lasx_xvreplgr2vr_w(0));\n\n    sign_bit_sin = __lasx_xvxor_v(sign_bit_sin, swap_sign_bit_sin);\n    sign_bit_cos = swap_sign_bit_cos;\n\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP1.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP2.i), x);\n    x = __lasx_xvfmadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_minus_cephes_DP3.i), x);\n\n    __m256 z = __lasx_xvfmul_s(x, x);\n    y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p0.i);\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p1.i));\n    y = __lasx_xvfmadd_s(y, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_cos_p2.i));\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmul_s(y, z);\n    y = __lasx_xvfmadd_s(z, n0p5, y);\n    y = __lasx_xvfadd_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i));\n\n    __m256 y2 = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p0.i);\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p1.i));\n    y2 = __lasx_xvfmadd_s(y2, z, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_sin_p2.i));\n    y2 = __lasx_xvfmul_s(y2, z);\n    y2 = __lasx_xvfmadd_s(y2, x, x);\n\n    __m256 ysin1 = (__m256)__lasx_xvandn_v(poly_mask, (__m256i)y);\n    __m256 ysin2 = (__m256)__lasx_xvand_v(poly_mask, (__m256i)y2);\n    y2 = __lasx_xvfsub_s(y2, ysin2);\n    y = __lasx_xvfsub_s(y, ysin1);\n\n    ysin1 = __lasx_xvfadd_s(ysin1, ysin2);\n    y = __lasx_xvfadd_s(y, y2);\n\n    *s = (__m256)__lasx_xvxor_v((__m256i)ysin1, sign_bit_sin);\n    *c = (__m256)__lasx_xvxor_v((__m256i)y, sign_bit_cos);\n}\n\nstatic inline __m256 tan256_ps(__m256 x)\n{\n    __m256 ysin, ycos;\n    __m256 eps = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_eps.i);\n    __m256 zero = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0.i);\n    sincos256_ps(x, &ysin, &ycos);\n    __m256i mask = __lasx_xvfcmp_ceq_s(ycos, eps);\n    mask = __lasx_xvand_v(mask, (__m256i)eps);\n    ycos = __lasx_xvfadd_s(ycos, (__m256)mask);\n    __m256 ytan = __lasx_xvfdiv_s(ysin, ycos);\n    return ytan;\n}\n\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a4, 0.023994016f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a5, 0.042417344f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a2, 0.07494697f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a3, 0.045520633f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a0, 1.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_a1, 0.166667819f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_half_pi, 1.5707964f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_pi, 3.1415927f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_asin_npi, -3.1415927f);\n\nstatic inline __m256 asin256_ps(__m256 x)\n{\n    __m256 big_input_approx, input_approx, square_of_input_approx, fourth_power_of_input_approx;\n    __m256 is_big_input_one, output_approx, final_approx;\n    __m256 tmp1, tmp2, tmp3, tmp4;\n    __m256i mask, is_small_input, is_big_input;\n\n    mask = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lasx_xvfcmp_cle_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n    is_big_input = __lasx_xvxor_v(is_small_input, __lasx_xvreplgr2vr_w(0xffffffff));\n    is_big_input_one = (__m256)__lasx_xvand_v(__lasx_xvreplgr2vr_w(_ps256_c_1.i), is_big_input);\n\n    big_input_approx = __lasx_xvfsub_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i), x);\n    big_input_approx = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i), big_input_approx);\n    big_input_approx = __lasx_xvfsqrt_s(big_input_approx);\n\n    input_approx = (__m256)__lasx_xvand_v(is_small_input, (__m256i)x);\n    input_approx = (__m256)__lasx_xvor_v((__m256i)input_approx, __lasx_xvand_v(is_big_input, (__m256i)big_input_approx));\n\n    square_of_input_approx = __lasx_xvfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lasx_xvfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a4.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a2.i));\n    tmp2 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a5.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a3.i));\n    tmp3 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp1, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a0.i));\n    tmp4 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp2, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a1.i));\n    output_approx = __lasx_xvfmadd_s(square_of_input_approx, tmp4, tmp3);\n\n    tmp1 = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_half_pi.i), is_big_input_one);\n    tmp2 = __lasx_xvfmul_s(output_approx, input_approx);\n    tmp3 = __lasx_xvfmadd_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_n3.i), is_big_input_one, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i));\n\n    final_approx = __lasx_xvfmadd_s(tmp2, tmp3, tmp1);\n    final_approx = (__m256)__lasx_xvor_v((__m256i)final_approx, mask);\n\n    return final_approx;\n}\n\nstatic inline __m256 acos256_ps(__m256 x)\n{\n    __m256 big_input_approx, input_approx, square_of_input_approx, fourth_power_of_input_approx;\n    __m256 output_approx, final_approx, small_final_approx, big_final_approx;\n    __m256 tmp1, tmp2, tmp3, tmp4;\n    __m256i mask, mask2, is_small_input, is_big_input, lt_zero;\n\n    lt_zero = __lasx_xvfcmp_clt_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0.i));\n    mask = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lasx_xvfcmp_cle_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i));\n    is_big_input = __lasx_xvxor_v(is_small_input, __lasx_xvreplgr2vr_w(0xffffffff));\n\n    big_input_approx = __lasx_xvfsub_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i), x);\n    big_input_approx = __lasx_xvfmul_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_0p5.i), big_input_approx);\n    big_input_approx = __lasx_xvfsqrt_s(big_input_approx);\n\n    input_approx = (__m256)__lasx_xvand_v(is_small_input, (__m256i)x);\n    input_approx = (__m256)__lasx_xvor_v((__m256i)input_approx, __lasx_xvand_v(is_big_input, (__m256i)big_input_approx));\n\n    square_of_input_approx = __lasx_xvfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lasx_xvfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a4.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a2.i));\n    tmp2 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a5.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a3.i));\n    tmp3 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp1, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a0.i));\n    tmp4 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp2, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_a1.i));\n    output_approx = __lasx_xvfmadd_s(square_of_input_approx, tmp4, tmp3);\n\n    tmp1 = __lasx_xvfmul_s(input_approx, output_approx);\n\n    small_final_approx = (__m256)__lasx_xvor_v((__m256i)tmp1, mask);\n    small_final_approx = __lasx_xvfsub_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_half_pi.i), small_final_approx);\n\n    big_final_approx = (__m256)__lasx_xvand_v(lt_zero, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_pi.i));\n    tmp1 = __lasx_xvfadd_s(tmp1, tmp1);\n    tmp1 = (__m256)__lasx_xvor_v((__m256i)tmp1, mask);\n    big_final_approx = __lasx_xvfadd_s(big_final_approx, tmp1);\n\n    final_approx = (__m256)__lasx_xvand_v(is_small_input, (__m256i)small_final_approx);\n    final_approx = (__m256)__lasx_xvor_v((__m256i)final_approx, __lasx_xvand_v(is_big_input, (__m256i)big_final_approx));\n\n    return final_approx;\n}\n\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x0, 1.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x1, -0.33333072f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x2, 0.1999262f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x3, -0.14203644f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x4, 0.10640934f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x5, -0.07504295f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x6, 0.04269152f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x7, -0.01606863f);\n_LOONGARCH_FLOAT_CONST_PS256(c_cephes_atan_x8, 0.0028498897f);\n\nstatic inline __m256 atan256_ps(__m256 x)\n{\n    __m256i mask, is_small_input, is_big_input;\n    __m256 tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, input_approx, output_approx;\n    __m256 square_of_input_approx, fourth_power_of_input_approx;\n\n    mask = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    x = (__m256)__lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lasx_xvfcmp_clt_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_1.i), x);\n    is_big_input = __lasx_xvxor_v(is_small_input, __lasx_xvreplgr2vr_w(0xffffffff));\n\n    tmp1 = (__m256)__lasx_xvand_v(is_small_input, __lasx_xvreplgr2vr_w(_ps256_c_n1.i));\n    tmp1 = (__m256)__lasx_xvor_v(__lasx_xvand_v(is_big_input, (__m256i)x), (__m256i)tmp1);\n\n    tmp2 = (__m256)__lasx_xvand_v(is_small_input, (__m256i)x);\n    tmp2 = (__m256)__lasx_xvor_v(__lasx_xvand_v((__m256i)is_big_input, __lasx_xvreplgr2vr_w(_ps256_c_1.i)), (__m256i)tmp2);\n\n    input_approx = __lasx_xvfdiv_s(tmp1, tmp2);\n    square_of_input_approx = __lasx_xvfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lasx_xvfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x7.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x5.i));\n    tmp2 = __lasx_xvfmadd_s(fourth_power_of_input_approx, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x8.i), (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x6.i));\n    tmp3 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp1, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x3.i));\n    tmp4 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp2, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x4.i));\n    tmp5 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp3, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x1.i));\n    tmp6 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp4, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x2.i));\n    tmp7 = __lasx_xvfmadd_s(fourth_power_of_input_approx, tmp6, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_cephes_atan_x0.i));\n    output_approx = __lasx_xvfmadd_s(square_of_input_approx, tmp5, tmp7);\n\n    tmp1 = __lasx_xvfmul_s(input_approx, output_approx);\n    tmp2 = (__m256)__lasx_xvand_v(is_small_input, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_half_pi.i));\n    tmp1 = __lasx_xvfadd_s(tmp1, tmp2);\n    tmp1 = (__m256)__lasx_xvxor_v(mask, (__m256i)tmp1);\n    return tmp1;\n}\n\nstatic inline __m256 atan2256_ps(__m256 y, __m256 x)\n{\n    __m256i not_eq_zero_x, not_eq_zero_y, normal_mode, negative_mask_x, negative_mask_y;\n    __m256i lt_zero_mask_x, lt_zero_mask_y, ge_zero_mask_y, eq_zero_y;\n    __m256 pi_additions, tmp1, tmp2, normal_result, special_result, final_result;\n\n    not_eq_zero_x = __lasx_xvfcmp_cne_s(x, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0.i));\n    not_eq_zero_y = __lasx_xvfcmp_cne_s(y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_0.i));\n    eq_zero_y = __lasx_xvxor_v(not_eq_zero_y, __lasx_xvreplgr2vr_w(0xffffffff));\n    normal_mode = __lasx_xvand_v(not_eq_zero_x, not_eq_zero_y);\n    negative_mask_x = __lasx_xvand_v((__m256i)x, __lasx_xvreplgr2vr_w(0x80000000));\n    negative_mask_y = __lasx_xvand_v((__m256i)y, __lasx_xvreplgr2vr_w(0x80000000));\n\n    lt_zero_mask_x = __lasx_xvfcmp_clt_s(x, (__m256)__lasx_xvreplgr2vr_w(0));\n    lt_zero_mask_y = __lasx_xvfcmp_clt_s(y, (__m256)__lasx_xvreplgr2vr_w(0));\n    ge_zero_mask_y = __lasx_xvxor_v(lt_zero_mask_y, __lasx_xvreplgr2vr_w(0xffffffff));\n\n    pi_additions = (__m256)__lasx_xvand_v(lt_zero_mask_y, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_npi.i));\n    pi_additions = (__m256)__lasx_xvor_v(__lasx_xvand_v(ge_zero_mask_y, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_pi.i)), (__m256i)pi_additions);\n    pi_additions = (__m256)__lasx_xvand_v(lt_zero_mask_x, (__m256i)pi_additions);\n\n    normal_result = __lasx_xvfdiv_s(y, x);\n    normal_result = __lasx_xvfadd_s(atan256_ps(normal_result), pi_additions);\n\n    tmp1 = (__m256)__lasx_xvand_v(negative_mask_y, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_half_pi.i));\n    tmp2 = (__m256)__lasx_xvand_v(negative_mask_x, __lasx_xvreplgr2vr_w(_ps256_c_cephes_asin_pi.i));\n    special_result = (__m256)__lasx_xvand_v(not_eq_zero_y, (__m256i)tmp1);\n    special_result = (__m256)__lasx_xvor_v(__lasx_xvand_v(eq_zero_y, (__m256i)tmp2), (__m256i)special_result);\n\n    final_result = (__m256)__lasx_xvand_v(normal_mode, (__m256i)normal_result);\n    normal_mode = __lasx_xvxor_v(normal_mode, __lasx_xvreplgr2vr_w(0xffffffff));\n    final_result = (__m256)__lasx_xvor_v(__lasx_xvand_v(normal_mode, (__m256i)special_result), (__m256i)final_result);\n\n    return final_result;\n}\n\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_tiny, 1e-4f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_hi, 9.0f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_1, 4.89352455891786e-3f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_3, 6.37261928875436e-4f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_5, 1.48572235717979e-5f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_7, 5.12229709037114e-8f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_9, -8.60467152213735e-11f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_11, 2.00018790482477e-13f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_alpha_13, -2.76076847742355e-16f);\n// The monomial coefficients of the denominator polynomial (even).\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_beta_0, 4.89352518554385e-3f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_beta_2, 2.26843463243900e-3f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_beta_4, 1.18534705686654e-4f);\n_LOONGARCH_FLOAT_CONST_PS256(c_tanh_beta_6, 1.19825839466702e-6f);\n\n/* tanh() computed for 4 float at once */\nstatic inline __m256 tanh256_ps(__m256 x)\n{\n    __m256 x2 = (__m256)__lasx_xvbitclri_w((__m256i)x, 31);\n    __m256i tiny_mask = __lasx_xvfcmp_clt_s((__m256)x2, (__m256)(__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_tiny.i));\n    __m256i sig_mask = __lasx_xvreplgr2vr_w(1 << 31);\n    __m256i sig_save = __lasx_xvand_v((__m256i)x, sig_mask);\n\n    // clamp the inputs to the range [-9, 9] since anything outside\n    // this range is -/+1.0f in single-precision.\n    x2 = (__m256)__lasx_xvbitsel_v((__m256i)x2, (__m256i)__lasx_xvreplgr2vr_w(_ps256_c_tanh_hi.i), (__m256i)__lasx_xvfcmp_clt_s((__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_hi.i), (__m256)x2));\n\n    // since the polynomials are odd/even, we need x**2.\n    __m256 z = __lasx_xvfmul_s(x2, x2);\n\n    // evaluate the numerator polynomial y.\n    __m256 y = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_13.i);\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_11.i));\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_9.i));\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_7.i));\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_5.i));\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_3.i));\n    y = __lasx_xvfmadd_s(z, y, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_alpha_1.i));\n    y = __lasx_xvfmul_s(y, x2);\n\n    // evaluate the denominator polynomial w.\n    __m256 w = (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_beta_6.i);\n    w = __lasx_xvfmadd_s(z, w, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_beta_4.i));\n    w = __lasx_xvfmadd_s(z, w, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_beta_2.i));\n    w = __lasx_xvfmadd_s(z, w, (__m256)__lasx_xvreplgr2vr_w(_ps256_c_tanh_beta_0.i));\n\n    // divide the numerator by the denominator.\n    y = __lasx_xvfdiv_s(y, w);\n\n    // reinstate the sign.\n    y = (__m256)__lasx_xvor_v((__m256i)y, sig_save);\n\n    // when the argument is very small in magnitude it's more accurate to just return it.\n    y = (__m256)__lasx_xvbitsel_v((__m256i)y, (__m256i)x, (__m256i)tiny_mask);\n\n    return y;\n}\n\nstatic inline __m256 pow256_ps(__m256 a, __m256 b)\n{\n    // pow(x, m) = exp(m * log(x))\n    return exp256_ps(__lasx_xvfmul_s(b, log256_ps(a)));\n}\n\nstatic inline __m256 sigmoid256_ps(__m256 _v)\n{\n    __m256 _one = __lasx_xvreplfr2vr_s(1.f);\n    _v = (__m256)__lasx_xvbitrevi_w((__m256i)_v, 31);\n    _v = exp256_ps(_v);\n    _v = __lasx_xvfadd_s(_v, _one);\n    return __lasx_xvfdiv_s(_one, _v);\n}\n\n#endif // LASX_MATHFUN_H\n"
  },
  {
    "path": "src/layer/loongarch/loongarch_activation.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LOONGARCH_ACTIVATION_H\n#define LOONGARCH_ACTIVATION_H\n\n#include \"fused_activation.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n\nstatic inline __m128 activation_ps(__m128 _v, int activation_type, const ncnn::Mat& activation_params)\n{\n    if (activation_type == 1)\n    {\n        __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n        _v = __lsx_vfmax_s(_v, _zero);\n    }\n    else if (activation_type == 2)\n    {\n        __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n        __m128 _slope = (__m128)__lsx_vreplfr2vr_s(activation_params[0]);\n        __m128i _lemask = __lsx_vfcmp_cle_s(_v, _zero);\n        __m128 _ps = __lsx_vfmul_s(_v, _slope);\n        _v = (__m128)__lsx_vbitsel_v((__m128i)_v, (__m128i)_ps, (__m128i)_lemask);\n    }\n    else if (activation_type == 3)\n    {\n        __m128 _min = (__m128)__lsx_vreplfr2vr_s(activation_params[0]);\n        __m128 _max = (__m128)__lsx_vreplfr2vr_s(activation_params[1]);\n        _v = __lsx_vfmax_s(_v, _min);\n        _v = __lsx_vfmin_s(_v, _max);\n    }\n    else if (activation_type == 4)\n    {\n        _v = sigmoid_ps(_v);\n    }\n    else if (activation_type == 5)\n    {\n        _v = __lsx_vfmul_s(_v, tanh_ps(log_ps(__lsx_vfadd_s(exp_ps(_v), (__m128)__lsx_vreplfr2vr_s(1.f)))));\n    }\n    else if (activation_type == 6)\n    {\n        __m128 _alpha = (__m128)__lsx_vreplfr2vr_s(activation_params[0]);\n        __m128 _beta = (__m128)__lsx_vreplfr2vr_s(activation_params[1]);\n        __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n        __m128 _one = (__m128)__lsx_vreplfr2vr_s(1.f);\n        __m128 _outp = __lsx_vfmadd_s(_alpha, _v, _beta);\n        _outp = __lsx_vfmax_s(_outp, _zero);\n        _outp = __lsx_vfmin_s(_outp, _one);\n        _v = __lsx_vfmul_s(_outp, _v);\n    }\n\n    return _v;\n}\n#endif // __loongarch_sx\n\n#endif // LOONGARCH_ACTIVATION_H\n"
  },
  {
    "path": "src/layer/loongarch/loongarch_usability.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LOONGARCH_USABILITY_H\n#define LOONGARCH_USABILITY_H\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#if __loongarch_asx\n#include <lasxintrin.h>\n#endif // __loongarch_asx\n#endif // __loongarch_sx\n\n#include <stdint.h>\n\nnamespace ncnn {\n\ntypedef union\n{\n    int32_t i;\n    float f;\n} FloatInt;\n\n} // namespace ncnn\n\n#if __loongarch_sx\n/* declare some loongarch constants with union */\n#define _LOONGARCH_FLOAT_CONST(Name, Val) \\\n    static const ncnn::FloatInt Name = {.f = Val}\n#endif\n\n#if __loongarch_asx\n/* declare some loongarch constants with union */\n#define _LOONGARCH_FLOAT_CONST_PS256(Name, Val) \\\n    static const ncnn::FloatInt _ps256_##Name = {.f = Val}\n#endif\n\n#if __loongarch_sx\n/* float type data load instructions */\nstatic NCNN_FORCEINLINE __m128 __lsx_vreplfr2vr_s(float val)\n{\n    ncnn::FloatInt fi_tmpval = {.f = val};\n    return (__m128)__lsx_vreplgr2vr_w(fi_tmpval.i);\n}\n\nstatic NCNN_FORCEINLINE float __lsx_reduce_fadd_s(__m128 _v)\n{\n    // TODO find a more efficient way\n    float* _v_p = (float*)&_v;\n    return _v_p[0] + _v_p[1] + _v_p[2] + _v_p[3];\n}\n\nstatic NCNN_FORCEINLINE int __lsx_reduce_add_w(__m128i _v)\n{\n    // TODO find a more efficient way\n    int* _v_p = (int*)&_v;\n    return _v_p[0] + _v_p[1] + _v_p[2] + _v_p[3];\n}\n\n#endif // __loongarch_sx\n\n#if __loongarch_asx\n/* float type data load instructions */\nstatic NCNN_FORCEINLINE __m256 __lasx_xvreplfr2vr_s(float val)\n{\n    ncnn::FloatInt fi_tmpval = {.f = val};\n    return (__m256)__lasx_xvreplgr2vr_w(fi_tmpval.i);\n}\n\nstatic NCNN_FORCEINLINE float __lasx_reduce_fadd_s(__m256 _v)\n{\n    // TODO find a more efficient way\n    float* _v_p = (float*)&_v;\n    return _v_p[0] + _v_p[1] + _v_p[2] + _v_p[3] + _v_p[4] + _v_p[5] + _v_p[6] + _v_p[7];\n}\n\nstatic NCNN_FORCEINLINE int __lasx_reduce_add_w(__m256i _v)\n{\n    // TODO find a more efficient way\n    int* _v_p = (int*)&_v;\n    return _v_p[0] + _v_p[1] + _v_p[2] + _v_p[3] + _v_p[4] + _v_p[5] + _v_p[6] + _v_p[7];\n}\n#endif // __loongarch_asx\n\nstatic NCNN_FORCEINLINE signed char float2int8(float v)\n{\n    int int32 = round(v);\n    if (int32 > 127) return 127;\n    if (int32 < -127) return -127;\n    return (signed char)int32;\n}\n\n#if __loongarch_sx\nstatic NCNN_FORCEINLINE __m128i round(__m128 _v)\n{\n    __m128 _p5 = (__m128)__lsx_vreplfr2vr_s(0.5f);\n    __m128i _signmask = __lsx_vreplgr2vr_w(1 << 31);\n\n    __m128i _sign = __lsx_vand_v((__m128i)_v, _signmask);\n    __m128 _p5s = (__m128)__lsx_vor_v((__m128i)_p5, (__m128i)_sign);\n    __m128 _v5 = __lsx_vfadd_s(_v, _p5s);\n    __m128i _v32 = __lsx_vftintrz_w_s(_v5);\n\n    return _v32;\n}\n\nstatic NCNN_FORCEINLINE __m128i float2int8(__m128 _v)\n{\n    __m128i _v32 = round(_v);\n\n    __m128i _v32_16 = __lsx_vsat_w(_v32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_v32_16, _v32_16);\n    _v16 = __lsx_vmax_h(_v16, __lsx_vreplgr2vr_h(-127));\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8(__m128 _vlow, __m128 _vhigh)\n{\n    // simulate round to nearest via +/-0.5\n    __m128 _p5 = (__m128)__lsx_vreplfr2vr_s(0.5f);\n    __m128i _signmask = __lsx_vreplgr2vr_w(1 << 31);\n\n    __m128i _signlow = __lsx_vand_v((__m128i)_vlow, _signmask);\n    __m128i _signhigh = __lsx_vand_v((__m128i)_vhigh, _signmask);\n    __m128 _p5low = (__m128)__lsx_vor_v((__m128i)_p5, _signlow);\n    __m128 _p5high = (__m128)__lsx_vor_v((__m128i)_p5, _signhigh);\n    __m128 _vlow5 = __lsx_vfadd_s(_vlow, _p5low);\n    __m128 _vhigh5 = __lsx_vfadd_s(_vhigh, _p5high);\n    __m128i _vlow32 = __lsx_vftintrz_w_s(_vlow5);\n    __m128i _vhigh32 = __lsx_vftintrz_w_s(_vhigh5);\n\n    __m128i _vlow32_16 = __lsx_vsat_w(_vlow32, 15);\n    __m128i _vhigh32_16 = __lsx_vsat_w(_vhigh32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_vhigh32_16, _vlow32_16);\n    _v16 = __lsx_vmax_h(_v16, __lsx_vreplgr2vr_h(-127));\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n\nstatic NCNN_FORCEINLINE __m128i float2int8relu(__m128 _v)\n{\n    __m128i _v32 = round(_v);\n\n    __m128i _v32_16 = __lsx_vsat_w(_v32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_v32_16, _v32_16);\n    _v16 = __lsx_vmaxi_h(_v16, 0);\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8relu(__m128 _vlow, __m128 _vhigh)\n{\n    // simulate round to nearest via +/-0.5\n    __m128 _p5 = (__m128)__lsx_vreplfr2vr_s(0.5f);\n    __m128i _signmask = __lsx_vreplgr2vr_w(1 << 31);\n\n    __m128i _signlow = __lsx_vand_v((__m128i)_vlow, _signmask);\n    __m128i _signhigh = __lsx_vand_v((__m128i)_vhigh, _signmask);\n    __m128 _p5low = (__m128)__lsx_vor_v((__m128i)_p5, _signlow);\n    __m128 _p5high = (__m128)__lsx_vor_v((__m128i)_p5, _signhigh);\n    __m128 _vlow5 = __lsx_vfadd_s(_vlow, _p5low);\n    __m128 _vhigh5 = __lsx_vfadd_s(_vhigh, _p5high);\n    __m128i _vlow32 = __lsx_vftintrz_w_s(_vlow5);\n    __m128i _vhigh32 = __lsx_vftintrz_w_s(_vhigh5);\n\n    __m128i _vlow32_16 = __lsx_vsat_w(_vlow32, 15);\n    __m128i _vhigh32_16 = __lsx_vsat_w(_vhigh32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_vhigh32_16, _vlow32_16);\n    _v16 = __lsx_vmaxi_h(_v16, 0);\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n\nstatic NCNN_FORCEINLINE __m128i float2int8leakyrelu(__m128 _v, __m128 _slope)\n{\n    __m128 _v_leaky = __lsx_vfmul_s(_v, _slope);\n\n    // simulate round to nearest via +/-0.5\n    __m128 _p5 = (__m128)__lsx_vreplfr2vr_s(0.5f);\n    __m128i _signmask = __lsx_vreplgr2vr_w(1 << 31);\n\n    __m128i _sign = __lsx_vand_v((__m128i)_v, _signmask);\n    __m128 _p5s = (__m128)__lsx_vor_v((__m128i)_p5, _sign);\n    __m128 _v5 = __lsx_vfadd_s(_v, _p5s);\n    __m128i _v32 = __lsx_vftintrz_w_s(_v5);\n\n    __m128i _sign_leaky = __lsx_vand_v((__m128i)_v_leaky, _signmask);\n    __m128 _p5_leaky = (__m128)__lsx_vor_v((__m128i)_p5, _sign_leaky);\n    __m128 _v5_leaky = __lsx_vfadd_s(_v_leaky, _p5_leaky);\n    __m128i _v32_leaky = __lsx_vftintrz_w_s(_v5_leaky);\n\n    __m128i _v32_16 = __lsx_vsat_w(_v32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_v32_16, _v32_16);\n\n    __m128i _v32_16_leaky = __lsx_vsat_w(_v32_leaky, 15);\n    __m128i _v16_leaky = __lsx_vpickev_h(_v32_16_leaky, _v32_16_leaky);\n\n    _v16 = __lsx_vmax_h(_v16, _v16_leaky);\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8leakyrelu(__m128 _vlow, __m128 _vhigh, __m128 _slope)\n{\n    __m128 _vlow_leaky = __lsx_vfmul_s(_vlow, _slope);\n    __m128 _vhigh_leaky = __lsx_vfmul_s(_vhigh, _slope);\n\n    // simulate round to nearest via +/-0.5\n    __m128i _p5 = (__m128i)__lsx_vreplfr2vr_s(0.5f);\n    __m128i _signmask = __lsx_vreplgr2vr_w(1 << 31);\n\n    __m128i _signlow = __lsx_vand_v((__m128i)_vlow, _signmask);\n    __m128i _signhigh = __lsx_vand_v((__m128i)_vhigh, _signmask);\n    __m128 _p5low = (__m128)__lsx_vor_v(_p5, _signlow);\n    __m128 _p5high = (__m128)__lsx_vor_v(_p5, _signhigh);\n    __m128 _vlow5 = __lsx_vfadd_s(_vlow, _p5low);\n    __m128 _vhigh5 = __lsx_vfadd_s(_vhigh, _p5high);\n    __m128i _vlow32 = __lsx_vftintrz_w_s(_vlow5);\n    __m128i _vhigh32 = __lsx_vftintrz_w_s(_vhigh5);\n\n    __m128i _signlow_leaky = __lsx_vand_v((__m128i)_vlow_leaky, _signmask);\n    __m128i _signhigh_leaky = __lsx_vand_v((__m128i)_vhigh_leaky, _signmask);\n    __m128 _p5low_leaky = (__m128)__lsx_vor_v(_p5, _signlow_leaky);\n    __m128 _p5high_leaky = (__m128)__lsx_vor_v(_p5, _signhigh_leaky);\n    __m128 _vlow5_leaky = __lsx_vfadd_s(_vlow_leaky, _p5low_leaky);\n    __m128 _vhigh5_leaky = __lsx_vfadd_s(_vhigh_leaky, _p5high_leaky);\n    __m128i _vlow32_leaky = __lsx_vftintrz_w_s(_vlow5_leaky);\n    __m128i _vhigh32_leaky = __lsx_vftintrz_w_s(_vhigh5_leaky);\n\n    __m128i _vlow32_16 = __lsx_vsat_w(_vlow32, 15);\n    __m128i _vhigh32_16 = __lsx_vsat_w(_vhigh32, 15);\n    __m128i _v16 = __lsx_vpickev_h(_vhigh32_16, _vlow32_16);\n\n    __m128i _vlow32_16_leaky = __lsx_vsat_w(_vlow32_leaky, 15);\n    __m128i _vhigh32_16_leaky = __lsx_vsat_w(_vhigh32_leaky, 15);\n    __m128i _v16_leaky = __lsx_vpickev_h(_vhigh32_16_leaky, _vlow32_16_leaky);\n\n    _v16 = __lsx_vmax_h(_v16, _v16_leaky);\n    __m128i _v16_8 = __lsx_vsat_h(_v16, 7);\n    __m128i _v8 = __lsx_vpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n#endif // __loongarch_sx\n\n#if __loongarch_asx\nstatic NCNN_FORCEINLINE __m256i round(__m256 _v)\n{\n    __m256 _p5 = (__m256)__lasx_xvreplfr2vr_s(0.5f);\n    __m256i _signmask = __lasx_xvreplgr2vr_w(1 << 31);\n\n    __m256i _sign = __lasx_xvand_v((__m256i)_v, _signmask);\n    __m256 _p5s = (__m256)__lasx_xvor_v((__m256i)_p5, (__m256i)_sign);\n    __m256 _v5 = __lasx_xvfadd_s(_v, _p5s);\n    __m256i _v32 = __lasx_xvftintrz_w_s(_v5);\n\n    return _v32;\n}\n\nstatic NCNN_FORCEINLINE __m256i float2int8(__m256 _v)\n{\n    __m256i _v32 = round(_v);\n\n    __m256i _v32_16 = __lasx_xvsat_w(_v32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_v32_16, _v32_16);\n    _v16 = __lasx_xvmax_h(_v16, __lasx_xvreplgr2vr_h(-127));\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8(__m256 _vlow, __m256 _vhigh)\n{\n    // simulate round to nearest via +/-0.5\n    __m256 _p5 = (__m256)__lasx_xvreplfr2vr_s(0.5f);\n    __m256i _signmask = __lasx_xvreplgr2vr_w(1 << 31);\n\n    __m256i _signlow = __lasx_xvand_v((__m256i)_vlow, _signmask);\n    __m256i _signhigh = __lasx_xvand_v((__m256i)_vhigh, _signmask);\n    __m256 _p5low = (__m256)__lasx_xvor_v((__m256i)_p5, _signlow);\n    __m256 _p5high = (__m256)__lasx_xvor_v((__m256i)_p5, _signhigh);\n    __m256 _vlow5 = __lasx_xvfadd_s(_vlow, _p5low);\n    __m256 _vhigh5 = __lasx_xvfadd_s(_vhigh, _p5high);\n    __m256i _vlow32 = __lasx_xvftintrz_w_s(_vlow5);\n    __m256i _vhigh32 = __lasx_xvftintrz_w_s(_vhigh5);\n\n    __m256i _vlow32_16 = __lasx_xvsat_w(_vlow32, 15);\n    __m256i _vhigh32_16 = __lasx_xvsat_w(_vhigh32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_vhigh32_16, _vlow32_16);\n    _v16 = __lasx_xvmax_h(_v16, __lasx_xvreplgr2vr_h(-127));\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n\nstatic NCNN_FORCEINLINE __m256i float2int8relu(__m256 _v)\n{\n    __m256i _v32 = round(_v);\n\n    __m256i _v32_16 = __lasx_xvsat_w(_v32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_v32_16, _v32_16);\n    _v16 = __lasx_xvmaxi_h(_v16, 0);\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8relu(__m256 _vlow, __m256 _vhigh)\n{\n    // simulate round to nearest via +/-0.5\n    __m256 _p5 = (__m256)__lasx_xvreplfr2vr_s(0.5f);\n    __m256i _signmask = __lasx_xvreplgr2vr_w(1 << 31);\n\n    __m256i _signlow = __lasx_xvand_v((__m256i)_vlow, _signmask);\n    __m256i _signhigh = __lasx_xvand_v((__m256i)_vhigh, _signmask);\n    __m256 _p5low = (__m256)__lasx_xvor_v((__m256i)_p5, _signlow);\n    __m256 _p5high = (__m256)__lasx_xvor_v((__m256i)_p5, _signhigh);\n    __m256 _vlow5 = __lasx_xvfadd_s(_vlow, _p5low);\n    __m256 _vhigh5 = __lasx_xvfadd_s(_vhigh, _p5high);\n    __m256i _vlow32 = __lasx_xvftintrz_w_s(_vlow5);\n    __m256i _vhigh32 = __lasx_xvftintrz_w_s(_vhigh5);\n\n    __m256i _vlow32_16 = __lasx_xvsat_w(_vlow32, 15);\n    __m256i _vhigh32_16 = __lasx_xvsat_w(_vhigh32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_vhigh32_16, _vlow32_16);\n    _v16 = __lasx_xvmaxi_h(_v16, 0);\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n\nstatic NCNN_FORCEINLINE __m256i float2int8leakyrelu(__m256 _v, __m256 _slope)\n{\n    __m256 _v_leaky = __lasx_xvfmul_s(_v, _slope);\n\n    // simulate round to nearest via +/-0.5\n    __m256 _p5 = (__m256)__lasx_xvreplfr2vr_s(0.5f);\n    __m256i _signmask = __lasx_xvreplgr2vr_w(1 << 31);\n\n    __m256i _sign = __lasx_xvand_v((__m256i)_v, _signmask);\n    __m256 _p5s = (__m256)__lasx_xvor_v((__m256i)_p5, _sign);\n    __m256 _v5 = __lasx_xvfadd_s(_v, _p5s);\n    __m256i _v32 = __lasx_xvftintrz_w_s(_v5);\n\n    __m256i _sign_leaky = __lasx_xvand_v((__m256i)_v_leaky, _signmask);\n    __m256 _p5_leaky = (__m256)__lasx_xvor_v((__m256i)_p5, _sign_leaky);\n    __m256 _v5_leaky = __lasx_xvfadd_s(_v_leaky, _p5_leaky);\n    __m256i _v32_leaky = __lasx_xvftintrz_w_s(_v5_leaky);\n\n    __m256i _v32_16 = __lasx_xvsat_w(_v32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_v32_16, _v32_16);\n\n    __m256i _v32_16_leaky = __lasx_xvsat_w(_v32_leaky, 15);\n    __m256i _v16_leaky = __lasx_xvpickev_h(_v32_16_leaky, _v32_16_leaky);\n\n    _v16 = __lasx_xvmax_h(_v16, _v16_leaky);\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8;\n}\n\nstatic NCNN_FORCEINLINE int64_t float2int8leakyrelu(__m256 _vlow, __m256 _vhigh, __m256 _slope)\n{\n    __m256 _vlow_leaky = __lasx_xvfmul_s(_vlow, _slope);\n    __m256 _vhigh_leaky = __lasx_xvfmul_s(_vhigh, _slope);\n\n    // simulate round to nearest via +/-0.5\n    __m256i _p5 = (__m256i)__lasx_xvreplfr2vr_s(0.5f);\n    __m256i _signmask = __lasx_xvreplgr2vr_w(1 << 31);\n\n    __m256i _signlow = __lasx_xvand_v((__m256i)_vlow, _signmask);\n    __m256i _signhigh = __lasx_xvand_v((__m256i)_vhigh, _signmask);\n    __m256 _p5low = (__m256)__lasx_xvor_v(_p5, _signlow);\n    __m256 _p5high = (__m256)__lasx_xvor_v(_p5, _signhigh);\n    __m256 _vlow5 = __lasx_xvfadd_s(_vlow, _p5low);\n    __m256 _vhigh5 = __lasx_xvfadd_s(_vhigh, _p5high);\n    __m256i _vlow32 = __lasx_xvftintrz_w_s(_vlow5);\n    __m256i _vhigh32 = __lasx_xvftintrz_w_s(_vhigh5);\n\n    __m256i _signlow_leaky = __lasx_xvand_v((__m256i)_vlow_leaky, _signmask);\n    __m256i _signhigh_leaky = __lasx_xvand_v((__m256i)_vhigh_leaky, _signmask);\n    __m256 _p5low_leaky = (__m256)__lasx_xvor_v(_p5, _signlow_leaky);\n    __m256 _p5high_leaky = (__m256)__lasx_xvor_v(_p5, _signhigh_leaky);\n    __m256 _vlow5_leaky = __lasx_xvfadd_s(_vlow_leaky, _p5low_leaky);\n    __m256 _vhigh5_leaky = __lasx_xvfadd_s(_vhigh_leaky, _p5high_leaky);\n    __m256i _vlow32_leaky = __lasx_xvftintrz_w_s(_vlow5_leaky);\n    __m256i _vhigh32_leaky = __lasx_xvftintrz_w_s(_vhigh5_leaky);\n\n    __m256i _vlow32_16 = __lasx_xvsat_w(_vlow32, 15);\n    __m256i _vhigh32_16 = __lasx_xvsat_w(_vhigh32, 15);\n    __m256i _v16 = __lasx_xvpickev_h(_vhigh32_16, _vlow32_16);\n\n    __m256i _vlow32_16_leaky = __lasx_xvsat_w(_vlow32_leaky, 15);\n    __m256i _vhigh32_16_leaky = __lasx_xvsat_w(_vhigh32_leaky, 15);\n    __m256i _v16_leaky = __lasx_xvpickev_h(_vhigh32_16_leaky, _vlow32_16_leaky);\n\n    _v16 = __lasx_xvmax_h(_v16, _v16_leaky);\n    __m256i _v16_8 = __lasx_xvsat_h(_v16, 7);\n    __m256i _v8 = __lasx_xvpickev_b(_v16_8, _v16_8);\n\n    return _v8[0];\n}\n#endif // __loongarch_asx\n\n#endif // LOONGARCH_USABILITY_H\n"
  },
  {
    "path": "src/layer/loongarch/lsx_mathfun.h",
    "content": "/* LOONGARCH implementation of mathfun\n *\n *   Inspired by Intel Approximate Math library, and based on the\n *   corresponding algorithms of the cephes math library\n *   Copyright (C) 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>. All rights reserved.\n */\n\n/*\n *  This software is provided 'as-is', without any express or implied\n *  warranty.  In no event will the authors be held liable for any damages\n *  arising from the use of this software.\n *\n *  Permission is granted to anyone to use this software for any purpose,\n *  including commercial applications, and to alter it and redistribute it\n *  freely, subject to the following restrictions:\n *\n *  1. The origin of this software must not be misrepresented; you must not\n *     claim that you wrote the original software. If you use this software\n *     in a product, an acknowledgment in the product documentation would be\n *     appreciated but is not required.\n *  2. Altered source versions must be plainly marked as such, and must not be\n *     misrepresented as being the original software.\n *  3. This notice may not be removed or altered from any source distribution.\n *\n *  (this is the zlib license)\n */\n\n#ifndef LSX_MATHFUN_H\n#define LSX_MATHFUN_H\n\n#include \"loongarch_usability.h\"\n\n#include <lsxintrin.h>\n\n_LOONGARCH_FLOAT_CONST(c_0, 0.0f);\n_LOONGARCH_FLOAT_CONST(c_1, 1.0f);\n_LOONGARCH_FLOAT_CONST(c_2, 2.0f);\n_LOONGARCH_FLOAT_CONST(c_3, 3.0f);\n_LOONGARCH_FLOAT_CONST(c_4, 4.0f);\n_LOONGARCH_FLOAT_CONST(c_n1, -1.0f);\n_LOONGARCH_FLOAT_CONST(c_n3, -3.0f);\n_LOONGARCH_FLOAT_CONST(c_0p5, 0.5f);\n_LOONGARCH_FLOAT_CONST(c_eps, 1E-8f);\n\n#define c_inv_mant_mask ~0x7f800000u\n_LOONGARCH_FLOAT_CONST(c_cephes_SQRTHF, 0.707106781186547524);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p0, 7.0376836292E-2);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p1, -1.1514610310E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p2, 1.1676998740E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p3, -1.2420140846E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p4, +1.4249322787E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p5, -1.6668057665E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p6, +2.0000714765E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p7, -2.4999993993E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_p8, +3.3333331174E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_q1, -2.12194440e-4);\n_LOONGARCH_FLOAT_CONST(c_cephes_log_q2, 0.693359375);\n\n/* natural logarithm computed for 4 simultaneous float\n *   return NaN for x <= 0\n */\nstatic inline __m128 log_ps(__m128 x)\n{\n    __m128 one = (__m128)__lsx_vreplgr2vr_w(c_1.i);\n\n    x = __lsx_vfmax_s(x, (__m128)__lsx_vreplgr2vr_w(0)); /* force flush to zero on denormal values */\n    __m128i invalid_mask = __lsx_vfcmp_cle_s(x, (__m128)__lsx_vreplgr2vr_w(0));\n\n    __m128i ux = (__m128i)(x);\n\n    __m128i emm0 = __lsx_vsrl_w(ux, __lsx_vreplgr2vr_w(23));\n\n    /* keep only the fractional part */\n    ux = __lsx_vand_v(ux, __lsx_vreplgr2vr_w(c_inv_mant_mask));\n    ux = __lsx_vor_v(ux, __lsx_vreplgr2vr_w(c_0p5.i));\n    x = (__m128)(ux);\n\n    emm0 = __lsx_vsub_w(emm0, __lsx_vreplgr2vr_w(0x7f));\n    __m128 e = __lsx_vffint_s_w(emm0);\n\n    e = __lsx_vfadd_s(e, one);\n\n    /* part2:\n     *     if( x < SQRTHF ) {\n     *       e -= 1;\n     *       x = x + x - 1.0;\n     *     } else { x = x - 1.0; }\n     */\n    __m128i mask = __lsx_vfcmp_clt_s((__m128)x, (__m128)__lsx_vreplgr2vr_w(c_cephes_SQRTHF.i));\n    __m128 tmp = (__m128)(__lsx_vand_v((__m128i)(x), (__m128i)mask));\n    x = __lsx_vfsub_s(x, one);\n    e = __lsx_vfsub_s(e, (__m128)(__lsx_vand_v((__m128i)(one), (__m128i)mask)));\n    x = __lsx_vfadd_s(x, tmp);\n\n    __m128 z = __lsx_vfmul_s(x, x);\n\n    __m128 y = (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p0.i);\n\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p1.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p2.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p3.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p4.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p5.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p6.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p7.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_p8.i));\n    y = __lsx_vfmul_s(y, x);\n\n    y = __lsx_vfmul_s(y, z);\n\n    tmp = __lsx_vfmul_s(e, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_q1.i));\n    y = __lsx_vfadd_s(y, tmp);\n\n    tmp = __lsx_vfmul_s(z, (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n    y = __lsx_vfsub_s(y, tmp);\n\n    tmp = __lsx_vfmul_s(e, (__m128)__lsx_vreplgr2vr_w(c_cephes_log_q2.i));\n    x = __lsx_vfadd_s(x, y);\n    x = __lsx_vfadd_s(x, tmp);\n    x = (__m128)(__lsx_vor_v((__m128i)(x), (__m128i)invalid_mask)); // negative arg will be NAN\n    return x;\n}\n\n_LOONGARCH_FLOAT_CONST(c_exp_hi, 88.3762626647949f);\n_LOONGARCH_FLOAT_CONST(c_exp_lo, -88.3762626647949f);\n\n_LOONGARCH_FLOAT_CONST(c_cephes_LOG2EF, 1.44269504088896341);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_C1, 0.693359375);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_C2, -2.12194440e-4);\n\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p0, 1.9875691500E-4);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p1, 1.3981999507E-3);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p2, 8.3334519073E-3);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p3, 4.1665795894E-2);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p4, 1.6666665459E-1);\n_LOONGARCH_FLOAT_CONST(c_cephes_exp_p5, 5.0000001201E-1);\n\n/* exp() computed for 4 float at once */\nstatic inline __m128 exp_ps(__m128 x)\n{\n    __m128 tmp, fx;\n\n    __m128 one = (__m128)__lsx_vreplgr2vr_w(c_1.i);\n    x = __lsx_vfmin_s(x, (__m128)__lsx_vreplgr2vr_w(c_exp_hi.i));\n    x = __lsx_vfmax_s(x, (__m128)__lsx_vreplgr2vr_w(c_exp_lo.i));\n\n    /* express exp(x) as exp(g + n*log(2)) */\n    fx = __lsx_vfmul_s(x, (__m128)__lsx_vreplgr2vr_w(c_cephes_LOG2EF.i));\n    fx = __lsx_vfadd_s(fx, (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n\n    /* perform a floorf */\n    tmp = __lsx_vffint_s_w(__lsx_vftint_w_s(fx));\n\n    /* if greater, substract 1 */\n    __m128i mask = __lsx_vfcmp_clt_s(fx, tmp);\n    mask = __lsx_vand_v(mask, (__m128i)one);\n\n    fx = __lsx_vfsub_s(tmp, (__m128)mask);\n\n    tmp = __lsx_vfmul_s(fx, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_C1.i));\n    __m128 z = __lsx_vfmul_s(fx, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_C2.i));\n    x = __lsx_vfsub_s(x, tmp);\n    x = __lsx_vfsub_s(x, z);\n\n    __m128 y = (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p0.i);\n\n    z = __lsx_vfmul_s(x, x);\n\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p1.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p2.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p3.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p4.i));\n    y = __lsx_vfmadd_s(x, y, (__m128)__lsx_vreplgr2vr_w(c_cephes_exp_p5.i));\n\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfadd_s(y, x);\n    y = __lsx_vfadd_s(y, one);\n\n    /* build 2^n */\n    __m128i mm;\n    mm = __lsx_vftintrz_w_s(fx);\n    mm = __lsx_vadd_w(mm, __lsx_vreplgr2vr_w(0x7f));\n    mm = __lsx_vsll_w(mm, __lsx_vreplgr2vr_w(23));\n\n    y = __lsx_vfmul_s(y, (__m128)mm);\n    return y;\n}\n\n_LOONGARCH_FLOAT_CONST(c_tanh_tiny, 1e-4f);\n_LOONGARCH_FLOAT_CONST(c_tanh_hi, 9.0f);\n// The monomial coefficients of the numerator polynomial (odd).\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_1, 4.89352455891786e-3f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_3, 6.37261928875436e-4f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_5, 1.48572235717979e-5f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_7, 5.12229709037114e-8f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_9, -8.60467152213735e-11f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_11, 2.00018790482477e-13f);\n_LOONGARCH_FLOAT_CONST(c_tanh_alpha_13, -2.76076847742355e-16f);\n// The monomial coefficients of the denominator polynomial (even).\n_LOONGARCH_FLOAT_CONST(c_tanh_beta_0, 4.89352518554385e-3f);\n_LOONGARCH_FLOAT_CONST(c_tanh_beta_2, 2.26843463243900e-3f);\n_LOONGARCH_FLOAT_CONST(c_tanh_beta_4, 1.18534705686654e-4f);\n_LOONGARCH_FLOAT_CONST(c_tanh_beta_6, 1.19825839466702e-6f);\n\n/* tanh() computed for 4 float at once */\nstatic inline __m128 tanh_ps(__m128 x)\n{\n    __m128 x2 = (__m128)__lsx_vbitclri_w((__m128i)x, 31);\n    __m128i tiny_mask = __lsx_vfcmp_clt_s((__m128)x2, (__m128)(__m128)__lsx_vreplgr2vr_w(c_tanh_tiny.i));\n    __m128i sig_mask = __lsx_vreplgr2vr_w(1 << 31);\n    __m128i sig_save = __lsx_vand_v((__m128i)x, sig_mask);\n\n    // clamp the inputs to the range [-9, 9] since anything outside\n    // this range is -/+1.0f in single-precision.\n    x2 = (__m128)__lsx_vbitsel_v((__m128i)x2, (__m128i)__lsx_vreplgr2vr_w(c_tanh_hi.i), (__m128i)__lsx_vfcmp_clt_s((__m128)__lsx_vreplgr2vr_w(c_tanh_hi.i), (__m128)x2));\n\n    // since the polynomials are odd/even, we need x**2.\n    __m128 z = __lsx_vfmul_s(x2, x2);\n\n    // evaluate the numerator polynomial y.\n    __m128 y = (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_13.i);\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_11.i));\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_9.i));\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_7.i));\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_5.i));\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_3.i));\n    y = __lsx_vfmadd_s(z, y, (__m128)__lsx_vreplgr2vr_w(c_tanh_alpha_1.i));\n    y = __lsx_vfmul_s(y, x2);\n\n    // evaluate the denominator polynomial w.\n    __m128 w = (__m128)__lsx_vreplgr2vr_w(c_tanh_beta_6.i);\n    w = __lsx_vfmadd_s(z, w, (__m128)__lsx_vreplgr2vr_w(c_tanh_beta_4.i));\n    w = __lsx_vfmadd_s(z, w, (__m128)__lsx_vreplgr2vr_w(c_tanh_beta_2.i));\n    w = __lsx_vfmadd_s(z, w, (__m128)__lsx_vreplgr2vr_w(c_tanh_beta_0.i));\n\n    // divide the numerator by the denominator.\n    y = __lsx_vfdiv_s(y, w);\n\n    // reinstate the sign.\n    y = (__m128)__lsx_vor_v((__m128i)y, sig_save);\n\n    // when the argument is very small in magnitude it's more accurate to just return it.\n    y = (__m128)__lsx_vbitsel_v((__m128i)y, (__m128i)x, (__m128i)tiny_mask);\n\n    return y;\n}\n\n_LOONGARCH_FLOAT_CONST(c_minus_cephes_DP1, -0.78515625f);\n_LOONGARCH_FLOAT_CONST(c_minus_cephes_DP2, -2.4187564849853515625e-4f);\n_LOONGARCH_FLOAT_CONST(c_minus_cephes_DP3, -3.77489497744594108e-8f);\n_LOONGARCH_FLOAT_CONST(c_cephes_sin_p0, -1.9515295891E-4f);\n_LOONGARCH_FLOAT_CONST(c_cephes_sin_p1, 8.3321608736E-3f);\n_LOONGARCH_FLOAT_CONST(c_cephes_sin_p2, -1.6666654611E-1f);\n_LOONGARCH_FLOAT_CONST(c_cephes_cos_p0, 2.443315711809948E-005f);\n_LOONGARCH_FLOAT_CONST(c_cephes_cos_p1, -1.388731625493765E-003f);\n_LOONGARCH_FLOAT_CONST(c_cephes_cos_p2, 4.166664568298827E-002f);\n_LOONGARCH_FLOAT_CONST(c_cephes_FOPI, 1.27323954473516f); // 4/PI\n\nstatic inline __m128 sin_ps(__m128 x)\n{\n    __m128 y;\n    __m128i swap_sign_bit, poly_mask, sign_bit;\n    __m128 n0p5 = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_n1.i), (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n\n    sign_bit = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    y = __lsx_vfmul_s(x, (__m128)__lsx_vreplgr2vr_w(c_cephes_FOPI.i));\n\n    poly_mask = __lsx_vftintrz_w_s(y);\n    poly_mask = __lsx_vadd_w(poly_mask, __lsx_vreplgr2vr_w(1));\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(~1));\n    y = __lsx_vffint_s_w(poly_mask);\n\n    swap_sign_bit = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(4));\n    swap_sign_bit = __lsx_vslli_w(swap_sign_bit, 29);\n\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(2));\n    poly_mask = __lsx_vseq_w(poly_mask, __lsx_vreplgr2vr_w(0));\n\n    sign_bit = __lsx_vxor_v(sign_bit, swap_sign_bit);\n\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP1.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP2.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP3.i), x);\n\n    y = (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p0.i);\n    __m128 z = __lsx_vfmul_s(x, x);\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p1.i));\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p2.i));\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmadd_s(z, n0p5, y);\n    y = __lsx_vfadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_1.i));\n\n    __m128 y2 = (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p0.i);\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p1.i));\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p2.i));\n    y2 = __lsx_vfmul_s(y2, z);\n    y2 = __lsx_vfmadd_s(y2, x, x);\n\n    y2 = (__m128)__lsx_vand_v((__m128i)y2, poly_mask);\n    y = (__m128)__lsx_vand_v(__lsx_vxor_v(poly_mask, __lsx_vreplgr2vr_w(0xffffffff)), (__m128i)y);\n    y = __lsx_vfadd_s(y, y2);\n    y = (__m128)__lsx_vxor_v((__m128i)y, sign_bit);\n\n    return y;\n}\n\nstatic inline __m128 cos_ps(__m128 x)\n{\n    __m128 y;\n    __m128i swap_sign_bit, poly_mask, sign_bit;\n    __m128 n0p5 = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_n1.i), (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    y = __lsx_vfmul_s(x, (__m128)__lsx_vreplgr2vr_w(c_cephes_FOPI.i));\n\n    poly_mask = __lsx_vftintrz_w_s(y);\n    poly_mask = __lsx_vadd_w(poly_mask, __lsx_vreplgr2vr_w(1));\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(~1));\n    y = __lsx_vffint_s_w(poly_mask);\n    poly_mask = __lsx_vsub_w(poly_mask, __lsx_vreplgr2vr_w(2));\n\n    swap_sign_bit = __lsx_vandn_v(poly_mask, __lsx_vreplgr2vr_w(4));\n    swap_sign_bit = __lsx_vslli_w(swap_sign_bit, 29);\n\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(2));\n    poly_mask = __lsx_vseq_w(poly_mask, __lsx_vreplgr2vr_w(0));\n\n    sign_bit = swap_sign_bit;\n\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP1.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP2.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP3.i), x);\n\n    y = (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p0.i);\n    __m128 z = __lsx_vfmul_s(x, x);\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p1.i));\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p2.i));\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmadd_s(z, n0p5, y);\n    y = __lsx_vfadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_1.i));\n\n    __m128 y2 = (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p0.i);\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p1.i));\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p2.i));\n    y2 = __lsx_vfmul_s(y2, z);\n    y2 = __lsx_vfmadd_s(y2, x, x);\n\n    y2 = (__m128)__lsx_vand_v((__m128i)y2, poly_mask);\n    y = (__m128)__lsx_vandn_v(poly_mask, (__m128i)y);\n    y = __lsx_vfadd_s(y, y2);\n    y = (__m128)__lsx_vxor_v((__m128i)y, sign_bit);\n\n    return y;\n}\n\nstatic inline void sincos_ps(__m128 x, __m128* s, __m128* c)\n{\n    __m128 y;\n    __m128i swap_sign_bit_cos, swap_sign_bit_sin, poly_mask, sign_bit_sin, sign_bit_cos;\n    __m128 n0p5 = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_n1.i), (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n\n    sign_bit_sin = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    y = __lsx_vfmul_s(x, (__m128)__lsx_vreplgr2vr_w(c_cephes_FOPI.i));\n\n    poly_mask = __lsx_vftintrz_w_s(y);\n    poly_mask = __lsx_vadd_w(poly_mask, __lsx_vreplgr2vr_w(1));\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(~1));\n    y = __lsx_vffint_s_w(poly_mask);\n\n    swap_sign_bit_cos = __lsx_vsub_w(poly_mask, __lsx_vreplgr2vr_w(2));\n    swap_sign_bit_cos = __lsx_vandn_v(swap_sign_bit_cos, __lsx_vreplgr2vr_w(4));\n    swap_sign_bit_cos = __lsx_vslli_w(swap_sign_bit_cos, 29);\n\n    swap_sign_bit_sin = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(4));\n    swap_sign_bit_sin = __lsx_vslli_w(swap_sign_bit_sin, 29);\n\n    poly_mask = __lsx_vand_v(poly_mask, __lsx_vreplgr2vr_w(2));\n    poly_mask = __lsx_vseq_w(poly_mask, __lsx_vreplgr2vr_w(0));\n\n    sign_bit_sin = __lsx_vxor_v(sign_bit_sin, swap_sign_bit_sin);\n    sign_bit_cos = swap_sign_bit_cos;\n\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP1.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP2.i), x);\n    x = __lsx_vfmadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_minus_cephes_DP3.i), x);\n\n    __m128 z = __lsx_vfmul_s(x, x);\n    y = (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p0.i);\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p1.i));\n    y = __lsx_vfmadd_s(y, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_cos_p2.i));\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmul_s(y, z);\n    y = __lsx_vfmadd_s(z, n0p5, y);\n    y = __lsx_vfadd_s(y, (__m128)__lsx_vreplgr2vr_w(c_1.i));\n\n    __m128 y2 = (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p0.i);\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p1.i));\n    y2 = __lsx_vfmadd_s(y2, z, (__m128)__lsx_vreplgr2vr_w(c_cephes_sin_p2.i));\n    y2 = __lsx_vfmul_s(y2, z);\n    y2 = __lsx_vfmadd_s(y2, x, x);\n\n    __m128 ysin1 = (__m128)__lsx_vandn_v(poly_mask, (__m128i)y);\n    __m128 ysin2 = (__m128)__lsx_vand_v(poly_mask, (__m128i)y2);\n    y2 = __lsx_vfsub_s(y2, ysin2);\n    y = __lsx_vfsub_s(y, ysin1);\n\n    ysin1 = __lsx_vfadd_s(ysin1, ysin2);\n    y = __lsx_vfadd_s(y, y2);\n\n    *s = (__m128)__lsx_vxor_v((__m128i)ysin1, sign_bit_sin);\n    *c = (__m128)__lsx_vxor_v((__m128i)y, sign_bit_cos);\n}\n\nstatic inline __m128 tan_ps(__m128 x)\n{\n    __m128 ysin, ycos;\n    __m128 eps = (__m128)__lsx_vreplgr2vr_w(c_eps.i);\n    __m128 zero = (__m128)__lsx_vreplgr2vr_w(c_0.i);\n    sincos_ps(x, &ysin, &ycos);\n    __m128i mask = __lsx_vfcmp_ceq_s(ycos, eps);\n    mask = __lsx_vand_v(mask, (__m128i)eps);\n    ycos = __lsx_vfadd_s(ycos, (__m128)mask);\n    __m128 ytan = __lsx_vfdiv_s(ysin, ycos);\n    return ytan;\n}\n\nstatic inline __m128 pow_ps(__m128 a, __m128 b)\n{\n    // pow(x, m) = exp(m * log(x))\n    return exp_ps(__lsx_vfmul_s(b, log_ps(a)));\n}\n\nstatic inline __m128 sigmoid_ps(__m128 _v)\n{\n    __m128 _one = __lsx_vreplfr2vr_s(1.f);\n    _v = (__m128)__lsx_vbitrevi_w((__m128i)_v, 31);\n    _v = exp_ps(_v);\n    _v = __lsx_vfadd_s(_v, _one);\n    return __lsx_vfdiv_s(_one, _v);\n}\n\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a4, 0.023994016f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a5, 0.042417344f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a2, 0.07494697f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a3, 0.045520633f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a0, 1.0f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_a1, 0.166667819f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_half_pi, 1.5707964f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_pi, 3.1415927f);\n_LOONGARCH_FLOAT_CONST(c_cephes_asin_npi, -3.1415927f);\n\nstatic inline __m128 asin_ps(__m128 x)\n{\n    __m128 big_input_approx, input_approx, square_of_input_approx, fourth_power_of_input_approx;\n    __m128 is_big_input_one, output_approx, final_approx;\n    __m128 tmp1, tmp2, tmp3, tmp4;\n    __m128i mask, is_small_input, is_big_input;\n\n    mask = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lsx_vfcmp_cle_s(x, (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n    is_big_input = __lsx_vxor_v(is_small_input, __lsx_vreplgr2vr_w(0xffffffff));\n    is_big_input_one = (__m128)__lsx_vand_v(__lsx_vreplgr2vr_w(c_1.i), is_big_input);\n\n    big_input_approx = __lsx_vfsub_s((__m128)__lsx_vreplgr2vr_w(c_1.i), x);\n    big_input_approx = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_0p5.i), big_input_approx);\n    big_input_approx = __lsx_vfsqrt_s(big_input_approx);\n\n    input_approx = (__m128)__lsx_vand_v(is_small_input, (__m128i)x);\n    input_approx = (__m128)__lsx_vor_v((__m128i)input_approx, __lsx_vand_v(is_big_input, (__m128i)big_input_approx));\n\n    square_of_input_approx = __lsx_vfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lsx_vfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a4.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a2.i));\n    tmp2 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a5.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a3.i));\n    tmp3 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp1, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a0.i));\n    tmp4 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp2, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a1.i));\n    output_approx = __lsx_vfmadd_s(square_of_input_approx, tmp4, tmp3);\n\n    tmp1 = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_cephes_asin_half_pi.i), is_big_input_one);\n    tmp2 = __lsx_vfmul_s(output_approx, input_approx);\n    tmp3 = __lsx_vfmadd_s((__m128)__lsx_vreplgr2vr_w(c_n3.i), is_big_input_one, (__m128)__lsx_vreplgr2vr_w(c_1.i));\n\n    final_approx = __lsx_vfmadd_s(tmp2, tmp3, tmp1);\n    final_approx = (__m128)__lsx_vor_v((__m128i)final_approx, mask);\n\n    return final_approx;\n}\n\nstatic inline __m128 acos_ps(__m128 x)\n{\n    __m128 big_input_approx, input_approx, square_of_input_approx, fourth_power_of_input_approx;\n    __m128 output_approx, final_approx, small_final_approx, big_final_approx;\n    __m128 tmp1, tmp2, tmp3, tmp4;\n    __m128i mask, mask2, is_small_input, is_big_input, lt_zero;\n\n    lt_zero = __lsx_vfcmp_clt_s(x, (__m128)__lsx_vreplgr2vr_w(c_0.i));\n    mask = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lsx_vfcmp_cle_s(x, (__m128)__lsx_vreplgr2vr_w(c_0p5.i));\n    is_big_input = __lsx_vxor_v(is_small_input, __lsx_vreplgr2vr_w(0xffffffff));\n\n    big_input_approx = __lsx_vfsub_s((__m128)__lsx_vreplgr2vr_w(c_1.i), x);\n    big_input_approx = __lsx_vfmul_s((__m128)__lsx_vreplgr2vr_w(c_0p5.i), big_input_approx);\n    big_input_approx = __lsx_vfsqrt_s(big_input_approx);\n\n    input_approx = (__m128)__lsx_vand_v(is_small_input, (__m128i)x);\n    input_approx = (__m128)__lsx_vor_v((__m128i)input_approx, __lsx_vand_v(is_big_input, (__m128i)big_input_approx));\n\n    square_of_input_approx = __lsx_vfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lsx_vfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a4.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a2.i));\n    tmp2 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a5.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a3.i));\n    tmp3 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp1, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a0.i));\n    tmp4 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp2, (__m128)__lsx_vreplgr2vr_w(c_cephes_asin_a1.i));\n    output_approx = __lsx_vfmadd_s(square_of_input_approx, tmp4, tmp3);\n\n    tmp1 = __lsx_vfmul_s(input_approx, output_approx);\n\n    small_final_approx = (__m128)__lsx_vor_v((__m128i)tmp1, mask);\n    small_final_approx = __lsx_vfsub_s((__m128)__lsx_vreplgr2vr_w(c_cephes_asin_half_pi.i), small_final_approx);\n\n    big_final_approx = (__m128)__lsx_vand_v(lt_zero, __lsx_vreplgr2vr_w(c_cephes_asin_pi.i));\n    tmp1 = __lsx_vfadd_s(tmp1, tmp1);\n    tmp1 = (__m128)__lsx_vor_v((__m128i)tmp1, mask);\n    big_final_approx = __lsx_vfadd_s(big_final_approx, tmp1);\n\n    final_approx = (__m128)__lsx_vand_v(is_small_input, (__m128i)small_final_approx);\n    final_approx = (__m128)__lsx_vor_v((__m128i)final_approx, __lsx_vand_v(is_big_input, (__m128i)big_final_approx));\n\n    return final_approx;\n}\n\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x0, 1.0f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x1, -0.33333072f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x2, 0.1999262f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x3, -0.14203644f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x4, 0.10640934f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x5, -0.07504295f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x6, 0.04269152f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x7, -0.01606863f);\n_LOONGARCH_FLOAT_CONST(c_cephes_atan_x8, 0.0028498897f);\n\nstatic inline __m128 atan_ps(__m128 x)\n{\n    __m128i mask, is_small_input, is_big_input;\n    __m128 tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, input_approx, output_approx;\n    __m128 square_of_input_approx, fourth_power_of_input_approx;\n\n    mask = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    x = (__m128)__lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x7fffffff));\n\n    is_small_input = __lsx_vfcmp_clt_s((__m128)__lsx_vreplgr2vr_w(c_1.i), x);\n    is_big_input = __lsx_vxor_v(is_small_input, __lsx_vreplgr2vr_w(0xffffffff));\n\n    tmp1 = (__m128)__lsx_vand_v(is_small_input, __lsx_vreplgr2vr_w(c_n1.i));\n    tmp1 = (__m128)__lsx_vor_v(__lsx_vand_v(is_big_input, (__m128i)x), (__m128i)tmp1);\n\n    tmp2 = (__m128)__lsx_vand_v(is_small_input, (__m128i)x);\n    tmp2 = (__m128)__lsx_vor_v(__lsx_vand_v((__m128i)is_big_input, __lsx_vreplgr2vr_w(c_1.i)), (__m128i)tmp2);\n\n    input_approx = __lsx_vfdiv_s(tmp1, tmp2);\n    square_of_input_approx = __lsx_vfmul_s(input_approx, input_approx);\n    fourth_power_of_input_approx = __lsx_vfmul_s(square_of_input_approx, square_of_input_approx);\n\n    tmp1 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x7.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x5.i));\n    tmp2 = __lsx_vfmadd_s(fourth_power_of_input_approx, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x8.i), (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x6.i));\n    tmp3 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp1, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x3.i));\n    tmp4 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp2, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x4.i));\n    tmp5 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp3, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x1.i));\n    tmp6 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp4, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x2.i));\n    tmp7 = __lsx_vfmadd_s(fourth_power_of_input_approx, tmp6, (__m128)__lsx_vreplgr2vr_w(c_cephes_atan_x0.i));\n    output_approx = __lsx_vfmadd_s(square_of_input_approx, tmp5, tmp7);\n\n    tmp1 = __lsx_vfmul_s(input_approx, output_approx);\n    tmp2 = (__m128)__lsx_vand_v(is_small_input, __lsx_vreplgr2vr_w(c_cephes_asin_half_pi.i));\n    tmp1 = __lsx_vfadd_s(tmp1, tmp2);\n    tmp1 = (__m128)__lsx_vxor_v(mask, (__m128i)tmp1);\n    return tmp1;\n}\n\nstatic inline __m128 atan2_ps(__m128 y, __m128 x)\n{\n    __m128i not_eq_zero_x, not_eq_zero_y, normal_mode, negative_mask_x, negative_mask_y;\n    __m128i lt_zero_mask_x, lt_zero_mask_y, ge_zero_mask_y, eq_zero_y;\n    __m128 pi_additions, tmp1, tmp2, normal_result, special_result, final_result;\n\n    not_eq_zero_x = __lsx_vfcmp_cne_s(x, (__m128)__lsx_vreplgr2vr_w(c_0.i));\n    not_eq_zero_y = __lsx_vfcmp_cne_s(y, (__m128)__lsx_vreplgr2vr_w(c_0.i));\n    eq_zero_y = __lsx_vxor_v(not_eq_zero_y, __lsx_vreplgr2vr_w(0xffffffff));\n    normal_mode = __lsx_vand_v(not_eq_zero_x, not_eq_zero_y);\n    negative_mask_x = __lsx_vand_v((__m128i)x, __lsx_vreplgr2vr_w(0x80000000));\n    negative_mask_y = __lsx_vand_v((__m128i)y, __lsx_vreplgr2vr_w(0x80000000));\n\n    lt_zero_mask_x = __lsx_vfcmp_clt_s(x, (__m128)__lsx_vreplgr2vr_w(0));\n    lt_zero_mask_y = __lsx_vfcmp_clt_s(y, (__m128)__lsx_vreplgr2vr_w(0));\n    ge_zero_mask_y = __lsx_vxor_v(lt_zero_mask_y, __lsx_vreplgr2vr_w(0xffffffff));\n\n    pi_additions = (__m128)__lsx_vand_v(lt_zero_mask_y, __lsx_vreplgr2vr_w(c_cephes_asin_npi.i));\n    pi_additions = (__m128)__lsx_vor_v(__lsx_vand_v(ge_zero_mask_y, __lsx_vreplgr2vr_w(c_cephes_asin_pi.i)), (__m128i)pi_additions);\n    pi_additions = (__m128)__lsx_vand_v(lt_zero_mask_x, (__m128i)pi_additions);\n\n    normal_result = __lsx_vfdiv_s(y, x);\n    normal_result = __lsx_vfadd_s(atan_ps(normal_result), pi_additions);\n\n    tmp1 = (__m128)__lsx_vand_v(negative_mask_y, __lsx_vreplgr2vr_w(c_cephes_asin_half_pi.i));\n    tmp2 = (__m128)__lsx_vand_v(negative_mask_x, __lsx_vreplgr2vr_w(c_cephes_asin_pi.i));\n    special_result = (__m128)__lsx_vand_v(not_eq_zero_y, (__m128i)tmp1);\n    special_result = (__m128)__lsx_vor_v(__lsx_vand_v(eq_zero_y, (__m128i)tmp2), (__m128i)special_result);\n\n    final_result = (__m128)__lsx_vand_v(normal_mode, (__m128i)normal_result);\n    normal_mode = __lsx_vxor_v(normal_mode, __lsx_vreplgr2vr_w(0xffffffff));\n    final_result = (__m128)__lsx_vor_v(__lsx_vand_v(normal_mode, (__m128i)special_result), (__m128i)final_result);\n\n    return final_result;\n}\n\nstatic inline __m128 fmod_ps(__m128 a, __m128 b)\n{\n    // fmod(a,b) = a - trunc(a/b)*b  (trunc toward 0)\n    __m128 q = __lsx_vfdiv_s(a, b);\n    __m128i qi = __lsx_vftintrz_w_s(q); // float -> int32 trunc toward zero\n    __m128 qf = __lsx_vffint_s_w(qi);   // int32 -> float\n    return __lsx_vfsub_s(a, __lsx_vfmul_s(qf, b));\n}\n\nstatic inline __m128 round_ps(__m128 x)\n{\n    __m128 half = (__m128)__lsx_vreplgr2vr_w(c_0p5.i);\n    __m128 one = (__m128)__lsx_vreplgr2vr_w(c_1.i);\n    __m128i sign_mask = __lsx_vfcmp_clt_s(x, (__m128)__lsx_vreplgr2vr_w(0));\n    __m128 abs_x = (__m128)__lsx_vbitclri_w((__m128i)x, 31);\n    __m128i xi = __lsx_vftintrz_w_s(abs_x);\n    __m128 xf = __lsx_vffint_s_w(xi);\n    __m128 diff = __lsx_vfsub_s(abs_x, xf);\n    __m128i diff_gt_half = __lsx_vfcmp_clt_s(half, diff);\n    __m128i diff_eq_half = __lsx_vfcmp_ceq_s(diff, half);\n    __m128i xi_and_1 = __lsx_vand_v(xi, __lsx_vreplgr2vr_w(1));\n    __m128i is_odd = __lsx_vseq_w(xi_and_1, __lsx_vreplgr2vr_w(1));\n    __m128i round_up = __lsx_vor_v(diff_gt_half, __lsx_vand_v(diff_eq_half, is_odd));\n    __m128 rounded = __lsx_vfadd_s(xf, (__m128)__lsx_vand_v(round_up, (__m128i)one));\n    return (__m128)__lsx_vbitsel_v((__m128i)rounded, (__m128i)__lsx_vbitrevi_w((__m128i)rounded, 31), sign_mask);\n}\n\nstatic inline __m128 logaddexp_ps(__m128 a, __m128 b)\n{\n    __m128 one = (__m128)__lsx_vreplgr2vr_w(c_1.i);\n    __m128 max_xy = __lsx_vfmax_s(a, b);\n    __m128 min_xy = __lsx_vfmin_s(a, b);\n    __m128 diff = __lsx_vfsub_s(min_xy, max_xy);\n    __m128 exp_diff = exp_ps(diff);\n    __m128 one_plus_exp = __lsx_vfadd_s(one, exp_diff);\n    __m128 log_result = log_ps(one_plus_exp);\n    return __lsx_vfadd_s(max_xy, log_result);\n}\n\nstatic inline __m128 floor_ps(__m128 x)\n{\n    __m128i xi = __lsx_vftintrz_w_s(x);\n    __m128 xf = __lsx_vffint_s_w(xi);\n    __m128i need_adjust = __lsx_vfcmp_clt_s(x, xf);\n    __m128 one = (__m128)__lsx_vreplgr2vr_w(c_1.i);\n    return __lsx_vfsub_s(xf, (__m128)__lsx_vand_v(need_adjust, (__m128i)one));\n}\n\nstatic inline __m128 floor_divide_ps(__m128 a, __m128 b)\n{\n    __m128 q = __lsx_vfdiv_s(a, b);\n    return floor_ps(q);\n}\n\nstatic inline __m128 remainder_ps(__m128 a, __m128 b)\n{\n    __m128 q = __lsx_vfdiv_s(a, b);\n    __m128 rq = round_ps(q);\n    return __lsx_vfsub_s(a, __lsx_vfmul_s(rq, b));\n}\n\n#endif // LSX_MATHFUN_H\n"
  },
  {
    "path": "src/layer/loongarch/mish_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"mish_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nMish_loongarch::Mish_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint Mish_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _one = (__m128)__lsx_vreplfr2vr_s(1.f);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = __lsx_vfmul_s(_p, tanh_ps(log_ps(__lsx_vfadd_s(exp_ps(_p), _one))));\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *ptr * tanhf(logf(expf(*ptr) + 1.f));\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/mish_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MISH_LOONGARCH_H\n#define LAYER_MISH_LOONGARCH_H\n\n#include \"mish.h\"\n\nnamespace ncnn {\n\nclass Mish_loongarch : public Mish\n{\npublic:\n    Mish_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MISH_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/packing_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"packing_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nPacking_loongarch::Packing_loongarch()\n{\n    support_packing = true;\n}\n\nint Packing_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n    if (use_padding)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    if (elembits != 32)\n    {\n        // non-fp32 type\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    if (elempack == out_elempack)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    bool pack1to4 = elempack == 1 && out_elempack == 4;\n    bool pack4to1 = elempack == 4 && out_elempack == 1;\n\n    if (!pack1to4 && !pack4to1)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    if (!use_padding)\n    {\n        // identity if use_padding not allowed\n        if (dims == 1 && w * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if (dims == 2 && h * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if ((dims == 3 || dims == 4) && channels * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n    }\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        top_blob.w = w * elempack / out_elempack;\n        top_blob.cstep = bottom_blob.cstep * elempack / out_elempack;\n        top_blob.elemsize = elemsize / elempack * out_elempack;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        int outh = h * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const float* r0 = bottom_blob.row(i * 4);\n                const float* r1 = bottom_blob.row(i * 4 + 1);\n                const float* r2 = bottom_blob.row(i * 4 + 2);\n                const float* r3 = bottom_blob.row(i * 4 + 3);\n\n                float* outptr = top_blob.row(i);\n\n                int j = 0;\n#if __loongarch_sx\n                for (; j + 3 < w; j += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r3 = __lsx_vld(r3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr, 0);\n                    __lsx_vst(_r0123_1, outptr + 4, 0);\n                    __lsx_vst(_r0123_2, outptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_3, outptr + 4 * 3, 0);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif // __loongarch_sx\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* r0 = bottom_blob.row(i);\n\n                float* outptr0 = top_blob.row(i * 4);\n                float* outptr1 = top_blob.row(i * 4 + 1);\n                float* outptr2 = top_blob.row(i * 4 + 2);\n                float* outptr3 = top_blob.row(i * 4 + 3);\n\n                int j = 0;\n#if __loongarch_sx\n                for (; j + 3 < w; j += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r1 = __lsx_vld(r0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(r0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(r0 + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr0, 0);\n                    __lsx_vst(_r0123_1, outptr1, 0);\n                    __lsx_vst(_r0123_2, outptr2, 0);\n                    __lsx_vst(_r0123_3, outptr3, 0);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif // __loongarch_sx\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int size = w * h * d;\n        int outc = channels * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 3)\n            top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        else // if (dims == 4)\n            top_blob.create(w, h, d, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to4)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const float* r0 = bottom_blob.channel(q * 4);\n                const float* r1 = bottom_blob.channel(q * 4 + 1);\n                const float* r2 = bottom_blob.channel(q * 4 + 2);\n                const float* r3 = bottom_blob.channel(q * 4 + 3);\n\n                float* outptr = top_blob.channel(q);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r1 = __lsx_vld(r1, 0);\n                    __m128i _r2 = __lsx_vld(r2, 0);\n                    __m128i _r3 = __lsx_vld(r3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr, 0);\n                    __lsx_vst(_r0123_1, outptr + 4, 0);\n                    __lsx_vst(_r0123_2, outptr + 4 * 2, 0);\n                    __lsx_vst(_r0123_3, outptr + 4 * 3, 0);\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                    r3 += 4;\n                    outptr += 16;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n\n                    outptr += 4;\n                }\n            }\n        }\n        if (pack4to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* r0 = bottom_blob.channel(q);\n\n                float* outptr0 = top_blob.channel(q * 4);\n                float* outptr1 = top_blob.channel(q * 4 + 1);\n                float* outptr2 = top_blob.channel(q * 4 + 2);\n                float* outptr3 = top_blob.channel(q * 4 + 3);\n\n                int i = 0;\n#if __loongarch_sx\n                for (; i + 3 < size; i += 4)\n                {\n                    // transpose 4x4\n                    __m128i _r0 = __lsx_vld(r0, 0);\n                    __m128i _r1 = __lsx_vld(r0 + 4, 0);\n                    __m128i _r2 = __lsx_vld(r0 + 4 * 2, 0);\n                    __m128i _r3 = __lsx_vld(r0 + 4 * 3, 0);\n\n                    __m128i _r01r = __lsx_vilvl_w(_r1, _r0);\n                    __m128i _r01l = __lsx_vilvh_w(_r1, _r0);\n                    __m128i _r23r = __lsx_vilvl_w(_r3, _r2);\n                    __m128i _r23l = __lsx_vilvh_w(_r3, _r2);\n                    __m128i _r0123_0 = __lsx_vilvl_d(_r23r, _r01r);\n                    __m128i _r0123_1 = __lsx_vilvh_d(_r23r, _r01r);\n                    __m128i _r0123_2 = __lsx_vilvl_d(_r23l, _r01l);\n                    __m128i _r0123_3 = __lsx_vilvh_d(_r23l, _r01l);\n\n                    __lsx_vst(_r0123_0, outptr0, 0);\n                    __lsx_vst(_r0123_1, outptr1, 0);\n                    __lsx_vst(_r0123_2, outptr2, 0);\n                    __lsx_vst(_r0123_3, outptr3, 0);\n\n                    r0 += 16;\n                    outptr0 += 4;\n                    outptr1 += 4;\n                    outptr2 += 4;\n                    outptr3 += 4;\n                }\n#endif // __loongarch_sx\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n\n                    r0 += 4;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    return 0;\n}\n\nint Packing_loongarch::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (use_padding)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    if (elempack == out_elempack)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    bool pack1to8 = elempack == 1 && out_elempack == 8;\n    bool pack8to1 = elempack == 8 && out_elempack == 1;\n\n    if (!pack1to8 && !pack8to1)\n    {\n        return Packing::forward(bottom_blob, top_blob, opt);\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n\n    if (!use_padding)\n    {\n        // identity if use_padding not allowed\n        if (dims == 1 && w * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if (dims == 2 && h * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n        if ((dims == 3 || dims == 4) && channels * elempack % out_elempack != 0)\n        {\n            top_blob = bottom_blob;\n            return 0;\n        }\n    }\n\n    if (dims == 1)\n    {\n        top_blob = bottom_blob;\n        top_blob.w = w * elempack / out_elempack;\n        top_blob.cstep = bottom_blob.cstep * elempack / out_elempack;\n        top_blob.elemsize = elemsize / elempack * out_elempack;\n        top_blob.elempack = out_elempack;\n        return 0;\n    }\n\n    if (dims == 2)\n    {\n        int outh = h * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const signed char* r0 = bottom_blob.row<const signed char>(i * 8);\n                const signed char* r1 = bottom_blob.row<const signed char>(i * 8 + 1);\n                const signed char* r2 = bottom_blob.row<const signed char>(i * 8 + 2);\n                const signed char* r3 = bottom_blob.row<const signed char>(i * 8 + 3);\n                const signed char* r4 = bottom_blob.row<const signed char>(i * 8 + 4);\n                const signed char* r5 = bottom_blob.row<const signed char>(i * 8 + 5);\n                const signed char* r6 = bottom_blob.row<const signed char>(i * 8 + 6);\n                const signed char* r7 = bottom_blob.row<const signed char>(i * 8 + 7);\n\n                signed char* outptr = top_blob.row<signed char>(i);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const signed char* r0 = bottom_blob.row<const signed char>(i);\n\n                signed char* outptr0 = top_blob.row<signed char>(i * 8);\n                signed char* outptr1 = top_blob.row<signed char>(i * 8 + 1);\n                signed char* outptr2 = top_blob.row<signed char>(i * 8 + 2);\n                signed char* outptr3 = top_blob.row<signed char>(i * 8 + 3);\n                signed char* outptr4 = top_blob.row<signed char>(i * 8 + 4);\n                signed char* outptr5 = top_blob.row<signed char>(i * 8 + 5);\n                signed char* outptr6 = top_blob.row<signed char>(i * 8 + 6);\n                signed char* outptr7 = top_blob.row<signed char>(i * 8 + 7);\n\n                int j = 0;\n                for (; j < w; j++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int size = w * h * d;\n        int outc = channels * elempack / out_elempack;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        if (dims == 3)\n            top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        else // if (dims == 4)\n            top_blob.create(w, h, d, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        if (pack1to8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q * 8);\n                const signed char* r1 = bottom_blob.channel(q * 8 + 1);\n                const signed char* r2 = bottom_blob.channel(q * 8 + 2);\n                const signed char* r3 = bottom_blob.channel(q * 8 + 3);\n                const signed char* r4 = bottom_blob.channel(q * 8 + 4);\n                const signed char* r5 = bottom_blob.channel(q * 8 + 5);\n                const signed char* r6 = bottom_blob.channel(q * 8 + 6);\n                const signed char* r7 = bottom_blob.channel(q * 8 + 7);\n\n                signed char* outptr = top_blob.channel(q);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    outptr[0] = *r0++;\n                    outptr[1] = *r1++;\n                    outptr[2] = *r2++;\n                    outptr[3] = *r3++;\n                    outptr[4] = *r4++;\n                    outptr[5] = *r5++;\n                    outptr[6] = *r6++;\n                    outptr[7] = *r7++;\n\n                    outptr += 8;\n                }\n            }\n        }\n        if (pack8to1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const signed char* r0 = bottom_blob.channel(q);\n\n                signed char* outptr0 = top_blob.channel(q * 8);\n                signed char* outptr1 = top_blob.channel(q * 8 + 1);\n                signed char* outptr2 = top_blob.channel(q * 8 + 2);\n                signed char* outptr3 = top_blob.channel(q * 8 + 3);\n                signed char* outptr4 = top_blob.channel(q * 8 + 4);\n                signed char* outptr5 = top_blob.channel(q * 8 + 5);\n                signed char* outptr6 = top_blob.channel(q * 8 + 6);\n                signed char* outptr7 = top_blob.channel(q * 8 + 7);\n\n                int i = 0;\n                for (; i < size; i++)\n                {\n                    *outptr0++ = r0[0];\n                    *outptr1++ = r0[1];\n                    *outptr2++ = r0[2];\n                    *outptr3++ = r0[3];\n                    *outptr4++ = r0[4];\n                    *outptr5++ = r0[5];\n                    *outptr6++ = r0[6];\n                    *outptr7++ = r0[7];\n\n                    r0 += 8;\n                }\n            }\n        }\n\n        return 0;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/packing_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PACKING_LOONGARCH_H\n#define LAYER_PACKING_LOONGARCH_H\n\n#include \"packing.h\"\n\nnamespace ncnn {\n\nclass Packing_loongarch : public Packing\n{\npublic:\n    Packing_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PACKING_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/padding_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"padding_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\n#if __loongarch_sx\n#include \"padding_pack4.h\"\n#include \"padding_pack8_int8.h\"\n#endif // __loongarch_sx\n\nPadding_loongarch::Padding_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Padding_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (top == 0 && bottom == 0 && left == 0 && right == 0 && front == 0 && behind == 0)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int elembits = bottom_blob.elembits();\n\n    if (elembits == 8)\n        return forward_int8(bottom_blob, top_blob, opt);\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __loongarch_sx\n    if (elempack == 4)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n            int out_elempack = outw % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                __m128 pad_value = __lsx_vreplfr2vr_s(value);\n                padding_constant_pack4_lsx(bottom_blob, top_blob, 0, 0, left / 4, right / 4, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n            int out_elempack = outh % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 4 == 0 && out_elempack == 4 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                __m128 pad_value = __lsx_vreplfr2vr_s(value);\n                padding_constant_pack4_lsx(bottom_blob, top_blob, top / 4, bottom / 4, left, right, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n            int out_elempack = outc % 4 == 0 ? 4 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (front % 4 == 0 && out_elempack == 4 && !(outc != channels * elempack && type != 0))\n            {\n                top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    __m128 pad_value = per_channel_pad_data_size ? (__m128)__lsx_vld((const float*)per_channel_pad_data + q * 4, 0) : __lsx_vreplfr2vr_s(value);\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack4_lsx(m, borderm, top, bottom, left, right, pad_value);\n                        if (type == 1)\n                            padding_replicate_pack4_lsx(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack4_lsx(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            if (type == 0)\n            {\n                top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    __m128 pad_value = per_channel_pad_data_size ? (__m128)__lsx_vld((const float*)per_channel_pad_data + q * 4, 0) : __lsx_vreplfr2vr_s(value);\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack4_lsx(m, borderm, top, bottom, left, right, pad_value);\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n    }\n\n    Mat top_blob_unpacked;\n    int ret = Padding::forward(bottom_blob_unpacked, top_blob_unpacked, opt);\n    if (ret != 0)\n        return ret;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = top_blob_unpacked.c % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n\n    return 0;\n}\n\nint Padding_loongarch::forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __loongarch_sx\n    if (elempack == 8)\n    {\n        if (dims == 1)\n        {\n            int outw = w * elempack + left + right;\n\n            int out_elempack = outw % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (left % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int64_t v8 = (int64_t)value;\n                int64_t pad_value = v8 | (v8 << 8) | (v8 << 16) | (v8 << 24) | (v8 << 32) | (v8 << 40) | (v8 << 48) | (v8 << 56);\n                padding_constant_pack8_int8_lsx(bottom_blob, top_blob, 0, 0, left / 8, right / 8, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 2)\n        {\n            int outw = w + left + right;\n            int outh = h * elempack + top + bottom;\n\n            int out_elempack = outh % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (top % 8 == 0 && out_elempack == 8 && type == 0)\n            {\n                top_blob.create(outw, outh / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int64_t v8 = (int64_t)value;\n                int64_t pad_value = v8 | (v8 << 8) | (v8 << 16) | (v8 << 24) | (v8 << 32) | (v8 << 40) | (v8 << 48) | (v8 << 56);\n                padding_constant_pack8_int8_lsx(bottom_blob, top_blob, top / 8, bottom / 8, left, right, pad_value);\n\n                return 0;\n            }\n        }\n\n        if (dims == 3)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outc = channels * elempack + front + behind;\n\n            int out_elempack = outc % 8 == 0 ? 8 : 1;\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            if (front % 8 == 0 && out_elempack == 8 && !(outc != channels * elempack && type != 0))\n            {\n                top_blob.create(outw, outh, outc / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                int front_ = front / elempack;\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < outc / out_elempack; q++)\n                {\n                    Mat borderm = top_blob.channel(q);\n\n                    // TODO perchannel\n                    //                     int64_t pad_value = per_channel_pad_data_size ? vld1_s8(per_channel_pad_data + q * 8) : vdup_n_s8((signed char)value);\n                    int64_t v8 = (int64_t)value;\n                    int64_t pad_value = v8 | (v8 << 8) | (v8 << 16) | (v8 << 24) | (v8 << 32) | (v8 << 40) | (v8 << 48) | (v8 << 56);\n\n                    //Channel padding\n                    if ((q - front_) < 0 || (q - front_) >= channels)\n                    {\n                        borderm.fill(pad_value);\n                    }\n                    else\n                    {\n                        const Mat m = bottom_blob.channel(q - front_);\n                        if (type == 0)\n                            padding_constant_pack8_int8_lsx(m, borderm, top, bottom, left, right, pad_value);\n                        if (type == 1)\n                            padding_replicate_pack8_int8_lsx(m, borderm, top, bottom, left, right);\n                        if (type == 2)\n                            padding_reflect_pack8_int8_lsx(m, borderm, top, bottom, left, right);\n                    }\n                }\n\n                return 0;\n            }\n        }\n\n        if (dims == 4)\n        {\n            int outw = w + left + right;\n            int outh = h + top + bottom;\n            int outd = d + front + behind;\n\n            if (type == 0)\n            {\n                top_blob.create(outw, outh, outd, channels, elemsize, elempack, opt.blob_allocator);\n                if (top_blob.empty())\n                    return -100;\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    // TODO perchannel\n                    //                     int64_t pad_value = per_channel_pad_data_size ? vld1_s8(per_channel_pad_data + q * 8) : vdup_n_s8((signed char)value);\n                    int64_t v8 = (int64_t)value;\n                    int64_t pad_value = v8 | (v8 << 8) | (v8 << 16) | (v8 << 24) | (v8 << 32) | (v8 << 40) | (v8 << 48) | (v8 << 56);\n\n                    for (int z = 0; z < outd; z++)\n                    {\n                        Mat borderm = top_blob.channel(q).depth(z);\n\n                        // depth padding\n                        if ((z - front) < 0 || (z - front) >= d)\n                        {\n                            borderm.fill(pad_value);\n                        }\n                        else\n                        {\n                            const Mat m = bottom_blob.channel(q).depth(z - front);\n                            padding_constant_pack8_int8_lsx(m, borderm, top, bottom, left, right, pad_value);\n                        }\n                    }\n                }\n\n                return 0;\n            }\n        }\n    }\n#endif // __loongarch_sx\n\n    Mat bottom_blob_unpacked = bottom_blob;\n    if (elempack != 1)\n    {\n        Option opt_pack1 = opt;\n        opt_pack1.blob_allocator = opt.workspace_allocator;\n\n        convert_packing(bottom_blob, bottom_blob_unpacked, 1, opt_pack1);\n    }\n\n    Mat top_blob_unpacked;\n    int ret = Padding::forward(bottom_blob_unpacked, top_blob_unpacked, opt);\n    if (ret != 0)\n        return ret;\n\n    int out_elempack = 1;\n#if __loongarch_sx\n    if (opt.use_packing_layout)\n    {\n        out_elempack = top_blob_unpacked.c % 8 == 0 ? 8 : 1;\n    }\n#endif\n\n    convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/padding_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PADDING_LOONGARCH_H\n#define LAYER_PADDING_LOONGARCH_H\n\n#include \"padding.h\"\n\nnamespace ncnn {\n\nclass Padding_loongarch : public Padding\n{\npublic:\n    Padding_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\nprotected:\n    int forward_int8(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PADDING_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/padding_pack4.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack4_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right, __m128 v)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n    int top_size = top * dst.w;\n    int bottom_size = bottom * dst.w;\n\n    // fill top\n    for (int y = 0; y < top_size; y++)\n    {\n        __lsx_vst(v, outptr, 0);\n        outptr += 4;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            __lsx_vst(v, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            __builtin_prefetch(ptr + 32);\n            __lsx_vst(__lsx_vld(ptr, 0), outptr, 0);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __lsx_vst(v, outptr, 0);\n            outptr += 4;\n        }\n    }\n    // fill top\n    for (int y = 0; y < bottom_size; y++)\n    {\n        __lsx_vst(v, outptr, 0);\n        outptr += 4;\n    }\n}\n\nstatic void padding_replicate_pack4_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const float* ptr0 = ptr;\n        __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n        for (int x = 0; x < left; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = (__m128)__lsx_vld(ptr0, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        __m128 _p = (__m128)__lsx_vld(ptr, 0);\n        for (int x = 0; x < left; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = (__m128)__lsx_vld(ptr, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const float* ptr0 = ptr;\n        __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n        for (int x = 0; x < left; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            _p = (__m128)__lsx_vld(ptr0, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n    }\n}\n\nstatic void padding_reflect_pack4_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const float* ptr = src;\n    float* outptr = dst;\n\n    // fill top\n    ptr += top * src.w * 4;\n    for (int y = 0; y < top; y++)\n    {\n        const float* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0 + (left - x) * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0 - 8 - x * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr + (left - x) * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr - 8 - x * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w * 4;\n    for (int y = 0; y < bottom; y++)\n    {\n        const float* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0 + (left - x) * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n            __lsx_vst(_p, outptr, 0);\n            ptr0 += 4;\n            outptr += 4;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr0 - 8 - x * 4, 0);\n            __lsx_vst(_p, outptr, 0);\n            outptr += 4;\n        }\n        ptr -= src.w * 4;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/padding_pack8_int8.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void padding_constant_pack8_int8_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right, int64_t _v)\n{\n    const int64_t* ptr = src;\n    int64_t* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        for (int x = 0; x < dst.w; x++)\n        {\n            *outptr++ = _v;\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = _v;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = _v;\n        }\n    }\n    // fill bottom\n    for (int y = 0; y < bottom; y++)\n    {\n        for (int x = 0; x < dst.w; x++)\n        {\n            *outptr++ = _v;\n        }\n    }\n}\n\nstatic void padding_replicate_pack8_int8_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const int64_t* ptr = src;\n    int64_t* outptr = dst;\n\n    // fill top\n    for (int y = 0; y < top; y++)\n    {\n        const int64_t* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = *ptr0;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr0++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr0[-1];\n        }\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = *ptr;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr[-1];\n        }\n    }\n    // fill bottom\n    ptr -= src.w;\n    for (int y = 0; y < bottom; y++)\n    {\n        const int64_t* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = *ptr0;\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr0++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr0[-1];\n        }\n    }\n}\n\nstatic void padding_reflect_pack8_int8_lsx(const Mat& src, Mat& dst, int top, int bottom, int left, int right)\n{\n    const int64_t* ptr = src;\n    int64_t* outptr = dst;\n\n    // fill top\n    ptr += top * src.w;\n    for (int y = 0; y < top; y++)\n    {\n        const int64_t* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = ptr0[left - x];\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr0++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr0[-2 - x];\n        }\n        ptr -= src.w;\n    }\n    // fill center\n    for (int y = 0; y < src.h; y++)\n    {\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = ptr[left - x];\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr[-2 - x];\n        }\n    }\n    // fill bottom\n    ptr -= 2 * src.w;\n    for (int y = 0; y < bottom; y++)\n    {\n        const int64_t* ptr0 = ptr;\n        for (int x = 0; x < left; x++)\n        {\n            *outptr++ = ptr0[left - x];\n        }\n        for (int x = 0; x < src.w; x++)\n        {\n            *outptr++ = *ptr0++;\n        }\n        for (int x = 0; x < right; x++)\n        {\n            *outptr++ = ptr0[-2 - x];\n        }\n        ptr -= src.w;\n    }\n}\n"
  },
  {
    "path": "src/layer/loongarch/pooling_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"pooling_loongarch.h\"\n\n#include <float.h>\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nPooling_loongarch::Pooling_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Pooling_loongarch::create_pipeline(const Option& /*opt*/)\n{\n    if (adaptive_pooling)\n    {\n        support_packing = false;\n\n        support_bf16_storage = false;\n        support_fp16_storage = false;\n        support_int8_storage = false;\n        support_tensor_storage = false;\n    }\n    return 0;\n}\n\nint Pooling_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (adaptive_pooling)\n    {\n        return Pooling::forward(bottom_blob, top_blob, opt);\n    }\n\n    // max value in NxN window\n    // avg value in NxN window\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n#if __loongarch_sx\n    //     NCNN_LOGE(\"Pooling     input %d x %d  pad = %d %d %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_left, pad_right, pad_top, pad_bottom, kernel_w, kernel_h, stride_w, stride_h);\n\n    if (elempack == 4)\n    {\n        if (global_pooling)\n        {\n            top_blob.create(channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            int size = w * h;\n\n            if (pooling_type == PoolMethod_MAX)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob.channel(q);\n\n                    __m128 _max = (__m128)__lsx_vld(ptr, 0);\n                    for (int i = 0; i < size; i++)\n                    {\n                        __m128 _val = (__m128)__lsx_vld(ptr, 0);\n                        _max = __lsx_vfmax_s(_max, _val);\n                        ptr += 4;\n                    }\n\n                    float* outptr = top_blob;\n                    __lsx_vst(_max, outptr + q * 4, 0);\n                }\n            }\n            else if (pooling_type == PoolMethod_AVE)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const float* ptr = bottom_blob.channel(q);\n\n                    __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n                    for (int i = 0; i < size; i++)\n                    {\n                        __m128 _val = (__m128)__lsx_vld(ptr, 0);\n                        _sum = __lsx_vfadd_s(_sum, _val);\n                        ptr += 4;\n                    }\n\n                    __m128 _avg = __lsx_vfmul_s(_sum, __lsx_vreplfr2vr_s(1.f / size));\n\n                    float* outptr = top_blob;\n                    __lsx_vst(_avg, outptr + q * 4, 0);\n                }\n            }\n\n            return 0;\n        }\n\n        Mat bottom_blob_bordered;\n        make_padding(bottom_blob, bottom_blob_bordered, opt);\n        if (bottom_blob_bordered.empty())\n            return -100;\n\n        w = bottom_blob_bordered.w;\n        h = bottom_blob_bordered.h;\n\n        int outw = (w - kernel_w) / stride_w + 1;\n        int outh = (h - kernel_h) / stride_h + 1;\n\n        top_blob.create(outw, outh, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int maxk = kernel_w * kernel_h;\n\n        // kernel offsets\n        std::vector<int> _space_ofs(maxk);\n        int* space_ofs = &_space_ofs[0];\n        {\n            int p1 = 0;\n            int p2 = 0;\n            int gap = w - kernel_w;\n            for (int i = 0; i < kernel_h; i++)\n            {\n                for (int j = 0; j < kernel_w; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2++;\n                }\n                p2 += gap;\n            }\n        }\n\n        if (pooling_type == PoolMethod_MAX)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const Mat m = bottom_blob_bordered.channel(q);\n                float* outptr = top_blob.channel(q);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                        __m128 _max = (__m128)__lsx_vld(sptr, 0);\n\n                        for (int k = 0; k < maxk; k++)\n                        {\n                            __m128 _val = (__m128)__lsx_vld(sptr + space_ofs[k] * 4, 0);\n                            _max = __lsx_vfmax_s(_max, _val);\n                        }\n\n                        __lsx_vst(_max, outptr + j * 4, 0);\n                    }\n\n                    outptr += outw * 4;\n                }\n            }\n        }\n        else if (pooling_type == PoolMethod_AVE)\n        {\n            if (avgpool_count_include_pad == 0)\n            {\n                int wtailpad = 0;\n                int htailpad = 0;\n\n                if (pad_mode == 0) // full padding\n                {\n                    wtailpad = bottom_blob_bordered.w - bottom_blob.w - pad_left - pad_right;\n                    htailpad = bottom_blob_bordered.h - bottom_blob.h - pad_top - pad_bottom;\n                }\n\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int sy0 = i * stride_h;\n\n                        for (int j = 0; j < outw; j++)\n                        {\n                            int sx0 = j * stride_w;\n\n                            __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n                            int area = 0;\n\n                            for (int ki = 0; ki < kernel_h; ki++)\n                            {\n                                int sy = sy0 + ki;\n\n                                if (sy < pad_top)\n                                    continue;\n\n                                if (sy >= h - pad_bottom - htailpad)\n                                    break;\n\n                                for (int kj = 0; kj < kernel_w; kj++)\n                                {\n                                    int sx = sx0 + kj;\n\n                                    if (sx < pad_left)\n                                        continue;\n\n                                    if (sx >= w - pad_right - wtailpad)\n                                        break;\n\n                                    __m128 _val = (__m128)__lsx_vld(m.row(sy) + sx * 4, 0);\n                                    _sum = __lsx_vfadd_s(_sum, _val);\n                                    area += 1;\n                                }\n                            }\n\n                            __m128 _avg = __lsx_vfmul_s(_sum, __lsx_vreplfr2vr_s(1.f / area));\n                            __lsx_vst(_avg, outptr + j * 4, 0);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n            else // if (avgpool_count_include_pad == 1)\n            {\n                #pragma omp parallel for num_threads(opt.num_threads)\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob_bordered.channel(q);\n                    float* outptr = top_blob.channel(q);\n\n                    const float inv_maxk = 1.f / maxk;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        for (int j = 0; j < outw; j++)\n                        {\n                            const float* sptr = m.row(i * stride_h) + j * stride_w * 4;\n\n                            __m128 _sum = (__m128)__lsx_vreplgr2vr_w(0);\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                __m128 _val = (__m128)__lsx_vld(sptr + space_ofs[k] * 4, 0);\n                                _sum = __lsx_vfadd_s(_sum, _val);\n                            }\n\n                            __m128 _avg = __lsx_vfmul_s(_sum, __lsx_vreplfr2vr_s(inv_maxk));\n                            __lsx_vst(_avg, outptr + j * 4, 0);\n                        }\n\n                        outptr += outw * 4;\n                    }\n                }\n            }\n        }\n\n        return 0;\n    }\n#endif // __loongarch_sx\n\n    return Pooling::forward(bottom_blob, top_blob, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/pooling_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_POOLING_LOONGARCH_H\n#define LAYER_POOLING_LOONGARCH_H\n\n#include \"pooling.h\"\n\nnamespace ncnn {\n\nclass Pooling_loongarch : public Pooling\n{\npublic:\n    Pooling_loongarch();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_POOLING_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/prelu_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"prelu_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nPReLU_loongarch::PReLU_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint PReLU_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w * elempack;\n\n#if __loongarch_sx\n        int nn_w = w / 4;\n        int remain_w_start = nn_w * 4;\n#else\n        int remain_w_start = 0;\n#endif // __loongarch_sx\n\n        float* ptr = bottom_top_blob;\n\n        if (num_slope > 1)\n        {\n            const float* slope = slope_data;\n\n#if __loongarch_sx\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < nn_w; i++)\n            {\n                float* ptr0 = ptr + i * 4;\n\n                __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n                __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _slope = (__m128)__lsx_vld(slope + i * 4, 0);\n                __m128i _lemask = __lsx_vfcmp_cle_s(_p, _zero);\n                __m128 _ps = __lsx_vfmul_s(_p, _slope);\n                _p = (__m128)__lsx_vbitsel_v((__m128i)_p, (__m128i)_ps, (__m128i)_lemask);\n                __lsx_vst(_p, ptr0, 0);\n            }\n#endif // __loongarch_sx\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = remain_w_start; i < w; i++)\n            {\n                float v = ptr[i];\n                if (v < 0.f)\n                    ptr[i] = v * slope[i];\n            }\n        }\n        else\n        {\n            const float slope = slope_data[0];\n\n#if __loongarch_sx\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < nn_w; i++)\n            {\n                float* ptr0 = ptr + i * 4;\n\n                __m128 _p = (__m128)__lsx_vld(ptr0, 0);\n                __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n                __m128 _slope = (__m128)__lsx_vreplfr2vr_s(slope);\n                __m128i _lemask = __lsx_vfcmp_cle_s(_p, _zero);\n                __m128 _ps = __lsx_vfmul_s(_p, _slope);\n                _p = (__m128)__lsx_vbitsel_v((__m128i)_p, (__m128i)_ps, (__m128i)_lemask);\n                __lsx_vst(_p, ptr0, 0);\n            }\n#endif // __loongarch_sx\n\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = remain_w_start; i < w; i++)\n            {\n                float v = ptr[i];\n                if (v < 0.f)\n                    ptr[i] = v * slope;\n            }\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w * elempack;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n\n            const float slope = num_slope > 1 ? slope_data[i] : slope_data[0];\n\n            int j = 0;\n#if __loongarch_sx\n            __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _slope = (elempack == 4 && num_slope > 1) ? (__m128)__lsx_vld((const float*)slope_data + i * 4, 0) : (__m128)__lsx_vreplfr2vr_s(slope);\n\n            for (; j + 3 < w; j += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                __m128i _lemask = __lsx_vfcmp_cle_s(_p, _zero);\n                __m128 _ps = __lsx_vfmul_s(_p, _slope);\n                _p = (__m128)__lsx_vbitsel_v((__m128i)_p, (__m128i)_ps, (__m128i)_lemask);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; j < w; j++)\n            {\n                float v = *ptr;\n                if (v < 0.f)\n                    *ptr = v * slope;\n\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int channels = bottom_top_blob.c;\n        int size = w * h * elempack;\n\n        const float* slope_data_ptr = slope_data;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            float slope = num_slope > 1 ? slope_data_ptr[q] : slope_data_ptr[0];\n\n            int i = 0;\n#if __loongarch_sx\n            __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _slope = (elempack == 4 && num_slope > 1) ? (__m128)__lsx_vld((const float*)slope_data + q * 4, 0) : (__m128)__lsx_vreplfr2vr_s(slope);\n\n            for (; i + 3 < size; i += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                __m128i _lemask = __lsx_vfcmp_cle_s(_p, _zero);\n                __m128 _ps = __lsx_vfmul_s(_p, _slope);\n                _p = (__m128)__lsx_vbitsel_v((__m128i)_p, (__m128i)_ps, (__m128i)_lemask);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                if (*ptr < 0)\n                    *ptr *= slope;\n\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/prelu_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_PRELU_LOONGARCH_H\n#define LAYER_PRELU_LOONGARCH_H\n\n#include \"prelu.h\"\n\nnamespace ncnn {\n\nclass PReLU_loongarch : public PReLU\n{\npublic:\n    PReLU_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_PRELU_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/quantize_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"quantize_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nQuantize_loongarch::Quantize_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nstatic void quantize(const float* ptr, signed char* s8ptr, const Mat& scale_data, int elemcount, int elempack)\n{\n    const int scale_data_size = scale_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"quantize %d   %d %d\", scale_data_size, elemcount, elempack);\n\n    float scale = scale_data[0];\n#if __loongarch_sx\n    __m128 _scale = (__m128)__lsx_vreplfr2vr_s(scale);\n    if (scale_data_size > 1)\n    {\n        if (elempack == 4)\n        {\n            _scale = (__m128)__lsx_vld((const float*)scale_data, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    int i = 0;\n#if __loongarch_sx\n    for (; i + 7 < size; i += 8)\n    {\n        __builtin_prefetch(ptr + 32);\n        __m128 _v0 = (__m128)__lsx_vld(ptr, 0);\n        __m128 _v1 = (__m128)__lsx_vld(ptr + 4, 0);\n        _v0 = __lsx_vfmul_s(_v0, _scale);\n        _v1 = __lsx_vfmul_s(_v1, _scale);\n        *((int64_t*)s8ptr) = float2int8(_v0, _v1);\n        ptr += 8;\n        s8ptr += 8;\n    }\n    for (; i + 3 < size; i += 4)\n    {\n        __m128 _v = (__m128)__lsx_vld(ptr, 0);\n        _v = __lsx_vfmul_s(_v, _scale);\n        v16i8 v = (v16i8)float2int8(_v);\n        s8ptr[0] = v[0];\n        s8ptr[1] = v[1];\n        s8ptr[2] = v[2];\n        s8ptr[3] = v[3];\n        ptr += 4;\n        s8ptr += 4;\n    }\n#endif // __loongarch_sx\n    for (; i < size; i++)\n    {\n        float v = *ptr * scale;\n        *s8ptr = float2int8(v);\n        ptr++;\n        s8ptr++;\n    }\n}\n\n#if __loongarch_sx\nstatic void quantize_pack4to8(const float* ptr0, const float* ptr1, signed char* s8ptr, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to8 %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    __m128 _scale0 = (__m128)__lsx_vreplfr2vr_s(scale);\n    __m128 _scale1 = _scale0;\n    if (scale_data_size > 1)\n    {\n        _scale0 = (__m128)__lsx_vld((const float*)scale_data, 0);\n        _scale1 = (__m128)__lsx_vld((const float*)scale_data + 4, 0);\n    }\n\n    int i = 0;\n    for (; i < elemcount; i++)\n    {\n        __m128 _v0 = (__m128)__lsx_vld(ptr0, 0);\n        __m128 _v1 = (__m128)__lsx_vld(ptr1, 0);\n        _v0 = __lsx_vfmul_s(_v0, _scale0);\n        _v1 = __lsx_vfmul_s(_v1, _scale1);\n        *((int64_t*)s8ptr) = float2int8(_v0, _v1);\n        ptr0 += 4;\n        ptr1 += 4;\n        s8ptr += 8;\n    }\n}\n\nstatic void quantize_pack4to1(const float* ptr, signed char* s8ptr0, signed char* s8ptr1, signed char* s8ptr2, signed char* s8ptr3, const Mat& scale_data, int elemcount)\n{\n    const int scale_data_size = scale_data.w;\n\n    // NCNN_LOGE(\"quantize_pack4to1 %d   %d\", scale_data_size, elemcount);\n\n    float scale = scale_data[0];\n    __m128 _scale = (__m128)__lsx_vreplfr2vr_s(scale);\n    if (scale_data_size > 1)\n    {\n        _scale = (__m128)__lsx_vld((const float*)scale_data, 0);\n    }\n\n    int i = 0;\n    for (; i < elemcount; i++)\n    {\n        __m128 _v = (__m128)__lsx_vld(ptr, 0);\n        _v = __lsx_vfmul_s(_v, _scale);\n        v16i8 v = (v16i8)float2int8(_v);\n        s8ptr0[0] = v[0];\n        s8ptr1[0] = v[1];\n        s8ptr2[0] = v[2];\n        s8ptr3[0] = v[3];\n        ptr += 4;\n        s8ptr0 += 1;\n        s8ptr1 += 1;\n        s8ptr2 += 1;\n        s8ptr3 += 1;\n    }\n}\n#endif // __loongarch_sx\n\nint Quantize_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n\n    if (dims == 1)\n    {\n        int out_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            out_elempack = w * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outw = w * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const float* ptr = (const float*)bottom_blob + i * elempack;\n            signed char* s8ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            quantize(ptr, s8ptr, scale_data, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        int out_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            out_elempack = h * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outh = h * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, outh, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __loongarch_sx\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < outh; i++)\n            {\n                const float* ptr0 = bottom_blob.row(i * 2);\n                const float* ptr1 = bottom_blob.row(i * 2 + 1);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8(ptr0, ptr1, s8ptr, scale_data_i, w);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                signed char* s8ptr0 = top_blob.row<signed char>(i * 4);\n                signed char* s8ptr1 = top_blob.row<signed char>(i * 4 + 1);\n                signed char* s8ptr2 = top_blob.row<signed char>(i * 4 + 2);\n                signed char* s8ptr3 = top_blob.row<signed char>(i * 4 + 3);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize_pack4to1(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_i, w);\n            }\n        }\n#endif // __loongarch_sx\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int i = 0; i < h; i++)\n            {\n                const float* ptr = bottom_blob.row(i);\n                signed char* s8ptr = top_blob.row<signed char>(i);\n\n                const Mat scale_data_i = scale_data_size > 1 ? scale_data.range(i * elempack, elempack) : scale_data;\n\n                quantize(ptr, s8ptr, scale_data_i, w, elempack);\n            }\n        }\n    }\n\n    if (dims == 3)\n    {\n        int out_elempack = 1;\n#if __loongarch_sx\n        if (opt.use_packing_layout)\n        {\n            out_elempack = channels * elempack % 8 == 0 ? 8 : 1;\n        }\n#endif\n        const int outc = channels * elempack / out_elempack;\n        const size_t out_elemsize = out_elempack * 1u;\n\n        top_blob.create(w, h, outc, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n#if __loongarch_sx\n        if (elempack == 4 && out_elempack == 8)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < outc; q++)\n            {\n                const float* ptr0 = bottom_blob.channel(q * 2);\n                const float* ptr1 = bottom_blob.channel(q * 2 + 1);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * out_elempack, out_elempack) : scale_data;\n\n                quantize_pack4to8(ptr0, ptr1, s8ptr, scale_data_q, w * h);\n            }\n        }\n        if (elempack == 4 && out_elempack == 1)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                signed char* s8ptr0 = top_blob.channel(q * 4);\n                signed char* s8ptr1 = top_blob.channel(q * 4 + 1);\n                signed char* s8ptr2 = top_blob.channel(q * 4 + 2);\n                signed char* s8ptr3 = top_blob.channel(q * 4 + 3);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize_pack4to1(ptr, s8ptr0, s8ptr1, s8ptr2, s8ptr3, scale_data_q, w * h);\n            }\n        }\n#endif // __loongarch_sx\n        if (elempack == out_elempack)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < channels; q++)\n            {\n                const float* ptr = bottom_blob.channel(q);\n                signed char* s8ptr = top_blob.channel(q);\n\n                const Mat scale_data_q = scale_data_size > 1 ? scale_data.range(q * elempack, elempack) : scale_data;\n\n                quantize(ptr, s8ptr, scale_data_q, w * h, elempack);\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/quantize_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_QUANTIZE_LOONGARCH_H\n#define LAYER_QUANTIZE_LOONGARCH_H\n\n#include \"quantize.h\"\n\nnamespace ncnn {\n\nclass Quantize_loongarch : public Quantize\n{\npublic:\n    Quantize_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_QUANTIZE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/relu_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"relu_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nReLU_loongarch::ReLU_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint ReLU_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        if (slope == 0.f)\n        {\n            int i = 0;\n#if __loongarch_sx\n            __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n            for (; i + 3 < size; i += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                _p = __lsx_vfmax_s(_p, _zero);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                if (*ptr < 0)\n                    *ptr = 0;\n                ptr++;\n            }\n        }\n        else\n        {\n            int i = 0;\n#if __loongarch_sx\n            __m128 _zero = (__m128)__lsx_vreplgr2vr_w(0);\n            __m128 _slope = (__m128)__lsx_vreplfr2vr_s(slope);\n            for (; i + 3 < size; i += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                __m128 _p = (__m128)__lsx_vld(ptr, 0);\n                __m128i _lemask = __lsx_vfcmp_cle_s(_p, _zero);\n                __m128 _ps = __lsx_vfmul_s(_p, _slope);\n                _p = (__m128)__lsx_vbitsel_v((__m128i)_p, (__m128i)_ps, (__m128i)_lemask);\n                __lsx_vst(_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __loongarch_sx\n            for (; i < size; i++)\n            {\n                if (*ptr < 0)\n                    *ptr *= slope;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/relu_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_RELU_LOONGARCH_H\n#define LAYER_RELU_LOONGARCH_H\n\n#include \"relu.h\"\n\nnamespace ncnn {\n\nclass ReLU_loongarch : public ReLU\n{\npublic:\n    ReLU_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_RELU_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/requantize_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"requantize_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#endif // __loongarch_sx\n\n#include \"loongarch_activation.h\"\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nRequantize_loongarch::Requantize_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nstatic void requantize_relu(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, int elemcount, int elempack)\n{\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize_relu %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    // int8(relu(v * scale_in) * scale_out)\n    // int8_relu(v * (scale_in * scale_out))\n\n    // int8(relu(v * scale_in + bias) * scale_out)\n    // int8_relu(v * (scale_in * scale_out) + (bias * scale_out))\n\n    float scale_in = scale_in_data[0];\n#if __loongarch_sx\n    __m128 _scale_in0 = (__m128)__lsx_vreplfr2vr_s(scale_in);\n    __m128 _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data, 0);\n            _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    float scale_out = scale_out_data[0];\n#if __loongarch_sx\n    __m128 _scale_out0 = (__m128)__lsx_vreplfr2vr_s(scale_out);\n    __m128 _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = (__m128)__lsx_vld((const float*)scale_out_data, 0);\n            _scale_out1 = (__m128)__lsx_vld((const float*)scale_out_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    float scale = scale_in * scale_out;\n#if __loongarch_sx\n    __m128 _scale0 = __lsx_vfmul_s(_scale_in0, _scale_out0);\n    __m128 _scale1 = __lsx_vfmul_s(_scale_in1, _scale_out1);\n#endif // __loongarch_sx\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmul_s(_v0, _scale0);\n            _v1 = __lsx_vfmul_s(_v1, _scale1);\n            *((int64_t*)ptr) = float2int8relu(_v0, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmul_s(_v, _scale0);\n            v16i8 v = (v16i8)float2int8relu(_v);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale;\n            if (v < 0) v = 0;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __loongarch_sx\n        __m128 _bias0 = (__m128)__lsx_vreplfr2vr_s(bias);\n        __m128 _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = (__m128)__lsx_vld((const float*)bias_data, 0);\n                _bias1 = (__m128)__lsx_vld((const float*)bias_data + 4, 0);\n            }\n        }\n#endif // __loongarch_sx\n\n        bias = bias * scale_out;\n#if __loongarch_sx\n        _bias0 = __lsx_vfmul_s(_bias0, _scale_out0);\n        _bias1 = __lsx_vfmul_s(_bias1, _scale_out1);\n#endif // __loongarch_sx\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmadd_s(_v0, _scale0, _bias0);\n            _v1 = __lsx_vfmadd_s(_v1, _scale1, _bias1);\n            *((int64_t*)ptr) = float2int8relu(_v0, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmadd_s(_v, _scale0, _bias0);\n            v16i8 v = (v16i8)float2int8relu(_v);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale + bias;\n            if (v < 0) v = 0;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nstatic void requantize_leakyrelu(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, float slope, int elemcount, int elempack)\n{\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize_leakyrelu %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    // int8(leakyrelu(v * scale_in, slope) * scale_out)\n    // int8_leakyrelu(v * (scale_in * scale_out), slope)\n\n    // int8(leakyrelu(v * scale_in + bias, slope) * scale_out)\n    // int8_leakyrelu(v * (scale_in * scale_out) + (bias * scale_out), slope)\n\n    float scale_in = scale_in_data[0];\n#if __loongarch_sx\n    __m128 _scale_in0 = (__m128)__lsx_vreplfr2vr_s(scale_in);\n    __m128 _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data, 0);\n            _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    float scale_out = scale_out_data[0];\n#if __loongarch_sx\n    __m128 _scale_out0 = (__m128)__lsx_vreplfr2vr_s(scale_out);\n    __m128 _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = (__m128)__lsx_vld((const float*)scale_out_data, 0);\n            _scale_out1 = (__m128)__lsx_vld((const float*)scale_out_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    float scale = scale_in * scale_out;\n#if __loongarch_sx\n    __m128 _scale0 = __lsx_vfmul_s(_scale_in0, _scale_out0);\n    __m128 _scale1 = __lsx_vfmul_s(_scale_in1, _scale_out1);\n    __m128 _slope = (__m128)__lsx_vreplfr2vr_s(slope);\n#endif // __loongarch_sx\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmul_s(_v0, _scale0);\n            _v1 = __lsx_vfmul_s(_v1, _scale1);\n            *((int64_t*)ptr) = float2int8leakyrelu(_v0, _v1, _slope);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmul_s(_v, _scale0);\n            v16i8 v = (v16i8)float2int8leakyrelu(_v, _slope);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale;\n            if (v < 0) v *= slope;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __loongarch_sx\n        __m128 _bias0 = (__m128)__lsx_vreplfr2vr_s(bias);\n        __m128 _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = (__m128)__lsx_vld((const float*)bias_data, 0);\n                _bias1 = (__m128)__lsx_vld((const float*)bias_data + 4, 0);\n            }\n        }\n#endif // __loongarch_sx\n\n        bias = bias * scale_out;\n#if __loongarch_sx\n        _bias0 = __lsx_vfmul_s(_bias0, _scale_out0);\n        _bias1 = __lsx_vfmul_s(_bias1, _scale_out1);\n#endif // __loongarch_sx\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmadd_s(_v0, _scale0, _bias0);\n            _v1 = __lsx_vfmadd_s(_v1, _scale1, _bias1);\n            *((int64_t*)ptr) = float2int8leakyrelu(_v0, _v1, _slope);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmadd_s(_v, _scale0, _bias0);\n            v16i8 v = (v16i8)float2int8leakyrelu(_v, _slope);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale + bias;\n            if (v < 0) v *= slope;\n            *ptr = float2int8(v);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nstatic void requantize(const int* intptr, signed char* ptr, const Mat& scale_in_data, const Mat& bias_data, const Mat& scale_out_data, int activation_type, const Mat& activation_params, int elemcount, int elempack)\n{\n    if (activation_type == 1)\n    {\n        requantize_relu(intptr, ptr, scale_in_data, bias_data, scale_out_data, elemcount, elempack);\n        return;\n    }\n\n    if (activation_type == 2 && activation_params[0] > 0.f)\n    {\n        const float slope = activation_params[0];\n        requantize_leakyrelu(intptr, ptr, scale_in_data, bias_data, scale_out_data, slope, elemcount, elempack);\n        return;\n    }\n\n    const int scale_in_data_size = scale_in_data.w;\n    const int bias_data_size = bias_data.w;\n    const int scale_out_data_size = scale_out_data.w;\n    const int size = elemcount * elempack;\n\n    // NCNN_LOGE(\"requantize %d %d %d   %d %d\", scale_in_data_size, bias_data_size, scale_out_data_size, elemcount, elempack);\n\n    float scale_in = scale_in_data[0];\n#if __loongarch_sx\n    __m128 _scale_in0 = (__m128)__lsx_vreplfr2vr_s(scale_in);\n    __m128 _scale_in1 = _scale_in0;\n    if (scale_in_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_in0 = (__m128)__lsx_vld((const float*)scale_in_data, 0);\n            _scale_in1 = (__m128)__lsx_vld((const float*)scale_in_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    float scale_out = scale_out_data[0];\n#if __loongarch_sx\n    __m128 _scale_out0 = (__m128)__lsx_vreplfr2vr_s(scale_out);\n    __m128 _scale_out1 = _scale_out0;\n    if (scale_out_data_size > 1)\n    {\n        if (elempack == 8)\n        {\n            _scale_out0 = (__m128)__lsx_vld((const float*)scale_out_data, 0);\n            _scale_out1 = (__m128)__lsx_vld((const float*)scale_out_data + 4, 0);\n        }\n    }\n#endif // __loongarch_sx\n\n    if (bias_data_size == 0)\n    {\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmul_s(_v0, _scale_in0);\n            _v1 = __lsx_vfmul_s(_v1, _scale_in1);\n            _v0 = activation_ps(_v0, activation_type, activation_params);\n            _v1 = activation_ps(_v1, activation_type, activation_params);\n            _v0 = __lsx_vfmul_s(_v0, _scale_out0);\n            _v1 = __lsx_vfmul_s(_v1, _scale_out1);\n            *((int64_t*)ptr) = float2int8(_v0, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmul_s(_v, _scale_in0);\n            _v = activation_ps(_v, activation_type, activation_params);\n            _v = __lsx_vfmul_s(_v, _scale_out0);\n            v16i8 v = (v16i8)float2int8(_v);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale_in;\n            v = activation_ss(v, activation_type, activation_params);\n            *ptr = float2int8(v * scale_out);\n            intptr++;\n            ptr++;\n        }\n    }\n    else\n    {\n        float bias = bias_data[0];\n#if __loongarch_sx\n        __m128 _bias0 = (__m128)__lsx_vreplfr2vr_s(bias);\n        __m128 _bias1 = _bias0;\n        if (bias_data_size > 1)\n        {\n            if (elempack == 8)\n            {\n                _bias0 = (__m128)__lsx_vld((const float*)bias_data, 0);\n                _bias1 = (__m128)__lsx_vld((const float*)bias_data + 4, 0);\n            }\n        }\n#endif // __loongarch_sx\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(intptr + 32);\n            __m128 _v0 = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            __m128 _v1 = __lsx_vffint_s_w(__lsx_vld(intptr + 4, 0));\n            _v0 = __lsx_vfmadd_s(_v0, _scale_in0, _bias0);\n            _v1 = __lsx_vfmadd_s(_v1, _scale_in1, _bias1);\n            _v0 = activation_ps(_v0, activation_type, activation_params);\n            _v1 = activation_ps(_v1, activation_type, activation_params);\n            _v0 = __lsx_vfmul_s(_v0, _scale_out0);\n            _v1 = __lsx_vfmul_s(_v1, _scale_out1);\n            *((int64_t*)ptr) = float2int8(_v0, _v1);\n            intptr += 8;\n            ptr += 8;\n        }\n        for (; i + 3 < size; i += 4)\n        {\n            __m128 _v = __lsx_vffint_s_w(__lsx_vld(intptr, 0));\n            _v = __lsx_vfmadd_s(_v, _scale_in0, _bias0);\n            _v = activation_ps(_v, activation_type, activation_params);\n            _v = __lsx_vfmul_s(_v, _scale_out0);\n            v16i8 v = (v16i8)float2int8(_v);\n            ptr[0] = v[0];\n            ptr[1] = v[1];\n            ptr[2] = v[2];\n            ptr[3] = v[3];\n            intptr += 4;\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            float v = *intptr * scale_in + bias;\n            v = activation_ss(v, activation_type, activation_params);\n            *ptr = float2int8(v * scale_out);\n            intptr++;\n            ptr++;\n        }\n    }\n}\n\nint Requantize_loongarch::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    const int dims = bottom_blob.dims;\n    const int w = bottom_blob.w;\n    const int h = bottom_blob.h;\n    const int channels = bottom_blob.c;\n    const int elempack = bottom_blob.elempack;\n    const size_t out_elemsize = elempack * 1u;\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int wp = std::max(1, w / opt.num_threads);\n        const int nn_w = (w + wp - 1) / wp;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_w; ii++)\n        {\n            const int i = ii * wp;\n\n            const int* intptr = (const int*)bottom_blob + i * elempack;\n            signed char* ptr = (signed char*)top_blob + i * elempack;\n\n            // assert scale_in_data_size == 1\n            // assert bias_data_size == 0 || bias_data_size == 1\n            // assert scale_out_data_size == 1\n\n            const int size = std::min(w - i, wp) * elempack;\n\n            requantize(intptr, ptr, scale_in_data, bias_data, scale_out_data, activation_type, activation_params, size, 1);\n        }\n    }\n\n    if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            const int* intptr = bottom_blob.row<const int>(i);\n            signed char* ptr = top_blob.row<signed char>(i);\n\n            const Mat scale_in_data_i = scale_in_data_size > 1 ? scale_in_data.range(i * elempack, elempack) : scale_in_data;\n            const Mat bias_data_i = bias_data_size > 1 ? bias_data.range(i * elempack, elempack) : bias_data;\n            const Mat scale_out_data_i = scale_out_data_size > 1 ? scale_out_data.range(i * elempack, elempack) : scale_out_data;\n\n            requantize(intptr, ptr, scale_in_data_i, bias_data_i, scale_out_data_i, activation_type, activation_params, w, elempack);\n        }\n    }\n\n    if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int* intptr = bottom_blob.channel(q);\n            signed char* ptr = top_blob.channel(q);\n\n            const Mat scale_in_data_q = scale_in_data_size > 1 ? scale_in_data.range(q * elempack, elempack) : scale_in_data;\n            const Mat bias_data_q = bias_data_size > 1 ? bias_data.range(q * elempack, elempack) : bias_data;\n            const Mat scale_out_data_q = scale_out_data_size > 1 ? scale_out_data.range(q * elempack, elempack) : scale_out_data;\n\n            requantize(intptr, ptr, scale_in_data_q, bias_data_q, scale_out_data_q, activation_type, activation_params, w * h, elempack);\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/requantize_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_REQUANTIZE_LOONGARCH_H\n#define LAYER_REQUANTIZE_LOONGARCH_H\n\n#include \"requantize.h\"\n\nnamespace ncnn {\n\nclass Requantize_loongarch : public Requantize\n{\npublic:\n    Requantize_loongarch();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_REQUANTIZE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/sigmoid_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"sigmoid_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#if __loongarch_asx\n#include <lasxintrin.h>\n#include \"lasx_mathfun.h\"\n#endif // __loongarch_asx\n#endif // __loongarch_sx\n\n#include \"loongarch_usability.h\"\n\nnamespace ncnn {\n\nSigmoid_loongarch::Sigmoid_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint Sigmoid_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n#if __loongarch_asx\n        __m256 _one_lasx = (__m256)__lasx_xvreplfr2vr_s(1.f);\n        for (; i + 7 < size; i += 8)\n        {\n            __builtin_prefetch(ptr + 32);\n            __m256 _p = (__m256)__lasx_xvld(ptr, 0);\n            _p = (__m256)__lasx_xvbitrevi_w((__m256i)_p, 31);\n            _p = exp256_ps(_p);\n            _p = __lasx_xvfadd_s(_p, _one_lasx);\n            __m256 _outp = __lasx_xvfdiv_s(_one_lasx, _p);\n            __lasx_xvst(_outp, ptr, 0);\n\n            ptr += 8;\n        }\n#endif // __loongarch_lasx\n        __m128 _one_lsx = (__m128)__lsx_vreplfr2vr_s(1.f);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = (__m128)__lsx_vbitrevi_w((__m128i)_p, 31);\n            _p = exp_ps(_p);\n            _p = __lsx_vfadd_s(_p, _one_lsx);\n            __m128 _outp = __lsx_vfdiv_s(_one_lsx, _p);\n            __lsx_vst(_outp, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = 1.f / (1.f + expf(-*ptr));\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/sigmoid_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SIGMOID_LOONGARCH_H\n#define LAYER_SIGMOID_LOONGARCH_H\n\n#include \"sigmoid.h\"\n\nnamespace ncnn {\n\nclass Sigmoid_loongarch : public Sigmoid\n{\npublic:\n    Sigmoid_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SIGMOID_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/slice_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"slice_loongarch.h\"\n\nnamespace ncnn {\n\nSlice_loongarch::Slice_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Slice_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n    const int* slices_ptr = slices;\n    const int* indices_ptr = indices;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // slice vector\n        int w = bottom_blob.w * elempack;\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __loongarch_sx\n            if (opt.use_packing_layout)\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            const float* ptr = (const float*)bottom_blob + q;\n            float* outptr = top_blob;\n            memcpy(outptr, ptr, top_blob.w * top_blob.elemsize);\n\n            q += slice;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // slice image height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __loongarch_sx\n            if (opt.use_packing_layout)\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        const float* ptr = bottom_blob_unpacked;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                for (int j = 0; j < top_blob.h; j++)\n                {\n                    const float* r0 = ptr;\n                    const float* r1 = ptr + w;\n                    const float* r2 = ptr + w * 2;\n                    const float* r3 = ptr + w * 3;\n\n                    float* outptr0 = top_blob.row(j);\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    ptr += w * 4;\n                }\n            }\n            else // if (out_elempack == 1 && top_blob.elempack == 1) if (out_elempack == 4 && top_blob.elempack == 4)\n            {\n                int size = w * top_blob.h;\n\n                float* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                ptr += size * top_blob.elempack;\n            }\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // slice image width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int j = 0; j < h; j++)\n        {\n            const float* ptr = bottom_blob.row(j);\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                float* outptr = top_blob.row(j);\n                memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                ptr += top_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // slice dim channel\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c * elempack;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = channels - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? channels + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((channels - q) / (top_blobs.size() - i));\n                }\n            }\n\n            int out_elempack = 1;\n#if __loongarch_sx\n            if (opt.use_packing_layout)\n                out_elempack = slice % 4 == 0 ? 4 : 1;\n#endif\n            size_t out_elemsize = elemsize / elempack * out_elempack;\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, d, slice / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        size_t out_elemsize = top_blobs[0].elemsize;\n        int out_elempack = top_blobs[0].elempack;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            out_elemsize = std::min(out_elemsize, top_blobs[i].elemsize);\n            out_elempack = std::min(out_elempack, top_blobs[i].elempack);\n        }\n\n        Mat bottom_blob_unpacked = bottom_blob;\n        if (elempack > out_elempack)\n        {\n            convert_packing(bottom_blob, bottom_blob_unpacked, out_elempack, opt);\n            if (bottom_blob_unpacked.empty())\n                return -100;\n        }\n\n        int p = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            Mat& top_blob = top_blobs[i];\n\n            if (out_elempack == 1 && top_blob.elempack == 4)\n            {\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                for (int q = 0; q < top_blob.c; q++)\n                {\n                    const float* r0 = bottom_blob_unpacked.channel(p);\n                    const float* r1 = bottom_blob_unpacked.channel(p + 1);\n                    const float* r2 = bottom_blob_unpacked.channel(p + 2);\n                    const float* r3 = bottom_blob_unpacked.channel(p + 3);\n\n                    float* outptr0 = top_blob.channel(q);\n\n                    for (int j = 0; j < size; j++)\n                    {\n                        outptr0[0] = *r0++;\n                        outptr0[1] = *r1++;\n                        outptr0[2] = *r2++;\n                        outptr0[3] = *r3++;\n\n                        outptr0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            else // if (out_elempack == 1 && top_blob.elempack == 1) if (out_elempack == 4 && top_blob.elempack == 4)\n            {\n                int size = top_blob.total();\n\n                const float* ptr = bottom_blob_unpacked.channel(p);\n                float* outptr = top_blob;\n                memcpy(outptr, ptr, size * top_blob.elemsize);\n\n                p += top_blob.c;\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // slice dim height\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = h - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? h + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((h - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, slice, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (size_t i = 0; i < top_blobs.size(); i++)\n                {\n                    Mat& top_blob = top_blobs[i];\n\n                    int size = top_blob.w * top_blob.h;\n\n                    float* outptr = top_blob.channel(p).depth(j);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    ptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // slice dim width\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = w - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? w + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((w - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(slice, h, d, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            top_blob.dims = dims;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (int j = 0; j < d; j++)\n            {\n                for (int k = 0; k < h; k++)\n                {\n                    for (size_t i = 0; i < top_blobs.size(); i++)\n                    {\n                        Mat& top_blob = top_blobs[i];\n\n                        float* outptr = top_blob.channel(p).depth(j).row(k);\n                        memcpy(outptr, ptr, top_blob.w * elemsize);\n\n                        ptr += top_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        int w = bottom_blob.w;\n        int h = bottom_blob.h;\n        int d = bottom_blob.d;\n        int channels = bottom_blob.c;\n\n        int q = 0;\n        for (size_t i = 0; i < top_blobs.size(); i++)\n        {\n            int slice;\n            if (indices_ptr)\n            {\n                if (i == top_blobs.size() - 1)\n                {\n                    slice = d - q;\n                }\n                else\n                {\n                    int indice = indices_ptr[i];\n                    int positive_indice = indice < 0 ? d + indice : indice;\n                    slice = positive_indice - q;\n                }\n            }\n            else\n            {\n                slice = slices_ptr[i];\n                if (slice == -233)\n                {\n                    slice = static_cast<int>((d - q) / (top_blobs.size() - i));\n                }\n            }\n\n            Mat& top_blob = top_blobs[i];\n            top_blob.create(w, h, slice, channels, elemsize, elempack, opt.blob_allocator);\n            if (top_blob.empty())\n                return -100;\n\n            q += slice;\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < channels; p++)\n        {\n            const float* ptr = bottom_blob.channel(p);\n\n            for (size_t i = 0; i < top_blobs.size(); i++)\n            {\n                Mat& top_blob = top_blobs[i];\n\n                int size = top_blob.w * top_blob.h * top_blob.d;\n\n                float* outptr = top_blob.channel(p);\n                memcpy(outptr, ptr, size * elemsize);\n\n                ptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/slice_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SLICE_LOONGARCH_H\n#define LAYER_SLICE_LOONGARCH_H\n\n#include \"slice.h\"\n\nnamespace ncnn {\n\nclass Slice_loongarch : public Slice\n{\npublic:\n    Slice_loongarch();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SLICE_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/softmax_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"softmax_loongarch.h\"\n\n#include <float.h>\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nint Softmax_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    size_t elemsize = bottom_top_blob.elemsize;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims != 3 || positive_axis != 0)\n        return Softmax::forward_inplace(bottom_top_blob, opt);\n\n    // value = exp( value - global max value )\n    // sum all value\n    // value = value / sum\n\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    int size = w * h;\n\n    Mat max;\n    max.create(w, h, elemsize, opt.workspace_allocator);\n    if (max.empty())\n        return -100;\n    max.fill(-FLT_MAX);\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n        float* maxptr = max;\n\n        for (int i = 0; i < size; i++)\n        {\n            maxptr[i] = std::max(maxptr[i], ptr[i]);\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n        float* maxptr = max;\n\n#if __loongarch_sx\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __loongarch_sx\n\n#if __loongarch_sx\n        for (; nn > 0; nn--)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _max = (__m128)__lsx_vld(maxptr, 0);\n\n            _p = exp_ps(__lsx_vfsub_s(_p, _max));\n\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n            maxptr += 4;\n        }\n#endif // __loongarch_sx\n\n        for (; remain > 0; remain--)\n        {\n            *ptr = expf(*ptr - *maxptr);\n\n            ptr++;\n            maxptr++;\n        }\n    }\n\n    Mat sum;\n    sum.create(w, h, elemsize, opt.workspace_allocator);\n    if (sum.empty())\n        return -100;\n    sum.fill(0.f);\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n        float* sumptr = sum;\n\n#if __loongarch_sx\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __loongarch_sx\n\n#if __loongarch_sx\n        for (; nn > 0; nn--)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _sum = (__m128)__lsx_vld(sumptr, 0);\n            _sum = __lsx_vfadd_s(_sum, _p);\n            __lsx_vst(_sum, sumptr, 0);\n\n            ptr += 4;\n            sumptr += 4;\n        }\n#endif // __loongarch_sx\n\n        for (; remain > 0; remain--)\n        {\n            *sumptr += *ptr;\n\n            ptr++;\n            sumptr++;\n        }\n    }\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n        float* sumptr = sum;\n\n#if __loongarch_sx\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __loongarch_sx\n\n#if __loongarch_sx\n        for (; nn > 0; nn--)\n        {\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            __m128 _sum = (__m128)__lsx_vld(sumptr, 0);\n            _p = __lsx_vfdiv_s(_p, _sum);\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n            sumptr += 4;\n        }\n#endif // __loongarch_sx\n\n        for (; remain > 0; remain--)\n        {\n            *ptr /= *sumptr;\n\n            ptr++;\n            sumptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/softmax_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SOFTMAX_LOONGARCH_H\n#define LAYER_SOFTMAX_LOONGARCH_H\n\n#include \"softmax.h\"\n\nnamespace ncnn {\n\nclass Softmax_loongarch : public Softmax\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SOFTMAX_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/swish_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"swish_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nSwish_loongarch::Swish_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\nint Swish_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        __m128 _one = (__m128)__lsx_vreplfr2vr_s(1.f);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128i _p = __lsx_vld(ptr, 0);\n            _p = (__m128i)__lsx_vfdiv_s((__m128)_p, __lsx_vfadd_s(_one, exp_ps((__m128)__lsx_vbitrevi_w(_p, 31))));\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = *ptr / (1.f + expf(-*ptr));\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/swish_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_SWISH_LOONGARCH_H\n#define LAYER_SWISH_LOONGARCH_H\n\n#include \"swish.h\"\n\nnamespace ncnn {\n\nclass Swish_loongarch : public Swish\n{\npublic:\n    Swish_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_SWISH_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/tanh_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"tanh_loongarch.h\"\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nTanH_loongarch::TanH_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif\n}\n\nint TanH_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = tanh_ps(_p);\n            __lsx_vst(_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = tanhf(*ptr);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/tanh_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_TANH_LOONGARCH_H\n#define LAYER_TANH_LOONGARCH_H\n\n#include \"tanh.h\"\n\nnamespace ncnn {\n\nclass TanH_loongarch : public TanH\n{\npublic:\n    TanH_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_TANH_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/loongarch/unaryop_loongarch.cpp",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"unaryop_loongarch.h\"\n\n// #include <fenv.h>\n#include <float.h>\n\n#if __loongarch_sx\n#include <lsxintrin.h>\n#include \"lsx_mathfun.h\"\n#endif // __loongarch_sx\n\nnamespace ncnn {\n\nUnaryOp_loongarch::UnaryOp_loongarch()\n{\n#if __loongarch_sx\n    support_packing = true;\n#endif // __loongarch_sx\n}\n\ntemplate<typename Op>\nstatic int unary_op_inplace(Mat& a, const Option& opt)\n{\n    Op op;\n\n    int w = a.w;\n    int h = a.h;\n    int d = a.d;\n    int channels = a.c;\n    int elempack = a.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        int i = 0;\n#if __loongarch_sx\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            __m128 _p = (__m128)__lsx_vld(ptr, 0);\n            _p = op.func_pack4(_p);\n            __lsx_vst(_p, ptr, 0);\n            ptr += 4;\n        }\n#endif // __loongarch_sx\n        for (; i < size; i++)\n        {\n            *ptr = op.func(*ptr);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nnamespace UnaryOp_loongarch_functor {\n\nstruct unary_op_abs\n{\n    float func(const float& x) const\n    {\n        return (float)fabsf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vbitclri_w((__m128i)x, 31);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_neg\n{\n    float func(const float& x) const\n    {\n        return -x;\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vbitrevi_w((__m128i)x, 31);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_floor\n{\n    float func(const float& x) const\n    {\n        return (float)floorf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vfrintrm_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_ceil\n{\n    float func(const float& x) const\n    {\n        return (float)ceilf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vfrintrp_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_square\n{\n    float func(const float& x) const\n    {\n        return x * x;\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return __lsx_vfmul_s(x, x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_sqrt\n{\n    float func(const float& x) const\n    {\n        return (float)sqrtf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return __lsx_vfsqrt_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_rsqrt\n{\n    float func(const float& x) const\n    {\n        return (float)(1.f / sqrtf(x));\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return __lsx_vfrsqrt_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_exp\n{\n    float func(const float& x) const\n    {\n        return (float)expf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return exp_ps(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_log\n{\n    float func(const float& x) const\n    {\n        return (float)logf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return log_ps(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_sin\n{\n    float func(const float& x) const\n    {\n        return (float)sinf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = sinf(tmp[0]);\n        tmp[1] = sinf(tmp[1]);\n        tmp[2] = sinf(tmp[2]);\n        tmp[3] = sinf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_cos\n{\n    float func(const float& x) const\n    {\n        return (float)cosf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = cosf(tmp[0]);\n        tmp[1] = cosf(tmp[1]);\n        tmp[2] = cosf(tmp[2]);\n        tmp[3] = cosf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_tan\n{\n    float func(const float& x) const\n    {\n        return (float)tanf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = tanf(tmp[0]);\n        tmp[1] = tanf(tmp[1]);\n        tmp[2] = tanf(tmp[2]);\n        tmp[3] = tanf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_asin\n{\n    float func(const float& x) const\n    {\n        return (float)asinf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = asinf(tmp[0]);\n        tmp[1] = asinf(tmp[1]);\n        tmp[2] = asinf(tmp[2]);\n        tmp[3] = asinf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_acos\n{\n    float func(const float& x) const\n    {\n        return (float)acosf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = acosf(tmp[0]);\n        tmp[1] = acosf(tmp[1]);\n        tmp[2] = acosf(tmp[2]);\n        tmp[3] = acosf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_atan\n{\n    float func(const float& x) const\n    {\n        return (float)atanf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        // TODO msa optimize\n        float tmp[4];\n        __lsx_vst(x, tmp, 0);\n        tmp[0] = atanf(tmp[0]);\n        tmp[1] = atanf(tmp[1]);\n        tmp[2] = atanf(tmp[2]);\n        tmp[3] = atanf(tmp[3]);\n        return (__m128)__lsx_vld(tmp, 0);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_reciprocal\n{\n    float func(const float& x) const\n    {\n        return 1.f / x;\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return __lsx_vfrecip_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_tanh\n{\n    float func(const float& x) const\n    {\n        return (float)tanhf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return tanh_ps(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_log10\n{\n    float func(const float& x) const\n    {\n        return (float)log10f(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return __lsx_vfmul_s(log_ps(x), __lsx_vreplfr2vr_s(0.434294481903));\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_round\n{\n    float func(const float& x) const\n    {\n        // round to nearest even\n#if NCNN_GNU_INLINE_ASM\n        // return (x + 12582912.f) - 12582912.f;\n        float y;\n        const float magic = 12582912.f;\n        asm volatile(\n            \"fadd.s     %0, %1, %2  \\n\"\n            \"fsub.s     %0, %0, %2  \\n\"\n            : \"=f\"(y)\n            : \"f\"(x), \"f\"(magic)\n            :);\n        return y;\n#else\n#ifdef FE_TONEAREST\n        int old_rm = fegetround();\n        fesetround(FE_TONEAREST);\n#endif\n        float y = nearbyintf(x);\n#ifdef FE_TONEAREST\n        fesetround(old_rm);\n#endif\n        return y;\n#endif\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vfrintrne_s(x);\n    }\n#endif // __loongarch_sx\n};\n\nstruct unary_op_trunc\n{\n    float func(const float& x) const\n    {\n        return (float)truncf(x);\n    }\n#if __loongarch_sx\n    __m128 func_pack4(const __m128& x) const\n    {\n        return (__m128)__lsx_vfrintrz_s(x);\n    }\n#endif // __loongarch_sx\n};\n\n} // namespace UnaryOp_loongarch_functor\n\nint UnaryOp_loongarch::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    using namespace UnaryOp_loongarch_functor;\n\n    if (op_type == Operation_ABS)\n        return unary_op_inplace<unary_op_abs>(bottom_top_blob, opt);\n\n    if (op_type == Operation_NEG)\n        return unary_op_inplace<unary_op_neg>(bottom_top_blob, opt);\n\n    if (op_type == Operation_FLOOR)\n        return unary_op_inplace<unary_op_floor>(bottom_top_blob, opt);\n\n    if (op_type == Operation_CEIL)\n        return unary_op_inplace<unary_op_ceil>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQUARE)\n        return unary_op_inplace<unary_op_square>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SQRT)\n        return unary_op_inplace<unary_op_sqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RSQRT)\n        return unary_op_inplace<unary_op_rsqrt>(bottom_top_blob, opt);\n\n    if (op_type == Operation_EXP)\n        return unary_op_inplace<unary_op_exp>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG)\n        return unary_op_inplace<unary_op_log>(bottom_top_blob, opt);\n\n    if (op_type == Operation_SIN)\n        return unary_op_inplace<unary_op_sin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_COS)\n        return unary_op_inplace<unary_op_cos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TAN)\n        return unary_op_inplace<unary_op_tan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ASIN)\n        return unary_op_inplace<unary_op_asin>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ACOS)\n        return unary_op_inplace<unary_op_acos>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ATAN)\n        return unary_op_inplace<unary_op_atan>(bottom_top_blob, opt);\n\n    if (op_type == Operation_RECIPROCAL)\n        return unary_op_inplace<unary_op_reciprocal>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TANH)\n        return unary_op_inplace<unary_op_tanh>(bottom_top_blob, opt);\n\n    if (op_type == Operation_LOG10)\n        return unary_op_inplace<unary_op_log10>(bottom_top_blob, opt);\n\n    if (op_type == Operation_ROUND)\n        return unary_op_inplace<unary_op_round>(bottom_top_blob, opt);\n\n    if (op_type == Operation_TRUNC)\n        return unary_op_inplace<unary_op_trunc>(bottom_top_blob, opt);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/loongarch/unaryop_loongarch.h",
    "content": "// Copyright 2022 yala <zhaojunchao@loongson.cn>;<junchao82@qq.com>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_UNARYOP_LOONGARCH_H\n#define LAYER_UNARYOP_LOONGARCH_H\n\n#include \"unaryop.h\"\n\nnamespace ncnn {\n\nclass UnaryOp_loongarch : public UnaryOp\n{\npublic:\n    UnaryOp_loongarch();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_UNARYOP_LOONGARCH_H\n"
  },
  {
    "path": "src/layer/lrn.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"lrn.h\"\n\nnamespace ncnn {\n\nLRN::LRN()\n{\n    one_blob_only = true;\n    support_inplace = true;\n}\n\nint LRN::load_param(const ParamDict& pd)\n{\n    region_type = pd.get(0, 0);\n    local_size = pd.get(1, 5);\n    alpha = pd.get(2, 1.f);\n    beta = pd.get(3, 0.75f);\n    bias = pd.get(4, 1.f);\n\n    return 0;\n}\n\nint LRN::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int channels = bottom_top_blob.c;\n    size_t elemsize = bottom_top_blob.elemsize;\n    int size = w * h;\n\n    // squared values with local_size padding\n    Mat square_blob;\n    square_blob.create(w, h, channels, elemsize, opt.workspace_allocator);\n    if (square_blob.empty())\n        return -100;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = bottom_top_blob.channel(q);\n        float* outptr = square_blob.channel(q);\n\n        for (int i = 0; i < size; i++)\n        {\n            outptr[i] = ptr[i] * ptr[i];\n        }\n    }\n\n    if (region_type == NormRegion_ACROSS_CHANNELS)\n    {\n        Mat square_sum;\n        square_sum.create(w, h, channels, elemsize, opt.workspace_allocator);\n        if (square_sum.empty())\n            return -100;\n        square_sum.fill(0.f);\n\n        const float alpha_div_size = alpha / local_size;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            // square sum\n            float* ssptr = square_sum.channel(q);\n            for (int p = q - local_size / 2; p <= q + local_size / 2; p++)\n            {\n                if (p < 0 || p >= channels)\n                    continue;\n\n                const float* sptr = square_blob.channel(p);\n                for (int i = 0; i < size; i++)\n                {\n                    ssptr[i] += sptr[i];\n                }\n            }\n\n            float* ptr = bottom_top_blob.channel(q);\n            for (int i = 0; i < size; i++)\n            {\n                ptr[i] = ptr[i] * powf(bias + alpha_div_size * ssptr[i], -beta);\n            }\n        }\n    }\n    else if (region_type == NormRegion_WITHIN_CHANNEL)\n    {\n        int outw = w;\n        int outh = h;\n\n        Mat square_blob_bordered = square_blob;\n        int pad = local_size / 2;\n        if (pad > 0)\n        {\n            Option opt_b = opt;\n            opt_b.blob_allocator = opt.workspace_allocator;\n            opt_b.use_packing_layout = false;\n            copy_make_border(square_blob, square_blob_bordered, pad, local_size - pad - 1, pad, local_size - pad - 1, BORDER_CONSTANT, 0.f, opt_b);\n            if (square_blob_bordered.empty())\n                return -100;\n\n            w = square_blob_bordered.w;\n            h = square_blob_bordered.h;\n        }\n\n        const int maxk = local_size * local_size;\n\n        const float alpha_div_size = alpha / maxk;\n\n        // norm window offsets\n        std::vector<int> _space_ofs(maxk);\n        int* space_ofs = &_space_ofs[0];\n        {\n            int p1 = 0;\n            int p2 = 0;\n            int gap = w - local_size;\n            for (int i = 0; i < local_size; i++)\n            {\n                for (int j = 0; j < local_size; j++)\n                {\n                    space_ofs[p1] = p2;\n                    p1++;\n                    p2++;\n                }\n                p2 += gap;\n            }\n        }\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            const Mat m = square_blob_bordered.channel(q);\n\n            for (int i = 0; i < outh; i++)\n            {\n                for (int j = 0; j < outw; j++)\n                {\n                    const float* sptr = m.row(i) + j;\n\n                    float ss = 0.f;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        float val = sptr[space_ofs[k]];\n                        ss += val;\n                    }\n\n                    ptr[j] = ptr[j] * powf(bias + alpha_div_size * ss, -beta);\n                }\n\n                ptr += outw;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/lrn.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LRN_H\n#define LAYER_LRN_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass LRN : public Layer\n{\npublic:\n    LRN();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n\n    enum NormRegionType\n    {\n        NormRegion_ACROSS_CHANNELS = 0,\n        NormRegion_WITHIN_CHANNEL = 1\n    };\n\npublic:\n    // param\n    int region_type;\n    int local_size;\n    float alpha;\n    float beta;\n    float bias;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LRN_H\n"
  },
  {
    "path": "src/layer/lstm.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"lstm.h\"\n\nnamespace ncnn {\n\nLSTM::LSTM()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint LSTM::load_param(const ParamDict& pd)\n{\n    num_output = pd.get(0, 0);\n    weight_data_size = pd.get(1, 0);\n    direction = pd.get(2, 0);\n    hidden_size = pd.get(3, num_output);\n    int8_scale_term = pd.get(8, 0);\n\n    if (int8_scale_term)\n    {\n#if !NCNN_INT8\n        NCNN_LOGE(\"please build ncnn with NCNN_INT8 enabled for int8 inference\");\n        return -1;\n#endif\n    }\n\n    return 0;\n}\n\nint LSTM::load_model(const ModelBin& mb)\n{\n    int num_directions = direction == 2 ? 2 : 1;\n\n    int size = weight_data_size / num_directions / hidden_size / 4;\n\n    // raw weight data\n    weight_xc_data = mb.load(size, hidden_size * 4, num_directions, 0);\n    if (weight_xc_data.empty())\n        return -100;\n\n    bias_c_data = mb.load(hidden_size, 4, num_directions, 0);\n    if (bias_c_data.empty())\n        return -100;\n\n    weight_hc_data = mb.load(num_output, hidden_size * 4, num_directions, 0);\n    if (weight_hc_data.empty())\n        return -100;\n\n    if (num_output != hidden_size)\n    {\n        weight_hr_data = mb.load(hidden_size, num_output, num_directions, 0);\n        if (weight_hr_data.empty())\n            return -100;\n    }\n\n#if NCNN_INT8\n    if (int8_scale_term)\n    {\n        weight_xc_data_int8_scales = mb.load(hidden_size * 4, num_directions, 1);\n        weight_hc_data_int8_scales = mb.load(hidden_size * 4, num_directions, 1);\n    }\n#endif // NCNN_INT8\n\n    return 0;\n}\n\nstatic int lstm(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc, const Mat& bias_c, const Mat& weight_hc, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        const float* x = bottom_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const float* bias_c_I = bias_c.row(0);\n            const float* bias_c_F = bias_c.row(1);\n            const float* bias_c_O = bias_c.row(2);\n            const float* bias_c_G = bias_c.row(3);\n\n            float* gates_data = gates.row(q);\n\n            // gate I F O G\n            const float* weight_xc_I = weight_xc.row(hidden_size * 0 + q);\n            const float* weight_xc_F = weight_xc.row(hidden_size * 1 + q);\n            const float* weight_xc_O = weight_xc.row(hidden_size * 2 + q);\n            const float* weight_xc_G = weight_xc.row(hidden_size * 3 + q);\n\n            const float* weight_hc_I = weight_hc.row(hidden_size * 0 + q);\n            const float* weight_hc_F = weight_hc.row(hidden_size * 1 + q);\n            const float* weight_hc_O = weight_hc.row(hidden_size * 2 + q);\n            const float* weight_hc_G = weight_hc.row(hidden_size * 3 + q);\n\n            float I = bias_c_I[q];\n            float F = bias_c_F[q];\n            float O = bias_c_O[q];\n            float G = bias_c_G[q];\n\n            for (int i = 0; i < size; i++)\n            {\n                float xi = x[i];\n\n                I += weight_xc_I[i] * xi;\n                F += weight_xc_F[i] * xi;\n                O += weight_xc_O[i] * xi;\n                G += weight_xc_G[i] * xi;\n            }\n\n            for (int i = 0; i < num_output; i++)\n            {\n                float h_cont = hidden_state[i];\n\n                I += weight_hc_I[i] * h_cont;\n                F += weight_hc_F[i] * h_cont;\n                O += weight_hc_O[i] * h_cont;\n                G += weight_hc_G[i] * h_cont;\n            }\n\n            gates_data[0] = I;\n            gates_data[1] = F;\n            gates_data[2] = O;\n            gates_data[3] = G;\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        float* output_data = top_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float I = gates_data[0];\n            float F = gates_data[1];\n            float O = gates_data[2];\n            float G = gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_state[q] + I * G;\n            float H = O * tanhf(cell2);\n            cell_state[q] = cell2;\n\n            if (num_output == hidden_size)\n            {\n                hidden_state[q] = H;\n                output_data[q] = H;\n            }\n            else\n            {\n                tmp_hidden_state[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_state[i] * hr[i];\n                }\n\n                hidden_state[q] = H;\n                output_data[q] = H;\n            }\n        }\n    }\n\n    return 0;\n}\n\n#if NCNN_INT8\nstatic int lstm_int8(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& weight_xc_int8, const float* weight_xc_int8_scales, const Mat& bias_c, const Mat& weight_hc_int8, const float* weight_hc_int8_scales, const Mat& weight_hr, Mat& hidden_state, Mat& cell_state, const Option& opt)\n{\n    int size = bottom_blob.w;\n    int T = bottom_blob.h;\n\n    int num_output = top_blob.w;\n    int hidden_size = cell_state.w;\n\n    // 4 x hidden_size\n    Mat gates(4, hidden_size, 4u, opt.workspace_allocator);\n    if (gates.empty())\n        return -100;\n\n    Mat tmp_hidden_state;\n    if (num_output != hidden_size)\n    {\n        tmp_hidden_state.create(hidden_size, 4u, opt.workspace_allocator);\n        if (tmp_hidden_state.empty())\n            return -100;\n    }\n\n    // dynamic quantize bottom_blob\n    Mat bottom_blob_int8(size, T, (size_t)1u, 1, opt.workspace_allocator);\n    Mat bottom_blob_int8_scales(T, (size_t)4u, 1, opt.workspace_allocator);\n    {\n        for (int t = 0; t < T; t++)\n        {\n            const float* x = bottom_blob.row(t);\n\n            float absmax = 0.f;\n            for (int i = 0; i < size; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(x[i]));\n            }\n\n            bottom_blob_int8_scales[t] = 127.f / absmax;\n        }\n\n        Option opt_quant = opt;\n        opt_quant.blob_allocator = opt.workspace_allocator;\n        opt_quant.use_packing_layout = false;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_quant);\n    }\n\n    Mat hidden_state_int8(num_output, (size_t)1u, 1, opt.workspace_allocator);\n    Mat hidden_state_int8_scales(1, (size_t)4u, 1, opt.workspace_allocator);\n\n    // unroll\n    for (int t = 0; t < T; t++)\n    {\n        // clip hidden by continuation indicator\n        // h_cont_{t-1} = cont_t * h_{t-1}\n        // h_cont_{t-1} = h_{t-1} if cont_t == 1\n        //                0       otherwise\n        // calculate hidden\n        // gate_input_t := W_hc * h_conted_{t-1} + W_xc * x_t + b_c\n\n        int ti = reverse ? T - 1 - t : t;\n\n        // dynamic quantize hidden_state\n        {\n            float absmax = 0.f;\n            for (int i = 0; i < num_output; i++)\n            {\n                absmax = std::max(absmax, (float)fabs(hidden_state[i]));\n            }\n\n            if (absmax == 0.f)\n            {\n                hidden_state_int8_scales[0] = 1.f;\n                hidden_state_int8.fill<signed char>(0);\n            }\n            else\n            {\n                hidden_state_int8_scales[0] = 127.f / absmax;\n\n                Option opt_quant = opt;\n                opt_quant.blob_allocator = opt.workspace_allocator;\n                opt_quant.use_packing_layout = false;\n                quantize_to_int8(hidden_state, hidden_state_int8, hidden_state_int8_scales, opt_quant);\n            }\n        }\n\n        const signed char* x = bottom_blob_int8.row<const signed char>(ti);\n        const signed char* hs = hidden_state_int8;\n        const float descale_x = 1.f / bottom_blob_int8_scales[ti];\n        const float descale_h = 1.f / hidden_state_int8_scales[0];\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const float* bias_c_I = bias_c.row(0);\n            const float* bias_c_F = bias_c.row(1);\n            const float* bias_c_O = bias_c.row(2);\n            const float* bias_c_G = bias_c.row(3);\n\n            float* gates_data = gates.row(q);\n\n            // gate I F O G\n            const signed char* weight_xc_int8_I = weight_xc_int8.row<const signed char>(hidden_size * 0 + q);\n            const signed char* weight_xc_int8_F = weight_xc_int8.row<const signed char>(hidden_size * 1 + q);\n            const signed char* weight_xc_int8_O = weight_xc_int8.row<const signed char>(hidden_size * 2 + q);\n            const signed char* weight_xc_int8_G = weight_xc_int8.row<const signed char>(hidden_size * 3 + q);\n\n            const signed char* weight_hc_int8_I = weight_hc_int8.row<const signed char>(hidden_size * 0 + q);\n            const signed char* weight_hc_int8_F = weight_hc_int8.row<const signed char>(hidden_size * 1 + q);\n            const signed char* weight_hc_int8_O = weight_hc_int8.row<const signed char>(hidden_size * 2 + q);\n            const signed char* weight_hc_int8_G = weight_hc_int8.row<const signed char>(hidden_size * 3 + q);\n\n            const float descale_xc_I = 1.f / weight_xc_int8_scales[hidden_size * 0 + q];\n            const float descale_xc_F = 1.f / weight_xc_int8_scales[hidden_size * 1 + q];\n            const float descale_xc_O = 1.f / weight_xc_int8_scales[hidden_size * 2 + q];\n            const float descale_xc_G = 1.f / weight_xc_int8_scales[hidden_size * 3 + q];\n            const float descale_hc_I = 1.f / weight_hc_int8_scales[hidden_size * 0 + q];\n            const float descale_hc_F = 1.f / weight_hc_int8_scales[hidden_size * 1 + q];\n            const float descale_hc_O = 1.f / weight_hc_int8_scales[hidden_size * 2 + q];\n            const float descale_hc_G = 1.f / weight_hc_int8_scales[hidden_size * 3 + q];\n\n            int Ix = 0;\n            int Fx = 0;\n            int Ox = 0;\n            int Gx = 0;\n            for (int i = 0; i < size; i++)\n            {\n                signed char xi = x[i];\n\n                Ix += weight_xc_int8_I[i] * xi;\n                Fx += weight_xc_int8_F[i] * xi;\n                Ox += weight_xc_int8_O[i] * xi;\n                Gx += weight_xc_int8_G[i] * xi;\n            }\n\n            int Ih = 0;\n            int Fh = 0;\n            int Oh = 0;\n            int Gh = 0;\n            for (int i = 0; i < num_output; i++)\n            {\n                signed char h_cont = hs[i];\n\n                Ih += weight_hc_int8_I[i] * h_cont;\n                Fh += weight_hc_int8_F[i] * h_cont;\n                Oh += weight_hc_int8_O[i] * h_cont;\n                Gh += weight_hc_int8_G[i] * h_cont;\n            }\n\n            float I = bias_c_I[q] + Ix * (descale_x * descale_xc_I) + Ih * (descale_h * descale_hc_I);\n            float F = bias_c_F[q] + Fx * (descale_x * descale_xc_F) + Fh * (descale_h * descale_hc_F);\n            float O = bias_c_O[q] + Ox * (descale_x * descale_xc_O) + Oh * (descale_h * descale_hc_O);\n            float G = bias_c_G[q] + Gx * (descale_x * descale_xc_G) + Gh * (descale_h * descale_hc_G);\n\n            gates_data[0] = I;\n            gates_data[1] = F;\n            gates_data[2] = O;\n            gates_data[3] = G;\n        }\n\n        // lstm unit\n        // sigmoid(I)\n        // sigmoid(F)\n        // sigmoid(O)\n        // tanh(G)\n        // c_t := f_t .* c_{t-1} + i_t .* g_t\n        // h_t := o_t .* tanh[c_t]\n        float* output_data = top_blob.row(ti);\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < hidden_size; q++)\n        {\n            const float* gates_data = gates.row(q);\n\n            float I = gates_data[0];\n            float F = gates_data[1];\n            float O = gates_data[2];\n            float G = gates_data[3];\n\n            I = 1.f / (1.f + expf(-I));\n            F = 1.f / (1.f + expf(-F));\n            O = 1.f / (1.f + expf(-O));\n            G = tanhf(G);\n\n            float cell2 = F * cell_state[q] + I * G;\n            float H = O * tanhf(cell2);\n            cell_state[q] = cell2;\n\n            if (num_output == hidden_size)\n            {\n                hidden_state[q] = H;\n                output_data[q] = H;\n            }\n            else\n            {\n                tmp_hidden_state[q] = H;\n            }\n        }\n\n        if (num_output != hidden_size)\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int q = 0; q < num_output; q++)\n            {\n                const float* hr = weight_hr.row(q);\n\n                float H = 0;\n                for (int i = 0; i < hidden_size; i++)\n                {\n                    H += tmp_hidden_state[i] * hr[i];\n                }\n\n                hidden_state[q] = H;\n                output_data[q] = H;\n            }\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\nint LSTM::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int T = bottom_blob.h;\n\n    int num_directions = direction == 2 ? 2 : 1;\n\n    // initial hidden state\n    Mat hidden(num_output, 4u, opt.workspace_allocator);\n    if (hidden.empty())\n        return -100;\n    hidden.fill(0.f);\n\n    Mat cell(hidden_size, 4u, opt.workspace_allocator);\n    if (cell.empty())\n        return -100;\n    cell.fill(0.f);\n\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob, direction, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob, direction, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        hidden.fill(0.0f);\n        cell.fill(0.0f);\n\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), weight_xc_data_int8_scales.row(1), bias_c_data.channel(1), weight_hc_data.channel(1), weight_hc_data_int8_scales.row(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), bias_c_data.channel(1), weight_hc_data.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    return 0;\n}\n\nint LSTM::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    int T = bottom_blob.h;\n    int num_directions = direction == 2 ? 2 : 1;\n\n    Mat hidden;\n    Mat cell;\n    Allocator* hidden_cell_allocator = top_blobs.size() == 3 ? opt.blob_allocator : opt.workspace_allocator;\n    if (bottom_blobs.size() == 3)\n    {\n        hidden = bottom_blobs[1].clone(hidden_cell_allocator);\n        cell = bottom_blobs[2].clone(hidden_cell_allocator);\n    }\n    else\n    {\n        hidden.create(num_output, num_directions, 4u, hidden_cell_allocator);\n        if (hidden.empty())\n            return -100;\n        hidden.fill(0.f);\n\n        cell.create(hidden_size, num_directions, 4u, hidden_cell_allocator);\n        if (cell.empty())\n            return -100;\n        cell.fill(0.f);\n    }\n\n    Mat& top_blob = top_blobs[0];\n    top_blob.create(num_output * num_directions, T, 4u, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    // Uni directional\n    if (direction == 0 || direction == 1)\n    {\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob, direction, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob, direction, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden, cell, opt);\n            if (ret != 0)\n                return ret;\n        }\n    }\n\n    if (direction == 2)\n    {\n        Mat top_blob_forward(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_forward.empty())\n            return -100;\n\n        Mat top_blob_reverse(num_output, T, 4u, opt.workspace_allocator);\n        if (top_blob_reverse.empty())\n            return -100;\n\n        Mat hidden0 = hidden.row_range(0, 1);\n        Mat cell0 = cell.row_range(0, 1);\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), weight_xc_data_int8_scales.row(0), bias_c_data.channel(0), weight_hc_data.channel(0), weight_hc_data_int8_scales.row(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob_forward, 0, weight_xc_data.channel(0), bias_c_data.channel(0), weight_hc_data.channel(0), num_output == hidden_size ? Mat() : weight_hr_data.channel(0), hidden0, cell0, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        Mat hidden1 = hidden.row_range(1, 1);\n        Mat cell1 = cell.row_range(1, 1);\n#if NCNN_INT8\n        if (int8_scale_term)\n        {\n            int ret = lstm_int8(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), weight_xc_data_int8_scales.row(1), bias_c_data.channel(1), weight_hc_data.channel(1), weight_hc_data_int8_scales.row(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n            if (ret != 0)\n                return ret;\n        }\n        else\n#endif\n        {\n            int ret = lstm(bottom_blob, top_blob_reverse, 1, weight_xc_data.channel(1), bias_c_data.channel(1), weight_hc_data.channel(1), num_output == hidden_size ? Mat() : weight_hr_data.channel(1), hidden1, cell1, opt);\n            if (ret != 0)\n                return ret;\n        }\n\n        // concat w\n        for (int i = 0; i < T; i++)\n        {\n            const float* pf = top_blob_forward.row(i);\n            const float* pr = top_blob_reverse.row(i);\n            float* ptr = top_blob.row(i);\n\n            memcpy(ptr, pf, num_output * sizeof(float));\n            memcpy(ptr + num_output, pr, num_output * sizeof(float));\n        }\n    }\n\n    if (top_blobs.size() == 3)\n    {\n        top_blobs[1] = hidden;\n        top_blobs[2] = cell;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/lstm.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_LSTM_H\n#define LAYER_LSTM_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass LSTM : public Layer\n{\npublic:\n    LSTM();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int num_output;\n    int weight_data_size;\n    int direction; // 0=forward 1=reverse 2=bidirectional\n    int hidden_size;\n\n    int int8_scale_term;\n\n    Mat weight_hc_data;\n    Mat weight_xc_data;\n    Mat bias_c_data;\n    Mat weight_hr_data;\n\n#if NCNN_INT8\n    Mat weight_hc_data_int8_scales;\n    Mat weight_xc_data_int8_scales;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_LSTM_H\n"
  },
  {
    "path": "src/layer/matmul.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"matmul.h\"\n\nnamespace ncnn {\n\nMatMul::MatMul()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint MatMul::load_param(const ParamDict& pd)\n{\n    transB = pd.get(0, 0);\n\n    return 0;\n}\n\nstatic void transpose(const Mat& X, Mat& XT, const Option& opt)\n{\n    const int w = X.w;\n    const int h = X.h;\n\n    const float* pX = X;\n    float* pXT = XT;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < w; i++)\n    {\n        float* ptr = pXT + i * h;\n        for (int j = 0; j < h; j++)\n        {\n            ptr[j] = pX[j * w + i];\n        }\n    }\n}\n\nstatic void matmul_transb(const Mat& A, const Mat& B, Mat& top_blob, const Option& opt)\n{\n    const int M = A.h;\n    const int K = A.w; // assert A.w == B.w\n    const int N = B.h;\n\n    const float* pA = A;\n    const float* pB = B;\n    float* pOut = top_blob;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int i = 0; i < M; i++)\n    {\n        const float* ptrA = pA + i * K;\n        float* outptr = pOut + i * N;\n\n        for (int j = 0; j < N; j++)\n        {\n            const float* ptrB = pB + j * K;\n\n            float sum = 0.f;\n            for (int k = 0; k < K; k++)\n            {\n                sum += ptrA[k] * ptrB[k];\n            }\n\n            *outptr++ = sum;\n        }\n    }\n}\n\nint MatMul::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int Adims = A.dims;\n    const int Bdims = B.dims;\n    const int max_ABdims = std::max(Adims, Bdims);\n    const size_t elemsize = A.elemsize;\n\n    if (Adims == 1 && Bdims == 1)\n    {\n        // dot product\n        top_blob.create(1, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        const int K = A.w; // assert A.w == B.w\n        const float* ptrA = A;\n        const float* ptrB = B;\n\n        float sum = 0.f;\n        for (int k = 0; k < K; k++)\n        {\n            sum += ptrA[k] * ptrB[k];\n        }\n\n        top_blob[0] = sum;\n    }\n    else if (Adims == 2 && Bdims == 2)\n    {\n        // matrix multiply\n        const int M = A.h;\n        const int N = transB == 0 ? B.w : B.h;\n\n        top_blob.create(N, M, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat BT;\n        if (transB == 0)\n        {\n            BT.create(B.h, B.w, elemsize, opt.workspace_allocator);\n            if (BT.empty())\n                return -100;\n\n            transpose(B, BT, opt);\n        }\n        else\n        {\n            BT = B;\n        }\n\n        matmul_transb(A, BT, top_blob, opt);\n    }\n    else if (Adims == 1 && Bdims == 2)\n    {\n        // matrix multiply\n        const int N = transB == 0 ? B.w : B.h;\n\n        Mat top_blob1(N, 1, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat A1 = A.reshape(A.w, 1);\n\n        Mat BT;\n        if (transB == 0)\n        {\n            BT.create(B.h, B.w, elemsize, opt.workspace_allocator);\n            if (BT.empty())\n                return -100;\n\n            transpose(B, BT, opt);\n        }\n        else\n        {\n            BT = B;\n        }\n\n        matmul_transb(A1, BT, top_blob1, opt);\n\n        top_blob = top_blob1.reshape(N);\n    }\n    else if (Adims == 2 && Bdims == 1)\n    {\n        // matrix multiply\n        const int M = A.h;\n\n        Mat top_blob1(1, M, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat BT = B.reshape(B.w, 1);\n\n        matmul_transb(A, BT, top_blob1, opt);\n\n        top_blob = top_blob1.reshape(M);\n    }\n    else if (Adims == 1 && Bdims > 2)\n    {\n        // batched matrix multiply\n        const int N = transB == 0 ? B.w : B.h;\n        const int batch_size = B.d * B.c;\n\n        Mat top_blob1(N, 1, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat A1 = A.reshape(A.w, 1);\n        Mat B1 = B.reshape(B.w, B.h, batch_size);\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            Mat BT;\n            if (transB == 0)\n            {\n                BT.create(B.h, B.w, elemsize, opt.workspace_allocator);\n                if (BT.empty())\n                    return -100;\n\n                transpose(B1.channel(p), BT, opt);\n            }\n            else\n            {\n                BT = B1.channel(p);\n            }\n\n            Mat top_blob1_p = top_blob1.channel(p);\n            matmul_transb(A1, BT, top_blob1_p, opt);\n        }\n\n        if (Bdims == 3)\n            top_blob = top_blob1.reshape(N, B.d * B.c);\n        else\n            top_blob = top_blob1.reshape(N, B.d, B.c);\n    }\n    else if (Adims > 2 && Bdims == 1)\n    {\n        // batched matrix multiply\n        const int M = A.h;\n        const int batch_size = A.d * A.c;\n\n        Mat top_blob1(1, M, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob1.empty())\n            return -100;\n\n        Mat A1 = A.reshape(A.w, A.h, batch_size);\n        Mat BT = B.reshape(B.w, 1);\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            Mat top_blob1_p = top_blob1.channel(p);\n            matmul_transb(A1.channel(p), BT, top_blob1_p, opt);\n        }\n\n        if (Adims == 3)\n            top_blob = top_blob1.reshape(M, A.d * A.c);\n        else\n            top_blob = top_blob1.reshape(M, A.d, A.c);\n    }\n    else if (max_ABdims == 3)\n    {\n        Mat A1 = Adims == 2 ? A.reshape(A.w, A.h, 1) : A;\n        Mat B1 = Bdims == 2 ? B.reshape(B.w, B.h, 1) : B;\n\n        const int M = A1.h;\n        const int N = transB == 0 ? B1.w : B1.h;\n        const int batch_size = std::max(A1.c, B1.c);\n\n        top_blob.create(N, M, batch_size, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat BT0;\n        if (B1.c == 1)\n        {\n            if (transB == 0)\n            {\n                BT0.create(B1.h, B1.w, elemsize, opt.workspace_allocator);\n                if (BT0.empty())\n                    return -100;\n\n                transpose(B1.channel(0), BT0, opt);\n            }\n            else\n            {\n                BT0 = B1.channel(0);\n            }\n        }\n\n        for (int p = 0; p < batch_size; p++)\n        {\n            int Ap = A1.c == 1 ? 0 : p;\n            int Bp = B1.c == 1 ? 0 : p;\n\n            Mat BT;\n            if (B1.c == 1)\n            {\n                BT = BT0;\n            }\n            else\n            {\n                if (transB == 0)\n                {\n                    BT.create(B1.h, B1.w, elemsize, opt.workspace_allocator);\n                    if (BT.empty())\n                        return -100;\n\n                    transpose(B1.channel(Bp), BT, opt);\n                }\n                else\n                {\n                    BT = B1.channel(Bp);\n                }\n            }\n\n            Mat top_blob_p = top_blob.channel(p);\n            matmul_transb(A1.channel(Ap), BT, top_blob_p, opt);\n        }\n    }\n    else if (max_ABdims == 4)\n    {\n        Mat A1 = Adims == 3 ? A.reshape(A.w, A.h, A.c, 1) : A;\n        Mat B1 = Bdims == 3 ? B.reshape(B.w, B.h, B.c, 1) : B;\n\n        const int M = A1.h;\n        const int N = transB == 0 ? B1.w : B1.h;\n        const int batch_size_d = std::max(A1.d, B1.d);\n        const int batch_size_c = std::max(A1.c, B1.c);\n\n        top_blob.create(N, M, batch_size_d, batch_size_c, elemsize, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat BT00;\n        if (B1.d == 1 && B1.c == 1)\n        {\n            if (transB == 0)\n            {\n                BT00.create(B1.h, B1.w, elemsize, opt.workspace_allocator);\n                if (BT00.empty())\n                    return -100;\n\n                transpose(B1.channel(0).depth(0), BT00, opt);\n            }\n            else\n            {\n                BT00 = B1.channel(0).depth(0);\n            }\n        }\n\n        for (int p = 0; p < batch_size_c; p++)\n        {\n            int Ap = A1.c == 1 ? 0 : p;\n            int Bp = B1.c == 1 ? 0 : p;\n\n            Mat BT0x;\n            if (B1.d == 1 && B1.c != 1)\n            {\n                if (transB == 0)\n                {\n                    BT0x.create(B1.h, B1.w, elemsize, opt.workspace_allocator);\n                    if (BT0x.empty())\n                        return -100;\n\n                    transpose(B1.channel(Bp).depth(0), BT0x, opt);\n                }\n                else\n                {\n                    BT0x = B1.channel(Bp).depth(0);\n                }\n            }\n\n            for (int q = 0; q < batch_size_d; q++)\n            {\n                int Ad = A1.d == 1 ? 0 : q;\n                int Bd = B1.d == 1 ? 0 : q;\n\n                Mat BT;\n                if (B1.d == 1 && B1.c == 1)\n                {\n                    BT = BT00;\n                }\n                else if (B1.d == 1 && B1.c != 1)\n                {\n                    BT = BT0x;\n                }\n                else\n                {\n                    if (transB == 0)\n                    {\n                        BT.create(B1.h, B1.w, elemsize, opt.workspace_allocator);\n                        if (BT.empty())\n                            return -100;\n\n                        transpose(B1.channel(Bp).depth(Bd), BT, opt);\n                    }\n                    else\n                    {\n                        BT = B1.channel(Bp).depth(Bd);\n                    }\n                }\n\n                Mat top_blob_p_q = top_blob.channel(p).depth(q);\n                matmul_transb(A1.channel(Ap).depth(Ad), BT, top_blob_p_q, opt);\n            }\n        }\n    }\n    else\n    {\n        NCNN_LOGE(\"impossible matmul %d %d\", Adims, Bdims);\n        return -1;\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/matmul.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MATMUL_H\n#define LAYER_MATMUL_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass MatMul : public Layer\n{\npublic:\n    MatMul();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int transB;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MATMUL_H\n"
  },
  {
    "path": "src/layer/memorydata.cpp",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"memorydata.h\"\n\nnamespace ncnn {\n\nMemoryData::MemoryData()\n{\n    one_blob_only = false;\n    support_inplace = false;\n}\n\nint MemoryData::load_param(const ParamDict& pd)\n{\n    w = pd.get(0, 0);\n    h = pd.get(1, 0);\n    d = pd.get(11, 0);\n    c = pd.get(2, 0);\n    load_type = pd.get(21, 1);\n\n    return 0;\n}\n\nint MemoryData::load_model(const ModelBin& mb)\n{\n    if (d != 0)\n    {\n        data = mb.load(w, h, d, c, load_type);\n    }\n    else if (c != 0)\n    {\n        data = mb.load(w, h, c, load_type);\n    }\n    else if (h != 0)\n    {\n        data = mb.load(w, h, load_type);\n    }\n    else if (w != 0)\n    {\n        data = mb.load(w, load_type);\n    }\n    else // 0 0 0\n    {\n        data.create(1);\n    }\n    if (data.empty())\n        return -100;\n\n    return 0;\n}\n\nint MemoryData::forward(const std::vector<Mat>& /*bottom_blobs*/, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    Mat& top_blob = top_blobs[0];\n\n    top_blob = data.clone(opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/memorydata.h",
    "content": "// Copyright 2017 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_MEMORYDATA_H\n#define LAYER_MEMORYDATA_H\n\n#include \"layer.h\"\n\nnamespace ncnn {\n\nclass MemoryData : public Layer\n{\npublic:\n    MemoryData();\n\n    virtual int load_param(const ParamDict& pd);\n\n    virtual int load_model(const ModelBin& mb);\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    int w;\n    int h;\n    int d;\n    int c;\n    int load_type;\n\n    Mat data;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_MEMORYDATA_H\n"
  },
  {
    "path": "src/layer/mips/absval_mips.h",
    "content": "// Copyright 2019 Leo <leo@nullptr.com.cn>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_ABSVAL_MIPS_H\n#define LAYER_ABSVAL_MIPS_H\n\n#include \"absval.h\"\n\nnamespace ncnn {\n\nclass AbsVal_mips : public AbsVal\n{\npublic:\n    AbsVal_mips();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_ABSVAL_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/batchnorm_mips.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"batchnorm_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\n#include \"mips_usability.h\"\n\nnamespace ncnn {\n\nBatchNorm_mips::BatchNorm_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif // __mips_msa\n}\n\nint BatchNorm_mips::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int dims = bottom_top_blob.dims;\n    int elempack = bottom_top_blob.elempack;\n\n    if (dims == 1)\n    {\n        int w = bottom_top_blob.w * elempack;\n\n#if __mips_msa\n        int nn_w = w / 4;\n        int remain_w_start = nn_w * 4;\n#else\n        int remain_w_start = 0;\n#endif // __mips_msa\n\n        float* ptr = bottom_top_blob;\n\n#if __mips_msa\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < nn_w; i++)\n        {\n            float* ptr0 = ptr + i * 4;\n\n            v4f32 _p = (v4f32)__msa_ld_w(ptr0, 0);\n            v4f32 _a = (v4f32)__msa_ld_w((const float*)a_data + i * 4, 0);\n            v4f32 _b = (v4f32)__msa_ld_w((const float*)b_data + i * 4, 0);\n            _p = __msa_fmadd_w(_a, _p, _b);\n            __msa_st_w((v4i32)_p, ptr0, 0);\n        }\n#endif // __mips_msa\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_w_start; i < w; i++)\n        {\n            ptr[i] = b_data[i] * ptr[i] + a_data[i];\n        }\n    }\n\n    if (dims == 2)\n    {\n        int w = bottom_top_blob.w * elempack;\n        int h = bottom_top_blob.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* ptr = bottom_top_blob.row(i);\n            float a = a_data[i];\n            float b = b_data[i];\n\n            int j = 0;\n#if __mips_msa\n            v4f32 _a = elempack == 4 ? (v4f32)__msa_ld_w((const float*)a_data + i * 4, 0) : (v4f32)__msa_fill_w_f32(a);\n            v4f32 _b = elempack == 4 ? (v4f32)__msa_ld_w((const float*)b_data + i * 4, 0) : (v4f32)__msa_fill_w_f32(b);\n            for (; j + 3 < w; j += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n                _p = __msa_fmadd_w(_a, _p, _b);\n                __msa_st_w((v4i32)_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __mips_msa\n            for (; j < w; j++)\n            {\n                *ptr = b * *ptr + a;\n                ptr++;\n            }\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        int w = bottom_top_blob.w;\n        int h = bottom_top_blob.h;\n        int d = bottom_top_blob.d;\n        int c = bottom_top_blob.c;\n        int size = w * h * d * elempack;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < c; q++)\n        {\n            float* ptr = bottom_top_blob.channel(q);\n            float a = a_data[q];\n            float b = b_data[q];\n\n            int i = 0;\n#if __mips_msa\n            v4f32 _a = elempack == 4 ? (v4f32)__msa_ld_w((const float*)a_data + q * 4, 0) : (v4f32)__msa_fill_w_f32(a);\n            v4f32 _b = elempack == 4 ? (v4f32)__msa_ld_w((const float*)b_data + q * 4, 0) : (v4f32)__msa_fill_w_f32(b);\n            for (; i + 3 < size; i += 4)\n            {\n                __builtin_prefetch(ptr + 16);\n                v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n                _p = __msa_fmadd_w(_a, _p, _b);\n                __msa_st_w((v4i32)_p, ptr, 0);\n\n                ptr += 4;\n            }\n#endif // __mips_msa\n            for (; i < size; i++)\n            {\n                *ptr = b * *ptr + a;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/batchnorm_mips.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BATCHNORM_MIPS_H\n#define LAYER_BATCHNORM_MIPS_H\n\n#include \"batchnorm.h\"\n\nnamespace ncnn {\n\nclass BatchNorm_mips : public BatchNorm\n{\npublic:\n    BatchNorm_mips();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BATCHNORM_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/bias_mips.cpp",
    "content": "// Copyright 2019 Leo <leo@nullptr.com.cn>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"bias_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\n#include \"mips_usability.h\"\n\nnamespace ncnn {\n\nint Bias_mips::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int size = w * h * d;\n\n    const float* bias_ptr = bias_data;\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        float bias = bias_ptr[q];\n\n#if __mips_msa\n        int nn = size >> 2;\n        int remain = size - (nn << 2);\n#else\n        int remain = size;\n#endif // __mips_msa\n\n#if __mips_msa\n        v4f32 _bias = (v4f32)__msa_fill_w_f32(bias);\n        for (; nn > 0; nn--)\n        {\n            v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n            v4f32 _outp = __msa_fadd_w(_p, _bias);\n            __msa_st_w((v4i32)_outp, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __mips_msa\n\n        for (; remain > 0; remain--)\n        {\n            *ptr = *ptr + bias;\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/bias_mips.h",
    "content": "// Copyright 2019 Leo <leo@nullptr.com.cn>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BIAS_MIPS_H\n#define LAYER_BIAS_MIPS_H\n\n#include \"bias.h\"\n\nnamespace ncnn {\n\nclass Bias_mips : public Bias\n{\npublic:\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BIAS_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/binaryop_mips.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"binaryop_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#include \"msa_mathfun.h\"\n#endif // __mips_msa\n\nnamespace ncnn {\n\nBinaryOp_mips::BinaryOp_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif // __mips_msa\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_no_broadcast(const float* ptr, const float* ptr1, float* outptr, int size)\n{\n    const Op op;\n\n    int i = 0;\n#if __mips_msa\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        __builtin_prefetch(ptr1 + 16);\n        v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n        v4f32 _b = (v4f32)__msa_ld_w(ptr1, 0);\n        v4f32 _outp = op(_p, _b);\n        __msa_st_w((v4i32)_outp, outptr, 0);\n        ptr += 4;\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __mips_msa\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, *ptr1);\n        ptr += 1;\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_b(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float b = *ptr1;\n#if __mips_msa\n    v4f32 _b_128 = (elempack == 4) ? (v4f32)__msa_ld_w(ptr1, 0) : __msa_fill_w_f32(b);\n#endif // __mips_msa\n\n    int i = 0;\n#if __mips_msa\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n        v4f32 _outp = op(_p, _b_128);\n        __msa_st_w((v4i32)_outp, outptr, 0);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __mips_msa\n    for (; i < size; i++)\n    {\n        *outptr = op(*ptr, b);\n        ptr += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_a(const float* ptr, const float* ptr1, float* outptr, int size, int elempack)\n{\n    const Op op;\n\n    const float a = *ptr;\n#if __mips_msa\n    v4f32 _a_128 = (elempack == 4) ? (v4f32)__msa_ld_w(ptr, 0) : __msa_fill_w_f32(a);\n#endif // __mips_msa\n\n    int i = 0;\n#if __mips_msa\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr1 + 16);\n        v4f32 _b = (v4f32)__msa_ld_w(ptr1, 0);\n        v4f32 _outp = op(_a_128, _b);\n        __msa_st_w((v4i32)_outp, outptr, 0);\n        ptr1 += 4;\n        outptr += 4;\n    }\n#endif // __mips_msa\n    for (; i < size; i++)\n    {\n        *outptr = op(a, *ptr1);\n        ptr1 += 1;\n        outptr += 1;\n    }\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __mips_msa\n    if (elempack == 4)\n    {\n        int i = 0;\n        for (; i < w; i++)\n        {\n            __builtin_prefetch(ptr + 16);\n            v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n            v4f32 _b = __msa_fill_w_f32(*ptr1);\n            v4f32 _outp = op(_p, _b);\n            __msa_st_w((v4i32)_outp, outptr, 0);\n            ptr += 4;\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __mips_msa\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_b(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n    const int size = w * elempack;\n\n    int i = 0;\n#if __mips_msa\n    v4f32 _b = __msa_fill_w_f32(*ptr1);\n    for (; i + 3 < size; i += 4)\n    {\n        __builtin_prefetch(ptr + 16);\n        v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n        v4f32 _outp = op(_p, _b);\n        __msa_st_w((v4i32)_outp, outptr, 0);\n        ptr += 4;\n        outptr += 4;\n    }\n#endif // __mips_msa\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector_broadcast_pb_a(const float* ptr, const float* ptr1, float* outptr, int w, int elempack)\n{\n    const Op op;\n\n#if __mips_msa\n    if (elempack == 4)\n    {\n        int i = 0;\n        v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n        for (; i < w; i++)\n        {\n            v4f32 _b = __msa_fill_w_f32(*ptr1);\n            v4f32 _outp = op(_p, _b);\n            __msa_st_w((v4i32)_outp, outptr, 0);\n            ptr1 += 1;\n            outptr += 4;\n        }\n    }\n#endif // __mips_msa\n}\n\ntemplate<typename Op>\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp)\n{\n    const int w = std::max(aw, bw);\n    const int elempack = std::max(ap, bp);\n    const int size = w * elempack;\n\n    if (ap == bp)\n    {\n        if (aw == bw)\n        {\n            // no broadcast\n            return binary_op_vector_no_broadcast<Op>(ptr, ptr1, outptr, size);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast single b\n            return binary_op_vector_broadcast_b<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a\n            return binary_op_vector_broadcast_a<Op>(ptr, ptr1, outptr, size, elempack);\n        }\n    }\n\n    if (bp == 1)\n    {\n        if (aw == bw)\n        {\n            // broadcast pack1 b\n            return binary_op_vector_broadcast_pb<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (bw == 1)\n        {\n            // broadcast pack1 single b\n            return binary_op_vector_broadcast_pb_b<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n\n        if (aw == 1)\n        {\n            // broadcast single a and pack1 b\n            return binary_op_vector_broadcast_pb_a<Op>(ptr, ptr1, outptr, w, elempack);\n        }\n    }\n\n    // shall never reach here\n}\n\ntemplate<typename Op>\nstatic int binary_op_scalar_inplace(Mat& a, float b, const Option& opt)\n{\n    Op op;\n\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        int i = 0;\n#if __mips_msa\n        v4f32 _b = __msa_fill_w_f32(b);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n            _p = op(_p, _b);\n            __msa_st_w((v4i32)_p, ptr, 0);\n            ptr += 4;\n        }\n#endif // __mips_msa\n        for (; i < size; i++)\n        {\n            *ptr = op(*ptr, b);\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\nnamespace BinaryOp_mips_functor {\n\n#if __mips_msa\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                       \\\n    struct NAME                                                \\\n    {                                                          \\\n        float operator()(const float& x, const float& y) const \\\n        {                                                      \\\n            return IMPL;                                       \\\n        }                                                      \\\n        v4f32 operator()(const v4f32& x, const v4f32& y) const \\\n        {                                                      \\\n            return IMPL4;                                      \\\n        }                                                      \\\n    };\n#else\n#define MAKE_FUNCTION(NAME, IMPL, IMPL4)                       \\\n    struct NAME                                                \\\n    {                                                          \\\n        float operator()(const float& x, const float& y) const \\\n        {                                                      \\\n            return IMPL;                                       \\\n        }                                                      \\\n    };\n#endif // __mips_msa\n\n// clang-format off\n// *INDENT-OFF*\nMAKE_FUNCTION(binary_op_add, x + y, __msa_fadd_w(x, y))\nMAKE_FUNCTION(binary_op_sub, x - y, __msa_fsub_w(x, y))\nMAKE_FUNCTION(binary_op_mul, x * y, __msa_fmul_w(x, y))\nMAKE_FUNCTION(binary_op_div, x / y, __msa_fdiv_w(x, y))\nMAKE_FUNCTION(binary_op_max, std::max(x, y), __msa_fmax_w(x, y))\nMAKE_FUNCTION(binary_op_min, std::min(x, y), __msa_fmin_w(x, y))\nMAKE_FUNCTION(binary_op_pow, (float)powf(x, y), pow_ps(x, y))\nMAKE_FUNCTION(binary_op_rsub, y - x, __msa_fsub_w(y, x))\nMAKE_FUNCTION(binary_op_rdiv, y / x, __msa_fdiv_w(y, x))\nMAKE_FUNCTION(binary_op_rpow, (float)powf(y, x), pow_ps(y, x))\nMAKE_FUNCTION(binary_op_atan2, (float)atan2f(x, y), atan2_ps(x, y))\nMAKE_FUNCTION(binary_op_ratan2, (float)atan2f(y, x), atan2_ps(y, x))\nMAKE_FUNCTION(binary_op_fmod, (float)fmodf(x, y), fmod_ps(x, y))\nMAKE_FUNCTION(binary_op_rfmod, (float)fmodf(y, x), fmod_ps(y, x))\nMAKE_FUNCTION(binary_op_logaddexp, (float)(std::max(x, y) + log1pf(expf(std::min(x, y) - std::max(x, y)))), logaddexp_ps(x, y))\nMAKE_FUNCTION(binary_op_floor_divide, (float)floorf(x / y), floor_divide_ps(x, y))\nMAKE_FUNCTION(binary_op_rfloor_divide, (float)floorf(y / x), floor_divide_ps(y, x))\nMAKE_FUNCTION(binary_op_remainder, (float)remainderf(x, y), remainder_ps(x, y))\nMAKE_FUNCTION(binary_op_rremainder, (float)remainderf(y, x), remainder_ps(y, x))\n// *INDENT-ON*\n// clang-format on\n\n#undef MAKE_FUNCTION\n\n} // namespace BinaryOp_mips_functor\n\nstatic void binary_op_vector(const float* ptr, const float* ptr1, float* outptr, int aw, int bw, int ap, int bp, int op_type)\n{\n    using namespace BinaryOp_mips_functor;\n\n    if (op_type == BinaryOp::Operation_ADD) return binary_op_vector<binary_op_add>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_SUB) return binary_op_vector<binary_op_sub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MUL) return binary_op_vector<binary_op_mul>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_DIV) return binary_op_vector<binary_op_div>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MAX) return binary_op_vector<binary_op_max>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_MIN) return binary_op_vector<binary_op_min>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_POW) return binary_op_vector<binary_op_pow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RSUB) return binary_op_vector<binary_op_rsub>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RDIV) return binary_op_vector<binary_op_rdiv>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RPOW) return binary_op_vector<binary_op_rpow>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_ATAN2) return binary_op_vector<binary_op_atan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RATAN2) return binary_op_vector<binary_op_ratan2>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FMOD) return binary_op_vector<binary_op_fmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFMOD) return binary_op_vector<binary_op_rfmod>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_LOGADDEXP) return binary_op_vector<binary_op_logaddexp>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return binary_op_vector<binary_op_floor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return binary_op_vector<binary_op_rfloor_divide>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_REMAINDER) return binary_op_vector<binary_op_remainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n    if (op_type == BinaryOp::Operation_RREMAINDER) return binary_op_vector<binary_op_rremainder>(ptr, ptr1, outptr, aw, bw, ap, bp);\n\n    // should never reach here\n}\n\nstatic void binary_op_scalar(const Mat& a, float b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, &b, outptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_no_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        const float* ptr = a.channel(q);\n        const float* ptr1 = b.channel(q);\n        float* outptr = c.channel(q);\n\n        binary_op_vector(ptr, ptr1, outptr, size, size, 1, 1, op_type);\n    }\n}\n\nstatic void binary_op_broadcast(const Mat& a, const Mat& b, Mat& c, int op_type, const Option& opt)\n{\n    if (b.w * b.h * b.d * b.c * b.elempack == 1)\n    {\n        return binary_op_scalar(a, b[0], c, op_type, opt);\n    }\n\n    if (a.dims == b.dims && a.w == b.w && a.h == b.h && a.d == b.d && a.c == b.c && a.elempack == b.elempack)\n    {\n        return binary_op_no_broadcast(a, b, c, op_type, opt);\n    }\n\n    const int dims = c.dims;\n\n    if (dims == 2)\n    {\n        const int h = c.h;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int y = 0; y < h; y++)\n        {\n            const int y0 = std::min(y, a.h - 1);\n            const int y1 = std::min(y, b.h - 1);\n\n            const float* ptr = a.row(y0);\n            const float* ptr1 = b.row(y1);\n            float* outptr = c.row(y);\n\n            binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n        }\n    }\n\n    if (dims == 3 || dims == 4)\n    {\n        const int channels = c.c;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const int q0 = std::min(q, a.c - 1);\n            const int q1 = std::min(q, b.c - 1);\n\n            if (b.d * b.h * b.w == 1)\n            {\n                const float* ptr = a.channel(q0);\n                const float* ptr1 = b.channel(q1);\n                float* outptr = c.channel(q);\n\n                binary_op_vector(ptr, ptr1, outptr, a.w * a.h * a.d, 1, a.elempack, b.elempack, op_type);\n                continue;\n            }\n\n            if (b.h * b.w == 1)\n            {\n                for (int z = 0; z < c.d; z++)\n                {\n                    const int z0 = std::min(z, a.d - 1);\n                    const int z1 = std::min(z, b.d - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0);\n                    const float* ptr1 = b.channel(q1).depth(z1);\n                    float* outptr = c.channel(q).depth(z);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w * a.h, 1, a.elempack, b.elempack, op_type);\n                }\n                continue;\n            }\n\n            for (int z = 0; z < c.d; z++)\n            {\n                const int z0 = std::min(z, a.d - 1);\n                const int z1 = std::min(z, b.d - 1);\n\n                for (int y = 0; y < c.h; y++)\n                {\n                    const int y0 = std::min(y, a.h - 1);\n                    const int y1 = std::min(y, b.h - 1);\n\n                    const float* ptr = a.channel(q0).depth(z0).row(y0);\n                    const float* ptr1 = b.channel(q1).depth(z1).row(y1);\n                    float* outptr = c.channel(q).depth(z).row(y);\n\n                    binary_op_vector(ptr, ptr1, outptr, a.w, b.w, a.elempack, b.elempack, op_type);\n                }\n            }\n        }\n    }\n}\n\nstatic void binary_op_scalar_inplace(Mat& a, float b, int op_type, const Option& opt)\n{\n    const int channels = a.c;\n    const int size = a.w * a.h * a.d * a.elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = a.channel(q);\n\n        binary_op_vector(ptr, &b, ptr, size, 1, 1, 1, op_type);\n    }\n}\n\nstatic int get_reverse_op_type(int op_type)\n{\n    if (op_type == BinaryOp::Operation_SUB) return BinaryOp::Operation_RSUB;\n    if (op_type == BinaryOp::Operation_DIV) return BinaryOp::Operation_RDIV;\n    if (op_type == BinaryOp::Operation_POW) return BinaryOp::Operation_RPOW;\n    if (op_type == BinaryOp::Operation_ATAN2) return BinaryOp::Operation_RATAN2;\n    if (op_type == BinaryOp::Operation_FMOD) return BinaryOp::Operation_RFMOD;\n    if (op_type == BinaryOp::Operation_FLOOR_DIVIDE) return BinaryOp::Operation_RFLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_REMAINDER) return BinaryOp::Operation_RREMAINDER;\n\n    if (op_type == BinaryOp::Operation_RSUB) return BinaryOp::Operation_SUB;\n    if (op_type == BinaryOp::Operation_RDIV) return BinaryOp::Operation_DIV;\n    if (op_type == BinaryOp::Operation_RPOW) return BinaryOp::Operation_POW;\n    if (op_type == BinaryOp::Operation_RATAN2) return BinaryOp::Operation_ATAN2;\n    if (op_type == BinaryOp::Operation_RFMOD) return BinaryOp::Operation_FMOD;\n    if (op_type == BinaryOp::Operation_RFLOOR_DIVIDE) return BinaryOp::Operation_FLOOR_DIVIDE;\n    if (op_type == BinaryOp::Operation_RREMAINDER) return BinaryOp::Operation_REMAINDER;\n    return op_type;\n}\n\nint BinaryOp_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& A = bottom_blobs[0];\n    const Mat& B = bottom_blobs[1];\n    const int outdims = std::max(A.dims, B.dims);\n\n    Mat A2 = A;\n    Mat B2 = B;\n    if (A.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (A.w * A.elempack == B.h * B.elempack)\n                A2 = A.reshape(1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 2;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 3;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 3 && A.dims == 2)\n            A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 1)\n        {\n            if (A.w * A.elempack == B.c * B.elempack)\n                A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);\n            else // if (A.w == B.w)\n            {\n                A2.dims = 4;\n                A2.w = A.w * A.elempack;\n                A2.elempack = 1;\n                A2.elemsize = A.elemsize / A.elempack;\n                A2.cstep = A.cstep * A.elempack;\n            }\n        }\n        if (outdims == 4 && A.dims == 2)\n            A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);\n        if (outdims == 4 && A.dims == 3)\n            A2 = A.reshape(1, A.w, A.h, A.c, opt.workspace_allocator);\n    }\n    if (B.dims < outdims)\n    {\n        // expand inner axes\n        if (outdims == 2)\n        {\n            if (B.w * B.elempack == A.h * A.elempack)\n                B2 = B.reshape(1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 2;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 3;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 3 && B.dims == 2)\n            B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 1)\n        {\n            if (B.w * B.elempack == A.c * A.elempack)\n                B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);\n            else // if (B.w == A.w)\n            {\n                B2.dims = 4;\n                B2.w = B.w * B.elempack;\n                B2.elempack = 1;\n                B2.elemsize = B.elemsize / B.elempack;\n                B2.cstep = B.cstep * B.elempack;\n            }\n        }\n        if (outdims == 4 && B.dims == 2)\n            B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);\n        if (outdims == 4 && B.dims == 3)\n            B2 = B.reshape(1, B.w, B.h, B.c, opt.workspace_allocator);\n    }\n\n    const int outw = std::max(A2.w, B2.w);\n    const int outh = std::max(A2.h, B2.h);\n    const int outd = std::max(A2.d, B2.d);\n    const int outc = std::max(A2.c, B2.c);\n    const size_t out_elemsize = std::max(A2.elemsize, B2.elemsize);\n    const int out_elempack = std::max(A2.elempack, B2.elempack);\n\n    Mat& top_blob = top_blobs[0];\n    if (outdims == 1)\n    {\n        top_blob.create(outw, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 2)\n    {\n        top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 3)\n    {\n        top_blob.create(outw, outh, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (outdims == 4)\n    {\n        top_blob.create(outw, outh, outd, outc, out_elemsize, out_elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    const bool a_pack_is_lower = A2.elempack < B2.elempack;\n    const bool a_pack_is_equal = A2.elempack == B2.elempack;\n    const bool a_size_is_lower = A2.w * A2.h * A2.d * A2.c * A2.elempack < B2.w * B2.h * B2.d * B2.c * B2.elempack;\n    if (a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))\n    {\n        binary_op_broadcast(B2, A2, top_blob, get_reverse_op_type(op_type), opt);\n    }\n    else\n    {\n        binary_op_broadcast(A2, B2, top_blob, op_type, opt);\n    }\n\n    return 0;\n}\n\nint BinaryOp_mips::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    binary_op_scalar_inplace(bottom_top_blob, b, op_type, opt);\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/binaryop_mips.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_BINARYOP_MIPS_H\n#define LAYER_BINARYOP_MIPS_H\n\n#include \"binaryop.h\"\n\nnamespace ncnn {\n\nclass BinaryOp_mips : public BinaryOp\n{\npublic:\n    BinaryOp_mips();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_BINARYOP_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/cast_mips.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cast_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\nnamespace ncnn {\n\nCast_mips::Cast_mips()\n{\n    support_packing = true;\n}\n\nint Cast_mips::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    if (type_from == type_to)\n    {\n        top_blob = bottom_blob;\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int d = bottom_blob.d;\n    int channels = bottom_blob.c;\n    int dims = bottom_blob.dims;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    size_t out_elemsize = elemsize;\n    if (type_to == 1)\n    {\n        if (type_from == 3)\n        {\n            Cast::forward(bottom_blob, top_blob, opt);\n        }\n\n        // float32\n        out_elemsize = 4 * elempack;\n    }\n    else if (type_to == 2)\n    {\n        // float16\n        out_elemsize = 2 * elempack;\n    }\n    else if (type_to == 3)\n    {\n        // int8\n        out_elemsize = elempack;\n    }\n    else if (type_to == 4)\n    {\n        // bfloat16\n        out_elemsize = 2 * elempack;\n    }\n\n    if (dims == 1)\n    {\n        top_blob.create(w, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 2)\n    {\n        top_blob.create(w, h, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 3)\n    {\n        top_blob.create(w, h, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    else if (dims == 4)\n    {\n        top_blob.create(w, h, d, channels, out_elemsize, elempack, opt.blob_allocator);\n    }\n    if (top_blob.empty())\n        return -100;\n\n    int size = w * h * d * elempack;\n\n    if (type_from == 1 && type_to == 2)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __mips_msa\n            for (; i + 7 < size; i += 8)\n            {\n                __builtin_prefetch(ptr + 16);\n                v4f32 _p0 = (v4f32)__msa_ld_w(ptr, 0);\n                v4f32 _p1 = (v4f32)__msa_ld_w(ptr + 4, 0);\n                v8i16 _p = __msa_fexdo_h(_p1, _p0);\n                __msa_st_h(_p, outptr, 0);\n\n                ptr += 8;\n                outptr += 8;\n            }\n#endif // __mips_msa\n            for (; i < size; i++)\n            {\n                *outptr = float32_to_float16(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 2 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n#if __mips_msa\n            for (; i + 7 < size; i += 8)\n            {\n                __builtin_prefetch(ptr + 16);\n                v8i16 _p = __msa_ld_h(ptr, 0);\n                v4f32 _p0 = __msa_fexupr_w(_p);\n                v4f32 _p1 = __msa_fexupl_w(_p);\n                __msa_st_w((v4i32)_p0, outptr, 0);\n                __msa_st_w((v4i32)_p1, outptr + 4, 0);\n\n                ptr += 8;\n                outptr += 8;\n            }\n#endif // __mips_msa\n            for (; i < size; i++)\n            {\n                *outptr = float16_to_float32(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 3 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const signed char* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < size; i++)\n            {\n                outptr[i] = (float)ptr[i];\n            }\n        }\n    }\n\n    if (type_from == 4 && type_to == 1)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const unsigned short* ptr = bottom_blob.channel(q);\n            float* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i < size; i++)\n            {\n                *outptr = bfloat16_to_float32(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    if (type_from == 1 && type_to == 4)\n    {\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            const float* ptr = bottom_blob.channel(q);\n            unsigned short* outptr = top_blob.channel(q);\n\n            int i = 0;\n            for (; i < size; i++)\n            {\n                *outptr = float32_to_bfloat16(*ptr);\n                outptr++;\n                ptr++;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/cast_mips.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CAST_MIPS_H\n#define LAYER_CAST_MIPS_H\n\n#include \"cast.h\"\n\nnamespace ncnn {\n\nclass Cast_mips : public Cast\n{\npublic:\n    Cast_mips();\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CAST_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/clip_mips.cpp",
    "content": "// Copyright 2019 Leo <leo@nullptr.com.cn>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"clip_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\n#include \"mips_usability.h\"\n\nnamespace ncnn {\n\nClip_mips::Clip_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif\n}\n\nint Clip_mips::forward_inplace(Mat& bottom_top_blob, const Option& opt) const\n{\n    int w = bottom_top_blob.w;\n    int h = bottom_top_blob.h;\n    int d = bottom_top_blob.d;\n    int channels = bottom_top_blob.c;\n    int elempack = bottom_top_blob.elempack;\n    int size = w * h * d * elempack;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int q = 0; q < channels; q++)\n    {\n        float* ptr = bottom_top_blob.channel(q);\n\n        int i = 0;\n#if __mips_msa\n        v4f32 _max = (v4f32)__msa_fill_w_f32(max);\n        v4f32 _min = (v4f32)__msa_fill_w_f32(min);\n        for (; i + 3 < size; i += 4)\n        {\n            __builtin_prefetch(ptr + 16);\n            v4f32 _p = (v4f32)__msa_ld_w(ptr, 0);\n            _p = __msa_fmax_w(_p, _min);\n            _p = __msa_fmin_w(_p, _max);\n            __msa_st_w((v4i32)_p, ptr, 0);\n\n            ptr += 4;\n        }\n#endif // __mips_msa\n        for (; i < size; i++)\n        {\n            if (*ptr < min)\n                *ptr = min;\n\n            if (*ptr > max)\n                *ptr = max;\n\n            ptr++;\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/clip_mips.h",
    "content": "// Copyright 2019 Leo <leo@nullptr.com.cn>\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CLIP_MIPS_H\n#define LAYER_CLIP_MIPS_H\n\n#include \"clip.h\"\n\nnamespace ncnn {\n\nclass Clip_mips : public Clip\n{\npublic:\n    Clip_mips();\n\n    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CLIP_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/concat_mips.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"concat_mips.h\"\n\nnamespace ncnn {\n\nConcat_mips::Concat_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif // __mips_msa\n}\n\nint Concat_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    int dims = bottom_blobs[0].dims;\n    int positive_axis = axis < 0 ? dims + axis : axis;\n\n    if (dims == 1) // positive_axis == 0\n    {\n        // concat vector\n        // total length\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_w % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        float* outptr = top_blob;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            const float* ptr = bottom_blob;\n            memcpy(outptr, ptr, bottom_blob.w * bottom_blob.elemsize);\n\n            outptr += bottom_blob.w * bottom_blob.elempack;\n        }\n    }\n\n    if (dims == 2 && positive_axis == 0)\n    {\n        // concat image\n        int w = bottom_blobs[0].w;\n\n        // total height\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_h += bottom_blob.h * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_h % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, top_h / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n        }\n\n        float* outptr = top_blob_unpacked;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                for (int i = 0; i < bottom_blob.h; i++)\n                {\n                    const float* r0 = bottom_blob.row(i);\n\n                    float* outptr0 = outptr;\n                    float* outptr1 = outptr + w;\n                    float* outptr2 = outptr + w * 2;\n                    float* outptr3 = outptr + w * 3;\n\n                    for (int j = 0; j < w; j++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    outptr += w * 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = w * bottom_blob.h;\n\n                const float* ptr = bottom_blob;\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                outptr += size * bottom_blob.elempack;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if (dims == 2 && positive_axis == 1)\n    {\n        // interleave image row\n        int h = bottom_blobs[0].h;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total width\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = 0; i < h; i++)\n        {\n            float* outptr = top_blob.row(i);\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                const float* ptr = bottom_blob.row(i);\n                memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                outptr += bottom_blob.w * elempack;\n            }\n        }\n    }\n\n    if ((dims == 3 || dims == 4) && positive_axis == 0)\n    {\n        // concat dim\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n\n        // total channels\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n        int top_channels = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            elemsize = std::min(elemsize, bottom_blob.elemsize);\n            elempack = std::min(elempack, bottom_blob.elempack);\n            top_channels += bottom_blob.c * bottom_blob.elempack;\n        }\n\n        int out_elempack = opt.use_packing_layout && top_channels % 4 == 0 ? 4 : 1;\n        size_t out_elemsize = elemsize / elempack * out_elempack;\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, d, top_channels / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        Mat top_blob_unpacked = top_blob;\n        if (elempack < out_elempack)\n        {\n            top_blob_unpacked.create(w, h, d, top_channels / elempack, elemsize, elempack, opt.workspace_allocator);\n            if (top_blob_unpacked.empty())\n                return -100;\n\n            top_blob_unpacked.dims = dims;\n        }\n\n        int p = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n\n            if (bottom_blob.elempack == 4 && elempack == 1)\n            {\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                for (int q = 0; q < bottom_blob.c; q++)\n                {\n                    const float* r0 = bottom_blob.channel(q);\n\n                    float* outptr0 = top_blob_unpacked.channel(p);\n                    float* outptr1 = top_blob_unpacked.channel(p + 1);\n                    float* outptr2 = top_blob_unpacked.channel(p + 2);\n                    float* outptr3 = top_blob_unpacked.channel(p + 3);\n\n                    for (int i = 0; i < size; i++)\n                    {\n                        *outptr0++ = r0[0];\n                        *outptr1++ = r0[1];\n                        *outptr2++ = r0[2];\n                        *outptr3++ = r0[3];\n\n                        r0 += 4;\n                    }\n\n                    p += 4;\n                }\n            }\n            else // if (bottom_blob.elempack == 1 && elempack == 1) if (bottom_blob.elempack == 4 && elempack == 4)\n            {\n                int size = bottom_blob.total();\n\n                const float* ptr = bottom_blob;\n                float* outptr = top_blob_unpacked.channel(p);\n                memcpy(outptr, ptr, size * bottom_blob.elemsize);\n\n                p += bottom_blob.c;\n            }\n        }\n\n        // packing\n        if (elempack < out_elempack)\n        {\n            convert_packing(top_blob_unpacked, top_blob, out_elempack, opt);\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 1) || (dims == 4 && positive_axis == 2))\n    {\n        // interleave dim height\n        int w = bottom_blobs[0].w;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_h = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_h += bottom_blob.h;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, top_h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (size_t b = 0; b < bottom_blobs.size(); b++)\n                {\n                    const Mat& bottom_blob = bottom_blobs[b];\n\n                    int size = bottom_blob.w * bottom_blob.h;\n\n                    const float* ptr = bottom_blob.channel(q).depth(i);\n                    memcpy(outptr, ptr, size * elemsize);\n\n                    outptr += size * elempack;\n                }\n            }\n        }\n    }\n\n    if ((dims == 3 && positive_axis == 2) || (dims == 4 && positive_axis == 3))\n    {\n        // interleave dim width\n        int h = bottom_blobs[0].h;\n        int d = bottom_blobs[0].d;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total height\n        int top_w = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_w += bottom_blob.w;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(top_w, h, d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        top_blob.dims = dims;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (int i = 0; i < d; i++)\n            {\n                for (int j = 0; j < h; j++)\n                {\n                    for (size_t b = 0; b < bottom_blobs.size(); b++)\n                    {\n                        const Mat& bottom_blob = bottom_blobs[b];\n\n                        const float* ptr = bottom_blob.channel(q).depth(i).row(j);\n                        memcpy(outptr, ptr, bottom_blob.w * elemsize);\n\n                        outptr += bottom_blob.w * elempack;\n                    }\n                }\n            }\n        }\n    }\n\n    if (dims == 4 && positive_axis == 1)\n    {\n        // interleave dim depth\n        int w = bottom_blobs[0].w;\n        int h = bottom_blobs[0].h;\n        int channels = bottom_blobs[0].c;\n        size_t elemsize = bottom_blobs[0].elemsize;\n        int elempack = bottom_blobs[0].elempack;\n\n        // total depth\n        int top_d = 0;\n        for (size_t b = 0; b < bottom_blobs.size(); b++)\n        {\n            const Mat& bottom_blob = bottom_blobs[b];\n            top_d += bottom_blob.d;\n        }\n\n        Mat& top_blob = top_blobs[0];\n        top_blob.create(w, h, top_d, channels, elemsize, elempack, opt.blob_allocator);\n        if (top_blob.empty())\n            return -100;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int q = 0; q < channels; q++)\n        {\n            float* outptr = top_blob.channel(q);\n\n            for (size_t b = 0; b < bottom_blobs.size(); b++)\n            {\n                const Mat& bottom_blob = bottom_blobs[b];\n\n                int size = bottom_blob.w * bottom_blob.h * bottom_blob.d;\n\n                const float* ptr = bottom_blob.channel(q);\n                memcpy(outptr, ptr, size * elemsize);\n\n                outptr += size * elempack;\n            }\n        }\n    }\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/concat_mips.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONCAT_MIPS_H\n#define LAYER_CONCAT_MIPS_H\n\n#include \"concat.h\"\n\nnamespace ncnn {\n\nclass Concat_mips : public Concat\n{\npublic:\n    Concat_mips();\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONCAT_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/convolution1d_mips.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution1d_mips.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\n#include \"mips_activation.h\"\n#include \"mips_usability.h\"\n\nnamespace ncnn {\n\nConvolution1D_mips::Convolution1D_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif // __mips_msa\n}\n\nint Convolution1D_mips::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    const int num_input = weight_data_size / kernel_w / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n    // src = kw-inch-outch\n    // dst = pb-pa-kw-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(kernel_w, num_input, num_output);\n\n        weight_data_packed.create(kernel_w, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_packed.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < kernel_w; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution1D_mips::destroy_pipeline(const Option& /*opt*/)\n{\n    return 0;\n}\n\nint Convolution1D_mips::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    const int outw = (w - kernel_extent_w) / stride_w + 1;\n    const int outh = num_output / out_elempack;\n\n    top_blob.create(outw, outh, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n#if __mips_msa\n    if (elempack == 4 && out_elempack == 4)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    v4f32 _sum = (v4f32)__msa_fill_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = (v4f32)__msa_ld_w((const float*)bias_data + p * 4, 0);\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w * 4;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            v4f32 _val0 = __msa_fill_w_f32(sptr[0]);\n                            v4f32 _val1 = __msa_fill_w_f32(sptr[1]);\n                            v4f32 _val2 = __msa_fill_w_f32(sptr[2]);\n                            v4f32 _val3 = __msa_fill_w_f32(sptr[3]);\n\n                            v4f32 _w0 = (v4f32)__msa_ld_w(kptr, 0);\n                            v4f32 _w1 = (v4f32)__msa_ld_w(kptr + 4, 0);\n                            v4f32 _w2 = (v4f32)__msa_ld_w(kptr + 8, 0);\n                            v4f32 _w3 = (v4f32)__msa_ld_w(kptr + 12, 0);\n\n                            _sum = __msa_fmadd_w(_sum, _val0, _w0);\n                            _sum = __msa_fmadd_w(_sum, _val1, _w1);\n                            _sum = __msa_fmadd_w(_sum, _val2, _w2);\n                            _sum = __msa_fmadd_w(_sum, _val3, _w3);\n\n                            sptr += dilation_w * 4;\n                            kptr += 16;\n                        }\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __msa_st_w((v4i32)_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    v4f32 _sum = (v4f32)__msa_fill_w(0);\n\n                    if (bias_term)\n                    {\n                        _sum = (v4f32)__msa_ld_w((const float*)bias_data + p * 4, 0);\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            v4f32 _val = __msa_fill_w_f32(sptr[0]);\n                            v4f32 _w = (v4f32)__msa_ld_w(kptr, 0);\n                            _sum = __msa_fmadd_w(_sum, _val, _w);\n\n                            sptr += dilation_w;\n                            kptr += 4;\n                        }\n                    }\n\n                    _sum = activation_ps(_sum, activation_type, activation_params);\n\n                    __msa_st_w((v4i32)_sum, outptr, 0);\n                    outptr += 4;\n                }\n            }\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    v4f32 _sum = (v4f32)__msa_fill_w(0);\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w * 4;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            v4f32 _val = (v4f32)__msa_ld_w(sptr, 0);\n                            v4f32 _w = (v4f32)__msa_ld_w(kptr, 0);\n                            _sum = __msa_fmadd_w(_sum, _val, _w);\n\n                            sptr += dilation_w * 4;\n                            kptr += 4;\n                        }\n                    }\n\n                    sum += __msa_reduce_fadd_w(_sum);\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[j] = sum;\n                }\n            }\n        }\n    }\n#endif // __mips_msa\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        {\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < outh; p++)\n            {\n                float* outptr = top_blob.row(p);\n\n                for (int j = 0; j < outw; j++)\n                {\n                    float sum = 0.f;\n\n                    if (bias_term)\n                    {\n                        sum = bias_data[p];\n                    }\n\n                    const float* kptr = weight_data_packed.channel(p);\n\n                    for (int q = 0; q < h; q++)\n                    {\n                        const float* sptr = bottom_blob_bordered.row(q) + j * stride_w;\n\n                        for (int k = 0; k < kernel_w; k++)\n                        {\n                            float val = sptr[0];\n                            float wt = kptr[0];\n                            sum += val * wt;\n\n                            sptr += dilation_w;\n                            kptr += 1;\n                        }\n                    }\n\n                    sum = activation_ss(sum, activation_type, activation_params);\n\n                    outptr[j] = sum;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Convolution1D_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution1D);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(2, dilation_w);\n    pd.set(3, stride_w);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/convolution1d_mips.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION1D_MIPS_H\n#define LAYER_CONVOLUTION1D_MIPS_H\n\n#include \"convolution1d.h\"\n\nnamespace ncnn {\n\nclass Convolution1D_mips : public Convolution1D\n{\npublic:\n    Convolution1D_mips();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\npublic:\n    // packn\n    Mat weight_data_packed;\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION1D_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_msa(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_int8_msa(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const signed char* r0 = bottom_blob.channel(p);\n        signed char* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n                outptr[2] = r0[4];\n                outptr[3] = r0[6];\n\n                r0 += 8;\n                outptr += 4;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n\n                r0 += 4;\n                outptr += 2;\n            }\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_int8_msa(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1_pack1to4_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack1to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack1to4_int8_msa(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack1to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const signed char* r0 = bottom_blob.channel(p);\n        signed char* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j + 3 < outw; j += 4)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n                outptr[2] = r0[4];\n                outptr[3] = r0[6];\n\n                r0 += 8;\n                outptr += 4;\n            }\n            for (; j + 1 < outw; j += 2)\n            {\n                outptr[0] = r0[0];\n                outptr[1] = r0[2];\n\n                r0 += 4;\n                outptr += 2;\n            }\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack1to4_int8_msa(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1_pack4.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack4_msa(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = (w - 2 * outw + w) * 4;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const float* r0 = bottom_blob.channel(p);\n        float* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                v4f32 _val = (v4f32)__msa_ld_w(r0, 0);\n                __msa_st_w((v4i32)_val, outptr, 0);\n\n                r0 += 4 * 2;\n                outptr += 4;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack4_msa(bottom_blob_shrinked, top_blob, kernel, _bias, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1_pack8to1_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack8to1_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack8to1_int8_msa(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack8to1_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const int64_t* r0 = bottom_blob.channel(p);\n        int64_t* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack8to1_int8_msa(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_1x1_pack8to4_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv1x1s1_sgemm_pack8to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    const int size = w * h;\n\n    Mat bottom_im2col = bottom_blob;\n    bottom_im2col.w = size;\n    bottom_im2col.h = 1;\n\n    im2col_sgemm_pack8to4_int8_msa(bottom_im2col, top_blob, kernel, opt);\n}\n\nstatic void conv1x1s2_sgemm_pack8to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n\n    const int tailstep = w - 2 * outw + w;\n\n    Mat bottom_blob_shrinked;\n    bottom_blob_shrinked.create(outw, outh, channels, elemsize, elempack, opt.workspace_allocator);\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < channels; p++)\n    {\n        const int64_t* r0 = bottom_blob.channel(p);\n        int64_t* outptr = bottom_blob_shrinked.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            int j = 0;\n            for (; j < outw; j++)\n            {\n                outptr[0] = r0[0];\n\n                r0 += 2;\n                outptr += 1;\n            }\n\n            r0 += tailstep;\n        }\n    }\n\n    conv1x1s1_sgemm_pack8to4_int8_msa(bottom_blob_shrinked, top_blob, kernel, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_3x3_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#if NCNN_RUNTIME_CPU && NCNN_MMI && !__mips_msa && !__mips_loongson_mmi\nvoid conv3x3s1_winograd43_transform_kernel_int8_loongson_mmi(const Mat& kernel, Mat& kernel_tm_packed, int inch, int outch, const Option& opt);\n#endif\n\nstatic void conv3x3s1_winograd43_transform_kernel_int8_msa(const Mat& kernel, Mat& kernel_tm_packed, int inch, int outch, const Option& opt)\n{\n#if NCNN_RUNTIME_CPU && NCNN_MMI && !__mips_msa && !__mips_loongson_mmi\n    if (ncnn::cpu_support_loongson_mmi())\n    {\n        conv3x3s1_winograd43_transform_kernel_int8_loongson_mmi(kernel, kernel_tm_packed, inch, outch, opt);\n        return;\n    }\n#endif\n\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 2b-inch-36-outch/2b\n#if __mips_msa\n    if (outch >= 4)\n    {\n        if (inch >= 4)\n            kernel_tm_packed.create(inch / 4 + inch % 4, 36, outch / 4 + outch % 4, (size_t)2u * 16, 16);\n        else\n            kernel_tm_packed.create(inch, 36, outch / 4 + outch % 4, (size_t)2u * 4, 4);\n    }\n#else // __mips_msa\n    if (outch >= 2)\n    {\n#if __mips_loongson_mmi\n        if (inch >= 4)\n            kernel_tm_packed.create(inch / 4 + inch % 4, 36, outch / 2 + outch % 2, (size_t)2u * 8, 8);\n        else\n#endif // __mips_loongson_mmi\n        {\n            kernel_tm_packed.create(inch, 36, outch / 2 + outch % 2, (size_t)2u * 2, 2);\n        }\n    }\n#endif // __mips_msa\n    else\n    {\n#if __mips_msa || __mips_loongson_mmi\n        if (inch >= 4)\n            kernel_tm_packed.create(inch / 4 + inch % 4, 36, outch, (size_t)2u * 4, 4);\n        else\n#endif // __mips_msa || __mips_loongson_mmi\n        {\n            kernel_tm_packed.create(inch, 36, outch, (size_t)2u, 1);\n        }\n    }\n\n    int p = 0;\n#if __mips_msa\n    for (; p + 3 < outch; p += 4)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n        const Mat k2 = kernel_tm.channel(p + 2);\n        const Mat k3 = kernel_tm.channel(p + 3);\n\n        Mat g0 = kernel_tm_packed.channel(p / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n            for (; q + 3 < inch; q += 4)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k0.row<const short>(q + 1)[k];\n                g00[2] = k0.row<const short>(q + 2)[k];\n                g00[3] = k0.row<const short>(q + 3)[k];\n                g00[4] = k1.row<const short>(q)[k];\n                g00[5] = k1.row<const short>(q + 1)[k];\n                g00[6] = k1.row<const short>(q + 2)[k];\n                g00[7] = k1.row<const short>(q + 3)[k];\n                g00[8] = k2.row<const short>(q)[k];\n                g00[9] = k2.row<const short>(q + 1)[k];\n                g00[10] = k2.row<const short>(q + 2)[k];\n                g00[11] = k2.row<const short>(q + 3)[k];\n                g00[12] = k3.row<const short>(q)[k];\n                g00[13] = k3.row<const short>(q + 1)[k];\n                g00[14] = k3.row<const short>(q + 2)[k];\n                g00[15] = k3.row<const short>(q + 3)[k];\n                g00 += 16;\n            }\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k1.row<const short>(q)[k];\n                g00[2] = k2.row<const short>(q)[k];\n                g00[3] = k3.row<const short>(q)[k];\n                g00 += 4;\n            }\n        }\n    }\n#else // __mips_msa\n    for (; p + 1 < outch; p += 2)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n\n        Mat g0 = kernel_tm_packed.channel(p / 2);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n#if __mips_loongson_mmi\n            for (; q + 3 < inch; q += 4)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k0.row<const short>(q + 1)[k];\n                g00[2] = k1.row<const short>(q)[k];\n                g00[3] = k1.row<const short>(q + 1)[k];\n                g00[4] = k0.row<const short>(q + 2)[k];\n                g00[5] = k0.row<const short>(q + 3)[k];\n                g00[6] = k1.row<const short>(q + 2)[k];\n                g00[7] = k1.row<const short>(q + 3)[k];\n                g00 += 8;\n            }\n#endif // __mips_loongson_mmi\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k1.row<const short>(q)[k];\n                g00 += 2;\n            }\n        }\n    }\n#endif // __mips_msa\n    for (; p < outch; p++)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n\n#if __mips_msa\n        Mat g0 = kernel_tm_packed.channel(p / 4 + p % 4);\n#else\n        Mat g0 = kernel_tm_packed.channel(p / 2 + p % 2);\n#endif\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            int q = 0;\n#if __mips_msa || __mips_loongson_mmi\n            for (; q + 3 < inch; q += 4)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00[1] = k0.row<const short>(q + 1)[k];\n                g00[2] = k0.row<const short>(q + 2)[k];\n                g00[3] = k0.row<const short>(q + 3)[k];\n                g00 += 4;\n            }\n#endif // __mips_msa || __mips_loongson_mmi\n            for (; q < inch; q++)\n            {\n                g00[0] = k0.row<const short>(q)[k];\n                g00 += 1;\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_int8_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_int8_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, 1, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_int8_msa(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_3x3_pack1to4.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_pack1to4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        v4f32 _bias0 = bias ? (v4f32)__msa_ld_w(bias + p * 4, 0) : (v4f32)__msa_fill_w(0);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            v4f32 _k00 = (v4f32)__msa_ld_w(k0, 0);\n            v4f32 _k01 = (v4f32)__msa_ld_w(k0 + 4, 0);\n            v4f32 _k02 = (v4f32)__msa_ld_w(k0 + 4 * 2, 0);\n            v4f32 _k10 = (v4f32)__msa_ld_w(k0 + 4 * 3, 0);\n            v4f32 _k11 = (v4f32)__msa_ld_w(k0 + 4 * 4, 0);\n            v4f32 _k12 = (v4f32)__msa_ld_w(k0 + 4 * 5, 0);\n            v4f32 _k20 = (v4f32)__msa_ld_w(k0 + 4 * 6, 0);\n            v4f32 _k21 = (v4f32)__msa_ld_w(k0 + 4 * 7, 0);\n            v4f32 _k22 = (v4f32)__msa_ld_w(k0 + 4 * 8, 0);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n                    v4f32 _sum2 = (v4f32)__msa_ld_w(outptr0 + 4 * 2, 0);\n                    v4f32 _sum3 = (v4f32)__msa_ld_w(outptr0 + 4 * 3, 0);\n                    v4f32 _sum4 = (v4f32)__msa_ld_w(outptr0 + 4 * 4, 0);\n                    v4f32 _sum5 = (v4f32)__msa_ld_w(outptr0 + 4 * 5, 0);\n                    v4f32 _sum6 = (v4f32)__msa_ld_w(outptr0 + 4 * 6, 0);\n                    v4f32 _sum7 = (v4f32)__msa_ld_w(outptr0 + 4 * 7, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4i32 _r0n = __msa_ld_w(r0 + 4, 0);\n                    v4i32 _r0nn = __msa_ld_w(r0 + 8, 0);\n\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n                    v4f32 _r04 = (v4f32)__msa_splati_w(_r0n, 0);\n                    v4f32 _r05 = (v4f32)__msa_splati_w(_r0n, 1);\n                    v4f32 _r06 = (v4f32)__msa_splati_w(_r0n, 2);\n                    v4f32 _r07 = (v4f32)__msa_splati_w(_r0n, 3);\n                    v4f32 _r08 = (v4f32)__msa_splati_w(_r0nn, 0);\n                    v4f32 _r09 = (v4f32)__msa_splati_w(_r0nn, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r01, _k00);\n                    _sum2 = __msa_fmadd_w(_sum2, _r02, _k00);\n                    _sum3 = __msa_fmadd_w(_sum3, _r03, _k00);\n                    _sum4 = __msa_fmadd_w(_sum4, _r04, _k00);\n                    _sum5 = __msa_fmadd_w(_sum5, _r05, _k00);\n                    _sum6 = __msa_fmadd_w(_sum6, _r06, _k00);\n                    _sum7 = __msa_fmadd_w(_sum7, _r07, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k01);\n                    _sum2 = __msa_fmadd_w(_sum2, _r03, _k01);\n                    _sum3 = __msa_fmadd_w(_sum3, _r04, _k01);\n                    _sum4 = __msa_fmadd_w(_sum4, _r05, _k01);\n                    _sum5 = __msa_fmadd_w(_sum5, _r06, _k01);\n                    _sum6 = __msa_fmadd_w(_sum6, _r07, _k01);\n                    _sum7 = __msa_fmadd_w(_sum7, _r08, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k02);\n                    _sum2 = __msa_fmadd_w(_sum2, _r04, _k02);\n                    _sum3 = __msa_fmadd_w(_sum3, _r05, _k02);\n                    _sum4 = __msa_fmadd_w(_sum4, _r06, _k02);\n                    _sum5 = __msa_fmadd_w(_sum5, _r07, _k02);\n                    _sum6 = __msa_fmadd_w(_sum6, _r08, _k02);\n                    _sum7 = __msa_fmadd_w(_sum7, _r09, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4i32 _r1n = __msa_ld_w(r1 + 4, 0);\n                    v4i32 _r1nn = __msa_ld_w(r1 + 8, 0);\n\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n                    v4f32 _r14 = (v4f32)__msa_splati_w(_r1n, 0);\n                    v4f32 _r15 = (v4f32)__msa_splati_w(_r1n, 1);\n                    v4f32 _r16 = (v4f32)__msa_splati_w(_r1n, 2);\n                    v4f32 _r17 = (v4f32)__msa_splati_w(_r1n, 3);\n                    v4f32 _r18 = (v4f32)__msa_splati_w(_r1nn, 0);\n                    v4f32 _r19 = (v4f32)__msa_splati_w(_r1nn, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r11, _k10);\n                    _sum2 = __msa_fmadd_w(_sum2, _r12, _k10);\n                    _sum3 = __msa_fmadd_w(_sum3, _r13, _k10);\n                    _sum4 = __msa_fmadd_w(_sum4, _r14, _k10);\n                    _sum5 = __msa_fmadd_w(_sum5, _r15, _k10);\n                    _sum6 = __msa_fmadd_w(_sum6, _r16, _k10);\n                    _sum7 = __msa_fmadd_w(_sum7, _r17, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k11);\n                    _sum2 = __msa_fmadd_w(_sum2, _r13, _k11);\n                    _sum3 = __msa_fmadd_w(_sum3, _r14, _k11);\n                    _sum4 = __msa_fmadd_w(_sum4, _r15, _k11);\n                    _sum5 = __msa_fmadd_w(_sum5, _r16, _k11);\n                    _sum6 = __msa_fmadd_w(_sum6, _r17, _k11);\n                    _sum7 = __msa_fmadd_w(_sum7, _r18, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k12);\n                    _sum2 = __msa_fmadd_w(_sum2, _r14, _k12);\n                    _sum3 = __msa_fmadd_w(_sum3, _r15, _k12);\n                    _sum4 = __msa_fmadd_w(_sum4, _r16, _k12);\n                    _sum5 = __msa_fmadd_w(_sum5, _r17, _k12);\n                    _sum6 = __msa_fmadd_w(_sum6, _r18, _k12);\n                    _sum7 = __msa_fmadd_w(_sum7, _r19, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4i32 _r2n = __msa_ld_w(r2 + 4, 0);\n                    v4i32 _r2nn = __msa_ld_w(r2 + 8, 0);\n\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n                    v4f32 _r24 = (v4f32)__msa_splati_w(_r2n, 0);\n                    v4f32 _r25 = (v4f32)__msa_splati_w(_r2n, 1);\n                    v4f32 _r26 = (v4f32)__msa_splati_w(_r2n, 2);\n                    v4f32 _r27 = (v4f32)__msa_splati_w(_r2n, 3);\n                    v4f32 _r28 = (v4f32)__msa_splati_w(_r2nn, 0);\n                    v4f32 _r29 = (v4f32)__msa_splati_w(_r2nn, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r21, _k20);\n                    _sum2 = __msa_fmadd_w(_sum2, _r22, _k20);\n                    _sum3 = __msa_fmadd_w(_sum3, _r23, _k20);\n                    _sum4 = __msa_fmadd_w(_sum4, _r24, _k20);\n                    _sum5 = __msa_fmadd_w(_sum5, _r25, _k20);\n                    _sum6 = __msa_fmadd_w(_sum6, _r26, _k20);\n                    _sum7 = __msa_fmadd_w(_sum7, _r27, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k21);\n                    _sum2 = __msa_fmadd_w(_sum2, _r23, _k21);\n                    _sum3 = __msa_fmadd_w(_sum3, _r24, _k21);\n                    _sum4 = __msa_fmadd_w(_sum4, _r25, _k21);\n                    _sum5 = __msa_fmadd_w(_sum5, _r26, _k21);\n                    _sum6 = __msa_fmadd_w(_sum6, _r27, _k21);\n                    _sum7 = __msa_fmadd_w(_sum7, _r28, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k22);\n                    _sum2 = __msa_fmadd_w(_sum2, _r24, _k22);\n                    _sum3 = __msa_fmadd_w(_sum3, _r25, _k22);\n                    _sum4 = __msa_fmadd_w(_sum4, _r26, _k22);\n                    _sum5 = __msa_fmadd_w(_sum5, _r27, _k22);\n                    _sum6 = __msa_fmadd_w(_sum6, _r28, _k22);\n                    _sum7 = __msa_fmadd_w(_sum7, _r29, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n                    __msa_st_w((v4i32)_sum2, outptr0 + 4 * 2, 0);\n                    __msa_st_w((v4i32)_sum3, outptr0 + 4 * 3, 0);\n                    __msa_st_w((v4i32)_sum4, outptr0 + 4 * 4, 0);\n                    __msa_st_w((v4i32)_sum5, outptr0 + 4 * 5, 0);\n                    __msa_st_w((v4i32)_sum6, outptr0 + 4 * 6, 0);\n                    __msa_st_w((v4i32)_sum7, outptr0 + 4 * 7, 0);\n\n                    outptr0 += 4 * 8;\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n                    v4f32 _sum2 = (v4f32)__msa_ld_w(outptr0 + 4 * 2, 0);\n                    v4f32 _sum3 = (v4f32)__msa_ld_w(outptr0 + 4 * 3, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4i32 _r0n = __msa_ld_w(r0 + 4, 0);\n\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n                    v4f32 _r04 = (v4f32)__msa_splati_w(_r0n, 0);\n                    v4f32 _r05 = (v4f32)__msa_splati_w(_r0n, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r01, _k00);\n                    _sum2 = __msa_fmadd_w(_sum2, _r02, _k00);\n                    _sum3 = __msa_fmadd_w(_sum3, _r03, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k01);\n                    _sum2 = __msa_fmadd_w(_sum2, _r03, _k01);\n                    _sum3 = __msa_fmadd_w(_sum3, _r04, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k02);\n                    _sum2 = __msa_fmadd_w(_sum2, _r04, _k02);\n                    _sum3 = __msa_fmadd_w(_sum3, _r05, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4i32 _r1n = __msa_ld_w(r1 + 4, 0);\n\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n                    v4f32 _r14 = (v4f32)__msa_splati_w(_r1n, 0);\n                    v4f32 _r15 = (v4f32)__msa_splati_w(_r1n, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r11, _k10);\n                    _sum2 = __msa_fmadd_w(_sum2, _r12, _k10);\n                    _sum3 = __msa_fmadd_w(_sum3, _r13, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k11);\n                    _sum2 = __msa_fmadd_w(_sum2, _r13, _k11);\n                    _sum3 = __msa_fmadd_w(_sum3, _r14, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k12);\n                    _sum2 = __msa_fmadd_w(_sum2, _r14, _k12);\n                    _sum3 = __msa_fmadd_w(_sum3, _r15, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4i32 _r2n = __msa_ld_w(r2 + 4, 0);\n\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n                    v4f32 _r24 = (v4f32)__msa_splati_w(_r2n, 0);\n                    v4f32 _r25 = (v4f32)__msa_splati_w(_r2n, 1);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r21, _k20);\n                    _sum2 = __msa_fmadd_w(_sum2, _r22, _k20);\n                    _sum3 = __msa_fmadd_w(_sum3, _r23, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k21);\n                    _sum2 = __msa_fmadd_w(_sum2, _r23, _k21);\n                    _sum3 = __msa_fmadd_w(_sum3, _r24, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k22);\n                    _sum2 = __msa_fmadd_w(_sum2, _r24, _k22);\n                    _sum3 = __msa_fmadd_w(_sum3, _r25, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n                    __msa_st_w((v4i32)_sum2, outptr0 + 4 * 2, 0);\n                    __msa_st_w((v4i32)_sum3, outptr0 + 4 * 3, 0);\n\n                    outptr0 += 4 * 4;\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r01, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r11, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r21, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n\n                    outptr0 += 4 * 2;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n                for (; j < outw; j++)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n\n                    outptr0 += 4;\n\n                    r0 += 1;\n                    r1 += 1;\n                    r2 += 1;\n                }\n\n                r0 += 2;\n                r1 += 2;\n                r2 += 2;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n\nstatic void conv3x3s2_pack1to4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int tailstep = w - 2 * outw + w;\n\n    const float* bias = _bias;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        Mat out0 = top_blob.channel(p);\n\n        v4f32 _bias0 = bias ? (v4f32)__msa_ld_w(bias + p * 4, 0) : (v4f32)__msa_fill_w(0);\n        out0.fill(_bias0);\n\n        const float* k0 = kernel.channel(p);\n\n        int q = 0;\n        for (; q < inch; q++)\n        {\n            float* outptr0 = out0;\n\n            const Mat img0 = bottom_blob.channel(q);\n\n            const float* r0 = img0.row(0);\n            const float* r1 = img0.row(1);\n            const float* r2 = img0.row(2);\n\n            v4f32 _k00 = (v4f32)__msa_ld_w(k0, 0);\n            v4f32 _k01 = (v4f32)__msa_ld_w(k0 + 4, 0);\n            v4f32 _k02 = (v4f32)__msa_ld_w(k0 + 4 * 2, 0);\n            v4f32 _k10 = (v4f32)__msa_ld_w(k0 + 4 * 3, 0);\n            v4f32 _k11 = (v4f32)__msa_ld_w(k0 + 4 * 4, 0);\n            v4f32 _k12 = (v4f32)__msa_ld_w(k0 + 4 * 5, 0);\n            v4f32 _k20 = (v4f32)__msa_ld_w(k0 + 4 * 6, 0);\n            v4f32 _k21 = (v4f32)__msa_ld_w(k0 + 4 * 7, 0);\n            v4f32 _k22 = (v4f32)__msa_ld_w(k0 + 4 * 8, 0);\n\n            int i = 0;\n            for (; i < outh; i++)\n            {\n                int j = 0;\n                for (; j + 7 < outw; j += 8)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n                    v4f32 _sum2 = (v4f32)__msa_ld_w(outptr0 + 4 * 2, 0);\n                    v4f32 _sum3 = (v4f32)__msa_ld_w(outptr0 + 4 * 3, 0);\n                    v4f32 _sum4 = (v4f32)__msa_ld_w(outptr0 + 4 * 4, 0);\n                    v4f32 _sum5 = (v4f32)__msa_ld_w(outptr0 + 4 * 5, 0);\n                    v4f32 _sum6 = (v4f32)__msa_ld_w(outptr0 + 4 * 6, 0);\n                    v4f32 _sum7 = (v4f32)__msa_ld_w(outptr0 + 4 * 7, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4i32 _r0n = __msa_ld_w(r0 + 4, 0);\n                    v4i32 _r0nn = __msa_ld_w(r0 + 8, 0);\n                    v4i32 _r0nnn = __msa_ld_w(r0 + 12, 0);\n\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n                    v4f32 _r04 = (v4f32)__msa_splati_w(_r0n, 0);\n                    v4f32 _r05 = (v4f32)__msa_splati_w(_r0n, 1);\n                    v4f32 _r06 = (v4f32)__msa_splati_w(_r0n, 2);\n                    v4f32 _r07 = (v4f32)__msa_splati_w(_r0n, 3);\n                    v4f32 _r08 = (v4f32)__msa_splati_w(_r0nn, 0);\n                    v4f32 _r09 = (v4f32)__msa_splati_w(_r0nn, 1);\n                    v4f32 _r0a = (v4f32)__msa_splati_w(_r0nn, 2);\n                    v4f32 _r0b = (v4f32)__msa_splati_w(_r0nn, 3);\n                    v4f32 _r0c = (v4f32)__msa_splati_w(_r0nnn, 0);\n                    v4f32 _r0d = (v4f32)__msa_splati_w(_r0nnn, 1);\n                    v4f32 _r0e = (v4f32)__msa_splati_w(_r0nnn, 2);\n                    v4f32 _r0f = (v4f32)__msa_splati_w(_r0nnn, 3);\n                    v4f32 _r0g = __msa_fill_w_f32(r0[16]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k00);\n                    _sum2 = __msa_fmadd_w(_sum2, _r04, _k00);\n                    _sum3 = __msa_fmadd_w(_sum3, _r06, _k00);\n                    _sum4 = __msa_fmadd_w(_sum4, _r08, _k00);\n                    _sum5 = __msa_fmadd_w(_sum5, _r0a, _k00);\n                    _sum6 = __msa_fmadd_w(_sum6, _r0c, _k00);\n                    _sum7 = __msa_fmadd_w(_sum7, _r0e, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k01);\n                    _sum2 = __msa_fmadd_w(_sum2, _r05, _k01);\n                    _sum3 = __msa_fmadd_w(_sum3, _r07, _k01);\n                    _sum4 = __msa_fmadd_w(_sum4, _r09, _k01);\n                    _sum5 = __msa_fmadd_w(_sum5, _r0b, _k01);\n                    _sum6 = __msa_fmadd_w(_sum6, _r0d, _k01);\n                    _sum7 = __msa_fmadd_w(_sum7, _r0f, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r04, _k02);\n                    _sum2 = __msa_fmadd_w(_sum2, _r06, _k02);\n                    _sum3 = __msa_fmadd_w(_sum3, _r08, _k02);\n                    _sum4 = __msa_fmadd_w(_sum4, _r0a, _k02);\n                    _sum5 = __msa_fmadd_w(_sum5, _r0c, _k02);\n                    _sum6 = __msa_fmadd_w(_sum6, _r0e, _k02);\n                    _sum7 = __msa_fmadd_w(_sum7, _r0g, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4i32 _r1n = __msa_ld_w(r1 + 4, 0);\n                    v4i32 _r1nn = __msa_ld_w(r1 + 8, 0);\n                    v4i32 _r1nnn = __msa_ld_w(r1 + 12, 0);\n\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n                    v4f32 _r14 = (v4f32)__msa_splati_w(_r1n, 0);\n                    v4f32 _r15 = (v4f32)__msa_splati_w(_r1n, 1);\n                    v4f32 _r16 = (v4f32)__msa_splati_w(_r1n, 2);\n                    v4f32 _r17 = (v4f32)__msa_splati_w(_r1n, 3);\n                    v4f32 _r18 = (v4f32)__msa_splati_w(_r1nn, 0);\n                    v4f32 _r19 = (v4f32)__msa_splati_w(_r1nn, 1);\n                    v4f32 _r1a = (v4f32)__msa_splati_w(_r1nn, 2);\n                    v4f32 _r1b = (v4f32)__msa_splati_w(_r1nn, 3);\n                    v4f32 _r1c = (v4f32)__msa_splati_w(_r1nnn, 0);\n                    v4f32 _r1d = (v4f32)__msa_splati_w(_r1nnn, 1);\n                    v4f32 _r1e = (v4f32)__msa_splati_w(_r1nnn, 2);\n                    v4f32 _r1f = (v4f32)__msa_splati_w(_r1nnn, 3);\n                    v4f32 _r1g = __msa_fill_w_f32(r1[16]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k10);\n                    _sum2 = __msa_fmadd_w(_sum2, _r14, _k10);\n                    _sum3 = __msa_fmadd_w(_sum3, _r16, _k10);\n                    _sum4 = __msa_fmadd_w(_sum4, _r18, _k10);\n                    _sum5 = __msa_fmadd_w(_sum5, _r1a, _k10);\n                    _sum6 = __msa_fmadd_w(_sum6, _r1c, _k10);\n                    _sum7 = __msa_fmadd_w(_sum7, _r1e, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k11);\n                    _sum2 = __msa_fmadd_w(_sum2, _r15, _k11);\n                    _sum3 = __msa_fmadd_w(_sum3, _r17, _k11);\n                    _sum4 = __msa_fmadd_w(_sum4, _r19, _k11);\n                    _sum5 = __msa_fmadd_w(_sum5, _r1b, _k11);\n                    _sum6 = __msa_fmadd_w(_sum6, _r1d, _k11);\n                    _sum7 = __msa_fmadd_w(_sum7, _r1f, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r14, _k12);\n                    _sum2 = __msa_fmadd_w(_sum2, _r16, _k12);\n                    _sum3 = __msa_fmadd_w(_sum3, _r18, _k12);\n                    _sum4 = __msa_fmadd_w(_sum4, _r1a, _k12);\n                    _sum5 = __msa_fmadd_w(_sum5, _r1c, _k12);\n                    _sum6 = __msa_fmadd_w(_sum6, _r1e, _k12);\n                    _sum7 = __msa_fmadd_w(_sum7, _r1g, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4i32 _r2n = __msa_ld_w(r2 + 4, 0);\n                    v4i32 _r2nn = __msa_ld_w(r2 + 8, 0);\n                    v4i32 _r2nnn = __msa_ld_w(r2 + 12, 0);\n\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n                    v4f32 _r24 = (v4f32)__msa_splati_w(_r2n, 0);\n                    v4f32 _r25 = (v4f32)__msa_splati_w(_r2n, 1);\n                    v4f32 _r26 = (v4f32)__msa_splati_w(_r2n, 2);\n                    v4f32 _r27 = (v4f32)__msa_splati_w(_r2n, 3);\n                    v4f32 _r28 = (v4f32)__msa_splati_w(_r2nn, 0);\n                    v4f32 _r29 = (v4f32)__msa_splati_w(_r2nn, 1);\n                    v4f32 _r2a = (v4f32)__msa_splati_w(_r2nn, 2);\n                    v4f32 _r2b = (v4f32)__msa_splati_w(_r2nn, 3);\n                    v4f32 _r2c = (v4f32)__msa_splati_w(_r2nnn, 0);\n                    v4f32 _r2d = (v4f32)__msa_splati_w(_r2nnn, 1);\n                    v4f32 _r2e = (v4f32)__msa_splati_w(_r2nnn, 2);\n                    v4f32 _r2f = (v4f32)__msa_splati_w(_r2nnn, 3);\n                    v4f32 _r2g = __msa_fill_w_f32(r2[16]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k20);\n                    _sum2 = __msa_fmadd_w(_sum2, _r24, _k20);\n                    _sum3 = __msa_fmadd_w(_sum3, _r26, _k20);\n                    _sum4 = __msa_fmadd_w(_sum4, _r28, _k20);\n                    _sum5 = __msa_fmadd_w(_sum5, _r2a, _k20);\n                    _sum6 = __msa_fmadd_w(_sum6, _r2c, _k20);\n                    _sum7 = __msa_fmadd_w(_sum7, _r2e, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k21);\n                    _sum2 = __msa_fmadd_w(_sum2, _r25, _k21);\n                    _sum3 = __msa_fmadd_w(_sum3, _r27, _k21);\n                    _sum4 = __msa_fmadd_w(_sum4, _r29, _k21);\n                    _sum5 = __msa_fmadd_w(_sum5, _r2b, _k21);\n                    _sum6 = __msa_fmadd_w(_sum6, _r2d, _k21);\n                    _sum7 = __msa_fmadd_w(_sum7, _r2f, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r24, _k22);\n                    _sum2 = __msa_fmadd_w(_sum2, _r26, _k22);\n                    _sum3 = __msa_fmadd_w(_sum3, _r28, _k22);\n                    _sum4 = __msa_fmadd_w(_sum4, _r2a, _k22);\n                    _sum5 = __msa_fmadd_w(_sum5, _r2c, _k22);\n                    _sum6 = __msa_fmadd_w(_sum6, _r2e, _k22);\n                    _sum7 = __msa_fmadd_w(_sum7, _r2g, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n                    __msa_st_w((v4i32)_sum2, outptr0 + 4 * 2, 0);\n                    __msa_st_w((v4i32)_sum3, outptr0 + 4 * 3, 0);\n                    __msa_st_w((v4i32)_sum4, outptr0 + 4 * 4, 0);\n                    __msa_st_w((v4i32)_sum5, outptr0 + 4 * 5, 0);\n                    __msa_st_w((v4i32)_sum6, outptr0 + 4 * 6, 0);\n                    __msa_st_w((v4i32)_sum7, outptr0 + 4 * 7, 0);\n\n                    outptr0 += 4 * 8;\n\n                    r0 += 16;\n                    r1 += 16;\n                    r2 += 16;\n                }\n                for (; j + 3 < outw; j += 4)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n                    v4f32 _sum2 = (v4f32)__msa_ld_w(outptr0 + 4 * 2, 0);\n                    v4f32 _sum3 = (v4f32)__msa_ld_w(outptr0 + 4 * 3, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4i32 _r0n = __msa_ld_w(r0 + 4, 0);\n\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n                    v4f32 _r04 = (v4f32)__msa_splati_w(_r0n, 0);\n                    v4f32 _r05 = (v4f32)__msa_splati_w(_r0n, 1);\n                    v4f32 _r06 = (v4f32)__msa_splati_w(_r0n, 2);\n                    v4f32 _r07 = (v4f32)__msa_splati_w(_r0n, 3);\n                    v4f32 _r08 = __msa_fill_w_f32(r0[8]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k00);\n                    _sum2 = __msa_fmadd_w(_sum2, _r04, _k00);\n                    _sum3 = __msa_fmadd_w(_sum3, _r06, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k01);\n                    _sum2 = __msa_fmadd_w(_sum2, _r05, _k01);\n                    _sum3 = __msa_fmadd_w(_sum3, _r07, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r04, _k02);\n                    _sum2 = __msa_fmadd_w(_sum2, _r06, _k02);\n                    _sum3 = __msa_fmadd_w(_sum3, _r08, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4i32 _r1n = __msa_ld_w(r1 + 4, 0);\n\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n                    v4f32 _r14 = (v4f32)__msa_splati_w(_r1n, 0);\n                    v4f32 _r15 = (v4f32)__msa_splati_w(_r1n, 1);\n                    v4f32 _r16 = (v4f32)__msa_splati_w(_r1n, 2);\n                    v4f32 _r17 = (v4f32)__msa_splati_w(_r1n, 3);\n                    v4f32 _r18 = __msa_fill_w_f32(r1[8]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k10);\n                    _sum2 = __msa_fmadd_w(_sum2, _r14, _k10);\n                    _sum3 = __msa_fmadd_w(_sum3, _r16, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k11);\n                    _sum2 = __msa_fmadd_w(_sum2, _r15, _k11);\n                    _sum3 = __msa_fmadd_w(_sum3, _r17, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r14, _k12);\n                    _sum2 = __msa_fmadd_w(_sum2, _r16, _k12);\n                    _sum3 = __msa_fmadd_w(_sum3, _r18, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4i32 _r2n = __msa_ld_w(r2 + 4, 0);\n\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n                    v4f32 _r24 = (v4f32)__msa_splati_w(_r2n, 0);\n                    v4f32 _r25 = (v4f32)__msa_splati_w(_r2n, 1);\n                    v4f32 _r26 = (v4f32)__msa_splati_w(_r2n, 2);\n                    v4f32 _r27 = (v4f32)__msa_splati_w(_r2n, 3);\n                    v4f32 _r28 = __msa_fill_w_f32(r2[8]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k20);\n                    _sum2 = __msa_fmadd_w(_sum2, _r24, _k20);\n                    _sum3 = __msa_fmadd_w(_sum3, _r26, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k21);\n                    _sum2 = __msa_fmadd_w(_sum2, _r25, _k21);\n                    _sum3 = __msa_fmadd_w(_sum3, _r27, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r24, _k22);\n                    _sum2 = __msa_fmadd_w(_sum2, _r26, _k22);\n                    _sum3 = __msa_fmadd_w(_sum3, _r28, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n                    __msa_st_w((v4i32)_sum2, outptr0 + 4 * 2, 0);\n                    __msa_st_w((v4i32)_sum3, outptr0 + 4 * 3, 0);\n\n                    outptr0 += 4 * 4;\n\n                    r0 += 8;\n                    r1 += 8;\n                    r2 += 8;\n                }\n                for (; j + 1 < outw; j += 2)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n                    v4f32 _sum1 = (v4f32)__msa_ld_w(outptr0 + 4, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n                    v4f32 _r03 = (v4f32)__msa_splati_w(_r0, 3);\n                    v4f32 _r04 = __msa_fill_w_f32(r0[4]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum1 = __msa_fmadd_w(_sum1, _r02, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum1 = __msa_fmadd_w(_sum1, _r03, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n                    _sum1 = __msa_fmadd_w(_sum1, _r04, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n                    v4f32 _r13 = (v4f32)__msa_splati_w(_r1, 3);\n                    v4f32 _r14 = __msa_fill_w_f32(r1[4]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum1 = __msa_fmadd_w(_sum1, _r12, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum1 = __msa_fmadd_w(_sum1, _r13, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n                    _sum1 = __msa_fmadd_w(_sum1, _r14, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n                    v4f32 _r23 = (v4f32)__msa_splati_w(_r2, 3);\n                    v4f32 _r24 = __msa_fill_w_f32(r2[4]);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum1 = __msa_fmadd_w(_sum1, _r22, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum1 = __msa_fmadd_w(_sum1, _r23, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n                    _sum1 = __msa_fmadd_w(_sum1, _r24, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n                    __msa_st_w((v4i32)_sum1, outptr0 + 4, 0);\n\n                    outptr0 += 4 * 2;\n\n                    r0 += 4;\n                    r1 += 4;\n                    r2 += 4;\n                }\n                for (; j < outw; j++)\n                {\n                    v4f32 _sum0 = (v4f32)__msa_ld_w(outptr0, 0);\n\n                    v4i32 _r0 = __msa_ld_w(r0, 0);\n                    v4f32 _r00 = (v4f32)__msa_splati_w(_r0, 0);\n                    v4f32 _r01 = (v4f32)__msa_splati_w(_r0, 1);\n                    v4f32 _r02 = (v4f32)__msa_splati_w(_r0, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r00, _k00);\n                    _sum0 = __msa_fmadd_w(_sum0, _r01, _k01);\n                    _sum0 = __msa_fmadd_w(_sum0, _r02, _k02);\n\n                    v4i32 _r1 = __msa_ld_w(r1, 0);\n                    v4f32 _r10 = (v4f32)__msa_splati_w(_r1, 0);\n                    v4f32 _r11 = (v4f32)__msa_splati_w(_r1, 1);\n                    v4f32 _r12 = (v4f32)__msa_splati_w(_r1, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r10, _k10);\n                    _sum0 = __msa_fmadd_w(_sum0, _r11, _k11);\n                    _sum0 = __msa_fmadd_w(_sum0, _r12, _k12);\n\n                    v4i32 _r2 = __msa_ld_w(r2, 0);\n                    v4f32 _r20 = (v4f32)__msa_splati_w(_r2, 0);\n                    v4f32 _r21 = (v4f32)__msa_splati_w(_r2, 1);\n                    v4f32 _r22 = (v4f32)__msa_splati_w(_r2, 2);\n\n                    _sum0 = __msa_fmadd_w(_sum0, _r20, _k20);\n                    _sum0 = __msa_fmadd_w(_sum0, _r21, _k21);\n                    _sum0 = __msa_fmadd_w(_sum0, _r22, _k22);\n\n                    __msa_st_w((v4i32)_sum0, outptr0, 0);\n\n                    outptr0 += 4;\n\n                    r0 += 2;\n                    r1 += 2;\n                    r2 += 2;\n                }\n\n                r0 += tailstep;\n                r1 += tailstep;\n                r2 += tailstep;\n            }\n\n            k0 += 9 * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_3x3_pack4.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd63_transform_kernel_pack4_msa(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd63 transform kernel\n    Mat kernel_tm;\n    kernel_tm.create(8 * 8, inch, outch);\n\n    const float ktm[8][3] = {\n        {1.0f, 0.0f, 0.0f},\n        {-2.0f / 9, -2.0f / 9, -2.0f / 9},\n        {-2.0f / 9, 2.0f / 9, -2.0f / 9},\n        {1.0f / 90, 1.0f / 45, 2.0f / 45},\n        {1.0f / 90, -1.0f / 45, 2.0f / 45},\n        {1.0f / 45, 1.0f / 90, 1.0f / 180},\n        {1.0f / 45, -1.0f / 90, 1.0f / 180},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel, transposed\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[8][3];\n            for (int i = 0; i < 8; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // v\n            for (int j = 0; j < 8; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 8; i++)\n                {\n                    kernel_tm0[j * 8 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 64-inch-outch\n    // dst = pb-pa-inch/pa-64-outch/pb\n    kernel_tm_pack4.create(inch / 4, 64, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 64; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd63_pack4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 6n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 5) / 6 * 6;\n    outh = (outh + 5) / 6 * 6;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 6;\n        int h_tiles = outh / 6;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 64, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd63_transform_input_pack4_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd63_transform_output_pack4_msa(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack4_msa(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch);\n\n    const float ktm[6][3] = {\n        {1.0f / 4, 0.0f, 0.0f},\n        {-1.0f / 6, -1.0f / 6, -1.0f / 6},\n        {-1.0f / 6, 1.0f / 6, -1.0f / 6},\n        {1.0f / 24, 1.0f / 12, 1.0f / 6},\n        {1.0f / 24, -1.0f / 12, 1.0f / 6},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = pb-pa-inch/pa-36-outch/pb\n    kernel_tm_pack4.create(inch / 4, 36, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack4_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_pack4_msa(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n\nstatic void conv3x3s1_winograd23_transform_kernel_pack4_msa(const Mat& kernel, Mat& kernel_tm_pack4, int inch, int outch, const Option& opt)\n{\n    // winograd23 transform kernel\n    Mat kernel_tm(4 * 4, inch, outch);\n\n    const float ktm[4][3] = {\n        {1.0f, 0.0f, 0.0f},\n        {1.0f / 2, 1.0f / 2, 1.0f / 2},\n        {1.0f / 2, -1.0f / 2, 1.0f / 2},\n        {0.0f, 0.0f, 1.0f}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const float* kernel0 = (const float*)kernel + p * inch * 9 + q * 9;\n            float* kernel_tm0 = kernel_tm.channel(p).row(q);\n\n            // transform kernel\n            const float* k0 = kernel0;\n            const float* k1 = kernel0 + 3;\n            const float* k2 = kernel0 + 6;\n\n            // h\n            float tmp[4][3];\n            for (int i = 0; i < 4; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 4; j++)\n            {\n                float* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 4; i++)\n                {\n                    kernel_tm0[j * 4 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 16-inch-outch\n    // dst = pb-pa-inch/pa-16-outch/pb\n    kernel_tm_pack4.create(inch / 4, 16, outch / 4, (size_t)4u * 4 * 4, 4 * 4);\n\n    for (int q = 0; q + 3 < outch; q += 4)\n    {\n        Mat g0 = kernel_tm_pack4.channel(q / 4);\n\n        for (int k = 0; k < 16; k++)\n        {\n            float* g00 = g0.row(k);\n\n            for (int p = 0; p + 3 < inch; p += 4)\n            {\n                for (int i = 0; i < 4; i++)\n                {\n                    for (int j = 0; j < 4; j++)\n                    {\n                        const float* k00 = kernel_tm.channel(q + j).row(p + i);\n                        g00[0] = k00[k];\n                        g00++;\n                    }\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd23_pack4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Mat& bias, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 2n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 1) / 2 * 2;\n    outh = (outh + 1) / 2 * 2;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 2;\n        int h_tiles = outh / 2;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 16, inch, elemsize, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd23_transform_input_pack4_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack4_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, elemsize, elempack, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd23_transform_output_pack4_msa(top_blob_tm, top_blob_bordered, bias, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_3x3_pack8to1_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack8to1_int8_msa(const Mat& kernel, Mat& kernel_tm_pack8to1, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 4b-8a-inch/8a-36-outch/4b\n    kernel_tm_pack8to1.create(8 * inch / 8, 36, outch / 4 + outch % 4, (size_t)2u * 4, 4);\n\n    int p = 0;\n    for (; p + 3 < outch; p += 4)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n        const Mat k1 = kernel_tm.channel(p + 1);\n        const Mat k2 = kernel_tm.channel(p + 2);\n        const Mat k3 = kernel_tm.channel(p + 3);\n\n        Mat g0 = kernel_tm_pack8to1.channel(p / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            for (int q = 0; q + 7 < inch; q += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0.row<const short>(q + i)[k];\n                    g00[1] = k1.row<const short>(q + i)[k];\n                    g00[2] = k2.row<const short>(q + i)[k];\n                    g00[3] = k3.row<const short>(q + i)[k];\n\n                    g00 += 4;\n                }\n            }\n        }\n    }\n    for (; p < outch; p++)\n    {\n        const Mat k0 = kernel_tm.channel(p);\n\n        Mat g0 = kernel_tm_pack8to1.channel(p / 4 + p % 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = g0.row<short>(k);\n\n            for (int q = 0; q + 7 < inch; q += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    g00[0] = k0.row<const short>(q + i)[k];\n\n                    g00 += 1;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack8to1_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack8_int8_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack8to1_int8_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u, 1, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_int8_msa(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_3x3_pack8to4_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void conv3x3s1_winograd43_transform_kernel_pack8to4_int8_msa(const Mat& kernel, Mat& kernel_tm_pack8, int inch, int outch, const Option& opt)\n{\n    // winograd43 transform kernel\n    Mat kernel_tm(6 * 6, inch, outch, (size_t)2u);\n\n    const short ktm[6][3] = {\n        {6, 0, 0},\n        {-4, -4, -4},\n        {-4, 4, -4},\n        {1, 2, 4},\n        {1, -2, 4},\n        {0, 0, 6}\n    };\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        for (int q = 0; q < inch; q++)\n        {\n            const signed char* kernel0 = (const signed char*)kernel + p * inch * 9 + q * 9;\n            short* kernel_tm0 = kernel_tm.channel(p).row<short>(q);\n\n            // transform kernel\n            const signed char* k0 = kernel0;\n            const signed char* k1 = kernel0 + 3;\n            const signed char* k2 = kernel0 + 6;\n\n            // h\n            short tmp[6][3];\n            for (int i = 0; i < 6; i++)\n            {\n                tmp[i][0] = k0[0] * ktm[i][0] + k0[1] * ktm[i][1] + k0[2] * ktm[i][2];\n                tmp[i][1] = k1[0] * ktm[i][0] + k1[1] * ktm[i][1] + k1[2] * ktm[i][2];\n                tmp[i][2] = k2[0] * ktm[i][0] + k2[1] * ktm[i][1] + k2[2] * ktm[i][2];\n            }\n\n            // U\n            for (int j = 0; j < 6; j++)\n            {\n                short* tmpp = &tmp[j][0];\n\n                for (int i = 0; i < 6; i++)\n                {\n                    kernel_tm0[j * 6 + i] = tmpp[0] * ktm[i][0] + tmpp[1] * ktm[i][1] + tmpp[2] * ktm[i][2];\n                }\n            }\n        }\n    }\n\n    // interleave\n    // src = 36-inch-outch\n    // dst = 4b-8a-inch/8a-36-outch/4b\n    kernel_tm_pack8.create(inch / 8, 36, outch / 4, (size_t)2u * 32, 32);\n\n    int q = 0;\n    for (; q + 3 < outch; q += 4)\n    {\n        const Mat k0 = kernel_tm.channel(q);\n        const Mat k1 = kernel_tm.channel(q + 1);\n        const Mat k2 = kernel_tm.channel(q + 2);\n        const Mat k3 = kernel_tm.channel(q + 3);\n\n        Mat kernel_tm = kernel_tm_pack8.channel(q / 4);\n\n        for (int k = 0; k < 36; k++)\n        {\n            short* g00 = kernel_tm.row<short>(k);\n\n            for (int p = 0; p + 7 < inch; p += 8)\n            {\n                for (int i = 0; i < 8; i++)\n                {\n                    const short* k00 = k0.row<const short>(p + i);\n                    const short* k10 = k1.row<const short>(p + i);\n                    const short* k20 = k2.row<const short>(p + i);\n                    const short* k30 = k3.row<const short>(p + i);\n\n                    g00[0] = k00[k];\n                    g00[1] = k10[k];\n                    g00[2] = k20[k];\n                    g00[3] = k30[k];\n\n                    g00 += 4;\n                }\n            }\n        }\n    }\n}\n\nstatic void conv3x3s1_winograd43_pack8to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel_tm, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int inch = bottom_blob.c;\n    //     size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    // pad to 4n+2\n    Mat bottom_blob_bordered = bottom_blob;\n\n    outw = (outw + 3) / 4 * 4;\n    outh = (outh + 3) / 4 * 4;\n\n    w = outw + 2;\n    h = outh + 2;\n    copy_make_border(bottom_blob, bottom_blob_bordered, 0, h - bottom_blob.h, 0, w - bottom_blob.w, BORDER_CONSTANT, 0.f, opt);\n\n    // BEGIN transform input\n    Mat bottom_blob_tm;\n    {\n        int w_tiles = outw / 4;\n        int h_tiles = outh / 4;\n        const int tiles = w_tiles * h_tiles;\n\n        bottom_blob_tm.create(tiles, 36, inch, 2u * elempack, elempack, opt.workspace_allocator);\n        conv3x3s1_winograd43_transform_input_pack8_int8_msa(bottom_blob_bordered, bottom_blob_tm, opt);\n    }\n    bottom_blob_bordered = Mat();\n    // END transform input\n\n    // BEGIN dot\n    Mat top_blob_tm;\n    convolution_winograd_dot_pack8to4_int8_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n    // END dot\n\n    // BEGIN transform output\n    Mat top_blob_bordered;\n    if (outw == top_blob.w && outh == top_blob.h)\n    {\n        top_blob_bordered = top_blob;\n    }\n    else\n    {\n        top_blob_bordered.create(outw, outh, outch, 4u * 4, 4, opt.workspace_allocator);\n    }\n    {\n        conv3x3s1_winograd43_transform_output_pack4_int8_msa(top_blob_tm, top_blob_bordered, opt);\n    }\n    // END transform output\n\n    // cut result pad\n    copy_cut_border(top_blob_bordered, top_blob, 0, top_blob_bordered.h - top_blob.h, 0, top_blob_bordered.w - top_blob.w, opt);\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_int8(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                int sum = 0;\n\n                //                 const signed char* kptr = weight_data_int8.channel(p);\n                const signed char* kptr = (const signed char*)weight_data_int8 + maxk * channels * p;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        signed char val = sptr[space_ofs[k]];\n                        signed char w = kptr[k];\n                        sum += val * w;\n                    }\n\n                    kptr += maxk;\n                }\n\n                outptr[j] = sum;\n            }\n\n            outptr += outw;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_mips.cpp",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"convolution_mips.h\"\n\n#include \"benchmark.h\"\n#include \"cpu.h\"\n#include \"layer_type.h\"\n\n#if __mips_msa\n#include <msa.h>\n#endif // __mips_msa\n\n#include \"mips_activation.h\"\n#include \"mips_usability.h\"\n\n#include \"cpu.h\"\n\nnamespace ncnn {\n\n#include \"convolution_sgemm.h\"\n#include \"convolution_winograd_transform.h\"\n#include \"convolution_winograd_dot.h\"\n#include \"convolution_1x1.h\"\n#include \"convolution_3x3.h\"\n\n#if NCNN_INT8\n#include \"convolution_sgemm_int8.h\"\n#include \"convolution_winograd_transform_int8.h\"\n#include \"convolution_winograd_dot_int8.h\"\n#include \"convolution_1x1_int8.h\"\n#include \"convolution_3x3_int8.h\"\n#include \"convolution_int8.h\"\n#endif // NCNN_INT8\n\n#if __mips_msa\n#include \"convolution_pack4.h\"\n#include \"convolution_pack1to4.h\"\n#include \"convolution_pack4to1.h\"\n\n#include \"convolution_sgemm_pack4.h\"\n#include \"convolution_sgemm_pack4to1.h\"\n#include \"convolution_winograd_transform_pack4.h\"\n#include \"convolution_winograd_dot_pack4.h\"\n#include \"convolution_1x1_pack4.h\"\n#include \"convolution_1x1_pack4to1.h\"\n#include \"convolution_3x3_pack4.h\"\n#include \"convolution_3x3_pack1to4.h\"\n#include \"convolution_7x7_pack1to4.h\"\n\n#if NCNN_INT8\n#include \"convolution_pack8to4_int8.h\"\n#include \"convolution_pack1to4_int8.h\"\n#include \"convolution_pack8to1_int8.h\"\n#include \"convolution_sgemm_pack8to4_int8.h\"\n#include \"convolution_sgemm_pack1to4_int8.h\"\n#include \"convolution_sgemm_pack8to1_int8.h\"\n#include \"convolution_winograd_transform_pack4_int8.h\"\n#include \"convolution_winograd_transform_pack8_int8.h\"\n#include \"convolution_winograd_dot_pack8to4_int8.h\"\n#include \"convolution_winograd_dot_pack8to1_int8.h\"\n#include \"convolution_1x1_pack8to4_int8.h\"\n#include \"convolution_1x1_pack1to4_int8.h\"\n#include \"convolution_1x1_pack8to1_int8.h\"\n#include \"convolution_3x3_pack8to4_int8.h\"\n#include \"convolution_3x3_pack8to1_int8.h\"\n#endif // NCNN_INT8\n#endif // __mips_msa\n\nConvolution_mips::Convolution_mips()\n{\n#if __mips_msa\n    support_packing = true;\n#endif // __mips_msa\n\n    activation = 0;\n}\n\nstatic void convolution_transform_kernel_packed_msa(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pb-pa-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)4u * elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            float* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < elempack; i++)\n                    {\n                        for (int j = 0; j < out_elempack; j++)\n                        {\n                            const float* k00 = weight_data_r2.channel(q + j).row(p + i);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_mips::create_pipeline(const Option& opt)\n{\n    if (dynamic_weight)\n        return 0;\n\n    activation = create_activation_layer(activation_type, activation_params, opt);\n\n#if NCNN_INT8\n    if (opt.use_int8_inference && weight_data.elemsize == (size_t)1u)\n    {\n        return create_pipeline_int8_mips(opt);\n    }\n#endif\n\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 4 == 0 ? 4 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n\n#if __mips_msa\n    // pack4\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd63_convolution && num_input >= 8 && num_output >= 8 && num_input <= 64 && num_output <= 64) || (!opt.use_winograd43_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd63_transform_kernel_pack4_msa(weight_data, weight_winograd63_data, num_input, num_output, opt);\n            else if ((opt.use_winograd43_convolution && num_input >= 8 && num_output >= 8) || (!opt.use_winograd63_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd43_transform_kernel_pack4_msa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n            else // if (opt.use_winograd23_convolution)\n                conv3x3s1_winograd23_transform_kernel_pack4_msa(weight_data, weight_winograd23_data, num_input, num_output, opt);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    // pack1ton\n    if (elempack == 1 && out_elempack == 4)\n    {\n        convolution_transform_kernel_packed_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n    }\n\n    // pack4to1\n    if (elempack == 4 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack4to1_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n#endif // __mips_msa\n\n    // pack1\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd43_convolution && num_input >= 16 && num_output >= 16) || !opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd43_transform_kernel_msa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n            }\n            else if (opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd23_transform_kernel_msa(weight_data, weight_winograd23_data, num_input, num_output, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            weight_data_tm = weight_data;\n        }\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_mips::destroy_pipeline(const Option& opt)\n{\n    if (activation)\n    {\n        activation->destroy_pipeline(opt);\n        delete activation;\n        activation = 0;\n    }\n\n    return 0;\n}\n\nint Convolution_mips::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n#if NCNN_INT8\n    if (opt.use_int8_inference && int8_scale_term)\n    {\n        return forward_int8_mips(bottom_blob, top_blob, opt);\n    }\n#endif\n\n    // flattened blob, implement as InnerProduct\n    if (bottom_blob.dims == 1 && kernel_w == 1 && kernel_h == 1)\n    {\n        Mat bottom_blob_3d;\n        if (bottom_blob.elemsize % 16 == 0)\n        {\n            bottom_blob_3d = bottom_blob;\n            bottom_blob_3d.dims = 3;\n            bottom_blob_3d.w = 1;\n            bottom_blob_3d.h = 1;\n            bottom_blob_3d.c = bottom_blob.w;\n            bottom_blob_3d.cstep = 1;\n        }\n        else\n        {\n            bottom_blob_3d = bottom_blob.reshape(1, 1, bottom_blob.w, opt.workspace_allocator);\n        }\n\n        Mat top_blob_3d;\n        int ret = forward(bottom_blob_3d, top_blob_3d, opt);\n        if (ret != 0)\n            return ret;\n\n        if (top_blob_3d.elemsize % 16 == 0)\n        {\n            top_blob = top_blob_3d;\n            top_blob.dims = 1;\n            top_blob.w = top_blob_3d.c;\n            top_blob.h = 1;\n            top_blob.c = 1;\n            top_blob.cstep = top_blob_3d.c;\n        }\n        else\n        {\n            top_blob = top_blob_3d.reshape(top_blob_3d.c, opt.blob_allocator);\n        }\n\n        return 0;\n    }\n\n    int w = bottom_blob.w;\n    int h = bottom_blob.h;\n    int channels = bottom_blob.c;\n    size_t elemsize = bottom_blob.elemsize;\n    int elempack = bottom_blob.elempack;\n\n    //     NCNN_LOGE(\"Convolution input %d x %d  pad = %d %d  ksize=%d %d  stride=%d %d\", w, h, pad_w, pad_h, kernel_w, kernel_h, stride_w, stride_h);\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    w = bottom_blob_bordered.w;\n    h = bottom_blob_bordered.h;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif\n    size_t out_elemsize = elemsize / elempack * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int num_input = channels * elempack;\n\n#if __mips_msa\n    if (elempack == 4 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution || opt.use_winograd63_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd63_convolution && num_input >= 8 && num_output >= 8 && num_input <= 64 && num_output <= 64) || (!opt.use_winograd43_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd63_pack4_msa(bottom_blob_bordered, top_blob, weight_winograd63_data, bias_data, opt);\n            else if ((opt.use_winograd43_convolution && num_input >= 8 && num_output >= 8) || (!opt.use_winograd63_convolution && !opt.use_winograd23_convolution))\n                conv3x3s1_winograd43_pack4_msa(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, opt);\n            else // if (opt.use_winograd23_convolution)\n                conv3x3s1_winograd23_pack4_msa(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_pack1to4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv3x3s2_pack1to4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 7 && kernel_h == 7 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv7x7s2_pack1to4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack1to4_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n\n    if (elempack == 4 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack4to1_msa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack4to1_msa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack4to1_msa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            convolution_pack4to1_msa(bottom_blob_bordered, top_blob, weight_data_tm, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, activation_type, activation_params, opt);\n        }\n    }\n#endif // __mips_msa\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_msa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_winograd_convolution && (opt.use_winograd23_convolution || opt.use_winograd43_convolution) && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            if ((opt.use_winograd43_convolution && num_input >= 16 && num_output >= 16) || !opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd43_msa(bottom_blob_bordered, top_blob, weight_winograd43_data, bias_data, opt);\n            }\n            else if (opt.use_winograd23_convolution)\n            {\n                conv3x3s1_winograd23_msa(bottom_blob_bordered, top_blob, weight_winograd23_data, bias_data, opt);\n            }\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_msa(bottom_blob_bordered, top_blob, weight_sgemm_data, bias_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n\n            if (activation)\n            {\n                activation->forward_inplace(top_blob, opt);\n            }\n        }\n        else\n        {\n            const int maxk = kernel_w * kernel_h;\n\n            // kernel offsets\n            std::vector<int> _space_ofs(maxk);\n            int* space_ofs = &_space_ofs[0];\n            {\n                int p1 = 0;\n                int p2 = 0;\n                int gap = w * dilation_h - kernel_w * dilation_w;\n                for (int i = 0; i < kernel_h; i++)\n                {\n                    for (int j = 0; j < kernel_w; j++)\n                    {\n                        space_ofs[p1] = p2;\n                        p1++;\n                        p2 += dilation_w;\n                    }\n                    p2 += gap;\n                }\n            }\n\n            // num_output\n            #pragma omp parallel for num_threads(opt.num_threads)\n            for (int p = 0; p < num_output; p++)\n            {\n                float* outptr = top_blob.channel(p);\n\n                for (int i = 0; i < outh; i++)\n                {\n                    for (int j = 0; j < outw; j++)\n                    {\n                        float sum = 0.f;\n\n                        if (bias_term)\n                        {\n                            sum = bias_data[p];\n                        }\n\n                        const float* kptr = (const float*)weight_data_tm + maxk * channels * p;\n\n                        // channels\n                        for (int q = 0; q < channels; q++)\n                        {\n                            const Mat m = bottom_blob_bordered.channel(q);\n                            const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                            for (int k = 0; k < maxk; k++)\n                            {\n                                float val = sptr[space_ofs[k]];\n                                float wt = kptr[k];\n                                sum += val * wt;\n                            }\n\n                            kptr += maxk;\n                        }\n\n                        sum = activation_ss(sum, activation_type, activation_params);\n\n                        outptr[j] = sum;\n                    }\n\n                    outptr += outw;\n                }\n            }\n        }\n    }\n\n    return 0;\n}\n\nint Convolution_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const\n{\n    const Mat& bottom_blob = bottom_blobs[0];\n    const Mat& _weight_data = bottom_blobs[1];\n    Mat& top_blob = top_blobs[0];\n\n    const int _kernel_w = _weight_data.w;\n    const int _kernel_h = _weight_data.h;\n    const int _num_output = _weight_data.c * _weight_data.elempack;\n\n    Mat weight_data_flattened;\n    flatten(_weight_data, weight_data_flattened, opt);\n    if (weight_data_flattened.empty())\n        return -100;\n\n    // weight_data_flattened as pack1\n    weight_data_flattened.w *= weight_data_flattened.elempack;\n    weight_data_flattened.elemsize /= weight_data_flattened.elempack;\n    weight_data_flattened.elempack = 1;\n\n    Mat bias_data_flattened;\n    if (bias_term)\n    {\n        const Mat& _bias_data = bottom_blobs[2];\n        flatten(_bias_data, bias_data_flattened, opt);\n        if (bias_data_flattened.empty())\n            return -100;\n\n        // bias_data_flattened as pack1\n        bias_data_flattened.w *= bias_data_flattened.elempack;\n        bias_data_flattened.elemsize /= bias_data_flattened.elempack;\n        bias_data_flattened.elempack = 1;\n    }\n\n    ncnn::Layer* op = ncnn::create_layer_cpu(ncnn::LayerType::Convolution);\n\n    ncnn::ParamDict pd;\n    pd.set(0, _num_output);\n    pd.set(1, _kernel_w);\n    pd.set(11, _kernel_h);\n    pd.set(2, dilation_w);\n    pd.set(12, dilation_h);\n    pd.set(3, stride_w);\n    pd.set(13, stride_h);\n    pd.set(4, pad_left);\n    pd.set(15, pad_right);\n    pd.set(14, pad_top);\n    pd.set(16, pad_bottom);\n    pd.set(18, pad_value);\n    pd.set(5, bias_term);\n    pd.set(6, weight_data_flattened.w);\n    pd.set(8, int8_scale_term);\n    pd.set(9, activation_type);\n    pd.set(10, activation_params);\n\n    op->load_param(pd);\n\n    ncnn::Mat weights[2];\n    weights[0] = weight_data_flattened;\n    weights[1] = bias_data_flattened;\n\n    op->load_model(ncnn::ModelBinFromMatArray(weights));\n\n    op->create_pipeline(opt);\n\n    op->forward(bottom_blob, top_blob, opt);\n\n    op->destroy_pipeline(opt);\n\n    delete op;\n\n    return 0;\n}\n\n#if NCNN_INT8\nstatic void convolution_transform_kernel_packed_int8_msa(const Mat& weight_data, Mat& weight_data_tm, int num_input, int num_output, int kernel_w, int kernel_h, int elempack, int out_elempack)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // src = kw-kh-inch-outch\n    // dst = pa-pb-kw-kh-inch/pa-outch/pb\n    {\n        Mat weight_data_r2 = weight_data.reshape(maxk, num_input, num_output);\n\n        weight_data_tm.create(maxk, num_input / elempack, num_output / out_elempack, (size_t)elempack * out_elempack, elempack * out_elempack);\n\n        for (int q = 0; q + (out_elempack - 1) < num_output; q += out_elempack)\n        {\n            signed char* g00 = weight_data_tm.channel(q / out_elempack);\n\n            for (int p = 0; p + (elempack - 1) < num_input; p += elempack)\n            {\n                for (int k = 0; k < maxk; k++)\n                {\n                    for (int i = 0; i < out_elempack; i++)\n                    {\n                        for (int j = 0; j < elempack; j++)\n                        {\n                            const signed char* k00 = weight_data_r2.channel(q + i).row<const signed char>(p + j);\n\n                            g00[0] = k00[k];\n\n                            g00++;\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n\nint Convolution_mips::create_pipeline_int8_mips(const Option& opt)\n{\n    const int maxk = kernel_w * kernel_h;\n    const int num_input = weight_data_size / maxk / num_output;\n\n    int elempack = 1;\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        elempack = num_input % 8 == 0 ? 8 : 1;\n        out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __mips_msa\n\n#if __mips_msa\n    if (elempack == 8 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_pack8to4_int8_msa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    if (elempack == 1 && out_elempack == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack1to4_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n\n    if (elempack == 8 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_pack8to1_int8_msa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_transform_kernel_pack8to1_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            convolution_transform_kernel_packed_int8_msa(weight_data, weight_data_tm, num_input, num_output, kernel_w, kernel_h, elempack, out_elempack);\n        }\n    }\n#endif // __mips_msa\n\n    if (elempack == 1 && out_elempack == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_transform_kernel_int8_msa(weight_data, weight_winograd43_data, num_input, num_output, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_transform_kernel_int8_msa(weight_data, weight_sgemm_data, num_input, num_output, kernel_w, kernel_h);\n        }\n        else\n        {\n            weight_data_tm = weight_data;\n        }\n    }\n\n    scale_in_data.create(num_output);\n    for (int p = 0; p < num_output; p++)\n    {\n        // requantize and relu\n        float scale_in;\n        if (weight_data_int8_scales[p] == 0)\n            scale_in = 0;\n        else\n            scale_in = 1.f / (bottom_blob_int8_scales[0] * weight_data_int8_scales[p]);\n\n        scale_in_data[p] = scale_in;\n    }\n\n    if (opt.lightmode)\n        weight_data.release();\n\n    return 0;\n}\n\nint Convolution_mips::forward_int8_mips(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const\n{\n    int elembits = bottom_blob.elembits();\n\n    Mat bottom_blob_int8 = bottom_blob;\n    if (elembits != 8)\n    {\n        Option opt_q = opt;\n        opt_q.blob_allocator = opt.workspace_allocator;\n        quantize_to_int8(bottom_blob, bottom_blob_int8, bottom_blob_int8_scales, opt_q);\n    }\n\n    Mat bottom_blob_bordered;\n    make_padding(bottom_blob_int8, bottom_blob_bordered, opt);\n    if (bottom_blob_bordered.empty())\n        return -100;\n\n    int w = bottom_blob_bordered.w;\n    int h = bottom_blob_bordered.h;\n    int channels = bottom_blob_bordered.c;\n    int elempack = bottom_blob_bordered.elempack;\n\n    const int kernel_extent_w = dilation_w * (kernel_w - 1) + 1;\n    const int kernel_extent_h = dilation_h * (kernel_h - 1) + 1;\n\n    int outw = (w - kernel_extent_w) / stride_w + 1;\n    int outh = (h - kernel_extent_h) / stride_h + 1;\n\n    bool use_int8_requantize = int8_scale_term > 100;\n    int out_elempack = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        if (use_int8_requantize)\n            out_elempack = num_output % 8 == 0 ? 8 : 1;\n        else\n            out_elempack = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __mips_msa\n    size_t out_elemsize = use_int8_requantize ? 1u * out_elempack : 4u * out_elempack;\n\n    top_blob.create(outw, outh, num_output / out_elempack, out_elemsize, out_elempack, opt.blob_allocator);\n    if (top_blob.empty())\n        return -100;\n\n    const int num_input = channels * elempack;\n\n    int out_elempack_int32 = 1;\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        out_elempack_int32 = num_output % 4 == 0 ? 4 : 1;\n    }\n#endif // __mips_msa\n\n    Mat top_blob_int32;\n    top_blob_int32.create(outw, outh, num_output / out_elempack_int32, (size_t)(4u * out_elempack_int32), out_elempack_int32, opt.workspace_allocator);\n    if (top_blob_int32.empty())\n        return -100;\n\n#if __mips_msa\n    if (elempack == 8 && out_elempack_int32 == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack8to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack8to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_pack8to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_pack8to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack8to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n    if (elempack == 1 && out_elempack_int32 == 4)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack1to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack1to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_pack1to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack1to4_int8_msa(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n    if (elempack == 8 && out_elempack_int32 == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_pack8to1_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_pack8to1_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_pack8to1_int8_msa(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution) // TODO better condition && num_input >= 8 && num_output >= 8)\n        {\n            convolution_im2col_sgemm_pack8to1_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_pack8to1_int8_msa(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n#endif // __mips_msa\n\n    if (elempack == 1 && out_elempack_int32 == 1)\n    {\n        if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv1x1s1_sgemm_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (kernel_w == 1 && kernel_h == 1 && dilation_w == 1 && dilation_h == 1 && stride_w == 2 && stride_h == 2)\n        {\n            conv1x1s2_sgemm_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, opt);\n        }\n        else if (opt.use_winograd_convolution && opt.use_winograd43_convolution && kernel_w == 3 && kernel_h == 3 && dilation_w == 1 && dilation_h == 1 && stride_w == 1 && stride_h == 1)\n        {\n            conv3x3s1_winograd43_int8_msa(bottom_blob_bordered, top_blob_int32, weight_winograd43_data, opt);\n        }\n        else if (opt.use_sgemm_convolution)\n        {\n            convolution_im2col_sgemm_int8_msa(bottom_blob_bordered, top_blob_int32, weight_sgemm_data, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n        else\n        {\n            convolution_int8(bottom_blob_bordered, top_blob_int32, weight_data_tm, kernel_w, kernel_h, dilation_w, dilation_h, stride_w, stride_h, opt);\n        }\n    }\n\n#if __mips_msa\n    if (opt.use_packing_layout)\n    {\n        // NCNN_LOGE(\"top_blob_int32  %d  %d\", top_blob_int32.c, top_blob_int32.elempack);\n        if (use_int8_requantize)\n        {\n            // TODO implement winograd sgemm packed int8 pack1 output\n            if (top_blob_int32.elempack == 4 && top_blob_int32.c % 2 == 1)\n            {\n                Mat tmp;\n                convert_packing(top_blob_int32, tmp, 1, opt);\n                top_blob_int32 = tmp;\n            }\n            if (top_blob_int32.elempack == 4 && top_blob_int32.c % 2 == 0)\n            {\n                Mat tmp;\n                convert_packing(top_blob_int32, tmp, 8, opt);\n                top_blob_int32 = tmp;\n            }\n        }\n    }\n#endif\n\n    if (use_int8_requantize)\n    {\n        requantize_from_int32_to_int8(top_blob_int32, top_blob, scale_in_data, top_blob_int8_scales, bias_data, activation_type, activation_params, opt);\n    }\n    else\n    {\n        dequantize_from_int32(top_blob_int32, top_blob, scale_in_data, bias_data, opt);\n\n        if (activation)\n        {\n            activation->forward_inplace(top_blob, opt);\n        }\n    }\n\n    return 0;\n}\n#endif // NCNN_INT8\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/convolution_mips.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#ifndef LAYER_CONVOLUTION_MIPS_H\n#define LAYER_CONVOLUTION_MIPS_H\n\n#include \"convolution.h\"\n\nnamespace ncnn {\n\nclass Convolution_mips : public Convolution\n{\npublic:\n    Convolution_mips();\n\n    virtual int create_pipeline(const Option& opt);\n    virtual int destroy_pipeline(const Option& opt);\n\n    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n\n    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;\n\nprotected:\n#if NCNN_INT8\n    int create_pipeline_int8_mips(const Option& opt);\n    int forward_int8_mips(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;\n#endif\n\npublic:\n    Layer* activation;\n\n    Mat weight_data_tm;\n    Mat weight_sgemm_data;\n    Mat weight_winograd23_data;\n    Mat weight_winograd43_data;\n    Mat weight_winograd63_data;\n\n#if NCNN_INT8\n    Mat scale_in_data;\n#endif\n};\n\n} // namespace ncnn\n\n#endif // LAYER_CONVOLUTION_MIPS_H\n"
  },
  {
    "path": "src/layer/mips/convolution_mips_mmi.cpp",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\n#include \"cpu.h\"\n#include \"mat.h\"\n#if __mips_loongson_mmi\n#include \"loongson_mmi.h\"\n#endif // __mips_loongson_mmi\n\nnamespace ncnn {\n\n#include \"convolution_sgemm_int8.h\"\n#include \"convolution_winograd_transform_int8.h\"\n#include \"convolution_winograd_dot_int8.h\"\n#include \"convolution_3x3_int8.h\"\n\n// pack1\nvoid im2col_sgemm_int8_loongson_mmi(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Option& opt)\n{\n    im2col_sgemm_int8_msa(bottom_im2col, top_blob, kernel, opt);\n}\n\nvoid convolution_im2col_sgemm_transform_kernel_int8_loongson_mmi(const Mat& kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    convolution_im2col_sgemm_transform_kernel_int8_msa(kernel, kernel_tm, inch, outch, kernel_w, kernel_h);\n}\n\nvoid conv3x3s1_winograd43_transform_kernel_int8_loongson_mmi(const Mat& kernel, Mat& kernel_tm_packed, int inch, int outch, const Option& opt)\n{\n    conv3x3s1_winograd43_transform_kernel_int8_msa(kernel, kernel_tm_packed, inch, outch, opt);\n}\n\nvoid convolution_winograd_dot_int8_loongson_mmi(Mat& bottom_blob_tm, int outch, const Mat& kernel_tm, Mat& top_blob_tm, const Option& opt)\n{\n    convolution_winograd_dot_int8_msa(bottom_blob_tm, outch, kernel_tm, top_blob_tm, opt);\n}\n\n} // namespace ncnn\n"
  },
  {
    "path": "src/layer/mips/convolution_pack1to4.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack1to4_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_pack1ton, const Mat& bias_data, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, int activation_type, const Mat& activation_params, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    const float* bias_data_ptr = bias_data;\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        float* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                v4f32 _sum = (v4f32)__msa_fill_w(0);\n\n                if (bias_data_ptr)\n                {\n                    _sum = (v4f32)__msa_ld_w(bias_data_ptr + p * 4, 0);\n                }\n\n                const float* kptr = (const float*)weight_data_pack1ton + maxk * channels * p * 4;\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const float* sptr = m.row(i * stride_h) + j * stride_w;\n\n                    for (int k = 0; k < maxk; k++) // 29.23\n                    {\n                        v4f32 _val = __msa_fill_w_f32(sptr[space_ofs[k]]);\n                        v4f32 _w = (v4f32)__msa_ld_w(kptr, 0);\n                        _sum = __msa_fmadd_w(_sum, _val, _w);\n\n                        kptr += 4;\n                    }\n                }\n\n                _sum = activation_ps(_sum, activation_type, activation_params);\n\n                __msa_st_w((v4i32)_sum, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_pack8to4_int8.h",
    "content": "// Copyright 2022 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void convolution_pack8to4_int8_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& weight_data_int8, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int channels = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    int outch = top_blob.c;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // kernel offsets\n    std::vector<int> _space_ofs(maxk);\n    int* space_ofs = &_space_ofs[0];\n    {\n        int p1 = 0;\n        int p2 = 0;\n        int gap = w * dilation_h - kernel_w * dilation_w;\n        for (int i = 0; i < kernel_h; i++)\n        {\n            for (int j = 0; j < kernel_w; j++)\n            {\n                space_ofs[p1] = p2;\n                p1++;\n                p2 += dilation_w;\n            }\n            p2 += gap;\n        }\n    }\n\n    // num_output\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = 0; p < outch; p++)\n    {\n        int* outptr = top_blob.channel(p);\n\n        for (int i = 0; i < outh; i++)\n        {\n            for (int j = 0; j < outw; j++)\n            {\n                v4i32 _sum0 = __msa_fill_w(0);\n                v4i32 _sum1 = __msa_fill_w(0);\n                v4i32 _sum2 = __msa_fill_w(0);\n                v4i32 _sum3 = __msa_fill_w(0);\n\n                const signed char* kptr = weight_data_int8.channel(p);\n\n                // channels\n                for (int q = 0; q < channels; q++)\n                {\n                    const Mat m = bottom_blob.channel(q);\n                    const signed char* sptr = m.row<signed char>(i * stride_h) + j * stride_w * 8;\n\n                    for (int k = 0; k < maxk; k++)\n                    {\n                        v16i8 _val = __msa_ld_b(sptr + space_ofs[k] * 8, 0);\n                        v8i16 _val16 = (v8i16)__msa_ilvr_b(__msa_clti_s_b(_val, 0), _val);\n\n                        v16i8 _w01 = __msa_ld_b(kptr, 0);\n                        v16i8 _w23 = __msa_ld_b(kptr + 16, 0);\n                        v16i8 _extw01 = __msa_clti_s_b(_w01, 0);\n                        v16i8 _extw23 = __msa_clti_s_b(_w23, 0);\n                        v8i16 _w0 = (v8i16)__msa_ilvr_b(_extw01, _w01);\n                        v8i16 _w1 = (v8i16)__msa_ilvl_b(_extw01, _w01);\n                        v8i16 _w2 = (v8i16)__msa_ilvr_b(_extw23, _w23);\n                        v8i16 _w3 = (v8i16)__msa_ilvl_b(_extw23, _w23);\n\n                        v8i16 _s0 = __msa_mulv_h(_val16, _w0);\n                        v8i16 _s1 = __msa_mulv_h(_val16, _w1);\n                        v8i16 _s2 = __msa_mulv_h(_val16, _w2);\n                        v8i16 _s3 = __msa_mulv_h(_val16, _w3);\n\n                        _sum0 = __msa_addv_w(_sum0, __msa_hadd_s_w(_s0, _s0));\n                        _sum1 = __msa_addv_w(_sum1, __msa_hadd_s_w(_s1, _s1));\n                        _sum2 = __msa_addv_w(_sum2, __msa_hadd_s_w(_s2, _s2));\n                        _sum3 = __msa_addv_w(_sum3, __msa_hadd_s_w(_s3, _s3));\n\n                        kptr += 32;\n                    }\n                }\n\n                // transpose 4x4\n                {\n                    v4i32 _tmp0, _tmp1, _tmp2, _tmp3;\n                    _tmp0 = __msa_ilvr_w(_sum1, _sum0);\n                    _tmp1 = __msa_ilvr_w(_sum3, _sum2);\n                    _tmp2 = __msa_ilvl_w(_sum1, _sum0);\n                    _tmp3 = __msa_ilvl_w(_sum3, _sum2);\n                    _sum0 = (v4i32)__msa_ilvr_d((v2i64)_tmp1, (v2i64)_tmp0);\n                    _sum1 = (v4i32)__msa_ilvl_d((v2i64)_tmp1, (v2i64)_tmp0);\n                    _sum2 = (v4i32)__msa_ilvr_d((v2i64)_tmp3, (v2i64)_tmp2);\n                    _sum3 = (v4i32)__msa_ilvl_d((v2i64)_tmp3, (v2i64)_tmp2);\n                }\n\n                _sum0 = __msa_addv_w(_sum0, _sum1);\n                _sum2 = __msa_addv_w(_sum2, _sum3);\n\n                _sum0 = __msa_addv_w(_sum0, _sum2);\n\n                __msa_st_w(_sum0, outptr + j * 4, 0);\n            }\n\n            outptr += outw * 4;\n        }\n    }\n}\n"
  },
  {
    "path": "src/layer/mips/convolution_sgemm.h",
    "content": "// Copyright 2021 Tencent\n// SPDX-License-Identifier: BSD-3-Clause\n\nstatic void im2col_sgemm_msa(const Mat& bottom_im2col, Mat& top_blob, const Mat& kernel, const Mat& _bias, const Option& opt)\n{\n    // Mat bottom_im2col(size, maxk, inch, 4u, 1, opt.workspace_allocator);\n\n    const int size = bottom_im2col.w;\n    const int maxk = bottom_im2col.h;\n    const int inch = bottom_im2col.c;\n\n    const int outch = top_blob.c;\n\n    const float* bias = _bias;\n\n    // permute\n    Mat tmp;\n    if (size >= 4)\n        tmp.create(4 * maxk, inch, size / 4 + size % 4, 4u, 1, opt.workspace_allocator);\n    else\n        tmp.create(maxk, inch, size, 4u, 1, opt.workspace_allocator);\n    {\n        int nn_size = size / 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int ii = 0; ii < nn_size; ii++)\n        {\n            int i = ii * 4;\n\n            float* tmpptr = tmp.channel(i / 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n#if __mips_msa\n                    __msa_st_w(__msa_ld_w(img0, 0), tmpptr, 0);\n#else\n                    tmpptr[0] = img0[0];\n                    tmpptr[1] = img0[1];\n                    tmpptr[2] = img0[2];\n                    tmpptr[3] = img0[3];\n#endif\n                    img0 += size;\n                    tmpptr += 4;\n                }\n            }\n        }\n\n        int remain_size_start = nn_size * 4;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int i = remain_size_start; i < size; i++)\n        {\n            float* tmpptr = tmp.channel(i / 4 + i % 4);\n\n            for (int q = 0; q < inch; q++)\n            {\n                const float* img0 = (const float*)bottom_im2col.channel(q) + i;\n\n                for (int k = 0; k < maxk; k++)\n                {\n                    tmpptr[0] = img0[0];\n                    img0 += size;\n                    tmpptr += 1;\n                }\n            }\n        }\n    }\n\n#if __mips_msa\n    int nn_outch = outch >> 3;\n    int remain_outch_start = nn_outch << 3;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 8;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n        float* outptr2 = top_blob.channel(p + 2);\n        float* outptr3 = top_blob.channel(p + 3);\n        float* outptr4 = top_blob.channel(p + 4);\n        float* outptr5 = top_blob.channel(p + 5);\n        float* outptr6 = top_blob.channel(p + 6);\n        float* outptr7 = top_blob.channel(p + 7);\n\n        const float zeros[8] = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 8);\n\n            int nn = inch * maxk; // inch always > 0\n\n            v4f32 _sum0 = __msa_fill_w_f32(biasptr[0]);\n            v4f32 _sum1 = __msa_fill_w_f32(biasptr[1]);\n            v4f32 _sum2 = __msa_fill_w_f32(biasptr[2]);\n            v4f32 _sum3 = __msa_fill_w_f32(biasptr[3]);\n            v4f32 _sum4 = __msa_fill_w_f32(biasptr[4]);\n            v4f32 _sum5 = __msa_fill_w_f32(biasptr[5]);\n            v4f32 _sum6 = __msa_fill_w_f32(biasptr[6]);\n            v4f32 _sum7 = __msa_fill_w_f32(biasptr[7]);\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 32);\n                v4f32 _val = (v4f32)__msa_ld_w(tmpptr, 0);\n                v4i32 _w0123 = __msa_ld_w(kptr, 0);\n                v4i32 _w4567 = __msa_ld_w(kptr + 4, 0);\n                _sum0 = __msa_fmadd_w(_sum0, _val, (v4f32)__msa_splati_w(_w0123, 0));\n                _sum1 = __msa_fmadd_w(_sum1, _val, (v4f32)__msa_splati_w(_w0123, 1));\n                _sum2 = __msa_fmadd_w(_sum2, _val, (v4f32)__msa_splati_w(_w0123, 2));\n                _sum3 = __msa_fmadd_w(_sum3, _val, (v4f32)__msa_splati_w(_w0123, 3));\n                _sum4 = __msa_fmadd_w(_sum4, _val, (v4f32)__msa_splati_w(_w4567, 0));\n                _sum5 = __msa_fmadd_w(_sum5, _val, (v4f32)__msa_splati_w(_w4567, 1));\n                _sum6 = __msa_fmadd_w(_sum6, _val, (v4f32)__msa_splati_w(_w4567, 2));\n                _sum7 = __msa_fmadd_w(_sum7, _val, (v4f32)__msa_splati_w(_w4567, 3));\n\n                tmpptr += 4;\n                kptr += 8;\n            }\n\n            __msa_st_w((v4i32)_sum0, outptr0, 0);\n            __msa_st_w((v4i32)_sum1, outptr1, 0);\n            __msa_st_w((v4i32)_sum2, outptr2, 0);\n            __msa_st_w((v4i32)_sum3, outptr3, 0);\n            __msa_st_w((v4i32)_sum4, outptr4, 0);\n            __msa_st_w((v4i32)_sum5, outptr5, 0);\n            __msa_st_w((v4i32)_sum6, outptr6, 0);\n            __msa_st_w((v4i32)_sum7, outptr7, 0);\n\n            outptr0 += 4;\n            outptr1 += 4;\n            outptr2 += 4;\n            outptr3 += 4;\n            outptr4 += 4;\n            outptr5 += 4;\n            outptr6 += 4;\n            outptr7 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 8);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n            float sum2 = biasptr[2];\n            float sum3 = biasptr[3];\n            float sum4 = biasptr[4];\n            float sum5 = biasptr[5];\n            float sum6 = biasptr[6];\n            float sum7 = biasptr[7];\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                sum2 += tmpptr[0] * kptr[2];\n                sum3 += tmpptr[0] * kptr[3];\n                sum4 += tmpptr[0] * kptr[4];\n                sum5 += tmpptr[0] * kptr[5];\n                sum6 += tmpptr[0] * kptr[6];\n                sum7 += tmpptr[0] * kptr[7];\n                tmpptr++;\n                kptr += 8;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr2[0] = sum2;\n            outptr3[0] = sum3;\n            outptr4[0] = sum4;\n            outptr5[0] = sum5;\n            outptr6[0] = sum6;\n            outptr7[0] = sum7;\n\n            outptr0++;\n            outptr1++;\n            outptr2++;\n            outptr3++;\n            outptr4++;\n            outptr5++;\n            outptr6++;\n            outptr7++;\n        }\n    }\n\n    nn_outch = (outch - remain_outch_start) >> 2;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = remain_outch_start + pp * 4;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n        float* outptr2 = top_blob.channel(p + 2);\n        float* outptr3 = top_blob.channel(p + 3);\n\n        const float zeros[4] = {0.f, 0.f, 0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            v4f32 _sum0 = __msa_fill_w_f32(biasptr[0]);\n            v4f32 _sum1 = __msa_fill_w_f32(biasptr[1]);\n            v4f32 _sum2 = __msa_fill_w_f32(biasptr[2]);\n            v4f32 _sum3 = __msa_fill_w_f32(biasptr[3]);\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 16);\n                v4f32 _val = (v4f32)__msa_ld_w(tmpptr, 0);\n                v4i32 _w0123 = __msa_ld_w(kptr, 0);\n                _sum0 = __msa_fmadd_w(_sum0, _val, (v4f32)__msa_splati_w(_w0123, 0));\n                _sum1 = __msa_fmadd_w(_sum1, _val, (v4f32)__msa_splati_w(_w0123, 1));\n                _sum2 = __msa_fmadd_w(_sum2, _val, (v4f32)__msa_splati_w(_w0123, 2));\n                _sum3 = __msa_fmadd_w(_sum3, _val, (v4f32)__msa_splati_w(_w0123, 3));\n\n                tmpptr += 4;\n                kptr += 4;\n            }\n\n            __msa_st_w((v4i32)_sum0, outptr0, 0);\n            __msa_st_w((v4i32)_sum1, outptr1, 0);\n            __msa_st_w((v4i32)_sum2, outptr2, 0);\n            __msa_st_w((v4i32)_sum3, outptr3, 0);\n\n            outptr0 += 4;\n            outptr1 += 4;\n            outptr2 += 4;\n            outptr3 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n            float sum2 = biasptr[2];\n            float sum3 = biasptr[3];\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                sum2 += tmpptr[0] * kptr[2];\n                sum3 += tmpptr[0] * kptr[3];\n                tmpptr++;\n                kptr += 4;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n            outptr2[0] = sum2;\n            outptr3[0] = sum3;\n\n            outptr0++;\n            outptr1++;\n            outptr2++;\n            outptr3++;\n        }\n    }\n\n    remain_outch_start += nn_outch << 2;\n#else // __mips_msa\n    int nn_outch = outch >> 1;\n    int remain_outch_start = nn_outch << 1;\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int pp = 0; pp < nn_outch; pp++)\n    {\n        int p = pp * 2;\n\n        float* outptr0 = top_blob.channel(p);\n        float* outptr1 = top_blob.channel(p + 1);\n\n        const float zeros[2] = {0.f, 0.f};\n        const float* biasptr = bias ? bias + p : zeros;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n            const float* kptr = kernel.channel(p / 2);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum00 = biasptr[0];\n            float sum01 = biasptr[0];\n            float sum02 = biasptr[0];\n            float sum03 = biasptr[0];\n            float sum10 = biasptr[1];\n            float sum11 = biasptr[1];\n            float sum12 = biasptr[1];\n            float sum13 = biasptr[1];\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 8);\n                float k0 = kptr[0];\n                float k1 = kptr[1];\n                sum00 += tmpptr[0] * k0;\n                sum01 += tmpptr[1] * k0;\n                sum02 += tmpptr[2] * k0;\n                sum03 += tmpptr[3] * k0;\n                sum10 += tmpptr[0] * k1;\n                sum11 += tmpptr[1] * k1;\n                sum12 += tmpptr[2] * k1;\n                sum13 += tmpptr[3] * k1;\n                tmpptr += 4;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum00;\n            outptr0[1] = sum01;\n            outptr0[2] = sum02;\n            outptr0[3] = sum03;\n            outptr1[0] = sum10;\n            outptr1[1] = sum11;\n            outptr1[2] = sum12;\n            outptr1[3] = sum13;\n\n            outptr0 += 4;\n            outptr1 += 4;\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n            const float* kptr = kernel.channel(p / 2);\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = biasptr[0];\n            float sum1 = biasptr[1];\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 4);\n                __builtin_prefetch(kptr + 8);\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[0] * kptr[1];\n                tmpptr++;\n                kptr += 2;\n            }\n\n            outptr0[0] = sum0;\n            outptr1[0] = sum1;\n\n            outptr0++;\n            outptr1++;\n        }\n    }\n#endif // __mips_msa\n\n    #pragma omp parallel for num_threads(opt.num_threads)\n    for (int p = remain_outch_start; p < outch; p++)\n    {\n        float* outptr0 = top_blob.channel(p);\n\n        const float bias0 = bias ? bias[p] : 0.f;\n\n        int i = 0;\n        for (; i + 3 < size; i += 4)\n        {\n            const float* tmpptr = tmp.channel(i / 4);\n#if __mips_msa\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4 + p % 4);\n#else\n            const float* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int nn = inch * maxk; // inch always > 0\n\n#if __mips_msa\n            v4f32 _sum0 = __msa_fill_w_f32(bias0);\n\n            for (int q = 0; q < nn; q++)\n            {\n                _sum0 = __msa_fmadd_w(_sum0, __msa_fill_w_f32(kptr[0]), (v4f32)__msa_ld_w(tmpptr, 0));\n                tmpptr += 4;\n                kptr++;\n            }\n\n            __msa_st_w((v4i32)_sum0, outptr0, 0);\n\n            outptr0 += 4;\n#else\n            float sum0 = bias0;\n            float sum1 = bias0;\n            float sum2 = bias0;\n            float sum3 = bias0;\n\n            for (int q = 0; q < nn; q++)\n            {\n                __builtin_prefetch(tmpptr + 16);\n                __builtin_prefetch(kptr + 4);\n                sum0 += tmpptr[0] * kptr[0];\n                sum1 += tmpptr[1] * kptr[0];\n                sum2 += tmpptr[2] * kptr[0];\n                sum3 += tmpptr[3] * kptr[0];\n                tmpptr += 4;\n                kptr++;\n            }\n\n            outptr0[0] = sum0;\n            outptr0[1] = sum1;\n            outptr0[2] = sum2;\n            outptr0[3] = sum3;\n\n            outptr0 += 4;\n#endif // __mips_msa\n        }\n        for (; i < size; i++)\n        {\n            const float* tmpptr = tmp.channel(i / 4 + i % 4);\n#if __mips_msa\n            const float* kptr = kernel.channel(p / 8 + (p % 8) / 4 + p % 4);\n#else\n            const float* kptr = kernel.channel(p / 2 + p % 2);\n#endif\n\n            int nn = inch * maxk; // inch always > 0\n\n            float sum0 = bias0;\n\n            for (int q = 0; q < nn; q++)\n            {\n                sum0 += tmpptr[0] * kptr[0];\n                tmpptr++;\n                kptr++;\n            }\n\n            outptr0[0] = sum0;\n\n            outptr0++;\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_transform_kernel_msa(const Mat& _kernel, Mat& kernel_tm, int inch, int outch, int kernel_w, int kernel_h)\n{\n    const int maxk = kernel_w * kernel_h;\n\n    // interleave\n    // src = maxk-inch-outch\n    // dst = 8b-maxk-inch-outch/8b\n    Mat kernel = _kernel.reshape(maxk, inch, outch);\n#if __mips_msa\n    kernel_tm.create(8 * maxk, inch, outch / 8 + (outch % 8) / 4 + outch % 4);\n#else\n    kernel_tm.create(2 * maxk, inch, outch / 2 + outch % 2);\n#endif\n\n    int q = 0;\n#if __mips_msa\n    for (; q + 7 < outch; q += 8)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n        const Mat k2 = kernel.channel(q + 2);\n        const Mat k3 = kernel.channel(q + 3);\n        const Mat k4 = kernel.channel(q + 4);\n        const Mat k5 = kernel.channel(q + 5);\n        const Mat k6 = kernel.channel(q + 6);\n        const Mat k7 = kernel.channel(q + 7);\n\n        float* g00 = kernel_tm.channel(q / 8);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n            const float* k20 = k2.row(p);\n            const float* k30 = k3.row(p);\n            const float* k40 = k4.row(p);\n            const float* k50 = k5.row(p);\n            const float* k60 = k6.row(p);\n            const float* k70 = k7.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n                g00[2] = k20[k];\n                g00[3] = k30[k];\n                g00[4] = k40[k];\n                g00[5] = k50[k];\n                g00[6] = k60[k];\n                g00[7] = k70[k];\n\n                g00 += 8;\n            }\n        }\n    }\n    for (; q + 3 < outch; q += 4)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n        const Mat k2 = kernel.channel(q + 2);\n        const Mat k3 = kernel.channel(q + 3);\n\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n            const float* k20 = k2.row(p);\n            const float* k30 = k3.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n                g00[2] = k20[k];\n                g00[3] = k30[k];\n\n                g00 += 4;\n            }\n        }\n    }\n#else\n    for (; q + 1 < outch; q += 2)\n    {\n        const Mat k0 = kernel.channel(q);\n        const Mat k1 = kernel.channel(q + 1);\n\n        float* g00 = kernel_tm.channel(q / 2);\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n            const float* k10 = k1.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n                g00[1] = k10[k];\n\n                g00 += 2;\n            }\n        }\n    }\n#endif // __mips_msa\n    for (; q < outch; q++)\n    {\n        const Mat k0 = kernel.channel(q);\n\n#if __mips_msa\n        float* g00 = kernel_tm.channel(q / 8 + (q % 8) / 4 + q % 4);\n#else\n        float* g00 = kernel_tm.channel(q / 2 + q % 2);\n#endif\n\n        for (int p = 0; p < inch; p++)\n        {\n            const float* k00 = k0.row(p);\n\n            for (int k = 0; k < maxk; k++)\n            {\n                g00[0] = k00[k];\n\n                g00 += 1;\n            }\n        }\n    }\n}\n\nstatic void convolution_im2col_sgemm_msa(const Mat& bottom_blob, Mat& top_blob, const Mat& kernel, const Mat& _bias, int kernel_w, int kernel_h, int dilation_w, int dilation_h, int stride_w, int stride_h, const Option& opt)\n{\n    int w = bottom_blob.w;\n    int inch = bottom_blob.c;\n\n    int outw = top_blob.w;\n    int outh = top_blob.h;\n    const int size = outw * outh;\n\n    const int maxk = kernel_w * kernel_h;\n\n    // im2col\n    Mat bottom_im2col(size, maxk, inch, 4u, 1, opt.workspace_allocator);\n    {\n        const int gap = w * stride_h - outw * stride_w;\n\n        #pragma omp parallel for num_threads(opt.num_threads)\n        for (int p = 0; p < inch; p++)\n        {\n            const Mat img = bottom_blob.channel(p);\n            float* ptr = bottom_im2col.channel(p);\n\n            for (int u = 0; u < kernel_h; u++)\n            {\n                for (int v = 0; v < kernel_w; v++)\n                {\n                    const float* sptr = img.row<const float>(dilation_h * u) + dilation_w * v;\n\n                    for (int i = 0; i < outh; i++)\n                    {\n                        int j = 0;\n                        for (; j < outw; j++)\n                        {\n                            ptr[0] = sptr[0];\n\n                            sptr += stride_w;\n                            ptr += 1;\n                        }\n\n                        sptr += gap;\n                    }\n                }\n            }\n        }\n    }\n\n    im2col_sgemm_msa(bottom_im2col, top_blob, kernel, _bias, opt);\n}\n"
  }
]